Anastasis Germanidis is the CTO and Co-Founder of Runway. He discusses Runway’s journey creating Gen-2, and what’s next for creativity.
To begin with, can you tell us the steps you took between releasing your work on Latent Diffusion Models in April 2022, and introducing Gen-2 in June 2023?
We’ve been dreaming of building a text-to-video system for a long time, but the development of the Gen models really came together a year ago in September, 2022. After the Stable Diffusion release in the month prior, we saw that once a specific threshold of quality for a given modality is reached, it acts as an existence proof that propels the field forward to make continuous progress. While that had been achieved in image generation, we weren’t quite there with video. The state-of-the-art model around that time was CogVideo, and our first goal was to improve on the results using the latent diffusion architecture.
The initial problem that needed to be solved was temporal consistency. If you use an image generation model to create a video on a per-frame basis, you’ll get a lot of flicker and changes in content between each frame. Moreover, you are not able to generate specific movements and actions. While per-frame image generation methods (like Deforum) have created a distinctive style that can be aesthetically pleasing, we were interested in tackling the problem of photorealistic video generation from the beginning. Rather than trying to solve end-to-end generation from the start, we decided to solve a simpler version of this with Gen-1: you have an input video as conditioning, which determines the structure of the output video. A few months later, we released Gen-2, which removed the need for structure conditioning and tackled text-guided video generation directly. Most recently, we pushed a large update to the model that allowed using arbitrary starting frames to generate videos using image-to-video.
More broadly, one way to view Gen-2 is as a model that can take any starting image, whether real-world or generated, and predict its motion. The big insight over the past few years in language modeling has been that you can build widely useful, extremely capable AI systems by simply training models to predict the next token in a sentence. That is, in order to solve this next-token prediction task well, the model needs to build a highly detailed representation of the world. The same principle applies in video, where training models to predict the next frame end up gaining a deep understanding of the visual world.
Throughout this process, what was your north star? Where were you trying to go?
We’ve always set the ability to generate a two-hour film as a north star. That does not mean that we expect a film to entirely materialize from a simple prompt, but rather that someone would be able to iteratively build an entire movie, scene by scene, using generative models. It also does not mean that our models are only used for generating feature length films. It’s more about the point that in order to solve all the challenges along the way in getting to the feature-length film milestone, we’ll need to build a series of broadly useful systems for storytelling and creativity.
If you take it hierarchically, the first individual frame should be as high fidelity as possible. In most films, shots are a few seconds long, and you need to figure out how to get temporal stability and high fidelity within those shots. And then as you build a scene, you enter the challenges involved in consistency across different shots, in terms of characters, settings, and so forth.
Going one step further, you need to think in terms of storytelling and the overall narrative that you are trying to build. How can generative models assist there in understanding how all the different pieces fit together?
How do you hope the community uses this technology?
A lot of the dream scenarios that we had in mind before we released the Gen models are already happening. It’s been amazing to see the community emerge around AI video over the past months. It’s only going to grow. We’re figuring out new ways of supporting this emerging community with initiatives like Runway Watch, Gen:48, and the AI Film Festival.
For the first time, video becomes a much more immediate form of expression than before. Just because of the latency, from having the idea to seeing it on the screen, is similar to the latency of writing down a story. That was never the case with video storytelling before. It’s a huge change that we’re witnessing.
Now that you’ve launched Gen-2, how are you thinking about making improvements?
A big principle of how we develop models at Runway is that of incremental deployment. We believe in continuously pushing updates to our models and giving them to the hands of our creators every step of the way. We don’t have a full hypothesis of all of the ways the models are going to be used, but instead let use cases emerge organically. There are also many safety considerations with rolling out those models. For both of those reasons, we follow a staged release approach. Gen-2 lived in Discord for about two months before it was released to Runway web and mobile.
Recently, we’ve been focusing a lot on control, and the ability to supplement the text description of the content with additional inputs to better guide the generation. This might be a starting frame, a specific camera direction, or the degree of motion you’ll have in a video. Essentially, thinking in terms of how a filmmaker would describe a shot, and working backwards from there in terms of the model updates we need to make to support those directions.
How are you growing your team?
We currently have a variety of open roles on the research team, which you can see at runwayml.com/careers. We’re looking for engineers and researchers who are very motivated by this idea of building models that unlock creativity, and helping us advance what’s possible to do with generative models for media.