Gen-3 Alpha Gen-2 Gen-1 Careers About Sign in to Runway

Gen-2 Gen-1 Careers About Go to Runway

Selected Papers

A curated list of research papers we are reading.

2022

Diffusion Models for Video Prediction and Infilling

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi

Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. This paper presents Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training.

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han

Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, researchers demonstrate that an LLM is also capable of self-improving with only unlabeled datasets.

Semi-Parametric Neural Image Synthesis

Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, Björn Ommer

Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. This work questions the underlying paradigm of compressing large training data into ever growing parametric representations.

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu

This work proposes to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.

FactorMatte: Redefining Video Matting for Re-Composition Tasks

Zeqi Gu, Wenqi Xian, Noah Snavely, Abe Davis

This work proposes "factor matting", an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks.

TAP-Vid: A Benchmark for Tracking Any Point in a Video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

The problem of tracking arbitrary physical points on surfaces over longer video clips has been addressed to some extent, but until now, no dataset or benchmark for evaluation had existed. In this paper, the problem is formalized and named tracking any point (TAP), and a companion benchmark named TAP-Vid is introduced.

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or

In this paper, researchers introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image.

On Distillation of Guided Diffusion Models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans

This work proposes an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps.

2021

The Animation Transformer: Visual Correspondence via Segment Matching

Evan Casey, Víctor Pérez, Zhuoru Li, Harry Teitelman, Nick Boyajian, Tim Pulver, Mike Manh, William Grisaitis

A lot of ML tasks on the video domain rely on creating visual correspondences; that is, matching parts of a frame that represent same content across frames of the video, usually at a pixel or patch level. This paper, which focuses on hand-drawn animation, considers correspondences between line-enclosed segments instead of pixels, significantly reducing the amount of computation required while improving their accuracy compared to pixel-level approaches.

Layered Neural Atlases for Consistent Video Editing

Yoni Kasten, Dolev Ofri, Oliver Wang, Tali Dekel

Traditional techniques for editing video content based on “keyframe edits” rely on optical flow, which is often inaccurate, or 3D geometry, which is not always available. This paper introduces a new 2D representation called “neural atlases,” which are 2D mosaics of the video content that can be used to magically edit the entire video at once, and can be computed in a differentiable way.

Resolution-robust Large Mask Inpainting with Fourier Convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, Victor Lempitsky

This paper presents an image inpainting model that uses a novel operator, the Fast Fourier Convolution, to address one of the biggest limitations in previous inpainting work — hallucinating large missing regions in an image. The results are astonishing!

Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models

Niv Granot, Ben Feinstein, Assaf Shocher, Shai Bagon, Michal Irani

Generative adversarial networks have become the de-facto method for generative modeling in the image domain. Yet they are still time-consuming and difficult to train, and often produce unpredictable artifacts that are hard to control. This paper challenges the notion that GANs are the solution to all generative problems by introducing a simple nearest-neighbor patch-based method for generating new images from a single image, leading to orders of magnitude improvements in speed.

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski

Manipulate StyleGAN images with text. Multimodal transformers such as CLIP open up so many possibilities for text-based media editing. A new paradigm in creative tools that relies less on precise manipulation of sliders and anchor points and more on imaginative descriptions and prompts. Hello, post-slider interfaces?

Enhancing Photorealism Enhancement

Stephan R. Richter, Hassan Abu AlHaija, Vladlen Koltun

We saw NVIDIA make the first steps towards adding ConvNets as a rendering pass in video games to perform real-time super-resolution with DLSS. This paper takes that approach to a next level by using image-to-image GANs applied to G-buffers from the game engine to generate temporally consistent photorealistic GTA V frames.

Skip-Convolutions for Efficient Video Processing

Amirhossein Habibian, Davide Abati, Taco S. Cohen, Babak Ehteshami Bejnordi

There are decades of work in image and video compression taking advantage of insights on the human perceptual system and the redundancies in videos to reduce bandwidth with techniques such as DCT coding, chroma subsampling, and motion compensation. This is one of a few recent papers that uses an idea analogous to motion compensation in the context of DNN inference on video by only operating on the residuals between frames, significantly saving compute.

Growing 3D Artefacts and Functional Machines with Neural Cellular Automata

Shyam Sudhakaran, Djordje Grbic, Siyan Li, Adam Katona, Elias Najarro, Claire Glanois, Sebastian Risi

Biology has been a consistent source of inspiration for new architectures and techniques in machine learning, from neural networks to genetic algorithms. Neural Cell Automata (NCAs) bring ideas from morphogenesis, the process by which biological organisms self-assemble from a single cell, to the world of deep neural networks and differentiable computing. In this paper, the authors apply NCAs to Minecraft to evolve complex buildings, and even machines, from a single block!

Explore Careers