Selected Papers

A curated list of research papers we are reading.
Diffusion Models for Video Prediction and Infilling
Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi
Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. This paper presents Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training.
View Paper
Large Language Models Can Self-Improve
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, researchers demonstrate that an LLM is also capable of self-improving with only unlabeled datasets.
View Paper
Semi-Parametric Neural Image Synthesis
Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, Björn Ommer
Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. This work questions the underlying paradigm of compressing large training data into ever growing parametric representations.
View Paper
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu
This work proposes to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.
View Paper
FactorMatte: Redefining Video Matting for Re-Composition Tasks
Zeqi Gu, Wenqi Xian, Noah Snavely, Abe Davis
This work proposes "factor matting", an alternative formulation of the video matting problem in terms of counterfactual video synthesis that is better suited for re-composition tasks.
View Paper
TAP-Vid: A Benchmark for Tracking Any Point in a Video
Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang
The problem of tracking arbitrary physical points on surfaces over longer video clips has been addressed to some extent, but until now, no dataset or benchmark for evaluation had existed. In this paper, the problem is formalized and named tracking any point (TAP), and a companion benchmark named TAP-Vid is introduced.
View Paper
Null-text Inversion for Editing Real Images using Guided Diffusion Models
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
In this paper, researchers introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image.
View Paper
On Distillation of Guided Diffusion Models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans
This work proposes an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps.
View Paper
The Animation Transformer: Visual Correspondence via Segment Matching
Evan Casey, Víctor Pérez, Zhuoru Li, Harry Teitelman, Nick Boyajian, Tim Pulver, Mike Manh, William Grisaitis
A lot of ML tasks on the video domain rely on creating visual correspondences; that is, matching parts of a frame that represent same content across frames of the video, usually at a pixel or patch level. This paper, which focuses on hand-drawn animation, considers correspondences between line-enclosed segments instead of pixels, significantly reducing the amount of computation required while improving their accuracy compared to pixel-level approaches.
View Paper
Layered Neural Atlases for Consistent Video Editing
Yoni Kasten, Dolev Ofri, Oliver Wang, Tali Dekel
Traditional techniques for editing video content based on “keyframe edits” rely on optical flow, which is often inaccurate, or 3D geometry, which is not always available. This paper introduces a new 2D representation called “neural atlases,” which are 2D mosaics of the video content that can be used to magically edit the entire video at once, and can be computed in a differentiable way.
View Paper
Resolution-robust Large Mask Inpainting with Fourier Convolutions
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, Victor Lempitsky
This paper presents an image inpainting model that uses a novel operator, the Fast Fourier Convolution, to address one of the biggest limitations in previous inpainting work — hallucinating large missing regions in an image. The results are astonishing!
View Paper
Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative Models
Niv Granot, Ben Feinstein, Assaf Shocher, Shai Bagon, Michal Irani
Generative adversarial networks have become the de-facto method for generative modeling in the image domain. Yet they are still time-consuming and difficult to train, and often produce unpredictable artifacts that are hard to control. This paper challenges the notion that GANs are the solution to all generative problems by introducing a simple nearest-neighbor patch-based method for generating new images from a single image, leading to orders of magnitude improvements in speed.
View Paper
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, Dani Lischinski
Manipulate StyleGAN images with text. Multimodal transformers such as CLIP open up so many possibilities for text-based media editing. A new paradigm in creative tools that relies less on precise manipulation of sliders and anchor points and more on imaginative descriptions and prompts. Hello, post-slider interfaces?
View Paper
Enhancing Photorealism Enhancement
Stephan R. Richter, Hassan Abu AlHaija, Vladlen Koltun
We saw NVIDIA make the first steps towards adding ConvNets as a rendering pass in video games to perform real-time super-resolution with DLSS. This paper takes that approach to a next level by using image-to-image GANs applied to G-buffers from the game engine to generate temporally consistent photorealistic GTA V frames.
View Paper
Skip-Convolutions for Efficient Video Processing
Amirhossein Habibian, Davide Abati, Taco S. Cohen, Babak Ehteshami Bejnordi
There are decades of work in image and video compression taking advantage of insights on the human perceptual system and the redundancies in videos to reduce bandwidth with techniques such as DCT coding, chroma subsampling, and motion compensation. This is one of a few recent papers that uses an idea analogous to motion compensation in the context of DNN inference on video by only operating on the residuals between frames, significantly saving compute.
View Paper
Growing 3D Artefacts and Functional Machines with Neural Cellular Automata
Shyam Sudhakaran, Djordje Grbic, Siyan Li, Adam Katona, Elias Najarro, Claire Glanois, Sebastian Risi
Biology has been a consistent source of inspiration for new architectures and techniques in machine learning, from neural networks to genetic algorithms. Neural Cell Automata (NCAs) bring ideas from morphogenesis, the process by which biological organisms self-assemble from a single cell, to the world of deep neural networks and differentiable computing. In this paper, the authors apply NCAs to Minecraft to evolve complex buildings, and even machines, from a single block!
View Paper
Explore Careers