Scalable Neural Video Representations with Learnable Positional Features

NeurIPS 2022

Subin Kim*,1, Sihyun Yu*,1, Jaeho Lee2, Jinwoo Shin1

1 KAIST       2 POSTECH

* Denotes equal contribution
[Paper]            [Code]            [Poster]

Note: If some videos are not displayed properly, try out refreshing the page!

City Street Surfing
FlowerFocus SunBath Twilight

Abstract


Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07→34.57 (measured with the PSNR metric), even using >8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.

Comparison to baselines


Compute-efficiency


Ground Truth
NeRV
Instant-ngp
NVP (ours)
ReadySetGo Red box Red box Jockey Red box Red box

Illustration of reconstruction results after training each model for a 1 minute or 5 minutes. The red boxes are zoomed-in as the image at the right of each video. Our method, NVP, can capture the details in the high-resolution videos that contain dynamic motions within a few minutes, such as the edge of fences filmed with a moving camera (ReadySetGo; left) and the legs of a running horse (Jockey; right). Remarkably, compared with Instant-ngp, NVP only requires 25~43 times fewer parameters but shows better reconstruction quality in a matter of minutes.

Parameter-efficiency


Ground Truth
NeRV
NVP (ours)
Yachtride Blue box Red box ShakeNDry Blue box Red box

Illustration of reconstruction results by restricting the number of parameters of each model. The red and blue boxes are zoomed-in as the image at the right of each video. FLIP, which indicates an evaluating metric, emphasizes the difference between the ground truth and reconstructed images. As succinct neural video representations, our method, NVP, does not suffer from an artifact that some pixels significantly deviate from the ground-truth images.

Video interpolation


Ground Truth
NeRV
NVP (ours)
ReadySetGo Bosphorus

Interpolation results (8× FPS) from coordinate-based neural video representations. Our method shows consistent transition across the temporal direction.

Video inpainting


Frame 1
Frame 2
Ground truth Mask Inpainting result

Inpainting results from our method (NVP). Our method successfully removes the masked car.

High-resolution videos


~2 minutes
~12 hours

Reconstruction results of NVP on 1200 frames of 3840x2160 resolution.

Acknowledgements


This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project.