Scalable Neural Video Representations with Learnable Positional Features

NeurIPS 2022

Subin Kim^*,1, Sihyun Yu^*,1, Jaeho Lee², Jinwoo Shin¹

¹ KAIST ² POSTECH

^* Denotes equal contribution

Note: If some videos are not displayed properly, try out refreshing the page!


	City	Street	Surfing


	FlowerFocus	SunBath	Twilight

Abstract

Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07→34.57 (measured with the PSNR metric), even using >8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.

Comparison to baselines

Compute-efficiency

Ground Truth
NeRV
Instant-ngp
NVP (ours)
	ReadySetGo	Red box	Red box	Jockey	Red box	Red box

Illustration of reconstruction results after training each model for a 1 minute or 5 minutes. The red boxes are zoomed-in as the image at the right of each video. Our method, NVP, can capture the details in the high-resolution videos that contain dynamic motions within a few minutes, such as the edge of fences filmed with a moving camera (ReadySetGo; left) and the legs of a running horse (Jockey; right). Remarkably, compared with Instant-ngp, NVP only requires 25~43 times fewer parameters but shows better reconstruction quality in a matter of minutes.

Parameter-efficiency

Ground Truth
NeRV

NVP (ours)

	Yachtride	Blue box	Red box	ShakeNDry	Blue box	Red box

Illustration of reconstruction results by restricting the number of parameters of each model. The red and blue boxes are zoomed-in as the image at the right of each video. FLIP, which indicates an evaluating metric, emphasizes the difference between the ground truth and reconstructed images. As succinct neural video representations, our method, NVP, does not suffer from an artifact that some pixels significantly deviate from the ground-truth images.

Video interpolation

Ground Truth
NeRV
NVP (ours)
	ReadySetGo	Bosphorus

Interpolation results (8× FPS) from coordinate-based neural video representations. Our method shows consistent transition across the temporal direction.

Video inpainting

Frame 1
Frame 2
	Ground truth	Mask	Inpainting result

Inpainting results from our method (NVP). Our method successfully removes the masked car.

High-resolution videos

~2 minutes

~12 hours

Reconstruction results of NVP on 1200 frames of 3840x2160 resolution.

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project.