Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x, y, t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.
We design our encoder to encode a video $\mathbf{x}$ into factorized triplane representations $\mathbf{z} = \left[\mathbf{z}^{xy}, \mathbf{z}^{yt}, \mathbf{z}^{xt}\right]$ which can efficiently represent the video with three 2D latent planes. Given the triplane representations $\mathbf{z}$, our decoder learns a mapping from $(x, y, t)$ coordinates to RGB pixels within the corresponding patches. In particular, we extract coordinate-based representations of $N$ sampled coordinates by querying the coordinates from triplane representations via bilinear interpolation. Then the decoder aggregates and fuses information from different coordinates with self-attention layers and project outputs into corresponding patches. This design enables us to train tokenizers on long videos in a compute-efficient manner by avoiding reconstruction of entire frames at once.
Note: If the videos for GT and the different approaches are not properly synchronized, please try out refreshing the page!
128-frame, 128$\times$128 resolution video reconstruction results from CoordTok (Ours) and baselines trained on the UCF-101 dataset.
Unconditional 128-frame, 128$\times$128 resolution video generation results from CoordTok-SiT-L/2 trained on 128-frame videos from the UCF-101 dataset.
Illustration of factorized triplane representations of CoordTok trained on the UCF-101 dataset. We note that $\textbf{z}_{xy}$ captures the global content in the video across time, e.g., layout and appearance of the scene or object, and $\textbf{z}_{yt}$, $\textbf{z}_{xt}$ capture the underlying motion in the video across two spatial axes.
When the global content is moving horizontally, the slope of the $y$ value with respect to the $t$-axis is steep at $\textbf{z}_{yt}$. In contrast, the slop of the $x$ value with respect to the $t$-axis is flat at $\textbf{z}_{xt}$.
When the global content is moving vertically, the slope of the $x$ value with respect to the $t$-axis is steep at $\textbf{z}_{xt}$. In contrast, the slop of the $y$ value with respect to the $t$-axis is flat at $\textbf{z}_{yt}$.