Efficient tokenization of videos remains a challenge in training vision models that can process long videos.
One promising direction is to develop a tokenizer that can encode long video clips,
as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization.
However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once.
In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos,
inspired by recent advances in 3D generative models.
In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled
We design our encoder to encode a video
Note: If the videos for GT and the different approaches are not properly synchronized, please try out refreshing the page!
128-frame, 128
Unconditional 128-frame, 128
Illustration of factorized triplane representations of CoordTok trained on the UCF-101 dataset.
We note that
When the global content is moving horizontally,
the slope of the
When the global content is moving vertically,
the slope of the