Research · Jul 4, 2026

Apple proposes VideoFlexTok for flexible-length, coarse-to-fine video tokenization

New tokenizer reduces token count by up to 8x compared to 3D grid baselines while maintaining generation quality, enabling longer video generation with lower compute.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Apple’s ML Research team introduces VideoFlexTok, a video tokenizer that outputs variable-length, coarse-to-fine token sequences instead of fixed 3D grids.
On text-to-video tasks, VideoFlexTok achieves comparable quality with a 5x smaller model (1.1B vs 5.2B parameters) using fewer tokens.
The method enables training on 10-second, 81-frame videos using only 672 tokens—8x fewer than a comparable 3D grid tokenizer.

Apple’s Machine Learning Research team describes VideoFlexTok, a video tokenizer that represents videos as variable-length sequences of tokens organized in a coarse-to-fine hierarchy. The approach departs from the de facto standard of fixed 3D grid tokenization, where each token captures local spatiotemporal details regardless of the video’s complexity. VideoFlexTok’s first tokens capture abstract information such as semantics and motion, while subsequent tokens add fine-grained details, enabling downstream models to focus on higher-level structure when needed.

The authors report that VideoFlexTok supports realistic video reconstructions from any token count via a generative flow decoder. On class- and text-to-video generative tasks, they show the tokenizer leads to more efficient training compared to 3D grid tokens. Specifically, VideoFlexTok achieves comparable generation quality—measured by gFVD and ViCLIP Score—with a 5x smaller model (1.1B vs 5.2B parameters).

The team demonstrates the method’s scalability by training a text-to-video model on 10-second videos composed of 81 frames using only 672 tokens, which is 8x fewer than a comparable 3D grid tokenizer. This reduction in token count lowers memory and compute requirements, making long video generation more practical under fixed compute budgets.

VideoFlexTok is positioned within a broader line of Apple ML Research work on flexible tokenization. Related efforts include TrajTok, which decouples video duration from token count using trajectory-based tokens, and FlexTok, which resamples images into 1D token sequences of flexible length. The VideoFlexTok paper lists authors from Apple and the Swiss Federal Institute of Technology Lausanne (EPFL).

Sources

01Apple — Machine Learning Research — VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

Also on Research

Apple proposes VideoFlexTok for flexible-length, coarse-to-fine video tokenization

Microsoft Research proposes Memora, a memory system for long-horizon AI agents

Apple study finds self-organizing LLM teams underperform single experts by up to 41.1%

Researchers propose TokenScope for token-level interpretability of code-generating LLMs