Skip to content
Research · Jul 4, 2026

Apple proposes VideoFlexTok for flexible-length, coarse-to-fine video tokenization

New tokenizer reduces token count by up to 8x compared to 3D grid baselines while maintaining generation quality, enabling longer video generation with lower compute.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Apple’s ML Research team introduces VideoFlexTok, a video tokenizer that outputs variable-length, coarse-to-fine token sequences instead of fixed 3D grids.
  • On text-to-video tasks, VideoFlexTok achieves comparable quality with a 5x smaller model (1.1B vs 5.2B parameters) using fewer tokens.
  • The method enables training on 10-second, 81-frame videos using only 672 tokens—8x fewer than a comparable 3D grid tokenizer.

Apple’s Machine Learning Research team describes VideoFlexTok, a video tokenizer that represents videos as variable-length sequences of tokens organized in a coarse-to-fine hierarchy. The approach departs from the de facto standard of fixed 3D grid tokenization, where each token captures local spatiotemporal details regardless of the video’s complexity. VideoFlexTok’s first tokens capture abstract information such as semantics and motion, while subsequent tokens add fine-grained details, enabling downstream models to focus on higher-level structure when needed.

The authors report that VideoFlexTok supports realistic video reconstructions from any token count via a generative flow decoder. On class- and text-to-video generative tasks, they show the tokenizer leads to more efficient training compared to 3D grid tokens. Specifically, VideoFlexTok achieves comparable generation quality—measured by gFVD and ViCLIP Score—with a 5x smaller model (1.1B vs 5.2B parameters).

The team demonstrates the method’s scalability by training a text-to-video model on 10-second videos composed of 81 frames using only 672 tokens, which is 8x fewer than a comparable 3D grid tokenizer. This reduction in token count lowers memory and compute requirements, making long video generation more practical under fixed compute budgets.

VideoFlexTok is positioned within a broader line of Apple ML Research work on flexible tokenization. Related efforts include TrajTok, which decouples video duration from token count using trajectory-based tokens, and FlexTok, which resamples images into 1D token sequences of flexible length. The VideoFlexTok paper lists authors from Apple and the Swiss Federal Institute of Technology Lausanne (EPFL).

Sources
  1. 01Apple — Machine Learning ResearchVideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.