Apple researchers propose normalizing flow-based model for video generation as alternative to diffusion
STARFlow-V applies likelihood-based generative modeling to video synthesis, with native support for text-to-video, image-to-video, and video-to-video tasks.
1 source · cross-referenced
- Apple Machine Learning Research published STARFlow-V, a normalizing flow-based video generation model accepted to CVPR (April 2026).
- The model operates in spatiotemporal latent space using a global-local architecture designed to reduce error accumulation over time.
- STARFlow-V supports multiple conditional generation modes (text-to-video, image-to-video, video-to-video) within a single invertible framework.
- The researchers introduced flow-score matching, a lightweight denoiser mechanism to improve autoregressive video generation consistency.
- The approach achieves comparable visual fidelity and temporal consistency to diffusion-based baselines while offering native likelihood estimation.
Apple's Machine Learning Research team has published STARFlow-V, a generative video model that revisits normalizing flows as an alternative to the diffusion-based systems that currently dominate the field. The model was accepted to the 2026 CVPR conference. While normalizing flows have seen renewed attention in image generation, video synthesis has remained largely the domain of diffusion models due to the computational and complexity demands of spatiotemporal modeling. STARFlow-V addresses this by building on the earlier STARFlow architecture and adapting it for video with a two-level design that separates global latent dependencies from local frame-level interactions.
The architecture mitigates a known challenge in autoregressive video generation: error accumulation over time. By restricting causal dependencies to a global latent space while preserving rich interactions within frames, the model reduces the compounding effect of per-frame prediction errors. The researchers also introduced flow-score matching, a lightweight mechanism that adds a causal denoiser to improve consistency in autoregressive generation, addressing a common bottleneck in sequential video synthesis.
A key computational contribution is a video-aware Jacobi iteration scheme that reformulates inner loop updates as parallelizable operations without violating temporal causality, improving sampling efficiency relative to prior approaches. Because the underlying structure is invertible, the same trained model can natively support multiple conditional generation tasks—text-to-video, image-to-video, and video-to-video—without task-specific fine-tuning. The research team reports that STARFlow-V achieves visual fidelity and temporal consistency comparable to diffusion baselines while providing native likelihood estimation, a property unavailable in most contemporary video generators.
- May 2, 2026 · Apple — Machine Learning Research
Apple researchers introduce inference-time feedback system for tool-calling agents
Trust74 - May 1, 2026 · arXiv cs.AI
Researchers present Bayesian framework for replacing end-of-life language models in production
Trust69 - Apr 30, 2026 · Google DeepMind — Blog
Google DeepMind announces AI co-clinician research initiative to augment physician care
Trust69