Skip to content
Tools · May 18, 2026

Hugging Face releases fine-tuning guide for NVIDIA Cosmos video model using LoRA and DoRA

A new tutorial demonstrates how to adapt NVIDIA's 2B-parameter video generation model for domain-specific tasks like robot manipulation using parameter-efficient adapters, reducing GPU memory requirements and enabling single-GPU training.

Trust74
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Hugging Face published a guide for fine-tuning NVIDIA Cosmos Predict 2.5 using LoRA and DoRA adapters with the diffusers library.
  • The method freezes the base model weights and injects trainable adapters into the transformer's attention and feedforward layers, reducing memory overhead.
  • The tutorial includes code examples for training on robot manipulation videos (92 videos with text prompts) and inference with synthetic trajectory generation.
  • The approach uses rectified flow loss and supports both single-GPU and multi-GPU training configurations.
  • Fine-tuned adapters remain small and portable, enabling flexible domain-switching at inference time.

Hugging Face has published a comprehensive tutorial for parameter-efficient fine-tuning of NVIDIA's Cosmos Predict 2.5 video generation model, focusing on applications in robot manipulation and trajectory synthesis. The guide details how to use LoRA (Low-Rank Adaptation) and DoRA (Directional-Only Rank-Aware) modules to adapt the 2B-parameter model to specific domains while keeping the base model frozen.

The method injects trainable adapters into the transformer's attention projections and feedforward layers, dramatically reducing memory consumption compared to full fine-tuning. According to the guide, this approach is practical enough to run on a single 80 GB GPU, with optional multi-GPU support for faster iteration. LoRA parameters are upcast to float32 during training to maintain numerical stability under mixed precision, while DoRA additionally decomposes weights into magnitude and direction components before applying updates.

The tutorial uses robot manipulation video data—92 labeled videos with text prompts describing pick-and-place tasks—as the reference training dataset, with 50 text-image pairs reserved for evaluation. The model conditions generation on both text prompts and initial frame images, predicting subsequent video frames via rectified flow, a velocity-prediction formulation that interpolates between noise and clean data at sampled noise levels.

Implementation details cover dataset preparation, loss computation, optimizer configuration (AdamW with linear warmup and decay scheduling), and inference pipelines. Fine-tuned adapters are kept small and portable, enabling practitioners to swap domain-specific adapters without modifying the base model, a useful property for managing multiple specialized use cases from a single foundation model.

Sources
  1. 01Hugging FaceFine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.