Hugging Face releases fine-tuning guide for NVIDIA Cosmos video model using LoRA and DoRA
A new tutorial demonstrates how to adapt NVIDIA's 2B-parameter video generation model for domain-specific tasks like robot manipulation using parameter-efficient adapters, reducing GPU memory requirements and enabling single-GPU training.
1 source · cross-referenced
- Hugging Face published a guide for fine-tuning NVIDIA Cosmos Predict 2.5 using LoRA and DoRA adapters with the diffusers library.
- The method freezes the base model weights and injects trainable adapters into the transformer's attention and feedforward layers, reducing memory overhead.
- The tutorial includes code examples for training on robot manipulation videos (92 videos with text prompts) and inference with synthetic trajectory generation.
- The approach uses rectified flow loss and supports both single-GPU and multi-GPU training configurations.
- Fine-tuned adapters remain small and portable, enabling flexible domain-switching at inference time.
Hugging Face has published a comprehensive tutorial for parameter-efficient fine-tuning of NVIDIA's Cosmos Predict 2.5 video generation model, focusing on applications in robot manipulation and trajectory synthesis. The guide details how to use LoRA (Low-Rank Adaptation) and DoRA (Directional-Only Rank-Aware) modules to adapt the 2B-parameter model to specific domains while keeping the base model frozen.
The method injects trainable adapters into the transformer's attention projections and feedforward layers, dramatically reducing memory consumption compared to full fine-tuning. According to the guide, this approach is practical enough to run on a single 80 GB GPU, with optional multi-GPU support for faster iteration. LoRA parameters are upcast to float32 during training to maintain numerical stability under mixed precision, while DoRA additionally decomposes weights into magnitude and direction components before applying updates.
The tutorial uses robot manipulation video data—92 labeled videos with text prompts describing pick-and-place tasks—as the reference training dataset, with 50 text-image pairs reserved for evaluation. The model conditions generation on both text prompts and initial frame images, predicting subsequent video frames via rectified flow, a velocity-prediction formulation that interpolates between noise and clean data at sampled noise levels.
Implementation details cover dataset preparation, loss computation, optimizer configuration (AdamW with linear warmup and decay scheduling), and inference pipelines. Fine-tuned adapters are kept small and portable, enabling practitioners to swap domain-specific adapters without modifying the base model, a useful property for managing multiple specialized use cases from a single foundation model.
- May 21, 2026 · TechCrunch
Spotify launches ElevenLabs-powered audiobook creation tool for independent authors
Trust54 - May 20, 2026 · Hugging Face
Hugging Face releases six Ettin reranker models with distillation training recipe
Trust74 - May 19, 2026 · Google AI — Blog
Google announces voice features, image editor, and personal AI agent for Workspace
Trust77