Tools · May 18, 2026

Hugging Face releases fine-tuning guide for NVIDIA Cosmos video model using LoRA and DoRA

A new tutorial demonstrates how to adapt NVIDIA's 2B-parameter video generation model for domain-specific tasks like robot manipulation using parameter-efficient adapters, reducing GPU memory requirements and enabling single-GPU training.

Trust74

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Hugging Face published a guide for fine-tuning NVIDIA Cosmos Predict 2.5 using LoRA and DoRA adapters with the diffusers library.
The method freezes the base model weights and injects trainable adapters into the transformer's attention and feedforward layers, reducing memory overhead.
The tutorial includes code examples for training on robot manipulation videos (92 videos with text prompts) and inference with synthetic trajectory generation.
The approach uses rectified flow loss and supports both single-GPU and multi-GPU training configurations.
Fine-tuned adapters remain small and portable, enabling flexible domain-switching at inference time.

Hugging Face has published a comprehensive tutorial for parameter-efficient fine-tuning of NVIDIA's Cosmos Predict 2.5 video generation model, focusing on applications in robot manipulation and trajectory synthesis. The guide details how to use LoRA (Low-Rank Adaptation) and DoRA (Directional-Only Rank-Aware) modules to adapt the 2B-parameter model to specific domains while keeping the base model frozen.

The method injects trainable adapters into the transformer's attention projections and feedforward layers, dramatically reducing memory consumption compared to full fine-tuning. According to the guide, this approach is practical enough to run on a single 80 GB GPU, with optional multi-GPU support for faster iteration. LoRA parameters are upcast to float32 during training to maintain numerical stability under mixed precision, while DoRA additionally decomposes weights into magnitude and direction components before applying updates.

The tutorial uses robot manipulation video data—92 labeled videos with text prompts describing pick-and-place tasks—as the reference training dataset, with 50 text-image pairs reserved for evaluation. The model conditions generation on both text prompts and initial frame images, predicting subsequent video frames via rectified flow, a velocity-prediction formulation that interpolates between noise and clean data at sampled noise levels.

Implementation details cover dataset preparation, loss computation, optimizer configuration (AdamW with linear warmup and decay scheduling), and inference pipelines. Fine-tuned adapters are kept small and portable, enabling practitioners to swap domain-specific adapters without modifying the base model, a useful property for managing multiple specialized use cases from a single foundation model.

Sources

01Hugging Face — Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for Robot Video Generation

Also on Tools

Hugging Face releases fine-tuning guide for NVIDIA Cosmos video model using LoRA and DoRA

Nonprofit Current AI launches open public AI infrastructure projects with $3.2M in grants

Smartsheet deploys remote Model Context Protocol server on AWS to connect AI agents to enterprise data

Interactive SQLite Query Explainer runs in-browser with annotated query plans