Researchers demonstrate first in-orbit use of a vision-language model for autonomous Earth observation
NAVI-Orbital system processed imagery onboard a LEO spacecraft using a zero-shot vision-language model, achieving 88.16% accuracy on a curated benchmark and enabling natural-language re-tasking.
1 source · cross-referenced
- NAVI-Orbital is the first system to perform autonomous multi-modal inference onboard a Low Earth Orbit spacecraft using a vision-language model.
- The system uses Gemma 3 to classify scenes, generate text descriptions, and respond to operator queries via natural-language dialogue.
- Results include 88.16% accuracy on the 7,960-image AID benchmark and successful in-orbit processing of previously unseen Earth imagery.
- The approach aims to reduce downlink bandwidth by semantically compressing observations in orbit.
Researchers report the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard a Low Earth Orbit spacecraft. The NAVI-Orbital system was deployed on a LEO spacecraft and executed zero-shot classification, scene description, and natural-language dialogue using a local vision-language model (Gemma 3).
The system enables operators to re-task the spacecraft using plain-English prompts instead of conventional command sequences. It is orchestrated by a graph-based state machine (LangGraph) that coordinates dedicated agents for detection and dialogue.
Ground benchmarking results show 88.16% accuracy on the 7,960-image curated AID benchmark. Flatsat validation and live in-orbit captures of previously unseen Earth imagery—including uncorrected YAM-9 imagery—were processed onboard using hardware-accelerated GPU inference without fine-tuning for the flight instrument.
The authors argue this approach addresses the growing gap between onboard data collection and actionable ground intelligence by semantically compressing Earth observations in orbit, thereby reducing the need to downlink raw data.
- Jun 18, 2026 · arXiv cs.AI
Researchers propose CaVe-VLM-CoT, a reflection-based agentic-RAG framework for interpretable vision-language models
Trust79 - Jun 18, 2026 · arXiv cs.AI
Study finds coordination structure critical for human-AI team performance in shared tasks
Trust79 - Jun 17, 2026 · arXiv cs.CL
Paper proposes PromptMN, a pseudo-prompting language to structure human-AI instructions
Trust79