Apple researchers introduce MixAtlas framework for optimizing multimodal LLM training data mixtures
A new method uses smaller proxy models to find optimal data mixtures at 1/100th the cost of full-scale training, achieving faster convergence and consistent performance improvements across diverse benchmarks.
1 source · cross-referenced
- Apple researchers introduced MixAtlas, a framework for optimizing data mixtures in multimodal LLM pretraining, accepted at the NADPFM workshop at ICLR 2026.
- The framework systematically decomposes training data along two axes—image concepts and task supervision—enabling interpretable mixture control and domain-specific performance attribution.
- Using proxy models and Gaussian-process surrogates, MixAtlas explores mixture configurations at 1/100th the computational cost of full-scale training.
- Optimized mixtures achieved up to 3× faster convergence and 2-5% performance gains across benchmarks, with particularly strong results on text-rich tasks: +10% on ChartQA and +13% on TextVQA.
- Mixtures learned from smaller proxy models transferred successfully to larger-scale model training, preserving both efficiency and accuracy gains.
Apple's Machine Learning Research team has introduced MixAtlas, a framework designed to address a gap in multimodal large language model (LLM) training: how to systematically optimize which domains and data types to emphasize during pretraining. Current approaches typically adjust mixtures along single dimensions—such as data format or task type—without considering interactions across multiple factors.
The framework decomposes training data along two interpretable axes: image concepts (what visual domains are represented) and task supervision (the types of learning objectives). This dual-axis decomposition allows researchers to control mixture proportions and trace downstream performance improvements back to specific data sources, moving beyond black-box optimization.
MixAtlas employs smaller proxy models trained on subsets of the full dataset alongside a Gaussian-process surrogate model to map the mixture space. This approach reduces the computational cost of exploration to roughly 1/100th of what full-scale training would require, making extensive mixture search feasible.
When applied to multimodal benchmarks, optimized mixtures achieved up to 3× faster convergence and consistent gains of 2-5% compared to existing approaches. On text-heavy visual reasoning tasks, improvements were more pronounced: ChartQA improved by 10% and TextVQA by 13%. Crucially, mixtures discovered using smaller proxy models transferred to larger-scale model training without degradation, suggesting the framework's findings generalize across model sizes.
The research was accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models (NADPFM) at ICLR 2026, reflecting growing attention to data composition as a lever for model performance independent of scale.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74