Research · Jun 19, 2026

No evidence of Semitic-specific cross-lingual transfer in large language models

Fine-tuning on Arabic and inference-time reasoning improve zero-shot reading comprehension across languages, but not due to linguistic relatedness.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Seven LLMs (4B–671B parameters) fine-tuned on Arabic showed no Semitic-specific transfer in zero-shot reading comprehension across languages.
Models with weak baselines improved dramatically across all languages, while strong baselines showed only marginal gains regardless of language family.
Chain-of-thought reasoning provided similar benefits to fine-tuning, suggesting gains stem from task-format alignment rather than cross-lingual knowledge transfer.

Researchers fine-tuned seven large language models ranging from 4B to 671B parameters on Arabic and evaluated zero-shot reading comprehension across Semitic and non-Semitic languages. The study included both dense and Mixture-of-Experts architectures.

Across architectures, the authors report no evidence of Semitic-specific transfer. Models with weak baselines improved dramatically across all languages after fine-tuning, while models with strong baselines showed only marginal gains regardless of language family.

A chain-of-thought ablation reinforced these findings: the same models that benefited most from fine-tuning also benefited equally from inference-time reasoning. The authors interpret this as evidence that both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

Sources

01arXiv cs.CL — Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

Also on Research

No evidence of Semitic-specific cross-lingual transfer in large language models

LLM ensemble achieves 0.74 F1-score in automating EQ-5D study detection from PubMed abstracts

Frontier LLMs hit ceiling on VerilogEval hardware-coding benchmark, study finds

Systematic study compares diffusion language models to next-token LLMs across eight benchmarks