Research · Jun 25, 2026

Researchers propose automated benchmark generation for neural relational reasoning using LLMs

Method uses LLM-driven evolutionary search and autonomous agentic workflows to produce increasingly challenging problem instances and improve reasoning evaluators.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Researchers propose a method to automate the generation of challenging benchmark instances for neural relational reasoning using LLMs.
The approach combines LLM-driven evolutionary search and autonomous agentic search to discover sampling functions that yield hard problem instances.
The same machinery can be applied to novel worlds proposed by LLMs, enabling autonomous research on neural relational reasoning.
The work introduces an Edge Transformer as the reasoning evaluator and shows it can be improved to generalize to further data perturbations.

A new arXiv preprint introduces Project Auto-World, a framework that uses large language models (LLMs) to automate the generation of challenging benchmark instances for neural relational reasoning. The work targets a persistent challenge in evaluating neural models: determining what makes a problem instance hard and ensuring models can generalize to instances beyond their training distribution.

The authors frame the problem as a search for sampling functions that produce increasingly difficult problem instances. They employ LLM-driven evolutionary search—based on the FunSearch paradigm—and autonomous agentic search to discover these functions. The approach is applied within worlds defined by Datalog rules, with an Edge Transformer serving as the reasoning evaluator.

The paper demonstrates that the Edge Transformer can be improved using data generated by this process, yielding better generalization to unseen data perturbations. The authors also show that the same machinery can be extended to novel worlds proposed by LLMs, suggesting a path toward autonomous research workflows in neural relational reasoning.

The preprint is submitted to the NeurIPS 2026 Exposition & Demonstrations track and includes a link to an associated code repository. The authors include Anirban Das, Joanne Boisson, Irtaza Khalid, Sumita Garai, and Steven Schockaert.

Sources

01arXiv cs.AI — Project Auto-World: Towards Automated Benchmarking of Neural Relational Reasoners

Also on Research

Researchers propose automated benchmark generation for neural relational reasoning using LLMs

Book outlines full-stack methodology for building agentic AI systems

Researchers propose Goal-Identity-Configurator architecture to distinguish agentive from agentic systems

Neuro-Symbolic Drive framework improves driving VLA reasoning with rule-grounded traces