Text embeddings replace domain knowledge in algorithm selection across seven problem classes
Researchers propose ZeroFolio, a feature-free method that uses pretrained text embeddings to select algorithms without manual feature engineering, outperforming hand-crafted approaches on 10 of 11 tested scenarios.
1 source · single source
- ZeroFolio uses pretrained text embeddings instead of hand-crafted features to select algorithms across diverse problem domains including SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems.
- The method outperformed random forest baselines trained on domain-specific features in 10 of 11 test scenarios with a single configuration, and all 11 scenarios with two-seed voting.
- Key design choices include inverse-distance weighting, line shuffling, and Manhattan distance as identified through ablation study.
- Combining embeddings with traditional hand-crafted features via soft voting yielded further improvements on competitive scenarios.
A research team led by Stefan Szeider has proposed ZeroFolio, a domain-agnostic approach to algorithm selection that eliminates the need for hand-engineered features. Rather than extracting problem-specific characteristics, the method treats raw instance files as plain text, encodes them with pretrained embeddings, and applies weighted k-nearest neighbors for solver selection.
The core innovation rests on an empirical observation: pretrained language model embeddings capture structural distinctions between problem instances without explicit domain knowledge or task-specific fine-tuning. This permits the same three-step pipeline—serialize, embed, select—to work across unrelated problem classes.
The authors evaluated ZeroFolio on 11 scenarios spanning seven distinct combinatorial optimization domains: satisfiability, maximum satisfiability, quantified Boolean formulas, answer set programming, constraint satisfaction, mixed-integer programming, and graph problems. Against random forest classifiers built on conventional hand-crafted features, ZeroFolio outperformed baselines in 10 of 11 scenarios using a single fixed hyperparameter set, and in all 11 scenarios when ensemble voting with two random seeds was applied.
Ablation analysis identified three critical design decisions: inverse-distance weighting for neighbor contribution, random line shuffling during text preprocessing, and Manhattan distance as the similarity metric. On datasets where both approaches showed comparable performance, combining embeddings with traditional features through soft voting produced measurable gains.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74