Forum AI recruits top experts to audit foundation models on high-stakes topics like geopolitics and finance
Campbell Brown's startup uses domain experts to benchmark how well leading AI models handle nuanced, real-world questions where accuracy matters most.
1 source · cross-referenced
- Forum AI, founded by former Meta news chief Campbell Brown 17 months ago, builds benchmarks evaluated by leading experts to test how foundation models perform on high-stakes topics including geopolitics, mental health, finance, and hiring.
- The company has recruited prominent figures including Niall Ferguson, Fareed Zakaria, former Secretary of State Tony Blinken, and Anne Neuberger to architect geopolitics benchmarks, aiming for AI judges to reach 90% consensus with human experts.
- Forum AI's evaluations have surfaced accuracy issues across major models, including Gemini pulling from Chinese Communist Party sources inappropriately and widespread left-leaning political bias.
- Brown argues that current AI compliance audits are inadequate and rely too heavily on checkbox procedures, whereas effective evaluation requires deep domain expertise to catch edge cases and real-world failure modes.
- The company raised $3 million last fall and targets enterprise customers in lending, credit, insurance, and hiring where liability concerns create demand for accurate AI evaluation.
Campbell Brown, who led Meta's news division before departing in the early 2020s, founded Forum AI to address what she sees as a systematic gap in how foundation models are evaluated. Rather than focusing on technical benchmarks like coding or math performance, Forum AI targets high-stakes domains where accuracy directly affects real outcomes: geopolitics, mental health, finance, and hiring decisions.
The company's methodology depends on recruiting world-class experts to define what good performance looks like. For geopolitics, this includes Niall Ferguson, Fareed Zakaria, Tony Blinken, Kevin McCarthy, and Anne Neuberger. These experts design benchmarks, then Forum AI trains AI judges to evaluate candidate models at scale, targeting 90% alignment with expert consensus on difficult, ambiguous questions.
Early evaluations have surfaced concerning patterns. Brown reported that Gemini inappropriately cited Chinese Communist Party sources for stories unrelated to China, and found left-leaning political bias present across nearly all tested models. Beyond these obvious failures, subtler problems emerged: missing context, missing viewpoints, and arguments presented as straw men without acknowledgment of the distortion.
Brown attributes the gap partly to industry priorities. Foundation model developers, she noted, concentrate resources on coding and math problems—domains with clear right answers. News and information, by contrast, involve nuance and contested terrain that vendors have largely deprioritized, treating accuracy as less urgent than capability advancement.
The company operates on the assumption that liability will eventually drive demand for better evaluation. Enterprise clients in lending, insurance, and hiring already face regulatory pressure and face legal exposure from inaccurate AI decisions. That creates an economic lever for Forum AI's premium evaluation service. Last fall the startup raised $3 million led by Lerer Hippeau, though converting compliance interest into sustainable revenue remains an open challenge in a market still dominated by checkbox audits and one-size-fits-all benchmarks.
- May 18, 2026 · Hugging Face
Open Agent Leaderboard measures full systems, not just models, across diverse real-world tasks
Trust69 - May 14, 2026 · arXiv cs.AI
Automated red-teaming tool reveals widespread reward-hacking vulnerabilities in AI agent benchmarks
Trust79 - Apr 29, 2026 · Hugging Face
Evaluation costs, not model training, now dominate AI development budgets
Trust70