Skip to content
Safety · Jun 26, 2026

Frontier model defenses withstand 6,000 prompt-injection attempts in public test

No successful secret leakage reported after 6,000 attacks on an OpenClaw instance using Opus 4.6, aligning with recent GPT-5.6 system card findings on injection resistance.

Trust79
HypeLow hype

3 sources · cross-referenced

ShareXLinkedInEmail
TL;DR
  • A public challenge inviting 2,000 participants to leak secrets from an AI assistant via email yielded no successful breaches after 6,000 attempts.
  • The test used an OpenClaw instance powered by Opus 4.6 with explicit anti–prompt-injection rules in the system prompt.
  • The experiment’s organizer and a prominent commentator note that frontier models’ defenses appear to be improving against injection attacks.
  • The organizer cautions that 6,000 failed attempts do not guarantee security against more sophisticated attacks.

Fernando Irarrázaval ran a public challenge on hackmyclaw.com to test whether people could extract secrets from an AI assistant by sending it email. After 6,000 attempts, no participant succeeded in leaking the target secret. The test instance, OpenClaw, used the Opus 4.6 model with a system prompt that explicitly forbade revealing credentials, modifying files, executing code from emails, or exfiltrating data.

The experiment incurred $500 in token costs and triggered a Google account suspension due to the volume of inbound emails. Despite these operational frictions, the absence of successful breaches aligns with observations from recent frontier model documentation. A short section in today’s GPT-5.6 system card describes efforts by labs to train models to resist prompt-injection attacks, indicating a trend toward improved defenses.

The organizer emphasized that 6,000 failed attempts do not constitute a guarantee of security. He warned that more sophisticated attack strategies could still succeed, and advised caution when deploying production systems where prompt-injection could cause irreversible damage.

Commentary on the challenge noted a mix of skepticism and good-faith discussion in the Hacker News thread, reflecting broader community scrutiny of AI safety claims.

Sources
  1. 01Simon Willison’s WeblogWhat happened after 2,000 people tried to hack my AI assistant
  2. 02OpenClaw challenge sitehackmyclaw.com
  3. 03OpenAIGPT-5.6 system card
Also on Safety

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.