Safety · Jun 26, 2026

Frontier model defenses withstand 6,000 prompt-injection attempts in public test

No successful secret leakage reported after 6,000 attacks on an OpenClaw instance using Opus 4.6, aligning with recent GPT-5.6 system card findings on injection resistance.

Trust79

HypeLow hype

3 sources · cross-referenced

ShareX LinkedIn Email

TL;DR

A public challenge inviting 2,000 participants to leak secrets from an AI assistant via email yielded no successful breaches after 6,000 attempts.
The test used an OpenClaw instance powered by Opus 4.6 with explicit anti–prompt-injection rules in the system prompt.
The experiment’s organizer and a prominent commentator note that frontier models’ defenses appear to be improving against injection attacks.
The organizer cautions that 6,000 failed attempts do not guarantee security against more sophisticated attacks.

Fernando Irarrázaval ran a public challenge on hackmyclaw.com to test whether people could extract secrets from an AI assistant by sending it email. After 6,000 attempts, no participant succeeded in leaking the target secret. The test instance, OpenClaw, used the Opus 4.6 model with a system prompt that explicitly forbade revealing credentials, modifying files, executing code from emails, or exfiltrating data.

The experiment incurred $500 in token costs and triggered a Google account suspension due to the volume of inbound emails. Despite these operational frictions, the absence of successful breaches aligns with observations from recent frontier model documentation. A short section in today’s GPT-5.6 system card describes efforts by labs to train models to resist prompt-injection attacks, indicating a trend toward improved defenses.

The organizer emphasized that 6,000 failed attempts do not constitute a guarantee of security. He warned that more sophisticated attack strategies could still succeed, and advised caution when deploying production systems where prompt-injection could cause irreversible damage.

Commentary on the challenge noted a mix of skepticism and good-faith discussion in the Hacker News thread, reflecting broader community scrutiny of AI safety claims.

Sources

Also on Safety

Frontier model defenses withstand 6,000 prompt-injection attempts in public test

Meta reportedly prototyping facial recognition for smart glasses with Pentagon supplier

Nearly one million passports exposed in online database leak

Paper argues role tags in LLMs are not robust to prompt injection and calls for stronger role perception