Tech

Independent benchmark challenges Anthropic Mythos supremacy in security detection

A June 2026 study suggests Anthropic’s restricted Mythos model is not uniquely superior in identifying security flaws, as cheaper alternatives like Qwen 3.6 and Gemma 4 demonstrate competitive results.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Media Research

Related coverage

Explore Artificial Intelligence coverage Explore Media coverage Explore Research coverage More from the Tech desk

Tech

No image available

Analysis of nine complex vulnerabilities reveals public models can match top-tier performance at a fraction of the cost

An independent analysis published on 23 June 2026 has questioned the exclusive status of Anthropic’s Mythos model, a large language model restricted from general public access due to its reported ability to detect complex security vulnerabilities. The study benchmarks Mythos against a wide array of competing models, including Qwen 3.6, Gemma 4, MiMo, DeepSeek, GPT 5.5, and Gemini 3.1 Pro, to assess their efficacy in identifying security flaws without prior knowledge of the specific bugs.

The author constructed a benchmark suite using a corpus of nine bugs previously identified by Mythos, verifying that top-tier models (such as Opus) could identify them when pointed directly at the code, to ensure the bugs were real and not in the models' training data. The benchmark tested multiple models, including Qwen 3.6, Gemma 4, MiMo, DeepSeek, GPT 5.5, Gemini 3.1 Pro, Mistral Medium, and Laguna M.1, among others. Results indicated that while Mythos performed well, several other models demonstrated competitive performance at significantly lower costs. Specifically, Qwen 3.6, Gemma 4, MiMo, and DeepSeek were highlighted as cost-effective alternatives.

Gemini 3.5 Flash outperformed Gemini 3.1 Pro, finding one more target bug with fewer false positives, though its cost was comparable to larger models. Mistral Medium completely failed to find any known vulnerabilities, returning no results, which the author attributes to potential safety guardrails rather than incompetence. Haiku was noted for burning tokens at a prodigious rate (1.6M per case on average), making it inefficient despite its low price.

The author suspects Anthropic’s restriction of Mythos may be due to high operational costs and capacity constraints rather than solely security concerns. Open-source projects form the digital bedrock of the commercial software industry, yet their decentralized and poorly monitored structure often leads to insecure code. Vulnerabilities in open-source software can cascade into major problems for commercial codebases, as demonstrated by the log4j debacle several years ago.

Anthropic’s Mythos model is described as finding "really challenging security bugs" and is currently cordoned off from general users. The author previously built a tool called Nelson to automate bug hunting and noticed surprising differences in how effectively various models identify bugs. The study suggests that Mythos may not be uniquely superior in this domain and that public models can achieve similar results with adequate tools and time.

Independent benchmark challenges Anthropic Mythos supremacy in security detection

More from Tech

Hotels.com Unveils June 2026 Summer Discount Strategy and Coupon Codes

Open Culture aggregates 1,700 free courses from elite universities

Tech giants slash 21,000 jobs at Oracle as AI reshapes workforce strategy