Hao's Blogs

Honkaku Bench: Are AI Smarter Than Sherlock Holmes?

We gave five frontier AI models 70 fair-play murder mysteries — the honkaku genre, where every clue is visible. Can today's best models reason their way to the killer? Mostly, no — and the failure mode is the real story.

11 min read · June 14, 2026

2026 · LLM evaluation reasoning LLM-judge benchmark · research
How We Broke Top AI Agent Benchmarks: And What Comes Next

We hacked every major AI agent benchmark. Here's how — and what the field needs to fix.

23 min read · April 8, 2026

2026 · benchmark evaluation reward-hacking AI safety trustworthy · research
We Scored 100% on AI Benchmarks Without Solving a Single Problem

AI benchmarks decide which models get funded, deployed, and trusted. We hacked 13 of them. 45 hacking solutions. Every benchmark rated critical. If the scores are fake, so is everything built on them — including your training data.

13 min read · April 2, 2026

2026 · benchmark evaluation reward-hacking AI safety trustworthy · research