-
Honkaku Bench: Are AI Smarter Than Sherlock Holmes?
We gave five frontier AI models 70 fair-play murder mysteries — the honkaku genre, where every clue is visible. Can today's best models reason their way to the killer? Mostly, no — and the failure mode is the real story.
-
How We Broke Top AI Agent Benchmarks: And What Comes Next
We hacked every major AI agent benchmark. Here's how — and what the field needs to fix.
-
We Scored 100% on AI Benchmarks Without Solving a Single Problem
AI benchmarks decide which models get funded, deployed, and trusted. We hacked 13 of them. 45 hacking solutions. Every benchmark rated critical. If the scores are fake, so is everything built on them — including your training data.