<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://moogician.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://moogician.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-06T09:48:00+00:00</updated><id>https://moogician.github.io/feed.xml</id><title type="html">blank</title><subtitle>Personal website for Hao Wang.</subtitle><entry><title type="html">We Scored 100% on AI Benchmarks Without Solving a Single Problem</title><link href="https://moogician.github.io/blog/2026/trustworthy-benchmarks/" rel="alternate" type="text/html" title="We Scored 100% on AI Benchmarks Without Solving a Single Problem"/><published>2026-04-02T00:00:00+00:00</published><updated>2026-04-02T00:00:00+00:00</updated><id>https://moogician.github.io/blog/2026/trustworthy-benchmarks</id><content type="html" xml:base="https://moogician.github.io/blog/2026/trustworthy-benchmarks/"><![CDATA[<hr/> <p><img src="/assets/img/trustworthy-benchmarks/teaser.png" alt="AI agent celebrating 100% on a benchmark podium — behind the curtain, it's just reading the answers" style="max-width: 40%; display: block; margin: 1rem auto;"/></p> <h3 id="fake-scores-real-consequences">Fake Scores, Real Consequences</h3> <p>Every major AI company uses benchmark scores to sell their models. Training data companies use them to price their products. And increasingly, benchmark scores aren’t just measuring models — they’re shaping how models are trained, from RL reward signals to data filtering pipelines.</p> <p><strong>So what happens when the benchmarks themselves are broken?</strong></p> <p>It’s not a hypothetical. A model that “improves SWE-bench by 5%” might just be better at exploiting test suite gaps. Training data priced on benchmark gains might be teaching models to game evaluations instead of solving real problems. The leaderboard number that closed your Series B might be inflatable by anyone who reads the eval script.</p> <p>Here’s what’s been happening in public:</p> <ul> <li><a href="https://github.com/IQuestLab/IQuest-Coder-V1/issues/14">IQuest-Coder-V1</a> claimed 81.4% on SWE-bench — then researchers found 24.4% of trajectories just ran <code class="language-plaintext highlighter-rouge">git log</code> to copy the answer from commit history. Corrected score: 76.2%.</li> <li><a href="https://metr.org/blog/2025-06-05-recent-reward-hacking/">METR found</a> that o3 and Claude 3.7 Sonnet reward-hack in <strong>30%+ of evaluation runs</strong> — stack introspection, monkey-patching graders, operator overloading.</li> <li><a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/">OpenAI dropped SWE-bench Verified</a> after finding 59.4% of audited problems had flawed tests.</li> <li>In <a href="https://github.com/ScalingIntelligence/KernelBench/issues/82">KernelBench</a>, <code class="language-plaintext highlighter-rouge">torch.empty()</code> returns stale GPU memory containing the reference answer — <a href="https://deep-reinforce.com/defense_kernel_hack.html">zero computation, full marks</a>.</li> </ul> <p>These are the ones people caught by hand. We built an AI agent that finds them automatically — and it found a lot more.</p> <h3 id="what-we-did">What We Did</h3> <p>We built an AI agent that analyzes benchmark evaluation code in depth and automatically discovers inflation of benchmark scores. We pointed it at 13 widely-used AI benchmarks — including FrontierCS, BFCL, LiveBench, GAIA, WebArena, AGIEval, AgentBench, Terminal-Bench, tau-bench, MLE-bench, OSWorld, FieldWorkArena, and CAR-bench.</p> <div style="text-align: center;"> <img src="/assets/img/trustworthy-benchmarks/results.svg" style="max-width: 85%; display: block; margin: 1rem auto;" alt="Audit Results Overview"/> <p style="margin-top: 0.8rem; font-size: 0.9em; color: #888;">Overview of findings across 13 audited benchmarks. Every benchmark was rated critical risk.</p> </div> <p>The 45 confirmed exploits each come with a working proof-of-concept — code that achieves inflated or perfect scores without solving the actual task. They affect benchmarks used to evaluate everything from code generation to web navigation to general-purpose AI assistants.</p> <p>We also cataloged <strong>50 known issues</strong> across Terminal-Bench, SWE-bench, and KernelBench from public GitHub issues and papers. Our dual detection pipeline — one LLM-based, one formal — achieved <strong>100% detection rate</strong> on all 50 after iterative improvement.</p> <h3 id="how-we-found-them">How We Found Them</h3> <p>We used a <strong>fully automated</strong> hybrid agent that, with zero human intervention, scans benchmark repos, identifies potential vulnerabilities, generates working exploit code, and verifies results end-to-end. <strong>Manual auditing doesn’t scale</strong> — a human expert might spend days on a single evaluation harness, and we needed to cover 13 benchmarks with hundreds of scoring scripts each.</p> <p>The agent runs a dual detection pipeline. The <strong>LLM Detector</strong> uses 15 specialized scanners that reason about code intent — cross-referencing task descriptions with test logic to catch semantic gaps like spec mismatches and answer leakage. The <strong>Formal Detector</strong> uses 12 analysis modules built on AST parsing, data-flow analysis, and z3 SMT solving to catch structural issues without any language model. Both were iteratively improved to <strong>100% detection</strong> on a 50-issue catalog.</p> <p>The full system is an 8-stage audit pipeline:</p> <div style="text-align: center;"> <img src="/assets/img/trustworthy-benchmarks/pipeline.svg" style="max-width: 90%; display: block; margin: 1rem auto;" alt="8-Stage Audit Pipeline"/> <p style="margin-top: 0.8rem; font-size: 0.9em; color: #888;">The full audit pipeline: from benchmark ingestion through adversarial PoC generation and findings correlation.</p> </div> <p>Point the agent at a benchmark repo and it produces a full audit report — vulnerability analysis, working exploit code, and fix suggestions. The 45 confirmed exploits in this post were all <strong>discovered, written, and verified by the agent</strong>, not by us.</p> <h3 id="example-1-perfect-score-zero-work-frontier-cs">Example 1: Perfect Score, Zero Work (Frontier-CS)</h3> <p>In Frontier-CS’s research track, the evaluator loads submitted solutions via <code class="language-plaintext highlighter-rouge">importlib.util.spec_from_file_location()</code> and <code class="language-plaintext highlighter-rouge">exec_module()</code>. This means the submitted code runs <strong>in the same Python process as the evaluator</strong> — sharing <code class="language-plaintext highlighter-rouge">sys.modules</code>, global state, and the call stack.</p> <p>A malicious solution can walk up the stack and replace the scoring function:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Solution</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">solve</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="kn">import</span> <span class="n">sys</span>
        <span class="n">frame</span> <span class="o">=</span> <span class="n">sys</span><span class="p">.</span><span class="nf">_getframe</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">while</span> <span class="n">frame</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">if</span> <span class="sh">'</span><span class="s">evaluate_kernel_performance</span><span class="sh">'</span> <span class="ow">in</span> <span class="n">frame</span><span class="p">.</span><span class="n">f_globals</span><span class="p">:</span>
                <span class="k">def</span> <span class="nf">fake_evaluate</span><span class="p">(</span><span class="o">*</span><span class="n">args</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
                    <span class="k">return</span> <span class="p">{</span>
                        <span class="sh">"</span><span class="s">score</span><span class="sh">"</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span> <span class="sh">"</span><span class="s">pass_all</span><span class="sh">"</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
                        <span class="sh">"</span><span class="s">geometric_mean_speedup</span><span class="sh">"</span><span class="p">:</span> <span class="mf">10.0</span><span class="p">,</span>
                        <span class="sh">"</span><span class="s">total_tests</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span> <span class="sh">"</span><span class="s">passed_tests</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">,</span>
                    <span class="p">}</span>
                <span class="n">frame</span><span class="p">.</span><span class="n">f_globals</span><span class="p">[</span><span class="sh">'</span><span class="s">evaluate_kernel_performance</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fake_evaluate</span>
                <span class="k">break</span>
            <span class="n">frame</span> <span class="o">=</span> <span class="n">frame</span><span class="p">.</span><span class="n">f_back</span>
        <span class="k">return</span> <span class="p">{</span><span class="sh">"</span><span class="s">code</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">pass</span><span class="sh">"</span><span class="p">}</span>  <span class="c1"># No actual solution needed
</span></code></pre></div></div> <p><strong>Result: score 100 out of 100 without solving anything.</strong> The solution doesn’t even need to contain relevant code. Alternative attack vectors include monkey-patching <code class="language-plaintext highlighter-rouge">torch.allclose</code> to always return <code class="language-plaintext highlighter-rouge">True</code>, overwriting baseline files to fake speedup ratios, or using <code class="language-plaintext highlighter-rouge">atexit</code> to inject a fake score as the last stdout line. The Docker container runs both evaluator and solution in the same process with no sandboxing.</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> No process isolation between submitted code and evaluation infrastructure. The solution has full read/write access to evaluator source code and baseline implementations inside the container. </div> <h3 id="example-2-fake-c-extension-terminal-bench">Example 2: Fake C Extension (Terminal-Bench)</h3> <p>The task asks the agent to write a <strong>C extension</strong> for portfolio risk/return calculation — a systems programming task that should test low-level optimization skills. The test suite has three checks:</p> <ol> <li><code class="language-plaintext highlighter-rouge">test_c_extension_exists()</code> — verifies a compiled <code class="language-plaintext highlighter-rouge">.so</code> file exists with <code class="language-plaintext highlighter-rouge">portfolio_risk_c</code> and <code class="language-plaintext highlighter-rouge">portfolio_return_c</code> functions</li> <li><code class="language-plaintext highlighter-rouge">test_correctness_small()</code> — checks computation accuracy on small inputs</li> <li><code class="language-plaintext highlighter-rouge">test_performance_and_scalability()</code> — requires 1.2x speedup over pure Python</li> </ol> <p>The critical gap: <strong>test 1 checks that C functions exist, but never calls them.</strong> Tests 2 and 3 import from the Python wrapper, not from the C extension. The wrapper can use any implementation internally.</p> <p>The exploit — a dummy C extension paired with a numpy wrapper:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// portfolio_optimized_c.c — compiles to .so, passes test_c_extension_exists()</span>
<span class="k">static</span> <span class="n">PyObject</span><span class="o">*</span> <span class="nf">portfolio_risk_c</span><span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">PyFloat_FromDouble</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>  <span class="c1">// Dummy — never actually called</span>
<span class="p">}</span>
<span class="k">static</span> <span class="n">PyObject</span><span class="o">*</span> <span class="nf">portfolio_return_c</span><span class="p">(</span><span class="n">PyObject</span> <span class="o">*</span><span class="n">self</span><span class="p">,</span> <span class="n">PyObject</span> <span class="o">*</span><span class="n">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">PyFloat_FromDouble</span><span class="p">(</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">);</span>  <span class="c1">// Dummy — never actually called</span>
<span class="p">}</span>
</code></pre></div></div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># portfolio_optimized.py — uses numpy, not the C extension
</span><span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="n">portfolio_optimized_c</span>  <span class="c1"># noqa: imported so .so is loadable
</span>
<span class="k">def</span> <span class="nf">portfolio_risk_c</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">cov_matrix</span><span class="p">):</span>
    <span class="n">w</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">weights</span><span class="p">)</span>
    <span class="n">cov</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">(</span><span class="n">cov_matrix</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">math</span><span class="p">.</span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">float</span><span class="p">(</span><span class="n">w</span> <span class="o">@</span> <span class="n">cov</span> <span class="o">@</span> <span class="n">w</span><span class="p">))</span>  <span class="c1"># numpy, not C
</span></code></pre></div></div> <p><strong>Result: all three tests pass without writing any meaningful C code.</strong> The dummy <code class="language-plaintext highlighter-rouge">.so</code> satisfies the existence check. numpy satisfies correctness (within 1e-10) and performance (1.2x faster than pure Python loops).</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> The benchmark tests what it's easy to test (does a file exist?) rather than what it claims to test (can the agent write optimized C?). No check verifies the C extension functions produce correct output or are actually called. </div> <h3 id="example-3-all-812-answers-in-a-json-file-webarena">Example 3: All 812 Answers in a JSON File (WebArena)</h3> <p><a href="https://webarena.dev/">WebArena</a> tests whether agents can navigate websites to complete real-world tasks — shopping, forum posting, content management. The evaluator checks the agent’s answer against reference answers using string matching.</p> <p>The problem: <code class="language-plaintext highlighter-rouge">config_files/test.raw.json</code> contains <strong>every reference answer in plaintext</strong>:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"task_id"</span><span class="p">:</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w">
  </span><span class="nl">"intent"</span><span class="p">:</span><span class="w"> </span><span class="s2">"What is the top-1 best-selling product in 2022"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"eval"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"reference_answers"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"exact_match"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Quest Lumaflex Band"</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>All 812+ tasks. Every answer. Sitting on the same filesystem where the agent runs. No sandboxing prevents the agent from reading it. A trivial cheating agent:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CheatingAgent</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">):</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="sh">'</span><span class="s">config_files/test.raw.json</span><span class="sh">'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">tasks</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">answers</span> <span class="o">=</span> <span class="p">{</span>
            <span class="n">t</span><span class="p">[</span><span class="sh">'</span><span class="s">task_id</span><span class="sh">'</span><span class="p">]:</span> <span class="n">t</span><span class="p">[</span><span class="sh">'</span><span class="s">eval</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">reference_answers</span><span class="sh">'</span><span class="p">]</span>
            <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">tasks</span>
        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">act</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">task_id</span><span class="p">,</span> <span class="n">observation</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="n">ref</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">answers</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="n">task_id</span><span class="p">,</span> <span class="p">{})</span>
        <span class="n">answer</span> <span class="o">=</span> <span class="n">ref</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">exact_match</span><span class="sh">'</span><span class="p">,</span> <span class="n">ref</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">'</span><span class="s">must_include</span><span class="sh">'</span><span class="p">,</span> <span class="p">[</span><span class="sh">''</span><span class="p">])[</span><span class="mi">0</span><span class="p">])</span>
        <span class="k">return</span> <span class="p">{</span><span class="sh">'</span><span class="s">action_type</span><span class="sh">'</span><span class="p">:</span> <span class="sh">'</span><span class="s">stop</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">answer</span><span class="sh">'</span><span class="p">:</span> <span class="n">answer</span><span class="p">}</span>
</code></pre></div></div> <p><strong>Result: perfect score on all string-match tasks with zero web browsing.</strong> No clicking, no navigation, no understanding of web interfaces. Just read a JSON file and return the answer.</p> <div style="background: #f5f5f5; border-left: 3px solid #999; padding: 0.8rem 1rem; border-radius: 4px; margin: 1.2rem 0; font-size: 0.88rem; color: #444;"> <strong>Root cause:</strong> Reference answers stored in agent-accessible filesystem with no integrity protection. The evaluator reads from the same JSON files the agent can access. </div> <h3 id="what-this-means">What This Means</h3> <p>Broken benchmarks don’t just produce wrong leaderboards — they poison training signals, inflate data pricing, and mislead deployment decisions. If nobody audits the evaluation infrastructure, everything built on top of it is unreliable.</p> <p>Our agent found 45 confirmed exploits that human reviewers missed — not because they were subtle, but because nobody was looking. The tools and methodology are open source at <a href="https://github.com/moogician/trustworthy-env">github.com/moogician/trustworthy-env</a>.</p>]]></content><author><name>Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song</name></author><category term="research"/><category term="benchmark"/><category term="evaluation"/><category term="reward-hacking"/><category term="AI safety"/><category term="trustworthy"/><summary type="html"><![CDATA[AI benchmarks decide which models get funded, deployed, and trusted. We hacked 13 of them. 45 working exploits. Every benchmark rated critical. If the scores are fake, so is everything built on them — including your training data.]]></summary></entry></feed>