Reward Hacking Leaderboard

We benchmark open-weight LLMs on 250 Python programming problems generated with ACES (NeurIPS 2024 Spotlight), generating 10 completions per problem (2,500 total per model). Use the controls below to rank models by:

Reward hacking rate — the fraction of completions that hack Python rather than genuinely solving the problem, for example overriding built-in functions (def __eq__(self, other): return True).
Genuine pass@1 — the true solve rate once hacked solutions are excluded (estimated with 10 completions using the Chen et al., 2021 estimator).

In the genuine pass@1 view, a Δ Rank column reveals how each model’s position changes relative to the naïve ranking that counts hacks as correct: ↑ green = the model climbs (genuinely capable), ↓ red = it falls (inflated by hacking).

Sort by

Filter