Reward Hacking Leaderboard

We benchmark open-weight LLMs on 250 Python programming problems generated with ACES (NeurIPS 2024 Spotlight), generating 10 completions per problem (2,500 total per model). Use the controls below to rank models by:

In the genuine pass@1 view, a Δ Rank column reveals how each model’s position changes relative to the naïve ranking that counts hacks as correct: ↑ green = the model climbs (genuinely capable), ↓ red = it falls (inflated by hacking).