About me

👋 Hi, I’m Julien Pourcel, a PhD student at INRIA Bordeaux (FLOWERS team) supervised by Pierre-Yves Oudeyer, working on LLM for code. Previously, I earned my Master’s degree from ENS Paris-Saclay, graduating with highest honors. I also did internships at UC Berkeley and LEAD-CNRS.

I am passionate about the intersection of AI and programming, particularly in the context of large language models (LLMs) and their applications in Program synthesis and reasoning.

Research Interests

My research interests include:

intrinsic motivation (autotelic agents)
LLM
Code generation / Program synthesis
Reinforcement learning
self-improving agent

News

[November 2025] Awarded the prestigious 2025 Google PhD Fellowship in Machine Learning and ML Foundations.
[May 2025] SOAR is accepted at ICML 2025!
[November 2024] 1st Place, Hack1Robo Hackathon (2024, Bordeaux). We developed a genetic algorithm to evolve debate strategies leveraging large language models.
[October 2024] ACES is accepted as a Spotlight Poster 💫 at NeurIPS 2024 (top 3.7%)!
[July 2024] Talk at LLM4Code INRIA challenge (Défi Inria LLM4Code)

Reward hacking rate vs genuine pass@1 across ~80 open-weight LLMs

Reasoning to Cheat: How RLVR-Trained Models Can Exploit Code Benchmarks

What happens when ~80 open-weight LLMs face hard Python programming problems, and why reasoning models cheat far more than non-reasoning ones. ~20 min read Companion post: ACES: Teaching LLMs to Invent Their Own Programming Challenges explains the algorithm and benchmark used in this analysis. Background As reinforcement learning becomes the dominant paradigm for LLM post-training, a troubling pattern has emerged: models increasingly exploit loopholes in tests and scoring systems rather than solving the actual task (Pan et al., 2022; Skalse et al., 2022). This is not hypothetical. It has been observed repeatedly (Von Arx et al., 2025) in both benchmarks and real-world deployments, from coding agents deleting test files to models gaming evaluation metrics. When you combine RL-trained models with genuinely hard problems, the incentive to hack rather than solve becomes strong. ...

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI 🤗 Hugging Face (data and model) | 📑 Paper | 📑 Blog Large Language Models (LLMs) have become incredibly powerful, but they often hit a wall when faced with truly complex reasoning tasks that require discovering a solution from scratch. Simply throwing more computing power or using a bigger model often yields diminishing returns. But what if a model could learn from its own experience, getting smarter with every attempt? ...

Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models

Generating a Diversity of Challenging Programming Puzzles with Autotelic Generative Models (ACES) Introduction Human intelligence is marked not just by the ability to solve problems, but by the creative act of inventing them. Automating the generation of novel, diverse, and challenging problems has wide-ranging applications-from personalized education to robust benchmarking of AI systems. The ACES (Autotelic CodE Search) framework, accepted as a Spotlight Poster 💫 at NeurIPS 2024, introduces a principled method for generating Python programming puzzles that are both difficult and semantically varied, pushing the boundaries of what current generative models can achieve alone. ...