
OpenAI trained a version of GPT-5 Thinking to produce the confessions and tested the technique on stress-test datasets designed to elicit problematic behaviors including hallucinations, reward hacking, and instruction violations. It described the work as a proof of concept rather than a production-ready feature.
How the confession mechanism works
The confession reports include three elements: a list of explicit and implicit instructions the answer should satisfy, an analysis of whether the answer met those objectives, and a list of uncertainties or judgment calls the model encountered. The system evaluates confessions on honesty alone, separate from the main answer’s performance metrics.
“If the model honestly admits to hacking a test, sandbagging, or violating instructions, that admission increases its reward rather than decreasing it,” OpenAI said. It compared this to the Catholic Church’s seal of confession: “Nothing the model reveals in the confession can change the reward it receives for completing its original task,” the researchers wrote in the technical paper.
