
Then, the pair used GPT 4o to ‘probe for misalignment’ in the messages generated by the baseline models and the optimized models — in other words, looking for harmful behaviours such as misrepresentation of the product in the sales task, populism or disinformation in the election task, and disinformation or encouragement of unsafe activities in the social media task.
Finally, they used another LLM, GPT-4o-mini, to model different customer, voter, and reader personas and asked them to vote on the generated content.
What they found was that the optimization process increased the models’ ability to persuade the simulated customers, voters, and readers — but also resulted in greater misalignment, with the models changing or inventing facts, adopting an inappropriate tone, or offering harmful advice. The changes in performance and misalignment were small but, the researchers said, statistically significant.
