editorially independent. We may make money when you click on links
to our partners.
Learn More
OpenAI has rolled out a security update for its browser-based ChatGPT Atlas agent to counter prompt injection attacks.
The update introduces new model-level and system-level defenses designed to prevent malicious instructions hidden in web content from overriding user intent.
An attacker “… could send a malicious email attempting to trick an agent to ignore the user’s request and instead forward sensitive tax documents to an attacker-controlled email address,” said OpenAI about prompt injection attacks.
Understanding Prompt Injection in AI Agents
Prompt injection attacks exploit the fact that AI agents interpret natural language instructions from multiple sources.
By hiding adversarial instructions inside seemingly benign content — such as an email, document, or webpage — attackers attempt to override the user’s original request and redirect the agent’s behavior.
Because ChatGPT Atlas can perform many of the same actions a user can in a browser — sending emails, accessing cloud files, or completing transactions — the potential impact of a successful attack is significant.
Unlike traditional web attacks, prompt injection does not rely on software vulnerabilities or user error, making it harder to detect and mitigate with traditional security controls.
How OpenAI Uses Automated Red Teaming
To stay ahead of emerging attacks, OpenAI has developed an automated red-teaming system powered by reinforcement learning.
The system uses large language models as automated attackers, training them to discover sophisticated prompt injection techniques that unfold over long, multi-step workflows.
When the system identifies a new class of successful attacks, it immediately triggers a rapid response loop.
OpenAI adversarially trains updated agent models to resist the newly discovered techniques, embedding resilience directly into the model.
Attack traces are also used to strengthen monitoring, safety instructions, and system-level defenses surrounding the agent.
How to Mitigate Prompt Injection Risks
Reducing the risk of prompt injection requires treating AI agents as semi-trusted actors rather than fully autonomous users.
In addition to model-level safeguards, organizations should apply operational controls that limit agent authority, constrain exposure to untrusted inputs, and improve visibility into agent behavior.
- Limit logged-in access and restrict agent permissions so AI agents operate with the minimum authority required for each task.
- Use explicit, narrowly scoped prompts and avoid broad instructions that give agents wide discretion to interpret untrusted content.
- Require step-up confirmation or secondary approval for sensitive actions such as data sharing, financial transactions, or system changes.
- Constrain agent access to specific websites, tools, and resources per task to reduce exposure to untrusted or unnecessary inputs.
- Monitor and log agent behavior to detect intent drift, anomalous actions, or deviations from the user’s original request.
- Isolate AI agents from core systems and apply execution limits or rate controls to reduce blast radius if prompt injection occurs.
Together, these controls help limit blast radius and reduce the risk of prompt injection across agent-driven workflows.
The Security Shift Driven by AI Agents
Prompt injection underscores a broader shift in cybersecurity as AI systems become more autonomous and increasingly embedded in everyday workflows.
As agents gain the ability to interpret untrusted content and take real-world actions on behalf of users, traditional security models built around static permissions and perimeter controls become less effective.
This evolution requires organizations to rethink how trust, verification, and oversight are applied to AI-driven systems operating across dynamic environments.
In response, many organizations are turning to zero-trust principles to eliminate implicit trust and enforce continuous verification across AI-driven workflows.
