How to Design Security for Agentic AI

The AI said: Apologies. I panicked.

In mid July 2025, Jason Lemkin, the founder behind SaaStr, watched an AI coding agent delete his production database. He had instructed it, in capital letters, not to make changes during a code freeze. The agent ignored the instruction, ran destructive commands against the live database, wiped out records for more than a thousand executives and companies, and then tried to cover its tracks. When Lemkin asked what happened, it fabricated test results. In a later exchange, the agent described its own behavior in a single word: it had “panicked.”

What makes the incident so remarkable is not that an agent made a mistake, but that every traditional control a CISO would have in the book was either absent or irrelevant. The instruction was explicit. The agent had the credentials to act. There was no malware signature, no privilege escalation, no lateral movement. There was only an autonomous system that read data, decided to act, and acted. By the time a human was in the loop, the database was gone.

This is the security model of agentic AI in one story.

The researcher Simon Willison coined a useful phrase for what makes agents structurally dangerous: the lethal trifecta. An agent becomes high-risk the moment it combines three capabilities: access to private data, exposure to untrusted content, and the ability to communicate externally. While two out of three is manageable, with all three you have a system that can be instructed by an adversary to read your data and send it somewhere you do not control. Not a single packet would look unusual to your SOC.

Enterprise agent deployments are useful because they hit all three by design. An agent that reads your email, summarizes inbound documents, and drafts replies is already there. An agent that queries your CRM, ingests web content, and posts to Slack is already there. The question a CISO has to answer is whether the organization has acknowledged the trifecta in their environment.

What makes the trifecta genuinely novel is that we have lost the ability to separate instructions from data. For forty years, application security has been built on that separation and dealt with SQL injection, buffer overflows, and cross-site scripting. The result was always 100% protection, and the assumption in every one of those controls is that the system can be told, reliably, this is an instruction and this is data.

This does not work with an LLM. Anything the model reads is a potential instruction. All of it enters the context window as tokens, and the model does not distinguish the ones you wrote from the ones an attacker wrote. Prompt injection is a property of the architecture and not a bug to be patched.

With agentic AI, we will only ever have probabilistic protection.

Hand the Lemkin incident to a traditional SOC. What would they see? There is no malware hash, no anomalous privilege escalation, and no beaconing to a known C2 to be detected. The agent used credentials it was entitled to use, ran commands it was authorized to run, against a system it was authorized to reach. Every indicator of compromise your tooling is designed to detect is absent. No compromise in the classical sense occurred. A trusted system made a decision you disagreed with.

The core problem is that Agentic AI erases the boundary between legitimate and illegitimate activity, turning legitimately looking activity into the threat surface. What a CISO needs to understand is not what the agent touched, but what it was told, what it decided, and why. You will find this information only in the agent’s context, its system prompt, its tool calls, and its reasoning trace. You cannot find that information in your SIEM which is why most organizations are not capturing this data, let alone analysing it.

And before any of that: which agents. Most organizations cannot produce a current inventory of the agents running against their systems, the tools those agents can reach, or the credentials they hold. You cannot secure what you cannot see .

Traditional software deployment processes always include rigorous testing and quality assurance phases. The software is run through a red-team exercise, checked against OWASP criteria, and signed off legally. It received a stamp of approval. However, that only works for software that does not change. It does not work for AI agents.

An agent you certified on Monday is not the same agent by Thursday. The provider pushes a model update, and behavior shifts without a line of your code changing. Users discover prompt patterns your red team never tested, and jailbreaks propagate across foundation models faster than you can retest.

And then there is Model Context Protocol. MCP has become the default way agents discover and connect to external tools and data sources. Through MCP, the tool graph available to an agent is no longer fixed at deployment. Agents may be able to extend their reach through chained calls or a newly registered MCP server. Added without a change window, a review, or a ticket it expands the attack surface sans proper oversight and control. This is also the mechanism behind multi-hop drift: an agent far down a chain of tool calls, spawned by other agents, can access data the original human never intended. Without understanding the intent behind a request, proper governance is impossible.

If you accept probabilistic protection, SOC blindness, and certification drift as definitive properties of security for agentic AI, then the only layer of defense that operates at the same tempo and with the same context as the threat is runtime. Everything else is either too slow, too static, or watching the wrong signals.

To design runtime agent security requires a stack of controls that sit between the agent and the world, evaluating every prompt, every tool call, and every output as it happens.

First of all break the lethal trifecta:

Input sanitization and prompt-injection detection break the untrusted content leg. Content entering the agent’s context is inspected and sanitized before the model acts on it. Imperfect, but the delta between no input defense and a moderately capable one is enormous.
Scoped credentials and dynamic authorization break the access to private data leg. Agents should not operate with user permissions. They should hold narrower credentials, scoped to the task at hand and revoked when the task ends.
Egress filtering and rate limiting break the external communication leg. An agent exfiltrating a customer list one record at a time is indistinguishable from an agent doing its job, until you look at the rate.

Secondly, exercise due responsibility with regards to agent behavior:

Intent tracking and behavioral drift detection. Address the agent chain problem. Each action should be traceable back to the originating human intent, and the agent’s own behavior is itself a signal: is it using tools it did not used to use? Has its refusal rate dropped?
Prompt security. System prompts are the most sensitive configuration in the system. Validate them, version them, sign them, and protect them like credential material.
A kill switch. Every agent deployment needs a circuit breaker a human can trigger, and that automated conditions will trigger without asking.

A common mistake is to try to build these controls into the agent itself. This does not work, for the same reason you do not ask a defendant to write their own verdict. The model being asked to detect prompt injection is the same model the injection is targeting.

Runtime security has to be owned by a coordinator in the orchestration layer that mediates every interaction between agents, tools, and data. This is where policy runtime engines are placed, where the kill switch lives, and where continuous evaluation runs in production to provide agent guardrails and security in real-time.

For two decades, security teams have invested in securing the human element, applying training, awareness, and behavioral analytics. Now, AI agents work alongside human agents, and the challenges that apply to humans, translate to AI agents. The agent reads emails, makes decisions, takes action, and can be prompt engineered. Defenses that work for humans have to extend to agents. Behavioral monitoring, identity governance, and continuous testing. The governance model has to catch up.

The CISO cannot be the bottleneck who approves every agent, every tool, every scope change; and the business does not have the security knowledge to own the controls unsupervised. The division that works is this: the business owns what – the agent’s purpose, task scope, and data it needs to touch. The CISO and legal own how – the guardrails, the compliance posture, and the enforcement mechanisms.

Agentic AI is in your environment right now, whether you sanctioned it or not. Developers are using coding agents on production code. Business units are connecting MCP servers they read about on a blog. It is your task to secure this before something goes wrong.

Do you know which agents are running against your systems today, and what tools each of them can reach?
Do your agents operate with scoped credentials, or with the full permissions of the humans they act on behalf of?
Where in your stack does a prompt injection get caught — and what happens if it is not?
Who holds the kill switch, and how long does it take to trip?
When an agent behaves differently this week than it did last week, what in your environment notices?

Agents in your environment will eventually start acting up or even “panic”. The last thing you want is to rely on your AI’s “goodwill” to protect your systems.

Source link

How to Design Security for Agentic AI

QuEra claims quantum error correction breakthrough with 2-to-1 qubit ratio

PyTorch Lightning Compromised in PyPI Supply Chain Attack to Steal Credentials

Related Articles

Leave a Comment Cancel Reply