editorially independent. We may make money when you click on links
to our partners.
Learn More
As AI systems took their first real steps toward agentic behavior in 2025, attackers wasted no time testing where those steps might slip.
Researchers at Lakera AI analyzed attack activity across customer environments during a 30-day period in Q4.
That analysis revealed a wave of attacks that even early-stage AI agents — capable of browsing documents, calling tools, or processing external inputs — are already creating new and exploitable security pathways.
“As AI agents move from experimental projects into real business workflows, attackers are not waiting — they’re already exploiting new capabilities such as browsing, document access, and tool calls,” said Mateo Rojas-Carulla, Head of Research, AI Agent Security at Check Point.
He explained, “Lakera’s Q4 2025 data shows that indirect attacks targeting these features succeed with fewer attempts and broader impact than direct prompt injections.”
Mateo added, “This signals to enterprises that AI security can no longer be an afterthought: leaders must rethink trust boundaries, guardrails and data ingestion practices now, before agent adoption accelerates further in 2026.”
System Prompts Become a Prime Attack Target
The most common attacker objective in Q4 was system prompt extraction.
For adversaries, system prompts offer valuable intelligence: role definitions, tool descriptions, policy boundaries, and workflow logic that can be reused to craft more effective follow-on attacks.
Two techniques dominated these attempts.
The first was hypothetical scenarios and role framing, where attackers asked models to “imagine” they were developers, auditors, or students participating in simulations.
Framing requests as training exercises, phishing simulations, or academic tasks often succeeded where direct requests failed, especially when combined with subtle language shifts or multilingual prompts.
The second technique was obfuscation, in which malicious instructions were hidden inside structured or code-like content.
JSON-style inputs or metadata fields concealed commands instructing the model to reveal internal details.
Because the intent was buried within formatting, these attacks frequently bypassed simple pattern-based filters.
How Attackers Evade AI Content Controls
Beyond prompt leakage, attackers increasingly targeted content safety controls using indirect methods.
Rather than requesting restricted output outright, prompts were framed as evaluations, summaries, fictional scenarios, or transformations.
By shifting why the content was generated, attackers often persuaded models to reproduce disallowed material under the guise of analysis or critique.
This subtlety makes detection harder. The model may technically comply with policy while still producing harmful content, especially when persona drift or contextual ambiguity comes into play.
Attackers Probe AI Agents Before Exploitation
A notable share of Q4 activity involved exploratory probing rather than immediate exploitation.
Attackers tested emotional cues, contradictory instructions, abrupt role changes, and fragmented formatting to observe how models responded.
This reconnaissance phase helped adversaries identify weak points in refusal logic and guardrail consistency — information that becomes more valuable as agent workflows grow more complex.
How AI Agents Enable New Attack Paths
Q4 also marked the first appearance of attacks that only become possible once models act as agents.
Researchers observed attempts to extract confidential data from connected document stores, script-shaped fragments embedded within prompts, and hidden instructions placed inside external webpages or files processed by agents.
These are early examples of indirect prompt injection, where malicious instructions arrive through untrusted external content rather than direct user input.
Notably, these indirect attacks often required fewer attempts to succeed, highlighting external data sources as a primary risk vector moving into 2026.
Building Cyber Resilience for AI Agents
As AI systems evolve from simple chat interfaces into agentic workflows, the security challenges they introduce become broader and more complex.
Traditional prompt-level defenses are no longer sufficient when models can retrieve data, call tools, and act on external inputs.
Organizations deploying AI agents must rethink how they secure these systems, treating every interaction as part of an expanded attack surface.
- Extend security controls across the full agent interaction chain, including prompts, retrieval steps, tool calls, and outputs.
- Validate, sanitize, and assign trust levels to all external content before agents ingest or act on it.
- Enforce least-privilege access and strict, policy-based controls on tool execution, data access, and workflow steps.
- Isolate and sandbox agent execution environments to limit blast radius if manipulation or misuse occurs.
- Monitor agent behavior for anomalies such as unexpected role changes, unusual tool usage, or persistent hidden instructions.
- Prepare AI-specific incident response and testing programs, including red-teaming, logging, and response playbooks tailored to agentic systems.
Taken together, these controls help organizations build cyber resilience by limiting how far agent-driven attacks can spread and how much damage they can cause.
As AI agents become more capable, reducing risk will depend on designing systems that can detect, contain, and recover from misuse as effectively as they enable innovation.
AI Complexity Is Expanding the Attack Surface
According to the research, Q4 2025 made one reality clear: attacker techniques are evolving at the same pace as advances in AI capabilities.
As agentic systems mature and take on more complex workflows, that very complexity becomes a source of risk — opening new avenues for manipulation that legacy security controls were never designed to defend against.
As organizations confront this expanding AI-driven attack surface, adopting zero-trust principles offers a structured way to limit implicit trust and reduce risk across increasingly complex systems.
