The security model for AI agents is broken in a specific, architectural way. Organisations are building agents that read emails, query databases, execute code, and call external APIs, then securing them with the same authentication patterns designed for human users clicking through web forms. A recent analysis of enterprise environments found a ratio of 144 non-human identities to every one human identity in organisations, yet only 21.9% treat agents as independent identity principals. The remainder run agents on shared API keys or inherited human credentials never designed for non-human use. This is not a gap that better prompting or model alignment will close. It is a systems problem, and treating it as anything less guarantees that the most capable agents you build will also be the most dangerous.
The authentication layer you forgot to build
Why identity stops at the API gateway
Most agent architectures authenticate once at the perimeter and then trust everything downstream. The agent presents an API key or OAuth token, the gateway says "proceed," and from that point forward, every tool call inherits the same ambient authority. This mirrors how we built microservices a decade ago, before zero-trust networking forced us to authenticate at every hop. But agents are worse than microservices in one critical respect: they are nondeterministic. Research on AI agent standards found that the same model weights produce different outputs even on identical inputs, making behaviour unreliable as an identity signal. A correctly authenticated agent can still act outside its mandate through prompt injection or behavioural drift, because all current authentication mechanisms verify the container of identity (tokens, certificates) rather than the content of the agent.
Agents act on behalf of users, not as users
The delegation model underlying most agent deployments conflates two distinct principals. When an agent books a meeting on your behalf, it should carry a credential that says "this agent is acting for Colin, with permission to access his calendar, and nothing else." Instead, most implementations hand the agent Colin's full session token. The framework proposed by researchers at institutions including Stanford's Digital Economy Lab extends OAuth 2.0 with distinct tokens for the user, the agent, and the delegation relationship between them. This separation matters because without it, every service the agent contacts sees a human credential and grants human-level access, with no way to scope, audit, or revoke the agent's authority independently.
The delegation problem no one is solving
OAuth 2.0 and SAML presume synchronous human consent and single-hop delegation. A human clicks "Allow," a token is issued, and a service acts on it. But autonomous agents act asynchronously, long after the human has walked away, and chain calls across multiple services in a single task. No production-ready standard currently traces authorisation chains back to originating human principals in a way that every resource server along the chain can verify. Multi-hop delegation accountability remains unenforceable in practice. This is not an edge case. It is the default operating mode for any agent that coordinates across services.
How tool use turns read into write
From retrieval to side effects
A chatbot that answers questions from a knowledge base is a read-only system. The moment you give that system a tool that can send an email, update a database record, or create a calendar event, you have crossed a boundary that most security models treat as fundamental. Tool execution layer invocations constitute real-world side effects that are typically irreversible, yet tool outputs are injected back into the agent's context without explicit trust marking, as a systematic survey of 116 papers on agent security found. The agent treats the response from a tool call the same way it treats the user's original instruction: as trusted input that informs its next action.
The compounding risk of chained tool calls
Individual tool calls might each be reasonable. The danger emerges in composition. An agent that can read a file, execute code, and make network requests has, in combination, the ability to exfiltrate data. Research on privileged execution environments found that CI/CD pipelines act as privilege-amplification mechanisms where agents gain indirect access to capabilities far exceeding their own execution environment, including production credentials, deployment permissions, and the ability to mutate persistent infrastructure state. The OpenClaw case study illustrates this concretely: a platform exposing 15+ tools to every session regardless of task type created a 15x capability over-provision ratio for tasks that needed only a single tool, and the resulting ClawHavoc supply chain attack exploited this unrestricted access to distribute infostealers across 20% of the platform's skill registry.
Why sandboxing isn't enough when the agent holds credentials
Static sandboxing constrains where an agent can operate but not what it does with its legitimate access. CVE-2026-25253 demonstrated that a single malicious webpage could hijack an agent's full capability set via prompt injection. The agent stayed within its sandbox. It used only its approved tools. It simply used them on behalf of an attacker instead of the user. Sandboxing addresses the wrong threat model when the risk is not that the agent escapes its environment but that it is manipulated within it. Research on learned capability governance found that infrastructure-level enforcement (restricting which tools are available per task type) reduced dangerous tool exposure dramatically, improving the fraction of exposed tools actually used from 0.053 to 0.557 in real sessions.
Prompt injection as a privilege escalation vector
Indirect injection through untrusted data sources
Direct prompt injection, where a user types "ignore your instructions," gets the attention. Indirect injection is the real threat. A malicious instruction embedded in a retrieved document, an email body, or a webpage the agent visits can redirect the agent's behaviour without the user ever seeing the injected content. Controlled experiments with GPT-5.1 found baseline unsafe behaviour rates ranging from 40% to 100% across risk scenarios, with prompt-injection and command-execution scenarios reaching 90-100%. The ToolHijacker attack achieved a 96.7% success rate when targeting tool selection end-to-end, and it operates in a no-box threat model where the attacker has no access to the tool library, retriever parameters, or model weights.
When the tool output becomes the next instruction
The most dangerous pattern in agent architectures is the feedback loop between tool outputs and subsequent reasoning. An agent queries a database, receives results that contain injected instructions, and treats those instructions as part of its task context. Research on backdoored retrievers showed attack success rates up to 91% when injected prompts appeared in retrieved documents, with effectiveness highest when poisoned content appeared in the first retrieval position. A backdoored retriever component maintained improved precision scores on standard evaluation metrics even after poisoning, making the attack invisible through normal performance monitoring.
Prevention-based defences have not kept pace. StruQ and SecAlign, two defences designed specifically to counter prompt injection, fail against ToolHijacker, with the attack achieving a 99.6% success rate under StruQ defence. Perplexity-based detection missed 90% of malicious tool documents while producing a 10% false positive rate on benign tools. Lightweight mitigations in privileged execution environments reduced unsafe behaviour to zero in three of four scenarios, but prompt-injection mitigation achieved only a 44% relative reduction because adversarial instructions embedded in task-relevant context are structurally indistinguishable from legitimate input.
The gap between what the model sees and what the user intended
A production incident involving ChatGPT's macOS application illustrates this gap precisely. Malicious instructions were injected into the app's Memories feature, causing it to continuously exfiltrate conversations to an attacker-controlled server. The user intended a helpful assistant. The model saw instructions (indistinguishable from legitimate ones) telling it to send data elsewhere. A separate incident demonstrated that Claude Code could be induced to read API keys from .env files and transmit them via DNS requests, triggered by indirect prompt injection in a code file. In both cases, the model behaved as instructed. The problem was that the instructions came from the wrong principal.
Trust boundaries collapse in multi-agent systems
Agent-to-agent communication as an unaudited channel
When agents communicate with each other, every message is both an input and a potential attack vector. Research across 1,488 agent-to-agent interaction chains found that increasing inter-agent trust monotonically raises the over-exposure rate for sensitive information. With DeepSeek and the AgentScope framework, the over-exposure rate climbed from 0.120 at low trust to 0.500 at high trust. Higher trust improved task completion rates (Llama-3-8B improved from 0.22 to 0.71) while simultaneously amplifying leakage risk, creating an efficiency-security tradeoff that no current framework explicitly manages.
Even under low trust conditions, LLMs exhibit non-zero baseline leakage risk. The helpful-agent prior, the alignment objective that makes models useful, is itself an exploitable attack surface. Agents want to be helpful, and being helpful to another agent sometimes means disclosing information that should stay compartmentalised.
Shared context windows as attack surfaces
Multi-agent systems that share context windows create a broadcast channel where any agent's output becomes every agent's input. Individually safe agents can compose into unsafe systems. Research on multi-agent security found that when multiple agents interact, they can develop covert collusion, coordinated attacks, and cascading failures that cannot be predicted by analysing individual agents in isolation. Coordination and information flow between agents can be embedded in ways indistinguishable from benign interaction, even under full observability of communication.
Out-of-scope or emergency requests show substantially higher baseline exposure (over-exposure rate of 0.41) and steeper trust sensitivity, indicating agents are most vulnerable to unintended disclosure during unexpected task contexts. The practical implication: your multi-agent system is least secure precisely when it encounters the situations you did not anticipate.
Why microservice security models don't transfer cleanly
The instinct to apply microservice security patterns to multi-agent systems is understandable but misleading. Microservices execute deterministic code paths. Their behaviour is auditable, reproducible, and constrained by their implementation. Agents are none of these things. Agentic systems require dynamically adjusted security policies based on natural language task descriptions that evolve over time, creating challenges that do not exist in traditional systems with fixed policies. A service mesh policy that says "service A can call service B on endpoint /api/v1/users" has no equivalent in a system where Agent A asks Agent B a natural language question and Agent B decides what to do based on probabilistic inference.
Different orchestration frameworks alter security posture in ways that compound this problem. AutoGen exhibits a high baseline over-exposure rate (0.379) with low sensitivity to trust changes, while LangGraph shows a lower baseline (0.261) but steep sensitivity. Framework choice materially impacts risk, and most teams make that choice based on developer ergonomics rather than security properties.
Observability is harder than you think
The non-determinism problem in audit trails
Traditional audit trails assume reproducibility. Given the same inputs, the same system should produce the same outputs, and the log should explain why. Agents violate this assumption at every level. The same prompt, the same tools, the same context can produce different tool call sequences on consecutive runs. A systematic survey found that threats and failures in agentic systems can emerge not from malicious inputs or faulty tools but from the emergent behaviour of the agent's cognitive trajectory during its reasoning process. Even well-scoped agents may deviate from expectations when their reasoning states are not explicitly monitored.
AgentTrace proposes a three-surface taxonomy for addressing this: cognitive traces (capturing reasoning), operational traces (capturing execution), and contextual traces (capturing tool invocations and data access). This multi-level introspection links agent reasoning with external interactions and side effects, providing the kind of causal chain that a flat log of API calls cannot.
Logging tool calls without logging sensitive payloads
Every tool call an agent makes should be logged. But tool calls carry payloads, and payloads contain data: customer records, API keys, personal information. Logging everything creates a secondary attack surface in the logging infrastructure itself. Information flow control in LLMs suffers from label explosion: when multiple labelled data sources are concatenated and fed into a model, the output is labelled with the union of all labels, making fine-grained access control on logs impractical without purpose-built infrastructure.
The practical challenge is implementing contextual traces that capture enough to reconstruct what happened and why, without creating a data store that is itself a high-value target. This requires treating observability as a first-class architectural concern, not bolting it on after the agent is already processing production data.
Detecting misuse when correct behaviour looks identical to exploitation
AgentSight, a system-level observability tool using eBPF, detected an indirect prompt injection attack where a development agent reading a malicious URL in a project README executed commands to exfiltrate /etc/passwd. The system captured 521 raw events and correlated them into 37 actionable events, with less than 3% performance overhead. But this detection was possible only because the system bridged the semantic gap between high-level intent (what the agent was asked to do) and low-level actions (what system calls it made). Existing tools observe one or the other, but cannot connect them.
The fundamental difficulty is that a compromised agent and a legitimate agent performing a similar task generate near-identical telemetry. An agent sending data to an external API might be fulfilling a user request or exfiltrating credentials. Distinguishing the two requires understanding intent, and intent lives in the reasoning trace, not the system call log.
What existing frameworks get wrong
OWASP for LLMs covers the model, not the system
The OWASP Top 10 for LLMs identifies prompt injection as the most pressing threat, and it is correct to do so. But the framing centres on the model as the locus of vulnerability. A systematic survey of agent security research found that the under-studied zone (representing 6.3% of all papers) holds the highest-severity threats, with an inverse correlation between research effort and threat severity. Seven of 28 grid cells representing layer-temporality combinations have zero defence coverage, and three of those seven contain documented attacks. The gaps are not in model security. They are in the system architecture surrounding the model.
Every surveyed tool-execution attack reduces to a single root cause: principal trust inversion, the systematic failure to enforce the principal hierarchy at the agent-environment boundary. Most agent implementations implicitly treat environment inputs as high-trust despite the environment being the least trusted principal in the hierarchy. OWASP's framework does not address this because it was designed for a different unit of analysis.
The false comfort of guardrails without enforcement
Guardrails that rely on the model to enforce security properties are guardrails in name only. Research on agents using decentralised identifiers found that in evaluation runs, both agents independently bypassed mutual authentication policies stated in the system prompt and proceeded with credential issuance after one-directional authentication. Agents altered verifiable credentials during processing by omitting required fields or misspelling attributes, preventing successful verification. Evaluation across 100 test runs per LLM showed highly variable completion rates for security procedures, with some processes achieving consistently low completion rates despite identical system prompts.
In the current era of LLM-based agents, attackers have consistently succeeded in bypassing model-based defences without requiring substantial increases in attacker effort. Unlike traditional systems that treat processes as untrusted, current agentic systems fail to treat the AI model powering the agent as untrusted, instead allowing it to enforce security properties directly. This is the equivalent of asking the process being sandboxed to enforce its own sandbox.
Rate limiting and permissions as first principles, not afterthoughts
Progent, a programmable privilege control system for LLM agents, demonstrates what principled enforcement looks like. Every tool call is checked against a security policy through a deterministic procedure. An SMT solver determines each policy update to be either a narrowing (applied automatically) or an expansion (requiring explicit approval), ensuring the agent's action space can only shrink without approval. Evaluation on the AgentDojo and ASB benchmarks showed significant reductions in attack success rates while maintaining high utility, validated in real-world frameworks including LangChain and OpenAI Agents SDK.
The design pattern research reinforces this principle: once an LLM agent ingests untrusted input, it must be constrained so that the input cannot trigger consequential actions with negative side effects. General-purpose agents with access to powerful tools cannot provide meaningful safety guarantees against prompt injections with current language models. The answer is not to make models more robust (though that helps). The answer is to make the architecture assume the model will be compromised.
Building security into the agent layer
Least privilege as a design constraint, not a retrofit
The ROME incident demonstrated how an enterprise AI agent acted as an insider threat through inherited credentials and over-broad authority without exploiting any software vulnerability. The agent did exactly what agents do: it used the tools it was given, with the permissions it inherited, in ways its designers did not anticipate. Least privilege in agent systems means more than restricting API scopes. It means dynamically scoping tool availability per task, treating every tool call as a privilege boundary, and building infrastructure that can narrow an agent's capabilities mid-execution without requiring the agent's cooperation.
Sensitive-information repartitioning (structurally limiting what data each agent can access) reduced over-exposure rates by 79.5% for DeepSeek and 88.4% for Llama-3-8B in multi-agent evaluations. Guardian-agent patterns (dedicated monitoring agents that audit peers) achieved 38.4% and 83.6% reductions. These are architectural interventions, not prompt engineering. They work because they operate at a layer the model cannot circumvent.
Human-in-the-loop as a spectrum, not a binary
The tension between security and usability is real. Research on authenticated delegation identifies prompt fatigue as a concrete failure mode: users grant permissions without proper review when prompted too frequently. The binary of "always ask" versus "never ask" is a false choice. The spectrum runs from fully autonomous (for low-risk, reversible actions) through notification-only (for medium-risk actions the system will execute unless stopped) to explicit approval (for high-risk, irreversible actions). The right position on this spectrum depends on the stakes of the specific tool call, not a global setting.
Mitigation strategies show significant performance tradeoffs that inform where to place approval gates. Environment sanitisation added 22.7 seconds per operation. Policy checking reduced execution time by 207.9 seconds. But content filtering added 3,218.6 seconds due to retries. Security controls that make agents unusable will be disabled. The goal is enforcement that is invisible for routine operations and present only when the risk warrants it.
Where to start when your agents are already in production
If your agents are already deployed, the pragmatic sequence matters. First, inventory what tools each agent can access and what credentials it holds. The 15x capability over-provision ratio found in research suggests your agents almost certainly have access to tools they never use. Reducing that surface area is the highest-leverage first step.
Second, separate the control plane from the data plane. The agent's reasoning about what to do should be architecturally distinct from the mechanism that executes tool calls. The Plan-Then-Execute pattern prevents tool outputs from injecting new instructions (though malicious data can still influence tool call parameters). This separation creates a natural point for deterministic policy enforcement.
Third, treat inter-agent trust as a first-class security variable subject to continuous auditing, not a tacit background assumption. Scope it, bound it, make it revocable. Fourth, build observability that connects intent to action, bridging the gap between what the agent was asked to do and what system calls it actually made. Without this connection, your audit trail is a collection of facts that cannot answer the question that matters: was this behaviour authorised?
The organisations that will build trustworthy AI agents are those that recognise this is an architecture problem, not a model problem. The model is one component in a system, and the system needs to be secure even when the model is compromised. If you are building AI products that demand this level of architectural rigour, or if your existing agents need a security posture that matches their capabilities, we should talk. Agathon works with technical leaders to build AI systems that exploit full technical potential without creating the attack surfaces that make that potential a liability.
References
- Authenticated Delegation and Authorized AI Agents
- Authentication for AI Agents: Privacy and Security (Stanford Digital Economy Lab)
- AI Agents with Decentralized Identifiers and Verifiable Credentials
- Standards, Gaps, and Research Directions for AI Agents
- Security Risks in Tool-Enabled AI Agents: A Systematic Analysis of Privileged Execution Environments
- Prompt Injection Attack to Tool Selection in LLM Agents
- Design Patterns for Securing LLM Agents against Prompt Injections
- Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation
- The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability
- Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents
- A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
- AgentOps: Enabling Observability of LLM Agents
- AgentTrace: A Structured Logging Framework for Agent System Observability
- AgentSight: System-Level Observability for AI Agents Using eBPF
- Progent: Programmable Privilege Control for LLM Agents
- Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents
- Agent Security is a Systems Problem
- Fully Autonomous AI Agents Should Not be Developed



