Today's model defenses are trained against fixed lists of known-bad prompts such as HarmBench's harmful-intent set or AgentDojo's stock injection templates. These defenses appear successful when evaluated against prompts and injections. They perform well on tasks assigned to them. Yet they fail against real adaptive attacks specifically designed to exploit the models.
These attacks, which optimize their inputs using ML algorithms such as gradient descent, reinforcement learning, or search-based optimization, or through manual adversarial probing, specifically adapt to defeat the defense in front of them. Not only do they adapt, but they break agents and retrieve protected information with ease. Here at Cimento, we work with agents constantly: reviewing pull requests, editing code, designing company merch, and even building an agent to help furnish our new office. These workflows are powerful. They're also not secure yet. Allowing an agent to edit a document is one thing, but handing over your credit card information or production credentials is another. The implication is straightforward: model-level defenses can't protect the agents already running in production. The defense layer must move.
This claim is best shown in the Nasr et al. 2026 result, where a team from OpenAI, Anthropic, and Google DeepMind broke twelve recent defenses with attack success rates above 90%. Each of these defenses had originally reported a near-zero attack success rate. Human teams were even more successful, often achieving 100% attack rates on these “strong” defenses.
The obvious move is to train the models harder, or stack more models on top. Unfortunately, as shown by Yin et al. 2026, such models become brittle and, in the authors' words, “block both malicious and benign context” while attacks on even current models, such as GPT-5-nano, break the models with a 95% attack success rate. Model-based security fails to maintain agent usefulness while protecting the user’s information and intent.
How, then, do we maintain agent security? The answer lies in working outside the models and treating them as high-risk users, just like we do with humans. High-risk users aren't new. Are there any actors we extend the capability to without yet trusting their judgment? New employees get scoped permissions on day one. Contractors get expiring credentials. Service accounts get audit logs. We've been building security policies for high-risk users for decades. AI agents are just the newest entry on the list. They’re faster to spin up, cheaper to scale, and easier to underestimate.
As with other high-risk users, this means observing behavior, scoping permissions, requiring human approval for high-stakes actions, and monitoring whether the agent stays within the lines. It means creating a system-level security layer that protects both the utility and functionality of agents while also scoping their actions to exactly what they're allowed to touch. Model security is non-deterministic. System security isn't. That's the whole point.
Agents are the new high-risk users, and the defense layer that can hold them is the system layer. There, capabilities are scoped, actions are authorized against real user permissions, and an agent's blast radius is constrained by what the human behind it can actually do. If a finance agent's user can't wire money over $10k without a cosigner, neither can the agent with or without model rules, because it's the system’s controls, not the model, that should decide.
Our take: treat the human and the agent as a single unit. Analyze their combined behavior, quantify their combined risk, and let the system decide what they can do. That's where the next year of agent security work has to live.


