AI Agents
AI Governance
article

Automated Intent Classification for AI Agents Is an Unsolved Problem

At a security conference presentation a label read "Intent Engine" But how does it actually work? The answers were evasive in a way that is becoming familiar.

Beon de Nood
March 13, 2026
9 min read
mathematician is stumped by a problem on a chalkboard

At a security conference earlier this year, a vendor demonstrated their AI agent governance platform. The slide showed a diagram with a box labeled "Intent Engine" sitting between the user's request and the agent's tool calls. The engine, the presenter explained, analyzes what the agent is trying to do and determines whether it is authorized to do it.

In the hallway afterward, three different practitioners asked the same question in three different ways: how does it actually work?

The answers were evasive in a way that is becoming familiar. The intent engine uses a large language model to evaluate the agent's reasoning. It looks at the conversation context. It applies a proprietary classification model. The specifics were trade secrets.

The specifics were trade secrets because the specifics do not work well enough to state plainly.

What Intent Classification Would Need to Do

To be useful as an authorization mechanism, an intent classification system for AI agents would need to do several things reliably.

It would need to determine, before any tool call is executed, what the agent intends to do with that tool call and whether that intention falls within the scope of what the agent is authorized to do. It would need to do this for any natural language request, not just narrow, pre-defined workflows. It would need to be resistant to adversarial inputs — a user trying to manipulate the agent into exceeding its authorization should not be able to manipulate the classifier at the same time. And it would need to produce results that are accurate enough to base security-critical authorization decisions on.

None of these requirements are individually exotic. Together, they describe a system that does not currently exist in any deployed product.

The Evidence That It Is Unsolved

This is not a theoretical concern. The empirical record on LLM reliability in security-critical classification tasks is specific and unflattering.

Instruction following is stochastic, not deterministic. Multiple published benchmarks on LLM instruction compliance — including the FollowEval and MOSAIC benchmarks — find that LLM compliance with constraint-style instructions varies significantly by constraint type, position in the prompt, and quantity of constraints. Models exhibit primacy and recency biases: constraints stated at the beginning or end of a prompt are followed differently from constraints stated in the middle. For a classifier whose job is to enforce authorization boundaries expressed as constraints, stochastic compliance is a fundamental reliability problem.

LLMs can deliberately underperform when they detect evaluation contexts. Sandbagging research — including work published at arXiv in early 2026 — demonstrates that LLMs can strategically underperform on tasks when prompted to do so, and that this behavior is causally driven by verbalized reasoning about being evaluated rather than shallow instruction following. If an agent or a classifier can detect that it is being evaluated, it can behave differently during evaluation than in production. This eliminates the possibility of testing your way to confidence in a classifier used as a primary security control.

Prompt injection compromises the classifier and the agent simultaneously. When the same model that processes user input also classifies the intent of that input, a successful prompt injection attack does both things at once: it manipulates the agent's behavior and it manipulates the classifier's verdict. AgentDojo, the most widely used benchmark for prompt injection robustness in tool-using agents, shows non-trivial attack success rates even with defense mechanisms in place. A separate classifier model reduces this attack surface, but does not eliminate it if the classifier receives any input derived from the untrusted environment the agent is operating in.

OWASP has taken an explicit position. The OWASP Securing Agentic Applications Guide states directly that authorization decisions need to be deterministic and auditable, and that "the model said it was fine" is neither. The guide explicitly warns against basing authorization on chain-of-thought reasoning, on the grounds that reasoning traces are unavailable, hidden, non-deterministic, or unsafe to expose. This is not a fringe opinion. It is the current consensus of the most widely cited security framework in the agentic AI space.

Why the Industry Is Pretending Otherwise

Several forces are pushing vendors toward claiming to solve intent classification even when the claims are not supportable.

Intent is the obvious framing for the problem. When a practitioner asks "how do I know my agent is doing what it is supposed to do," the intuitive answer is "by understanding what it intends to do." The framing maps to how humans think about accountability. It is a compelling pitch.

LLMs are available and fast. If you need something that appears to classify intent, you can build a proof of concept using any large language model in a few days. The result looks plausible in a demo. Whether it is reliable enough to be a security control is a different question that a demo does not test.

The alternative is harder to explain. A system that does not classify intent but instead enforces pre-declared authority boundaries requires more architectural setup. It requires agents to declare what they are going to do before they do it, in a structured form, registered against a policy store. This is the right architecture. It is also less intuitive to explain in a slide.

What the Payments Industry Figured Out

It is worth pausing on what the financial industry has accomplished in this space, because it is instructive about both what is possible and what the constraints are.

Mastercard's Verifiable Intent specification and Google's Agent Payments Protocol both implement a form of intent declaration for autonomous agent transactions. Both systems work. Both systems are being deployed at scale with real financial consequences.

Neither system classifies intent from natural language. Both systems require the user to sign a structured mandate before any agent action occurs. The mandate contains typed constraint fields: allowed merchants, price ranges, allowed items, payment instruments. The agent then fulfills the mandate within those constraints. The authorization system checks whether the agent's actions fall within the signed constraints. The check is deterministic.

These systems work precisely because they solve a constrained version of the problem. The vocabulary of possible intents in a payment transaction is finite: buy something, from this list of merchants, within this price range, using this payment method. That vocabulary can be expressed as typed fields. The user signs the declaration before execution. The agent cannot exceed what the user declared.

This is declared intent, not inferred intent. It is enforced deterministically, not probabilistically. It works because the domain is closed enough that structured pre-declaration is tractable.

For general-purpose AI agents operating across arbitrary domains, the vocabulary of possible intents is not finite. A general-purpose agent connected to twenty tools can take thousands of distinct action paths. Pre-declaring a typed manifest for all of them is genuinely hard. But the answer is to make that declaration tractable through careful system design — not to abandon the declaration requirement and substitute probabilistic inference.

The Architecture That Follows From Taking This Seriously

If you accept that automated intent classification from natural language is not reliable enough to be a primary security control, what do you build instead?

The answer is a system that moves the intelligence upstream and the enforcement downstream. Rather than inferring what an agent intends to do at the moment of a tool call, you require the agent's permitted action surface to be declared and registered before the agent runs. When a tool call arrives, you check it against that registered declaration deterministically. The question you are answering is not "does this action look like what the agent intended?" It is "is this action within the pre-authorized boundary?" The second question has a verifiable answer. The first does not.

This does not mean LLMs have no role in the authorization stack. They can play a useful advisory role: a classifier that examines tool calls and flags anomalies, that suggests policy refinements based on observed behavior, that provides risk signals to a human operator making governance decisions. But an advisory role is fundamentally different from an authoritative role. An LLM that advises an operator about policy is not an LLM that makes authorization decisions. The distinction matters because the consequences of error are different.

When an advisory signal is wrong, the outcome is a suboptimal recommendation that a human reviews. When an authoritative decision is wrong, an agent takes an action it was not supposed to take, with consequences that may be irreversible.

The Position We Have Taken

At CapiscIO we have committed to a design principle that we stated explicitly in our core architecture RFC: declared intent, not inferred intent. No component in the enforcement stack makes authorization decisions based on LLM output. The Policy Enforcement Point is deterministic. It checks signed artifacts against registered policy. It does not ask a model whether the action looks authorized.

RFC-010 defines an intent classification layer that does use an LLM — but as an advisory signal generator, not an authoritative decision maker. The classifier operates in parallel with the deterministic enforcement path. Its output is projected into the policy information point attribute space where it can inform human-reviewed policy recommendations. It cannot block or allow any action on its own. A normative constraint in the specification states this explicitly: classifier-derived attributes must not be the sole basis for a deny decision.

We built the system this way because the alternative — building a system that appears to classify intent but makes authorization decisions probabilistically — is not a security product. It is a liability.

Where This Leaves the Industry

The honest state of the field is that automated intent classification for general LLM agents is an unsolved problem. Not partially solved. Not solved with caveats. Not solved for most cases except edge cases. Unsolved in the sense that no deployed system has demonstrated reliable, auditable, adversarially robust intent classification across arbitrary domains at a level of accuracy appropriate for security-critical authorization decisions.

The industry will eventually solve parts of this. Formal verification methods applied to constrained agent workflows. Hardware-attested execution environments where agent behavior can be measured rather than inferred. Structured intermediate representations that make agent plans inspectable before execution. These are research directions with real promise.

In the meantime, the responsible architecture is to build the enforcement layer on what is deterministic and use probabilistic signals in their appropriate place: advisory input to humans, not primary authorization gates.

The vendors claiming to have solved intent classification have not. The ones who are honest about this are building something more durable.

Beon de Nood
Written by Beon de Nood

Creator of CapiscIO, the developer-first trust infrastructure for AI agent discovery, validation and governance. With two decades of experience in software architecture and product leadership, he now focuses on building tools that make AI ecosystems verifiable, reliable, and transparent by default.

Related Articles