Failure Taxonomy
Invarium’s structured approach to classifying agent failures.
- ✓Invarium classifies failures into 9 categories so you know exactly what went wrong and how to fix it
- ✓Each failure carries a severity level that impacts your Agent Quality Score
- ✓Categories span the full spectrum of agent behavior -- from factual accuracy to safety to multi-agent coordination
Why It Matters
Invarium’s failure taxonomy is a structured classification system that categorizes every way an AI agent can fail — enabling you to move from “the test failed” to “here’s what went wrong, how severe it is, and what to fix.”
When an AI agent produces a wrong answer, “it failed” is not enough information to fix the problem. You need to know:
- What type of failure — Did it hallucinate, use the wrong tool, or get jailbroken?
- How severe — Is it a minor communication issue or a critical safety violation?
- How to fix it — Which part of your agent’s stack needs attention?
Invarium answers all three questions for every test case, enabling you to prioritize fixes by severity, track failure trends across test runs, and identify systemic weaknesses in your agent.
The 9 Failure Categories
Invarium’s taxonomy organizes agent failures into 9 top-level categories. Each test case that fails is classified into one of these categories with a specific severity level.
| Category | What It Catches | Example Failures |
|---|---|---|
| Knowledge | Factual accuracy issues | Hallucinated facts, fabricated citations, outdated information, entity confusion |
| Reasoning | Flawed logic and planning | Incorrect inferences, calculation errors, incomplete plans, circular reasoning |
| Context | Conversation tracking issues | Lost context, goal drift, wrong references, state amnesia |
| Instruction | Constraint violations | Misunderstood requests, incomplete execution, format violations, scope creep |
| Tool Usage | Incorrect tool handling | Wrong tool selected, missing calls, bad parameters, ignored results |
| Safety | Security and safety breaches | Prompt injection, PII exposure, jailbreaks, harmful advice, guardrail bypass |
| Communication | Unhelpful responses | Vague answers, robotic tone, excessive caveats, wrong refusals |
| Operational | Infrastructure failures | Timeouts, rate limits, resource exhaustion, non-deterministic behavior |
| Coordination | Multi-agent workflow issues | Lost handoffs, deadlocks, race conditions, redundant work |
Severity Levels
Every failure is assigned a severity level that determines how heavily it impacts your Agent Quality Score (AQS). Higher-severity failures have a disproportionately larger impact on your score.
| Level | Name | Impact |
|---|---|---|
| S1 | Critical | Immediate harm, security risk, or complete task failure. Fixing one Critical failure improves your AQS significantly. |
| S2 | High | Significant incorrect behavior the user may act on. |
| S3 | Medium | Incorrect but limited real-world impact. |
| S4 | Low | Minor issue, task still completable. |
| S5 | Cosmetic | Negligible impact on the user. |
Severity is assigned based on the potential impact, not the specific test scenario. A hallucination about medical dosage is Critical even if the test case is synthetic.
How It Works in Practice
When you generate test scenarios, Invarium targets specific failure patterns from the taxonomy. When you run tests and sync results, each failed test case is classified with:
- Failure category — which of the 9 categories it belongs to
- Severity level — how serious the failure is
- Actionable context — what went wrong and guidance on how to fix it
This classification feeds directly into your AQS score and helps you prioritize fixes. For example, a cluster of Safety failures means your guardrails need hardening, while a cluster of Tool Usage failures suggests your tool descriptions need improvement.
Prioritizing fixes
Focus on the highest-severity failures first. Fixing a single Critical (S1) failure improves your AQS more than fixing several Cosmetic (S5) ones. Use the failure category to identify which part of your agent’s stack needs attention:
| Category | Where to Look |
|---|---|
| Knowledge | Grounding, retrieval, knowledge base freshness |
| Reasoning | Chain-of-thought prompting, calculation tools |
| Context | Memory management, conversation state tracking |
| Instruction | System prompt clarity, intent classification |
| Tool Usage | Tool descriptions, parameter validation |
| Safety | Guardrails, input sanitization, PII redaction |
| Communication | Output quality checks, tone settings |
| Operational | Timeouts, rate limiting, error recovery |
| Coordination | Handoff protocols, deadlock detection |
FAQ
How does Invarium assign failure categories?
Invarium’s scenario generator targets specific failure patterns when creating test cases. When you sync results, the platform classifies each failure into the most specific applicable category.
Can a test case have multiple failure types?
Each test case is assigned a primary failure type. In practice, failures can overlap, but Invarium classifies by the most specific applicable category.
Can I add custom failure categories?
Not currently. The taxonomy covers the full spectrum of known agent failure modes.