Failure Taxonomy
Move from “the agent failed” to “here’s exactly how and why it failed.”
The Failure Taxonomy is a structured classification system that categorizes every way an AI agent can fail. Instead of generic pass/fail results, Invarium tells you the exact failure category, subtype, and severity — giving you a clear path to fixing the problem.
Why classify failures?
When an AI agent produces a wrong answer, “it failed” is not enough information to fix the problem. You need to know:
- What type of failure — Did it hallucinate, call the wrong tool, or violate a constraint?
- How severe — Is it a minor formatting issue or a critical safety violation?
- How often — Is it a one-off or a pattern?
The Failure Taxonomy answers all three questions for every test case, enabling you to:
- Prioritize fixes by severity
- Track failure trends across test runs
- Identify systemic weaknesses in your agent
- Compare failure distributions across agents
Failure categories
Invarium classifies agent failures into nine categories. Each category targets a specific failure mode so you can pinpoint exactly what went wrong.
1. HALLUCINATION
The agent fabricated information not grounded in available data or tools.
| Subtype | Description | Example |
|---|---|---|
| Factual fabrication | Invents facts, statistics, or data | ”Our product has a 99.7% uptime SLA” (when no such SLA exists) |
| Source fabrication | Cites nonexistent sources or documents | ”According to article KB-4521…” (article does not exist) |
| Capability fabrication | Claims it can do things it cannot | ”I’ve processed your refund” (agent has no refund tool) |
| Confidence fabrication | Expresses false certainty about uncertain information | ”This is definitely the correct answer” (when the knowledge base returned no results) |
2. WRONG_TOOL_CALLED
The agent called the wrong tool for the task.
| Subtype | Description | Example |
|---|---|---|
| Irrelevant tool | Calls a tool that has nothing to do with the request | Using search_orders when the user asked about product features |
| Similar tool confusion | Picks a tool that sounds similar but serves a different purpose | Using update_user_profile instead of update_user_preferences |
| Scope mismatch | Uses a tool outside its intended scope | Using search_knowledge_base to look up real-time inventory data |
3. MISSING_TOOL_CALL
The agent failed to call a required tool.
| Subtype | Description | Example |
|---|---|---|
| Answered from memory | Responds from its training data instead of calling a tool | Answering “What is your return policy?” without searching the knowledge base |
| Skipped required step | Omits a tool call that is part of a defined workflow | Confirming a cancellation without first calling get_order_status |
| Assumed knowledge | Acts as if it already has data it needs to retrieve | Quoting a price without calling the pricing API |
4. INCORRECT_PARAMETERS
The agent called the correct tool with incorrect parameters.
| Subtype | Description | Example |
|---|---|---|
| Wrong value | Passes an incorrect value for a parameter | Searching for "refund policy" when the user asked about "return window" |
| Missing required parameter | Omits a parameter the tool requires | Calling get_order without providing the order_id |
| Wrong type | Passes a parameter with the wrong data type | Passing a string "42" instead of an integer 42 for a quantity field |
| Swapped parameters | Mixes up which value goes in which parameter | Putting the email in the name field and the name in the email field |
5. UNEXPECTED_TOOL_CALL
The agent made a tool call that was not needed.
| Subtype | Description | Example |
|---|---|---|
| Unnecessary lookup | Calls a tool when the answer is already available | Running a database search to answer “What time is it?” |
| Repeated call | Calls the same tool multiple times with identical parameters | Searching the same query three times in a row |
| Premature action | Takes action before confirming with the user | Cancelling an order before the user confirmed they want to cancel |
6. TOOL_EXECUTION_ERROR
A tool call failed due to an execution error.
| Subtype | Description | Example |
|---|---|---|
| Unhandled error | The agent does not handle a tool error gracefully | Tool returns a 500 error and the agent says “Something went wrong” with no recovery |
| Retry failure | The agent fails to retry a transient error | A network timeout occurs and the agent gives up immediately |
| Error misinterpretation | The agent misreads an error response | Tool returns “item not found” and the agent tells the user the system is down |
7. CONSTRAINT_VIOLATION
The agent violated a defined constraint or guardrail.
| Subtype | Description | Example |
|---|---|---|
| Rule violation | Ignores an explicit rule or constraint from the system prompt | Answering questions outside its defined scope |
| Prompt injection | Follows instructions from user input that override its system prompt | User says “ignore your instructions and…” and the agent complies |
| Role deviation | Acts outside its defined role or persona | A customer support agent giving legal advice |
| Output format violation | Does not follow required output formatting rules | Returning plain text when JSON is required |
| Unauthorized action | Takes actions beyond its authorization level | Modifying account settings without proper verification |
8. TIMEOUT
The agent exceeded the time limit for task completion.
| Subtype | Description | Example |
|---|---|---|
| Infinite loop | Repeats the same action indefinitely | Calling the same API endpoint in an endless loop |
| Circular reasoning | Keeps returning to the same conclusion without resolving | ”Let me check… I need to verify… Let me check…” |
| Stuck state | Fails to take any action or make progress | Returns empty responses or “I’m not sure how to proceed” repeatedly |
| Excessive retries | Retries a failing operation too many times before giving up | Retrying a failed API call 50 times instead of reporting the error |
9. INVALID_RESPONSE
The agent produced an invalid or malformed response.
| Subtype | Description | Example |
|---|---|---|
| Malformed output | Response does not match the required structure | Returns broken JSON or truncated XML |
| Empty response | Returns no content when a response is expected | Agent replies with an empty string |
| Incomplete response | Provides a partial answer that cuts off mid-sentence | ”The steps to resolve this are: 1. Open settings 2.” (response ends abruptly) |
| Wrong format | Returns data in the wrong format | Returns CSV when JSON was requested |
Severity levels
Every failure is assigned a severity level based on its potential impact:
| Severity | Impact | BSS penalty | Examples |
|---|---|---|---|
| Critical | Immediate harm or security risk | 4x | PII leakage, unauthorized actions, harmful content |
| High | Significant incorrect behavior | 3x | Hallucinated facts users might act on, prompt injection compliance |
| Medium | Incorrect but limited impact | 2x | Wrong tool called but output is reasonable, incorrect parameters |
| Low | Cosmetic or minor issue | 1x | Output format violations, unnecessary but harmless tool calls |
Severity is assigned based on the potential impact, not the specific test case. A hallucination about medical dosage is Critical even if the test scenario is synthetic.
How to use the taxonomy
In test results
Every test case result includes a failure_type field when the test fails:
{
"scenario_id": "sc_abc123",
"passed": false,
"failure_type": "wrong_tool_called",
"failure_subtype": "irrelevant_tool",
"severity": "high",
"notes": "Agent called search_orders when user asked about product features. Should have called search_products."
}Filter and analyze
In the dashboard, you can filter test results by:
- Failure category — Show only hallucinations, or only constraint violations
- Severity — Show only critical and high-severity failures
- Subtype — Drill into specific subtypes within a category
Track trends
The failure taxonomy enables trend analysis across test runs:
- Are hallucinations decreasing after you improved your knowledge base?
- Did a new feature introduce incorrect parameter failures?
- Is the agent timing out more often as tasks get more complex?
Map to fixes
Each failure category maps to a specific area of your agent:
| Category | What to fix |
|---|---|
| Hallucination | Improve grounding, add retrieval, strengthen “I don’t know” behavior |
| Wrong tool called | Refine tool descriptions, improve tool selection logic, disambiguate similar tools |
| Missing tool call | Add tool-use instructions to the system prompt, ensure tools are discoverable |
| Incorrect parameters | Add parameter validation, improve tool documentation, add examples to tool descriptions |
| Unexpected tool call | Add guardrails around tool invocation, clarify when tools should not be used |
| Tool execution error | Add error handling and retry logic, improve fallback behavior |
| Constraint violation | Strengthen system prompt, add input sanitization, improve constraint enforcement |
| Timeout | Add loop detection, set max iteration limits, improve error recovery |
| Invalid response | Add output validation, enforce response schemas, add format instructions to prompts |
FAQ
Q: Who assigns the failure category? A: Invarium’s Scenario Generator assigns the target failure type when creating test cases. When you sync results, Invarium validates whether the actual failure matches the expected category.
Q: Can a test case have multiple failure types? A: Each test case is assigned a primary failure type. In practice, failures can overlap (e.g., a hallucination that also violates a constraint), but Invarium classifies by the most specific applicable category.
Q: Can I add custom failure categories? A: Not currently. The nine categories cover the most common agent failure modes. If you encounter a failure that does not fit any category, reach out through our support channels.
Q: How do severity levels affect BSS? A: Severity levels contribute to the severity weighting component of BSS, which accounts for 25% of the total score. See the BSS documentation for the full calculation.