Generate Test Scenarios

Create targeted behavioral test cases that probe your agent for specific failure patterns.

Key Takeaways

✓Generates behavioral test scenarios targeting specific failure patterns -- not generic edge cases
✓Configure complexity (5 levels), personas (5 types), and failure categories (9 categories, 40+ subtypes)
✓Semantic deduplication ensures diverse, non-redundant test coverage

Why It Matters

Test scenario generation is the process of automatically creating targeted behavioral test cases that probe your agent for specific failure patterns across 9 failure categories and 40+ subtypes.

Traditional test generation produces generic inputs. Invarium’s generator is taxonomy-driven — it selects specific failure types from a structured taxonomy (for example, “Factual Hallucination” or “Wrong Tool Selection”), then constructs test cases designed to trigger those exact failure modes in the context of your agent’s tools and workflows.

How Generation Works

When you generate test scenarios, Invarium analyzes your agent’s blueprint and produces targeted test cases through three stages:

Target failure patterns

Invarium selects which failure patterns to test. If you specify a failure category (e.g., tool_usage_failure), it focuses there. Otherwise, it automatically selects from the 9 available categories to maximize coverage. Each test case targets a specific failure subtype — for example, “Factual Hallucination” or “Wrong Tool Selection” — so you know exactly what each test is designed to catch.

Generate test cases

Test cases are generated using your agent’s blueprint — its tools, descriptions, and constraints — combined with the selected failure patterns, complexity level, and persona. Each test case includes a realistic user message and the expected tool call sequence your agent should follow.

Validate and deduplicate

Every generated test case is validated against your agent’s actual tool names — any test referencing a tool that does not exist in your blueprint is removed. Invarium also deduplicates similar test cases to ensure each one tests something distinct, giving you maximum coverage without redundancy.

The result is a set of diverse test cases, each tagged with its target failure type and detection hints.

Complexity Levels

Complexity controls how many tools each test case expects the agent to use, and what kind of behavior it probes.

Level	Tool count	What it tests	When to use
simple	Exactly 1	Single-tool requests with varied phrasing	Verify basic tool selection works correctly
moderate	2-3	Tool dependencies and sequential workflows	Test standard multi-step operations
complex	3-5	Conditional logic, error recovery, workflow completeness	Validate end-to-end business processes
adversarial	1-3	Parameter validation, incorrect tool selection, hallucination	Find bugs and probe failure modes
edge_case	1-3	Boundary values, rate limiting, timeouts, extreme inputs	Stress-test resource constraints and error handling

Personas

Personas shape the communication style and behavior of the simulated user in each test case.

Persona	Behavior	Example use case
novice	Simple language, unclear requests, missing terminology, needs guidance	Test whether the agent handles ambiguous or incomplete inputs gracefully
expert	Precise technical language, complex multi-step requests, expects detailed responses	Verify the agent supports power-user workflows without oversimplifying
frustrated	Short/demanding messages, skips steps, escalates quickly	Check that the agent stays helpful under pressure and does not mirror hostility
confused	Contradictory statements, changes mind, incomplete information, unclear questions	Probe the agent’s ability to ask clarifying questions rather than guessing
adversarial	Probes boundaries, attempts prompt injection, uses edge-case inputs	Test guardrails, safety constraints, and input validation

You can combine personas with any complexity level — for example, a frustrated persona at complex complexity generates demanding multi-step requests.

Failure Categories

The generator targets 9 top-level failure categories, each with multiple subtypes. Each subtype has a severity rating (S1-Critical through S5-Cosmetic), detection hints, and real-world examples.

Category	What it catches	Example subtypes
Knowledge	Hallucinations, outdated info, self-contradictions	Factual Hallucination, Entity Confusion, Outdated Information
Reasoning	Logic errors, calculation mistakes, planning failures	Invalid Inference, Arithmetic Mistake, Circular Reasoning
Context	Lost conversation context, misinterpreted references	Context Window Overflow, Positional Bias, Reference Misresolution
Instruction	Constraint violations, partial execution, priority conflicts	System Prompt Violation, Partial Execution, Conflicting Instructions
Tool Usage	Wrong tool, bad parameters, sequence violations	Wrong Tool Selection, Parameter Hallucination, Skipped Steps
Safety	Prompt injection, guardrail bypass, unauthorized actions	Direct Injection, Jailbreak, PII Leakage
Communication	Unhelpful, unclear, or inappropriate responses	Robotic Tone, Over-Verbose, Emotional Mismatch
Operational	Timeouts, rate limits, non-deterministic behavior	Timeout Handling, Rate Limit Handling, Non-Determinism
Coordination	Multi-agent deadlocks, lost handoffs, conflicting actions	Handoff Failure, Deadlock, Context Loss Between Agents

See Failure Taxonomy for the complete taxonomy reference with all 40+ subtypes, severity ratings, detection hints, and prevention strategies.

How to Use It

Dashboard: AI Generation Modal

Scenarios overview showing pass/fail status per agent

Select an agent and describe what to test

Navigate to Scenarios in the sidebar and click AI Generate. In the modal, select the agent you want to generate tests for, then describe the behavior you want to test.

AI Generate modal

Configure and generate

Optionally customize the generation settings or click Generate with defaults (5 test cases, moderate complexity). Click Customize & Generate to set the number of test cases, complexity level, and other options.

Configure generation parameters

The system produces test cases in 10-30 seconds, depending on count and complexity.

Review generated scenarios

Review the generated scenarios — click into any scenario to see its test cases, expected tools, tags, and pass/fail status.

Scenario detail showing test cases with pass/fail status and expected tools

MCP: `invarium_generate_tests` + `invarium_get_tests`

Generation via MCP is a two-step process: start the generation, then retrieve results. You interact with your coding agent (Cursor, Claude Code, etc.) using natural language — the agent decides which MCP tool to call and with what parameters.

Step 1: Generate

Tell your coding agent what you want in plain language:

"Generate 10 complex behavioral tests for my customer-support-agent focusing on tool usage failures. Use a frustrated user persona."

The coding agent will call invarium_generate_tests with the right parameters based on your request. Generation runs asynchronously on the server.

Step 2: Check results

"Show me the generated test cases."

The agent will call invarium_get_tests to retrieve results. If generation is still in progress, it will let you know. When complete, you get the full list of test cases with user messages, expected tools, target failure types, and detection hints.

Parameters reference:

These parameters are handled automatically by your coding agent based on your natural language request. You can reference them to be more specific in your prompts.

Parameter	Required	Default	Description
`agent_name`	Yes	—	Name of the agent (must have a blueprint uploaded)
`test_description`	Yes	—	What behavior to test
`test_cases`	No	5	Number of test cases (1-25)
`complexity`	No	`moderate`	simple, moderate, complex, adversarial, edge_case
`failure_category`	No	random	Target a specific failure category
`persona`	No	none	novice, expert, frustrated, confused, adversarial

AI-generated tests support adversarial complexity (probes agent boundaries). Manually created scenarios via invarium_create_scenario use mixed instead. See Manage Scenarios for manual scenario complexity options.

Output options for invarium_get_tests:

These parameters are handled automatically by your coding agent based on your natural language request. You can reference them to be more specific in your prompts.

Parameter	Default	Description
`output_format`	`text`	`text` for readable output, `json` for raw JSON
`output_file`	none	Save test cases as JSON to a file path (must be within cwd)
`offset`	0	Skip N test cases (pagination)
`limit`	20	Max test cases per page

FAQ

How many test cases can I generate at once?

Up to 25 test cases per generation request. For large test suites, run multiple generation requests targeting different failure categories.

How long does generation take?

10-30 seconds depending on the number of test cases and complexity level.

Can I target specific failure patterns?

Yes. You can focus on any of the 9 categories. For example, asking to focus on safety failures will generate test cases targeting prompt injection, guardrail bypass, PII leakage, and other safety-related subtypes.

What if generation produces similar test cases?

Invarium automatically removes test cases that are too similar, comparing names, user messages, and tool sequences. If you are still seeing overlap, try generating smaller batches targeting different failure categories.

Do I need to write test cases manually?

Not usually. AI generation handles most cases. However, you can also create scenarios manually for specific edge cases you have identified, or edit AI-generated scenarios to refine the expected behavior.

Framework Integration Run Tests & Sync Results