DocumentationGenerate Test Scenarios

Generate Test Scenarios

Create targeted behavioral test cases that probe your agent for specific failure patterns.

Key Takeaways
  • Generates behavioral test scenarios targeting specific failure patterns -- not generic edge cases
  • Configure complexity (5 levels), personas (5 types), and failure categories (9 categories, 40+ subtypes)
  • Semantic deduplication ensures diverse, non-redundant test coverage

Why It Matters

Test scenario generation is the process of automatically creating targeted behavioral test cases that probe your agent for specific failure patterns across 9 failure categories and 40+ subtypes.

Traditional test generation produces generic inputs. Invarium’s generator is taxonomy-driven — it selects specific failure types from a structured taxonomy (for example, “Factual Hallucination” or “Wrong Tool Selection”), then constructs test cases designed to trigger those exact failure modes in the context of your agent’s tools and workflows.


How Generation Works

When you generate test scenarios, Invarium analyzes your agent’s blueprint and produces targeted test cases through three stages:

1

Target failure patterns

Invarium selects which failure patterns to test. If you specify a failure category (e.g., tool_usage_failure), it focuses there. Otherwise, it automatically selects from the 9 available categories to maximize coverage. Each test case targets a specific failure subtype — for example, “Factual Hallucination” or “Wrong Tool Selection” — so you know exactly what each test is designed to catch.

2

Generate test cases

Test cases are generated using your agent’s blueprint — its tools, descriptions, and constraints — combined with the selected failure patterns, complexity level, and persona. Each test case includes a realistic user message and the expected tool call sequence your agent should follow.

3

Validate and deduplicate

Every generated test case is validated against your agent’s actual tool names — any test referencing a tool that does not exist in your blueprint is removed. Invarium also deduplicates similar test cases to ensure each one tests something distinct, giving you maximum coverage without redundancy.

The result is a set of diverse test cases, each tagged with its target failure type and detection hints.


Complexity Levels

Complexity controls how many tools each test case expects the agent to use, and what kind of behavior it probes.

LevelTool countWhat it testsWhen to use
simpleExactly 1Single-tool requests with varied phrasingVerify basic tool selection works correctly
moderate2-3Tool dependencies and sequential workflowsTest standard multi-step operations
complex3-5Conditional logic, error recovery, workflow completenessValidate end-to-end business processes
adversarial1-3Parameter validation, incorrect tool selection, hallucinationFind bugs and probe failure modes
edge_case1-3Boundary values, rate limiting, timeouts, extreme inputsStress-test resource constraints and error handling

Personas

Personas shape the communication style and behavior of the simulated user in each test case.

PersonaBehaviorExample use case
noviceSimple language, unclear requests, missing terminology, needs guidanceTest whether the agent handles ambiguous or incomplete inputs gracefully
expertPrecise technical language, complex multi-step requests, expects detailed responsesVerify the agent supports power-user workflows without oversimplifying
frustratedShort/demanding messages, skips steps, escalates quicklyCheck that the agent stays helpful under pressure and does not mirror hostility
confusedContradictory statements, changes mind, incomplete information, unclear questionsProbe the agent’s ability to ask clarifying questions rather than guessing
adversarialProbes boundaries, attempts prompt injection, uses edge-case inputsTest guardrails, safety constraints, and input validation

You can combine personas with any complexity level — for example, a frustrated persona at complex complexity generates demanding multi-step requests.


Failure Categories

The generator targets 9 top-level failure categories, each with multiple subtypes. Each subtype has a severity rating (S1-Critical through S5-Cosmetic), detection hints, and real-world examples.

CategoryWhat it catchesExample subtypes
KnowledgeHallucinations, outdated info, self-contradictionsFactual Hallucination, Entity Confusion, Outdated Information
ReasoningLogic errors, calculation mistakes, planning failuresInvalid Inference, Arithmetic Mistake, Circular Reasoning
ContextLost conversation context, misinterpreted referencesContext Window Overflow, Positional Bias, Reference Misresolution
InstructionConstraint violations, partial execution, priority conflictsSystem Prompt Violation, Partial Execution, Conflicting Instructions
Tool UsageWrong tool, bad parameters, sequence violationsWrong Tool Selection, Parameter Hallucination, Skipped Steps
SafetyPrompt injection, guardrail bypass, unauthorized actionsDirect Injection, Jailbreak, PII Leakage
CommunicationUnhelpful, unclear, or inappropriate responsesRobotic Tone, Over-Verbose, Emotional Mismatch
OperationalTimeouts, rate limits, non-deterministic behaviorTimeout Handling, Rate Limit Handling, Non-Determinism
CoordinationMulti-agent deadlocks, lost handoffs, conflicting actionsHandoff Failure, Deadlock, Context Loss Between Agents

See Failure Taxonomy for the complete taxonomy reference with all 40+ subtypes, severity ratings, detection hints, and prevention strategies.


How to Use It

Dashboard: AI Generation Modal

Scenarios overview showing pass/fail status per agent

1

Select an agent and describe what to test

Navigate to Scenarios in the sidebar and click AI Generate. In the modal, select the agent you want to generate tests for, then describe the behavior you want to test.

AI Generate modal

2

Configure and generate

Optionally customize the generation settings or click Generate with defaults (5 test cases, moderate complexity). Click Customize & Generate to set the number of test cases, complexity level, and other options.

Configure generation parameters

The system produces test cases in 10-30 seconds, depending on count and complexity.

3

Review generated scenarios

Review the generated scenarios — click into any scenario to see its test cases, expected tools, tags, and pass/fail status.

Scenario detail showing test cases with pass/fail status and expected tools


FAQ

How many test cases can I generate at once?

Up to 25 test cases per generation request. For large test suites, run multiple generation requests targeting different failure categories.

How long does generation take?

10-30 seconds depending on the number of test cases and complexity level.

Can I target specific failure patterns?

Yes. You can focus on any of the 9 categories. For example, asking to focus on safety failures will generate test cases targeting prompt injection, guardrail bypass, PII leakage, and other safety-related subtypes.

What if generation produces similar test cases?

Invarium automatically removes test cases that are too similar, comparing names, user messages, and tool sequences. If you are still seeing overlap, try generating smaller batches targeting different failure categories.

Do I need to write test cases manually?

Not usually. AI generation handles most cases. However, you can also create scenarios manually for specific edge cases you have identified, or edit AI-generated scenarios to refine the expected behavior.