Overview — What is Invarium?

Behavioral QA testing platform for AI agents

Invarium generates behavioral test cases for your AI agents, runs them against your agent, and produces a reliability score — so you can find failures before your users do. Think of it as pytest for agentic workflows.

Instead of writing test cases by hand, you describe your agent as a JSON blueprint and Invarium’s Scenario Generator creates targeted test cases that probe for real failure modes: hallucinations, tool misuse, safety violations, instruction drift, and more.

What Invarium does:

Generate behavioral tests — Automatically create test scenarios targeting known failure categories
Score reliability with BSS — Get a Behavioral Safety Score (0-100) that quantifies how safe your agent is to deploy
Classify failures — Every failure is categorized using a structured failure taxonomy so you know exactly what went wrong
Visualize with the Agent Intelligence Graph (coming soon) — See how your agent’s tools, workflows, and constraints connect
Gate deployments with CI/CD (coming soon) — Set quality thresholds and block deploys that don’t meet your standards

How it works

Upload Blueprint

→

Generate Tests

→

Run Tests

→

Get BSS Score

Upload a blueprint

Describe your agent as a JSON blueprint — its tools, workflows, constraints, and expected behaviors. This tells Invarium what your agent is supposed to do.

Generate test cases

Invarium’s Scenario Generator analyzes your blueprint and creates behavioral test cases. Each test targets a specific failure type (hallucination, tool misuse, safety violation, etc.) at a specific complexity level.

Run tests against your agent

Execute the generated test cases against your agent — either manually from your IDE or automatically in CI/CD. Send each test’s user message to your agent and collect the response.

Get your BSS score

Sync results back to Invarium. You get a Behavioral Safety Score, a failure breakdown by category, and actionable insights about what to fix.

How to use Invarium

There are two ways to interact with the platform:

🔌

MCP Server

For developers. Connect from Claude Desktop, Cursor, or any MCP-compatible IDE. Upload blueprints, generate tests, and sync results — all without leaving your editor.

↗

📊

Dashboard

For teams. View test results, BSS scores, failure breakdowns, and the Agent Intelligence Graph. Manage blueprints, configure quality gates, and track reliability over time.

↗

Most developers start with the MCP server for day-to-day testing, then use the dashboard to review results and share with the team.

Use cases

AI Safety teams — Validate that agents handle adversarial inputs, PII exposure, and harmful content correctly before deployment.

Agent developers — Catch regressions early by running behavioral tests during development. Know exactly which failure types your agent is vulnerable to.

QA teams — Replace manual testing with automated behavioral test generation. Get structured failure reports instead of vague bug descriptions.

Platform teams — Set up CI/CD quality gates that block deploys when the BSS score drops below a threshold. Enforce reliability standards across all agents in your organization.

Get started

🚀

Quickstart

Test your first agent in under 5 minutes. Set up the MCP server, upload a blueprint, and generate your first test cases.

↗

📖

Dashboard Guide

Learn how to navigate the Invarium dashboard, interpret your BSS score, and manage your agents.

↗

Was this page helpful?

Quickstart