Test Runs
View every test run, drill into individual test cases, and compare results across runs.
The Test Runs page is where you analyze how your agent performed across testing sessions. Each time you sync results with invarium_sync_results, a new test run is created and appears here.
Test runs list
The main view shows a chronological list of all test runs in your workspace:
Each row displays:
| Column | Description |
|---|---|
| Run ID | Unique identifier for the test run |
| Agent | The agent that was tested |
| Date | When the test run was synced |
| Tests | Total number of test cases in the run |
| Passed | Number of passing tests (green) |
| Failed | Number of failing tests (red) |
| BSS | BSS score for this run |
| Source | Where tests were run (mcp, ci, cli) |
Filtering test runs
Use the filter bar above the list to narrow results:
By agent
Select one or more agents from the dropdown to show only their test runs. Useful when your workspace has many agents and you want to focus on one.
By date range
Choose a predefined range (Last 24 hours, Last 7 days, Last 30 days) or set a custom date range. This helps when investigating regressions — narrow the window to find when a score dropped.
By status
Filter by:
- All — Show all test runs
- Has failures — Show only runs with at least one failing test
- All passed — Show only runs where every test passed
By source
Filter by where the tests were executed:
- mcp — Tests run from an IDE via the MCP server
- ci — Tests run in a CI/CD pipeline
- cli — Tests run from the command line
- api — Tests run via the REST API
Comparing test runs
To compare two test runs side by side:
- Select two runs by checking their checkboxes in the list
- Click Compare in the toolbar
- The comparison view shows:
- BSS score delta between the two runs
- Tests that changed from pass to fail (regressions)
- Tests that changed from fail to pass (fixes)
- Failure category distribution changes
Comparison works best when both runs use the same set of test scenarios. If the test cases are different, only shared scenarios are compared.
Drill into a test run
Click any test run row to open its detail page. The detail page shows:
Summary bar
- Total tests, passed, failed
- BSS score for this run
- Failure breakdown by category (pie chart)
- Duration and source
Test case list
A table of every test case in the run:
| Column | Description |
|---|---|
| Scenario | Description of what the test checks |
| Complexity | simple, moderate, or complex |
| Target failure | The failure category being tested |
| Status | Passed or Failed |
| Severity | If failed, the severity level |
| Failure type | If failed, the specific failure category |
Click any test case to open its trace view.
Test case detail
The test case detail shows:
- User message — The input that was sent to the agent
- Expected behavior — What a correct response looks like
- Agent response — What the agent actually returned
- Tools called — Which tools the agent used, with parameters and results
- Failure analysis — If the test failed, the failure type, subtype, and severity
- Behavioral trace — The full timeline of events (see Behavioral Tracing)
Bulk actions
The test runs list supports bulk actions:
- Export — Select multiple runs and export as CSV or JSON
- Delete — Remove test runs you no longer need (Admin role required)
- Re-score — Recalculate BSS for selected runs (useful after scoring algorithm updates)
Deleting a test run removes all associated test case results and traces. This action cannot be undone.
Tips
- Look for regressions first. Use the Compare feature after deploying agent changes to catch new failures early.
- Filter by Critical severity. Critical failures are the highest priority. Use severity filtering to focus on what matters most.
- Check the source column. CI-sourced runs represent your automated quality gate checks. MCP-sourced runs are developer-driven. Track both.