DashboardTest Runs

Test Runs

View every test run, drill into individual test cases, and compare results across runs.

The Test Runs page is where you analyze how your agent performed across testing sessions. Each time you sync results with invarium_sync_results, a new test run is created and appears here.


Test runs list

The main view shows a chronological list of all test runs in your workspace:

Each row displays:

ColumnDescription
Run IDUnique identifier for the test run
AgentThe agent that was tested
DateWhen the test run was synced
TestsTotal number of test cases in the run
PassedNumber of passing tests (green)
FailedNumber of failing tests (red)
BSSBSS score for this run
SourceWhere tests were run (mcp, ci, cli)

Filtering test runs

Use the filter bar above the list to narrow results:

By agent

Select one or more agents from the dropdown to show only their test runs. Useful when your workspace has many agents and you want to focus on one.

By date range

Choose a predefined range (Last 24 hours, Last 7 days, Last 30 days) or set a custom date range. This helps when investigating regressions — narrow the window to find when a score dropped.

By status

Filter by:

  • All — Show all test runs
  • Has failures — Show only runs with at least one failing test
  • All passed — Show only runs where every test passed

By source

Filter by where the tests were executed:

  • mcp — Tests run from an IDE via the MCP server
  • ci — Tests run in a CI/CD pipeline
  • cli — Tests run from the command line
  • api — Tests run via the REST API

Comparing test runs

To compare two test runs side by side:

  1. Select two runs by checking their checkboxes in the list
  2. Click Compare in the toolbar
  3. The comparison view shows:
    • BSS score delta between the two runs
    • Tests that changed from pass to fail (regressions)
    • Tests that changed from fail to pass (fixes)
    • Failure category distribution changes

Comparison works best when both runs use the same set of test scenarios. If the test cases are different, only shared scenarios are compared.


Drill into a test run

Click any test run row to open its detail page. The detail page shows:

Summary bar

  • Total tests, passed, failed
  • BSS score for this run
  • Failure breakdown by category (pie chart)
  • Duration and source

Test case list

A table of every test case in the run:

ColumnDescription
ScenarioDescription of what the test checks
Complexitysimple, moderate, or complex
Target failureThe failure category being tested
StatusPassed or Failed
SeverityIf failed, the severity level
Failure typeIf failed, the specific failure category

Click any test case to open its trace view.

Test case detail

The test case detail shows:

  • User message — The input that was sent to the agent
  • Expected behavior — What a correct response looks like
  • Agent response — What the agent actually returned
  • Tools called — Which tools the agent used, with parameters and results
  • Failure analysis — If the test failed, the failure type, subtype, and severity
  • Behavioral trace — The full timeline of events (see Behavioral Tracing)

Bulk actions

The test runs list supports bulk actions:

  • Export — Select multiple runs and export as CSV or JSON
  • Delete — Remove test runs you no longer need (Admin role required)
  • Re-score — Recalculate BSS for selected runs (useful after scoring algorithm updates)
⚠️

Deleting a test run removes all associated test case results and traces. This action cannot be undone.


Tips

  • Look for regressions first. Use the Compare feature after deploying agent changes to catch new failures early.
  • Filter by Critical severity. Critical failures are the highest priority. Use severity filtering to focus on what matters most.
  • Check the source column. CI-sourced runs represent your automated quality gate checks. MCP-sourced runs are developer-driven. Track both.
Was this page helpful?