Agent Quality Score (AQS)

Q: How often is AQS recalculated?

AQS is recalculated every time you sync results with invarium_sync_results. Each sync creates a new data point on the trend chart.

Q: Can I compare AQS across different agents?

Yes. AQS is normalized to 0-100 regardless of test count or agent complexity. However, comparing agents with very different coverage breadth may be misleading -- an agent tested across all nine failure categories has a more meaningful score than one tested on only two.

A single, comparable metric for AI agent behavioral reliability.

Key Takeaways

✓AQS is a composite 0-100 metric measuring behavioral reliability from test results
✓Four components: pass rate, failure severity, coverage breadth, and cross-run consistency
✓Score ranges: 90-100 Excellent, 70-89 Good, 50-69 Degraded, 0-49 Critical

Why It Matters

Agent Quality Score (AQS) is a composite 0-100 metric that measures your AI agent’s behavioral reliability by evaluating test pass rate, failure severity, coverage breadth, and consistency across test runs.

Testing AI agents produces a lot of data: pass rates, failure counts, severity distributions, coverage reports. AQS distills all of that into a single number so you can:

Compare agents — Is agent A more reliable than agent B?
Track progress — Is your agent getting better or worse over time?
Set thresholds — Block deploys when the score drops below an acceptable level
Communicate risk — Share one intuitive metric with stakeholders who do not need the full breakdown

Without AQS, you are left interpreting dozens of metrics. With AQS, you have one number and a clear action plan.

Score Ranges

Range	Rating	What it means
90-100	Excellent	Agent handles edge cases reliably. Safe for production with monitoring.
70-89	Good	Agent performs well but has gaps. Review failure categories before deploying.
50-69	Degraded	Significant failures detected. Do not deploy without fixes.
0-49	Critical	Fundamental safety issues. Block deployment.

An AQS of 100 does not mean zero risk. It means your agent handled all tested scenarios correctly. Untested scenarios may still reveal failures. Increase coverage breadth to improve confidence.

How AQS Is Calculated

AQS combines four dimensions of agent reliability into a single score. Each dimension captures a different aspect of production readiness.

1. Pass Rate

How many test cases your agent passed. This is the most direct measure of whether the agent does what it is supposed to do. An agent that fails tests will fail in production.

2. Severity Weighting

Not all failures are equal. A hallucination that fabricates medical advice is far worse than a minor formatting issue. AQS penalizes critical and high-severity failures more heavily, so two agents with the same pass rate but different failure profiles get different scores.

3. Coverage Breadth

How many of the nine failure categories were actually tested. An agent with a 100% pass rate across two categories looks great on paper, but you have no idea how it handles the other seven failure modes. Coverage breadth penalizes untested blind spots and rewards comprehensive testing.

4. Consistency

Whether your agent performs evenly across complexity levels. An agent that aces simple requests but collapses on multi-step workflows is inconsistent — and that inconsistency is a risk in production where you cannot control what users ask.

How to View Your Score

Dashboard: Agent Detail

Navigate to your agent’s detail page. The AQS is displayed prominently with:

Current score — the overall 0-100 value with color-coded rating
Trend chart — score over time across test runs, so you can spot regressions
Component breakdown — individual scores for pass rate, severity, coverage, and consistency
Failure distribution — which failure categories are dragging the score down

Use the trend chart to watch for patterns:

Gradual decline — New features are introducing edge cases your agent does not handle
Sudden drop — A recent change broke something specific; check the failure breakdown
Plateau — You have hit a ceiling; look at coverage breadth to find untested categories

How to Improve Each Component

Component	How to improve
Pass rate	Fix failing test cases. Start with the highest-severity failures — one critical fix has more impact than ten low-severity fixes.
Severity weighting	Prioritize critical and high-severity failures. Use the failure taxonomy to identify which categories produce the most severe issues.
Coverage breadth	Generate tests with `complexity: "mixed"` to cover more failure categories. Check which of the nine categories are untested and create targeted scenarios.
Consistency	If complex tests fail more than simple ones, investigate why your agent struggles with multi-step or ambiguous scenarios. Run tests at each complexity level separately to isolate the gap.

FAQ

How often is AQS recalculated?

AQS is recalculated every time you sync results with invarium_sync_results. Each sync creates a new data point on the trend chart.

Can I compare AQS across different agents?

Yes. AQS is normalized to 0-100 regardless of the number of tests or agent complexity. However, comparing agents with very different coverage breadth may be misleading — an agent tested across all nine failure categories has a more meaningful score than one tested on only two.

What AQS is production-ready?

It depends on your risk tolerance. For customer-facing agents, aim for 80+. For internal tools with human oversight, 65+ may be acceptable. For safety-critical applications, target 90+.

Does AQS account for test count?

Not directly. An agent with 10 tests and 100% pass rate gets the same pass rate component as one with 1,000 tests and 100% pass rate. However, the coverage breadth and consistency components reward broader testing — so a 10-test agent will typically score lower overall than a 1,000-test agent that covers all nine categories.

Manage Scenarios Agent Readiness Audit