Agent Quality Score (AQS)
A single, comparable metric for AI agent behavioral reliability.
- ✓AQS is a composite 0-100 metric measuring behavioral reliability from test results
- ✓Four components: pass rate, failure severity, coverage breadth, and cross-run consistency
- ✓Score ranges: 90-100 Excellent, 70-89 Good, 50-69 Degraded, 0-49 Critical
Why It Matters
Agent Quality Score (AQS) is a composite 0-100 metric that measures your AI agent’s behavioral reliability by evaluating test pass rate, failure severity, coverage breadth, and consistency across test runs.
Testing AI agents produces a lot of data: pass rates, failure counts, severity distributions, coverage reports. AQS distills all of that into a single number so you can:
- Compare agents — Is agent A more reliable than agent B?
- Track progress — Is your agent getting better or worse over time?
- Set thresholds — Block deploys when the score drops below an acceptable level
- Communicate risk — Share one intuitive metric with stakeholders who do not need the full breakdown
Without AQS, you are left interpreting dozens of metrics. With AQS, you have one number and a clear action plan.
Score Ranges
| Range | Rating | What it means |
|---|---|---|
| 90-100 | Excellent | Agent handles edge cases reliably. Safe for production with monitoring. |
| 70-89 | Good | Agent performs well but has gaps. Review failure categories before deploying. |
| 50-69 | Degraded | Significant failures detected. Do not deploy without fixes. |
| 0-49 | Critical | Fundamental safety issues. Block deployment. |
An AQS of 100 does not mean zero risk. It means your agent handled all tested scenarios correctly. Untested scenarios may still reveal failures. Increase coverage breadth to improve confidence.
How AQS Is Calculated
AQS combines four dimensions of agent reliability into a single score. Each dimension captures a different aspect of production readiness.
1. Pass Rate
How many test cases your agent passed. This is the most direct measure of whether the agent does what it is supposed to do. An agent that fails tests will fail in production.
2. Severity Weighting
Not all failures are equal. A hallucination that fabricates medical advice is far worse than a minor formatting issue. AQS penalizes critical and high-severity failures more heavily, so two agents with the same pass rate but different failure profiles get different scores.
3. Coverage Breadth
How many of the nine failure categories were actually tested. An agent with a 100% pass rate across two categories looks great on paper, but you have no idea how it handles the other seven failure modes. Coverage breadth penalizes untested blind spots and rewards comprehensive testing.
4. Consistency
Whether your agent performs evenly across complexity levels. An agent that aces simple requests but collapses on multi-step workflows is inconsistent — and that inconsistency is a risk in production where you cannot control what users ask.
How to View Your Score
Dashboard: Agent Detail
Navigate to your agent’s detail page. The AQS is displayed prominently with:
- Current score — the overall 0-100 value with color-coded rating
- Trend chart — score over time across test runs, so you can spot regressions
- Component breakdown — individual scores for pass rate, severity, coverage, and consistency
- Failure distribution — which failure categories are dragging the score down
Use the trend chart to watch for patterns:
- Gradual decline — New features are introducing edge cases your agent does not handle
- Sudden drop — A recent change broke something specific; check the failure breakdown
- Plateau — You have hit a ceiling; look at coverage breadth to find untested categories
How to Improve Each Component
| Component | How to improve |
|---|---|
| Pass rate | Fix failing test cases. Start with the highest-severity failures — one critical fix has more impact than ten low-severity fixes. |
| Severity weighting | Prioritize critical and high-severity failures. Use the failure taxonomy to identify which categories produce the most severe issues. |
| Coverage breadth | Generate tests with complexity: "mixed" to cover more failure categories. Check which of the nine categories are untested and create targeted scenarios. |
| Consistency | If complex tests fail more than simple ones, investigate why your agent struggles with multi-step or ambiguous scenarios. Run tests at each complexity level separately to isolate the gap. |
FAQ
How often is AQS recalculated?
AQS is recalculated every time you sync results with invarium_sync_results. Each sync creates a new data point on the trend chart.
Can I compare AQS across different agents?
Yes. AQS is normalized to 0-100 regardless of the number of tests or agent complexity. However, comparing agents with very different coverage breadth may be misleading — an agent tested across all nine failure categories has a more meaningful score than one tested on only two.
What AQS is production-ready?
It depends on your risk tolerance. For customer-facing agents, aim for 80+. For internal tools with human oversight, 65+ may be acceptable. For safety-critical applications, target 90+.
Does AQS account for test count?
Not directly. An agent with 10 tests and 100% pass rate gets the same pass rate component as one with 1,000 tests and 100% pass rate. However, the coverage breadth and consistency components reward broader testing — so a 10-test agent will typically score lower overall than a 1,000-test agent that covers all nine categories.