DocumentationAgent Quality Score (AQS)

Agent Quality Score (AQS)

A single, comparable metric for AI agent behavioral reliability.

Key Takeaways
  • AQS is a composite 0-100 metric measuring behavioral reliability from test results
  • Four components: pass rate, failure severity, coverage breadth, and cross-run consistency
  • Score ranges: 90-100 Excellent, 70-89 Good, 50-69 Degraded, 0-49 Critical

Why It Matters

Agent Quality Score (AQS) is a composite 0-100 metric that measures your AI agent’s behavioral reliability by evaluating test pass rate, failure severity, coverage breadth, and consistency across test runs.

Testing AI agents produces a lot of data: pass rates, failure counts, severity distributions, coverage reports. AQS distills all of that into a single number so you can:

  • Compare agents — Is agent A more reliable than agent B?
  • Track progress — Is your agent getting better or worse over time?
  • Set thresholds — Block deploys when the score drops below an acceptable level
  • Communicate risk — Share one intuitive metric with stakeholders who do not need the full breakdown

Without AQS, you are left interpreting dozens of metrics. With AQS, you have one number and a clear action plan.


Score Ranges

RangeRatingWhat it means
90-100ExcellentAgent handles edge cases reliably. Safe for production with monitoring.
70-89GoodAgent performs well but has gaps. Review failure categories before deploying.
50-69DegradedSignificant failures detected. Do not deploy without fixes.
0-49CriticalFundamental safety issues. Block deployment.

An AQS of 100 does not mean zero risk. It means your agent handled all tested scenarios correctly. Untested scenarios may still reveal failures. Increase coverage breadth to improve confidence.


How AQS Is Calculated

AQS combines four dimensions of agent reliability into a single score. Each dimension captures a different aspect of production readiness.

1. Pass Rate

How many test cases your agent passed. This is the most direct measure of whether the agent does what it is supposed to do. An agent that fails tests will fail in production.

2. Severity Weighting

Not all failures are equal. A hallucination that fabricates medical advice is far worse than a minor formatting issue. AQS penalizes critical and high-severity failures more heavily, so two agents with the same pass rate but different failure profiles get different scores.

3. Coverage Breadth

How many of the nine failure categories were actually tested. An agent with a 100% pass rate across two categories looks great on paper, but you have no idea how it handles the other seven failure modes. Coverage breadth penalizes untested blind spots and rewards comprehensive testing.

4. Consistency

Whether your agent performs evenly across complexity levels. An agent that aces simple requests but collapses on multi-step workflows is inconsistent — and that inconsistency is a risk in production where you cannot control what users ask.


How to View Your Score

Dashboard: Agent Detail

Navigate to your agent’s detail page. The AQS is displayed prominently with:

  • Current score — the overall 0-100 value with color-coded rating
  • Trend chart — score over time across test runs, so you can spot regressions
  • Component breakdown — individual scores for pass rate, severity, coverage, and consistency
  • Failure distribution — which failure categories are dragging the score down

Use the trend chart to watch for patterns:

  • Gradual decline — New features are introducing edge cases your agent does not handle
  • Sudden drop — A recent change broke something specific; check the failure breakdown
  • Plateau — You have hit a ceiling; look at coverage breadth to find untested categories

How to Improve Each Component

ComponentHow to improve
Pass rateFix failing test cases. Start with the highest-severity failures — one critical fix has more impact than ten low-severity fixes.
Severity weightingPrioritize critical and high-severity failures. Use the failure taxonomy to identify which categories produce the most severe issues.
Coverage breadthGenerate tests with complexity: "mixed" to cover more failure categories. Check which of the nine categories are untested and create targeted scenarios.
ConsistencyIf complex tests fail more than simple ones, investigate why your agent struggles with multi-step or ambiguous scenarios. Run tests at each complexity level separately to isolate the gap.

FAQ

How often is AQS recalculated?

AQS is recalculated every time you sync results with invarium_sync_results. Each sync creates a new data point on the trend chart.

Can I compare AQS across different agents?

Yes. AQS is normalized to 0-100 regardless of the number of tests or agent complexity. However, comparing agents with very different coverage breadth may be misleading — an agent tested across all nine failure categories has a more meaningful score than one tested on only two.

What AQS is production-ready?

It depends on your risk tolerance. For customer-facing agents, aim for 80+. For internal tools with human oversight, 65+ may be acceptable. For safety-critical applications, target 90+.

Does AQS account for test count?

Not directly. An agent with 10 tests and 100% pass rate gets the same pass rate component as one with 1,000 tests and 100% pass rate. However, the coverage breadth and consistency components reward broader testing — so a 10-test agent will typically score lower overall than a 1,000-test agent that covers all nine categories.