DocumentationBehavioral Safety Score

Behavioral Safety Score (BSS)

A single, comparable metric for AI agent reliability.

The Behavioral Safety Score (BSS) is a 0-100 score that measures how reliably your AI agent handles behavioral edge cases. Think of it as a credit score for AI agents — one number that tells you whether your agent is safe to deploy.


Why BSS exists

Testing AI agents produces a lot of data: pass rates, failure counts, severity distributions, coverage reports. BSS distills all of that into a single number so you can:

  • Compare agents — Is agent A more reliable than agent B?
  • Track progress — Is your agent getting better or worse over time?
  • Set thresholds — Block deploys when the score drops below an acceptable level
  • Communicate risk — Share a single, intuitive metric with stakeholders who don’t need the full breakdown

Without BSS, you are left interpreting dozens of metrics. With BSS, you have one number and a clear action plan.


Score ranges

RangeRatingWhat it means
90-100ExcellentAgent handles edge cases reliably. Safe for production with monitoring.
70-89GoodAgent performs well but has gaps. Review failure categories before deploying.
50-69FairAgent has notable weaknesses. Address critical and high-severity failures before production use.
0-49PoorAgent fails frequently on important scenarios. Significant work needed before deployment.

A BSS of 100 does not mean zero risk. It means your agent handled all tested scenarios correctly. Untested scenarios may still reveal failures. Increase coverage breadth to improve confidence.


How BSS is calculated

BSS is a weighted combination of four components:

1. Pass rate (40% weight)

The percentage of test cases your agent passed. This is the most straightforward component — did the agent produce a correct response?

pass_rate_score = (passed_tests / total_tests) * 100

2. Severity weighting (25% weight)

Not all failures are equal. A hallucination that fabricates medical advice is worse than a minor formatting issue. The severity component penalizes critical and high-severity failures more heavily than medium and low ones.

SeverityPenalty multiplier
Critical4x
High3x
Medium2x
Low1x
severity_score = 100 - (weighted_failure_penalty / max_possible_penalty) * 100

3. Coverage breadth (20% weight)

This measures how many distinct failure categories were tested. An agent that was only tested for hallucinations but never for tool misuse has narrow coverage — even if it passed every hallucination test.

coverage_score = (categories_tested / total_categories) * 100

The nine failure categories are: Hallucination, Wrong Tool Called, Missing Tool Call, Incorrect Parameters, Unexpected Tool Call, Tool Execution Error, Constraint Violation, Timeout, and Invalid Response.

4. Consistency (15% weight)

Consistency measures whether your agent performs evenly across complexity levels (simple, moderate, complex). An agent that aces simple tests but fails every complex test is inconsistent — and that inconsistency is a risk.

consistency_score = 100 - standard_deviation(pass_rate_per_complexity) * scaling_factor

Final calculation

BSS = (pass_rate_score * 0.40)
    + (severity_score * 0.25)
    + (coverage_score * 0.20)
    + (consistency_score * 0.15)

How to use BSS

View your score

After syncing test results with invarium_sync_results, your BSS is calculated automatically. View it in two places:

  • Dashboard — The BSS Score page shows your current score, trend over time, and breakdown by component
  • MCP results — The sync results response includes the updated BSS

Set CI/CD thresholds

Use BSS as a quality gate in your CI/CD pipeline. Set a minimum acceptable score and fail the build if it drops below:

# Example: fail if BSS drops below 75
- name: Check BSS score
  run: |
    if [ "$BSS_SCORE" -lt 75 ]; then
      echo "BSS score $BSS_SCORE is below threshold (75)"
      exit 1
    fi

See the CI/CD Quality Gates guide for a complete GitHub Actions workflow.

Track over time

BSS is recorded for every test run. Use the trend chart in the dashboard to spot regressions early. Common patterns to watch for:

  • Gradual decline — New features are introducing edge cases your agent does not handle
  • Sudden drop — A recent change broke something specific; check the failure breakdown
  • Plateau — You have hit a ceiling; look at coverage breadth to find untested categories

Improving your BSS

ComponentHow to improve
Pass rateFix the failing test cases. Start with the highest-severity failures.
Severity weightingPrioritize critical and high-severity failures. One critical fix improves the score more than ten low-severity fixes.
Coverage breadthGenerate tests with different complexity levels and failure categories. Use complexity: "mixed" to cover more ground.
ConsistencyIf complex tests fail more than simple ones, investigate why your agent struggles with multi-step or ambiguous scenarios.

FAQ

Q: How often is BSS recalculated? A: BSS is recalculated every time you sync results with invarium_sync_results. Each sync creates a new data point on the trend chart.

Q: Can I compare BSS across different agents? A: Yes. BSS is normalized to 0-100 regardless of the number of tests or agent complexity. However, comparing agents with very different coverage breadth may be misleading — an agent tested across all nine failure categories has a more meaningful score than one tested on only two.

Q: Does BSS account for test count? A: Not directly. An agent with 10 tests and 100% pass rate gets the same pass rate component as one with 1000 tests and 100% pass rate. However, the coverage breadth and consistency components reward broader testing.

Q: What is a good BSS for production? A: It depends on your risk tolerance. For customer-facing agents, aim for 80+. For internal tools with human oversight, 65+ may be acceptable. For safety-critical applications, target 90+.

Was this page helpful?