Agent Readiness Audit
Static analysis of your agent’s blueprint — identify risks before running any tests.
- ✓Static analysis of your blueprint -- no tests needed, immediate feedback
- ✓16 checks across 4 categories: Security (50pts), Reliability (25pts), System Design (15pts), Tool Quality (10pts)
- ✓Agent Readiness Score (ARS) 0-100: Red (0-30), Yellow (31-60), Green (61-100)
Why It Matters
The Agent Readiness Audit is a static analysis engine that examines your agent’s blueprint and architecture graph across 16 checks to identify security risks, reliability gaps, and design issues before any tests are run.
Running tests takes time. The audit gives you immediate, actionable feedback the moment you upload a blueprint. It catches structural problems — missing guardrails, exposed secrets, unconstrained permissions — that would otherwise surface as test failures or, worse, production incidents. Think of it as a linter for your agent’s architecture.
Agent Readiness Score (ARS)
The audit produces a single 0-100 score called the Agent Readiness Score (ARS). Unlike AQS, which requires test results, ARS is computed entirely from static analysis of your blueprint and (optionally) your Agent Intelligence Graph.
Score Ranges
| Range | Color | Meaning |
|---|---|---|
| 61-100 | Green | Ready. No critical issues, minor findings only. |
| 31-60 | Yellow | Notable issues. Address high-severity findings before production. |
| 0-30 | Red | Critical risk. Fundamental security or reliability gaps must be resolved. |
How the Score Is Calculated
The ARS starts at 100 and deducts points based on findings across four categories. Each category has a point budget that caps how much it can deduct — so a single category cannot tank the entire score.
Severity penalties:
| Severity | Points deducted |
|---|---|
| Critical | 8 |
| High | 4 |
| Medium | 2 |
| Low | 1 |
Category budgets:
| Category | Budget | Rationale |
|---|---|---|
| Security | 50 | Weighted highest because security failures (exposed secrets, unguarded mutations) create the most severe production risk. A single critical security finding should dominate the score. |
| Reliability | 25 | Error handling, retries, fallbacks, and timeouts determine whether your agent recovers gracefully or fails catastrophically. Reliability issues are dangerous but typically less severe than security gaps. |
| System Design | 15 | System prompt, guardrails, and output constraints shape the agent’s baseline behavior. Missing design elements lead to unpredictable responses, but the damage is usually bounded. |
| Tool Quality | 10 | Incomplete tool definitions and missing input validation cause the LLM to guess, which leads to incorrect tool usage. Important, but the lowest-impact category compared to security and reliability. |
Formula per category:
G = sum of global finding penalties
T = sum of per-tool finding penalties
adjusted = G + (T / num_tools)
deduction = min(budget, adjusted)Per-tool findings are divided by the number of tools because a per-tool issue on 1 out of 20 tools is less severe than the same issue on 1 out of 2 tools.
Overall score:
ARS = 100 - sum(deductions across all categories)Category budgets sum to exactly 100, so in the worst case (maximum findings in every category) the score hits 0. In practice, most agents score between 40 and 90 on first upload.
The 16 Checks
The audit runs 11 blueprint-based checks and 5 graph-based checks. Blueprint checks analyze the static structure of your agent definition. Graph-based checks (marked with *) require Agent Intelligence Graph data and examine the relationships between nodes.
Security (50-point budget)
| Check | Type | What it detects | Severity |
|---|---|---|---|
| secret_exposure | Blueprint | API keys, tokens, or credentials embedded in tool parameter defaults, system prompts, or tool descriptions. Scans for known prefixes (sk-, ghp_, Bearer, AKIA, etc.) and credential-like patterns. | Critical |
| tool_permissions | Blueprint | Mutating actions (write, delete, send, payment) without requires_confirmation: true. Also flags single tools that write to multiple systems and write operations without scope constraints. | Critical / High / Medium |
| unguarded_paths* | Graph | Paths from root nodes to sensitive tools (payment, database_write, email_send, file_write) that lack a Guard node anywhere along the path. | Critical / High |
| unguarded_external_access* | Graph | Tool nodes that write to an ExternalService node without a GUARDED_BY edge. External writes without validation gates are a direct attack surface. | High |
Reliability (25-point budget)
| Check | Type | What it detects | Severity |
|---|---|---|---|
| error_handling | Blueprint | Tools with no error_handling configured. Flags globally when no tool in the entire blueprint has error handling, and per-tool for each individual tool missing it. | High (global) / Medium (per-tool) |
| retry_logic | Blueprint | Missing or zero max_retries in constraints. Also flags contradictory configs where a tool declares error_handling: "retry" but no max_retries value exists. | Medium (global) / High (contradictory) |
| fallback_behavior | Blueprint | No tool has error_handling: "fallback" for graceful degradation. Also flags tools with side effects that lack fallback handling, and workflow chains with no conditional branching. | Medium (global) / High (side effects) / Low (chains) |
| timeout_config | Blueprint | Missing rate limits and timeout configuration. Escalates to high severity for tools that make API calls without any timeout settings. | Medium (global) / High (API calls) |
| circular_dependencies* | Graph | Cycles in the invocation flow (CAN_INVOKE and CHAINS_TO edges). Circular dependencies can cause infinite loops at runtime. | High |
| missing_error_recovery* | Graph | Chains with 3 or more tool steps where none define error handling. Escalates to high severity when the chain touches external services. | Medium / High |
System Design (15-point budget)
| Check | Type | What it detects | Severity |
|---|---|---|---|
| system_prompt | Blueprint | Missing system prompt entirely (critical). Also flags system prompts that lack output format specification (low). Without a system prompt, the agent has no behavioral constraints. | Critical / Low |
| guardrails | Blueprint | No guardrail instructions defined. The agent has no explicit safety boundaries (e.g., “Never process refunds over $500 without approval”). | High |
| output_constraints | Blueprint | No output constraints found in the system prompt or guardrails. Without constraints on format, length, or content type, agent responses may be inconsistent. | Medium |
Tool Quality (10-point budget)
| Check | Type | What it detects | Severity |
|---|---|---|---|
| tool_completeness | Blueprint | Missing tool descriptions (high — the LLM cannot understand when to use the tool), missing parameter schemas (medium — the LLM will guess parameter formats), and missing return type documentation (low). | High / Medium / Low |
| input_validation | Blueprint | Incomplete parameter schemas missing required arrays or type fields. Also flags numeric parameters without range constraints documented in their descriptions. | Medium / Low |
| unreachable_tools* | Graph | Tool nodes with zero incoming flow edges that are not intentional root nodes. Unreachable tools are dead code — they exist in the blueprint but can never be invoked. | Medium |
How to Run the Audit
Dashboard: Agent Audit Tab
The audit runs automatically when you upload a blueprint. Navigate to your agent’s detail page and select the Agent Audit tab to see:
- Agent Readiness Score — the 0-100 ARS with color indicator (Red/Yellow/Green)
- Category breakdown — per-category scores showing budget, deduction, and finding count
- Findings list — all findings grouped by severity (critical first), each with details and a specific recommendation
The audit re-runs each time you upload an updated blueprint. Previous audit results are replaced by the latest.
Understanding Findings
Each finding contains four pieces of information:
| Field | Description |
|---|---|
| Category | Which of the 16 checks produced this finding |
| Severity | critical, high, medium, or low — determines the point deduction |
| Details | What was detected and where (includes tool name for per-tool findings) |
| Recommendation | Specific action to resolve the finding |
Global vs. Per-Tool Findings
Findings have one of two scopes:
- Global — Applies to the blueprint as a whole (e.g., “No system prompt defined”). Penalty is applied at full value.
- Per-tool — Applies to a specific tool (e.g., “Tool ‘send_email’ has no error handling”). Penalty is divided by the number of tools, so a single tool issue on a 20-tool agent has less impact than the same issue on a 2-tool agent.
Fixing Findings
Address findings in severity order — critical first, then high, medium, low. Each critical finding costs 8 points, so resolving one critical finding has the same score impact as resolving four high-severity findings or eight low-severity findings.
Common fixes by check:
| Check | Typical fix |
|---|---|
| secret_exposure | Remove credentials from blueprint. Use environment variables or a secrets manager. |
| tool_permissions | Add requires_confirmation: true to mutating tools. Split multi-system tools into focused single-system tools. |
| error_handling | Add error_handling: "retry" or "fallback" to tools, especially those with side effects. |
| system_prompt | Add a system_prompt_summary with role definition, behavioral rules, and output format. |
| guardrails | Add explicit safety boundaries (e.g., “Never disclose PII”, “Require approval for refunds over $500”). |
Limitations
- Graph-based checks require graph data. Five of the 16 checks (unguarded_paths, unreachable_tools, circular_dependencies, unguarded_external_access, missing_error_recovery) only run when Agent Intelligence Graph data is available. Without graph data, these checks are skipped and the audit covers 11 checks.
- Audit reflects declared architecture. The audit analyzes what your blueprint declares, not what your agent actually does at runtime. If your blueprint says a tool has
requires_confirmation: truebut the implementation does not enforce it, the audit will not catch that gap. Use AQS (test-based scoring) to validate runtime behavior. - Category budgets cap deductions. A category can never deduct more than its budget. If your Security category has 60 points of findings, the deduction is capped at 50. This means the ARS floor for a single-category disaster is 50, not 0.
FAQ
What is the difference between ARS and AQS?
ARS (Agent Readiness Score) is static analysis — it examines your blueprint and architecture without running any tests. AQS (Agent Quality Score) is test-based — it measures behavioral reliability from actual test results. ARS tells you if your agent is designed safely; AQS tells you if it behaves correctly. Use both: ARS for immediate feedback on architecture, AQS for validated confidence from test runs.
How do I fix critical findings?
Each finding includes a specific recommendation. For example, a secret_exposure finding recommends removing credentials and using environment variables. A tool_permissions finding recommends adding requires_confirmation: true for mutating operations. Address critical findings first, as they carry an 8-point penalty each — fixing a single critical finding recovers more score than fixing four medium-severity findings combined.
Why are category budgets not equal?
The budgets reflect relative production risk. Security vulnerabilities (exposed credentials, unguarded mutations) can cause immediate, severe damage — data breaches, unauthorized transactions, compliance violations. Reliability issues are serious but typically recoverable. System design and tool quality affect agent behavior quality but rarely cause catastrophic failures. The 50/25/15/10 split encodes this risk hierarchy.
What if I only have blueprint data, no graph?
Eleven of the 16 checks run on blueprint data alone. You get a valid ARS score, but it may be higher than the true score because graph-based checks (which catch issues like unguarded paths to payment tools) are skipped. Upload your Agent Intelligence Graph data when available for the most complete audit.