Agent Readiness Audit

Q: What is the difference between ARS and AQS?

ARS (Agent Readiness Score) is static analysis -- it examines your blueprint and architecture without running any tests. AQS (Agent Quality Score) is test-based -- it measures behavioral reliability from actual test results. ARS tells you if your agent is designed safely; AQS tells you if it behaves correctly.

Static analysis of your agent’s blueprint — identify risks before running any tests.

Key Takeaways

✓Static analysis of your blueprint -- no tests needed, immediate feedback
✓16 checks across 4 categories: Security (50pts), Reliability (25pts), System Design (15pts), Tool Quality (10pts)
✓Agent Readiness Score (ARS) 0-100: Red (0-30), Yellow (31-60), Green (61-100)

Why It Matters

The Agent Readiness Audit is a static analysis engine that examines your agent’s blueprint and architecture graph across 16 checks to identify security risks, reliability gaps, and design issues before any tests are run.

Running tests takes time. The audit gives you immediate, actionable feedback the moment you upload a blueprint. It catches structural problems — missing guardrails, exposed secrets, unconstrained permissions — that would otherwise surface as test failures or, worse, production incidents. Think of it as a linter for your agent’s architecture.

Agent Readiness Score (ARS)

The audit produces a single 0-100 score called the Agent Readiness Score (ARS). Unlike AQS, which requires test results, ARS is computed entirely from static analysis of your blueprint and (optionally) your Agent Intelligence Graph.

Score Ranges

Range	Color	Meaning
61-100	Green	Ready. No critical issues, minor findings only.
31-60	Yellow	Notable issues. Address high-severity findings before production.
0-30	Red	Critical risk. Fundamental security or reliability gaps must be resolved.

How the Score Is Calculated

The ARS starts at 100 and deducts points based on findings across four categories. Each category has a point budget that caps how much it can deduct — so a single category cannot tank the entire score.

Severity penalties:

Severity	Points deducted
Critical	8
High	4
Medium	2
Low	1

Category budgets:

Category	Budget	Rationale
Security	50	Weighted highest because security failures (exposed secrets, unguarded mutations) create the most severe production risk. A single critical security finding should dominate the score.
Reliability	25	Error handling, retries, fallbacks, and timeouts determine whether your agent recovers gracefully or fails catastrophically. Reliability issues are dangerous but typically less severe than security gaps.
System Design	15	System prompt, guardrails, and output constraints shape the agent’s baseline behavior. Missing design elements lead to unpredictable responses, but the damage is usually bounded.
Tool Quality	10	Incomplete tool definitions and missing input validation cause the LLM to guess, which leads to incorrect tool usage. Important, but the lowest-impact category compared to security and reliability.

Formula per category:

G = sum of global finding penalties
T = sum of per-tool finding penalties
adjusted = G + (T / num_tools)
deduction = min(budget, adjusted)

Per-tool findings are divided by the number of tools because a per-tool issue on 1 out of 20 tools is less severe than the same issue on 1 out of 2 tools.

Overall score:

ARS = 100 - sum(deductions across all categories)

Category budgets sum to exactly 100, so in the worst case (maximum findings in every category) the score hits 0. In practice, most agents score between 40 and 90 on first upload.

The 16 Checks

The audit runs 11 blueprint-based checks and 5 graph-based checks. Blueprint checks analyze the static structure of your agent definition. Graph-based checks (marked with *) require Agent Intelligence Graph data and examine the relationships between nodes.

Security (50-point budget)

Check	Type	What it detects	Severity
secret_exposure	Blueprint	API keys, tokens, or credentials embedded in tool parameter defaults, system prompts, or tool descriptions. Scans for known prefixes (`sk-`, `ghp_`, `Bearer`, `AKIA`, etc.) and credential-like patterns.	Critical
tool_permissions	Blueprint	Mutating actions (write, delete, send, payment) without `requires_confirmation: true`. Also flags single tools that write to multiple systems and write operations without scope constraints.	Critical / High / Medium
unguarded_paths*	Graph	Paths from root nodes to sensitive tools (payment, database_write, email_send, file_write) that lack a Guard node anywhere along the path.	Critical / High
unguarded_external_access*	Graph	Tool nodes that write to an ExternalService node without a GUARDED_BY edge. External writes without validation gates are a direct attack surface.	High

Reliability (25-point budget)

Check	Type	What it detects	Severity
error_handling	Blueprint	Tools with no `error_handling` configured. Flags globally when no tool in the entire blueprint has error handling, and per-tool for each individual tool missing it.	High (global) / Medium (per-tool)
retry_logic	Blueprint	Missing or zero `max_retries` in constraints. Also flags contradictory configs where a tool declares `error_handling: "retry"` but no `max_retries` value exists.	Medium (global) / High (contradictory)
fallback_behavior	Blueprint	No tool has `error_handling: "fallback"` for graceful degradation. Also flags tools with side effects that lack fallback handling, and workflow chains with no conditional branching.	Medium (global) / High (side effects) / Low (chains)
timeout_config	Blueprint	Missing rate limits and timeout configuration. Escalates to high severity for tools that make API calls without any timeout settings.	Medium (global) / High (API calls)
circular_dependencies*	Graph	Cycles in the invocation flow (CAN_INVOKE and CHAINS_TO edges). Circular dependencies can cause infinite loops at runtime.	High
missing_error_recovery*	Graph	Chains with 3 or more tool steps where none define error handling. Escalates to high severity when the chain touches external services.	Medium / High

System Design (15-point budget)

Check	Type	What it detects	Severity
system_prompt	Blueprint	Missing system prompt entirely (critical). Also flags system prompts that lack output format specification (low). Without a system prompt, the agent has no behavioral constraints.	Critical / Low
guardrails	Blueprint	No guardrail instructions defined. The agent has no explicit safety boundaries (e.g., “Never process refunds over $500 without approval”).	High
output_constraints	Blueprint	No output constraints found in the system prompt or guardrails. Without constraints on format, length, or content type, agent responses may be inconsistent.	Medium

Tool Quality (10-point budget)

Check	Type	What it detects	Severity
tool_completeness	Blueprint	Missing tool descriptions (high — the LLM cannot understand when to use the tool), missing parameter schemas (medium — the LLM will guess parameter formats), and missing return type documentation (low).	High / Medium / Low
input_validation	Blueprint	Incomplete parameter schemas missing `required` arrays or `type` fields. Also flags numeric parameters without range constraints documented in their descriptions.	Medium / Low
unreachable_tools*	Graph	Tool nodes with zero incoming flow edges that are not intentional root nodes. Unreachable tools are dead code — they exist in the blueprint but can never be invoked.	Medium

How to Run the Audit

Dashboard: Agent Audit Tab

The audit runs automatically when you upload a blueprint. Navigate to your agent’s detail page and select the Agent Audit tab to see:

Agent Readiness Score — the 0-100 ARS with color indicator (Red/Yellow/Green)
Category breakdown — per-category scores showing budget, deduction, and finding count
Findings list — all findings grouped by severity (critical first), each with details and a specific recommendation

The audit re-runs each time you upload an updated blueprint. Previous audit results are replaced by the latest.

Understanding Findings

Each finding contains four pieces of information:

Field	Description
Category	Which of the 16 checks produced this finding
Severity	critical, high, medium, or low — determines the point deduction
Details	What was detected and where (includes tool name for per-tool findings)
Recommendation	Specific action to resolve the finding

Global vs. Per-Tool Findings

Findings have one of two scopes:

Global — Applies to the blueprint as a whole (e.g., “No system prompt defined”). Penalty is applied at full value.
Per-tool — Applies to a specific tool (e.g., “Tool ‘send_email’ has no error handling”). Penalty is divided by the number of tools, so a single tool issue on a 20-tool agent has less impact than the same issue on a 2-tool agent.

Fixing Findings

Address findings in severity order — critical first, then high, medium, low. Each critical finding costs 8 points, so resolving one critical finding has the same score impact as resolving four high-severity findings or eight low-severity findings.

Common fixes by check:

Check	Typical fix
secret_exposure	Remove credentials from blueprint. Use environment variables or a secrets manager.
tool_permissions	Add `requires_confirmation: true` to mutating tools. Split multi-system tools into focused single-system tools.
error_handling	Add `error_handling: "retry"` or `"fallback"` to tools, especially those with side effects.
system_prompt	Add a `system_prompt_summary` with role definition, behavioral rules, and output format.
guardrails	Add explicit safety boundaries (e.g., “Never disclose PII”, “Require approval for refunds over $500”).

Limitations

Graph-based checks require graph data. Five of the 16 checks (unguarded_paths, unreachable_tools, circular_dependencies, unguarded_external_access, missing_error_recovery) only run when Agent Intelligence Graph data is available. Without graph data, these checks are skipped and the audit covers 11 checks.
Audit reflects declared architecture. The audit analyzes what your blueprint declares, not what your agent actually does at runtime. If your blueprint says a tool has requires_confirmation: true but the implementation does not enforce it, the audit will not catch that gap. Use AQS (test-based scoring) to validate runtime behavior.
Category budgets cap deductions. A category can never deduct more than its budget. If your Security category has 60 points of findings, the deduction is capped at 50. This means the ARS floor for a single-category disaster is 50, not 0.

FAQ

What is the difference between ARS and AQS?

ARS (Agent Readiness Score) is static analysis — it examines your blueprint and architecture without running any tests. AQS (Agent Quality Score) is test-based — it measures behavioral reliability from actual test results. ARS tells you if your agent is designed safely; AQS tells you if it behaves correctly. Use both: ARS for immediate feedback on architecture, AQS for validated confidence from test runs.

How do I fix critical findings?

Each finding includes a specific recommendation. For example, a secret_exposure finding recommends removing credentials and using environment variables. A tool_permissions finding recommends adding requires_confirmation: true for mutating operations. Address critical findings first, as they carry an 8-point penalty each — fixing a single critical finding recovers more score than fixing four medium-severity findings combined.

Why are category budgets not equal?

The budgets reflect relative production risk. Security vulnerabilities (exposed credentials, unguarded mutations) can cause immediate, severe damage — data breaches, unauthorized transactions, compliance violations. Reliability issues are serious but typically recoverable. System design and tool quality affect agent behavior quality but rarely cause catastrophic failures. The 50/25/15/10 split encodes this risk hierarchy.

What if I only have blueprint data, no graph?

Eleven of the 16 checks run on blueprint data alone. You get a valid ARS score, but it may be higher than the true score because graph-based checks (which catch issues like unguarded paths to payment tools) are skipped. Upload your Agent Intelligence Graph data when available for the most complete audit.

Agent Quality Score (AQS)Failure Taxonomy