Agent-Ready Product Scorecard (2026)

Purpose
Use this scorecard to evaluate whether your product is ready for agentic workflows (plan + tool calls + execution). Score each item 0–2.
0 = missing, 1 = partial, 2 = strong. Total max: 30.

A) Delegation UX (max 6)
1) Plan visibility: Users can see a step-by-step plan before execution.
2) Review surfaces: Users can review changes in diffs/timelines (not only chat logs).
3) Escalation path: Users can hand off to a human easily (queue, assign, or “needs review”).

B) Tooling & Reliability (max 8)
4) Tool contracts: Tools have stable schemas, explicit errors, and versioning.
5) Idempotency: Side-effect tools support idempotency keys to avoid duplicates.
6) Timeouts & retries: Each tool has defined timeouts and limited retries.
7) Tool SLOs: You track p95 latency and failure rates for each tool in production.

C) Budgets & Economics (max 6)
8) Per-task budgets: Hard caps for wall time, tool calls, and dollars per task.
9) Cost per resolved task: You measure total cost (inference + tools + review time) per resolved task.
10) Spend controls: Quotas/alerts by workspace/team + a showback report for finance.

D) Security & Compliance (max 8)
11) Least privilege: Agents use scoped identities/service principals, not broad user impersonation.
12) Read vs act separation: Permissions distinguish drafting from sending/committing.
13) Audit trail: Immutable logs for tool calls, approvals, data accessed, and outputs.
14) Data handling: Redaction, retention windows, export controls, and configurable policy settings.

E) Evaluation & Operations (max 2)
15) Replay evals: You can replay real tasks against new models/policies before shipping.

Interpretation
26–30: Enterprise-ready. You can sell “autopilot” for constrained tasks.
20–25: Ready for drafts + supervised execution. Focus next on budgets and audit depth.
14–19: Pilot-only. Improve tool contracts, review surfaces, and permissions before scaling.
0–13: Demo stage. Build the system (tools, ledger, constraints) before more UX polish.

Weekly Operating Rhythm (recommended)
- Monday: Review autonomy rate, escalation rate, incident severity per 1,000 tasks.
- Wednesday: Audit 20 random tasks end-to-end (plan, tools, diffs, approvals).
- Friday: Run replay eval on last week’s top 1,000 tasks against any policy/model changes.

Non-negotiables for launch
- An approval gate for irreversible actions (email send, refunds, production config changes).
- A task ledger with correlation IDs across model calls and tool calls.
- Hard budgets (max tool calls + max cost) with deterministic escalation when exceeded.