Agent-Ready Product Scorecard (2026) Purpose Use this scorecard to evaluate whether your product is ready for agentic workflows (plan + tool calls + execution). Score each item 0–2. 0 = missing, 1 = partial, 2 = strong. Total max: 30. A) Delegation UX (max 6) 1) Plan visibility: Users can see a step-by-step plan before execution. 2) Review surfaces: Users can review changes in diffs/timelines (not only chat logs). 3) Escalation path: Users can hand off to a human easily (queue, assign, or “needs review”). B) Tooling & Reliability (max 8) 4) Tool contracts: Tools have stable schemas, explicit errors, and versioning. 5) Idempotency: Side-effect tools support idempotency keys to avoid duplicates. 6) Timeouts & retries: Each tool has defined timeouts and limited retries. 7) Tool SLOs: You track p95 latency and failure rates for each tool in production. C) Budgets & Economics (max 6) 8) Per-task budgets: Hard caps for wall time, tool calls, and dollars per task. 9) Cost per resolved task: You measure total cost (inference + tools + review time) per resolved task. 10) Spend controls: Quotas/alerts by workspace/team + a showback report for finance. D) Security & Compliance (max 8) 11) Least privilege: Agents use scoped identities/service principals, not broad user impersonation. 12) Read vs act separation: Permissions distinguish drafting from sending/committing. 13) Audit trail: Immutable logs for tool calls, approvals, data accessed, and outputs. 14) Data handling: Redaction, retention windows, export controls, and configurable policy settings. E) Evaluation & Operations (max 2) 15) Replay evals: You can replay real tasks against new models/policies before shipping. Interpretation 26–30: Enterprise-ready. You can sell “autopilot” for constrained tasks. 20–25: Ready for drafts + supervised execution. Focus next on budgets and audit depth. 14–19: Pilot-only. Improve tool contracts, review surfaces, and permissions before scaling. 0–13: Demo stage. Build the system (tools, ledger, constraints) before more UX polish. Weekly Operating Rhythm (recommended) - Monday: Review autonomy rate, escalation rate, incident severity per 1,000 tasks. - Wednesday: Audit 20 random tasks end-to-end (plan, tools, diffs, approvals). - Friday: Run replay eval on last week’s top 1,000 tasks against any policy/model changes. Non-negotiables for launch - An approval gate for irreversible actions (email send, refunds, production config changes). - A task ledger with correlation IDs across model calls and tool calls. - Hard budgets (max tool calls + max cost) with deterministic escalation when exceeded.