ICMD Agentic AI Reliability Scorecard (2026) Use this scorecard to move an agent from “pilot” to “production-grade” in ~30 days. Fill it out per workflow (e.g., refunds, lead qualification, invoice reconciliation), not per product. A) Success Definition (Week 1) 1) Name the workflow and define a single “successful run” outcome (binary). Example: “Refund issued correctly AND ticket updated AND CRM case tagged.” 2) List 3–5 non-negotiable constraints: no PII leakage, no cross-tenant access, no refunds >$200 without approval. 3) Choose 2 primary metrics: - TSR (Task Success Rate) = successful runs / total runs - CPSR (Cost per Successful Run) = total run cost / successful runs B) Evaluation Harness (Week 1–2) 1) Build an eval set of 100–300 real cases from production history (redact PII). 2) Add automated checks where possible: schema validation, policy citations, correct field updates, SQL row limits, etc. 3) Add human review sampling: 1–5% of production runs weekly. 4) Set release gates: no deploy if TSR drops >3 points vs baseline or CPSR rises >25%. C) Guardrails & Permissions (Week 2) 1) Implement least privilege: separate READ tools from WRITE tools. 2) Add approval rules for irreversible actions (100% approval for high-risk writes). 3) Require structured outputs for writes (diffs, JSON patches). No free-form “just do it.” 4) Add step budgets (max steps, max retries) and stop conditions (handoff to human). D) Tracing & Replay (Week 2–3) 1) Log run_id, model + version, prompt version, tool calls (inputs/outputs), retrieval sources, and final action. 2) Redact secrets/PII in logs; encrypt traces at rest. 3) Enable replay: store tool responses or snapshots for 30–90 days (policy-dependent). 4) Add dashboards for p50/p95 latency, tool-call validity rate, and escalation rate. E) Cost & Latency Controls (Week 3) 1) Set budgets at three levels: per step, per run, per user/workspace (monthly cap). 2) Add model routing: planner model for hard reasoning; executor model for deterministic tool calls. 3) Cache retrieval and memoize repeated tool reads within a run. 4) Define SLOs: interactive p95 <15s; background jobs p95 <2m. F) Launch Readiness (Week 4) 1) TSR target: 80–95% depending on risk; document acceptable failure modes. 2) Incident response: define severity, rollback plan, and on-call ownership. 3) Security review checklist: audit logs, tenant isolation, approval gates, data retention. 4) Post-launch: weekly eval refresh; monthly policy review; quarterly permission minimization. If you can report TSR, CPSR, p95 latency, and approval coverage for high-risk actions—with traces to prove it—you’re ready to scale the workflow beyond early adopters.