ICMD Agentic AI Reliability Scorecard (2026)

Use this scorecard to move an agent from “pilot” to “production-grade” in ~30 days. Fill it out per workflow (e.g., refunds, lead qualification, invoice reconciliation), not per product.

A) Success Definition (Week 1)
1) Name the workflow and define a single “successful run” outcome (binary). Example: “Refund issued correctly AND ticket updated AND CRM case tagged.”
2) List 3–5 non-negotiable constraints: no PII leakage, no cross-tenant access, no refunds >$200 without approval.
3) Choose 2 primary metrics:
   - TSR (Task Success Rate) = successful runs / total runs
   - CPSR (Cost per Successful Run) = total run cost / successful runs

B) Evaluation Harness (Week 1–2)
1) Build an eval set of 100–300 real cases from production history (redact PII).
2) Add automated checks where possible: schema validation, policy citations, correct field updates, SQL row limits, etc.
3) Add human review sampling: 1–5% of production runs weekly.
4) Set release gates: no deploy if TSR drops >3 points vs baseline or CPSR rises >25%.

C) Guardrails & Permissions (Week 2)
1) Implement least privilege: separate READ tools from WRITE tools.
2) Add approval rules for irreversible actions (100% approval for high-risk writes).
3) Require structured outputs for writes (diffs, JSON patches). No free-form “just do it.”
4) Add step budgets (max steps, max retries) and stop conditions (handoff to human).

D) Tracing & Replay (Week 2–3)
1) Log run_id, model + version, prompt version, tool calls (inputs/outputs), retrieval sources, and final action.
2) Redact secrets/PII in logs; encrypt traces at rest.
3) Enable replay: store tool responses or snapshots for 30–90 days (policy-dependent).
4) Add dashboards for p50/p95 latency, tool-call validity rate, and escalation rate.

E) Cost & Latency Controls (Week 3)
1) Set budgets at three levels: per step, per run, per user/workspace (monthly cap).
2) Add model routing: planner model for hard reasoning; executor model for deterministic tool calls.
3) Cache retrieval and memoize repeated tool reads within a run.
4) Define SLOs: interactive p95 <15s; background jobs p95 <2m.

F) Launch Readiness (Week 4)
1) TSR target: 80–95% depending on risk; document acceptable failure modes.
2) Incident response: define severity, rollback plan, and on-call ownership.
3) Security review checklist: audit logs, tenant isolation, approval gates, data retention.
4) Post-launch: weekly eval refresh; monthly policy review; quarterly permission minimization.

If you can report TSR, CPSR, p95 latency, and approval coverage for high-risk actions—with traces to prove it—you’re ready to scale the workflow beyond early adopters.