ICMD Agent Production Readiness Checklist (2026) Use this before you sell “autonomy.” The goal is not a great demo—it’s predictable outcomes, bounded risk, and provable auditability. 1) Define the work - Pick one queue (e.g., Zendesk tickets, AP invoices, IT access requests). - Write a task taxonomy: 10–30 task types with clear definitions. - For each task type, define: success criteria, allowed tools, and refusal conditions. - Establish baseline metrics from the customer: weekly volume, SLA targets, current cost per item, backlog size. 2) Set reliability targets (per task type) - Success rate target (e.g., 75%+ in steady state for low-risk tasks). - Escalation budget (e.g., ≤12% after week 4). - p95 latency target (e.g., ≤90 seconds for standard tasks). - Blast radius rules: what the agent must never do without approval (refunds, terminations, production changes). 3) Build “structured autonomy” - Implement a tool-call envelope with: agent_id, task_id, requested_action, evidence, and policy_context. - Enforce schema validation on every model output (reject and retry safely). - Make tool execution deterministic: idempotency keys, retries with backoff, and safe failure modes. - Never expose raw secrets to the model. Use a tool proxy layer. 4) Identity, permissions, and approvals - Create agent identities with least-privilege roles in each integrated system. - Support OAuth scopes and RBAC; add SCIM if selling to enterprise. - Add approval gates for high-risk actions (role-based approvers). - Log impersonation and authorization: who allowed what, when. 5) Evals and regression control - Build a golden dataset per workflow (at least 100–500 examples). - Add adversarial tests: prompt injection, tool failures, missing context, ambiguous user requests. - Put eval gates in CI: no deploy if success rate drops beyond threshold. - Track production drift: compare last 7 days vs last 30 days on success/escalation/cost. 6) Observability and audit - Trace every step: retrieval, reasoning summary, tool calls, policy decisions, outputs. - Provide an “Explain” UI for customers (exportable). - Export logs to common tooling (Splunk/Datadog/OpenTelemetry pipelines). - Define incident playbooks: rollback plan, kill switch, customer comms template. 7) Human-in-the-loop operations - Ship an escalation queue that makes humans faster (prefill context, suggested action, citations). - Capture reviewer outcomes and feed them into retraining/eval updates. - Define SLA routing: which escalations need response in 5 min vs 24 hours. 8) Unit economics and pricing - Measure cost per resolved task (p50 and p95) including retries and tool calls. - Implement spend caps per tenant and per day/week. - Add model routing: small model first; fallback only when confidence is low. - Build an ROI dashboard: tasks resolved, escalations, time saved, estimated cost avoided, net value. Launch criteria (minimum): - Clear task boundaries and refusal rules - Auditable logs for every action - Evals running in CI + a rollback plan - Escalation workflow + measurable escalation budget - p95 cost and latency dashboards per tenant If you can’t measure it, you can’t sell it—at least not twice.