ICMD Agent Production Readiness Checklist (2026)

Use this before you sell “autonomy.” The goal is not a great demo—it’s predictable outcomes, bounded risk, and provable auditability.

1) Define the work
- Pick one queue (e.g., Zendesk tickets, AP invoices, IT access requests).
- Write a task taxonomy: 10–30 task types with clear definitions.
- For each task type, define: success criteria, allowed tools, and refusal conditions.
- Establish baseline metrics from the customer: weekly volume, SLA targets, current cost per item, backlog size.

2) Set reliability targets (per task type)
- Success rate target (e.g., 75%+ in steady state for low-risk tasks).
- Escalation budget (e.g., ≤12% after week 4).
- p95 latency target (e.g., ≤90 seconds for standard tasks).
- Blast radius rules: what the agent must never do without approval (refunds, terminations, production changes).

3) Build “structured autonomy”
- Implement a tool-call envelope with: agent_id, task_id, requested_action, evidence, and policy_context.
- Enforce schema validation on every model output (reject and retry safely).
- Make tool execution deterministic: idempotency keys, retries with backoff, and safe failure modes.
- Never expose raw secrets to the model. Use a tool proxy layer.

4) Identity, permissions, and approvals
- Create agent identities with least-privilege roles in each integrated system.
- Support OAuth scopes and RBAC; add SCIM if selling to enterprise.
- Add approval gates for high-risk actions (role-based approvers).
- Log impersonation and authorization: who allowed what, when.

5) Evals and regression control
- Build a golden dataset per workflow (at least 100–500 examples).
- Add adversarial tests: prompt injection, tool failures, missing context, ambiguous user requests.
- Put eval gates in CI: no deploy if success rate drops beyond threshold.
- Track production drift: compare last 7 days vs last 30 days on success/escalation/cost.

6) Observability and audit
- Trace every step: retrieval, reasoning summary, tool calls, policy decisions, outputs.
- Provide an “Explain” UI for customers (exportable).
- Export logs to common tooling (Splunk/Datadog/OpenTelemetry pipelines).
- Define incident playbooks: rollback plan, kill switch, customer comms template.

7) Human-in-the-loop operations
- Ship an escalation queue that makes humans faster (prefill context, suggested action, citations).
- Capture reviewer outcomes and feed them into retraining/eval updates.
- Define SLA routing: which escalations need response in 5 min vs 24 hours.

8) Unit economics and pricing
- Measure cost per resolved task (p50 and p95) including retries and tool calls.
- Implement spend caps per tenant and per day/week.
- Add model routing: small model first; fallback only when confidence is low.
- Build an ROI dashboard: tasks resolved, escalations, time saved, estimated cost avoided, net value.

Launch criteria (minimum):
- Clear task boundaries and refusal rules
- Auditable logs for every action
- Evals running in CI + a rollback plan
- Escalation workflow + measurable escalation budget
- p95 cost and latency dashboards per tenant

If you can’t measure it, you can’t sell it—at least not twice.