ICMD Agent Readiness & Rollout Checklist (2026) Use this checklist to graduate an LLM agent from “assistant” to “executor” without creating security or reliability debt. 1) Scope & ROI - Define one workflow with a clear definition of done (example: “route bug tickets to correct team within 2 minutes”). - Quantify value: time saved per task (minutes) × volume per month × blended hourly cost. - Set explicit success metrics: task success rate, human override rate, median latency, and cost per task. 2) Sources of Truth & Data Hygiene - List authoritative sources (Jira/Linear, GitHub, Datadog, Salesforce, runbooks) and rank them. - Require citations for any user-facing claim or operational recommendation. - Add freshness rules (e.g., ignore documents older than 30 days for fast-changing policies). 3) Permissions & Identity - Default to read-only tool access. - Implement least privilege: fine-grained tokens, short-lived credentials, per-tool allowlists. - Separate planner vs executor: model proposes; constrained service account executes. - Add approval gates for write actions (PR creation, ticket creation, deploy triggers). 4) Reliability Engineering - Make actions idempotent where possible (safe retries, no duplicate tickets/PRs). - Enforce structured outputs with schema validation. - Add timeouts, retries with backoff, and deterministic fallbacks for tool failures. 5) Evaluation & Testing - Create an initial eval set from 200–500 real historical tasks. - Track: decision accuracy, tool-call accuracy, citation coverage, and override rate. - Run evals on every prompt/tool/schema change (treat as CI). - Red-team for prompt injection, data exfiltration, and policy bypass. 6) Observability & Governance - Log prompts/outputs with redaction; define retention (e.g., 30 days) and access controls. - Emit traces per step: model call, tool call, validator, policy check, approval event. - Monitor: spend per user/day, per workspace/month; cap max steps per task. 7) Rollout Plan - Start in dry-run mode for 1–2 weeks; compare against human outcomes. - Canary rollout: 5% → 25% → 100% traffic with automatic rollback thresholds. - Maintain a rollback procedure (disable tool writes, revert to read-only, pin model version). 8) Graduation Gates (recommended) - Tool-call success rate > 99%. - Decision accuracy > 95% on eval set (higher for critical workflows). - Citation coverage = 100% for key claims. - Human override rate < 10% for low-risk workflows. - Cost per task within your ROI model (including review time). If you can’t meet the gates, narrow scope, reduce tool permissions, add rules/validators, and expand the eval set before increasing autonomy.