ICMD Agent Readiness & Rollout Checklist (2026)

Use this checklist to graduate an LLM agent from “assistant” to “executor” without creating security or reliability debt.

1) Scope & ROI
- Define one workflow with a clear definition of done (example: “route bug tickets to correct team within 2 minutes”).
- Quantify value: time saved per task (minutes) × volume per month × blended hourly cost.
- Set explicit success metrics: task success rate, human override rate, median latency, and cost per task.

2) Sources of Truth & Data Hygiene
- List authoritative sources (Jira/Linear, GitHub, Datadog, Salesforce, runbooks) and rank them.
- Require citations for any user-facing claim or operational recommendation.
- Add freshness rules (e.g., ignore documents older than 30 days for fast-changing policies).

3) Permissions & Identity
- Default to read-only tool access.
- Implement least privilege: fine-grained tokens, short-lived credentials, per-tool allowlists.
- Separate planner vs executor: model proposes; constrained service account executes.
- Add approval gates for write actions (PR creation, ticket creation, deploy triggers).

4) Reliability Engineering
- Make actions idempotent where possible (safe retries, no duplicate tickets/PRs).
- Enforce structured outputs with schema validation.
- Add timeouts, retries with backoff, and deterministic fallbacks for tool failures.

5) Evaluation & Testing
- Create an initial eval set from 200–500 real historical tasks.
- Track: decision accuracy, tool-call accuracy, citation coverage, and override rate.
- Run evals on every prompt/tool/schema change (treat as CI).
- Red-team for prompt injection, data exfiltration, and policy bypass.

6) Observability & Governance
- Log prompts/outputs with redaction; define retention (e.g., 30 days) and access controls.
- Emit traces per step: model call, tool call, validator, policy check, approval event.
- Monitor: spend per user/day, per workspace/month; cap max steps per task.

7) Rollout Plan
- Start in dry-run mode for 1–2 weeks; compare against human outcomes.
- Canary rollout: 5% → 25% → 100% traffic with automatic rollback thresholds.
- Maintain a rollback procedure (disable tool writes, revert to read-only, pin model version).

8) Graduation Gates (recommended)
- Tool-call success rate > 99%.
- Decision accuracy > 95% on eval set (higher for critical workflows).
- Citation coverage = 100% for key claims.
- Human override rate < 10% for low-risk workflows.
- Cost per task within your ROI model (including review time).

If you can’t meet the gates, narrow scope, reduce tool permissions, add rules/validators, and expand the eval set before increasing autonomy.