ICMD Agent Readiness Checklist (2026)

Use this checklist to take an agent from idea to production without surprise costs, security gaps, or untestable behavior. Copy/paste into your ticket as acceptance criteria.

1) Scope the Unit of Work (UoW)
- Define a single job-to-be-done in one page: trigger, inputs, outputs, and “done” condition.
- List the top 10 edge cases (missing data, conflicting records, rate limits, ambiguous user intent).
- Identify the worst-case failure and ensure it’s reversible (draft vs send; recommend vs execute).

2) Tooling & Data Contracts
- Every tool has a typed interface (JSON schema / function signature) and returns machine-readable errors.
- Actions are idempotent (safe to retry) and have request IDs.
- Retrieval sources are explicit (which KBs, which repos, which DB tables) with access controls.

3) Identity, Permissions, and Secrets
- No shared API keys: use per-run short-lived tokens (minutes TTL) or delegated OAuth.
- Enforce least privilege: allowlist endpoints + parameter constraints (e.g., refund <= $50).
- Implement an emergency kill switch and automated credential revocation.

4) Budgets and Stop Conditions
- Set budgets: token budget, tool-call budget, and wall-clock timeout per run.
- Define stop conditions in code: goal met, budget exceeded, policy violation, low confidence.
- Add degrade modes: switch to summarization or human escalation at 70–90% budget use.

5) Observability & Auditability
- Log every run with: prompt/tool versions, model version, tool inputs/outputs, latency, token counts.
- Ensure replayability for incident response (target: >99% of runs replayable).
- Redact or tokenize sensitive fields in logs (PII, secrets, regulated data).

6) Evaluation (Evals)
- Build a golden set (50–300 real tasks) with expected outcomes.
- Add adversarial tests: prompt injection attempts, malformed tool outputs, missing permissions.
- Run evals on every deployment; fail the release on regression thresholds.

7) Human-in-the-Loop (HITL) Gates
- Start in “draft mode” where agent proposes actions for approval.
- Define escalation rules (low confidence, missing data, high-risk action, budget exceeded).
- Measure escalation rate and approval latency; iterate until stable.

8) Rollout Plan
- Canary to internal users first (e.g., 5–10%), then expand by risk tier.
- Monitor: success rate, cost per outcome, policy violations, and customer impact metrics.
- Document rollback: model downgrade, tool disable, feature flag off, revert prompt version.

Success definition (fill in):
- KPI: ______________________ (e.g., cost per resolved ticket, mean time to mitigate)
- Target success rate: _______%
- Target cost per outcome: $________
- Max acceptable policy violations per week: ________
- Max time-to-resolution: ________