Production Agent Readiness Checklist (2026)

Use this checklist before you let an agent take real actions in core systems (CRM, payments, ticketing, code, IT).

1) Scope & Success Metrics
- Define ONE workflow the agent owns (single system of record + 1–2 downstream actions).
- Write numeric acceptance criteria (e.g., 40% auto-triage rate, <2% misroute rate, p95 <30s).
- Define “stop conditions” (max wall-clock time, max tool calls, max retries).
- Define escalation rules (what goes to a human; what requires manager approval).

2) Tooling Contracts (Propose vs Execute)
- Every tool has a strict JSON/schema contract with required fields + enumerations.
- Validate every model output before executing tools.
- Implement idempotency keys for any write action (refunds, tickets, PRs, CRM updates).
- Separate “propose” from “execute”: model suggests; policy service approves/denies.

3) Identity, Permissions, and Secrets
- Create a dedicated agent identity (OAuth client / service account) per agent.
- Enforce least privilege roles; document why each permission exists.
- No static secrets in prompts, logs, or vector stores; rotate credentials on a schedule (target <30 days).
- Rate-limit high-impact tools (e.g., max 10 writes/minute) and add circuit breakers.

4) Data & Memory Hygiene
- Classify memory types: session state, long-term knowledge, episodic logs.
- Define retention per type (e.g., session 7 days, logs 30–90 days, knowledge versioned).
- Add PII detection/redaction before storage and before sending to model APIs.
- Ensure RAG sources are versioned and traceable (which doc/section supported a decision).

5) Evaluation & Release Process
- Build a regression suite of real tasks (start with 50–200) and keep it under version control.
- Track metrics: task success rate, tool-call accuracy, escalation rate, hallucination on grounded Qs.
- Run canary releases: 5% volume for 48 hours, then 25%, with instant rollback.
- Set a “no-ship” threshold (e.g., success rate drops >2 points vs baseline).

6) Observability & Incident Response
- Emit structured traces for each run: prompts, retrieval hits, tool calls, tool outputs, validations.
- Dashboard p50/p95 latency, failure rate, cost per run, and top escalation reasons.
- Add alerts for abnormal behavior: duplicate actions, spike in retries, spend anomalies.
- Write an incident playbook: disable switch, rollback procedure, and customer comms plan.

7) Unit Economics & Budget Controls
- Define unit cost targets (e.g., <$0.50 per resolved ticket or <$2 per invoice matched).
- Use model routing (cheap router + expensive solver) with explicit escalation criteria.
- Enforce spend caps per day/week and per tenant; fail safe to human handoff.
- Track wasted spend: tokens on failed runs, retries, and rejected tool calls.

Launch Gate: You’re ready when (a) schemas + policy gates exist for every write tool, (b) you can quantify success + cost per completed task, (c) you can audit any action end-to-end, and (d) you can shut the agent off in one click.