Agent Production Readiness Checklist (2026)

Use this checklist to graduate an agent from demo to production. It’s written for founders, tech leads, and operators.

1) Define the workflow boundary (scope)
- Name the workflow (e.g., “Refund handling,” “RMA creation,” “On-call remediation”).
- List systems touched (Stripe, Salesforce, Jira, AWS, internal admin).
- Classify actions as READ vs WRITE vs DESTRUCTIVE.
- Define a rollback path for each write (revert PR, void invoice, undo permission grant).

2) Identity & permissions (least privilege)
- Create a dedicated agent identity (service account) per workflow.
- Grant tool permissions per action, not “admin” access.
- Separate read-only credentials from write credentials.
- Set explicit thresholds (e.g., refunds <= $50 auto; $51–$300 requires approval; >$300 blocked).

3) Policy enforcement (outside the model)
- Put a policy gate between agent output and tool execution.
- Validate tool inputs with strict schemas (JSON Schema / Pydantic).
- Enforce destination allowlists (email domains, Slack channels, export buckets, webhook hosts).
- Add rate limits: max tool calls per run; max runs per minute; max spend per hour.

4) Observability & audit
- Assign a unique run_id per execution and propagate it to every tool call.
- Log: tool name, args (redacted), response codes, latency, and side-effect IDs (refund_id, ticket_id, PR#).
- Store immutable audit events for write actions.
- Set retention rules: short TTL for sensitive payloads; long-term retention for redacted metadata.

5) Quality gates & evals
- Define success metrics: completion rate, critical error rate, escalation rate, AHT, CSAT impact.
- Create a labeled dataset of real cases (at least 200) and run regression tests.
- Add canary releases: start at 1–5% traffic, then ramp weekly.
- Require a “kill switch” to disable writes instantly.

6) Cost governance
- Track cost per successful outcome (inference + tool costs + human review time).
- Implement model routing (cheap model for triage/extraction; frontier model for high-stakes reasoning).
- Add caching for repeated intents and retrieval results (with redaction).
- Cap retries and enforce timeouts to prevent runaway loops.

7) Security & data handling
- Redact secrets/PII before sending to LLMs when possible.
- Block direct access to raw credential stores; use short-lived tokens.
- Review prompt/tool injection risks for any retrieval sources.
- Document data residency requirements and subprocessors for compliance.

Graduation rule of thumb:
- You can move from “assist” to “autopilot” only when you can prove (a) scoped permissions, (b) 100% trace coverage for actions, (c) critical error rate <= 0.1% on write paths, and (d) cost per success below your ROI threshold.