Agent Production Readiness Checklist (2026) Use this checklist before moving any agent workflow from “demo” to “production.” The goal is to ship measurable value without creating a security, compliance, or cost incident. 1) Scope and risk tiering - Define the workflow in one sentence (start trigger, end state, systems touched). - Classify actions into Low / Medium / High risk. - Low: draft content, read-only lookups, internal suggestions. - Medium: send external messages, update CRM fields, create tickets. - High: refunds, payments, deleting data, changing permissions. - Decide which tiers are autonomous in v1 (recommended: Low only). 2) Identity, access, and secrets - Provision a dedicated identity per agent (no shared admin accounts). - Enforce least-privilege scopes per tool (resource-level if possible). - Use short-lived tokens and rotate secrets; store in a secrets manager. - Log every tool call with: agent_id, user/request_id, tool_name, timestamp, result. 3) Tooling and schemas - Make tools narrow and typed (explicit JSON schema for inputs/outputs). - Add timeouts, retries with backoff, and idempotency keys for write actions. - Validate tool arguments server-side; reject unexpected fields. - For high-impact tools, require “justification fields” (ticket_id, policy_reference). 4) Policy enforcement and approvals - Implement approval gates by risk tier and $ thresholds (e.g., refunds > $250 require approval). - Add allowlists/blocklists for destinations (approved domains, projects, channels). - Add per-run budgets: max steps, max tool calls, max tokens, max cost in USD. - Ensure a kill-switch exists (disable autonomy in < 5 minutes). 5) Evaluation and regression - Build an eval set from real cases (target: 200–2,000 examples). - Track: task success rate, policy violation rate, tool error rate, p95 latency, cost/run. - Set ship thresholds (example): ≥95% success, <0.5% policy violations on test set. - Gate releases in CI: prompt changes, model changes, tool changes must re-run evals. 6) Observability and operations - Implement tracing across the full run (retrieval → reasoning → tool calls → outcome). - Redact PII in logs; define retention (e.g., prompts 30 days, tool metadata 180 days). - Set alerts: success rate drop, cost spike, p95 latency regression, tool outage. - Write an incident runbook: disable autonomy, force human handoff, revoke tokens, communicate impact. 7) Rollout plan - Start with “assist mode” (human approves) and measure acceptance and edits. - Expand autonomy gradually: Low-risk autopilot first; Medium with approvals; High usually blocked. - Review access and performance quarterly; re-run adversarial tests after major tool/API changes. If you can’t answer “who executed what action, using which permission, under which policy, and at what cost?” you’re not production-ready.