AGENT PRODUCTION READINESS CHECKLIST (2026) Use this checklist before enabling “autonomous” agent execution for real customers. 1) WORKFLOW SCOPE (DEFINE THE BLAST RADIUS) - Name the workflow (one sentence) and define the “done” outcome. - List allowed tools/actions (allow-list). Explicitly list forbidden actions. - Decide the maximum number of tool calls per run (e.g., 8–15). - Decide what the agent is allowed to write (records, tickets, messages) vs read only. 2) IDENTITY + PERMISSIONS (LEAST PRIVILEGE) - Create a distinct agent principal (service identity), not a shared user token. - Use short-lived credentials (managed identity, workload identity, or OAuth refresh flow). - Implement granular scopes per connector (read vs write vs admin). - Add per-action approval rules (e.g., “refund <= $50 auto; above requires approval”). - Ensure tenants/workspaces are hard-isolated (no cross-customer data paths). 3) AUDIT LOGGING + REPLAY (NON-NEGOTIABLE) - Assign a unique trace/run ID and propagate it across model + tool calls. - Log structured tool events: tool name, parameters, response codes, and side effects. - Store a replay bundle: model/version, prompt template version, retrieved doc hashes, tool schema versions. - Provide an “agent run timeline” UI and export (CSV/JSON) for enterprise buyers. - Set retention controls (e.g., 30/90/365 days) and document what’s stored. 4) RELIABILITY CONTROLS (BOUNDED EXECUTION) - Implement idempotency keys for every write. - Add timeouts, retries with exponential backoff + jitter, and a dead-letter queue. - Add circuit breakers for flaky connectors (auto-degrade to read-only or pause runs). - Add post-action verification (read-back checks to confirm invariants). - Add a global “safe mode” toggle to disable writes instantly. 5) QUALITY MEASUREMENT (OUTCOME METRICS) - Define success rate in business terms (e.g., % tickets resolved; % tasks completed correctly). - Track rollback/override rate (how often humans undo the agent’s work). - Track tool error rate and “policy block” rate (useful for tuning permissions). - Build a regression set (100–500 tasks) and run it weekly. - Set a release gate: block deployments if success drops by more than a threshold (e.g., 2%). 6) UNIT ECONOMICS (COST PER SUCCESS) - Track $/successful task, not just tokens/run. - Add token and cost caps per run; define behavior on cap (ask for confirmation, or stop). - Implement model routing: cheap models for routing/extraction; premium models for hard cases. - Add caching/memoization for repeated summaries and policy Q&A. - Review gross margin weekly for top customers; flag accounts that go negative. 7) HUMAN CONTROL (TRUST BUILDERS) - Add approval queues for high-risk actions with clear diff previews. - Provide “why did it do this?” explanations based on logged evidence (not vague text). - Allow customer admins to configure policies: thresholds, allowed tools, and escalation rules. - Provide a one-click “report bad run” flow that attaches the run ID automatically. PASS/FAIL RULE OF THUMB You are production-ready when: (a) every write is gated, reversible, or verifiable; (b) you can replay and explain any run from logs; and (c) you can bound cost per successful task with caps and routing.