AGENT PRODUCTION READINESS CHECKLIST (2026)

Use this checklist before enabling “autonomous” agent execution for real customers.

1) WORKFLOW SCOPE (DEFINE THE BLAST RADIUS)
- Name the workflow (one sentence) and define the “done” outcome.
- List allowed tools/actions (allow-list). Explicitly list forbidden actions.
- Decide the maximum number of tool calls per run (e.g., 8–15).
- Decide what the agent is allowed to write (records, tickets, messages) vs read only.

2) IDENTITY + PERMISSIONS (LEAST PRIVILEGE)
- Create a distinct agent principal (service identity), not a shared user token.
- Use short-lived credentials (managed identity, workload identity, or OAuth refresh flow).
- Implement granular scopes per connector (read vs write vs admin).
- Add per-action approval rules (e.g., “refund <= $50 auto; above requires approval”).
- Ensure tenants/workspaces are hard-isolated (no cross-customer data paths).

3) AUDIT LOGGING + REPLAY (NON-NEGOTIABLE)
- Assign a unique trace/run ID and propagate it across model + tool calls.
- Log structured tool events: tool name, parameters, response codes, and side effects.
- Store a replay bundle: model/version, prompt template version, retrieved doc hashes, tool schema versions.
- Provide an “agent run timeline” UI and export (CSV/JSON) for enterprise buyers.
- Set retention controls (e.g., 30/90/365 days) and document what’s stored.

4) RELIABILITY CONTROLS (BOUNDED EXECUTION)
- Implement idempotency keys for every write.
- Add timeouts, retries with exponential backoff + jitter, and a dead-letter queue.
- Add circuit breakers for flaky connectors (auto-degrade to read-only or pause runs).
- Add post-action verification (read-back checks to confirm invariants).
- Add a global “safe mode” toggle to disable writes instantly.

5) QUALITY MEASUREMENT (OUTCOME METRICS)
- Define success rate in business terms (e.g., % tickets resolved; % tasks completed correctly).
- Track rollback/override rate (how often humans undo the agent’s work).
- Track tool error rate and “policy block” rate (useful for tuning permissions).
- Build a regression set (100–500 tasks) and run it weekly.
- Set a release gate: block deployments if success drops by more than a threshold (e.g., 2%).

6) UNIT ECONOMICS (COST PER SUCCESS)
- Track $/successful task, not just tokens/run.
- Add token and cost caps per run; define behavior on cap (ask for confirmation, or stop).
- Implement model routing: cheap models for routing/extraction; premium models for hard cases.
- Add caching/memoization for repeated summaries and policy Q&A.
- Review gross margin weekly for top customers; flag accounts that go negative.

7) HUMAN CONTROL (TRUST BUILDERS)
- Add approval queues for high-risk actions with clear diff previews.
- Provide “why did it do this?” explanations based on logged evidence (not vague text).
- Allow customer admins to configure policies: thresholds, allowed tools, and escalation rules.
- Provide a one-click “report bad run” flow that attaches the run ID automatically.

PASS/FAIL RULE OF THUMB
You are production-ready when: (a) every write is gated, reversible, or verifiable; (b) you can replay and explain any run from logs; and (c) you can bound cost per successful task with caps and routing.