Agent Production Readiness Checklist (2026)

Use this checklist before moving any agent workflow from “demo” to “production.” The goal is to ship measurable value without creating a security, compliance, or cost incident.

1) Scope and risk tiering
- Define the workflow in one sentence (start trigger, end state, systems touched).
- Classify actions into Low / Medium / High risk.
  - Low: draft content, read-only lookups, internal suggestions.
  - Medium: send external messages, update CRM fields, create tickets.
  - High: refunds, payments, deleting data, changing permissions.
- Decide which tiers are autonomous in v1 (recommended: Low only).

2) Identity, access, and secrets
- Provision a dedicated identity per agent (no shared admin accounts).
- Enforce least-privilege scopes per tool (resource-level if possible).
- Use short-lived tokens and rotate secrets; store in a secrets manager.
- Log every tool call with: agent_id, user/request_id, tool_name, timestamp, result.

3) Tooling and schemas
- Make tools narrow and typed (explicit JSON schema for inputs/outputs).
- Add timeouts, retries with backoff, and idempotency keys for write actions.
- Validate tool arguments server-side; reject unexpected fields.
- For high-impact tools, require “justification fields” (ticket_id, policy_reference).

4) Policy enforcement and approvals
- Implement approval gates by risk tier and $ thresholds (e.g., refunds > $250 require approval).
- Add allowlists/blocklists for destinations (approved domains, projects, channels).
- Add per-run budgets: max steps, max tool calls, max tokens, max cost in USD.
- Ensure a kill-switch exists (disable autonomy in < 5 minutes).

5) Evaluation and regression
- Build an eval set from real cases (target: 200–2,000 examples).
- Track: task success rate, policy violation rate, tool error rate, p95 latency, cost/run.
- Set ship thresholds (example): ≥95% success, <0.5% policy violations on test set.
- Gate releases in CI: prompt changes, model changes, tool changes must re-run evals.

6) Observability and operations
- Implement tracing across the full run (retrieval → reasoning → tool calls → outcome).
- Redact PII in logs; define retention (e.g., prompts 30 days, tool metadata 180 days).
- Set alerts: success rate drop, cost spike, p95 latency regression, tool outage.
- Write an incident runbook: disable autonomy, force human handoff, revoke tokens, communicate impact.

7) Rollout plan
- Start with “assist mode” (human approves) and measure acceptance and edits.
- Expand autonomy gradually: Low-risk autopilot first; Medium with approvals; High usually blocked.
- Review access and performance quarterly; re-run adversarial tests after major tool/API changes.

If you can’t answer “who executed what action, using which permission, under which policy, and at what cost?” you’re not production-ready.