Agentic AI Production Readiness Checklist (2026)

Use this checklist to decide whether a workflow is ready to move from prototype → pilot → production. Aim to complete everything in Phase 1 before any write-capable autonomy.

PHASE 1 — Define the job and the SLO (must-have)
1) Workflow definition: Write a one-page spec of the workflow (start state, end state, allowed tools, disallowed actions). Include examples of “should refuse” requests.
2) Success metrics: Define an SLO with numbers (e.g., 95% completion rate, p50 latency < 3s, median variable cost < $0.10/run, escalation < 8%).
3) Golden task set: Collect at least 100 real tasks with known correct outcomes. Keep 20% as a held-out regression set.

PHASE 2 — Safety and permissions (must-have for any write actions)
4) Scoped identity: Every agent run must map to a tenant + user identity. Use least-privilege scopes (separate read vs write). Prefer short-lived tokens.
5) Tool schemas: Enforce structured tool inputs/outputs (JSON schema). Reject free-form parameters for high-risk tools.
6) Guardrails: Implement policy checks before tool execution (PII/DLP checks, action allowlists, rate limits, and caps like “max refunds/day”).
7) Rollback: For every write operation, provide an undo plan (soft delete, reversible updates, or compensating transactions).

PHASE 3 — Observability and operations (required for production)
8) Tracing: Produce end-to-end traces for each run (plan → retrieve → tool calls → verification → final action). Store tool parameters with redaction.
9) Cost controls: Track tokens and dollars per run; set budgets per tenant and per workflow; alert on cost anomalies (e.g., +30% week-over-week).
10) Incident playbook: Document kill switches (disable write tools globally / per tenant), escalation paths, and postmortem templates.

PHASE 4 — Deployment discipline (recommended)
11) Staged rollout: Shadow mode (no execution) → suggest mode (human approval) → limited autopilot (low-risk) → broader autopilot.
12) Continuous evaluation: Run the benchmark suite on every model/prompt/tool change. Require regression gates (e.g., no more than -1% completion rate; no new policy violations).

Decision rule: If you cannot (a) prove task success rate on held-out tasks, (b) attribute every action to a scoped identity, and (c) reconstruct any run from traces and logs, you are not ready for autonomous execution.