Agentic AI Production Readiness Checklist (2026) Use this checklist to decide whether a workflow is ready to move from prototype → pilot → production. Aim to complete everything in Phase 1 before any write-capable autonomy. PHASE 1 — Define the job and the SLO (must-have) 1) Workflow definition: Write a one-page spec of the workflow (start state, end state, allowed tools, disallowed actions). Include examples of “should refuse” requests. 2) Success metrics: Define an SLO with numbers (e.g., 95% completion rate, p50 latency < 3s, median variable cost < $0.10/run, escalation < 8%). 3) Golden task set: Collect at least 100 real tasks with known correct outcomes. Keep 20% as a held-out regression set. PHASE 2 — Safety and permissions (must-have for any write actions) 4) Scoped identity: Every agent run must map to a tenant + user identity. Use least-privilege scopes (separate read vs write). Prefer short-lived tokens. 5) Tool schemas: Enforce structured tool inputs/outputs (JSON schema). Reject free-form parameters for high-risk tools. 6) Guardrails: Implement policy checks before tool execution (PII/DLP checks, action allowlists, rate limits, and caps like “max refunds/day”). 7) Rollback: For every write operation, provide an undo plan (soft delete, reversible updates, or compensating transactions). PHASE 3 — Observability and operations (required for production) 8) Tracing: Produce end-to-end traces for each run (plan → retrieve → tool calls → verification → final action). Store tool parameters with redaction. 9) Cost controls: Track tokens and dollars per run; set budgets per tenant and per workflow; alert on cost anomalies (e.g., +30% week-over-week). 10) Incident playbook: Document kill switches (disable write tools globally / per tenant), escalation paths, and postmortem templates. PHASE 4 — Deployment discipline (recommended) 11) Staged rollout: Shadow mode (no execution) → suggest mode (human approval) → limited autopilot (low-risk) → broader autopilot. 12) Continuous evaluation: Run the benchmark suite on every model/prompt/tool change. Require regression gates (e.g., no more than -1% completion rate; no new policy violations). Decision rule: If you cannot (a) prove task success rate on held-out tasks, (b) attribute every action to a scoped identity, and (c) reconstruct any run from traces and logs, you are not ready for autonomous execution.