AGENTIC AI PRODUCTION READINESS CHECKLIST (2026)

Use this checklist to move one workflow from prototype to production without creating an untestable “prompt pile.” Target: a system you can measure, replay, secure, and cost-control.

1) WORKFLOW DEFINITION
- Write a one-sentence job statement (e.g., “Triage inbound support tickets and route to the right queue with a reason code”).
- Define inputs/outputs as a contract (fields, formats, required IDs).
- Define success criteria (e.g., ≥95% correct routing; p95 latency ≤12s).
- List unacceptable failures (e.g., never email customers without approval; never create refunds).

2) TOOLING & PERMISSIONS
- Inventory every tool/API call (read vs write).
- Add strict JSON schemas for tool inputs/outputs; reject invalid payloads.
- Enforce least privilege per agent and per tenant (scoped tokens, TTL, rotation).
- Add idempotency keys for any write action.

3) ORCHESTRATION & BUDGETS
- Implement explicit steps (state machine): classify → retrieve → draft → validate → execute → log.
- Set hard caps: max tool calls, max tokens per run, max retries, per-step timeouts.
- Add fallbacks: smaller model for easy steps; human queue for uncertain cases.

4) EVALUATION (EVALS)
- Create a regression set (start 200 examples; grow to 1,000+).
- Add a red-team set: prompt injection, policy violations, unsafe tool requests.
- Define automated metrics: task success rate, escalation rate, tool-call intensity, cost per completed task.
- Run evals in CI on every prompt/model/tool change.

5) OBSERVABILITY & REPLAY
- Log every run with a trace_id and step-level events.
- Capture model version, prompt version, tool calls, and tool outputs (or mocks).
- Implement replay: re-run a historical trace deterministically for debugging.
- Create dashboards: success %, p95 latency, $/task, top failure clusters, escalations.

6) SAFETY & GOVERNANCE
- Put authorization decisions in a deterministic policy engine outside the model.
- Sanitize untrusted content before retrieval and before passing into prompts.
- Enforce retention rules (store minimal PII; separate encrypted blobs from traces).
- Establish release process: canary (5%), monitor deltas, rollback criteria.

7) LAUNCH CRITERIA (GO/NO-GO)
- Meets success targets on regression suite and first production cohort.
- 100% trace coverage and successful replay drills.
- 0 critical policy bypasses in red-team tests.
- Per-task cost within budget with headroom (e.g., 20%).
- Clear on-call and incident playbook (who owns failures, how to disable actions).

If you can’t measure it, replay it, or roll it back, it’s not production—no matter how good the demo looks.