Agentic AI Production Readiness Checklist (2026)

Use this checklist before giving an AI agent access to real tools (CRM, billing, email, ticketing, code repos). Target: measurable task success, bounded risk, and predictable cost.

1) Workflow Definition (1–2 hours)
- Write a one-sentence “workflow contract” (input → outcome). Example: “Given a Zendesk ticket, draft a policy-correct response with cited billing facts.”
- Identify action types: read-only, write, irreversible (refunds, deletions), external comms (email/SMS).
- Define success metrics: task success %, policy violation %, escalation rate, average latency, cost per run.

2) Context & Tooling (platform work)
- Standardize tool interfaces (prefer MCP-style connectors or a single internal tool schema).
- Enforce typed parameters and structured outputs (JSON), with documented failure modes.
- Implement least-privilege scopes per tool: separate read vs write; staging vs production credentials.
- Add idempotency keys for write operations (avoid duplicate refunds/tickets).

3) Evaluation Gates (release discipline)
- Build an initial eval set of 100–300 real cases (stratify edge cases: VIP customers, refunds, angry users, missing data).
- Create 3 eval tiers:
  a) Task success (deterministic checks where possible)
  b) Safety/policy (PII leakage, jailbreak prompts, forbidden actions)
  c) Operational behaviors (max tool calls, max latency, loop detection)
- Set ship/rollback thresholds (example): policy violations must stay <0.5%; runaway tool loops <0.2%.

4) Safety & Governance (non-negotiables)
- Log every tool call: user, timestamp, tool name, parameters, outputs, model version, prompt hash.
- Add approval gates for high-impact actions (refunds, contract changes, outbound sends, data deletion).
- Implement PII/secrets handling: redact before model calls; never place secrets in prompts; use a vault.
- Require provenance for user-facing outputs (citations to docs/tool outputs; verification step “no new facts”).

5) Cost & Performance (keep margins intact)
- Add token/cost metering per step and per customer.
- Implement routing: small model for classification/extraction; frontier model only for planning/final.
- Add caching: prompt/result cache, semantic cache, retrieval cache for stable corpora.
- Set budgets: max tokens/run, max tool calls/run, max retries/step; alert on budget breaches.

6) Operations & Incident Response
- Create dashboards: success rate, violation rate, latency p95, cost/run, tool error rate.
- Add circuit breakers: auto-disable write actions if violation rate spikes or tools degrade.
- Define human escalation paths (where uncertain, the agent must hand off with context).
- Run game days monthly: simulate tool outages, permission denials, prompt injection attempts.

Exit Criteria (go/no-go)
- You can replay any agent run end-to-end from logs within 5 minutes.
- You can prove least-privilege access for every tool.
- Evals are in CI/CD and block releases when thresholds fail.
- Write actions are bounded by approvals, budgets, and circuit breakers.
- Unit economics are understood: cost per successful outcome is acceptable for your pricing.