Production AI Agent Readiness Checklist (2026)

Use this checklist to take an agent from prototype to production without losing control of security, spend, or reliability.

1) Define the job and success metrics
- Specify one workflow (start narrow): e.g., “refund dispute triage” or “PR review.”
- Define 1 primary outcome metric (e.g., % auto-resolved, median cycle time, $ saved per month).
- Define 2 guard metrics (e.g., escalation rate, customer CSAT delta, error rate).
- Set explicit autonomy tiers (Draft-only / Suggest-with-approval / Autonomous-under-thresholds).

2) Tooling and orchestration requirements
- Choose an orchestrator with durable state (Temporal, Step Functions, Durable Functions, etc.).
- Ensure all tools are typed: JSON schemas, strict input validation, strict output parsing.
- Make tool calls idempotent where possible; include idempotency keys in write actions.
- Add timeouts, retries, and circuit breakers per tool.

3) Identity, permissions, and policy
- Create a dedicated agent identity (service account) per workflow or per tenant.
- Enforce least privilege scopes for every integrated system.
- Implement policy-as-code checks: $ thresholds, data sensitivity rules, and escalation triggers.
- Define “break-glass” access and a kill switch (per workflow + per tenant).

4) Guardrails and verification
- Require sandbox/dry-run for any side-effectful action when available.
- Add deterministic validators (schema checks, allowed commands, safe parameter bounds).
- Add a second-layer verifier for high-risk steps (rules or independent model).
- Define approval checkpoints (who approves, where approvals happen, SLA expectations).

5) Observability and audit
- Log a run record: run_id, model version, tool calls, approvals, outcomes, and cost.
- Implement redaction for PII/secrets; store raw data only with strict access controls.
- Add distributed tracing spans for retrieval, inference, tool calls, retries, approvals.
- Define retention: staging (full), production (redacted), encrypted archives for incidents.

6) Evals and release discipline
- Build a scenario bank (200–2,000 cases) representative of production inputs.
- Run nightly regression evals and gate deployments on score thresholds.
- Maintain an adversarial pack (prompt injection, malformed tool output, policy bypass attempts).
- Version everything: prompts, retrieval configs, tool adapters, and policies.

7) Economics and budgeting
- Track cost per successful outcome (not cost per request).
- Set per-run and per-day budgets; fail gracefully when budget is exceeded.
- Route tasks: smaller models for routine steps, frontier models for ambiguous/high-stakes steps.
- Measure ROI monthly: $ saved or revenue impact vs total agent COGS/opex.

8) Rollout plan
- Start at 1% traffic or a single internal team.
- Progress: Draft-only → Suggest-with-approval → Autonomous-under-thresholds.
- Monitor guard metrics continuously; roll back on threshold breaches.
- Write an incident playbook: detection, triage, disablement, customer comms, and postmortems.

If you can’t answer “What did the agent do, under what policy, with what permissions, at what cost?” you’re not ready for production autonomy.