AGENT-NATIVE PRODUCTION READINESS CHECKLIST (2026 EDITION)

Use this before promoting an agent from “demo” to “production,” especially if it can take actions in customer systems.

1) Task Contract (Definition of Done)
- Name the workflow and define the business outcome (example: “shorten onboarding cycle time”).
- Specify inputs/outputs as a schema (JSON fields, required/optional, allowed enums).
- Write explicit constraints: forbidden actions, forbidden claims, tone rules, compliance rules.
- Define “success” as an artifact you can audit (approved output, ticket resolved, change committed, etc.).

2) Autonomy & Permissions
- Implement autonomy tiers (draft → approve → auto) with clear thresholds.
- Use least-privilege scopes per connector; avoid broad, shared service accounts.
- Add gates for destructive or external-facing actions (delete, message a customer, refund, terminate access, deploy).
- Enforce environment separation (dev/stage/prod) with separate credentials.

3) Logging, Auditability, and Replay
- Log every run: inputs, model version, prompt/template version, tool calls, outputs, and final action.
- Store immutable audit events with retention controls that match enterprise expectations.
- Redact PII/secrets in logs while keeping traceability.
- Support deterministic replay or an equivalent trace mode for debugging and support.

4) Evaluation (Evals) and Regression Testing
- Build a golden set of real tasks labeled with correct outcomes.
- Add adversarial cases: prompt injection attempts, contradictory data, missing permissions, stale docs.
- Run evals in CI for any prompt/agent/tooling change; block deploys on regressions.
- Track: outcome success, safety violations, tool-call error rate, latency, and run cost.

5) Verification & Guardrails
- Validate outputs and tool-call arguments against schemas.
- Add rule-based policy checks for high-risk domains (refunds, outbound claims, compliance actions).
- Use retrieval grounding and citations where factuality matters.
- Fail closed: if verification fails, escalate to a human with full context.

6) Unit Economics
- Measure cost per successful outcome (include retries and human escalation time).
- Use model routing (planner vs worker) and set explicit budgets per run.
- Enforce token discipline: summarize long contexts, store structured state, retrieve only what’s needed.
- Set pricing to survive worst-case run paths, not idealized success paths.

7) Rollout Plan
- Start in shadow mode: generate proposed actions and require human approval.
- Ramp autonomy by segment (lower-risk queues/customers first).
- Canary new versions; monitor drift and regressions.
- Prepare rollback: versioned prompts/agents and a kill switch for actions.

8) Trust UX for Admins
- Provide a run dashboard: what happened, why, which tools were used, and what escalated.
- Provide a permissions UI: per-tool scopes, per-workflow limits, and audit export.
- Provide an incident workflow: how customers report mistakes and how you respond.

Exit Criteria (targets you set)
- Documented failure modes and what triggers escalation.
- Proven tool-call reliability in staging plus clear handling for rate limits and permission errors.
- A tracked safety-event rate with a plan to drive it down.
- A documented cost-per-successful-outcome and an explicit plan to improve it.

If you can’t hit your exit criteria, narrow the task, add verification, or reduce autonomy. Shipping a smaller, trustworthy loop beats shipping a broad, unreliable agent.