Agent Feature Production Readiness Checklist (2026)

Use this checklist before shipping an agent that can take actions (write data, trigger workflows, move money, or deploy code). Target: you can let it run unattended within clearly defined boundaries.

1) Scope & Success Criteria
- Define the workflow in one sentence (e.g., “triage inbound support tickets and draft replies”).
- Define a measurable target (e.g., +20% faster handle time in 30 days, or 15% deflection).
- List explicit non-goals (what the agent must never do).

2) Capabilities & Permissions
- Implement capability-based access (e.g., issue_refund, update_crm_field, open_pr).
- Enforce least-privilege OAuth scopes and short-lived tokens.
- Separate sandbox vs production credentials and environments.
- Require idempotency keys for any write action.

3) Cost, Time, and Loop Controls
- Set per-task ceilings (max tokens, max tool calls, max wall-clock time).
- Set tenant/workspace daily budgets and rate limits.
- Add stop conditions and maximum iteration counts.
- Log cost attribution per task: model spend, tool/API spend, and retries.

4) Tool Contracts & Reliability
- Define tool schemas strictly; validate inputs/outputs.
- Version tools and prompts; ship behind flags.
- Add circuit breakers and retries with exponential backoff.
- Track tool success rate, p95 latency, and rate-limit errors.

5) Observability & Audit Trail (“Flight Recorder”)
- Store structured traces: inputs, retrieved sources (with IDs), decisions, tool parameters, outputs.
- Redact/limit sensitive payloads; document retention (e.g., 365 days).
- Provide a user-visible execution timeline and outcomes.

6) Human Oversight & Reversibility
- Use diff-first review for state changes (field diffs, PR diffs, invoice diffs).
- Configure tiered approvals (by risk level, dollar amount, environment).
- Provide undo/rollback where possible (revert PR, compensating transaction).
- Track override rate and rollback rate.

7) Evaluation & Release Safety
- Build a replay harness to rerun tasks against new models/policies.
- Maintain a small “golden set” plus ongoing production sampling.
- Alert on regressions: completion rate, escalation rate, policy denials, error types.
- Run A/B tests for thresholds (confidence gating vs human review load).

8) Rollout Plan
- Phase 1: Draft-only suggestions.
- Phase 2: Assisted execution (reversible writes) + approvals.
- Phase 3: Delegated execution in scoped domains (e.g., under $50, staging only).
- Phase 4: Managed autonomy with customer-configurable policies and SLAs.

If you can’t answer “Who approved this?”, “What changed?”, “How do we undo it?”, and “How much did it cost?” you’re not ready to ship.