Agent Feature Production Readiness Checklist (2026) Use this checklist before shipping an agent that can take actions (write data, trigger workflows, move money, or deploy code). Target: you can let it run unattended within clearly defined boundaries. 1) Scope & Success Criteria - Define the workflow in one sentence (e.g., “triage inbound support tickets and draft replies”). - Define a measurable target (e.g., +20% faster handle time in 30 days, or 15% deflection). - List explicit non-goals (what the agent must never do). 2) Capabilities & Permissions - Implement capability-based access (e.g., issue_refund, update_crm_field, open_pr). - Enforce least-privilege OAuth scopes and short-lived tokens. - Separate sandbox vs production credentials and environments. - Require idempotency keys for any write action. 3) Cost, Time, and Loop Controls - Set per-task ceilings (max tokens, max tool calls, max wall-clock time). - Set tenant/workspace daily budgets and rate limits. - Add stop conditions and maximum iteration counts. - Log cost attribution per task: model spend, tool/API spend, and retries. 4) Tool Contracts & Reliability - Define tool schemas strictly; validate inputs/outputs. - Version tools and prompts; ship behind flags. - Add circuit breakers and retries with exponential backoff. - Track tool success rate, p95 latency, and rate-limit errors. 5) Observability & Audit Trail (“Flight Recorder”) - Store structured traces: inputs, retrieved sources (with IDs), decisions, tool parameters, outputs. - Redact/limit sensitive payloads; document retention (e.g., 365 days). - Provide a user-visible execution timeline and outcomes. 6) Human Oversight & Reversibility - Use diff-first review for state changes (field diffs, PR diffs, invoice diffs). - Configure tiered approvals (by risk level, dollar amount, environment). - Provide undo/rollback where possible (revert PR, compensating transaction). - Track override rate and rollback rate. 7) Evaluation & Release Safety - Build a replay harness to rerun tasks against new models/policies. - Maintain a small “golden set” plus ongoing production sampling. - Alert on regressions: completion rate, escalation rate, policy denials, error types. - Run A/B tests for thresholds (confidence gating vs human review load). 8) Rollout Plan - Phase 1: Draft-only suggestions. - Phase 2: Assisted execution (reversible writes) + approvals. - Phase 3: Delegated execution in scoped domains (e.g., under $50, staging only). - Phase 4: Managed autonomy with customer-configurable policies and SLAs. If you can’t answer “Who approved this?”, “What changed?”, “How do we undo it?”, and “How much did it cost?” you’re not ready to ship.