Production Agent Readiness Checklist (2026) Use this checklist to decide if an agent is ready to run with real permissions in a customer environment. 1) Define the work unit and KPI - Name the unit of work (e.g., “ticket resolved,” “invoice coded,” “PR triaged”). - Define a primary KPI with a baseline (e.g., AHT, backlog size, $ recovered). - Set a production success threshold (example targets: TSR ≥ 92%, escalation ≤ 8%). 2) Tooling and integration quality - Prefer API-based tools over browser automation for systems of record. - Every tool call is typed (schema-validated inputs/outputs). - Idempotency: reruns do not duplicate refunds, emails, or record updates. - Rate limiting and backoff are implemented per connector. 3) Policy and permissioning - SSO/SAML is supported; RBAC exists for admin vs operator roles. - Least-privilege scopes for each connector (no shared passwords). - Policy gates exist for risky actions (money movement, external emails, user provisioning). - Approval workflow exists with an auditable “who approved what and why.” 4) Auditability and observability - Log chain is complete: request → retrieved context references → model decision → tool calls → tool responses → final output. - OpenTelemetry (or equivalent) traces exist for every run. - A reason taxonomy exists for failures/escalations (policy blocked, ambiguity, tool outage, model error, data missing). 5) Evaluation and change management - Offline eval suite covers top workflows and edge cases; results are versioned. - Shadow mode is available to collect real-world diffs before enabling autonomy. - Canary deployments exist for model/prompt changes (e.g., 5% traffic for 48 hours). - Automatic rollback triggers are defined (error spike, TSR drop, policy violations). 6) Data handling and compliance - Encryption in transit and at rest; retention controls are configurable per tenant. - Customer data is not used for training by default (explicit opt-in only). - Subprocessor list and DPA are ready; incident response process is documented. - If targeting enterprise: SOC 2 plan (target Type I within ~6 months, Type II within ~12 months) and pen test schedule. 7) Economic readiness - Measure blended cost per completed task (inference + tools + infra + human review). - Confirm pricing aligns to work (platform fee + usage + optional success kicker). - Confirm margins at target scale even under pessimistic assumptions (higher escalation, slower tool calls). Go/No-Go rule of thumb: If you can’t (a) quantify TSR and escalation, (b) reconstruct an incident from logs, and (c) block risky actions with policy, you are not ready for production autonomy.