Production Agent Readiness Checklist (2026)

Use this checklist to decide if an agent is ready to run with real permissions in a customer environment.

1) Define the work unit and KPI
- Name the unit of work (e.g., “ticket resolved,” “invoice coded,” “PR triaged”).
- Define a primary KPI with a baseline (e.g., AHT, backlog size, $ recovered).
- Set a production success threshold (example targets: TSR ≥ 92%, escalation ≤ 8%).

2) Tooling and integration quality
- Prefer API-based tools over browser automation for systems of record.
- Every tool call is typed (schema-validated inputs/outputs).
- Idempotency: reruns do not duplicate refunds, emails, or record updates.
- Rate limiting and backoff are implemented per connector.

3) Policy and permissioning
- SSO/SAML is supported; RBAC exists for admin vs operator roles.
- Least-privilege scopes for each connector (no shared passwords).
- Policy gates exist for risky actions (money movement, external emails, user provisioning).
- Approval workflow exists with an auditable “who approved what and why.”

4) Auditability and observability
- Log chain is complete: request → retrieved context references → model decision → tool calls → tool responses → final output.
- OpenTelemetry (or equivalent) traces exist for every run.
- A reason taxonomy exists for failures/escalations (policy blocked, ambiguity, tool outage, model error, data missing).

5) Evaluation and change management
- Offline eval suite covers top workflows and edge cases; results are versioned.
- Shadow mode is available to collect real-world diffs before enabling autonomy.
- Canary deployments exist for model/prompt changes (e.g., 5% traffic for 48 hours).
- Automatic rollback triggers are defined (error spike, TSR drop, policy violations).

6) Data handling and compliance
- Encryption in transit and at rest; retention controls are configurable per tenant.
- Customer data is not used for training by default (explicit opt-in only).
- Subprocessor list and DPA are ready; incident response process is documented.
- If targeting enterprise: SOC 2 plan (target Type I within ~6 months, Type II within ~12 months) and pen test schedule.

7) Economic readiness
- Measure blended cost per completed task (inference + tools + infra + human review).
- Confirm pricing aligns to work (platform fee + usage + optional success kicker).
- Confirm margins at target scale even under pessimistic assumptions (higher escalation, slower tool calls).

Go/No-Go rule of thumb: If you can’t (a) quantify TSR and escalation, (b) reconstruct an incident from logs, and (c) block risky actions with policy, you are not ready for production autonomy.