Agent Workflow Production Readiness Checklist (2026) Use this checklist to ship a production agent workflow that is controlled, audited, and economically predictable. Aim to complete every section before rolling beyond 25% traffic. 1) Scope & Success Metrics - Define one workflow with a clear start/end state (e.g., “triage ticket,” “create quote,” “provision access”). - Set measurable targets: - Task success rate (e.g., ≥90% on offline eval set) - Policy violation rate (e.g., ≤2%) - Cost ceiling (e.g., ≤$0.10 per completed task) - Latency budget (e.g., p95 ≤20s for async workflows) - Define a “stop ship” condition (e.g., any unauthorized data export, any delete action, or >0.5% high-severity incidents). 2) Tooling & Permissions (Least Privilege) - Inventory tools/APIs the agent can call; cap v1 at 5–15 tools. - Require strict schemas for every tool input/output (JSON, validated types). - Implement transaction-scoped credentials: - Short-lived tokens (target TTL ≤1 hour) - No shared API keys in agent runtime - Add a tool proxy that enforces: schema validation, rate limits, allow/deny lists, and logging. 3) Policy-as-Code & Approvals - Encode policies outside prompts (rules engine / middleware). Examples: - Refunds > $100 require approval - Exports > 1,000 rows denied - Deletes denied in v1 - Add risk scoring to decide when HITL applies (target humans review only top 5–10% highest-risk runs). - Ensure approvals are logged with approver identity, timestamp, and decision rationale. 4) Observability & Audit Trail - Log per run: run_id, user_id/tenant_id, tool calls, parameters (redacted), outputs (redacted), policy decisions, and final state change. - Store an append-only “action ledger” for compliance (retention based on industry; often 1–7 years). - Set up dashboards: - Success rate, policy blocks, human approvals, tool error rates - Token spend per tenant, per workflow, per route - Alerting: page on unauthorized action attempt, repeated tool failures, or runaway step counts. 5) Evals & Regression - Build a golden dataset of real tasks (start with 200; grow to 1,000+). - Score outcomes (state change correctness), not just text quality. - Run nightly regression on: model changes, prompt changes, tool schema changes, and policy changes. - Add “canary” evals for known failure modes (prompt injection, PII leakage, invalid tool params). 6) Deployment & Rollout - Implement kill switches: - Global disable - Per-tenant disable - Per-tool disable - Roll out gradually: 5% → 25% → 100% with monitoring gates. - Provide safe fallback: human queue, deterministic rules, or read-only mode. 7) Unit Economics Controls - Route easy tasks to cheaper models; escalate on uncertainty. - Compress context via retrieval + summaries; avoid full transcript stuffing. - Cache embeddings and repeated answers where safe. - Track cost per outcome weekly; investigate any >20% cost regression. If you can (1) bound actions, (2) enforce policies in code, (3) measure correctness, and (4) replay failures, you’re ready to operate agents like real production systems.