Agent Workflow Production Readiness Checklist (2026)

Use this checklist to ship a production agent workflow that is controlled, audited, and economically predictable. Aim to complete every section before rolling beyond 25% traffic.

1) Scope & Success Metrics
- Define one workflow with a clear start/end state (e.g., “triage ticket,” “create quote,” “provision access”).
- Set measurable targets:
  - Task success rate (e.g., ≥90% on offline eval set)
  - Policy violation rate (e.g., ≤2%)
  - Cost ceiling (e.g., ≤$0.10 per completed task)
  - Latency budget (e.g., p95 ≤20s for async workflows)
- Define a “stop ship” condition (e.g., any unauthorized data export, any delete action, or >0.5% high-severity incidents).

2) Tooling & Permissions (Least Privilege)
- Inventory tools/APIs the agent can call; cap v1 at 5–15 tools.
- Require strict schemas for every tool input/output (JSON, validated types).
- Implement transaction-scoped credentials:
  - Short-lived tokens (target TTL ≤1 hour)
  - No shared API keys in agent runtime
- Add a tool proxy that enforces: schema validation, rate limits, allow/deny lists, and logging.

3) Policy-as-Code & Approvals
- Encode policies outside prompts (rules engine / middleware). Examples:
  - Refunds > $100 require approval
  - Exports > 1,000 rows denied
  - Deletes denied in v1
- Add risk scoring to decide when HITL applies (target humans review only top 5–10% highest-risk runs).
- Ensure approvals are logged with approver identity, timestamp, and decision rationale.

4) Observability & Audit Trail
- Log per run: run_id, user_id/tenant_id, tool calls, parameters (redacted), outputs (redacted), policy decisions, and final state change.
- Store an append-only “action ledger” for compliance (retention based on industry; often 1–7 years).
- Set up dashboards:
  - Success rate, policy blocks, human approvals, tool error rates
  - Token spend per tenant, per workflow, per route
- Alerting: page on unauthorized action attempt, repeated tool failures, or runaway step counts.

5) Evals & Regression
- Build a golden dataset of real tasks (start with 200; grow to 1,000+).
- Score outcomes (state change correctness), not just text quality.
- Run nightly regression on: model changes, prompt changes, tool schema changes, and policy changes.
- Add “canary” evals for known failure modes (prompt injection, PII leakage, invalid tool params).

6) Deployment & Rollout
- Implement kill switches:
  - Global disable
  - Per-tenant disable
  - Per-tool disable
- Roll out gradually: 5% → 25% → 100% with monitoring gates.
- Provide safe fallback: human queue, deterministic rules, or read-only mode.

7) Unit Economics Controls
- Route easy tasks to cheaper models; escalate on uncertainty.
- Compress context via retrieval + summaries; avoid full transcript stuffing.
- Cache embeddings and repeated answers where safe.
- Track cost per outcome weekly; investigate any >20% cost regression.

If you can (1) bound actions, (2) enforce policies in code, (3) measure correctness, and (4) replay failures, you’re ready to operate agents like real production systems.