AGENTIC WORKFLOW PRODUCTION READINESS CHECKLIST (2026)

Use this checklist to ship an agent workflow that is measurable, governable, and safe.

1) SCOPE & SUCCESS METRICS
- Define a single workflow (start narrow): trigger → steps → “done.”
- Choose 1–2 KPIs and a baseline (e.g., time-to-first-response, deflection rate, hours saved/week, error rate).
- Decide which actions are READ vs WRITE vs IRREVERSIBLE.
- Set target SLOs by action type (example: READ 99.5% acceptable; WRITE 99.9% with verification; IRREVERSIBLE requires approval).

2) DATA & CONTEXT
- List required data sources (tickets, CRM, billing, logs) and classify sensitivity (PII, PCI, PHI).
- Implement data minimization: retrieve only what the step needs.
- Add redaction/filters for PII where feasible.
- Document retention rules for prompts, retrieved context, and tool outputs.

3) TOOLING & STATE
- Keep workflow state outside the model (DB record + explicit state machine).
- Require idempotency keys for every tool action.
- Implement timeouts, retries with backoff, and circuit breakers.
- Add an explicit kill switch (feature flag) and a safe fallback path.

4) PERMISSIONS & GOVERNANCE
- Separate identities: requester (human), agent runtime, tool/service account.
- Enforce least privilege: per-tool scopes, per-action allowlists.
- Add approval gates for high-impact actions (e.g., refunds > $500, permission changes, mass outbound messages).
- Log: who requested, what context was used, what actions were taken, and what was changed.

5) EVALUATION (EVALS)
- Build a golden set (200–2,000 real examples) with expected outcomes.
- Run regression evals nightly and on every release candidate.
- Track: schema adherence, tool-call accuracy, hallucination rate, escalation rate, latency, and cost per successful task.
- Add adversarial tests: prompt injection attempts, malformed inputs, missing context.

6) VERIFICATION LAYER
- Implement deterministic business rules first (schemas, invariants, constraints).
- Add probabilistic checks where needed (LLM judge, cross-model critique) with thresholds.
- Define escalation rules: when confidence is low or policies block actions.
- Require “evidence” for decisions (links to ticket, records, retrieved docs).

7) OBSERVABILITY & COST CONTROLS
- Trace each request end-to-end (model calls + tool calls + retries).
- Compute dollars per successful task and alert on regressions.
- Use routing (small → large) and caching for repeated queries.
- Set budget alerts and per-tenant rate limits.

8) ROLLOUT PLAN
- Shadow mode first: agent suggests, human executes.
- Limited write access next: caps + canary rollout to a small cohort.
- Full rollout only after stable metrics over a defined window (e.g., 2–4 weeks).
- Post-incident review process: classification, root cause, new tests added.

If you can’t answer “Who did what, using which data, under what permission, and how we verified it?” you’re not ready for production.