AGENTIC WORKFLOW PRODUCTION READINESS CHECKLIST (2026)

Use this checklist to take one agentic workflow from prototype to production with bounded autonomy.

1) SCOPE & SUCCESS CRITERIA
- Pick ONE workflow (not a platform). Define the start and end state.
- Write 3 metrics with targets:
  - Success rate (e.g., 85% of runs complete without human edits)
  - Escalation rate (e.g., <10% routed to humans)
  - p95 time-to-resolution (e.g., <45s) and p95 cost-per-success (e.g., <$0.25)
- Define “never events” (e.g., refund above $50, email external domains, export PII).

2) TOOLING CONTRACTS (MOST IMPORTANT)
- For each tool, specify:
  - JSON schema for inputs/outputs, including required fields
  - Error codes (auth, not-found, validation, rate-limit)
  - Idempotency key behavior (required for writes)
  - Timeouts and retry policy (max retries, backoff)
- Prefer fewer, higher-level tools over many granular tools.

3) PERMISSIONS & IDENTITY
- No shared credentials. Use service principals with least privilege.
- If user-delegated actions are needed, map agent actions to user identity.
- Separate READ authority from WRITE authority (different tools/scopes).
- Add a “break-glass” path for elevated actions with explicit approval.

4) POLICY-AS-CODE GUARDRAILS
- Implement pre-tool and post-tool checks for:
  - Money movement thresholds
  - External communication restrictions
  - Data export / PII handling
  - Allowed objects/fields in systems like Salesforce, Jira, ServiceNow
- Store policies in version control and require code review for changes.

5) VERIFICATION & FALLBACKS
- After every write action, re-read state and verify invariants.
- Add deterministic fallbacks for common cases (templates/rules engine).
- Human-in-the-loop (HITL) thresholds:
  - Amount limits (e.g., >$50)
  - Risk tiers (Enterprise/Gov customers)
  - Low-confidence detections

6) OBSERVABILITY & AUDIT
- Log: prompt inputs, retrieval doc IDs, tool calls, tool payloads, tool responses.
- Add correlation IDs so every action can be replayed end-to-end.
- Track weekly: p95 tool latency, tool error rate, average tool calls/run, loop rate.
- Retention: define storage duration (30/90/365 days) based on compliance needs.

7) EVALUATION HARNESS
- Build an eval set of 100–500 real cases (sample from production history).
- Score: correctness, policy compliance, citation/provenance, and user satisfaction.
- Run evals before every release; track pass rate trend over time.

8) ROLLOUT PLAN
- Start with shadow mode: agent proposes actions, humans approve.
- Then limited autonomy: 5% traffic → 25% → 50% with a kill switch.
- Create an incident playbook: rollback, disable tool writes, route to humans.

9) UNIT ECONOMICS
- Monitor cost-per-success (not cost-per-run).
- Add budgets: max tool calls (e.g., 8) and max cost per run (e.g., $0.25).
- Use model routing: cheap model by default, escalate only on uncertainty.

10) SECURITY TESTING
- Red team for workflow attacks:
  - prompt injection in retrieved docs
  - tool output poisoning
  - privilege escalation via chained actions
- Require structured tool outputs; avoid trusting free-form tool text.

If you can check off sections 1–6, you’re ready for a limited production rollout. If you can check off 7–10, you’re ready to scale and sell into serious enterprises.