AGENTIC WORKFLOW PRODUCTION READINESS CHECKLIST (2026) Use this checklist to take one agentic workflow from prototype to production with bounded autonomy. 1) SCOPE & SUCCESS CRITERIA - Pick ONE workflow (not a platform). Define the start and end state. - Write 3 metrics with targets: - Success rate (e.g., 85% of runs complete without human edits) - Escalation rate (e.g., <10% routed to humans) - p95 time-to-resolution (e.g., <45s) and p95 cost-per-success (e.g., <$0.25) - Define “never events” (e.g., refund above $50, email external domains, export PII). 2) TOOLING CONTRACTS (MOST IMPORTANT) - For each tool, specify: - JSON schema for inputs/outputs, including required fields - Error codes (auth, not-found, validation, rate-limit) - Idempotency key behavior (required for writes) - Timeouts and retry policy (max retries, backoff) - Prefer fewer, higher-level tools over many granular tools. 3) PERMISSIONS & IDENTITY - No shared credentials. Use service principals with least privilege. - If user-delegated actions are needed, map agent actions to user identity. - Separate READ authority from WRITE authority (different tools/scopes). - Add a “break-glass” path for elevated actions with explicit approval. 4) POLICY-AS-CODE GUARDRAILS - Implement pre-tool and post-tool checks for: - Money movement thresholds - External communication restrictions - Data export / PII handling - Allowed objects/fields in systems like Salesforce, Jira, ServiceNow - Store policies in version control and require code review for changes. 5) VERIFICATION & FALLBACKS - After every write action, re-read state and verify invariants. - Add deterministic fallbacks for common cases (templates/rules engine). - Human-in-the-loop (HITL) thresholds: - Amount limits (e.g., >$50) - Risk tiers (Enterprise/Gov customers) - Low-confidence detections 6) OBSERVABILITY & AUDIT - Log: prompt inputs, retrieval doc IDs, tool calls, tool payloads, tool responses. - Add correlation IDs so every action can be replayed end-to-end. - Track weekly: p95 tool latency, tool error rate, average tool calls/run, loop rate. - Retention: define storage duration (30/90/365 days) based on compliance needs. 7) EVALUATION HARNESS - Build an eval set of 100–500 real cases (sample from production history). - Score: correctness, policy compliance, citation/provenance, and user satisfaction. - Run evals before every release; track pass rate trend over time. 8) ROLLOUT PLAN - Start with shadow mode: agent proposes actions, humans approve. - Then limited autonomy: 5% traffic → 25% → 50% with a kill switch. - Create an incident playbook: rollback, disable tool writes, route to humans. 9) UNIT ECONOMICS - Monitor cost-per-success (not cost-per-run). - Add budgets: max tool calls (e.g., 8) and max cost per run (e.g., $0.25). - Use model routing: cheap model by default, escalate only on uncertainty. 10) SECURITY TESTING - Red team for workflow attacks: - prompt injection in retrieved docs - tool output poisoning - privilege escalation via chained actions - Require structured tool outputs; avoid trusting free-form tool text. If you can check off sections 1–6, you’re ready for a limited production rollout. If you can check off 7–10, you’re ready to scale and sell into serious enterprises.