AGENTIC RELIABILITY READINESS CHECKLIST (2026) Use this checklist before you let an AI agent touch production systems-of-record. 1) DEFINE THE WORKFLOW CONTRACT - Name the workflow and owner (team + on-call rotation). - Write the “task contract”: inputs, outputs, required tools, forbidden tools. - Define success/failure with 10–20 concrete examples (including edge cases). 2) CHOOSE THE AUTONOMY LEVEL - Classify the workflow: Read-only / Draft-only / Low-risk write / Revenue-impacting / Security & access. - Decide execution mode: auto-execute, approval-gated, or human-in-the-loop. - Add a rollback strategy (revert patches, undo actions, or compensating transactions). 3) GUARDRAILS THAT PREVENT REAL INCIDENTS - Enforce structured outputs (schema validation + retries with capped attempts). - Implement allowlists for tools and fields; deny-by-default for writes. - Add rate limits: max tool calls per task and max retries per tool. - Add budget limits: max $/task (P95) and monthly cap with alerts. - Add a kill switch and a “degrade to draft-only” mode. 4) IDENTITY, PERMISSIONS, AND AUDIT - Give the agent a dedicated service identity. - Apply least privilege (tool-scoped permissions, read vs write separation). - Log every tool call: tool name, timestamp, params hash, result summary, and record identifiers touched. - Decide retention (e.g., 30 days traces; longer for regulated workflows). 5) EVALUATION BEFORE PRODUCTION - Build a golden set (start with 200 tasks; scale to 2,000 as you mature). - Create regression gates: block deploy if success rate drops >2 pts or tool errors rise. - Add adversarial cases: prompt injection attempts, malformed inputs, stale documents. 6) SAFE ROLLOUT PLAN - Start with shadow mode (agent proposes actions; no writes). - Canary deploy to 1–5% traffic with extra review sampling. - Define alert thresholds: success rate, intervention rate, cost per task, unauthorized tool attempts. 7) PRODUCTION METRICS (WEEKLY REVIEW) - Task success rate (overall + by segment). - Intervention rate (human corrections / total tasks). - Tool error rate (invalid params, denied actions). - Unit cost per outcome (P50/P95) and top cost drivers. - Time-to-detection for silent failures (goal: hours, not weeks). 8) OPERATING CADENCE - Weekly failure review: add new real failures into the golden set. - Monthly permission review: validate allowlists, thresholds, and approvals. - Model/provider changes require staged eval + canary, not a flip. Exit criteria for “production-ready”: (a) audited write-path, (b) kill switch + rollback, (c) regression evals in CI, (d) unit-cost SLO with alerting, and (e) a named owner with on-call coverage.