AGENTIC RELIABILITY READINESS CHECKLIST (2026)

Use this checklist before you let an AI agent touch production systems-of-record.

1) DEFINE THE WORKFLOW CONTRACT
- Name the workflow and owner (team + on-call rotation).
- Write the “task contract”: inputs, outputs, required tools, forbidden tools.
- Define success/failure with 10–20 concrete examples (including edge cases).

2) CHOOSE THE AUTONOMY LEVEL
- Classify the workflow: Read-only / Draft-only / Low-risk write / Revenue-impacting / Security & access.
- Decide execution mode: auto-execute, approval-gated, or human-in-the-loop.
- Add a rollback strategy (revert patches, undo actions, or compensating transactions).

3) GUARDRAILS THAT PREVENT REAL INCIDENTS
- Enforce structured outputs (schema validation + retries with capped attempts).
- Implement allowlists for tools and fields; deny-by-default for writes.
- Add rate limits: max tool calls per task and max retries per tool.
- Add budget limits: max $/task (P95) and monthly cap with alerts.
- Add a kill switch and a “degrade to draft-only” mode.

4) IDENTITY, PERMISSIONS, AND AUDIT
- Give the agent a dedicated service identity.
- Apply least privilege (tool-scoped permissions, read vs write separation).
- Log every tool call: tool name, timestamp, params hash, result summary, and record identifiers touched.
- Decide retention (e.g., 30 days traces; longer for regulated workflows).

5) EVALUATION BEFORE PRODUCTION
- Build a golden set (start with 200 tasks; scale to 2,000 as you mature).
- Create regression gates: block deploy if success rate drops >2 pts or tool errors rise.
- Add adversarial cases: prompt injection attempts, malformed inputs, stale documents.

6) SAFE ROLLOUT PLAN
- Start with shadow mode (agent proposes actions; no writes).
- Canary deploy to 1–5% traffic with extra review sampling.
- Define alert thresholds: success rate, intervention rate, cost per task, unauthorized tool attempts.

7) PRODUCTION METRICS (WEEKLY REVIEW)
- Task success rate (overall + by segment).
- Intervention rate (human corrections / total tasks).
- Tool error rate (invalid params, denied actions).
- Unit cost per outcome (P50/P95) and top cost drivers.
- Time-to-detection for silent failures (goal: hours, not weeks).

8) OPERATING CADENCE
- Weekly failure review: add new real failures into the golden set.
- Monthly permission review: validate allowlists, thresholds, and approvals.
- Model/provider changes require staged eval + canary, not a flip.

Exit criteria for “production-ready”: (a) audited write-path, (b) kill switch + rollback, (c) regression evals in CI, (d) unit-cost SLO with alerting, and (e) a named owner with on-call coverage.