Agentic Runtime Readiness Checklist (2026)

Use this checklist before you allow an agent to take actions in production systems (tickets, billing, CRM, code repos, infra). Aim to complete Sections 1–3 for any external launch; Sections 4–6 for enterprise deals or regulated workflows.

1) Workflow definition (scope and blast radius)
- Define the workflow’s allowed outcomes in one sentence (e.g., “Create a refund draft, never issue a refund”).
- List every system touched (Stripe, Salesforce, Zendesk, GitHub, internal DB) and classify actions as Read vs Write.
- Establish “stop conditions”: max steps, max tool calls, max elapsed time, and what happens on timeout.

2) Tooling design (least privilege)
- Convert each capability into a narrow, typed tool (avoid general-purpose executors early).
- Add idempotency keys for all write actions.
- Enforce tenant scoping and environment scoping (prod vs sandbox) at the tool layer.
- Document tool contracts: required fields, error codes, and expected latency.

3) Policy and approvals
- Write explicit deny rules (restricted actions, restricted data classes like PII/PHI).
- Specify approval thresholds (e.g., refunds > $50 require human approval).
- Add data/command separation: retrieved text cannot directly trigger tool calls without policy checks.

4) Observability and auditability
- Log per request: tenant, workflow, model, tokens in/out, latency, cost estimate.
- Trace tool calls: tool name, status, retries, and redacted argument fingerprints.
- Store a “decision packet”: user intent, retrieved doc IDs, tool calls, final output, and approver identity.
- Define retention (e.g., 90 days) and redaction rules.

5) Evaluation and release gates
- Build an eval set (200–500 real tasks) with expected actions and disallowed actions.
- Add deterministic checks (schema validation, policy checks) plus a quality rubric (LLM-judge or human).
- Require eval pass thresholds before shipping prompt/tool updates.
- Run canaries on 1–5% traffic; enable fast rollback.

6) Unit economics controls
- Track weekly: cost per successful task, p95 latency, tool failure rate, human override rate.
- Set per-tenant budgets and alerting when spend spikes.
- Implement routing (cheap default + escalation) and caching (semantic + tool-result caching).
- Reduce context: summarize stable state, avoid re-sending large histories each step.

Definition of “ready to launch” (minimum)
- The agent cannot perform any high-impact write without explicit approval.
- You can answer: what it did, why it did it, what it cost, and who approved it.
- You can roll back a change (prompt/tool/routing) within minutes and prove regressions via evals.