Agent Infrastructure Readiness Checklist (2026)

Use this checklist to move from “agent demo” to “production automation.” Score each item as: Not Started / In Progress / Done.

1) Workflow Definition (Business)
- Identify one workflow with clear volume and value (e.g., IT ticket triage, PR dependency bumps).
- Define success metrics (completion rate, time saved, $/task) and failure metrics (escalations, silent failures).
- Set hard stop conditions: max tool calls, max wall-clock time, and max retries.

2) Tool Contracting (Engineering)
- Every tool has a versioned schema (inputs/outputs), server-side validation, and idempotency keys.
- Replace “general API access” with capability-scoped tools (e.g., create_refund(max_amount_usd)).
- Ensure deterministic tools return structured JSON; avoid free-form text as tool output.

3) Permissions & Identity (Security)
- Map tools to roles using your IdP (Okta/Entra ID) and least-privilege policies.
- Implement step-up approvals for high-risk actions (money movement, deletes, production deploys).
- Store secrets in a secrets manager; never in prompts or client apps.

4) Execution Safety (Reliability)
- Default to dry-run/sandbox for destructive actions; promote to execute only after validation.
- Add circuit breakers: rate limits, concurrency caps, and kill switches per workflow.
- Implement backoff and timeouts for flaky external tools.

5) Observability (Ops)
- Emit a structured event per step: model call, tool call, validator, and final outcome.
- Trace requests end-to-end (run_id) and log cost, latency, tool-call count, and retries.
- Maintain replayability: store inputs, tool responses, and intermediate state for incident analysis.

6) Evaluation (Quality)
- Build a regression suite (start with 50–100 cases; scale to 500+).
- Track changes across model/provider upgrades; use canary releases.
- Add adversarial tests: prompt injection attempts, ambiguous requests, and missing data cases.

7) Cost Governance (Finance)
- Define target $/task and acceptable variance (e.g., ±10–20%).
- Implement caching for retrieval and repeated tool lookups; route simple intents to cheaper models.
- Set budget alerts and per-workflow spend limits; monitor token multipliers from loops.

8) Human-in-the-Loop (Operations)
- Define escalation thresholds: confidence, anomaly score, dollar amount, or policy mismatch.
- Provide operators a “why” view: steps taken, tool calls made, and evidence used.
- Capture operator feedback to improve routing, validators, and tool reliability.

9) Compliance & Data Handling (Governance)
- Classify data (PII/PCI/PHI) and enforce retention policies for logs and memory stores.
- Redact sensitive fields in prompts/logs; keep an audit trail for approvals and side effects.
- Document third-party risk posture if using hosted models or managed agent platforms.

10) Iteration Loop (Product)
- Run weekly reviews of metrics: completion rate, $/completion, escalation rate, silent failures.
- Maintain a backlog of top failure modes and prioritize fixes (tools, validators, routing, UX).
- Define ownership: who is on-call, who approves changes, and how incidents are reviewed.

If you can confidently mark 7/10 items as Done, you’re usually ready to scale beyond a single team. If you’re below 5/10, focus on tool contracts, permissions, and observability before adding more workflows.