Production Agent Readiness Checklist (2026)

Use this as a gate before you ship any tool-using agent into a customer-facing or business-critical workflow.

1) Workflow Definition
- Name one workflow (not a platform): e.g., “Handle address change requests.”
- Define success metrics: accuracy %, time-to-complete, escalation rate, and a hard safety metric (e.g., <0.1% wrong-customer actions).
- Define blast radius: what’s the worst-case damage if the agent is wrong?

2) Tool Access & Permissions
- Expose narrow actions (capabilities) instead of full raw APIs.
- Enforce least privilege at the action level (role-based + context constraints like max $ amount).
- Require approvals for irreversible actions (refunds, deletes, production changes).
- Centralize secrets in a secrets manager; rotate credentials; avoid long-lived tokens.

3) Retrieval (Secure RAG)
- Confirm retrieval respects the same permissions as the source system (row/role-based).
- Log document IDs, versions, and access decisions for every retrieval.
- Add citation-by-construction: answers must reference retrieved passages.
- Add prompt-injection defenses: strip instructions from retrieved text; isolate system prompts.

4) Evaluation & Regression Testing
- Create a golden set of 500+ real tasks with expected outcomes and citations.
- Track metrics separately: answer correctness, policy adherence, and tool correctness.
- Run evals on every change to prompts, tools, retrieval rules, and model versions.
- Add adversarial tests: conflicting docs, partial info, typos, and malicious content.

5) Observability & Incident Response
- Trace every run: prompts, retrieved context hashes, tool calls, outputs, latency, and errors.
- Implement replay: reproduce failures end-to-end from logs.
- Set alerts on drift: rising tool failures, increased retries, or falling pass rates.
- Define an incident playbook: disable tools, force human-only mode, notify owners.

6) Cost & Performance
- Measure cost per completed task (not just cost per token).
- Add tiering: cheap model for routing/classification, stronger model for hard steps.
- Cap context: retrieval limits, reranking, and summarization into structured state.
- Add caching for repeated, validated answers with freshness checks.

Ship criteria: If you can’t answer “what data did it read, what did it change, and who approved it?” you’re not ready for production.