Production Agent Readiness Checklist (2026) Use this as a gate before you ship any tool-using agent into a customer-facing or business-critical workflow. 1) Workflow Definition - Name one workflow (not a platform): e.g., “Handle address change requests.” - Define success metrics: accuracy %, time-to-complete, escalation rate, and a hard safety metric (e.g., <0.1% wrong-customer actions). - Define blast radius: what’s the worst-case damage if the agent is wrong? 2) Tool Access & Permissions - Expose narrow actions (capabilities) instead of full raw APIs. - Enforce least privilege at the action level (role-based + context constraints like max $ amount). - Require approvals for irreversible actions (refunds, deletes, production changes). - Centralize secrets in a secrets manager; rotate credentials; avoid long-lived tokens. 3) Retrieval (Secure RAG) - Confirm retrieval respects the same permissions as the source system (row/role-based). - Log document IDs, versions, and access decisions for every retrieval. - Add citation-by-construction: answers must reference retrieved passages. - Add prompt-injection defenses: strip instructions from retrieved text; isolate system prompts. 4) Evaluation & Regression Testing - Create a golden set of 500+ real tasks with expected outcomes and citations. - Track metrics separately: answer correctness, policy adherence, and tool correctness. - Run evals on every change to prompts, tools, retrieval rules, and model versions. - Add adversarial tests: conflicting docs, partial info, typos, and malicious content. 5) Observability & Incident Response - Trace every run: prompts, retrieved context hashes, tool calls, outputs, latency, and errors. - Implement replay: reproduce failures end-to-end from logs. - Set alerts on drift: rising tool failures, increased retries, or falling pass rates. - Define an incident playbook: disable tools, force human-only mode, notify owners. 6) Cost & Performance - Measure cost per completed task (not just cost per token). - Add tiering: cheap model for routing/classification, stronger model for hard steps. - Cap context: retrieval limits, reranking, and summarization into structured state. - Add caching for repeated, validated answers with freshness checks. Ship criteria: If you can’t answer “what data did it read, what did it change, and who approved it?” you’re not ready for production.