Production Agent Readiness Kit (2026) Use this checklist to take an agent from demo to dependable production. 1) Define the job (scope + success metrics) - Choose ONE workflow with clear inputs/outputs (e.g., “close tier-1 tickets,” “provision sandbox,” “draft QBR”). - Set 2–3 metrics with baselines: task success rate (%), escalation rate (%), median time-to-complete, and cost per task ($). - Define the failure mode: what should happen when the agent is unsure (escalate, ask clarifying question, or stop). 2) Authority model (permissions and approvals) - List allowed tools/APIs and map to least-privilege credentials. - Create authority tiers: Read-only → Suggest-only → Execute-with-approval → Autonomous. - Require approvals for irreversible, money-moving, external-communication, deletion, or permission-grant actions. - Add a “kill switch” (feature flag) that disables tool execution but keeps read-only mode. 3) Policy layer (enforceable constraints) - Encode deterministic rules in code (amount caps, domain restrictions, required fields). - Add a small classifier for ambiguous cases (e.g., “is this request a refund?”) but keep the final enforcement deterministic. - Log every policy decision with: timestamp, rule_id, input summary, and outcome. 4) Budget-first orchestration (cost control) - Set per-request caps: max model calls, max tool calls, and max total cost ($). - Implement routing: small model for intent/extraction; larger model only for hard cases. - Add caching for retrieval and repeated tool results. - Define a hard stop: if budget exhausted, escalate or return partial results. 5) Evals as CI (quality and safety) - Build a labeled eval set (start with 200 cases; aim for 1,000+). - Measure workflow outcomes, not just answer quality: success/failure, correct tool usage, and policy compliance. - Gate releases: block deployment if key metrics regress beyond thresholds (e.g., >1–2 pp drop in success rate). - Include adversarial tests: prompt injection, missing context, malformed inputs, rate limits, and tool failures. 6) Observability + incident response - Require trace_id per request; log tool calls, latencies, retries, and cost. - Redact PII in logs by default; restrict trace access via RBAC. - Set alerts: spike in retries, tool errors, policy denials, or cost per task. - Create runbooks: how to replay traces, how to disable tool execution, and how to roll back. 7) Rollout plan (safe deployment) - Start with shadow mode (no writes), then 1–5% traffic with review, then ramp. - Use risk-based sampling for human review (higher-risk requests get higher review rate). - Review weekly: top failure reasons, policy blocks, and cost hotspots; prioritize fixes. Outcome: If you can’t explain your agent’s authority, costs, and failure handling in 60 seconds, you’re not ready for production.