Agent Production Readiness Kit (2026) Use this as a go/no-go template before shipping an agent that can take actions (tickets, refunds, CRM updates, code changes). 1) Scope and permissions - List every tool/action the agent can execute (e.g., create_ticket, update_customer, issue_refund_request). - For each action, define: allowed parameters, deny-by-default rules, and required auth checks on the server. - Identify “irreversible” actions (delete data, close account, refunds, production deploys). Mark them CONFIRMATION REQUIRED. 2) Evaluation gates (CI) - Build an eval dataset of at least 100 scenarios from real workflows. - Categorize failures: (A) cosmetic, (B) incorrect outcome, (C) unsafe/policy violation. - Release gate suggestion: - Scenario pass rate: >= 95% - Category C failures: 0 across 1,000 adversarial prompts - Regression rule: every incident adds >= 1 new eval case - Calibrate model-graded evals monthly with a human-labeled sample (200–500 cases) and track agreement. 3) Observability and audit - Enable tracing across: model calls, retrieval docs (with provenance), tool inputs/outputs, policy decisions. - Store an immutable audit log for any state-changing action (who/what/when + trace_id). - Add dashboards for p95 latency, tokens per task, tool error rate, and policy block rate. 4) Budgets and fallbacks - Define per-task budgets: max_steps, max_tool_calls, max_tokens, p95_latency_slo. - Implement a deterministic fallback when budgets are exceeded (ask clarifying question, escalate to human queue, or switch to cheaper model). - Add alerts when average cost per task exceeds threshold (e.g., +20% week-over-week). 5) Safety and data handling - Define data classes (PII, secrets, regulated data). Enforce redaction and access control before retrieval and before tool export. - Block external tool calls for sensitive data unless explicitly approved. - Run periodic red-team tests against prompt injection and data exfiltration. 6) Incident runbook (minimum viable) - Kill switch: ability to disable the agent or specific tools in minutes. - Rollback: pinned model + prompt + tool versions; redeploy previous known-good config. - Postmortem within 48 hours for any high-severity incident: - Root cause (retrieval, tool misuse, planning loop, policy gap) - Fix (code/policy/tool schema) - New regression eval added Decision rule If you cannot (1) reproduce failures with traces and (2) prevent repeats with eval gates, you are not shipping a product—you are shipping a demo.