Agent Production Readiness Kit (2026)

Use this as a go/no-go template before shipping an agent that can take actions (tickets, refunds, CRM updates, code changes).

1) Scope and permissions
- List every tool/action the agent can execute (e.g., create_ticket, update_customer, issue_refund_request).
- For each action, define: allowed parameters, deny-by-default rules, and required auth checks on the server.
- Identify “irreversible” actions (delete data, close account, refunds, production deploys). Mark them CONFIRMATION REQUIRED.

2) Evaluation gates (CI)
- Build an eval dataset of at least 100 scenarios from real workflows.
- Categorize failures: (A) cosmetic, (B) incorrect outcome, (C) unsafe/policy violation.
- Release gate suggestion:
  - Scenario pass rate: >= 95%
  - Category C failures: 0 across 1,000 adversarial prompts
  - Regression rule: every incident adds >= 1 new eval case
- Calibrate model-graded evals monthly with a human-labeled sample (200–500 cases) and track agreement.

3) Observability and audit
- Enable tracing across: model calls, retrieval docs (with provenance), tool inputs/outputs, policy decisions.
- Store an immutable audit log for any state-changing action (who/what/when + trace_id).
- Add dashboards for p95 latency, tokens per task, tool error rate, and policy block rate.

4) Budgets and fallbacks
- Define per-task budgets: max_steps, max_tool_calls, max_tokens, p95_latency_slo.
- Implement a deterministic fallback when budgets are exceeded (ask clarifying question, escalate to human queue, or switch to cheaper model).
- Add alerts when average cost per task exceeds threshold (e.g., +20% week-over-week).

5) Safety and data handling
- Define data classes (PII, secrets, regulated data). Enforce redaction and access control before retrieval and before tool export.
- Block external tool calls for sensitive data unless explicitly approved.
- Run periodic red-team tests against prompt injection and data exfiltration.

6) Incident runbook (minimum viable)
- Kill switch: ability to disable the agent or specific tools in minutes.
- Rollback: pinned model + prompt + tool versions; redeploy previous known-good config.
- Postmortem within 48 hours for any high-severity incident:
  - Root cause (retrieval, tool misuse, planning loop, policy gap)
  - Fix (code/policy/tool schema)
  - New regression eval added

Decision rule
If you cannot (1) reproduce failures with traces and (2) prevent repeats with eval gates, you are not shipping a product—you are shipping a demo.