Production Agent Readiness Kit (2026)

Use this checklist to take an agent from demo to dependable production.

1) Define the job (scope + success metrics)
- Choose ONE workflow with clear inputs/outputs (e.g., “close tier-1 tickets,” “provision sandbox,” “draft QBR”).
- Set 2–3 metrics with baselines: task success rate (%), escalation rate (%), median time-to-complete, and cost per task ($).
- Define the failure mode: what should happen when the agent is unsure (escalate, ask clarifying question, or stop).

2) Authority model (permissions and approvals)
- List allowed tools/APIs and map to least-privilege credentials.
- Create authority tiers: Read-only → Suggest-only → Execute-with-approval → Autonomous.
- Require approvals for irreversible, money-moving, external-communication, deletion, or permission-grant actions.
- Add a “kill switch” (feature flag) that disables tool execution but keeps read-only mode.

3) Policy layer (enforceable constraints)
- Encode deterministic rules in code (amount caps, domain restrictions, required fields).
- Add a small classifier for ambiguous cases (e.g., “is this request a refund?”) but keep the final enforcement deterministic.
- Log every policy decision with: timestamp, rule_id, input summary, and outcome.

4) Budget-first orchestration (cost control)
- Set per-request caps: max model calls, max tool calls, and max total cost ($).
- Implement routing: small model for intent/extraction; larger model only for hard cases.
- Add caching for retrieval and repeated tool results.
- Define a hard stop: if budget exhausted, escalate or return partial results.

5) Evals as CI (quality and safety)
- Build a labeled eval set (start with 200 cases; aim for 1,000+).
- Measure workflow outcomes, not just answer quality: success/failure, correct tool usage, and policy compliance.
- Gate releases: block deployment if key metrics regress beyond thresholds (e.g., >1–2 pp drop in success rate).
- Include adversarial tests: prompt injection, missing context, malformed inputs, rate limits, and tool failures.

6) Observability + incident response
- Require trace_id per request; log tool calls, latencies, retries, and cost.
- Redact PII in logs by default; restrict trace access via RBAC.
- Set alerts: spike in retries, tool errors, policy denials, or cost per task.
- Create runbooks: how to replay traces, how to disable tool execution, and how to roll back.

7) Rollout plan (safe deployment)
- Start with shadow mode (no writes), then 1–5% traffic with review, then ramp.
- Use risk-based sampling for human review (higher-risk requests get higher review rate).
- Review weekly: top failure reasons, policy blocks, and cost hotspots; prioritize fixes.

Outcome: If you can’t explain your agent’s authority, costs, and failure handling in 60 seconds, you’re not ready for production.