LLM Ops Production Readiness Pack (2026)

Use this as a pre-launch and post-launch checklist for any LLM feature, especially tool-using agents.

1) Define the job and the blast radius
- Primary task: Write a single-sentence definition (e.g., “Resolve tier-1 billing tickets”).
- Allowed actions: List the exact actions the agent may take (max 3–7 to start).
- Disallowed actions: Explicitly forbid data exports, credential handling, or irreversible actions.
- Autonomy tier: Suggest → Draft → Execute w/ approval → Execute under threshold.
- Risk thresholds: Set dollar/time thresholds (e.g., refunds <= $50 auto, > $50 needs approval).

2) Observability (must-have)
- Trace completeness: Log prompt, system prompt, retrieved docs IDs, tool calls, model name/version, and final output.
- Correlation ID: Every step shares a single request ID.
- Metrics dashboard:
  - Task success rate (% completed without escalation)
  - Policy violation rate (% outputs violating rules)
  - Tool failure rate (% tool calls that error)
  - p50/p95 latency (seconds)
  - Tokens per successful task (or $ per successful task)

3) Evaluation and release gating
- Build an eval set of at least 200 real cases; tag by segment (customer tier, region, language).
- Add deterministic checks: JSON schema validation, required fields, forbidden content, and business-rule validators.
- Add qualitative checks: tone/helpfulness judged by calibrated rubric; sample 20–50 cases weekly for human review.
- Define release gates (example):
  - No more than 1% regression in task success vs baseline
  - Policy violations < 0.5% on eval set
  - Tool-call error rate < 0.2% in canary

4) Security controls
- Least privilege: Tools and data are scoped to minimum required access.
- Data boundaries: Redact/tokenize PII where possible; separate “instructions” from “retrieved content.”
- Prompt injection defense: Treat emails/web/PDFs as untrusted; never allow them to modify system instructions.
- Audit logs: Immutable records of tool execution, approvals, and policy decisions.

5) Cost controls
- Context budget: Set max prompt size and enforce truncation/summarization rules.
- Routing: Use cheaper models for classification/routine tasks; reserve frontier models for hard cases.
- Caching: Implement semantic caching for common requests; cache tool schemas and static context.
- Retries: Cap retries; measure retry rate; prefer deterministic validators over LLM self-check loops.

6) Rollout plan
- Canary: Start at 1–5% traffic; compare metrics to baseline.
- Rollback: Define automatic rollback triggers (e.g., +2% tool failures, +3% escalations).
- Incident playbook: Who gets paged, how to disable tool execution, and how to communicate to users.

If you can’t prove traceability, pass eval gates, and enforce authorization, you don’t have a production agent—you have a demo.