LLM Ops Production Readiness Pack (2026) Use this as a pre-launch and post-launch checklist for any LLM feature, especially tool-using agents. 1) Define the job and the blast radius - Primary task: Write a single-sentence definition (e.g., “Resolve tier-1 billing tickets”). - Allowed actions: List the exact actions the agent may take (max 3–7 to start). - Disallowed actions: Explicitly forbid data exports, credential handling, or irreversible actions. - Autonomy tier: Suggest → Draft → Execute w/ approval → Execute under threshold. - Risk thresholds: Set dollar/time thresholds (e.g., refunds <= $50 auto, > $50 needs approval). 2) Observability (must-have) - Trace completeness: Log prompt, system prompt, retrieved docs IDs, tool calls, model name/version, and final output. - Correlation ID: Every step shares a single request ID. - Metrics dashboard: - Task success rate (% completed without escalation) - Policy violation rate (% outputs violating rules) - Tool failure rate (% tool calls that error) - p50/p95 latency (seconds) - Tokens per successful task (or $ per successful task) 3) Evaluation and release gating - Build an eval set of at least 200 real cases; tag by segment (customer tier, region, language). - Add deterministic checks: JSON schema validation, required fields, forbidden content, and business-rule validators. - Add qualitative checks: tone/helpfulness judged by calibrated rubric; sample 20–50 cases weekly for human review. - Define release gates (example): - No more than 1% regression in task success vs baseline - Policy violations < 0.5% on eval set - Tool-call error rate < 0.2% in canary 4) Security controls - Least privilege: Tools and data are scoped to minimum required access. - Data boundaries: Redact/tokenize PII where possible; separate “instructions” from “retrieved content.” - Prompt injection defense: Treat emails/web/PDFs as untrusted; never allow them to modify system instructions. - Audit logs: Immutable records of tool execution, approvals, and policy decisions. 5) Cost controls - Context budget: Set max prompt size and enforce truncation/summarization rules. - Routing: Use cheaper models for classification/routine tasks; reserve frontier models for hard cases. - Caching: Implement semantic caching for common requests; cache tool schemas and static context. - Retries: Cap retries; measure retry rate; prefer deterministic validators over LLM self-check loops. 6) Rollout plan - Canary: Start at 1–5% traffic; compare metrics to baseline. - Rollback: Define automatic rollback triggers (e.g., +2% tool failures, +3% escalations). - Incident playbook: Who gets paged, how to disable tool execution, and how to communicate to users. If you can’t prove traceability, pass eval gates, and enforce authorization, you don’t have a production agent—you have a demo.