AGENTIC ML OPS READINESS CHECKLIST (2026)

Use this checklist before you expand an AI agent from “draft” to “execute.” It’s written for founders, platform engineers, and ML operators running tool-using LLM systems in production.

1) TRACE & OBSERVABILITY (Week 1–2)
- Assign a trace_id to every user session and propagate it across the LLM call, retrieval calls, and every tool/API call.
- Log: model name/version, prompt template version, retrieved document IDs + timestamps, tool call args/results, latency per step, and final outcome.
- PII handling: define redaction rules (names, emails, addresses, account IDs) and verify redaction is applied before logs are stored.
- Build 3 dashboards: (a) cost per task (P50/P95), (b) tool success rate per tool, (c) top failure reasons with example traces.

2) EVALUATION FOUNDATION (Week 2–4)
- Create an initial eval set from production: sample 200–500 traces across the top workflows.
- Define 5–10 rubric items that map to business outcomes (e.g., “correct tool selection,” “policy compliance,” “escalates appropriately,” “no invented facts”).
- Set a regression gate: no release if key metrics drop more than 1% on the eval suite.
- Calibrate graders: measure judge disagreement on 50 cases; if disagreement exceeds ~10–15%, tighten the rubric or add human review.

3) POLICY & TOOL GOVERNANCE (Week 3–6)
- Put a centralized tool gateway in front of every action tool (refunds, emails, ticketing, provisioning).
- Encode constraints as policy-as-code: max spend per action, allowed tools per agent, required approvals over thresholds.
- Implement capability tiers: Draft-only → Execute-with-limits → Execute-with-approvals → Full execute.
- Audit logs: store allow/deny decisions with reasons, plus who/what approved high-risk actions.

4) COST & LATENCY BUDGETS (Week 4–8)
- Establish budgets: max tokens per task, max tool calls per task, max total latency per session.
- Add model routing: a fast model for extraction/classification; a stronger model only for complex planning.
- Track token growth: alert if average tokens per task increases by >20% week-over-week.
- Add backpressure: if tool latency spikes, downgrade to read-only mode or force escalation.

5) INCIDENT RESPONSE (Week 6–10)
- Define “kill switches”: disable tool writes globally; force human approval; rollback prompt/model version.
- Write 3 incident playbooks: hallucinated policy statement, incorrect tool execution, and data leakage.
- Run a fire drill: simulate a bad release and confirm you can detect, rollback, and explain root cause within 60 minutes.

6) GO/NO-GO CRITERIA FOR EXPANDING PERMISSIONS
You’re ready to expand agent permissions when:
- Trace coverage is above 95% of sessions.
- Tool success rate is above 99.5% per tool (including retries).
- Policy violations are below 0.1% of attempted actions.
- Eval suite is stable and runs automatically on every release.
- Cost budgets are enforced and P95 cost per task is predictable.

If any item is missing, expand autonomy slowly: keep the agent in draft mode or require approvals. Reliability and governance are what let you scale agent capability without scaling risk.