AGENTIC OPS STACK STARTER CHECKLIST (2026) Purpose: Launch one production-grade agent in 30 days with measurable ROI and controlled blast radius. 1) Pick the workflow (Day 1) - Choose ONE high-volume, low-to-medium risk workflow (e.g., ticket triage, CRM updates, invoice matching, PR test generation). - Define success metrics with numbers: • Task success rate target (e.g., 85% correct) • Escalation/handoff ceiling (e.g., <30%) • Cost per successful outcome (e.g., <$0.05) • Latency (e.g., P95 <8s) - Define “unacceptable outcomes” (e.g., wrong refund, wrong customer contacted, policy violation). 2) Set autonomy level (Day 1–2) - Start at L0 (draft only) or L1 (recommend + prefill). - Write the rule that moves you to L2/L3 (example: 2 weeks with 0 policy violations + success rate >85% on gold set). 3) Permissions + tool contracts (Days 3–7) - Create a dedicated service account per agent. - Enforce least privilege: read-only by default; allowlist tools; scope OAuth. - Define tool input/output schema (JSON schema or typed interface). - Add parameter validation and rate limits. - Add a “kill switch” feature flag that removes tool write access within 30 seconds. 4) Data + retrieval (Days 4–10) - Identify authoritative sources (policies, docs, playbooks) and label them “trusted.” - Label all external/user text as “untrusted.” - Require citations for any customer-facing claim. - Implement access control in retrieval (agents must not retrieve restricted docs). 5) Build the gold eval set (Days 8–14) - Collect 200–500 real historical cases; redact PII. - For each case, store: • Inputs (ticket/email/task) • Expected action/outcome • References to policy/doc source • Risk label (low/med/high) - Add at least 20 adversarial cases (prompt injection attempts, ambiguous requests). 6) Observability (Days 15–21) - Log per run: agent run ID, model, prompt/context hash, tool calls, outputs, latency, token cost. - Store the approval chain (who approved what, when). - Dashboard minimums: • Success rate, escalation rate • Policy violations count • Cost per success, total daily cost • P95 latency 7) Evals + release gate (Days 15–25) - Run nightly evals on the gold set. - Gate releases on: • No increase in policy violations • Regression threshold (e.g., <2–3% drop in success rate) • Cost budget (e.g., <10% increase per success) 8) Launch sequence (Days 22–30) - Shadow mode: agent drafts; human executes. - Capture failure reasons in 5 buckets (bad retrieval, tool error, policy ambiguity, hallucination, edge case). - Promote to limited execution only for low-risk cases with thresholds (e.g., refunds under $50; duplicates-only ticket closing). 9) Ongoing operations (weekly) - Review top 20 failures; update policies, schemas, and eval set. - Add 10–50 new labeled cases per week. - Rotate credentials and review tool scopes monthly. - Run an “agent incident drill” quarterly: simulate a bad tool call, verify logs, verify kill switch. Outcome: If you can demonstrate stable metrics + auditability after 30 days, you have a credible foundation to expand agent fleets across teams.