AGENTIC OPS STACK STARTER CHECKLIST (2026)

Purpose: Launch one production-grade agent in 30 days with measurable ROI and controlled blast radius.

1) Pick the workflow (Day 1)
- Choose ONE high-volume, low-to-medium risk workflow (e.g., ticket triage, CRM updates, invoice matching, PR test generation).
- Define success metrics with numbers:
  • Task success rate target (e.g., 85% correct)
  • Escalation/handoff ceiling (e.g., <30%)
  • Cost per successful outcome (e.g., <$0.05)
  • Latency (e.g., P95 <8s)
- Define “unacceptable outcomes” (e.g., wrong refund, wrong customer contacted, policy violation).

2) Set autonomy level (Day 1–2)
- Start at L0 (draft only) or L1 (recommend + prefill).
- Write the rule that moves you to L2/L3 (example: 2 weeks with 0 policy violations + success rate >85% on gold set).

3) Permissions + tool contracts (Days 3–7)
- Create a dedicated service account per agent.
- Enforce least privilege: read-only by default; allowlist tools; scope OAuth.
- Define tool input/output schema (JSON schema or typed interface).
- Add parameter validation and rate limits.
- Add a “kill switch” feature flag that removes tool write access within 30 seconds.

4) Data + retrieval (Days 4–10)
- Identify authoritative sources (policies, docs, playbooks) and label them “trusted.”
- Label all external/user text as “untrusted.”
- Require citations for any customer-facing claim.
- Implement access control in retrieval (agents must not retrieve restricted docs).

5) Build the gold eval set (Days 8–14)
- Collect 200–500 real historical cases; redact PII.
- For each case, store:
  • Inputs (ticket/email/task)
  • Expected action/outcome
  • References to policy/doc source
  • Risk label (low/med/high)
- Add at least 20 adversarial cases (prompt injection attempts, ambiguous requests).

6) Observability (Days 15–21)
- Log per run: agent run ID, model, prompt/context hash, tool calls, outputs, latency, token cost.
- Store the approval chain (who approved what, when).
- Dashboard minimums:
  • Success rate, escalation rate
  • Policy violations count
  • Cost per success, total daily cost
  • P95 latency

7) Evals + release gate (Days 15–25)
- Run nightly evals on the gold set.
- Gate releases on:
  • No increase in policy violations
  • Regression threshold (e.g., <2–3% drop in success rate)
  • Cost budget (e.g., <10% increase per success)

8) Launch sequence (Days 22–30)
- Shadow mode: agent drafts; human executes.
- Capture failure reasons in 5 buckets (bad retrieval, tool error, policy ambiguity, hallucination, edge case).
- Promote to limited execution only for low-risk cases with thresholds (e.g., refunds under $50; duplicates-only ticket closing).

9) Ongoing operations (weekly)
- Review top 20 failures; update policies, schemas, and eval set.
- Add 10–50 new labeled cases per week.
- Rotate credentials and review tool scopes monthly.
- Run an “agent incident drill” quarterly: simulate a bad tool call, verify logs, verify kill switch.

Outcome: If you can demonstrate stable metrics + auditability after 30 days, you have a credible foundation to expand agent fleets across teams.