AgentOps Launch Pack (2026)

Goal: Ship one audited, cost-controlled agent into production in 30–60 days.

1) Pick the right first workflow (Week 0)
- Choose ONE narrow workflow with clear inputs/outputs (examples: ticket triage, refund recommendation, access request routing).
- Define success in business terms: time saved (min/case), error cost ($/incident), escalation rate (%), and customer impact (CSAT delta).
- Establish a baseline from the last 2–4 weeks: median handle time, p95 handle time, rework rate, and current automation rate.

2) Tooling & permissions design (Week 1)
- Replace generic tools with purpose-built endpoints (e.g., create_refund_request vs refund_anything).
- Enforce least privilege: scoped API keys per agent, per environment, per tenant.
- Add strict schemas for every tool call (required fields, enums, min/max amounts).
- Make actions idempotent: require idempotency_key for any mutation (refunds, updates, provisioning).

3) Observability requirements (Week 1–2)
- Log every run with: workflow_version, model_id, retrieved_docs IDs, tool calls (args + results), tokens, latency, final action.
- Redact or tokenize PII in logs; store raw PII only in the system of record.
- Retention policy: minimum 30 days; target 90–180 days for high-risk workflows.
- Add a kill switch: feature flag + immediate credential revocation procedure.

4) Evaluation harness (Week 2–3)
- Build a regression set of 200–1,000 historical cases.
- Define pass/fail criteria per case (expected tool, expected outcome, forbidden actions).
- Run nightly regressions and alert on: success rate drop >2%, tool error increase >1%, cost/run increase >20%.
- Add red-team cases: prompt injection attempts, malformed inputs, missing context, and over-permission scenarios.

5) Safety gates & autonomy tiers (Week 3–4)
- Create deterministic policy-as-code rules:
  - Money threshold approvals (e.g., >$100 requires human approval).
  - PII access restrictions (who/what can read, summarize, or export).
  - Admin actions blocked by default.
- Define autonomy tiers:
  - Green: auto-execute low-risk actions.
  - Yellow: propose + require approval.
  - Red: disallowed unless human initiated.

6) Economics & SLOs (Week 4–5)
- Track cost per successful outcome (not cost per run).
- Set budgets in code: max tokens/run, max tool calls/run, max retries.
- Set SLOs: p95 latency target and task success target (e.g., 97%+ on regression set before scaling).
- Implement fast/slow paths: small model or rules first; escalate to larger model only when needed.

7) Rollout plan (Week 5–8)
- Shadow mode 1–2 weeks: agent recommends; humans execute. Measure delta.
- Canary release: 5% traffic, then 20%, then 50% as metrics hold.
- Document incident response: disable agent, rotate keys, rollback workflow_version, notify stakeholders.

Operational dashboard (minimum)
- Task success rate (by workflow_version)
- Escalation rate and reasons
- Tool-call validity and tool-call success rates
- Cost per successful outcome ($)
- p50/p95 latency (end-to-end and per tool)
- Policy gate trigger counts (approvals, blocks)

Exit criteria for “Production Ready”
- Regression pass rate stable for 7 days (no >2% drop).
- Tool-call validity ≥99% on regression set.
- Clear audit trail + retention policy implemented.
- Kill switch tested and documented.
- Unit economics proven on a real cohort (cost per outcome below agreed threshold).