AgentOps Launch Pack (2026) Goal: Ship one audited, cost-controlled agent into production in 30–60 days. 1) Pick the right first workflow (Week 0) - Choose ONE narrow workflow with clear inputs/outputs (examples: ticket triage, refund recommendation, access request routing). - Define success in business terms: time saved (min/case), error cost ($/incident), escalation rate (%), and customer impact (CSAT delta). - Establish a baseline from the last 2–4 weeks: median handle time, p95 handle time, rework rate, and current automation rate. 2) Tooling & permissions design (Week 1) - Replace generic tools with purpose-built endpoints (e.g., create_refund_request vs refund_anything). - Enforce least privilege: scoped API keys per agent, per environment, per tenant. - Add strict schemas for every tool call (required fields, enums, min/max amounts). - Make actions idempotent: require idempotency_key for any mutation (refunds, updates, provisioning). 3) Observability requirements (Week 1–2) - Log every run with: workflow_version, model_id, retrieved_docs IDs, tool calls (args + results), tokens, latency, final action. - Redact or tokenize PII in logs; store raw PII only in the system of record. - Retention policy: minimum 30 days; target 90–180 days for high-risk workflows. - Add a kill switch: feature flag + immediate credential revocation procedure. 4) Evaluation harness (Week 2–3) - Build a regression set of 200–1,000 historical cases. - Define pass/fail criteria per case (expected tool, expected outcome, forbidden actions). - Run nightly regressions and alert on: success rate drop >2%, tool error increase >1%, cost/run increase >20%. - Add red-team cases: prompt injection attempts, malformed inputs, missing context, and over-permission scenarios. 5) Safety gates & autonomy tiers (Week 3–4) - Create deterministic policy-as-code rules: - Money threshold approvals (e.g., >$100 requires human approval). - PII access restrictions (who/what can read, summarize, or export). - Admin actions blocked by default. - Define autonomy tiers: - Green: auto-execute low-risk actions. - Yellow: propose + require approval. - Red: disallowed unless human initiated. 6) Economics & SLOs (Week 4–5) - Track cost per successful outcome (not cost per run). - Set budgets in code: max tokens/run, max tool calls/run, max retries. - Set SLOs: p95 latency target and task success target (e.g., 97%+ on regression set before scaling). - Implement fast/slow paths: small model or rules first; escalate to larger model only when needed. 7) Rollout plan (Week 5–8) - Shadow mode 1–2 weeks: agent recommends; humans execute. Measure delta. - Canary release: 5% traffic, then 20%, then 50% as metrics hold. - Document incident response: disable agent, rotate keys, rollback workflow_version, notify stakeholders. Operational dashboard (minimum) - Task success rate (by workflow_version) - Escalation rate and reasons - Tool-call validity and tool-call success rates - Cost per successful outcome ($) - p50/p95 latency (end-to-end and per tool) - Policy gate trigger counts (approvals, blocks) Exit criteria for “Production Ready” - Regression pass rate stable for 7 days (no >2% drop). - Tool-call validity ≥99% on regression set. - Clear audit trail + retention policy implemented. - Kill switch tested and documented. - Unit economics proven on a real cohort (cost per outcome below agreed threshold).