BOUNDED AUTONOMY LAUNCH CHECKLIST (2026)

Use this checklist to ship an agentic feature that can safely take actions (tool calls) in production. The goal is not maximum autonomy on day one; the goal is measurable progress with controlled risk.

1) Define the job and success criteria
- Pick one narrow workflow (e.g., “resolve password reset tickets,” “draft renewal quote,” “classify and route security alerts”).
- Define a single primary success metric (e.g., task success rate, median time-to-resolution) and 2–3 secondary metrics (cost/task, escalation rate, latency).
- Write a one-page “action policy” describing what the agent is allowed to do.

2) Build the autonomy ladder
- Level 0: Draft-only (no external side effects).
- Level 1: Low-risk actions (create ticket, tag record, suggest response).
- Level 2: Medium-risk actions (update CRM fields, schedule meeting) with validation.
- Level 3: High-risk actions (refunds, pricing changes, production deployments) gated by approval.
- Document which tools are enabled at each level and what confidence/conditions are required.

3) Instrumentation and cost controls (non-negotiable)
- Assign a trace ID to every task; log model, prompt version, retrieved docs, tool calls, and outputs.
- Add budgets: max tokens per task, max tool calls per task, and max retries per tool.
- Add loop detection: escalate if tool calls exceed a threshold or if state doesn’t change after N steps.

4) Reliability harness
- Create a golden set of 200–1,000 real tasks with expected outcomes.
- Run evals on every change (prompt edits, model swap, tool schema changes, retrieval tuning).
- Track: schema pass rate for tool calls, task success rate, and unauthorized action rate.

5) Safety and governance
- Use scoped OAuth / least-privilege permissions for every integration.
- Implement RBAC and audit logs; ensure logs are searchable and exportable.
- Add PII handling rules (redaction or blocked fields) and define data retention.
- Build a rollback plan: feature flag the agent and each tool separately.

6) Rollout plan
- Shadow mode for 1–2 weeks: agent proposes actions; humans execute.
- Canary release to 5–10% of traffic with strict budgets and alerting.
- Weekly review: top failure reasons, highest-cost tasks, and the biggest sources of escalation.

7) Commercial packaging
- Pick a pricing unit tied to value (per resolved ticket, per processed invoice, per seat with usage bands).
- Publish usage limits (tool calls/task, max actions/day) and what triggers human review.
- Track gross margin AFTER inference and after any human review cost.

Exit criteria to scale autonomy:
- Tool-call schema pass rate ≥ 99%.
- Unauthorized action rate = 0 in production sampling.
- Stable cost per successful task for 2–4 weeks.
- Documented incident playbook and rollback tested in staging.