AGENTIC FEATURE LAUNCH CHECKLIST (2026)

1) ROLE CARD (PRODUCT SPEC)
- Agent name + purpose in one sentence (e.g., “Support Triage Teammate: classify, route, and draft responses for inbound tickets”).
- Primary workflow: define start/end states and what “done” means.
- In-scope tasks (max 3–5) and explicitly out-of-scope tasks.
- Authority ladder: Read-only → Draft → Execute w/ approval → Constrained autonomy.
- Escalation rules: confidence threshold, missing data, policy conflicts, customer tier, $ amount thresholds.
- Success metrics: coverage (% cases handled), precision (correct when acting), time-to-resolution, override rate, incident rate.

2) PERMISSIONS + GOVERNANCE
- Identity & access: SSO (Okta/Entra), SCIM, role-based permissions per action.
- Data boundaries: what objects/fields can be read/written; tenant isolation guarantees.
- Retention: log retention policy, customer-configurable options, deletion/DSR workflow.
- Auditability: every action logged with who/what/when, sources used, and rollback path.
- Safety controls: kill switch per workspace; rate limits; blast-radius caps (max actions/day).

3) TOOLING + RELIABILITY
- Tools are idempotent (retries won’t duplicate refunds/emails/updates).
- Tool schemas versioned; breaking changes gated behind tests.
- Sandbox mode available; “dry run” outputs include a diff of planned changes.
- Undo/rollback implemented for every write action.

4) EVALUATION (BEFORE GA)
- Golden set: 200–1,000 real cases with expected outputs; include edge/adversarial examples.
- Nightly regression: run against prompt updates, model/provider changes, and tool schema changes.
- Metrics dashboard: coverage, precision, p50/p95 latency, cost per task (p50/p95), policy violations.
- Human review rubric for sampled outputs (tone, correctness, policy compliance).

5) COST + UNIT ECONOMICS
- Define budget per task (e.g., triage $0.05, complex billing $0.25).
- Routing policy: small model default; escalate to larger model only on low confidence/high value.
- Caching + dedupe plan (retrieval cache, context compression, retry limits).
- Metering spec: what counts as an “agent action,” what’s billable, and how customers see usage.

6) LAUNCH PLAN
- Shadow mode (1–2 weeks): drafts + logs only; measure precision/coverage.
- Approve-to-act beta: limited cohort; batch approvals; track approval fatigue.
- Constrained autonomy: confidence thresholds + spend caps + alerts.
- Incident playbook: disable switch, rollback steps, customer comms template, postmortem process.

7) PRICING + PACKAGING
- Choose a measurable unit tied to value (tickets triaged, invoices reconciled, leads qualified).
- Include predictability: monthly included quota + overage, or hard caps with admin controls.
- Provide an “AI Usage” page: actions, costs, models used, and who triggered runs.

GA EXIT CRITERIA (SUGGESTED)
- Precision meets target for the workflow (e.g., ≥90% on golden set) AND policy violations near zero.
- Undo/rollback works end-to-end; audit logs complete.
- p95 cost per task within budget and gross margin model validated with real usage.
- Security review complete: retention, access controls, subprocessors, and audit export verified.