Constrained Autonomy Launch Checklist (2026) Use this to ship an LLM feature that can take actions without turning into an incident generator. 1) Pick the right first workflow - Choose one workflow with clear success criteria and limited blast radius (e.g., create a ticket, draft a response, schedule a meeting). - Define what “done” means in system-of-record terms (ticket created with correct fields, calendar event created, etc.). - Write down the top 10 edge cases operators complain about; these become initial regression tests. 2) Identity, permissions, and tenancy (non-negotiable) - Every request must be bound to an end-user identity (not a shared “assistant” account). - Enforce tenant isolation in code before the model sees data. - Create a permission map per tool: which roles can call which tools, and which parameters require extra approval. - Log: actor, tenant, tool called, records read, records changed. 3) Runtime context design - Prefer structured state pulled from APIs (subscription status, entitlements, order state) over retrieved prose. - Minimize context: only include what is needed for the immediate decision. - Attach provenance: where each field came from (system name, record id, timestamp). - Implement “context refresh” for long-running sessions so the model doesn’t act on stale state. 4) Tool contracts (design for a probabilistic caller) - Strongly typed inputs; avoid many optional fields. - Add dry_run=true support for any action that mutates state. - Require idempotency_key for all mutating tools. - Return explicit error codes and next-step messages (e.g., NEEDS_APPROVAL, NOT_FOUND, PERMISSION_DENIED). - Separate preview from commit (e.g., compute_refund_preview vs issue_refund). 5) Safety gates - Put policy decisions in code, not in prompts. - Add approvals for high-risk actions (refunds, deletions, permission changes). - Rate-limit tool execution per user and per tenant. - Add a “kill switch” to disable tool execution without redeploying. 6) Tracing and observability - Capture traces for: prompts, tool calls, tool results, errors, and final user-visible output. - Redact or tokenize sensitive fields before storing traces. - Create an operator view: “what did it do, what did it read, why did it decide that?” 7) Evals that gate releases - Build a regression suite from real failures and edge cases. - Test permission boundaries explicitly (cross-tenant, cross-role, restricted records). - Test tool calling correctness: required args present, dry_run honored, idempotency respected. - Run evals on every prompt/model/tool schema change; block release on failures. 8) Rollout plan - Start read-only, then draft-only, then scoped execution. - Monitor: tool error rates, policy denials, and manual override frequency. - Every incident becomes a new test case within 24 hours. If you can’t explain your assistant’s last action using logs and policy checks, you don’t have a product — you have a demo.