Constrained Autonomy Launch Checklist (2026)

Use this to ship an LLM feature that can take actions without turning into an incident generator.

1) Pick the right first workflow
- Choose one workflow with clear success criteria and limited blast radius (e.g., create a ticket, draft a response, schedule a meeting).
- Define what “done” means in system-of-record terms (ticket created with correct fields, calendar event created, etc.).
- Write down the top 10 edge cases operators complain about; these become initial regression tests.

2) Identity, permissions, and tenancy (non-negotiable)
- Every request must be bound to an end-user identity (not a shared “assistant” account).
- Enforce tenant isolation in code before the model sees data.
- Create a permission map per tool: which roles can call which tools, and which parameters require extra approval.
- Log: actor, tenant, tool called, records read, records changed.

3) Runtime context design
- Prefer structured state pulled from APIs (subscription status, entitlements, order state) over retrieved prose.
- Minimize context: only include what is needed for the immediate decision.
- Attach provenance: where each field came from (system name, record id, timestamp).
- Implement “context refresh” for long-running sessions so the model doesn’t act on stale state.

4) Tool contracts (design for a probabilistic caller)
- Strongly typed inputs; avoid many optional fields.
- Add dry_run=true support for any action that mutates state.
- Require idempotency_key for all mutating tools.
- Return explicit error codes and next-step messages (e.g., NEEDS_APPROVAL, NOT_FOUND, PERMISSION_DENIED).
- Separate preview from commit (e.g., compute_refund_preview vs issue_refund).

5) Safety gates
- Put policy decisions in code, not in prompts.
- Add approvals for high-risk actions (refunds, deletions, permission changes).
- Rate-limit tool execution per user and per tenant.
- Add a “kill switch” to disable tool execution without redeploying.

6) Tracing and observability
- Capture traces for: prompts, tool calls, tool results, errors, and final user-visible output.
- Redact or tokenize sensitive fields before storing traces.
- Create an operator view: “what did it do, what did it read, why did it decide that?”

7) Evals that gate releases
- Build a regression suite from real failures and edge cases.
- Test permission boundaries explicitly (cross-tenant, cross-role, restricted records).
- Test tool calling correctness: required args present, dry_run honored, idempotency respected.
- Run evals on every prompt/model/tool schema change; block release on failures.

8) Rollout plan
- Start read-only, then draft-only, then scoped execution.
- Monitor: tool error rates, policy denials, and manual override frequency.
- Every incident becomes a new test case within 24 hours.

If you can’t explain your assistant’s last action using logs and policy checks, you don’t have a product — you have a demo.