Protocol-First AI Feature Shipping Checklist (v1)

Use this checklist before shipping any AI workflow that touches customer data or production systems. The goal is simple: make behavior enforceable, traceable, and reversible.

1) Workflow Definition (Scope)
- Name the job in one sentence (e.g., “Draft support reply from internal docs and ticket history”).
- List the system-of-record objects touched (tickets, docs, CRM records, repos).
- Define the “stop condition” (what counts as done) and explicit failure states.

2) Tool Boundary Protocol
- Create a tool allowlist (start small). No “call any endpoint” access.
- Write JSON schemas for tool arguments and validate them server-side.
- Decide what happens on tool failure: retry rules, fallback behavior, and user messaging.
- Add idempotency where side effects exist (avoid duplicate writes).

3) Identity & Permissions
- Decide: act-as-user vs service account per action. Default to act-as-user.
- Enforce least privilege with existing RBAC where possible.
- Log: user ID, workspace/tenant ID, tool called, object IDs touched.
- Define admin visibility: what can admins audit, and what is private to users.

4) Memory Protocol
- Default to ephemeral context unless users opt into durable memory.
- For durable memory, store structured “facts” not raw transcripts.
- Define scope (user-level vs workspace-level) and retention/deletion paths.
- Make deletion verifiable in your system (record deletion events in audit logs).

5) Verification Gates
- Require a review surface for non-trivial changes: diffs, citations, or previews.
- Add invariants: permission checks, schema checks, policy rules, and safe mode.
- Use sandboxing for risky actions (read-only queries first; staged writes).
- Define rollback: how to revert changes and how users trigger it.

6) Observability & Debugging
- Assign a trace/correlation ID per run.
- Capture: plan, tool calls, tool results summaries, model output, and commit outcome.
- Redact sensitive fields in logs; restrict access to traces.
- Add replay capability for debugging (same inputs, same tool responses when possible).

7) Evaluation (Minimum Viable)
- Write 10–30 representative test cases from real scenarios.
- Define pass/fail criteria per case (not vibes). Examples: correct fields changed; citation present; refusal when permission missing.
- Run evals before release and on any model/prompt/tool change.
- Set a rollback trigger: what signals “stop the rollout.”

If you can’t complete sections 2, 3, and 5, you’re not shipping an AI workflow—you’re shipping a support burden.