Protocol-First AI Feature Shipping Checklist (v1) Use this checklist before shipping any AI workflow that touches customer data or production systems. The goal is simple: make behavior enforceable, traceable, and reversible. 1) Workflow Definition (Scope) - Name the job in one sentence (e.g., “Draft support reply from internal docs and ticket history”). - List the system-of-record objects touched (tickets, docs, CRM records, repos). - Define the “stop condition” (what counts as done) and explicit failure states. 2) Tool Boundary Protocol - Create a tool allowlist (start small). No “call any endpoint” access. - Write JSON schemas for tool arguments and validate them server-side. - Decide what happens on tool failure: retry rules, fallback behavior, and user messaging. - Add idempotency where side effects exist (avoid duplicate writes). 3) Identity & Permissions - Decide: act-as-user vs service account per action. Default to act-as-user. - Enforce least privilege with existing RBAC where possible. - Log: user ID, workspace/tenant ID, tool called, object IDs touched. - Define admin visibility: what can admins audit, and what is private to users. 4) Memory Protocol - Default to ephemeral context unless users opt into durable memory. - For durable memory, store structured “facts” not raw transcripts. - Define scope (user-level vs workspace-level) and retention/deletion paths. - Make deletion verifiable in your system (record deletion events in audit logs). 5) Verification Gates - Require a review surface for non-trivial changes: diffs, citations, or previews. - Add invariants: permission checks, schema checks, policy rules, and safe mode. - Use sandboxing for risky actions (read-only queries first; staged writes). - Define rollback: how to revert changes and how users trigger it. 6) Observability & Debugging - Assign a trace/correlation ID per run. - Capture: plan, tool calls, tool results summaries, model output, and commit outcome. - Redact sensitive fields in logs; restrict access to traces. - Add replay capability for debugging (same inputs, same tool responses when possible). 7) Evaluation (Minimum Viable) - Write 10–30 representative test cases from real scenarios. - Define pass/fail criteria per case (not vibes). Examples: correct fields changed; citation present; refusal when permission missing. - Run evals before release and on any model/prompt/tool change. - Set a rollback trigger: what signals “stop the rollout.” If you can’t complete sections 2, 3, and 5, you’re not shipping an AI workflow—you’re shipping a support burden.