Transactional AI Build Checklist (v1)

Goal: convert an LLM-powered feature from “helpful text” into a production workflow that can safely perform side effects (writes) with auditability and replay.

1) Define the transaction boundary
- Write down the exact side effects the system is allowed to perform (create ticket, refund, change setting, send email).
- For each side effect, define a single owner service (the executor). The LLM never writes directly.
- Decide the highest-risk action; make it require explicit approval from a human or a policy gate.

2) Model output must be structured
- Define a schema for every tool call payload (JSON Schema, Pydantic, Zod, protobuf—pick one).
- Reject any payload that fails validation. Do not “fix it in the prompt.”
- Store the validated payload as the “intent record” before executing.

3) Identity and permissions
- Identify the acting principal for each action (end user, admin, service account).
- Use scoped tokens (OAuth where applicable). Ensure scopes map to business permissions.
- Never let the model expand scope. Permission checks live in code.

4) Idempotency and retries
- Require an idempotency key for every mutation.
- Ensure the executor de-duplicates server-side.
- Define retry behavior per tool: which calls are safe to retry, which need compensation.

5) Durable logs and traceability
- Generate correlation IDs per workflow.
- Log: input (sanitized), model version, prompt/template version, retrieved sources, validated tool payloads, tool responses, final outcome.
- Make logs queryable by customer/account for support and compliance.

6) Retrieval governance (if using RAG)
- Enforce entitlements at retrieval time (per-user/per-group), not just per-tenant.
- Track provenance: which documents/chunks were used.
- Treat retrieved text as untrusted input; never let it directly dictate tool execution.

7) Human override and UX
- For high-risk actions, provide a review screen that shows: proposed action, inputs, and expected side effect.
- Provide an undo or compensation path where possible.
- Make failures legible: show which validation gate failed and how to fix it.

8) Pre-launch tests
- Create a small eval set of real workflow cases (including edge cases and ambiguous inputs).
- Add adversarial tests: prompt injection in retrieved docs, malformed tool payloads, partial outages.
- Run chaos-style tests on retries to confirm idempotency.

Definition of done: the workflow can be replayed from the intent record, produces the same side effect exactly once, and can be explained end-to-end via logs.