LLM RELEASE-GATING CHECKLIST (2026)

Goal: Ship ONE AI workflow that behaves like software: testable, traceable, and bounded.

1) Define the workflow boundary
- Name the workflow (e.g., “refund eligibility decision”, “support ticket triage”, “incident summary”).
- List irreversible actions the workflow can trigger (refund, export, delete, admin change). If any exist, require strict schemas + deterministic checks.
- Decide explicit refusal conditions (missing inputs, ambiguous identity, insufficient permissions, no sources, policy conflicts).

2) Write contracts (before prompts)
- Define structured output schema (JSON Schema / Pydantic model). Include enums and required fields.
- Define tool schemas: exact function names, required args, allowed ranges, and role/permission constraints.
- Add a “reason” field for decisions that must be auditable (short, factual, citeable).

3) Instrument traces you can replay
- Log: model name/version, system prompt, user input, retrieved docs (IDs + snippets), tool calls (args + results), final output.
- Add correlation IDs across steps (request_id, user_id/tenant_id, workflow_id).
- Decide retention and redaction: what must be masked (PII, secrets) and what must remain for incident response.

4) Build eval sets tied to product requirements
- Golden set: real historical cases with expected outcomes.
- Edge set: missing fields, weird formatting, long inputs, conflicting instructions.
- Adversarial set: prompt injection strings inside retrieved docs; attempts to exfiltrate data; attempts to bypass permissions.
- Permissions set: cross-tenant lookups; role-limited actions; “should refuse” cases.

5) Implement release-blocking eval gates
- Schema gate: output parses and validates.
- Tool gate: correct tool selection + correct arguments; no loops/duplicate calls.
- Grounding gate (if RAG): answer must cite exact retrieved passages; if none, refuse.
- Security gate: no cross-tenant data; respects ACL filters; refuses disallowed actions.
- Budget gate: token/tool usage stays within your product’s cost/latency budget.

6) Failure handling (design it, don’t improvise it)
- On schema/tool failure: ask a clarifying question OR fall back to a safe non-agent path.
- On grounding failure: refuse and show what sources are missing.
- On policy failure: refuse with a short reason + next step.

7) Deployment discipline
- Treat prompt and model changes as code changes (PRs, review, changelog).
- Run eval suite in CI on every change.
- Canary in production for high-risk workflows; keep fast rollback.

8) Incident readiness
- Define what a “bad action” is (wrong refund, wrong access, wrong customer).
- Ensure you can reconstruct: input → retrieval → tool calls → output.
- Add an operator kill-switch for tool execution if you suspect ongoing abuse.

Use this checklist on one workflow first. If you can’t make one workflow pass gates consistently, scaling to “agents everywhere” will only scale your failure rate.