Deterministic LLM System Checklist (Action Apps)

Use this when your LLM can take actions: send messages, modify data, trigger workflows, touch infrastructure, or spend money. The goal is simple: the model is allowed to suggest; the system decides.

1) Define “allowed actions” as tools, not prose
- List every action as a named tool (e.g., send_email, create_ticket, update_crm_record).
- For each tool: document required inputs, forbidden inputs, and side effects.
- Put tools behind an allowlist; deny by default.

2) Add typed contracts
- Define a JSON Schema (or equivalent) per tool call.
- Reject unknown fields (no “extraProperties”).
- Version contracts (tool_v1, tool_v2) to avoid breaking behavior during iterations.

3) Enforce invariants outside the model
Write 5–15 hard rules that must never be violated. Examples:
- Never email external domains without explicit approval.
- Never execute arbitrary SQL; only parameterized queries.
- Never delete records; soft-delete only.
- Never access documents outside the user’s permissions.
Implement these checks in code. Prompts can restate them, but prompts don’t enforce.

4) Make state explicit and server-side
- Store workflow state in a database or durable workflow engine.
- Never let the model “remember” whether an action happened.
- Use idempotency keys for tool calls so retries don’t duplicate side effects.

5) Add an audit trail you can hand to security
For each request, record:
- Model name/version (as reported by the provider)
- System prompt + user input (with redaction rules)
- Retrieved context (document IDs, chunk IDs, URLs) when using RAG
- Tool calls (name, arguments) and tool results
- Final user-visible output
Set retention and redaction policies (especially for PII).

6) Build an eval gate before you scale
- Create a small “golden set” of real tasks and edge cases.
- Run automated evals on every prompt/model/retrieval change.
- Track regressions by category (format errors, wrong tool choice, policy violation).

7) Add human gates for high-risk transitions
- Identify actions that require approval (refunds, outbound email to customers, production changes).
- Insert an approval step with a clear diff: what will change, who will be affected.
- Log approver identity.

8) Plan for provider/model drift
- Treat model upgrades like dependency upgrades.
- Pin versions where possible; test before switching.
- Keep a rollback plan (previous model + previous prompt + previous retrieval settings).

If you only do two things this week: (1) enforce schemas on tool calls, and (2) write invariants in code. That alone eliminates a large class of “agent went rogue” failures.