AI Infrastructure Readiness Checklist (2026)

Use this checklist to move from “LLM feature” to “operated subsystem.” It’s written for founders, engineering leads, and platform teams.

1) Inventory and boundaries
- List every place your product calls an LLM (including background jobs).
- For each call, document: provider/model, purpose, input data types, output consumers.
- Define a hard policy for what data may be sent (and what may never be sent): secrets, credentials, payment data, health data, customer identifiers.

2) Gateway and routing
- Put model calls behind a single internal interface (an API or library).
- Implement consistent timeouts, retries, and rate limits.
- Support routing by use case (cheap model for draft text, stronger model for final decisions).
- Add a “kill switch” to disable a workflow without a deploy.

3) Logging and audit
- Log: request metadata, tool calls, retrieval sources, and model versions.
- Decide what you will NOT log (full prompts with sensitive data) and enforce redaction.
- Ensure logs are access-controlled per tenant/customer where applicable.

4) Evaluation gates
- Create a small but representative evaluation set per workflow (realistic prompts + expected behavior).
- Track regressions across: correctness, refusal behavior, policy compliance, citation quality (for RAG).
- Add a canary rollout process: route a small slice of traffic to new prompts/models and monitor.

5) Tool contracts and safety
- Define typed schemas for tool calls (strict fields, enums, no extra properties).
- Validate tool inputs server-side; never trust model-generated parameters.
- Require human approval for high-impact actions (money movement, permission changes, outbound comms, code merges).

6) RAG hygiene
- Identify the sources of truth (docs, tickets, CRM, code, runbooks).
- Assign owners to keep content current; RAG will surface stale policies.
- Implement chunking and indexing rules; re-index on meaningful content changes.
- Require citations in user-facing answers for knowledge workflows.

7) Failure modes
- Define fallback behavior per workflow: deterministic template, retrieval-only, escalate to human.
- Add user-visible uncertainty handling (ask clarifying questions; show sources).
- Write an incident playbook for prompt injection/data exposure and practice it.

Operational target: a new engineer should be able to answer, in under an hour, where model calls live, what data flows into them, how changes are evaluated, and how to roll back safely.