AI Infrastructure Readiness Checklist (2026) Use this checklist to move from “LLM feature” to “operated subsystem.” It’s written for founders, engineering leads, and platform teams. 1) Inventory and boundaries - List every place your product calls an LLM (including background jobs). - For each call, document: provider/model, purpose, input data types, output consumers. - Define a hard policy for what data may be sent (and what may never be sent): secrets, credentials, payment data, health data, customer identifiers. 2) Gateway and routing - Put model calls behind a single internal interface (an API or library). - Implement consistent timeouts, retries, and rate limits. - Support routing by use case (cheap model for draft text, stronger model for final decisions). - Add a “kill switch” to disable a workflow without a deploy. 3) Logging and audit - Log: request metadata, tool calls, retrieval sources, and model versions. - Decide what you will NOT log (full prompts with sensitive data) and enforce redaction. - Ensure logs are access-controlled per tenant/customer where applicable. 4) Evaluation gates - Create a small but representative evaluation set per workflow (realistic prompts + expected behavior). - Track regressions across: correctness, refusal behavior, policy compliance, citation quality (for RAG). - Add a canary rollout process: route a small slice of traffic to new prompts/models and monitor. 5) Tool contracts and safety - Define typed schemas for tool calls (strict fields, enums, no extra properties). - Validate tool inputs server-side; never trust model-generated parameters. - Require human approval for high-impact actions (money movement, permission changes, outbound comms, code merges). 6) RAG hygiene - Identify the sources of truth (docs, tickets, CRM, code, runbooks). - Assign owners to keep content current; RAG will surface stale policies. - Implement chunking and indexing rules; re-index on meaningful content changes. - Require citations in user-facing answers for knowledge workflows. 7) Failure modes - Define fallback behavior per workflow: deterministic template, retrieval-only, escalate to human. - Add user-visible uncertainty handling (ask clarifying questions; show sources). - Write an incident playbook for prompt injection/data exposure and practice it. Operational target: a new engineer should be able to answer, in under an hour, where model calls live, what data flows into them, how changes are evaluated, and how to roll back safely.