AI LAST-MILE READINESS CHECKLIST (ENTERPRISE-GRADE) Use this to pressure-test whether your AI feature is a product or just a model wrapper. The goal: permissioned access, provable behavior, safe tool actions, and repeatable evaluation. 1) IDENTITY + ACCESS - SSO: Do you support Okta or Microsoft Entra ID (SAML/OIDC) end-to-end? - Provisioning: Do you support SCIM for automatic user/group lifecycle? - RBAC mapping: Can you map org roles/groups to what the AI can read and do? - Least privilege: Is there a documented list of scopes/permissions required per connector? - Tenant isolation: Is customer data isolated (storage, indices, caches), and can you explain how? 2) DATA FLOW + RETRIEVAL - Data flow diagram: Can you show where prompts, retrieved docs, tool outputs, and logs go? - Permission-aware retrieval: For every retrieval, do you enforce user/tenant permissions at request time or via a proven isolation scheme? - Citations: Can the AI return citations with stable identifiers (doc ID, URL, timestamp/version)? - Freshness: Do you have a strategy for document updates (re-indexing, invalidation, change tracking)? - Sensitive data handling: Are secrets/PII filtered from prompts and logs where required? 3) TOOL/ACTION SAFETY - Tool catalog: Is every tool/action enumerated with inputs, outputs, and side effects? - Approval gates: Do risky actions require explicit user approval (or policy-based approval)? - Idempotency: Are actions designed to be repeatable safely (dedupe keys, retries)? - Rate limits + backoff: Are API limits handled predictably? - Rollback: For each action category, is there a rollback or remediation playbook? 4) OBSERVABILITY + AUDIT - Tracing: Do you generate trace IDs that tie together retrieval, model calls, and tool calls? - Audit logs: Can an admin export who asked what, what sources were used, and what actions were taken? - Retention controls: Can customers configure retention and deletion for logs and stored artifacts? - Incident readiness: Do you have a documented process for disabling tools, revoking tokens, and scoping impact? 5) EVALUATION (RELEASE GATES) - Definition of done: For one workflow, is success measurable (pass/fail criteria)? - Golden set: Do you have a fixed suite of real cases with expected outputs or acceptance checks? - Regression runs: Do you run evals on every prompt change, model change, and connector change? - Human review loop: Is there a workflow for reviewing failures and updating prompts/policies/tests? - Red-team tests: Do you test prompt injection, data exfiltration attempts, and unsafe tool use? 6) COMMERCIAL REALITY CHECK - Buyer: Is there a clear budget owner (security, ops, support, finance, engineering productivity)? - Procurement: Can you answer standard vendor questions (data handling, sub-processors, access controls)? - “Swap test”: If you change model providers, does the product still work with similar behavior? If you can’t produce artifacts for sections 1–5, don’t scale distribution. Fix the last mile first.