AI LAST-MILE READINESS CHECKLIST (ENTERPRISE-GRADE)

Use this to pressure-test whether your AI feature is a product or just a model wrapper. The goal: permissioned access, provable behavior, safe tool actions, and repeatable evaluation.

1) IDENTITY + ACCESS
- SSO: Do you support Okta or Microsoft Entra ID (SAML/OIDC) end-to-end?
- Provisioning: Do you support SCIM for automatic user/group lifecycle?
- RBAC mapping: Can you map org roles/groups to what the AI can read and do?
- Least privilege: Is there a documented list of scopes/permissions required per connector?
- Tenant isolation: Is customer data isolated (storage, indices, caches), and can you explain how?

2) DATA FLOW + RETRIEVAL
- Data flow diagram: Can you show where prompts, retrieved docs, tool outputs, and logs go?
- Permission-aware retrieval: For every retrieval, do you enforce user/tenant permissions at request time or via a proven isolation scheme?
- Citations: Can the AI return citations with stable identifiers (doc ID, URL, timestamp/version)?
- Freshness: Do you have a strategy for document updates (re-indexing, invalidation, change tracking)?
- Sensitive data handling: Are secrets/PII filtered from prompts and logs where required?

3) TOOL/ACTION SAFETY
- Tool catalog: Is every tool/action enumerated with inputs, outputs, and side effects?
- Approval gates: Do risky actions require explicit user approval (or policy-based approval)?
- Idempotency: Are actions designed to be repeatable safely (dedupe keys, retries)?
- Rate limits + backoff: Are API limits handled predictably?
- Rollback: For each action category, is there a rollback or remediation playbook?

4) OBSERVABILITY + AUDIT
- Tracing: Do you generate trace IDs that tie together retrieval, model calls, and tool calls?
- Audit logs: Can an admin export who asked what, what sources were used, and what actions were taken?
- Retention controls: Can customers configure retention and deletion for logs and stored artifacts?
- Incident readiness: Do you have a documented process for disabling tools, revoking tokens, and scoping impact?

5) EVALUATION (RELEASE GATES)
- Definition of done: For one workflow, is success measurable (pass/fail criteria)?
- Golden set: Do you have a fixed suite of real cases with expected outputs or acceptance checks?
- Regression runs: Do you run evals on every prompt change, model change, and connector change?
- Human review loop: Is there a workflow for reviewing failures and updating prompts/policies/tests?
- Red-team tests: Do you test prompt injection, data exfiltration attempts, and unsafe tool use?

6) COMMERCIAL REALITY CHECK
- Buyer: Is there a clear budget owner (security, ops, support, finance, engineering productivity)?
- Procurement: Can you answer standard vendor questions (data handling, sub-processors, access controls)?
- “Swap test”: If you change model providers, does the product still work with similar behavior?

If you can’t produce artifacts for sections 1–5, don’t scale distribution. Fix the last mile first.