MCP Production Readiness Checklist (Control Plane First) Goal: ship tool-using LLM agents without turning your tool layer into an un-auditable software supply chain. 1) Inventory and ownership - List every MCP server and every tool it exposes. - Assign a human owner per tool (not per repo). Include an on-call rotation or escalation path. - Record the tool’s side effects (read-only vs write) and the systems it touches (CRM, billing, infra, code, docs). - Define a deprecation policy: who can remove tools, and how clients are notified. 2) Gateway architecture (recommended baseline) - Put a single gateway in front of tool execution. - Require all tool calls to pass through the gateway for policy checks, identity binding, logging, and quotas. - Decide where tools run: developer laptops (prototype only), central services, or isolated sandboxes for untrusted tools. 3) Identity and authorization - Define execution identity: service principal, delegated user token, or both. - Make attribution mandatory: every tool action must map to an actor and tenant. - Implement least-privilege scopes per tool action (e.g., “tickets:read”, “tickets:write”), not broad “admin” access. - Add approval gates for high-risk actions (money movement, prod changes, exports of sensitive data). 4) Auditability (incident-grade logs) - Log a correlation ID per agent run and per tool call. - Store: tool name + version, actor identity, policy decision, timestamp, redacted parameters, and redacted outputs. - Ensure logs are searchable and retained long enough for investigations. - Validate you can reconstruct: prompt/context source → tool selection → parameters → side effects. 5) Budgets and rate limiting - Set quotas per user and per tenant for: tool calls, downstream API calls, and model usage. - Enforce budgets at the gateway; fail closed when attribution is missing. - Surface spend/usage to operators and, where appropriate, to end users. 6) Tool engineering standards - Use strict schemas; reject ambiguous inputs. - Make destructive actions explicit; support “dry run” where possible. - Add idempotency keys to prevent duplicate creates/charges on retries. - Treat tools like production APIs: tests, versioning, code review, and monitoring. 7) Runbooks and kill switches - Create a documented kill switch to disable a tool (or entire MCP server) quickly. - Define rollback procedures for each high-risk tool (how to undo a write). - Write a “tool incident” runbook: triage steps, who gets paged, what data to pull, and how to communicate. Exit criteria before production - You can answer: who ran this tool call, under what permissions, with what inputs, and what changed. - High-risk tools require explicit approvals. - Budgets are enforced and visible. - There is a tested kill switch and a runbook.