MCP Production Readiness Checklist (Control Plane First)

Goal: ship tool-using LLM agents without turning your tool layer into an un-auditable software supply chain.

1) Inventory and ownership
- List every MCP server and every tool it exposes.
- Assign a human owner per tool (not per repo). Include an on-call rotation or escalation path.
- Record the tool’s side effects (read-only vs write) and the systems it touches (CRM, billing, infra, code, docs).
- Define a deprecation policy: who can remove tools, and how clients are notified.

2) Gateway architecture (recommended baseline)
- Put a single gateway in front of tool execution.
- Require all tool calls to pass through the gateway for policy checks, identity binding, logging, and quotas.
- Decide where tools run: developer laptops (prototype only), central services, or isolated sandboxes for untrusted tools.

3) Identity and authorization
- Define execution identity: service principal, delegated user token, or both.
- Make attribution mandatory: every tool action must map to an actor and tenant.
- Implement least-privilege scopes per tool action (e.g., “tickets:read”, “tickets:write”), not broad “admin” access.
- Add approval gates for high-risk actions (money movement, prod changes, exports of sensitive data).

4) Auditability (incident-grade logs)
- Log a correlation ID per agent run and per tool call.
- Store: tool name + version, actor identity, policy decision, timestamp, redacted parameters, and redacted outputs.
- Ensure logs are searchable and retained long enough for investigations.
- Validate you can reconstruct: prompt/context source → tool selection → parameters → side effects.

5) Budgets and rate limiting
- Set quotas per user and per tenant for: tool calls, downstream API calls, and model usage.
- Enforce budgets at the gateway; fail closed when attribution is missing.
- Surface spend/usage to operators and, where appropriate, to end users.

6) Tool engineering standards
- Use strict schemas; reject ambiguous inputs.
- Make destructive actions explicit; support “dry run” where possible.
- Add idempotency keys to prevent duplicate creates/charges on retries.
- Treat tools like production APIs: tests, versioning, code review, and monitoring.

7) Runbooks and kill switches
- Create a documented kill switch to disable a tool (or entire MCP server) quickly.
- Define rollback procedures for each high-risk tool (how to undo a write).
- Write a “tool incident” runbook: triage steps, who gets paged, what data to pull, and how to communicate.

Exit criteria before production
- You can answer: who ran this tool call, under what permissions, with what inputs, and what changed.
- High-risk tools require explicit approvals.
- Budgets are enforced and visible.
- There is a tested kill switch and a runbook.