TOOL CONTRACT SHIPPING CHECKLIST (2026)

Use this before you let an LLM call tools that create, update, delete, send, or publish anything.

1) TOOL DEFINITION (PER TOOL)
- Tool name + semantic version (v1, v1.1…)
- One-sentence purpose (what it does, what it never does)
- Side effects explicitly listed (creates records, sends emails, posts messages, modifies configs)
- Owner (team) + on-call/rotation or escalation path

2) SCHEMA & VALIDATION
- Inputs defined with strict types and constraints (enums, min/max lengths, required fields)
- Outputs defined (including error shapes)
- Validator is enforced server-side (never rely on the model to comply)
- “Repair loop” defined: what error message the model sees when validation fails

3) PERMISSIONS & SCOPING
- Who can call the tool (RBAC/ABAC rules)
- Resource scoping (which workspace/project/customer/account)
- Credential model: scoped OAuth tokens, short-lived tokens, or service accounts with boundaries
- Explicit “deny by default” behavior if policy context is missing

4) APPROVALS (HUMAN-IN-THE-LOOP)
- List actions requiring approval (external sends, irreversible deletes, finance/security changes)
- Approval UI shows an action diff: tool name, parameters, target resources, recipients
- Approval identity captured (who approved, when)
- Editable plan: user can modify parameters before execution

5) STATE, RETRIES, AND IDEMPOTENCY
- Task state machine stored in a database (planned → approved → executing → done/failed)
- Idempotency key strategy to prevent double execution on retries
- Safe retry rules per tool (which errors retry, which fail fast)
- Partial failure handling (what happens if step 3 fails after steps 1–2 succeeded)

6) OBSERVABILITY & AUDIT
- Structured logs for every tool call (tool, params, user, resource IDs, timestamps, result)
- Trace IDs link: user request → retrieval → model call → tool execution
- Redaction rules for sensitive fields (PII, secrets)
- Audit log retention policy and access controls

7) SECURITY AGAINST PROMPT INJECTION
- Treat retrieved text as untrusted input; never allow it to directly authorize actions
- System enforces policy; model never bypasses permission checks
- Content filtering strategy for connector sources (docs, tickets, emails)
- Separate “draft” vs “action” modes; draft mode cannot execute tools

8) EVALUATION (REGRESSION TESTS)
- Golden set of scenarios for: wrong recipient, wrong resource, missing permission, ambiguous user intent
- Adversarial cases: injection attempts inside retrieved docs, tool parameter smuggling
- Release gate: tool changes require passing evals just like unit tests

SHIP CRITERIA
If you can’t answer: “What exactly can this assistant do, on whose behalf, to which resources, with what review and what audit trail?” you’re not ready to ship actions.