TOOL CONTRACT SHIPPING CHECKLIST (2026) Use this before you let an LLM call tools that create, update, delete, send, or publish anything. 1) TOOL DEFINITION (PER TOOL) - Tool name + semantic version (v1, v1.1…) - One-sentence purpose (what it does, what it never does) - Side effects explicitly listed (creates records, sends emails, posts messages, modifies configs) - Owner (team) + on-call/rotation or escalation path 2) SCHEMA & VALIDATION - Inputs defined with strict types and constraints (enums, min/max lengths, required fields) - Outputs defined (including error shapes) - Validator is enforced server-side (never rely on the model to comply) - “Repair loop” defined: what error message the model sees when validation fails 3) PERMISSIONS & SCOPING - Who can call the tool (RBAC/ABAC rules) - Resource scoping (which workspace/project/customer/account) - Credential model: scoped OAuth tokens, short-lived tokens, or service accounts with boundaries - Explicit “deny by default” behavior if policy context is missing 4) APPROVALS (HUMAN-IN-THE-LOOP) - List actions requiring approval (external sends, irreversible deletes, finance/security changes) - Approval UI shows an action diff: tool name, parameters, target resources, recipients - Approval identity captured (who approved, when) - Editable plan: user can modify parameters before execution 5) STATE, RETRIES, AND IDEMPOTENCY - Task state machine stored in a database (planned → approved → executing → done/failed) - Idempotency key strategy to prevent double execution on retries - Safe retry rules per tool (which errors retry, which fail fast) - Partial failure handling (what happens if step 3 fails after steps 1–2 succeeded) 6) OBSERVABILITY & AUDIT - Structured logs for every tool call (tool, params, user, resource IDs, timestamps, result) - Trace IDs link: user request → retrieval → model call → tool execution - Redaction rules for sensitive fields (PII, secrets) - Audit log retention policy and access controls 7) SECURITY AGAINST PROMPT INJECTION - Treat retrieved text as untrusted input; never allow it to directly authorize actions - System enforces policy; model never bypasses permission checks - Content filtering strategy for connector sources (docs, tickets, emails) - Separate “draft” vs “action” modes; draft mode cannot execute tools 8) EVALUATION (REGRESSION TESTS) - Golden set of scenarios for: wrong recipient, wrong resource, missing permission, ambiguous user intent - Adversarial cases: injection attempts inside retrieved docs, tool parameter smuggling - Release gate: tool changes require passing evals just like unit tests SHIP CRITERIA If you can’t answer: “What exactly can this assistant do, on whose behalf, to which resources, with what review and what audit trail?” you’re not ready to ship actions.