Agent Reliability Readiness Checklist (Control Plane + Evals) Use this before you ship any agent that can take actions (create/edit/delete/send/run). 1) Define the action surface - List every tool/action the agent can call (API endpoints, internal functions, database writes). - Mark which actions are irreversible (payments, deletions) vs reversible (drafts, PRs). - For each action, define the expected “undo” or rollback path (link, API, or workflow). 2) Identity + permissions - Decide: does the agent act as the user (delegated authority) or as a service identity (bot account)? - Enforce least privilege with explicit scopes per tool (avoid “admin” tokens). - Use time-limited credentials and revocation mechanisms. - Confirm tenant isolation: no shared caches or retrieval paths that can mix tenants. 3) Tool contracts (treat the model as a hostile client) - Define strict schemas for tool inputs; reject unknown fields. - Ensure tools return explicit error codes/messages; don’t rely on free-form text. - Add idempotency keys for actions that can be retried. - Add rate limits per user and per agent. 4) Human gates - Decide which categories require approval: sending messages, merging code, config changes, financial actions. - Implement approvals as first-class artifacts (draft, diff, approver, timestamp). - Provide a “preview” UI for high-impact actions (who will be emailed, what will change). 5) Evaluation coverage - Build test cases for: correct tool selection, correct arguments, and safe refusal. - Include adversarial prompts (prompt injection attempts, role escalation requests). - Test partial failures: timeouts, 429 rate limits, malformed tool responses. - Run evals on every model change, prompt change, tool/schema change. 6) Observability + audit - Log: user request ID, model response IDs, tool call payloads (redacted), tool results, side effects. - Use trace IDs across the whole flow. - Make logs searchable by customer, agent, tool, and time window. - Ensure you can investigate incidents without reading sensitive user content whenever possible. 7) Memory governance - Separate “preferences” from “facts observed”; keep memory typed and scoped. - Prevent untrusted text from writing directly into durable memory without review. - Provide user controls to view/edit/delete memory entries. - Set retention and deletion policies that match your product promises. 8) Kill switches + incident response - Implement an immediate kill switch: disable agent actions, disable specific tools, revoke tokens. - Document who is on-call/owner for each agent and where to page them. - Prepare runbooks: rollback steps, customer communication templates, and postmortem checklist. If you can’t complete this checklist for a proposed agent feature, don’t ship actions yet. Ship read-only assistance, add observability, then expand the action surface with gates.