AGENT RUNBOOK TEMPLATE (copy/paste) 1) Purpose and scope - Agent name: - Owner (team + on-call rotation): - What business outcome it drives: - What it is explicitly NOT allowed to do: 2) Systems touched (inventory) For each system (e.g., Salesforce, Stripe, GitHub, AWS): - Read operations: - Write operations: - Data classes involved (PII, financial, secrets, IP): - Rate limits / quotas to respect: 3) Identity and access model - Auth mechanism per system (OAuth, service account, API key): - Credential lifetime (short-lived vs long-lived): - Tenant isolation approach (per-tenant creds vs shared): - Permission allow-list (actions + resource scopes): - Break-glass procedure (who can elevate, how logged, expiry): 4) Tooling contract (capabilities) Define tools as capabilities, not raw endpoints. For each tool: - Tool name: - Allowed parameters + validation rules: - Hard limits (amount caps, max recipients, project scope): - Idempotency key strategy: - Expected side effects (what changes, where): 5) Approval policy - Actions that require human approval: - Approval UX requirements (must show diff/impact, not reasoning text): - Two-person rule triggers (money, access, bulk actions): - Timeout behavior (auto-cancel vs auto-approve is usually wrong): 6) Observability and audit - Required log fields: timestamp, tenant_id, user_id, agent_version, tool, params, result, correlation_id - Where logs live and retention: - How to replay a request safely (dry-run mode): - Alerts: abnormal volume, repeated failures, unusual targets 7) Failure modes and circuit breakers List top failure modes and the stop conditions. - External API outage: fallback behavior: - Partial completion: how to detect and reconcile: - Tool call retries: max attempts, backoff, and when to stop: - Kill switch: where it is, who can trigger, what it disables: 8) Rollback and reversibility For each write action: - Is rollback possible? (Y/N) - Rollback mechanism (API undo, compensating transaction, manual steps): - Max time window where rollback is valid: - Customer comms plan if rollback is not possible: 9) Security review checklist (pre-ship) - Least-privilege permissions verified - Parameter validation for every capability - No secrets in prompts/logs - Tenant boundary tests (cannot access cross-tenant resources) - Prompt injection considered at tool boundary (untrusted text cannot directly become an action) 10) Incident response (operational) - Severity definitions for agent incidents - On-call steps: disable agent, snapshot logs, identify affected resources - Customer notification criteria - Post-incident actions: permission reductions, new approval points, new tests Ship criteria (simple rule): if you can’t explain how to stop it, trace it, and undo it, it’s not ready for production.