AGENT RUNBOOK TEMPLATE (copy/paste)

1) Purpose and scope
- Agent name:
- Owner (team + on-call rotation):
- What business outcome it drives:
- What it is explicitly NOT allowed to do:

2) Systems touched (inventory)
For each system (e.g., Salesforce, Stripe, GitHub, AWS):
- Read operations:
- Write operations:
- Data classes involved (PII, financial, secrets, IP):
- Rate limits / quotas to respect:

3) Identity and access model
- Auth mechanism per system (OAuth, service account, API key):
- Credential lifetime (short-lived vs long-lived):
- Tenant isolation approach (per-tenant creds vs shared):
- Permission allow-list (actions + resource scopes):
- Break-glass procedure (who can elevate, how logged, expiry):

4) Tooling contract (capabilities)
Define tools as capabilities, not raw endpoints.
For each tool:
- Tool name:
- Allowed parameters + validation rules:
- Hard limits (amount caps, max recipients, project scope):
- Idempotency key strategy:
- Expected side effects (what changes, where):

5) Approval policy
- Actions that require human approval:
- Approval UX requirements (must show diff/impact, not reasoning text):
- Two-person rule triggers (money, access, bulk actions):
- Timeout behavior (auto-cancel vs auto-approve is usually wrong):

6) Observability and audit
- Required log fields: timestamp, tenant_id, user_id, agent_version, tool, params, result, correlation_id
- Where logs live and retention:
- How to replay a request safely (dry-run mode):
- Alerts: abnormal volume, repeated failures, unusual targets

7) Failure modes and circuit breakers
List top failure modes and the stop conditions.
- External API outage: fallback behavior:
- Partial completion: how to detect and reconcile:
- Tool call retries: max attempts, backoff, and when to stop:
- Kill switch: where it is, who can trigger, what it disables:

8) Rollback and reversibility
For each write action:
- Is rollback possible? (Y/N)
- Rollback mechanism (API undo, compensating transaction, manual steps):
- Max time window where rollback is valid:
- Customer comms plan if rollback is not possible:

9) Security review checklist (pre-ship)
- Least-privilege permissions verified
- Parameter validation for every capability
- No secrets in prompts/logs
- Tenant boundary tests (cannot access cross-tenant resources)
- Prompt injection considered at tool boundary (untrusted text cannot directly become an action)

10) Incident response (operational)
- Severity definitions for agent incidents
- On-call steps: disable agent, snapshot logs, identify affected resources
- Customer notification criteria
- Post-incident actions: permission reductions, new approval points, new tests

Ship criteria (simple rule): if you can’t explain how to stop it, trace it, and undo it, it’s not ready for production.