ICMD Agent Operating Model (AOM) — Runbook Pack

Use this pack to operationalize AI agents with human accountability. Copy into your docs repo and require it for any agent that touches customers, money, or production.

1) One-Page Agent Runbook Template
- Agent name:
- Business purpose (one sentence):
- Owner / DRI (name + team + on-call alias):
- Users impacted (internal teams, customer segments):
- Systems touched (Jira, GitHub, Zendesk, Salesforce, prod):
- Allowed tools/actions (explicit list):
- Forbidden actions (explicit list):
- Data accessed (PII/PHI/PCI? Y/N; fields):
- Permission model (service account, OAuth scopes, token TTL):
- Rollout mode: Shadow / Draft-only / Execute
- Kill switch (flag name + who can toggle):
- Known failure modes (top 3):
- Escalation path (first 15 minutes):

2) SLO + Audit Template (pick 2–3 per agent)
- Outcome metrics:
  a) Accuracy (min): ____ (e.g., 0.98 factual accuracy in audits)
  b) Quality (max): ____ (e.g., <=2 tone complaints/week)
  c) Speed (target): ____ (e.g., first-response time -30%)
- Audit plan:
  - Sampling rate: ____ items/week
  - Reviewer(s): ____
  - Pass/fail rubric link:
  - Where results are stored (sheet/dashboard):
- Error budget rule:
  - If SLO breached for ____ days OR ____ Sev2 incidents, agent moves to Draft-only.

3) Change Management Checklist (required before any prompt/model/tool change)
- Prompt/policy version updated in git
- Peer review completed (name/date)
- Security review needed? (if permissions/data change)
- Canary plan: ____% for ____ days
- Shadow-mode evaluation run? Y/N; link to results
- Rollback plan tested (kill switch + comms)

4) Incident Playbook (Agent-caused customer or prod impact)
- Trigger criteria (examples): wrong email sent, unauthorized data access, bad deploy suggestion merged, refund error
- First 5 minutes:
  1) Disable agent (kill switch)
  2) Preserve logs (prompts, tool calls, outputs)
  3) Assign incident commander + scribe
- First 30 minutes:
  - Identify blast radius (customers, endpoints, dollars)
  - Customer comms decision owner
  - Containment steps (rate limit, revoke tokens, block tool calls)
- Post-incident (within 5 business days):
  - Postmortem with “why allowed?” and “what control failed?” sections
  - Update runbook, SLOs, guardrails; add a regression test

5) Governance Bar (minimum viable controls)
- Dedicated agent identity (no shared human tokens)
- Least privilege scopes + quarterly access review
- Log retention 30–90 days with correlation IDs
- PII redaction for stored logs
- Rollout modes + canary support
- Kill switch owned by on-call

30-Day Deployment Plan (summary)
Week 1: Inventory all agents/workflows; classify blast radius.
Week 2: Assign DRIs; write runbooks; add permissions discipline.
Week 3: Define SLOs; start audits; create dashboards.
Week 4: Implement kill switches, canaries, and incident simulation.

If you can’t write the one-page runbook, don’t ship the agent.