BOUNDED AGENT PRODUCTION READINESS CHECKLIST (v1) Use this as a gate before an “agent” is allowed to touch real systems (tickets, emails, refunds, deploys, CRM updates). If you can’t fill a line item, the agent is not ready. 1) SCOPE & AUTHORITY - Single-sentence job: what outcome does the agent own? - Explicit “won’t do” list (systems, actions, user segments). - Definition of DONE (machine-checkable if possible). 2) STATE MODEL (WRITE IT DOWN) - Persisted run state exists outside the LLM (DB/workflow engine). - State schema includes: current step, inputs, tool results, errors, and a terminal status. - Deterministic transition rules: what events move the run forward vs stop it. 3) TOOL CONTRACTS - Every tool has a versioned schema (args + return shape). - Tool calls are validated before execution (type checks, required fields). - Timeouts and retries are specified per tool. 4) SIDE EFFECT SAFETY - All write operations are idempotent (idempotency key strategy documented). - Audit log captures: who initiated, what changed, when, and which tool executed. - Dry-run mode exists for high-risk actions. 5) PERMISSIONS & SECRETS - Least privilege: no shared “agent-admin” credential. - Clear impersonation model (which user is the action taken as?). - Secrets rotation path documented; revoked credentials fail safely. 6) EVALUATION & REGRESSION TESTS - You maintain a small, real test set of representative tasks. - Acceptance criteria are explicit (format, citations, required fields, constraints). - Any prompt/tool change triggers regression runs (CI job or release checklist). 7) HUMAN CHECKPOINTS - Actions are categorized by risk (low/medium/high). - High-risk actions require pre-flight approval OR mid-flight checkpoint. - Review UI shows plan, diffs/side effects, and supporting evidence. 8) OBSERVABILITY - Trace ID links model calls, tool calls, and workflow steps. - Structured logs exist (not just chat transcripts). - Error taxonomy is defined (policy block, tool timeout, schema mismatch, eval fail). 9) FAILURE MODES - The agent can stop safely with a clear reason (no silent success). - Partial completion behavior is defined (compensations, rollbacks, or “needs human”). - Rate limit and upstream outage behavior is defined. 10) DEPLOYMENT HYGIENE - Environment separation (sandbox vs production tools). - Feature flags or kill switch exists. - Rollout plan includes monitoring and a revert path. If your agent passes everything above, it’s not “smart.” It’s controllable. That’s the point.