BOUNDED AGENT PRODUCTION READINESS CHECKLIST (v1)

Use this as a gate before an “agent” is allowed to touch real systems (tickets, emails, refunds, deploys, CRM updates). If you can’t fill a line item, the agent is not ready.

1) SCOPE & AUTHORITY
- Single-sentence job: what outcome does the agent own?
- Explicit “won’t do” list (systems, actions, user segments).
- Definition of DONE (machine-checkable if possible).

2) STATE MODEL (WRITE IT DOWN)
- Persisted run state exists outside the LLM (DB/workflow engine).
- State schema includes: current step, inputs, tool results, errors, and a terminal status.
- Deterministic transition rules: what events move the run forward vs stop it.

3) TOOL CONTRACTS
- Every tool has a versioned schema (args + return shape).
- Tool calls are validated before execution (type checks, required fields).
- Timeouts and retries are specified per tool.

4) SIDE EFFECT SAFETY
- All write operations are idempotent (idempotency key strategy documented).
- Audit log captures: who initiated, what changed, when, and which tool executed.
- Dry-run mode exists for high-risk actions.

5) PERMISSIONS & SECRETS
- Least privilege: no shared “agent-admin” credential.
- Clear impersonation model (which user is the action taken as?).
- Secrets rotation path documented; revoked credentials fail safely.

6) EVALUATION & REGRESSION TESTS
- You maintain a small, real test set of representative tasks.
- Acceptance criteria are explicit (format, citations, required fields, constraints).
- Any prompt/tool change triggers regression runs (CI job or release checklist).

7) HUMAN CHECKPOINTS
- Actions are categorized by risk (low/medium/high).
- High-risk actions require pre-flight approval OR mid-flight checkpoint.
- Review UI shows plan, diffs/side effects, and supporting evidence.

8) OBSERVABILITY
- Trace ID links model calls, tool calls, and workflow steps.
- Structured logs exist (not just chat transcripts).
- Error taxonomy is defined (policy block, tool timeout, schema mismatch, eval fail).

9) FAILURE MODES
- The agent can stop safely with a clear reason (no silent success).
- Partial completion behavior is defined (compensations, rollbacks, or “needs human”).
- Rate limit and upstream outage behavior is defined.

10) DEPLOYMENT HYGIENE
- Environment separation (sandbox vs production tools).
- Feature flags or kill switch exists.
- Rollout plan includes monitoring and a revert path.

If your agent passes everything above, it’s not “smart.” It’s controllable. That’s the point.