Enterprise Agent Launch Checklist (2026)

Use this checklist to move from a working demo to a durable enterprise deployment.

1) Scope and success definition
- Define the job-to-be-done in one sentence (e.g., “reconcile invoices to POs and route exceptions”).
- Define “success” in measurable terms: completion criteria, acceptable error rate, and who signs off.
- Tier actions (read-only, low-risk write, high-risk write). Decide which tiers require approvals.

2) Tooling and contracts
- Make every tool call schema-driven (JSON schema/function calling). No free-form parsing.
- Enforce idempotency for write actions (idempotency keys for tickets, refunds, provisioning).
- Add timeouts, retries, and backoff policies per tool. Document expected failure modes.

3) Security and governance
- Apply least privilege: separate read vs write credentials; narrow scopes for mutation tools.
- Add a policy layer that checks every tool call (allowlist tools + parameter validation).
- Build prompt-injection defenses for untrusted inputs (emails, PDFs, web pages): isolate content, treat retrieved text as untrusted, and block instruction-like patterns from influencing tool selection.
- Ensure tenant isolation: no cross-tenant memory, logs, or caches.

4) Evaluation and reliability
- Create an evaluation suite of 100–1,000 representative tasks (include messy inputs and edge cases).
- Add a simulator for tool failures (timeouts/500s/permission denied) and verify recovery paths.
- Track SLOs: task success rate, p95 latency, cost per successful task, escalation rate, and tool-call failure rate.
- Make evals part of CI: every prompt/model/tool change triggers regression tests.

5) Cost controls
- Implement routing: send extraction/formatting to smaller models; reserve frontier models for hard reasoning.
- Set per-task budgets (soft/hard). Define behavior on budget exceed (fallback model or human escalation).
- Reduce token bloat: compress prompts, remove redundant context, cap max tokens per step.
- Add caching where safe: embeddings, retrieval results for common queries, deterministic tool outputs.

6) Human-in-the-loop UX
- Provide an approval UI for high-risk actions with evidence: retrieved sources, proposed action, exact parameters.
- Make runs replayable (run IDs, trace view). Allow operators to annotate failure reasons.
- Build clear escalation paths: who is paged, what context is included, and how to resume safely.

7) Observability and audit
- Capture end-to-end traces: user request, model/version, retrieved evidence, tool calls, outputs.
- Store immutable audit logs for high-impact actions (and define retention policies).
- Set alerts: anomaly detection on action volume, cost spikes, unusual tools, or repeated failures.

Exit criteria for production
- Meets success-rate target on production-like evals (typically ≥98–99%).
- High-risk writes are approval-gated and auditable.
- Cost per successful task is within budget at expected scale.
- On-call playbook exists and has been tested via at least one game day.