Enterprise Agent Launch Checklist (2026) Use this checklist to move from a working demo to a durable enterprise deployment. 1) Scope and success definition - Define the job-to-be-done in one sentence (e.g., “reconcile invoices to POs and route exceptions”). - Define “success” in measurable terms: completion criteria, acceptable error rate, and who signs off. - Tier actions (read-only, low-risk write, high-risk write). Decide which tiers require approvals. 2) Tooling and contracts - Make every tool call schema-driven (JSON schema/function calling). No free-form parsing. - Enforce idempotency for write actions (idempotency keys for tickets, refunds, provisioning). - Add timeouts, retries, and backoff policies per tool. Document expected failure modes. 3) Security and governance - Apply least privilege: separate read vs write credentials; narrow scopes for mutation tools. - Add a policy layer that checks every tool call (allowlist tools + parameter validation). - Build prompt-injection defenses for untrusted inputs (emails, PDFs, web pages): isolate content, treat retrieved text as untrusted, and block instruction-like patterns from influencing tool selection. - Ensure tenant isolation: no cross-tenant memory, logs, or caches. 4) Evaluation and reliability - Create an evaluation suite of 100–1,000 representative tasks (include messy inputs and edge cases). - Add a simulator for tool failures (timeouts/500s/permission denied) and verify recovery paths. - Track SLOs: task success rate, p95 latency, cost per successful task, escalation rate, and tool-call failure rate. - Make evals part of CI: every prompt/model/tool change triggers regression tests. 5) Cost controls - Implement routing: send extraction/formatting to smaller models; reserve frontier models for hard reasoning. - Set per-task budgets (soft/hard). Define behavior on budget exceed (fallback model or human escalation). - Reduce token bloat: compress prompts, remove redundant context, cap max tokens per step. - Add caching where safe: embeddings, retrieval results for common queries, deterministic tool outputs. 6) Human-in-the-loop UX - Provide an approval UI for high-risk actions with evidence: retrieved sources, proposed action, exact parameters. - Make runs replayable (run IDs, trace view). Allow operators to annotate failure reasons. - Build clear escalation paths: who is paged, what context is included, and how to resume safely. 7) Observability and audit - Capture end-to-end traces: user request, model/version, retrieved evidence, tool calls, outputs. - Store immutable audit logs for high-impact actions (and define retention policies). - Set alerts: anomaly detection on action volume, cost spikes, unusual tools, or repeated failures. Exit criteria for production - Meets success-rate target on production-like evals (typically ≥98–99%). - High-risk writes are approval-gated and auditable. - Cost per successful task is within budget at expected scale. - On-call playbook exists and has been tested via at least one game day.