Agentic AI Production Readiness Template (2026)

Use this template to move from prototype to production without reliability, security, or cost blowups.

1) Scope & Success Definition
- Pick ONE workflow (e.g., “triage IT tickets” or “draft support replies with citations”).
- Define success metrics with thresholds:
  - Task success rate (offline eval): ____% (target: 85–95% depending on risk)
  - Policy violations per tool call: ____% (target: <0.1%)
  - P95 time-to-resolution: ____ seconds (target: <120s for interactive)
  - P95 cost per task: $____ (target: set a hard cap)
- Define “stop conditions” (when the agent must ask or escalate): missing required fields, contradictory user input, tool errors >N retries.

2) Tooling Design (Least Privilege)
- Expose narrow tools, not broad endpoints.
- Add server-side validation (amount limits, schema checks, row-level security).
- Prefer idempotent actions; include a “dry_run=true” mode.
- Map agent identity to real RBAC/IAM roles; rotate secrets and avoid long-lived tokens.

3) Policy & Approvals
- Categorize actions into tiers:
  - Tier 0: Read-only (safe)
  - Tier 1: Low-risk writes (e.g., create a ticket)
  - Tier 2: Financial/legal/security impact (refunds, account changes, firewall rules)
- For Tier 2, require approvals above threshold: $____ or risk score >____.
- Implement hard session budgets:
  - Max tool calls: ____
  - Max wall-clock seconds: ____
  - Max model spend: $____

4) Evaluation & Testing
- Build a golden eval set (100–1,000 tasks) with edge cases and past incidents.
- Run nightly regression tests; pin model versions for critical flows.
- Add adversarial tests: prompt injection in retrieved docs, malicious URLs, conflicting instructions.
- Define acceptance gates for release: no regression >____% on key metrics.

5) Observability & Audit
- Trace every task with a correlation ID.
- Log: tool name, parameters, result status, retrieved sources/citations, policy decision, model+prompt versions.
- Produce an “action ledger” for write operations (who/what/when/why).
- Create dashboards: success rate, escalation rate, tool error rate, cost/task, unsafe action attempts.

6) Rollout Plan
- Phase 1: Internal users only + read-only.
- Phase 2: Low-risk writes with approvals and tight caps.
- Phase 3: Expand traffic via canary (1–5%), auto-rollback on error spikes.
- Establish incident response: on-call owner, runbooks, and postmortems.

7) Operations (Ongoing)
- Prompt/policy changes require PR review and changelog.
- Quarterly access review and security assessment.
- Monthly cost review: top workflows by spend, top tools by retries, escalation root causes.

If you can’t measure it (success, cost, and policy compliance), you can’t ship it safely.