Agentic AI Production Readiness Template (2026) Use this template to move from prototype to production without reliability, security, or cost blowups. 1) Scope & Success Definition - Pick ONE workflow (e.g., “triage IT tickets” or “draft support replies with citations”). - Define success metrics with thresholds: - Task success rate (offline eval): ____% (target: 85–95% depending on risk) - Policy violations per tool call: ____% (target: <0.1%) - P95 time-to-resolution: ____ seconds (target: <120s for interactive) - P95 cost per task: $____ (target: set a hard cap) - Define “stop conditions” (when the agent must ask or escalate): missing required fields, contradictory user input, tool errors >N retries. 2) Tooling Design (Least Privilege) - Expose narrow tools, not broad endpoints. - Add server-side validation (amount limits, schema checks, row-level security). - Prefer idempotent actions; include a “dry_run=true” mode. - Map agent identity to real RBAC/IAM roles; rotate secrets and avoid long-lived tokens. 3) Policy & Approvals - Categorize actions into tiers: - Tier 0: Read-only (safe) - Tier 1: Low-risk writes (e.g., create a ticket) - Tier 2: Financial/legal/security impact (refunds, account changes, firewall rules) - For Tier 2, require approvals above threshold: $____ or risk score >____. - Implement hard session budgets: - Max tool calls: ____ - Max wall-clock seconds: ____ - Max model spend: $____ 4) Evaluation & Testing - Build a golden eval set (100–1,000 tasks) with edge cases and past incidents. - Run nightly regression tests; pin model versions for critical flows. - Add adversarial tests: prompt injection in retrieved docs, malicious URLs, conflicting instructions. - Define acceptance gates for release: no regression >____% on key metrics. 5) Observability & Audit - Trace every task with a correlation ID. - Log: tool name, parameters, result status, retrieved sources/citations, policy decision, model+prompt versions. - Produce an “action ledger” for write operations (who/what/when/why). - Create dashboards: success rate, escalation rate, tool error rate, cost/task, unsafe action attempts. 6) Rollout Plan - Phase 1: Internal users only + read-only. - Phase 2: Low-risk writes with approvals and tight caps. - Phase 3: Expand traffic via canary (1–5%), auto-rollback on error spikes. - Establish incident response: on-call owner, runbooks, and postmortems. 7) Operations (Ongoing) - Prompt/policy changes require PR review and changelog. - Quarterly access review and security assessment. - Monthly cost review: top workflows by spend, top tools by retries, escalation root causes. If you can’t measure it (success, cost, and policy compliance), you can’t ship it safely.