AX LAUNCH KIT (2026) Purpose Use this kit to plan, ship, and scale an action-capable AI agent (not just chat). It focuses on reliability, safety, observability, and unit economics. 1) Define the Job (one workflow) - Name the job in verb-object form (e.g., “triage inbound support tickets”). - Write “done criteria” with measurable outputs (e.g., tags applied + draft reply + escalation reason). - List allowed tools/APIs and explicitly disallow everything else. - Set a scope boundary for v1 (one segment, one language, one product area). 2) Autonomy Ladder (permissions plan) - Read-only: retrieval + summarization only. - Draft mode: agent proposes actions; human approves. - Supervised action: agent executes low-risk actions with confirmations. - Limited autonomy: agent executes within thresholds (e.g., refunds < $200). - Full autonomy (rare): requires audits, rollbacks, and contractual controls. 3) Safety & Access Controls - Role-based tool access (scopes per user/tenant). - PII handling: redaction rules + retention policy for prompts/logs. - Action audit log: who requested, what agent did, when, in which system. - Rollback strategy per tool (undo, compensating transaction, or human escalation). 4) Verification (trust layer) - Require citations for any factual claim from internal docs. - Validate tool outputs with schemas; reject on mismatch. - Add business-rule checks (bounds, thresholds, required fields). - High-risk actions must show a diff/preview before execution. 5) Observability (agent cockpit) Instrument from day one: - Session trace ID linking model calls + tool calls + retries. - Metrics: task success rate, human intervention rate, p50/p95 latency, tool error rate. - Cost: tokens, tool-call counts, and cost per SUCCESSFUL task. - Feedback capture: thumbs, correction reasons, and “why escalated” categories. 6) Unit Economics (budgets) - Set per-task budgets (USD), max tokens, max tool calls, max runtime. - Define fallback behavior when budgets are exceeded (ask user, cheaper model, escalate). - Target: AI COGS under 15–25% of AI revenue for sustainable margins. 7) Launch Gates (go/no-go) - Offline eval coverage: at least 100–500 realistic cases for v1. - Shadow mode: run new logic without user impact for 1–2 weeks. - Canary cohort: 1–5% of users with higher logging and clear rollback. - Incident plan: owner on-call, escalation path, and disable switch. Metric Definitions (paste into your doc) - TTFCA (Time to First Correct Action): time from user request to first verified correct step. - Human intervention rate: % tasks requiring a human approval/edit/escalation. - Cost per successful task: total AI + tool costs divided by completed tasks (not runs). - Tool-call error rate: errors per 100 tool calls (auth, rate limit, schema mismatch). If you can’t answer “what happened, why, and how much it cost” for any agent session, you’re not ready for autonomy.