Internal Support Runbook
Agent product ticket categorization and handling procedures — "agent output error", "task stuck", "bill spike" and other ticket types SaaS lacks.
Trigger condition
Inbound customer support ticket. All tickets enter this runbook.
Roles and responsibilities
| Role | Primary responsibility |
|---|---|
| Tier-1 support | Ticket categorization, standardized handling, SLA-bound response |
| Tier-2 support | Escalation handling for tier-1; agent-specific troubleshooting |
| CSM | High-severity ticket relay for key accounts; coordinate with tier-1 to avoid multiple touches |
| Engineering | Product bug / system-level issue escalation |
| Product | Ticket pattern analysis; periodic review of top-three issue types to drive product improvements |
Agent product ticket categories
Unlike traditional SaaS, agent products require dedicated ticket categories. Tier-1’s first action upon receiving a ticket is to categorize using this table:
| Category | Typical ticket | Tier-1 / Tier-2 | SLA |
|---|---|---|---|
| Agent output error | ”agent gave the wrong answer”, “wrong label applied” | Tier-2 | 4 business hours |
| Task stuck | ”agent stopped halfway”, “task perpetually in_progress” | Tier-2 | 2 business hours |
| Bill higher than expected | ”why is this month’s bill 3× higher” | Tier-1 → Tier-2 | 24 business hours |
| Cache hit rate drop | ”our costs went up recently” (root may be cache miss) | Tier-2 | 1 business day |
| HITL too frequent | ”why does every task require my confirmation” | Tier-2 | 1 business day |
| Tier cap related | ”why was I suddenly locked out” (hit cap) | Tier-1 | 4 business hours |
| Integration / API issue | ”BUA extension won’t connect”, “API 401” | Tier-1 | 4 business hours |
| Account / permissions | ”new employee can’t access”, “forgot password” | Tier-1 | 4 business hours |
| Feature inquiry | ”can agent do X”, “how do I configure a workflow” | Tier-1 | 4 business hours |
Standard procedures by category
Agent output error
- Collect: ticket requires customer to provide task ID, original prompt, agent output, expected output
- Tier-2 verification: pull task-level log from admin console (token usage, tool call sequence, model version)
- Categorize:
- Reproducible prompt issue — customer prompt unclear → respond with optimization suggestions + recommended prompt templates
- Non-reproducible / probabilistic issue — same prompt yields divergent results across runs → escalate to engineering for prompt tuning or model switch evaluation
- Clear product bug — agent completely misinterprets task or picks wrong tool → engineering immediate engagement
- Respond: within 72 hours give customer specific categorization + follow-up actions
Task stuck
- Immediately kill the task in admin console (prevent continued token drain)
- Tier-2 verifies task-level log, identifies stuck location (which step, which tool call, what error)
- Categorize:
- Timeout / upstream model unavailable — standard retry, notify customer it’s recovered
- HITL waiting — user hasn’t confirmed → educate customer on HITL workflow
- Agent logic bug / runaway loop — engineering escalation (matches the “hard termination” trigger in incident-response)
- Refund the corresponding token quota in the ticket
Bill higher than expected
The most sensitive ticket type — involves money, customer emotion is typically elevated.
- Tier-1 commits to investigation first — do not promise refund or explain cause before pulling logs
- Escalate to tier-2 for verification:
- Pull 30-day usage distribution (by tier / task type / user)
- Compare against prior month, flag abnormal increases
- Categorize:
- Real customer usage increase — explain growth source in detail (which user, which workflow); if customer accepts as reasonable, close ticket; if not, guide toward tier upgrade or usage caps
- Product-side cache hit rate decline causing same usage but higher cost — CSM takes over, proactively informs customer, explains product-side fix path, considers bill adjustment
- Clear billing bug — immediate refund + engineering fix + platform-wide scan for other affected customers
Cache hit rate drop
- Tier-2 pulls the customer’s cache hit history curve from admin console
- Categorize:
- Customer-side prompt became dynamic — a customer integration update introduced variable prompts → feedback to customer to adjust
- Product-side prompt structure change — a release modified the system prompt → engineering evaluates whether to roll back
- Frequently coupled with “bill higher than expected”; follow that flow
Escalation criteria
Tier-1 must escalate to engineering when:
- Same ticket type appears in ≥ 3 distinct customers within 24 hours → possible platform-level issue, trigger incident-response
- Customer loss explicitly attributed to agent behavior → legal liability implications, escalate to engineering + legal (see compliance/model-output-liability)
- Data leak / compliance event → immediate escalation to engineering + legal + IC (bypass support channel, go to incident-response)
Critical things not to do
- Do not let tier-1 explain agent internals — tier-1 lacks knowledge of prompt design / tool chain architecture; wrong explanations seed the next wave of customer dissatisfaction. Technical explanations come from tier-2 / engineering
- Do not commit to fix timelines in tickets — unless engineering has confirmed; missing committed dates damages trust
- Do not categorize “agent output error” as “feature inquiry” — the former is a quality issue requiring traceability, the latter is education; misclassification means product misses quality signals
- Do not respond to a customer’s repeated same-category ticket with templated answers — customers notice “this is the same response as last time”, which signals “you’re not actually solving it”
Measurement metrics
Weekly review:
- SLA attainment rate by category — identify which categories regularly miss deadlines
- First-contact resolution rate — was the ticket closed at first touch
- Escalation rate — tier-1 → tier-2 / engineering ratio (healthy range 15-25%; too low suggests tier-1 is struggling beyond capability)
- Repeat ticket rate — same customer / same problem reopened within 30 days
- Ticket → product issue conversion rate — how many tickets eventually trigger product improvement tickets (healthy baseline 5-10%)
Cross-section connections
- Platform-level incident external communication: incident-response
- Agent output liability allocation: compliance/model-output-liability
- Cost model basis for bill anomalies: economics/cost-model
- Cache hit rate and pricing mechanism: pricing/tier-design
Was this page helpful? Thanks for the feedback.