Internal Support Runbook

Agent product ticket categorization and handling procedures — "agent output error", "task stuck", "bill spike" and other ticket types SaaS lacks.

Trigger condition

Inbound customer support ticket. All tickets enter this runbook.

Roles and responsibilities

RolePrimary responsibility
Tier-1 supportTicket categorization, standardized handling, SLA-bound response
Tier-2 supportEscalation handling for tier-1; agent-specific troubleshooting
CSMHigh-severity ticket relay for key accounts; coordinate with tier-1 to avoid multiple touches
EngineeringProduct bug / system-level issue escalation
ProductTicket pattern analysis; periodic review of top-three issue types to drive product improvements

Agent product ticket categories

Unlike traditional SaaS, agent products require dedicated ticket categories. Tier-1’s first action upon receiving a ticket is to categorize using this table:

CategoryTypical ticketTier-1 / Tier-2SLA
Agent output error”agent gave the wrong answer”, “wrong label applied”Tier-24 business hours
Task stuck”agent stopped halfway”, “task perpetually in_progress”Tier-22 business hours
Bill higher than expected”why is this month’s bill 3× higher”Tier-1 → Tier-224 business hours
Cache hit rate drop”our costs went up recently” (root may be cache miss)Tier-21 business day
HITL too frequent”why does every task require my confirmation”Tier-21 business day
Tier cap related”why was I suddenly locked out” (hit cap)Tier-14 business hours
Integration / API issue”BUA extension won’t connect”, “API 401”Tier-14 business hours
Account / permissions”new employee can’t access”, “forgot password”Tier-14 business hours
Feature inquiry”can agent do X”, “how do I configure a workflow”Tier-14 business hours

Standard procedures by category

Agent output error

  1. Collect: ticket requires customer to provide task ID, original prompt, agent output, expected output
  2. Tier-2 verification: pull task-level log from admin console (token usage, tool call sequence, model version)
  3. Categorize:
    • Reproducible prompt issue — customer prompt unclear → respond with optimization suggestions + recommended prompt templates
    • Non-reproducible / probabilistic issue — same prompt yields divergent results across runs → escalate to engineering for prompt tuning or model switch evaluation
    • Clear product bug — agent completely misinterprets task or picks wrong tool → engineering immediate engagement
  4. Respond: within 72 hours give customer specific categorization + follow-up actions

Task stuck

  1. Immediately kill the task in admin console (prevent continued token drain)
  2. Tier-2 verifies task-level log, identifies stuck location (which step, which tool call, what error)
  3. Categorize:
    • Timeout / upstream model unavailable — standard retry, notify customer it’s recovered
    • HITL waiting — user hasn’t confirmed → educate customer on HITL workflow
    • Agent logic bug / runaway loop — engineering escalation (matches the “hard termination” trigger in incident-response)
  4. Refund the corresponding token quota in the ticket

Bill higher than expected

The most sensitive ticket type — involves money, customer emotion is typically elevated.

  1. Tier-1 commits to investigation first — do not promise refund or explain cause before pulling logs
  2. Escalate to tier-2 for verification:
    • Pull 30-day usage distribution (by tier / task type / user)
    • Compare against prior month, flag abnormal increases
  3. Categorize:
    • Real customer usage increase — explain growth source in detail (which user, which workflow); if customer accepts as reasonable, close ticket; if not, guide toward tier upgrade or usage caps
    • Product-side cache hit rate decline causing same usage but higher cost — CSM takes over, proactively informs customer, explains product-side fix path, considers bill adjustment
    • Clear billing bug — immediate refund + engineering fix + platform-wide scan for other affected customers

Cache hit rate drop

  1. Tier-2 pulls the customer’s cache hit history curve from admin console
  2. Categorize:
    • Customer-side prompt became dynamic — a customer integration update introduced variable prompts → feedback to customer to adjust
    • Product-side prompt structure change — a release modified the system prompt → engineering evaluates whether to roll back
  3. Frequently coupled with “bill higher than expected”; follow that flow

Escalation criteria

Tier-1 must escalate to engineering when:

  • Same ticket type appears in ≥ 3 distinct customers within 24 hours → possible platform-level issue, trigger incident-response
  • Customer loss explicitly attributed to agent behavior → legal liability implications, escalate to engineering + legal (see compliance/model-output-liability)
  • Data leak / compliance event → immediate escalation to engineering + legal + IC (bypass support channel, go to incident-response)

Critical things not to do

  • Do not let tier-1 explain agent internals — tier-1 lacks knowledge of prompt design / tool chain architecture; wrong explanations seed the next wave of customer dissatisfaction. Technical explanations come from tier-2 / engineering
  • Do not commit to fix timelines in tickets — unless engineering has confirmed; missing committed dates damages trust
  • Do not categorize “agent output error” as “feature inquiry” — the former is a quality issue requiring traceability, the latter is education; misclassification means product misses quality signals
  • Do not respond to a customer’s repeated same-category ticket with templated answers — customers notice “this is the same response as last time”, which signals “you’re not actually solving it”

Measurement metrics

Weekly review:

  • SLA attainment rate by category — identify which categories regularly miss deadlines
  • First-contact resolution rate — was the ticket closed at first touch
  • Escalation rate — tier-1 → tier-2 / engineering ratio (healthy range 15-25%; too low suggests tier-1 is struggling beyond capability)
  • Repeat ticket rate — same customer / same problem reopened within 30 days
  • Ticket → product issue conversion rate — how many tickets eventually trigger product improvement tickets (healthy baseline 5-10%)

Cross-section connections

Was this page helpful?