Internal Support Runbook

Trigger condition

Inbound customer support ticket. All tickets enter this runbook.

Roles and responsibilities

Role	Primary responsibility
Tier-1 support	Ticket categorization, standardized handling, SLA-bound response
Tier-2 support	Escalation handling for tier-1; agent-specific troubleshooting
CSM	High-severity ticket relay for key accounts; coordinate with tier-1 to avoid multiple touches
Engineering	Product bug / system-level issue escalation
Product	Ticket pattern analysis; periodic review of top-three issue types to drive product improvements

Agent product ticket categories

Unlike traditional SaaS, agent products require dedicated ticket categories. Tier-1’s first action upon receiving a ticket is to categorize using this table:

Category	Typical ticket	Tier-1 / Tier-2	SLA
Agent output error	”agent gave the wrong answer”, “wrong label applied”	Tier-2	4 business hours
Task stuck	”agent stopped halfway”, “task perpetually in_progress”	Tier-2	2 business hours
Bill higher than expected	”why is this month’s bill 3× higher”	Tier-1 → Tier-2	24 business hours
Cache hit rate drop	”our costs went up recently” (root may be cache miss)	Tier-2	1 business day
HITL too frequent	”why does every task require my confirmation”	Tier-2	1 business day
Tier cap related	”why was I suddenly locked out” (hit cap)	Tier-1	4 business hours
Integration / API issue	”BUA extension won’t connect”, “API 401”	Tier-1	4 business hours
Account / permissions	”new employee can’t access”, “forgot password”	Tier-1	4 business hours
Feature inquiry	”can agent do X”, “how do I configure a workflow”	Tier-1	4 business hours

Standard procedures by category

Agent output error

Collect: ticket requires customer to provide task ID, original prompt, agent output, expected output
Tier-2 verification: pull task-level log from admin console (token usage, tool call sequence, model version)
Categorize:
- Reproducible prompt issue — customer prompt unclear → respond with optimization suggestions + recommended prompt templates
- Non-reproducible / probabilistic issue — same prompt yields divergent results across runs → escalate to engineering for prompt tuning or model switch evaluation
- Clear product bug — agent completely misinterprets task or picks wrong tool → engineering immediate engagement
Respond: within 72 hours give customer specific categorization + follow-up actions

Task stuck

Immediately kill the task in admin console (prevent continued token drain)
Tier-2 verifies task-level log, identifies stuck location (which step, which tool call, what error)
Categorize:
- Timeout / upstream model unavailable — standard retry, notify customer it’s recovered
- HITL waiting — user hasn’t confirmed → educate customer on HITL workflow
- Agent logic bug / runaway loop — engineering escalation (matches the “hard termination” trigger in incident-response)
Refund the corresponding token quota in the ticket

Bill higher than expected

The most sensitive ticket type — involves money, customer emotion is typically elevated.

Tier-1 commits to investigation first — do not promise refund or explain cause before pulling logs
Escalate to tier-2 for verification:
- Pull 30-day usage distribution (by tier / task type / user)
- Compare against prior month, flag abnormal increases
Categorize:
- Real customer usage increase — explain growth source in detail (which user, which workflow); if customer accepts as reasonable, close ticket; if not, guide toward tier upgrade or usage caps
- Product-side cache hit rate decline causing same usage but higher cost — CSM takes over, proactively informs customer, explains product-side fix path, considers bill adjustment
- Clear billing bug — immediate refund + engineering fix + platform-wide scan for other affected customers

Cache hit rate drop

Tier-2 pulls the customer’s cache hit history curve from admin console
Categorize:
- Customer-side prompt became dynamic — a customer integration update introduced variable prompts → feedback to customer to adjust
- Product-side prompt structure change — a release modified the system prompt → engineering evaluates whether to roll back
Frequently coupled with “bill higher than expected”; follow that flow

Escalation criteria

Tier-1 must escalate to engineering when:

Same ticket type appears in ≥ 3 distinct customers within 24 hours → possible platform-level issue, trigger incident-response
Customer loss explicitly attributed to agent behavior → legal liability implications, escalate to engineering + legal (see compliance/model-output-liability)
Data leak / compliance event → immediate escalation to engineering + legal + IC (bypass support channel, go to incident-response)

Critical things not to do

Do not let tier-1 explain agent internals — tier-1 lacks knowledge of prompt design / tool chain architecture; wrong explanations seed the next wave of customer dissatisfaction. Technical explanations come from tier-2 / engineering
Do not commit to fix timelines in tickets — unless engineering has confirmed; missing committed dates damages trust
Do not categorize “agent output error” as “feature inquiry” — the former is a quality issue requiring traceability, the latter is education; misclassification means product misses quality signals
Do not respond to a customer’s repeated same-category ticket with templated answers — customers notice “this is the same response as last time”, which signals “you’re not actually solving it”

Measurement metrics

Weekly review:

SLA attainment rate by category — identify which categories regularly miss deadlines
First-contact resolution rate — was the ticket closed at first touch
Escalation rate — tier-1 → tier-2 / engineering ratio (healthy range 15-25%; too low suggests tier-1 is struggling beyond capability)
Repeat ticket rate — same customer / same problem reopened within 30 days
Ticket → product issue conversion rate — how many tickets eventually trigger product improvement tickets (healthy baseline 5-10%)

Cross-section connections

Platform-level incident external communication: incident-response
Agent output liability allocation: compliance/model-output-liability
Cost model basis for bill anomalies: economics/cost-model
Cache hit rate and pricing mechanism: pricing/tier-design