Incident Response Playbook
External communication templates for agent system-level failures — inference errors, token exhaustion, upstream model unavailability, severe output errors.
Trigger conditions
Any of:
- Agent completion rate drops > 20% platform-wide (sustained > 15 minutes)
- Upstream model API (Anthropic / OpenAI / self-hosted) unavailable
-
5 consecutive customer complaints about the same erroneous output on the same task class
- Systematic token billing anomaly (users over- or under-charged by > 5%)
- Customer’s important data accidentally sent or leaked through agent
Roles and responsibilities
| Role | Primary responsibility |
|---|---|
| On-call engineer | First responder; technical diagnosis; service restoration |
| Incident Commander (IC) | Coordination; escalation decisions; approves external comms |
| Customer Success (CSM) | Proactive outreach to key accounts; collect impact scope |
| Support | Handle ticket surge; execute customer communication templates |
| Product owner | Decide external announcement wording and distribution scope |
| Legal | Engaged for data leaks, compliance events, model output liability |
Differences from SaaS incidents
Agent products have several incident forms SaaS lacks, requiring different communication strategies:
1. “Agent behavior anomaly” vs “system outage”
SaaS outage: clear “works / doesn’t work”. Communication templates are mature.
Agent behavior anomaly: users can connect, but the agent makes wrong decisions or quality drops. Example: prompt regression causing platform-wide task completion rate to fall from 95% to 75% — technically the service is “normal”, but customer experience is at incident level.
Communication point: do not paper over with “system status: normal”. Publish “we are observing task quality decline on category X, investigating”. This transparency builds more trust than pretending nothing is wrong.
2. Token billing anomaly
Involves money; more sensitive than technical failure.
- Over-charge: must proactively refund without waiting for customer to discover; announcement specifies refund timeline
- Under-charge: in principle do not back-bill customers (reputation cost exceeds the loss); announcement states issue is fixed
3. Model output errors causing customer loss
The most severe form: agent made a wrong decision for the customer (sent wrong email, clicked wrong link, wrote wrong contract clause).
Legal liability allocation depends on contract terms (see compliance), but communication must immediately accept ownership — agent is part of the product, the vendor bears primary responsibility, root cause analysis follows later.
Response timeline
T+0 to T+15 min
On-call engineer:
- Confirm incident nature (category, severity, scope)
- Open incident channel (Slack / war room)
- Assess if IC engagement required (> 5% users or > 1 enterprise customer affected: must escalate)
IC (if engaged):
- Decide on external announcement (default: yes — delay cost exceeds internal-acknowledgment cost)
- Decide communication scope (status page / in-app / proactive email)
- Assign customer success to reach out to key accounts proactively
T+15 to T+60 min
On-call engineer:
- Continue technical investigation and restoration
- Sync progress to incident channel every 15 minutes (even “no new progress” should be said)
Customer Success:
- Contact top 10 customer primary contacts (phone / Slack, not email)
- Proactively notify of incident nature and impact (don’t wait for customer inquiry)
- Collect customer-observed phenomena
Support:
- Monitor ticket surge
- Execute pre-approved communication templates to avoid conflicting messages
T+1 hour
IC:
- Decide on escalation (unresolved > 1 hour: escalate to VP Engineering + CEO)
- Publish first external announcement (status page + email)
Product owner:
- Draft announcement including: what happened, scope, current progress, expected restoration time, compensation policy (if applicable)
- Legal review (if compliance or customer loss involved)
T+restored
Incident Commander:
- Publish “resolved” announcement
- Proactively email all affected customers (including those not visible on status page, e.g., key accounts)
Within 72 hours post-incident:
- Publish full external postmortem — agent product customer expectation for transparency exceeds SaaS
- For data-related incidents, within 72 hours confirm with legal whether regulatory notification is required (GDPR 72-hour window, etc.)
Communication template (medium-severity incident)
Subject: [Incident Notice] Agent completion rate degradation event
Customers:
At UTC HH:MM we detected agent completion rate on [task type] dropping from
~95% to ~75%. Scope: approximately [%] of users running [task type] tasks,
starting at HH:MM.
Initial root cause: [brief description, e.g., "yesterday's prompt update
introduced a regression"].
Current status: [rolled back / rolling back / investigating].
Expected restoration: [time, or "continuous updates"].
Compensation: failed tasks during the affected period will not count against
your monthly quota; will be reflected in next month's invoice.
We will update progress every 30 minutes. Latest status: [status page link].
A full postmortem will follow.
Apologies for the disruption.
[Team signature]
Critical things not to do
- Do not defend decisions during the incident — customer loss is in progress; defending “why this was reasonable” worsens the situation. Acknowledge, remediate, analyze in postmortem
- Do not publicly promise “this will never happen again” — agent product systems are complex; absolute promises tend to break. Use “specific improvements we are making” instead
- Do not let support explain technical issues to customers in lieu of engineering — frontline support lacks full information; wrong explanations seed the next wave of customer dissatisfaction
- Do not externalize blame entirely to upstream model providers in the postmortem — even when root cause is upstream, the customer bought the vendor’s service; upstream can be mentioned, but primary responsibility cannot be deflected
Cross-section connections
- Monitoring metrics and alerting mechanism: operations/overview
- Model output liability compliance boundary: compliance
- CSM proactive outreach process: onboarding