Incident Response Playbook

Trigger conditions

Any of:

Agent completion rate drops > 20% platform-wide (sustained > 15 minutes)
Upstream model API (Anthropic / OpenAI / self-hosted) unavailable
5 consecutive customer complaints about the same erroneous output on the same task class
Systematic token billing anomaly (users over- or under-charged by > 5%)
Customer’s important data accidentally sent or leaked through agent

Roles and responsibilities

Role	Primary responsibility
On-call engineer	First responder; technical diagnosis; service restoration
Incident Commander (IC)	Coordination; escalation decisions; approves external comms
Customer Success (CSM)	Proactive outreach to key accounts; collect impact scope
Support	Handle ticket surge; execute customer communication templates
Product owner	Decide external announcement wording and distribution scope
Legal	Engaged for data leaks, compliance events, model output liability

Differences from SaaS incidents

Agent products have several incident forms SaaS lacks, requiring different communication strategies:

1. “Agent behavior anomaly” vs “system outage”

SaaS outage: clear “works / doesn’t work”. Communication templates are mature.

Agent behavior anomaly: users can connect, but the agent makes wrong decisions or quality drops. Example: prompt regression causing platform-wide task completion rate to fall from 95% to 75% — technically the service is “normal”, but customer experience is at incident level.

Communication point: do not paper over with “system status: normal”. Publish “we are observing task quality decline on category X, investigating”. This transparency builds more trust than pretending nothing is wrong.

2. Token billing anomaly

Involves money; more sensitive than technical failure.

Over-charge: must proactively refund without waiting for customer to discover; announcement specifies refund timeline
Under-charge: in principle do not back-bill customers (reputation cost exceeds the loss); announcement states issue is fixed

3. Model output errors causing customer loss

The most severe form: agent made a wrong decision for the customer (sent wrong email, clicked wrong link, wrote wrong contract clause).

Legal liability allocation depends on contract terms (see compliance), but communication must immediately accept ownership — agent is part of the product, the vendor bears primary responsibility, root cause analysis follows later.

Response timeline

T+0 to T+15 min

On-call engineer:

Confirm incident nature (category, severity, scope)
Open incident channel (Slack / war room)
Assess if IC engagement required (> 5% users or > 1 enterprise customer affected: must escalate)

IC (if engaged):

Decide on external announcement (default: yes — delay cost exceeds internal-acknowledgment cost)
Decide communication scope (status page / in-app / proactive email)
Assign customer success to reach out to key accounts proactively

T+15 to T+60 min

On-call engineer:

Continue technical investigation and restoration
Sync progress to incident channel every 15 minutes (even “no new progress” should be said)

Customer Success:

Contact top 10 customer primary contacts (phone / Slack, not email)
Proactively notify of incident nature and impact (don’t wait for customer inquiry)
Collect customer-observed phenomena

Support:

Monitor ticket surge
Execute pre-approved communication templates to avoid conflicting messages

T+1 hour

IC:

Decide on escalation (unresolved > 1 hour: escalate to VP Engineering + CEO)
Publish first external announcement (status page + email)

Product owner:

Draft announcement including: what happened, scope, current progress, expected restoration time, compensation policy (if applicable)
Legal review (if compliance or customer loss involved)

T+restored

Incident Commander:

Publish “resolved” announcement
Proactively email all affected customers (including those not visible on status page, e.g., key accounts)

Within 72 hours post-incident:

Publish full external postmortem — agent product customer expectation for transparency exceeds SaaS
For data-related incidents, within 72 hours confirm with legal whether regulatory notification is required (GDPR 72-hour window, etc.)

Communication template (medium-severity incident)

Subject: [Incident Notice] Agent completion rate degradation event

Customers:

At UTC HH:MM we detected agent completion rate on [task type] dropping from
~95% to ~75%. Scope: approximately [%] of users running [task type] tasks,
starting at HH:MM.

Initial root cause: [brief description, e.g., "yesterday's prompt update
introduced a regression"].
Current status: [rolled back / rolling back / investigating].
Expected restoration: [time, or "continuous updates"].

Compensation: failed tasks during the affected period will not count against
your monthly quota; will be reflected in next month's invoice.

We will update progress every 30 minutes. Latest status: [status page link].
A full postmortem will follow.

Apologies for the disruption.

[Team signature]

Critical things not to do

Do not defend decisions during the incident — customer loss is in progress; defending “why this was reasonable” worsens the situation. Acknowledge, remediate, analyze in postmortem
Do not publicly promise “this will never happen again” — agent product systems are complex; absolute promises tend to break. Use “specific improvements we are making” instead
Do not let support explain technical issues to customers in lieu of engineering — frontline support lacks full information; wrong explanations seed the next wave of customer dissatisfaction
Do not externalize blame entirely to upstream model providers in the postmortem — even when root cause is upstream, the customer bought the vendor’s service; upstream can be mentioned, but primary responsibility cannot be deflected

Cross-section connections

Monitoring metrics and alerting mechanism: operations/overview
Model output liability compliance boundary: compliance
CSM proactive outreach process: onboarding