@musashishao/agent-kit 1.6.1 → 1.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agent/.shared/ui-ux-pro-max/data/charts.csv +26 -0
- package/.agent/.shared/ui-ux-pro-max/data/colors.csv +97 -0
- package/.agent/.shared/ui-ux-pro-max/data/icons.csv +101 -0
- package/.agent/.shared/ui-ux-pro-max/data/landing.csv +31 -0
- package/.agent/.shared/ui-ux-pro-max/data/products.csv +97 -0
- package/.agent/.shared/ui-ux-pro-max/data/prompts.csv +24 -0
- package/.agent/.shared/ui-ux-pro-max/data/react-performance.csv +45 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/flutter.csv +53 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/html-tailwind.csv +56 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/jetpack-compose.csv +53 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/nextjs.csv +53 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/nuxt-ui.csv +51 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/nuxtjs.csv +59 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/react-native.csv +52 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/react.csv +54 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/shadcn.csv +61 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/svelte.csv +54 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/swiftui.csv +51 -0
- package/.agent/.shared/ui-ux-pro-max/data/stacks/vue.csv +50 -0
- package/.agent/.shared/ui-ux-pro-max/data/styles.csv +59 -0
- package/.agent/.shared/ui-ux-pro-max/data/typography.csv +58 -0
- package/.agent/.shared/ui-ux-pro-max/data/ui-reasoning.csv +101 -0
- package/.agent/.shared/ui-ux-pro-max/data/ux-guidelines.csv +100 -0
- package/.agent/.shared/ui-ux-pro-max/data/web-interface.csv +31 -0
- package/.agent/.shared/ui-ux-pro-max/scripts/core.py +258 -0
- package/.agent/.shared/ui-ux-pro-max/scripts/design_system.py +487 -0
- package/.agent/.shared/ui-ux-pro-max/scripts/search.py +76 -0
- package/.agent/adr/ADR-TEMPLATE.md +57 -0
- package/.agent/adr/README.md +30 -0
- package/.agent/agents/backend-specialist.md +1 -1
- package/.agent/agents/devops-engineer.md +1 -1
- package/.agent/agents/performance-optimizer.md +1 -1
- package/.agent/agents/security-auditor.md +1 -1
- package/.agent/dashboard/index.html +169 -0
- package/.agent/rules/REFERENCE.md +14 -0
- package/.agent/skills/ai-incident-management/SKILL.md +517 -0
- package/.agent/skills/ai-security-guardrails/SKILL.md +405 -0
- package/.agent/skills/ai-security-guardrails/owasp-llm-top10.md +160 -0
- package/.agent/skills/ai-security-guardrails/scripts/prompt_injection_scanner.py +230 -0
- package/.agent/skills/compliance-for-ai/SKILL.md +411 -0
- package/.agent/skills/observability-patterns/SKILL.md +484 -0
- package/.agent/skills/observability-patterns/scripts/otel_validator.py +330 -0
- package/.agent/skills/opentelemetry-expert/SKILL.md +738 -0
- package/.agent/skills/opentelemetry-expert/scripts/trace_analyzer.py +351 -0
- package/.agent/skills/privacy-preserving-dev/SKILL.md +442 -0
- package/.agent/skills/privacy-preserving-dev/scripts/pii_scanner.py +285 -0
- package/.agent/workflows/autofix.md +4 -1
- package/.agent/workflows/brainstorm.md +1 -1
- package/.agent/workflows/context.md +3 -1
- package/.agent/workflows/create.md +1 -1
- package/.agent/workflows/dashboard.md +4 -1
- package/.agent/workflows/debug.md +1 -1
- package/.agent/workflows/deploy.md +1 -1
- package/.agent/workflows/enhance.md +1 -1
- package/.agent/workflows/next.md +4 -1
- package/.agent/workflows/orchestrate.md +1 -1
- package/.agent/workflows/plan.md +1 -1
- package/.agent/workflows/preview.md +1 -1
- package/.agent/workflows/quality.md +1 -1
- package/.agent/workflows/spec.md +1 -1
- package/.agent/workflows/status.md +1 -1
- package/.agent/workflows/test.md +1 -1
- package/.agent/workflows/ui-ux-pro-max.md +1 -1
- package/package.json +4 -1
|
@@ -0,0 +1,517 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ai-incident-management
|
|
3
|
+
description: AI-specific incident response playbook, hallucination detection patterns, degradation vs outage classification, rollback strategies, post-mortem templates.
|
|
4
|
+
allowed-tools: Read, Glob, Grep
|
|
5
|
+
skills:
|
|
6
|
+
- systematic-debugging
|
|
7
|
+
- observability-patterns
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# AI Incident Management
|
|
11
|
+
|
|
12
|
+
> When AI breaks, respond fast and learn faster.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 1. AI-Specific Incident Types
|
|
17
|
+
|
|
18
|
+
### Taxonomy
|
|
19
|
+
|
|
20
|
+
| Category | Examples | Severity |
|
|
21
|
+
|----------|----------|----------|
|
|
22
|
+
| **Hallucination** | Factually wrong, made-up info | Medium-Critical |
|
|
23
|
+
| **Toxicity** | Harmful, offensive output | Critical |
|
|
24
|
+
| **Data Leakage** | PII in responses, prompt leak | Critical |
|
|
25
|
+
| **Performance** | High latency, timeouts | Medium |
|
|
26
|
+
| **Availability** | Model API down | High-Critical |
|
|
27
|
+
| **Drift** | Quality degradation over time | Medium |
|
|
28
|
+
| **Prompt Injection** | Security bypass | Critical |
|
|
29
|
+
|
|
30
|
+
### Severity Matrix
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
┌─────────────────┬──────────────────┬─────────────────────┐
|
|
34
|
+
│ │ Impact: Low │ Impact: High │
|
|
35
|
+
├─────────────────┼──────────────────┼─────────────────────┤
|
|
36
|
+
│ Frequency: │ P3 - Monitor │ P2 - Investigate │
|
|
37
|
+
│ Rare │ Next sprint │ Within 24h │
|
|
38
|
+
├─────────────────┼──────────────────┼─────────────────────┤
|
|
39
|
+
│ Frequency: │ P2 - Fix Soon │ P1 - DROP ALL │
|
|
40
|
+
│ Common │ Within 24h │ Immediate │
|
|
41
|
+
└─────────────────┴──────────────────┴─────────────────────┘
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## 2. Incident Response Playbook
|
|
47
|
+
|
|
48
|
+
### Phase 1: Detection (0-5 minutes)
|
|
49
|
+
|
|
50
|
+
```markdown
|
|
51
|
+
## Detection Checklist
|
|
52
|
+
- [ ] Alert received and acknowledged
|
|
53
|
+
- [ ] Initial severity assessed
|
|
54
|
+
- [ ] On-call notified (if P1/P2)
|
|
55
|
+
- [ ] Incident channel created
|
|
56
|
+
- [ ] User impact estimated
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Phase 2: Triage (5-15 minutes)
|
|
60
|
+
|
|
61
|
+
```markdown
|
|
62
|
+
## Triage Questions
|
|
63
|
+
1. What type of AI incident? (hallucination/toxicity/etc.)
|
|
64
|
+
2. How widespread? (single user / all users / specific segment)
|
|
65
|
+
3. Is it ongoing? (active / resolved / intermittent)
|
|
66
|
+
4. What changed recently? (model / prompt / data)
|
|
67
|
+
5. Can we reproduce it?
|
|
68
|
+
|
|
69
|
+
## Quick Actions
|
|
70
|
+
- [ ] Check model health dashboard
|
|
71
|
+
- [ ] Review recent deployments
|
|
72
|
+
- [ ] Check provider status page
|
|
73
|
+
- [ ] Sample recent outputs for patterns
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Phase 3: Mitigation (15-60 minutes)
|
|
77
|
+
|
|
78
|
+
| Strategy | When to Use | Impact |
|
|
79
|
+
|----------|-------------|--------|
|
|
80
|
+
| **Kill Switch** | Toxicity, data leak | Service degraded |
|
|
81
|
+
| **Fallback Model** | Primary unavailable | Quality may differ |
|
|
82
|
+
| **Circuit Breaker** | High error rate | Automatic recovery |
|
|
83
|
+
| **Content Filter** | Specific bad patterns | Targeted fix |
|
|
84
|
+
| **Rollback Prompt** | Prompt regression | Quick if versioned |
|
|
85
|
+
|
|
86
|
+
### Phase 4: Resolution
|
|
87
|
+
|
|
88
|
+
```markdown
|
|
89
|
+
## Resolution Checklist
|
|
90
|
+
- [ ] Root cause identified
|
|
91
|
+
- [ ] Fix implemented and tested
|
|
92
|
+
- [ ] Affected users notified (if needed)
|
|
93
|
+
- [ ] Monitoring confirmed improvement
|
|
94
|
+
- [ ] Incident timeline documented
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## 3. Hallucination Response
|
|
100
|
+
|
|
101
|
+
### Detection Patterns
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
HALLUCINATION_INDICATORS = [
|
|
105
|
+
# Over-confidence on uncertain topics
|
|
106
|
+
"definitely", "certainly", "always", "never",
|
|
107
|
+
|
|
108
|
+
# Made-up citations
|
|
109
|
+
r"\d{4}\)", r"According to Dr\.",
|
|
110
|
+
|
|
111
|
+
# Fictional data
|
|
112
|
+
r"\d+%", "studies show", "research indicates",
|
|
113
|
+
|
|
114
|
+
# Self-referential confusion
|
|
115
|
+
"as an AI", "I don't have", "I cannot",
|
|
116
|
+
]
|
|
117
|
+
|
|
118
|
+
def assess_hallucination_risk(response: str) -> float:
|
|
119
|
+
"""Score 0-1 for hallucination risk."""
|
|
120
|
+
score = 0.0
|
|
121
|
+
|
|
122
|
+
# Check indicators
|
|
123
|
+
for pattern in HALLUCINATION_INDICATORS:
|
|
124
|
+
if re.search(pattern, response, re.IGNORECASE):
|
|
125
|
+
score += 0.1
|
|
126
|
+
|
|
127
|
+
# Check for specific claims
|
|
128
|
+
if has_numeric_claims(response):
|
|
129
|
+
score += 0.2
|
|
130
|
+
|
|
131
|
+
return min(score, 1.0)
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
### Response Flow
|
|
135
|
+
|
|
136
|
+
```
|
|
137
|
+
Hallucination Detected
|
|
138
|
+
├── Severity Assessment
|
|
139
|
+
│ ├── Factual error, low impact → Log, monitor
|
|
140
|
+
│ ├── Medical/Legal/Financial → Immediate action
|
|
141
|
+
│ └── User reported → Investigate
|
|
142
|
+
├── Immediate Actions
|
|
143
|
+
│ ├── Flag response for review
|
|
144
|
+
│ ├── If critical: suppress similar queries
|
|
145
|
+
│ └── If pattern: add to filter
|
|
146
|
+
└── Long-term
|
|
147
|
+
├── Add to evaluation dataset
|
|
148
|
+
├── Consider prompt improvements
|
|
149
|
+
└── Update knowledge base
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## 4. Toxicity Response
|
|
155
|
+
|
|
156
|
+
### Severity Levels
|
|
157
|
+
|
|
158
|
+
| Level | Description | Response Time |
|
|
159
|
+
|-------|-------------|---------------|
|
|
160
|
+
| **L1** | Mildly inappropriate | Review next day |
|
|
161
|
+
| **L2** | Offensive to groups | Fix within hours |
|
|
162
|
+
| **L3** | Harmful instructions | Immediate block |
|
|
163
|
+
| **L4** | Illegal content | Emergency + Legal |
|
|
164
|
+
|
|
165
|
+
### Emergency Response
|
|
166
|
+
|
|
167
|
+
```typescript
|
|
168
|
+
async function handleToxicOutput(incident: ToxicityIncident): Promise<void> {
|
|
169
|
+
// 1. Immediate containment
|
|
170
|
+
await blockSimilarQueries(incident.queryPatterns);
|
|
171
|
+
|
|
172
|
+
// 2. Notify
|
|
173
|
+
await notifyOnCall({
|
|
174
|
+
severity: incident.level,
|
|
175
|
+
description: incident.summary, // Never include actual content
|
|
176
|
+
affectedUsers: incident.userCount,
|
|
177
|
+
});
|
|
178
|
+
|
|
179
|
+
// 3. Evidence preservation
|
|
180
|
+
await preserveForReview({
|
|
181
|
+
incidentId: incident.id,
|
|
182
|
+
// Store hashed/redacted version
|
|
183
|
+
contentHash: hash(incident.content),
|
|
184
|
+
timestamp: Date.now(),
|
|
185
|
+
});
|
|
186
|
+
|
|
187
|
+
// 4. If L3+, activate circuit breaker
|
|
188
|
+
if (incident.level >= 3) {
|
|
189
|
+
await circuitBreaker.open('chat', {
|
|
190
|
+
fallbackMessage: 'Service temporarily unavailable',
|
|
191
|
+
duration: 30 * 60 * 1000, // 30 minutes
|
|
192
|
+
});
|
|
193
|
+
}
|
|
194
|
+
}
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## 5. Rollback Strategies
|
|
200
|
+
|
|
201
|
+
### What to Rollback
|
|
202
|
+
|
|
203
|
+
| Component | Rollback Method | Time to Effect |
|
|
204
|
+
|-----------|----------------|----------------|
|
|
205
|
+
| **Prompt** | Version control | Seconds |
|
|
206
|
+
| **Fine-tuned Model** | Model registry | Minutes |
|
|
207
|
+
| **RAG Data** | Snapshot restore | Minutes |
|
|
208
|
+
| **Base Model** | Provider switch | Varies |
|
|
209
|
+
| **Feature Flag** | Kill switch | Seconds |
|
|
210
|
+
|
|
211
|
+
### Prompt Versioning
|
|
212
|
+
|
|
213
|
+
```typescript
|
|
214
|
+
interface PromptVersion {
|
|
215
|
+
version: string;
|
|
216
|
+
content: string;
|
|
217
|
+
deployedAt: Date;
|
|
218
|
+
deployedBy: string;
|
|
219
|
+
rollbackTarget?: string; // Previous version to use
|
|
220
|
+
}
|
|
221
|
+
|
|
222
|
+
class PromptManager {
|
|
223
|
+
async rollback(promptId: string): Promise<void> {
|
|
224
|
+
const current = await this.getCurrent(promptId);
|
|
225
|
+
const previous = await this.getVersion(promptId, current.rollbackTarget);
|
|
226
|
+
|
|
227
|
+
// Deploy previous version
|
|
228
|
+
await this.deploy(promptId, previous.content);
|
|
229
|
+
|
|
230
|
+
// Log rollback
|
|
231
|
+
await this.logRollback({
|
|
232
|
+
promptId,
|
|
233
|
+
fromVersion: current.version,
|
|
234
|
+
toVersion: previous.version,
|
|
235
|
+
reason: 'incident_rollback',
|
|
236
|
+
timestamp: new Date(),
|
|
237
|
+
});
|
|
238
|
+
}
|
|
239
|
+
}
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### Model Fallback Chain
|
|
243
|
+
|
|
244
|
+
```yaml
|
|
245
|
+
# config/ai-fallback.yaml
|
|
246
|
+
fallback_chain:
|
|
247
|
+
primary:
|
|
248
|
+
provider: openai
|
|
249
|
+
model: gpt-4
|
|
250
|
+
timeout_ms: 30000
|
|
251
|
+
|
|
252
|
+
secondary:
|
|
253
|
+
provider: anthropic
|
|
254
|
+
model: claude-3-sonnet
|
|
255
|
+
timeout_ms: 30000
|
|
256
|
+
trigger:
|
|
257
|
+
- primary_unavailable
|
|
258
|
+
- error_rate > 0.1
|
|
259
|
+
|
|
260
|
+
tertiary:
|
|
261
|
+
provider: internal
|
|
262
|
+
model: fallback-model
|
|
263
|
+
timeout_ms: 10000
|
|
264
|
+
trigger:
|
|
265
|
+
- secondary_unavailable
|
|
266
|
+
- degraded_mode
|
|
267
|
+
|
|
268
|
+
circuit_breaker:
|
|
269
|
+
type: static_response
|
|
270
|
+
message: "Service temporarily unavailable. Please try again later."
|
|
271
|
+
trigger:
|
|
272
|
+
- all_failed
|
|
273
|
+
- critical_incident
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
---
|
|
277
|
+
|
|
278
|
+
## 6. Post-Mortem Template
|
|
279
|
+
|
|
280
|
+
```markdown
|
|
281
|
+
# AI Incident Post-Mortem: [Title]
|
|
282
|
+
|
|
283
|
+
**Date:** [YYYY-MM-DD]
|
|
284
|
+
**Duration:** [Start time] - [End time] ([Duration])
|
|
285
|
+
**Severity:** P[1-4]
|
|
286
|
+
**Incident Commander:** [Name]
|
|
287
|
+
|
|
288
|
+
## Summary
|
|
289
|
+
[2-3 sentence description of what happened]
|
|
290
|
+
|
|
291
|
+
## Impact
|
|
292
|
+
- **Users Affected:** [Number/Percentage]
|
|
293
|
+
- **Requests Impacted:** [Count]
|
|
294
|
+
- **Business Impact:** [Revenue/Reputation/Legal]
|
|
295
|
+
|
|
296
|
+
## Timeline
|
|
297
|
+
| Time | Event |
|
|
298
|
+
|------|-------|
|
|
299
|
+
| HH:MM | [First detection] |
|
|
300
|
+
| HH:MM | [Incident declared] |
|
|
301
|
+
| HH:MM | [Mitigation started] |
|
|
302
|
+
| HH:MM | [Resolved] |
|
|
303
|
+
|
|
304
|
+
## Root Cause
|
|
305
|
+
[Detailed explanation of why this happened]
|
|
306
|
+
|
|
307
|
+
### Contributing Factors
|
|
308
|
+
1. [Factor 1]
|
|
309
|
+
2. [Factor 2]
|
|
310
|
+
|
|
311
|
+
## Detection
|
|
312
|
+
- **How Detected:** [Alert/User report/Manual]
|
|
313
|
+
- **Time to Detect:** [Duration]
|
|
314
|
+
- **Detection Gap:** [What could have detected sooner]
|
|
315
|
+
|
|
316
|
+
## Response
|
|
317
|
+
- **What Worked:** [Effective actions]
|
|
318
|
+
- **What Didn't:** [Ineffective or delayed actions]
|
|
319
|
+
- **Escalation:** [Was escalation appropriate?]
|
|
320
|
+
|
|
321
|
+
## Lessons Learned
|
|
322
|
+
### What Went Well
|
|
323
|
+
- [Positive 1]
|
|
324
|
+
- [Positive 2]
|
|
325
|
+
|
|
326
|
+
### What Went Poorly
|
|
327
|
+
- [Issue 1]
|
|
328
|
+
- [Issue 2]
|
|
329
|
+
|
|
330
|
+
## Action Items
|
|
331
|
+
| Action | Owner | Due Date | Status |
|
|
332
|
+
|--------|-------|----------|--------|
|
|
333
|
+
| [Action 1] | [Name] | [Date] | [ ] |
|
|
334
|
+
| [Action 2] | [Name] | [Date] | [ ] |
|
|
335
|
+
|
|
336
|
+
## AI-Specific Analysis
|
|
337
|
+
### Model Behavior
|
|
338
|
+
- **Expected:** [What should have happened]
|
|
339
|
+
- **Actual:** [What happened]
|
|
340
|
+
- **Gap:** [Why the difference]
|
|
341
|
+
|
|
342
|
+
### Prompt Analysis
|
|
343
|
+
- **Prompt Version:** [v1.2.3]
|
|
344
|
+
- **Recent Changes:** [What changed]
|
|
345
|
+
- **Rollback Used:** [Yes/No]
|
|
346
|
+
|
|
347
|
+
### Prevention
|
|
348
|
+
- [ ] Added to evaluation dataset
|
|
349
|
+
- [ ] Updated content filters
|
|
350
|
+
- [ ] Added monitoring for similar patterns
|
|
351
|
+
- [ ] Documented in knowledge base
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
---
|
|
355
|
+
|
|
356
|
+
## 7. Degradation vs Outage
|
|
357
|
+
|
|
358
|
+
### Classification
|
|
359
|
+
|
|
360
|
+
| State | Characteristics | User Experience |
|
|
361
|
+
|-------|----------------|-----------------|
|
|
362
|
+
| **Healthy** | Normal metrics | Full service |
|
|
363
|
+
| **Degraded** | High latency, reduced quality | Slow but works |
|
|
364
|
+
| **Partial Outage** | Some features down | Core works |
|
|
365
|
+
| **Full Outage** | Service unavailable | Error pages |
|
|
366
|
+
|
|
367
|
+
### Degradation Handling
|
|
368
|
+
|
|
369
|
+
```typescript
|
|
370
|
+
enum ServiceState {
|
|
371
|
+
HEALTHY = 'healthy',
|
|
372
|
+
DEGRADED = 'degraded',
|
|
373
|
+
PARTIAL_OUTAGE = 'partial_outage',
|
|
374
|
+
OUTAGE = 'outage',
|
|
375
|
+
}
|
|
376
|
+
|
|
377
|
+
function determineServiceState(metrics: AIMetrics): ServiceState {
|
|
378
|
+
const { errorRate, p99Latency, successRate } = metrics;
|
|
379
|
+
|
|
380
|
+
if (successRate < 0.5) return ServiceState.OUTAGE;
|
|
381
|
+
if (successRate < 0.9) return ServiceState.PARTIAL_OUTAGE;
|
|
382
|
+
if (errorRate > 0.05 || p99Latency > 30000) return ServiceState.DEGRADED;
|
|
383
|
+
return ServiceState.HEALTHY;
|
|
384
|
+
}
|
|
385
|
+
|
|
386
|
+
// Communicate appropriately
|
|
387
|
+
function getStatusMessage(state: ServiceState): string {
|
|
388
|
+
switch (state) {
|
|
389
|
+
case ServiceState.DEGRADED:
|
|
390
|
+
return "AI responses may be slower than usual. Thank you for your patience.";
|
|
391
|
+
case ServiceState.PARTIAL_OUTAGE:
|
|
392
|
+
return "Some AI features are currently limited. Basic functions are available.";
|
|
393
|
+
case ServiceState.OUTAGE:
|
|
394
|
+
return "AI services are temporarily unavailable. Please try again later.";
|
|
395
|
+
default:
|
|
396
|
+
return "";
|
|
397
|
+
}
|
|
398
|
+
}
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
---
|
|
402
|
+
|
|
403
|
+
## 8. Communication Templates
|
|
404
|
+
|
|
405
|
+
### Internal Alert
|
|
406
|
+
|
|
407
|
+
```markdown
|
|
408
|
+
🚨 **AI Incident Declared**
|
|
409
|
+
|
|
410
|
+
**Severity:** P[X]
|
|
411
|
+
**Type:** [Hallucination/Toxicity/Outage]
|
|
412
|
+
**Status:** Investigating
|
|
413
|
+
|
|
414
|
+
**Impact:** [Brief description]
|
|
415
|
+
**Incident Commander:** @[name]
|
|
416
|
+
**Channel:** #incident-[date]-ai
|
|
417
|
+
|
|
418
|
+
Updates every [15/30] minutes until resolved.
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
### User-Facing (if needed)
|
|
422
|
+
|
|
423
|
+
```markdown
|
|
424
|
+
**Service Notice**
|
|
425
|
+
|
|
426
|
+
We're currently experiencing issues with our AI assistant. You may notice:
|
|
427
|
+
- Slower response times
|
|
428
|
+
- Reduced functionality
|
|
429
|
+
|
|
430
|
+
Our team is actively working on a resolution. We apologize for any inconvenience.
|
|
431
|
+
|
|
432
|
+
Last updated: [Time]
|
|
433
|
+
```
|
|
434
|
+
|
|
435
|
+
---
|
|
436
|
+
|
|
437
|
+
## 9. On-Call Runbook
|
|
438
|
+
|
|
439
|
+
### First Responder Checklist
|
|
440
|
+
|
|
441
|
+
```markdown
|
|
442
|
+
## When Paged for AI Incident
|
|
443
|
+
|
|
444
|
+
1. **Acknowledge** the alert (do this first!)
|
|
445
|
+
|
|
446
|
+
2. **Quick Assessment** (2 min max)
|
|
447
|
+
- [ ] Check status page for provider outages
|
|
448
|
+
- [ ] Check error rate dashboard
|
|
449
|
+
- [ ] Check recent deployments
|
|
450
|
+
|
|
451
|
+
3. **Declare or Escalate**
|
|
452
|
+
- If clear cause → Fix
|
|
453
|
+
- If unclear → Declare incident
|
|
454
|
+
- If serious → Page secondary
|
|
455
|
+
|
|
456
|
+
4. **Communicate**
|
|
457
|
+
- [ ] Update status in #on-call
|
|
458
|
+
- [ ] Create incident channel if P1/P2
|
|
459
|
+
- [ ] Set timer for next update
|
|
460
|
+
|
|
461
|
+
5. **Mitigate**
|
|
462
|
+
- Refer to playbooks above
|
|
463
|
+
- When in doubt, use fallback/kill switch
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
### Key Links
|
|
467
|
+
|
|
468
|
+
```yaml
|
|
469
|
+
quick_links:
|
|
470
|
+
dashboards:
|
|
471
|
+
- AI Health: https://grafana.internal/ai-health
|
|
472
|
+
- LLM Metrics: https://grafana.internal/llm
|
|
473
|
+
- Error Tracking: https://sentry.internal/ai-service
|
|
474
|
+
|
|
475
|
+
provider_status:
|
|
476
|
+
- OpenAI: https://status.openai.com
|
|
477
|
+
- Anthropic: https://status.anthropic.com
|
|
478
|
+
|
|
479
|
+
runbooks:
|
|
480
|
+
- Hallucination: /docs/runbooks/hallucination.md
|
|
481
|
+
- Toxicity: /docs/runbooks/toxicity.md
|
|
482
|
+
- Provider Outage: /docs/runbooks/provider-outage.md
|
|
483
|
+
```
|
|
484
|
+
|
|
485
|
+
---
|
|
486
|
+
|
|
487
|
+
## 10. Checklist
|
|
488
|
+
|
|
489
|
+
### Preparation
|
|
490
|
+
|
|
491
|
+
- [ ] Incident response team defined
|
|
492
|
+
- [ ] Escalation paths documented
|
|
493
|
+
- [ ] Kill switches tested
|
|
494
|
+
- [ ] Fallback models configured
|
|
495
|
+
- [ ] Post-mortem template ready
|
|
496
|
+
- [ ] Communication templates approved
|
|
497
|
+
|
|
498
|
+
### During Incident
|
|
499
|
+
|
|
500
|
+
- [ ] Incident acknowledged
|
|
501
|
+
- [ ] Severity determined
|
|
502
|
+
- [ ] Incident commander assigned
|
|
503
|
+
- [ ] Communication channel created
|
|
504
|
+
- [ ] Timeline being tracked
|
|
505
|
+
- [ ] Updates being shared
|
|
506
|
+
|
|
507
|
+
### Post-Incident
|
|
508
|
+
|
|
509
|
+
- [ ] Post-mortem scheduled
|
|
510
|
+
- [ ] Action items tracked
|
|
511
|
+
- [ ] Lessons shared with team
|
|
512
|
+
- [ ] Monitoring improved
|
|
513
|
+
- [ ] Playbook updated
|
|
514
|
+
|
|
515
|
+
---
|
|
516
|
+
|
|
517
|
+
> **Remember:** The goal isn't to prevent all incidents—it's to detect fast, respond faster, and never repeat.
|