arkaos 2.0.0 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +100 -74
- package/VERSION +1 -1
- package/bin/arkaos +1 -1
- package/core/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/agents/__pycache__/validator.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/advisor_db.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/display.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/matcher.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/persistence.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/profiler.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/prompts.cpython-313.pyc +0 -0
- package/core/conclave/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/governance/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/governance/__pycache__/constitution.cpython-313.pyc +0 -0
- package/core/registry/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/registry/__pycache__/generator.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/base.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/claude_code.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/codex_cli.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/cursor.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/gemini_cli.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/registry.cpython-313.pyc +0 -0
- package/core/runtime/__pycache__/subagent.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/manager.cpython-313.pyc +0 -0
- package/core/specs/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/registry.cpython-313.pyc +0 -0
- package/core/squads/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/cache.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/engine.cpython-313.pyc +0 -0
- package/core/synapse/__pycache__/layers.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/manager.cpython-313.pyc +0 -0
- package/core/tasks/__pycache__/schema.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/__init__.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/engine.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/loader.cpython-313.pyc +0 -0
- package/core/workflow/__pycache__/schema.cpython-313.pyc +0 -0
- package/departments/dev/skills/agent-design/SKILL.md +4 -0
- package/departments/dev/skills/agent-design/references/architecture-patterns.md +223 -0
- package/departments/dev/skills/ai-security/SKILL.md +4 -0
- package/departments/dev/skills/ai-security/references/prompt-injection-catalog.md +230 -0
- package/departments/dev/skills/ci-cd-pipeline/SKILL.md +4 -0
- package/departments/dev/skills/ci-cd-pipeline/references/github-actions-patterns.md +202 -0
- package/departments/dev/skills/db-schema/SKILL.md +4 -0
- package/departments/dev/skills/db-schema/references/indexing-strategy.md +197 -0
- package/departments/dev/skills/dependency-audit/SKILL.md +4 -0
- package/departments/dev/skills/dependency-audit/references/license-matrix.md +191 -0
- package/departments/dev/skills/incident/SKILL.md +4 -0
- package/departments/dev/skills/incident/references/severity-playbook.md +221 -0
- package/departments/dev/skills/observability/SKILL.md +4 -0
- package/departments/dev/skills/observability/references/slo-design.md +200 -0
- package/departments/dev/skills/rag-architect/SKILL.md +5 -0
- package/departments/dev/skills/rag-architect/references/chunking-strategies.md +129 -0
- package/departments/dev/skills/rag-architect/references/evaluation-guide.md +158 -0
- package/departments/dev/skills/red-team/SKILL.md +4 -0
- package/departments/dev/skills/red-team/references/mitre-attack-web.md +165 -0
- package/departments/dev/skills/security-audit/SKILL.md +4 -0
- package/departments/dev/skills/security-audit/references/owasp-2025-deep.md +409 -0
- package/departments/dev/skills/security-compliance/SKILL.md +117 -0
- package/departments/finance/skills/ciso-advisor/SKILL.md +4 -0
- package/departments/finance/skills/ciso-advisor/references/compliance-roadmap.md +172 -0
- package/departments/marketing/skills/programmatic-seo/SKILL.md +4 -0
- package/departments/marketing/skills/programmatic-seo/references/template-playbooks.md +289 -0
- package/departments/ops/skills/gdpr-compliance/SKILL.md +104 -0
- package/departments/ops/skills/iso27001/SKILL.md +113 -0
- package/departments/ops/skills/quality-management/SKILL.md +118 -0
- package/departments/ops/skills/risk-management/SKILL.md +120 -0
- package/departments/ops/skills/soc2-compliance/SKILL.md +120 -0
- package/departments/strategy/skills/cto-advisor/SKILL.md +4 -0
- package/departments/strategy/skills/cto-advisor/references/build-vs-buy-framework.md +190 -0
- package/installer/cli.js +13 -2
- package/installer/index.js +1 -2
- package/installer/migrate.js +123 -0
- package/installer/update.js +28 -15
- package/package.json +1 -1
- package/pyproject.toml +1 -1
- package/core/agents/__pycache__/registry_gen.cpython-313.pyc +0 -0
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
# Open Source License Compatibility Matrix — Deep Reference
|
|
2
|
+
|
|
3
|
+
> Companion to `dependency-audit/SKILL.md`. License obligations, compatibility rules, and commercial implications.
|
|
4
|
+
|
|
5
|
+
## License Classification
|
|
6
|
+
|
|
7
|
+
| License | Type | OSI Approved | Copyleft Strength |
|
|
8
|
+
|---------|------|-------------|-------------------|
|
|
9
|
+
| MIT | Permissive | Yes | None |
|
|
10
|
+
| ISC | Permissive | Yes | None |
|
|
11
|
+
| BSD-2-Clause | Permissive | Yes | None |
|
|
12
|
+
| BSD-3-Clause | Permissive | Yes | None |
|
|
13
|
+
| Apache-2.0 | Permissive | Yes | None (patent grant) |
|
|
14
|
+
| MPL-2.0 | Weak copyleft | Yes | File-level |
|
|
15
|
+
| LGPL-2.1 | Weak copyleft | Yes | Library-level |
|
|
16
|
+
| LGPL-3.0 | Weak copyleft | Yes | Library-level |
|
|
17
|
+
| GPL-2.0 | Strong copyleft | Yes | Entire work |
|
|
18
|
+
| GPL-3.0 | Strong copyleft | Yes | Entire work |
|
|
19
|
+
| AGPL-3.0 | Network copyleft | Yes | Entire work + network use |
|
|
20
|
+
| SSPL | Source-available | No | Entire service stack |
|
|
21
|
+
| BSL 1.1 | Source-available | No | Time-delayed open source |
|
|
22
|
+
| Proprietary | Proprietary | No | Full restriction |
|
|
23
|
+
|
|
24
|
+
## Compatibility Matrix
|
|
25
|
+
|
|
26
|
+
Can you combine code under these licenses in the SAME project?
|
|
27
|
+
|
|
28
|
+
| Dependency License | Your Project: MIT | Your Project: Apache-2.0 | Your Project: GPL-3.0 | Your Project: Proprietary |
|
|
29
|
+
|-------------------|:-:|:-:|:-:|:-:|
|
|
30
|
+
| MIT | OK | OK | OK | OK |
|
|
31
|
+
| BSD-2/3 | OK | OK | OK | OK |
|
|
32
|
+
| Apache-2.0 | OK | OK | OK | OK |
|
|
33
|
+
| MPL-2.0 | OK (keep MPL files) | OK (keep MPL files) | OK | OK (keep MPL files) |
|
|
34
|
+
| LGPL-2.1/3.0 | OK (dynamic link) | OK (dynamic link) | OK | OK (dynamic link only) |
|
|
35
|
+
| GPL-2.0 | NO | NO | OK (if "or later") | NO |
|
|
36
|
+
| GPL-3.0 | NO | NO | OK | NO |
|
|
37
|
+
| AGPL-3.0 | NO | NO | OK (becomes AGPL) | NO |
|
|
38
|
+
|
|
39
|
+
**Key:** OK = compatible. NO = license violation.
|
|
40
|
+
|
|
41
|
+
## Obligations by License
|
|
42
|
+
|
|
43
|
+
### Permissive Licenses (MIT, BSD, ISC, Apache-2.0)
|
|
44
|
+
|
|
45
|
+
| Obligation | MIT | BSD-2 | BSD-3 | Apache-2.0 |
|
|
46
|
+
|------------|:---:|:-----:|:-----:|:----------:|
|
|
47
|
+
| Include copyright notice | Yes | Yes | Yes | Yes |
|
|
48
|
+
| Include license text | Yes | Yes | Yes | Yes |
|
|
49
|
+
| State changes | No | No | No | Yes |
|
|
50
|
+
| Patent grant | No | No | No | Yes |
|
|
51
|
+
| No endorsement clause | No | No | Yes | No |
|
|
52
|
+
| NOTICE file preservation | No | No | No | Yes |
|
|
53
|
+
|
|
54
|
+
### Copyleft Licenses (MPL, LGPL, GPL, AGPL)
|
|
55
|
+
|
|
56
|
+
| Obligation | MPL-2.0 | LGPL-3.0 | GPL-3.0 | AGPL-3.0 |
|
|
57
|
+
|------------|:-------:|:--------:|:-------:|:--------:|
|
|
58
|
+
| Source code of modified files | Yes | Yes | Yes | Yes |
|
|
59
|
+
| Source code of entire work | No | No | Yes | Yes |
|
|
60
|
+
| Network use triggers copyleft | No | No | No | Yes |
|
|
61
|
+
| Allow reverse engineering | No | Yes | Yes | Yes |
|
|
62
|
+
| Provide installation info | No | No | Yes | Yes |
|
|
63
|
+
| Anti-tivoization clause | No | No | Yes | Yes |
|
|
64
|
+
|
|
65
|
+
## Commercial Use Decision Tree
|
|
66
|
+
|
|
67
|
+
```
|
|
68
|
+
START: Is the dependency's license identified?
|
|
69
|
+
NO --> STOP. Do not use. Unknown license = all rights reserved.
|
|
70
|
+
YES --> Is it permissive (MIT, BSD, ISC, Apache)?
|
|
71
|
+
YES --> SAFE for commercial use. Include attribution.
|
|
72
|
+
NO --> Is it weak copyleft (MPL, LGPL)?
|
|
73
|
+
YES --> SAFE if:
|
|
74
|
+
MPL: Keep modified MPL files open-source
|
|
75
|
+
LGPL: Dynamic linking only (no static linking in proprietary)
|
|
76
|
+
NO --> Is it strong copyleft (GPL)?
|
|
77
|
+
YES --> NOT SAFE for proprietary. Your entire project becomes GPL.
|
|
78
|
+
Exception: GPL with Classpath Exception (like OpenJDK)
|
|
79
|
+
NO --> Is it network copyleft (AGPL)?
|
|
80
|
+
YES --> NOT SAFE. Even SaaS use triggers copyleft.
|
|
81
|
+
NO --> Is it source-available (SSPL, BSL)?
|
|
82
|
+
YES --> Review specific terms. Usually NOT safe for competing services.
|
|
83
|
+
NO --> Legal review required.
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## Common Ecosystem License Risks
|
|
87
|
+
|
|
88
|
+
### Node.js (npm)
|
|
89
|
+
|
|
90
|
+
| Risk Pattern | Example | Action |
|
|
91
|
+
|-------------|---------|--------|
|
|
92
|
+
| Transitive GPL dependency | `node-sass` (deprecated, had GPL deps) | Audit full tree with `license-checker` |
|
|
93
|
+
| License field missing in package.json | Small utilities | Check source repo manually |
|
|
94
|
+
| "SEE LICENSE IN" custom file | Some enterprise libs | Read the actual file |
|
|
95
|
+
| Dual license with one GPL option | Some databases | Verify you are using the permissive option |
|
|
96
|
+
|
|
97
|
+
### PHP (Composer)
|
|
98
|
+
|
|
99
|
+
| Risk Pattern | Example | Action |
|
|
100
|
+
|-------------|---------|--------|
|
|
101
|
+
| Laravel itself is MIT | Framework code | Safe |
|
|
102
|
+
| Packages wrapping GPL tools | ImageMagick bindings | Check if wrapper license differs from tool |
|
|
103
|
+
| WordPress ecosystem (GPL-2.0+) | Themes, plugins | All derivatives must be GPL |
|
|
104
|
+
|
|
105
|
+
### Python (pip)
|
|
106
|
+
|
|
107
|
+
| Risk Pattern | Example | Action |
|
|
108
|
+
|-------------|---------|--------|
|
|
109
|
+
| PSF license (Python itself) | Standard lib | Safe, permissive |
|
|
110
|
+
| LGPL in scientific computing | Some NumPy deps historically | Dynamic linking usually fine |
|
|
111
|
+
| GPL in CLI tools used via subprocess | `pandoc` (GPL) | Subprocess call != linking (generally safe) |
|
|
112
|
+
|
|
113
|
+
## Dual Licensing Strategies
|
|
114
|
+
|
|
115
|
+
| Strategy | How It Works | Example |
|
|
116
|
+
|----------|-------------|---------|
|
|
117
|
+
| Open core | AGPL for community, proprietary for enterprise | MongoDB (was AGPL, now SSPL) |
|
|
118
|
+
| GPL + commercial | GPL for open use, paid license for proprietary embedding | MySQL, Qt |
|
|
119
|
+
| MIT + paid support | MIT code, paid for SLA/support/features | Many SaaS tools |
|
|
120
|
+
| BSL time-delay | Source-available now, becomes open source after N years | CockroachDB, Sentry |
|
|
121
|
+
|
|
122
|
+
## License Audit Automation
|
|
123
|
+
|
|
124
|
+
### npm
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
# List all licenses
|
|
128
|
+
npx license-checker --summary
|
|
129
|
+
|
|
130
|
+
# Fail CI on copyleft
|
|
131
|
+
npx license-checker --failOn "GPL-2.0;GPL-3.0;AGPL-3.0"
|
|
132
|
+
|
|
133
|
+
# Export for legal review
|
|
134
|
+
npx license-checker --csv --out licenses.csv
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
### Composer (PHP)
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
# List licenses
|
|
141
|
+
composer licenses
|
|
142
|
+
|
|
143
|
+
# Detailed with dependencies
|
|
144
|
+
composer licenses --format=json
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### pip (Python)
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
pip-licenses --format=table --with-urls
|
|
151
|
+
pip-licenses --fail-on="GPL-2.0;GPL-3.0;AGPL-3.0"
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## License Change Risks
|
|
155
|
+
|
|
156
|
+
| Event | Risk | Action |
|
|
157
|
+
|-------|------|--------|
|
|
158
|
+
| Maintainer relicenses (MIT -> BSL) | Future versions restricted | Pin to last MIT version |
|
|
159
|
+
| CLA-based project changes license | All contributor code affected | Monitor project governance |
|
|
160
|
+
| Acquisition changes license | New owner may restrict | Evaluate fork options |
|
|
161
|
+
| License removed from package | Defaults to all-rights-reserved | Contact maintainer, find alternative |
|
|
162
|
+
|
|
163
|
+
### Notable License Changes (Precedents)
|
|
164
|
+
|
|
165
|
+
| Project | From | To | Year | Impact |
|
|
166
|
+
|---------|------|----|------|--------|
|
|
167
|
+
| Elasticsearch | Apache-2.0 | SSPL | 2021 | AWS forked as OpenSearch |
|
|
168
|
+
| Redis modules | BSD | SSPL | 2024 | Community forks (Valkey) |
|
|
169
|
+
| HashiCorp (Terraform) | MPL-2.0 | BSL 1.1 | 2023 | OpenTofu fork |
|
|
170
|
+
| MongoDB | AGPL-3.0 | SSPL | 2018 | Cloud providers stopped offering |
|
|
171
|
+
|
|
172
|
+
## NOTICE File Template
|
|
173
|
+
|
|
174
|
+
When using Apache-2.0 licensed code, preserve and extend the NOTICE file:
|
|
175
|
+
|
|
176
|
+
```
|
|
177
|
+
This product includes software developed by [Original Author].
|
|
178
|
+
Licensed under the Apache License, Version 2.0.
|
|
179
|
+
|
|
180
|
+
Modifications by [Your Company], [Year].
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
## Quick Reference: "Can I Use This?"
|
|
184
|
+
|
|
185
|
+
| Your Project Type | Safe Licenses | Caution | Blocked |
|
|
186
|
+
|-------------------|--------------|---------|---------|
|
|
187
|
+
| Proprietary SaaS | MIT, BSD, ISC, Apache-2.0 | MPL, LGPL (check linking) | GPL, AGPL, SSPL |
|
|
188
|
+
| Proprietary desktop app | MIT, BSD, ISC, Apache-2.0 | LGPL (dynamic link only) | GPL, AGPL |
|
|
189
|
+
| Open source (MIT) | MIT, BSD, ISC, Apache-2.0 | MPL | GPL, AGPL (would change your license) |
|
|
190
|
+
| Open source (GPL-3.0) | All OSI-approved | SSPL, BSL | Proprietary, GPL-2.0-only |
|
|
191
|
+
| Internal tool (never distributed) | All (copyleft not triggered) | AGPL (network use triggers) | SSPL (service use triggers) |
|
|
@@ -123,3 +123,7 @@ Surface these issues WITHOUT being asked:
|
|
|
123
123
|
1. [Key takeaway]
|
|
124
124
|
2. [Key takeaway]
|
|
125
125
|
```
|
|
126
|
+
|
|
127
|
+
## References
|
|
128
|
+
|
|
129
|
+
- [severity-playbook.md](references/severity-playbook.md) — SEV1-4 definitions, escalation paths, communication templates, and PIR checklist
|
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
# Severity Playbook — Deep Reference
|
|
2
|
+
|
|
3
|
+
> SEV1-4 definitions, escalation paths, communication templates, PIR framework, and anti-patterns.
|
|
4
|
+
|
|
5
|
+
## Severity Definitions with Examples
|
|
6
|
+
|
|
7
|
+
| Level | Definition | User Impact | Examples |
|
|
8
|
+
|-------|-----------|-------------|---------|
|
|
9
|
+
| **SEV1** | Complete service outage, data loss, active security breach | 100% of users or data integrity compromised | Database corruption, payment system down, credentials leaked, entire site 500 |
|
|
10
|
+
| **SEV2** | Major feature degraded, >25% users affected | Significant functionality lost | Search broken, checkout intermittent, API latency >10x, auth failures for subset |
|
|
11
|
+
| **SEV3** | Single feature broken, workaround exists | Minor inconvenience, <10% users | Export fails (manual workaround), slow dashboard, broken notification emails |
|
|
12
|
+
| **SEV4** | Cosmetic, dev/staging only, no user impact | None or negligible | UI alignment bug, staging env down, deprecation warning, flaky test |
|
|
13
|
+
|
|
14
|
+
## Escalation Paths
|
|
15
|
+
|
|
16
|
+
### SEV1 — Full Escalation
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
T+0min Alert fires or report received
|
|
20
|
+
T+5min On-call engineer acknowledges, starts investigation
|
|
21
|
+
T+10min Incident Commander assigned, war room opened
|
|
22
|
+
T+15min First stakeholder notification sent
|
|
23
|
+
T+15min Engineering lead and CTO notified
|
|
24
|
+
T+30min If no mitigation path: escalate to vendor/cloud provider
|
|
25
|
+
T+60min If unresolved: executive briefing, consider public status page
|
|
26
|
+
T+4h If unresolved: assemble cross-team tiger team
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
### SEV2 — Team Escalation
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
T+0min Alert fires or report received
|
|
33
|
+
T+15min On-call engineer acknowledges
|
|
34
|
+
T+30min Team lead notified, incident channel created
|
|
35
|
+
T+1h First stakeholder update
|
|
36
|
+
T+2h If unresolved: escalate to engineering manager
|
|
37
|
+
T+4h If unresolved: consider SEV1 upgrade
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### SEV3 — Standard Response
|
|
41
|
+
|
|
42
|
+
```
|
|
43
|
+
T+0min Ticket created automatically or manually
|
|
44
|
+
T+2h Engineer assigned during business hours
|
|
45
|
+
T+1d Initial investigation and fix ETA
|
|
46
|
+
T+3d Fix deployed or workaround documented
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### SEV4 — Backlog
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
T+0min Ticket created, tagged low priority
|
|
53
|
+
Next sprint Triaged and prioritized
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
## Severity Upgrade/Downgrade Criteria
|
|
57
|
+
|
|
58
|
+
| Trigger | Action |
|
|
59
|
+
|---------|--------|
|
|
60
|
+
| Impact expands beyond initial scope | Upgrade severity |
|
|
61
|
+
| Duration exceeds 2x expected MTTR | Upgrade severity |
|
|
62
|
+
| Data integrity concerns emerge | Upgrade to SEV1 |
|
|
63
|
+
| Workaround found and confirmed | Consider downgrade |
|
|
64
|
+
| Impact narrower than initial assessment | Downgrade severity |
|
|
65
|
+
|
|
66
|
+
## Communication Templates
|
|
67
|
+
|
|
68
|
+
### Initial Notification (SEV1/SEV2)
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
INCIDENT: [SEV{N}] {Service Name} - {Brief Description}
|
|
72
|
+
|
|
73
|
+
Impact: {What users experience, how many affected}
|
|
74
|
+
Start time: {ISO 8601 timestamp, timezone}
|
|
75
|
+
Status: INVESTIGATING
|
|
76
|
+
|
|
77
|
+
Incident Commander: {Name}
|
|
78
|
+
Technical Lead: {Name}
|
|
79
|
+
War Room: {Slack channel / Zoom link}
|
|
80
|
+
|
|
81
|
+
Next update: {Time, max 15min for SEV1, 30min for SEV2}
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Status Update
|
|
85
|
+
|
|
86
|
+
```
|
|
87
|
+
INCIDENT UPDATE #{N}: [SEV{level}] {Service Name}
|
|
88
|
+
|
|
89
|
+
Status: INVESTIGATING | IDENTIFIED | MITIGATING | MONITORING | RESOLVED
|
|
90
|
+
Duration: {elapsed time}
|
|
91
|
+
|
|
92
|
+
What we know:
|
|
93
|
+
- {Finding 1}
|
|
94
|
+
- {Finding 2}
|
|
95
|
+
|
|
96
|
+
Actions taken:
|
|
97
|
+
- {Action 1}
|
|
98
|
+
- {Action 2}
|
|
99
|
+
|
|
100
|
+
Next steps:
|
|
101
|
+
- {Planned action with owner}
|
|
102
|
+
|
|
103
|
+
ETA to resolution: {estimate or "Under investigation"}
|
|
104
|
+
Next update: {time}
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### Resolution Notification
|
|
108
|
+
|
|
109
|
+
```
|
|
110
|
+
RESOLVED: [SEV{level}] {Service Name} - {Brief Description}
|
|
111
|
+
|
|
112
|
+
Duration: {start} to {end} ({total})
|
|
113
|
+
Root cause: {1-2 sentence summary}
|
|
114
|
+
Fix applied: {what was done}
|
|
115
|
+
Users affected: {count or percentage}
|
|
116
|
+
|
|
117
|
+
Post-Incident Review scheduled: {date/time}
|
|
118
|
+
Action items will be tracked in: {ticket link}
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Customer-Facing Status Page
|
|
122
|
+
|
|
123
|
+
```
|
|
124
|
+
[Investigating] We are aware of issues with {feature}. Our team is actively
|
|
125
|
+
investigating. We will provide an update within {timeframe}.
|
|
126
|
+
|
|
127
|
+
[Identified] We have identified the cause of {issue}. A fix is being implemented.
|
|
128
|
+
Expected resolution: {ETA}.
|
|
129
|
+
|
|
130
|
+
[Resolved] The issue with {feature} has been resolved. All systems are operating
|
|
131
|
+
normally. We apologize for the inconvenience.
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
## Post-Incident Review (PIR) Template
|
|
135
|
+
|
|
136
|
+
### Header
|
|
137
|
+
|
|
138
|
+
| Field | Value |
|
|
139
|
+
|-------|-------|
|
|
140
|
+
| Incident ID | INC-YYYY-NNN |
|
|
141
|
+
| Severity | SEV{N} |
|
|
142
|
+
| Date | YYYY-MM-DD |
|
|
143
|
+
| Duration | {start} to {end} ({total}) |
|
|
144
|
+
| Incident Commander | {Name} |
|
|
145
|
+
| Technical Lead | {Name} |
|
|
146
|
+
| PIR Author | {Name} |
|
|
147
|
+
| PIR Date | {date, within 48h of resolution} |
|
|
148
|
+
|
|
149
|
+
### Timeline (Required for SEV1/SEV2)
|
|
150
|
+
|
|
151
|
+
| Time (UTC) | Event | Source |
|
|
152
|
+
|------------|-------|--------|
|
|
153
|
+
| HH:MM | Alert fired: {description} | Monitoring |
|
|
154
|
+
| HH:MM | On-call acknowledged | PagerDuty |
|
|
155
|
+
| HH:MM | IC assigned, war room opened | Manual |
|
|
156
|
+
| HH:MM | Root cause identified: {description} | Investigation |
|
|
157
|
+
| HH:MM | Mitigation applied: {action} | Deployment |
|
|
158
|
+
| HH:MM | Service confirmed restored | Monitoring |
|
|
159
|
+
|
|
160
|
+
### Root Cause Analysis
|
|
161
|
+
|
|
162
|
+
**5 Whys format:**
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
1. Why did the service go down?
|
|
166
|
+
-> Database connection pool exhausted
|
|
167
|
+
2. Why was the pool exhausted?
|
|
168
|
+
-> Slow query holding connections for 30s+
|
|
169
|
+
3. Why was the query slow?
|
|
170
|
+
-> Missing index on users.email after migration
|
|
171
|
+
4. Why was the index missing?
|
|
172
|
+
-> Migration script did not include index creation
|
|
173
|
+
5. Why was the missing index not caught?
|
|
174
|
+
-> No performance test in CI for migration scripts
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Action Items Table
|
|
178
|
+
|
|
179
|
+
| # | Action | Type | Owner | Due Date | Priority | Status |
|
|
180
|
+
|---|--------|------|-------|----------|----------|--------|
|
|
181
|
+
| 1 | Add index on users.email | Fix | {name} | {date} | P0 | Done |
|
|
182
|
+
| 2 | Add migration perf tests to CI | Prevent | {name} | {date} | P1 | Open |
|
|
183
|
+
| 3 | Add connection pool alert at 80% | Detect | {name} | {date} | P1 | Open |
|
|
184
|
+
| 4 | Document DB migration checklist | Process | {name} | {date} | P2 | Open |
|
|
185
|
+
|
|
186
|
+
Action item types: **Fix** (address this incident), **Prevent** (stop recurrence), **Detect** (catch it earlier), **Process** (improve response).
|
|
187
|
+
|
|
188
|
+
## PIR Quality Checklist
|
|
189
|
+
|
|
190
|
+
- [ ] Timeline is complete with timestamps from monitoring (not memory)
|
|
191
|
+
- [ ] Root cause goes deep enough (5 Whys or equivalent)
|
|
192
|
+
- [ ] Action items have owners and due dates (no orphaned items)
|
|
193
|
+
- [ ] Action items include detection improvements, not just fixes
|
|
194
|
+
- [ ] Blameless language throughout (systems, not people)
|
|
195
|
+
- [ ] Shared with broader engineering team
|
|
196
|
+
- [ ] Runbooks updated with new knowledge
|
|
197
|
+
- [ ] Follow-up review scheduled for action item completion
|
|
198
|
+
|
|
199
|
+
## Anti-Patterns
|
|
200
|
+
|
|
201
|
+
| Anti-Pattern | Why It Hurts | Fix |
|
|
202
|
+
|-------------|-------------|-----|
|
|
203
|
+
| Skipping severity classification | Wrong response level, wasted effort or delayed response | Classify within first 5 minutes, always |
|
|
204
|
+
| Hero culture (one person does everything) | Burnout, no knowledge sharing, SPOF | Separate IC and Tech Lead roles |
|
|
205
|
+
| No communication cadence | Stakeholders assume the worst, escalate unnecessarily | Set timer for updates, even if "still investigating" |
|
|
206
|
+
| Blame-focused PIR | People hide mistakes, no systemic improvement | Blameless by policy, focus on systems |
|
|
207
|
+
| PIR action items with no owners | Nothing gets done, same incident recurs | Every action item requires name + date |
|
|
208
|
+
| Never upgrading severity | SEV3 that is actually SEV1 gets slow response | Review upgrade criteria at every status update |
|
|
209
|
+
| Fix-only action items | Catches this incident but not the next variant | Always include Detect and Prevent items |
|
|
210
|
+
| PIR delayed beyond 1 week | Details forgotten, momentum lost | Schedule within 48 hours, hard deadline 5 days |
|
|
211
|
+
|
|
212
|
+
## Metrics to Track
|
|
213
|
+
|
|
214
|
+
| Metric | Target | Measures |
|
|
215
|
+
|--------|--------|----------|
|
|
216
|
+
| MTTD (Mean Time to Detect) | < 5 min | Monitoring effectiveness |
|
|
217
|
+
| MTTA (Mean Time to Acknowledge) | < 10 min (SEV1) | On-call responsiveness |
|
|
218
|
+
| MTTR (Mean Time to Resolve) | < 1h (SEV1), < 4h (SEV2) | Resolution efficiency |
|
|
219
|
+
| PIR completion rate | 100% for SEV1/SEV2 | Learning culture |
|
|
220
|
+
| Action item completion rate | > 90% within due date | Follow-through |
|
|
221
|
+
| Recurrence rate | < 5% same root cause | Prevention effectiveness |
|
|
@@ -0,0 +1,200 @@
|
|
|
1
|
+
# SLO Design Guide — Deep Reference
|
|
2
|
+
|
|
3
|
+
> SLI/SLO/SLA framework, error budgets, burn rate alerts, and production SLO documents.
|
|
4
|
+
|
|
5
|
+
## Terminology
|
|
6
|
+
|
|
7
|
+
| Term | Definition | Owner | Example |
|
|
8
|
+
|------|-----------|-------|---------|
|
|
9
|
+
| **SLI** (Service Level Indicator) | Quantitative measure of service behavior | Engineering | Request latency p99 |
|
|
10
|
+
| **SLO** (Service Level Objective) | Target value for an SLI over a time window | Engineering + Product | p99 latency < 200ms over 30 days |
|
|
11
|
+
| **SLA** (Service Level Agreement) | Contract with consequences for missing targets | Business + Legal | 99.9% uptime or service credits |
|
|
12
|
+
| **Error Budget** | Allowed amount of unreliability | Engineering | 0.1% of requests can fail per month |
|
|
13
|
+
|
|
14
|
+
Relationship: SLI measures reality. SLO sets internal targets. SLA sets external commitments. SLO should always be stricter than SLA.
|
|
15
|
+
|
|
16
|
+
## Step 1: Define SLIs
|
|
17
|
+
|
|
18
|
+
### SLI Selection by Service Type
|
|
19
|
+
|
|
20
|
+
| Service Type | Primary SLI | Secondary SLIs |
|
|
21
|
+
|-------------|------------|----------------|
|
|
22
|
+
| **API / Web Service** | Availability (successful responses / total) | Latency p50/p95/p99, error rate |
|
|
23
|
+
| **Data Pipeline** | Freshness (time since last successful run) | Throughput, completeness |
|
|
24
|
+
| **Storage System** | Durability (data loss events) | Availability, latency |
|
|
25
|
+
| **Batch Processing** | Completion rate within deadline | Processing time, error rate |
|
|
26
|
+
| **Streaming** | End-to-end latency | Throughput, ordering guarantees |
|
|
27
|
+
|
|
28
|
+
### SLI Specification Template
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
SLI Name: API Availability
|
|
32
|
+
Definition: Proportion of valid requests served successfully
|
|
33
|
+
Good event: HTTP response with status code != 5xx, latency < 1000ms
|
|
34
|
+
Valid event: All HTTP requests excluding health checks
|
|
35
|
+
Measurement: Load balancer access logs
|
|
36
|
+
Aggregation: Rolling 30-day window
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
### Common SLI Mistakes
|
|
40
|
+
|
|
41
|
+
| Mistake | Problem | Fix |
|
|
42
|
+
|---------|---------|-----|
|
|
43
|
+
| Using server-side metrics only | Misses client-perceived failures | Measure at the edge/load balancer |
|
|
44
|
+
| Counting health checks | Inflates availability numbers | Exclude synthetic traffic |
|
|
45
|
+
| Averaging latency | Hides tail latency issues | Use percentiles (p50, p95, p99) |
|
|
46
|
+
| Boolean up/down | Too coarse, misses partial failures | Use request-level success ratio |
|
|
47
|
+
| No "valid event" filter | Includes bot traffic, attacks | Define what counts as a real request |
|
|
48
|
+
|
|
49
|
+
## Step 2: Set SLO Targets
|
|
50
|
+
|
|
51
|
+
### Target Selection Guide
|
|
52
|
+
|
|
53
|
+
| Availability | Downtime/Month | Downtime/Year | Typical Use Case |
|
|
54
|
+
|-------------|---------------|---------------|-----------------|
|
|
55
|
+
| 99% (two 9s) | 7.3 hours | 3.65 days | Internal tools, dev environments |
|
|
56
|
+
| 99.5% | 3.65 hours | 1.83 days | Non-critical B2B services |
|
|
57
|
+
| 99.9% (three 9s) | 43.8 minutes | 8.76 hours | Standard production services |
|
|
58
|
+
| 99.95% | 21.9 minutes | 4.38 hours | Important customer-facing services |
|
|
59
|
+
| 99.99% (four 9s) | 4.38 minutes | 52.6 minutes | Payment systems, auth services |
|
|
60
|
+
| 99.999% (five 9s) | 26.3 seconds | 5.26 minutes | Safety-critical (rarely achievable) |
|
|
61
|
+
|
|
62
|
+
### Setting Targets Checklist
|
|
63
|
+
|
|
64
|
+
- [ ] Based on current performance (set SLO at current p10 performance, not aspirational)
|
|
65
|
+
- [ ] Aligned with user expectations (survey or infer from behavior)
|
|
66
|
+
- [ ] Achievable with current architecture (do not promise what you cannot deliver)
|
|
67
|
+
- [ ] Stricter than SLA by at least 0.1% (buffer for reaction time)
|
|
68
|
+
- [ ] Different SLOs for different user segments if needed (paid vs free)
|
|
69
|
+
- [ ] Reviewed quarterly and adjusted based on data
|
|
70
|
+
|
|
71
|
+
## Step 3: Calculate Error Budgets
|
|
72
|
+
|
|
73
|
+
### Formula
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
Error Budget = 1 - SLO target
|
|
77
|
+
|
|
78
|
+
Example: SLO = 99.9% availability over 30 days
|
|
79
|
+
Error Budget = 0.1% = 0.001
|
|
80
|
+
Total requests/month = 10,000,000
|
|
81
|
+
Allowed failures = 10,000,000 * 0.001 = 10,000 failed requests
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Error Budget Policy
|
|
85
|
+
|
|
86
|
+
| Budget Remaining | Action |
|
|
87
|
+
|-----------------|--------|
|
|
88
|
+
| > 50% | Normal development velocity, deploy freely |
|
|
89
|
+
| 25-50% | Increased caution, review risky deployments |
|
|
90
|
+
| 10-25% | Freeze non-critical deployments, focus on reliability |
|
|
91
|
+
| < 10% | Emergency mode: only reliability fixes ship |
|
|
92
|
+
| Exhausted (0%) | Full deployment freeze until budget recovers |
|
|
93
|
+
|
|
94
|
+
### Budget Consumption Tracking
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
Daily budget = Error Budget / 30
|
|
98
|
+
Burn rate = actual_errors / expected_daily_budget
|
|
99
|
+
|
|
100
|
+
Burn rate = 1.0: consuming budget exactly as planned
|
|
101
|
+
Burn rate > 1.0: consuming faster than sustainable
|
|
102
|
+
Burn rate = 10.0: will exhaust 30-day budget in 3 days
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Step 4: Configure Burn Rate Alerts
|
|
106
|
+
|
|
107
|
+
### Multi-Window Burn Rate Alerting
|
|
108
|
+
|
|
109
|
+
| Alert | Burn Rate | Long Window | Short Window | Severity | Budget Consumed |
|
|
110
|
+
|-------|-----------|-------------|-------------|----------|-----------------|
|
|
111
|
+
| **Page (SEV1)** | 14.4x | 1 hour | 5 min | Critical | 2% in 1h |
|
|
112
|
+
| **Page (SEV2)** | 6x | 6 hours | 30 min | High | 5% in 6h |
|
|
113
|
+
| **Ticket** | 3x | 3 days | 6 hours | Medium | 10% in 3d |
|
|
114
|
+
| **Ticket** | 1x | 30 days | 3 days | Low | Budget tracking |
|
|
115
|
+
|
|
116
|
+
### Why Multi-Window?
|
|
117
|
+
|
|
118
|
+
- **Long window** prevents alerting on brief spikes (high precision)
|
|
119
|
+
- **Short window** catches sudden onset (low detection time)
|
|
120
|
+
- Both conditions must be true simultaneously to fire
|
|
121
|
+
|
|
122
|
+
### Alert Configuration Example (Prometheus)
|
|
123
|
+
|
|
124
|
+
```yaml
|
|
125
|
+
# SEV1: 14.4x burn rate over 1h, confirmed by 5min window
|
|
126
|
+
- alert: SLOBurnRateCritical
|
|
127
|
+
expr: |
|
|
128
|
+
(
|
|
129
|
+
sum(rate(http_requests_total{code=~"5.."}[1h]))
|
|
130
|
+
/ sum(rate(http_requests_total[1h]))
|
|
131
|
+
) > (14.4 * 0.001)
|
|
132
|
+
AND
|
|
133
|
+
(
|
|
134
|
+
sum(rate(http_requests_total{code=~"5.."}[5m]))
|
|
135
|
+
/ sum(rate(http_requests_total[5m]))
|
|
136
|
+
) > (14.4 * 0.001)
|
|
137
|
+
for: 2m
|
|
138
|
+
labels:
|
|
139
|
+
severity: critical
|
|
140
|
+
annotations:
|
|
141
|
+
summary: "High error burn rate - SEV1"
|
|
142
|
+
budget_impact: "Will exhaust 30-day error budget in 50 hours"
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Step 5: Document the SLO
|
|
146
|
+
|
|
147
|
+
### SLO Document Template
|
|
148
|
+
|
|
149
|
+
```markdown
|
|
150
|
+
# SLO: {Service Name} - {SLI Name}
|
|
151
|
+
|
|
152
|
+
| Field | Value |
|
|
153
|
+
|-------|-------|
|
|
154
|
+
| Service | {service name} |
|
|
155
|
+
| Owner | {team name} |
|
|
156
|
+
| SLI | {definition} |
|
|
157
|
+
| SLO Target | {percentage} over {window} |
|
|
158
|
+
| SLA (if applicable) | {percentage} with {consequence} |
|
|
159
|
+
| Error Budget | {number} per {period} |
|
|
160
|
+
| Measurement Source | {logs / metrics / synthetic} |
|
|
161
|
+
| Dashboard | {link} |
|
|
162
|
+
| Alert Runbook | {link} |
|
|
163
|
+
|
|
164
|
+
## SLI Definition
|
|
165
|
+
Good event: {definition}
|
|
166
|
+
Valid event: {definition}
|
|
167
|
+
Exclusions: {health checks, synthetic monitoring, etc.}
|
|
168
|
+
|
|
169
|
+
## Error Budget Policy
|
|
170
|
+
{Copy from error budget policy table, customized for this service}
|
|
171
|
+
|
|
172
|
+
## Review Schedule
|
|
173
|
+
- Weekly: error budget consumption in standup
|
|
174
|
+
- Monthly: SLO performance review
|
|
175
|
+
- Quarterly: SLO target adjustment if needed
|
|
176
|
+
```
|
|
177
|
+
|
|
178
|
+
## Common Mistakes
|
|
179
|
+
|
|
180
|
+
| Mistake | Why It Hurts | Fix |
|
|
181
|
+
|---------|-------------|-----|
|
|
182
|
+
| SLO = 100% | Zero error budget, no deployments possible | Start at 99.9%, adjust based on data |
|
|
183
|
+
| SLO set without measurement | Cannot track compliance | Implement SLI measurement first |
|
|
184
|
+
| Same SLO for all services | Over-invests in non-critical, under-invests in critical | Tier services, different SLOs per tier |
|
|
185
|
+
| No error budget policy | SLO exists but nobody acts on it | Define actions per budget threshold |
|
|
186
|
+
| Alerting on SLI instead of burn rate | Too noisy (brief spikes trigger) | Use multi-window burn rate alerts |
|
|
187
|
+
| SLO not reviewed | Target drifts from reality | Quarterly review cadence |
|
|
188
|
+
| SLA stricter than SLO | No reaction time before breach | SLO should be 0.1-0.5% stricter than SLA |
|
|
189
|
+
| Too many SLOs per service | Focus diluted, alert fatigue | 1-3 SLOs per service maximum |
|
|
190
|
+
|
|
191
|
+
## SLO Maturity Model
|
|
192
|
+
|
|
193
|
+
| Level | Characteristics | Next Step |
|
|
194
|
+
|-------|----------------|-----------|
|
|
195
|
+
| **0 - None** | No SLIs or SLOs defined | Define 1 SLI per critical service |
|
|
196
|
+
| **1 - Measured** | SLIs exist, dashboards built | Set SLO targets based on current performance |
|
|
197
|
+
| **2 - Targeted** | SLOs set, error budgets calculated | Implement burn rate alerts |
|
|
198
|
+
| **3 - Alerted** | Multi-window burn rate alerts active | Define error budget policy |
|
|
199
|
+
| **4 - Managed** | Error budget drives deployment decisions | Automate deployment freeze on budget exhaustion |
|
|
200
|
+
| **5 - Optimized** | SLOs reviewed quarterly, drive architecture decisions | Tie SLOs to business KPIs |
|
|
@@ -123,3 +123,8 @@ Surface these issues WITHOUT being asked:
|
|
|
123
123
|
- Storage: ~$X/month for <N> vectors
|
|
124
124
|
- Query cost: ~$X per 1K queries
|
|
125
125
|
```
|
|
126
|
+
|
|
127
|
+
## References
|
|
128
|
+
|
|
129
|
+
- [chunking-strategies.md](references/chunking-strategies.md) — Decision tree and benchmarks for chunking approaches
|
|
130
|
+
- [evaluation-guide.md](references/evaluation-guide.md) — RAGAS metrics and ground truth dataset creation
|