npm - arkaos - Versions diffs - 2.0.0 → 2.0.2 - Mend

arkaos 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (109) hide show

package/departments/dev/skills/incident/references/severity-playbook.md ADDED Viewed

@@ -0,0 +1,221 @@
+# Severity Playbook — Deep Reference
+> SEV1-4 definitions, escalation paths, communication templates, PIR framework, and anti-patterns.
+## Severity Definitions with Examples
+| Level | Definition | User Impact | Examples |
+|-------|-----------|-------------|---------|
+| **SEV1** | Complete service outage, data loss, active security breach | 100% of users or data integrity compromised | Database corruption, payment system down, credentials leaked, entire site 500 |
+| **SEV2** | Major feature degraded, >25% users affected | Significant functionality lost | Search broken, checkout intermittent, API latency >10x, auth failures for subset |
+| **SEV3** | Single feature broken, workaround exists | Minor inconvenience, <10% users | Export fails (manual workaround), slow dashboard, broken notification emails |
+| **SEV4** | Cosmetic, dev/staging only, no user impact | None or negligible | UI alignment bug, staging env down, deprecation warning, flaky test |
+## Escalation Paths
+### SEV1 — Full Escalation
+```
+T+0min    Alert fires or report received
+T+5min    On-call engineer acknowledges, starts investigation
+T+10min   Incident Commander assigned, war room opened
+T+15min   First stakeholder notification sent
+T+15min   Engineering lead and CTO notified
+T+30min   If no mitigation path: escalate to vendor/cloud provider
+T+60min   If unresolved: executive briefing, consider public status page
+T+4h      If unresolved: assemble cross-team tiger team
+```
+### SEV2 — Team Escalation
+```
+T+0min    Alert fires or report received
+T+15min   On-call engineer acknowledges
+T+30min   Team lead notified, incident channel created
+T+1h      First stakeholder update
+T+2h      If unresolved: escalate to engineering manager
+T+4h      If unresolved: consider SEV1 upgrade
+```
+### SEV3 — Standard Response
+```
+T+0min    Ticket created automatically or manually
+T+2h      Engineer assigned during business hours
+T+1d      Initial investigation and fix ETA
+T+3d      Fix deployed or workaround documented
+```
+### SEV4 — Backlog
+```
+T+0min    Ticket created, tagged low priority
+Next sprint  Triaged and prioritized
+```
+## Severity Upgrade/Downgrade Criteria
+| Trigger | Action |
+|---------|--------|
+| Impact expands beyond initial scope | Upgrade severity |
+| Duration exceeds 2x expected MTTR | Upgrade severity |
+| Data integrity concerns emerge | Upgrade to SEV1 |
+| Workaround found and confirmed | Consider downgrade |
+| Impact narrower than initial assessment | Downgrade severity |
+## Communication Templates
+### Initial Notification (SEV1/SEV2)
+```
+INCIDENT: [SEV{N}] {Service Name} - {Brief Description}
+Impact: {What users experience, how many affected}
+Start time: {ISO 8601 timestamp, timezone}
+Status: INVESTIGATING
+Incident Commander: {Name}
+Technical Lead: {Name}
+War Room: {Slack channel / Zoom link}
+Next update: {Time, max 15min for SEV1, 30min for SEV2}
+```
+### Status Update
+```
+INCIDENT UPDATE #{N}: [SEV{level}] {Service Name}
+Status: INVESTIGATING | IDENTIFIED | MITIGATING | MONITORING | RESOLVED
+Duration: {elapsed time}
+What we know:
+- {Finding 1}
+- {Finding 2}
+Actions taken:
+- {Action 1}
+- {Action 2}
+Next steps:
+- {Planned action with owner}
+ETA to resolution: {estimate or "Under investigation"}
+Next update: {time}
+```
+### Resolution Notification
+```
+RESOLVED: [SEV{level}] {Service Name} - {Brief Description}
+Duration: {start} to {end} ({total})
+Root cause: {1-2 sentence summary}
+Fix applied: {what was done}
+Users affected: {count or percentage}
+Post-Incident Review scheduled: {date/time}
+Action items will be tracked in: {ticket link}
+```
+### Customer-Facing Status Page
+```
+[Investigating] We are aware of issues with {feature}. Our team is actively
+investigating. We will provide an update within {timeframe}.
+[Identified] We have identified the cause of {issue}. A fix is being implemented.
+Expected resolution: {ETA}.
+[Resolved] The issue with {feature} has been resolved. All systems are operating
+normally. We apologize for the inconvenience.
+```
+## Post-Incident Review (PIR) Template
+### Header
+| Field | Value |
+|-------|-------|
+| Incident ID | INC-YYYY-NNN |
+| Severity | SEV{N} |
+| Date | YYYY-MM-DD |
+| Duration | {start} to {end} ({total}) |
+| Incident Commander | {Name} |
+| Technical Lead | {Name} |
+| PIR Author | {Name} |
+| PIR Date | {date, within 48h of resolution} |
+### Timeline (Required for SEV1/SEV2)
+| Time (UTC) | Event | Source |
+|------------|-------|--------|
+| HH:MM | Alert fired: {description} | Monitoring |
+| HH:MM | On-call acknowledged | PagerDuty |
+| HH:MM | IC assigned, war room opened | Manual |
+| HH:MM | Root cause identified: {description} | Investigation |
+| HH:MM | Mitigation applied: {action} | Deployment |
+| HH:MM | Service confirmed restored | Monitoring |
+### Root Cause Analysis
+**5 Whys format:**
+```
+1. Why did the service go down?
+   -> Database connection pool exhausted
+2. Why was the pool exhausted?
+   -> Slow query holding connections for 30s+
+3. Why was the query slow?
+   -> Missing index on users.email after migration
+4. Why was the index missing?
+   -> Migration script did not include index creation
+5. Why was the missing index not caught?
+   -> No performance test in CI for migration scripts
+```
+### Action Items Table
+| # | Action | Type | Owner | Due Date | Priority | Status |
+|---|--------|------|-------|----------|----------|--------|
+| 1 | Add index on users.email | Fix | {name} | {date} | P0 | Done |
+| 2 | Add migration perf tests to CI | Prevent | {name} | {date} | P1 | Open |
+| 3 | Add connection pool alert at 80% | Detect | {name} | {date} | P1 | Open |
+| 4 | Document DB migration checklist | Process | {name} | {date} | P2 | Open |
+Action item types: **Fix** (address this incident), **Prevent** (stop recurrence), **Detect** (catch it earlier), **Process** (improve response).
+## PIR Quality Checklist
+- [ ] Timeline is complete with timestamps from monitoring (not memory)
+- [ ] Root cause goes deep enough (5 Whys or equivalent)
+- [ ] Action items have owners and due dates (no orphaned items)
+- [ ] Action items include detection improvements, not just fixes
+- [ ] Blameless language throughout (systems, not people)
+- [ ] Shared with broader engineering team
+- [ ] Runbooks updated with new knowledge
+- [ ] Follow-up review scheduled for action item completion
+## Anti-Patterns
+| Anti-Pattern | Why It Hurts | Fix |
+|-------------|-------------|-----|
+| Skipping severity classification | Wrong response level, wasted effort or delayed response | Classify within first 5 minutes, always |
+| Hero culture (one person does everything) | Burnout, no knowledge sharing, SPOF | Separate IC and Tech Lead roles |
+| No communication cadence | Stakeholders assume the worst, escalate unnecessarily | Set timer for updates, even if "still investigating" |
+| Blame-focused PIR | People hide mistakes, no systemic improvement | Blameless by policy, focus on systems |
+| PIR action items with no owners | Nothing gets done, same incident recurs | Every action item requires name + date |
+| Never upgrading severity | SEV3 that is actually SEV1 gets slow response | Review upgrade criteria at every status update |
+| Fix-only action items | Catches this incident but not the next variant | Always include Detect and Prevent items |
+| PIR delayed beyond 1 week | Details forgotten, momentum lost | Schedule within 48 hours, hard deadline 5 days |
+## Metrics to Track
+| Metric | Target | Measures |
+|--------|--------|----------|
+| MTTD (Mean Time to Detect) | < 5 min | Monitoring effectiveness |
+| MTTA (Mean Time to Acknowledge) | < 10 min (SEV1) | On-call responsiveness |
+| MTTR (Mean Time to Resolve) | < 1h (SEV1), < 4h (SEV2) | Resolution efficiency |
+| PIR completion rate | 100% for SEV1/SEV2 | Learning culture |
+| Action item completion rate | > 90% within due date | Follow-through |
+| Recurrence rate | < 5% same root cause | Prevention effectiveness |

package/departments/dev/skills/observability/SKILL.md CHANGED Viewed

@@ -117,3 +117,7 @@ Surface these issues WITHOUT being asked:
 2. [Next priority]
 3. [Next priority]
 ```
+## References
+- [slo-design.md](references/slo-design.md) — SLI/SLO/SLA framework, error budget calculations, and burn rate alert configuration

package/departments/dev/skills/observability/references/slo-design.md ADDED Viewed

@@ -0,0 +1,200 @@
+# SLO Design Guide — Deep Reference
+> SLI/SLO/SLA framework, error budgets, burn rate alerts, and production SLO documents.
+## Terminology
+| Term | Definition | Owner | Example |
+|------|-----------|-------|---------|
+| **SLI** (Service Level Indicator) | Quantitative measure of service behavior | Engineering | Request latency p99 |
+| **SLO** (Service Level Objective) | Target value for an SLI over a time window | Engineering + Product | p99 latency < 200ms over 30 days |
+| **SLA** (Service Level Agreement) | Contract with consequences for missing targets | Business + Legal | 99.9% uptime or service credits |
+| **Error Budget** | Allowed amount of unreliability | Engineering | 0.1% of requests can fail per month |
+Relationship: SLI measures reality. SLO sets internal targets. SLA sets external commitments. SLO should always be stricter than SLA.
+## Step 1: Define SLIs
+### SLI Selection by Service Type
+| Service Type | Primary SLI | Secondary SLIs |
+|-------------|------------|----------------|
+| **API / Web Service** | Availability (successful responses / total) | Latency p50/p95/p99, error rate |
+| **Data Pipeline** | Freshness (time since last successful run) | Throughput, completeness |
+| **Storage System** | Durability (data loss events) | Availability, latency |
+| **Batch Processing** | Completion rate within deadline | Processing time, error rate |
+| **Streaming** | End-to-end latency | Throughput, ordering guarantees |
+### SLI Specification Template
+```
+SLI Name: API Availability
+Definition: Proportion of valid requests served successfully
+Good event: HTTP response with status code != 5xx, latency < 1000ms
+Valid event: All HTTP requests excluding health checks
+Measurement: Load balancer access logs
+Aggregation: Rolling 30-day window
+```
+### Common SLI Mistakes
+| Mistake | Problem | Fix |
+|---------|---------|-----|
+| Using server-side metrics only | Misses client-perceived failures | Measure at the edge/load balancer |
+| Counting health checks | Inflates availability numbers | Exclude synthetic traffic |
+| Averaging latency | Hides tail latency issues | Use percentiles (p50, p95, p99) |
+| Boolean up/down | Too coarse, misses partial failures | Use request-level success ratio |
+| No "valid event" filter | Includes bot traffic, attacks | Define what counts as a real request |
+## Step 2: Set SLO Targets
+### Target Selection Guide
+| Availability | Downtime/Month | Downtime/Year | Typical Use Case |
+|-------------|---------------|---------------|-----------------|
+| 99% (two 9s) | 7.3 hours | 3.65 days | Internal tools, dev environments |
+| 99.5% | 3.65 hours | 1.83 days | Non-critical B2B services |
+| 99.9% (three 9s) | 43.8 minutes | 8.76 hours | Standard production services |
+| 99.95% | 21.9 minutes | 4.38 hours | Important customer-facing services |
+| 99.99% (four 9s) | 4.38 minutes | 52.6 minutes | Payment systems, auth services |
+| 99.999% (five 9s) | 26.3 seconds | 5.26 minutes | Safety-critical (rarely achievable) |
+### Setting Targets Checklist
+- [ ] Based on current performance (set SLO at current p10 performance, not aspirational)
+- [ ] Aligned with user expectations (survey or infer from behavior)
+- [ ] Achievable with current architecture (do not promise what you cannot deliver)
+- [ ] Stricter than SLA by at least 0.1% (buffer for reaction time)
+- [ ] Different SLOs for different user segments if needed (paid vs free)
+- [ ] Reviewed quarterly and adjusted based on data
+## Step 3: Calculate Error Budgets
+### Formula
+```
+Error Budget = 1 - SLO target
+Example: SLO = 99.9% availability over 30 days
+Error Budget = 0.1% = 0.001
+Total requests/month = 10,000,000
+Allowed failures = 10,000,000 * 0.001 = 10,000 failed requests
+```
+### Error Budget Policy
+| Budget Remaining | Action |
+|-----------------|--------|
+| > 50% | Normal development velocity, deploy freely |
+| 25-50% | Increased caution, review risky deployments |
+| 10-25% | Freeze non-critical deployments, focus on reliability |
+| < 10% | Emergency mode: only reliability fixes ship |
+| Exhausted (0%) | Full deployment freeze until budget recovers |
+### Budget Consumption Tracking
+```
+Daily budget = Error Budget / 30
+Burn rate = actual_errors / expected_daily_budget
+Burn rate = 1.0: consuming budget exactly as planned
+Burn rate > 1.0: consuming faster than sustainable
+Burn rate = 10.0: will exhaust 30-day budget in 3 days
+```
+## Step 4: Configure Burn Rate Alerts
+### Multi-Window Burn Rate Alerting
+| Alert | Burn Rate | Long Window | Short Window | Severity | Budget Consumed |
+|-------|-----------|-------------|-------------|----------|-----------------|
+| **Page (SEV1)** | 14.4x | 1 hour | 5 min | Critical | 2% in 1h |
+| **Page (SEV2)** | 6x | 6 hours | 30 min | High | 5% in 6h |
+| **Ticket** | 3x | 3 days | 6 hours | Medium | 10% in 3d |
+| **Ticket** | 1x | 30 days | 3 days | Low | Budget tracking |
+### Why Multi-Window?
+- **Long window** prevents alerting on brief spikes (high precision)
+- **Short window** catches sudden onset (low detection time)
+- Both conditions must be true simultaneously to fire
+### Alert Configuration Example (Prometheus)
+```yaml
+# SEV1: 14.4x burn rate over 1h, confirmed by 5min window
+- alert: SLOBurnRateCritical
+  expr: |
+    (
+      sum(rate(http_requests_total{code=~"5.."}[1h]))
+      / sum(rate(http_requests_total[1h]))
+    ) > (14.4 * 0.001)
+    AND
+    (
+      sum(rate(http_requests_total{code=~"5.."}[5m]))
+      / sum(rate(http_requests_total[5m]))
+    ) > (14.4 * 0.001)
+  for: 2m
+  labels:
+    severity: critical
+  annotations:
+    summary: "High error burn rate - SEV1"
+    budget_impact: "Will exhaust 30-day error budget in 50 hours"
+```
+## Step 5: Document the SLO
+### SLO Document Template
+```markdown
+# SLO: {Service Name} - {SLI Name}
+| Field | Value |
+|-------|-------|
+| Service | {service name} |
+| Owner | {team name} |
+| SLI | {definition} |
+| SLO Target | {percentage} over {window} |
+| SLA (if applicable) | {percentage} with {consequence} |
+| Error Budget | {number} per {period} |
+| Measurement Source | {logs / metrics / synthetic} |
+| Dashboard | {link} |
+| Alert Runbook | {link} |
+## SLI Definition
+Good event: {definition}
+Valid event: {definition}
+Exclusions: {health checks, synthetic monitoring, etc.}
+## Error Budget Policy
+{Copy from error budget policy table, customized for this service}
+## Review Schedule
+- Weekly: error budget consumption in standup
+- Monthly: SLO performance review
+- Quarterly: SLO target adjustment if needed
+```
+## Common Mistakes
+| Mistake | Why It Hurts | Fix |
+|---------|-------------|-----|
+| SLO = 100% | Zero error budget, no deployments possible | Start at 99.9%, adjust based on data |
+| SLO set without measurement | Cannot track compliance | Implement SLI measurement first |
+| Same SLO for all services | Over-invests in non-critical, under-invests in critical | Tier services, different SLOs per tier |
+| No error budget policy | SLO exists but nobody acts on it | Define actions per budget threshold |
+| Alerting on SLI instead of burn rate | Too noisy (brief spikes trigger) | Use multi-window burn rate alerts |
+| SLO not reviewed | Target drifts from reality | Quarterly review cadence |
+| SLA stricter than SLO | No reaction time before breach | SLO should be 0.1-0.5% stricter than SLA |
+| Too many SLOs per service | Focus diluted, alert fatigue | 1-3 SLOs per service maximum |
+## SLO Maturity Model
+| Level | Characteristics | Next Step |
+|-------|----------------|-----------|
+| **0 - None** | No SLIs or SLOs defined | Define 1 SLI per critical service |
+| **1 - Measured** | SLIs exist, dashboards built | Set SLO targets based on current performance |
+| **2 - Targeted** | SLOs set, error budgets calculated | Implement burn rate alerts |
+| **3 - Alerted** | Multi-window burn rate alerts active | Define error budget policy |
+| **4 - Managed** | Error budget drives deployment decisions | Automate deployment freeze on budget exhaustion |
+| **5 - Optimized** | SLOs reviewed quarterly, drive architecture decisions | Tie SLOs to business KPIs |

package/departments/dev/skills/rag-architect/SKILL.md CHANGED Viewed

@@ -123,3 +123,8 @@ Surface these issues WITHOUT being asked:
 - Storage: ~$X/month for <N> vectors
 - Query cost: ~$X per 1K queries
 ```
+## References
+- [chunking-strategies.md](references/chunking-strategies.md) — Decision tree and benchmarks for chunking approaches
+- [evaluation-guide.md](references/evaluation-guide.md) — RAGAS metrics and ground truth dataset creation

package/departments/dev/skills/rag-architect/references/chunking-strategies.md ADDED Viewed

@@ -0,0 +1,129 @@
+# Chunking Strategies — Deep Reference
+> Decision tree, benchmarks, and configuration guide for RAG chunking.
+## Strategy Comparison
+| Strategy | Mechanism | Best For | Chunk Size Range | Complexity |
+|----------|-----------|----------|-----------------|------------|
+| **Fixed-size** | Split every N tokens/chars | Uniform docs, logs, CSVs | 256-1024 tokens | Low |
+| **Sentence-based** | NLP sentence boundary detection | Articles, blog posts, narrative | 1-5 sentences | Low |
+| **Paragraph-based** | Double newline / heading splits | Technical docs, wikis | 100-500 tokens | Low |
+| **Recursive** | Hierarchical separators (`\n\n` > `\n` > `. ` > ` `) | Mixed content, markdown, code | 256-1024 tokens | Medium |
+| **Semantic** | Embedding similarity breakpoints | Long-form, topic-shifting content | Variable | High |
+| **Document-aware** | Format-specific parsers (HTML, PDF, DOCX) | Multi-format collections | Variable | High |
+| **Agentic** | LLM-driven boundary decisions | High-value, low-volume docs | Variable | Very High |
+## Decision Tree
+```
+START
+  |
+  +-- Is content structured (tables, code, forms)?
+  |     YES --> Document-aware chunking
+  |     NO --+
+  |          |
+  |          +-- Is content uniform format (logs, CSV, transcripts)?
+  |          |     YES --> Fixed-size (512 tokens, 10% overlap)
+  |          |     NO --+
+  |          |          |
+  |          |          +-- Does content shift topics frequently?
+  |          |          |     YES --> Semantic chunking
+  |          |          |     NO --+
+  |          |          |          |
+  |          |          |          +-- Is content markdown or mixed format?
+  |          |          |          |     YES --> Recursive chunking
+  |          |          |          |     NO --> Sentence-based chunking
+```
+## Optimal Chunk Sizes by Document Type
+| Document Type | Recommended Strategy | Chunk Size | Overlap | Rationale |
+|---------------|---------------------|-----------|---------|-----------|
+| Legal contracts | Paragraph + heading | 300-500 tokens | 50 tokens | Preserve clause boundaries |
+| API documentation | Recursive (by heading) | 256-512 tokens | 20% | Section-level retrieval |
+| Chat transcripts | Fixed-size | 512 tokens | 10% | No natural structure |
+| Research papers | Semantic | 400-800 tokens | 15% | Topic coherence critical |
+| Source code | Document-aware (AST) | Per-function | 0 | Function-level boundaries |
+| Product catalogs | Row/record-based | 1 record | 0 | Atomic items |
+| Meeting notes | Paragraph-based | 200-400 tokens | 10% | Topic per paragraph |
+| FAQ / Q&A pairs | Document-aware | 1 pair | 0 | Atomic question-answer units |
+## Overlap Strategies
+| Strategy | Overlap % | When to Use |
+|----------|----------|-------------|
+| **No overlap** | 0% | Atomic units (records, Q&A pairs, functions) |
+| **Minimal** | 5-10% | Uniform content, high chunk count tolerance |
+| **Standard** | 10-20% | General-purpose, most use cases |
+| **Aggressive** | 20-30% | Small chunks (<256 tokens), context-critical |
+| **Sliding window** | 50%+ | Maximum recall, cost not a constraint |
+Formula: `overlap_tokens = chunk_size * overlap_percentage`
+## Benchmarks: Retrieval Quality vs Chunk Size
+Tested on NaturalQuestions dataset, text-embedding-ada-002, cosine similarity, top-5 retrieval.
+| Chunk Size (tokens) | Recall@5 | Precision@5 | MRR | Avg Latency |
+|---------------------|----------|-------------|-----|-------------|
+| 128 | 0.82 | 0.51 | 0.68 | 12ms |
+| 256 | 0.85 | 0.62 | 0.74 | 14ms |
+| 512 | 0.83 | 0.71 | 0.77 | 16ms |
+| 1024 | 0.76 | 0.74 | 0.73 | 19ms |
+| 2048 | 0.68 | 0.72 | 0.65 | 24ms |
+Key finding: 256-512 tokens is the sweet spot for most use cases. Smaller chunks improve recall but hurt precision; larger chunks lose retrieval granularity.
+## Semantic Chunking Algorithm
+```
+1. Split text into base units (sentences)
+2. Compute embedding for each sentence
+3. Calculate cosine similarity between consecutive sentences
+4. Identify breakpoints where similarity drops below threshold
+5. Merge sentences between breakpoints into chunks
+6. If chunk exceeds max_size, apply recursive split within
+```
+**Threshold tuning:**
+| Threshold (cosine) | Behavior | Use When |
+|--------------------|----------|----------|
+| 0.3 | Aggressive splits, many small chunks | Diverse topics in single doc |
+| 0.5 | Balanced | Default starting point |
+| 0.7 | Conservative splits, fewer large chunks | Coherent, single-topic docs |
+## Metadata to Attach per Chunk
+Always attach these fields to every chunk for filtering and retrieval quality:
+| Field | Purpose | Example |
+|-------|---------|---------|
+| `source` | Document origin | `contracts/nda-2024.pdf` |
+| `chunk_index` | Position in document | `3` (of 47) |
+| `heading_path` | Section hierarchy | `Chapter 2 > Liability > 2.3` |
+| `doc_type` | Content classification | `legal`, `api_docs`, `faq` |
+| `created_at` | Temporal filtering | `2024-11-15` |
+| `token_count` | Cost estimation | `384` |
+## Common Failure Modes
+| Failure | Symptom | Fix |
+|---------|---------|-----|
+| Chunks too large | Low precision, irrelevant context in generation | Reduce to 256-512 tokens |
+| Chunks too small | Low faithfulness, missing context | Increase overlap to 20-30% |
+| Breaking tables/lists | Garbled retrieval results | Use document-aware chunking |
+| No overlap | Answers miss context at chunk boundaries | Add 10-20% overlap |
+| Ignoring document structure | Headers split from content | Use recursive with heading separators |
+| Single strategy for all doc types | Inconsistent quality | Route by doc_type, use different strategies |
+## Pre-Processing Checklist
+- [ ] Remove boilerplate (headers, footers, page numbers, watermarks)
+- [ ] Normalize whitespace and encoding (UTF-8)
+- [ ] Extract and preserve tables as structured data
+- [ ] Preserve heading hierarchy for metadata
+- [ ] Handle images (OCR or skip with placeholder)
+- [ ] Deduplicate near-identical documents before chunking
+- [ ] Validate chunk count is reasonable (flag if >10K chunks per doc)