cortex-agents 2.3.1 → 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.opencode/agents/{plan.md → architect.md} +104 -58
- package/.opencode/agents/audit.md +183 -0
- package/.opencode/agents/{fullstack.md → coder.md} +10 -54
- package/.opencode/agents/debug.md +76 -201
- package/.opencode/agents/devops.md +16 -123
- package/.opencode/agents/docs-writer.md +195 -0
- package/.opencode/agents/fix.md +207 -0
- package/.opencode/agents/implement.md +433 -0
- package/.opencode/agents/perf.md +151 -0
- package/.opencode/agents/refactor.md +163 -0
- package/.opencode/agents/security.md +20 -85
- package/.opencode/agents/testing.md +1 -151
- package/.opencode/skills/data-engineering/SKILL.md +221 -0
- package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
- package/README.md +315 -224
- package/dist/cli.js +85 -17
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +60 -22
- package/dist/registry.d.ts +8 -3
- package/dist/registry.d.ts.map +1 -1
- package/dist/registry.js +16 -2
- package/dist/tools/branch.d.ts +2 -2
- package/dist/tools/cortex.d.ts +2 -2
- package/dist/tools/cortex.js +7 -7
- package/dist/tools/docs.d.ts +2 -2
- package/dist/tools/environment.d.ts +31 -0
- package/dist/tools/environment.d.ts.map +1 -0
- package/dist/tools/environment.js +93 -0
- package/dist/tools/github.d.ts +42 -0
- package/dist/tools/github.d.ts.map +1 -0
- package/dist/tools/github.js +200 -0
- package/dist/tools/plan.d.ts +28 -4
- package/dist/tools/plan.d.ts.map +1 -1
- package/dist/tools/plan.js +232 -4
- package/dist/tools/quality-gate.d.ts +28 -0
- package/dist/tools/quality-gate.d.ts.map +1 -0
- package/dist/tools/quality-gate.js +233 -0
- package/dist/tools/repl.d.ts +55 -0
- package/dist/tools/repl.d.ts.map +1 -0
- package/dist/tools/repl.js +291 -0
- package/dist/tools/task.d.ts +2 -0
- package/dist/tools/task.d.ts.map +1 -1
- package/dist/tools/task.js +25 -30
- package/dist/tools/worktree.d.ts +5 -32
- package/dist/tools/worktree.d.ts.map +1 -1
- package/dist/tools/worktree.js +75 -447
- package/dist/utils/change-scope.d.ts +33 -0
- package/dist/utils/change-scope.d.ts.map +1 -0
- package/dist/utils/change-scope.js +198 -0
- package/dist/utils/github.d.ts +104 -0
- package/dist/utils/github.d.ts.map +1 -0
- package/dist/utils/github.js +243 -0
- package/dist/utils/ide.d.ts +76 -0
- package/dist/utils/ide.d.ts.map +1 -0
- package/dist/utils/ide.js +307 -0
- package/dist/utils/plan-extract.d.ts +28 -0
- package/dist/utils/plan-extract.d.ts.map +1 -1
- package/dist/utils/plan-extract.js +90 -1
- package/dist/utils/repl.d.ts +145 -0
- package/dist/utils/repl.d.ts.map +1 -0
- package/dist/utils/repl.js +547 -0
- package/dist/utils/terminal.d.ts +53 -1
- package/dist/utils/terminal.d.ts.map +1 -1
- package/dist/utils/terminal.js +642 -5
- package/package.json +1 -1
- package/.opencode/agents/build.md +0 -294
- package/.opencode/agents/review.md +0 -314
- package/dist/plugin.d.ts +0 -1
- package/dist/plugin.d.ts.map +0 -1
- package/dist/plugin.js +0 -4
|
@@ -0,0 +1,221 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-engineering
|
|
3
|
+
description: ETL pipelines, data validation, streaming patterns, message queues, and data partitioning strategies
|
|
4
|
+
license: Apache-2.0
|
|
5
|
+
compatibility: opencode
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Data Engineering Skill
|
|
9
|
+
|
|
10
|
+
This skill provides patterns for building reliable data pipelines, processing large datasets, and managing data infrastructure.
|
|
11
|
+
|
|
12
|
+
## When to Use
|
|
13
|
+
|
|
14
|
+
Use this skill when:
|
|
15
|
+
- Designing ETL/ELT pipelines
|
|
16
|
+
- Implementing data validation and schema enforcement
|
|
17
|
+
- Working with message queues (Kafka, RabbitMQ, SQS)
|
|
18
|
+
- Building streaming data processing systems
|
|
19
|
+
- Designing data partitioning and sharding strategies
|
|
20
|
+
- Handling batch vs real-time data processing
|
|
21
|
+
|
|
22
|
+
## ETL Pipeline Design
|
|
23
|
+
|
|
24
|
+
### Batch vs Streaming
|
|
25
|
+
|
|
26
|
+
| Aspect | Batch Processing | Stream Processing |
|
|
27
|
+
|--------|-----------------|-------------------|
|
|
28
|
+
| **Latency** | Minutes to hours | Milliseconds to seconds |
|
|
29
|
+
| **Data volume** | Large datasets at once | Continuous flow |
|
|
30
|
+
| **Complexity** | Simpler error handling | Complex state management |
|
|
31
|
+
| **Use cases** | Reports, analytics, migrations | Real-time dashboards, alerts, events |
|
|
32
|
+
| **Tools** | Airflow, dbt, Spark | Kafka Streams, Flink, Pulsar |
|
|
33
|
+
|
|
34
|
+
### ETL vs ELT
|
|
35
|
+
|
|
36
|
+
| Pattern | When to Use |
|
|
37
|
+
|---------|-------------|
|
|
38
|
+
| **ETL** (Extract → Transform → Load) | Data warehouse with strict schema, transform before loading |
|
|
39
|
+
| **ELT** (Extract → Load → Transform) | Cloud data lakes, transform after loading using SQL/Spark |
|
|
40
|
+
|
|
41
|
+
### Pipeline Design Principles
|
|
42
|
+
- **Idempotency** — Running the same pipeline twice produces the same result
|
|
43
|
+
- **Incremental processing** — Process only new/changed data, not full reloads
|
|
44
|
+
- **Schema evolution** — Handle schema changes gracefully (add columns, not remove)
|
|
45
|
+
- **Backfill capability** — Ability to reprocess historical data
|
|
46
|
+
- **Monitoring** — Track pipeline health, data quality, processing lag
|
|
47
|
+
|
|
48
|
+
### Pipeline Architecture
|
|
49
|
+
|
|
50
|
+
```
|
|
51
|
+
Source → Extract → Validate → Transform → Load → Verify
|
|
52
|
+
↓
|
|
53
|
+
Dead Letter Queue (failed records)
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Error Handling Strategies
|
|
57
|
+
- **Skip and log** — Log bad records, continue processing (good for analytics)
|
|
58
|
+
- **Dead letter queue** — Route failures to a separate queue for manual review
|
|
59
|
+
- **Fail fast** — Stop pipeline on first error (good for critical data)
|
|
60
|
+
- **Retry with backoff** — Retry transient errors with exponential backoff
|
|
61
|
+
|
|
62
|
+
## Data Validation
|
|
63
|
+
|
|
64
|
+
### Schema Enforcement
|
|
65
|
+
- Validate data types, required fields, and constraints at ingestion
|
|
66
|
+
- Use schema registries (Avro, Protobuf, JSON Schema) for contract enforcement
|
|
67
|
+
- Version schemas — never break backward compatibility
|
|
68
|
+
|
|
69
|
+
### Validation Layers
|
|
70
|
+
|
|
71
|
+
| Layer | What to Check | Example |
|
|
72
|
+
|-------|---------------|---------|
|
|
73
|
+
| **Structural** | Schema conformance, types, required fields | Missing `email` field, wrong type |
|
|
74
|
+
| **Semantic** | Business rules, value ranges, relationships | Age < 0, end_date before start_date |
|
|
75
|
+
| **Referential** | Foreign key integrity, cross-dataset consistency | Order references non-existent customer |
|
|
76
|
+
| **Statistical** | Distribution anomalies, volume checks | 10x fewer records than yesterday |
|
|
77
|
+
|
|
78
|
+
### Data Quality Dimensions
|
|
79
|
+
- **Completeness** — Are all required fields populated?
|
|
80
|
+
- **Accuracy** — Do values reflect reality?
|
|
81
|
+
- **Consistency** — Are the same facts represented the same way?
|
|
82
|
+
- **Timeliness** — Is data available when needed?
|
|
83
|
+
- **Uniqueness** — Are there duplicate records?
|
|
84
|
+
|
|
85
|
+
## Idempotency Patterns
|
|
86
|
+
|
|
87
|
+
### Why Idempotency Matters
|
|
88
|
+
Pipelines fail and retry. Without idempotency, retries cause:
|
|
89
|
+
- Duplicate records in the target
|
|
90
|
+
- Incorrect aggregations (double-counting)
|
|
91
|
+
- Inconsistent state
|
|
92
|
+
|
|
93
|
+
### Patterns
|
|
94
|
+
|
|
95
|
+
| Pattern | How It Works | Trade-off |
|
|
96
|
+
|---------|-------------|-----------|
|
|
97
|
+
| **Upsert (MERGE)** | Insert or update based on key | Requires natural/business key |
|
|
98
|
+
| **Delete + Insert** | Delete partition, then insert | Simple but risky window of missing data |
|
|
99
|
+
| **Deduplication** | Assign unique IDs, deduplicate at read or write | Extra storage for IDs |
|
|
100
|
+
| **Exactly-once semantics** | Transactional writes with offset tracking | Complex, framework-dependent |
|
|
101
|
+
| **Tombstone + Compact** | Write delete markers, compact later | Kafka log compaction pattern |
|
|
102
|
+
|
|
103
|
+
### Idempotency Keys
|
|
104
|
+
- Use deterministic IDs: `hash(source + key + timestamp)`
|
|
105
|
+
- Store processing watermarks: "last processed offset/timestamp"
|
|
106
|
+
- Use database transactions: read offset + write data atomically
|
|
107
|
+
|
|
108
|
+
## Message Queue Patterns
|
|
109
|
+
|
|
110
|
+
### When to Use Which
|
|
111
|
+
|
|
112
|
+
| Queue | Best For | Key Feature |
|
|
113
|
+
|-------|----------|-------------|
|
|
114
|
+
| **Kafka** | High-throughput event streaming, log-based | Durable, ordered, replayable |
|
|
115
|
+
| **RabbitMQ** | Task queues, RPC, complex routing | Flexible routing, acknowledgments |
|
|
116
|
+
| **SQS** | Simple cloud-native queuing | Managed, auto-scaling, no ops |
|
|
117
|
+
| **Redis Streams** | Lightweight streaming with existing Redis | Low latency, familiar API |
|
|
118
|
+
| **NATS** | High-performance pub/sub | Ultra-low latency, cloud-native |
|
|
119
|
+
|
|
120
|
+
### Consumer Patterns
|
|
121
|
+
|
|
122
|
+
| Pattern | Description | Use Case |
|
|
123
|
+
|---------|-------------|----------|
|
|
124
|
+
| **Competing consumers** | Multiple consumers share a queue | Parallel task processing |
|
|
125
|
+
| **Fan-out** | One message delivered to all consumers | Event notifications |
|
|
126
|
+
| **Consumer groups** | Partitioned consumption across group members | Kafka-style parallel processing |
|
|
127
|
+
| **Request-reply** | Send request, await response on reply queue | Async RPC |
|
|
128
|
+
|
|
129
|
+
### Delivery Guarantees
|
|
130
|
+
|
|
131
|
+
| Guarantee | Meaning | Trade-off |
|
|
132
|
+
|-----------|---------|-----------|
|
|
133
|
+
| **At-most-once** | Message may be lost, never duplicated | Fastest, lossy |
|
|
134
|
+
| **At-least-once** | Message never lost, may be duplicated | Requires idempotent consumers |
|
|
135
|
+
| **Exactly-once** | Message processed exactly once | Complex, performance overhead |
|
|
136
|
+
|
|
137
|
+
### Backpressure Handling
|
|
138
|
+
- **Bounded queues** — Reject/block producers when queue is full
|
|
139
|
+
- **Rate limiting** — Limit consumer processing rate
|
|
140
|
+
- **Circuit breaker** — Stop consuming when downstream is unhealthy
|
|
141
|
+
- **Autoscaling** — Add consumers when queue depth exceeds threshold
|
|
142
|
+
|
|
143
|
+
## Streaming Patterns
|
|
144
|
+
|
|
145
|
+
### Windowing
|
|
146
|
+
|
|
147
|
+
| Window Type | Description | Use Case |
|
|
148
|
+
|-------------|-------------|----------|
|
|
149
|
+
| **Tumbling** | Fixed-size, non-overlapping | Hourly aggregation |
|
|
150
|
+
| **Sliding** | Fixed-size, overlapping | Moving average |
|
|
151
|
+
| **Session** | Gap-based, variable size | User session activity |
|
|
152
|
+
| **Global** | All events in one window | Running totals |
|
|
153
|
+
|
|
154
|
+
### Event Time vs Processing Time
|
|
155
|
+
- **Event time** — When the event actually occurred (embedded in data)
|
|
156
|
+
- **Processing time** — When the system processes the event
|
|
157
|
+
- **Watermarks** — Track progress through event time, handle late arrivals
|
|
158
|
+
- Always prefer event time for correctness; use processing time only for real-time approximation
|
|
159
|
+
|
|
160
|
+
### Stateful Stream Processing
|
|
161
|
+
- **State stores** — Local key-value stores for aggregations, joins
|
|
162
|
+
- **Changelog topics** — Back up state to Kafka for fault tolerance
|
|
163
|
+
- **State checkpointing** — Periodic snapshots for recovery (Flink pattern)
|
|
164
|
+
|
|
165
|
+
## Data Partitioning & Sharding
|
|
166
|
+
|
|
167
|
+
### Partitioning Strategies
|
|
168
|
+
|
|
169
|
+
| Strategy | How | Best For |
|
|
170
|
+
|----------|-----|----------|
|
|
171
|
+
| **Range partitioning** | Partition by value range (date, ID range) | Time-series data, sequential access |
|
|
172
|
+
| **Hash partitioning** | Hash key modulo partition count | Even distribution, point lookups |
|
|
173
|
+
| **List partitioning** | Partition by discrete values (country, region) | Known categories, geographic data |
|
|
174
|
+
| **Composite** | Combine strategies (hash + range) | Multi-tenant time-series |
|
|
175
|
+
|
|
176
|
+
### Partition Key Selection
|
|
177
|
+
- Choose keys with **high cardinality** (many distinct values)
|
|
178
|
+
- Avoid **hot partitions** (one key getting disproportionate traffic)
|
|
179
|
+
- Consider **query patterns** — partition by how data is most often read
|
|
180
|
+
- Plan for **partition growth** — avoid partition count that requires redistribution
|
|
181
|
+
|
|
182
|
+
### Sharding Considerations
|
|
183
|
+
- **Shard key immutability** — Changing a shard key requires data migration
|
|
184
|
+
- **Cross-shard queries** — Avoid joins across shards (denormalize instead)
|
|
185
|
+
- **Rebalancing** — Use consistent hashing to minimize data movement
|
|
186
|
+
- **Shard splitting** — Plan for splitting hot shards without downtime
|
|
187
|
+
|
|
188
|
+
## Data Pipeline Tools
|
|
189
|
+
|
|
190
|
+
### Orchestration
|
|
191
|
+
- **Airflow** — DAG-based workflow orchestration (Python)
|
|
192
|
+
- **Dagster** — Software-defined assets, strong typing
|
|
193
|
+
- **Prefect** — Python-native, dynamic workflows
|
|
194
|
+
- **Temporal** — Durable execution for long-running pipelines
|
|
195
|
+
|
|
196
|
+
### Transformation
|
|
197
|
+
- **dbt** — SQL-based transformations in the warehouse
|
|
198
|
+
- **Spark** — Distributed processing for large datasets
|
|
199
|
+
- **Pandas/Polars** — Single-machine data transformation
|
|
200
|
+
- **Flink** — Stream and batch processing (JVM)
|
|
201
|
+
|
|
202
|
+
### Storage
|
|
203
|
+
- **Data Lake** — Raw, unstructured (S3, GCS, ADLS)
|
|
204
|
+
- **Data Warehouse** — Structured, optimized for analytics (BigQuery, Snowflake, Redshift)
|
|
205
|
+
- **Data Lakehouse** — Combines both (Delta Lake, Iceberg, Hudi)
|
|
206
|
+
|
|
207
|
+
## Checklist
|
|
208
|
+
|
|
209
|
+
When building a data pipeline:
|
|
210
|
+
- [ ] Idempotent operations — safe to retry without side effects
|
|
211
|
+
- [ ] Schema validation at ingestion boundary
|
|
212
|
+
- [ ] Dead letter queue for failed records
|
|
213
|
+
- [ ] Monitoring: processing lag, error rate, throughput
|
|
214
|
+
- [ ] Backfill capability — can reprocess historical data
|
|
215
|
+
- [ ] Incremental processing — not full reloads on every run
|
|
216
|
+
- [ ] Data quality checks after transformation
|
|
217
|
+
- [ ] Partition strategy aligned with query patterns
|
|
218
|
+
- [ ] Exactly-once or at-least-once with idempotent consumers
|
|
219
|
+
- [ ] Schema evolution plan (backward compatible changes)
|
|
220
|
+
- [ ] Alerting on pipeline failures and data quality anomalies
|
|
221
|
+
- [ ] Documentation of data lineage and transformation logic
|
|
@@ -0,0 +1,251 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: monitoring-observability
|
|
3
|
+
description: Structured logging, metrics instrumentation, distributed tracing, health checks, and alerting patterns
|
|
4
|
+
license: Apache-2.0
|
|
5
|
+
compatibility: opencode
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Monitoring & Observability Skill
|
|
9
|
+
|
|
10
|
+
This skill provides patterns for making applications observable in production through logging, metrics, tracing, and alerting.
|
|
11
|
+
|
|
12
|
+
## When to Use
|
|
13
|
+
|
|
14
|
+
Use this skill when:
|
|
15
|
+
- Adding logging to new features or services
|
|
16
|
+
- Instrumenting code with metrics (counters, histograms, gauges)
|
|
17
|
+
- Implementing distributed tracing across services
|
|
18
|
+
- Designing health check endpoints
|
|
19
|
+
- Setting up alerting and SLO definitions
|
|
20
|
+
- Debugging production issues through observability data
|
|
21
|
+
|
|
22
|
+
## The Three Pillars of Observability
|
|
23
|
+
|
|
24
|
+
### 1. Logs — What Happened
|
|
25
|
+
Structured, contextual records of discrete events.
|
|
26
|
+
|
|
27
|
+
### 2. Metrics — How Much / How Fast
|
|
28
|
+
Numeric measurements aggregated over time.
|
|
29
|
+
|
|
30
|
+
### 3. Traces — The Journey
|
|
31
|
+
End-to-end request paths across service boundaries.
|
|
32
|
+
|
|
33
|
+
## Structured Logging
|
|
34
|
+
|
|
35
|
+
### Principles
|
|
36
|
+
- **Always use structured logging** (JSON) — never unstructured `console.log` in production
|
|
37
|
+
- **Log levels matter**: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development)
|
|
38
|
+
- **Include context**: correlation IDs, user IDs, request IDs, operation names
|
|
39
|
+
- **Never log secrets**: passwords, tokens, PII, credit card numbers
|
|
40
|
+
|
|
41
|
+
### Log Levels Guide
|
|
42
|
+
|
|
43
|
+
| Level | When to Use | Example |
|
|
44
|
+
|-------|-------------|---------|
|
|
45
|
+
| **ERROR** | Something failed and needs attention | Database connection lost, payment failed |
|
|
46
|
+
| **WARN** | Degraded but still functioning | Cache miss fallback, retry attempt, rate limit approaching |
|
|
47
|
+
| **INFO** | Significant business events | User signed up, order placed, deployment completed |
|
|
48
|
+
| **DEBUG** | Development/troubleshooting detail | SQL query executed, cache hit/miss, function entry/exit |
|
|
49
|
+
|
|
50
|
+
### Structured Log Format
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{
|
|
54
|
+
"timestamp": "2025-01-15T10:30:00.000Z",
|
|
55
|
+
"level": "info",
|
|
56
|
+
"message": "Order placed successfully",
|
|
57
|
+
"service": "order-service",
|
|
58
|
+
"traceId": "abc123",
|
|
59
|
+
"spanId": "def456",
|
|
60
|
+
"userId": "user_789",
|
|
61
|
+
"orderId": "order_012",
|
|
62
|
+
"amount": 99.99,
|
|
63
|
+
"currency": "USD",
|
|
64
|
+
"duration_ms": 145
|
|
65
|
+
}
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Correlation IDs
|
|
69
|
+
- Generate a unique request ID at the entry point (API gateway, load balancer)
|
|
70
|
+
- Propagate it through all downstream calls via headers (`X-Request-ID`, `traceparent`)
|
|
71
|
+
- Include it in every log line for cross-service correlation
|
|
72
|
+
|
|
73
|
+
### What to Log
|
|
74
|
+
|
|
75
|
+
**DO log:**
|
|
76
|
+
- Request/response boundaries (method, path, status, duration)
|
|
77
|
+
- Business events (user actions, state transitions, transactions)
|
|
78
|
+
- Error details with stack traces and context
|
|
79
|
+
- Performance-relevant data (query times, cache hit rates)
|
|
80
|
+
- Security events (auth failures, permission denials, rate limits)
|
|
81
|
+
|
|
82
|
+
**DO NOT log:**
|
|
83
|
+
- Passwords, tokens, API keys, secrets
|
|
84
|
+
- Full credit card numbers, SSNs, or PII without masking
|
|
85
|
+
- High-frequency debug data in production (use sampling)
|
|
86
|
+
- Request/response bodies containing sensitive data
|
|
87
|
+
|
|
88
|
+
## Metrics Instrumentation
|
|
89
|
+
|
|
90
|
+
### Metric Types
|
|
91
|
+
|
|
92
|
+
| Type | Use Case | Example |
|
|
93
|
+
|------|----------|---------|
|
|
94
|
+
| **Counter** | Monotonically increasing value | Total requests, errors, orders placed |
|
|
95
|
+
| **Gauge** | Value that goes up and down | Active connections, queue depth, memory usage |
|
|
96
|
+
| **Histogram** | Distribution of values | Request latency, response size, batch processing time |
|
|
97
|
+
| **Summary** | Pre-calculated quantiles | P50/P95/P99 latency (client-side) |
|
|
98
|
+
|
|
99
|
+
### Naming Conventions
|
|
100
|
+
- Use snake_case: `http_requests_total`, `request_duration_seconds`
|
|
101
|
+
- Include units in the name: `_seconds`, `_bytes`, `_total`
|
|
102
|
+
- Use `_total` suffix for counters
|
|
103
|
+
- Prefix with service/subsystem: `api_http_requests_total`
|
|
104
|
+
|
|
105
|
+
### Key Metrics to Track
|
|
106
|
+
|
|
107
|
+
**RED Method (Request-driven services):**
|
|
108
|
+
- **R**ate — Requests per second
|
|
109
|
+
- **E**rrors — Error rate (4xx, 5xx)
|
|
110
|
+
- **D**uration — Request latency distribution
|
|
111
|
+
|
|
112
|
+
**USE Method (Resource-oriented):**
|
|
113
|
+
- **U**tilization — % of resource capacity used
|
|
114
|
+
- **S**aturation — Queue depth, backpressure
|
|
115
|
+
- **E**rrors — Error count per resource
|
|
116
|
+
|
|
117
|
+
### Cardinality Warning
|
|
118
|
+
- Avoid high-cardinality labels (user IDs, request IDs, URLs with path params)
|
|
119
|
+
- Keep label combinations < 1000 per metric
|
|
120
|
+
- Use bounded values: HTTP methods (GET, POST), status codes (2xx, 4xx, 5xx), endpoints (normalized)
|
|
121
|
+
|
|
122
|
+
## Distributed Tracing
|
|
123
|
+
|
|
124
|
+
### OpenTelemetry Patterns
|
|
125
|
+
- **Span** — A single operation within a trace (e.g., HTTP request, DB query, function call)
|
|
126
|
+
- **Trace** — A tree of spans representing an end-to-end request
|
|
127
|
+
- **Context Propagation** — Passing trace context across service boundaries via headers
|
|
128
|
+
|
|
129
|
+
### What to Trace
|
|
130
|
+
- HTTP requests (client and server)
|
|
131
|
+
- Database queries
|
|
132
|
+
- Cache operations
|
|
133
|
+
- Message queue publish/consume
|
|
134
|
+
- External API calls
|
|
135
|
+
- Significant business operations
|
|
136
|
+
|
|
137
|
+
### Span Attributes
|
|
138
|
+
```
|
|
139
|
+
http.method: GET
|
|
140
|
+
http.url: /api/users/123
|
|
141
|
+
http.status_code: 200
|
|
142
|
+
db.system: postgresql
|
|
143
|
+
db.statement: SELECT * FROM users WHERE id = $1
|
|
144
|
+
messaging.system: kafka
|
|
145
|
+
messaging.destination: orders
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Sampling Strategies
|
|
149
|
+
- **Head-based sampling**: Decide at trace start (e.g., sample 10% of requests)
|
|
150
|
+
- **Tail-based sampling**: Decide after trace completes (keep errors, slow requests, sample normal)
|
|
151
|
+
- **Priority sampling**: Always sample errors, high-value transactions; sample routine requests
|
|
152
|
+
|
|
153
|
+
## Health Check Endpoints
|
|
154
|
+
|
|
155
|
+
### Liveness vs Readiness
|
|
156
|
+
|
|
157
|
+
| Check | Purpose | Failure Action |
|
|
158
|
+
|-------|---------|----------------|
|
|
159
|
+
| **Liveness** (`/healthz`) | Is the process alive? | Restart the container |
|
|
160
|
+
| **Readiness** (`/readyz`) | Can it serve traffic? | Remove from load balancer |
|
|
161
|
+
| **Startup** (`/startupz`) | Has it finished initializing? | Wait before liveness checks |
|
|
162
|
+
|
|
163
|
+
### Health Check Response Format
|
|
164
|
+
```json
|
|
165
|
+
{
|
|
166
|
+
"status": "healthy",
|
|
167
|
+
"checks": {
|
|
168
|
+
"database": { "status": "healthy", "latency_ms": 5 },
|
|
169
|
+
"cache": { "status": "healthy", "latency_ms": 1 },
|
|
170
|
+
"external_api": { "status": "degraded", "latency_ms": 2500, "message": "Slow response" }
|
|
171
|
+
},
|
|
172
|
+
"version": "1.2.3",
|
|
173
|
+
"uptime_seconds": 86400
|
|
174
|
+
}
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Best Practices
|
|
178
|
+
- Health checks should be **fast** (< 1 second)
|
|
179
|
+
- Liveness should check the process only — NOT external dependencies
|
|
180
|
+
- Readiness should check critical dependencies (database, cache)
|
|
181
|
+
- Return appropriate HTTP status: 200 (healthy), 503 (unhealthy)
|
|
182
|
+
- Include dependency health in readiness but not liveness
|
|
183
|
+
|
|
184
|
+
## Alerting & SLOs
|
|
185
|
+
|
|
186
|
+
### SLO Definitions
|
|
187
|
+
- **SLI** (Service Level Indicator): The metric you measure (e.g., request latency P99)
|
|
188
|
+
- **SLO** (Service Level Objective): The target (e.g., P99 latency < 500ms for 99.9% of requests)
|
|
189
|
+
- **Error Budget**: Allowable failures (e.g., 0.1% of requests can exceed 500ms)
|
|
190
|
+
|
|
191
|
+
### Alert Design Principles
|
|
192
|
+
- **Alert on symptoms, not causes** — Alert on "users can't log in", not "CPU is high"
|
|
193
|
+
- **Alert on SLO burn rate** — Alert when error budget is being consumed too fast
|
|
194
|
+
- **Avoid alert fatigue** — Every alert should require human action
|
|
195
|
+
- **Include runbook links** — Every alert should link to resolution steps
|
|
196
|
+
|
|
197
|
+
### Severity Levels
|
|
198
|
+
|
|
199
|
+
| Severity | Response Time | Example |
|
|
200
|
+
|----------|--------------|---------|
|
|
201
|
+
| **P1 — Critical** | Immediate (< 5 min) | Service down, data loss, security breach |
|
|
202
|
+
| **P2 — High** | Within 1 hour | Degraded performance, partial outage |
|
|
203
|
+
| **P3 — Medium** | Within 1 business day | Non-critical feature broken, elevated error rate |
|
|
204
|
+
| **P4 — Low** | Next sprint | Performance degradation, tech debt alert |
|
|
205
|
+
|
|
206
|
+
### Useful Alert Patterns
|
|
207
|
+
- Error rate exceeds N% for M minutes
|
|
208
|
+
- Latency P99 exceeds threshold for M minutes
|
|
209
|
+
- Error budget burn rate > 1x for 1 hour (fast burn)
|
|
210
|
+
- Error budget burn rate > 0.1x for 6 hours (slow burn)
|
|
211
|
+
- Queue depth exceeds threshold (backpressure)
|
|
212
|
+
- Certificate expiry within N days
|
|
213
|
+
- Disk usage exceeds N%
|
|
214
|
+
|
|
215
|
+
## Technology Selection
|
|
216
|
+
|
|
217
|
+
### Logging
|
|
218
|
+
- **Node.js**: pino, winston, bunyan
|
|
219
|
+
- **Python**: structlog, python-json-logger
|
|
220
|
+
- **Go**: zerolog, zap, slog (stdlib)
|
|
221
|
+
- **Rust**: tracing, log + env_logger
|
|
222
|
+
|
|
223
|
+
### Metrics
|
|
224
|
+
- **Prometheus** — Pull-based, widely adopted, great with Kubernetes
|
|
225
|
+
- **StatsD/Datadog** — Push-based, hosted
|
|
226
|
+
- **OpenTelemetry Metrics** — Vendor-neutral, emerging standard
|
|
227
|
+
|
|
228
|
+
### Tracing
|
|
229
|
+
- **OpenTelemetry** — Vendor-neutral standard (recommended)
|
|
230
|
+
- **Jaeger** — Open-source trace backend
|
|
231
|
+
- **Zipkin** — Lightweight trace backend
|
|
232
|
+
|
|
233
|
+
### Dashboards
|
|
234
|
+
- **Grafana** — Open-source, works with Prometheus/Loki/Tempo
|
|
235
|
+
- **Datadog** — Hosted all-in-one
|
|
236
|
+
- **New Relic** — Hosted APM
|
|
237
|
+
|
|
238
|
+
## Checklist
|
|
239
|
+
|
|
240
|
+
When adding observability to a feature:
|
|
241
|
+
- [ ] Structured logging with correlation IDs at request boundaries
|
|
242
|
+
- [ ] Error logging with stack traces and context
|
|
243
|
+
- [ ] Business event logging (significant state changes)
|
|
244
|
+
- [ ] RED metrics for request-driven endpoints
|
|
245
|
+
- [ ] Histogram for latency-sensitive operations
|
|
246
|
+
- [ ] Trace spans for cross-service calls and database queries
|
|
247
|
+
- [ ] Health check endpoint updated if new dependency added
|
|
248
|
+
- [ ] No secrets or PII in logs
|
|
249
|
+
- [ ] Appropriate log levels (not everything is INFO)
|
|
250
|
+
- [ ] Dashboard updated with new metrics
|
|
251
|
+
- [ ] Alerts defined for SLO violations
|