cortex-agents 2.3.1 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/.opencode/agents/{plan.md → architect.md} +104 -58
  2. package/.opencode/agents/audit.md +183 -0
  3. package/.opencode/agents/{fullstack.md → coder.md} +10 -54
  4. package/.opencode/agents/debug.md +76 -201
  5. package/.opencode/agents/devops.md +16 -123
  6. package/.opencode/agents/docs-writer.md +195 -0
  7. package/.opencode/agents/fix.md +207 -0
  8. package/.opencode/agents/implement.md +433 -0
  9. package/.opencode/agents/perf.md +151 -0
  10. package/.opencode/agents/refactor.md +163 -0
  11. package/.opencode/agents/security.md +20 -85
  12. package/.opencode/agents/testing.md +1 -151
  13. package/.opencode/skills/data-engineering/SKILL.md +221 -0
  14. package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
  15. package/README.md +315 -224
  16. package/dist/cli.js +85 -17
  17. package/dist/index.d.ts.map +1 -1
  18. package/dist/index.js +60 -22
  19. package/dist/registry.d.ts +8 -3
  20. package/dist/registry.d.ts.map +1 -1
  21. package/dist/registry.js +16 -2
  22. package/dist/tools/branch.d.ts +2 -2
  23. package/dist/tools/cortex.d.ts +2 -2
  24. package/dist/tools/cortex.js +7 -7
  25. package/dist/tools/docs.d.ts +2 -2
  26. package/dist/tools/environment.d.ts +31 -0
  27. package/dist/tools/environment.d.ts.map +1 -0
  28. package/dist/tools/environment.js +93 -0
  29. package/dist/tools/github.d.ts +42 -0
  30. package/dist/tools/github.d.ts.map +1 -0
  31. package/dist/tools/github.js +200 -0
  32. package/dist/tools/plan.d.ts +28 -4
  33. package/dist/tools/plan.d.ts.map +1 -1
  34. package/dist/tools/plan.js +232 -4
  35. package/dist/tools/quality-gate.d.ts +28 -0
  36. package/dist/tools/quality-gate.d.ts.map +1 -0
  37. package/dist/tools/quality-gate.js +233 -0
  38. package/dist/tools/repl.d.ts +55 -0
  39. package/dist/tools/repl.d.ts.map +1 -0
  40. package/dist/tools/repl.js +291 -0
  41. package/dist/tools/task.d.ts +2 -0
  42. package/dist/tools/task.d.ts.map +1 -1
  43. package/dist/tools/task.js +25 -30
  44. package/dist/tools/worktree.d.ts +5 -32
  45. package/dist/tools/worktree.d.ts.map +1 -1
  46. package/dist/tools/worktree.js +75 -447
  47. package/dist/utils/change-scope.d.ts +33 -0
  48. package/dist/utils/change-scope.d.ts.map +1 -0
  49. package/dist/utils/change-scope.js +198 -0
  50. package/dist/utils/github.d.ts +104 -0
  51. package/dist/utils/github.d.ts.map +1 -0
  52. package/dist/utils/github.js +243 -0
  53. package/dist/utils/ide.d.ts +76 -0
  54. package/dist/utils/ide.d.ts.map +1 -0
  55. package/dist/utils/ide.js +307 -0
  56. package/dist/utils/plan-extract.d.ts +28 -0
  57. package/dist/utils/plan-extract.d.ts.map +1 -1
  58. package/dist/utils/plan-extract.js +90 -1
  59. package/dist/utils/repl.d.ts +145 -0
  60. package/dist/utils/repl.d.ts.map +1 -0
  61. package/dist/utils/repl.js +547 -0
  62. package/dist/utils/terminal.d.ts +53 -1
  63. package/dist/utils/terminal.d.ts.map +1 -1
  64. package/dist/utils/terminal.js +642 -5
  65. package/package.json +1 -1
  66. package/.opencode/agents/build.md +0 -294
  67. package/.opencode/agents/review.md +0 -314
  68. package/dist/plugin.d.ts +0 -1
  69. package/dist/plugin.d.ts.map +0 -1
  70. package/dist/plugin.js +0 -4
@@ -0,0 +1,221 @@
1
+ ---
2
+ name: data-engineering
3
+ description: ETL pipelines, data validation, streaming patterns, message queues, and data partitioning strategies
4
+ license: Apache-2.0
5
+ compatibility: opencode
6
+ ---
7
+
8
+ # Data Engineering Skill
9
+
10
+ This skill provides patterns for building reliable data pipelines, processing large datasets, and managing data infrastructure.
11
+
12
+ ## When to Use
13
+
14
+ Use this skill when:
15
+ - Designing ETL/ELT pipelines
16
+ - Implementing data validation and schema enforcement
17
+ - Working with message queues (Kafka, RabbitMQ, SQS)
18
+ - Building streaming data processing systems
19
+ - Designing data partitioning and sharding strategies
20
+ - Handling batch vs real-time data processing
21
+
22
+ ## ETL Pipeline Design
23
+
24
+ ### Batch vs Streaming
25
+
26
+ | Aspect | Batch Processing | Stream Processing |
27
+ |--------|-----------------|-------------------|
28
+ | **Latency** | Minutes to hours | Milliseconds to seconds |
29
+ | **Data volume** | Large datasets at once | Continuous flow |
30
+ | **Complexity** | Simpler error handling | Complex state management |
31
+ | **Use cases** | Reports, analytics, migrations | Real-time dashboards, alerts, events |
32
+ | **Tools** | Airflow, dbt, Spark | Kafka Streams, Flink, Pulsar |
33
+
34
+ ### ETL vs ELT
35
+
36
+ | Pattern | When to Use |
37
+ |---------|-------------|
38
+ | **ETL** (Extract → Transform → Load) | Data warehouse with strict schema, transform before loading |
39
+ | **ELT** (Extract → Load → Transform) | Cloud data lakes, transform after loading using SQL/Spark |
40
+
41
+ ### Pipeline Design Principles
42
+ - **Idempotency** — Running the same pipeline twice produces the same result
43
+ - **Incremental processing** — Process only new/changed data, not full reloads
44
+ - **Schema evolution** — Handle schema changes gracefully (add columns, not remove)
45
+ - **Backfill capability** — Ability to reprocess historical data
46
+ - **Monitoring** — Track pipeline health, data quality, processing lag
47
+
48
+ ### Pipeline Architecture
49
+
50
+ ```
51
+ Source → Extract → Validate → Transform → Load → Verify
52
+
53
+ Dead Letter Queue (failed records)
54
+ ```
55
+
56
+ ### Error Handling Strategies
57
+ - **Skip and log** — Log bad records, continue processing (good for analytics)
58
+ - **Dead letter queue** — Route failures to a separate queue for manual review
59
+ - **Fail fast** — Stop pipeline on first error (good for critical data)
60
+ - **Retry with backoff** — Retry transient errors with exponential backoff
61
+
62
+ ## Data Validation
63
+
64
+ ### Schema Enforcement
65
+ - Validate data types, required fields, and constraints at ingestion
66
+ - Use schema registries (Avro, Protobuf, JSON Schema) for contract enforcement
67
+ - Version schemas — never break backward compatibility
68
+
69
+ ### Validation Layers
70
+
71
+ | Layer | What to Check | Example |
72
+ |-------|---------------|---------|
73
+ | **Structural** | Schema conformance, types, required fields | Missing `email` field, wrong type |
74
+ | **Semantic** | Business rules, value ranges, relationships | Age < 0, end_date before start_date |
75
+ | **Referential** | Foreign key integrity, cross-dataset consistency | Order references non-existent customer |
76
+ | **Statistical** | Distribution anomalies, volume checks | 10x fewer records than yesterday |
77
+
78
+ ### Data Quality Dimensions
79
+ - **Completeness** — Are all required fields populated?
80
+ - **Accuracy** — Do values reflect reality?
81
+ - **Consistency** — Are the same facts represented the same way?
82
+ - **Timeliness** — Is data available when needed?
83
+ - **Uniqueness** — Are there duplicate records?
84
+
85
+ ## Idempotency Patterns
86
+
87
+ ### Why Idempotency Matters
88
+ Pipelines fail and retry. Without idempotency, retries cause:
89
+ - Duplicate records in the target
90
+ - Incorrect aggregations (double-counting)
91
+ - Inconsistent state
92
+
93
+ ### Patterns
94
+
95
+ | Pattern | How It Works | Trade-off |
96
+ |---------|-------------|-----------|
97
+ | **Upsert (MERGE)** | Insert or update based on key | Requires natural/business key |
98
+ | **Delete + Insert** | Delete partition, then insert | Simple but risky window of missing data |
99
+ | **Deduplication** | Assign unique IDs, deduplicate at read or write | Extra storage for IDs |
100
+ | **Exactly-once semantics** | Transactional writes with offset tracking | Complex, framework-dependent |
101
+ | **Tombstone + Compact** | Write delete markers, compact later | Kafka log compaction pattern |
102
+
103
+ ### Idempotency Keys
104
+ - Use deterministic IDs: `hash(source + key + timestamp)`
105
+ - Store processing watermarks: "last processed offset/timestamp"
106
+ - Use database transactions: read offset + write data atomically
107
+
108
+ ## Message Queue Patterns
109
+
110
+ ### When to Use Which
111
+
112
+ | Queue | Best For | Key Feature |
113
+ |-------|----------|-------------|
114
+ | **Kafka** | High-throughput event streaming, log-based | Durable, ordered, replayable |
115
+ | **RabbitMQ** | Task queues, RPC, complex routing | Flexible routing, acknowledgments |
116
+ | **SQS** | Simple cloud-native queuing | Managed, auto-scaling, no ops |
117
+ | **Redis Streams** | Lightweight streaming with existing Redis | Low latency, familiar API |
118
+ | **NATS** | High-performance pub/sub | Ultra-low latency, cloud-native |
119
+
120
+ ### Consumer Patterns
121
+
122
+ | Pattern | Description | Use Case |
123
+ |---------|-------------|----------|
124
+ | **Competing consumers** | Multiple consumers share a queue | Parallel task processing |
125
+ | **Fan-out** | One message delivered to all consumers | Event notifications |
126
+ | **Consumer groups** | Partitioned consumption across group members | Kafka-style parallel processing |
127
+ | **Request-reply** | Send request, await response on reply queue | Async RPC |
128
+
129
+ ### Delivery Guarantees
130
+
131
+ | Guarantee | Meaning | Trade-off |
132
+ |-----------|---------|-----------|
133
+ | **At-most-once** | Message may be lost, never duplicated | Fastest, lossy |
134
+ | **At-least-once** | Message never lost, may be duplicated | Requires idempotent consumers |
135
+ | **Exactly-once** | Message processed exactly once | Complex, performance overhead |
136
+
137
+ ### Backpressure Handling
138
+ - **Bounded queues** — Reject/block producers when queue is full
139
+ - **Rate limiting** — Limit consumer processing rate
140
+ - **Circuit breaker** — Stop consuming when downstream is unhealthy
141
+ - **Autoscaling** — Add consumers when queue depth exceeds threshold
142
+
143
+ ## Streaming Patterns
144
+
145
+ ### Windowing
146
+
147
+ | Window Type | Description | Use Case |
148
+ |-------------|-------------|----------|
149
+ | **Tumbling** | Fixed-size, non-overlapping | Hourly aggregation |
150
+ | **Sliding** | Fixed-size, overlapping | Moving average |
151
+ | **Session** | Gap-based, variable size | User session activity |
152
+ | **Global** | All events in one window | Running totals |
153
+
154
+ ### Event Time vs Processing Time
155
+ - **Event time** — When the event actually occurred (embedded in data)
156
+ - **Processing time** — When the system processes the event
157
+ - **Watermarks** — Track progress through event time, handle late arrivals
158
+ - Always prefer event time for correctness; use processing time only for real-time approximation
159
+
160
+ ### Stateful Stream Processing
161
+ - **State stores** — Local key-value stores for aggregations, joins
162
+ - **Changelog topics** — Back up state to Kafka for fault tolerance
163
+ - **State checkpointing** — Periodic snapshots for recovery (Flink pattern)
164
+
165
+ ## Data Partitioning & Sharding
166
+
167
+ ### Partitioning Strategies
168
+
169
+ | Strategy | How | Best For |
170
+ |----------|-----|----------|
171
+ | **Range partitioning** | Partition by value range (date, ID range) | Time-series data, sequential access |
172
+ | **Hash partitioning** | Hash key modulo partition count | Even distribution, point lookups |
173
+ | **List partitioning** | Partition by discrete values (country, region) | Known categories, geographic data |
174
+ | **Composite** | Combine strategies (hash + range) | Multi-tenant time-series |
175
+
176
+ ### Partition Key Selection
177
+ - Choose keys with **high cardinality** (many distinct values)
178
+ - Avoid **hot partitions** (one key getting disproportionate traffic)
179
+ - Consider **query patterns** — partition by how data is most often read
180
+ - Plan for **partition growth** — avoid partition count that requires redistribution
181
+
182
+ ### Sharding Considerations
183
+ - **Shard key immutability** — Changing a shard key requires data migration
184
+ - **Cross-shard queries** — Avoid joins across shards (denormalize instead)
185
+ - **Rebalancing** — Use consistent hashing to minimize data movement
186
+ - **Shard splitting** — Plan for splitting hot shards without downtime
187
+
188
+ ## Data Pipeline Tools
189
+
190
+ ### Orchestration
191
+ - **Airflow** — DAG-based workflow orchestration (Python)
192
+ - **Dagster** — Software-defined assets, strong typing
193
+ - **Prefect** — Python-native, dynamic workflows
194
+ - **Temporal** — Durable execution for long-running pipelines
195
+
196
+ ### Transformation
197
+ - **dbt** — SQL-based transformations in the warehouse
198
+ - **Spark** — Distributed processing for large datasets
199
+ - **Pandas/Polars** — Single-machine data transformation
200
+ - **Flink** — Stream and batch processing (JVM)
201
+
202
+ ### Storage
203
+ - **Data Lake** — Raw, unstructured (S3, GCS, ADLS)
204
+ - **Data Warehouse** — Structured, optimized for analytics (BigQuery, Snowflake, Redshift)
205
+ - **Data Lakehouse** — Combines both (Delta Lake, Iceberg, Hudi)
206
+
207
+ ## Checklist
208
+
209
+ When building a data pipeline:
210
+ - [ ] Idempotent operations — safe to retry without side effects
211
+ - [ ] Schema validation at ingestion boundary
212
+ - [ ] Dead letter queue for failed records
213
+ - [ ] Monitoring: processing lag, error rate, throughput
214
+ - [ ] Backfill capability — can reprocess historical data
215
+ - [ ] Incremental processing — not full reloads on every run
216
+ - [ ] Data quality checks after transformation
217
+ - [ ] Partition strategy aligned with query patterns
218
+ - [ ] Exactly-once or at-least-once with idempotent consumers
219
+ - [ ] Schema evolution plan (backward compatible changes)
220
+ - [ ] Alerting on pipeline failures and data quality anomalies
221
+ - [ ] Documentation of data lineage and transformation logic
@@ -0,0 +1,251 @@
1
+ ---
2
+ name: monitoring-observability
3
+ description: Structured logging, metrics instrumentation, distributed tracing, health checks, and alerting patterns
4
+ license: Apache-2.0
5
+ compatibility: opencode
6
+ ---
7
+
8
+ # Monitoring & Observability Skill
9
+
10
+ This skill provides patterns for making applications observable in production through logging, metrics, tracing, and alerting.
11
+
12
+ ## When to Use
13
+
14
+ Use this skill when:
15
+ - Adding logging to new features or services
16
+ - Instrumenting code with metrics (counters, histograms, gauges)
17
+ - Implementing distributed tracing across services
18
+ - Designing health check endpoints
19
+ - Setting up alerting and SLO definitions
20
+ - Debugging production issues through observability data
21
+
22
+ ## The Three Pillars of Observability
23
+
24
+ ### 1. Logs — What Happened
25
+ Structured, contextual records of discrete events.
26
+
27
+ ### 2. Metrics — How Much / How Fast
28
+ Numeric measurements aggregated over time.
29
+
30
+ ### 3. Traces — The Journey
31
+ End-to-end request paths across service boundaries.
32
+
33
+ ## Structured Logging
34
+
35
+ ### Principles
36
+ - **Always use structured logging** (JSON) — never unstructured `console.log` in production
37
+ - **Log levels matter**: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development)
38
+ - **Include context**: correlation IDs, user IDs, request IDs, operation names
39
+ - **Never log secrets**: passwords, tokens, PII, credit card numbers
40
+
41
+ ### Log Levels Guide
42
+
43
+ | Level | When to Use | Example |
44
+ |-------|-------------|---------|
45
+ | **ERROR** | Something failed and needs attention | Database connection lost, payment failed |
46
+ | **WARN** | Degraded but still functioning | Cache miss fallback, retry attempt, rate limit approaching |
47
+ | **INFO** | Significant business events | User signed up, order placed, deployment completed |
48
+ | **DEBUG** | Development/troubleshooting detail | SQL query executed, cache hit/miss, function entry/exit |
49
+
50
+ ### Structured Log Format
51
+
52
+ ```json
53
+ {
54
+ "timestamp": "2025-01-15T10:30:00.000Z",
55
+ "level": "info",
56
+ "message": "Order placed successfully",
57
+ "service": "order-service",
58
+ "traceId": "abc123",
59
+ "spanId": "def456",
60
+ "userId": "user_789",
61
+ "orderId": "order_012",
62
+ "amount": 99.99,
63
+ "currency": "USD",
64
+ "duration_ms": 145
65
+ }
66
+ ```
67
+
68
+ ### Correlation IDs
69
+ - Generate a unique request ID at the entry point (API gateway, load balancer)
70
+ - Propagate it through all downstream calls via headers (`X-Request-ID`, `traceparent`)
71
+ - Include it in every log line for cross-service correlation
72
+
73
+ ### What to Log
74
+
75
+ **DO log:**
76
+ - Request/response boundaries (method, path, status, duration)
77
+ - Business events (user actions, state transitions, transactions)
78
+ - Error details with stack traces and context
79
+ - Performance-relevant data (query times, cache hit rates)
80
+ - Security events (auth failures, permission denials, rate limits)
81
+
82
+ **DO NOT log:**
83
+ - Passwords, tokens, API keys, secrets
84
+ - Full credit card numbers, SSNs, or PII without masking
85
+ - High-frequency debug data in production (use sampling)
86
+ - Request/response bodies containing sensitive data
87
+
88
+ ## Metrics Instrumentation
89
+
90
+ ### Metric Types
91
+
92
+ | Type | Use Case | Example |
93
+ |------|----------|---------|
94
+ | **Counter** | Monotonically increasing value | Total requests, errors, orders placed |
95
+ | **Gauge** | Value that goes up and down | Active connections, queue depth, memory usage |
96
+ | **Histogram** | Distribution of values | Request latency, response size, batch processing time |
97
+ | **Summary** | Pre-calculated quantiles | P50/P95/P99 latency (client-side) |
98
+
99
+ ### Naming Conventions
100
+ - Use snake_case: `http_requests_total`, `request_duration_seconds`
101
+ - Include units in the name: `_seconds`, `_bytes`, `_total`
102
+ - Use `_total` suffix for counters
103
+ - Prefix with service/subsystem: `api_http_requests_total`
104
+
105
+ ### Key Metrics to Track
106
+
107
+ **RED Method (Request-driven services):**
108
+ - **R**ate — Requests per second
109
+ - **E**rrors — Error rate (4xx, 5xx)
110
+ - **D**uration — Request latency distribution
111
+
112
+ **USE Method (Resource-oriented):**
113
+ - **U**tilization — % of resource capacity used
114
+ - **S**aturation — Queue depth, backpressure
115
+ - **E**rrors — Error count per resource
116
+
117
+ ### Cardinality Warning
118
+ - Avoid high-cardinality labels (user IDs, request IDs, URLs with path params)
119
+ - Keep label combinations < 1000 per metric
120
+ - Use bounded values: HTTP methods (GET, POST), status codes (2xx, 4xx, 5xx), endpoints (normalized)
121
+
122
+ ## Distributed Tracing
123
+
124
+ ### OpenTelemetry Patterns
125
+ - **Span** — A single operation within a trace (e.g., HTTP request, DB query, function call)
126
+ - **Trace** — A tree of spans representing an end-to-end request
127
+ - **Context Propagation** — Passing trace context across service boundaries via headers
128
+
129
+ ### What to Trace
130
+ - HTTP requests (client and server)
131
+ - Database queries
132
+ - Cache operations
133
+ - Message queue publish/consume
134
+ - External API calls
135
+ - Significant business operations
136
+
137
+ ### Span Attributes
138
+ ```
139
+ http.method: GET
140
+ http.url: /api/users/123
141
+ http.status_code: 200
142
+ db.system: postgresql
143
+ db.statement: SELECT * FROM users WHERE id = $1
144
+ messaging.system: kafka
145
+ messaging.destination: orders
146
+ ```
147
+
148
+ ### Sampling Strategies
149
+ - **Head-based sampling**: Decide at trace start (e.g., sample 10% of requests)
150
+ - **Tail-based sampling**: Decide after trace completes (keep errors, slow requests, sample normal)
151
+ - **Priority sampling**: Always sample errors, high-value transactions; sample routine requests
152
+
153
+ ## Health Check Endpoints
154
+
155
+ ### Liveness vs Readiness
156
+
157
+ | Check | Purpose | Failure Action |
158
+ |-------|---------|----------------|
159
+ | **Liveness** (`/healthz`) | Is the process alive? | Restart the container |
160
+ | **Readiness** (`/readyz`) | Can it serve traffic? | Remove from load balancer |
161
+ | **Startup** (`/startupz`) | Has it finished initializing? | Wait before liveness checks |
162
+
163
+ ### Health Check Response Format
164
+ ```json
165
+ {
166
+ "status": "healthy",
167
+ "checks": {
168
+ "database": { "status": "healthy", "latency_ms": 5 },
169
+ "cache": { "status": "healthy", "latency_ms": 1 },
170
+ "external_api": { "status": "degraded", "latency_ms": 2500, "message": "Slow response" }
171
+ },
172
+ "version": "1.2.3",
173
+ "uptime_seconds": 86400
174
+ }
175
+ ```
176
+
177
+ ### Best Practices
178
+ - Health checks should be **fast** (< 1 second)
179
+ - Liveness should check the process only — NOT external dependencies
180
+ - Readiness should check critical dependencies (database, cache)
181
+ - Return appropriate HTTP status: 200 (healthy), 503 (unhealthy)
182
+ - Include dependency health in readiness but not liveness
183
+
184
+ ## Alerting & SLOs
185
+
186
+ ### SLO Definitions
187
+ - **SLI** (Service Level Indicator): The metric you measure (e.g., request latency P99)
188
+ - **SLO** (Service Level Objective): The target (e.g., P99 latency < 500ms for 99.9% of requests)
189
+ - **Error Budget**: Allowable failures (e.g., 0.1% of requests can exceed 500ms)
190
+
191
+ ### Alert Design Principles
192
+ - **Alert on symptoms, not causes** — Alert on "users can't log in", not "CPU is high"
193
+ - **Alert on SLO burn rate** — Alert when error budget is being consumed too fast
194
+ - **Avoid alert fatigue** — Every alert should require human action
195
+ - **Include runbook links** — Every alert should link to resolution steps
196
+
197
+ ### Severity Levels
198
+
199
+ | Severity | Response Time | Example |
200
+ |----------|--------------|---------|
201
+ | **P1 — Critical** | Immediate (< 5 min) | Service down, data loss, security breach |
202
+ | **P2 — High** | Within 1 hour | Degraded performance, partial outage |
203
+ | **P3 — Medium** | Within 1 business day | Non-critical feature broken, elevated error rate |
204
+ | **P4 — Low** | Next sprint | Performance degradation, tech debt alert |
205
+
206
+ ### Useful Alert Patterns
207
+ - Error rate exceeds N% for M minutes
208
+ - Latency P99 exceeds threshold for M minutes
209
+ - Error budget burn rate > 1x for 1 hour (fast burn)
210
+ - Error budget burn rate > 0.1x for 6 hours (slow burn)
211
+ - Queue depth exceeds threshold (backpressure)
212
+ - Certificate expiry within N days
213
+ - Disk usage exceeds N%
214
+
215
+ ## Technology Selection
216
+
217
+ ### Logging
218
+ - **Node.js**: pino, winston, bunyan
219
+ - **Python**: structlog, python-json-logger
220
+ - **Go**: zerolog, zap, slog (stdlib)
221
+ - **Rust**: tracing, log + env_logger
222
+
223
+ ### Metrics
224
+ - **Prometheus** — Pull-based, widely adopted, great with Kubernetes
225
+ - **StatsD/Datadog** — Push-based, hosted
226
+ - **OpenTelemetry Metrics** — Vendor-neutral, emerging standard
227
+
228
+ ### Tracing
229
+ - **OpenTelemetry** — Vendor-neutral standard (recommended)
230
+ - **Jaeger** — Open-source trace backend
231
+ - **Zipkin** — Lightweight trace backend
232
+
233
+ ### Dashboards
234
+ - **Grafana** — Open-source, works with Prometheus/Loki/Tempo
235
+ - **Datadog** — Hosted all-in-one
236
+ - **New Relic** — Hosted APM
237
+
238
+ ## Checklist
239
+
240
+ When adding observability to a feature:
241
+ - [ ] Structured logging with correlation IDs at request boundaries
242
+ - [ ] Error logging with stack traces and context
243
+ - [ ] Business event logging (significant state changes)
244
+ - [ ] RED metrics for request-driven endpoints
245
+ - [ ] Histogram for latency-sensitive operations
246
+ - [ ] Trace spans for cross-service calls and database queries
247
+ - [ ] Health check endpoint updated if new dependency added
248
+ - [ ] No secrets or PII in logs
249
+ - [ ] Appropriate log levels (not everything is INFO)
250
+ - [ ] Dashboard updated with new metrics
251
+ - [ ] Alerts defined for SLO violations