cortex-agents 3.4.0 → 4.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/.opencode/agents/architect.md +81 -89
  2. package/.opencode/agents/audit.md +57 -188
  3. package/.opencode/agents/{crosslayer.md → coder.md} +8 -52
  4. package/.opencode/agents/debug.md +151 -0
  5. package/.opencode/agents/devops.md +142 -0
  6. package/.opencode/agents/docs-writer.md +195 -0
  7. package/.opencode/agents/fix.md +118 -189
  8. package/.opencode/agents/implement.md +114 -74
  9. package/.opencode/agents/perf.md +151 -0
  10. package/.opencode/agents/refactor.md +163 -0
  11. package/.opencode/agents/{guard.md → security.md} +20 -85
  12. package/.opencode/agents/testing.md +115 -0
  13. package/.opencode/skills/data-engineering/SKILL.md +221 -0
  14. package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
  15. package/README.md +302 -287
  16. package/dist/cli.js +6 -9
  17. package/dist/index.d.ts.map +1 -1
  18. package/dist/index.js +26 -28
  19. package/dist/registry.d.ts +4 -4
  20. package/dist/registry.d.ts.map +1 -1
  21. package/dist/registry.js +6 -6
  22. package/dist/tools/branch.d.ts +2 -2
  23. package/dist/tools/docs.d.ts +2 -2
  24. package/dist/tools/github.d.ts +3 -3
  25. package/dist/tools/plan.d.ts +28 -4
  26. package/dist/tools/plan.d.ts.map +1 -1
  27. package/dist/tools/plan.js +232 -4
  28. package/dist/tools/quality-gate.d.ts +28 -0
  29. package/dist/tools/quality-gate.d.ts.map +1 -0
  30. package/dist/tools/quality-gate.js +233 -0
  31. package/dist/tools/repl.d.ts +5 -0
  32. package/dist/tools/repl.d.ts.map +1 -1
  33. package/dist/tools/repl.js +58 -7
  34. package/dist/tools/worktree.d.ts +5 -32
  35. package/dist/tools/worktree.d.ts.map +1 -1
  36. package/dist/tools/worktree.js +75 -458
  37. package/dist/utils/change-scope.d.ts +33 -0
  38. package/dist/utils/change-scope.d.ts.map +1 -0
  39. package/dist/utils/change-scope.js +198 -0
  40. package/dist/utils/plan-extract.d.ts +21 -0
  41. package/dist/utils/plan-extract.d.ts.map +1 -1
  42. package/dist/utils/plan-extract.js +65 -0
  43. package/dist/utils/repl.d.ts +31 -0
  44. package/dist/utils/repl.d.ts.map +1 -1
  45. package/dist/utils/repl.js +126 -13
  46. package/package.json +1 -1
  47. package/.opencode/agents/qa.md +0 -265
  48. package/.opencode/agents/ship.md +0 -249
@@ -0,0 +1,251 @@
1
+ ---
2
+ name: monitoring-observability
3
+ description: Structured logging, metrics instrumentation, distributed tracing, health checks, and alerting patterns
4
+ license: Apache-2.0
5
+ compatibility: opencode
6
+ ---
7
+
8
+ # Monitoring & Observability Skill
9
+
10
+ This skill provides patterns for making applications observable in production through logging, metrics, tracing, and alerting.
11
+
12
+ ## When to Use
13
+
14
+ Use this skill when:
15
+ - Adding logging to new features or services
16
+ - Instrumenting code with metrics (counters, histograms, gauges)
17
+ - Implementing distributed tracing across services
18
+ - Designing health check endpoints
19
+ - Setting up alerting and SLO definitions
20
+ - Debugging production issues through observability data
21
+
22
+ ## The Three Pillars of Observability
23
+
24
+ ### 1. Logs — What Happened
25
+ Structured, contextual records of discrete events.
26
+
27
+ ### 2. Metrics — How Much / How Fast
28
+ Numeric measurements aggregated over time.
29
+
30
+ ### 3. Traces — The Journey
31
+ End-to-end request paths across service boundaries.
32
+
33
+ ## Structured Logging
34
+
35
+ ### Principles
36
+ - **Always use structured logging** (JSON) — never unstructured `console.log` in production
37
+ - **Log levels matter**: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development)
38
+ - **Include context**: correlation IDs, user IDs, request IDs, operation names
39
+ - **Never log secrets**: passwords, tokens, PII, credit card numbers
40
+
41
+ ### Log Levels Guide
42
+
43
+ | Level | When to Use | Example |
44
+ |-------|-------------|---------|
45
+ | **ERROR** | Something failed and needs attention | Database connection lost, payment failed |
46
+ | **WARN** | Degraded but still functioning | Cache miss fallback, retry attempt, rate limit approaching |
47
+ | **INFO** | Significant business events | User signed up, order placed, deployment completed |
48
+ | **DEBUG** | Development/troubleshooting detail | SQL query executed, cache hit/miss, function entry/exit |
49
+
50
+ ### Structured Log Format
51
+
52
+ ```json
53
+ {
54
+ "timestamp": "2025-01-15T10:30:00.000Z",
55
+ "level": "info",
56
+ "message": "Order placed successfully",
57
+ "service": "order-service",
58
+ "traceId": "abc123",
59
+ "spanId": "def456",
60
+ "userId": "user_789",
61
+ "orderId": "order_012",
62
+ "amount": 99.99,
63
+ "currency": "USD",
64
+ "duration_ms": 145
65
+ }
66
+ ```
67
+
68
+ ### Correlation IDs
69
+ - Generate a unique request ID at the entry point (API gateway, load balancer)
70
+ - Propagate it through all downstream calls via headers (`X-Request-ID`, `traceparent`)
71
+ - Include it in every log line for cross-service correlation
72
+
73
+ ### What to Log
74
+
75
+ **DO log:**
76
+ - Request/response boundaries (method, path, status, duration)
77
+ - Business events (user actions, state transitions, transactions)
78
+ - Error details with stack traces and context
79
+ - Performance-relevant data (query times, cache hit rates)
80
+ - Security events (auth failures, permission denials, rate limits)
81
+
82
+ **DO NOT log:**
83
+ - Passwords, tokens, API keys, secrets
84
+ - Full credit card numbers, SSNs, or PII without masking
85
+ - High-frequency debug data in production (use sampling)
86
+ - Request/response bodies containing sensitive data
87
+
88
+ ## Metrics Instrumentation
89
+
90
+ ### Metric Types
91
+
92
+ | Type | Use Case | Example |
93
+ |------|----------|---------|
94
+ | **Counter** | Monotonically increasing value | Total requests, errors, orders placed |
95
+ | **Gauge** | Value that goes up and down | Active connections, queue depth, memory usage |
96
+ | **Histogram** | Distribution of values | Request latency, response size, batch processing time |
97
+ | **Summary** | Pre-calculated quantiles | P50/P95/P99 latency (client-side) |
98
+
99
+ ### Naming Conventions
100
+ - Use snake_case: `http_requests_total`, `request_duration_seconds`
101
+ - Include units in the name: `_seconds`, `_bytes`, `_total`
102
+ - Use `_total` suffix for counters
103
+ - Prefix with service/subsystem: `api_http_requests_total`
104
+
105
+ ### Key Metrics to Track
106
+
107
+ **RED Method (Request-driven services):**
108
+ - **R**ate — Requests per second
109
+ - **E**rrors — Error rate (4xx, 5xx)
110
+ - **D**uration — Request latency distribution
111
+
112
+ **USE Method (Resource-oriented):**
113
+ - **U**tilization — % of resource capacity used
114
+ - **S**aturation — Queue depth, backpressure
115
+ - **E**rrors — Error count per resource
116
+
117
+ ### Cardinality Warning
118
+ - Avoid high-cardinality labels (user IDs, request IDs, URLs with path params)
119
+ - Keep label combinations < 1000 per metric
120
+ - Use bounded values: HTTP methods (GET, POST), status codes (2xx, 4xx, 5xx), endpoints (normalized)
121
+
122
+ ## Distributed Tracing
123
+
124
+ ### OpenTelemetry Patterns
125
+ - **Span** — A single operation within a trace (e.g., HTTP request, DB query, function call)
126
+ - **Trace** — A tree of spans representing an end-to-end request
127
+ - **Context Propagation** — Passing trace context across service boundaries via headers
128
+
129
+ ### What to Trace
130
+ - HTTP requests (client and server)
131
+ - Database queries
132
+ - Cache operations
133
+ - Message queue publish/consume
134
+ - External API calls
135
+ - Significant business operations
136
+
137
+ ### Span Attributes
138
+ ```
139
+ http.method: GET
140
+ http.url: /api/users/123
141
+ http.status_code: 200
142
+ db.system: postgresql
143
+ db.statement: SELECT * FROM users WHERE id = $1
144
+ messaging.system: kafka
145
+ messaging.destination: orders
146
+ ```
147
+
148
+ ### Sampling Strategies
149
+ - **Head-based sampling**: Decide at trace start (e.g., sample 10% of requests)
150
+ - **Tail-based sampling**: Decide after trace completes (keep errors, slow requests, sample normal)
151
+ - **Priority sampling**: Always sample errors, high-value transactions; sample routine requests
152
+
153
+ ## Health Check Endpoints
154
+
155
+ ### Liveness vs Readiness
156
+
157
+ | Check | Purpose | Failure Action |
158
+ |-------|---------|----------------|
159
+ | **Liveness** (`/healthz`) | Is the process alive? | Restart the container |
160
+ | **Readiness** (`/readyz`) | Can it serve traffic? | Remove from load balancer |
161
+ | **Startup** (`/startupz`) | Has it finished initializing? | Wait before liveness checks |
162
+
163
+ ### Health Check Response Format
164
+ ```json
165
+ {
166
+ "status": "healthy",
167
+ "checks": {
168
+ "database": { "status": "healthy", "latency_ms": 5 },
169
+ "cache": { "status": "healthy", "latency_ms": 1 },
170
+ "external_api": { "status": "degraded", "latency_ms": 2500, "message": "Slow response" }
171
+ },
172
+ "version": "1.2.3",
173
+ "uptime_seconds": 86400
174
+ }
175
+ ```
176
+
177
+ ### Best Practices
178
+ - Health checks should be **fast** (< 1 second)
179
+ - Liveness should check the process only — NOT external dependencies
180
+ - Readiness should check critical dependencies (database, cache)
181
+ - Return appropriate HTTP status: 200 (healthy), 503 (unhealthy)
182
+ - Include dependency health in readiness but not liveness
183
+
184
+ ## Alerting & SLOs
185
+
186
+ ### SLO Definitions
187
+ - **SLI** (Service Level Indicator): The metric you measure (e.g., request latency P99)
188
+ - **SLO** (Service Level Objective): The target (e.g., P99 latency < 500ms for 99.9% of requests)
189
+ - **Error Budget**: Allowable failures (e.g., 0.1% of requests can exceed 500ms)
190
+
191
+ ### Alert Design Principles
192
+ - **Alert on symptoms, not causes** — Alert on "users can't log in", not "CPU is high"
193
+ - **Alert on SLO burn rate** — Alert when error budget is being consumed too fast
194
+ - **Avoid alert fatigue** — Every alert should require human action
195
+ - **Include runbook links** — Every alert should link to resolution steps
196
+
197
+ ### Severity Levels
198
+
199
+ | Severity | Response Time | Example |
200
+ |----------|--------------|---------|
201
+ | **P1 — Critical** | Immediate (< 5 min) | Service down, data loss, security breach |
202
+ | **P2 — High** | Within 1 hour | Degraded performance, partial outage |
203
+ | **P3 — Medium** | Within 1 business day | Non-critical feature broken, elevated error rate |
204
+ | **P4 — Low** | Next sprint | Performance degradation, tech debt alert |
205
+
206
+ ### Useful Alert Patterns
207
+ - Error rate exceeds N% for M minutes
208
+ - Latency P99 exceeds threshold for M minutes
209
+ - Error budget burn rate > 1x for 1 hour (fast burn)
210
+ - Error budget burn rate > 0.1x for 6 hours (slow burn)
211
+ - Queue depth exceeds threshold (backpressure)
212
+ - Certificate expiry within N days
213
+ - Disk usage exceeds N%
214
+
215
+ ## Technology Selection
216
+
217
+ ### Logging
218
+ - **Node.js**: pino, winston, bunyan
219
+ - **Python**: structlog, python-json-logger
220
+ - **Go**: zerolog, zap, slog (stdlib)
221
+ - **Rust**: tracing, log + env_logger
222
+
223
+ ### Metrics
224
+ - **Prometheus** — Pull-based, widely adopted, great with Kubernetes
225
+ - **StatsD/Datadog** — Push-based, hosted
226
+ - **OpenTelemetry Metrics** — Vendor-neutral, emerging standard
227
+
228
+ ### Tracing
229
+ - **OpenTelemetry** — Vendor-neutral standard (recommended)
230
+ - **Jaeger** — Open-source trace backend
231
+ - **Zipkin** — Lightweight trace backend
232
+
233
+ ### Dashboards
234
+ - **Grafana** — Open-source, works with Prometheus/Loki/Tempo
235
+ - **Datadog** — Hosted all-in-one
236
+ - **New Relic** — Hosted APM
237
+
238
+ ## Checklist
239
+
240
+ When adding observability to a feature:
241
+ - [ ] Structured logging with correlation IDs at request boundaries
242
+ - [ ] Error logging with stack traces and context
243
+ - [ ] Business event logging (significant state changes)
244
+ - [ ] RED metrics for request-driven endpoints
245
+ - [ ] Histogram for latency-sensitive operations
246
+ - [ ] Trace spans for cross-service calls and database queries
247
+ - [ ] Health check endpoint updated if new dependency added
248
+ - [ ] No secrets or PII in logs
249
+ - [ ] Appropriate log levels (not everything is INFO)
250
+ - [ ] Dashboard updated with new metrics
251
+ - [ ] Alerts defined for SLO violations