cortex-agents 3.4.0 → 4.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.opencode/agents/architect.md +81 -89
- package/.opencode/agents/audit.md +57 -188
- package/.opencode/agents/{crosslayer.md → coder.md} +8 -52
- package/.opencode/agents/debug.md +151 -0
- package/.opencode/agents/devops.md +142 -0
- package/.opencode/agents/docs-writer.md +195 -0
- package/.opencode/agents/fix.md +118 -189
- package/.opencode/agents/implement.md +114 -74
- package/.opencode/agents/perf.md +151 -0
- package/.opencode/agents/refactor.md +163 -0
- package/.opencode/agents/{guard.md → security.md} +20 -85
- package/.opencode/agents/testing.md +115 -0
- package/.opencode/skills/data-engineering/SKILL.md +221 -0
- package/.opencode/skills/monitoring-observability/SKILL.md +251 -0
- package/README.md +302 -287
- package/dist/cli.js +6 -9
- package/dist/index.d.ts.map +1 -1
- package/dist/index.js +26 -28
- package/dist/registry.d.ts +4 -4
- package/dist/registry.d.ts.map +1 -1
- package/dist/registry.js +6 -6
- package/dist/tools/branch.d.ts +2 -2
- package/dist/tools/docs.d.ts +2 -2
- package/dist/tools/github.d.ts +3 -3
- package/dist/tools/plan.d.ts +28 -4
- package/dist/tools/plan.d.ts.map +1 -1
- package/dist/tools/plan.js +232 -4
- package/dist/tools/quality-gate.d.ts +28 -0
- package/dist/tools/quality-gate.d.ts.map +1 -0
- package/dist/tools/quality-gate.js +233 -0
- package/dist/tools/repl.d.ts +5 -0
- package/dist/tools/repl.d.ts.map +1 -1
- package/dist/tools/repl.js +58 -7
- package/dist/tools/worktree.d.ts +5 -32
- package/dist/tools/worktree.d.ts.map +1 -1
- package/dist/tools/worktree.js +75 -458
- package/dist/utils/change-scope.d.ts +33 -0
- package/dist/utils/change-scope.d.ts.map +1 -0
- package/dist/utils/change-scope.js +198 -0
- package/dist/utils/plan-extract.d.ts +21 -0
- package/dist/utils/plan-extract.d.ts.map +1 -1
- package/dist/utils/plan-extract.js +65 -0
- package/dist/utils/repl.d.ts +31 -0
- package/dist/utils/repl.d.ts.map +1 -1
- package/dist/utils/repl.js +126 -13
- package/package.json +1 -1
- package/.opencode/agents/qa.md +0 -265
- package/.opencode/agents/ship.md +0 -249
|
@@ -0,0 +1,251 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: monitoring-observability
|
|
3
|
+
description: Structured logging, metrics instrumentation, distributed tracing, health checks, and alerting patterns
|
|
4
|
+
license: Apache-2.0
|
|
5
|
+
compatibility: opencode
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Monitoring & Observability Skill
|
|
9
|
+
|
|
10
|
+
This skill provides patterns for making applications observable in production through logging, metrics, tracing, and alerting.
|
|
11
|
+
|
|
12
|
+
## When to Use
|
|
13
|
+
|
|
14
|
+
Use this skill when:
|
|
15
|
+
- Adding logging to new features or services
|
|
16
|
+
- Instrumenting code with metrics (counters, histograms, gauges)
|
|
17
|
+
- Implementing distributed tracing across services
|
|
18
|
+
- Designing health check endpoints
|
|
19
|
+
- Setting up alerting and SLO definitions
|
|
20
|
+
- Debugging production issues through observability data
|
|
21
|
+
|
|
22
|
+
## The Three Pillars of Observability
|
|
23
|
+
|
|
24
|
+
### 1. Logs — What Happened
|
|
25
|
+
Structured, contextual records of discrete events.
|
|
26
|
+
|
|
27
|
+
### 2. Metrics — How Much / How Fast
|
|
28
|
+
Numeric measurements aggregated over time.
|
|
29
|
+
|
|
30
|
+
### 3. Traces — The Journey
|
|
31
|
+
End-to-end request paths across service boundaries.
|
|
32
|
+
|
|
33
|
+
## Structured Logging
|
|
34
|
+
|
|
35
|
+
### Principles
|
|
36
|
+
- **Always use structured logging** (JSON) — never unstructured `console.log` in production
|
|
37
|
+
- **Log levels matter**: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development)
|
|
38
|
+
- **Include context**: correlation IDs, user IDs, request IDs, operation names
|
|
39
|
+
- **Never log secrets**: passwords, tokens, PII, credit card numbers
|
|
40
|
+
|
|
41
|
+
### Log Levels Guide
|
|
42
|
+
|
|
43
|
+
| Level | When to Use | Example |
|
|
44
|
+
|-------|-------------|---------|
|
|
45
|
+
| **ERROR** | Something failed and needs attention | Database connection lost, payment failed |
|
|
46
|
+
| **WARN** | Degraded but still functioning | Cache miss fallback, retry attempt, rate limit approaching |
|
|
47
|
+
| **INFO** | Significant business events | User signed up, order placed, deployment completed |
|
|
48
|
+
| **DEBUG** | Development/troubleshooting detail | SQL query executed, cache hit/miss, function entry/exit |
|
|
49
|
+
|
|
50
|
+
### Structured Log Format
|
|
51
|
+
|
|
52
|
+
```json
|
|
53
|
+
{
|
|
54
|
+
"timestamp": "2025-01-15T10:30:00.000Z",
|
|
55
|
+
"level": "info",
|
|
56
|
+
"message": "Order placed successfully",
|
|
57
|
+
"service": "order-service",
|
|
58
|
+
"traceId": "abc123",
|
|
59
|
+
"spanId": "def456",
|
|
60
|
+
"userId": "user_789",
|
|
61
|
+
"orderId": "order_012",
|
|
62
|
+
"amount": 99.99,
|
|
63
|
+
"currency": "USD",
|
|
64
|
+
"duration_ms": 145
|
|
65
|
+
}
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Correlation IDs
|
|
69
|
+
- Generate a unique request ID at the entry point (API gateway, load balancer)
|
|
70
|
+
- Propagate it through all downstream calls via headers (`X-Request-ID`, `traceparent`)
|
|
71
|
+
- Include it in every log line for cross-service correlation
|
|
72
|
+
|
|
73
|
+
### What to Log
|
|
74
|
+
|
|
75
|
+
**DO log:**
|
|
76
|
+
- Request/response boundaries (method, path, status, duration)
|
|
77
|
+
- Business events (user actions, state transitions, transactions)
|
|
78
|
+
- Error details with stack traces and context
|
|
79
|
+
- Performance-relevant data (query times, cache hit rates)
|
|
80
|
+
- Security events (auth failures, permission denials, rate limits)
|
|
81
|
+
|
|
82
|
+
**DO NOT log:**
|
|
83
|
+
- Passwords, tokens, API keys, secrets
|
|
84
|
+
- Full credit card numbers, SSNs, or PII without masking
|
|
85
|
+
- High-frequency debug data in production (use sampling)
|
|
86
|
+
- Request/response bodies containing sensitive data
|
|
87
|
+
|
|
88
|
+
## Metrics Instrumentation
|
|
89
|
+
|
|
90
|
+
### Metric Types
|
|
91
|
+
|
|
92
|
+
| Type | Use Case | Example |
|
|
93
|
+
|------|----------|---------|
|
|
94
|
+
| **Counter** | Monotonically increasing value | Total requests, errors, orders placed |
|
|
95
|
+
| **Gauge** | Value that goes up and down | Active connections, queue depth, memory usage |
|
|
96
|
+
| **Histogram** | Distribution of values | Request latency, response size, batch processing time |
|
|
97
|
+
| **Summary** | Pre-calculated quantiles | P50/P95/P99 latency (client-side) |
|
|
98
|
+
|
|
99
|
+
### Naming Conventions
|
|
100
|
+
- Use snake_case: `http_requests_total`, `request_duration_seconds`
|
|
101
|
+
- Include units in the name: `_seconds`, `_bytes`, `_total`
|
|
102
|
+
- Use `_total` suffix for counters
|
|
103
|
+
- Prefix with service/subsystem: `api_http_requests_total`
|
|
104
|
+
|
|
105
|
+
### Key Metrics to Track
|
|
106
|
+
|
|
107
|
+
**RED Method (Request-driven services):**
|
|
108
|
+
- **R**ate — Requests per second
|
|
109
|
+
- **E**rrors — Error rate (4xx, 5xx)
|
|
110
|
+
- **D**uration — Request latency distribution
|
|
111
|
+
|
|
112
|
+
**USE Method (Resource-oriented):**
|
|
113
|
+
- **U**tilization — % of resource capacity used
|
|
114
|
+
- **S**aturation — Queue depth, backpressure
|
|
115
|
+
- **E**rrors — Error count per resource
|
|
116
|
+
|
|
117
|
+
### Cardinality Warning
|
|
118
|
+
- Avoid high-cardinality labels (user IDs, request IDs, URLs with path params)
|
|
119
|
+
- Keep label combinations < 1000 per metric
|
|
120
|
+
- Use bounded values: HTTP methods (GET, POST), status codes (2xx, 4xx, 5xx), endpoints (normalized)
|
|
121
|
+
|
|
122
|
+
## Distributed Tracing
|
|
123
|
+
|
|
124
|
+
### OpenTelemetry Patterns
|
|
125
|
+
- **Span** — A single operation within a trace (e.g., HTTP request, DB query, function call)
|
|
126
|
+
- **Trace** — A tree of spans representing an end-to-end request
|
|
127
|
+
- **Context Propagation** — Passing trace context across service boundaries via headers
|
|
128
|
+
|
|
129
|
+
### What to Trace
|
|
130
|
+
- HTTP requests (client and server)
|
|
131
|
+
- Database queries
|
|
132
|
+
- Cache operations
|
|
133
|
+
- Message queue publish/consume
|
|
134
|
+
- External API calls
|
|
135
|
+
- Significant business operations
|
|
136
|
+
|
|
137
|
+
### Span Attributes
|
|
138
|
+
```
|
|
139
|
+
http.method: GET
|
|
140
|
+
http.url: /api/users/123
|
|
141
|
+
http.status_code: 200
|
|
142
|
+
db.system: postgresql
|
|
143
|
+
db.statement: SELECT * FROM users WHERE id = $1
|
|
144
|
+
messaging.system: kafka
|
|
145
|
+
messaging.destination: orders
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Sampling Strategies
|
|
149
|
+
- **Head-based sampling**: Decide at trace start (e.g., sample 10% of requests)
|
|
150
|
+
- **Tail-based sampling**: Decide after trace completes (keep errors, slow requests, sample normal)
|
|
151
|
+
- **Priority sampling**: Always sample errors, high-value transactions; sample routine requests
|
|
152
|
+
|
|
153
|
+
## Health Check Endpoints
|
|
154
|
+
|
|
155
|
+
### Liveness vs Readiness
|
|
156
|
+
|
|
157
|
+
| Check | Purpose | Failure Action |
|
|
158
|
+
|-------|---------|----------------|
|
|
159
|
+
| **Liveness** (`/healthz`) | Is the process alive? | Restart the container |
|
|
160
|
+
| **Readiness** (`/readyz`) | Can it serve traffic? | Remove from load balancer |
|
|
161
|
+
| **Startup** (`/startupz`) | Has it finished initializing? | Wait before liveness checks |
|
|
162
|
+
|
|
163
|
+
### Health Check Response Format
|
|
164
|
+
```json
|
|
165
|
+
{
|
|
166
|
+
"status": "healthy",
|
|
167
|
+
"checks": {
|
|
168
|
+
"database": { "status": "healthy", "latency_ms": 5 },
|
|
169
|
+
"cache": { "status": "healthy", "latency_ms": 1 },
|
|
170
|
+
"external_api": { "status": "degraded", "latency_ms": 2500, "message": "Slow response" }
|
|
171
|
+
},
|
|
172
|
+
"version": "1.2.3",
|
|
173
|
+
"uptime_seconds": 86400
|
|
174
|
+
}
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
### Best Practices
|
|
178
|
+
- Health checks should be **fast** (< 1 second)
|
|
179
|
+
- Liveness should check the process only — NOT external dependencies
|
|
180
|
+
- Readiness should check critical dependencies (database, cache)
|
|
181
|
+
- Return appropriate HTTP status: 200 (healthy), 503 (unhealthy)
|
|
182
|
+
- Include dependency health in readiness but not liveness
|
|
183
|
+
|
|
184
|
+
## Alerting & SLOs
|
|
185
|
+
|
|
186
|
+
### SLO Definitions
|
|
187
|
+
- **SLI** (Service Level Indicator): The metric you measure (e.g., request latency P99)
|
|
188
|
+
- **SLO** (Service Level Objective): The target (e.g., P99 latency < 500ms for 99.9% of requests)
|
|
189
|
+
- **Error Budget**: Allowable failures (e.g., 0.1% of requests can exceed 500ms)
|
|
190
|
+
|
|
191
|
+
### Alert Design Principles
|
|
192
|
+
- **Alert on symptoms, not causes** — Alert on "users can't log in", not "CPU is high"
|
|
193
|
+
- **Alert on SLO burn rate** — Alert when error budget is being consumed too fast
|
|
194
|
+
- **Avoid alert fatigue** — Every alert should require human action
|
|
195
|
+
- **Include runbook links** — Every alert should link to resolution steps
|
|
196
|
+
|
|
197
|
+
### Severity Levels
|
|
198
|
+
|
|
199
|
+
| Severity | Response Time | Example |
|
|
200
|
+
|----------|--------------|---------|
|
|
201
|
+
| **P1 — Critical** | Immediate (< 5 min) | Service down, data loss, security breach |
|
|
202
|
+
| **P2 — High** | Within 1 hour | Degraded performance, partial outage |
|
|
203
|
+
| **P3 — Medium** | Within 1 business day | Non-critical feature broken, elevated error rate |
|
|
204
|
+
| **P4 — Low** | Next sprint | Performance degradation, tech debt alert |
|
|
205
|
+
|
|
206
|
+
### Useful Alert Patterns
|
|
207
|
+
- Error rate exceeds N% for M minutes
|
|
208
|
+
- Latency P99 exceeds threshold for M minutes
|
|
209
|
+
- Error budget burn rate > 1x for 1 hour (fast burn)
|
|
210
|
+
- Error budget burn rate > 0.1x for 6 hours (slow burn)
|
|
211
|
+
- Queue depth exceeds threshold (backpressure)
|
|
212
|
+
- Certificate expiry within N days
|
|
213
|
+
- Disk usage exceeds N%
|
|
214
|
+
|
|
215
|
+
## Technology Selection
|
|
216
|
+
|
|
217
|
+
### Logging
|
|
218
|
+
- **Node.js**: pino, winston, bunyan
|
|
219
|
+
- **Python**: structlog, python-json-logger
|
|
220
|
+
- **Go**: zerolog, zap, slog (stdlib)
|
|
221
|
+
- **Rust**: tracing, log + env_logger
|
|
222
|
+
|
|
223
|
+
### Metrics
|
|
224
|
+
- **Prometheus** — Pull-based, widely adopted, great with Kubernetes
|
|
225
|
+
- **StatsD/Datadog** — Push-based, hosted
|
|
226
|
+
- **OpenTelemetry Metrics** — Vendor-neutral, emerging standard
|
|
227
|
+
|
|
228
|
+
### Tracing
|
|
229
|
+
- **OpenTelemetry** — Vendor-neutral standard (recommended)
|
|
230
|
+
- **Jaeger** — Open-source trace backend
|
|
231
|
+
- **Zipkin** — Lightweight trace backend
|
|
232
|
+
|
|
233
|
+
### Dashboards
|
|
234
|
+
- **Grafana** — Open-source, works with Prometheus/Loki/Tempo
|
|
235
|
+
- **Datadog** — Hosted all-in-one
|
|
236
|
+
- **New Relic** — Hosted APM
|
|
237
|
+
|
|
238
|
+
## Checklist
|
|
239
|
+
|
|
240
|
+
When adding observability to a feature:
|
|
241
|
+
- [ ] Structured logging with correlation IDs at request boundaries
|
|
242
|
+
- [ ] Error logging with stack traces and context
|
|
243
|
+
- [ ] Business event logging (significant state changes)
|
|
244
|
+
- [ ] RED metrics for request-driven endpoints
|
|
245
|
+
- [ ] Histogram for latency-sensitive operations
|
|
246
|
+
- [ ] Trace spans for cross-service calls and database queries
|
|
247
|
+
- [ ] Health check endpoint updated if new dependency added
|
|
248
|
+
- [ ] No secrets or PII in logs
|
|
249
|
+
- [ ] Appropriate log levels (not everything is INFO)
|
|
250
|
+
- [ ] Dashboard updated with new metrics
|
|
251
|
+
- [ ] Alerts defined for SLO violations
|