cubeapm-mcp 1.1.1 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +110 -8
- package/dist/index.js +316 -7
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
[](https://www.npmjs.com/package/cubeapm-mcp)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
[](https://lobehub.com/mcp/technicalrhino-cubeapm-mcp)
|
|
5
6
|
|
|
6
7
|
A [Model Context Protocol (MCP)](https://modelcontextprotocol.io) server for [CubeAPM](https://cubeapm.com) - enabling AI assistants like Claude to query your observability data including traces, metrics, and logs.
|
|
7
8
|
|
|
@@ -180,9 +181,33 @@ You can now ask Claude questions like:
|
|
|
180
181
|
|------|-------------|
|
|
181
182
|
| `ingest_metrics_prometheus` | Send metrics in Prometheus text exposition format |
|
|
182
183
|
|
|
184
|
+
## Prompts
|
|
185
|
+
|
|
186
|
+
Pre-defined templates for common observability tasks:
|
|
187
|
+
|
|
188
|
+
| Prompt | Description |
|
|
189
|
+
|--------|-------------|
|
|
190
|
+
| `investigate-service` | Comprehensive service investigation - checks errors, latency, and traces |
|
|
191
|
+
| `check-latency` | Get P50, P95, P99 latency percentiles for a service |
|
|
192
|
+
| `find-slow-traces` | Find slowest traces to identify performance bottlenecks |
|
|
193
|
+
|
|
194
|
+
**Usage Example:**
|
|
195
|
+
```
|
|
196
|
+
Use the investigate-service prompt for Kratos-Prod
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
## Resources
|
|
200
|
+
|
|
201
|
+
Readable resources exposing CubeAPM data and configuration:
|
|
202
|
+
|
|
203
|
+
| Resource URI | Description |
|
|
204
|
+
|--------------|-------------|
|
|
205
|
+
| `cubeapm://config` | Current CubeAPM connection configuration |
|
|
206
|
+
| `cubeapm://query-patterns` | Query patterns and naming conventions reference |
|
|
207
|
+
|
|
183
208
|
## CubeAPM Query Patterns
|
|
184
209
|
|
|
185
|
-
### Metrics
|
|
210
|
+
### Metrics (PromQL / MetricsQL)
|
|
186
211
|
|
|
187
212
|
CubeAPM uses specific naming conventions that differ from standard OpenTelemetry:
|
|
188
213
|
|
|
@@ -192,7 +217,7 @@ CubeAPM uses specific naming conventions that differ from standard OpenTelemetry
|
|
|
192
217
|
| Service label | `service` (NOT `server` or `service_name`) |
|
|
193
218
|
| Common labels | `env`, `service`, `span_kind`, `status_code`, `http_code` |
|
|
194
219
|
|
|
195
|
-
|
|
220
|
+
#### Histogram Queries (P50, P90, P95, P99)
|
|
196
221
|
|
|
197
222
|
CubeAPM uses **VictoriaMetrics-style histograms** with `vmrange` labels instead of Prometheus `le` buckets:
|
|
198
223
|
|
|
@@ -208,7 +233,9 @@ histogram_quantile(0.95, sum by (le) (rate(http_request_duration_bucket[5m])))
|
|
|
208
233
|
|
|
209
234
|
> **Note:** Latency values are returned in **seconds** (0.05 = 50ms)
|
|
210
235
|
|
|
211
|
-
### Logs
|
|
236
|
+
### Logs (LogsQL)
|
|
237
|
+
|
|
238
|
+
#### Stream Selectors
|
|
212
239
|
|
|
213
240
|
Log labels vary by source. Use `*` query first to discover available labels:
|
|
214
241
|
|
|
@@ -224,17 +251,92 @@ Log labels vary by source. Use `*` query first to discover available labels:
|
|
|
224
251
|
# Lambda function logs
|
|
225
252
|
{faas.name="my-lambda-prod"}
|
|
226
253
|
|
|
227
|
-
#
|
|
228
|
-
{faas.name=~".*-prod"}
|
|
254
|
+
# Regex match
|
|
255
|
+
{faas.name=~".*-prod"}
|
|
256
|
+
|
|
257
|
+
# Text filter with boolean operators
|
|
258
|
+
{faas.name=~".*"} AND "error" AND NOT "retry"
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
#### Pipe Operators
|
|
262
|
+
|
|
263
|
+
Chain after any query with `|`:
|
|
264
|
+
|
|
265
|
+
| Pipe | Syntax | Description |
|
|
266
|
+
|------|--------|-------------|
|
|
267
|
+
| `copy` | `\| copy src AS dst` | Copy field value |
|
|
268
|
+
| `drop` | `\| drop field1, field2` | Remove fields from output |
|
|
269
|
+
| `extract_regexp` | `\| extract_regexp "(?P<name>re)"` | Extract via named capture groups |
|
|
270
|
+
| `join` | `\| join by (field) (...subquery...)` | Join with subquery results |
|
|
271
|
+
| `keep` | `\| keep field1, field2` | Keep only specified fields |
|
|
272
|
+
| `limit` | `\| limit N` | Return at most N results |
|
|
273
|
+
| `math` | `\| math result = f1 + f2` | Arithmetic (+, -, *, /, %) |
|
|
274
|
+
| `rename` | `\| rename src AS dst` | Rename a field |
|
|
275
|
+
| `replace` | `\| replace (field, "old", "new")` | Substring replacement |
|
|
276
|
+
| `replace_regexp` | `\| replace_regexp (field, "re", "repl")` | Regex replacement |
|
|
277
|
+
| `sort` | `\| sort by (field) [asc\|desc]` | Sort results |
|
|
278
|
+
| `stats` | `\| stats <func> as alias [by (fields)]` | Aggregate results |
|
|
279
|
+
| `unpack_json` | `\| unpack_json` | Extract fields from JSON body |
|
|
280
|
+
|
|
281
|
+
#### Stats Functions
|
|
282
|
+
|
|
283
|
+
Used with the `| stats` pipe:
|
|
284
|
+
|
|
285
|
+
| Function | Description |
|
|
286
|
+
|----------|-------------|
|
|
287
|
+
| `avg(field)` | Arithmetic mean |
|
|
288
|
+
| `count()` | Total matching entries |
|
|
289
|
+
| `count_empty(field)` | Entries where field is empty |
|
|
290
|
+
| `count_uniq(field)` | Distinct values |
|
|
291
|
+
| `max(field)` | Maximum value |
|
|
292
|
+
| `median(field)` | Median (50th percentile) |
|
|
293
|
+
| `min(field)` | Minimum value |
|
|
294
|
+
| `quantile(p, field)` | p-th quantile (e.g., `quantile(0.95, duration)`) |
|
|
295
|
+
| `sum(field)` | Sum of values |
|
|
296
|
+
|
|
297
|
+
#### Example Log Queries
|
|
298
|
+
|
|
299
|
+
```logsql
|
|
300
|
+
# Count errors per Lambda function
|
|
301
|
+
{faas.name=~".*"} AND "error" | stats count() as errors by (faas.name)
|
|
302
|
+
|
|
303
|
+
# Top 10 slowest requests
|
|
304
|
+
{service_name="my-service"} | sort by (duration) desc | limit 10
|
|
305
|
+
|
|
306
|
+
# Extract and aggregate from JSON logs
|
|
307
|
+
{service_name="api"} | unpack_json | stats avg(response_time) as avg_rt by (endpoint)
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
### Traces
|
|
311
|
+
|
|
312
|
+
Trace queries use the same pipe syntax as logs: `{stream_selector} | pipe1 | pipe2`
|
|
313
|
+
|
|
314
|
+
**Important notes:**
|
|
315
|
+
- `query`, `env`, `service` are REQUIRED parameters
|
|
316
|
+
- Duration is in **milliseconds** (not seconds like metrics)
|
|
317
|
+
- `p95` is NOT a valid stats function — use `quantile(0.95, duration)`
|
|
318
|
+
- Service names are case-sensitive (e.g., `"Kratos-Prod"` not `"kratos"`)
|
|
319
|
+
|
|
320
|
+
#### Example Trace Queries
|
|
321
|
+
|
|
322
|
+
```
|
|
323
|
+
# P95 latency for a service
|
|
324
|
+
{service="Kratos-Prod", span_kind="server"} | stats quantile(0.95, duration) as p95_ms
|
|
325
|
+
|
|
326
|
+
# Error count by endpoint
|
|
327
|
+
{service="Kratos-Prod", status_code="ERROR"} | stats count() as errors by (http_route)
|
|
328
|
+
|
|
329
|
+
# Slowest spans
|
|
330
|
+
{service="Kratos-Prod"} | sort by (duration) desc | limit 20
|
|
229
331
|
```
|
|
230
332
|
|
|
231
|
-
## Example Queries
|
|
333
|
+
## Example Natural Language Queries
|
|
232
334
|
|
|
233
335
|
### Logs
|
|
234
336
|
```
|
|
235
337
|
"Show me logs from webhook-lambda-prod"
|
|
236
338
|
"Find all logs containing 'timeout' in the last hour"
|
|
237
|
-
"
|
|
339
|
+
"Count errors per Lambda function in the last 24h"
|
|
238
340
|
```
|
|
239
341
|
|
|
240
342
|
### Metrics
|
|
@@ -246,7 +348,7 @@ Log labels vary by source. Use `*` query first to discover available labels:
|
|
|
246
348
|
|
|
247
349
|
### Traces
|
|
248
350
|
```
|
|
249
|
-
"Find
|
|
351
|
+
"Find the P95 latency for Kratos-Prod using trace stats"
|
|
250
352
|
"Show me traces with errors in the production environment"
|
|
251
353
|
"Get the full waterfall for trace ID abc123"
|
|
252
354
|
```
|
package/dist/index.js
CHANGED
|
@@ -41,11 +41,39 @@ LogsQL Query Syntax:
|
|
|
41
41
|
- Negation: {faas.name!="unwanted-service"}
|
|
42
42
|
- Combine filters: {env="UNSET", faas.name=~".*-lambda.*"}
|
|
43
43
|
|
|
44
|
+
PIPE OPERATORS — chain after query with "|":
|
|
45
|
+
| copy src_field AS dst_field — copy field value
|
|
46
|
+
| drop field1, field2 — remove fields from output
|
|
47
|
+
| extract_regexp "(?P<name>regex)" — extract fields via named capture groups
|
|
48
|
+
| join by (field) (...subquery...) — join with subquery results
|
|
49
|
+
| keep field1, field2 — keep only specified fields
|
|
50
|
+
| limit N — return at most N results
|
|
51
|
+
| math result = field1 + field2 — compute derived fields (+, -, *, /, %)
|
|
52
|
+
| rename src AS dst — rename a field
|
|
53
|
+
| replace (field, "old", "new") — replace substring in field
|
|
54
|
+
| replace_regexp (field, "re", "replacement") — regex replacement
|
|
55
|
+
| sort by (field) [asc|desc] — sort results
|
|
56
|
+
| stats <func> as alias [by (group_fields)] — aggregate results
|
|
57
|
+
| unpack_json — extract fields from JSON log body
|
|
58
|
+
|
|
59
|
+
STATS FUNCTIONS (use with | stats pipe):
|
|
60
|
+
avg(field) — arithmetic mean
|
|
61
|
+
count() — total number of matching entries
|
|
62
|
+
count_empty(field) — count entries where field is empty
|
|
63
|
+
count_uniq(field) — count distinct values
|
|
64
|
+
max(field) — maximum value
|
|
65
|
+
median(field) — median (50th percentile)
|
|
66
|
+
min(field) — minimum value
|
|
67
|
+
quantile(p, field) — p-th quantile (e.g., quantile(0.95, duration))
|
|
68
|
+
sum(field) — sum of values
|
|
69
|
+
|
|
44
70
|
Example queries:
|
|
45
71
|
- All logs (discover structure): *
|
|
46
72
|
- Lambda logs: {faas.name="webhook-lambda-prod"}
|
|
47
73
|
- Search errors: {faas.name=~".*"} AND "error"
|
|
48
|
-
- By environment: {env="production"}
|
|
74
|
+
- By environment: {env="production"}
|
|
75
|
+
- Count errors per function: {faas.name=~".*"} AND "error" | stats count() as error_count by (faas.name)
|
|
76
|
+
- Top 10 slowest: {service_name="my-service"} | sort by (duration) desc | limit 10`, {
|
|
49
77
|
query: z.string().describe("The log search query including stream filters (LogsQL syntax)"),
|
|
50
78
|
start: z.string().describe("Start time in RFC3339 format (e.g., 2024-01-01T00:00:00Z) or Unix timestamp in seconds"),
|
|
51
79
|
end: z.string().describe("End time in RFC3339 format or Unix timestamp in seconds"),
|
|
@@ -206,15 +234,45 @@ IMPORTANT - Required Parameters:
|
|
|
206
234
|
- Use env="UNSET" if environment is not configured
|
|
207
235
|
- Service names are case-sensitive (e.g., "Kratos-Prod" not "kratos")
|
|
208
236
|
|
|
209
|
-
|
|
237
|
+
TRACE QUERY SYNTAX:
|
|
238
|
+
The query parameter supports pipe syntax similar to LogsQL:
|
|
239
|
+
{stream_selector} | pipe1 | pipe2
|
|
240
|
+
|
|
241
|
+
Stream selectors filter spans:
|
|
242
|
+
{service="Kratos-Prod"}
|
|
243
|
+
{service="Kratos-Prod", span_kind="server"}
|
|
244
|
+
{http_code=~"5.."}
|
|
245
|
+
|
|
246
|
+
PIPE OPERATORS — same as LogsQL pipes:
|
|
247
|
+
| copy, | drop, | extract_regexp, | join, | keep, | limit,
|
|
248
|
+
| math, | rename, | replace, | replace_regexp, | sort,
|
|
249
|
+
| stats <func> as alias [by (group_fields)], | unpack_json
|
|
250
|
+
|
|
251
|
+
STATS FUNCTIONS (for aggregate trace analysis):
|
|
252
|
+
avg(field) — arithmetic mean
|
|
253
|
+
count() — total matching spans
|
|
254
|
+
count_uniq(field) — distinct values
|
|
255
|
+
max(field) — maximum value
|
|
256
|
+
median(field) — median value
|
|
257
|
+
min(field) — minimum value
|
|
258
|
+
quantile(p, field) — p-th quantile (e.g., quantile(0.95, duration))
|
|
259
|
+
sum(field) — sum of values
|
|
260
|
+
|
|
261
|
+
IMPORTANT: Duration in traces is in MILLISECONDS.
|
|
262
|
+
IMPORTANT: "p95" is NOT a valid function — use quantile(0.95, duration).
|
|
263
|
+
|
|
264
|
+
Example queries:
|
|
265
|
+
Wildcard: query="*"
|
|
266
|
+
P95 latency: query='{service="Kratos-Prod", span_kind="server"} | stats quantile(0.95, duration) as p95_latency'
|
|
267
|
+
Error count by endpoint: query='{service="Kratos-Prod", status_code="ERROR"} | stats count() as errors by (http_route)'
|
|
268
|
+
Slow spans: query='{service="Kratos-Prod"} | sort by (duration) desc | limit 20'
|
|
269
|
+
|
|
270
|
+
Optional filters (applied in addition to query):
|
|
210
271
|
- spanKind: server, client, consumer, producer
|
|
211
272
|
- sortBy: duration (to find slow traces)
|
|
212
273
|
|
|
213
274
|
To discover available service names, first query metrics:
|
|
214
|
-
count by (service) (cube_apm_calls_total{env="UNSET"})
|
|
215
|
-
|
|
216
|
-
Example: Find slow server spans in Shopify-Prod
|
|
217
|
-
query="*", env="UNSET", service="Shopify-Prod", spanKind="server", sortBy="duration"`, {
|
|
275
|
+
count by (service) (cube_apm_calls_total{env="UNSET"})`, {
|
|
218
276
|
query: z.string().default("*").describe("The traces search query (use * for wildcard)"),
|
|
219
277
|
env: z.string().default("UNSET").describe("Environment name (use UNSET if not configured)"),
|
|
220
278
|
service: z.string().describe("Service name to filter by (REQUIRED, case-sensitive)"),
|
|
@@ -224,13 +282,16 @@ Example: Find slow server spans in Shopify-Prod
|
|
|
224
282
|
spanKind: z.string().optional().describe("Filter by span kind: server, client, consumer, producer"),
|
|
225
283
|
sortBy: z.string().optional().describe("Sort results by: duration"),
|
|
226
284
|
}, async ({ query, env, service, start, end, limit, spanKind, sortBy }) => {
|
|
285
|
+
// Generate a random index value for the API call
|
|
286
|
+
const index = Math.random().toString(36).substring(2, 15);
|
|
227
287
|
const params = new URLSearchParams({
|
|
228
288
|
query,
|
|
229
289
|
env,
|
|
230
290
|
service,
|
|
231
291
|
start,
|
|
232
292
|
end,
|
|
233
|
-
limit: String(limit)
|
|
293
|
+
limit: String(limit),
|
|
294
|
+
index
|
|
234
295
|
});
|
|
235
296
|
if (spanKind)
|
|
236
297
|
params.append("spanKind", spanKind);
|
|
@@ -319,6 +380,254 @@ server.tool("ingest_metrics_prometheus", "Send metrics to CubeAPM in Prometheus
|
|
|
319
380
|
],
|
|
320
381
|
};
|
|
321
382
|
});
|
|
383
|
+
// ============================================
|
|
384
|
+
// PROMPTS - Pre-defined templates for common tasks
|
|
385
|
+
// ============================================
|
|
386
|
+
server.prompt("investigate-service", "Investigate issues with a specific service - checks errors, latency, and recent traces", {
|
|
387
|
+
service: z.string().describe("The service name to investigate (case-sensitive)"),
|
|
388
|
+
timeRange: z.string().optional().default("1h").describe("Time range to look back (e.g., 1h, 6h, 24h)"),
|
|
389
|
+
}, async ({ service, timeRange }) => {
|
|
390
|
+
return {
|
|
391
|
+
messages: [
|
|
392
|
+
{
|
|
393
|
+
role: "user",
|
|
394
|
+
content: {
|
|
395
|
+
type: "text",
|
|
396
|
+
text: `Please investigate the ${service} service for the last ${timeRange}:
|
|
397
|
+
|
|
398
|
+
1. First, check the error rate using metrics:
|
|
399
|
+
- Query: sum(increase(cube_apm_calls_total{service="${service}", status_code="ERROR"}[${timeRange}])) / sum(increase(cube_apm_calls_total{service="${service}"}[${timeRange}])) * 100
|
|
400
|
+
|
|
401
|
+
2. Check P95 latency:
|
|
402
|
+
- Query: histogram_quantiles("phi", 0.95, sum by (vmrange) (increase(cube_apm_latency_bucket{service="${service}", span_kind="server"}[5m])))
|
|
403
|
+
|
|
404
|
+
3. Search for error traces:
|
|
405
|
+
- Use search_traces with service="${service}", query="status_code=ERROR"
|
|
406
|
+
|
|
407
|
+
4. Check recent logs if available:
|
|
408
|
+
- Query logs with {service_name="${service}"} or discover labels first with *
|
|
409
|
+
|
|
410
|
+
Summarize any issues found and provide recommendations.`,
|
|
411
|
+
},
|
|
412
|
+
},
|
|
413
|
+
],
|
|
414
|
+
};
|
|
415
|
+
});
|
|
416
|
+
server.prompt("check-latency", "Check P50, P95, and P99 latency percentiles for a service", {
|
|
417
|
+
service: z.string().describe("The service name (case-sensitive)"),
|
|
418
|
+
}, async ({ service }) => {
|
|
419
|
+
return {
|
|
420
|
+
messages: [
|
|
421
|
+
{
|
|
422
|
+
role: "user",
|
|
423
|
+
content: {
|
|
424
|
+
type: "text",
|
|
425
|
+
text: `Check the latency percentiles for ${service} service:
|
|
426
|
+
|
|
427
|
+
Use query_metrics_instant with these queries:
|
|
428
|
+
|
|
429
|
+
1. P50 (median):
|
|
430
|
+
histogram_quantiles("phi", 0.5, sum by (vmrange, service) (increase(cube_apm_latency_bucket{service="${service}", span_kind="server"}[5m])))
|
|
431
|
+
|
|
432
|
+
2. P95:
|
|
433
|
+
histogram_quantiles("phi", 0.95, sum by (vmrange, service) (increase(cube_apm_latency_bucket{service="${service}", span_kind="server"}[5m])))
|
|
434
|
+
|
|
435
|
+
3. P99:
|
|
436
|
+
histogram_quantiles("phi", 0.99, sum by (vmrange, service) (increase(cube_apm_latency_bucket{service="${service}", span_kind="server"}[5m])))
|
|
437
|
+
|
|
438
|
+
Note: Latency values are in SECONDS (multiply by 1000 for milliseconds).
|
|
439
|
+
|
|
440
|
+
Present the results in a clear table format.`,
|
|
441
|
+
},
|
|
442
|
+
},
|
|
443
|
+
],
|
|
444
|
+
};
|
|
445
|
+
});
|
|
446
|
+
server.prompt("find-slow-traces", "Find the slowest traces for a service to identify performance bottlenecks", {
|
|
447
|
+
service: z.string().describe("The service name (case-sensitive)"),
|
|
448
|
+
minDuration: z.string().optional().default("1s").describe("Minimum duration to filter (e.g., 500ms, 1s, 2s)"),
|
|
449
|
+
}, async ({ service, minDuration }) => {
|
|
450
|
+
return {
|
|
451
|
+
messages: [
|
|
452
|
+
{
|
|
453
|
+
role: "user",
|
|
454
|
+
content: {
|
|
455
|
+
type: "text",
|
|
456
|
+
text: `Find slow traces for ${service} service (minimum duration: ${minDuration}):
|
|
457
|
+
|
|
458
|
+
1. Search for traces using search_traces with:
|
|
459
|
+
- service: "${service}"
|
|
460
|
+
- sortBy: "duration"
|
|
461
|
+
- spanKind: "server"
|
|
462
|
+
- limit: 10
|
|
463
|
+
|
|
464
|
+
2. For the slowest trace found, use get_trace to fetch the full trace details
|
|
465
|
+
|
|
466
|
+
3. Analyze the trace waterfall to identify:
|
|
467
|
+
- Which span took the longest
|
|
468
|
+
- Any external dependencies causing delays
|
|
469
|
+
- Database queries or API calls that are slow
|
|
470
|
+
|
|
471
|
+
Provide a summary of performance bottlenecks and recommendations.`,
|
|
472
|
+
},
|
|
473
|
+
},
|
|
474
|
+
],
|
|
475
|
+
};
|
|
476
|
+
});
|
|
477
|
+
// ============================================
|
|
478
|
+
// RESOURCES - Expose CubeAPM data as readable resources
|
|
479
|
+
// ============================================
|
|
480
|
+
server.resource("config", "cubeapm://config", {
|
|
481
|
+
description: "Current CubeAPM connection configuration",
|
|
482
|
+
mimeType: "application/json",
|
|
483
|
+
}, async () => {
|
|
484
|
+
return {
|
|
485
|
+
contents: [
|
|
486
|
+
{
|
|
487
|
+
uri: "cubeapm://config",
|
|
488
|
+
mimeType: "application/json",
|
|
489
|
+
text: JSON.stringify({
|
|
490
|
+
queryUrl: queryBaseUrl,
|
|
491
|
+
ingestUrl: ingestBaseUrl,
|
|
492
|
+
usingFullUrl: !!CUBEAPM_URL,
|
|
493
|
+
host: CUBEAPM_HOST,
|
|
494
|
+
queryPort: CUBEAPM_QUERY_PORT,
|
|
495
|
+
ingestPort: CUBEAPM_INGEST_PORT,
|
|
496
|
+
}, null, 2),
|
|
497
|
+
},
|
|
498
|
+
],
|
|
499
|
+
};
|
|
500
|
+
});
|
|
501
|
+
server.resource("query-patterns", "cubeapm://query-patterns", {
|
|
502
|
+
description: "CubeAPM-specific query patterns and naming conventions",
|
|
503
|
+
mimeType: "text/markdown",
|
|
504
|
+
}, async () => {
|
|
505
|
+
return {
|
|
506
|
+
contents: [
|
|
507
|
+
{
|
|
508
|
+
uri: "cubeapm://query-patterns",
|
|
509
|
+
mimeType: "text/markdown",
|
|
510
|
+
text: `# CubeAPM Query Patterns — Full Reference
|
|
511
|
+
|
|
512
|
+
## Metrics (PromQL / MetricsQL)
|
|
513
|
+
|
|
514
|
+
### Naming Conventions
|
|
515
|
+
- Prefix: \`cube_apm_*\` (e.g., cube_apm_calls_total, cube_apm_latency_bucket)
|
|
516
|
+
- Service label: \`service\` (NOT "server" or "service_name")
|
|
517
|
+
- Common labels: env, service, span_kind, status_code, http_code
|
|
518
|
+
|
|
519
|
+
### Histogram Queries (Percentiles)
|
|
520
|
+
CubeAPM uses VictoriaMetrics-style histograms with \`vmrange\` label (NOT Prometheus \`le\` buckets):
|
|
521
|
+
|
|
522
|
+
\`\`\`promql
|
|
523
|
+
# ✅ Correct — use histogram_quantiles() with vmrange
|
|
524
|
+
histogram_quantiles("phi", 0.95, sum by (vmrange, service) (
|
|
525
|
+
increase(cube_apm_latency_bucket{service="MyService", span_kind="server"}[5m])
|
|
526
|
+
))
|
|
527
|
+
|
|
528
|
+
# ❌ Wrong — standard Prometheus syntax won't work
|
|
529
|
+
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_bucket[5m])))
|
|
530
|
+
\`\`\`
|
|
531
|
+
|
|
532
|
+
Latency values are in SECONDS (0.05 = 50ms).
|
|
533
|
+
|
|
534
|
+
---
|
|
535
|
+
|
|
536
|
+
## Logs (LogsQL)
|
|
537
|
+
|
|
538
|
+
### Stream Selectors
|
|
539
|
+
- \`*\` — wildcard, discover all labels
|
|
540
|
+
- \`{faas.name="my-function"}\` — exact match
|
|
541
|
+
- \`{faas.name=~".*-prod"}\` — regex match
|
|
542
|
+
- \`{faas.name!="unwanted"}\` — negation
|
|
543
|
+
- \`{env="production", faas.name=~".*"}\` — combine filters
|
|
544
|
+
|
|
545
|
+
### Text Filters
|
|
546
|
+
- \`{stream} AND "error"\` — text search
|
|
547
|
+
- \`{stream} AND "timeout" AND NOT "retry"\` — boolean operators
|
|
548
|
+
|
|
549
|
+
### Pipe Operators
|
|
550
|
+
Chain after query with \`|\`:
|
|
551
|
+
|
|
552
|
+
| Pipe | Syntax | Description |
|
|
553
|
+
|------|--------|-------------|
|
|
554
|
+
| copy | \`copy src AS dst\` | Copy field value |
|
|
555
|
+
| drop | \`drop field1, field2\` | Remove fields from output |
|
|
556
|
+
| extract_regexp | \`extract_regexp "(?P<name>re)"\` | Extract via named capture groups |
|
|
557
|
+
| join | \`join by (field) (...subquery...)\` | Join with subquery |
|
|
558
|
+
| keep | \`keep field1, field2\` | Keep only specified fields |
|
|
559
|
+
| limit | \`limit N\` | Return at most N results |
|
|
560
|
+
| math | \`math result = f1 + f2\` | Arithmetic (+, -, *, /, %) |
|
|
561
|
+
| rename | \`rename src AS dst\` | Rename a field |
|
|
562
|
+
| replace | \`replace (field, "old", "new")\` | Substring replacement |
|
|
563
|
+
| replace_regexp | \`replace_regexp (field, "re", "repl")\` | Regex replacement |
|
|
564
|
+
| sort | \`sort by (field) [asc/desc]\` | Sort results |
|
|
565
|
+
| stats | \`stats <func> as alias [by (fields)]\` | Aggregate results |
|
|
566
|
+
| unpack_json | \`unpack_json\` | Extract fields from JSON body |
|
|
567
|
+
|
|
568
|
+
### Stats Functions
|
|
569
|
+
Used with \`| stats\` pipe:
|
|
570
|
+
|
|
571
|
+
| Function | Description |
|
|
572
|
+
|----------|-------------|
|
|
573
|
+
| \`avg(field)\` | Arithmetic mean |
|
|
574
|
+
| \`count()\` | Total matching entries |
|
|
575
|
+
| \`count_empty(field)\` | Entries where field is empty |
|
|
576
|
+
| \`count_uniq(field)\` | Distinct values |
|
|
577
|
+
| \`max(field)\` | Maximum value |
|
|
578
|
+
| \`median(field)\` | Median (50th percentile) |
|
|
579
|
+
| \`min(field)\` | Minimum value |
|
|
580
|
+
| \`quantile(p, field)\` | p-th quantile (e.g., quantile(0.95, duration)) |
|
|
581
|
+
| \`sum(field)\` | Sum of values |
|
|
582
|
+
|
|
583
|
+
### Example Log Queries
|
|
584
|
+
\`\`\`logsql
|
|
585
|
+
# Discover labels
|
|
586
|
+
*
|
|
587
|
+
|
|
588
|
+
# Count errors per Lambda function
|
|
589
|
+
{faas.name=~".*"} AND "error" | stats count() as errors by (faas.name)
|
|
590
|
+
|
|
591
|
+
# Top 10 slowest requests
|
|
592
|
+
{service_name="my-service"} | sort by (duration) desc | limit 10
|
|
593
|
+
\`\`\`
|
|
594
|
+
|
|
595
|
+
---
|
|
596
|
+
|
|
597
|
+
## Traces
|
|
598
|
+
|
|
599
|
+
### Query Syntax
|
|
600
|
+
Trace queries use the same pipe syntax:
|
|
601
|
+
\`{stream_selector} | pipe1 | pipe2\`
|
|
602
|
+
|
|
603
|
+
### Stream Selectors
|
|
604
|
+
- \`{service="Kratos-Prod"}\`
|
|
605
|
+
- \`{service="Kratos-Prod", span_kind="server"}\`
|
|
606
|
+
- \`{http_code=~"5.."}\`
|
|
607
|
+
|
|
608
|
+
### Important Notes
|
|
609
|
+
- query, env, service are REQUIRED parameters
|
|
610
|
+
- Duration is in MILLISECONDS (not seconds like metrics)
|
|
611
|
+
- \`p95\` is NOT a valid function — use \`quantile(0.95, duration)\`
|
|
612
|
+
- Use \`sortBy=duration\` param to find slow traces
|
|
613
|
+
- Filter by spanKind: server, client, consumer, producer
|
|
614
|
+
|
|
615
|
+
### Example Trace Queries
|
|
616
|
+
\`\`\`
|
|
617
|
+
# P95 latency for a service
|
|
618
|
+
{service="Kratos-Prod", span_kind="server"} | stats quantile(0.95, duration) as p95_ms
|
|
619
|
+
|
|
620
|
+
# Error count by endpoint
|
|
621
|
+
{service="Kratos-Prod", status_code="ERROR"} | stats count() as errors by (http_route)
|
|
622
|
+
|
|
623
|
+
# Slowest spans
|
|
624
|
+
{service="Kratos-Prod"} | sort by (duration) desc | limit 20
|
|
625
|
+
\`\`\`
|
|
626
|
+
`,
|
|
627
|
+
},
|
|
628
|
+
],
|
|
629
|
+
};
|
|
630
|
+
});
|
|
322
631
|
// Start the server
|
|
323
632
|
async function main() {
|
|
324
633
|
const transport = new StdioServerTransport();
|
package/package.json
CHANGED