@fluentcommerce/ai-skills 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +622 -0
  3. package/bin/cli.mjs +1973 -0
  4. package/content/cli/agents/fluent-cli/agent.json +149 -0
  5. package/content/cli/agents/fluent-cli.md +132 -0
  6. package/content/cli/skills/fluent-bootstrap/SKILL.md +181 -0
  7. package/content/cli/skills/fluent-cli-index/SKILL.md +63 -0
  8. package/content/cli/skills/fluent-cli-mcp-cicd/SKILL.md +77 -0
  9. package/content/cli/skills/fluent-cli-reference/SKILL.md +1031 -0
  10. package/content/cli/skills/fluent-cli-retailer/SKILL.md +85 -0
  11. package/content/cli/skills/fluent-cli-settings/SKILL.md +106 -0
  12. package/content/cli/skills/fluent-connect/SKILL.md +886 -0
  13. package/content/cli/skills/fluent-module-deploy/SKILL.md +349 -0
  14. package/content/cli/skills/fluent-profile/SKILL.md +180 -0
  15. package/content/cli/skills/fluent-workflow/SKILL.md +310 -0
  16. package/content/dev/agents/fluent-dev/agent.json +88 -0
  17. package/content/dev/agents/fluent-dev.md +525 -0
  18. package/content/dev/reference-modules/catalog.json +4754 -0
  19. package/content/dev/skills/fluent-build/SKILL.md +192 -0
  20. package/content/dev/skills/fluent-connection-analysis/SKILL.md +386 -0
  21. package/content/dev/skills/fluent-custom-code/SKILL.md +895 -0
  22. package/content/dev/skills/fluent-data-module-scaffold/SKILL.md +714 -0
  23. package/content/dev/skills/fluent-e2e-test/SKILL.md +394 -0
  24. package/content/dev/skills/fluent-event-api/SKILL.md +945 -0
  25. package/content/dev/skills/fluent-feature-explain/SKILL.md +603 -0
  26. package/content/dev/skills/fluent-feature-plan/PLAN_TEMPLATE.md +695 -0
  27. package/content/dev/skills/fluent-feature-plan/SKILL.md +227 -0
  28. package/content/dev/skills/fluent-job-batch/SKILL.md +138 -0
  29. package/content/dev/skills/fluent-mermaid-validate/SKILL.md +86 -0
  30. package/content/dev/skills/fluent-module-scaffold/SKILL.md +1928 -0
  31. package/content/dev/skills/fluent-module-validate/SKILL.md +775 -0
  32. package/content/dev/skills/fluent-pre-deploy-check/SKILL.md +1108 -0
  33. package/content/dev/skills/fluent-retailer-config/SKILL.md +1111 -0
  34. package/content/dev/skills/fluent-rule-scaffold/SKILL.md +385 -0
  35. package/content/dev/skills/fluent-scope-decompose/SKILL.md +1021 -0
  36. package/content/dev/skills/fluent-session-audit-export/SKILL.md +632 -0
  37. package/content/dev/skills/fluent-session-summary/SKILL.md +195 -0
  38. package/content/dev/skills/fluent-settings/SKILL.md +1058 -0
  39. package/content/dev/skills/fluent-source-onboard/SKILL.md +632 -0
  40. package/content/dev/skills/fluent-system-monitoring/SKILL.md +767 -0
  41. package/content/dev/skills/fluent-test-data/SKILL.md +513 -0
  42. package/content/dev/skills/fluent-trace/SKILL.md +1143 -0
  43. package/content/dev/skills/fluent-transition-api/SKILL.md +346 -0
  44. package/content/dev/skills/fluent-version-manage/SKILL.md +744 -0
  45. package/content/dev/skills/fluent-workflow-analyzer/SKILL.md +959 -0
  46. package/content/dev/skills/fluent-workflow-builder/SKILL.md +319 -0
  47. package/content/dev/skills/fluent-workflow-deploy/SKILL.md +267 -0
  48. package/content/mcp-extn/agents/fluent-mcp.md +69 -0
  49. package/content/mcp-extn/skills/fluent-mcp-tools/SKILL.md +461 -0
  50. package/content/mcp-official/agents/fluent-mcp-core.md +91 -0
  51. package/content/mcp-official/skills/fluent-mcp-core/SKILL.md +94 -0
  52. package/content/rfl/agents/fluent-rfl.md +56 -0
  53. package/content/rfl/skills/fluent-rfl-assess/SKILL.md +172 -0
  54. package/docs/CAPABILITY_MAP.md +77 -0
  55. package/docs/CLI_COVERAGE.md +47 -0
  56. package/docs/DEV_WORKFLOW.md +802 -0
  57. package/docs/FLOW_RUN.md +142 -0
  58. package/docs/USE_CASES.md +404 -0
  59. package/metadata.json +156 -0
  60. package/package.json +51 -0
@@ -0,0 +1,767 @@
1
+ ---
2
+ name: fluent-system-monitoring
3
+ description: Monitor Fluent Commerce event processing health. Query Prometheus metrics, interpret event volume analytics, detect failure rate anomalies, triage alerts, and identify workflow gaps. Triggers on "system monitoring", "event analytics", "failure rate", "event volume", "top events", "alert triage", "monitoring dashboard", "prometheus", "metrics", "latency", "throughput".
4
+ user-invocable: true
5
+ allowed-tools: Bash, Read, Write, Edit, Glob, Grep
6
+ argument-hint: [--from <ISO-datetime>] [--entity-type ORDER|FULFILMENT] [--top-n 20]
7
+ ---
8
+
9
+ # System Monitoring
10
+
11
+ Monitor Fluent Commerce event processing health using Prometheus metrics and event analytics. Detect anomalies, triage alerts, and identify workflow gaps.
12
+
13
+ ## Ownership Boundary
14
+
15
+ This skill owns aggregate event analytics interpretation, anomaly detection, and alert triage workflows.
16
+
17
+ - **Per-entity event queries** → `/fluent-event-api`
18
+ - **Root cause diagnosis** → `/fluent-trace`
19
+ - **MCP tool syntax** → `/fluent-mcp-tools`
20
+ - **Workflow structure** → `/fluent-workflow-analyzer`
21
+
22
+ ## Platform Metrics Architecture
23
+
24
+ Fluent Commerce captures metrics at four instrumentation points in the event processing pipeline:
25
+
26
+ ```
27
+ Customer Request
28
+
29
+
30
+ ┌──────────────┐ core_event_received_total (Counter)
31
+ │ Fluent API │ core_event_last_received_seconds (Gauge)
32
+ │ (Core) │ ← events RECEIVED by the platform
33
+ └──────┬───────┘
34
+ │ enqueue
35
+
36
+ ┌──────────────┐ rubix_event_received_total (Counter)
37
+ │ Internal │ rubix_event_inflight_latency_seconds (Histogram)
38
+ │ Queue │ ← time spent WAITING in queue
39
+ └──────┬───────┘
40
+ │ dequeue
41
+
42
+ ┌──────────────┐ rubix_event_runtime_seconds (Histogram)
43
+ │ Rubix │ ← time spent PROCESSING (includes status label)
44
+ │ (Orchestr.) │
45
+ └──────────────┘
46
+
47
+ Batch path (inventory):
48
+ ┌──────────────┐ bpp_records_processed_total (Counter)
49
+ │ Batch Pre- │ bpp_records_changed_total (Counter)
50
+ │ Processing │ bpp_records_unchanged_total (Counter)
51
+ │ (Dedup) │ bpp_last_run_timestamp_seconds (Gauge)
52
+ └──────────────┘
53
+
54
+ Feed export path:
55
+ ┌──────────────┐ feed_sent_total (Counter)
56
+ │ Inventory │ feed_last_run_timestamp_seconds (Gauge)
57
+ │ Feeds │
58
+ └──────────────┘
59
+ ```
60
+
61
+ ### Key Distinction: Received vs Processed
62
+
63
+ This is the most common source of confusion. Understand it before writing any query:
64
+
65
+ | Metric | Layer | What it counts | Has `status` label? |
66
+ |--------|-------|----------------|---------------------|
67
+ | `core_event_received_total` | API | Events the platform **received** | NO |
68
+ | `rubix_event_received_total` | Rubix | Events the orchestration engine **received** from queue | NO |
69
+ | `rubix_event_runtime_seconds_count` | Rubix | Events the orchestration engine **processed** | YES (`COMPLETE`, `FAILED`, `NO_MATCH`) |
70
+
71
+ - To count events received → use `core_event_received_total`
72
+ - To count events processed with success/failure breakdown → use `rubix_event_runtime_seconds_count`
73
+ - To measure processing latency → use `rubix_event_runtime_seconds` histogram
74
+ - To measure queue wait time → use `rubix_event_inflight_latency_seconds` histogram
75
+
76
+ **Never aggregate `core_event_received_total` by `status`** — the label does not exist on that metric and will produce null/misleading results.
77
+
78
+ ### Metric/Status Naming Drift (Doc Reconciliation)
79
+
80
+ Some product/feature docs use alternate metric families and status labels. Normalize before analysis:
81
+
82
+ | Semantic intent | Canonical metrics/status in this skill | Alternate names seen in some docs |
83
+ |---|---|---|
84
+ | Total events processed | `rubix_event_runtime_seconds_count` | `fluent_events_total` |
85
+ | Failed terminal state | `FAILED` | `ERROR` |
86
+ | In-flight / not terminal | `PENDING` / `SCHEDULED` | `PROCESSING` |
87
+
88
+ Guidelines:
89
+ - Prefer `rubix_*` and `core_*` metrics for Fluent platform observability unless tenant evidence proves otherwise.
90
+ - If a provided query uses `fluent_events_*`, translate it to the canonical metric family first.
91
+ - Do not filter Rubix metrics by `status="ERROR"` or `status="PROCESSING"`; use canonical status labels.
92
+
93
+ ### What Is Captured
94
+
95
+ Metrics are captured for events entering via:
96
+ - `POST /api/v4.1/event/async`
97
+ - `POST /api/v4.1/event/sync`
98
+ - `POST /api/v4.1/job/{jobId}/batch`
99
+ - Action events (sent directly to the Event API) — captured against the event name, entity type, and other labels
100
+
101
+ **Not captured:**
102
+ - Inline events triggered within a workflow for the currently orchestrated entity (internal rule-to-rule calls within the same orchestration context)
103
+
104
+ ## Complete Metrics Reference
105
+
106
+ ### Fluent API Metrics
107
+
108
+ | Metric | Type | Labels | Description |
109
+ |--------|------|--------|-------------|
110
+ | `core_event_received_total` | Counter | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Number of events received by the platform |
111
+ | `core_event_last_received_seconds` | Gauge | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Timestamp of last event received (epoch seconds) |
112
+
113
+ ### Orchestration Engine (Rubix) Metrics
114
+
115
+ | Metric | Type | Labels | Description |
116
+ |--------|------|--------|-------------|
117
+ | `rubix_event_received_total` | Counter | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Events received by Rubix from queue or direct HTTP API |
118
+ | `rubix_event_inflight_latency_seconds` | Histogram | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Queue wait time before Rubix processes |
119
+ | `rubix_event_inflight_latency_seconds_sum` | Counter | (same as above) | Cumulative queue wait time |
120
+ | `rubix_event_inflight_latency_seconds_count` | Counter | (same as above) | Count of events observed in queue |
121
+ | `rubix_event_inflight_latency_seconds_bucket` | Counter | (same as above) + `le` | Bucketed queue latency distribution |
122
+ | `rubix_event_runtime_seconds` | Histogram | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source`, **`status`** | Rubix processing time per event |
123
+ | `rubix_event_runtime_seconds_sum` | Counter | (same as above) | Cumulative processing time |
124
+ | `rubix_event_runtime_seconds_count` | Counter | (same as above) | Count of events processed |
125
+ | `rubix_event_runtime_seconds_bucket` | Counter | (same as above) + `le` | Bucketed processing time distribution |
126
+
127
+ ### Batch Pre-Processing Metrics (Inventory Deduplication)
128
+
129
+ | Metric | Type | Labels | Description |
130
+ |--------|------|--------|-------------|
131
+ | `bpp_records_processed_total` | Counter | `account_id`, `run_id`, `stage`, `first_batch_received`, `deduplication_finished` | Total batch items processed by dedup job |
132
+ | `bpp_records_unchanged_total` | Counter | `account_id`, `run_id`, `stage` | Items filtered out (no change detected) |
133
+ | `bpp_records_changed_total` | Counter | `account_id`, `run_id`, `stage` | Items that changed and were sent to Rubix |
134
+ | `bpp_last_run_timestamp_seconds` | Gauge | `account_id`, `run_id`, `stage`, `status` | Completion timestamp. Status: `SUCCESS`, `ERROR` |
135
+
136
+ ### Inventory Feed Metrics (Data Loading)
137
+
138
+ | Metric | Type | Labels | Description |
139
+ |--------|------|--------|-------------|
140
+ | `feed_sent_total` | Counter | `account_id`, `feed_ref`, `run_id`, `data_type` | Records exported. data_type: `INVENTORY_POSITION`, `INVENTORY_CATALOGUE`, `VIRTUAL_POSITION`, `VIRTUAL_CATALOGUE` |
141
+ | `feed_last_run_timestamp_seconds` | Gauge | `account_id`, `feed_ref`, `run_id`, `status` | Completion timestamp. Status: `SUCCESS`, `ERROR`, `NO_RECORDS` |
142
+
143
+ ### Histogram Bucket Boundaries
144
+
145
+ Default histogram bucket boundaries (in seconds): `0.005`, `0.01`, `0.025`, `0.05`, `0.075`, `0.1`, `0.25`, `0.5`, `0.75`, `1.0`, `2.5`, `5.0`, `7.5`, `10.0`, `+Inf`.
146
+
147
+ The `le` label marks the upper bound of each bucket. A bucket with `le="0.5"` counts all events that took ≤500ms.
148
+
149
+ ### Labels Reference
150
+
151
+ | Label | Appears on | Values |
152
+ |-------|-----------|--------|
153
+ | `account_id` | All metrics | Fluent account identifier |
154
+ | `retailer_id` | API + Rubix metrics | Retailer within the account |
155
+ | `event_name` | API + Rubix metrics | Event name (e.g., `CREATE`, `UPSERT_PRODUCT`) |
156
+ | `entity_type` | API + Rubix metrics | Entity type (e.g., `ORDER`, `FULFILMENT`, `INVENTORY_POSITION`) |
157
+ | `source` | API + Rubix metrics | Origin: `event`, `event-sync`, `batch`, `internal`, or custom |
158
+ | `status` | **Only** `rubix_event_runtime_seconds*` | `COMPLETE`, `FAILED`, `NO_MATCH` |
159
+ | `le` | Histogram `_bucket` metrics only | Upper bound of histogram bucket (seconds) |
160
+ | `run_id` | BPP + Feed metrics | Execution run identifier |
161
+ | `stage` | BPP metrics | Processing stage |
162
+ | `feed_ref` | Feed metrics | Feed reference identifier |
163
+ | `data_type` | `feed_sent_total` | Type of exported records |
164
+ | `first_batch_received` | `bpp_records_processed_total` | Timestamp of first batch in run |
165
+ | `deduplication_finished` | `bpp_records_processed_total` | Timestamp of dedup completion |
166
+
167
+ ### Source Label Values
168
+
169
+ | Source Value | Meaning |
170
+ |-------------|---------|
171
+ | `event` | Async endpoint (`/api/v4.1/event/async`) with no explicit source |
172
+ | `event-sync` | Sync endpoint (`/api/v4.1/event/sync`) with no explicit source |
173
+ | `batch` | Batch endpoint (`/api/v4.1/job/{jobId}/batch`) |
174
+ | `internal` | Scheduled events, cross-domain events (platform-originated) |
175
+ | Custom value | Events with explicit `source` field set in payload (e.g., `POS`, `ERP`) |
176
+
177
+ ## Data Access
178
+
179
+ ### Query Path
180
+
181
+ Fluent metrics are accessed exclusively through the **GraphQL API**, which acts as an auth + tenant-isolation proxy over Prometheus. The platform does not expose raw Prometheus REST endpoints (`/api/v1/query` returns 400).
182
+
183
+ **MCP Extension Tools (recommended for AI analysis)**
184
+
185
+ The `metrics.query` tool routes PromQL through the GraphQL `metricInstant` / `metricRange` queries automatically:
186
+
187
+ ```json
188
+ {
189
+ "query": "sum by (event_name) (increase(rubix_event_runtime_seconds_count[1h]))",
190
+ "type": "instant"
191
+ }
192
+ ```
193
+
194
+ **GraphQL queries (for OMX dashboards, programmatic access, and external tools)**
195
+
196
+ The same GraphQL queries are available directly:
197
+ - `metricInstant(query, time?)` — point-in-time query
198
+ - `metricRange(query, start, end, step)` — time-series query
199
+
200
+ **Required headers:**
201
+ - `Authorization: Bearer <token>` — user/service token with `METRICS_VIEW` permission
202
+ - `fluent.account: <ACCOUNT_ID>` — identifies which Prometheus workspace to query
203
+ - `Content-Type: application/json`
204
+
205
+ **metricInstant — point-in-time snapshot:**
206
+ ```graphql
207
+ query ($PromQL: String!, $time: DateTime) {
208
+ metricInstant(query: $PromQL, time: $time) {
209
+ status
210
+ error
211
+ errorType
212
+ warnings
213
+ data {
214
+ resultType
215
+ result {
216
+ metric
217
+ value
218
+ }
219
+ }
220
+ }
221
+ }
222
+ ```
223
+
224
+ Variables:
225
+ ```json
226
+ {
227
+ "PromQL": "sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count[1h])) > 0",
228
+ "time": "2026-02-22T12:00:00Z"
229
+ }
230
+ ```
231
+
232
+ **metricRange — time-series over a window:**
233
+ ```graphql
234
+ query ($PromQLRange: String!, $start: DateTime, $end: DateTime, $step: String) {
235
+ metricRange(query: $PromQLRange, start: $start, end: $end, step: $step) {
236
+ status
237
+ error
238
+ errorType
239
+ warnings
240
+ data {
241
+ resultType
242
+ result {
243
+ metric
244
+ values
245
+ }
246
+ }
247
+ }
248
+ }
249
+ ```
250
+
251
+ Variables:
252
+ ```json
253
+ {
254
+ "PromQLRange": "(sum by (source, retailer_id, event_name, entity_type) ((last_over_time(core_event_received_total{}[30m]) - core_event_received_total{} offset 30m) or last_over_time(core_event_received_total{}[30m]))) > 0",
255
+ "start": "2026-02-22T00:00:00Z",
256
+ "end": "2026-02-22T23:59:00Z",
257
+ "step": "30m"
258
+ }
259
+ ```
260
+
261
+ **Response shape:** Both queries return the same structure — `status` ("success"/"error"), optional `error`/`errorType`/`warnings`, and `data` containing `resultType` ("vector" for instant, "matrix" for range) and `result[]` where each entry has a `metric` label map and either a single `value` (instant) or `values` array (range: `[timestamp, value]` pairs at each step).
262
+
263
+ **Query limits and defaults:**
264
+ - Maximum query time range for `metricRange`: **32 days** between `start` and `end`. Queries exceeding this will fail.
265
+ - `metricRange` defaults when parameters are omitted: `start` = 30 minutes before current server time, `end` = current server time, `step` = 1 minute
266
+ - `metricInstant` default: `time` = current server time
267
+
268
+ **Access requirements:**
269
+ - User/token must have `METRICS_VIEW` permission
270
+ - Metrics are isolated per account per environment (workspace-level segregation)
271
+ - GraphQL API identifies account from user token + `fluent.account` header
272
+ - Retention: 150 days from ingestion
273
+ - Custom metrics creation is not available — only the platform-instrumented metrics listed above are captured
274
+
275
+ **MCP tools vs direct GraphQL:** Both use the same GraphQL `metricInstant` / `metricRange` queries underneath. The MCP `metrics.query` tool abstracts the GraphQL construction — just pass PromQL + type. Use MCP tools for AI-driven analysis; use direct GraphQL for OMX dashboards, external monitoring tools (ElasticSearch, Grafana), or programmatic access from scripts.
276
+
277
+ ### If Prometheus is Unavailable
278
+
279
+ When the `metrics.*` MCP tools are available, `metrics.topEvents` and `metrics.healthCheck` will fall back to the Event API for aggregation. Until then, use `event.list` with appropriate filters to aggregate event data manually. Event API queries have a ~30 day practical window.
280
+
281
+ ## Tools
282
+
283
+ > **Note:** The `metrics.healthCheck`, `metrics.query`, and `metrics.topEvents` MCP tools described below are planned but not yet implemented in the current MCP extension server. Currently, use the Fluent Commerce admin console for metrics dashboards, or query the GraphQL `metricInstant`/`metricRange` endpoints directly via the `graphql.query` MCP tool with the PromQL patterns from the cookbook section below.
284
+
285
+ ### metrics.healthCheck (Primary — Single-Call Health Assessment)
286
+
287
+ Runs all anomaly checks locally in the MCP server. One call replaces the multi-step Quick Health Check workflow.
288
+
289
+ ```json
290
+ {
291
+ "window": "1h",
292
+ "includeTopEvents": true,
293
+ "topN": 10
294
+ }
295
+ ```
296
+
297
+ **Response shape:**
298
+ - `healthy` — boolean, true if no findings
299
+ - `source` — "prometheus" or "event_api" (automatic fallback)
300
+ - `summary` — window, totalEvents, failureRate, pendingRate, statusBreakdown
301
+ - `findings[]` — severity + type + message for each detected anomaly
302
+ - `topEvents[]` — ranked event breakdown (if includeTopEvents)
303
+ - `recommendations[]` — actionable next steps based on findings
304
+
305
+ **Custom thresholds:**
306
+ ```json
307
+ {
308
+ "window": "6h",
309
+ "thresholds": { "failureRate": 2, "pendingRate": 5, "dominanceRate": 30 }
310
+ }
311
+ ```
312
+
313
+ ### metrics.query (Prometheus PromQL)
314
+
315
+ Query Prometheus metrics via GraphQL `metricInstant`/`metricRange` with PromQL. Supports instant and range queries.
316
+
317
+ **Instant query — aggregated failure summary:**
318
+ ```json
319
+ {
320
+ "query": "sum by (event_name, entity_type, status) (increase(rubix_event_runtime_seconds_count[1h]))",
321
+ "type": "instant"
322
+ }
323
+ ```
324
+
325
+ **Range query — failure rate over time:**
326
+ ```json
327
+ {
328
+ "query": "rate(rubix_event_runtime_seconds_count{status=\"FAILED\"}[5m])",
329
+ "type": "range",
330
+ "start": "2026-02-22T00:00:00Z",
331
+ "end": "2026-02-22T06:00:00Z",
332
+ "step": "1m"
333
+ }
334
+ ```
335
+
336
+ ### metrics.topEvents (Convenience Aggregation)
337
+
338
+ Pre-built ranked summary from the Event API. Useful when Prometheus is unavailable or when you need per-event-name breakdowns with failure rates.
339
+
340
+ ```json
341
+ {
342
+ "from": "2026-02-22T00:00:00Z",
343
+ "to": "2026-02-22T12:00:00Z",
344
+ "topN": 20,
345
+ "eventType": "ORCHESTRATION"
346
+ }
347
+ ```
348
+
349
+ **Filter to specific event status** (e.g., top failing events only):
350
+ ```json
351
+ {
352
+ "from": "2026-02-22T00:00:00Z",
353
+ "eventStatus": "FAILED",
354
+ "topN": 10
355
+ }
356
+ ```
357
+
358
+ Supported `eventStatus` values: `COMPLETE`, `FAILED`, `NO_MATCH`, `SUCCESS`, `PENDING`.
359
+
360
+ **Response includes:**
361
+ - `totalEvents` — total events in the time window
362
+ - `failureRate` — percentage of FAILED events
363
+ - `statusBreakdown` — counts by status (SUCCESS, FAILED, COMPLETE, NO_MATCH, etc.)
364
+ - `topEvents[]` — ranked list: name + entityType + status + count + percentage
365
+ - `uniqueEventNames` / `uniqueEntityTypes` — cardinality metrics
366
+
367
+ ### When to Use Which
368
+
369
+ | Scenario | Tool | Why |
370
+ |----------|------|-----|
371
+ | One-call health assessment | `metrics.healthCheck` | Runs all checks locally, minimal tokens |
372
+ | Quick failure rate check | `metrics.query` (instant) | Single PromQL, no pagination |
373
+ | Failure rate trend over time | `metrics.query` (range) | Time-series data with step intervals |
374
+ | Top-N event ranking with breakdowns | `metrics.topEvents` | Pre-aggregated, structured response |
375
+ | Custom PromQL (p99 latency, throughput) | `metrics.query` | Full PromQL flexibility |
376
+ | Top failing events only | `metrics.topEvents` | Use `eventStatus: "FAILED"` for server-side filtering |
377
+ | Prometheus unavailable | `metrics.topEvents` | Falls back to Event API aggregation |
378
+ | BPP / Feed job monitoring | `metrics.query` | Only path — Event API doesn't cover batch jobs |
379
+
380
+ ## PromQL Cookbook
381
+
382
+ Recipes for common monitoring questions. All use `metrics.query` with `type: "instant"` unless noted.
383
+
384
+ ### Event Volume
385
+
386
+ **Total events received in the last hour:**
387
+ ```promql
388
+ sum(increase(core_event_received_total[1h]))
389
+ ```
390
+
391
+ **Events received by entity type in the last hour:**
392
+ ```promql
393
+ sum by (entity_type) (increase(core_event_received_total[1h]))
394
+ ```
395
+
396
+ **Events received in the last 24h (period delta with offset fallback):**
397
+ ```promql
398
+ sum by (source, retailer_id, event_name, entity_type) (
399
+ (last_over_time(core_event_received_total[1d]) - core_event_received_total offset 1d)
400
+ or
401
+ last_over_time(core_event_received_total[1d])
402
+ )
403
+ ```
404
+ The `or` clause handles cases where no data point exists at the offset time (counter reset, new series, or sparse scrapes). Without it, the subtraction returns null for those series.
405
+
406
+ **Events received per retailer per hour (range query for trending):**
407
+ ```promql
408
+ sum by (retailer_id) (increase(core_event_received_total[1h]))
409
+ ```
410
+ Use `type: "range"` with `step: "1h"` for hourly breakdown.
411
+
412
+ ### Failure Analysis
413
+
414
+ **Overall failure rate (percentage):**
415
+ ```promql
416
+ sum(increase(rubix_event_runtime_seconds_count{status="FAILED"}[1h]))
417
+ /
418
+ sum(increase(rubix_event_runtime_seconds_count[1h]))
419
+ * 100
420
+ ```
421
+
422
+ **Failed events by event name:**
423
+ ```promql
424
+ sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count{status="FAILED"}[1h])) > 0
425
+ ```
426
+
427
+ **NO_MATCH events (workflow configuration gaps):**
428
+ ```promql
429
+ sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count{status="NO_MATCH"}[1h])) > 0
430
+ ```
431
+
432
+ **Failure rate trend over time (range query):**
433
+ ```promql
434
+ sum(rate(rubix_event_runtime_seconds_count{status="FAILED"}[5m]))
435
+ /
436
+ sum(rate(rubix_event_runtime_seconds_count[5m]))
437
+ ```
438
+ Use `type: "range"`, `step: "5m"` for 5-minute granularity.
439
+
440
+ ### Latency Analysis
441
+
442
+ **Average processing time per event:**
443
+ ```promql
444
+ sum by (event_name, entity_type) (rate(rubix_event_runtime_seconds_sum[1h]))
445
+ /
446
+ sum by (event_name, entity_type) (rate(rubix_event_runtime_seconds_count[1h]))
447
+ ```
448
+ This divides the per-second rate of total latency by the per-second rate of event counts, yielding average seconds per event.
449
+
450
+ **WRONG — do not use `avg_over_time` on `_count`:**
451
+ ```promql
452
+ # INCORRECT: avg_over_time on _count gives average of cumulative counter, NOT average latency
453
+ avg_over_time(rubix_event_inflight_latency_seconds_count[1d])
454
+ ```
455
+
456
+ **P99 processing latency:**
457
+ ```promql
458
+ histogram_quantile(0.99,
459
+ sum by (le, event_name) (rate(rubix_event_runtime_seconds_bucket[5m]))
460
+ )
461
+ ```
462
+
463
+ **P50 (median) processing latency:**
464
+ ```promql
465
+ histogram_quantile(0.50,
466
+ sum by (le) (rate(rubix_event_runtime_seconds_bucket[5m]))
467
+ )
468
+ ```
469
+
470
+ **Average queue wait time (inflight latency):**
471
+ ```promql
472
+ sum by (event_name, entity_type) (rate(rubix_event_inflight_latency_seconds_sum[1h]))
473
+ /
474
+ sum by (event_name, entity_type) (rate(rubix_event_inflight_latency_seconds_count[1h]))
475
+ ```
476
+
477
+ **P99 queue wait time:**
478
+ ```promql
479
+ histogram_quantile(0.99,
480
+ sum by (le) (rate(rubix_event_inflight_latency_seconds_bucket[5m]))
481
+ )
482
+ ```
483
+
484
+ ### Throughput
485
+
486
+ **Events processed per second (instant rate):**
487
+ ```promql
488
+ sum(rate(rubix_event_runtime_seconds_count[5m]))
489
+ ```
490
+
491
+ **Events processed per second by event name (range):**
492
+ ```promql
493
+ sum by (event_name) (rate(rubix_event_runtime_seconds_count[5m]))
494
+ ```
495
+
496
+ ### Batch Pre-Processing (Inventory Deduplication)
497
+
498
+ **Total records processed in the latest run:**
499
+ ```promql
500
+ max by (run_id) (bpp_records_processed_total)
501
+ ```
502
+
503
+ **Dedup efficiency (changed vs processed):**
504
+ ```promql
505
+ max by (run_id) (bpp_records_changed_total)
506
+ /
507
+ max by (run_id) (bpp_records_processed_total)
508
+ * 100
509
+ ```
510
+ A low percentage means most records are unchanged — good for dedup efficiency, but if 0% it could indicate stale data.
511
+
512
+ **Last BPP run status:**
513
+ ```promql
514
+ bpp_last_run_timestamp_seconds
515
+ ```
516
+ Filter by `status="SUCCESS"` or `status="ERROR"` to check job health.
517
+
518
+ ### Inventory Feeds (Data Loading)
519
+
520
+ **Total records exported by feed:**
521
+ ```promql
522
+ sum by (feed_ref, data_type) (increase(feed_sent_total[24h]))
523
+ ```
524
+
525
+ **Last feed run timestamp:**
526
+ ```promql
527
+ feed_last_run_timestamp_seconds
528
+ ```
529
+ Filter by `status="NO_RECORDS"` to find feeds that ran but had nothing to export — this could be normal or could indicate a stale pipeline.
530
+
531
+ **Feed export volume trend (range query):**
532
+ ```promql
533
+ sum by (feed_ref) (increase(feed_sent_total[1h]))
534
+ ```
535
+
536
+ ### Source Attribution
537
+
538
+ **Events by source (who is sending them):**
539
+ ```promql
540
+ sum by (source) (increase(core_event_received_total[1h]))
541
+ ```
542
+
543
+ **High-volume external source detection:**
544
+ ```promql
545
+ topk(5, sum by (source, event_name) (increase(core_event_received_total[1h])))
546
+ ```
547
+ Use this to detect runaway integrations (e.g., a POS system sending millions of `UPSERT_PRODUCT` events).
548
+
549
+ ## Anomaly Detection Heuristics
550
+
551
+ ### High Failure Rate
552
+ - **Threshold:** >5% failure rate
553
+ - **Detection:** `metrics.query({ query: "sum(increase(rubix_event_runtime_seconds_count{status='FAILED'}[1h])) / sum(increase(rubix_event_runtime_seconds_count[1h]))", type: "instant" })`
554
+ - **Action:** Filter failed events with `event.list({ eventStatus: "FAILED", from: "...", count: 50 })`, then hand off to `/fluent-trace` for root cause
555
+
556
+ ### Volume Spikes
557
+ - **Threshold:** >10x normal baseline for any event name
558
+ - **Detection:** Compare `increase(...[1h])` against `avg_over_time(increase(...[1h])[24h:1h])` for the same event
559
+ - **Action:** Investigate whether a runaway loop or bulk import is occurring. Check if a SendEvent rule is creating circular event chains.
560
+
561
+ ### NO_MATCH Events Present
562
+ - **Threshold:** Any NO_MATCH events
563
+ - **Meaning:** Events fired with names that don't match any ruleset in the workflow
564
+ - **Action:** Check event name spelling, entity subtype, and workflow deployment status. Use `/fluent-workflow-analyzer` to verify ruleset names match.
565
+
566
+ ### Sustained PENDING Queue
567
+ - **Threshold:** >10% of events in PENDING status
568
+ - **Meaning:** Async processing queue is backed up
569
+ - **Action:** Check platform health. PENDING events should clear within seconds under normal load.
570
+
571
+ ### Missing Expected Events
572
+ - **Indicator:** An entity type that should have events (e.g., FULFILMENT) shows zero events in the time window
573
+ - **Action:** Verify workflows are deployed for that entity type. Check if the event send pipeline is functioning.
574
+
575
+ ### Slow Processing
576
+ - **Threshold:** P99 latency >10s or average >2s for any event
577
+ - **Detection:** Use P99 and average latency queries from the cookbook above
578
+ - **Action:** Check which rules are executing for the slow event. Common causes: complex GraphQL queries in rules, external webhook timeouts, large entity attribute payloads.
579
+
580
+ ### Queue Backup
581
+ - **Threshold:** Average inflight latency >5s
582
+ - **Detection:** Use average queue wait time query from cookbook
583
+ - **Action:** This indicates Rubix is not keeping up with incoming events. Check if there's a volume spike, or if processing times have degraded.
584
+
585
+ ## Monitoring Workflows
586
+
587
+ ### Quick Health Check
588
+ ```
589
+ 1. metrics.healthCheck({ window: "1h" })
590
+ 2. If healthy=true → done
591
+ 3. If findings present → review severity and recommendations
592
+ 4. For HIGH/CRITICAL findings → drill down with event.list using recommended filters
593
+ 5. Hand off to /fluent-trace for root cause of specific failures
594
+ ```
595
+
596
+ ### Pre-Deployment Baseline
597
+ ```
598
+ 1. metrics.healthCheck({ window: "24h" }) → record baseline
599
+ 2. Deploy changes
600
+ 3. Wait for processing window
601
+ 4. metrics.healthCheck({ window: "1h" }) → check post-deploy
602
+ 5. Compare: new findings? Changed failure rate? New dominant events?
603
+ ```
604
+
605
+ ### Post-Incident Analysis
606
+ ```
607
+ 1. Identify incident time window
608
+ 2. metrics.query({
609
+ query: "sum by (event_name, status) (increase(rubix_event_runtime_seconds_count[5m]))",
610
+ type: "range",
611
+ start: "<incident_start>",
612
+ end: "<incident_end>",
613
+ step: "1m"
614
+ })
615
+ 3. Look for: spikes in specific event names, new FAILED events, NO_MATCH appearances
616
+ 4. For each anomalous event → event.list with specific filters → event.get for details
617
+ 5. Hand off to /fluent-trace for root cause of specific failures
618
+ ```
619
+
620
+ ### Latency Investigation
621
+ ```
622
+ 1. Check P99 processing latency:
623
+ metrics.query({
624
+ query: "histogram_quantile(0.99, sum by (le, event_name) (rate(rubix_event_runtime_seconds_bucket[5m])))",
625
+ type: "range", start: "<start>", end: "<end>", step: "5m"
626
+ })
627
+ 2. Check queue wait time:
628
+ metrics.query({
629
+ query: "histogram_quantile(0.99, sum by (le) (rate(rubix_event_inflight_latency_seconds_bucket[5m])))",
630
+ type: "range", start: "<start>", end: "<end>", step: "5m"
631
+ })
632
+ 3. If processing latency high → investigate rule execution for those events
633
+ 4. If queue latency high → check overall throughput and volume
634
+ 5. Use /fluent-trace for slow event investigation
635
+ ```
636
+
637
+ ### Batch Processing Health
638
+ ```
639
+ 1. Check latest BPP run:
640
+ metrics.query({ query: "bpp_last_run_timestamp_seconds", type: "instant" })
641
+ 2. Check for ERROR status:
642
+ metrics.query({ query: "bpp_last_run_timestamp_seconds{status=\"ERROR\"}", type: "instant" })
643
+ 3. Check dedup efficiency:
644
+ metrics.query({
645
+ query: "max by (run_id) (bpp_records_changed_total) / max by (run_id) (bpp_records_processed_total) * 100",
646
+ type: "instant"
647
+ })
648
+ 4. Check feed export status:
649
+ metrics.query({ query: "feed_last_run_timestamp_seconds", type: "instant" })
650
+ 5. Look for NO_RECORDS feeds that should have data
651
+ ```
652
+
653
+ ### Runaway Integration Detection
654
+ ```
655
+ 1. Check top sources by volume:
656
+ metrics.query({
657
+ query: "topk(10, sum by (source, event_name) (increase(core_event_received_total[1h])))",
658
+ type: "instant"
659
+ })
660
+ 2. If any source+event exceeds expected volume (e.g., >100k/hour for a POS integration):
661
+ - Check if the source system has a misconfiguration
662
+ - Check if dedup is filtering effectively (bpp_records_unchanged_total should be high)
663
+ - Consider rate-limiting at the integration layer
664
+ ```
665
+
666
+ ## IPU/IPC Visibility
667
+
668
+ IPU (Inventory Processing Units) and IPC (Inventory Processing Credits) consumption is tracked via the Fluent dashboard, not directly through Prometheus metrics.
669
+
670
+ **Self-service dashboard:** Access via the Fluent Admin Console → Account dashboard. See [Self-Service IPU/IPC Visibility Overview](https://docs.fluentcommerce.com/essential-knowledge/self-service-ipu-ipc-visibility-overview).
671
+
672
+ The dashboard uses an internal GraphQL metrics query (not yet publicly documented) that aggregates account-level consumption. For custom IPU tracking, correlate event volumes from `core_event_received_total` and `rubix_event_runtime_seconds_count` with your contracted IPU thresholds.
673
+
674
+ ## Prometheus Interpretation Guidelines
675
+
676
+ **Treat metrics as trends, not ledger-accurate totals.** Prometheus functions extrapolate based on scrape intervals and windows. Minor discrepancies between metrics counts and Event API counts are normal and expected.
677
+
678
+ **Retention:** 150 days from ingestion in the account-specific workspace. After 150 days, data is deleted.
679
+
680
+ **Counter behavior:** Counters are monotonically increasing and reset on process restart. Use `increase()` or `rate()` to get meaningful deltas — never compare raw counter values directly.
681
+
682
+ **`last_over_time` vs `increase()` for counter deltas:**
683
+
684
+ `increase(counter[1h])` extrapolates based on the rate observed within the window. For short windows (5m, 15m) this is fine, but over wide windows (hours, days) the extrapolation can drift from actual observed values.
685
+
686
+ `last_over_time(counter[window])` returns the most recent actual sample within the lookback window — no extrapolation. This is more accurate for wide-window deltas because you're subtracting two real counter values:
687
+
688
+ ```promql
689
+ last_over_time(counter[window]) - counter offset <period>
690
+ ```
691
+
692
+ **Use `increase()` for:** short-window aggregations (5m, 15m), rate-over-time range queries, quick estimates.
693
+ **Use `last_over_time` minus offset for:** accurate daily/hourly counts, external reporting, billing-adjacent metrics.
694
+
695
+ **The `or` fallback pattern for counter deltas with offset:**
696
+ ```promql
697
+ (last_over_time(metric[window]) - metric offset <period>)
698
+ or
699
+ last_over_time(metric[window])
700
+ ```
701
+ The `or` clause is necessary because:
702
+ 1. If no data point exists at the offset time (new series, counter reset, irregular scrapes), the subtraction returns null
703
+ 2. The fallback returns the current value instead of dropping the series entirely
704
+ 3. This may overcount for new series, but avoids data loss
705
+
706
+ **Histogram math:**
707
+ - Average = `rate(_sum) / rate(_count)` — always use `rate()` to handle counter resets
708
+ - Percentiles = `histogram_quantile(quantile, rate(_bucket))` — quantile is 0-1 (0.99 = p99)
709
+ - Never use `avg_over_time()` on `_count` — it averages the raw cumulative counter, not the observed values
710
+
711
+ ## Common Query Pitfalls
712
+
713
+ ### 1) Wrong labels for a metric
714
+
715
+ `core_event_received_total` does **not** carry a `status` label. Aggregating by `status` on that metric will produce null label sets. Check the metrics reference table above before adding labels to `sum by (...)`.
716
+
717
+ ### 2) Counter deltas with sparse/offset data
718
+
719
+ See the `or` pattern in the Prometheus Interpretation Guidelines above.
720
+
721
+ ### 3) Histogram average computed incorrectly
722
+
723
+ Do not use `avg_over_time(..._count)`. For average latency, divide rate(sum) by rate(count):
724
+
725
+ ```promql
726
+ sum by (event_name) (rate(rubix_event_runtime_seconds_sum[1h]))
727
+ /
728
+ sum by (event_name) (rate(rubix_event_runtime_seconds_count[1h]))
729
+ ```
730
+
731
+ ### 4) Using `increase()` vs `rate()`
732
+
733
+ - `increase(metric[1h])` = total increase over 1 hour (absolute count)
734
+ - `rate(metric[5m])` = per-second rate averaged over 5 minutes
735
+ - Use `increase()` for "how many events in the last hour" questions
736
+ - Use `rate()` for time-series graphs, histogram math, and throughput measurements
737
+
738
+ ### 5) BPP metrics have different labels
739
+
740
+ BPP and Feed metrics use `run_id`, `stage`, `feed_ref`, `data_type` — not `retailer_id` or `event_name`. Don't mix them in the same aggregation as event processing metrics.
741
+
742
+ ### 6) Missing `> 0` filter
743
+
744
+ Counter-based queries often return zero-value results for inactive series. Append `> 0` to filter these out for cleaner output.
745
+
746
+ ### 7) Invalid metric name returns success with empty results
747
+
748
+ Querying a non-existent metric (e.g., `core_event_received_incorrect`) does **not** return an error. The API returns `status: "success"` with an empty `result` array. Always verify metric names against the reference tables above before debugging empty responses.
749
+
750
+ ### 8) Staleness window on metricInstant
751
+
752
+ `metricInstant` identifies data as "stale" if no new data point exists within 5 minutes. If you query a specific timestamp and the latest data point is older than 5 minutes, you get no value — not an error. For sparse or infrequent events, use `metricRange` with a wider window instead.
753
+
754
+ ### 9) 32-day maximum range
755
+
756
+ `metricRange` queries cannot span more than 32 days between `start` and `end`. For longer analysis windows, split into multiple queries or use `metricInstant` with different `time` values.
757
+
758
+ ## Integration with Other Skills
759
+
760
+ | Need | Skill |
761
+ |------|-------|
762
+ | Drill into specific event details | `/fluent-event-api` |
763
+ | Root cause diagnosis for failures | `/fluent-trace` |
764
+ | Verify workflow configuration | `/fluent-workflow-analyzer` |
765
+ | MCP tool payload syntax | `/fluent-mcp-tools` |
766
+ | Run targeted test sequences | `/fluent-e2e-test` |
767
+ | Validate settings for webhook rules | `/fluent-settings` |