npm - @fluentcommerce/ai-skills - Versions diffs - 0.1.0 - Mend

@fluentcommerce/ai-skills 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/LICENSE +21 -0
package/README.md +622 -0
package/bin/cli.mjs +1973 -0
package/content/cli/agents/fluent-cli/agent.json +149 -0
package/content/cli/agents/fluent-cli.md +132 -0
package/content/cli/skills/fluent-bootstrap/SKILL.md +181 -0
package/content/cli/skills/fluent-cli-index/SKILL.md +63 -0
package/content/cli/skills/fluent-cli-mcp-cicd/SKILL.md +77 -0
package/content/cli/skills/fluent-cli-reference/SKILL.md +1031 -0
package/content/cli/skills/fluent-cli-retailer/SKILL.md +85 -0
package/content/cli/skills/fluent-cli-settings/SKILL.md +106 -0
package/content/cli/skills/fluent-connect/SKILL.md +886 -0
package/content/cli/skills/fluent-module-deploy/SKILL.md +349 -0
package/content/cli/skills/fluent-profile/SKILL.md +180 -0
package/content/cli/skills/fluent-workflow/SKILL.md +310 -0
package/content/dev/agents/fluent-dev/agent.json +88 -0
package/content/dev/agents/fluent-dev.md +525 -0
package/content/dev/reference-modules/catalog.json +4754 -0
package/content/dev/skills/fluent-build/SKILL.md +192 -0
package/content/dev/skills/fluent-connection-analysis/SKILL.md +386 -0
package/content/dev/skills/fluent-custom-code/SKILL.md +895 -0
package/content/dev/skills/fluent-data-module-scaffold/SKILL.md +714 -0
package/content/dev/skills/fluent-e2e-test/SKILL.md +394 -0
package/content/dev/skills/fluent-event-api/SKILL.md +945 -0
package/content/dev/skills/fluent-feature-explain/SKILL.md +603 -0
package/content/dev/skills/fluent-feature-plan/PLAN_TEMPLATE.md +695 -0
package/content/dev/skills/fluent-feature-plan/SKILL.md +227 -0
package/content/dev/skills/fluent-job-batch/SKILL.md +138 -0
package/content/dev/skills/fluent-mermaid-validate/SKILL.md +86 -0
package/content/dev/skills/fluent-module-scaffold/SKILL.md +1928 -0
package/content/dev/skills/fluent-module-validate/SKILL.md +775 -0
package/content/dev/skills/fluent-pre-deploy-check/SKILL.md +1108 -0
package/content/dev/skills/fluent-retailer-config/SKILL.md +1111 -0
package/content/dev/skills/fluent-rule-scaffold/SKILL.md +385 -0
package/content/dev/skills/fluent-scope-decompose/SKILL.md +1021 -0
package/content/dev/skills/fluent-session-audit-export/SKILL.md +632 -0
package/content/dev/skills/fluent-session-summary/SKILL.md +195 -0
package/content/dev/skills/fluent-settings/SKILL.md +1058 -0
package/content/dev/skills/fluent-source-onboard/SKILL.md +632 -0
package/content/dev/skills/fluent-system-monitoring/SKILL.md +767 -0
package/content/dev/skills/fluent-test-data/SKILL.md +513 -0
package/content/dev/skills/fluent-trace/SKILL.md +1143 -0
package/content/dev/skills/fluent-transition-api/SKILL.md +346 -0
package/content/dev/skills/fluent-version-manage/SKILL.md +744 -0
package/content/dev/skills/fluent-workflow-analyzer/SKILL.md +959 -0
package/content/dev/skills/fluent-workflow-builder/SKILL.md +319 -0
package/content/dev/skills/fluent-workflow-deploy/SKILL.md +267 -0
package/content/mcp-extn/agents/fluent-mcp.md +69 -0
package/content/mcp-extn/skills/fluent-mcp-tools/SKILL.md +461 -0
package/content/mcp-official/agents/fluent-mcp-core.md +91 -0
package/content/mcp-official/skills/fluent-mcp-core/SKILL.md +94 -0
package/content/rfl/agents/fluent-rfl.md +56 -0
package/content/rfl/skills/fluent-rfl-assess/SKILL.md +172 -0
package/docs/CAPABILITY_MAP.md +77 -0
package/docs/CLI_COVERAGE.md +47 -0
package/docs/DEV_WORKFLOW.md +802 -0
package/docs/FLOW_RUN.md +142 -0
package/docs/USE_CASES.md +404 -0
package/metadata.json +156 -0
package/package.json +51 -0

package/content/dev/skills/fluent-system-monitoring/SKILL.md ADDED Viewed

@@ -0,0 +1,767 @@
+---
+name: fluent-system-monitoring
+description: Monitor Fluent Commerce event processing health. Query Prometheus metrics, interpret event volume analytics, detect failure rate anomalies, triage alerts, and identify workflow gaps. Triggers on "system monitoring", "event analytics", "failure rate", "event volume", "top events", "alert triage", "monitoring dashboard", "prometheus", "metrics", "latency", "throughput".
+user-invocable: true
+allowed-tools: Bash, Read, Write, Edit, Glob, Grep
+argument-hint: [--from <ISO-datetime>] [--entity-type ORDER|FULFILMENT] [--top-n 20]
+---
+# System Monitoring
+Monitor Fluent Commerce event processing health using Prometheus metrics and event analytics. Detect anomalies, triage alerts, and identify workflow gaps.
+## Ownership Boundary
+This skill owns aggregate event analytics interpretation, anomaly detection, and alert triage workflows.
+- **Per-entity event queries** → `/fluent-event-api`
+- **Root cause diagnosis** → `/fluent-trace`
+- **MCP tool syntax** → `/fluent-mcp-tools`
+- **Workflow structure** → `/fluent-workflow-analyzer`
+## Platform Metrics Architecture
+Fluent Commerce captures metrics at four instrumentation points in the event processing pipeline:
+```
+Customer Request
+       │
+       ▼
+┌──────────────┐   core_event_received_total         (Counter)
+│  Fluent API  │   core_event_last_received_seconds   (Gauge)
+│  (Core)      │   ← events RECEIVED by the platform
+└──────┬───────┘
+       │ enqueue
+       ▼
+┌──────────────┐   rubix_event_received_total         (Counter)
+│  Internal    │   rubix_event_inflight_latency_seconds (Histogram)
+│  Queue       │   ← time spent WAITING in queue
+└──────┬───────┘
+       │ dequeue
+       ▼
+┌──────────────┐   rubix_event_runtime_seconds        (Histogram)
+│  Rubix       │   ← time spent PROCESSING (includes status label)
+│  (Orchestr.) │
+└──────────────┘
+Batch path (inventory):
+┌──────────────┐   bpp_records_processed_total        (Counter)
+│  Batch Pre-  │   bpp_records_changed_total          (Counter)
+│  Processing  │   bpp_records_unchanged_total        (Counter)
+│  (Dedup)     │   bpp_last_run_timestamp_seconds     (Gauge)
+└──────────────┘
+Feed export path:
+┌──────────────┐   feed_sent_total                    (Counter)
+│  Inventory   │   feed_last_run_timestamp_seconds    (Gauge)
+│  Feeds       │
+└──────────────┘
+```
+### Key Distinction: Received vs Processed
+This is the most common source of confusion. Understand it before writing any query:
+| Metric | Layer | What it counts | Has `status` label? |
+|--------|-------|----------------|---------------------|
+| `core_event_received_total` | API | Events the platform **received** | NO |
+| `rubix_event_received_total` | Rubix | Events the orchestration engine **received** from queue | NO |
+| `rubix_event_runtime_seconds_count` | Rubix | Events the orchestration engine **processed** | YES (`COMPLETE`, `FAILED`, `NO_MATCH`) |
+- To count events received → use `core_event_received_total`
+- To count events processed with success/failure breakdown → use `rubix_event_runtime_seconds_count`
+- To measure processing latency → use `rubix_event_runtime_seconds` histogram
+- To measure queue wait time → use `rubix_event_inflight_latency_seconds` histogram
+**Never aggregate `core_event_received_total` by `status`** — the label does not exist on that metric and will produce null/misleading results.
+### Metric/Status Naming Drift (Doc Reconciliation)
+Some product/feature docs use alternate metric families and status labels. Normalize before analysis:
+| Semantic intent | Canonical metrics/status in this skill | Alternate names seen in some docs |
+|---|---|---|
+| Total events processed | `rubix_event_runtime_seconds_count` | `fluent_events_total` |
+| Failed terminal state | `FAILED` | `ERROR` |
+| In-flight / not terminal | `PENDING` / `SCHEDULED` | `PROCESSING` |
+Guidelines:
+- Prefer `rubix_*` and `core_*` metrics for Fluent platform observability unless tenant evidence proves otherwise.
+- If a provided query uses `fluent_events_*`, translate it to the canonical metric family first.
+- Do not filter Rubix metrics by `status="ERROR"` or `status="PROCESSING"`; use canonical status labels.
+### What Is Captured
+Metrics are captured for events entering via:
+- `POST /api/v4.1/event/async`
+- `POST /api/v4.1/event/sync`
+- `POST /api/v4.1/job/{jobId}/batch`
+- Action events (sent directly to the Event API) — captured against the event name, entity type, and other labels
+**Not captured:**
+- Inline events triggered within a workflow for the currently orchestrated entity (internal rule-to-rule calls within the same orchestration context)
+## Complete Metrics Reference
+### Fluent API Metrics
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `core_event_received_total` | Counter | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Number of events received by the platform |
+| `core_event_last_received_seconds` | Gauge | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Timestamp of last event received (epoch seconds) |
+### Orchestration Engine (Rubix) Metrics
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `rubix_event_received_total` | Counter | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Events received by Rubix from queue or direct HTTP API |
+| `rubix_event_inflight_latency_seconds` | Histogram | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source` | Queue wait time before Rubix processes |
+| `rubix_event_inflight_latency_seconds_sum` | Counter | (same as above) | Cumulative queue wait time |
+| `rubix_event_inflight_latency_seconds_count` | Counter | (same as above) | Count of events observed in queue |
+| `rubix_event_inflight_latency_seconds_bucket` | Counter | (same as above) + `le` | Bucketed queue latency distribution |
+| `rubix_event_runtime_seconds` | Histogram | `account_id`, `retailer_id`, `event_name`, `entity_type`, `source`, **`status`** | Rubix processing time per event |
+| `rubix_event_runtime_seconds_sum` | Counter | (same as above) | Cumulative processing time |
+| `rubix_event_runtime_seconds_count` | Counter | (same as above) | Count of events processed |
+| `rubix_event_runtime_seconds_bucket` | Counter | (same as above) + `le` | Bucketed processing time distribution |
+### Batch Pre-Processing Metrics (Inventory Deduplication)
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `bpp_records_processed_total` | Counter | `account_id`, `run_id`, `stage`, `first_batch_received`, `deduplication_finished` | Total batch items processed by dedup job |
+| `bpp_records_unchanged_total` | Counter | `account_id`, `run_id`, `stage` | Items filtered out (no change detected) |
+| `bpp_records_changed_total` | Counter | `account_id`, `run_id`, `stage` | Items that changed and were sent to Rubix |
+| `bpp_last_run_timestamp_seconds` | Gauge | `account_id`, `run_id`, `stage`, `status` | Completion timestamp. Status: `SUCCESS`, `ERROR` |
+### Inventory Feed Metrics (Data Loading)
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `feed_sent_total` | Counter | `account_id`, `feed_ref`, `run_id`, `data_type` | Records exported. data_type: `INVENTORY_POSITION`, `INVENTORY_CATALOGUE`, `VIRTUAL_POSITION`, `VIRTUAL_CATALOGUE` |
+| `feed_last_run_timestamp_seconds` | Gauge | `account_id`, `feed_ref`, `run_id`, `status` | Completion timestamp. Status: `SUCCESS`, `ERROR`, `NO_RECORDS` |
+### Histogram Bucket Boundaries
+Default histogram bucket boundaries (in seconds): `0.005`, `0.01`, `0.025`, `0.05`, `0.075`, `0.1`, `0.25`, `0.5`, `0.75`, `1.0`, `2.5`, `5.0`, `7.5`, `10.0`, `+Inf`.
+The `le` label marks the upper bound of each bucket. A bucket with `le="0.5"` counts all events that took ≤500ms.
+### Labels Reference
+| Label | Appears on | Values |
+|-------|-----------|--------|
+| `account_id` | All metrics | Fluent account identifier |
+| `retailer_id` | API + Rubix metrics | Retailer within the account |
+| `event_name` | API + Rubix metrics | Event name (e.g., `CREATE`, `UPSERT_PRODUCT`) |
+| `entity_type` | API + Rubix metrics | Entity type (e.g., `ORDER`, `FULFILMENT`, `INVENTORY_POSITION`) |
+| `source` | API + Rubix metrics | Origin: `event`, `event-sync`, `batch`, `internal`, or custom |
+| `status` | **Only** `rubix_event_runtime_seconds*` | `COMPLETE`, `FAILED`, `NO_MATCH` |
+| `le` | Histogram `_bucket` metrics only | Upper bound of histogram bucket (seconds) |
+| `run_id` | BPP + Feed metrics | Execution run identifier |
+| `stage` | BPP metrics | Processing stage |
+| `feed_ref` | Feed metrics | Feed reference identifier |
+| `data_type` | `feed_sent_total` | Type of exported records |
+| `first_batch_received` | `bpp_records_processed_total` | Timestamp of first batch in run |
+| `deduplication_finished` | `bpp_records_processed_total` | Timestamp of dedup completion |
+### Source Label Values
+| Source Value | Meaning |
+|-------------|---------|
+| `event` | Async endpoint (`/api/v4.1/event/async`) with no explicit source |
+| `event-sync` | Sync endpoint (`/api/v4.1/event/sync`) with no explicit source |
+| `batch` | Batch endpoint (`/api/v4.1/job/{jobId}/batch`) |
+| `internal` | Scheduled events, cross-domain events (platform-originated) |
+| Custom value | Events with explicit `source` field set in payload (e.g., `POS`, `ERP`) |
+## Data Access
+### Query Path
+Fluent metrics are accessed exclusively through the **GraphQL API**, which acts as an auth + tenant-isolation proxy over Prometheus. The platform does not expose raw Prometheus REST endpoints (`/api/v1/query` returns 400).
+**MCP Extension Tools (recommended for AI analysis)**
+The `metrics.query` tool routes PromQL through the GraphQL `metricInstant` / `metricRange` queries automatically:
+```json
+{
+  "query": "sum by (event_name) (increase(rubix_event_runtime_seconds_count[1h]))",
+  "type": "instant"
+}
+```
+**GraphQL queries (for OMX dashboards, programmatic access, and external tools)**
+The same GraphQL queries are available directly:
+- `metricInstant(query, time?)` — point-in-time query
+- `metricRange(query, start, end, step)` — time-series query
+**Required headers:**
+- `Authorization: Bearer <token>` — user/service token with `METRICS_VIEW` permission
+- `fluent.account: <ACCOUNT_ID>` — identifies which Prometheus workspace to query
+- `Content-Type: application/json`
+**metricInstant — point-in-time snapshot:**
+```graphql
+query ($PromQL: String!, $time: DateTime) {
+  metricInstant(query: $PromQL, time: $time) {
+    status
+    error
+    errorType
+    warnings
+    data {
+      resultType
+      result {
+        metric
+        value
+      }
+    }
+  }
+}
+```
+Variables:
+```json
+{
+  "PromQL": "sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count[1h])) > 0",
+  "time": "2026-02-22T12:00:00Z"
+}
+```
+**metricRange — time-series over a window:**
+```graphql
+query ($PromQLRange: String!, $start: DateTime, $end: DateTime, $step: String) {
+  metricRange(query: $PromQLRange, start: $start, end: $end, step: $step) {
+    status
+    error
+    errorType
+    warnings
+    data {
+      resultType
+      result {
+        metric
+        values
+      }
+    }
+  }
+}
+```
+Variables:
+```json
+{
+  "PromQLRange": "(sum by (source, retailer_id, event_name, entity_type) ((last_over_time(core_event_received_total{}[30m]) - core_event_received_total{} offset 30m) or last_over_time(core_event_received_total{}[30m]))) > 0",
+  "start": "2026-02-22T00:00:00Z",
+  "end": "2026-02-22T23:59:00Z",
+  "step": "30m"
+}
+```
+**Response shape:** Both queries return the same structure — `status` ("success"/"error"), optional `error`/`errorType`/`warnings`, and `data` containing `resultType` ("vector" for instant, "matrix" for range) and `result[]` where each entry has a `metric` label map and either a single `value` (instant) or `values` array (range: `[timestamp, value]` pairs at each step).
+**Query limits and defaults:**
+- Maximum query time range for `metricRange`: **32 days** between `start` and `end`. Queries exceeding this will fail.
+- `metricRange` defaults when parameters are omitted: `start` = 30 minutes before current server time, `end` = current server time, `step` = 1 minute
+- `metricInstant` default: `time` = current server time
+**Access requirements:**
+- User/token must have `METRICS_VIEW` permission
+- Metrics are isolated per account per environment (workspace-level segregation)
+- GraphQL API identifies account from user token + `fluent.account` header
+- Retention: 150 days from ingestion
+- Custom metrics creation is not available — only the platform-instrumented metrics listed above are captured
+**MCP tools vs direct GraphQL:** Both use the same GraphQL `metricInstant` / `metricRange` queries underneath. The MCP `metrics.query` tool abstracts the GraphQL construction — just pass PromQL + type. Use MCP tools for AI-driven analysis; use direct GraphQL for OMX dashboards, external monitoring tools (ElasticSearch, Grafana), or programmatic access from scripts.
+### If Prometheus is Unavailable
+When the `metrics.*` MCP tools are available, `metrics.topEvents` and `metrics.healthCheck` will fall back to the Event API for aggregation. Until then, use `event.list` with appropriate filters to aggregate event data manually. Event API queries have a ~30 day practical window.
+## Tools
+> **Note:** The `metrics.healthCheck`, `metrics.query`, and `metrics.topEvents` MCP tools described below are planned but not yet implemented in the current MCP extension server. Currently, use the Fluent Commerce admin console for metrics dashboards, or query the GraphQL `metricInstant`/`metricRange` endpoints directly via the `graphql.query` MCP tool with the PromQL patterns from the cookbook section below.
+### metrics.healthCheck (Primary — Single-Call Health Assessment)
+Runs all anomaly checks locally in the MCP server. One call replaces the multi-step Quick Health Check workflow.
+```json
+{
+  "window": "1h",
+  "includeTopEvents": true,
+  "topN": 10
+}
+```
+**Response shape:**
+- `healthy` — boolean, true if no findings
+- `source` — "prometheus" or "event_api" (automatic fallback)
+- `summary` — window, totalEvents, failureRate, pendingRate, statusBreakdown
+- `findings[]` — severity + type + message for each detected anomaly
+- `topEvents[]` — ranked event breakdown (if includeTopEvents)
+- `recommendations[]` — actionable next steps based on findings
+**Custom thresholds:**
+```json
+{
+  "window": "6h",
+  "thresholds": { "failureRate": 2, "pendingRate": 5, "dominanceRate": 30 }
+}
+```
+### metrics.query (Prometheus PromQL)
+Query Prometheus metrics via GraphQL `metricInstant`/`metricRange` with PromQL. Supports instant and range queries.
+**Instant query — aggregated failure summary:**
+```json
+{
+  "query": "sum by (event_name, entity_type, status) (increase(rubix_event_runtime_seconds_count[1h]))",
+  "type": "instant"
+}
+```
+**Range query — failure rate over time:**
+```json
+{
+  "query": "rate(rubix_event_runtime_seconds_count{status=\"FAILED\"}[5m])",
+  "type": "range",
+  "start": "2026-02-22T00:00:00Z",
+  "end": "2026-02-22T06:00:00Z",
+  "step": "1m"
+}
+```
+### metrics.topEvents (Convenience Aggregation)
+Pre-built ranked summary from the Event API. Useful when Prometheus is unavailable or when you need per-event-name breakdowns with failure rates.
+```json
+{
+  "from": "2026-02-22T00:00:00Z",
+  "to": "2026-02-22T12:00:00Z",
+  "topN": 20,
+  "eventType": "ORCHESTRATION"
+}
+```
+**Filter to specific event status** (e.g., top failing events only):
+```json
+{
+  "from": "2026-02-22T00:00:00Z",
+  "eventStatus": "FAILED",
+  "topN": 10
+}
+```
+Supported `eventStatus` values: `COMPLETE`, `FAILED`, `NO_MATCH`, `SUCCESS`, `PENDING`.
+**Response includes:**
+- `totalEvents` — total events in the time window
+- `failureRate` — percentage of FAILED events
+- `statusBreakdown` — counts by status (SUCCESS, FAILED, COMPLETE, NO_MATCH, etc.)
+- `topEvents[]` — ranked list: name + entityType + status + count + percentage
+- `uniqueEventNames` / `uniqueEntityTypes` — cardinality metrics
+### When to Use Which
+| Scenario | Tool | Why |
+|----------|------|-----|
+| One-call health assessment | `metrics.healthCheck` | Runs all checks locally, minimal tokens |
+| Quick failure rate check | `metrics.query` (instant) | Single PromQL, no pagination |
+| Failure rate trend over time | `metrics.query` (range) | Time-series data with step intervals |
+| Top-N event ranking with breakdowns | `metrics.topEvents` | Pre-aggregated, structured response |
+| Custom PromQL (p99 latency, throughput) | `metrics.query` | Full PromQL flexibility |
+| Top failing events only | `metrics.topEvents` | Use `eventStatus: "FAILED"` for server-side filtering |
+| Prometheus unavailable | `metrics.topEvents` | Falls back to Event API aggregation |
+| BPP / Feed job monitoring | `metrics.query` | Only path — Event API doesn't cover batch jobs |
+## PromQL Cookbook
+Recipes for common monitoring questions. All use `metrics.query` with `type: "instant"` unless noted.
+### Event Volume
+**Total events received in the last hour:**
+```promql
+sum(increase(core_event_received_total[1h]))
+```
+**Events received by entity type in the last hour:**
+```promql
+sum by (entity_type) (increase(core_event_received_total[1h]))
+```
+**Events received in the last 24h (period delta with offset fallback):**
+```promql
+sum by (source, retailer_id, event_name, entity_type) (
+  (last_over_time(core_event_received_total[1d]) - core_event_received_total offset 1d)
+  or
+  last_over_time(core_event_received_total[1d])
+)
+```
+The `or` clause handles cases where no data point exists at the offset time (counter reset, new series, or sparse scrapes). Without it, the subtraction returns null for those series.
+**Events received per retailer per hour (range query for trending):**
+```promql
+sum by (retailer_id) (increase(core_event_received_total[1h]))
+```
+Use `type: "range"` with `step: "1h"` for hourly breakdown.
+### Failure Analysis
+**Overall failure rate (percentage):**
+```promql
+sum(increase(rubix_event_runtime_seconds_count{status="FAILED"}[1h]))
+/
+sum(increase(rubix_event_runtime_seconds_count[1h]))
+* 100
+```
+**Failed events by event name:**
+```promql
+sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count{status="FAILED"}[1h])) > 0
+```
+**NO_MATCH events (workflow configuration gaps):**
+```promql
+sum by (event_name, entity_type) (increase(rubix_event_runtime_seconds_count{status="NO_MATCH"}[1h])) > 0
+```
+**Failure rate trend over time (range query):**
+```promql
+sum(rate(rubix_event_runtime_seconds_count{status="FAILED"}[5m]))
+/
+sum(rate(rubix_event_runtime_seconds_count[5m]))
+```
+Use `type: "range"`, `step: "5m"` for 5-minute granularity.
+### Latency Analysis
+**Average processing time per event:**
+```promql
+sum by (event_name, entity_type) (rate(rubix_event_runtime_seconds_sum[1h]))
+/
+sum by (event_name, entity_type) (rate(rubix_event_runtime_seconds_count[1h]))
+```
+This divides the per-second rate of total latency by the per-second rate of event counts, yielding average seconds per event.
+**WRONG — do not use `avg_over_time` on `_count`:**
+```promql
+# INCORRECT: avg_over_time on _count gives average of cumulative counter, NOT average latency
+avg_over_time(rubix_event_inflight_latency_seconds_count[1d])
+```
+**P99 processing latency:**
+```promql
+histogram_quantile(0.99,
+  sum by (le, event_name) (rate(rubix_event_runtime_seconds_bucket[5m]))
+)
+```
+**P50 (median) processing latency:**
+```promql
+histogram_quantile(0.50,
+  sum by (le) (rate(rubix_event_runtime_seconds_bucket[5m]))
+)
+```
+**Average queue wait time (inflight latency):**
+```promql
+sum by (event_name, entity_type) (rate(rubix_event_inflight_latency_seconds_sum[1h]))
+/
+sum by (event_name, entity_type) (rate(rubix_event_inflight_latency_seconds_count[1h]))
+```
+**P99 queue wait time:**
+```promql
+histogram_quantile(0.99,
+  sum by (le) (rate(rubix_event_inflight_latency_seconds_bucket[5m]))
+)
+```
+### Throughput
+**Events processed per second (instant rate):**
+```promql
+sum(rate(rubix_event_runtime_seconds_count[5m]))
+```
+**Events processed per second by event name (range):**
+```promql
+sum by (event_name) (rate(rubix_event_runtime_seconds_count[5m]))
+```
+### Batch Pre-Processing (Inventory Deduplication)
+**Total records processed in the latest run:**
+```promql
+max by (run_id) (bpp_records_processed_total)
+```
+**Dedup efficiency (changed vs processed):**
+```promql
+max by (run_id) (bpp_records_changed_total)
+/
+max by (run_id) (bpp_records_processed_total)
+* 100
+```
+A low percentage means most records are unchanged — good for dedup efficiency, but if 0% it could indicate stale data.
+**Last BPP run status:**
+```promql
+bpp_last_run_timestamp_seconds
+```
+Filter by `status="SUCCESS"` or `status="ERROR"` to check job health.
+### Inventory Feeds (Data Loading)
+**Total records exported by feed:**
+```promql
+sum by (feed_ref, data_type) (increase(feed_sent_total[24h]))
+```
+**Last feed run timestamp:**
+```promql
+feed_last_run_timestamp_seconds
+```
+Filter by `status="NO_RECORDS"` to find feeds that ran but had nothing to export — this could be normal or could indicate a stale pipeline.
+**Feed export volume trend (range query):**
+```promql
+sum by (feed_ref) (increase(feed_sent_total[1h]))
+```
+### Source Attribution
+**Events by source (who is sending them):**
+```promql
+sum by (source) (increase(core_event_received_total[1h]))
+```
+**High-volume external source detection:**
+```promql
+topk(5, sum by (source, event_name) (increase(core_event_received_total[1h])))
+```
+Use this to detect runaway integrations (e.g., a POS system sending millions of `UPSERT_PRODUCT` events).
+## Anomaly Detection Heuristics
+### High Failure Rate
+- **Threshold:** >5% failure rate
+- **Detection:** `metrics.query({ query: "sum(increase(rubix_event_runtime_seconds_count{status='FAILED'}[1h])) / sum(increase(rubix_event_runtime_seconds_count[1h]))", type: "instant" })`
+- **Action:** Filter failed events with `event.list({ eventStatus: "FAILED", from: "...", count: 50 })`, then hand off to `/fluent-trace` for root cause
+### Volume Spikes
+- **Threshold:** >10x normal baseline for any event name
+- **Detection:** Compare `increase(...[1h])` against `avg_over_time(increase(...[1h])[24h:1h])` for the same event
+- **Action:** Investigate whether a runaway loop or bulk import is occurring. Check if a SendEvent rule is creating circular event chains.
+### NO_MATCH Events Present
+- **Threshold:** Any NO_MATCH events
+- **Meaning:** Events fired with names that don't match any ruleset in the workflow
+- **Action:** Check event name spelling, entity subtype, and workflow deployment status. Use `/fluent-workflow-analyzer` to verify ruleset names match.
+### Sustained PENDING Queue
+- **Threshold:** >10% of events in PENDING status
+- **Meaning:** Async processing queue is backed up
+- **Action:** Check platform health. PENDING events should clear within seconds under normal load.
+### Missing Expected Events
+- **Indicator:** An entity type that should have events (e.g., FULFILMENT) shows zero events in the time window
+- **Action:** Verify workflows are deployed for that entity type. Check if the event send pipeline is functioning.
+### Slow Processing
+- **Threshold:** P99 latency >10s or average >2s for any event
+- **Detection:** Use P99 and average latency queries from the cookbook above
+- **Action:** Check which rules are executing for the slow event. Common causes: complex GraphQL queries in rules, external webhook timeouts, large entity attribute payloads.
+### Queue Backup
+- **Threshold:** Average inflight latency >5s
+- **Detection:** Use average queue wait time query from cookbook
+- **Action:** This indicates Rubix is not keeping up with incoming events. Check if there's a volume spike, or if processing times have degraded.
+## Monitoring Workflows
+### Quick Health Check
+```
+1. metrics.healthCheck({ window: "1h" })
+2. If healthy=true → done
+3. If findings present → review severity and recommendations
+4. For HIGH/CRITICAL findings → drill down with event.list using recommended filters
+5. Hand off to /fluent-trace for root cause of specific failures
+```
+### Pre-Deployment Baseline
+```
+1. metrics.healthCheck({ window: "24h" })  → record baseline
+2. Deploy changes
+3. Wait for processing window
+4. metrics.healthCheck({ window: "1h" })  → check post-deploy
+5. Compare: new findings? Changed failure rate? New dominant events?
+```
+### Post-Incident Analysis
+```
+1. Identify incident time window
+2. metrics.query({
+     query: "sum by (event_name, status) (increase(rubix_event_runtime_seconds_count[5m]))",
+     type: "range",
+     start: "<incident_start>",
+     end: "<incident_end>",
+     step: "1m"
+   })
+3. Look for: spikes in specific event names, new FAILED events, NO_MATCH appearances
+4. For each anomalous event → event.list with specific filters → event.get for details
+5. Hand off to /fluent-trace for root cause of specific failures
+```
+### Latency Investigation
+```
+1. Check P99 processing latency:
+   metrics.query({
+     query: "histogram_quantile(0.99, sum by (le, event_name) (rate(rubix_event_runtime_seconds_bucket[5m])))",
+     type: "range", start: "<start>", end: "<end>", step: "5m"
+   })
+2. Check queue wait time:
+   metrics.query({
+     query: "histogram_quantile(0.99, sum by (le) (rate(rubix_event_inflight_latency_seconds_bucket[5m])))",
+     type: "range", start: "<start>", end: "<end>", step: "5m"
+   })
+3. If processing latency high → investigate rule execution for those events
+4. If queue latency high → check overall throughput and volume
+5. Use /fluent-trace for slow event investigation
+```
+### Batch Processing Health
+```
+1. Check latest BPP run:
+   metrics.query({ query: "bpp_last_run_timestamp_seconds", type: "instant" })
+2. Check for ERROR status:
+   metrics.query({ query: "bpp_last_run_timestamp_seconds{status=\"ERROR\"}", type: "instant" })
+3. Check dedup efficiency:
+   metrics.query({
+     query: "max by (run_id) (bpp_records_changed_total) / max by (run_id) (bpp_records_processed_total) * 100",
+     type: "instant"
+   })
+4. Check feed export status:
+   metrics.query({ query: "feed_last_run_timestamp_seconds", type: "instant" })
+5. Look for NO_RECORDS feeds that should have data
+```
+### Runaway Integration Detection
+```
+1. Check top sources by volume:
+   metrics.query({
+     query: "topk(10, sum by (source, event_name) (increase(core_event_received_total[1h])))",
+     type: "instant"
+   })
+2. If any source+event exceeds expected volume (e.g., >100k/hour for a POS integration):
+   - Check if the source system has a misconfiguration
+   - Check if dedup is filtering effectively (bpp_records_unchanged_total should be high)
+   - Consider rate-limiting at the integration layer
+```
+## IPU/IPC Visibility
+IPU (Inventory Processing Units) and IPC (Inventory Processing Credits) consumption is tracked via the Fluent dashboard, not directly through Prometheus metrics.
+**Self-service dashboard:** Access via the Fluent Admin Console → Account dashboard. See [Self-Service IPU/IPC Visibility Overview](https://docs.fluentcommerce.com/essential-knowledge/self-service-ipu-ipc-visibility-overview).
+The dashboard uses an internal GraphQL metrics query (not yet publicly documented) that aggregates account-level consumption. For custom IPU tracking, correlate event volumes from `core_event_received_total` and `rubix_event_runtime_seconds_count` with your contracted IPU thresholds.
+## Prometheus Interpretation Guidelines
+**Treat metrics as trends, not ledger-accurate totals.** Prometheus functions extrapolate based on scrape intervals and windows. Minor discrepancies between metrics counts and Event API counts are normal and expected.
+**Retention:** 150 days from ingestion in the account-specific workspace. After 150 days, data is deleted.
+**Counter behavior:** Counters are monotonically increasing and reset on process restart. Use `increase()` or `rate()` to get meaningful deltas — never compare raw counter values directly.
+**`last_over_time` vs `increase()` for counter deltas:**
+`increase(counter[1h])` extrapolates based on the rate observed within the window. For short windows (5m, 15m) this is fine, but over wide windows (hours, days) the extrapolation can drift from actual observed values.
+`last_over_time(counter[window])` returns the most recent actual sample within the lookback window — no extrapolation. This is more accurate for wide-window deltas because you're subtracting two real counter values:
+```promql
+last_over_time(counter[window]) - counter offset <period>
+```
+**Use `increase()` for:** short-window aggregations (5m, 15m), rate-over-time range queries, quick estimates.
+**Use `last_over_time` minus offset for:** accurate daily/hourly counts, external reporting, billing-adjacent metrics.
+**The `or` fallback pattern for counter deltas with offset:**
+```promql
+(last_over_time(metric[window]) - metric offset <period>)
+or
+last_over_time(metric[window])
+```
+The `or` clause is necessary because:
+1. If no data point exists at the offset time (new series, counter reset, irregular scrapes), the subtraction returns null
+2. The fallback returns the current value instead of dropping the series entirely
+3. This may overcount for new series, but avoids data loss
+**Histogram math:**
+- Average = `rate(_sum) / rate(_count)` — always use `rate()` to handle counter resets
+- Percentiles = `histogram_quantile(quantile, rate(_bucket))` — quantile is 0-1 (0.99 = p99)
+- Never use `avg_over_time()` on `_count` — it averages the raw cumulative counter, not the observed values
+## Common Query Pitfalls
+### 1) Wrong labels for a metric
+`core_event_received_total` does **not** carry a `status` label. Aggregating by `status` on that metric will produce null label sets. Check the metrics reference table above before adding labels to `sum by (...)`.
+### 2) Counter deltas with sparse/offset data
+See the `or` pattern in the Prometheus Interpretation Guidelines above.
+### 3) Histogram average computed incorrectly
+Do not use `avg_over_time(..._count)`. For average latency, divide rate(sum) by rate(count):
+```promql
+sum by (event_name) (rate(rubix_event_runtime_seconds_sum[1h]))
+/
+sum by (event_name) (rate(rubix_event_runtime_seconds_count[1h]))
+```
+### 4) Using `increase()` vs `rate()`
+- `increase(metric[1h])` = total increase over 1 hour (absolute count)
+- `rate(metric[5m])` = per-second rate averaged over 5 minutes
+- Use `increase()` for "how many events in the last hour" questions
+- Use `rate()` for time-series graphs, histogram math, and throughput measurements
+### 5) BPP metrics have different labels
+BPP and Feed metrics use `run_id`, `stage`, `feed_ref`, `data_type` — not `retailer_id` or `event_name`. Don't mix them in the same aggregation as event processing metrics.
+### 6) Missing `> 0` filter
+Counter-based queries often return zero-value results for inactive series. Append `> 0` to filter these out for cleaner output.
+### 7) Invalid metric name returns success with empty results
+Querying a non-existent metric (e.g., `core_event_received_incorrect`) does **not** return an error. The API returns `status: "success"` with an empty `result` array. Always verify metric names against the reference tables above before debugging empty responses.
+### 8) Staleness window on metricInstant
+`metricInstant` identifies data as "stale" if no new data point exists within 5 minutes. If you query a specific timestamp and the latest data point is older than 5 minutes, you get no value — not an error. For sparse or infrequent events, use `metricRange` with a wider window instead.
+### 9) 32-day maximum range
+`metricRange` queries cannot span more than 32 days between `start` and `end`. For longer analysis windows, split into multiple queries or use `metricInstant` with different `time` values.
+## Integration with Other Skills
+| Need | Skill |
+|------|-------|
+| Drill into specific event details | `/fluent-event-api` |
+| Root cause diagnosis for failures | `/fluent-trace` |
+| Verify workflow configuration | `/fluent-workflow-analyzer` |
+| MCP tool payload syntax | `/fluent-mcp-tools` |
+| Run targeted test sequences | `/fluent-e2e-test` |
+| Validate settings for webhook rules | `/fluent-settings` |