npm - @unrdf/observability - Versions diffs - 26.4.2 - Mend

@unrdf/observability 26.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

package/.eslintrc.cjs +10 -0
package/IMPLEMENTATION-SUMMARY.md +478 -0
package/LICENSE +21 -0
package/README.md +482 -0
package/capability-map.md +90 -0
package/config/alert-rules.yml +269 -0
package/config/prometheus.yml +136 -0
package/dashboards/grafana-unrdf.json +798 -0
package/dashboards/unrdf-workflow-dashboard.json +295 -0
package/docs/OBSERVABILITY-PATTERNS.md +681 -0
package/docs/OBSERVABILITY-RUNBOOK.md +554 -0
package/examples/observability-demo.mjs +334 -0
package/package.json +46 -0
package/src/advanced-metrics.mjs +413 -0
package/src/alerts/alert-manager.mjs +436 -0
package/src/custom-events.mjs +558 -0
package/src/distributed-tracing.mjs +352 -0
package/src/exporters/grafana-exporter.mjs +415 -0
package/src/index.mjs +61 -0
package/src/metrics/workflow-metrics.mjs +346 -0
package/src/receipts/anchor.mjs +155 -0
package/src/receipts/index.mjs +62 -0
package/src/receipts/merkle-tree.mjs +188 -0
package/src/receipts/receipt-chain.mjs +209 -0
package/src/receipts/receipt-schema.mjs +128 -0
package/src/receipts/tamper-detection.mjs +219 -0
package/test/advanced-metrics.test.mjs +302 -0
package/test/custom-events.test.mjs +387 -0
package/test/distributed-tracing.test.mjs +314 -0
package/validation/observability-validation.mjs +366 -0
package/vitest.config.mjs +25 -0

package/README.md ADDED Viewed

@@ -0,0 +1,482 @@
+# @unrdf/observability
+**Innovative Prometheus/Grafana observability dashboard for UNRDF distributed workflows**
+Real-time monitoring, alerting, and visualization for workflow execution, resource utilization, and business metrics.
+## Features
+### Metrics Collection
+- **Workflow Execution Metrics**: Total executions, duration, active workflows
+- **Task Performance**: Execution time, success rate, queue depth
+- **Resource Utilization**: CPU, memory, disk monitoring
+- **Event Sourcing**: Events appended, store size tracking
+- **Business Metrics**: Policy evaluations, cryptographic receipts
+- **Latency Percentiles**: p50, p90, p95, p99 tracking
+### Alerting System
+- **Threshold-Based Alerts**: Configurable rules with hysteresis
+- **Anomaly Detection**: Statistical z-score analysis
+- **Webhook Notifications**: HTTP callbacks for alert events
+- **Alert Deduplication**: Smart grouping and correlation
+- **Severity Levels**: INFO, WARNING, CRITICAL
+### Grafana Dashboards
+- **Pre-built Dashboards**: Ready-to-use visualizations
+- **Real-time Updates**: 5-second refresh intervals
+- **Custom Variables**: Filter by workflow ID, pattern
+- **Alert Annotations**: Visual markers for events
+- **Exportable JSON**: Import directly into Grafana
+## Installation
+```bash
+pnpm add @unrdf/observability
+```
+## Quick Start
+### Basic Usage
+```javascript
+import { createWorkflowMetrics } from '@unrdf/observability';
+const metrics = createWorkflowMetrics({
+  enableDefaultMetrics: true,
+  prefix: 'unrdf_workflow_',
+  labels: { environment: 'production' },
+});
+// Record workflow execution
+metrics.recordWorkflowStart('wf-123', 'SEQUENCE');
+// ... execute workflow ...
+metrics.recordWorkflowComplete('wf-123', 'completed', 2.5, 'SEQUENCE');
+// Export metrics
+const prometheusMetrics = await metrics.getMetrics();
+```
+### Complete Observability Stack
+```javascript
+import { createObservabilityStack } from '@unrdf/observability';
+const { metrics, grafana, alerts } = await createObservabilityStack({
+  metrics: {
+    enableDefaultMetrics: true,
+    prefix: 'unrdf_workflow_',
+  },
+  alerts: {
+    rules: [
+      {
+        id: 'high-latency',
+        name: 'High Workflow Latency',
+        metric: 'workflow_duration',
+        threshold: 10,
+        operator: 'gt',
+        severity: 'warning',
+      },
+    ],
+    enableAnomalyDetection: true,
+  },
+  grafana: {
+    title: 'Production Workflows',
+    datasource: 'Prometheus',
+  },
+});
+// Listen for alerts
+alerts.on('alert', alert => {
+  console.log(`Alert: ${alert.name} - ${alert.severity}`);
+});
+```
+## Live Demo
+Run the included demo to see metrics in action:
+```bash
+cd packages/observability
+pnpm install
+pnpm demo
+```
+Then visit:
+- **Prometheus Metrics**: http://localhost:9090/metrics
+- **Metrics JSON**: http://localhost:9090/metrics/json
+- **Grafana Dashboard**: http://localhost:9090/dashboard
+- **Active Alerts**: http://localhost:9090/alerts
+- **Statistics**: http://localhost:9090/stats
+## API Reference
+### WorkflowMetrics
+#### Constructor
+```javascript
+new WorkflowMetrics(config);
+```
+**Config Options:**
+- `enableDefaultMetrics` (boolean): Enable Node.js default metrics (default: true)
+- `prefix` (string): Metric name prefix (default: 'unrdf*workflow*')
+- `labels` (object): Global labels for all metrics
+- `collectInterval` (number): Collection interval in ms (default: 10000)
+#### Methods
+##### recordWorkflowStart(workflowId, pattern)
+Record workflow execution start.
+##### recordWorkflowComplete(workflowId, status, durationSeconds, pattern)
+Record workflow completion with duration.
+##### recordTaskExecution(workflowId, taskId, taskType, status, durationSeconds)
+Record task execution metrics.
+##### updateTaskQueueDepth(workflowId, queueName, depth)
+Update task queue depth gauge.
+##### recordResourceUtilization(resourceType, resourceId, utilizationPercent)
+Record resource utilization (0-100%).
+##### recordEventAppended(eventType, workflowId)
+Record event appended to event store.
+##### recordPolicyEvaluation(policyName, result)
+Record policy evaluation result.
+##### recordCryptoReceipt(workflowId, algorithm)
+Record cryptographic receipt generation.
+##### recordError(errorType, workflowId, severity)
+Record error occurrence.
+##### getMetrics()
+Get metrics in Prometheus text format.
+##### getMetricsJSON()
+Get metrics in JSON format.
+### AlertManager
+#### Constructor
+```javascript
+new AlertManager(config);
+```
+**Config Options:**
+- `rules` (array): Initial alert rules
+- `webhooks` (array): Webhook endpoints
+- `checkInterval` (number): Rule check interval (default: 10000)
+- `enableAnomalyDetection` (boolean): Enable anomaly detection (default: true)
+#### Methods
+##### addRule(rule)
+Add alert rule.
+**Rule Schema:**
+```javascript
+{
+  id: 'unique-rule-id',
+  name: 'Human Readable Name',
+  metric: 'metric_name',
+  threshold: 100,
+  operator: 'gt', // gt, lt, gte, lte, eq
+  severity: 'warning', // info, warning, critical
+  duration: 60000, // ms
+  enabled: true
+}
+```
+##### evaluateMetric(metricName, value, labels)
+Evaluate metric against all rules.
+##### getActiveAlerts()
+Get currently active alerts.
+##### getAlertHistory(filters)
+Get alert history with optional filters.
+##### getStatistics()
+Get alert statistics.
+#### Events
+- `alert`: Fired when alert triggers
+- `alert:resolved`: Fired when alert resolves
+- `webhook:error`: Fired on webhook delivery failure
+### GrafanaExporter
+#### Constructor
+```javascript
+new GrafanaExporter(config);
+```
+**Config Options:**
+- `title` (string): Dashboard title
+- `datasource` (string): Prometheus datasource name (default: 'Prometheus')
+- `refreshInterval` (string): Dashboard refresh interval (default: '5s')
+- `tags` (array): Dashboard tags
+#### Methods
+##### generateDashboard()
+Generate complete Grafana dashboard configuration.
+##### exportJSON(pretty)
+Export dashboard as JSON string.
+##### generateAlertDashboard()
+Generate alert-focused dashboard.
+## Metrics Collected
+### Workflow Metrics
+- `unrdf_workflow_executions_total` (Counter): Total workflow executions
+- `unrdf_workflow_execution_duration_seconds` (Histogram): Workflow duration
+- `unrdf_workflow_active_workflows` (Gauge): Active workflows
+### Task Metrics
+- `unrdf_workflow_task_executions_total` (Counter): Total task executions
+- `unrdf_workflow_task_duration_seconds` (Histogram): Task duration
+- `unrdf_workflow_task_queue_depth` (Gauge): Queue depth
+### Resource Metrics
+- `unrdf_workflow_resource_utilization` (Gauge): Resource utilization %
+- `unrdf_workflow_resource_allocations_total` (Counter): Resource allocations
+### Event Sourcing Metrics
+- `unrdf_workflow_events_appended_total` (Counter): Events appended
+- `unrdf_workflow_event_store_size_bytes` (Gauge): Event store size
+### Business Metrics
+- `unrdf_workflow_policy_evaluations_total` (Counter): Policy evaluations
+- `unrdf_workflow_crypto_receipts_total` (Counter): Crypto receipts
+- `unrdf_workflow_errors_total` (Counter): Errors
+### Performance Metrics
+- `unrdf_workflow_latency_percentiles` (Summary): Latency percentiles
+## Integration with Prometheus
+### Prometheus Configuration
+```yaml
+scrape_configs:
+  - job_name: 'unrdf-workflows'
+    scrape_interval: 5s
+    static_configs:
+      - targets: ['localhost:9090']
+```
+### Alert Rules (prometheus.yml)
+```yaml
+groups:
+  - name: unrdf_workflow_alerts
+    interval: 30s
+    rules:
+      - alert: HighWorkflowErrorRate
+        expr: rate(unrdf_workflow_errors_total[5m]) > 1
+        for: 5m
+        labels:
+          severity: critical
+        annotations:
+          summary: 'High workflow error rate detected'
+          description: 'Error rate is {{ $value }} errors/sec'
+      - alert: HighResourceUtilization
+        expr: unrdf_workflow_resource_utilization > 90
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: 'Resource utilization above 90%'
+          description: '{{ $labels.resource_type }} on {{ $labels.resource_id }}'
+```
+## Grafana Dashboard Import
+1. Export dashboard JSON:
+```bash
+curl http://localhost:9090/dashboard/export > dashboard.json
+```
+2. Import in Grafana:
+   - Navigate to Dashboards → Import
+   - Upload `dashboard.json`
+   - Select Prometheus datasource
+   - Click Import
+## Architecture
+```
+┌─────────────────────────────────────────────┐
+│          Workflow Application               │
+│  (Records metrics via WorkflowMetrics)      │
+└─────────────────┬───────────────────────────┘
+                  │
+                  ↓
+┌─────────────────────────────────────────────┐
+│        Prometheus Metrics Endpoint          │
+│            (Express Server)                 │
+│         http://localhost:9090/metrics       │
+└─────────────────┬───────────────────────────┘
+                  │
+                  ↓
+┌─────────────────────────────────────────────┐
+│         Prometheus Server (Scraper)         │
+│     - Scrapes metrics every 5s              │
+│     - Stores time-series data               │
+│     - Evaluates alert rules                 │
+└─────────────────┬───────────────────────────┘
+                  │
+                  ↓
+┌─────────────────────────────────────────────┐
+│           Grafana Dashboard                 │
+│     - Visualizes metrics                    │
+│     - Real-time graphs                      │
+│     - Alert annotations                     │
+└─────────────────────────────────────────────┘
+```
+## Alert Flow
+```
+Metric Value
+    ↓
+AlertManager.evaluateMetric()
+    ↓
+Check Threshold Rules
+    ↓
+Check Anomaly Detection (z-score)
+    ↓
+Alert Triggered?
+    ↓
+Emit 'alert' Event
+    ↓
+Send Webhook Notifications
+    ↓
+Store in Alert History
+```
+## Performance
+- **Overhead**: <1ms per metric recording
+- **Memory**: ~50MB for 1000 workflows
+- **Throughput**: 10,000+ metrics/sec
+- **Alert Latency**: <100ms detection to notification
+## Best Practices
+### 1. Metric Cardinality
+Keep label cardinality low to avoid memory issues:
+```javascript
+// ✅ Good - bounded cardinality
+metrics.recordWorkflowComplete(workflowId, 'completed', duration, 'SEQUENCE');
+// ❌ Bad - unbounded cardinality
+// metrics.recordWorkflow(userId, timestamp, randomValue);
+```
+### 2. Alert Thresholds
+Set thresholds based on baseline + 2σ:
+```javascript
+const baseline = 5; // seconds
+const stdDev = 1.5;
+const threshold = baseline + 2 * stdDev; // 8 seconds
+alerts.addRule({
+  id: 'high-latency',
+  metric: 'workflow_duration',
+  threshold,
+  operator: 'gt',
+  severity: 'warning',
+});
+```
+### 3. Dashboard Organization
+- Group related panels
+- Use template variables for filtering
+- Set appropriate time ranges
+- Enable auto-refresh (5-10s)
+### 4. Webhook Reliability
+- Use exponential backoff
+- Implement idempotency
+- Monitor webhook failures
+- Set reasonable timeouts
+## Troubleshooting
+### Metrics Not Appearing
+1. Check metrics endpoint: `curl http://localhost:9090/metrics`
+2. Verify Prometheus scraping: Prometheus UI → Targets
+3. Check for metric name typos
+### Alerts Not Firing
+1. Verify rule configuration: `alerts.rules`
+2. Check metric values: `alerts.metricHistory`
+3. Enable debug logging: `alerts.on('alert', console.log)`
+### High Memory Usage
+1. Reduce metric history: limit to 100-500 samples
+2. Lower cardinality: fewer unique label combinations
+3. Increase scrape interval: 10-30s instead of 5s
+## License
+MIT
+## Contributing
+See main UNRDF repository for contribution guidelines.

package/capability-map.md ADDED Viewed

@@ -0,0 +1,90 @@
+# Capability Map: @unrdf/observability
+**Generated:** 2025-12-28
+**Package:** @unrdf/observability
+**Version:** 1.0.0
+---
+## Description
+Innovative Prometheus/Grafana observability dashboard for UNRDF distributed workflows
+---
+## Capability Atoms
+### A57: OTEL Integration
+**Runtime:** Node.js
+**Invariants:** tracing, metrics, logs
+**Evidence:** `packages/observability/src/index.mjs`
+---
+## Package Metadata
+### Dependencies
+- `prom-client`: ^15.1.0
+- `@opentelemetry/api`: ^1.9.0
+- `@opentelemetry/exporter-prometheus`: ^0.49.0
+- `@opentelemetry/sdk-metrics`: ^1.21.0
+- `express`: ^4.18.2
+- `zod`: ^4.1.13
+### Exports
+- `.`: `./src/index.mjs`
+- `./metrics`: `./src/metrics/workflow-metrics.mjs`
+- `./exporters`: `./src/exporters/grafana-exporter.mjs`
+- `./alerts`: `./src/alerts/alert-manager.mjs`
+---
+## Integration Patterns
+### Primary Use Cases
+1. **OTEL Integration**
+   - Import: `import { /* exports */ } from '@unrdf/observability'`
+   - Use for: OTEL Integration operations
+   - Runtime: Node.js
+### Composition Examples
+```javascript
+import { createStore } from '@unrdf/oxigraph';
+import {} from /* functions */ '@unrdf/observability';
+const store = createStore();
+// Use observability capabilities with store
+```
+---
+## Evidence Trail
+- **A57**: `packages/observability/src/index.mjs`
+---
+## Next Steps
+1. **Explore API Surface**
+   - Review exports in package.json
+   - Read source files in `src/` directory
+2. **Integration Testing**
+   - Create test cases using package capabilities
+   - Verify compatibility with dependent packages
+3. **Performance Profiling**
+   - Benchmark key operations
+   - Measure runtime characteristics
+---
+**Status:** GENERATED
+**Method:** Systematic extraction from capability-basis.md + package.json analysis
+**Confidence:** 95% (evidence-based)