@manifest-cyber/observability-ts 0.2.0 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,758 +0,0 @@
1
- # Distributed Tracing Implementation Plan for Manifest Cyber
2
-
3
- **Status:** Implementation Ready
4
- **Date:** November 11, 2025
5
- **Scope:** End-to-end distributed tracing across web-app, manifest-api, message queues, and job-\* services
6
-
7
- ---
8
-
9
- ## Executive Summary
10
-
11
- This document outlines the comprehensive approach to implement distributed tracing across Manifest Cyber's event-driven architecture using OpenTelemetry (OTEL) standards, supporting both SaaS (ECS) and on-premise (K8s) deployments.
12
-
13
- ### Architecture Overview
14
-
15
- ```
16
- ┌──────────┐ HTTP ┌──────────────┐ SQS/RabbitMQ ┌────────────┐
17
- │ web-app │ ────────► │ manifest-api │ ──────────────────► │ job-* │
18
- │ (React) │ │ (Express) │ │ (Workers) │
19
- └──────────┘ └──────────────┘ └────────────┘
20
- │ │ │
21
- │ W3C Traceparent │ W3C Traceparent │
22
- │ via HTTP Headers │ via Message Attributes │
23
- │ │ │
24
- └──────────────────────────┴────────────────────────────────────┘
25
-
26
-
27
- ┌────────────────┐
28
- │ Vector │
29
- │ (Sidecar) │
30
- └────────────────┘
31
-
32
-
33
- ┌────────────────┐
34
- │ VictoriaTraces │
35
- │ (OTLP 4317/18) │
36
- └────────────────┘
37
-
38
-
39
- ┌────────────────┐
40
- │ Grafana │
41
- │ (Tempo UI) │
42
- └────────────────┘
43
- ```
44
-
45
- ---
46
-
47
- ## Current State Analysis
48
-
49
- ### ✅ What's Already Working
50
-
51
- 1. **Logger-ts with Trace Support**
52
- - `TraceContext` class with trace_id, span_id, parent_id
53
- - `OtelLogger` with automatic trace extraction
54
- - W3C traceparent parsing in messaging layer
55
- - AsyncLocalStorage for context propagation
56
-
57
- 2. **Messaging Infrastructure (job-common)**
58
- - W3C traceparent creation/parsing utilities
59
- - Trace context propagation via message attributes
60
- - Integration with logger-ts for trace-aware logging
61
-
62
- 3. **Observability Stack**
63
- - Vector deployed as ECS sidecar
64
- - VictoriaTraces endpoint: `https://victoria-traces.development.manifestcyber.dev`
65
- - OTLP ports: 4317 (gRPC), 4318 (HTTP)
66
- - Grafana for visualization
67
-
68
- 4. **Metrics Infrastructure**
69
- - Prometheus metrics collection via Vector
70
- - VictoriaMetrics storage
71
- - Service-level instrumentation pattern established
72
-
73
- ### 🔴 What's Missing
74
-
75
- 1. **OpenTelemetry Instrumentation**
76
- - No OTEL SDK integration in any service
77
- - No automatic span creation for HTTP requests
78
- - No automatic span creation for queue operations
79
- - No OTLP exporter configured
80
-
81
- 2. **Trace Propagation Standards**
82
- - No standardized baggage/tracestate handling
83
- - No sampling strategy defined
84
- - No trace ID generation at entry points
85
-
86
- 3. **Service-Level Integration**
87
- - web-app: No trace injection in HTTP calls
88
- - manifest-api: No auto-instrumentation
89
- - job-\*: No trace continuation from messages
90
-
91
- 4. **Infrastructure Configuration**
92
- - Vector not configured for OTLP ingestion
93
- - No trace collection pipeline defined
94
- - K8s on-prem configuration missing
95
-
96
- ---
97
-
98
- ## Technical Architecture
99
-
100
- ### 1. Trace Flow: Web App → API → Queue → Job
101
-
102
- #### Entry Point: User-Initiated (web-app)
103
-
104
- ```typescript
105
- // web-app/src/api/client.ts
106
- import { trace, context } from '@opentelemetry/api';
107
-
108
- const span = tracer.startSpan('http.request', {
109
- attributes: {
110
- 'http.method': 'POST',
111
- 'http.url': '/api/sboms/upload',
112
- }
113
- });
114
-
115
- context.with(trace.setSpan(context.active(), span), () => {
116
- axios.post('/api/sboms/upload', data, {
117
- headers: {
118
- traceparent: // Auto-injected by OTEL instrumentation
119
- }
120
- });
121
- });
122
- ```
123
-
124
- #### Ingress: API Receives Request (manifest-api)
125
-
126
- ```typescript
127
- // manifest-api auto-instrumentation extracts traceparent
128
- // Express middleware creates span automatically
129
- // Pass to logger:
130
- const logger = OtelLogger.create(baseLogger, { autoTrace: true });
131
- logger.info('Processing SBOM upload', { userId, organizationId });
132
- ```
133
-
134
- #### Egress: API → Queue (manifest-api)
135
-
136
- ```typescript
137
- // Use job-common messaging with trace context
138
- import { createTraceparentFromLogger } from '@manifest-cyber/job-common';
139
-
140
- const traceparent = createTraceparentFromLogger(logger);
141
- await messageClient.send('sbom-process-queue', {
142
- sbomId,
143
- organizationId,
144
- traceContext: { traceparent },
145
- });
146
- ```
147
-
148
- #### Ingress: Job Receives Message (job-sbom-process)
149
-
150
- ```typescript
151
- // job-common automatically extracts trace context
152
- import { createLoggerWithMessageTrace } from '@manifest-cyber/job-common';
153
-
154
- const logger = createLoggerWithMessageTrace(message, baseLogger);
155
- // Logger now has trace context from message
156
- ```
157
-
158
- #### Entry Point: System-Initiated (cron job)
159
-
160
- ```typescript
161
- // job-daily-vuln-match/src/invoke.ts
162
- import { TraceContext } from '@manifest-cyber/logger-ts';
163
-
164
- const traceContext = TraceContext.generateRandom();
165
- const logger = OtelLogger.create(baseLogger, {
166
- fields: { trace_id: traceContext.trace_id, span_id: traceContext.span_id },
167
- });
168
- ```
169
-
170
- ### 2. OpenTelemetry SDK Integration
171
-
172
- #### Core Dependencies
173
-
174
- ```json
175
- {
176
- "@opentelemetry/sdk-node": "^0.54.0",
177
- "@opentelemetry/auto-instrumentations-node": "^0.51.0",
178
- "@opentelemetry/exporter-trace-otlp-grpc": "^0.54.0",
179
- "@opentelemetry/resources": "^1.28.0",
180
- "@opentelemetry/semantic-conventions": "^1.28.0"
181
- }
182
- ```
183
-
184
- #### Initialization Pattern (Backend Services)
185
-
186
- ```typescript
187
- // src/tracing.ts
188
- import { NodeSDK } from '@opentelemetry/sdk-node';
189
- import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
190
- import { Resource } from '@opentelemetry/resources';
191
- import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
192
- import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
193
-
194
- const sdk = new NodeSDK({
195
- resource: new Resource({
196
- [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
197
- environment: process.env.ENV || 'development',
198
- 'service.version': process.env.VERSION || '0.0.0',
199
- }),
200
- traceExporter: new OTLPTraceExporter({
201
- url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
202
- }),
203
- instrumentations: [
204
- getNodeAutoInstrumentations({
205
- '@opentelemetry/instrumentation-http': { enabled: true },
206
- '@opentelemetry/instrumentation-express': { enabled: true },
207
- '@opentelemetry/instrumentation-mongodb': { enabled: true },
208
- '@opentelemetry/instrumentation-aws-sdk': { enabled: true },
209
- }),
210
- ],
211
- });
212
-
213
- sdk.start();
214
-
215
- process.on('SIGTERM', () => {
216
- sdk.shutdown().finally(() => process.exit(0));
217
- });
218
- ```
219
-
220
- ### 3. Message Queue Trace Propagation
221
-
222
- #### W3C Traceparent Format
223
-
224
- ```
225
- 00-{trace-id:32-hex}-{span-id:16-hex}-{flags:2-hex}
226
- Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
227
- ```
228
-
229
- #### SQS Message Attributes
230
-
231
- ```typescript
232
- // Producer (manifest-api)
233
- await sqs.sendMessage({
234
- QueueUrl: queueUrl,
235
- MessageBody: JSON.stringify(payload),
236
- MessageAttributes: {
237
- traceparent: {
238
- DataType: 'String',
239
- StringValue: traceparent, // W3C format
240
- },
241
- // Optional: Add tracestate for baggage
242
- tracestate: {
243
- DataType: 'String',
244
- StringValue: 'key1=value1,key2=value2',
245
- },
246
- },
247
- });
248
- ```
249
-
250
- #### RabbitMQ Message Properties
251
-
252
- ```typescript
253
- // Producer (manifest-api)
254
- await channel.sendToQueue(queueName, Buffer.from(JSON.stringify(payload)), {
255
- headers: {
256
- traceparent: traceparent, // W3C format
257
- tracestate: 'key1=value1,key2=value2', // Optional
258
- },
259
- persistent: true,
260
- });
261
- ```
262
-
263
- ### 4. Sampling Strategy
264
-
265
- #### Development/Staging
266
-
267
- - **Rate:** 100% (always sample)
268
- - **Reason:** Full visibility for debugging
269
-
270
- #### Production
271
-
272
- - **Rate:** 10% (probabilistic sampling)
273
- - **Head-based sampling:** Decision at trace root
274
- - **Override:** Always sample errors (500 status, exceptions)
275
-
276
- ```typescript
277
- import {
278
- ParentBasedSampler,
279
- TraceIdRatioBasedSampler,
280
- } from '@opentelemetry/sdk-trace-base';
281
-
282
- const sampler = new ParentBasedSampler({
283
- root: new TraceIdRatioBasedSampler(process.env.ENV === 'production' ? 0.1 : 1.0),
284
- });
285
- ```
286
-
287
- ---
288
-
289
- ## Implementation Roadmap
290
-
291
- ### Phase 1: Core Library Support (Week 1) ✅ COMPLETE
292
-
293
- **Goal:** Create unified observability library with metrics + tracing
294
-
295
- - [x] Rename package to `@manifest-cyber/observability-ts`
296
- - [x] Add OpenTelemetry dependencies
297
- - [x] Create src/tracing/ module with SDK initialization helpers
298
- - [x] Create span management utilities (spans.ts)
299
- - [x] Create context propagation helpers (context.ts)
300
- - [x] Reorganize metrics into src/metrics/ for modularity
301
- - [x] Configure subpath exports for tree-shaking
302
- - [x] Update documentation with tracing examples
303
-
304
- **Deliverables:**
305
-
306
- - `@manifest-cyber/observability-ts` v0.2.0 published
307
- - Deprecated `@manifest-cyber/metrics` on NPM
308
- - [TRACING_GUIDE.md](./TRACING_GUIDE.md) - Complete tracing documentation
309
- - [MIGRATION_GUIDE.md](./MIGRATION_GUIDE.md) - Migration instructions
310
- - Backward-compatible metrics API
311
-
312
- ### Phase 2: Pilot Service (manifest-api) (Week 2)
313
-
314
- **Goal:** End-to-end tracing for one critical service
315
-
316
- **Tasks:**
317
-
318
- 1. Initialize OTEL SDK in manifest-api
319
- 2. Configure OTLP exporter (Vector/VictoriaTraces)
320
- 3. Enable auto-instrumentation (HTTP, Express, MongoDB)
321
- 4. Add custom spans for business logic
322
- 5. Test trace propagation to downstream jobs
323
-
324
- **Success Criteria:**
325
-
326
- - Traces visible in Grafana
327
- - HTTP spans include request metadata
328
- - Traces propagate through SQS to job-sbom-process
329
- - Logs include trace_id/span_id
330
-
331
- ### Phase 3: Message Queue Jobs (Week 3)
332
-
333
- **Goal:** Instrument 3 high-volume jobs
334
-
335
- **Services:**
336
-
337
- - job-sbom-process
338
- - job-vulnerability-match
339
- - job-component-process
340
-
341
- **Tasks:**
342
-
343
- 1. Add OTEL SDK initialization to each job
344
- 2. Update message consumers to extract trace context
345
- 3. Create spans for processing stages
346
- 4. Link spans to parent API traces
347
- 5. Add error tracking with span status
348
-
349
- **Success Criteria:**
350
-
351
- - Full trace visibility from API → Queue → Job
352
- - Job spans linked to API spans
353
- - Error traces captured
354
- - Average trace latency <500ms overhead
355
-
356
- ### Phase 4: Web App (Week 4)
357
-
358
- **Goal:** Frontend trace initiation
359
-
360
- **Tasks:**
361
-
362
- 1. Add @opentelemetry/web SDK
363
- 2. Configure trace initiation on user actions
364
- 3. Inject traceparent in HTTP requests
365
- 4. Add resource timing spans
366
- 5. Configure sampling (1% production)
367
-
368
- **Success Criteria:**
369
-
370
- - User action → API trace correlation
371
- - Performance metrics captured
372
- - Minimal bundle size impact (<50KB)
373
-
374
- ### Phase 5: Infrastructure & Rollout (Week 5-6)
375
-
376
- **Goal:** Production-ready infrastructure
377
-
378
- **Tasks:**
379
-
380
- 1. Configure Vector for OTLP ingestion (ECS)
381
- 2. Deploy OTEL collector for K8s
382
- 3. Configure Grafana Tempo datasource
383
- 4. Create trace dashboards
384
- 5. Instrument remaining 20+ services
385
- 6. Production rollout with 10% sampling
386
-
387
- **Success Criteria:**
388
-
389
- - 100% service coverage
390
- - <5% trace data loss
391
- - Query latency <3s in Grafana
392
- - Cost <$200/month trace storage
393
-
394
- ---
395
-
396
- ## Deployment Considerations
397
-
398
- ### ECS (SaaS) Configuration
399
-
400
- #### Vector Sidecar Update
401
-
402
- ```yaml
403
- # infrastructure/vector/vector-config.yaml
404
- sources:
405
- otlp_grpc:
406
- type: 'opentelemetry'
407
- address: '0.0.0.0:4317'
408
- grpc:
409
- enabled: true
410
-
411
- otlp_http:
412
- type: 'opentelemetry'
413
- address: '0.0.0.0:4318'
414
- http:
415
- enabled: true
416
-
417
- sinks:
418
- victoria_traces:
419
- type: 'opentelemetry'
420
- inputs: ['otlp_grpc', 'otlp_http']
421
- endpoint: 'https://victoria-traces.development.manifestcyber.dev'
422
- compression: 'gzip'
423
- batch:
424
- max_events: 1000
425
- timeout_secs: 10
426
- ```
427
-
428
- #### ECS Task Environment Variables
429
-
430
- ```bash
431
- OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
432
- OTEL_SERVICE_NAME=${SERVICE_NAME}
433
- OTEL_RESOURCE_ATTRIBUTES=environment=${ENV},cluster=${CLUSTER_NAME}
434
- OTEL_TRACES_SAMPLER=parentbased_traceidratio
435
- OTEL_TRACES_SAMPLER_ARG=0.1 # 10% in production
436
- ```
437
-
438
- ### Kubernetes (On-Prem) Configuration
439
-
440
- #### OTEL Collector DaemonSet
441
-
442
- ```yaml
443
- # k3s-on-prem/kustomize/apps/otel-collector/daemonset.yaml
444
- apiVersion: apps/v1
445
- kind: DaemonSet
446
- metadata:
447
- name: otel-collector
448
- namespace: observability
449
- spec:
450
- selector:
451
- matchLabels:
452
- app: otel-collector
453
- template:
454
- metadata:
455
- labels:
456
- app: otel-collector
457
- spec:
458
- containers:
459
- - name: otel-collector
460
- image: otel/opentelemetry-collector-contrib:latest
461
- ports:
462
- - containerPort: 4317 # OTLP gRPC
463
- hostPort: 4317
464
- - containerPort: 4318 # OTLP HTTP
465
- hostPort: 4318
466
- volumeMounts:
467
- - name: config
468
- mountPath: /etc/otel
469
- env:
470
- - name: VICTORIA_TRACES_ENDPOINT
471
- value: 'http://victoria-traces.observability.svc.cluster.local:4317'
472
- volumes:
473
- - name: config
474
- configMap:
475
- name: otel-collector-config
476
- ```
477
-
478
- #### OTEL Collector ConfigMap
479
-
480
- ```yaml
481
- apiVersion: v1
482
- kind: ConfigMap
483
- metadata:
484
- name: otel-collector-config
485
- namespace: observability
486
- data:
487
- config.yaml: |
488
- receivers:
489
- otlp:
490
- protocols:
491
- grpc:
492
- endpoint: 0.0.0.0:4317
493
- http:
494
- endpoint: 0.0.0.0:4318
495
-
496
- processors:
497
- batch:
498
- timeout: 10s
499
- send_batch_size: 1024
500
-
501
- resource:
502
- attributes:
503
- - key: deployment.environment
504
- value: ${env:ENVIRONMENT}
505
- action: insert
506
-
507
- exporters:
508
- otlp:
509
- endpoint: victoria-traces.observability.svc.cluster.local:4317
510
- tls:
511
- insecure: true
512
-
513
- service:
514
- pipelines:
515
- traces:
516
- receivers: [otlp]
517
- processors: [batch, resource]
518
- exporters: [otlp]
519
- ```
520
-
521
- ---
522
-
523
- ## Best Practices & Patterns
524
-
525
- ### 1. Span Naming Conventions
526
-
527
- Follow OpenTelemetry semantic conventions:
528
-
529
- ```typescript
530
- // ✅ Good
531
- tracer.startSpan('http.client.request', {
532
- attributes: {
533
- 'http.method': 'POST',
534
- 'http.url': '/api/users',
535
- 'http.status_code': 200,
536
- },
537
- });
538
-
539
- // ❌ Bad
540
- tracer.startSpan('API Call', {
541
- attributes: { url: '/api/users' },
542
- });
543
- ```
544
-
545
- ### 2. Error Handling
546
-
547
- Always set span status on errors:
548
-
549
- ```typescript
550
- try {
551
- await processData();
552
- span.setStatus({ code: SpanStatusCode.OK });
553
- } catch (error) {
554
- span.setStatus({
555
- code: SpanStatusCode.ERROR,
556
- message: error.message,
557
- });
558
- span.recordException(error);
559
- throw error;
560
- } finally {
561
- span.end();
562
- }
563
- ```
564
-
565
- ### 3. Cardinality Management
566
-
567
- ```typescript
568
- // ✅ Good - bounded cardinality
569
- span.setAttribute('http.status_code', 200);
570
- span.setAttribute('user.role', 'admin');
571
-
572
- // ❌ Bad - unbounded cardinality
573
- span.setAttribute('user.id', 'user-12345-abcde-...'); // Too unique
574
- span.setAttribute('request.body', JSON.stringify(body)); // Too large
575
- ```
576
-
577
- ### 4. Context Propagation in Async Operations
578
-
579
- ```typescript
580
- import { context } from '@opentelemetry/api';
581
-
582
- async function processInBackground(data: any) {
583
- // Capture current context
584
- const currentContext = context.active();
585
-
586
- // Run in background with context
587
- setTimeout(() => {
588
- context.with(currentContext, () => {
589
- const span = tracer.startSpan('background.process');
590
- // Process data with trace context
591
- span.end();
592
- });
593
- }, 1000);
594
- }
595
- ```
596
-
597
- ---
598
-
599
- ## Monitoring & Alerting
600
-
601
- ### Key Metrics to Track
602
-
603
- 1. **Trace Volume**
604
- - `traces_received_total`
605
- - `traces_dropped_total`
606
- - `trace_export_errors_total`
607
-
608
- 2. **Trace Latency**
609
- - `trace_export_duration_seconds`
610
- - `span_processing_duration_seconds`
611
-
612
- 3. **Sampling Rates**
613
- - `traces_sampled_ratio`
614
- - `traces_head_sampled_total`
615
-
616
- ### Alerts
617
-
618
- ```yaml
619
- # vmalert/alerts/tracing.yaml
620
- groups:
621
- - name: tracing
622
- interval: 1m
623
- rules:
624
- - alert: HighTraceDropRate
625
- expr: |
626
- rate(traces_dropped_total[5m]) / rate(traces_received_total[5m]) > 0.05
627
- for: 5m
628
- annotations:
629
- summary: 'High trace drop rate detected'
630
- description: '{{ $value }}% of traces are being dropped'
631
-
632
- - alert: TraceExportFailures
633
- expr: rate(trace_export_errors_total[5m]) > 10
634
- for: 5m
635
- annotations:
636
- summary: 'Trace export failures detected'
637
- ```
638
-
639
- ---
640
-
641
- ## Cost Analysis
642
-
643
- ### Expected Costs (Production at Scale)
644
-
645
- | Component | Monthly Cost | Notes |
646
- | ----------------------- | ------------ | ---------------------------------- |
647
- | VictoriaTraces Storage | $50-100 | 7 days retention, 10% sampling |
648
- | Vector CPU/Memory (ECS) | $30-50 | Sidecar overhead across tasks |
649
- | OTEL Collector (K8s) | $20-30 | DaemonSet across nodes |
650
- | Network Transfer | $10-20 | OTLP data to VictoriaTraces |
651
- | **Total** | **$110-200** | **Compared to $800/mo CloudWatch** |
652
-
653
- ### Cost Optimization Strategies
654
-
655
- 1. **Sampling:** 10% production, 100% dev/staging
656
- 2. **Retention:** 7 days hot, 30 days cold (S3)
657
- 3. **Filtering:** Drop health check traces
658
- 4. **Compression:** gzip for OTLP payloads
659
-
660
- ---
661
-
662
- ## Testing Strategy
663
-
664
- ### Unit Tests
665
-
666
- ```typescript
667
- import { trace, context } from '@opentelemetry/api';
668
- import { InMemorySpanExporter } from '@opentelemetry/sdk-trace-base';
669
-
670
- describe('Tracing', () => {
671
- let exporter: InMemorySpanExporter;
672
-
673
- beforeEach(() => {
674
- exporter = new InMemorySpanExporter();
675
- // Configure test tracer with in-memory exporter
676
- });
677
-
678
- it('should create span with correct attributes', () => {
679
- const span = tracer.startSpan('test.operation');
680
- span.setAttribute('test.key', 'value');
681
- span.end();
682
-
683
- const spans = exporter.getFinishedSpans();
684
- expect(spans).toHaveLength(1);
685
- expect(spans[0].attributes['test.key']).toBe('value');
686
- });
687
- });
688
- ```
689
-
690
- ### Integration Tests
691
-
692
- ```typescript
693
- describe('Trace Propagation', () => {
694
- it('should propagate trace from API to job', async () => {
695
- // 1. Create trace in API
696
- const response = await request(app).post('/api/sboms/upload').send(sbomData);
697
-
698
- const traceparent = response.headers['traceparent'];
699
-
700
- // 2. Verify message has trace context
701
- const message = await queueClient.receiveMessage();
702
- expect(message.traceContext.traceparent).toBe(traceparent);
703
-
704
- // 3. Verify job creates child span
705
- const jobSpans = await getSpansForTrace(traceparent);
706
- expect(jobSpans).toContainEqual(
707
- expect.objectContaining({
708
- name: 'job.process.sbom',
709
- parentSpanId: extractSpanId(traceparent),
710
- }),
711
- );
712
- });
713
- });
714
- ```
715
-
716
- ---
717
-
718
- ## Migration Checklist
719
-
720
- - [ ] **Week 1:** Implement tracing in observability-ts library
721
- - [ ] **Week 1:** Update job-common with OTEL integration
722
- - [ ] **Week 2:** Deploy to manifest-api (pilot)
723
- - [ ] **Week 2:** Verify traces in Grafana
724
- - [ ] **Week 3:** Instrument 3 priority jobs
725
- - [ ] **Week 3:** Test end-to-end trace propagation
726
- - [ ] **Week 4:** Add web-app frontend tracing
727
- - [ ] **Week 4:** Create trace dashboards in Grafana
728
- - [ ] **Week 5:** Configure Vector/OTEL collector
729
- - [ ] **Week 5:** Deploy to K8s staging
730
- - [ ] **Week 6:** Rollout to remaining services
731
- - [ ] **Week 6:** Production validation (30 days)
732
-
733
- ---
734
-
735
- ## Success Criteria
736
-
737
- 1. ✅ **100% Service Coverage:** All services emit traces
738
- 2. ✅ **End-to-End Visibility:** Traces from web-app → job completion
739
- 3. ✅ **Performance:** <500ms tracing overhead per request
740
- 4. ✅ **Reliability:** <1% trace data loss
741
- 5. ✅ **Cost:** <$200/month trace infrastructure
742
- 6. ✅ **Adoption:** Team uses traces for 80% of incidents
743
-
744
- ---
745
-
746
- ## References
747
-
748
- - [OpenTelemetry Docs](https://opentelemetry.io/docs/)
749
- - [W3C Trace Context](https://www.w3.org/TR/trace-context/)
750
- - [OTLP Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md)
751
- - [VictoriaMetrics Tracing](https://docs.victoriametrics.com/victorialogs/)
752
- - [Vector OTLP Source](https://vector.dev/docs/reference/configuration/sources/opentelemetry/)
753
-
754
- ---
755
-
756
- **Document Owner:** Platform Team
757
- **Last Updated:** November 11, 2025
758
- **Status:** Ready for Implementation