@manifest-cyber/observability-ts 0.2.0 → 0.2.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +2 -5
- package/MIGRATION_GUIDE.md +0 -299
- package/TRACING_GUIDE.md +0 -791
- package/TRACING_IMPLEMENTATION_PLAN.md +0 -758
|
@@ -1,758 +0,0 @@
|
|
|
1
|
-
# Distributed Tracing Implementation Plan for Manifest Cyber
|
|
2
|
-
|
|
3
|
-
**Status:** Implementation Ready
|
|
4
|
-
**Date:** November 11, 2025
|
|
5
|
-
**Scope:** End-to-end distributed tracing across web-app, manifest-api, message queues, and job-\* services
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Executive Summary
|
|
10
|
-
|
|
11
|
-
This document outlines the comprehensive approach to implement distributed tracing across Manifest Cyber's event-driven architecture using OpenTelemetry (OTEL) standards, supporting both SaaS (ECS) and on-premise (K8s) deployments.
|
|
12
|
-
|
|
13
|
-
### Architecture Overview
|
|
14
|
-
|
|
15
|
-
```
|
|
16
|
-
┌──────────┐ HTTP ┌──────────────┐ SQS/RabbitMQ ┌────────────┐
|
|
17
|
-
│ web-app │ ────────► │ manifest-api │ ──────────────────► │ job-* │
|
|
18
|
-
│ (React) │ │ (Express) │ │ (Workers) │
|
|
19
|
-
└──────────┘ └──────────────┘ └────────────┘
|
|
20
|
-
│ │ │
|
|
21
|
-
│ W3C Traceparent │ W3C Traceparent │
|
|
22
|
-
│ via HTTP Headers │ via Message Attributes │
|
|
23
|
-
│ │ │
|
|
24
|
-
└──────────────────────────┴────────────────────────────────────┘
|
|
25
|
-
│
|
|
26
|
-
▼
|
|
27
|
-
┌────────────────┐
|
|
28
|
-
│ Vector │
|
|
29
|
-
│ (Sidecar) │
|
|
30
|
-
└────────────────┘
|
|
31
|
-
│
|
|
32
|
-
▼
|
|
33
|
-
┌────────────────┐
|
|
34
|
-
│ VictoriaTraces │
|
|
35
|
-
│ (OTLP 4317/18) │
|
|
36
|
-
└────────────────┘
|
|
37
|
-
│
|
|
38
|
-
▼
|
|
39
|
-
┌────────────────┐
|
|
40
|
-
│ Grafana │
|
|
41
|
-
│ (Tempo UI) │
|
|
42
|
-
└────────────────┘
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
---
|
|
46
|
-
|
|
47
|
-
## Current State Analysis
|
|
48
|
-
|
|
49
|
-
### ✅ What's Already Working
|
|
50
|
-
|
|
51
|
-
1. **Logger-ts with Trace Support**
|
|
52
|
-
- `TraceContext` class with trace_id, span_id, parent_id
|
|
53
|
-
- `OtelLogger` with automatic trace extraction
|
|
54
|
-
- W3C traceparent parsing in messaging layer
|
|
55
|
-
- AsyncLocalStorage for context propagation
|
|
56
|
-
|
|
57
|
-
2. **Messaging Infrastructure (job-common)**
|
|
58
|
-
- W3C traceparent creation/parsing utilities
|
|
59
|
-
- Trace context propagation via message attributes
|
|
60
|
-
- Integration with logger-ts for trace-aware logging
|
|
61
|
-
|
|
62
|
-
3. **Observability Stack**
|
|
63
|
-
- Vector deployed as ECS sidecar
|
|
64
|
-
- VictoriaTraces endpoint: `https://victoria-traces.development.manifestcyber.dev`
|
|
65
|
-
- OTLP ports: 4317 (gRPC), 4318 (HTTP)
|
|
66
|
-
- Grafana for visualization
|
|
67
|
-
|
|
68
|
-
4. **Metrics Infrastructure**
|
|
69
|
-
- Prometheus metrics collection via Vector
|
|
70
|
-
- VictoriaMetrics storage
|
|
71
|
-
- Service-level instrumentation pattern established
|
|
72
|
-
|
|
73
|
-
### 🔴 What's Missing
|
|
74
|
-
|
|
75
|
-
1. **OpenTelemetry Instrumentation**
|
|
76
|
-
- No OTEL SDK integration in any service
|
|
77
|
-
- No automatic span creation for HTTP requests
|
|
78
|
-
- No automatic span creation for queue operations
|
|
79
|
-
- No OTLP exporter configured
|
|
80
|
-
|
|
81
|
-
2. **Trace Propagation Standards**
|
|
82
|
-
- No standardized baggage/tracestate handling
|
|
83
|
-
- No sampling strategy defined
|
|
84
|
-
- No trace ID generation at entry points
|
|
85
|
-
|
|
86
|
-
3. **Service-Level Integration**
|
|
87
|
-
- web-app: No trace injection in HTTP calls
|
|
88
|
-
- manifest-api: No auto-instrumentation
|
|
89
|
-
- job-\*: No trace continuation from messages
|
|
90
|
-
|
|
91
|
-
4. **Infrastructure Configuration**
|
|
92
|
-
- Vector not configured for OTLP ingestion
|
|
93
|
-
- No trace collection pipeline defined
|
|
94
|
-
- K8s on-prem configuration missing
|
|
95
|
-
|
|
96
|
-
---
|
|
97
|
-
|
|
98
|
-
## Technical Architecture
|
|
99
|
-
|
|
100
|
-
### 1. Trace Flow: Web App → API → Queue → Job
|
|
101
|
-
|
|
102
|
-
#### Entry Point: User-Initiated (web-app)
|
|
103
|
-
|
|
104
|
-
```typescript
|
|
105
|
-
// web-app/src/api/client.ts
|
|
106
|
-
import { trace, context } from '@opentelemetry/api';
|
|
107
|
-
|
|
108
|
-
const span = tracer.startSpan('http.request', {
|
|
109
|
-
attributes: {
|
|
110
|
-
'http.method': 'POST',
|
|
111
|
-
'http.url': '/api/sboms/upload',
|
|
112
|
-
}
|
|
113
|
-
});
|
|
114
|
-
|
|
115
|
-
context.with(trace.setSpan(context.active(), span), () => {
|
|
116
|
-
axios.post('/api/sboms/upload', data, {
|
|
117
|
-
headers: {
|
|
118
|
-
traceparent: // Auto-injected by OTEL instrumentation
|
|
119
|
-
}
|
|
120
|
-
});
|
|
121
|
-
});
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
#### Ingress: API Receives Request (manifest-api)
|
|
125
|
-
|
|
126
|
-
```typescript
|
|
127
|
-
// manifest-api auto-instrumentation extracts traceparent
|
|
128
|
-
// Express middleware creates span automatically
|
|
129
|
-
// Pass to logger:
|
|
130
|
-
const logger = OtelLogger.create(baseLogger, { autoTrace: true });
|
|
131
|
-
logger.info('Processing SBOM upload', { userId, organizationId });
|
|
132
|
-
```
|
|
133
|
-
|
|
134
|
-
#### Egress: API → Queue (manifest-api)
|
|
135
|
-
|
|
136
|
-
```typescript
|
|
137
|
-
// Use job-common messaging with trace context
|
|
138
|
-
import { createTraceparentFromLogger } from '@manifest-cyber/job-common';
|
|
139
|
-
|
|
140
|
-
const traceparent = createTraceparentFromLogger(logger);
|
|
141
|
-
await messageClient.send('sbom-process-queue', {
|
|
142
|
-
sbomId,
|
|
143
|
-
organizationId,
|
|
144
|
-
traceContext: { traceparent },
|
|
145
|
-
});
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
#### Ingress: Job Receives Message (job-sbom-process)
|
|
149
|
-
|
|
150
|
-
```typescript
|
|
151
|
-
// job-common automatically extracts trace context
|
|
152
|
-
import { createLoggerWithMessageTrace } from '@manifest-cyber/job-common';
|
|
153
|
-
|
|
154
|
-
const logger = createLoggerWithMessageTrace(message, baseLogger);
|
|
155
|
-
// Logger now has trace context from message
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
#### Entry Point: System-Initiated (cron job)
|
|
159
|
-
|
|
160
|
-
```typescript
|
|
161
|
-
// job-daily-vuln-match/src/invoke.ts
|
|
162
|
-
import { TraceContext } from '@manifest-cyber/logger-ts';
|
|
163
|
-
|
|
164
|
-
const traceContext = TraceContext.generateRandom();
|
|
165
|
-
const logger = OtelLogger.create(baseLogger, {
|
|
166
|
-
fields: { trace_id: traceContext.trace_id, span_id: traceContext.span_id },
|
|
167
|
-
});
|
|
168
|
-
```
|
|
169
|
-
|
|
170
|
-
### 2. OpenTelemetry SDK Integration
|
|
171
|
-
|
|
172
|
-
#### Core Dependencies
|
|
173
|
-
|
|
174
|
-
```json
|
|
175
|
-
{
|
|
176
|
-
"@opentelemetry/sdk-node": "^0.54.0",
|
|
177
|
-
"@opentelemetry/auto-instrumentations-node": "^0.51.0",
|
|
178
|
-
"@opentelemetry/exporter-trace-otlp-grpc": "^0.54.0",
|
|
179
|
-
"@opentelemetry/resources": "^1.28.0",
|
|
180
|
-
"@opentelemetry/semantic-conventions": "^1.28.0"
|
|
181
|
-
}
|
|
182
|
-
```
|
|
183
|
-
|
|
184
|
-
#### Initialization Pattern (Backend Services)
|
|
185
|
-
|
|
186
|
-
```typescript
|
|
187
|
-
// src/tracing.ts
|
|
188
|
-
import { NodeSDK } from '@opentelemetry/sdk-node';
|
|
189
|
-
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
|
|
190
|
-
import { Resource } from '@opentelemetry/resources';
|
|
191
|
-
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
|
|
192
|
-
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
|
193
|
-
|
|
194
|
-
const sdk = new NodeSDK({
|
|
195
|
-
resource: new Resource({
|
|
196
|
-
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || 'unknown-service',
|
|
197
|
-
environment: process.env.ENV || 'development',
|
|
198
|
-
'service.version': process.env.VERSION || '0.0.0',
|
|
199
|
-
}),
|
|
200
|
-
traceExporter: new OTLPTraceExporter({
|
|
201
|
-
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
|
|
202
|
-
}),
|
|
203
|
-
instrumentations: [
|
|
204
|
-
getNodeAutoInstrumentations({
|
|
205
|
-
'@opentelemetry/instrumentation-http': { enabled: true },
|
|
206
|
-
'@opentelemetry/instrumentation-express': { enabled: true },
|
|
207
|
-
'@opentelemetry/instrumentation-mongodb': { enabled: true },
|
|
208
|
-
'@opentelemetry/instrumentation-aws-sdk': { enabled: true },
|
|
209
|
-
}),
|
|
210
|
-
],
|
|
211
|
-
});
|
|
212
|
-
|
|
213
|
-
sdk.start();
|
|
214
|
-
|
|
215
|
-
process.on('SIGTERM', () => {
|
|
216
|
-
sdk.shutdown().finally(() => process.exit(0));
|
|
217
|
-
});
|
|
218
|
-
```
|
|
219
|
-
|
|
220
|
-
### 3. Message Queue Trace Propagation
|
|
221
|
-
|
|
222
|
-
#### W3C Traceparent Format
|
|
223
|
-
|
|
224
|
-
```
|
|
225
|
-
00-{trace-id:32-hex}-{span-id:16-hex}-{flags:2-hex}
|
|
226
|
-
Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
|
|
227
|
-
```
|
|
228
|
-
|
|
229
|
-
#### SQS Message Attributes
|
|
230
|
-
|
|
231
|
-
```typescript
|
|
232
|
-
// Producer (manifest-api)
|
|
233
|
-
await sqs.sendMessage({
|
|
234
|
-
QueueUrl: queueUrl,
|
|
235
|
-
MessageBody: JSON.stringify(payload),
|
|
236
|
-
MessageAttributes: {
|
|
237
|
-
traceparent: {
|
|
238
|
-
DataType: 'String',
|
|
239
|
-
StringValue: traceparent, // W3C format
|
|
240
|
-
},
|
|
241
|
-
// Optional: Add tracestate for baggage
|
|
242
|
-
tracestate: {
|
|
243
|
-
DataType: 'String',
|
|
244
|
-
StringValue: 'key1=value1,key2=value2',
|
|
245
|
-
},
|
|
246
|
-
},
|
|
247
|
-
});
|
|
248
|
-
```
|
|
249
|
-
|
|
250
|
-
#### RabbitMQ Message Properties
|
|
251
|
-
|
|
252
|
-
```typescript
|
|
253
|
-
// Producer (manifest-api)
|
|
254
|
-
await channel.sendToQueue(queueName, Buffer.from(JSON.stringify(payload)), {
|
|
255
|
-
headers: {
|
|
256
|
-
traceparent: traceparent, // W3C format
|
|
257
|
-
tracestate: 'key1=value1,key2=value2', // Optional
|
|
258
|
-
},
|
|
259
|
-
persistent: true,
|
|
260
|
-
});
|
|
261
|
-
```
|
|
262
|
-
|
|
263
|
-
### 4. Sampling Strategy
|
|
264
|
-
|
|
265
|
-
#### Development/Staging
|
|
266
|
-
|
|
267
|
-
- **Rate:** 100% (always sample)
|
|
268
|
-
- **Reason:** Full visibility for debugging
|
|
269
|
-
|
|
270
|
-
#### Production
|
|
271
|
-
|
|
272
|
-
- **Rate:** 10% (probabilistic sampling)
|
|
273
|
-
- **Head-based sampling:** Decision at trace root
|
|
274
|
-
- **Override:** Always sample errors (500 status, exceptions)
|
|
275
|
-
|
|
276
|
-
```typescript
|
|
277
|
-
import {
|
|
278
|
-
ParentBasedSampler,
|
|
279
|
-
TraceIdRatioBasedSampler,
|
|
280
|
-
} from '@opentelemetry/sdk-trace-base';
|
|
281
|
-
|
|
282
|
-
const sampler = new ParentBasedSampler({
|
|
283
|
-
root: new TraceIdRatioBasedSampler(process.env.ENV === 'production' ? 0.1 : 1.0),
|
|
284
|
-
});
|
|
285
|
-
```
|
|
286
|
-
|
|
287
|
-
---
|
|
288
|
-
|
|
289
|
-
## Implementation Roadmap
|
|
290
|
-
|
|
291
|
-
### Phase 1: Core Library Support (Week 1) ✅ COMPLETE
|
|
292
|
-
|
|
293
|
-
**Goal:** Create unified observability library with metrics + tracing
|
|
294
|
-
|
|
295
|
-
- [x] Rename package to `@manifest-cyber/observability-ts`
|
|
296
|
-
- [x] Add OpenTelemetry dependencies
|
|
297
|
-
- [x] Create src/tracing/ module with SDK initialization helpers
|
|
298
|
-
- [x] Create span management utilities (spans.ts)
|
|
299
|
-
- [x] Create context propagation helpers (context.ts)
|
|
300
|
-
- [x] Reorganize metrics into src/metrics/ for modularity
|
|
301
|
-
- [x] Configure subpath exports for tree-shaking
|
|
302
|
-
- [x] Update documentation with tracing examples
|
|
303
|
-
|
|
304
|
-
**Deliverables:**
|
|
305
|
-
|
|
306
|
-
- `@manifest-cyber/observability-ts` v0.2.0 published
|
|
307
|
-
- Deprecated `@manifest-cyber/metrics` on NPM
|
|
308
|
-
- [TRACING_GUIDE.md](./TRACING_GUIDE.md) - Complete tracing documentation
|
|
309
|
-
- [MIGRATION_GUIDE.md](./MIGRATION_GUIDE.md) - Migration instructions
|
|
310
|
-
- Backward-compatible metrics API
|
|
311
|
-
|
|
312
|
-
### Phase 2: Pilot Service (manifest-api) (Week 2)
|
|
313
|
-
|
|
314
|
-
**Goal:** End-to-end tracing for one critical service
|
|
315
|
-
|
|
316
|
-
**Tasks:**
|
|
317
|
-
|
|
318
|
-
1. Initialize OTEL SDK in manifest-api
|
|
319
|
-
2. Configure OTLP exporter (Vector/VictoriaTraces)
|
|
320
|
-
3. Enable auto-instrumentation (HTTP, Express, MongoDB)
|
|
321
|
-
4. Add custom spans for business logic
|
|
322
|
-
5. Test trace propagation to downstream jobs
|
|
323
|
-
|
|
324
|
-
**Success Criteria:**
|
|
325
|
-
|
|
326
|
-
- Traces visible in Grafana
|
|
327
|
-
- HTTP spans include request metadata
|
|
328
|
-
- Traces propagate through SQS to job-sbom-process
|
|
329
|
-
- Logs include trace_id/span_id
|
|
330
|
-
|
|
331
|
-
### Phase 3: Message Queue Jobs (Week 3)
|
|
332
|
-
|
|
333
|
-
**Goal:** Instrument 3 high-volume jobs
|
|
334
|
-
|
|
335
|
-
**Services:**
|
|
336
|
-
|
|
337
|
-
- job-sbom-process
|
|
338
|
-
- job-vulnerability-match
|
|
339
|
-
- job-component-process
|
|
340
|
-
|
|
341
|
-
**Tasks:**
|
|
342
|
-
|
|
343
|
-
1. Add OTEL SDK initialization to each job
|
|
344
|
-
2. Update message consumers to extract trace context
|
|
345
|
-
3. Create spans for processing stages
|
|
346
|
-
4. Link spans to parent API traces
|
|
347
|
-
5. Add error tracking with span status
|
|
348
|
-
|
|
349
|
-
**Success Criteria:**
|
|
350
|
-
|
|
351
|
-
- Full trace visibility from API → Queue → Job
|
|
352
|
-
- Job spans linked to API spans
|
|
353
|
-
- Error traces captured
|
|
354
|
-
- Average trace latency <500ms overhead
|
|
355
|
-
|
|
356
|
-
### Phase 4: Web App (Week 4)
|
|
357
|
-
|
|
358
|
-
**Goal:** Frontend trace initiation
|
|
359
|
-
|
|
360
|
-
**Tasks:**
|
|
361
|
-
|
|
362
|
-
1. Add @opentelemetry/web SDK
|
|
363
|
-
2. Configure trace initiation on user actions
|
|
364
|
-
3. Inject traceparent in HTTP requests
|
|
365
|
-
4. Add resource timing spans
|
|
366
|
-
5. Configure sampling (1% production)
|
|
367
|
-
|
|
368
|
-
**Success Criteria:**
|
|
369
|
-
|
|
370
|
-
- User action → API trace correlation
|
|
371
|
-
- Performance metrics captured
|
|
372
|
-
- Minimal bundle size impact (<50KB)
|
|
373
|
-
|
|
374
|
-
### Phase 5: Infrastructure & Rollout (Week 5-6)
|
|
375
|
-
|
|
376
|
-
**Goal:** Production-ready infrastructure
|
|
377
|
-
|
|
378
|
-
**Tasks:**
|
|
379
|
-
|
|
380
|
-
1. Configure Vector for OTLP ingestion (ECS)
|
|
381
|
-
2. Deploy OTEL collector for K8s
|
|
382
|
-
3. Configure Grafana Tempo datasource
|
|
383
|
-
4. Create trace dashboards
|
|
384
|
-
5. Instrument remaining 20+ services
|
|
385
|
-
6. Production rollout with 10% sampling
|
|
386
|
-
|
|
387
|
-
**Success Criteria:**
|
|
388
|
-
|
|
389
|
-
- 100% service coverage
|
|
390
|
-
- <5% trace data loss
|
|
391
|
-
- Query latency <3s in Grafana
|
|
392
|
-
- Cost <$200/month trace storage
|
|
393
|
-
|
|
394
|
-
---
|
|
395
|
-
|
|
396
|
-
## Deployment Considerations
|
|
397
|
-
|
|
398
|
-
### ECS (SaaS) Configuration
|
|
399
|
-
|
|
400
|
-
#### Vector Sidecar Update
|
|
401
|
-
|
|
402
|
-
```yaml
|
|
403
|
-
# infrastructure/vector/vector-config.yaml
|
|
404
|
-
sources:
|
|
405
|
-
otlp_grpc:
|
|
406
|
-
type: 'opentelemetry'
|
|
407
|
-
address: '0.0.0.0:4317'
|
|
408
|
-
grpc:
|
|
409
|
-
enabled: true
|
|
410
|
-
|
|
411
|
-
otlp_http:
|
|
412
|
-
type: 'opentelemetry'
|
|
413
|
-
address: '0.0.0.0:4318'
|
|
414
|
-
http:
|
|
415
|
-
enabled: true
|
|
416
|
-
|
|
417
|
-
sinks:
|
|
418
|
-
victoria_traces:
|
|
419
|
-
type: 'opentelemetry'
|
|
420
|
-
inputs: ['otlp_grpc', 'otlp_http']
|
|
421
|
-
endpoint: 'https://victoria-traces.development.manifestcyber.dev'
|
|
422
|
-
compression: 'gzip'
|
|
423
|
-
batch:
|
|
424
|
-
max_events: 1000
|
|
425
|
-
timeout_secs: 10
|
|
426
|
-
```
|
|
427
|
-
|
|
428
|
-
#### ECS Task Environment Variables
|
|
429
|
-
|
|
430
|
-
```bash
|
|
431
|
-
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
|
|
432
|
-
OTEL_SERVICE_NAME=${SERVICE_NAME}
|
|
433
|
-
OTEL_RESOURCE_ATTRIBUTES=environment=${ENV},cluster=${CLUSTER_NAME}
|
|
434
|
-
OTEL_TRACES_SAMPLER=parentbased_traceidratio
|
|
435
|
-
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% in production
|
|
436
|
-
```
|
|
437
|
-
|
|
438
|
-
### Kubernetes (On-Prem) Configuration
|
|
439
|
-
|
|
440
|
-
#### OTEL Collector DaemonSet
|
|
441
|
-
|
|
442
|
-
```yaml
|
|
443
|
-
# k3s-on-prem/kustomize/apps/otel-collector/daemonset.yaml
|
|
444
|
-
apiVersion: apps/v1
|
|
445
|
-
kind: DaemonSet
|
|
446
|
-
metadata:
|
|
447
|
-
name: otel-collector
|
|
448
|
-
namespace: observability
|
|
449
|
-
spec:
|
|
450
|
-
selector:
|
|
451
|
-
matchLabels:
|
|
452
|
-
app: otel-collector
|
|
453
|
-
template:
|
|
454
|
-
metadata:
|
|
455
|
-
labels:
|
|
456
|
-
app: otel-collector
|
|
457
|
-
spec:
|
|
458
|
-
containers:
|
|
459
|
-
- name: otel-collector
|
|
460
|
-
image: otel/opentelemetry-collector-contrib:latest
|
|
461
|
-
ports:
|
|
462
|
-
- containerPort: 4317 # OTLP gRPC
|
|
463
|
-
hostPort: 4317
|
|
464
|
-
- containerPort: 4318 # OTLP HTTP
|
|
465
|
-
hostPort: 4318
|
|
466
|
-
volumeMounts:
|
|
467
|
-
- name: config
|
|
468
|
-
mountPath: /etc/otel
|
|
469
|
-
env:
|
|
470
|
-
- name: VICTORIA_TRACES_ENDPOINT
|
|
471
|
-
value: 'http://victoria-traces.observability.svc.cluster.local:4317'
|
|
472
|
-
volumes:
|
|
473
|
-
- name: config
|
|
474
|
-
configMap:
|
|
475
|
-
name: otel-collector-config
|
|
476
|
-
```
|
|
477
|
-
|
|
478
|
-
#### OTEL Collector ConfigMap
|
|
479
|
-
|
|
480
|
-
```yaml
|
|
481
|
-
apiVersion: v1
|
|
482
|
-
kind: ConfigMap
|
|
483
|
-
metadata:
|
|
484
|
-
name: otel-collector-config
|
|
485
|
-
namespace: observability
|
|
486
|
-
data:
|
|
487
|
-
config.yaml: |
|
|
488
|
-
receivers:
|
|
489
|
-
otlp:
|
|
490
|
-
protocols:
|
|
491
|
-
grpc:
|
|
492
|
-
endpoint: 0.0.0.0:4317
|
|
493
|
-
http:
|
|
494
|
-
endpoint: 0.0.0.0:4318
|
|
495
|
-
|
|
496
|
-
processors:
|
|
497
|
-
batch:
|
|
498
|
-
timeout: 10s
|
|
499
|
-
send_batch_size: 1024
|
|
500
|
-
|
|
501
|
-
resource:
|
|
502
|
-
attributes:
|
|
503
|
-
- key: deployment.environment
|
|
504
|
-
value: ${env:ENVIRONMENT}
|
|
505
|
-
action: insert
|
|
506
|
-
|
|
507
|
-
exporters:
|
|
508
|
-
otlp:
|
|
509
|
-
endpoint: victoria-traces.observability.svc.cluster.local:4317
|
|
510
|
-
tls:
|
|
511
|
-
insecure: true
|
|
512
|
-
|
|
513
|
-
service:
|
|
514
|
-
pipelines:
|
|
515
|
-
traces:
|
|
516
|
-
receivers: [otlp]
|
|
517
|
-
processors: [batch, resource]
|
|
518
|
-
exporters: [otlp]
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
---
|
|
522
|
-
|
|
523
|
-
## Best Practices & Patterns
|
|
524
|
-
|
|
525
|
-
### 1. Span Naming Conventions
|
|
526
|
-
|
|
527
|
-
Follow OpenTelemetry semantic conventions:
|
|
528
|
-
|
|
529
|
-
```typescript
|
|
530
|
-
// ✅ Good
|
|
531
|
-
tracer.startSpan('http.client.request', {
|
|
532
|
-
attributes: {
|
|
533
|
-
'http.method': 'POST',
|
|
534
|
-
'http.url': '/api/users',
|
|
535
|
-
'http.status_code': 200,
|
|
536
|
-
},
|
|
537
|
-
});
|
|
538
|
-
|
|
539
|
-
// ❌ Bad
|
|
540
|
-
tracer.startSpan('API Call', {
|
|
541
|
-
attributes: { url: '/api/users' },
|
|
542
|
-
});
|
|
543
|
-
```
|
|
544
|
-
|
|
545
|
-
### 2. Error Handling
|
|
546
|
-
|
|
547
|
-
Always set span status on errors:
|
|
548
|
-
|
|
549
|
-
```typescript
|
|
550
|
-
try {
|
|
551
|
-
await processData();
|
|
552
|
-
span.setStatus({ code: SpanStatusCode.OK });
|
|
553
|
-
} catch (error) {
|
|
554
|
-
span.setStatus({
|
|
555
|
-
code: SpanStatusCode.ERROR,
|
|
556
|
-
message: error.message,
|
|
557
|
-
});
|
|
558
|
-
span.recordException(error);
|
|
559
|
-
throw error;
|
|
560
|
-
} finally {
|
|
561
|
-
span.end();
|
|
562
|
-
}
|
|
563
|
-
```
|
|
564
|
-
|
|
565
|
-
### 3. Cardinality Management
|
|
566
|
-
|
|
567
|
-
```typescript
|
|
568
|
-
// ✅ Good - bounded cardinality
|
|
569
|
-
span.setAttribute('http.status_code', 200);
|
|
570
|
-
span.setAttribute('user.role', 'admin');
|
|
571
|
-
|
|
572
|
-
// ❌ Bad - unbounded cardinality
|
|
573
|
-
span.setAttribute('user.id', 'user-12345-abcde-...'); // Too unique
|
|
574
|
-
span.setAttribute('request.body', JSON.stringify(body)); // Too large
|
|
575
|
-
```
|
|
576
|
-
|
|
577
|
-
### 4. Context Propagation in Async Operations
|
|
578
|
-
|
|
579
|
-
```typescript
|
|
580
|
-
import { context } from '@opentelemetry/api';
|
|
581
|
-
|
|
582
|
-
async function processInBackground(data: any) {
|
|
583
|
-
// Capture current context
|
|
584
|
-
const currentContext = context.active();
|
|
585
|
-
|
|
586
|
-
// Run in background with context
|
|
587
|
-
setTimeout(() => {
|
|
588
|
-
context.with(currentContext, () => {
|
|
589
|
-
const span = tracer.startSpan('background.process');
|
|
590
|
-
// Process data with trace context
|
|
591
|
-
span.end();
|
|
592
|
-
});
|
|
593
|
-
}, 1000);
|
|
594
|
-
}
|
|
595
|
-
```
|
|
596
|
-
|
|
597
|
-
---
|
|
598
|
-
|
|
599
|
-
## Monitoring & Alerting
|
|
600
|
-
|
|
601
|
-
### Key Metrics to Track
|
|
602
|
-
|
|
603
|
-
1. **Trace Volume**
|
|
604
|
-
- `traces_received_total`
|
|
605
|
-
- `traces_dropped_total`
|
|
606
|
-
- `trace_export_errors_total`
|
|
607
|
-
|
|
608
|
-
2. **Trace Latency**
|
|
609
|
-
- `trace_export_duration_seconds`
|
|
610
|
-
- `span_processing_duration_seconds`
|
|
611
|
-
|
|
612
|
-
3. **Sampling Rates**
|
|
613
|
-
- `traces_sampled_ratio`
|
|
614
|
-
- `traces_head_sampled_total`
|
|
615
|
-
|
|
616
|
-
### Alerts
|
|
617
|
-
|
|
618
|
-
```yaml
|
|
619
|
-
# vmalert/alerts/tracing.yaml
|
|
620
|
-
groups:
|
|
621
|
-
- name: tracing
|
|
622
|
-
interval: 1m
|
|
623
|
-
rules:
|
|
624
|
-
- alert: HighTraceDropRate
|
|
625
|
-
expr: |
|
|
626
|
-
rate(traces_dropped_total[5m]) / rate(traces_received_total[5m]) > 0.05
|
|
627
|
-
for: 5m
|
|
628
|
-
annotations:
|
|
629
|
-
summary: 'High trace drop rate detected'
|
|
630
|
-
description: '{{ $value }}% of traces are being dropped'
|
|
631
|
-
|
|
632
|
-
- alert: TraceExportFailures
|
|
633
|
-
expr: rate(trace_export_errors_total[5m]) > 10
|
|
634
|
-
for: 5m
|
|
635
|
-
annotations:
|
|
636
|
-
summary: 'Trace export failures detected'
|
|
637
|
-
```
|
|
638
|
-
|
|
639
|
-
---
|
|
640
|
-
|
|
641
|
-
## Cost Analysis
|
|
642
|
-
|
|
643
|
-
### Expected Costs (Production at Scale)
|
|
644
|
-
|
|
645
|
-
| Component | Monthly Cost | Notes |
|
|
646
|
-
| ----------------------- | ------------ | ---------------------------------- |
|
|
647
|
-
| VictoriaTraces Storage | $50-100 | 7 days retention, 10% sampling |
|
|
648
|
-
| Vector CPU/Memory (ECS) | $30-50 | Sidecar overhead across tasks |
|
|
649
|
-
| OTEL Collector (K8s) | $20-30 | DaemonSet across nodes |
|
|
650
|
-
| Network Transfer | $10-20 | OTLP data to VictoriaTraces |
|
|
651
|
-
| **Total** | **$110-200** | **Compared to $800/mo CloudWatch** |
|
|
652
|
-
|
|
653
|
-
### Cost Optimization Strategies
|
|
654
|
-
|
|
655
|
-
1. **Sampling:** 10% production, 100% dev/staging
|
|
656
|
-
2. **Retention:** 7 days hot, 30 days cold (S3)
|
|
657
|
-
3. **Filtering:** Drop health check traces
|
|
658
|
-
4. **Compression:** gzip for OTLP payloads
|
|
659
|
-
|
|
660
|
-
---
|
|
661
|
-
|
|
662
|
-
## Testing Strategy
|
|
663
|
-
|
|
664
|
-
### Unit Tests
|
|
665
|
-
|
|
666
|
-
```typescript
|
|
667
|
-
import { trace, context } from '@opentelemetry/api';
|
|
668
|
-
import { InMemorySpanExporter } from '@opentelemetry/sdk-trace-base';
|
|
669
|
-
|
|
670
|
-
describe('Tracing', () => {
|
|
671
|
-
let exporter: InMemorySpanExporter;
|
|
672
|
-
|
|
673
|
-
beforeEach(() => {
|
|
674
|
-
exporter = new InMemorySpanExporter();
|
|
675
|
-
// Configure test tracer with in-memory exporter
|
|
676
|
-
});
|
|
677
|
-
|
|
678
|
-
it('should create span with correct attributes', () => {
|
|
679
|
-
const span = tracer.startSpan('test.operation');
|
|
680
|
-
span.setAttribute('test.key', 'value');
|
|
681
|
-
span.end();
|
|
682
|
-
|
|
683
|
-
const spans = exporter.getFinishedSpans();
|
|
684
|
-
expect(spans).toHaveLength(1);
|
|
685
|
-
expect(spans[0].attributes['test.key']).toBe('value');
|
|
686
|
-
});
|
|
687
|
-
});
|
|
688
|
-
```
|
|
689
|
-
|
|
690
|
-
### Integration Tests
|
|
691
|
-
|
|
692
|
-
```typescript
|
|
693
|
-
describe('Trace Propagation', () => {
|
|
694
|
-
it('should propagate trace from API to job', async () => {
|
|
695
|
-
// 1. Create trace in API
|
|
696
|
-
const response = await request(app).post('/api/sboms/upload').send(sbomData);
|
|
697
|
-
|
|
698
|
-
const traceparent = response.headers['traceparent'];
|
|
699
|
-
|
|
700
|
-
// 2. Verify message has trace context
|
|
701
|
-
const message = await queueClient.receiveMessage();
|
|
702
|
-
expect(message.traceContext.traceparent).toBe(traceparent);
|
|
703
|
-
|
|
704
|
-
// 3. Verify job creates child span
|
|
705
|
-
const jobSpans = await getSpansForTrace(traceparent);
|
|
706
|
-
expect(jobSpans).toContainEqual(
|
|
707
|
-
expect.objectContaining({
|
|
708
|
-
name: 'job.process.sbom',
|
|
709
|
-
parentSpanId: extractSpanId(traceparent),
|
|
710
|
-
}),
|
|
711
|
-
);
|
|
712
|
-
});
|
|
713
|
-
});
|
|
714
|
-
```
|
|
715
|
-
|
|
716
|
-
---
|
|
717
|
-
|
|
718
|
-
## Migration Checklist
|
|
719
|
-
|
|
720
|
-
- [ ] **Week 1:** Implement tracing in observability-ts library
|
|
721
|
-
- [ ] **Week 1:** Update job-common with OTEL integration
|
|
722
|
-
- [ ] **Week 2:** Deploy to manifest-api (pilot)
|
|
723
|
-
- [ ] **Week 2:** Verify traces in Grafana
|
|
724
|
-
- [ ] **Week 3:** Instrument 3 priority jobs
|
|
725
|
-
- [ ] **Week 3:** Test end-to-end trace propagation
|
|
726
|
-
- [ ] **Week 4:** Add web-app frontend tracing
|
|
727
|
-
- [ ] **Week 4:** Create trace dashboards in Grafana
|
|
728
|
-
- [ ] **Week 5:** Configure Vector/OTEL collector
|
|
729
|
-
- [ ] **Week 5:** Deploy to K8s staging
|
|
730
|
-
- [ ] **Week 6:** Rollout to remaining services
|
|
731
|
-
- [ ] **Week 6:** Production validation (30 days)
|
|
732
|
-
|
|
733
|
-
---
|
|
734
|
-
|
|
735
|
-
## Success Criteria
|
|
736
|
-
|
|
737
|
-
1. ✅ **100% Service Coverage:** All services emit traces
|
|
738
|
-
2. ✅ **End-to-End Visibility:** Traces from web-app → job completion
|
|
739
|
-
3. ✅ **Performance:** <500ms tracing overhead per request
|
|
740
|
-
4. ✅ **Reliability:** <1% trace data loss
|
|
741
|
-
5. ✅ **Cost:** <$200/month trace infrastructure
|
|
742
|
-
6. ✅ **Adoption:** Team uses traces for 80% of incidents
|
|
743
|
-
|
|
744
|
-
---
|
|
745
|
-
|
|
746
|
-
## References
|
|
747
|
-
|
|
748
|
-
- [OpenTelemetry Docs](https://opentelemetry.io/docs/)
|
|
749
|
-
- [W3C Trace Context](https://www.w3.org/TR/trace-context/)
|
|
750
|
-
- [OTLP Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md)
|
|
751
|
-
- [VictoriaMetrics Tracing](https://docs.victoriametrics.com/victorialogs/)
|
|
752
|
-
- [Vector OTLP Source](https://vector.dev/docs/reference/configuration/sources/opentelemetry/)
|
|
753
|
-
|
|
754
|
-
---
|
|
755
|
-
|
|
756
|
-
**Document Owner:** Platform Team
|
|
757
|
-
**Last Updated:** November 11, 2025
|
|
758
|
-
**Status:** Ready for Implementation
|