@aigentsphere/openclaw-otel-observability 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. package/.github/workflows/ci.yml +52 -0
  2. package/.github/workflows/docs.yml +25 -0
  3. package/LICENSE +15 -0
  4. package/README.md +300 -0
  5. package/collector/README.md +186 -0
  6. package/collector/otel-collector-config.yaml +230 -0
  7. package/docker-compose.yaml +32 -0
  8. package/docs/architecture.md +319 -0
  9. package/docs/backends/dynatrace.md +168 -0
  10. package/docs/backends/generic-otlp.md +166 -0
  11. package/docs/backends/grafana.md +167 -0
  12. package/docs/backends/index.md +49 -0
  13. package/docs/backends/otel-collector.md +210 -0
  14. package/docs/configuration.md +276 -0
  15. package/docs/development.md +198 -0
  16. package/docs/getting-started.md +295 -0
  17. package/docs/index.md +139 -0
  18. package/docs/limitations.md +95 -0
  19. package/docs/security/detection.md +274 -0
  20. package/docs/security/tetragon.md +454 -0
  21. package/docs/telemetry/metrics.md +283 -0
  22. package/docs/telemetry/tokens.md +188 -0
  23. package/docs/telemetry/traces.md +165 -0
  24. package/dynatrace/security-slo-dql.md +263 -0
  25. package/index.ts +191 -0
  26. package/instrumentation/preload.mjs +59 -0
  27. package/mkdocs.yml +90 -0
  28. package/openclaw.plugin.json +99 -0
  29. package/package.json +49 -0
  30. package/src/config.ts +72 -0
  31. package/src/diagnostics.ts +214 -0
  32. package/src/hooks.ts +575 -0
  33. package/src/openllmetry.ts +27 -0
  34. package/src/security.ts +396 -0
  35. package/src/telemetry.ts +282 -0
  36. package/tetragon-policies/01-process-exec.yaml +20 -0
  37. package/tetragon-policies/02-sensitive-files.yaml +86 -0
  38. package/tetragon-policies/04-privilege-escalation.yaml +25 -0
  39. package/tetragon-policies/05-dangerous-commands.yaml +97 -0
  40. package/tetragon-policies/06-kernel-modules.yaml +27 -0
  41. package/tetragon-policies/07-prompt-injection-shell.yaml +73 -0
  42. package/tetragon-policies/README.md +143 -0
  43. package/tsconfig.json +17 -0
@@ -0,0 +1,52 @@
1
+ name: CI & Publish
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ tags: ["v*"]
7
+ pull_request:
8
+ branches: [main]
9
+
10
+ permissions:
11
+ contents: read
12
+
13
+ jobs:
14
+ typecheck:
15
+ runs-on: ubuntu-latest
16
+ steps:
17
+ - uses: actions/checkout@v4
18
+
19
+ - uses: actions/setup-node@v4
20
+ with:
21
+ node-version: "22"
22
+ cache: npm
23
+
24
+ - name: Install dependencies
25
+ run: npm ci
26
+
27
+ - name: Typecheck
28
+ run: npm run typecheck
29
+
30
+ publish:
31
+ needs: typecheck
32
+ if: startsWith(github.ref, 'refs/tags/v')
33
+ runs-on: ubuntu-latest
34
+ steps:
35
+ - uses: actions/checkout@v4
36
+
37
+ - uses: actions/setup-node@v4
38
+ with:
39
+ node-version: "22"
40
+ cache: npm
41
+ registry-url: "https://registry.npmjs.org"
42
+
43
+ - name: Install dependencies
44
+ run: npm ci
45
+
46
+ - name: Typecheck
47
+ run: npm run typecheck
48
+
49
+ - name: Publish to npm
50
+ run: npm publish --access public
51
+ env:
52
+ NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }}
@@ -0,0 +1,25 @@
1
+ name: Deploy Documentation
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ workflow_dispatch:
7
+
8
+ permissions:
9
+ contents: write
10
+
11
+ jobs:
12
+ deploy:
13
+ runs-on: ubuntu-latest
14
+ steps:
15
+ - uses: actions/checkout@v4
16
+
17
+ - uses: actions/setup-python@v5
18
+ with:
19
+ python-version: "3.12"
20
+
21
+ - name: Install MkDocs Material
22
+ run: pip install mkdocs-material
23
+
24
+ - name: Deploy to GitHub Pages
25
+ run: mkdocs gh-deploy --force
package/LICENSE ADDED
@@ -0,0 +1,15 @@
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ Licensed under the Apache License, Version 2.0 (the "License");
6
+ you may not use this file except in compliance with the License.
7
+ You may obtain a copy of the License at
8
+
9
+ http://www.apache.org/licenses/LICENSE-2.0
10
+
11
+ Unless required by applicable law or agreed to in writing, software
12
+ distributed under the License is distributed on an "AS IS" BASIS,
13
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ See the License for the specific language governing permissions and
15
+ limitations under the License.
package/README.md ADDED
@@ -0,0 +1,300 @@
1
+ # OpenClaw Observability
2
+
3
+ [![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://henrikrexed.github.io/openclaw-observability-plugin/)
4
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+
6
+ OpenTelemetry observability for [OpenClaw](https://github.com/openclaw/openclaw) AI agents.
7
+
8
+ 📖 **[Full Documentation](https://henrikrexed.github.io/openclaw-observability-plugin/)** — Setup guides, configuration reference, and backend examples.
9
+
10
+ ## Two Approaches to Observability
11
+
12
+ This repository documents **two complementary approaches** to monitoring OpenClaw:
13
+
14
+ | Approach | Best For | Setup Complexity |
15
+ |----------|----------|------------------|
16
+ | **Official Plugin** | Operational metrics, Gateway health, cost tracking | Simple config |
17
+ | **Custom Plugin** | Deep tracing, tool call visibility, request lifecycle | Plugin installation |
18
+
19
+ **Recommendation:** Use both for complete observability.
20
+
21
+ ---
22
+
23
+ ## Approach 1: Official Diagnostics Plugin (Built-in)
24
+
25
+ OpenClaw v2026.2+ includes **built-in OpenTelemetry support**. Just add to `openclaw.json`:
26
+
27
+ ```json
28
+ {
29
+ "diagnostics": {
30
+ "enabled": true,
31
+ "otel": {
32
+ "enabled": true,
33
+ "endpoint": "http://localhost:4318",
34
+ "serviceName": "openclaw-gateway",
35
+ "traces": true,
36
+ "metrics": true,
37
+ "logs": true
38
+ }
39
+ }
40
+ }
41
+ ```
42
+
43
+ Then restart:
44
+
45
+ ```bash
46
+ openclaw gateway restart
47
+ ```
48
+
49
+ ### What It Captures
50
+
51
+ **Metrics:**
52
+ - `openclaw.tokens` — Token usage by type (input/output/cache)
53
+ - `openclaw.cost.usd` — Estimated model cost
54
+ - `openclaw.run.duration_ms` — Agent run duration
55
+ - `openclaw.context.tokens` — Context window usage
56
+ - `openclaw.webhook.*` — Webhook processing stats
57
+ - `openclaw.message.*` — Message processing stats
58
+ - `openclaw.queue.*` — Queue depth and wait times
59
+ - `openclaw.session.*` — Session state transitions
60
+
61
+ **Traces:** Model usage, webhook processing, message processing, stuck sessions
62
+
63
+ **Logs:** All Gateway logs via OTLP with severity, subsystem, and code location
64
+
65
+ ---
66
+
67
+ ## Approach 2: Custom Hook-Based Plugin (This Repo)
68
+
69
+ For **deeper observability**, install the custom plugin from this repo. It uses OpenClaw's typed plugin hooks to capture the full agent lifecycle.
70
+
71
+ ### What It Adds
72
+
73
+ **Connected Traces:**
74
+ ```
75
+ openclaw.request (root span)
76
+ ├── openclaw.agent.turn
77
+ │ ├── tool.Read (file read)
78
+ │ ├── tool.exec (shell command)
79
+ │ ├── tool.Write (file write)
80
+ │ └── tool.web_search
81
+ └── (child spans connected via trace context)
82
+ ```
83
+
84
+ **Per-Tool Visibility:**
85
+ - Individual spans for each tool call
86
+ - Tool execution time
87
+ - Result size (characters)
88
+ - Error tracking per tool
89
+
90
+ **Request Lifecycle:**
91
+ - Full message → response tracing
92
+ - Session context propagation
93
+ - Agent turn duration with token breakdown
94
+
95
+ ### Installation
96
+
97
+ 1. Clone this repository:
98
+ ```bash
99
+ git clone https://github.com/henrikrexed/openclaw-observability-plugin.git
100
+ ```
101
+
102
+ 2. Add to your `openclaw.json`:
103
+ ```json
104
+ {
105
+ "plugins": {
106
+ "load": {
107
+ "paths": ["/path/to/openclaw-observability-plugin"]
108
+ },
109
+ "entries": {
110
+ "otel-observability": {
111
+ "enabled": true,
112
+ "config": {
113
+ "endpoint": "http://localhost:4318",
114
+ "serviceName": "openclaw-gateway"
115
+ }
116
+ }
117
+ }
118
+ }
119
+ }
120
+ ```
121
+
122
+ 3. Clear cache and restart:
123
+ ```bash
124
+ rm -rf /tmp/jiti
125
+ systemctl --user restart openclaw-gateway
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Comparing the Two Approaches
131
+
132
+ | Feature | Official Plugin | Custom Plugin |
133
+ |---------|-----------------|---------------|
134
+ | Token metrics | ✅ Per model | ✅ Per session + model |
135
+ | Cost tracking | ✅ Yes | ✅ Yes (from diagnostics) |
136
+ | Gateway health | ✅ Webhooks, queues, sessions | ❌ Not focused |
137
+ | Session state | ✅ State transitions | ❌ Not tracked |
138
+ | **Tool call tracing** | ❌ No | ✅ Individual tool spans |
139
+ | **Request lifecycle** | ❌ No | ✅ Full request → response |
140
+ | **Connected traces** | ❌ Separate spans | ✅ Parent-child hierarchy |
141
+ | Setup complexity | 🟢 Config only | 🟡 Plugin installation |
142
+
143
+ ---
144
+
145
+ ## Backend Examples
146
+
147
+ ### Dynatrace (Direct)
148
+
149
+ ```json
150
+ {
151
+ "diagnostics": {
152
+ "enabled": true,
153
+ "otel": {
154
+ "enabled": true,
155
+ "endpoint": "https://{env-id}.live.dynatrace.com/api/v2/otlp",
156
+ "headers": {
157
+ "Authorization": "Api-Token {your-token}"
158
+ },
159
+ "serviceName": "openclaw-gateway",
160
+ "traces": true,
161
+ "metrics": true,
162
+ "logs": true
163
+ }
164
+ }
165
+ }
166
+ ```
167
+
168
+ ### Grafana Cloud
169
+
170
+ ```json
171
+ {
172
+ "diagnostics": {
173
+ "enabled": true,
174
+ "otel": {
175
+ "enabled": true,
176
+ "endpoint": "https://otlp-gateway-{region}.grafana.net/otlp",
177
+ "headers": {
178
+ "Authorization": "Basic {base64-credentials}"
179
+ },
180
+ "serviceName": "openclaw-gateway",
181
+ "traces": true,
182
+ "metrics": true
183
+ }
184
+ }
185
+ }
186
+ ```
187
+
188
+ ### Local OTel Collector
189
+
190
+ ```json
191
+ {
192
+ "diagnostics": {
193
+ "enabled": true,
194
+ "otel": {
195
+ "enabled": true,
196
+ "endpoint": "http://localhost:4318",
197
+ "serviceName": "openclaw-gateway",
198
+ "traces": true,
199
+ "metrics": true,
200
+ "logs": true
201
+ }
202
+ }
203
+ }
204
+ ```
205
+
206
+ ---
207
+
208
+ ## Configuration Reference
209
+
210
+ ### Official Plugin Options
211
+
212
+ | Option | Type | Default | Description |
213
+ |--------|------|---------|-------------|
214
+ | `diagnostics.enabled` | boolean | false | Enable diagnostics system |
215
+ | `diagnostics.otel.enabled` | boolean | false | Enable OTel export |
216
+ | `diagnostics.otel.endpoint` | string | — | OTLP endpoint URL |
217
+ | `diagnostics.otel.protocol` | string | "http/protobuf" | Protocol |
218
+ | `diagnostics.otel.headers` | object | — | Custom headers |
219
+ | `diagnostics.otel.serviceName` | string | "openclaw" | Service name |
220
+ | `diagnostics.otel.traces` | boolean | true | Enable traces |
221
+ | `diagnostics.otel.metrics` | boolean | true | Enable metrics |
222
+ | `diagnostics.otel.logs` | boolean | false | Enable logs |
223
+ | `diagnostics.otel.sampleRate` | number | 1.0 | Trace sampling (0-1) |
224
+
225
+ ### Custom Plugin Options
226
+
227
+ | Option | Type | Default | Description |
228
+ |--------|------|---------|-------------|
229
+ | `endpoint` | string | — | OTLP endpoint URL |
230
+ | `serviceName` | string | "openclaw-gateway" | Service name |
231
+ | `exporterType` | string | "otlp" | Exporter type |
232
+ | `enableTraces` | boolean | true | Enable traces |
233
+ | `enableMetrics` | boolean | true | Enable metrics |
234
+
235
+ ---
236
+
237
+ ## Documentation
238
+
239
+ - [Getting Started](./docs/getting-started.md) — Setup guide
240
+ - [Configuration](./docs/configuration.md) — All options
241
+ - [Architecture](./docs/architecture.md) — How it works
242
+ - [Limitations](./docs/limitations.md) — Known constraints
243
+ - [Backends](./docs/backends/) — Backend-specific guides
244
+
245
+ ---
246
+
247
+ ## Optional: Kernel-Level Security with Tetragon
248
+
249
+ For **defense in depth**, add [Tetragon](https://tetragon.io) eBPF-based monitoring. While the plugins above capture application-level telemetry, Tetragon sees what happens at the kernel level — file access, process execution, network connections, and privilege changes.
250
+
251
+ ### Why Tetragon?
252
+
253
+ - **Tamper-proof**: Even a compromised agent can't hide its kernel-level actions
254
+ - **Sensitive file detection**: Alert when `.env`, SSH keys, or credentials are accessed
255
+ - **Dangerous command detection**: Catch `rm`, `curl | sh`, `chmod 777`, etc.
256
+ - **Privilege escalation**: Detect `setuid`/`setgid` attempts
257
+
258
+ ### Quick Setup
259
+
260
+ ```bash
261
+ # Install Tetragon
262
+ curl -LO https://github.com/cilium/tetragon/releases/latest/download/tetragon-v1.6.0-amd64.tar.gz
263
+ tar -xzf tetragon-v1.6.0-amd64.tar.gz && cd tetragon-v1.6.0-amd64
264
+ sudo ./install.sh
265
+
266
+ # Create OpenClaw policies directory
267
+ sudo mkdir -p /etc/tetragon/tetragon.tp.d/openclaw
268
+
269
+ # Add policies (see docs/security/tetragon.md for full examples)
270
+ # Start Tetragon
271
+ sudo systemctl enable --now tetragon
272
+ ```
273
+
274
+ Tetragon events are exported to `/var/log/tetragon/tetragon.log` and can be ingested by the OTel Collector using the `filelog` receiver.
275
+
276
+ ### Complete Observability Stack
277
+
278
+ | Layer | Source | What It Shows |
279
+ |-------|--------|---------------|
280
+ | **Application** | Custom Plugin | Tool calls, tokens, request flow |
281
+ | **Gateway** | Official Plugin | Session health, queues, costs |
282
+ | **Kernel** | Tetragon | System calls, file access, network |
283
+
284
+ See [Security: Tetragon](./docs/security/tetragon.md) for full installation and configuration guide.
285
+
286
+ ---
287
+
288
+ ## Known Limitations
289
+
290
+ **Auto-instrumentation not possible:** OpenLLMetry/IITM breaks `@mariozechner/pi-ai` named exports due to ESM/CJS module isolation. All telemetry is captured via hooks, not direct SDK instrumentation.
291
+
292
+ **No per-LLM-call spans:** Individual API calls to Claude/OpenAI cannot be traced. Token usage is aggregated per agent turn.
293
+
294
+ See [Limitations](./docs/limitations.md) for details.
295
+
296
+ ---
297
+
298
+ ## License
299
+
300
+ MIT
@@ -0,0 +1,186 @@
1
+ # OTel Collector Configuration
2
+
3
+ This directory contains a ready-to-use OpenTelemetry Collector configuration for OpenClaw observability.
4
+
5
+ ## What It Collects
6
+
7
+ | Source | Receiver | Data Type | Description |
8
+ |--------|----------|-----------|-------------|
9
+ | OpenClaw Plugin | `otlp` | Traces | Request lifecycle, tool calls |
10
+ | OpenClaw Plugin | `otlp` | Metrics | Token usage, costs |
11
+ | OpenClaw Plugin | `otlp` | Logs | Application logs |
12
+ | Host | `hostmetrics` | Metrics | CPU, memory, disk, network |
13
+ | Tetragon | `filelog/tetragon` | Logs | Kernel security events |
14
+
15
+ ## Quick Start
16
+
17
+ ### 1. Install the Collector
18
+
19
+ ```bash
20
+ # Download otelcol-contrib (includes all receivers/processors)
21
+ curl -LO https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.144.0/otelcol-contrib_0.144.0_linux_amd64.tar.gz
22
+ tar -xzf otelcol-contrib_0.144.0_linux_amd64.tar.gz
23
+ sudo mv otelcol-contrib /usr/local/bin/
24
+ ```
25
+
26
+ ### 2. Configure Environment Variables
27
+
28
+ For Dynatrace:
29
+ ```bash
30
+ export DT_ENDPOINT="https://YOUR_ENV.live.dynatrace.com/api/v2/otlp"
31
+ export DT_API_TOKEN="dt0c01.xxxxx"
32
+ ```
33
+
34
+ ### 3. Run the Collector
35
+
36
+ ```bash
37
+ otelcol-contrib --config otel-collector-config.yaml
38
+ ```
39
+
40
+ ### 4. Run as a Service (systemd)
41
+
42
+ ```bash
43
+ # Create service file
44
+ sudo tee /etc/systemd/system/otelcol-contrib.service << 'EOF'
45
+ [Unit]
46
+ Description=OpenTelemetry Collector
47
+ After=network.target
48
+
49
+ [Service]
50
+ Type=simple
51
+ User=otelcol-contrib
52
+ ExecStart=/usr/local/bin/otelcol-contrib --config /etc/otelcol-contrib/config.yaml
53
+ Restart=always
54
+ RestartSec=5
55
+
56
+ [Install]
57
+ WantedBy=multi-user.target
58
+ EOF
59
+
60
+ # Create override for environment
61
+ sudo mkdir -p /etc/systemd/system/otelcol-contrib.service.d
62
+ sudo tee /etc/systemd/system/otelcol-contrib.service.d/override.conf << 'EOF'
63
+ [Service]
64
+ Environment="DT_ENDPOINT=https://YOUR_ENV.live.dynatrace.com/api/v2/otlp"
65
+ Environment="DT_API_TOKEN=dt0c01.xxxxx"
66
+ EOF
67
+
68
+ # Copy config and start
69
+ sudo mkdir -p /etc/otelcol-contrib
70
+ sudo cp otel-collector-config.yaml /etc/otelcol-contrib/config.yaml
71
+ sudo systemctl daemon-reload
72
+ sudo systemctl enable --now otelcol-contrib
73
+ ```
74
+
75
+ ## Alternative Backends
76
+
77
+ ### Grafana Cloud
78
+
79
+ Replace the exporter section:
80
+
81
+ ```yaml
82
+ exporters:
83
+ otlphttp/grafana:
84
+ endpoint: "https://otlp-gateway-prod-us-central-0.grafana.net/otlp"
85
+ headers:
86
+ Authorization: "Basic ${env:GRAFANA_CLOUD_TOKEN}"
87
+ ```
88
+
89
+ ### Jaeger (Local)
90
+
91
+ ```yaml
92
+ exporters:
93
+ otlp/jaeger:
94
+ endpoint: "localhost:4317"
95
+ tls:
96
+ insecure: true
97
+ ```
98
+
99
+ ### Generic OTLP
100
+
101
+ ```yaml
102
+ exporters:
103
+ otlphttp:
104
+ endpoint: "https://your-otlp-endpoint.com"
105
+ headers:
106
+ Authorization: "Bearer ${env:API_TOKEN}"
107
+ ```
108
+
109
+ ## Pipelines
110
+
111
+ The configuration defines four pipelines:
112
+
113
+ | Pipeline | Receivers | Purpose |
114
+ |----------|-----------|---------|
115
+ | `traces` | otlp | OpenClaw request traces |
116
+ | `metrics` | otlp, hostmetrics | Token usage + system metrics |
117
+ | `logs/openclaw` | otlp | OpenClaw application logs |
118
+ | `logs/tetragon` | filelog/tetragon | Kernel security events |
119
+
120
+ ## Tetragon Integration
121
+
122
+ The Tetragon pipeline:
123
+
124
+ 1. **Reads** JSON events from `/var/log/tetragon/tetragon.log`
125
+ 2. **Parses** the JSON and extracts timestamps
126
+ 3. **Transforms** events to extract:
127
+ - `tetragon.type` — event type (kprobe, exec, exit)
128
+ - `tetragon.policy` — which policy triggered
129
+ - `process.binary`, `process.pid`, `process.uid`
130
+ - `tetragon.function` — syscall name
131
+ 4. **Assigns** security risk levels:
132
+ - `critical` — privilege-escalation, kernel-modules
133
+ - `high` — sensitive-files, dangerous-commands
134
+ - `low` — process-exec
135
+ 5. **Exports** to your backend with `service.name: openclaw-security`
136
+
137
+ ### Prerequisites for Tetragon
138
+
139
+ ```bash
140
+ # Install Tetragon
141
+ # See ../tetragon-policies/README.md
142
+
143
+ # Ensure collector can read the log
144
+ sudo chmod 644 /var/log/tetragon/tetragon.log
145
+
146
+ # Or add collector user to appropriate group
147
+ sudo usermod -a -G adm otelcol-contrib
148
+ ```
149
+
150
+ ## Troubleshooting
151
+
152
+ ### Collector not starting
153
+
154
+ ```bash
155
+ # Validate config
156
+ otelcol-contrib validate --config otel-collector-config.yaml
157
+
158
+ # Check for missing env vars
159
+ echo $DT_ENDPOINT
160
+ echo $DT_API_TOKEN
161
+ ```
162
+
163
+ ### Tetragon events not appearing
164
+
165
+ ```bash
166
+ # Check Tetragon is writing events
167
+ sudo tail -f /var/log/tetragon/tetragon.log
168
+
169
+ # Check file permissions
170
+ ls -la /var/log/tetragon/tetragon.log
171
+
172
+ # Check collector logs
173
+ journalctl -u otelcol-contrib -f | grep tetragon
174
+ ```
175
+
176
+ ### High memory usage
177
+
178
+ Reduce batch sizes:
179
+
180
+ ```yaml
181
+ processors:
182
+ batch:
183
+ timeout: 5s
184
+ send_batch_size: 256
185
+ send_batch_max_size: 512
186
+ ```