cw-datadog 2.23.0.2 → 2.23.0.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/ext/datadog_profiling_native_extension/extconf.rb +4 -2
- data/ext/libdatadog_api/library_config.c +12 -11
- data/ext/libdatadog_extconf_helpers.rb +1 -1
- data/lib/datadog/appsec/api_security/route_extractor.rb +20 -5
- data/lib/datadog/appsec/api_security/sampler.rb +3 -1
- data/lib/datadog/appsec/assets/blocked.html +8 -0
- data/lib/datadog/appsec/assets/blocked.json +1 -1
- data/lib/datadog/appsec/assets/blocked.text +3 -1
- data/lib/datadog/appsec/assets.rb +1 -1
- data/lib/datadog/appsec/remote.rb +4 -0
- data/lib/datadog/appsec/response.rb +18 -4
- data/lib/datadog/core/cloudwise/client.rb +364 -25
- data/lib/datadog/core/cloudwise/component.rb +197 -52
- data/lib/datadog/core/cloudwise/docc_heartbeat_worker.rb +105 -0
- data/lib/datadog/core/cloudwise/docc_operation_worker.rb +191 -0
- data/lib/datadog/core/cloudwise/docc_registration_worker.rb +89 -0
- data/lib/datadog/core/cloudwise/license_worker.rb +3 -1
- data/lib/datadog/core/cloudwise/probe_state.rb +134 -12
- data/lib/datadog/core/configuration/components.rb +10 -9
- data/lib/datadog/core/configuration/settings.rb +28 -0
- data/lib/datadog/core/configuration/supported_configurations.rb +5 -2
- data/lib/datadog/core/remote/client/capabilities.rb +7 -0
- data/lib/datadog/core/remote/component.rb +2 -2
- data/lib/datadog/core/remote/transport/config.rb +2 -10
- data/lib/datadog/core/remote/transport/http/config.rb +9 -9
- data/lib/datadog/core/remote/transport/http/negotiation.rb +17 -8
- data/lib/datadog/core/remote/transport/http.rb +2 -0
- data/lib/datadog/core/remote/transport/negotiation.rb +2 -18
- data/lib/datadog/core/remote/worker.rb +23 -35
- data/lib/datadog/core/telemetry/component.rb +26 -13
- data/lib/datadog/core/telemetry/event/app_started.rb +67 -49
- data/lib/datadog/core/telemetry/event/synth_app_client_configuration_change.rb +27 -4
- data/lib/datadog/core/telemetry/transport/http/telemetry.rb +5 -6
- data/lib/datadog/core/telemetry/transport/telemetry.rb +1 -2
- data/lib/datadog/core/telemetry/worker.rb +51 -6
- data/lib/datadog/core/transport/http/adapters/net.rb +2 -0
- data/lib/datadog/core/transport/http/client.rb +69 -0
- data/lib/datadog/core/utils/only_once_successful.rb +6 -2
- data/lib/datadog/data_streams/transport/http/client.rb +4 -32
- data/lib/datadog/data_streams/transport/stats.rb +1 -1
- data/lib/datadog/di/probe_notification_builder.rb +35 -13
- data/lib/datadog/di/transport/diagnostics.rb +2 -2
- data/lib/datadog/di/transport/http/diagnostics.rb +2 -4
- data/lib/datadog/di/transport/http/input.rb +2 -4
- data/lib/datadog/di/transport/input.rb +2 -2
- data/lib/datadog/open_feature/component.rb +60 -0
- data/lib/datadog/open_feature/configuration.rb +27 -0
- data/lib/datadog/open_feature/evaluation_engine.rb +59 -0
- data/lib/datadog/open_feature/exposures/batch_builder.rb +32 -0
- data/lib/datadog/open_feature/exposures/buffer.rb +43 -0
- data/lib/datadog/open_feature/exposures/deduplicator.rb +30 -0
- data/lib/datadog/open_feature/exposures/event.rb +60 -0
- data/lib/datadog/open_feature/exposures/reporter.rb +40 -0
- data/lib/datadog/open_feature/exposures/worker.rb +116 -0
- data/lib/datadog/open_feature/ext.rb +13 -0
- data/lib/datadog/open_feature/noop_evaluator.rb +26 -0
- data/lib/datadog/open_feature/provider.rb +134 -0
- data/lib/datadog/open_feature/remote.rb +74 -0
- data/lib/datadog/open_feature/resolution_details.rb +35 -0
- data/lib/datadog/open_feature/transport.rb +72 -0
- data/lib/datadog/open_feature.rb +19 -0
- data/lib/datadog/profiling/component.rb +6 -0
- data/lib/datadog/profiling/profiler.rb +4 -0
- data/lib/datadog/profiling.rb +1 -2
- data/lib/datadog/single_step_instrument.rb +1 -1
- data/lib/datadog/tracing/contrib/cloudwise/propagation.rb +164 -7
- data/lib/datadog/tracing/contrib/graphql/unified_trace.rb +22 -17
- data/lib/datadog/tracing/contrib/karafka/framework.rb +30 -0
- data/lib/datadog/tracing/contrib/karafka/patcher.rb +14 -0
- data/lib/datadog/tracing/contrib/rack/middlewares.rb +6 -2
- data/lib/datadog/tracing/contrib/waterdrop/configuration/settings.rb +27 -0
- data/lib/datadog/tracing/contrib/waterdrop/distributed/propagation.rb +48 -0
- data/lib/datadog/tracing/contrib/waterdrop/ext.rb +17 -0
- data/lib/datadog/tracing/contrib/waterdrop/integration.rb +43 -0
- data/lib/datadog/tracing/contrib/waterdrop/middleware.rb +46 -0
- data/lib/datadog/tracing/contrib/waterdrop/patcher.rb +46 -0
- data/lib/datadog/tracing/contrib/waterdrop/producer.rb +50 -0
- data/lib/datadog/tracing/contrib/waterdrop.rb +37 -0
- data/lib/datadog/tracing/contrib.rb +1 -0
- data/lib/datadog/tracing/transport/http/api.rb +40 -1
- data/lib/datadog/tracing/transport/http/client.rb +12 -26
- data/lib/datadog/tracing/transport/http/traces.rb +4 -2
- data/lib/datadog/tracing/transport/trace_formatter.rb +16 -0
- data/lib/datadog/version.rb +2 -2
- data/lib/datadog.rb +1 -0
- metadata +38 -15
- data/lib/datadog/core/cloudwise/IMPLEMENTATION_V2.md +0 -517
- data/lib/datadog/core/cloudwise/QUICKSTART.md +0 -398
- data/lib/datadog/core/cloudwise/README.md +0 -722
- data/lib/datadog/core/remote/transport/http/client.rb +0 -49
- data/lib/datadog/core/telemetry/transport/http/client.rb +0 -49
- data/lib/datadog/di/transport/http/client.rb +0 -47
|
@@ -1,722 +0,0 @@
|
|
|
1
|
-
# Cloudwise Integration
|
|
2
|
-
|
|
3
|
-
## 概述
|
|
4
|
-
|
|
5
|
-
Cloudwise 集成提供了探针管理、License 校验和应用注册功能。
|
|
6
|
-
|
|
7
|
-
**重要特性**:
|
|
8
|
-
- **异步初始化**: 不阻塞应用启动
|
|
9
|
-
- **延迟加载**: Datadog 组件在 Host ID 就绪后才初始化
|
|
10
|
-
- **无限重试**: Host ID 获取失败时会一直重试
|
|
11
|
-
- **正常服务**: 应用在监控组件初始化期间可以正常处理请求
|
|
12
|
-
|
|
13
|
-
## 功能特性
|
|
14
|
-
|
|
15
|
-
### 1. Host ID 生成
|
|
16
|
-
- **接口**: `POST /v2/app/generateHostId`
|
|
17
|
-
- **执行时机**: 应用启动时(后台异步执行)
|
|
18
|
-
- **重试策略**: 失败时每 30 秒重试一次,**无限重试直到成功**
|
|
19
|
-
- **阻塞行为**:
|
|
20
|
-
- ✅ 不阻塞应用启动
|
|
21
|
-
- ❌ 阻塞 Cloudwise 和 Datadog 组件的初始化
|
|
22
|
-
- ✅ 应用可以正常处理请求
|
|
23
|
-
- **环境变量**:
|
|
24
|
-
- 成功后自动设置 `ENV['CLOUDWISE_ACCOUNT_ID']`
|
|
25
|
-
- **这是 Cloudwise 正常工作的前提条件**
|
|
26
|
-
|
|
27
|
-
### 2. 心跳注册
|
|
28
|
-
- **接口**: `POST /api/v1/agent/heartbeat`
|
|
29
|
-
- **执行时机**: Host ID 成功后**首先启动**(阻塞后续初始化)
|
|
30
|
-
- **执行周期**: 30 秒
|
|
31
|
-
- **重试策略**: 同一周期内失败时间隔 30 秒重试,最多 3 次
|
|
32
|
-
- **失败处理**:
|
|
33
|
-
- 接口成功但 code ≠ 1000:标记为 inactive,数据不采集和上报
|
|
34
|
-
- 接口连续失败 3 次:标记为 inactive,失败计数重置为 0
|
|
35
|
-
- 下个周期(30s 后)继续尝试,直到成功
|
|
36
|
-
- 任何周期返回 code = 1000:标记为 active,继续后续初始化
|
|
37
|
-
- **阻塞行为**: 必须首次成功(code = 1000)才能继续后续流程
|
|
38
|
-
|
|
39
|
-
### 3. 应用注册
|
|
40
|
-
- **接口**: `POST /api/v1/application/register`
|
|
41
|
-
- **执行时机**: Heartbeat 成功后启动(**不阻塞**后续初始化)
|
|
42
|
-
- **执行周期**: 3 分钟
|
|
43
|
-
- **重试策略**: 失败时在下个周期继续尝试
|
|
44
|
-
- **失败处理**:
|
|
45
|
-
- 成功:标记为 registered
|
|
46
|
-
- 失败:标记为 unregistered,3 分钟后再次尝试
|
|
47
|
-
- **阻塞行为**: **不阻塞**,后台运行,失败不影响后续流程
|
|
48
|
-
|
|
49
|
-
### 4. License 校验
|
|
50
|
-
- **接口**: `POST /api/v1/license/verify`
|
|
51
|
-
- **执行时机**: App Registration 启动后**立即启动**(阻塞后续初始化)
|
|
52
|
-
- **执行周期**: 5 分钟
|
|
53
|
-
- **重试策略**: 每个周期尝试一次,失败累计计数
|
|
54
|
-
- **失败处理**:
|
|
55
|
-
- 接口成功但 code ≠ 1000:标记为 invalid,数据不采集和上报
|
|
56
|
-
- 接口连续失败 3 次:标记为 invalid,失败计数重置为 0
|
|
57
|
-
- 下个周期(5min 后)继续尝试,直到成功
|
|
58
|
-
- 任何周期返回 code = 1000:标记为 valid,继续后续初始化
|
|
59
|
-
- **阻塞行为**: 必须首次成功(code = 1000)才能初始化 Datadog 组件
|
|
60
|
-
|
|
61
|
-
## 配置方式
|
|
62
|
-
|
|
63
|
-
### 环境变量配置
|
|
64
|
-
|
|
65
|
-
```bash
|
|
66
|
-
# 必需配置
|
|
67
|
-
export DD_CLOUDWISE_ENABLED=true
|
|
68
|
-
export DD_CLOUDWISE_LICENSE_KEY=your-license-key-here
|
|
69
|
-
export DD_SERVICE=my-ruby-app # 用作 server_name
|
|
70
|
-
export DD_AGENT_HOST=127.0.0.1 # API base URL 的 host 部分
|
|
71
|
-
export DD_AGENT_PORT=8126 # API base URL 的 port 部分
|
|
72
|
-
|
|
73
|
-
# 可选配置(带默认值)
|
|
74
|
-
export DD_CLOUDWISE_HEARTBEAT_INTERVAL=30 # 默认: 30秒
|
|
75
|
-
export DD_CLOUDWISE_LICENSE_CHECK_INTERVAL=300 # 默认: 5分钟
|
|
76
|
-
export DD_CLOUDWISE_APP_REGISTRATION_INTERVAL=180 # 默认: 3分钟
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
**说明**:
|
|
80
|
-
- `DD_SERVICE`: 服务名称,用作 Cloudwise API 的 `server_name` 参数
|
|
81
|
-
- `DD_AGENT_HOST` + `DD_AGENT_PORT`: 组合成 `base_url`,例如 `http://127.0.0.1:8126`
|
|
82
|
-
|
|
83
|
-
### 代码配置
|
|
84
|
-
|
|
85
|
-
```ruby
|
|
86
|
-
require 'datadog'
|
|
87
|
-
|
|
88
|
-
Datadog.configure do |c|
|
|
89
|
-
# Cloudwise 配置
|
|
90
|
-
c.cloudwise.enabled = true
|
|
91
|
-
c.cloudwise.license_key = 'your-license-key-here'
|
|
92
|
-
|
|
93
|
-
# 可选: 自定义间隔时间
|
|
94
|
-
c.cloudwise.heartbeat_interval = 30 # 30秒
|
|
95
|
-
c.cloudwise.license_check_interval = 300 # 5分钟
|
|
96
|
-
c.cloudwise.app_registration_interval = 180 # 3分钟
|
|
97
|
-
|
|
98
|
-
# Datadog 通用配置(用于 Cloudwise)
|
|
99
|
-
c.service = 'my-ruby-app' # server_name
|
|
100
|
-
c.agent.host = '127.0.0.1' # base_url host
|
|
101
|
-
c.agent.port = 8126 # base_url port
|
|
102
|
-
end
|
|
103
|
-
```
|
|
104
|
-
|
|
105
|
-
## 初始化流程
|
|
106
|
-
|
|
107
|
-
### 完整流程图
|
|
108
|
-
|
|
109
|
-
```
|
|
110
|
-
Web 应用启动
|
|
111
|
-
↓
|
|
112
|
-
Components.initialize(settings)
|
|
113
|
-
↓
|
|
114
|
-
┌────────────────────────────────────────────────────────────┐
|
|
115
|
-
│ [1] Cloudwise.initialize_async(callback) │ ← 异步,非阻塞
|
|
116
|
-
│ ├─ 启动 HostIdWorker (后台线程,无限重试) │
|
|
117
|
-
│ └─ 启动监听线程 │
|
|
118
|
-
│ └─ 等待 Host ID 就绪 │
|
|
119
|
-
│ └─ 就绪后调用 callback 初始化 Datadog 组件 │
|
|
120
|
-
└────────────────────────────────────────────────────────────┘
|
|
121
|
-
↓ (立即返回,不等待)
|
|
122
|
-
↓
|
|
123
|
-
✅ [2] 应用启动完成
|
|
124
|
-
↓
|
|
125
|
-
✅ [3] 应用开始处理请求 🚀
|
|
126
|
-
│
|
|
127
|
-
│ 此时状态:
|
|
128
|
-
│ • Cloudwise: HostIdWorker 在后台运行 ⏳
|
|
129
|
-
│ • Datadog 组件: 尚未初始化 ❌
|
|
130
|
-
│ • 数据采集: 未启动 ❌
|
|
131
|
-
│ • 应用服务: 正常运行 ✅
|
|
132
|
-
│
|
|
133
|
-
↓ (后台持续运行...)
|
|
134
|
-
│
|
|
135
|
-
┌──────────────────────────────────────────────────────────────┐
|
|
136
|
-
│ [4] 后台: HostIdWorker 无限重试 │
|
|
137
|
-
│ 尝试 1: 失败 → 等待 30s │
|
|
138
|
-
│ 尝试 2: 失败 → 等待 30s │
|
|
139
|
-
│ 尝试 3: 失败 → 等待 30s │
|
|
140
|
-
│ ... │
|
|
141
|
-
│ 尝试 N: 成功! ✅ │
|
|
142
|
-
│ └─ 设置 ENV['CLOUDWISE_ACCOUNT_ID'] │
|
|
143
|
-
│ └─ probe_state.mark_host_id_ready! │
|
|
144
|
-
└──────────────────────────────────────────────────────────────┘
|
|
145
|
-
↓
|
|
146
|
-
┌──────────────────────────────────────────────────────────────┐
|
|
147
|
-
│ [5] Heartbeat 验证(阻塞,必须成功) │
|
|
148
|
-
│ │
|
|
149
|
-
│ Thread: Cloudwise-Initializer │
|
|
150
|
-
│ ├─ 启动 HeartbeatWorker │
|
|
151
|
-
│ └─ 等待首次成功 ⏳ │
|
|
152
|
-
│ │
|
|
153
|
-
│ Thread: Heartbeat-Worker │
|
|
154
|
-
│ ├─ 尝试 1: POST /api/v1/agent/heartbeat │
|
|
155
|
-
│ │ ├─ 失败 → 等待 30s → 重试 │
|
|
156
|
-
│ │ ├─ 失败 → 等待 30s → 重试 │
|
|
157
|
-
│ │ ├─ 失败 → 等待 30s → 重试 │
|
|
158
|
-
│ │ └─ 失败 3 次 → mark_heartbeat_inactive! │
|
|
159
|
-
│ │ → 失败计数重置为 0 │
|
|
160
|
-
│ │ → 下个周期(30s后)继续尝试 │
|
|
161
|
-
│ │ │
|
|
162
|
-
│ ├─ 尝试 2 (30s后): POST /api/v1/agent/heartbeat │
|
|
163
|
-
│ │ └─ 成功! code = 1000 ✅ │
|
|
164
|
-
│ │ └─ mark_heartbeat_active! │
|
|
165
|
-
│ │ │
|
|
166
|
-
│ └─ 继续定时任务(每 30s 一次) │
|
|
167
|
-
│ ├─ 成功 (code=1000): 保持 active │
|
|
168
|
-
│ ├─ 失败 or code≠1000: mark_inactive (继续尝试) │
|
|
169
|
-
│ └─ 连续失败 3 次: 失败计数重置,下周期继续 │
|
|
170
|
-
└──────────────────────────────────────────────────────────────┘
|
|
171
|
-
↓ (Heartbeat 首次成功后)
|
|
172
|
-
↓
|
|
173
|
-
┌──────────────────────────────────────────────────────────────┐
|
|
174
|
-
│ [6] 启动 App Registration(非阻塞,后台运行) │
|
|
175
|
-
│ │
|
|
176
|
-
│ Thread: Cloudwise-Initializer │
|
|
177
|
-
│ └─ 启动 AppRegistrationWorker ✅ │
|
|
178
|
-
│ │
|
|
179
|
-
│ Thread: AppRegistration-Worker │
|
|
180
|
-
│ ├─ 每 3 分钟调用一次 │
|
|
181
|
-
│ ├─ POST /api/v1/application/register │
|
|
182
|
-
│ │ ├─ 成功: mark_app_registered! │
|
|
183
|
-
│ │ └─ 失败: mark_app_unregistered! │
|
|
184
|
-
│ │ (3分钟后再次尝试) │
|
|
185
|
-
│ └─ 失败不影响其他流程 │
|
|
186
|
-
└──────────────────────────────────────────────────────────────┘
|
|
187
|
-
↓ (不等待,立即继续)
|
|
188
|
-
↓
|
|
189
|
-
┌──────────────────────────────────────────────────────────────┐
|
|
190
|
-
│ [7] License 验证(阻塞,必须成功) │
|
|
191
|
-
│ │
|
|
192
|
-
│ Thread: Cloudwise-Initializer │
|
|
193
|
-
│ ├─ 启动 LicenseWorker │
|
|
194
|
-
│ └─ 等待首次成功 ⏳ │
|
|
195
|
-
│ │
|
|
196
|
-
│ Thread: License-Worker │
|
|
197
|
-
│ ├─ 尝试 1: POST /api/v1/license/verify │
|
|
198
|
-
│ │ ├─ 失败 → failure_count++ │
|
|
199
|
-
│ │ ├─ 失败 → failure_count++ │
|
|
200
|
-
│ │ ├─ 失败 → failure_count++ │
|
|
201
|
-
│ │ └─ 失败 3 次 → mark_license_invalid! │
|
|
202
|
-
│ │ → 失败计数重置为 0 │
|
|
203
|
-
│ │ → 下个周期(5分钟后)继续尝试 │
|
|
204
|
-
│ │ │
|
|
205
|
-
│ ├─ 尝试 2 (5分钟后): POST /api/v1/license/verify │
|
|
206
|
-
│ │ └─ 成功! code = 1000 ✅ │
|
|
207
|
-
│ │ └─ mark_license_valid! │
|
|
208
|
-
│ │ │
|
|
209
|
-
│ └─ 继续定时任务(每 5分钟 一次) │
|
|
210
|
-
│ ├─ 成功 (code=1000): 保持 valid │
|
|
211
|
-
│ ├─ 失败 or code≠1000: mark_invalid (继续尝试) │
|
|
212
|
-
│ └─ 连续失败 3 次: 失败计数重置,下周期继续 │
|
|
213
|
-
└──────────────────────────────────────────────────────────────┘
|
|
214
|
-
↓ (License 首次成功后)
|
|
215
|
-
↓
|
|
216
|
-
┌──────────────────────────────────────────────────────────────┐
|
|
217
|
-
│ [8] 初始化 Datadog 组件 🎉 │
|
|
218
|
-
│ │
|
|
219
|
-
│ Thread: Cloudwise-Initializer │
|
|
220
|
-
│ └─ block.call # 调用回调函数 │
|
|
221
|
-
│ └─ initialize_datadog_components(settings) │
|
|
222
|
-
│ ├─ telemetry │
|
|
223
|
-
│ ├─ tracer │
|
|
224
|
-
│ ├─ profiler │
|
|
225
|
-
│ ├─ runtime_metrics │
|
|
226
|
-
│ ├─ health_metrics │
|
|
227
|
-
│ ├─ appsec │
|
|
228
|
-
│ ├─ dynamic_instrumentation │
|
|
229
|
-
│ ├─ error_tracking │
|
|
230
|
-
│ ├─ data_streams │
|
|
231
|
-
│ └─ remote │
|
|
232
|
-
└──────────────────────────────────────────────────────────────┘
|
|
233
|
-
↓
|
|
234
|
-
✅ [9] Cloudwise & Datadog 全部就绪
|
|
235
|
-
↓
|
|
236
|
-
✅ [10] 开始数据采集和上报 🎉
|
|
237
|
-
```
|
|
238
|
-
|
|
239
|
-
### 状态时间线
|
|
240
|
-
|
|
241
|
-
| 时间 | 应用状态 | Cloudwise 状态 | Datadog 组件 | 数据采集 |
|
|
242
|
-
|------|---------|---------------|-------------|---------|
|
|
243
|
-
| T0 | 启动中 | 初始化中 (async) | 未初始化 ❌ | 未启动 ❌ |
|
|
244
|
-
| T1 | 运行中 ✅ | 等待 Host ID ⏳ | 未初始化 ❌ | 未启动 ❌ |
|
|
245
|
-
| | 处理请求 | (后台重试) | | |
|
|
246
|
-
| T2 | 运行中 ✅ | Host ID 就绪 ✅ | 未初始化 ❌ | 未启动 ❌ |
|
|
247
|
-
| | 处理请求 | 等待 Heartbeat ⏳ | | |
|
|
248
|
-
| T3 | 运行中 ✅ | Heartbeat 就绪 ✅ | 未初始化 ❌ | 未启动 ❌ |
|
|
249
|
-
| | 处理请求 | App Reg 启动<br>等待 License ⏳ | | |
|
|
250
|
-
| T4 | 运行中 ✅ | License 就绪 ✅ | 初始化中 ⏳ | 未启动 ❌ |
|
|
251
|
-
| | 处理请求 | (初始化 Datadog) | | |
|
|
252
|
-
| T5 | 运行中 ✅ | 正常工作 ✅ | 正常工作 ✅ | 开始采集 ✅ |
|
|
253
|
-
| | 处理请求 | (所有 workers 运行) | | |
|
|
254
|
-
|
|
255
|
-
## Workers 执行顺序详解
|
|
256
|
-
|
|
257
|
-
### 为什么需要顺序执行?
|
|
258
|
-
|
|
259
|
-
Cloudwise 采用**顺序验证**策略,确保每个关键步骤都成功后才继续:
|
|
260
|
-
|
|
261
|
-
1. **Host ID** - 所有 API 调用的前提,必须先获取
|
|
262
|
-
2. **Heartbeat** - 验证探针是否被服务端允许运行
|
|
263
|
-
3. **App Registration** - 注册应用信息(可选,不阻塞)
|
|
264
|
-
4. **License** - 验证 License 是否有效
|
|
265
|
-
5. **Datadog 初始化** - 所有验证通过后才初始化
|
|
266
|
-
|
|
267
|
-
### 执行顺序
|
|
268
|
-
|
|
269
|
-
```
|
|
270
|
-
Host ID 就绪
|
|
271
|
-
↓
|
|
272
|
-
启动 HeartbeatWorker(阻塞等待首次成功)
|
|
273
|
-
├─ 周期 1: 尝试 3 次(间隔 30s)
|
|
274
|
-
│ └─ 全部失败 → mark_inactive,重置计数
|
|
275
|
-
├─ 周期 2 (30s后): 继续尝试
|
|
276
|
-
│ └─ 成功 (code=1000) → mark_active ✅
|
|
277
|
-
└─ 继续后续流程
|
|
278
|
-
↓
|
|
279
|
-
启动 AppRegistrationWorker(不阻塞,后台运行)
|
|
280
|
-
├─ 每 3 分钟尝试一次
|
|
281
|
-
└─ 失败不影响其他流程
|
|
282
|
-
↓
|
|
283
|
-
启动 LicenseWorker(阻塞等待首次成功)
|
|
284
|
-
├─ 周期 1: 尝试 1 次
|
|
285
|
-
│ └─ 失败 → failure_count++
|
|
286
|
-
├─ 周期 2 (5min后): 尝试 1 次
|
|
287
|
-
│ └─ 失败 → failure_count++
|
|
288
|
-
├─ 周期 3 (10min后): 尝试 1 次
|
|
289
|
-
│ └─ 失败 → failure_count++ (达到3次)
|
|
290
|
-
│ → mark_invalid,重置计数
|
|
291
|
-
├─ 周期 4 (15min后): 继续尝试
|
|
292
|
-
│ └─ 成功 (code=1000) → mark_valid ✅
|
|
293
|
-
└─ 继续后续流程
|
|
294
|
-
↓
|
|
295
|
-
初始化 Datadog 组件 ✅
|
|
296
|
-
↓
|
|
297
|
-
开始数据采集和上报 🎉
|
|
298
|
-
```
|
|
299
|
-
|
|
300
|
-
### 关键特性
|
|
301
|
-
|
|
302
|
-
#### 1. HeartbeatWorker(阻塞)
|
|
303
|
-
|
|
304
|
-
- **间隔**: 30 秒
|
|
305
|
-
- **重试**: 同一周期内失败重试(间隔 30s,最多 3 次)
|
|
306
|
-
- **失败处理**:
|
|
307
|
-
- 连续失败 3 次 → `mark_heartbeat_inactive!`
|
|
308
|
-
- 失败计数重置为 0
|
|
309
|
-
- 下个周期(30s 后)继续尝试
|
|
310
|
-
- **成功条件**: `code = 1000`
|
|
311
|
-
- **阻塞**: ✅ 必须首次成功才能继续
|
|
312
|
-
|
|
313
|
-
#### 2. AppRegistrationWorker(非阻塞)
|
|
314
|
-
|
|
315
|
-
- **间隔**: 3 分钟
|
|
316
|
-
- **重试**: 失败后在下个周期继续尝试
|
|
317
|
-
- **失败处理**:
|
|
318
|
-
- 失败 → `mark_app_unregistered!`
|
|
319
|
-
- 3 分钟后再次尝试
|
|
320
|
-
- **成功条件**: `code = 1000`
|
|
321
|
-
- **阻塞**: ❌ 不阻塞,后台运行
|
|
322
|
-
|
|
323
|
-
#### 3. LicenseWorker(阻塞)
|
|
324
|
-
|
|
325
|
-
- **间隔**: 5 分钟
|
|
326
|
-
- **重试**: 每个周期尝试 1 次,失败累计计数
|
|
327
|
-
- **失败处理**:
|
|
328
|
-
- 失败 → `failure_count++`
|
|
329
|
-
- 连续失败 3 次 → `mark_license_invalid!`
|
|
330
|
-
- 失败计数重置为 0
|
|
331
|
-
- 下个周期(5min 后)继续尝试
|
|
332
|
-
- **成功条件**: `code = 1000`
|
|
333
|
-
- **阻塞**: ✅ 必须首次成功才能初始化 Datadog
|
|
334
|
-
|
|
335
|
-
### 失败恢复机制
|
|
336
|
-
|
|
337
|
-
所有 workers 都支持**自动恢复**:
|
|
338
|
-
|
|
339
|
-
- **HeartbeatWorker**: 任何周期返回 `code=1000` → `mark_heartbeat_active!` → 继续初始化
|
|
340
|
-
- **LicenseWorker**: 任何周期返回 `code=1000` → `mark_license_valid!` → 继续初始化
|
|
341
|
-
- **失败计数重置**: 连续失败达到最大次数后,计数重置为 0,下个周期重新开始计数
|
|
342
|
-
|
|
343
|
-
### 为什么这样设计?
|
|
344
|
-
|
|
345
|
-
1. **渐进式验证**: 确保每个步骤都成功后才继续,避免无效数据采集
|
|
346
|
-
2. **持续重试**: 失败后不放弃,持续尝试直到成功
|
|
347
|
-
3. **应用不阻塞**: 所有验证在后台线程执行,不影响应用正常运行
|
|
348
|
-
4. **灵活性**: App Registration 可选,不阻塞关键流程
|
|
349
|
-
|
|
350
|
-
## 探针状态管理
|
|
351
|
-
|
|
352
|
-
### ProbeState 组件
|
|
353
|
-
|
|
354
|
-
`ProbeState` 负责管理探针的全局状态,追踪多个独立条件:
|
|
355
|
-
|
|
356
|
-
```ruby
|
|
357
|
-
class ProbeState
|
|
358
|
-
def initialize
|
|
359
|
-
@host_id_ready = false # Host ID 是否就绪
|
|
360
|
-
@heartbeat_active = false # 心跳是否活跃
|
|
361
|
-
@license_valid = false # License 是否有效
|
|
362
|
-
@app_registered = false # 应用是否注册
|
|
363
|
-
end
|
|
364
|
-
|
|
365
|
-
# 是否可以采集数据
|
|
366
|
-
def can_collect_data?
|
|
367
|
-
@host_id_ready && @heartbeat_active && @license_valid
|
|
368
|
-
end
|
|
369
|
-
|
|
370
|
-
# 探针是否暂停
|
|
371
|
-
def suspended?
|
|
372
|
-
!can_collect_data?
|
|
373
|
-
end
|
|
374
|
-
|
|
375
|
-
# 探针是否活跃
|
|
376
|
-
def active?
|
|
377
|
-
can_collect_data?
|
|
378
|
-
end
|
|
379
|
-
end
|
|
380
|
-
```
|
|
381
|
-
|
|
382
|
-
### 状态转换
|
|
383
|
-
|
|
384
|
-
```
|
|
385
|
-
初始状态:
|
|
386
|
-
host_id_ready = false
|
|
387
|
-
heartbeat_active = false
|
|
388
|
-
license_valid = false
|
|
389
|
-
→ suspended = true (不采集数据)
|
|
390
|
-
|
|
391
|
-
Host ID 生成成功:
|
|
392
|
-
host_id_ready = true
|
|
393
|
-
heartbeat_active = false
|
|
394
|
-
license_valid = false
|
|
395
|
-
→ suspended = true (仍不采集数据)
|
|
396
|
-
|
|
397
|
-
首次心跳成功:
|
|
398
|
-
host_id_ready = true
|
|
399
|
-
heartbeat_active = true
|
|
400
|
-
license_valid = false
|
|
401
|
-
→ suspended = true (仍不采集数据)
|
|
402
|
-
|
|
403
|
-
首次 License 校验成功:
|
|
404
|
-
host_id_ready = true
|
|
405
|
-
heartbeat_active = true
|
|
406
|
-
license_valid = true
|
|
407
|
-
→ active = true (开始采集数据) ✅
|
|
408
|
-
|
|
409
|
-
心跳失败 3 次:
|
|
410
|
-
heartbeat_active = false
|
|
411
|
-
→ suspended = true (停止采集数据) ❌
|
|
412
|
-
|
|
413
|
-
License 校验失败 3 次:
|
|
414
|
-
license_valid = false
|
|
415
|
-
→ suspended = true (停止采集数据) ❌
|
|
416
|
-
|
|
417
|
-
心跳恢复:
|
|
418
|
-
heartbeat_active = true
|
|
419
|
-
license_valid = true
|
|
420
|
-
→ active = true (恢复采集数据) ✅
|
|
421
|
-
```
|
|
422
|
-
|
|
423
|
-
## 架构设计
|
|
424
|
-
|
|
425
|
-
### 组件概览
|
|
426
|
-
|
|
427
|
-
```
|
|
428
|
-
Cloudwise::Component
|
|
429
|
-
├─ Client (HTTP 客户端)
|
|
430
|
-
├─ ProbeState (状态管理)
|
|
431
|
-
├─ HostIdWorker (Host ID 生成)
|
|
432
|
-
├─ HeartbeatWorker (心跳)
|
|
433
|
-
├─ LicenseWorker (License 校验)
|
|
434
|
-
└─ AppRegistrationWorker (应用注册)
|
|
435
|
-
```
|
|
436
|
-
|
|
437
|
-
### 核心机制
|
|
438
|
-
|
|
439
|
-
#### 1. 回调机制(Callback)
|
|
440
|
-
|
|
441
|
-
Cloudwise 使用 Ruby block (callback) 实现延迟初始化:
|
|
442
|
-
|
|
443
|
-
```ruby
|
|
444
|
-
# components.rb
|
|
445
|
-
@cloudwise.initialize_async do
|
|
446
|
-
initialize_datadog_components(settings) # 回调函数
|
|
447
|
-
end
|
|
448
|
-
```
|
|
449
|
-
|
|
450
|
-
当 Host ID 就绪后,Cloudwise 调用这个回调函数来初始化 Datadog 组件。
|
|
451
|
-
|
|
452
|
-
#### 2. 线程模型
|
|
453
|
-
|
|
454
|
-
```
|
|
455
|
-
Main Thread (应用主线程)
|
|
456
|
-
├─ 创建 Cloudwise 组件
|
|
457
|
-
└─ 立即返回,应用启动完成 ✅
|
|
458
|
-
|
|
459
|
-
Thread: Host-ID-Worker (HostIdWorker)
|
|
460
|
-
└─ 无限循环调用 generateHostId(每 30s 重试)
|
|
461
|
-
└─ 成功后设置 host_id_generated = true
|
|
462
|
-
|
|
463
|
-
Thread: Cloudwise-Initializer (监听线程)
|
|
464
|
-
├─ [步骤 1] 轮询 host_id_generated? (阻塞)
|
|
465
|
-
├─ [步骤 2] 启动 HeartbeatWorker,轮询 heartbeat_active? (阻塞)
|
|
466
|
-
├─ [步骤 3] 启动 AppRegistrationWorker(不阻塞)
|
|
467
|
-
├─ [步骤 4] 启动 LicenseWorker,轮询 license_valid? (阻塞)
|
|
468
|
-
└─ [步骤 5] 调用 callback 初始化 Datadog 组件
|
|
469
|
-
|
|
470
|
-
Thread: Heartbeat-Worker (HeartbeatWorker)
|
|
471
|
-
├─ 每 30 秒调用一次心跳接口
|
|
472
|
-
├─ 同一周期内失败重试(间隔 30s,最多 3 次)
|
|
473
|
-
└─ 失败 3 次后重置计数,下周期继续
|
|
474
|
-
|
|
475
|
-
Thread: AppRegistration-Worker (AppRegistrationWorker)
|
|
476
|
-
├─ 每 3 分钟调用一次应用注册接口
|
|
477
|
-
└─ 失败不影响其他流程,下周期继续
|
|
478
|
-
|
|
479
|
-
Thread: License-Worker (LicenseWorker)
|
|
480
|
-
├─ 每 5 分钟调用一次 License 接口
|
|
481
|
-
├─ 失败累计计数(最多 3 次)
|
|
482
|
-
└─ 失败 3 次后重置计数,下周期继续
|
|
483
|
-
```
|
|
484
|
-
|
|
485
|
-
#### 3. 数据采集控制
|
|
486
|
-
|
|
487
|
-
数据采集由 `traces.rb` 中的 `cloudwise_probe_suspended?` 方法控制:
|
|
488
|
-
|
|
489
|
-
```ruby
|
|
490
|
-
# lib/datadog/tracing/transport/http/traces.rb
|
|
491
|
-
def call(env, &block)
|
|
492
|
-
# 检查 Cloudwise 探针状态
|
|
493
|
-
if cloudwise_probe_suspended?
|
|
494
|
-
Datadog.logger.debug { 'Cloudwise: Probe suspended, skipping trace submission' }
|
|
495
|
-
return build_mock_response(env)
|
|
496
|
-
end
|
|
497
|
-
|
|
498
|
-
# 正常上报 trace
|
|
499
|
-
# ...
|
|
500
|
-
end
|
|
501
|
-
|
|
502
|
-
def cloudwise_probe_suspended?
|
|
503
|
-
return false unless defined?(Datadog.components)
|
|
504
|
-
return false unless Datadog.components.respond_to?(:cloudwise)
|
|
505
|
-
|
|
506
|
-
cloudwise = Datadog.components.cloudwise
|
|
507
|
-
return false unless cloudwise&.enabled?
|
|
508
|
-
|
|
509
|
-
cloudwise.probe_state.suspended?
|
|
510
|
-
end
|
|
511
|
-
```
|
|
512
|
-
|
|
513
|
-
## 使用示例
|
|
514
|
-
|
|
515
|
-
### 基本使用
|
|
516
|
-
|
|
517
|
-
```ruby
|
|
518
|
-
require 'datadog'
|
|
519
|
-
|
|
520
|
-
Datadog.configure do |c|
|
|
521
|
-
c.service = 'my-ruby-app'
|
|
522
|
-
c.agent.host = '127.0.0.1'
|
|
523
|
-
c.agent.port = 8126
|
|
524
|
-
|
|
525
|
-
c.cloudwise.enabled = true
|
|
526
|
-
c.cloudwise.license_key = 'your-license-key'
|
|
527
|
-
end
|
|
528
|
-
|
|
529
|
-
# 应用立即启动,不等待 Cloudwise
|
|
530
|
-
# Cloudwise 在后台初始化
|
|
531
|
-
```
|
|
532
|
-
|
|
533
|
-
### 检查状态
|
|
534
|
-
|
|
535
|
-
```ruby
|
|
536
|
-
# 检查 Cloudwise 状态
|
|
537
|
-
status = Datadog.components.cloudwise.status
|
|
538
|
-
puts status
|
|
539
|
-
# => {
|
|
540
|
-
# enabled: true,
|
|
541
|
-
# account_id: "acc-1234567890",
|
|
542
|
-
# host_id_generated: true,
|
|
543
|
-
# host_id_ready: true,
|
|
544
|
-
# heartbeat_active: true,
|
|
545
|
-
# license_valid: true,
|
|
546
|
-
# app_registered: true,
|
|
547
|
-
# can_collect_data: true,
|
|
548
|
-
# heartbeat_running: true,
|
|
549
|
-
# license_running: true,
|
|
550
|
-
# app_registration_running: true
|
|
551
|
-
# }
|
|
552
|
-
|
|
553
|
-
# 检查是否可以采集数据
|
|
554
|
-
if Datadog.components.cloudwise.probe_state.active?
|
|
555
|
-
puts "Probe is active, collecting data"
|
|
556
|
-
else
|
|
557
|
-
puts "Probe is suspended, NOT collecting data"
|
|
558
|
-
end
|
|
559
|
-
```
|
|
560
|
-
|
|
561
|
-
## 故障排查
|
|
562
|
-
|
|
563
|
-
### 常见问题
|
|
564
|
-
|
|
565
|
-
#### 1. Host ID 一直获取失败
|
|
566
|
-
|
|
567
|
-
**现象**: 应用正常启动,但日志显示 Host ID 一直在重试
|
|
568
|
-
|
|
569
|
-
**原因**:
|
|
570
|
-
- API 服务器无法访问
|
|
571
|
-
- 网络问题
|
|
572
|
-
- 认证失败
|
|
573
|
-
|
|
574
|
-
**解决方案**:
|
|
575
|
-
```bash
|
|
576
|
-
# 检查网络连通性
|
|
577
|
-
curl -X POST http://127.0.0.1:8126/v2/app/generateHostId \
|
|
578
|
-
-H "Content-Type: application/json" \
|
|
579
|
-
-d '{
|
|
580
|
-
"serverName": "my-ruby-app",
|
|
581
|
-
"licenseKey": "your-license-key",
|
|
582
|
-
"timestamp": 1234567890
|
|
583
|
-
}'
|
|
584
|
-
|
|
585
|
-
# 检查日志
|
|
586
|
-
grep "Host ID generation failed" application.log
|
|
587
|
-
```
|
|
588
|
-
|
|
589
|
-
#### 2. 数据不上报
|
|
590
|
-
|
|
591
|
-
**现象**: 应用运行正常,但没有数据上报到 Datadog
|
|
592
|
-
|
|
593
|
-
**检查步骤**:
|
|
594
|
-
|
|
595
|
-
1. 检查 Host ID 是否就绪:
|
|
596
|
-
```ruby
|
|
597
|
-
Datadog.components.cloudwise.status[:host_id_ready]
|
|
598
|
-
```
|
|
599
|
-
|
|
600
|
-
2. 检查探针状态:
|
|
601
|
-
```ruby
|
|
602
|
-
Datadog.components.cloudwise.probe_state.active?
|
|
603
|
-
```
|
|
604
|
-
|
|
605
|
-
3. 检查 ProbeState 各项条件:
|
|
606
|
-
```ruby
|
|
607
|
-
status = Datadog.components.cloudwise.probe_state.status
|
|
608
|
-
# => {
|
|
609
|
-
# host_id_ready: true/false,
|
|
610
|
-
# heartbeat_active: true/false,
|
|
611
|
-
# license_valid: true/false,
|
|
612
|
-
# app_registered: true/false,
|
|
613
|
-
# can_collect_data: true/false
|
|
614
|
-
# }
|
|
615
|
-
```
|
|
616
|
-
|
|
617
|
-
#### 3. 心跳失败
|
|
618
|
-
|
|
619
|
-
**现象**: 日志显示心跳失败
|
|
620
|
-
|
|
621
|
-
**解决方案**:
|
|
622
|
-
```bash
|
|
623
|
-
# 手动测试心跳接口
|
|
624
|
-
curl -X POST http://127.0.0.1:8126/api/v1/agent/heartbeat \
|
|
625
|
-
-H "Content-Type: application/json" \
|
|
626
|
-
-d '{
|
|
627
|
-
"accountId": "acc-1234567890",
|
|
628
|
-
"serverName": "my-ruby-app",
|
|
629
|
-
"timestamp": 1234567890
|
|
630
|
-
}'
|
|
631
|
-
```
|
|
632
|
-
|
|
633
|
-
## 开发和测试
|
|
634
|
-
|
|
635
|
-
### 单元测试
|
|
636
|
-
|
|
637
|
-
运行 Cloudwise 相关测试:
|
|
638
|
-
|
|
639
|
-
```bash
|
|
640
|
-
bundle exec rspec spec/datadog/core/cloudwise/
|
|
641
|
-
```
|
|
642
|
-
|
|
643
|
-
### 集成测试
|
|
644
|
-
|
|
645
|
-
创建一个测试应用:
|
|
646
|
-
|
|
647
|
-
```ruby
|
|
648
|
-
# test_app.rb
|
|
649
|
-
require 'datadog'
|
|
650
|
-
require 'sinatra'
|
|
651
|
-
|
|
652
|
-
Datadog.configure do |c|
|
|
653
|
-
c.service = 'test-app'
|
|
654
|
-
c.agent.host = '127.0.0.1'
|
|
655
|
-
c.agent.port = 8126
|
|
656
|
-
|
|
657
|
-
c.cloudwise.enabled = true
|
|
658
|
-
c.cloudwise.license_key = 'test-license-key'
|
|
659
|
-
end
|
|
660
|
-
|
|
661
|
-
get '/' do
|
|
662
|
-
"Hello, Cloudwise!"
|
|
663
|
-
end
|
|
664
|
-
|
|
665
|
-
get '/status' do
|
|
666
|
-
Datadog.components.cloudwise.status.to_json
|
|
667
|
-
end
|
|
668
|
-
```
|
|
669
|
-
|
|
670
|
-
运行应用并检查状态:
|
|
671
|
-
|
|
672
|
-
```bash
|
|
673
|
-
ruby test_app.rb
|
|
674
|
-
|
|
675
|
-
# 在另一个终端
|
|
676
|
-
curl http://localhost:4567/status
|
|
677
|
-
```
|
|
678
|
-
|
|
679
|
-
## 性能考虑
|
|
680
|
-
|
|
681
|
-
### 资源消耗
|
|
682
|
-
|
|
683
|
-
- **内存**: Cloudwise 组件占用约 1-2 MB
|
|
684
|
-
- **CPU**: 后台 workers 几乎不占用 CPU(大部分时间在 sleep)
|
|
685
|
-
- **网络**:
|
|
686
|
-
- Host ID: 一次性请求(每 30s 重试,成功后停止)
|
|
687
|
-
- 心跳: 每 30 秒一次
|
|
688
|
-
- License: 每 5 分钟一次
|
|
689
|
-
- 应用注册: 每 3 分钟一次
|
|
690
|
-
|
|
691
|
-
### 启动时间影响
|
|
692
|
-
|
|
693
|
-
- ✅ **不影响**: Cloudwise 是异步的,不阻塞应用启动
|
|
694
|
-
- ✅ **不影响**: 应用可以立即处理请求
|
|
695
|
-
- ⚠️ **延迟**: Datadog 组件会延迟初始化(等待 Host ID)
|
|
696
|
-
|
|
697
|
-
## 最佳实践
|
|
698
|
-
|
|
699
|
-
1. **配置管理**: 使用环境变量而非硬编码配置
|
|
700
|
-
2. **监控日志**: 定期检查 Cloudwise 相关日志
|
|
701
|
-
3. **健康检查**: 在应用的健康检查端点中包含 Cloudwise 状态
|
|
702
|
-
4. **错误处理**: Cloudwise 失败不应导致应用崩溃
|
|
703
|
-
5. **资源限制**: 确保容器有足够的内存和网络访问权限
|
|
704
|
-
|
|
705
|
-
## 版本历史
|
|
706
|
-
|
|
707
|
-
### v2.0 (当前版本)
|
|
708
|
-
- ✅ 异步初始化,不阻塞应用启动
|
|
709
|
-
- ✅ Datadog 组件延迟加载
|
|
710
|
-
- ✅ Host ID 无限重试
|
|
711
|
-
- ✅ 使用 DD_SERVICE 和 DD_AGENT_HOST/PORT 配置
|
|
712
|
-
|
|
713
|
-
### v1.0 (已弃用)
|
|
714
|
-
- ❌ 同步初始化,阻塞应用启动
|
|
715
|
-
- ❌ Datadog 组件立即初始化
|
|
716
|
-
- ❌ Host ID 有重试限制
|
|
717
|
-
|
|
718
|
-
## 参考文档
|
|
719
|
-
|
|
720
|
-
- [实现指南 V2](./IMPLEMENTATION_V2.md) - 详细的实现说明
|
|
721
|
-
- [静态类型检查指南](../../docs/StaticTypingGuide.md) - RBS 类型定义
|
|
722
|
-
- [Pull Request 模板](../../../.github/PULL_REQUEST_TEMPLATE.md) - 贡献指南
|