kempnerpulse 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Kempner Institute, Harvard University
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,273 @@
1
+ Metadata-Version: 2.4
2
+ Name: kempnerpulse
3
+ Version: 0.1.0
4
+ Summary: Real-time GPU monitoring dashboard for DCGM Prometheus metrics
5
+ Author: Kempner Institute, Harvard University
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Kempner Institute, Harvard University
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/KempnerInstitute/kempnerpulse
29
+ Project-URL: Issues, https://github.com/KempnerInstitute/kempnerpulse/issues
30
+ Keywords: gpu,monitoring,dcgm,nvidia,dashboard,hpc
31
+ Classifier: Development Status :: 4 - Beta
32
+ Classifier: Environment :: Console
33
+ Classifier: Intended Audience :: Developers
34
+ Classifier: Intended Audience :: Science/Research
35
+ Classifier: Intended Audience :: System Administrators
36
+ Classifier: License :: OSI Approved :: MIT License
37
+ Classifier: Operating System :: POSIX :: Linux
38
+ Classifier: Programming Language :: Python :: 3
39
+ Classifier: Programming Language :: Python :: 3.9
40
+ Classifier: Programming Language :: Python :: 3.10
41
+ Classifier: Programming Language :: Python :: 3.11
42
+ Classifier: Programming Language :: Python :: 3.12
43
+ Classifier: Topic :: System :: Monitoring
44
+ Requires-Python: >=3.9
45
+ Description-Content-Type: text/markdown
46
+ License-File: LICENSE
47
+ Requires-Dist: rich>=13.0
48
+ Dynamic: license-file
49
+
50
+ # KempnerPulse
51
+
52
+ > `nvidia-smi` says 100% GPU utilization - but are your tensor cores even active? KempnerPulse shows what's *actually* happening.
53
+
54
+ Real-time GPU monitoring dashboard for DCGM Prometheus metrics. A single-file
55
+ Rich-based TUI that streams
56
+ [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) `/metrics` and
57
+ renders four interactive views in the terminal.
58
+
59
+ ![KempnerPulse Demo](docs/images/kempner_pulse_screen_record.gif)
60
+
61
+ ## Features
62
+
63
+ - **Fleet View** : All GPUs at a glance: utilization, memory, power,
64
+ temperature, PCIe/NVLink bandwidth, sparkline bars.
65
+ - **Focus View** : Deep dive into one GPU with per-metric sparkline history.
66
+ - **Plot View** : Stacked line charts across all GPUs.
67
+ - **Job View** : Running GPU compute processes with per-GPU metrics.
68
+ - **Real Utilization** : Weighted composite metric from SM active, tensor pipe,
69
+ DRAM active, and GR engine counters (customizable weights with presets for
70
+ AI/ML, HPC, and memory-bound workflows).
71
+ - **Workload Classification** : 12-category status based on NVIDIA DCGM
72
+ profiling metric guidance (idle, tensor-heavy compute, memory-bound, I/O,
73
+ etc.).
74
+ - **Health Monitoring** : Temperature, PCIe replay errors, and ECC errors
75
+ with color-coded alerts.
76
+ - **SLURM/CUDA Aware** : Automatically detects `CUDA_VISIBLE_DEVICES`,
77
+ `SLURM_JOB_GPUS`, etc. to show only your allocated GPUs.
78
+ - **Zero Dependencies** beyond Python 3.9+ and `rich`.
79
+
80
+ ## Screenshots
81
+
82
+ ### Fleet View
83
+
84
+ All GPUs at a glance with utilization bars, memory, power, temperature, and bandwidth.
85
+
86
+ ![Fleet View](docs/images/fleet_view.png)
87
+
88
+ ### Focus View
89
+
90
+ Deep dive into a single GPU with per-metric sparkline history.
91
+
92
+ ![Focus View](docs/images/focus_view.png)
93
+
94
+ ### Plot View
95
+
96
+ Stacked line charts across all GPUs.
97
+
98
+ ![Plot View](docs/images/plot_view.png)
99
+
100
+ ### Job View
101
+
102
+ Running GPU compute processes with per-GPU metrics.
103
+
104
+ ![Job View](docs/images/job_view.png)
105
+
106
+ ## Requirements
107
+
108
+ - Linux with NVIDIA GPUs
109
+ - [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) running and
110
+ exposing `/metrics` (default: `http://localhost:9400/metrics`)
111
+ - Python >= 3.9
112
+ - `nvidia-smi` on the PATH (for hardware queries and process listing)
113
+
114
+ ## Installation
115
+
116
+ Install locally (recommended until published on PyPI):
117
+
118
+ ```bash
119
+ pip install .
120
+ ```
121
+
122
+ Or run directly (installs only the rich dependency):
123
+
124
+ ```bash
125
+ pip install rich
126
+ python3 kempner_pulse.py
127
+ ```
128
+
129
+ ## Quick Start
130
+
131
+ ```bash
132
+ # Default: connect to localhost:9400/metrics, show SLURM/CUDA-visible GPUs
133
+ kempnerpulse
134
+
135
+ # Explicit source and GPU selection
136
+ kempnerpulse --source http://gpu-node:9400/metrics --gpus 0,1,2,3
137
+
138
+ # Show all GPUs on the node
139
+ kempnerpulse --show-all
140
+
141
+ # Start in focus view for GPU 0
142
+ kempnerpulse --focus-gpu 0
143
+
144
+ # Use HPC weight preset
145
+ kempnerpulse --hpc-weights
146
+
147
+ # Custom weights (SM, Tensor, DRAM, GR; normalized automatically)
148
+ kempnerpulse --weights 0.40,0.30,0.20,0.10
149
+ ```
150
+
151
+ ## Interactive Commands
152
+
153
+ | Command | Action |
154
+ |---------------|---------------------------------------------|
155
+ | `:focus <id>` | Enter focused view for a specific GPU |
156
+ | `:plot` | Enter plot view (line charts) |
157
+ | `:job` | Enter job view (running GPU processes) |
158
+ | `:q` | Return to fleet view (or exit if in fleet) |
159
+ | `:exit` | Exit the dashboard |
160
+ | `Ctrl+C` | Exit the dashboard |
161
+ | `Esc` | Cancel an unfinished `:` command |
162
+
163
+ ## CLI Reference
164
+
165
+ | Flag | Type | Default | Description |
166
+ |------|------|---------|-------------|
167
+ | `--version` | | | Show version and exit. |
168
+ | `--source URL` | string | `http://localhost:9400/metrics` | dcgm-exporter `/metrics` endpoint or a local text file. |
169
+ | `--poll SECS` | float | `1.0` | Dashboard redraw interval in seconds (does not change DCGM sampling rate). |
170
+ | `--history N` | int | `120` | Number of samples kept for sparkline history. |
171
+ | `--focus-gpu ID` | string | | Start in Focus View for the given GPU id (e.g. `0`). |
172
+ | `--once` | flag | | Render a single snapshot and exit instead of running live. |
173
+ | `--gpus IDS` | string | | Explicit GPU ids or ranges (`0,1` or `0-3`). Overrides SLURM/CUDA env vars. |
174
+ | `--show-all` | flag | | Ignore SLURM/CUDA visibility env vars; show every GPU in the source. |
175
+ | `--weights W` | 4 floats | `0.35,0.35,0.20,0.10` | Comma-separated Real Util weights: SM,TENSOR,DRAM,GR. Auto-normalized. |
176
+ | `--ai-weights` | preset | | AI/LLM training preset `(0.35, 0.35, 0.20, 0.10)`. This is the default. |
177
+ | `--hpc-weights` | preset | | HPC / mixed CUDA preset `(0.45, 0.15, 0.25, 0.15)`. |
178
+ | `--mem-weights` | preset | | Memory-bound / bandwidth-heavy preset `(0.35, 0.10, 0.40, 0.15)`. |
179
+
180
+ ### GPU Visibility Selection
181
+
182
+ The dashboard picks the first available source in this order:
183
+
184
+ 1. `--gpus` flag
185
+ 2. `CUDA_VISIBLE_DEVICES` env var
186
+ 3. `NVIDIA_VISIBLE_DEVICES` env var
187
+ 4. `SLURM_STEP_GPUS` env var
188
+ 5. `SLURM_JOB_GPUS` env var
189
+
190
+ If none are set, all GPUs on the node are shown. Use `--show-all` to
191
+ explicitly override all env vars. All GPU selections are filtered against
192
+ GPUs accessible to the current process (as reported by `nvidia-smi`),
193
+ which respects cgroup and container restrictions.
194
+
195
+ ## Weight Presets
196
+
197
+ | Preset | Flag | SM | Tensor | DRAM | GR | Best For |
198
+ |-----------------|------------------|-------|--------|-------|-------|----------|
199
+ | AI/ML (default) | `--ai-weights` | 0.35 | 0.35 | 0.20 | 0.10 | DL training, LLM inference, transformers |
200
+ | HPC | `--hpc-weights` | 0.45 | 0.15 | 0.25 | 0.15 | Scientific computing, mixed CUDA |
201
+ | Memory-bound | `--mem-weights` | 0.35 | 0.10 | 0.40 | 0.15 | Bandwidth-heavy workloads, stencil codes |
202
+
203
+ Custom: `--weights 0.40,0.30,0.20,0.10` (values are normalized automatically).
204
+
205
+ ## How It Works
206
+
207
+ KempnerPulse reads Prometheus text-format metrics from dcgm-exporter via HTTP
208
+ (or a local file). It computes a **Real Utilization** score as a weighted
209
+ combination of four DCGM profiling counters:
210
+
211
+ ```
212
+ Real Util = clamp(0, 100,
213
+ W_sm × SM_ACTIVE
214
+ + W_tensor × TENSOR_ACTIVE
215
+ + W_dram × DRAM_ACTIVE
216
+ + W_gr × GR_ENGINE_ACTIVE)
217
+ ```
218
+
219
+ This gives a more accurate picture of GPU utilization than `nvidia-smi`'s
220
+ `GPU-Util` alone, which only reports kernel-launch duty cycle.
221
+
222
+ ## Workload Classification
223
+
224
+ Each GPU is classified into one of **12 categories** every refresh cycle,
225
+ based on thresholds from
226
+ [NVIDIA's DCGM profiling metric guidance](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling).
227
+ Categories are evaluated in order and the first matching rule wins.
228
+
229
+ | Status | Thresholds | Rationale |
230
+ |--------|------------|-----------|
231
+ | **idle** | Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O | Nothing running. |
232
+ | **tensor-heavy compute** | Tensor ≥ 50 % and SM ≥ 60 % | DL training / large-scale inference. |
233
+ | **tensor compute** | Tensor ≥ 15 % and SM ≥ 40 % | Mixed-precision, moderate tensor use. |
234
+ | **FP64 / HPC compute** | FP64 ≥ 20 % and SM ≥ 50 % | Scientific double-precision workload. |
235
+ | **I/O or data-loading** | Memcpy ≥ 40 % or PCIe ≥ 1 GB/s, SM < 30 % | Heavy transfer; SMs idle. |
236
+ | **memory-bound** | DRAM ≥ 50 % and SM < 50 % | Bandwidth limited. |
237
+ | **compute-heavy** | SM ≥ 80 % | Effective SM use (NVIDIA: ≥ 80 % needed). |
238
+ | **compute-active** | SM ≥ 50 % | Moderate compute, no tensor dominance. |
239
+ | **memory-active** | DRAM ≥ 40 % | Significant DRAM traffic. |
240
+ | **busy, low SM use** | GR ≥ 40 % and SM < 25 % | Overhead / sync / small kernels. |
241
+ | **low utilization** | GR < 15 %, SM < 15 %, DRAM < 15 % | Barely active. |
242
+ | **mixed / moderate** | *(fallthrough)* | No single dominant pattern. |
243
+
244
+ Full details, bottleneck color key, and NVIDIA reference points:
245
+ [docs/classification.md](docs/classification.md)
246
+
247
+ ## Health Monitoring
248
+
249
+ | Status | Condition | Meaning |
250
+ |--------|-----------|---------|
251
+ | **OK** | *(none of the below)* | Normal operation. |
252
+ | **WARN** | PCIe replay rate > 0/s | PCIe link retransmissions occurring. |
253
+ | **HOT** | GPU or memory temp ≥ warning threshold | Approaching thermal throttling. |
254
+ | **CRIT** | Row-remap failure > 0 or uncorrectable remapped rows > 0 | Hardware memory errors. Remove from production. |
255
+
256
+ Temperature warning thresholds are per-model (A100: 93 °C, H100/H200: 95 °C,
257
+ RTX 6000: 92 °C, default: 93 °C). Full threshold table:
258
+ [docs/classification.md](docs/classification.md#temperature-thresholds-by-gpu-model)
259
+
260
+ ## DCGM Metrics
261
+
262
+ KempnerPulse consumes ~30 DCGM fields covering profiling counters, memory,
263
+ temperature, power, clocks, PCIe, NVLink, and error counters. The complete
264
+ list with descriptions and NVIDIA doc links:
265
+ [docs/metrics.md](docs/metrics.md)
266
+
267
+ ## Performance Overhead
268
+
269
+ KempnerPulse introduces minimal runtime overhead, using approximately 8.2% of a single CPU core on an AMD EPYC 9374F processor, with negligible memory usage (below the reporting resolution of `top`).
270
+
271
+ ## License
272
+
273
+ MIT. See [LICENSE](LICENSE) for details.
@@ -0,0 +1,224 @@
1
+ # KempnerPulse
2
+
3
+ > `nvidia-smi` says 100% GPU utilization - but are your tensor cores even active? KempnerPulse shows what's *actually* happening.
4
+
5
+ Real-time GPU monitoring dashboard for DCGM Prometheus metrics. A single-file
6
+ Rich-based TUI that streams
7
+ [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) `/metrics` and
8
+ renders four interactive views in the terminal.
9
+
10
+ ![KempnerPulse Demo](docs/images/kempner_pulse_screen_record.gif)
11
+
12
+ ## Features
13
+
14
+ - **Fleet View** : All GPUs at a glance: utilization, memory, power,
15
+ temperature, PCIe/NVLink bandwidth, sparkline bars.
16
+ - **Focus View** : Deep dive into one GPU with per-metric sparkline history.
17
+ - **Plot View** : Stacked line charts across all GPUs.
18
+ - **Job View** : Running GPU compute processes with per-GPU metrics.
19
+ - **Real Utilization** : Weighted composite metric from SM active, tensor pipe,
20
+ DRAM active, and GR engine counters (customizable weights with presets for
21
+ AI/ML, HPC, and memory-bound workflows).
22
+ - **Workload Classification** : 12-category status based on NVIDIA DCGM
23
+ profiling metric guidance (idle, tensor-heavy compute, memory-bound, I/O,
24
+ etc.).
25
+ - **Health Monitoring** : Temperature, PCIe replay errors, and ECC errors
26
+ with color-coded alerts.
27
+ - **SLURM/CUDA Aware** : Automatically detects `CUDA_VISIBLE_DEVICES`,
28
+ `SLURM_JOB_GPUS`, etc. to show only your allocated GPUs.
29
+ - **Zero Dependencies** beyond Python 3.9+ and `rich`.
30
+
31
+ ## Screenshots
32
+
33
+ ### Fleet View
34
+
35
+ All GPUs at a glance with utilization bars, memory, power, temperature, and bandwidth.
36
+
37
+ ![Fleet View](docs/images/fleet_view.png)
38
+
39
+ ### Focus View
40
+
41
+ Deep dive into a single GPU with per-metric sparkline history.
42
+
43
+ ![Focus View](docs/images/focus_view.png)
44
+
45
+ ### Plot View
46
+
47
+ Stacked line charts across all GPUs.
48
+
49
+ ![Plot View](docs/images/plot_view.png)
50
+
51
+ ### Job View
52
+
53
+ Running GPU compute processes with per-GPU metrics.
54
+
55
+ ![Job View](docs/images/job_view.png)
56
+
57
+ ## Requirements
58
+
59
+ - Linux with NVIDIA GPUs
60
+ - [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) running and
61
+ exposing `/metrics` (default: `http://localhost:9400/metrics`)
62
+ - Python >= 3.9
63
+ - `nvidia-smi` on the PATH (for hardware queries and process listing)
64
+
65
+ ## Installation
66
+
67
+ Install locally (recommended until published on PyPI):
68
+
69
+ ```bash
70
+ pip install .
71
+ ```
72
+
73
+ Or run directly (installs only the rich dependency):
74
+
75
+ ```bash
76
+ pip install rich
77
+ python3 kempner_pulse.py
78
+ ```
79
+
80
+ ## Quick Start
81
+
82
+ ```bash
83
+ # Default: connect to localhost:9400/metrics, show SLURM/CUDA-visible GPUs
84
+ kempnerpulse
85
+
86
+ # Explicit source and GPU selection
87
+ kempnerpulse --source http://gpu-node:9400/metrics --gpus 0,1,2,3
88
+
89
+ # Show all GPUs on the node
90
+ kempnerpulse --show-all
91
+
92
+ # Start in focus view for GPU 0
93
+ kempnerpulse --focus-gpu 0
94
+
95
+ # Use HPC weight preset
96
+ kempnerpulse --hpc-weights
97
+
98
+ # Custom weights (SM, Tensor, DRAM, GR; normalized automatically)
99
+ kempnerpulse --weights 0.40,0.30,0.20,0.10
100
+ ```
101
+
102
+ ## Interactive Commands
103
+
104
+ | Command | Action |
105
+ |---------------|---------------------------------------------|
106
+ | `:focus <id>` | Enter focused view for a specific GPU |
107
+ | `:plot` | Enter plot view (line charts) |
108
+ | `:job` | Enter job view (running GPU processes) |
109
+ | `:q` | Return to fleet view (or exit if in fleet) |
110
+ | `:exit` | Exit the dashboard |
111
+ | `Ctrl+C` | Exit the dashboard |
112
+ | `Esc` | Cancel an unfinished `:` command |
113
+
114
+ ## CLI Reference
115
+
116
+ | Flag | Type | Default | Description |
117
+ |------|------|---------|-------------|
118
+ | `--version` | | | Show version and exit. |
119
+ | `--source URL` | string | `http://localhost:9400/metrics` | dcgm-exporter `/metrics` endpoint or a local text file. |
120
+ | `--poll SECS` | float | `1.0` | Dashboard redraw interval in seconds (does not change DCGM sampling rate). |
121
+ | `--history N` | int | `120` | Number of samples kept for sparkline history. |
122
+ | `--focus-gpu ID` | string | | Start in Focus View for the given GPU id (e.g. `0`). |
123
+ | `--once` | flag | | Render a single snapshot and exit instead of running live. |
124
+ | `--gpus IDS` | string | | Explicit GPU ids or ranges (`0,1` or `0-3`). Overrides SLURM/CUDA env vars. |
125
+ | `--show-all` | flag | | Ignore SLURM/CUDA visibility env vars; show every GPU in the source. |
126
+ | `--weights W` | 4 floats | `0.35,0.35,0.20,0.10` | Comma-separated Real Util weights: SM,TENSOR,DRAM,GR. Auto-normalized. |
127
+ | `--ai-weights` | preset | | AI/LLM training preset `(0.35, 0.35, 0.20, 0.10)`. This is the default. |
128
+ | `--hpc-weights` | preset | | HPC / mixed CUDA preset `(0.45, 0.15, 0.25, 0.15)`. |
129
+ | `--mem-weights` | preset | | Memory-bound / bandwidth-heavy preset `(0.35, 0.10, 0.40, 0.15)`. |
130
+
131
+ ### GPU Visibility Selection
132
+
133
+ The dashboard picks the first available source in this order:
134
+
135
+ 1. `--gpus` flag
136
+ 2. `CUDA_VISIBLE_DEVICES` env var
137
+ 3. `NVIDIA_VISIBLE_DEVICES` env var
138
+ 4. `SLURM_STEP_GPUS` env var
139
+ 5. `SLURM_JOB_GPUS` env var
140
+
141
+ If none are set, all GPUs on the node are shown. Use `--show-all` to
142
+ explicitly override all env vars. All GPU selections are filtered against
143
+ GPUs accessible to the current process (as reported by `nvidia-smi`),
144
+ which respects cgroup and container restrictions.
145
+
146
+ ## Weight Presets
147
+
148
+ | Preset | Flag | SM | Tensor | DRAM | GR | Best For |
149
+ |-----------------|------------------|-------|--------|-------|-------|----------|
150
+ | AI/ML (default) | `--ai-weights` | 0.35 | 0.35 | 0.20 | 0.10 | DL training, LLM inference, transformers |
151
+ | HPC | `--hpc-weights` | 0.45 | 0.15 | 0.25 | 0.15 | Scientific computing, mixed CUDA |
152
+ | Memory-bound | `--mem-weights` | 0.35 | 0.10 | 0.40 | 0.15 | Bandwidth-heavy workloads, stencil codes |
153
+
154
+ Custom: `--weights 0.40,0.30,0.20,0.10` (values are normalized automatically).
155
+
156
+ ## How It Works
157
+
158
+ KempnerPulse reads Prometheus text-format metrics from dcgm-exporter via HTTP
159
+ (or a local file). It computes a **Real Utilization** score as a weighted
160
+ combination of four DCGM profiling counters:
161
+
162
+ ```
163
+ Real Util = clamp(0, 100,
164
+ W_sm × SM_ACTIVE
165
+ + W_tensor × TENSOR_ACTIVE
166
+ + W_dram × DRAM_ACTIVE
167
+ + W_gr × GR_ENGINE_ACTIVE)
168
+ ```
169
+
170
+ This gives a more accurate picture of GPU utilization than `nvidia-smi`'s
171
+ `GPU-Util` alone, which only reports kernel-launch duty cycle.
172
+
173
+ ## Workload Classification
174
+
175
+ Each GPU is classified into one of **12 categories** every refresh cycle,
176
+ based on thresholds from
177
+ [NVIDIA's DCGM profiling metric guidance](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#profiling).
178
+ Categories are evaluated in order and the first matching rule wins.
179
+
180
+ | Status | Thresholds | Rationale |
181
+ |--------|------------|-----------|
182
+ | **idle** | Real Util < 5 %, GR < 5 %, DRAM < 5 %, no I/O | Nothing running. |
183
+ | **tensor-heavy compute** | Tensor ≥ 50 % and SM ≥ 60 % | DL training / large-scale inference. |
184
+ | **tensor compute** | Tensor ≥ 15 % and SM ≥ 40 % | Mixed-precision, moderate tensor use. |
185
+ | **FP64 / HPC compute** | FP64 ≥ 20 % and SM ≥ 50 % | Scientific double-precision workload. |
186
+ | **I/O or data-loading** | Memcpy ≥ 40 % or PCIe ≥ 1 GB/s, SM < 30 % | Heavy transfer; SMs idle. |
187
+ | **memory-bound** | DRAM ≥ 50 % and SM < 50 % | Bandwidth limited. |
188
+ | **compute-heavy** | SM ≥ 80 % | Effective SM use (NVIDIA: ≥ 80 % needed). |
189
+ | **compute-active** | SM ≥ 50 % | Moderate compute, no tensor dominance. |
190
+ | **memory-active** | DRAM ≥ 40 % | Significant DRAM traffic. |
191
+ | **busy, low SM use** | GR ≥ 40 % and SM < 25 % | Overhead / sync / small kernels. |
192
+ | **low utilization** | GR < 15 %, SM < 15 %, DRAM < 15 % | Barely active. |
193
+ | **mixed / moderate** | *(fallthrough)* | No single dominant pattern. |
194
+
195
+ Full details, bottleneck color key, and NVIDIA reference points:
196
+ [docs/classification.md](docs/classification.md)
197
+
198
+ ## Health Monitoring
199
+
200
+ | Status | Condition | Meaning |
201
+ |--------|-----------|---------|
202
+ | **OK** | *(none of the below)* | Normal operation. |
203
+ | **WARN** | PCIe replay rate > 0/s | PCIe link retransmissions occurring. |
204
+ | **HOT** | GPU or memory temp ≥ warning threshold | Approaching thermal throttling. |
205
+ | **CRIT** | Row-remap failure > 0 or uncorrectable remapped rows > 0 | Hardware memory errors. Remove from production. |
206
+
207
+ Temperature warning thresholds are per-model (A100: 93 °C, H100/H200: 95 °C,
208
+ RTX 6000: 92 °C, default: 93 °C). Full threshold table:
209
+ [docs/classification.md](docs/classification.md#temperature-thresholds-by-gpu-model)
210
+
211
+ ## DCGM Metrics
212
+
213
+ KempnerPulse consumes ~30 DCGM fields covering profiling counters, memory,
214
+ temperature, power, clocks, PCIe, NVLink, and error counters. The complete
215
+ list with descriptions and NVIDIA doc links:
216
+ [docs/metrics.md](docs/metrics.md)
217
+
218
+ ## Performance Overhead
219
+
220
+ KempnerPulse introduces minimal runtime overhead, using approximately 8.2% of a single CPU core on an AMD EPYC 9374F processor, with negligible memory usage (below the reporting resolution of `top`).
221
+
222
+ ## License
223
+
224
+ MIT. See [LICENSE](LICENSE) for details.