gpu-usage-audit 1.0.2__tar.gz → 1.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.gitignore +2 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/CHANGELOG.md +26 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/PKG-INFO +130 -241
- gpu_usage_audit-1.1.0/README.ko.md +230 -0
- gpu_usage_audit-1.1.0/README.md +230 -0
- gpu_usage_audit-1.1.0/docs/work-specs/0001-gua-board-cloud-sync.ko.md +238 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/handoff.ko.md +27 -23
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/status.ko.md +39 -13
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/pyproject.toml +1 -1
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/__main__.py +248 -33
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/__init__.py +14 -0
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/client.py +145 -0
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/config.py +100 -0
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/snapshot.py +112 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/daemon.py +30 -8
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/db.py +253 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/doctor.py +24 -14
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/identity.py +19 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/model.py +17 -1
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/nvml.py +30 -1
- gpu_usage_audit-1.1.0/src/gpu_usage_audit/paths.py +22 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/render.py +4 -1
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/report.py +55 -21
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/tier.py +25 -4
- gpu_usage_audit-1.1.0/tests/test_cloud_cli.py +275 -0
- gpu_usage_audit-1.1.0/tests/test_cloud_client.py +159 -0
- gpu_usage_audit-1.1.0/tests/test_cloud_config.py +61 -0
- gpu_usage_audit-1.1.0/tests/test_cloud_snapshot.py +156 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_daemon.py +4 -1
- gpu_usage_audit-1.1.0/tests/test_db.py +238 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_doctor.py +1 -3
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_identity.py +23 -1
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_nvml.py +62 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_report.py +60 -1
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_smoke.py +5 -2
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/uv.lock +1 -1
- gpu_usage_audit-1.0.2/README.md +0 -341
- gpu_usage_audit-1.0.2/src/gpu_usage_audit/db.py +0 -135
- gpu_usage_audit-1.0.2/tests/test_db.py +0 -105
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.github/workflows/ci.yml +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.github/workflows/release.yml +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/LICENSE +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/plan.ko.md +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/scripts/check-tag-version.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/scripts/smoke-dist-wheel.sh +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/__init__.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/classify.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/summarize.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/__init__.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_classify.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_render.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_summarize.py +0 -0
- {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_tier.py +0 -0
|
@@ -1,5 +1,31 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 1.1.0 - 2026-06-17
|
|
4
|
+
|
|
5
|
+
- Added optional GUA Board cloud sync. `gua enroll` claims a one-time
|
|
6
|
+
enrollment token from a GUA Board workspace and stores a host-scoped,
|
|
7
|
+
write-only agent token in `~/.gua/cloud.json` (mode 0600). `gua sync-once`
|
|
8
|
+
collects one snapshot, writes it to the local history database first, then
|
|
9
|
+
pushes the latest state to GUA Board; a failed push never blocks or rolls
|
|
10
|
+
back the local write. Cloud sync is entirely optional — local collection,
|
|
11
|
+
storage, and `gua report` are unchanged when no host is enrolled, and no new
|
|
12
|
+
runtime dependency is added (the client uses the standard library).
|
|
13
|
+
- Enriched NVML collection with per-GPU name, total/used memory, temperature,
|
|
14
|
+
power, and physical index, plus per-process name (from `/proc/<pid>/comm`;
|
|
15
|
+
full command lines are never collected). The local SQLite schema gained
|
|
16
|
+
these columns plus a normalized `gpu_device` table. The migration is
|
|
17
|
+
additive (nullable columns), so existing `~/.gua/gua.db` databases upgrade
|
|
18
|
+
in place and `gua report` output is unaffected.
|
|
19
|
+
|
|
20
|
+
## 1.0.3 - 2026-05-27
|
|
21
|
+
|
|
22
|
+
- Changed default `gua` state paths to `~/.gua/gua.db`, `~/.gua/gua.pid`,
|
|
23
|
+
and `~/.gua/gua.log`; the default database now acts as an appendable local
|
|
24
|
+
history database.
|
|
25
|
+
- Record daemon run intervals in SQLite and attach samples to a run, so
|
|
26
|
+
`gua report` uses recorded intervals by default. `--interval` is now an
|
|
27
|
+
override and a fallback for legacy rows without interval metadata.
|
|
28
|
+
|
|
3
29
|
## 1.0.2 - 2026-05-15
|
|
4
30
|
|
|
5
31
|
- Hardened `gua status` and `gua stop` so stale PID files do not act on
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: gpu-usage-audit
|
|
3
|
-
Version: 1.0
|
|
3
|
+
Version: 1.1.0
|
|
4
4
|
Summary: Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.
|
|
5
5
|
Project-URL: Homepage, https://github.com/AI-Ocean/gpu-usage-audit
|
|
6
6
|
Project-URL: Issues, https://github.com/AI-Ocean/gpu-usage-audit/issues
|
|
@@ -223,72 +223,58 @@ Description-Content-Type: text/markdown
|
|
|
223
223
|
|
|
224
224
|
# gpu-usage-audit
|
|
225
225
|
|
|
226
|
-
|
|
227
|
-
SQLite and produces a retrospective report separating *active* use from
|
|
228
|
-
*allocated-but-idle* ("idle-held") and *truly idle* (no process at all).
|
|
226
|
+
Single-host NVIDIA GPU usage audit for finding **idle-held** GPUs: cards that look idle by utilization, but are still held by a process through GPU memory.
|
|
229
227
|
|
|
230
|
-
|
|
231
|
-
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
*unusable* by anyone else. This tool measures that.
|
|
228
|
+
[](https://pypi.org/project/gpu-usage-audit/)
|
|
229
|
+
[](https://pypi.org/project/gpu-usage-audit/)
|
|
230
|
+
[](LICENSE)
|
|
231
|
+
[](https://github.com/AI-Ocean/gpu-usage-audit/releases)
|
|
235
232
|
|
|
236
|
-
|
|
237
|
-
> `gua doctor` checks only the current machine. `daemon` records NVML
|
|
238
|
-
> telemetry from the current NVIDIA host, `report` reads the resulting
|
|
239
|
-
> SQLite database, and `demo` runs anywhere with fake telemetry. The Go
|
|
240
|
-
> v0.1.0 implementation remains downloadable at tag `v0.1.0` / branch
|
|
241
|
-
> [`go-archive`](https://github.com/AI-Ocean/gpu-usage-audit/tree/go-archive).
|
|
233
|
+
[English](README.md) · [한국어](README.ko.md) · [Releases](https://github.com/AI-Ocean/gpu-usage-audit/releases) · [Issues](https://github.com/AI-Ocean/gpu-usage-audit/issues)
|
|
242
234
|
|
|
243
|
-
|
|
235
|
+
---
|
|
244
236
|
|
|
245
|
-
|
|
237
|
+
## About
|
|
246
238
|
|
|
247
|
-
|
|
248
|
-
uv creates the isolated tool environment and manages the needed Python
|
|
249
|
-
runtime. If Python downloads are disabled by local policy, install Python
|
|
250
|
-
3.12+ first.
|
|
239
|
+
gpu-usage-audit records local NVIDIA/NVML telemetry into SQLite and renders a retrospective report that separates GPU card-ticks into:
|
|
251
240
|
|
|
252
|
-
|
|
253
|
-
|
|
241
|
+
- `active`: utilization is doing real work
|
|
242
|
+
- `idle-held`: utilization is low, but a process still holds GPU memory
|
|
243
|
+
- `truly-idle`: no meaningful GPU process memory is present
|
|
254
244
|
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
245
|
+
The second category is the point. A notebook can sit at 1% SM utilization while keeping an 8 GB tensor allocated. Conventional dashboards usually flatten that into “idle”; this tool shows that the card is effectively unavailable.
|
|
246
|
+
|
|
247
|
+
## Features
|
|
248
|
+
|
|
249
|
+
- Single-host, bare-metal NVIDIA GPU audit
|
|
250
|
+
- `gua doctor` readiness check for `/dev/nvidia*`, `nvidia-smi`, NVML, and DB path
|
|
251
|
+
- Background collector with `gua daemon`, `gua status`, and `gua stop`
|
|
252
|
+
- SQLite history database at `~/.gua/gua.db` by default
|
|
253
|
+
- Report sections for headline split, idle capacity, per-GPU state, top identities, and time-of-day heatmap
|
|
254
|
+
- Daemon interval metadata stored per run, so reports compute GPU-hours correctly across mixed 30s / 10s runs
|
|
255
|
+
- GPU-less `gua demo` command with deterministic fake telemetry
|
|
256
|
+
- No cluster runtime dependency; no Kubernetes, Slurm, Docker, or remote-node scan in the 1.0 scope
|
|
261
257
|
|
|
262
|
-
|
|
263
|
-
machine: OS/kernel/Python, `/dev/nvidia*`, `nvidia-smi -L`, NVML
|
|
264
|
-
load/init/device count/driver version, and the database path the daemon
|
|
265
|
-
would write to. The default is `/tmp/gua.db`; pass `gua doctor --db PATH`
|
|
266
|
-
when you plan to use a different daemon database.
|
|
258
|
+
## Installation
|
|
267
259
|
|
|
268
|
-
|
|
269
|
-
The JSON includes local paths, command stderr, and `nvidia-smi -L` output
|
|
270
|
-
with GPU UUIDs, so review it before sharing it outside your team.
|
|
271
|
-
`gua doctor` does not need `sudo`; run it as the same user that will run
|
|
272
|
-
the daemon.
|
|
260
|
+
The recommended install path is PyPI via [uv](https://docs.astral.sh/uv/):
|
|
273
261
|
|
|
274
|
-
|
|
275
|
-
|
|
262
|
+
```sh
|
|
263
|
+
uv tool install gpu-usage-audit
|
|
264
|
+
```
|
|
276
265
|
|
|
277
|
-
Update or remove
|
|
266
|
+
Update or remove it with:
|
|
278
267
|
|
|
279
268
|
```sh
|
|
280
269
|
uv tool upgrade gpu-usage-audit
|
|
281
270
|
uv tool uninstall gpu-usage-audit
|
|
282
271
|
```
|
|
283
272
|
|
|
284
|
-
|
|
285
|
-
its `gua` / `gpu-usage-audit` commands.
|
|
286
|
-
|
|
287
|
-
GitHub Release assets are also available for manual download:
|
|
273
|
+
Manual wheel downloads are available from GitHub Releases:
|
|
288
274
|
|
|
289
275
|
```sh
|
|
290
|
-
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.
|
|
291
|
-
WHEEL="gpu_usage_audit-1.0.
|
|
276
|
+
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.3"
|
|
277
|
+
WHEEL="gpu_usage_audit-1.0.3-py3-none-any.whl"
|
|
292
278
|
|
|
293
279
|
curl -fsSLO "$BASE/$WHEEL"
|
|
294
280
|
curl -fsSLO "$BASE/SHA256SUMS"
|
|
@@ -297,202 +283,94 @@ sha256sum -c SHA256SUMS --ignore-missing
|
|
|
297
283
|
uvx --from "./$WHEEL" gua doctor
|
|
298
284
|
```
|
|
299
285
|
|
|
300
|
-
##
|
|
286
|
+
## Quick Start
|
|
287
|
+
|
|
288
|
+
On an NVIDIA GPU host:
|
|
301
289
|
|
|
290
|
+
```sh
|
|
291
|
+
gua doctor
|
|
292
|
+
gua daemon --interval 30s
|
|
293
|
+
gua status
|
|
294
|
+
gua report --since 1h
|
|
295
|
+
gua stop
|
|
302
296
|
```
|
|
303
|
-
|
|
297
|
+
|
|
298
|
+
`gua doctor` is read-only. It does not need `sudo`; run it as the same user that will run the daemon.
|
|
299
|
+
|
|
300
|
+
Default local state lives under `~/.gua/`:
|
|
301
|
+
|
|
302
|
+
| Path | Purpose |
|
|
303
|
+
| --- | --- |
|
|
304
|
+
| `~/.gua/gua.db` | SQLite history database |
|
|
305
|
+
| `~/.gua/gua.pid` | background daemon PID file |
|
|
306
|
+
| `~/.gua/gua.log` | daemon stdout/stderr log |
|
|
307
|
+
|
|
308
|
+
The default DB is an appendable local history database. Later daemon runs append to it. If you pass a custom `--db PATH`, daemon still refuses an existing file to avoid mixing ad hoc runs by accident.
|
|
309
|
+
|
|
310
|
+
## Report Preview
|
|
311
|
+
|
|
312
|
+
```text
|
|
313
|
+
$ gua report --since 1h
|
|
304
314
|
gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
|
|
305
315
|
|
|
306
316
|
§1 Headline
|
|
307
317
|
basis: one sample = one GPU card at one daemon tick
|
|
308
318
|
rules: active >=10% util; idle-held <10% util with >100 MB process memory
|
|
309
|
-
█████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
|
|
310
319
|
active █ 15.7%
|
|
311
|
-
idle-held ▒ 45.1%
|
|
320
|
+
idle-held ▒ 45.1%
|
|
312
321
|
truly-idle ░ 39.2%
|
|
313
322
|
(51 samples)
|
|
314
323
|
|
|
315
324
|
§2 Idle capacity
|
|
316
|
-
converted from card-ticks to GPU-hours using
|
|
325
|
+
converted from card-ticks to GPU-hours using recorded daemon interval
|
|
317
326
|
idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
|
|
318
327
|
truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
|
|
319
328
|
|
|
320
329
|
§3 Per-GPU
|
|
321
|
-
per-card share of samples in the same three states
|
|
322
|
-
GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
|
|
323
|
-
GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
|
|
324
|
-
GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
|
|
325
|
-
|
|
326
330
|
§4 Top identities
|
|
327
|
-
one identity counts once per GPU/tick after its processes are summed
|
|
328
|
-
identity gpu-hours idle-held samples
|
|
329
|
-
alice 0.42 42.9% 51
|
|
330
|
-
bob 0.28 100.0% 34
|
|
331
|
-
|
|
332
331
|
§5 Time-of-day heatmap (UTC)
|
|
333
|
-
darker means higher active share; blank means no samples
|
|
334
|
-
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
|
|
335
|
-
Mon .
|
|
336
|
-
```
|
|
337
|
-
|
|
338
|
-
The 3-bar collapses every card × every tick over the window into the
|
|
339
|
-
active / idle-held / truly-idle split. **`idle-held` rows are the
|
|
340
|
-
embarrassing category**: a process is holding GPU memory but the SM
|
|
341
|
-
utilization is below 10%. §2 converts those card-ticks into GPU-hours
|
|
342
|
-
with `--interval`; §4 groups process rows by identity, GPU, and tick
|
|
343
|
-
before ranking users, so multiple same-user processes on one GPU/tick
|
|
344
|
-
count once.
|
|
345
|
-
|
|
346
|
-
## Demo (no GPU required)
|
|
347
|
-
|
|
348
|
-
The `demo` subcommand records 30 ticks of fake telemetry and prints the
|
|
349
|
-
report — all in one process, no second shell needed.
|
|
350
|
-
|
|
351
|
-
```sh
|
|
352
|
-
gua demo
|
|
353
332
|
```
|
|
354
333
|
|
|
355
|
-
|
|
356
|
-
active learning → idle-held memory → cleanup — so the output is the
|
|
357
|
-
same every run. Adjust the shape with `--ticks N` and `--interval D`.
|
|
358
|
-
|
|
359
|
-
## Real NVIDIA GPU host
|
|
334
|
+
Reports can run while the daemon is writing; SQLite WAL mode handles concurrent reads. Reports also work after the daemon has stopped, as long as the DB file exists.
|
|
360
335
|
|
|
361
|
-
|
|
336
|
+
## Commands
|
|
362
337
|
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
338
|
+
| Command | Description |
|
|
339
|
+
| --- | --- |
|
|
340
|
+
| `gua doctor` | Check local NVIDIA/NVML readiness and DB path status |
|
|
341
|
+
| `gua daemon` | Start background collection on the local NVIDIA host |
|
|
342
|
+
| `gua start` | Alias for `gua daemon` |
|
|
343
|
+
| `gua status` | Show whether the managed background collector is running |
|
|
344
|
+
| `gua stop` | Stop the managed background collector |
|
|
345
|
+
| `gua report` | Render the retrospective report from SQLite |
|
|
346
|
+
| `gua demo` | Generate a fake local report without a GPU |
|
|
347
|
+
| `gua enroll` | Connect this host to a GUA Board workspace (optional cloud sync) |
|
|
348
|
+
| `gua sync-once` | Collect one snapshot and push the latest state to GUA Board |
|
|
349
|
+
| `gua version` | Print version |
|
|
366
350
|
|
|
367
|
-
|
|
368
|
-
files, `nvidia-smi -L` GPUs, NVML device count, and `/tmp/gua.db` status.
|
|
369
|
-
`nvidia-ml-py` is installed by default with `gpu-usage-audit`; if doctor
|
|
370
|
-
reports that `pynvml` is not importable, reinstall the isolated tool
|
|
371
|
-
environment:
|
|
351
|
+
## Important Options
|
|
372
352
|
|
|
373
353
|
```sh
|
|
374
|
-
uv tool install --force gpu-usage-audit
|
|
375
|
-
```
|
|
376
|
-
|
|
377
|
-
If `pynvml` imports but NVML init fails, fix the host NVIDIA driver
|
|
378
|
-
installation instead. `libnvidia-ml.so.1` must be available and match the
|
|
379
|
-
loaded kernel driver; `nvidia-smi -L` should list GPUs before the daemon
|
|
380
|
-
can collect real telemetry.
|
|
381
|
-
|
|
382
|
-
Then run the collector:
|
|
383
|
-
|
|
384
|
-
```sh
|
|
385
|
-
gua daemon --interval 30s
|
|
386
|
-
gua status
|
|
387
|
-
```
|
|
388
|
-
|
|
389
|
-
Run the report:
|
|
390
|
-
|
|
391
|
-
```sh
|
|
392
|
-
gua report --since 1h --interval 30s
|
|
393
|
-
```
|
|
394
|
-
|
|
395
|
-
Stop the background collector when the collection window is done:
|
|
396
|
-
|
|
397
|
-
```sh
|
|
398
|
-
gua stop
|
|
399
|
-
```
|
|
400
|
-
|
|
401
|
-
If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
|
|
402
|
-
`daemon` refuses to start when that database file already exists, so a
|
|
403
|
-
new collection run does not silently append to an old test database. If
|
|
404
|
-
`gua doctor` reports that the database already exists, either run
|
|
405
|
-
`gua report` against the existing data or choose a fresh `--db PATH` for
|
|
406
|
-
the next daemon run.
|
|
407
|
-
|
|
408
|
-
> The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
|
|
409
|
-
> driverless host it exits with a friendly NVML initialization error. For
|
|
410
|
-
> a driverless box, use `demo` instead.
|
|
411
|
-
|
|
412
|
-
## Usage
|
|
413
|
-
|
|
414
|
-
`gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
|
|
415
|
-
point remains installed for compatibility, but new examples use `gua`.
|
|
416
|
-
|
|
417
|
-
| Command | What it does |
|
|
418
|
-
| -------- | ----------------------------------------------------------- |
|
|
419
|
-
| `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
|
|
420
|
-
| `start` | Alias for `gua daemon`. |
|
|
421
|
-
| `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
|
|
422
|
-
| `stop` | Stops the background collector with SIGTERM. |
|
|
423
|
-
| `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
|
|
424
|
-
| `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
|
|
425
|
-
|
|
426
|
-
### `daemon` / `start`
|
|
427
|
-
|
|
428
|
-
```
|
|
429
354
|
gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
430
|
-
gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
431
355
|
gua daemon --foreground [--db PATH] [--interval D]
|
|
432
|
-
```
|
|
433
|
-
|
|
434
|
-
- `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
|
|
435
|
-
to. The daemon exits with an error if the file already exists. WAL mode
|
|
436
|
-
is enabled automatically.
|
|
437
|
-
- `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
|
|
438
|
-
`1m`, `200ms`, etc.
|
|
439
|
-
- `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
|
|
440
|
-
- `--log-file PATH` (default `/tmp/gua.log`) — stdout/stderr from the
|
|
441
|
-
background collector.
|
|
442
|
-
- `--foreground` — keep the collector attached to the current process.
|
|
443
|
-
Use this for systemd or debugging.
|
|
444
|
-
|
|
445
|
-
By default, `gua daemon` returns after the collector starts. Each tick is
|
|
446
|
-
written to the log file; on shutdown the cumulative row count is written
|
|
447
|
-
there too. `gua daemon --foreground` prints the tick summaries directly
|
|
448
|
-
to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
|
|
449
|
-
`gua status` and `gua stop` verify that the PID file points to the
|
|
450
|
-
managed collector before acting on it; stale PID files are cleared.
|
|
451
|
-
|
|
452
|
-
### `report`
|
|
453
|
-
|
|
454
|
-
```
|
|
455
356
|
gua report [--db PATH] [--since D] [--interval D] [--width N]
|
|
456
|
-
```
|
|
457
|
-
|
|
458
|
-
- `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
|
|
459
|
-
to. The report exits with an error if the file does not exist.
|
|
460
|
-
- `--since D` (default `1h`) — the report window. **No upper bound** —
|
|
461
|
-
`--since 365d` is accepted. The effective window is min(`--since`, age
|
|
462
|
-
of oldest sample), so passing a huge `--since` is the same as "all
|
|
463
|
-
data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
|
|
464
|
-
- `--interval D` (default `30s`) — **must match what the daemon used**.
|
|
465
|
-
This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
|
|
466
|
-
to GPU-hours. Mismatched intervals → wrong GPU-hours.
|
|
467
|
-
- `--width N` (default `60`) — width of the §1 three-bar in characters.
|
|
468
|
-
|
|
469
|
-
### `demo`
|
|
470
|
-
|
|
471
|
-
```
|
|
472
357
|
gua demo [--db PATH] [--ticks N] [--interval D]
|
|
473
358
|
```
|
|
474
359
|
|
|
475
|
-
- `--
|
|
476
|
-
|
|
477
|
-
- `--
|
|
478
|
-
|
|
479
|
-
- `--interval D` (default `1s`) — tick spacing.
|
|
360
|
+
- `--interval` on `daemon` controls sampling cadence. Default: `30s`.
|
|
361
|
+
- `--interval` on `report` is optional. New DB rows use the interval recorded by each daemon run. Use report `--interval D` only as an override or for legacy rows without interval metadata.
|
|
362
|
+
- `--since` accepts `ms`, `s`, `m`, `h`, and `d`, with no upper bound.
|
|
363
|
+
- `--foreground` is intended for systemd and debugging.
|
|
480
364
|
|
|
481
|
-
|
|
365
|
+
## Demo Without a GPU
|
|
482
366
|
|
|
483
|
-
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
§4 (Top identities) needs hours; §5 (Heatmap) needs days.
|
|
487
|
-
- **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
|
|
488
|
-
cleaned up automatically when the last connection closes.
|
|
489
|
-
- **DB size**: ~50 MB per host per 30 days at 12 GPUs (extrapolated
|
|
490
|
-
from Go v0.1.0; not yet re-measured for the Python rewrite).
|
|
367
|
+
```sh
|
|
368
|
+
gua demo
|
|
369
|
+
```
|
|
491
370
|
|
|
492
|
-
|
|
371
|
+
The demo records deterministic fake telemetry and immediately prints the report shape.
|
|
493
372
|
|
|
494
|
-
|
|
495
|
-
`/etc/systemd/system/gpu-usage-audit.service`:
|
|
373
|
+
## Systemd Example
|
|
496
374
|
|
|
497
375
|
```ini
|
|
498
376
|
[Unit]
|
|
@@ -509,56 +387,67 @@ User=gua
|
|
|
509
387
|
WantedBy=multi-user.target
|
|
510
388
|
```
|
|
511
389
|
|
|
512
|
-
Then
|
|
390
|
+
Then run:
|
|
513
391
|
|
|
514
|
-
|
|
515
|
-
|
|
516
|
-
|
|
392
|
+
```sh
|
|
393
|
+
systemctl enable --now gpu-usage-audit
|
|
394
|
+
```
|
|
517
395
|
|
|
518
|
-
|
|
519
|
-
- per-process: `mem_used_mb` per `(card, pid)`
|
|
396
|
+
## Cloud Sync (GUA Board, optional)
|
|
520
397
|
|
|
521
|
-
|
|
398
|
+
`gpu-usage-audit` runs fully local by default. If you also use GUA Board (a separate service that shows the latest GPU availability across several servers in one place), you can optionally connect a host:
|
|
522
399
|
|
|
400
|
+
```sh
|
|
401
|
+
# 1. In the GUA Board web UI, register a server and copy the one-time enrollment token.
|
|
402
|
+
# 2. On the GPU host:
|
|
403
|
+
gua enroll --server-url https://board.example.com --enrollment-token <TOKEN>
|
|
404
|
+
# 3. Push the current snapshot (run on a timer or after `gua daemon`):
|
|
405
|
+
gua sync-once
|
|
523
406
|
```
|
|
524
|
-
|
|
525
|
-
|
|
526
|
-
|
|
407
|
+
|
|
408
|
+
How it works and what it does not do:
|
|
409
|
+
|
|
410
|
+
- `enroll` exchanges the one-time token for a host-scoped, write-only agent token, stored in `~/.gua/cloud.json` with mode `0600`. The token can only write this host's observations — it cannot read reservations, users, or other hosts.
|
|
411
|
+
- `sync-once` collects one snapshot, **writes it to the local database first**, then pushes only the latest state. A failed push never blocks or rolls back the local write.
|
|
412
|
+
- Only the latest snapshot is sent. Historical ticks are kept locally and are never replayed to the server.
|
|
413
|
+
- Process telemetry is limited to PID, Linux user, process name (`/proc/<pid>/comm`), and GPU memory — never full command lines.
|
|
414
|
+
- Cloud sync adds no new runtime dependency (the client uses the Python standard library).
|
|
415
|
+
|
|
416
|
+
Override the config or database path with `--config PATH` / `--db PATH`, and use `gua sync-once --fake` to exercise the flow without a GPU.
|
|
417
|
+
|
|
418
|
+
## Classification Rules
|
|
419
|
+
|
|
420
|
+
Each daemon tick records per-card utilization and per-process GPU memory. The report classifies each GPU card at each tick with these rules:
|
|
421
|
+
|
|
422
|
+
```text
|
|
423
|
+
util >= 10 -> active
|
|
424
|
+
util < 10 AND mem > 100 -> idle-held
|
|
425
|
+
util < 10 AND mem <= 100 -> truly-idle
|
|
527
426
|
```
|
|
528
427
|
|
|
529
|
-
The 100 MB threshold absorbs
|
|
530
|
-
importing torch doesn't count as "holding the GPU".
|
|
428
|
+
The 100 MB threshold absorbs runtime baselines such as importing PyTorch or TensorFlow.
|
|
531
429
|
|
|
532
430
|
## Development
|
|
533
431
|
|
|
534
|
-
Requires [uv](https://docs.astral.sh/uv/) (uv pins the Python version
|
|
535
|
-
automatically; `requires-python = ">=3.12"`).
|
|
536
|
-
|
|
537
432
|
```sh
|
|
538
433
|
git clone https://github.com/AI-Ocean/gpu-usage-audit
|
|
539
434
|
cd gpu-usage-audit
|
|
540
|
-
uv sync
|
|
541
|
-
uv run
|
|
542
|
-
uv run ruff check
|
|
543
|
-
uv run
|
|
544
|
-
uv run
|
|
435
|
+
uv sync
|
|
436
|
+
uv run python -m pytest
|
|
437
|
+
uv run ruff check
|
|
438
|
+
uv run ruff format --check
|
|
439
|
+
uv run python -m mypy
|
|
440
|
+
uv run gua demo
|
|
545
441
|
```
|
|
546
442
|
|
|
547
|
-
CI runs ruff
|
|
548
|
-
the wheel on every push and PR. Tag pushes (`v*`) rerun the same checks,
|
|
549
|
-
build sdist + wheel, smoke-test the wheel, and create a GitHub Release
|
|
550
|
-
with auto-generated notes. Release tags also publish the wheel and sdist
|
|
551
|
-
to PyPI through Trusted Publishing.
|
|
443
|
+
CI runs ruff, format check, mypy, pytest, build, and wheel smoke tests. Tag pushes (`v*`) build release assets and publish to PyPI through Trusted Publishing.
|
|
552
444
|
|
|
553
445
|
## Non-goals
|
|
554
446
|
|
|
555
|
-
This is a
|
|
556
|
-
|
|
557
|
-
|
|
558
|
-
for bare-metal 1.0. Those belong above the host layer. If this tool
|
|
559
|
-
surfaces enough idle-held to make scheduling worth solving, see
|
|
560
|
-
[ocean-all](https://github.com/AI-Ocean).
|
|
447
|
+
This is a single-host retrospective tool. Live dashboards, multi-host aggregation, quotas, Kubernetes cluster scans, Slurm joins, Docker/Podman runtime fallback, and pod-name resolution are outside the bare-metal 1.0 scope.
|
|
448
|
+
|
|
449
|
+
The Go v0.1.0 implementation remains available at tag `v0.1.0` and branch [`go-archive`](https://github.com/AI-Ocean/gpu-usage-audit/tree/go-archive).
|
|
561
450
|
|
|
562
451
|
## License
|
|
563
452
|
|
|
564
|
-
Apache License 2.0
|
|
453
|
+
Apache License 2.0. See [LICENSE](LICENSE).
|