gpu-usage-audit 1.0.0__tar.gz → 1.0.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- gpu_usage_audit-1.0.2/CHANGELOG.md +38 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/PKG-INFO +70 -35
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/README.md +69 -34
- gpu_usage_audit-1.0.2/projects/bare-metal-1.0/handoff.ko.md +83 -0
- gpu_usage_audit-1.0.2/projects/bare-metal-1.0/status.ko.md +120 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/pyproject.toml +2 -2
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/scripts/smoke-dist-wheel.sh +1 -1
- gpu_usage_audit-1.0.2/src/gpu_usage_audit/__main__.py +737 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/doctor.py +5 -5
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/nvml.py +13 -1
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/render.py +25 -14
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/report.py +70 -29
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_doctor.py +5 -3
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_render.py +26 -12
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_report.py +38 -7
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_smoke.py +253 -1
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/uv.lock +1 -4
- gpu_usage_audit-1.0.0/CHANGELOG.md +0 -17
- gpu_usage_audit-1.0.0/projects/bare-metal-1.0/handoff.ko.md +0 -84
- gpu_usage_audit-1.0.0/projects/bare-metal-1.0/status.ko.md +0 -96
- gpu_usage_audit-1.0.0/src/gpu_usage_audit/__main__.py +0 -374
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.github/workflows/ci.yml +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.github/workflows/release.yml +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.gitignore +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/LICENSE +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/projects/bare-metal-1.0/plan.ko.md +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/scripts/check-tag-version.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/__init__.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/classify.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/daemon.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/db.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/identity.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/model.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/summarize.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/tier.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/__init__.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_classify.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_daemon.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_db.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_identity.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_nvml.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_summarize.py +0 -0
- {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_tier.py +0 -0
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
## 1.0.2 - 2026-05-15
|
|
4
|
+
|
|
5
|
+
- Hardened `gua status` and `gua stop` so stale PID files do not act on
|
|
6
|
+
unrelated live processes.
|
|
7
|
+
- Clarified report output by explaining sample units, classification rules,
|
|
8
|
+
interval-dependent GPU-hours, and heatmap density.
|
|
9
|
+
- Split §2 from generic "Waste" into idle-held capacity and truly-idle
|
|
10
|
+
capacity. The equivalent-GPU figures now use GPUs present in the report
|
|
11
|
+
window instead of the entire database.
|
|
12
|
+
- Made §4 Top identities aggregate by identity/GPU/tick before converting to
|
|
13
|
+
GPU-hours, so reports may show lower per-user GPU-hours when one user has
|
|
14
|
+
multiple processes on the same GPU at the same tick.
|
|
15
|
+
- Warn when NVML process-list visibility is unavailable for a GPU.
|
|
16
|
+
|
|
17
|
+
## 1.0.1 - 2026-05-15
|
|
18
|
+
|
|
19
|
+
- Made `gua` the documented command surface for daemon, report, demo, and doctor output.
|
|
20
|
+
- Made `gua daemon` start the collector in the background by default, with
|
|
21
|
+
`gua daemon --foreground` available for systemd and debugging.
|
|
22
|
+
- Added `gua start`, `gua status`, and `gua stop` for background collector management.
|
|
23
|
+
|
|
24
|
+
## 1.0.0 - 2026-05-15
|
|
25
|
+
|
|
26
|
+
Bare-metal 1.0 narrows `gpu-usage-audit` to one clear workflow: inspect the
|
|
27
|
+
current NVIDIA Linux host, collect NVML telemetry into SQLite, and render a
|
|
28
|
+
retrospective active / idle-held / truly-idle report.
|
|
29
|
+
|
|
30
|
+
- Reset the product surface to a single local bare-metal host.
|
|
31
|
+
- Added `gua doctor` for read-only local NVIDIA/NVML/database readiness checks.
|
|
32
|
+
- Made `nvidia-ml-py` a default dependency while keeping the `nvml` extra as a
|
|
33
|
+
compatibility alias.
|
|
34
|
+
- Defaulted `daemon` and `report` to `/tmp/gua.db`.
|
|
35
|
+
- Made `daemon` refuse an existing database and `report` refuse a missing one.
|
|
36
|
+
- Kept the schema at v1: `host`, `gpu_sample`, `proc_sample`.
|
|
37
|
+
- Removed post-1.0 auto-runtime planning artifacts and runtime-detection code.
|
|
38
|
+
- Preserved `demo` for GPU-less output checks with fake telemetry.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: gpu-usage-audit
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.2
|
|
4
4
|
Summary: Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.
|
|
5
5
|
Project-URL: Homepage, https://github.com/AI-Ocean/gpu-usage-audit
|
|
6
6
|
Project-URL: Issues, https://github.com/AI-Ocean/gpu-usage-audit/issues
|
|
@@ -233,7 +233,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to
|
|
|
233
233
|
lunch — `nvidia-smi` will show 1% utilization, but the card is
|
|
234
234
|
*unusable* by anyone else. This tool measures that.
|
|
235
235
|
|
|
236
|
-
> **Status:** bare-metal 1.0
|
|
236
|
+
> **Status:** bare-metal 1.0.
|
|
237
237
|
> `gua doctor` checks only the current machine. `daemon` records NVML
|
|
238
238
|
> telemetry from the current NVIDIA host, `report` reads the resulting
|
|
239
239
|
> SQLite database, and `demo` runs anywhere with fake telemetry. The Go
|
|
@@ -253,8 +253,10 @@ runtime. If Python downloads are disabled by local policy, install Python
|
|
|
253
253
|
uv tool install gpu-usage-audit
|
|
254
254
|
|
|
255
255
|
gua doctor
|
|
256
|
-
|
|
257
|
-
|
|
256
|
+
gua daemon --interval 30s
|
|
257
|
+
gua status
|
|
258
|
+
gua report --since 1h --interval 30s
|
|
259
|
+
gua stop
|
|
258
260
|
```
|
|
259
261
|
|
|
260
262
|
`gua doctor` is intentionally read-only. It checks only the current
|
|
@@ -269,7 +271,8 @@ with GPU UUIDs, so review it before sharing it outside your team.
|
|
|
269
271
|
`gua doctor` does not need `sudo`; run it as the same user that will run
|
|
270
272
|
the daemon.
|
|
271
273
|
|
|
272
|
-
Available `gua` subcommands: `doctor
|
|
274
|
+
Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`,
|
|
275
|
+
`stop`, `report`, `demo`, `version`, `help`.
|
|
273
276
|
|
|
274
277
|
Update or remove the installed tool with uv:
|
|
275
278
|
|
|
@@ -284,8 +287,8 @@ its `gua` / `gpu-usage-audit` commands.
|
|
|
284
287
|
GitHub Release assets are also available for manual download:
|
|
285
288
|
|
|
286
289
|
```sh
|
|
287
|
-
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.
|
|
288
|
-
WHEEL="gpu_usage_audit-1.0.
|
|
290
|
+
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
|
|
291
|
+
WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
|
|
289
292
|
|
|
290
293
|
curl -fsSLO "$BASE/$WHEEL"
|
|
291
294
|
curl -fsSLO "$BASE/SHA256SUMS"
|
|
@@ -297,30 +300,37 @@ uvx --from "./$WHEEL" gua doctor
|
|
|
297
300
|
## What you get
|
|
298
301
|
|
|
299
302
|
```
|
|
300
|
-
$
|
|
301
|
-
|
|
303
|
+
$ gua report --since 1h --interval 30s
|
|
304
|
+
gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
|
|
302
305
|
|
|
303
306
|
§1 Headline
|
|
307
|
+
basis: one sample = one GPU card at one daemon tick
|
|
308
|
+
rules: active >=10% util; idle-held <10% util with >100 MB process memory
|
|
304
309
|
█████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
|
|
305
310
|
active █ 15.7%
|
|
306
311
|
idle-held ▒ 45.1% ← this is the number conventional tools miss
|
|
307
312
|
truly-idle ░ 39.2%
|
|
308
313
|
(51 samples)
|
|
309
314
|
|
|
310
|
-
§2
|
|
311
|
-
|
|
315
|
+
§2 Idle capacity
|
|
316
|
+
converted from card-ticks to GPU-hours using the report --interval
|
|
317
|
+
idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
|
|
318
|
+
truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
|
|
312
319
|
|
|
313
320
|
§3 Per-GPU
|
|
321
|
+
per-card share of samples in the same three states
|
|
314
322
|
GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
|
|
315
323
|
GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
|
|
316
324
|
GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
|
|
317
325
|
|
|
318
326
|
§4 Top identities
|
|
319
|
-
identity
|
|
320
|
-
|
|
321
|
-
|
|
327
|
+
one identity counts once per GPU/tick after its processes are summed
|
|
328
|
+
identity gpu-hours idle-held samples
|
|
329
|
+
alice 0.42 42.9% 51
|
|
330
|
+
bob 0.28 100.0% 34
|
|
322
331
|
|
|
323
332
|
§5 Time-of-day heatmap (UTC)
|
|
333
|
+
darker means higher active share; blank means no samples
|
|
324
334
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
|
|
325
335
|
Mon .
|
|
326
336
|
```
|
|
@@ -328,7 +338,10 @@ gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
|
|
|
328
338
|
The 3-bar collapses every card × every tick over the window into the
|
|
329
339
|
active / idle-held / truly-idle split. **`idle-held` rows are the
|
|
330
340
|
embarrassing category**: a process is holding GPU memory but the SM
|
|
331
|
-
utilization is below 10%.
|
|
341
|
+
utilization is below 10%. §2 converts those card-ticks into GPU-hours
|
|
342
|
+
with `--interval`; §4 groups process rows by identity, GPU, and tick
|
|
343
|
+
before ranking users, so multiple same-user processes on one GPU/tick
|
|
344
|
+
count once.
|
|
332
345
|
|
|
333
346
|
## Demo (no GPU required)
|
|
334
347
|
|
|
@@ -336,7 +349,7 @@ The `demo` subcommand records 30 ticks of fake telemetry and prints the
|
|
|
336
349
|
report — all in one process, no second shell needed.
|
|
337
350
|
|
|
338
351
|
```sh
|
|
339
|
-
|
|
352
|
+
gua demo
|
|
340
353
|
```
|
|
341
354
|
|
|
342
355
|
The bundled `FakeTier` produces a deterministic 5-tick workload —
|
|
@@ -369,21 +382,28 @@ can collect real telemetry.
|
|
|
369
382
|
Then run the collector:
|
|
370
383
|
|
|
371
384
|
```sh
|
|
372
|
-
|
|
385
|
+
gua daemon --interval 30s
|
|
386
|
+
gua status
|
|
373
387
|
```
|
|
374
388
|
|
|
375
|
-
Run the report
|
|
389
|
+
Run the report:
|
|
376
390
|
|
|
377
391
|
```sh
|
|
378
|
-
|
|
392
|
+
gua report --since 1h --interval 30s
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
Stop the background collector when the collection window is done:
|
|
396
|
+
|
|
397
|
+
```sh
|
|
398
|
+
gua stop
|
|
379
399
|
```
|
|
380
400
|
|
|
381
401
|
If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
|
|
382
402
|
`daemon` refuses to start when that database file already exists, so a
|
|
383
403
|
new collection run does not silently append to an old test database. If
|
|
384
404
|
`gua doctor` reports that the database already exists, either run
|
|
385
|
-
`
|
|
386
|
-
|
|
405
|
+
`gua report` against the existing data or choose a fresh `--db PATH` for
|
|
406
|
+
the next daemon run.
|
|
387
407
|
|
|
388
408
|
> The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
|
|
389
409
|
> driverless host it exits with a friendly NVML initialization error. For
|
|
@@ -391,18 +411,24 @@ new collection run does not silently append to an old test database. If
|
|
|
391
411
|
|
|
392
412
|
## Usage
|
|
393
413
|
|
|
394
|
-
`
|
|
414
|
+
`gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
|
|
415
|
+
point remains installed for compatibility, but new examples use `gua`.
|
|
395
416
|
|
|
396
417
|
| Command | What it does |
|
|
397
418
|
| -------- | ----------------------------------------------------------- |
|
|
398
|
-
| `daemon` |
|
|
419
|
+
| `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
|
|
420
|
+
| `start` | Alias for `gua daemon`. |
|
|
421
|
+
| `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
|
|
422
|
+
| `stop` | Stops the background collector with SIGTERM. |
|
|
399
423
|
| `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
|
|
400
424
|
| `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
|
|
401
425
|
|
|
402
|
-
### `daemon`
|
|
426
|
+
### `daemon` / `start`
|
|
403
427
|
|
|
404
428
|
```
|
|
405
|
-
|
|
429
|
+
gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
430
|
+
gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
431
|
+
gua daemon --foreground [--db PATH] [--interval D]
|
|
406
432
|
```
|
|
407
433
|
|
|
408
434
|
- `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
|
|
@@ -410,14 +436,23 @@ gpu-usage-audit daemon [--db PATH] [--interval D]
|
|
|
410
436
|
is enabled automatically.
|
|
411
437
|
- `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
|
|
412
438
|
`1m`, `200ms`, etc.
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
|
|
439
|
+
- `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
|
|
440
|
+
- `--log-file PATH` (default `/tmp/gua.log`) — stdout/stderr from the
|
|
441
|
+
background collector.
|
|
442
|
+
- `--foreground` — keep the collector attached to the current process.
|
|
443
|
+
Use this for systemd or debugging.
|
|
444
|
+
|
|
445
|
+
By default, `gua daemon` returns after the collector starts. Each tick is
|
|
446
|
+
written to the log file; on shutdown the cumulative row count is written
|
|
447
|
+
there too. `gua daemon --foreground` prints the tick summaries directly
|
|
448
|
+
to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
|
|
449
|
+
`gua status` and `gua stop` verify that the PID file points to the
|
|
450
|
+
managed collector before acting on it; stale PID files are cleared.
|
|
416
451
|
|
|
417
452
|
### `report`
|
|
418
453
|
|
|
419
454
|
```
|
|
420
|
-
|
|
455
|
+
gua report [--db PATH] [--since D] [--interval D] [--width N]
|
|
421
456
|
```
|
|
422
457
|
|
|
423
458
|
- `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
|
|
@@ -427,14 +462,14 @@ gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
|
|
|
427
462
|
of oldest sample), so passing a huge `--since` is the same as "all
|
|
428
463
|
data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
|
|
429
464
|
- `--interval D` (default `30s`) — **must match what the daemon used**.
|
|
430
|
-
This is how §2 (
|
|
465
|
+
This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
|
|
431
466
|
to GPU-hours. Mismatched intervals → wrong GPU-hours.
|
|
432
467
|
- `--width N` (default `60`) — width of the §1 three-bar in characters.
|
|
433
468
|
|
|
434
469
|
### `demo`
|
|
435
470
|
|
|
436
471
|
```
|
|
437
|
-
|
|
472
|
+
gua demo [--db PATH] [--ticks N] [--interval D]
|
|
438
473
|
```
|
|
439
474
|
|
|
440
475
|
- `--db PATH` (optional) — if omitted, a fresh temporary database is
|
|
@@ -446,7 +481,7 @@ gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
|
|
|
446
481
|
### Operational notes
|
|
447
482
|
|
|
448
483
|
- **Same `--interval` on both sides.** If you ran the daemon with
|
|
449
|
-
`--interval 30s`, run `report --interval 30s` too.
|
|
484
|
+
`--interval 30s`, run `gua report --interval 30s` too.
|
|
450
485
|
- **Let it run for a while.** §1/§3 are meaningful after one tick;
|
|
451
486
|
§4 (Top identities) needs hours; §5 (Heatmap) needs days.
|
|
452
487
|
- **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
|
|
@@ -461,12 +496,12 @@ For a long-running deployment, drop a unit file in
|
|
|
461
496
|
|
|
462
497
|
```ini
|
|
463
498
|
[Unit]
|
|
464
|
-
Description=
|
|
499
|
+
Description=gua daemon
|
|
465
500
|
After=network.target
|
|
466
501
|
|
|
467
502
|
[Service]
|
|
468
503
|
Type=simple
|
|
469
|
-
ExecStart=/usr/local/bin/
|
|
504
|
+
ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s
|
|
470
505
|
Restart=on-failure
|
|
471
506
|
User=gua
|
|
472
507
|
|
|
@@ -506,7 +541,7 @@ uv sync # create .venv, install dev deps
|
|
|
506
541
|
uv run pytest # run the test suite
|
|
507
542
|
uv run ruff check # lint
|
|
508
543
|
uv run mypy # type-check (strict)
|
|
509
|
-
uv run
|
|
544
|
+
uv run gua demo # see the report shape locally
|
|
510
545
|
```
|
|
511
546
|
|
|
512
547
|
CI runs ruff + format check + mypy + pytest, then builds and smoke-tests
|
|
@@ -10,7 +10,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to
|
|
|
10
10
|
lunch — `nvidia-smi` will show 1% utilization, but the card is
|
|
11
11
|
*unusable* by anyone else. This tool measures that.
|
|
12
12
|
|
|
13
|
-
> **Status:** bare-metal 1.0
|
|
13
|
+
> **Status:** bare-metal 1.0.
|
|
14
14
|
> `gua doctor` checks only the current machine. `daemon` records NVML
|
|
15
15
|
> telemetry from the current NVIDIA host, `report` reads the resulting
|
|
16
16
|
> SQLite database, and `demo` runs anywhere with fake telemetry. The Go
|
|
@@ -30,8 +30,10 @@ runtime. If Python downloads are disabled by local policy, install Python
|
|
|
30
30
|
uv tool install gpu-usage-audit
|
|
31
31
|
|
|
32
32
|
gua doctor
|
|
33
|
-
|
|
34
|
-
|
|
33
|
+
gua daemon --interval 30s
|
|
34
|
+
gua status
|
|
35
|
+
gua report --since 1h --interval 30s
|
|
36
|
+
gua stop
|
|
35
37
|
```
|
|
36
38
|
|
|
37
39
|
`gua doctor` is intentionally read-only. It checks only the current
|
|
@@ -46,7 +48,8 @@ with GPU UUIDs, so review it before sharing it outside your team.
|
|
|
46
48
|
`gua doctor` does not need `sudo`; run it as the same user that will run
|
|
47
49
|
the daemon.
|
|
48
50
|
|
|
49
|
-
Available `gua` subcommands: `doctor
|
|
51
|
+
Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`,
|
|
52
|
+
`stop`, `report`, `demo`, `version`, `help`.
|
|
50
53
|
|
|
51
54
|
Update or remove the installed tool with uv:
|
|
52
55
|
|
|
@@ -61,8 +64,8 @@ its `gua` / `gpu-usage-audit` commands.
|
|
|
61
64
|
GitHub Release assets are also available for manual download:
|
|
62
65
|
|
|
63
66
|
```sh
|
|
64
|
-
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.
|
|
65
|
-
WHEEL="gpu_usage_audit-1.0.
|
|
67
|
+
BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
|
|
68
|
+
WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
|
|
66
69
|
|
|
67
70
|
curl -fsSLO "$BASE/$WHEEL"
|
|
68
71
|
curl -fsSLO "$BASE/SHA256SUMS"
|
|
@@ -74,30 +77,37 @@ uvx --from "./$WHEEL" gua doctor
|
|
|
74
77
|
## What you get
|
|
75
78
|
|
|
76
79
|
```
|
|
77
|
-
$
|
|
78
|
-
|
|
80
|
+
$ gua report --since 1h --interval 30s
|
|
81
|
+
gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
|
|
79
82
|
|
|
80
83
|
§1 Headline
|
|
84
|
+
basis: one sample = one GPU card at one daemon tick
|
|
85
|
+
rules: active >=10% util; idle-held <10% util with >100 MB process memory
|
|
81
86
|
█████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
|
|
82
87
|
active █ 15.7%
|
|
83
88
|
idle-held ▒ 45.1% ← this is the number conventional tools miss
|
|
84
89
|
truly-idle ░ 39.2%
|
|
85
90
|
(51 samples)
|
|
86
91
|
|
|
87
|
-
§2
|
|
88
|
-
|
|
92
|
+
§2 Idle capacity
|
|
93
|
+
converted from card-ticks to GPU-hours using the report --interval
|
|
94
|
+
idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
|
|
95
|
+
truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
|
|
89
96
|
|
|
90
97
|
§3 Per-GPU
|
|
98
|
+
per-card share of samples in the same three states
|
|
91
99
|
GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
|
|
92
100
|
GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
|
|
93
101
|
GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
|
|
94
102
|
|
|
95
103
|
§4 Top identities
|
|
96
|
-
identity
|
|
97
|
-
|
|
98
|
-
|
|
104
|
+
one identity counts once per GPU/tick after its processes are summed
|
|
105
|
+
identity gpu-hours idle-held samples
|
|
106
|
+
alice 0.42 42.9% 51
|
|
107
|
+
bob 0.28 100.0% 34
|
|
99
108
|
|
|
100
109
|
§5 Time-of-day heatmap (UTC)
|
|
110
|
+
darker means higher active share; blank means no samples
|
|
101
111
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
|
|
102
112
|
Mon .
|
|
103
113
|
```
|
|
@@ -105,7 +115,10 @@ gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
|
|
|
105
115
|
The 3-bar collapses every card × every tick over the window into the
|
|
106
116
|
active / idle-held / truly-idle split. **`idle-held` rows are the
|
|
107
117
|
embarrassing category**: a process is holding GPU memory but the SM
|
|
108
|
-
utilization is below 10%.
|
|
118
|
+
utilization is below 10%. §2 converts those card-ticks into GPU-hours
|
|
119
|
+
with `--interval`; §4 groups process rows by identity, GPU, and tick
|
|
120
|
+
before ranking users, so multiple same-user processes on one GPU/tick
|
|
121
|
+
count once.
|
|
109
122
|
|
|
110
123
|
## Demo (no GPU required)
|
|
111
124
|
|
|
@@ -113,7 +126,7 @@ The `demo` subcommand records 30 ticks of fake telemetry and prints the
|
|
|
113
126
|
report — all in one process, no second shell needed.
|
|
114
127
|
|
|
115
128
|
```sh
|
|
116
|
-
|
|
129
|
+
gua demo
|
|
117
130
|
```
|
|
118
131
|
|
|
119
132
|
The bundled `FakeTier` produces a deterministic 5-tick workload —
|
|
@@ -146,21 +159,28 @@ can collect real telemetry.
|
|
|
146
159
|
Then run the collector:
|
|
147
160
|
|
|
148
161
|
```sh
|
|
149
|
-
|
|
162
|
+
gua daemon --interval 30s
|
|
163
|
+
gua status
|
|
150
164
|
```
|
|
151
165
|
|
|
152
|
-
Run the report
|
|
166
|
+
Run the report:
|
|
153
167
|
|
|
154
168
|
```sh
|
|
155
|
-
|
|
169
|
+
gua report --since 1h --interval 30s
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
Stop the background collector when the collection window is done:
|
|
173
|
+
|
|
174
|
+
```sh
|
|
175
|
+
gua stop
|
|
156
176
|
```
|
|
157
177
|
|
|
158
178
|
If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
|
|
159
179
|
`daemon` refuses to start when that database file already exists, so a
|
|
160
180
|
new collection run does not silently append to an old test database. If
|
|
161
181
|
`gua doctor` reports that the database already exists, either run
|
|
162
|
-
`
|
|
163
|
-
|
|
182
|
+
`gua report` against the existing data or choose a fresh `--db PATH` for
|
|
183
|
+
the next daemon run.
|
|
164
184
|
|
|
165
185
|
> The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
|
|
166
186
|
> driverless host it exits with a friendly NVML initialization error. For
|
|
@@ -168,18 +188,24 @@ new collection run does not silently append to an old test database. If
|
|
|
168
188
|
|
|
169
189
|
## Usage
|
|
170
190
|
|
|
171
|
-
`
|
|
191
|
+
`gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
|
|
192
|
+
point remains installed for compatibility, but new examples use `gua`.
|
|
172
193
|
|
|
173
194
|
| Command | What it does |
|
|
174
195
|
| -------- | ----------------------------------------------------------- |
|
|
175
|
-
| `daemon` |
|
|
196
|
+
| `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
|
|
197
|
+
| `start` | Alias for `gua daemon`. |
|
|
198
|
+
| `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
|
|
199
|
+
| `stop` | Stops the background collector with SIGTERM. |
|
|
176
200
|
| `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
|
|
177
201
|
| `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
|
|
178
202
|
|
|
179
|
-
### `daemon`
|
|
203
|
+
### `daemon` / `start`
|
|
180
204
|
|
|
181
205
|
```
|
|
182
|
-
|
|
206
|
+
gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
207
|
+
gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
|
|
208
|
+
gua daemon --foreground [--db PATH] [--interval D]
|
|
183
209
|
```
|
|
184
210
|
|
|
185
211
|
- `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
|
|
@@ -187,14 +213,23 @@ gpu-usage-audit daemon [--db PATH] [--interval D]
|
|
|
187
213
|
is enabled automatically.
|
|
188
214
|
- `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
|
|
189
215
|
`1m`, `200ms`, etc.
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
216
|
+
- `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
|
|
217
|
+
- `--log-file PATH` (default `/tmp/gua.log`) — stdout/stderr from the
|
|
218
|
+
background collector.
|
|
219
|
+
- `--foreground` — keep the collector attached to the current process.
|
|
220
|
+
Use this for systemd or debugging.
|
|
221
|
+
|
|
222
|
+
By default, `gua daemon` returns after the collector starts. Each tick is
|
|
223
|
+
written to the log file; on shutdown the cumulative row count is written
|
|
224
|
+
there too. `gua daemon --foreground` prints the tick summaries directly
|
|
225
|
+
to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
|
|
226
|
+
`gua status` and `gua stop` verify that the PID file points to the
|
|
227
|
+
managed collector before acting on it; stale PID files are cleared.
|
|
193
228
|
|
|
194
229
|
### `report`
|
|
195
230
|
|
|
196
231
|
```
|
|
197
|
-
|
|
232
|
+
gua report [--db PATH] [--since D] [--interval D] [--width N]
|
|
198
233
|
```
|
|
199
234
|
|
|
200
235
|
- `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
|
|
@@ -204,14 +239,14 @@ gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
|
|
|
204
239
|
of oldest sample), so passing a huge `--since` is the same as "all
|
|
205
240
|
data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
|
|
206
241
|
- `--interval D` (default `30s`) — **must match what the daemon used**.
|
|
207
|
-
This is how §2 (
|
|
242
|
+
This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
|
|
208
243
|
to GPU-hours. Mismatched intervals → wrong GPU-hours.
|
|
209
244
|
- `--width N` (default `60`) — width of the §1 three-bar in characters.
|
|
210
245
|
|
|
211
246
|
### `demo`
|
|
212
247
|
|
|
213
248
|
```
|
|
214
|
-
|
|
249
|
+
gua demo [--db PATH] [--ticks N] [--interval D]
|
|
215
250
|
```
|
|
216
251
|
|
|
217
252
|
- `--db PATH` (optional) — if omitted, a fresh temporary database is
|
|
@@ -223,7 +258,7 @@ gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
|
|
|
223
258
|
### Operational notes
|
|
224
259
|
|
|
225
260
|
- **Same `--interval` on both sides.** If you ran the daemon with
|
|
226
|
-
`--interval 30s`, run `report --interval 30s` too.
|
|
261
|
+
`--interval 30s`, run `gua report --interval 30s` too.
|
|
227
262
|
- **Let it run for a while.** §1/§3 are meaningful after one tick;
|
|
228
263
|
§4 (Top identities) needs hours; §5 (Heatmap) needs days.
|
|
229
264
|
- **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
|
|
@@ -238,12 +273,12 @@ For a long-running deployment, drop a unit file in
|
|
|
238
273
|
|
|
239
274
|
```ini
|
|
240
275
|
[Unit]
|
|
241
|
-
Description=
|
|
276
|
+
Description=gua daemon
|
|
242
277
|
After=network.target
|
|
243
278
|
|
|
244
279
|
[Service]
|
|
245
280
|
Type=simple
|
|
246
|
-
ExecStart=/usr/local/bin/
|
|
281
|
+
ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s
|
|
247
282
|
Restart=on-failure
|
|
248
283
|
User=gua
|
|
249
284
|
|
|
@@ -283,7 +318,7 @@ uv sync # create .venv, install dev deps
|
|
|
283
318
|
uv run pytest # run the test suite
|
|
284
319
|
uv run ruff check # lint
|
|
285
320
|
uv run mypy # type-check (strict)
|
|
286
|
-
uv run
|
|
321
|
+
uv run gua demo # see the report shape locally
|
|
287
322
|
```
|
|
288
323
|
|
|
289
324
|
CI runs ruff + format check + mypy + pytest, then builds and smoke-tests
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# Bare Metal 1.0 Handoff
|
|
2
|
+
|
|
3
|
+
갱신일: 2026-05-15
|
|
4
|
+
|
|
5
|
+
## 이어받을 때 먼저 볼 것
|
|
6
|
+
|
|
7
|
+
- `projects/bare-metal-1.0/status.ko.md`: 현재 완료 상태, 1.0.1 검증 결과, 1.0.2 release prep 상태.
|
|
8
|
+
- `README.md`: 실제 사용자 문서와 release/install/runbook/report 표면.
|
|
9
|
+
- `src/gpu_usage_audit/__main__.py`: `gua` CLI, background daemon lifecycle, PID handling.
|
|
10
|
+
- `src/gpu_usage_audit/report.py`: report SQL 집계.
|
|
11
|
+
- `src/gpu_usage_audit/render.py`: report 사람이 읽는 출력.
|
|
12
|
+
- `.github/workflows/release.yml`: tag release, GitHub Release, PyPI publish 경로.
|
|
13
|
+
|
|
14
|
+
## 고정된 결정
|
|
15
|
+
|
|
16
|
+
- 1.0은 단일 로컬 베어메탈 NVIDIA 호스트만 본다.
|
|
17
|
+
- Kubernetes, Slurm, Docker/Podman fallback, remote node, cluster-wide report는 1.0 범위 밖이다.
|
|
18
|
+
- `nvidia-ml-py`는 기본 dependency다.
|
|
19
|
+
- `gpu-usage-audit[nvml]` extra는 compatibility를 위해 빈 alias로 남긴다.
|
|
20
|
+
- DB schema는 v1을 유지한다: `host`, `gpu_sample`, `proc_sample`.
|
|
21
|
+
- 기본 DB는 `/tmp/gua.db`다.
|
|
22
|
+
- `gua daemon`은 기본 백그라운드 실행이다.
|
|
23
|
+
- `gua daemon --foreground`는 systemd/debugging 용도다.
|
|
24
|
+
- `gua start`는 `gua daemon` alias다.
|
|
25
|
+
- `gua status`와 `gua stop`은 pid file 기반 background collector 관리용이다.
|
|
26
|
+
- `daemon`은 기존 DB 파일이 있으면 실패한다.
|
|
27
|
+
- `report`는 DB 파일이 없으면 실패한다.
|
|
28
|
+
- `daemon`과 `demo`는 host row의 `env_kind`를 항상 `"bare"`로 기록한다.
|
|
29
|
+
- auto-runtime proposal/project 문서는 삭제했다. Kubernetes/Slurm/Docker/Podman 확장을 다시
|
|
30
|
+
시작하려면 새 proposal로 시작한다.
|
|
31
|
+
|
|
32
|
+
## 현재 상태
|
|
33
|
+
|
|
34
|
+
- PR A: implemented in PR #9.
|
|
35
|
+
- PR B: implemented in PR #10.
|
|
36
|
+
- Post-1.0 cleanup: completed in PR #11.
|
|
37
|
+
- Bare-metal 1.0 release: completed in PR #12 and tag `v1.0.0`.
|
|
38
|
+
- 1.0.1 command surface/background daemon release: completed in PR #13 and tag `v1.0.1`.
|
|
39
|
+
- GitHub Release `v1.0.1`: published.
|
|
40
|
+
- PyPI `gpu-usage-audit 1.0.1`: published.
|
|
41
|
+
- NVIDIA host acceptance: 사용자가 실제 host에서 수집 정상 동작을 확인했다.
|
|
42
|
+
- 1.0.2 release prep: 진행 중. #14 lifecycle/report cleanup을 patch release로 배포한다.
|
|
43
|
+
package version은 `1.0.2`로 bump했고 local build/wheel smoke는 통과했다.
|
|
44
|
+
|
|
45
|
+
## 마지막 로컬 검증
|
|
46
|
+
|
|
47
|
+
```sh
|
|
48
|
+
uv run ruff check
|
|
49
|
+
uv run ruff format --check
|
|
50
|
+
uv run mypy
|
|
51
|
+
uv run pytest
|
|
52
|
+
uv build --out-dir /tmp/gua-dist-1.0.2-prep
|
|
53
|
+
bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-1.0.2-prep/gpu_usage_audit-1.0.2-py3-none-any.whl
|
|
54
|
+
env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
결과는 `pytest` 124 passed, `mypy` 25 source files, `ruff format` 26 files 기준이다.
|
|
58
|
+
|
|
59
|
+
## 현재 cleanup PR 방향
|
|
60
|
+
|
|
61
|
+
- `/tmp/gua.pid`가 PID 재사용으로 다른 프로세스를 가리킬 수 있으므로 `status`/`stop` 전에
|
|
62
|
+
해당 PID가 실제 managed `gpu_usage_audit daemon` 프로세스인지 확인한다.
|
|
63
|
+
- report §2는 low-util 전체를 "waste"로 합치지 말고 `idle-held`와 `truly-idle`을 분리한다.
|
|
64
|
+
- report §4는 process row가 아니라 identity/GPU/tick 단위로 먼저 접어서 사용자별 GPU-hours를 계산한다.
|
|
65
|
+
- report 출력 자체에 sample 의미, classification rule, `--interval` 의존성, heatmap 의미를 짧게 노출한다.
|
|
66
|
+
- NVML process list 조회 실패는 idle-held를 과소평가할 수 있으므로 warning으로 남긴다.
|
|
67
|
+
- 1.0.2 release prep에서는 package version, README release asset 예시, CHANGELOG를 `1.0.2`로 맞춘다.
|
|
68
|
+
|
|
69
|
+
## 주의할 점
|
|
70
|
+
|
|
71
|
+
- 현재 로컬 개발 머신은 NVIDIA host가 아니다. `gua doctor`가 unsupported를 내는 것은 정상이다.
|
|
72
|
+
- `/tmp/gua.db`가 이미 존재한다. 기본 경로 daemon 실행이 거부되는 것은 기대 동작이다.
|
|
73
|
+
- `report --interval`은 daemon 수집 interval과 같아야 GPU-hours가 맞다.
|
|
74
|
+
- SQLite WAL sidecar(`*.db-wal`, `*.db-shm`)는 마지막 connection이 닫히면 정리된다.
|
|
75
|
+
- 1.0.2를 자를 경우 `env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py`가
|
|
76
|
+
통과해야 한다.
|
|
77
|
+
|
|
78
|
+
## 다음 세션 추천 순서
|
|
79
|
+
|
|
80
|
+
1. `git status --short`로 사용자 변경 여부를 먼저 확인한다.
|
|
81
|
+
2. cleanup PR의 CI 결과와 review comments를 확인한다.
|
|
82
|
+
3. 필요하면 report wording을 실제 운영자가 읽기 쉬운 형태로 한 번 더 다듬는다.
|
|
83
|
+
4. merge 후 patch release가 필요하면 version bump와 changelog를 별도 PR로 처리한다.
|