gpu-usage-audit 1.0.0__tar.gz → 1.0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. gpu_usage_audit-1.0.2/CHANGELOG.md +38 -0
  2. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/PKG-INFO +70 -35
  3. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/README.md +69 -34
  4. gpu_usage_audit-1.0.2/projects/bare-metal-1.0/handoff.ko.md +83 -0
  5. gpu_usage_audit-1.0.2/projects/bare-metal-1.0/status.ko.md +120 -0
  6. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/pyproject.toml +2 -2
  7. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/scripts/smoke-dist-wheel.sh +1 -1
  8. gpu_usage_audit-1.0.2/src/gpu_usage_audit/__main__.py +737 -0
  9. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/doctor.py +5 -5
  10. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/nvml.py +13 -1
  11. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/render.py +25 -14
  12. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/report.py +70 -29
  13. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_doctor.py +5 -3
  14. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_render.py +26 -12
  15. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_report.py +38 -7
  16. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_smoke.py +253 -1
  17. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/uv.lock +1 -4
  18. gpu_usage_audit-1.0.0/CHANGELOG.md +0 -17
  19. gpu_usage_audit-1.0.0/projects/bare-metal-1.0/handoff.ko.md +0 -84
  20. gpu_usage_audit-1.0.0/projects/bare-metal-1.0/status.ko.md +0 -96
  21. gpu_usage_audit-1.0.0/src/gpu_usage_audit/__main__.py +0 -374
  22. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.github/workflows/ci.yml +0 -0
  23. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.github/workflows/release.yml +0 -0
  24. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/.gitignore +0 -0
  25. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/LICENSE +0 -0
  26. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/projects/bare-metal-1.0/plan.ko.md +0 -0
  27. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/scripts/check-tag-version.py +0 -0
  28. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/__init__.py +0 -0
  29. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/classify.py +0 -0
  30. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/daemon.py +0 -0
  31. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/db.py +0 -0
  32. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/identity.py +0 -0
  33. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/model.py +0 -0
  34. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/summarize.py +0 -0
  35. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/tier.py +0 -0
  36. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/__init__.py +0 -0
  37. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_classify.py +0 -0
  38. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_daemon.py +0 -0
  39. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_db.py +0 -0
  40. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_identity.py +0 -0
  41. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_nvml.py +0 -0
  42. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_summarize.py +0 -0
  43. {gpu_usage_audit-1.0.0 → gpu_usage_audit-1.0.2}/tests/test_tier.py +0 -0
@@ -0,0 +1,38 @@
1
+ # Changelog
2
+
3
+ ## 1.0.2 - 2026-05-15
4
+
5
+ - Hardened `gua status` and `gua stop` so stale PID files do not act on
6
+ unrelated live processes.
7
+ - Clarified report output by explaining sample units, classification rules,
8
+ interval-dependent GPU-hours, and heatmap density.
9
+ - Split §2 from generic "Waste" into idle-held capacity and truly-idle
10
+ capacity. The equivalent-GPU figures now use GPUs present in the report
11
+ window instead of the entire database.
12
+ - Made §4 Top identities aggregate by identity/GPU/tick before converting to
13
+ GPU-hours, so reports may show lower per-user GPU-hours when one user has
14
+ multiple processes on the same GPU at the same tick.
15
+ - Warn when NVML process-list visibility is unavailable for a GPU.
16
+
17
+ ## 1.0.1 - 2026-05-15
18
+
19
+ - Made `gua` the documented command surface for daemon, report, demo, and doctor output.
20
+ - Made `gua daemon` start the collector in the background by default, with
21
+ `gua daemon --foreground` available for systemd and debugging.
22
+ - Added `gua start`, `gua status`, and `gua stop` for background collector management.
23
+
24
+ ## 1.0.0 - 2026-05-15
25
+
26
+ Bare-metal 1.0 narrows `gpu-usage-audit` to one clear workflow: inspect the
27
+ current NVIDIA Linux host, collect NVML telemetry into SQLite, and render a
28
+ retrospective active / idle-held / truly-idle report.
29
+
30
+ - Reset the product surface to a single local bare-metal host.
31
+ - Added `gua doctor` for read-only local NVIDIA/NVML/database readiness checks.
32
+ - Made `nvidia-ml-py` a default dependency while keeping the `nvml` extra as a
33
+ compatibility alias.
34
+ - Defaulted `daemon` and `report` to `/tmp/gua.db`.
35
+ - Made `daemon` refuse an existing database and `report` refuse a missing one.
36
+ - Kept the schema at v1: `host`, `gpu_sample`, `proc_sample`.
37
+ - Removed post-1.0 auto-runtime planning artifacts and runtime-detection code.
38
+ - Preserved `demo` for GPU-less output checks with fake telemetry.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: gpu-usage-audit
3
- Version: 1.0.0
3
+ Version: 1.0.2
4
4
  Summary: Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.
5
5
  Project-URL: Homepage, https://github.com/AI-Ocean/gpu-usage-audit
6
6
  Project-URL: Issues, https://github.com/AI-Ocean/gpu-usage-audit/issues
@@ -233,7 +233,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to
233
233
  lunch — `nvidia-smi` will show 1% utilization, but the card is
234
234
  *unusable* by anyone else. This tool measures that.
235
235
 
236
- > **Status:** bare-metal 1.0 release candidate.
236
+ > **Status:** bare-metal 1.0.
237
237
  > `gua doctor` checks only the current machine. `daemon` records NVML
238
238
  > telemetry from the current NVIDIA host, `report` reads the resulting
239
239
  > SQLite database, and `demo` runs anywhere with fake telemetry. The Go
@@ -253,8 +253,10 @@ runtime. If Python downloads are disabled by local policy, install Python
253
253
  uv tool install gpu-usage-audit
254
254
 
255
255
  gua doctor
256
- gpu-usage-audit daemon --interval 30s
257
- gpu-usage-audit report --since 1h --interval 30s
256
+ gua daemon --interval 30s
257
+ gua status
258
+ gua report --since 1h --interval 30s
259
+ gua stop
258
260
  ```
259
261
 
260
262
  `gua doctor` is intentionally read-only. It checks only the current
@@ -269,7 +271,8 @@ with GPU UUIDs, so review it before sharing it outside your team.
269
271
  `gua doctor` does not need `sudo`; run it as the same user that will run
270
272
  the daemon.
271
273
 
272
- Available `gua` subcommands: `doctor`.
274
+ Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`,
275
+ `stop`, `report`, `demo`, `version`, `help`.
273
276
 
274
277
  Update or remove the installed tool with uv:
275
278
 
@@ -284,8 +287,8 @@ its `gua` / `gpu-usage-audit` commands.
284
287
  GitHub Release assets are also available for manual download:
285
288
 
286
289
  ```sh
287
- BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.0"
288
- WHEEL="gpu_usage_audit-1.0.0-py3-none-any.whl"
290
+ BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
291
+ WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
289
292
 
290
293
  curl -fsSLO "$BASE/$WHEEL"
291
294
  curl -fsSLO "$BASE/SHA256SUMS"
@@ -297,30 +300,37 @@ uvx --from "./$WHEEL" gua doctor
297
300
  ## What you get
298
301
 
299
302
  ```
300
- $ gpu-usage-audit report --since 1h --interval 30s
301
- gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
303
+ $ gua report --since 1h --interval 30s
304
+ gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
302
305
 
303
306
  §1 Headline
307
+ basis: one sample = one GPU card at one daemon tick
308
+ rules: active >=10% util; idle-held <10% util with >100 MB process memory
304
309
  █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
305
310
  active █ 15.7%
306
311
  idle-held ▒ 45.1% ← this is the number conventional tools miss
307
312
  truly-idle ░ 39.2%
308
313
  (51 samples)
309
314
 
310
- §2 Waste
311
- ~0.43 GPU-hours idle, ~2.53 GPUs equivalently unused
315
+ §2 Idle capacity
316
+ converted from card-ticks to GPU-hours using the report --interval
317
+ idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
318
+ truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
312
319
 
313
320
  §3 Per-GPU
321
+ per-card share of samples in the same three states
314
322
  GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
315
323
  GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
316
324
  GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
317
325
 
318
326
  §4 Top identities
319
- identity gpu-hours idle-held
320
- alice 0.42 42.9%
321
- bob 0.28 100.0%
327
+ one identity counts once per GPU/tick after its processes are summed
328
+ identity gpu-hours idle-held samples
329
+ alice 0.42 42.9% 51
330
+ bob 0.28 100.0% 34
322
331
 
323
332
  §5 Time-of-day heatmap (UTC)
333
+ darker means higher active share; blank means no samples
324
334
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
325
335
  Mon .
326
336
  ```
@@ -328,7 +338,10 @@ gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
328
338
  The 3-bar collapses every card × every tick over the window into the
329
339
  active / idle-held / truly-idle split. **`idle-held` rows are the
330
340
  embarrassing category**: a process is holding GPU memory but the SM
331
- utilization is below 10%.
341
+ utilization is below 10%. §2 converts those card-ticks into GPU-hours
342
+ with `--interval`; §4 groups process rows by identity, GPU, and tick
343
+ before ranking users, so multiple same-user processes on one GPU/tick
344
+ count once.
332
345
 
333
346
  ## Demo (no GPU required)
334
347
 
@@ -336,7 +349,7 @@ The `demo` subcommand records 30 ticks of fake telemetry and prints the
336
349
  report — all in one process, no second shell needed.
337
350
 
338
351
  ```sh
339
- gpu-usage-audit demo
352
+ gua demo
340
353
  ```
341
354
 
342
355
  The bundled `FakeTier` produces a deterministic 5-tick workload —
@@ -369,21 +382,28 @@ can collect real telemetry.
369
382
  Then run the collector:
370
383
 
371
384
  ```sh
372
- gpu-usage-audit daemon --interval 30s
385
+ gua daemon --interval 30s
386
+ gua status
373
387
  ```
374
388
 
375
- Run the report from another shell:
389
+ Run the report:
376
390
 
377
391
  ```sh
378
- gpu-usage-audit report --since 1h --interval 30s
392
+ gua report --since 1h --interval 30s
393
+ ```
394
+
395
+ Stop the background collector when the collection window is done:
396
+
397
+ ```sh
398
+ gua stop
379
399
  ```
380
400
 
381
401
  If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
382
402
  `daemon` refuses to start when that database file already exists, so a
383
403
  new collection run does not silently append to an old test database. If
384
404
  `gua doctor` reports that the database already exists, either run
385
- `gpu-usage-audit report` against the existing data or choose a fresh
386
- `--db PATH` for the next daemon run.
405
+ `gua report` against the existing data or choose a fresh `--db PATH` for
406
+ the next daemon run.
387
407
 
388
408
  > The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
389
409
  > driverless host it exits with a friendly NVML initialization error. For
@@ -391,18 +411,24 @@ new collection run does not silently append to an old test database. If
391
411
 
392
412
  ## Usage
393
413
 
394
- `gpu-usage-audit` has three commands sharing one SQLite file:
414
+ `gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
415
+ point remains installed for compatibility, but new examples use `gua`.
395
416
 
396
417
  | Command | What it does |
397
418
  | -------- | ----------------------------------------------------------- |
398
- | `daemon` | Long-running background process. Samples real NVML telemetry on every tick and writes to a new database. Stop with Ctrl+C (SIGINT) or `systemctl stop`. NVIDIA host required. |
419
+ | `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
420
+ | `start` | Alias for `gua daemon`. |
421
+ | `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
422
+ | `stop` | Stops the background collector with SIGTERM. |
399
423
  | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
400
424
  | `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
401
425
 
402
- ### `daemon`
426
+ ### `daemon` / `start`
403
427
 
404
428
  ```
405
- gpu-usage-audit daemon [--db PATH] [--interval D]
429
+ gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
430
+ gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
431
+ gua daemon --foreground [--db PATH] [--interval D]
406
432
  ```
407
433
 
408
434
  - `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
@@ -410,14 +436,23 @@ gpu-usage-audit daemon [--db PATH] [--interval D]
410
436
  is enabled automatically.
411
437
  - `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
412
438
  `1m`, `200ms`, etc.
413
-
414
- Each tick prints a one-line summary to stdout; on shutdown the cumulative
415
- row count is printed.
439
+ - `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
440
+ - `--log-file PATH` (default `/tmp/gua.log`) stdout/stderr from the
441
+ background collector.
442
+ - `--foreground` — keep the collector attached to the current process.
443
+ Use this for systemd or debugging.
444
+
445
+ By default, `gua daemon` returns after the collector starts. Each tick is
446
+ written to the log file; on shutdown the cumulative row count is written
447
+ there too. `gua daemon --foreground` prints the tick summaries directly
448
+ to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
449
+ `gua status` and `gua stop` verify that the PID file points to the
450
+ managed collector before acting on it; stale PID files are cleared.
416
451
 
417
452
  ### `report`
418
453
 
419
454
  ```
420
- gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
455
+ gua report [--db PATH] [--since D] [--interval D] [--width N]
421
456
  ```
422
457
 
423
458
  - `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
@@ -427,14 +462,14 @@ gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
427
462
  of oldest sample), so passing a huge `--since` is the same as "all
428
463
  data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
429
464
  - `--interval D` (default `30s`) — **must match what the daemon used**.
430
- This is how §2 (Waste) and §4 (Top identities) convert tick counts
465
+ This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
431
466
  to GPU-hours. Mismatched intervals → wrong GPU-hours.
432
467
  - `--width N` (default `60`) — width of the §1 three-bar in characters.
433
468
 
434
469
  ### `demo`
435
470
 
436
471
  ```
437
- gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
472
+ gua demo [--db PATH] [--ticks N] [--interval D]
438
473
  ```
439
474
 
440
475
  - `--db PATH` (optional) — if omitted, a fresh temporary database is
@@ -446,7 +481,7 @@ gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
446
481
  ### Operational notes
447
482
 
448
483
  - **Same `--interval` on both sides.** If you ran the daemon with
449
- `--interval 30s`, run `report --interval 30s` too.
484
+ `--interval 30s`, run `gua report --interval 30s` too.
450
485
  - **Let it run for a while.** §1/§3 are meaningful after one tick;
451
486
  §4 (Top identities) needs hours; §5 (Heatmap) needs days.
452
487
  - **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
@@ -461,12 +496,12 @@ For a long-running deployment, drop a unit file in
461
496
 
462
497
  ```ini
463
498
  [Unit]
464
- Description=gpu-usage-audit daemon
499
+ Description=gua daemon
465
500
  After=network.target
466
501
 
467
502
  [Service]
468
503
  Type=simple
469
- ExecStart=/usr/local/bin/gpu-usage-audit daemon --db /var/lib/gua/gua.db --interval 30s
504
+ ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s
470
505
  Restart=on-failure
471
506
  User=gua
472
507
 
@@ -506,7 +541,7 @@ uv sync # create .venv, install dev deps
506
541
  uv run pytest # run the test suite
507
542
  uv run ruff check # lint
508
543
  uv run mypy # type-check (strict)
509
- uv run gpu-usage-audit demo # see the report shape locally
544
+ uv run gua demo # see the report shape locally
510
545
  ```
511
546
 
512
547
  CI runs ruff + format check + mypy + pytest, then builds and smoke-tests
@@ -10,7 +10,7 @@ Jupyter notebook open with an 8 GB tensor on the GPU and went to
10
10
  lunch — `nvidia-smi` will show 1% utilization, but the card is
11
11
  *unusable* by anyone else. This tool measures that.
12
12
 
13
- > **Status:** bare-metal 1.0 release candidate.
13
+ > **Status:** bare-metal 1.0.
14
14
  > `gua doctor` checks only the current machine. `daemon` records NVML
15
15
  > telemetry from the current NVIDIA host, `report` reads the resulting
16
16
  > SQLite database, and `demo` runs anywhere with fake telemetry. The Go
@@ -30,8 +30,10 @@ runtime. If Python downloads are disabled by local policy, install Python
30
30
  uv tool install gpu-usage-audit
31
31
 
32
32
  gua doctor
33
- gpu-usage-audit daemon --interval 30s
34
- gpu-usage-audit report --since 1h --interval 30s
33
+ gua daemon --interval 30s
34
+ gua status
35
+ gua report --since 1h --interval 30s
36
+ gua stop
35
37
  ```
36
38
 
37
39
  `gua doctor` is intentionally read-only. It checks only the current
@@ -46,7 +48,8 @@ with GPU UUIDs, so review it before sharing it outside your team.
46
48
  `gua doctor` does not need `sudo`; run it as the same user that will run
47
49
  the daemon.
48
50
 
49
- Available `gua` subcommands: `doctor`.
51
+ Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`,
52
+ `stop`, `report`, `demo`, `version`, `help`.
50
53
 
51
54
  Update or remove the installed tool with uv:
52
55
 
@@ -61,8 +64,8 @@ its `gua` / `gpu-usage-audit` commands.
61
64
  GitHub Release assets are also available for manual download:
62
65
 
63
66
  ```sh
64
- BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.0"
65
- WHEEL="gpu_usage_audit-1.0.0-py3-none-any.whl"
67
+ BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
68
+ WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
66
69
 
67
70
  curl -fsSLO "$BASE/$WHEEL"
68
71
  curl -fsSLO "$BASE/SHA256SUMS"
@@ -74,30 +77,37 @@ uvx --from "./$WHEEL" gua doctor
74
77
  ## What you get
75
78
 
76
79
  ```
77
- $ gpu-usage-audit report --since 1h --interval 30s
78
- gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
80
+ $ gua report --since 1h --interval 30s
81
+ gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
79
82
 
80
83
  §1 Headline
84
+ basis: one sample = one GPU card at one daemon tick
85
+ rules: active >=10% util; idle-held <10% util with >100 MB process memory
81
86
  █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
82
87
  active █ 15.7%
83
88
  idle-held ▒ 45.1% ← this is the number conventional tools miss
84
89
  truly-idle ░ 39.2%
85
90
  (51 samples)
86
91
 
87
- §2 Waste
88
- ~0.43 GPU-hours idle, ~2.53 GPUs equivalently unused
92
+ §2 Idle capacity
93
+ converted from card-ticks to GPU-hours using the report --interval
94
+ idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
95
+ truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
89
96
 
90
97
  §3 Per-GPU
98
+ per-card share of samples in the same three states
91
99
  GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
92
100
  GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
93
101
  GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
94
102
 
95
103
  §4 Top identities
96
- identity gpu-hours idle-held
97
- alice 0.42 42.9%
98
- bob 0.28 100.0%
104
+ one identity counts once per GPU/tick after its processes are summed
105
+ identity gpu-hours idle-held samples
106
+ alice 0.42 42.9% 51
107
+ bob 0.28 100.0% 34
99
108
 
100
109
  §5 Time-of-day heatmap (UTC)
110
+ darker means higher active share; blank means no samples
101
111
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
102
112
  Mon .
103
113
  ```
@@ -105,7 +115,10 @@ gpu-usage-audit — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
105
115
  The 3-bar collapses every card × every tick over the window into the
106
116
  active / idle-held / truly-idle split. **`idle-held` rows are the
107
117
  embarrassing category**: a process is holding GPU memory but the SM
108
- utilization is below 10%.
118
+ utilization is below 10%. §2 converts those card-ticks into GPU-hours
119
+ with `--interval`; §4 groups process rows by identity, GPU, and tick
120
+ before ranking users, so multiple same-user processes on one GPU/tick
121
+ count once.
109
122
 
110
123
  ## Demo (no GPU required)
111
124
 
@@ -113,7 +126,7 @@ The `demo` subcommand records 30 ticks of fake telemetry and prints the
113
126
  report — all in one process, no second shell needed.
114
127
 
115
128
  ```sh
116
- gpu-usage-audit demo
129
+ gua demo
117
130
  ```
118
131
 
119
132
  The bundled `FakeTier` produces a deterministic 5-tick workload —
@@ -146,21 +159,28 @@ can collect real telemetry.
146
159
  Then run the collector:
147
160
 
148
161
  ```sh
149
- gpu-usage-audit daemon --interval 30s
162
+ gua daemon --interval 30s
163
+ gua status
150
164
  ```
151
165
 
152
- Run the report from another shell:
166
+ Run the report:
153
167
 
154
168
  ```sh
155
- gpu-usage-audit report --since 1h --interval 30s
169
+ gua report --since 1h --interval 30s
170
+ ```
171
+
172
+ Stop the background collector when the collection window is done:
173
+
174
+ ```sh
175
+ gua stop
156
176
  ```
157
177
 
158
178
  If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
159
179
  `daemon` refuses to start when that database file already exists, so a
160
180
  new collection run does not silently append to an old test database. If
161
181
  `gua doctor` reports that the database already exists, either run
162
- `gpu-usage-audit report` against the existing data or choose a fresh
163
- `--db PATH` for the next daemon run.
182
+ `gua report` against the existing data or choose a fresh `--db PATH` for
183
+ the next daemon run.
164
184
 
165
185
  > The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
166
186
  > driverless host it exits with a friendly NVML initialization error. For
@@ -168,18 +188,24 @@ new collection run does not silently append to an old test database. If
168
188
 
169
189
  ## Usage
170
190
 
171
- `gpu-usage-audit` has three commands sharing one SQLite file:
191
+ `gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
192
+ point remains installed for compatibility, but new examples use `gua`.
172
193
 
173
194
  | Command | What it does |
174
195
  | -------- | ----------------------------------------------------------- |
175
- | `daemon` | Long-running background process. Samples real NVML telemetry on every tick and writes to a new database. Stop with Ctrl+C (SIGINT) or `systemctl stop`. NVIDIA host required. |
196
+ | `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
197
+ | `start` | Alias for `gua daemon`. |
198
+ | `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
199
+ | `stop` | Stops the background collector with SIGTERM. |
176
200
  | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
177
201
  | `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
178
202
 
179
- ### `daemon`
203
+ ### `daemon` / `start`
180
204
 
181
205
  ```
182
- gpu-usage-audit daemon [--db PATH] [--interval D]
206
+ gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
207
+ gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
208
+ gua daemon --foreground [--db PATH] [--interval D]
183
209
  ```
184
210
 
185
211
  - `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
@@ -187,14 +213,23 @@ gpu-usage-audit daemon [--db PATH] [--interval D]
187
213
  is enabled automatically.
188
214
  - `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
189
215
  `1m`, `200ms`, etc.
190
-
191
- Each tick prints a one-line summary to stdout; on shutdown the cumulative
192
- row count is printed.
216
+ - `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
217
+ - `--log-file PATH` (default `/tmp/gua.log`) stdout/stderr from the
218
+ background collector.
219
+ - `--foreground` — keep the collector attached to the current process.
220
+ Use this for systemd or debugging.
221
+
222
+ By default, `gua daemon` returns after the collector starts. Each tick is
223
+ written to the log file; on shutdown the cumulative row count is written
224
+ there too. `gua daemon --foreground` prints the tick summaries directly
225
+ to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
226
+ `gua status` and `gua stop` verify that the PID file points to the
227
+ managed collector before acting on it; stale PID files are cleared.
193
228
 
194
229
  ### `report`
195
230
 
196
231
  ```
197
- gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
232
+ gua report [--db PATH] [--since D] [--interval D] [--width N]
198
233
  ```
199
234
 
200
235
  - `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
@@ -204,14 +239,14 @@ gpu-usage-audit report [--db PATH] [--since D] [--interval D] [--width N]
204
239
  of oldest sample), so passing a huge `--since` is the same as "all
205
240
  data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
206
241
  - `--interval D` (default `30s`) — **must match what the daemon used**.
207
- This is how §2 (Waste) and §4 (Top identities) convert tick counts
242
+ This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
208
243
  to GPU-hours. Mismatched intervals → wrong GPU-hours.
209
244
  - `--width N` (default `60`) — width of the §1 three-bar in characters.
210
245
 
211
246
  ### `demo`
212
247
 
213
248
  ```
214
- gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
249
+ gua demo [--db PATH] [--ticks N] [--interval D]
215
250
  ```
216
251
 
217
252
  - `--db PATH` (optional) — if omitted, a fresh temporary database is
@@ -223,7 +258,7 @@ gpu-usage-audit demo [--db PATH] [--ticks N] [--interval D]
223
258
  ### Operational notes
224
259
 
225
260
  - **Same `--interval` on both sides.** If you ran the daemon with
226
- `--interval 30s`, run `report --interval 30s` too.
261
+ `--interval 30s`, run `gua report --interval 30s` too.
227
262
  - **Let it run for a while.** §1/§3 are meaningful after one tick;
228
263
  §4 (Top identities) needs hours; §5 (Heatmap) needs days.
229
264
  - **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
@@ -238,12 +273,12 @@ For a long-running deployment, drop a unit file in
238
273
 
239
274
  ```ini
240
275
  [Unit]
241
- Description=gpu-usage-audit daemon
276
+ Description=gua daemon
242
277
  After=network.target
243
278
 
244
279
  [Service]
245
280
  Type=simple
246
- ExecStart=/usr/local/bin/gpu-usage-audit daemon --db /var/lib/gua/gua.db --interval 30s
281
+ ExecStart=/usr/local/bin/gua daemon --foreground --db /var/lib/gua/gua.db --interval 30s
247
282
  Restart=on-failure
248
283
  User=gua
249
284
 
@@ -283,7 +318,7 @@ uv sync # create .venv, install dev deps
283
318
  uv run pytest # run the test suite
284
319
  uv run ruff check # lint
285
320
  uv run mypy # type-check (strict)
286
- uv run gpu-usage-audit demo # see the report shape locally
321
+ uv run gua demo # see the report shape locally
287
322
  ```
288
323
 
289
324
  CI runs ruff + format check + mypy + pytest, then builds and smoke-tests
@@ -0,0 +1,83 @@
1
+ # Bare Metal 1.0 Handoff
2
+
3
+ 갱신일: 2026-05-15
4
+
5
+ ## 이어받을 때 먼저 볼 것
6
+
7
+ - `projects/bare-metal-1.0/status.ko.md`: 현재 완료 상태, 1.0.1 검증 결과, 1.0.2 release prep 상태.
8
+ - `README.md`: 실제 사용자 문서와 release/install/runbook/report 표면.
9
+ - `src/gpu_usage_audit/__main__.py`: `gua` CLI, background daemon lifecycle, PID handling.
10
+ - `src/gpu_usage_audit/report.py`: report SQL 집계.
11
+ - `src/gpu_usage_audit/render.py`: report 사람이 읽는 출력.
12
+ - `.github/workflows/release.yml`: tag release, GitHub Release, PyPI publish 경로.
13
+
14
+ ## 고정된 결정
15
+
16
+ - 1.0은 단일 로컬 베어메탈 NVIDIA 호스트만 본다.
17
+ - Kubernetes, Slurm, Docker/Podman fallback, remote node, cluster-wide report는 1.0 범위 밖이다.
18
+ - `nvidia-ml-py`는 기본 dependency다.
19
+ - `gpu-usage-audit[nvml]` extra는 compatibility를 위해 빈 alias로 남긴다.
20
+ - DB schema는 v1을 유지한다: `host`, `gpu_sample`, `proc_sample`.
21
+ - 기본 DB는 `/tmp/gua.db`다.
22
+ - `gua daemon`은 기본 백그라운드 실행이다.
23
+ - `gua daemon --foreground`는 systemd/debugging 용도다.
24
+ - `gua start`는 `gua daemon` alias다.
25
+ - `gua status`와 `gua stop`은 pid file 기반 background collector 관리용이다.
26
+ - `daemon`은 기존 DB 파일이 있으면 실패한다.
27
+ - `report`는 DB 파일이 없으면 실패한다.
28
+ - `daemon`과 `demo`는 host row의 `env_kind`를 항상 `"bare"`로 기록한다.
29
+ - auto-runtime proposal/project 문서는 삭제했다. Kubernetes/Slurm/Docker/Podman 확장을 다시
30
+ 시작하려면 새 proposal로 시작한다.
31
+
32
+ ## 현재 상태
33
+
34
+ - PR A: implemented in PR #9.
35
+ - PR B: implemented in PR #10.
36
+ - Post-1.0 cleanup: completed in PR #11.
37
+ - Bare-metal 1.0 release: completed in PR #12 and tag `v1.0.0`.
38
+ - 1.0.1 command surface/background daemon release: completed in PR #13 and tag `v1.0.1`.
39
+ - GitHub Release `v1.0.1`: published.
40
+ - PyPI `gpu-usage-audit 1.0.1`: published.
41
+ - NVIDIA host acceptance: 사용자가 실제 host에서 수집 정상 동작을 확인했다.
42
+ - 1.0.2 release prep: 진행 중. #14 lifecycle/report cleanup을 patch release로 배포한다.
43
+ package version은 `1.0.2`로 bump했고 local build/wheel smoke는 통과했다.
44
+
45
+ ## 마지막 로컬 검증
46
+
47
+ ```sh
48
+ uv run ruff check
49
+ uv run ruff format --check
50
+ uv run mypy
51
+ uv run pytest
52
+ uv build --out-dir /tmp/gua-dist-1.0.2-prep
53
+ bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-1.0.2-prep/gpu_usage_audit-1.0.2-py3-none-any.whl
54
+ env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py
55
+ ```
56
+
57
+ 결과는 `pytest` 124 passed, `mypy` 25 source files, `ruff format` 26 files 기준이다.
58
+
59
+ ## 현재 cleanup PR 방향
60
+
61
+ - `/tmp/gua.pid`가 PID 재사용으로 다른 프로세스를 가리킬 수 있으므로 `status`/`stop` 전에
62
+ 해당 PID가 실제 managed `gpu_usage_audit daemon` 프로세스인지 확인한다.
63
+ - report §2는 low-util 전체를 "waste"로 합치지 말고 `idle-held`와 `truly-idle`을 분리한다.
64
+ - report §4는 process row가 아니라 identity/GPU/tick 단위로 먼저 접어서 사용자별 GPU-hours를 계산한다.
65
+ - report 출력 자체에 sample 의미, classification rule, `--interval` 의존성, heatmap 의미를 짧게 노출한다.
66
+ - NVML process list 조회 실패는 idle-held를 과소평가할 수 있으므로 warning으로 남긴다.
67
+ - 1.0.2 release prep에서는 package version, README release asset 예시, CHANGELOG를 `1.0.2`로 맞춘다.
68
+
69
+ ## 주의할 점
70
+
71
+ - 현재 로컬 개발 머신은 NVIDIA host가 아니다. `gua doctor`가 unsupported를 내는 것은 정상이다.
72
+ - `/tmp/gua.db`가 이미 존재한다. 기본 경로 daemon 실행이 거부되는 것은 기대 동작이다.
73
+ - `report --interval`은 daemon 수집 interval과 같아야 GPU-hours가 맞다.
74
+ - SQLite WAL sidecar(`*.db-wal`, `*.db-shm`)는 마지막 connection이 닫히면 정리된다.
75
+ - 1.0.2를 자를 경우 `env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py`가
76
+ 통과해야 한다.
77
+
78
+ ## 다음 세션 추천 순서
79
+
80
+ 1. `git status --short`로 사용자 변경 여부를 먼저 확인한다.
81
+ 2. cleanup PR의 CI 결과와 review comments를 확인한다.
82
+ 3. 필요하면 report wording을 실제 운영자가 읽기 쉬운 형태로 한 번 더 다듬는다.
83
+ 4. merge 후 patch release가 필요하면 version bump와 changelog를 별도 PR로 처리한다.