gpu-usage-audit 1.0.2__tar.gz → 1.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.gitignore +2 -0
  2. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/CHANGELOG.md +26 -0
  3. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/PKG-INFO +130 -241
  4. gpu_usage_audit-1.1.0/README.ko.md +230 -0
  5. gpu_usage_audit-1.1.0/README.md +230 -0
  6. gpu_usage_audit-1.1.0/docs/work-specs/0001-gua-board-cloud-sync.ko.md +238 -0
  7. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/handoff.ko.md +27 -23
  8. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/status.ko.md +39 -13
  9. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/pyproject.toml +1 -1
  10. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/__main__.py +248 -33
  11. gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/__init__.py +14 -0
  12. gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/client.py +145 -0
  13. gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/config.py +100 -0
  14. gpu_usage_audit-1.1.0/src/gpu_usage_audit/cloud/snapshot.py +112 -0
  15. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/daemon.py +30 -8
  16. gpu_usage_audit-1.1.0/src/gpu_usage_audit/db.py +253 -0
  17. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/doctor.py +24 -14
  18. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/identity.py +19 -0
  19. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/model.py +17 -1
  20. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/nvml.py +30 -1
  21. gpu_usage_audit-1.1.0/src/gpu_usage_audit/paths.py +22 -0
  22. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/render.py +4 -1
  23. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/report.py +55 -21
  24. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/tier.py +25 -4
  25. gpu_usage_audit-1.1.0/tests/test_cloud_cli.py +275 -0
  26. gpu_usage_audit-1.1.0/tests/test_cloud_client.py +159 -0
  27. gpu_usage_audit-1.1.0/tests/test_cloud_config.py +61 -0
  28. gpu_usage_audit-1.1.0/tests/test_cloud_snapshot.py +156 -0
  29. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_daemon.py +4 -1
  30. gpu_usage_audit-1.1.0/tests/test_db.py +238 -0
  31. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_doctor.py +1 -3
  32. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_identity.py +23 -1
  33. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_nvml.py +62 -0
  34. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_report.py +60 -1
  35. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_smoke.py +5 -2
  36. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/uv.lock +1 -1
  37. gpu_usage_audit-1.0.2/README.md +0 -341
  38. gpu_usage_audit-1.0.2/src/gpu_usage_audit/db.py +0 -135
  39. gpu_usage_audit-1.0.2/tests/test_db.py +0 -105
  40. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.github/workflows/ci.yml +0 -0
  41. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/.github/workflows/release.yml +0 -0
  42. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/LICENSE +0 -0
  43. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/projects/bare-metal-1.0/plan.ko.md +0 -0
  44. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/scripts/check-tag-version.py +0 -0
  45. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/scripts/smoke-dist-wheel.sh +0 -0
  46. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/__init__.py +0 -0
  47. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/classify.py +0 -0
  48. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/src/gpu_usage_audit/summarize.py +0 -0
  49. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/__init__.py +0 -0
  50. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_classify.py +0 -0
  51. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_render.py +0 -0
  52. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_summarize.py +0 -0
  53. {gpu_usage_audit-1.0.2 → gpu_usage_audit-1.1.0}/tests/test_tier.py +0 -0
@@ -38,3 +38,5 @@ Thumbs.db
38
38
  .vscode/
39
39
  *.swp
40
40
  *.swo
41
+
42
+ ignore/
@@ -1,5 +1,31 @@
1
1
  # Changelog
2
2
 
3
+ ## 1.1.0 - 2026-06-17
4
+
5
+ - Added optional GUA Board cloud sync. `gua enroll` claims a one-time
6
+ enrollment token from a GUA Board workspace and stores a host-scoped,
7
+ write-only agent token in `~/.gua/cloud.json` (mode 0600). `gua sync-once`
8
+ collects one snapshot, writes it to the local history database first, then
9
+ pushes the latest state to GUA Board; a failed push never blocks or rolls
10
+ back the local write. Cloud sync is entirely optional — local collection,
11
+ storage, and `gua report` are unchanged when no host is enrolled, and no new
12
+ runtime dependency is added (the client uses the standard library).
13
+ - Enriched NVML collection with per-GPU name, total/used memory, temperature,
14
+ power, and physical index, plus per-process name (from `/proc/<pid>/comm`;
15
+ full command lines are never collected). The local SQLite schema gained
16
+ these columns plus a normalized `gpu_device` table. The migration is
17
+ additive (nullable columns), so existing `~/.gua/gua.db` databases upgrade
18
+ in place and `gua report` output is unaffected.
19
+
20
+ ## 1.0.3 - 2026-05-27
21
+
22
+ - Changed default `gua` state paths to `~/.gua/gua.db`, `~/.gua/gua.pid`,
23
+ and `~/.gua/gua.log`; the default database now acts as an appendable local
24
+ history database.
25
+ - Record daemon run intervals in SQLite and attach samples to a run, so
26
+ `gua report` uses recorded intervals by default. `--interval` is now an
27
+ override and a fallback for legacy rows without interval metadata.
28
+
3
29
  ## 1.0.2 - 2026-05-15
4
30
 
5
31
  - Hardened `gua status` and `gua stop` so stale PID files do not act on
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: gpu-usage-audit
3
- Version: 1.0.2
3
+ Version: 1.1.0
4
4
  Summary: Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.
5
5
  Project-URL: Homepage, https://github.com/AI-Ocean/gpu-usage-audit
6
6
  Project-URL: Issues, https://github.com/AI-Ocean/gpu-usage-audit/issues
@@ -223,72 +223,58 @@ Description-Content-Type: text/markdown
223
223
 
224
224
  # gpu-usage-audit
225
225
 
226
- A single-host diagnostic daemon that records NVIDIA GPU utilization to
227
- SQLite and produces a retrospective report separating *active* use from
228
- *allocated-but-idle* ("idle-held") and *truly idle* (no process at all).
226
+ Single-host NVIDIA GPU usage audit for finding **idle-held** GPUs: cards that look idle by utilization, but are still held by a process through GPU memory.
229
227
 
230
- Conventional dashboards collapse the latter two. **Surfacing
231
- idle-held as its own number is the entire point.** Someone left a
232
- Jupyter notebook open with an 8 GB tensor on the GPU and went to
233
- lunch — `nvidia-smi` will show 1% utilization, but the card is
234
- *unusable* by anyone else. This tool measures that.
228
+ [![PyPI](https://img.shields.io/pypi/v/gpu-usage-audit.svg)](https://pypi.org/project/gpu-usage-audit/)
229
+ [![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://pypi.org/project/gpu-usage-audit/)
230
+ [![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
231
+ [![GitHub Release](https://img.shields.io/github/v/release/AI-Ocean/gpu-usage-audit)](https://github.com/AI-Ocean/gpu-usage-audit/releases)
235
232
 
236
- > **Status:** bare-metal 1.0.
237
- > `gua doctor` checks only the current machine. `daemon` records NVML
238
- > telemetry from the current NVIDIA host, `report` reads the resulting
239
- > SQLite database, and `demo` runs anywhere with fake telemetry. The Go
240
- > v0.1.0 implementation remains downloadable at tag `v0.1.0` / branch
241
- > [`go-archive`](https://github.com/AI-Ocean/gpu-usage-audit/tree/go-archive).
233
+ [English](README.md) · [한국어](README.ko.md) · [Releases](https://github.com/AI-Ocean/gpu-usage-audit/releases) · [Issues](https://github.com/AI-Ocean/gpu-usage-audit/issues)
242
234
 
243
- ## Install
235
+ ---
244
236
 
245
- The recommended install path is PyPI via uv.
237
+ ## About
246
238
 
247
- Requires [uv](https://docs.astral.sh/uv/). In normal online environments,
248
- uv creates the isolated tool environment and manages the needed Python
249
- runtime. If Python downloads are disabled by local policy, install Python
250
- 3.12+ first.
239
+ gpu-usage-audit records local NVIDIA/NVML telemetry into SQLite and renders a retrospective report that separates GPU card-ticks into:
251
240
 
252
- ```sh
253
- uv tool install gpu-usage-audit
241
+ - `active`: utilization is doing real work
242
+ - `idle-held`: utilization is low, but a process still holds GPU memory
243
+ - `truly-idle`: no meaningful GPU process memory is present
254
244
 
255
- gua doctor
256
- gua daemon --interval 30s
257
- gua status
258
- gua report --since 1h --interval 30s
259
- gua stop
260
- ```
245
+ The second category is the point. A notebook can sit at 1% SM utilization while keeping an 8 GB tensor allocated. Conventional dashboards usually flatten that into “idle”; this tool shows that the card is effectively unavailable.
246
+
247
+ ## Features
248
+
249
+ - Single-host, bare-metal NVIDIA GPU audit
250
+ - `gua doctor` readiness check for `/dev/nvidia*`, `nvidia-smi`, NVML, and DB path
251
+ - Background collector with `gua daemon`, `gua status`, and `gua stop`
252
+ - SQLite history database at `~/.gua/gua.db` by default
253
+ - Report sections for headline split, idle capacity, per-GPU state, top identities, and time-of-day heatmap
254
+ - Daemon interval metadata stored per run, so reports compute GPU-hours correctly across mixed 30s / 10s runs
255
+ - GPU-less `gua demo` command with deterministic fake telemetry
256
+ - No cluster runtime dependency; no Kubernetes, Slurm, Docker, or remote-node scan in the 1.0 scope
261
257
 
262
- `gua doctor` is intentionally read-only. It checks only the current
263
- machine: OS/kernel/Python, `/dev/nvidia*`, `nvidia-smi -L`, NVML
264
- load/init/device count/driver version, and the database path the daemon
265
- would write to. The default is `/tmp/gua.db`; pass `gua doctor --db PATH`
266
- when you plan to use a different daemon database.
258
+ ## Installation
267
259
 
268
- Use `gua doctor --json` for the same report in a machine-readable form.
269
- The JSON includes local paths, command stderr, and `nvidia-smi -L` output
270
- with GPU UUIDs, so review it before sharing it outside your team.
271
- `gua doctor` does not need `sudo`; run it as the same user that will run
272
- the daemon.
260
+ The recommended install path is PyPI via [uv](https://docs.astral.sh/uv/):
273
261
 
274
- Available `gua` subcommands: `doctor`, `daemon`, `start`, `status`,
275
- `stop`, `report`, `demo`, `version`, `help`.
262
+ ```sh
263
+ uv tool install gpu-usage-audit
264
+ ```
276
265
 
277
- Update or remove the installed tool with uv:
266
+ Update or remove it with:
278
267
 
279
268
  ```sh
280
269
  uv tool upgrade gpu-usage-audit
281
270
  uv tool uninstall gpu-usage-audit
282
271
  ```
283
272
 
284
- `uv tool uninstall gpu-usage-audit` removes the installed Python tool and
285
- its `gua` / `gpu-usage-audit` commands.
286
-
287
- GitHub Release assets are also available for manual download:
273
+ Manual wheel downloads are available from GitHub Releases:
288
274
 
289
275
  ```sh
290
- BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
291
- WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
276
+ BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.3"
277
+ WHEEL="gpu_usage_audit-1.0.3-py3-none-any.whl"
292
278
 
293
279
  curl -fsSLO "$BASE/$WHEEL"
294
280
  curl -fsSLO "$BASE/SHA256SUMS"
@@ -297,202 +283,94 @@ sha256sum -c SHA256SUMS --ignore-missing
297
283
  uvx --from "./$WHEEL" gua doctor
298
284
  ```
299
285
 
300
- ## What you get
286
+ ## Quick Start
287
+
288
+ On an NVIDIA GPU host:
301
289
 
290
+ ```sh
291
+ gua doctor
292
+ gua daemon --interval 30s
293
+ gua status
294
+ gua report --since 1h
295
+ gua stop
302
296
  ```
303
- $ gua report --since 1h --interval 30s
297
+
298
+ `gua doctor` is read-only. It does not need `sudo`; run it as the same user that will run the daemon.
299
+
300
+ Default local state lives under `~/.gua/`:
301
+
302
+ | Path | Purpose |
303
+ | --- | --- |
304
+ | `~/.gua/gua.db` | SQLite history database |
305
+ | `~/.gua/gua.pid` | background daemon PID file |
306
+ | `~/.gua/gua.log` | daemon stdout/stderr log |
307
+
308
+ The default DB is an appendable local history database. Later daemon runs append to it. If you pass a custom `--db PATH`, daemon still refuses an existing file to avoid mixing ad hoc runs by accident.
309
+
310
+ ## Report Preview
311
+
312
+ ```text
313
+ $ gua report --since 1h
304
314
  gua — lab-a100 (bare, driver 560.35.05) Window: 1:00:00
305
315
 
306
316
  §1 Headline
307
317
  basis: one sample = one GPU card at one daemon tick
308
318
  rules: active >=10% util; idle-held <10% util with >100 MB process memory
309
- █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
310
319
  active █ 15.7%
311
- idle-held ▒ 45.1% ← this is the number conventional tools miss
320
+ idle-held ▒ 45.1%
312
321
  truly-idle ░ 39.2%
313
322
  (51 samples)
314
323
 
315
324
  §2 Idle capacity
316
- converted from card-ticks to GPU-hours using the report --interval
325
+ converted from card-ticks to GPU-hours using recorded daemon interval
317
326
  idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
318
327
  truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
319
328
 
320
329
  §3 Per-GPU
321
- per-card share of samples in the same three states
322
- GPU-0 active 47.1% idle-held 35.3% truly-idle 17.6%
323
- GPU-1 active 0.0% idle-held 100.0% truly-idle 0.0%
324
- GPU-2 active 0.0% idle-held 0.0% truly-idle 100.0%
325
-
326
330
  §4 Top identities
327
- one identity counts once per GPU/tick after its processes are summed
328
- identity gpu-hours idle-held samples
329
- alice 0.42 42.9% 51
330
- bob 0.28 100.0% 34
331
-
332
331
  §5 Time-of-day heatmap (UTC)
333
- darker means higher active share; blank means no samples
334
- 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
335
- Mon .
336
- ```
337
-
338
- The 3-bar collapses every card × every tick over the window into the
339
- active / idle-held / truly-idle split. **`idle-held` rows are the
340
- embarrassing category**: a process is holding GPU memory but the SM
341
- utilization is below 10%. §2 converts those card-ticks into GPU-hours
342
- with `--interval`; §4 groups process rows by identity, GPU, and tick
343
- before ranking users, so multiple same-user processes on one GPU/tick
344
- count once.
345
-
346
- ## Demo (no GPU required)
347
-
348
- The `demo` subcommand records 30 ticks of fake telemetry and prints the
349
- report — all in one process, no second shell needed.
350
-
351
- ```sh
352
- gua demo
353
332
  ```
354
333
 
355
- The bundled `FakeTier` produces a deterministic 5-tick workload
356
- active learning → idle-held memory → cleanup — so the output is the
357
- same every run. Adjust the shape with `--ticks N` and `--interval D`.
358
-
359
- ## Real NVIDIA GPU host
334
+ Reports can run while the daemon is writing; SQLite WAL mode handles concurrent reads. Reports also work after the daemon has stopped, as long as the DB file exists.
360
335
 
361
- On an NVIDIA host, start with doctor:
336
+ ## Commands
362
337
 
363
- ```sh
364
- gua doctor
365
- ```
338
+ | Command | Description |
339
+ | --- | --- |
340
+ | `gua doctor` | Check local NVIDIA/NVML readiness and DB path status |
341
+ | `gua daemon` | Start background collection on the local NVIDIA host |
342
+ | `gua start` | Alias for `gua daemon` |
343
+ | `gua status` | Show whether the managed background collector is running |
344
+ | `gua stop` | Stop the managed background collector |
345
+ | `gua report` | Render the retrospective report from SQLite |
346
+ | `gua demo` | Generate a fake local report without a GPU |
347
+ | `gua enroll` | Connect this host to a GUA Board workspace (optional cloud sync) |
348
+ | `gua sync-once` | Collect one snapshot and push the latest state to GUA Board |
349
+ | `gua version` | Print version |
366
350
 
367
- Doctor should show the current machine, visible `/dev/nvidia*` device
368
- files, `nvidia-smi -L` GPUs, NVML device count, and `/tmp/gua.db` status.
369
- `nvidia-ml-py` is installed by default with `gpu-usage-audit`; if doctor
370
- reports that `pynvml` is not importable, reinstall the isolated tool
371
- environment:
351
+ ## Important Options
372
352
 
373
353
  ```sh
374
- uv tool install --force gpu-usage-audit
375
- ```
376
-
377
- If `pynvml` imports but NVML init fails, fix the host NVIDIA driver
378
- installation instead. `libnvidia-ml.so.1` must be available and match the
379
- loaded kernel driver; `nvidia-smi -L` should list GPUs before the daemon
380
- can collect real telemetry.
381
-
382
- Then run the collector:
383
-
384
- ```sh
385
- gua daemon --interval 30s
386
- gua status
387
- ```
388
-
389
- Run the report:
390
-
391
- ```sh
392
- gua report --since 1h --interval 30s
393
- ```
394
-
395
- Stop the background collector when the collection window is done:
396
-
397
- ```sh
398
- gua stop
399
- ```
400
-
401
- If `--db` is omitted, both `daemon` and `report` use `/tmp/gua.db`.
402
- `daemon` refuses to start when that database file already exists, so a
403
- new collection run does not silently append to an old test database. If
404
- `gua doctor` reports that the database already exists, either run
405
- `gua report` against the existing data or choose a fresh `--db PATH` for
406
- the next daemon run.
407
-
408
- > The daemon requires the NVIDIA driver and `libnvidia-ml.so.1`. On a
409
- > driverless host it exits with a friendly NVML initialization error. For
410
- > a driverless box, use `demo` instead.
411
-
412
- ## Usage
413
-
414
- `gua` has commands sharing one SQLite file. The `gpu-usage-audit` entry
415
- point remains installed for compatibility, but new examples use `gua`.
416
-
417
- | Command | What it does |
418
- | -------- | ----------------------------------------------------------- |
419
- | `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
420
- | `start` | Alias for `gua daemon`. |
421
- | `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
422
- | `stop` | Stops the background collector with SIGTERM. |
423
- | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
424
- | `demo` | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
425
-
426
- ### `daemon` / `start`
427
-
428
- ```
429
354
  gua daemon [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
430
- gua start [--db PATH] [--interval D] [--pid-file PATH] [--log-file PATH]
431
355
  gua daemon --foreground [--db PATH] [--interval D]
432
- ```
433
-
434
- - `--db PATH` (default `/tmp/gua.db`) — SQLite file to create and write
435
- to. The daemon exits with an error if the file already exists. WAL mode
436
- is enabled automatically.
437
- - `--interval D` (default `30s`) — how often to sample. Accepts `30s`,
438
- `1m`, `200ms`, etc.
439
- - `--pid-file PATH` (default `/tmp/gua.pid`) — background PID file.
440
- - `--log-file PATH` (default `/tmp/gua.log`) — stdout/stderr from the
441
- background collector.
442
- - `--foreground` — keep the collector attached to the current process.
443
- Use this for systemd or debugging.
444
-
445
- By default, `gua daemon` returns after the collector starts. Each tick is
446
- written to the log file; on shutdown the cumulative row count is written
447
- there too. `gua daemon --foreground` prints the tick summaries directly
448
- to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
449
- `gua status` and `gua stop` verify that the PID file points to the
450
- managed collector before acting on it; stale PID files are cleared.
451
-
452
- ### `report`
453
-
454
- ```
455
356
  gua report [--db PATH] [--since D] [--interval D] [--width N]
456
- ```
457
-
458
- - `--db PATH` (default `/tmp/gua.db`) — same SQLite file the daemon writes
459
- to. The report exits with an error if the file does not exist.
460
- - `--since D` (default `1h`) — the report window. **No upper bound** —
461
- `--since 365d` is accepted. The effective window is min(`--since`, age
462
- of oldest sample), so passing a huge `--since` is the same as "all
463
- data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
464
- - `--interval D` (default `30s`) — **must match what the daemon used**.
465
- This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
466
- to GPU-hours. Mismatched intervals → wrong GPU-hours.
467
- - `--width N` (default `60`) — width of the §1 three-bar in characters.
468
-
469
- ### `demo`
470
-
471
- ```
472
357
  gua demo [--db PATH] [--ticks N] [--interval D]
473
358
  ```
474
359
 
475
- - `--db PATH` (optional) if omitted, a fresh temporary database is
476
- created and its path is printed to stderr.
477
- - `--ticks N` (default `30`) how many fake ticks to record before
478
- printing the report.
479
- - `--interval D` (default `1s`) — tick spacing.
360
+ - `--interval` on `daemon` controls sampling cadence. Default: `30s`.
361
+ - `--interval` on `report` is optional. New DB rows use the interval recorded by each daemon run. Use report `--interval D` only as an override or for legacy rows without interval metadata.
362
+ - `--since` accepts `ms`, `s`, `m`, `h`, and `d`, with no upper bound.
363
+ - `--foreground` is intended for systemd and debugging.
480
364
 
481
- ### Operational notes
365
+ ## Demo Without a GPU
482
366
 
483
- - **Same `--interval` on both sides.** If you ran the daemon with
484
- `--interval 30s`, run `gua report --interval 30s` too.
485
- - **Let it run for a while.** §1/§3 are meaningful after one tick;
486
- §4 (Top identities) needs hours; §5 (Heatmap) needs days.
487
- - **WAL leaves sidecar files** (`gua.db-wal`, `gua.db-shm`). They are
488
- cleaned up automatically when the last connection closes.
489
- - **DB size**: ~50 MB per host per 30 days at 12 GPUs (extrapolated
490
- from Go v0.1.0; not yet re-measured for the Python rewrite).
367
+ ```sh
368
+ gua demo
369
+ ```
491
370
 
492
- ### Running as a systemd service
371
+ The demo records deterministic fake telemetry and immediately prints the report shape.
493
372
 
494
- For a long-running deployment, drop a unit file in
495
- `/etc/systemd/system/gpu-usage-audit.service`:
373
+ ## Systemd Example
496
374
 
497
375
  ```ini
498
376
  [Unit]
@@ -509,56 +387,67 @@ User=gua
509
387
  WantedBy=multi-user.target
510
388
  ```
511
389
 
512
- Then `systemctl enable --now gpu-usage-audit`.
390
+ Then run:
513
391
 
514
- ## How the classification works
515
-
516
- Each tick of the daemon records:
392
+ ```sh
393
+ systemctl enable --now gpu-usage-audit
394
+ ```
517
395
 
518
- - per-card: `util_pct` (SM utilization)
519
- - per-process: `mem_used_mb` per `(card, pid)`
396
+ ## Cloud Sync (GUA Board, optional)
520
397
 
521
- The report aggregates per card × per tick:
398
+ `gpu-usage-audit` runs fully local by default. If you also use GUA Board (a separate service that shows the latest GPU availability across several servers in one place), you can optionally connect a host:
522
399
 
400
+ ```sh
401
+ # 1. In the GUA Board web UI, register a server and copy the one-time enrollment token.
402
+ # 2. On the GPU host:
403
+ gua enroll --server-url https://board.example.com --enrollment-token <TOKEN>
404
+ # 3. Push the current snapshot (run on a timer or after `gua daemon`):
405
+ gua sync-once
523
406
  ```
524
- util >= 10 → active (compute is happening)
525
- util < 10 AND mem > 100 → idle-held (memory is held, SM is cold)
526
- util < 10 AND mem <= 100 → truly-idle (the card is genuinely free)
407
+
408
+ How it works and what it does not do:
409
+
410
+ - `enroll` exchanges the one-time token for a host-scoped, write-only agent token, stored in `~/.gua/cloud.json` with mode `0600`. The token can only write this host's observations — it cannot read reservations, users, or other hosts.
411
+ - `sync-once` collects one snapshot, **writes it to the local database first**, then pushes only the latest state. A failed push never blocks or rolls back the local write.
412
+ - Only the latest snapshot is sent. Historical ticks are kept locally and are never replayed to the server.
413
+ - Process telemetry is limited to PID, Linux user, process name (`/proc/<pid>/comm`), and GPU memory — never full command lines.
414
+ - Cloud sync adds no new runtime dependency (the client uses the Python standard library).
415
+
416
+ Override the config or database path with `--config PATH` / `--db PATH`, and use `gua sync-once --fake` to exercise the flow without a GPU.
417
+
418
+ ## Classification Rules
419
+
420
+ Each daemon tick records per-card utilization and per-process GPU memory. The report classifies each GPU card at each tick with these rules:
421
+
422
+ ```text
423
+ util >= 10 -> active
424
+ util < 10 AND mem > 100 -> idle-held
425
+ util < 10 AND mem <= 100 -> truly-idle
527
426
  ```
528
427
 
529
- The 100 MB threshold absorbs the PyTorch/TF runtime baseline so
530
- importing torch doesn't count as "holding the GPU".
428
+ The 100 MB threshold absorbs runtime baselines such as importing PyTorch or TensorFlow.
531
429
 
532
430
  ## Development
533
431
 
534
- Requires [uv](https://docs.astral.sh/uv/) (uv pins the Python version
535
- automatically; `requires-python = ">=3.12"`).
536
-
537
432
  ```sh
538
433
  git clone https://github.com/AI-Ocean/gpu-usage-audit
539
434
  cd gpu-usage-audit
540
- uv sync # create .venv, install dev deps
541
- uv run pytest # run the test suite
542
- uv run ruff check # lint
543
- uv run mypy # type-check (strict)
544
- uv run gua demo # see the report shape locally
435
+ uv sync
436
+ uv run python -m pytest
437
+ uv run ruff check
438
+ uv run ruff format --check
439
+ uv run python -m mypy
440
+ uv run gua demo
545
441
  ```
546
442
 
547
- CI runs ruff + format check + mypy + pytest, then builds and smoke-tests
548
- the wheel on every push and PR. Tag pushes (`v*`) rerun the same checks,
549
- build sdist + wheel, smoke-test the wheel, and create a GitHub Release
550
- with auto-generated notes. Release tags also publish the wheel and sdist
551
- to PyPI through Trusted Publishing.
443
+ CI runs ruff, format check, mypy, pytest, build, and wheel smoke tests. Tag pushes (`v*`) build release assets and publish to PyPI through Trusted Publishing.
552
444
 
553
445
  ## Non-goals
554
446
 
555
- This is a **single-host retrospective** tool. Live dashboards, multi-host
556
- aggregation, quotas, Kubernetes cluster scans, Slurm scheduler joins,
557
- Docker/Podman fallback runtimes, and pod-name resolution are out of scope
558
- for bare-metal 1.0. Those belong above the host layer. If this tool
559
- surfaces enough idle-held to make scheduling worth solving, see
560
- [ocean-all](https://github.com/AI-Ocean).
447
+ This is a single-host retrospective tool. Live dashboards, multi-host aggregation, quotas, Kubernetes cluster scans, Slurm joins, Docker/Podman runtime fallback, and pod-name resolution are outside the bare-metal 1.0 scope.
448
+
449
+ The Go v0.1.0 implementation remains available at tag `v0.1.0` and branch [`go-archive`](https://github.com/AI-Ocean/gpu-usage-audit/tree/go-archive).
561
450
 
562
451
  ## License
563
452
 
564
- Apache License 2.0 see [LICENSE](LICENSE).
453
+ Apache License 2.0. See [LICENSE](LICENSE).