PyPI - gpu-usage-audit - Versions diffs - 1.0.1__tar.gz → 1.0.2__tar.gz - Mend

gpu-usage-audit 1.0.1tar.gz → 1.0.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,19 @@
 # Changelog
+## 1.0.2 - 2026-05-15
+- Hardened `gua status` and `gua stop` so stale PID files do not act on
+  unrelated live processes.
+- Clarified report output by explaining sample units, classification rules,
+  interval-dependent GPU-hours, and heatmap density.
+- Split §2 from generic "Waste" into idle-held capacity and truly-idle
+  capacity. The equivalent-GPU figures now use GPUs present in the report
+  window instead of the entire database.
+- Made §4 Top identities aggregate by identity/GPU/tick before converting to
+  GPU-hours, so reports may show lower per-user GPU-hours when one user has
+  multiple processes on the same GPU at the same tick.
+- Warn when NVML process-list visibility is unavailable for a GPU.
 ## 1.0.1 - 2026-05-15
 - Made `gua` the documented command surface for daemon, report, demo, and doctor output.

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-usage-audit
-Version: 1.0.1
+Version: 1.0.2
 Summary: Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss.
 Project-URL: Homepage, https://github.com/AI-Ocean/gpu-usage-audit
 Project-URL: Issues, https://github.com/AI-Ocean/gpu-usage-audit/issues
@@ -287,8 +287,8 @@ its `gua` / `gpu-usage-audit` commands.
 GitHub Release assets are also available for manual download:
 ```sh
-BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.1"
-WHEEL="gpu_usage_audit-1.0.1-py3-none-any.whl"
+BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
+WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
 curl -fsSLO "$BASE/$WHEEL"
 curl -fsSLO "$BASE/SHA256SUMS"
@@ -304,26 +304,33 @@ $ gua report --since 1h --interval 30s
 gua — lab-a100 (bare, driver 560.35.05)  Window: 1:00:00
 §1 Headline
+  basis: one sample = one GPU card at one daemon tick
+  rules: active >=10% util; idle-held <10% util with >100 MB process memory
   █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
   active       █   15.7%
   idle-held    ▒   45.1%       ← this is the number conventional tools miss
   truly-idle   ░   39.2%
   (51 samples)
-§2 Waste
-  ~0.43 GPU-hours idle, ~2.53 GPUs equivalently unused
+§2 Idle capacity
+  converted from card-ticks to GPU-hours using the report --interval
+  idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
+  truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
 §3 Per-GPU
+  per-card share of samples in the same three states
   GPU-0     active  47.1%  idle-held  35.3%  truly-idle  17.6%
   GPU-1     active   0.0%  idle-held 100.0%  truly-idle   0.0%
   GPU-2     active   0.0%  idle-held   0.0%  truly-idle 100.0%
 §4 Top identities
-  identity              gpu-hours   idle-held
-  alice                      0.42       42.9%
-  bob                        0.28      100.0%
+  one identity counts once per GPU/tick after its processes are summed
+  identity              gpu-hours   idle-held   samples
+  alice                      0.42       42.9%        51
+  bob                        0.28      100.0%        34
 §5 Time-of-day heatmap (UTC)
+  darker means higher active share; blank means no samples
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
   Mon               .
 ```
@@ -331,7 +338,10 @@ gua — lab-a100 (bare, driver 560.35.05)  Window: 1:00:00
 The 3-bar collapses every card × every tick over the window into the
 active / idle-held / truly-idle split. **`idle-held` rows are the
 embarrassing category**: a process is holding GPU memory but the SM
-utilization is below 10%.
+utilization is below 10%. §2 converts those card-ticks into GPU-hours
+with `--interval`; §4 groups process rows by identity, GPU, and tick
+before ranking users, so multiple same-user processes on one GPU/tick
+count once.
 ## Demo (no GPU required)
@@ -408,7 +418,7 @@ point remains installed for compatibility, but new examples use `gua`.
 | -------- | ----------------------------------------------------------- |
 | `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
 | `start`  | Alias for `gua daemon`. |
-| `status` | Shows whether the background collector PID is still running. |
+| `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
 | `stop`   | Stops the background collector with SIGTERM. |
 | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
 | `demo`   | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
@@ -436,6 +446,8 @@ By default, `gua daemon` returns after the collector starts. Each tick is
 written to the log file; on shutdown the cumulative row count is written
 there too. `gua daemon --foreground` prints the tick summaries directly
 to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
+`gua status` and `gua stop` verify that the PID file points to the
+managed collector before acting on it; stale PID files are cleared.
 ### `report`
@@ -450,7 +462,7 @@ gua report [--db PATH] [--since D] [--interval D] [--width N]
   of oldest sample), so passing a huge `--since` is the same as "all
   data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
 - `--interval D` (default `30s`) — **must match what the daemon used**.
-  This is how §2 (Waste) and §4 (Top identities) convert tick counts
+  This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
   to GPU-hours. Mismatched intervals → wrong GPU-hours.
 - `--width N` (default `60`) — width of the §1 three-bar in characters.

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/README.md RENAMED Viewed

@@ -64,8 +64,8 @@ its `gua` / `gpu-usage-audit` commands.
 GitHub Release assets are also available for manual download:
 ```sh
-BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.1"
-WHEEL="gpu_usage_audit-1.0.1-py3-none-any.whl"
+BASE="https://github.com/AI-Ocean/gpu-usage-audit/releases/download/v1.0.2"
+WHEEL="gpu_usage_audit-1.0.2-py3-none-any.whl"
 curl -fsSLO "$BASE/$WHEEL"
 curl -fsSLO "$BASE/SHA256SUMS"
@@ -81,26 +81,33 @@ $ gua report --since 1h --interval 30s
 gua — lab-a100 (bare, driver 560.35.05)  Window: 1:00:00
 §1 Headline
+  basis: one sample = one GPU card at one daemon tick
+  rules: active >=10% util; idle-held <10% util with >100 MB process memory
   █████████▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░░░░░░░░░░░░░░░░░░░░░░░
   active       █   15.7%
   idle-held    ▒   45.1%       ← this is the number conventional tools miss
   truly-idle   ░   39.2%
   (51 samples)
-§2 Waste
-  ~0.43 GPU-hours idle, ~2.53 GPUs equivalently unused
+§2 Idle capacity
+  converted from card-ticks to GPU-hours using the report --interval
+  idle-held: ~0.31 GPU-hours, ~1.53 GPUs equivalently unavailable
+  truly-idle: ~0.12 GPU-hours, ~1.00 GPUs equivalently free
 §3 Per-GPU
+  per-card share of samples in the same three states
   GPU-0     active  47.1%  idle-held  35.3%  truly-idle  17.6%
   GPU-1     active   0.0%  idle-held 100.0%  truly-idle   0.0%
   GPU-2     active   0.0%  idle-held   0.0%  truly-idle 100.0%
 §4 Top identities
-  identity              gpu-hours   idle-held
-  alice                      0.42       42.9%
-  bob                        0.28      100.0%
+  one identity counts once per GPU/tick after its processes are summed
+  identity              gpu-hours   idle-held   samples
+  alice                      0.42       42.9%        51
+  bob                        0.28      100.0%        34
 §5 Time-of-day heatmap (UTC)
+  darker means higher active share; blank means no samples
         0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
   Mon               .
 ```
@@ -108,7 +115,10 @@ gua — lab-a100 (bare, driver 560.35.05)  Window: 1:00:00
 The 3-bar collapses every card × every tick over the window into the
 active / idle-held / truly-idle split. **`idle-held` rows are the
 embarrassing category**: a process is holding GPU memory but the SM
-utilization is below 10%.
+utilization is below 10%. §2 converts those card-ticks into GPU-hours
+with `--interval`; §4 groups process rows by identity, GPU, and tick
+before ranking users, so multiple same-user processes on one GPU/tick
+count once.
 ## Demo (no GPU required)
@@ -185,7 +195,7 @@ point remains installed for compatibility, but new examples use `gua`.
 | -------- | ----------------------------------------------------------- |
 | `daemon` | Starts the collector in the background. Samples real NVML telemetry on every tick and writes to a new database. NVIDIA host required. |
 | `start`  | Alias for `gua daemon`. |
-| `status` | Shows whether the background collector PID is still running. |
+| `status` | Shows whether the background collector PID is still running. Also clears a stale PID file when it points to a missing or unrelated process. |
 | `stop`   | Stops the background collector with SIGTERM. |
 | `report` | One-shot read against the accumulated database. Safe to run **while the daemon is still writing** — SQLite WAL mode handles the concurrency. |
 | `demo`   | Self-contained showcase. Records N fake ticks and immediately prints the report. No GPU, no second shell, no operational meaning — just to see the output shape. |
@@ -213,6 +223,8 @@ By default, `gua daemon` returns after the collector starts. Each tick is
 written to the log file; on shutdown the cumulative row count is written
 there too. `gua daemon --foreground` prints the tick summaries directly
 to the terminal and exits on Ctrl+C, SIGTERM, or `systemctl stop`.
+`gua status` and `gua stop` verify that the PID file points to the
+managed collector before acting on it; stale PID files are cleared.
 ### `report`
@@ -227,7 +239,7 @@ gua report [--db PATH] [--since D] [--interval D] [--width N]
   of oldest sample), so passing a huge `--since` is the same as "all
   data". Units: `ms`, `s`, `m`, `h`, `d` (no `w`; use `7d`).
 - `--interval D` (default `30s`) — **must match what the daemon used**.
-  This is how §2 (Waste) and §4 (Top identities) convert tick counts
+  This is how §2 (Idle capacity) and §4 (Top identities) convert tick counts
   to GPU-hours. Mismatched intervals → wrong GPU-hours.
 - `--width N` (default `60`) — width of the §1 three-bar in characters.

gpu_usage_audit-1.0.2/projects/bare-metal-1.0/handoff.ko.md ADDED Viewed

@@ -0,0 +1,83 @@
+# Bare Metal 1.0 Handoff
+갱신일: 2026-05-15
+## 이어받을 때 먼저 볼 것
+- `projects/bare-metal-1.0/status.ko.md`: 현재 완료 상태, 1.0.1 검증 결과, 1.0.2 release prep 상태.
+- `README.md`: 실제 사용자 문서와 release/install/runbook/report 표면.
+- `src/gpu_usage_audit/__main__.py`: `gua` CLI, background daemon lifecycle, PID handling.
+- `src/gpu_usage_audit/report.py`: report SQL 집계.
+- `src/gpu_usage_audit/render.py`: report 사람이 읽는 출력.
+- `.github/workflows/release.yml`: tag release, GitHub Release, PyPI publish 경로.
+## 고정된 결정
+- 1.0은 단일 로컬 베어메탈 NVIDIA 호스트만 본다.
+- Kubernetes, Slurm, Docker/Podman fallback, remote node, cluster-wide report는 1.0 범위 밖이다.
+- `nvidia-ml-py`는 기본 dependency다.
+- `gpu-usage-audit[nvml]` extra는 compatibility를 위해 빈 alias로 남긴다.
+- DB schema는 v1을 유지한다: `host`, `gpu_sample`, `proc_sample`.
+- 기본 DB는 `/tmp/gua.db`다.
+- `gua daemon`은 기본 백그라운드 실행이다.
+- `gua daemon --foreground`는 systemd/debugging 용도다.
+- `gua start`는 `gua daemon` alias다.
+- `gua status`와 `gua stop`은 pid file 기반 background collector 관리용이다.
+- `daemon`은 기존 DB 파일이 있으면 실패한다.
+- `report`는 DB 파일이 없으면 실패한다.
+- `daemon`과 `demo`는 host row의 `env_kind`를 항상 `"bare"`로 기록한다.
+- auto-runtime proposal/project 문서는 삭제했다. Kubernetes/Slurm/Docker/Podman 확장을 다시
+  시작하려면 새 proposal로 시작한다.
+## 현재 상태
+- PR A: implemented in PR #9.
+- PR B: implemented in PR #10.
+- Post-1.0 cleanup: completed in PR #11.
+- Bare-metal 1.0 release: completed in PR #12 and tag `v1.0.0`.
+- 1.0.1 command surface/background daemon release: completed in PR #13 and tag `v1.0.1`.
+- GitHub Release `v1.0.1`: published.
+- PyPI `gpu-usage-audit 1.0.1`: published.
+- NVIDIA host acceptance: 사용자가 실제 host에서 수집 정상 동작을 확인했다.
+- 1.0.2 release prep: 진행 중. #14 lifecycle/report cleanup을 patch release로 배포한다.
+  package version은 `1.0.2`로 bump했고 local build/wheel smoke는 통과했다.
+## 마지막 로컬 검증
+```sh
+uv run ruff check
+uv run ruff format --check
+uv run mypy
+uv run pytest
+uv build --out-dir /tmp/gua-dist-1.0.2-prep
+bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-1.0.2-prep/gpu_usage_audit-1.0.2-py3-none-any.whl
+env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py
+```
+결과는 `pytest` 124 passed, `mypy` 25 source files, `ruff format` 26 files 기준이다.
+## 현재 cleanup PR 방향
+- `/tmp/gua.pid`가 PID 재사용으로 다른 프로세스를 가리킬 수 있으므로 `status`/`stop` 전에
+  해당 PID가 실제 managed `gpu_usage_audit daemon` 프로세스인지 확인한다.
+- report §2는 low-util 전체를 "waste"로 합치지 말고 `idle-held`와 `truly-idle`을 분리한다.
+- report §4는 process row가 아니라 identity/GPU/tick 단위로 먼저 접어서 사용자별 GPU-hours를 계산한다.
+- report 출력 자체에 sample 의미, classification rule, `--interval` 의존성, heatmap 의미를 짧게 노출한다.
+- NVML process list 조회 실패는 idle-held를 과소평가할 수 있으므로 warning으로 남긴다.
+- 1.0.2 release prep에서는 package version, README release asset 예시, CHANGELOG를 `1.0.2`로 맞춘다.
+## 주의할 점
+- 현재 로컬 개발 머신은 NVIDIA host가 아니다. `gua doctor`가 unsupported를 내는 것은 정상이다.
+- `/tmp/gua.db`가 이미 존재한다. 기본 경로 daemon 실행이 거부되는 것은 기대 동작이다.
+- `report --interval`은 daemon 수집 interval과 같아야 GPU-hours가 맞다.
+- SQLite WAL sidecar(`*.db-wal`, `*.db-shm`)는 마지막 connection이 닫히면 정리된다.
+- 1.0.2를 자를 경우 `env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py`가
+  통과해야 한다.
+## 다음 세션 추천 순서
+1. `git status --short`로 사용자 변경 여부를 먼저 확인한다.
+2. cleanup PR의 CI 결과와 review comments를 확인한다.
+3. 필요하면 report wording을 실제 운영자가 읽기 쉬운 형태로 한 번 더 다듬는다.
+4. merge 후 patch release가 필요하면 version bump와 changelog를 별도 PR로 처리한다.

gpu_usage_audit-1.0.2/projects/bare-metal-1.0/status.ko.md ADDED Viewed

@@ -0,0 +1,120 @@
+# Bare Metal 1.0 Status
+갱신일: 2026-05-15
+## 요약
+Bare Metal 1.0은 단일 NVIDIA 베어메탈 호스트만 대상으로 하는 형태로 1.0.1까지
+릴리스됐고, 현재 1.0.2 release prep을 진행 중이다. `v1.0.1` GitHub Release와
+PyPI publish는 완료됐고, 사용자가 실제 NVIDIA host에서 telemetry 수집이 정상
+동작하는 것도 확인했다.
+1.0.2 후보는 1.0.1 이후 코드 퀄리티 cleanup을 배포하기 위한 patch release다.
+주요 초점은 background daemon PID 안전성, report 의미 가시성, 내부 문서 정합성이다.
+## 구현 상태
+| 영역 | 상태 | 메모 |
+| --- | --- | --- |
+| Scope reset | 완료 | Kubernetes/Slurm/Docker/remote runtime 표면 제거. |
+| `gua doctor` | 완료 | 현재 머신의 `/dev/nvidia*`, `nvidia-smi -L`, NVML, DB path만 진단. |
+| Packaging UX | 완료 | `nvidia-ml-py`가 기본 dependency이고 `nvml` extra는 빈 compatibility alias. |
+| `gua` command surface | 완료 | `doctor`, `daemon`, `start`, `status`, `stop`, `report`, `demo` 제공. |
+| Background daemon UX | 완료 | `gua daemon`은 기본 백그라운드 실행, `--foreground`는 systemd/debug용. |
+| `daemon`/`report` DB UX | 완료 | 기본 DB는 `/tmp/gua.db`; daemon은 기존 DB를 거부하고 report는 없는 DB를 거부. |
+| README bare-metal 문서 | 완료 | install, runbook, systemd 예시, 운영 notes가 1.0.2 기준. |
+| Release | 진행 중 | package version은 `1.0.2`; local build/wheel smoke 완료, release prep PR과 tag publish가 남음. |
+| NVIDIA host acceptance | 완료 | 실제 NVIDIA host에서 수집 정상 동작 확인. |
+## 마지막 확인 결과
+2026-05-15 1.0.2 release prep 로컬 검증:
+```sh
+uv run ruff format --check
+uv run ruff check
+uv run mypy
+uv run pytest
+env GITHUB_REF_NAME=v1.0.2 uv run python scripts/check-tag-version.py
+uv build --out-dir /tmp/gua-dist-1.0.2-prep
+bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-1.0.2-prep/gpu_usage_audit-1.0.2-py3-none-any.whl
+```
+결과:
+- `ruff format --check`: 26 files already formatted.
+- `ruff check`: pass.
+- `mypy`: no issues in 25 source files.
+- `pytest`: 124 passed.
+- tag-version check: `v1.0.2`와 `pyproject.toml` version 일치.
+- `uv build`: sdist/wheel build 성공.
+- wheel smoke: 성공.
+2026-05-15 1.0.1 상태 확인:
+```sh
+git status --short
+uv run ruff check
+uv run ruff format --check
+uv run mypy
+uv run pytest
+env GITHUB_REF_NAME=v1.0.1 uv run python scripts/check-tag-version.py
+uv build --out-dir /tmp/gua-dist-1.0.1-status
+bash scripts/smoke-dist-wheel.sh /tmp/gua-dist-1.0.1-status/gpu_usage_audit-1.0.1-py3-none-any.whl
+```
+결과:
+- 작업트리 clean.
+- `ruff check`: pass.
+- `ruff format --check`: 26 files already formatted.
+- `mypy`: no issues in 25 source files.
+- `pytest`: 114 passed.
+- tag-version check: `v1.0.1`과 `pyproject.toml` version 일치.
+- `uv build`: sdist/wheel build 성공.
+- wheel smoke: 성공.
+- Release workflow: `v1.0.1` success.
+- PyPI latest: `gpu-usage-audit 1.0.1`.
+## 1.0.1에서 바뀐 점
+- `gua`를 documented command surface로 정리했다.
+- `gua daemon`은 collector를 백그라운드로 시작한다.
+- `gua daemon --foreground`는 systemd와 debugging 용도로 유지한다.
+- `gua start`, `gua status`, `gua stop`을 추가했다.
+- README의 install/run/report 예시는 `gua` 기준으로 정리됐다.
+## 현재 cleanup 리뷰 결과
+- `/tmp/gua.pid` 숫자만 믿고 `gua stop`이 SIGTERM을 보내면 PID 재사용 시 다른
+  프로세스를 건드릴 수 있다. pid가 실제 `python -m gpu_usage_audit daemon`
+  프로세스인지 확인해야 한다.
+- §2 report가 `idle-held`와 `truly-idle`을 모두 "idle/waste"로 합쳐 보여주면
+  제품 메시지가 흐려진다. 사용자가 못 쓰는 용량과 실제 빈 용량을 분리해야 한다.
+- §4 Top identities는 process row를 바로 세면 같은 사용자의 여러 프로세스가
+  같은 GPU/tick에서 과대계상될 수 있다. identity/GPU/tick 단위로 먼저 접어야 한다.
+- report는 "sample"의 의미, threshold, `--interval` 의존성을 출력 자체에서 더
+  잘 설명해야 한다.
+- NVML process list를 읽지 못하는 경우 low-util GPU가 `truly-idle`처럼 보일 수
+  있으므로 최소한 경고가 필요하다.
+## 로컬 `doctor` 상태
+현재 개발 머신은 NVIDIA host가 아니므로 `uv run gua doctor`는 `unsupported`가
+정상 결과다.
+관찰된 blocker:
+- `/dev/nvidia*` 없음.
+- `nvidia-smi`가 PATH에 없음.
+- NVML init 실패: `libnvidia-ml.so.1` 없음.
+- `/tmp/gua.db`가 이미 있어 daemon은 기본 경로로 시작하지 않음.
+이 결과는 로컬 환경 한계이며, 제품 regression으로 보지 않는다.
+## 다음 작업
+1. 1.0.2 release prep PR에서 version, README release asset 예시, CHANGELOG를 갱신한다.
+2. `uv run ruff check`, `uv run ruff format --check`, `uv run mypy`, `uv run pytest`,
+   `uv build`, wheel smoke, tag-version check를 다시 실행한다.
+3. PR merge 후 `v1.0.2` tag를 push해 GitHub Release와 PyPI publish workflow를 실행한다.

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "gpu-usage-audit"
-version = "1.0.1"
+version = "1.0.2"
 description = "Single-host daemon that surfaces 'idle-held' NVIDIA GPU memory — the embarrassing category conventional dashboards miss."
 readme = "README.md"
 license = { file = "LICENSE" }

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/__main__.py RENAMED Viewed

@@ -48,17 +48,17 @@ from .nvml import NVMLNotAvailableError, NVMLTier
 from .render import (
     render_headline,
     render_heatmap,
+    render_idle_capacity,
     render_per_gpu,
     render_top_identities,
-    render_waste,
 )
 from .report import (
     load_headline,
     load_heatmap,
     load_host,
+    load_idle_capacity,
     load_per_gpu,
     load_top_identities,
-    load_waste,
 )
 from .tier import FakeTier
@@ -137,7 +137,7 @@ def build_parser() -> argparse.ArgumentParser:
         "--interval",
         type=_duration,
         default=timedelta(seconds=30),
-        help="Daemon tick interval — for §2 Waste / §4 time conversion [default: 30s]",
+        help="Daemon tick interval — for §2 Idle capacity / §4 time conversion [default: 30s]",
     )
     p_report.add_argument(
         "--width",
@@ -206,7 +206,7 @@ def _add_report_args(parser: argparse.ArgumentParser) -> None:
         "--interval",
         type=_duration,
         default=timedelta(seconds=30),
-        help="Daemon tick interval — for §2 Waste / §4 time conversion [default: 30s]",
+        help="Daemon tick interval — for §2 Idle capacity / §4 time conversion [default: 30s]",
     )
     parser.add_argument(
         "--width",
@@ -355,10 +355,15 @@ def _cmd_gua_start(args: argparse.Namespace) -> int:
     log_path = Path(args.log_file)
     existing_pid = _read_pid(pid_path)
-    if existing_pid is not None and _pid_alive(existing_pid):
-        print(f"gua daemon: already running (pid {existing_pid})")
-        return 0
     if existing_pid is not None:
+        if _pid_alive(existing_pid) and _pid_is_managed_daemon(existing_pid):
+            print(f"gua daemon: already running (pid {existing_pid})")
+            return 0
+        if _pid_alive(existing_pid):
+            print(
+                f"gua daemon: pid {existing_pid} belongs to another process; "
+                "clearing stale pid file"
+            )
         _unlink_if_exists(pid_path)
     if db_path.exists():
@@ -418,13 +423,20 @@ def _cmd_gua_status(args: argparse.Namespace) -> int:
     if pid is None:
         print("gua daemon: not running")
         return 0
-    if _pid_alive(pid):
+    if _pid_alive(pid) and _pid_is_managed_daemon(pid):
         print(f"gua daemon: running (pid {pid})")
         print(f"  pid file: {pid_path}")
         print(f"  log: {log_path}")
         return 0
-    print(f"gua daemon: not running (stale pid {pid})")
-    _unlink_if_exists(pid_path)
+    if _pid_alive(pid):
+        _unlink_if_exists(pid_path)
+        print(
+            f"gua daemon: not running (pid {pid} belongs to another process; "
+            "cleared stale pid file)"
+        )
+    else:
+        print(f"gua daemon: not running (stale pid {pid})")
+        _unlink_if_exists(pid_path)
     return 0
@@ -438,7 +450,17 @@ def _cmd_gua_stop(args: argparse.Namespace) -> int:
         _unlink_if_exists(pid_path)
         print(f"gua daemon: not running (removed stale pid {pid})")
         return 0
+    if not _pid_is_managed_daemon(pid):
+        _unlink_if_exists(pid_path)
+        print(
+            f"gua daemon: not running (pid {pid} belongs to another process; "
+            "cleared stale pid file)"
+        )
+        return 0
+    # The identity check above closes the common stale-PID-file case. A tiny
+    # check-then-kill race remains if the process exits and the OS reuses the
+    # PID before SIGTERM; avoiding that needs a stronger lock model.
     try:
         os.kill(pid, signal.SIGTERM)
     except PermissionError:
@@ -525,12 +547,12 @@ def _cmd_report(args: argparse.Namespace) -> int:
         cutoff = datetime.now(UTC) - args.since
         host = load_host(conn)
         headline = load_headline(conn, cutoff)
-        waste = load_waste(conn, cutoff, args.interval)
+        idle_capacity = load_idle_capacity(conn, cutoff, args.interval)
         per_gpu = load_per_gpu(conn, cutoff)
         top = load_top_identities(conn, cutoff, args.interval)
         heat = load_heatmap(conn, cutoff)
         render_headline(sys.stdout, host, headline, args.since, args.width)
-        render_waste(sys.stdout, waste)
+        render_idle_capacity(sys.stdout, idle_capacity)
         render_per_gpu(sys.stdout, per_gpu)
         render_top_identities(sys.stdout, top)
         render_heatmap(sys.stdout, heat)
@@ -586,7 +608,7 @@ def _cmd_demo(args: argparse.Namespace) -> int:
         cutoff = datetime.now(UTC) - window
         loaded_host = load_host(conn)
         render_headline(sys.stdout, loaded_host, load_headline(conn, cutoff), window, width=60)
-        render_waste(sys.stdout, load_waste(conn, cutoff, args.interval))
+        render_idle_capacity(sys.stdout, load_idle_capacity(conn, cutoff, args.interval))
         render_per_gpu(sys.stdout, load_per_gpu(conn, cutoff))
         render_top_identities(sys.stdout, load_top_identities(conn, cutoff, args.interval))
         render_heatmap(sys.stdout, load_heatmap(conn, cutoff))
@@ -677,6 +699,27 @@ def _pid_alive(pid: int) -> bool:
     return True
+def _pid_is_managed_daemon(pid: int) -> bool:
+    """Return True for the subprocess shape created by `_cmd_gua_start`.
+    Keep this in sync with the spawn command in `_cmd_gua_start`; status/stop
+    use it to avoid acting on unrelated processes from stale PID files.
+    """
+    args = _read_proc_cmdline(pid)
+    for i, arg in enumerate(args):
+        if arg == "-m" and args[i + 1 : i + 3] == ["gpu_usage_audit", "daemon"]:
+            return True
+    return False
+def _read_proc_cmdline(pid: int) -> list[str]:
+    try:
+        raw = Path(f"/proc/{pid}/cmdline").read_bytes()
+    except OSError:
+        return []
+    return [part.decode("utf-8", errors="replace") for part in raw.split(b"\0") if part]
 def _unlink_if_exists(path: Path) -> None:
     with contextlib.suppress(FileNotFoundError):
         path.unlink()

{gpu_usage_audit-1.0.1 → gpu_usage_audit-1.0.2}/src/gpu_usage_audit/nvml.py RENAMED Viewed

@@ -10,11 +10,14 @@ GPU 없는 개발/CI/demo 환경도 계속 동작해야 하므로 import/init
 from __future__ import annotations
 import contextlib
+import logging
 from datetime import datetime
 from typing import Any
 from .model import GPUSample, ProcSample, Snapshot
+logger = logging.getLogger(__name__)
 class NVMLNotAvailableError(RuntimeError):
     """pynvml 미설치 또는 NVML 초기화 실패. 사용자 facing 메시지로도 사용."""
@@ -59,6 +62,7 @@ class NVMLTier:
     def __init__(self) -> None:
         self._nvml: Any | None = None  # pynvml ModuleType
         self._initialized = False
+        self._process_list_warning_uuids: set[str] = set()
     def __enter__(self) -> NVMLTier:
         return self
@@ -97,7 +101,15 @@ class NVMLTier:
             # 해당 카드의 process list 만 비우고 진행.
             try:
                 running = nvml.nvmlDeviceGetComputeRunningProcesses(h)
-            except nvml.NVMLError:
+            except nvml.NVMLError as e:
+                if uuid not in self._process_list_warning_uuids:
+                    logger.warning(
+                        "NVML process list unavailable for %s; idle-held classification "
+                        "may be understated: %s",
+                        uuid,
+                        e,
+                    )
+                    self._process_list_warning_uuids.add(uuid)
                 running = []
             for p in running:

gpu-usage-audit 1.0.1__tar.gz → 1.0.2__tar.gz

gpu-usage-audit 1.0.1tar.gz → 1.0.2tar.gz