PyPI - kernelmeter - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

kernelmeter 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

{kernelmeter-0.2.0/src/kernelmeter.egg-info → kernelmeter-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kernelmeter
-Version: 0.2.0
+Version: 0.3.0
 Summary: Query every CUDA device attribute without profiling a kernel, and benchmark your kernels against the hardware's speed of light.
 Author: nuemaan
 License: MIT
@@ -24,6 +24,11 @@ Dynamic: license-file
 # kernelmeter
+[![PyPI](https://img.shields.io/pypi/v/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![CI](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml/badge.svg)](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml)
+[![Python](https://img.shields.io/pypi/pyversions/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
 Small tools for one question: **is my GPU kernel actually good, and if
 not, what exactly is holding it back?** All in one package with zero
 required dependencies.
@@ -36,7 +41,9 @@ required dependencies.
   reference, and scores it against the roofline: the best your card could
   possibly do for that kernel's mix of math and memory traffic. 240 GB/s
   means nothing on its own; "76% of attainable" tells you how much room
-  is left.
+  is left. While the kernel runs it also samples the real clocks, power
+  and temperature through NVML, and re-scores against the ceiling the
+  card actually held.
 * `kernelmeter roofline` draws your card's roofline in the terminal and
   shows where a kernel sits on it.
 * `kernelmeter occupancy` answers "why is my occupancy 50%?" from block
@@ -137,7 +144,7 @@ kernelmeter bench mybench.py
 ```text
 kernel                    median ms      GB/s   TFLOP/s  bound    %roof   vs ref  correct
 ------------------------------------------------------------------------------------------
-my_add                       3.3393     241.2         -    mem    75.3%    1.01x     PASS
+my_add                       3.2725     246.1         -    mem    76.9%    1.03x     PASS
 ```
 * **correct** - your output matched the reference. If this says FAIL,
@@ -151,8 +158,39 @@ my_add                       3.3393     241.2         -    mem    75.3%    1.01x
 Pass `flops_per_call` too and the roofline model places your kernel
 precisely; pass `peak_tflops=...` if your kernel runs on tensor cores so
-it gets judged against the right ceiling. Raw `%peak bw` and `%fp32`
-numbers are always in the `--json` output.
+it gets judged against the right ceiling (`kernelmeter info` prints the
+derived fp16/tf32 tensor peaks for your card). Raw `%peak bw` and
+`%fp32` numbers are always in the `--json` output.
+When NVML is available (it ships with the driver) a second table follows
+with what the card was doing during each measurement:
+```text
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+my_add                    1062/1590      5000    42C   53.1W      76.9%
+```
+`%roof@clk` is the same roofline score, but against the ceiling at the
+clocks the card actually held. If `%roof` looks bad but `%roof@clk` is
+high, your kernel is fine: the card is thermal or power limited, and no
+amount of kernel work will change that. A real example, cuBLAS fp32
+matmul on a 70 W T4:
+```text
+kernel                    median ms      GB/s   TFLOP/s  bound   %roof   vs ref  correct
+----------------------------------------------------------------------------------------
+fp32_matmul                 32.0354       6.3      4.29   comp   52.7%        -        -
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+fp32_matmul                877/1590      5000    46C   70.4W      95.5%
+```
+53% of peak looks like a kernel problem. The telemetry shows it is not:
+the card hit its 70 W power limit and dropped to 877 MHz, and at those
+clocks the kernel was at 95.5% of what the silicon could deliver. cuBLAS
+was never the problem.
 Timing uses CUDA events with warmup, and the L2 cache is flushed between
 iterations so small workloads can't fake huge bandwidth numbers from
@@ -194,6 +232,9 @@ The `o` is your kernel, the `x` is the ridge point. Left of the ridge,
 more FLOPs are free: the memory traffic is the bill you are paying
 anyway. That is the whole argument for kernel fusion, in one picture.
 No GPU around? `--peak-bw` and `--peak-tflops` let you draw any card.
+`--tensor` swaps in the fp16 tensor-core roof, which moves the ridge
+point far to the left; that picture explains why tensor-core kernels
+are almost always memory-bound.
 ## Why is my occupancy low?
@@ -280,15 +321,15 @@ whether your kernels are any good:
 ## Caveats
 * Theoretical peaks are computed from the max boost clock the driver
-  reports. Sustained clocks under load are lower; `kernelmeter ceiling`
-  measures what you can actually reach.
-* The derived compute peak is for plain FP32 on CUDA cores. For
-  tensor-core kernels pass `peak_tflops=...` to the benchmark decorator
-  so the roofline uses the right roof.
+  reports. Sustained clocks under load are lower; the telemetry table
+  and `kernelmeter ceiling` both show what you can actually reach.
+* The tensor-core peaks are dense rates with fp16 accumulate. GeForce
+  cards run tensor cores at half rate when accumulating in fp32, and
+  sparse rates are double; pass `peak_tflops=...` when those apply.
 * The occupancy command implements the standard calculator model. Real
   occupancy can differ (launch bounds, driver decisions); confirm with
   Nsight Compute when it matters.
-* Attribute names above id 121 are best-effort against the CUDA 12.x
+* Attribute names above id 143 are best-effort against the CUDA 12.x
   headers. Values are always read live from your driver. PRs that extend
   the name table are welcome.
@@ -304,6 +345,10 @@ them on plain GitHub runners. For an end-to-end check on a real GPU there
 is a [Modal](https://modal.com) script: `modal run scripts/modal_gpu_test.py`.
 The numbers in this README come from that script on a T4.
+Releases are tag-driven: bump the version in `pyproject.toml`, add a
+[CHANGELOG.md](CHANGELOG.md) entry, push a `v*` tag. CI tests, builds and
+publishes to PyPI through trusted publishing.
 ## License
 MIT

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/README.md RENAMED Viewed

@@ -1,5 +1,10 @@
 # kernelmeter
+[![PyPI](https://img.shields.io/pypi/v/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![CI](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml/badge.svg)](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml)
+[![Python](https://img.shields.io/pypi/pyversions/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
 Small tools for one question: **is my GPU kernel actually good, and if
 not, what exactly is holding it back?** All in one package with zero
 required dependencies.
@@ -12,7 +17,9 @@ required dependencies.
   reference, and scores it against the roofline: the best your card could
   possibly do for that kernel's mix of math and memory traffic. 240 GB/s
   means nothing on its own; "76% of attainable" tells you how much room
-  is left.
+  is left. While the kernel runs it also samples the real clocks, power
+  and temperature through NVML, and re-scores against the ceiling the
+  card actually held.
 * `kernelmeter roofline` draws your card's roofline in the terminal and
   shows where a kernel sits on it.
 * `kernelmeter occupancy` answers "why is my occupancy 50%?" from block
@@ -113,7 +120,7 @@ kernelmeter bench mybench.py
 ```text
 kernel                    median ms      GB/s   TFLOP/s  bound    %roof   vs ref  correct
 ------------------------------------------------------------------------------------------
-my_add                       3.3393     241.2         -    mem    75.3%    1.01x     PASS
+my_add                       3.2725     246.1         -    mem    76.9%    1.03x     PASS
 ```
 * **correct** - your output matched the reference. If this says FAIL,
@@ -127,8 +134,39 @@ my_add                       3.3393     241.2         -    mem    75.3%    1.01x
 Pass `flops_per_call` too and the roofline model places your kernel
 precisely; pass `peak_tflops=...` if your kernel runs on tensor cores so
-it gets judged against the right ceiling. Raw `%peak bw` and `%fp32`
-numbers are always in the `--json` output.
+it gets judged against the right ceiling (`kernelmeter info` prints the
+derived fp16/tf32 tensor peaks for your card). Raw `%peak bw` and
+`%fp32` numbers are always in the `--json` output.
+When NVML is available (it ships with the driver) a second table follows
+with what the card was doing during each measurement:
+```text
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+my_add                    1062/1590      5000    42C   53.1W      76.9%
+```
+`%roof@clk` is the same roofline score, but against the ceiling at the
+clocks the card actually held. If `%roof` looks bad but `%roof@clk` is
+high, your kernel is fine: the card is thermal or power limited, and no
+amount of kernel work will change that. A real example, cuBLAS fp32
+matmul on a 70 W T4:
+```text
+kernel                    median ms      GB/s   TFLOP/s  bound   %roof   vs ref  correct
+----------------------------------------------------------------------------------------
+fp32_matmul                 32.0354       6.3      4.29   comp   52.7%        -        -
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+fp32_matmul                877/1590      5000    46C   70.4W      95.5%
+```
+53% of peak looks like a kernel problem. The telemetry shows it is not:
+the card hit its 70 W power limit and dropped to 877 MHz, and at those
+clocks the kernel was at 95.5% of what the silicon could deliver. cuBLAS
+was never the problem.
 Timing uses CUDA events with warmup, and the L2 cache is flushed between
 iterations so small workloads can't fake huge bandwidth numbers from
@@ -170,6 +208,9 @@ The `o` is your kernel, the `x` is the ridge point. Left of the ridge,
 more FLOPs are free: the memory traffic is the bill you are paying
 anyway. That is the whole argument for kernel fusion, in one picture.
 No GPU around? `--peak-bw` and `--peak-tflops` let you draw any card.
+`--tensor` swaps in the fp16 tensor-core roof, which moves the ridge
+point far to the left; that picture explains why tensor-core kernels
+are almost always memory-bound.
 ## Why is my occupancy low?
@@ -256,15 +297,15 @@ whether your kernels are any good:
 ## Caveats
 * Theoretical peaks are computed from the max boost clock the driver
-  reports. Sustained clocks under load are lower; `kernelmeter ceiling`
-  measures what you can actually reach.
-* The derived compute peak is for plain FP32 on CUDA cores. For
-  tensor-core kernels pass `peak_tflops=...` to the benchmark decorator
-  so the roofline uses the right roof.
+  reports. Sustained clocks under load are lower; the telemetry table
+  and `kernelmeter ceiling` both show what you can actually reach.
+* The tensor-core peaks are dense rates with fp16 accumulate. GeForce
+  cards run tensor cores at half rate when accumulating in fp32, and
+  sparse rates are double; pass `peak_tflops=...` when those apply.
 * The occupancy command implements the standard calculator model. Real
   occupancy can differ (launch bounds, driver decisions); confirm with
   Nsight Compute when it matters.
-* Attribute names above id 121 are best-effort against the CUDA 12.x
+* Attribute names above id 143 are best-effort against the CUDA 12.x
   headers. Values are always read live from your driver. PRs that extend
   the name table are welcome.
@@ -280,6 +321,10 @@ them on plain GitHub runners. For an end-to-end check on a real GPU there
 is a [Modal](https://modal.com) script: `modal run scripts/modal_gpu_test.py`.
 The numbers in this README come from that script on a T4.
+Releases are tag-driven: bump the version in `pyproject.toml`, add a
+[CHANGELOG.md](CHANGELOG.md) entry, push a `v*` tag. CI tests, builds and
+publishes to PyPI through trusted publishing.
 ## License
 MIT

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "kernelmeter"
-version = "0.2.0"
+version = "0.3.0"
 description = "Query every CUDA device attribute without profiling a kernel, and benchmark your kernels against the hardware's speed of light."
 readme = "README.md"
 license = { text = "MIT" }

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter/__init__.py RENAMED Viewed

@@ -7,7 +7,7 @@ from . import occupancy, roofline
 from .occupancy import Occupancy
 from .peaks import Peaks
-__version__ = "0.2.0"
+__version__ = "0.3.0"
 __all__ = [
     "BenchResult",

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter/attrs.py RENAMED Viewed

@@ -132,6 +132,28 @@ KNOWN_ATTRS: dict[int, str] = {
     119: "mempool_supported_handle_types",
     120: "cluster_launch",
     121: "deferred_mapping_cuda_array_supported",
+    122: "can_use_64_bit_stream_mem_ops",
+    123: "can_use_stream_wait_value_nor",
+    124: "dma_buf_supported",
+    125: "ipc_event_supported",
+    126: "mem_sync_domain_count",
+    127: "tensor_map_access_supported",
+    128: "handle_type_fabric_supported",
+    129: "unified_function_pointers",
+    130: "numa_config",
+    131: "numa_id",
+    132: "multicast_supported",
+    133: "mps_enabled",
+    134: "host_numa_id",
+    135: "d3d12_cig_supported",
+    136: "mem_decompress_algorithm_mask",
+    137: "mem_decompress_maximum_length",
+    138: "vulkan_cig_supported",
+    139: "gpu_pci_device_id",
+    140: "gpu_pci_subsystem_id",
+    141: "host_numa_virtual_memory_management_supported",
+    142: "host_numa_memory_pools_supported",
+    143: "host_numa_multinode_ipc_supported",
 }

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter/bench.py RENAMED Viewed

@@ -46,8 +46,14 @@ class BenchResult:
     intensity: float | None = None
     bound: str | None = None
     pct_roofline: float | None = None
+    pct_roof_sustained: float | None = None
     pct_peak_bw: float | None = None
     pct_peak_fp32: float | None = None
+    sm_clock_mhz: float | None = None
+    max_sm_clock_mhz: int | None = None
+    mem_clock_mhz: float | None = None
+    temperature_c: int | None = None
+    power_w: float | None = None
     ref_ms_median: float | None = None
     speedup_vs_ref: float | None = None
     correct: bool | None = None
@@ -159,6 +165,21 @@ def roofline_score(
     return None, None, None
+def sustained_peaks(peaks: _peaks.Peaks, telemetry, peak_tflops_override: float | None = None) -> _peaks.Peaks:
+    """Scale the theoretical peaks down to the clocks the card actually
+    held while the kernel ran."""
+    tf = peak_tflops_override or peaks.fp32_tflops
+    return _peaks.Peaks(
+        mem_bandwidth_gbs=(
+            peaks.mem_bandwidth_gbs * telemetry.mem_clock_fraction
+            if peaks.mem_bandwidth_gbs
+            else None
+        ),
+        fp32_tflops=tf * telemetry.sm_clock_fraction if tf else None,
+        compute_capability=peaks.compute_capability,
+    )
 def diff_results(baseline: list[dict], results: list["BenchResult"], threshold_pct: float = 5.0):
     """Compare a run against a saved baseline. Returns (rows, regressions)
     where rows are (name, old_ms, new_ms, delta_pct) and regressions lists
@@ -238,7 +259,25 @@ def run(spec: BenchSpec, peaks: _peaks.Peaks | None = None, flush_l2: bool = Tru
     if spec.ref is not None:
         correct, max_err = _check_correctness(spec, args)
+    monitor = None
+    try:
+        from . import nvml as _nvml
+        monitor = _nvml.Monitor()
+        monitor.start()
+    except Exception:
+        monitor = None
     times = _time_fn(spec.fn, args, spec.warmup, spec.iters, flush_l2)
+    telemetry = None
+    if monitor is not None:
+        try:
+            telemetry = monitor.stop()
+            monitor.close()
+        except Exception:
+            telemetry = None
     ms_mean, ms_median, ms_min = summarize_times(times)
     nbytes = _resolve(spec.bytes_per_call, args)
@@ -249,6 +288,11 @@ def run(spec: BenchSpec, peaks: _peaks.Peaks | None = None, flush_l2: bool = Tru
         nbytes, nflops, gbps, tflops, peaks, spec.peak_tflops
     )
+    pct_sustained = None
+    if telemetry is not None:
+        scaled = sustained_peaks(peaks, telemetry, spec.peak_tflops)
+        pct_sustained = roofline_score(nbytes, nflops, gbps, tflops, scaled)[2]
     ref_ms = speedup = None
     if spec.ref is not None:
         ref_times = _time_fn(spec.ref, args, spec.warmup, spec.iters, flush_l2)
@@ -265,8 +309,14 @@ def run(spec: BenchSpec, peaks: _peaks.Peaks | None = None, flush_l2: bool = Tru
         intensity=ai,
         bound=kernel_bound,
         pct_roofline=pct_roof,
+        pct_roof_sustained=pct_sustained,
         pct_peak_bw=pct_of_peak(gbps, peaks.mem_bandwidth_gbs) if gbps else None,
         pct_peak_fp32=pct_of_peak(tflops, peaks.fp32_tflops) if tflops else None,
+        sm_clock_mhz=telemetry.sm_clock_mhz if telemetry else None,
+        max_sm_clock_mhz=telemetry.max_sm_clock_mhz if telemetry else None,
+        mem_clock_mhz=telemetry.mem_clock_mhz if telemetry else None,
+        temperature_c=telemetry.temperature_c if telemetry else None,
+        power_w=telemetry.power_w if telemetry else None,
         ref_ms_median=ref_ms,
         speedup_vs_ref=speedup,
         correct=correct,

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter/cli.py RENAMED Viewed

@@ -30,6 +30,28 @@ def _device_attrs(ordinal: int = 0) -> dict[str, int]:
     return _attrs.query_all(driver, driver.device(ordinal))
+def _print_live_telemetry(ordinal: int) -> None:
+    """Current clocks/temp/power via NVML; quietly skipped when absent."""
+    try:
+        from . import nvml as _nvml
+        n = _nvml.Nvml()
+    except Exception:
+        return
+    try:
+        h = n.device(ordinal)
+        print(
+            f"  live: sm {n.sm_clock_mhz(h)}/{n.max_sm_clock_mhz(h)} MHz, "
+            f"mem {n.mem_clock_mhz(h)}/{n.max_mem_clock_mhz(h)} MHz, "
+            f"{n.temperature_c(h)}C, "
+            f"{n.power_w(h):.1f}/{n.power_limit_w(h):.0f}W"
+        )
+    except Exception:
+        pass
+    finally:
+        n.close()
 # ---------------------------------------------------------------------------
 # info
 # ---------------------------------------------------------------------------
@@ -79,6 +101,17 @@ def cmd_info(args: argparse.Namespace) -> int:
             "  theoretical FP32 peak     : "
             + _fmt(derived["theoretical_fp32_tflops"], " TFLOP/s", nd=2)
         )
+        if derived.get("theoretical_fp16_tensor_tflops"):
+            print(
+                "  theoretical fp16 tensor   : "
+                + _fmt(derived["theoretical_fp16_tensor_tflops"], " TFLOP/s (dense)", nd=2)
+            )
+        if derived.get("theoretical_tf32_tensor_tflops"):
+            print(
+                "  theoretical tf32 tensor   : "
+                + _fmt(derived["theoretical_tf32_tensor_tflops"], " TFLOP/s (dense)", nd=2)
+            )
+        _print_live_telemetry(dev["ordinal"])
         print(f"\n  {'attribute':<48} value")
         print(f"  {'-' * 48} {'-' * 12}")
         for name, value in dev["attributes"].items():
@@ -137,6 +170,27 @@ def cmd_bench(args: argparse.Namespace) -> int:
                 f"{_fmt(r.pct_roofline, '%'):>7} {speedup:>8} {correct:>8}"
             )
+        if any(r.sm_clock_mhz for r in results):
+            print()
+            theader = (
+                f"{'telemetry':<24} {'sm MHz':>10} {'mem MHz':>9} "
+                f"{'temp':>6} {'power':>7} {'%roof@clk':>10}"
+            )
+            print(theader)
+            print("-" * len(theader))
+            for r in results:
+                if not r.sm_clock_mhz:
+                    continue
+                print(
+                    f"{r.name:<24} {r.sm_clock_mhz:>5.0f}/{r.max_sm_clock_mhz:<4} "
+                    f"{_fmt(r.mem_clock_mhz, nd=0):>9} {r.temperature_c or '-':>5}C "
+                    f"{_fmt(r.power_w, 'W'):>7} {_fmt(r.pct_roof_sustained, '%'):>10}"
+                )
+            print(
+                "%roof@clk scores against the ceiling at the clocks the card "
+                "actually held during the run"
+            )
     ok = all(r.error is None and r.correct in (None, True) for r in results)
     if args.compare:
@@ -169,7 +223,17 @@ def cmd_roofline(args: argparse.Namespace) -> int:
             name = dev.name
             peaks = _peaks.derive(_attrs.query_all(driver, dev))
             peak_bw = peak_bw or peaks.mem_bandwidth_gbs
-            peak_tf = peak_tf or peaks.fp32_tflops
+            if args.tensor:
+                if peaks.fp16_tensor_tflops is None:
+                    print(
+                        "error: no tensor-core rate known for this card; "
+                        "pass --peak-tflops instead.",
+                        file=sys.stderr,
+                    )
+                    return 1
+                peak_tf = peak_tf or peaks.fp16_tensor_tflops
+            else:
+                peak_tf = peak_tf or peaks.fp32_tflops
         except CudaNotAvailableError:
             pass
     if not peak_bw or not peak_tf:
@@ -183,8 +247,9 @@ def cmd_roofline(args: argparse.Namespace) -> int:
     ridge = _roofline.ridge_point(peak_tf, peak_bw)
     if name:
         print(f"Device {args.device}: {name}")
+    roof_kind = "fp16 tensor" if args.tensor else "fp32"
     print(f"  peak bandwidth : {peak_bw:.1f} GB/s")
-    print(f"  peak compute   : {peak_tf:.2f} TFLOP/s")
+    print(f"  peak compute   : {peak_tf:.2f} TFLOP/s ({roof_kind})")
     print(f"  ridge point    : {ridge:.1f} flop/byte\n")
     for line in _roofline.render(peak_tf, peak_bw, ai=args.ai):
         print(line)
@@ -278,6 +343,7 @@ def main(argv: list[str] | None = None) -> int:
     p_roof = sub.add_parser("roofline", help="draw the device roofline")
     p_roof.add_argument("--ai", type=float, help="mark a kernel at this arithmetic intensity")
     p_roof.add_argument("--device", type=int, default=0)
+    p_roof.add_argument("--tensor", action="store_true", help="use the fp16 tensor-core roof")
     p_roof.add_argument("--peak-bw", type=float, help="override bandwidth in GB/s")
     p_roof.add_argument("--peak-tflops", type=float, help="override compute in TFLOP/s")
     p_roof.set_defaults(func=cmd_roofline)

kernelmeter-0.3.0/src/kernelmeter/nvml.py ADDED Viewed

@@ -0,0 +1,172 @@
+"""Live telemetry through NVML (the library behind nvidia-smi).
+libnvidia-ml ships with the driver, so this costs no extra dependency.
+The point: theoretical peaks assume the max boost clock, but cards
+downclock under load. Sampling the actual SM and memory clocks while a
+kernel runs lets the bench report what the ceiling really was during the
+measurement, not what the spec sheet promised.
+"""
+from __future__ import annotations
+import ctypes
+import statistics
+import sys
+import threading
+from dataclasses import dataclass
+NVML_SUCCESS = 0
+NVML_CLOCK_SM = 1
+NVML_CLOCK_MEM = 2
+NVML_TEMPERATURE_GPU = 0
+class NvmlError(RuntimeError):
+    def __init__(self, func: str, code: int):
+        super().__init__(f"{func} failed with NVML code {code}")
+class NvmlNotAvailableError(RuntimeError):
+    pass
+def load_library() -> ctypes.CDLL:
+    if sys.platform == "darwin":
+        raise NvmlNotAvailableError("NVML is not available on macOS")
+    names = ("nvml.dll",) if sys.platform == "win32" else ("libnvidia-ml.so.1", "libnvidia-ml.so")
+    for name in names:
+        try:
+            if sys.platform == "win32":
+                return ctypes.WinDLL(name)  # pragma: no cover
+            return ctypes.CDLL(name)
+        except OSError:
+            continue
+    raise NvmlNotAvailableError("could not load libnvidia-ml; is the NVIDIA driver installed?")
+class Nvml:
+    """Minimal wrapper. Like cudadrv.Driver, the lib is injectable so the
+    tests can run on machines with no NVIDIA driver."""
+    def __init__(self, lib=None):
+        self._lib = lib if lib is not None else load_library()
+        self._check("nvmlInit_v2", self._lib.nvmlInit_v2())
+    def _check(self, func: str, code: int) -> None:
+        if code != NVML_SUCCESS:
+            raise NvmlError(func, code)
+    def close(self) -> None:
+        self._lib.nvmlShutdown()
+    def device(self, index: int = 0) -> ctypes.c_void_p:
+        handle = ctypes.c_void_p()
+        self._check(
+            "nvmlDeviceGetHandleByIndex",
+            self._lib.nvmlDeviceGetHandleByIndex_v2(index, ctypes.byref(handle)),
+        )
+        return handle
+    def _uint_query(self, func_name: str, handle, *args) -> int:
+        out = ctypes.c_uint(0)
+        fn = getattr(self._lib, func_name)
+        self._check(func_name, fn(handle, *args, ctypes.byref(out)))
+        return out.value
+    def sm_clock_mhz(self, handle) -> int:
+        return self._uint_query("nvmlDeviceGetClockInfo", handle, NVML_CLOCK_SM)
+    def mem_clock_mhz(self, handle) -> int:
+        return self._uint_query("nvmlDeviceGetClockInfo", handle, NVML_CLOCK_MEM)
+    def max_sm_clock_mhz(self, handle) -> int:
+        return self._uint_query("nvmlDeviceGetMaxClockInfo", handle, NVML_CLOCK_SM)
+    def max_mem_clock_mhz(self, handle) -> int:
+        return self._uint_query("nvmlDeviceGetMaxClockInfo", handle, NVML_CLOCK_MEM)
+    def temperature_c(self, handle) -> int:
+        return self._uint_query("nvmlDeviceGetTemperature", handle, NVML_TEMPERATURE_GPU)
+    def power_w(self, handle) -> float:
+        return self._uint_query("nvmlDeviceGetPowerUsage", handle) / 1000.0
+    def power_limit_w(self, handle) -> float:
+        return self._uint_query("nvmlDeviceGetEnforcedPowerLimit", handle) / 1000.0
+@dataclass
+class Telemetry:
+    sm_clock_mhz: float
+    mem_clock_mhz: float
+    max_sm_clock_mhz: int
+    max_mem_clock_mhz: int
+    temperature_c: int
+    power_w: float
+    @property
+    def sm_clock_fraction(self) -> float:
+        return self.sm_clock_mhz / self.max_sm_clock_mhz if self.max_sm_clock_mhz else 1.0
+    @property
+    def mem_clock_fraction(self) -> float:
+        return self.mem_clock_mhz / self.max_mem_clock_mhz if self.max_mem_clock_mhz else 1.0
+def summarize_samples(
+    sm: list[int], mem: list[int], temp: list[int], power: list[float],
+    max_sm: int, max_mem: int,
+) -> Telemetry:
+    return Telemetry(
+        sm_clock_mhz=statistics.fmean(sm),
+        mem_clock_mhz=statistics.fmean(mem),
+        max_sm_clock_mhz=max_sm,
+        max_mem_clock_mhz=max_mem,
+        temperature_c=max(temp),
+        power_w=statistics.fmean(power),
+    )
+class Monitor:
+    """Samples clocks/temperature/power on a background thread while a
+    kernel benchmark runs in the main thread."""
+    def __init__(self, device_index: int = 0, interval_s: float = 0.02, nvml: Nvml | None = None):
+        self._nvml = nvml if nvml is not None else Nvml()
+        self._handle = self._nvml.device(device_index)
+        self._interval = interval_s
+        self._stop = threading.Event()
+        self._thread: threading.Thread | None = None
+        self._sm: list[int] = []
+        self._mem: list[int] = []
+        self._temp: list[int] = []
+        self._power: list[float] = []
+    def _sample(self) -> None:
+        self._sm.append(self._nvml.sm_clock_mhz(self._handle))
+        self._mem.append(self._nvml.mem_clock_mhz(self._handle))
+        self._temp.append(self._nvml.temperature_c(self._handle))
+        self._power.append(self._nvml.power_w(self._handle))
+    def _loop(self) -> None:
+        while not self._stop.wait(self._interval):
+            self._sample()
+    def start(self) -> None:
+        self._stop.clear()
+        self._sample()  # always have at least one sample
+        self._thread = threading.Thread(target=self._loop, daemon=True)
+        self._thread.start()
+    def stop(self) -> Telemetry:
+        self._stop.set()
+        if self._thread is not None:
+            self._thread.join()
+        return summarize_samples(
+            self._sm, self._mem, self._temp, self._power,
+            self._nvml.max_sm_clock_mhz(self._handle),
+            self._nvml.max_mem_clock_mhz(self._handle),
+        )
+    def close(self) -> None:
+        self._nvml.close()

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter/peaks.py RENAMED Viewed

@@ -32,6 +32,41 @@ def fp32_cores_per_sm(major: int, minor: int) -> int:
     return 64
+# Dense tensor-core FLOPs per SM per clock, fp16 inputs with fp16
+# accumulate. Derived from published board specs (V100 125 TF, T4 65 TF,
+# A100 312 TF, 4090 330 TF, H100 SXM 989 TF, ...), which all divide out
+# to clean powers of two per SM per clock. GeForce parts run at half
+# this rate when accumulating in fp32.
+_FP16_TENSOR_FLOPS_PER_SM: dict[tuple[int, int], int] = {
+    (7, 0): 1024, (7, 2): 1024, (7, 5): 1024,
+    (8, 0): 2048, (8, 6): 1024, (8, 7): 1024, (8, 9): 1024,
+    (9, 0): 4096,
+    (12, 0): 1024, (12, 1): 1024,
+}
+# Same idea for tf32 (only exists on Ampere and newer).
+_TF32_TENSOR_FLOPS_PER_SM: dict[tuple[int, int], int] = {
+    (8, 0): 1024, (8, 6): 256, (8, 7): 256, (8, 9): 256,
+    (9, 0): 2048,
+    (12, 0): 256, (12, 1): 256,
+}
+def _tensor_tflops(table: dict, sm_count: int, clock_khz: int, major: int, minor: int) -> float | None:
+    rate = table.get((major, minor))
+    if rate is None:
+        return None
+    return rate * sm_count * clock_khz * 1e3 / 1e12
+def fp16_tensor_tflops(sm_count: int, clock_khz: int, major: int, minor: int) -> float | None:
+    return _tensor_tflops(_FP16_TENSOR_FLOPS_PER_SM, sm_count, clock_khz, major, minor)
+def tf32_tensor_tflops(sm_count: int, clock_khz: int, major: int, minor: int) -> float | None:
+    return _tensor_tflops(_TF32_TENSOR_FLOPS_PER_SM, sm_count, clock_khz, major, minor)
 @dataclass
 class Peaks:
     """Theoretical per-device ceilings derived from driver attributes."""
@@ -39,11 +74,15 @@ class Peaks:
     mem_bandwidth_gbs: float | None
     fp32_tflops: float | None
     compute_capability: tuple[int, int] | None
+    fp16_tensor_tflops: float | None = None
+    tf32_tensor_tflops: float | None = None
     def as_dict(self) -> dict:
         return {
             "theoretical_mem_bandwidth_gb_s": self.mem_bandwidth_gbs,
             "theoretical_fp32_tflops": self.fp32_tflops,
+            "theoretical_fp16_tensor_tflops": self.fp16_tensor_tflops,
+            "theoretical_tf32_tensor_tflops": self.tf32_tensor_tflops,
             "compute_capability": (
                 f"{self.compute_capability[0]}.{self.compute_capability[1]}"
                 if self.compute_capability
@@ -72,12 +111,19 @@ def derive(attrs: dict[str, int]) -> Peaks:
         )
     cc = None
-    flops = None
+    flops = fp16 = tf32 = None
     if "compute_capability_major" in attrs and "compute_capability_minor" in attrs:
         cc = (attrs["compute_capability_major"], attrs["compute_capability_minor"])
         if "multiprocessor_count" in attrs and "clock_rate_khz" in attrs:
-            flops = fp32_tflops(
-                attrs["multiprocessor_count"], attrs["clock_rate_khz"], cc[0], cc[1]
-            )
-    return Peaks(mem_bandwidth_gbs=bw, fp32_tflops=flops, compute_capability=cc)
+            sm, clk = attrs["multiprocessor_count"], attrs["clock_rate_khz"]
+            flops = fp32_tflops(sm, clk, cc[0], cc[1])
+            fp16 = fp16_tensor_tflops(sm, clk, cc[0], cc[1])
+            tf32 = tf32_tensor_tflops(sm, clk, cc[0], cc[1])
+    return Peaks(
+        mem_bandwidth_gbs=bw,
+        fp32_tflops=flops,
+        compute_capability=cc,
+        fp16_tensor_tflops=fp16,
+        tf32_tensor_tflops=tf32,
+    )

{kernelmeter-0.2.0 → kernelmeter-0.3.0/src/kernelmeter.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: kernelmeter
-Version: 0.2.0
+Version: 0.3.0
 Summary: Query every CUDA device attribute without profiling a kernel, and benchmark your kernels against the hardware's speed of light.
 Author: nuemaan
 License: MIT
@@ -24,6 +24,11 @@ Dynamic: license-file
 # kernelmeter
+[![PyPI](https://img.shields.io/pypi/v/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![CI](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml/badge.svg)](https://github.com/nuemaan/kernelmeter/actions/workflows/ci.yml)
+[![Python](https://img.shields.io/pypi/pyversions/kernelmeter)](https://pypi.org/project/kernelmeter/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
 Small tools for one question: **is my GPU kernel actually good, and if
 not, what exactly is holding it back?** All in one package with zero
 required dependencies.
@@ -36,7 +41,9 @@ required dependencies.
   reference, and scores it against the roofline: the best your card could
   possibly do for that kernel's mix of math and memory traffic. 240 GB/s
   means nothing on its own; "76% of attainable" tells you how much room
-  is left.
+  is left. While the kernel runs it also samples the real clocks, power
+  and temperature through NVML, and re-scores against the ceiling the
+  card actually held.
 * `kernelmeter roofline` draws your card's roofline in the terminal and
   shows where a kernel sits on it.
 * `kernelmeter occupancy` answers "why is my occupancy 50%?" from block
@@ -137,7 +144,7 @@ kernelmeter bench mybench.py
 ```text
 kernel                    median ms      GB/s   TFLOP/s  bound    %roof   vs ref  correct
 ------------------------------------------------------------------------------------------
-my_add                       3.3393     241.2         -    mem    75.3%    1.01x     PASS
+my_add                       3.2725     246.1         -    mem    76.9%    1.03x     PASS
 ```
 * **correct** - your output matched the reference. If this says FAIL,
@@ -151,8 +158,39 @@ my_add                       3.3393     241.2         -    mem    75.3%    1.01x
 Pass `flops_per_call` too and the roofline model places your kernel
 precisely; pass `peak_tflops=...` if your kernel runs on tensor cores so
-it gets judged against the right ceiling. Raw `%peak bw` and `%fp32`
-numbers are always in the `--json` output.
+it gets judged against the right ceiling (`kernelmeter info` prints the
+derived fp16/tf32 tensor peaks for your card). Raw `%peak bw` and
+`%fp32` numbers are always in the `--json` output.
+When NVML is available (it ships with the driver) a second table follows
+with what the card was doing during each measurement:
+```text
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+my_add                    1062/1590      5000    42C   53.1W      76.9%
+```
+`%roof@clk` is the same roofline score, but against the ceiling at the
+clocks the card actually held. If `%roof` looks bad but `%roof@clk` is
+high, your kernel is fine: the card is thermal or power limited, and no
+amount of kernel work will change that. A real example, cuBLAS fp32
+matmul on a 70 W T4:
+```text
+kernel                    median ms      GB/s   TFLOP/s  bound   %roof   vs ref  correct
+----------------------------------------------------------------------------------------
+fp32_matmul                 32.0354       6.3      4.29   comp   52.7%        -        -
+telemetry                    sm MHz   mem MHz   temp   power  %roof@clk
+-----------------------------------------------------------------------
+fp32_matmul                877/1590      5000    46C   70.4W      95.5%
+```
+53% of peak looks like a kernel problem. The telemetry shows it is not:
+the card hit its 70 W power limit and dropped to 877 MHz, and at those
+clocks the kernel was at 95.5% of what the silicon could deliver. cuBLAS
+was never the problem.
 Timing uses CUDA events with warmup, and the L2 cache is flushed between
 iterations so small workloads can't fake huge bandwidth numbers from
@@ -194,6 +232,9 @@ The `o` is your kernel, the `x` is the ridge point. Left of the ridge,
 more FLOPs are free: the memory traffic is the bill you are paying
 anyway. That is the whole argument for kernel fusion, in one picture.
 No GPU around? `--peak-bw` and `--peak-tflops` let you draw any card.
+`--tensor` swaps in the fp16 tensor-core roof, which moves the ridge
+point far to the left; that picture explains why tensor-core kernels
+are almost always memory-bound.
 ## Why is my occupancy low?
@@ -280,15 +321,15 @@ whether your kernels are any good:
 ## Caveats
 * Theoretical peaks are computed from the max boost clock the driver
-  reports. Sustained clocks under load are lower; `kernelmeter ceiling`
-  measures what you can actually reach.
-* The derived compute peak is for plain FP32 on CUDA cores. For
-  tensor-core kernels pass `peak_tflops=...` to the benchmark decorator
-  so the roofline uses the right roof.
+  reports. Sustained clocks under load are lower; the telemetry table
+  and `kernelmeter ceiling` both show what you can actually reach.
+* The tensor-core peaks are dense rates with fp16 accumulate. GeForce
+  cards run tensor cores at half rate when accumulating in fp32, and
+  sparse rates are double; pass `peak_tflops=...` when those apply.
 * The occupancy command implements the standard calculator model. Real
   occupancy can differ (launch bounds, driver decisions); confirm with
   Nsight Compute when it matters.
-* Attribute names above id 121 are best-effort against the CUDA 12.x
+* Attribute names above id 143 are best-effort against the CUDA 12.x
   headers. Values are always read live from your driver. PRs that extend
   the name table are welcome.
@@ -304,6 +345,10 @@ them on plain GitHub runners. For an end-to-end check on a real GPU there
 is a [Modal](https://modal.com) script: `modal run scripts/modal_gpu_test.py`.
 The numbers in this README come from that script on a T4.
+Releases are tag-driven: bump the version in `pyproject.toml`, add a
+[CHANGELOG.md](CHANGELOG.md) entry, push a `v*` tag. CI tests, builds and
+publishes to PyPI through trusted publishing.
 ## License
 MIT

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/src/kernelmeter.egg-info/SOURCES.txt RENAMED Viewed

@@ -7,6 +7,7 @@ src/kernelmeter/bench.py
 src/kernelmeter/ceiling.py
 src/kernelmeter/cli.py
 src/kernelmeter/cudadrv.py
+src/kernelmeter/nvml.py
 src/kernelmeter/occupancy.py
 src/kernelmeter/peaks.py
 src/kernelmeter/roofline.py
@@ -21,6 +22,8 @@ tests/test_bench_math.py
 tests/test_bench_roofline.py
 tests/test_cli.py
 tests/test_cli_new_commands.py
+tests/test_nvml.py
 tests/test_occupancy.py
 tests/test_peaks.py
-tests/test_roofline.py
+tests/test_roofline.py
+tests/test_tensor_peaks.py

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/tests/test_attrs.py RENAMED Viewed

@@ -24,6 +24,13 @@ def test_unknown_but_supported_ids_get_generic_names(fake_driver):
     assert result["attribute_150"] == 7
+def test_cuda12_range_names(fake_driver):
+    dev = fake_driver.device(0)
+    result = attrs.query_all(fake_driver, dev)
+    assert result["numa_id"] == -1
+    assert result["gpu_pci_device_id"] == 0x1EB810DE
 def test_device_metadata(fake_driver):
     dev = fake_driver.device(0)
     assert dev.name == "NVIDIA GeForce RTX 3090"

{kernelmeter-0.2.0 → kernelmeter-0.3.0}/tests/test_cli_new_commands.py RENAMED Viewed

@@ -20,6 +20,13 @@ def test_roofline_from_device(patched_driver, capsys):
     assert "*" in out  # the chart got drawn
+def test_roofline_tensor_roof(patched_driver, capsys):
+    assert cli.main(["roofline", "--tensor"]) == 0
+    out = capsys.readouterr().out
+    assert "fp16 tensor" in out
+    assert "142.33 TFLOP/s" in out
 def test_roofline_manual_peaks_need_no_device(monkeypatch, capsys):
     from kernelmeter.cudadrv import CudaNotAvailableError

kernelmeter-0.3.0/tests/test_nvml.py ADDED Viewed

@@ -0,0 +1,91 @@
+import time
+import pytest
+from kernelmeter import nvml
+NVML_SUCCESS = 0
+class FakeNvmlLib:
+    """Duck-types the libnvidia-ml entry points the wrapper uses.
+    Simulates a card boosting to 1590 MHz but holding 1530 under load."""
+    def __init__(self):
+        self.shutdown_called = False
+    def nvmlInit_v2(self):
+        return NVML_SUCCESS
+    def nvmlShutdown(self):
+        self.shutdown_called = True
+        return NVML_SUCCESS
+    def nvmlDeviceGetHandleByIndex_v2(self, index, ptr):
+        ptr._obj.value = 42
+        return NVML_SUCCESS
+    def nvmlDeviceGetClockInfo(self, handle, clock_type, ptr):
+        ptr._obj.value = 1530 if clock_type == nvml.NVML_CLOCK_SM else 4985
+        return NVML_SUCCESS
+    def nvmlDeviceGetMaxClockInfo(self, handle, clock_type, ptr):
+        ptr._obj.value = 1590 if clock_type == nvml.NVML_CLOCK_SM else 5001
+        return NVML_SUCCESS
+    def nvmlDeviceGetTemperature(self, handle, sensor, ptr):
+        ptr._obj.value = 63
+        return NVML_SUCCESS
+    def nvmlDeviceGetPowerUsage(self, handle, ptr):
+        ptr._obj.value = 45200  # milliwatts
+        return NVML_SUCCESS
+    def nvmlDeviceGetEnforcedPowerLimit(self, handle, ptr):
+        ptr._obj.value = 70000
+        return NVML_SUCCESS
+def test_wrapper_reads_values():
+    n = nvml.Nvml(lib=FakeNvmlLib())
+    h = n.device(0)
+    assert n.sm_clock_mhz(h) == 1530
+    assert n.max_sm_clock_mhz(h) == 1590
+    assert n.mem_clock_mhz(h) == 4985
+    assert n.temperature_c(h) == 63
+    assert n.power_w(h) == pytest.approx(45.2)
+    assert n.power_limit_w(h) == pytest.approx(70.0)
+def test_error_code_raises():
+    class Broken(FakeNvmlLib):
+        def nvmlDeviceGetTemperature(self, handle, sensor, ptr):
+            return 999
+    n = nvml.Nvml(lib=Broken())
+    with pytest.raises(nvml.NvmlError):
+        n.temperature_c(n.device(0))
+def test_summarize_samples():
+    t = nvml.summarize_samples(
+        sm=[1500, 1560], mem=[4985, 4985], temp=[60, 63], power=[44.0, 46.0],
+        max_sm=1590, max_mem=5001,
+    )
+    assert t.sm_clock_mhz == pytest.approx(1530)
+    assert t.temperature_c == 63
+    assert t.power_w == pytest.approx(45.0)
+    assert t.sm_clock_fraction == pytest.approx(1530 / 1590)
+    assert t.mem_clock_fraction == pytest.approx(4985 / 5001)
+def test_monitor_collects_while_running():
+    lib = FakeNvmlLib()
+    mon = nvml.Monitor(nvml=nvml.Nvml(lib=lib), interval_s=0.001)
+    mon.start()
+    time.sleep(0.02)
+    t = mon.stop()
+    mon.close()
+    assert t.sm_clock_mhz == pytest.approx(1530)
+    assert t.max_sm_clock_mhz == 1590
+    assert lib.shutdown_called

kernelmeter-0.3.0/tests/test_tensor_peaks.py ADDED Viewed

@@ -0,0 +1,50 @@
+import pytest
+from kernelmeter import attrs, peaks
+def test_t4_fp16_tensor():
+    # T4: 40 SMs, 1590 MHz, CC 7.5 -> spec sheet says 65 TFLOPS fp16
+    tf = peaks.fp16_tensor_tflops(40, 1_590_000, 7, 5)
+    assert tf == pytest.approx(65.1, rel=0.01)
+def test_a100_fp16_and_tf32():
+    # A100: 108 SMs, 1410 MHz -> 312 TFLOPS fp16, 156 TFLOPS tf32
+    assert peaks.fp16_tensor_tflops(108, 1_410_000, 8, 0) == pytest.approx(312, rel=0.01)
+    assert peaks.tf32_tensor_tflops(108, 1_410_000, 8, 0) == pytest.approx(156, rel=0.01)
+def test_h100_fp16_tensor():
+    # H100 SXM: 132 SMs, 1830 MHz -> 989 TFLOPS dense fp16
+    assert peaks.fp16_tensor_tflops(132, 1_830_000, 9, 0) == pytest.approx(989, rel=0.01)
+def test_pre_tensor_core_cards_return_none():
+    assert peaks.fp16_tensor_tflops(20, 1_700_000, 6, 1) is None
+    assert peaks.tf32_tensor_tflops(40, 1_590_000, 7, 5) is None  # Turing has no tf32
+def test_derive_includes_tensor_peaks(fake_driver):
+    # fake is a 3090: CC 8.6, 82 SMs, 1695 MHz -> 1024 flops/sm/clk = 142 TF
+    dev = fake_driver.device(0)
+    p = peaks.derive(attrs.query_all(fake_driver, dev))
+    assert p.fp16_tensor_tflops == pytest.approx(142.3, rel=0.01)
+    assert p.tf32_tensor_tflops == pytest.approx(35.6, rel=0.01)
+def test_sustained_peaks_scaling():
+    from kernelmeter import bench
+    from kernelmeter.nvml import Telemetry
+    base = peaks.Peaks(mem_bandwidth_gbs=320.0, fp32_tflops=8.0, compute_capability=(7, 5))
+    t = Telemetry(
+        sm_clock_mhz=1431, mem_clock_mhz=4500, max_sm_clock_mhz=1590,
+        max_mem_clock_mhz=5000, temperature_c=70, power_w=60.0,
+    )
+    scaled = bench.sustained_peaks(base, t)
+    assert scaled.fp32_tflops == pytest.approx(8.0 * 1431 / 1590)
+    assert scaled.mem_bandwidth_gbs == pytest.approx(320.0 * 0.9)
+    # with an override the override gets scaled instead
+    scaled2 = bench.sustained_peaks(base, t, peak_tflops_override=65.0)
+    assert scaled2.fp32_tflops == pytest.approx(65.0 * 1431 / 1590)