mfu-tracker 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,22 @@
1
+ name: pypi
2
+
3
+ on:
4
+ push:
5
+ tags:
6
+ - "v*"
7
+
8
+ jobs:
9
+ publish:
10
+ runs-on: ubuntu-latest
11
+ environment: pypi
12
+ permissions:
13
+ id-token: write # required for trusted publishing
14
+
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+
18
+ - uses: astral-sh/setup-uv@v5
19
+
20
+ - run: uv build
21
+
22
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,210 @@
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[codz]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py.cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ #Pipfile.lock
96
+
97
+ # UV
98
+ # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ #uv.lock
102
+
103
+ # poetry
104
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
106
+ # commonly ignored for libraries.
107
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108
+ #poetry.lock
109
+ #poetry.toml
110
+
111
+ # pdm
112
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
113
+ # pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
114
+ # https://pdm-project.org/en/latest/usage/project/#working-with-version-control
115
+ #pdm.lock
116
+ #pdm.toml
117
+ .pdm-python
118
+ .pdm-build/
119
+
120
+ # pixi
121
+ # Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
122
+ #pixi.lock
123
+ # Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
124
+ # in the .venv directory. It is recommended not to include this directory in version control.
125
+ .pixi
126
+
127
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128
+ __pypackages__/
129
+
130
+ # Celery stuff
131
+ celerybeat-schedule
132
+ celerybeat.pid
133
+
134
+ # SageMath parsed files
135
+ *.sage.py
136
+
137
+ # Environments
138
+ .env
139
+ .envrc
140
+ .venv
141
+ env/
142
+ venv/
143
+ ENV/
144
+ env.bak/
145
+ venv.bak/
146
+
147
+ # Spyder project settings
148
+ .spyderproject
149
+ .spyproject
150
+
151
+ # Rope project settings
152
+ .ropeproject
153
+
154
+ # mkdocs documentation
155
+ /site
156
+
157
+ # mypy
158
+ .mypy_cache/
159
+ .dmypy.json
160
+ dmypy.json
161
+
162
+ # Pyre type checker
163
+ .pyre/
164
+
165
+ # pytype static type analyzer
166
+ .pytype/
167
+
168
+ # Cython debug symbols
169
+ cython_debug/
170
+
171
+ # PyCharm
172
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
173
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
174
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
175
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
176
+ #.idea/
177
+
178
+ # Abstra
179
+ # Abstra is an AI-powered process automation framework.
180
+ # Ignore directories containing user credentials, local state, and settings.
181
+ # Learn more at https://abstra.io/docs
182
+ .abstra/
183
+
184
+ # Visual Studio Code
185
+ # Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
186
+ # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
187
+ # and can be added to the global gitignore or merged into this file. However, if you prefer,
188
+ # you could uncomment the following to ignore the entire vscode folder
189
+ # .vscode/
190
+
191
+ # Ruff stuff:
192
+ .ruff_cache/
193
+
194
+ # PyPI configuration file
195
+ .pypirc
196
+
197
+ # Cursor
198
+ # Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
199
+ # exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
200
+ # refer to https://docs.cursor.com/context/ignore-files
201
+ .cursorignore
202
+ .cursorindexingignore
203
+
204
+ # Marimo
205
+ marimo/_static/
206
+ marimo/_lsp/
207
+ __marimo__/
208
+
209
+ # Vibe Coding Stuff
210
+ .serena/
@@ -0,0 +1,75 @@
1
+ # mfu-tracker
2
+
3
+ PyPI library for tracking Model FLOPs Utilization (MFU) and Model Bandwidth Utilization (MBU).
4
+
5
+ ## Architecture
6
+
7
+ - [src/mfu_tracker/gpu.py](src/mfu_tracker/gpu.py) — queries `torch.cuda.get_device_properties()` to derive peak TFLOPS and memory bandwidth from first principles. Uses `_FP16_FLOPS_PER_SM_PER_CLOCK` keyed by `(major, minor)` compute capability tuple (empirically validated against spec sheets). Supports per-dtype peak ceilings (fp16, bf16, int8, fp8, int4, fp4).
8
+ - [src/mfu_tracker/flops.py](src/mfu_tracker/flops.py) — FLOP counting via `torch.utils.flop_counter.FlopCounterMode` (PyTorch 2.1+), with `thop` as fallback. `FlopCounterMode` hooks at the ATen dispatch level, so it counts `F.scaled_dot_product_attention` (SDPA / native flash attention) automatically on CUDA — no manual correction needed for modern transformer models. SDPA is NOT counted on CPU (the kernel dispatches differently); profile with a CUDA model for accurate counts. `thop` fallback handles PyTorch < 2.1. For the rare `flash_attn` C extension (direct `flash_attn_func` calls), use `flash_attn_flops(B, S, H, D)` to compute the missing FLOPs manually. For kwargs-only models (e.g. HF), both paths wrap in `_KwargsAdapter`. `param_bytes()` accepts `trainable_only=True` for PEFT/LoRA backward MBU estimates.
9
+ - [src/mfu_tracker/tracker.py](src/mfu_tracker/tracker.py) — `track()` context manager and `compute_mfu`/`compute_mbu` standalone functions. All accept a `dtype` parameter. `UtilizationResult` is a mutable dataclass yielded by `track()`; fields start as `None` and are populated after the block exits. CUDA-event-backed fields (from `track_step()`) are resolved lazily on first attribute access; `_resolve()` is idempotent and calls `synchronize` exactly once.
10
+ - [src/mfu_tracker/optim.py](src/mfu_tracker/optim.py) — `MFUOptimizerWrapper`. Wraps any `torch.optim.Optimizer` and exposes a `track_step()` context manager. Uses a fixed `backward_factor` (default 2.0, the standard 3× convention) — no gradient hook. `zero_grad()` is called automatically at the **start** of `track_step()`; call `optimizer.step()` **after** the block to keep it outside the timing window. Profile the uncompiled model via `wrapper.profile()` before calling `torch.compile`.
11
+ - [src/mfu_tracker/integrations/hf_trainer.py](src/mfu_tracker/integrations/hf_trainer.py) — `MFUCallback(TrainerCallback)`. Profiles the model once at `on_train_begin` (moves sample batch to model device automatically). Records two non-blocking CUDA events per step (`on_step_begin` / `on_step_end`), defers `torch.cuda.synchronize()` to `on_log` amortised across the logging interval. Logs `throughput/mfu` and `throughput/mbu` (configurable prefix). Does NOT read `state.total_flos` — HF Trainer uses the dense 6ND formula for all models including MoE, overcounting MoE by up to 4×. Skips silently when CUDA is unavailable.
12
+
13
+ ## Key design decisions
14
+
15
+ - `(major, minor)` tuple keys for GPU lookup — CC 8.0 (A100) and CC 8.6 (RTX 3090) have genuinely different per-SM throughput (1024 vs 512 FP16 FLOPs/SM/clock) despite both being Ampere. Major-version-only keys would be wrong.
16
+ - Ada Lovelace is CC 8.9 — gets FP8 support via a special case in `_fp8_supported()` even though its major version is 8 (below the FP8 min_major of 9 for Hopper).
17
+ - `thop` over `calflops` — calflops unconditionally imports `transformers` in `__init__.py`, making it a 600MB transitive dep that defeats the lightweight goal.
18
+ - **Fixed backward factor, not dynamic measurement.** The original design used a gradient hook on `trainable[-1]` to measure the forward/backward time split dynamically. This was abandoned because: (1) gradient hooks fire on the CPU autograd thread based on scheduling, not GPU completion — making the "start of backward" event unreliable; (2) with `optimizer.step()` inside the timing window, the measured factor absorbed optimizer time and gave ~4× instead of ~2×; (3) `torch.compile` restructures the backward graph so the hook often never fires, producing inconsistent FLOP multipliers between compiled and uncompiled runs. Fixed `backward_factor=2.0` (3× convention) with a user-overridable parameter is simpler and consistent. Users with gradient checkpointing set `backward_factor=3.0–4.0` explicitly.
19
+ - `optimizer.step()` belongs **outside** `track_step()`. `zero_grad()` is called at the start of the block so gradients are valid until the block exits. This way timing captures forward + backward only.
20
+ - `UtilizationResult` is mutable (no `frozen=True`) so the context manager pattern works correctly — yield first, populate after block exits.
21
+ - `MFUOptimizerWrapper.track_step()` is lazy — no `synchronize` in `finally`, only on first attribute access of the result. Skipping `result.mfu` on some steps incurs zero sync cost for those steps.
22
+ - HF integration uses `TrainerCallback`, not monkey-patch — cleaner, composable, and avoids patching internal Trainer methods. `MFUCallback` inherits from `TrainerCallback` so HF Trainer can call all callback events on it.
23
+ - `metric_prefix="throughput"` on `MFUCallback` — logs `throughput/mfu` and `throughput/mbu`. WandB groups metrics by `/` separator, placing these in their own "throughput" section away from `loss`/`lr`. Set `metric_prefix=""` for bare keys.
24
+ - HF Trainer forwards the `logs` dict from `on_log` to all configured integrations (WandB, TensorBoard, MLflow) automatically — no extra configuration needed.
25
+ - Graceful degradation: unknown compute capability emits a `UserWarning` and falls back to the closest known major version.
26
+ - MBU is always reported alongside MFU.
27
+ - `num_gpus` parameter on `track()`, `compute_mfu()`, `compute_mbu()`, `MFUCallback`, and `MFUOptimizerWrapper` scales the peak ceiling. Default 1, correct for all parallelism strategies when using `profile_flops` — per-GPU MFU equals global MFU for DDP, FSDP, tensor, and pipeline parallelism because the N factors cancel (per-GPU FLOPs = total/N, wall time is the same across all GPUs). Only set `num_gpus > 1` when pairing analytically-derived full-model FLOPs (e.g. `6 × params × tokens`) with a total-job peak.
28
+ - `torch.compile` does not change FLOP count (same math, faster execution). Profile the *uncompiled* model — `FlopCounterMode` may not trace compiled graphs correctly. The MFU improvement from compilation is captured automatically via CUDA event timing of real steps.
29
+ - `src/` layout for correct PyPI packaging (hatchling build backend).
30
+
31
+ ## Benchmark findings (RTX 4080, GPT-2 124M, fp16)
32
+
33
+ From `examples/benchmark_mfu.py` — uses `track()` with `profile_flops(with_backward=True)` (fixed 3× convention) for consistent results across all configurations:
34
+
35
+ | Configuration | MFU | ms/step |
36
+ |-----------------------|-------|---------|
37
+ | batch=1 \| eager | ~2.7% | ~40ms |
38
+ | batch=8 \| eager | ~9% | ~93ms |
39
+ | batch=8 \| sdpa | ~12% | ~74ms |
40
+ | batch=8 \| sdpa+compile | ~17% | ~50ms |
41
+ | batch=16 \| sdpa+compile | ~16% | ~104ms |
42
+
43
+ Key observations:
44
+ - Low MFU (~2–17%) is expected for GPT-2 on modern hardware — the model is too small to saturate tensor cores. Large models (LLaMA-70B) reach 40–60% MFU.
45
+ - Low MBU (~0.2–0.7%) at batch≥4 means memory bandwidth is **not** the bottleneck — the model is compute-bound (or kernel-launch-bound). MBU is more meaningful for inference at batch=1.
46
+ - Both low MFU and low MBU simultaneously indicates **kernel launch overhead**: the GPU idles between small operations waiting for the CPU to dispatch the next kernel. `torch.compile` addresses this by fusing kernels, giving +5–8pp MFU improvement.
47
+ - `sdpa` over `eager` attention: +2–3pp MFU from avoiding materialising the full B×H×S×S attention matrix (flash attention tiling).
48
+
49
+ ## Testing
50
+
51
+ Two test tiers:
52
+
53
+ - **Mock-based** (`test_flops.py`, `test_gpu.py`, `test_tracker.py`, `test_hf_callback.py`, `test_optim.py`) — no GPU required, run anywhere.
54
+ - **GPU integration** (`test_integration_gpu.py`) — skipped automatically without CUDA. Validated on RTX 4080 (CC 8.9). Covers: spec detection without warnings, thop FLOP counts matching theory within 1% for `Linear` and `Conv2d`, MFU/MBU in `(0, 1]` on real hardware, `compute_mfu` agreeing with `track()`, larger batch → higher MFU.
55
+
56
+ Known faithfulness limitations:
57
+ - Peak ceiling is from NVIDIA spec sheets, not our own measurements.
58
+ - `F.scaled_dot_product_attention` (SDPA) is counted automatically on CUDA via `FlopCounterMode`. Models using `flash_attn_func` directly (rare — older HF with `use_flash_attention_2=True`) still need `flash_attn_flops()` correction.
59
+ - SDPA is not counted when profiling on CPU — profile the CUDA model for accurate counts.
60
+ - bitsandbytes INT8/NF4 quantized layers (QLoRA) are opaque CUDA kernels not visible to either counter. NF4 dequantizes to fp16 before matmul so FLOPs are approximately correct. Pass `dtype="int8"` to get the right peak ceiling.
61
+ - CUDA event timing is accurate; CPU-timer `track()` requires a `synchronize` at block boundaries.
62
+ - The dynamic backward factor measurement (gradient hook on `trainable[-1]`) was removed — it gave unreliable results (~4× instead of ~2×) due to CPU/GPU async timing and broke comparisons between compiled/uncompiled models.
63
+
64
+ `transformers` and `accelerate` are dev dependencies (needed to test `MFUCallback` and run the HF Trainer example).
65
+
66
+ ```bash
67
+ uv sync --group dev
68
+ .venv/bin/pytest tests/ -v # mock tests only (no GPU needed)
69
+ .venv/bin/pytest tests/test_integration_gpu.py -v # GPU tests
70
+ ```
71
+
72
+ ## Examples
73
+
74
+ - `examples/benchmark_mfu.py` — benchmarks MFU/MBU across batch size, attention implementation (`eager` vs `sdpa`), and `torch.compile` using GPT-2 (124M). Uses `track()` with pre-profiled FLOPs for consistent results.
75
+ - `examples/hf_trainer_mfu.py` — demonstrates `MFUCallback` with HF Trainer on synthetic data. Metrics appear as `throughput/mfu` / `throughput/mbu` in training logs and WandB. Run with `--wandb` to enable WandB logging.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Jeremias Lino Ferrao
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,235 @@
1
+ Metadata-Version: 2.4
2
+ Name: mfu-tracker
3
+ Version: 0.1.0
4
+ Summary: Lightweight Model FLOPs Utilization and Bandwidth Utilization tracker for PyTorch
5
+ License: MIT License
6
+
7
+ Copyright (c) 2026 Jeremias Lino Ferrao
8
+
9
+ Permission is hereby granted, free of charge, to any person obtaining a copy
10
+ of this software and associated documentation files (the "Software"), to deal
11
+ in the Software without restriction, including without limitation the rights
12
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13
+ copies of the Software, and to permit persons to whom the Software is
14
+ furnished to do so, subject to the following conditions:
15
+
16
+ The above copyright notice and this permission notice shall be included in all
17
+ copies or substantial portions of the Software.
18
+
19
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25
+ SOFTWARE.
26
+ License-File: LICENSE
27
+ Requires-Python: >=3.9
28
+ Requires-Dist: numpy>=2.0.2
29
+ Requires-Dist: thop>=0.1.1.post2209072238
30
+ Requires-Dist: torch>=2.0
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest-cov; extra == 'dev'
33
+ Requires-Dist: pytest>=7.0; extra == 'dev'
34
+ Provides-Extra: hf
35
+ Requires-Dist: transformers>=4.30; extra == 'hf'
36
+ Description-Content-Type: text/markdown
37
+
38
+ # mfu-tracker
39
+
40
+ When profiling training runs, I found that most existing tools either lacked MFU/MBU support entirely or dragged in hundreds of megabytes of transitive dependencies. This library is an attempt at a self-contained alternative.
41
+
42
+ **mfu-tracker** is a PyTorch library for measuring Model FLOPs Utilization (MFU) and Model Bandwidth Utilization (MBU). It supports bare PyTorch training loops, an optimizer wrapper, and a HuggingFace Trainer callback.
43
+
44
+ - **Minimal dependencies** — PyTorch and `thop` only
45
+ - **Profiled FLOPs, not formula estimates** — uses `FlopCounterMode` to count the FLOPs your model actually executes rather than a formula like `6 × params × tokens`. For Mixture-of-Experts models this means only active experts are counted, giving a more accurate numerator than parameter-based estimates.
46
+ - **Three integration styles** — context manager, optimizer wrapper, HF Trainer callback
47
+ - **WandB / TensorBoard / MLflow** — metrics are logged through HF Trainer's existing pipeline when using `MFUCallback`
48
+
49
+ MFU as a training efficiency metric was introduced in the [PaLM paper](https://arxiv.org/abs/2204.02311) (Chowdhery et al., 2022).
50
+
51
+ ---
52
+
53
+ ## What MFU and MBU measure
54
+
55
+ **MFU (Model FLOPs Utilization)** is the ratio of observed FLOP throughput to the GPU's theoretical peak for the given dtype. A value of 0.50 means the model is executing at half the GPU's rated peak. Well-optimized large models on modern hardware typically fall in the 0.40–0.60 range; small models often land much lower due to kernel dispatch overhead relative to compute time.
56
+
57
+ **MBU (Model Bandwidth Utilization)** as computed here is a proxy, not a direct DRAM measurement. It is defined as:
58
+
59
+ ```
60
+ MBU = (param_bytes / elapsed_sec) / peak_memory_bandwidth
61
+ ```
62
+
63
+ where `param_bytes` is the total size of model parameters and `elapsed_sec` is wall time. This assumes one full pass through model weights per step and does not account for activation memory, gradients, optimizer state, or data layout effects. It is most useful as a relative indicator across runs rather than an absolute efficiency measure.
64
+
65
+ If both MFU and MBU are low simultaneously, the GPU is underutilized. Two common causes: kernel dispatch overhead (the CPU cannot issue kernels fast enough to keep the GPU busy — `torch.compile` reduces this by fusing operations), or CPU-side pipeline stalls (slow DataLoader, heavy host preprocessing, or host-to-device transfers in the hot path).
66
+
67
+ ---
68
+
69
+ ## Installation
70
+
71
+ ```bash
72
+ pip install mfu-tracker
73
+ ```
74
+
75
+ HuggingFace Trainer integration requires no extra install — if you are already running HF Trainer, `transformers` is already available. Import `MFUCallback` directly.
76
+
77
+ ---
78
+
79
+ ## Usage
80
+
81
+ ### Context manager (bare PyTorch)
82
+
83
+ ```python
84
+ from mfu_tracker import track, profile_flops, param_bytes
85
+
86
+ # Profile once on the uncompiled model before training begins
87
+ sample = {"input_ids": batch["input_ids"][:1]}
88
+ flops = profile_flops(model, kwargs=sample, with_backward=True)
89
+ p_bytes = param_bytes(model)
90
+
91
+ for batch in dataloader:
92
+ optimizer.zero_grad()
93
+ with track(flops, p_bytes, dtype="bf16") as result:
94
+ loss = model(**batch).loss
95
+ loss.backward()
96
+ optimizer.step()
97
+
98
+ print(f"MFU: {result.mfu:.3f} MBU: {result.mbu:.3f} {result.elapsed_sec*1000:.0f} ms/step")
99
+ ```
100
+
101
+ ### Optimizer wrapper
102
+
103
+ ```python
104
+ from mfu_tracker import MFUOptimizerWrapper
105
+
106
+ base_optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
107
+ optimizer = MFUOptimizerWrapper(
108
+ base_optimizer, model,
109
+ sample_batch={"input_ids": sample_ids},
110
+ dtype="bf16",
111
+ )
112
+
113
+ # Profile before compiling — FlopCounterMode may not trace compiled graphs
114
+ optimizer.profile()
115
+ model = torch.compile(model)
116
+
117
+ for batch in dataloader:
118
+ with optimizer.track_step() as result: # calls zero_grad() at block entry
119
+ loss = model(**batch).loss
120
+ loss.backward()
121
+ optimizer.step() # outside the timing window
122
+
123
+ if step % 10 == 0:
124
+ print(f"MFU {result.mfu:.3f} MBU {result.mbu:.3f}")
125
+ ```
126
+
127
+ ### HuggingFace Trainer
128
+
129
+ ```python
130
+ from mfu_tracker.integrations.hf_trainer import MFUCallback
131
+
132
+ sample_batch = {k: v[:batch_size] for k, v in next(iter(train_dataloader)).items()}
133
+
134
+ callback = MFUCallback(
135
+ sample_batch=sample_batch,
136
+ dtype="bf16",
137
+ metric_prefix="throughput", # logs throughput/mfu and throughput/mbu
138
+ )
139
+
140
+ trainer = Trainer(
141
+ model=model,
142
+ args=training_args,
143
+ train_dataset=train_dataset,
144
+ callbacks=[callback],
145
+ )
146
+ trainer.train()
147
+ ```
148
+
149
+ `throughput/mfu` and `throughput/mbu` are added to the Trainer log dict at each logging step and forwarded automatically to any configured integrations (WandB, TensorBoard, MLflow). WandB groups metrics by the `/` separator, so these appear in a distinct "throughput" section rather than alongside loss and learning rate.
150
+
151
+ ---
152
+
153
+ ## FLOP counting
154
+
155
+ ```python
156
+ from mfu_tracker import profile_flops, flash_attn_flops, param_bytes
157
+
158
+ # Standard models — FlopCounterMode counts SDPA automatically on CUDA
159
+ flops = profile_flops(model, kwargs=batch, with_backward=True)
160
+
161
+ # Models calling flash_attn_func directly (rare; older HF with use_flash_attention_2=True)
162
+ # need a manual correction since the C extension is opaque to FlopCounterMode:
163
+ flops += flash_attn_flops(batch_size=B, seq_len=S, num_heads=H, head_dim=D)
164
+
165
+ # PEFT / LoRA — restrict param_bytes to trainable parameters only
166
+ p_bytes = param_bytes(model, trainable_only=True)
167
+ ```
168
+
169
+ `with_backward=True` applies the standard 3× convention (1× forward + 2× backward). For gradient checkpointing, pass `backward_factor=3.0` or `4.0` to `MFUOptimizerWrapper` or `MFUCallback`.
170
+
171
+ ---
172
+
173
+ ## GPU spec
174
+
175
+ ```python
176
+ from mfu_tracker import get_gpu_spec
177
+
178
+ spec = get_gpu_spec()
179
+ print(spec.name) # e.g. "NVIDIA GeForce RTX 4080"
180
+ print(spec.peak_tflops("fp16")) # e.g. 97.6
181
+ print(spec.peak_tflops("fp8")) # Ada Lovelace (CC 8.9) and Hopper (CC 9.0)+
182
+ print(spec.peak_memory_bandwidth_tbs) # e.g. 0.717
183
+ ```
184
+
185
+ Supported dtypes: `fp32`, `fp16`, `bf16`, `int8`, `fp8`, `int4`, `fp4`. Unrecognized compute capabilities fall back to the nearest known major version with a `UserWarning`.
186
+
187
+ ---
188
+
189
+ ## Benchmark (RTX 4080, GPT-2 124M, fp16)
190
+
191
+ | Configuration | MFU | ms/step |
192
+ |---|---|---|
193
+ | batch=1 · eager | ~0.027 | ~40 ms |
194
+ | batch=8 · eager | ~0.09 | ~93 ms |
195
+ | batch=8 · sdpa | ~0.12 | ~74 ms |
196
+ | batch=8 · sdpa + compile | ~0.17 | ~50 ms |
197
+ | batch=16 · sdpa + compile | ~0.16 | ~104 ms |
198
+
199
+ GPT-2 (124M) is a small model relative to the compute capacity of a modern GPU, so low MFU is expected — the model spends a large fraction of step time waiting for kernel dispatch rather than doing arithmetic. Larger models (e.g. LLaMA-70B) typically reach 0.40–0.60 MFU. The improvement from `torch.compile` reflects kernel fusion reducing dispatch overhead. I'll add some testing on this later.
200
+
201
+ ```bash
202
+ python examples/benchmark_mfu.py --help
203
+ python examples/hf_trainer_mfu.py --dtype bf16 --batch-size 16
204
+ ```
205
+
206
+ ---
207
+
208
+ ## Multi-GPU
209
+
210
+ Leave `num_gpus=1` (the default) when using `profile_flops` as the FLOP source. For data-parallel strategies (DDP, FSDP), per-GPU FLOPs equal total FLOPs divided by N and wall time is the same on all ranks, so per-GPU MFU equals global MFU and the N factors cancel. Set `num_gpus > 1` only when pairing an analytically-derived full-model FLOP count (e.g. `6 × params × tokens`) with a total-job peak ceiling.
211
+
212
+ ---
213
+
214
+ ## Limitations
215
+
216
+ - **SDPA on CPU is not counted** — `FlopCounterMode` does not intercept flash attention dispatch on CPU. Profile with a CUDA model.
217
+ - **bitsandbytes quantized layers** — INT8/NF4 kernels are opaque to `FlopCounterMode`. NF4 dequantizes to fp16 before the matmul, so FLOP counts are approximately correct. Pass the appropriate dtype to use the right peak ceiling.
218
+ - **`flash_attn_func` direct calls** — models bypassing `F.scaled_dot_product_attention` need a manual `flash_attn_flops()` correction (see above).
219
+ - **Peak ceilings from spec sheets** — these are not independently measured. MFU > 1.0 indicates the ceiling is underestimated.
220
+ - **MBU is a proxy** — the formula uses parameter bytes as a stand-in for memory traffic; actual DRAM traffic (activations, gradients, optimizer state) is higher and not measured.
221
+ - I have not tested the library extensively yet; please open an issue if you encounter any bugs or unexpected behavior.
222
+
223
+ ---
224
+
225
+ ## Requirements
226
+
227
+ - Python 3.9+
228
+ - PyTorch 2.0+ (2.1+ recommended for `FlopCounterMode`)
229
+ - A CUDA GPU is required for meaningful results; CPU timing works but MFU will be near zero for any realistic model
230
+
231
+ ---
232
+
233
+ ## License
234
+
235
+ MIT