torch-amd-setup 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,17 @@
1
+ __pycache__/
2
+ *.py[cod]
3
+ *.so
4
+ *.egg
5
+ *.egg-info/
6
+ dist/
7
+ build/
8
+ .eggs/
9
+ .venv*/
10
+ venv*/
11
+ .env
12
+ *.log
13
+ .pytest_cache/
14
+ .mypy_cache/
15
+ .ruff_cache/
16
+ htmlcov/
17
+ .coverage
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 ChharithOeun
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,210 @@
1
+ Metadata-Version: 2.4
2
+ Name: torch-amd-setup
3
+ Version: 0.1.0
4
+ Summary: Auto-detects the best PyTorch compute device for AMD GPUs, with gfx1010 ROCm override support (RX 5700 XT, RX 5600 XT, Navi 10)
5
+ Project-URL: Homepage, https://github.com/ChharithOeun/torch-amd-setup
6
+ Project-URL: Repository, https://github.com/ChharithOeun/torch-amd-setup
7
+ Project-URL: Issues, https://github.com/ChharithOeun/torch-amd-setup/issues
8
+ Project-URL: Documentation, https://github.com/ChharithOeun/torch-amd-setup/tree/main/docs
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: amd,device-detection,directml,gfx1010,gpu,machine-learning,navi10,pytorch,rocm,rx5700xt
12
+ Classifier: Development Status :: 3 - Alpha
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Operating System :: OS Independent
17
+ Classifier: Programming Language :: Python :: 3
18
+ Classifier: Programming Language :: Python :: 3.9
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Requires-Python: >=3.9
24
+ Provides-Extra: cpu
25
+ Requires-Dist: torch>=2.2.0; extra == 'cpu'
26
+ Provides-Extra: cuda
27
+ Requires-Dist: torch>=2.2.0; extra == 'cuda'
28
+ Requires-Dist: torchaudio; extra == 'cuda'
29
+ Requires-Dist: torchvision; extra == 'cuda'
30
+ Provides-Extra: dev
31
+ Requires-Dist: mypy; extra == 'dev'
32
+ Requires-Dist: pytest-cov; extra == 'dev'
33
+ Requires-Dist: pytest>=7.0; extra == 'dev'
34
+ Requires-Dist: ruff; extra == 'dev'
35
+ Provides-Extra: directml
36
+ Requires-Dist: torch-directml; extra == 'directml'
37
+ Requires-Dist: torch==2.3.0; extra == 'directml'
38
+ Provides-Extra: rocm
39
+ Requires-Dist: torch>=2.2.0; extra == 'rocm'
40
+ Requires-Dist: torchaudio; extra == 'rocm'
41
+ Requires-Dist: torchvision; extra == 'rocm'
42
+ Description-Content-Type: text/markdown
43
+
44
+ # torch-amd-setup
45
+
46
+ **Auto-detects the best PyTorch compute device for AMD GPUs** — with first-class support for cards that are not in ROCm's default allow-list (RX 5700 XT, RX 5600 XT, RX 5500 XT / gfx1010–gfx1012).
47
+
48
+ One import. No manual env var hunting. Works on Windows, Linux, WSL2, and macOS.
49
+
50
+ ```python
51
+ from torch_amd_setup import get_best_device, get_torch_device, get_dtype
52
+
53
+ device_type = get_best_device() # "rocm" | "dml" | "cuda" | "mps" | "cpu"
54
+ device = get_torch_device() # torch.device ready for model.to()
55
+ dtype = get_dtype() # torch.float16 or torch.float32
56
+ ```
57
+
58
+ ---
59
+
60
+ ## The problem this solves
61
+
62
+ AMD GPUs that use the **gfx1010 architecture** (Navi 10 — RX 5700 XT, RX 5700, RX 5600 XT) are not in ROCm's default supported GPU list. PyTorch on ROCm will silently fall back to CPU unless you set:
63
+
64
+ ```bash
65
+ export HSA_OVERRIDE_GFX_VERSION=10.3.0
66
+ ```
67
+
68
+ ...but it has to be set *before* Python imports torch, which means you either:
69
+ - Remember to set it in every shell session, or
70
+ - Bake it into a shell script wrapper, or
71
+ - Set it in your Python script before the first `import torch`
72
+
73
+ `torch-amd-setup` handles all of that automatically. It also detects DirectML on Windows (no ROCm required), Apple MPS on macOS, NVIDIA CUDA, and falls back to CPU — so you can ship one codebase that works everywhere.
74
+
75
+ ---
76
+
77
+ ## Detection priority
78
+
79
+ | Priority | Backend | Platform | Requirement |
80
+ |----------|---------------|---------------------|--------------------------------------|
81
+ | 1 | NVIDIA CUDA | Any | Standard `pip install torch` |
82
+ | 2 | AMD ROCm | Linux / WSL2 | ROCm PyTorch + AMD driver ≥22.20 |
83
+ | 3 | AMD DirectML | Windows | `pip install torch-directml`, Py≤3.11 |
84
+ | 4 | Apple MPS | macOS Apple Silicon | Standard `pip install torch` |
85
+ | 5 | CPU | Any | Always available, always slow |
86
+
87
+ ---
88
+
89
+ ## Install
90
+
91
+ ```bash
92
+ pip install torch-amd-setup
93
+ ```
94
+
95
+ > `torch` is not a hard dependency — install the appropriate torch variant for your hardware first (see [Tutorial](docs/tutorial.md)).
96
+
97
+ ---
98
+
99
+ ## Quick start
100
+
101
+ ```python
102
+ from torch_amd_setup import get_best_device, get_torch_device, get_dtype
103
+ import torch
104
+
105
+ device_type = get_best_device()
106
+ device = get_torch_device(device_type)
107
+ dtype = get_dtype(device_type)
108
+
109
+ print(f"Using: {device_type} → {device} @ {dtype}")
110
+
111
+ # Load your model
112
+ model = MyModel().to(device).to(dtype)
113
+ ```
114
+
115
+ ### Diagnostics CLI
116
+
117
+ ```bash
118
+ python -m torch_amd_setup
119
+ ```
120
+
121
+ Output:
122
+ ```
123
+ ── torch-amd-setup diagnostics ──────────────────────────────
124
+ python_version 3.10.12
125
+ platform Linux-6.6.x-WSL2-x86_64
126
+ best_device rocm
127
+ cuda_available True
128
+ cuda_device_name AMD Radeon RX 5700 XT
129
+ cuda_vram_mb 8176
130
+ rocm_available True
131
+ torch_version 2.6.0+rocm6.1
132
+ ...
133
+ ```
134
+
135
+ ---
136
+
137
+ ## API Reference
138
+
139
+ ### `get_best_device() → str`
140
+ Returns the best available device type as a string: `"cuda"`, `"rocm"`, `"dml"`, `"mps"`, or `"cpu"`.
141
+
142
+ ### `get_torch_device(device_type=None) → torch.device`
143
+ Returns a `torch.device` object (or a DirectML device object for `"dml"`) ready for `model.to()`. If `device_type` is `None`, calls `get_best_device()` automatically.
144
+
145
+ ### `get_dtype(device_type=None) → torch.dtype`
146
+ Returns `torch.float16` for CUDA/ROCm/MPS, and `torch.float32` for DirectML/CPU. DirectML float16 support is unreliable; this keeps you safe.
147
+
148
+ ### `device_info() → dict`
149
+ Returns a diagnostic dictionary with all detected hardware info. Useful for logging and bug reports.
150
+
151
+ ### `get_install_guide() → str`
152
+ Returns platform-appropriate install instructions as a formatted string.
153
+
154
+ ### `get_wsl2_install_guide() → str`
155
+ Returns the full WSL2 + ROCm setup walkthrough for AMD GPUs on Windows.
156
+
157
+ ### `AMD_ROCM_ENV: dict`
158
+ The environment variable overrides applied for gfx1010 support. You can inspect or override these before calling `get_best_device()`.
159
+
160
+ ---
161
+
162
+ ## AMD GPU compatibility
163
+
164
+ | GPU | Architecture | HSA Override | Tested |
165
+ |-------------------------|-------------|----------------|--------|
166
+ | RX 5700 XT | gfx1010 | `10.3.0` | ✅ |
167
+ | RX 5700 | gfx1010 | `10.3.0` | ✅ |
168
+ | RX 5600 XT | gfx1010 | `10.3.0` | ✅ |
169
+ | RX 5500 XT | gfx1011 | `10.3.0` | ⚠️ reported |
170
+ | RX 6000 series (gfx1030+) | RDNA2 | Not needed | ✅ native ROCm |
171
+ | RX 7000 series (gfx1100+) | RDNA3 | Not needed | ✅ native ROCm |
172
+
173
+ If your card isn't listed, check `GFX_OVERRIDE_MAP` in `detect.py` and open a PR.
174
+
175
+ ---
176
+
177
+ ## Windows users: DirectML vs WSL2
178
+
179
+ | Feature | DirectML | WSL2 + ROCm |
180
+ |----------------------|--------------------|--------------------|
181
+ | Setup difficulty | Easy | Medium |
182
+ | float16 support | ❌ (float32 only) | ✅ |
183
+ | Python version limit | 3.11 max | Any |
184
+ | GPU memory usage | ~1.5× higher | Native |
185
+ | Best for | Quick experiments | Production workloads |
186
+
187
+ ---
188
+
189
+ ## Contributing
190
+
191
+ PRs welcome. Especially interested in:
192
+ - Verified gfx override values for additional GPU models
193
+ - ROCm 6.2+ compatibility reports
194
+ - Windows DirectML on NVIDIA/Intel test results
195
+
196
+ Please open an issue before large PRs.
197
+
198
+ ---
199
+
200
+ ## License
201
+
202
+ MIT — see [LICENSE](LICENSE).
203
+
204
+ ---
205
+
206
+ ## Background
207
+
208
+ This package was extracted from a private AI music pipeline project. The gfx1010 ROCm workaround was discovered the hard way — through several hours of cascading PyTorch installs, ROCm SDK conflicts, and dependency hell. The goal is that nobody else has to spend that time.
209
+
210
+ See [docs/lessons-learned.md](docs/lessons-learned.md) for the full story.
@@ -0,0 +1,167 @@
1
+ # torch-amd-setup
2
+
3
+ **Auto-detects the best PyTorch compute device for AMD GPUs** — with first-class support for cards that are not in ROCm's default allow-list (RX 5700 XT, RX 5600 XT, RX 5500 XT / gfx1010–gfx1012).
4
+
5
+ One import. No manual env var hunting. Works on Windows, Linux, WSL2, and macOS.
6
+
7
+ ```python
8
+ from torch_amd_setup import get_best_device, get_torch_device, get_dtype
9
+
10
+ device_type = get_best_device() # "rocm" | "dml" | "cuda" | "mps" | "cpu"
11
+ device = get_torch_device() # torch.device ready for model.to()
12
+ dtype = get_dtype() # torch.float16 or torch.float32
13
+ ```
14
+
15
+ ---
16
+
17
+ ## The problem this solves
18
+
19
+ AMD GPUs that use the **gfx1010 architecture** (Navi 10 — RX 5700 XT, RX 5700, RX 5600 XT) are not in ROCm's default supported GPU list. PyTorch on ROCm will silently fall back to CPU unless you set:
20
+
21
+ ```bash
22
+ export HSA_OVERRIDE_GFX_VERSION=10.3.0
23
+ ```
24
+
25
+ ...but it has to be set *before* Python imports torch, which means you either:
26
+ - Remember to set it in every shell session, or
27
+ - Bake it into a shell script wrapper, or
28
+ - Set it in your Python script before the first `import torch`
29
+
30
+ `torch-amd-setup` handles all of that automatically. It also detects DirectML on Windows (no ROCm required), Apple MPS on macOS, NVIDIA CUDA, and falls back to CPU — so you can ship one codebase that works everywhere.
31
+
32
+ ---
33
+
34
+ ## Detection priority
35
+
36
+ | Priority | Backend | Platform | Requirement |
37
+ |----------|---------------|---------------------|--------------------------------------|
38
+ | 1 | NVIDIA CUDA | Any | Standard `pip install torch` |
39
+ | 2 | AMD ROCm | Linux / WSL2 | ROCm PyTorch + AMD driver ≥22.20 |
40
+ | 3 | AMD DirectML | Windows | `pip install torch-directml`, Py≤3.11 |
41
+ | 4 | Apple MPS | macOS Apple Silicon | Standard `pip install torch` |
42
+ | 5 | CPU | Any | Always available, always slow |
43
+
44
+ ---
45
+
46
+ ## Install
47
+
48
+ ```bash
49
+ pip install torch-amd-setup
50
+ ```
51
+
52
+ > `torch` is not a hard dependency — install the appropriate torch variant for your hardware first (see [Tutorial](docs/tutorial.md)).
53
+
54
+ ---
55
+
56
+ ## Quick start
57
+
58
+ ```python
59
+ from torch_amd_setup import get_best_device, get_torch_device, get_dtype
60
+ import torch
61
+
62
+ device_type = get_best_device()
63
+ device = get_torch_device(device_type)
64
+ dtype = get_dtype(device_type)
65
+
66
+ print(f"Using: {device_type} → {device} @ {dtype}")
67
+
68
+ # Load your model
69
+ model = MyModel().to(device).to(dtype)
70
+ ```
71
+
72
+ ### Diagnostics CLI
73
+
74
+ ```bash
75
+ python -m torch_amd_setup
76
+ ```
77
+
78
+ Output:
79
+ ```
80
+ ── torch-amd-setup diagnostics ──────────────────────────────
81
+ python_version 3.10.12
82
+ platform Linux-6.6.x-WSL2-x86_64
83
+ best_device rocm
84
+ cuda_available True
85
+ cuda_device_name AMD Radeon RX 5700 XT
86
+ cuda_vram_mb 8176
87
+ rocm_available True
88
+ torch_version 2.6.0+rocm6.1
89
+ ...
90
+ ```
91
+
92
+ ---
93
+
94
+ ## API Reference
95
+
96
+ ### `get_best_device() → str`
97
+ Returns the best available device type as a string: `"cuda"`, `"rocm"`, `"dml"`, `"mps"`, or `"cpu"`.
98
+
99
+ ### `get_torch_device(device_type=None) → torch.device`
100
+ Returns a `torch.device` object (or a DirectML device object for `"dml"`) ready for `model.to()`. If `device_type` is `None`, calls `get_best_device()` automatically.
101
+
102
+ ### `get_dtype(device_type=None) → torch.dtype`
103
+ Returns `torch.float16` for CUDA/ROCm/MPS, and `torch.float32` for DirectML/CPU. DirectML float16 support is unreliable; this keeps you safe.
104
+
105
+ ### `device_info() → dict`
106
+ Returns a diagnostic dictionary with all detected hardware info. Useful for logging and bug reports.
107
+
108
+ ### `get_install_guide() → str`
109
+ Returns platform-appropriate install instructions as a formatted string.
110
+
111
+ ### `get_wsl2_install_guide() → str`
112
+ Returns the full WSL2 + ROCm setup walkthrough for AMD GPUs on Windows.
113
+
114
+ ### `AMD_ROCM_ENV: dict`
115
+ The environment variable overrides applied for gfx1010 support. You can inspect or override these before calling `get_best_device()`.
116
+
117
+ ---
118
+
119
+ ## AMD GPU compatibility
120
+
121
+ | GPU | Architecture | HSA Override | Tested |
122
+ |-------------------------|-------------|----------------|--------|
123
+ | RX 5700 XT | gfx1010 | `10.3.0` | ✅ |
124
+ | RX 5700 | gfx1010 | `10.3.0` | ✅ |
125
+ | RX 5600 XT | gfx1010 | `10.3.0` | ✅ |
126
+ | RX 5500 XT | gfx1011 | `10.3.0` | ⚠️ reported |
127
+ | RX 6000 series (gfx1030+) | RDNA2 | Not needed | ✅ native ROCm |
128
+ | RX 7000 series (gfx1100+) | RDNA3 | Not needed | ✅ native ROCm |
129
+
130
+ If your card isn't listed, check `GFX_OVERRIDE_MAP` in `detect.py` and open a PR.
131
+
132
+ ---
133
+
134
+ ## Windows users: DirectML vs WSL2
135
+
136
+ | Feature | DirectML | WSL2 + ROCm |
137
+ |----------------------|--------------------|--------------------|
138
+ | Setup difficulty | Easy | Medium |
139
+ | float16 support | ❌ (float32 only) | ✅ |
140
+ | Python version limit | 3.11 max | Any |
141
+ | GPU memory usage | ~1.5× higher | Native |
142
+ | Best for | Quick experiments | Production workloads |
143
+
144
+ ---
145
+
146
+ ## Contributing
147
+
148
+ PRs welcome. Especially interested in:
149
+ - Verified gfx override values for additional GPU models
150
+ - ROCm 6.2+ compatibility reports
151
+ - Windows DirectML on NVIDIA/Intel test results
152
+
153
+ Please open an issue before large PRs.
154
+
155
+ ---
156
+
157
+ ## License
158
+
159
+ MIT — see [LICENSE](LICENSE).
160
+
161
+ ---
162
+
163
+ ## Background
164
+
165
+ This package was extracted from a private AI music pipeline project. The gfx1010 ROCm workaround was discovered the hard way — through several hours of cascading PyTorch installs, ROCm SDK conflicts, and dependency hell. The goal is that nobody else has to spend that time.
166
+
167
+ See [docs/lessons-learned.md](docs/lessons-learned.md) for the full story.
@@ -0,0 +1,155 @@
1
+ # Lessons Learned: Building AMD ROCm + PyTorch Support from Scratch
2
+
3
+ **Date:** 2026-03-23
4
+ **Context:** Extracting `torch-amd-setup` from a private AI audio pipeline project.
5
+ **Hardware:** AMD Radeon RX 5700 XT (gfx1010 / Navi 10), Windows 11, WSL2 Ubuntu 22.04.
6
+
7
+ This document is a raw account of every mistake made, every dependency wall hit, and every workaround discovered while getting AMD GPU acceleration working with PyTorch and Seamless M4T. Written so you don't have to spend the same time.
8
+
9
+ ---
10
+
11
+ ## 1. The gfx1010 problem — your GPU exists but ROCm ignores it
12
+
13
+ The single biggest source of confusion: the AMD RX 5700 XT is a capable GPU, it's supported by the AMD Adrenalin driver, and it works fine for gaming. But ROCm (AMD's GPU compute stack) has an explicit list of officially supported GPU architectures, and gfx1010 is not on it.
14
+
15
+ When you install the ROCm version of PyTorch and run `torch.cuda.is_available()`, it returns `False`. No error, no explanation — just `False`. This led to hours of assuming the ROCm install was broken, when the actual issue was a single missing environment variable:
16
+
17
+ ```bash
18
+ export HSA_OVERRIDE_GFX_VERSION=10.3.0
19
+ ```
20
+
21
+ This has to be set **before Python imports torch**. Setting it after `import torch` does nothing. The reason: ROCm checks the GPU architecture at init time and caches the result. If the env var isn't present at that moment, the GPU is invisible for the rest of the process.
22
+
23
+ **Lesson:** If `torch.cuda.is_available()` returns False on ROCm, check the env var before anything else. Don't re-install ROCm.
24
+
25
+ ---
26
+
27
+ ## 2. Ubuntu 22.04 ships its own broken rocminfo
28
+
29
+ Ubuntu 22.04's default `apt` repos include `rocminfo 5.0.0-1`. This package exists to provide stub implementations of ROCm tools. When you add AMD's official ROCm 6.1 repository and try to install `rocm-hip-sdk`, apt sees the conflict and fails:
30
+
31
+ ```
32
+ rocm-hip-runtime: Depends: rocminfo (= 1.0.0.60100-82~22.04)
33
+ but 5.0.0-1 is to be installed
34
+ ```
35
+
36
+ The version numbers look backwards (5.0.0 > 1.0.0) but they're not comparable — AMD's rocminfo uses a different versioning scheme entirely. `5.0.0-1` is Ubuntu's stub; `1.0.0.60100` is AMD's real package at ROCm 6.1.
37
+
38
+ **Fix:** Remove Ubuntu's ROCm stubs before installing from AMD's repo, then pin the AMD repo to priority 1001 so it always wins in future apt operations. See [Troubleshooting](troubleshooting.md#rocm-61-install-blocked-by-ubuntus-rocminfo-500).
39
+
40
+ **Lesson:** Always purge Ubuntu's ROCm stubs before adding AMD's ROCm repo. Add the apt pin immediately.
41
+
42
+ ---
43
+
44
+ ## 3. set -e + grep = silent script death
45
+
46
+ When writing the automated setup script (`wsl2_rocm_setup.sh`), the script was configured with `set -euo pipefail` for safety. However, certain commands that pipe through `grep -v` would cause the entire script to silently exit with no error message.
47
+
48
+ The cause: when `apt-get -qq` runs with nothing to output (the package is already installed, or there are no packages matching), the `grep -v` that follows gets empty input and returns exit code 1 — "no lines matched the invert pattern." With `set -e`, exit code 1 from any command is fatal. The script dies silently at the first line that runs `| grep -v anything` on empty input.
49
+
50
+ The debug session was confusing because there was no error — just a prompt returning after printing one progress message.
51
+
52
+ **Fix:** `|| true` after any `grep` in a pipeline where empty output is possible. Also drop `-u` from `set -euo pipefail` if you have variables that might be unset legitimately.
53
+
54
+ **Lesson:** When a `set -e` script exits silently, check every pipe for commands that could return non-zero on "no results" — grep, awk, wc -l comparisons, etc.
55
+
56
+ ---
57
+
58
+ ## 4. Dependency packages silently replace your ROCm torch
59
+
60
+ Installing packages with PyPI dependencies that pin specific PyTorch versions will overwrite your ROCm build. This happened twice:
61
+
62
+ - `fairseq2==0.3.0` pins `torch==2.5.1`. pip fetched that version from PyPI, which is the standard CUDA build. ROCm build gone.
63
+ - After reinstalling ROCm torch 2.6.0, torchaudio 2.2.2 was installed separately, causing a version mismatch (`libcudart.so.13` error from the torchaudio build expecting torch 2.2.x).
64
+
65
+ Each iteration added 10–20 minutes of reinstall time and debugging.
66
+
67
+ **Lesson:** Install torch last, always. Use `--no-deps` for packages that try to pull their own torch. After any package install, verify `torch.version.hip` is still set. Consider using pip constraints or a lock file.
68
+
69
+ ---
70
+
71
+ ## 5. fairseq2n CPU binary doesn't exist for 0.2.1
72
+
73
+ `seamless_communication 1.0.0` requires `fairseq2==0.2.*`. `fairseq2 0.2.1` requires `fairseq2n==0.2.1` (a C extension binary). The `fairseq2n` package on PyPI ships a CUDA-linked binary — it needs `libcudart.so.12` to import.
74
+
75
+ Meta provides a CPU build server: `https://fair-src-fairseq2-build-publish.s3.amazonaws.com/whl/cpu/index.html` — but it only has builds for fairseq2n 0.3.x, not 0.2.1. So the official CPU binary for the version required by seamless_communication simply does not exist.
76
+
77
+ The solution that worked: install `nvidia-cuda-runtime-cu12` via pip, which provides `libcudart.so.12` inside the venv's site-packages, then set `LD_LIBRARY_PATH` to point at it. This lets the CUDA-linked `fairseq2n.so` load correctly even on a machine with no NVIDIA GPU.
78
+
79
+ **Lesson:** When a package claims to need CUDA but you don't have CUDA, try installing the CUDA runtime stub wheel first before assuming you need to rebuild from source.
80
+
81
+ ---
82
+
83
+ ## 6. torch-directml requires Python ≤3.11
84
+
85
+ `torch-directml` is Microsoft's DirectML backend for PyTorch. It provides AMD (and any DirectX 12) GPU acceleration on Windows without needing ROCm. It's genuinely useful and easy to install — but it has a hard Python version ceiling of 3.11.
86
+
87
+ This is a significant limitation because many projects now target Python 3.12+. The workaround is to maintain a separate `venv311` environment specifically for DirectML workloads. This is awkward but workable.
88
+
89
+ The underlying reason is that `torch-directml` contains compiled C extensions that were built against Python 3.11's ABI. Microsoft hasn't released 3.12 wheels as of the time of writing.
90
+
91
+ **Lesson:** Plan for a separate Python 3.11 venv on Windows if DirectML is on your path. Build your code to be venv-agnostic so switching is easy.
92
+
93
+ ---
94
+
95
+ ## 7. numpy 2.x breaks fairseq2 0.2.1
96
+
97
+ `fairseq2 0.2.1` was compiled against NumPy 1.x. NumPy 2.0 introduced breaking C extension ABI changes. If pip installs NumPy 2.x (which it does by default now), importing `fairseq2` crashes:
98
+
99
+ ```
100
+ A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.6
101
+ _ARRAY_API not found
102
+ ```
103
+
104
+ Fix: `pip install "numpy~=1.23" --force-reinstall`.
105
+
106
+ **Lesson:** Any package with compiled C extensions and a `numpy~=1.x` pin is going to break if pip installs numpy 2.x before it. Add an explicit numpy pin to your requirements file before installing such packages.
107
+
108
+ ---
109
+
110
+ ## 8. WSL2 GPU passthrough needs /dev/kfd
111
+
112
+ Even with a recent AMD driver, `/dev/kfd` (the AMD GPU compute device node) may not appear in WSL2 if:
113
+ - The AMD Adrenalin driver version is below 22.20
114
+ - The Windows version is below 10 21H2
115
+
116
+ In our case, `/dev/kfd` was missing because the driver hadn't been verified yet. This caused `rocminfo` inside WSL2 to report no agents, even though ROCm was installed correctly.
117
+
118
+ **Lesson:** Verify `/dev/kfd` exists before troubleshooting anything else. If it doesn't exist, the fix is a driver update in Windows — nothing inside WSL2 will help.
119
+
120
+ ---
121
+
122
+ ## 9. ROCm uses the CUDA compatibility layer — model.to("cuda") works
123
+
124
+ ROCm PyTorch exposes AMD GPUs through a CUDA compatibility layer. From the Python API perspective, `torch.cuda.is_available()` returns `True`, `torch.cuda.get_device_name(0)` returns the AMD card name, and `model.to("cuda:0")` puts the model on the AMD GPU.
125
+
126
+ This is intentional and by design. The practical consequence: code written for NVIDIA CUDA often works on AMD ROCm with zero changes. The catch is that some CUDA-specific operations (`torch.cuda.amp`, certain custom CUDA kernels) may not be supported.
127
+
128
+ **Lesson:** Don't create separate "CUDA" and "ROCm" code paths. Use `get_torch_device()` which returns `torch.device("cuda:0")` for both — the ROCm PyTorch build handles the rest.
129
+
130
+ ---
131
+
132
+ ## 10. The model download hits the disk hard
133
+
134
+ The SeamlessM4T v2 large model is ~8.5GB for the main checkpoint plus ~160MB for the vocoder. On first run, it downloads to `~/.cache/huggingface/hub/`. This is inside the WSL2 virtual disk, which lives on the Windows C: drive.
135
+
136
+ On a machine with limited C: drive space, this is immediately a problem. The WSL2 virtual disk is also not easily inspectable from Windows Explorer, so users may not realize a 9GB file just appeared.
137
+
138
+ **Lesson:** Warn users about the model download size before first run. Consider setting `HF_HOME` to redirect the cache to a larger drive. On a machine with an external drive, this is essential.
139
+
140
+ ---
141
+
142
+ ## Summary: What the setup actually requires
143
+
144
+ Getting AMD ROCm + fairseq2 + seamless_communication working requires touching at least 8 separate failure points that are not documented together anywhere:
145
+
146
+ 1. Remove Ubuntu's conflicting ROCm stubs
147
+ 2. Pin the ROCm apt repo to priority 1001
148
+ 3. Set `HSA_OVERRIDE_GFX_VERSION=10.3.0` before importing torch
149
+ 4. Add user to `render` and `video` groups
150
+ 5. Install PyTorch from the ROCm-specific index URL
151
+ 6. Install `nvidia-cuda-runtime-cu12` for the CUDA stub
152
+ 7. Set `LD_LIBRARY_PATH` to the stub's lib directory
153
+ 8. Pin numpy to `~=1.23` before installing fairseq2
154
+
155
+ None of these steps are individually complex. But they're scattered across AMD documentation, Meta's fairseq2 GitHub issues, Ubuntu Launchpad bug reports, and Stack Overflow threads. The goal of `torch-amd-setup` is to encode as much of this as possible so future projects don't start from scratch.