torch-amd-setup 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- torch_amd_setup-0.1.0/.gitignore +17 -0
- torch_amd_setup-0.1.0/LICENSE +21 -0
- torch_amd_setup-0.1.0/PKG-INFO +210 -0
- torch_amd_setup-0.1.0/README.md +167 -0
- torch_amd_setup-0.1.0/docs/lessons-learned.md +155 -0
- torch_amd_setup-0.1.0/docs/troubleshooting.md +274 -0
- torch_amd_setup-0.1.0/docs/tutorial.md +317 -0
- torch_amd_setup-0.1.0/examples/basic_usage.py +50 -0
- torch_amd_setup-0.1.0/pyproject.toml +81 -0
- torch_amd_setup-0.1.0/torch_amd_setup/__init__.py +37 -0
- torch_amd_setup-0.1.0/torch_amd_setup/__main__.py +13 -0
- torch_amd_setup-0.1.0/torch_amd_setup/detect.py +475 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 ChharithOeun
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,210 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: torch-amd-setup
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Auto-detects the best PyTorch compute device for AMD GPUs, with gfx1010 ROCm override support (RX 5700 XT, RX 5600 XT, Navi 10)
|
|
5
|
+
Project-URL: Homepage, https://github.com/ChharithOeun/torch-amd-setup
|
|
6
|
+
Project-URL: Repository, https://github.com/ChharithOeun/torch-amd-setup
|
|
7
|
+
Project-URL: Issues, https://github.com/ChharithOeun/torch-amd-setup/issues
|
|
8
|
+
Project-URL: Documentation, https://github.com/ChharithOeun/torch-amd-setup/tree/main/docs
|
|
9
|
+
License: MIT
|
|
10
|
+
License-File: LICENSE
|
|
11
|
+
Keywords: amd,device-detection,directml,gfx1010,gpu,machine-learning,navi10,pytorch,rocm,rx5700xt
|
|
12
|
+
Classifier: Development Status :: 3 - Alpha
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Operating System :: OS Independent
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
21
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
22
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
23
|
+
Requires-Python: >=3.9
|
|
24
|
+
Provides-Extra: cpu
|
|
25
|
+
Requires-Dist: torch>=2.2.0; extra == 'cpu'
|
|
26
|
+
Provides-Extra: cuda
|
|
27
|
+
Requires-Dist: torch>=2.2.0; extra == 'cuda'
|
|
28
|
+
Requires-Dist: torchaudio; extra == 'cuda'
|
|
29
|
+
Requires-Dist: torchvision; extra == 'cuda'
|
|
30
|
+
Provides-Extra: dev
|
|
31
|
+
Requires-Dist: mypy; extra == 'dev'
|
|
32
|
+
Requires-Dist: pytest-cov; extra == 'dev'
|
|
33
|
+
Requires-Dist: pytest>=7.0; extra == 'dev'
|
|
34
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
35
|
+
Provides-Extra: directml
|
|
36
|
+
Requires-Dist: torch-directml; extra == 'directml'
|
|
37
|
+
Requires-Dist: torch==2.3.0; extra == 'directml'
|
|
38
|
+
Provides-Extra: rocm
|
|
39
|
+
Requires-Dist: torch>=2.2.0; extra == 'rocm'
|
|
40
|
+
Requires-Dist: torchaudio; extra == 'rocm'
|
|
41
|
+
Requires-Dist: torchvision; extra == 'rocm'
|
|
42
|
+
Description-Content-Type: text/markdown
|
|
43
|
+
|
|
44
|
+
# torch-amd-setup
|
|
45
|
+
|
|
46
|
+
**Auto-detects the best PyTorch compute device for AMD GPUs** — with first-class support for cards that are not in ROCm's default allow-list (RX 5700 XT, RX 5600 XT, RX 5500 XT / gfx1010–gfx1012).
|
|
47
|
+
|
|
48
|
+
One import. No manual env var hunting. Works on Windows, Linux, WSL2, and macOS.
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from torch_amd_setup import get_best_device, get_torch_device, get_dtype
|
|
52
|
+
|
|
53
|
+
device_type = get_best_device() # "rocm" | "dml" | "cuda" | "mps" | "cpu"
|
|
54
|
+
device = get_torch_device() # torch.device ready for model.to()
|
|
55
|
+
dtype = get_dtype() # torch.float16 or torch.float32
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## The problem this solves
|
|
61
|
+
|
|
62
|
+
AMD GPUs that use the **gfx1010 architecture** (Navi 10 — RX 5700 XT, RX 5700, RX 5600 XT) are not in ROCm's default supported GPU list. PyTorch on ROCm will silently fall back to CPU unless you set:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
export HSA_OVERRIDE_GFX_VERSION=10.3.0
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
...but it has to be set *before* Python imports torch, which means you either:
|
|
69
|
+
- Remember to set it in every shell session, or
|
|
70
|
+
- Bake it into a shell script wrapper, or
|
|
71
|
+
- Set it in your Python script before the first `import torch`
|
|
72
|
+
|
|
73
|
+
`torch-amd-setup` handles all of that automatically. It also detects DirectML on Windows (no ROCm required), Apple MPS on macOS, NVIDIA CUDA, and falls back to CPU — so you can ship one codebase that works everywhere.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## Detection priority
|
|
78
|
+
|
|
79
|
+
| Priority | Backend | Platform | Requirement |
|
|
80
|
+
|----------|---------------|---------------------|--------------------------------------|
|
|
81
|
+
| 1 | NVIDIA CUDA | Any | Standard `pip install torch` |
|
|
82
|
+
| 2 | AMD ROCm | Linux / WSL2 | ROCm PyTorch + AMD driver ≥22.20 |
|
|
83
|
+
| 3 | AMD DirectML | Windows | `pip install torch-directml`, Py≤3.11 |
|
|
84
|
+
| 4 | Apple MPS | macOS Apple Silicon | Standard `pip install torch` |
|
|
85
|
+
| 5 | CPU | Any | Always available, always slow |
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## Install
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
pip install torch-amd-setup
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
> `torch` is not a hard dependency — install the appropriate torch variant for your hardware first (see [Tutorial](docs/tutorial.md)).
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## Quick start
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
from torch_amd_setup import get_best_device, get_torch_device, get_dtype
|
|
103
|
+
import torch
|
|
104
|
+
|
|
105
|
+
device_type = get_best_device()
|
|
106
|
+
device = get_torch_device(device_type)
|
|
107
|
+
dtype = get_dtype(device_type)
|
|
108
|
+
|
|
109
|
+
print(f"Using: {device_type} → {device} @ {dtype}")
|
|
110
|
+
|
|
111
|
+
# Load your model
|
|
112
|
+
model = MyModel().to(device).to(dtype)
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
### Diagnostics CLI
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
python -m torch_amd_setup
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Output:
|
|
122
|
+
```
|
|
123
|
+
── torch-amd-setup diagnostics ──────────────────────────────
|
|
124
|
+
python_version 3.10.12
|
|
125
|
+
platform Linux-6.6.x-WSL2-x86_64
|
|
126
|
+
best_device rocm
|
|
127
|
+
cuda_available True
|
|
128
|
+
cuda_device_name AMD Radeon RX 5700 XT
|
|
129
|
+
cuda_vram_mb 8176
|
|
130
|
+
rocm_available True
|
|
131
|
+
torch_version 2.6.0+rocm6.1
|
|
132
|
+
...
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
## API Reference
|
|
138
|
+
|
|
139
|
+
### `get_best_device() → str`
|
|
140
|
+
Returns the best available device type as a string: `"cuda"`, `"rocm"`, `"dml"`, `"mps"`, or `"cpu"`.
|
|
141
|
+
|
|
142
|
+
### `get_torch_device(device_type=None) → torch.device`
|
|
143
|
+
Returns a `torch.device` object (or a DirectML device object for `"dml"`) ready for `model.to()`. If `device_type` is `None`, calls `get_best_device()` automatically.
|
|
144
|
+
|
|
145
|
+
### `get_dtype(device_type=None) → torch.dtype`
|
|
146
|
+
Returns `torch.float16` for CUDA/ROCm/MPS, and `torch.float32` for DirectML/CPU. DirectML float16 support is unreliable; this keeps you safe.
|
|
147
|
+
|
|
148
|
+
### `device_info() → dict`
|
|
149
|
+
Returns a diagnostic dictionary with all detected hardware info. Useful for logging and bug reports.
|
|
150
|
+
|
|
151
|
+
### `get_install_guide() → str`
|
|
152
|
+
Returns platform-appropriate install instructions as a formatted string.
|
|
153
|
+
|
|
154
|
+
### `get_wsl2_install_guide() → str`
|
|
155
|
+
Returns the full WSL2 + ROCm setup walkthrough for AMD GPUs on Windows.
|
|
156
|
+
|
|
157
|
+
### `AMD_ROCM_ENV: dict`
|
|
158
|
+
The environment variable overrides applied for gfx1010 support. You can inspect or override these before calling `get_best_device()`.
|
|
159
|
+
|
|
160
|
+
---
|
|
161
|
+
|
|
162
|
+
## AMD GPU compatibility
|
|
163
|
+
|
|
164
|
+
| GPU | Architecture | HSA Override | Tested |
|
|
165
|
+
|-------------------------|-------------|----------------|--------|
|
|
166
|
+
| RX 5700 XT | gfx1010 | `10.3.0` | ✅ |
|
|
167
|
+
| RX 5700 | gfx1010 | `10.3.0` | ✅ |
|
|
168
|
+
| RX 5600 XT | gfx1010 | `10.3.0` | ✅ |
|
|
169
|
+
| RX 5500 XT | gfx1011 | `10.3.0` | ⚠️ reported |
|
|
170
|
+
| RX 6000 series (gfx1030+) | RDNA2 | Not needed | ✅ native ROCm |
|
|
171
|
+
| RX 7000 series (gfx1100+) | RDNA3 | Not needed | ✅ native ROCm |
|
|
172
|
+
|
|
173
|
+
If your card isn't listed, check `GFX_OVERRIDE_MAP` in `detect.py` and open a PR.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
## Windows users: DirectML vs WSL2
|
|
178
|
+
|
|
179
|
+
| Feature | DirectML | WSL2 + ROCm |
|
|
180
|
+
|----------------------|--------------------|--------------------|
|
|
181
|
+
| Setup difficulty | Easy | Medium |
|
|
182
|
+
| float16 support | ❌ (float32 only) | ✅ |
|
|
183
|
+
| Python version limit | 3.11 max | Any |
|
|
184
|
+
| GPU memory usage | ~1.5× higher | Native |
|
|
185
|
+
| Best for | Quick experiments | Production workloads |
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
## Contributing
|
|
190
|
+
|
|
191
|
+
PRs welcome. Especially interested in:
|
|
192
|
+
- Verified gfx override values for additional GPU models
|
|
193
|
+
- ROCm 6.2+ compatibility reports
|
|
194
|
+
- Windows DirectML on NVIDIA/Intel test results
|
|
195
|
+
|
|
196
|
+
Please open an issue before large PRs.
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## License
|
|
201
|
+
|
|
202
|
+
MIT — see [LICENSE](LICENSE).
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Background
|
|
207
|
+
|
|
208
|
+
This package was extracted from a private AI music pipeline project. The gfx1010 ROCm workaround was discovered the hard way — through several hours of cascading PyTorch installs, ROCm SDK conflicts, and dependency hell. The goal is that nobody else has to spend that time.
|
|
209
|
+
|
|
210
|
+
See [docs/lessons-learned.md](docs/lessons-learned.md) for the full story.
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
# torch-amd-setup
|
|
2
|
+
|
|
3
|
+
**Auto-detects the best PyTorch compute device for AMD GPUs** — with first-class support for cards that are not in ROCm's default allow-list (RX 5700 XT, RX 5600 XT, RX 5500 XT / gfx1010–gfx1012).
|
|
4
|
+
|
|
5
|
+
One import. No manual env var hunting. Works on Windows, Linux, WSL2, and macOS.
|
|
6
|
+
|
|
7
|
+
```python
|
|
8
|
+
from torch_amd_setup import get_best_device, get_torch_device, get_dtype
|
|
9
|
+
|
|
10
|
+
device_type = get_best_device() # "rocm" | "dml" | "cuda" | "mps" | "cpu"
|
|
11
|
+
device = get_torch_device() # torch.device ready for model.to()
|
|
12
|
+
dtype = get_dtype() # torch.float16 or torch.float32
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## The problem this solves
|
|
18
|
+
|
|
19
|
+
AMD GPUs that use the **gfx1010 architecture** (Navi 10 — RX 5700 XT, RX 5700, RX 5600 XT) are not in ROCm's default supported GPU list. PyTorch on ROCm will silently fall back to CPU unless you set:
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
export HSA_OVERRIDE_GFX_VERSION=10.3.0
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
...but it has to be set *before* Python imports torch, which means you either:
|
|
26
|
+
- Remember to set it in every shell session, or
|
|
27
|
+
- Bake it into a shell script wrapper, or
|
|
28
|
+
- Set it in your Python script before the first `import torch`
|
|
29
|
+
|
|
30
|
+
`torch-amd-setup` handles all of that automatically. It also detects DirectML on Windows (no ROCm required), Apple MPS on macOS, NVIDIA CUDA, and falls back to CPU — so you can ship one codebase that works everywhere.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Detection priority
|
|
35
|
+
|
|
36
|
+
| Priority | Backend | Platform | Requirement |
|
|
37
|
+
|----------|---------------|---------------------|--------------------------------------|
|
|
38
|
+
| 1 | NVIDIA CUDA | Any | Standard `pip install torch` |
|
|
39
|
+
| 2 | AMD ROCm | Linux / WSL2 | ROCm PyTorch + AMD driver ≥22.20 |
|
|
40
|
+
| 3 | AMD DirectML | Windows | `pip install torch-directml`, Py≤3.11 |
|
|
41
|
+
| 4 | Apple MPS | macOS Apple Silicon | Standard `pip install torch` |
|
|
42
|
+
| 5 | CPU | Any | Always available, always slow |
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Install
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
pip install torch-amd-setup
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
> `torch` is not a hard dependency — install the appropriate torch variant for your hardware first (see [Tutorial](docs/tutorial.md)).
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Quick start
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
from torch_amd_setup import get_best_device, get_torch_device, get_dtype
|
|
60
|
+
import torch
|
|
61
|
+
|
|
62
|
+
device_type = get_best_device()
|
|
63
|
+
device = get_torch_device(device_type)
|
|
64
|
+
dtype = get_dtype(device_type)
|
|
65
|
+
|
|
66
|
+
print(f"Using: {device_type} → {device} @ {dtype}")
|
|
67
|
+
|
|
68
|
+
# Load your model
|
|
69
|
+
model = MyModel().to(device).to(dtype)
|
|
70
|
+
```
|
|
71
|
+
|
|
72
|
+
### Diagnostics CLI
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
python -m torch_amd_setup
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Output:
|
|
79
|
+
```
|
|
80
|
+
── torch-amd-setup diagnostics ──────────────────────────────
|
|
81
|
+
python_version 3.10.12
|
|
82
|
+
platform Linux-6.6.x-WSL2-x86_64
|
|
83
|
+
best_device rocm
|
|
84
|
+
cuda_available True
|
|
85
|
+
cuda_device_name AMD Radeon RX 5700 XT
|
|
86
|
+
cuda_vram_mb 8176
|
|
87
|
+
rocm_available True
|
|
88
|
+
torch_version 2.6.0+rocm6.1
|
|
89
|
+
...
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## API Reference
|
|
95
|
+
|
|
96
|
+
### `get_best_device() → str`
|
|
97
|
+
Returns the best available device type as a string: `"cuda"`, `"rocm"`, `"dml"`, `"mps"`, or `"cpu"`.
|
|
98
|
+
|
|
99
|
+
### `get_torch_device(device_type=None) → torch.device`
|
|
100
|
+
Returns a `torch.device` object (or a DirectML device object for `"dml"`) ready for `model.to()`. If `device_type` is `None`, calls `get_best_device()` automatically.
|
|
101
|
+
|
|
102
|
+
### `get_dtype(device_type=None) → torch.dtype`
|
|
103
|
+
Returns `torch.float16` for CUDA/ROCm/MPS, and `torch.float32` for DirectML/CPU. DirectML float16 support is unreliable; this keeps you safe.
|
|
104
|
+
|
|
105
|
+
### `device_info() → dict`
|
|
106
|
+
Returns a diagnostic dictionary with all detected hardware info. Useful for logging and bug reports.
|
|
107
|
+
|
|
108
|
+
### `get_install_guide() → str`
|
|
109
|
+
Returns platform-appropriate install instructions as a formatted string.
|
|
110
|
+
|
|
111
|
+
### `get_wsl2_install_guide() → str`
|
|
112
|
+
Returns the full WSL2 + ROCm setup walkthrough for AMD GPUs on Windows.
|
|
113
|
+
|
|
114
|
+
### `AMD_ROCM_ENV: dict`
|
|
115
|
+
The environment variable overrides applied for gfx1010 support. You can inspect or override these before calling `get_best_device()`.
|
|
116
|
+
|
|
117
|
+
---
|
|
118
|
+
|
|
119
|
+
## AMD GPU compatibility
|
|
120
|
+
|
|
121
|
+
| GPU | Architecture | HSA Override | Tested |
|
|
122
|
+
|-------------------------|-------------|----------------|--------|
|
|
123
|
+
| RX 5700 XT | gfx1010 | `10.3.0` | ✅ |
|
|
124
|
+
| RX 5700 | gfx1010 | `10.3.0` | ✅ |
|
|
125
|
+
| RX 5600 XT | gfx1010 | `10.3.0` | ✅ |
|
|
126
|
+
| RX 5500 XT | gfx1011 | `10.3.0` | ⚠️ reported |
|
|
127
|
+
| RX 6000 series (gfx1030+) | RDNA2 | Not needed | ✅ native ROCm |
|
|
128
|
+
| RX 7000 series (gfx1100+) | RDNA3 | Not needed | ✅ native ROCm |
|
|
129
|
+
|
|
130
|
+
If your card isn't listed, check `GFX_OVERRIDE_MAP` in `detect.py` and open a PR.
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Windows users: DirectML vs WSL2
|
|
135
|
+
|
|
136
|
+
| Feature | DirectML | WSL2 + ROCm |
|
|
137
|
+
|----------------------|--------------------|--------------------|
|
|
138
|
+
| Setup difficulty | Easy | Medium |
|
|
139
|
+
| float16 support | ❌ (float32 only) | ✅ |
|
|
140
|
+
| Python version limit | 3.11 max | Any |
|
|
141
|
+
| GPU memory usage | ~1.5× higher | Native |
|
|
142
|
+
| Best for | Quick experiments | Production workloads |
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
## Contributing
|
|
147
|
+
|
|
148
|
+
PRs welcome. Especially interested in:
|
|
149
|
+
- Verified gfx override values for additional GPU models
|
|
150
|
+
- ROCm 6.2+ compatibility reports
|
|
151
|
+
- Windows DirectML on NVIDIA/Intel test results
|
|
152
|
+
|
|
153
|
+
Please open an issue before large PRs.
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## License
|
|
158
|
+
|
|
159
|
+
MIT — see [LICENSE](LICENSE).
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## Background
|
|
164
|
+
|
|
165
|
+
This package was extracted from a private AI music pipeline project. The gfx1010 ROCm workaround was discovered the hard way — through several hours of cascading PyTorch installs, ROCm SDK conflicts, and dependency hell. The goal is that nobody else has to spend that time.
|
|
166
|
+
|
|
167
|
+
See [docs/lessons-learned.md](docs/lessons-learned.md) for the full story.
|
|
@@ -0,0 +1,155 @@
|
|
|
1
|
+
# Lessons Learned: Building AMD ROCm + PyTorch Support from Scratch
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-03-23
|
|
4
|
+
**Context:** Extracting `torch-amd-setup` from a private AI audio pipeline project.
|
|
5
|
+
**Hardware:** AMD Radeon RX 5700 XT (gfx1010 / Navi 10), Windows 11, WSL2 Ubuntu 22.04.
|
|
6
|
+
|
|
7
|
+
This document is a raw account of every mistake made, every dependency wall hit, and every workaround discovered while getting AMD GPU acceleration working with PyTorch and Seamless M4T. Written so you don't have to spend the same time.
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## 1. The gfx1010 problem — your GPU exists but ROCm ignores it
|
|
12
|
+
|
|
13
|
+
The single biggest source of confusion: the AMD RX 5700 XT is a capable GPU, it's supported by the AMD Adrenalin driver, and it works fine for gaming. But ROCm (AMD's GPU compute stack) has an explicit list of officially supported GPU architectures, and gfx1010 is not on it.
|
|
14
|
+
|
|
15
|
+
When you install the ROCm version of PyTorch and run `torch.cuda.is_available()`, it returns `False`. No error, no explanation — just `False`. This led to hours of assuming the ROCm install was broken, when the actual issue was a single missing environment variable:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
export HSA_OVERRIDE_GFX_VERSION=10.3.0
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
This has to be set **before Python imports torch**. Setting it after `import torch` does nothing. The reason: ROCm checks the GPU architecture at init time and caches the result. If the env var isn't present at that moment, the GPU is invisible for the rest of the process.
|
|
22
|
+
|
|
23
|
+
**Lesson:** If `torch.cuda.is_available()` returns False on ROCm, check the env var before anything else. Don't re-install ROCm.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## 2. Ubuntu 22.04 ships its own broken rocminfo
|
|
28
|
+
|
|
29
|
+
Ubuntu 22.04's default `apt` repos include `rocminfo 5.0.0-1`. This package exists to provide stub implementations of ROCm tools. When you add AMD's official ROCm 6.1 repository and try to install `rocm-hip-sdk`, apt sees the conflict and fails:
|
|
30
|
+
|
|
31
|
+
```
|
|
32
|
+
rocm-hip-runtime: Depends: rocminfo (= 1.0.0.60100-82~22.04)
|
|
33
|
+
but 5.0.0-1 is to be installed
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
The version numbers look backwards (5.0.0 > 1.0.0) but they're not comparable — AMD's rocminfo uses a different versioning scheme entirely. `5.0.0-1` is Ubuntu's stub; `1.0.0.60100` is AMD's real package at ROCm 6.1.
|
|
37
|
+
|
|
38
|
+
**Fix:** Remove Ubuntu's ROCm stubs before installing from AMD's repo, then pin the AMD repo to priority 1001 so it always wins in future apt operations. See [Troubleshooting](troubleshooting.md#rocm-61-install-blocked-by-ubuntus-rocminfo-500).
|
|
39
|
+
|
|
40
|
+
**Lesson:** Always purge Ubuntu's ROCm stubs before adding AMD's ROCm repo. Add the apt pin immediately.
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## 3. set -e + grep = silent script death
|
|
45
|
+
|
|
46
|
+
When writing the automated setup script (`wsl2_rocm_setup.sh`), the script was configured with `set -euo pipefail` for safety. However, certain commands that pipe through `grep -v` would cause the entire script to silently exit with no error message.
|
|
47
|
+
|
|
48
|
+
The cause: when `apt-get -qq` runs with nothing to output (the package is already installed, or there are no packages matching), the `grep -v` that follows gets empty input and returns exit code 1 — "no lines matched the invert pattern." With `set -e`, exit code 1 from any command is fatal. The script dies silently at the first line that runs `| grep -v anything` on empty input.
|
|
49
|
+
|
|
50
|
+
The debug session was confusing because there was no error — just a prompt returning after printing one progress message.
|
|
51
|
+
|
|
52
|
+
**Fix:** `|| true` after any `grep` in a pipeline where empty output is possible. Also drop `-u` from `set -euo pipefail` if you have variables that might be unset legitimately.
|
|
53
|
+
|
|
54
|
+
**Lesson:** When a `set -e` script exits silently, check every pipe for commands that could return non-zero on "no results" — grep, awk, wc -l comparisons, etc.
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## 4. Dependency packages silently replace your ROCm torch
|
|
59
|
+
|
|
60
|
+
Installing packages with PyPI dependencies that pin specific PyTorch versions will overwrite your ROCm build. This happened twice:
|
|
61
|
+
|
|
62
|
+
- `fairseq2==0.3.0` pins `torch==2.5.1`. pip fetched that version from PyPI, which is the standard CUDA build. ROCm build gone.
|
|
63
|
+
- After reinstalling ROCm torch 2.6.0, torchaudio 2.2.2 was installed separately, causing a version mismatch (`libcudart.so.13` error from the torchaudio build expecting torch 2.2.x).
|
|
64
|
+
|
|
65
|
+
Each iteration added 10–20 minutes of reinstall time and debugging.
|
|
66
|
+
|
|
67
|
+
**Lesson:** Install torch last, always. Use `--no-deps` for packages that try to pull their own torch. After any package install, verify `torch.version.hip` is still set. Consider using pip constraints or a lock file.
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## 5. fairseq2n CPU binary doesn't exist for 0.2.1
|
|
72
|
+
|
|
73
|
+
`seamless_communication 1.0.0` requires `fairseq2==0.2.*`. `fairseq2 0.2.1` requires `fairseq2n==0.2.1` (a C extension binary). The `fairseq2n` package on PyPI ships a CUDA-linked binary — it needs `libcudart.so.12` to import.
|
|
74
|
+
|
|
75
|
+
Meta provides a CPU build server: `https://fair-src-fairseq2-build-publish.s3.amazonaws.com/whl/cpu/index.html` — but it only has builds for fairseq2n 0.3.x, not 0.2.1. So the official CPU binary for the version required by seamless_communication simply does not exist.
|
|
76
|
+
|
|
77
|
+
The solution that worked: install `nvidia-cuda-runtime-cu12` via pip, which provides `libcudart.so.12` inside the venv's site-packages, then set `LD_LIBRARY_PATH` to point at it. This lets the CUDA-linked `fairseq2n.so` load correctly even on a machine with no NVIDIA GPU.
|
|
78
|
+
|
|
79
|
+
**Lesson:** When a package claims to need CUDA but you don't have CUDA, try installing the CUDA runtime stub wheel first before assuming you need to rebuild from source.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## 6. torch-directml requires Python ≤3.11
|
|
84
|
+
|
|
85
|
+
`torch-directml` is Microsoft's DirectML backend for PyTorch. It provides AMD (and any DirectX 12) GPU acceleration on Windows without needing ROCm. It's genuinely useful and easy to install — but it has a hard Python version ceiling of 3.11.
|
|
86
|
+
|
|
87
|
+
This is a significant limitation because many projects now target Python 3.12+. The workaround is to maintain a separate `venv311` environment specifically for DirectML workloads. This is awkward but workable.
|
|
88
|
+
|
|
89
|
+
The underlying reason is that `torch-directml` contains compiled C extensions that were built against Python 3.11's ABI. Microsoft hasn't released 3.12 wheels as of the time of writing.
|
|
90
|
+
|
|
91
|
+
**Lesson:** Plan for a separate Python 3.11 venv on Windows if DirectML is on your path. Build your code to be venv-agnostic so switching is easy.
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
## 7. numpy 2.x breaks fairseq2 0.2.1
|
|
96
|
+
|
|
97
|
+
`fairseq2 0.2.1` was compiled against NumPy 1.x. NumPy 2.0 introduced breaking C extension ABI changes. If pip installs NumPy 2.x (which it does by default now), importing `fairseq2` crashes:
|
|
98
|
+
|
|
99
|
+
```
|
|
100
|
+
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.6
|
|
101
|
+
_ARRAY_API not found
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
Fix: `pip install "numpy~=1.23" --force-reinstall`.
|
|
105
|
+
|
|
106
|
+
**Lesson:** Any package with compiled C extensions and a `numpy~=1.x` pin is going to break if pip installs numpy 2.x before it. Add an explicit numpy pin to your requirements file before installing such packages.
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
## 8. WSL2 GPU passthrough needs /dev/kfd
|
|
111
|
+
|
|
112
|
+
Even with a recent AMD driver, `/dev/kfd` (the AMD GPU compute device node) may not appear in WSL2 if:
|
|
113
|
+
- The AMD Adrenalin driver version is below 22.20
|
|
114
|
+
- The Windows version is below 10 21H2
|
|
115
|
+
|
|
116
|
+
In our case, `/dev/kfd` was missing because the driver hadn't been verified yet. This caused `rocminfo` inside WSL2 to report no agents, even though ROCm was installed correctly.
|
|
117
|
+
|
|
118
|
+
**Lesson:** Verify `/dev/kfd` exists before troubleshooting anything else. If it doesn't exist, the fix is a driver update in Windows — nothing inside WSL2 will help.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## 9. ROCm uses the CUDA compatibility layer — model.to("cuda") works
|
|
123
|
+
|
|
124
|
+
ROCm PyTorch exposes AMD GPUs through a CUDA compatibility layer. From the Python API perspective, `torch.cuda.is_available()` returns `True`, `torch.cuda.get_device_name(0)` returns the AMD card name, and `model.to("cuda:0")` puts the model on the AMD GPU.
|
|
125
|
+
|
|
126
|
+
This is intentional and by design. The practical consequence: code written for NVIDIA CUDA often works on AMD ROCm with zero changes. The catch is that some CUDA-specific operations (`torch.cuda.amp`, certain custom CUDA kernels) may not be supported.
|
|
127
|
+
|
|
128
|
+
**Lesson:** Don't create separate "CUDA" and "ROCm" code paths. Use `get_torch_device()` which returns `torch.device("cuda:0")` for both — the ROCm PyTorch build handles the rest.
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## 10. The model download hits the disk hard
|
|
133
|
+
|
|
134
|
+
The SeamlessM4T v2 large model is ~8.5GB for the main checkpoint plus ~160MB for the vocoder. On first run, it downloads to `~/.cache/huggingface/hub/`. This is inside the WSL2 virtual disk, which lives on the Windows C: drive.
|
|
135
|
+
|
|
136
|
+
On a machine with limited C: drive space, this is immediately a problem. The WSL2 virtual disk is also not easily inspectable from Windows Explorer, so users may not realize a 9GB file just appeared.
|
|
137
|
+
|
|
138
|
+
**Lesson:** Warn users about the model download size before first run. Consider setting `HF_HOME` to redirect the cache to a larger drive. On a machine with an external drive, this is essential.
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## Summary: What the setup actually requires
|
|
143
|
+
|
|
144
|
+
Getting AMD ROCm + fairseq2 + seamless_communication working requires touching at least 8 separate failure points that are not documented together anywhere:
|
|
145
|
+
|
|
146
|
+
1. Remove Ubuntu's conflicting ROCm stubs
|
|
147
|
+
2. Pin the ROCm apt repo to priority 1001
|
|
148
|
+
3. Set `HSA_OVERRIDE_GFX_VERSION=10.3.0` before importing torch
|
|
149
|
+
4. Add user to `render` and `video` groups
|
|
150
|
+
5. Install PyTorch from the ROCm-specific index URL
|
|
151
|
+
6. Install `nvidia-cuda-runtime-cu12` for the CUDA stub
|
|
152
|
+
7. Set `LD_LIBRARY_PATH` to the stub's lib directory
|
|
153
|
+
8. Pin numpy to `~=1.23` before installing fairseq2
|
|
154
|
+
|
|
155
|
+
None of these steps are individually complex. But they're scattered across AMD documentation, Meta's fairseq2 GitHub issues, Ubuntu Launchpad bug reports, and Stack Overflow threads. The goal of `torch-amd-setup` is to encode as much of this as possible so future projects don't start from scratch.
|