torch-sla 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. torch_sla-0.1.0/LICENSE +9 -0
  2. torch_sla-0.1.0/MANIFEST.in +25 -0
  3. torch_sla-0.1.0/PKG-INFO +520 -0
  4. torch_sla-0.1.0/README.md +457 -0
  5. torch_sla-0.1.0/TODO.md +32 -0
  6. torch_sla-0.1.0/csrc/cudss/cudss_spsolve.cu +309 -0
  7. torch_sla-0.1.0/csrc/cusolver/cusolver_spsolve.cu +347 -0
  8. torch_sla-0.1.0/csrc/spsolve/spsolve.cpp +199 -0
  9. torch_sla-0.1.0/pyproject.toml +117 -0
  10. torch_sla-0.1.0/setup.cfg +4 -0
  11. torch_sla-0.1.0/setup.py +132 -0
  12. torch_sla-0.1.0/tests/test_batch_solve.py +295 -0
  13. torch_sla-0.1.0/tests/test_distributed.py +426 -0
  14. torch_sla-0.1.0/tests/test_distributed_matvec.py +142 -0
  15. torch_sla-0.1.0/tests/test_distributed_multiprocess.py +323 -0
  16. torch_sla-0.1.0/tests/test_distributed_solve.py +241 -0
  17. torch_sla-0.1.0/tests/test_io.py +362 -0
  18. torch_sla-0.1.0/tests/test_io_distributed.py +192 -0
  19. torch_sla-0.1.0/tests/test_matvec_multiprocess.py +146 -0
  20. torch_sla-0.1.0/tests/test_real_distributed.py +212 -0
  21. torch_sla-0.1.0/tests/test_sparse_tensor.py +833 -0
  22. torch_sla-0.1.0/tests/test_spsolve.py +321 -0
  23. torch_sla-0.1.0/torch_sla/__init__.py +184 -0
  24. torch_sla-0.1.0/torch_sla/backends/__init__.py +442 -0
  25. torch_sla-0.1.0/torch_sla/backends/pytorch_backend.py +1639 -0
  26. torch_sla-0.1.0/torch_sla/backends/scipy_backend.py +354 -0
  27. torch_sla-0.1.0/torch_sla/batch_solve.py +468 -0
  28. torch_sla-0.1.0/torch_sla/distributed.py +2873 -0
  29. torch_sla-0.1.0/torch_sla/io.py +1070 -0
  30. torch_sla-0.1.0/torch_sla/linear_solve.py +627 -0
  31. torch_sla-0.1.0/torch_sla/nonlinear_solve.py +647 -0
  32. torch_sla-0.1.0/torch_sla/sparse_tensor.py +3928 -0
  33. torch_sla-0.1.0/torch_sla.egg-info/SOURCES.txt +30 -0
@@ -0,0 +1,9 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright © 2024 walker chi
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6
+
7
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8
+
9
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,25 @@
1
+ # Include C++ source files
2
+ recursive-include csrc *.cpp *.cu *.h *.cuh
3
+
4
+ # Include documentation
5
+ include README.md
6
+ include LICENSE
7
+ include TODO.md
8
+
9
+ # Include configuration files
10
+ include pyproject.toml
11
+ include setup.py
12
+
13
+ # Exclude compiled files
14
+ global-exclude *.pyc
15
+ global-exclude *.pyo
16
+ global-exclude __pycache__
17
+ global-exclude *.so
18
+ global-exclude *.o
19
+ global-exclude *.a
20
+
21
+ # Exclude build directories
22
+ prune build
23
+ prune dist
24
+ prune *.egg-info
25
+
@@ -0,0 +1,520 @@
1
+ Metadata-Version: 2.4
2
+ Name: torch-sla
3
+ Version: 0.1.0
4
+ Summary: PyTorch Sparse Linear Algebra - Differentiable sparse solvers with CUDA support
5
+ Home-page: https://github.com/walkerchi/torch-sla
6
+ Author: walkerchi
7
+ Author-email: walkerchi <walkerchi@example.com>
8
+ Maintainer-email: walkerchi <walkerchi@example.com>
9
+ License: MIT
10
+ Project-URL: Homepage, https://pypi.org/project/torch-sla/
11
+ Project-URL: Documentation, https://github.com/walkerchi/torch-sla#readme
12
+ Project-URL: Repository, https://github.com/walkerchi/torch-sla
13
+ Project-URL: PyPI, https://pypi.org/project/torch-sla/
14
+ Project-URL: Issues, https://github.com/walkerchi/torch-sla/issues
15
+ Keywords: pytorch,sparse,linear-algebra,cuda,cusolver,cudss,sparse-matrix,linear-solver,differentiable,autograd
16
+ Classifier: Development Status :: 4 - Beta
17
+ Classifier: Intended Audience :: Developers
18
+ Classifier: Intended Audience :: Science/Research
19
+ Classifier: License :: OSI Approved :: MIT License
20
+ Classifier: Operating System :: OS Independent
21
+ Classifier: Programming Language :: Python :: 3
22
+ Classifier: Programming Language :: Python :: 3.8
23
+ Classifier: Programming Language :: Python :: 3.9
24
+ Classifier: Programming Language :: Python :: 3.10
25
+ Classifier: Programming Language :: Python :: 3.11
26
+ Classifier: Programming Language :: Python :: 3.12
27
+ Classifier: Programming Language :: C++
28
+ Classifier: Topic :: Scientific/Engineering
29
+ Classifier: Topic :: Scientific/Engineering :: Mathematics
30
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
31
+ Requires-Python: >=3.8
32
+ Description-Content-Type: text/markdown
33
+ License-File: LICENSE
34
+ Requires-Dist: torch>=1.10.0
35
+ Requires-Dist: ninja
36
+ Provides-Extra: test
37
+ Requires-Dist: pytest>=6.0; extra == "test"
38
+ Requires-Dist: numpy>=1.19.0; extra == "test"
39
+ Requires-Dist: scipy>=1.5.0; extra == "test"
40
+ Provides-Extra: dev
41
+ Requires-Dist: pytest>=6.0; extra == "dev"
42
+ Requires-Dist: numpy>=1.19.0; extra == "dev"
43
+ Requires-Dist: scipy>=1.5.0; extra == "dev"
44
+ Requires-Dist: black; extra == "dev"
45
+ Requires-Dist: isort; extra == "dev"
46
+ Requires-Dist: mypy; extra == "dev"
47
+ Requires-Dist: pre-commit; extra == "dev"
48
+ Provides-Extra: docs
49
+ Requires-Dist: sphinx>=4.0; extra == "docs"
50
+ Requires-Dist: furo; extra == "docs"
51
+ Requires-Dist: sphinx-autodoc-typehints; extra == "docs"
52
+ Provides-Extra: cuda
53
+ Requires-Dist: nvidia-cudss-cu12>=0.7.0; extra == "cuda"
54
+ Provides-Extra: all
55
+ Requires-Dist: pytest>=6.0; extra == "all"
56
+ Requires-Dist: numpy>=1.19.0; extra == "all"
57
+ Requires-Dist: scipy>=1.5.0; extra == "all"
58
+ Requires-Dist: nvidia-cudss-cu12>=0.7.0; extra == "all"
59
+ Dynamic: author
60
+ Dynamic: home-page
61
+ Dynamic: license-file
62
+ Dynamic: requires-python
63
+
64
+ <p align="center">
65
+ <img src="assets/logo.jpg" alt="torch-sla logo" width="200">
66
+ </p>
67
+
68
+ <h1 align="center">torch-sla</h1>
69
+
70
+ <p align="center">
71
+ <b>PyTorch Sparse Linear Algebra</b> - A differentiable sparse linear equation solver library with multiple backends.
72
+ </p>
73
+
74
+ <p align="center">
75
+ <a href="https://badge.fury.io/py/torch-sla"><img src="https://badge.fury.io/py/torch-sla.svg" alt="PyPI version"></a>
76
+ <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
77
+ <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.8+-blue.svg" alt="Python 3.8+"></a>
78
+ </p>
79
+
80
+ ## Features
81
+
82
+ - 🔥 **Differentiable**: Full gradient support through `torch.autograd`
83
+ - 🚀 **Multiple Backends**: SciPy, Eigen (CPU), cuSOLVER, cuDSS, PyTorch-native (CUDA)
84
+ - 📦 **Batched Operations**: Support for batched sparse tensors `[..., M, N, ...]`
85
+ - 🎯 **Property Detection**: Auto-detect symmetry and positive definiteness
86
+ - ⚡ **High Performance**: Auto-selects best solver based on device, dtype, and problem size
87
+ - 🌐 **Distributed**: Domain decomposition with halo exchange (CFD/FEM style)
88
+ - 🔧 **Easy to Use**: `SparseTensor` class with solve, norm, eigs methods
89
+ - 🧮 **Nonlinear Solve**: Adjoint-based Newton/Anderson solvers with implicit differentiation
90
+
91
+ ## Installation
92
+
93
+ ```bash
94
+ # Basic installation
95
+ pip install torch-sla
96
+
97
+ # With cuDSS support (CUDA 12+, recommended for GPU)
98
+ pip install torch-sla[cuda]
99
+
100
+ # Full installation with all dependencies
101
+ pip install torch-sla[all]
102
+
103
+ # From source (for development)
104
+ git clone https://github.com/walkerchi/torch-sla.git
105
+ cd torch-sla
106
+ pip install -e ".[dev]"
107
+ ```
108
+
109
+ > **Note**: cuDSS (`nvidia-cudss-cu12`) is now available on PyPI! Installing `torch-sla[cuda]` will automatically include it.
110
+
111
+ ## Quick Start
112
+
113
+ ### Basic Solve
114
+
115
+ ```python
116
+ import torch
117
+ from torch_sla import SparseTensor
118
+
119
+ # Create sparse matrix in COO format
120
+ val = torch.tensor([4.0, -1.0, -1.0, 4.0, -1.0, -1.0, 4.0], dtype=torch.float64)
121
+ row = torch.tensor([0, 0, 1, 1, 1, 2, 2])
122
+ col = torch.tensor([0, 1, 0, 1, 2, 1, 2])
123
+
124
+ # Create SparseTensor
125
+ A = SparseTensor(val, row, col, (3, 3))
126
+
127
+ # Solve Ax = b
128
+ b = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float64)
129
+ x = A.solve(b)
130
+
131
+ # Specify backend and method
132
+ x = A.solve(b, backend='scipy', method='superlu')
133
+ ```
134
+
135
+ ### CUDA Solve
136
+
137
+ ```python
138
+ # Move to CUDA
139
+ A_cuda = A.cuda()
140
+ b_cuda = b.cuda()
141
+
142
+ # Auto-selects cudss+cholesky (best for CUDA)
143
+ x = A_cuda.solve(b_cuda)
144
+
145
+ # Or explicitly specify
146
+ x = A_cuda.solve(b_cuda, backend='cudss', method='cholesky')
147
+
148
+ # For very large problems (DOF > 2M), use iterative
149
+ x = A_cuda.solve(b_cuda, backend='pytorch', method='cg')
150
+ ```
151
+
152
+ ## Recommended Backends
153
+
154
+ Based on benchmarks on 2D Poisson equations (tested up to **169M DOF**):
155
+
156
+ | Problem Size | CPU | CUDA | Notes |
157
+ |-------------|-----|------|-------|
158
+ | **Small (< 100K DOF)** | `scipy+superlu` | `cudss+cholesky` | Direct solvers, machine precision |
159
+ | **Medium (100K - 2M DOF)** | `scipy+superlu` | `cudss+cholesky` | cuDSS is fastest on GPU |
160
+ | **Large (2M - 169M DOF)** | N/A | `pytorch+cg` | **Iterative only**, ~1e-6 precision |
161
+
162
+ ### Key Insights
163
+
164
+ 1. **PyTorch CG+Jacobi scales to 169M+ DOF** with near-linear O(n^1.1) complexity
165
+ 2. **Direct solvers limited to ~2M DOF** due to memory (O(n^1.5) fill-in)
166
+ 3. **Use float64** for best convergence with iterative solvers
167
+ 4. **Trade-off**: Direct = machine precision, Iterative = ~1e-6 but 100x faster
168
+
169
+ ## Backends and Methods
170
+
171
+ ### Available Backends
172
+
173
+ | Backend | Device | Description | Recommended For |
174
+ |---------|--------|-------------|-----------------|
175
+ | `scipy` | CPU | SciPy (SuperLU/UMFPACK) | **CPU default** - fast + machine precision |
176
+ | `eigen` | CPU | Eigen C++ (CG, BiCGStab) | Alternative CPU iterative |
177
+ | `cudss` | CUDA | NVIDIA cuDSS (LU, Cholesky, LDLT) | **CUDA default** - fastest direct |
178
+ | `cusolver` | CUDA | NVIDIA cuSOLVER | Not recommended (slower, no float32) |
179
+ | `pytorch` | CUDA | PyTorch-native (CG, BiCGStab) | Very large problems (> 2M DOF) |
180
+
181
+ ### Solver Methods
182
+
183
+ | Method | Backends | Best For | Precision |
184
+ |--------|----------|----------|-----------|
185
+ | `superlu` | scipy | General matrices | Machine precision |
186
+ | `cholesky` | cudss, cusolver | **SPD matrices (fastest)** | Machine precision |
187
+ | `ldlt` | cudss | Symmetric matrices | Machine precision |
188
+ | `lu` | cudss, cusolver | General matrices | Machine precision |
189
+ | `cg` | scipy, eigen, pytorch | SPD matrices (iterative) | ~1e-6 to 1e-7 |
190
+ | `bicgstab` | scipy, eigen, pytorch | General (iterative) | ~1e-6 to 1e-7 |
191
+
192
+ ## Batched Solve
193
+
194
+ ```python
195
+ # Batched matrices: same structure, different values
196
+ batch_size = 4
197
+ val_batch = val.unsqueeze(0).expand(batch_size, -1).clone()
198
+
199
+ # Create batched SparseTensor [B, M, N]
200
+ A = SparseTensor(val_batch, row, col, (batch_size, 3, 3))
201
+
202
+ # Batched solve
203
+ b = torch.randn(batch_size, 3, dtype=torch.float64)
204
+ x = A.solve(b) # Shape: [batch_size, 3]
205
+ ```
206
+
207
+ ## Distributed Computing (DSparseMatrix)
208
+
209
+ For large-scale problems across multiple GPUs, use domain decomposition:
210
+
211
+ ```python
212
+ import torch.distributed as dist
213
+ from torch_sla.distributed import DSparseMatrix, partition_simple
214
+
215
+ # Initialize distributed (each process runs this)
216
+ dist.init_process_group(backend='nccl') # or 'gloo' for CPU
217
+ rank = dist.get_rank()
218
+ world_size = dist.get_world_size()
219
+
220
+ # Each rank creates its local partition
221
+ A = DSparseMatrix.from_global(
222
+ val, row, col, shape,
223
+ num_partitions=world_size,
224
+ my_partition=rank,
225
+ partition_ids=partition_simple(n, world_size),
226
+ device=f'cuda:{rank}'
227
+ )
228
+
229
+ # Distributed CG solve (default: distributed=True)
230
+ x_owned = A.solve(b_owned, atol=1e-10)
231
+
232
+ # Distributed LOBPCG eigenvalues
233
+ eigenvalues, eigenvectors_owned = A.eigsh(k=5)
234
+
235
+ # Local subdomain solve (no global communication)
236
+ x_local = A.solve(b_owned, distributed=False)
237
+ ```
238
+
239
+ ```bash
240
+ # Run with 4 GPUs
241
+ torchrun --standalone --nproc_per_node=4 your_script.py
242
+ ```
243
+
244
+ ## Gradient Support
245
+
246
+ All operations support automatic differentiation:
247
+
248
+ ```python
249
+ val = val.requires_grad_(True)
250
+ b = b.requires_grad_(True)
251
+
252
+ x = A.solve(b)
253
+ loss = x.sum()
254
+ loss.backward()
255
+
256
+ print(val.grad) # Gradient w.r.t. matrix values
257
+ print(b.grad) # Gradient w.r.t. RHS
258
+ ```
259
+
260
+ ### Gradient Support Summary
261
+
262
+ #### SparseTensor
263
+
264
+ | Operation | CPU | CUDA | Notes |
265
+ |-----------|-----|------|-------|
266
+ | `solve()` | ✓ | ✓ | Adjoint method, O(1) graph nodes |
267
+ | `eigsh()` / `eigs()` | ✓ | ✓ | Adjoint method, O(1) graph nodes |
268
+ | `svd()` | ✓ | ✓ | Power iteration, differentiable |
269
+ | `nonlinear_solve()` | ✓ | ✓ | Adjoint, params only |
270
+ | `@` (A @ x, SpMV) | ✓ | ✓ | Standard autograd |
271
+ | `@` (A @ B, SpSpM) | ✓ | ✓ | Sparse gradients |
272
+ | `+`, `-`, `*` | ✓ | ✓ | Element-wise ops |
273
+ | `T()` (transpose) | ✓ | ✓ | View-like, gradients flow through |
274
+ | `norm()`, `sum()`, `mean()` | ✓ | ✓ | Standard autograd |
275
+ | `to_dense()` | ✓ | ✓ | Standard autograd |
276
+
277
+ #### DSparseMatrix (Multi-GPU)
278
+
279
+ | Operation | CPU (Gloo) | CUDA (NCCL) | Notes |
280
+ |-----------|------------|-------------|-------|
281
+ | `matvec()` | ✓ | ✓ | Halo exchange + local SpMV |
282
+ | `solve()` | ✓ | ✓ | Distributed CG (default `distributed=True`) |
283
+ | `eigsh()` | ✓ | ✓ | Distributed LOBPCG |
284
+ | `halo_exchange()` | ✓ | ✓ | P2P communication with neighbors |
285
+
286
+ **Communication per iteration**:
287
+ - `solve()`: Halo exchange + 2 all_reduce
288
+ - `eigsh()`: Halo exchange + O(k²) all_reduce
289
+
290
+ > **Note**: DSparseMatrix uses true distributed algorithms that only require distributed matvec + global reductions. No data gather is needed for core operations.
291
+
292
+ ## Persistence (I/O)
293
+
294
+ Save and load sparse tensors using `safetensors` format:
295
+
296
+ ```python
297
+ from torch_sla import SparseTensor, DSparseTensor, DSparseMatrix
298
+ from torch_sla import load_sparse_as_partition, load_distributed_as_sparse
299
+
300
+ # Save SparseTensor
301
+ A = SparseTensor(val, row, col, shape)
302
+ A.save("matrix.safetensors")
303
+
304
+ # Load SparseTensor
305
+ A = SparseTensor.load("matrix.safetensors", device="cuda")
306
+
307
+ # Save as partitioned (for distributed loading)
308
+ A.save_distributed("matrix_dist", num_partitions=4)
309
+
310
+ # Each rank loads only its partition
311
+ rank = dist.get_rank()
312
+ partition = DSparseMatrix.load("matrix_dist", rank, world_size)
313
+
314
+ # Load partitioned data as single SparseTensor
315
+ A = load_distributed_as_sparse("matrix_dist")
316
+
317
+ # Load single file as partition (each rank reads full file, keeps its part)
318
+ partition = load_sparse_as_partition("matrix.safetensors", rank, world_size)
319
+ ```
320
+
321
+ ### Cross-Format Conversion
322
+
323
+ | Save Format | Load as SparseTensor | Load as DSparseMatrix |
324
+ |------------|---------------------|----------------------|
325
+ | `A.save("file.safetensors")` | `SparseTensor.load("file")` | `load_sparse_as_partition("file", rank, world_size)` |
326
+ | `A.save_distributed("dir", n)` | `load_distributed_as_sparse("dir")` | `DSparseMatrix.load("dir", rank, world_size)` |
327
+ | `D.save("dir")` | `load_distributed_as_sparse("dir")` | `DSparseTensor.load("dir")` |
328
+
329
+ ## Nonlinear Solve (Adjoint Method)
330
+
331
+ Solve nonlinear equations `F(u, A, θ) = 0` with automatic differentiation using the adjoint method:
332
+
333
+ ```python
334
+ from torch_sla import SparseTensor
335
+
336
+ # Create sparse matrix (e.g., FEM stiffness matrix)
337
+ A = SparseTensor(val, row, col, (n, n))
338
+
339
+ # Define nonlinear residual: A @ u + u² = f
340
+ def residual(u, A, f):
341
+ return A @ u + u**2 - f
342
+
343
+ # Parameters with gradients
344
+ f = torch.randn(n, requires_grad=True)
345
+ u0 = torch.zeros(n)
346
+
347
+ # Solve with Newton-Raphson
348
+ u = A.nonlinear_solve(residual, u0, f, method='newton')
349
+
350
+ # Gradients flow via adjoint method
351
+ loss = u.sum()
352
+ loss.backward()
353
+ print(f.grad) # ∂L/∂f via implicit differentiation
354
+ ```
355
+
356
+ **Methods:**
357
+ - `newton`: Newton-Raphson with line search (default, fast convergence)
358
+ - `picard`: Fixed-point iteration (simple, slow)
359
+ - `anderson`: Anderson acceleration (memory efficient)
360
+
361
+ **Key Features:**
362
+ - Memory-efficient adjoint method (no Jacobian storage)
363
+ - Jacobian-free Newton-Krylov via autograd
364
+ - Multiple parameters with mixed requires_grad
365
+ - Seamless integration with `SparseTensor` class
366
+
367
+ ## Matrix Operations
368
+
369
+ ```python
370
+ A = SparseTensor(val, row, col, shape)
371
+
372
+ # Norms
373
+ norm = A.norm('fro') # Frobenius norm
374
+
375
+ # Eigenvalues
376
+ eigenvalues, eigenvectors = A.eigsh(k=6)
377
+
378
+ # SVD
379
+ U, S, Vt = A.svd(k=10)
380
+
381
+ # Matrix-vector product
382
+ y = A @ x
383
+
384
+ # LU factorization for repeated solves
385
+ lu = A.lu()
386
+ x = lu.solve(b)
387
+ ```
388
+
389
+ ## Benchmark Results
390
+
391
+ 2D Poisson equation (5-point stencil), NVIDIA H200 (140GB), float64:
392
+
393
+ ### Performance Comparison
394
+
395
+ ![Solver Performance](assets/benchmarks/performance.png)
396
+
397
+ | DOF | SciPy SuperLU | cuDSS Cholesky | PyTorch CG+Jacobi |
398
+ |----:|-------------:|---------------:|------------------:|
399
+ | 10K | 24ms | 128ms | 20ms |
400
+ | 100K | 29ms | 630ms | 43ms |
401
+ | 1M | 19.4s | 7.3s | 190ms |
402
+ | 2M | 52.9s | 15.6s | 418ms |
403
+ | 16M | - | - | 7.3s |
404
+ | 81M | - | - | 75.9s |
405
+ | **169M** | - | - | **224s** |
406
+
407
+ ### Memory Usage
408
+
409
+ ![Memory Usage](assets/benchmarks/memory.png)
410
+
411
+ | Method | Memory Scaling | Notes |
412
+ |--------|---------------|-------|
413
+ | **SciPy SuperLU** | O(n^1.5) fill-in | CPU only, limited to ~2M DOF |
414
+ | **cuDSS Cholesky** | O(n^1.5) fill-in | GPU, limited to ~2M DOF |
415
+ | **PyTorch CG+Jacobi** | **O(n) ~443 bytes/DOF** | Scales to 169M+ DOF |
416
+
417
+ ### Accuracy
418
+
419
+ ![Accuracy](assets/benchmarks/accuracy.png)
420
+
421
+ | Method | Precision | Notes |
422
+ |--------|-----------|-------|
423
+ | **Direct solvers** | ~1e-14 | Machine precision |
424
+ | **Iterative (tol=1e-6)** | ~1e-6 | User-configurable tolerance |
425
+
426
+ ### Key Findings
427
+
428
+ 1. **Iterative solver scales to 169M DOF** with O(n^1.1) time complexity
429
+ 2. **Direct solvers limited to ~2M DOF** due to O(n^1.5~2) memory fill-in
430
+ 3. **PyTorch CG+Jacobi is 100x faster** than direct solvers at 2M DOF
431
+ 4. **Memory efficient**: 443 bytes/DOF (vs theoretical minimum 144 bytes/DOF)
432
+ 5. **Trade-off**: Direct solvers achieve machine precision, iterative achieves ~1e-6
433
+
434
+ ### Distributed Solve (Multi-GPU)
435
+
436
+ 4x NVIDIA H200 GPUs with NCCL backend, 4x CPU processes with Gloo:
437
+
438
+ ![Distributed Benchmark](assets/benchmarks/distributed_benchmark.png)
439
+
440
+ **CUDA (4 GPU, NCCL)**:
441
+
442
+ | DOF | Time | Residual | Memory/GPU |
443
+ |----:|-----:|---------:|-----------:|
444
+ | 10K | 0.18s | 7.5e-9 | 0.03 GB |
445
+ | 100K | 0.61s | 1.2e-8 | 0.05 GB |
446
+ | 500K | 1.64s | 1.2e-7 | 0.15 GB |
447
+ | 1M | 2.82s | 4.0e-7 | 0.27 GB |
448
+ | **2M** | 6.02s | 1.3e-6 | **0.50 GB** |
449
+
450
+ **CPU (4 proc, Gloo)**:
451
+
452
+ | DOF | Time | Residual |
453
+ |----:|-----:|---------:|
454
+ | 10K | 0.37s | 7.5e-9 |
455
+ | 100K | 7.42s | 1.1e-8 |
456
+
457
+ **Key Findings**:
458
+ - **CUDA 12x faster than CPU**: 0.6s vs 7.4s for 100K DOF
459
+ - **Memory evenly distributed**: Each GPU uses only 0.5GB for 2M DOF
460
+ - **Theoretically scales to 500M+ DOF**: H200 has 140GB per GPU
461
+
462
+ ```bash
463
+ # Run distributed solve with 4 GPUs
464
+ torchrun --standalone --nproc_per_node=4 examples/distributed/distributed_solve.py
465
+ ```
466
+
467
+ ## API Reference
468
+
469
+ ### Core Classes
470
+
471
+ - `SparseTensor` - Wrapper with batched solve, norm, eigs, svd methods
472
+ - `SparseTensorList` - List of SparseTensors with different structures
473
+ - `DSparseTensor` - Distributed sparse tensor with halo exchange
474
+ - `LUFactorization` - LU factorization for repeated solves
475
+
476
+ ### Main Functions
477
+
478
+ - `spsolve(val, row, col, shape, b, backend='auto', method='auto')` - Solve Ax=b
479
+ - `spsolve_coo(A_sparse, b, **kwargs)` - Solve using PyTorch sparse tensor
480
+ - `nonlinear_solve(residual_fn, u0, *params, method='newton')` - Solve F(u,θ)=0 with adjoint gradients
481
+
482
+ ### Backend Utilities
483
+
484
+ - `get_available_backends()` - List available backends
485
+ - `get_backend_methods(backend)` - List methods for a backend
486
+ - `select_backend(device, n, dtype)` - Auto-select backend
487
+ - `is_scipy_available()`, `is_cudss_available()`, etc.
488
+
489
+ ## Performance Tips
490
+
491
+ 1. **Use float64** for iterative solvers (better convergence)
492
+ 2. **Use cholesky** for SPD matrices (2x faster than LU)
493
+ 3. **Use scipy+superlu** for CPU (all sizes)
494
+ 4. **Use cudss+cholesky** for CUDA (up to ~2M DOF)
495
+ 5. **Use pytorch+cg** for very large problems (> 2M DOF)
496
+ 6. **Avoid cuSOLVER** - slower than cudss, no float32 support
497
+ 7. **Use LU factorization** for repeated solves with same matrix
498
+
499
+ ## Requirements
500
+
501
+ - Python >= 3.8
502
+ - PyTorch >= 1.10.0
503
+ - SciPy (recommended for CPU)
504
+ - CUDA Toolkit (for GPU backends)
505
+ - nvidia-cudss-cu12 (optional, for cuDSS backend)
506
+
507
+ ## License
508
+
509
+ MIT License - see [LICENSE](LICENSE)
510
+
511
+ ## Citation
512
+
513
+ ```bibtex
514
+ @software{torch_sla,
515
+ title = {torch-sla: PyTorch Sparse Linear Algebra},
516
+ author = {walkerchi},
517
+ year = {2024},
518
+ url = {https://github.com/walkerchi/torch-sla}
519
+ }
520
+ ```