slowai 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
slowai-0.3.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Rico Allen
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
slowai-0.3.0/PKG-INFO ADDED
@@ -0,0 +1,183 @@
1
+ Metadata-Version: 2.4
2
+ Name: slowai
3
+ Version: 0.3.0
4
+ Summary: One command to find why your PyTorch model is slow — and fix it.
5
+ Author-email: Rico Allen <ricardojallen37@gmail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/ricojallen37-sketch/slowai
8
+ Project-URL: Repository, https://github.com/ricojallen37-sketch/slowai
9
+ Project-URL: Issues, https://github.com/ricojallen37-sketch/slowai/issues
10
+ Keywords: pytorch,profiling,performance,gpu,edge-ai,optimization,cuda,jetson
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.10
16
+ Classifier: Programming Language :: Python :: 3.11
17
+ Classifier: Programming Language :: Python :: 3.12
18
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
20
+ Classifier: Topic :: System :: Benchmark
21
+ Requires-Python: >=3.10
22
+ Description-Content-Type: text/markdown
23
+ License-File: LICENSE
24
+ Requires-Dist: torch>=2.1
25
+ Provides-Extra: dev
26
+ Requires-Dist: pytest>=7.0; extra == "dev"
27
+ Requires-Dist: ruff>=0.1; extra == "dev"
28
+ Requires-Dist: mypy>=1.0; extra == "dev"
29
+ Dynamic: license-file
30
+
31
+ # slowai
32
+
33
+ **One command to find why your PyTorch model is slow — and fix it.**
34
+
35
+ slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.
36
+
37
+ ```
38
+ $ slowai fix model.py
39
+
40
+ ==============================================================
41
+ BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
42
+ wall time: 7.523s
43
+ ==============================================================
44
+
45
+ Tried 4 remedies:
46
+
47
+ 1. [10.00x] bf16_autocast ** BEST **
48
+ Run under bfloat16 automatic mixed precision
49
+ 7.523s >>> 0.752s
50
+ regime: compute (confidence: 0.85)
51
+
52
+ 2. [6.32x] tf32_tensor_cores
53
+ Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
54
+ 7.523s >>> 1.191s
55
+
56
+ 3. [6.22x] high_matmul_precision
57
+ Set float32 matmul precision to 'high'
58
+ 7.523s >>> 1.210s
59
+
60
+ 4. [1.31x] cudnn_benchmark
61
+ Enable cuDNN auto-tuner for conv kernels
62
+ 7.523s >>> 5.719s
63
+
64
+ --------------------------------------------------------------
65
+ WINNER: bf16_autocast
66
+ 7.523s >>> 0.752s (10.00x, +900% faster)
67
+ How: Run under bfloat16 automatic mixed precision
68
+ --------------------------------------------------------------
69
+ ```
70
+
71
+ ## Why this exists
72
+
73
+ Every deep learning workload is stuck in one of three performance regimes ([Horace He, 2022](https://horace.io/brrr_intro.html)):
74
+
75
+ | Regime | What's happening | Wrong fix = no speedup |
76
+ |--------|-----------------|----------------------|
77
+ | **Compute-bound** | GPU is saturated doing math (matmuls, convolutions) | Fusing ops won't help — the math itself is the bottleneck |
78
+ | **Memory-bound** | GPU is waiting for data (pointwise ops, activations) | Smaller model won't help — you need less data movement |
79
+ | **Overhead-bound** | GPU is idle waiting for Python/dispatcher (tiny ops) | Lower precision won't help — you need fewer, bigger ops |
80
+
81
+ The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.
82
+
83
+ slowai does it in one command.
84
+
85
+ ## How it works
86
+
87
+ ```bash
88
+ slowai diagnose model.py # Classify the regime + prescribe fixes
89
+ slowai fix model.py # ^ plus auto-apply fixes and measure speedup
90
+ ```
91
+
92
+ Under the hood:
93
+
94
+ 1. **Profile** — Runs your workload under `torch.profiler` with CUDA timing, warmup pass, and op-level statistics
95
+ 2. **Classify** — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
96
+ 3. **Prescribe** — Returns a ranked list of fixes for that regime, cheapest first
97
+ 4. **Remediate** — Auto-applies each applicable fix (TF32 tensor cores, bf16/fp16 autocast, cuDNN benchmark, matmul precision), re-profiles, and ranks by measured speedup
98
+
99
+ No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.
100
+
101
+ ## Benchmarks
102
+
103
+ Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, JetPack 6.2, PyTorch 2.8.0):
104
+
105
+ | Workload | Regime | Baseline | Best remedy | After | Speedup |
106
+ |----------|--------|----------|-------------|-------|---------|
107
+ | Dense GEMM (4096x4096) | Compute | 7.523s | bf16_autocast | 0.752s | **10.00x** |
108
+ | Pointwise chain (8192x8192) | Memory | 2.400s | tf32_tensor_cores | 0.575s | **4.17x** |
109
+ | Tiny ops (5000 micro-ops) | Overhead | 3.281s | tf32_tensor_cores | 1.141s | **2.88x** |
110
+ | MobileNetV2 inference | Compute | 1.778s | cudnn_benchmark | 1.681s | **1.06x** |
111
+ | ResNet-50 inference | Compute | 2.105s | bf16_autocast | 1.969s | **1.07x** |
112
+
113
+ Production models (MobileNet, ResNet) show modest gains because PyTorch already optimizes common architectures well. The real value is that slowai **finds the right fix automatically** — cuDNN benchmark wins for convolution-heavy MobileNet, bf16 autocast wins for matmul-heavy ResNet. Different models, different winners, zero guesswork.
114
+
115
+ ## Installation
116
+
117
+ ```bash
118
+ git clone https://github.com/ricojallen37-sketch/slowai.git
119
+ cd slowai
120
+ pip install -e .
121
+ ```
122
+
123
+ Requires Python 3.10+ and PyTorch >= 2.1 with CUDA support.
124
+
125
+ ## Writing a workload
126
+
127
+ slowai profiles any Python script that exposes a `main()` function:
128
+
129
+ ```python
130
+ # my_model.py
131
+ import torch
132
+ from torchvision.models import resnet50
133
+
134
+ model = resnet50().cuda().eval()
135
+ data = torch.randn(8, 3, 224, 224, device="cuda")
136
+
137
+ def main():
138
+ with torch.no_grad():
139
+ for _ in range(30):
140
+ model(data)
141
+ ```
142
+
143
+ ```bash
144
+ slowai fix my_model.py
145
+ ```
146
+
147
+ ## Architecture
148
+
149
+ ```
150
+ slowai/
151
+ schema.py # Regime enum, Diagnosis dataclass — the product thesis in types
152
+ profiler.py # torch.profiler wrapper → ProfileResult (op stats + wall time)
153
+ diagnose.py # Heuristic classifier → Diagnosis (regime + confidence + prescriptions)
154
+ remediate.py # Auto-fix engine → FixReport (before/after speedup per remedy)
155
+ cli.py # CLI entry points: diagnose, fix
156
+ ```
157
+
158
+ The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers) so it never modifies user code.
159
+
160
+ ## What's different
161
+
162
+ Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.
163
+
164
+ | Tool | Profiles | Classifies regime | Prescribes fixes | Auto-applies | Measures speedup |
165
+ |------|----------|------------------|-----------------|-------------|-----------------|
166
+ | PyTorch Profiler | Yes | No | No | No | No |
167
+ | NVIDIA Nsight | Yes | No | No | No | No |
168
+ | torch.utils.bottleneck | Yes | No | No | No | No |
169
+ | DeepSpeed Flops Profiler | Yes | No | No | No | No |
170
+ | **slowai** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** |
171
+
172
+ ## Roadmap
173
+
174
+ - **V1** (shipped) — Profile + classify regime for synthetic workloads
175
+ - **V2** (shipped) — Noise filtering (sync ops, init ops), normalization-aware classification, real model support
176
+ - **V3** (shipped) — Auto-remediate: apply fixes and measure before/after speedup
177
+ - **V4** (next) — torch.compile integration, channels_last auto-transform, batch size search, export optimized config
178
+
179
+ ## Built by
180
+
181
+ Rico Allen — [@ricojallen37-sketch](https://github.com/ricojallen37-sketch)
182
+
183
+ Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.
slowai-0.3.0/README.md ADDED
@@ -0,0 +1,153 @@
1
+ # slowai
2
+
3
+ **One command to find why your PyTorch model is slow — and fix it.**
4
+
5
+ slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.
6
+
7
+ ```
8
+ $ slowai fix model.py
9
+
10
+ ==============================================================
11
+ BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
12
+ wall time: 7.523s
13
+ ==============================================================
14
+
15
+ Tried 4 remedies:
16
+
17
+ 1. [10.00x] bf16_autocast ** BEST **
18
+ Run under bfloat16 automatic mixed precision
19
+ 7.523s >>> 0.752s
20
+ regime: compute (confidence: 0.85)
21
+
22
+ 2. [6.32x] tf32_tensor_cores
23
+ Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
24
+ 7.523s >>> 1.191s
25
+
26
+ 3. [6.22x] high_matmul_precision
27
+ Set float32 matmul precision to 'high'
28
+ 7.523s >>> 1.210s
29
+
30
+ 4. [1.31x] cudnn_benchmark
31
+ Enable cuDNN auto-tuner for conv kernels
32
+ 7.523s >>> 5.719s
33
+
34
+ --------------------------------------------------------------
35
+ WINNER: bf16_autocast
36
+ 7.523s >>> 0.752s (10.00x, +900% faster)
37
+ How: Run under bfloat16 automatic mixed precision
38
+ --------------------------------------------------------------
39
+ ```
40
+
41
+ ## Why this exists
42
+
43
+ Every deep learning workload is stuck in one of three performance regimes ([Horace He, 2022](https://horace.io/brrr_intro.html)):
44
+
45
+ | Regime | What's happening | Wrong fix = no speedup |
46
+ |--------|-----------------|----------------------|
47
+ | **Compute-bound** | GPU is saturated doing math (matmuls, convolutions) | Fusing ops won't help — the math itself is the bottleneck |
48
+ | **Memory-bound** | GPU is waiting for data (pointwise ops, activations) | Smaller model won't help — you need less data movement |
49
+ | **Overhead-bound** | GPU is idle waiting for Python/dispatcher (tiny ops) | Lower precision won't help — you need fewer, bigger ops |
50
+
51
+ The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.
52
+
53
+ slowai does it in one command.
54
+
55
+ ## How it works
56
+
57
+ ```bash
58
+ slowai diagnose model.py # Classify the regime + prescribe fixes
59
+ slowai fix model.py # ^ plus auto-apply fixes and measure speedup
60
+ ```
61
+
62
+ Under the hood:
63
+
64
+ 1. **Profile** — Runs your workload under `torch.profiler` with CUDA timing, warmup pass, and op-level statistics
65
+ 2. **Classify** — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
66
+ 3. **Prescribe** — Returns a ranked list of fixes for that regime, cheapest first
67
+ 4. **Remediate** — Auto-applies each applicable fix (TF32 tensor cores, bf16/fp16 autocast, cuDNN benchmark, matmul precision), re-profiles, and ranks by measured speedup
68
+
69
+ No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.
70
+
71
+ ## Benchmarks
72
+
73
+ Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, JetPack 6.2, PyTorch 2.8.0):
74
+
75
+ | Workload | Regime | Baseline | Best remedy | After | Speedup |
76
+ |----------|--------|----------|-------------|-------|---------|
77
+ | Dense GEMM (4096x4096) | Compute | 7.523s | bf16_autocast | 0.752s | **10.00x** |
78
+ | Pointwise chain (8192x8192) | Memory | 2.400s | tf32_tensor_cores | 0.575s | **4.17x** |
79
+ | Tiny ops (5000 micro-ops) | Overhead | 3.281s | tf32_tensor_cores | 1.141s | **2.88x** |
80
+ | MobileNetV2 inference | Compute | 1.778s | cudnn_benchmark | 1.681s | **1.06x** |
81
+ | ResNet-50 inference | Compute | 2.105s | bf16_autocast | 1.969s | **1.07x** |
82
+
83
+ Production models (MobileNet, ResNet) show modest gains because PyTorch already optimizes common architectures well. The real value is that slowai **finds the right fix automatically** — cuDNN benchmark wins for convolution-heavy MobileNet, bf16 autocast wins for matmul-heavy ResNet. Different models, different winners, zero guesswork.
84
+
85
+ ## Installation
86
+
87
+ ```bash
88
+ git clone https://github.com/ricojallen37-sketch/slowai.git
89
+ cd slowai
90
+ pip install -e .
91
+ ```
92
+
93
+ Requires Python 3.10+ and PyTorch >= 2.1 with CUDA support.
94
+
95
+ ## Writing a workload
96
+
97
+ slowai profiles any Python script that exposes a `main()` function:
98
+
99
+ ```python
100
+ # my_model.py
101
+ import torch
102
+ from torchvision.models import resnet50
103
+
104
+ model = resnet50().cuda().eval()
105
+ data = torch.randn(8, 3, 224, 224, device="cuda")
106
+
107
+ def main():
108
+ with torch.no_grad():
109
+ for _ in range(30):
110
+ model(data)
111
+ ```
112
+
113
+ ```bash
114
+ slowai fix my_model.py
115
+ ```
116
+
117
+ ## Architecture
118
+
119
+ ```
120
+ slowai/
121
+ schema.py # Regime enum, Diagnosis dataclass — the product thesis in types
122
+ profiler.py # torch.profiler wrapper → ProfileResult (op stats + wall time)
123
+ diagnose.py # Heuristic classifier → Diagnosis (regime + confidence + prescriptions)
124
+ remediate.py # Auto-fix engine → FixReport (before/after speedup per remedy)
125
+ cli.py # CLI entry points: diagnose, fix
126
+ ```
127
+
128
+ The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers) so it never modifies user code.
129
+
130
+ ## What's different
131
+
132
+ Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.
133
+
134
+ | Tool | Profiles | Classifies regime | Prescribes fixes | Auto-applies | Measures speedup |
135
+ |------|----------|------------------|-----------------|-------------|-----------------|
136
+ | PyTorch Profiler | Yes | No | No | No | No |
137
+ | NVIDIA Nsight | Yes | No | No | No | No |
138
+ | torch.utils.bottleneck | Yes | No | No | No | No |
139
+ | DeepSpeed Flops Profiler | Yes | No | No | No | No |
140
+ | **slowai** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** |
141
+
142
+ ## Roadmap
143
+
144
+ - **V1** (shipped) — Profile + classify regime for synthetic workloads
145
+ - **V2** (shipped) — Noise filtering (sync ops, init ops), normalization-aware classification, real model support
146
+ - **V3** (shipped) — Auto-remediate: apply fixes and measure before/after speedup
147
+ - **V4** (next) — torch.compile integration, channels_last auto-transform, batch size search, export optimized config
148
+
149
+ ## Built by
150
+
151
+ Rico Allen — [@ricojallen37-sketch](https://github.com/ricojallen37-sketch)
152
+
153
+ Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.
@@ -0,0 +1,56 @@
1
+ [build-system]
2
+ requires = ["setuptools>=61.0"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "slowai"
7
+ version = "0.3.0"
8
+ description = "One command to find why your PyTorch model is slow — and fix it."
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = "MIT"
12
+ authors = [{ name = "Rico Allen", email = "ricardojallen37@gmail.com" }]
13
+ keywords = ["pytorch", "profiling", "performance", "gpu", "edge-ai", "optimization", "cuda", "jetson"]
14
+ classifiers = [
15
+ "Development Status :: 3 - Alpha",
16
+ "Intended Audience :: Developers",
17
+ "Intended Audience :: Science/Research",
18
+ "Programming Language :: Python :: 3",
19
+ "Programming Language :: Python :: 3.10",
20
+ "Programming Language :: Python :: 3.11",
21
+ "Programming Language :: Python :: 3.12",
22
+ "Topic :: Scientific/Engineering :: Artificial Intelligence",
23
+ "Topic :: Software Development :: Libraries :: Python Modules",
24
+ "Topic :: System :: Benchmark",
25
+ ]
26
+ dependencies = [
27
+ "torch>=2.1",
28
+ ]
29
+
30
+ [project.urls]
31
+ Homepage = "https://github.com/ricojallen37-sketch/slowai"
32
+ Repository = "https://github.com/ricojallen37-sketch/slowai"
33
+ Issues = "https://github.com/ricojallen37-sketch/slowai/issues"
34
+
35
+ [project.optional-dependencies]
36
+ dev = [
37
+ "pytest>=7.0",
38
+ "ruff>=0.1",
39
+ "mypy>=1.0",
40
+ ]
41
+
42
+ [project.scripts]
43
+ slowai = "slowai.cli:main"
44
+
45
+ # Explicit package list — stops setuptools from trying to auto-discover
46
+ # and tripping on the sibling `workloads/` folder (which is a test corpus,
47
+ # NOT a shipped package).
48
+ [tool.setuptools]
49
+ packages = ["slowai"]
50
+
51
+ [tool.ruff]
52
+ line-length = 100
53
+ target-version = "py310"
54
+
55
+ [tool.pytest.ini_options]
56
+ testpaths = ["tests"]
slowai-0.3.0/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,6 @@
1
+ """slowai — diagnose and auto-fix PyTorch performance bottlenecks."""
2
+
3
+ from slowai.schema import Diagnosis, Regime
4
+
5
+ __version__ = "0.3.0"
6
+ __all__ = ["Diagnosis", "Regime"]
@@ -0,0 +1,199 @@
1
+ """slowai CLI — V3.
2
+
3
+ Usage:
4
+ slowai diagnose <workload.py>
5
+ slowai diagnose <workload.py> --json
6
+ slowai fix <workload.py>
7
+ slowai fix <workload.py> --json
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ import argparse
13
+ import dataclasses
14
+ import json
15
+ import sys
16
+
17
+
18
+ def _diagnosis_to_dict(diag) -> dict:
19
+ """Flatten a Diagnosis into plain dicts/lists for JSON output."""
20
+ d = dataclasses.asdict(diag)
21
+ # Regime is a str Enum — asdict leaves it as the value already,
22
+ # but be defensive in case of subclassing.
23
+ if hasattr(diag.regime, "value"):
24
+ d["regime"] = diag.regime.value
25
+ return d
26
+
27
+
28
+ def _cmd_diagnose(args) -> int:
29
+ # Lazy imports so `slowai --help` works without torch installed.
30
+ from slowai.diagnose import classify
31
+ from slowai.profiler import profile_workload
32
+
33
+ try:
34
+ result = profile_workload(args.workload)
35
+ except FileNotFoundError as e:
36
+ print(f"error: {e}", file=sys.stderr)
37
+ return 2
38
+ except Exception as e: # noqa: BLE001
39
+ print(f"error: failed to profile workload: {e}", file=sys.stderr)
40
+ return 3
41
+
42
+ diag = classify(result)
43
+
44
+ if args.json:
45
+ print(json.dumps(_diagnosis_to_dict(diag), indent=2, default=str))
46
+ return 0
47
+
48
+ # Human-readable output.
49
+ print(diag.summary())
50
+ print(f" device: {result.device}")
51
+ print(f" wall time: {result.wall_time_s:.3f}s")
52
+ print(f" total ops: {result.total_op_count}")
53
+ if result.op_stats:
54
+ top = result.top_op
55
+ print(f" top op: {top.name} (x{top.count}, {top.total_us / 1000:.1f} ms total)")
56
+ if diag.evidence:
57
+ print(" evidence:")
58
+ for k, v in diag.evidence.items():
59
+ if isinstance(v, float):
60
+ print(f" {k}: {v:.3f}")
61
+ else:
62
+ print(f" {k}: {v}")
63
+ if diag.top_fixes:
64
+ print(" top fixes (cheapest first):")
65
+ for i, fix in enumerate(diag.top_fixes, 1):
66
+ print(f" {i}. {fix}")
67
+ return 0
68
+
69
+
70
+ def _cmd_fix(args) -> int:
71
+ """The V3 command: diagnose + auto-apply fixes + show before/after speedup."""
72
+ from slowai.remediate import auto_fix
73
+
74
+ print(f"slowai fix: profiling baseline for {args.workload} ...")
75
+ try:
76
+ report = auto_fix(args.workload)
77
+ except FileNotFoundError as e:
78
+ print(f"error: {e}", file=sys.stderr)
79
+ return 2
80
+ except Exception as e: # noqa: BLE001
81
+ print(f"error: failed to run auto-fix: {e}", file=sys.stderr)
82
+ return 3
83
+
84
+ bd = report.baseline_diagnosis
85
+
86
+ if args.json:
87
+ out = {
88
+ "workload": report.workload,
89
+ "baseline": _diagnosis_to_dict(bd),
90
+ "remedies": [
91
+ {
92
+ "name": r.remedy_name,
93
+ "description": r.remedy_description,
94
+ "baseline_wall_s": r.baseline_wall_s,
95
+ "remedied_wall_s": r.remedied_wall_s,
96
+ "speedup": r.speedup,
97
+ "regime_before": r.regime_before.value,
98
+ "regime_after": r.regime_after.value,
99
+ "confidence_after": r.confidence_after,
100
+ "error": r.error,
101
+ }
102
+ for r in report.results
103
+ ],
104
+ "best": report.best.remedy_name if report.best else None,
105
+ }
106
+ print(json.dumps(out, indent=2, default=str))
107
+ return 0
108
+
109
+ # Human-readable output.
110
+ print()
111
+ print("=" * 62)
112
+ print(f" BASELINE: {bd.summary()}")
113
+ print(f" wall time: {bd.evidence.get('wall_time_s', 0):.3f}s")
114
+ print("=" * 62)
115
+
116
+ if not report.results:
117
+ print("\n No auto-remedies available for this regime.")
118
+ if bd.top_fixes:
119
+ print(" Manual fixes (from `slowai diagnose`):")
120
+ for i, fix in enumerate(bd.top_fixes, 1):
121
+ print(f" {i}. {fix}")
122
+ return 0
123
+
124
+ print(f"\n Tried {len(report.results)} remedies:\n")
125
+
126
+ for i, rr in enumerate(report.results, 1):
127
+ if rr.error:
128
+ status = "FAILED"
129
+ detail = f" error: {rr.error}"
130
+ else:
131
+ arrow = ">>>" if rr.speedup > 1.05 else "---" if rr.speedup >= 0.95 else "<<<"
132
+ status = f"{rr.speedup:.2f}x"
133
+ detail = (
134
+ f" {rr.baseline_wall_s:.3f}s {arrow} {rr.remedied_wall_s:.3f}s"
135
+ )
136
+
137
+ best_marker = " ** BEST **" if report.best and rr.remedy_name == report.best.remedy_name else ""
138
+ print(f" {i}. [{status}] {rr.remedy_name}{best_marker}")
139
+ print(f" {rr.remedy_description}")
140
+ print(detail)
141
+ if not rr.error:
142
+ print(f" regime: {rr.regime_after.value} (confidence: {rr.confidence_after:.2f})")
143
+ print()
144
+
145
+ best = report.best
146
+ if best:
147
+ pct = (best.speedup - 1.0) * 100
148
+ print("-" * 62)
149
+ print(f" WINNER: {best.remedy_name}")
150
+ print(f" {best.baseline_wall_s:.3f}s >>> {best.remedied_wall_s:.3f}s ({best.speedup:.2f}x, +{pct:.0f}% faster)")
151
+ print(f" How: {best.remedy_description}")
152
+ print("-" * 62)
153
+ else:
154
+ print("-" * 62)
155
+ print(" No remedy produced a speedup. The workload may already be")
156
+ print(" well-optimized, or the fixes require manual code changes.")
157
+ print(" Run `slowai diagnose` for the full prescription list.")
158
+ print("-" * 62)
159
+
160
+ return 0
161
+
162
+
163
+ def main(argv: list[str] | None = None) -> int:
164
+ parser = argparse.ArgumentParser(
165
+ prog="slowai",
166
+ description="Diagnose and auto-fix PyTorch performance bottlenecks.",
167
+ )
168
+ sub = parser.add_subparsers(dest="command", required=True)
169
+
170
+ # --- slowai diagnose ---
171
+ diag = sub.add_parser(
172
+ "diagnose", help="Profile a workload and classify its regime."
173
+ )
174
+ diag.add_argument("workload", help="Path to the Python script to diagnose.")
175
+ diag.add_argument(
176
+ "--json",
177
+ action="store_true",
178
+ help="Emit the Diagnosis as JSON instead of human-readable text.",
179
+ )
180
+ diag.set_defaults(func=_cmd_diagnose)
181
+
182
+ # --- slowai fix ---
183
+ fix = sub.add_parser(
184
+ "fix", help="Auto-apply fixes and show before/after speedup."
185
+ )
186
+ fix.add_argument("workload", help="Path to the Python script to fix.")
187
+ fix.add_argument(
188
+ "--json",
189
+ action="store_true",
190
+ help="Emit the FixReport as JSON instead of human-readable text.",
191
+ )
192
+ fix.set_defaults(func=_cmd_fix)
193
+
194
+ args = parser.parse_args(argv)
195
+ return args.func(args)
196
+
197
+
198
+ if __name__ == "__main__":
199
+ sys.exit(main())