PyPI - slowai - Versions diffs - 0.3.0__tar.gz - Mend

slowai 0.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

slowai-0.3.0/LICENSE +21 -0
slowai-0.3.0/PKG-INFO +183 -0
slowai-0.3.0/README.md +153 -0
slowai-0.3.0/pyproject.toml +56 -0
slowai-0.3.0/setup.cfg +4 -0
slowai-0.3.0/slowai/__init__.py +6 -0
slowai-0.3.0/slowai/cli.py +199 -0
slowai-0.3.0/slowai/diagnose.py +298 -0
slowai-0.3.0/slowai/profiler.py +164 -0
slowai-0.3.0/slowai/remediate.py +265 -0
slowai-0.3.0/slowai/schema.py +76 -0
slowai-0.3.0/slowai.egg-info/PKG-INFO +183 -0
slowai-0.3.0/slowai.egg-info/SOURCES.txt +17 -0
slowai-0.3.0/slowai.egg-info/dependency_links.txt +1 -0
slowai-0.3.0/slowai.egg-info/entry_points.txt +2 -0
slowai-0.3.0/slowai.egg-info/requires.txt +6 -0
slowai-0.3.0/slowai.egg-info/top_level.txt +1 -0
slowai-0.3.0/tests/test_diagnose.py +183 -0
slowai-0.3.0/tests/test_schema.py +56 -0

slowai-0.3.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Rico Allen
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

slowai-0.3.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,183 @@
+Metadata-Version: 2.4
+Name: slowai
+Version: 0.3.0
+Summary: One command to find why your PyTorch model is slow — and fix it.
+Author-email: Rico Allen <ricardojallen37@gmail.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/ricojallen37-sketch/slowai
+Project-URL: Repository, https://github.com/ricojallen37-sketch/slowai
+Project-URL: Issues, https://github.com/ricojallen37-sketch/slowai/issues
+Keywords: pytorch,profiling,performance,gpu,edge-ai,optimization,cuda,jetson
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Classifier: Topic :: System :: Benchmark
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.1
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+Requires-Dist: ruff>=0.1; extra == "dev"
+Requires-Dist: mypy>=1.0; extra == "dev"
+Dynamic: license-file
+# slowai
+**One command to find why your PyTorch model is slow — and fix it.**
+slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.
+```
+$ slowai fix model.py
+==============================================================
+  BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
+  wall time: 7.523s
+==============================================================
+  Tried 4 remedies:
+  1. [10.00x] bf16_autocast  ** BEST **
+     Run under bfloat16 automatic mixed precision
+     7.523s >>> 0.752s
+     regime: compute (confidence: 0.85)
+  2. [6.32x] tf32_tensor_cores
+     Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
+     7.523s >>> 1.191s
+  3. [6.22x] high_matmul_precision
+     Set float32 matmul precision to 'high'
+     7.523s >>> 1.210s
+  4. [1.31x] cudnn_benchmark
+     Enable cuDNN auto-tuner for conv kernels
+     7.523s >>> 5.719s
+--------------------------------------------------------------
+  WINNER: bf16_autocast
+  7.523s >>> 0.752s  (10.00x, +900% faster)
+  How: Run under bfloat16 automatic mixed precision
+--------------------------------------------------------------
+```
+## Why this exists
+Every deep learning workload is stuck in one of three performance regimes ([Horace He, 2022](https://horace.io/brrr_intro.html)):
+| Regime | What's happening | Wrong fix = no speedup |
+|--------|-----------------|----------------------|
+| **Compute-bound** | GPU is saturated doing math (matmuls, convolutions) | Fusing ops won't help — the math itself is the bottleneck |
+| **Memory-bound** | GPU is waiting for data (pointwise ops, activations) | Smaller model won't help — you need less data movement |
+| **Overhead-bound** | GPU is idle waiting for Python/dispatcher (tiny ops) | Lower precision won't help — you need fewer, bigger ops |
+The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.
+slowai does it in one command.
+## How it works
+```bash
+slowai diagnose model.py    # Classify the regime + prescribe fixes
+slowai fix model.py          # ^ plus auto-apply fixes and measure speedup
+```
+Under the hood:
+1. **Profile** — Runs your workload under `torch.profiler` with CUDA timing, warmup pass, and op-level statistics
+2. **Classify** — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
+3. **Prescribe** — Returns a ranked list of fixes for that regime, cheapest first
+4. **Remediate** — Auto-applies each applicable fix (TF32 tensor cores, bf16/fp16 autocast, cuDNN benchmark, matmul precision), re-profiles, and ranks by measured speedup
+No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.
+## Benchmarks
+Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, JetPack 6.2, PyTorch 2.8.0):
+| Workload | Regime | Baseline | Best remedy | After | Speedup |
+|----------|--------|----------|-------------|-------|---------|
+| Dense GEMM (4096x4096) | Compute | 7.523s | bf16_autocast | 0.752s | **10.00x** |
+| Pointwise chain (8192x8192) | Memory | 2.400s | tf32_tensor_cores | 0.575s | **4.17x** |
+| Tiny ops (5000 micro-ops) | Overhead | 3.281s | tf32_tensor_cores | 1.141s | **2.88x** |
+| MobileNetV2 inference | Compute | 1.778s | cudnn_benchmark | 1.681s | **1.06x** |
+| ResNet-50 inference | Compute | 2.105s | bf16_autocast | 1.969s | **1.07x** |
+Production models (MobileNet, ResNet) show modest gains because PyTorch already optimizes common architectures well. The real value is that slowai **finds the right fix automatically** — cuDNN benchmark wins for convolution-heavy MobileNet, bf16 autocast wins for matmul-heavy ResNet. Different models, different winners, zero guesswork.
+## Installation
+```bash
+git clone https://github.com/ricojallen37-sketch/slowai.git
+cd slowai
+pip install -e .
+```
+Requires Python 3.10+ and PyTorch >= 2.1 with CUDA support.
+## Writing a workload
+slowai profiles any Python script that exposes a `main()` function:
+```python
+# my_model.py
+import torch
+from torchvision.models import resnet50
+model = resnet50().cuda().eval()
+data = torch.randn(8, 3, 224, 224, device="cuda")
+def main():
+    with torch.no_grad():
+        for _ in range(30):
+            model(data)
+```
+```bash
+slowai fix my_model.py
+```
+## Architecture
+```
+slowai/
+  schema.py      # Regime enum, Diagnosis dataclass — the product thesis in types
+  profiler.py    # torch.profiler wrapper → ProfileResult (op stats + wall time)
+  diagnose.py    # Heuristic classifier → Diagnosis (regime + confidence + prescriptions)
+  remediate.py   # Auto-fix engine → FixReport (before/after speedup per remedy)
+  cli.py         # CLI entry points: diagnose, fix
+```
+The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers) so it never modifies user code.
+## What's different
+Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.
+| Tool | Profiles | Classifies regime | Prescribes fixes | Auto-applies | Measures speedup |
+|------|----------|------------------|-----------------|-------------|-----------------|
+| PyTorch Profiler | Yes | No | No | No | No |
+| NVIDIA Nsight | Yes | No | No | No | No |
+| torch.utils.bottleneck | Yes | No | No | No | No |
+| DeepSpeed Flops Profiler | Yes | No | No | No | No |
+| **slowai** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** |
+## Roadmap
+- **V1** (shipped) — Profile + classify regime for synthetic workloads
+- **V2** (shipped) — Noise filtering (sync ops, init ops), normalization-aware classification, real model support
+- **V3** (shipped) — Auto-remediate: apply fixes and measure before/after speedup
+- **V4** (next) — torch.compile integration, channels_last auto-transform, batch size search, export optimized config
+## Built by
+Rico Allen — [@ricojallen37-sketch](https://github.com/ricojallen37-sketch)
+Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.

slowai-0.3.0/README.md ADDED Viewed

@@ -0,0 +1,153 @@
+# slowai
+**One command to find why your PyTorch model is slow — and fix it.**
+slowai diagnoses which performance regime your workload is stuck in (compute-bound, memory-bound, or overhead-bound), prescribes the right fix, auto-applies it, and proves the speedup with before/after measurements. No guesswork. No manual profiler interpretation.
+```
+$ slowai fix model.py
+==============================================================
+  BASELINE: model.py: COMPUTE_BOUND (confidence: 0.85)
+  wall time: 7.523s
+==============================================================
+  Tried 4 remedies:
+  1. [10.00x] bf16_autocast  ** BEST **
+     Run under bfloat16 automatic mixed precision
+     7.523s >>> 0.752s
+     regime: compute (confidence: 0.85)
+  2. [6.32x] tf32_tensor_cores
+     Enable TF32 tensor cores (~2x matmul throughput on Ampere+)
+     7.523s >>> 1.191s
+  3. [6.22x] high_matmul_precision
+     Set float32 matmul precision to 'high'
+     7.523s >>> 1.210s
+  4. [1.31x] cudnn_benchmark
+     Enable cuDNN auto-tuner for conv kernels
+     7.523s >>> 5.719s
+--------------------------------------------------------------
+  WINNER: bf16_autocast
+  7.523s >>> 0.752s  (10.00x, +900% faster)
+  How: Run under bfloat16 automatic mixed precision
+--------------------------------------------------------------
+```
+## Why this exists
+Every deep learning workload is stuck in one of three performance regimes ([Horace He, 2022](https://horace.io/brrr_intro.html)):
+| Regime | What's happening | Wrong fix = no speedup |
+|--------|-----------------|----------------------|
+| **Compute-bound** | GPU is saturated doing math (matmuls, convolutions) | Fusing ops won't help — the math itself is the bottleneck |
+| **Memory-bound** | GPU is waiting for data (pointwise ops, activations) | Smaller model won't help — you need less data movement |
+| **Overhead-bound** | GPU is idle waiting for Python/dispatcher (tiny ops) | Lower precision won't help — you need fewer, bigger ops |
+The fix for each regime is different. Applying a compute-bound fix to a memory-bound workload does nothing. Engineers waste hours in profiler UIs figuring this out manually.
+slowai does it in one command.
+## How it works
+```bash
+slowai diagnose model.py    # Classify the regime + prescribe fixes
+slowai fix model.py          # ^ plus auto-apply fixes and measure speedup
+```
+Under the hood:
+1. **Profile** — Runs your workload under `torch.profiler` with CUDA timing, warmup pass, and op-level statistics
+2. **Classify** — A heuristic classifier analyzes op shares (matmul, normalization, pointwise, tiny-op fraction) to determine the dominant regime
+3. **Prescribe** — Returns a ranked list of fixes for that regime, cheapest first
+4. **Remediate** — Auto-applies each applicable fix (TF32 tensor cores, bf16/fp16 autocast, cuDNN benchmark, matmul precision), re-profiles, and ranks by measured speedup
+No code changes required. Remedies are environment-level transforms — they modify PyTorch's runtime settings, not your model code.
+## Benchmarks
+Tested on NVIDIA Jetson Orin Nano Super (Ampere GPU, 1024 CUDA cores, JetPack 6.2, PyTorch 2.8.0):
+| Workload | Regime | Baseline | Best remedy | After | Speedup |
+|----------|--------|----------|-------------|-------|---------|
+| Dense GEMM (4096x4096) | Compute | 7.523s | bf16_autocast | 0.752s | **10.00x** |
+| Pointwise chain (8192x8192) | Memory | 2.400s | tf32_tensor_cores | 0.575s | **4.17x** |
+| Tiny ops (5000 micro-ops) | Overhead | 3.281s | tf32_tensor_cores | 1.141s | **2.88x** |
+| MobileNetV2 inference | Compute | 1.778s | cudnn_benchmark | 1.681s | **1.06x** |
+| ResNet-50 inference | Compute | 2.105s | bf16_autocast | 1.969s | **1.07x** |
+Production models (MobileNet, ResNet) show modest gains because PyTorch already optimizes common architectures well. The real value is that slowai **finds the right fix automatically** — cuDNN benchmark wins for convolution-heavy MobileNet, bf16 autocast wins for matmul-heavy ResNet. Different models, different winners, zero guesswork.
+## Installation
+```bash
+git clone https://github.com/ricojallen37-sketch/slowai.git
+cd slowai
+pip install -e .
+```
+Requires Python 3.10+ and PyTorch >= 2.1 with CUDA support.
+## Writing a workload
+slowai profiles any Python script that exposes a `main()` function:
+```python
+# my_model.py
+import torch
+from torchvision.models import resnet50
+model = resnet50().cuda().eval()
+data = torch.randn(8, 3, 224, 224, device="cuda")
+def main():
+    with torch.no_grad():
+        for _ in range(30):
+            model(data)
+```
+```bash
+slowai fix my_model.py
+```
+## Architecture
+```
+slowai/
+  schema.py      # Regime enum, Diagnosis dataclass — the product thesis in types
+  profiler.py    # torch.profiler wrapper → ProfileResult (op stats + wall time)
+  diagnose.py    # Heuristic classifier → Diagnosis (regime + confidence + prescriptions)
+  remediate.py   # Auto-fix engine → FixReport (before/after speedup per remedy)
+  cli.py         # CLI entry points: diagnose, fix
+```
+The classifier is a pure function of profiler output — no torch dependency, fully unit-testable. The remediate engine applies fixes as environment transforms (global flags, autocast context managers) so it never modifies user code.
+## What's different
+Other tools in this space are profiler UIs that show you data and leave interpretation to you. slowai is the only tool that goes from raw workload to regime classification to ranked prescriptions to auto-applied fixes to measured speedup in a single CLI command.
+| Tool | Profiles | Classifies regime | Prescribes fixes | Auto-applies | Measures speedup |
+|------|----------|------------------|-----------------|-------------|-----------------|
+| PyTorch Profiler | Yes | No | No | No | No |
+| NVIDIA Nsight | Yes | No | No | No | No |
+| torch.utils.bottleneck | Yes | No | No | No | No |
+| DeepSpeed Flops Profiler | Yes | No | No | No | No |
+| **slowai** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** |
+## Roadmap
+- **V1** (shipped) — Profile + classify regime for synthetic workloads
+- **V2** (shipped) — Noise filtering (sync ops, init ops), normalization-aware classification, real model support
+- **V3** (shipped) — Auto-remediate: apply fixes and measure before/after speedup
+- **V4** (next) — torch.compile integration, channels_last auto-transform, batch size search, export optimized config
+## Built by
+Rico Allen — [@ricojallen37-sketch](https://github.com/ricojallen37-sketch)
+Built and tested on NVIDIA Jetson Orin Nano Super Developer Kit.

slowai-0.3.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,56 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "slowai"
+version = "0.3.0"
+description = "One command to find why your PyTorch model is slow — and fix it."
+readme = "README.md"
+requires-python = ">=3.10"
+license = "MIT"
+authors = [{ name = "Rico Allen", email = "ricardojallen37@gmail.com" }]
+keywords = ["pytorch", "profiling", "performance", "gpu", "edge-ai", "optimization", "cuda", "jetson"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Developers",
+    "Intended Audience :: Science/Research",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    "Topic :: Software Development :: Libraries :: Python Modules",
+    "Topic :: System :: Benchmark",
+]
+dependencies = [
+    "torch>=2.1",
+]
+[project.urls]
+Homepage = "https://github.com/ricojallen37-sketch/slowai"
+Repository = "https://github.com/ricojallen37-sketch/slowai"
+Issues = "https://github.com/ricojallen37-sketch/slowai/issues"
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0",
+    "ruff>=0.1",
+    "mypy>=1.0",
+]
+[project.scripts]
+slowai = "slowai.cli:main"
+# Explicit package list — stops setuptools from trying to auto-discover
+# and tripping on the sibling `workloads/` folder (which is a test corpus,
+# NOT a shipped package).
+[tool.setuptools]
+packages = ["slowai"]
+[tool.ruff]
+line-length = 100
+target-version = "py310"
+[tool.pytest.ini_options]
+testpaths = ["tests"]

slowai-0.3.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

slowai-0.3.0/slowai/__init__.py ADDED Viewed

@@ -0,0 +1,6 @@
+"""slowai — diagnose and auto-fix PyTorch performance bottlenecks."""
+from slowai.schema import Diagnosis, Regime
+__version__ = "0.3.0"
+__all__ = ["Diagnosis", "Regime"]

slowai-0.3.0/slowai/cli.py ADDED Viewed

@@ -0,0 +1,199 @@
+"""slowai CLI — V3.
+Usage:
+    slowai diagnose <workload.py>
+    slowai diagnose <workload.py> --json
+    slowai fix <workload.py>
+    slowai fix <workload.py> --json
+"""
+from __future__ import annotations
+import argparse
+import dataclasses
+import json
+import sys
+def _diagnosis_to_dict(diag) -> dict:
+    """Flatten a Diagnosis into plain dicts/lists for JSON output."""
+    d = dataclasses.asdict(diag)
+    # Regime is a str Enum — asdict leaves it as the value already,
+    # but be defensive in case of subclassing.
+    if hasattr(diag.regime, "value"):
+        d["regime"] = diag.regime.value
+    return d
+def _cmd_diagnose(args) -> int:
+    # Lazy imports so `slowai --help` works without torch installed.
+    from slowai.diagnose import classify
+    from slowai.profiler import profile_workload
+    try:
+        result = profile_workload(args.workload)
+    except FileNotFoundError as e:
+        print(f"error: {e}", file=sys.stderr)
+        return 2
+    except Exception as e:  # noqa: BLE001
+        print(f"error: failed to profile workload: {e}", file=sys.stderr)
+        return 3
+    diag = classify(result)
+    if args.json:
+        print(json.dumps(_diagnosis_to_dict(diag), indent=2, default=str))
+        return 0
+    # Human-readable output.
+    print(diag.summary())
+    print(f"  device: {result.device}")
+    print(f"  wall time: {result.wall_time_s:.3f}s")
+    print(f"  total ops: {result.total_op_count}")
+    if result.op_stats:
+        top = result.top_op
+        print(f"  top op: {top.name} (x{top.count}, {top.total_us / 1000:.1f} ms total)")
+    if diag.evidence:
+        print("  evidence:")
+        for k, v in diag.evidence.items():
+            if isinstance(v, float):
+                print(f"    {k}: {v:.3f}")
+            else:
+                print(f"    {k}: {v}")
+    if diag.top_fixes:
+        print("  top fixes (cheapest first):")
+        for i, fix in enumerate(diag.top_fixes, 1):
+            print(f"    {i}. {fix}")
+    return 0
+def _cmd_fix(args) -> int:
+    """The V3 command: diagnose + auto-apply fixes + show before/after speedup."""
+    from slowai.remediate import auto_fix
+    print(f"slowai fix: profiling baseline for {args.workload} ...")
+    try:
+        report = auto_fix(args.workload)
+    except FileNotFoundError as e:
+        print(f"error: {e}", file=sys.stderr)
+        return 2
+    except Exception as e:  # noqa: BLE001
+        print(f"error: failed to run auto-fix: {e}", file=sys.stderr)
+        return 3
+    bd = report.baseline_diagnosis
+    if args.json:
+        out = {
+            "workload": report.workload,
+            "baseline": _diagnosis_to_dict(bd),
+            "remedies": [
+                {
+                    "name": r.remedy_name,
+                    "description": r.remedy_description,
+                    "baseline_wall_s": r.baseline_wall_s,
+                    "remedied_wall_s": r.remedied_wall_s,
+                    "speedup": r.speedup,
+                    "regime_before": r.regime_before.value,
+                    "regime_after": r.regime_after.value,
+                    "confidence_after": r.confidence_after,
+                    "error": r.error,
+                }
+                for r in report.results
+            ],
+            "best": report.best.remedy_name if report.best else None,
+        }
+        print(json.dumps(out, indent=2, default=str))
+        return 0
+    # Human-readable output.
+    print()
+    print("=" * 62)
+    print(f"  BASELINE: {bd.summary()}")
+    print(f"  wall time: {bd.evidence.get('wall_time_s', 0):.3f}s")
+    print("=" * 62)
+    if not report.results:
+        print("\n  No auto-remedies available for this regime.")
+        if bd.top_fixes:
+            print("  Manual fixes (from `slowai diagnose`):")
+            for i, fix in enumerate(bd.top_fixes, 1):
+                print(f"    {i}. {fix}")
+        return 0
+    print(f"\n  Tried {len(report.results)} remedies:\n")
+    for i, rr in enumerate(report.results, 1):
+        if rr.error:
+            status = "FAILED"
+            detail = f"     error: {rr.error}"
+        else:
+            arrow = ">>>" if rr.speedup > 1.05 else "---" if rr.speedup >= 0.95 else "<<<"
+            status = f"{rr.speedup:.2f}x"
+            detail = (
+                f"     {rr.baseline_wall_s:.3f}s {arrow} {rr.remedied_wall_s:.3f}s"
+            )
+        best_marker = "  ** BEST **" if report.best and rr.remedy_name == report.best.remedy_name else ""
+        print(f"  {i}. [{status}] {rr.remedy_name}{best_marker}")
+        print(f"     {rr.remedy_description}")
+        print(detail)
+        if not rr.error:
+            print(f"     regime: {rr.regime_after.value} (confidence: {rr.confidence_after:.2f})")
+        print()
+    best = report.best
+    if best:
+        pct = (best.speedup - 1.0) * 100
+        print("-" * 62)
+        print(f"  WINNER: {best.remedy_name}")
+        print(f"  {best.baseline_wall_s:.3f}s >>> {best.remedied_wall_s:.3f}s  ({best.speedup:.2f}x, +{pct:.0f}% faster)")
+        print(f"  How: {best.remedy_description}")
+        print("-" * 62)
+    else:
+        print("-" * 62)
+        print("  No remedy produced a speedup. The workload may already be")
+        print("  well-optimized, or the fixes require manual code changes.")
+        print("  Run `slowai diagnose` for the full prescription list.")
+        print("-" * 62)
+    return 0
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        prog="slowai",
+        description="Diagnose and auto-fix PyTorch performance bottlenecks.",
+    )
+    sub = parser.add_subparsers(dest="command", required=True)
+    # --- slowai diagnose ---
+    diag = sub.add_parser(
+        "diagnose", help="Profile a workload and classify its regime."
+    )
+    diag.add_argument("workload", help="Path to the Python script to diagnose.")
+    diag.add_argument(
+        "--json",
+        action="store_true",
+        help="Emit the Diagnosis as JSON instead of human-readable text.",
+    )
+    diag.set_defaults(func=_cmd_diagnose)
+    # --- slowai fix ---
+    fix = sub.add_parser(
+        "fix", help="Auto-apply fixes and show before/after speedup."
+    )
+    fix.add_argument("workload", help="Path to the Python script to fix.")
+    fix.add_argument(
+        "--json",
+        action="store_true",
+        help="Emit the FixReport as JSON instead of human-readable text.",
+    )
+    fix.set_defaults(func=_cmd_fix)
+    args = parser.parse_args(argv)
+    return args.func(args)
+if __name__ == "__main__":
+    sys.exit(main())