@mcptoolshop/backpropagate 1.0.5 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -21
- package/README.es.md +359 -0
- package/README.fr.md +359 -0
- package/README.hi.md +359 -0
- package/README.it.md +359 -0
- package/README.ja.md +359 -0
- package/README.md +306 -56
- package/README.pt-BR.md +359 -0
- package/README.zh.md +359 -0
- package/bin/backpropagate.js +6 -6
- package/package.json +12 -3
package/README.md
CHANGED
|
@@ -1,109 +1,359 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<a href="README.ja.md">日本語</a> | <a href="README.zh.md">中文</a> | <a href="README.es.md">Español</a> | <a href="README.fr.md">Français</a> | <a href="README.hi.md">हिन्दी</a> | <a href="README.it.md">Italiano</a> | <a href="README.pt-BR.md">Português (BR)</a>
|
|
3
|
+
</p>
|
|
4
|
+
|
|
1
5
|
<p align="center">
|
|
2
6
|
<img src="https://raw.githubusercontent.com/mcp-tool-shop-org/brand/main/logos/backpropagate/readme.png" alt="Backpropagate" width="400">
|
|
3
7
|
</p>
|
|
4
8
|
|
|
5
9
|
<p align="center">
|
|
6
|
-
<a href="https://
|
|
7
|
-
<a href="https://
|
|
8
|
-
<a href="https://
|
|
10
|
+
<a href="https://github.com/mcp-tool-shop-org/backpropagate/actions/workflows/ci.yml"><img src="https://github.com/mcp-tool-shop-org/backpropagate/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
|
|
11
|
+
<a href="https://pypi.org/project/backpropagate/"><img src="https://img.shields.io/pypi/v/backpropagate" alt="PyPI"></a>
|
|
12
|
+
<a href="https://codecov.io/gh/mcp-tool-shop-org/backpropagate"><img src="https://img.shields.io/codecov/c/github/mcp-tool-shop-org/backpropagate" alt="Coverage"></a>
|
|
13
|
+
<a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue" alt="MIT License"></a>
|
|
9
14
|
<a href="https://mcp-tool-shop-org.github.io/backpropagate/"><img src="https://img.shields.io/badge/Landing_Page-live-blue" alt="Landing Page"></a>
|
|
10
15
|
</p>
|
|
11
16
|
|
|
12
|
-
Headless LLM fine-tuning
|
|
17
|
+
**Headless LLM fine-tuning in 3 lines. Smart defaults, VRAM-aware batch sizing, multi-run SLAO, and one-click GGUF export for Ollama.**
|
|
13
18
|
|
|
14
|
-
|
|
19
|
+
*SLAO is Single LoRA Continual Learning via Asymmetric Merging — the merge-between-runs technique that prevents catastrophic forgetting in extended fine-tuning campaigns ([paper](https://arxiv.org/abs/2512.23017)).*
|
|
15
20
|
|
|
16
|
-
|
|
21
|
+
*Train LLMs in 3 lines of code. Export to Ollama in one more.*
|
|
22
|
+
|
|
23
|
+
## Quick Start
|
|
17
24
|
|
|
18
25
|
```bash
|
|
19
|
-
|
|
26
|
+
pip install backpropagate[standard]
|
|
20
27
|
```
|
|
21
28
|
|
|
22
|
-
|
|
29
|
+
```python
|
|
30
|
+
from backpropagate import Trainer
|
|
31
|
+
|
|
32
|
+
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
|
|
33
|
+
trainer.train("examples/quickstart.jsonl", steps=10)
|
|
34
|
+
trainer.export("gguf", quantization="q4_k_m") # Ready for Ollama
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
The repo ships a small `examples/quickstart.jsonl` (5 ShareGPT-format examples) so the snippet above runs end-to-end on a clean install. For your own training, see [Dataset Format](#dataset-format) below.
|
|
38
|
+
|
|
39
|
+
### No-code path: Web UI
|
|
40
|
+
|
|
41
|
+
Prefer a UI to a Python REPL? Install the same extra and run:
|
|
23
42
|
|
|
24
43
|
```bash
|
|
25
|
-
|
|
44
|
+
pip install backpropagate[standard]
|
|
45
|
+
backprop ui --port 7862
|
|
26
46
|
```
|
|
27
47
|
|
|
28
|
-
|
|
48
|
+
The Reflex (Radix UI) interface lets you point at a JSONL file, pick a model, train, and export — no Python required. The UI is local-first; for public-internet exposure see [Web UI](#web-ui) below for the `--share` + `--auth` security contract and supported tunnel options (Cloudflare Tunnel, ngrok).
|
|
49
|
+
|
|
50
|
+
## Dataset Format
|
|
51
|
+
|
|
52
|
+
Your JSONL training file should have one example per line. The simplest format is ShareGPT chat:
|
|
53
|
+
|
|
54
|
+
```jsonl
|
|
55
|
+
{"conversations": [{"from": "human", "value": "What is Python?"}, {"from": "gpt", "value": "A programming language."}]}
|
|
56
|
+
{"conversations": [{"from": "human", "value": "Explain recursion."}, {"from": "gpt", "value": "A function that calls itself."}]}
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Alpaca (`instruction`/`output`), OpenAI chat (`messages`), and raw text formats are also supported. See `examples/quickstart.jsonl` for a copyable starting point.
|
|
60
|
+
|
|
61
|
+
## Why Backpropagate?
|
|
62
|
+
|
|
63
|
+
| Problem | Solution |
|
|
64
|
+
|---------|----------|
|
|
65
|
+
| Fine-tuning is complex | 3 lines: load, train, save |
|
|
66
|
+
| Windows is a nightmare | First-class Windows support |
|
|
67
|
+
| VRAM management is hard | Auto batch sizing, GPU monitoring |
|
|
68
|
+
| Model export is confusing | One-click GGUF + Ollama registration |
|
|
69
|
+
| Long runs cause forgetting | Multi-run SLAO training |
|
|
70
|
+
|
|
71
|
+
## Key Features
|
|
72
|
+
|
|
73
|
+
- **Headless by Design**: Built for CI/CD pipelines, automated workflows, and programmatic execution.
|
|
74
|
+
- **Smart Defaults**: Automatically configures optimal hyperparameters based on your hardware and dataset.
|
|
75
|
+
- **Multi-Run SLAO Training**: Advanced training strategies to prevent catastrophic forgetting during long runs.
|
|
76
|
+
- **First-Class Windows Support**: Tested and optimized for Windows environments, avoiding common PyTorch/CUDA pitfalls.
|
|
77
|
+
- **Seamless Export**: One-click export to GGUF format and automatic registration with Ollama.
|
|
78
|
+
- **Modular Architecture**: Install only the dependencies you need (e.g., `[unsloth]`, `[ui]`, `[export]`).
|
|
79
|
+
|
|
80
|
+
## Installation
|
|
29
81
|
|
|
30
82
|
```bash
|
|
31
|
-
|
|
32
|
-
#
|
|
83
|
+
pip install backpropagate # Core only (minimal)
|
|
84
|
+
pip install backpropagate[unsloth] # + Unsloth 2x faster training
|
|
85
|
+
pip install backpropagate[ui] # + Reflex (Radix UI) web interface
|
|
86
|
+
pip install backpropagate[standard] # unsloth + ui (recommended)
|
|
87
|
+
pip install backpropagate[full] # Everything
|
|
33
88
|
```
|
|
34
89
|
|
|
35
|
-
|
|
90
|
+
| Extra | Description | Dependencies |
|
|
91
|
+
|-------|-------------|--------------|
|
|
92
|
+
| `unsloth` | 2x faster training, 50% less VRAM | unsloth |
|
|
93
|
+
| `ui` | Reflex (Radix UI) web interface | reflex>=0.9.2, fastapi>=0.115 |
|
|
94
|
+
| `validation` | Pydantic config validation | pydantic, pydantic-settings |
|
|
95
|
+
| `export` | GGUF export for Ollama | llama-cpp-python |
|
|
96
|
+
| `monitoring` | WandB + system monitoring (auto-wired into trainer in v1.1.0) | wandb, psutil |
|
|
97
|
+
| `observability` | OpenTelemetry tracing | opentelemetry-api, opentelemetry-sdk |
|
|
98
|
+
| `logging` | Structured logging | structlog |
|
|
99
|
+
| `security` | JWT auth + token generation | PyJWT, cryptography |
|
|
100
|
+
| `production` | unsloth + ui + validation + logging + security | (bundle) |
|
|
101
|
+
|
|
102
|
+
**Requirements:** Python 3.10+ · CUDA GPU (8GB+ VRAM) · PyTorch 2.0+
|
|
103
|
+
|
|
104
|
+
### Platform prerequisites
|
|
105
|
+
|
|
106
|
+
Backpropagate handles the runtime quirks (multiprocessing, xformers on RTX 40/50, dataloader workers on Windows). It does **not** handle the install-time platform pain — fix those first:
|
|
107
|
+
|
|
108
|
+
- **CUDA toolkit version.** PyTorch is published per-CUDA — picking the wrong wheel silently installs CPU-only torch. Use the picker at <https://pytorch.org/get-started/locally/> for the exact `pip install torch ...` command for your driver. Run `nvidia-smi` to see your driver / CUDA version.
|
|
109
|
+
- **Windows.** Visual Studio Build Tools (C++) and CMake are required for the `[export]` extra (`llama-cpp-python` builds from source). `bitsandbytes` wheel is published for Windows natively now (>= 0.43); older guides mentioning `bitsandbytes-windows` are stale.
|
|
110
|
+
- **macOS.** GPU training is **not supported** — no CUDA. You can install Backpropagate to run *inference* on an exported GGUF via Ollama, but `trainer.train()` raises `DEP_GPU_NOT_AVAILABLE`. Use a CUDA machine for training.
|
|
111
|
+
- **Linux.** Most distros work out of the box. If you're using the PyPI binary release, note that the Linux build uses CPU-only torch (to stay under GitHub's 2 GB release-asset cap); install with the matching CUDA wheel from pytorch.org first.
|
|
112
|
+
|
|
113
|
+
For the long-form install troubleshooting, see [the troubleshooting handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting/).
|
|
114
|
+
|
|
115
|
+
## Configuration
|
|
116
|
+
|
|
117
|
+
All settings can be overridden with environment variables using the `BACKPROPAGATE_` prefix (e.g., `BACKPROPAGATE_LOG_LEVEL=debug`). A `.env` file in the project root is loaded automatically when the `[validation]` extra is installed.
|
|
118
|
+
|
|
119
|
+
Common knobs (see [the full env-vars reference](https://mcp-tool-shop-org.github.io/backpropagate/handbook/env-vars/) for everything):
|
|
120
|
+
|
|
121
|
+
| Variable | Default | Notes |
|
|
122
|
+
|----------|---------|-------|
|
|
123
|
+
| `BACKPROPAGATE_LOG_LEVEL` | `INFO` | `DEBUG` / `INFO` / `WARNING` / `ERROR` |
|
|
124
|
+
| `BACKPROPAGATE_LOG_JSON` | auto | Force JSON (`true`) or console (`false`) logs |
|
|
125
|
+
| `BACKPROPAGATE_LOG_FILE` | unset | Path to mirror logs into |
|
|
126
|
+
| `BACKPROPAGATE_DEFER_FEATURE_DETECTION` | unset | Skip optional-dep detection at startup for the fastest CLI cold start |
|
|
127
|
+
| `BACKPROPAGATE_SECURITY__REQUIRE_AUTH_FOR_SHARE` | `true` | When `true`, refuses `backprop ui --share` without `--auth` |
|
|
128
|
+
| `BACKPROPAGATE_UI__OUTPUT_DIR` | `~/.backpropagate/ui-outputs` | Sandbox base for all UI filesystem writes; denylist-validated |
|
|
129
|
+
| `BACKPROPAGATE_MODEL__NAME` | `Qwen/Qwen2.5-7B-Instruct` | Default model |
|
|
130
|
+
| `BACKPROPAGATE_TRAINING__LEARNING_RATE` | `2e-4` | Learning rate |
|
|
131
|
+
| `BACKPROPAGATE_LORA__R` | `16` | LoRA rank |
|
|
132
|
+
|
|
133
|
+
Nested keys use double underscore as the delimiter (Pydantic `env_nested_delimiter` convention).
|
|
36
134
|
|
|
37
135
|
## Usage
|
|
38
136
|
|
|
39
|
-
|
|
40
|
-
# Show system info (GPU, Python, PyTorch, CUDA)
|
|
41
|
-
backpropagate info
|
|
137
|
+
### Basic Training
|
|
42
138
|
|
|
43
|
-
|
|
44
|
-
backpropagate
|
|
139
|
+
```python
|
|
140
|
+
from backpropagate import Trainer
|
|
45
141
|
|
|
46
|
-
|
|
47
|
-
|
|
142
|
+
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
|
|
143
|
+
trainer.train("my_data.jsonl", steps=100)
|
|
144
|
+
trainer.save("./my-model")
|
|
145
|
+
trainer.export("gguf", quantization="q4_k_m")
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
`Qwen/Qwen2.5-7B-Instruct` is the canonical default — the value `Trainer()` resolves when called with no model argument (see [`config.py`](backpropagate/config.py) `ModelConfig.name`). Older examples pinned the pre-quantized `unsloth/Qwen2.5-7B-Instruct-bnb-4bit`; we switched the default to the official Qwen weights for better reliability ([CHANGELOG v0.1.3](CHANGELOG.md)). Either model works.
|
|
149
|
+
|
|
150
|
+
### Multi-Run SLAO Training
|
|
48
151
|
|
|
49
|
-
|
|
50
|
-
backpropagate
|
|
152
|
+
```python
|
|
153
|
+
from backpropagate import Trainer
|
|
51
154
|
|
|
52
|
-
|
|
53
|
-
backpropagate ui
|
|
155
|
+
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct")
|
|
54
156
|
|
|
55
|
-
|
|
56
|
-
|
|
157
|
+
result = trainer.multi_run(
|
|
158
|
+
dataset="HuggingFaceH4/ultrachat_200k",
|
|
159
|
+
num_runs=5,
|
|
160
|
+
steps_per_run=100,
|
|
161
|
+
samples_per_run=1000,
|
|
162
|
+
merge_mode="slao", # Single LoRA Continual Learning via Asymmetric Merging
|
|
163
|
+
)
|
|
57
164
|
```
|
|
58
165
|
|
|
59
|
-
|
|
166
|
+
SLAO (Single LoRA Continual Learning via Asymmetric Merging) implements the [Merge before Forget](https://arxiv.org/abs/2512.23017) paper: orthogonal A-matrix init via QR decomposition, asymmetric A/B handling, and time-aware `λ(i) = 1/√i` scaling. The CLI flag is `--samples` (the underlying field is `samples_per_run`).
|
|
60
167
|
|
|
61
|
-
|
|
62
|
-
- **VRAM-Aware** — Automatic batch sizing and GPU memory management
|
|
63
|
-
- **Multi-Run SLAO** — Prevents catastrophic forgetting during long training runs
|
|
64
|
-
- **One-Click Export** — GGUF export with automatic Ollama registration
|
|
65
|
-
- **Windows-First** — Tested and optimized for Windows, Linux, and macOS
|
|
66
|
-
- **Headless** — Built for CI/CD pipelines and automated workflows
|
|
168
|
+
### Export to Ollama
|
|
67
169
|
|
|
68
|
-
|
|
170
|
+
```python
|
|
171
|
+
# Export to GGUF
|
|
172
|
+
result = trainer.export("gguf", quantization="q4_k_m")
|
|
173
|
+
|
|
174
|
+
# Register with Ollama separately
|
|
175
|
+
from backpropagate import register_with_ollama
|
|
176
|
+
register_with_ollama(result.path, "my-finetuned-model")
|
|
177
|
+
# ollama run my-finetuned-model
|
|
178
|
+
```
|
|
69
179
|
|
|
70
|
-
|
|
180
|
+
### CLI
|
|
71
181
|
|
|
72
182
|
```bash
|
|
73
|
-
|
|
183
|
+
backprop train --data my_data.jsonl --model Qwen/Qwen2.5-7B-Instruct --steps 100
|
|
184
|
+
backprop multi-run --data my_data.jsonl --runs 5 --steps 100
|
|
185
|
+
backprop export ./output/lora --format gguf --quantization q4_k_m --ollama --ollama-name my-model
|
|
186
|
+
backprop ui --port 7862
|
|
187
|
+
backprop info
|
|
188
|
+
backprop list-runs # v1.1.0: query past training runs
|
|
189
|
+
backprop show-run <run-id> # v1.1.0: detail view
|
|
190
|
+
backprop resume <run-id> # v1.1.0: resume a crashed multi-run
|
|
191
|
+
backprop push ./output/lora --repo me/my-model # v1.1.0: push adapter to HF Hub
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
See the [CLI reference](https://mcp-tool-shop-org.github.io/backpropagate/handbook/cli-reference/) for every subcommand and flag, or run `backprop <subcommand> --help`.
|
|
195
|
+
|
|
196
|
+
### Resume from checkpoint (v1.1.0)
|
|
197
|
+
|
|
198
|
+
A 5-run multi-run that crashes at run 4 is now recoverable. Every multi-run session writes its run_id into both `run_history.json` and the on-disk checkpoint manifest, so picking up where you left off is one command:
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
backprop resume <run-id> # picks up the in-progress session
|
|
202
|
+
backprop multi-run --data ... --resume <run-id> # explicit form
|
|
203
|
+
backprop train --data ... --resume <run-id> # single-run resume (continues run_id)
|
|
74
204
|
```
|
|
75
205
|
|
|
76
|
-
The
|
|
206
|
+
The default behavior of `backprop multi-run` (no `--resume`) auto-detects an in-progress entry for the same output directory and continues it. Pass `resume_from="off"` (Python API) or omit `--resume` and start in a fresh output dir to force a clean session.
|
|
207
|
+
|
|
208
|
+
When a multi-run resumes, the latest checkpoint for that run_id is loaded into the model, the SLAO merger state is restored from `slao/` next to the checkpoint, and the run loop continues from `last_completed_run + 1`. The history entry's `status` flips back to `running` so `backprop list-runs --status running` shows the live session.
|
|
77
209
|
|
|
78
|
-
|
|
210
|
+
### Experiment tracking (v1.1.0)
|
|
79
211
|
|
|
80
|
-
|
|
212
|
+
`Trainer` auto-detects installed experiment trackers (`wandb`, `tensorboard`, `mlflow`) and wires them into the underlying `transformers.TrainingArguments`. The default `report_to="auto"` picks up whatever's importable:
|
|
213
|
+
|
|
214
|
+
```bash
|
|
215
|
+
pip install backpropagate[monitoring] # installs wandb + psutil
|
|
216
|
+
wandb login # one-time
|
|
217
|
+
backprop train --data my_data.jsonl # W&B run gets the same run_id prefix as the on-disk history
|
|
218
|
+
```
|
|
81
219
|
|
|
82
|
-
|
|
83
|
-
2. Downloads the matching binary from [GitHub Releases](https://github.com/mcp-tool-shop-org/backpropagate/releases)
|
|
84
|
-
3. Verifies the SHA256 checksum
|
|
85
|
-
4. Caches the binary locally (~/.cache/mcptoolshop/ or %LOCALAPPDATA%\mcptoolshop\)
|
|
86
|
-
5. Runs the binary with full argument passthrough
|
|
220
|
+
Override with `Trainer(report_to=["wandb"])`, `Trainer(report_to=["tensorboard"])`, or `Trainer(report_to="none")` to opt out explicitly. For MLflow add `pip install mlflow`; for TensorBoard add `pip install tensorboard`. The W&B run name is `backprop-<run_id_prefix>` so an operator can grep across W&B, our logs, and `run_history.json` by the same identifier.
|
|
87
221
|
|
|
88
|
-
|
|
222
|
+
### Training history
|
|
89
223
|
|
|
90
|
-
|
|
224
|
+
Every `backprop train` and `backprop multi-run` invocation records a row in `<output>/run_history.json` with the run_id, model, dataset, hyperparameters, status, final loss, loss history, and (for multi-run) the SLAO merge timeline. List recent runs:
|
|
91
225
|
|
|
92
226
|
```bash
|
|
93
|
-
#
|
|
94
|
-
|
|
227
|
+
backprop list-runs # most recent 20 runs, all statuses
|
|
228
|
+
backprop list-runs --status failed # filter
|
|
229
|
+
backprop list-runs --json --limit 100 # machine-readable
|
|
230
|
+
backprop show-run abcd1234 # detail view (partial run_id ok)
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
Run history survives across processes — the `Runs` tab in the web UI is a separate, in-memory view; the on-disk history is the source of truth for `list-runs` / `show-run` / `resume`.
|
|
234
|
+
|
|
235
|
+
### Web UI
|
|
236
|
+
|
|
237
|
+
Launch the Reflex interface locally:
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
backprop ui --port 7862
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
To expose a public-internet URL, you must pair `--share` with `--auth`:
|
|
244
|
+
|
|
245
|
+
```bash
|
|
246
|
+
backprop ui --share --auth alice:hunter2
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
`backprop ui --share` without `--auth` exits with code `1` and the structured error `[INPUT_AUTH_REQUIRED]`. The rationale: `--share` publishes a `*.gradio.live` URL that anyone on the internet can hit, and without auth that means anyone can drive your training pipeline.
|
|
250
|
+
|
|
251
|
+
To explicitly opt out (e.g. an internal dev environment), set the env var `BACKPROPAGATE_SECURITY__REQUIRE_AUTH_FOR_SHARE=false`. A loud warning will print on every launch — and there's a 5-second grace period before the unauth'd UI binds, so you can `Ctrl-C` if it looks wrong.
|
|
252
|
+
|
|
253
|
+
Filesystem writes from the UI are sandboxed to a single directory:
|
|
254
|
+
|
|
255
|
+
- Default: `~/.backpropagate/ui-outputs`
|
|
256
|
+
- Override: `BACKPROPAGATE_UI__OUTPUT_DIR=/path/you/own`
|
|
257
|
+
- The override is **denylist-validated** — system / credential paths (`/etc`, `/var`, `~/.ssh`, `~/.aws`, `C:\Windows\System32`, etc.) are refused with `[UI_OUTPUT_DIR_FORBIDDEN]`.
|
|
258
|
+
|
|
259
|
+
## Windows Support
|
|
260
|
+
|
|
261
|
+
Backpropagate is designed to work on Windows out of the box:
|
|
262
|
+
|
|
263
|
+
- Pre-tokenization to avoid multiprocessing crashes
|
|
264
|
+
- Automatic xformers disable for RTX 40/50 series
|
|
265
|
+
- Safe dataloader settings
|
|
266
|
+
- Tested on RTX 5080 (16GB VRAM)
|
|
267
|
+
|
|
268
|
+
## Model Presets
|
|
269
|
+
|
|
270
|
+
| Preset | VRAM | Speed | Quality |
|
|
271
|
+
|--------|------|-------|---------|
|
|
272
|
+
| Qwen 2.5 7B | ~12GB | Medium | Best |
|
|
273
|
+
| Qwen 2.5 3B | ~8GB | Fast | Good |
|
|
274
|
+
| Llama 3.2 3B | ~8GB | Fast | Good |
|
|
275
|
+
| Llama 3.2 1B | ~6GB | Fastest | Basic |
|
|
276
|
+
| Mistral 7B | ~12GB | Medium | Good |
|
|
277
|
+
|
|
278
|
+
## Architecture
|
|
95
279
|
|
|
96
|
-
# Clear cached binaries
|
|
97
|
-
backpropagate --clear-cache
|
|
98
280
|
```
|
|
281
|
+
backpropagate/
|
|
282
|
+
├── trainer.py # Core Trainer class
|
|
283
|
+
├── multi_run.py # Multi-run SLAO training
|
|
284
|
+
├── slao.py # SLAO LoRA merging algorithm
|
|
285
|
+
├── datasets.py # Dataset loading, filtering & curriculum
|
|
286
|
+
├── export.py # GGUF/Ollama export
|
|
287
|
+
├── config.py # Pydantic settings + training presets
|
|
288
|
+
├── gpu_safety.py # GPU monitoring & safety
|
|
289
|
+
├── cli.py # CLI entry point (backprop command)
|
|
290
|
+
├── checkpoints.py # Checkpoint management
|
|
291
|
+
├── exceptions.py # Structured error hierarchy
|
|
292
|
+
├── feature_flags.py # Optional feature detection
|
|
293
|
+
├── security.py # Path traversal & torch security
|
|
294
|
+
├── logging_config.py # Structured logging setup
|
|
295
|
+
├── ui_theme.py # Radix theme tokens + CSS (Reflex era)
|
|
296
|
+
├── ui_state.py # rx.State subclasses
|
|
297
|
+
├── ui_app/ # Reflex web interface (Radix UI)
|
|
298
|
+
│ ├── app.py # rx.App entry point
|
|
299
|
+
│ ├── chrome.py # Header / LeftNav / SideRail / Footer
|
|
300
|
+
│ ├── pages/ # Train / Multi-Run / Export / Dataset
|
|
301
|
+
│ └── components/ # Bp* primitives (status pill, sparkline, event log…)
|
|
302
|
+
├── ui_security.py # Rate limiting, CSRF, file validation (framework-agnostic)
|
|
303
|
+
├── ui_gradio_legacy.py # DEPRECATED — preserved as v1.0 reference; removed in v1.2
|
|
304
|
+
└── theme_gradio_legacy.py # DEPRECATED — same
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
## Troubleshooting
|
|
308
|
+
|
|
309
|
+
A short index of the most common first-run failures. The full reverse index lives at [the troubleshooting handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting/); every code below is documented at [error codes](https://mcp-tool-shop-org.github.io/backpropagate/handbook/error-codes/).
|
|
310
|
+
|
|
311
|
+
| Symptom | Code | Fix |
|
|
312
|
+
|---------|------|-----|
|
|
313
|
+
| GPU runs out of memory mid-training | `RUNTIME_GPU_OOM` | OOM auto-recovery (B-002) halves batch size up to 3 times automatically. To opt out: `Trainer(oom_recovery=False)`. To force smaller: `--batch-size 1`. |
|
|
314
|
+
| HF Hub returns 401 / "model not found" | `DEP_MODEL_LOAD_FAILED` | `huggingface-cli login` and re-try. For typos, copy the exact id from <https://huggingface.co/models>. |
|
|
315
|
+
| Bad model name typo | `INPUT_VALIDATION_FAILED` or `DEP_MODEL_LOAD_FAILED` | Verify the `org/name` identifier at <https://huggingface.co/models>. |
|
|
316
|
+
| `register_with_ollama` connection refused | `DEP_OLLAMA_REGISTRATION_FAILED` | Start the daemon: `ollama serve`. Install from <https://ollama.com>. Retryable. |
|
|
317
|
+
| Disk full during checkpoint save | `STATE_CHECKPOINT_INVALID` | Atomic writes leave a `.partial` directory on crash — safe to delete. Previous good checkpoint is intact. |
|
|
318
|
+
| Training paused / aborted on GPU overheat | `RUNTIME_GPU_TEMPERATURE_CRITICAL` | B-003 monitor pauses on NVML temp threshold; resumes automatically as the GPU cools. Improve airflow or lower sustained load. |
|
|
319
|
+
| `backprop ui --share` rejected | `INPUT_AUTH_REQUIRED` | Pass `--auth user:password`, or set `BACKPROPAGATE_SECURITY__REQUIRE_AUTH_FOR_SHARE=false` to opt out (loud warning). |
|
|
320
|
+
| Multi-run "validation overlap" | `CONFIG_INVALID` (Stage A backend B-001) | Lower `--samples` below the training-pool size, increase dataset, or disable validation. |
|
|
321
|
+
| GGUF export failed on first try | `RUNTIME_GGUF_EXPORT_FAILED` | `pip install backpropagate[export]`; on Windows you also need Visual C++ Build Tools + CMake. |
|
|
322
|
+
|
|
323
|
+
## Reporting bugs
|
|
99
324
|
|
|
100
|
-
|
|
325
|
+
When something fails, Backpropagate prints a `run_started run_id=<uuid>` line at startup and binds the same id to checkpoint manifests, SLAO merge history, and structured log lines. Include the `run_id` in any bug report — it lets a maintainer correlate every log line, every checkpoint, and every merge for that exact run.
|
|
101
326
|
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
327
|
+
A good bug report includes:
|
|
328
|
+
|
|
329
|
+
1. **`run_id`** — the uuid printed at startup (also available as `TrainingRun.run_id` and `RunResult.run_id`).
|
|
330
|
+
2. **The error code** — the `[CODE_NAME]: message` line in stderr is what to grep for; see [error codes](https://mcp-tool-shop-org.github.io/backpropagate/handbook/error-codes/) for the catalog.
|
|
331
|
+
3. **The redacted command line.** Stderr in non-verbose mode is automatically redacted (Bearer tokens, `sk-*`, `hf_*`, AWS keys, `password=`/`token=`/`api_key=` pairs are scrubbed) — safe to paste. For the full unredacted traceback, re-run with `--verbose`, but review before posting.
|
|
332
|
+
4. **Python / PyTorch versions, GPU model, OS.** `backprop info` prints all of this in one go.
|
|
333
|
+
|
|
334
|
+
## Privacy
|
|
335
|
+
|
|
336
|
+
All training happens locally on your GPU. Backpropagate makes no network requests except to download models from HuggingFace (which you initiate). No telemetry, no cloud dependency.
|
|
337
|
+
|
|
338
|
+
## Scorecard
|
|
339
|
+
|
|
340
|
+
| Category | Score | Notes |
|
|
341
|
+
|----------|-------|-------|
|
|
342
|
+
| A. Security | 6/8 | SECURITY.md, trust model, no secrets/telemetry, safe_path(). MCP items skipped |
|
|
343
|
+
| B. Error Handling | 5/7 | Structured exception shape (`code`/`message`/`hint`/`cause`/`retryable`) via ERROR_CODES registry; CLI exit codes 0/1/2/3; no raw stack traces without `--verbose`; `run_id` correlation; redacted stderr; `--share`+`--auth` gating. MCP/desktop/vscode skipped. |
|
|
344
|
+
| C. Operator Docs | 4/7 | README, CHANGELOG, LICENSE, --help. Logging/MCP/complex skipped |
|
|
345
|
+
| D. Shipping Hygiene | 6/9 | verify.sh, version=tag, 5 scanners in CI, dependabot, python_requires, clean build |
|
|
346
|
+
| E. Identity | 4/4 | Logo, translations, landing page, metadata |
|
|
347
|
+
| **Total** | **25/31** | 14 items skipped with justification · `shipcheck audit` passes 100% · Audit date: 2026-05-21 (B-row re-graded after Stage B + Stage A CLI exit-code work) |
|
|
348
|
+
|
|
349
|
+
Design history and what each line item maps to: see [ROADMAP.md](ROADMAP.md) — all Week 1–4 items are shipped in v1.1.0.
|
|
106
350
|
|
|
107
351
|
## License
|
|
108
352
|
|
|
109
|
-
MIT
|
|
353
|
+
MIT — see [LICENSE](LICENSE) for details.
|
|
354
|
+
|
|
355
|
+
---
|
|
356
|
+
|
|
357
|
+
<p align="center">
|
|
358
|
+
Built by <a href="https://mcp-tool-shop.github.io/">MCP Tool Shop</a>
|
|
359
|
+
</p>
|