@mcptoolshop/backpropagate 1.3.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.es.md +175 -121
- package/README.fr.md +178 -124
- package/README.hi.md +165 -111
- package/README.it.md +171 -117
- package/README.ja.md +169 -115
- package/README.md +73 -19
- package/README.pt-BR.md +170 -116
- package/README.zh.md +171 -117
- package/bin/backpropagate.js +25 -7
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -28,7 +28,7 @@ trainer.export("gguf", quantization="q4_k_m")
|
|
|
28
28
|
```
|
|
29
29
|
|
|
30
30
|
```bash
|
|
31
|
-
backprop
|
|
31
|
+
backprop export ./output/lora --format gguf --quantization q4_k_m --ollama --ollama-name my-model
|
|
32
32
|
ollama run my-model
|
|
33
33
|
```
|
|
34
34
|
|
|
@@ -61,11 +61,11 @@ Prefer Docker? `docker pull ghcr.io/mcp-tool-shop-org/backpropagate:latest` work
|
|
|
61
61
|
There are several good libraries for fine-tuning LLMs. They're each great at different things:
|
|
62
62
|
|
|
63
63
|
- **[Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)** — if you like YAML configs and want a community of recipes to copy from
|
|
64
|
-
- **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** — if you want a web GUI
|
|
64
|
+
- **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** — if you want DPO/PPO/RLHF and a web GUI
|
|
65
65
|
- **[Unsloth](https://github.com/unslothai/unsloth)** — if you need the fastest possible training and you're on a supported model family
|
|
66
66
|
- **[torchtune](https://github.com/pytorch/torchtune)** — if you want Meta's first-party PyTorch-native recipes you can edit
|
|
67
67
|
|
|
68
|
-
Backpropagate is the missing option: **a 3-line Python API for solo operators on a single consumer GPU who want to train an adapter and ship it.** No YAML, no GUI, no
|
|
68
|
+
Backpropagate is the missing option: **a 3-line Python API for solo operators on a single consumer GPU who want to train an adapter and ship it.** No YAML, no GUI, no online RL (PPO/GRPO), no multi-node. Just the loop everyone actually needs and the export step that gets in the way.
|
|
69
69
|
|
|
70
70
|
If you tried one of the libraries above and bounced off the config-file ceremony, or hit a model-family gap, or wanted Windows-first defaults — Backpropagate is for you.
|
|
71
71
|
|
|
@@ -76,20 +76,25 @@ Here's the practical envelope on a 16GB card (RTX 4080 / 5080 / 4070 Ti Super):
|
|
|
76
76
|
| Model | Method | Status |
|
|
77
77
|
|---|---|---|
|
|
78
78
|
| Qwen-3.5-4B / Phi-4-mini-3.8B / SmolLM3-3B | LoRA / QLoRA / DoRA | Comfortable. Full sequence length, room to spare. |
|
|
79
|
+
| SmolLM3-3B / Qwen2.5-3B / Llama-3.2-3B / Llama-3.2-1B | `mode="full"` (full fine-tuning) | v1.4 — pass `--mode=full` on `backprop train` or `Trainer(..., mode="full")`. Loads full-precision (bf16) weights — no 4-bit, no adapter; gradient checkpointing + paged 8-bit Adam keep the footprint inside 16GB. |
|
|
79
80
|
| Qwen-2.5-7B / Llama-3.1-8B / Mistral-7B | QLoRA | Standard. ~7-8 GB. Backpropagate's default presets. |
|
|
80
81
|
| Llama-3 13B | QLoRA + sample packing | Tight but works. Use shorter sequences. |
|
|
81
|
-
| Mixtral 8x7B (47B total parameters) |
|
|
82
|
+
| Mixtral 8x7B (47B total parameters) | — | Out of scope — 2-bit (AQLM / QuIP#) breaks the mergeable-adapter + GGUF-export contract, so it was retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md). On a 16GB card, use a ≤8B base. |
|
|
83
|
+
|
|
84
|
+
`mode="full"` admits models up to **4B parameters**. The four presets in the full-FT row above are genuine ~3B (true parameter count 3.08–3.24B) and fit a 16GB card. The 3.8–4B class (Phi-4-mini-3.8B, Qwen-3.5-4B) is also accepted by the ceiling but needs a **24GB+** card for full FT — weights + gradients alone approach 16GB before the optimizer and activations — so on a 16GB card use `mode="lora"` for those (they're in the LoRA row). Models >4B exit with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`.
|
|
85
|
+
|
|
86
|
+
2-bit quantization (AQLM / QuIP#) is **out of scope**. It was scoped for v1.4, then retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md): a 2-bit base can't be cleanly merged back into full-precision weights, which breaks Backpropagate's mergeable-adapter → GGUF → Ollama export contract (the whole point of the pipeline). The headroom levers Backpropagate ships instead are the v1.5 **FP8 compute path** (`--fp8`, Blackwell/Hopper) and `mode="full"` for ≤4B models — both stay mergeable and exportable.
|
|
82
87
|
|
|
83
|
-
For models 3B and smaller, full fine-tuning (not just LoRA) is feasible on 16GB and
|
|
88
|
+
For models 3B and smaller, full fine-tuning (not just LoRA) is feasible on 16GB and now ships in v1.4 as `mode="full"`. Pass `Trainer(..., mode="full")` or `backprop train --mode=full --model phi-4-mini-3.8b` to enable it. A hard gate refuses the mode for models > 4B with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, naming LoRA + the sub-4B presets as the recovery options. See [the full fine-tuning handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) for the configuration math + Biderman 2024 / Thinking Machines 2025 quality comparison. For 7B+ models, full fine-tuning needs a 24GB+ GPU — consider an A100 cloud rental, or stick with LoRA, which recent research shows matches full fine-tuning quality on most post-training tasks anyway (see [the anti-pitch section](#what-backpropagate-is-not-for) for citations).
|
|
84
89
|
|
|
85
90
|
## What Backpropagate is NOT for
|
|
86
91
|
|
|
87
|
-
|
|
92
|
+
If your use case is below, you'll have a better time with a different library — Backpropagate is not the right pick and trying to make it work would cost more than just reaching for the right tool. Reading this section before you start saves the install-and-bounce cycle:
|
|
88
93
|
|
|
89
|
-
- **Full-parameter fine-tuning of 7B+ models** — Backpropagate uses LoRA / QLoRA, which trains a small adapter rather than updating every weight. For models 7B and larger, full fine-tuning needs 24GB+ of GPU memory and doesn't fit on a 16GB consumer card. For models 3B and smaller, full fine-tuning IS feasible on 16GB
|
|
90
|
-
- **
|
|
94
|
+
- **Full-parameter fine-tuning of 7B+ models** — Backpropagate uses LoRA / QLoRA, which trains a small adapter rather than updating every weight. For models 7B and larger, full fine-tuning needs 24GB+ of GPU memory and doesn't fit on a 16GB consumer card. For models 3B and smaller, full fine-tuning IS feasible on 16GB and ships in v1.4 as `mode="full"` (pass `Trainer(..., mode="full")` or `--mode=full` on the CLI; a hard gate raises `RUNTIME_FULL_FT_MODEL_TOO_LARGE` for models > 4B and names LoRA + the sub-4B presets as recoveries). The bigger picture: recent research ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) shows that LoRA at correct configuration matches full fine-tuning quality on most post-training tasks (instruction-following, domain adaptation, persona/style) at 67% of the compute — so for the work most operators actually want, you don't lose anything by sticking with LoRA. `mode="full"` exists for the cases where you've measured a quality gap and decided to spend the extra compute. If you genuinely need full fine-tuning of a 7B+ model, use HuggingFace `transformers.Trainer` directly on a 24GB+ card.
|
|
95
|
+
- **Online RL — PPO / GRPO / RLVR** — Backpropagate does single-stage SFT plus reference-free preference tuning (ORPO ships in v1.5; SimPO/KTO are planned). What it does *not* do is online reinforcement learning — PPO, GRPO, or RLVR — which needs a reward model or a generation-and-scoring loop on top of the training step. For those, use TRL directly or LLaMA-Factory. (Reference-free preference tuning fits the single-stage envelope because there's no separate reference model to hold in memory; see the ORPO note under [Quick Start](#quick-start).)
|
|
91
96
|
- **Multi-node training** — single GPU on one machine only. Multi-GPU on one machine works (via `accelerate launch`) but isn't officially supported.
|
|
92
|
-
- **macOS training** — Apple Silicon doesn't have CUDA, so
|
|
97
|
+
- **macOS training on the CUDA rail** — Apple Silicon doesn't have CUDA, so the CUDA path has to run on a Linux or Windows box with an NVIDIA GPU. You can still run the trained model on a Mac via Ollama. **New in v1.5:** an experimental MLX rail (`--backend mlx`) trains a LoRA adapter natively on Apple Silicon — see [Apple Silicon (MLX)](#apple-silicon-mlx--experimental-v15). It is LoRA-SFT-only and built-but-not-yet-dogfood-verified on real silicon, so for anything beyond a LoRA SFT (ORPO, full fine-tune, FP8, multi-run) you still want the CUDA rail.
|
|
93
98
|
- **Anything outside the tested model families** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B. Other models often work but aren't pinned in CI.
|
|
94
99
|
|
|
95
100
|
If you need any of those things, reach for one of the libraries listed above. They're better at them.
|
|
@@ -140,6 +145,53 @@ This trains a Qwen 2.5 7B adapter on 5 short ShareGPT-format conversations, then
|
|
|
140
145
|
|
|
141
146
|
Alpaca (`instruction` / `output`), OpenAI chat (`messages`), and raw text formats also work — Backpropagate auto-detects the format.
|
|
142
147
|
|
|
148
|
+
### Preference tuning (ORPO)
|
|
149
|
+
|
|
150
|
+
New in v1.5: train on preferences instead of plain demonstrations. ORPO is reference-free and single-stage — it folds the preference signal into the SFT step, so there's no separate reward or reference model and the 3-line shape is unchanged. Pass `--method orpo` (CLI) or `method="orpo"` (Python) and feed it a dataset of `{prompt, chosen, rejected}` (or just `{chosen, rejected}`) rows:
|
|
151
|
+
|
|
152
|
+
```jsonl
|
|
153
|
+
{"prompt": "What is Python?", "chosen": "A high-level programming language known for readability.", "rejected": "idk look it up"}
|
|
154
|
+
{"prompt": "Explain recursion.", "chosen": "A function that calls itself with a smaller input until a base case.", "rejected": "when something repeats"}
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
```python
|
|
158
|
+
from backpropagate import Trainer
|
|
159
|
+
|
|
160
|
+
trainer = Trainer("Qwen/Qwen2.5-7B-Instruct", method="orpo")
|
|
161
|
+
trainer.train("preferences.jsonl", steps=100)
|
|
162
|
+
trainer.export("gguf", quantization="q4_k_m")
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
backprop train --data preferences.jsonl --method orpo --steps 100
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
The default learning rate auto-lowers to `8e-6` for ORPO (the loss is sharper than plain SFT); tune `--orpo-beta` (default `0.1`) to weight the odds-ratio penalty. In v1.5 ORPO is `mode="lora"` only. SimPO and KTO are planned follow-ons; for online RL (PPO/GRPO) see [What Backpropagate is NOT for](#what-backpropagate-is-not-for).
|
|
170
|
+
|
|
171
|
+
### Reasoning-trace SFT (R1 distillation)
|
|
172
|
+
|
|
173
|
+
New in v1.5: distill a reasoning model the easy way. Pass `--reasoning-trace` (CLI) or `Trainer(..., reasoning_trace=True)` (Python) and feed it traces that keep a `<think>...</think>` chain-of-thought inside the assistant turn — the pure-SFT half of [DeepSeek-R1](https://arxiv.org/abs/2501.12948) distillation, no RL required. Backpropagate keeps `<think>` in the training target, drops empty / over-long traces (trace-length filtering), and raises the default `max_seq_length` to 8192 for the longer CoT. Critically, `<think>` stays **plain text** — no special tokens, no embedding resize — so the merged GGUF still exports to Ollama like any other fine-tune. SFT only. See the [reasoning-trace recipe](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) for the dataset shape and the tunable token band.
|
|
174
|
+
|
|
175
|
+
### Apple Silicon (MLX) — experimental, v1.5
|
|
176
|
+
|
|
177
|
+
New in v1.5: **one API, two rails.** CUDA stays the canonical, verified backend; MLX is a second rail that trains on an M-series Mac via Apple's [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) toolchain (unified memory, no CUDA). The same 3-line shape picks the rail by hardware — `backend='auto'` (the default) routes to CUDA on NVIDIA and to MLX on Apple Silicon, so existing CUDA rigs are byte-identical:
|
|
178
|
+
|
|
179
|
+
```python
|
|
180
|
+
from backpropagate import Trainer
|
|
181
|
+
|
|
182
|
+
# On an M-series Mac with `pip install 'backpropagate[mlx]'`:
|
|
183
|
+
trainer = Trainer("mlx-community/Qwen2.5-0.5B-Instruct-4bit", backend="mlx")
|
|
184
|
+
trainer.train("examples/quickstart.jsonl", steps=100)
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
```bash
|
|
188
|
+
backprop train --data my_data.jsonl --backend mlx --steps 100
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
In v1.5 the MLX rail is **LoRA SFT only** — no ORPO, no FP8, no `mode='full'`, no multi-run on MLX yet (each is rejected with `CONFIG_INVALID_SETTING`; use `backend='cuda'`/`'auto'` on an NVIDIA box for those). The resulting adapter is plain safetensors and exports to Ollama through the same path as the CUDA rail.
|
|
192
|
+
|
|
193
|
+
> ⚠️ **Honest status:** the MLX rail ships in v1.5 **built + unit-tested (mocked)** but **NOT yet dogfood-verified on real Apple Silicon** — `mlx-lm` is Apple-only and could not be run on the NVIDIA rig this was authored on. Treat it as experimental (same framing as the FP8 path), and please [report anomalies](#reporting-bugs) once it runs on an M-series Mac. Forcing `--backend mlx` on a non-Apple host errors with `CONFIG_INVALID_SETTING`; a missing `mlx_lm` toolchain on a Mac raises `DEP_MLX_UNAVAILABLE`.
|
|
194
|
+
|
|
143
195
|
For more end-to-end workflows (fine-tune-and-push-to-HF-Hub, resume after OOM, multi-run SLAO across a long campaign, etc.) see the [handbook recipes page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/).
|
|
144
196
|
|
|
145
197
|
### Web UI (optional)
|
|
@@ -151,7 +203,7 @@ pipx install "backpropagate[ui]"
|
|
|
151
203
|
backprop ui --port 7862
|
|
152
204
|
```
|
|
153
205
|
|
|
154
|
-
A local web interface opens at `http://localhost:7862`
|
|
206
|
+
A local web interface opens at `http://localhost:7862` for browsing datasets, validating formats, and assembling a training config visually. Training itself runs via `backprop train` (UI-driven training is on the roadmap — the Start button currently surfaces that note). The UI is local-only by default. To expose it to other devices, see [Web UI](#web-ui) below for the `--share` + `--auth` security contract.
|
|
155
207
|
|
|
156
208
|
## Multi-run training
|
|
157
209
|
|
|
@@ -221,7 +273,9 @@ The Reflex web interface is opt-in — install with `pipx install "backpropagate
|
|
|
221
273
|
backprop ui --port 7862
|
|
222
274
|
```
|
|
223
275
|
|
|
224
|
-
The UI runs locally on `http://localhost:7862`.
|
|
276
|
+
The UI runs locally on `http://localhost:7862`. Today it covers the **browse / validate / configure** half of the workflow — point it at a dataset, check the auto-detected format and stats, pick a model, and assemble a run config. **Launching the run is done from the CLI** (`backprop train` / `backprop multi-run`); the in-UI Start button surfaces a note pointing there. UI-driven training is a planned follow-up — until then the UI is the on-ramp and the CLI is the trigger.
|
|
277
|
+
|
|
278
|
+
To expose it to other devices (other people on your network, a public URL, etc.) you must pair `--share` (or `--host`) with `--auth`:
|
|
225
279
|
|
|
226
280
|
```bash
|
|
227
281
|
backprop ui --share --auth alice:hunter2
|
|
@@ -249,14 +303,14 @@ Filesystem writes from the UI are sandboxed to a single directory:
|
|
|
249
303
|
|
|
250
304
|
**Requirements:** Python 3.10+ · CUDA GPU (8GB+ VRAM) · PyTorch 2.0+
|
|
251
305
|
|
|
252
|
-
Python 3.10 reaches upstream end-of-life in October 2026
|
|
306
|
+
Python 3.10 reaches upstream end-of-life in October 2026 and is scheduled for drop in v1.5. For new installs, prefer Python 3.11 or 3.12 — 3.11 is the most-tested floor.
|
|
253
307
|
|
|
254
308
|
Backpropagate handles the runtime quirks of training on different platforms, but it can't fix install-time problems. The two most common are:
|
|
255
309
|
|
|
256
310
|
- **Wrong CUDA wheel.** PyTorch is published one binary per CUDA version. If you pick the wrong one, you silently get CPU-only PyTorch and training is impossibly slow. Use the wheel picker at <https://pytorch.org/get-started/locally/> for your driver. Run `nvidia-smi` to see your driver / CUDA version.
|
|
257
311
|
- **Windows + GGUF export.** The `[export]` extra builds `llama-cpp-python` from source, which needs Visual Studio Build Tools (C++ component) and CMake.
|
|
258
312
|
|
|
259
|
-
**macOS:**
|
|
313
|
+
**macOS:** the CUDA rail is not supported (no CUDA) — a CUDA-routed `trainer.train()` raises `DEP_GPU_NOT_AVAILABLE`, and you can run the trained adapter on a Mac via Ollama. **New in v1.5:** an experimental MLX rail (`--backend mlx`, `pip install 'backpropagate[mlx]'`) trains a LoRA adapter natively on Apple Silicon via `mlx_lm.lora` — LoRA SFT only, and built + unit-tested but not yet dogfood-verified on real silicon (see [Apple Silicon (MLX)](#apple-silicon-mlx--experimental-v15)). For the CUDA path, or for ORPO / full fine-tune / FP8 / multi-run, use a CUDA Linux or Windows machine.
|
|
260
314
|
|
|
261
315
|
See the [troubleshooting handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting/) for the long-form install fix-it guide, and the dedicated [CUDA troubleshooting page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting-cuda/) for driver / VRAM / xformers / bf16-vs-fp16 issues.
|
|
262
316
|
|
|
@@ -322,7 +376,7 @@ A short index of the most common first-run failures. The full reverse index is a
|
|
|
322
376
|
| `register_with_ollama` connection refused | `DEP_OLLAMA_REGISTRATION_FAILED` | Start the daemon: `ollama serve`. Install from <https://ollama.com>. Retryable. |
|
|
323
377
|
| Disk full during checkpoint save | `STATE_CHECKPOINT_INVALID` | Atomic writes leave a `.partial` directory on crash — safe to delete. The previous good checkpoint is intact. |
|
|
324
378
|
| Training paused on GPU overheat | `RUNTIME_GPU_TEMPERATURE_CRITICAL` | Automatic — Backpropagate pauses on the temperature threshold and resumes as the GPU cools. Improve airflow if it keeps happening. |
|
|
325
|
-
| `backprop ui --share` rejected | `
|
|
379
|
+
| `backprop ui --share` rejected | `RUNTIME_UI_AUTH_NOT_ENFORCED` | Pass `--auth user:password`, or use SSH port-forwarding instead (see [Web UI](#web-ui)). |
|
|
326
380
|
| GGUF export failed on first try | `RUNTIME_GGUF_EXPORT_FAILED` | `pip install backpropagate[export]`; on Windows you also need Visual C++ Build Tools + CMake. |
|
|
327
381
|
|
|
328
382
|
## Reporting bugs
|
|
@@ -331,12 +385,12 @@ When something fails, Backpropagate prints a line at startup like `run_started r
|
|
|
331
385
|
|
|
332
386
|
A good bug report includes:
|
|
333
387
|
|
|
334
|
-
1. **The `run_id`** — the UUID printed at startup.
|
|
335
|
-
2. **The error code** — the `[CODE_NAME]: message` line in stderr. See [error codes](https://mcp-tool-shop-org.github.io/backpropagate/handbook/error-codes/) for the catalog.
|
|
336
|
-
3. **The redacted
|
|
337
|
-
4. **Python / PyTorch
|
|
388
|
+
1. **The `run_id`** — the UUID printed at startup. One UUID lets a maintainer correlate every log line, every checkpoint, and every Weights & Biases entry for that exact run.
|
|
389
|
+
2. **The error code** — the `[CODE_NAME]: message` line in stderr. See [error codes](https://mcp-tool-shop-org.github.io/backpropagate/handbook/error-codes/) for the catalog of stable codes.
|
|
390
|
+
3. **The redacted traceback.** Stderr is automatically redacted in non-verbose mode (Bearer tokens, `sk-*`, `hf_*`, AWS keys, `password=` / `token=` / `api_key=` pairs are scrubbed) — safe to paste. For the full unredacted traceback, re-run with `BACKPROPAGATE_DEBUG=1` (or `--verbose`); review before posting.
|
|
391
|
+
4. **The `backprop info` output.** One command prints Python / PyTorch / CUDA / GPU model / VRAM / OS / installed extras — everything the maintainer needs to bisect a platform-specific regression.
|
|
338
392
|
|
|
339
|
-
Questions, ideas, or "is this expected" threads belong in [GitHub Discussions](https://github.com/mcp-tool-shop-org/backpropagate/discussions). Security issues should be reported privately via the [GitHub Security Advisory](https://github.com/mcp-tool-shop-org/backpropagate/security/advisories/new) form — see [SECURITY.md](SECURITY.md) for the policy.
|
|
393
|
+
The [bug report template](https://github.com/mcp-tool-shop-org/backpropagate/issues/new?template=bug_report.yml) prompts for each of these explicitly so triage moves fast. Questions, ideas, or "is this expected?" threads belong in [GitHub Discussions](https://github.com/mcp-tool-shop-org/backpropagate/discussions). Security issues should be reported privately via the [GitHub Security Advisory](https://github.com/mcp-tool-shop-org/backpropagate/security/advisories/new) form — see [SECURITY.md](SECURITY.md) for the policy and response timelines.
|
|
340
394
|
|
|
341
395
|
## Privacy
|
|
342
396
|
|