npm - @mcptoolshop/backpropagate - Versions diffs - 1.6.0 → 1.7.0 - Mend

@mcptoolshop/backpropagate 1.6.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.it.md CHANGED Viewed

@@ -15,9 +15,9 @@
   <a href="https://mcp-tool-shop-org.github.io/backpropagate/"><img src="https://img.shields.io/badge/Landing_Page-live-blue" alt="Landing Page"></a>
 </p>
-# Addestra un adattatore. Caricalo su Ollama. Passa oltre
+# Ottimizza un modello QLoRA da 32 miliardi di parametri oppure un modello end-to-end da 7 miliardi di parametri su una singola GPU. Caricalo su Ollama
-Backpropagate è una libreria Python per l'affinamento di modelli linguistici di grandi dimensioni su una singola GPU. Tre righe di codice addestrano un modello da 7B su una scheda da 16 GB. Un comando aggiuntivo lo esporta su Ollama in modo che tu possa eseguire `ollama run` sul tuo modello affinato. Funziona perfettamente su Windows.
+Esegui il fine-tuning di modelli linguistici di grandi dimensioni su una **singola** GPU, dimensionata in base alla scheda che hai effettivamente. Tre righe di codice Python per il fine-tuning di un modello da 7 a 34 miliardi di parametri su una singola scheda consumer da 32 GB (RTX 5090); un flag — `--full-ft-offload` — esegue il fine-tuning completo di un modello di classe 7B scaricando lo stato dell'ottimizzatore nella RAM del sistema. Un comando aggiuntivo esporta i risultati su Ollama, quindi esegui `ollama run` con il tuo modello ottimizzato. Si adatta in modo efficiente fino a 16 GB. Ottime prestazioni su Windows.
 ```python
 from backpropagate import Trainer
@@ -69,33 +69,38 @@ Backpropagate è l'opzione mancante: **un'API Python a 3 righe per gli utenti si
 Se hai provato una delle librerie sopra elencate e ti sei scontrato con la procedura del file di configurazione, o hai riscontrato un problema con la famiglia di modelli, o desideravi impostazioni predefinite per Windows, Backpropagate è quello che fa per te.
-## Cosa puoi affinare su una GPU consumer da 16 GB
+## Cosa puoi ottimizzare su una singola GPU
-Ecco i limiti pratici su una scheda da 16 GB (RTX 4080 / 5080 / 4070 Ti Super):
+Backpropagate dimensiona l'esecuzione in base alla tua scheda. Ecco i limiti pratici su una GPU consumer da **32 GB** (RTX 5090) con 64 GB di RAM del sistema: la configurazione su cui viene eseguito il fine-tuning è la seguente:
-| Modello | Metodo | Stato |
+| Dimensione del modello | Metodo | Stato su una scheda da 32 GB |
 |---|---|---|
-| Qwen-3.5-4B / Phi-4-mini-3.8B / SmolLM3-3B | LoRA / QLoRA / DoRA | Ottimo. Lunghezza di sequenza completa, con spazio a sufficienza. |
-| SmolLM3-3B / Qwen2.5-3B / Llama-3.2-3B / Llama-3.2-1B | `mode="full"` (affinamento completo) | v1.4: passa `--mode=full` in `backprop train` o `Trainer(..., mode="full")`. Carica i pesi a piena precisione (bf16), senza 4 bit, senza adattatore; il checkpointing del gradiente e l'Adam a 8 bit con paging mantengono l'impronta entro i 16 GB. |
-| Qwen-2.5-7B / Llama-3.1-8B / Mistral-7B | QLoRA | Standard. Circa 7-8 GB. Impostazioni predefinite di Backpropagate. |
-| Llama-3 13B | QLoRA + sample packing | Stretto, ma funziona. Utilizza sequenze più brevi. |
-| Mixtral 8x7B (47 miliardi di parametri totali) | — | Fuori portata: la quantizzazione a 2 bit (AQLM / QuIP#) interrompe il contratto dell'adattatore unificabile + esportazione GGUF, quindi è stata abbandonata nella [breve descrizione della traiettoria v1.5](docs/V1_5_BRIEF.md). Su una scheda da 16 GB, utilizza una base ≤8B. |
+| 7B (Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B) | QLoRA | Ottimo — circa 7–8 GB. Lunghezza della sequenza completa, ampio margine di manovra. |
+| **14B** (Qwen2.5-14B) | QLoRA | **Il punto ideale per l'uso quotidiano — circa 8,5 GB**, misurato. rank/alpha 32, paged 8-bit AdamW, 4096 ctx. |
+| 24B (Mistral-Small-24B) | QLoRA | Circa 18 GB. Si adatta con un buon margine a 4096 ctx. |
+| **32B** (Qwen2.5-32B) | QLoRA | **Si adatta appena — circa 26 GB** con `max_len 2048` + paged 8-bit AdamW. Limite massimo. |
+| ≤6B | `mode="full"` (affinamento completo) | Fine-tuning completo su GPU pura: pesi bf16, nessun adattatore. Il limite massimo per la scheda è di 6B su 32 GB. |
+| **Classe 7B** (Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B) | `mode="full" --full-ft-offload` | **Fine-tuning completo tramite CPU-offload FSDP2:** scarica i parametri e l'ottimizzatore nella RAM del sistema da 64 GB. Più lento (limitato dalla larghezza di banda); Linux/WSL2. |
-`mode="full"` consente modelli fino a **4 miliardi di parametri**. Le quattro impostazioni nella riga dell'affinamento completo sopra sono autentici ~3B (numero effettivo di parametri 3,08–3,24B) e si adattano a una scheda da 16 GB. La classe 3,8–4B (Phi-4-mini-3,8B, Qwen-3,5-4B) è accettata anche dal limite massimo, ma richiede una scheda da **24 GB o superiore** per l'affinamento completo: i soli pesi e gradienti si avvicinano già a 16 GB prima dell'ottimizzatore e delle attivazioni; quindi, su una scheda da 16 GB, utilizza `mode="lora"` per questi (si trovano nella riga LoRA). I modelli >4B restituiscono un errore con `RUNTIME_FULL_FT_MODEL_TOO_LARGE`.
+Due cose per cui la maggior parte delle librerie per singola GPU ti indirizzano altrove: **QLoRA da 24–34B** e **fine-tuning completo su scheda singola di classe 7B**. Backpropagate esegue queste operazioni su una singola scheda consumer, quindi esporta direttamente il risultato su Ollama.
-La quantizzazione a 2 bit (AQLM / QuIP#) è **fuori portata**. È stata prevista per la v1.4, quindi abbandonata nella [breve descrizione della traiettoria v1.5](docs/V1_5_BRIEF.md): una base a 2 bit non può essere unificata in modo pulito con i pesi a piena precisione, il che interrompe il contratto dell'adattatore unificabile → GGUF → Ollama (il punto principale della pipeline). Le opzioni di ottimizzazione offerte da Backpropagate sono invece il percorso di calcolo FP8 v1.5 (`--fp8`, Blackwell/Hopper) e `mode="full"` per i modelli ≤4B: entrambi rimangono unificabili ed esportabili.
+**Il limite massimo per il fine-tuning completo è adattato alla scheda.** Deriva dall'aritmetica della memoria di addestramento a 4 termini (pesi + gradienti + ottimizzatore + attivazioni) rispetto alla VRAM *rilevata*: **16 GB → 4B, 24 GB → 5B, 32 GB → 6B** su GPU pura. `--full-ft-offload` lo estende a **classe 7B** scaricando i parametri e lo stato dell'ottimizzatore nella RAM del sistema tramite FSDP2 `fully_shard` + `CPUOffloadPolicy` (più lento, limitato dalla larghezza di banda PCIe/CPU; richiede circa 64 GB di RAM del sistema e un backend NCCL, ovvero Linux/WSL2). Sovrascrivi esplicitamente il limite con `--full-ft-ceiling-billions`. Un modello che supera anche il limite di offload termina con `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, indicando la soluzione (`--full-ft-offload` o LoRA/QLoRA). Consulta [la pagina completa del manuale sul fine-tuning](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) per i calcoli della VRAM e il confronto sulla qualità di Biderman 2024 / Thinking Machines 2025.
-Per i modelli da 3B o inferiori, l'affinamento completo (non solo LoRA) è fattibile su 16 GB ed è ora disponibile nella v1.4 come `mode="full"`. Passa `Trainer(..., mode="full")` o `backprop train --mode=full --model phi-4-mini-3.8b` per abilitarlo. Un blocco rigido rifiuta la modalità per i modelli > 4B con `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, indicando LoRA e le impostazioni predefinite inferiori a 4B come opzioni di ripristino. Consulta [la pagina del manuale sull'affinamento completo](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) per i calcoli della configurazione e il confronto sulla qualità con Biderman 2024 / Thinking Machines 2025. Per i modelli da 7B o superiori, l'affinamento completo richiede una GPU da 24 GB o superiore: valuta la possibilità di noleggiare un A100 nel cloud oppure attieniti a LoRA, che le ricerche più recenti dimostrano essere altrettanto efficace dell'affinamento completo nella maggior parte delle attività post-addestramento (consulta [la sezione "anti-pitch"](#what-backpropagate-is-not-for) per i riferimenti).
+### Si adatta fino a 16 GB
+Il limite di 16 GB (RTX 4080 / 5080 / 4070 Ti Super) offre comunque ottime prestazioni: QLoRA da 7B con circa 7–8 GB e vero fine-tuning completo di un modello reale da ~3B (SmolLM3-3B, Qwen2.5-3B, Llama-3.2-3B/1B) all'interno di 16 GB tramite `mode="full"` (pesi bf16 + checkpointing del gradiente + paged 8-bit AdamW). Lo stesso codice seleziona la dimensione del batch e il limite massimo per il fine-tuning completo in base alla scheda rilevata, senza flag da modificare tra le diverse configurazioni.
+La quantizzazione a 2 bit (AQLM / QuIP#) è **fuori dall'ambito**: una base a 2 bit non può essere unita correttamente ai pesi in piena precisione, il che interrompe il contratto di esportazione dell'adattatore unificabile → GGUF → Ollama (che è lo scopo principale della pipeline). Invece, Backpropagate offre i seguenti strumenti: QLoRA, `mode="full"`, `--full-ft-offload` e il percorso di calcolo FP8 (`--fp8`, Blackwell/Hopper), tutti che rimangono unificabili ed esportabili.
 ## Per cosa NON è adatto Backpropagate
 Se il tuo caso d'uso rientra nelle seguenti categorie, otterrai risultati migliori con una libreria diversa: Backpropagate non è la scelta giusta e cercare di farlo funzionare costerebbe più che semplicemente utilizzare lo strumento corretto. Leggere questa sezione prima di iniziare ti eviterà di installare e poi abbandonare il progetto:
-- **Ottimizzazione completa dei parametri per modelli da 7B+** — Backpropagate utilizza LoRA/QLoRA, che addestra un piccolo adattatore anziché aggiornare tutti i pesi. Per i modelli da 7B e superiori, l'ottimizzazione completa richiede 24 GB o più di memoria GPU e non è adatta per una scheda consumer da 16 GB. Per i modelli da 3B e inferiori, l'ottimizzazione completa È fattibile con 16 GB ed è disponibile nella versione 1.4 come `mode="full"` (passare `Trainer(..., mode="full")` o `--mode=full` dalla riga di comando; un controllo rigido genera `RUNTIME_FULL_FT_MODEL_TOO_LARGE` per i modelli > 4B e nomina LoRA + le configurazioni predefinite inferiori a 4B come soluzioni alternative). Nel complesso: ricerche recenti ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) mostrano che LoRA, con la configurazione corretta, corrisponde alla qualità dell'ottimizzazione completa nella maggior parte delle attività di post-addestramento (seguimento delle istruzioni, adattamento al dominio, personalità/stile) con il 67% della potenza di calcolo. Quindi, per il lavoro che la maggior parte degli utenti desidera effettivamente, non si perde nulla utilizzando LoRA. `mode="full"` è disponibile per i casi in cui si è misurata una differenza di qualità e si è deciso di investire più risorse computazionali. Se si ha realmente bisogno dell'ottimizzazione completa di un modello da 7B+, utilizzare direttamente HuggingFace `transformers.Trainer` su una scheda da 24 GB o superiore.
-- **RL online — PPO / GRPO / RLVR** — Backpropagate esegue l'addestramento SFT in una sola fase più l'ottimizzazione delle preferenze senza riferimenti (ORPO nella versione 1.5; SimPO + KTO nella versione 1.6). Non esegue l'apprendimento per rinforzo online, come PPO, GRPO o RLVR, che richiede un modello di ricompensa o un ciclo di generazione e valutazione in aggiunta alla fase di addestramento. Per questi casi, utilizzare direttamente TRL o LLaMA-Factory. (L'ottimizzazione delle preferenze senza riferimenti si adatta all'ambito della singola fase perché non è necessario mantenere un modello di riferimento separato in memoria; vedere la nota su ORPO sotto [Guida rapida](#guida-rapida)).
-- **Addestramento multi-nodo** — una sola GPU su una sola macchina. L'addestramento multi-GPU su una singola macchina funziona (tramite `accelerate launch`), ma non è ufficialmente supportato.
-- **Addestramento macOS con CUDA** — Apple Silicon non dispone di CUDA, quindi il percorso CUDA deve essere eseguito su un sistema Linux o Windows con una GPU NVIDIA. È comunque possibile eseguire il modello addestrato su un Mac tramite Ollama. **Novità nella versione 1.5:** un percorso MLX sperimentale (`--backend mlx`) addestra in modo nativo un adattatore LoRA su Apple Silicon; vedere [Apple Silicon (MLX)](#apple-silicon-mlx--sperimentale-v15). È disponibile solo l'addestramento LoRA-SFT ed è stato implementato, ma non ancora verificato su hardware reale. Pertanto, per qualsiasi operazione che vada oltre un addestramento LoRA SFT (ORPO, ottimizzazione completa, FP8, esecuzioni multiple), si consiglia di utilizzare il percorso CUDA.
-- **Qualsiasi cosa al di fuori delle famiglie di modelli testate** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B. Altri modelli spesso funzionano, ma non sono inclusi nei test CI.
+- **Ottimizzazione fine con tutti i parametri oltre il limite di offload (≈13B+)** — Esegue la retropropagazione dell'ottimizzazione fine completa fino a **~6 GB di GPU pura e ~7 GB tramite `--full-ft-offload`** su una scheda da 32 GB (vedere [la sezione](#what-you-can-fine-tune-on-one-gpu)). Un'ottimizzazione fine *veramente completa* di un modello da 13B+ supera tale limite: richiede FSDP multi-GPU o una scheda più grande (utilizzare `transformers.Trainer` su più GPU oppure noleggiare una A100/H100). Prima di utilizzare tutta questa potenza di calcolo, tuttavia: ricerche recenti ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) dimostrano che LoRA, con la configurazione corretta, offre una qualità di ottimizzazione fine paragonabile a quella completa per la maggior parte delle attività post-addestramento (seguimento delle istruzioni, adattamento al dominio, personalità/stile) con circa il 67% della potenza di calcolo necessaria. Quindi, QLoRA fino a 34B, che Backpropagate esegue su una singola scheda, non comporta alcuna perdita per il lavoro che la maggior parte degli utenti desidera svolgere.
+- **Apprendimento per rinforzo online — PPO / GRPO / RLVR** — Backpropagate esegue l'ottimizzazione fine monostadio (SFT) più l'ottimizzazione delle preferenze senza riferimenti (ORPO nella versione 1.5; SimPO + KTO nella versione 1.6). Non esegue l'apprendimento per rinforzo online — PPO, GRPO o RLVR —, che richiede un modello di ricompensa o un ciclo di generazione e valutazione in aggiunta alla fase di addestramento. Per queste attività, utilizzare direttamente TRL o LLaMA-Factory. (L'ottimizzazione delle preferenze senza riferimenti si adatta all'ambito monostadio perché non è necessario memorizzare un modello di riferimento separato; vedere la nota su ORPO nella sezione [Quick Start](#quick-start).)
+- **Addestramento multi-nodo** — una singola GPU su una sola macchina. L'utilizzo di più GPU su una singola macchina funziona (tramite `accelerate launch`), ma non è ufficialmente supportato.
+- **Addestramento macOS con CUDA** — Apple Silicon non dispone di CUDA, quindi il percorso CUDA viene eseguito su un sistema Linux o Windows con una GPU NVIDIA. È comunque possibile eseguire il modello addestrato su un Mac tramite Ollama. Un percorso MLX **sperimentale e non verificato** (`--backend mlx`) addestra in modo nativo un adattatore LoRA su Apple Silicon — vedere [Apple Silicon (MLX)](#apple-silicon-mlx--unverified-preview). È solo per LoRA-SFT e **non è stato testato su hardware reale** (nessun supporto), quindi, per qualsiasi cosa oltre a un SFT LoRA (ORPO, ottimizzazione fine completa, FP8, esecuzioni multiple), è consigliabile utilizzare il percorso CUDA.
+- **Qualsiasi modello al di fuori delle famiglie di modelli testate** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B. Altri modelli spesso funzionano, ma non sono inclusi nei test CI.
 Se si necessita di una qualsiasi di queste funzionalità, utilizzare una delle librerie elencate sopra. Sono più adatte a questo scopo.
@@ -174,7 +179,9 @@ Il tasso di apprendimento predefinito si riduce automaticamente a `8e-6` per ORP
 Novità nella versione 1.5: distilla un modello di ragionamento in modo semplice. Passa `--reasoning-trace` (CLI) o `Trainer(..., reasoning_trace=True)` (Python) e fornisci tracce che mantengono una catena di pensiero `<think>...</think>` all'interno del turno dell'assistente — la metà SFT pura della distillazione di [DeepSeek-R1](https://arxiv.org/abs/2501.12948), senza necessità di RL. Backpropagate mantiene `<think>` nell'obiettivo di addestramento, elimina le tracce vuote o troppo lunghe (filtraggio della lunghezza delle tracce) e aumenta il valore predefinito di `max_seq_length` a 8192 per la catena di pensiero più lunga. Fondamentalmente, `<think>` rimane in **testo semplice** — nessun token speciale, nessuna ridimensionamento dell'embedding — quindi l'adattatore unificato esporta ancora in GGUF e può essere utilizzato con Ollama come qualsiasi altro modello ottimizzato. Solo SFT. Consulta la [ricetta per la traccia del ragionamento](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) per la forma del set di dati e i token regolabili.
-### Apple Silicon (MLX) — sperimentale, versione 1.5
+### Apple Silicon (MLX) — anteprima non verificata
+> ⚠️ **Anteprima non verificata: non fa parte delle funzionalità supportate.** Il percorso MLX è stato creato ed è stato sottoposto a test unitari, ma **non** è stato testato su hardware Apple Silicon reale (`mlx-lm` è disponibile solo per Apple e non può essere eseguito sui sistemi NVIDIA su cui viene sviluppato Backpropagate). Considerare tutto quanto segue come sperimentale, utilizzarlo a proprio rischio e [segnalare eventuali anomalie](#reporting-bugs) se lo si esegue su un Mac della serie M.
 Novità nella versione 1.5: **un'API, due opzioni.** CUDA rimane il backend canonico e verificato; MLX è una seconda opzione che esegue l'addestramento su un Mac della serie M tramite lo strumento [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) di Apple (memoria unificata, nessuna necessità di CUDA). La stessa struttura a 3 righe seleziona l'opzione in base all'hardware: `backend='auto'` (predefinito) indirizza verso CUDA su NVIDIA e verso MLX su Apple Silicon, quindi le configurazioni CUDA esistenti sono identiche.
@@ -192,7 +199,7 @@ backprop train --data my_data.jsonl --backend mlx --steps 100
 Nella versione 1.5, l'opzione MLX è **solo SFT LoRA** — nessun ORPO, nessun FP8, nessuna modalità `'full'`, nessun addestramento multiplo su MLX (ognuno viene rifiutato con `CONFIG_INVALID_SETTING`; utilizza `backend='cuda'`/`'auto'` su una macchina NVIDIA per queste opzioni). L'adattatore risultante è in formato safetensors e può essere esportato verso Ollama tramite lo stesso percorso dell'opzione CUDA.
-> ⚠️ **Stato reale:** l'opzione MLX viene fornita nella versione 1.5 **costruita + testata con unità (simulata)** ma **NON ancora verificata su Apple Silicon reale** — `mlx-lm` è solo per Apple e non poteva essere eseguito sulla macchina NVIDIA su cui è stato creato questo progetto. Considerala sperimentale — lo stesso approccio che è stato utilizzato per il percorso FP8 nella versione 1.5 (FP8 è passato alla fase di verifica su Blackwell nella versione 1.6; MLX deve ancora superare questa fase su hardware reale) — e segnala eventuali anomalie [qui](#reporting-bugs) una volta che verrà eseguita su un Mac della serie M. Forzare `--backend mlx` su un host non Apple genera un errore `CONFIG_INVALID_SETTING`; la mancanza dello strumento `mlx_lm` su un Mac genera `DEP_MLX_UNAVAILABLE`.
+> Forzare `--backend mlx` su un host non Apple genera l'errore `CONFIG_INVALID_SETTING`; la mancanza di una toolchain `mlx_lm` su un Mac genera `DEP_MLX_UNAVAILABLE`.
 Per flussi di lavoro end-to-end più completi (ottimizzazione e caricamento su HF Hub, ripresa dopo esaurimento della memoria, SLAO multi-esecuzione in una lunga campagna, ecc.), consulta la [pagina delle ricette del manuale](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/).
@@ -364,8 +371,12 @@ Le chiavi nidificate utilizzano il doppio underscore (`MODEL__NAME`, non `MODEL_
 | Llama 3.2 3B | ~8 GB | Comunità Llama | Un'alternativa valida a Qwen 3B con alcune limitazioni. |
 | Llama 3.2 1B | ~6 GB | Comunità Llama | Per esperimenti rapidi su schede di piccole dimensioni. |
 | Mistral 7B | ~12 GB | Apache 2.0 | Simile a Qwen 7B, con un modello di chat diverso. |
+| Llama-3.1-8B | ~7-8 GB (QLoRA) | Llama-3.1-Community | 8B QLoRA, contesto nativo di 128K (la clausola >700M-MAU richiede una licenza Meta separata). |
+| **Qwen2.5-14B** | ~8,5 GB (QLoRA) | Apache 2.0 | **Il punto ideale per l'utilizzo quotidiano con 32 GB** — rank/alpha 32, paged 8-bit AdamW, 4096 ctx. |
+| Mistral-Small-24B | ~18 GB (QLoRA) | Apache 2.0 | 24B QLoRA su una scheda da 32 GB con margine di 4096 ctx. |
+| **Qwen2.5-32B** | ~26 GB (QLoRA) | Apache 2.0 | **Limite massimo per 32 GB** — si adatta a malapena con `max_len 2048` + paged 8-bit AdamW. |
-Altri modelli spesso funzionano, ma solo questi otto sono configurati in CI. Utilizzare `--lora-preset=quality` (impostazione predefinita) per obiettivi di rango 256 / completamente lineari secondo Biderman 2024 + Thinking Machines 2025, oppure `--lora-preset=fast` per l'obiettivo legacy di rango 16 / q+v se è necessario il footprint della versione 1.2.x.
+Altri modelli spesso funzionano; le righe sopra riportate sono le configurazioni predefinite curate: la fascia da 14B a 32B è ottimizzata con QLoRA per una scheda da 32 GB (l'ambito misurato). Utilizzare `--lora-preset=quality` (impostazione predefinita) per i target rank-256 / all-linear secondo Biderman 2024 + Thinking Machines 2025, oppure `--lora-preset=fast` per il target legacy rank-16 / q+v se è necessario l'ingombro della versione 1.2.x.
 ## Risoluzione dei problemi

package/README.ja.md CHANGED Viewed

@@ -15,9 +15,9 @@
   <a href="https://mcp-tool-shop-org.github.io/backpropagate/"><img src="https://img.shields.io/badge/Landing_Page-live-blue" alt="Landing Page"></a>
 </p>
-# アダプターをトレーニングします。Ollamaにデプロイします。次に進みます
+# 320億パラメータのQLoRAモデル、または70億パラメータのエンドツーエンドモデルを1つのGPUで微調整します。Ollamaにデプロイします
-Backpropagateは、単一のGPUで大規模言語モデルを微調整するためのPythonライブラリです。3行のコードで、16GBのカード上で7Bモデルをトレーニングできます。さらに1つのコマンドで、微調整したモデルをOllamaにエクスポートし、`ollama run`コマンドで実行できるようにします。Windowsで最適に動作します。
+大規模言語モデルの微調整を**単一の**GPU上で実行し、実際に使用しているカードに合わせてサイズを調整します。3行のPythonコードで、70億～340億パラメータのモデルを1つの32GBのコンシューマーカード（RTX 5090）で実行できます。`--full-ft-offload`フラグを使用すると、最適化の状態をホストRAMにオフロードすることで、70億パラメータ規模のモデルを完全に微調整できます。さらにコマンドを実行してOllamaにエクスポートし、`ollama run`で微調整したモデルを実行します。16GBまでスムーズにスケールダウンできます。Windowsでも優れたパフォーマンスを発揮します。
 ```python
 from backpropagate import Trainer
@@ -69,33 +69,38 @@ Backpropagateは、不足している選択肢です。**単一のコンシュ
 上記のいずれかのライブラリを試して、設定ファイルの操作に苦労したり、モデルファミリーの制限に遭遇したり、Windowsを優先するデフォルト設定が必要になった場合は、Backpropagateが最適です。
-## 16GBのコンシューマーGPUで微調整できること
+## 1つのGPUで微調整できるもの
-16GBのカード（RTX 4080 / 5080 / 4070 Ti Super）で実際に使用できる範囲は次のとおりです。
+Backpropagateは、実行に必要なリソースをカードに合わせて調整します。以下は、64GBのホストRAMを備えた**32GB**のコンシューマーGPU（RTX 5090）での実用的な上限です。
-| モデル | 方法 | 状態 |
+| モデルサイズ | 方法 | 32GBカードでの状況 |
 |---|---|---|
-| Qwen-3.5-4B / Phi-4-mini-3.8B / SmolLM3-3B | LoRA / QLoRA / DoRA | 快適。完全なシーケンス長で、余裕があります。 |
-| SmolLM3-3B / Qwen2.5-3B / Llama-3.2-3B / Llama-3.2-1B | `mode="full"`（完全な微調整） | v1.4 — `backprop train`コマンドまたは`Trainer(..., mode="full")`で`--mode=full`を指定します。完全な精度（bf16）の重みをロードします。4ビット、アダプターは使用しません。勾配チェックポイントとページ化された8ビットAdamにより、フットプリントを16GB以内に収めることができます。 |
-| Qwen-2.5-7B / Llama-3.1-8B / Mistral-7B | QLoRA | 標準。約7〜8GB。Backpropagateのデフォルトプリセット。 |
-| Llama-3 13B | QLoRA + サンプルパッキング | ぎりぎりですが、動作します。短いシーケンスを使用してください。 |
-| Mixtral 8x7B（合計470億パラメータ） | — | 範囲外 — 2ビット（AQLM / QuIP#）は、マージ可能なアダプターとGGUFエクスポートの契約を破るため、[v1.5の概要](docs/V1_5_BRIEF.md)で廃止されました。16GBのカードでは、≤8Bのベースモデルを使用してください。 |
+| 70億パラメータ（Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B） | QLoRA | 快適に動作します。約7～8GBを使用し、十分な余裕があります。完全なシーケンス長で実行できます。 |
+| **140億パラメータ（Qwen2.5-14B）** | QLoRA | **日常的な使用に最適なサイズ — 約8.5GB**。rank/alpha 32、ページングされた8ビットAdamW、4096のコンテキスト長で測定。 |
+| 240億パラメータ（Mistral-Small-24B） | QLoRA | 約18GBを使用します。4096のコンテキスト長で実行しても余裕があります。 |
+| **320億パラメータ（Qwen2.5-32B）** | QLoRA | **ギリギリ動作するサイズ — `max_len 2048` + ページングされた8ビットAdamWで約26GB**。上限に近い状態です。 |
+| 60億パラメータ以下 | `mode="full"`（完全な微調整） | GPUのみを使用した完全な微調整 — bf16の重みを使用し、アダプターは使用しません。32GBでは、カードがサポートできる上限は60億パラメータです。 |
+| **70億パラメータ規模（Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B）** | `mode="full" --full-ft-offload` | **FSDP2を使用したCPUオフロードによる完全な微調整** — パラメータと最適化の状態を64GBのホストRAMにオフロードします。速度は遅くなります（帯域幅がボトルネック）。Linux/WSL2でのみ動作します。 |
-`mode="full"`は、最大**40億パラメータ**のモデルをサポートします。上記の完全な微調整行にある4つのプリセットは、実際には約30億（実際のパラメータ数は3.08〜3.24億）であり、16GBのカードに適合します。3.8〜40億のクラス（Phi-4-mini-3.8B、Qwen-3.5-4B）も上限に達しますが、完全な微調整には**24GB以上の**カードが必要です。重みと勾配だけでも16GBに近づき、オプティマイザーと活性化も考慮する必要があります。そのため、16GBのカードでは、これらのモデルに対して`mode="lora"`を使用してください（LoRA行にあります）。40億を超えるモデルは、`RUNTIME_FULL_FT_MODEL_TOO_LARGE`というエラーで終了します。
+ほとんどの単一GPUライブラリでは、**24～340億パラメータのQLoRA**と**単一カードでの70億パラメータ規模の完全な微調整**のために別の場所に誘導しますが、Backpropagateはこれらを1つのコンシューマーカードで実行し、結果をOllamaに直接エクスポートします。
-2ビット量子化（AQLM / QuIP#）は、**範囲外**です。[v1.4で検討された後、[v1.5の概要](docs/V1_5_BRIEF.md)で廃止されました。2ビットのベースモデルを、完全な精度の重みにクリーンにマージすることはできません。これにより、Backpropagateのマージ可能なアダプター→GGUF→Ollamaエクスポートの契約（パイプラインの目的）が破られます。代わりに、Backpropagateが提供するヘッドルームは、v1.5の**FP8コンピューティングパス**（`--fp8`、Blackwell / Hopper）と、≤40億のモデルに対する`mode="full"`です。どちらもマージ可能で、エクスポート可能です。
+**完全な微調整の上限は、カードに合わせて調整されます。** これは、4つの要素（重み + 勾配 + 最適化器 + 活性化）のトレーニングメモリ計算に基づいており、*検出された*VRAMに対して行われます。**16GB → 40億パラメータ、24GB → 50億パラメータ、32GB → 60億パラメータ**がGPUのみで使用できる上限です。`--full-ft-offload`を使用すると、パラメータと最適化の状態をFSDP2の`fully_shard` + `CPUOffloadPolicy`を通じてホストRAMにオフロードすることで、**70億パラメータ規模**まで拡張できます（速度は遅くなり、PCIe/CPU帯域幅がボトルネックになります。約64GBのホストRAMとNCCLバックエンドが必要です。つまり、Linux/WSL2でのみ動作します）。`--full-ft-ceiling-billions`を使用して、上限を明示的にオーバーライドできます。モデルがオフロードの上限を超える場合、`RUNTIME_FULL_FT_MODEL_TOO_LARGE`というエラーが発生し、回復方法（`--full-ft-offload`またはLoRA/QLoRA）が表示されます。[完全な微調整に関するハンドブック](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/)を参照して、VRAMの計算とBiderman 2024 / Thinking Machines 2025による品質比較を確認してください。
-30億以下のモデルの場合、16GBで完全な微調整（LoRAだけでなく）が可能になり、v1.4で`mode="full"`として提供されます。`Trainer(..., mode="full")`または`backprop train --mode=full --model phi-4-mini-3.8b`を指定して有効にします。40億を超えるモデルの場合、`RUNTIME_FULL_FT_MODEL_TOO_LARGE`というエラーが発生し、LoRAと40億以下のプリセットが代替手段として提案されます。構成の計算と、Biderman 2024 / Thinking Machines 2025による品質比較については、[完全な微調整のハンドブック](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/)をご覧ください。70億以上のモデルの場合、完全な微調整には24GB以上のGPUが必要です。A100クラウドレンタルを検討するか、最新の研究では、ほとんどのポストトレーニングタスクで完全な微調整の品質に匹敵することが示されているため、LoRAを使用してください（[アンチピッチセクション](#what-backpropagate-is-not-for)に参考文献を参照）。
+### 16GBまでスケールダウン可能
+16GB（RTX 4080 / 5080 / 4070 Ti Super）の環境でも優れたパフォーマンスを発揮します。70億パラメータのQLoRAは約7～8GBを使用し、真の完全な微調整を約30億パラメータ（SmolLM3-3B、Qwen2.5-3B、Llama-3.2-3B/1B）に対して16GB内で`mode="full"`を使用して実行できます（bf16の重み + 勾配チェックポイント + ページングされた8ビットAdamW）。同じコードは、検出されたカードに合わせてバッチサイズと完全な微調整の上限を自動的に選択します。設定を変更する必要はありません。
+2ビット量子化（AQLM / QuIP#）は**対象外**です — 2ビットのベースモデルを完全に精度の高い重みにクリーンにマージすることはできず、マージ可能なアダプター → GGUF → Ollamaのエクスポートという一連の流れが中断されます（このパイプライン全体の目的）。Backpropagateでは、代わりにQLoRA、`mode="full"`、`--full-ft-offload`、およびFP8計算パス（`--fp8`、Blackwell/Hopper）などの機能を提供し、これらはすべてマージ可能でエクスポート可能です。
 ## Backpropagateが適さない場合
 以下のユースケースに該当する場合は、別のライブラリを使用する方が良い結果が得られます。Backpropagateは適切な選択肢ではなく、無理に使用しようとすると、適切なツールを選択するよりも多くの労力がかかります。インストールを開始する前に、このセクションを読んでください。
-- **7B以上のモデルに対するフルパラメータのファインチューニング** — BackpropagateはLoRA / QLoRAを使用し、すべての重みを更新するのではなく、小さなアダプターをトレーニングします。7B以上のモデルの場合、フルファインチューニングには24GB以上のGPUメモリが必要であり、16GBのコンシューマーカードでは実行できません。3B以下のモデルの場合、16GBでフルファインチューニングが可能であり、v1.4で`mode="full"`として提供されます（CLIで`Trainer(..., mode="full")`または`--mode=full`を渡します。4Bを超えるモデルの場合、ハードゲートが`RUNTIME_FULL_FT_MODEL_TOO_LARGE`を発生させ、LoRAと4B未満のプリセットをリカバリとして指定します）。より大きな視点として、最近の研究（[Biderman 2024](https://arxiv.org/abs/2405.09673)、[Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/））は、適切な設定のLoRAが、ほとんどのポストトレーニングタスク（指示への従順、ドメインへの適応、ペルソナ/スタイル）において、フルファインチューニングの品質に匹敵し、計算コストは67%で済むことを示しています。したがって、ほとんどのオペレーターが実際に求める作業においては、LoRAを使用し続けることで何も失うことはありません。`mode="full"`は、品質の差を測定し、追加の計算コストを費やすことを決定した場合に使用します。7B以上のモデルのフルファインチューニングを本当に必要とする場合は、HuggingFaceの`transformers.Trainer`を24GB以上のカードで直接使用してください。
-- **オンラインRL — PPO / GRPO / RLVR** — Backpropagateは、単一ステージのSFTと参照なしの優先度調整（ORPOはv1.5で提供され、SimPO / KTOは計画中）を実行します。実行しないのは、PPO、GRPO、またはRLVRなどのオンライン強化学習です。これには、報酬モデルまたはトレーニングステップの上に生成とスコアリングのループが必要です。これらの場合は、TRLまたはLLaMA-Factoryを直接使用してください。（参照なしの優先度調整は、単一ステージの範囲に適合します。なぜなら、メモリに保持する必要のある個別の参照モデルがないからです。詳細は、[クイックスタート](#quick-start)のORPOの注記を参照してください。）
-- **マルチノードトレーニング** — 単一のGPUを1つのマシンでのみ使用します。1つのマシンでのマルチGPUは機能しますが、公式にはサポートされていません（`accelerate launch`経由）。
-- **CUDAレール上のmacOSトレーニング** — Apple SiliconにはCUDAがないため、CUDAパスは、NVIDIA GPUを搭載したLinuxまたはWindowsマシンで実行する必要があります。トレーニングされたモデルは、Ollamaを介してMacで引き続き実行できます。**v1.5の新機能：** 実験的なMLXレール（`--backend mlx`）は、Apple Silicon上でLoRAアダプターをネイティブにトレーニングします。詳細は、[Apple Silicon（MLX）](#apple-silicon-mlx--experimental-v15)を参照してください。これはLoRA-SFTのみであり、実際のシリコン上で構築および検証されていますが、まだ完全に検証されていません。したがって、LoRA SFT（ORPO、フルファインチューニング、FP8、マルチラン）以外のものについては、引き続きCUDAレールを使用することをお勧めします。
-- **テストされたモデルファミリー以外のもの** — Qwen 2.5 / 3.5（7B / 4B）、Phi-4-mini-3.8B、SmolLM3-3B、Llama 3.2（3B / 1B）、Mistral 7B。他のモデルも多くの場合機能しますが、CIでは固定されていません。
+- **フルパラメータの微調整を、オフロードの上限を超えて行う（約13B以上）** — 最大で**〜6GBの純粋なGPUと〜7Bクラスを`--full-ft-offload`オプションを使用して、32GBのカード上で実行し、フル微調整を行う**（[この範囲](#what-you-can-fine-tune-on-one-gpu)を参照）。13B以上のモデルに対する*真のフル*微調整は、それ以上を必要とするため、マルチGPU FSDPまたはより大きな容量のカードが必要になる（複数のGPUにわたって`transformers.Trainer`を使用するか、A100/H100をレンタルする）。ただし、その計算リソースを使用する前に、最近の研究（[Biderman 2024](https://arxiv.org/abs/2405.09673)、[Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)）によると、適切な設定でLoRAを使用すると、ほとんどの事後学習タスク（指示への追従、ドメイン適応、ペルソナ/スタイル）において、フル微調整と同等の品質が得られ、計算リソースは約67%で済む。したがって、最大34Bまで可能なQLoRAは、Backpropagateが単一のカード上で実行する場合、ほとんどのユーザーが求めるタスクに対してパフォーマンスを損なうことはない。
+- **オンライン強化学習 — PPO / GRPO / RLVR** — Backpropagateは、シングルステージSFTと参照不要の嗜好調整（v1.5ではORPO、v1.6ではSimPO + KTO）を実行する。ただし、オンライン強化学習—PPO、GRPO、またはRLVR—は実行しない。これらは、報酬モデルまたはトレーニングステップに加えて、生成とスコアリングを行うループを必要とする。これらの場合は、TRLを直接使用するか、LLaMA-Factoryを使用する（参照不要の嗜好調整は、メモリ内に保持する必要のある個別の参照モデルがないため、シングルステージの範囲に適合する。詳細は[Quick Start](#quick-start)のORPOに関する注を参照）。
+- **マルチノードトレーニング** — 単一のマシン上の単一GPUのみ。単一のマシン上で複数のGPUを使用することも可能（`accelerate launch`経由）だが、公式にはサポートされていない。
+- **CUDA環境でのmacOSトレーニング** — Apple SiliconはCUDAをサポートしていないため、CUDAパスはNVIDIA GPUを備えたLinuxまたはWindowsマシンで実行される。ただし、トレーニングされたモデルはOllamaを使用してMac上で引き続き実行できる。**実験的で検証されていないMLXレール（`--backend mlx`）**を使用すると、Apple Silicon上でLoRAアダプターをネイティブにトレーニングできる（[Apple Silicon (MLX)](#apple-silicon-mlx--unverified-preview)を参照）。これはLoRA-SFTのみであり、**実際のシリコン上での検証は行われていない**（サポートなし）ため、LoRA SFT以外のタスク（ORPO、フル微調整、FP8、複数回の実行など）にはCUDAレールを使用する必要がある。
+- **テストされたモデルファミリー外のモデル** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B。他のモデルも多くの場合動作するが、CIで固定されていない。
 これらの機能が必要な場合は、上記のライブラリのいずれかを使用してください。それらの機能により優れています。
@@ -174,7 +179,9 @@ backprop train --data preferences.jsonl --method orpo --steps 100
 v1.5の新機能：推論モデルを簡単に蒸留します。`--reasoning-trace`（CLI）または `Trainer(..., reasoning_trace=True)`（Python）を渡し、アシスタントの応答内に `<think>...</think>` の連鎖的な思考を保持するトレースを入力します。これは、[DeepSeek-R1](https://arxiv.org/abs/2501.12948) 蒸留の純粋なSFT部分であり、RLは必要ありません。バックプロパゲーションは `<think>` をトレーニングターゲットに保持し、空の/長すぎるトレースを削除（トレース長フィルタリング）、およびより長いCoTのためにデフォルトの `max_seq_length` を8192に引き上げます。重要な点として、`<think>` は **プレーンテキスト** のままです。特別なトークンや、埋め込みのリサイズは行われません。そのため、マージされたGGUFは、他のファインチューンと同様にOllamaにエクスポートできます。SFTのみです。データセットの形状と調整可能なトークンバンドについては、[reasoning-trace recipe](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) を参照してください。
-### Apple Silicon（MLX）—実験的、v1.5
+### Apple Silicon (MLX) — 検証されていないプレビュー
+> ⚠️ **検証されていないプレビュー — サポートされている機能セットの一部ではない。** MLXレールは構築され、ユニットテストも行われているが、**実際のApple Silicon上での検証（`mlx-lm`はApple専用であり、Backpropagateの開発に使用されるNVIDIAリグでは実行できない）は行われていない**。以下すべてを実験的なものとして扱い、自己責任で使用し、MシリーズのMacで実行した場合は[バグ報告](#reporting-bugs)を行うこと。
 v1.5の新機能：**1つのAPI、2つのレール。** CUDAは、引き続き標準で検証済みのバックエンドです。MLXは、Appleの [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) ツールチェーンを介して、MシリーズMacでトレーニングを行う2番目のレールです（統合メモリ、CUDAは不要）。同じ3行のコードで、ハードウェアによってレールを選択します。`backend='auto'`（デフォルト）は、NVIDIAではCUDAに、Apple SiliconではMLXにルーティングするため、既存のCUDA環境はバイト単位で同一です。
@@ -192,7 +199,7 @@ backprop train --data my_data.jsonl --backend mlx --steps 100
 v1.5では、MLXレールは **LoRA SFTのみ** です。ORPO、FP8、`mode='full'`、MLXでのマルチランはまだサポートされていません（それぞれ `CONFIG_INVALID_SETTING` で拒否されます。それらの機能を使用する場合は、NVIDIA環境で `backend='cuda'` / `'auto'` を使用してください）。結果として得られるアダプターは、プレーンなsafetensorsであり、CUDAレールと同じパスを通じてOllamaにエクスポートされます。
-> ⚠️ **現状:** v1.5でMLXレールは**構築され、ユニットテストも完了（モックを使用）**していますが、**まだ実際のApple Siliconでの実証検証は行われていません** — `mlx-lm`はApple専用であり、このコードが作成されたNVIDIA環境では実行できません。実験的なものとして扱ってください。これは、v1.5でFP8パスが採用されたときと同じ考え方です（FP8はv1.6でBlackwell上で実証検証に合格しました。MLXはまだ実際のシリコンでの検証が必要です）。MシリーズのMacで実行したら、[不具合を報告してください](#reporting-bugs)。Apple以外のホストで`--backend mlx`を強制すると、`CONFIG_INVALID_SETTING`エラーが発生します。Macに`mlx_lm`ツールチェーンがない場合、`DEP_MLX_UNAVAILABLE`エラーが発生します。
+> `--backend mlx`をApple以外のホストに強制的に適用すると、`CONFIG_INVALID_SETTING`エラーが発生する。Mac上に`mlx_lm`ツールチェーンがない場合、`DEP_MLX_UNAVAILABLE`エラーが発生する。
 よりエンドツーエンドのワークフロー（ファインチューンとHF Hubへのプッシュ、OOM後の再開、長期間のキャンペーンにおけるマルチランSLAOなど）については、[ハンドブックのレシピページ](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/) を参照してください。
@@ -364,8 +371,12 @@ backprop export-runs --format jsonl    # bulk export run history
 | Llama 3.2 3B | 約8GB | Llama Community | Qwen 3Bの優れた代替手段で、許可に関する制限があります。 |
 | Llama 3.2 1B | 約6GB | Llama Community | 小さなカードで迅速な実験を行うためのものです。 |
 | Mistral 7B | 約12GB | Apache 2.0 | Qwen 7Bと同等で、異なるチャットテンプレートを使用します。 |
+| Llama-3.1-8B | 〜7〜8GB（QLoRA） | Llama-3.1-Community | 8B QLoRA、128Kのネイティブコンテキスト（>700M-MAU条項には、別途Metaライセンスが必要）。 |
+| **Qwen2.5-14B** | 〜8.5GB（QLoRA） | Apache 2.0 | **32GBのカードで最適なパフォーマンスを発揮するポイント** — ランク/アルファ32、ページングされた8ビットAdamW、4096 ctx。 |
+| Mistral-Small-24B | 〜18GB（QLoRA） | Apache 2.0 | 4096-ctxの余裕がある状態で、32GBのカード上で24B QLoRAを実行。 |
+| **Qwen2.5-32B** | 〜26GB（QLoRA） | Apache 2.0 | **32GBの範囲で最大限に活用できる設定** — `max_len 2048` + ページングされた8ビットAdamWで、ギリギリ収まる。 |
-他のモデルも動作する場合がありますが、これらの8つだけがCIで固定されています。`--lora-preset=quality`（デフォルト）を渡すと、Biderman 2024 + Thinking Machines 2025のランク256 / すべての線形ターゲットが使用されます。v1.2.xのフットプリントが必要な場合は、`--lora-preset=fast`を渡すと、従来のランク16 / q+vターゲットが使用されます。
+他のモデルも多くの場合動作する。上記の行は、調整済みのプリセットである。14B〜32Bの範囲は、32GBのカード用にQLoRAが調整されている（測定された範囲）。Biderman 2024 + Thinking Machines 2025に従い、ランク256 / すべての線形ターゲットに対して`--lora-preset=quality`（デフォルト）を渡すか、v1.2.xのフットプリントが必要な場合は、レガシーのランク16 / q+vターゲットに対して`--lora-preset=fast`を渡す。
 ## トラブルシューティング

package/README.md CHANGED Viewed

@@ -15,9 +15,9 @@
   <a href="https://mcp-tool-shop-org.github.io/backpropagate/"><img src="https://img.shields.io/badge/Landing_Page-live-blue" alt="Landing Page"></a>
 </p>
-# Train an adapter. Ship it to Ollama. Move on.
+# Fine-tune a 32B QLoRA — or a 7B end to end — on one GPU. Ship it to Ollama.
-Backpropagate is a Python library for fine-tuning large language models on a single GPU. Three lines of code train a 7B model on a 16GB card. One more command exports it to Ollama so you can `ollama run` your finetune. Works first-class on Windows.
+Backpropagate fine-tunes large language models on a **single** GPU, sized for the card you actually have. Three lines of Python QLoRA a 7B–34B model on one 32 GB consumer card (RTX 5090); one flag — `--full-ft-offload` — full-fine-tunes a 7B-class model by spilling the optimizer state to host RAM. One more command exports to Ollama, then `ollama run` your finetune. Scales cleanly down to 16 GB. First-class on Windows.
 ```python
 from backpropagate import Trainer
@@ -69,32 +69,37 @@ Backpropagate is the missing option: **a 3-line Python API for solo operators on
 If you tried one of the libraries above and bounced off the config-file ceremony, or hit a model-family gap, or wanted Windows-first defaults — Backpropagate is for you.
-## What you can fine-tune on a 16GB consumer GPU
+## What you can fine-tune on one GPU
-Here's the practical envelope on a 16GB card (RTX 4080 / 5080 / 4070 Ti Super):
+Backpropagate sizes the run to your card. Here's the practical envelope on a **32 GB** consumer GPU (RTX 5090) with 64 GB host RAM — the rig it's tuned on:
-| Model | Method | Status |
+| Model size | Method | Status on a 32 GB card |
 |---|---|---|
-| Qwen-3.5-4B / Phi-4-mini-3.8B / SmolLM3-3B | LoRA / QLoRA / DoRA | Comfortable. Full sequence length, room to spare. |
-| SmolLM3-3B / Qwen2.5-3B / Llama-3.2-3B / Llama-3.2-1B | `mode="full"` (full fine-tuning) | v1.4 — pass `--mode=full` on `backprop train` or `Trainer(..., mode="full")`. Loads full-precision (bf16) weights — no 4-bit, no adapter; gradient checkpointing + paged 8-bit Adam keep the footprint inside 16GB. |
-| Qwen-2.5-7B / Llama-3.1-8B / Mistral-7B | QLoRA | Standard. ~7-8 GB. Backpropagate's default presets. |
-| Llama-3 13B | QLoRA + sample packing | Tight but works. Use shorter sequences. |
-| Mixtral 8x7B (47B total parameters) | — | Out of scope — 2-bit (AQLM / QuIP#) breaks the mergeable-adapter + GGUF-export contract, so it was retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md). On a 16GB card, use a ≤8B base. |
+| 7B (Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B) | QLoRA | Comfortable — ~7–8 GB. Full sequence length, lots of headroom. |
+| **14B** (Qwen2.5-14B) | QLoRA | **The daily-driver sweet spot — ~8.5 GB** measured. rank/alpha 32, paged 8-bit AdamW, 4096 ctx. |
+| 24B (Mistral-Small-24B) | QLoRA | ~18 GB. Fits with headroom at 4096 ctx. |
+| **32B** (Qwen2.5-32B) | QLoRA | **Just fits — ~26 GB** at `max_len 2048` + paged 8-bit AdamW. Top of the envelope. |
+| ≤6B | `mode="full"` (true full fine-tuning) | Pure-GPU full FT — bf16 weights, no adapter. The card-aware ceiling is 6B on 32 GB. |
+| **7B-class** (Qwen 2.5 7B / Llama-3.1-8B / Mistral 7B) | `mode="full" --full-ft-offload` | **Full fine-tuning via FSDP2 CPU-offload** — spills params + optimizer to 64 GB host RAM. Slower (bandwidth-bound); Linux/WSL2. |
-`mode="full"` admits models up to **4B parameters**. The four presets in the full-FT row above are genuine ~3B (true parameter count 3.08–3.24B) and fit a 16GB card. The 3.8–4B class (Phi-4-mini-3.8B, Qwen-3.5-4B) is also accepted by the ceiling but needs a **24GB+** card for full FT — weights + gradients alone approach 16GB before the optimizer and activations — so on a 16GB card use `mode="lora"` for those (they're in the LoRA row). Models >4B exit with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`.
+Two things most single-GPU libraries send you elsewhere for — **24–34B QLoRA** and **single-card 7B-class full fine-tuning** — Backpropagate does on one consumer card, then exports the result straight to Ollama.
-2-bit quantization (AQLM / QuIP#) is **out of scope**. It was scoped for v1.4, then retired in the [v1.5 trajectory brief](docs/V1_5_BRIEF.md): a 2-bit base can't be cleanly merged back into full-precision weights, which breaks Backpropagate's mergeable-adapter → GGUF → Ollama export contract (the whole point of the pipeline). The headroom levers Backpropagate ships instead are the v1.5 **FP8 compute path** (`--fp8`, Blackwell/Hopper) and `mode="full"` for ≤4B models — both stay mergeable and exportable.
+**The full-FT ceiling is card-aware.** It's derived from the 4-addend training-memory arithmetic (weights + gradients + optimizer + activations) against your *detected* VRAM: **16 GB → 4B, 24 GB → 5B, 32 GB → 6B** pure-GPU. `--full-ft-offload` lifts it to **7B-class** by spilling params + optimizer state into host RAM via FSDP2 `fully_shard` + `CPUOffloadPolicy` (slower, PCIe/CPU-bandwidth-bound; needs ~64 GB host RAM and an NCCL backend, i.e. Linux/WSL2). Override the ceiling explicitly with `--full-ft-ceiling-billions`. A model past even the offload ceiling exits with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, naming the recovery (`--full-ft-offload`, or LoRA/QLoRA). See [the full fine-tuning handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) for the VRAM math + the Biderman 2024 / Thinking Machines 2025 quality comparison.
-For models 3B and smaller, full fine-tuning (not just LoRA) is feasible on 16GB and now ships in v1.4 as `mode="full"`. Pass `Trainer(..., mode="full")` or `backprop train --mode=full --model phi-4-mini-3.8b` to enable it. A hard gate refuses the mode for models > 4B with `RUNTIME_FULL_FT_MODEL_TOO_LARGE`, naming LoRA + the sub-4B presets as the recovery options. See [the full fine-tuning handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/full-fine-tuning/) for the configuration math + Biderman 2024 / Thinking Machines 2025 quality comparison. For 7B+ models, full fine-tuning needs a 24GB+ GPU — consider an A100 cloud rental, or stick with LoRA, which recent research shows matches full fine-tuning quality on most post-training tasks anyway (see [the anti-pitch section](#what-backpropagate-is-not-for) for citations).
+### Scales down to 16 GB
+The 16 GB envelope (RTX 4080 / 5080 / 4070 Ti Super) is still first-class: 7B QLoRA at ~7–8 GB, and true full fine-tuning of a genuine ~3B (SmolLM3-3B, Qwen2.5-3B, Llama-3.2-3B/1B) inside 16 GB via `mode="full"` (bf16 weights + gradient checkpointing + paged 8-bit AdamW). The same code picks the batch size and full-FT ceiling that fit whatever card it detects — no flags to change between rigs.
+2-bit quantization (AQLM / QuIP#) stays **out of scope** — a 2-bit base can't be cleanly merged back into full-precision weights, which breaks the mergeable-adapter → GGUF → Ollama export contract (the whole point of the pipeline). The headroom levers Backpropagate ships instead — QLoRA, `mode="full"`, `--full-ft-offload`, and the FP8 compute path (`--fp8`, Blackwell/Hopper) — all stay mergeable and exportable.
 ## What Backpropagate is NOT for
 If your use case is below, you'll have a better time with a different library — Backpropagate is not the right pick and trying to make it work would cost more than just reaching for the right tool. Reading this section before you start saves the install-and-bounce cycle:
-- **Full-parameter fine-tuning of 7B+ models** — Backpropagate uses LoRA / QLoRA, which trains a small adapter rather than updating every weight. For models 7B and larger, full fine-tuning needs 24GB+ of GPU memory and doesn't fit on a 16GB consumer card. For models 3B and smaller, full fine-tuning IS feasible on 16GB and ships in v1.4 as `mode="full"` (pass `Trainer(..., mode="full")` or `--mode=full` on the CLI; a hard gate raises `RUNTIME_FULL_FT_MODEL_TOO_LARGE` for models > 4B and names LoRA + the sub-4B presets as recoveries). The bigger picture: recent research ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) shows that LoRA at correct configuration matches full fine-tuning quality on most post-training tasks (instruction-following, domain adaptation, persona/style) at 67% of the compute — so for the work most operators actually want, you don't lose anything by sticking with LoRA. `mode="full"` exists for the cases where you've measured a quality gap and decided to spend the extra compute. If you genuinely need full fine-tuning of a 7B+ model, use HuggingFace `transformers.Trainer` directly on a 24GB+ card.
+- **Full-parameter fine-tuning past the offload ceiling (≈13B+)** — Backpropagate full-fine-tunes up to **~6B pure-GPU and ~7B-class via `--full-ft-offload`** on a 32 GB card (see [the envelope](#what-you-can-fine-tune-on-one-gpu)). A *true full* fine-tune of a 13B+ model is past that — it wants multi-GPU FSDP or a bigger card (reach for `transformers.Trainer` across multiple GPUs, or rent an A100/H100). Before spending that compute, though: recent research ([Biderman 2024](https://arxiv.org/abs/2405.09673), [Thinking Machines 2025](https://thinkingmachines.ai/blog/lora/)) shows LoRA at correct configuration matches full fine-tuning quality on most post-training tasks (instruction-following, domain adaptation, persona/style) at ~67% of the compute — so QLoRA up to 34B, which Backpropagate does on one card, loses nothing for the work most operators actually want.
 - **Online RL — PPO / GRPO / RLVR** — Backpropagate does single-stage SFT plus reference-free preference tuning (ORPO in v1.5; SimPO + KTO in v1.6). What it does *not* do is online reinforcement learning — PPO, GRPO, or RLVR — which needs a reward model or a generation-and-scoring loop on top of the training step. For those, use TRL directly or LLaMA-Factory. (Reference-free preference tuning fits the single-stage envelope because there's no separate reference model to hold in memory; see the ORPO note under [Quick Start](#quick-start).)
 - **Multi-node training** — single GPU on one machine only. Multi-GPU on one machine works (via `accelerate launch`) but isn't officially supported.
-- **macOS training on the CUDA rail** — Apple Silicon doesn't have CUDA, so the CUDA path has to run on a Linux or Windows box with an NVIDIA GPU. You can still run the trained model on a Mac via Ollama. **New in v1.5:** an experimental MLX rail (`--backend mlx`) trains a LoRA adapter natively on Apple Silicon — see [Apple Silicon (MLX)](#apple-silicon-mlx--experimental-v15). It is LoRA-SFT-only and built-but-not-yet-dogfood-verified on real silicon, so for anything beyond a LoRA SFT (ORPO, full fine-tune, FP8, multi-run) you still want the CUDA rail.
+- **macOS training on the CUDA rail** — Apple Silicon doesn't have CUDA, so the CUDA path runs on a Linux or Windows box with an NVIDIA GPU. You can still run the trained model on a Mac via Ollama. An **experimental, unverified-preview** MLX rail (`--backend mlx`) trains a LoRA adapter natively on Apple Silicon — see [Apple Silicon (MLX)](#apple-silicon-mlx--unverified-preview). It is LoRA-SFT-only and **not dogfood-verified on real silicon** (no support), so for anything beyond a LoRA SFT (ORPO, full fine-tune, FP8, multi-run) you want the CUDA rail.
 - **Anything outside the tested model families** — Qwen 2.5 / 3.5 (7B / 4B), Phi-4-mini-3.8B, SmolLM3-3B, Llama 3.2 (3B / 1B), Mistral 7B. Other models often work but aren't pinned in CI.
 If you need any of those things, reach for one of the libraries listed above. They're better at them.
@@ -172,11 +177,13 @@ The default learning rate auto-lowers to `8e-6` for ORPO (the loss is sharper th
 ### Reasoning-trace SFT (R1 distillation)
-New in v1.5: distill a reasoning model the easy way. Pass `--reasoning-trace` (CLI) or `Trainer(..., reasoning_trace=True)` (Python) and feed it traces that keep a `<think>...</think>` chain-of-thought inside the assistant turn — the pure-SFT half of [DeepSeek-R1](https://arxiv.org/abs/2501.12948) distillation, no RL required. Backpropagate keeps `<think>` in the training target, drops empty / over-long traces (trace-length filtering), and raises the default `max_seq_length` to 8192 for the longer CoT. Critically, `<think>` stays **plain text** — no special tokens, no embedding resize — so the merged GGUF still exports to Ollama like any other fine-tune. SFT only. See the [reasoning-trace recipe](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) for the dataset shape and the tunable token band.
+Distill a reasoning model the easy way. Pass `--reasoning-trace` (CLI) or `Trainer(..., reasoning_trace=True)` (Python) and feed it traces that keep a `<think>...</think>` chain-of-thought inside the assistant turn — the pure-SFT half of [DeepSeek-R1](https://arxiv.org/abs/2501.12948) distillation, no RL required. Backpropagate keeps `<think>` in the training target, drops empty / over-long traces (trace-length filtering), and raises the default `max_seq_length` to 8192 for the longer CoT. Critically, `<think>` stays **plain text** — no special tokens, no embedding resize — so the merged GGUF still exports to Ollama like any other fine-tune. SFT only. See the [reasoning-trace recipe](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/#reasoning-trace-sft-r1-distillation) for the dataset shape and the tunable token band.
+### Apple Silicon (MLX) — unverified preview
-### Apple Silicon (MLX) — experimental, v1.5
+> ⚠️ **Unverified preview — not part of the supported feature set.** The MLX rail is built and unit-tested but has **not** been dogfood-verified on real Apple Silicon (`mlx-lm` is Apple-only and can't run on the NVIDIA rigs Backpropagate is developed on). Treat everything below as experimental, use at your own risk, and [report anomalies](#reporting-bugs) if you run it on an M-series Mac.
-New in v1.5: **one API, two rails.** CUDA stays the canonical, verified backend; MLX is a second rail that trains on an M-series Mac via Apple's [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) toolchain (unified memory, no CUDA). The same 3-line shape picks the rail by hardware — `backend='auto'` (the default) routes to CUDA on NVIDIA and to MLX on Apple Silicon, so existing CUDA rigs are byte-identical:
+**One API, two rails.** CUDA is the canonical, verified backend; MLX is a second rail that trains on an M-series Mac via Apple's [`mlx_lm.lora`](https://github.com/ml-explore/mlx-lm) toolchain (unified memory, no CUDA). The 3-line shape picks the rail by hardware — `backend='auto'` (the default) routes to CUDA on NVIDIA and to MLX on Apple Silicon, so existing CUDA rigs are byte-identical:
 ```python
 from backpropagate import Trainer
@@ -190,9 +197,9 @@ trainer.train("examples/quickstart.jsonl", steps=100)
 backprop train --data my_data.jsonl --backend mlx --steps 100
 ```
-In v1.5 the MLX rail is **LoRA SFT only** — no ORPO, no FP8, no `mode='full'`, no multi-run on MLX yet (each is rejected with `CONFIG_INVALID_SETTING`; use `backend='cuda'`/`'auto'` on an NVIDIA box for those). The resulting adapter is plain safetensors and exports to Ollama through the same path as the CUDA rail.
+The MLX rail is **LoRA SFT only** — no ORPO, no FP8, no `mode='full'`, no multi-run (each is rejected with `CONFIG_INVALID_SETTING`; use `backend='cuda'`/`'auto'` on an NVIDIA box for those). The resulting adapter is plain safetensors and exports to Ollama through the same path as the CUDA rail.
-> ⚠️ **Honest status:** the MLX rail ships in v1.5 **built + unit-tested (mocked)** but **NOT yet dogfood-verified on real Apple Silicon** — `mlx-lm` is Apple-only and could not be run on the NVIDIA rig this was authored on. Treat it as experimental — the same framing the FP8 path had in v1.5 (FP8 graduated to dogfood-verified on Blackwell in v1.6; MLX still needs that pass on real silicon) — and please [report anomalies](#reporting-bugs) once it runs on an M-series Mac. Forcing `--backend mlx` on a non-Apple host errors with `CONFIG_INVALID_SETTING`; a missing `mlx_lm` toolchain on a Mac raises `DEP_MLX_UNAVAILABLE`.
+> Forcing `--backend mlx` on a non-Apple host errors with `CONFIG_INVALID_SETTING`; a missing `mlx_lm` toolchain on a Mac raises `DEP_MLX_UNAVAILABLE`.
 For more end-to-end workflows (fine-tune-and-push-to-HF-Hub, resume after OOM, multi-run SLAO across a long campaign, etc.) see the [handbook recipes page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/recipes/).
@@ -312,7 +319,7 @@ Backpropagate handles the runtime quirks of training on different platforms, but
 - **Wrong CUDA wheel.** PyTorch is published one binary per CUDA version. If you pick the wrong one, you silently get CPU-only PyTorch and training is impossibly slow. Use the wheel picker at <https://pytorch.org/get-started/locally/> for your driver. Run `nvidia-smi` to see your driver / CUDA version.
 - **Windows + GGUF export.** The `[export]` extra builds `llama-cpp-python` from source, which needs Visual Studio Build Tools (C++ component) and CMake.
-**macOS:** the CUDA rail is not supported (no CUDA) — a CUDA-routed `trainer.train()` raises `DEP_GPU_NOT_AVAILABLE`, and you can run the trained adapter on a Mac via Ollama. **New in v1.5:** an experimental MLX rail (`--backend mlx`, `pip install 'backpropagate[mlx]'`) trains a LoRA adapter natively on Apple Silicon via `mlx_lm.lora` — LoRA SFT only, and built + unit-tested but not yet dogfood-verified on real silicon (see [Apple Silicon (MLX)](#apple-silicon-mlx--experimental-v15)). For the CUDA path, or for ORPO / full fine-tune / FP8 / multi-run, use a CUDA Linux or Windows machine.
+**macOS:** the CUDA rail is not supported (no CUDA) — a CUDA-routed `trainer.train()` raises `DEP_GPU_NOT_AVAILABLE`, and you can run the trained adapter on a Mac via Ollama. An **experimental, unverified-preview** MLX rail (`--backend mlx`, `pip install 'backpropagate[mlx]'`) trains a LoRA adapter natively on Apple Silicon via `mlx_lm.lora` — LoRA SFT only, and **not dogfood-verified on real silicon** (see [Apple Silicon (MLX)](#apple-silicon-mlx--unverified-preview)). For the CUDA path, or for ORPO / full fine-tune / FP8 / multi-run, use a CUDA Linux or Windows machine.
 See the [troubleshooting handbook page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting/) for the long-form install fix-it guide, and the dedicated [CUDA troubleshooting page](https://mcp-tool-shop-org.github.io/backpropagate/handbook/troubleshooting-cuda/) for driver / VRAM / xformers / bf16-vs-fp16 issues.
@@ -364,8 +371,12 @@ Nested keys use double underscore (`MODEL__NAME`, not `MODEL_NAME`). The full re
 | Llama 3.2 3B | ~8GB | Llama Community | Solid alternative to Qwen 3B with permissive caveats. |
 | Llama 3.2 1B | ~6GB | Llama Community | For quick experiments on small cards. |
 | Mistral 7B | ~12GB | Apache 2.0 | Comparable to Qwen 7B, different chat template. |
+| Llama-3.1-8B | ~7-8GB (QLoRA) | Llama-3.1-Community | 8B QLoRA, 128K native context (the >700M-MAU clause needs a separate Meta license). |
+| **Qwen2.5-14B** | ~8.5GB (QLoRA) | Apache 2.0 | **The 32 GB daily-driver sweet spot** — rank/alpha 32, paged 8-bit AdamW, 4096 ctx. |
+| Mistral-Small-24B | ~18GB (QLoRA) | Apache 2.0 | 24B QLoRA on a 32 GB card with 4096-ctx headroom. |
+| **Qwen2.5-32B** | ~26GB (QLoRA) | Apache 2.0 | **Top of the 32 GB envelope** — just fits at `max_len 2048` + paged 8-bit AdamW. |
-Other models often work, but only these eight are pinned in CI. Pass `--lora-preset=quality` (default) for rank-256 / all-linear targets per Biderman 2024 + Thinking Machines 2025, or `--lora-preset=fast` for the legacy rank-16 / q+v target if you need the v1.2.x footprint.
+Other models often work; the rows above are the curated presets — the 14B–32B tier is QLoRA-tuned for a 32 GB card (the measured envelope). Pass `--lora-preset=quality` (default) for rank-256 / all-linear targets per Biderman 2024 + Thinking Machines 2025, or `--lora-preset=fast` for the legacy rank-16 / q+v target if you need the v1.2.x footprint.
 ## Troubleshooting