PyPI - soup-cli - Versions diffs - 0.53.2__tar.gz → 0.53.4__tar.gz - Mend

soup-cli 0.53.2tar.gz → 0.53.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (525) hide show

{soup_cli-0.53.2 → soup_cli-0.53.4}/CONTRIBUTING.md RENAMED Viewed

@@ -111,7 +111,7 @@ soup_cli/
   templates/          - 17 built-in soup.yaml templates (YAML + manifest.json) with load_template loader (v0.39.0, +bco v0.40.0)
   ui/                 - Web UI (FastAPI + HTML/JS SPA)
-tests/                - Test suite (185 files, 7842 tests)
+tests/                - Test suite (187 files, 7935 tests)
 examples/             - Real-world config examples and datasets
 ```

{soup_cli-0.53.2 → soup_cli-0.53.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: soup-cli
-Version: 0.53.2
+Version: 0.53.4
 Summary: Fine-tune LLMs in one command. No SSH, no config hell.
 Project-URL: Homepage, https://github.com/MakazhanAlpamys/Soup
 Project-URL: Repository, https://github.com/MakazhanAlpamys/Soup
@@ -134,14 +134,14 @@ soup train
 Latest highlights only. Full history: [GitHub Releases](https://github.com/MakazhanAlpamys/Soup/releases).
-**v0.53.2 — Modality II live trainers**: Four v0.52.0 deferred stubs lifted into real, end-to-end-trainable wrappers — knowledge distillation, sequence classification, EBFT / GDPO loss kernels, and gpt-oss-style `reasoning_effort` system-prompt injection.
+**v0.53.4 — Long Context + Architecture**: Six closes — LongLoRA hardened, LLaMA Pro block expansion lifted from deferred stub to live wiring, and CUDA-OOM hint upgraded.
-- **`soup train` with `task: distill`.** New `DistillTrainerWrapper`: student + frozen teacher both load via `AutoModelForCausalLM` (separate `trust_remote_code` resolution for each), KL / forward_KL / reverse_KL / JS divergence kernels scaled by `temperature**2` per the Hinton paper. Device-bridge: teacher inputs auto-move to the teacher's device, teacher logits move back onto the student's device before the KL kernel — survives HF Trainer's auto-CUDA promotion on a CPU-tagged run. `DataCollatorForSeq2Seq(label_pad_token_id=-100)` handles variable-length pre-tokenised loss-masked rows correctly.
-- **`soup train` with `task: classifier | reranker | cross_encoder`.** New `ClassifierTrainerWrapper`: `AutoModelForSequenceClassification` with `num_labels` and `label_names`, auto-routes `single_label_classification` / `multi_label_classification` from `tcfg.classifier_kind`. Multi-label string labels resolved via the `label_names` map with a 1024-entry cap + dedup. Training Setup Panel renders `Head: num_labels=N, kind=...` instead of LoRA r/alpha for the classifier family.
-- **EBFT structured / strided + GDPO standard / length_normalized / margin loss kernels.** `apply_ebft_loss` and `apply_gdpo_loss` exit the v0.52.0 `NotImplementedError` stubs with finite-only-input guards and bool-rejected numeric params. `attach_ebft_compute_loss(trainer, tcfg)` (SFT) and `attach_gdpo_compute_loss(trainer, tcfg)` (DPO) wrap `Trainer.compute_loss` idempotently — re-attach is a no-op via a marker attribute on the wrapped method. Auto-attached when the corresponding `*_variant` field is set on `TrainingConfig`.
-- **gpt-oss `reasoning_effort` + `train_on_eot`.** `apply_reasoning_effort_prefix(messages, level)` injects `<|reasoning_effort|>{low,medium,high}<|/reasoning_effort|>` into the system turn (creates one if absent), returning a new list (caller's messages immutable). `build_assistant_only_labels(train_on_eot=True)` keeps the EOT/EOS token unmasked at the assistant-turn boundary so the model learns when to stop. Both gated to the SFT-family at config-load.
-- **+120 net new tests** (7722 → 7842) across `test_v0532.py`. Four review agents (python / code / security / tdd) ran; every CRITICAL / HIGH / MEDIUM / LOW finding fixed — separate `trust_remote_code` resolution for student vs teacher, idempotent attach hooks with regression tests, 1024-entry multi-label cap, `dpo_margin` defaults to `None` (not `0.0`) so missing values raise rather than silently zero, source-grep regression guards on the trainer-routing call sites use the full instantiation expression (no comment-only false-positives), Panel renders the classifier head instead of LoRA r/alpha.
-- **Local end-to-end CPU smoke** confirms both new wrappers train 2 steps with finite loss on `hf-internal-testing/tiny-random-gpt2`. Two real bugs surfaced and were fixed during the smoke (collator label padding + teacher / student device mismatch) — both have source-level regression guards in the test suite. ONNX export QA: pipeline integrity proven on tiny-gpt2; TinyLlama-1.1B full export is host-RAM-bound (documented in `tests/qa/v053_qa.md`).
+- **LLaMA Pro is live.** `soup train` and `soup data` pretraining now honour `training.expand_layers: N` by deep-copying the last N decoder blocks, zero-initialising their residual projections (so the appended block initially acts as identity per the LLaMA Pro paper §3.1), and appending to `model.layers`. Pair with `freeze_trainable_layers: N` to train only the new blocks. Centralised via `block_expansion.apply_block_expansion_if_configured` so SFT and Pretrain stay in lock-step.
+- **LongLoRA arch allowlist expanded** to Llama / CodeLlama / Mistral / Qwen / Phi via word-boundary helpers (`is_mistral_model`, `is_qwen_model`, `is_phi_model`). Mixtral is intentionally excluded — its MoE attention requires a dedicated helper still tracked for a future release.
+- **LongLoRA + FlashAttention v3 reject.** New `flash_attn.is_flash_attn_v3_available()` probe — when FA-v3 is installed the schema now rejects `use_longlora: true` at load time with an actionable error (S² shifted-sparse + FA-v3 custom-mask both rewrite the kernel; allowing both would silently corrupt outputs).
+- **Llama 3.1 RoPE auto-detect.** Pass `rope_scaling_type: None` (or omit it) on a Llama 3.1 base, and `apply_long_context_config` now reads `model.config.rope_scaling`. If it carries a `llama3` block, the long-context path picks `LLAMA3_DEFAULT_*` instead of falling through to `dynamic`. Explicit caller picks still win.
+- **CUDA-OOM hint upgrade.** `format_friendly_error` now points users at the exact CLI flags — `--batch-size <half>` and `--grad-accum <double>` to preserve effective batch size — before the legacy `quantization: 4bit` fallback.
+- **+56 net new tests** (7879 → 7935) across the new `test_v0534.py`. Four review agents (python / code / security / tdd) ran; every CRITICAL → LOW finding was fixed — shared centralised helper for trainer drift, `is None` over falsy guards, defensive non-string surface on `is_supported_longlora_arch`, 64-char base-name truncation in error messages, null-byte rejection on `task`/`backend`, and a `warnings.warn` when block expansion runs on non-Llama-shaped architectures.
 ## Why Soup?
@@ -917,9 +917,9 @@ training:
 **YaRN.** Best quality for 4-8x extension. Tunables (`yarn_factor`, `yarn_attn_factor`, `yarn_beta_fast`, `yarn_beta_slow`) only apply when `rope_scaling_type=yarn`; the schema rejects them otherwise. Pure-Python math kernels are exposed at `soup_cli.utils.long_context.yarn_*` for reference / config-emit. The actual RoPE rotation runs inside HF Transformers.
-**Llama 3.1 NTK-aware.** Use `rope_scaling_type: llama3` for the canonical Llama 3.1 frequency-band scaling (`scale_factor=8`, `low_freq_factor=1`, `high_freq_factor=4`, `old_context_len=8192`). `detect_llama3_rope_in_config` auto-detects the block in any HF model config dict.
+**Llama 3.1 NTK-aware.** Use `rope_scaling_type: llama3` for the canonical Llama 3.1 frequency-band scaling (`scale_factor=8`, `low_freq_factor=1`, `high_freq_factor=4`, `old_context_len=8192`). `detect_llama3_rope_in_config` auto-detects the block in any HF model config dict. Omit `rope_scaling_type` from your YAML (so it stays `None`) on a Llama 3.1 base and `apply_long_context_config` will auto-pick `llama3` by reading `model.config.rope_scaling` at load time — explicit caller picks still win.
-**LongLoRA S² (schema-only this release).** `training.use_longlora: true` requires `task=sft`, `backend=transformers`, Llama-family base, and `use_ring_attention=false`. The schema gate fails fast at config load; live forward override mirroring LlamaFactory `model/model_utils/longlora.py` lands in v0.49.1.
+**LongLoRA S² (schema-only this release).** `training.use_longlora: true` requires `task=sft`, `backend=transformers`, a base in the architecture allowlist (Llama / CodeLlama / Mistral / Qwen / Phi — Mixtral excluded), and `use_ring_attention=false`. The schema also rejects the combo with FlashAttention v3 installed (the S² custom-mask kernel conflicts with FA-v3 native custom-mask). The schema gate fails fast at config load; live forward override mirroring LlamaFactory `model/model_utils/longlora.py` lands in a follow-up release.
 ```yaml
 # Llama 3.1 with NTK-aware scaling out to 128k
@@ -931,6 +931,27 @@ data:
   max_length: 131072
 ```
+## LLaMA Pro Block Expansion
+Add `N` zero-initialised transformer blocks to a base model and train **only the new blocks** — keeps the original behaviour intact while adding capacity for a new domain (per the LLaMA Pro paper, `arxiv.org/abs/2401.02415`).
+```yaml
+# soup.yaml — LLaMA Pro continued-training on a Llama-3.1 base
+base: meta-llama/Llama-3.1-8B
+task: sft
+data:
+  train: ./domain.jsonl
+training:
+  expand_layers: 4              # append 4 zero-init decoder blocks
+  freeze_trainable_layers: 4    # train only the appended blocks
+  lr: 5e-5
+  epochs: 1
+```
+**What happens at trainer start.** Soup deep-copies the last `expand_layers` decoder blocks, zero-inits each clone's residual projections (`mlp.down_proj` + `self_attn.o_proj`) so the appended block initially acts as identity, appends them to `model.model.layers`, and updates `config.num_hidden_layers`. When `freeze_trainable_layers > 0` is set, every parameter except the appended blocks is frozen — this is the canonical LLaMA Pro "train only new blocks" recipe.
+**Scope.** Works on both `task: sft` and `task: pretrain` with `backend: transformers`. Bounds: `expand_layers ∈ [1, 64]`. Over-expansion (more new blocks than the base has layers) silently clamps to the base layer count. Non-Llama-shaped architectures (e.g. Falcon's `dense_4h_to_h`) emit a `warnings.warn` because the residual zero-init heuristic only matches the standard `down_proj` / `o_proj` names — the appended blocks are still appended + trainable, but lose the identity-init guarantee.
 ## Optimizer & PEFT Zoo
 Pick from a wider catalogue of optimizers, target individual modules with their own LR, and use quantization-aware LoRA initialisation:

{soup_cli-0.53.2 → soup_cli-0.53.4}/README.md RENAMED Viewed

@@ -43,14 +43,14 @@ soup train
 Latest highlights only. Full history: [GitHub Releases](https://github.com/MakazhanAlpamys/Soup/releases).
-**v0.53.2 — Modality II live trainers**: Four v0.52.0 deferred stubs lifted into real, end-to-end-trainable wrappers — knowledge distillation, sequence classification, EBFT / GDPO loss kernels, and gpt-oss-style `reasoning_effort` system-prompt injection.
+**v0.53.4 — Long Context + Architecture**: Six closes — LongLoRA hardened, LLaMA Pro block expansion lifted from deferred stub to live wiring, and CUDA-OOM hint upgraded.
-- **`soup train` with `task: distill`.** New `DistillTrainerWrapper`: student + frozen teacher both load via `AutoModelForCausalLM` (separate `trust_remote_code` resolution for each), KL / forward_KL / reverse_KL / JS divergence kernels scaled by `temperature**2` per the Hinton paper. Device-bridge: teacher inputs auto-move to the teacher's device, teacher logits move back onto the student's device before the KL kernel — survives HF Trainer's auto-CUDA promotion on a CPU-tagged run. `DataCollatorForSeq2Seq(label_pad_token_id=-100)` handles variable-length pre-tokenised loss-masked rows correctly.
-- **`soup train` with `task: classifier | reranker | cross_encoder`.** New `ClassifierTrainerWrapper`: `AutoModelForSequenceClassification` with `num_labels` and `label_names`, auto-routes `single_label_classification` / `multi_label_classification` from `tcfg.classifier_kind`. Multi-label string labels resolved via the `label_names` map with a 1024-entry cap + dedup. Training Setup Panel renders `Head: num_labels=N, kind=...` instead of LoRA r/alpha for the classifier family.
-- **EBFT structured / strided + GDPO standard / length_normalized / margin loss kernels.** `apply_ebft_loss` and `apply_gdpo_loss` exit the v0.52.0 `NotImplementedError` stubs with finite-only-input guards and bool-rejected numeric params. `attach_ebft_compute_loss(trainer, tcfg)` (SFT) and `attach_gdpo_compute_loss(trainer, tcfg)` (DPO) wrap `Trainer.compute_loss` idempotently — re-attach is a no-op via a marker attribute on the wrapped method. Auto-attached when the corresponding `*_variant` field is set on `TrainingConfig`.
-- **gpt-oss `reasoning_effort` + `train_on_eot`.** `apply_reasoning_effort_prefix(messages, level)` injects `<|reasoning_effort|>{low,medium,high}<|/reasoning_effort|>` into the system turn (creates one if absent), returning a new list (caller's messages immutable). `build_assistant_only_labels(train_on_eot=True)` keeps the EOT/EOS token unmasked at the assistant-turn boundary so the model learns when to stop. Both gated to the SFT-family at config-load.
-- **+120 net new tests** (7722 → 7842) across `test_v0532.py`. Four review agents (python / code / security / tdd) ran; every CRITICAL / HIGH / MEDIUM / LOW finding fixed — separate `trust_remote_code` resolution for student vs teacher, idempotent attach hooks with regression tests, 1024-entry multi-label cap, `dpo_margin` defaults to `None` (not `0.0`) so missing values raise rather than silently zero, source-grep regression guards on the trainer-routing call sites use the full instantiation expression (no comment-only false-positives), Panel renders the classifier head instead of LoRA r/alpha.
-- **Local end-to-end CPU smoke** confirms both new wrappers train 2 steps with finite loss on `hf-internal-testing/tiny-random-gpt2`. Two real bugs surfaced and were fixed during the smoke (collator label padding + teacher / student device mismatch) — both have source-level regression guards in the test suite. ONNX export QA: pipeline integrity proven on tiny-gpt2; TinyLlama-1.1B full export is host-RAM-bound (documented in `tests/qa/v053_qa.md`).
+- **LLaMA Pro is live.** `soup train` and `soup data` pretraining now honour `training.expand_layers: N` by deep-copying the last N decoder blocks, zero-initialising their residual projections (so the appended block initially acts as identity per the LLaMA Pro paper §3.1), and appending to `model.layers`. Pair with `freeze_trainable_layers: N` to train only the new blocks. Centralised via `block_expansion.apply_block_expansion_if_configured` so SFT and Pretrain stay in lock-step.
+- **LongLoRA arch allowlist expanded** to Llama / CodeLlama / Mistral / Qwen / Phi via word-boundary helpers (`is_mistral_model`, `is_qwen_model`, `is_phi_model`). Mixtral is intentionally excluded — its MoE attention requires a dedicated helper still tracked for a future release.
+- **LongLoRA + FlashAttention v3 reject.** New `flash_attn.is_flash_attn_v3_available()` probe — when FA-v3 is installed the schema now rejects `use_longlora: true` at load time with an actionable error (S² shifted-sparse + FA-v3 custom-mask both rewrite the kernel; allowing both would silently corrupt outputs).
+- **Llama 3.1 RoPE auto-detect.** Pass `rope_scaling_type: None` (or omit it) on a Llama 3.1 base, and `apply_long_context_config` now reads `model.config.rope_scaling`. If it carries a `llama3` block, the long-context path picks `LLAMA3_DEFAULT_*` instead of falling through to `dynamic`. Explicit caller picks still win.
+- **CUDA-OOM hint upgrade.** `format_friendly_error` now points users at the exact CLI flags — `--batch-size <half>` and `--grad-accum <double>` to preserve effective batch size — before the legacy `quantization: 4bit` fallback.
+- **+56 net new tests** (7879 → 7935) across the new `test_v0534.py`. Four review agents (python / code / security / tdd) ran; every CRITICAL → LOW finding was fixed — shared centralised helper for trainer drift, `is None` over falsy guards, defensive non-string surface on `is_supported_longlora_arch`, 64-char base-name truncation in error messages, null-byte rejection on `task`/`backend`, and a `warnings.warn` when block expansion runs on non-Llama-shaped architectures.
 ## Why Soup?
@@ -826,9 +826,9 @@ training:
 **YaRN.** Best quality for 4-8x extension. Tunables (`yarn_factor`, `yarn_attn_factor`, `yarn_beta_fast`, `yarn_beta_slow`) only apply when `rope_scaling_type=yarn`; the schema rejects them otherwise. Pure-Python math kernels are exposed at `soup_cli.utils.long_context.yarn_*` for reference / config-emit. The actual RoPE rotation runs inside HF Transformers.
-**Llama 3.1 NTK-aware.** Use `rope_scaling_type: llama3` for the canonical Llama 3.1 frequency-band scaling (`scale_factor=8`, `low_freq_factor=1`, `high_freq_factor=4`, `old_context_len=8192`). `detect_llama3_rope_in_config` auto-detects the block in any HF model config dict.
+**Llama 3.1 NTK-aware.** Use `rope_scaling_type: llama3` for the canonical Llama 3.1 frequency-band scaling (`scale_factor=8`, `low_freq_factor=1`, `high_freq_factor=4`, `old_context_len=8192`). `detect_llama3_rope_in_config` auto-detects the block in any HF model config dict. Omit `rope_scaling_type` from your YAML (so it stays `None`) on a Llama 3.1 base and `apply_long_context_config` will auto-pick `llama3` by reading `model.config.rope_scaling` at load time — explicit caller picks still win.
-**LongLoRA S² (schema-only this release).** `training.use_longlora: true` requires `task=sft`, `backend=transformers`, Llama-family base, and `use_ring_attention=false`. The schema gate fails fast at config load; live forward override mirroring LlamaFactory `model/model_utils/longlora.py` lands in v0.49.1.
+**LongLoRA S² (schema-only this release).** `training.use_longlora: true` requires `task=sft`, `backend=transformers`, a base in the architecture allowlist (Llama / CodeLlama / Mistral / Qwen / Phi — Mixtral excluded), and `use_ring_attention=false`. The schema also rejects the combo with FlashAttention v3 installed (the S² custom-mask kernel conflicts with FA-v3 native custom-mask). The schema gate fails fast at config load; live forward override mirroring LlamaFactory `model/model_utils/longlora.py` lands in a follow-up release.
 ```yaml
 # Llama 3.1 with NTK-aware scaling out to 128k
@@ -840,6 +840,27 @@ data:
   max_length: 131072
 ```
+## LLaMA Pro Block Expansion
+Add `N` zero-initialised transformer blocks to a base model and train **only the new blocks** — keeps the original behaviour intact while adding capacity for a new domain (per the LLaMA Pro paper, `arxiv.org/abs/2401.02415`).
+```yaml
+# soup.yaml — LLaMA Pro continued-training on a Llama-3.1 base
+base: meta-llama/Llama-3.1-8B
+task: sft
+data:
+  train: ./domain.jsonl
+training:
+  expand_layers: 4              # append 4 zero-init decoder blocks
+  freeze_trainable_layers: 4    # train only the appended blocks
+  lr: 5e-5
+  epochs: 1
+```
+**What happens at trainer start.** Soup deep-copies the last `expand_layers` decoder blocks, zero-inits each clone's residual projections (`mlp.down_proj` + `self_attn.o_proj`) so the appended block initially acts as identity, appends them to `model.model.layers`, and updates `config.num_hidden_layers`. When `freeze_trainable_layers > 0` is set, every parameter except the appended blocks is frozen — this is the canonical LLaMA Pro "train only new blocks" recipe.
+**Scope.** Works on both `task: sft` and `task: pretrain` with `backend: transformers`. Bounds: `expand_layers ∈ [1, 64]`. Over-expansion (more new blocks than the base has layers) silently clamps to the base layer count. Non-Llama-shaped architectures (e.g. Falcon's `dense_4h_to_h`) emit a `warnings.warn` because the residual zero-init heuristic only matches the standard `down_proj` / `o_proj` names — the appended blocks are still appended + trainable, but lose the identity-init guarantee.
 ## Optimizer & PEFT Zoo
 Pick from a wider catalogue of optimizers, target individual modules with their own LR, and use quantization-aware LoRA initialisation:

{soup_cli-0.53.2 → soup_cli-0.53.4}/SECURITY.md RENAMED Viewed

@@ -9,7 +9,9 @@ We provide security updates for the following versions:
 - **Versions older than 3 minor versions:** No support
 Example:
-- v0.53.2 -- Full support (latest)
+- v0.53.4 -- Full support (latest)
+- v0.53.3 -- Full support
+- v0.53.2 -- Full support
 - v0.53.1 -- Full support
 - v0.53.0 -- Full support
 - v0.52.0 -- Full support
@@ -148,6 +150,9 @@ No known critical vulnerabilities in current releases.
 - **v0.32.0 — Training Stability & Auto-Tuning**: `--find-lr-output` containment via shared `utils/paths.is_under_cwd` (prevents writes outside cwd); `save_lr_finder_report` rejects NaN / Infinity floats in `lrs` / `losses` and serialises with `allow_nan=False` (keeps the report parser-safe); `compute_lr_schedule` rejects non-positive `start_lr`, inverted ranges, and `num_steps` outside `[2, 10_000]`; `pick_mixed_precision` rejects empty / null-byte / >200-char model names and resolves multi-version quirks (`qwen2.5` vs `qwen2`, `phi-3.5` vs `phi-3`) by longest-substring-first iteration so an added family can never accidentally make a more-specific entry dead code; `compute_warmup_steps` clamps to `[10, 1000]` with a `ratio==0.0` short-circuit matching HF Trainer's "no warmup" convention; `SpikeRecoveryStrategy` is `@dataclass(frozen=True)` (post-construction mutation cannot bypass validation), `max_attempts ∈ [1, 10]`, `lr_decay ∈ (0, 1)`, `min_lr > 0`; cross-validator `_validate_spike_recovery_requires_watchdog` rejects `loss_spike_recovery=true, loss_watchdog=false` at config-load (fails fast instead of never triggering); `convergence_window ∈ [5, 10_000]`, `convergence_rel_tol ∈ (0, 1]`, `recommend_action` reuses `detect_plateau` so plateau heuristic stays single-source-of-truth; `GradAccumMonitor.recommend()` caps doubled `accum` at `MAX_ACCUM=1024` so a runaway advisory loop cannot blow up DataLoader prefetch; `generate_config` validates BOTH the YAML output path AND the embedded `decisions["output"]` field via `is_under_cwd` (closes the gap where a crafted `decisions["output"]="../../etc"` would have silently propagated into the rendered YAML)
 - **v0.34.0 — Observability & Dev UX**: `.crash` bundle generator (`utils/crash.py`) recursively redacts `hf_*` / `sk-*` / `Bearer …` token-shaped strings in any captured `config` and metric tail before serialisation, so a `.crash` file shared on a public GitHub issue cannot leak credentials; `output_dir` is reduced to `os.path.basename` so `$HOME` doesn't leak; `write_crash_bundle` uses `os.path.realpath + commonpath` for cwd containment (Windows-safe; raises `ValueError` not `PermissionError` so callers cannot silently swallow with `except OSError`); filename appends `secrets.token_hex(4)` so two crashes in the same UTC second don't collide; bundle truncated to `MAX_BUNDLE_BYTES=1_000_000`. `train.py` crash-write surfaces failures to the user (no silent missing-bundle). `profiling.py` `resolve_trace_path` rejects empty / `.` / `..` / `/` / `\\` / null-byte `run_id` (closes the `output_dir/profiles/../trace.json` escape) and uses `os.path.realpath + is_under_cwd`; profiles dir is created only on successful torch import (no stale empty dirs on torch-less CI). `tracker.get_run` LIKE-prefix match escapes `%` / `_` / `\\` and uses `ESCAPE '\\'` so a crafted `run_id` cannot widen the match (mirrors v0.26.0 registry policy). Lazy schema migration (`_ensure_schema`) tolerates the "duplicate column" race when two CLI processes start simultaneously on a fresh DB (fork-based multi-GPU training, TUI auto-refresh). `runs.py show/replay/clean` switched user `run_id` rendering to `markup_escape` and switched `clean` containment from broken `Path.resolve() + relative_to()` to project-standard `os.path.realpath + is_under_cwd`. `tui_app.py` lazy-imports `ExperimentTracker` and `markup_escape`s every DB-sourced string before passing into Textual widgets so a crafted base_model / experiment_name cannot inject `[bold red]…[/]` markup. `run_cost.estimate_run_cost_usd` rejects `bool` in `num_gpus` (bool is a subclass of int — same defence as v0.30.0 `Candidate.__post_init__`); duration clamped to `[0, 1 year]`; unknown GPU returns `None` so callers render `—` instead of fabricating `$0.00`. `log_level.parse_log_level` rejects non-string + null-byte input.
 - **v0.33.0 — Live Wire**: RLVR `code_exec_reward` adds OS-level isolation (Linux best-effort `os.unshare(CLONE_NEWUSER|CLONE_NEWNET|CLONE_NEWPID)`, macOS `sandbox-exec` with default-deny `MACOS_SANDBOX_PROFILE` narrowed to a 3-name `mach-lookup` allowlist to prevent DNS / NSURLSession bypass of `(deny network*)`); `prune_checkpoints` switches to TOCTOU-safe `os.lstat + S_ISLNK` + `shutil.rmtree(onerror=_abort_on_symlink)` so a symlink encountered mid-walk aborts rather than escapes; `run_gate` wraps each task scorer in a typed `try/except` so backend failures produce `score=None, error=str(exc)` (never silent `score=1.0`); `_parse_judge_url` removes the bare `http://` catch-all (defence-in-depth after the Pydantic GateTask validator); `soup can run` requires `--yes` or explicit consent callback and raises `ValueError` (not `PermissionError`, which is an `OSError` subclass that broad `except` blocks would swallow); GGUF `rglob` result for ollama deploy is `realpath+commonpath` checked against extract_dir (prevents symlink escape from a crafted can); `DeployTarget.path` validator normalises mixed `\\`/`/` separators before splitting (closes a Windows `..` bypass); `CAN_FORMAT_VERSION` 1→2 (additive — v1 still loads); `soup can publish` validates `repo_id` via `utils/hf.validate_repo_id`, resolves token via `resolve_token`, sanitises commit messages (first-line, 200-char cap), uses HTTPS-only HfApi; `_write_spike_recovery_hint` adds `is_under_cwd` containment check on `args.output_dir` from raw HF `TrainingArguments`; `lookup_entry_by_output_dir` emits `ResourceWarning` when 1000-row scan limit is hit (no silent miss); `CrossDocCollator` no longer mutates input feature dicts (HF Dataset rows are cached and reused — mutation broke subsequent batches); `Candidate` rejects `bool` in `score`/`latency_ms` (was sneaking past `int` isinstance check); `evaluate_candidate` latency mean now divides by *completed* prompts (excludes crashed) so a broken candidate isn't artificially fast; `auto_quant.run_auto_quant_picker` soft-falls-back to highest-scored candidate when no candidate clears `min_score` (server still binds); `build_logits_processors` returns `[]` when neither `outlines` nor `lm-format-enforcer` is installed (server degrades to free-form rather than 500); MII server uses loopback-only CORS, max_tokens cap [1, 16384], stream rejection, generic 500 with no stack-trace leak; `os.execvp` auto-reexec uses list args (no shell), all forwarded flags pre-validated; `cleanup_extract_dir` uses `os.path.commonpath` (Windows-safe) instead of `startswith`; `_run_subprocess` catches `TimeoutExpired` and returns rc=124 (coreutils convention) instead of an unhandled traceback; new `eval_results` and `tensorrt` artifact kinds in `RegistryStore._VALID_KINDS`
+- **v0.53.4 — Long Context + Architecture**: six closes covering LongLoRA hardening, LLaMA Pro live wiring, and a CUDA-OOM-hint UX upgrade. (#11 OOM hint) `format_friendly_error` upgrades the CUDA-OOM and `OutOfMemoryError` patterns to point users at the explicit `--batch-size <half>` / `--grad-accum <double>` CLI flags before the legacy `quantization: 4bit` fallback — closes #11 with no functional change to the security surface. (#122 FlashAttention v3 incompatibility) New `soup_cli/utils/flash_attn.is_flash_attn_v3_available() -> bool` is a defensive probe (never raises, False on missing `flash_attn` / non-string `__version__` / unparseable / major < 3). `validate_longlora_compat` calls it AFTER the existing task / backend / architecture / ring-attention checks so the FA-v3 error only surfaces on otherwise-valid LongLoRA configs (avoids spurious confusion on unrelated misconfig). The check is loaded via a function-scoped import to keep `validate_longlora_compat` import-cheap and avoid CUDA-side effects at config load time on machines without `flash_attn` installed. (#120 LongLoRA arch allowlist) `soup_cli/utils/longlora.py` ships three new word-boundary regex helpers (`is_mistral_model`, `is_qwen_model`, `is_phi_model`) — same regex policy as v0.39.0 `is_gemma4_model` (rejects substring matches like `"my-mistralish-finetune"` or `"unmistral-7b"`). Shared `_check_model_name` input guard rejects `bool` BEFORE the `isinstance(str)` check (because bool is a subclass of int and would otherwise fall through silently — matches v0.53.3 `is_known_vlm_base` policy), rejects null bytes via explicit substring check, and returns `None` (→ helper returns False) for inputs >512 chars (avoids ReDoS-style overhead on adversarial input). New `is_supported_longlora_arch(model_name: object) -> bool` is the union accessor with defensive non-string surface (returns False rather than propagating TypeError, matches v0.53.3 / v0.52.0 model-detection policy). `validate_longlora_compat` also gained per-call null-byte rejection + bool/non-string TypeError on `task` and `backend` (matches v0.50.0 `validate_long_context_grpo_compat`); new `_truncate_for_message(value, limit=64)` helper bounds the `base` echo in error messages (security-review MEDIUM fix mirroring v0.53.3 `validate_vision_grpo_compat` redaction — defends against adversarial / long bases bloating stderr + log files). Mixtral is INTENTIONALLY excluded from the allowlist — regex matches `mistral` as a word-boundary token, NOT `mixtral`; documented at the docstring so a future contributor adding Mixtral support adds it explicitly. (#121 Llama 3.1 RoPE auto-detect) `apply_long_context_config` extended with `rope_scaling_type=None` auto-detect path — reads `model_config.rope_scaling` and runs `detect_llama3_rope_in_config` (v0.49.0 Part D helper) on it. If the existing block declares `llama3` (either via the legacy `type` key OR the newer `rope_type` alias), the auto-detect picks `"llama3"` + the upstream `LLAMA3_DEFAULT_*` constants; otherwise falls back to `"dynamic"`. Explicit caller pick still wins (any non-None value). Back-compat preserved by keeping the legacy default kwarg `rope_scaling_type="dynamic"`. The detect helper rejects non-Mapping config input via `TypeError` (no SSRF / file-read risk — the function is pure-Python data inspection). (#83 LLaMA Pro live block expansion) `soup_cli/utils/block_expansion.expand_model_blocks` lifts the v0.41.0 Part C `NotImplementedError` stub with a real implementation: clones the last `min(num_new_blocks, original_count)` decoder blocks via `copy.deepcopy` (full independent storage — no shared buffers), zero-inits each clone's residual projections (`mlp.down_proj.weight + bias` and `self_attn.o_proj.weight + bias`) so the appended block initially acts as identity per the LLaMA Pro paper §3.1, appends to `model.model.layers`, and updates `model.config.num_hidden_layers`. Validates `num_new_blocks` via `validate_expand_layers` (bool-guard + `[1, 64]`) BEFORE any model mutation. `_get_layers_module` uses explicit `is None` check (not falsy shortcut) to defend against `nn.Module.__bool__` overrides on subclasses (code-review HIGH fix). `_zero_init_block_residual` returns `bool` and the caller emits `warnings.warn` when neither standard projection path matches the cloned block (non-Llama-shaped arch — security-review LOW fix surfaces silent-degradation to operators training on Falcon-style models). Over-expansion silently clamps to `min(n, original_count)` rather than raising — matches the project's defensive-fallback policy for advisory operations. New `apply_llama_pro_freeze(model, num_new_blocks) -> int` is the canonical "train only new blocks" companion (global `requires_grad=False` pass, then unfreeze the tail N blocks; returns trainable parameter count). New shared helper `apply_block_expansion_if_configured(model, tcfg, console)` centralises the "if `expand_layers` is set, expand + optionally freeze + print" sequence — used identically by SFT and Pretrain trainers (matches v0.40.6 `peft_wiring` centralisation policy; defends against drift between trainer call sites which would otherwise produce subtle inconsistent behaviour). (#74 HF push surface QA) Manual QA of `soup push`, `soup train --push-as`, `soup data push`, `soup deploy hf-space` deferred to a contributor with private HF credentials — entry recorded in `tests/qa/v053_qa.md` with the full test plan + acceptance criteria. The HF push security surface (repo_id validation, token resolution, commit message sanitization, model card injection defence, Space template containment) is unchanged from v0.29.0 / v0.40.2 and remains covered by `test_hf_integration.py` + `test_v0402_part_a.py`. Test surface: 1 new test file (`tests/test_v0534.py`) carrying 49 new tests + 7 net updates to v0.49.0 / v0.41.0 / v0.10.x regression tests. Known limitations: (1) LongLoRA S² forward override still deferred to v0.49.1 — schema gate hardened, live monkeypatch is the next deliverable. (2) Mixtral excluded from LongLoRA allowlist (MoE attention forward signature differs). (3) Block-expansion zero-init covers Llama-shaped blocks only — non-standard arches still get appended + trainable, but lose the LLaMA Pro identity-init guarantee (and emit a runtime warning). (4) Llama 3.1 RoPE auto-detect only fires when caller passes `rope_scaling_type=None` (explicit pick wins). (5) #74 live QA against a private HF repo is the v0.53.5+ follow-up. (v0.53.4)
+- **v0.53.3 — GRPO Plus partial wiring (#128 grpo_fp16, #129 vision-VLM probe)**: lifts two surgical v0.50.0 GRPO Plus deferred stubs while keeping the project's hardening invariants; the four larger items (#127 stability callback, #123 6 GRPO variant loss kernels, #126 PRMTrainerWrapper, #68 multi-objective preference live combine) are scope-deferred to v0.53.4. (#128 grpo_fp16 routing) New `_validate_grpo_fp16_amp_exclusive` SoupConfig cross-validator rejects the silent-mutex combo `grpo_fp16=True + auto_mixed_precision=True` at config load — both flags pick the mixed-precision dtype via different codepaths; combining them is a footgun where downstream behaviour depends on validator execution order. Cross-validator short-circuits when `task != 'grpo'` so the v0.50.0 stability task-gate diagnosis fires first (keeps the most actionable error at the front; code-review HIGH fix). New `GRPOTrainerWrapper._build_precision_kwargs(self) -> dict[str, bool]` returns the `{fp16, bf16}` HF kwargs per `(device, grpo_fp16)` matrix: non-CUDA (CPU / MPS / XPU) → both False (HF Trainer's fp16/bf16 kwargs are CUDA-specific, MPS / XPU use their own mixed-precision paths), CUDA + `grpo_fp16=True` → `fp16=True, bf16=False` (unsloth parity), default CUDA → `fp16=False, bf16=True` (legacy v0.50.0 path). Direct attribute access on `self.config.training.grpo_fp16` (no `getattr` fallback — Pydantic-guaranteed field). (#129 vision-GRPO base probe) New `soup_cli/utils/prm.KNOWN_VLM_REGEX` compiled regex with 10 word-boundary alternatives covering Qwen2-VL / Qwen2.5-VL / QVQ / Pixtral / InternVL / InternVL2_5 / InternVL3 / Llama-3.2-Vision (any size via `[a-z0-9._-]*vision` glob) / LLaVA / MiniCPM-V / Idefics / ShareGPT4V / Fuyu. Word-boundary idiom `(?:^|[^a-z0-9])…(?:[^a-z0-9]|$)` mirrors v0.39.0 `is_gemma4_model` / v0.44.0 `is_llama4_model` / v0.49.0 `is_llama_model` policy — rejects substring noise like `"my-pixtralish"`. New `is_known_vlm_base(name: object) -> bool` is defensive — returns False (never raises) on non-string / bool / empty / null-byte / `>_MAX_BASE_NAME_LEN=512`. Extended `validate_vision_grpo_compat` with optional `base: str | None = None` kwarg — `None` / empty-string skips the probe (back-compat for legacy v0.50.0 Part E callers); non-empty-non-VLM raises `ValueError` with friendly message naming the expected families (Qwen2-VL / Pixtral / InternVL / Llama-3.2-Vision / LLaVA / MiniCPM-V). Error message **truncates the echoed `base` to 64 chars** before serialisation (security-review MEDIUM fix mirroring v0.34.0 `crash.py` `output_dir` basename policy — defends against adversarial / long bases bloating error logs and from leaking unredacted user input into operator-facing tracebacks). `_validate_vision_grpo` in SoupConfig threads `base=self.base` so a YAML pairing `vision_grpo: true` with a non-VLM checkpoint is rejected at schema-load instead of surfacing as a cryptic `"module has no attribute 'vision_tower'"` runtime error. Test surface: 1 new test file (`test_v0533.py`) carrying 37 new tests covering: every `_build_precision_kwargs` matrix cell (CUDA + grpo_fp16 / default CUDA / CPU / MPS), every cross-validator branch (mutex rejection / task-gate priority / both-off pass), every regex alternative (Qwen2-VL / Pixtral / QVQ / Llama-3.2-Vision variants / negative matches), every defensive guard (bool / non-string / null-byte / 512-byte boundary), error-message truncation (security-review M regression), and end-to-end YAML load (happy + reject). Known limitations: (1) Scope-deferred — 4 larger v0.53.3 items moved to v0.53.4 because each requires deep TRL subclassing and warrants its own focused release; the v0.40.x stub-then-live cadence shipped 5 patch releases over 6 weeks, mirroring that here. (2) VLM allowlist is static name-regex only; a legitimate VLM published under an org whose checkpoint name lacks any of those tokens (e.g. a custom internal fork) is rejected at schema-load and operators must omit `vision_grpo: true` until a future release adds a runtime `model.config.vision_config` probe. (3) `_build_precision_kwargs` is GRPO-only — other RL trainers (PPO / RewardModel) follow their existing mixed-precision conventions. (v0.53.3)
 - **v0.53.2 — Modality II live trainers**: lifts four v0.52.0 deferred stubs (#137, #135, #133, #132) into real trainer wrappers while keeping the project's hardening invariants. (#137 reasoning_effort + train_on_eot) `apply_reasoning_effort_prefix` follows v0.41.0 / v0.51.0 validator policy (bool-first, null-byte / empty / oversize / case-insensitive normalisation); messages list is treated as immutable (returns a new list — matches v0.33.0 #47 `CrossDocCollator` policy). `build_assistant_only_labels(train_on_eot=True)` reuses the existing v0.36.0 mask infrastructure — same null-byte / max_length / bool guards. (#135 EBFT / GDPO) `apply_ebft_loss` and `apply_gdpo_loss` enforce **finite-only inputs** (`torch.isfinite` guard on tensor inputs + `math.isfinite` on scalar params) — NaN / Inf would silently corrupt training otherwise. `dpo_margin` defaults to `None` (not `0.0`) per security-review M3 fix: silent zeroing in the `margin` variant when the operator forgot to set the margin would have looked like training success but produced a meaningless gradient. Both attach hooks (`attach_ebft_compute_loss`, `attach_gdpo_compute_loss`) are **idempotent** via a marker attribute on the wrapped method — re-attach is a no-op and a dedicated test class verifies the invariant (code-review M2 fix). (#133 DistillTrainerWrapper) **Separate trust_remote_code resolution for student and teacher** (security-review L2 fix): `model_requires_trust_remote_code(teacher)` runs independently of the student probe, otherwise a malicious teacher could piggy-back on the student's opt-in. Teacher is loaded with `device_map="cpu" if device == "cpu" else "auto"`, frozen via `requires_grad_(False)` + `.eval()` immediately after load — never participates in gradient computation. `_DistillTrainer.compute_loss` device-bridge: `teacher_device = next(teacher_ref.parameters()).device`, `teacher_inputs.to(teacher_device)` before teacher forward, `teacher_logits.to(student_logits.device)` before KL kernel — defends against HF Trainer's auto-CUDA promotion silently producing cross-device `index_select` crashes. **DataCollator correctness fix** (surfaced during Wave 3 CPU smoke): `DataCollatorForLanguageModeling` does NOT pad pre-tokenised `labels` — switched to `DataCollatorForSeq2Seq(label_pad_token_id=-100, padding=True)` so variable-length loss-masked rows batch correctly without runtime crash. (#132 ClassifierTrainerWrapper) `_normalise_label` caps multi-label entries at **1024 per row** (matches v0.52.0 schema cap; security-review HIGH fix — unbounded would allow OOM via crafted JSONL), dedups via set conversion, validates `label_names` map entries reject null bytes + empty strings. `problem_type` is set explicitly from `tcfg.classifier_kind` (not silently inferred from labels) so a multi-label-shaped row in a single-label config raises rather than mis-trains. Training Setup Panel renders `Head: num_labels=N, kind=...` for classifier-family tasks instead of meaningless LoRA r/alpha lines (code-review L3 cosmetic fix — Panel no longer mis-represents what the wrapper is doing). (Cross-cutting) `commands/train.py` task routing branches added for `distill` and `classifier` / `reranker` / `cross_encoder` — source-grep regression guards in the test suite use the **full instantiation expression** `DistillTrainerWrapper(cfg, **trainer_kwargs)` so comment-only mentions of the class name cannot satisfy the regression check (TDD-review hardening). Both new factories (`build_distill_trainer`, `build_classifier_trainer`) reject unknown kwargs via Python signature contract — dedicated `pytest.raises(TypeError)` tests cover the path (TDD-review L1 fix). Test surface: 1 new test file (`test_v0532.py`) carrying 120 new tests across 14 classes. Known limitations: (1) `#71` TinyLlama-1.1B-LoRA full ONNX export is host-RAM-bound (≥16 GB free RAM needed for the `onnx.load(load_external_data=True)` post-process step); tiny-gpt2 smoke proves pipeline integrity — recorded in `tests/qa/v053_qa.md`. (2) Distillation supports same-tokenizer pairs only — cross-tokenizer (Llama → Qwen) needs a projection or sequence-level loss, out of scope. (3) Classifier wrapper has no LoRA path — full head + base training; LoRA classifier finetuning is a follow-up. (4) EBFT / GDPO auto-attach only fires when the corresponding `*_variant` field is set; manual `attach_*` invocation from custom training loops is supported and idempotent. (5) `reasoning_effort` injection happens at data-prep time inside `build_format_row`; changing the level between runs requires re-rendering the dataset. (v0.53.2)
 - **v0.53.1 — Quant Menu II + Export pipeline live**: lifts six v0.53.0 deferred stubs to live wiring while keeping the project's hardening invariants. New shared helper `soup_cli/utils/paths.enforce_under_cwd_and_no_symlink` consolidates the v0.33.0 #22 TOCTOU pattern (cwd containment via `os.path.realpath + os.path.commonpath` + `os.lstat + S_ISLNK` rejection) — used by `commands/merge.py`, `commands/export.py`, `utils/save_formats.py`, and `utils/gguf_quant.py` so the same boundary check fires at every CLI dispatch point. `merge_4bit` and `export_torchao` (`utils/save_formats.py`): cwd containment + symlink rejection on `merged_dir` / `model_dir` / `output_dir`; `load_quant_config` enforces `yaml.safe_load` only + 256 KB cap + extension allowlist (`.yaml`/`.yml`); **per-scheme closed kwarg allowlist** rejects dunder keys + unknown params before the splat into `torchao.<scheme>Config(**kwargs)` (security-review HIGH fix — `Int4WeightOnly` accepts `{group_size, inner_k_tiles}`, `NVFP4` accepts nothing extra). Corrected BNB-4bit skip-modules kwarg name from `llm_int8_skip_modules` to `bnb_4bit_skip_modules`. `export_advanced_gguf` (`utils/gguf_quant.py`): all three subprocess invocations (`convert_hf_to_gguf.py`, `llama-imatrix`, `llama-quantize`) use argv-list form with no shell, 30-min timeout, `sys.executable` for the convert script; `_run_convert_to_f16` realpath-verifies that `convert_hf_to_gguf.py` stays inside the `llama_cpp_dir` after resolution (security-review HIGH M5 fix — defends against a symlinked script escape). `_prepare_calibration_text` strips null bytes, collapses newlines to spaces, caps per-line at 8 KB + total at 50 MB (security-review M1), uses POSIX `O_NOFOLLOW` to refuse symlinks at the kernel level (security-review M3 — closes the TOCTOU window between the dispatch-time check and the actual `open()`); requires ≥ 1 usable row before invoking imatrix. `_safe_stderr` Rich-markup-escapes subprocess stderr before embedding in `RuntimeError` (security-review L4) so a crafted llama.cpp error cannot inject `[red]...[/]` into the operator-facing panel. UD-prefix stripped from flavour arg before passing to llama-quantize (`UD-Q4_K_XL` → `Q4_K_XL`). Calibration data path containment + symlink rejection fires at CLI dispatch in `commands/export.py::_export_gguf_advanced`. `detect_prequantized_format_from_path` (`autopilot/decisions.py`): cwd containment + `os.lstat + S_ISLNK` on `<model_dir>/config.json` (security-review HIGH H2 — out-of-cwd model paths silently return `None` to preserve soft-probe semantics so HF Hub repo IDs aren't rejected); null-byte rejection on `model_dir`. `commands/merge.py`: early `is_under_cwd(output)` check at CLI boundary (security-review M4) — consistent with the v0.20.0 / v0.40.2 containment-at-the-boundary policy. `deploy_measure.py`: cache file written atomically via `tempfile.mkstemp` + `os.replace` with `os.lstat + S_ISLNK` rejection on BOTH `load_cache` and `save_cache` (security-review M2 — was missing on the load side); env override `SOUP_DEPLOY_AUTOPILOT_CACHE` rejects null bytes + control chars before any path resolution and confines the override to home / cwd / tempdir; cache file gets best-effort 0o600 perms on POSIX (matches v0.26.0 registry.db policy); 1 MB cache-file cap. `_DEPLOY_MEASURE_BEFORE_GEN` / `_AFTER_FACTORY` module-level callables are documented as a non-public escape hatch (deferred until v0.46.1 live model-loader). Test surface: 4 new test files (`test_v0531_82.py` / `test_v0531_109.py` / `test_v0531_139.py` / `test_v0531_142.py`) carrying 112 new tests covering happy paths + failure modes + every security guard (POSIX symlink rejection, per-scheme kwarg allowlist, TOCTOU defences, `_MAX_CANDIDATES` cap, MINOR-verdict band, mxfp4 word boundary, BNB-alias detection, render-table markup escape). Known limitations: (1) `_DEPLOY_MEASURE_BEFORE_GEN` / `_AFTER_FACTORY` are a stop-gap until v0.46.1 ships first-party transformers / vLLM generator factories. (2) `#70` GGUF and `#72` AWQ/GPTQ manual QA smokes remain pending — require CUDA + llama.cpp build; recipes scripted in `tests/qa/v053_qa.md`. (3) BNB-4bit merge + TorchAO PTQ live happy-path is mock-covered only — CPU-only CI cannot execute the real BNB / torchao kernels. (4) `_prepare_calibration_text` accepts JSONL with `text` / `prompt` / `content` aliases + raw text fallback; other formats (parquet / markdown) are out of scope. (5) Cache key truncates `base_sha` to 16 hex chars at the call site (collision probability ≈ 1-in-2³² across ~4 billion entries). (6) Pre-quantized detection is heuristic — name regex + local `config.json` probe; HF Hub repo IDs without local download fall back to name-only matching. (7) `enforce_under_cwd_and_no_symlink` checks only the leaf path; deeper traversal relies on the per-file leaf check at each site. (v0.53.1)

{soup_cli-0.53.2 → soup_cli-0.53.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "soup-cli"
-version = "0.53.2"
+version = "0.53.4"
 description = "Fine-tune LLMs in one command. No SSH, no config hell."
 readme = "README.md"
 license = "Apache-2.0"

{soup_cli-0.53.2 → soup_cli-0.53.4}/soup_cli/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """Soup CLI — Fine-tune LLMs in one command."""
-__version__ = "0.53.2"
+__version__ = "0.53.4"

{soup_cli-0.53.2 → soup_cli-0.53.4}/soup_cli/config/schema.py RENAMED Viewed

@@ -2512,6 +2512,7 @@ class SoupConfig(BaseModel):
                 task=self.task,
                 modality=self.modality,
                 backend=self.backend,
+                base=self.base,  # v0.53.3 #129 — name-regex VLM probe
             )
         except ValueError as exc:
             raise ValueError(str(exc)) from exc
@@ -2557,6 +2558,35 @@ class SoupConfig(BaseModel):
             )
         return self
+    @model_validator(mode="after")
+    def _validate_grpo_fp16_amp_exclusive(self) -> "SoupConfig":
+        """v0.53.3 #128 — ``grpo_fp16`` and ``auto_mixed_precision`` are
+        mutually exclusive.
+        Both flags pick the mixed-precision dtype but go through different
+        codepaths (``grpo_fp16`` forces ``fp16=True, bf16=False`` on
+        GRPOConfig directly; ``auto_mixed_precision`` runs the v0.32.0
+        per-model + per-GPU picker). Combining them is a footgun where the
+        downstream behaviour depends on order-of-evaluation — fail fast at
+        config-load with a friendly message naming both flags so the user
+        picks one.
+        """
+        # Short-circuit when task is not 'grpo' so the v0.50.0 stability
+        # task-gate error fires first (code-review HIGH fix — keeps a
+        # consistent "wrong-task" diagnosis ahead of the mutual-exclusion
+        # one, regardless of validator execution order).
+        if self.task != "grpo":
+            return self
+        if self.training.grpo_fp16 and self.training.auto_mixed_precision:
+            raise ValueError(
+                "grpo_fp16=True and auto_mixed_precision=True are mutually "
+                "exclusive — both pick the mixed-precision dtype but go "
+                "through different codepaths. Pick one: grpo_fp16 forces "
+                "FP16 (unsloth parity), auto_mixed_precision uses the "
+                "v0.32.0 per-GPU picker."
+            )
+        return self
     @model_validator(mode="after")
     def _validate_hub_supported(self) -> "SoupConfig":
         """v0.51.0 Part E — ``hub`` other than ``hf`` requires a non-mlx

{soup_cli-0.53.2 → soup_cli-0.53.4}/soup_cli/trainer/grpo.py RENAMED Viewed

@@ -54,6 +54,32 @@ class GRPOTrainerWrapper:
         self.tokenizer = None
         self.trainer = None
+    def _build_precision_kwargs(self) -> dict[str, bool]:
+        """Resolve fp16/bf16 kwargs for GRPOConfig (v0.53.3 #128).
+        Priority:
+        - Non-CUDA device (CPU / MPS / XPU) → no mixed precision (both
+          False). HF Trainer's fp16/bf16 kwargs are CUDA-specific; non-CUDA
+          backends must use their own mixed-precision path (MPS Metal,
+          XPU IPEX). Documented explicitly so future MPS work doesn't
+          regress this branch silently.
+        - ``grpo_fp16=True`` (CUDA) → ``fp16=True, bf16=False`` (unsloth
+          parity).
+        - Default CUDA → ``fp16=False, bf16=True`` (legacy v0.50.0 path).
+        ``auto_mixed_precision`` is mutually exclusive with ``grpo_fp16``
+        (rejected at schema load via ``_validate_grpo_fp16_amp_exclusive``);
+        when only ``auto_mixed_precision`` is set, the v0.32.0 picker runs
+        elsewhere in the training loop and overrides this default.
+        """
+        if self.device != "cuda":
+            return {"fp16": False, "bf16": False}
+        # grpo_fp16 is a Pydantic field with default=False; direct attribute
+        # access (no getattr fallback) so a typo would fail loudly.
+        if self.config.training.grpo_fp16:
+            return {"fp16": True, "bf16": False}
+        return {"fp16": False, "bf16": True}
     def setup(self, dataset: dict):
         """Load model, tokenizer, apply LoRA, create GRPO trainer."""
         from datasets import Dataset
@@ -166,7 +192,7 @@ class GRPOTrainerWrapper:
             "logging_steps": tcfg.logging_steps,
             "save_steps": tcfg.save_steps,
             "save_total_limit": 3,
-            "bf16": self.device == "cuda",
+            **self._build_precision_kwargs(),
             "report_to": self.report_to,
             "remove_unused_columns": False,
             "deepspeed": self.deepspeed_config,

{soup_cli-0.53.2 → soup_cli-0.53.4}/soup_cli/trainer/pretrain.py RENAMED Viewed

@@ -264,6 +264,13 @@ class PretrainTrainerWrapper:
         if tcfg.quantization in ("4bit", "8bit", "mxfp4"):
             self.model = prepare_model_for_kbit_training(self.model)
+        # v0.53.4 #83 — LLaMA Pro block expansion (centralised — see SFT).
+        from soup_cli.utils.block_expansion import (
+            apply_block_expansion_if_configured,
+        )
+        apply_block_expansion_if_configured(self.model, tcfg, console)
         # LoRA — with MoE-aware target modules if moe_lora is enabled
         target_modules = tcfg.lora.target_modules
         if target_modules == "auto":

{soup_cli-0.53.2 → soup_cli-0.53.4}/soup_cli/trainer/sft.py RENAMED Viewed

@@ -491,6 +491,17 @@ class SFTTrainerWrapper:
                 f"[green]Freeze training:[/] {frozen} parameters frozen"
             )
+        # v0.53.4 #83 — LLaMA Pro block expansion. Run BEFORE LoRA so PEFT's
+        # target-module matcher sees the new blocks. Centralised in
+        # ``block_expansion.apply_block_expansion_if_configured`` to avoid
+        # drift between SFT and Pretrain trainers (matches v0.40.6 peft_wiring
+        # centralisation policy).
+        from soup_cli.utils.block_expansion import (
+            apply_block_expansion_if_configured,
+        )
+        apply_block_expansion_if_configured(self.model, tcfg, console)
         # LoRA — with MoE-aware target modules if moe_lora is enabled
         target_modules = tcfg.lora.target_modules
         if target_modules == "auto":

soup-cli 0.53.2__tar.gz → 0.53.4__tar.gz

soup-cli 0.53.2tar.gz → 0.53.4tar.gz