npm - @synsci/cli-darwin-x64 - Versions diffs - 1.1.70 → 1.1.72 - Mend

@synsci/cli-darwin-x64 1.1.70 → 1.1.72

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (339) hide show

package/bin/skills/unsloth/docs/inference-deployment-overview.md ADDED Viewed

@@ -0,0 +1,17 @@
+# Inference & Deployment
+You can also run your fine-tuned models by using [Unsloth's 2x faster inference](https://unsloth.ai/docs/basics/inference-and-deployment/unsloth-inference).
+## Deployment Options
+- [llama.cpp - Saving to GGUF](saving-to-gguf.md)
+- [vLLM](vllm-guide.md)
+- [Ollama](saving-to-ollama.md)
+- [LM Studio](lm-studio.md)
+- [SGLang](sglang-guide.md)
+- [Unsloth Inference](inference.md)
+- [Troubleshooting](troubleshooting-inference.md)
+- [llama-server & OpenAI endpoint](llama-server-and-openai-endpoint.md)
+- [vLLM Engine Arguments](vllm-engine-arguments.md)
+- [LoRA Hotswapping](lora-hot-swapping-guide.md)
+- [Tool Calling](tool-calling-guide-for-local-llms.md)

package/bin/skills/unsloth/docs/inference.md ADDED Viewed

@@ -0,0 +1,27 @@
+# Unsloth Inference
+Unsloth supports natively 2x faster inference. For our inference only notebook, click [here](https://colab.research.google.com/drive/1aqlNQi7MMJbynFDyOQteD2t0yVfjb9Zh?usp=sharing).
+All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.
+```python
+from unsloth import FastLanguageModel
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
+    max_seq_length = max_seq_length,
+    dtype = dtype,
+    load_in_4bit = load_in_4bit,
+)
+FastLanguageModel.for_inference(model) # Enable native 2x faster inference
+text_streamer = TextStreamer(tokenizer)
+_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)
+```
+#### NotImplementedError: A UTF-8 locale is required. Got ANSI
+Sometimes when you execute a cell [this error](https://github.com/googlecolab/colabtools/issues/3409) can appear. To solve this, in a new cell, run the below:
+```python
+import locale
+locale.getpreferredencoding = lambda: "UTF-8"
+```

package/bin/skills/unsloth/docs/installation-docker.md ADDED Viewed

@@ -0,0 +1,155 @@
+# Install Unsloth via Docker
+Learn how to use our Docker containers with all dependencies pre-installed for immediate installation. No setup required, just run and start training!
+Unsloth Docker image: [**`unsloth/unsloth`**](https://hub.docker.com/r/unsloth/unsloth)
+{% hint style="success" %}
+You can now use our main Docker image `unsloth/unsloth` for Blackwell and 50-series GPUs - no separate image needed.
+{% endhint %}
+### ⚡ Quickstart
+{% stepper %}
+{% step %}
+**Install Docker and NVIDIA Container Toolkit.**
+Install Docker via [Linux](https://docs.docker.com/engine/install/) or [Desktop](https://docs.docker.com/desktop/) (other).\
+Then install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation):
+<pre class="language-bash"><code class="lang-bash"><strong>export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
+</strong>sudo apt-get update &#x26;&#x26; sudo apt-get install -y \
+  nvidia-container-toolkit=${NVIDIA_CONTAIR_TOOLKIT_VERSION} \
+  nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
+  libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
+  libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
+</code></pre>
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-41cae231ed4761f844ce9836e03b17aabd7c803c%2Fnvidia%20toolkit.png?alt=media" alt=""><figcaption></figcaption></figure>
+{% endstep %}
+{% step %}
+**Run the container.**
+[**`unsloth/unsloth`**](https://hub.docker.com/r/unsloth/unsloth) is Unsloth's only Docker image. For Blackwell and 50-series GPUs, use this same image - no separate one needed.
+```bash
+docker run -d -e JUPYTER_PASSWORD="mypassword" \
+  -p 8888:8888 -p 2222:22 \
+  -v $(pwd)/work:/workspace/work \
+  --gpus all \
+  unsloth/unsloth
+```
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-2b50d78c5d54eaf189c0a40d46c405585ea23082%2Fdocker%20run.png?alt=media" alt=""><figcaption></figcaption></figure>
+{% endstep %}
+{% step %}
+**Access Jupyter Lab**
+Go to [http://localhost:8888](http://localhost:8888/) and open Unsloth.
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-828df0a668fd94025c1193c24a7f09c1d58dcbd8%2Fjupyter.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>
+Access the `unsloth-notebooks` tabs to see Unsloth notebooks.
+<div><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-e7a3f620a3ec5bff335632ff9b0cb422f76528a1%2FScreenshot_from_2025-09-30_21-38-15.png?alt=media" alt=""><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-531882c33eb96dec24e2d7673471d6a3928a3951%2FScreenshot_from_2025-09-30_21-39-41.png?alt=media" alt=""><figcaption></figcaption></figure></div>
+{% endstep %}
+{% step %}
+**Start training with Unsloth**
+If you're new, follow our step-by-step [Fine-tuning Guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide), [RL Guide](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) or just save/copy any of our premade [notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks).
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fgit-blob-665f900b008991ddcd8fdabb773b292de3c41e72%2FScreenshot_from_2025-09-30_21-40-29.png?alt=media" alt=""><figcaption></figcaption></figure>
+{% endstep %}
+{% endstepper %}
+#### 📂 Container Structure
+* `/workspace/work/` — Your mounted work directory
+* `/workspace/unsloth-notebooks/` — Example fine-tuning notebooks
+* `/home/unsloth/` — User home directory
+### 📖 Usage Ex# Full Example
+```bash
+docker run -d -e JUPYTER_PORT=8000 \
+  -e JUPYTER_PASSWORD="mypassword" \
+  -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
+  -e USER_PASSWORD="unsloth2024" \
+  -p 8000:8000 -p 2222:22 \
+  -v $(pwd)/work:/workspace/work \
+  --gpus all \
+  unsloth/unsloth
+```
+#### Setting up SSH Key
+If you don't have an SSH key pair:
+```bash
+# Generate new key pair
+ssh-keygen -t rsa -b 4096 -f ~/.ssh/container_key
+# Use the public key in docker run
+-e "SSH_KEY=$(cat ~/.ssh/container_key.pub)"
+# Connect via SSH
+ssh -i ~/.ssh/container_key -p 2222 unsloth@localhost
+```
+### 🦥Why Unsloth Containers?
+* **Reliable**: Curated environment with stable & maintained package versions. Just 7 GB compressed (vs. 10–11 GB elsewhere)
+* **Ready-to-use**: Pre-installed notebooks in `/workspace/unsloth-notebooks/`
+* **Secure**: Runs safely as a non-root user
+* **Universal**: Compatible with all transformer-based models (TTS, BERT, etc.)
+### ⚙️ Advanced Settings
+```bash
+# Generate SSH key pair
+ssh-keygen-b 4096 -f ~/.ssh/container_key
+# Connect to container
+ssh -i ~/.ssh/container_key -p 2222 unsloth@localhost
+```
+| Variable           | Description                        | Default   |
+| ------------------ | ---------------------------------- | --------- |
+| `JUPYTER_PASSWORD` | Jupyter Lab password               | `unsloth` |
+| `JUPYTER_PORT`     | Jupyter Lab port inside container  | `8888`    |
+| `SSH_KEY`          | SSH public key for authentication  | `None`    |
+| `USER_PASSWORD`    | Password for `unsloth` user (sudo) | `unsloth` |
+```bash
+-p <host_port>:<container_port>
+```
+* Jupyter Lab: `-p 8000:8888`
+* SSH access: `-p 2222:22`
+{% hint style="warning" %}
+**Important**: Use volume mounts to preserve your work between container runs.
+{% endhint %}
+```bash
+-v <local_folder>:<container_folder>
+```
+```bash
+docker run -d -e JUPYTER_PORT=8000 \
+  -e JUPYTER_PASSWORD="mypassword" \
+  -e "SSH_KEY=$(cat ~/.ssh/container_key.pub)" \
+  -e USER_PASSWORD="unsloth2024" \
+  -p 8000:8000 -p 2222:22 \
+  -v $(pwd)/work:/workspace/work \
+  --gpus all \
+  unsloth/unsloth
+```
+### **🔒 Security Notes**
+* Container runs as non-root `unsloth` user by default
+* Use `USER_PASSWORD` for sudo operations inside container

package/bin/skills/unsloth/docs/installation-pip.md ADDED Viewed

@@ -0,0 +1,148 @@
+# Instal Unsloth via pip and uv
+## **Recommended installation method**
+**Install with pip (recommended) for the latest pip release:**
+```bash
+pip install unsloth
+```
+To use **uv**:
+```bash
+pip install --upgrade pip && pip install uv
+uv pip install unsloth
+```
+To install **vLLM and Unsloth** together, do:
+```bash
+uv pip install unsloth vllm
+```
+To install the **latest main branch** of Unsloth, do:
+{% code overflow="wrap" %}
+```bash
+pip install unsloth
+pip uninstall unsloth unsloth_zoo -y && pip install --no-deps git+https://github.com/unslothai/unsloth_zoo.git && pip install --no-deps git+https://github.com/unslothai/unsloth.git
+```
+{% endcode %}
+For **venv and virtual environments installs** to isolate your installation to not break system packages, and to reduce irreparable damage to your system, use venv:
+{% code overflow="wrap" %}
+```bash
+apt install python3.10-venv python3.11-venv python3.12-venv python3.13-venv -y
+python -m venv unsloth_env
+source unsloth_env/bin/activate
+pip install --upgrade pip && pip install uv
+uv pip install unsloth
+```
+{% endcode %}
+If you're installing Unsloth in Jupyter, Colab, or other notebooks, be sure to prefix the command with `!`. This isn't necessary when using a terminal
+{% hint style="info" %}
+Python 3.13 is now supported!
+{% endhint %}
+## Uninstall or Reinstall
+If you're still encountering dependency issues with Unsloth, many users have resolved them by forcing uninstalling and reinstalling Unsloth:
+{% code overflow="wrap" %}
+```bash
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo
+```
+{% endcode %}
+***
+## Advanced Pip Installation
+{% hint style="warning" %}
+Do **NOT** use this if you have [Conda](https://unsloth.ai/docs/get-started/install/conda-install).
+{% endhint %}
+Pip is a bit more complex since there are dependency issues. The pip command is different for `torch 2.2,2.3,2.4,2.5` and CUDA versions.
+For other torch versions, we support `torch211`, `torch212`, `torch220`, `torch230`, `torch240` and for CUDA versions, we support `cu118` and `cu121` and `cu124`. For Ampere devices (A100, H100, RTX3090) and above, use `cu118-ampere` or `cu121-ampere` or `cu124-ampere`.
+For example, if you have `torch 2.4` and `CUDA 12.1`, use:
+```bash
+pip install --upgrade pip
+pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
+```
+Another example, if you have `torch 2.5` and `CUDA 12.4`, use:
+```bash
+pip install --upgrade pip
+pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
+```
+And other examples:
+```bash
+pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
+pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
+```
+Or, run the below in a terminal to get the **optimal** pip installation command:
+```bash
+wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
+```
+Or, run the below manually in a Python REPL:
+{% code overflow="wrap" %}
+```python
+# Licensed under the Apache License, Version 2.0 (the "License")
+try: import torch
+except: raise ImportError('Install torch via `pip install torch`')
+from packaging.version import Version as V
+import re
+v = V(re.match(r"[0-9\.]{3,}", torch.__version__).group(0))
+cuda = str(torch.version.cuda)
+is_ampere = torch.cuda.get_device_capability()[0] >= 8
+USE_ABI = torch._C._GLIBCXX_USE_CXX11_ABI
+if cuda not in ("11.8", "12.1", "12.4", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
+if   v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
+elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
+elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
+elif v  < V('2.3.0'): x = 'cu{}{}-torch220'
+elif v  < V('2.4.0'): x = 'cu{}{}-torch230'
+elif v  < V('2.5.0'): x = 'cu{}{}-torch240'
+elif v  < V('2.5.1'): x = 'cu{}{}-torch250'
+elif v <= V('2.5.1'): x = 'cu{}{}-torch251'
+elif v  < V('2.7.0'): x = 'cu{}{}-torch260'
+elif v  < V('2.7.9'): x = 'cu{}{}-torch270'
+elif v  < V('2.8.0'): x = 'cu{}{}-torch271'
+elif v  < V('2.8.9'): x = 'cu{}{}-torch280'
+elif v  < V('2.9.1'): x = 'cu{}{}-torch290'
+elif v  < V('2.9.2'): x = 'cu{}{}-torch291'
+else: raise RuntimeError(f"Torch = {v} too new!")
+if v > V('2.6.9') and cuda not in ("11.8", "12.6", "12.8", "13.0"): raise RuntimeError(f"CUDA = {cuda} not supported!")
+x = x.format(cuda.replace(".", ""), "-ampere" if False else "") # is_ampere is broken due to flash-attn
+print(f'pip install --upgrade pip && pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git" --no-build-isolation')
+```

package/bin/skills/unsloth/docs/kernels-packing.md ADDED Viewed

@@ -0,0 +1,190 @@
+# 3x Faster LLM Training with Unsloth Kernels + Packing
+Unsloth now supports up to **5× faster** (typically 3x) training with our new custom **RoPE and MLP Triton kernels**, plus our new smart auto packing. Unsloth's new kernels + features not only increase training speed, but also further **reduces VRAM use (30% - 90%)** with no accuracy loss. [Unsloth GitHub](https://github.com/unslothai/unsloth)\
+\
+This means you can now train LLMs like [Qwen3](https://unsloth.ai/docs/models/qwen3-how-to-run-and-fine-tune)-4B not only on just **3GB VRAM**, but also 3x faster.
+Our auto [**padding-free**](#padding-free-by-default) uncontaminated packing is smartly enabled for all training runs without any changes, and all fast attention backends (FlashAttention 3, xFormers, SDPA). [Benchmarks](#analysis-and-benchmarks) show training losses match non-packing runs **exactly**.
+* **2.3x faster QK Rotary Embedding** fused Triton kernel with packing support
+* Updated SwiGLU, GeGLU kernels with **int64 indexing for lon context**
+* **2.5x to 5x faster uncontaminated packing** with xformers, SDPA, FA3 backends
+* **2.1x faster padding free, 50% less VRAM**, 0% accuracy change
+* Unsloth also now has improved SFT loss stability and more predictable GPU utilization.
+* This new upgrade works **for all training methods** e.g. full fine-tuning, pretraining etc.
+### :drum:Fused QK RoPE Triton Kernel with packing
+Back in December 2023, we introduced a RoPE kernel coded up in Triton as part of our Unsloth launch. In March 2024, a community member made end to end training 1-2% faster by optimizing the RoPE kernel to allow launching a block for a group of heads. See [PR 238](https://github.com/unslothai/unsloth/pull/238).
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FewadBu05vK7zAmJRJcj6%2Frope_varlen_qk_rope_kernel_benchmark_v5.png?alt=media&#x26;token=04d277d4-c289-4943-9312-e3d3e2d60bec" alt="" width="563"><figcaption></figcaption></figure>
+One issue is for each Q and K, there are 2 Triton kernels. We merged them into 1 Triton kernel now, and enabled variable length RoPE, which was imperative for padding free and packing support. This makes the RoPE kernel in micro benchmarks **2.3x faster on longer context lengths**, and 1.9x faster on shorter context lengths.
+We also eliminated all clones and contiguous transpose operations, and so **RoPE is now fully inplace**, reducing further GPU memory. Note for the backward pass, we see that `sin1 = -sin1` since:
+```
+Q * cos + rotate_half(Q) * sin
+is equivalent to
+Q * cos + Q @ R * sin
+where R is a rotation matrix [ 0,  I]
+                             [-I,  0]
+dC/dY = dY * cos + dY @ R.T * sin
+where R.T is again the same  [ 0, -I]
+but the minus is transposed. [ I,  0]
+```
+### :railway\_car:Int64 Indexing for Triton Kernels
+During 500K long context training which we introduced in [500k-context-length-fine-tuning](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning "mention"), we would get CUDA out of bounds errors. This was because MLP kernels for SwiGLU, GeGLU had int32 indexing which is by default in Triton and CUDA.
+We can't just do `tl.program_id(0).to(tl.int64)` since training will be slightly slower due to int64 indexing. We instead make this a `LONG_INDEXING: tl.constexpr` variable so the Triton compiler can specialize this. This allows shorter and longer context runs to both run great!
+{% code overflow="wrap" %}
+```python
+block_idx = tl.program_id(0)
+if LONG_INDEXING:
+    offsets = block_idx.to(tl.int64) * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE).to(tl.int64)
+    n_elements = tl.cast(n_elements, tl.int64)
+else:
+    offsets = block_idx * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
+```
+{% endcode %}
+### :abacus:Why is padding needed & mathematical speedup
+Computers and GPUs cannot process different length datasets, so we have to pad them with 0s. This causes wastage. Assume we have a dataset of 50% short sequences S, and 50% long sequences L, then in the worst case, padding will cause token usage to be $\text{batchsize} \times L$ since the longest sequence length dominates.
+By packing multiple examples into a single, long one-dimensional tensor, we can eliminate a significant amount of padding. In fact we get the below token usage:
+$
+\text{Token Usage} = \frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}S
+$
+By some math and algebra, we can work out the speedup via:
+$
+\text{Speedup} = \frac{\text{batchsize} \times L}{\frac{\text{batchsize}}{2}L+\frac{\text{batchsize}}{2}S} = 2 \frac{L}{L + S}
+$
+By assuming $S\rightarrow0$ then we get a 2x theoretical speedup since $2 \frac{L}{L + 0} = 2$
+By changing the ratio of 50% short sequences, and assuming we have MORE short sequences, for eg 20% long sequences and 80% short sequences, we get $\frac{L}{0.2L + 0.8S}\rightarrow\frac{L}{0.2L}=5$ so 5x faster training! This means packing's speedup depends on how short rows your dataset has (the more shorter, the faster).
+### :clapper:Padding-Free by Default
+In addition to large throughput gains available when setting `packing = True` in your `SFTConfig` , we will **automatically use padding-free batching** in order to reduce padding waste improve throughput and increases tokens/s throughput, while resulting in the ***exact same loss*** as seen in the previous version of Unsloth.
+For example for Qwen3-8B and Qwen3-32B, we see memory usage decrease by 60%, be 2x faster, and have the same exact loss and grad norm curves!
+<div><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FPATEJoJwIotXNPsYT1hu%2FW%26B%20Chart%2010_12_2025%2C%203_57_51%20am.png?alt=media&#x26;token=e31ee2cd-cd6e-4fd2-9c59-7f2148179815" alt=""><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FjjnXdgPSgUxL9WNzx9wc%2FW%26B%20Chart%2010_12_2025%2C%203_58_19%20am.png?alt=media&#x26;token=54368c73-2ce1-4faa-a1f4-c82341638be3" alt=""><figcaption></figcaption></figure></div>
+<div><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FA61fgCtUj0K9dhrHCt0C%2FW%26B%20Chart%2010_12_2025%2C%203_54_40%20am.png?alt=media&#x26;token=b8472635-4b05-430e-9df1-3820ed381c3f" alt="" width="563"><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FPh0Xfaup0CTz8REL6P9F%2FW%26B%20Chart%2010_12_2025%2C%203_56_38%20am.png?alt=media&#x26;token=86ed33c3-e5ac-4b71-82c3-f8f86ca79862" alt="" width="563"><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2F6FKM6pkRkQX3gzQLdcIP%2FW%26B%20Chart%2010_12_2025%2C%203_55_38%20am.png?alt=media&#x26;token=48d2c1d3-e6f2-420c-8209-70ef247ce63d" alt="" width="563"><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FxbhcPeELu78xh3M01xkf%2FW%26B%20Chart%2010_12_2025%2C%203_56_07%20am.png?alt=media&#x26;token=375c3f27-5af8-43e5-93eb-3818cb401f95" alt="" width="563"><figcaption></figcaption></figure></div>
+### :spades:Uncontaminated Packing 2-5x faster training
+Real datasets can contain different sequence lengths, so increasing the batch size to 32 for example will cause padding, making training slower and use more VRAM.
+{% hint style="success" %}
+In the past, increasing `batch_size` to large numbers (>32) will make training SLOWER, not faster. This was due to padding - we can now eliminate this issue via `packing = True`, and so training is FASTER!
+{% endhint %}
+When we pack multiple samples into a single one-dimensional tensor, we keep sequence length metadata around in order to properly mask samples, without leaking attention between samples. We also need the RoPE kernel described in [#fused-qk-rope-triton-kernel-with-packing](#fused-qk-rope-triton-kernel-with-packing "mention") to allow reset position ids.
+{% columns %}
+{% column width="41.66666666666667%" %}
+<div align="center" data-full-width="false"><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2F508zS4YN2sYnYjYkt8ej%2Fimage.png?alt=media&#x26;token=a05f917a-f593-4abd-a834-2f3f6652ca5a" alt="" width="563"><figcaption><p>4 examples without packing wastes space</p></figcaption></figure></div>
+{% endcolumn %}
+{% column width="58.33333333333333%" %}
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2F8azEy9wbeF2RWSWbNdma%2Fimage.png?alt=media&#x26;token=a4b96567-244a-45ed-8b18-b16074bac88c" alt=""><figcaption><p>Uncontaminated packing creates correct attention pattern</p></figcaption></figure>
+{% endcolumn %}
+{% endcolumns %}
+By changing the ratio of 50% short sequences, and assuming we have MORE short sequences, for eg 20% long sequences and 80% long sequences, we get $\frac{L}{0.2L + 0.8S}\rightarrow\frac{L}{0.2L}=5$ so 5x faster training! This means packing's speedup depends on how short rows your dataset has (the more shorter, the faster).
+### :beach:Analysis and Benchmarks
+To demonstrate the various improvements when training with our new kernels and packed data, we ran fine-tuning runs with [Qwen3-32B](https://unsloth.ai/docs/models/qwen3-how-to-run-and-fine-tune), Qwen3-8B, Llama 3 8B on the `yahma/alpaca-cleaned` dataset and measured various [training loss](#padding-free-by-default) throughput and efficiency metrics. We compared our new runs vs. a standard optimized training run with our own kernels/optimizations turned on and kernels like Flash Attention 3 (FA3) enabled. We fixed `max_length = 1024` and varied the batch size in {1, 2, 4, 8, 16, 32}. This allows the maximum token count per batch to vary in {1024, 2048, 4096, 8192, 16K, 32K}.
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FFfmdok7AmeretPSGlZjg%2Fnew%20rope%20kernel%20graph.png?alt=media&#x26;token=d890fd95-c8c0-4817-9ee3-18e3095cde5f" alt="" width="563"><figcaption></figcaption></figure>
+The above shows how tokens per second (tokens/s) training throughput varies for new Unsloth with varying batch size. This translates into training your model on an epoch of your dataset **1.7-3x faster (sometimes even 5x or more)**! These gains will be more pronounced if there are many short sequences in your data and if you have longer training runs, as described in [#why-is-padding-needed-and-mathematical-speedup](#why-is-padding-needed-and-mathematical-speedup "mention")
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FReViERxLWBHnT8GOv0ql%2Fpacking_efficiency_by_per_device_train_batch_size.png?alt=media&#x26;token=1c4a78c7-a611-4374-ac03-94aabb1d3184" alt="" width="563"><figcaption></figcaption></figure>
+The above shows the average percentage of tokens per batch that are valid (i.e., non-padding). As the batch size length grows, many more padding tokens are seen in the unpacked case, while we achieve a high packing efficiency in the packed case regardless of max sequence length.
+Note that, since the batching logic trims batches to the maximum sequence length seen in the batch, when the batch size is 1, the unpacked data is all valid tokens (i.e., no padding). However, as more examples are added into the batch, padding increases on average, hitting nearly 50% padding with batch size is 8! Our sample packing implementation eliminates that waste.
+<div><figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FRfzNZVz9uzDPhEe3frGe%2Funknown.png?alt=media&#x26;token=e9fe893e-6b94-4c0d-b144-ef8315067c1e" alt="" width="563"><figcaption></figcaption></figure> <figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FTHtcpIdQ0z0mGRYYBfuF%2Funknown.png?alt=media&#x26;token=ec52b2c1-1e60-4ed2-969a-d760af26be2a" alt="" width="563"><figcaption></figcaption></figure></div>
+The first graph (above) plots progress on `yahma/alpaca-cleaned` with `max_length = 2048`, Unsloth new with packing + kernels (maroon) vs. Unsloth old (gray). Both are trained with `max_steps = 500`, but we plot the x-axis in wall-clock time. Notice that we train on nearly 40% of an epoch in the packed case in the same amount of steps (and only a bit more wall-clock time) that it takes to train less than 5% of an epoch in the unpacked case.
+Similarly, the 2nd graph (above) plots loss from the same runs, this time plotted with training steps on the x-axis. Notice that the losses match in scale and trend, but the loss in the packing case is less variable since the model is seeing more tokens per training step.
+### :sparkles:How to enable packing?
+**Update Unsloth first and padding free is done by default**! So all training is immediately 1.1 to 2x faster with 30% less memory usage at least and 0 change in loss curve metric!
+{% code overflow="wrap" %}
+```bash
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth_zoo
+```
+{% endcode %}
+We also support Flash Attention 3 via Xformers, SDPA support, Flash Attention 2, and this works on old GPUs (Tesla T4, RTX 2080) and new GPUs like H100s, B200s etc! Sample packing works *regardless of choice of attention backend or model family*, so enjoy the same speedups previously had with these fast attention implementations!
+If you want to enable explicit packing, then add `packing = True` to enable up to 5x faster training!
+{% hint style="warning" %}
+Note `packing=True` will change the training loss and will make the dataset number of rows truncated, since multiple short sequences are packed into 1 sequence. You might see the number of examples in the dataset shrink.
+To not get different training loss numbers, simply set `packing=False` and we will enable auto padding-free, which already makes training faster!
+{% endhint %}
+```python
+from unsloth import FastLanguageModel
+from trl import SFTTrainer, SFTConfig
+model, tokenizer = FastLanguageModel.from_pretrained(
+    "unsloth/Qwen3-14B",
+)
+trainer = SFTTrainer(
+    model = model,
+    processing_class = tokenizer,
+    train_dataset = dataset,
+    args = SFTConfig(
+        per_device_train_batch_size = 1,
+        max_length = 4096,
+        …,
+        packing = True, # required to enable sample packing!
+    ),
+)
+trainer.train()
+```
+All our notebooks are automatically faster (no need to do anything). See [unsloth-notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks "mention")
+{% columns %}
+{% column %}
+Qwen3 14B faster:
+{% embed url="<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb>" %}
+{% endcolumn %}
+{% column %}
+Llama 3.1 Conversational faster:
+{% embed url="<https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb>" %}
+{% endcolumn %}
+{% endcolumns %}