npm - @synsci/cli-darwin-arm64 - Versions diffs - 1.1.70 → 1.1.72 - Mend

@synsci/cli-darwin-arm64 1.1.70 → 1.1.72

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (339) hide show

package/bin/skills/unsloth/docs/dynamic-ggufs-2.0.md ADDED Viewed

@@ -0,0 +1,116 @@
+# Unsloth Dynamic 2.0 GGUFs
+We're excited to introduce our Dynamic v2.0 quantization method - a major upgrade to our previous quants. This new method outperforms leading quantization methods and sets new benchmarks for 5-shot MMLU and KL Divergence.
+This means you can now run + fine-tune quantized LLMs while preserving as much accuracy as possible! You can run the 2.0 GGUFs on any inference engine like llama.cpp, Ollama, Open WebUI etc.
+> **Sept 10, 2025 update:** You asked for tougher benchmarks, so we're showcasing Aider Polyglot results! Our Dynamic 3-bit DeepSeek V3.1 GGUF scores **75.6%**, surpassing many full-precision SOTA LLMs.
+The **key advantage** of using the Unsloth package and models is our active role in **fixing critical bugs** in major models. We've collaborated directly with teams behind Qwen3, Meta (Llama 4), Mistral (Devstral), Google (Gemma 1-3) and Microsoft (Phi-3/4), contributing essential fixes that significantly boost accuracy.
+## What's New in Dynamic v2.0?
+* **Revamped Layer Selection for GGUFs + safetensors:** Unsloth Dynamic 2.0 now selectively quantizes layers much more intelligently and extensively. Rather than modifying only select layers, we now dynamically adjust the quantization type of every possible layer, and the combinations will differ for each layer and model.
+* Current selected and all future GGUF uploads will utilize Dynamic 2.0 and our new calibration dataset. The dataset contains more than >1.5M **tokens** (depending on model) and comprise of high-quality, hand-curated and cleaned data - to greatly enhance conversational chat performance.
+* Previously, our Dynamic quantization (DeepSeek-R1 1.58-bit GGUF) was effective only for MoE architectures. **Dynamic 2.0 quantization now works on all models (including MOEs & non-MoEs)**.
+* **Model-Specific Quants:** Each model now uses a custom-tailored quantization scheme. E.g. the layers quantized in Gemma 3 differ significantly from those in Llama 4.
+* To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.
+## Why KL Divergence?
+[Accuracy is Not All You Need](https://arxiv.org/pdf/2407.09141) showcases how pruning layers, even by selecting unnecessary ones still yields vast differences in terms of "flips". A "flip" is defined as answers changing from incorrect to correct or vice versa.
+> **KL Divergence** should be the **gold standard for reporting quantization errors** as per the research paper. **Using perplexity is incorrect** since output token values can cancel out, so we must use KLD!
+## Calibration Dataset Overfitting
+Most frameworks report perplexity KL Divergence using a test set of Wikipedia articles. However, using the calibration dataset which is also Wikipedia related causes quants to overfit. **Also instruct models have unique chat templates, and using text only calibration datasets is not effective for instruct models** (base models yes).
+We utilize Calibration_v3 and Calibration_v5 datasets for fair testing which includes some wikitext data amongst other data.
+## MMLU Replication
+* Replicating MMLU 5 shot was nightmarish. We **could not** replicate MMLU results for many models including Llama 3.1 (8B) Instruct, Gemma 3 (12B) due to **subtle implementation issues**.
+* Llama 3.1 (8B) **tokenizes "A" and " A" (A with a space in front) as different token ids**. If we consider both spaced and non spaced tokens, we get 68.2% (+0.4%)
+* Llama 3 as per Eleuther AI's LLM Harness also appends **"The best answer is"** to the question.
+## Gemma 3 QAT Benchmarks
+The Gemma team released two QAT (quantization aware training) versions of Gemma 3:
+1. Q4_0 GGUF
+2. int4 version (TorchAO int4 style)
+Key results for Gemma 3 (12B):
+| Metric | Value |
+|--------|-------|
+| MMLU 5 shot (QAT Q4_0) | **67.07%** (67.15% BF16) |
+| Disk Space | 7.52GB |
+KL Divergence improvements (Gemma 3 12B):
+| Quant | Baseline KLD | GB | New KLD | GB |
+|-------|---|---|---|---|
+| IQ1_S | 1.035688 | 5.83 | 0.972932 | 6.06 |
+| IQ2_XXS | 0.535764 | 7.16 | 0.521039 | 7.31 |
+| Q2_K_XL | 0.229671 | 9.78 | 0.220937 | 9.95 |
+| Q3_K_XL | 0.087845 | 12.51 | 0.080617 | 12.76 |
+| Q4_K_XL | 0.024916 | 15.41 | 0.023701 | 15.64 |
+Gemma 3 (27B) MMLU results:
+| Quant | Unsloth | Unsloth + QAT | Disk Size | Efficiency |
+|-------|---------|---------------|-----------|------------|
+| IQ2_XXS | 59.20 | 56.57 | 7.31 | 4.32 |
+| Q2_K_XL | 68.70 | 67.77 | 9.95 | 4.30 |
+| Q3_K_XL | 70.87 | 69.50 | 12.76 | 3.49 |
+| **Q4_K_XL** | **71.47** | **71.07** | **15.64** | **2.94** |
+| **Google QAT** | | **70.64** | **17.2** | **2.65** |
+Key finding: **Our dynamic 4bit version is 2GB smaller whilst having +1% extra accuracy vs the QAT version!**
+## Llama 4 Bug Fixes
+We helped and fixed several Llama 4 bugs:
+* Llama 4 Scout changed the RoPE Scaling configuration - we helped resolve issues in llama.cpp
+* Llama 4's QK Norm's epsilon should be 1e-05, not 1e-06
+* QK Norm being shared across all heads was fixed (MMLU Pro increased from 68.58% to 71.53%)
+### Running Llama 4 Scout
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
+cp llama.cpp/build/bin/llama-* llama.cpp
+```
+```python
+import os
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
+    local_dir = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF",
+    allow_patterns = ["*IQ2_XXS*"],
+)
+```
+```bash
+./llama.cpp/llama-cli \
+    --model unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/Llama-4-Scout-17B-16E-Instruct-UD-IQ2_XXS.gguf \
+    --threads 32 \
+    --ctx-size 16384 \
+    --n-gpu-layers 99 \
+    -ot ".ffn_.*_exps.=CPU" \
+    --seed 3407 \
+    --prio 3 \
+    --temp 0.6 \
+    --min-p 0.01 \
+    --top-p 0.9 \
+    -no-cnv \
+    --prompt "<|header_start|>user<|header_end|>\n\nCreate a Flappy Bird game.<|eot|><|header_start|>assistant<|header_end|>\n\n"
+```

package/bin/skills/unsloth/docs/dynamic-ggufs-aider.md ADDED Viewed

@@ -0,0 +1,118 @@
+# Unsloth Dynamic GGUFs on Aider Polyglot
+We're showcasing how Unsloth Dynamic GGUFs makes it possible to quantize LLMs like DeepSeek-V3.1 (671B) down to just **1-bit** or **3-bit**, and still be able to outperform SOTA models like **GPT-4.5, GPT-4.1** (April 2025) and **Claude-4-Opus** (May 2025).
+## Key Results
+* Our **1-bit** Unsloth Dynamic GGUF shrinks DeepSeek-V3.1 from **671GB to 192GB (-75% size)** and no-thinking mode greatly outperforms GPT-4.1, GPT-4.5, and DeepSeek-V3-0324.
+* **3-bit** Unsloth DeepSeek-V3.1 (thinking) GGUF: Outperforms Claude-4-Opus-20250514 (thinking).
+* **5-bit** Unsloth DeepSeek-V3.1 (non-thinking) GGUF: Matches Claude-4-Opus-20250514 (non-thinking) performance.
+* Unsloth Dynamic GGUFs perform consistently better than other non-Unsloth Dynamic imatrix GGUFs.
+* Other non-Unsloth 1-bit and 2-bit DeepSeek-V3.1 quantizations either failed to load or produced gibberish.
+## Reasoning Model Aider Benchmarks
+| Model | Accuracy |
+|-------|----------|
+| GPT-5 | 86.7 |
+| Gemini 2.5 Pro (June) | 83.1 |
+| o3 | 76.9 |
+| DeepSeek V3.1 | 76.1 |
+| **(3 bit) DeepSeek V3.1 Unsloth** | **75.6** |
+| Claude-4-Opus (May) | 72 |
+| o4-mini (High) | 72 |
+| DeepSeek R1 0528 | 71.4 |
+| **(2 bit) DeepSeek V3.1 Unsloth** | **66.7** |
+| Claude-3.7-Sonnet (Feb) | 64.9 |
+| **(1 bit) DeepSeek V3.1 Unsloth** | **57.8** |
+| DeepSeek R1 | 56.9 |
+## Non-Reasoning Model Aider Benchmarks
+| Model | Accuracy |
+|-------|----------|
+| DeepSeek V3.1 | 71.6 |
+| Claude-4-Opus (May) | 70.7 |
+| **(5 bit) DeepSeek V3.1 Unsloth** | **70.7** |
+| **(4 bit) DeepSeek V3.1 Unsloth** | **69.7** |
+| **(3 bit) DeepSeek V3.1 Unsloth** | **68.4** |
+| **(2 bit) DeepSeek V3.1 Unsloth** | **65.8** |
+| Qwen3 235B A22B | 59.6 |
+| Kimi K2 | 59.1 |
+| **(1 bit) DeepSeek V3.1 Unsloth** | **55.7** |
+| DeepSeek V3-0324 | 55.1 |
+| GPT-4.1 (April, 2025) | 52.4 |
+| ChatGPT 4o (March, 2025) | 45.3 |
+| GPT-4.5 | 44.9 |
+## Dynamic Quantization Methodology
+**Dynamic 1 bit makes important layers in 8 or 16 bits and un-important layers in 1,2,3,4,5 or 6bits.**
+In Nov 2024, our 4-bit Dynamic Quants showcased how you could largely restore QLoRA fine-tuning & model accuracy by just **selectively quantizing layers**. We later applied this to DeepSeek-R1's MoE architecture, where we quantized some layers to as low as 1-bit and important layers to higher bits.
+## Comparison to Other Quants
+| Quant | Quant Size (GB) | Unsloth Accuracy % | Comparison Accuracy % |
+|-------|-----------------|--------------------|-----------------------|
+| TQ1_0 | 170 | 50.7 | |
+| IQ1_M | 206 | 55.7 | |
+| IQ2_XXS | 225 | 61.2 | |
+| IQ2_M | 235 | 64.3 | |
+| Q2_K_XL | 255 | 65.8 | |
+| IQ3_XXS | 279 | 66.8 | |
+| Q3_K_XL | 300 | 68.4 | |
+| IQ4_XS | 357 | 69.2 | |
+| Q4_K_XL | 387 | 69.7 | |
+| Q5_K_XL | 484 | 70.7 | |
+| IQ2_XXS | 164 | | 43.6 |
+| IQ2_M | 215 | | 56.6 |
+| Q2_K_L | 239 | | 64.0 |
+| IQ3_XXS | 268 | | 65.6 |
+| Q3_K_S | 293 | | 65.2 |
+| IQ4_XS | 360 | | 66.3 |
+| Q4_K_M | 409 | | 67.7 |
+| Q5_K_M | 478 | | 68.9 |
+## Dynamic Quantization Ablations
+We did ablations to confirm our calibration dataset and dynamic methodology works. Key finding: `attn_k_b` and other tensors in DeepSeek V3.1 are highly important / sensitive to quantization and should be left in higher precision to retain accuracy!
+## Chat Template Bug Fixes
+During testing we found some lower bit quants not enclosing `<think> </think>` properly. We had to change llama.cpp's minja usage:
+```
+# From:
+{%- set content = content.split("</think>", 1)[1] -%}
+# To:
+{%- set splitted = content.split("</think>") -%}
+{%- set content = splitted[1:] | join("</think>") -%}
+```
+## Run DeepSeek V3.1 Dynamic Quants
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
+cp llama.cpp/build/bin/llama-* llama.cpp
+```
+```bash
+export LLAMA_CACHE="unsloth/DeepSeek-V3.1-GGUF"
+./llama.cpp/llama-cli \
+    -hf unsloth/DeepSeek-V3.1-GGUF:Q2_K_XL \
+    --jinja \
+    --n-gpu-layers 99 \
+    --temp 0.6 \
+    --top_p 0.95 \
+    --min_p 0.01 \
+    --ctx-size 8192 \
+    --seed 3407 \
+    -ot ".ffn_.*_exps.=CPU"
+```

package/bin/skills/unsloth/docs/faq.md ADDED Viewed

@@ -0,0 +1,91 @@
+# FAQ + Is Fine-tuning Right For Me?
+## Understanding Fine-Tuning
+Fine-tuning an LLM customizes its behavior, deepens its domain expertise, and optimizes its performance for specific tasks. By refining a pre-trained model (e.g. *Llama-3.1-8B*) with specialized data, you can:
+* **Update Knowledge** – Introduce new, domain-specific information that the base model didn’t originally include.
+* **Customize Behavior** – Adjust the model’s tone, personality, or response style to fit specific needs or a brand voice.
+* **Optimize for Tasks** – Improve accuracy and relevance on particular tasks or queries your use-case requires.
+Think of fine-tuning as creating a specialized expert out of a generalist model. Some debate whether to use Retrieval-Augmented Generation (RAG) instead of fine-tuning, but fine-tuning can incorporate knowledge and behaviors directly into the model in ways RAG cannot. In practice, combining both approaches yields the best results - leading to greater accuracy, better usabifewer hallucinations.
+### Real-World Applications of Fine-Tuning
+Fine-tuning can be applied across various domains and needs. Here are a few practical examples of how it makes a difference:
+* **Sentiment Analysis for Finance** – Train an LLM to determine if a news headline impacts a company positively or negatively, tailoring its understanding to financial context.
+* **Customer Support Chatbots** – Fine-tune on past customer interactions to provide more accurate and personalized responses in a company’s style and terminology.
+* **Legal Document Assistance** – Fine-tune on legal texts (contracts, case law, regulations) for tasks like contract analysis, case law research, or compliance support, ensuring the model uses precise legal language.
+## The Benefits of Fine-Tuning
+Fine-tuning offers several notable benefits beyond what a base model or a purely retrieval-based system can provide:
+#### Fine-Tuning vs. RAG: What’s the Difference?
+Fine-tuning can do mostly everything RAG can - but not the around. During training, fine-tuning embeds external knowledge directly into the model. This allows the model to handle niche queries, summarize documents, and maintain context without relying on an outside retrieval system. That’s not to say RAG lacks advantages as it is excels at accessing up-to-date information from external databases. It is in fact possible to retrieve fresh data with fine-tuning as well, however it is better to combine RAG with fine-tuning for efficiency.
+#### Task-Specific Mastery
+Fine-tuning deeply integrates domain knowledge into the model. This makes it highly effective at handling structured, repetitive, or nuanced queries, scenarios where RAG-alone systems often struggle. In other words, a fine-tuned model becomes a specialist in the tasks or content it was trained on.
+#### Independence from Retrieval
+A fine-tuned model has no dependency on external data sources at inference time. It remains reliable even if a connected retrieval system fails or is incomplete, because all eded information is already within the model’s own parameters. This self-sufficiency means fewer points of failure in production.
+#### Faster Responses
+Fine-tuned models don’t need to call out to an external knowledge base during generation. Skipping the retrieval step means they can produce answers much more quickly. This speed makes fine-tuned models ideal for time-sensitive applications where every second counts.
+#### Custom Behavior and Tone
+Fine-tuning allows precise control over how the model communicates. This ensures the model’s responses stay consistent with a brand’s voice, adhere to regulatory requirements, or match specific tone preferences. You get a model that not only knows *what* to say, but *how* to say it in the desired style.
+#### Reliable Performance
+Even in a hybrid setup that uses both fine-tuning and RAG, the fine-tuned model provides a reliable fallback. If the retrieval component fails to find the right information or returns incorrect data, the model’s built-in knowstill generate a useful answer. This guarantees more consistent and robust performance for your system.
+## Common Misconceptions
+Despite fine-tuning’s advantages, a few myths persist. Let’s address two of the most common misconceptions about fine-tuning:
+### Does Fine-Tuning Add New Knowledge to a Model?
+**Yes - it absolutely can.** A common myth suggests that fine-tuning doesn’t introduce new knowledge, but in reality it does. If your fine-tuning dataset contains new domain-specific information, the model will learn that content during training and incorporate it into its responses. In effect, fine-tuning *can and does* teach the model new facts and patterns from scratch.
+### Is RAG Always Better Than Fine-Tuning?
+**Not necessarily.** Many assume RAG will consistently outperform a fine-tuned model, but that’s not the case when fine-tuning is done properly. In fact, a well-tuned model often matches or even surpasses RAG-based systems on specialized tasks. Claims that “RAG is always better”m from fine-tuning attempts that weren’t optimally configured - for example, using incorrect [LoRA parameters](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide) or insufficient training.
+Unsloth takes care of these complexities by automatically selecting the best parameter configurations for you. All you need is a good-quality dataset, and you'll get a fine-tuned model that performs to its fullest potential.
+### Is Fine-Tuning Expensive?
+**Not at all!** While full fine-tuning or pretraining can be costly, these are not necessary (pretraining is especially not necessary). In most cases, LoRA or QLoRA fine-tuning can be done for minimal cost. In fact, with Unsloth’s [free notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks) for Colab or Kaggle, you can fine-tune models without spending a dime. Better yet, you can even fine-tune locally on your own device.
+## FAQ:
+### Why You Should Combine RAG & Fine-Tuning
+Instead of choosing between RAG and fine-tu, consider using **both** together for the best results. Combining a retrieval system with a fine-tuned model brings out the strengths of each approach. Here’s why:
+* **Task-Specific Expertise** – Fine-tuning excels at specialized tasks or formats (making the model an expert in a specific area), while RAG keeps the model up-to-date with the latest external knowledge.
+* **Better Adaptability** – A fine-tuned model can still give useful answers even if the retrieval component fails or returns incomplete information. Meanwhile, RAG ensures the system stays current without requiring you to retrain the model for every new piece of data.
+* **Efficiency** – Fine-tuning provides a strong foundational knowledge base within the model, and RAG handles dynamic or quickly-changing details without the need for exhaustive re-training from scratch. This balance yields an efficient workflow and reduces overall compute costs.
+### LoRA vs. QLoRA: Which One to Use?
+When it comes to implementing fine-tuning, two popuniques can dramatically cut down the compute and memory requirements: **LoRA** and **QLoRA**. Here’s a quick comparison of each:
+* **LoRA (Low-Rank Adaptation)** – Fine-tunes only a small set of additional “adapter” weight matrices (in 16-bit precision), while leaving most of the original model unchanged. This significantly reduces the number of parameters that need updating during training.
+* **QLoRA (Quantized LoRA)** – Combines LoRA with 4-bit quantization of the model weights, enabling efficient fine-tuning of very large models on minimal hardware. By using 4-bit precision where possible, it dramatically lowers memory usage and compute overhead.
+We recommend starting with **QLoRA**, as it’s one of the most efficient and accessible methods available. Thanks to Unsloth’s [dynamic 4-bit](https://unsloth.ai/blog/dynamic-4bit) quants, the accuracy loss compared to standard 16-bit LoRA fine-tuning is now negligible.
+### Experimentation is Key
+There’s no single “best” approach to fine-tactices for different scenarios. It’s important to experiment with different methods and configurations to find what works best for your dataset and use case. A great starting point is **QLoRA (4-bit)**, which offers a very cost-effective, resource-friendly way to fine-tune models without heavy computational requirements.
+{% content-ref url="../fine-tuning-llms-guide/lora-hyperparameters-guide" %}
+[lora-hyperparameters-guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide)

package/bin/skills/unsloth/docs/fp16-vs-bf16.md ADDED Viewed

@@ -0,0 +1,61 @@
+# FP16 vs BF16 for RL
+### Float16 vs Bfloat16
+There was a paper titled "**Defeating the Training-Inference Mismatch via FP16**" <https://arxiv.org/pdf/2510.26788> showing how using float16 precision can dramatically be better than using bfloat16 when doing reinforcement learning.
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Frec4qe1aQS0xyMzGvS9c%2Fimage.png?alt=media&#x26;token=2137e766-0f1f-48ec-b25f-2292d6f149f4" alt=""><figcaption></figcaption></figure>
+In fact the longer the generation, the worse it gets when using bfloat16:
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FWs7ioB2lraTbDbUCOAnn%2Fimage.png?alt=media&#x26;token=ac2b4f8e-210f-4bcc-bcbb-6e68f80781a6" alt=""><figcaption></figcaption></figure>
+We did an investigation, and **DO find float16 to be more stable** than bfloat16 with much smaller gradient norms see <https://x.com/danielhanchen/status/1985557028295827482> and <https://x.com/danielhanchen/status/1985562902531850472>
+{% columns %}
+{% column width="50%" %}
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FhvQ1W5wtV6TTfsetp7y2%2FG44d7ZFbIAANBBd.jpg?alt=media&#x26;token=35181a07-de3e-4321-b54e-4436b4a201ff" alt=""><figcaption></figcaption></figure>
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2F62HkxnGcaKvxnSxbZMZu%2FG44c20SbwAAGo8j.jpg?alt=media&#x26;token=e0c7ecb8-6f0c-4ecf-b1a0-50f1b2a9a807" alt=""><figcaption></figcaption></figure>
+{% endcolumn %}
+{% column width="50%" %}
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2Fsi18IkGqE4IuUvzroyHh%2FG44ix5FbQAM0L5l.jpg?alt=media&#x26;token=bc3b97ce-5df4-4b69-aa50-a8e339f21601" alt=""><figcaption></figcaption></figure>
+{% endcolumn %}
+{% endcolumns %}
+### :exploding\_head:A100 Cascade Attention Bug
+As per <https://x.com/RichardYRLi/status/1984858850143715759> and <https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda>, older vLLM versions (before 0.11.0) had broken attention mechanisms for A100 and similar GPUs. Please update vLLM! We also by default disable cascade attention in vLLM during Unsloth reinforcement learning if we detect an older vLLM version.
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FnkCLRVIIGLADXBSCe58e%2Fimage.png?alt=media&#x26;token=6669642f-8690-44bf-b2de-6aa89acf2332" alt=""><figcaption></figcaption></figure>
+Different hardware also changes results, where newer and more expensive GPUs have less KL difference between the inference and training sides:
+<figure><img src="https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FxhOjnexMCB3dmuQFQ2Zq%2Fuploads%2FaroTTz68zzyofy6nagtH%2Fimage.webp?alt=media&#x26;token=3be09506-b8a0-42eb-8d17-af72496a9cd1" alt=""><figcaption></figcaption></figure>
+### :fire:Using float16 in Unsloth RL
+To use float16 precision in Unsloth GRPO and RL, you just need to set `dtype = torch.float16` and we'll take care of the rest!
+{% code overflow="wrap" %}
+```python
+from unsloth import FastLanguageModel
+import torch
+max_seq_length = 2048 # Can increase for longer reasoning traces
+lora_rank = 32 # Larger rank = smarter, but slower
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "unsloth/Qwen3-4B-Base",
+    max_seq_length = max_seq_length,
+    load_in_4bit = False, # False for LoRA 16bit
+    fast_inference = True, # Enable vLLM fast inference
+    max_lora_rank = lora_rank,
+    gpu_memory_utilization = 0.9, # Reduce if out of memory
+    dtype = torch.float16, # Use torch.float16, torch.bfloat16
+)
+```