npm - @synsci/cli-darwin-x64 - Versions diffs - 1.1.70 → 1.1.72 - Mend

@synsci/cli-darwin-x64 1.1.70 → 1.1.72

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (339) hide show

package/bin/skills/unsloth/docs/requirements.md ADDED Viewed

@@ -0,0 +1,45 @@
+# Unsloth Requirements
+## System Requirements
+* **Operating System**: Works on Linux and [Windows](https://docs.unsloth.ai/get-started/install-and-update/windows-installation)
+* Supports NVIDIA GPUs since 2018+ including [Blackwell RTX 50](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth)
+  * [fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth "mention")
+  * [fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth "mention")
+* Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20 & 50, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow.
+* The official [Unsloth Docker image](https://hub.docker.com/r/unsloth/unsloth) `unsloth/unsloth` is available on Docker Hub
+  * [docker](https://unsloth.ai/docs/get-started/install/docker "mention")
+* Unsloth works on [AMD](https://unsloth.ai/docs/get-started/fine-tuning-for-beginners/broken-reference) and [Intel](https://github.com/unslothai/unsloth/pull/2621) GPUs! Apple/Silicon/MLX is in the works
+* If you have different versions of torch, transformers etc., `pip install unsloth` will automatically install all the latest versions of those libraries so you don't need to worry about version compatibility.
+* Your device should have `xformers`, `torch`, `BitsandBytes` and `triton` support.
+{% hint style="info" %}
+Python 3.13 is now supported!
+{% endhint %}
+## Fine-tuning VRAM requirements:
+How much GPU memory do I need for LLM fine-tuning using Unsloth?
+{% hint style="info" %}
+A common issue when you OOM or run out of memory is because you set your batch size too high. Set it to 1, 2, or 3 to use less VRAM.
+**For context length benchmarks, see** [**here**](https://unsloth.ai/docs/basics/unsloth-benchmarks#context-length-benchmarks)**.**
+{% endhint %}
+Check this table for VRAM requirements sorted by model parameters and fine-tuning method. QLoRA uses 4-bit, LoRA uses 16-bit. Keep in mind that sometimes more VRAM is required depending on the model so these numbers are the absolute minimum:
+| Model parameters | QLoRA (4-bit) VRAM | LoRA (16-bit) VRAM |
+| ---------------- | ------------------ | ------------------ |
+| 3B               | 3.5 GB             | 8 GB               |
+| 7B               | 5 GB               | 19 GB              |
+| 8B               | 6 GB               | 22 GB              |
+| 9B               | 6.5 GB             | 24 GB              |
+| 11B              | 7.5 GB             | 29 GB              |
+| 14B              | 8.5 GB             | 33 GB              |
+| 27B              | 22GB               | 64GB               |
+| 32B              | 26 GB              | 76 GB              |
+| 40B              | 30GB               | 96GB               |
+| 70B              | 41 GB              | 164 GB             |
+| 81B              | 48GB               | 192GB              |
+| 90B              | 53GB               | 212GB              |

package/bin/skills/unsloth/docs/reward-hacking.md ADDED Viewed

@@ -0,0 +1,25 @@
+# RL Reward Hacking
+The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric). But RL can **cheat.** When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "**Reward Hacking**".
+It's the reason models learn to modify unit tests to pass coding challenges, and these are critical blockers for real world deployment. Some other good examples are from [Wikipedia](https://en.wikipedia.org/wiki/Reward_hacking).
+<div align="center"><figure><img src="https://i.pinimg.com/originals/55/e0/1b/55e01b94a9c5546b61b59ae300811c83.gif" alt="" width="188"><figcaption></figcaption></figure></div>
+**Can you counter reward hacking? Yes!** In our [free gpt-oss RL notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-\(20B\)-GRPO.ipynb) we explore how to counter reward hacking in a code generation setting and showcase tangible solutions to common error modes. We saw the model edit the timing function, outsource to other libraries, cache the results, and outright cheat. After countering, the result is our model generates genuinely optimized matrix multiplication kernels, not clever cheats.
+## :trophy: Reward Hacking Overview
+Some common examples of reward hacking during RL include:
+#### Laziness
+RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels. We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries.
+#### Caching & Cheating
+RL learns to cache the result of the output and RL learns to find the actual output by inspecting Python global variables.
+We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.
+#### Cheating

package/bin/skills/unsloth/docs/saving-to-gguf.md ADDED Viewed

@@ -0,0 +1,138 @@
+# Saving to GGUF
+## Locally
+To save to GGUF, use the below to save locally:
+```python
+model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q4_k_m")
+model.save_pretrained_gguf("directory", tokenizer, quantization_method = "q8_0")
+model.save_pretrained_gguf("directory", tokenizer, quantization_method = "f16")
+```
+To push to Hugging Face hub:
+```python
+model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q4_k_m")
+model.push_to_hub_gguf("hf_username/directory", tokenizer, quantization_method = "q8_0")
+```
+All supported quantization options for `quantization_method` are listed below:
+```python
+# https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp#L19
+# From https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html
+ALLOWED_QUANTS = \
+{
+    "not_quantized"  : "Recommended. Fast conversion. Slow inference, big files.",
+    "fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
+    "quantized"      : "Recommended. Slow conversion. Fast inference, small files.",
+    "f32"     : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
+    "f16"     : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
+    "q8_0"    : "Fast conversion. High resource use, but generally acceptable.",
+    "q4_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
+    "q5_k_m"  : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
+    "q2_k"    : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
+    "q3_k_l"  : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_m"  : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
+    "q3_k_s"  : "Uses Q3_K for all tensors",
+    "q4_0"    : "Original quant method, 4-bit.",
+    "q4_1"    : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
+    "q4_k_s"  : "Uses Q4_K for all tensors",
+    "q4_k"    : "alias for q4_k_m",
+    "q5_k"    : "alias for q5_k_m",
+    "q5_0"    : "Higher accuracy, higher resource usage and slower inference.",
+    "q5_1"    : "Even higher accuracy, resource usage and slower inference.",
+    "q5_k_s"  : "Uses Q5_K for all tensors",
+    "q6_k"    : "Uses Q8_K for all tensors",
+    "iq2_xxs" : "2.06 bpw quantization",
+    "iq2_xs"  : "2.31 bpw quantization",
+    "iq3_xxs" : "3.06 bpw quantization",
+    "q3_k_xs" : "3-bit extra small quantization",
+}
+```
+## Manual Saving
+First save your model to 16bit:
+```python
+model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
+```
+Then use the terminal and do:
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
+cp llama.cpp/build/bin/llama-* llama.cpp
+python llama.cpp/convert-hf-to-gguf.py FOLDER --outfile OUTPUT --outtype f16
+```
+Or follow the steps at <https://rentry.org/llama-cpp-conversions#merging-loras-into-a-model> using the model name "merged_model" to merge to GGUF.
+### Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
+You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.
+* The most common cause of this error is using an **incorrect chat template**. It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
+* You must use the correct `eos token`. If not, you might get gibberish on longer generations.
+* It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
+* **Use our conversational notebooks to force the chat template - this will fix most issues.**
+  * Qwen-3 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb)
+  * Gemma-3 4B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B).ipynb)
+  * Llama-3.2 3B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
+  * Phi-4 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
+  * Mistral v0.3 7B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)
+  * **More notebooks in our [notebooks docs](https://unsloth.ai/docs/get-started/unsloth-notebooks)**
+### Saving to GGUF / vLLM 16bit crashes
+You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.
+The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
+### How do I manually save to GGUF?
+First save your model to 16bit via:
+```python
+model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
+```
+Compile llama.cpp from source like below:
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
+cp llama.cpp/build/bin/llama-* llama.cpp
+```
+Then, save the model to F16:
+```bash
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-F16.gguf --outtype f16 \
+    --split-max-size 50G
+```
+```bash
+# For BF16:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-BF16.gguf --outtype bf16 \
+    --split-max-size 50G
+# For Q8_0:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-Q8_0.gguf --outtype q8_0 \
+    --split-max-size 50G
+```

package/bin/skills/unsloth/docs/saving-to-ollama.md ADDED Viewed

@@ -0,0 +1,46 @@
+# Saving to Ollama
+See our [Tutorial: How to Finetune Llama-3 and Use in Ollama](tutorial-llama3-ollama.md) for the complete process on how to save to [Ollama](https://github.com/ollama/ollama).
+### Saving on Google Colab
+You can save the finetuned model as a small 100MB file called a LoRA adapter. You can instead push to the Hugging Face hub as well if you want to upload your model! Remember to get a Hugging Face token via: <https://huggingface.co/settings/tokens> and add your token!
+After saving the model, we can again use Unsloth to run the model itself! Use `FastLanguageModel` again to call it for inference!
+### Exporting to Ollama
+Finally we can export our finetuned model to Ollama itself! First we have to install Ollama in the Colab notebook.
+Then we export the finetuned model we have to llama.cpp's GGUF formats.
+Reminder to convert `False` to `True` for 1 row, and not change every row to `True`, or else you'll be waiting for a very long time! We normally suggest the first row getting set to `True`, so we can export the finetuned model quickly to `Q8_0` format (8 bit quantization). We also allow you to export to a whole list of quantization methods as well, with a popular one being `q4_k_m`.
+Head over to <https://github.com/ggerganov/llama.cpp> to learn more about GGUF. We also have some manual instructions of how to export to GGUF if you want here: <https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf>
+You will see a long list of text - please wait 5 to 10 minutes!
+### Automatic `Modelfile` creation
+The trick Unsloth provides is we automatically create a `Modelfile` which Ollama requires! This is just a list of settings and includes the chat template which we used for the finetune process! You can also print the `Modelfile` generated.
+We then ask Ollama to create a model which is Ollama compatible, by using the `Modelfile`.
+### Ollama Inference
+And we can now call the model for inference if you want to call the Ollama server itself which is running on your own local machine / in the free Colab notebook in the background.
+### Running in Unsloth works well, but after exporting & running on Ollama, the results are poor
+You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.
+* The most common cause of this error is using an **incorrect chat template**. It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
+* You must use the correct `eos token`. If not, you might get gibberish on longer generations.
+* It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
+* **Use our conversational notebooks to force the chat template - this will fix most issues.**
+  * Qwen-3 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb)
+  * Gemma-3 4B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B).ipynb)
+  * Llama-3.2 3B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
+  * Phi-4 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
+  * Mistral v0.3 7B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)
+  * **More notebooks in our [notebooks docs](https://unsloth.ai/docs/get-started/unsloth-notebooks)**

package/bin/skills/unsloth/docs/sglang-guide.md ADDED Viewed

@@ -0,0 +1,278 @@
+# SGLang Deployment & Inference Guide
+You can serve any LLM or fine-tuned model via [SGLang](https://github.com/sgl-project/sglang) for low-latency, high-throughput inference. SGLang supports text, image/video model inference on any GPU setup, with support for some GGUFs.
+### Installing SGLang
+To install SGLang and Unsloth on NVIDIA GPUs, you can use the below in a virtual environment (which won't break your other Python libraries)
+```bash
+# OPTIONAL use a virtual environment
+python -m venv unsloth_env
+source unsloth_env/bin/activate
+# Install Rust, outlines-core then SGLang
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+source $HOME/.cargo/env && sudo apt-get install -y pkg-config libssl-dev
+pip install --upgrade pip && pip install uv
+uv pip install "sglang" && uv pip install unsloth
+```
+For **Docker** setups run:
+```bash
+docker run --gpus all \
+    --shm-size 32g \
+    -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --env "HF_TOKEN=<secret>" \
+    --ipc=host \
+    lmsysorg/sglang:latest \
+    python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
+```
+### Debugging SGLang Installation issues
+Note if you see the below, update Rust and outlines-core as specified above:
+```
+hint: This usually indicates a problem with the package or the build environment.
+  help: `outlines-core` (v0.1.26) was included because `sglang` (v0.5.5.post2) depends on `outlines` (v0.1.11) which depends on `outlines-core`
+```
+If you see a Flashinfer issue like below:
+```
+/home/daniel/.cache/flashinfer/...batch_prefill_ragged_kernel_mask_1.cu:1:10: fatal error: flashinfer/attention/prefill.cuh: No such file or directory
+```
+Remove the flashinfer cache via `rm -rf .cache/flashinfer` and also `rm -rf ~/.cache/flashinfer`
+### Deploying SGLang models
+To deploy any model like for example [unsloth/Llama-3.2-1B-Instruct](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct), do the below in a separate terminal:
+```bash
+python3 -m sglang.launch_server \
+    --model-path unsloth/Llama-3.2-1B-Instruct \
+    --host 0.0.0.0 --port 30000
+```
+You can then use the OpenAI Chat completions library to call the model (in another terminal or using tmux):
+```python
+# Install openai via pip install openai
+from openai import OpenAI
+import json
+openai_client = OpenAI(
+    base_url = "http://0.0.0.0:30000/v1",
+    api_key = "sk-no-key-required",
+)
+completion = openai_client.chat.completions.create(
+    model = "unsloth/Llama-3.2-1B-Instruct",
+    messages = [{"role": "user", "content": "What is 2+2?"},],
+)
+print(completion.choices[0].message.content)
+```
+And you will get `2 + 2 = 4.`
+### Deploying Unsloth finetunes in SGLang
+After fine-tuning or using our notebooks, you can save or deploy your models directly through SGLang within a single workflow. An example Unsloth finetuning script:
+```python
+from unsloth import FastLanguageModel
+import torch
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name = "unsloth/gpt-oss-20b",
+    max_seq_length = 2048,
+    load_in_4bit = True,
+)
+model = FastLanguageModel.get_peft_model(model)
+```
+**To save to 16-bit for SGLang, use:**
+```python
+model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
+## OR to upload to HuggingFace:
+model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
+```
+**To save just the LoRA adapters**, either use:
+```python
+model.save_pretrained("finetuned_model")
+tokenizer.save_pretrained("finetuned_model")
+```
+Or just use our builtin function to do that:
+```python
+model.save_pretrained_merged("model", tokenizer, save_method = "lora")
+## OR to upload to HuggingFace
+model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")
+```
+### gpt-oss-20b: Unsloth & SGLang Deployment Guide
+Below is a step-by-step tutorial with instructions for training the gpt-oss-20b using Unsloth and deploying it with SGLang. It includes performance benchmarks across multiple quantization formats.
+#### Step 1: Unsloth Fine-tuning and Exporting Formats
+After training, you can export the model in multiple formats:
+```python
+model.save_pretrained_merged(
+    "finetuned_model",
+    tokenizer,
+    save_method = "merged_16bit",
+)
+## For gpt-oss specific mxfp4 conversions:
+model.save_pretrained_merged(
+    "finetuned_model",
+    tokenizer,
+    save_method = "mxfp4", # (ONLY FOR gpt-oss otherwise choose "merged_16bit")
+)
+```
+#### Step 2: Deployment with SGLang
+We saved our gpt-oss finetune to the folder "finetuned_model", and so in a new terminal, we can launch the finetuned model as an inference endpoint with SGLang:
+```bash
+python -m sglang.launch_server \
+    --model-path finetuned_model \
+    --host 0.0.0.0 --port 30002
+```
+You might have to wait a bit on `Capturing batches (bs=1 avail_mem=20.84 GB):` !
+#### Step 3: Calling the inference endpoint
+To call the inference endpoint, first launch a new terminal. We then can call the model like below:
+```python
+from openai import OpenAI
+import json
+openai_client = OpenAI(
+    base_url = "http://0.0.0.0:30002/v1",
+    api_key = "sk-no-key-required",
+)
+completion = openai_client.chat.completions.create(
+    model = "finetuned_model",
+    messages = [{"role": "user", "content": "What is 2+2?"},],
+)
+print(completion.choices[0].message.content)
+## OUTPUT ##
+# 2 + 2 equals 4.
+```
+### FP8 Online Quantization
+To deploy models with FP8 online quantization which allows 30 to 50% more throughput and 50% less memory usage with 2x longer context length supports with SGLang:
+```bash
+python -m sglang.launch_server \
+    --model-path unsloth/Llama-3.2-1B-Instruct \
+    --host 0.0.0.0 --port 30002 \
+    --quantization fp8 \
+    --kv-cache-dtype fp8_e4m3
+```
+You can also use `--kv-cache-dtype fp8_e5m2` which has a larger dynamic range which might solve FP8 inference issues if you see them. Or use our pre-quantized float8 quants listed in <https://huggingface.co/unsloth/models?search=-fp8>
+### Benchmarking SGLang
+Below is some code you can run to test the performance speed of your finetuned model:
+```bash
+python -m sglang.launch_server \
+    --model-path finetuned_model \
+    --host 0.0.0.0 --port 30002
+```
+Then in another terminal or via tmux:
+```bash
+# Batch Size=8, Input=1024, Output=1024
+python -m sglang.bench_one_batch_server \
+    --model finetuned_model \
+    --base-url http://0.0.0.0:30002 \
+    --batch-size 8 \
+    --input-len 1024 \
+    --output-len 1024
+```
+We used a B200x1 GPU with gpt-oss-20b and got the below results (~2,500 tokens throughput)
+| Batch/Input/Output | TTFT (s) | ITL (s) | Input Throughput | Output Throughput |
+| --- | --- | --- | --- | --- |
+| 8/1024/1024 | 0.40 | 3.59 | 20,718.95 | 2,562.87 |
+| 8/8192/1024 | 0.42 | 3.74 | 154,459.01 | 2,473.84 |
+See <https://docs.sglang.ai/advanced_features/server_arguments.html> for server arguments for SGLang.
+### SGLang Interactive Offline Mode
+You can also use SGLang in offline mode (ie not a server) inside a Python interactive environment.
+```python
+import sglang as sgl
+engine = sgl.Engine(model_path = "unsloth/Qwen3-0.6B", random_seed = 42)
+prompt = "Today is a sunny day and I like"
+sampling_params = {"temperature": 0, "max_new_tokens": 256}
+outputs = engine.generate(prompt, sampling_params)["text"]
+print(outputs)
+engine.shutdown()
+```
+### GGUFs in SGLang
+SGLang also interestingly supports GGUFs! **Qwen3 MoE is still under construction, but most dense models (Llama 3, Qwen 3, Mistral etc) are supported.**
+First install the latest gguf python package via:
+```bash
+pip install -e "git+https://github.com/ggml-org/llama.cpp.git#egg=gguf&subdirectory=gguf-py"
+```
+Then for example in offline mode SGLang, you can do:
+```python
+from huggingface_hub import hf_hub_download
+model_path = hf_hub_download(
+    "unsloth/Qwen3-32B-GGUF",
+    filename = "Qwen3-32B-UD-Q4_K_XL.gguf",
+)
+import sglang as sgl
+engine = sgl.Engine(model_path = model_path, random_seed = 42)
+prompt = "Today is a sunny day and I like"
+sampling_params = {"temperature": 0, "max_new_tokens": 256}
+outputs = engine.generate(prompt, sampling_params)["text"]
+print(outputs)
+engine.shutdown()
+```
+### High throughput GGUF serving with SGLang
+First download the specific GGUF file like below:
+```python
+from huggingface_hub import hf_hub_download
+hf_hub_download("unsloth/Qwen3-32B-GGUF", filename="Qwen3-32B-UD-Q4_K_XL.gguf", local_dir=".")
+```
+Then serve the specific file `Qwen3-32B-UD-Q4_K_XL.gguf` and use `--served-model-name unsloth/Qwen3-32B` and also we need the HuggingFace compatible tokenizer via `--tokenizer-path`
+```bash
+python -m sglang.launch_server \
+    --model-path Qwen3-32B-UD-Q4_K_XL.gguf \
+    --host 0.0.0.0 --port 30002 \
+    --served-model-name unsloth/Qwen3-32B \
+    --tokenizer-path unsloth/Qwen3-32B
+```

package/bin/skills/unsloth/docs/speculative-decoding.md ADDED Viewed

@@ -0,0 +1,70 @@
+# Speculative Decoding
+## Speculative Decoding in llama.cpp, llama-server
+Speculative decoding in llama.cpp can be easily enabled via `llama-cli` and `llama-server` via the `--model-draft` argument. Note you must have a draft model, which generally is a smaller model, but it must have the same tokenizer.
+### Spec Decoding for GLM 4.7
+```python
+# !pip install huggingface_hub hf_transfer
+import os
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # Can sometimes rate limit, so set to 0 to disable
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id = "unsloth/GLM-4.7-GGUF",
+    local_dir = "unsloth/GLM-4.7-GGUF",
+    allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2bit Use "*UD-TQ1_0*" for Dynamic 1bit
+)
+snapshot_download(
+    repo_id = "unsloth/GLM-4.5-Air-GGUF",
+    local_dir = "unsloth/GLM-4.5-Air-GGUF",
+    allow_patterns = ["*UD-Q4_K_XL*"], # Dynamic 4bit. Use "*UD-TQ1_0*" for Dynamic 1bit
+)
+```
+```bash
+./llama.cpp/llama-cli \
+    --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
+    --threads -1 \
+    --fit on \
+    --prio 3 \
+    --temp 1.0 \
+    --top-p 0.95 \
+    --ctx-size 16384 \
+    --jinja
+```
+With speculative decoding using a draft model:
+```bash
+./llama.cpp/llama-cli \
+    --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
+    --model-draft unsloth/GLM-4.5-Air-GGUF/UD-Q4_K_XL/GLM-4.5-Air-UD-Q4_K_XL-00001-of-00002.gguf \
+    --threads -1 \
+    --fit on \
+    --prio 3 \
+    --temp 1.0 \
+    --top-p 0.95 \
+    --ctx-size 16384 \
+    --ctx-size-draft 16384 \
+    --jinja \
+    --device CUDA0 \
+    --device-draft CUDA0,CUDA1
+```
+Using llama-server:
+```bash
+./llama.cpp/llama-server \
+    --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
+    --alias "unsloth/GLM-4.7" \
+    --threads -1 \
+    --fit on \
+    --prio 3 \
+    --temp 1.0 \
+    --top-p 0.95 \
+    --ctx-size 16384 \
+    --port 8001 \
+    --jinja
+```