npm - @synsci/cli-darwin-x64 - Versions diffs - 1.1.70 → 1.1.72 - Mend

@synsci/cli-darwin-x64 1.1.70 → 1.1.72

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (339) hide show

package/bin/skills/unsloth/docs/tool-calling.md ADDED Viewed

@@ -0,0 +1,334 @@
+# Tool Calling Guide for Local LLMs
+Tool calling is when a LLM is allowed to trigger specific functions (like "search my files," "run a calculator," or "call an API") by emitting a structured request instead of guessing the answer in text. You use tool calls because they make outputs **more reliable and up-to-date**, and they let the model **take real actions** (query systems, validate facts, enforce schemas) rather than hallucinating.
+In this tutorial, you will learn how to use local LLMs via Tool Calling with Mathematical, story, Python code and terminal function examples. Inference is done locally via llama.cpp, llama-server and OpenAI endpoints.
+Our guide should work for nearly any model including:
+* **Qwen3-Coder-Next**, Qwen3-Coder, and other **Qwen** models
+* **GLM-4.7**, GLM-4.7-Flash and **Kimi K2.5**, Kimi K2 Thinking
+* **DeepSeek-V3.1**, DeepSeek-V3.2 and **MiniMax**
+* **gpt-oss** and **NVIDIA Nemotron 3 Nano** and **Devstral 2**
+## Tool Calling Setup
+Our first step is to obtain the latest `llama.cpp` on GitHub. You can follow the build instructions below as well. Change `-DGGML_CUDA=ON` to `-DGGML_CUDA=OFF` if you don't have a GPU or just want CPU inference.
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggml-org/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
+cp llama.cpp/build/bin/llama-* llama.cpp
+```
+In a new terminal, we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:
+```python
+import json, subprocess, random
+from typing import Any
+def add_number(a: float | str, b: float | str) -> float:
+    return float(a) + float(b)
+def multiply_number(a: float | str, b: float | str) -> float:
+    return float(a) * float(b)
+def substract_number(a: float | str, b: float | str) -> float:
+    return float(a) - float(b)
+def write_a_story() -> str:
+    return random.choice([
+        "A long time ago in a galaxy far far away...",
+        "There were 2 friends who loved sloths and code...",
+        "The world was ending because every sloth evolved to have superhuman intelligence...",
+        "Unbeknownst to one friend, the other accidentally coded a program to evolve sloths...",
+    ])
+def terminal(command: str) -> str:
+    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
+        msg = "Cannot execute 'rm, sudo, dd, chmod' commands since they are dangerous"
+        print(msg); return msg
+    print(f"Executing terminal command `{command}`")
+    try:
+        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
+    except subprocess.CalledProcessError as e:
+        return f"Command failed: {e.stderr}"
+def python(code: str) -> str:
+    data = {}
+    exec(code, data)
+    del data["__builtins__"]
+    return str(data)
+MAP_FN = {
+    "add_number": add_number,
+    "multiply_number": multiply_number,
+    "substract_number": substract_number,
+    "write_a_story": write_a_story,
+    "terminal": terminal,
+    "python": python,
+}
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "add_number",
+            "description": "Add two numbers.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "a": {"type": "string", "description": "The first number."},
+                    "b": {"type": "string", "description": "The second number."},
+                },
+                "required": ["a", "b"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "multiply_number",
+            "description": "Multiply two numbers.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "a": {"type": "string", "description": "The first number."},
+                    "b": {"type": "string", "description": "The second number."},
+                },
+                "required": ["a", "b"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "substract_number",
+            "description": "Substract two numbers.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "a": {"type": "string", "description": "The first number."},
+                    "b": {"type": "string", "description": "The second number."},
+                },
+                "required": ["a", "b"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "write_a_story",
+            "description": "Writes a random story.",
+            "parameters": {"type": "object", "properties": {}, "required": []},
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "terminal",
+            "description": "Perform operations from the terminal.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "command": {"type": "string", "description": "The command you wish to launch, e.g `ls`, `rm`, ..."},
+                },
+                "required": ["command"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "python",
+            "description": "Call a Python interpreter with some Python code that will be ran.",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "code": {"type": "string", "description": "The Python code to run"},
+                },
+                "required": ["code"],
+            },
+        },
+    },
+]
+```
+## Inference Function
+We use the below functions which will parse the function calls automatically and call the OpenAI endpoint for any model:
+```python
+from openai import OpenAI
+def unsloth_inference(
+    messages,
+    temperature = 0.7,
+    top_p = 0.95,
+    top_k = 40,
+    min_p = 0.01,
+    repetition_penalty = 1.0,
+):
+    messages = messages.copy()
+    openai_client = OpenAI(
+        base_url = "http://127.0.0.1:8001/v1",
+        api_key = "sk-no-key-required",
+    )
+    model_name = next(iter(openai_client.models.list())).id
+    print(f"Using model = {model_name}")
+    has_tool_calls = True
+    original_messages_len = len(messages)
+    while has_tool_calls:
+        print(f"Current messages = {messages}")
+        response = openai_client.chat.completions.create(
+            model = model_name,
+            messages = messages,
+            temperature = temperature,
+            top_p = top_p,
+            tools = tools if tools else None,
+            tool_choice = "auto" if tools else None,
+            extra_body = {"top_k": top_k, "min_p": min_p, "repetition_penalty": repetition_penalty}
+        )
+        tool_calls = response.choices[0].message.tool_calls or []
+        content = response.choices[0].message.content or ""
+        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
+        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content})
+        for tool_call in tool_calls:
+            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
+            out = MAP_FN[fx](**json.loads(args))
+            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out)})
+        else:
+            has_tool_calls = False
+    return messages
+```
+## Examples
+### Writing a story
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "Could you write me a story ?"}],
+}]
+unsloth_inference(messages, temperature = 0.15, top_p = 1.0, top_k = -1, min_p = 0.00)
+```
+### Mathematical operations
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "What is today's date plus 3 days?"}],
+}]
+unsloth_inference(messages, temperature = 0.15, top_p = 1.0, top_k = -1, min_p = 0.00)
+```
+### Execute generated Python code
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "Create a Fibonacci function in Python and find fib(20)."}],
+}]
+unsloth_inference(messages, temperature = 0.15, top_p = 1.0, top_k = -1, min_p = 0.00)
+```
+### Execute arbitrary terminal functions
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "Write 'I'm a happy Sloth' to a file, then print it back to me."}],
+}]
+messages = unsloth_inference(messages, temperature = 0.15, top_p = 1.0, top_k = -1, min_p = 0.00)
+```
+## Qwen3-Coder-Next Tool Calling
+Use Qwen3-Coder-Next's optimal parameters of `temperature = 1.0, top_p = 0.95, top_k = 40`.
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "Create a Fibonacci function in Python and find fib(20)."}],
+}]
+unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = 40, min_p = 0.00)
+```
+## GLM-4.7-Flash + GLM 4.7 Tool Calling
+We first download GLM-4.7 via some Python code, then launch it via llama-server in a separate terminal:
+```python
+# !pip install huggingface_hub hf_transfer
+import os
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id = "unsloth/GLM-4.7-GGUF",
+    local_dir = "unsloth/GLM-4.7-GGUF",
+    allow_patterns = ["*UD-Q2_K_XL*"],
+)
+```
+Now launch it via llama-server:
+```bash
+./llama.cpp/llama-server \
+    --model unsloth/GLM-4.7-GGUF/UD-Q2_K_XL/GLM-4.7-UD-Q2_K_XL-00001-of-00003.gguf \
+    --alias "unsloth/GLM-4.7" \
+    --threads -1 \
+    --fit on \
+    --prio 3 \
+    --min_p 0.01 \
+    --ctx-size 16384 \
+    --port 8001 \
+    --jinja
+```
+Use GLM 4.7's optimal parameters of `temperature = 0.7` and `top_p = 1.0`:
+```python
+messages = [{
+    "role": "user",
+    "content": [{"type": "text", "text": "What is today's date plus 3 days?"}],
+}]
+unsloth_inference(messages, temperature = 0.7, top_p = 1.0, top_k = -1, min_p = 0.00)
+```
+## Devstral 2 Tool Calling
+We first download Devstral 2 via some Python code, then launch it via llama-server:
+```python
+# !pip install huggingface_hub hf_transfer
+import os
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
+    local_dir = "unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF",
+    allow_patterns = ["*UD-Q4_K_XL*", "*mmproj-F16*"],
+)
+```
+```bash
+./llama.cpp/llama-server \
+    --model unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/Devstral-Small-2-24B-Instruct-2512-UD-Q4_K_XL.gguf \
+    --mmproj unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/mmproj-F16.gguf \
+    --alias "unsloth/Devstral-Small-2-24B-Instruct-2512" \
+    --threads -1 \
+    --fit on \
+    --prio 3 \
+    --min_p 0.01 \
+    --ctx-size 16384 \
+    --port 8001 \
+    --jinja
+```
+Use Devstral's suggested parameters of `temperature = 0.15`.

package/bin/skills/unsloth/docs/troubleshooting-faq.md ADDED Viewed

@@ -0,0 +1,204 @@
+# Troubleshooting & FAQs
+If you're still encountering any issues with versions or dependencies, please use our [Docker image](installation-docker.md) which will have everything pre-installed.
+> **Try always to update Unsloth if you find any issues.**
+> `pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo`
+## Fine-tuning a new model not supported by Unsloth?
+Unsloth works with any model supported by `transformers`. If a model isn't in our uploads or doesn't run out of the box, it's usually still supported. Enable compatibility by setting `trust_remote_code=True`:
+```python
+model, tokenizer = FastVisionModel.from_pretrained(
+    "./deepseek_ocr",
+    load_in_4bit = False,
+    auto_model = AutoModel,
+    trust_remote_code = True,
+    unsloth_force_compile = True,
+    use_gradient_checkpointing = "unsloth",
+)
+```
+## Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
+* The most common cause is using an **incorrect chat template**. Use the SAME template for training and inference.
+* You must use the correct `eos token`.
+* Check if your inference engine adds an unnecessary "start of sequence" token.
+* **Use our conversational notebooks to force the chat template - this will fix most issues.**
+## Saving to GGUF / vLLM 16bit crashes
+Reduce `maximum_memory_usage`:
+`model.save_pretrained(..., maximum_memory_usage = 0.5)` (default is 0.75).
+## How do I manually save to GGUF?
+```python
+model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit")
+```
+```bash
+apt-get update
+apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
+git clone https://github.com/ggerganov/llama.cpp
+cmake llama.cpp -B llama.cpp/build \
+    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
+cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
+cp llama.cpp/build/bin/llama-* llama.cpp
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-F16.gguf --outtype f16 --split-max-size 50G
+# For BF16:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-BF16.gguf --outtype bf16 --split-max-size 50G
+# For Q8_0:
+python llama.cpp/convert_hf_to_gguf.py merged_model \
+    --outfile model-Q8_0.gguf --outtype q8_0 --split-max-size 50G
+```
+## Why is Q8_K_XL slower than Q8_0 GGUF?
+On Mac devices, BF16 might be slower than F16. Q8_K_XL upcasts some layers to BF16. We are actively changing to make F16 the default.
+## How to do Evaluation
+Split your dataset into training and test splits (always shuffle!):
+```python
+new_dataset = dataset.train_test_split(
+    test_size = 0.01,
+    shuffle = True,
+    seed = 3407,
+)
+train_dataset = new_dataset["train"]
+eval_dataset = new_dataset["test"]
+```
+```python
+from trl import SFTTrainer, SFTConfig
+trainer = SFTTrainer(
+    args = SFTConfig(
+        fp16_full_eval = True,
+        per_device_eval_batch_size = 2,
+        eval_accumulation_steps = 4,
+        eval_strategy = "steps",
+        eval_steps = 1,
+    ),
+    train_dataset = new_dataset["train"],
+    eval_dataset = new_dataset["test"],
+)
+```
+## How do I do Early Stopping?
+```python
+from trl import SFTConfig, SFTTrainer
+trainer = SFTTrainer(
+    args = SFTConfig(
+        fp16_full_eval = True,
+        per_device_eval_batch_size = 2,
+        eval_accumulation_steps = 4,
+        output_dir = "training_checkpoints",
+        save_strategy = "steps",
+        save_steps = 10,
+        save_total_limit = 3,
+        eval_strategy = "steps",
+        eval_steps = 10,
+        load_best_model_at_end = True,
+        metric_for_best_model = "eval_loss",
+        greater_is_better = False,
+    ),
+)
+from transformers import EarlyStoppingCallback
+early_stopping_callback = EarlyStoppingCallback(
+    early_stopping_patience = 3,
+    early_stopping_threshold = 0.0,
+)
+trainer.add_callback(early_stopping_callback)
+trainer.train()
+```
+## Evaluation Loop - Out of Memory or crashing
+Set batch size lower than 2 and use `fp16_full_eval=True` to cut memory by 1/2.
+## Downloading gets stuck at 90 to 95%
+```python
+import os
+os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"
+from unsloth import FastLanguageModel
+```
+## RuntimeError: CUDA error: device-side assert triggered
+```python
+import os
+os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
+os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
+```
+## All labels in your dataset are -100
+This means `train_on_responses_only` is incorrect for that model.
+For Llama 3.1/3.2/3.3:
+```python
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
+    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
+)
+```
+For Gemma 2/3/3n:
+```python
+from unsloth.chat_templates import train_on_responses_only
+trainer = train_on_responses_only(
+    trainer,
+    instruction_part = "<start_of_turn>user\n",
+    response_part = "<start_of_turn>model\n",
+)
+```
+## Unsloth is slower than expected?
+`torch.compile` typically takes ~5 minutes to warm up. Measure throughput **after** it's fully loaded. To disable:
+```python
+import os
+os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
+```
+## Some weights were not initialized from model checkpoint
+Fix by upgrading:
+```bash
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo
+pip install --upgrade --force-reinstall --no-cache-dir --no-deps transformers timm
+```
+## NotImplementedError: A UTF-8 locale is required. Got ANSI
+```python
+import locale
+locale.getpreferredencoding = lambda: "UTF-8"
+```
+## Citing Unsloth
+```bibtex
+@misc{unsloth,
+  author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
+  title        = {Unsloth},
+  year         = {2025},
+  publisher    = {Github},
+  howpublished = {\url{https://github.com/unslothai/unsloth}}
+}
+```

package/bin/skills/unsloth/docs/troubleshooting-inference.md ADDED Viewed

@@ -0,0 +1,26 @@
+# Troubleshooting Inference
+### Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
+You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations or repeated outputs.
+* The most common cause of this error is using an **incorrect chat template**. It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
+* You must use the correct `eos token`. If not, you might get gibberish on longer generations.
+* It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
+* **Use our conversational notebooks to force the chat template - this will fix most issues.**
+  * Qwen-3 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb)
+  * Gemma-3 4B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B).ipynb)
+  * Llama-3.2 3B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
+  * Phi-4 14B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
+  * Mistral v0.3 7B Conversational notebook [Open in Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb)
+  * **More notebooks in our [notebooks repo](https://github.com/unslothai/notebooks).**
+### Saving to `safetensors`, not `bin` format in Colab
+We save to `.bin` in Colab so it's like 4x faster, but set `safe_serialization = None` to force saving to `.safetensors`. So `model.save_pretrained(..., safe_serialization = None)` or `model.push_to_hub(..., safe_serialization = None)`
+### If saving to GGUF or vLLM 16bit crashes
+You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.
+The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.