PyPI - EvoScientist - Versions diffs - 0.0.1.dev1__py3-none-any.whl - Mend

EvoScientist 0.0.1.dev1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (107) hide show

EvoScientist/skills/bitsandbytes/SKILL.md ADDED Viewed

@@ -0,0 +1,411 @@
+---
+name: bitsandbytes
+description: Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with minimal accuracy loss. Use when GPU memory is limited, need to fit larger models, or want faster inference. Supports INT8, NF4, FP4 formats, QLoRA training, and 8-bit optimizers. Works with HuggingFace Transformers.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+tags: [Optimization, Bitsandbytes, Quantization, 8-Bit, 4-Bit, Memory Optimization, QLoRA, NF4, INT8, HuggingFace, Efficient Inference]
+dependencies: [bitsandbytes, transformers, accelerate, torch]
+---
+# bitsandbytes - LLM Quantization
+## Quick start
+bitsandbytes reduces LLM memory by 50% (8-bit) or 75% (4-bit) with <1% accuracy loss.
+**Installation**:
+```bash
+pip install bitsandbytes transformers accelerate
+```
+**8-bit quantization** (50% memory reduction):
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+config = BitsAndBytesConfig(load_in_8bit=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=config,
+    device_map="auto"
+)
+# Memory: 14GB → 7GB
+```
+**4-bit quantization** (75% memory reduction):
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=config,
+    device_map="auto"
+)
+# Memory: 14GB → 3.5GB
+```
+## Common workflows
+### Workflow 1: Load large model in limited GPU memory
+Copy this checklist:
+```
+Quantization Loading:
+- [ ] Step 1: Calculate memory requirements
+- [ ] Step 2: Choose quantization level (4-bit or 8-bit)
+- [ ] Step 3: Configure quantization
+- [ ] Step 4: Load and verify model
+```
+**Step 1: Calculate memory requirements**
+Estimate model memory:
+```
+FP16 memory (GB) = Parameters × 2 bytes / 1e9
+INT8 memory (GB) = Parameters × 1 byte / 1e9
+INT4 memory (GB) = Parameters × 0.5 bytes / 1e9
+Example (Llama 2 7B):
+FP16: 7B × 2 / 1e9 = 14 GB
+INT8: 7B × 1 / 1e9 = 7 GB
+INT4: 7B × 0.5 / 1e9 = 3.5 GB
+```
+**Step 2: Choose quantization level**
+| GPU VRAM | Model Size | Recommended |
+|----------|------------|-------------|
+| 8 GB | 3B | 4-bit |
+| 12 GB | 7B | 4-bit |
+| 16 GB | 7B | 8-bit or 4-bit |
+| 24 GB | 13B | 8-bit or 70B 4-bit |
+| 40+ GB | 70B | 8-bit |
+**Step 3: Configure quantization**
+For 8-bit (better accuracy):
+```python
+from transformers import BitsAndBytesConfig
+import torch
+config = BitsAndBytesConfig(
+    load_in_8bit=True,
+    llm_int8_threshold=6.0,  # Outlier threshold
+    llm_int8_has_fp16_weight=False
+)
+```
+For 4-bit (maximum memory savings):
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
+    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
+    bnb_4bit_use_double_quant=True  # Nested quantization
+)
+```
+**Step 4: Load and verify model**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-13b-hf",
+    quantization_config=config,
+    device_map="auto",  # Automatic device placement
+    torch_dtype=torch.float16
+)
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
+# Test inference
+inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0]))
+# Check memory
+import torch
+print(f"Memory allocated: {torch.cuda.memory_allocated()/1e9:.2f}GB")
+```
+### Workflow 2: Fine-tune with QLoRA (4-bit training)
+QLoRA enables fine-tuning large models on consumer GPUs.
+Copy this checklist:
+```
+QLoRA Fine-tuning:
+- [ ] Step 1: Install dependencies
+- [ ] Step 2: Configure 4-bit base model
+- [ ] Step 3: Add LoRA adapters
+- [ ] Step 4: Train with standard Trainer
+```
+**Step 1: Install dependencies**
+```bash
+pip install bitsandbytes transformers peft accelerate datasets
+```
+**Step 2: Configure 4-bit base model**
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+import torch
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+```
+**Step 3: Add LoRA adapters**
+```python
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+# Prepare model for training
+model = prepare_model_for_kbit_training(model)
+# Configure LoRA
+lora_config = LoraConfig(
+    r=16,  # LoRA rank
+    lora_alpha=32,  # LoRA alpha
+    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+# Add LoRA adapters
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.06%
+```
+**Step 4: Train with standard Trainer**
+```python
+from transformers import Trainer, TrainingArguments
+training_args = TrainingArguments(
+    output_dir="./qlora-output",
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,
+    num_train_epochs=3,
+    learning_rate=2e-4,
+    fp16=True,
+    logging_steps=10,
+    save_strategy="epoch"
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer
+)
+trainer.train()
+# Save LoRA adapters (only ~20MB)
+model.save_pretrained("./qlora-adapters")
+```
+### Workflow 3: 8-bit optimizer for memory-efficient training
+Use 8-bit Adam/AdamW to reduce optimizer memory by 75%.
+```
+8-bit Optimizer Setup:
+- [ ] Step 1: Replace standard optimizer
+- [ ] Step 2: Configure training
+- [ ] Step 3: Monitor memory savings
+```
+**Step 1: Replace standard optimizer**
+```python
+import bitsandbytes as bnb
+from transformers import Trainer, TrainingArguments
+# Instead of torch.optim.AdamW
+model = AutoModelForCausalLM.from_pretrained("model-name")
+training_args = TrainingArguments(
+    output_dir="./output",
+    per_device_train_batch_size=8,
+    optim="paged_adamw_8bit",  # 8-bit optimizer
+    learning_rate=5e-5
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset
+)
+trainer.train()
+```
+**Manual optimizer usage**:
+```python
+import bitsandbytes as bnb
+optimizer = bnb.optim.AdamW8bit(
+    model.parameters(),
+    lr=1e-4,
+    betas=(0.9, 0.999),
+    eps=1e-8
+)
+# Training loop
+for batch in dataloader:
+    loss = model(**batch).loss
+    loss.backward()
+    optimizer.step()
+    optimizer.zero_grad()
+```
+**Step 2: Configure training**
+Compare memory:
+```
+Standard AdamW optimizer memory = model_params × 8 bytes (states)
+8-bit AdamW memory = model_params × 2 bytes
+Savings = 75% optimizer memory
+Example (Llama 2 7B):
+Standard: 7B × 8 = 56 GB
+8-bit: 7B × 2 = 14 GB
+Savings: 42 GB
+```
+**Step 3: Monitor memory savings**
+```python
+import torch
+before = torch.cuda.memory_allocated()
+# Training step
+optimizer.step()
+after = torch.cuda.memory_allocated()
+print(f"Memory used: {(after-before)/1e9:.2f}GB")
+```
+## When to use vs alternatives
+**Use bitsandbytes when:**
+- GPU memory limited (need to fit larger model)
+- Training with QLoRA (fine-tune 70B on single GPU)
+- Inference only (50-75% memory reduction)
+- Using HuggingFace Transformers
+- Acceptable 0-2% accuracy degradation
+**Use alternatives instead:**
+- **GPTQ/AWQ**: Production serving (faster inference than bitsandbytes)
+- **GGUF**: CPU inference (llama.cpp)
+- **FP8**: H100 GPUs (hardware FP8 faster)
+- **Full precision**: Accuracy critical, memory not constrained
+## Common issues
+**Issue: CUDA error during loading**
+Install matching CUDA version:
+```bash
+# Check CUDA version
+nvcc --version
+# Install matching bitsandbytes
+pip install bitsandbytes --no-cache-dir
+```
+**Issue: Model loading slow**
+Use CPU offload for large models:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    quantization_config=config,
+    device_map="auto",
+    max_memory={0: "20GB", "cpu": "30GB"}  # Offload to CPU
+)
+```
+**Issue: Lower accuracy than expected**
+Try 8-bit instead of 4-bit:
+```python
+config = BitsAndBytesConfig(load_in_8bit=True)
+# 8-bit has <0.5% accuracy loss vs 1-2% for 4-bit
+```
+Or use NF4 with double quantization:
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",  # Better than fp4
+    bnb_4bit_use_double_quant=True  # Extra accuracy
+)
+```
+**Issue: OOM even with 4-bit**
+Enable CPU offload:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    quantization_config=config,
+    device_map="auto",
+    offload_folder="offload",  # Disk offload
+    offload_state_dict=True
+)
+```
+## Advanced topics
+**QLoRA training guide**: See [references/qlora-training.md](references/qlora-training.md) for complete fine-tuning workflows, hyperparameter tuning, and multi-GPU training.
+**Quantization formats**: See [references/quantization-formats.md](references/quantization-formats.md) for INT8, NF4, FP4 comparison, double quantization, and custom quantization configs.
+**Memory optimization**: See [references/memory-optimization.md](references/memory-optimization.md) for CPU offloading strategies, gradient checkpointing, and memory profiling.
+## Hardware requirements
+- **GPU**: NVIDIA with compute capability 7.0+ (Turing, Ampere, Hopper)
+- **VRAM**: Depends on model and quantization
+  - 4-bit Llama 2 7B: 4GB
+  - 4-bit Llama 2 13B: 8GB
+  - 4-bit Llama 2 70B: 24GB
+- **CUDA**: 11.1+ (12.0+ recommended)
+- **PyTorch**: 2.0+
+**Supported platforms**: NVIDIA GPUs (primary), AMD ROCm, Intel GPUs (experimental)
+## Resources
+- GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
+- HuggingFace docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes
+- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
+- LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)