PyPI - EvoScientist - Versions diffs - 0.0.1.dev2__py3-none-any.whl - Mend

EvoScientist 0.0.1.dev2__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (107) hide show

EvoScientist/skills/bitsandbytes/references/qlora-training.md ADDED Viewed

@@ -0,0 +1,521 @@
+# QLoRA Training
+Complete guide to fine-tuning large language models using 4-bit quantization with QLoRA (Quantized Low-Rank Adaptation).
+## Overview
+QLoRA enables fine-tuning 70B+ parameter models on consumer GPUs by:
+- Loading base model in 4-bit (75% memory reduction)
+- Training only small LoRA adapters (~20MB)
+- Maintaining near-full-precision quality
+**Memory savings**:
+- Llama 2 70B: 140GB → 35GB (4-bit) + 20MB (LoRA) = **35GB total**
+- Fits on single A100 80GB!
+**Accuracy**: <1% degradation vs full fine-tuning
+## Quick Start
+### Basic QLoRA Fine-tuning
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+import torch
+# Step 1: Load model in 4-bit
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=bnb_config,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+# Step 2: Prepare for k-bit training
+model = prepare_model_for_kbit_training(model)
+# Step 3: Add LoRA adapters
+lora_config = LoraConfig(
+    r=64,
+    lora_alpha=16,
+    target_modules="all-linear",
+    lora_dropout=0.1,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+model = get_peft_model(model, lora_config)
+model.print_trainable_parameters()
+# trainable params: 335M || all params: 70B || trainable%: 0.48%
+# Step 4: Train
+from trl import SFTTrainer
+training_args = TrainingArguments(
+    output_dir="./qlora-70b",
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,
+    num_train_epochs=3,
+    learning_rate=2e-4,
+    bf16=True,
+    optim="paged_adamw_8bit",
+    logging_steps=10,
+    save_strategy="epoch"
+)
+trainer = SFTTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset,
+    tokenizer=tokenizer
+)
+trainer.train()
+```
+## Complete Training Workflows
+### Workflow 1: Single GPU Training (Consumer GPU)
+Train Llama 2 13B on RTX 4090 (24GB).
+**Step 1: Prepare dataset**
+```python
+from datasets import load_dataset
+# Load instruction dataset
+dataset = load_dataset("timdettmers/openassistant-guanaco")
+# Format for instruction tuning
+def format_instruction(example):
+    return {
+        "text": f"### Human: {example['text']}\n### Assistant: {example['output']}"
+    }
+dataset = dataset.map(format_instruction)
+```
+**Step 2: Configure quantization**
+```python
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,  # BF16 for stability
+    bnb_4bit_quant_type="nf4",  # NormalFloat4 (recommended)
+    bnb_4bit_use_double_quant=True  # Nested quantization
+)
+```
+**Step 3: Load and prepare model**
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-13b-hf",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")
+tokenizer.pad_token = tokenizer.eos_token
+# Enable gradient checkpointing (further memory savings)
+model.gradient_checkpointing_enable()
+model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
+```
+**Step 4: Configure LoRA**
+```python
+from peft import LoraConfig
+lora_config = LoraConfig(
+    r=16,  # LoRA rank (lower = less memory)
+    lora_alpha=32,  # Scaling factor
+    target_modules="all-linear",  # Apply to all linear layers
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+model = get_peft_model(model, lora_config)
+```
+**Step 5: Train**
+```python
+training_args = TrainingArguments(
+    output_dir="./qlora-13b-results",
+    per_device_train_batch_size=4,
+    gradient_accumulation_steps=4,  # Effective batch = 16
+    warmup_steps=100,
+    num_train_epochs=1,
+    learning_rate=2e-4,
+    bf16=True,
+    logging_steps=10,
+    save_strategy="steps",
+    save_steps=100,
+    eval_strategy="steps",
+    eval_steps=100,
+    optim="paged_adamw_8bit",  # 8-bit optimizer
+    max_grad_norm=0.3,
+    max_steps=1000
+)
+trainer = SFTTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=dataset["train"],
+    eval_dataset=dataset["test"],
+    tokenizer=tokenizer,
+    max_seq_length=512
+)
+trainer.train()
+```
+**Memory usage**: ~18GB on RTX 4090 (24GB)
+### Workflow 2: Multi-GPU Training (FSDP + QLoRA)
+Train Llama 2 70B on 8×A100 (80GB each).
+**Step 1: Configure FSDP-compatible quantization**
+```python
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_storage=torch.bfloat16  # CRITICAL for FSDP!
+)
+```
+**Important**: `bnb_4bit_quant_storage=torch.bfloat16` ensures 4-bit layers are wrapped identically to regular layers for FSDP sharding.
+**Step 2: Launch with accelerate**
+Create `fsdp_config.yaml`:
+```yaml
+compute_environment: LOCAL_MACHINE
+distributed_type: FSDP
+fsdp_config:
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch_policy: BACKWARD_PRE
+  fsdp_forward_prefetch: true
+  fsdp_sharding_strategy: 1  # FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
+mixed_precision: bf16
+num_processes: 8
+```
+**Launch training**:
+```bash
+accelerate launch --config_file fsdp_config.yaml train_qlora.py
+```
+**train_qlora.py**:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=bnb_config,
+    torch_dtype=torch.bfloat16
+)
+# Rest same as single-GPU workflow
+model = prepare_model_for_kbit_training(model)
+model = get_peft_model(model, lora_config)
+trainer = SFTTrainer(...)
+trainer.train()
+```
+**Memory per GPU**: ~40GB (70B model sharded across 8 GPUs)
+### Workflow 3: Extremely Large Models (405B)
+Train Llama 3.1 405B on 8×H100 (80GB each).
+**Requirements**:
+- 8×H100 80GB GPUs
+- 256GB+ system RAM
+- FSDP + QLoRA
+**Configuration**:
+```python
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_storage=torch.bfloat16
+)
+lora_config = LoraConfig(
+    r=32,  # Higher rank for 405B
+    lora_alpha=64,
+    target_modules="all-linear",
+    lora_dropout=0.1,
+    bias="none",
+    task_type="CAUSAL_LM"
+)
+training_args = TrainingArguments(
+    per_device_train_batch_size=1,  # Small batch
+    gradient_accumulation_steps=32,  # Effective batch = 256
+    learning_rate=1e-4,  # Lower LR for large model
+    bf16=True,
+    optim="paged_adamw_8bit",
+    gradient_checkpointing=True
+)
+```
+**Memory per GPU**: ~70GB (405B in 4-bit / 8 GPUs)
+## Hyperparameter Tuning
+### LoRA Rank (r)
+Controls adapter capacity:
+| Model Size | Recommended r | Trainable Params | Use Case |
+|------------|---------------|------------------|----------|
+| 7B | 8-16 | ~4M | Simple tasks |
+| 13B | 16-32 | ~8M | General fine-tuning |
+| 70B | 32-64 | ~80M | Complex tasks |
+| 405B | 64-128 | ~300M | Maximum capacity |
+**Trade-off**: Higher r = more capacity but more memory and slower training
+### LoRA Alpha
+Scaling factor for LoRA updates:
+```python
+effective_learning_rate = learning_rate * (lora_alpha / r)
+```
+**Recommended**: `lora_alpha = 2 × r`
+- r=16 → alpha=32
+- r=64 → alpha=128
+### Target Modules
+**Options**:
+- `"all-linear"`: All linear layers (recommended for QLoRA)
+- `["q_proj", "v_proj"]`: Only attention (minimal)
+- `["q_proj", "k_proj", "v_proj", "o_proj"]`: All attention
+- `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`: Attention + FFN
+**Trade-off**: More modules = better performance but more memory
+### Learning Rate
+| Model Size | Recommended LR |
+|------------|----------------|
+| 7-13B | 2e-4 to 3e-4 |
+| 70B | 1e-4 to 2e-4 |
+| 405B | 5e-5 to 1e-4 |
+**Rule**: Larger models need lower learning rates
+### Batch Size
+```python
+effective_batch_size = per_device_batch_size × gradient_accumulation_steps × num_gpus
+```
+**Recommended effective batch sizes**:
+- Instruction tuning: 64-128
+- Continued pretraining: 256-512
+### Quantization Dtype
+| Dtype | Speed | Accuracy | Use Case |
+|-------|-------|----------|----------|
+| `torch.float32` | Slow | Best | Debugging |
+| `torch.bfloat16` | Fast | Good | **Recommended** |
+| `torch.float16` | Fastest | Risky | May have precision issues |
+## Advanced Techniques
+### Gradient Checkpointing
+Save memory by recomputing activations:
+```python
+model.gradient_checkpointing_enable()
+model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
+```
+**Memory savings**: ~30-40% activation memory
+**Cost**: ~20% slower training
+### Nested Quantization
+Quantize the quantization constants:
+```python
+bnb_config = BitsAndBytesConfig(
+    bnb_4bit_use_double_quant=True  # Enable nested quantization
+)
+```
+**Memory savings**: Additional ~2-3% reduction
+**Accuracy**: Minimal impact
+### CPU Offloading
+For models that still don't fit:
+```python
+model = AutoModelForCausalLM.from_pretrained(
+    "model-name",
+    quantization_config=bnb_config,
+    device_map="auto",
+    max_memory={0: "40GB", "cpu": "100GB"}
+)
+```
+**Trade-off**: Much slower but enables larger models
+### Paged Optimizers
+Use paged memory for optimizer states:
+```python
+training_args = TrainingArguments(
+    optim="paged_adamw_8bit"  # Or paged_adamw_32bit
+)
+```
+**Benefit**: Prevents OOM from optimizer states
+## Deployment
+### Save LoRA Adapters
+```python
+# Save only adapters (~20MB)
+model.save_pretrained("./qlora-adapters")
+tokenizer.save_pretrained("./qlora-adapters")
+```
+### Load for Inference
+```python
+from peft import PeftModel
+# Load base model in 4-bit
+base_model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=bnb_config,
+    device_map="auto"
+)
+# Load adapters
+model = PeftModel.from_pretrained(base_model, "./qlora-adapters")
+# Inference
+inputs = tokenizer("Question here", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_length=200)
+```
+### Merge Adapters (Optional)
+```python
+# Merge LoRA into base weights
+model = model.merge_and_unload()
+# Save merged model
+model.save_pretrained("./merged-model")
+```
+**Note**: Merged model loses 4-bit quantization (back to FP16/BF16)
+## Troubleshooting
+### OOM During Training
+1. Reduce batch size:
+   ```python
+   per_device_train_batch_size=1
+   ```
+2. Increase gradient accumulation:
+   ```python
+   gradient_accumulation_steps=16
+   ```
+3. Lower LoRA rank:
+   ```python
+   r=8  # Instead of 16
+   ```
+4. Enable gradient checkpointing
+5. Use CPU offloading
+### Low Quality Results
+1. Increase LoRA rank:
+   ```python
+   r=64  # Instead of 16
+   ```
+2. Train longer:
+   ```python
+   num_train_epochs=3  # Instead of 1
+   ```
+3. Use more target modules:
+   ```python
+   target_modules="all-linear"
+   ```
+4. Check learning rate (try 1e-4 to 3e-4)
+### Slow Training
+1. Disable gradient checkpointing (if memory allows)
+2. Increase batch size
+3. Use BF16:
+   ```python
+   bf16=True
+   ```
+4. Use paged optimizer
+## Best Practices
+1. **Start small**: Test on 7B before 70B
+2. **Monitor loss**: Should decrease steadily
+3. **Use validation**: Track eval loss to detect overfitting
+4. **Save checkpoints**: Every 100-500 steps
+5. **Log hyperparameters**: For reproducibility
+6. **Test inference**: Verify quality before full training
+## Example: Complete Training Script
+See full working example at `examples/qlora_training.py` in the repository.
+## References
+- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
+- bitsandbytes GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
+- PEFT documentation: https://huggingface.co/docs/peft
+- FSDP+QLoRA guide: https://huggingface.co/blog/fsdp-qlora