PyPI - EvoScientist - Versions diffs - 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl - Mend

EvoScientist 0.1.0rc1py3-none-any.whl → 0.1.0rc2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (108) hide show

EvoScientist/skills/bitsandbytes/references/quantization-formats.md ADDED Viewed

@@ -0,0 +1,447 @@
+# Quantization Formats
+Complete guide to INT8, NF4, FP4 quantization formats, double quantization, and custom configurations in bitsandbytes.
+## Overview
+bitsandbytes supports multiple quantization formats:
+- **INT8**: 8-bit integer quantization (LLM.int8())
+- **NF4**: 4-bit NormalFloat (for normally distributed weights)
+- **FP4**: 4-bit FloatPoint (for uniformly distributed weights)
+- **Double Quantization**: Quantize the quantization constants
+## INT8 Quantization
+### LLM.int8() Algorithm
+LLM.int8() uses mixed 8-bit/16-bit matrix multiplication:
+- Most features (>99.9%) computed in INT8
+- Outlier features (>threshold) computed in FP16
+- Results combined for final output
+**Memory**: 50% reduction (2 bytes → 1 byte per parameter)
+**Accuracy**: <0.5% degradation
+### Configuration
+```python
+from transformers import BitsAndBytesConfig
+config = BitsAndBytesConfig(
+    load_in_8bit=True,
+    llm_int8_threshold=6.0,  # Outlier threshold
+    llm_int8_has_fp16_weight=False,  # Use INT8 storage
+    llm_int8_skip_modules=["lm_head"]  # Skip certain layers
+)
+```
+### Parameters Explained
+**`llm_int8_threshold`** (default: 6.0):
+- Activations with magnitude > threshold are kept in FP16
+- Lower = more FP16 (slower but more accurate)
+- Higher = more INT8 (faster but less accurate)
+```python
+# Conservative (more accurate)
+llm_int8_threshold=5.0
+# Aggressive (faster)
+llm_int8_threshold=8.0
+```
+**`llm_int8_has_fp16_weight`** (default: False):
+- `False`: Store weights in INT8 (50% memory savings)
+- `True`: Store in FP16, quantize only during computation (no memory savings)
+**`llm_int8_skip_modules`**:
+```python
+# Skip specific layers (keep in FP16)
+llm_int8_skip_modules=["lm_head", "embed_tokens"]
+```
+### Example
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-13b-hf",
+    quantization_config=config,
+    device_map="auto"
+)
+# Memory: 26GB (FP16) → 13GB (INT8)
+```
+### When to Use INT8
+✅ **Use INT8 when**:
+- Need high accuracy (<0.5% loss)
+- Model fits with 50% reduction
+- Have Turing+ GPU (tensor cores)
+❌ **Don't use when**:
+- Need maximum memory savings (use 4-bit)
+- Inference speed critical (use GPTQ/AWQ)
+## 4-Bit Quantization
+### NormalFloat4 (NF4)
+Optimized for normally distributed weights (most neural networks).
+**How it works**:
+- Bins chosen to minimize quantization error for normal distribution
+- Asymmetric quantization bins
+- Better for transformer weights
+**Configuration**:
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4"  # NormalFloat4
+)
+```
+**Memory**: 75% reduction (2 bytes → 0.5 bytes per parameter)
+### FloatPoint4 (FP4)
+Standard 4-bit floating point for uniform distributions.
+**How it works**:
+- Symmetric quantization bins
+- Better for weights with broader dynamic range
+- Less common for transformers
+**Configuration**:
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="fp4"  # FloatPoint4
+)
+```
+### NF4 vs FP4 Comparison
+| Aspect | NF4 | FP4 |
+|--------|-----|-----|
+| Distribution | Normal | Uniform |
+| Typical use | **Transformers** | CNNs, unusual architectures |
+| Accuracy | **Better for LLMs** | Worse for LLMs |
+| Speed | Same | Same |
+| Recommendation | ✅ Default | Use only if NF4 fails |
+**Rule of thumb**: Always use NF4 for transformers.
+### Example Comparison
+```python
+# NF4 (recommended)
+nf4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4"
+)
+# FP4 (alternative)
+fp4_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="fp4"
+)
+# Load and compare
+model_nf4 = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=nf4_config
+)
+model_fp4 = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-7b-hf",
+    quantization_config=fp4_config
+)
+# Typical results on MMLU:
+# NF4: 45.2%
+# FP4: 43.8%
+# FP16: 45.9%
+```
+## Compute Dtype
+The `bnb_4bit_compute_dtype` controls the precision used for actual computation.
+### Options
+**torch.bfloat16** (recommended):
+```python
+bnb_4bit_compute_dtype=torch.bfloat16
+```
+- Good balance of speed and accuracy
+- Recommended for A100/H100
+- Prevents numerical instability
+**torch.float16**:
+```python
+bnb_4bit_compute_dtype=torch.float16
+```
+- Slightly faster than BF16
+- Risk of overflow/underflow
+- Use only if BF16 unavailable
+**torch.float32**:
+```python
+bnb_4bit_compute_dtype=torch.float32
+```
+- Most accurate
+- Slowest (no tensor core acceleration)
+- Debugging only
+### Performance Comparison
+| Dtype | Speed | Accuracy | Memory |
+|-------|-------|----------|--------|
+| FP32 | 1× (baseline) | 100% | 4 bytes |
+| FP16 | 3-4× | 99.5% | 2 bytes |
+| BF16 | 3-4× | **99.8%** | 2 bytes |
+**Recommendation**: Always use `torch.bfloat16` if supported.
+## Double Quantization
+Quantize the quantization constants for additional memory savings.
+### How It Works
+Standard 4-bit quantization stores:
+- 4-bit quantized weights
+- FP32 scaling factors (4 bytes per block)
+Double quantization:
+- 4-bit quantized weights
+- **INT8 quantized scaling factors** (1 byte per block)
+**Additional savings**: ~2-3% memory reduction
+### Configuration
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True  # Enable double quantization
+)
+```
+### Example
+```python
+# Without double quant
+model_single = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_use_double_quant=False
+    )
+)
+# Memory: ~36GB
+# With double quant
+model_double = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=BitsAndBytesConfig(
+        load_in_4bit=True,
+        bnb_4bit_use_double_quant=True
+    )
+)
+# Memory: ~35GB (saves ~1GB)
+```
+**Accuracy impact**: Negligible (<0.1%)
+**Recommendation**: Always enable for maximum memory savings.
+## Quantization Storage
+Controls storage dtype for quantized weights (important for FSDP).
+### Configuration
+```python
+config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_storage=torch.bfloat16  # Storage dtype
+)
+```
+### When to Use
+**Default (uint8)**:
+- Single GPU training/inference
+- No special requirements
+**torch.bfloat16** (for FSDP):
+```python
+bnb_4bit_quant_storage=torch.bfloat16
+```
+- **Required for FSDP+QLoRA**
+- Ensures 4-bit layers wrapped like regular layers
+- Enables proper model sharding
+### Example: FSDP Configuration
+```python
+# CRITICAL: Set quant_storage for FSDP
+fsdp_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_storage=torch.bfloat16  # Must match torch_dtype!
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-2-70b-hf",
+    quantization_config=fsdp_config,
+    torch_dtype=torch.bfloat16  # Must match quant_storage!
+)
+```
+## Recommended Configurations
+### Production Inference (Best Accuracy)
+```python
+BitsAndBytesConfig(
+    load_in_8bit=True,
+    llm_int8_threshold=6.0
+)
+```
+**Use case**: Maximum accuracy with 50% memory savings
+### Production Inference (Maximum Memory Savings)
+```python
+BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True
+)
+```
+**Use case**: 75% memory reduction with <1% accuracy loss
+### QLoRA Training (Single GPU)
+```python
+BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True
+)
+```
+**Use case**: Fine-tune 70B on RTX 3090
+### FSDP + QLoRA (Multi-GPU)
+```python
+BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_storage=torch.bfloat16  # CRITICAL!
+)
+```
+**Use case**: Fine-tune 405B on 8×H100
+## Advanced: Block-wise Quantization
+bitsandbytes uses block-wise quantization:
+- Weights divided into blocks (typically 64 or 128 elements)
+- Each block has own scaling factor
+- Better accuracy than tensor-wise quantization
+**Block size** (automatically determined):
+```python
+# Typical block sizes
+# 4-bit: 64 elements per block
+# 8-bit: 64 elements per block
+```
+**Cannot be configured** (internal implementation detail).
+## Quantization Quality Metrics
+### Perplexity (Lower is Better)
+| Model | FP16 | INT8 | NF4 | NF4+DQ |
+|-------|------|------|-----|--------|
+| Llama 2 7B | 5.12 | 5.14 | 5.18 | 5.19 |
+| Llama 2 13B | 4.88 | 4.90 | 4.93 | 4.94 |
+| Llama 2 70B | 3.32 | 3.33 | 3.35 | 3.36 |
+**Conclusion**: <1% degradation for all quantization methods
+### MMLU Accuracy (Higher is Better)
+| Model | FP16 | INT8 | NF4 | FP4 |
+|-------|------|------|-----|-----|
+| Llama 2 7B | 45.9% | 45.7% | 45.2% | 43.8% |
+| Llama 2 13B | 54.8% | 54.6% | 54.1% | 52.9% |
+| Llama 2 70B | 68.9% | 68.7% | 68.4% | 67.2% |
+**Conclusion**: NF4 is significantly better than FP4 for transformers
+## Troubleshooting
+### "Quantization failed" Error
+Try different quant type:
+```python
+# If NF4 fails
+bnb_4bit_quant_type="fp4"
+```
+### Numerical Instability
+Use BF16 compute:
+```python
+bnb_4bit_compute_dtype=torch.bfloat16
+```
+### Poor Quality with 4-bit
+1. Try 8-bit instead:
+   ```python
+   load_in_8bit=True
+   ```
+2. Enable double quantization:
+   ```python
+   bnb_4bit_use_double_quant=True
+   ```
+3. Use BF16 compute dtype
+### FSDP Errors
+Ensure quant_storage matches torch_dtype:
+```python
+bnb_4bit_quant_storage=torch.bfloat16
+torch_dtype=torch.bfloat16  # Must match!
+```
+## References
+- LLM.int8() paper: "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
+- QLoRA paper: "QLoRA: Efficient Finetuning of Quantized LLMs" (2023)
+- bitsandbytes GitHub: https://github.com/bitsandbytes-foundation/bitsandbytes
+- HuggingFace quantization docs: https://huggingface.co/docs/transformers/quantization/bitsandbytes

EvoScientist/skills/clip/SKILL.md ADDED Viewed

@@ -0,0 +1,253 @@
+---
+name: clip
+description: OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
+version: 1.0.0
+author: Orchestra Research
+license: MIT
+tags: [Multimodal, CLIP, Vision-Language, Zero-Shot, Image Classification, OpenAI, Image Search, Cross-Modal Retrieval, Content Moderation]
+dependencies: [transformers, torch, pillow]
+---
+# CLIP - Contrastive Language-Image Pre-Training
+OpenAI's model that understands images from natural language.
+## When to use CLIP
+**Use when:**
+- Zero-shot image classification (no training data needed)
+- Image-text similarity/matching
+- Semantic image search
+- Content moderation (detect NSFW, violence)
+- Visual question answering
+- Cross-modal retrieval (image→text, text→image)
+**Metrics**:
+- **25,300+ GitHub stars**
+- Trained on 400M image-text pairs
+- Matches ResNet-50 on ImageNet (zero-shot)
+- MIT License
+**Use alternatives instead**:
+- **BLIP-2**: Better captioning
+- **LLaVA**: Vision-language chat
+- **Segment Anything**: Image segmentation
+## Quick start
+### Installation
+```bash
+pip install git+https://github.com/openai/CLIP.git
+pip install torch torchvision ftfy regex tqdm
+```
+### Zero-shot classification
+```python
+import torch
+import clip
+from PIL import Image
+# Load model
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model, preprocess = clip.load("ViT-B/32", device=device)
+# Load image
+image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
+# Define possible labels
+text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
+# Compute similarity
+with torch.no_grad():
+    image_features = model.encode_image(image)
+    text_features = model.encode_text(text)
+    # Cosine similarity
+    logits_per_image, logits_per_text = model(image, text)
+    probs = logits_per_image.softmax(dim=-1).cpu().numpy()
+# Print results
+labels = ["a dog", "a cat", "a bird", "a car"]
+for label, prob in zip(labels, probs[0]):
+    print(f"{label}: {prob:.2%}")
+```
+## Available models
+```python
+# Models (sorted by size)
+models = [
+    "RN50",           # ResNet-50
+    "RN101",          # ResNet-101
+    "ViT-B/32",       # Vision Transformer (recommended)
+    "ViT-B/16",       # Better quality, slower
+    "ViT-L/14",       # Best quality, slowest
+]
+model, preprocess = clip.load("ViT-B/32")
+```
+| Model | Parameters | Speed | Quality |
+|-------|------------|-------|---------|
+| RN50 | 102M | Fast | Good |
+| ViT-B/32 | 151M | Medium | Better |
+| ViT-L/14 | 428M | Slow | Best |
+## Image-text similarity
+```python
+# Compute embeddings
+image_features = model.encode_image(image)
+text_features = model.encode_text(text)
+# Normalize
+image_features /= image_features.norm(dim=-1, keepdim=True)
+text_features /= text_features.norm(dim=-1, keepdim=True)
+# Cosine similarity
+similarity = (image_features @ text_features.T).item()
+print(f"Similarity: {similarity:.4f}")
+```
+## Semantic image search
+```python
+# Index images
+image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
+image_embeddings = []
+for img_path in image_paths:
+    image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
+    with torch.no_grad():
+        embedding = model.encode_image(image)
+        embedding /= embedding.norm(dim=-1, keepdim=True)
+    image_embeddings.append(embedding)
+image_embeddings = torch.cat(image_embeddings)
+# Search with text query
+query = "a sunset over the ocean"
+text_input = clip.tokenize([query]).to(device)
+with torch.no_grad():
+    text_embedding = model.encode_text(text_input)
+    text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
+# Find most similar images
+similarities = (text_embedding @ image_embeddings.T).squeeze(0)
+top_k = similarities.topk(3)
+for idx, score in zip(top_k.indices, top_k.values):
+    print(f"{image_paths[idx]}: {score:.3f}")
+```
+## Content moderation
+```python
+# Define categories
+categories = [
+    "safe for work",
+    "not safe for work",
+    "violent content",
+    "graphic content"
+]
+text = clip.tokenize(categories).to(device)
+# Check image
+with torch.no_grad():
+    logits_per_image, _ = model(image, text)
+    probs = logits_per_image.softmax(dim=-1)
+# Get classification
+max_idx = probs.argmax().item()
+max_prob = probs[0, max_idx].item()
+print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
+```
+## Batch processing
+```python
+# Process multiple images
+images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
+images = torch.stack(images).to(device)
+with torch.no_grad():
+    image_features = model.encode_image(images)
+    image_features /= image_features.norm(dim=-1, keepdim=True)
+# Batch text
+texts = ["a dog", "a cat", "a bird"]
+text_tokens = clip.tokenize(texts).to(device)
+with torch.no_grad():
+    text_features = model.encode_text(text_tokens)
+    text_features /= text_features.norm(dim=-1, keepdim=True)
+# Similarity matrix (10 images × 3 texts)
+similarities = image_features @ text_features.T
+print(similarities.shape)  # (10, 3)
+```
+## Integration with vector databases
+```python
+# Store CLIP embeddings in Chroma/FAISS
+import chromadb
+client = chromadb.Client()
+collection = client.create_collection("image_embeddings")
+# Add image embeddings
+for img_path, embedding in zip(image_paths, image_embeddings):
+    collection.add(
+        embeddings=[embedding.cpu().numpy().tolist()],
+        metadatas=[{"path": img_path}],
+        ids=[img_path]
+    )
+# Query with text
+query = "a sunset"
+text_embedding = model.encode_text(clip.tokenize([query]))
+results = collection.query(
+    query_embeddings=[text_embedding.cpu().numpy().tolist()],
+    n_results=5
+)
+```
+## Best practices
+1. **Use ViT-B/32 for most cases** - Good balance
+2. **Normalize embeddings** - Required for cosine similarity
+3. **Batch processing** - More efficient
+4. **Cache embeddings** - Expensive to recompute
+5. **Use descriptive labels** - Better zero-shot performance
+6. **GPU recommended** - 10-50× faster
+7. **Preprocess images** - Use provided preprocess function
+## Performance
+| Operation | CPU | GPU (V100) |
+|-----------|-----|------------|
+| Image encoding | ~200ms | ~20ms |
+| Text encoding | ~50ms | ~5ms |
+| Similarity compute | <1ms | <1ms |
+## Limitations
+1. **Not for fine-grained tasks** - Best for broad categories
+2. **Requires descriptive text** - Vague labels perform poorly
+3. **Biased on web data** - May have dataset biases
+4. **No bounding boxes** - Whole image only
+5. **Limited spatial understanding** - Position/counting weak
+## Resources
+- **GitHub**: https://github.com/openai/CLIP ⭐ 25,300+
+- **Paper**: https://arxiv.org/abs/2103.00020
+- **Colab**: https://colab.research.google.com/github/openai/clip/
+- **License**: MIT

EvoScientist 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl

EvoScientist 0.1.0rc1py3-none-any.whl → 0.1.0rc2py3-none-any.whl