PyPI - EvoScientist - Versions diffs - 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl - Mend

EvoScientist 0.1.0rc1py3-none-any.whl → 0.1.0rc2py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (108) hide show

EvoScientist/skills/llama-cpp/references/quantization.md ADDED Viewed

@@ -0,0 +1,213 @@
+# GGUF Quantization Guide
+Complete guide to GGUF quantization formats and model conversion.
+## Quantization Overview
+**GGUF** (GPT-Generated Unified Format) - Standard format for llama.cpp models.
+### Format Comparison
+| Format | Perplexity | Size (7B) | Tokens/sec | Notes |
+|--------|------------|-----------|------------|-------|
+| FP16 | 5.9565 (baseline) | 13.0 GB | 15 tok/s | Original quality |
+| Q8_0 | 5.9584 (+0.03%) | 7.0 GB | 25 tok/s | Nearly lossless |
+| **Q6_K** | 5.9642 (+0.13%) | 5.5 GB | 30 tok/s | Best quality/size |
+| **Q5_K_M** | 5.9796 (+0.39%) | 4.8 GB | 35 tok/s | Balanced |
+| **Q4_K_M** | 6.0565 (+1.68%) | 4.1 GB | 40 tok/s | **Recommended** |
+| Q4_K_S | 6.1125 (+2.62%) | 3.9 GB | 42 tok/s | Faster, lower quality |
+| Q3_K_M | 6.3184 (+6.07%) | 3.3 GB | 45 tok/s | Small models only |
+| Q2_K | 6.8673 (+15.3%) | 2.7 GB | 50 tok/s | Not recommended |
+**Recommendation**: Use **Q4_K_M** for best balance of quality and speed.
+## Converting Models
+### HuggingFace to GGUF
+```bash
+# 1. Download HuggingFace model
+huggingface-cli download meta-llama/Llama-2-7b-chat-hf \
+    --local-dir models/llama-2-7b-chat/
+# 2. Convert to FP16 GGUF
+python convert_hf_to_gguf.py \
+    models/llama-2-7b-chat/ \
+    --outtype f16 \
+    --outfile models/llama-2-7b-chat-f16.gguf
+# 3. Quantize to Q4_K_M
+./llama-quantize \
+    models/llama-2-7b-chat-f16.gguf \
+    models/llama-2-7b-chat-Q4_K_M.gguf \
+    Q4_K_M
+```
+### Batch quantization
+```bash
+# Quantize to multiple formats
+for quant in Q4_K_M Q5_K_M Q6_K Q8_0; do
+    ./llama-quantize \
+        model-f16.gguf \
+        model-${quant}.gguf \
+        $quant
+done
+```
+## K-Quantization Methods
+**K-quants** use mixed precision for better quality:
+- Attention weights: Higher precision
+- Feed-forward weights: Lower precision
+**Variants**:
+- `_S` (Small): Faster, lower quality
+- `_M` (Medium): Balanced (recommended)
+- `_L` (Large): Better quality, larger size
+**Example**: `Q4_K_M`
+- `Q4`: 4-bit quantization
+- `K`: Mixed precision method
+- `M`: Medium quality
+## Quality Testing
+```bash
+# Calculate perplexity (quality metric)
+./llama-perplexity \
+    -m model.gguf \
+    -f wikitext-2-raw/wiki.test.raw \
+    -c 512
+# Lower perplexity = better quality
+# Baseline (FP16): ~5.96
+# Q4_K_M: ~6.06 (+1.7%)
+# Q2_K: ~6.87 (+15.3% - too much degradation)
+```
+## Use Case Guide
+### General purpose (chatbots, assistants)
+```
+Q4_K_M - Best balance
+Q5_K_M - If you have extra RAM
+```
+### Code generation
+```
+Q5_K_M or Q6_K - Higher precision helps with code
+```
+### Creative writing
+```
+Q4_K_M - Sufficient quality
+Q3_K_M - Acceptable for draft generation
+```
+### Technical/medical
+```
+Q6_K or Q8_0 - Maximum accuracy
+```
+### Edge devices (Raspberry Pi)
+```
+Q2_K or Q3_K_S - Fit in limited RAM
+```
+## Model Size Scaling
+### 7B parameter models
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 2.7 GB | 5 GB |
+| Q3_K_M | 3.3 GB | 6 GB |
+| Q4_K_M | 4.1 GB | 7 GB |
+| Q5_K_M | 4.8 GB | 8 GB |
+| Q6_K | 5.5 GB | 9 GB |
+| Q8_0 | 7.0 GB | 11 GB |
+### 13B parameter models
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 5.1 GB | 8 GB |
+| Q3_K_M | 6.2 GB | 10 GB |
+| Q4_K_M | 7.9 GB | 12 GB |
+| Q5_K_M | 9.2 GB | 14 GB |
+| Q6_K | 10.7 GB | 16 GB |
+### 70B parameter models
+| Format | Size | RAM needed |
+|--------|------|------------|
+| Q2_K | 26 GB | 32 GB |
+| Q3_K_M | 32 GB | 40 GB |
+| Q4_K_M | 41 GB | 48 GB |
+| Q4_K_S | 39 GB | 46 GB |
+| Q5_K_M | 48 GB | 56 GB |
+**Recommendation for 70B**: Use Q3_K_M or Q4_K_S to fit in consumer hardware.
+## Finding Pre-Quantized Models
+**TheBloke** on HuggingFace:
+- https://huggingface.co/TheBloke
+- Most models available in all GGUF formats
+- No conversion needed
+**Example**:
+```bash
+# Download pre-quantized Llama 2-7B
+huggingface-cli download \
+    TheBloke/Llama-2-7B-Chat-GGUF \
+    llama-2-7b-chat.Q4_K_M.gguf \
+    --local-dir models/
+```
+## Importance Matrices (imatrix)
+**What**: Calibration data to improve quantization quality.
+**Benefits**:
+- 10-20% perplexity improvement with Q4
+- Essential for Q3 and below
+**Usage**:
+```bash
+# 1. Generate importance matrix
+./llama-imatrix \
+    -m model-f16.gguf \
+    -f calibration-data.txt \
+    -o model.imatrix
+# 2. Quantize with imatrix
+./llama-quantize \
+    --imatrix model.imatrix \
+    model-f16.gguf \
+    model-Q4_K_M.gguf \
+    Q4_K_M
+```
+**Calibration data**:
+- Use domain-specific text (e.g., code for code models)
+- ~100MB of representative text
+- Higher quality data = better quantization
+## Troubleshooting
+**Model outputs gibberish**:
+- Quantization too aggressive (Q2_K)
+- Try Q4_K_M or Q5_K_M
+- Verify model converted correctly
+**Out of memory**:
+- Use lower quantization (Q4_K_S instead of Q5_K_M)
+- Offload fewer layers to GPU (`-ngl`)
+- Use smaller context (`-c 2048`)
+**Slow inference**:
+- Higher quantization uses more compute
+- Q8_0 much slower than Q4_K_M
+- Consider speed vs quality trade-off

EvoScientist/skills/llama-cpp/references/server.md ADDED Viewed

@@ -0,0 +1,125 @@
+# Server Deployment Guide
+Production deployment of llama.cpp server with OpenAI-compatible API.
+## Server Modes
+### llama-server
+```bash
+# Basic server
+./llama-server \
+    -m models/llama-2-7b-chat.Q4_K_M.gguf \
+    --host 0.0.0.0 \
+    --port 8080 \
+    -c 4096  # Context size
+# With GPU acceleration
+./llama-server \
+    -m models/llama-2-70b.Q4_K_M.gguf \
+    -ngl 40  # Offload 40 layers to GPU
+```
+## OpenAI-Compatible API
+### Chat completions
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [
+      {"role": "system", "content": "You are helpful"},
+      {"role": "user", "content": "Hello"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 100
+  }'
+```
+### Streaming
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama-2",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "stream": true
+  }'
+```
+## Docker Deployment
+**Dockerfile**:
+```dockerfile
+FROM ubuntu:22.04
+RUN apt-get update && apt-get install -y git build-essential
+RUN git clone https://github.com/ggerganov/llama.cpp
+WORKDIR /llama.cpp
+RUN make LLAMA_CUDA=1
+COPY models/ /models/
+EXPOSE 8080
+CMD ["./llama-server", "-m", "/models/model.gguf", "--host", "0.0.0.0", "--port", "8080"]
+```
+**Run**:
+```bash
+docker run --gpus all -p 8080:8080 llama-cpp:latest
+```
+## Monitoring
+```bash
+# Server metrics endpoint
+curl http://localhost:8080/metrics
+# Health check
+curl http://localhost:8080/health
+```
+**Metrics**:
+- requests_total
+- tokens_generated
+- prompt_tokens
+- completion_tokens
+- kv_cache_tokens
+## Load Balancing
+**NGINX**:
+```nginx
+upstream llama_cpp {
+    server llama1:8080;
+    server llama2:8080;
+}
+server {
+    location / {
+        proxy_pass http://llama_cpp;
+        proxy_read_timeout 300s;
+    }
+}
+```
+## Performance Tuning
+**Parallel requests**:
+```bash
+./llama-server \
+    -m model.gguf \
+    -np 4  # 4 parallel slots
+```
+**Continuous batching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cont-batching  # Enable continuous batching
+```
+**Context caching**:
+```bash
+./llama-server \
+    -m model.gguf \
+    --cache-prompt  # Cache processed prompts
+```

EvoScientist 0.1.0rc1__py3-none-any.whl → 0.1.0rc2__py3-none-any.whl

EvoScientist 0.1.0rc1py3-none-any.whl → 0.1.0rc2py3-none-any.whl