RubyGems - red-candle - Versions diffs - 1.8.0-aarch64-linux - Mend

red-candle 1.8.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

checksums.yaml +7 -0
data/Cargo.lock +5021 -0
data/Cargo.toml +6 -0
data/Gemfile +3 -0
data/LICENSE +22 -0
data/README.md +1171 -0
data/Rakefile +167 -0
data/bin/console +11 -0
data/bin/setup +17 -0
data/ext/candle/Cargo.toml +38 -0
data/ext/candle/build.rs +117 -0
data/ext/candle/extconf.rb +79 -0
data/ext/candle/rustfmt.toml +63 -0
data/ext/candle/src/gvl.rs +58 -0
data/ext/candle/src/lib.rs +59 -0
data/ext/candle/src/llm/constrained_generation_test.rs +395 -0
data/ext/candle/src/llm/gemma.rs +313 -0
data/ext/candle/src/llm/generation_config.rs +63 -0
data/ext/candle/src/llm/glm4.rs +236 -0
data/ext/candle/src/llm/granite.rs +308 -0
data/ext/candle/src/llm/granitemoehybrid.rs +315 -0
data/ext/candle/src/llm/llama.rs +396 -0
data/ext/candle/src/llm/mistral.rs +309 -0
data/ext/candle/src/llm/mod.rs +49 -0
data/ext/candle/src/llm/phi.rs +369 -0
data/ext/candle/src/llm/quantized_gguf.rs +734 -0
data/ext/candle/src/llm/qwen.rs +261 -0
data/ext/candle/src/llm/qwen3.rs +257 -0
data/ext/candle/src/llm/text_generation.rs +284 -0
data/ext/candle/src/ruby/device.rs +234 -0
data/ext/candle/src/ruby/dtype.rs +39 -0
data/ext/candle/src/ruby/embedding_model.rs +477 -0
data/ext/candle/src/ruby/errors.rs +16 -0
data/ext/candle/src/ruby/llm.rs +730 -0
data/ext/candle/src/ruby/mod.rs +24 -0
data/ext/candle/src/ruby/ner.rs +444 -0
data/ext/candle/src/ruby/reranker.rs +488 -0
data/ext/candle/src/ruby/result.rs +3 -0
data/ext/candle/src/ruby/structured.rs +92 -0
data/ext/candle/src/ruby/tensor.rs +731 -0
data/ext/candle/src/ruby/tokenizer.rs +343 -0
data/ext/candle/src/ruby/utils.rs +96 -0
data/ext/candle/src/ruby/vlm.rs +330 -0
data/ext/candle/src/structured/integration_test.rs +130 -0
data/ext/candle/src/structured/mod.rs +31 -0
data/ext/candle/src/structured/schema_processor.rs +215 -0
data/ext/candle/src/structured/vocabulary_adapter.rs +152 -0
data/ext/candle/src/structured/vocabulary_adapter_real_test.rs +66 -0
data/ext/candle/src/structured/vocabulary_adapter_simple_test.rs +70 -0
data/ext/candle/src/tokenizer/loader.rs +108 -0
data/ext/candle/src/tokenizer/mod.rs +104 -0
data/ext/candle/tests/device_tests.rs +43 -0
data/ext/candle/tests/tensor_tests.rs +162 -0
data/lib/candle/3.1/candle.so +0 -0
data/lib/candle/3.2/candle.so +0 -0
data/lib/candle/3.3/candle.so +0 -0
data/lib/candle/3.4/candle.so +0 -0
data/lib/candle/4.0/candle.so +0 -0
data/lib/candle/agent.rb +68 -0
data/lib/candle/build_info.rb +67 -0
data/lib/candle/device_utils.rb +10 -0
data/lib/candle/embedding_model.rb +75 -0
data/lib/candle/embedding_model_type.rb +31 -0
data/lib/candle/llm.rb +595 -0
data/lib/candle/logger.rb +149 -0
data/lib/candle/ner.rb +368 -0
data/lib/candle/reranker.rb +45 -0
data/lib/candle/tensor.rb +99 -0
data/lib/candle/tokenizer.rb +139 -0
data/lib/candle/tool.rb +47 -0
data/lib/candle/tool_call_parser.rb +57 -0
data/lib/candle/version.rb +5 -0
data/lib/candle/vlm.rb +31 -0
data/lib/candle.rb +29 -0
data/lib/red-candle.rb +1 -0
metadata +309 -0

data/README.md ADDED Viewed

@@ -0,0 +1,1171 @@
+<img src="/docs/assets/logo-title.png" alt="red-candle" height="160px">
+[![build](https://github.com/scientist-labs/red-candle/actions/workflows/build.yml/badge.svg)](https://github.com/scientist-labs/red-candle/actions/workflows/build.yml)
+[![Gem Version](https://badge.fury.io/rb/red-candle.svg)](https://badge.fury.io/rb/red-candle)
+Run state-of-the-art **language models directly from Ruby**. No Python, no APIs, no external services - just Ruby with blazing-fast Rust under the hood. Hardware accelerated with **Metal (Mac)** and **CUDA (NVIDIA).** Red candle leverages the Rust ecosystem, notably [Candle](https://github.com/huggingface/candle) and [Magnus](https://github.com/matsadler/magnus), to provide a fast and efficient way to run LLMs in Ruby. See [Dependencies](#dependencies) for more.
+## Install & Chat in 30 Seconds
+[![red-candle quickstart](https://img.youtube.com/vi/hbyFCyh8esk/0.jpg)](https://www.youtube.com/watch?v=hbyFCyh8esk)
+```bash
+# Install the gem
+gem install red-candle
+```
+```ruby
+require 'candle'
+# Download a model (one-time, ~650MB) - Mistral, Llama3, Gemma all work!
+llm = Candle::LLM.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
+                                  gguf_file: "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
+# Chat with it - no API calls, running locally in your Ruby process!
+messages = [
+  { role: "user", content: "Explain Ruby in one sentence" }
+]
+puts llm.chat(messages)
+# => "Ruby is a dynamic, object-oriented programming language known for its
+#     simplicity, elegance, and productivity, often used for web development
+#     with frameworks like Rails."
+```
+## What Just Happened?
+You just ran a 1.1-billion parameter AI model inside Ruby. The model lives in your process memory, runs on your hardware (CPU/GPU), and responds instantly without network latency.
+## Stream Responses Like a Pro
+```ruby
+# Watch the AI think in real-time
+llm.chat_stream(messages) do |token|
+  print token
+end
+```
+## Why This Matters
+- **Privacy**: Your data never leaves your machine
+- **Speed**: No network overhead, direct memory access
+- **Control**: Fine-tune generation parameters, access raw tokens
+- **Integration**: It's just Ruby objects - use it anywhere Ruby runs
+## Supports
+- **Tokenizers**: Access the tokenizer directly
+- **EmbeddingModel**: Generate embeddings for text
+- **Reranker**: Rerank documents based on relevance
+- **NER**: Named Entity Recognition directly from Ruby
+- **LLM**: Chat with Large Language Models (e.g., Llama, Mistral, Gemma, Qwen, Phi)
+- **Structured Generation**: Generate JSON from a schema or match a regular expression
+## Model Storage
+Models are automatically downloaded and cached when you first use them. They are stored in:
+- **Location**: `~/.cache/huggingface/hub/`
+- **Size**: Models range from ~100MB (embeddings) to several GB (LLMs)
+- **Reuse**: Models are downloaded once and reused across sessions
+To check your cache or manage storage:
+```bash
+# View cache contents
+ls -la ~/.cache/huggingface/hub/
+# Check total cache size
+du -sh ~/.cache/huggingface/
+# Clear cache if needed (removes all downloaded models)
+rm -rf ~/.cache/huggingface/hub/
+```
+----
+## Usage
+```ruby
+require "candle"
+x = Candle::Tensor.new([1, 2, 3, 4, 5, 6], :i64)
+x = x.reshape([3, 2])
+# [[1., 2.],
+#  [3., 4.],
+#  [5., 6.]]
+# Tensor[[3, 2], f32]
+```
+```ruby
+require 'candle'
+# Default model (JinaBERT) on CPU
+model = Candle::EmbeddingModel.from_pretrained
+embedding = model.embedding("Hi there!")
+# Specify device (CPU, Metal, or CUDA)
+device = Candle::Device.cpu     # or Candle::Device.metal, Candle::Device.cuda
+model = Candle::EmbeddingModel.from_pretrained("jinaai/jina-embeddings-v2-base-en", device: device)
+embedding = model.embedding("Hi there!")
+# Reranker also supports device selection
+reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2", device: device)
+results = reranker.rerank("query", ["doc1", "doc2", "doc3"])
+```
+## LLM Support
+Red-Candle now supports Large Language Models (LLMs) with GPU acceleration!
+### Supported Models
+- **Gemma**: Google's Gemma models (e.g., `google/gemma-2b`, `google/gemma-7b`, `google/gemma-2b-it`)
+- **Llama**: Llama 2 and Llama 3 models (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0`, `meta-llama/Llama-2-7b-hf`, `NousResearch/Llama-2-7b-hf`)
+- **Mistral**: All Mistral models (e.g., `mistralai/Mistral-7B-Instruct-v0.1`)
+- **Qwen**: Qwen 2 and 2.5 models (e.g., `Qwen/Qwen2-1.5B`, `Qwen/Qwen2.5-7B-Instruct`)
+- **Phi**: Microsoft's Phi-2, Phi-3, Phi-3.5, and Phi-4 models (e.g., `microsoft/phi-2`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/phi-4`)
+  - ⚠️ ⚠️ ⚠️ Note: Phi-3 and Phi-4 GGUF models have a known issue with KV cache persistence between generations. The `reset_cache` parameter doesn't work for GGUF models. Recreate the model instance for each generation.
+  - `candle` pull request about phi-3 gguf models: https://github.com/huggingface/candle/pull/2937
+### Quantized Model Support (GGUF)
+Red-Candle supports quantized models in GGUF format, offering 4-8x memory reduction:
+> **Note on GGUF Support**: Red-Candle now uses a unified GGUF loader that automatically detects the model architecture from the GGUF file. This means all GGUF models (including Mistral models from TheBloke) should now work correctly! The loader automatically selects the appropriate tokenizer based on the model type to ensure proper text generation.
+```ruby
+# Load quantized models - always specify the GGUF filename
+llm = Candle::LLM.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF",
+                                  device: device,
+                                  gguf_file: "llama-2-7b-chat.Q4_K_M.gguf")
+# Register custom tokenizer mappings for your models
+Candle::LLM.register_tokenizer("my-org/my-model-GGUF", "my-org/my-tokenizer")
+# Popular quantized model sources:
+# - TheBloke: Extensive collection of GGUF models
+# - Search HuggingFace for "GGUF" models
+```
+**Memory usage comparison (7B models):**
+- Full precision: ~28 GB
+- Q8_0 (8-bit): ~7 GB - Best quality, larger size
+- Q5_K_M (5-bit): ~4.5 GB - Very good quality
+- Q4_K_M (4-bit): ~4 GB - Recommended default, best balance
+- Q3_K_M (3-bit): ~3 GB - Good for memory-constrained systems
+**Quantization levels explained:**
+- **Q8_0**: Almost identical to full model, use when quality is paramount
+- **Q5_K_M**: Excellent quality with good compression
+- **Q4_K_M**: Best balance of quality/size/speed (recommended default)
+- **Q3_K_M**: Noticeable quality reduction but very compact
+- **Q2_K**: ⚠️ **Not recommended** - Can cause inference errors due to extreme quantization
+> **Warning**: Q2_K quantization can lead to "weight is negative, too large or not a valid number" errors during inference. Use Q3_K_M or higher for stable operation.
+> ### ⚠️ Huggingface login warning
+>
+> Many models, including the one below, require you to agree to the terms. You'll need to:
+> 1. Login to [Huggingface](https://huggingface.co)
+> 2. Agree to the terms. For example: [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
+> 3. Authenticate your session. Simplest way is with `huggingface-cli login`. Detail here: [Huggingface CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+>
+> More details here: [Huggingface Authentication](docs/HUGGINGFACE.md)
+```ruby
+require 'candle'
+# Choose your device
+device = Candle::Device.cpu     # CPU (default)
+device = Candle::Device.metal   # Apple GPU (Metal)
+device = Candle::Device.cuda    # NVIDIA GPU (CUDA)
+# Load a model
+llm = Candle::LLM.from_pretrained("google/gemma-2b-it", device: device)  # Gemma
+# llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device: device)  # Llama
+# llm = Candle::LLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device: device)  # Mistral
+# Generate text
+response = llm.generate("What is Ruby?", config: Candle::GenerationConfig.balanced)
+# Stream generation
+llm.generate_stream("Tell me a story", config: Candle::GenerationConfig.balanced) do |token|
+  print token
+end
+# Chat interface
+messages = [
+  { role: "system", content: "You are a helpful assistant." },
+  { role: "user", content: "Explain Ruby in one sentence." }
+]
+response = llm.chat(messages)
+```
+### GPU Acceleration
+We see an 18x speed up running LLMs under CUDA vs CPU and a >3x speed up running under Metal vs CPU. Details [here](docs/DEVICE_SUPPORT.md#performance-considerations).
+```ruby
+# CPU works for all models
+device = Candle::Device.cpu
+llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device: device)
+# Metal
+device = Candle::Device.metal
+# CUDA support (for NVIDIA GPUs)
+device = Candle::Device.cuda   # Linux/Windows with NVIDIA GPU
+```
+### Debugging Token Generation
+For debugging purposes, you can enable raw token output to see both token IDs and their raw representations:
+```ruby
+# Enable debug mode to see raw tokens during generation
+config = Candle::GenerationConfig.balanced(debug_tokens: true)
+# Non-streaming generation with debug tokens
+result = llm.generate("Hello, world!", config: config)
+puts result
+# Output: [15043:Hello][11:,][1917:world][0:!]
+# Streaming generation with debug tokens
+llm.generate_stream("Hello, world!", config: config) do |text|
+  print text  # Will show each token as it's generated: [15043:Hello][11:,][1917:world][0:!]
+end
+# Works with all models (Llama, Mistral, Gemma, and quantized GGUF models)
+```
+This is particularly useful for:
+- Debugging tokenization issues
+- Understanding how the model processes text
+- Troubleshooting generation problems
+- Analyzing model behavior
+## Structured Generation
+Red Candle supports structured generation to constrain LLM outputs to follow specific patterns like JSON schemas or regular expressions:
+```ruby
+# Define a JSON schema
+schema = {
+  type: "object",
+  properties: {
+    answer: { type: "string", enum: ["yes", "no"] },
+    confidence: { type: "number", minimum: 0, maximum: 1 }
+  },
+  required: ["answer"]
+}
+# Generate and parse in one step
+result = llm.generate_structured("Is Ruby easy to learn?", schema: schema)
+puts result["answer"]      # "yes"
+puts result["confidence"]  # 0.9
+# Or use regex patterns for non-JSON outputs
+phone_constraint = llm.constraint_from_regex('\d{3}-\d{3}-\d{4}')
+config = Candle::GenerationConfig.balanced(constraint: phone_constraint)
+phone = llm.generate("Generate a phone number:", config: config)
+```
+See [STRUCTURED_GENERATION.md](docs/STRUCTURED_GENERATION.md) for detailed documentation.
+**Note on Reliability**: Structured generation constrains the model's output tokens, but success rates vary by model size and schema complexity. Smaller models (< 7B parameters) may occasionally produce incomplete or invalid JSON, especially with complex schemas. Consider implementing retry logic or fallback strategies in production applications. Larger models generally perform much better with structured generation.
+## Tool Calling
+Red-candle supports tool/function calling, enabling models to invoke external functions during generation. This works best with models fine-tuned for tool calling, such as Qwen3.
+### Defining Tools
+```ruby
+get_weather = Candle::Tool.new(
+  name: "get_weather",
+  description: "Get the current weather for a city",
+  parameters: {
+    type: "object",
+    properties: { city: { type: "string", description: "City name" } },
+    required: ["city"]
+  }
+) { |args| { city: args["city"], temperature: 72, condition: "sunny" } }
+```
+### Extracting Tool Calls
+`chat_with_tools` injects tool definitions into the system prompt, generates a response, and parses any `<tool_call>` tags from the output. It does **not** feed results back to the model — it just tells you what the model wants to call. You decide what to do with it:
+```ruby
+llm = Candle::LLM.from_pretrained("Qwen/Qwen3-0.6B")
+messages = [{ role: "user", content: "What's the weather in San Francisco?" }]
+result = llm.chat_with_tools(messages, tools: [get_weather],
+  config: Candle::GenerationConfig.deterministic(max_length: 500))
+if result.has_tool_calls?
+  result.tool_calls.each do |tc|
+    puts "#{tc.name}(#{tc.arguments})"
+    output = get_weather.call(tc.arguments)
+    puts "=> #{output}"
+  end
+else
+  puts result.text_response
+end
+```
+Pass `execute: true` to automatically run the tools (but still no round-trip back to the model):
+```ruby
+result = llm.chat_with_tools(messages, tools: [get_weather], execute: true,
+  config: Candle::GenerationConfig.deterministic(max_length: 500))
+result.tool_results.each do |tr|
+  puts "#{tr[:tool_call].name} => #{tr[:result]}"
+end
+```
+### Agent (Multi-Turn Tool Loop)
+`Candle::Agent` completes the round-trip: generate → parse tool calls → execute → feed results back to the model → repeat until the model produces a final text answer or hits `max_iterations`. This is a convenience wrapper for quick prototyping — for production use, frameworks like [RubyLLM](https://github.com/crmne/ruby_llm) manage this loop for you via the [ruby_llm-red_candle](https://github.com/scientist-labs/ruby_llm-red_candle) plugin:
+```ruby
+agent = Candle::Agent.new(llm, tools: [get_weather, lookup_price], max_iterations: 5)
+result = agent.run("What's the weather in Paris, and how much does a widget cost?",
+  config: Candle::GenerationConfig.deterministic(max_length: 1000))
+puts result.response         # Final text answer from the model
+puts result.iterations       # Number of generate cycles
+puts result.tool_calls_made  # Number of tools invoked
+```
+### Model Recommendations
+Tool calling quality depends heavily on model size:
+| Model | Tool Calling Quality |
+|-------|---------------------|
+| **Qwen3-8B GGUF** (~5 GB) | Calls correct tools, self-corrects errors, but may hallucinate values from tool results |
+| **Qwen3-4B GGUF** (~2.5 GB) | Calls correct tools, occasional reasoning errors |
+| **Qwen3-0.6B** (~1.2 GB) | Single-turn works, needs `max_length: 500+` for thinking |
+| SmolLM2-360M | Does not work |
+| TinyLlama-1.1B | Does not work (not fine-tuned for tool calling) |
+**Tip:** Qwen3 models use a `<think>` reasoning block before producing tool calls. Set `max_length` high enough (500+ for 0.6B, 1000+ for larger models) to allow room for both thinking and the tool call.
+## ⚠️ Model Format Requirements
+### EmbeddingModels and Rerankers: Safetensors Only
+Red-Candle **only supports embedding models and rerankers that provide their weights in the [safetensors](https://github.com/huggingface/safetensors) format** (i.e., the model repo must contain a `model.safetensors` file). If the model repo does not provide the required file, loading will fail with a clear error. Most official BERT and DistilBERT models do **not** provide safetensors; many Sentence Transformers and JinaBERT models do.
+**If you encounter an error like:**
+```
+RuntimeError: model.safetensors not found after download. Only safetensors models are supported. Please ensure your model repo contains model.safetensors.
+```
+this means the selected model is not compatible. Please choose a model repo that provides the required file.
+### LLMs: Safetensors and GGUF Support
+LLM models support two formats:
+1. **Safetensors format** - Standard HuggingFace models (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0`)
+2. **GGUF quantized format** - Memory-efficient quantized models (e.g., `TheBloke/Llama-2-7B-Chat-GGUF`)
+See the [Quantized Model Support](#quantized-model-support-gguf) section for details on using GGUF models.
+## Supported Embedding Models
+Red-Candle supports the following embedding model types from Hugging Face:
+1. `Candle::EmbeddingModelType::JINA_BERT` - Jina BERT models (e.g., `jinaai/jina-embeddings-v2-base-en`) (**safetensors required**)
+2. `Candle::EmbeddingModelType::MINILM` - MINILM models (e.g., `sentence-transformers/all-MiniLM-L6-v2`) (**safetensors required**)
+3. `Candle::EmbeddingModelType::DISTILBERT` - DistilBERT models (e.g., `distilbert-base-uncased-finetuned-sst-2-english`) (**safetensors required**)
+4. `Candle::EmbeddingModelType::STANDARD_BERT` - Standard BERT models (e.g., `scientistcom/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`) (**safetensors required**)
+> **Note:** Most official BERT and DistilBERT models do _not_ provide safetensors. Please check the model repo before use.
+You can get a list of all supported model types and suggested models paths:
+```ruby
+Candle::EmbeddingModelType.all  # Returns all supported model types
+Candle::EmbeddingModelType.suggested_model_paths  # Returns hash of suggested models for each type
+```
+## A note on memory usage
+The default model (`jinaai/jina-embeddings-v2-base-en` with the `sentence-transformers/all-MiniLM-L6-v2` tokenizer, both from [HuggingFace](https://huggingface.co)) takes a little more than 3GB of memory running on a Mac. The memory stays with the instantiated `Candle::EmbeddingModel` class, if you instantiate more than one, you'll use more memory. Likewise, if you let it go out of scope and call the garbage collector, you'll free the memory. For example:
+```ruby
+> require 'candle'
+# Ruby memory = 25.9 MB
+> model = Candle::EmbeddingModel.from_pretrained
+# Ruby memory = 3.50 GB
+> model2 = Candle::EmbeddingModel.from_pretrained
+# Ruby memory = 7.04 GB
+> model2 = nil
+> GC.start
+# Ruby memory = 3.56 GB
+> model = nil
+> GC.start
+# Ruby memory = 55.2 MB
+```
+## A note on returned embeddings
+The code should match the same embeddings when generated from the python `transformers` library. For instance, locally I was able to generate the same embedding for the text "Hi there!" using the python code:
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
+sentence = ['Hi there!']
+embedding = model.encode(sentence)
+print(embedding)
+```
+And the following ruby:
+```ruby
+require 'candle'
+model = Candle::EmbeddingModel.from_pretrained
+embedding = model.embedding("Hi there!")
+```
+## Document Reranking
+Red-Candle includes support for cross-encoder reranking models, which can be used to reorder documents by relevance to a query. This is particularly useful for improving search results or implementing retrieval-augmented generation (RAG) systems.
+### Supported Models
+Red-Candle supports both **BERT** and **XLM-RoBERTa** reranker architectures. The model type is auto-detected from `config.json`.
+| Model | Architecture | Params | Notes |
+|-------|-------------|--------|-------|
+| `BAAI/bge-reranker-base` | XLM-RoBERTa | 278M | Recommended — strong quality, multilingual |
+| `BAAI/bge-reranker-large` | XLM-RoBERTa | 560M | Best quality, higher resource usage |
+| `BAAI/bge-reranker-v2-m3` | XLM-RoBERTa | 278M | Multilingual, very strong |
+| `cross-encoder/ms-marco-MiniLM-L-12-v2` | BERT | 33M | Lightweight, English only |
+### Basic Usage
+```ruby
+require 'candle'
+# Initialize the reranker (BGE reranker recommended for quality)
+reranker = Candle::Reranker.from_pretrained("BAAI/bge-reranker-base")
+# Or use the lighter BERT-based model
+reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2")
+# Or with custom max_length for truncation (default is 512)
+reranker = Candle::Reranker.from_pretrained(
+  "cross-encoder/ms-marco-MiniLM-L-12-v2",
+  max_length: 256  # Faster processing with less context
+)
+# Define your query and candidate documents
+query = "How many people live in London?"
+documents = [
+  "London is known for its financial district",
+  "Around 9 Million people live in London",
+  "The weather in London is often rainy",
+  "London is the capital of England"
+]
+# Rerank documents by relevance to the query (raw logits)
+ranked_results = reranker.rerank(query, documents, pooling_method: "pooler", apply_sigmoid: false)
+# Or apply sigmoid activation to get scores between 0 and 1
+sigmoid_results = reranker.rerank(query, documents, pooling_method: "pooler", apply_sigmoid: true)
+# The pooler method is the default and is recommended for cross-encoders, as is apply_sigmoid, so the above is the same as:
+ranked_results = reranker.rerank(query, documents)
+# Results are returned as an array of hashes, sorted by relevance
+e.g.
+ranked_results.each do |result|
+  puts "Score: #{result[:score].round(4)} - Doc ##{result[:doc_id]}: #{result[:text]}"
+end
+# Output:
+# Score: 1.0 - Doc #1: Around 9 Million people live in London
+# Score: 0.0438 - Doc #3: London is the capital of England
+# Score: 0.0085 - Doc #0: London is known for its financial district
+# Score: 0.0005 - Doc #2: The weather in London is often rainy
+```
+### Arguments & Activation Functions
+By default, `apply_sigmoid` is `true` (scores between 0 and 1). Set it to `false` to get raw logits. You can also select the pooling method:
+- `pooling_method: "pooler"` (default)
+- `pooling_method: "cls"`
+- `pooling_method: "mean"`
+Example without sigmoid activation:
+```ruby
+# Get raw logits
+ranked_results = reranker.rerank(query, documents, apply_sigmoid: false)
+ranked_results.each do |result|
+  puts "Score: #{result[:score].round(4)} - Doc ##{result[:doc_id]}: #{result[:text]}"
+end
+# Output:
+# Score: 10.3918 - Doc #1: Around 9 Million people live in London
+# Score: -3.0829 - Doc #3: London is the capital of England
+# Score: -4.7619 - Doc #0: London is known for its financial district
+# Score: -7.5251 - Doc #2: The weather in London is often rainy
+```
+### Output Format
+The reranker returns an array of hashes, each with the following keys:
+- `:text` – The original document text
+- `:score` – The relevance score (raw logit or sigmoid-activated)
+- `:doc_id` – The original 0-based index of the document in the input array
+This format is compatible with the Informers gem, which returns results as hashes with `:doc_id` and `:score` keys. The `doc_id` allows you to map results back to your original data structure.
+### Pooling Methods
+The reranker supports different pooling strategies for aggregating BERT embeddings:
+```ruby
+# Use alternative pooling methods
+# "pooler" (default) - Uses the pooler layer with tanh activation (most accurate for cross-encoders)
+# "cls" - Uses raw [CLS] token embeddings without the pooler layer
+# "mean" - Mean pooling across all tokens (not recommended for cross-encoders)
+# With raw logits
+results = reranker.rerank_with_pooling(query, documents, "cls")
+# With sigmoid activation
+results = reranker.rerank_sigmoid_with_pooling(query, documents, "cls")
+```
+Note: Pooling methods only apply to BERT-based models. XLM-RoBERTa models (e.g., BGE rerankers) have a built-in classification head and ignore the `pooling_method` parameter. For BERT models, the default "pooler" method is recommended as it matches how cross-encoder models are trained.
+### CUDA Support
+For faster inference on NVIDIA GPUs:
+```ruby
+# Initialize with CUDA if available (falls back to CPU if not)
+reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2", cuda: true)
+```
+### How It Works
+Cross-encoder reranking models differ from bi-encoder embedding models:
+- **Bi-encoders** (like the embedding models above) encode queries and documents separately into dense vectors
+- **Cross-encoders** process the query and document together, allowing for more nuanced relevance scoring
+The reranker concatenates the query and document with special tokens and processes them jointly through transformer layers to produce a single relevance score.
+**BERT models** (e.g., MiniLM): Use a pooler layer (dense + tanh) on the [CLS] token, then a classifier layer. Pooling method is configurable (`pooler`, `cls`, `mean`).
+**XLM-RoBERTa models** (e.g., BGE rerankers): Use a built-in classification head that returns logits directly. The `pooling_method` parameter is ignored — the model handles its own pooling internally.
+This joint processing allows cross-encoders to capture subtle semantic relationships between queries and documents, making them more accurate for reranking tasks, though at the cost of higher computational requirements.
+### Performance Considerations
+**Important**: The Reranker automatically truncates documents to ensure stable performance. The default maximum is 512 tokens, but this is configurable.
+#### Configurable Truncation
+You can adjust the `max_length` parameter to balance performance and context:
+```ruby
+# Default: 512 tokens (maximum context, ~300ms per doc on CPU)
+reranker = Candle::Reranker.from_pretrained(model_id)
+# Faster: 256 tokens (~60% faster, ~120ms per doc on CPU)
+reranker = Candle::Reranker.from_pretrained(model_id, max_length: 256)
+# Fastest: 128 tokens (~80% faster, ~60ms per doc on CPU)
+reranker = Candle::Reranker.from_pretrained(model_id, max_length: 128)
+```
+Choose based on your needs:
+- **512 tokens**: Maximum context for complex queries (default)
+- **256 tokens**: Good balance of speed and context
+- **128 tokens**: Fast processing for simple matching
+#### Performance Guidelines
+1. **Document Length**: Documents longer than ~400 words will be truncated
+   - The first 512 tokens (roughly 300-400 words) are used
+   - Consider splitting very long documents into chunks if full coverage is needed
+2. **Batch Size**: Process multiple documents in one call for efficiency
+   ```ruby
+   # Good: Single call with multiple documents
+   results = reranker.rerank(query, documents)
+   # Less efficient: Multiple calls
+   documents.map { |doc| reranker.rerank(query, [doc]) }
+   ```
+3. **Expected Performance**:
+   - **CPU**: ~0.3-0.5s per query-document pair
+   - **GPU (Metal/CUDA)**: ~0.05-0.1s per query-document pair
+   - Performance is consistent regardless of document length due to truncation
+4. **Chunking Strategy** for long documents:
+   ```ruby
+   def rerank_long_document(query, long_text, chunk_size: 300)
+     # Split into overlapping chunks
+     words = long_text.split
+     chunks = []
+     (0...words.length).step(chunk_size - 50) do |i|
+       chunk = words[i...(i + chunk_size)].join(" ")
+       chunks << chunk
+     end
+     # Rerank chunks
+     results = reranker.rerank(query, chunks)
+     # Return best chunk
+     results.max_by { |r| r[:score] }
+   end
+   ```
+5. **Memory Usage**:
+   - Model size: ~125MB
+   - Each batch processes all documents simultaneously
+   - Consider batching if you have many documents
+## Tokenizer
+Red-Candle provides direct access to tokenizers for text preprocessing and analysis. This is useful for understanding how models process text, debugging issues, and building custom NLP pipelines.
+### Basic Usage
+```ruby
+require 'candle'
+# Load a tokenizer from HuggingFace
+tokenizer = Candle::Tokenizer.from_pretrained("bert-base-uncased")
+# Encode text to token IDs
+token_ids = tokenizer.encode("Hello, world!")
+# => [101, 7592, 1010, 2088, 999, 102]
+# Decode token IDs back to text
+text = tokenizer.decode(token_ids)
+# => "hello, world!"
+# Get token strings (subwords) - useful for visualization
+tokens = tokenizer.encode_to_tokens("Hello, world!")
+# => ["[CLS]", "hello", ",", "world", "!", "[SEP]"]
+# Get both IDs and tokens together
+result = tokenizer.encode_with_tokens("preprocessing")
+# => {"ids" => [101, 3653, 22618, 2527, 102],
+#     "tokens" => ["[CLS]", "prep", "##ro", "##ces", "##sing", "[SEP]"]}
+```
+### Batch Processing
+```ruby
+# Encode multiple texts at once
+texts = ["Hello world", "How are you?", "Tokenizers are cool"]
+batch_ids = tokenizer.encode_batch(texts)
+# Get token strings for multiple texts
+batch_tokens = tokenizer.encode_batch_to_tokens(texts)
+```
+### Vocabulary Access
+```ruby
+# Get vocabulary size
+vocab_size = tokenizer.vocab_size
+# => 30522
+# Get full vocabulary as a hash
+vocab = tokenizer.get_vocab
+# vocab["hello"] => 7592
+# Convert a specific token ID to its string
+token_str = tokenizer.id_to_token(7592)
+# => "hello"
+# Get special tokens
+special = tokenizer.get_special_tokens
+# => {"cls_token" => 101, "sep_token" => 102, "pad_token" => 0, ...}
+```
+### Configuration
+```ruby
+# Create a tokenizer with padding enabled
+padded_tokenizer = tokenizer.with_padding(length: 128)
+# Create a tokenizer with truncation
+truncated_tokenizer = tokenizer.with_truncation(512)
+# Configure padding with more options
+padded_tokenizer = tokenizer.with_padding(
+  length: 128,          # Fixed length padding
+  direction: "right",   # Pad on the right (default)
+  pad_token: "[PAD]"    # Padding token
+)
+```
+### Model Integration
+All models expose their tokenizers:
+```ruby
+# From LLM
+llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+llm_tokenizer = llm.tokenizer
+# From EmbeddingModel
+embedding_model = Candle::EmbeddingModel.from_pretrained
+emb_tokenizer = embedding_model.tokenizer
+# From Reranker
+reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2")
+rank_tokenizer = reranker.tokenizer
+```
+### Understanding Subword Tokenization
+Modern tokenizers split unknown or rare words into subword pieces:
+```ruby
+# See how words are split into subwords
+result = tokenizer.encode_with_tokens("unbelievable")
+# => {"ids" => [101, 4895, 6499, 102],
+#     "tokens" => ["[CLS]", "un", "##believable", "[SEP]"]}
+# The ## prefix indicates a continuation of the previous token
+complex = tokenizer.encode_to_tokens("preprocessing tokenization")
+# => ["[CLS]", "prep", "##ro", "##ces", "##sing", "token", "##ization", "[SEP]"]
+```
+### Use Cases
+- **Token Analysis**: Understand how your text is being processed by models
+- **Debugging**: See why certain inputs might cause unexpected model behavior
+- **Custom Preprocessing**: Build your own text processing pipelines
+- **Educational**: Teach how modern NLP models handle text
+- **NER Preparation**: Get aligned tokens for named entity recognition tasks
+## Named Entity Recognition (NER)
+Red-Candle includes comprehensive Named Entity Recognition capabilities for extracting entities like people, organizations, locations, and custom entity types from text.
+### Model-based NER
+Load pre-trained NER models from HuggingFace:
+```ruby
+require 'candle'
+# Load a pre-trained NER model
+ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")
+# Or load a model with a specific tokenizer (for models without tokenizer.json)
+ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")
+# Extract entities from text
+text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino, California."
+entities = ner.extract_entities(text)
+entities.each do |entity|
+  puts "#{entity[:text]} (#{entity[:label]}) - confidence: #{entity[:confidence].round(2)}"
+end
+# Output:
+# Apple Inc. (ORG) - confidence: 0.99
+# Steve Jobs (PER) - confidence: 0.99
+# Steve Wozniak (PER) - confidence: 0.98
+# Cupertino (LOC) - confidence: 0.97
+# California (LOC) - confidence: 0.98
+# Adjust confidence threshold (default: 0.9)
+entities = ner.extract_entities(text, confidence_threshold: 0.95)
+# Get token-level predictions for detailed analysis
+tokens = ner.predict_tokens(text)
+```
+### Pattern-based Recognition
+For domain-specific entities, use regex patterns:
+```ruby
+# Create pattern-based recognizers
+email_recognizer = Candle::PatternEntityRecognizer.new("EMAIL", [
+  /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
+])
+phone_recognizer = Candle::PatternEntityRecognizer.new("PHONE", [
+  /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/,         # 555-123-4567
+  /\b\(\d{3}\)\s*\d{3}[-.]?\d{4}\b/,      # (555) 123-4567
+  /\b\+1\s*\d{3}[-.]?\d{3}[-.]?\d{4}\b/   # +1 555-123-4567
+])
+# Extract entities
+text = "Contact us at info@example.com or call 555-123-4567"
+email_entities = email_recognizer.recognize(text)
+phone_entities = phone_recognizer.recognize(text)
+```
+### Gazetteer-based Recognition
+Use dictionaries for known entities:
+```ruby
+# Create gazetteer recognizers
+companies = ["Apple", "Google", "Microsoft", "Amazon", "Tesla"]
+company_recognizer = Candle::GazetteerEntityRecognizer.new("COMPANY", companies)
+# Load from file
+drug_recognizer = Candle::GazetteerEntityRecognizer.new("DRUG")
+drug_recognizer.load_from_file("drug_names.txt")
+# Case-sensitive matching
+product_recognizer = Candle::GazetteerEntityRecognizer.new("PRODUCT",
+  ["iPhone", "iPad", "MacBook"],
+  case_sensitive: true
+)
+```
+### Hybrid NER
+Combine ML models with rule-based approaches for best results:
+```ruby
+# Create hybrid NER system
+hybrid = Candle::HybridNER.new("Babelscape/wikineural-multilingual-ner")
+# Add pattern recognizers
+hybrid.add_pattern_recognizer("EMAIL", [/\b[\w._%+-]+@[\w.-]+\.[A-Z|a-z]{2,}\b/])
+hybrid.add_pattern_recognizer("PHONE", [/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/])
+# Add gazetteer recognizers
+hybrid.add_gazetteer_recognizer("COMPANY", ["Apple", "Google", "Microsoft"])
+hybrid.add_gazetteer_recognizer("PRODUCT", ["iPhone", "Android", "Windows"])
+# Extract all entities
+text = "John Smith (john@apple.com) from Apple called about the new iPhone. Reach him at 555-0123."
+entities = hybrid.extract_entities(text)
+# Results include entities from all recognizers
+# Overlapping entities are automatically resolved (highest confidence wins)
+```
+### Custom Entity Types
+Perfect for specialized domains:
+```ruby
+# Biomedical entities
+gene_patterns = [
+  /\b[A-Z][A-Z0-9]{2,10}\b/,    # TP53, BRCA1, EGFR (bounded for safety)
+  /\bCD\d+\b/,                  # CD4, CD8, CD34
+  /\b[A-Z]+\d[A-Z]\d*\b/        # RAD51C, PALB2
+]
+gene_recognizer = Candle::PatternEntityRecognizer.new("GENE", gene_patterns)
+# Financial entities
+ticker_patterns = [
+  /\$[A-Z]{1,5}\b/,             # $AAPL, $GOOGL
+  /\b[A-Z]{1,5}\.NYSE\b/,       # AAPL.NYSE
+  /\b[A-Z]{1,5}\.NASDAQ\b/      # GOOGL.NASDAQ
+]
+ticker_recognizer = Candle::PatternEntityRecognizer.new("TICKER", ticker_patterns)
+# Legal entities
+case_patterns = [
+  /\b\d+\s+F\.\d+\s+\d+\b/,     # 123 F.3d 456
+  /\b\d+\s+U\.S\.\s+\d+\b/,     # 123 U.S. 456
+  /\bNo\.\s+\d+-\d+\b/          # No. 20-1234
+]
+case_recognizer = Candle::PatternEntityRecognizer.new("CASE", case_patterns)
+```
+### Available Pre-trained Models
+Popular NER models on HuggingFace:
+```ruby
+# General multilingual NER (4 entity types: PER, ORG, LOC, MISC)
+ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")
+# English NER (requires separate tokenizer)
+ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")
+# Multilingual NER
+ner = Candle::NER.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")
+# OntoNotes 5 (18 entity types including DATE, TIME, MONEY, etc.)
+ner = Candle::NER.from_pretrained("flair/ner-english-ontonotes-large")
+# Biomedical NER
+ner = Candle::NER.from_pretrained("dmis-lab/biobert-base-cased-v1.2")
+ner = Candle::NER.from_pretrained("allenai/scibert_scivocab_uncased")
+```
+### Performance Tips
+1. **Device Selection**: Use GPU for faster inference
+   ```ruby
+   ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner", device: Candle::Device.metal)
+   ```
+2. **Batch Processing**: Process multiple texts together when possible
+3. **Confidence Threshold**: Balance precision/recall with appropriate thresholds
+4. **Entity Resolution**: The hybrid NER automatically handles overlapping entities
+### Output Format
+All NER methods return entities in a consistent format:
+```ruby
+{
+  "text" => "Apple Inc.",          # The entity text
+  "label" => "ORG",               # Entity type
+  "start" => 0,                   # Character start position
+  "end" => 10,                    # Character end position
+  "confidence" => 0.99,           # Confidence score (0-1)
+  "token_start" => 0,             # Token start index (model-based only)
+  "token_end" => 2,               # Token end index (model-based only)
+  "source" => "model"             # Source: "model", "pattern", or "gazetteer"
+}
+```
+## Vision-Language Models (VLM)
+Red-Candle supports vision-language models for understanding and describing images. The VLM module uses LLaVA (Large Language and Vision Assistant), which combines a CLIP vision encoder with a Llama language model.
+### Basic Usage
+```ruby
+require 'candle'
+# Load a LLaVA model (requires ~13GB download on first use)
+vlm = Candle::VLM.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")
+# Describe an image
+description = vlm.describe("photo.jpg")
+# Ask a question about an image
+answer = vlm.ask("photo.jpg", "What animal is in this image?")
+# => "The animal in the image is a cat."
+# Control output length
+vlm.describe("photo.jpg", max_length: 500)
+vlm.ask("photo.jpg", "What colors do you see?", max_length: 50)
+```
+### How It Works
+1. **CLIP Vision Encoder**: Converts the image into a sequence of visual feature tokens (576 patches from a 336x336 image)
+2. **MM Projector**: Projects vision features into the language model's embedding space
+3. **Llama LLM**: Processes the combined image+text embeddings and generates a text response
+### Supported Models
+| Model | LLM Backend | Size | Notes |
+|:------|:-----------|:-----|:------|
+| `llava-hf/llava-v1.6-vicuna-7b-hf` | Llama (Vicuna) | 13GB | Recommended, LLaVA-Next with Llama backend |
+### Notes
+- First load downloads ~13GB of model weights (cached for subsequent use)
+- Image preprocessing is automatic (resize, normalize to CLIP format)
+- Generation uses greedy decoding
+- Multiple calls work correctly (KV cache is reset between queries)
+## Common Runtime Errors
+### Weight is negative, too large or not a valid number
+**Error:**
+```
+/Users/cpetersen/src/scientist/red-candle/lib/candle/llm.rb:25:in `_generate_stream': Generation failed: A weight is negative, too large or not a valid number (RuntimeError)
+    from /Users/cpetersen/src/scientist/red-candle/lib/candle/llm.rb:25:in `generate_stream'
+    ...
+```
+**Cause:** This error occurs when using overly aggressive quantization levels (particularly Q2_K) that result in numerical instability during inference. The 2-bit quantization can cause weights to become corrupted or produce NaN/Inf values.
+**Solution:** Use a higher quantization level. Recommended options:
+- Q4_K_M (4-bit) - Best balance of quality and size
+- Q5_K_M (5-bit) - Higher quality with slightly larger size
+- Q3_K_M (3-bit) - Minimum recommended quantization
+```ruby
+llm = Candle::LLM.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
+                                  device: device,
+                                  gguf_file: "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
+```
+### Cannot find tensor model.embed_tokens.weight
+**Error:**
+```
+Failed to load quantized model: cannot find tensor model.embed_tokens.weight (RuntimeError)
+```
+**Cause:** This error was common in earlier versions when loading GGUF files with incompatible tensor naming conventions. The unified GGUF loader in version 1.0.0+ should handle most GGUF files correctly.
+**If you still encounter this error:**
+1. Ensure you're using the latest version of red-candle (1.0.0 or higher)
+2. Make sure to specify the exact GGUF filename:
+   ```ruby
+   llm = Candle::LLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
+                                     device: device,
+                                     gguf_file: "mistral-7b-instruct-v0.2.Q4_K_M.gguf")
+   ```
+3. If the error persists, the GGUF file may use an unsupported architecture or format
+### No GGUF file found in repository
+**Error:**
+```
+Failed to load quantized model: No GGUF file found in repository TheBloke/model-name-GGUF. Try specifying a quantization level like Q4_K_M, Q5_K_M, or Q8_0. (RuntimeError)
+```
+**Cause:** The automatic GGUF file detection couldn't find a matching file, often due to naming variations.
+**Solution:** Specify the exact GGUF filename:
+```ruby
+# Visit the HuggingFace repository to find the exact filename
+llm = Candle::LLM.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF",
+                                  device: device,
+                                  gguf_file: "llama-2-7b-chat.Q4_K_M.gguf")
+```
+### Failed to download tokenizer
+**Error:**
+```
+Failed to load quantized model: Failed to download tokenizer: request error: HTTP status client error (404 Not Found)
+```
+**Cause:** GGUF repositories often don't include separate tokenizer files since they're embedded in the GGUF format.
+**Solution:** The code now includes fallback tokenizer loading. If you still encounter this error, ensure you're using the latest version of red-candle.
+### Missing metadata in GGUF file
+**Error:**
+```
+Failed to load GGUF model: cannot find gemma3.attention.head_count in metadata (RuntimeError)
+```
+or
+```
+Failed to load GGUF model: cannot find llama.attention.head_count in metadata (RuntimeError)
+```
+**Cause:** Some GGUF files may have been created with older conversion tools that don't include all required metadata fields.
+**Solution:**
+- Try a different GGUF file from the same model
+- Look for GGUF files from TheBloke or other reputable sources
+- Check if a newer version of the GGUF file is available
+- Some Gemma GGUF files may not be compatible with the current loader
+**Known compatibility issues:**
+- `lmstudio-ai/gemma-2b-it-GGUF` - Missing required metadata fields
+- Gemma 3 GGUF files may require specific tokenizers that are not publicly available
+- For best compatibility, use Llama or Mistral GGUF files from TheBloke
+## Development
+FORK IT!
+```
+git clone https://github.com/scientist-labs/red-candle
+cd red-candle
+bundle
+bundle exec rake compile
+```
+Pull requests are welcome.
+## Testing
+Red Candle has comprehensive tests at both the Ruby and Rust levels:
+### Ruby Tests
+```bash
+# Run all Ruby tests
+bundle exec rake test
+# Run specific test suites
+bundle exec rake test:device         # Device compatibility tests
+bundle exec rake test:benchmark      # Benchmark tests
+bundle exec rake test:llm:mistral    # Model-specific tests
+```
+### Rust Tests
+```bash
+# Run Rust unit and integration tests
+cd ext/candle && cargo test
+# Or use the Rake task
+bundle exec rake rust:test
+```
+The Rust tests include:
+- Unit tests within source files (using `#[cfg(test)]` modules)
+- Integration tests for external dependencies (candle_core operations)
+- Tests for structured generation, tokenization, and text generation
+### Code Coverage
+#### Rust Code Coverage
+Red Candle uses `cargo-llvm-cov` for Rust code coverage analysis:
+```bash
+# Generate HTML coverage report (opens in target/llvm-cov/html/index.html)
+bundle exec rake rust:coverage:html
+# Show coverage summary in terminal
+bundle exec rake rust:coverage:summary
+# Generate detailed coverage report
+bundle exec rake rust:coverage:report
+# Generate LCOV format for CI integration
+bundle exec rake rust:coverage:lcov
+# Clean coverage data
+bundle exec rake rust:coverage:clean
+```
+**Note**: Overall Rust coverage shows ~17% because most code consists of Ruby FFI bindings that are tested through Ruby tests. The testable Rust components have high coverage:
+- Constrained generation: 99.59%
+- Schema processing: 90.99%
+- Integration tests: 97.12%
+#### Ruby Code Coverage
+Ruby test coverage is generated automatically when running tests:
+```bash
+bundle exec rake test
+# Coverage report generated in coverage/index.html
+```
+## Release
+1. Update version number in `lib/candle/version.rb` and commit.
+2. `bundle exec rake build`
+3. `git tag VERSION_NUMBER`
+4. `git push --follow-tags`
+5. `gem push pkg/red-candle-VERSION_NUMBER.gem`
+## Dependencies
+- [Candle](https://github.com/huggingface/candle)
+- [Magnus](https://github.com/matsadler/magnus)
+- [Outlines-core](https://github.com/dottxt-ai/outlines-core)