red-candle 1.8.0-aarch64-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (76) hide show
  1. checksums.yaml +7 -0
  2. data/Cargo.lock +5021 -0
  3. data/Cargo.toml +6 -0
  4. data/Gemfile +3 -0
  5. data/LICENSE +22 -0
  6. data/README.md +1171 -0
  7. data/Rakefile +167 -0
  8. data/bin/console +11 -0
  9. data/bin/setup +17 -0
  10. data/ext/candle/Cargo.toml +38 -0
  11. data/ext/candle/build.rs +117 -0
  12. data/ext/candle/extconf.rb +79 -0
  13. data/ext/candle/rustfmt.toml +63 -0
  14. data/ext/candle/src/gvl.rs +58 -0
  15. data/ext/candle/src/lib.rs +59 -0
  16. data/ext/candle/src/llm/constrained_generation_test.rs +395 -0
  17. data/ext/candle/src/llm/gemma.rs +313 -0
  18. data/ext/candle/src/llm/generation_config.rs +63 -0
  19. data/ext/candle/src/llm/glm4.rs +236 -0
  20. data/ext/candle/src/llm/granite.rs +308 -0
  21. data/ext/candle/src/llm/granitemoehybrid.rs +315 -0
  22. data/ext/candle/src/llm/llama.rs +396 -0
  23. data/ext/candle/src/llm/mistral.rs +309 -0
  24. data/ext/candle/src/llm/mod.rs +49 -0
  25. data/ext/candle/src/llm/phi.rs +369 -0
  26. data/ext/candle/src/llm/quantized_gguf.rs +734 -0
  27. data/ext/candle/src/llm/qwen.rs +261 -0
  28. data/ext/candle/src/llm/qwen3.rs +257 -0
  29. data/ext/candle/src/llm/text_generation.rs +284 -0
  30. data/ext/candle/src/ruby/device.rs +234 -0
  31. data/ext/candle/src/ruby/dtype.rs +39 -0
  32. data/ext/candle/src/ruby/embedding_model.rs +477 -0
  33. data/ext/candle/src/ruby/errors.rs +16 -0
  34. data/ext/candle/src/ruby/llm.rs +730 -0
  35. data/ext/candle/src/ruby/mod.rs +24 -0
  36. data/ext/candle/src/ruby/ner.rs +444 -0
  37. data/ext/candle/src/ruby/reranker.rs +488 -0
  38. data/ext/candle/src/ruby/result.rs +3 -0
  39. data/ext/candle/src/ruby/structured.rs +92 -0
  40. data/ext/candle/src/ruby/tensor.rs +731 -0
  41. data/ext/candle/src/ruby/tokenizer.rs +343 -0
  42. data/ext/candle/src/ruby/utils.rs +96 -0
  43. data/ext/candle/src/ruby/vlm.rs +330 -0
  44. data/ext/candle/src/structured/integration_test.rs +130 -0
  45. data/ext/candle/src/structured/mod.rs +31 -0
  46. data/ext/candle/src/structured/schema_processor.rs +215 -0
  47. data/ext/candle/src/structured/vocabulary_adapter.rs +152 -0
  48. data/ext/candle/src/structured/vocabulary_adapter_real_test.rs +66 -0
  49. data/ext/candle/src/structured/vocabulary_adapter_simple_test.rs +70 -0
  50. data/ext/candle/src/tokenizer/loader.rs +108 -0
  51. data/ext/candle/src/tokenizer/mod.rs +104 -0
  52. data/ext/candle/tests/device_tests.rs +43 -0
  53. data/ext/candle/tests/tensor_tests.rs +162 -0
  54. data/lib/candle/3.1/candle.so +0 -0
  55. data/lib/candle/3.2/candle.so +0 -0
  56. data/lib/candle/3.3/candle.so +0 -0
  57. data/lib/candle/3.4/candle.so +0 -0
  58. data/lib/candle/4.0/candle.so +0 -0
  59. data/lib/candle/agent.rb +68 -0
  60. data/lib/candle/build_info.rb +67 -0
  61. data/lib/candle/device_utils.rb +10 -0
  62. data/lib/candle/embedding_model.rb +75 -0
  63. data/lib/candle/embedding_model_type.rb +31 -0
  64. data/lib/candle/llm.rb +595 -0
  65. data/lib/candle/logger.rb +149 -0
  66. data/lib/candle/ner.rb +368 -0
  67. data/lib/candle/reranker.rb +45 -0
  68. data/lib/candle/tensor.rb +99 -0
  69. data/lib/candle/tokenizer.rb +139 -0
  70. data/lib/candle/tool.rb +47 -0
  71. data/lib/candle/tool_call_parser.rb +57 -0
  72. data/lib/candle/version.rb +5 -0
  73. data/lib/candle/vlm.rb +31 -0
  74. data/lib/candle.rb +29 -0
  75. data/lib/red-candle.rb +1 -0
  76. metadata +309 -0
data/README.md ADDED
@@ -0,0 +1,1171 @@
1
+ <img src="/docs/assets/logo-title.png" alt="red-candle" height="160px">
2
+
3
+ [![build](https://github.com/scientist-labs/red-candle/actions/workflows/build.yml/badge.svg)](https://github.com/scientist-labs/red-candle/actions/workflows/build.yml)
4
+ [![Gem Version](https://badge.fury.io/rb/red-candle.svg)](https://badge.fury.io/rb/red-candle)
5
+
6
+ Run state-of-the-art **language models directly from Ruby**. No Python, no APIs, no external services - just Ruby with blazing-fast Rust under the hood. Hardware accelerated with **Metal (Mac)** and **CUDA (NVIDIA).** Red candle leverages the Rust ecosystem, notably [Candle](https://github.com/huggingface/candle) and [Magnus](https://github.com/matsadler/magnus), to provide a fast and efficient way to run LLMs in Ruby. See [Dependencies](#dependencies) for more.
7
+
8
+ ## Install & Chat in 30 Seconds
9
+
10
+ [![red-candle quickstart](https://img.youtube.com/vi/hbyFCyh8esk/0.jpg)](https://www.youtube.com/watch?v=hbyFCyh8esk)
11
+
12
+ ```bash
13
+ # Install the gem
14
+ gem install red-candle
15
+ ```
16
+
17
+ ```ruby
18
+ require 'candle'
19
+
20
+ # Download a model (one-time, ~650MB) - Mistral, Llama3, Gemma all work!
21
+ llm = Candle::LLM.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
22
+ gguf_file: "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
23
+
24
+ # Chat with it - no API calls, running locally in your Ruby process!
25
+ messages = [
26
+ { role: "user", content: "Explain Ruby in one sentence" }
27
+ ]
28
+
29
+ puts llm.chat(messages)
30
+ # => "Ruby is a dynamic, object-oriented programming language known for its
31
+ # simplicity, elegance, and productivity, often used for web development
32
+ # with frameworks like Rails."
33
+ ```
34
+
35
+ ## What Just Happened?
36
+
37
+ You just ran a 1.1-billion parameter AI model inside Ruby. The model lives in your process memory, runs on your hardware (CPU/GPU), and responds instantly without network latency.
38
+
39
+ ## Stream Responses Like a Pro
40
+
41
+ ```ruby
42
+ # Watch the AI think in real-time
43
+ llm.chat_stream(messages) do |token|
44
+ print token
45
+ end
46
+ ```
47
+
48
+ ## Why This Matters
49
+
50
+ - **Privacy**: Your data never leaves your machine
51
+ - **Speed**: No network overhead, direct memory access
52
+ - **Control**: Fine-tune generation parameters, access raw tokens
53
+ - **Integration**: It's just Ruby objects - use it anywhere Ruby runs
54
+
55
+ ## Supports
56
+
57
+ - **Tokenizers**: Access the tokenizer directly
58
+ - **EmbeddingModel**: Generate embeddings for text
59
+ - **Reranker**: Rerank documents based on relevance
60
+ - **NER**: Named Entity Recognition directly from Ruby
61
+ - **LLM**: Chat with Large Language Models (e.g., Llama, Mistral, Gemma, Qwen, Phi)
62
+ - **Structured Generation**: Generate JSON from a schema or match a regular expression
63
+
64
+ ## Model Storage
65
+
66
+ Models are automatically downloaded and cached when you first use them. They are stored in:
67
+ - **Location**: `~/.cache/huggingface/hub/`
68
+ - **Size**: Models range from ~100MB (embeddings) to several GB (LLMs)
69
+ - **Reuse**: Models are downloaded once and reused across sessions
70
+
71
+ To check your cache or manage storage:
72
+ ```bash
73
+ # View cache contents
74
+ ls -la ~/.cache/huggingface/hub/
75
+
76
+ # Check total cache size
77
+ du -sh ~/.cache/huggingface/
78
+
79
+ # Clear cache if needed (removes all downloaded models)
80
+ rm -rf ~/.cache/huggingface/hub/
81
+ ```
82
+
83
+ ----
84
+
85
+ ## Usage
86
+
87
+ ```ruby
88
+ require "candle"
89
+
90
+ x = Candle::Tensor.new([1, 2, 3, 4, 5, 6], :i64)
91
+ x = x.reshape([3, 2])
92
+ # [[1., 2.],
93
+ # [3., 4.],
94
+ # [5., 6.]]
95
+ # Tensor[[3, 2], f32]
96
+ ```
97
+
98
+ ```ruby
99
+ require 'candle'
100
+
101
+ # Default model (JinaBERT) on CPU
102
+ model = Candle::EmbeddingModel.from_pretrained
103
+ embedding = model.embedding("Hi there!")
104
+
105
+ # Specify device (CPU, Metal, or CUDA)
106
+ device = Candle::Device.cpu # or Candle::Device.metal, Candle::Device.cuda
107
+ model = Candle::EmbeddingModel.from_pretrained("jinaai/jina-embeddings-v2-base-en", device: device)
108
+ embedding = model.embedding("Hi there!")
109
+
110
+ # Reranker also supports device selection
111
+ reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2", device: device)
112
+ results = reranker.rerank("query", ["doc1", "doc2", "doc3"])
113
+ ```
114
+
115
+ ## LLM Support
116
+
117
+ Red-Candle now supports Large Language Models (LLMs) with GPU acceleration!
118
+
119
+ ### Supported Models
120
+
121
+ - **Gemma**: Google's Gemma models (e.g., `google/gemma-2b`, `google/gemma-7b`, `google/gemma-2b-it`)
122
+ - **Llama**: Llama 2 and Llama 3 models (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0`, `meta-llama/Llama-2-7b-hf`, `NousResearch/Llama-2-7b-hf`)
123
+ - **Mistral**: All Mistral models (e.g., `mistralai/Mistral-7B-Instruct-v0.1`)
124
+ - **Qwen**: Qwen 2 and 2.5 models (e.g., `Qwen/Qwen2-1.5B`, `Qwen/Qwen2.5-7B-Instruct`)
125
+ - **Phi**: Microsoft's Phi-2, Phi-3, Phi-3.5, and Phi-4 models (e.g., `microsoft/phi-2`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/phi-4`)
126
+ - ⚠️ ⚠️ ⚠️ Note: Phi-3 and Phi-4 GGUF models have a known issue with KV cache persistence between generations. The `reset_cache` parameter doesn't work for GGUF models. Recreate the model instance for each generation.
127
+ - `candle` pull request about phi-3 gguf models: https://github.com/huggingface/candle/pull/2937
128
+
129
+ ### Quantized Model Support (GGUF)
130
+
131
+ Red-Candle supports quantized models in GGUF format, offering 4-8x memory reduction:
132
+
133
+ > **Note on GGUF Support**: Red-Candle now uses a unified GGUF loader that automatically detects the model architecture from the GGUF file. This means all GGUF models (including Mistral models from TheBloke) should now work correctly! The loader automatically selects the appropriate tokenizer based on the model type to ensure proper text generation.
134
+
135
+ ```ruby
136
+ # Load quantized models - always specify the GGUF filename
137
+ llm = Candle::LLM.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF",
138
+ device: device,
139
+ gguf_file: "llama-2-7b-chat.Q4_K_M.gguf")
140
+
141
+ # Register custom tokenizer mappings for your models
142
+ Candle::LLM.register_tokenizer("my-org/my-model-GGUF", "my-org/my-tokenizer")
143
+
144
+ # Popular quantized model sources:
145
+ # - TheBloke: Extensive collection of GGUF models
146
+ # - Search HuggingFace for "GGUF" models
147
+ ```
148
+
149
+ **Memory usage comparison (7B models):**
150
+ - Full precision: ~28 GB
151
+ - Q8_0 (8-bit): ~7 GB - Best quality, larger size
152
+ - Q5_K_M (5-bit): ~4.5 GB - Very good quality
153
+ - Q4_K_M (4-bit): ~4 GB - Recommended default, best balance
154
+ - Q3_K_M (3-bit): ~3 GB - Good for memory-constrained systems
155
+
156
+ **Quantization levels explained:**
157
+ - **Q8_0**: Almost identical to full model, use when quality is paramount
158
+ - **Q5_K_M**: Excellent quality with good compression
159
+ - **Q4_K_M**: Best balance of quality/size/speed (recommended default)
160
+ - **Q3_K_M**: Noticeable quality reduction but very compact
161
+ - **Q2_K**: ⚠️ **Not recommended** - Can cause inference errors due to extreme quantization
162
+
163
+ > **Warning**: Q2_K quantization can lead to "weight is negative, too large or not a valid number" errors during inference. Use Q3_K_M or higher for stable operation.
164
+
165
+ > ### ⚠️ Huggingface login warning
166
+ >
167
+ > Many models, including the one below, require you to agree to the terms. You'll need to:
168
+ > 1. Login to [Huggingface](https://huggingface.co)
169
+ > 2. Agree to the terms. For example: [here](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)
170
+ > 3. Authenticate your session. Simplest way is with `huggingface-cli login`. Detail here: [Huggingface CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
171
+ >
172
+ > More details here: [Huggingface Authentication](docs/HUGGINGFACE.md)
173
+
174
+ ```ruby
175
+ require 'candle'
176
+
177
+ # Choose your device
178
+ device = Candle::Device.cpu # CPU (default)
179
+ device = Candle::Device.metal # Apple GPU (Metal)
180
+ device = Candle::Device.cuda # NVIDIA GPU (CUDA)
181
+
182
+ # Load a model
183
+ llm = Candle::LLM.from_pretrained("google/gemma-2b-it", device: device) # Gemma
184
+ # llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device: device) # Llama
185
+ # llm = Candle::LLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", device: device) # Mistral
186
+
187
+ # Generate text
188
+ response = llm.generate("What is Ruby?", config: Candle::GenerationConfig.balanced)
189
+
190
+ # Stream generation
191
+ llm.generate_stream("Tell me a story", config: Candle::GenerationConfig.balanced) do |token|
192
+ print token
193
+ end
194
+
195
+ # Chat interface
196
+ messages = [
197
+ { role: "system", content: "You are a helpful assistant." },
198
+ { role: "user", content: "Explain Ruby in one sentence." }
199
+ ]
200
+ response = llm.chat(messages)
201
+ ```
202
+
203
+ ### GPU Acceleration
204
+
205
+ We see an 18x speed up running LLMs under CUDA vs CPU and a >3x speed up running under Metal vs CPU. Details [here](docs/DEVICE_SUPPORT.md#performance-considerations).
206
+
207
+ ```ruby
208
+ # CPU works for all models
209
+ device = Candle::Device.cpu
210
+ llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device: device)
211
+
212
+ # Metal
213
+ device = Candle::Device.metal
214
+
215
+ # CUDA support (for NVIDIA GPUs)
216
+ device = Candle::Device.cuda # Linux/Windows with NVIDIA GPU
217
+ ```
218
+
219
+ ### Debugging Token Generation
220
+
221
+ For debugging purposes, you can enable raw token output to see both token IDs and their raw representations:
222
+
223
+ ```ruby
224
+ # Enable debug mode to see raw tokens during generation
225
+ config = Candle::GenerationConfig.balanced(debug_tokens: true)
226
+
227
+ # Non-streaming generation with debug tokens
228
+ result = llm.generate("Hello, world!", config: config)
229
+ puts result
230
+ # Output: [15043:Hello][11:,][1917:world][0:!]
231
+
232
+ # Streaming generation with debug tokens
233
+ llm.generate_stream("Hello, world!", config: config) do |text|
234
+ print text # Will show each token as it's generated: [15043:Hello][11:,][1917:world][0:!]
235
+ end
236
+
237
+ # Works with all models (Llama, Mistral, Gemma, and quantized GGUF models)
238
+ ```
239
+
240
+ This is particularly useful for:
241
+ - Debugging tokenization issues
242
+ - Understanding how the model processes text
243
+ - Troubleshooting generation problems
244
+ - Analyzing model behavior
245
+
246
+ ## Structured Generation
247
+
248
+ Red Candle supports structured generation to constrain LLM outputs to follow specific patterns like JSON schemas or regular expressions:
249
+
250
+ ```ruby
251
+ # Define a JSON schema
252
+ schema = {
253
+ type: "object",
254
+ properties: {
255
+ answer: { type: "string", enum: ["yes", "no"] },
256
+ confidence: { type: "number", minimum: 0, maximum: 1 }
257
+ },
258
+ required: ["answer"]
259
+ }
260
+
261
+ # Generate and parse in one step
262
+ result = llm.generate_structured("Is Ruby easy to learn?", schema: schema)
263
+ puts result["answer"] # "yes"
264
+ puts result["confidence"] # 0.9
265
+
266
+ # Or use regex patterns for non-JSON outputs
267
+ phone_constraint = llm.constraint_from_regex('\d{3}-\d{3}-\d{4}')
268
+ config = Candle::GenerationConfig.balanced(constraint: phone_constraint)
269
+ phone = llm.generate("Generate a phone number:", config: config)
270
+ ```
271
+
272
+ See [STRUCTURED_GENERATION.md](docs/STRUCTURED_GENERATION.md) for detailed documentation.
273
+
274
+ **Note on Reliability**: Structured generation constrains the model's output tokens, but success rates vary by model size and schema complexity. Smaller models (< 7B parameters) may occasionally produce incomplete or invalid JSON, especially with complex schemas. Consider implementing retry logic or fallback strategies in production applications. Larger models generally perform much better with structured generation.
275
+
276
+ ## Tool Calling
277
+
278
+ Red-candle supports tool/function calling, enabling models to invoke external functions during generation. This works best with models fine-tuned for tool calling, such as Qwen3.
279
+
280
+ ### Defining Tools
281
+
282
+ ```ruby
283
+ get_weather = Candle::Tool.new(
284
+ name: "get_weather",
285
+ description: "Get the current weather for a city",
286
+ parameters: {
287
+ type: "object",
288
+ properties: { city: { type: "string", description: "City name" } },
289
+ required: ["city"]
290
+ }
291
+ ) { |args| { city: args["city"], temperature: 72, condition: "sunny" } }
292
+ ```
293
+
294
+ ### Extracting Tool Calls
295
+
296
+ `chat_with_tools` injects tool definitions into the system prompt, generates a response, and parses any `<tool_call>` tags from the output. It does **not** feed results back to the model — it just tells you what the model wants to call. You decide what to do with it:
297
+
298
+ ```ruby
299
+ llm = Candle::LLM.from_pretrained("Qwen/Qwen3-0.6B")
300
+
301
+ messages = [{ role: "user", content: "What's the weather in San Francisco?" }]
302
+ result = llm.chat_with_tools(messages, tools: [get_weather],
303
+ config: Candle::GenerationConfig.deterministic(max_length: 500))
304
+
305
+ if result.has_tool_calls?
306
+ result.tool_calls.each do |tc|
307
+ puts "#{tc.name}(#{tc.arguments})"
308
+ output = get_weather.call(tc.arguments)
309
+ puts "=> #{output}"
310
+ end
311
+ else
312
+ puts result.text_response
313
+ end
314
+ ```
315
+
316
+ Pass `execute: true` to automatically run the tools (but still no round-trip back to the model):
317
+
318
+ ```ruby
319
+ result = llm.chat_with_tools(messages, tools: [get_weather], execute: true,
320
+ config: Candle::GenerationConfig.deterministic(max_length: 500))
321
+
322
+ result.tool_results.each do |tr|
323
+ puts "#{tr[:tool_call].name} => #{tr[:result]}"
324
+ end
325
+ ```
326
+
327
+ ### Agent (Multi-Turn Tool Loop)
328
+
329
+ `Candle::Agent` completes the round-trip: generate → parse tool calls → execute → feed results back to the model → repeat until the model produces a final text answer or hits `max_iterations`. This is a convenience wrapper for quick prototyping — for production use, frameworks like [RubyLLM](https://github.com/crmne/ruby_llm) manage this loop for you via the [ruby_llm-red_candle](https://github.com/scientist-labs/ruby_llm-red_candle) plugin:
330
+
331
+ ```ruby
332
+ agent = Candle::Agent.new(llm, tools: [get_weather, lookup_price], max_iterations: 5)
333
+ result = agent.run("What's the weather in Paris, and how much does a widget cost?",
334
+ config: Candle::GenerationConfig.deterministic(max_length: 1000))
335
+
336
+ puts result.response # Final text answer from the model
337
+ puts result.iterations # Number of generate cycles
338
+ puts result.tool_calls_made # Number of tools invoked
339
+ ```
340
+
341
+ ### Model Recommendations
342
+
343
+ Tool calling quality depends heavily on model size:
344
+
345
+ | Model | Tool Calling Quality |
346
+ |-------|---------------------|
347
+ | **Qwen3-8B GGUF** (~5 GB) | Calls correct tools, self-corrects errors, but may hallucinate values from tool results |
348
+ | **Qwen3-4B GGUF** (~2.5 GB) | Calls correct tools, occasional reasoning errors |
349
+ | **Qwen3-0.6B** (~1.2 GB) | Single-turn works, needs `max_length: 500+` for thinking |
350
+ | SmolLM2-360M | Does not work |
351
+ | TinyLlama-1.1B | Does not work (not fine-tuned for tool calling) |
352
+
353
+ **Tip:** Qwen3 models use a `<think>` reasoning block before producing tool calls. Set `max_length` high enough (500+ for 0.6B, 1000+ for larger models) to allow room for both thinking and the tool call.
354
+
355
+ ## ⚠️ Model Format Requirements
356
+
357
+ ### EmbeddingModels and Rerankers: Safetensors Only
358
+
359
+ Red-Candle **only supports embedding models and rerankers that provide their weights in the [safetensors](https://github.com/huggingface/safetensors) format** (i.e., the model repo must contain a `model.safetensors` file). If the model repo does not provide the required file, loading will fail with a clear error. Most official BERT and DistilBERT models do **not** provide safetensors; many Sentence Transformers and JinaBERT models do.
360
+
361
+ **If you encounter an error like:**
362
+
363
+ ```
364
+ RuntimeError: model.safetensors not found after download. Only safetensors models are supported. Please ensure your model repo contains model.safetensors.
365
+ ```
366
+
367
+ this means the selected model is not compatible. Please choose a model repo that provides the required file.
368
+
369
+ ### LLMs: Safetensors and GGUF Support
370
+
371
+ LLM models support two formats:
372
+ 1. **Safetensors format** - Standard HuggingFace models (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0`)
373
+ 2. **GGUF quantized format** - Memory-efficient quantized models (e.g., `TheBloke/Llama-2-7B-Chat-GGUF`)
374
+
375
+ See the [Quantized Model Support](#quantized-model-support-gguf) section for details on using GGUF models.
376
+
377
+ ## Supported Embedding Models
378
+
379
+ Red-Candle supports the following embedding model types from Hugging Face:
380
+
381
+ 1. `Candle::EmbeddingModelType::JINA_BERT` - Jina BERT models (e.g., `jinaai/jina-embeddings-v2-base-en`) (**safetensors required**)
382
+ 2. `Candle::EmbeddingModelType::MINILM` - MINILM models (e.g., `sentence-transformers/all-MiniLM-L6-v2`) (**safetensors required**)
383
+ 3. `Candle::EmbeddingModelType::DISTILBERT` - DistilBERT models (e.g., `distilbert-base-uncased-finetuned-sst-2-english`) (**safetensors required**)
384
+ 4. `Candle::EmbeddingModelType::STANDARD_BERT` - Standard BERT models (e.g., `scientistcom/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`) (**safetensors required**)
385
+
386
+ > **Note:** Most official BERT and DistilBERT models do _not_ provide safetensors. Please check the model repo before use.
387
+
388
+ You can get a list of all supported model types and suggested models paths:
389
+
390
+ ```ruby
391
+ Candle::EmbeddingModelType.all # Returns all supported model types
392
+ Candle::EmbeddingModelType.suggested_model_paths # Returns hash of suggested models for each type
393
+ ```
394
+
395
+ ## A note on memory usage
396
+ The default model (`jinaai/jina-embeddings-v2-base-en` with the `sentence-transformers/all-MiniLM-L6-v2` tokenizer, both from [HuggingFace](https://huggingface.co)) takes a little more than 3GB of memory running on a Mac. The memory stays with the instantiated `Candle::EmbeddingModel` class, if you instantiate more than one, you'll use more memory. Likewise, if you let it go out of scope and call the garbage collector, you'll free the memory. For example:
397
+
398
+ ```ruby
399
+ > require 'candle'
400
+ # Ruby memory = 25.9 MB
401
+ > model = Candle::EmbeddingModel.from_pretrained
402
+ # Ruby memory = 3.50 GB
403
+ > model2 = Candle::EmbeddingModel.from_pretrained
404
+ # Ruby memory = 7.04 GB
405
+ > model2 = nil
406
+ > GC.start
407
+ # Ruby memory = 3.56 GB
408
+ > model = nil
409
+ > GC.start
410
+ # Ruby memory = 55.2 MB
411
+ ```
412
+
413
+ ## A note on returned embeddings
414
+
415
+ The code should match the same embeddings when generated from the python `transformers` library. For instance, locally I was able to generate the same embedding for the text "Hi there!" using the python code:
416
+
417
+ ```python
418
+ from transformers import AutoModel
419
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
420
+ sentence = ['Hi there!']
421
+ embedding = model.encode(sentence)
422
+ print(embedding)
423
+ ```
424
+
425
+ And the following ruby:
426
+
427
+ ```ruby
428
+ require 'candle'
429
+ model = Candle::EmbeddingModel.from_pretrained
430
+ embedding = model.embedding("Hi there!")
431
+ ```
432
+
433
+ ## Document Reranking
434
+
435
+ Red-Candle includes support for cross-encoder reranking models, which can be used to reorder documents by relevance to a query. This is particularly useful for improving search results or implementing retrieval-augmented generation (RAG) systems.
436
+
437
+ ### Supported Models
438
+
439
+ Red-Candle supports both **BERT** and **XLM-RoBERTa** reranker architectures. The model type is auto-detected from `config.json`.
440
+
441
+ | Model | Architecture | Params | Notes |
442
+ |-------|-------------|--------|-------|
443
+ | `BAAI/bge-reranker-base` | XLM-RoBERTa | 278M | Recommended — strong quality, multilingual |
444
+ | `BAAI/bge-reranker-large` | XLM-RoBERTa | 560M | Best quality, higher resource usage |
445
+ | `BAAI/bge-reranker-v2-m3` | XLM-RoBERTa | 278M | Multilingual, very strong |
446
+ | `cross-encoder/ms-marco-MiniLM-L-12-v2` | BERT | 33M | Lightweight, English only |
447
+
448
+ ### Basic Usage
449
+
450
+ ```ruby
451
+ require 'candle'
452
+
453
+ # Initialize the reranker (BGE reranker recommended for quality)
454
+ reranker = Candle::Reranker.from_pretrained("BAAI/bge-reranker-base")
455
+
456
+ # Or use the lighter BERT-based model
457
+ reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2")
458
+
459
+ # Or with custom max_length for truncation (default is 512)
460
+ reranker = Candle::Reranker.from_pretrained(
461
+ "cross-encoder/ms-marco-MiniLM-L-12-v2",
462
+ max_length: 256 # Faster processing with less context
463
+ )
464
+
465
+ # Define your query and candidate documents
466
+ query = "How many people live in London?"
467
+ documents = [
468
+ "London is known for its financial district",
469
+ "Around 9 Million people live in London",
470
+ "The weather in London is often rainy",
471
+ "London is the capital of England"
472
+ ]
473
+
474
+ # Rerank documents by relevance to the query (raw logits)
475
+ ranked_results = reranker.rerank(query, documents, pooling_method: "pooler", apply_sigmoid: false)
476
+
477
+ # Or apply sigmoid activation to get scores between 0 and 1
478
+ sigmoid_results = reranker.rerank(query, documents, pooling_method: "pooler", apply_sigmoid: true)
479
+
480
+ # The pooler method is the default and is recommended for cross-encoders, as is apply_sigmoid, so the above is the same as:
481
+ ranked_results = reranker.rerank(query, documents)
482
+
483
+ # Results are returned as an array of hashes, sorted by relevance
484
+ e.g.
485
+ ranked_results.each do |result|
486
+ puts "Score: #{result[:score].round(4)} - Doc ##{result[:doc_id]}: #{result[:text]}"
487
+ end
488
+ # Output:
489
+ # Score: 1.0 - Doc #1: Around 9 Million people live in London
490
+ # Score: 0.0438 - Doc #3: London is the capital of England
491
+ # Score: 0.0085 - Doc #0: London is known for its financial district
492
+ # Score: 0.0005 - Doc #2: The weather in London is often rainy
493
+ ```
494
+
495
+ ### Arguments & Activation Functions
496
+
497
+ By default, `apply_sigmoid` is `true` (scores between 0 and 1). Set it to `false` to get raw logits. You can also select the pooling method:
498
+
499
+ - `pooling_method: "pooler"` (default)
500
+ - `pooling_method: "cls"`
501
+ - `pooling_method: "mean"`
502
+
503
+ Example without sigmoid activation:
504
+
505
+ ```ruby
506
+ # Get raw logits
507
+ ranked_results = reranker.rerank(query, documents, apply_sigmoid: false)
508
+
509
+ ranked_results.each do |result|
510
+ puts "Score: #{result[:score].round(4)} - Doc ##{result[:doc_id]}: #{result[:text]}"
511
+ end
512
+ # Output:
513
+ # Score: 10.3918 - Doc #1: Around 9 Million people live in London
514
+ # Score: -3.0829 - Doc #3: London is the capital of England
515
+ # Score: -4.7619 - Doc #0: London is known for its financial district
516
+ # Score: -7.5251 - Doc #2: The weather in London is often rainy
517
+ ```
518
+
519
+ ### Output Format
520
+
521
+ The reranker returns an array of hashes, each with the following keys:
522
+ - `:text` – The original document text
523
+ - `:score` – The relevance score (raw logit or sigmoid-activated)
524
+ - `:doc_id` – The original 0-based index of the document in the input array
525
+
526
+ This format is compatible with the Informers gem, which returns results as hashes with `:doc_id` and `:score` keys. The `doc_id` allows you to map results back to your original data structure.
527
+
528
+ ### Pooling Methods
529
+
530
+ The reranker supports different pooling strategies for aggregating BERT embeddings:
531
+
532
+ ```ruby
533
+ # Use alternative pooling methods
534
+ # "pooler" (default) - Uses the pooler layer with tanh activation (most accurate for cross-encoders)
535
+ # "cls" - Uses raw [CLS] token embeddings without the pooler layer
536
+ # "mean" - Mean pooling across all tokens (not recommended for cross-encoders)
537
+
538
+ # With raw logits
539
+ results = reranker.rerank_with_pooling(query, documents, "cls")
540
+
541
+ # With sigmoid activation
542
+ results = reranker.rerank_sigmoid_with_pooling(query, documents, "cls")
543
+ ```
544
+
545
+ Note: Pooling methods only apply to BERT-based models. XLM-RoBERTa models (e.g., BGE rerankers) have a built-in classification head and ignore the `pooling_method` parameter. For BERT models, the default "pooler" method is recommended as it matches how cross-encoder models are trained.
546
+
547
+ ### CUDA Support
548
+
549
+ For faster inference on NVIDIA GPUs:
550
+
551
+ ```ruby
552
+ # Initialize with CUDA if available (falls back to CPU if not)
553
+ reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2", cuda: true)
554
+ ```
555
+
556
+ ### How It Works
557
+
558
+ Cross-encoder reranking models differ from bi-encoder embedding models:
559
+
560
+ - **Bi-encoders** (like the embedding models above) encode queries and documents separately into dense vectors
561
+ - **Cross-encoders** process the query and document together, allowing for more nuanced relevance scoring
562
+
563
+ The reranker concatenates the query and document with special tokens and processes them jointly through transformer layers to produce a single relevance score.
564
+
565
+ **BERT models** (e.g., MiniLM): Use a pooler layer (dense + tanh) on the [CLS] token, then a classifier layer. Pooling method is configurable (`pooler`, `cls`, `mean`).
566
+
567
+ **XLM-RoBERTa models** (e.g., BGE rerankers): Use a built-in classification head that returns logits directly. The `pooling_method` parameter is ignored — the model handles its own pooling internally.
568
+
569
+ This joint processing allows cross-encoders to capture subtle semantic relationships between queries and documents, making them more accurate for reranking tasks, though at the cost of higher computational requirements.
570
+
571
+ ### Performance Considerations
572
+
573
+ **Important**: The Reranker automatically truncates documents to ensure stable performance. The default maximum is 512 tokens, but this is configurable.
574
+
575
+ #### Configurable Truncation
576
+
577
+ You can adjust the `max_length` parameter to balance performance and context:
578
+
579
+ ```ruby
580
+ # Default: 512 tokens (maximum context, ~300ms per doc on CPU)
581
+ reranker = Candle::Reranker.from_pretrained(model_id)
582
+
583
+ # Faster: 256 tokens (~60% faster, ~120ms per doc on CPU)
584
+ reranker = Candle::Reranker.from_pretrained(model_id, max_length: 256)
585
+
586
+ # Fastest: 128 tokens (~80% faster, ~60ms per doc on CPU)
587
+ reranker = Candle::Reranker.from_pretrained(model_id, max_length: 128)
588
+ ```
589
+
590
+ Choose based on your needs:
591
+ - **512 tokens**: Maximum context for complex queries (default)
592
+ - **256 tokens**: Good balance of speed and context
593
+ - **128 tokens**: Fast processing for simple matching
594
+
595
+ #### Performance Guidelines
596
+
597
+ 1. **Document Length**: Documents longer than ~400 words will be truncated
598
+ - The first 512 tokens (roughly 300-400 words) are used
599
+ - Consider splitting very long documents into chunks if full coverage is needed
600
+
601
+ 2. **Batch Size**: Process multiple documents in one call for efficiency
602
+ ```ruby
603
+ # Good: Single call with multiple documents
604
+ results = reranker.rerank(query, documents)
605
+
606
+ # Less efficient: Multiple calls
607
+ documents.map { |doc| reranker.rerank(query, [doc]) }
608
+ ```
609
+
610
+ 3. **Expected Performance**:
611
+ - **CPU**: ~0.3-0.5s per query-document pair
612
+ - **GPU (Metal/CUDA)**: ~0.05-0.1s per query-document pair
613
+ - Performance is consistent regardless of document length due to truncation
614
+
615
+ 4. **Chunking Strategy** for long documents:
616
+ ```ruby
617
+ def rerank_long_document(query, long_text, chunk_size: 300)
618
+ # Split into overlapping chunks
619
+ words = long_text.split
620
+ chunks = []
621
+
622
+ (0...words.length).step(chunk_size - 50) do |i|
623
+ chunk = words[i...(i + chunk_size)].join(" ")
624
+ chunks << chunk
625
+ end
626
+
627
+ # Rerank chunks
628
+ results = reranker.rerank(query, chunks)
629
+
630
+ # Return best chunk
631
+ results.max_by { |r| r[:score] }
632
+ end
633
+ ```
634
+
635
+ 5. **Memory Usage**:
636
+ - Model size: ~125MB
637
+ - Each batch processes all documents simultaneously
638
+ - Consider batching if you have many documents
639
+
640
+ ## Tokenizer
641
+
642
+ Red-Candle provides direct access to tokenizers for text preprocessing and analysis. This is useful for understanding how models process text, debugging issues, and building custom NLP pipelines.
643
+
644
+ ### Basic Usage
645
+
646
+ ```ruby
647
+ require 'candle'
648
+
649
+ # Load a tokenizer from HuggingFace
650
+ tokenizer = Candle::Tokenizer.from_pretrained("bert-base-uncased")
651
+
652
+ # Encode text to token IDs
653
+ token_ids = tokenizer.encode("Hello, world!")
654
+ # => [101, 7592, 1010, 2088, 999, 102]
655
+
656
+ # Decode token IDs back to text
657
+ text = tokenizer.decode(token_ids)
658
+ # => "hello, world!"
659
+
660
+ # Get token strings (subwords) - useful for visualization
661
+ tokens = tokenizer.encode_to_tokens("Hello, world!")
662
+ # => ["[CLS]", "hello", ",", "world", "!", "[SEP]"]
663
+
664
+ # Get both IDs and tokens together
665
+ result = tokenizer.encode_with_tokens("preprocessing")
666
+ # => {"ids" => [101, 3653, 22618, 2527, 102],
667
+ # "tokens" => ["[CLS]", "prep", "##ro", "##ces", "##sing", "[SEP]"]}
668
+ ```
669
+
670
+ ### Batch Processing
671
+
672
+ ```ruby
673
+ # Encode multiple texts at once
674
+ texts = ["Hello world", "How are you?", "Tokenizers are cool"]
675
+ batch_ids = tokenizer.encode_batch(texts)
676
+
677
+ # Get token strings for multiple texts
678
+ batch_tokens = tokenizer.encode_batch_to_tokens(texts)
679
+ ```
680
+
681
+ ### Vocabulary Access
682
+
683
+ ```ruby
684
+ # Get vocabulary size
685
+ vocab_size = tokenizer.vocab_size
686
+ # => 30522
687
+
688
+ # Get full vocabulary as a hash
689
+ vocab = tokenizer.get_vocab
690
+ # vocab["hello"] => 7592
691
+
692
+ # Convert a specific token ID to its string
693
+ token_str = tokenizer.id_to_token(7592)
694
+ # => "hello"
695
+
696
+ # Get special tokens
697
+ special = tokenizer.get_special_tokens
698
+ # => {"cls_token" => 101, "sep_token" => 102, "pad_token" => 0, ...}
699
+ ```
700
+
701
+ ### Configuration
702
+
703
+ ```ruby
704
+ # Create a tokenizer with padding enabled
705
+ padded_tokenizer = tokenizer.with_padding(length: 128)
706
+
707
+ # Create a tokenizer with truncation
708
+ truncated_tokenizer = tokenizer.with_truncation(512)
709
+
710
+ # Configure padding with more options
711
+ padded_tokenizer = tokenizer.with_padding(
712
+ length: 128, # Fixed length padding
713
+ direction: "right", # Pad on the right (default)
714
+ pad_token: "[PAD]" # Padding token
715
+ )
716
+ ```
717
+
718
+ ### Model Integration
719
+
720
+ All models expose their tokenizers:
721
+
722
+ ```ruby
723
+ # From LLM
724
+ llm = Candle::LLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
725
+ llm_tokenizer = llm.tokenizer
726
+
727
+ # From EmbeddingModel
728
+ embedding_model = Candle::EmbeddingModel.from_pretrained
729
+ emb_tokenizer = embedding_model.tokenizer
730
+
731
+ # From Reranker
732
+ reranker = Candle::Reranker.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2")
733
+ rank_tokenizer = reranker.tokenizer
734
+ ```
735
+
736
+ ### Understanding Subword Tokenization
737
+
738
+ Modern tokenizers split unknown or rare words into subword pieces:
739
+
740
+ ```ruby
741
+ # See how words are split into subwords
742
+ result = tokenizer.encode_with_tokens("unbelievable")
743
+ # => {"ids" => [101, 4895, 6499, 102],
744
+ # "tokens" => ["[CLS]", "un", "##believable", "[SEP]"]}
745
+
746
+ # The ## prefix indicates a continuation of the previous token
747
+ complex = tokenizer.encode_to_tokens("preprocessing tokenization")
748
+ # => ["[CLS]", "prep", "##ro", "##ces", "##sing", "token", "##ization", "[SEP]"]
749
+ ```
750
+
751
+ ### Use Cases
752
+
753
+ - **Token Analysis**: Understand how your text is being processed by models
754
+ - **Debugging**: See why certain inputs might cause unexpected model behavior
755
+ - **Custom Preprocessing**: Build your own text processing pipelines
756
+ - **Educational**: Teach how modern NLP models handle text
757
+ - **NER Preparation**: Get aligned tokens for named entity recognition tasks
758
+
759
+ ## Named Entity Recognition (NER)
760
+
761
+ Red-Candle includes comprehensive Named Entity Recognition capabilities for extracting entities like people, organizations, locations, and custom entity types from text.
762
+
763
+ ### Model-based NER
764
+
765
+ Load pre-trained NER models from HuggingFace:
766
+
767
+ ```ruby
768
+ require 'candle'
769
+
770
+ # Load a pre-trained NER model
771
+ ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")
772
+
773
+ # Or load a model with a specific tokenizer (for models without tokenizer.json)
774
+ ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")
775
+
776
+ # Extract entities from text
777
+ text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in Cupertino, California."
778
+ entities = ner.extract_entities(text)
779
+
780
+ entities.each do |entity|
781
+ puts "#{entity[:text]} (#{entity[:label]}) - confidence: #{entity[:confidence].round(2)}"
782
+ end
783
+ # Output:
784
+ # Apple Inc. (ORG) - confidence: 0.99
785
+ # Steve Jobs (PER) - confidence: 0.99
786
+ # Steve Wozniak (PER) - confidence: 0.98
787
+ # Cupertino (LOC) - confidence: 0.97
788
+ # California (LOC) - confidence: 0.98
789
+
790
+ # Adjust confidence threshold (default: 0.9)
791
+ entities = ner.extract_entities(text, confidence_threshold: 0.95)
792
+
793
+ # Get token-level predictions for detailed analysis
794
+ tokens = ner.predict_tokens(text)
795
+ ```
796
+
797
+ ### Pattern-based Recognition
798
+
799
+ For domain-specific entities, use regex patterns:
800
+
801
+ ```ruby
802
+ # Create pattern-based recognizers
803
+ email_recognizer = Candle::PatternEntityRecognizer.new("EMAIL", [
804
+ /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
805
+ ])
806
+
807
+ phone_recognizer = Candle::PatternEntityRecognizer.new("PHONE", [
808
+ /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/, # 555-123-4567
809
+ /\b\(\d{3}\)\s*\d{3}[-.]?\d{4}\b/, # (555) 123-4567
810
+ /\b\+1\s*\d{3}[-.]?\d{3}[-.]?\d{4}\b/ # +1 555-123-4567
811
+ ])
812
+
813
+ # Extract entities
814
+ text = "Contact us at info@example.com or call 555-123-4567"
815
+ email_entities = email_recognizer.recognize(text)
816
+ phone_entities = phone_recognizer.recognize(text)
817
+ ```
818
+
819
+ ### Gazetteer-based Recognition
820
+
821
+ Use dictionaries for known entities:
822
+
823
+ ```ruby
824
+ # Create gazetteer recognizers
825
+ companies = ["Apple", "Google", "Microsoft", "Amazon", "Tesla"]
826
+ company_recognizer = Candle::GazetteerEntityRecognizer.new("COMPANY", companies)
827
+
828
+ # Load from file
829
+ drug_recognizer = Candle::GazetteerEntityRecognizer.new("DRUG")
830
+ drug_recognizer.load_from_file("drug_names.txt")
831
+
832
+ # Case-sensitive matching
833
+ product_recognizer = Candle::GazetteerEntityRecognizer.new("PRODUCT",
834
+ ["iPhone", "iPad", "MacBook"],
835
+ case_sensitive: true
836
+ )
837
+ ```
838
+
839
+ ### Hybrid NER
840
+
841
+ Combine ML models with rule-based approaches for best results:
842
+
843
+ ```ruby
844
+ # Create hybrid NER system
845
+ hybrid = Candle::HybridNER.new("Babelscape/wikineural-multilingual-ner")
846
+
847
+ # Add pattern recognizers
848
+ hybrid.add_pattern_recognizer("EMAIL", [/\b[\w._%+-]+@[\w.-]+\.[A-Z|a-z]{2,}\b/])
849
+ hybrid.add_pattern_recognizer("PHONE", [/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/])
850
+
851
+ # Add gazetteer recognizers
852
+ hybrid.add_gazetteer_recognizer("COMPANY", ["Apple", "Google", "Microsoft"])
853
+ hybrid.add_gazetteer_recognizer("PRODUCT", ["iPhone", "Android", "Windows"])
854
+
855
+ # Extract all entities
856
+ text = "John Smith (john@apple.com) from Apple called about the new iPhone. Reach him at 555-0123."
857
+ entities = hybrid.extract_entities(text)
858
+
859
+ # Results include entities from all recognizers
860
+ # Overlapping entities are automatically resolved (highest confidence wins)
861
+ ```
862
+
863
+ ### Custom Entity Types
864
+
865
+ Perfect for specialized domains:
866
+
867
+ ```ruby
868
+ # Biomedical entities
869
+ gene_patterns = [
870
+ /\b[A-Z][A-Z0-9]{2,10}\b/, # TP53, BRCA1, EGFR (bounded for safety)
871
+ /\bCD\d+\b/, # CD4, CD8, CD34
872
+ /\b[A-Z]+\d[A-Z]\d*\b/ # RAD51C, PALB2
873
+ ]
874
+ gene_recognizer = Candle::PatternEntityRecognizer.new("GENE", gene_patterns)
875
+
876
+ # Financial entities
877
+ ticker_patterns = [
878
+ /\$[A-Z]{1,5}\b/, # $AAPL, $GOOGL
879
+ /\b[A-Z]{1,5}\.NYSE\b/, # AAPL.NYSE
880
+ /\b[A-Z]{1,5}\.NASDAQ\b/ # GOOGL.NASDAQ
881
+ ]
882
+ ticker_recognizer = Candle::PatternEntityRecognizer.new("TICKER", ticker_patterns)
883
+
884
+ # Legal entities
885
+ case_patterns = [
886
+ /\b\d+\s+F\.\d+\s+\d+\b/, # 123 F.3d 456
887
+ /\b\d+\s+U\.S\.\s+\d+\b/, # 123 U.S. 456
888
+ /\bNo\.\s+\d+-\d+\b/ # No. 20-1234
889
+ ]
890
+ case_recognizer = Candle::PatternEntityRecognizer.new("CASE", case_patterns)
891
+ ```
892
+
893
+ ### Available Pre-trained Models
894
+
895
+ Popular NER models on HuggingFace:
896
+
897
+ ```ruby
898
+ # General multilingual NER (4 entity types: PER, ORG, LOC, MISC)
899
+ ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")
900
+
901
+ # English NER (requires separate tokenizer)
902
+ ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")
903
+
904
+ # Multilingual NER
905
+ ner = Candle::NER.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")
906
+
907
+ # OntoNotes 5 (18 entity types including DATE, TIME, MONEY, etc.)
908
+ ner = Candle::NER.from_pretrained("flair/ner-english-ontonotes-large")
909
+
910
+ # Biomedical NER
911
+ ner = Candle::NER.from_pretrained("dmis-lab/biobert-base-cased-v1.2")
912
+ ner = Candle::NER.from_pretrained("allenai/scibert_scivocab_uncased")
913
+ ```
914
+
915
+ ### Performance Tips
916
+
917
+ 1. **Device Selection**: Use GPU for faster inference
918
+ ```ruby
919
+ ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner", device: Candle::Device.metal)
920
+ ```
921
+
922
+ 2. **Batch Processing**: Process multiple texts together when possible
923
+
924
+ 3. **Confidence Threshold**: Balance precision/recall with appropriate thresholds
925
+
926
+ 4. **Entity Resolution**: The hybrid NER automatically handles overlapping entities
927
+
928
+ ### Output Format
929
+
930
+ All NER methods return entities in a consistent format:
931
+
932
+ ```ruby
933
+ {
934
+ "text" => "Apple Inc.", # The entity text
935
+ "label" => "ORG", # Entity type
936
+ "start" => 0, # Character start position
937
+ "end" => 10, # Character end position
938
+ "confidence" => 0.99, # Confidence score (0-1)
939
+ "token_start" => 0, # Token start index (model-based only)
940
+ "token_end" => 2, # Token end index (model-based only)
941
+ "source" => "model" # Source: "model", "pattern", or "gazetteer"
942
+ }
943
+ ```
944
+
945
+ ## Vision-Language Models (VLM)
946
+
947
+ Red-Candle supports vision-language models for understanding and describing images. The VLM module uses LLaVA (Large Language and Vision Assistant), which combines a CLIP vision encoder with a Llama language model.
948
+
949
+ ### Basic Usage
950
+
951
+ ```ruby
952
+ require 'candle'
953
+
954
+ # Load a LLaVA model (requires ~13GB download on first use)
955
+ vlm = Candle::VLM.from_pretrained("llava-hf/llava-v1.6-vicuna-7b-hf")
956
+
957
+ # Describe an image
958
+ description = vlm.describe("photo.jpg")
959
+
960
+ # Ask a question about an image
961
+ answer = vlm.ask("photo.jpg", "What animal is in this image?")
962
+ # => "The animal in the image is a cat."
963
+
964
+ # Control output length
965
+ vlm.describe("photo.jpg", max_length: 500)
966
+ vlm.ask("photo.jpg", "What colors do you see?", max_length: 50)
967
+ ```
968
+
969
+ ### How It Works
970
+
971
+ 1. **CLIP Vision Encoder**: Converts the image into a sequence of visual feature tokens (576 patches from a 336x336 image)
972
+ 2. **MM Projector**: Projects vision features into the language model's embedding space
973
+ 3. **Llama LLM**: Processes the combined image+text embeddings and generates a text response
974
+
975
+ ### Supported Models
976
+
977
+ | Model | LLM Backend | Size | Notes |
978
+ |:------|:-----------|:-----|:------|
979
+ | `llava-hf/llava-v1.6-vicuna-7b-hf` | Llama (Vicuna) | 13GB | Recommended, LLaVA-Next with Llama backend |
980
+
981
+ ### Notes
982
+
983
+ - First load downloads ~13GB of model weights (cached for subsequent use)
984
+ - Image preprocessing is automatic (resize, normalize to CLIP format)
985
+ - Generation uses greedy decoding
986
+ - Multiple calls work correctly (KV cache is reset between queries)
987
+
988
+ ## Common Runtime Errors
989
+
990
+ ### Weight is negative, too large or not a valid number
991
+
992
+ **Error:**
993
+ ```
994
+ /Users/cpetersen/src/scientist/red-candle/lib/candle/llm.rb:25:in `_generate_stream': Generation failed: A weight is negative, too large or not a valid number (RuntimeError)
995
+ from /Users/cpetersen/src/scientist/red-candle/lib/candle/llm.rb:25:in `generate_stream'
996
+ ...
997
+ ```
998
+
999
+ **Cause:** This error occurs when using overly aggressive quantization levels (particularly Q2_K) that result in numerical instability during inference. The 2-bit quantization can cause weights to become corrupted or produce NaN/Inf values.
1000
+
1001
+ **Solution:** Use a higher quantization level. Recommended options:
1002
+ - Q4_K_M (4-bit) - Best balance of quality and size
1003
+ - Q5_K_M (5-bit) - Higher quality with slightly larger size
1004
+ - Q3_K_M (3-bit) - Minimum recommended quantization
1005
+
1006
+ ```ruby
1007
+ llm = Candle::LLM.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
1008
+ device: device,
1009
+ gguf_file: "tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")
1010
+ ```
1011
+
1012
+ ### Cannot find tensor model.embed_tokens.weight
1013
+
1014
+ **Error:**
1015
+ ```
1016
+ Failed to load quantized model: cannot find tensor model.embed_tokens.weight (RuntimeError)
1017
+ ```
1018
+
1019
+ **Cause:** This error was common in earlier versions when loading GGUF files with incompatible tensor naming conventions. The unified GGUF loader in version 1.0.0+ should handle most GGUF files correctly.
1020
+
1021
+ **If you still encounter this error:**
1022
+ 1. Ensure you're using the latest version of red-candle (1.0.0 or higher)
1023
+ 2. Make sure to specify the exact GGUF filename:
1024
+ ```ruby
1025
+ llm = Candle::LLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
1026
+ device: device,
1027
+ gguf_file: "mistral-7b-instruct-v0.2.Q4_K_M.gguf")
1028
+ ```
1029
+ 3. If the error persists, the GGUF file may use an unsupported architecture or format
1030
+
1031
+ ### No GGUF file found in repository
1032
+
1033
+ **Error:**
1034
+ ```
1035
+ Failed to load quantized model: No GGUF file found in repository TheBloke/model-name-GGUF. Try specifying a quantization level like Q4_K_M, Q5_K_M, or Q8_0. (RuntimeError)
1036
+ ```
1037
+
1038
+ **Cause:** The automatic GGUF file detection couldn't find a matching file, often due to naming variations.
1039
+
1040
+ **Solution:** Specify the exact GGUF filename:
1041
+ ```ruby
1042
+ # Visit the HuggingFace repository to find the exact filename
1043
+ llm = Candle::LLM.from_pretrained("TheBloke/Llama-2-7B-Chat-GGUF",
1044
+ device: device,
1045
+ gguf_file: "llama-2-7b-chat.Q4_K_M.gguf")
1046
+ ```
1047
+
1048
+ ### Failed to download tokenizer
1049
+
1050
+ **Error:**
1051
+ ```
1052
+ Failed to load quantized model: Failed to download tokenizer: request error: HTTP status client error (404 Not Found)
1053
+ ```
1054
+
1055
+ **Cause:** GGUF repositories often don't include separate tokenizer files since they're embedded in the GGUF format.
1056
+
1057
+ **Solution:** The code now includes fallback tokenizer loading. If you still encounter this error, ensure you're using the latest version of red-candle.
1058
+
1059
+ ### Missing metadata in GGUF file
1060
+
1061
+ **Error:**
1062
+ ```
1063
+ Failed to load GGUF model: cannot find gemma3.attention.head_count in metadata (RuntimeError)
1064
+ ```
1065
+ or
1066
+ ```
1067
+ Failed to load GGUF model: cannot find llama.attention.head_count in metadata (RuntimeError)
1068
+ ```
1069
+
1070
+ **Cause:** Some GGUF files may have been created with older conversion tools that don't include all required metadata fields.
1071
+
1072
+ **Solution:**
1073
+ - Try a different GGUF file from the same model
1074
+ - Look for GGUF files from TheBloke or other reputable sources
1075
+ - Check if a newer version of the GGUF file is available
1076
+ - Some Gemma GGUF files may not be compatible with the current loader
1077
+
1078
+ **Known compatibility issues:**
1079
+ - `lmstudio-ai/gemma-2b-it-GGUF` - Missing required metadata fields
1080
+ - Gemma 3 GGUF files may require specific tokenizers that are not publicly available
1081
+ - For best compatibility, use Llama or Mistral GGUF files from TheBloke
1082
+
1083
+ ## Development
1084
+
1085
+ FORK IT!
1086
+
1087
+ ```
1088
+ git clone https://github.com/scientist-labs/red-candle
1089
+ cd red-candle
1090
+ bundle
1091
+ bundle exec rake compile
1092
+ ```
1093
+
1094
+ Pull requests are welcome.
1095
+
1096
+ ## Testing
1097
+
1098
+ Red Candle has comprehensive tests at both the Ruby and Rust levels:
1099
+
1100
+ ### Ruby Tests
1101
+ ```bash
1102
+ # Run all Ruby tests
1103
+ bundle exec rake test
1104
+
1105
+ # Run specific test suites
1106
+ bundle exec rake test:device # Device compatibility tests
1107
+ bundle exec rake test:benchmark # Benchmark tests
1108
+ bundle exec rake test:llm:mistral # Model-specific tests
1109
+ ```
1110
+
1111
+ ### Rust Tests
1112
+ ```bash
1113
+ # Run Rust unit and integration tests
1114
+ cd ext/candle && cargo test
1115
+
1116
+ # Or use the Rake task
1117
+ bundle exec rake rust:test
1118
+ ```
1119
+
1120
+ The Rust tests include:
1121
+ - Unit tests within source files (using `#[cfg(test)]` modules)
1122
+ - Integration tests for external dependencies (candle_core operations)
1123
+ - Tests for structured generation, tokenization, and text generation
1124
+
1125
+ ### Code Coverage
1126
+
1127
+ #### Rust Code Coverage
1128
+ Red Candle uses `cargo-llvm-cov` for Rust code coverage analysis:
1129
+
1130
+ ```bash
1131
+ # Generate HTML coverage report (opens in target/llvm-cov/html/index.html)
1132
+ bundle exec rake rust:coverage:html
1133
+
1134
+ # Show coverage summary in terminal
1135
+ bundle exec rake rust:coverage:summary
1136
+
1137
+ # Generate detailed coverage report
1138
+ bundle exec rake rust:coverage:report
1139
+
1140
+ # Generate LCOV format for CI integration
1141
+ bundle exec rake rust:coverage:lcov
1142
+
1143
+ # Clean coverage data
1144
+ bundle exec rake rust:coverage:clean
1145
+ ```
1146
+
1147
+ **Note**: Overall Rust coverage shows ~17% because most code consists of Ruby FFI bindings that are tested through Ruby tests. The testable Rust components have high coverage:
1148
+ - Constrained generation: 99.59%
1149
+ - Schema processing: 90.99%
1150
+ - Integration tests: 97.12%
1151
+
1152
+ #### Ruby Code Coverage
1153
+ Ruby test coverage is generated automatically when running tests:
1154
+ ```bash
1155
+ bundle exec rake test
1156
+ # Coverage report generated in coverage/index.html
1157
+ ```
1158
+
1159
+ ## Release
1160
+
1161
+ 1. Update version number in `lib/candle/version.rb` and commit.
1162
+ 2. `bundle exec rake build`
1163
+ 3. `git tag VERSION_NUMBER`
1164
+ 4. `git push --follow-tags`
1165
+ 5. `gem push pkg/red-candle-VERSION_NUMBER.gem`
1166
+
1167
+ ## Dependencies
1168
+
1169
+ - [Candle](https://github.com/huggingface/candle)
1170
+ - [Magnus](https://github.com/matsadler/magnus)
1171
+ - [Outlines-core](https://github.com/dottxt-ai/outlines-core)