rag_embeddings 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE +21 -0
- data/README.md +119 -0
- data/Rakefile +6 -0
- data/ext/rag_embeddings/embedding.c +117 -0
- data/ext/rag_embeddings/extconf.rb +2 -0
- data/lib/rag_embeddings/database.rb +39 -0
- data/lib/rag_embeddings/engine.rb +13 -0
- data/lib/rag_embeddings/version.rb +3 -0
- data/lib/rag_embeddings.rb +4 -0
- metadata +94 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: cd1dc0ac570ef83c9c79142cb516411454137919c649ab7f446e18748f7f7717
|
4
|
+
data.tar.gz: d21f7e9b2eee1324b4a0f7a5d892d3f55f1767bb42a965e0828225ba823461b8
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 3fd776d3ff4b3082eac778534f2fc56aa541ea048d70d69ddbfb2738da9fb252012d6d7ffa4af0f2a1fbed229ac86960ada844cd062a4c8294ecd4999b4cd0ed
|
7
|
+
data.tar.gz: b2682f1d217d9689a73e78fdbcfd74973de598f323dc12a35868530cb95e0833a584d29468d91c2e6e01afcf66585fe18f8388e54cee04b71b8e7dc955a10181
|
data/LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2025 Marco Mastrodonato
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,119 @@
|
|
1
|
+
# ๐ Rag Embeddings
|
2
|
+
|
3
|
+
[](https://badge.fury.io/rb/rag_embeddings)
|
4
|
+
|
5
|
+
**rag_embeddings** is a native Ruby library for efficient storage and comparison of AI-generated embedding vectors (float arrays) using high-performance C extensions. It is designed for seamless integration with external LLMs (Ollama, OpenAI, Mistral, etc) and works perfectly for RAG (Retrieval-Augmented Generation) applications.
|
6
|
+
|
7
|
+
- **C extension for maximum speed** in cosine similarity and vector allocation
|
8
|
+
- **Compatible with langchainrb** for embedding generation (Ollama, OpenAI, etc)
|
9
|
+
- **SQLite-based storage** with vector search capabilities
|
10
|
+
- **RSpec tested**
|
11
|
+
|
12
|
+
---
|
13
|
+
|
14
|
+
## ๐ฆ Features
|
15
|
+
|
16
|
+
- Creation of embedding objects from LLM-generated float arrays
|
17
|
+
- Cosine similarity calculation in C for speed and safety
|
18
|
+
- Embedding + text storage in SQLite (BLOB)
|
19
|
+
- Retrieve top-K most similar texts to a query using cosine similarity
|
20
|
+
- Memory-safe and 100% Ruby compatible
|
21
|
+
- Plug-and-play for RAG, semantic search, and retrieval AI
|
22
|
+
|
23
|
+
---
|
24
|
+
|
25
|
+
## ๐ง Installation
|
26
|
+
|
27
|
+
Add to your Gemfile:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
gem "rag_embeddings"
|
31
|
+
gem "langchainrb"
|
32
|
+
gem "faraday"
|
33
|
+
gem "sqlite3"
|
34
|
+
```
|
35
|
+
|
36
|
+
bundle install
|
37
|
+
rake compile
|
38
|
+
|
39
|
+
(Requires a working C compiler!)
|
40
|
+
|
41
|
+
## ๐ Running the test suite
|
42
|
+
|
43
|
+
To run all specs (RSpec required):
|
44
|
+
|
45
|
+
`bundle exec rspec`
|
46
|
+
|
47
|
+
## ๐งช Practical examples
|
48
|
+
|
49
|
+
### 1. Generate an embedding from text
|
50
|
+
|
51
|
+
```ruby
|
52
|
+
require "rag_embeddings"
|
53
|
+
|
54
|
+
text = "Hello world, this is RAG!"
|
55
|
+
embedding = RagEmbeddings.embed(text)
|
56
|
+
# embedding is a float array
|
57
|
+
```
|
58
|
+
|
59
|
+
### 2. Create a C embedding object
|
60
|
+
|
61
|
+
```ruby
|
62
|
+
c_embedding = RagEmbeddings::Embedding.from_array(embedding)
|
63
|
+
puts "Dimension: #{c_embedding.dim}"
|
64
|
+
puts "Ruby array: #{c_embedding.to_a.inspect}"
|
65
|
+
```
|
66
|
+
|
67
|
+
### 3. Compute similarity between two texts
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
emb1 = RagEmbeddings.embed("Hello world!")
|
71
|
+
emb2 = RagEmbeddings.embed("Hi universe!")
|
72
|
+
obj1 = RagEmbeddings::Embedding.from_array(emb1)
|
73
|
+
obj2 = RagEmbeddings::Embedding.from_array(emb2)
|
74
|
+
sim = obj1.cosine_similarity(obj2)
|
75
|
+
puts "Cosine similarity: #{sim}" # Value between -1 and 1
|
76
|
+
```
|
77
|
+
|
78
|
+
### 4. Store and search embeddings in a database
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
db = RagEmbeddings::Database.new("embeddings.db")
|
82
|
+
db.insert("Hello world!", RagEmbeddings.embed("Hello world!"))
|
83
|
+
db.insert("Completely different sentence", RagEmbeddings.embed("Completely different sentence"))
|
84
|
+
|
85
|
+
# Find the most similar text to a query
|
86
|
+
result = db.top_k_similar("Hello!", k: 1)
|
87
|
+
puts "Most similar text: #{result.first[1]}, score: #{result.first[2]}"
|
88
|
+
```
|
89
|
+
|
90
|
+
## ๐๏ธ How it works
|
91
|
+
|
92
|
+
- Embeddings are managed as dynamic C objects for efficiency (variable dimension).
|
93
|
+
- The only correct way to construct an embedding object is using .from_array.
|
94
|
+
- Langchainrb integration lets you easily change the embedding provider (Ollama, OpenAI, etc).
|
95
|
+
- Storage uses local SQLite with embeddings as BLOB, for maximum portability and simplicity.
|
96
|
+
|
97
|
+
## ๐๏ธ Customization
|
98
|
+
|
99
|
+
- Embedding provider: switch model/provider in engine.rb (Ollama, OpenAI, etc)
|
100
|
+
- Database: set the SQLite file path as desired
|
101
|
+
|
102
|
+
## ๐ท Requirements
|
103
|
+
|
104
|
+
- Ruby >= 3.3
|
105
|
+
- langchainrb (for embedding)
|
106
|
+
- sqlite3 (for storage)
|
107
|
+
- A working C compiler
|
108
|
+
|
109
|
+
## ๐ Notes
|
110
|
+
|
111
|
+
- Always create embeddings with .from_array
|
112
|
+
- All memory management is idiomatic and safe
|
113
|
+
- For millions of vectors, consider vector DBs (Faiss, sqlite-vss, etc.)
|
114
|
+
|
115
|
+
## ๐ฌ Contact & Issues
|
116
|
+
Open an issue or contact the maintainer for questions, suggestions, or bugs.
|
117
|
+
|
118
|
+
|
119
|
+
Happy RAG! ๐
|
data/Rakefile
ADDED
@@ -0,0 +1,117 @@
|
|
1
|
+
#include <ruby.h> // Ruby API
|
2
|
+
#include <stdint.h> // For integer types like uint16_t
|
3
|
+
#include <stdlib.h> // For memory allocation functions
|
4
|
+
#include <math.h> // For math functions like sqrt
|
5
|
+
|
6
|
+
// Main data structure for storing embeddings
|
7
|
+
// Flexible array member (values[]) allows variable length arrays
|
8
|
+
typedef struct {
|
9
|
+
uint16_t dim; // Dimension of the embedding vector
|
10
|
+
float values[]; // Flexible array member to store the actual values
|
11
|
+
} embedding_t;
|
12
|
+
|
13
|
+
// Callback for freeing memory when Ruby's GC collects our object
|
14
|
+
static void embedding_free(void *ptr) {
|
15
|
+
xfree(ptr); // Ruby's memory free function
|
16
|
+
}
|
17
|
+
|
18
|
+
// Callback to report memory usage to Ruby's GC
|
19
|
+
static size_t embedding_memsize(const void *ptr) {
|
20
|
+
const embedding_t *emb = (const embedding_t *)ptr;
|
21
|
+
return emb ? sizeof(embedding_t) + emb->dim * sizeof(float) : 0;
|
22
|
+
}
|
23
|
+
|
24
|
+
// Type information for Ruby's GC:
|
25
|
+
// Tells Ruby how to manage our C data structure
|
26
|
+
static const rb_data_type_t embedding_type = {
|
27
|
+
"RagEmbeddings/Embedding", // Type name
|
28
|
+
{0, embedding_free, embedding_memsize,}, // Functions: mark, free, size
|
29
|
+
0, 0, // Parent type, data
|
30
|
+
RUBY_TYPED_FREE_IMMEDIATELY // Flags
|
31
|
+
};
|
32
|
+
|
33
|
+
// Class method: RagEmbeddings::Embedding.from_array([1.0, 2.0, ...])
|
34
|
+
// Creates a new embedding from a Ruby array
|
35
|
+
static VALUE embedding_from_array(VALUE klass, VALUE rb_array) {
|
36
|
+
Check_Type(rb_array, T_ARRAY); // Ensure argument is a Ruby array
|
37
|
+
uint16_t dim = (uint16_t)RARRAY_LEN(rb_array);
|
38
|
+
|
39
|
+
// Allocate memory for struct + array of floats
|
40
|
+
embedding_t *ptr = xmalloc(sizeof(embedding_t) + dim * sizeof(float));
|
41
|
+
ptr->dim = dim;
|
42
|
+
|
43
|
+
// Copy values from Ruby array to our C array
|
44
|
+
for (int i = 0; i < dim; ++i)
|
45
|
+
ptr->values[i] = (float)NUM2DBL(rb_ary_entry(rb_array, i));
|
46
|
+
|
47
|
+
// Wrap our C struct in a Ruby object
|
48
|
+
VALUE obj = TypedData_Wrap_Struct(klass, &embedding_type, ptr);
|
49
|
+
return obj;
|
50
|
+
}
|
51
|
+
|
52
|
+
// Instance method: embedding.dim
|
53
|
+
// Returns the dimension of the embedding
|
54
|
+
static VALUE embedding_dim(VALUE self) {
|
55
|
+
embedding_t *ptr;
|
56
|
+
// Get the C struct from the Ruby object
|
57
|
+
TypedData_Get_Struct(self, embedding_t, &embedding_type, ptr);
|
58
|
+
return INT2NUM(ptr->dim);
|
59
|
+
}
|
60
|
+
|
61
|
+
// Instance method: embedding.to_a
|
62
|
+
// Converts the embedding back to a Ruby array
|
63
|
+
static VALUE embedding_to_a(VALUE self) {
|
64
|
+
embedding_t *ptr;
|
65
|
+
TypedData_Get_Struct(self, embedding_t, &embedding_type, ptr);
|
66
|
+
|
67
|
+
// Create a new Ruby array with pre-allocated capacity
|
68
|
+
VALUE arr = rb_ary_new2(ptr->dim);
|
69
|
+
|
70
|
+
// Copy each float value to the Ruby array
|
71
|
+
for (int i = 0; i < ptr->dim; ++i)
|
72
|
+
rb_ary_push(arr, DBL2NUM(ptr->values[i]));
|
73
|
+
|
74
|
+
return arr;
|
75
|
+
}
|
76
|
+
|
77
|
+
// Instance method: embedding.cosine_similarity(other_embedding)
|
78
|
+
// Calculate cosine similarity between two embeddings
|
79
|
+
static VALUE embedding_cosine_similarity(VALUE self, VALUE other) {
|
80
|
+
embedding_t *a, *b;
|
81
|
+
// Get C structs for both embeddings
|
82
|
+
TypedData_Get_Struct(self, embedding_t, &embedding_type, a);
|
83
|
+
TypedData_Get_Struct(other, embedding_t, &embedding_type, b);
|
84
|
+
|
85
|
+
// Ensure dimensions match
|
86
|
+
if (a->dim != b->dim)
|
87
|
+
rb_raise(rb_eArgError, "Dimension mismatch");
|
88
|
+
|
89
|
+
float dot = 0.0f, norm_a = 0.0f, norm_b = 0.0f;
|
90
|
+
|
91
|
+
// Calculate dot product and vector magnitudes
|
92
|
+
for (int i = 0; i < a->dim; ++i) {
|
93
|
+
dot += a->values[i] * b->values[i]; // Dot product
|
94
|
+
norm_a += a->values[i] * a->values[i]; // Square of magnitude for vector a
|
95
|
+
norm_b += b->values[i] * b->values[i]; // Square of magnitude for vector b
|
96
|
+
}
|
97
|
+
|
98
|
+
// Apply cosine similarity formula: dot(a,b)/(|a|*|b|)
|
99
|
+
// Small epsilon (1e-8) added to prevent division by zero
|
100
|
+
return DBL2NUM(dot / (sqrt(norm_a) * sqrt(norm_b) + 1e-8));
|
101
|
+
}
|
102
|
+
|
103
|
+
// Ruby extension initialization function
|
104
|
+
// This function is called when the extension is loaded
|
105
|
+
void Init_embedding(void) {
|
106
|
+
// Define module and class
|
107
|
+
VALUE mRag = rb_define_module("RagEmbeddings");
|
108
|
+
VALUE cEmbedding = rb_define_class_under(mRag, "Embedding", rb_cObject);
|
109
|
+
|
110
|
+
// Register class methods
|
111
|
+
rb_define_singleton_method(cEmbedding, "from_array", embedding_from_array, 1);
|
112
|
+
|
113
|
+
// Register instance methods
|
114
|
+
rb_define_method(cEmbedding, "dim", embedding_dim, 0);
|
115
|
+
rb_define_method(cEmbedding, "to_a", embedding_to_a, 0);
|
116
|
+
rb_define_method(cEmbedding, "cosine_similarity", embedding_cosine_similarity, 1);
|
117
|
+
}
|
@@ -0,0 +1,39 @@
|
|
1
|
+
require "sqlite3"
|
2
|
+
|
3
|
+
module RagEmbeddings
|
4
|
+
class Database
|
5
|
+
def initialize(path = "embeddings.db")
|
6
|
+
@db = SQLite3::Database.new(path)
|
7
|
+
@db.execute <<~SQL
|
8
|
+
CREATE TABLE IF NOT EXISTS embeddings (
|
9
|
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
10
|
+
content TEXT NOT NULL,
|
11
|
+
embedding BLOB NOT NULL
|
12
|
+
);
|
13
|
+
SQL
|
14
|
+
end
|
15
|
+
|
16
|
+
def insert(text, embedding)
|
17
|
+
blob = embedding.pack("f*")
|
18
|
+
@db.execute("INSERT INTO embeddings (content, embedding) VALUES (?, ?)", [text, blob])
|
19
|
+
end
|
20
|
+
|
21
|
+
def all
|
22
|
+
@db.execute("SELECT id, content, embedding FROM embeddings").map do |id, content, blob|
|
23
|
+
[id, content, blob.unpack("f*")]
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
# "Raw" search: returns the N texts most similar to the query
|
28
|
+
def top_k_similar(query_text, k: 5)
|
29
|
+
query_embedding = RagEmbeddings.embed(query_text)
|
30
|
+
query_obj = RagEmbeddings::Embedding.from_array(query_embedding)
|
31
|
+
|
32
|
+
all.map do |id, content, emb|
|
33
|
+
emb_obj = RagEmbeddings::Embedding.from_array(emb)
|
34
|
+
similarity = emb_obj.cosine_similarity(query_obj)
|
35
|
+
[id, content, similarity]
|
36
|
+
end.sort_by { |_,_,sim| -sim }.first(k)
|
37
|
+
end
|
38
|
+
end
|
39
|
+
end
|
@@ -0,0 +1,13 @@
|
|
1
|
+
require "langchainrb"
|
2
|
+
|
3
|
+
module RagEmbeddings
|
4
|
+
MODEL = "gemma3".freeze
|
5
|
+
|
6
|
+
def self.llm
|
7
|
+
@llm ||= Langchain::LLM::Ollama.new(url: "http://localhost:11434", default_options: { temperature: 0.1, model: MODEL })
|
8
|
+
end
|
9
|
+
|
10
|
+
def self.embed(text)
|
11
|
+
llm.embed(text: text).embedding
|
12
|
+
end
|
13
|
+
end
|
metadata
ADDED
@@ -0,0 +1,94 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: rag_embeddings
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Marco Mastrodonato
|
8
|
+
bindir: bin
|
9
|
+
cert_chain: []
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
11
|
+
dependencies:
|
12
|
+
- !ruby/object:Gem::Dependency
|
13
|
+
name: sqlite3
|
14
|
+
requirement: !ruby/object:Gem::Requirement
|
15
|
+
requirements:
|
16
|
+
- - ">="
|
17
|
+
- !ruby/object:Gem::Version
|
18
|
+
version: '0'
|
19
|
+
type: :runtime
|
20
|
+
prerelease: false
|
21
|
+
version_requirements: !ruby/object:Gem::Requirement
|
22
|
+
requirements:
|
23
|
+
- - ">="
|
24
|
+
- !ruby/object:Gem::Version
|
25
|
+
version: '0'
|
26
|
+
- !ruby/object:Gem::Dependency
|
27
|
+
name: langchainrb
|
28
|
+
requirement: !ruby/object:Gem::Requirement
|
29
|
+
requirements:
|
30
|
+
- - ">="
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '0'
|
33
|
+
type: :runtime
|
34
|
+
prerelease: false
|
35
|
+
version_requirements: !ruby/object:Gem::Requirement
|
36
|
+
requirements:
|
37
|
+
- - ">="
|
38
|
+
- !ruby/object:Gem::Version
|
39
|
+
version: '0'
|
40
|
+
- !ruby/object:Gem::Dependency
|
41
|
+
name: faraday
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
43
|
+
requirements:
|
44
|
+
- - ">="
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
version: '0'
|
47
|
+
type: :runtime
|
48
|
+
prerelease: false
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
50
|
+
requirements:
|
51
|
+
- - ">="
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
description: Manage AI vector embeddings in C with Ruby integration
|
55
|
+
email:
|
56
|
+
- m.mastrodonato@gmail.com
|
57
|
+
executables: []
|
58
|
+
extensions:
|
59
|
+
- ext/rag_embeddings/extconf.rb
|
60
|
+
extra_rdoc_files: []
|
61
|
+
files:
|
62
|
+
- LICENSE
|
63
|
+
- README.md
|
64
|
+
- Rakefile
|
65
|
+
- ext/rag_embeddings/embedding.c
|
66
|
+
- ext/rag_embeddings/extconf.rb
|
67
|
+
- lib/rag_embeddings.rb
|
68
|
+
- lib/rag_embeddings/database.rb
|
69
|
+
- lib/rag_embeddings/engine.rb
|
70
|
+
- lib/rag_embeddings/version.rb
|
71
|
+
homepage: https://rubygems.org/gems/rag_embeddings
|
72
|
+
licenses:
|
73
|
+
- MIT
|
74
|
+
metadata:
|
75
|
+
homepage_uri: https://rubygems.org/gems/rag_embeddings
|
76
|
+
source_code_uri: https://github.com/marcomd/rag_embeddings
|
77
|
+
rdoc_options: []
|
78
|
+
require_paths:
|
79
|
+
- lib
|
80
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
81
|
+
requirements:
|
82
|
+
- - ">="
|
83
|
+
- !ruby/object:Gem::Version
|
84
|
+
version: '0'
|
85
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - ">="
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
requirements: []
|
91
|
+
rubygems_version: 3.6.9
|
92
|
+
specification_version: 4
|
93
|
+
summary: Efficient RAG embedding storage and retrieval
|
94
|
+
test_files: []
|