mini_embed 0.1.1 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +9 -5
- data/ext/mini_embed/mini_embed.c +788 -603
- data/lib/mini_embed.rb +14 -0
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: '038f53048205e4db0def9faa8fa718580f9e089a2eb057ca64ee86a9794f8fa6'
|
|
4
|
+
data.tar.gz: d5d37dd58c4bb3671053acb280db02ebb2ef78722d9c115f57f2594ad3a9ab50
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9af0cca4fe5cf57f8ac43f1b410f37faac267090e7cb54aa52aecc990343c283899b81675d62314a0982574756027e1d367b3ab180196ad8a5a68e4cd6d0cc2e
|
|
7
|
+
data.tar.gz: f5bb3db889b9c51348daed59c3fbab9496237c3e9a64cb908ef386a1093e5e678531a5ad10eb051d0614dbe1fb9217d93a32049e6a5b8392b053d2474d6e9606
|
data/README.md
CHANGED
|
@@ -52,15 +52,19 @@ require 'mini_embed'
|
|
|
52
52
|
# Load a GGUF model (F32, F16, Q8_0, Q4_K, etc. are all supported)
|
|
53
53
|
model = MiniEmbed.new(model: '/path/to/gte-small.Q8_0.gguf')
|
|
54
54
|
|
|
55
|
-
# Get
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
# Get an embedding as an array of floats
|
|
59
|
-
embedding = binary.unpack('e*')
|
|
55
|
+
# Get embedding as an array of floats (default)
|
|
56
|
+
embedding = model.embeddings(text: 'hello world')
|
|
60
57
|
puts embedding.size # e.g. 384
|
|
61
58
|
puts embedding[0..4] # e.g. [0.0123, -0.0456, ...]
|
|
59
|
+
|
|
60
|
+
# Or get the raw binary string (little‑endian 32‑bit floats)
|
|
61
|
+
binary = model.embeddings(text: 'hello world', type: :binary)
|
|
62
|
+
embedding_from_binary = binary.unpack('e*')
|
|
62
63
|
```
|
|
63
64
|
|
|
65
|
+
Note: The type parameter is optional – it defaults to :vector which returns a Ruby `Array<Float>`. Use `type: :binary` to get the raw binary string (compatible with the original C extension).
|
|
66
|
+
|
|
67
|
+
|
|
64
68
|
## Simple tokenization note
|
|
65
69
|
MiniEmbed uses a naive space‑based tokenizer. This means it splits input on spaces and looks up each token exactly in the model's vocabulary. For models trained with subword tokenization (like BERT), this will not work for out‑of‑vocabulary words.
|
|
66
70
|
If you need proper subword tokenization, you can:
|