ruby_llm-tokenizer 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -1
- data/README.md +5 -5
- data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb +6 -4
- data/lib/ruby_llm/tokenizer/registry.rb +1 -1
- data/lib/ruby_llm/tokenizer/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 7ea1085db79537e78ef113cf8bea5f648cf30145c3f716959ffb909a011bba6c
|
|
4
|
+
data.tar.gz: 481525307059a6103dee95ce4bd8d87b49ecd8eb51a776da74597f894444ce5e
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 7be46d79826054f97f494c8d9b3fa8a2e12cd5e1433dac64c86372c67846d84ee8cfbf8d4e1e381120f802670abf1aba38a0977c6e02b361253f6bb5ffa2e233
|
|
7
|
+
data.tar.gz: 760d2875b2572703749db37f018af27f8167b3d507b9ff15899a82538378a4cbec1c12ab5f99d73793dc3c4e86afb9eb4dc9cc0f23c79b038da971a03c22147e
|
data/CHANGELOG.md
CHANGED
|
@@ -1,4 +1,9 @@
|
|
|
1
|
-
## [
|
|
1
|
+
## [0.1.2] - 2026-06-13
|
|
2
|
+
|
|
3
|
+
- Bundled a default SentencePiece model for Gemini so it works out of the box,
|
|
4
|
+
while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
|
|
5
|
+
- Tightened the README wording around SentencePiece and Gemini usage.
|
|
6
|
+
- Updated the gem version to prepare for the next RubyGems release.
|
|
2
7
|
|
|
3
8
|
## [0.1.1] - 2026-06-11
|
|
4
9
|
|
|
@@ -27,4 +32,13 @@
|
|
|
27
32
|
- Hugging Face tokenizers fetched from the Hub are persisted under `cache_dir` for
|
|
28
33
|
later offline reuse.
|
|
29
34
|
|
|
35
|
+
## [Unreleased]
|
|
36
|
+
|
|
37
|
+
## [Unreleased]
|
|
38
|
+
|
|
39
|
+
- Bundled a default SentencePiece model for Gemini so it works out of the box,
|
|
40
|
+
while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
|
|
41
|
+
- Tightened the README wording around SentencePiece and Gemini usage.
|
|
42
|
+
- Updated the gem version to prepare for the next RubyGems release.
|
|
43
|
+
|
|
30
44
|
|
data/README.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
[](https://rubygems.org/gems/ruby_llm-tokenizer)
|
|
5
5
|
|
|
6
6
|
Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
|
|
7
|
-
A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the
|
|
7
|
+
A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the right tokenizer for counting, analyzing, and truncating text locally.
|
|
8
8
|
No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
|
|
9
9
|
|
|
10
10
|
## Installation
|
|
@@ -65,7 +65,7 @@ implementation may still retain the kept portion in memory.
|
|
|
65
65
|
| Family | Backend | Encoding / Repo |
|
|
66
66
|
|-----------------------------------------------------------|-----------------|------------------------------------------|
|
|
67
67
|
| All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
|
|
68
|
-
| `gemini` | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE`
|
|
68
|
+
| `gemini` | `sentencepiece` | bundled `.model`, override with `GEMINI_TOKENIZER_MODEL_FILE` |
|
|
69
69
|
| `llama-3` / `meta-llama` | `hugging_face` | `meta-llama/Meta-Llama-3-8B-Instruct` |
|
|
70
70
|
| `mistral` / `mixtral` | `hugging_face` | `mistralai/Mistral-7B-Instruct-v0.2` |
|
|
71
71
|
| `deepseek` | `hugging_face` | `deepseek-ai/DeepSeek-V2` |
|
|
@@ -75,7 +75,7 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
|
|
|
75
75
|
|
|
76
76
|
OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
|
|
77
77
|
|
|
78
|
-
If a model ships a SentencePiece `.model` file instead of `tokenizer.json`,
|
|
78
|
+
If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, register it with the `sentencepiece` backend:
|
|
79
79
|
|
|
80
80
|
```ruby
|
|
81
81
|
RubyLLM::Tokenizer.register(
|
|
@@ -85,7 +85,7 @@ RubyLLM::Tokenizer.register(
|
|
|
85
85
|
)
|
|
86
86
|
```
|
|
87
87
|
|
|
88
|
-
This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem.
|
|
88
|
+
This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. Add `sentencepiece` to your bundle and install the native SentencePiece library on your system.
|
|
89
89
|
|
|
90
90
|
Common install commands from the upstream project:
|
|
91
91
|
|
|
@@ -103,7 +103,7 @@ If you install the gem directly on Apple Silicon, upstream also notes that you m
|
|
|
103
103
|
gem install sentencepiece -- --with-opt-dir=/opt/homebrew
|
|
104
104
|
```
|
|
105
105
|
|
|
106
|
-
Gemini
|
|
106
|
+
Gemini uses the bundled `lib/ruby_llm/tokenizer/data/gemini_tokenizer.model` by default; set `GEMINI_TOKENIZER_MODEL_FILE` to override it.
|
|
107
107
|
|
|
108
108
|
## Claude / Anthropic
|
|
109
109
|
|
|
@@ -8,9 +8,9 @@ module RubyLLM
|
|
|
8
8
|
class SentencePiece < Base
|
|
9
9
|
attr_reader :model_file
|
|
10
10
|
|
|
11
|
-
def initialize(model_file: nil, model_file_env: nil)
|
|
11
|
+
def initialize(model_file: nil, model_file_env: nil, default_model_file: nil)
|
|
12
12
|
super()
|
|
13
|
-
@model_file = resolve_model_file(model_file, model_file_env)
|
|
13
|
+
@model_file = resolve_model_file(model_file, model_file_env, default_model_file)
|
|
14
14
|
processor_class = load_sentencepiece_processor_class
|
|
15
15
|
@tokenizer = processor_class.new(model_file: @model_file)
|
|
16
16
|
rescue StandardError => e
|
|
@@ -38,7 +38,7 @@ module RubyLLM
|
|
|
38
38
|
|
|
39
39
|
private
|
|
40
40
|
|
|
41
|
-
def resolve_model_file(model_file, model_file_env)
|
|
41
|
+
def resolve_model_file(model_file, model_file_env, default_model_file)
|
|
42
42
|
return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
|
|
43
43
|
|
|
44
44
|
if model_file_env && !model_file_env.to_s.empty?
|
|
@@ -46,8 +46,10 @@ module RubyLLM
|
|
|
46
46
|
return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
|
|
47
47
|
end
|
|
48
48
|
|
|
49
|
+
return default_model_file.to_s unless default_model_file.nil? || default_model_file.to_s.empty?
|
|
50
|
+
|
|
49
51
|
raise BackendError,
|
|
50
|
-
"SentencePiece backend requires :model_file or :
|
|
52
|
+
"SentencePiece backend requires :model_file, :model_file_env, or :default_model_file with a configured path"
|
|
51
53
|
end
|
|
52
54
|
|
|
53
55
|
def load_sentencepiece_processor_class
|
|
@@ -108,7 +108,7 @@ module RubyLLM
|
|
|
108
108
|
when :tiktoken then Backend::Tiktoken.new(**entry.options)
|
|
109
109
|
when :tiktoken_auto then build_tiktoken_auto(model)
|
|
110
110
|
when :hugging_face then Backend::HuggingFace.new(**entry.options)
|
|
111
|
-
when :sentencepiece then Backend::SentencePiece.new(**entry.options)
|
|
111
|
+
when :sentencepiece then Backend::SentencePiece.new(**entry.options, default_model_file: File.expand_path("data/gemini_tokenizer.model", __dir__))
|
|
112
112
|
when :approximate then Backend::Approximate.new(**entry.options)
|
|
113
113
|
else
|
|
114
114
|
raise BackendError, "Unknown backend: #{entry.backend.inspect}"
|