ruby_llm-tokenizer 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 8eb30ee6604d821956446b091836d48eea001f881dfac2c2e16249d7fe6fef03
4
- data.tar.gz: e699a883413608d735fde1fff538f84c0686adf4f82ebb8658fdef371fd1faeb
3
+ metadata.gz: 7ea1085db79537e78ef113cf8bea5f648cf30145c3f716959ffb909a011bba6c
4
+ data.tar.gz: 481525307059a6103dee95ce4bd8d87b49ecd8eb51a776da74597f894444ce5e
5
5
  SHA512:
6
- metadata.gz: 3cfb259665b740f0fdbe49a40aebd0e61c2aea5ac69efac57c1c60f6ea5e2531e235d58133c3640b117a6a493e3c364b20db2019eb808298e3cd7d797502d264
7
- data.tar.gz: d6a5dd3e6b8f520947fc14a07f048eab8746ea28e4b4dafd86f7ff0496675285042cf86c9e76cb99505874b4d12ae04f1de72b46e4feee33fe5a0468f92195d3
6
+ metadata.gz: 7be46d79826054f97f494c8d9b3fa8a2e12cd5e1433dac64c86372c67846d84ee8cfbf8d4e1e381120f802670abf1aba38a0977c6e02b361253f6bb5ffa2e233
7
+ data.tar.gz: 760d2875b2572703749db37f018af27f8167b3d507b9ff15899a82538378a4cbec1c12ab5f99d73793dc3c4e86afb9eb4dc9cc0f23c79b038da971a03c22147e
data/CHANGELOG.md CHANGED
@@ -1,4 +1,9 @@
1
- ## [Unreleased]
1
+ ## [0.1.2] - 2026-06-13
2
+
3
+ - Bundled a default SentencePiece model for Gemini so it works out of the box,
4
+ while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
5
+ - Tightened the README wording around SentencePiece and Gemini usage.
6
+ - Updated the gem version to prepare for the next RubyGems release.
2
7
 
3
8
  ## [0.1.1] - 2026-06-11
4
9
 
@@ -27,4 +32,13 @@
27
32
  - Hugging Face tokenizers fetched from the Hub are persisted under `cache_dir` for
28
33
  later offline reuse.
29
34
 
35
+ ## [Unreleased]
36
+
37
+ ## [Unreleased]
38
+
39
+ - Bundled a default SentencePiece model for Gemini so it works out of the box,
40
+ while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
41
+ - Tightened the README wording around SentencePiece and Gemini usage.
42
+ - Updated the gem version to prepare for the next RubyGems release.
43
+
30
44
 
data/README.md CHANGED
@@ -4,7 +4,7 @@
4
4
  [![Gem Version](https://badge.fury.io/rb/ruby_llm-tokenizer.svg)](https://rubygems.org/gems/ruby_llm-tokenizer)
5
5
 
6
6
  Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
7
- A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
7
+ A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the right tokenizer for counting, analyzing, and truncating text locally.
8
8
  No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
9
9
 
10
10
  ## Installation
@@ -65,7 +65,7 @@ implementation may still retain the kept portion in memory.
65
65
  | Family | Backend | Encoding / Repo |
66
66
  |-----------------------------------------------------------|-----------------|------------------------------------------|
67
67
  | All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
68
- | `gemini` | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE` |
68
+ | `gemini` | `sentencepiece` | bundled `.model`, override with `GEMINI_TOKENIZER_MODEL_FILE` |
69
69
  | `llama-3` / `meta-llama` | `hugging_face` | `meta-llama/Meta-Llama-3-8B-Instruct` |
70
70
  | `mistral` / `mixtral` | `hugging_face` | `mistralai/Mistral-7B-Instruct-v0.2` |
71
71
  | `deepseek` | `hugging_face` | `deepseek-ai/DeepSeek-V2` |
@@ -75,7 +75,7 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
75
75
 
76
76
  OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
77
77
 
78
- If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, you can register it with the `sentencepiece` backend:
78
+ If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, register it with the `sentencepiece` backend:
79
79
 
80
80
  ```ruby
81
81
  RubyLLM::Tokenizer.register(
@@ -85,7 +85,7 @@ RubyLLM::Tokenizer.register(
85
85
  )
86
86
  ```
87
87
 
88
- This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. If you want to use it in your app, add `sentencepiece` to your bundle and make sure the SentencePiece native library is installed on your system.
88
+ This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. Add `sentencepiece` to your bundle and install the native SentencePiece library on your system.
89
89
 
90
90
  Common install commands from the upstream project:
91
91
 
@@ -103,7 +103,7 @@ If you install the gem directly on Apple Silicon, upstream also notes that you m
103
103
  gem install sentencepiece -- --with-opt-dir=/opt/homebrew
104
104
  ```
105
105
 
106
- Gemini models are wired to this backend by default and read the tokenizer path from `GEMINI_TOKENIZER_MODEL_FILE`.
106
+ Gemini uses the bundled `lib/ruby_llm/tokenizer/data/gemini_tokenizer.model` by default; set `GEMINI_TOKENIZER_MODEL_FILE` to override it.
107
107
 
108
108
  ## Claude / Anthropic
109
109
 
@@ -8,9 +8,9 @@ module RubyLLM
8
8
  class SentencePiece < Base
9
9
  attr_reader :model_file
10
10
 
11
- def initialize(model_file: nil, model_file_env: nil)
11
+ def initialize(model_file: nil, model_file_env: nil, default_model_file: nil)
12
12
  super()
13
- @model_file = resolve_model_file(model_file, model_file_env)
13
+ @model_file = resolve_model_file(model_file, model_file_env, default_model_file)
14
14
  processor_class = load_sentencepiece_processor_class
15
15
  @tokenizer = processor_class.new(model_file: @model_file)
16
16
  rescue StandardError => e
@@ -38,7 +38,7 @@ module RubyLLM
38
38
 
39
39
  private
40
40
 
41
- def resolve_model_file(model_file, model_file_env)
41
+ def resolve_model_file(model_file, model_file_env, default_model_file)
42
42
  return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
43
43
 
44
44
  if model_file_env && !model_file_env.to_s.empty?
@@ -46,8 +46,10 @@ module RubyLLM
46
46
  return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
47
47
  end
48
48
 
49
+ return default_model_file.to_s unless default_model_file.nil? || default_model_file.to_s.empty?
50
+
49
51
  raise BackendError,
50
- "SentencePiece backend requires :model_file or :model_file_env with a configured path"
52
+ "SentencePiece backend requires :model_file, :model_file_env, or :default_model_file with a configured path"
51
53
  end
52
54
 
53
55
  def load_sentencepiece_processor_class
@@ -108,7 +108,7 @@ module RubyLLM
108
108
  when :tiktoken then Backend::Tiktoken.new(**entry.options)
109
109
  when :tiktoken_auto then build_tiktoken_auto(model)
110
110
  when :hugging_face then Backend::HuggingFace.new(**entry.options)
111
- when :sentencepiece then Backend::SentencePiece.new(**entry.options)
111
+ when :sentencepiece then Backend::SentencePiece.new(**entry.options, default_model_file: File.expand_path("data/gemini_tokenizer.model", __dir__))
112
112
  when :approximate then Backend::Approximate.new(**entry.options)
113
113
  else
114
114
  raise BackendError, "Unknown backend: #{entry.backend.inspect}"
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Tokenizer
5
- VERSION = "0.1.1"
5
+ VERSION = "0.1.2"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-tokenizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sal Scotto