RubyGems - ruby_llm-tokenizer - Versions diffs - 0.1.1 → 0.1.2 - Mend

ruby_llm-tokenizer 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -1
data/README.md +5 -5
data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb +6 -4
data/lib/ruby_llm/tokenizer/registry.rb +1 -1
data/lib/ruby_llm/tokenizer/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 8eb30ee6604d821956446b091836d48eea001f881dfac2c2e16249d7fe6fef03
-  data.tar.gz: e699a883413608d735fde1fff538f84c0686adf4f82ebb8658fdef371fd1faeb
+  metadata.gz: 7ea1085db79537e78ef113cf8bea5f648cf30145c3f716959ffb909a011bba6c
+  data.tar.gz: 481525307059a6103dee95ce4bd8d87b49ecd8eb51a776da74597f894444ce5e
 SHA512:
-  metadata.gz: 3cfb259665b740f0fdbe49a40aebd0e61c2aea5ac69efac57c1c60f6ea5e2531e235d58133c3640b117a6a493e3c364b20db2019eb808298e3cd7d797502d264
-  data.tar.gz: d6a5dd3e6b8f520947fc14a07f048eab8746ea28e4b4dafd86f7ff0496675285042cf86c9e76cb99505874b4d12ae04f1de72b46e4feee33fe5a0468f92195d3
+  metadata.gz: 7be46d79826054f97f494c8d9b3fa8a2e12cd5e1433dac64c86372c67846d84ee8cfbf8d4e1e381120f802670abf1aba38a0977c6e02b361253f6bb5ffa2e233
+  data.tar.gz: 760d2875b2572703749db37f018af27f8167b3d507b9ff15899a82538378a4cbec1c12ab5f99d73793dc3c4e86afb9eb4dc9cc0f23c79b038da971a03c22147e

data/CHANGELOG.md CHANGED Viewed

@@ -1,4 +1,9 @@
-## [Unreleased]
+## [0.1.2] - 2026-06-13
+- Bundled a default SentencePiece model for Gemini so it works out of the box,
+  while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
+- Tightened the README wording around SentencePiece and Gemini usage.
+- Updated the gem version to prepare for the next RubyGems release.
 ## [0.1.1] - 2026-06-11
@@ -27,4 +32,13 @@
 - Hugging Face tokenizers fetched from the Hub are persisted under `cache_dir` for
   later offline reuse.
+## [Unreleased]
+## [Unreleased]
+- Bundled a default SentencePiece model for Gemini so it works out of the box,
+  while still allowing `GEMINI_TOKENIZER_MODEL_FILE` overrides.
+- Tightened the README wording around SentencePiece and Gemini usage.
+- Updated the gem version to prepare for the next RubyGems release.

data/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 [![Gem Version](https://badge.fury.io/rb/ruby_llm-tokenizer.svg)](https://rubygems.org/gems/ruby_llm-tokenizer)
 Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
-A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
+A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the right tokenizer for counting, analyzing, and truncating text locally.
 No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
 ## Installation
@@ -65,7 +65,7 @@ implementation may still retain the kept portion in memory.
 | Family                                                    | Backend         | Encoding / Repo                          |
 |-----------------------------------------------------------|-----------------|------------------------------------------|
 | All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
-| `gemini`                                                  | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE`            |
+| `gemini`                                                  | `sentencepiece` | bundled `.model`, override with `GEMINI_TOKENIZER_MODEL_FILE` |
 | `llama-3` / `meta-llama`                                  | `hugging_face`  | `meta-llama/Meta-Llama-3-8B-Instruct`    |
 | `mistral` / `mixtral`                                     | `hugging_face`  | `mistralai/Mistral-7B-Instruct-v0.2`     |
 | `deepseek`                                                | `hugging_face`  | `deepseek-ai/DeepSeek-V2`                |
@@ -75,7 +75,7 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
 OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
-If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, you can register it with the `sentencepiece` backend:
+If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, register it with the `sentencepiece` backend:
 ```ruby
 RubyLLM::Tokenizer.register(
@@ -85,7 +85,7 @@ RubyLLM::Tokenizer.register(
 )
 ```
-This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. If you want to use it in your app, add `sentencepiece` to your bundle and make sure the SentencePiece native library is installed on your system.
+This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. Add `sentencepiece` to your bundle and install the native SentencePiece library on your system.
 Common install commands from the upstream project:
@@ -103,7 +103,7 @@ If you install the gem directly on Apple Silicon, upstream also notes that you m
 gem install sentencepiece -- --with-opt-dir=/opt/homebrew
 ```
-Gemini models are wired to this backend by default and read the tokenizer path from `GEMINI_TOKENIZER_MODEL_FILE`.
+Gemini uses the bundled `lib/ruby_llm/tokenizer/data/gemini_tokenizer.model` by default; set `GEMINI_TOKENIZER_MODEL_FILE` to override it.
 ## Claude / Anthropic

data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb CHANGED Viewed

@@ -8,9 +8,9 @@ module RubyLLM
       class SentencePiece < Base
         attr_reader :model_file
-        def initialize(model_file: nil, model_file_env: nil)
+        def initialize(model_file: nil, model_file_env: nil, default_model_file: nil)
           super()
-          @model_file = resolve_model_file(model_file, model_file_env)
+          @model_file = resolve_model_file(model_file, model_file_env, default_model_file)
           processor_class = load_sentencepiece_processor_class
           @tokenizer = processor_class.new(model_file: @model_file)
         rescue StandardError => e
@@ -38,7 +38,7 @@ module RubyLLM
         private
-        def resolve_model_file(model_file, model_file_env)
+        def resolve_model_file(model_file, model_file_env, default_model_file)
           return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
           if model_file_env && !model_file_env.to_s.empty?
@@ -46,8 +46,10 @@ module RubyLLM
             return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
           end
+          return default_model_file.to_s unless default_model_file.nil? || default_model_file.to_s.empty?
           raise BackendError,
-                "SentencePiece backend requires :model_file or :model_file_env with a configured path"
+                "SentencePiece backend requires :model_file, :model_file_env, or :default_model_file with a configured path"
         end
         def load_sentencepiece_processor_class

data/lib/ruby_llm/tokenizer/registry.rb CHANGED Viewed

@@ -108,7 +108,7 @@ module RubyLLM
         when :tiktoken      then Backend::Tiktoken.new(**entry.options)
         when :tiktoken_auto then build_tiktoken_auto(model)
         when :hugging_face  then Backend::HuggingFace.new(**entry.options)
-        when :sentencepiece then Backend::SentencePiece.new(**entry.options)
+        when :sentencepiece then Backend::SentencePiece.new(**entry.options, default_model_file: File.expand_path("data/gemini_tokenizer.model", __dir__))
         when :approximate   then Backend::Approximate.new(**entry.options)
         else
           raise BackendError, "Unknown backend: #{entry.backend.inspect}"

data/lib/ruby_llm/tokenizer/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module RubyLLM
   module Tokenizer
-    VERSION = "0.1.1"
+    VERSION = "0.1.2"
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-tokenizer
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - Sal Scotto