RubyGems - ruby_llm-tokenizer - Versions diffs - 0.1.0 → 0.1.1 - Mend

ruby_llm-tokenizer 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +6 -0
data/README.md +32 -1
data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb +71 -0
data/lib/ruby_llm/tokenizer/models.yml +9 -0
data/lib/ruby_llm/tokenizer/registry.rb +2 -0
data/lib/ruby_llm/tokenizer/version.rb +1 -1
metadata +32 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9c7e5f2afe7b48ff4ad5185afee11b931f499c52777ee2873e28ea232880aa72
-  data.tar.gz: 4609b223108101ac0cabd6f3fb5b9d757f95bf2e6501fe00b7d749670aa74987
+  metadata.gz: 8eb30ee6604d821956446b091836d48eea001f881dfac2c2e16249d7fe6fef03
+  data.tar.gz: e699a883413608d735fde1fff538f84c0686adf4f82ebb8658fdef371fd1faeb
 SHA512:
-  metadata.gz: 425438a6b8b8e1c2f53b79ae81bcc5b1a4eed6fcf8d9200ca0be5e17241f8d87d1e15c5cd37b2869c9fee4caa859df59e522b570c0a9f66ff241a93fa20821f2
-  data.tar.gz: a96c74f1f8619bbd8cbf1a923d19d1ea576c0cf0ec2751befa1982502a38765e1dc805128ad5957a088ebae935d96209bb6af995eee5f890b69fc20c004c6287
+  metadata.gz: 3cfb259665b740f0fdbe49a40aebd0e61c2aea5ac69efac57c1c60f6ea5e2531e235d58133c3640b117a6a493e3c364b20db2019eb808298e3cd7d797502d264
+  data.tar.gz: d6a5dd3e6b8f520947fc14a07f048eab8746ea28e4b4dafd86f7ff0496675285042cf86c9e76cb99505874b4d12ae04f1de72b46e4feee33fe5a0468f92195d3

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,11 @@
 ## [Unreleased]
+## [0.1.1] - 2026-06-11
+- Bumped the gem version.
+- Added a post-install notice explaining that SentencePiece-backed models require
+  the native SentencePiece library and how to install it on macOS and Debian/Ubuntu.
 ## [0.1.0] - 2026-06-05
 - Initial release.

data/README.md CHANGED Viewed

@@ -4,7 +4,7 @@
 [![Gem Version](https://badge.fury.io/rb/ruby_llm-tokenizer.svg)](https://rubygems.org/gems/ruby_llm-tokenizer)
 Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
-A pure-Ruby facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby) and OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby) that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
+A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
 No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
 ## Installation
@@ -65,6 +65,7 @@ implementation may still retain the kept portion in memory.
 | Family                                                    | Backend         | Encoding / Repo                          |
 |-----------------------------------------------------------|-----------------|------------------------------------------|
 | All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
+| `gemini`                                                  | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE`            |
 | `llama-3` / `meta-llama`                                  | `hugging_face`  | `meta-llama/Meta-Llama-3-8B-Instruct`    |
 | `mistral` / `mixtral`                                     | `hugging_face`  | `mistralai/Mistral-7B-Instruct-v0.2`     |
 | `deepseek`                                                | `hugging_face`  | `deepseek-ai/DeepSeek-V2`                |
@@ -74,6 +75,36 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
 OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
+If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, you can register it with the `sentencepiece` backend:
+```ruby
+RubyLLM::Tokenizer.register(
+  match: /^gemma-/,
+  backend: :sentencepiece,
+  model_file: "/path/to/tokenizer.model"
+)
+```
+This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. If you want to use it in your app, add `sentencepiece` to your bundle and make sure the SentencePiece native library is installed on your system.
+Common install commands from the upstream project:
+```bash
+# macOS
+brew install sentencepiece
+# Ubuntu / Debian
+sudo apt-get install sentencepiece libsentencepiece-dev
+```
+If you install the gem directly on Apple Silicon, upstream also notes that you may need to point RubyGems at Homebrew's prefix:
+```bash
+gem install sentencepiece -- --with-opt-dir=/opt/homebrew
+```
+Gemini models are wired to this backend by default and read the tokenizer path from `GEMINI_TOKENIZER_MODEL_FILE`.
 ## Claude / Anthropic
 Anthropic does not publish Claude's tokenizer. By default, `model: "claude-..."` raises `UnknownModelError`.

data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb ADDED Viewed

@@ -0,0 +1,71 @@
+# frozen_string_literal: true
+require_relative "../backend"
+module RubyLLM
+  module Tokenizer
+    module Backend
+      class SentencePiece < Base
+        attr_reader :model_file
+        def initialize(model_file: nil, model_file_env: nil)
+          super()
+          @model_file = resolve_model_file(model_file, model_file_env)
+          processor_class = load_sentencepiece_processor_class
+          @tokenizer = processor_class.new(model_file: @model_file)
+        rescue StandardError => e
+          raise BackendError, "Failed to load SentencePiece model #{@model_file.inspect}: #{e.message}"
+        end
+        def encode(text)
+          @tokenizer.public_send(:encode_as_ids, text.to_s)
+        end
+        def decode(ids)
+          @tokenizer.public_send(:decode, Array(ids))
+        end
+        def analyze(text)
+          text = text.to_s
+          ids = @tokenizer.public_send(:encode_as_ids, text)
+          tokens = @tokenizer.public_send(:encode, text, out_type: "str")
+          Analysis.new(tokens: tokens, ids: ids, model: identifier)
+        end
+        def identifier
+          "sentencepiece:#{model_file}"
+        end
+        private
+        def resolve_model_file(model_file, model_file_env)
+          return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
+          if model_file_env && !model_file_env.to_s.empty?
+            env_value = ENV.fetch(model_file_env.to_s, nil)
+            return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
+          end
+          raise BackendError,
+                "SentencePiece backend requires :model_file or :model_file_env with a configured path"
+        end
+        def load_sentencepiece_processor_class
+          Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
+        rescue NameError
+          begin
+            require "sentencepiece"
+            Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
+          rescue LoadError => e
+            raise BackendError,
+                  "SentencePiece backend requires the sentencepiece gem and a compiled SentencePiece library: #{e.message}"
+          rescue NameError => e
+            raise BackendError,
+                  "SentencePiece backend requires SentencePieceProcessor to be available: #{e.message}"
+          end
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/tokenizer/models.yml CHANGED Viewed

@@ -22,6 +22,15 @@
 - match: "/^(gpt-|gpt[0-9]|chatgpt-|o[1-9]|text-|code-|davinci\\b|curie\\b|babbage\\b|ada\\b|ft:|codex-)/"
   backend: tiktoken_auto
+# --- Google Gemini: SentencePiece backend ------------------------------------
+# Gemini models use SentencePiece tokenization. Set GEMINI_TOKENIZER_MODEL_FILE
+# to point at the local tokenizer.model you want to use, or override the match
+# at runtime with RubyLLM::Tokenizer.register(...).
+- match: "/^gemini/i"
+  backend: sentencepiece
+  model_file_env: GEMINI_TOKENIZER_MODEL_FILE
 # --- Open weights: Hugging Face backend (tokenizer.json fetched lazily) -------
 # Some repos below are gated and require HF_TOKEN. Override with
 # RubyLLM::Tokenizer.register(...) if you want a different mirror.

data/lib/ruby_llm/tokenizer/registry.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require "yaml"
 require_relative "errors"
 require_relative "backend/tiktoken"
 require_relative "backend/hugging_face"
+require_relative "backend/sentencepiece"
 require_relative "backend/approximate"
 module RubyLLM
@@ -107,6 +108,7 @@ module RubyLLM
         when :tiktoken      then Backend::Tiktoken.new(**entry.options)
         when :tiktoken_auto then build_tiktoken_auto(model)
         when :hugging_face  then Backend::HuggingFace.new(**entry.options)
+        when :sentencepiece then Backend::SentencePiece.new(**entry.options)
         when :approximate   then Backend::Approximate.new(**entry.options)
         else
           raise BackendError, "Unknown backend: #{entry.backend.inspect}"

data/lib/ruby_llm/tokenizer/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module RubyLLM
   module Tokenizer
-    VERSION = "0.1.0"
+    VERSION = "0.1.1"
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-tokenizer
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Sal Scotto
@@ -37,12 +37,27 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '0.5'
+- !ruby/object:Gem::Dependency
+  name: sentencepiece
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.2'
 description: |
-  Pure-Ruby facade over Hugging Face `tokenizers` and OpenAI `tiktoken_ruby`
-  that maps ruby_llm model identifiers (gpt-4o, llama-3, mistral, ...) to the
-  correct tokenizer and exposes a small API for counting, analyzing, and
-  truncating text against a model's context window. Includes an opt-in
-  approximation backend for models with no published tokenizer (Claude).
+  Pure-Ruby facade over Hugging Face `tokenizers`, OpenAI `tiktoken_ruby`, and
+  SentencePiece bindings that maps ruby_llm model identifiers (gpt-4o,
+  llama-3, mistral, ...) to the correct tokenizer and exposes a small API for
+  counting, analyzing, and truncating text against a model's context window.
+  Includes an opt-in approximation backend for models with no published
+  tokenizer (Claude).
 email:
 - sal.scotto@gmail.com
 executables: []
@@ -60,6 +75,7 @@ files:
 - lib/ruby_llm/tokenizer/backend.rb
 - lib/ruby_llm/tokenizer/backend/approximate.rb
 - lib/ruby_llm/tokenizer/backend/hugging_face.rb
+- lib/ruby_llm/tokenizer/backend/sentencepiece.rb
 - lib/ruby_llm/tokenizer/backend/tiktoken.rb
 - lib/ruby_llm/tokenizer/configuration.rb
 - lib/ruby_llm/tokenizer/errors.rb
@@ -75,6 +91,16 @@ metadata:
   source_code_uri: https://github.com/washu/ruby_llm-tokenizer/tree/main
   changelog_uri: https://github.com/washu/ruby_llm-tokenizer/blob/main/CHANGELOG.md
   rubygems_mfa_required: 'true'
+post_install_message: |
+  ruby_llm-tokenizer includes a SentencePiece backend. If you use Gemini or any
+  other SentencePiece-based model, install the native SentencePiece library too:
+    macOS:       brew install sentencepiece
+    Ubuntu/Debian: sudo apt-get install sentencepiece libsentencepiece-dev
+  On Apple Silicon, direct gem installs may need:
+    gem install sentencepiece -- --with-opt-dir=/opt/homebrew
 rdoc_options: []
 require_paths:
 - lib