ruby_llm-tokenizer 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9c7e5f2afe7b48ff4ad5185afee11b931f499c52777ee2873e28ea232880aa72
4
- data.tar.gz: 4609b223108101ac0cabd6f3fb5b9d757f95bf2e6501fe00b7d749670aa74987
3
+ metadata.gz: 8eb30ee6604d821956446b091836d48eea001f881dfac2c2e16249d7fe6fef03
4
+ data.tar.gz: e699a883413608d735fde1fff538f84c0686adf4f82ebb8658fdef371fd1faeb
5
5
  SHA512:
6
- metadata.gz: 425438a6b8b8e1c2f53b79ae81bcc5b1a4eed6fcf8d9200ca0be5e17241f8d87d1e15c5cd37b2869c9fee4caa859df59e522b570c0a9f66ff241a93fa20821f2
7
- data.tar.gz: a96c74f1f8619bbd8cbf1a923d19d1ea576c0cf0ec2751befa1982502a38765e1dc805128ad5957a088ebae935d96209bb6af995eee5f890b69fc20c004c6287
6
+ metadata.gz: 3cfb259665b740f0fdbe49a40aebd0e61c2aea5ac69efac57c1c60f6ea5e2531e235d58133c3640b117a6a493e3c364b20db2019eb808298e3cd7d797502d264
7
+ data.tar.gz: d6a5dd3e6b8f520947fc14a07f048eab8746ea28e4b4dafd86f7ff0496675285042cf86c9e76cb99505874b4d12ae04f1de72b46e4feee33fe5a0468f92195d3
data/CHANGELOG.md CHANGED
@@ -1,5 +1,11 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.1.1] - 2026-06-11
4
+
5
+ - Bumped the gem version.
6
+ - Added a post-install notice explaining that SentencePiece-backed models require
7
+ the native SentencePiece library and how to install it on macOS and Debian/Ubuntu.
8
+
3
9
  ## [0.1.0] - 2026-06-05
4
10
 
5
11
  - Initial release.
data/README.md CHANGED
@@ -4,7 +4,7 @@
4
4
  [![Gem Version](https://badge.fury.io/rb/ruby_llm-tokenizer.svg)](https://rubygems.org/gems/ruby_llm-tokenizer)
5
5
 
6
6
  Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
7
- A pure-Ruby facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby) and OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby) that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
7
+ A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
8
8
  No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
9
9
 
10
10
  ## Installation
@@ -65,6 +65,7 @@ implementation may still retain the kept portion in memory.
65
65
  | Family | Backend | Encoding / Repo |
66
66
  |-----------------------------------------------------------|-----------------|------------------------------------------|
67
67
  | All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
68
+ | `gemini` | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE` |
68
69
  | `llama-3` / `meta-llama` | `hugging_face` | `meta-llama/Meta-Llama-3-8B-Instruct` |
69
70
  | `mistral` / `mixtral` | `hugging_face` | `mistralai/Mistral-7B-Instruct-v0.2` |
70
71
  | `deepseek` | `hugging_face` | `deepseek-ai/DeepSeek-V2` |
@@ -74,6 +75,36 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
74
75
 
75
76
  OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
76
77
 
78
+ If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, you can register it with the `sentencepiece` backend:
79
+
80
+ ```ruby
81
+ RubyLLM::Tokenizer.register(
82
+ match: /^gemma-/,
83
+ backend: :sentencepiece,
84
+ model_file: "/path/to/tokenizer.model"
85
+ )
86
+ ```
87
+
88
+ This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. If you want to use it in your app, add `sentencepiece` to your bundle and make sure the SentencePiece native library is installed on your system.
89
+
90
+ Common install commands from the upstream project:
91
+
92
+ ```bash
93
+ # macOS
94
+ brew install sentencepiece
95
+
96
+ # Ubuntu / Debian
97
+ sudo apt-get install sentencepiece libsentencepiece-dev
98
+ ```
99
+
100
+ If you install the gem directly on Apple Silicon, upstream also notes that you may need to point RubyGems at Homebrew's prefix:
101
+
102
+ ```bash
103
+ gem install sentencepiece -- --with-opt-dir=/opt/homebrew
104
+ ```
105
+
106
+ Gemini models are wired to this backend by default and read the tokenizer path from `GEMINI_TOKENIZER_MODEL_FILE`.
107
+
77
108
  ## Claude / Anthropic
78
109
 
79
110
  Anthropic does not publish Claude's tokenizer. By default, `model: "claude-..."` raises `UnknownModelError`.
@@ -0,0 +1,71 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "../backend"
4
+
5
+ module RubyLLM
6
+ module Tokenizer
7
+ module Backend
8
+ class SentencePiece < Base
9
+ attr_reader :model_file
10
+
11
+ def initialize(model_file: nil, model_file_env: nil)
12
+ super()
13
+ @model_file = resolve_model_file(model_file, model_file_env)
14
+ processor_class = load_sentencepiece_processor_class
15
+ @tokenizer = processor_class.new(model_file: @model_file)
16
+ rescue StandardError => e
17
+ raise BackendError, "Failed to load SentencePiece model #{@model_file.inspect}: #{e.message}"
18
+ end
19
+
20
+ def encode(text)
21
+ @tokenizer.public_send(:encode_as_ids, text.to_s)
22
+ end
23
+
24
+ def decode(ids)
25
+ @tokenizer.public_send(:decode, Array(ids))
26
+ end
27
+
28
+ def analyze(text)
29
+ text = text.to_s
30
+ ids = @tokenizer.public_send(:encode_as_ids, text)
31
+ tokens = @tokenizer.public_send(:encode, text, out_type: "str")
32
+ Analysis.new(tokens: tokens, ids: ids, model: identifier)
33
+ end
34
+
35
+ def identifier
36
+ "sentencepiece:#{model_file}"
37
+ end
38
+
39
+ private
40
+
41
+ def resolve_model_file(model_file, model_file_env)
42
+ return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
43
+
44
+ if model_file_env && !model_file_env.to_s.empty?
45
+ env_value = ENV.fetch(model_file_env.to_s, nil)
46
+ return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
47
+ end
48
+
49
+ raise BackendError,
50
+ "SentencePiece backend requires :model_file or :model_file_env with a configured path"
51
+ end
52
+
53
+ def load_sentencepiece_processor_class
54
+ Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
55
+ rescue NameError
56
+ begin
57
+ require "sentencepiece"
58
+ Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
59
+ rescue LoadError => e
60
+ raise BackendError,
61
+ "SentencePiece backend requires the sentencepiece gem and a compiled SentencePiece library: #{e.message}"
62
+ rescue NameError => e
63
+ raise BackendError,
64
+ "SentencePiece backend requires SentencePieceProcessor to be available: #{e.message}"
65
+ end
66
+ end
67
+ end
68
+ end
69
+ end
70
+ end
71
+
@@ -22,6 +22,15 @@
22
22
  - match: "/^(gpt-|gpt[0-9]|chatgpt-|o[1-9]|text-|code-|davinci\\b|curie\\b|babbage\\b|ada\\b|ft:|codex-)/"
23
23
  backend: tiktoken_auto
24
24
 
25
+ # --- Google Gemini: SentencePiece backend ------------------------------------
26
+ # Gemini models use SentencePiece tokenization. Set GEMINI_TOKENIZER_MODEL_FILE
27
+ # to point at the local tokenizer.model you want to use, or override the match
28
+ # at runtime with RubyLLM::Tokenizer.register(...).
29
+
30
+ - match: "/^gemini/i"
31
+ backend: sentencepiece
32
+ model_file_env: GEMINI_TOKENIZER_MODEL_FILE
33
+
25
34
  # --- Open weights: Hugging Face backend (tokenizer.json fetched lazily) -------
26
35
  # Some repos below are gated and require HF_TOKEN. Override with
27
36
  # RubyLLM::Tokenizer.register(...) if you want a different mirror.
@@ -4,6 +4,7 @@ require "yaml"
4
4
  require_relative "errors"
5
5
  require_relative "backend/tiktoken"
6
6
  require_relative "backend/hugging_face"
7
+ require_relative "backend/sentencepiece"
7
8
  require_relative "backend/approximate"
8
9
 
9
10
  module RubyLLM
@@ -107,6 +108,7 @@ module RubyLLM
107
108
  when :tiktoken then Backend::Tiktoken.new(**entry.options)
108
109
  when :tiktoken_auto then build_tiktoken_auto(model)
109
110
  when :hugging_face then Backend::HuggingFace.new(**entry.options)
111
+ when :sentencepiece then Backend::SentencePiece.new(**entry.options)
110
112
  when :approximate then Backend::Approximate.new(**entry.options)
111
113
  else
112
114
  raise BackendError, "Unknown backend: #{entry.backend.inspect}"
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Tokenizer
5
- VERSION = "0.1.0"
5
+ VERSION = "0.1.1"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-tokenizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Sal Scotto
@@ -37,12 +37,27 @@ dependencies:
37
37
  - - "~>"
38
38
  - !ruby/object:Gem::Version
39
39
  version: '0.5'
40
+ - !ruby/object:Gem::Dependency
41
+ name: sentencepiece
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - "~>"
45
+ - !ruby/object:Gem::Version
46
+ version: '0.2'
47
+ type: :runtime
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - "~>"
52
+ - !ruby/object:Gem::Version
53
+ version: '0.2'
40
54
  description: |
41
- Pure-Ruby facade over Hugging Face `tokenizers` and OpenAI `tiktoken_ruby`
42
- that maps ruby_llm model identifiers (gpt-4o, llama-3, mistral, ...) to the
43
- correct tokenizer and exposes a small API for counting, analyzing, and
44
- truncating text against a model's context window. Includes an opt-in
45
- approximation backend for models with no published tokenizer (Claude).
55
+ Pure-Ruby facade over Hugging Face `tokenizers`, OpenAI `tiktoken_ruby`, and
56
+ SentencePiece bindings that maps ruby_llm model identifiers (gpt-4o,
57
+ llama-3, mistral, ...) to the correct tokenizer and exposes a small API for
58
+ counting, analyzing, and truncating text against a model's context window.
59
+ Includes an opt-in approximation backend for models with no published
60
+ tokenizer (Claude).
46
61
  email:
47
62
  - sal.scotto@gmail.com
48
63
  executables: []
@@ -60,6 +75,7 @@ files:
60
75
  - lib/ruby_llm/tokenizer/backend.rb
61
76
  - lib/ruby_llm/tokenizer/backend/approximate.rb
62
77
  - lib/ruby_llm/tokenizer/backend/hugging_face.rb
78
+ - lib/ruby_llm/tokenizer/backend/sentencepiece.rb
63
79
  - lib/ruby_llm/tokenizer/backend/tiktoken.rb
64
80
  - lib/ruby_llm/tokenizer/configuration.rb
65
81
  - lib/ruby_llm/tokenizer/errors.rb
@@ -75,6 +91,16 @@ metadata:
75
91
  source_code_uri: https://github.com/washu/ruby_llm-tokenizer/tree/main
76
92
  changelog_uri: https://github.com/washu/ruby_llm-tokenizer/blob/main/CHANGELOG.md
77
93
  rubygems_mfa_required: 'true'
94
+ post_install_message: |
95
+ ruby_llm-tokenizer includes a SentencePiece backend. If you use Gemini or any
96
+ other SentencePiece-based model, install the native SentencePiece library too:
97
+
98
+ macOS: brew install sentencepiece
99
+ Ubuntu/Debian: sudo apt-get install sentencepiece libsentencepiece-dev
100
+
101
+ On Apple Silicon, direct gem installs may need:
102
+
103
+ gem install sentencepiece -- --with-opt-dir=/opt/homebrew
78
104
  rdoc_options: []
79
105
  require_paths:
80
106
  - lib