ruby_llm-tokenizer 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +6 -0
- data/README.md +32 -1
- data/lib/ruby_llm/tokenizer/backend/sentencepiece.rb +71 -0
- data/lib/ruby_llm/tokenizer/models.yml +9 -0
- data/lib/ruby_llm/tokenizer/registry.rb +2 -0
- data/lib/ruby_llm/tokenizer/version.rb +1 -1
- metadata +32 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 8eb30ee6604d821956446b091836d48eea001f881dfac2c2e16249d7fe6fef03
|
|
4
|
+
data.tar.gz: e699a883413608d735fde1fff538f84c0686adf4f82ebb8658fdef371fd1faeb
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 3cfb259665b740f0fdbe49a40aebd0e61c2aea5ac69efac57c1c60f6ea5e2531e235d58133c3640b117a6a493e3c364b20db2019eb808298e3cd7d797502d264
|
|
7
|
+
data.tar.gz: d6a5dd3e6b8f520947fc14a07f048eab8746ea28e4b4dafd86f7ff0496675285042cf86c9e76cb99505874b4d12ae04f1de72b46e4feee33fe5a0468f92195d3
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,11 @@
|
|
|
1
1
|
## [Unreleased]
|
|
2
2
|
|
|
3
|
+
## [0.1.1] - 2026-06-11
|
|
4
|
+
|
|
5
|
+
- Bumped the gem version.
|
|
6
|
+
- Added a post-install notice explaining that SentencePiece-backed models require
|
|
7
|
+
the native SentencePiece library and how to install it on macOS and Debian/Ubuntu.
|
|
8
|
+
|
|
3
9
|
## [0.1.0] - 2026-06-05
|
|
4
10
|
|
|
5
11
|
- Initial release.
|
data/README.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
|
4
4
|
[](https://rubygems.org/gems/ruby_llm-tokenizer)
|
|
5
5
|
|
|
6
6
|
Local, model-aware token counting for [ruby_llm](https://github.com/crmne/ruby_llm).
|
|
7
|
-
A
|
|
7
|
+
A facade over Hugging Face [`tokenizers`](https://github.com/ankane/tokenizers-ruby), OpenAI [`tiktoken_ruby`](https://github.com/IAPark/tiktoken_ruby), and SentencePiece bindings that maps model identifiers (`gpt-4o`, `llama-3`, `mistral`, ...) to the correct tokenizer and exposes a small API for counting, analyzing, and truncating text against a model's context window — without making an LLM API call.
|
|
8
8
|
No Rust toolchain required: cross-compiled binaries are inherited from the upstream gems.
|
|
9
9
|
|
|
10
10
|
## Installation
|
|
@@ -65,6 +65,7 @@ implementation may still retain the kept portion in memory.
|
|
|
65
65
|
| Family | Backend | Encoding / Repo |
|
|
66
66
|
|-----------------------------------------------------------|-----------------|------------------------------------------|
|
|
67
67
|
| All OpenAI families (gpt-3.5/4/4o/4.1/4.5/5, o-series, gpt-oss, embeddings, ft:, legacy) | `tiktoken_auto` | resolved via `Tiktoken.encoding_for_model` |
|
|
68
|
+
| `gemini` | `sentencepiece` | `GEMINI_TOKENIZER_MODEL_FILE` |
|
|
68
69
|
| `llama-3` / `meta-llama` | `hugging_face` | `meta-llama/Meta-Llama-3-8B-Instruct` |
|
|
69
70
|
| `mistral` / `mixtral` | `hugging_face` | `mistralai/Mistral-7B-Instruct-v0.2` |
|
|
70
71
|
| `deepseek` | `hugging_face` | `deepseek-ai/DeepSeek-V2` |
|
|
@@ -74,6 +75,36 @@ OpenAI model resolution is delegated to `tiktoken_ruby` — new OpenAI models be
|
|
|
74
75
|
|
|
75
76
|
OpenAI encodings are bundled with `tiktoken_ruby` (no network needed). Hugging Face `tokenizer.json` files are downloaded lazily on first use, then persisted under `cache_dir` for later offline reuse. Some HF repos (Llama 3, recent Mistral) are gated and require an HF token — see [Configuration](#configuration).
|
|
76
77
|
|
|
78
|
+
If a model ships a SentencePiece `.model` file instead of `tokenizer.json`, you can register it with the `sentencepiece` backend:
|
|
79
|
+
|
|
80
|
+
```ruby
|
|
81
|
+
RubyLLM::Tokenizer.register(
|
|
82
|
+
match: /^gemma-/,
|
|
83
|
+
backend: :sentencepiece,
|
|
84
|
+
model_file: "/path/to/tokenizer.model"
|
|
85
|
+
)
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
This backend uses the [`sentencepiece.rb`](https://github.com/yoshoku/sentencepiece.rb) gem. If you want to use it in your app, add `sentencepiece` to your bundle and make sure the SentencePiece native library is installed on your system.
|
|
89
|
+
|
|
90
|
+
Common install commands from the upstream project:
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# macOS
|
|
94
|
+
brew install sentencepiece
|
|
95
|
+
|
|
96
|
+
# Ubuntu / Debian
|
|
97
|
+
sudo apt-get install sentencepiece libsentencepiece-dev
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
If you install the gem directly on Apple Silicon, upstream also notes that you may need to point RubyGems at Homebrew's prefix:
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
gem install sentencepiece -- --with-opt-dir=/opt/homebrew
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
Gemini models are wired to this backend by default and read the tokenizer path from `GEMINI_TOKENIZER_MODEL_FILE`.
|
|
107
|
+
|
|
77
108
|
## Claude / Anthropic
|
|
78
109
|
|
|
79
110
|
Anthropic does not publish Claude's tokenizer. By default, `model: "claude-..."` raises `UnknownModelError`.
|
|
@@ -0,0 +1,71 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative "../backend"
|
|
4
|
+
|
|
5
|
+
module RubyLLM
|
|
6
|
+
module Tokenizer
|
|
7
|
+
module Backend
|
|
8
|
+
class SentencePiece < Base
|
|
9
|
+
attr_reader :model_file
|
|
10
|
+
|
|
11
|
+
def initialize(model_file: nil, model_file_env: nil)
|
|
12
|
+
super()
|
|
13
|
+
@model_file = resolve_model_file(model_file, model_file_env)
|
|
14
|
+
processor_class = load_sentencepiece_processor_class
|
|
15
|
+
@tokenizer = processor_class.new(model_file: @model_file)
|
|
16
|
+
rescue StandardError => e
|
|
17
|
+
raise BackendError, "Failed to load SentencePiece model #{@model_file.inspect}: #{e.message}"
|
|
18
|
+
end
|
|
19
|
+
|
|
20
|
+
def encode(text)
|
|
21
|
+
@tokenizer.public_send(:encode_as_ids, text.to_s)
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
def decode(ids)
|
|
25
|
+
@tokenizer.public_send(:decode, Array(ids))
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
def analyze(text)
|
|
29
|
+
text = text.to_s
|
|
30
|
+
ids = @tokenizer.public_send(:encode_as_ids, text)
|
|
31
|
+
tokens = @tokenizer.public_send(:encode, text, out_type: "str")
|
|
32
|
+
Analysis.new(tokens: tokens, ids: ids, model: identifier)
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
def identifier
|
|
36
|
+
"sentencepiece:#{model_file}"
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def resolve_model_file(model_file, model_file_env)
|
|
42
|
+
return model_file.to_s unless model_file.nil? || model_file.to_s.empty?
|
|
43
|
+
|
|
44
|
+
if model_file_env && !model_file_env.to_s.empty?
|
|
45
|
+
env_value = ENV.fetch(model_file_env.to_s, nil)
|
|
46
|
+
return env_value.to_s unless env_value.nil? || env_value.to_s.empty?
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
raise BackendError,
|
|
50
|
+
"SentencePiece backend requires :model_file or :model_file_env with a configured path"
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
def load_sentencepiece_processor_class
|
|
54
|
+
Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
|
|
55
|
+
rescue NameError
|
|
56
|
+
begin
|
|
57
|
+
require "sentencepiece"
|
|
58
|
+
Object.const_get(:SentencePiece).const_get(:SentencePieceProcessor)
|
|
59
|
+
rescue LoadError => e
|
|
60
|
+
raise BackendError,
|
|
61
|
+
"SentencePiece backend requires the sentencepiece gem and a compiled SentencePiece library: #{e.message}"
|
|
62
|
+
rescue NameError => e
|
|
63
|
+
raise BackendError,
|
|
64
|
+
"SentencePiece backend requires SentencePieceProcessor to be available: #{e.message}"
|
|
65
|
+
end
|
|
66
|
+
end
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
end
|
|
70
|
+
end
|
|
71
|
+
|
|
@@ -22,6 +22,15 @@
|
|
|
22
22
|
- match: "/^(gpt-|gpt[0-9]|chatgpt-|o[1-9]|text-|code-|davinci\\b|curie\\b|babbage\\b|ada\\b|ft:|codex-)/"
|
|
23
23
|
backend: tiktoken_auto
|
|
24
24
|
|
|
25
|
+
# --- Google Gemini: SentencePiece backend ------------------------------------
|
|
26
|
+
# Gemini models use SentencePiece tokenization. Set GEMINI_TOKENIZER_MODEL_FILE
|
|
27
|
+
# to point at the local tokenizer.model you want to use, or override the match
|
|
28
|
+
# at runtime with RubyLLM::Tokenizer.register(...).
|
|
29
|
+
|
|
30
|
+
- match: "/^gemini/i"
|
|
31
|
+
backend: sentencepiece
|
|
32
|
+
model_file_env: GEMINI_TOKENIZER_MODEL_FILE
|
|
33
|
+
|
|
25
34
|
# --- Open weights: Hugging Face backend (tokenizer.json fetched lazily) -------
|
|
26
35
|
# Some repos below are gated and require HF_TOKEN. Override with
|
|
27
36
|
# RubyLLM::Tokenizer.register(...) if you want a different mirror.
|
|
@@ -4,6 +4,7 @@ require "yaml"
|
|
|
4
4
|
require_relative "errors"
|
|
5
5
|
require_relative "backend/tiktoken"
|
|
6
6
|
require_relative "backend/hugging_face"
|
|
7
|
+
require_relative "backend/sentencepiece"
|
|
7
8
|
require_relative "backend/approximate"
|
|
8
9
|
|
|
9
10
|
module RubyLLM
|
|
@@ -107,6 +108,7 @@ module RubyLLM
|
|
|
107
108
|
when :tiktoken then Backend::Tiktoken.new(**entry.options)
|
|
108
109
|
when :tiktoken_auto then build_tiktoken_auto(model)
|
|
109
110
|
when :hugging_face then Backend::HuggingFace.new(**entry.options)
|
|
111
|
+
when :sentencepiece then Backend::SentencePiece.new(**entry.options)
|
|
110
112
|
when :approximate then Backend::Approximate.new(**entry.options)
|
|
111
113
|
else
|
|
112
114
|
raise BackendError, "Unknown backend: #{entry.backend.inspect}"
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-tokenizer
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Sal Scotto
|
|
@@ -37,12 +37,27 @@ dependencies:
|
|
|
37
37
|
- - "~>"
|
|
38
38
|
- !ruby/object:Gem::Version
|
|
39
39
|
version: '0.5'
|
|
40
|
+
- !ruby/object:Gem::Dependency
|
|
41
|
+
name: sentencepiece
|
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
|
43
|
+
requirements:
|
|
44
|
+
- - "~>"
|
|
45
|
+
- !ruby/object:Gem::Version
|
|
46
|
+
version: '0.2'
|
|
47
|
+
type: :runtime
|
|
48
|
+
prerelease: false
|
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
50
|
+
requirements:
|
|
51
|
+
- - "~>"
|
|
52
|
+
- !ruby/object:Gem::Version
|
|
53
|
+
version: '0.2'
|
|
40
54
|
description: |
|
|
41
|
-
Pure-Ruby facade over Hugging Face `tokenizers
|
|
42
|
-
that maps ruby_llm model identifiers (gpt-4o,
|
|
43
|
-
correct tokenizer and exposes a small API for
|
|
44
|
-
truncating text against a model's context window.
|
|
45
|
-
approximation backend for models with no published
|
|
55
|
+
Pure-Ruby facade over Hugging Face `tokenizers`, OpenAI `tiktoken_ruby`, and
|
|
56
|
+
SentencePiece bindings that maps ruby_llm model identifiers (gpt-4o,
|
|
57
|
+
llama-3, mistral, ...) to the correct tokenizer and exposes a small API for
|
|
58
|
+
counting, analyzing, and truncating text against a model's context window.
|
|
59
|
+
Includes an opt-in approximation backend for models with no published
|
|
60
|
+
tokenizer (Claude).
|
|
46
61
|
email:
|
|
47
62
|
- sal.scotto@gmail.com
|
|
48
63
|
executables: []
|
|
@@ -60,6 +75,7 @@ files:
|
|
|
60
75
|
- lib/ruby_llm/tokenizer/backend.rb
|
|
61
76
|
- lib/ruby_llm/tokenizer/backend/approximate.rb
|
|
62
77
|
- lib/ruby_llm/tokenizer/backend/hugging_face.rb
|
|
78
|
+
- lib/ruby_llm/tokenizer/backend/sentencepiece.rb
|
|
63
79
|
- lib/ruby_llm/tokenizer/backend/tiktoken.rb
|
|
64
80
|
- lib/ruby_llm/tokenizer/configuration.rb
|
|
65
81
|
- lib/ruby_llm/tokenizer/errors.rb
|
|
@@ -75,6 +91,16 @@ metadata:
|
|
|
75
91
|
source_code_uri: https://github.com/washu/ruby_llm-tokenizer/tree/main
|
|
76
92
|
changelog_uri: https://github.com/washu/ruby_llm-tokenizer/blob/main/CHANGELOG.md
|
|
77
93
|
rubygems_mfa_required: 'true'
|
|
94
|
+
post_install_message: |
|
|
95
|
+
ruby_llm-tokenizer includes a SentencePiece backend. If you use Gemini or any
|
|
96
|
+
other SentencePiece-based model, install the native SentencePiece library too:
|
|
97
|
+
|
|
98
|
+
macOS: brew install sentencepiece
|
|
99
|
+
Ubuntu/Debian: sudo apt-get install sentencepiece libsentencepiece-dev
|
|
100
|
+
|
|
101
|
+
On Apple Silicon, direct gem installs may need:
|
|
102
|
+
|
|
103
|
+
gem install sentencepiece -- --with-opt-dir=/opt/homebrew
|
|
78
104
|
rdoc_options: []
|
|
79
105
|
require_paths:
|
|
80
106
|
- lib
|