riptoken 0.2.1__tar.gz → 0.2.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {riptoken-0.2.1 → riptoken-0.2.3}/.github/workflows/ci.yml +1 -1
- {riptoken-0.2.1 → riptoken-0.2.3}/.github/workflows/release.yml +2 -2
- {riptoken-0.2.1 → riptoken-0.2.3}/CHANGELOG.md +59 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/Cargo.lock +1 -1
- {riptoken-0.2.1 → riptoken-0.2.3}/Cargo.toml +1 -1
- {riptoken-0.2.1 → riptoken-0.2.3}/PKG-INFO +2 -1
- {riptoken-0.2.1 → riptoken-0.2.3}/pyproject.toml +2 -1
- riptoken-0.2.3/python/riptoken/__init__.py +316 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/src/lib.rs +27 -6
- riptoken-0.2.1/python/riptoken/__init__.py +0 -183
- {riptoken-0.2.1 → riptoken-0.2.3}/.cargo/config.toml +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/.gitignore +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/LICENSE +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/README.md +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/benches/bpe.rs +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/examples/parallel_bench.rs +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/python/riptoken/_riptoken.pyi +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/python/riptoken/py.typed +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/rust-toolchain.toml +0 -0
- {riptoken-0.2.1 → riptoken-0.2.3}/tests/integration.rs +0 -0
|
@@ -28,7 +28,7 @@ jobs:
|
|
|
28
28
|
uses: PyO3/maturin-action@v1
|
|
29
29
|
with:
|
|
30
30
|
target: ${{ matrix.target }}
|
|
31
|
-
args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13
|
|
31
|
+
args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13 python3.14
|
|
32
32
|
manylinux: auto
|
|
33
33
|
- name: Upload wheels
|
|
34
34
|
uses: actions/upload-artifact@v4
|
|
@@ -51,7 +51,7 @@ jobs:
|
|
|
51
51
|
uses: PyO3/maturin-action@v1
|
|
52
52
|
with:
|
|
53
53
|
target: ${{ matrix.target }}
|
|
54
|
-
args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13
|
|
54
|
+
args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13 python3.14
|
|
55
55
|
- name: Upload wheels
|
|
56
56
|
uses: actions/upload-artifact@v4
|
|
57
57
|
with:
|
|
@@ -7,6 +7,65 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.2.3] — 2026-04-11
|
|
11
|
+
|
|
12
|
+
Minor release. Closes the remaining tiktoken API surface gap by wrapping
|
|
13
|
+
the Rust `CoreBPE` in a new Python `Encoding` class that forwards every
|
|
14
|
+
non-hot-path attribute and method to the underlying `tiktoken.Encoding`.
|
|
15
|
+
Code written against `tiktoken.Encoding` — `n_vocab`, `eot_token`,
|
|
16
|
+
`special_tokens_set`, `encode_with_unstable`, `decode_single_token_bytes`,
|
|
17
|
+
etc. — now works unchanged against a `riptoken.get_encoding` result.
|
|
18
|
+
Also adds Python 3.14 to CI and the release wheels matrix.
|
|
19
|
+
|
|
20
|
+
### Added
|
|
21
|
+
|
|
22
|
+
- `riptoken.Encoding` — a drop-in replacement for `tiktoken.Encoding`.
|
|
23
|
+
Hot-path methods (`encode`, `encode_ordinary`, `decode`, `decode_bytes`,
|
|
24
|
+
`encode_batch`, `encode_ordinary_batch`) execute in riptoken's Rust
|
|
25
|
+
core and release the GIL; every other attribute falls through to the
|
|
26
|
+
wrapped `tiktoken.Encoding` via `__getattr__`. `get_encoding` and
|
|
27
|
+
`encoding_for_model` now return `Encoding` instances instead of bare
|
|
28
|
+
`CoreBPE` instances.
|
|
29
|
+
- `encode` and `encode_batch` accept `allowed_special="all"` as a
|
|
30
|
+
sentinel meaning "every special token in the vocabulary", matching
|
|
31
|
+
tiktoken.
|
|
32
|
+
- Python 3.14 support — added to the CI matrix and to the release
|
|
33
|
+
workflow's interpreter list for Linux and macOS wheel builds.
|
|
34
|
+
|
|
35
|
+
### Changed
|
|
36
|
+
|
|
37
|
+
- `n_vocab` is now a property (int), not a callable, matching tiktoken.
|
|
38
|
+
Code that called `enc.n_vocab()` needs to drop the parentheses.
|
|
39
|
+
`CoreBPE.n_vocab()` is unchanged for direct Rust-core users.
|
|
40
|
+
|
|
41
|
+
## [0.2.2] — 2026-04-11
|
|
42
|
+
|
|
43
|
+
Patch release. Closes two tiktoken API compatibility gaps in the Python
|
|
44
|
+
bindings so that code written against `tiktoken.Encoding` — including
|
|
45
|
+
the examples in Sebastian Raschka's *Build a Large Language Model
|
|
46
|
+
(From Scratch)* — runs unchanged against a `riptoken.get_encoding`
|
|
47
|
+
instance.
|
|
48
|
+
|
|
49
|
+
### Added
|
|
50
|
+
|
|
51
|
+
- `CoreBPE.decode(tokens)` — returns a Python `str`, matching
|
|
52
|
+
`tiktoken.Encoding.decode`. Invalid UTF-8 sequences (which can occur
|
|
53
|
+
mid-stream when a multi-byte character spans a token boundary) are
|
|
54
|
+
replaced with U+FFFD, matching tiktoken's default
|
|
55
|
+
`errors="replace"` behavior. The existing `decode_bytes` method is
|
|
56
|
+
unchanged and remains the right choice for strict / streaming
|
|
57
|
+
decoding.
|
|
58
|
+
|
|
59
|
+
### Fixed
|
|
60
|
+
|
|
61
|
+
- `CoreBPE.encode(text)` now works without an explicit
|
|
62
|
+
`allowed_special` argument, matching tiktoken. Previously
|
|
63
|
+
`allowed_special` was a required positional parameter, so
|
|
64
|
+
`tokenizer.encode(raw_text)` raised a `TypeError`. The parameter is
|
|
65
|
+
now optional and defaults to an empty set (no special tokens
|
|
66
|
+
recognized), and can still be passed as a positional or keyword
|
|
67
|
+
argument. `CoreBPE.encode_batch` received the same treatment.
|
|
68
|
+
|
|
10
69
|
## [0.2.1] — 2026-04-11
|
|
11
70
|
|
|
12
71
|
Patch release. Makes `tiktoken` a required runtime dependency so that
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: riptoken
|
|
3
|
-
Version: 0.2.
|
|
3
|
+
Version: 0.2.3
|
|
4
4
|
Classifier: Development Status :: 4 - Beta
|
|
5
5
|
Classifier: Intended Audience :: Developers
|
|
6
6
|
Classifier: Intended Audience :: Science/Research
|
|
@@ -14,6 +14,7 @@ Classifier: Programming Language :: Python :: 3.10
|
|
|
14
14
|
Classifier: Programming Language :: Python :: 3.11
|
|
15
15
|
Classifier: Programming Language :: Python :: 3.12
|
|
16
16
|
Classifier: Programming Language :: Python :: 3.13
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.14
|
|
17
18
|
Classifier: Programming Language :: Rust
|
|
18
19
|
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
19
20
|
Classifier: Topic :: Text Processing :: Linguistic
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "riptoken"
|
|
3
|
-
version = "0.2.
|
|
3
|
+
version = "0.2.3"
|
|
4
4
|
description = "Fast BPE tokenizer for LLMs — a faster, drop-in compatible reimplementation of tiktoken"
|
|
5
5
|
readme = "README.md"
|
|
6
6
|
requires-python = ">=3.9"
|
|
@@ -21,6 +21,7 @@ classifiers = [
|
|
|
21
21
|
"Programming Language :: Python :: 3.11",
|
|
22
22
|
"Programming Language :: Python :: 3.12",
|
|
23
23
|
"Programming Language :: Python :: 3.13",
|
|
24
|
+
"Programming Language :: Python :: 3.14",
|
|
24
25
|
"Programming Language :: Rust",
|
|
25
26
|
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
|
26
27
|
"Topic :: Text Processing :: Linguistic",
|
|
@@ -0,0 +1,316 @@
|
|
|
1
|
+
"""riptoken — fast BPE tokenizer for LLMs.
|
|
2
|
+
|
|
3
|
+
A drop-in compatible, faster reimplementation of OpenAI's tiktoken. riptoken
|
|
4
|
+
reads the same ``.tiktoken`` vocabulary files and produces byte-identical
|
|
5
|
+
output for every tested corpus.
|
|
6
|
+
|
|
7
|
+
Quick start
|
|
8
|
+
-----------
|
|
9
|
+
|
|
10
|
+
The simplest way to get a tokenizer:
|
|
11
|
+
|
|
12
|
+
>>> import riptoken
|
|
13
|
+
>>> enc = riptoken.get_encoding("o200k_base")
|
|
14
|
+
>>> enc.encode_ordinary("Hello, world!")
|
|
15
|
+
[13225, 11, 2375, 0]
|
|
16
|
+
|
|
17
|
+
The returned object is a :class:`riptoken.Encoding` — a drop-in replacement
|
|
18
|
+
for :class:`tiktoken.Encoding`. The hot-path methods (``encode``,
|
|
19
|
+
``encode_ordinary``, ``decode``, ``decode_bytes``, and their batch variants)
|
|
20
|
+
run through riptoken's fast Rust core; every other attribute and method
|
|
21
|
+
forwards transparently to the underlying ``tiktoken.Encoding`` instance, so
|
|
22
|
+
code written against the ``tiktoken`` API works unchanged.
|
|
23
|
+
|
|
24
|
+
Or construct a :class:`CoreBPE` directly from a ``.tiktoken`` file if you
|
|
25
|
+
want to avoid the ``tiktoken`` dependency entirely:
|
|
26
|
+
|
|
27
|
+
>>> import riptoken
|
|
28
|
+
>>> ranks = riptoken.load_tiktoken_bpe("o200k_base.tiktoken")
|
|
29
|
+
>>> special_tokens = {"<|endoftext|>": 199999}
|
|
30
|
+
>>> pat = r"\\w+|\\s+" # simplified pattern; use the full o200k pattern in prod
|
|
31
|
+
>>> enc = riptoken.CoreBPE(ranks, special_tokens, pat)
|
|
32
|
+
>>> enc.decode_bytes(enc.encode_ordinary("Hello, world!"))
|
|
33
|
+
b'Hello, world!'
|
|
34
|
+
|
|
35
|
+
See https://github.com/daechoi/riptoken for full documentation, benchmarks,
|
|
36
|
+
and the canonical ``o200k_base`` regex pattern.
|
|
37
|
+
"""
|
|
38
|
+
|
|
39
|
+
from __future__ import annotations
|
|
40
|
+
|
|
41
|
+
import base64
|
|
42
|
+
from pathlib import Path
|
|
43
|
+
from typing import Any, Iterable, Union
|
|
44
|
+
|
|
45
|
+
from riptoken._riptoken import CoreBPE
|
|
46
|
+
|
|
47
|
+
__version__ = "0.2.3"
|
|
48
|
+
__all__ = [
|
|
49
|
+
"CoreBPE",
|
|
50
|
+
"Encoding",
|
|
51
|
+
"encoding_for_model",
|
|
52
|
+
"get_encoding",
|
|
53
|
+
"load_tiktoken_bpe",
|
|
54
|
+
]
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
def load_tiktoken_bpe(path: Union[str, Path]) -> dict[bytes, int]:
|
|
58
|
+
"""Load a tiktoken ``.tiktoken`` vocabulary file.
|
|
59
|
+
|
|
60
|
+
The file format is one token per line::
|
|
61
|
+
|
|
62
|
+
<base64-encoded token bytes> <integer rank>
|
|
63
|
+
|
|
64
|
+
This is the same format OpenAI ships for ``cl100k_base``, ``o200k_base``,
|
|
65
|
+
etc. You can obtain vocabularies from the ``tiktoken`` package's cache
|
|
66
|
+
directory or download them directly from OpenAI's public CDN.
|
|
67
|
+
|
|
68
|
+
Parameters
|
|
69
|
+
----------
|
|
70
|
+
path:
|
|
71
|
+
Filesystem path to the ``.tiktoken`` file.
|
|
72
|
+
|
|
73
|
+
Returns
|
|
74
|
+
-------
|
|
75
|
+
dict[bytes, int]
|
|
76
|
+
A dictionary mapping token bytes to their integer rank, suitable for
|
|
77
|
+
passing to :class:`CoreBPE`'s first positional argument.
|
|
78
|
+
|
|
79
|
+
Raises
|
|
80
|
+
------
|
|
81
|
+
FileNotFoundError
|
|
82
|
+
If ``path`` does not exist.
|
|
83
|
+
ValueError
|
|
84
|
+
If a line in the file is malformed.
|
|
85
|
+
"""
|
|
86
|
+
ranks: dict[bytes, int] = {}
|
|
87
|
+
with open(path, "rb") as f:
|
|
88
|
+
for lineno, line in enumerate(f, start=1):
|
|
89
|
+
if not line.strip():
|
|
90
|
+
continue
|
|
91
|
+
parts = line.split()
|
|
92
|
+
if len(parts) != 2:
|
|
93
|
+
raise ValueError(
|
|
94
|
+
f"{path}: line {lineno}: expected '<b64> <rank>', got {line!r}"
|
|
95
|
+
)
|
|
96
|
+
token_b64, rank_str = parts
|
|
97
|
+
try:
|
|
98
|
+
token_bytes = base64.b64decode(token_b64)
|
|
99
|
+
rank = int(rank_str)
|
|
100
|
+
except (ValueError, base64.binascii.Error) as e:
|
|
101
|
+
raise ValueError(
|
|
102
|
+
f"{path}: line {lineno}: failed to parse token/rank: {e}"
|
|
103
|
+
) from e
|
|
104
|
+
ranks[token_bytes] = rank
|
|
105
|
+
return ranks
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
class Encoding:
|
|
109
|
+
"""Drop-in replacement for :class:`tiktoken.Encoding`.
|
|
110
|
+
|
|
111
|
+
Holds a riptoken :class:`CoreBPE` and the underlying
|
|
112
|
+
:class:`tiktoken.Encoding` it was built from. Hot-path methods —
|
|
113
|
+
``encode``, ``encode_ordinary``, ``decode``, ``decode_bytes``, and their
|
|
114
|
+
batch variants — execute in riptoken's Rust core and release the GIL.
|
|
115
|
+
Every other attribute and method is forwarded to the wrapped
|
|
116
|
+
``tiktoken.Encoding`` via :meth:`__getattr__`, so properties like
|
|
117
|
+
``n_vocab``, ``eot_token``, ``special_tokens_set``, and less-common
|
|
118
|
+
methods like ``encode_with_unstable`` work transparently.
|
|
119
|
+
|
|
120
|
+
You usually obtain an ``Encoding`` via :func:`riptoken.get_encoding` or
|
|
121
|
+
:func:`riptoken.encoding_for_model`; constructing one directly is
|
|
122
|
+
unusual.
|
|
123
|
+
|
|
124
|
+
Notes
|
|
125
|
+
-----
|
|
126
|
+
This class is duck-typed as a ``tiktoken.Encoding`` — it does not
|
|
127
|
+
subclass it. Code that checks ``isinstance(enc, tiktoken.Encoding)``
|
|
128
|
+
will need to be updated; code that calls methods on it will not.
|
|
129
|
+
"""
|
|
130
|
+
|
|
131
|
+
__slots__ = ("_core", "_tiktoken")
|
|
132
|
+
|
|
133
|
+
def __init__(self, core: CoreBPE, tiktoken_encoding: Any) -> None:
|
|
134
|
+
object.__setattr__(self, "_core", core)
|
|
135
|
+
object.__setattr__(self, "_tiktoken", tiktoken_encoding)
|
|
136
|
+
|
|
137
|
+
# ---- Hot-path methods: run in riptoken's Rust core ----
|
|
138
|
+
|
|
139
|
+
def encode_ordinary(self, text: str) -> list[int]:
|
|
140
|
+
"""Encode ``text`` without recognizing any special tokens.
|
|
141
|
+
|
|
142
|
+
Equivalent to ``tiktoken.Encoding.encode_ordinary``. Releases the
|
|
143
|
+
GIL.
|
|
144
|
+
"""
|
|
145
|
+
return self._core.encode_ordinary(text)
|
|
146
|
+
|
|
147
|
+
def encode(
|
|
148
|
+
self,
|
|
149
|
+
text: str,
|
|
150
|
+
allowed_special: Union[set[str], str, None] = None,
|
|
151
|
+
disallowed_special: Union[set[str], str, None] = None,
|
|
152
|
+
) -> list[int]:
|
|
153
|
+
"""Encode ``text``, optionally recognizing special tokens.
|
|
154
|
+
|
|
155
|
+
``allowed_special`` may be a set of special-token strings to
|
|
156
|
+
recognize, or the sentinel string ``"all"`` to allow every special
|
|
157
|
+
token in the vocabulary. ``None`` (the default) disallows all.
|
|
158
|
+
|
|
159
|
+
``disallowed_special`` is accepted for ``tiktoken`` signature
|
|
160
|
+
compatibility but is currently treated as a hint — unrecognized
|
|
161
|
+
specials pass through as ordinary text rather than raising.
|
|
162
|
+
"""
|
|
163
|
+
if allowed_special is None:
|
|
164
|
+
allowed: set[str] = set()
|
|
165
|
+
elif allowed_special == "all":
|
|
166
|
+
allowed = set(self._tiktoken.special_tokens_set)
|
|
167
|
+
else:
|
|
168
|
+
allowed = set(allowed_special)
|
|
169
|
+
return self._core.encode(text, allowed)
|
|
170
|
+
|
|
171
|
+
def encode_ordinary_batch(self, texts: Iterable[str]) -> list[list[int]]:
|
|
172
|
+
"""Parallel batch version of :meth:`encode_ordinary`.
|
|
173
|
+
|
|
174
|
+
Fans out to ``rayon``'s global thread pool, releasing the GIL for
|
|
175
|
+
the full batch.
|
|
176
|
+
"""
|
|
177
|
+
return self._core.encode_ordinary_batch(list(texts))
|
|
178
|
+
|
|
179
|
+
def encode_batch(
|
|
180
|
+
self,
|
|
181
|
+
texts: Iterable[str],
|
|
182
|
+
allowed_special: Union[set[str], str, None] = None,
|
|
183
|
+
disallowed_special: Union[set[str], str, None] = None,
|
|
184
|
+
) -> list[list[int]]:
|
|
185
|
+
"""Parallel batch version of :meth:`encode`.
|
|
186
|
+
|
|
187
|
+
Fans out to ``rayon``'s global thread pool, releasing the GIL for
|
|
188
|
+
the full batch.
|
|
189
|
+
"""
|
|
190
|
+
if allowed_special is None:
|
|
191
|
+
allowed: set[str] = set()
|
|
192
|
+
elif allowed_special == "all":
|
|
193
|
+
allowed = set(self._tiktoken.special_tokens_set)
|
|
194
|
+
else:
|
|
195
|
+
allowed = set(allowed_special)
|
|
196
|
+
return self._core.encode_batch(list(texts), allowed)
|
|
197
|
+
|
|
198
|
+
def decode(self, tokens: list[int], errors: str = "replace") -> str:
|
|
199
|
+
"""Decode ``tokens`` into a string.
|
|
200
|
+
|
|
201
|
+
``errors`` is accepted for tiktoken signature compatibility; only
|
|
202
|
+
``"replace"`` (the default, matching tiktoken) is honored — invalid
|
|
203
|
+
UTF-8 byte sequences are replaced with U+FFFD.
|
|
204
|
+
"""
|
|
205
|
+
return self._core.decode(tokens)
|
|
206
|
+
|
|
207
|
+
def decode_bytes(self, tokens: list[int]) -> bytes:
|
|
208
|
+
"""Decode ``tokens`` into raw bytes (no UTF-8 conversion)."""
|
|
209
|
+
return self._core.decode_bytes(tokens)
|
|
210
|
+
|
|
211
|
+
# ---- Everything else: forward to tiktoken ----
|
|
212
|
+
|
|
213
|
+
def __getattr__(self, name: str) -> Any:
|
|
214
|
+
# __getattr__ is only called when normal attribute lookup fails, so
|
|
215
|
+
# the hot-path methods defined above are never routed through here.
|
|
216
|
+
return getattr(self._tiktoken, name)
|
|
217
|
+
|
|
218
|
+
def __repr__(self) -> str:
|
|
219
|
+
name = getattr(self._tiktoken, "name", "?")
|
|
220
|
+
return f"<riptoken.Encoding {name!r}>"
|
|
221
|
+
|
|
222
|
+
|
|
223
|
+
def _build_encoding(tiktoken_encoding: Any) -> Encoding:
|
|
224
|
+
"""Build a :class:`riptoken.Encoding` from a :class:`tiktoken.Encoding`.
|
|
225
|
+
|
|
226
|
+
Reads the vocab, pattern string, and special tokens from stable (if
|
|
227
|
+
technically private) attributes that tiktoken has exposed since its
|
|
228
|
+
first release.
|
|
229
|
+
"""
|
|
230
|
+
# These three attrs have been stable since tiktoken 0.3 and are what
|
|
231
|
+
# tiktoken itself passes into its own Rust constructor.
|
|
232
|
+
mergeable_ranks = tiktoken_encoding._mergeable_ranks # type: ignore[attr-defined]
|
|
233
|
+
special_tokens = tiktoken_encoding._special_tokens # type: ignore[attr-defined]
|
|
234
|
+
pat_str = tiktoken_encoding._pat_str # type: ignore[attr-defined]
|
|
235
|
+
core = CoreBPE(mergeable_ranks, special_tokens, pat_str)
|
|
236
|
+
return Encoding(core, tiktoken_encoding)
|
|
237
|
+
|
|
238
|
+
|
|
239
|
+
def get_encoding(name: str) -> Encoding:
|
|
240
|
+
"""Load a named tiktoken encoding (``"gpt2"``, ``"cl100k_base"``, ``"o200k_base"``, ...).
|
|
241
|
+
|
|
242
|
+
This is the riptoken equivalent of :func:`tiktoken.get_encoding`. It
|
|
243
|
+
uses the ``tiktoken`` package to supply the vocabulary file, regex
|
|
244
|
+
pattern, and special-token map, then wraps them in a
|
|
245
|
+
:class:`riptoken.Encoding` whose hot-path methods execute in
|
|
246
|
+
riptoken's faster Rust core. Subsequent calls with the same ``name``
|
|
247
|
+
reuse tiktoken's in-process cache, and the vocabulary download (if
|
|
248
|
+
needed) uses tiktoken's on-disk cache at ``~/.cache/tiktoken/`` (or
|
|
249
|
+
wherever ``TIKTOKEN_CACHE_DIR`` points).
|
|
250
|
+
|
|
251
|
+
Parameters
|
|
252
|
+
----------
|
|
253
|
+
name:
|
|
254
|
+
A tiktoken encoding name, e.g. ``"gpt2"``, ``"r50k_base"``,
|
|
255
|
+
``"p50k_base"``, ``"cl100k_base"``, ``"o200k_base"``.
|
|
256
|
+
|
|
257
|
+
Returns
|
|
258
|
+
-------
|
|
259
|
+
Encoding
|
|
260
|
+
A drop-in replacement for the corresponding
|
|
261
|
+
:class:`tiktoken.Encoding`, producing byte-identical output on all
|
|
262
|
+
hot-path methods.
|
|
263
|
+
|
|
264
|
+
Raises
|
|
265
|
+
------
|
|
266
|
+
ImportError
|
|
267
|
+
If the ``tiktoken`` package is not installed. Install it with
|
|
268
|
+
``pip install tiktoken``, or construct :class:`CoreBPE` directly
|
|
269
|
+
from a ``.tiktoken`` file via :func:`load_tiktoken_bpe`.
|
|
270
|
+
"""
|
|
271
|
+
try:
|
|
272
|
+
import tiktoken
|
|
273
|
+
except ImportError as e:
|
|
274
|
+
raise ImportError(
|
|
275
|
+
"riptoken.get_encoding requires the `tiktoken` package to supply "
|
|
276
|
+
"vocabulary files and regex patterns. Install it with "
|
|
277
|
+
"`pip install tiktoken`, or construct CoreBPE directly using "
|
|
278
|
+
"riptoken.load_tiktoken_bpe()."
|
|
279
|
+
) from e
|
|
280
|
+
return _build_encoding(tiktoken.get_encoding(name))
|
|
281
|
+
|
|
282
|
+
|
|
283
|
+
def encoding_for_model(model_name: str) -> Encoding:
|
|
284
|
+
"""Load the encoding used by a specific OpenAI model.
|
|
285
|
+
|
|
286
|
+
This is the riptoken equivalent of :func:`tiktoken.encoding_for_model`.
|
|
287
|
+
For example, ``encoding_for_model("gpt-4o")`` returns an ``o200k_base``
|
|
288
|
+
encoder wrapped in a :class:`riptoken.Encoding`. See
|
|
289
|
+
:func:`get_encoding` for dependency notes.
|
|
290
|
+
|
|
291
|
+
Parameters
|
|
292
|
+
----------
|
|
293
|
+
model_name:
|
|
294
|
+
An OpenAI model name, e.g. ``"gpt-4"``, ``"gpt-4o"``, ``"gpt-3.5-turbo"``.
|
|
295
|
+
|
|
296
|
+
Returns
|
|
297
|
+
-------
|
|
298
|
+
Encoding
|
|
299
|
+
A drop-in replacement for the corresponding
|
|
300
|
+
:class:`tiktoken.Encoding`.
|
|
301
|
+
|
|
302
|
+
Raises
|
|
303
|
+
------
|
|
304
|
+
ImportError
|
|
305
|
+
If the ``tiktoken`` package is not installed.
|
|
306
|
+
"""
|
|
307
|
+
try:
|
|
308
|
+
import tiktoken
|
|
309
|
+
except ImportError as e:
|
|
310
|
+
raise ImportError(
|
|
311
|
+
"riptoken.encoding_for_model requires the `tiktoken` package to "
|
|
312
|
+
"supply vocabulary files and regex patterns. Install it with "
|
|
313
|
+
"`pip install tiktoken`, or construct CoreBPE directly using "
|
|
314
|
+
"riptoken.load_tiktoken_bpe()."
|
|
315
|
+
) from e
|
|
316
|
+
return _build_encoding(tiktoken.encoding_for_model(model_name))
|
|
@@ -885,10 +885,16 @@ impl CoreBPE {
|
|
|
885
885
|
py.detach(|| self.encode_ordinary(text))
|
|
886
886
|
}
|
|
887
887
|
|
|
888
|
-
#[pyo3(name = "encode")]
|
|
889
|
-
fn py_encode(
|
|
888
|
+
#[pyo3(name = "encode", signature = (text, allowed_special = None))]
|
|
889
|
+
fn py_encode(
|
|
890
|
+
&self,
|
|
891
|
+
py: Python<'_>,
|
|
892
|
+
text: &str,
|
|
893
|
+
allowed_special: Option<HashSet<String>>,
|
|
894
|
+
) -> Vec<Rank> {
|
|
890
895
|
py.detach(|| {
|
|
891
|
-
let
|
|
896
|
+
let allowed = allowed_special.unwrap_or_default();
|
|
897
|
+
let allowed_refs: HashSet<&str> = allowed.iter().map(|s| s.as_str()).collect();
|
|
892
898
|
self.encode(text, &allowed_refs)
|
|
893
899
|
})
|
|
894
900
|
}
|
|
@@ -901,16 +907,17 @@ impl CoreBPE {
|
|
|
901
907
|
})
|
|
902
908
|
}
|
|
903
909
|
|
|
904
|
-
#[pyo3(name = "encode_batch")]
|
|
910
|
+
#[pyo3(name = "encode_batch", signature = (texts, allowed_special = None))]
|
|
905
911
|
fn py_encode_batch(
|
|
906
912
|
&self,
|
|
907
913
|
py: Python<'_>,
|
|
908
914
|
texts: Vec<String>,
|
|
909
|
-
allowed_special: HashSet<String
|
|
915
|
+
allowed_special: Option<HashSet<String>>,
|
|
910
916
|
) -> Vec<Vec<Rank>> {
|
|
911
917
|
py.detach(|| {
|
|
912
918
|
let refs: Vec<&str> = texts.iter().map(|s| s.as_str()).collect();
|
|
913
|
-
let
|
|
919
|
+
let allowed = allowed_special.unwrap_or_default();
|
|
920
|
+
let allowed_refs: HashSet<&str> = allowed.iter().map(|s| s.as_str()).collect();
|
|
914
921
|
self.encode_batch(&refs, &allowed_refs)
|
|
915
922
|
})
|
|
916
923
|
}
|
|
@@ -931,6 +938,20 @@ impl CoreBPE {
|
|
|
931
938
|
pyo3::types::PyBytes::new(py, &bytes)
|
|
932
939
|
}
|
|
933
940
|
|
|
941
|
+
/// Decode tokens into a Python `str`, matching `tiktoken.Encoding.decode`.
|
|
942
|
+
///
|
|
943
|
+
/// Invalid UTF-8 sequences (which can occur mid-stream when a multi-byte
|
|
944
|
+
/// character spans a token boundary) are replaced with U+FFFD, matching
|
|
945
|
+
/// tiktoken's default `errors="replace"` behavior. For strict decoding or
|
|
946
|
+
/// raw bytes, use [`decode_bytes`].
|
|
947
|
+
#[pyo3(name = "decode")]
|
|
948
|
+
fn py_decode(&self, py: Python<'_>, tokens: Vec<Rank>) -> String {
|
|
949
|
+
py.detach(|| {
|
|
950
|
+
let bytes = self.decode_bytes(&tokens);
|
|
951
|
+
String::from_utf8_lossy(&bytes).into_owned()
|
|
952
|
+
})
|
|
953
|
+
}
|
|
954
|
+
|
|
934
955
|
#[pyo3(name = "decode_single_token_bytes")]
|
|
935
956
|
fn py_decode_single_token_bytes<'py>(
|
|
936
957
|
&self,
|
|
@@ -1,183 +0,0 @@
|
|
|
1
|
-
"""riptoken — fast BPE tokenizer for LLMs.
|
|
2
|
-
|
|
3
|
-
A drop-in compatible, faster reimplementation of OpenAI's tiktoken. riptoken
|
|
4
|
-
reads the same ``.tiktoken`` vocabulary files and produces byte-identical
|
|
5
|
-
output for every tested corpus.
|
|
6
|
-
|
|
7
|
-
Quick start
|
|
8
|
-
-----------
|
|
9
|
-
|
|
10
|
-
The simplest way to get a tokenizer, if you have ``tiktoken`` installed:
|
|
11
|
-
|
|
12
|
-
>>> import riptoken
|
|
13
|
-
>>> enc = riptoken.get_encoding("o200k_base")
|
|
14
|
-
>>> enc.encode_ordinary("Hello, world!")
|
|
15
|
-
[13225, 11, 2375, 0]
|
|
16
|
-
|
|
17
|
-
Or construct a :class:`CoreBPE` directly from a ``.tiktoken`` file:
|
|
18
|
-
|
|
19
|
-
>>> import riptoken
|
|
20
|
-
>>> ranks = riptoken.load_tiktoken_bpe("o200k_base.tiktoken")
|
|
21
|
-
>>> special_tokens = {"<|endoftext|>": 199999}
|
|
22
|
-
>>> pat = r"\\w+|\\s+" # simplified pattern; use the full o200k pattern in prod
|
|
23
|
-
>>> enc = riptoken.CoreBPE(ranks, special_tokens, pat)
|
|
24
|
-
>>> enc.decode_bytes(enc.encode_ordinary("Hello, world!"))
|
|
25
|
-
b'Hello, world!'
|
|
26
|
-
|
|
27
|
-
See https://github.com/daechoi/riptoken for full documentation, benchmarks,
|
|
28
|
-
and the canonical ``o200k_base`` regex pattern.
|
|
29
|
-
"""
|
|
30
|
-
|
|
31
|
-
from __future__ import annotations
|
|
32
|
-
|
|
33
|
-
import base64
|
|
34
|
-
from pathlib import Path
|
|
35
|
-
from typing import Union
|
|
36
|
-
|
|
37
|
-
from riptoken._riptoken import CoreBPE
|
|
38
|
-
|
|
39
|
-
__version__ = "0.2.1"
|
|
40
|
-
__all__ = ["CoreBPE", "encoding_for_model", "get_encoding", "load_tiktoken_bpe"]
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
def load_tiktoken_bpe(path: Union[str, Path]) -> dict[bytes, int]:
|
|
44
|
-
"""Load a tiktoken ``.tiktoken`` vocabulary file.
|
|
45
|
-
|
|
46
|
-
The file format is one token per line::
|
|
47
|
-
|
|
48
|
-
<base64-encoded token bytes> <integer rank>
|
|
49
|
-
|
|
50
|
-
This is the same format OpenAI ships for ``cl100k_base``, ``o200k_base``,
|
|
51
|
-
etc. You can obtain vocabularies from the ``tiktoken`` package's cache
|
|
52
|
-
directory or download them directly from OpenAI's public CDN.
|
|
53
|
-
|
|
54
|
-
Parameters
|
|
55
|
-
----------
|
|
56
|
-
path:
|
|
57
|
-
Filesystem path to the ``.tiktoken`` file.
|
|
58
|
-
|
|
59
|
-
Returns
|
|
60
|
-
-------
|
|
61
|
-
dict[bytes, int]
|
|
62
|
-
A dictionary mapping token bytes to their integer rank, suitable for
|
|
63
|
-
passing to :class:`CoreBPE`'s first positional argument.
|
|
64
|
-
|
|
65
|
-
Raises
|
|
66
|
-
------
|
|
67
|
-
FileNotFoundError
|
|
68
|
-
If ``path`` does not exist.
|
|
69
|
-
ValueError
|
|
70
|
-
If a line in the file is malformed.
|
|
71
|
-
"""
|
|
72
|
-
ranks: dict[bytes, int] = {}
|
|
73
|
-
with open(path, "rb") as f:
|
|
74
|
-
for lineno, line in enumerate(f, start=1):
|
|
75
|
-
if not line.strip():
|
|
76
|
-
continue
|
|
77
|
-
parts = line.split()
|
|
78
|
-
if len(parts) != 2:
|
|
79
|
-
raise ValueError(
|
|
80
|
-
f"{path}: line {lineno}: expected '<b64> <rank>', got {line!r}"
|
|
81
|
-
)
|
|
82
|
-
token_b64, rank_str = parts
|
|
83
|
-
try:
|
|
84
|
-
token_bytes = base64.b64decode(token_b64)
|
|
85
|
-
rank = int(rank_str)
|
|
86
|
-
except (ValueError, base64.binascii.Error) as e:
|
|
87
|
-
raise ValueError(
|
|
88
|
-
f"{path}: line {lineno}: failed to parse token/rank: {e}"
|
|
89
|
-
) from e
|
|
90
|
-
ranks[token_bytes] = rank
|
|
91
|
-
return ranks
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
def _wrap_tiktoken_encoding(enc: object) -> CoreBPE:
|
|
95
|
-
"""Build a :class:`CoreBPE` from a :class:`tiktoken.Encoding` instance.
|
|
96
|
-
|
|
97
|
-
Reads the vocab, pattern string, and special tokens from stable (if
|
|
98
|
-
technically private) attributes that tiktoken has exposed since its
|
|
99
|
-
first release.
|
|
100
|
-
"""
|
|
101
|
-
# These three attrs have been stable since tiktoken 0.3 and are what
|
|
102
|
-
# tiktoken itself passes into its own Rust constructor.
|
|
103
|
-
mergeable_ranks = enc._mergeable_ranks # type: ignore[attr-defined]
|
|
104
|
-
special_tokens = enc._special_tokens # type: ignore[attr-defined]
|
|
105
|
-
pat_str = enc._pat_str # type: ignore[attr-defined]
|
|
106
|
-
return CoreBPE(mergeable_ranks, special_tokens, pat_str)
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
def get_encoding(name: str) -> CoreBPE:
|
|
110
|
-
"""Load a named tiktoken encoding (``"gpt2"``, ``"cl100k_base"``, ``"o200k_base"``, ...).
|
|
111
|
-
|
|
112
|
-
This is the riptoken equivalent of :func:`tiktoken.get_encoding`. It
|
|
113
|
-
soft-depends on the ``tiktoken`` package to supply the vocabulary file,
|
|
114
|
-
regex pattern, and special-token map, then wraps them in riptoken's
|
|
115
|
-
faster Rust core. Subsequent calls with the same ``name`` reuse
|
|
116
|
-
tiktoken's in-process cache, and the vocabulary download (if needed)
|
|
117
|
-
uses tiktoken's on-disk cache at ``~/.cache/tiktoken/`` (or wherever
|
|
118
|
-
``TIKTOKEN_CACHE_DIR`` points).
|
|
119
|
-
|
|
120
|
-
Parameters
|
|
121
|
-
----------
|
|
122
|
-
name:
|
|
123
|
-
A tiktoken encoding name, e.g. ``"gpt2"``, ``"r50k_base"``,
|
|
124
|
-
``"p50k_base"``, ``"cl100k_base"``, ``"o200k_base"``.
|
|
125
|
-
|
|
126
|
-
Returns
|
|
127
|
-
-------
|
|
128
|
-
CoreBPE
|
|
129
|
-
A riptoken encoder producing byte-identical output to
|
|
130
|
-
``tiktoken.get_encoding(name)``.
|
|
131
|
-
|
|
132
|
-
Raises
|
|
133
|
-
------
|
|
134
|
-
ImportError
|
|
135
|
-
If the ``tiktoken`` package is not installed. Install it with
|
|
136
|
-
``pip install tiktoken``, or construct :class:`CoreBPE` directly
|
|
137
|
-
from a ``.tiktoken`` file via :func:`load_tiktoken_bpe`.
|
|
138
|
-
"""
|
|
139
|
-
try:
|
|
140
|
-
import tiktoken
|
|
141
|
-
except ImportError as e:
|
|
142
|
-
raise ImportError(
|
|
143
|
-
"riptoken.get_encoding requires the `tiktoken` package to supply "
|
|
144
|
-
"vocabulary files and regex patterns. Install it with "
|
|
145
|
-
"`pip install tiktoken`, or construct CoreBPE directly using "
|
|
146
|
-
"riptoken.load_tiktoken_bpe()."
|
|
147
|
-
) from e
|
|
148
|
-
return _wrap_tiktoken_encoding(tiktoken.get_encoding(name))
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
def encoding_for_model(model_name: str) -> CoreBPE:
|
|
152
|
-
"""Load the encoding used by a specific OpenAI model.
|
|
153
|
-
|
|
154
|
-
This is the riptoken equivalent of :func:`tiktoken.encoding_for_model`.
|
|
155
|
-
For example, ``encoding_for_model("gpt-4o")`` returns an ``o200k_base``
|
|
156
|
-
encoder. See :func:`get_encoding` for dependency notes.
|
|
157
|
-
|
|
158
|
-
Parameters
|
|
159
|
-
----------
|
|
160
|
-
model_name:
|
|
161
|
-
An OpenAI model name, e.g. ``"gpt-4"``, ``"gpt-4o"``, ``"gpt-3.5-turbo"``.
|
|
162
|
-
|
|
163
|
-
Returns
|
|
164
|
-
-------
|
|
165
|
-
CoreBPE
|
|
166
|
-
A riptoken encoder producing byte-identical output to
|
|
167
|
-
``tiktoken.encoding_for_model(model_name)``.
|
|
168
|
-
|
|
169
|
-
Raises
|
|
170
|
-
------
|
|
171
|
-
ImportError
|
|
172
|
-
If the ``tiktoken`` package is not installed.
|
|
173
|
-
"""
|
|
174
|
-
try:
|
|
175
|
-
import tiktoken
|
|
176
|
-
except ImportError as e:
|
|
177
|
-
raise ImportError(
|
|
178
|
-
"riptoken.encoding_for_model requires the `tiktoken` package to "
|
|
179
|
-
"supply vocabulary files and regex patterns. Install it with "
|
|
180
|
-
"`pip install tiktoken`, or construct CoreBPE directly using "
|
|
181
|
-
"riptoken.load_tiktoken_bpe()."
|
|
182
|
-
) from e
|
|
183
|
-
return _wrap_tiktoken_encoding(tiktoken.encoding_for_model(model_name))
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|