riptoken 0.2.1__tar.gz → 0.2.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -61,7 +61,7 @@ jobs:
61
61
  fail-fast: false
62
62
  matrix:
63
63
  os: [ubuntu-latest, macos-latest, windows-latest]
64
- python-version: ["3.9", "3.11", "3.13"]
64
+ python-version: ["3.9", "3.11", "3.13", "3.14"]
65
65
  steps:
66
66
  - uses: actions/checkout@v4
67
67
 
@@ -28,7 +28,7 @@ jobs:
28
28
  uses: PyO3/maturin-action@v1
29
29
  with:
30
30
  target: ${{ matrix.target }}
31
- args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13
31
+ args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13 python3.14
32
32
  manylinux: auto
33
33
  - name: Upload wheels
34
34
  uses: actions/upload-artifact@v4
@@ -51,7 +51,7 @@ jobs:
51
51
  uses: PyO3/maturin-action@v1
52
52
  with:
53
53
  target: ${{ matrix.target }}
54
- args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13
54
+ args: --release --out dist --features python --interpreter python3.9 python3.10 python3.11 python3.12 python3.13 python3.14
55
55
  - name: Upload wheels
56
56
  uses: actions/upload-artifact@v4
57
57
  with:
@@ -7,6 +7,65 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.2.3] — 2026-04-11
11
+
12
+ Minor release. Closes the remaining tiktoken API surface gap by wrapping
13
+ the Rust `CoreBPE` in a new Python `Encoding` class that forwards every
14
+ non-hot-path attribute and method to the underlying `tiktoken.Encoding`.
15
+ Code written against `tiktoken.Encoding` — `n_vocab`, `eot_token`,
16
+ `special_tokens_set`, `encode_with_unstable`, `decode_single_token_bytes`,
17
+ etc. — now works unchanged against a `riptoken.get_encoding` result.
18
+ Also adds Python 3.14 to CI and the release wheels matrix.
19
+
20
+ ### Added
21
+
22
+ - `riptoken.Encoding` — a drop-in replacement for `tiktoken.Encoding`.
23
+ Hot-path methods (`encode`, `encode_ordinary`, `decode`, `decode_bytes`,
24
+ `encode_batch`, `encode_ordinary_batch`) execute in riptoken's Rust
25
+ core and release the GIL; every other attribute falls through to the
26
+ wrapped `tiktoken.Encoding` via `__getattr__`. `get_encoding` and
27
+ `encoding_for_model` now return `Encoding` instances instead of bare
28
+ `CoreBPE` instances.
29
+ - `encode` and `encode_batch` accept `allowed_special="all"` as a
30
+ sentinel meaning "every special token in the vocabulary", matching
31
+ tiktoken.
32
+ - Python 3.14 support — added to the CI matrix and to the release
33
+ workflow's interpreter list for Linux and macOS wheel builds.
34
+
35
+ ### Changed
36
+
37
+ - `n_vocab` is now a property (int), not a callable, matching tiktoken.
38
+ Code that called `enc.n_vocab()` needs to drop the parentheses.
39
+ `CoreBPE.n_vocab()` is unchanged for direct Rust-core users.
40
+
41
+ ## [0.2.2] — 2026-04-11
42
+
43
+ Patch release. Closes two tiktoken API compatibility gaps in the Python
44
+ bindings so that code written against `tiktoken.Encoding` — including
45
+ the examples in Sebastian Raschka's *Build a Large Language Model
46
+ (From Scratch)* — runs unchanged against a `riptoken.get_encoding`
47
+ instance.
48
+
49
+ ### Added
50
+
51
+ - `CoreBPE.decode(tokens)` — returns a Python `str`, matching
52
+ `tiktoken.Encoding.decode`. Invalid UTF-8 sequences (which can occur
53
+ mid-stream when a multi-byte character spans a token boundary) are
54
+ replaced with U+FFFD, matching tiktoken's default
55
+ `errors="replace"` behavior. The existing `decode_bytes` method is
56
+ unchanged and remains the right choice for strict / streaming
57
+ decoding.
58
+
59
+ ### Fixed
60
+
61
+ - `CoreBPE.encode(text)` now works without an explicit
62
+ `allowed_special` argument, matching tiktoken. Previously
63
+ `allowed_special` was a required positional parameter, so
64
+ `tokenizer.encode(raw_text)` raised a `TypeError`. The parameter is
65
+ now optional and defaults to an empty set (no special tokens
66
+ recognized), and can still be passed as a positional or keyword
67
+ argument. `CoreBPE.encode_batch` received the same treatment.
68
+
10
69
  ## [0.2.1] — 2026-04-11
11
70
 
12
71
  Patch release. Makes `tiktoken` a required runtime dependency so that
@@ -258,7 +258,7 @@ checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a"
258
258
 
259
259
  [[package]]
260
260
  name = "riptoken"
261
- version = "0.2.1"
261
+ version = "0.2.3"
262
262
  dependencies = [
263
263
  "base64",
264
264
  "fancy-regex",
@@ -1,6 +1,6 @@
1
1
  [package]
2
2
  name = "riptoken"
3
- version = "0.2.1"
3
+ version = "0.2.3"
4
4
  edition = "2024"
5
5
  rust-version = "1.85"
6
6
  description = "Fast BPE tokenizer for LLMs — a faster, drop-in compatible reimplementation of tiktoken"
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: riptoken
3
- Version: 0.2.1
3
+ Version: 0.2.3
4
4
  Classifier: Development Status :: 4 - Beta
5
5
  Classifier: Intended Audience :: Developers
6
6
  Classifier: Intended Audience :: Science/Research
@@ -14,6 +14,7 @@ Classifier: Programming Language :: Python :: 3.10
14
14
  Classifier: Programming Language :: Python :: 3.11
15
15
  Classifier: Programming Language :: Python :: 3.12
16
16
  Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Programming Language :: Python :: 3.14
17
18
  Classifier: Programming Language :: Rust
18
19
  Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
19
20
  Classifier: Topic :: Text Processing :: Linguistic
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "riptoken"
3
- version = "0.2.1"
3
+ version = "0.2.3"
4
4
  description = "Fast BPE tokenizer for LLMs — a faster, drop-in compatible reimplementation of tiktoken"
5
5
  readme = "README.md"
6
6
  requires-python = ">=3.9"
@@ -21,6 +21,7 @@ classifiers = [
21
21
  "Programming Language :: Python :: 3.11",
22
22
  "Programming Language :: Python :: 3.12",
23
23
  "Programming Language :: Python :: 3.13",
24
+ "Programming Language :: Python :: 3.14",
24
25
  "Programming Language :: Rust",
25
26
  "Topic :: Scientific/Engineering :: Artificial Intelligence",
26
27
  "Topic :: Text Processing :: Linguistic",
@@ -0,0 +1,316 @@
1
+ """riptoken — fast BPE tokenizer for LLMs.
2
+
3
+ A drop-in compatible, faster reimplementation of OpenAI's tiktoken. riptoken
4
+ reads the same ``.tiktoken`` vocabulary files and produces byte-identical
5
+ output for every tested corpus.
6
+
7
+ Quick start
8
+ -----------
9
+
10
+ The simplest way to get a tokenizer:
11
+
12
+ >>> import riptoken
13
+ >>> enc = riptoken.get_encoding("o200k_base")
14
+ >>> enc.encode_ordinary("Hello, world!")
15
+ [13225, 11, 2375, 0]
16
+
17
+ The returned object is a :class:`riptoken.Encoding` — a drop-in replacement
18
+ for :class:`tiktoken.Encoding`. The hot-path methods (``encode``,
19
+ ``encode_ordinary``, ``decode``, ``decode_bytes``, and their batch variants)
20
+ run through riptoken's fast Rust core; every other attribute and method
21
+ forwards transparently to the underlying ``tiktoken.Encoding`` instance, so
22
+ code written against the ``tiktoken`` API works unchanged.
23
+
24
+ Or construct a :class:`CoreBPE` directly from a ``.tiktoken`` file if you
25
+ want to avoid the ``tiktoken`` dependency entirely:
26
+
27
+ >>> import riptoken
28
+ >>> ranks = riptoken.load_tiktoken_bpe("o200k_base.tiktoken")
29
+ >>> special_tokens = {"<|endoftext|>": 199999}
30
+ >>> pat = r"\\w+|\\s+" # simplified pattern; use the full o200k pattern in prod
31
+ >>> enc = riptoken.CoreBPE(ranks, special_tokens, pat)
32
+ >>> enc.decode_bytes(enc.encode_ordinary("Hello, world!"))
33
+ b'Hello, world!'
34
+
35
+ See https://github.com/daechoi/riptoken for full documentation, benchmarks,
36
+ and the canonical ``o200k_base`` regex pattern.
37
+ """
38
+
39
+ from __future__ import annotations
40
+
41
+ import base64
42
+ from pathlib import Path
43
+ from typing import Any, Iterable, Union
44
+
45
+ from riptoken._riptoken import CoreBPE
46
+
47
+ __version__ = "0.2.3"
48
+ __all__ = [
49
+ "CoreBPE",
50
+ "Encoding",
51
+ "encoding_for_model",
52
+ "get_encoding",
53
+ "load_tiktoken_bpe",
54
+ ]
55
+
56
+
57
+ def load_tiktoken_bpe(path: Union[str, Path]) -> dict[bytes, int]:
58
+ """Load a tiktoken ``.tiktoken`` vocabulary file.
59
+
60
+ The file format is one token per line::
61
+
62
+ <base64-encoded token bytes> <integer rank>
63
+
64
+ This is the same format OpenAI ships for ``cl100k_base``, ``o200k_base``,
65
+ etc. You can obtain vocabularies from the ``tiktoken`` package's cache
66
+ directory or download them directly from OpenAI's public CDN.
67
+
68
+ Parameters
69
+ ----------
70
+ path:
71
+ Filesystem path to the ``.tiktoken`` file.
72
+
73
+ Returns
74
+ -------
75
+ dict[bytes, int]
76
+ A dictionary mapping token bytes to their integer rank, suitable for
77
+ passing to :class:`CoreBPE`'s first positional argument.
78
+
79
+ Raises
80
+ ------
81
+ FileNotFoundError
82
+ If ``path`` does not exist.
83
+ ValueError
84
+ If a line in the file is malformed.
85
+ """
86
+ ranks: dict[bytes, int] = {}
87
+ with open(path, "rb") as f:
88
+ for lineno, line in enumerate(f, start=1):
89
+ if not line.strip():
90
+ continue
91
+ parts = line.split()
92
+ if len(parts) != 2:
93
+ raise ValueError(
94
+ f"{path}: line {lineno}: expected '<b64> <rank>', got {line!r}"
95
+ )
96
+ token_b64, rank_str = parts
97
+ try:
98
+ token_bytes = base64.b64decode(token_b64)
99
+ rank = int(rank_str)
100
+ except (ValueError, base64.binascii.Error) as e:
101
+ raise ValueError(
102
+ f"{path}: line {lineno}: failed to parse token/rank: {e}"
103
+ ) from e
104
+ ranks[token_bytes] = rank
105
+ return ranks
106
+
107
+
108
+ class Encoding:
109
+ """Drop-in replacement for :class:`tiktoken.Encoding`.
110
+
111
+ Holds a riptoken :class:`CoreBPE` and the underlying
112
+ :class:`tiktoken.Encoding` it was built from. Hot-path methods —
113
+ ``encode``, ``encode_ordinary``, ``decode``, ``decode_bytes``, and their
114
+ batch variants — execute in riptoken's Rust core and release the GIL.
115
+ Every other attribute and method is forwarded to the wrapped
116
+ ``tiktoken.Encoding`` via :meth:`__getattr__`, so properties like
117
+ ``n_vocab``, ``eot_token``, ``special_tokens_set``, and less-common
118
+ methods like ``encode_with_unstable`` work transparently.
119
+
120
+ You usually obtain an ``Encoding`` via :func:`riptoken.get_encoding` or
121
+ :func:`riptoken.encoding_for_model`; constructing one directly is
122
+ unusual.
123
+
124
+ Notes
125
+ -----
126
+ This class is duck-typed as a ``tiktoken.Encoding`` — it does not
127
+ subclass it. Code that checks ``isinstance(enc, tiktoken.Encoding)``
128
+ will need to be updated; code that calls methods on it will not.
129
+ """
130
+
131
+ __slots__ = ("_core", "_tiktoken")
132
+
133
+ def __init__(self, core: CoreBPE, tiktoken_encoding: Any) -> None:
134
+ object.__setattr__(self, "_core", core)
135
+ object.__setattr__(self, "_tiktoken", tiktoken_encoding)
136
+
137
+ # ---- Hot-path methods: run in riptoken's Rust core ----
138
+
139
+ def encode_ordinary(self, text: str) -> list[int]:
140
+ """Encode ``text`` without recognizing any special tokens.
141
+
142
+ Equivalent to ``tiktoken.Encoding.encode_ordinary``. Releases the
143
+ GIL.
144
+ """
145
+ return self._core.encode_ordinary(text)
146
+
147
+ def encode(
148
+ self,
149
+ text: str,
150
+ allowed_special: Union[set[str], str, None] = None,
151
+ disallowed_special: Union[set[str], str, None] = None,
152
+ ) -> list[int]:
153
+ """Encode ``text``, optionally recognizing special tokens.
154
+
155
+ ``allowed_special`` may be a set of special-token strings to
156
+ recognize, or the sentinel string ``"all"`` to allow every special
157
+ token in the vocabulary. ``None`` (the default) disallows all.
158
+
159
+ ``disallowed_special`` is accepted for ``tiktoken`` signature
160
+ compatibility but is currently treated as a hint — unrecognized
161
+ specials pass through as ordinary text rather than raising.
162
+ """
163
+ if allowed_special is None:
164
+ allowed: set[str] = set()
165
+ elif allowed_special == "all":
166
+ allowed = set(self._tiktoken.special_tokens_set)
167
+ else:
168
+ allowed = set(allowed_special)
169
+ return self._core.encode(text, allowed)
170
+
171
+ def encode_ordinary_batch(self, texts: Iterable[str]) -> list[list[int]]:
172
+ """Parallel batch version of :meth:`encode_ordinary`.
173
+
174
+ Fans out to ``rayon``'s global thread pool, releasing the GIL for
175
+ the full batch.
176
+ """
177
+ return self._core.encode_ordinary_batch(list(texts))
178
+
179
+ def encode_batch(
180
+ self,
181
+ texts: Iterable[str],
182
+ allowed_special: Union[set[str], str, None] = None,
183
+ disallowed_special: Union[set[str], str, None] = None,
184
+ ) -> list[list[int]]:
185
+ """Parallel batch version of :meth:`encode`.
186
+
187
+ Fans out to ``rayon``'s global thread pool, releasing the GIL for
188
+ the full batch.
189
+ """
190
+ if allowed_special is None:
191
+ allowed: set[str] = set()
192
+ elif allowed_special == "all":
193
+ allowed = set(self._tiktoken.special_tokens_set)
194
+ else:
195
+ allowed = set(allowed_special)
196
+ return self._core.encode_batch(list(texts), allowed)
197
+
198
+ def decode(self, tokens: list[int], errors: str = "replace") -> str:
199
+ """Decode ``tokens`` into a string.
200
+
201
+ ``errors`` is accepted for tiktoken signature compatibility; only
202
+ ``"replace"`` (the default, matching tiktoken) is honored — invalid
203
+ UTF-8 byte sequences are replaced with U+FFFD.
204
+ """
205
+ return self._core.decode(tokens)
206
+
207
+ def decode_bytes(self, tokens: list[int]) -> bytes:
208
+ """Decode ``tokens`` into raw bytes (no UTF-8 conversion)."""
209
+ return self._core.decode_bytes(tokens)
210
+
211
+ # ---- Everything else: forward to tiktoken ----
212
+
213
+ def __getattr__(self, name: str) -> Any:
214
+ # __getattr__ is only called when normal attribute lookup fails, so
215
+ # the hot-path methods defined above are never routed through here.
216
+ return getattr(self._tiktoken, name)
217
+
218
+ def __repr__(self) -> str:
219
+ name = getattr(self._tiktoken, "name", "?")
220
+ return f"<riptoken.Encoding {name!r}>"
221
+
222
+
223
+ def _build_encoding(tiktoken_encoding: Any) -> Encoding:
224
+ """Build a :class:`riptoken.Encoding` from a :class:`tiktoken.Encoding`.
225
+
226
+ Reads the vocab, pattern string, and special tokens from stable (if
227
+ technically private) attributes that tiktoken has exposed since its
228
+ first release.
229
+ """
230
+ # These three attrs have been stable since tiktoken 0.3 and are what
231
+ # tiktoken itself passes into its own Rust constructor.
232
+ mergeable_ranks = tiktoken_encoding._mergeable_ranks # type: ignore[attr-defined]
233
+ special_tokens = tiktoken_encoding._special_tokens # type: ignore[attr-defined]
234
+ pat_str = tiktoken_encoding._pat_str # type: ignore[attr-defined]
235
+ core = CoreBPE(mergeable_ranks, special_tokens, pat_str)
236
+ return Encoding(core, tiktoken_encoding)
237
+
238
+
239
+ def get_encoding(name: str) -> Encoding:
240
+ """Load a named tiktoken encoding (``"gpt2"``, ``"cl100k_base"``, ``"o200k_base"``, ...).
241
+
242
+ This is the riptoken equivalent of :func:`tiktoken.get_encoding`. It
243
+ uses the ``tiktoken`` package to supply the vocabulary file, regex
244
+ pattern, and special-token map, then wraps them in a
245
+ :class:`riptoken.Encoding` whose hot-path methods execute in
246
+ riptoken's faster Rust core. Subsequent calls with the same ``name``
247
+ reuse tiktoken's in-process cache, and the vocabulary download (if
248
+ needed) uses tiktoken's on-disk cache at ``~/.cache/tiktoken/`` (or
249
+ wherever ``TIKTOKEN_CACHE_DIR`` points).
250
+
251
+ Parameters
252
+ ----------
253
+ name:
254
+ A tiktoken encoding name, e.g. ``"gpt2"``, ``"r50k_base"``,
255
+ ``"p50k_base"``, ``"cl100k_base"``, ``"o200k_base"``.
256
+
257
+ Returns
258
+ -------
259
+ Encoding
260
+ A drop-in replacement for the corresponding
261
+ :class:`tiktoken.Encoding`, producing byte-identical output on all
262
+ hot-path methods.
263
+
264
+ Raises
265
+ ------
266
+ ImportError
267
+ If the ``tiktoken`` package is not installed. Install it with
268
+ ``pip install tiktoken``, or construct :class:`CoreBPE` directly
269
+ from a ``.tiktoken`` file via :func:`load_tiktoken_bpe`.
270
+ """
271
+ try:
272
+ import tiktoken
273
+ except ImportError as e:
274
+ raise ImportError(
275
+ "riptoken.get_encoding requires the `tiktoken` package to supply "
276
+ "vocabulary files and regex patterns. Install it with "
277
+ "`pip install tiktoken`, or construct CoreBPE directly using "
278
+ "riptoken.load_tiktoken_bpe()."
279
+ ) from e
280
+ return _build_encoding(tiktoken.get_encoding(name))
281
+
282
+
283
+ def encoding_for_model(model_name: str) -> Encoding:
284
+ """Load the encoding used by a specific OpenAI model.
285
+
286
+ This is the riptoken equivalent of :func:`tiktoken.encoding_for_model`.
287
+ For example, ``encoding_for_model("gpt-4o")`` returns an ``o200k_base``
288
+ encoder wrapped in a :class:`riptoken.Encoding`. See
289
+ :func:`get_encoding` for dependency notes.
290
+
291
+ Parameters
292
+ ----------
293
+ model_name:
294
+ An OpenAI model name, e.g. ``"gpt-4"``, ``"gpt-4o"``, ``"gpt-3.5-turbo"``.
295
+
296
+ Returns
297
+ -------
298
+ Encoding
299
+ A drop-in replacement for the corresponding
300
+ :class:`tiktoken.Encoding`.
301
+
302
+ Raises
303
+ ------
304
+ ImportError
305
+ If the ``tiktoken`` package is not installed.
306
+ """
307
+ try:
308
+ import tiktoken
309
+ except ImportError as e:
310
+ raise ImportError(
311
+ "riptoken.encoding_for_model requires the `tiktoken` package to "
312
+ "supply vocabulary files and regex patterns. Install it with "
313
+ "`pip install tiktoken`, or construct CoreBPE directly using "
314
+ "riptoken.load_tiktoken_bpe()."
315
+ ) from e
316
+ return _build_encoding(tiktoken.encoding_for_model(model_name))
@@ -885,10 +885,16 @@ impl CoreBPE {
885
885
  py.detach(|| self.encode_ordinary(text))
886
886
  }
887
887
 
888
- #[pyo3(name = "encode")]
889
- fn py_encode(&self, py: Python<'_>, text: &str, allowed_special: HashSet<String>) -> Vec<Rank> {
888
+ #[pyo3(name = "encode", signature = (text, allowed_special = None))]
889
+ fn py_encode(
890
+ &self,
891
+ py: Python<'_>,
892
+ text: &str,
893
+ allowed_special: Option<HashSet<String>>,
894
+ ) -> Vec<Rank> {
890
895
  py.detach(|| {
891
- let allowed_refs: HashSet<&str> = allowed_special.iter().map(|s| s.as_str()).collect();
896
+ let allowed = allowed_special.unwrap_or_default();
897
+ let allowed_refs: HashSet<&str> = allowed.iter().map(|s| s.as_str()).collect();
892
898
  self.encode(text, &allowed_refs)
893
899
  })
894
900
  }
@@ -901,16 +907,17 @@ impl CoreBPE {
901
907
  })
902
908
  }
903
909
 
904
- #[pyo3(name = "encode_batch")]
910
+ #[pyo3(name = "encode_batch", signature = (texts, allowed_special = None))]
905
911
  fn py_encode_batch(
906
912
  &self,
907
913
  py: Python<'_>,
908
914
  texts: Vec<String>,
909
- allowed_special: HashSet<String>,
915
+ allowed_special: Option<HashSet<String>>,
910
916
  ) -> Vec<Vec<Rank>> {
911
917
  py.detach(|| {
912
918
  let refs: Vec<&str> = texts.iter().map(|s| s.as_str()).collect();
913
- let allowed_refs: HashSet<&str> = allowed_special.iter().map(|s| s.as_str()).collect();
919
+ let allowed = allowed_special.unwrap_or_default();
920
+ let allowed_refs: HashSet<&str> = allowed.iter().map(|s| s.as_str()).collect();
914
921
  self.encode_batch(&refs, &allowed_refs)
915
922
  })
916
923
  }
@@ -931,6 +938,20 @@ impl CoreBPE {
931
938
  pyo3::types::PyBytes::new(py, &bytes)
932
939
  }
933
940
 
941
+ /// Decode tokens into a Python `str`, matching `tiktoken.Encoding.decode`.
942
+ ///
943
+ /// Invalid UTF-8 sequences (which can occur mid-stream when a multi-byte
944
+ /// character spans a token boundary) are replaced with U+FFFD, matching
945
+ /// tiktoken's default `errors="replace"` behavior. For strict decoding or
946
+ /// raw bytes, use [`decode_bytes`].
947
+ #[pyo3(name = "decode")]
948
+ fn py_decode(&self, py: Python<'_>, tokens: Vec<Rank>) -> String {
949
+ py.detach(|| {
950
+ let bytes = self.decode_bytes(&tokens);
951
+ String::from_utf8_lossy(&bytes).into_owned()
952
+ })
953
+ }
954
+
934
955
  #[pyo3(name = "decode_single_token_bytes")]
935
956
  fn py_decode_single_token_bytes<'py>(
936
957
  &self,
@@ -1,183 +0,0 @@
1
- """riptoken — fast BPE tokenizer for LLMs.
2
-
3
- A drop-in compatible, faster reimplementation of OpenAI's tiktoken. riptoken
4
- reads the same ``.tiktoken`` vocabulary files and produces byte-identical
5
- output for every tested corpus.
6
-
7
- Quick start
8
- -----------
9
-
10
- The simplest way to get a tokenizer, if you have ``tiktoken`` installed:
11
-
12
- >>> import riptoken
13
- >>> enc = riptoken.get_encoding("o200k_base")
14
- >>> enc.encode_ordinary("Hello, world!")
15
- [13225, 11, 2375, 0]
16
-
17
- Or construct a :class:`CoreBPE` directly from a ``.tiktoken`` file:
18
-
19
- >>> import riptoken
20
- >>> ranks = riptoken.load_tiktoken_bpe("o200k_base.tiktoken")
21
- >>> special_tokens = {"<|endoftext|>": 199999}
22
- >>> pat = r"\\w+|\\s+" # simplified pattern; use the full o200k pattern in prod
23
- >>> enc = riptoken.CoreBPE(ranks, special_tokens, pat)
24
- >>> enc.decode_bytes(enc.encode_ordinary("Hello, world!"))
25
- b'Hello, world!'
26
-
27
- See https://github.com/daechoi/riptoken for full documentation, benchmarks,
28
- and the canonical ``o200k_base`` regex pattern.
29
- """
30
-
31
- from __future__ import annotations
32
-
33
- import base64
34
- from pathlib import Path
35
- from typing import Union
36
-
37
- from riptoken._riptoken import CoreBPE
38
-
39
- __version__ = "0.2.1"
40
- __all__ = ["CoreBPE", "encoding_for_model", "get_encoding", "load_tiktoken_bpe"]
41
-
42
-
43
- def load_tiktoken_bpe(path: Union[str, Path]) -> dict[bytes, int]:
44
- """Load a tiktoken ``.tiktoken`` vocabulary file.
45
-
46
- The file format is one token per line::
47
-
48
- <base64-encoded token bytes> <integer rank>
49
-
50
- This is the same format OpenAI ships for ``cl100k_base``, ``o200k_base``,
51
- etc. You can obtain vocabularies from the ``tiktoken`` package's cache
52
- directory or download them directly from OpenAI's public CDN.
53
-
54
- Parameters
55
- ----------
56
- path:
57
- Filesystem path to the ``.tiktoken`` file.
58
-
59
- Returns
60
- -------
61
- dict[bytes, int]
62
- A dictionary mapping token bytes to their integer rank, suitable for
63
- passing to :class:`CoreBPE`'s first positional argument.
64
-
65
- Raises
66
- ------
67
- FileNotFoundError
68
- If ``path`` does not exist.
69
- ValueError
70
- If a line in the file is malformed.
71
- """
72
- ranks: dict[bytes, int] = {}
73
- with open(path, "rb") as f:
74
- for lineno, line in enumerate(f, start=1):
75
- if not line.strip():
76
- continue
77
- parts = line.split()
78
- if len(parts) != 2:
79
- raise ValueError(
80
- f"{path}: line {lineno}: expected '<b64> <rank>', got {line!r}"
81
- )
82
- token_b64, rank_str = parts
83
- try:
84
- token_bytes = base64.b64decode(token_b64)
85
- rank = int(rank_str)
86
- except (ValueError, base64.binascii.Error) as e:
87
- raise ValueError(
88
- f"{path}: line {lineno}: failed to parse token/rank: {e}"
89
- ) from e
90
- ranks[token_bytes] = rank
91
- return ranks
92
-
93
-
94
- def _wrap_tiktoken_encoding(enc: object) -> CoreBPE:
95
- """Build a :class:`CoreBPE` from a :class:`tiktoken.Encoding` instance.
96
-
97
- Reads the vocab, pattern string, and special tokens from stable (if
98
- technically private) attributes that tiktoken has exposed since its
99
- first release.
100
- """
101
- # These three attrs have been stable since tiktoken 0.3 and are what
102
- # tiktoken itself passes into its own Rust constructor.
103
- mergeable_ranks = enc._mergeable_ranks # type: ignore[attr-defined]
104
- special_tokens = enc._special_tokens # type: ignore[attr-defined]
105
- pat_str = enc._pat_str # type: ignore[attr-defined]
106
- return CoreBPE(mergeable_ranks, special_tokens, pat_str)
107
-
108
-
109
- def get_encoding(name: str) -> CoreBPE:
110
- """Load a named tiktoken encoding (``"gpt2"``, ``"cl100k_base"``, ``"o200k_base"``, ...).
111
-
112
- This is the riptoken equivalent of :func:`tiktoken.get_encoding`. It
113
- soft-depends on the ``tiktoken`` package to supply the vocabulary file,
114
- regex pattern, and special-token map, then wraps them in riptoken's
115
- faster Rust core. Subsequent calls with the same ``name`` reuse
116
- tiktoken's in-process cache, and the vocabulary download (if needed)
117
- uses tiktoken's on-disk cache at ``~/.cache/tiktoken/`` (or wherever
118
- ``TIKTOKEN_CACHE_DIR`` points).
119
-
120
- Parameters
121
- ----------
122
- name:
123
- A tiktoken encoding name, e.g. ``"gpt2"``, ``"r50k_base"``,
124
- ``"p50k_base"``, ``"cl100k_base"``, ``"o200k_base"``.
125
-
126
- Returns
127
- -------
128
- CoreBPE
129
- A riptoken encoder producing byte-identical output to
130
- ``tiktoken.get_encoding(name)``.
131
-
132
- Raises
133
- ------
134
- ImportError
135
- If the ``tiktoken`` package is not installed. Install it with
136
- ``pip install tiktoken``, or construct :class:`CoreBPE` directly
137
- from a ``.tiktoken`` file via :func:`load_tiktoken_bpe`.
138
- """
139
- try:
140
- import tiktoken
141
- except ImportError as e:
142
- raise ImportError(
143
- "riptoken.get_encoding requires the `tiktoken` package to supply "
144
- "vocabulary files and regex patterns. Install it with "
145
- "`pip install tiktoken`, or construct CoreBPE directly using "
146
- "riptoken.load_tiktoken_bpe()."
147
- ) from e
148
- return _wrap_tiktoken_encoding(tiktoken.get_encoding(name))
149
-
150
-
151
- def encoding_for_model(model_name: str) -> CoreBPE:
152
- """Load the encoding used by a specific OpenAI model.
153
-
154
- This is the riptoken equivalent of :func:`tiktoken.encoding_for_model`.
155
- For example, ``encoding_for_model("gpt-4o")`` returns an ``o200k_base``
156
- encoder. See :func:`get_encoding` for dependency notes.
157
-
158
- Parameters
159
- ----------
160
- model_name:
161
- An OpenAI model name, e.g. ``"gpt-4"``, ``"gpt-4o"``, ``"gpt-3.5-turbo"``.
162
-
163
- Returns
164
- -------
165
- CoreBPE
166
- A riptoken encoder producing byte-identical output to
167
- ``tiktoken.encoding_for_model(model_name)``.
168
-
169
- Raises
170
- ------
171
- ImportError
172
- If the ``tiktoken`` package is not installed.
173
- """
174
- try:
175
- import tiktoken
176
- except ImportError as e:
177
- raise ImportError(
178
- "riptoken.encoding_for_model requires the `tiktoken` package to "
179
- "supply vocabulary files and regex patterns. Install it with "
180
- "`pip install tiktoken`, or construct CoreBPE directly using "
181
- "riptoken.load_tiktoken_bpe()."
182
- ) from e
183
- return _wrap_tiktoken_encoding(tiktoken.encoding_for_model(model_name))
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes