@hyvmind/tiktoken-ts 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +557 -0
- package/dist/bpe.d.ts +171 -0
- package/dist/bpe.d.ts.map +1 -0
- package/dist/bpe.js +478 -0
- package/dist/bpe.js.map +1 -0
- package/dist/core/byte-pair-encoding.d.ts +49 -0
- package/dist/core/byte-pair-encoding.d.ts.map +1 -0
- package/dist/core/byte-pair-encoding.js +154 -0
- package/dist/core/byte-pair-encoding.js.map +1 -0
- package/dist/core/encoding-definitions.d.ts +95 -0
- package/dist/core/encoding-definitions.d.ts.map +1 -0
- package/dist/core/encoding-definitions.js +202 -0
- package/dist/core/encoding-definitions.js.map +1 -0
- package/dist/core/index.d.ts +12 -0
- package/dist/core/index.d.ts.map +1 -0
- package/dist/core/index.js +17 -0
- package/dist/core/index.js.map +1 -0
- package/dist/core/model-to-encoding.d.ts +36 -0
- package/dist/core/model-to-encoding.d.ts.map +1 -0
- package/dist/core/model-to-encoding.js +299 -0
- package/dist/core/model-to-encoding.js.map +1 -0
- package/dist/core/tiktoken.d.ts +126 -0
- package/dist/core/tiktoken.d.ts.map +1 -0
- package/dist/core/tiktoken.js +295 -0
- package/dist/core/tiktoken.js.map +1 -0
- package/dist/core/vocab-loader.d.ts +77 -0
- package/dist/core/vocab-loader.d.ts.map +1 -0
- package/dist/core/vocab-loader.js +176 -0
- package/dist/core/vocab-loader.js.map +1 -0
- package/dist/encodings/cl100k-base.d.ts +43 -0
- package/dist/encodings/cl100k-base.d.ts.map +1 -0
- package/dist/encodings/cl100k-base.js +142 -0
- package/dist/encodings/cl100k-base.js.map +1 -0
- package/dist/encodings/claude-estimation.d.ts +136 -0
- package/dist/encodings/claude-estimation.d.ts.map +1 -0
- package/dist/encodings/claude-estimation.js +160 -0
- package/dist/encodings/claude-estimation.js.map +1 -0
- package/dist/encodings/index.d.ts +9 -0
- package/dist/encodings/index.d.ts.map +1 -0
- package/dist/encodings/index.js +13 -0
- package/dist/encodings/index.js.map +1 -0
- package/dist/encodings/o200k-base.d.ts +58 -0
- package/dist/encodings/o200k-base.d.ts.map +1 -0
- package/dist/encodings/o200k-base.js +191 -0
- package/dist/encodings/o200k-base.js.map +1 -0
- package/dist/encodings/p50k-base.d.ts +44 -0
- package/dist/encodings/p50k-base.d.ts.map +1 -0
- package/dist/encodings/p50k-base.js +64 -0
- package/dist/encodings/p50k-base.js.map +1 -0
- package/dist/index.d.ts +61 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +109 -0
- package/dist/index.js.map +1 -0
- package/dist/models.d.ts +92 -0
- package/dist/models.d.ts.map +1 -0
- package/dist/models.js +320 -0
- package/dist/models.js.map +1 -0
- package/dist/tiktoken.d.ts +198 -0
- package/dist/tiktoken.d.ts.map +1 -0
- package/dist/tiktoken.js +331 -0
- package/dist/tiktoken.js.map +1 -0
- package/dist/tokenizer.d.ts +181 -0
- package/dist/tokenizer.d.ts.map +1 -0
- package/dist/tokenizer.js +436 -0
- package/dist/tokenizer.js.map +1 -0
- package/dist/types.d.ts +127 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/types.js +6 -0
- package/dist/types.js.map +1 -0
- package/dist/utils.d.ts +152 -0
- package/dist/utils.d.ts.map +1 -0
- package/dist/utils.js +244 -0
- package/dist/utils.js.map +1 -0
- package/package.json +78 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Evandro Camargo
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,557 @@
|
|
|
1
|
+
# tiktoken-ts
|
|
2
|
+
|
|
3
|
+
A pure TypeScript port of OpenAI's [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs), providing exact BPE (Byte-Pair Encoding) tokenization compatible with OpenAI's models.
|
|
4
|
+
|
|
5
|
+
## Features
|
|
6
|
+
|
|
7
|
+
- **Exact BPE tokenization** - Direct port of tiktoken-rs algorithm, produces identical tokens
|
|
8
|
+
- **All OpenAI encodings** - `r50k_base`, `p50k_base`, `p50k_edit`, `cl100k_base`, `o200k_base`, `o200k_harmony`
|
|
9
|
+
- **Zero dependencies** - Pure TypeScript, works in Node.js and browsers
|
|
10
|
+
- **Lazy vocabulary loading** - Vocabularies loaded on-demand from OpenAI CDN (~4-10MB each)
|
|
11
|
+
- **Caching** - Vocabularies and tokenizer instances are cached for performance
|
|
12
|
+
- **Fast estimation API** - Synchronous heuristic-based counting for quick estimates
|
|
13
|
+
- **Model-aware** - Automatic encoding selection for GPT-4, GPT-4o, GPT-5, o-series, and more
|
|
14
|
+
|
|
15
|
+
## Installation
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
npm install tiktoken-ts
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
## Quick Start
|
|
22
|
+
|
|
23
|
+
### Exact BPE Tokenization (Async)
|
|
24
|
+
|
|
25
|
+
Use this for exact token counts that match OpenAI's tokenizer:
|
|
26
|
+
|
|
27
|
+
```typescript
|
|
28
|
+
import {
|
|
29
|
+
getEncodingAsync,
|
|
30
|
+
countTokensAsync,
|
|
31
|
+
encodeAsync,
|
|
32
|
+
decodeAsync,
|
|
33
|
+
} from "tiktoken-ts";
|
|
34
|
+
|
|
35
|
+
// Load encoding and tokenize
|
|
36
|
+
const tiktoken = await getEncodingAsync("cl100k_base");
|
|
37
|
+
const tokens = tiktoken.encode("Hello, world!");
|
|
38
|
+
console.log(tokens); // [9906, 11, 1917, 0]
|
|
39
|
+
|
|
40
|
+
// Decode back to text (round-trip works!)
|
|
41
|
+
const text = tiktoken.decode(tokens);
|
|
42
|
+
console.log(text); // "Hello, world!"
|
|
43
|
+
|
|
44
|
+
// Count tokens
|
|
45
|
+
const count = tiktoken.countTokens("Hello, world!");
|
|
46
|
+
console.log(count); // 4
|
|
47
|
+
|
|
48
|
+
// Or use convenience functions
|
|
49
|
+
const count2 = await countTokensAsync("Hello, world!", "cl100k_base");
|
|
50
|
+
const tokens2 = await encodeAsync("Hello!", "o200k_base");
|
|
51
|
+
const decoded = await decodeAsync(tokens2, "o200k_base");
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### For a Specific Model (Async)
|
|
55
|
+
|
|
56
|
+
```typescript
|
|
57
|
+
import {
|
|
58
|
+
getEncodingForModelAsync,
|
|
59
|
+
countTokensForModelAsync,
|
|
60
|
+
} from "tiktoken-ts";
|
|
61
|
+
|
|
62
|
+
// Automatically selects the correct encoding for the model
|
|
63
|
+
const tiktoken = await getEncodingForModelAsync("gpt-4o");
|
|
64
|
+
const tokens = tiktoken.encode("Hello!");
|
|
65
|
+
|
|
66
|
+
// Or count directly
|
|
67
|
+
const count = await countTokensForModelAsync("Hello!", "gpt-4o");
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
### Token Estimation (Sync)
|
|
71
|
+
|
|
72
|
+
Use this for fast approximate counts when exact accuracy isn't required:
|
|
73
|
+
|
|
74
|
+
```typescript
|
|
75
|
+
import {
|
|
76
|
+
countTokens,
|
|
77
|
+
estimateMaxTokens,
|
|
78
|
+
getTokenEstimation,
|
|
79
|
+
fitsInContext,
|
|
80
|
+
} from "tiktoken-ts";
|
|
81
|
+
|
|
82
|
+
// Fast token estimation (no vocabulary loading)
|
|
83
|
+
const count = countTokens("Hello, world!", { model: "gpt-4o" });
|
|
84
|
+
|
|
85
|
+
// Estimate safe max_tokens to avoid truncation
|
|
86
|
+
const maxTokens = estimateMaxTokens(promptText, "gpt-4o", {
|
|
87
|
+
desiredOutputTokens: 1000,
|
|
88
|
+
safetyMargin: 0.1,
|
|
89
|
+
});
|
|
90
|
+
|
|
91
|
+
// Get detailed estimation with warnings
|
|
92
|
+
const estimation = getTokenEstimation(promptText, "gpt-4o");
|
|
93
|
+
if (!estimation.fitsInContext) {
|
|
94
|
+
console.warn(estimation.warning);
|
|
95
|
+
}
|
|
96
|
+
|
|
97
|
+
// Check if text fits in context
|
|
98
|
+
if (fitsInContext(longText, "gpt-4o", 1000)) {
|
|
99
|
+
// Text fits with 1000 tokens reserved for output
|
|
100
|
+
}
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
## API Reference
|
|
104
|
+
|
|
105
|
+
### Exact BPE API (Async)
|
|
106
|
+
|
|
107
|
+
#### `getEncodingAsync(encodingName)`
|
|
108
|
+
|
|
109
|
+
Load an encoding by name. Returns a `Tiktoken` instance.
|
|
110
|
+
|
|
111
|
+
```typescript
|
|
112
|
+
const tiktoken = await getEncodingAsync("cl100k_base");
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
#### `getEncodingForModelAsync(modelName)`
|
|
116
|
+
|
|
117
|
+
Get the appropriate encoding for a model.
|
|
118
|
+
|
|
119
|
+
```typescript
|
|
120
|
+
const tiktoken = await getEncodingForModelAsync("gpt-4o");
|
|
121
|
+
// Uses o200k_base for GPT-4o
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
#### `Tiktoken` Class Methods
|
|
125
|
+
|
|
126
|
+
```typescript
|
|
127
|
+
const tiktoken = await getEncodingAsync("cl100k_base");
|
|
128
|
+
|
|
129
|
+
// Encode text to tokens
|
|
130
|
+
const tokens = tiktoken.encode("Hello!"); // [9906, 0]
|
|
131
|
+
const tokens = tiktoken.encodeOrdinary("Hello!"); // Same, no special token handling
|
|
132
|
+
const tokens = tiktoken.encodeWithSpecialTokens("<|endoftext|>"); // Handles special tokens
|
|
133
|
+
|
|
134
|
+
// Decode tokens to text
|
|
135
|
+
const text = tiktoken.decode(tokens); // "Hello!"
|
|
136
|
+
const bytes = tiktoken.decodeBytes(tokens); // Uint8Array
|
|
137
|
+
|
|
138
|
+
// Count tokens
|
|
139
|
+
const count = tiktoken.countTokens("Hello!"); // 2
|
|
140
|
+
|
|
141
|
+
// Properties
|
|
142
|
+
tiktoken.vocabSize; // Vocabulary size (excluding special tokens)
|
|
143
|
+
tiktoken.totalVocabSize; // Total vocabulary size
|
|
144
|
+
tiktoken.loaded; // Whether vocabulary is loaded
|
|
145
|
+
tiktoken.name; // Encoding name
|
|
146
|
+
|
|
147
|
+
// Special tokens
|
|
148
|
+
tiktoken.getSpecialTokens(); // Set of special token strings
|
|
149
|
+
tiktoken.isSpecialToken(100257); // Check if token ID is special
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
#### Convenience Functions
|
|
153
|
+
|
|
154
|
+
```typescript
|
|
155
|
+
// Encode/decode without managing instances
|
|
156
|
+
const tokens = await encodeAsync("Hello!", "cl100k_base");
|
|
157
|
+
const text = await decodeAsync(tokens, "cl100k_base");
|
|
158
|
+
const count = await countTokensAsync("Hello!", "cl100k_base");
|
|
159
|
+
const count = await countTokensForModelAsync("Hello!", "gpt-4o");
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### Estimation API (Sync)
|
|
163
|
+
|
|
164
|
+
#### `countTokens(text, options?)`
|
|
165
|
+
|
|
166
|
+
Fast heuristic-based token counting.
|
|
167
|
+
|
|
168
|
+
```typescript
|
|
169
|
+
// With default encoding (o200k_base)
|
|
170
|
+
const count = countTokens("Hello, world!");
|
|
171
|
+
|
|
172
|
+
// With specific model
|
|
173
|
+
const count = countTokens("Hello, world!", { model: "gpt-4o" });
|
|
174
|
+
|
|
175
|
+
// With specific encoding
|
|
176
|
+
const count = countTokens("Hello, world!", { encoding: "cl100k_base" });
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
#### `countChatTokens(messages, model?)`
|
|
180
|
+
|
|
181
|
+
Count tokens in chat messages, including message overhead.
|
|
182
|
+
|
|
183
|
+
```typescript
|
|
184
|
+
const messages = [
|
|
185
|
+
{ role: "system", content: "You are helpful." },
|
|
186
|
+
{ role: "user", content: "Hello!" },
|
|
187
|
+
];
|
|
188
|
+
const count = countChatTokens(messages, "gpt-4o");
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
#### `estimateMaxTokens(promptText, model, options?)`
|
|
192
|
+
|
|
193
|
+
Estimate a safe `max_tokens` value for API calls.
|
|
194
|
+
|
|
195
|
+
```typescript
|
|
196
|
+
const maxTokens = estimateMaxTokens(prompt, "gpt-4o", {
|
|
197
|
+
desiredOutputTokens: 1000,
|
|
198
|
+
safetyMargin: 0.1, // 10% safety margin
|
|
199
|
+
minOutputTokens: 100,
|
|
200
|
+
maxOutputTokensCap: 4096,
|
|
201
|
+
});
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
#### `getTokenEstimation(promptText, model, options?)`
|
|
205
|
+
|
|
206
|
+
Get detailed estimation with context fit analysis.
|
|
207
|
+
|
|
208
|
+
```typescript
|
|
209
|
+
const estimation = getTokenEstimation(longPrompt, "gpt-4o", {
|
|
210
|
+
desiredOutputTokens: 2000,
|
|
211
|
+
});
|
|
212
|
+
|
|
213
|
+
console.log({
|
|
214
|
+
promptTokens: estimation.promptTokens,
|
|
215
|
+
recommendedMaxTokens: estimation.recommendedMaxTokens,
|
|
216
|
+
contextLimit: estimation.contextLimit,
|
|
217
|
+
fitsInContext: estimation.fitsInContext,
|
|
218
|
+
warning: estimation.warning,
|
|
219
|
+
});
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
### Utility Functions
|
|
223
|
+
|
|
224
|
+
```typescript
|
|
225
|
+
// Check context fit
|
|
226
|
+
fitsInContext(text, "gpt-4o", 1000); // reservedOutputTokens
|
|
227
|
+
|
|
228
|
+
// Truncate to fit
|
|
229
|
+
const truncated = truncateToTokenLimit(longText, 1000, "gpt-4o");
|
|
230
|
+
|
|
231
|
+
// Split into chunks
|
|
232
|
+
const chunks = splitIntoChunks(longText, 500, 100, "gpt-4o"); // maxTokens, overlap
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
### Model Configuration
|
|
236
|
+
|
|
237
|
+
```typescript
|
|
238
|
+
import {
|
|
239
|
+
getModelConfig,
|
|
240
|
+
getModelContextLimit,
|
|
241
|
+
getModelMaxOutputTokens,
|
|
242
|
+
getEncodingForModel,
|
|
243
|
+
listModels,
|
|
244
|
+
} from "tiktoken-ts";
|
|
245
|
+
|
|
246
|
+
// Get full model config
|
|
247
|
+
const config = getModelConfig("gpt-4o");
|
|
248
|
+
// { name: "gpt-4o", encoding: "o200k_base", contextLimit: 128000, maxOutputTokens: 16384, family: "gpt-4o" }
|
|
249
|
+
|
|
250
|
+
// Get specific values
|
|
251
|
+
getModelContextLimit("gpt-4o"); // 128000
|
|
252
|
+
getModelMaxOutputTokens("gpt-4o"); // 16384
|
|
253
|
+
getEncodingForModel("gpt-4o"); // "o200k_base"
|
|
254
|
+
|
|
255
|
+
// List all supported models
|
|
256
|
+
listModels(); // ["gpt-5", "gpt-4o", "gpt-4", ...]
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
## Encoding Selection Guide
|
|
260
|
+
|
|
261
|
+
This section explains **which encoding to use for each model** and **why**.
|
|
262
|
+
|
|
263
|
+
### Quick Reference Table
|
|
264
|
+
|
|
265
|
+
| Model Family | Encoding | Type | Accuracy | When to Use |
|
|
266
|
+
| -------------------------------- | ------------------- | ---------- | -------------- | --------------------------------- |
|
|
267
|
+
| GPT-4o, GPT-4.1, GPT-5, o-series | `o200k_base` | Exact BPE | 100% | Billing, debugging, decode needed |
|
|
268
|
+
| GPT-4, GPT-3.5-turbo | `cl100k_base` | Exact BPE | 100% | Billing, debugging, decode needed |
|
|
269
|
+
| Claude (all versions) | `claude_estimation` | Estimation | ~80-90% (safe) | Context management, API limits |
|
|
270
|
+
| DeepSeek, Gemini | `cl100k_base` | Estimation | ~70-85% | Rough estimates only |
|
|
271
|
+
| Legacy GPT-3 | `r50k_base` | Exact BPE | 100% | Legacy applications |
|
|
272
|
+
| Codex | `p50k_base` | Exact BPE | 100% | Legacy code models |
|
|
273
|
+
|
|
274
|
+
### Detailed Encoding Guide
|
|
275
|
+
|
|
276
|
+
#### `o200k_base` - Modern OpenAI Models (Recommended)
|
|
277
|
+
|
|
278
|
+
**Use for:** GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-5, GPT-5-mini, o1, o3, o4-mini
|
|
279
|
+
|
|
280
|
+
```typescript
|
|
281
|
+
// Exact tokenization (async, loads vocabulary)
|
|
282
|
+
const tiktoken = await getEncodingAsync("o200k_base");
|
|
283
|
+
const tokens = tiktoken.encode("Hello!"); // Exact tokens
|
|
284
|
+
|
|
285
|
+
// Or use model name (auto-selects o200k_base)
|
|
286
|
+
const tiktoken = await getEncodingForModelAsync("gpt-4o");
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
**Characteristics:**
|
|
290
|
+
|
|
291
|
+
- 200,000 token vocabulary
|
|
292
|
+
- Most efficient for modern text (~4 chars/token)
|
|
293
|
+
- Required for exact billing calculations
|
|
294
|
+
- Supports round-trip encode/decode
|
|
295
|
+
|
|
296
|
+
#### `cl100k_base` - GPT-4 Era Models
|
|
297
|
+
|
|
298
|
+
**Use for:** GPT-4, GPT-4-turbo, GPT-3.5-turbo, text-embedding-ada-002, text-embedding-3-\*
|
|
299
|
+
|
|
300
|
+
```typescript
|
|
301
|
+
const tiktoken = await getEncodingAsync("cl100k_base");
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Characteristics:**
|
|
305
|
+
|
|
306
|
+
- 100,256 token vocabulary
|
|
307
|
+
- Slightly less efficient than o200k_base
|
|
308
|
+
- Still widely used for embeddings
|
|
309
|
+
|
|
310
|
+
#### `claude_estimation` - Anthropic Claude Models
|
|
311
|
+
|
|
312
|
+
**Use for:** All Claude models (claude-4.5-_, claude-4.1-_, claude-4-_, claude-3.5-_, claude-3-_, claude-2._)
|
|
313
|
+
|
|
314
|
+
```typescript
|
|
315
|
+
// Automatic (recommended)
|
|
316
|
+
const count = countTokens("Hello!", { model: "claude-3-5-sonnet" });
|
|
317
|
+
|
|
318
|
+
// Explicit encoding
|
|
319
|
+
const count = countTokens("Hello!", { encoding: "claude_estimation" });
|
|
320
|
+
|
|
321
|
+
// Content-aware (for code, adds extra safety margin)
|
|
322
|
+
import { estimateClaudeTokens } from "tiktoken-ts";
|
|
323
|
+
const codeCount = estimateClaudeTokens(pythonCode, "code");
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
**IMPORTANT - Claude is estimation only:**
|
|
327
|
+
|
|
328
|
+
- Claude uses a **proprietary tokenizer** (not publicly available)
|
|
329
|
+
- We apply a **1.25x safety multiplier** to prevent API truncation
|
|
330
|
+
- Estimates are intentionally **conservative** (over-count)
|
|
331
|
+
- For exact counts, use [Anthropic's Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting)
|
|
332
|
+
|
|
333
|
+
**Why 1.25x multiplier?**
|
|
334
|
+
|
|
335
|
+
- Research shows Claude produces 16-30% more tokens than GPT-4
|
|
336
|
+
- English text: +16%, Math: +21%, Code: +30%
|
|
337
|
+
- 1.25x covers worst-case while remaining practical
|
|
338
|
+
|
|
339
|
+
#### `p50k_base` / `p50k_edit` - Legacy Codex
|
|
340
|
+
|
|
341
|
+
**Use for:** code-davinci-002, text-davinci-003, text-davinci-edit-001
|
|
342
|
+
|
|
343
|
+
```typescript
|
|
344
|
+
const tiktoken = await getEncodingAsync("p50k_base");
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
#### `r50k_base` - Legacy GPT-3
|
|
348
|
+
|
|
349
|
+
**Use for:** davinci, curie, babbage, ada (original GPT-3 models)
|
|
350
|
+
|
|
351
|
+
```typescript
|
|
352
|
+
const tiktoken = await getEncodingAsync("r50k_base");
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
### Decision Flowchart
|
|
356
|
+
|
|
357
|
+
```
|
|
358
|
+
Is the model from OpenAI?
|
|
359
|
+
├─ YES → Is it GPT-4o, GPT-4.1, GPT-5, or o-series?
|
|
360
|
+
│ ├─ YES → Use o200k_base (exact)
|
|
361
|
+
│ └─ NO → Is it GPT-4 or GPT-3.5?
|
|
362
|
+
│ ├─ YES → Use cl100k_base (exact)
|
|
363
|
+
│ └─ NO → Is it Codex or text-davinci?
|
|
364
|
+
│ ├─ YES → Use p50k_base (exact)
|
|
365
|
+
│ └─ NO → Use r50k_base (exact)
|
|
366
|
+
├─ Is the model from Anthropic (Claude)?
|
|
367
|
+
│ └─ YES → Use claude_estimation (safe estimate, 1.25x multiplier)
|
|
368
|
+
└─ Other (DeepSeek, Gemini, etc.)
|
|
369
|
+
└─ Use cl100k_base estimation (rough approximation only)
|
|
370
|
+
```
|
|
371
|
+
|
|
372
|
+
### Exact vs Estimation: When to Use Which
|
|
373
|
+
|
|
374
|
+
| Scenario | Use Exact (Async) | Use Estimation (Sync) |
|
|
375
|
+
| ------------------------- | ----------------- | --------------------- |
|
|
376
|
+
| Billing/cost calculation | ✅ | ❌ |
|
|
377
|
+
| Debugging tokenization | ✅ | ❌ |
|
|
378
|
+
| Need to decode tokens | ✅ | ❌ |
|
|
379
|
+
| Context window management | Either | ✅ (faster) |
|
|
380
|
+
| Real-time UI feedback | ❌ (too slow) | ✅ |
|
|
381
|
+
| Claude models | N/A | ✅ (only option) |
|
|
382
|
+
| Batch processing | ✅ | Either |
|
|
383
|
+
|
|
384
|
+
## Supported Encodings
|
|
385
|
+
|
|
386
|
+
| Encoding | Vocab Size | Type | Models |
|
|
387
|
+
| ------------------- | ---------- | ---------- | --------------------------------- |
|
|
388
|
+
| `o200k_base` | 200,000 | Exact BPE | GPT-4o, GPT-4.1, GPT-5, o-series |
|
|
389
|
+
| `o200k_harmony` | 200,000 | Exact BPE | gpt-oss |
|
|
390
|
+
| `cl100k_base` | 100,256 | Exact BPE | GPT-4, GPT-3.5-turbo, embeddings |
|
|
391
|
+
| `p50k_base` | 50,257 | Exact BPE | Code-davinci, text-davinci-003 |
|
|
392
|
+
| `p50k_edit` | 50,257 | Exact BPE | text-davinci-edit-001 |
|
|
393
|
+
| `r50k_base` | 50,257 | Exact BPE | GPT-3 (davinci, curie, etc.) |
|
|
394
|
+
| `claude_estimation` | ~22,000\* | Estimation | All Claude models (safe estimate) |
|
|
395
|
+
|
|
396
|
+
\*Claude's actual vocabulary size is estimated at ~22,000 based on research, but the encoding uses cl100k_base patterns with a safety multiplier.
|
|
397
|
+
|
|
398
|
+
## Supported Models
|
|
399
|
+
|
|
400
|
+
### OpenAI
|
|
401
|
+
|
|
402
|
+
- **GPT-5 series**: gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-turbo
|
|
403
|
+
- **GPT-4.1 series**: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano (1M context!)
|
|
404
|
+
- **GPT-4o series**: gpt-4o, gpt-4o-mini, chatgpt-4o-latest
|
|
405
|
+
- **GPT-4 series**: gpt-4, gpt-4-turbo, gpt-4-32k
|
|
406
|
+
- **GPT-3.5 series**: gpt-3.5-turbo, gpt-3.5-turbo-16k
|
|
407
|
+
- **o-series**: o1, o1-mini, o3, o3-mini, o4-mini (reasoning models)
|
|
408
|
+
- **Embeddings**: text-embedding-ada-002, text-embedding-3-small/large
|
|
409
|
+
- **Fine-tuned**: ft:gpt-4o, ft:gpt-4, ft:gpt-3.5-turbo
|
|
410
|
+
|
|
411
|
+
### Anthropic Claude (Safe Estimation)
|
|
412
|
+
|
|
413
|
+
Claude models use a dedicated `claude_estimation` encoding that provides **safe** token estimates with a built-in safety margin. This is designed to prevent API truncation by intentionally over-counting tokens.
|
|
414
|
+
|
|
415
|
+
**Why is Claude different?**
|
|
416
|
+
|
|
417
|
+
Claude uses a proprietary tokenizer that is NOT publicly available. Based on research:
|
|
418
|
+
|
|
419
|
+
- Claude 3+ uses ~22,000 token vocabulary (vs OpenAI's 100K-200K)
|
|
420
|
+
- Claude produces 16-30% MORE tokens than GPT-4 for equivalent content
|
|
421
|
+
- Average ~3.5 characters per token (vs GPT-4's ~4)
|
|
422
|
+
|
|
423
|
+
**Our solution:**
|
|
424
|
+
|
|
425
|
+
The `claude_estimation` encoding applies a **1.25x safety multiplier** to ensure estimates err on over-counting. This prevents API truncation while still providing useful estimates.
|
|
426
|
+
|
|
427
|
+
```typescript
|
|
428
|
+
import {
|
|
429
|
+
countTokens,
|
|
430
|
+
usesClaudeEstimation,
|
|
431
|
+
estimateClaudeTokens,
|
|
432
|
+
} from "tiktoken-ts";
|
|
433
|
+
|
|
434
|
+
// Automatic safe estimation for Claude models
|
|
435
|
+
const count = countTokens("Hello, Claude!", { model: "claude-4-5-sonnet" });
|
|
436
|
+
|
|
437
|
+
// Check if model uses Claude estimation
|
|
438
|
+
if (usesClaudeEstimation("claude-3-opus")) {
|
|
439
|
+
console.log("This uses safe Claude estimation");
|
|
440
|
+
}
|
|
441
|
+
|
|
442
|
+
// Content-aware estimation (code has additional +10% multiplier)
|
|
443
|
+
const codeCount = estimateClaudeTokens(pythonCode, "code");
|
|
444
|
+
```
|
|
445
|
+
|
|
446
|
+
**For exact Claude token counts**, use Anthropic's official API:
|
|
447
|
+
|
|
448
|
+
- [Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting)
|
|
449
|
+
|
|
450
|
+
Supported Claude models:
|
|
451
|
+
|
|
452
|
+
- Claude 4.5, 4.1, 4, 3.5, 3, 2 series
|
|
453
|
+
|
|
454
|
+
### Others (Estimation only)
|
|
455
|
+
|
|
456
|
+
- DeepSeek, Gemini (using cl100k_base approximation)
|
|
457
|
+
|
|
458
|
+
## Accuracy
|
|
459
|
+
|
|
460
|
+
### Exact BPE API
|
|
461
|
+
|
|
462
|
+
The async API produces **identical tokens** to OpenAI's tiktoken and tiktoken-rs. Use this when:
|
|
463
|
+
|
|
464
|
+
- You need exact token counts for billing
|
|
465
|
+
- You're debugging tokenization issues
|
|
466
|
+
- You need to decode tokens back to text
|
|
467
|
+
|
|
468
|
+
### Estimation API
|
|
469
|
+
|
|
470
|
+
The sync estimation API uses heuristics and is:
|
|
471
|
+
|
|
472
|
+
- **Fast** - No vocabulary loading (instant)
|
|
473
|
+
- **Approximate** - Typically within ±10-15% for English (OpenAI models)
|
|
474
|
+
- **Conservative** - Tends to slightly over-estimate, safer for API calls
|
|
475
|
+
|
|
476
|
+
Use estimation when:
|
|
477
|
+
|
|
478
|
+
- You need quick approximate counts
|
|
479
|
+
- You're doing context window management
|
|
480
|
+
- Exact counts aren't critical
|
|
481
|
+
|
|
482
|
+
### Claude Estimation
|
|
483
|
+
|
|
484
|
+
For Claude models, the estimation is **intentionally conservative** with a 1.25x safety multiplier because:
|
|
485
|
+
|
|
486
|
+
- Claude's tokenizer is proprietary (not publicly available)
|
|
487
|
+
- Claude produces 16-30% more tokens than GPT-4 for equivalent content
|
|
488
|
+
- Over-estimation is safer than under-estimation for API limits
|
|
489
|
+
|
|
490
|
+
For exact Claude counts, use [Anthropic's Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting).
|
|
491
|
+
|
|
492
|
+
## Browser Usage
|
|
493
|
+
|
|
494
|
+
The exact BPE API works in browsers but requires fetching vocabulary files (~4-10MB each). Vocabularies are cached after first load.
|
|
495
|
+
|
|
496
|
+
```typescript
|
|
497
|
+
// Works in browsers
|
|
498
|
+
const tiktoken = await getEncodingAsync("cl100k_base");
|
|
499
|
+
const tokens = tiktoken.encode("Hello!");
|
|
500
|
+
```
|
|
501
|
+
|
|
502
|
+
For bundle-size-sensitive applications, consider:
|
|
503
|
+
|
|
504
|
+
1. Using the estimation API (zero network requests)
|
|
505
|
+
2. Pre-loading vocabularies at app startup
|
|
506
|
+
3. Using a service worker to cache vocabularies
|
|
507
|
+
|
|
508
|
+
## Comparison with Other Libraries
|
|
509
|
+
|
|
510
|
+
| Library | Exact BPE | Sync API | Bundle Size | Dependencies |
|
|
511
|
+
| --------------- | --------- | --------------- | ----------- | -------------- |
|
|
512
|
+
| **tiktoken-ts** | ✅ | ✅ (estimation) | ~50KB | 0 |
|
|
513
|
+
| tiktoken (WASM) | ✅ | ✅ | ~4MB | WASM |
|
|
514
|
+
| gpt-tokenizer | ✅ | ✅ | ~10MB | Embedded vocab |
|
|
515
|
+
| gpt-3-encoder | ❌ | ✅ | ~2MB | r50k only |
|
|
516
|
+
|
|
517
|
+
## Development
|
|
518
|
+
|
|
519
|
+
```bash
|
|
520
|
+
# Install dependencies
|
|
521
|
+
npm install
|
|
522
|
+
|
|
523
|
+
# Build
|
|
524
|
+
npm run build
|
|
525
|
+
|
|
526
|
+
# Run tests
|
|
527
|
+
npm test
|
|
528
|
+
|
|
529
|
+
# Type check
|
|
530
|
+
npm run typecheck
|
|
531
|
+
|
|
532
|
+
# Lint
|
|
533
|
+
npm run lint
|
|
534
|
+
|
|
535
|
+
# Format
|
|
536
|
+
npm run format
|
|
537
|
+
```
|
|
538
|
+
|
|
539
|
+
## Architecture
|
|
540
|
+
|
|
541
|
+
See [ARCHITECTURE.md](./ARCHITECTURE.md) for detailed implementation notes.
|
|
542
|
+
|
|
543
|
+
Key design decisions:
|
|
544
|
+
|
|
545
|
+
- Vocabularies loaded from CDN (not embedded) to keep package small
|
|
546
|
+
- Dual API: exact async + fast sync estimation
|
|
547
|
+
- Direct port of tiktoken-rs BPE algorithm for correctness
|
|
548
|
+
- Global caching of vocabularies and instances
|
|
549
|
+
|
|
550
|
+
## License
|
|
551
|
+
|
|
552
|
+
MIT
|
|
553
|
+
|
|
554
|
+
## Credits
|
|
555
|
+
|
|
556
|
+
- [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs) - Original Rust implementation
|
|
557
|
+
- [tiktoken](https://github.com/openai/tiktoken) - OpenAI's Python implementation
|