@hyvmind/tiktoken-ts 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (75) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +557 -0
  3. package/dist/bpe.d.ts +171 -0
  4. package/dist/bpe.d.ts.map +1 -0
  5. package/dist/bpe.js +478 -0
  6. package/dist/bpe.js.map +1 -0
  7. package/dist/core/byte-pair-encoding.d.ts +49 -0
  8. package/dist/core/byte-pair-encoding.d.ts.map +1 -0
  9. package/dist/core/byte-pair-encoding.js +154 -0
  10. package/dist/core/byte-pair-encoding.js.map +1 -0
  11. package/dist/core/encoding-definitions.d.ts +95 -0
  12. package/dist/core/encoding-definitions.d.ts.map +1 -0
  13. package/dist/core/encoding-definitions.js +202 -0
  14. package/dist/core/encoding-definitions.js.map +1 -0
  15. package/dist/core/index.d.ts +12 -0
  16. package/dist/core/index.d.ts.map +1 -0
  17. package/dist/core/index.js +17 -0
  18. package/dist/core/index.js.map +1 -0
  19. package/dist/core/model-to-encoding.d.ts +36 -0
  20. package/dist/core/model-to-encoding.d.ts.map +1 -0
  21. package/dist/core/model-to-encoding.js +299 -0
  22. package/dist/core/model-to-encoding.js.map +1 -0
  23. package/dist/core/tiktoken.d.ts +126 -0
  24. package/dist/core/tiktoken.d.ts.map +1 -0
  25. package/dist/core/tiktoken.js +295 -0
  26. package/dist/core/tiktoken.js.map +1 -0
  27. package/dist/core/vocab-loader.d.ts +77 -0
  28. package/dist/core/vocab-loader.d.ts.map +1 -0
  29. package/dist/core/vocab-loader.js +176 -0
  30. package/dist/core/vocab-loader.js.map +1 -0
  31. package/dist/encodings/cl100k-base.d.ts +43 -0
  32. package/dist/encodings/cl100k-base.d.ts.map +1 -0
  33. package/dist/encodings/cl100k-base.js +142 -0
  34. package/dist/encodings/cl100k-base.js.map +1 -0
  35. package/dist/encodings/claude-estimation.d.ts +136 -0
  36. package/dist/encodings/claude-estimation.d.ts.map +1 -0
  37. package/dist/encodings/claude-estimation.js +160 -0
  38. package/dist/encodings/claude-estimation.js.map +1 -0
  39. package/dist/encodings/index.d.ts +9 -0
  40. package/dist/encodings/index.d.ts.map +1 -0
  41. package/dist/encodings/index.js +13 -0
  42. package/dist/encodings/index.js.map +1 -0
  43. package/dist/encodings/o200k-base.d.ts +58 -0
  44. package/dist/encodings/o200k-base.d.ts.map +1 -0
  45. package/dist/encodings/o200k-base.js +191 -0
  46. package/dist/encodings/o200k-base.js.map +1 -0
  47. package/dist/encodings/p50k-base.d.ts +44 -0
  48. package/dist/encodings/p50k-base.d.ts.map +1 -0
  49. package/dist/encodings/p50k-base.js +64 -0
  50. package/dist/encodings/p50k-base.js.map +1 -0
  51. package/dist/index.d.ts +61 -0
  52. package/dist/index.d.ts.map +1 -0
  53. package/dist/index.js +109 -0
  54. package/dist/index.js.map +1 -0
  55. package/dist/models.d.ts +92 -0
  56. package/dist/models.d.ts.map +1 -0
  57. package/dist/models.js +320 -0
  58. package/dist/models.js.map +1 -0
  59. package/dist/tiktoken.d.ts +198 -0
  60. package/dist/tiktoken.d.ts.map +1 -0
  61. package/dist/tiktoken.js +331 -0
  62. package/dist/tiktoken.js.map +1 -0
  63. package/dist/tokenizer.d.ts +181 -0
  64. package/dist/tokenizer.d.ts.map +1 -0
  65. package/dist/tokenizer.js +436 -0
  66. package/dist/tokenizer.js.map +1 -0
  67. package/dist/types.d.ts +127 -0
  68. package/dist/types.d.ts.map +1 -0
  69. package/dist/types.js +6 -0
  70. package/dist/types.js.map +1 -0
  71. package/dist/utils.d.ts +152 -0
  72. package/dist/utils.d.ts.map +1 -0
  73. package/dist/utils.js +244 -0
  74. package/dist/utils.js.map +1 -0
  75. package/package.json +78 -0
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Evandro Camargo
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,557 @@
1
+ # tiktoken-ts
2
+
3
+ A pure TypeScript port of OpenAI's [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs), providing exact BPE (Byte-Pair Encoding) tokenization compatible with OpenAI's models.
4
+
5
+ ## Features
6
+
7
+ - **Exact BPE tokenization** - Direct port of tiktoken-rs algorithm, produces identical tokens
8
+ - **All OpenAI encodings** - `r50k_base`, `p50k_base`, `p50k_edit`, `cl100k_base`, `o200k_base`, `o200k_harmony`
9
+ - **Zero dependencies** - Pure TypeScript, works in Node.js and browsers
10
+ - **Lazy vocabulary loading** - Vocabularies loaded on-demand from OpenAI CDN (~4-10MB each)
11
+ - **Caching** - Vocabularies and tokenizer instances are cached for performance
12
+ - **Fast estimation API** - Synchronous heuristic-based counting for quick estimates
13
+ - **Model-aware** - Automatic encoding selection for GPT-4, GPT-4o, GPT-5, o-series, and more
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ npm install tiktoken-ts
19
+ ```
20
+
21
+ ## Quick Start
22
+
23
+ ### Exact BPE Tokenization (Async)
24
+
25
+ Use this for exact token counts that match OpenAI's tokenizer:
26
+
27
+ ```typescript
28
+ import {
29
+ getEncodingAsync,
30
+ countTokensAsync,
31
+ encodeAsync,
32
+ decodeAsync,
33
+ } from "tiktoken-ts";
34
+
35
+ // Load encoding and tokenize
36
+ const tiktoken = await getEncodingAsync("cl100k_base");
37
+ const tokens = tiktoken.encode("Hello, world!");
38
+ console.log(tokens); // [9906, 11, 1917, 0]
39
+
40
+ // Decode back to text (round-trip works!)
41
+ const text = tiktoken.decode(tokens);
42
+ console.log(text); // "Hello, world!"
43
+
44
+ // Count tokens
45
+ const count = tiktoken.countTokens("Hello, world!");
46
+ console.log(count); // 4
47
+
48
+ // Or use convenience functions
49
+ const count2 = await countTokensAsync("Hello, world!", "cl100k_base");
50
+ const tokens2 = await encodeAsync("Hello!", "o200k_base");
51
+ const decoded = await decodeAsync(tokens2, "o200k_base");
52
+ ```
53
+
54
+ ### For a Specific Model (Async)
55
+
56
+ ```typescript
57
+ import {
58
+ getEncodingForModelAsync,
59
+ countTokensForModelAsync,
60
+ } from "tiktoken-ts";
61
+
62
+ // Automatically selects the correct encoding for the model
63
+ const tiktoken = await getEncodingForModelAsync("gpt-4o");
64
+ const tokens = tiktoken.encode("Hello!");
65
+
66
+ // Or count directly
67
+ const count = await countTokensForModelAsync("Hello!", "gpt-4o");
68
+ ```
69
+
70
+ ### Token Estimation (Sync)
71
+
72
+ Use this for fast approximate counts when exact accuracy isn't required:
73
+
74
+ ```typescript
75
+ import {
76
+ countTokens,
77
+ estimateMaxTokens,
78
+ getTokenEstimation,
79
+ fitsInContext,
80
+ } from "tiktoken-ts";
81
+
82
+ // Fast token estimation (no vocabulary loading)
83
+ const count = countTokens("Hello, world!", { model: "gpt-4o" });
84
+
85
+ // Estimate safe max_tokens to avoid truncation
86
+ const maxTokens = estimateMaxTokens(promptText, "gpt-4o", {
87
+ desiredOutputTokens: 1000,
88
+ safetyMargin: 0.1,
89
+ });
90
+
91
+ // Get detailed estimation with warnings
92
+ const estimation = getTokenEstimation(promptText, "gpt-4o");
93
+ if (!estimation.fitsInContext) {
94
+ console.warn(estimation.warning);
95
+ }
96
+
97
+ // Check if text fits in context
98
+ if (fitsInContext(longText, "gpt-4o", 1000)) {
99
+ // Text fits with 1000 tokens reserved for output
100
+ }
101
+ ```
102
+
103
+ ## API Reference
104
+
105
+ ### Exact BPE API (Async)
106
+
107
+ #### `getEncodingAsync(encodingName)`
108
+
109
+ Load an encoding by name. Returns a `Tiktoken` instance.
110
+
111
+ ```typescript
112
+ const tiktoken = await getEncodingAsync("cl100k_base");
113
+ ```
114
+
115
+ #### `getEncodingForModelAsync(modelName)`
116
+
117
+ Get the appropriate encoding for a model.
118
+
119
+ ```typescript
120
+ const tiktoken = await getEncodingForModelAsync("gpt-4o");
121
+ // Uses o200k_base for GPT-4o
122
+ ```
123
+
124
+ #### `Tiktoken` Class Methods
125
+
126
+ ```typescript
127
+ const tiktoken = await getEncodingAsync("cl100k_base");
128
+
129
+ // Encode text to tokens
130
+ const tokens = tiktoken.encode("Hello!"); // [9906, 0]
131
+ const tokens = tiktoken.encodeOrdinary("Hello!"); // Same, no special token handling
132
+ const tokens = tiktoken.encodeWithSpecialTokens("<|endoftext|>"); // Handles special tokens
133
+
134
+ // Decode tokens to text
135
+ const text = tiktoken.decode(tokens); // "Hello!"
136
+ const bytes = tiktoken.decodeBytes(tokens); // Uint8Array
137
+
138
+ // Count tokens
139
+ const count = tiktoken.countTokens("Hello!"); // 2
140
+
141
+ // Properties
142
+ tiktoken.vocabSize; // Vocabulary size (excluding special tokens)
143
+ tiktoken.totalVocabSize; // Total vocabulary size
144
+ tiktoken.loaded; // Whether vocabulary is loaded
145
+ tiktoken.name; // Encoding name
146
+
147
+ // Special tokens
148
+ tiktoken.getSpecialTokens(); // Set of special token strings
149
+ tiktoken.isSpecialToken(100257); // Check if token ID is special
150
+ ```
151
+
152
+ #### Convenience Functions
153
+
154
+ ```typescript
155
+ // Encode/decode without managing instances
156
+ const tokens = await encodeAsync("Hello!", "cl100k_base");
157
+ const text = await decodeAsync(tokens, "cl100k_base");
158
+ const count = await countTokensAsync("Hello!", "cl100k_base");
159
+ const count = await countTokensForModelAsync("Hello!", "gpt-4o");
160
+ ```
161
+
162
+ ### Estimation API (Sync)
163
+
164
+ #### `countTokens(text, options?)`
165
+
166
+ Fast heuristic-based token counting.
167
+
168
+ ```typescript
169
+ // With default encoding (o200k_base)
170
+ const count = countTokens("Hello, world!");
171
+
172
+ // With specific model
173
+ const count = countTokens("Hello, world!", { model: "gpt-4o" });
174
+
175
+ // With specific encoding
176
+ const count = countTokens("Hello, world!", { encoding: "cl100k_base" });
177
+ ```
178
+
179
+ #### `countChatTokens(messages, model?)`
180
+
181
+ Count tokens in chat messages, including message overhead.
182
+
183
+ ```typescript
184
+ const messages = [
185
+ { role: "system", content: "You are helpful." },
186
+ { role: "user", content: "Hello!" },
187
+ ];
188
+ const count = countChatTokens(messages, "gpt-4o");
189
+ ```
190
+
191
+ #### `estimateMaxTokens(promptText, model, options?)`
192
+
193
+ Estimate a safe `max_tokens` value for API calls.
194
+
195
+ ```typescript
196
+ const maxTokens = estimateMaxTokens(prompt, "gpt-4o", {
197
+ desiredOutputTokens: 1000,
198
+ safetyMargin: 0.1, // 10% safety margin
199
+ minOutputTokens: 100,
200
+ maxOutputTokensCap: 4096,
201
+ });
202
+ ```
203
+
204
+ #### `getTokenEstimation(promptText, model, options?)`
205
+
206
+ Get detailed estimation with context fit analysis.
207
+
208
+ ```typescript
209
+ const estimation = getTokenEstimation(longPrompt, "gpt-4o", {
210
+ desiredOutputTokens: 2000,
211
+ });
212
+
213
+ console.log({
214
+ promptTokens: estimation.promptTokens,
215
+ recommendedMaxTokens: estimation.recommendedMaxTokens,
216
+ contextLimit: estimation.contextLimit,
217
+ fitsInContext: estimation.fitsInContext,
218
+ warning: estimation.warning,
219
+ });
220
+ ```
221
+
222
+ ### Utility Functions
223
+
224
+ ```typescript
225
+ // Check context fit
226
+ fitsInContext(text, "gpt-4o", 1000); // reservedOutputTokens
227
+
228
+ // Truncate to fit
229
+ const truncated = truncateToTokenLimit(longText, 1000, "gpt-4o");
230
+
231
+ // Split into chunks
232
+ const chunks = splitIntoChunks(longText, 500, 100, "gpt-4o"); // maxTokens, overlap
233
+ ```
234
+
235
+ ### Model Configuration
236
+
237
+ ```typescript
238
+ import {
239
+ getModelConfig,
240
+ getModelContextLimit,
241
+ getModelMaxOutputTokens,
242
+ getEncodingForModel,
243
+ listModels,
244
+ } from "tiktoken-ts";
245
+
246
+ // Get full model config
247
+ const config = getModelConfig("gpt-4o");
248
+ // { name: "gpt-4o", encoding: "o200k_base", contextLimit: 128000, maxOutputTokens: 16384, family: "gpt-4o" }
249
+
250
+ // Get specific values
251
+ getModelContextLimit("gpt-4o"); // 128000
252
+ getModelMaxOutputTokens("gpt-4o"); // 16384
253
+ getEncodingForModel("gpt-4o"); // "o200k_base"
254
+
255
+ // List all supported models
256
+ listModels(); // ["gpt-5", "gpt-4o", "gpt-4", ...]
257
+ ```
258
+
259
+ ## Encoding Selection Guide
260
+
261
+ This section explains **which encoding to use for each model** and **why**.
262
+
263
+ ### Quick Reference Table
264
+
265
+ | Model Family | Encoding | Type | Accuracy | When to Use |
266
+ | -------------------------------- | ------------------- | ---------- | -------------- | --------------------------------- |
267
+ | GPT-4o, GPT-4.1, GPT-5, o-series | `o200k_base` | Exact BPE | 100% | Billing, debugging, decode needed |
268
+ | GPT-4, GPT-3.5-turbo | `cl100k_base` | Exact BPE | 100% | Billing, debugging, decode needed |
269
+ | Claude (all versions) | `claude_estimation` | Estimation | ~80-90% (safe) | Context management, API limits |
270
+ | DeepSeek, Gemini | `cl100k_base` | Estimation | ~70-85% | Rough estimates only |
271
+ | Legacy GPT-3 | `r50k_base` | Exact BPE | 100% | Legacy applications |
272
+ | Codex | `p50k_base` | Exact BPE | 100% | Legacy code models |
273
+
274
+ ### Detailed Encoding Guide
275
+
276
+ #### `o200k_base` - Modern OpenAI Models (Recommended)
277
+
278
+ **Use for:** GPT-4o, GPT-4o-mini, GPT-4.1, GPT-4.1-mini, GPT-5, GPT-5-mini, o1, o3, o4-mini
279
+
280
+ ```typescript
281
+ // Exact tokenization (async, loads vocabulary)
282
+ const tiktoken = await getEncodingAsync("o200k_base");
283
+ const tokens = tiktoken.encode("Hello!"); // Exact tokens
284
+
285
+ // Or use model name (auto-selects o200k_base)
286
+ const tiktoken = await getEncodingForModelAsync("gpt-4o");
287
+ ```
288
+
289
+ **Characteristics:**
290
+
291
+ - 200,000 token vocabulary
292
+ - Most efficient for modern text (~4 chars/token)
293
+ - Required for exact billing calculations
294
+ - Supports round-trip encode/decode
295
+
296
+ #### `cl100k_base` - GPT-4 Era Models
297
+
298
+ **Use for:** GPT-4, GPT-4-turbo, GPT-3.5-turbo, text-embedding-ada-002, text-embedding-3-\*
299
+
300
+ ```typescript
301
+ const tiktoken = await getEncodingAsync("cl100k_base");
302
+ ```
303
+
304
+ **Characteristics:**
305
+
306
+ - 100,256 token vocabulary
307
+ - Slightly less efficient than o200k_base
308
+ - Still widely used for embeddings
309
+
310
+ #### `claude_estimation` - Anthropic Claude Models
311
+
312
+ **Use for:** All Claude models (claude-4.5-_, claude-4.1-_, claude-4-_, claude-3.5-_, claude-3-_, claude-2._)
313
+
314
+ ```typescript
315
+ // Automatic (recommended)
316
+ const count = countTokens("Hello!", { model: "claude-3-5-sonnet" });
317
+
318
+ // Explicit encoding
319
+ const count = countTokens("Hello!", { encoding: "claude_estimation" });
320
+
321
+ // Content-aware (for code, adds extra safety margin)
322
+ import { estimateClaudeTokens } from "tiktoken-ts";
323
+ const codeCount = estimateClaudeTokens(pythonCode, "code");
324
+ ```
325
+
326
+ **IMPORTANT - Claude is estimation only:**
327
+
328
+ - Claude uses a **proprietary tokenizer** (not publicly available)
329
+ - We apply a **1.25x safety multiplier** to prevent API truncation
330
+ - Estimates are intentionally **conservative** (over-count)
331
+ - For exact counts, use [Anthropic's Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting)
332
+
333
+ **Why 1.25x multiplier?**
334
+
335
+ - Research shows Claude produces 16-30% more tokens than GPT-4
336
+ - English text: +16%, Math: +21%, Code: +30%
337
+ - 1.25x covers worst-case while remaining practical
338
+
339
+ #### `p50k_base` / `p50k_edit` - Legacy Codex
340
+
341
+ **Use for:** code-davinci-002, text-davinci-003, text-davinci-edit-001
342
+
343
+ ```typescript
344
+ const tiktoken = await getEncodingAsync("p50k_base");
345
+ ```
346
+
347
+ #### `r50k_base` - Legacy GPT-3
348
+
349
+ **Use for:** davinci, curie, babbage, ada (original GPT-3 models)
350
+
351
+ ```typescript
352
+ const tiktoken = await getEncodingAsync("r50k_base");
353
+ ```
354
+
355
+ ### Decision Flowchart
356
+
357
+ ```
358
+ Is the model from OpenAI?
359
+ ├─ YES → Is it GPT-4o, GPT-4.1, GPT-5, or o-series?
360
+ │ ├─ YES → Use o200k_base (exact)
361
+ │ └─ NO → Is it GPT-4 or GPT-3.5?
362
+ │ ├─ YES → Use cl100k_base (exact)
363
+ │ └─ NO → Is it Codex or text-davinci?
364
+ │ ├─ YES → Use p50k_base (exact)
365
+ │ └─ NO → Use r50k_base (exact)
366
+ ├─ Is the model from Anthropic (Claude)?
367
+ │ └─ YES → Use claude_estimation (safe estimate, 1.25x multiplier)
368
+ └─ Other (DeepSeek, Gemini, etc.)
369
+ └─ Use cl100k_base estimation (rough approximation only)
370
+ ```
371
+
372
+ ### Exact vs Estimation: When to Use Which
373
+
374
+ | Scenario | Use Exact (Async) | Use Estimation (Sync) |
375
+ | ------------------------- | ----------------- | --------------------- |
376
+ | Billing/cost calculation | ✅ | ❌ |
377
+ | Debugging tokenization | ✅ | ❌ |
378
+ | Need to decode tokens | ✅ | ❌ |
379
+ | Context window management | Either | ✅ (faster) |
380
+ | Real-time UI feedback | ❌ (too slow) | ✅ |
381
+ | Claude models | N/A | ✅ (only option) |
382
+ | Batch processing | ✅ | Either |
383
+
384
+ ## Supported Encodings
385
+
386
+ | Encoding | Vocab Size | Type | Models |
387
+ | ------------------- | ---------- | ---------- | --------------------------------- |
388
+ | `o200k_base` | 200,000 | Exact BPE | GPT-4o, GPT-4.1, GPT-5, o-series |
389
+ | `o200k_harmony` | 200,000 | Exact BPE | gpt-oss |
390
+ | `cl100k_base` | 100,256 | Exact BPE | GPT-4, GPT-3.5-turbo, embeddings |
391
+ | `p50k_base` | 50,257 | Exact BPE | Code-davinci, text-davinci-003 |
392
+ | `p50k_edit` | 50,257 | Exact BPE | text-davinci-edit-001 |
393
+ | `r50k_base` | 50,257 | Exact BPE | GPT-3 (davinci, curie, etc.) |
394
+ | `claude_estimation` | ~22,000\* | Estimation | All Claude models (safe estimate) |
395
+
396
+ \*Claude's actual vocabulary size is estimated at ~22,000 based on research, but the encoding uses cl100k_base patterns with a safety multiplier.
397
+
398
+ ## Supported Models
399
+
400
+ ### OpenAI
401
+
402
+ - **GPT-5 series**: gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-turbo
403
+ - **GPT-4.1 series**: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano (1M context!)
404
+ - **GPT-4o series**: gpt-4o, gpt-4o-mini, chatgpt-4o-latest
405
+ - **GPT-4 series**: gpt-4, gpt-4-turbo, gpt-4-32k
406
+ - **GPT-3.5 series**: gpt-3.5-turbo, gpt-3.5-turbo-16k
407
+ - **o-series**: o1, o1-mini, o3, o3-mini, o4-mini (reasoning models)
408
+ - **Embeddings**: text-embedding-ada-002, text-embedding-3-small/large
409
+ - **Fine-tuned**: ft:gpt-4o, ft:gpt-4, ft:gpt-3.5-turbo
410
+
411
+ ### Anthropic Claude (Safe Estimation)
412
+
413
+ Claude models use a dedicated `claude_estimation` encoding that provides **safe** token estimates with a built-in safety margin. This is designed to prevent API truncation by intentionally over-counting tokens.
414
+
415
+ **Why is Claude different?**
416
+
417
+ Claude uses a proprietary tokenizer that is NOT publicly available. Based on research:
418
+
419
+ - Claude 3+ uses ~22,000 token vocabulary (vs OpenAI's 100K-200K)
420
+ - Claude produces 16-30% MORE tokens than GPT-4 for equivalent content
421
+ - Average ~3.5 characters per token (vs GPT-4's ~4)
422
+
423
+ **Our solution:**
424
+
425
+ The `claude_estimation` encoding applies a **1.25x safety multiplier** to ensure estimates err on over-counting. This prevents API truncation while still providing useful estimates.
426
+
427
+ ```typescript
428
+ import {
429
+ countTokens,
430
+ usesClaudeEstimation,
431
+ estimateClaudeTokens,
432
+ } from "tiktoken-ts";
433
+
434
+ // Automatic safe estimation for Claude models
435
+ const count = countTokens("Hello, Claude!", { model: "claude-4-5-sonnet" });
436
+
437
+ // Check if model uses Claude estimation
438
+ if (usesClaudeEstimation("claude-3-opus")) {
439
+ console.log("This uses safe Claude estimation");
440
+ }
441
+
442
+ // Content-aware estimation (code has additional +10% multiplier)
443
+ const codeCount = estimateClaudeTokens(pythonCode, "code");
444
+ ```
445
+
446
+ **For exact Claude token counts**, use Anthropic's official API:
447
+
448
+ - [Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting)
449
+
450
+ Supported Claude models:
451
+
452
+ - Claude 4.5, 4.1, 4, 3.5, 3, 2 series
453
+
454
+ ### Others (Estimation only)
455
+
456
+ - DeepSeek, Gemini (using cl100k_base approximation)
457
+
458
+ ## Accuracy
459
+
460
+ ### Exact BPE API
461
+
462
+ The async API produces **identical tokens** to OpenAI's tiktoken and tiktoken-rs. Use this when:
463
+
464
+ - You need exact token counts for billing
465
+ - You're debugging tokenization issues
466
+ - You need to decode tokens back to text
467
+
468
+ ### Estimation API
469
+
470
+ The sync estimation API uses heuristics and is:
471
+
472
+ - **Fast** - No vocabulary loading (instant)
473
+ - **Approximate** - Typically within ±10-15% for English (OpenAI models)
474
+ - **Conservative** - Tends to slightly over-estimate, safer for API calls
475
+
476
+ Use estimation when:
477
+
478
+ - You need quick approximate counts
479
+ - You're doing context window management
480
+ - Exact counts aren't critical
481
+
482
+ ### Claude Estimation
483
+
484
+ For Claude models, the estimation is **intentionally conservative** with a 1.25x safety multiplier because:
485
+
486
+ - Claude's tokenizer is proprietary (not publicly available)
487
+ - Claude produces 16-30% more tokens than GPT-4 for equivalent content
488
+ - Over-estimation is safer than under-estimation for API limits
489
+
490
+ For exact Claude counts, use [Anthropic's Token Counting API](https://docs.anthropic.com/en/docs/build-with-claude/token-counting).
491
+
492
+ ## Browser Usage
493
+
494
+ The exact BPE API works in browsers but requires fetching vocabulary files (~4-10MB each). Vocabularies are cached after first load.
495
+
496
+ ```typescript
497
+ // Works in browsers
498
+ const tiktoken = await getEncodingAsync("cl100k_base");
499
+ const tokens = tiktoken.encode("Hello!");
500
+ ```
501
+
502
+ For bundle-size-sensitive applications, consider:
503
+
504
+ 1. Using the estimation API (zero network requests)
505
+ 2. Pre-loading vocabularies at app startup
506
+ 3. Using a service worker to cache vocabularies
507
+
508
+ ## Comparison with Other Libraries
509
+
510
+ | Library | Exact BPE | Sync API | Bundle Size | Dependencies |
511
+ | --------------- | --------- | --------------- | ----------- | -------------- |
512
+ | **tiktoken-ts** | ✅ | ✅ (estimation) | ~50KB | 0 |
513
+ | tiktoken (WASM) | ✅ | ✅ | ~4MB | WASM |
514
+ | gpt-tokenizer | ✅ | ✅ | ~10MB | Embedded vocab |
515
+ | gpt-3-encoder | ❌ | ✅ | ~2MB | r50k only |
516
+
517
+ ## Development
518
+
519
+ ```bash
520
+ # Install dependencies
521
+ npm install
522
+
523
+ # Build
524
+ npm run build
525
+
526
+ # Run tests
527
+ npm test
528
+
529
+ # Type check
530
+ npm run typecheck
531
+
532
+ # Lint
533
+ npm run lint
534
+
535
+ # Format
536
+ npm run format
537
+ ```
538
+
539
+ ## Architecture
540
+
541
+ See [ARCHITECTURE.md](./ARCHITECTURE.md) for detailed implementation notes.
542
+
543
+ Key design decisions:
544
+
545
+ - Vocabularies loaded from CDN (not embedded) to keep package small
546
+ - Dual API: exact async + fast sync estimation
547
+ - Direct port of tiktoken-rs BPE algorithm for correctness
548
+ - Global caching of vocabularies and instances
549
+
550
+ ## License
551
+
552
+ MIT
553
+
554
+ ## Credits
555
+
556
+ - [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs) - Original Rust implementation
557
+ - [tiktoken](https://github.com/openai/tiktoken) - OpenAI's Python implementation