bpe-lite 0.3.1 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -62,6 +62,36 @@ tok.count('Hello, world!'); // → 4
62
62
 
63
63
  Vocab files are bundled in the package — no network required at runtime or install time.
64
64
 
65
+ ## Performance
66
+
67
+ Benchmarked on Node v24, win32/x64, against [js-tiktoken](https://github.com/dqbd/tiktoken) and [ai-tokenizer](https://github.com/nicepkg/ai-tokenizer) (`node --expose-gc scripts/bench.js`).
68
+
69
+ **OpenAI cl100k — large text (~54 KB)**
70
+
71
+ | impl | ops/s | tokens/s | MB/s |
72
+ |------|------:|---------:|-----:|
73
+ | bpe-lite | **257** | **3.15M** | **13.6** |
74
+ | ai-tokenizer | 201 | 2.46M | 10.7 |
75
+ | js-tiktoken | 23 | 282k | 1.2 |
76
+
77
+ **Anthropic — large text (~54 KB)**
78
+
79
+ | impl | ops/s | tokens/s | MB/s |
80
+ |------|------:|---------:|-----:|
81
+ | bpe-lite | **257** | 3.15M | **13.6** |
82
+ | ai-tokenizer | 253 | **4.62M** | 13.4 |
83
+
84
+ **Gemini — large text (8 KB)**
85
+
86
+ | impl | ops/s | tokens/s | MB/s | note |
87
+ |------|------:|---------:|-----:|------|
88
+ | bpe-lite | **3,800** | **6.23M** | **29.7** | actual Gemma3 SPM |
89
+ | ai-tokenizer | 1,220 | 2.01M | 9.6 | o200k BPE — different algorithm, different results |
90
+
91
+ ai-tokenizer does not implement Gemini tokenization. The row above uses their o200k encoding on the same input string; it produces different token ids and counts than the Gemini tokenizer, so it is not a real comparison.
92
+
93
+ Numbers vary by machine — run the bench script locally for results on your hardware.
94
+
65
95
  ## API
66
96
 
67
97
  ### `countTokens(text, provider?)`
@@ -86,12 +116,11 @@ Returns the token count if `text` is within `limit` tokens, or `false` if exceed
86
116
 
87
117
  ## Why not tiktoken?
88
118
 
89
- `tiktoken` is accurate for OpenAI but requires Rust/WASM native bindings, which can break in Docker containers, edge runtimes, and serverless environments. `bpe-lite` is pure JavaScript — it runs anywhere Node 18+ runs.
119
+ `tiktoken` is accurate for OpenAI but requires Rust/WASM native bindings, which can break in Docker containers, edge runtimes, and serverless environments. `bpe-lite` is pure JavaScript — it runs anywhere Node 18+ runs, with no native compilation step.
90
120
 
91
121
  ## Caveats
92
122
 
93
123
  - **Anthropic**: Anthropic has not released the Claude 3+ tokenizer. The cl100k approximation is accurate to ~95% for most text.
94
- - **Speed**: Pure JS is slower than tiktoken's native implementation. For token counting (not bulk processing) the difference is negligible.
95
124
  - **Node version**: Requires Node 18+ for Unicode property escapes (`\p{L}`, `\p{N}`) in the pre-tokenization regex.
96
125
 
97
126
  ## License
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "bpe-lite",
3
- "version": "0.3.1",
3
+ "version": "0.4.1",
4
4
  "description": "Offline BPE tokenizer for OpenAI, Anthropic, and Gemini — zero dependencies",
5
5
  "main": "src/index.js",
6
6
  "types": "index.d.ts",
@@ -20,8 +20,22 @@
20
20
  ],
21
21
  "scripts": {
22
22
  "build": "node scripts/build-vocabs.js",
23
+ "bench": "node --expose-gc scripts/bench.js",
24
+ "compat": "node scripts/compat.js",
23
25
  "test": "node tests/run-tests.js"
24
26
  },
25
- "keywords": ["tokenizer", "bpe", "openai", "anthropic", "gemini", "tokens", "llm"],
26
- "license": "MIT"
27
+ "keywords": [
28
+ "tokenizer",
29
+ "bpe",
30
+ "openai",
31
+ "anthropic",
32
+ "gemini",
33
+ "tokens",
34
+ "llm"
35
+ ],
36
+ "license": "MIT",
37
+ "devDependencies": {
38
+ "ai-tokenizer": "^1.0.6",
39
+ "js-tiktoken": "^1.0.21"
40
+ }
27
41
  }