@cyberlangke/tokkit-microsoft 1.8.0 → 1.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/COPYRIGHT +1 -0
- package/README.md +41 -4
- package/dist/generated/bitnet_b1_58_2b_4t.cjs +25 -0
- package/dist/generated/bitnet_b1_58_2b_4t.js +5 -0
- package/dist/index.cjs +6 -0
- package/dist/index.js +6 -0
- package/package.json +2 -2
package/COPYRIGHT
CHANGED
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
Copyright (c) 2026 cyberlangke
|
|
2
2
|
|
|
3
3
|
This package bundles tokenizer assets derived from official Hugging Face repositories under the microsoft organization:
|
|
4
|
+
- microsoft/bitnet-b1.58-2B-4T tokenizer.json
|
|
4
5
|
- microsoft/phi-1 tokenizer.json
|
|
5
6
|
- microsoft/Phi-3-mini-4k-instruct tokenizer.json
|
|
6
7
|
- microsoft/Phi-3-medium-4k-instruct tokenizer.json
|
package/README.md
CHANGED
|
@@ -1,10 +1,47 @@
|
|
|
1
1
|
# @cyberlangke/tokkit-microsoft
|
|
2
2
|
|
|
3
|
-
Microsoft 官方 tokenizer 的 tokkit
|
|
3
|
+
Microsoft 官方 tokenizer 的 tokkit 子包,只包含当前 MIT 兼容、且能用官方 tokenizer 资产直接验证的纯文本 tokenizer。
|
|
4
4
|
|
|
5
|
-
##
|
|
5
|
+
## 支持的 family
|
|
6
6
|
|
|
7
|
-
-
|
|
7
|
+
- `bitnet-b1.58-2b-4t`:覆盖 `bitnet-b1.58-2B-4T`
|
|
8
|
+
- `phi-1`:覆盖 `phi-1 / phi-1_5 / phi-2`
|
|
9
|
+
- `phi-3-mini`:覆盖 `Phi-3-mini-4k-instruct / Phi-3-mini-128k-instruct`
|
|
10
|
+
- `phi-3-medium`:覆盖 `Phi-3-medium-4k-instruct / Phi-3-medium-128k-instruct`
|
|
11
|
+
- `phi-3.5`:覆盖 `Phi-3.5-mini-instruct / Phi-3.5-MoE-instruct`
|
|
12
|
+
- `phi-moe`:覆盖 `Phi-mini-MoE-instruct / Phi-tiny-MoE-instruct`
|
|
13
|
+
- `phi-4`:覆盖 `phi-4`
|
|
14
|
+
- `phi-4-mini`:覆盖 `Phi-4-mini-instruct`
|
|
15
|
+
- `phi-4-mini-reasoning`:覆盖 `Phi-4-mini-reasoning`
|
|
16
|
+
- `phi-4-mini-flash`:覆盖 `Phi-4-mini-flash-reasoning`
|
|
17
|
+
- `phi-4-reasoning`:覆盖 `Phi-4-reasoning / Phi-4-reasoning-plus`
|
|
18
|
+
|
|
19
|
+
## 当前不纳入
|
|
20
|
+
|
|
21
|
+
- `Phi-3-small-8k-instruct`
|
|
22
|
+
- `Phi-3-small-128k-instruct`
|
|
23
|
+
|
|
24
|
+
这两条官方文本主线仓库当前仍然没有直接公开根目录 `tokenizer.json`,不满足当前“下载官方 tokenizer 快照并直接对拍”的接入门槛。
|
|
25
|
+
|
|
26
|
+
- `Phi-3-vision-128k-instruct`
|
|
27
|
+
- `Phi-3.5-vision-instruct`
|
|
28
|
+
- `Phi-4-multimodal-instruct`
|
|
29
|
+
- `Phi-4-reasoning-vision-15B`
|
|
30
|
+
|
|
31
|
+
这些模型属于 vision / multimodal 路线,不在当前纯文本 BPE 主线范围内。
|
|
32
|
+
|
|
33
|
+
- `NextCoder-*`
|
|
34
|
+
- `UserLM-8b`
|
|
35
|
+
- `FrogBoss-*`
|
|
36
|
+
- `FrogMini-*`
|
|
37
|
+
|
|
38
|
+
这些仓库当前都带 `base_model:finetune` 信号,属于官方微调 / 派生模型,不作为当前主线 tokenizer 的维护目标。
|
|
39
|
+
|
|
40
|
+
- `*-onnx`
|
|
41
|
+
- `*-gguf`
|
|
42
|
+
- `*-pytdml`
|
|
43
|
+
|
|
44
|
+
这些仓库属于导出格式或衍生分发,不作为当前官方主线 tokenizer 的维护目标。
|
|
8
45
|
|
|
9
46
|
## 使用方法
|
|
10
47
|
|
|
@@ -15,7 +52,7 @@ npm install @cyberlangke/tokkit-microsoft
|
|
|
15
52
|
```ts
|
|
16
53
|
import { getTokenizer } from "@cyberlangke/tokkit-microsoft"
|
|
17
54
|
|
|
18
|
-
const tokenizer = await getTokenizer("
|
|
55
|
+
const tokenizer = await getTokenizer("bitnet-b1.58-2b-4t")
|
|
19
56
|
|
|
20
57
|
console.log(tokenizer.vocabSize)
|
|
21
58
|
```
|