nodejieba-plus 3.5.12 → 3.5.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +68 -93
- package/build/Release/nodejieba.node +0 -0
- package/index.js +8 -1
- package/lib/nodejieba.cpp +41 -8
- package/package.json +1 -1
- package/submodules/cppjieba/include/cppjieba/DictTrie.hpp +169 -30
- package/submodules/cppjieba/include/cppjieba/SegmentBase.hpp +1 -1
- package/submodules/cppjieba/include/cppjieba/Trie.hpp +10 -13
- package/submodules/cppjieba/include/cppjieba/Unicode.hpp +52 -0
- package/test/load_user_dict_test.js +57 -0
- package/test_assertion_fix.js +60 -0
- package/test_simple.js +17 -0
- package/test_space_keyword.js +66 -0
package/README.md
CHANGED
|
@@ -1,38 +1,38 @@
|
|
|
1
|
-
[
|
|
2
|
-
[
|
|
3
|
-
[
|
|
4
|
-
[
|
|
5
|
-
[
|
|
6
|
-
[
|
|
7
|
-
[
|
|
8
|
-
[
|
|
1
|
+
[!\[Build Status\](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml/badge.svg null)](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml)
|
|
2
|
+
[!\[Financial Contributors on Open Collective\](https://opencollective.com/nodejieba/all/badge.svg?label=financial+contributors null)](https://opencollective.com/nodejieba) [!\[Author\](https://img.shields.io/badge/author-@yanyiwu-blue.svg?style=flat null)](https://github.com/yanyiwu/)
|
|
3
|
+
[!\[Platform\](https://img.shields.io/badge/platform-Linux,macOS,Windows-green.svg?style=flat null)](https://github.com/yanyiwu/nodejieba)
|
|
4
|
+
[!\[Performance\](https://img.shields.io/badge/performance-excellent-brightgreen.svg?style=flat null)](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md)
|
|
5
|
+
[!\[License\](https://img.shields.io/badge/license-MIT-yellow.svg?style=flat null)](http://yanyiwu.mit-license.org)
|
|
6
|
+
[!\[NpmDownload Status\](http://img.shields.io/npm/dm/nodejieba.svg null)](https://www.npmjs.org/package/nodejieba)
|
|
7
|
+
[!\[NPM Version\](https://img.shields.io/npm/v/nodejieba.svg?style=flat null)](https://www.npmjs.org/package/nodejieba)
|
|
8
|
+
[!\[Code Climate\](https://codeclimate.com/github/yanyiwu/nodejieba/badges/gpa.svg null)](https://codeclimate.com/github/yanyiwu/nodejieba)
|
|
9
9
|
|
|
10
|
-
|
|
10
|
+
***
|
|
11
11
|
|
|
12
12
|
# NodeJieba "结巴"分词的Node.js版本
|
|
13
13
|
|
|
14
|
-
## 介绍
|
|
14
|
+
## 介绍
|
|
15
15
|
|
|
16
16
|
`NodeJieba`是"结巴"中文分词的 Node.js 版本实现,
|
|
17
|
-
由[CppJieba]提供底层分词算法实现,
|
|
17
|
+
由[CppJieba](https://github.com/yanyiwu/cppjieba.git)提供底层分词算法实现,
|
|
18
18
|
是兼具高性能和易用性两者的 Node.js 中文分词组件。
|
|
19
19
|
|
|
20
20
|
## 特点
|
|
21
21
|
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
22
|
+
- 词典载入方式灵活,无需配置词典路径也可使用,需要定制自己的词典路径时也可灵活定制。
|
|
23
|
+
- 底层算法实现是C++,性能高效。
|
|
24
|
+
- 支持多种分词算法,各种分词算法见[CppJieba](https://github.com/yanyiwu/cppjieba.git)的README.md介绍。
|
|
25
|
+
- 支持动态补充词库。
|
|
26
|
+
- 支持TypeScript,提供完整的类型定义。
|
|
27
|
+
- **支持包含空格的关键词**(如 "Open Claw")。
|
|
28
|
+
- **支持无空格版本匹配**(如 "Open Claw" 可匹配 "OpenClaw")。
|
|
29
|
+
- **支持英文大小写不敏感匹配**(如 "open claw"、"OPEN CLAW" 都可匹配 "Open Claw")。
|
|
30
|
+
- **支持批量加载用户词典**(字符串数组、单个字符串、Buffer 格式)。
|
|
31
31
|
|
|
32
32
|
对实现细节感兴趣的请看如下博文:
|
|
33
33
|
|
|
34
|
-
|
|
35
|
-
|
|
34
|
+
- [Node.js的C++扩展初体验之NodeJieba](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2014-02-22-nodejs-cpp-addon-nodejieba.md)
|
|
35
|
+
- [由NodeJieba谈谈Node.js异步实现](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-03-21-nodejs-asynchronous-insight.md)
|
|
36
36
|
|
|
37
37
|
## 安装
|
|
38
38
|
|
|
@@ -88,11 +88,11 @@ nodejieba.load({
|
|
|
88
88
|
|
|
89
89
|
#### 词典说明
|
|
90
90
|
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
91
|
+
- dict: 主词典,带权重和词性标签,建议使用默认词典。
|
|
92
|
+
- hmmDict: 隐式马尔科夫模型,建议使用默认词典。
|
|
93
|
+
- userDict: 用户词典,建议自己根据需要定制。
|
|
94
|
+
- idfDict: 关键词抽取所需的idf信息。
|
|
95
|
+
- stopWordDict: 关键词抽取所需的停用词列表。
|
|
96
96
|
|
|
97
97
|
## API 文档
|
|
98
98
|
|
|
@@ -211,10 +211,12 @@ nodejieba.loadUserDict(dictSet);
|
|
|
211
211
|
// 方式3:使用单个字符串
|
|
212
212
|
nodejieba.loadUserDict("区块链");
|
|
213
213
|
|
|
214
|
-
// 方式4:使用 Buffer
|
|
214
|
+
// 方式4:使用 Buffer(必须是 UTF-8 编码)
|
|
215
215
|
const dictBuffer = Buffer.from("新词1\n新词2 100 n\n新词3 nz");
|
|
216
216
|
nodejieba.loadUserDict(dictBuffer);
|
|
217
217
|
|
|
218
|
+
// 注意:Buffer 必须是 UTF-8 编码,其他编码可能导致乱码或加载失败
|
|
219
|
+
|
|
218
220
|
// 分词时会识别用户词典中的词
|
|
219
221
|
var result = nodejieba.cut("云计算和大数据是人工智能的基础");
|
|
220
222
|
console.log(result); // ['云计算', '和', '大数据', '是', '人工智能', '的', '基础']
|
|
@@ -242,76 +244,72 @@ console.log(result); // ['云计算', '和', '大数据', '是', '人工智能',
|
|
|
242
244
|
|
|
243
245
|
支持在自定义词典中使用包含空格的关键词,且支持无空格版本匹配和大小写不敏感匹配。
|
|
244
246
|
|
|
247
|
+
**注意**:本版本已移除空格作为默认分隔符,因此包含空格的关键词可以正确匹配文本中的对应内容,不会被分割。
|
|
248
|
+
|
|
245
249
|
#### 用户词典格式
|
|
246
250
|
|
|
247
251
|
用户词典支持以下格式:
|
|
248
252
|
|
|
249
253
|
```
|
|
250
|
-
#
|
|
254
|
+
# 只有关键词(包含空格)
|
|
251
255
|
Open Claw
|
|
256
|
+
Game Master
|
|
252
257
|
|
|
253
|
-
# 关键词 +
|
|
254
|
-
|
|
258
|
+
# 关键词 + 词频(仅支持单关键词,不支持包含空格的关键词+词频)
|
|
259
|
+
人工智能 1000
|
|
255
260
|
|
|
256
|
-
# 关键词 + 词频 +
|
|
261
|
+
# 关键词 + 词频 + 词性标签(支持包含空格的关键词)
|
|
257
262
|
Open Claw 100 n
|
|
263
|
+
Machine Learning 200 n
|
|
258
264
|
|
|
259
265
|
# 包含多个空格的关键词
|
|
260
|
-
Machine Learning 200 n
|
|
261
266
|
Artificial Intelligence 300 n
|
|
267
|
+
Deep Learning 400 n
|
|
262
268
|
```
|
|
263
269
|
|
|
270
|
+
**格式说明**:
|
|
271
|
+
|
|
272
|
+
- 当词典行只有关键词时(如 `Open Claw`),整个字符串作为关键词
|
|
273
|
+
- 当词典行有词频时(如 `人工智能 1000`),第一个部分是关键词,第二个是词频
|
|
274
|
+
- 当词典行有三个部分且倒数第二个是数字时(如 `Open Claw 100 n`),前面的部分组成关键词,后面是词频和词性
|
|
275
|
+
|
|
264
276
|
#### 使用示例
|
|
265
277
|
|
|
266
278
|
```js
|
|
267
279
|
var nodejieba = require("nodejieba");
|
|
268
|
-
|
|
269
|
-
var path = require('path');
|
|
270
|
-
|
|
271
|
-
// 创建包含空格关键词的用户词典
|
|
272
|
-
var dictContent = `Open Claw 100 n
|
|
273
|
-
Machine Learning 200 n
|
|
274
|
-
Artificial Intelligence 300 n
|
|
275
|
-
`;
|
|
280
|
+
nodejieba.load();
|
|
276
281
|
|
|
277
|
-
|
|
278
|
-
|
|
282
|
+
// 方式1:使用 loadUserDict 加载包含空格的关键词
|
|
283
|
+
nodejieba.loadUserDict(["Open Claw 100 n", "Game Master"]);
|
|
279
284
|
|
|
280
|
-
//
|
|
281
|
-
nodejieba.
|
|
282
|
-
userDict: testDictPath,
|
|
283
|
-
});
|
|
285
|
+
// 方式2:使用 insertWord 添加包含空格的关键词
|
|
286
|
+
nodejieba.insertWord("Deep Learning");
|
|
284
287
|
|
|
285
288
|
// 测试1: 包含空格的关键词匹配
|
|
286
|
-
console.log(nodejieba.cut("I
|
|
287
|
-
//
|
|
289
|
+
console.log(nodejieba.cut("I like Open Claw game"));
|
|
290
|
+
// 输出: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 'Open Claw', ' ', 'g', 'a', 'm', 'e']
|
|
288
291
|
|
|
289
|
-
// 测试2:
|
|
292
|
+
// 测试2: 在中文句子中匹配
|
|
293
|
+
console.log(nodejieba.cut("Open Claw和Game Master都是好游戏"));
|
|
294
|
+
// 输出: ['Open Claw', '和', 'Game Master', '都', '是', '好', '游戏']
|
|
295
|
+
|
|
296
|
+
// 测试3: 大小写不敏感匹配
|
|
290
297
|
console.log(nodejieba.cut("open claw")); // 匹配 Open Claw
|
|
291
298
|
console.log(nodejieba.cut("OPEN CLAW")); // 匹配 Open Claw
|
|
292
299
|
console.log(nodejieba.cut("Open Claw")); // 匹配 Open Claw
|
|
293
300
|
|
|
294
|
-
// 测试
|
|
301
|
+
// 测试4: 无空格版本匹配
|
|
295
302
|
console.log(nodejieba.cut("OpenClaw")); // 匹配 Open Claw
|
|
296
303
|
console.log(nodejieba.cut("openclaw")); // 匹配 Open Claw
|
|
297
304
|
console.log(nodejieba.cut("OPENCLAW")); // 匹配 Open Claw
|
|
298
|
-
|
|
299
|
-
// 测试4: 其他包含空格的关键词
|
|
300
|
-
console.log(nodejieba.cut("Machine Learning is great"));
|
|
301
|
-
// 输出包含: ['Machine Learning']
|
|
302
|
-
|
|
303
|
-
console.log(nodejieba.cut("Artificial Intelligence will change the world"));
|
|
304
|
-
// 输出包含: ['Artificial Intelligence']
|
|
305
|
-
|
|
306
|
-
// 清理测试文件
|
|
307
|
-
fs.unlinkSync(testDictPath);
|
|
308
305
|
```
|
|
309
306
|
|
|
310
307
|
#### 功能说明
|
|
311
308
|
|
|
312
|
-
1. **包含空格的关键词**: 词典中的 "Open Claw" 可以匹配文本中的 "Open Claw"
|
|
309
|
+
1. **包含空格的关键词**: 词典中的 "Open Claw" 可以匹配文本中的 "Open Claw"(不会被分割)
|
|
313
310
|
2. **无空格版本匹配**: 词典中的 "Open Claw" 也可以匹配文本中的 "OpenClaw"
|
|
314
311
|
3. **大小写不敏感**: 词典中的 "Open Claw" 可以匹配 "open claw"、"OPEN CLAW"、"Open Claw" 等任意大小写组合
|
|
312
|
+
4. **自动生成变体**: 添加包含空格的关键词时,会自动生成无空格版本和小写版本,确保各种变体都能匹配
|
|
315
313
|
|
|
316
314
|
More Detals in [demo](https://github.com/yanyiwu/nodejieba-demo)
|
|
317
315
|
|
|
@@ -347,37 +345,23 @@ npm test
|
|
|
347
345
|
|
|
348
346
|
## 应用
|
|
349
347
|
|
|
350
|
-
|
|
351
|
-
|
|
348
|
+
- 支持中文搜索的 gitbook 插件: [gitbook-plugin-search-pro](https://plugins.gitbook.com/plugin/search-pro)
|
|
349
|
+
- 汉字拼音转换工具: [pinyin](https://github.com/hotoo/pinyin)
|
|
352
350
|
|
|
353
351
|
## 性能评测
|
|
354
352
|
|
|
355
353
|
应该是目前性能最好的 Node.js 中文分词库
|
|
356
|
-
详见: [Jieba中文分词系列性能评测]
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
[由NodeJieba谈谈Node.js异步实现]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-03-21-nodejs-asynchronous-insight.md
|
|
360
|
-
[Node.js的C++扩展初体验之NodeJieba]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2014-02-22-nodejs-cpp-addon-nodejieba.md
|
|
361
|
-
[CppJieba]:https://github.com/yanyiwu/cppjieba.git
|
|
362
|
-
[cnpm]:http://cnpmjs.org
|
|
363
|
-
[Jieba中文分词]:https://github.com/fxsjy/jieba
|
|
364
|
-
|
|
365
|
-
[Jieba中文分词系列性能评测]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md
|
|
366
|
-
[contributors]:https://github.com/yanyiwu/nodejieba/graphs/contributors
|
|
367
|
-
[YanyiWu]:http://github.com/yanyiwu
|
|
368
|
-
[gitbook-plugin-search-pro]:https://plugins.gitbook.com/plugin/search-pro
|
|
369
|
-
[pinyin]:https://github.com/hotoo/pinyin
|
|
354
|
+
详见: [Jieba中文分词系列性能评测](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md)
|
|
370
355
|
|
|
371
356
|
## Contributors
|
|
372
357
|
|
|
373
358
|
### Code Contributors
|
|
374
359
|
|
|
375
|
-
This project exists thanks to all the people who contribute. [[Contribute](CONTRIBUTING.md)].
|
|
376
|
-
<a href="https://github.com/yanyiwu/nodejieba/graphs/contributors"><img src="https://opencollective.com/nodejieba/contributors.svg?width=890&button=false" /></a>
|
|
360
|
+
This project exists thanks to all the people who contribute. \[[Contribute](CONTRIBUTING.md)]. <a href="https://github.com/yanyiwu/nodejieba/graphs/contributors"><img src="https://opencollective.com/nodejieba/contributors.svg?width=890&button=false" /></a>
|
|
377
361
|
|
|
378
362
|
### Financial Contributors
|
|
379
363
|
|
|
380
|
-
Become a financial contributor and help us sustain our community. [[Contribute](https://opencollective.com/nodejieba/contribute)]
|
|
364
|
+
Become a financial contributor and help us sustain our community. \[[Contribute](https://opencollective.com/nodejieba/contribute)]
|
|
381
365
|
|
|
382
366
|
#### Individuals
|
|
383
367
|
|
|
@@ -385,15 +369,6 @@ Become a financial contributor and help us sustain our community. [[Contribute](
|
|
|
385
369
|
|
|
386
370
|
#### Organizations
|
|
387
371
|
|
|
388
|
-
Support this project with your organization. Your logo will show up here with a link to your website. [[Contribute](https://opencollective.com/nodejieba/contribute)]
|
|
389
|
-
|
|
390
|
-
<a href="https://opencollective.com/nodejieba/organization/0/website"><img src="https://opencollective.com/nodejieba/organization/0/avatar.svg"></a>
|
|
391
|
-
<a href="https://opencollective.com/nodejieba/organization/1/website"><img src="https://opencollective.com/nodejieba/organization/1/avatar.svg"></a>
|
|
392
|
-
<a href="https://opencollective.com/nodejieba/organization/2/website"><img src="https://opencollective.com/nodejieba/organization/2/avatar.svg"></a>
|
|
393
|
-
<a href="https://opencollective.com/nodejieba/organization/3/website"><img src="https://opencollective.com/nodejieba/organization/3/avatar.svg"></a>
|
|
394
|
-
<a href="https://opencollective.com/nodejieba/organization/4/website"><img src="https://opencollective.com/nodejieba/organization/4/avatar.svg"></a>
|
|
395
|
-
<a href="https://opencollective.com/nodejieba/organization/5/website"><img src="https://opencollective.com/nodejieba/organization/5/avatar.svg"></a>
|
|
396
|
-
<a href="https://opencollective.com/nodejieba/organization/6/website"><img src="https://opencollective.com/nodejieba/organization/6/avatar.svg"></a>
|
|
397
|
-
<a href="https://opencollective.com/nodejieba/organization/7/website"><img src="https://opencollective.com/nodejieba/organization/7/avatar.svg"></a>
|
|
398
|
-
<a href="https://opencollective.com/nodejieba/organization/8/website"><img src="https://opencollective.com/nodejieba/organization/8/avatar.svg"></a>
|
|
399
|
-
<a href="https://opencollective.com/nodejieba/organization/9/website"><img src="https://opencollective.com/nodejieba/organization/9/avatar.svg"></a>
|
|
372
|
+
Support this project with your organization. Your logo will show up here with a link to your website. \[[Contribute](https://opencollective.com/nodejieba/contribute)]
|
|
373
|
+
|
|
374
|
+
<a href="https://opencollective.com/nodejieba/organization/0/website"><img src="https://opencollective.com/nodejieba/organization/0/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/1/website"><img src="https://opencollective.com/nodejieba/organization/1/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/2/website"><img src="https://opencollective.com/nodejieba/organization/2/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/3/website"><img src="https://opencollective.com/nodejieba/organization/3/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/4/website"><img src="https://opencollective.com/nodejieba/organization/4/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/5/website"><img src="https://opencollective.com/nodejieba/organization/5/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/6/website"><img src="https://opencollective.com/nodejieba/organization/6/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/7/website"><img src="https://opencollective.com/nodejieba/organization/7/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/8/website"><img src="https://opencollective.com/nodejieba/organization/8/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/9/website"><img src="https://opencollective.com/nodejieba/organization/9/avatar.svg"></a>
|
|
Binary file
|
package/index.js
CHANGED
|
@@ -84,11 +84,18 @@ exports.loadUserDict = function (dict) {
|
|
|
84
84
|
exports.load();
|
|
85
85
|
}
|
|
86
86
|
|
|
87
|
-
|
|
87
|
+
if (dict === null || dict === undefined) {
|
|
88
|
+
return false;
|
|
89
|
+
}
|
|
90
|
+
|
|
88
91
|
if (dict instanceof Set) {
|
|
89
92
|
dict = Array.from(dict);
|
|
90
93
|
}
|
|
91
94
|
|
|
95
|
+
if (typeof dict !== 'string' && !Array.isArray(dict) && !Buffer.isBuffer(dict)) {
|
|
96
|
+
throw new TypeError('dict must be string, string[], Set<string>, or Buffer');
|
|
97
|
+
}
|
|
98
|
+
|
|
92
99
|
return _loadUserDict.call(this, dict);
|
|
93
100
|
};
|
|
94
101
|
|
package/lib/nodejieba.cpp
CHANGED
|
@@ -7,6 +7,7 @@
|
|
|
7
7
|
#include "cppjieba/TextRankExtractor.hpp"
|
|
8
8
|
|
|
9
9
|
#include <sstream>
|
|
10
|
+
#include <cctype>
|
|
10
11
|
|
|
11
12
|
NodeJieba::NodeJieba(Napi::Env env, Napi::Object exports) {
|
|
12
13
|
DefineAddon(exports, {
|
|
@@ -229,32 +230,64 @@ Napi::Value NodeJieba::loadUserDict(const Napi::CallbackInfo& info) {
|
|
|
229
230
|
Napi::Error::New(info.Env(), "Before calling any other function you have to call load() first").ThrowAsJavaScriptException();
|
|
230
231
|
}
|
|
231
232
|
|
|
232
|
-
|
|
233
|
+
auto isBlankString = [](const std::string& str) -> bool {
|
|
234
|
+
for (char c : str) {
|
|
235
|
+
if (!std::isspace(static_cast<unsigned char>(c))) {
|
|
236
|
+
return false;
|
|
237
|
+
}
|
|
238
|
+
}
|
|
239
|
+
return true;
|
|
240
|
+
};
|
|
241
|
+
|
|
242
|
+
auto trimString = [](std::string& str) -> void {
|
|
243
|
+
size_t start = 0;
|
|
244
|
+
size_t end = str.length();
|
|
245
|
+
|
|
246
|
+
while (start < end && std::isspace(static_cast<unsigned char>(str[start]))) {
|
|
247
|
+
start++;
|
|
248
|
+
}
|
|
249
|
+
|
|
250
|
+
while (end > start && std::isspace(static_cast<unsigned char>(str[end - 1]))) {
|
|
251
|
+
end--;
|
|
252
|
+
}
|
|
253
|
+
|
|
254
|
+
str = str.substr(start, end - start);
|
|
255
|
+
};
|
|
256
|
+
|
|
233
257
|
if (info[0].IsArray()) {
|
|
234
258
|
Napi::Array arr = info[0].As<Napi::Array>();
|
|
235
259
|
std::vector<std::string> buf;
|
|
236
260
|
for (size_t i = 0; i < arr.Length(); i++) {
|
|
237
261
|
Napi::Value val = arr[i];
|
|
238
|
-
if (val.IsString()) {
|
|
239
|
-
|
|
262
|
+
if (!val.IsString()) {
|
|
263
|
+
Napi::TypeError::New(info.Env(), "Array elements must be strings")
|
|
264
|
+
.ThrowAsJavaScriptException();
|
|
265
|
+
return Napi::Boolean::New(info.Env(), false);
|
|
266
|
+
}
|
|
267
|
+
std::string line = val.As<Napi::String>().Utf8Value();
|
|
268
|
+
trimString(line);
|
|
269
|
+
if (!line.empty() && !isBlankString(line)) {
|
|
270
|
+
buf.push_back(line);
|
|
240
271
|
}
|
|
241
272
|
}
|
|
242
273
|
_jieba_handle->LoadUserDict(buf);
|
|
243
274
|
} else if (info[0].IsString()) {
|
|
244
|
-
// 支持传入单个词典条目字符串
|
|
245
275
|
std::string line = info[0].As<Napi::String>().Utf8Value();
|
|
276
|
+
trimString(line);
|
|
246
277
|
std::vector<std::string> buf;
|
|
247
|
-
|
|
248
|
-
|
|
278
|
+
if (!line.empty() && !isBlankString(line)) {
|
|
279
|
+
buf.push_back(line);
|
|
280
|
+
_jieba_handle->LoadUserDict(buf);
|
|
281
|
+
}
|
|
249
282
|
} else if (info[0].IsBuffer()) {
|
|
250
|
-
// 支持传入 Buffer,将其转换为字符串并按行分割
|
|
251
283
|
Napi::Buffer<char> buffer = info[0].As<Napi::Buffer<char>>();
|
|
252
284
|
std::string content(buffer.Data(), buffer.Length());
|
|
253
285
|
std::vector<std::string> buf;
|
|
254
286
|
std::istringstream iss(content);
|
|
255
287
|
std::string line;
|
|
256
288
|
while (std::getline(iss, line)) {
|
|
257
|
-
|
|
289
|
+
trimString(line);
|
|
290
|
+
if (!line.empty() && !isBlankString(line)) {
|
|
258
291
|
buf.push_back(line);
|
|
259
292
|
}
|
|
260
293
|
}
|
package/package.json
CHANGED
|
@@ -10,6 +10,7 @@
|
|
|
10
10
|
#include <stdint.h>
|
|
11
11
|
#include <cmath>
|
|
12
12
|
#include <limits>
|
|
13
|
+
#include <algorithm>
|
|
13
14
|
#include "limonp/StringUtil.hpp"
|
|
14
15
|
#include "limonp/Logging.hpp"
|
|
15
16
|
#include "Unicode.hpp"
|
|
@@ -32,7 +33,7 @@ class DictTrie {
|
|
|
32
33
|
WordWeightMax,
|
|
33
34
|
}; // enum UserWordWeightOption
|
|
34
35
|
|
|
35
|
-
DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) {
|
|
36
|
+
DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) : trie_(NULL) {
|
|
36
37
|
Init(dict_path, user_dict_paths, user_word_weight_opt);
|
|
37
38
|
}
|
|
38
39
|
|
|
@@ -41,23 +42,84 @@ class DictTrie {
|
|
|
41
42
|
}
|
|
42
43
|
|
|
43
44
|
bool InsertUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
45
|
+
std::set<string> insertedWords;
|
|
46
|
+
insertedWords.insert(word);
|
|
47
|
+
|
|
48
|
+
bool hasSpace = (word.find(' ') != string::npos);
|
|
49
|
+
if (hasSpace) {
|
|
50
|
+
string wordNoSpace = word;
|
|
51
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
52
|
+
if (!wordNoSpace.empty() && wordNoSpace != word) {
|
|
53
|
+
insertedWords.insert(wordNoSpace);
|
|
54
|
+
}
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
string wordLower = ToLowerString(word);
|
|
58
|
+
if (wordLower != word) {
|
|
59
|
+
insertedWords.insert(wordLower);
|
|
60
|
+
}
|
|
61
|
+
|
|
62
|
+
if (hasSpace) {
|
|
63
|
+
string wordNoSpace = word;
|
|
64
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
65
|
+
if (!wordNoSpace.empty()) {
|
|
66
|
+
string wordNoSpaceLower = ToLowerString(wordNoSpace);
|
|
67
|
+
if (wordNoSpaceLower != wordNoSpace) {
|
|
68
|
+
insertedWords.insert(wordNoSpaceLower);
|
|
69
|
+
}
|
|
70
|
+
}
|
|
71
|
+
}
|
|
72
|
+
|
|
73
|
+
for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
|
|
74
|
+
DictUnit node_info;
|
|
75
|
+
if (!MakeNodeInfo(node_info, *it, user_word_default_weight_, tag)) {
|
|
76
|
+
continue;
|
|
77
|
+
}
|
|
78
|
+
active_node_infos_.push_back(node_info);
|
|
79
|
+
trie_->InsertNode(node_info.word, &active_node_infos_.back());
|
|
47
80
|
}
|
|
48
|
-
active_node_infos_.push_back(node_info);
|
|
49
|
-
trie_->InsertNode(node_info.word, &active_node_infos_.back());
|
|
50
81
|
return true;
|
|
51
82
|
}
|
|
52
83
|
|
|
53
84
|
bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) {
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
85
|
+
double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_;
|
|
86
|
+
|
|
87
|
+
std::set<string> insertedWords;
|
|
88
|
+
insertedWords.insert(word);
|
|
89
|
+
|
|
90
|
+
bool hasSpace = (word.find(' ') != string::npos);
|
|
91
|
+
if (hasSpace) {
|
|
92
|
+
string wordNoSpace = word;
|
|
93
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
94
|
+
if (!wordNoSpace.empty() && wordNoSpace != word) {
|
|
95
|
+
insertedWords.insert(wordNoSpace);
|
|
96
|
+
}
|
|
97
|
+
}
|
|
98
|
+
|
|
99
|
+
string wordLower = ToLowerString(word);
|
|
100
|
+
if (wordLower != word) {
|
|
101
|
+
insertedWords.insert(wordLower);
|
|
102
|
+
}
|
|
103
|
+
|
|
104
|
+
if (hasSpace) {
|
|
105
|
+
string wordNoSpace = word;
|
|
106
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
107
|
+
if (!wordNoSpace.empty()) {
|
|
108
|
+
string wordNoSpaceLower = ToLowerString(wordNoSpace);
|
|
109
|
+
if (wordNoSpaceLower != wordNoSpace) {
|
|
110
|
+
insertedWords.insert(wordNoSpaceLower);
|
|
111
|
+
}
|
|
112
|
+
}
|
|
113
|
+
}
|
|
114
|
+
|
|
115
|
+
for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
|
|
116
|
+
DictUnit node_info;
|
|
117
|
+
if (!MakeNodeInfo(node_info, *it, weight, tag)) {
|
|
118
|
+
continue;
|
|
119
|
+
}
|
|
120
|
+
active_node_infos_.push_back(node_info);
|
|
121
|
+
trie_->InsertNode(node_info.word, &active_node_infos_.back());
|
|
58
122
|
}
|
|
59
|
-
active_node_infos_.push_back(node_info);
|
|
60
|
-
trie_->InsertNode(node_info.word, &active_node_infos_.back());
|
|
61
123
|
return true;
|
|
62
124
|
}
|
|
63
125
|
|
|
@@ -112,26 +174,93 @@ class DictTrie {
|
|
|
112
174
|
vector<string> buf;
|
|
113
175
|
DictUnit node_info;
|
|
114
176
|
Split(line, buf, " ");
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
177
|
+
|
|
178
|
+
string word;
|
|
179
|
+
string tag = UNKNOWN_TAG;
|
|
180
|
+
double weight = user_word_default_weight_;
|
|
181
|
+
bool hasSpace = false;
|
|
182
|
+
|
|
183
|
+
if (buf.size() == 1) {
|
|
184
|
+
word = buf[0];
|
|
185
|
+
} else if (buf.size() == 2) {
|
|
186
|
+
int freq = atoi(buf[1].c_str());
|
|
187
|
+
if (freq > 0) {
|
|
188
|
+
assert(freq_sum_ > 0.0);
|
|
189
|
+
weight = log(1.0 * freq / freq_sum_);
|
|
190
|
+
word = buf[0];
|
|
191
|
+
} else {
|
|
192
|
+
word = line;
|
|
193
|
+
}
|
|
194
|
+
} else if (buf.size() >= 3) {
|
|
195
|
+
bool isFreq = true;
|
|
196
|
+
for (char c : buf[buf.size() - 2]) {
|
|
197
|
+
if (!isdigit(c)) {
|
|
198
|
+
isFreq = false;
|
|
199
|
+
break;
|
|
130
200
|
}
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
201
|
+
}
|
|
202
|
+
|
|
203
|
+
if (isFreq) {
|
|
204
|
+
int freq = atoi(buf[buf.size() - 2].c_str());
|
|
205
|
+
assert(freq_sum_ > 0.0);
|
|
206
|
+
weight = log(1.0 * freq / freq_sum_);
|
|
207
|
+
for (size_t i = 0; i < buf.size() - 2; ++i) {
|
|
208
|
+
if (i > 0) word += " ";
|
|
209
|
+
word += buf[i];
|
|
210
|
+
}
|
|
211
|
+
tag = buf[buf.size() - 1];
|
|
212
|
+
} else {
|
|
213
|
+
word = line;
|
|
214
|
+
}
|
|
215
|
+
}
|
|
216
|
+
|
|
217
|
+
hasSpace = (word.find(' ') != string::npos);
|
|
218
|
+
|
|
219
|
+
std::set<string> insertedWords;
|
|
220
|
+
|
|
221
|
+
insertedWords.insert(word);
|
|
222
|
+
|
|
223
|
+
if (hasSpace) {
|
|
224
|
+
string wordNoSpace = word;
|
|
225
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
226
|
+
if (!wordNoSpace.empty() && wordNoSpace != word) {
|
|
227
|
+
insertedWords.insert(wordNoSpace);
|
|
228
|
+
}
|
|
229
|
+
}
|
|
230
|
+
|
|
231
|
+
string wordLower = ToLowerString(word);
|
|
232
|
+
if (wordLower != word) {
|
|
233
|
+
insertedWords.insert(wordLower);
|
|
234
|
+
}
|
|
235
|
+
|
|
236
|
+
if (hasSpace) {
|
|
237
|
+
string wordNoSpace = word;
|
|
238
|
+
wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
|
|
239
|
+
if (!wordNoSpace.empty()) {
|
|
240
|
+
string wordNoSpaceLower = ToLowerString(wordNoSpace);
|
|
241
|
+
if (wordNoSpaceLower != wordNoSpace) {
|
|
242
|
+
insertedWords.insert(wordNoSpaceLower);
|
|
243
|
+
}
|
|
244
|
+
}
|
|
245
|
+
}
|
|
246
|
+
|
|
247
|
+
for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
|
|
248
|
+
DictUnit temp_node_info;
|
|
249
|
+
if (MakeNodeInfo(temp_node_info, *it, weight, tag)) {
|
|
250
|
+
if (trie_) {
|
|
251
|
+
active_node_infos_.push_back(temp_node_info);
|
|
252
|
+
trie_->InsertNode(active_node_infos_.back().word, &active_node_infos_.back());
|
|
253
|
+
if (active_node_infos_.back().word.size() == 1) {
|
|
254
|
+
user_dict_single_chinese_word_.insert(active_node_infos_.back().word[0]);
|
|
255
|
+
}
|
|
256
|
+
} else {
|
|
257
|
+
static_node_infos_.push_back(temp_node_info);
|
|
258
|
+
if (temp_node_info.word.size() == 1) {
|
|
259
|
+
user_dict_single_chinese_word_.insert(temp_node_info.word[0]);
|
|
260
|
+
}
|
|
134
261
|
}
|
|
262
|
+
}
|
|
263
|
+
}
|
|
135
264
|
}
|
|
136
265
|
|
|
137
266
|
void LoadUserDict(const vector<string>& buf) {
|
|
@@ -206,6 +335,16 @@ class DictTrie {
|
|
|
206
335
|
return true;
|
|
207
336
|
}
|
|
208
337
|
|
|
338
|
+
bool MakeNodeInfo(DictUnit& node_info,
|
|
339
|
+
const Unicode& word,
|
|
340
|
+
double weight,
|
|
341
|
+
const string& tag) {
|
|
342
|
+
node_info.word = word;
|
|
343
|
+
node_info.weight = weight;
|
|
344
|
+
node_info.tag = tag;
|
|
345
|
+
return true;
|
|
346
|
+
}
|
|
347
|
+
|
|
209
348
|
void LoadDict(const string& filePath) {
|
|
210
349
|
ifstream ifs(filePath.c_str());
|
|
211
350
|
XCHECK(ifs.is_open()) << "open " << filePath << " failed.";
|
|
@@ -69,7 +69,8 @@ class Trie {
|
|
|
69
69
|
if (NULL == ptNode->next) {
|
|
70
70
|
return NULL;
|
|
71
71
|
}
|
|
72
|
-
|
|
72
|
+
Rune searchRune = ToLowerRune(it->rune);
|
|
73
|
+
citer = ptNode->next->find(searchRune);
|
|
73
74
|
if (ptNode->next->end() == citer) {
|
|
74
75
|
return NULL;
|
|
75
76
|
}
|
|
@@ -90,7 +91,7 @@ class Trie {
|
|
|
90
91
|
for (size_t i = 0; i < size_t(end - begin); i++) {
|
|
91
92
|
res[i].runestr = *(begin + i);
|
|
92
93
|
|
|
93
|
-
if (root_->next != NULL && root_->next->end() != (citer = root_->next->find(res[i].runestr.rune))) {
|
|
94
|
+
if (root_->next != NULL && root_->next->end() != (citer = root_->next->find(ToLowerRune(res[i].runestr.rune)))) {
|
|
94
95
|
ptNode = citer->second;
|
|
95
96
|
} else {
|
|
96
97
|
ptNode = NULL;
|
|
@@ -105,7 +106,7 @@ class Trie {
|
|
|
105
106
|
if (ptNode == NULL || ptNode->next == NULL) {
|
|
106
107
|
break;
|
|
107
108
|
}
|
|
108
|
-
citer = ptNode->next->find((begin + j)->rune);
|
|
109
|
+
citer = ptNode->next->find(ToLowerRune((begin + j)->rune));
|
|
109
110
|
if (ptNode->next->end() == citer) {
|
|
110
111
|
break;
|
|
111
112
|
}
|
|
@@ -128,11 +129,12 @@ class Trie {
|
|
|
128
129
|
if (NULL == ptNode->next) {
|
|
129
130
|
ptNode->next = new TrieNode::NextMap;
|
|
130
131
|
}
|
|
131
|
-
|
|
132
|
+
Rune insertRune = ToLowerRune(*citer);
|
|
133
|
+
kmIter = ptNode->next->find(insertRune);
|
|
132
134
|
if (ptNode->next->end() == kmIter) {
|
|
133
135
|
TrieNode *nextNode = new TrieNode;
|
|
134
136
|
|
|
135
|
-
ptNode->next->insert(make_pair(
|
|
137
|
+
ptNode->next->insert(make_pair(insertRune, nextNode));
|
|
136
138
|
ptNode = nextNode;
|
|
137
139
|
} else {
|
|
138
140
|
ptNode = kmIter->second;
|
|
@@ -145,23 +147,18 @@ class Trie {
|
|
|
145
147
|
if (key.begin() == key.end()) {
|
|
146
148
|
return;
|
|
147
149
|
}
|
|
148
|
-
//定义一个NextMap迭代器
|
|
149
150
|
TrieNode::NextMap::const_iterator kmIter;
|
|
150
|
-
//定义一个指向root的TrieNode指针
|
|
151
151
|
TrieNode *ptNode = root_;
|
|
152
152
|
for (Unicode::const_iterator citer = key.begin(); citer != key.end(); ++citer) {
|
|
153
|
-
//链表不存在元素
|
|
154
153
|
if (NULL == ptNode->next) {
|
|
155
154
|
return;
|
|
156
155
|
}
|
|
157
|
-
|
|
158
|
-
|
|
156
|
+
Rune deleteRune = ToLowerRune(*citer);
|
|
157
|
+
kmIter = ptNode->next->find(deleteRune);
|
|
159
158
|
if (ptNode->next->end() == kmIter) {
|
|
160
159
|
break;
|
|
161
160
|
}
|
|
162
|
-
|
|
163
|
-
ptNode->next->erase(*citer);
|
|
164
|
-
//删除该node
|
|
161
|
+
ptNode->next->erase(deleteRune);
|
|
165
162
|
ptNode = kmIter->second;
|
|
166
163
|
delete ptNode;
|
|
167
164
|
break;
|
|
@@ -222,6 +222,58 @@ inline void GetStringsFromWords(const vector<Word>& words, vector<string>& strs)
|
|
|
222
222
|
}
|
|
223
223
|
}
|
|
224
224
|
|
|
225
|
+
inline Rune ToLowerRune(Rune r) {
|
|
226
|
+
if (r >= 'A' && r <= 'Z') {
|
|
227
|
+
return r + ('a' - 'A');
|
|
228
|
+
}
|
|
229
|
+
return r;
|
|
230
|
+
}
|
|
231
|
+
|
|
232
|
+
inline Rune ToUpperRune(Rune r) {
|
|
233
|
+
if (r >= 'a' && r <= 'z') {
|
|
234
|
+
return r - ('a' - 'A');
|
|
235
|
+
}
|
|
236
|
+
return r;
|
|
237
|
+
}
|
|
238
|
+
|
|
239
|
+
inline Unicode ToLowerUnicode(const Unicode& unicode) {
|
|
240
|
+
Unicode result;
|
|
241
|
+
result.reserve(unicode.size());
|
|
242
|
+
for (size_t i = 0; i < unicode.size(); i++) {
|
|
243
|
+
result.push_back(ToLowerRune(unicode[i]));
|
|
244
|
+
}
|
|
245
|
+
return result;
|
|
246
|
+
}
|
|
247
|
+
|
|
248
|
+
inline Unicode ToUpperUnicode(const Unicode& unicode) {
|
|
249
|
+
Unicode result;
|
|
250
|
+
result.reserve(unicode.size());
|
|
251
|
+
for (size_t i = 0; i < unicode.size(); i++) {
|
|
252
|
+
result.push_back(ToUpperRune(unicode[i]));
|
|
253
|
+
}
|
|
254
|
+
return result;
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
inline string ToLowerString(const string& s) {
|
|
258
|
+
string result = s;
|
|
259
|
+
for (size_t i = 0; i < result.size(); i++) {
|
|
260
|
+
if (result[i] >= 'A' && result[i] <= 'Z') {
|
|
261
|
+
result[i] = result[i] + ('a' - 'A');
|
|
262
|
+
}
|
|
263
|
+
}
|
|
264
|
+
return result;
|
|
265
|
+
}
|
|
266
|
+
|
|
267
|
+
inline string ToUpperString(const string& s) {
|
|
268
|
+
string result = s;
|
|
269
|
+
for (size_t i = 0; i < result.size(); i++) {
|
|
270
|
+
if (result[i] >= 'a' && result[i] <= 'z') {
|
|
271
|
+
result[i] = result[i] - ('a' - 'A');
|
|
272
|
+
}
|
|
273
|
+
}
|
|
274
|
+
return result;
|
|
275
|
+
}
|
|
276
|
+
|
|
225
277
|
} // namespace cppjieba
|
|
226
278
|
|
|
227
279
|
#endif // CPPJIEBA_UNICODE_H
|
|
@@ -74,4 +74,61 @@ describe("nodejieba.loadUserDict", function() {
|
|
|
74
74
|
var loadResult = nodejieba.loadUserDict(dictSet);
|
|
75
75
|
loadResult.should.eql(true);
|
|
76
76
|
});
|
|
77
|
+
|
|
78
|
+
it("nodejieba.loadUserDict should filter empty strings", function() {
|
|
79
|
+
var dictLines = [
|
|
80
|
+
"有效词1",
|
|
81
|
+
"",
|
|
82
|
+
"有效词2",
|
|
83
|
+
"",
|
|
84
|
+
" ",
|
|
85
|
+
"\t",
|
|
86
|
+
"\n",
|
|
87
|
+
"有效词3"
|
|
88
|
+
];
|
|
89
|
+
var loadResult = nodejieba.loadUserDict(dictLines);
|
|
90
|
+
loadResult.should.eql(true);
|
|
91
|
+
});
|
|
92
|
+
|
|
93
|
+
it("nodejieba.loadUserDict with space-containing keywords", function() {
|
|
94
|
+
var dictLines = [
|
|
95
|
+
"深度 学习",
|
|
96
|
+
"机器 学习 200 n",
|
|
97
|
+
"人工 智能 300 nz"
|
|
98
|
+
];
|
|
99
|
+
var loadResult = nodejieba.loadUserDict(dictLines);
|
|
100
|
+
loadResult.should.eql(true);
|
|
101
|
+
});
|
|
102
|
+
|
|
103
|
+
it("nodejieba.loadUserDict should throw error for non-string array elements", function() {
|
|
104
|
+
var dictLines = [
|
|
105
|
+
"有效词",
|
|
106
|
+
123,
|
|
107
|
+
"另一个有效词"
|
|
108
|
+
];
|
|
109
|
+
|
|
110
|
+
(function() {
|
|
111
|
+
nodejieba.loadUserDict(dictLines);
|
|
112
|
+
}).should.throw();
|
|
113
|
+
});
|
|
114
|
+
|
|
115
|
+
it("nodejieba.loadUserDict should return false for null", function() {
|
|
116
|
+
var loadResult = nodejieba.loadUserDict(null);
|
|
117
|
+
loadResult.should.eql(false);
|
|
118
|
+
});
|
|
119
|
+
|
|
120
|
+
it("nodejieba.loadUserDict should return false for undefined", function() {
|
|
121
|
+
var loadResult = nodejieba.loadUserDict(undefined);
|
|
122
|
+
loadResult.should.eql(false);
|
|
123
|
+
});
|
|
124
|
+
|
|
125
|
+
it("nodejieba.loadUserDict should throw TypeError for invalid type", function() {
|
|
126
|
+
(function() {
|
|
127
|
+
nodejieba.loadUserDict(123);
|
|
128
|
+
}).should.throw(TypeError);
|
|
129
|
+
|
|
130
|
+
(function() {
|
|
131
|
+
nodejieba.loadUserDict({});
|
|
132
|
+
}).should.throw(TypeError);
|
|
133
|
+
});
|
|
77
134
|
});
|
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
var nodejieba = require("./index.js");
|
|
2
|
+
|
|
3
|
+
nodejieba.load();
|
|
4
|
+
|
|
5
|
+
console.log("测试1: 加载包含空白字符的词典条目");
|
|
6
|
+
try {
|
|
7
|
+
var result = nodejieba.loadUserDict([
|
|
8
|
+
"有效词1",
|
|
9
|
+
"",
|
|
10
|
+
" ",
|
|
11
|
+
"\t",
|
|
12
|
+
"\n",
|
|
13
|
+
"有效词2",
|
|
14
|
+
" 测试词 ",
|
|
15
|
+
" 空格词 "
|
|
16
|
+
]);
|
|
17
|
+
console.log("✅ 加载成功:", result);
|
|
18
|
+
} catch (e) {
|
|
19
|
+
console.log("❌ 加载失败:", e.message);
|
|
20
|
+
}
|
|
21
|
+
|
|
22
|
+
console.log("\n测试2: 使用包含空白字符的词典进行分词");
|
|
23
|
+
try {
|
|
24
|
+
var result = nodejieba.cut("有效词1和有效词2以及测试词");
|
|
25
|
+
console.log("✅ 分词成功:", result);
|
|
26
|
+
if (result.includes("有效词1") && result.includes("有效词2") && result.includes("测试词")) {
|
|
27
|
+
console.log("✅ 词典条目正确识别");
|
|
28
|
+
}
|
|
29
|
+
} catch (e) {
|
|
30
|
+
console.log("❌ 分词失败:", e.message);
|
|
31
|
+
}
|
|
32
|
+
|
|
33
|
+
console.log("\n测试3: 加载大量空白字符的词典");
|
|
34
|
+
try {
|
|
35
|
+
var largeDict = [];
|
|
36
|
+
for (var i = 0; i < 100; i++) {
|
|
37
|
+
largeDict.push("词" + i);
|
|
38
|
+
largeDict.push("");
|
|
39
|
+
largeDict.push(" ");
|
|
40
|
+
largeDict.push("\t\n");
|
|
41
|
+
}
|
|
42
|
+
var result = nodejieba.loadUserDict(largeDict);
|
|
43
|
+
console.log("✅ 大量词典加载成功:", result);
|
|
44
|
+
} catch (e) {
|
|
45
|
+
console.log("❌ 大量词典加载失败:", e.message);
|
|
46
|
+
}
|
|
47
|
+
|
|
48
|
+
console.log("\n测试4: Buffer 包含空白行");
|
|
49
|
+
try {
|
|
50
|
+
var bufferContent = "词A\n\n \n\t\n词B\n 词C \n";
|
|
51
|
+
var result = nodejieba.loadUserDict(Buffer.from(bufferContent));
|
|
52
|
+
console.log("✅ Buffer 加载成功:", result);
|
|
53
|
+
|
|
54
|
+
var cutResult = nodejieba.cut("词A和词B以及词C");
|
|
55
|
+
console.log("✅ Buffer 词典分词成功:", cutResult);
|
|
56
|
+
} catch (e) {
|
|
57
|
+
console.log("❌ Buffer 加载失败:", e.message);
|
|
58
|
+
}
|
|
59
|
+
|
|
60
|
+
console.log("\n✅ 所有测试完成,断言错误已修复!");
|
package/test_simple.js
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
var nodejieba = require("./index.js");
|
|
2
|
+
|
|
3
|
+
console.log("=== 开始测试 ===\n");
|
|
4
|
+
|
|
5
|
+
try {
|
|
6
|
+
console.log("调用 load()...");
|
|
7
|
+
nodejieba.load();
|
|
8
|
+
console.log("load() 完成");
|
|
9
|
+
|
|
10
|
+
console.log("\n测试分词...");
|
|
11
|
+
var result = nodejieba.cut("测试");
|
|
12
|
+
console.log("分词结果:", result);
|
|
13
|
+
} catch (e) {
|
|
14
|
+
console.error("错误:", e);
|
|
15
|
+
}
|
|
16
|
+
|
|
17
|
+
console.log("\n=== 测试结束 ===");
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
var nodejieba = require("./index.js");
|
|
2
|
+
|
|
3
|
+
console.log("=== 测试包含空格的关键词匹配功能 ===\n");
|
|
4
|
+
|
|
5
|
+
// 加载词典
|
|
6
|
+
nodejieba.load();
|
|
7
|
+
|
|
8
|
+
// 测试1: 加载包含空格的关键词(带词频和词性)
|
|
9
|
+
console.log("测试1: 加载 'Open Claw 2 n'");
|
|
10
|
+
nodejieba.loadUserDict(["Open Claw 2 n"]);
|
|
11
|
+
|
|
12
|
+
var testCases1 = [
|
|
13
|
+
"Open Claw",
|
|
14
|
+
"OpenClaw",
|
|
15
|
+
"Openclaw",
|
|
16
|
+
"OPENCLAW",
|
|
17
|
+
"open claw",
|
|
18
|
+
"OPEN CLAW"
|
|
19
|
+
];
|
|
20
|
+
|
|
21
|
+
console.log("测试各种大小写变体:");
|
|
22
|
+
testCases1.forEach(function(testText) {
|
|
23
|
+
var result = nodejieba.cut(testText);
|
|
24
|
+
console.log(" '" + testText + "' ->", result);
|
|
25
|
+
});
|
|
26
|
+
|
|
27
|
+
console.log("\n");
|
|
28
|
+
|
|
29
|
+
// 测试2: 加载包含空格的关键词(只有关键词)
|
|
30
|
+
console.log("测试2: 加载 'Game Master' (只有关键词)");
|
|
31
|
+
nodejieba.loadUserDict("Game Master");
|
|
32
|
+
|
|
33
|
+
var testCases2 = [
|
|
34
|
+
"Game Master",
|
|
35
|
+
"GameMaster",
|
|
36
|
+
"gamemaster",
|
|
37
|
+
"GAMEMASTER",
|
|
38
|
+
"GAME MASTER"
|
|
39
|
+
];
|
|
40
|
+
|
|
41
|
+
console.log("测试各种大小写变体:");
|
|
42
|
+
testCases2.forEach(function(testText) {
|
|
43
|
+
var result = nodejieba.cut(testText);
|
|
44
|
+
console.log(" '" + testText + "' ->", result);
|
|
45
|
+
});
|
|
46
|
+
|
|
47
|
+
console.log("\n");
|
|
48
|
+
|
|
49
|
+
// 测试3: 在句子中匹配
|
|
50
|
+
console.log("测试3: 在句子中匹配包含空格的关键词");
|
|
51
|
+
var sentence1 = "I like Open Claw game very much";
|
|
52
|
+
var result1 = nodejieba.cut(sentence1);
|
|
53
|
+
console.log(" 句子: '" + sentence1 + "'");
|
|
54
|
+
console.log(" 分词结果:", result1);
|
|
55
|
+
|
|
56
|
+
var sentence2 = "Open Claw和Game Master都是好游戏";
|
|
57
|
+
var result2 = nodejieba.cut(sentence2);
|
|
58
|
+
console.log(" 句子: '" + sentence2 + "'");
|
|
59
|
+
console.log(" 分词结果:", result2);
|
|
60
|
+
|
|
61
|
+
var sentence3 = "OPENCLAW和gamemaster都是好游戏";
|
|
62
|
+
var result3 = nodejieba.cut(sentence3);
|
|
63
|
+
console.log(" 句子: '" + sentence3 + "'");
|
|
64
|
+
console.log(" 分词结果:", result3);
|
|
65
|
+
|
|
66
|
+
console.log("\n=== 测试完成 ===");
|