nodejieba-plus 3.5.12 → 3.5.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,38 +1,38 @@
1
- [![Build Status](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml/badge.svg)](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml)
2
- [![Financial Contributors on Open Collective](https://opencollective.com/nodejieba/all/badge.svg?label=financial+contributors)](https://opencollective.com/nodejieba) [![Author](https://img.shields.io/badge/author-@yanyiwu-blue.svg?style=flat)](https://github.com/yanyiwu/)
3
- [![Platform](https://img.shields.io/badge/platform-Linux,macOS,Windows-green.svg?style=flat)](https://github.com/yanyiwu/nodejieba)
4
- [![Performance](https://img.shields.io/badge/performance-excellent-brightgreen.svg?style=flat)](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md)
5
- [![License](https://img.shields.io/badge/license-MIT-yellow.svg?style=flat)](http://yanyiwu.mit-license.org)
6
- [![NpmDownload Status](http://img.shields.io/npm/dm/nodejieba.svg)](https://www.npmjs.org/package/nodejieba)
7
- [![NPM Version](https://img.shields.io/npm/v/nodejieba.svg?style=flat)](https://www.npmjs.org/package/nodejieba)
8
- [![Code Climate](https://codeclimate.com/github/yanyiwu/nodejieba/badges/gpa.svg)](https://codeclimate.com/github/yanyiwu/nodejieba)
1
+ [!\[Build Status\](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml/badge.svg null)](https://github.com/yanyiwu/nodejieba/actions/workflows/test.yml)
2
+ [!\[Financial Contributors on Open Collective\](https://opencollective.com/nodejieba/all/badge.svg?label=financial+contributors null)](https://opencollective.com/nodejieba) [!\[Author\](https://img.shields.io/badge/author-@yanyiwu-blue.svg?style=flat null)](https://github.com/yanyiwu/)
3
+ [!\[Platform\](https://img.shields.io/badge/platform-Linux,macOS,Windows-green.svg?style=flat null)](https://github.com/yanyiwu/nodejieba)
4
+ [!\[Performance\](https://img.shields.io/badge/performance-excellent-brightgreen.svg?style=flat null)](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md)
5
+ [!\[License\](https://img.shields.io/badge/license-MIT-yellow.svg?style=flat null)](http://yanyiwu.mit-license.org)
6
+ [!\[NpmDownload Status\](http://img.shields.io/npm/dm/nodejieba.svg null)](https://www.npmjs.org/package/nodejieba)
7
+ [!\[NPM Version\](https://img.shields.io/npm/v/nodejieba.svg?style=flat null)](https://www.npmjs.org/package/nodejieba)
8
+ [!\[Code Climate\](https://codeclimate.com/github/yanyiwu/nodejieba/badges/gpa.svg null)](https://codeclimate.com/github/yanyiwu/nodejieba)
9
9
 
10
- - - -
10
+ ***
11
11
 
12
12
  # NodeJieba "结巴"分词的Node.js版本
13
13
 
14
- ## 介绍
14
+ ## 介绍
15
15
 
16
16
  `NodeJieba`是"结巴"中文分词的 Node.js 版本实现,
17
- 由[CppJieba]提供底层分词算法实现,
17
+ 由[CppJieba](https://github.com/yanyiwu/cppjieba.git)提供底层分词算法实现,
18
18
  是兼具高性能和易用性两者的 Node.js 中文分词组件。
19
19
 
20
20
  ## 特点
21
21
 
22
- + 词典载入方式灵活,无需配置词典路径也可使用,需要定制自己的词典路径时也可灵活定制。
23
- + 底层算法实现是C++,性能高效。
24
- + 支持多种分词算法,各种分词算法见[CppJieba]的README.md介绍。
25
- + 支持动态补充词库。
26
- + 支持TypeScript,提供完整的类型定义。
27
- + **支持包含空格的关键词**(如 "Open Claw")。
28
- + **支持无空格版本匹配**(如 "OpenClaw" 可匹配 "Open Claw")。
29
- + **支持英文大小写不敏感匹配**(如 "open claw"、"OPEN CLAW" 都可匹配 "Open Claw")。
30
- + **支持批量加载用户词典**(字符串数组、单个字符串、Buffer 格式)。
22
+ - 词典载入方式灵活,无需配置词典路径也可使用,需要定制自己的词典路径时也可灵活定制。
23
+ - 底层算法实现是C++,性能高效。
24
+ - 支持多种分词算法,各种分词算法见[CppJieba](https://github.com/yanyiwu/cppjieba.git)的README.md介绍。
25
+ - 支持动态补充词库。
26
+ - 支持TypeScript,提供完整的类型定义。
27
+ - **支持包含空格的关键词**(如 "Open Claw")。
28
+ - **支持无空格版本匹配**(如 "Open Claw" 可匹配 "OpenClaw")。
29
+ - **支持英文大小写不敏感匹配**(如 "open claw"、"OPEN CLAW" 都可匹配 "Open Claw")。
30
+ - **支持批量加载用户词典**(字符串数组、单个字符串、Buffer 格式)。
31
31
 
32
32
  对实现细节感兴趣的请看如下博文:
33
33
 
34
- + [Node.js的C++扩展初体验之NodeJieba]
35
- + [由NodeJieba谈谈Node.js异步实现]
34
+ - [Node.js的C++扩展初体验之NodeJieba](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2014-02-22-nodejs-cpp-addon-nodejieba.md)
35
+ - [由NodeJieba谈谈Node.js异步实现](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-03-21-nodejs-asynchronous-insight.md)
36
36
 
37
37
  ## 安装
38
38
 
@@ -88,11 +88,11 @@ nodejieba.load({
88
88
 
89
89
  #### 词典说明
90
90
 
91
- + dict: 主词典,带权重和词性标签,建议使用默认词典。
92
- + hmmDict: 隐式马尔科夫模型,建议使用默认词典。
93
- + userDict: 用户词典,建议自己根据需要定制。
94
- + idfDict: 关键词抽取所需的idf信息。
95
- + stopWordDict: 关键词抽取所需的停用词列表。
91
+ - dict: 主词典,带权重和词性标签,建议使用默认词典。
92
+ - hmmDict: 隐式马尔科夫模型,建议使用默认词典。
93
+ - userDict: 用户词典,建议自己根据需要定制。
94
+ - idfDict: 关键词抽取所需的idf信息。
95
+ - stopWordDict: 关键词抽取所需的停用词列表。
96
96
 
97
97
  ## API 文档
98
98
 
@@ -211,10 +211,12 @@ nodejieba.loadUserDict(dictSet);
211
211
  // 方式3:使用单个字符串
212
212
  nodejieba.loadUserDict("区块链");
213
213
 
214
- // 方式4:使用 Buffer
214
+ // 方式4:使用 Buffer(必须是 UTF-8 编码)
215
215
  const dictBuffer = Buffer.from("新词1\n新词2 100 n\n新词3 nz");
216
216
  nodejieba.loadUserDict(dictBuffer);
217
217
 
218
+ // 注意:Buffer 必须是 UTF-8 编码,其他编码可能导致乱码或加载失败
219
+
218
220
  // 分词时会识别用户词典中的词
219
221
  var result = nodejieba.cut("云计算和大数据是人工智能的基础");
220
222
  console.log(result); // ['云计算', '和', '大数据', '是', '人工智能', '的', '基础']
@@ -242,76 +244,72 @@ console.log(result); // ['云计算', '和', '大数据', '是', '人工智能',
242
244
 
243
245
  支持在自定义词典中使用包含空格的关键词,且支持无空格版本匹配和大小写不敏感匹配。
244
246
 
247
+ **注意**:本版本已移除空格作为默认分隔符,因此包含空格的关键词可以正确匹配文本中的对应内容,不会被分割。
248
+
245
249
  #### 用户词典格式
246
250
 
247
251
  用户词典支持以下格式:
248
252
 
249
253
  ```
250
- # 只有关键词
254
+ # 只有关键词(包含空格)
251
255
  Open Claw
256
+ Game Master
252
257
 
253
- # 关键词 + 词性标签
254
- Open Claw n
258
+ # 关键词 + 词频(仅支持单关键词,不支持包含空格的关键词+词频)
259
+ 人工智能 1000
255
260
 
256
- # 关键词 + 词频 + 词性标签
261
+ # 关键词 + 词频 + 词性标签(支持包含空格的关键词)
257
262
  Open Claw 100 n
263
+ Machine Learning 200 n
258
264
 
259
265
  # 包含多个空格的关键词
260
- Machine Learning 200 n
261
266
  Artificial Intelligence 300 n
267
+ Deep Learning 400 n
262
268
  ```
263
269
 
270
+ **格式说明**:
271
+
272
+ - 当词典行只有关键词时(如 `Open Claw`),整个字符串作为关键词
273
+ - 当词典行有词频时(如 `人工智能 1000`),第一个部分是关键词,第二个是词频
274
+ - 当词典行有三个部分且倒数第二个是数字时(如 `Open Claw 100 n`),前面的部分组成关键词,后面是词频和词性
275
+
264
276
  #### 使用示例
265
277
 
266
278
  ```js
267
279
  var nodejieba = require("nodejieba");
268
- var fs = require('fs');
269
- var path = require('path');
270
-
271
- // 创建包含空格关键词的用户词典
272
- var dictContent = `Open Claw 100 n
273
- Machine Learning 200 n
274
- Artificial Intelligence 300 n
275
- `;
280
+ nodejieba.load();
276
281
 
277
- var testDictPath = path.join(__dirname, 'user_dict.utf8');
278
- fs.writeFileSync(testDictPath, dictContent);
282
+ // 方式1:使用 loadUserDict 加载包含空格的关键词
283
+ nodejieba.loadUserDict(["Open Claw 100 n", "Game Master"]);
279
284
 
280
- // 加载词典
281
- nodejieba.load({
282
- userDict: testDictPath,
283
- });
285
+ // 方式2:使用 insertWord 添加包含空格的关键词
286
+ nodejieba.insertWord("Deep Learning");
284
287
 
285
288
  // 测试1: 包含空格的关键词匹配
286
- console.log(nodejieba.cut("I want to use Open Claw tool"));
287
- // 输出包含: ['Open Claw']
289
+ console.log(nodejieba.cut("I like Open Claw game"));
290
+ // 输出: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 'Open Claw', ' ', 'g', 'a', 'm', 'e']
288
291
 
289
- // 测试2: 大小写不敏感匹配
292
+ // 测试2: 在中文句子中匹配
293
+ console.log(nodejieba.cut("Open Claw和Game Master都是好游戏"));
294
+ // 输出: ['Open Claw', '和', 'Game Master', '都', '是', '好', '游戏']
295
+
296
+ // 测试3: 大小写不敏感匹配
290
297
  console.log(nodejieba.cut("open claw")); // 匹配 Open Claw
291
298
  console.log(nodejieba.cut("OPEN CLAW")); // 匹配 Open Claw
292
299
  console.log(nodejieba.cut("Open Claw")); // 匹配 Open Claw
293
300
 
294
- // 测试3: 无空格版本匹配
301
+ // 测试4: 无空格版本匹配
295
302
  console.log(nodejieba.cut("OpenClaw")); // 匹配 Open Claw
296
303
  console.log(nodejieba.cut("openclaw")); // 匹配 Open Claw
297
304
  console.log(nodejieba.cut("OPENCLAW")); // 匹配 Open Claw
298
-
299
- // 测试4: 其他包含空格的关键词
300
- console.log(nodejieba.cut("Machine Learning is great"));
301
- // 输出包含: ['Machine Learning']
302
-
303
- console.log(nodejieba.cut("Artificial Intelligence will change the world"));
304
- // 输出包含: ['Artificial Intelligence']
305
-
306
- // 清理测试文件
307
- fs.unlinkSync(testDictPath);
308
305
  ```
309
306
 
310
307
  #### 功能说明
311
308
 
312
- 1. **包含空格的关键词**: 词典中的 "Open Claw" 可以匹配文本中的 "Open Claw"
309
+ 1. **包含空格的关键词**: 词典中的 "Open Claw" 可以匹配文本中的 "Open Claw"(不会被分割)
313
310
  2. **无空格版本匹配**: 词典中的 "Open Claw" 也可以匹配文本中的 "OpenClaw"
314
311
  3. **大小写不敏感**: 词典中的 "Open Claw" 可以匹配 "open claw"、"OPEN CLAW"、"Open Claw" 等任意大小写组合
312
+ 4. **自动生成变体**: 添加包含空格的关键词时,会自动生成无空格版本和小写版本,确保各种变体都能匹配
315
313
 
316
314
  More Detals in [demo](https://github.com/yanyiwu/nodejieba-demo)
317
315
 
@@ -347,37 +345,23 @@ npm test
347
345
 
348
346
  ## 应用
349
347
 
350
- + 支持中文搜索的 gitbook 插件: [gitbook-plugin-search-pro]
351
- + 汉字拼音转换工具: [pinyin]
348
+ - 支持中文搜索的 gitbook 插件: [gitbook-plugin-search-pro](https://plugins.gitbook.com/plugin/search-pro)
349
+ - 汉字拼音转换工具: [pinyin](https://github.com/hotoo/pinyin)
352
350
 
353
351
  ## 性能评测
354
352
 
355
353
  应该是目前性能最好的 Node.js 中文分词库
356
- 详见: [Jieba中文分词系列性能评测]
357
-
358
-
359
- [由NodeJieba谈谈Node.js异步实现]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-03-21-nodejs-asynchronous-insight.md
360
- [Node.js的C++扩展初体验之NodeJieba]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2014-02-22-nodejs-cpp-addon-nodejieba.md
361
- [CppJieba]:https://github.com/yanyiwu/cppjieba.git
362
- [cnpm]:http://cnpmjs.org
363
- [Jieba中文分词]:https://github.com/fxsjy/jieba
364
-
365
- [Jieba中文分词系列性能评测]:https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md
366
- [contributors]:https://github.com/yanyiwu/nodejieba/graphs/contributors
367
- [YanyiWu]:http://github.com/yanyiwu
368
- [gitbook-plugin-search-pro]:https://plugins.gitbook.com/plugin/search-pro
369
- [pinyin]:https://github.com/hotoo/pinyin
354
+ 详见: [Jieba中文分词系列性能评测](https://github.com/yanyiwu/blog/blob/posts2023archive/_posts/2015-06-14-jieba-series-performance-test.md)
370
355
 
371
356
  ## Contributors
372
357
 
373
358
  ### Code Contributors
374
359
 
375
- This project exists thanks to all the people who contribute. [[Contribute](CONTRIBUTING.md)].
376
- <a href="https://github.com/yanyiwu/nodejieba/graphs/contributors"><img src="https://opencollective.com/nodejieba/contributors.svg?width=890&button=false" /></a>
360
+ This project exists thanks to all the people who contribute. \[[Contribute](CONTRIBUTING.md)]. <a href="https://github.com/yanyiwu/nodejieba/graphs/contributors"><img src="https://opencollective.com/nodejieba/contributors.svg?width=890&button=false" /></a>
377
361
 
378
362
  ### Financial Contributors
379
363
 
380
- Become a financial contributor and help us sustain our community. [[Contribute](https://opencollective.com/nodejieba/contribute)]
364
+ Become a financial contributor and help us sustain our community. \[[Contribute](https://opencollective.com/nodejieba/contribute)]
381
365
 
382
366
  #### Individuals
383
367
 
@@ -385,15 +369,6 @@ Become a financial contributor and help us sustain our community. [[Contribute](
385
369
 
386
370
  #### Organizations
387
371
 
388
- Support this project with your organization. Your logo will show up here with a link to your website. [[Contribute](https://opencollective.com/nodejieba/contribute)]
389
-
390
- <a href="https://opencollective.com/nodejieba/organization/0/website"><img src="https://opencollective.com/nodejieba/organization/0/avatar.svg"></a>
391
- <a href="https://opencollective.com/nodejieba/organization/1/website"><img src="https://opencollective.com/nodejieba/organization/1/avatar.svg"></a>
392
- <a href="https://opencollective.com/nodejieba/organization/2/website"><img src="https://opencollective.com/nodejieba/organization/2/avatar.svg"></a>
393
- <a href="https://opencollective.com/nodejieba/organization/3/website"><img src="https://opencollective.com/nodejieba/organization/3/avatar.svg"></a>
394
- <a href="https://opencollective.com/nodejieba/organization/4/website"><img src="https://opencollective.com/nodejieba/organization/4/avatar.svg"></a>
395
- <a href="https://opencollective.com/nodejieba/organization/5/website"><img src="https://opencollective.com/nodejieba/organization/5/avatar.svg"></a>
396
- <a href="https://opencollective.com/nodejieba/organization/6/website"><img src="https://opencollective.com/nodejieba/organization/6/avatar.svg"></a>
397
- <a href="https://opencollective.com/nodejieba/organization/7/website"><img src="https://opencollective.com/nodejieba/organization/7/avatar.svg"></a>
398
- <a href="https://opencollective.com/nodejieba/organization/8/website"><img src="https://opencollective.com/nodejieba/organization/8/avatar.svg"></a>
399
- <a href="https://opencollective.com/nodejieba/organization/9/website"><img src="https://opencollective.com/nodejieba/organization/9/avatar.svg"></a>
372
+ Support this project with your organization. Your logo will show up here with a link to your website. \[[Contribute](https://opencollective.com/nodejieba/contribute)]
373
+
374
+ <a href="https://opencollective.com/nodejieba/organization/0/website"><img src="https://opencollective.com/nodejieba/organization/0/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/1/website"><img src="https://opencollective.com/nodejieba/organization/1/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/2/website"><img src="https://opencollective.com/nodejieba/organization/2/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/3/website"><img src="https://opencollective.com/nodejieba/organization/3/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/4/website"><img src="https://opencollective.com/nodejieba/organization/4/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/5/website"><img src="https://opencollective.com/nodejieba/organization/5/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/6/website"><img src="https://opencollective.com/nodejieba/organization/6/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/7/website"><img src="https://opencollective.com/nodejieba/organization/7/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/8/website"><img src="https://opencollective.com/nodejieba/organization/8/avatar.svg"></a> <a href="https://opencollective.com/nodejieba/organization/9/website"><img src="https://opencollective.com/nodejieba/organization/9/avatar.svg"></a>
Binary file
package/index.js CHANGED
@@ -84,11 +84,18 @@ exports.loadUserDict = function (dict) {
84
84
  exports.load();
85
85
  }
86
86
 
87
- // 如果是 Set 对象,转换为数组
87
+ if (dict === null || dict === undefined) {
88
+ return false;
89
+ }
90
+
88
91
  if (dict instanceof Set) {
89
92
  dict = Array.from(dict);
90
93
  }
91
94
 
95
+ if (typeof dict !== 'string' && !Array.isArray(dict) && !Buffer.isBuffer(dict)) {
96
+ throw new TypeError('dict must be string, string[], Set<string>, or Buffer');
97
+ }
98
+
92
99
  return _loadUserDict.call(this, dict);
93
100
  };
94
101
 
package/lib/nodejieba.cpp CHANGED
@@ -7,6 +7,7 @@
7
7
  #include "cppjieba/TextRankExtractor.hpp"
8
8
 
9
9
  #include <sstream>
10
+ #include <cctype>
10
11
 
11
12
  NodeJieba::NodeJieba(Napi::Env env, Napi::Object exports) {
12
13
  DefineAddon(exports, {
@@ -229,32 +230,64 @@ Napi::Value NodeJieba::loadUserDict(const Napi::CallbackInfo& info) {
229
230
  Napi::Error::New(info.Env(), "Before calling any other function you have to call load() first").ThrowAsJavaScriptException();
230
231
  }
231
232
 
232
- // 支持传入字符串数组、单个字符串或 Buffer
233
+ auto isBlankString = [](const std::string& str) -> bool {
234
+ for (char c : str) {
235
+ if (!std::isspace(static_cast<unsigned char>(c))) {
236
+ return false;
237
+ }
238
+ }
239
+ return true;
240
+ };
241
+
242
+ auto trimString = [](std::string& str) -> void {
243
+ size_t start = 0;
244
+ size_t end = str.length();
245
+
246
+ while (start < end && std::isspace(static_cast<unsigned char>(str[start]))) {
247
+ start++;
248
+ }
249
+
250
+ while (end > start && std::isspace(static_cast<unsigned char>(str[end - 1]))) {
251
+ end--;
252
+ }
253
+
254
+ str = str.substr(start, end - start);
255
+ };
256
+
233
257
  if (info[0].IsArray()) {
234
258
  Napi::Array arr = info[0].As<Napi::Array>();
235
259
  std::vector<std::string> buf;
236
260
  for (size_t i = 0; i < arr.Length(); i++) {
237
261
  Napi::Value val = arr[i];
238
- if (val.IsString()) {
239
- buf.push_back(val.As<Napi::String>().Utf8Value());
262
+ if (!val.IsString()) {
263
+ Napi::TypeError::New(info.Env(), "Array elements must be strings")
264
+ .ThrowAsJavaScriptException();
265
+ return Napi::Boolean::New(info.Env(), false);
266
+ }
267
+ std::string line = val.As<Napi::String>().Utf8Value();
268
+ trimString(line);
269
+ if (!line.empty() && !isBlankString(line)) {
270
+ buf.push_back(line);
240
271
  }
241
272
  }
242
273
  _jieba_handle->LoadUserDict(buf);
243
274
  } else if (info[0].IsString()) {
244
- // 支持传入单个词典条目字符串
245
275
  std::string line = info[0].As<Napi::String>().Utf8Value();
276
+ trimString(line);
246
277
  std::vector<std::string> buf;
247
- buf.push_back(line);
248
- _jieba_handle->LoadUserDict(buf);
278
+ if (!line.empty() && !isBlankString(line)) {
279
+ buf.push_back(line);
280
+ _jieba_handle->LoadUserDict(buf);
281
+ }
249
282
  } else if (info[0].IsBuffer()) {
250
- // 支持传入 Buffer,将其转换为字符串并按行分割
251
283
  Napi::Buffer<char> buffer = info[0].As<Napi::Buffer<char>>();
252
284
  std::string content(buffer.Data(), buffer.Length());
253
285
  std::vector<std::string> buf;
254
286
  std::istringstream iss(content);
255
287
  std::string line;
256
288
  while (std::getline(iss, line)) {
257
- if (!line.empty()) {
289
+ trimString(line);
290
+ if (!line.empty() && !isBlankString(line)) {
258
291
  buf.push_back(line);
259
292
  }
260
293
  }
package/package.json CHANGED
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "nodejieba-plus",
3
3
  "description": "chinese word segmentation for node",
4
- "version": "3.5.12",
4
+ "version": "3.5.16",
5
5
  "author": "Yanyi Wu <wuyanyi09@foxmail.com>",
6
6
  "maintainers": [
7
7
  "Yanyi Wu <wuyanyi09@foxmail.com>"
@@ -10,6 +10,7 @@
10
10
  #include <stdint.h>
11
11
  #include <cmath>
12
12
  #include <limits>
13
+ #include <algorithm>
13
14
  #include "limonp/StringUtil.hpp"
14
15
  #include "limonp/Logging.hpp"
15
16
  #include "Unicode.hpp"
@@ -32,7 +33,7 @@ class DictTrie {
32
33
  WordWeightMax,
33
34
  }; // enum UserWordWeightOption
34
35
 
35
- DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) {
36
+ DictTrie(const string& dict_path, const string& user_dict_paths = "", UserWordWeightOption user_word_weight_opt = WordWeightMedian) : trie_(NULL) {
36
37
  Init(dict_path, user_dict_paths, user_word_weight_opt);
37
38
  }
38
39
 
@@ -41,23 +42,84 @@ class DictTrie {
41
42
  }
42
43
 
43
44
  bool InsertUserWord(const string& word, const string& tag = UNKNOWN_TAG) {
44
- DictUnit node_info;
45
- if (!MakeNodeInfo(node_info, word, user_word_default_weight_, tag)) {
46
- return false;
45
+ std::set<string> insertedWords;
46
+ insertedWords.insert(word);
47
+
48
+ bool hasSpace = (word.find(' ') != string::npos);
49
+ if (hasSpace) {
50
+ string wordNoSpace = word;
51
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
52
+ if (!wordNoSpace.empty() && wordNoSpace != word) {
53
+ insertedWords.insert(wordNoSpace);
54
+ }
55
+ }
56
+
57
+ string wordLower = ToLowerString(word);
58
+ if (wordLower != word) {
59
+ insertedWords.insert(wordLower);
60
+ }
61
+
62
+ if (hasSpace) {
63
+ string wordNoSpace = word;
64
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
65
+ if (!wordNoSpace.empty()) {
66
+ string wordNoSpaceLower = ToLowerString(wordNoSpace);
67
+ if (wordNoSpaceLower != wordNoSpace) {
68
+ insertedWords.insert(wordNoSpaceLower);
69
+ }
70
+ }
71
+ }
72
+
73
+ for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
74
+ DictUnit node_info;
75
+ if (!MakeNodeInfo(node_info, *it, user_word_default_weight_, tag)) {
76
+ continue;
77
+ }
78
+ active_node_infos_.push_back(node_info);
79
+ trie_->InsertNode(node_info.word, &active_node_infos_.back());
47
80
  }
48
- active_node_infos_.push_back(node_info);
49
- trie_->InsertNode(node_info.word, &active_node_infos_.back());
50
81
  return true;
51
82
  }
52
83
 
53
84
  bool InsertUserWord(const string& word,int freq, const string& tag = UNKNOWN_TAG) {
54
- DictUnit node_info;
55
- double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_ ;
56
- if (!MakeNodeInfo(node_info, word, weight , tag)) {
57
- return false;
85
+ double weight = freq ? log(1.0 * freq / freq_sum_) : user_word_default_weight_;
86
+
87
+ std::set<string> insertedWords;
88
+ insertedWords.insert(word);
89
+
90
+ bool hasSpace = (word.find(' ') != string::npos);
91
+ if (hasSpace) {
92
+ string wordNoSpace = word;
93
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
94
+ if (!wordNoSpace.empty() && wordNoSpace != word) {
95
+ insertedWords.insert(wordNoSpace);
96
+ }
97
+ }
98
+
99
+ string wordLower = ToLowerString(word);
100
+ if (wordLower != word) {
101
+ insertedWords.insert(wordLower);
102
+ }
103
+
104
+ if (hasSpace) {
105
+ string wordNoSpace = word;
106
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
107
+ if (!wordNoSpace.empty()) {
108
+ string wordNoSpaceLower = ToLowerString(wordNoSpace);
109
+ if (wordNoSpaceLower != wordNoSpace) {
110
+ insertedWords.insert(wordNoSpaceLower);
111
+ }
112
+ }
113
+ }
114
+
115
+ for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
116
+ DictUnit node_info;
117
+ if (!MakeNodeInfo(node_info, *it, weight, tag)) {
118
+ continue;
119
+ }
120
+ active_node_infos_.push_back(node_info);
121
+ trie_->InsertNode(node_info.word, &active_node_infos_.back());
58
122
  }
59
- active_node_infos_.push_back(node_info);
60
- trie_->InsertNode(node_info.word, &active_node_infos_.back());
61
123
  return true;
62
124
  }
63
125
 
@@ -112,26 +174,93 @@ class DictTrie {
112
174
  vector<string> buf;
113
175
  DictUnit node_info;
114
176
  Split(line, buf, " ");
115
- if(buf.size() == 1){
116
- MakeNodeInfo(node_info,
117
- buf[0],
118
- user_word_default_weight_,
119
- UNKNOWN_TAG);
120
- } else if (buf.size() == 2) {
121
- MakeNodeInfo(node_info,
122
- buf[0],
123
- user_word_default_weight_,
124
- buf[1]);
125
- } else if (buf.size() == 3) {
126
- int freq = atoi(buf[1].c_str());
127
- assert(freq_sum_ > 0.0);
128
- double weight = log(1.0 * freq / freq_sum_);
129
- MakeNodeInfo(node_info, buf[0], weight, buf[2]);
177
+
178
+ string word;
179
+ string tag = UNKNOWN_TAG;
180
+ double weight = user_word_default_weight_;
181
+ bool hasSpace = false;
182
+
183
+ if (buf.size() == 1) {
184
+ word = buf[0];
185
+ } else if (buf.size() == 2) {
186
+ int freq = atoi(buf[1].c_str());
187
+ if (freq > 0) {
188
+ assert(freq_sum_ > 0.0);
189
+ weight = log(1.0 * freq / freq_sum_);
190
+ word = buf[0];
191
+ } else {
192
+ word = line;
193
+ }
194
+ } else if (buf.size() >= 3) {
195
+ bool isFreq = true;
196
+ for (char c : buf[buf.size() - 2]) {
197
+ if (!isdigit(c)) {
198
+ isFreq = false;
199
+ break;
130
200
  }
131
- static_node_infos_.push_back(node_info);
132
- if (node_info.word.size() == 1) {
133
- user_dict_single_chinese_word_.insert(node_info.word[0]);
201
+ }
202
+
203
+ if (isFreq) {
204
+ int freq = atoi(buf[buf.size() - 2].c_str());
205
+ assert(freq_sum_ > 0.0);
206
+ weight = log(1.0 * freq / freq_sum_);
207
+ for (size_t i = 0; i < buf.size() - 2; ++i) {
208
+ if (i > 0) word += " ";
209
+ word += buf[i];
210
+ }
211
+ tag = buf[buf.size() - 1];
212
+ } else {
213
+ word = line;
214
+ }
215
+ }
216
+
217
+ hasSpace = (word.find(' ') != string::npos);
218
+
219
+ std::set<string> insertedWords;
220
+
221
+ insertedWords.insert(word);
222
+
223
+ if (hasSpace) {
224
+ string wordNoSpace = word;
225
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
226
+ if (!wordNoSpace.empty() && wordNoSpace != word) {
227
+ insertedWords.insert(wordNoSpace);
228
+ }
229
+ }
230
+
231
+ string wordLower = ToLowerString(word);
232
+ if (wordLower != word) {
233
+ insertedWords.insert(wordLower);
234
+ }
235
+
236
+ if (hasSpace) {
237
+ string wordNoSpace = word;
238
+ wordNoSpace.erase(remove(wordNoSpace.begin(), wordNoSpace.end(), ' '), wordNoSpace.end());
239
+ if (!wordNoSpace.empty()) {
240
+ string wordNoSpaceLower = ToLowerString(wordNoSpace);
241
+ if (wordNoSpaceLower != wordNoSpace) {
242
+ insertedWords.insert(wordNoSpaceLower);
243
+ }
244
+ }
245
+ }
246
+
247
+ for (std::set<string>::const_iterator it = insertedWords.begin(); it != insertedWords.end(); ++it) {
248
+ DictUnit temp_node_info;
249
+ if (MakeNodeInfo(temp_node_info, *it, weight, tag)) {
250
+ if (trie_) {
251
+ active_node_infos_.push_back(temp_node_info);
252
+ trie_->InsertNode(active_node_infos_.back().word, &active_node_infos_.back());
253
+ if (active_node_infos_.back().word.size() == 1) {
254
+ user_dict_single_chinese_word_.insert(active_node_infos_.back().word[0]);
255
+ }
256
+ } else {
257
+ static_node_infos_.push_back(temp_node_info);
258
+ if (temp_node_info.word.size() == 1) {
259
+ user_dict_single_chinese_word_.insert(temp_node_info.word[0]);
260
+ }
134
261
  }
262
+ }
263
+ }
135
264
  }
136
265
 
137
266
  void LoadUserDict(const vector<string>& buf) {
@@ -206,6 +335,16 @@ class DictTrie {
206
335
  return true;
207
336
  }
208
337
 
338
+ bool MakeNodeInfo(DictUnit& node_info,
339
+ const Unicode& word,
340
+ double weight,
341
+ const string& tag) {
342
+ node_info.word = word;
343
+ node_info.weight = weight;
344
+ node_info.tag = tag;
345
+ return true;
346
+ }
347
+
209
348
  void LoadDict(const string& filePath) {
210
349
  ifstream ifs(filePath.c_str());
211
350
  XCHECK(ifs.is_open()) << "open " << filePath << " failed.";
@@ -8,7 +8,7 @@
8
8
 
9
9
  namespace cppjieba {
10
10
 
11
- const char* const SPECIAL_SEPARATORS = " \t\n\xEF\xBC\x8C\xE3\x80\x82";
11
+ const char* const SPECIAL_SEPARATORS = "\t\n\xEF\xBC\x8C\xE3\x80\x82";
12
12
 
13
13
  using namespace limonp;
14
14
 
@@ -69,7 +69,8 @@ class Trie {
69
69
  if (NULL == ptNode->next) {
70
70
  return NULL;
71
71
  }
72
- citer = ptNode->next->find(it->rune);
72
+ Rune searchRune = ToLowerRune(it->rune);
73
+ citer = ptNode->next->find(searchRune);
73
74
  if (ptNode->next->end() == citer) {
74
75
  return NULL;
75
76
  }
@@ -90,7 +91,7 @@ class Trie {
90
91
  for (size_t i = 0; i < size_t(end - begin); i++) {
91
92
  res[i].runestr = *(begin + i);
92
93
 
93
- if (root_->next != NULL && root_->next->end() != (citer = root_->next->find(res[i].runestr.rune))) {
94
+ if (root_->next != NULL && root_->next->end() != (citer = root_->next->find(ToLowerRune(res[i].runestr.rune)))) {
94
95
  ptNode = citer->second;
95
96
  } else {
96
97
  ptNode = NULL;
@@ -105,7 +106,7 @@ class Trie {
105
106
  if (ptNode == NULL || ptNode->next == NULL) {
106
107
  break;
107
108
  }
108
- citer = ptNode->next->find((begin + j)->rune);
109
+ citer = ptNode->next->find(ToLowerRune((begin + j)->rune));
109
110
  if (ptNode->next->end() == citer) {
110
111
  break;
111
112
  }
@@ -128,11 +129,12 @@ class Trie {
128
129
  if (NULL == ptNode->next) {
129
130
  ptNode->next = new TrieNode::NextMap;
130
131
  }
131
- kmIter = ptNode->next->find(*citer);
132
+ Rune insertRune = ToLowerRune(*citer);
133
+ kmIter = ptNode->next->find(insertRune);
132
134
  if (ptNode->next->end() == kmIter) {
133
135
  TrieNode *nextNode = new TrieNode;
134
136
 
135
- ptNode->next->insert(make_pair(*citer, nextNode));
137
+ ptNode->next->insert(make_pair(insertRune, nextNode));
136
138
  ptNode = nextNode;
137
139
  } else {
138
140
  ptNode = kmIter->second;
@@ -145,23 +147,18 @@ class Trie {
145
147
  if (key.begin() == key.end()) {
146
148
  return;
147
149
  }
148
- //定义一个NextMap迭代器
149
150
  TrieNode::NextMap::const_iterator kmIter;
150
- //定义一个指向root的TrieNode指针
151
151
  TrieNode *ptNode = root_;
152
152
  for (Unicode::const_iterator citer = key.begin(); citer != key.end(); ++citer) {
153
- //链表不存在元素
154
153
  if (NULL == ptNode->next) {
155
154
  return;
156
155
  }
157
- kmIter = ptNode->next->find(*citer);
158
- //如果map中不存在,跳出循环
156
+ Rune deleteRune = ToLowerRune(*citer);
157
+ kmIter = ptNode->next->find(deleteRune);
159
158
  if (ptNode->next->end() == kmIter) {
160
159
  break;
161
160
  }
162
- //从unordered_map中擦除该项
163
- ptNode->next->erase(*citer);
164
- //删除该node
161
+ ptNode->next->erase(deleteRune);
165
162
  ptNode = kmIter->second;
166
163
  delete ptNode;
167
164
  break;
@@ -222,6 +222,58 @@ inline void GetStringsFromWords(const vector<Word>& words, vector<string>& strs)
222
222
  }
223
223
  }
224
224
 
225
+ inline Rune ToLowerRune(Rune r) {
226
+ if (r >= 'A' && r <= 'Z') {
227
+ return r + ('a' - 'A');
228
+ }
229
+ return r;
230
+ }
231
+
232
+ inline Rune ToUpperRune(Rune r) {
233
+ if (r >= 'a' && r <= 'z') {
234
+ return r - ('a' - 'A');
235
+ }
236
+ return r;
237
+ }
238
+
239
+ inline Unicode ToLowerUnicode(const Unicode& unicode) {
240
+ Unicode result;
241
+ result.reserve(unicode.size());
242
+ for (size_t i = 0; i < unicode.size(); i++) {
243
+ result.push_back(ToLowerRune(unicode[i]));
244
+ }
245
+ return result;
246
+ }
247
+
248
+ inline Unicode ToUpperUnicode(const Unicode& unicode) {
249
+ Unicode result;
250
+ result.reserve(unicode.size());
251
+ for (size_t i = 0; i < unicode.size(); i++) {
252
+ result.push_back(ToUpperRune(unicode[i]));
253
+ }
254
+ return result;
255
+ }
256
+
257
+ inline string ToLowerString(const string& s) {
258
+ string result = s;
259
+ for (size_t i = 0; i < result.size(); i++) {
260
+ if (result[i] >= 'A' && result[i] <= 'Z') {
261
+ result[i] = result[i] + ('a' - 'A');
262
+ }
263
+ }
264
+ return result;
265
+ }
266
+
267
+ inline string ToUpperString(const string& s) {
268
+ string result = s;
269
+ for (size_t i = 0; i < result.size(); i++) {
270
+ if (result[i] >= 'a' && result[i] <= 'z') {
271
+ result[i] = result[i] - ('a' - 'A');
272
+ }
273
+ }
274
+ return result;
275
+ }
276
+
225
277
  } // namespace cppjieba
226
278
 
227
279
  #endif // CPPJIEBA_UNICODE_H
@@ -74,4 +74,61 @@ describe("nodejieba.loadUserDict", function() {
74
74
  var loadResult = nodejieba.loadUserDict(dictSet);
75
75
  loadResult.should.eql(true);
76
76
  });
77
+
78
+ it("nodejieba.loadUserDict should filter empty strings", function() {
79
+ var dictLines = [
80
+ "有效词1",
81
+ "",
82
+ "有效词2",
83
+ "",
84
+ " ",
85
+ "\t",
86
+ "\n",
87
+ "有效词3"
88
+ ];
89
+ var loadResult = nodejieba.loadUserDict(dictLines);
90
+ loadResult.should.eql(true);
91
+ });
92
+
93
+ it("nodejieba.loadUserDict with space-containing keywords", function() {
94
+ var dictLines = [
95
+ "深度 学习",
96
+ "机器 学习 200 n",
97
+ "人工 智能 300 nz"
98
+ ];
99
+ var loadResult = nodejieba.loadUserDict(dictLines);
100
+ loadResult.should.eql(true);
101
+ });
102
+
103
+ it("nodejieba.loadUserDict should throw error for non-string array elements", function() {
104
+ var dictLines = [
105
+ "有效词",
106
+ 123,
107
+ "另一个有效词"
108
+ ];
109
+
110
+ (function() {
111
+ nodejieba.loadUserDict(dictLines);
112
+ }).should.throw();
113
+ });
114
+
115
+ it("nodejieba.loadUserDict should return false for null", function() {
116
+ var loadResult = nodejieba.loadUserDict(null);
117
+ loadResult.should.eql(false);
118
+ });
119
+
120
+ it("nodejieba.loadUserDict should return false for undefined", function() {
121
+ var loadResult = nodejieba.loadUserDict(undefined);
122
+ loadResult.should.eql(false);
123
+ });
124
+
125
+ it("nodejieba.loadUserDict should throw TypeError for invalid type", function() {
126
+ (function() {
127
+ nodejieba.loadUserDict(123);
128
+ }).should.throw(TypeError);
129
+
130
+ (function() {
131
+ nodejieba.loadUserDict({});
132
+ }).should.throw(TypeError);
133
+ });
77
134
  });
@@ -0,0 +1,60 @@
1
+ var nodejieba = require("./index.js");
2
+
3
+ nodejieba.load();
4
+
5
+ console.log("测试1: 加载包含空白字符的词典条目");
6
+ try {
7
+ var result = nodejieba.loadUserDict([
8
+ "有效词1",
9
+ "",
10
+ " ",
11
+ "\t",
12
+ "\n",
13
+ "有效词2",
14
+ " 测试词 ",
15
+ " 空格词 "
16
+ ]);
17
+ console.log("✅ 加载成功:", result);
18
+ } catch (e) {
19
+ console.log("❌ 加载失败:", e.message);
20
+ }
21
+
22
+ console.log("\n测试2: 使用包含空白字符的词典进行分词");
23
+ try {
24
+ var result = nodejieba.cut("有效词1和有效词2以及测试词");
25
+ console.log("✅ 分词成功:", result);
26
+ if (result.includes("有效词1") && result.includes("有效词2") && result.includes("测试词")) {
27
+ console.log("✅ 词典条目正确识别");
28
+ }
29
+ } catch (e) {
30
+ console.log("❌ 分词失败:", e.message);
31
+ }
32
+
33
+ console.log("\n测试3: 加载大量空白字符的词典");
34
+ try {
35
+ var largeDict = [];
36
+ for (var i = 0; i < 100; i++) {
37
+ largeDict.push("词" + i);
38
+ largeDict.push("");
39
+ largeDict.push(" ");
40
+ largeDict.push("\t\n");
41
+ }
42
+ var result = nodejieba.loadUserDict(largeDict);
43
+ console.log("✅ 大量词典加载成功:", result);
44
+ } catch (e) {
45
+ console.log("❌ 大量词典加载失败:", e.message);
46
+ }
47
+
48
+ console.log("\n测试4: Buffer 包含空白行");
49
+ try {
50
+ var bufferContent = "词A\n\n \n\t\n词B\n 词C \n";
51
+ var result = nodejieba.loadUserDict(Buffer.from(bufferContent));
52
+ console.log("✅ Buffer 加载成功:", result);
53
+
54
+ var cutResult = nodejieba.cut("词A和词B以及词C");
55
+ console.log("✅ Buffer 词典分词成功:", cutResult);
56
+ } catch (e) {
57
+ console.log("❌ Buffer 加载失败:", e.message);
58
+ }
59
+
60
+ console.log("\n✅ 所有测试完成,断言错误已修复!");
package/test_simple.js ADDED
@@ -0,0 +1,17 @@
1
+ var nodejieba = require("./index.js");
2
+
3
+ console.log("=== 开始测试 ===\n");
4
+
5
+ try {
6
+ console.log("调用 load()...");
7
+ nodejieba.load();
8
+ console.log("load() 完成");
9
+
10
+ console.log("\n测试分词...");
11
+ var result = nodejieba.cut("测试");
12
+ console.log("分词结果:", result);
13
+ } catch (e) {
14
+ console.error("错误:", e);
15
+ }
16
+
17
+ console.log("\n=== 测试结束 ===");
@@ -0,0 +1,66 @@
1
+ var nodejieba = require("./index.js");
2
+
3
+ console.log("=== 测试包含空格的关键词匹配功能 ===\n");
4
+
5
+ // 加载词典
6
+ nodejieba.load();
7
+
8
+ // 测试1: 加载包含空格的关键词(带词频和词性)
9
+ console.log("测试1: 加载 'Open Claw 2 n'");
10
+ nodejieba.loadUserDict(["Open Claw 2 n"]);
11
+
12
+ var testCases1 = [
13
+ "Open Claw",
14
+ "OpenClaw",
15
+ "Openclaw",
16
+ "OPENCLAW",
17
+ "open claw",
18
+ "OPEN CLAW"
19
+ ];
20
+
21
+ console.log("测试各种大小写变体:");
22
+ testCases1.forEach(function(testText) {
23
+ var result = nodejieba.cut(testText);
24
+ console.log(" '" + testText + "' ->", result);
25
+ });
26
+
27
+ console.log("\n");
28
+
29
+ // 测试2: 加载包含空格的关键词(只有关键词)
30
+ console.log("测试2: 加载 'Game Master' (只有关键词)");
31
+ nodejieba.loadUserDict("Game Master");
32
+
33
+ var testCases2 = [
34
+ "Game Master",
35
+ "GameMaster",
36
+ "gamemaster",
37
+ "GAMEMASTER",
38
+ "GAME MASTER"
39
+ ];
40
+
41
+ console.log("测试各种大小写变体:");
42
+ testCases2.forEach(function(testText) {
43
+ var result = nodejieba.cut(testText);
44
+ console.log(" '" + testText + "' ->", result);
45
+ });
46
+
47
+ console.log("\n");
48
+
49
+ // 测试3: 在句子中匹配
50
+ console.log("测试3: 在句子中匹配包含空格的关键词");
51
+ var sentence1 = "I like Open Claw game very much";
52
+ var result1 = nodejieba.cut(sentence1);
53
+ console.log(" 句子: '" + sentence1 + "'");
54
+ console.log(" 分词结果:", result1);
55
+
56
+ var sentence2 = "Open Claw和Game Master都是好游戏";
57
+ var result2 = nodejieba.cut(sentence2);
58
+ console.log(" 句子: '" + sentence2 + "'");
59
+ console.log(" 分词结果:", result2);
60
+
61
+ var sentence3 = "OPENCLAW和gamemaster都是好游戏";
62
+ var result3 = nodejieba.cut(sentence3);
63
+ console.log(" 句子: '" + sentence3 + "'");
64
+ console.log(" 分词结果:", result3);
65
+
66
+ console.log("\n=== 测试完成 ===");