jieba-rb 5.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (117) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +18 -0
  3. data/.gitmodules +3 -0
  4. data/.travis.yml +19 -0
  5. data/Gemfile +4 -0
  6. data/LICENSE.txt +22 -0
  7. data/README.md +85 -0
  8. data/Rakefile +15 -0
  9. data/ext/cppjieba/.gitignore +17 -0
  10. data/ext/cppjieba/.travis.yml +22 -0
  11. data/ext/cppjieba/CMakeLists.txt +28 -0
  12. data/ext/cppjieba/ChangeLog.md +236 -0
  13. data/ext/cppjieba/README.md +285 -0
  14. data/ext/cppjieba/README_EN.md +111 -0
  15. data/ext/cppjieba/appveyor.yml +32 -0
  16. data/ext/cppjieba/deps/CMakeLists.txt +1 -0
  17. data/ext/cppjieba/deps/gtest/CMakeLists.txt +5 -0
  18. data/ext/cppjieba/deps/gtest/include/gtest/gtest-death-test.h +283 -0
  19. data/ext/cppjieba/deps/gtest/include/gtest/gtest-message.h +230 -0
  20. data/ext/cppjieba/deps/gtest/include/gtest/gtest-param-test.h +1421 -0
  21. data/ext/cppjieba/deps/gtest/include/gtest/gtest-param-test.h.pump +487 -0
  22. data/ext/cppjieba/deps/gtest/include/gtest/gtest-printers.h +796 -0
  23. data/ext/cppjieba/deps/gtest/include/gtest/gtest-spi.h +232 -0
  24. data/ext/cppjieba/deps/gtest/include/gtest/gtest-test-part.h +176 -0
  25. data/ext/cppjieba/deps/gtest/include/gtest/gtest-typed-test.h +259 -0
  26. data/ext/cppjieba/deps/gtest/include/gtest/gtest.h +2155 -0
  27. data/ext/cppjieba/deps/gtest/include/gtest/gtest_pred_impl.h +358 -0
  28. data/ext/cppjieba/deps/gtest/include/gtest/gtest_prod.h +58 -0
  29. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-death-test-internal.h +308 -0
  30. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-filepath.h +210 -0
  31. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-internal.h +1226 -0
  32. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-linked_ptr.h +233 -0
  33. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-param-util-generated.h +4822 -0
  34. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-param-util-generated.h.pump +301 -0
  35. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-param-util.h +619 -0
  36. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-port.h +1788 -0
  37. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-string.h +350 -0
  38. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-tuple.h +968 -0
  39. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-tuple.h.pump +336 -0
  40. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-type-util.h +3330 -0
  41. data/ext/cppjieba/deps/gtest/include/gtest/internal/gtest-type-util.h.pump +296 -0
  42. data/ext/cppjieba/deps/gtest/src/.deps/.dirstamp +0 -0
  43. data/ext/cppjieba/deps/gtest/src/.deps/gtest-all.Plo +681 -0
  44. data/ext/cppjieba/deps/gtest/src/.deps/gtest_main.Plo +509 -0
  45. data/ext/cppjieba/deps/gtest/src/.dirstamp +0 -0
  46. data/ext/cppjieba/deps/gtest/src/gtest-all.cc +48 -0
  47. data/ext/cppjieba/deps/gtest/src/gtest-death-test.cc +1234 -0
  48. data/ext/cppjieba/deps/gtest/src/gtest-filepath.cc +380 -0
  49. data/ext/cppjieba/deps/gtest/src/gtest-internal-inl.h +1038 -0
  50. data/ext/cppjieba/deps/gtest/src/gtest-port.cc +746 -0
  51. data/ext/cppjieba/deps/gtest/src/gtest-printers.cc +356 -0
  52. data/ext/cppjieba/deps/gtest/src/gtest-test-part.cc +110 -0
  53. data/ext/cppjieba/deps/gtest/src/gtest-typed-test.cc +110 -0
  54. data/ext/cppjieba/deps/gtest/src/gtest.cc +4898 -0
  55. data/ext/cppjieba/deps/gtest/src/gtest_main.cc +39 -0
  56. data/ext/cppjieba/deps/limonp/ArgvContext.hpp +70 -0
  57. data/ext/cppjieba/deps/limonp/BlockingQueue.hpp +49 -0
  58. data/ext/cppjieba/deps/limonp/BoundedBlockingQueue.hpp +67 -0
  59. data/ext/cppjieba/deps/limonp/BoundedQueue.hpp +65 -0
  60. data/ext/cppjieba/deps/limonp/Closure.hpp +206 -0
  61. data/ext/cppjieba/deps/limonp/Colors.hpp +31 -0
  62. data/ext/cppjieba/deps/limonp/Condition.hpp +38 -0
  63. data/ext/cppjieba/deps/limonp/Config.hpp +103 -0
  64. data/ext/cppjieba/deps/limonp/FileLock.hpp +74 -0
  65. data/ext/cppjieba/deps/limonp/ForcePublic.hpp +7 -0
  66. data/ext/cppjieba/deps/limonp/LocalVector.hpp +139 -0
  67. data/ext/cppjieba/deps/limonp/Logging.hpp +76 -0
  68. data/ext/cppjieba/deps/limonp/Md5.hpp +411 -0
  69. data/ext/cppjieba/deps/limonp/MutexLock.hpp +51 -0
  70. data/ext/cppjieba/deps/limonp/NonCopyable.hpp +21 -0
  71. data/ext/cppjieba/deps/limonp/StdExtension.hpp +159 -0
  72. data/ext/cppjieba/deps/limonp/StringUtil.hpp +365 -0
  73. data/ext/cppjieba/deps/limonp/Thread.hpp +44 -0
  74. data/ext/cppjieba/deps/limonp/ThreadPool.hpp +86 -0
  75. data/ext/cppjieba/dict/README.md +31 -0
  76. data/ext/cppjieba/dict/hmm_model.utf8 +34 -0
  77. data/ext/cppjieba/dict/idf.utf8 +258826 -0
  78. data/ext/cppjieba/dict/jieba.dict.utf8 +348982 -0
  79. data/ext/cppjieba/dict/pos_dict/char_state_tab.utf8 +6653 -0
  80. data/ext/cppjieba/dict/pos_dict/prob_emit.utf8 +166 -0
  81. data/ext/cppjieba/dict/pos_dict/prob_start.utf8 +259 -0
  82. data/ext/cppjieba/dict/pos_dict/prob_trans.utf8 +5222 -0
  83. data/ext/cppjieba/dict/stop_words.utf8 +1534 -0
  84. data/ext/cppjieba/dict/user.dict.utf8 +4 -0
  85. data/ext/cppjieba/include/cppjieba/DictTrie.hpp +227 -0
  86. data/ext/cppjieba/include/cppjieba/FullSegment.hpp +93 -0
  87. data/ext/cppjieba/include/cppjieba/HMMModel.hpp +129 -0
  88. data/ext/cppjieba/include/cppjieba/HMMSegment.hpp +190 -0
  89. data/ext/cppjieba/include/cppjieba/Jieba.hpp +108 -0
  90. data/ext/cppjieba/include/cppjieba/KeywordExtractor.hpp +153 -0
  91. data/ext/cppjieba/include/cppjieba/MPSegment.hpp +137 -0
  92. data/ext/cppjieba/include/cppjieba/MixSegment.hpp +109 -0
  93. data/ext/cppjieba/include/cppjieba/PosTagger.hpp +77 -0
  94. data/ext/cppjieba/include/cppjieba/PreFilter.hpp +54 -0
  95. data/ext/cppjieba/include/cppjieba/QuerySegment.hpp +90 -0
  96. data/ext/cppjieba/include/cppjieba/SegmentBase.hpp +46 -0
  97. data/ext/cppjieba/include/cppjieba/SegmentTagged.hpp +24 -0
  98. data/ext/cppjieba/include/cppjieba/TextRankExtractor.hpp +190 -0
  99. data/ext/cppjieba/include/cppjieba/Trie.hpp +174 -0
  100. data/ext/cppjieba/include/cppjieba/Unicode.hpp +215 -0
  101. data/ext/jieba/extconf.rb +28 -0
  102. data/ext/jieba/jieba.c +11 -0
  103. data/ext/jieba/jieba.h +11 -0
  104. data/ext/jieba/keyword.cc +92 -0
  105. data/ext/jieba/keyword.h +17 -0
  106. data/ext/jieba/segment.cc +107 -0
  107. data/ext/jieba/segment.h +17 -0
  108. data/ext/jieba/tagging.cc +76 -0
  109. data/ext/jieba/tagging.h +17 -0
  110. data/jieba_rb.gemspec +51 -0
  111. data/lib/jieba-rb.rb +66 -0
  112. data/lib/jieba_rb/version.rb +3 -0
  113. data/test/test_keyword.rb +17 -0
  114. data/test/test_segment.rb +32 -0
  115. data/test/test_tagging.rb +22 -0
  116. data/test/user.dict.utf8 +23 -0
  117. metadata +219 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 60638207c3ca1e4d26af778155a834f48e32c5fdc9abe7f9f5daa3f6708e8f32
4
+ data.tar.gz: f3aaa0b8450bce6f29955d5484c17b69017d80e87d8ebc29dd27b44bdd3ff2b1
5
+ SHA512:
6
+ metadata.gz: 158508d61cc0742194156b72f1e9afc678292c614a7024bc782d4fc6cec94b63b8a9a08d47868f9138d6754cfd453374919cc5a2d0a5c8571f1450db0f0a4869
7
+ data.tar.gz: d1eb4d94ac74fa04c3ab96d18381f6b6fc1392e7838970242562060c5726ae465ee23068a64d596130d03ca16e1ae81f30729f274a649529f50268454bb21d0b
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ lib/*.bundle
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
@@ -0,0 +1,3 @@
1
+ [submodule "ext/cppjieba"]
2
+ path = ext/cppjieba
3
+ url = https://github.com/aszxqw/cppjieba.git
@@ -0,0 +1,19 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.0.0
5
+ - 2.1.10
6
+ - 2.2.5
7
+ - 2.3.1
8
+ - ruby-head
9
+ - rbx-2
10
+
11
+ before_install:
12
+ - gem install bundler
13
+
14
+ before_script: rake compile
15
+
16
+ matrix:
17
+ allow_failures:
18
+ - rvm: ruby-head
19
+ - rvm: rbx-2
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in jieba_rb.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2014 Li Qi
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,85 @@
1
+ # JiebaRb
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/jieba_rb.svg)](http://badge.fury.io/rb/jieba_rb)
4
+
5
+ [![Build Status](https://travis-ci.org/altkatz/jieba_rb.png?branch=master)](https://travis-ci.org/altkatz/jieba_rb)
6
+
7
+
8
+ Ruby extension for [Cppjieba](https://github.com/aszxqw/cppjieba)
9
+
10
+ ## Installation
11
+
12
+ Add this line to your application's Gemfile:
13
+
14
+ gem 'jieba_rb'
15
+
16
+ And then execute:
17
+
18
+ $ bundle
19
+
20
+ Or install it yourself as:
21
+
22
+ $ gem install jieba_rb
23
+
24
+ ## Word segment Usage
25
+
26
+ Mix Segment mode (HMM with Max Prob, default):
27
+
28
+ require 'jieba_rb'
29
+ seg = JiebaRb::Segment.new # equivalent to "JiebaRb::Segment.new mode: :mix"
30
+ words = seg.cut "令狐冲是云计算行业的专家"
31
+ # 令狐冲 是 云 计算 行业 的 专家
32
+
33
+ Mix Segment mode with user-defined dictionary:
34
+
35
+ seg = JiebaRb::Segment.new mode: :mix, user_dict: "ext/cppjieba/dict/user.dict.utf8"
36
+ words = seg.cut "令狐冲是云计算行业的专家"
37
+ # 令狐冲 是 云计算 行业 的 专家
38
+
39
+ HMM or Max probability (mp) Segment mode:
40
+
41
+ seg = JiebaRb::Segment.new mode: :hmm # or mode: :mp
42
+ words = seg.cut "令狐冲是云计算行业的专家"
43
+
44
+ ## Word tagging Usage
45
+
46
+ Default tagging:
47
+
48
+ require 'jieba_rb'
49
+ tagging = JiebaRb::Tagging.new
50
+ pairs = tagging.tag "我是蓝翔技工拖拉机学院手扶拖拉机专业的。"
51
+ # [{"我"=>"r"}, {"是"=>"v"}, {"蓝翔"=>"x"}, {"技工"=>"n"}, {"拖拉机"=>"n"}, {"学院"=>"n"}, {"手扶拖拉机"=>"n"}, {"专业"=>"n"}, {"的"=>"uj"}, {"。"=>"x"}]
52
+
53
+ Tagging with user-defined dictionary:
54
+
55
+ require 'jieba_rb'
56
+ tagging = JiebaRb::Tagging.new user_dict: :default
57
+ pairs = tagging.tag "我是蓝翔技工拖拉机学院手扶拖拉机专业的。"
58
+ # [{"我"=>"r"}, {"是"=>"v"}, {"蓝翔"=>"nz"}, {"技工"=>"n"}, {"拖拉机"=>"n"}, {"学院"=>"n"}, {"手扶拖拉机"=>"n"}, {"专业"=>"n"}, {"的"=>"uj"}, {"。"=>"x"}]
59
+
60
+ ## Keyword Extractor Usage
61
+
62
+ * only support TF-IDF currently
63
+
64
+ ```
65
+ keyword = JiebaRb::Keyword.new
66
+ keywords_weights = keyword.extract "我是拖拉机学院手扶拖拉机专业的。不用多久,我就会升职加薪,当上CEO,走上人生巅峰。", 5
67
+
68
+ [
69
+ ["CEO", 11.739204307083542],
70
+ ["升职", 10.8561552143],
71
+ ["加薪", 10.642581114],
72
+ ["手扶拖拉机", 10.0088573539],
73
+ ["巅峰", 9.49395840471]
74
+ ]
75
+ ```
76
+
77
+
78
+
79
+ ## Contributing
80
+
81
+ 1. Fork it ( http://github.com/<my-github-username>/jieba_rb/fork )
82
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
83
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
84
+ 4. Push to the branch (`git push origin my-new-feature`)
85
+ 5. Create new Pull Request
@@ -0,0 +1,15 @@
1
+ require "bundler/gem_tasks"
2
+ require 'rake/testtask'
3
+ require 'rake/extensiontask'
4
+
5
+ Rake::ExtensionTask.new "jieba"
6
+
7
+ Rake::TestTask.new do |t|
8
+ t.libs << 'test'
9
+ end
10
+ desc "clean compile files"
11
+ task :clean_compile do
12
+ system "rm -r tmp"
13
+ system "rm lib/*.bundle"
14
+ end
15
+ task :default => :test
@@ -0,0 +1,17 @@
1
+ tags
2
+ *.demo
3
+ *swp
4
+ *.out
5
+ *.o
6
+ *.d
7
+ *.ut
8
+ log
9
+ main
10
+ lib*.a
11
+ *_demo
12
+ segdict*
13
+ prior.gbk
14
+ tmp
15
+ t.*
16
+ *.pid
17
+ build
@@ -0,0 +1,22 @@
1
+ language: cpp
2
+ before_install:
3
+ - if [ $TRAVIS_OS_NAME == linux ]; then sudo apt-get install cmake; fi
4
+ - if [ $TRAVIS_OS_NAME == osx ]; then brew install cmake; fi
5
+ script:
6
+ - mkdir build
7
+ - cd build
8
+ - cmake ..
9
+ - make
10
+ - make test
11
+ os:
12
+ - linux
13
+ - osx
14
+ compiler:
15
+ - clang
16
+ - gcc
17
+ notifications:
18
+ recipients:
19
+ - i@yanyiwu.com
20
+ email:
21
+ on_success: change
22
+ on_failure: always
@@ -0,0 +1,28 @@
1
+ PROJECT(CPPJIEBA)
2
+
3
+ CMAKE_MINIMUM_REQUIRED (VERSION 2.6)
4
+
5
+ INCLUDE_DIRECTORIES(${PROJECT_SOURCE_DIR}/deps
6
+ ${PROJECT_SOURCE_DIR}/include)
7
+
8
+ if (CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
9
+ set (CMAKE_INSTALL_PREFIX "/usr/local/cppjieba" CACHE PATH "default install path" FORCE )
10
+ endif()
11
+
12
+ ADD_DEFINITIONS(-O3 -Wall -g)
13
+ IF(APPLE) # mac os
14
+ ADD_DEFINITIONS(-std=c++0x)
15
+ endif()
16
+
17
+ # cmake .. -DENC=GBK
18
+ # IF (DEFINED ENC)
19
+ # ADD_DEFINITIONS(-DCPPJIEBA_${ENC})
20
+ # ENDIF()
21
+
22
+ ADD_SUBDIRECTORY(deps)
23
+ ADD_SUBDIRECTORY(test)
24
+
25
+ ENABLE_TESTING()
26
+ ADD_TEST(NAME ./test/test.run COMMAND ./test/test.run)
27
+ ADD_TEST(NAME ./load_test COMMAND ./load_test)
28
+ ADD_TEST(NAME ./demo COMMAND ./demo)
@@ -0,0 +1,236 @@
1
+ # CppJieba ChangeLog
2
+
3
+ ## v5.0.0
4
+
5
+ + Notice(**api changed**) : Jieba class 3 arguments -> 5 arguments, and use KeywordExtractor in Jieba
6
+
7
+ ## v4.8.1
8
+
9
+ + add TextRankExtractor by [@questionfish] in [pull request 65](https://github.com/yanyiwu/cppjieba/pull/65)
10
+ + add Jieba::ResetSeparators api for some special situation, for example in [issue67](https://github.com/yanyiwu/cppjieba/issues/67)
11
+ + fix [issue70](https://github.com/yanyiwu/cppjieba/issues/70)
12
+ + support (word, freq, tag) format in user_dict, see details in [pr74](https://github.com/yanyiwu/cppjieba/pull/74)
13
+
14
+ ## v4.8.0
15
+
16
+ + rewrite QuerySegment, make `Jieba::CutForSearch` behaves the same as [jieba] `cut_for_search` api
17
+ + remove Jieba::SetQuerySegmentThreshold
18
+
19
+ ## v4.7.0
20
+
21
+ api changes:
22
+
23
+ + override Cut functions, add location information into Word results;
24
+ + remove LevelSegment;
25
+ + remove Jieba::Locate;
26
+
27
+ upgrade:
28
+
29
+ + limonp -> v0.6.1
30
+
31
+ ## v4.6.0
32
+
33
+ + Change Jieba::Locate(deprecated) to be static function.
34
+ + Change the return value of KeywordExtractor::Extract from bool to void.
35
+ + Add KeywordExtractor::Word and add more overrided KeywordExtractor::Extract
36
+
37
+ ## v4.5.3
38
+
39
+ + Upgrade limonp to v0.6.0
40
+
41
+ ## v4.5.2
42
+
43
+ + Upgrade limonp to v0.5.6 to fix hidden trouble.
44
+
45
+ ## v4.5.1
46
+
47
+ + Upgrade limonp to v0.5.5 to solve macro name conficts in some special case.
48
+
49
+ ## v4.5.0
50
+
51
+ + 在 Trie 中去除之前糟糕的针对 uint16 优化的用数组代替 map 的设计,
52
+ 该设计的主要问题是前提 unicode 每个字符必须是 uint16 ,则无法更全面得支持 unicode 多国字符。
53
+ + Rune 类型从 16bit 更改为 32bit ,支持更多 Unicode 字符,包括一些罕见汉字。
54
+
55
+ ## v4.4.1
56
+
57
+ + 使用 valgrind 检查内存泄露的问题,定位出一个HMM模型初始化的问题导致内存泄露的bug,不过此内存泄露不是致命问题,
58
+ 因为只会在词典载入的时候发生,而词典载入通常情况下只会被运行一次,故不会导致严重问题。
59
+ + 感谢 [qinwf] 帮我发现这个bug,非常感谢。
60
+
61
+ ## v4.4.0
62
+
63
+ + 加代码容易删代码难,思索良久,还是决定把 Server 功能的源码剥离出这个项目。
64
+ + 让 [cppjieba] 回到当年情窦未开时清纯的感觉,删除那些无关紧要的server代码,让整个项目轻装上阵,专注分词的核心代码。
65
+ + By the way, 之前的 server 相关的代码,如果你真的需要它,就去新的项目仓库 [cppjieba-server](https://github.com/yanyiwu/cppjieba-server) 找它们。
66
+
67
+ ## v4.3.3
68
+
69
+ + Yet Another Incompatibility Problem Repair: Upgrade [limonp] to version v0.5.3, fix incompatibility problem in Windows
70
+
71
+ ## v4.3.2
72
+
73
+ + Upgrade [limonp] to version v0.5.2, fix incompatibility problem in Windows
74
+
75
+ ## v4.3.1
76
+
77
+ + 重载 KeywordExtractor 的构造函数,可以传入 Jieba 进行字典和模型的构造。
78
+
79
+ ## v4.3.0
80
+
81
+ 源码目录布局调整:
82
+
83
+ 1. src/ -> include/cppjieba/
84
+ 2. src/limonp/ -> deps/limonp/
85
+ 3. server/husky -> deps/husky/
86
+ 4. test/unittest/gtest -> deps/gtest
87
+
88
+ 依赖库升级:
89
+
90
+ 1. [limonp] to version v0.5.1
91
+ 2. [husky] to version v0.2.0
92
+
93
+ ## v4.2.1
94
+
95
+ 1. Upgrade [limonp] to version v0.4.1, [husky] to version v0.2.0
96
+
97
+ ## v4.2.0
98
+
99
+ 1. 修复[issue50]提到的多词典分隔符在Windows环境下存在的问题,从':'修改成'|'或';'。
100
+
101
+ ## v4.1.2
102
+
103
+ 1. 新增 Jieba::Locate 函数接口,作为计算分词结果的词语位置信息,在某些场景下有用,比如搜索结果高亮之类的。
104
+
105
+ ## v4.1.1
106
+
107
+ 1. 在 class Jieba 中新增词性标注的接口函数 Jieba::Tag
108
+
109
+ ## v4.1.0
110
+
111
+ 1. QuerySegment切词时加一层判断,当长词满足IsAllAscii(比如英文单词)时,不进行细粒度分词。
112
+ 2. QuerySegment新增SetMaxWordLen和GetMaxWordLen接口,用来设置二次分词条件被触发的词长阈值。
113
+ 3. Jieba新增SetQuerySegmentThreshold设置CutForSearch函数的词长阈值。
114
+
115
+ ## v4.0.0
116
+
117
+ 1. 支持多个userdict载入,多词典路径用英文冒号(:)作为分隔符,就当是向环境变量PATH致敬,哈哈。
118
+ 2. userdict是不带权重的,之前对于新的userword默认设置词频权重为最大值,现已支持可配置,默认使用中位值。
119
+ 3. 【兼容性预警】修改一些代码风格,比如命名空间小写化,从CppJieba变成cppjieba。
120
+ 4. 【兼容性预警】弃用Application.hpp, 取而代之使用Jieba.hpp ,接口也进行了大幅修改,函数风格更统一,和python版本的Jieba分词更一致。
121
+
122
+ ## v3.2.1
123
+
124
+ 1. 修复 Jieba.hpp 头文件保护写错导致的 bug。
125
+
126
+ ## v3.2.0
127
+
128
+ 1. 使用工程上比较 tricky 的 Trie树优化办法。废弃了之前的 `Aho-Corasick-Automation` 实现,可读性更好,性能更高。
129
+ 2. 新增层次分词器: LevelSegment 。
130
+ 3. 增加MPSegment的细粒度分词功能。
131
+ 4. 增加 class Jieba ,提供可读性更好的接口。
132
+ 5. 放弃了统一接口ISegment,因为统一的接口限制了分词方式的灵活性,限制了一些功能的增加。
133
+ 6. 增加默认开启新词发现功能的可选参数hmm,让MixSegment和QuerySegment都支持开关新词发现功能。
134
+
135
+ ## v3.1.0
136
+
137
+ 1. 新增可动态增加词典的API: insertUserWord
138
+ 2. cut函数增加默认参数,默认使用Mix切词算法。关于切词算法详见README.md
139
+
140
+ ## v3.0.1
141
+
142
+ 1. 提升兼容性,修复在某些特定环境下的编译错误问题。
143
+
144
+ ## v3.0.0
145
+
146
+ 1. 使得 QuerySegment 支持自定义词典(可选参数)。
147
+ 2. 使得 KeywordExtractor 支持自定义词典(可选参数)。
148
+ 3. 修改 Code Style ,参照 google code style 。
149
+ 4. 增加更详细的错误日志,在初始化过程中合理使用LogFatal。
150
+ 5. 增加 Application 这个类,整合了所有CppJieba的功能进去,以后用户只需要使用这个类即可。
151
+ 6. 修改 cjserver 服务,可以通过http参数使用不同切词算法进行切词。
152
+ 7. 修改 make install 的安装目录,统一安装到同一个目录 /usr/local/cppjieba 。
153
+
154
+ ## v2.4.4
155
+
156
+ 1. 修改两条更细粒度的特殊过滤规则,将连续的数字(包括浮点数)和连续的字母单独切分出来(而不会混在一起)。
157
+ 2. 修改最大概率法时动态规划过程需要使用的 DAG 数据结构(同时也修改 Trie 的 DAG 查询函数),提高分词速度 8% 。
158
+ 3. 使用了 `Aho-Corasick-Automation` 算法提速 Trie 查找的过程等优化,提升性能。
159
+ 4. 增加词性标注的两条特殊规则。
160
+
161
+ ## v2.4.3
162
+
163
+ 1. 更新 [husky] 服务代码,新 [husky] 为基于线程池的服务器简易框架。并且修复当 HTTP POST 请求时 body 过长数据可能丢失的问题。
164
+ 2. 修改 PosTagger 的参数结构,删除暂时无用的参数。并添加使用自定义字典的参数,也就是支持 **自定义词性**。
165
+ 3. 更好的支持 `mac osx` (原谅作者如此屌丝,这么晚才买 `mac` )。
166
+ 4. 支持 `Docker` ,具体请见 `Dockerfile` 。
167
+
168
+ ## v2.4.2
169
+
170
+ 1. 适当使用 `vector`, 的基础上,使用`limonp/LocalVector.hpp`作为`Unicode`的类型等优化,约提高性能 `30%`。
171
+ 2. 使 `cjserver` 支持用户自定义词典,通过在 `conf/server.conf` 里面配置 `user_dict_path` 来实现。
172
+ 3. 修复 `MPSegment` 切词时,当句子中含有特殊字符时,切词结果不完整的问题。
173
+ 4. 修改 `FullSegment` 减少内存使用。
174
+ 5. 修改 `-std=c++0x` 或者 `-std=c++11` 时编译失败的问题。
175
+
176
+ ## v2.4.1
177
+
178
+ 1. 完善一些特殊字符和字母串的切词效果。
179
+ 2. 提高关键词抽取的速度。
180
+ 3. 提供用户自定义词典的接口。
181
+ 4. 将server相关的代码独立出来,单独放在`server/`目录下。
182
+ 5. 修复用户自定义词典中单字会被MixSegment的新词发现功能给忽略的问题。也就是说,现在的词典是用户词典优先级最高,其次是自带的词典,再其次是新词发现出来的词。
183
+
184
+ ## v2.4.0
185
+
186
+ 1. 适配更低级版本的`g++`和`cmake`,已在`g++ 4.1.2`和`cmake 2.6`上测试通过。
187
+ 2. 修改一些测试用例的文件,减少测试时编译的时间。
188
+ 3. 修复`make install`相关的问题。
189
+ 4. 增加HTTP服务的POST请求接口。
190
+ 5. 拆分`Trie.hpp`成`DictTrie.hpp`和`Trie.hpp`,将trie树这个数据结构抽象出来,并且修复Trie这个类潜在的bug并完善单元测试。
191
+ 6. 重写cjserver的启动和停止,新启动和停止方法详见README.md。
192
+
193
+ ## v2.3.4
194
+
195
+ 1. 修改了设计上的问题,删除了`TrieManager`这个类,以避免造成一些可能的隐患。
196
+ 2. 增加`stop_words.utf8`词典,并修改`KeywordExtractor`的初始化函数用以使用此词典。
197
+ 3. 优化了`Trie`树相关部分代码结构。
198
+
199
+ ## v2.3.3
200
+
201
+ 1. 修复因为使用unordered_map导致的在不同机器上结果不一致的问题。
202
+ 2. 将部分数据结果从unordered_map改为map,提升了差不多1/6的切词速度。(因为unordered_map虽然查找速度快,但是在范围迭代的效率较低。)
203
+
204
+ ## v2.3.2
205
+
206
+ 1. 修复单元测试的问题,有些case在x84和x64中结果不一致。
207
+ 2. merge进词性标注的简单版本。
208
+
209
+ ## v2.3.1
210
+
211
+ 1. 修复安装时的服务启动问题(不过安装切词服务只是linux下的一个附加功能,不影响核心代码。)
212
+
213
+ ## v2.3.0
214
+
215
+ 1. 增加`KeywordExtractor.hpp`来进行关键词抽取。
216
+ 2. 使用`gtest`来做单元测试。
217
+
218
+ ## v2.2.0
219
+
220
+ 1. 性能优化,提升切词速度约6倍。
221
+ 2. 其他暂时也想不起来了。
222
+
223
+ ## v2.1.1 (v2.1.1之前的统统一起写在 v2.1.1里面了)
224
+
225
+ 1. 完成__最大概率分词算法__和__HMM分词算法__,并且将他们结合起来成效果最好的`MixSegment`。
226
+ 2. 进行大量的代码重构,将主要的功能性代码都写成了hpp文件。
227
+ 3. 使用`cmake`工具来管理项目。
228
+ 4. 使用 [limonp]作为工具函数库,比如日志,字符串操作等常用函数。
229
+ 5. 使用 [husky] 搭简易分词服务的服务器框架。
230
+
231
+ [limonp]:http://github.com/yanyiwu/limonp.git
232
+ [husky]:http://github.com/yanyiwu/husky.git
233
+ [issue50]:https://github.com/yanyiwu/cppjieba/issues/50
234
+ [qinwf]:https://github.com/yanyiwu/cppjieba/pull/53#issuecomment-176264929
235
+ [jieba]:https://github.com/fxsjy/jieba
236
+ [@questionfish]:https://github.com/questionfish