pluskid-rmmseg-cpp 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README +51 -0
- data/data/chars.dic +12638 -0
- data/data/punctuation.dic +79 -0
- data/data/words.dic +120330 -0
- data/ext/rmmseg/algor.cpp +216 -0
- data/ext/rmmseg/algor.h +75 -0
- data/ext/rmmseg/chunk.h +58 -0
- data/ext/rmmseg/dict.cpp +228 -0
- data/ext/rmmseg/dict.h +34 -0
- data/ext/rmmseg/extconf.rb +6 -0
- data/ext/rmmseg/memory.cpp +9 -0
- data/ext/rmmseg/memory.h +43 -0
- data/ext/rmmseg/rmmseg.cpp +174 -0
- data/ext/rmmseg/rules.h +87 -0
- data/ext/rmmseg/token.h +19 -0
- data/ext/rmmseg/word.h +44 -0
- data/lib/rmmseg/dictionary.rb +54 -0
- data/lib/rmmseg/ferret.rb +112 -0
- data/lib/rmmseg.rb +3 -0
- metadata +74 -0
data/README
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
= rmmseg-cpp
|
2
|
+
|
3
|
+
== Background
|
4
|
+
|
5
|
+
rmmseg (http://rmmseg.rubyforge.org) is a Chinese word segmentation library
|
6
|
+
written for and in Ruby. It features full integration with Ruby. However,
|
7
|
+
its performance (both time and memory) is terrible in some cases, especially
|
8
|
+
when you use the complex algorithm.
|
9
|
+
|
10
|
+
So I re-implemented rmmseg in C++ and wrapped it as a Ruby extension. This
|
11
|
+
gem is aimed at high performance and thus less extensible than the pure-
|
12
|
+
Ruby rmmseg gem. There are also some differences between them:
|
13
|
+
|
14
|
+
* The dictionary format is different. For performance reason, the words
|
15
|
+
dictionary of rmmseg-cpp included the word length information in it.
|
16
|
+
See rdoc of the Dictionary class for more information of the format.
|
17
|
+
It is likely that I will upgrade rmmseg's dictionary format to make
|
18
|
+
those two compatible, writing a Ruby script to convert the dictionary
|
19
|
+
is almost trivial.
|
20
|
+
|
21
|
+
While the dictionaries in rmmseg will be loaded automatically when needed,
|
22
|
+
you'll need to load dictionaries explicitly in rmmseg-cpp.
|
23
|
+
|
24
|
+
* Only complex algorithm is provided currently. And I don't see any need
|
25
|
+
to implement the simple algorithm here if complex algorithm has a good
|
26
|
+
performance and a much better accuracy.
|
27
|
+
|
28
|
+
== Install
|
29
|
+
|
30
|
+
This project is hosted at github (http://github.com/pluskid/rmmseg-cpp/).
|
31
|
+
You can use rubygems to install rmmseg-cpp:
|
32
|
+
|
33
|
+
sudo gem install pluskid-rmmseg-cpp --source=http://gems.github.com
|
34
|
+
|
35
|
+
Unfortunately, the command is long and you'll have to use the 'pluskid'
|
36
|
+
prefix currently. Or you can also check out the latest source code:
|
37
|
+
|
38
|
+
git clone git://github.com/pluskid/rmmseg-cpp.git
|
39
|
+
|
40
|
+
== Usage
|
41
|
+
|
42
|
+
Integration with ferret is almost identical as before, to run the example for
|
43
|
+
rmmseg (http://rmmseg.rubyforge.org/#Analyzer-for-Ferret), only one minor
|
44
|
+
change is needed:
|
45
|
+
|
46
|
+
RMMSeg::Dictionary.load_dictionaries
|
47
|
+
|
48
|
+
Add the line above to explicitly load the default dictionaries before
|
49
|
+
using the analyzer. You can also add your own dictionaries through
|
50
|
+
<tt>Dictionary#add</tt>.
|
51
|
+
|