pluskid-rmmseg-cpp 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +51 -0
- data/data/chars.dic +12638 -0
- data/data/punctuation.dic +79 -0
- data/data/words.dic +120330 -0
- data/ext/rmmseg/algor.cpp +216 -0
- data/ext/rmmseg/algor.h +75 -0
- data/ext/rmmseg/chunk.h +58 -0
- data/ext/rmmseg/dict.cpp +228 -0
- data/ext/rmmseg/dict.h +34 -0
- data/ext/rmmseg/extconf.rb +6 -0
- data/ext/rmmseg/memory.cpp +9 -0
- data/ext/rmmseg/memory.h +43 -0
- data/ext/rmmseg/rmmseg.cpp +174 -0
- data/ext/rmmseg/rules.h +87 -0
- data/ext/rmmseg/token.h +19 -0
- data/ext/rmmseg/word.h +44 -0
- data/lib/rmmseg/dictionary.rb +54 -0
- data/lib/rmmseg/ferret.rb +112 -0
- data/lib/rmmseg.rb +3 -0
- metadata +74 -0
data/README
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
= rmmseg-cpp
|
2
|
+
|
3
|
+
== Background
|
4
|
+
|
5
|
+
rmmseg (http://rmmseg.rubyforge.org) is a Chinese word segmentation library
|
6
|
+
written for and in Ruby. It features full integration with Ruby. However,
|
7
|
+
its performance (both time and memory) is terrible in some cases, especially
|
8
|
+
when you use the complex algorithm.
|
9
|
+
|
10
|
+
So I re-implemented rmmseg in C++ and wrapped it as a Ruby extension. This
|
11
|
+
gem is aimed at high performance and thus less extensible than the pure-
|
12
|
+
Ruby rmmseg gem. There are also some differences between them:
|
13
|
+
|
14
|
+
* The dictionary format is different. For performance reason, the words
|
15
|
+
dictionary of rmmseg-cpp included the word length information in it.
|
16
|
+
See rdoc of the Dictionary class for more information of the format.
|
17
|
+
It is likely that I will upgrade rmmseg's dictionary format to make
|
18
|
+
those two compatible, writing a Ruby script to convert the dictionary
|
19
|
+
is almost trivial.
|
20
|
+
|
21
|
+
While the dictionaries in rmmseg will be loaded automatically when needed,
|
22
|
+
you'll need to load dictionaries explicitly in rmmseg-cpp.
|
23
|
+
|
24
|
+
* Only complex algorithm is provided currently. And I don't see any need
|
25
|
+
to implement the simple algorithm here if complex algorithm has a good
|
26
|
+
performance and a much better accuracy.
|
27
|
+
|
28
|
+
== Install
|
29
|
+
|
30
|
+
This project is hosted at github (http://github.com/pluskid/rmmseg-cpp/).
|
31
|
+
You can use rubygems to install rmmseg-cpp:
|
32
|
+
|
33
|
+
sudo gem install pluskid-rmmseg-cpp --source=http://gems.github.com
|
34
|
+
|
35
|
+
Unfortunately, the command is long and you'll have to use the 'pluskid'
|
36
|
+
prefix currently. Or you can also check out the latest source code:
|
37
|
+
|
38
|
+
git clone git://github.com/pluskid/rmmseg-cpp.git
|
39
|
+
|
40
|
+
== Usage
|
41
|
+
|
42
|
+
Integration with ferret is almost identical as before, to run the example for
|
43
|
+
rmmseg (http://rmmseg.rubyforge.org/#Analyzer-for-Ferret), only one minor
|
44
|
+
change is needed:
|
45
|
+
|
46
|
+
RMMSeg::Dictionary.load_dictionaries
|
47
|
+
|
48
|
+
Add the line above to explicitly load the default dictionaries before
|
49
|
+
using the analyzer. You can also add your own dictionaries through
|
50
|
+
<tt>Dictionary#add</tt>.
|
51
|
+
|