natto 0.9.3 → 0.9.4
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +13 -2
- data/README.md +47 -22
- data/lib/natto.rb +242 -140
- data/lib/natto/binding.rb +40 -29
- data/lib/natto/utils.rb +16 -0
- data/lib/natto/version.rb +3 -3
- metadata +39 -60
- data/test/natto/tc_binding.rb +0 -41
- data/test/natto/tc_dictionaryinfo.rb +0 -65
- data/test/natto/tc_mecab.rb +0 -295
- data/test/natto/tc_mecabnode.rb +0 -101
- data/test/test_natto.rb +0 -12
data/CHANGELOG
CHANGED
@@ -1,7 +1,18 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
-
- __2012/02/
|
4
|
-
-
|
3
|
+
- __2012/02/26__: 0.9.4 release.
|
4
|
+
- Exposing the underlying FFI pointer as @tagger in Natto::MeCab
|
5
|
+
- Exposing the underlying FFI pointer as @pointer in Natto::MeCabNode
|
6
|
+
- Fixed node formatting behavior for both regular and N-Best node parsing
|
7
|
+
- Adding is_nor?, is_unk?, is_bos?, is_eos?, is_eon? to Natto::MeCabNode
|
8
|
+
- Adding is_sysdic?, is_usrdic?, is_unkdic? to Natto::DictionaryInfo
|
9
|
+
- Adding readnodes to allow for method-chaining on list of parsed nodes
|
10
|
+
- Adding readlines to allow for method-chaining on list of string output
|
11
|
+
- Updating README to include description of DevKit install for RubyInstaller
|
12
|
+
- Refactored the codebase, introducing more tests
|
13
|
+
|
14
|
+
- __2012/02/14__: 0.9.3 release.
|
15
|
+
- Updating structures, tests following the latest release of MeCab 0.993
|
5
16
|
- Ending support for 0.98 MeCab.
|
6
17
|
- Support for MeCab 0.993 in Cygwin is still not confirmed.
|
7
18
|
|
data/README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
A Tasty Ruby Binding with MeCab
|
3
3
|
|
4
4
|
## What is natto?
|
5
|
-
natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.
|
5
|
+
natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech and morphological analyzer for the Japanese language.
|
6
6
|
|
7
7
|
natto is a gem bridging Ruby and MeCab using FFI (foreign function interface). No compilation is necessary, as natto is _not_ a C extension. natto will run on CRuby (mri/yarv) and JRuby (jvm) equally well. natto will also run on Windows, Unix/Linux, and Mac.
|
8
8
|
|
@@ -15,8 +15,22 @@ natto requires the following:
|
|
15
15
|
- [ffi _0.6.3 or greater_](http://rubygems.org/gems/ffi)
|
16
16
|
- Ruby _1.8.7 or greater_
|
17
17
|
|
18
|
-
## Installation
|
18
|
+
## Installation on *NIX/Mac/Cygwin
|
19
19
|
Install natto with the following gem command:
|
20
|
+
|
21
|
+
gem install natto
|
22
|
+
|
23
|
+
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which is what natto uses to bind to the <tt>mecab</tt> library.
|
24
|
+
|
25
|
+
## Installation on Windows
|
26
|
+
However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), which is a MSYS/MinGW based toolkit than enables your Windows Ruby installation to build many of the native C/C++ extensions available, including <tt>ffi</tt>.
|
27
|
+
|
28
|
+
1. Download the latest release for RubyInstaller for Windows platforms and the corresponding DevKit from the [RubyInstaller for Windows downloads page](http://rubyinstaller.org/downloads/).
|
29
|
+
2. After installing RubyInstaller for Windows, double-click on the DevKit-tdm installer <tt>.exe</tt>, and expand the contents to an appropriate location, for example <tt>C:\devkit</tt>.
|
30
|
+
3. Open a command window under <tt>C:\devkit</tt>, and execute: <tt>ruby dk.rb init</tt>. This will locate all known ruby installations, and add them to <tt>C:\devkit\config.yml</tt>.
|
31
|
+
4. Next, execute: <tt>ruby dk.rb install</tt>, which will add the DevKit to all of the installed rubies listed in your <tt>C:\devkit\config.yml</tt>.
|
32
|
+
5. Now you should be able to install and build the <tt>ffi</tt> rubygem correctly on your Windows-installed ruby, so you can install <tt>natto with</tt>:
|
33
|
+
|
20
34
|
gem install natto
|
21
35
|
|
22
36
|
## Configuration
|
@@ -24,12 +38,19 @@ Install natto with the following gem command:
|
|
24
38
|
- In case of <tt>LoadError</tt>, please set the <tt>MECAB_PATH</tt> environment variable to the exact name/path to your <tt>mecab</tt> library.
|
25
39
|
|
26
40
|
e.g., for bash on UNIX/Linux
|
41
|
+
|
27
42
|
export MECAB_PATH=/usr/local/lib/libmecab.so
|
43
|
+
|
28
44
|
e.g., on Windows
|
45
|
+
|
29
46
|
set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
|
47
|
+
|
30
48
|
e.g., for Cygwin
|
49
|
+
|
31
50
|
export MECAB_PATH=cygmecab-1
|
51
|
+
|
32
52
|
e.g., from within a Ruby program
|
53
|
+
|
33
54
|
ENV['MECAB_PATH']=/usr/local/lib/libmecab.so
|
34
55
|
|
35
56
|
## Usage
|
@@ -38,15 +59,16 @@ e.g., from within a Ruby program
|
|
38
59
|
|
39
60
|
nm = Natto::MeCab.new
|
40
61
|
=> #<Natto::MeCab:0x28d30748
|
41
|
-
@
|
62
|
+
@tagger=#<FFI::Pointer address=0x28a97d50>, \
|
42
63
|
@options={}, \
|
43
|
-
@dicts=[#<Natto::DictionaryInfo:0x28d3061c
|
44
|
-
|
45
|
-
|
46
|
-
|
64
|
+
@dicts=[#<Natto::DictionaryInfo:0x28d3061c \
|
65
|
+
type="0", \
|
66
|
+
filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
|
67
|
+
charset="utf8">], \
|
68
|
+
@version="0.993">
|
47
69
|
|
48
70
|
puts nm.version
|
49
|
-
=> "0.
|
71
|
+
=> "0.993"
|
50
72
|
|
51
73
|
sysdic = nm.dicts.first
|
52
74
|
|
@@ -55,22 +77,25 @@ e.g., from within a Ruby program
|
|
55
77
|
|
56
78
|
puts sysdic.charset
|
57
79
|
=> "utf8"
|
58
|
-
|
59
|
-
nm.parse('
|
80
|
+
|
81
|
+
nm.parse('ピンチの時には必ずヒーローが現れる。') do |n|
|
60
82
|
puts "#{n.surface}\t#{n.feature}"
|
61
83
|
end
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
。
|
72
|
-
|
73
|
-
|
84
|
+
ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
|
85
|
+
の 助詞,連体化,*,*,*,*,の,ノ,ノ
|
86
|
+
時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
|
87
|
+
に 助詞,格助詞,一般,*,*,*,に,一般ニ,ニ
|
88
|
+
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
|
89
|
+
必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
|
90
|
+
ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーローー,ヒーロー
|
91
|
+
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
|
92
|
+
現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
|
93
|
+
。 記号,句点,*,*,*,*,。,。,。句点
|
94
|
+
BOS/EOS,*,*,*,*,*,*,*,*
|
95
|
+
|
96
|
+
|
97
|
+
## Learn more
|
98
|
+
- You can read more about natto on the [project Wiki](https://bitbucket.org/buruzaemon/natto/wiki/Home).
|
74
99
|
|
75
100
|
## Contributing to natto
|
76
101
|
- Use [mercurial](http://mercurial.selenic.com/) and [check out the latest code at bitbucket](https://bitbucket.org/buruzaemon/natto/src/) to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
|
data/lib/natto.rb
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
require 'rubygems' if RUBY_VERSION.to_f < 1.9
|
3
3
|
require 'natto/binding'
|
4
|
+
require 'natto/utils'
|
4
5
|
|
5
6
|
module Natto
|
6
7
|
require 'ffi'
|
7
8
|
require 'optparse'
|
8
9
|
|
9
|
-
# <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt>
|
10
|
-
# Options to the <tt>mecab</tt>
|
10
|
+
# <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> tagger.
|
11
|
+
# Options to the <tt>mecab</tt> tagger are passed in as a string
|
11
12
|
# (MeCab command-line style) or as a Ruby-style hash at
|
12
13
|
# initialization.
|
13
14
|
#
|
@@ -16,32 +17,42 @@ module Natto
|
|
16
17
|
# require 'rubygems' if RUBY_VERSION.to_f < 1.9
|
17
18
|
# require 'natto'
|
18
19
|
#
|
19
|
-
# nm = Natto::MeCab.new('-
|
20
|
+
# nm = Natto::MeCab.new('-Ochasen')
|
20
21
|
# => #<Natto::MeCab:0x28d3bdc8 \
|
21
|
-
# @
|
22
|
-
# @options={:output_format_type=>"
|
23
|
-
# @dicts=[#<Natto::DictionaryInfo:0x289a1f14
|
22
|
+
# @tagger=#<FFI::Pointer address=0x28afb980>, \
|
23
|
+
# @options={:output_format_type=>"chasen"}, \
|
24
|
+
# @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
|
25
|
+
# type="0", \
|
24
26
|
# filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
|
25
27
|
# charset="utf8">], \
|
26
28
|
# @version="0.993">
|
27
29
|
#
|
28
|
-
# nm.parse('
|
30
|
+
# nm.parse('凡人にしか見えねえ風景ってのがあるんだよ。') do |n|
|
29
31
|
# puts "#{n.surface}\t#{n.feature}"
|
30
|
-
# end
|
31
|
-
#
|
32
|
-
#
|
33
|
-
#
|
34
|
-
#
|
35
|
-
#
|
36
|
-
#
|
37
|
-
#
|
32
|
+
# end
|
33
|
+
# 凡人 名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
|
34
|
+
# に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
|
35
|
+
# しか 助詞,係助詞,*,*,*,*,しか,シカ,シカ
|
36
|
+
# 見え 動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
|
37
|
+
# ねえ 助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
|
38
|
+
# 風景 名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
|
39
|
+
# って 助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
|
40
|
+
# の 名詞,非自立,一般,*,*,*,の,ノ,ノ
|
41
|
+
# が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
|
42
|
+
# ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
|
43
|
+
# ん 名詞,非自立,一般,*,*,*,ん,ン,ン
|
44
|
+
# だ 助動詞,*,*,*一般,特殊・ダ,基本形,だ,ダ,ダ
|
45
|
+
# よ 助詞,終助詞,*,*,*,*,よ,ã¨,ヨ
|
46
|
+
# 。 記号,句点,*,*,*,*,。,。,。
|
47
|
+
# BOS/EOS,*,*,*,*,*,*,*,*BOS
|
38
48
|
#
|
39
49
|
class MeCab
|
40
50
|
include Natto::Binding
|
51
|
+
include Natto::Utils
|
41
52
|
|
42
|
-
attr_reader :options, :dicts, :version
|
53
|
+
attr_reader :tagger, :options, :dicts, :version
|
43
54
|
|
44
|
-
# Mapping of mecab short-style configuration options to the <tt>mecab</tt>
|
55
|
+
# Mapping of mecab short-style configuration options to the <tt>mecab</tt> tagger.
|
45
56
|
# See the <tt>mecab</tt> help for more details.
|
46
57
|
SUPPORTED_OPTS = { '-r' => :rcfile,
|
47
58
|
'-d' => :dicdir,
|
@@ -69,7 +80,7 @@ module Natto
|
|
69
80
|
# - :rcfile -- resource file
|
70
81
|
# - :dicdir -- system dicdir
|
71
82
|
# - :userdic -- user dictionary
|
72
|
-
# - :lattice_level -- lattice information level (
|
83
|
+
# - :lattice_level -- lattice information level (DEPRECATED)
|
73
84
|
# - :output_format_type -- output format type (wakati, chasen, yomi, etc.)
|
74
85
|
# - :all_morphs -- output all morphs (default false)
|
75
86
|
# - :nbest -- output N best results (integer, default 1), requires lattice level >= 1
|
@@ -90,24 +101,30 @@ module Natto
|
|
90
101
|
# e.g.<br/>
|
91
102
|
#
|
92
103
|
# nm = Natto::MeCab.new(:node_format=>'%m¥t%f[7]¥n')
|
93
|
-
# => #<Natto::MeCab:0x28d2ae10
|
94
|
-
# @
|
95
|
-
# @
|
104
|
+
# => #<Natto::MeCab:0x28d2ae10
|
105
|
+
# @tagger=#<FFI::Pointer address=0x28a97980>, \
|
106
|
+
# @options={:node_format=>"%m¥t%f[7]¥n"}, \
|
107
|
+
# @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
|
108
|
+
# type="0", \
|
96
109
|
# filename="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
|
97
110
|
# charset="utf8">], \
|
98
111
|
# @version="0.993">
|
99
|
-
#
|
100
|
-
# puts nm.parse('
|
101
|
-
#
|
102
|
-
#
|
103
|
-
#
|
104
|
-
#
|
105
|
-
#
|
106
|
-
#
|
107
|
-
#
|
108
|
-
#
|
112
|
+
#
|
113
|
+
# puts nm.parse('才能とは求める人間に与えられるものではない。')
|
114
|
+
# 才能 サイノウ
|
115
|
+
# と ト
|
116
|
+
# は ハ
|
117
|
+
# 求 モトメル
|
118
|
+
# 人間 ニンゲン
|
119
|
+
# に ニ
|
120
|
+
# 与え アタエ
|
121
|
+
# られる ラレル
|
122
|
+
# もの モノ
|
123
|
+
# で デ
|
124
|
+
# は ハ
|
125
|
+
# ない ナイ
|
126
|
+
# 。 。
|
109
127
|
# EOS
|
110
|
-
# => nil
|
111
128
|
#
|
112
129
|
# @param [Hash or String]
|
113
130
|
# @raise [MeCabError] if <tt>mecab</tt> cannot be initialized with the given <tt>options</tt>
|
@@ -118,48 +135,87 @@ module Natto
|
|
118
135
|
@dicts = []
|
119
136
|
|
120
137
|
opt_str = self.class.build_options_str(@options)
|
121
|
-
@
|
122
|
-
raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @
|
138
|
+
@tagger = self.mecab_new2(opt_str)
|
139
|
+
raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @tagger.address == 0x0
|
123
140
|
|
124
|
-
|
125
|
-
self.
|
126
|
-
self.
|
127
|
-
self.mecab_set_all_morphs(@ptr, 1) if @options[:all_morphs]
|
141
|
+
self.mecab_set_theta(@tagger, @options[:theta]) if @options[:theta]
|
142
|
+
self.mecab_set_lattice_level(@tagger, @options[:lattice_level]) if @options[:lattice_level]
|
143
|
+
self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
|
128
144
|
|
129
145
|
# Set mecab parsing implementations for N-best and regular parsing,
|
130
146
|
# for both parsing as string and yielding a node object
|
131
147
|
# N-Best parsing implementations
|
132
148
|
if @options[:nbest] && @options[:nbest] > 1
|
133
149
|
# nbest parsing require lattice level >= 1
|
134
|
-
self.mecab_set_lattice_level(@
|
135
|
-
@parse_tostr = lambda
|
136
|
-
self.
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
150
|
+
self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
151
|
+
@parse_tostr = lambda do |str|
|
152
|
+
return self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], str) ||
|
153
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
154
|
+
end
|
155
|
+
@parse_tonodes = lambda do |str|
|
156
|
+
nodes = []
|
157
|
+
if @options[:nbest] && @options[:nbest] > 1
|
158
|
+
self.mecab_nbest_init(@tagger, str)
|
159
|
+
n = self.mecab_nbest_next_tonode(@tagger)
|
160
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
161
|
+
nlen = @options[:nbest]
|
162
|
+
nlen.times do
|
163
|
+
s = str.bytes.to_a
|
164
|
+
while n && n.address != 0x0
|
165
|
+
mn = Natto::MeCabNode.new(n)
|
166
|
+
if mn.is_nor?
|
167
|
+
slen, sarr = mn.length, []
|
168
|
+
slen.times { sarr << s.shift }
|
169
|
+
surf = sarr.pack('C*')
|
170
|
+
mn.surface = self.class.force_enc(surf)
|
171
|
+
if @options[:output_format_type] || @options[:node_format]
|
172
|
+
mn.feature = self.class.force_enc(self.mecab_format_node(@tagger, n))
|
173
|
+
end
|
174
|
+
end
|
175
|
+
nodes << mn if !mn.is_bos?
|
176
|
+
n = mn.next
|
177
|
+
end
|
178
|
+
n = self.mecab_nbest_next_tonode(@tagger)
|
179
|
+
end
|
180
|
+
end
|
181
|
+
return nodes
|
182
|
+
end
|
144
183
|
else
|
145
184
|
# default parsing implementations
|
146
|
-
@parse_tostr = lambda
|
147
|
-
return self.mecab_sparse_tostr(@
|
148
|
-
|
149
|
-
|
185
|
+
@parse_tostr = lambda do |str|
|
186
|
+
return self.mecab_sparse_tostr(@tagger, str) ||
|
187
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
188
|
+
end
|
189
|
+
@parse_tonodes = lambda do |str|
|
190
|
+
nodes = []
|
191
|
+
n = self.mecab_sparse_tonode(@tagger, str)
|
192
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
193
|
+
mn = Natto::MeCabNode.new(n)
|
194
|
+
n = mn.next if mn.next.address!=0x0 && mn.is_bos?
|
195
|
+
s = str.bytes.to_a
|
196
|
+
while n && n.address!=0x0
|
197
|
+
mn = Natto::MeCabNode.new(n)
|
198
|
+
if mn.is_nor?
|
199
|
+
slen, sarr = mn.length, []
|
200
|
+
slen.times { sarr << s.shift }
|
201
|
+
surf = sarr.pack('C*')
|
202
|
+
mn.surface = self.class.force_enc(surf)
|
203
|
+
end
|
204
|
+
nodes << mn
|
205
|
+
n = mn.next
|
206
|
+
end
|
207
|
+
return nodes
|
208
|
+
end
|
150
209
|
end
|
151
210
|
|
152
|
-
|
153
|
-
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@ptr))
|
211
|
+
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
|
154
212
|
while @dicts.last.next.address != 0x0
|
155
213
|
@dicts << Natto::DictionaryInfo.new(@dicts.last.next)
|
156
214
|
end
|
157
215
|
|
158
|
-
# set ref to mecab version string
|
159
216
|
@version = self.mecab_version
|
160
217
|
|
161
|
-
|
162
|
-
ObjectSpace.define_finalizer(self, self.class.create_free_proc(@ptr))
|
218
|
+
ObjectSpace.define_finalizer(self, self.class.create_free_proc(@tagger))
|
163
219
|
end
|
164
220
|
|
165
221
|
# Parses the given string <tt>str</tt>. If a block is passed to this method,
|
@@ -167,47 +223,48 @@ module Natto
|
|
167
223
|
#
|
168
224
|
# @param [String] str
|
169
225
|
# @return parsing result from <tt>mecab</tt>
|
170
|
-
# @raise [MeCabError] if the <tt>mecab</tt>
|
226
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
171
227
|
# @see MeCabNode
|
172
228
|
def parse(str)
|
173
229
|
if block_given?
|
174
|
-
|
175
|
-
|
176
|
-
if head && head[:next].address != 0x0
|
177
|
-
node = Natto::MeCabNode.new(head[:next])
|
178
|
-
i = 0
|
179
|
-
while node.nil? == false
|
180
|
-
if node.length > 0
|
181
|
-
node.surface = str.bytes.to_a()[i, node.length].pack('C*')
|
182
|
-
end
|
183
|
-
yield node
|
184
|
-
if node[:next].address != 0x0
|
185
|
-
i += node.length
|
186
|
-
node = Natto::MeCabNode.new(node[:next])
|
187
|
-
else
|
188
|
-
break
|
189
|
-
end
|
190
|
-
end
|
191
|
-
end
|
230
|
+
nodes = @parse_tonodes.call(str)
|
231
|
+
nodes.each {|n| yield n }
|
192
232
|
else
|
193
|
-
|
194
|
-
result.force_encoding(Encoding.default_external) if result.respond_to?(:encoding) && result.encoding!=Encoding.default_external
|
195
|
-
result
|
233
|
+
self.class.force_enc(@parse_tostr.call(str))
|
196
234
|
end
|
197
235
|
end
|
198
236
|
|
199
|
-
#
|
237
|
+
# Parses the given string <tt>str</tt>, and returns
|
238
|
+
# a list of <tt>mecab</tt> nodes.
|
239
|
+
# @param [String] str
|
240
|
+
# @return [Array] of parsed <tt>mecab</tt> nodes.
|
241
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
242
|
+
# @see MeCabNode
|
243
|
+
def readnodes(str)
|
244
|
+
@parse_tonodes.call(str)
|
245
|
+
end
|
246
|
+
|
247
|
+
# Parses the given string <tt>str</tt>, and returns
|
248
|
+
# a list of <tt>mecab</tt> result strings.
|
249
|
+
# @param [String] str
|
250
|
+
# @return [Array] of parsed <tt>mecab</tt> result strings.
|
251
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
252
|
+
def readlines(str)
|
253
|
+
self.class.force_enc(@parse_tostr.call(str)).lines.to_a
|
254
|
+
end
|
255
|
+
|
256
|
+
# Returns human-readable details for the wrapped <tt>mecab</tt> tagger.
|
200
257
|
# Overrides <tt>Object#to_s</tt>.
|
201
258
|
#
|
202
259
|
# - encoded object id
|
203
|
-
# - FFI pointer to <tt>mecab</tt>
|
260
|
+
# - underlying FFI pointer to the <tt>mecab</tt> tagger
|
204
261
|
# - options hash
|
205
262
|
# - list of dictionaries
|
206
263
|
# - MeCab version
|
207
264
|
#
|
208
|
-
# @return [String] encoded object id, FFI pointer, options hash, list of dictionaries, and MeCab version
|
265
|
+
# @return [String] encoded object id, underlying FFI pointer, options hash, list of dictionaries, and MeCab version
|
209
266
|
def to_s
|
210
|
-
%(#{super.chop} @
|
267
|
+
%(#{super.chop} @tagger=#{@tagger}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
|
211
268
|
end
|
212
269
|
|
213
270
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -284,10 +341,10 @@ module Natto
|
|
284
341
|
end
|
285
342
|
|
286
343
|
# Returns a string-representation of the options to
|
287
|
-
# be passed in the construction of <tt>mecab</tt
|
344
|
+
# be passed in the construction of the <tt>mecab</tt> tagger.
|
288
345
|
#
|
289
346
|
# @param [Hash] options
|
290
|
-
# @return [String] representation of the options to the <tt>mecab</tt>
|
347
|
+
# @return [String] representation of the options to the <tt>mecab</tt> tagger
|
291
348
|
def self.build_options_str(options={})
|
292
349
|
opt = []
|
293
350
|
SUPPORTED_OPTS.values.each do |k|
|
@@ -352,7 +409,16 @@ module Natto
|
|
352
409
|
#
|
353
410
|
# puts sysdic.charset
|
354
411
|
# => "utf8"
|
412
|
+
#
|
413
|
+
# puts sysdic.is_sysdic?
|
414
|
+
# => true
|
355
415
|
class DictionaryInfo < MeCabStruct
|
416
|
+
# System dictionary.
|
417
|
+
SYS_DIC = 0
|
418
|
+
# User dictionary.
|
419
|
+
USR_DIC = 1
|
420
|
+
# Unknown dictionary.
|
421
|
+
UNK_DIC = 2
|
356
422
|
|
357
423
|
layout :filename, :string,
|
358
424
|
:charset, :string,
|
@@ -381,12 +447,13 @@ module Natto
|
|
381
447
|
# Overrides <tt>Object#to_s</tt>.
|
382
448
|
#
|
383
449
|
# - encoded object id
|
450
|
+
# - dictionary type
|
384
451
|
# - full-path dictionary filename
|
385
452
|
# - dictionary charset
|
386
453
|
#
|
387
|
-
# @return [String] encoded object id, dictionary filename, and charset
|
454
|
+
# @return [String] encoded object id, type, dictionary filename, and charset
|
388
455
|
def to_s
|
389
|
-
%(#{super.chop} filename="#{self.filename}", charset="#{self.charset}">)
|
456
|
+
%(#{super.chop} type="#{self.type}", filename="#{self.filename}", charset="#{self.charset}">)
|
390
457
|
end
|
391
458
|
|
392
459
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -396,6 +463,24 @@ module Natto
|
|
396
463
|
def inspect
|
397
464
|
self.to_s
|
398
465
|
end
|
466
|
+
|
467
|
+
# Returns <tt>true</tt> if this is a system dictionary.
|
468
|
+
# @return [Boolean]
|
469
|
+
def is_sysdic?
|
470
|
+
self.type == SYS_DIC
|
471
|
+
end
|
472
|
+
|
473
|
+
# Returns <tt>true</tt> if this is a user dictionary.
|
474
|
+
# @return [Boolean]
|
475
|
+
def is_usrdic?
|
476
|
+
self.type == USR_DIC
|
477
|
+
end
|
478
|
+
|
479
|
+
# Returns <tt>true</tt> if this is a unknown dictionary type.
|
480
|
+
# @return [Boolean]
|
481
|
+
def is_unkdic?
|
482
|
+
self.type == UNK_DIC
|
483
|
+
end
|
399
484
|
end
|
400
485
|
|
401
486
|
# <tt>MeCabNode</tt> is a wrapper for the structure holding
|
@@ -430,61 +515,57 @@ module Natto
|
|
430
515
|
# - :cost
|
431
516
|
#
|
432
517
|
# <h2>Usage</h2>
|
433
|
-
# An instance of <tt>MeCabNode</tt> is yielded to
|
434
|
-
# used with <tt>MeCab#parse</tt
|
435
|
-
#
|
436
|
-
# node attributes may be accessed.
|
518
|
+
# An instance of <tt>MeCabNode</tt> is yielded to the block
|
519
|
+
# used with <tt>MeCab#parse</tt>, where the above-mentioned
|
520
|
+
# node attributes may be accessed by name.
|
437
521
|
#
|
438
522
|
# nm = Natto::MeCab.new
|
439
523
|
#
|
440
|
-
# nm.parse('
|
441
|
-
# puts "#{n.surface}
|
524
|
+
# nm.parse('卓球なんて死ぬまでの暇つぶしだよ。') do |n|
|
525
|
+
# puts "#{n.surface}\t#{n.cost}" if n.is_nor?
|
442
526
|
# end
|
443
|
-
#
|
444
|
-
#
|
445
|
-
#
|
446
|
-
#
|
447
|
-
#
|
448
|
-
#
|
449
|
-
#
|
450
|
-
#
|
451
|
-
#
|
452
|
-
# 困っ 43797
|
453
|
-
# て 42178
|
454
|
-
# まし 46708
|
455
|
-
# た 46111
|
456
|
-
# 。 42677
|
457
|
-
# 41141
|
458
|
-
# => nil
|
527
|
+
# 卓球 2874
|
528
|
+
# な 4398
|
529
|
+
# 死ぬ 9261
|
530
|
+
# まで 9386
|
531
|
+
# の 10007
|
532
|
+
# 暇つぶし 13324
|
533
|
+
# だ 15346
|
534
|
+
# よ 14396
|
535
|
+
# 。 10194
|
459
536
|
#
|
460
537
|
# It is also possible to use the <tt>Symbol</tt> for the
|
461
538
|
# <tt>mecab</tt> node member to index into the
|
462
539
|
# <tt>FFI::Struct</tt> layout associative array like so:
|
463
540
|
#
|
464
|
-
# nm.parse('
|
465
|
-
#
|
466
|
-
#
|
467
|
-
#
|
468
|
-
#
|
469
|
-
#
|
470
|
-
#
|
471
|
-
#
|
472
|
-
#
|
541
|
+
# nm.parse('あいつ笑うと結構可愛い顔してんよ。') {|n| puts n[:feature] }
|
542
|
+
# 名詞,代名詞,一般,*,*,*,あいつ,アイツ,アイツ
|
543
|
+
# 動詞,自立,*,*,五段・ワ行促音便,基本形,笑う,ワラウ,ワラウ
|
544
|
+
# 助詞,接続助詞,*,*,*,*,と,ト,ト
|
545
|
+
# 副詞,一般,*,*,*,*,結構,ケッコウ,ケッコー
|
546
|
+
# 形容詞,自立,*,*,形容詞・イ段,基本形,可愛い,カワイイ,カワイイ
|
547
|
+
# 名詞,一般,*,*,*,*,顔,カオ,カオ
|
548
|
+
# 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
|
549
|
+
# 動詞,非自立,*,*,一段,体言接続特殊,てる,テン,テン
|
550
|
+
# 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
|
551
|
+
# 記号,句点,*,*,*,*,。,。,。
|
473
552
|
# BOS/EOS,*,*,*,*,*,*,*,*
|
474
|
-
# => nil
|
475
553
|
#
|
476
554
|
class MeCabNode < MeCabStruct
|
555
|
+
include Natto::Utils
|
556
|
+
|
477
557
|
attr_accessor :surface, :feature
|
558
|
+
attr_reader :pointer
|
478
559
|
|
479
|
-
# Normal <tt>mecab</tt> node.
|
560
|
+
# Normal <tt>mecab</tt> node defined in the dictionary.
|
480
561
|
NOR_NODE = 0
|
481
|
-
# Unknown <tt>mecab</tt> node.
|
562
|
+
# Unknown <tt>mecab</tt> node not defined in the dictionary.
|
482
563
|
UNK_NODE = 1
|
483
|
-
#
|
564
|
+
# Virtual node representing the beginning of the sentence.
|
484
565
|
BOS_NODE = 2
|
485
|
-
#
|
566
|
+
# Virutual node representing the end of the sentence.
|
486
567
|
EOS_NODE = 3
|
487
|
-
#
|
568
|
+
# Virtual node representing the end of an N-Best <tt>mecab</tt> node list.
|
488
569
|
EON_NODE = 4
|
489
570
|
|
490
571
|
layout :prev, :pointer,
|
@@ -525,39 +606,30 @@ module Natto
|
|
525
606
|
end
|
526
607
|
|
527
608
|
# Initializes this node instance.
|
528
|
-
# Sets the <
|
609
|
+
# Sets the <tt>MeCab</tt> feature value for this node.
|
529
610
|
#
|
530
611
|
# @param [FFI::Pointer]
|
531
612
|
def initialize(ptr)
|
532
613
|
super(ptr)
|
614
|
+
@pointer = ptr
|
533
615
|
|
534
616
|
if self[:feature]
|
535
|
-
@feature = self[:feature]
|
536
|
-
@feature.force_encoding(Encoding.default_external) if @feature.respond_to?(:encoding) && @feature.encoding!=Encoding.default_external
|
617
|
+
@feature = self.class.force_enc(self[:feature])
|
537
618
|
end
|
538
619
|
end
|
539
620
|
|
540
|
-
# Sets the morpheme surface value for this node.
|
541
|
-
#
|
542
|
-
# @param [String]
|
543
|
-
def surface=(str)
|
544
|
-
if str && self[:length] > 0
|
545
|
-
@surface = str
|
546
|
-
@surface.force_encoding(Encoding.default_external) if @surface.respond_to?(:encoding) && @surface.encoding!=Encoding.default_external
|
547
|
-
end
|
548
|
-
end
|
549
|
-
|
550
621
|
# Returns human-readable details for the <tt>mecab</tt> node.
|
551
622
|
# Overrides <tt>Object#to_s</tt>.
|
552
623
|
#
|
553
624
|
# - encoded object id
|
554
|
-
# -
|
625
|
+
# - underlying FFI pointer to MeCab Node
|
626
|
+
# - stat (node type: NOR, UNK, BOS/EOS, EON)
|
555
627
|
# - surface
|
556
628
|
# - feature
|
557
629
|
#
|
558
|
-
# @return [String] encoded object id, stat, surface, and feature
|
630
|
+
# @return [String] encoded object id, underlying FFI pointer, stat, surface, and feature
|
559
631
|
def to_s
|
560
|
-
%(#{super.chop} stat=#{self[:stat]}, surface="#{self.surface}", feature="#{self.feature}">)
|
632
|
+
%(#{super.chop} @pointer=#{@pointer}, stat=#{self[:stat]}, @surface="#{self.surface}", @feature="#{self.feature}">)
|
561
633
|
end
|
562
634
|
|
563
635
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -567,5 +639,35 @@ module Natto
|
|
567
639
|
def inspect
|
568
640
|
self.to_s
|
569
641
|
end
|
642
|
+
|
643
|
+
# Returns <tt>true</tt> if this is a normal <tt>mecab</tt> node found in the dictionary.
|
644
|
+
# @return [Boolean]
|
645
|
+
def is_nor?
|
646
|
+
self.stat == NOR_NODE
|
647
|
+
end
|
648
|
+
|
649
|
+
# Returns <tt>true</tt> if this is an unknown <tt>mecab</tt> node not found in the dictionary.
|
650
|
+
# @return [Boolean]
|
651
|
+
def is_unk?
|
652
|
+
self.stat == UNK_NODE
|
653
|
+
end
|
654
|
+
|
655
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the beginning of the sentence.
|
656
|
+
# @return [Boolean]
|
657
|
+
def is_bos?
|
658
|
+
self.stat == BOS_NODE
|
659
|
+
end
|
660
|
+
|
661
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the sentence.
|
662
|
+
# @return [Boolean]
|
663
|
+
def is_eos?
|
664
|
+
self.stat == EOS_NODE
|
665
|
+
end
|
666
|
+
|
667
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the node list.
|
668
|
+
# @return [Boolean]
|
669
|
+
def is_eon?
|
670
|
+
self.stat == EON_NODE
|
671
|
+
end
|
570
672
|
end
|
571
673
|
end
|