natto 0.9.3 → 0.9.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +13 -2
- data/README.md +47 -22
- data/lib/natto.rb +242 -140
- data/lib/natto/binding.rb +40 -29
- data/lib/natto/utils.rb +16 -0
- data/lib/natto/version.rb +3 -3
- metadata +39 -60
- data/test/natto/tc_binding.rb +0 -41
- data/test/natto/tc_dictionaryinfo.rb +0 -65
- data/test/natto/tc_mecab.rb +0 -295
- data/test/natto/tc_mecabnode.rb +0 -101
- data/test/test_natto.rb +0 -12
data/CHANGELOG
CHANGED
@@ -1,7 +1,18 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
-
- __2012/02/
|
4
|
-
-
|
3
|
+
- __2012/02/26__: 0.9.4 release.
|
4
|
+
- Exposing the underlying FFI pointer as @tagger in Natto::MeCab
|
5
|
+
- Exposing the underlying FFI pointer as @pointer in Natto::MeCabNode
|
6
|
+
- Fixed node formatting behavior for both regular and N-Best node parsing
|
7
|
+
- Adding is_nor?, is_unk?, is_bos?, is_eos?, is_eon? to Natto::MeCabNode
|
8
|
+
- Adding is_sysdic?, is_usrdic?, is_unkdic? to Natto::DictionaryInfo
|
9
|
+
- Adding readnodes to allow for method-chaining on list of parsed nodes
|
10
|
+
- Adding readlines to allow for method-chaining on list of string output
|
11
|
+
- Updating README to include description of DevKit install for RubyInstaller
|
12
|
+
- Refactored the codebase, introducing more tests
|
13
|
+
|
14
|
+
- __2012/02/14__: 0.9.3 release.
|
15
|
+
- Updating structures, tests following the latest release of MeCab 0.993
|
5
16
|
- Ending support for 0.98 MeCab.
|
6
17
|
- Support for MeCab 0.993 in Cygwin is still not confirmed.
|
7
18
|
|
data/README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
A Tasty Ruby Binding with MeCab
|
3
3
|
|
4
4
|
## What is natto?
|
5
|
-
natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.
|
5
|
+
natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech and morphological analyzer for the Japanese language.
|
6
6
|
|
7
7
|
natto is a gem bridging Ruby and MeCab using FFI (foreign function interface). No compilation is necessary, as natto is _not_ a C extension. natto will run on CRuby (mri/yarv) and JRuby (jvm) equally well. natto will also run on Windows, Unix/Linux, and Mac.
|
8
8
|
|
@@ -15,8 +15,22 @@ natto requires the following:
|
|
15
15
|
- [ffi _0.6.3 or greater_](http://rubygems.org/gems/ffi)
|
16
16
|
- Ruby _1.8.7 or greater_
|
17
17
|
|
18
|
-
## Installation
|
18
|
+
## Installation on *NIX/Mac/Cygwin
|
19
19
|
Install natto with the following gem command:
|
20
|
+
|
21
|
+
gem install natto
|
22
|
+
|
23
|
+
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which is what natto uses to bind to the <tt>mecab</tt> library.
|
24
|
+
|
25
|
+
## Installation on Windows
|
26
|
+
However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), which is a MSYS/MinGW based toolkit than enables your Windows Ruby installation to build many of the native C/C++ extensions available, including <tt>ffi</tt>.
|
27
|
+
|
28
|
+
1. Download the latest release for RubyInstaller for Windows platforms and the corresponding DevKit from the [RubyInstaller for Windows downloads page](http://rubyinstaller.org/downloads/).
|
29
|
+
2. After installing RubyInstaller for Windows, double-click on the DevKit-tdm installer <tt>.exe</tt>, and expand the contents to an appropriate location, for example <tt>C:\devkit</tt>.
|
30
|
+
3. Open a command window under <tt>C:\devkit</tt>, and execute: <tt>ruby dk.rb init</tt>. This will locate all known ruby installations, and add them to <tt>C:\devkit\config.yml</tt>.
|
31
|
+
4. Next, execute: <tt>ruby dk.rb install</tt>, which will add the DevKit to all of the installed rubies listed in your <tt>C:\devkit\config.yml</tt>.
|
32
|
+
5. Now you should be able to install and build the <tt>ffi</tt> rubygem correctly on your Windows-installed ruby, so you can install <tt>natto with</tt>:
|
33
|
+
|
20
34
|
gem install natto
|
21
35
|
|
22
36
|
## Configuration
|
@@ -24,12 +38,19 @@ Install natto with the following gem command:
|
|
24
38
|
- In case of <tt>LoadError</tt>, please set the <tt>MECAB_PATH</tt> environment variable to the exact name/path to your <tt>mecab</tt> library.
|
25
39
|
|
26
40
|
e.g., for bash on UNIX/Linux
|
41
|
+
|
27
42
|
export MECAB_PATH=/usr/local/lib/libmecab.so
|
43
|
+
|
28
44
|
e.g., on Windows
|
45
|
+
|
29
46
|
set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
|
47
|
+
|
30
48
|
e.g., for Cygwin
|
49
|
+
|
31
50
|
export MECAB_PATH=cygmecab-1
|
51
|
+
|
32
52
|
e.g., from within a Ruby program
|
53
|
+
|
33
54
|
ENV['MECAB_PATH']=/usr/local/lib/libmecab.so
|
34
55
|
|
35
56
|
## Usage
|
@@ -38,15 +59,16 @@ e.g., from within a Ruby program
|
|
38
59
|
|
39
60
|
nm = Natto::MeCab.new
|
40
61
|
=> #<Natto::MeCab:0x28d30748
|
41
|
-
@
|
62
|
+
@tagger=#<FFI::Pointer address=0x28a97d50>, \
|
42
63
|
@options={}, \
|
43
|
-
@dicts=[#<Natto::DictionaryInfo:0x28d3061c
|
44
|
-
|
45
|
-
|
46
|
-
|
64
|
+
@dicts=[#<Natto::DictionaryInfo:0x28d3061c \
|
65
|
+
type="0", \
|
66
|
+
filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
|
67
|
+
charset="utf8">], \
|
68
|
+
@version="0.993">
|
47
69
|
|
48
70
|
puts nm.version
|
49
|
-
=> "0.
|
71
|
+
=> "0.993"
|
50
72
|
|
51
73
|
sysdic = nm.dicts.first
|
52
74
|
|
@@ -55,22 +77,25 @@ e.g., from within a Ruby program
|
|
55
77
|
|
56
78
|
puts sysdic.charset
|
57
79
|
=> "utf8"
|
58
|
-
|
59
|
-
nm.parse('
|
80
|
+
|
81
|
+
nm.parse('ピンチの時には必ずヒーローが現れる。') do |n|
|
60
82
|
puts "#{n.surface}\t#{n.feature}"
|
61
83
|
end
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
。
|
72
|
-
|
73
|
-
|
84
|
+
ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
|
85
|
+
の 助詞,連体化,*,*,*,*,の,ノ,ノ
|
86
|
+
時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
|
87
|
+
に 助詞,格助詞,一般,*,*,*,に,一般ニ,ニ
|
88
|
+
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
|
89
|
+
必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
|
90
|
+
ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーローー,ヒーロー
|
91
|
+
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
|
92
|
+
現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
|
93
|
+
。 記号,句点,*,*,*,*,。,。,。句点
|
94
|
+
BOS/EOS,*,*,*,*,*,*,*,*
|
95
|
+
|
96
|
+
|
97
|
+
## Learn more
|
98
|
+
- You can read more about natto on the [project Wiki](https://bitbucket.org/buruzaemon/natto/wiki/Home).
|
74
99
|
|
75
100
|
## Contributing to natto
|
76
101
|
- Use [mercurial](http://mercurial.selenic.com/) and [check out the latest code at bitbucket](https://bitbucket.org/buruzaemon/natto/src/) to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
|
data/lib/natto.rb
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
require 'rubygems' if RUBY_VERSION.to_f < 1.9
|
3
3
|
require 'natto/binding'
|
4
|
+
require 'natto/utils'
|
4
5
|
|
5
6
|
module Natto
|
6
7
|
require 'ffi'
|
7
8
|
require 'optparse'
|
8
9
|
|
9
|
-
# <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt>
|
10
|
-
# Options to the <tt>mecab</tt>
|
10
|
+
# <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> tagger.
|
11
|
+
# Options to the <tt>mecab</tt> tagger are passed in as a string
|
11
12
|
# (MeCab command-line style) or as a Ruby-style hash at
|
12
13
|
# initialization.
|
13
14
|
#
|
@@ -16,32 +17,42 @@ module Natto
|
|
16
17
|
# require 'rubygems' if RUBY_VERSION.to_f < 1.9
|
17
18
|
# require 'natto'
|
18
19
|
#
|
19
|
-
# nm = Natto::MeCab.new('-
|
20
|
+
# nm = Natto::MeCab.new('-Ochasen')
|
20
21
|
# => #<Natto::MeCab:0x28d3bdc8 \
|
21
|
-
# @
|
22
|
-
# @options={:output_format_type=>"
|
23
|
-
# @dicts=[#<Natto::DictionaryInfo:0x289a1f14
|
22
|
+
# @tagger=#<FFI::Pointer address=0x28afb980>, \
|
23
|
+
# @options={:output_format_type=>"chasen"}, \
|
24
|
+
# @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
|
25
|
+
# type="0", \
|
24
26
|
# filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
|
25
27
|
# charset="utf8">], \
|
26
28
|
# @version="0.993">
|
27
29
|
#
|
28
|
-
# nm.parse('
|
30
|
+
# nm.parse('凡人にしか見えねえ風景ってのがあるんだよ。') do |n|
|
29
31
|
# puts "#{n.surface}\t#{n.feature}"
|
30
|
-
# end
|
31
|
-
#
|
32
|
-
#
|
33
|
-
#
|
34
|
-
#
|
35
|
-
#
|
36
|
-
#
|
37
|
-
#
|
32
|
+
# end
|
33
|
+
# 凡人 名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
|
34
|
+
# に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
|
35
|
+
# しか 助詞,係助詞,*,*,*,*,しか,シカ,シカ
|
36
|
+
# 見え 動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
|
37
|
+
# ねえ 助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
|
38
|
+
# 風景 名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
|
39
|
+
# って 助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
|
40
|
+
# の 名詞,非自立,一般,*,*,*,の,ノ,ノ
|
41
|
+
# が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
|
42
|
+
# ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
|
43
|
+
# ん 名詞,非自立,一般,*,*,*,ん,ン,ン
|
44
|
+
# だ 助動詞,*,*,*一般,特殊・ダ,基本形,だ,ダ,ダ
|
45
|
+
# よ 助詞,終助詞,*,*,*,*,よ,ã¨,ヨ
|
46
|
+
# 。 記号,句点,*,*,*,*,。,。,。
|
47
|
+
# BOS/EOS,*,*,*,*,*,*,*,*BOS
|
38
48
|
#
|
39
49
|
class MeCab
|
40
50
|
include Natto::Binding
|
51
|
+
include Natto::Utils
|
41
52
|
|
42
|
-
attr_reader :options, :dicts, :version
|
53
|
+
attr_reader :tagger, :options, :dicts, :version
|
43
54
|
|
44
|
-
# Mapping of mecab short-style configuration options to the <tt>mecab</tt>
|
55
|
+
# Mapping of mecab short-style configuration options to the <tt>mecab</tt> tagger.
|
45
56
|
# See the <tt>mecab</tt> help for more details.
|
46
57
|
SUPPORTED_OPTS = { '-r' => :rcfile,
|
47
58
|
'-d' => :dicdir,
|
@@ -69,7 +80,7 @@ module Natto
|
|
69
80
|
# - :rcfile -- resource file
|
70
81
|
# - :dicdir -- system dicdir
|
71
82
|
# - :userdic -- user dictionary
|
72
|
-
# - :lattice_level -- lattice information level (
|
83
|
+
# - :lattice_level -- lattice information level (DEPRECATED)
|
73
84
|
# - :output_format_type -- output format type (wakati, chasen, yomi, etc.)
|
74
85
|
# - :all_morphs -- output all morphs (default false)
|
75
86
|
# - :nbest -- output N best results (integer, default 1), requires lattice level >= 1
|
@@ -90,24 +101,30 @@ module Natto
|
|
90
101
|
# e.g.<br/>
|
91
102
|
#
|
92
103
|
# nm = Natto::MeCab.new(:node_format=>'%m¥t%f[7]¥n')
|
93
|
-
# => #<Natto::MeCab:0x28d2ae10
|
94
|
-
# @
|
95
|
-
# @
|
104
|
+
# => #<Natto::MeCab:0x28d2ae10
|
105
|
+
# @tagger=#<FFI::Pointer address=0x28a97980>, \
|
106
|
+
# @options={:node_format=>"%m¥t%f[7]¥n"}, \
|
107
|
+
# @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
|
108
|
+
# type="0", \
|
96
109
|
# filename="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
|
97
110
|
# charset="utf8">], \
|
98
111
|
# @version="0.993">
|
99
|
-
#
|
100
|
-
# puts nm.parse('
|
101
|
-
#
|
102
|
-
#
|
103
|
-
#
|
104
|
-
#
|
105
|
-
#
|
106
|
-
#
|
107
|
-
#
|
108
|
-
#
|
112
|
+
#
|
113
|
+
# puts nm.parse('才能とは求める人間に与えられるものではない。')
|
114
|
+
# 才能 サイノウ
|
115
|
+
# と ト
|
116
|
+
# は ハ
|
117
|
+
# 求 モトメル
|
118
|
+
# 人間 ニンゲン
|
119
|
+
# に ニ
|
120
|
+
# 与え アタエ
|
121
|
+
# られる ラレル
|
122
|
+
# もの モノ
|
123
|
+
# で デ
|
124
|
+
# は ハ
|
125
|
+
# ない ナイ
|
126
|
+
# 。 。
|
109
127
|
# EOS
|
110
|
-
# => nil
|
111
128
|
#
|
112
129
|
# @param [Hash or String]
|
113
130
|
# @raise [MeCabError] if <tt>mecab</tt> cannot be initialized with the given <tt>options</tt>
|
@@ -118,48 +135,87 @@ module Natto
|
|
118
135
|
@dicts = []
|
119
136
|
|
120
137
|
opt_str = self.class.build_options_str(@options)
|
121
|
-
@
|
122
|
-
raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @
|
138
|
+
@tagger = self.mecab_new2(opt_str)
|
139
|
+
raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @tagger.address == 0x0
|
123
140
|
|
124
|
-
|
125
|
-
self.
|
126
|
-
self.
|
127
|
-
self.mecab_set_all_morphs(@ptr, 1) if @options[:all_morphs]
|
141
|
+
self.mecab_set_theta(@tagger, @options[:theta]) if @options[:theta]
|
142
|
+
self.mecab_set_lattice_level(@tagger, @options[:lattice_level]) if @options[:lattice_level]
|
143
|
+
self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
|
128
144
|
|
129
145
|
# Set mecab parsing implementations for N-best and regular parsing,
|
130
146
|
# for both parsing as string and yielding a node object
|
131
147
|
# N-Best parsing implementations
|
132
148
|
if @options[:nbest] && @options[:nbest] > 1
|
133
149
|
# nbest parsing require lattice level >= 1
|
134
|
-
self.mecab_set_lattice_level(@
|
135
|
-
@parse_tostr = lambda
|
136
|
-
self.
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
150
|
+
self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
151
|
+
@parse_tostr = lambda do |str|
|
152
|
+
return self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], str) ||
|
153
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
154
|
+
end
|
155
|
+
@parse_tonodes = lambda do |str|
|
156
|
+
nodes = []
|
157
|
+
if @options[:nbest] && @options[:nbest] > 1
|
158
|
+
self.mecab_nbest_init(@tagger, str)
|
159
|
+
n = self.mecab_nbest_next_tonode(@tagger)
|
160
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
161
|
+
nlen = @options[:nbest]
|
162
|
+
nlen.times do
|
163
|
+
s = str.bytes.to_a
|
164
|
+
while n && n.address != 0x0
|
165
|
+
mn = Natto::MeCabNode.new(n)
|
166
|
+
if mn.is_nor?
|
167
|
+
slen, sarr = mn.length, []
|
168
|
+
slen.times { sarr << s.shift }
|
169
|
+
surf = sarr.pack('C*')
|
170
|
+
mn.surface = self.class.force_enc(surf)
|
171
|
+
if @options[:output_format_type] || @options[:node_format]
|
172
|
+
mn.feature = self.class.force_enc(self.mecab_format_node(@tagger, n))
|
173
|
+
end
|
174
|
+
end
|
175
|
+
nodes << mn if !mn.is_bos?
|
176
|
+
n = mn.next
|
177
|
+
end
|
178
|
+
n = self.mecab_nbest_next_tonode(@tagger)
|
179
|
+
end
|
180
|
+
end
|
181
|
+
return nodes
|
182
|
+
end
|
144
183
|
else
|
145
184
|
# default parsing implementations
|
146
|
-
@parse_tostr = lambda
|
147
|
-
return self.mecab_sparse_tostr(@
|
148
|
-
|
149
|
-
|
185
|
+
@parse_tostr = lambda do |str|
|
186
|
+
return self.mecab_sparse_tostr(@tagger, str) ||
|
187
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
188
|
+
end
|
189
|
+
@parse_tonodes = lambda do |str|
|
190
|
+
nodes = []
|
191
|
+
n = self.mecab_sparse_tonode(@tagger, str)
|
192
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
193
|
+
mn = Natto::MeCabNode.new(n)
|
194
|
+
n = mn.next if mn.next.address!=0x0 && mn.is_bos?
|
195
|
+
s = str.bytes.to_a
|
196
|
+
while n && n.address!=0x0
|
197
|
+
mn = Natto::MeCabNode.new(n)
|
198
|
+
if mn.is_nor?
|
199
|
+
slen, sarr = mn.length, []
|
200
|
+
slen.times { sarr << s.shift }
|
201
|
+
surf = sarr.pack('C*')
|
202
|
+
mn.surface = self.class.force_enc(surf)
|
203
|
+
end
|
204
|
+
nodes << mn
|
205
|
+
n = mn.next
|
206
|
+
end
|
207
|
+
return nodes
|
208
|
+
end
|
150
209
|
end
|
151
210
|
|
152
|
-
|
153
|
-
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@ptr))
|
211
|
+
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
|
154
212
|
while @dicts.last.next.address != 0x0
|
155
213
|
@dicts << Natto::DictionaryInfo.new(@dicts.last.next)
|
156
214
|
end
|
157
215
|
|
158
|
-
# set ref to mecab version string
|
159
216
|
@version = self.mecab_version
|
160
217
|
|
161
|
-
|
162
|
-
ObjectSpace.define_finalizer(self, self.class.create_free_proc(@ptr))
|
218
|
+
ObjectSpace.define_finalizer(self, self.class.create_free_proc(@tagger))
|
163
219
|
end
|
164
220
|
|
165
221
|
# Parses the given string <tt>str</tt>. If a block is passed to this method,
|
@@ -167,47 +223,48 @@ module Natto
|
|
167
223
|
#
|
168
224
|
# @param [String] str
|
169
225
|
# @return parsing result from <tt>mecab</tt>
|
170
|
-
# @raise [MeCabError] if the <tt>mecab</tt>
|
226
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
171
227
|
# @see MeCabNode
|
172
228
|
def parse(str)
|
173
229
|
if block_given?
|
174
|
-
|
175
|
-
|
176
|
-
if head && head[:next].address != 0x0
|
177
|
-
node = Natto::MeCabNode.new(head[:next])
|
178
|
-
i = 0
|
179
|
-
while node.nil? == false
|
180
|
-
if node.length > 0
|
181
|
-
node.surface = str.bytes.to_a()[i, node.length].pack('C*')
|
182
|
-
end
|
183
|
-
yield node
|
184
|
-
if node[:next].address != 0x0
|
185
|
-
i += node.length
|
186
|
-
node = Natto::MeCabNode.new(node[:next])
|
187
|
-
else
|
188
|
-
break
|
189
|
-
end
|
190
|
-
end
|
191
|
-
end
|
230
|
+
nodes = @parse_tonodes.call(str)
|
231
|
+
nodes.each {|n| yield n }
|
192
232
|
else
|
193
|
-
|
194
|
-
result.force_encoding(Encoding.default_external) if result.respond_to?(:encoding) && result.encoding!=Encoding.default_external
|
195
|
-
result
|
233
|
+
self.class.force_enc(@parse_tostr.call(str))
|
196
234
|
end
|
197
235
|
end
|
198
236
|
|
199
|
-
#
|
237
|
+
# Parses the given string <tt>str</tt>, and returns
|
238
|
+
# a list of <tt>mecab</tt> nodes.
|
239
|
+
# @param [String] str
|
240
|
+
# @return [Array] of parsed <tt>mecab</tt> nodes.
|
241
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
242
|
+
# @see MeCabNode
|
243
|
+
def readnodes(str)
|
244
|
+
@parse_tonodes.call(str)
|
245
|
+
end
|
246
|
+
|
247
|
+
# Parses the given string <tt>str</tt>, and returns
|
248
|
+
# a list of <tt>mecab</tt> result strings.
|
249
|
+
# @param [String] str
|
250
|
+
# @return [Array] of parsed <tt>mecab</tt> result strings.
|
251
|
+
# @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
|
252
|
+
def readlines(str)
|
253
|
+
self.class.force_enc(@parse_tostr.call(str)).lines.to_a
|
254
|
+
end
|
255
|
+
|
256
|
+
# Returns human-readable details for the wrapped <tt>mecab</tt> tagger.
|
200
257
|
# Overrides <tt>Object#to_s</tt>.
|
201
258
|
#
|
202
259
|
# - encoded object id
|
203
|
-
# - FFI pointer to <tt>mecab</tt>
|
260
|
+
# - underlying FFI pointer to the <tt>mecab</tt> tagger
|
204
261
|
# - options hash
|
205
262
|
# - list of dictionaries
|
206
263
|
# - MeCab version
|
207
264
|
#
|
208
|
-
# @return [String] encoded object id, FFI pointer, options hash, list of dictionaries, and MeCab version
|
265
|
+
# @return [String] encoded object id, underlying FFI pointer, options hash, list of dictionaries, and MeCab version
|
209
266
|
def to_s
|
210
|
-
%(#{super.chop} @
|
267
|
+
%(#{super.chop} @tagger=#{@tagger}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
|
211
268
|
end
|
212
269
|
|
213
270
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -284,10 +341,10 @@ module Natto
|
|
284
341
|
end
|
285
342
|
|
286
343
|
# Returns a string-representation of the options to
|
287
|
-
# be passed in the construction of <tt>mecab</tt
|
344
|
+
# be passed in the construction of the <tt>mecab</tt> tagger.
|
288
345
|
#
|
289
346
|
# @param [Hash] options
|
290
|
-
# @return [String] representation of the options to the <tt>mecab</tt>
|
347
|
+
# @return [String] representation of the options to the <tt>mecab</tt> tagger
|
291
348
|
def self.build_options_str(options={})
|
292
349
|
opt = []
|
293
350
|
SUPPORTED_OPTS.values.each do |k|
|
@@ -352,7 +409,16 @@ module Natto
|
|
352
409
|
#
|
353
410
|
# puts sysdic.charset
|
354
411
|
# => "utf8"
|
412
|
+
#
|
413
|
+
# puts sysdic.is_sysdic?
|
414
|
+
# => true
|
355
415
|
class DictionaryInfo < MeCabStruct
|
416
|
+
# System dictionary.
|
417
|
+
SYS_DIC = 0
|
418
|
+
# User dictionary.
|
419
|
+
USR_DIC = 1
|
420
|
+
# Unknown dictionary.
|
421
|
+
UNK_DIC = 2
|
356
422
|
|
357
423
|
layout :filename, :string,
|
358
424
|
:charset, :string,
|
@@ -381,12 +447,13 @@ module Natto
|
|
381
447
|
# Overrides <tt>Object#to_s</tt>.
|
382
448
|
#
|
383
449
|
# - encoded object id
|
450
|
+
# - dictionary type
|
384
451
|
# - full-path dictionary filename
|
385
452
|
# - dictionary charset
|
386
453
|
#
|
387
|
-
# @return [String] encoded object id, dictionary filename, and charset
|
454
|
+
# @return [String] encoded object id, type, dictionary filename, and charset
|
388
455
|
def to_s
|
389
|
-
%(#{super.chop} filename="#{self.filename}", charset="#{self.charset}">)
|
456
|
+
%(#{super.chop} type="#{self.type}", filename="#{self.filename}", charset="#{self.charset}">)
|
390
457
|
end
|
391
458
|
|
392
459
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -396,6 +463,24 @@ module Natto
|
|
396
463
|
def inspect
|
397
464
|
self.to_s
|
398
465
|
end
|
466
|
+
|
467
|
+
# Returns <tt>true</tt> if this is a system dictionary.
|
468
|
+
# @return [Boolean]
|
469
|
+
def is_sysdic?
|
470
|
+
self.type == SYS_DIC
|
471
|
+
end
|
472
|
+
|
473
|
+
# Returns <tt>true</tt> if this is a user dictionary.
|
474
|
+
# @return [Boolean]
|
475
|
+
def is_usrdic?
|
476
|
+
self.type == USR_DIC
|
477
|
+
end
|
478
|
+
|
479
|
+
# Returns <tt>true</tt> if this is a unknown dictionary type.
|
480
|
+
# @return [Boolean]
|
481
|
+
def is_unkdic?
|
482
|
+
self.type == UNK_DIC
|
483
|
+
end
|
399
484
|
end
|
400
485
|
|
401
486
|
# <tt>MeCabNode</tt> is a wrapper for the structure holding
|
@@ -430,61 +515,57 @@ module Natto
|
|
430
515
|
# - :cost
|
431
516
|
#
|
432
517
|
# <h2>Usage</h2>
|
433
|
-
# An instance of <tt>MeCabNode</tt> is yielded to
|
434
|
-
# used with <tt>MeCab#parse</tt
|
435
|
-
#
|
436
|
-
# node attributes may be accessed.
|
518
|
+
# An instance of <tt>MeCabNode</tt> is yielded to the block
|
519
|
+
# used with <tt>MeCab#parse</tt>, where the above-mentioned
|
520
|
+
# node attributes may be accessed by name.
|
437
521
|
#
|
438
522
|
# nm = Natto::MeCab.new
|
439
523
|
#
|
440
|
-
# nm.parse('
|
441
|
-
# puts "#{n.surface}
|
524
|
+
# nm.parse('卓球なんて死ぬまでの暇つぶしだよ。') do |n|
|
525
|
+
# puts "#{n.surface}\t#{n.cost}" if n.is_nor?
|
442
526
|
# end
|
443
|
-
#
|
444
|
-
#
|
445
|
-
#
|
446
|
-
#
|
447
|
-
#
|
448
|
-
#
|
449
|
-
#
|
450
|
-
#
|
451
|
-
#
|
452
|
-
# 困っ 43797
|
453
|
-
# て 42178
|
454
|
-
# まし 46708
|
455
|
-
# た 46111
|
456
|
-
# 。 42677
|
457
|
-
# 41141
|
458
|
-
# => nil
|
527
|
+
# 卓球 2874
|
528
|
+
# な 4398
|
529
|
+
# 死ぬ 9261
|
530
|
+
# まで 9386
|
531
|
+
# の 10007
|
532
|
+
# 暇つぶし 13324
|
533
|
+
# だ 15346
|
534
|
+
# よ 14396
|
535
|
+
# 。 10194
|
459
536
|
#
|
460
537
|
# It is also possible to use the <tt>Symbol</tt> for the
|
461
538
|
# <tt>mecab</tt> node member to index into the
|
462
539
|
# <tt>FFI::Struct</tt> layout associative array like so:
|
463
540
|
#
|
464
|
-
# nm.parse('
|
465
|
-
#
|
466
|
-
#
|
467
|
-
#
|
468
|
-
#
|
469
|
-
#
|
470
|
-
#
|
471
|
-
#
|
472
|
-
#
|
541
|
+
# nm.parse('あいつ笑うと結構可愛い顔してんよ。') {|n| puts n[:feature] }
|
542
|
+
# 名詞,代名詞,一般,*,*,*,あいつ,アイツ,アイツ
|
543
|
+
# 動詞,自立,*,*,五段・ワ行促音便,基本形,笑う,ワラウ,ワラウ
|
544
|
+
# 助詞,接続助詞,*,*,*,*,と,ト,ト
|
545
|
+
# 副詞,一般,*,*,*,*,結構,ケッコウ,ケッコー
|
546
|
+
# 形容詞,自立,*,*,形容詞・イ段,基本形,可愛い,カワイイ,カワイイ
|
547
|
+
# 名詞,一般,*,*,*,*,顔,カオ,カオ
|
548
|
+
# 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
|
549
|
+
# 動詞,非自立,*,*,一段,体言接続特殊,てる,テン,テン
|
550
|
+
# 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
|
551
|
+
# 記号,句点,*,*,*,*,。,。,。
|
473
552
|
# BOS/EOS,*,*,*,*,*,*,*,*
|
474
|
-
# => nil
|
475
553
|
#
|
476
554
|
class MeCabNode < MeCabStruct
|
555
|
+
include Natto::Utils
|
556
|
+
|
477
557
|
attr_accessor :surface, :feature
|
558
|
+
attr_reader :pointer
|
478
559
|
|
479
|
-
# Normal <tt>mecab</tt> node.
|
560
|
+
# Normal <tt>mecab</tt> node defined in the dictionary.
|
480
561
|
NOR_NODE = 0
|
481
|
-
# Unknown <tt>mecab</tt> node.
|
562
|
+
# Unknown <tt>mecab</tt> node not defined in the dictionary.
|
482
563
|
UNK_NODE = 1
|
483
|
-
#
|
564
|
+
# Virtual node representing the beginning of the sentence.
|
484
565
|
BOS_NODE = 2
|
485
|
-
#
|
566
|
+
# Virutual node representing the end of the sentence.
|
486
567
|
EOS_NODE = 3
|
487
|
-
#
|
568
|
+
# Virtual node representing the end of an N-Best <tt>mecab</tt> node list.
|
488
569
|
EON_NODE = 4
|
489
570
|
|
490
571
|
layout :prev, :pointer,
|
@@ -525,39 +606,30 @@ module Natto
|
|
525
606
|
end
|
526
607
|
|
527
608
|
# Initializes this node instance.
|
528
|
-
# Sets the <
|
609
|
+
# Sets the <tt>MeCab</tt> feature value for this node.
|
529
610
|
#
|
530
611
|
# @param [FFI::Pointer]
|
531
612
|
def initialize(ptr)
|
532
613
|
super(ptr)
|
614
|
+
@pointer = ptr
|
533
615
|
|
534
616
|
if self[:feature]
|
535
|
-
@feature = self[:feature]
|
536
|
-
@feature.force_encoding(Encoding.default_external) if @feature.respond_to?(:encoding) && @feature.encoding!=Encoding.default_external
|
617
|
+
@feature = self.class.force_enc(self[:feature])
|
537
618
|
end
|
538
619
|
end
|
539
620
|
|
540
|
-
# Sets the morpheme surface value for this node.
|
541
|
-
#
|
542
|
-
# @param [String]
|
543
|
-
def surface=(str)
|
544
|
-
if str && self[:length] > 0
|
545
|
-
@surface = str
|
546
|
-
@surface.force_encoding(Encoding.default_external) if @surface.respond_to?(:encoding) && @surface.encoding!=Encoding.default_external
|
547
|
-
end
|
548
|
-
end
|
549
|
-
|
550
621
|
# Returns human-readable details for the <tt>mecab</tt> node.
|
551
622
|
# Overrides <tt>Object#to_s</tt>.
|
552
623
|
#
|
553
624
|
# - encoded object id
|
554
|
-
# -
|
625
|
+
# - underlying FFI pointer to MeCab Node
|
626
|
+
# - stat (node type: NOR, UNK, BOS/EOS, EON)
|
555
627
|
# - surface
|
556
628
|
# - feature
|
557
629
|
#
|
558
|
-
# @return [String] encoded object id, stat, surface, and feature
|
630
|
+
# @return [String] encoded object id, underlying FFI pointer, stat, surface, and feature
|
559
631
|
def to_s
|
560
|
-
%(#{super.chop} stat=#{self[:stat]}, surface="#{self.surface}", feature="#{self.feature}">)
|
632
|
+
%(#{super.chop} @pointer=#{@pointer}, stat=#{self[:stat]}, @surface="#{self.surface}", @feature="#{self.feature}">)
|
561
633
|
end
|
562
634
|
|
563
635
|
# Overrides <tt>Object#inspect</tt>.
|
@@ -567,5 +639,35 @@ module Natto
|
|
567
639
|
def inspect
|
568
640
|
self.to_s
|
569
641
|
end
|
642
|
+
|
643
|
+
# Returns <tt>true</tt> if this is a normal <tt>mecab</tt> node found in the dictionary.
|
644
|
+
# @return [Boolean]
|
645
|
+
def is_nor?
|
646
|
+
self.stat == NOR_NODE
|
647
|
+
end
|
648
|
+
|
649
|
+
# Returns <tt>true</tt> if this is an unknown <tt>mecab</tt> node not found in the dictionary.
|
650
|
+
# @return [Boolean]
|
651
|
+
def is_unk?
|
652
|
+
self.stat == UNK_NODE
|
653
|
+
end
|
654
|
+
|
655
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the beginning of the sentence.
|
656
|
+
# @return [Boolean]
|
657
|
+
def is_bos?
|
658
|
+
self.stat == BOS_NODE
|
659
|
+
end
|
660
|
+
|
661
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the sentence.
|
662
|
+
# @return [Boolean]
|
663
|
+
def is_eos?
|
664
|
+
self.stat == EOS_NODE
|
665
|
+
end
|
666
|
+
|
667
|
+
# Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the node list.
|
668
|
+
# @return [Boolean]
|
669
|
+
def is_eon?
|
670
|
+
self.stat == EON_NODE
|
671
|
+
end
|
570
672
|
end
|
571
673
|
end
|