natto 0.9.3 → 0.9.4

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,7 +1,18 @@
1
1
  ## CHANGELOG
2
2
 
3
- - __2012/02/xx__: 0.9.3 release.
4
- - Updating structures, tests following the latest release of MeCab 0.993a
3
+ - __2012/02/26__: 0.9.4 release.
4
+ - Exposing the underlying FFI pointer as @tagger in Natto::MeCab
5
+ - Exposing the underlying FFI pointer as @pointer in Natto::MeCabNode
6
+ - Fixed node formatting behavior for both regular and N-Best node parsing
7
+ - Adding is_nor?, is_unk?, is_bos?, is_eos?, is_eon? to Natto::MeCabNode
8
+ - Adding is_sysdic?, is_usrdic?, is_unkdic? to Natto::DictionaryInfo
9
+ - Adding readnodes to allow for method-chaining on list of parsed nodes
10
+ - Adding readlines to allow for method-chaining on list of string output
11
+ - Updating README to include description of DevKit install for RubyInstaller
12
+ - Refactored the codebase, introducing more tests
13
+
14
+ - __2012/02/14__: 0.9.3 release.
15
+ - Updating structures, tests following the latest release of MeCab 0.993
5
16
  - Ending support for 0.98 MeCab.
6
17
  - Support for MeCab 0.993 in Cygwin is still not confirmed.
7
18
 
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
  A Tasty Ruby Binding with MeCab
3
3
 
4
4
  ## What is natto?
5
- natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.sourceforge.net/), the part-of-speech and morphological analyzer for the Japanese language.
5
+ natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech and morphological analyzer for the Japanese language.
6
6
 
7
7
  natto is a gem bridging Ruby and MeCab using FFI (foreign function interface). No compilation is necessary, as natto is _not_ a C extension. natto will run on CRuby (mri/yarv) and JRuby (jvm) equally well. natto will also run on Windows, Unix/Linux, and Mac.
8
8
 
@@ -15,8 +15,22 @@ natto requires the following:
15
15
  - [ffi _0.6.3 or greater_](http://rubygems.org/gems/ffi)
16
16
  - Ruby _1.8.7 or greater_
17
17
 
18
- ## Installation
18
+ ## Installation on *NIX/Mac/Cygwin
19
19
  Install natto with the following gem command:
20
+
21
+ gem install natto
22
+
23
+ This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which is what natto uses to bind to the <tt>mecab</tt> library.
24
+
25
+ ## Installation on Windows
26
+ However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), which is a MSYS/MinGW based toolkit than enables your Windows Ruby installation to build many of the native C/C++ extensions available, including <tt>ffi</tt>.
27
+
28
+ 1. Download the latest release for RubyInstaller for Windows platforms and the corresponding DevKit from the [RubyInstaller for Windows downloads page](http://rubyinstaller.org/downloads/).
29
+ 2. After installing RubyInstaller for Windows, double-click on the DevKit-tdm installer <tt>.exe</tt>, and expand the contents to an appropriate location, for example <tt>C:\devkit</tt>.
30
+ 3. Open a command window under <tt>C:\devkit</tt>, and execute: <tt>ruby dk.rb init</tt>. This will locate all known ruby installations, and add them to <tt>C:\devkit\config.yml</tt>.
31
+ 4. Next, execute: <tt>ruby dk.rb install</tt>, which will add the DevKit to all of the installed rubies listed in your <tt>C:\devkit\config.yml</tt>.
32
+ 5. Now you should be able to install and build the <tt>ffi</tt> rubygem correctly on your Windows-installed ruby, so you can install <tt>natto with</tt>:
33
+
20
34
  gem install natto
21
35
 
22
36
  ## Configuration
@@ -24,12 +38,19 @@ Install natto with the following gem command:
24
38
  - In case of <tt>LoadError</tt>, please set the <tt>MECAB_PATH</tt> environment variable to the exact name/path to your <tt>mecab</tt> library.
25
39
 
26
40
  e.g., for bash on UNIX/Linux
41
+
27
42
  export MECAB_PATH=/usr/local/lib/libmecab.so
43
+
28
44
  e.g., on Windows
45
+
29
46
  set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
47
+
30
48
  e.g., for Cygwin
49
+
31
50
  export MECAB_PATH=cygmecab-1
51
+
32
52
  e.g., from within a Ruby program
53
+
33
54
  ENV['MECAB_PATH']=/usr/local/lib/libmecab.so
34
55
 
35
56
  ## Usage
@@ -38,15 +59,16 @@ e.g., from within a Ruby program
38
59
 
39
60
  nm = Natto::MeCab.new
40
61
  => #<Natto::MeCab:0x28d30748
41
- @ptr=#<FFI::Pointer address=0x28a97d50>, \
62
+ @tagger=#<FFI::Pointer address=0x28a97d50>, \
42
63
  @options={}, \
43
- @dicts=[#<Natto::DictionaryInfo:0x28d3061c
44
- filename="/usr/local/lib/mecab/dic/ipadic/sys.dic",
45
- charset="utf8">],
46
- @version="0.992">
64
+ @dicts=[#<Natto::DictionaryInfo:0x28d3061c \
65
+ type="0", \
66
+ filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
67
+ charset="utf8">], \
68
+ @version="0.993">
47
69
 
48
70
  puts nm.version
49
- => "0.992"
71
+ => "0.993"
50
72
 
51
73
  sysdic = nm.dicts.first
52
74
 
@@ -55,22 +77,25 @@ e.g., from within a Ruby program
55
77
 
56
78
  puts sysdic.charset
57
79
  => "utf8"
58
-
59
- nm.parse('暑い日にはもってこいの一品ですね。') do |n|
80
+
81
+ nm.parse('ピンチの時には必ずヒーローが現れる。') do |n|
60
82
  puts "#{n.surface}\t#{n.feature}"
61
83
  end
62
- 暑い 形容詞,自立,*,*,形容詞・アウオ段,基本形,暑い,アツイ,アツイ
63
- 日 名詞,非自立,副詞可能,*,*,*,日,ヒ,ヒ
64
- に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
65
- は 助詞,係助詞,*,*,*,*,は,ハ,ワ
66
- もってこい 名詞,一般,*,*,*,*,もってこい,モッテコイ,モッテコイ
67
- の 助詞,連体化,*,*,*,*,の,ノ,ノ
68
- 一品 名詞,一般,*,*,*,*,一品,イッピン,イッピン
69
- です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
70
- 助詞,終助詞,*,*,*,*,ね,ネ,ネ
71
- 終助詞記号,句点,*,*,*,*,。,。,。
72
- BOS/EOS,*,*,*,*,*,*,*,*
73
- => nil
84
+ ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
85
+ の 助詞,連体化,*,*,*,*,の,ノ,ノ
86
+ 時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
87
+ に 助詞,格助詞,一般,*,*,*,に,一般ニ,ニ
88
+ は 助詞,係助詞,*,*,*,*,は,ハ,ワ
89
+ 必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
90
+ ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーローー,ヒーロー
91
+ が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
92
+ 現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
93
+ 記号,句点,*,*,*,*,。,。,。句点
94
+ BOS/EOS,*,*,*,*,*,*,*,*
95
+
96
+
97
+ ## Learn more
98
+ - You can read more about natto on the [project Wiki](https://bitbucket.org/buruzaemon/natto/wiki/Home).
74
99
 
75
100
  ## Contributing to natto
76
101
  - Use [mercurial](http://mercurial.selenic.com/) and [check out the latest code at bitbucket](https://bitbucket.org/buruzaemon/natto/src/) to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
data/lib/natto.rb CHANGED
@@ -1,13 +1,14 @@
1
1
  # coding: utf-8
2
2
  require 'rubygems' if RUBY_VERSION.to_f < 1.9
3
3
  require 'natto/binding'
4
+ require 'natto/utils'
4
5
 
5
6
  module Natto
6
7
  require 'ffi'
7
8
  require 'optparse'
8
9
 
9
- # <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> parser.
10
- # Options to the <tt>mecab</tt> parser are passed in as a string
10
+ # <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> tagger.
11
+ # Options to the <tt>mecab</tt> tagger are passed in as a string
11
12
  # (MeCab command-line style) or as a Ruby-style hash at
12
13
  # initialization.
13
14
  #
@@ -16,32 +17,42 @@ module Natto
16
17
  # require 'rubygems' if RUBY_VERSION.to_f < 1.9
17
18
  # require 'natto'
18
19
  #
19
- # nm = Natto::MeCab.new('-O chasen2')
20
+ # nm = Natto::MeCab.new('-Ochasen')
20
21
  # => #<Natto::MeCab:0x28d3bdc8 \
21
- # @ptr=#<FFI::Pointer address=0x28afb980>, \
22
- # @options={:output_format_type=>"chasen2"}, \
23
- # @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
22
+ # @tagger=#<FFI::Pointer address=0x28afb980>, \
23
+ # @options={:output_format_type=>"chasen"}, \
24
+ # @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
25
+ # type="0", \
24
26
  # filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
25
27
  # charset="utf8">], \
26
28
  # @version="0.993">
27
29
  #
28
- # nm.parse('ネバネバの組み合わせ美味しいです。') do |n|
30
+ # nm.parse('凡人にしか見えねえ風景ってのがあるんだよ。') do |n|
29
31
  # puts "#{n.surface}\t#{n.feature}"
30
- # end
31
- #
32
- # ネバネバ 名詞,サ変接続,*,*,*,*,ネバネバ,ネバネバ,ネバネバ
33
- # の 助詞,連体化,*,*,*,*,の,ノ,ノ
34
- # 組み合わせ 名詞,一般,*,*,*,*,組み合わせ,クミアワセ,クミアワセ
35
- # 美味しい 形容詞,自立,*,*,形容詞・イ段,基本形,美味しい,オイシイ,オイシイ
36
- # です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
37
- # 。 記号,句点,*,*,*,*,。,。,。
32
+ # end
33
+ # 凡人 名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
34
+ # に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
35
+ # しか 助詞,係助詞,*,*,*,*,しか,シカ,シカ
36
+ # 見え 動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
37
+ # ねえ 助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
38
+ # 風景 名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
39
+ # って 助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
40
+ # の 名詞,非自立,一般,*,*,*,の,ノ,ノ
41
+ # が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
42
+ # ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
43
+ # ん 名詞,非自立,一般,*,*,*,ん,ン,ン
44
+ # だ 助動詞,*,*,*一般,特殊・ダ,基本形,だ,ダ,ダ
45
+ # よ 助詞,終助詞,*,*,*,*,よ,ã¨,ヨ
46
+ # 。 記号,句点,*,*,*,*,。,。,。
47
+ # BOS/EOS,*,*,*,*,*,*,*,*BOS
38
48
  #
39
49
  class MeCab
40
50
  include Natto::Binding
51
+ include Natto::Utils
41
52
 
42
- attr_reader :options, :dicts, :version
53
+ attr_reader :tagger, :options, :dicts, :version
43
54
 
44
- # Mapping of mecab short-style configuration options to the <tt>mecab</tt> parser.
55
+ # Mapping of mecab short-style configuration options to the <tt>mecab</tt> tagger.
45
56
  # See the <tt>mecab</tt> help for more details.
46
57
  SUPPORTED_OPTS = { '-r' => :rcfile,
47
58
  '-d' => :dicdir,
@@ -69,7 +80,7 @@ module Natto
69
80
  # - :rcfile -- resource file
70
81
  # - :dicdir -- system dicdir
71
82
  # - :userdic -- user dictionary
72
- # - :lattice_level -- lattice information level (integer, default 0)
83
+ # - :lattice_level -- lattice information level (DEPRECATED)
73
84
  # - :output_format_type -- output format type (wakati, chasen, yomi, etc.)
74
85
  # - :all_morphs -- output all morphs (default false)
75
86
  # - :nbest -- output N best results (integer, default 1), requires lattice level >= 1
@@ -90,24 +101,30 @@ module Natto
90
101
  # e.g.<br/>
91
102
  #
92
103
  # nm = Natto::MeCab.new(:node_format=>'%m¥t%f[7]¥n')
93
- # => #<Natto::MeCab:0x28d2ae10 @ptr=#<FFI::Pointer address=0x28a97980>, \
94
- # @options={:node_format=>"%m¥t%f[7]¥n"}, \
95
- # @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
104
+ # => #<Natto::MeCab:0x28d2ae10
105
+ # @tagger=#<FFI::Pointer address=0x28a97980>, \
106
+ # @options={:node_format=>"%m¥t%f[7]¥n"}, \
107
+ # @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
108
+ # type="0", \
96
109
  # filename="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
97
110
  # charset="utf8">], \
98
111
  # @version="0.993">
99
- #
100
- # puts nm.parse('簡単で美味しくて良いですよね。')
101
- # 簡単 カンタン
102
- # で デ
103
- # 美味しくて オイシクテ
104
- # 良い ヨイ
105
- # です デス
106
- # よ ヨ
107
- # ね ネ
108
- #
112
+ #
113
+ # puts nm.parse('才能とは求める人間に与えられるものではない。')
114
+ # 才能 サイノウ
115
+ # と ト
116
+ # は ハ
117
+ # 求 モトメル
118
+ # 人間 ニンゲン
119
+ # に ニ
120
+ # 与え アタエ
121
+ # られる ラレル
122
+ # もの モノ
123
+ # で デ
124
+ # は ハ
125
+ # ない ナイ
126
+ # 。 。
109
127
  # EOS
110
- # => nil
111
128
  #
112
129
  # @param [Hash or String]
113
130
  # @raise [MeCabError] if <tt>mecab</tt> cannot be initialized with the given <tt>options</tt>
@@ -118,48 +135,87 @@ module Natto
118
135
  @dicts = []
119
136
 
120
137
  opt_str = self.class.build_options_str(@options)
121
- @ptr = self.mecab_new2(opt_str)
122
- raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @ptr.address == 0x0
138
+ @tagger = self.mecab_new2(opt_str)
139
+ raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @tagger.address == 0x0
123
140
 
124
- # set mecab parsing options
125
- self.mecab_set_theta(@ptr, @options[:theta]) if @options[:theta]
126
- self.mecab_set_lattice_level(@ptr, @options[:lattice_level]) if @options[:lattice_level]
127
- self.mecab_set_all_morphs(@ptr, 1) if @options[:all_morphs]
141
+ self.mecab_set_theta(@tagger, @options[:theta]) if @options[:theta]
142
+ self.mecab_set_lattice_level(@tagger, @options[:lattice_level]) if @options[:lattice_level]
143
+ self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
128
144
 
129
145
  # Set mecab parsing implementations for N-best and regular parsing,
130
146
  # for both parsing as string and yielding a node object
131
147
  # N-Best parsing implementations
132
148
  if @options[:nbest] && @options[:nbest] > 1
133
149
  # nbest parsing require lattice level >= 1
134
- self.mecab_set_lattice_level(@ptr, (@options[:lattice_level] || 1))
135
- @parse_tostr = lambda { |str|
136
- self.mecab_nbest_init(@ptr, str)
137
- return self.mecab_nbest_sparse_tostr(@ptr, @options[:nbest], str) ||
138
- raise(MeCabError.new(self.mecab_strerror(@ptr)))
139
- }
140
- @parse_tonode = lambda { |str|
141
- self.mecab_nbest_init(@ptr, str)
142
- return self.mecab_nbest_next_tonode(@ptr)
143
- }
150
+ self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
151
+ @parse_tostr = lambda do |str|
152
+ return self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], str) ||
153
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
154
+ end
155
+ @parse_tonodes = lambda do |str|
156
+ nodes = []
157
+ if @options[:nbest] && @options[:nbest] > 1
158
+ self.mecab_nbest_init(@tagger, str)
159
+ n = self.mecab_nbest_next_tonode(@tagger)
160
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
161
+ nlen = @options[:nbest]
162
+ nlen.times do
163
+ s = str.bytes.to_a
164
+ while n && n.address != 0x0
165
+ mn = Natto::MeCabNode.new(n)
166
+ if mn.is_nor?
167
+ slen, sarr = mn.length, []
168
+ slen.times { sarr << s.shift }
169
+ surf = sarr.pack('C*')
170
+ mn.surface = self.class.force_enc(surf)
171
+ if @options[:output_format_type] || @options[:node_format]
172
+ mn.feature = self.class.force_enc(self.mecab_format_node(@tagger, n))
173
+ end
174
+ end
175
+ nodes << mn if !mn.is_bos?
176
+ n = mn.next
177
+ end
178
+ n = self.mecab_nbest_next_tonode(@tagger)
179
+ end
180
+ end
181
+ return nodes
182
+ end
144
183
  else
145
184
  # default parsing implementations
146
- @parse_tostr = lambda { |str|
147
- return self.mecab_sparse_tostr(@ptr, str) || raise(MeCabError.new(self.mecab_strerror(@ptr)))
148
- }
149
- @parse_tonode = lambda { |str| return self.mecab_sparse_tonode(@ptr, str) }
185
+ @parse_tostr = lambda do |str|
186
+ return self.mecab_sparse_tostr(@tagger, str) ||
187
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
188
+ end
189
+ @parse_tonodes = lambda do |str|
190
+ nodes = []
191
+ n = self.mecab_sparse_tonode(@tagger, str)
192
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
193
+ mn = Natto::MeCabNode.new(n)
194
+ n = mn.next if mn.next.address!=0x0 && mn.is_bos?
195
+ s = str.bytes.to_a
196
+ while n && n.address!=0x0
197
+ mn = Natto::MeCabNode.new(n)
198
+ if mn.is_nor?
199
+ slen, sarr = mn.length, []
200
+ slen.times { sarr << s.shift }
201
+ surf = sarr.pack('C*')
202
+ mn.surface = self.class.force_enc(surf)
203
+ end
204
+ nodes << mn
205
+ n = mn.next
206
+ end
207
+ return nodes
208
+ end
150
209
  end
151
210
 
152
- # set ref to dictionaries
153
- @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@ptr))
211
+ @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
154
212
  while @dicts.last.next.address != 0x0
155
213
  @dicts << Natto::DictionaryInfo.new(@dicts.last.next)
156
214
  end
157
215
 
158
- # set ref to mecab version string
159
216
  @version = self.mecab_version
160
217
 
161
- # set Proc for freeing mecab pointer
162
- ObjectSpace.define_finalizer(self, self.class.create_free_proc(@ptr))
218
+ ObjectSpace.define_finalizer(self, self.class.create_free_proc(@tagger))
163
219
  end
164
220
 
165
221
  # Parses the given string <tt>str</tt>. If a block is passed to this method,
@@ -167,47 +223,48 @@ module Natto
167
223
  #
168
224
  # @param [String] str
169
225
  # @return parsing result from <tt>mecab</tt>
170
- # @raise [MeCabError] if the <tt>mecab</tt> parser cannot parse the given string <tt>str</tt>
226
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
171
227
  # @see MeCabNode
172
228
  def parse(str)
173
229
  if block_given?
174
- m_node_ptr = @parse_tonode.call(str)
175
- head = Natto::MeCabNode.new(m_node_ptr)
176
- if head && head[:next].address != 0x0
177
- node = Natto::MeCabNode.new(head[:next])
178
- i = 0
179
- while node.nil? == false
180
- if node.length > 0
181
- node.surface = str.bytes.to_a()[i, node.length].pack('C*')
182
- end
183
- yield node
184
- if node[:next].address != 0x0
185
- i += node.length
186
- node = Natto::MeCabNode.new(node[:next])
187
- else
188
- break
189
- end
190
- end
191
- end
230
+ nodes = @parse_tonodes.call(str)
231
+ nodes.each {|n| yield n }
192
232
  else
193
- result = @parse_tostr.call(str)
194
- result.force_encoding(Encoding.default_external) if result.respond_to?(:encoding) && result.encoding!=Encoding.default_external
195
- result
233
+ self.class.force_enc(@parse_tostr.call(str))
196
234
  end
197
235
  end
198
236
 
199
- # Returns human-readable details for the wrapped <tt>mecab</tt> parser.
237
+ # Parses the given string <tt>str</tt>, and returns
238
+ # a list of <tt>mecab</tt> nodes.
239
+ # @param [String] str
240
+ # @return [Array] of parsed <tt>mecab</tt> nodes.
241
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
242
+ # @see MeCabNode
243
+ def readnodes(str)
244
+ @parse_tonodes.call(str)
245
+ end
246
+
247
+ # Parses the given string <tt>str</tt>, and returns
248
+ # a list of <tt>mecab</tt> result strings.
249
+ # @param [String] str
250
+ # @return [Array] of parsed <tt>mecab</tt> result strings.
251
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
252
+ def readlines(str)
253
+ self.class.force_enc(@parse_tostr.call(str)).lines.to_a
254
+ end
255
+
256
+ # Returns human-readable details for the wrapped <tt>mecab</tt> tagger.
200
257
  # Overrides <tt>Object#to_s</tt>.
201
258
  #
202
259
  # - encoded object id
203
- # - FFI pointer to <tt>mecab</tt> object
260
+ # - underlying FFI pointer to the <tt>mecab</tt> tagger
204
261
  # - options hash
205
262
  # - list of dictionaries
206
263
  # - MeCab version
207
264
  #
208
- # @return [String] encoded object id, FFI pointer, options hash, list of dictionaries, and MeCab version
265
+ # @return [String] encoded object id, underlying FFI pointer, options hash, list of dictionaries, and MeCab version
209
266
  def to_s
210
- %(#{super.chop} @ptr=#{@ptr.to_s}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
267
+ %(#{super.chop} @tagger=#{@tagger}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
211
268
  end
212
269
 
213
270
  # Overrides <tt>Object#inspect</tt>.
@@ -284,10 +341,10 @@ module Natto
284
341
  end
285
342
 
286
343
  # Returns a string-representation of the options to
287
- # be passed in the construction of <tt>mecab</tt>.
344
+ # be passed in the construction of the <tt>mecab</tt> tagger.
288
345
  #
289
346
  # @param [Hash] options
290
- # @return [String] representation of the options to the <tt>mecab</tt> parser
347
+ # @return [String] representation of the options to the <tt>mecab</tt> tagger
291
348
  def self.build_options_str(options={})
292
349
  opt = []
293
350
  SUPPORTED_OPTS.values.each do |k|
@@ -352,7 +409,16 @@ module Natto
352
409
  #
353
410
  # puts sysdic.charset
354
411
  # => "utf8"
412
+ #
413
+ # puts sysdic.is_sysdic?
414
+ # => true
355
415
  class DictionaryInfo < MeCabStruct
416
+ # System dictionary.
417
+ SYS_DIC = 0
418
+ # User dictionary.
419
+ USR_DIC = 1
420
+ # Unknown dictionary.
421
+ UNK_DIC = 2
356
422
 
357
423
  layout :filename, :string,
358
424
  :charset, :string,
@@ -381,12 +447,13 @@ module Natto
381
447
  # Overrides <tt>Object#to_s</tt>.
382
448
  #
383
449
  # - encoded object id
450
+ # - dictionary type
384
451
  # - full-path dictionary filename
385
452
  # - dictionary charset
386
453
  #
387
- # @return [String] encoded object id, dictionary filename, and charset
454
+ # @return [String] encoded object id, type, dictionary filename, and charset
388
455
  def to_s
389
- %(#{super.chop} filename="#{self.filename}", charset="#{self.charset}">)
456
+ %(#{super.chop} type="#{self.type}", filename="#{self.filename}", charset="#{self.charset}">)
390
457
  end
391
458
 
392
459
  # Overrides <tt>Object#inspect</tt>.
@@ -396,6 +463,24 @@ module Natto
396
463
  def inspect
397
464
  self.to_s
398
465
  end
466
+
467
+ # Returns <tt>true</tt> if this is a system dictionary.
468
+ # @return [Boolean]
469
+ def is_sysdic?
470
+ self.type == SYS_DIC
471
+ end
472
+
473
+ # Returns <tt>true</tt> if this is a user dictionary.
474
+ # @return [Boolean]
475
+ def is_usrdic?
476
+ self.type == USR_DIC
477
+ end
478
+
479
+ # Returns <tt>true</tt> if this is a unknown dictionary type.
480
+ # @return [Boolean]
481
+ def is_unkdic?
482
+ self.type == UNK_DIC
483
+ end
399
484
  end
400
485
 
401
486
  # <tt>MeCabNode</tt> is a wrapper for the structure holding
@@ -430,61 +515,57 @@ module Natto
430
515
  # - :cost
431
516
  #
432
517
  # <h2>Usage</h2>
433
- # An instance of <tt>MeCabNode</tt> is yielded to a block
434
- # used with <tt>MeCab#parse</tt>. Each resulting node is
435
- # yielded to the block passed in, where the above-mentioned
436
- # node attributes may be accessed.
518
+ # An instance of <tt>MeCabNode</tt> is yielded to the block
519
+ # used with <tt>MeCab#parse</tt>, where the above-mentioned
520
+ # node attributes may be accessed by name.
437
521
  #
438
522
  # nm = Natto::MeCab.new
439
523
  #
440
- # nm.parse('めかぶの使い方がわからなくて困ってました。') do |n|
441
- # puts "#{n.surface}¥t#{n.cost}"
524
+ # nm.parse('卓球なんて死ぬまでの暇つぶしだよ。') do |n|
525
+ # puts "#{n.surface}\t#{n.cost}" if n.is_nor?
442
526
  # end
443
- #
444
- # め 7961
445
- # かぶ 19303
446
- # の 25995
447
- # 使い方 29182
448
- # が 28327
449
- # わから 33625
450
- # なく 34256
451
- # て 36454
452
- # 困っ 43797
453
- # て 42178
454
- # まし 46708
455
- # た 46111
456
- # 。 42677
457
- # 41141
458
- # => nil
527
+ # 卓球 2874
528
+ # な 4398
529
+ # 死ぬ 9261
530
+ # まで 9386
531
+ # の 10007
532
+ # 暇つぶし 13324
533
+ # だ 15346
534
+ # よ 14396
535
+ # 。 10194
459
536
  #
460
537
  # It is also possible to use the <tt>Symbol</tt> for the
461
538
  # <tt>mecab</tt> node member to index into the
462
539
  # <tt>FFI::Struct</tt> layout associative array like so:
463
540
  #
464
- # nm.parse('納豆に乗っけて頂きます!') {|n| puts n[:feature] }
465
- #
466
- # 名詞,一般,*,*,*,*,納豆,ナットウ,ナットー
467
- # 助詞,格助詞,一般,*,*,*,に,ニ,ニ
468
- # 動詞,自立,*,*,一段,連用形,乗っける,ノッケ,ノッケ
469
- # 助詞,接続助詞,*,*,*,*,て,テ,テ
470
- # 動詞,非自立,*,*,五段・カ行イ音便,連用形,頂く,イタダキ,イタダキ
471
- # 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
472
- # 記号,一般,*,*,*,*,!,!,!
541
+ # nm.parse('あいつ笑うと結構可愛い顔してんよ。') {|n| puts n[:feature] }
542
+ # 名詞,代名詞,一般,*,*,*,あいつ,アイツ,アイツ
543
+ # 動詞,自立,*,*,五段・ワ行促音便,基本形,笑う,ワラウ,ワラウ
544
+ # 助詞,接続助詞,*,*,*,*,と,ト,ト
545
+ # 副詞,一般,*,*,*,*,結構,ケッコウ,ケッコー
546
+ # 形容詞,自立,*,*,形容詞・イ段,基本形,可愛い,カワイイ,カワイイ
547
+ # 名詞,一般,*,*,*,*,顔,カオ,カオ
548
+ # 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
549
+ # 動詞,非自立,*,*,一段,体言接続特殊,てる,テン,テン
550
+ # 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
551
+ # 記号,句点,*,*,*,*,。,。,。
473
552
  # BOS/EOS,*,*,*,*,*,*,*,*
474
- # => nil
475
553
  #
476
554
  class MeCabNode < MeCabStruct
555
+ include Natto::Utils
556
+
477
557
  attr_accessor :surface, :feature
558
+ attr_reader :pointer
478
559
 
479
- # Normal <tt>mecab</tt> node.
560
+ # Normal <tt>mecab</tt> node defined in the dictionary.
480
561
  NOR_NODE = 0
481
- # Unknown <tt>mecab</tt> node.
562
+ # Unknown <tt>mecab</tt> node not defined in the dictionary.
482
563
  UNK_NODE = 1
483
- # Beginning-of-string <tt>mecab</tt> node.
564
+ # Virtual node representing the beginning of the sentence.
484
565
  BOS_NODE = 2
485
- # End-of-string <tt>mecab</tt> node.
566
+ # Virutual node representing the end of the sentence.
486
567
  EOS_NODE = 3
487
- # End-of-NBest <tt>mecab</tt> node list.
568
+ # Virtual node representing the end of an N-Best <tt>mecab</tt> node list.
488
569
  EON_NODE = 4
489
570
 
490
571
  layout :prev, :pointer,
@@ -525,39 +606,30 @@ module Natto
525
606
  end
526
607
 
527
608
  # Initializes this node instance.
528
- # Sets the <ttMeCab</tt> feature value for this node.
609
+ # Sets the <tt>MeCab</tt> feature value for this node.
529
610
  #
530
611
  # @param [FFI::Pointer]
531
612
  def initialize(ptr)
532
613
  super(ptr)
614
+ @pointer = ptr
533
615
 
534
616
  if self[:feature]
535
- @feature = self[:feature]
536
- @feature.force_encoding(Encoding.default_external) if @feature.respond_to?(:encoding) && @feature.encoding!=Encoding.default_external
617
+ @feature = self.class.force_enc(self[:feature])
537
618
  end
538
619
  end
539
620
 
540
- # Sets the morpheme surface value for this node.
541
- #
542
- # @param [String]
543
- def surface=(str)
544
- if str && self[:length] > 0
545
- @surface = str
546
- @surface.force_encoding(Encoding.default_external) if @surface.respond_to?(:encoding) && @surface.encoding!=Encoding.default_external
547
- end
548
- end
549
-
550
621
  # Returns human-readable details for the <tt>mecab</tt> node.
551
622
  # Overrides <tt>Object#to_s</tt>.
552
623
  #
553
624
  # - encoded object id
554
- # - stat
625
+ # - underlying FFI pointer to MeCab Node
626
+ # - stat (node type: NOR, UNK, BOS/EOS, EON)
555
627
  # - surface
556
628
  # - feature
557
629
  #
558
- # @return [String] encoded object id, stat, surface, and feature
630
+ # @return [String] encoded object id, underlying FFI pointer, stat, surface, and feature
559
631
  def to_s
560
- %(#{super.chop} stat=#{self[:stat]}, surface="#{self.surface}", feature="#{self.feature}">)
632
+ %(#{super.chop} @pointer=#{@pointer}, stat=#{self[:stat]}, @surface="#{self.surface}", @feature="#{self.feature}">)
561
633
  end
562
634
 
563
635
  # Overrides <tt>Object#inspect</tt>.
@@ -567,5 +639,35 @@ module Natto
567
639
  def inspect
568
640
  self.to_s
569
641
  end
642
+
643
+ # Returns <tt>true</tt> if this is a normal <tt>mecab</tt> node found in the dictionary.
644
+ # @return [Boolean]
645
+ def is_nor?
646
+ self.stat == NOR_NODE
647
+ end
648
+
649
+ # Returns <tt>true</tt> if this is an unknown <tt>mecab</tt> node not found in the dictionary.
650
+ # @return [Boolean]
651
+ def is_unk?
652
+ self.stat == UNK_NODE
653
+ end
654
+
655
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the beginning of the sentence.
656
+ # @return [Boolean]
657
+ def is_bos?
658
+ self.stat == BOS_NODE
659
+ end
660
+
661
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the sentence.
662
+ # @return [Boolean]
663
+ def is_eos?
664
+ self.stat == EOS_NODE
665
+ end
666
+
667
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the node list.
668
+ # @return [Boolean]
669
+ def is_eon?
670
+ self.stat == EON_NODE
671
+ end
570
672
  end
571
673
  end