natto 0.9.3 → 0.9.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,7 +1,18 @@
1
1
  ## CHANGELOG
2
2
 
3
- - __2012/02/xx__: 0.9.3 release.
4
- - Updating structures, tests following the latest release of MeCab 0.993a
3
+ - __2012/02/26__: 0.9.4 release.
4
+ - Exposing the underlying FFI pointer as @tagger in Natto::MeCab
5
+ - Exposing the underlying FFI pointer as @pointer in Natto::MeCabNode
6
+ - Fixed node formatting behavior for both regular and N-Best node parsing
7
+ - Adding is_nor?, is_unk?, is_bos?, is_eos?, is_eon? to Natto::MeCabNode
8
+ - Adding is_sysdic?, is_usrdic?, is_unkdic? to Natto::DictionaryInfo
9
+ - Adding readnodes to allow for method-chaining on list of parsed nodes
10
+ - Adding readlines to allow for method-chaining on list of string output
11
+ - Updating README to include description of DevKit install for RubyInstaller
12
+ - Refactored the codebase, introducing more tests
13
+
14
+ - __2012/02/14__: 0.9.3 release.
15
+ - Updating structures, tests following the latest release of MeCab 0.993
5
16
  - Ending support for 0.98 MeCab.
6
17
  - Support for MeCab 0.993 in Cygwin is still not confirmed.
7
18
 
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
  A Tasty Ruby Binding with MeCab
3
3
 
4
4
  ## What is natto?
5
- natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.sourceforge.net/), the part-of-speech and morphological analyzer for the Japanese language.
5
+ natto combines the [Ruby programming language](http://www.ruby-lang.org/) with [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech and morphological analyzer for the Japanese language.
6
6
 
7
7
  natto is a gem bridging Ruby and MeCab using FFI (foreign function interface). No compilation is necessary, as natto is _not_ a C extension. natto will run on CRuby (mri/yarv) and JRuby (jvm) equally well. natto will also run on Windows, Unix/Linux, and Mac.
8
8
 
@@ -15,8 +15,22 @@ natto requires the following:
15
15
  - [ffi _0.6.3 or greater_](http://rubygems.org/gems/ffi)
16
16
  - Ruby _1.8.7 or greater_
17
17
 
18
- ## Installation
18
+ ## Installation on *NIX/Mac/Cygwin
19
19
  Install natto with the following gem command:
20
+
21
+ gem install natto
22
+
23
+ This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which is what natto uses to bind to the <tt>mecab</tt> library.
24
+
25
+ ## Installation on Windows
26
+ However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), which is a MSYS/MinGW based toolkit than enables your Windows Ruby installation to build many of the native C/C++ extensions available, including <tt>ffi</tt>.
27
+
28
+ 1. Download the latest release for RubyInstaller for Windows platforms and the corresponding DevKit from the [RubyInstaller for Windows downloads page](http://rubyinstaller.org/downloads/).
29
+ 2. After installing RubyInstaller for Windows, double-click on the DevKit-tdm installer <tt>.exe</tt>, and expand the contents to an appropriate location, for example <tt>C:\devkit</tt>.
30
+ 3. Open a command window under <tt>C:\devkit</tt>, and execute: <tt>ruby dk.rb init</tt>. This will locate all known ruby installations, and add them to <tt>C:\devkit\config.yml</tt>.
31
+ 4. Next, execute: <tt>ruby dk.rb install</tt>, which will add the DevKit to all of the installed rubies listed in your <tt>C:\devkit\config.yml</tt>.
32
+ 5. Now you should be able to install and build the <tt>ffi</tt> rubygem correctly on your Windows-installed ruby, so you can install <tt>natto with</tt>:
33
+
20
34
  gem install natto
21
35
 
22
36
  ## Configuration
@@ -24,12 +38,19 @@ Install natto with the following gem command:
24
38
  - In case of <tt>LoadError</tt>, please set the <tt>MECAB_PATH</tt> environment variable to the exact name/path to your <tt>mecab</tt> library.
25
39
 
26
40
  e.g., for bash on UNIX/Linux
41
+
27
42
  export MECAB_PATH=/usr/local/lib/libmecab.so
43
+
28
44
  e.g., on Windows
45
+
29
46
  set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
47
+
30
48
  e.g., for Cygwin
49
+
31
50
  export MECAB_PATH=cygmecab-1
51
+
32
52
  e.g., from within a Ruby program
53
+
33
54
  ENV['MECAB_PATH']=/usr/local/lib/libmecab.so
34
55
 
35
56
  ## Usage
@@ -38,15 +59,16 @@ e.g., from within a Ruby program
38
59
 
39
60
  nm = Natto::MeCab.new
40
61
  => #<Natto::MeCab:0x28d30748
41
- @ptr=#<FFI::Pointer address=0x28a97d50>, \
62
+ @tagger=#<FFI::Pointer address=0x28a97d50>, \
42
63
  @options={}, \
43
- @dicts=[#<Natto::DictionaryInfo:0x28d3061c
44
- filename="/usr/local/lib/mecab/dic/ipadic/sys.dic",
45
- charset="utf8">],
46
- @version="0.992">
64
+ @dicts=[#<Natto::DictionaryInfo:0x28d3061c \
65
+ type="0", \
66
+ filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
67
+ charset="utf8">], \
68
+ @version="0.993">
47
69
 
48
70
  puts nm.version
49
- => "0.992"
71
+ => "0.993"
50
72
 
51
73
  sysdic = nm.dicts.first
52
74
 
@@ -55,22 +77,25 @@ e.g., from within a Ruby program
55
77
 
56
78
  puts sysdic.charset
57
79
  => "utf8"
58
-
59
- nm.parse('暑い日にはもってこいの一品ですね。') do |n|
80
+
81
+ nm.parse('ピンチの時には必ずヒーローが現れる。') do |n|
60
82
  puts "#{n.surface}\t#{n.feature}"
61
83
  end
62
- 暑い 形容詞,自立,*,*,形容詞・アウオ段,基本形,暑い,アツイ,アツイ
63
- 日 名詞,非自立,副詞可能,*,*,*,日,ヒ,ヒ
64
- に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
65
- は 助詞,係助詞,*,*,*,*,は,ハ,ワ
66
- もってこい 名詞,一般,*,*,*,*,もってこい,モッテコイ,モッテコイ
67
- の 助詞,連体化,*,*,*,*,の,ノ,ノ
68
- 一品 名詞,一般,*,*,*,*,一品,イッピン,イッピン
69
- です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
70
- 助詞,終助詞,*,*,*,*,ね,ネ,ネ
71
- 終助詞記号,句点,*,*,*,*,。,。,。
72
- BOS/EOS,*,*,*,*,*,*,*,*
73
- => nil
84
+ ピンチ 名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
85
+ の 助詞,連体化,*,*,*,*,の,ノ,ノ
86
+ 時 名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
87
+ に 助詞,格助詞,一般,*,*,*,に,一般ニ,ニ
88
+ は 助詞,係助詞,*,*,*,*,は,ハ,ワ
89
+ 必ず 副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
90
+ ヒーロー 名詞,一般,*,*,*,*,ヒーロー,ヒーローー,ヒーロー
91
+ が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
92
+ 現れる 動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
93
+ 記号,句点,*,*,*,*,。,。,。句点
94
+ BOS/EOS,*,*,*,*,*,*,*,*
95
+
96
+
97
+ ## Learn more
98
+ - You can read more about natto on the [project Wiki](https://bitbucket.org/buruzaemon/natto/wiki/Home).
74
99
 
75
100
  ## Contributing to natto
76
101
  - Use [mercurial](http://mercurial.selenic.com/) and [check out the latest code at bitbucket](https://bitbucket.org/buruzaemon/natto/src/) to make sure the feature hasn't been implemented or the bug hasn't been fixed yet.
data/lib/natto.rb CHANGED
@@ -1,13 +1,14 @@
1
1
  # coding: utf-8
2
2
  require 'rubygems' if RUBY_VERSION.to_f < 1.9
3
3
  require 'natto/binding'
4
+ require 'natto/utils'
4
5
 
5
6
  module Natto
6
7
  require 'ffi'
7
8
  require 'optparse'
8
9
 
9
- # <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> parser.
10
- # Options to the <tt>mecab</tt> parser are passed in as a string
10
+ # <tt>MeCab</tt> is a wrapper class for the <tt>mecab</tt> tagger.
11
+ # Options to the <tt>mecab</tt> tagger are passed in as a string
11
12
  # (MeCab command-line style) or as a Ruby-style hash at
12
13
  # initialization.
13
14
  #
@@ -16,32 +17,42 @@ module Natto
16
17
  # require 'rubygems' if RUBY_VERSION.to_f < 1.9
17
18
  # require 'natto'
18
19
  #
19
- # nm = Natto::MeCab.new('-O chasen2')
20
+ # nm = Natto::MeCab.new('-Ochasen')
20
21
  # => #<Natto::MeCab:0x28d3bdc8 \
21
- # @ptr=#<FFI::Pointer address=0x28afb980>, \
22
- # @options={:output_format_type=>"chasen2"}, \
23
- # @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
22
+ # @tagger=#<FFI::Pointer address=0x28afb980>, \
23
+ # @options={:output_format_type=>"chasen"}, \
24
+ # @dicts=[#<Natto::DictionaryInfo:0x289a1f14 \
25
+ # type="0", \
24
26
  # filename="/usr/local/lib/mecab/dic/ipadic/sys.dic", \
25
27
  # charset="utf8">], \
26
28
  # @version="0.993">
27
29
  #
28
- # nm.parse('ネバネバの組み合わせ美味しいです。') do |n|
30
+ # nm.parse('凡人にしか見えねえ風景ってのがあるんだよ。') do |n|
29
31
  # puts "#{n.surface}\t#{n.feature}"
30
- # end
31
- #
32
- # ネバネバ 名詞,サ変接続,*,*,*,*,ネバネバ,ネバネバ,ネバネバ
33
- # の 助詞,連体化,*,*,*,*,の,ノ,ノ
34
- # 組み合わせ 名詞,一般,*,*,*,*,組み合わせ,クミアワセ,クミアワセ
35
- # 美味しい 形容詞,自立,*,*,形容詞・イ段,基本形,美味しい,オイシイ,オイシイ
36
- # です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
37
- # 。 記号,句点,*,*,*,*,。,。,。
32
+ # end
33
+ # 凡人 名詞,一般,*,*,*,*,凡人,ボンジン,ボンジン
34
+ # に 助詞,格助詞,一般,*,*,*,に,ニ,ニ
35
+ # しか 助詞,係助詞,*,*,*,*,しか,シカ,シカ
36
+ # 見え 動詞,自立,*,*,一段,未然形,見える,ミエ,ミエ
37
+ # ねえ 助動詞,*,*,*,特殊・ナイ,音便基本形,ない,ネエ,ネー
38
+ # 風景 名詞,一般,*,*,*,*,風景,フウケイ,フーケイ
39
+ # って 助詞,格助詞,連語,*,*,*,って,ッテ,ッテ
40
+ # の 名詞,非自立,一般,*,*,*,の,ノ,ノ
41
+ # が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
42
+ # ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル
43
+ # ん 名詞,非自立,一般,*,*,*,ん,ン,ン
44
+ # だ 助動詞,*,*,*一般,特殊・ダ,基本形,だ,ダ,ダ
45
+ # よ 助詞,終助詞,*,*,*,*,よ,ã¨,ヨ
46
+ # 。 記号,句点,*,*,*,*,。,。,。
47
+ # BOS/EOS,*,*,*,*,*,*,*,*BOS
38
48
  #
39
49
  class MeCab
40
50
  include Natto::Binding
51
+ include Natto::Utils
41
52
 
42
- attr_reader :options, :dicts, :version
53
+ attr_reader :tagger, :options, :dicts, :version
43
54
 
44
- # Mapping of mecab short-style configuration options to the <tt>mecab</tt> parser.
55
+ # Mapping of mecab short-style configuration options to the <tt>mecab</tt> tagger.
45
56
  # See the <tt>mecab</tt> help for more details.
46
57
  SUPPORTED_OPTS = { '-r' => :rcfile,
47
58
  '-d' => :dicdir,
@@ -69,7 +80,7 @@ module Natto
69
80
  # - :rcfile -- resource file
70
81
  # - :dicdir -- system dicdir
71
82
  # - :userdic -- user dictionary
72
- # - :lattice_level -- lattice information level (integer, default 0)
83
+ # - :lattice_level -- lattice information level (DEPRECATED)
73
84
  # - :output_format_type -- output format type (wakati, chasen, yomi, etc.)
74
85
  # - :all_morphs -- output all morphs (default false)
75
86
  # - :nbest -- output N best results (integer, default 1), requires lattice level >= 1
@@ -90,24 +101,30 @@ module Natto
90
101
  # e.g.<br/>
91
102
  #
92
103
  # nm = Natto::MeCab.new(:node_format=>'%m¥t%f[7]¥n')
93
- # => #<Natto::MeCab:0x28d2ae10 @ptr=#<FFI::Pointer address=0x28a97980>, \
94
- # @options={:node_format=>"%m¥t%f[7]¥n"}, \
95
- # @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
104
+ # => #<Natto::MeCab:0x28d2ae10
105
+ # @tagger=#<FFI::Pointer address=0x28a97980>, \
106
+ # @options={:node_format=>"%m¥t%f[7]¥n"}, \
107
+ # @dicts=[#<Natto::DictionaryInfo:0x28d2a85c \
108
+ # type="0", \
96
109
  # filename="/usr/local/lib/mecab/dic/ipadic/sys.dic" \
97
110
  # charset="utf8">], \
98
111
  # @version="0.993">
99
- #
100
- # puts nm.parse('簡単で美味しくて良いですよね。')
101
- # 簡単 カンタン
102
- # で デ
103
- # 美味しくて オイシクテ
104
- # 良い ヨイ
105
- # です デス
106
- # よ ヨ
107
- # ね ネ
108
- #
112
+ #
113
+ # puts nm.parse('才能とは求める人間に与えられるものではない。')
114
+ # 才能 サイノウ
115
+ # と ト
116
+ # は ハ
117
+ # 求 モトメル
118
+ # 人間 ニンゲン
119
+ # に ニ
120
+ # 与え アタエ
121
+ # られる ラレル
122
+ # もの モノ
123
+ # で デ
124
+ # は ハ
125
+ # ない ナイ
126
+ # 。 。
109
127
  # EOS
110
- # => nil
111
128
  #
112
129
  # @param [Hash or String]
113
130
  # @raise [MeCabError] if <tt>mecab</tt> cannot be initialized with the given <tt>options</tt>
@@ -118,48 +135,87 @@ module Natto
118
135
  @dicts = []
119
136
 
120
137
  opt_str = self.class.build_options_str(@options)
121
- @ptr = self.mecab_new2(opt_str)
122
- raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @ptr.address == 0x0
138
+ @tagger = self.mecab_new2(opt_str)
139
+ raise MeCabError.new("Could not initialize MeCab with options: '#{opt_str}'") if @tagger.address == 0x0
123
140
 
124
- # set mecab parsing options
125
- self.mecab_set_theta(@ptr, @options[:theta]) if @options[:theta]
126
- self.mecab_set_lattice_level(@ptr, @options[:lattice_level]) if @options[:lattice_level]
127
- self.mecab_set_all_morphs(@ptr, 1) if @options[:all_morphs]
141
+ self.mecab_set_theta(@tagger, @options[:theta]) if @options[:theta]
142
+ self.mecab_set_lattice_level(@tagger, @options[:lattice_level]) if @options[:lattice_level]
143
+ self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
128
144
 
129
145
  # Set mecab parsing implementations for N-best and regular parsing,
130
146
  # for both parsing as string and yielding a node object
131
147
  # N-Best parsing implementations
132
148
  if @options[:nbest] && @options[:nbest] > 1
133
149
  # nbest parsing require lattice level >= 1
134
- self.mecab_set_lattice_level(@ptr, (@options[:lattice_level] || 1))
135
- @parse_tostr = lambda { |str|
136
- self.mecab_nbest_init(@ptr, str)
137
- return self.mecab_nbest_sparse_tostr(@ptr, @options[:nbest], str) ||
138
- raise(MeCabError.new(self.mecab_strerror(@ptr)))
139
- }
140
- @parse_tonode = lambda { |str|
141
- self.mecab_nbest_init(@ptr, str)
142
- return self.mecab_nbest_next_tonode(@ptr)
143
- }
150
+ self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
151
+ @parse_tostr = lambda do |str|
152
+ return self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], str) ||
153
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
154
+ end
155
+ @parse_tonodes = lambda do |str|
156
+ nodes = []
157
+ if @options[:nbest] && @options[:nbest] > 1
158
+ self.mecab_nbest_init(@tagger, str)
159
+ n = self.mecab_nbest_next_tonode(@tagger)
160
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
161
+ nlen = @options[:nbest]
162
+ nlen.times do
163
+ s = str.bytes.to_a
164
+ while n && n.address != 0x0
165
+ mn = Natto::MeCabNode.new(n)
166
+ if mn.is_nor?
167
+ slen, sarr = mn.length, []
168
+ slen.times { sarr << s.shift }
169
+ surf = sarr.pack('C*')
170
+ mn.surface = self.class.force_enc(surf)
171
+ if @options[:output_format_type] || @options[:node_format]
172
+ mn.feature = self.class.force_enc(self.mecab_format_node(@tagger, n))
173
+ end
174
+ end
175
+ nodes << mn if !mn.is_bos?
176
+ n = mn.next
177
+ end
178
+ n = self.mecab_nbest_next_tonode(@tagger)
179
+ end
180
+ end
181
+ return nodes
182
+ end
144
183
  else
145
184
  # default parsing implementations
146
- @parse_tostr = lambda { |str|
147
- return self.mecab_sparse_tostr(@ptr, str) || raise(MeCabError.new(self.mecab_strerror(@ptr)))
148
- }
149
- @parse_tonode = lambda { |str| return self.mecab_sparse_tonode(@ptr, str) }
185
+ @parse_tostr = lambda do |str|
186
+ return self.mecab_sparse_tostr(@tagger, str) ||
187
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
188
+ end
189
+ @parse_tonodes = lambda do |str|
190
+ nodes = []
191
+ n = self.mecab_sparse_tonode(@tagger, str)
192
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
193
+ mn = Natto::MeCabNode.new(n)
194
+ n = mn.next if mn.next.address!=0x0 && mn.is_bos?
195
+ s = str.bytes.to_a
196
+ while n && n.address!=0x0
197
+ mn = Natto::MeCabNode.new(n)
198
+ if mn.is_nor?
199
+ slen, sarr = mn.length, []
200
+ slen.times { sarr << s.shift }
201
+ surf = sarr.pack('C*')
202
+ mn.surface = self.class.force_enc(surf)
203
+ end
204
+ nodes << mn
205
+ n = mn.next
206
+ end
207
+ return nodes
208
+ end
150
209
  end
151
210
 
152
- # set ref to dictionaries
153
- @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@ptr))
211
+ @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
154
212
  while @dicts.last.next.address != 0x0
155
213
  @dicts << Natto::DictionaryInfo.new(@dicts.last.next)
156
214
  end
157
215
 
158
- # set ref to mecab version string
159
216
  @version = self.mecab_version
160
217
 
161
- # set Proc for freeing mecab pointer
162
- ObjectSpace.define_finalizer(self, self.class.create_free_proc(@ptr))
218
+ ObjectSpace.define_finalizer(self, self.class.create_free_proc(@tagger))
163
219
  end
164
220
 
165
221
  # Parses the given string <tt>str</tt>. If a block is passed to this method,
@@ -167,47 +223,48 @@ module Natto
167
223
  #
168
224
  # @param [String] str
169
225
  # @return parsing result from <tt>mecab</tt>
170
- # @raise [MeCabError] if the <tt>mecab</tt> parser cannot parse the given string <tt>str</tt>
226
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
171
227
  # @see MeCabNode
172
228
  def parse(str)
173
229
  if block_given?
174
- m_node_ptr = @parse_tonode.call(str)
175
- head = Natto::MeCabNode.new(m_node_ptr)
176
- if head && head[:next].address != 0x0
177
- node = Natto::MeCabNode.new(head[:next])
178
- i = 0
179
- while node.nil? == false
180
- if node.length > 0
181
- node.surface = str.bytes.to_a()[i, node.length].pack('C*')
182
- end
183
- yield node
184
- if node[:next].address != 0x0
185
- i += node.length
186
- node = Natto::MeCabNode.new(node[:next])
187
- else
188
- break
189
- end
190
- end
191
- end
230
+ nodes = @parse_tonodes.call(str)
231
+ nodes.each {|n| yield n }
192
232
  else
193
- result = @parse_tostr.call(str)
194
- result.force_encoding(Encoding.default_external) if result.respond_to?(:encoding) && result.encoding!=Encoding.default_external
195
- result
233
+ self.class.force_enc(@parse_tostr.call(str))
196
234
  end
197
235
  end
198
236
 
199
- # Returns human-readable details for the wrapped <tt>mecab</tt> parser.
237
+ # Parses the given string <tt>str</tt>, and returns
238
+ # a list of <tt>mecab</tt> nodes.
239
+ # @param [String] str
240
+ # @return [Array] of parsed <tt>mecab</tt> nodes.
241
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
242
+ # @see MeCabNode
243
+ def readnodes(str)
244
+ @parse_tonodes.call(str)
245
+ end
246
+
247
+ # Parses the given string <tt>str</tt>, and returns
248
+ # a list of <tt>mecab</tt> result strings.
249
+ # @param [String] str
250
+ # @return [Array] of parsed <tt>mecab</tt> result strings.
251
+ # @raise [MeCabError] if the <tt>mecab</tt> tagger cannot parse the given string <tt>str</tt>
252
+ def readlines(str)
253
+ self.class.force_enc(@parse_tostr.call(str)).lines.to_a
254
+ end
255
+
256
+ # Returns human-readable details for the wrapped <tt>mecab</tt> tagger.
200
257
  # Overrides <tt>Object#to_s</tt>.
201
258
  #
202
259
  # - encoded object id
203
- # - FFI pointer to <tt>mecab</tt> object
260
+ # - underlying FFI pointer to the <tt>mecab</tt> tagger
204
261
  # - options hash
205
262
  # - list of dictionaries
206
263
  # - MeCab version
207
264
  #
208
- # @return [String] encoded object id, FFI pointer, options hash, list of dictionaries, and MeCab version
265
+ # @return [String] encoded object id, underlying FFI pointer, options hash, list of dictionaries, and MeCab version
209
266
  def to_s
210
- %(#{super.chop} @ptr=#{@ptr.to_s}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
267
+ %(#{super.chop} @tagger=#{@tagger}, @options=#{@options.inspect}, @dicts=#{@dicts.to_s}, @version="#{@version.to_s}">)
211
268
  end
212
269
 
213
270
  # Overrides <tt>Object#inspect</tt>.
@@ -284,10 +341,10 @@ module Natto
284
341
  end
285
342
 
286
343
  # Returns a string-representation of the options to
287
- # be passed in the construction of <tt>mecab</tt>.
344
+ # be passed in the construction of the <tt>mecab</tt> tagger.
288
345
  #
289
346
  # @param [Hash] options
290
- # @return [String] representation of the options to the <tt>mecab</tt> parser
347
+ # @return [String] representation of the options to the <tt>mecab</tt> tagger
291
348
  def self.build_options_str(options={})
292
349
  opt = []
293
350
  SUPPORTED_OPTS.values.each do |k|
@@ -352,7 +409,16 @@ module Natto
352
409
  #
353
410
  # puts sysdic.charset
354
411
  # => "utf8"
412
+ #
413
+ # puts sysdic.is_sysdic?
414
+ # => true
355
415
  class DictionaryInfo < MeCabStruct
416
+ # System dictionary.
417
+ SYS_DIC = 0
418
+ # User dictionary.
419
+ USR_DIC = 1
420
+ # Unknown dictionary.
421
+ UNK_DIC = 2
356
422
 
357
423
  layout :filename, :string,
358
424
  :charset, :string,
@@ -381,12 +447,13 @@ module Natto
381
447
  # Overrides <tt>Object#to_s</tt>.
382
448
  #
383
449
  # - encoded object id
450
+ # - dictionary type
384
451
  # - full-path dictionary filename
385
452
  # - dictionary charset
386
453
  #
387
- # @return [String] encoded object id, dictionary filename, and charset
454
+ # @return [String] encoded object id, type, dictionary filename, and charset
388
455
  def to_s
389
- %(#{super.chop} filename="#{self.filename}", charset="#{self.charset}">)
456
+ %(#{super.chop} type="#{self.type}", filename="#{self.filename}", charset="#{self.charset}">)
390
457
  end
391
458
 
392
459
  # Overrides <tt>Object#inspect</tt>.
@@ -396,6 +463,24 @@ module Natto
396
463
  def inspect
397
464
  self.to_s
398
465
  end
466
+
467
+ # Returns <tt>true</tt> if this is a system dictionary.
468
+ # @return [Boolean]
469
+ def is_sysdic?
470
+ self.type == SYS_DIC
471
+ end
472
+
473
+ # Returns <tt>true</tt> if this is a user dictionary.
474
+ # @return [Boolean]
475
+ def is_usrdic?
476
+ self.type == USR_DIC
477
+ end
478
+
479
+ # Returns <tt>true</tt> if this is a unknown dictionary type.
480
+ # @return [Boolean]
481
+ def is_unkdic?
482
+ self.type == UNK_DIC
483
+ end
399
484
  end
400
485
 
401
486
  # <tt>MeCabNode</tt> is a wrapper for the structure holding
@@ -430,61 +515,57 @@ module Natto
430
515
  # - :cost
431
516
  #
432
517
  # <h2>Usage</h2>
433
- # An instance of <tt>MeCabNode</tt> is yielded to a block
434
- # used with <tt>MeCab#parse</tt>. Each resulting node is
435
- # yielded to the block passed in, where the above-mentioned
436
- # node attributes may be accessed.
518
+ # An instance of <tt>MeCabNode</tt> is yielded to the block
519
+ # used with <tt>MeCab#parse</tt>, where the above-mentioned
520
+ # node attributes may be accessed by name.
437
521
  #
438
522
  # nm = Natto::MeCab.new
439
523
  #
440
- # nm.parse('めかぶの使い方がわからなくて困ってました。') do |n|
441
- # puts "#{n.surface}¥t#{n.cost}"
524
+ # nm.parse('卓球なんて死ぬまでの暇つぶしだよ。') do |n|
525
+ # puts "#{n.surface}\t#{n.cost}" if n.is_nor?
442
526
  # end
443
- #
444
- # め 7961
445
- # かぶ 19303
446
- # の 25995
447
- # 使い方 29182
448
- # が 28327
449
- # わから 33625
450
- # なく 34256
451
- # て 36454
452
- # 困っ 43797
453
- # て 42178
454
- # まし 46708
455
- # た 46111
456
- # 。 42677
457
- # 41141
458
- # => nil
527
+ # 卓球 2874
528
+ # な 4398
529
+ # 死ぬ 9261
530
+ # まで 9386
531
+ # の 10007
532
+ # 暇つぶし 13324
533
+ # だ 15346
534
+ # よ 14396
535
+ # 。 10194
459
536
  #
460
537
  # It is also possible to use the <tt>Symbol</tt> for the
461
538
  # <tt>mecab</tt> node member to index into the
462
539
  # <tt>FFI::Struct</tt> layout associative array like so:
463
540
  #
464
- # nm.parse('納豆に乗っけて頂きます!') {|n| puts n[:feature] }
465
- #
466
- # 名詞,一般,*,*,*,*,納豆,ナットウ,ナットー
467
- # 助詞,格助詞,一般,*,*,*,に,ニ,ニ
468
- # 動詞,自立,*,*,一段,連用形,乗っける,ノッケ,ノッケ
469
- # 助詞,接続助詞,*,*,*,*,て,テ,テ
470
- # 動詞,非自立,*,*,五段・カ行イ音便,連用形,頂く,イタダキ,イタダキ
471
- # 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
472
- # 記号,一般,*,*,*,*,!,!,!
541
+ # nm.parse('あいつ笑うと結構可愛い顔してんよ。') {|n| puts n[:feature] }
542
+ # 名詞,代名詞,一般,*,*,*,あいつ,アイツ,アイツ
543
+ # 動詞,自立,*,*,五段・ワ行促音便,基本形,笑う,ワラウ,ワラウ
544
+ # 助詞,接続助詞,*,*,*,*,と,ト,ト
545
+ # 副詞,一般,*,*,*,*,結構,ケッコウ,ケッコー
546
+ # 形容詞,自立,*,*,形容詞・イ段,基本形,可愛い,カワイイ,カワイイ
547
+ # 名詞,一般,*,*,*,*,顔,カオ,カオ
548
+ # 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
549
+ # 動詞,非自立,*,*,一段,体言接続特殊,てる,テン,テン
550
+ # 助詞,終助詞,*,*,*,*,よ,ヨ,ヨ
551
+ # 記号,句点,*,*,*,*,。,。,。
473
552
  # BOS/EOS,*,*,*,*,*,*,*,*
474
- # => nil
475
553
  #
476
554
  class MeCabNode < MeCabStruct
555
+ include Natto::Utils
556
+
477
557
  attr_accessor :surface, :feature
558
+ attr_reader :pointer
478
559
 
479
- # Normal <tt>mecab</tt> node.
560
+ # Normal <tt>mecab</tt> node defined in the dictionary.
480
561
  NOR_NODE = 0
481
- # Unknown <tt>mecab</tt> node.
562
+ # Unknown <tt>mecab</tt> node not defined in the dictionary.
482
563
  UNK_NODE = 1
483
- # Beginning-of-string <tt>mecab</tt> node.
564
+ # Virtual node representing the beginning of the sentence.
484
565
  BOS_NODE = 2
485
- # End-of-string <tt>mecab</tt> node.
566
+ # Virutual node representing the end of the sentence.
486
567
  EOS_NODE = 3
487
- # End-of-NBest <tt>mecab</tt> node list.
568
+ # Virtual node representing the end of an N-Best <tt>mecab</tt> node list.
488
569
  EON_NODE = 4
489
570
 
490
571
  layout :prev, :pointer,
@@ -525,39 +606,30 @@ module Natto
525
606
  end
526
607
 
527
608
  # Initializes this node instance.
528
- # Sets the <ttMeCab</tt> feature value for this node.
609
+ # Sets the <tt>MeCab</tt> feature value for this node.
529
610
  #
530
611
  # @param [FFI::Pointer]
531
612
  def initialize(ptr)
532
613
  super(ptr)
614
+ @pointer = ptr
533
615
 
534
616
  if self[:feature]
535
- @feature = self[:feature]
536
- @feature.force_encoding(Encoding.default_external) if @feature.respond_to?(:encoding) && @feature.encoding!=Encoding.default_external
617
+ @feature = self.class.force_enc(self[:feature])
537
618
  end
538
619
  end
539
620
 
540
- # Sets the morpheme surface value for this node.
541
- #
542
- # @param [String]
543
- def surface=(str)
544
- if str && self[:length] > 0
545
- @surface = str
546
- @surface.force_encoding(Encoding.default_external) if @surface.respond_to?(:encoding) && @surface.encoding!=Encoding.default_external
547
- end
548
- end
549
-
550
621
  # Returns human-readable details for the <tt>mecab</tt> node.
551
622
  # Overrides <tt>Object#to_s</tt>.
552
623
  #
553
624
  # - encoded object id
554
- # - stat
625
+ # - underlying FFI pointer to MeCab Node
626
+ # - stat (node type: NOR, UNK, BOS/EOS, EON)
555
627
  # - surface
556
628
  # - feature
557
629
  #
558
- # @return [String] encoded object id, stat, surface, and feature
630
+ # @return [String] encoded object id, underlying FFI pointer, stat, surface, and feature
559
631
  def to_s
560
- %(#{super.chop} stat=#{self[:stat]}, surface="#{self.surface}", feature="#{self.feature}">)
632
+ %(#{super.chop} @pointer=#{@pointer}, stat=#{self[:stat]}, @surface="#{self.surface}", @feature="#{self.feature}">)
561
633
  end
562
634
 
563
635
  # Overrides <tt>Object#inspect</tt>.
@@ -567,5 +639,35 @@ module Natto
567
639
  def inspect
568
640
  self.to_s
569
641
  end
642
+
643
+ # Returns <tt>true</tt> if this is a normal <tt>mecab</tt> node found in the dictionary.
644
+ # @return [Boolean]
645
+ def is_nor?
646
+ self.stat == NOR_NODE
647
+ end
648
+
649
+ # Returns <tt>true</tt> if this is an unknown <tt>mecab</tt> node not found in the dictionary.
650
+ # @return [Boolean]
651
+ def is_unk?
652
+ self.stat == UNK_NODE
653
+ end
654
+
655
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the beginning of the sentence.
656
+ # @return [Boolean]
657
+ def is_bos?
658
+ self.stat == BOS_NODE
659
+ end
660
+
661
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the sentence.
662
+ # @return [Boolean]
663
+ def is_eos?
664
+ self.stat == EOS_NODE
665
+ end
666
+
667
+ # Returns <tt>true</tt> if this is a virtual <tt>mecab</tt> node representing the end of the node list.
668
+ # @return [Boolean]
669
+ def is_eon?
670
+ self.stat == EON_NODE
671
+ end
570
672
  end
571
673
  end