natto 0.9.8 → 0.9.9

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,5 +1,11 @@
1
1
  ## CHANGELOG
2
2
 
3
+ - __2015/03/31__: 0.9.9 release.
4
+ - Issue 21/34: Implemented boundary constraint parsing.
5
+ - Issue 26: Removing deprecated methods parse_as_nodes, parse_as_strings, readnodes and readlines.
6
+ - Issue 32: Fixed partial parsing.
7
+ - Updated documentation for new MeCab home at http://taku910.github.io/mecab/.
8
+
3
9
  - __2015/02/10__: 0.9.8 release.
4
10
  - Migrated natto code home from Bitbucket to GitHub.
5
11
  - Improved documentation following said migration.
@@ -7,16 +13,10 @@
7
13
  - Updating LICENSE for year 2015.
8
14
 
9
15
  - __2014/12/20__: 0.9.7 release.
10
- - Issue 14: [adding automatic discovery for mecab library; no need to
11
- explicitly set
12
- MECAB_PATH!](https://bitbucket.org/buruzaemon/natto/issue/14/automatic-discovery-of-libmecab-path-and)
13
- - Issue 15: [refactored node-parsing to use Enumerator instead of
14
- materializing every node and stuffing into
15
- array](https://bitbucket.org/buruzaemon/natto/issue/15/use-enumerator-when-parsing-mecab-nodes)
16
- - Issue 17: [adding filepath to MeCab and
17
- DictionaryInfo](https://bitbucket.org/buruzaemon/natto/issue/17/use-filerealpath-value-for-all-file-paths)
18
- - Issue 18: [bug-fix for node-formatting during default node
19
- parse](https://bitbucket.org/buruzaemon/natto/issue/18/no-node-formatting-when-using-default-node)
16
+ - Issue 14: adding automatic discovery for mecab library; no need to explicitly set MECAB_PATH
17
+ - Issue 15: refactored node-parsing to use Enumerator instead of materializing every node and stuffing into array
18
+ - Issue 17: adding filepath to MeCab and DictionaryInfo
19
+ - Issue 18: bug-fix for node-formatting during default node parse
20
20
  - Deprecating parse_as_nodes and parse_as_strings; please use parse instead!
21
21
  - CAUTION: parse_as_nodes, parse_as_strings, readnodes and readlines will be removed in the following release!
22
22
  - Enhancements to to_s methods for both MeCab and DictionaryInfo
@@ -36,8 +36,8 @@
36
36
  - Removing automatic library load for Cygwin platform (does not compile)
37
37
 
38
38
  - __2012/09/16__: 0.9.5 release.
39
- - Fixed [Issue 9: trimされていない文字列のparse](https://bitbucket.org/buruzaemon/natto/issue/9/trim-parse)
40
- - Fixed [Issue 10: BUG Segmentation Fault](https://bitbucket.org/buruzaemon/natto/issue/10/bug-segmentation-fault)
39
+ - Fixed Issue 9: trimされていない文字列のparse
40
+ - Fixed Issue 10: BUG Segmentation Fault
41
41
  - Adding parse_as_nodes to allow for method-chaining on list of parsed nodes
42
42
  - Adding parse_as_strings to allow for method-chaining on list of string output
43
43
  - Deprecating both readnodes and readlines (badly named methods, see parse_as_nodes and parse_as_strings, respectively)
data/README.md CHANGED
@@ -1,16 +1,16 @@
1
- # natto
1
+ # natto [![Gem Version](https://badge.fury.io/rb/natto.svg)](http://badge.fury.io/rb/natto) [![Build Status](https://travis-ci.org/buruzaemon/natto.svg?branch=master)](https://travis-ci.org/buruzaemon/natto)
2
2
  A Tasty Ruby Binding with MeCab
3
3
 
4
4
  ## What is natto?
5
5
  A gem leveraging FFI (foreign function interface), natto combines the
6
6
  [Ruby programming language](http://www.ruby-lang.org/) with
7
- [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech
7
+ [MeCab](http://taku910.github.io/mecab/), the part-of-speech
8
8
  and morphological analyzer for the Japanese language.
9
9
 
10
- - No compiler is necessary, as natto is _not_ a C extension.
11
- - It will run on CRuby (mri/yarv) and JRuby (jvm) equally well.
12
- - It will work with MeCab installations on Windows, Unix/Linux or OS X.
13
10
  - natto provides a naturally Ruby-esque interface to MeCab.
11
+ - It runs on both CRuby (mri/yarv) and JRuby (jvm).
12
+ - It works with MeCab installations on Windows, Unix/Linux and OS X.
13
+ - No compiler is necessary, as natto is _not_ a C extension.
14
14
 
15
15
  You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
16
16
 
@@ -18,8 +18,8 @@ You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
18
18
  ## Requirements
19
19
  natto requires the following:
20
20
 
21
- - [MeCab _0.996_](http://code.google.com/p/mecab/downloads/list)
22
- - A system dictionary, like [mecab-ipadic](https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz) or [mecab-jumandic](https://mecab.googlecode.com/files/mecab-jumandic-5.1-20070304.tar.gz)
21
+ - [MeCab _0.996_](http://taku910.github.io/mecab/#download)
22
+ - A system dictionary, like mecab-ipadic (recommended), mecab-jumandic or unidic, all available on the [MeCab downloads page](http://taku910.github.io/mecab/#download)
23
23
  - `libmecab-devel` if you are on Linux, since natto uses `mecab-config`
24
24
  - Ruby _1.9 or greater_
25
25
  - [ffi _1.9.0 or greater_](http://rubygems.org/gems/ffi)
@@ -29,7 +29,7 @@ Install natto with the following gem command:
29
29
 
30
30
  gem install natto
31
31
 
32
- This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the `mecab` library.
32
+ This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the MeCab library.
33
33
 
34
34
  ## Installation on Windows
35
35
  However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), a MSYS/MinGW based toolkit that enables your Windows Ruby installation to build many of the native C/C++ extensions available, including ffi.
@@ -45,40 +45,38 @@ However, if you are using a CRuby on Windows, then you will first need to instal
45
45
  6. If you are on a 64-bit Windows and you use a 64-bit Ruby or JRuby, then you might want to [build a 64-bit version of libmecab.dll](https://github.com/buruzaemon/natto/wiki/64-Bit-Windows).
46
46
 
47
47
 
48
- ## Configuration
49
- - ***No explicit configuration should be necessary, as natto will try to locate the `mecab` library based upon its runtime environment.***
50
- - On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
51
- - On OS X and \*nix, it will query `mecab-config --libs`
52
- - ***But if natto cannot find the `mecab` library, `LoadError` will be raised.***
53
- - Please set the `MECAB_PATH` environment variable to the exact name/path to your `mecab` library.
54
- - e.g., for OS X
48
+ ## Automatic Configuration
49
+ No explicit configuration should be necessary, as natto will try to locate the MeCab library based upon its runtime environment.
50
+ - On OS X and \*nix, it will query `mecab-config --libs`
51
+ - On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
52
+
53
+ ## Explicit configuration via `MECAB_PATH` and `MECAB_CHARSET`
54
+ If natto cannot find the MeCab library, `LoadError` will be raised. Please set the `MECAB_PATH` environment variable to the exact name/path to your MeCab library.
55
55
 
56
- export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
56
+ - e.g., for OS X
57
57
 
58
- - e.g., for bash on UNIX/Linux
58
+ export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
59
59
 
60
- export MECAB_PATH=/usr/local/lib/libmecab.so
60
+ - e.g., for bash on UNIX/Linux
61
61
 
62
- - e.g., on Windows
62
+ export MECAB_PATH=/usr/local/lib/libmecab.so
63
63
 
64
- set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
64
+ - e.g., on Windows
65
65
 
66
- - e.g., from within a Ruby program
66
+ set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
67
67
 
68
- ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
68
+ - e.g., from within a Ruby program
69
+
70
+ ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
69
71
 
70
72
  ## Usage
71
73
 
74
+ Here's a very quick guide to using natto.
75
+
76
+ Instantiate a reference to the MeCab library, and display some details:
72
77
 
73
- # Quick Start
74
- # -----------
75
- #
76
- # No explicit configuration should be necessary!
77
- #
78
78
  require 'natto'
79
79
 
80
- # first, create an instance of Natto::MeCab
81
- #
82
80
  nm = Natto::MeCab.new
83
81
  => #<Natto::MeCab:0x28d30748
84
82
  @tagger=#<FFI::Pointer address=0x28a97d50>, \
@@ -90,32 +88,27 @@ However, if you are using a CRuby on Windows, then you will first need to instal
90
88
  type=0>] \
91
89
  @version=0.996>
92
90
 
93
- # display MeCab version
94
- #
95
91
  puts nm.version
96
92
  => 0.996
97
93
 
98
- # display full pathname to MeCab library
99
- #
94
+ ----
95
+
96
+ Display details about the system dictionary used by MeCab:
97
+
100
98
  puts nm.libpath
101
99
  => /usr/local/lib/libmecab.so
102
100
 
103
- # reference to MeCab system dictionary
104
- #
105
101
  sysdic = nm.dicts.first
106
-
107
- # display full pathname to system dictionary file
108
- #
109
102
  puts sysdic.filepath
110
103
  => /usr/local/lib/mecab/dic/ipadic/sys.dic
111
104
 
112
- # what charset (encoding) is the system dictionary?
113
- #
114
105
  puts sysdic.charset
115
106
  => utf8
116
-
117
- # parse text and send output to stdout
118
- #
107
+
108
+ ----
109
+
110
+ Parse Japanese text and send the MeCab result as a single string to stdout:
111
+
119
112
  puts nm.parse('俺の名前は星野豊だ!!そこんとこヨロシク!')
120
113
  俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ
121
114
  の 助詞,連体化,*,*,*,*,の,ノ,ノ
@@ -133,12 +126,22 @@ However, if you are using a CRuby on Windows, then you will first need to instal
133
126
  ! 記号,一般,*,*,*,*,!,!,!
134
127
  EOS
135
128
 
136
- # parse more text and use a block to:
137
- # - iterate over the resulting MeCabNode instances
138
- # - output morpheme surface and part-of-speech ID
139
- #
140
- # * ignore any end-of-sentence nodes
141
- #
129
+ ----
130
+
131
+ If a block is passed to `parse`, you can iterate over the list of resulting `MeCabNode`
132
+ instances to access more detailed information about each morpheme.
133
+
134
+ In this example, the following attributes and methods for `MeCabNode` are used:
135
+
136
+ - `surface` - the morpheme surface
137
+ - `posid` - node part-of-speech ID (dictionary-dependent)
138
+ - `is_eos?` - is this `MeCabNode` an end-of-sentence node?
139
+
140
+ This iterates over the morpheme nodes in the given text,
141
+ and outputs a formatted, tab-delimited line with the
142
+ morpheme surface and part-of-speech ID, ignoring any end-of-sentence
143
+ nodes:
144
+
142
145
  nm.parse('世界チャンプ目指してんだなこれがっ!!夢なの、俺のっ!!') do |n|
143
146
  puts "#{n.surface}\tpart-of-speech id: #{n.posid}" if !n.is_eos?
144
147
  end
@@ -160,37 +163,46 @@ However, if you are using a CRuby on Windows, then you will first need to instal
160
163
  のっ part-of-speech id: 31
161
164
  !! part-of-speech id: 36
162
165
 
163
- # for more complex parsing, such as that for natural
164
- # language processing tasks, it is far more efficient
165
- # to iterate over MeCab nodes using an Enumerator
166
- #
167
- # this example uses the -F node-format option to customize
168
- # the resulting MeCabNode feature attribute to extract:
169
- # - %m ... surface
170
- # - %f[0] ... part-of-speech
171
- # - %f[7] ... reading
172
- #
173
- nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
166
+ ----
167
+
168
+ For more complex parsing, such as that for natural language
169
+ processing tasks, it is far more efficient to use `enum_parse` to
170
+ obtain an [`Enumerator`](http://ruby-doc.org/core-2.2.0/Enumerator.html)
171
+ to iterate over the resulting `MeCabNode` instances. An `Enumerator`
172
+ yields each `MeCabNode` instance without first materializing all
173
+ instances at once, thus being more efficient.
174
+
175
+ This example uses the `-F` node-format option to customize
176
+ the resulting `MeCabNode` feature attribute to extract:
177
+
178
+ - `%m` - morpheme surface
179
+ - `%f[0]` - node part-of-speech
180
+ - `%f[7]` - reading
181
+
182
+ Note that we can move the `Enumerator` both forwards and backwards, rewind it
183
+ back to the beginning, and then iterate over it.
174
184
 
185
+ nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
186
+
175
187
  enum = nm.enum_parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')
176
188
  => #<Enumerator: #<Enumerator::Generator:0x00000002ff3898>:each>
177
-
189
+
178
190
  enum.next
179
191
  => #<Natto::MeCabNode:0x000000032eed68 \
180
192
  @pointer=#<FFI::Pointer address=0x000000005ffb48>, \
181
193
  stat=0, \
182
194
  @surface="この", \
183
195
  @feature="この 連体詞 コノ">
184
-
196
+
185
197
  enum.peek
186
198
  => #<Natto::MeCabNode:0x00000002fe2110a \
187
199
  @pointer=#<FFI::Pointer address=0x000000005ffdb8>, \
188
200
  stat=0, \
189
201
  @surface="星", \
190
202
  @feature="星 名詞 ホシ">
191
-
203
+
192
204
  enum.rewind
193
-
205
+
194
206
  # again, ignore any end-of-sentence nodes
195
207
  enum.each { |n| puts n.feature if !n.is_eos? }
196
208
  この 連体詞 コノ
@@ -211,7 +223,52 @@ However, if you are using a CRuby on Windows, then you will first need to instal
211
223
  だけ 助詞 ダケ
212
224
  ! 記号 !
213
225
 
214
-
226
+ ----
227
+
228
+ [Partial parsing](http://taku910.github.io/mecab/partial.html) allows you to
229
+ pass hints to MeCab on how to tokenize morphemes when parsing. With boundary
230
+ constraint parsing, you can specify either
231
+ a [Regexp](http://ruby-doc.org/core-2.2.1/Regexp.html) or
232
+ [String](http://ruby-doc.org/core-2.2.1/String.html) to tell MeCab where the
233
+ boundaries of a morpheme should be. Use the new `boundary_constraints` keyword.
234
+ For hints on tokenization, please see
235
+ [String#scan](http://ruby-doc.org/core-2.2.1/String.html#method-i-scan)
236
+
237
+ This example uses the `-F` node-format option to customize
238
+ the resulting `MeCabNode` feature attribute to extract:
239
+
240
+ - `%m` - morpheme surface
241
+ - `%f[0]` - node part-of-speech
242
+ - `%s` - node `stat` status value, 1 is `unknown`
243
+
244
+ Note that any such morphemes captured will have node `stat` status of unknown.
245
+
246
+ nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
247
+
248
+ text = '心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'
249
+ pattern = /ヒーロー見参/
250
+
251
+ nm.enum_parse(text, boundary_constraints: pattern).each do |n|
252
+ puts n.feature if !(n.is_bos? || n.is_eos?)
253
+ end
254
+
255
+ # desired morpheme boundary specified with Regexp /ヒーロー見参/
256
+ 心, 名詞, 0
257
+ の, 助詞, 0
258
+ 中, 名詞, 0
259
+ で, 助詞, 0
260
+ 3, 名詞, 1
261
+ 回, 名詞, 0
262
+ 唱え, 動詞, 0
263
+ 、, 記号, 0
264
+ ヒーロー見参, 名詞, 1
265
+ !, 記号, 0
266
+ ヒーロー見参, 名詞, 1
267
+ !, 記号, 0
268
+ ヒーロー見参, 名詞, 1
269
+ !, 記号, 0
270
+
271
+
215
272
 
216
273
  ## Learn more
217
274
  - You can read more about natto on the [project Wiki](https://github.com/buruzaemon/natto/wiki).
data/lib/natto/binding.rb CHANGED
@@ -66,12 +66,7 @@ module Natto
66
66
 
67
67
  ffi_lib find_library
68
68
 
69
- # new interface
70
- attach_function :mecab_model_new2, [:string], :pointer
71
- attach_function :mecab_model_destroy, [:pointer], :void
72
- attach_function :mecab_model_dictionary_info, [:pointer], :pointer
73
-
74
- # old interface
69
+ # C interface
75
70
  attach_function :mecab_new2, [:string], :pointer
76
71
  attach_function :mecab_version, [], :string
77
72
  attach_function :mecab_strerror, [:pointer],:string
@@ -88,6 +83,26 @@ module Natto
88
83
  attach_function :mecab_format_node, [:pointer, :pointer], :string
89
84
  attach_function :mecab_dictionary_info, [:pointer], :pointer
90
85
 
86
+ attach_function :mecab_lattice_new, [], :pointer
87
+ attach_function :mecab_lattice_destroy, [:pointer], :void
88
+ attach_function :mecab_lattice_clear, [:pointer], :void
89
+ attach_function :mecab_lattice_is_available, [:pointer], :int
90
+ attach_function :mecab_lattice_get_bos_node, [:pointer], :pointer
91
+ attach_function :mecab_lattice_set_sentence, [:pointer, :string], :void
92
+ attach_function :mecab_lattice_get_size, [:pointer], :int
93
+ attach_function :mecab_lattice_set_z, [:pointer, :float], :void
94
+ attach_function :mecab_lattice_set_theta, [:pointer, :float], :void
95
+ attach_function :mecab_lattice_next, [:pointer], :int
96
+ attach_function :mecab_lattice_get_request_type, [:pointer], :int
97
+ attach_function :mecab_lattice_add_request_type, [:pointer, :int], :void
98
+ attach_function :mecab_lattice_set_request_type, [:pointer, :int], :void
99
+ attach_function :mecab_lattice_tostr, [:pointer], :string
100
+ attach_function :mecab_lattice_nbest_tostr, [:pointer, :int], :string
101
+ attach_function :mecab_lattice_get_boundary_constraint, [:pointer, :int], :int
102
+ attach_function :mecab_lattice_set_boundary_constraint, [:pointer, :int, :int], :void
103
+ attach_function :mecab_parse_lattice, [:pointer, :pointer], :int
104
+ attach_function :mecab_lattice_strerror, [:pointer], :string
105
+
91
106
  # @private
92
107
  module ClassMethods
93
108
 
@@ -95,18 +110,6 @@ module Natto
95
110
  Natto::Binding.find_library
96
111
  end
97
112
 
98
- def mecab_model_new2(options_str)
99
- Natto::Binding.mecab_model_new2(options_str)
100
- end
101
-
102
- def mecab_model_destroy(m_ptr)
103
- Natto::Binding.mecab_model_destroy(m_ptr)
104
- end
105
-
106
- def mecab_model_dictionary_info(m_ptr)
107
- Natto::Binding.mecab_model_dictionary_info(m_ptr)
108
- end
109
-
110
113
  # ----------------------------------------
111
114
  def mecab_new2(options_str)
112
115
  Natto::Binding.mecab_new2(options_str)
@@ -116,60 +119,136 @@ module Natto
116
119
  Natto::Binding.mecab_version
117
120
  end
118
121
 
119
- def mecab_strerror(m_ptr)
120
- Natto::Binding.mecab_strerror(m_ptr)
122
+ def mecab_strerror(tptr)
123
+ Natto::Binding.mecab_strerror(tptr)
121
124
  end
122
125
 
123
- def mecab_destroy(m_ptr)
124
- Natto::Binding.mecab_destroy(m_ptr)
126
+ def mecab_destroy(tptr)
127
+ Natto::Binding.mecab_destroy(tptr)
125
128
  end
126
129
 
127
- def mecab_set_partial(m_ptr, ll)
128
- Natto::Binding.mecab_set_partial(m_ptr, ll)
130
+ def mecab_set_partial(tptr, ll)
131
+ Natto::Binding.mecab_set_partial(tptr, ll)
129
132
  end
130
133
 
131
- def mecab_set_theta(m_ptr, t)
132
- Natto::Binding.mecab_set_theta(m_ptr, t)
134
+ def mecab_set_theta(tptr, t)
135
+ Natto::Binding.mecab_set_theta(tptr, t)
133
136
  end
134
137
 
135
- def mecab_set_lattice_level(m_ptr, ll)
136
- Natto::Binding.mecab_set_lattice_level(m_ptr, ll)
138
+ def mecab_set_lattice_level(tptr, ll)
139
+ Natto::Binding.mecab_set_lattice_level(tptr, ll)
137
140
  end
138
141
 
139
- def mecab_set_all_morphs(m_ptr, am)
140
- Natto::Binding.mecab_set_all_morphs(m_ptr, am)
142
+ def mecab_set_all_morphs(tptr, am)
143
+ Natto::Binding.mecab_set_all_morphs(tptr, am)
141
144
  end
142
145
 
143
- def mecab_sparse_tostr(m_ptr, str)
144
- Natto::Binding.mecab_sparse_tostr(m_ptr, str)
146
+ def mecab_sparse_tostr(tptr, str)
147
+ Natto::Binding.mecab_sparse_tostr(tptr, str)
145
148
  end
146
149
 
147
- def mecab_sparse_tonode(m_ptr, str)
148
- Natto::Binding.mecab_sparse_tonode(m_ptr, str)
150
+ def mecab_sparse_tonode(tptr, str)
151
+ Natto::Binding.mecab_sparse_tonode(tptr, str)
149
152
  end
150
153
 
151
- def mecab_nbest_next_tonode(m_ptr)
152
- Natto::Binding.mecab_nbest_next_tonode(m_ptr)
154
+ def mecab_nbest_next_tonode(tptr)
155
+ Natto::Binding.mecab_nbest_next_tonode(tptr)
153
156
  end
154
157
 
155
- def mecab_nbest_init(m_ptr, str)
156
- Natto::Binding.mecab_nbest_init(m_ptr, str)
158
+ def mecab_nbest_init(tptr, str)
159
+ Natto::Binding.mecab_nbest_init(tptr, str)
157
160
  end
158
161
 
159
- def mecab_nbest_sparse_tostr(m_ptr, n, str)
160
- Natto::Binding.mecab_nbest_sparse_tostr(m_ptr, n, str)
162
+ def mecab_nbest_sparse_tostr(tptr, n, str)
163
+ Natto::Binding.mecab_nbest_sparse_tostr(tptr, n, str)
161
164
  end
162
165
 
163
- def mecab_nbest_next_tonode(m_ptr)
164
- Natto::Binding.mecab_nbest_next_tonode(m_ptr)
166
+ def mecab_nbest_next_tonode(tptr)
167
+ Natto::Binding.mecab_nbest_next_tonode(tptr)
165
168
  end
166
169
 
167
- def mecab_format_node(m_ptr, n_ptr)
168
- Natto::Binding.mecab_format_node(m_ptr, n_ptr)
170
+ def mecab_format_node(tptr, nptr)
171
+ Natto::Binding.mecab_format_node(tptr, nptr)
172
+ end
173
+
174
+ def mecab_dictionary_info(tptr)
175
+ Natto::Binding.mecab_dictionary_info(tptr)
169
176
  end
170
177
 
171
- def mecab_dictionary_info(m_ptr)
172
- Natto::Binding.mecab_dictionary_info(m_ptr)
178
+ def mecab_lattice_new()
179
+ Natto::Binding.mecab_lattice_new()
180
+ end
181
+
182
+ def mecab_lattice_destroy(lptr)
183
+ Natto::Binding.mecab_lattice_destroy(lptr)
184
+ end
185
+
186
+ def mecab_lattice_clear(lptr)
187
+ Natto::Binding.mecab_lattice_clear(lptr)
188
+ end
189
+
190
+ def mecab_lattice_is_available(lptr)
191
+ Natto::Binding.mecab_lattice_is_available(lptr)
192
+ end
193
+
194
+ def mecab_lattice_get_bos_node(lptr)
195
+ Natto::Binding.mecab_lattice_get_bos_node(lptr)
196
+ end
197
+
198
+ def mecab_lattice_set_sentence(lptr, str)
199
+ Natto::Binding.mecab_lattice_set_sentence(lptr, str)
200
+ end
201
+
202
+ def mecab_lattice_get_size(lptr)
203
+ Natto::Binding.mecab_lattice_get_size(lptr)
204
+ end
205
+
206
+ def mecab_lattice_set_z(lptr, z)
207
+ Natto::Binding.mecab_lattice_set_z(lptr, z)
208
+ end
209
+
210
+ def mecab_lattice_set_theta(lptr, t)
211
+ Natto::Binding.mecab_lattice_set_theta(lptr, t)
212
+ end
213
+
214
+ def mecab_lattice_next(lptr)
215
+ Natto::Binding.mecab_lattice_next(lptr)
216
+ end
217
+
218
+ def mecab_lattice_get_request_type(lptr)
219
+ Natto::Binding.mecab_lattice_get_request_type(lptr)
220
+ end
221
+
222
+ def mecab_lattice_add_request_type(lptr, rtype)
223
+ Natto::Binding.mecab_lattice_add_request_type(lptr, rtype)
224
+ end
225
+
226
+ def mecab_lattice_set_request_type(lptr, rtype)
227
+ Natto::Binding.mecab_lattice_set_request_type(lptr, rtype)
228
+ end
229
+
230
+ def mecab_lattice_tostr(lptr)
231
+ Natto::Binding.mecab_lattice_tostr(lptr)
232
+ end
233
+
234
+ def mecab_lattice_nbest_tostr(lptr, n)
235
+ Natto::Binding.mecab_lattice_nbest_tostr(lptr, n)
236
+ end
237
+
238
+ def mecab_lattice_get_boundary_constraint(lptr, pos)
239
+ Natto::Binding.mecab_lattice_get_boundary_constraint(lptr, pos)
240
+ end
241
+
242
+ def mecab_lattice_set_boundary_constraint(lptr, pos, btype)
243
+ Natto::Binding.mecab_lattice_set_boundary_constraint(lptr, pos, btype)
244
+ end
245
+
246
+ def mecab_parse_lattice(tptr, lptr)
247
+ Natto::Binding.mecab_parse_lattice(tptr, lptr)
248
+ end
249
+
250
+ def mecab_lattice_strerror(lptr)
251
+ Natto::Binding.mecab_lattice_strerror(lptr)
173
252
  end
174
253
  end
175
254
  end
data/lib/natto/natto.rb CHANGED
@@ -4,8 +4,8 @@ require 'natto/option_parse'
4
4
  require 'natto/struct'
5
5
 
6
6
  module Natto
7
- # `MeCab` is a wrapper class for the `mecab` tagger.
8
- # Options to the `mecab` tagger are passed in as a string
7
+ # `MeCab` is a wrapper class for the MeCab `Tagger`.
8
+ # Options to the MeCab `Tagger` are passed in as a string
9
9
  # (MeCab command-line style) or as a Ruby-style hash at
10
10
  # initialization.
11
11
  #
@@ -45,6 +45,7 @@ module Natto
45
45
  # 。 記号,句点,*,*,*,*,。,。,。
46
46
  # EOS
47
47
  #
48
+ #
48
49
  # # pass a block to iterate over each MeCabNode instance
49
50
  # #
50
51
  # nm.parse(text) do |n|
@@ -101,10 +102,52 @@ module Natto
101
102
  # 13: 。,。,記号,7
102
103
  #
103
104
  #
105
+ # # Boundary constraint parsing with output formatting.
106
+ # # %m ... morpheme surface
107
+ # # %F, ... comma-delimited ChaSen feature values
108
+ # # part-of-speech (index 0)
109
+ # # %2 ... MeCab node status value (1 unknown)
110
+ # #
111
+ # nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
112
+ #
113
+ # enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
114
+ # => #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>
115
+ #
116
+ # # output the feature attribute of each MeCabNode
117
+ # # ignoring any beginning- or end-of-sentence nodes
118
+ # #
119
+ # enum.each do |n|
120
+ # puts n.feature if !(n.is_bos? or n.is_eos?)
121
+ # end
122
+ # 凡人, 名詞, 0
123
+ # に, 助詞, 0
124
+ # しか, 助詞, 0
125
+ # 見えねえ風景, 名詞, 1
126
+ # って, 助詞, 0
127
+ # の, 名詞, 0
128
+ # が, 助詞, 0
129
+ # ある, 動詞, 0
130
+ # ん, 名詞, 0
131
+ # だ, 助動詞, 0
132
+ # よ, 助詞, 0
133
+ # 。, 記号, 0
134
+ #
104
135
  class MeCab
105
136
  include Natto::Binding
106
137
  include Natto::OptionParse
107
138
 
139
+ MECAB_LATTICE_ONE_BEST = 1
140
+ MECAB_LATTICE_NBEST = 2
141
+ MECAB_LATTICE_PARTIAL = 4
142
+ MECAB_LATTICE_MARGINAL_PROB = 8
143
+ MECAB_LATTICE_ALTERNATIVE = 16
144
+ MECAB_LATTICE_ALL_MORPHS = 32
145
+ MECAB_LATTICE_ALLOCATE_SENTENCE = 64
146
+
147
+ MECAB_ANY_BOUNDARY = 0
148
+ MECAB_TOKEN_BOUNDARY = 1
149
+ MECAB_INSIDE_TOKEN = 2
150
+
108
151
  # @return [FFI:Pointer] pointer to MeCab tagger.
109
152
  attr_reader :tagger
110
153
  # @return [String] absolute filepath to MeCab library.
@@ -113,10 +156,10 @@ module Natto
113
156
  attr_reader :options
114
157
  # @return [Array] listing of all of dictionaries referenced.
115
158
  attr_reader :dicts
116
- # @return [String] `MeCab` versions.
159
+ # @return [String] `MeCab` version.
117
160
  attr_reader :version
118
161
 
119
- # Initializes the wrapped `mecab` instance with the
162
+ # Initializes the wrapped `Tagger` instance with the
120
163
  # given `options`.
121
164
  #
122
165
  # Options supported are:
@@ -174,7 +217,7 @@ module Natto
174
217
  # 。 。
175
218
  # EOS
176
219
  #
177
- # @param [Hash, String] options MeCab options for tagger
220
+ # @param options [Hash, String] the MeCab options for tagger
178
221
  # @raise [MeCabError] if `mecab` cannot be initialized with the given `options`
179
222
  def initialize(options={})
180
223
  @options = self.class.parse_mecab_options(options)
@@ -191,31 +234,159 @@ module Natto
191
234
  self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
192
235
  self.mecab_set_partial(@tagger, 1) if @options[:partial]
193
236
 
194
- # Set mecab parsing implementations for N-best and regular parsing,
195
- # for both parsing as string and yielding a node object
196
- if @options[:nbest] && @options[:nbest] > 1
197
- # N-Best parsing implementations
198
- self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
199
-
200
- @parse_tostr = lambda do |text|
237
+ # Define lambda for each major parsing type: _tostr, _tonode,
238
+ # boundary constraint _tostr, boundary constraint _node;
239
+ # and each parsing type will support both normal and N-best
240
+ # options
241
+ @parse_tostr = ->(text) {
242
+ if @options[:nbest] && @options[:nbest] > 1
243
+ #self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
201
244
  retval = self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], text) ||
202
245
  raise(MeCabError.new(self.mecab_strerror(@tagger)))
203
- retval.force_encoding(Encoding.default_external)
204
- end
246
+ else
247
+ retval = self.mecab_sparse_tostr(@tagger, text) ||
248
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
249
+ end
205
250
 
206
- @parse_tonodes = lambda do |text|
207
- Enumerator.new do |y|
208
- self.mecab_nbest_init(@tagger, text)
209
- n = self.mecab_nbest_next_tonode(@tagger)
210
- raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
251
+ retval.force_encoding(Encoding.default_external)
252
+ }
211
253
 
254
+ @parse_tonodes = ->(text) {
255
+ Enumerator.new do |y|
256
+ if @options[:nbest] && @options[:nbest] > 1
212
257
  nlen = @options[:nbest]
213
- nlen.times do |i|
214
- s = text.bytes.to_a
215
- while n && n.address != 0x0
216
- mn = Natto::MeCabNode.new(n)
217
- # ignore BOS nodes, since mecab does so
218
- if !mn.is_bos?
258
+ #self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
259
+ self.mecab_nbest_init(@tagger, text)
260
+ nptr = self.mecab_nbest_next_tonode(@tagger)
261
+ else
262
+ nlen = 1
263
+ nptr = self.mecab_sparse_tonode(@tagger, text)
264
+ end
265
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if nptr.nil? || nptr.address==0x0
266
+
267
+ nlen.times do
268
+ s = text.bytes.to_a
269
+ while nptr && nptr.address != 0x0
270
+ mn = Natto::MeCabNode.new(nptr)
271
+ # ignore BOS nodes, since mecab does so
272
+ if !mn.is_bos?
273
+ s = s.drop_while {|e| (e==0xa || e==0x20)}
274
+ if !s.empty?
275
+ sarr = []
276
+ mn.length.times { sarr << s.shift }
277
+ surf = sarr.pack('C*')
278
+ mn.surface = surf.force_encoding(Encoding.default_external)
279
+ end
280
+ if @options[:output_format_type] || @options[:node_format]
281
+ mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
282
+ end
283
+ y.yield mn
284
+ end
285
+ nptr = mn.next
286
+ end
287
+ if nlen > 1
288
+ nptr = self.mecab_nbest_next_tonode(@tagger)
289
+ end
290
+ end
291
+ end
292
+ }
293
+
294
+ @bcparse_tostr = ->(text, boundary_constraints=/./) {
295
+ begin
296
+ lattice = self.mecab_lattice_new()
297
+ raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
298
+
299
+ if @options[:nbest] && @options[:nbest] > 1
300
+ n = @options[:nbest]
301
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
302
+ else
303
+ n = 1
304
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
305
+ end
306
+ if @options[:theta]
307
+ self.mecab_lattice_set_theta(lattice, @options[:theta])
308
+ end
309
+
310
+ tokens = tokenize(text, boundary_constraints)
311
+ text = tokens.map {|t| t.first}.join
312
+ self.mecab_lattice_set_sentence(lattice, text)
313
+
314
+ bpos = 0
315
+ tokens.each do |token|
316
+ c = token.first.bytes.count
317
+
318
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
319
+ bpos += 1
320
+
321
+ mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
322
+ (c-1).times do
323
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
324
+ bpos += 1
325
+ end
326
+ end
327
+
328
+ self.mecab_parse_lattice(@tagger, lattice)
329
+
330
+ if n > 1
331
+ retval = self.mecab_lattice_nbest_tostr(lattice, n)
332
+ else
333
+ retval = self.mecab_lattice_tostr(lattice)
334
+ end
335
+ retval.force_encoding(Encoding.default_external)
336
+ rescue
337
+ raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
338
+ ensure
339
+ if lattice.address != 0x0
340
+ self.mecab_lattice_destroy(lattice)
341
+ end
342
+ end
343
+ }
344
+
345
+ @bcparse_tonodes = ->(text, boundary_constraints=/./) {
346
+ Enumerator.new do |y|
347
+ begin
348
+ lattice = self.mecab_lattice_new()
349
+ raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
350
+
351
+ if @options[:nbest] && @options[:nbest] > 1
352
+ n = @options[:nbest]
353
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
354
+ else
355
+ n = 1
356
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
357
+ end
358
+ if @options[:theta]
359
+ self.mecab_lattice_set_theta(lattice, @options[:theta])
360
+ end
361
+
362
+ tokens = tokenize(text, boundary_constraints)
363
+ text = tokens.map {|t| t.first}.join
364
+ self.mecab_lattice_set_sentence(lattice, text)
365
+
366
+ bpos = 0
367
+ tokens.each do |token|
368
+ c = token.first.bytes.count
369
+
370
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
371
+ bpos += 1
372
+
373
+ mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
374
+ (c-1).times do
375
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
376
+ bpos += 1
377
+ end
378
+ end
379
+
380
+ self.mecab_parse_lattice(@tagger, lattice)
381
+
382
+ n.times do
383
+ check = self.mecab_lattice_next(lattice)
384
+ if check
385
+ nptr = self.mecab_lattice_get_bos_node(lattice)
386
+
387
+ s = text.bytes.to_a
388
+ while nptr && nptr.address!=0x0
389
+ mn = Natto::MeCabNode.new(nptr)
219
390
  s = s.drop_while {|e| (e==0xa || e==0x20)}
220
391
  if !s.empty?
221
392
  sarr = []
@@ -224,50 +395,22 @@ module Natto
224
395
  mn.surface = surf.force_encoding(Encoding.default_external)
225
396
  end
226
397
  if @options[:output_format_type] || @options[:node_format]
227
- mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
398
+ mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
228
399
  end
229
400
  y.yield mn
401
+ nptr = mn.next
230
402
  end
231
- n = mn.next
232
403
  end
233
- n = self.mecab_nbest_next_tonode(@tagger)
234
404
  end
235
- end
236
- end
237
- else
238
- # default parsing implementations
239
- @parse_tostr = lambda do |text|
240
- retval = self.mecab_sparse_tostr(@tagger, text) ||
241
- raise(MeCabError.new(self.mecab_strerror(@tagger)))
242
- retval.force_encoding(Encoding.default_external)
243
- end
244
-
245
- @parse_tonodes = lambda do |text|
246
- Enumerator.new do |y|
247
- n = self.mecab_sparse_tonode(@tagger, text)
248
- raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
249
-
250
- mn = Natto::MeCabNode.new(n)
251
- n = mn.next if mn.next.address!=0x0
252
- s = text.bytes.to_a
253
- while n && n.address!=0x0
254
- mn = Natto::MeCabNode.new(n)
255
- s = s.drop_while {|e| (e==0xa || e==0x20)}
256
- if !s.empty?
257
- sarr = []
258
- mn.length.times { sarr << s.shift }
259
- surf = sarr.pack('C*')
260
- mn.surface = surf.force_encoding(Encoding.default_external)
261
- end
262
- if @options[:output_format_type] || @options[:node_format]
263
- mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
264
- end
265
- y.yield mn
266
- n = mn.next
405
+ rescue
406
+ raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
407
+ ensure
408
+ if lattice.address != 0x0
409
+ self.mecab_lattice_destroy(lattice)
267
410
  end
268
411
  end
269
412
  end
270
- end
413
+ }
271
414
 
272
415
  @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
273
416
  while @dicts.last.next.address != 0x0
@@ -283,17 +426,35 @@ module Natto
283
426
  # If a block is passed to this method, then node parsing will be used
284
427
  # and each node yielded to the given block.
285
428
  #
286
- # @param [String] text
429
+ # Boundary constraint parsing is available via passing in the
430
+ # `boundary_constraints` key in the `options` hash. Boundary constraints
431
+ # parsing provides hints to MeCab on where the morpheme boundaries in the
432
+ # given `text` are located. `boundary_constraints` value may be either a
433
+ # `Regexp` or `String`; please see
434
+ # [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan.
435
+ # The boundary constraint parsed output will be returned as a single
436
+ # string, unless a block is passed to this method for node parsing.
437
+ #
438
+ # @param text [String] the Japanese text to parse
439
+ # @param options [Hash] only the `boundary_constraints` key is available
287
440
  # @return [String] parsing result from `mecab`
288
441
  # @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
289
442
  # @raise [ArgumentError] if the given string `text` argument is `nil`
290
443
  # @see MeCabNode
291
- def parse(text)
444
+ def parse(text, options={})
292
445
  raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
293
- if block_given?
294
- @parse_tonodes.call(text).each {|n| yield n }
446
+ if options[:boundary_constraints]
447
+ if block_given?
448
+ @bcparse_tonodes.call(text, options[:boundary_constraints]).each {|n| yield n }
449
+ else
450
+ @bcparse_tostr.call(text, options[:boundary_constraints])
451
+ end
295
452
  else
296
- @parse_tostr.call(text)
453
+ if block_given?
454
+ @parse_tonodes.call(text).each {|n| yield n }
455
+ else
456
+ @parse_tostr.call(text)
457
+ end
297
458
  end
298
459
  end
299
460
 
@@ -301,73 +462,34 @@ module Natto
301
462
  # {http://www.ruby-doc.org/core-2.1.5/Enumerator.html Enumerator} that may be
302
463
  # used to iterate over the resulting {MeCabNode} objects. This is more
303
464
  # efficient than parsing to a simple string, since each node's
304
- # information will not be materialized all at once as with it is with
465
+ # information will not be materialized all at once as it is with
305
466
  # string output.
306
467
  #
307
468
  # MeCab nodes contain much more detailed information about
308
469
  # the morpheme. Node-formatting may also be used to customize
309
470
  # the resulting node's `feature` attribute.
310
471
  #
311
- # @param [String] text
472
+ # Boundary constraint parsing is available via passing in the
473
+ # `boundary_constraints` key in the `options` hash. Boundary constraints
474
+ # parsing provides hints to MeCab on where the morpheme boundaries in the
475
+ # given `text` are located. `boundary_constraints` value may be either a
476
+ # `Regexp` or `String`; please see
477
+ # [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan).
478
+ #
479
+ # @param text [String] the Japanese text to parse
480
+ # @param options [Hash] only the `boundary_constraints` key is available
312
481
  # @return [Enumerator] of MeCabNode instances
313
482
  # @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
314
483
  # @raise [ArgumentError] if the given string `text` argument is `nil`
315
484
  # @see MeCabNode
316
- # @see http://www.ruby-doc.org/core-2.1.5/Enumerator.html
317
- def enum_parse(text)
485
+ # @see http://ruby-doc.org/core-2.2.1/Enumerator.html
486
+ def enum_parse(text, options={})
318
487
  raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
319
- @parse_tonodes.call(text)
320
- end
321
-
322
- # @deprecated
323
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
324
- # Parses the given string `str`, and returns
325
- # a list of `mecab` nodes.
326
- # @param [String] str
327
- # @return [Array] of parsed `mecab` nodes.
328
- # @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
329
- # @raise [ArgumentError] if the given string `str` argument is `nil`
330
- # @see MeCabNode
331
- def parse_as_nodes(str)
332
- $stderr.puts 'DEPRECATED: use enum_parse instead'
333
- $stderr.puts ' This method will be removed in the next release!'
334
- raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
335
- @parse_tonodes.call(str)
336
- end
337
-
338
- # @deprecated
339
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
340
- # Parses the given string `str`, and returns
341
- # a list of `mecab` result strings.
342
- # @param [String] str
343
- # @return [Array] of parsed `mecab` result strings.
344
- # @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
345
- # @raise [ArgumentError] if the given string `str` argument is `nil`
346
- def parse_as_strings(str)
347
- $stderr.puts 'DEPRECATED: use enum_parse instead'
348
- $stderr.puts ' This method will be removed in the next release!'
349
- raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
350
- @parse_tostr.call(str).lines.to_a
351
- end
352
-
353
- # @deprecated
354
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
355
- # @param [String] str
356
- # @return [Array] of parsed `mecab` nodes.
357
- def readnodes(str)
358
- $stderr.puts 'DEPRECATED: use enum_parse instead'
359
- $stderr.puts ' This method will be removed in the next release!'
360
- parse_as_nodes(str)
361
- end
362
-
363
- # @deprecated
364
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
365
- # @param [String] str
366
- # @return [Array] of parsed `mecab` result strings.
367
- def readlines(str)
368
- $stderr.puts 'DEPRECATED: use enum_parse instead'
369
- $stderr.puts ' This method will be removed in the next release!'
370
- parse_as_strings(str)
488
+ if options[:boundary_constraints]
489
+ @bcparse_tonodes.call(text, options[:boundary_constraints])
490
+ else
491
+ @parse_tonodes.call(text)
492
+ end
371
493
  end
372
494
 
373
495
  # Returns human-readable details for the wrapped `mecab` tagger.
@@ -402,17 +524,41 @@ module Natto
402
524
  end
403
525
 
404
526
  # Returns a `Proc` that will properly free resources
405
- # when this `MeCab` instance is garbage collected.
527
+ # when this `Tagger` instance is garbage collected.
406
528
  # The `Proc` returned is registered to be invoked
407
- # after the `MeCab` instance owning `ptr`
529
+ # after the `Tagger` instance owning `tptr`
408
530
  # has been destroyed.
409
531
  #
410
- # @param [FFI::Pointer] ptr
532
+ # @param tptr [FFI::Pointer] pointer to `Tagger`
411
533
  # @return [Proc] to release `mecab` resources properly
412
- def self.create_free_proc(ptr)
534
+ def self.create_free_proc(tptr)
413
535
  Proc.new do
414
- self.mecab_destroy(ptr)
536
+ self.mecab_destroy(tptr)
537
+ end
538
+ end
539
+
540
+ private
541
+
542
+ # @private
543
+ def tokenize(text, pattern)
544
+ matches = text.scan(pattern)
545
+
546
+ acc =[]
547
+ tmp = text
548
+ matches.each_with_index do |m,i|
549
+ bef, mat, aft = tmp.partition(m)
550
+ unless bef.empty?
551
+ acc << [bef.strip, false]
552
+ end
553
+ unless mat.empty?
554
+ acc << [mat.strip, true]
555
+ end
556
+ if i==matches.size-1 and !aft.empty?
557
+ acc << [aft.strip, false]
558
+ end
559
+ tmp = aft
415
560
  end
561
+ acc
416
562
  end
417
563
  end
418
564
 
@@ -97,14 +97,14 @@ module Natto
97
97
  # Returns a string-representation of the options to
98
98
  # be passed in the construction of the `mecab` tagger.
99
99
  #
100
- # @param [Hash] options
100
+ # @param options[Hash] options for MeCab
101
101
  # @return [String] representation of the options to the `mecab` tagger
102
102
  def build_options_str(options={})
103
103
  opt = []
104
104
  SUPPORTED_OPTS.values.each do |k|
105
105
  if options.has_key? k
106
106
  key = k.to_s.gsub('_', '-')
107
- if %w( all-morphs allocate-sentence ).include? key
107
+ if %w( all-morphs allocate-sentence partial ).include? key
108
108
  opt << "--#{key}" if options[k]==true
109
109
  else
110
110
  opt << "--#{key}=#{options[k]}"
data/lib/natto/struct.rb CHANGED
@@ -6,12 +6,12 @@ module Natto
6
6
  require 'ffi'
7
7
 
8
8
  # `MeCabStruct` is a general base class for `FFI::Struct` objects in
9
- # the `Natto` module. Please refer to
10
- # [`mecab.h`](http://code.google.com/p/mecab/source/browse/trunk/mecab/src/mecab.h)
9
+ # the `Natto` module. Please refer to `mecab.h` in the source code
10
+ # distribution.
11
11
  class MeCabStruct < FFI::Struct
12
12
  # Provides accessor methods for the members of the `mecab` struct.
13
13
  #
14
- # @param [String] attr_name
14
+ # @param attr_name [String] attribute name
15
15
  # @return member values for the `mecab` struct
16
16
  # @raise [NoMethodError] if `attr_name` is not a member of this `mecab` struct
17
17
  def method_missing(attr_name)
@@ -94,7 +94,7 @@ module Natto
94
94
  # Initializes this dictionary info instance.
95
95
  # Sets the `DictionaryInfo` filepath value.
96
96
  #
97
- # @param [FFI::Pointer] ptr pointer to MeCab dictionary
97
+ # @param ptr [FFI::Pointer] pointer to MeCab dictionary
98
98
  def initialize(ptr)
99
99
  super(ptr)
100
100
 
@@ -246,10 +246,10 @@ module Natto
246
246
  # Initializes this node instance.
247
247
  # Sets the `MeCab` feature value for this node.
248
248
  #
249
- # @param [FFI::Pointer] ptr pointer to MeCab node
250
- def initialize(ptr)
251
- super(ptr)
252
- @pointer = ptr
249
+ # @param nptr [FFI::Pointer] pointer to MeCab node
250
+ def initialize(nptr)
251
+ super(nptr)
252
+ @pointer = nptr
253
253
 
254
254
  if self[:feature]
255
255
  @feature = self[:feature].force_encoding(Encoding.default_external)
@@ -311,10 +311,7 @@ module Natto
311
311
  def is_eon?
312
312
  self.stat == EON_NODE
313
313
  end
314
-
315
-
316
314
  end
317
-
318
315
  end
319
316
 
320
317
  # Copyright (c) 2015, Brooke M. Fujita.
data/lib/natto/version.rb CHANGED
@@ -1,16 +1,16 @@
1
1
  # coding: utf-8
2
2
 
3
3
  # `Natto` is the namespace for objects that provide
4
- # a binding to the `mecab` tagger and related resources.
4
+ # a binding to MeCab and an API for the `Tagger`,
5
+ # `Node` and `Lattice` objects.
5
6
  #
6
- # `Natto::MeCab` is a wrapper class for the `mecab`
7
- # tagger.
7
+ # `Natto::MeCab` is a wrapper class for the MeCab `Tagger`.
8
8
  #
9
9
  # `Natto::MeCabStruct` is a base class for a `mecab`
10
10
  # struct.
11
11
  #
12
12
  # `Natto::MeCabNode` is a wrapper for the struct representing
13
- # a `mecab`-parsed node.
13
+ # a MeCab `Node`.
14
14
  #
15
15
  # `Natto::DictionaryInfo` is a wrapper for the struct
16
16
  # representing a `Natto::MeCab` instance's related
@@ -27,7 +27,7 @@
27
27
  # `Natto`.
28
28
  module Natto
29
29
  # Version string for this Rubygem.
30
- VERSION = "0.9.8"
30
+ VERSION = "0.9.9"
31
31
  end
32
32
 
33
33
  # Copyright (c) 2015, Brooke M. Fujita.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: natto
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.8
4
+ version: 0.9.9
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-02-10 00:00:00.000000000 Z
12
+ date: 2015-03-30 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: ffi
@@ -27,9 +27,41 @@ dependencies:
27
27
  - - ! '>='
28
28
  - !ruby/object:Gem::Version
29
29
  version: 1.9.0
30
- description: ! 'No compiler is necessary, as natto is not a C extension. It will run
31
- on CRuby (mri/yarv) and JRuby (jvm) equally well. It will also run on Windows, Unix/Linux,
32
- and OS X. natto provides a naturally Ruby-esque interface to MeCab.
30
+ - !ruby/object:Gem::Dependency
31
+ name: rake
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: minitest
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description: ! 'natto provides a naturally Ruby-esque interface to MeCab. It runs
63
+ on both CRuby (mri/yarv) and JRuby (jvm). It works with MeCab installations on Windows,
64
+ Unix/Linux, and OS X. No compiler is necessary, as natto is not a C extension.
33
65
 
34
66
  '
35
67
  email: buruzaemon@gmail.com