natto 0.9.8 → 0.9.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,5 +1,11 @@
1
1
  ## CHANGELOG
2
2
 
3
+ - __2015/03/31__: 0.9.9 release.
4
+ - Issue 21/34: Implemented boundary constraint parsing.
5
+ - Issue 26: Removing deprecated methods parse_as_nodes, parse_as_strings, readnodes and readlines.
6
+ - Issue 32: Fixed partial parsing.
7
+ - Updated documentation for new MeCab home at http://taku910.github.io/mecab/.
8
+
3
9
  - __2015/02/10__: 0.9.8 release.
4
10
  - Migrated natto code home from Bitbucket to GitHub.
5
11
  - Improved documentation following said migration.
@@ -7,16 +13,10 @@
7
13
  - Updating LICENSE for year 2015.
8
14
 
9
15
  - __2014/12/20__: 0.9.7 release.
10
- - Issue 14: [adding automatic discovery for mecab library; no need to
11
- explicitly set
12
- MECAB_PATH!](https://bitbucket.org/buruzaemon/natto/issue/14/automatic-discovery-of-libmecab-path-and)
13
- - Issue 15: [refactored node-parsing to use Enumerator instead of
14
- materializing every node and stuffing into
15
- array](https://bitbucket.org/buruzaemon/natto/issue/15/use-enumerator-when-parsing-mecab-nodes)
16
- - Issue 17: [adding filepath to MeCab and
17
- DictionaryInfo](https://bitbucket.org/buruzaemon/natto/issue/17/use-filerealpath-value-for-all-file-paths)
18
- - Issue 18: [bug-fix for node-formatting during default node
19
- parse](https://bitbucket.org/buruzaemon/natto/issue/18/no-node-formatting-when-using-default-node)
16
+ - Issue 14: adding automatic discovery for mecab library; no need to explicitly set MECAB_PATH
17
+ - Issue 15: refactored node-parsing to use Enumerator instead of materializing every node and stuffing into array
18
+ - Issue 17: adding filepath to MeCab and DictionaryInfo
19
+ - Issue 18: bug-fix for node-formatting during default node parse
20
20
  - Deprecating parse_as_nodes and parse_as_strings; please use parse instead!
21
21
  - CAUTION: parse_as_nodes, parse_as_strings, readnodes and readlines will be removed in the following release!
22
22
  - Enhancements to to_s methods for both MeCab and DictionaryInfo
@@ -36,8 +36,8 @@
36
36
  - Removing automatic library load for Cygwin platform (does not compile)
37
37
 
38
38
  - __2012/09/16__: 0.9.5 release.
39
- - Fixed [Issue 9: trimされていない文字列のparse](https://bitbucket.org/buruzaemon/natto/issue/9/trim-parse)
40
- - Fixed [Issue 10: BUG Segmentation Fault](https://bitbucket.org/buruzaemon/natto/issue/10/bug-segmentation-fault)
39
+ - Fixed Issue 9: trimされていない文字列のparse
40
+ - Fixed Issue 10: BUG Segmentation Fault
41
41
  - Adding parse_as_nodes to allow for method-chaining on list of parsed nodes
42
42
  - Adding parse_as_strings to allow for method-chaining on list of string output
43
43
  - Deprecating both readnodes and readlines (badly named methods, see parse_as_nodes and parse_as_strings, respectively)
data/README.md CHANGED
@@ -1,16 +1,16 @@
1
- # natto
1
+ # natto [![Gem Version](https://badge.fury.io/rb/natto.svg)](http://badge.fury.io/rb/natto) [![Build Status](https://travis-ci.org/buruzaemon/natto.svg?branch=master)](https://travis-ci.org/buruzaemon/natto)
2
2
  A Tasty Ruby Binding with MeCab
3
3
 
4
4
  ## What is natto?
5
5
  A gem leveraging FFI (foreign function interface), natto combines the
6
6
  [Ruby programming language](http://www.ruby-lang.org/) with
7
- [MeCab](http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html), the part-of-speech
7
+ [MeCab](http://taku910.github.io/mecab/), the part-of-speech
8
8
  and morphological analyzer for the Japanese language.
9
9
 
10
- - No compiler is necessary, as natto is _not_ a C extension.
11
- - It will run on CRuby (mri/yarv) and JRuby (jvm) equally well.
12
- - It will work with MeCab installations on Windows, Unix/Linux or OS X.
13
10
  - natto provides a naturally Ruby-esque interface to MeCab.
11
+ - It runs on both CRuby (mri/yarv) and JRuby (jvm).
12
+ - It works with MeCab installations on Windows, Unix/Linux and OS X.
13
+ - No compiler is necessary, as natto is _not_ a C extension.
14
14
 
15
15
  You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
16
16
 
@@ -18,8 +18,8 @@ You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
18
18
  ## Requirements
19
19
  natto requires the following:
20
20
 
21
- - [MeCab _0.996_](http://code.google.com/p/mecab/downloads/list)
22
- - A system dictionary, like [mecab-ipadic](https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz) or [mecab-jumandic](https://mecab.googlecode.com/files/mecab-jumandic-5.1-20070304.tar.gz)
21
+ - [MeCab _0.996_](http://taku910.github.io/mecab/#download)
22
+ - A system dictionary, like mecab-ipadic (recommended), mecab-jumandic or unidic, all available on the [MeCab downloads page](http://taku910.github.io/mecab/#download)
23
23
  - `libmecab-devel` if you are on Linux, since natto uses `mecab-config`
24
24
  - Ruby _1.9 or greater_
25
25
  - [ffi _1.9.0 or greater_](http://rubygems.org/gems/ffi)
@@ -29,7 +29,7 @@ Install natto with the following gem command:
29
29
 
30
30
  gem install natto
31
31
 
32
- This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the `mecab` library.
32
+ This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the MeCab library.
33
33
 
34
34
  ## Installation on Windows
35
35
  However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), a MSYS/MinGW based toolkit that enables your Windows Ruby installation to build many of the native C/C++ extensions available, including ffi.
@@ -45,40 +45,38 @@ However, if you are using a CRuby on Windows, then you will first need to instal
45
45
  6. If you are on a 64-bit Windows and you use a 64-bit Ruby or JRuby, then you might want to [build a 64-bit version of libmecab.dll](https://github.com/buruzaemon/natto/wiki/64-Bit-Windows).
46
46
 
47
47
 
48
- ## Configuration
49
- - ***No explicit configuration should be necessary, as natto will try to locate the `mecab` library based upon its runtime environment.***
50
- - On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
51
- - On OS X and \*nix, it will query `mecab-config --libs`
52
- - ***But if natto cannot find the `mecab` library, `LoadError` will be raised.***
53
- - Please set the `MECAB_PATH` environment variable to the exact name/path to your `mecab` library.
54
- - e.g., for OS X
48
+ ## Automatic Configuration
49
+ No explicit configuration should be necessary, as natto will try to locate the MeCab library based upon its runtime environment.
50
+ - On OS X and \*nix, it will query `mecab-config --libs`
51
+ - On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
52
+
53
+ ## Explicit configuration via `MECAB_PATH` and `MECAB_CHARSET`
54
+ If natto cannot find the MeCab library, `LoadError` will be raised. Please set the `MECAB_PATH` environment variable to the exact name/path to your MeCab library.
55
55
 
56
- export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
56
+ - e.g., for OS X
57
57
 
58
- - e.g., for bash on UNIX/Linux
58
+ export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
59
59
 
60
- export MECAB_PATH=/usr/local/lib/libmecab.so
60
+ - e.g., for bash on UNIX/Linux
61
61
 
62
- - e.g., on Windows
62
+ export MECAB_PATH=/usr/local/lib/libmecab.so
63
63
 
64
- set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
64
+ - e.g., on Windows
65
65
 
66
- - e.g., from within a Ruby program
66
+ set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
67
67
 
68
- ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
68
+ - e.g., from within a Ruby program
69
+
70
+ ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
69
71
 
70
72
  ## Usage
71
73
 
74
+ Here's a very quick guide to using natto.
75
+
76
+ Instantiate a reference to the MeCab library, and display some details:
72
77
 
73
- # Quick Start
74
- # -----------
75
- #
76
- # No explicit configuration should be necessary!
77
- #
78
78
  require 'natto'
79
79
 
80
- # first, create an instance of Natto::MeCab
81
- #
82
80
  nm = Natto::MeCab.new
83
81
  => #<Natto::MeCab:0x28d30748
84
82
  @tagger=#<FFI::Pointer address=0x28a97d50>, \
@@ -90,32 +88,27 @@ However, if you are using a CRuby on Windows, then you will first need to instal
90
88
  type=0>] \
91
89
  @version=0.996>
92
90
 
93
- # display MeCab version
94
- #
95
91
  puts nm.version
96
92
  => 0.996
97
93
 
98
- # display full pathname to MeCab library
99
- #
94
+ ----
95
+
96
+ Display details about the system dictionary used by MeCab:
97
+
100
98
  puts nm.libpath
101
99
  => /usr/local/lib/libmecab.so
102
100
 
103
- # reference to MeCab system dictionary
104
- #
105
101
  sysdic = nm.dicts.first
106
-
107
- # display full pathname to system dictionary file
108
- #
109
102
  puts sysdic.filepath
110
103
  => /usr/local/lib/mecab/dic/ipadic/sys.dic
111
104
 
112
- # what charset (encoding) is the system dictionary?
113
- #
114
105
  puts sysdic.charset
115
106
  => utf8
116
-
117
- # parse text and send output to stdout
118
- #
107
+
108
+ ----
109
+
110
+ Parse Japanese text and send the MeCab result as a single string to stdout:
111
+
119
112
  puts nm.parse('俺の名前は星野豊だ!!そこんとこヨロシク!')
120
113
  俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ
121
114
  の 助詞,連体化,*,*,*,*,の,ノ,ノ
@@ -133,12 +126,22 @@ However, if you are using a CRuby on Windows, then you will first need to instal
133
126
  ! 記号,一般,*,*,*,*,!,!,!
134
127
  EOS
135
128
 
136
- # parse more text and use a block to:
137
- # - iterate over the resulting MeCabNode instances
138
- # - output morpheme surface and part-of-speech ID
139
- #
140
- # * ignore any end-of-sentence nodes
141
- #
129
+ ----
130
+
131
+ If a block is passed to `parse`, you can iterate over the list of resulting `MeCabNode`
132
+ instances to access more detailed information about each morpheme.
133
+
134
+ In this example, the following attributes and methods for `MeCabNode` are used:
135
+
136
+ - `surface` - the morpheme surface
137
+ - `posid` - node part-of-speech ID (dictionary-dependent)
138
+ - `is_eos?` - is this `MeCabNode` an end-of-sentence node?
139
+
140
+ This iterates over the morpheme nodes in the given text,
141
+ and outputs a formatted, tab-delimited line with the
142
+ morpheme surface and part-of-speech ID, ignoring any end-of-sentence
143
+ nodes:
144
+
142
145
  nm.parse('世界チャンプ目指してんだなこれがっ!!夢なの、俺のっ!!') do |n|
143
146
  puts "#{n.surface}\tpart-of-speech id: #{n.posid}" if !n.is_eos?
144
147
  end
@@ -160,37 +163,46 @@ However, if you are using a CRuby on Windows, then you will first need to instal
160
163
  のっ part-of-speech id: 31
161
164
  !! part-of-speech id: 36
162
165
 
163
- # for more complex parsing, such as that for natural
164
- # language processing tasks, it is far more efficient
165
- # to iterate over MeCab nodes using an Enumerator
166
- #
167
- # this example uses the -F node-format option to customize
168
- # the resulting MeCabNode feature attribute to extract:
169
- # - %m ... surface
170
- # - %f[0] ... part-of-speech
171
- # - %f[7] ... reading
172
- #
173
- nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
166
+ ----
167
+
168
+ For more complex parsing, such as that for natural language
169
+ processing tasks, it is far more efficient to use `enum_parse` to
170
+ obtain an [`Enumerator`](http://ruby-doc.org/core-2.2.0/Enumerator.html)
171
+ to iterate over the resulting `MeCabNode` instances. An `Enumerator`
172
+ yields each `MeCabNode` instance without first materializing all
173
+ instances at once, thus being more efficient.
174
+
175
+ This example uses the `-F` node-format option to customize
176
+ the resulting `MeCabNode` feature attribute to extract:
177
+
178
+ - `%m` - morpheme surface
179
+ - `%f[0]` - node part-of-speech
180
+ - `%f[7]` - reading
181
+
182
+ Note that we can move the `Enumerator` both forwards and backwards, rewind it
183
+ back to the beginning, and then iterate over it.
174
184
 
185
+ nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
186
+
175
187
  enum = nm.enum_parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')
176
188
  => #<Enumerator: #<Enumerator::Generator:0x00000002ff3898>:each>
177
-
189
+
178
190
  enum.next
179
191
  => #<Natto::MeCabNode:0x000000032eed68 \
180
192
  @pointer=#<FFI::Pointer address=0x000000005ffb48>, \
181
193
  stat=0, \
182
194
  @surface="この", \
183
195
  @feature="この 連体詞 コノ">
184
-
196
+
185
197
  enum.peek
186
198
  => #<Natto::MeCabNode:0x00000002fe2110a \
187
199
  @pointer=#<FFI::Pointer address=0x000000005ffdb8>, \
188
200
  stat=0, \
189
201
  @surface="星", \
190
202
  @feature="星 名詞 ホシ">
191
-
203
+
192
204
  enum.rewind
193
-
205
+
194
206
  # again, ignore any end-of-sentence nodes
195
207
  enum.each { |n| puts n.feature if !n.is_eos? }
196
208
  この 連体詞 コノ
@@ -211,7 +223,52 @@ However, if you are using a CRuby on Windows, then you will first need to instal
211
223
  だけ 助詞 ダケ
212
224
  ! 記号 !
213
225
 
214
-
226
+ ----
227
+
228
+ [Partial parsing](http://taku910.github.io/mecab/partial.html) allows you to
229
+ pass hints to MeCab on how to tokenize morphemes when parsing. With boundary
230
+ constraint parsing, you can specify either
231
+ a [Regexp](http://ruby-doc.org/core-2.2.1/Regexp.html) or
232
+ [String](http://ruby-doc.org/core-2.2.1/String.html) to tell MeCab where the
233
+ boundaries of a morpheme should be. Use the new `boundary_constraints` keyword.
234
+ For hints on tokenization, please see
235
+ [String#scan](http://ruby-doc.org/core-2.2.1/String.html#method-i-scan)
236
+
237
+ This example uses the `-F` node-format option to customize
238
+ the resulting `MeCabNode` feature attribute to extract:
239
+
240
+ - `%m` - morpheme surface
241
+ - `%f[0]` - node part-of-speech
242
+ - `%s` - node `stat` status value, 1 is `unknown`
243
+
244
+ Note that any such morphemes captured will have node `stat` status of unknown.
245
+
246
+ nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
247
+
248
+ text = '心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'
249
+ pattern = /ヒーロー見参/
250
+
251
+ nm.enum_parse(text, boundary_constraints: pattern).each do |n|
252
+ puts n.feature if !(n.is_bos? || n.is_eos?)
253
+ end
254
+
255
+ # desired morpheme boundary specified with Regexp /ヒーロー見参/
256
+ 心, 名詞, 0
257
+ の, 助詞, 0
258
+ 中, 名詞, 0
259
+ で, 助詞, 0
260
+ 3, 名詞, 1
261
+ 回, 名詞, 0
262
+ 唱え, 動詞, 0
263
+ 、, 記号, 0
264
+ ヒーロー見参, 名詞, 1
265
+ !, 記号, 0
266
+ ヒーロー見参, 名詞, 1
267
+ !, 記号, 0
268
+ ヒーロー見参, 名詞, 1
269
+ !, 記号, 0
270
+
271
+
215
272
 
216
273
  ## Learn more
217
274
  - You can read more about natto on the [project Wiki](https://github.com/buruzaemon/natto/wiki).
data/lib/natto/binding.rb CHANGED
@@ -66,12 +66,7 @@ module Natto
66
66
 
67
67
  ffi_lib find_library
68
68
 
69
- # new interface
70
- attach_function :mecab_model_new2, [:string], :pointer
71
- attach_function :mecab_model_destroy, [:pointer], :void
72
- attach_function :mecab_model_dictionary_info, [:pointer], :pointer
73
-
74
- # old interface
69
+ # C interface
75
70
  attach_function :mecab_new2, [:string], :pointer
76
71
  attach_function :mecab_version, [], :string
77
72
  attach_function :mecab_strerror, [:pointer],:string
@@ -88,6 +83,26 @@ module Natto
88
83
  attach_function :mecab_format_node, [:pointer, :pointer], :string
89
84
  attach_function :mecab_dictionary_info, [:pointer], :pointer
90
85
 
86
+ attach_function :mecab_lattice_new, [], :pointer
87
+ attach_function :mecab_lattice_destroy, [:pointer], :void
88
+ attach_function :mecab_lattice_clear, [:pointer], :void
89
+ attach_function :mecab_lattice_is_available, [:pointer], :int
90
+ attach_function :mecab_lattice_get_bos_node, [:pointer], :pointer
91
+ attach_function :mecab_lattice_set_sentence, [:pointer, :string], :void
92
+ attach_function :mecab_lattice_get_size, [:pointer], :int
93
+ attach_function :mecab_lattice_set_z, [:pointer, :float], :void
94
+ attach_function :mecab_lattice_set_theta, [:pointer, :float], :void
95
+ attach_function :mecab_lattice_next, [:pointer], :int
96
+ attach_function :mecab_lattice_get_request_type, [:pointer], :int
97
+ attach_function :mecab_lattice_add_request_type, [:pointer, :int], :void
98
+ attach_function :mecab_lattice_set_request_type, [:pointer, :int], :void
99
+ attach_function :mecab_lattice_tostr, [:pointer], :string
100
+ attach_function :mecab_lattice_nbest_tostr, [:pointer, :int], :string
101
+ attach_function :mecab_lattice_get_boundary_constraint, [:pointer, :int], :int
102
+ attach_function :mecab_lattice_set_boundary_constraint, [:pointer, :int, :int], :void
103
+ attach_function :mecab_parse_lattice, [:pointer, :pointer], :int
104
+ attach_function :mecab_lattice_strerror, [:pointer], :string
105
+
91
106
  # @private
92
107
  module ClassMethods
93
108
 
@@ -95,18 +110,6 @@ module Natto
95
110
  Natto::Binding.find_library
96
111
  end
97
112
 
98
- def mecab_model_new2(options_str)
99
- Natto::Binding.mecab_model_new2(options_str)
100
- end
101
-
102
- def mecab_model_destroy(m_ptr)
103
- Natto::Binding.mecab_model_destroy(m_ptr)
104
- end
105
-
106
- def mecab_model_dictionary_info(m_ptr)
107
- Natto::Binding.mecab_model_dictionary_info(m_ptr)
108
- end
109
-
110
113
  # ----------------------------------------
111
114
  def mecab_new2(options_str)
112
115
  Natto::Binding.mecab_new2(options_str)
@@ -116,60 +119,136 @@ module Natto
116
119
  Natto::Binding.mecab_version
117
120
  end
118
121
 
119
- def mecab_strerror(m_ptr)
120
- Natto::Binding.mecab_strerror(m_ptr)
122
+ def mecab_strerror(tptr)
123
+ Natto::Binding.mecab_strerror(tptr)
121
124
  end
122
125
 
123
- def mecab_destroy(m_ptr)
124
- Natto::Binding.mecab_destroy(m_ptr)
126
+ def mecab_destroy(tptr)
127
+ Natto::Binding.mecab_destroy(tptr)
125
128
  end
126
129
 
127
- def mecab_set_partial(m_ptr, ll)
128
- Natto::Binding.mecab_set_partial(m_ptr, ll)
130
+ def mecab_set_partial(tptr, ll)
131
+ Natto::Binding.mecab_set_partial(tptr, ll)
129
132
  end
130
133
 
131
- def mecab_set_theta(m_ptr, t)
132
- Natto::Binding.mecab_set_theta(m_ptr, t)
134
+ def mecab_set_theta(tptr, t)
135
+ Natto::Binding.mecab_set_theta(tptr, t)
133
136
  end
134
137
 
135
- def mecab_set_lattice_level(m_ptr, ll)
136
- Natto::Binding.mecab_set_lattice_level(m_ptr, ll)
138
+ def mecab_set_lattice_level(tptr, ll)
139
+ Natto::Binding.mecab_set_lattice_level(tptr, ll)
137
140
  end
138
141
 
139
- def mecab_set_all_morphs(m_ptr, am)
140
- Natto::Binding.mecab_set_all_morphs(m_ptr, am)
142
+ def mecab_set_all_morphs(tptr, am)
143
+ Natto::Binding.mecab_set_all_morphs(tptr, am)
141
144
  end
142
145
 
143
- def mecab_sparse_tostr(m_ptr, str)
144
- Natto::Binding.mecab_sparse_tostr(m_ptr, str)
146
+ def mecab_sparse_tostr(tptr, str)
147
+ Natto::Binding.mecab_sparse_tostr(tptr, str)
145
148
  end
146
149
 
147
- def mecab_sparse_tonode(m_ptr, str)
148
- Natto::Binding.mecab_sparse_tonode(m_ptr, str)
150
+ def mecab_sparse_tonode(tptr, str)
151
+ Natto::Binding.mecab_sparse_tonode(tptr, str)
149
152
  end
150
153
 
151
- def mecab_nbest_next_tonode(m_ptr)
152
- Natto::Binding.mecab_nbest_next_tonode(m_ptr)
154
+ def mecab_nbest_next_tonode(tptr)
155
+ Natto::Binding.mecab_nbest_next_tonode(tptr)
153
156
  end
154
157
 
155
- def mecab_nbest_init(m_ptr, str)
156
- Natto::Binding.mecab_nbest_init(m_ptr, str)
158
+ def mecab_nbest_init(tptr, str)
159
+ Natto::Binding.mecab_nbest_init(tptr, str)
157
160
  end
158
161
 
159
- def mecab_nbest_sparse_tostr(m_ptr, n, str)
160
- Natto::Binding.mecab_nbest_sparse_tostr(m_ptr, n, str)
162
+ def mecab_nbest_sparse_tostr(tptr, n, str)
163
+ Natto::Binding.mecab_nbest_sparse_tostr(tptr, n, str)
161
164
  end
162
165
 
163
- def mecab_nbest_next_tonode(m_ptr)
164
- Natto::Binding.mecab_nbest_next_tonode(m_ptr)
166
+ def mecab_nbest_next_tonode(tptr)
167
+ Natto::Binding.mecab_nbest_next_tonode(tptr)
165
168
  end
166
169
 
167
- def mecab_format_node(m_ptr, n_ptr)
168
- Natto::Binding.mecab_format_node(m_ptr, n_ptr)
170
+ def mecab_format_node(tptr, nptr)
171
+ Natto::Binding.mecab_format_node(tptr, nptr)
172
+ end
173
+
174
+ def mecab_dictionary_info(tptr)
175
+ Natto::Binding.mecab_dictionary_info(tptr)
169
176
  end
170
177
 
171
- def mecab_dictionary_info(m_ptr)
172
- Natto::Binding.mecab_dictionary_info(m_ptr)
178
+ def mecab_lattice_new()
179
+ Natto::Binding.mecab_lattice_new()
180
+ end
181
+
182
+ def mecab_lattice_destroy(lptr)
183
+ Natto::Binding.mecab_lattice_destroy(lptr)
184
+ end
185
+
186
+ def mecab_lattice_clear(lptr)
187
+ Natto::Binding.mecab_lattice_clear(lptr)
188
+ end
189
+
190
+ def mecab_lattice_is_available(lptr)
191
+ Natto::Binding.mecab_lattice_is_available(lptr)
192
+ end
193
+
194
+ def mecab_lattice_get_bos_node(lptr)
195
+ Natto::Binding.mecab_lattice_get_bos_node(lptr)
196
+ end
197
+
198
+ def mecab_lattice_set_sentence(lptr, str)
199
+ Natto::Binding.mecab_lattice_set_sentence(lptr, str)
200
+ end
201
+
202
+ def mecab_lattice_get_size(lptr)
203
+ Natto::Binding.mecab_lattice_get_size(lptr)
204
+ end
205
+
206
+ def mecab_lattice_set_z(lptr, z)
207
+ Natto::Binding.mecab_lattice_set_z(lptr, z)
208
+ end
209
+
210
+ def mecab_lattice_set_theta(lptr, t)
211
+ Natto::Binding.mecab_lattice_set_theta(lptr, t)
212
+ end
213
+
214
+ def mecab_lattice_next(lptr)
215
+ Natto::Binding.mecab_lattice_next(lptr)
216
+ end
217
+
218
+ def mecab_lattice_get_request_type(lptr)
219
+ Natto::Binding.mecab_lattice_get_request_type(lptr)
220
+ end
221
+
222
+ def mecab_lattice_add_request_type(lptr, rtype)
223
+ Natto::Binding.mecab_lattice_add_request_type(lptr, rtype)
224
+ end
225
+
226
+ def mecab_lattice_set_request_type(lptr, rtype)
227
+ Natto::Binding.mecab_lattice_set_request_type(lptr, rtype)
228
+ end
229
+
230
+ def mecab_lattice_tostr(lptr)
231
+ Natto::Binding.mecab_lattice_tostr(lptr)
232
+ end
233
+
234
+ def mecab_lattice_nbest_tostr(lptr, n)
235
+ Natto::Binding.mecab_lattice_nbest_tostr(lptr, n)
236
+ end
237
+
238
+ def mecab_lattice_get_boundary_constraint(lptr, pos)
239
+ Natto::Binding.mecab_lattice_get_boundary_constraint(lptr, pos)
240
+ end
241
+
242
+ def mecab_lattice_set_boundary_constraint(lptr, pos, btype)
243
+ Natto::Binding.mecab_lattice_set_boundary_constraint(lptr, pos, btype)
244
+ end
245
+
246
+ def mecab_parse_lattice(tptr, lptr)
247
+ Natto::Binding.mecab_parse_lattice(tptr, lptr)
248
+ end
249
+
250
+ def mecab_lattice_strerror(lptr)
251
+ Natto::Binding.mecab_lattice_strerror(lptr)
173
252
  end
174
253
  end
175
254
  end
data/lib/natto/natto.rb CHANGED
@@ -4,8 +4,8 @@ require 'natto/option_parse'
4
4
  require 'natto/struct'
5
5
 
6
6
  module Natto
7
- # `MeCab` is a wrapper class for the `mecab` tagger.
8
- # Options to the `mecab` tagger are passed in as a string
7
+ # `MeCab` is a wrapper class for the MeCab `Tagger`.
8
+ # Options to the MeCab `Tagger` are passed in as a string
9
9
  # (MeCab command-line style) or as a Ruby-style hash at
10
10
  # initialization.
11
11
  #
@@ -45,6 +45,7 @@ module Natto
45
45
  # 。 記号,句点,*,*,*,*,。,。,。
46
46
  # EOS
47
47
  #
48
+ #
48
49
  # # pass a block to iterate over each MeCabNode instance
49
50
  # #
50
51
  # nm.parse(text) do |n|
@@ -101,10 +102,52 @@ module Natto
101
102
  # 13: 。,。,記号,7
102
103
  #
103
104
  #
105
+ # # Boundary constraint parsing with output formatting.
106
+ # # %m ... morpheme surface
107
+ # # %F, ... comma-delimited ChaSen feature values
108
+ # # part-of-speech (index 0)
109
+ # # %2 ... MeCab node status value (1 unknown)
110
+ # #
111
+ # nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
112
+ #
113
+ # enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
114
+ # => #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>
115
+ #
116
+ # # output the feature attribute of each MeCabNode
117
+ # # ignoring any beginning- or end-of-sentence nodes
118
+ # #
119
+ # enum.each do |n|
120
+ # puts n.feature if !(n.is_bos? or n.is_eos?)
121
+ # end
122
+ # 凡人, 名詞, 0
123
+ # に, 助詞, 0
124
+ # しか, 助詞, 0
125
+ # 見えねえ風景, 名詞, 1
126
+ # って, 助詞, 0
127
+ # の, 名詞, 0
128
+ # が, 助詞, 0
129
+ # ある, 動詞, 0
130
+ # ん, 名詞, 0
131
+ # だ, 助動詞, 0
132
+ # よ, 助詞, 0
133
+ # 。, 記号, 0
134
+ #
104
135
  class MeCab
105
136
  include Natto::Binding
106
137
  include Natto::OptionParse
107
138
 
139
+ MECAB_LATTICE_ONE_BEST = 1
140
+ MECAB_LATTICE_NBEST = 2
141
+ MECAB_LATTICE_PARTIAL = 4
142
+ MECAB_LATTICE_MARGINAL_PROB = 8
143
+ MECAB_LATTICE_ALTERNATIVE = 16
144
+ MECAB_LATTICE_ALL_MORPHS = 32
145
+ MECAB_LATTICE_ALLOCATE_SENTENCE = 64
146
+
147
+ MECAB_ANY_BOUNDARY = 0
148
+ MECAB_TOKEN_BOUNDARY = 1
149
+ MECAB_INSIDE_TOKEN = 2
150
+
108
151
  # @return [FFI:Pointer] pointer to MeCab tagger.
109
152
  attr_reader :tagger
110
153
  # @return [String] absolute filepath to MeCab library.
@@ -113,10 +156,10 @@ module Natto
113
156
  attr_reader :options
114
157
  # @return [Array] listing of all of dictionaries referenced.
115
158
  attr_reader :dicts
116
- # @return [String] `MeCab` versions.
159
+ # @return [String] `MeCab` version.
117
160
  attr_reader :version
118
161
 
119
- # Initializes the wrapped `mecab` instance with the
162
+ # Initializes the wrapped `Tagger` instance with the
120
163
  # given `options`.
121
164
  #
122
165
  # Options supported are:
@@ -174,7 +217,7 @@ module Natto
174
217
  # 。 。
175
218
  # EOS
176
219
  #
177
- # @param [Hash, String] options MeCab options for tagger
220
+ # @param options [Hash, String] the MeCab options for tagger
178
221
  # @raise [MeCabError] if `mecab` cannot be initialized with the given `options`
179
222
  def initialize(options={})
180
223
  @options = self.class.parse_mecab_options(options)
@@ -191,31 +234,159 @@ module Natto
191
234
  self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
192
235
  self.mecab_set_partial(@tagger, 1) if @options[:partial]
193
236
 
194
- # Set mecab parsing implementations for N-best and regular parsing,
195
- # for both parsing as string and yielding a node object
196
- if @options[:nbest] && @options[:nbest] > 1
197
- # N-Best parsing implementations
198
- self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
199
-
200
- @parse_tostr = lambda do |text|
237
+ # Define lambda for each major parsing type: _tostr, _tonode,
238
+ # boundary constraint _tostr, boundary constraint _node;
239
+ # and each parsing type will support both normal and N-best
240
+ # options
241
+ @parse_tostr = ->(text) {
242
+ if @options[:nbest] && @options[:nbest] > 1
243
+ #self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
201
244
  retval = self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], text) ||
202
245
  raise(MeCabError.new(self.mecab_strerror(@tagger)))
203
- retval.force_encoding(Encoding.default_external)
204
- end
246
+ else
247
+ retval = self.mecab_sparse_tostr(@tagger, text) ||
248
+ raise(MeCabError.new(self.mecab_strerror(@tagger)))
249
+ end
205
250
 
206
- @parse_tonodes = lambda do |text|
207
- Enumerator.new do |y|
208
- self.mecab_nbest_init(@tagger, text)
209
- n = self.mecab_nbest_next_tonode(@tagger)
210
- raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
251
+ retval.force_encoding(Encoding.default_external)
252
+ }
211
253
 
254
+ @parse_tonodes = ->(text) {
255
+ Enumerator.new do |y|
256
+ if @options[:nbest] && @options[:nbest] > 1
212
257
  nlen = @options[:nbest]
213
- nlen.times do |i|
214
- s = text.bytes.to_a
215
- while n && n.address != 0x0
216
- mn = Natto::MeCabNode.new(n)
217
- # ignore BOS nodes, since mecab does so
218
- if !mn.is_bos?
258
+ #self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
259
+ self.mecab_nbest_init(@tagger, text)
260
+ nptr = self.mecab_nbest_next_tonode(@tagger)
261
+ else
262
+ nlen = 1
263
+ nptr = self.mecab_sparse_tonode(@tagger, text)
264
+ end
265
+ raise(MeCabError.new(self.mecab_strerror(@tagger))) if nptr.nil? || nptr.address==0x0
266
+
267
+ nlen.times do
268
+ s = text.bytes.to_a
269
+ while nptr && nptr.address != 0x0
270
+ mn = Natto::MeCabNode.new(nptr)
271
+ # ignore BOS nodes, since mecab does so
272
+ if !mn.is_bos?
273
+ s = s.drop_while {|e| (e==0xa || e==0x20)}
274
+ if !s.empty?
275
+ sarr = []
276
+ mn.length.times { sarr << s.shift }
277
+ surf = sarr.pack('C*')
278
+ mn.surface = surf.force_encoding(Encoding.default_external)
279
+ end
280
+ if @options[:output_format_type] || @options[:node_format]
281
+ mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
282
+ end
283
+ y.yield mn
284
+ end
285
+ nptr = mn.next
286
+ end
287
+ if nlen > 1
288
+ nptr = self.mecab_nbest_next_tonode(@tagger)
289
+ end
290
+ end
291
+ end
292
+ }
293
+
294
+ @bcparse_tostr = ->(text, boundary_constraints=/./) {
295
+ begin
296
+ lattice = self.mecab_lattice_new()
297
+ raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
298
+
299
+ if @options[:nbest] && @options[:nbest] > 1
300
+ n = @options[:nbest]
301
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
302
+ else
303
+ n = 1
304
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
305
+ end
306
+ if @options[:theta]
307
+ self.mecab_lattice_set_theta(lattice, @options[:theta])
308
+ end
309
+
310
+ tokens = tokenize(text, boundary_constraints)
311
+ text = tokens.map {|t| t.first}.join
312
+ self.mecab_lattice_set_sentence(lattice, text)
313
+
314
+ bpos = 0
315
+ tokens.each do |token|
316
+ c = token.first.bytes.count
317
+
318
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
319
+ bpos += 1
320
+
321
+ mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
322
+ (c-1).times do
323
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
324
+ bpos += 1
325
+ end
326
+ end
327
+
328
+ self.mecab_parse_lattice(@tagger, lattice)
329
+
330
+ if n > 1
331
+ retval = self.mecab_lattice_nbest_tostr(lattice, n)
332
+ else
333
+ retval = self.mecab_lattice_tostr(lattice)
334
+ end
335
+ retval.force_encoding(Encoding.default_external)
336
+ rescue
337
+ raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
338
+ ensure
339
+ if lattice.address != 0x0
340
+ self.mecab_lattice_destroy(lattice)
341
+ end
342
+ end
343
+ }
344
+
345
+ @bcparse_tonodes = ->(text, boundary_constraints=/./) {
346
+ Enumerator.new do |y|
347
+ begin
348
+ lattice = self.mecab_lattice_new()
349
+ raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
350
+
351
+ if @options[:nbest] && @options[:nbest] > 1
352
+ n = @options[:nbest]
353
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
354
+ else
355
+ n = 1
356
+ self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
357
+ end
358
+ if @options[:theta]
359
+ self.mecab_lattice_set_theta(lattice, @options[:theta])
360
+ end
361
+
362
+ tokens = tokenize(text, boundary_constraints)
363
+ text = tokens.map {|t| t.first}.join
364
+ self.mecab_lattice_set_sentence(lattice, text)
365
+
366
+ bpos = 0
367
+ tokens.each do |token|
368
+ c = token.first.bytes.count
369
+
370
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
371
+ bpos += 1
372
+
373
+ mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
374
+ (c-1).times do
375
+ self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
376
+ bpos += 1
377
+ end
378
+ end
379
+
380
+ self.mecab_parse_lattice(@tagger, lattice)
381
+
382
+ n.times do
383
+ check = self.mecab_lattice_next(lattice)
384
+ if check
385
+ nptr = self.mecab_lattice_get_bos_node(lattice)
386
+
387
+ s = text.bytes.to_a
388
+ while nptr && nptr.address!=0x0
389
+ mn = Natto::MeCabNode.new(nptr)
219
390
  s = s.drop_while {|e| (e==0xa || e==0x20)}
220
391
  if !s.empty?
221
392
  sarr = []
@@ -224,50 +395,22 @@ module Natto
224
395
  mn.surface = surf.force_encoding(Encoding.default_external)
225
396
  end
226
397
  if @options[:output_format_type] || @options[:node_format]
227
- mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
398
+ mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
228
399
  end
229
400
  y.yield mn
401
+ nptr = mn.next
230
402
  end
231
- n = mn.next
232
403
  end
233
- n = self.mecab_nbest_next_tonode(@tagger)
234
404
  end
235
- end
236
- end
237
- else
238
- # default parsing implementations
239
- @parse_tostr = lambda do |text|
240
- retval = self.mecab_sparse_tostr(@tagger, text) ||
241
- raise(MeCabError.new(self.mecab_strerror(@tagger)))
242
- retval.force_encoding(Encoding.default_external)
243
- end
244
-
245
- @parse_tonodes = lambda do |text|
246
- Enumerator.new do |y|
247
- n = self.mecab_sparse_tonode(@tagger, text)
248
- raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
249
-
250
- mn = Natto::MeCabNode.new(n)
251
- n = mn.next if mn.next.address!=0x0
252
- s = text.bytes.to_a
253
- while n && n.address!=0x0
254
- mn = Natto::MeCabNode.new(n)
255
- s = s.drop_while {|e| (e==0xa || e==0x20)}
256
- if !s.empty?
257
- sarr = []
258
- mn.length.times { sarr << s.shift }
259
- surf = sarr.pack('C*')
260
- mn.surface = surf.force_encoding(Encoding.default_external)
261
- end
262
- if @options[:output_format_type] || @options[:node_format]
263
- mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
264
- end
265
- y.yield mn
266
- n = mn.next
405
+ rescue
406
+ raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
407
+ ensure
408
+ if lattice.address != 0x0
409
+ self.mecab_lattice_destroy(lattice)
267
410
  end
268
411
  end
269
412
  end
270
- end
413
+ }
271
414
 
272
415
  @dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
273
416
  while @dicts.last.next.address != 0x0
@@ -283,17 +426,35 @@ module Natto
283
426
  # If a block is passed to this method, then node parsing will be used
284
427
  # and each node yielded to the given block.
285
428
  #
286
- # @param [String] text
429
+ # Boundary constraint parsing is available via passing in the
430
+ # `boundary_constraints` key in the `options` hash. Boundary constraints
431
+ # parsing provides hints to MeCab on where the morpheme boundaries in the
432
+ # given `text` are located. `boundary_constraints` value may be either a
433
+ # `Regexp` or `String`; please see
434
+ # [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan.
435
+ # The boundary constraint parsed output will be returned as a single
436
+ # string, unless a block is passed to this method for node parsing.
437
+ #
438
+ # @param text [String] the Japanese text to parse
439
+ # @param options [Hash] only the `boundary_constraints` key is available
287
440
  # @return [String] parsing result from `mecab`
288
441
  # @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
289
442
  # @raise [ArgumentError] if the given string `text` argument is `nil`
290
443
  # @see MeCabNode
291
- def parse(text)
444
+ def parse(text, options={})
292
445
  raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
293
- if block_given?
294
- @parse_tonodes.call(text).each {|n| yield n }
446
+ if options[:boundary_constraints]
447
+ if block_given?
448
+ @bcparse_tonodes.call(text, options[:boundary_constraints]).each {|n| yield n }
449
+ else
450
+ @bcparse_tostr.call(text, options[:boundary_constraints])
451
+ end
295
452
  else
296
- @parse_tostr.call(text)
453
+ if block_given?
454
+ @parse_tonodes.call(text).each {|n| yield n }
455
+ else
456
+ @parse_tostr.call(text)
457
+ end
297
458
  end
298
459
  end
299
460
 
@@ -301,73 +462,34 @@ module Natto
301
462
  # {http://www.ruby-doc.org/core-2.1.5/Enumerator.html Enumerator} that may be
302
463
  # used to iterate over the resulting {MeCabNode} objects. This is more
303
464
  # efficient than parsing to a simple string, since each node's
304
- # information will not be materialized all at once as with it is with
465
+ # information will not be materialized all at once as it is with
305
466
  # string output.
306
467
  #
307
468
  # MeCab nodes contain much more detailed information about
308
469
  # the morpheme. Node-formatting may also be used to customize
309
470
  # the resulting node's `feature` attribute.
310
471
  #
311
- # @param [String] text
472
+ # Boundary constraint parsing is available via passing in the
473
+ # `boundary_constraints` key in the `options` hash. Boundary constraints
474
+ # parsing provides hints to MeCab on where the morpheme boundaries in the
475
+ # given `text` are located. `boundary_constraints` value may be either a
476
+ # `Regexp` or `String`; please see
477
+ # [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan).
478
+ #
479
+ # @param text [String] the Japanese text to parse
480
+ # @param options [Hash] only the `boundary_constraints` key is available
312
481
  # @return [Enumerator] of MeCabNode instances
313
482
  # @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
314
483
  # @raise [ArgumentError] if the given string `text` argument is `nil`
315
484
  # @see MeCabNode
316
- # @see http://www.ruby-doc.org/core-2.1.5/Enumerator.html
317
- def enum_parse(text)
485
+ # @see http://ruby-doc.org/core-2.2.1/Enumerator.html
486
+ def enum_parse(text, options={})
318
487
  raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
319
- @parse_tonodes.call(text)
320
- end
321
-
322
- # @deprecated
323
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
324
- # Parses the given string `str`, and returns
325
- # a list of `mecab` nodes.
326
- # @param [String] str
327
- # @return [Array] of parsed `mecab` nodes.
328
- # @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
329
- # @raise [ArgumentError] if the given string `str` argument is `nil`
330
- # @see MeCabNode
331
- def parse_as_nodes(str)
332
- $stderr.puts 'DEPRECATED: use enum_parse instead'
333
- $stderr.puts ' This method will be removed in the next release!'
334
- raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
335
- @parse_tonodes.call(str)
336
- end
337
-
338
- # @deprecated
339
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
340
- # Parses the given string `str`, and returns
341
- # a list of `mecab` result strings.
342
- # @param [String] str
343
- # @return [Array] of parsed `mecab` result strings.
344
- # @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
345
- # @raise [ArgumentError] if the given string `str` argument is `nil`
346
- def parse_as_strings(str)
347
- $stderr.puts 'DEPRECATED: use enum_parse instead'
348
- $stderr.puts ' This method will be removed in the next release!'
349
- raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
350
- @parse_tostr.call(str).lines.to_a
351
- end
352
-
353
- # @deprecated
354
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
355
- # @param [String] str
356
- # @return [Array] of parsed `mecab` nodes.
357
- def readnodes(str)
358
- $stderr.puts 'DEPRECATED: use enum_parse instead'
359
- $stderr.puts ' This method will be removed in the next release!'
360
- parse_as_nodes(str)
361
- end
362
-
363
- # @deprecated
364
- # DEPRECATED: use enum_parse instead, this convenience method is useless.
365
- # @param [String] str
366
- # @return [Array] of parsed `mecab` result strings.
367
- def readlines(str)
368
- $stderr.puts 'DEPRECATED: use enum_parse instead'
369
- $stderr.puts ' This method will be removed in the next release!'
370
- parse_as_strings(str)
488
+ if options[:boundary_constraints]
489
+ @bcparse_tonodes.call(text, options[:boundary_constraints])
490
+ else
491
+ @parse_tonodes.call(text)
492
+ end
371
493
  end
372
494
 
373
495
  # Returns human-readable details for the wrapped `mecab` tagger.
@@ -402,17 +524,41 @@ module Natto
402
524
  end
403
525
 
404
526
  # Returns a `Proc` that will properly free resources
405
- # when this `MeCab` instance is garbage collected.
527
+ # when this `Tagger` instance is garbage collected.
406
528
  # The `Proc` returned is registered to be invoked
407
- # after the `MeCab` instance owning `ptr`
529
+ # after the `Tagger` instance owning `tptr`
408
530
  # has been destroyed.
409
531
  #
410
- # @param [FFI::Pointer] ptr
532
+ # @param tptr [FFI::Pointer] pointer to `Tagger`
411
533
  # @return [Proc] to release `mecab` resources properly
412
- def self.create_free_proc(ptr)
534
+ def self.create_free_proc(tptr)
413
535
  Proc.new do
414
- self.mecab_destroy(ptr)
536
+ self.mecab_destroy(tptr)
537
+ end
538
+ end
539
+
540
+ private
541
+
542
+ # @private
543
+ def tokenize(text, pattern)
544
+ matches = text.scan(pattern)
545
+
546
+ acc =[]
547
+ tmp = text
548
+ matches.each_with_index do |m,i|
549
+ bef, mat, aft = tmp.partition(m)
550
+ unless bef.empty?
551
+ acc << [bef.strip, false]
552
+ end
553
+ unless mat.empty?
554
+ acc << [mat.strip, true]
555
+ end
556
+ if i==matches.size-1 and !aft.empty?
557
+ acc << [aft.strip, false]
558
+ end
559
+ tmp = aft
415
560
  end
561
+ acc
416
562
  end
417
563
  end
418
564
 
@@ -97,14 +97,14 @@ module Natto
97
97
  # Returns a string-representation of the options to
98
98
  # be passed in the construction of the `mecab` tagger.
99
99
  #
100
- # @param [Hash] options
100
+ # @param options[Hash] options for MeCab
101
101
  # @return [String] representation of the options to the `mecab` tagger
102
102
  def build_options_str(options={})
103
103
  opt = []
104
104
  SUPPORTED_OPTS.values.each do |k|
105
105
  if options.has_key? k
106
106
  key = k.to_s.gsub('_', '-')
107
- if %w( all-morphs allocate-sentence ).include? key
107
+ if %w( all-morphs allocate-sentence partial ).include? key
108
108
  opt << "--#{key}" if options[k]==true
109
109
  else
110
110
  opt << "--#{key}=#{options[k]}"
data/lib/natto/struct.rb CHANGED
@@ -6,12 +6,12 @@ module Natto
6
6
  require 'ffi'
7
7
 
8
8
  # `MeCabStruct` is a general base class for `FFI::Struct` objects in
9
- # the `Natto` module. Please refer to
10
- # [`mecab.h`](http://code.google.com/p/mecab/source/browse/trunk/mecab/src/mecab.h)
9
+ # the `Natto` module. Please refer to `mecab.h` in the source code
10
+ # distribution.
11
11
  class MeCabStruct < FFI::Struct
12
12
  # Provides accessor methods for the members of the `mecab` struct.
13
13
  #
14
- # @param [String] attr_name
14
+ # @param attr_name [String] attribute name
15
15
  # @return member values for the `mecab` struct
16
16
  # @raise [NoMethodError] if `attr_name` is not a member of this `mecab` struct
17
17
  def method_missing(attr_name)
@@ -94,7 +94,7 @@ module Natto
94
94
  # Initializes this dictionary info instance.
95
95
  # Sets the `DictionaryInfo` filepath value.
96
96
  #
97
- # @param [FFI::Pointer] ptr pointer to MeCab dictionary
97
+ # @param ptr [FFI::Pointer] pointer to MeCab dictionary
98
98
  def initialize(ptr)
99
99
  super(ptr)
100
100
 
@@ -246,10 +246,10 @@ module Natto
246
246
  # Initializes this node instance.
247
247
  # Sets the `MeCab` feature value for this node.
248
248
  #
249
- # @param [FFI::Pointer] ptr pointer to MeCab node
250
- def initialize(ptr)
251
- super(ptr)
252
- @pointer = ptr
249
+ # @param nptr [FFI::Pointer] pointer to MeCab node
250
+ def initialize(nptr)
251
+ super(nptr)
252
+ @pointer = nptr
253
253
 
254
254
  if self[:feature]
255
255
  @feature = self[:feature].force_encoding(Encoding.default_external)
@@ -311,10 +311,7 @@ module Natto
311
311
  def is_eon?
312
312
  self.stat == EON_NODE
313
313
  end
314
-
315
-
316
314
  end
317
-
318
315
  end
319
316
 
320
317
  # Copyright (c) 2015, Brooke M. Fujita.
data/lib/natto/version.rb CHANGED
@@ -1,16 +1,16 @@
1
1
  # coding: utf-8
2
2
 
3
3
  # `Natto` is the namespace for objects that provide
4
- # a binding to the `mecab` tagger and related resources.
4
+ # a binding to MeCab and an API for the `Tagger`,
5
+ # `Node` and `Lattice` objects.
5
6
  #
6
- # `Natto::MeCab` is a wrapper class for the `mecab`
7
- # tagger.
7
+ # `Natto::MeCab` is a wrapper class for the MeCab `Tagger`.
8
8
  #
9
9
  # `Natto::MeCabStruct` is a base class for a `mecab`
10
10
  # struct.
11
11
  #
12
12
  # `Natto::MeCabNode` is a wrapper for the struct representing
13
- # a `mecab`-parsed node.
13
+ # a MeCab `Node`.
14
14
  #
15
15
  # `Natto::DictionaryInfo` is a wrapper for the struct
16
16
  # representing a `Natto::MeCab` instance's related
@@ -27,7 +27,7 @@
27
27
  # `Natto`.
28
28
  module Natto
29
29
  # Version string for this Rubygem.
30
- VERSION = "0.9.8"
30
+ VERSION = "0.9.9"
31
31
  end
32
32
 
33
33
  # Copyright (c) 2015, Brooke M. Fujita.
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: natto
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.8
4
+ version: 0.9.9
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-02-10 00:00:00.000000000 Z
12
+ date: 2015-03-30 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: ffi
@@ -27,9 +27,41 @@ dependencies:
27
27
  - - ! '>='
28
28
  - !ruby/object:Gem::Version
29
29
  version: 1.9.0
30
- description: ! 'No compiler is necessary, as natto is not a C extension. It will run
31
- on CRuby (mri/yarv) and JRuby (jvm) equally well. It will also run on Windows, Unix/Linux,
32
- and OS X. natto provides a naturally Ruby-esque interface to MeCab.
30
+ - !ruby/object:Gem::Dependency
31
+ name: rake
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: minitest
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ description: ! 'natto provides a naturally Ruby-esque interface to MeCab. It runs
63
+ on both CRuby (mri/yarv) and JRuby (jvm). It works with MeCab installations on Windows,
64
+ Unix/Linux, and OS X. No compiler is necessary, as natto is not a C extension.
33
65
 
34
66
  '
35
67
  email: buruzaemon@gmail.com