natto 0.9.8 → 0.9.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +12 -12
- data/README.md +122 -65
- data/lib/natto/binding.rb +125 -46
- data/lib/natto/natto.rb +273 -127
- data/lib/natto/option_parse.rb +2 -2
- data/lib/natto/struct.rb +8 -11
- data/lib/natto/version.rb +5 -5
- metadata +37 -5
data/CHANGELOG
CHANGED
@@ -1,5 +1,11 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
+
- __2015/03/31__: 0.9.9 release.
|
4
|
+
- Issue 21/34: Implemented boundary constraint parsing.
|
5
|
+
- Issue 26: Removing deprecated methods parse_as_nodes, parse_as_strings, readnodes and readlines.
|
6
|
+
- Issue 32: Fixed partial parsing.
|
7
|
+
- Updated documentation for new MeCab home at http://taku910.github.io/mecab/.
|
8
|
+
|
3
9
|
- __2015/02/10__: 0.9.8 release.
|
4
10
|
- Migrated natto code home from Bitbucket to GitHub.
|
5
11
|
- Improved documentation following said migration.
|
@@ -7,16 +13,10 @@
|
|
7
13
|
- Updating LICENSE for year 2015.
|
8
14
|
|
9
15
|
- __2014/12/20__: 0.9.7 release.
|
10
|
-
- Issue 14:
|
11
|
-
|
12
|
-
|
13
|
-
- Issue
|
14
|
-
materializing every node and stuffing into
|
15
|
-
array](https://bitbucket.org/buruzaemon/natto/issue/15/use-enumerator-when-parsing-mecab-nodes)
|
16
|
-
- Issue 17: [adding filepath to MeCab and
|
17
|
-
DictionaryInfo](https://bitbucket.org/buruzaemon/natto/issue/17/use-filerealpath-value-for-all-file-paths)
|
18
|
-
- Issue 18: [bug-fix for node-formatting during default node
|
19
|
-
parse](https://bitbucket.org/buruzaemon/natto/issue/18/no-node-formatting-when-using-default-node)
|
16
|
+
- Issue 14: adding automatic discovery for mecab library; no need to explicitly set MECAB_PATH
|
17
|
+
- Issue 15: refactored node-parsing to use Enumerator instead of materializing every node and stuffing into array
|
18
|
+
- Issue 17: adding filepath to MeCab and DictionaryInfo
|
19
|
+
- Issue 18: bug-fix for node-formatting during default node parse
|
20
20
|
- Deprecating parse_as_nodes and parse_as_strings; please use parse instead!
|
21
21
|
- CAUTION: parse_as_nodes, parse_as_strings, readnodes and readlines will be removed in the following release!
|
22
22
|
- Enhancements to to_s methods for both MeCab and DictionaryInfo
|
@@ -36,8 +36,8 @@
|
|
36
36
|
- Removing automatic library load for Cygwin platform (does not compile)
|
37
37
|
|
38
38
|
- __2012/09/16__: 0.9.5 release.
|
39
|
-
- Fixed
|
40
|
-
- Fixed
|
39
|
+
- Fixed Issue 9: trimされていない文字列のparse
|
40
|
+
- Fixed Issue 10: BUG Segmentation Fault
|
41
41
|
- Adding parse_as_nodes to allow for method-chaining on list of parsed nodes
|
42
42
|
- Adding parse_as_strings to allow for method-chaining on list of string output
|
43
43
|
- Deprecating both readnodes and readlines (badly named methods, see parse_as_nodes and parse_as_strings, respectively)
|
data/README.md
CHANGED
@@ -1,16 +1,16 @@
|
|
1
|
-
# natto
|
1
|
+
# natto [](http://badge.fury.io/rb/natto) [](https://travis-ci.org/buruzaemon/natto)
|
2
2
|
A Tasty Ruby Binding with MeCab
|
3
3
|
|
4
4
|
## What is natto?
|
5
5
|
A gem leveraging FFI (foreign function interface), natto combines the
|
6
6
|
[Ruby programming language](http://www.ruby-lang.org/) with
|
7
|
-
[MeCab](http://
|
7
|
+
[MeCab](http://taku910.github.io/mecab/), the part-of-speech
|
8
8
|
and morphological analyzer for the Japanese language.
|
9
9
|
|
10
|
-
- No compiler is necessary, as natto is _not_ a C extension.
|
11
|
-
- It will run on CRuby (mri/yarv) and JRuby (jvm) equally well.
|
12
|
-
- It will work with MeCab installations on Windows, Unix/Linux or OS X.
|
13
10
|
- natto provides a naturally Ruby-esque interface to MeCab.
|
11
|
+
- It runs on both CRuby (mri/yarv) and JRuby (jvm).
|
12
|
+
- It works with MeCab installations on Windows, Unix/Linux and OS X.
|
13
|
+
- No compiler is necessary, as natto is _not_ a C extension.
|
14
14
|
|
15
15
|
You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
|
16
16
|
|
@@ -18,8 +18,8 @@ You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
|
|
18
18
|
## Requirements
|
19
19
|
natto requires the following:
|
20
20
|
|
21
|
-
- [MeCab _0.996_](http://
|
22
|
-
- A system dictionary, like
|
21
|
+
- [MeCab _0.996_](http://taku910.github.io/mecab/#download)
|
22
|
+
- A system dictionary, like mecab-ipadic (recommended), mecab-jumandic or unidic, all available on the [MeCab downloads page](http://taku910.github.io/mecab/#download)
|
23
23
|
- `libmecab-devel` if you are on Linux, since natto uses `mecab-config`
|
24
24
|
- Ruby _1.9 or greater_
|
25
25
|
- [ffi _1.9.0 or greater_](http://rubygems.org/gems/ffi)
|
@@ -29,7 +29,7 @@ Install natto with the following gem command:
|
|
29
29
|
|
30
30
|
gem install natto
|
31
31
|
|
32
|
-
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the
|
32
|
+
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the MeCab library.
|
33
33
|
|
34
34
|
## Installation on Windows
|
35
35
|
However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), a MSYS/MinGW based toolkit that enables your Windows Ruby installation to build many of the native C/C++ extensions available, including ffi.
|
@@ -45,40 +45,38 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
45
45
|
6. If you are on a 64-bit Windows and you use a 64-bit Ruby or JRuby, then you might want to [build a 64-bit version of libmecab.dll](https://github.com/buruzaemon/natto/wiki/64-Bit-Windows).
|
46
46
|
|
47
47
|
|
48
|
-
## Configuration
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
48
|
+
## Automatic Configuration
|
49
|
+
No explicit configuration should be necessary, as natto will try to locate the MeCab library based upon its runtime environment.
|
50
|
+
- On OS X and \*nix, it will query `mecab-config --libs`
|
51
|
+
- On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
|
52
|
+
|
53
|
+
## Explicit configuration via `MECAB_PATH` and `MECAB_CHARSET`
|
54
|
+
If natto cannot find the MeCab library, `LoadError` will be raised. Please set the `MECAB_PATH` environment variable to the exact name/path to your MeCab library.
|
55
55
|
|
56
|
-
|
56
|
+
- e.g., for OS X
|
57
57
|
|
58
|
-
|
58
|
+
export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
|
59
59
|
|
60
|
-
|
60
|
+
- e.g., for bash on UNIX/Linux
|
61
61
|
|
62
|
-
|
62
|
+
export MECAB_PATH=/usr/local/lib/libmecab.so
|
63
63
|
|
64
|
-
|
64
|
+
- e.g., on Windows
|
65
65
|
|
66
|
-
|
66
|
+
set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
|
67
67
|
|
68
|
-
|
68
|
+
- e.g., from within a Ruby program
|
69
|
+
|
70
|
+
ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
|
69
71
|
|
70
72
|
## Usage
|
71
73
|
|
74
|
+
Here's a very quick guide to using natto.
|
75
|
+
|
76
|
+
Instantiate a reference to the MeCab library, and display some details:
|
72
77
|
|
73
|
-
# Quick Start
|
74
|
-
# -----------
|
75
|
-
#
|
76
|
-
# No explicit configuration should be necessary!
|
77
|
-
#
|
78
78
|
require 'natto'
|
79
79
|
|
80
|
-
# first, create an instance of Natto::MeCab
|
81
|
-
#
|
82
80
|
nm = Natto::MeCab.new
|
83
81
|
=> #<Natto::MeCab:0x28d30748
|
84
82
|
@tagger=#<FFI::Pointer address=0x28a97d50>, \
|
@@ -90,32 +88,27 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
90
88
|
type=0>] \
|
91
89
|
@version=0.996>
|
92
90
|
|
93
|
-
# display MeCab version
|
94
|
-
#
|
95
91
|
puts nm.version
|
96
92
|
=> 0.996
|
97
93
|
|
98
|
-
|
99
|
-
|
94
|
+
----
|
95
|
+
|
96
|
+
Display details about the system dictionary used by MeCab:
|
97
|
+
|
100
98
|
puts nm.libpath
|
101
99
|
=> /usr/local/lib/libmecab.so
|
102
100
|
|
103
|
-
# reference to MeCab system dictionary
|
104
|
-
#
|
105
101
|
sysdic = nm.dicts.first
|
106
|
-
|
107
|
-
# display full pathname to system dictionary file
|
108
|
-
#
|
109
102
|
puts sysdic.filepath
|
110
103
|
=> /usr/local/lib/mecab/dic/ipadic/sys.dic
|
111
104
|
|
112
|
-
# what charset (encoding) is the system dictionary?
|
113
|
-
#
|
114
105
|
puts sysdic.charset
|
115
106
|
=> utf8
|
116
|
-
|
117
|
-
|
118
|
-
|
107
|
+
|
108
|
+
----
|
109
|
+
|
110
|
+
Parse Japanese text and send the MeCab result as a single string to stdout:
|
111
|
+
|
119
112
|
puts nm.parse('俺の名前は星野豊だ!!そこんとこヨロシク!')
|
120
113
|
俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ
|
121
114
|
の 助詞,連体化,*,*,*,*,の,ノ,ノ
|
@@ -133,12 +126,22 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
133
126
|
! 記号,一般,*,*,*,*,!,!,!
|
134
127
|
EOS
|
135
128
|
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
129
|
+
----
|
130
|
+
|
131
|
+
If a block is passed to `parse`, you can iterate over the list of resulting `MeCabNode`
|
132
|
+
instances to access more detailed information about each morpheme.
|
133
|
+
|
134
|
+
In this example, the following attributes and methods for `MeCabNode` are used:
|
135
|
+
|
136
|
+
- `surface` - the morpheme surface
|
137
|
+
- `posid` - node part-of-speech ID (dictionary-dependent)
|
138
|
+
- `is_eos?` - is this `MeCabNode` an end-of-sentence node?
|
139
|
+
|
140
|
+
This iterates over the morpheme nodes in the given text,
|
141
|
+
and outputs a formatted, tab-delimited line with the
|
142
|
+
morpheme surface and part-of-speech ID, ignoring any end-of-sentence
|
143
|
+
nodes:
|
144
|
+
|
142
145
|
nm.parse('世界チャンプ目指してんだなこれがっ!!夢なの、俺のっ!!') do |n|
|
143
146
|
puts "#{n.surface}\tpart-of-speech id: #{n.posid}" if !n.is_eos?
|
144
147
|
end
|
@@ -160,37 +163,46 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
160
163
|
のっ part-of-speech id: 31
|
161
164
|
!! part-of-speech id: 36
|
162
165
|
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
166
|
+
----
|
167
|
+
|
168
|
+
For more complex parsing, such as that for natural language
|
169
|
+
processing tasks, it is far more efficient to use `enum_parse` to
|
170
|
+
obtain an [`Enumerator`](http://ruby-doc.org/core-2.2.0/Enumerator.html)
|
171
|
+
to iterate over the resulting `MeCabNode` instances. An `Enumerator`
|
172
|
+
yields each `MeCabNode` instance without first materializing all
|
173
|
+
instances at once, thus being more efficient.
|
174
|
+
|
175
|
+
This example uses the `-F` node-format option to customize
|
176
|
+
the resulting `MeCabNode` feature attribute to extract:
|
177
|
+
|
178
|
+
- `%m` - morpheme surface
|
179
|
+
- `%f[0]` - node part-of-speech
|
180
|
+
- `%f[7]` - reading
|
181
|
+
|
182
|
+
Note that we can move the `Enumerator` both forwards and backwards, rewind it
|
183
|
+
back to the beginning, and then iterate over it.
|
174
184
|
|
185
|
+
nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
|
186
|
+
|
175
187
|
enum = nm.enum_parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')
|
176
188
|
=> #<Enumerator: #<Enumerator::Generator:0x00000002ff3898>:each>
|
177
|
-
|
189
|
+
|
178
190
|
enum.next
|
179
191
|
=> #<Natto::MeCabNode:0x000000032eed68 \
|
180
192
|
@pointer=#<FFI::Pointer address=0x000000005ffb48>, \
|
181
193
|
stat=0, \
|
182
194
|
@surface="この", \
|
183
195
|
@feature="この 連体詞 コノ">
|
184
|
-
|
196
|
+
|
185
197
|
enum.peek
|
186
198
|
=> #<Natto::MeCabNode:0x00000002fe2110a \
|
187
199
|
@pointer=#<FFI::Pointer address=0x000000005ffdb8>, \
|
188
200
|
stat=0, \
|
189
201
|
@surface="星", \
|
190
202
|
@feature="星 名詞 ホシ">
|
191
|
-
|
203
|
+
|
192
204
|
enum.rewind
|
193
|
-
|
205
|
+
|
194
206
|
# again, ignore any end-of-sentence nodes
|
195
207
|
enum.each { |n| puts n.feature if !n.is_eos? }
|
196
208
|
この 連体詞 コノ
|
@@ -211,7 +223,52 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
211
223
|
だけ 助詞 ダケ
|
212
224
|
! 記号 !
|
213
225
|
|
214
|
-
|
226
|
+
----
|
227
|
+
|
228
|
+
[Partial parsing](http://taku910.github.io/mecab/partial.html) allows you to
|
229
|
+
pass hints to MeCab on how to tokenize morphemes when parsing. With boundary
|
230
|
+
constraint parsing, you can specify either
|
231
|
+
a [Regexp](http://ruby-doc.org/core-2.2.1/Regexp.html) or
|
232
|
+
[String](http://ruby-doc.org/core-2.2.1/String.html) to tell MeCab where the
|
233
|
+
boundaries of a morpheme should be. Use the new `boundary_constraints` keyword.
|
234
|
+
For hints on tokenization, please see
|
235
|
+
[String#scan](http://ruby-doc.org/core-2.2.1/String.html#method-i-scan)
|
236
|
+
|
237
|
+
This example uses the `-F` node-format option to customize
|
238
|
+
the resulting `MeCabNode` feature attribute to extract:
|
239
|
+
|
240
|
+
- `%m` - morpheme surface
|
241
|
+
- `%f[0]` - node part-of-speech
|
242
|
+
- `%s` - node `stat` status value, 1 is `unknown`
|
243
|
+
|
244
|
+
Note that any such morphemes captured will have node `stat` status of unknown.
|
245
|
+
|
246
|
+
nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
|
247
|
+
|
248
|
+
text = '心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'
|
249
|
+
pattern = /ヒーロー見参/
|
250
|
+
|
251
|
+
nm.enum_parse(text, boundary_constraints: pattern).each do |n|
|
252
|
+
puts n.feature if !(n.is_bos? || n.is_eos?)
|
253
|
+
end
|
254
|
+
|
255
|
+
# desired morpheme boundary specified with Regexp /ヒーロー見参/
|
256
|
+
心, 名詞, 0
|
257
|
+
の, 助詞, 0
|
258
|
+
中, 名詞, 0
|
259
|
+
で, 助詞, 0
|
260
|
+
3, 名詞, 1
|
261
|
+
回, 名詞, 0
|
262
|
+
唱え, 動詞, 0
|
263
|
+
、, 記号, 0
|
264
|
+
ヒーロー見参, 名詞, 1
|
265
|
+
!, 記号, 0
|
266
|
+
ヒーロー見参, 名詞, 1
|
267
|
+
!, 記号, 0
|
268
|
+
ヒーロー見参, 名詞, 1
|
269
|
+
!, 記号, 0
|
270
|
+
|
271
|
+
|
215
272
|
|
216
273
|
## Learn more
|
217
274
|
- You can read more about natto on the [project Wiki](https://github.com/buruzaemon/natto/wiki).
|
data/lib/natto/binding.rb
CHANGED
@@ -66,12 +66,7 @@ module Natto
|
|
66
66
|
|
67
67
|
ffi_lib find_library
|
68
68
|
|
69
|
-
#
|
70
|
-
attach_function :mecab_model_new2, [:string], :pointer
|
71
|
-
attach_function :mecab_model_destroy, [:pointer], :void
|
72
|
-
attach_function :mecab_model_dictionary_info, [:pointer], :pointer
|
73
|
-
|
74
|
-
# old interface
|
69
|
+
# C interface
|
75
70
|
attach_function :mecab_new2, [:string], :pointer
|
76
71
|
attach_function :mecab_version, [], :string
|
77
72
|
attach_function :mecab_strerror, [:pointer],:string
|
@@ -88,6 +83,26 @@ module Natto
|
|
88
83
|
attach_function :mecab_format_node, [:pointer, :pointer], :string
|
89
84
|
attach_function :mecab_dictionary_info, [:pointer], :pointer
|
90
85
|
|
86
|
+
attach_function :mecab_lattice_new, [], :pointer
|
87
|
+
attach_function :mecab_lattice_destroy, [:pointer], :void
|
88
|
+
attach_function :mecab_lattice_clear, [:pointer], :void
|
89
|
+
attach_function :mecab_lattice_is_available, [:pointer], :int
|
90
|
+
attach_function :mecab_lattice_get_bos_node, [:pointer], :pointer
|
91
|
+
attach_function :mecab_lattice_set_sentence, [:pointer, :string], :void
|
92
|
+
attach_function :mecab_lattice_get_size, [:pointer], :int
|
93
|
+
attach_function :mecab_lattice_set_z, [:pointer, :float], :void
|
94
|
+
attach_function :mecab_lattice_set_theta, [:pointer, :float], :void
|
95
|
+
attach_function :mecab_lattice_next, [:pointer], :int
|
96
|
+
attach_function :mecab_lattice_get_request_type, [:pointer], :int
|
97
|
+
attach_function :mecab_lattice_add_request_type, [:pointer, :int], :void
|
98
|
+
attach_function :mecab_lattice_set_request_type, [:pointer, :int], :void
|
99
|
+
attach_function :mecab_lattice_tostr, [:pointer], :string
|
100
|
+
attach_function :mecab_lattice_nbest_tostr, [:pointer, :int], :string
|
101
|
+
attach_function :mecab_lattice_get_boundary_constraint, [:pointer, :int], :int
|
102
|
+
attach_function :mecab_lattice_set_boundary_constraint, [:pointer, :int, :int], :void
|
103
|
+
attach_function :mecab_parse_lattice, [:pointer, :pointer], :int
|
104
|
+
attach_function :mecab_lattice_strerror, [:pointer], :string
|
105
|
+
|
91
106
|
# @private
|
92
107
|
module ClassMethods
|
93
108
|
|
@@ -95,18 +110,6 @@ module Natto
|
|
95
110
|
Natto::Binding.find_library
|
96
111
|
end
|
97
112
|
|
98
|
-
def mecab_model_new2(options_str)
|
99
|
-
Natto::Binding.mecab_model_new2(options_str)
|
100
|
-
end
|
101
|
-
|
102
|
-
def mecab_model_destroy(m_ptr)
|
103
|
-
Natto::Binding.mecab_model_destroy(m_ptr)
|
104
|
-
end
|
105
|
-
|
106
|
-
def mecab_model_dictionary_info(m_ptr)
|
107
|
-
Natto::Binding.mecab_model_dictionary_info(m_ptr)
|
108
|
-
end
|
109
|
-
|
110
113
|
# ----------------------------------------
|
111
114
|
def mecab_new2(options_str)
|
112
115
|
Natto::Binding.mecab_new2(options_str)
|
@@ -116,60 +119,136 @@ module Natto
|
|
116
119
|
Natto::Binding.mecab_version
|
117
120
|
end
|
118
121
|
|
119
|
-
def mecab_strerror(
|
120
|
-
Natto::Binding.mecab_strerror(
|
122
|
+
def mecab_strerror(tptr)
|
123
|
+
Natto::Binding.mecab_strerror(tptr)
|
121
124
|
end
|
122
125
|
|
123
|
-
def mecab_destroy(
|
124
|
-
Natto::Binding.mecab_destroy(
|
126
|
+
def mecab_destroy(tptr)
|
127
|
+
Natto::Binding.mecab_destroy(tptr)
|
125
128
|
end
|
126
129
|
|
127
|
-
def mecab_set_partial(
|
128
|
-
Natto::Binding.mecab_set_partial(
|
130
|
+
def mecab_set_partial(tptr, ll)
|
131
|
+
Natto::Binding.mecab_set_partial(tptr, ll)
|
129
132
|
end
|
130
133
|
|
131
|
-
def mecab_set_theta(
|
132
|
-
Natto::Binding.mecab_set_theta(
|
134
|
+
def mecab_set_theta(tptr, t)
|
135
|
+
Natto::Binding.mecab_set_theta(tptr, t)
|
133
136
|
end
|
134
137
|
|
135
|
-
def mecab_set_lattice_level(
|
136
|
-
Natto::Binding.mecab_set_lattice_level(
|
138
|
+
def mecab_set_lattice_level(tptr, ll)
|
139
|
+
Natto::Binding.mecab_set_lattice_level(tptr, ll)
|
137
140
|
end
|
138
141
|
|
139
|
-
def mecab_set_all_morphs(
|
140
|
-
Natto::Binding.mecab_set_all_morphs(
|
142
|
+
def mecab_set_all_morphs(tptr, am)
|
143
|
+
Natto::Binding.mecab_set_all_morphs(tptr, am)
|
141
144
|
end
|
142
145
|
|
143
|
-
def mecab_sparse_tostr(
|
144
|
-
Natto::Binding.mecab_sparse_tostr(
|
146
|
+
def mecab_sparse_tostr(tptr, str)
|
147
|
+
Natto::Binding.mecab_sparse_tostr(tptr, str)
|
145
148
|
end
|
146
149
|
|
147
|
-
def mecab_sparse_tonode(
|
148
|
-
Natto::Binding.mecab_sparse_tonode(
|
150
|
+
def mecab_sparse_tonode(tptr, str)
|
151
|
+
Natto::Binding.mecab_sparse_tonode(tptr, str)
|
149
152
|
end
|
150
153
|
|
151
|
-
def mecab_nbest_next_tonode(
|
152
|
-
Natto::Binding.mecab_nbest_next_tonode(
|
154
|
+
def mecab_nbest_next_tonode(tptr)
|
155
|
+
Natto::Binding.mecab_nbest_next_tonode(tptr)
|
153
156
|
end
|
154
157
|
|
155
|
-
def mecab_nbest_init(
|
156
|
-
Natto::Binding.mecab_nbest_init(
|
158
|
+
def mecab_nbest_init(tptr, str)
|
159
|
+
Natto::Binding.mecab_nbest_init(tptr, str)
|
157
160
|
end
|
158
161
|
|
159
|
-
def mecab_nbest_sparse_tostr(
|
160
|
-
Natto::Binding.mecab_nbest_sparse_tostr(
|
162
|
+
def mecab_nbest_sparse_tostr(tptr, n, str)
|
163
|
+
Natto::Binding.mecab_nbest_sparse_tostr(tptr, n, str)
|
161
164
|
end
|
162
165
|
|
163
|
-
def mecab_nbest_next_tonode(
|
164
|
-
Natto::Binding.mecab_nbest_next_tonode(
|
166
|
+
def mecab_nbest_next_tonode(tptr)
|
167
|
+
Natto::Binding.mecab_nbest_next_tonode(tptr)
|
165
168
|
end
|
166
169
|
|
167
|
-
def mecab_format_node(
|
168
|
-
Natto::Binding.mecab_format_node(
|
170
|
+
def mecab_format_node(tptr, nptr)
|
171
|
+
Natto::Binding.mecab_format_node(tptr, nptr)
|
172
|
+
end
|
173
|
+
|
174
|
+
def mecab_dictionary_info(tptr)
|
175
|
+
Natto::Binding.mecab_dictionary_info(tptr)
|
169
176
|
end
|
170
177
|
|
171
|
-
def
|
172
|
-
Natto::Binding.
|
178
|
+
def mecab_lattice_new()
|
179
|
+
Natto::Binding.mecab_lattice_new()
|
180
|
+
end
|
181
|
+
|
182
|
+
def mecab_lattice_destroy(lptr)
|
183
|
+
Natto::Binding.mecab_lattice_destroy(lptr)
|
184
|
+
end
|
185
|
+
|
186
|
+
def mecab_lattice_clear(lptr)
|
187
|
+
Natto::Binding.mecab_lattice_clear(lptr)
|
188
|
+
end
|
189
|
+
|
190
|
+
def mecab_lattice_is_available(lptr)
|
191
|
+
Natto::Binding.mecab_lattice_is_available(lptr)
|
192
|
+
end
|
193
|
+
|
194
|
+
def mecab_lattice_get_bos_node(lptr)
|
195
|
+
Natto::Binding.mecab_lattice_get_bos_node(lptr)
|
196
|
+
end
|
197
|
+
|
198
|
+
def mecab_lattice_set_sentence(lptr, str)
|
199
|
+
Natto::Binding.mecab_lattice_set_sentence(lptr, str)
|
200
|
+
end
|
201
|
+
|
202
|
+
def mecab_lattice_get_size(lptr)
|
203
|
+
Natto::Binding.mecab_lattice_get_size(lptr)
|
204
|
+
end
|
205
|
+
|
206
|
+
def mecab_lattice_set_z(lptr, z)
|
207
|
+
Natto::Binding.mecab_lattice_set_z(lptr, z)
|
208
|
+
end
|
209
|
+
|
210
|
+
def mecab_lattice_set_theta(lptr, t)
|
211
|
+
Natto::Binding.mecab_lattice_set_theta(lptr, t)
|
212
|
+
end
|
213
|
+
|
214
|
+
def mecab_lattice_next(lptr)
|
215
|
+
Natto::Binding.mecab_lattice_next(lptr)
|
216
|
+
end
|
217
|
+
|
218
|
+
def mecab_lattice_get_request_type(lptr)
|
219
|
+
Natto::Binding.mecab_lattice_get_request_type(lptr)
|
220
|
+
end
|
221
|
+
|
222
|
+
def mecab_lattice_add_request_type(lptr, rtype)
|
223
|
+
Natto::Binding.mecab_lattice_add_request_type(lptr, rtype)
|
224
|
+
end
|
225
|
+
|
226
|
+
def mecab_lattice_set_request_type(lptr, rtype)
|
227
|
+
Natto::Binding.mecab_lattice_set_request_type(lptr, rtype)
|
228
|
+
end
|
229
|
+
|
230
|
+
def mecab_lattice_tostr(lptr)
|
231
|
+
Natto::Binding.mecab_lattice_tostr(lptr)
|
232
|
+
end
|
233
|
+
|
234
|
+
def mecab_lattice_nbest_tostr(lptr, n)
|
235
|
+
Natto::Binding.mecab_lattice_nbest_tostr(lptr, n)
|
236
|
+
end
|
237
|
+
|
238
|
+
def mecab_lattice_get_boundary_constraint(lptr, pos)
|
239
|
+
Natto::Binding.mecab_lattice_get_boundary_constraint(lptr, pos)
|
240
|
+
end
|
241
|
+
|
242
|
+
def mecab_lattice_set_boundary_constraint(lptr, pos, btype)
|
243
|
+
Natto::Binding.mecab_lattice_set_boundary_constraint(lptr, pos, btype)
|
244
|
+
end
|
245
|
+
|
246
|
+
def mecab_parse_lattice(tptr, lptr)
|
247
|
+
Natto::Binding.mecab_parse_lattice(tptr, lptr)
|
248
|
+
end
|
249
|
+
|
250
|
+
def mecab_lattice_strerror(lptr)
|
251
|
+
Natto::Binding.mecab_lattice_strerror(lptr)
|
173
252
|
end
|
174
253
|
end
|
175
254
|
end
|
data/lib/natto/natto.rb
CHANGED
@@ -4,8 +4,8 @@ require 'natto/option_parse'
|
|
4
4
|
require 'natto/struct'
|
5
5
|
|
6
6
|
module Natto
|
7
|
-
# `MeCab` is a wrapper class for the `
|
8
|
-
# Options to the `
|
7
|
+
# `MeCab` is a wrapper class for the MeCab `Tagger`.
|
8
|
+
# Options to the MeCab `Tagger` are passed in as a string
|
9
9
|
# (MeCab command-line style) or as a Ruby-style hash at
|
10
10
|
# initialization.
|
11
11
|
#
|
@@ -45,6 +45,7 @@ module Natto
|
|
45
45
|
# 。 記号,句点,*,*,*,*,。,。,。
|
46
46
|
# EOS
|
47
47
|
#
|
48
|
+
#
|
48
49
|
# # pass a block to iterate over each MeCabNode instance
|
49
50
|
# #
|
50
51
|
# nm.parse(text) do |n|
|
@@ -101,10 +102,52 @@ module Natto
|
|
101
102
|
# 13: 。,。,記号,7
|
102
103
|
#
|
103
104
|
#
|
105
|
+
# # Boundary constraint parsing with output formatting.
|
106
|
+
# # %m ... morpheme surface
|
107
|
+
# # %F, ... comma-delimited ChaSen feature values
|
108
|
+
# # part-of-speech (index 0)
|
109
|
+
# # %2 ... MeCab node status value (1 unknown)
|
110
|
+
# #
|
111
|
+
# nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
|
112
|
+
#
|
113
|
+
# enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
|
114
|
+
# => #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>
|
115
|
+
#
|
116
|
+
# # output the feature attribute of each MeCabNode
|
117
|
+
# # ignoring any beginning- or end-of-sentence nodes
|
118
|
+
# #
|
119
|
+
# enum.each do |n|
|
120
|
+
# puts n.feature if !(n.is_bos? or n.is_eos?)
|
121
|
+
# end
|
122
|
+
# 凡人, 名詞, 0
|
123
|
+
# に, 助詞, 0
|
124
|
+
# しか, 助詞, 0
|
125
|
+
# 見えねえ風景, 名詞, 1
|
126
|
+
# って, 助詞, 0
|
127
|
+
# の, 名詞, 0
|
128
|
+
# が, 助詞, 0
|
129
|
+
# ある, 動詞, 0
|
130
|
+
# ん, 名詞, 0
|
131
|
+
# だ, 助動詞, 0
|
132
|
+
# よ, 助詞, 0
|
133
|
+
# 。, 記号, 0
|
134
|
+
#
|
104
135
|
class MeCab
|
105
136
|
include Natto::Binding
|
106
137
|
include Natto::OptionParse
|
107
138
|
|
139
|
+
MECAB_LATTICE_ONE_BEST = 1
|
140
|
+
MECAB_LATTICE_NBEST = 2
|
141
|
+
MECAB_LATTICE_PARTIAL = 4
|
142
|
+
MECAB_LATTICE_MARGINAL_PROB = 8
|
143
|
+
MECAB_LATTICE_ALTERNATIVE = 16
|
144
|
+
MECAB_LATTICE_ALL_MORPHS = 32
|
145
|
+
MECAB_LATTICE_ALLOCATE_SENTENCE = 64
|
146
|
+
|
147
|
+
MECAB_ANY_BOUNDARY = 0
|
148
|
+
MECAB_TOKEN_BOUNDARY = 1
|
149
|
+
MECAB_INSIDE_TOKEN = 2
|
150
|
+
|
108
151
|
# @return [FFI:Pointer] pointer to MeCab tagger.
|
109
152
|
attr_reader :tagger
|
110
153
|
# @return [String] absolute filepath to MeCab library.
|
@@ -113,10 +156,10 @@ module Natto
|
|
113
156
|
attr_reader :options
|
114
157
|
# @return [Array] listing of all of dictionaries referenced.
|
115
158
|
attr_reader :dicts
|
116
|
-
# @return [String] `MeCab`
|
159
|
+
# @return [String] `MeCab` version.
|
117
160
|
attr_reader :version
|
118
161
|
|
119
|
-
# Initializes the wrapped `
|
162
|
+
# Initializes the wrapped `Tagger` instance with the
|
120
163
|
# given `options`.
|
121
164
|
#
|
122
165
|
# Options supported are:
|
@@ -174,7 +217,7 @@ module Natto
|
|
174
217
|
# 。 。
|
175
218
|
# EOS
|
176
219
|
#
|
177
|
-
# @param [Hash, String]
|
220
|
+
# @param options [Hash, String] the MeCab options for tagger
|
178
221
|
# @raise [MeCabError] if `mecab` cannot be initialized with the given `options`
|
179
222
|
def initialize(options={})
|
180
223
|
@options = self.class.parse_mecab_options(options)
|
@@ -191,31 +234,159 @@ module Natto
|
|
191
234
|
self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
|
192
235
|
self.mecab_set_partial(@tagger, 1) if @options[:partial]
|
193
236
|
|
194
|
-
#
|
195
|
-
#
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
|
200
|
-
|
237
|
+
# Define lambda for each major parsing type: _tostr, _tonode,
|
238
|
+
# boundary constraint _tostr, boundary constraint _node;
|
239
|
+
# and each parsing type will support both normal and N-best
|
240
|
+
# options
|
241
|
+
@parse_tostr = ->(text) {
|
242
|
+
if @options[:nbest] && @options[:nbest] > 1
|
243
|
+
#self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
201
244
|
retval = self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], text) ||
|
202
245
|
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
203
|
-
|
204
|
-
|
246
|
+
else
|
247
|
+
retval = self.mecab_sparse_tostr(@tagger, text) ||
|
248
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
249
|
+
end
|
205
250
|
|
206
|
-
|
207
|
-
|
208
|
-
self.mecab_nbest_init(@tagger, text)
|
209
|
-
n = self.mecab_nbest_next_tonode(@tagger)
|
210
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
251
|
+
retval.force_encoding(Encoding.default_external)
|
252
|
+
}
|
211
253
|
|
254
|
+
@parse_tonodes = ->(text) {
|
255
|
+
Enumerator.new do |y|
|
256
|
+
if @options[:nbest] && @options[:nbest] > 1
|
212
257
|
nlen = @options[:nbest]
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
|
217
|
-
|
218
|
-
|
258
|
+
#self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
259
|
+
self.mecab_nbest_init(@tagger, text)
|
260
|
+
nptr = self.mecab_nbest_next_tonode(@tagger)
|
261
|
+
else
|
262
|
+
nlen = 1
|
263
|
+
nptr = self.mecab_sparse_tonode(@tagger, text)
|
264
|
+
end
|
265
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if nptr.nil? || nptr.address==0x0
|
266
|
+
|
267
|
+
nlen.times do
|
268
|
+
s = text.bytes.to_a
|
269
|
+
while nptr && nptr.address != 0x0
|
270
|
+
mn = Natto::MeCabNode.new(nptr)
|
271
|
+
# ignore BOS nodes, since mecab does so
|
272
|
+
if !mn.is_bos?
|
273
|
+
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
274
|
+
if !s.empty?
|
275
|
+
sarr = []
|
276
|
+
mn.length.times { sarr << s.shift }
|
277
|
+
surf = sarr.pack('C*')
|
278
|
+
mn.surface = surf.force_encoding(Encoding.default_external)
|
279
|
+
end
|
280
|
+
if @options[:output_format_type] || @options[:node_format]
|
281
|
+
mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
|
282
|
+
end
|
283
|
+
y.yield mn
|
284
|
+
end
|
285
|
+
nptr = mn.next
|
286
|
+
end
|
287
|
+
if nlen > 1
|
288
|
+
nptr = self.mecab_nbest_next_tonode(@tagger)
|
289
|
+
end
|
290
|
+
end
|
291
|
+
end
|
292
|
+
}
|
293
|
+
|
294
|
+
@bcparse_tostr = ->(text, boundary_constraints=/./) {
|
295
|
+
begin
|
296
|
+
lattice = self.mecab_lattice_new()
|
297
|
+
raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
|
298
|
+
|
299
|
+
if @options[:nbest] && @options[:nbest] > 1
|
300
|
+
n = @options[:nbest]
|
301
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
|
302
|
+
else
|
303
|
+
n = 1
|
304
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
|
305
|
+
end
|
306
|
+
if @options[:theta]
|
307
|
+
self.mecab_lattice_set_theta(lattice, @options[:theta])
|
308
|
+
end
|
309
|
+
|
310
|
+
tokens = tokenize(text, boundary_constraints)
|
311
|
+
text = tokens.map {|t| t.first}.join
|
312
|
+
self.mecab_lattice_set_sentence(lattice, text)
|
313
|
+
|
314
|
+
bpos = 0
|
315
|
+
tokens.each do |token|
|
316
|
+
c = token.first.bytes.count
|
317
|
+
|
318
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
|
319
|
+
bpos += 1
|
320
|
+
|
321
|
+
mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
|
322
|
+
(c-1).times do
|
323
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
|
324
|
+
bpos += 1
|
325
|
+
end
|
326
|
+
end
|
327
|
+
|
328
|
+
self.mecab_parse_lattice(@tagger, lattice)
|
329
|
+
|
330
|
+
if n > 1
|
331
|
+
retval = self.mecab_lattice_nbest_tostr(lattice, n)
|
332
|
+
else
|
333
|
+
retval = self.mecab_lattice_tostr(lattice)
|
334
|
+
end
|
335
|
+
retval.force_encoding(Encoding.default_external)
|
336
|
+
rescue
|
337
|
+
raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
|
338
|
+
ensure
|
339
|
+
if lattice.address != 0x0
|
340
|
+
self.mecab_lattice_destroy(lattice)
|
341
|
+
end
|
342
|
+
end
|
343
|
+
}
|
344
|
+
|
345
|
+
@bcparse_tonodes = ->(text, boundary_constraints=/./) {
|
346
|
+
Enumerator.new do |y|
|
347
|
+
begin
|
348
|
+
lattice = self.mecab_lattice_new()
|
349
|
+
raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
|
350
|
+
|
351
|
+
if @options[:nbest] && @options[:nbest] > 1
|
352
|
+
n = @options[:nbest]
|
353
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
|
354
|
+
else
|
355
|
+
n = 1
|
356
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
|
357
|
+
end
|
358
|
+
if @options[:theta]
|
359
|
+
self.mecab_lattice_set_theta(lattice, @options[:theta])
|
360
|
+
end
|
361
|
+
|
362
|
+
tokens = tokenize(text, boundary_constraints)
|
363
|
+
text = tokens.map {|t| t.first}.join
|
364
|
+
self.mecab_lattice_set_sentence(lattice, text)
|
365
|
+
|
366
|
+
bpos = 0
|
367
|
+
tokens.each do |token|
|
368
|
+
c = token.first.bytes.count
|
369
|
+
|
370
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
|
371
|
+
bpos += 1
|
372
|
+
|
373
|
+
mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
|
374
|
+
(c-1).times do
|
375
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
|
376
|
+
bpos += 1
|
377
|
+
end
|
378
|
+
end
|
379
|
+
|
380
|
+
self.mecab_parse_lattice(@tagger, lattice)
|
381
|
+
|
382
|
+
n.times do
|
383
|
+
check = self.mecab_lattice_next(lattice)
|
384
|
+
if check
|
385
|
+
nptr = self.mecab_lattice_get_bos_node(lattice)
|
386
|
+
|
387
|
+
s = text.bytes.to_a
|
388
|
+
while nptr && nptr.address!=0x0
|
389
|
+
mn = Natto::MeCabNode.new(nptr)
|
219
390
|
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
220
391
|
if !s.empty?
|
221
392
|
sarr = []
|
@@ -224,50 +395,22 @@ module Natto
|
|
224
395
|
mn.surface = surf.force_encoding(Encoding.default_external)
|
225
396
|
end
|
226
397
|
if @options[:output_format_type] || @options[:node_format]
|
227
|
-
mn.feature = self.mecab_format_node(@tagger,
|
398
|
+
mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
|
228
399
|
end
|
229
400
|
y.yield mn
|
401
|
+
nptr = mn.next
|
230
402
|
end
|
231
|
-
n = mn.next
|
232
403
|
end
|
233
|
-
n = self.mecab_nbest_next_tonode(@tagger)
|
234
404
|
end
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
retval = self.mecab_sparse_tostr(@tagger, text) ||
|
241
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
242
|
-
retval.force_encoding(Encoding.default_external)
|
243
|
-
end
|
244
|
-
|
245
|
-
@parse_tonodes = lambda do |text|
|
246
|
-
Enumerator.new do |y|
|
247
|
-
n = self.mecab_sparse_tonode(@tagger, text)
|
248
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
249
|
-
|
250
|
-
mn = Natto::MeCabNode.new(n)
|
251
|
-
n = mn.next if mn.next.address!=0x0
|
252
|
-
s = text.bytes.to_a
|
253
|
-
while n && n.address!=0x0
|
254
|
-
mn = Natto::MeCabNode.new(n)
|
255
|
-
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
256
|
-
if !s.empty?
|
257
|
-
sarr = []
|
258
|
-
mn.length.times { sarr << s.shift }
|
259
|
-
surf = sarr.pack('C*')
|
260
|
-
mn.surface = surf.force_encoding(Encoding.default_external)
|
261
|
-
end
|
262
|
-
if @options[:output_format_type] || @options[:node_format]
|
263
|
-
mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
|
264
|
-
end
|
265
|
-
y.yield mn
|
266
|
-
n = mn.next
|
405
|
+
rescue
|
406
|
+
raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
|
407
|
+
ensure
|
408
|
+
if lattice.address != 0x0
|
409
|
+
self.mecab_lattice_destroy(lattice)
|
267
410
|
end
|
268
411
|
end
|
269
412
|
end
|
270
|
-
|
413
|
+
}
|
271
414
|
|
272
415
|
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
|
273
416
|
while @dicts.last.next.address != 0x0
|
@@ -283,17 +426,35 @@ module Natto
|
|
283
426
|
# If a block is passed to this method, then node parsing will be used
|
284
427
|
# and each node yielded to the given block.
|
285
428
|
#
|
286
|
-
#
|
429
|
+
# Boundary constraint parsing is available via passing in the
|
430
|
+
# `boundary_constraints` key in the `options` hash. Boundary constraints
|
431
|
+
# parsing provides hints to MeCab on where the morpheme boundaries in the
|
432
|
+
# given `text` are located. `boundary_constraints` value may be either a
|
433
|
+
# `Regexp` or `String`; please see
|
434
|
+
# [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan.
|
435
|
+
# The boundary constraint parsed output will be returned as a single
|
436
|
+
# string, unless a block is passed to this method for node parsing.
|
437
|
+
#
|
438
|
+
# @param text [String] the Japanese text to parse
|
439
|
+
# @param options [Hash] only the `boundary_constraints` key is available
|
287
440
|
# @return [String] parsing result from `mecab`
|
288
441
|
# @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
|
289
442
|
# @raise [ArgumentError] if the given string `text` argument is `nil`
|
290
443
|
# @see MeCabNode
|
291
|
-
def parse(text)
|
444
|
+
def parse(text, options={})
|
292
445
|
raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
|
293
|
-
if
|
294
|
-
|
446
|
+
if options[:boundary_constraints]
|
447
|
+
if block_given?
|
448
|
+
@bcparse_tonodes.call(text, options[:boundary_constraints]).each {|n| yield n }
|
449
|
+
else
|
450
|
+
@bcparse_tostr.call(text, options[:boundary_constraints])
|
451
|
+
end
|
295
452
|
else
|
296
|
-
|
453
|
+
if block_given?
|
454
|
+
@parse_tonodes.call(text).each {|n| yield n }
|
455
|
+
else
|
456
|
+
@parse_tostr.call(text)
|
457
|
+
end
|
297
458
|
end
|
298
459
|
end
|
299
460
|
|
@@ -301,73 +462,34 @@ module Natto
|
|
301
462
|
# {http://www.ruby-doc.org/core-2.1.5/Enumerator.html Enumerator} that may be
|
302
463
|
# used to iterate over the resulting {MeCabNode} objects. This is more
|
303
464
|
# efficient than parsing to a simple string, since each node's
|
304
|
-
# information will not be materialized all at once as
|
465
|
+
# information will not be materialized all at once as it is with
|
305
466
|
# string output.
|
306
467
|
#
|
307
468
|
# MeCab nodes contain much more detailed information about
|
308
469
|
# the morpheme. Node-formatting may also be used to customize
|
309
470
|
# the resulting node's `feature` attribute.
|
310
471
|
#
|
311
|
-
#
|
472
|
+
# Boundary constraint parsing is available via passing in the
|
473
|
+
# `boundary_constraints` key in the `options` hash. Boundary constraints
|
474
|
+
# parsing provides hints to MeCab on where the morpheme boundaries in the
|
475
|
+
# given `text` are located. `boundary_constraints` value may be either a
|
476
|
+
# `Regexp` or `String`; please see
|
477
|
+
# [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan).
|
478
|
+
#
|
479
|
+
# @param text [String] the Japanese text to parse
|
480
|
+
# @param options [Hash] only the `boundary_constraints` key is available
|
312
481
|
# @return [Enumerator] of MeCabNode instances
|
313
482
|
# @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
|
314
483
|
# @raise [ArgumentError] if the given string `text` argument is `nil`
|
315
484
|
# @see MeCabNode
|
316
|
-
# @see http://
|
317
|
-
def enum_parse(text)
|
485
|
+
# @see http://ruby-doc.org/core-2.2.1/Enumerator.html
|
486
|
+
def enum_parse(text, options={})
|
318
487
|
raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
|
319
|
-
|
320
|
-
|
321
|
-
|
322
|
-
|
323
|
-
|
324
|
-
# Parses the given string `str`, and returns
|
325
|
-
# a list of `mecab` nodes.
|
326
|
-
# @param [String] str
|
327
|
-
# @return [Array] of parsed `mecab` nodes.
|
328
|
-
# @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
|
329
|
-
# @raise [ArgumentError] if the given string `str` argument is `nil`
|
330
|
-
# @see MeCabNode
|
331
|
-
def parse_as_nodes(str)
|
332
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
333
|
-
$stderr.puts ' This method will be removed in the next release!'
|
334
|
-
raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
|
335
|
-
@parse_tonodes.call(str)
|
336
|
-
end
|
337
|
-
|
338
|
-
# @deprecated
|
339
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
340
|
-
# Parses the given string `str`, and returns
|
341
|
-
# a list of `mecab` result strings.
|
342
|
-
# @param [String] str
|
343
|
-
# @return [Array] of parsed `mecab` result strings.
|
344
|
-
# @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
|
345
|
-
# @raise [ArgumentError] if the given string `str` argument is `nil`
|
346
|
-
def parse_as_strings(str)
|
347
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
348
|
-
$stderr.puts ' This method will be removed in the next release!'
|
349
|
-
raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
|
350
|
-
@parse_tostr.call(str).lines.to_a
|
351
|
-
end
|
352
|
-
|
353
|
-
# @deprecated
|
354
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
355
|
-
# @param [String] str
|
356
|
-
# @return [Array] of parsed `mecab` nodes.
|
357
|
-
def readnodes(str)
|
358
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
359
|
-
$stderr.puts ' This method will be removed in the next release!'
|
360
|
-
parse_as_nodes(str)
|
361
|
-
end
|
362
|
-
|
363
|
-
# @deprecated
|
364
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
365
|
-
# @param [String] str
|
366
|
-
# @return [Array] of parsed `mecab` result strings.
|
367
|
-
def readlines(str)
|
368
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
369
|
-
$stderr.puts ' This method will be removed in the next release!'
|
370
|
-
parse_as_strings(str)
|
488
|
+
if options[:boundary_constraints]
|
489
|
+
@bcparse_tonodes.call(text, options[:boundary_constraints])
|
490
|
+
else
|
491
|
+
@parse_tonodes.call(text)
|
492
|
+
end
|
371
493
|
end
|
372
494
|
|
373
495
|
# Returns human-readable details for the wrapped `mecab` tagger.
|
@@ -402,17 +524,41 @@ module Natto
|
|
402
524
|
end
|
403
525
|
|
404
526
|
# Returns a `Proc` that will properly free resources
|
405
|
-
# when this `
|
527
|
+
# when this `Tagger` instance is garbage collected.
|
406
528
|
# The `Proc` returned is registered to be invoked
|
407
|
-
# after the `
|
529
|
+
# after the `Tagger` instance owning `tptr`
|
408
530
|
# has been destroyed.
|
409
531
|
#
|
410
|
-
# @param [FFI::Pointer]
|
532
|
+
# @param tptr [FFI::Pointer] pointer to `Tagger`
|
411
533
|
# @return [Proc] to release `mecab` resources properly
|
412
|
-
def self.create_free_proc(
|
534
|
+
def self.create_free_proc(tptr)
|
413
535
|
Proc.new do
|
414
|
-
self.mecab_destroy(
|
536
|
+
self.mecab_destroy(tptr)
|
537
|
+
end
|
538
|
+
end
|
539
|
+
|
540
|
+
private
|
541
|
+
|
542
|
+
# @private
|
543
|
+
def tokenize(text, pattern)
|
544
|
+
matches = text.scan(pattern)
|
545
|
+
|
546
|
+
acc =[]
|
547
|
+
tmp = text
|
548
|
+
matches.each_with_index do |m,i|
|
549
|
+
bef, mat, aft = tmp.partition(m)
|
550
|
+
unless bef.empty?
|
551
|
+
acc << [bef.strip, false]
|
552
|
+
end
|
553
|
+
unless mat.empty?
|
554
|
+
acc << [mat.strip, true]
|
555
|
+
end
|
556
|
+
if i==matches.size-1 and !aft.empty?
|
557
|
+
acc << [aft.strip, false]
|
558
|
+
end
|
559
|
+
tmp = aft
|
415
560
|
end
|
561
|
+
acc
|
416
562
|
end
|
417
563
|
end
|
418
564
|
|
data/lib/natto/option_parse.rb
CHANGED
@@ -97,14 +97,14 @@ module Natto
|
|
97
97
|
# Returns a string-representation of the options to
|
98
98
|
# be passed in the construction of the `mecab` tagger.
|
99
99
|
#
|
100
|
-
# @param [Hash] options
|
100
|
+
# @param options[Hash] options for MeCab
|
101
101
|
# @return [String] representation of the options to the `mecab` tagger
|
102
102
|
def build_options_str(options={})
|
103
103
|
opt = []
|
104
104
|
SUPPORTED_OPTS.values.each do |k|
|
105
105
|
if options.has_key? k
|
106
106
|
key = k.to_s.gsub('_', '-')
|
107
|
-
if %w( all-morphs allocate-sentence ).include? key
|
107
|
+
if %w( all-morphs allocate-sentence partial ).include? key
|
108
108
|
opt << "--#{key}" if options[k]==true
|
109
109
|
else
|
110
110
|
opt << "--#{key}=#{options[k]}"
|
data/lib/natto/struct.rb
CHANGED
@@ -6,12 +6,12 @@ module Natto
|
|
6
6
|
require 'ffi'
|
7
7
|
|
8
8
|
# `MeCabStruct` is a general base class for `FFI::Struct` objects in
|
9
|
-
# the `Natto` module. Please refer to
|
10
|
-
#
|
9
|
+
# the `Natto` module. Please refer to `mecab.h` in the source code
|
10
|
+
# distribution.
|
11
11
|
class MeCabStruct < FFI::Struct
|
12
12
|
# Provides accessor methods for the members of the `mecab` struct.
|
13
13
|
#
|
14
|
-
# @param [String]
|
14
|
+
# @param attr_name [String] attribute name
|
15
15
|
# @return member values for the `mecab` struct
|
16
16
|
# @raise [NoMethodError] if `attr_name` is not a member of this `mecab` struct
|
17
17
|
def method_missing(attr_name)
|
@@ -94,7 +94,7 @@ module Natto
|
|
94
94
|
# Initializes this dictionary info instance.
|
95
95
|
# Sets the `DictionaryInfo` filepath value.
|
96
96
|
#
|
97
|
-
# @param [FFI::Pointer]
|
97
|
+
# @param ptr [FFI::Pointer] pointer to MeCab dictionary
|
98
98
|
def initialize(ptr)
|
99
99
|
super(ptr)
|
100
100
|
|
@@ -246,10 +246,10 @@ module Natto
|
|
246
246
|
# Initializes this node instance.
|
247
247
|
# Sets the `MeCab` feature value for this node.
|
248
248
|
#
|
249
|
-
# @param [FFI::Pointer]
|
250
|
-
def initialize(
|
251
|
-
super(
|
252
|
-
@pointer =
|
249
|
+
# @param nptr [FFI::Pointer] pointer to MeCab node
|
250
|
+
def initialize(nptr)
|
251
|
+
super(nptr)
|
252
|
+
@pointer = nptr
|
253
253
|
|
254
254
|
if self[:feature]
|
255
255
|
@feature = self[:feature].force_encoding(Encoding.default_external)
|
@@ -311,10 +311,7 @@ module Natto
|
|
311
311
|
def is_eon?
|
312
312
|
self.stat == EON_NODE
|
313
313
|
end
|
314
|
-
|
315
|
-
|
316
314
|
end
|
317
|
-
|
318
315
|
end
|
319
316
|
|
320
317
|
# Copyright (c) 2015, Brooke M. Fujita.
|
data/lib/natto/version.rb
CHANGED
@@ -1,16 +1,16 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
|
3
3
|
# `Natto` is the namespace for objects that provide
|
4
|
-
# a binding to
|
4
|
+
# a binding to MeCab and an API for the `Tagger`,
|
5
|
+
# `Node` and `Lattice` objects.
|
5
6
|
#
|
6
|
-
# `Natto::MeCab` is a wrapper class for the `
|
7
|
-
# tagger.
|
7
|
+
# `Natto::MeCab` is a wrapper class for the MeCab `Tagger`.
|
8
8
|
#
|
9
9
|
# `Natto::MeCabStruct` is a base class for a `mecab`
|
10
10
|
# struct.
|
11
11
|
#
|
12
12
|
# `Natto::MeCabNode` is a wrapper for the struct representing
|
13
|
-
# a `
|
13
|
+
# a MeCab `Node`.
|
14
14
|
#
|
15
15
|
# `Natto::DictionaryInfo` is a wrapper for the struct
|
16
16
|
# representing a `Natto::MeCab` instance's related
|
@@ -27,7 +27,7 @@
|
|
27
27
|
# `Natto`.
|
28
28
|
module Natto
|
29
29
|
# Version string for this Rubygem.
|
30
|
-
VERSION = "0.9.
|
30
|
+
VERSION = "0.9.9"
|
31
31
|
end
|
32
32
|
|
33
33
|
# Copyright (c) 2015, Brooke M. Fujita.
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: natto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.9.
|
4
|
+
version: 0.9.9
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2015-
|
12
|
+
date: 2015-03-30 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: ffi
|
@@ -27,9 +27,41 @@ dependencies:
|
|
27
27
|
- - ! '>='
|
28
28
|
- !ruby/object:Gem::Version
|
29
29
|
version: 1.9.0
|
30
|
-
|
31
|
-
|
32
|
-
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rake
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: minitest
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
description: ! 'natto provides a naturally Ruby-esque interface to MeCab. It runs
|
63
|
+
on both CRuby (mri/yarv) and JRuby (jvm). It works with MeCab installations on Windows,
|
64
|
+
Unix/Linux, and OS X. No compiler is necessary, as natto is not a C extension.
|
33
65
|
|
34
66
|
'
|
35
67
|
email: buruzaemon@gmail.com
|