natto 0.9.8 → 0.9.9
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +12 -12
- data/README.md +122 -65
- data/lib/natto/binding.rb +125 -46
- data/lib/natto/natto.rb +273 -127
- data/lib/natto/option_parse.rb +2 -2
- data/lib/natto/struct.rb +8 -11
- data/lib/natto/version.rb +5 -5
- metadata +37 -5
data/CHANGELOG
CHANGED
@@ -1,5 +1,11 @@
|
|
1
1
|
## CHANGELOG
|
2
2
|
|
3
|
+
- __2015/03/31__: 0.9.9 release.
|
4
|
+
- Issue 21/34: Implemented boundary constraint parsing.
|
5
|
+
- Issue 26: Removing deprecated methods parse_as_nodes, parse_as_strings, readnodes and readlines.
|
6
|
+
- Issue 32: Fixed partial parsing.
|
7
|
+
- Updated documentation for new MeCab home at http://taku910.github.io/mecab/.
|
8
|
+
|
3
9
|
- __2015/02/10__: 0.9.8 release.
|
4
10
|
- Migrated natto code home from Bitbucket to GitHub.
|
5
11
|
- Improved documentation following said migration.
|
@@ -7,16 +13,10 @@
|
|
7
13
|
- Updating LICENSE for year 2015.
|
8
14
|
|
9
15
|
- __2014/12/20__: 0.9.7 release.
|
10
|
-
- Issue 14:
|
11
|
-
|
12
|
-
|
13
|
-
- Issue
|
14
|
-
materializing every node and stuffing into
|
15
|
-
array](https://bitbucket.org/buruzaemon/natto/issue/15/use-enumerator-when-parsing-mecab-nodes)
|
16
|
-
- Issue 17: [adding filepath to MeCab and
|
17
|
-
DictionaryInfo](https://bitbucket.org/buruzaemon/natto/issue/17/use-filerealpath-value-for-all-file-paths)
|
18
|
-
- Issue 18: [bug-fix for node-formatting during default node
|
19
|
-
parse](https://bitbucket.org/buruzaemon/natto/issue/18/no-node-formatting-when-using-default-node)
|
16
|
+
- Issue 14: adding automatic discovery for mecab library; no need to explicitly set MECAB_PATH
|
17
|
+
- Issue 15: refactored node-parsing to use Enumerator instead of materializing every node and stuffing into array
|
18
|
+
- Issue 17: adding filepath to MeCab and DictionaryInfo
|
19
|
+
- Issue 18: bug-fix for node-formatting during default node parse
|
20
20
|
- Deprecating parse_as_nodes and parse_as_strings; please use parse instead!
|
21
21
|
- CAUTION: parse_as_nodes, parse_as_strings, readnodes and readlines will be removed in the following release!
|
22
22
|
- Enhancements to to_s methods for both MeCab and DictionaryInfo
|
@@ -36,8 +36,8 @@
|
|
36
36
|
- Removing automatic library load for Cygwin platform (does not compile)
|
37
37
|
|
38
38
|
- __2012/09/16__: 0.9.5 release.
|
39
|
-
- Fixed
|
40
|
-
- Fixed
|
39
|
+
- Fixed Issue 9: trimされていない文字列のparse
|
40
|
+
- Fixed Issue 10: BUG Segmentation Fault
|
41
41
|
- Adding parse_as_nodes to allow for method-chaining on list of parsed nodes
|
42
42
|
- Adding parse_as_strings to allow for method-chaining on list of string output
|
43
43
|
- Deprecating both readnodes and readlines (badly named methods, see parse_as_nodes and parse_as_strings, respectively)
|
data/README.md
CHANGED
@@ -1,16 +1,16 @@
|
|
1
|
-
# natto
|
1
|
+
# natto [![Gem Version](https://badge.fury.io/rb/natto.svg)](http://badge.fury.io/rb/natto) [![Build Status](https://travis-ci.org/buruzaemon/natto.svg?branch=master)](https://travis-ci.org/buruzaemon/natto)
|
2
2
|
A Tasty Ruby Binding with MeCab
|
3
3
|
|
4
4
|
## What is natto?
|
5
5
|
A gem leveraging FFI (foreign function interface), natto combines the
|
6
6
|
[Ruby programming language](http://www.ruby-lang.org/) with
|
7
|
-
[MeCab](http://
|
7
|
+
[MeCab](http://taku910.github.io/mecab/), the part-of-speech
|
8
8
|
and morphological analyzer for the Japanese language.
|
9
9
|
|
10
|
-
- No compiler is necessary, as natto is _not_ a C extension.
|
11
|
-
- It will run on CRuby (mri/yarv) and JRuby (jvm) equally well.
|
12
|
-
- It will work with MeCab installations on Windows, Unix/Linux or OS X.
|
13
10
|
- natto provides a naturally Ruby-esque interface to MeCab.
|
11
|
+
- It runs on both CRuby (mri/yarv) and JRuby (jvm).
|
12
|
+
- It works with MeCab installations on Windows, Unix/Linux and OS X.
|
13
|
+
- No compiler is necessary, as natto is _not_ a C extension.
|
14
14
|
|
15
15
|
You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
|
16
16
|
|
@@ -18,8 +18,8 @@ You can learn more about [natto at GitHub](https://github.com/buruzaemon/natto).
|
|
18
18
|
## Requirements
|
19
19
|
natto requires the following:
|
20
20
|
|
21
|
-
- [MeCab _0.996_](http://
|
22
|
-
- A system dictionary, like
|
21
|
+
- [MeCab _0.996_](http://taku910.github.io/mecab/#download)
|
22
|
+
- A system dictionary, like mecab-ipadic (recommended), mecab-jumandic or unidic, all available on the [MeCab downloads page](http://taku910.github.io/mecab/#download)
|
23
23
|
- `libmecab-devel` if you are on Linux, since natto uses `mecab-config`
|
24
24
|
- Ruby _1.9 or greater_
|
25
25
|
- [ffi _1.9.0 or greater_](http://rubygems.org/gems/ffi)
|
@@ -29,7 +29,7 @@ Install natto with the following gem command:
|
|
29
29
|
|
30
30
|
gem install natto
|
31
31
|
|
32
|
-
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the
|
32
|
+
This will automatically install the [ffi](http://rubygems.org/gems/ffi) rubygem, which natto uses to bind to the MeCab library.
|
33
33
|
|
34
34
|
## Installation on Windows
|
35
35
|
However, if you are using a CRuby on Windows, then you will first need to install the [RubyInstaller Development Kit (DevKit)](https://github.com/oneclick/rubyinstaller/wiki/Development-Kit), a MSYS/MinGW based toolkit that enables your Windows Ruby installation to build many of the native C/C++ extensions available, including ffi.
|
@@ -45,40 +45,38 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
45
45
|
6. If you are on a 64-bit Windows and you use a 64-bit Ruby or JRuby, then you might want to [build a 64-bit version of libmecab.dll](https://github.com/buruzaemon/natto/wiki/64-Bit-Windows).
|
46
46
|
|
47
47
|
|
48
|
-
## Configuration
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
48
|
+
## Automatic Configuration
|
49
|
+
No explicit configuration should be necessary, as natto will try to locate the MeCab library based upon its runtime environment.
|
50
|
+
- On OS X and \*nix, it will query `mecab-config --libs`
|
51
|
+
- On Windows, it will query the Windows Registry to determine where `libmecab.dll` is installed
|
52
|
+
|
53
|
+
## Explicit configuration via `MECAB_PATH` and `MECAB_CHARSET`
|
54
|
+
If natto cannot find the MeCab library, `LoadError` will be raised. Please set the `MECAB_PATH` environment variable to the exact name/path to your MeCab library.
|
55
55
|
|
56
|
-
|
56
|
+
- e.g., for OS X
|
57
57
|
|
58
|
-
|
58
|
+
export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
|
59
59
|
|
60
|
-
|
60
|
+
- e.g., for bash on UNIX/Linux
|
61
61
|
|
62
|
-
|
62
|
+
export MECAB_PATH=/usr/local/lib/libmecab.so
|
63
63
|
|
64
|
-
|
64
|
+
- e.g., on Windows
|
65
65
|
|
66
|
-
|
66
|
+
set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
|
67
67
|
|
68
|
-
|
68
|
+
- e.g., from within a Ruby program
|
69
|
+
|
70
|
+
ENV['MECAB_PATH']='/usr/local/lib/libmecab.so'
|
69
71
|
|
70
72
|
## Usage
|
71
73
|
|
74
|
+
Here's a very quick guide to using natto.
|
75
|
+
|
76
|
+
Instantiate a reference to the MeCab library, and display some details:
|
72
77
|
|
73
|
-
# Quick Start
|
74
|
-
# -----------
|
75
|
-
#
|
76
|
-
# No explicit configuration should be necessary!
|
77
|
-
#
|
78
78
|
require 'natto'
|
79
79
|
|
80
|
-
# first, create an instance of Natto::MeCab
|
81
|
-
#
|
82
80
|
nm = Natto::MeCab.new
|
83
81
|
=> #<Natto::MeCab:0x28d30748
|
84
82
|
@tagger=#<FFI::Pointer address=0x28a97d50>, \
|
@@ -90,32 +88,27 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
90
88
|
type=0>] \
|
91
89
|
@version=0.996>
|
92
90
|
|
93
|
-
# display MeCab version
|
94
|
-
#
|
95
91
|
puts nm.version
|
96
92
|
=> 0.996
|
97
93
|
|
98
|
-
|
99
|
-
|
94
|
+
----
|
95
|
+
|
96
|
+
Display details about the system dictionary used by MeCab:
|
97
|
+
|
100
98
|
puts nm.libpath
|
101
99
|
=> /usr/local/lib/libmecab.so
|
102
100
|
|
103
|
-
# reference to MeCab system dictionary
|
104
|
-
#
|
105
101
|
sysdic = nm.dicts.first
|
106
|
-
|
107
|
-
# display full pathname to system dictionary file
|
108
|
-
#
|
109
102
|
puts sysdic.filepath
|
110
103
|
=> /usr/local/lib/mecab/dic/ipadic/sys.dic
|
111
104
|
|
112
|
-
# what charset (encoding) is the system dictionary?
|
113
|
-
#
|
114
105
|
puts sysdic.charset
|
115
106
|
=> utf8
|
116
|
-
|
117
|
-
|
118
|
-
|
107
|
+
|
108
|
+
----
|
109
|
+
|
110
|
+
Parse Japanese text and send the MeCab result as a single string to stdout:
|
111
|
+
|
119
112
|
puts nm.parse('俺の名前は星野豊だ!!そこんとこヨロシク!')
|
120
113
|
俺 名詞,代名詞,一般,*,*,*,俺,オレ,オレ
|
121
114
|
の 助詞,連体化,*,*,*,*,の,ノ,ノ
|
@@ -133,12 +126,22 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
133
126
|
! 記号,一般,*,*,*,*,!,!,!
|
134
127
|
EOS
|
135
128
|
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
129
|
+
----
|
130
|
+
|
131
|
+
If a block is passed to `parse`, you can iterate over the list of resulting `MeCabNode`
|
132
|
+
instances to access more detailed information about each morpheme.
|
133
|
+
|
134
|
+
In this example, the following attributes and methods for `MeCabNode` are used:
|
135
|
+
|
136
|
+
- `surface` - the morpheme surface
|
137
|
+
- `posid` - node part-of-speech ID (dictionary-dependent)
|
138
|
+
- `is_eos?` - is this `MeCabNode` an end-of-sentence node?
|
139
|
+
|
140
|
+
This iterates over the morpheme nodes in the given text,
|
141
|
+
and outputs a formatted, tab-delimited line with the
|
142
|
+
morpheme surface and part-of-speech ID, ignoring any end-of-sentence
|
143
|
+
nodes:
|
144
|
+
|
142
145
|
nm.parse('世界チャンプ目指してんだなこれがっ!!夢なの、俺のっ!!') do |n|
|
143
146
|
puts "#{n.surface}\tpart-of-speech id: #{n.posid}" if !n.is_eos?
|
144
147
|
end
|
@@ -160,37 +163,46 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
160
163
|
のっ part-of-speech id: 31
|
161
164
|
!! part-of-speech id: 36
|
162
165
|
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
173
|
-
|
166
|
+
----
|
167
|
+
|
168
|
+
For more complex parsing, such as that for natural language
|
169
|
+
processing tasks, it is far more efficient to use `enum_parse` to
|
170
|
+
obtain an [`Enumerator`](http://ruby-doc.org/core-2.2.0/Enumerator.html)
|
171
|
+
to iterate over the resulting `MeCabNode` instances. An `Enumerator`
|
172
|
+
yields each `MeCabNode` instance without first materializing all
|
173
|
+
instances at once, thus being more efficient.
|
174
|
+
|
175
|
+
This example uses the `-F` node-format option to customize
|
176
|
+
the resulting `MeCabNode` feature attribute to extract:
|
177
|
+
|
178
|
+
- `%m` - morpheme surface
|
179
|
+
- `%f[0]` - node part-of-speech
|
180
|
+
- `%f[7]` - reading
|
181
|
+
|
182
|
+
Note that we can move the `Enumerator` both forwards and backwards, rewind it
|
183
|
+
back to the beginning, and then iterate over it.
|
174
184
|
|
185
|
+
nm = Natto::MeCab.new('-F%m\t%f[0]\t%f[7]')
|
186
|
+
|
175
187
|
enum = nm.enum_parse('この星の一等賞になりたいの卓球で俺は、そんだけ!')
|
176
188
|
=> #<Enumerator: #<Enumerator::Generator:0x00000002ff3898>:each>
|
177
|
-
|
189
|
+
|
178
190
|
enum.next
|
179
191
|
=> #<Natto::MeCabNode:0x000000032eed68 \
|
180
192
|
@pointer=#<FFI::Pointer address=0x000000005ffb48>, \
|
181
193
|
stat=0, \
|
182
194
|
@surface="この", \
|
183
195
|
@feature="この 連体詞 コノ">
|
184
|
-
|
196
|
+
|
185
197
|
enum.peek
|
186
198
|
=> #<Natto::MeCabNode:0x00000002fe2110a \
|
187
199
|
@pointer=#<FFI::Pointer address=0x000000005ffdb8>, \
|
188
200
|
stat=0, \
|
189
201
|
@surface="星", \
|
190
202
|
@feature="星 名詞 ホシ">
|
191
|
-
|
203
|
+
|
192
204
|
enum.rewind
|
193
|
-
|
205
|
+
|
194
206
|
# again, ignore any end-of-sentence nodes
|
195
207
|
enum.each { |n| puts n.feature if !n.is_eos? }
|
196
208
|
この 連体詞 コノ
|
@@ -211,7 +223,52 @@ However, if you are using a CRuby on Windows, then you will first need to instal
|
|
211
223
|
だけ 助詞 ダケ
|
212
224
|
! 記号 !
|
213
225
|
|
214
|
-
|
226
|
+
----
|
227
|
+
|
228
|
+
[Partial parsing](http://taku910.github.io/mecab/partial.html) allows you to
|
229
|
+
pass hints to MeCab on how to tokenize morphemes when parsing. With boundary
|
230
|
+
constraint parsing, you can specify either
|
231
|
+
a [Regexp](http://ruby-doc.org/core-2.2.1/Regexp.html) or
|
232
|
+
[String](http://ruby-doc.org/core-2.2.1/String.html) to tell MeCab where the
|
233
|
+
boundaries of a morpheme should be. Use the new `boundary_constraints` keyword.
|
234
|
+
For hints on tokenization, please see
|
235
|
+
[String#scan](http://ruby-doc.org/core-2.2.1/String.html#method-i-scan)
|
236
|
+
|
237
|
+
This example uses the `-F` node-format option to customize
|
238
|
+
the resulting `MeCabNode` feature attribute to extract:
|
239
|
+
|
240
|
+
- `%m` - morpheme surface
|
241
|
+
- `%f[0]` - node part-of-speech
|
242
|
+
- `%s` - node `stat` status value, 1 is `unknown`
|
243
|
+
|
244
|
+
Note that any such morphemes captured will have node `stat` status of unknown.
|
245
|
+
|
246
|
+
nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
|
247
|
+
|
248
|
+
text = '心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'
|
249
|
+
pattern = /ヒーロー見参/
|
250
|
+
|
251
|
+
nm.enum_parse(text, boundary_constraints: pattern).each do |n|
|
252
|
+
puts n.feature if !(n.is_bos? || n.is_eos?)
|
253
|
+
end
|
254
|
+
|
255
|
+
# desired morpheme boundary specified with Regexp /ヒーロー見参/
|
256
|
+
心, 名詞, 0
|
257
|
+
の, 助詞, 0
|
258
|
+
中, 名詞, 0
|
259
|
+
で, 助詞, 0
|
260
|
+
3, 名詞, 1
|
261
|
+
回, 名詞, 0
|
262
|
+
唱え, 動詞, 0
|
263
|
+
、, 記号, 0
|
264
|
+
ヒーロー見参, 名詞, 1
|
265
|
+
!, 記号, 0
|
266
|
+
ヒーロー見参, 名詞, 1
|
267
|
+
!, 記号, 0
|
268
|
+
ヒーロー見参, 名詞, 1
|
269
|
+
!, 記号, 0
|
270
|
+
|
271
|
+
|
215
272
|
|
216
273
|
## Learn more
|
217
274
|
- You can read more about natto on the [project Wiki](https://github.com/buruzaemon/natto/wiki).
|
data/lib/natto/binding.rb
CHANGED
@@ -66,12 +66,7 @@ module Natto
|
|
66
66
|
|
67
67
|
ffi_lib find_library
|
68
68
|
|
69
|
-
#
|
70
|
-
attach_function :mecab_model_new2, [:string], :pointer
|
71
|
-
attach_function :mecab_model_destroy, [:pointer], :void
|
72
|
-
attach_function :mecab_model_dictionary_info, [:pointer], :pointer
|
73
|
-
|
74
|
-
# old interface
|
69
|
+
# C interface
|
75
70
|
attach_function :mecab_new2, [:string], :pointer
|
76
71
|
attach_function :mecab_version, [], :string
|
77
72
|
attach_function :mecab_strerror, [:pointer],:string
|
@@ -88,6 +83,26 @@ module Natto
|
|
88
83
|
attach_function :mecab_format_node, [:pointer, :pointer], :string
|
89
84
|
attach_function :mecab_dictionary_info, [:pointer], :pointer
|
90
85
|
|
86
|
+
attach_function :mecab_lattice_new, [], :pointer
|
87
|
+
attach_function :mecab_lattice_destroy, [:pointer], :void
|
88
|
+
attach_function :mecab_lattice_clear, [:pointer], :void
|
89
|
+
attach_function :mecab_lattice_is_available, [:pointer], :int
|
90
|
+
attach_function :mecab_lattice_get_bos_node, [:pointer], :pointer
|
91
|
+
attach_function :mecab_lattice_set_sentence, [:pointer, :string], :void
|
92
|
+
attach_function :mecab_lattice_get_size, [:pointer], :int
|
93
|
+
attach_function :mecab_lattice_set_z, [:pointer, :float], :void
|
94
|
+
attach_function :mecab_lattice_set_theta, [:pointer, :float], :void
|
95
|
+
attach_function :mecab_lattice_next, [:pointer], :int
|
96
|
+
attach_function :mecab_lattice_get_request_type, [:pointer], :int
|
97
|
+
attach_function :mecab_lattice_add_request_type, [:pointer, :int], :void
|
98
|
+
attach_function :mecab_lattice_set_request_type, [:pointer, :int], :void
|
99
|
+
attach_function :mecab_lattice_tostr, [:pointer], :string
|
100
|
+
attach_function :mecab_lattice_nbest_tostr, [:pointer, :int], :string
|
101
|
+
attach_function :mecab_lattice_get_boundary_constraint, [:pointer, :int], :int
|
102
|
+
attach_function :mecab_lattice_set_boundary_constraint, [:pointer, :int, :int], :void
|
103
|
+
attach_function :mecab_parse_lattice, [:pointer, :pointer], :int
|
104
|
+
attach_function :mecab_lattice_strerror, [:pointer], :string
|
105
|
+
|
91
106
|
# @private
|
92
107
|
module ClassMethods
|
93
108
|
|
@@ -95,18 +110,6 @@ module Natto
|
|
95
110
|
Natto::Binding.find_library
|
96
111
|
end
|
97
112
|
|
98
|
-
def mecab_model_new2(options_str)
|
99
|
-
Natto::Binding.mecab_model_new2(options_str)
|
100
|
-
end
|
101
|
-
|
102
|
-
def mecab_model_destroy(m_ptr)
|
103
|
-
Natto::Binding.mecab_model_destroy(m_ptr)
|
104
|
-
end
|
105
|
-
|
106
|
-
def mecab_model_dictionary_info(m_ptr)
|
107
|
-
Natto::Binding.mecab_model_dictionary_info(m_ptr)
|
108
|
-
end
|
109
|
-
|
110
113
|
# ----------------------------------------
|
111
114
|
def mecab_new2(options_str)
|
112
115
|
Natto::Binding.mecab_new2(options_str)
|
@@ -116,60 +119,136 @@ module Natto
|
|
116
119
|
Natto::Binding.mecab_version
|
117
120
|
end
|
118
121
|
|
119
|
-
def mecab_strerror(
|
120
|
-
Natto::Binding.mecab_strerror(
|
122
|
+
def mecab_strerror(tptr)
|
123
|
+
Natto::Binding.mecab_strerror(tptr)
|
121
124
|
end
|
122
125
|
|
123
|
-
def mecab_destroy(
|
124
|
-
Natto::Binding.mecab_destroy(
|
126
|
+
def mecab_destroy(tptr)
|
127
|
+
Natto::Binding.mecab_destroy(tptr)
|
125
128
|
end
|
126
129
|
|
127
|
-
def mecab_set_partial(
|
128
|
-
Natto::Binding.mecab_set_partial(
|
130
|
+
def mecab_set_partial(tptr, ll)
|
131
|
+
Natto::Binding.mecab_set_partial(tptr, ll)
|
129
132
|
end
|
130
133
|
|
131
|
-
def mecab_set_theta(
|
132
|
-
Natto::Binding.mecab_set_theta(
|
134
|
+
def mecab_set_theta(tptr, t)
|
135
|
+
Natto::Binding.mecab_set_theta(tptr, t)
|
133
136
|
end
|
134
137
|
|
135
|
-
def mecab_set_lattice_level(
|
136
|
-
Natto::Binding.mecab_set_lattice_level(
|
138
|
+
def mecab_set_lattice_level(tptr, ll)
|
139
|
+
Natto::Binding.mecab_set_lattice_level(tptr, ll)
|
137
140
|
end
|
138
141
|
|
139
|
-
def mecab_set_all_morphs(
|
140
|
-
Natto::Binding.mecab_set_all_morphs(
|
142
|
+
def mecab_set_all_morphs(tptr, am)
|
143
|
+
Natto::Binding.mecab_set_all_morphs(tptr, am)
|
141
144
|
end
|
142
145
|
|
143
|
-
def mecab_sparse_tostr(
|
144
|
-
Natto::Binding.mecab_sparse_tostr(
|
146
|
+
def mecab_sparse_tostr(tptr, str)
|
147
|
+
Natto::Binding.mecab_sparse_tostr(tptr, str)
|
145
148
|
end
|
146
149
|
|
147
|
-
def mecab_sparse_tonode(
|
148
|
-
Natto::Binding.mecab_sparse_tonode(
|
150
|
+
def mecab_sparse_tonode(tptr, str)
|
151
|
+
Natto::Binding.mecab_sparse_tonode(tptr, str)
|
149
152
|
end
|
150
153
|
|
151
|
-
def mecab_nbest_next_tonode(
|
152
|
-
Natto::Binding.mecab_nbest_next_tonode(
|
154
|
+
def mecab_nbest_next_tonode(tptr)
|
155
|
+
Natto::Binding.mecab_nbest_next_tonode(tptr)
|
153
156
|
end
|
154
157
|
|
155
|
-
def mecab_nbest_init(
|
156
|
-
Natto::Binding.mecab_nbest_init(
|
158
|
+
def mecab_nbest_init(tptr, str)
|
159
|
+
Natto::Binding.mecab_nbest_init(tptr, str)
|
157
160
|
end
|
158
161
|
|
159
|
-
def mecab_nbest_sparse_tostr(
|
160
|
-
Natto::Binding.mecab_nbest_sparse_tostr(
|
162
|
+
def mecab_nbest_sparse_tostr(tptr, n, str)
|
163
|
+
Natto::Binding.mecab_nbest_sparse_tostr(tptr, n, str)
|
161
164
|
end
|
162
165
|
|
163
|
-
def mecab_nbest_next_tonode(
|
164
|
-
Natto::Binding.mecab_nbest_next_tonode(
|
166
|
+
def mecab_nbest_next_tonode(tptr)
|
167
|
+
Natto::Binding.mecab_nbest_next_tonode(tptr)
|
165
168
|
end
|
166
169
|
|
167
|
-
def mecab_format_node(
|
168
|
-
Natto::Binding.mecab_format_node(
|
170
|
+
def mecab_format_node(tptr, nptr)
|
171
|
+
Natto::Binding.mecab_format_node(tptr, nptr)
|
172
|
+
end
|
173
|
+
|
174
|
+
def mecab_dictionary_info(tptr)
|
175
|
+
Natto::Binding.mecab_dictionary_info(tptr)
|
169
176
|
end
|
170
177
|
|
171
|
-
def
|
172
|
-
Natto::Binding.
|
178
|
+
def mecab_lattice_new()
|
179
|
+
Natto::Binding.mecab_lattice_new()
|
180
|
+
end
|
181
|
+
|
182
|
+
def mecab_lattice_destroy(lptr)
|
183
|
+
Natto::Binding.mecab_lattice_destroy(lptr)
|
184
|
+
end
|
185
|
+
|
186
|
+
def mecab_lattice_clear(lptr)
|
187
|
+
Natto::Binding.mecab_lattice_clear(lptr)
|
188
|
+
end
|
189
|
+
|
190
|
+
def mecab_lattice_is_available(lptr)
|
191
|
+
Natto::Binding.mecab_lattice_is_available(lptr)
|
192
|
+
end
|
193
|
+
|
194
|
+
def mecab_lattice_get_bos_node(lptr)
|
195
|
+
Natto::Binding.mecab_lattice_get_bos_node(lptr)
|
196
|
+
end
|
197
|
+
|
198
|
+
def mecab_lattice_set_sentence(lptr, str)
|
199
|
+
Natto::Binding.mecab_lattice_set_sentence(lptr, str)
|
200
|
+
end
|
201
|
+
|
202
|
+
def mecab_lattice_get_size(lptr)
|
203
|
+
Natto::Binding.mecab_lattice_get_size(lptr)
|
204
|
+
end
|
205
|
+
|
206
|
+
def mecab_lattice_set_z(lptr, z)
|
207
|
+
Natto::Binding.mecab_lattice_set_z(lptr, z)
|
208
|
+
end
|
209
|
+
|
210
|
+
def mecab_lattice_set_theta(lptr, t)
|
211
|
+
Natto::Binding.mecab_lattice_set_theta(lptr, t)
|
212
|
+
end
|
213
|
+
|
214
|
+
def mecab_lattice_next(lptr)
|
215
|
+
Natto::Binding.mecab_lattice_next(lptr)
|
216
|
+
end
|
217
|
+
|
218
|
+
def mecab_lattice_get_request_type(lptr)
|
219
|
+
Natto::Binding.mecab_lattice_get_request_type(lptr)
|
220
|
+
end
|
221
|
+
|
222
|
+
def mecab_lattice_add_request_type(lptr, rtype)
|
223
|
+
Natto::Binding.mecab_lattice_add_request_type(lptr, rtype)
|
224
|
+
end
|
225
|
+
|
226
|
+
def mecab_lattice_set_request_type(lptr, rtype)
|
227
|
+
Natto::Binding.mecab_lattice_set_request_type(lptr, rtype)
|
228
|
+
end
|
229
|
+
|
230
|
+
def mecab_lattice_tostr(lptr)
|
231
|
+
Natto::Binding.mecab_lattice_tostr(lptr)
|
232
|
+
end
|
233
|
+
|
234
|
+
def mecab_lattice_nbest_tostr(lptr, n)
|
235
|
+
Natto::Binding.mecab_lattice_nbest_tostr(lptr, n)
|
236
|
+
end
|
237
|
+
|
238
|
+
def mecab_lattice_get_boundary_constraint(lptr, pos)
|
239
|
+
Natto::Binding.mecab_lattice_get_boundary_constraint(lptr, pos)
|
240
|
+
end
|
241
|
+
|
242
|
+
def mecab_lattice_set_boundary_constraint(lptr, pos, btype)
|
243
|
+
Natto::Binding.mecab_lattice_set_boundary_constraint(lptr, pos, btype)
|
244
|
+
end
|
245
|
+
|
246
|
+
def mecab_parse_lattice(tptr, lptr)
|
247
|
+
Natto::Binding.mecab_parse_lattice(tptr, lptr)
|
248
|
+
end
|
249
|
+
|
250
|
+
def mecab_lattice_strerror(lptr)
|
251
|
+
Natto::Binding.mecab_lattice_strerror(lptr)
|
173
252
|
end
|
174
253
|
end
|
175
254
|
end
|
data/lib/natto/natto.rb
CHANGED
@@ -4,8 +4,8 @@ require 'natto/option_parse'
|
|
4
4
|
require 'natto/struct'
|
5
5
|
|
6
6
|
module Natto
|
7
|
-
# `MeCab` is a wrapper class for the `
|
8
|
-
# Options to the `
|
7
|
+
# `MeCab` is a wrapper class for the MeCab `Tagger`.
|
8
|
+
# Options to the MeCab `Tagger` are passed in as a string
|
9
9
|
# (MeCab command-line style) or as a Ruby-style hash at
|
10
10
|
# initialization.
|
11
11
|
#
|
@@ -45,6 +45,7 @@ module Natto
|
|
45
45
|
# 。 記号,句点,*,*,*,*,。,。,。
|
46
46
|
# EOS
|
47
47
|
#
|
48
|
+
#
|
48
49
|
# # pass a block to iterate over each MeCabNode instance
|
49
50
|
# #
|
50
51
|
# nm.parse(text) do |n|
|
@@ -101,10 +102,52 @@ module Natto
|
|
101
102
|
# 13: 。,。,記号,7
|
102
103
|
#
|
103
104
|
#
|
105
|
+
# # Boundary constraint parsing with output formatting.
|
106
|
+
# # %m ... morpheme surface
|
107
|
+
# # %F, ... comma-delimited ChaSen feature values
|
108
|
+
# # part-of-speech (index 0)
|
109
|
+
# # %2 ... MeCab node status value (1 unknown)
|
110
|
+
# #
|
111
|
+
# nm = Natto::MeCab.new('-F%m,\s%f[0],\s%s')
|
112
|
+
#
|
113
|
+
# enum = nm.enum_parse(text, boundary_constraint: /見えねえ風景/)
|
114
|
+
# => #<Enumerator: #<Enumerator::Generator:0x00000801d7aa38>:each>
|
115
|
+
#
|
116
|
+
# # output the feature attribute of each MeCabNode
|
117
|
+
# # ignoring any beginning- or end-of-sentence nodes
|
118
|
+
# #
|
119
|
+
# enum.each do |n|
|
120
|
+
# puts n.feature if !(n.is_bos? or n.is_eos?)
|
121
|
+
# end
|
122
|
+
# 凡人, 名詞, 0
|
123
|
+
# に, 助詞, 0
|
124
|
+
# しか, 助詞, 0
|
125
|
+
# 見えねえ風景, 名詞, 1
|
126
|
+
# って, 助詞, 0
|
127
|
+
# の, 名詞, 0
|
128
|
+
# が, 助詞, 0
|
129
|
+
# ある, 動詞, 0
|
130
|
+
# ん, 名詞, 0
|
131
|
+
# だ, 助動詞, 0
|
132
|
+
# よ, 助詞, 0
|
133
|
+
# 。, 記号, 0
|
134
|
+
#
|
104
135
|
class MeCab
|
105
136
|
include Natto::Binding
|
106
137
|
include Natto::OptionParse
|
107
138
|
|
139
|
+
MECAB_LATTICE_ONE_BEST = 1
|
140
|
+
MECAB_LATTICE_NBEST = 2
|
141
|
+
MECAB_LATTICE_PARTIAL = 4
|
142
|
+
MECAB_LATTICE_MARGINAL_PROB = 8
|
143
|
+
MECAB_LATTICE_ALTERNATIVE = 16
|
144
|
+
MECAB_LATTICE_ALL_MORPHS = 32
|
145
|
+
MECAB_LATTICE_ALLOCATE_SENTENCE = 64
|
146
|
+
|
147
|
+
MECAB_ANY_BOUNDARY = 0
|
148
|
+
MECAB_TOKEN_BOUNDARY = 1
|
149
|
+
MECAB_INSIDE_TOKEN = 2
|
150
|
+
|
108
151
|
# @return [FFI:Pointer] pointer to MeCab tagger.
|
109
152
|
attr_reader :tagger
|
110
153
|
# @return [String] absolute filepath to MeCab library.
|
@@ -113,10 +156,10 @@ module Natto
|
|
113
156
|
attr_reader :options
|
114
157
|
# @return [Array] listing of all of dictionaries referenced.
|
115
158
|
attr_reader :dicts
|
116
|
-
# @return [String] `MeCab`
|
159
|
+
# @return [String] `MeCab` version.
|
117
160
|
attr_reader :version
|
118
161
|
|
119
|
-
# Initializes the wrapped `
|
162
|
+
# Initializes the wrapped `Tagger` instance with the
|
120
163
|
# given `options`.
|
121
164
|
#
|
122
165
|
# Options supported are:
|
@@ -174,7 +217,7 @@ module Natto
|
|
174
217
|
# 。 。
|
175
218
|
# EOS
|
176
219
|
#
|
177
|
-
# @param [Hash, String]
|
220
|
+
# @param options [Hash, String] the MeCab options for tagger
|
178
221
|
# @raise [MeCabError] if `mecab` cannot be initialized with the given `options`
|
179
222
|
def initialize(options={})
|
180
223
|
@options = self.class.parse_mecab_options(options)
|
@@ -191,31 +234,159 @@ module Natto
|
|
191
234
|
self.mecab_set_all_morphs(@tagger, 1) if @options[:all_morphs]
|
192
235
|
self.mecab_set_partial(@tagger, 1) if @options[:partial]
|
193
236
|
|
194
|
-
#
|
195
|
-
#
|
196
|
-
|
197
|
-
|
198
|
-
|
199
|
-
|
200
|
-
|
237
|
+
# Define lambda for each major parsing type: _tostr, _tonode,
|
238
|
+
# boundary constraint _tostr, boundary constraint _node;
|
239
|
+
# and each parsing type will support both normal and N-best
|
240
|
+
# options
|
241
|
+
@parse_tostr = ->(text) {
|
242
|
+
if @options[:nbest] && @options[:nbest] > 1
|
243
|
+
#self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
201
244
|
retval = self.mecab_nbest_sparse_tostr(@tagger, @options[:nbest], text) ||
|
202
245
|
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
203
|
-
|
204
|
-
|
246
|
+
else
|
247
|
+
retval = self.mecab_sparse_tostr(@tagger, text) ||
|
248
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
249
|
+
end
|
205
250
|
|
206
|
-
|
207
|
-
|
208
|
-
self.mecab_nbest_init(@tagger, text)
|
209
|
-
n = self.mecab_nbest_next_tonode(@tagger)
|
210
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
251
|
+
retval.force_encoding(Encoding.default_external)
|
252
|
+
}
|
211
253
|
|
254
|
+
@parse_tonodes = ->(text) {
|
255
|
+
Enumerator.new do |y|
|
256
|
+
if @options[:nbest] && @options[:nbest] > 1
|
212
257
|
nlen = @options[:nbest]
|
213
|
-
|
214
|
-
|
215
|
-
|
216
|
-
|
217
|
-
|
218
|
-
|
258
|
+
#self.mecab_set_lattice_level(@tagger, (@options[:lattice_level] || 1))
|
259
|
+
self.mecab_nbest_init(@tagger, text)
|
260
|
+
nptr = self.mecab_nbest_next_tonode(@tagger)
|
261
|
+
else
|
262
|
+
nlen = 1
|
263
|
+
nptr = self.mecab_sparse_tonode(@tagger, text)
|
264
|
+
end
|
265
|
+
raise(MeCabError.new(self.mecab_strerror(@tagger))) if nptr.nil? || nptr.address==0x0
|
266
|
+
|
267
|
+
nlen.times do
|
268
|
+
s = text.bytes.to_a
|
269
|
+
while nptr && nptr.address != 0x0
|
270
|
+
mn = Natto::MeCabNode.new(nptr)
|
271
|
+
# ignore BOS nodes, since mecab does so
|
272
|
+
if !mn.is_bos?
|
273
|
+
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
274
|
+
if !s.empty?
|
275
|
+
sarr = []
|
276
|
+
mn.length.times { sarr << s.shift }
|
277
|
+
surf = sarr.pack('C*')
|
278
|
+
mn.surface = surf.force_encoding(Encoding.default_external)
|
279
|
+
end
|
280
|
+
if @options[:output_format_type] || @options[:node_format]
|
281
|
+
mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
|
282
|
+
end
|
283
|
+
y.yield mn
|
284
|
+
end
|
285
|
+
nptr = mn.next
|
286
|
+
end
|
287
|
+
if nlen > 1
|
288
|
+
nptr = self.mecab_nbest_next_tonode(@tagger)
|
289
|
+
end
|
290
|
+
end
|
291
|
+
end
|
292
|
+
}
|
293
|
+
|
294
|
+
@bcparse_tostr = ->(text, boundary_constraints=/./) {
|
295
|
+
begin
|
296
|
+
lattice = self.mecab_lattice_new()
|
297
|
+
raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
|
298
|
+
|
299
|
+
if @options[:nbest] && @options[:nbest] > 1
|
300
|
+
n = @options[:nbest]
|
301
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
|
302
|
+
else
|
303
|
+
n = 1
|
304
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
|
305
|
+
end
|
306
|
+
if @options[:theta]
|
307
|
+
self.mecab_lattice_set_theta(lattice, @options[:theta])
|
308
|
+
end
|
309
|
+
|
310
|
+
tokens = tokenize(text, boundary_constraints)
|
311
|
+
text = tokens.map {|t| t.first}.join
|
312
|
+
self.mecab_lattice_set_sentence(lattice, text)
|
313
|
+
|
314
|
+
bpos = 0
|
315
|
+
tokens.each do |token|
|
316
|
+
c = token.first.bytes.count
|
317
|
+
|
318
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
|
319
|
+
bpos += 1
|
320
|
+
|
321
|
+
mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
|
322
|
+
(c-1).times do
|
323
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
|
324
|
+
bpos += 1
|
325
|
+
end
|
326
|
+
end
|
327
|
+
|
328
|
+
self.mecab_parse_lattice(@tagger, lattice)
|
329
|
+
|
330
|
+
if n > 1
|
331
|
+
retval = self.mecab_lattice_nbest_tostr(lattice, n)
|
332
|
+
else
|
333
|
+
retval = self.mecab_lattice_tostr(lattice)
|
334
|
+
end
|
335
|
+
retval.force_encoding(Encoding.default_external)
|
336
|
+
rescue
|
337
|
+
raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
|
338
|
+
ensure
|
339
|
+
if lattice.address != 0x0
|
340
|
+
self.mecab_lattice_destroy(lattice)
|
341
|
+
end
|
342
|
+
end
|
343
|
+
}
|
344
|
+
|
345
|
+
@bcparse_tonodes = ->(text, boundary_constraints=/./) {
|
346
|
+
Enumerator.new do |y|
|
347
|
+
begin
|
348
|
+
lattice = self.mecab_lattice_new()
|
349
|
+
raise MeCabError.new("Could not create Lattice") if lattice.address == 0x0
|
350
|
+
|
351
|
+
if @options[:nbest] && @options[:nbest] > 1
|
352
|
+
n = @options[:nbest]
|
353
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_NBEST)
|
354
|
+
else
|
355
|
+
n = 1
|
356
|
+
self.mecab_lattice_set_request_type(lattice, MECAB_LATTICE_ONE_BEST)
|
357
|
+
end
|
358
|
+
if @options[:theta]
|
359
|
+
self.mecab_lattice_set_theta(lattice, @options[:theta])
|
360
|
+
end
|
361
|
+
|
362
|
+
tokens = tokenize(text, boundary_constraints)
|
363
|
+
text = tokens.map {|t| t.first}.join
|
364
|
+
self.mecab_lattice_set_sentence(lattice, text)
|
365
|
+
|
366
|
+
bpos = 0
|
367
|
+
tokens.each do |token|
|
368
|
+
c = token.first.bytes.count
|
369
|
+
|
370
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, MECAB_TOKEN_BOUNDARY)
|
371
|
+
bpos += 1
|
372
|
+
|
373
|
+
mark = token.last ? MECAB_INSIDE_TOKEN : MECAB_ANY_BOUNDARY
|
374
|
+
(c-1).times do
|
375
|
+
self.mecab_lattice_set_boundary_constraint(lattice, bpos, mark)
|
376
|
+
bpos += 1
|
377
|
+
end
|
378
|
+
end
|
379
|
+
|
380
|
+
self.mecab_parse_lattice(@tagger, lattice)
|
381
|
+
|
382
|
+
n.times do
|
383
|
+
check = self.mecab_lattice_next(lattice)
|
384
|
+
if check
|
385
|
+
nptr = self.mecab_lattice_get_bos_node(lattice)
|
386
|
+
|
387
|
+
s = text.bytes.to_a
|
388
|
+
while nptr && nptr.address!=0x0
|
389
|
+
mn = Natto::MeCabNode.new(nptr)
|
219
390
|
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
220
391
|
if !s.empty?
|
221
392
|
sarr = []
|
@@ -224,50 +395,22 @@ module Natto
|
|
224
395
|
mn.surface = surf.force_encoding(Encoding.default_external)
|
225
396
|
end
|
226
397
|
if @options[:output_format_type] || @options[:node_format]
|
227
|
-
mn.feature = self.mecab_format_node(@tagger,
|
398
|
+
mn.feature = self.mecab_format_node(@tagger, nptr).force_encoding(Encoding.default_external)
|
228
399
|
end
|
229
400
|
y.yield mn
|
401
|
+
nptr = mn.next
|
230
402
|
end
|
231
|
-
n = mn.next
|
232
403
|
end
|
233
|
-
n = self.mecab_nbest_next_tonode(@tagger)
|
234
404
|
end
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
retval = self.mecab_sparse_tostr(@tagger, text) ||
|
241
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger)))
|
242
|
-
retval.force_encoding(Encoding.default_external)
|
243
|
-
end
|
244
|
-
|
245
|
-
@parse_tonodes = lambda do |text|
|
246
|
-
Enumerator.new do |y|
|
247
|
-
n = self.mecab_sparse_tonode(@tagger, text)
|
248
|
-
raise(MeCabError.new(self.mecab_strerror(@tagger))) if n.nil? || n.address==0x0
|
249
|
-
|
250
|
-
mn = Natto::MeCabNode.new(n)
|
251
|
-
n = mn.next if mn.next.address!=0x0
|
252
|
-
s = text.bytes.to_a
|
253
|
-
while n && n.address!=0x0
|
254
|
-
mn = Natto::MeCabNode.new(n)
|
255
|
-
s = s.drop_while {|e| (e==0xa || e==0x20)}
|
256
|
-
if !s.empty?
|
257
|
-
sarr = []
|
258
|
-
mn.length.times { sarr << s.shift }
|
259
|
-
surf = sarr.pack('C*')
|
260
|
-
mn.surface = surf.force_encoding(Encoding.default_external)
|
261
|
-
end
|
262
|
-
if @options[:output_format_type] || @options[:node_format]
|
263
|
-
mn.feature = self.mecab_format_node(@tagger, n).force_encoding(Encoding.default_external)
|
264
|
-
end
|
265
|
-
y.yield mn
|
266
|
-
n = mn.next
|
405
|
+
rescue
|
406
|
+
raise(MeCabError.new(self.mecab_lattice_strerror(lattice)))
|
407
|
+
ensure
|
408
|
+
if lattice.address != 0x0
|
409
|
+
self.mecab_lattice_destroy(lattice)
|
267
410
|
end
|
268
411
|
end
|
269
412
|
end
|
270
|
-
|
413
|
+
}
|
271
414
|
|
272
415
|
@dicts << Natto::DictionaryInfo.new(Natto::Binding.mecab_dictionary_info(@tagger))
|
273
416
|
while @dicts.last.next.address != 0x0
|
@@ -283,17 +426,35 @@ module Natto
|
|
283
426
|
# If a block is passed to this method, then node parsing will be used
|
284
427
|
# and each node yielded to the given block.
|
285
428
|
#
|
286
|
-
#
|
429
|
+
# Boundary constraint parsing is available via passing in the
|
430
|
+
# `boundary_constraints` key in the `options` hash. Boundary constraints
|
431
|
+
# parsing provides hints to MeCab on where the morpheme boundaries in the
|
432
|
+
# given `text` are located. `boundary_constraints` value may be either a
|
433
|
+
# `Regexp` or `String`; please see
|
434
|
+
# [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan.
|
435
|
+
# The boundary constraint parsed output will be returned as a single
|
436
|
+
# string, unless a block is passed to this method for node parsing.
|
437
|
+
#
|
438
|
+
# @param text [String] the Japanese text to parse
|
439
|
+
# @param options [Hash] only the `boundary_constraints` key is available
|
287
440
|
# @return [String] parsing result from `mecab`
|
288
441
|
# @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
|
289
442
|
# @raise [ArgumentError] if the given string `text` argument is `nil`
|
290
443
|
# @see MeCabNode
|
291
|
-
def parse(text)
|
444
|
+
def parse(text, options={})
|
292
445
|
raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
|
293
|
-
if
|
294
|
-
|
446
|
+
if options[:boundary_constraints]
|
447
|
+
if block_given?
|
448
|
+
@bcparse_tonodes.call(text, options[:boundary_constraints]).each {|n| yield n }
|
449
|
+
else
|
450
|
+
@bcparse_tostr.call(text, options[:boundary_constraints])
|
451
|
+
end
|
295
452
|
else
|
296
|
-
|
453
|
+
if block_given?
|
454
|
+
@parse_tonodes.call(text).each {|n| yield n }
|
455
|
+
else
|
456
|
+
@parse_tostr.call(text)
|
457
|
+
end
|
297
458
|
end
|
298
459
|
end
|
299
460
|
|
@@ -301,73 +462,34 @@ module Natto
|
|
301
462
|
# {http://www.ruby-doc.org/core-2.1.5/Enumerator.html Enumerator} that may be
|
302
463
|
# used to iterate over the resulting {MeCabNode} objects. This is more
|
303
464
|
# efficient than parsing to a simple string, since each node's
|
304
|
-
# information will not be materialized all at once as
|
465
|
+
# information will not be materialized all at once as it is with
|
305
466
|
# string output.
|
306
467
|
#
|
307
468
|
# MeCab nodes contain much more detailed information about
|
308
469
|
# the morpheme. Node-formatting may also be used to customize
|
309
470
|
# the resulting node's `feature` attribute.
|
310
471
|
#
|
311
|
-
#
|
472
|
+
# Boundary constraint parsing is available via passing in the
|
473
|
+
# `boundary_constraints` key in the `options` hash. Boundary constraints
|
474
|
+
# parsing provides hints to MeCab on where the morpheme boundaries in the
|
475
|
+
# given `text` are located. `boundary_constraints` value may be either a
|
476
|
+
# `Regexp` or `String`; please see
|
477
|
+
# [String#scan](http://ruby-doc.org/core-2.2.0/String.html#method-i-scan String#scan).
|
478
|
+
#
|
479
|
+
# @param text [String] the Japanese text to parse
|
480
|
+
# @param options [Hash] only the `boundary_constraints` key is available
|
312
481
|
# @return [Enumerator] of MeCabNode instances
|
313
482
|
# @raise [MeCabError] if the `mecab` tagger cannot parse the given `text`
|
314
483
|
# @raise [ArgumentError] if the given string `text` argument is `nil`
|
315
484
|
# @see MeCabNode
|
316
|
-
# @see http://
|
317
|
-
def enum_parse(text)
|
485
|
+
# @see http://ruby-doc.org/core-2.2.1/Enumerator.html
|
486
|
+
def enum_parse(text, options={})
|
318
487
|
raise ArgumentError.new 'Text to parse cannot be nil' if text.nil?
|
319
|
-
|
320
|
-
|
321
|
-
|
322
|
-
|
323
|
-
|
324
|
-
# Parses the given string `str`, and returns
|
325
|
-
# a list of `mecab` nodes.
|
326
|
-
# @param [String] str
|
327
|
-
# @return [Array] of parsed `mecab` nodes.
|
328
|
-
# @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
|
329
|
-
# @raise [ArgumentError] if the given string `str` argument is `nil`
|
330
|
-
# @see MeCabNode
|
331
|
-
def parse_as_nodes(str)
|
332
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
333
|
-
$stderr.puts ' This method will be removed in the next release!'
|
334
|
-
raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
|
335
|
-
@parse_tonodes.call(str)
|
336
|
-
end
|
337
|
-
|
338
|
-
# @deprecated
|
339
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
340
|
-
# Parses the given string `str`, and returns
|
341
|
-
# a list of `mecab` result strings.
|
342
|
-
# @param [String] str
|
343
|
-
# @return [Array] of parsed `mecab` result strings.
|
344
|
-
# @raise [MeCabError] if the `mecab` tagger cannot parse the given string `str`
|
345
|
-
# @raise [ArgumentError] if the given string `str` argument is `nil`
|
346
|
-
def parse_as_strings(str)
|
347
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
348
|
-
$stderr.puts ' This method will be removed in the next release!'
|
349
|
-
raise ArgumentError.new 'String to parse cannot be nil' if str.nil?
|
350
|
-
@parse_tostr.call(str).lines.to_a
|
351
|
-
end
|
352
|
-
|
353
|
-
# @deprecated
|
354
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
355
|
-
# @param [String] str
|
356
|
-
# @return [Array] of parsed `mecab` nodes.
|
357
|
-
def readnodes(str)
|
358
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
359
|
-
$stderr.puts ' This method will be removed in the next release!'
|
360
|
-
parse_as_nodes(str)
|
361
|
-
end
|
362
|
-
|
363
|
-
# @deprecated
|
364
|
-
# DEPRECATED: use enum_parse instead, this convenience method is useless.
|
365
|
-
# @param [String] str
|
366
|
-
# @return [Array] of parsed `mecab` result strings.
|
367
|
-
def readlines(str)
|
368
|
-
$stderr.puts 'DEPRECATED: use enum_parse instead'
|
369
|
-
$stderr.puts ' This method will be removed in the next release!'
|
370
|
-
parse_as_strings(str)
|
488
|
+
if options[:boundary_constraints]
|
489
|
+
@bcparse_tonodes.call(text, options[:boundary_constraints])
|
490
|
+
else
|
491
|
+
@parse_tonodes.call(text)
|
492
|
+
end
|
371
493
|
end
|
372
494
|
|
373
495
|
# Returns human-readable details for the wrapped `mecab` tagger.
|
@@ -402,17 +524,41 @@ module Natto
|
|
402
524
|
end
|
403
525
|
|
404
526
|
# Returns a `Proc` that will properly free resources
|
405
|
-
# when this `
|
527
|
+
# when this `Tagger` instance is garbage collected.
|
406
528
|
# The `Proc` returned is registered to be invoked
|
407
|
-
# after the `
|
529
|
+
# after the `Tagger` instance owning `tptr`
|
408
530
|
# has been destroyed.
|
409
531
|
#
|
410
|
-
# @param [FFI::Pointer]
|
532
|
+
# @param tptr [FFI::Pointer] pointer to `Tagger`
|
411
533
|
# @return [Proc] to release `mecab` resources properly
|
412
|
-
def self.create_free_proc(
|
534
|
+
def self.create_free_proc(tptr)
|
413
535
|
Proc.new do
|
414
|
-
self.mecab_destroy(
|
536
|
+
self.mecab_destroy(tptr)
|
537
|
+
end
|
538
|
+
end
|
539
|
+
|
540
|
+
private
|
541
|
+
|
542
|
+
# @private
|
543
|
+
def tokenize(text, pattern)
|
544
|
+
matches = text.scan(pattern)
|
545
|
+
|
546
|
+
acc =[]
|
547
|
+
tmp = text
|
548
|
+
matches.each_with_index do |m,i|
|
549
|
+
bef, mat, aft = tmp.partition(m)
|
550
|
+
unless bef.empty?
|
551
|
+
acc << [bef.strip, false]
|
552
|
+
end
|
553
|
+
unless mat.empty?
|
554
|
+
acc << [mat.strip, true]
|
555
|
+
end
|
556
|
+
if i==matches.size-1 and !aft.empty?
|
557
|
+
acc << [aft.strip, false]
|
558
|
+
end
|
559
|
+
tmp = aft
|
415
560
|
end
|
561
|
+
acc
|
416
562
|
end
|
417
563
|
end
|
418
564
|
|
data/lib/natto/option_parse.rb
CHANGED
@@ -97,14 +97,14 @@ module Natto
|
|
97
97
|
# Returns a string-representation of the options to
|
98
98
|
# be passed in the construction of the `mecab` tagger.
|
99
99
|
#
|
100
|
-
# @param [Hash] options
|
100
|
+
# @param options[Hash] options for MeCab
|
101
101
|
# @return [String] representation of the options to the `mecab` tagger
|
102
102
|
def build_options_str(options={})
|
103
103
|
opt = []
|
104
104
|
SUPPORTED_OPTS.values.each do |k|
|
105
105
|
if options.has_key? k
|
106
106
|
key = k.to_s.gsub('_', '-')
|
107
|
-
if %w( all-morphs allocate-sentence ).include? key
|
107
|
+
if %w( all-morphs allocate-sentence partial ).include? key
|
108
108
|
opt << "--#{key}" if options[k]==true
|
109
109
|
else
|
110
110
|
opt << "--#{key}=#{options[k]}"
|
data/lib/natto/struct.rb
CHANGED
@@ -6,12 +6,12 @@ module Natto
|
|
6
6
|
require 'ffi'
|
7
7
|
|
8
8
|
# `MeCabStruct` is a general base class for `FFI::Struct` objects in
|
9
|
-
# the `Natto` module. Please refer to
|
10
|
-
#
|
9
|
+
# the `Natto` module. Please refer to `mecab.h` in the source code
|
10
|
+
# distribution.
|
11
11
|
class MeCabStruct < FFI::Struct
|
12
12
|
# Provides accessor methods for the members of the `mecab` struct.
|
13
13
|
#
|
14
|
-
# @param [String]
|
14
|
+
# @param attr_name [String] attribute name
|
15
15
|
# @return member values for the `mecab` struct
|
16
16
|
# @raise [NoMethodError] if `attr_name` is not a member of this `mecab` struct
|
17
17
|
def method_missing(attr_name)
|
@@ -94,7 +94,7 @@ module Natto
|
|
94
94
|
# Initializes this dictionary info instance.
|
95
95
|
# Sets the `DictionaryInfo` filepath value.
|
96
96
|
#
|
97
|
-
# @param [FFI::Pointer]
|
97
|
+
# @param ptr [FFI::Pointer] pointer to MeCab dictionary
|
98
98
|
def initialize(ptr)
|
99
99
|
super(ptr)
|
100
100
|
|
@@ -246,10 +246,10 @@ module Natto
|
|
246
246
|
# Initializes this node instance.
|
247
247
|
# Sets the `MeCab` feature value for this node.
|
248
248
|
#
|
249
|
-
# @param [FFI::Pointer]
|
250
|
-
def initialize(
|
251
|
-
super(
|
252
|
-
@pointer =
|
249
|
+
# @param nptr [FFI::Pointer] pointer to MeCab node
|
250
|
+
def initialize(nptr)
|
251
|
+
super(nptr)
|
252
|
+
@pointer = nptr
|
253
253
|
|
254
254
|
if self[:feature]
|
255
255
|
@feature = self[:feature].force_encoding(Encoding.default_external)
|
@@ -311,10 +311,7 @@ module Natto
|
|
311
311
|
def is_eon?
|
312
312
|
self.stat == EON_NODE
|
313
313
|
end
|
314
|
-
|
315
|
-
|
316
314
|
end
|
317
|
-
|
318
315
|
end
|
319
316
|
|
320
317
|
# Copyright (c) 2015, Brooke M. Fujita.
|
data/lib/natto/version.rb
CHANGED
@@ -1,16 +1,16 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
|
3
3
|
# `Natto` is the namespace for objects that provide
|
4
|
-
# a binding to
|
4
|
+
# a binding to MeCab and an API for the `Tagger`,
|
5
|
+
# `Node` and `Lattice` objects.
|
5
6
|
#
|
6
|
-
# `Natto::MeCab` is a wrapper class for the `
|
7
|
-
# tagger.
|
7
|
+
# `Natto::MeCab` is a wrapper class for the MeCab `Tagger`.
|
8
8
|
#
|
9
9
|
# `Natto::MeCabStruct` is a base class for a `mecab`
|
10
10
|
# struct.
|
11
11
|
#
|
12
12
|
# `Natto::MeCabNode` is a wrapper for the struct representing
|
13
|
-
# a `
|
13
|
+
# a MeCab `Node`.
|
14
14
|
#
|
15
15
|
# `Natto::DictionaryInfo` is a wrapper for the struct
|
16
16
|
# representing a `Natto::MeCab` instance's related
|
@@ -27,7 +27,7 @@
|
|
27
27
|
# `Natto`.
|
28
28
|
module Natto
|
29
29
|
# Version string for this Rubygem.
|
30
|
-
VERSION = "0.9.
|
30
|
+
VERSION = "0.9.9"
|
31
31
|
end
|
32
32
|
|
33
33
|
# Copyright (c) 2015, Brooke M. Fujita.
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: natto
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.9.
|
4
|
+
version: 0.9.9
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2015-
|
12
|
+
date: 2015-03-30 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: ffi
|
@@ -27,9 +27,41 @@ dependencies:
|
|
27
27
|
- - ! '>='
|
28
28
|
- !ruby/object:Gem::Version
|
29
29
|
version: 1.9.0
|
30
|
-
|
31
|
-
|
32
|
-
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rake
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: minitest
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
description: ! 'natto provides a naturally Ruby-esque interface to MeCab. It runs
|
63
|
+
on both CRuby (mri/yarv) and JRuby (jvm). It works with MeCab installations on Windows,
|
64
|
+
Unix/Linux, and OS X. No compiler is necessary, as natto is not a C extension.
|
33
65
|
|
34
66
|
'
|
35
67
|
email: buruzaemon@gmail.com
|