myaso 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 37940cb5479932d889a3e0cf92fe74d4d706ed46
4
+ data.tar.gz: e468549cebbddf8e6cad46d72b90f375de35d41b
5
+ SHA512:
6
+ metadata.gz: 82f29168750fe0b45192866c8d8f82e7eaba533d5fd8e6450f5b8ea9e106098a3255922357f354d0c7813793c2f06b02c51c1d67d6af74b8cb2c13800a90fefc
7
+ data.tar.gz: de8f484dc371381ef9e9ad1d9d558f7b21d1bcc2aeb73b96e4ad60b538995c14fd34266880ee4345ce8fbd4f041a1eb7e124adc8f2e622680d2e8743fe922f62
@@ -0,0 +1,25 @@
1
+ *.so
2
+ *.zip
3
+ *swp
4
+ *.~*
5
+ *.gem
6
+ *.rbc
7
+ .rbx
8
+ .bundle
9
+ .config
10
+ .yardoc
11
+ Gemfile.lock
12
+ InstalledFiles
13
+ _yardoc
14
+ coverage
15
+ doc/
16
+ lib/bundler/man
17
+ pkg
18
+ rdoc
19
+ spec/reports
20
+ test/tmp
21
+ test/version_tmp
22
+ tmp
23
+ .DS_Store
24
+ .ruby-version
25
+ .ruby-gemset
@@ -0,0 +1,10 @@
1
+ sudo: false
2
+ language: ruby
3
+ bundler_args: --without development
4
+ rvm:
5
+ - ruby
6
+ - jruby
7
+ install:
8
+ - wget 'https://github.com/yandex/tomita-parser/releases/download/v1.0/libmystem_c_binding.so.linux_x64.zip'
9
+ - unzip 'libmystem_c_binding.so.linux_x64.zip'
10
+ - bundle
data/Gemfile ADDED
@@ -0,0 +1,14 @@
1
+ # encoding: utf-8
2
+
3
+ source 'https://rubygems.org'
4
+
5
+ gemspec
6
+
7
+ group :development do
8
+ gem 'rdoc'
9
+ gem 'ruby-prof', :platforms => :mri
10
+ end
11
+
12
+ group :test do
13
+ gem 'rake'
14
+ end
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2010-2019 Dmitry Ustalov
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,213 @@
1
+ # Myaso
2
+
3
+ Myaso [ˈmʲæ.sə] is a morphological analysis and synthesis library, written in Ruby.
4
+
5
+ [![Gem Version][badge_fury_badge]][badge_fury_link] [![Build Status][travis_ci_badge]][travis_ci_link] [![Code Climate][code_climate_badge]][code_climage_link]
6
+
7
+ ![Myaso](myaso.jpg)
8
+
9
+ [badge_fury_badge]: https://badge.fury.io/rb/myaso.svg
10
+ [badge_fury_link]: https://badge.fury.io/rb/myaso
11
+ [travis_ci_badge]: https://travis-ci.org/dustalov/myaso.svg
12
+ [travis_ci_link]: https://travis-ci.org/dustalov/myaso
13
+ [code_climate_badge]: https://codeclimate.com/github/dustalov/myaso/badges/gpa.svg
14
+ [code_climage_link]: https://codeclimate.com/github/dustalov/myaso
15
+
16
+ ## Installation
17
+
18
+ Add this line to your application's Gemfile:
19
+
20
+ ```ruby
21
+ gem 'myaso'
22
+ ```
23
+
24
+ And then execute:
25
+
26
+ $ bundle
27
+
28
+ Or install it:
29
+
30
+ $ gem install myaso
31
+
32
+ ## Usage
33
+
34
+ At the moment, Myaso has pretty fast part of speech (POS) tagger built on hidden Markov models (HMMs). The tagging operation requires statistical model to be trained.
35
+
36
+ Myaso supports trained models in the TnT format. One could be obtained at the Serge Sharoff et al. resource called [Russian statistical taggers and parsers](http://corpus.leeds.ac.uk/mocky/).
37
+
38
+ ### Analysis
39
+
40
+ Since Yandex has released the [Mystem](https://tech.yandex.ru/mystem/) analyzer in the form of shared library, it makes it possible to use the analyzer through the foreign function interface.
41
+
42
+ Firstly, it is necessary to read and agree with the [mystem EULA](https://yandex.ru/legal/mystem/). Secondly, [download](https://github.com/yandex/tomita-parser/releases/tag/v1.0) and install the shared library for your operating system. Finally, use Myaso and enjoy the benefits.
43
+
44
+ #### Analysis API
45
+
46
+ Myaso uses mystem library to process Russian words. That is quite simple.
47
+
48
+ ```ruby
49
+ pp Myaso::Mystem.analyze('котёночка')
50
+ =begin
51
+ [#<struct Myaso::Mystem::Lemma
52
+ lemma="котеночек",
53
+ form="котёночка",
54
+ quality=:dictionary,
55
+ msd=#<Myasorubka::MSD::Russian msd="Ncmsay">,
56
+ stem_grammemes=[136, 192, 201],
57
+ flex_grammemes=[168, 174, 166],
58
+ flex_length=6,
59
+ rule_id=1525>]
60
+ =end
61
+ ```
62
+
63
+ Myaso works fine even in case the given word is either ambiguous or does not appear in the mystem's dictionary.
64
+
65
+ ```ruby
66
+ pp Myaso::Mystem.analyze('аудисты')
67
+ =begin
68
+ [#<struct Myaso::Mystem::Lemma
69
+ lemma="аудист",
70
+ form="аудисты",
71
+ quality=:bastard,
72
+ msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
73
+ stem_grammemes=[136, 192, 201],
74
+ flex_grammemes=[165, 175],
75
+ flex_length=1,
76
+ rule_id=25>,
77
+ #<struct Myaso::Mystem::Lemma
78
+ lemma="аудистый",
79
+ form="аудисты",
80
+ quality=:bastard,
81
+ msd=#<Myasorubka::MSD::Russian msd="A---p-s">,
82
+ stem_grammemes=[128],
83
+ flex_grammemes=[175, 183],
84
+ flex_length=1,
85
+ rule_id=65>]
86
+ =end
87
+ ```
88
+
89
+ ### Synthesis
90
+
91
+ Given the analyzed word, it is possible to retrieve all the possible forms. Having this information, one may use it to inflect a word. This is implemeneted using the abovementioned mystem shared library.
92
+
93
+ #### Synthesis API
94
+
95
+ In general form, all the possible word forms can be extracted with the specified word and its inflection rule.
96
+
97
+ ```ruby
98
+ pp Myaso::Mystem.forms('человеком', 3890)
99
+ =begin
100
+ [#<struct Myaso::Mystem::Form
101
+ form="людей",
102
+ msd=#<Myasorubka::MSD::Russian msd="Ncmpay">,
103
+ stem_grammemes=[136, 192, 201],
104
+ flex_grammemes=[168, 175, 166]>,
105
+ ...
106
+ #<struct Myaso::Mystem::Form
107
+ form="человеку",
108
+ msd=#<Myasorubka::MSD::Russian msd="Ncmsdy">,
109
+ stem_grammemes=[136, 192, 201],
110
+ flex_grammemes=[167, 174]>]
111
+ =end
112
+ ```
113
+
114
+ There exists a convenient way of doing this, which requires a previously lemmatized word.
115
+
116
+ ```ruby
117
+ lemmas = Myaso::Mystem.analyze('кот') # => [#<Myaso::Mystem::Lemma lemma="кот" msd="Ncmsny">]
118
+ pp lemmas[0].forms
119
+ =begin
120
+ [#<struct Myaso::Mystem::Form
121
+ form="кот",
122
+ msd=#<Myasorubka::MSD::Russian msd="Ncmsny">,
123
+ stem_grammemes=[136, 192, 201],
124
+ flex_grammemes=[165, 174]>,
125
+ ...
126
+ #<struct Myaso::Mystem::Form
127
+ form="коты",
128
+ msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
129
+ stem_grammemes=[136, 192, 201],
130
+ flex_grammemes=[165, 175]>]
131
+ =end
132
+ ```
133
+
134
+ Moreover, Myaso makes it possible to find exact matches of grammemes, but you have to be careful because computational linguistics is a hard field.
135
+
136
+ ```ruby
137
+ lemmas = Myaso::Mystem.analyze('человек') # => [#<Myaso::Mystem::Lemma lemma="человек" msd="Ncmpay">]
138
+ pp lemmas[0].inflect(:number => :plural, :case => :dative)
139
+ =begin
140
+ [#<struct Myaso::Mystem::Form
141
+ form="людям",
142
+ msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
143
+ stem_grammemes=[136, 192, 201],
144
+ flex_grammemes=[167, 175]>,
145
+ #<struct Myaso::Mystem::Form
146
+ form="человекам",
147
+ msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
148
+ stem_grammemes=[136, 192, 201],
149
+ flex_grammemes=[167, 175]>]
150
+ =end
151
+ ```
152
+
153
+ ### Tagging
154
+
155
+ Myaso performs POS tagging using its own implementation of the Viterbi algorithm on HMMs. The output has the following format: `token<TAB>tag`.
156
+
157
+ Please remember that tagger command line interface accepts only tokenized texts — one token per line. For instance, the [Greeb](https://github.com/dustalov/greeb) tokenizer can help you. Do not be afraid to use another text tokenization or segmentation tool if necessary.
158
+
159
+ ```
160
+ % echo 'Как поспал, проголодался наверное?' | greeb | myaso -n snyat-msd.123 -l snyat-msd.lex tagger
161
+ Как P-----r
162
+ поспал Vmis-sma
163
+ , ,
164
+ проголодался Vmis-sma
165
+ наверное R
166
+ ? SENT
167
+ ```
168
+
169
+ Unfortunately, current implementation of the tagger has two significant drawbacks:
170
+
171
+ 1. The tagger handles unknown words not so good. Sorry.
172
+ 2. Tagging is fast inself, but requires pretty slow training procedure running only once.
173
+
174
+ #### Tagging API
175
+
176
+ It is possible to embed the POS tagging feature in your own application using API.
177
+
178
+ ```ruby
179
+ model = Myaso::Tagger::TnT.new('model.123', 'model.lex')
180
+ tagger = Myaso::Tagger.new(model)
181
+ pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
182
+ =begin
183
+ ["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
184
+ =end
185
+ ```
186
+
187
+ It is possible to significantly speed up the initialization process by expicit setting of the interpolations vector. For instance, the TnT model from http://corpus.leeds.ac.uk/mocky/ has the following (approximated) linear interpolation coefficients: *k1 = 0.14*, *k2 = 0.30*, *k3 = 0.56*. In the example these values are provided precisely.
188
+
189
+ ```ruby
190
+ interpolations = [0.14095796503456284, 0.3032174211273352, 0.555824613838102]
191
+ model = Myaso::Tagger::TnT.new('model.123', 'model.lex', interpolations)
192
+ tagger = Myaso::Tagger.new(model)
193
+ pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
194
+ =begin
195
+ ["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
196
+ =end
197
+ ```
198
+
199
+ ## Acknowledgement
200
+
201
+ This work is partially supported by the Ural Branch of the Russian Academy of Sciences, grant no. РЦП-12-П10.
202
+
203
+ ## Contributing
204
+
205
+ 1. Fork it;
206
+ 2. Create your feature branch (`git checkout -b my-new-feature`);
207
+ 3. Commit your changes (`git commit -am 'Added some feature'`);
208
+ 4. Push to the branch (`git push origin my-new-feature`);
209
+ 5. Create new Pull Request.
210
+
211
+ ## Copyright
212
+
213
+ Copyright (c) 2010-2019 Dmitry Ustalov. See LICENSE for details.
@@ -0,0 +1,21 @@
1
+ #!/usr/bin/env rake
2
+ # encoding: utf-8
3
+
4
+ require 'rubygems/package_task'
5
+ require 'bundler/gem_tasks'
6
+ require 'rake/testtask'
7
+ require 'rdoc/task'
8
+
9
+ task :default => :test
10
+
11
+ Rake::TestTask.new do |test|
12
+ test.pattern = 'spec/**/*_spec.rb'
13
+ test.verbose = true
14
+ end
15
+
16
+ RDoc::Task.new do |rdoc|
17
+ rdoc.rdoc_dir = 'doc/rdoc'
18
+ rdoc.main = 'README.md'
19
+ rdoc.markup = 'markdown'
20
+ rdoc.rdoc_files.include('README.md', 'CHANGES.md', 'LICENSE.txt', 'lib/**/*.rb')
21
+ end
@@ -0,0 +1,73 @@
1
+ #!/usr/bin/env ruby
2
+ # encoding: utf-8
3
+
4
+ require 'ostruct'
5
+ require 'optparse'
6
+
7
+ if File.exists? File.expand_path('../../.git', __FILE__)
8
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
9
+ end
10
+
11
+ require 'myaso'
12
+
13
+ options = OpenStruct.new
14
+
15
+ optparse = OptionParser.new do |opts|
16
+ opts.banner = 'Usage: %s [options] command' % $PROGRAM_NAME
17
+
18
+ opts.separator ''
19
+ opts.separator 'Commands:'
20
+ opts.separator ' tagger: run the HMM tagger'
21
+ opts.separator ' console: start an IRB session'
22
+ opts.separator ''
23
+ opts.separator 'Options:'
24
+
25
+ opts.on('-n', '--ngrams ngrams', 'Path to ngrams file for tagger') do |n|
26
+ options.ngrams = n
27
+ end
28
+
29
+ opts.on('-l', '--lexicon lexicon', 'Path to lexicon file for tagger') do |l|
30
+ options.lexicon = l
31
+ end
32
+
33
+ opts.on '-e', '--eval [code]', 'Evaluate the given line of code' do |e|
34
+ options.eval = e
35
+ end
36
+
37
+ opts.on_tail '-h', '--help', 'Just display this help' do
38
+ puts opts
39
+ exit
40
+ end
41
+
42
+ opts.on_tail '-v', '--version', 'Just print the version infomation' do
43
+ puts 'Myaso v%s' % Myaso::VERSION
44
+ puts 'Copyright (c) 2010-2013 Dmitry Ustalov'
45
+ exit
46
+ end
47
+ end
48
+
49
+ optparse.parse!
50
+
51
+ eval(options.eval, binding, __FILE__, __LINE__) if options.eval
52
+
53
+ case ARGV.first
54
+ when 'tagger' then
55
+ sentence = STDIN.readlines.map(&:chomp)
56
+
57
+ STDERR.puts 'Training the tagger, this procedure is not so fast.'
58
+ model = Myaso::Tagger::TnT.new(options.ngrams, options.lexicon)
59
+ tagger = Myaso::Tagger.new(model)
60
+ tags = tagger.annotate(sentence)
61
+
62
+ sentence.zip(tags).each do |word, tag|
63
+ puts "%s\t%s" % [word, tag]
64
+ end
65
+ when 'console' then
66
+ ARGV.clear
67
+ include Myaso
68
+ require 'irb'
69
+ IRB.start
70
+ else
71
+ puts optparse
72
+ exit 1
73
+ end
@@ -0,0 +1,35 @@
1
+ # encoding: utf-8
2
+
3
+ require 'forwardable'
4
+ require 'ffi'
5
+
6
+ require 'myasorubka'
7
+ require 'myasorubka/msd/russian'
8
+ require 'myasorubka/mystem'
9
+
10
+ require 'myaso/version'
11
+ require 'myaso/pi_table'
12
+ require 'myaso/ngrams'
13
+ require 'myaso/lexicon'
14
+ require 'myaso/tagger'
15
+ require 'myaso/tagger/model'
16
+ require 'myaso/tagger/tnt'
17
+ require 'myaso/mystem'
18
+ require 'myaso/mystem/library'
19
+
20
+ # The UnknownWord exception is raised when Tagger considers an unknown
21
+ # word.
22
+ #
23
+ class Myaso::UnknownWord < RuntimeError
24
+ attr_reader :word
25
+
26
+ # @private
27
+ def initialize(word)
28
+ @word = word
29
+ end
30
+
31
+ # @private
32
+ def to_s
33
+ 'unknown word "%s"' % word
34
+ end
35
+ end
@@ -0,0 +1,70 @@
1
+ # encoding: utf-8
2
+
3
+ # A pretty useful representation of a lexicon in the following form:
4
+ # `word_prefix -> word -> tags`.
5
+ #
6
+ class Myaso::Lexicon
7
+ extend Forwardable
8
+ include Enumerable
9
+
10
+ attr_reader :table
11
+ def_delegator :@table, :each, :each
12
+
13
+ # An instance of a n-gram storage is initialized by zero counts.
14
+ #
15
+ def initialize
16
+ @table = Hash.new do |h, k|
17
+ h[k] = Hash.new { |h_local, k_local| h_local[k_local] = Hash.new(0) }
18
+ end
19
+ end
20
+
21
+ # Obtain the count of the specified word and tag.
22
+ #
23
+ def [] word, tag = nil
24
+ return 0 unless table.include? prefix(word)
25
+ return 0 unless table[prefix(word)].include? word
26
+ table[prefix(word)][word][tag]
27
+ end
28
+
29
+ # Assign the count to the specified word and tag.
30
+ #
31
+ def []= word, tag = nil, count
32
+ @tags = nil
33
+ table[prefix(word)][word][tag] = count
34
+ end
35
+
36
+ # Retrieve global tags or tags of the given word.
37
+ #
38
+ def tags(word = nil)
39
+ return lazy_aggregated_tags unless word
40
+ table[prefix(word)][word].keys.compact
41
+ end
42
+
43
+ # Two lexicons are equal iff they tables are equal.
44
+ #
45
+ def == other
46
+ self.table == other.table
47
+ end
48
+
49
+ protected
50
+ # Perform lazy initialization of global tags.
51
+ #
52
+ def lazy_aggregated_tags
53
+ @tags ||= table.inject(Hash.new(0)) do |hash, (_, wts)|
54
+ wts.each do |word, tags|
55
+ tags.each do |tag, count|
56
+ next unless tag
57
+ hash[tag] += count
58
+ end
59
+ end
60
+
61
+ hash
62
+ end
63
+ end
64
+
65
+ # Extract the word prefix of three characters.
66
+ #
67
+ def prefix(word)
68
+ word[0..2]
69
+ end
70
+ end