RubyGems - myaso - Versions diffs - 0.4.0 - Mend

myaso 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

checksums.yaml +7 -0
data/.gitignore +25 -0
data/.travis.yml +10 -0
data/Gemfile +14 -0
data/LICENSE.txt +22 -0
data/README.md +213 -0
data/Rakefile +21 -0
data/bin/myaso +73 -0
data/lib/myaso.rb +35 -0
data/lib/myaso/lexicon.rb +70 -0
data/lib/myaso/mystem.rb +187 -0
data/lib/myaso/mystem/library.rb +59 -0
data/lib/myaso/ngrams.rb +67 -0
data/lib/myaso/pi_table.rb +36 -0
data/lib/myaso/tagger.rb +94 -0
data/lib/myaso/tagger/model.rb +68 -0
data/lib/myaso/tagger/tnt.rb +183 -0
data/lib/myaso/version.rb +9 -0
data/myaso.gemspec +26 -0
data/myaso.jpg +0 -0
data/spec/bin_spec.rb +48 -0
data/spec/data/test.123 +77 -0
data/spec/data/test.lex +10 -0
data/spec/fixtures/interpolations.yml +4 -0
data/spec/fixtures/lexicon.yml +32 -0
data/spec/fixtures/ngrams.yml +106 -0
data/spec/lexicon_spec.rb +84 -0
data/spec/mystem_spec.rb +81 -0
data/spec/ngrams_spec.rb +97 -0
data/spec/pi_table_spec.rb +53 -0
data/spec/spec_helper.rb +12 -0
data/spec/support/fixtures.rb +34 -0
data/spec/support/invoker.rb +29 -0
data/spec/tagger_spec.rb +27 -0
data/spec/tagger_tnt_spec.rb +73 -0
metadata +137 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 37940cb5479932d889a3e0cf92fe74d4d706ed46
+  data.tar.gz: e468549cebbddf8e6cad46d72b90f375de35d41b
+SHA512:
+  metadata.gz: 82f29168750fe0b45192866c8d8f82e7eaba533d5fd8e6450f5b8ea9e106098a3255922357f354d0c7813793c2f06b02c51c1d67d6af74b8cb2c13800a90fefc
+  data.tar.gz: de8f484dc371381ef9e9ad1d9d558f7b21d1bcc2aeb73b96e4ad60b538995c14fd34266880ee4345ce8fbd4f041a1eb7e124adc8f2e622680d2e8743fe922f62

data/.gitignore ADDED

@@ -0,0 +1,25 @@
+*.so
+*.zip
+*swp
+*.~*
+*.gem
+*.rbc
+.rbx
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+.DS_Store
+.ruby-version
+.ruby-gemset

data/.travis.yml ADDED

@@ -0,0 +1,10 @@
+sudo: false
+language: ruby
+bundler_args: --without development
+rvm:
+- ruby
+- jruby
+install:
+- wget 'https://github.com/yandex/tomita-parser/releases/download/v1.0/libmystem_c_binding.so.linux_x64.zip'
+- unzip 'libmystem_c_binding.so.linux_x64.zip'
+- bundle

data/Gemfile ADDED

@@ -0,0 +1,14 @@
+# encoding: utf-8
+source 'https://rubygems.org'
+gemspec
+group :development do
+  gem 'rdoc'
+  gem 'ruby-prof', :platforms => :mri
+end
+group :test do
+  gem 'rake'
+end

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2010-2019 Dmitry Ustalov
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,213 @@
+# Myaso
+Myaso [ˈmʲæ.sə] is a morphological analysis and synthesis library, written in Ruby.
+[![Gem Version][badge_fury_badge]][badge_fury_link] [![Build Status][travis_ci_badge]][travis_ci_link] [![Code Climate][code_climate_badge]][code_climage_link]
+![Myaso](myaso.jpg)
+[badge_fury_badge]: https://badge.fury.io/rb/myaso.svg
+[badge_fury_link]: https://badge.fury.io/rb/myaso
+[travis_ci_badge]: https://travis-ci.org/dustalov/myaso.svg
+[travis_ci_link]: https://travis-ci.org/dustalov/myaso
+[code_climate_badge]: https://codeclimate.com/github/dustalov/myaso/badges/gpa.svg
+[code_climage_link]: https://codeclimate.com/github/dustalov/myaso
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'myaso'
+```
+And then execute:
+    $ bundle
+Or install it:
+    $ gem install myaso
+## Usage
+At the moment, Myaso has pretty fast part of speech (POS) tagger built on hidden Markov models (HMMs). The tagging operation requires statistical model to be trained.
+Myaso supports trained models in the TnT format. One could be obtained at the Serge Sharoff et al. resource called [Russian statistical taggers and parsers](http://corpus.leeds.ac.uk/mocky/).
+### Analysis
+Since Yandex has released the [Mystem](https://tech.yandex.ru/mystem/) analyzer in the form of shared library, it makes it possible to use the analyzer through the foreign function interface.
+Firstly, it is necessary to read and agree with the [mystem EULA](https://yandex.ru/legal/mystem/). Secondly, [download](https://github.com/yandex/tomita-parser/releases/tag/v1.0) and install the shared library for your operating system. Finally, use Myaso and enjoy the benefits.
+#### Analysis API
+Myaso uses mystem library to process Russian words. That is quite simple.
+```ruby
+pp Myaso::Mystem.analyze('котёночка')
+=begin
+[#<struct Myaso::Mystem::Lemma
+  lemma="котеночек",
+  form="котёночка",
+  quality=:dictionary,
+  msd=#<Myasorubka::MSD::Russian msd="Ncmsay">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[168, 174, 166],
+  flex_length=6,
+  rule_id=1525>]
+=end
+```
+Myaso works fine even in case the given word is either ambiguous or does not appear in the mystem's dictionary.
+```ruby
+pp Myaso::Mystem.analyze('аудисты')
+=begin
+[#<struct Myaso::Mystem::Lemma
+  lemma="аудист",
+  form="аудисты",
+  quality=:bastard,
+  msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[165, 175],
+  flex_length=1,
+  rule_id=25>,
+ #<struct Myaso::Mystem::Lemma
+  lemma="аудистый",
+  form="аудисты",
+  quality=:bastard,
+  msd=#<Myasorubka::MSD::Russian msd="A---p-s">,
+  stem_grammemes=[128],
+  flex_grammemes=[175, 183],
+  flex_length=1,
+  rule_id=65>]
+=end
+```
+### Synthesis
+Given the analyzed word, it is possible to retrieve all the possible forms. Having this information, one may use it to inflect a word. This is implemeneted using the abovementioned mystem shared library.
+#### Synthesis API
+In general form, all the possible word forms can be extracted with the specified word and its inflection rule.
+```ruby
+pp Myaso::Mystem.forms('человеком', 3890)
+=begin
+[#<struct Myaso::Mystem::Form
+  form="людей",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmpay">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[168, 175, 166]>,
+ ...
+ #<struct Myaso::Mystem::Form
+  form="человеку",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmsdy">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[167, 174]>]
+=end
+```
+There exists a convenient way of doing this, which requires a previously lemmatized word.
+```ruby
+lemmas = Myaso::Mystem.analyze('кот') # => [#<Myaso::Mystem::Lemma lemma="кот" msd="Ncmsny">]
+pp lemmas[0].forms
+=begin
+[#<struct Myaso::Mystem::Form
+  form="кот",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmsny">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[165, 174]>,
+ ...
+ #<struct Myaso::Mystem::Form
+  form="коты",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmpny">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[165, 175]>]
+=end
+```
+Moreover, Myaso makes it possible to find exact matches of grammemes, but you have to be careful because computational linguistics is a hard field.
+```ruby
+lemmas = Myaso::Mystem.analyze('человек') # => [#<Myaso::Mystem::Lemma lemma="человек" msd="Ncmpay">]
+pp lemmas[0].inflect(:number => :plural, :case => :dative)
+=begin
+[#<struct Myaso::Mystem::Form
+  form="людям",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[167, 175]>,
+ #<struct Myaso::Mystem::Form
+  form="человекам",
+  msd=#<Myasorubka::MSD::Russian msd="Ncmpdy">,
+  stem_grammemes=[136, 192, 201],
+  flex_grammemes=[167, 175]>]
+=end
+```
+### Tagging
+Myaso performs POS tagging using its own implementation of the Viterbi algorithm on HMMs. The output has the following format: `token<TAB>tag`.
+Please remember that tagger command line interface accepts only tokenized texts — one token per line. For instance, the [Greeb](https://github.com/dustalov/greeb) tokenizer can help you. Do not be afraid to use another text tokenization or segmentation tool if necessary.
+```
+% echo 'Как поспал, проголодался наверное?' | greeb | myaso -n snyat-msd.123 -l snyat-msd.lex tagger
+Как	P-----r
+поспал	Vmis-sma
+,	,
+проголодался	Vmis-sma
+наверное	R
+?	SENT
+```
+Unfortunately, current implementation of the tagger has two significant drawbacks:
+1. The tagger handles unknown words not so good. Sorry.
+2. Tagging is fast inself, but requires pretty slow training procedure running only once.
+#### Tagging API
+It is possible to embed the POS tagging feature in your own application using API.
+```ruby
+model = Myaso::Tagger::TnT.new('model.123', 'model.lex')
+tagger = Myaso::Tagger.new(model)
+pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
+=begin
+["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
+=end
+```
+It is possible to significantly speed up the initialization process by expicit setting of the interpolations vector. For instance, the TnT model from http://corpus.leeds.ac.uk/mocky/ has the following (approximated) linear interpolation coefficients: *k1 = 0.14*, *k2 = 0.30*, *k3 = 0.56*. In the example these values are provided precisely.
+```ruby
+interpolations = [0.14095796503456284, 0.3032174211273352, 0.555824613838102]
+model = Myaso::Tagger::TnT.new('model.123', 'model.lex', interpolations)
+tagger = Myaso::Tagger.new(model)
+pp tagger.annotate(%w(Как поспал , проголодался наверное ?))
+=begin
+["P-----r", "Vmis-sma", ",", "Vmis-sma", "R", "SENT"]
+=end
+```
+## Acknowledgement
+This work is partially supported by the Ural Branch of the Russian Academy of Sciences, grant no. РЦП-12-П10.
+## Contributing
+1. Fork it;
+2. Create your feature branch (`git checkout -b my-new-feature`);
+3. Commit your changes (`git commit -am 'Added some feature'`);
+4. Push to the branch (`git push origin my-new-feature`);
+5. Create new Pull Request.
+## Copyright
+Copyright (c) 2010-2019 Dmitry Ustalov. See LICENSE for details.

data/Rakefile ADDED

@@ -0,0 +1,21 @@
+#!/usr/bin/env rake
+# encoding: utf-8
+require 'rubygems/package_task'
+require 'bundler/gem_tasks'
+require 'rake/testtask'
+require 'rdoc/task'
+task :default => :test
+Rake::TestTask.new do |test|
+  test.pattern = 'spec/**/*_spec.rb'
+  test.verbose = true
+end
+RDoc::Task.new do |rdoc|
+  rdoc.rdoc_dir = 'doc/rdoc'
+  rdoc.main = 'README.md'
+  rdoc.markup = 'markdown'
+  rdoc.rdoc_files.include('README.md', 'CHANGES.md', 'LICENSE.txt', 'lib/**/*.rb')
+end

data/bin/myaso ADDED

@@ -0,0 +1,73 @@
+#!/usr/bin/env ruby
+# encoding: utf-8
+require 'ostruct'
+require 'optparse'
+if File.exists? File.expand_path('../../.git', __FILE__)
+  $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
+end
+require 'myaso'
+options = OpenStruct.new
+optparse = OptionParser.new do |opts|
+  opts.banner = 'Usage: %s [options] command' % $PROGRAM_NAME
+  opts.separator ''
+  opts.separator 'Commands:'
+  opts.separator '    tagger: run the HMM tagger'
+  opts.separator '    console: start an IRB session'
+  opts.separator ''
+  opts.separator 'Options:'
+  opts.on('-n', '--ngrams ngrams', 'Path to ngrams file for tagger') do |n|
+    options.ngrams = n
+  end
+  opts.on('-l', '--lexicon lexicon', 'Path to lexicon file for tagger') do |l|
+    options.lexicon = l
+  end
+  opts.on '-e', '--eval [code]', 'Evaluate the given line of code' do |e|
+    options.eval = e
+  end
+  opts.on_tail '-h', '--help', 'Just display this help' do
+    puts opts
+    exit
+  end
+  opts.on_tail '-v', '--version', 'Just print the version infomation' do
+    puts 'Myaso v%s' % Myaso::VERSION
+    puts 'Copyright (c) 2010-2013 Dmitry Ustalov'
+    exit
+  end
+end
+optparse.parse!
+eval(options.eval, binding, __FILE__, __LINE__) if options.eval
+case ARGV.first
+when 'tagger' then
+  sentence = STDIN.readlines.map(&:chomp)
+  STDERR.puts 'Training the tagger, this procedure is not so fast.'
+  model = Myaso::Tagger::TnT.new(options.ngrams, options.lexicon)
+  tagger = Myaso::Tagger.new(model)
+  tags = tagger.annotate(sentence)
+  sentence.zip(tags).each do |word, tag|
+    puts "%s\t%s" % [word, tag]
+  end
+when 'console' then
+  ARGV.clear
+  include Myaso
+  require 'irb'
+  IRB.start
+else
+  puts optparse
+  exit 1
+end

data/lib/myaso.rb ADDED

@@ -0,0 +1,35 @@
+# encoding: utf-8
+require 'forwardable'
+require 'ffi'
+require 'myasorubka'
+require 'myasorubka/msd/russian'
+require 'myasorubka/mystem'
+require 'myaso/version'
+require 'myaso/pi_table'
+require 'myaso/ngrams'
+require 'myaso/lexicon'
+require 'myaso/tagger'
+require 'myaso/tagger/model'
+require 'myaso/tagger/tnt'
+require 'myaso/mystem'
+require 'myaso/mystem/library'
+# The UnknownWord exception is raised when Tagger considers an unknown
+# word.
+#
+class Myaso::UnknownWord < RuntimeError
+  attr_reader :word
+  # @private
+  def initialize(word)
+    @word = word
+  end
+  # @private
+  def to_s
+    'unknown word "%s"' % word
+  end
+end

data/lib/myaso/lexicon.rb ADDED

@@ -0,0 +1,70 @@
+# encoding: utf-8
+# A pretty useful representation of a lexicon in the following form:
+# `word_prefix -> word -> tags`.
+#
+class Myaso::Lexicon
+  extend Forwardable
+  include Enumerable
+  attr_reader :table
+  def_delegator :@table, :each, :each
+  # An instance of a n-gram storage is initialized by zero counts.
+  #
+  def initialize
+    @table = Hash.new do |h, k|
+      h[k] = Hash.new { |h_local, k_local| h_local[k_local] = Hash.new(0) }
+    end
+  end
+  # Obtain the count of the specified word and tag.
+  #
+  def [] word, tag = nil
+    return 0 unless table.include? prefix(word)
+    return 0 unless table[prefix(word)].include? word
+    table[prefix(word)][word][tag]
+  end
+  # Assign the count to the specified word and tag.
+  #
+  def []= word, tag = nil, count
+    @tags = nil
+    table[prefix(word)][word][tag] = count
+  end
+  # Retrieve global tags or tags of the given word.
+  #
+  def tags(word = nil)
+    return lazy_aggregated_tags unless word
+    table[prefix(word)][word].keys.compact
+  end
+  # Two lexicons are equal iff they tables are equal.
+  #
+  def == other
+    self.table == other.table
+  end
+  protected
+  # Perform lazy initialization of global tags.
+  #
+  def lazy_aggregated_tags
+    @tags ||= table.inject(Hash.new(0)) do |hash, (_, wts)|
+      wts.each do |word, tags|
+        tags.each do |tag, count|
+          next unless tag
+          hash[tag] += count
+        end
+      end
+      hash
+    end
+  end
+  # Extract the word prefix of three characters.
+  #
+  def prefix(word)
+    word[0..2]
+  end
+end