RubyGems - engtagger - Versions diffs - 0.2.0 → 0.3.0 - Mend

engtagger 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 47684458b8965c2c1f52d0d29ca2cad340b0ed3f
-  data.tar.gz: aa6dbd0473409e9b6b2987a42c126a6b10dce096
+SHA256:
+  metadata.gz: b3f1fc1d4e6d89d2920a0774342478d951bacd4558ff8c4054da719730ed0b9c
+  data.tar.gz: 2c9061d018dd63d699ad18713edf0f8ba74720632574e2ed2b530965c501abc5
 SHA512:
-  metadata.gz: cddd67eab940146a2426032714aedd8e5195192ead3133ade5f76c594c5f0667f0747bba8b019f63df91dd2eb0610da19288a9e06a2323eb7bde91fec025b028
-  data.tar.gz: a76ca3422b9a3a1a813263b6e4ab5e69ca74ac998afcc259be55a6441b6b46b21ac9de6eb241327956a6fda1c2d3dbd033ba1678df86506f153089a3ef99d46d
+  metadata.gz: 475e5093d071bee1fac32a98713dd3eadc51262fc61cd090fe54fc98aad68d9d0c544aae0c10374aa38ac17676f0db0dbabc18a34f393747c1b9a51ff4d687ad
+  data.tar.gz: 4bfc9068df3ce8cf4688c0475600c326302c4df5ed1bb13848eb64c200ffc9e2fba61edb9f8cd64d1c6cb47015384cc3020bec707ddfc74e941874c310cbed83

data/.gitignore CHANGED Viewed

@@ -15,3 +15,4 @@ spec/reports
 test/tmp
 test/version_tmp
 tmp
+/.idea

data/.yardopts ADDED Viewed

@@ -0,0 +1,5 @@
+--protected
+--no-private
+--hide-void-return
+--markup markdown
+--readme README.md

data/Gemfile CHANGED Viewed

@@ -1,4 +1,3 @@
 source 'https://rubygems.org'
-# Specify your gem's dependencies in engtagger.gemspec
-gemspec
+gem 'lru_redux'

data/README.md CHANGED Viewed

@@ -4,13 +4,13 @@ English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
 ### Description
-A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
-tagger that assigns POS tags to English text based on a lookup dictionary and
-a set of probability values. The tagger assigns appropriate tags based on
-conditional probabilities--it examines the preceding tag to determine the
-appropriate tag for the current word. Unknown words are classified according to
-word morphology or can be set to be treated as nouns or other parts of speech.
-The tagger also extracts as many nouns and noun phrases as it can, using a set
+A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
+tagger that assigns POS tags to English text based on a lookup dictionary and
+a set of probability values. The tagger assigns appropriate tags based on
+conditional probabilities--it examines the preceding tag to determine the
+appropriate tag for the current word. Unknown words are classified according to
+word morphology or can be set to be treated as nouns or other parts of speech.
+The tagger also extracts as many nouns and noun phrases as it can, using a set
 of regular expressions.
 ### Features
@@ -21,7 +21,6 @@ of regular expressions.
 ### Synopsis:
-    require 'rubygems'
     require 'engtagger'
     # Create a parser object
@@ -34,20 +33,20 @@ of regular expressions.
     tagged = tgr.add_tags(text)
     #=> "<nnp>Alice</nnp> <vbd>chased</vbd> <det>the</det> <jj>big</jj> <jj>fat</jj><nn>cat</nn> <pp>.</pp>"
     # Get a list of all nouns and noun phrases with occurrence counts
     word_list = tgr.get_words(text)
     #=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
     # Get a readable version of the tagged text
     readable = tgr.get_readable(text)
     #=> "Alice/NNP chased/VBD the/DET big/JJ fat/JJ cat/NN ./PP"
     # Get all nouns from a tagged output
     nouns = tgr.get_nouns(tagged)
     #=> {"cat"=>1, "Alice"=>1}
     # Get all proper nouns
@@ -73,13 +72,13 @@ of regular expressions.
 ### Tag Set
-The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
+The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
     CC      Conjunction, coordinating               and, or
     CD      Adjective, cardinal number              3, fifteen
     DET     Determiner                              this, each, some
     EX      Pronoun, existential there              there
-    FW      Foreign words
+    FW      Foreign words
     IN      Preposition / Conjunction               for, of, although, that
     JJ      Adjective                               happy, bad
     JJR     Adjective, comparative                  happier, worse
@@ -111,7 +110,7 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
     WP      Pronoun, question                       who, whoever
     WPS     Determiner, possessive & question       whose
     WRB     Adverb, question                        when, how, however
     PP      Punctuation, sentence ender             ., !, ?
     PPC     Punctuation, comma                      ,
     PPD     Punctuation, dollar sign                $
@@ -121,29 +120,24 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
     LRB     Punctuation, left bracket               (, {, [
     RRB     Punctuation, right bracket              ), }, ]
-### Requirements
-* [Hpricot](http://code.whytheluckystiff.net/hpricot/) (optional)
 ### Install
-    (sudo) gem install engtagger
+    gem install engtagger
 ### Author
-of this Ruby library
+of this Ruby library
-* Yoichiro Hasebe (yohasebe [at] gmail.com)
+* Yoichiro Hasebe (yohasebe [at] gmail.com)
 ### Contributors
-* Carlos Ramirez III
-* Phil London
+Many thanks to the collaborators listed in the right column of this GitHub page.
 ### Acknowledgement
 This Ruby library is a direct port of Lingua::EN::Tagger available at CPAN.
-The credit for the crucial part of its algorithm/design therefore goes to
+The credit for the crucial part of its algorithm/design therefore goes to
 Aaron Coburn, the author of the original Perl version.
 ### License

data/engtagger.gemspec CHANGED Viewed

@@ -4,14 +4,14 @@ require File.expand_path('../lib/engtagger/version', __FILE__)
 Gem::Specification.new do |gem|
   gem.authors       = ["Yoichiro Hasebe"]
   gem.email         = ["yohasebe@gmail.com"]
-  gem.summary         = %q{A probability based, corpus-trained English POS tagger}
-  gem.description     = %q{A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values.}
-  gem.homepage        = "http://github.com/yohasebe/engtagger"
+  gem.summary         = %q{A probability based, corpus-trained English POS tagger}
+  gem.description     = %q{A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values.}
+  gem.homepage        = "http://github.com/yohasebe/engtagger"
   gem.files         = `git ls-files`.split($\)
   gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
   gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
   gem.name          = "engtagger"
   gem.require_paths = ["lib"]
-  gem.version       = EngTagger::VERSION
+  gem.version       = EngTagger::VERSION
 end

data/lib/engtagger/porter.rb CHANGED Viewed

@@ -12,7 +12,7 @@ module Stemmable
     'ousness'=>'ous', 'aliti'=>'al',
     'iviti'=>'ive', 'biliti'=>'ble', 'logi'=>'log'
   }
   STEP_3_LIST = {
     'icate'=>'ic', 'ative'=>'', 'alize'=>'al', 'iciti'=>'ic',
     'ical'=>'ic', 'ful'=>'', 'ness'=>''
@@ -48,7 +48,7 @@ module Stemmable
                       ance     |
                       ence     |
                       er       |
-                      ic       |
+                      ic       |
                       able     |
                       ible     |
                       ant      |
@@ -88,30 +88,30 @@ module Stemmable
   #
   # Send comments to raypereda@hotmail.com
   #
   def stem_porter
     # make a copy of the given object and convert it to a string.
     w = self.dup.to_str
     return w if w.length < 3
     # now map initial y to Y so that the patterns never treat it as vowel
     w[0] = 'Y' if w[0] == ?y
     # Step 1a
     if w =~ /(ss|i)es$/
       w = $` + $1
-    elsif w =~ /([^s])s$/
+    elsif w =~ /([^s])s$/
       w = $` + $1
     end
     # Step 1b
     if w =~ /eed$/
-      w.chop! if $` =~ MGR0
+      w.chop! if $` =~ MGR0
     elsif w =~ /(ed|ing)$/
       stem = $`
-      if stem =~ VOWEL_IN_STEM
+      if stem =~ VOWEL_IN_STEM
         w = stem
 	case w
         when /(at|bl|iz)$/             then w << "e"
@@ -121,9 +121,9 @@ module Stemmable
       end
     end
-    if w =~ /y$/
+    if w =~ /y$/
       stem = $`
-      w = stem + "i" if stem =~ VOWEL_IN_STEM
+      w = stem + "i" if stem =~ VOWEL_IN_STEM
     end
     # Step 2
@@ -159,7 +159,7 @@ module Stemmable
     end
     #  Step 5
-    if w =~ /e$/
+    if w =~ /e$/
       stem = $`
       if (stem =~ MGR1) ||
           (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)

data/lib/engtagger/pos_tags.hash CHANGED Viewed

Binary file

data/lib/engtagger/pos_words.hash CHANGED Viewed

Binary file

data/lib/engtagger/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
-module EngTagger
-  VERSION = "0.2.0"
+class EngTagger
+  VERSION = "0.3.0"
 end