RubyGems - engtagger - Versions diffs - 0.2.2 → 0.3.0 - Mend

engtagger 0.2.2 → 0.3.0

Files changed (11) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 65db6f78c5abff2e601841262e2aabcf936d80df76a3fecab33f4e9d730e02f5
-  data.tar.gz: '04896cc7bfeb84c9f720d8493f08d4b5537dcb9a177f1584d621dad6fbd1184b'
+  metadata.gz: b3f1fc1d4e6d89d2920a0774342478d951bacd4558ff8c4054da719730ed0b9c
+  data.tar.gz: 2c9061d018dd63d699ad18713edf0f8ba74720632574e2ed2b530965c501abc5
 SHA512:
-  metadata.gz: 04c8169ba6706cffc2afc02f93c8d1916afdb08e8f99d7e5705c645efcf52fcba08edd593079fc267091582f779cf48db92ab3c9852cbb06e494f754c5f18b94
-  data.tar.gz: 70ad3a969bb095f72804917bf1cf906d81466b9b25dba8a7c3616ab12f8500f0a490e08af8ba4bcbbff58cdb9f04fd3cc06b2317dfbe57643bb11adaeb94f946
+  metadata.gz: 475e5093d071bee1fac32a98713dd3eadc51262fc61cd090fe54fc98aad68d9d0c544aae0c10374aa38ac17676f0db0dbabc18a34f393747c1b9a51ff4d687ad
+  data.tar.gz: 4bfc9068df3ce8cf4688c0475600c326302c4df5ed1bb13848eb64c200ffc9e2fba61edb9f8cd64d1c6cb47015384cc3020bec707ddfc74e941874c310cbed83

data/.gitignore CHANGED Viewed

@@ -15,3 +15,4 @@ spec/reports
 test/tmp
 test/version_tmp
 tmp
+/.idea

data/.yardopts ADDED Viewed

@@ -0,0 +1,5 @@
+--protected
+--no-private
+--hide-void-return
+--markup markdown
+--readme README.md

data/Gemfile CHANGED Viewed

@@ -1,4 +1,3 @@
 source 'https://rubygems.org'
-# Specify your gem's dependencies in engtagger.gemspec
-gemspec
+gem 'lru_redux'

data/README.md CHANGED Viewed

@@ -4,13 +4,13 @@ English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
 ### Description
-A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
-tagger that assigns POS tags to English text based on a lookup dictionary and
-a set of probability values. The tagger assigns appropriate tags based on
-conditional probabilities--it examines the preceding tag to determine the
-appropriate tag for the current word. Unknown words are classified according to
-word morphology or can be set to be treated as nouns or other parts of speech.
-The tagger also extracts as many nouns and noun phrases as it can, using a set
+A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
+tagger that assigns POS tags to English text based on a lookup dictionary and
+a set of probability values. The tagger assigns appropriate tags based on
+conditional probabilities--it examines the preceding tag to determine the
+appropriate tag for the current word. Unknown words are classified according to
+word morphology or can be set to be treated as nouns or other parts of speech.
+The tagger also extracts as many nouns and noun phrases as it can, using a set
 of regular expressions.
 ### Features
@@ -21,7 +21,6 @@ of regular expressions.
 ### Synopsis:
-    require 'rubygems'
     require 'engtagger'
     # Create a parser object
@@ -34,20 +33,20 @@ of regular expressions.
     tagged = tgr.add_tags(text)
     #=> "<nnp>Alice</nnp> <vbd>chased</vbd> <det>the</det> <jj>big</jj> <jj>fat</jj><nn>cat</nn> <pp>.</pp>"
     # Get a list of all nouns and noun phrases with occurrence counts
     word_list = tgr.get_words(text)
     #=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
     # Get a readable version of the tagged text
     readable = tgr.get_readable(text)
     #=> "Alice/NNP chased/VBD the/DET big/JJ fat/JJ cat/NN ./PP"
     # Get all nouns from a tagged output
     nouns = tgr.get_nouns(tagged)
     #=> {"cat"=>1, "Alice"=>1}
     # Get all proper nouns
@@ -73,13 +72,13 @@ of regular expressions.
 ### Tag Set
-The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
+The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
     CC      Conjunction, coordinating               and, or
     CD      Adjective, cardinal number              3, fifteen
     DET     Determiner                              this, each, some
     EX      Pronoun, existential there              there
-    FW      Foreign words
+    FW      Foreign words
     IN      Preposition / Conjunction               for, of, although, that
     JJ      Adjective                               happy, bad
     JJR     Adjective, comparative                  happier, worse
@@ -111,7 +110,7 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
     WP      Pronoun, question                       who, whoever
     WPS     Determiner, possessive & question       whose
     WRB     Adverb, question                        when, how, however
     PP      Punctuation, sentence ender             ., !, ?
     PPC     Punctuation, comma                      ,
     PPD     Punctuation, dollar sign                $
@@ -121,30 +120,24 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
     LRB     Punctuation, left bracket               (, {, [
     RRB     Punctuation, right bracket              ), }, ]
-### Requirements
-* [Hpricot](http://code.whytheluckystiff.net/hpricot/) (optional)
 ### Install
-    (sudo) gem install engtagger
+    gem install engtagger
 ### Author
-of this Ruby library
+of this Ruby library
-* Yoichiro Hasebe (yohasebe [at] gmail.com)
+* Yoichiro Hasebe (yohasebe [at] gmail.com)
 ### Contributors
-* Carlos Ramirez III
-* Phil London
-* Bazay (Baron Bloomer)
+Many thanks to the collaborators listed in the right column of this GitHub page.
 ### Acknowledgement
 This Ruby library is a direct port of Lingua::EN::Tagger available at CPAN.
-The credit for the crucial part of its algorithm/design therefore goes to
+The credit for the crucial part of its algorithm/design therefore goes to
 Aaron Coburn, the author of the original Perl version.
 ### License

data/engtagger.gemspec CHANGED Viewed

@@ -4,14 +4,14 @@ require File.expand_path('../lib/engtagger/version', __FILE__)
 Gem::Specification.new do |gem|
   gem.authors       = ["Yoichiro Hasebe"]
   gem.email         = ["yohasebe@gmail.com"]
-  gem.summary         = %q{A probability based, corpus-trained English POS tagger}
-  gem.description     = %q{A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values.}
-  gem.homepage        = "http://github.com/yohasebe/engtagger"
+  gem.summary         = %q{A probability based, corpus-trained English POS tagger}
+  gem.description     = %q{A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values.}
+  gem.homepage        = "http://github.com/yohasebe/engtagger"
   gem.files         = `git ls-files`.split($\)
   gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
   gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
   gem.name          = "engtagger"
   gem.require_paths = ["lib"]
-  gem.version       = EngTagger::VERSION
+  gem.version       = EngTagger::VERSION
 end

data/lib/engtagger/porter.rb CHANGED Viewed

@@ -12,7 +12,7 @@ module Stemmable
     'ousness'=>'ous', 'aliti'=>'al',
     'iviti'=>'ive', 'biliti'=>'ble', 'logi'=>'log'
   }
   STEP_3_LIST = {
     'icate'=>'ic', 'ative'=>'', 'alize'=>'al', 'iciti'=>'ic',
     'ical'=>'ic', 'ful'=>'', 'ness'=>''
@@ -48,7 +48,7 @@ module Stemmable
                       ance     |
                       ence     |
                       er       |
-                      ic       |
+                      ic       |
                       able     |
                       ible     |
                       ant      |
@@ -88,30 +88,30 @@ module Stemmable
   #
   # Send comments to raypereda@hotmail.com
   #
   def stem_porter
     # make a copy of the given object and convert it to a string.
     w = self.dup.to_str
     return w if w.length < 3
     # now map initial y to Y so that the patterns never treat it as vowel
     w[0] = 'Y' if w[0] == ?y
     # Step 1a
     if w =~ /(ss|i)es$/
       w = $` + $1
-    elsif w =~ /([^s])s$/
+    elsif w =~ /([^s])s$/
       w = $` + $1
     end
     # Step 1b
     if w =~ /eed$/
-      w.chop! if $` =~ MGR0
+      w.chop! if $` =~ MGR0
     elsif w =~ /(ed|ing)$/
       stem = $`
-      if stem =~ VOWEL_IN_STEM
+      if stem =~ VOWEL_IN_STEM
         w = stem
 	case w
         when /(at|bl|iz)$/             then w << "e"
@@ -121,9 +121,9 @@ module Stemmable
       end
     end
-    if w =~ /y$/
+    if w =~ /y$/
       stem = $`
-      w = stem + "i" if stem =~ VOWEL_IN_STEM
+      w = stem + "i" if stem =~ VOWEL_IN_STEM
     end
     # Step 2
@@ -159,7 +159,7 @@ module Stemmable
     end
     #  Step 5
-    if w =~ /e$/
+    if w =~ /e$/
       stem = $`
       if (stem =~ MGR1) ||
           (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)

data/lib/engtagger/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
-module EngTagger
-  VERSION = "0.2.2"
+class EngTagger
+  VERSION = "0.3.0"
 end

data/lib/engtagger.rb CHANGED Viewed

@@ -3,30 +3,17 @@
 $LOAD_PATH << File.join(File.dirname(__FILE__), 'engtagger')
 require 'rubygems'
-require 'kconv'
 require 'porter'
+require 'lru_redux'
-# use hpricot for extracting English text from docs with XML like tags
-begin
-  require 'hpricot'
-rescue LoadError
-  $no_hpricot = true
-end
-# File paths
-$lexpath   = File.join(File.dirname(__FILE__), 'engtagger')
-$word_path = File.join($lexpath, "pos_words.hash")
-$tag_path  = File.join($lexpath, "pos_tags.hash")
-# for memoization (code snipet from http://eigenclass.org/hiki/bounded-space-memoization)
-class Module
-  def memoize(method)
+module BoundedSpaceMemoizable
+  def memoize(method, max_cache_size=100000)
     # alias_method is faster than define_method + old.bind(self).call
     alias_method "__memoized__#{method}", method
     module_eval <<-EOF
-      def #{method}(*a, &b)
-        # assumes the block won't change the result if the args are the same
-        (@__memoized_#{method}_cache ||= {})[a] ||= __memoized__#{method}(*a, &b)
+      def #{method}(*a)
+        @__memoized_#{method}_cache ||= LruRedux::Cache.new(#{max_cache_size})
+        @__memoized_#{method}_cache[a] ||= __memoized__#{method}(*a)
       end
     EOF
   end
@@ -34,17 +21,29 @@ end
 # English part-of-speech tagger class
 class EngTagger
+  extend BoundedSpaceMemoizable
+  # File paths
+  DEFAULT_LEXPATH = File.join(File.dirname(__FILE__), 'engtagger')
+  DEFAULT_WORDPATH = File.join(DEFAULT_LEXPATH, "pos_words.hash")
+  DEFAULT_TAGPATH = File.join(DEFAULT_LEXPATH, "pos_tags.hash")
   #################
   # Class methods #
   #################
-  # Return a class variable that holds probability data
+  # Return a class variable that holds probability data.
+  #
+  # @return [Hash] the probability data
+  #
   def self.hmm
     return @@hmm
   end
-  # Return a class variable that holds lexical data
+  # Return a class variable that holds lexical data.
+  #
+  # @return [Hash] the lexicon
+  #
   def self.lexicon
     return @@lexicon
   end
@@ -88,7 +87,12 @@ class EngTagger
   IN    = get_ext('in')
   # Convert a Treebank-style, abbreviated tag into verbose definitions
+  #
+  # @param tag [#to_s] the tag in question
+  # @return [String] the definition, if available
+  #
   def self.explain_tag(tag)
+    tag = tag.to_s.downcase
     if TAGS[tag]
       return TAGS[tag]
     else
@@ -143,7 +147,7 @@ class EngTagger
     "PPS",  "Punctuation, colon, semicolon, elipsis",
     "LRB",  "Punctuation, left bracket",
     "RRB",  "Punctuation, right bracket"
-    ]
+  ]
   tags = tags.collect{|t| t.downcase.gsub(/[\.\,\'\-\s]+/, '_')}
   tags = tags.collect{|t| t.gsub(/\&/, "and").gsub(/\//, "or")}
   TAGS = Hash[*tags]
@@ -196,12 +200,12 @@ class EngTagger
     @conf[:tag_lex] = 'tags.yml'
     @conf[:word_lex] = 'words.yml'
     @conf[:unknown_lex] = 'unknown.yml'
-    @conf[:word_path] = $word_path
-    @conf[:tag_path] = $tag_path
+    @conf[:word_path] = DEFAULT_WORDPATH
+    @conf[:tag_path] = DEFAULT_TAGPATH
     @conf[:debug] = false
     # assuming that we start analyzing from the beginninga new sentence...
     @conf[:current_tag] = 'pp'
-    @conf.merge!(params)
+    @conf.merge!(params) if params
     unless File.exist?(@conf[:word_path]) and File.exist?(@conf[:tag_path])
       print "Couldn't locate POS lexicon, creating a new one" if @conf[:debug]
       @@hmm = Hash.new
@@ -221,6 +225,33 @@ class EngTagger
   # Public methods #
   ##################
+  # Return an array of pairs of the form `["word", :tag]`.
+  #
+  # @param text [String] the input text
+  # @return [Array] the tagged words
+  #
+  def tag_pairs(text)
+    return [] unless valid_text(text)
+    out = clean_text(text).map do |word|
+      cleaned_word = clean_word word
+      tag = assign_tag(@conf[:current_tag], cleaned_word)
+      @conf[:current_tag] = tag = (tag and !tag.empty?) ? tag : 'nn'
+      [word, tag.to_sym]
+    end
+    # reset the tagger state
+    reset
+    out
+  end
+  # Examine the string provided and return it fully tagged in XML style.
+  #
+  # @param text [String] the input text
+  # @param verbose [false, true] whether to use verbose tags
+  # @return [String] the marked-up string
+  #
   # Examine the string provided and return it fully tagged in XML style
   def add_tags(text, verbose = false)
     return nil unless valid_text(text)
@@ -260,10 +291,10 @@ class EngTagger
   def get_readable(text, verbose = false)
     return nil unless valid_text(text)
     tagged = add_tags(text, verbose)
-    tagged = tagged.gsub(/<\w+>([^<]+)<\/(\w+)>/o) do
+    tagged = tagged.gsub(/<\w+>([^<]+|[<\w>]+)<\/(\w+)>/o) do
+    #!!!# tagged = tagged.gsub(/<\w+>([^<]+)<\/(\w+)>/o) do
       $1 + '/' + $2.upcase
     end
-    return tagged
   end
   # Return an array of sentences (without POS tags) from a text.
@@ -319,90 +350,151 @@ class EngTagger
   # Given a POS-tagged text, this method returns all nouns and their
   # occurrence frequencies.
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_nouns(tagged)
     return nil unless valid_text(tagged)
     tags = [NN]
     build_matches_hash(build_trimmed(tagged, tags))
   end
-  # Returns all types of verbs and does not descriminate between the various kinds.
-  # Is the combination of all other verb methods listed in this class.
+  # Returns all types of verbs and does not descriminate between the
+  # various kinds. Combines all other verb methods listed in this
+  # class.
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VB, VBD, VBG, PART, VBP, VBZ]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_infinitive_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VB]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_past_tense_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VBD]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_gerund_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VBG]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_passive_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [PART]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_base_present_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VBP]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_present_verbs(tagged)
     return nil unless valid_text(tagged)
     tags = [VBZ]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_adjectives(tagged)
     return nil unless valid_text(tagged)
     tags = [JJ]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_comparative_adjectives(tagged)
     return nil unless valid_text(tagged)
     tags = [JJR]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_superlative_adjectives(tagged)
     return nil unless valid_text(tagged)
     tags = [JJS]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_adverbs(tagged)
     return nil unless valid_text(tagged)
     tags = [RB, RBR, RBS, RP]
     build_matches_hash(build_trimmed(tagged, tags))
   end
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_interrogatives(tagged)
     return nil unless valid_text(tagged)
     tags = [WRB, WDT, WP, WPS]
     build_matches_hash(build_trimmed(tagged, tags))
   end
-  # To be consistent with documentation's naming of 'interrogative' parts of speech as 'question'
+  # To be consistent with documentation's naming of 'interrogative'
+  # parts of speech as 'question'
   alias_method :get_question_parts, :get_interrogatives
-  # Returns all types of conjunctions and does not discriminate between the various kinds.
-  # E.g. coordinating, subordinating, correlative...
+  # Returns all types of conjunctions and does not discriminate
+  # between the various kinds. E.g. coordinating, subordinating,
+  # correlative...
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_conjunctions(tagged)
     return nil unless valid_text(tagged)
     tags = [CC, IN]
@@ -410,7 +502,11 @@ class EngTagger
   end
   # Given a POS-tagged text, this method returns only the maximal noun phrases.
-  # May be called directly, but is also used by get_noun_phrases
+  # May be called directly, but is also used by `get_noun_phrases`.
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_max_noun_phrases(tagged)
     return nil unless valid_text(tagged)
     tags = [@@mnp]
@@ -424,11 +520,15 @@ class EngTagger
   end
   # Similar to get_words, but requires a POS-tagged text as an argument.
+  #
+  # @param tagged [String] the tagged text
+  # @return [Hash] the hash of matches
+  #
   def get_noun_phrases(tagged)
     return nil unless valid_text(tagged)
     found = Hash.new(0)
     phrase_ext = /(?:#{PREP}|#{DET}|#{NUM})+/xo
-    scanned = tagged.scan(@@mnp)
+      scanned = tagged.scan(@@mnp)
     # Find MNPs in the text, one sentence at a time
     # Record and split if the phrase is extended by a (?:PREP|DET|NUM)
     mn_phrases = []
@@ -437,9 +537,9 @@ class EngTagger
       mn_phrases += m.split(phrase_ext)
     end
     mn_phrases.each do |mnp|
-     # Split the phrase into an array of words, and create a loop for each word,
-     # shortening the phrase by removing the word in the first position.
-     # Record the phrase and any single nouns that are found
+      # Split the phrase into an array of words, and create a loop for each word,
+      # shortening the phrase by removing the word in the first position.
+      # Record the phrase and any single nouns that are found
       words = mnp.split
       words.length.times do |i|
         found[words.join(' ')] += 1 if words.length > 1
@@ -484,7 +584,7 @@ class EngTagger
   # Private methods #
   ###################
-  :private
+  private
   def build_trimmed(tagged, tags)
     tags.map { |tag| tagged.scan(tag) }.flatten.map do |n|
@@ -554,17 +654,10 @@ class EngTagger
     end
   end
-  # Strip the provided text of HTML-style tags and separate off any punctuation
-  # in preparation for tagging
+  # Strip the provided text and separate off any punctuation in preparation for tagging
   def clean_text(text)
     return false unless valid_text(text)
-    text = text.toutf8
-    unless $no_hpricot
-      # Strip out any markup and convert entities to their proper form
-      cleaned_text = Hpricot(text).inner_text
-    else
-      cleaned_text = text
-    end
+    cleaned_text = text.encode('utf-8')
     tokenized = []
     # Tokenize the text (splitting on punctuation as you go)
     cleaned_text.split(/\s+/).each do |line|
@@ -599,7 +692,8 @@ class EngTagger
     end
     words = Array.new
     tokenized.each_with_index do |t, i|
-      if tokenized[i + 1] and tokenized [i + 1] =~ /[A-Z\W]/ and tokenized[i] =~ /\A(.+)\.\z/
+      if tokenized[i + 1] and tokenized [i + 1] =~ /[A-Z\W]/ and
+          tokenized[i] =~ /\A(.+)\.\z/
         w = $1
         # Don't separate the period off words that
         # meet any of the following conditions:
@@ -607,7 +701,8 @@ class EngTagger
         # 1. It is defined in one of the lists above
         # 2. It is only one letter long: Alfred E. Sloan
         # 3. It has a repeating letter-dot: U.S.A. or J.C. Penney
-        unless abbr[w.downcase] or w =~ /\A[a-z]\z/i or w =~ /[a-z](?:\.[a-z])+\z/i
+        unless abbr[w.downcase] or
+            [/\A[a-z]\z/i, /[a-z](?:\.[a-z])+\z/i].any? { |r| r.match? w }
           words <<  w
           words << '.'
           next
@@ -641,7 +736,7 @@ class EngTagger
     # Handle all other punctuation
     text = text.gsub(/--+/o, " - ") # Convert and separate dashes
     text = text.gsub(/,(?!\d)/o, " , ") # Shift commas off everything but numbers
-    text = text.gsub(/:/o, " :") # Shift semicolons off
+    text = text.gsub(/:/o, " : ") # Shift semicolons off
     text = text.gsub(/(\.\.\.+)/o){" " + $1 + " "} # Shift ellipses off
     text = text.gsub(/([\(\[\{\}\]\)])/o){" " + $1 + " "} # Shift off brackets
     text = text.gsub(/([\!\?#\$%;~|])/o){" " + $1 + " "} # Shift off other ``standard'' punctuation
@@ -718,8 +813,7 @@ class EngTagger
   def classify_unknown_word(word)
     if /[\(\{\[]/ =~ word  # Left brackets
       classified = "*LRB*"
-    elsif
-      /[\)\}\]]/ =~ word   # Right brackets
+    elsif /[\)\}\]]/ =~ word   # Right brackets
       classified = "*RRB*"
     elsif /-?(?:\d+(?:\.\d*)?|\.\d+)\z/ =~ word # Floating point number
       classified = "*NUM*"
@@ -763,28 +857,28 @@ class EngTagger
   # from a POS-tagged text.
   def get_max_noun_regex
     regex = /
-      # optional number, gerund - adjective -participle
-      (?:#{NUM})?(?:#{GER}|#{ADJ}|#{PART})*
-        # Followed by one or more nouns
-        (?:#{NN})+
-          (?:
-            # Optional preposition, determinant, cardinal
-            (?:#{PREP})*(?:#{DET})?(?:#{NUM})?
-              # Optional gerund-adjective -participle
-              (?:#{GER}|#{ADJ}|#{PART})*
-                # one or more nouns
-                (?:#{NN})+
-           )*
-    /xo #/
-    return regex
+    # optional number, gerund - adjective -participle
+    (?:#{NUM})?(?:#{GER}|#{ADJ}|#{PART})*
+      # Followed by one or more nouns
+      (?:#{NN})+
+      (?:
+       # Optional preposition, determinant, cardinal
+       (?:#{PREP})*(?:#{DET})?(?:#{NUM})?
+       # Optional gerund-adjective -participle
+       (?:#{GER}|#{ADJ}|#{PART})*
+       # one or more nouns
+       (?:#{NN})+
+      )*
+      /xo #/
+      return regex
   end
   # Load the 2-grams into a hash from YAML data: This is a naive (but fast)
   # YAML data parser. It will load a YAML document with a collection of key:
   # value entries ( {pos tag}: {probability} ) mapped onto single keys ( {tag} ).
   # Each map is expected to be on a single line; i.e., det: { jj: 0.2, nn: 0.5, vb: 0.0002 }
-  def load_tags(lexicon)
-    path = File.join($lexpath, lexicon)
+  def load_tags(lexicon, lexpath = DEFAULT_LEXPATH)
+    path = File.join(lexpath, lexicon)
     fh = File.open(path, 'r')
     while line = fh.gets
       /\A"?([^{"]+)"?: \{ (.*) \}/ =~ line
@@ -806,8 +900,8 @@ class EngTagger
   # YAML data parser. It will load a YAML document with a collection of key:
   # value entries ( {pos tag}: {count} ) mapped onto single keys ( {a word} ).
   # Each map is expected to be on a single line; i.e., key: { jj: 103, nn: 34, vb: 1 }
-  def load_words(lexicon)
-    path = File.join($lexpath, lexicon)
+  def load_words(lexicon, lexpath = DEFAULT_LEXPATH)
+    path = File.join(lexpath, lexicon)
     fh = File.open(path, 'r')
     while line = fh.gets
       /\A"?([^{"]+)"?: \{ (.*) \}/ =~ line

data/test/test_engtagger.rb CHANGED Viewed

@@ -1,233 +1,246 @@
-# Code Generated by ZenTest v. 3.9.2
-#                 classname: asrt / meth =  ratio%
-#                    EngTagger:    0 /   24 =   0.00%
-$ENGTAGGER_LIB = File.join(File.dirname(__FILE__), '..', 'lib')
-$LOAD_PATH << $ENGTAGGER_LIB
-require 'test/unit' unless defined? $ZENTEST and $ZENTEST
-require 'engtagger'
-class TestEngTagger < Test::Unit::TestCase
-  @@untagged =<<EOD
-Lisa Raines, a lawyer and director of government relations for the Industrial Biotechnical Association, contends that a judge well-versed in patent law and the concerns of research-based industries would have ruled otherwise. And Judge Newman, a former patent lawyer, wrote in her dissent when the court denied a motion for a rehearing of the case by the full court, "The panel's judicial legislation has affected an important high-technological industry, without regard to the consequences for research and innovation or the public interest." Says Ms. Raines, "[The judgement] confirms our concern that the absence of patent lawyers on the court could prove troublesome."
-EOD
-  @@tagged =<<EOD
-<nnp>Lisa</nnp> <nnp>Raines</nnp> <ppc>,</ppc> <det>a</det> <nn>lawyer</nn> <cc>and</cc> <nn>director</nn> <in>of</in> <nn>government</nn> <nns>relations</nns> <in>for</in> <det>the</det> <nnp>Industrial</nnp> <nnp>Biotechnical</nnp> <nnp>Association</nnp> <ppc>,</ppc> <vbz>contends</vbz> <in>that</in> <det>a</det> <nn>judge</nn> <jj>well-versed</jj> <in>in</in> <nn>patent</nn> <nn>law</nn> <cc>and</cc> <det>the</det> <nns>concerns</nns> <in>of</in> <jj>research-based</jj> <nns>industries</nns> <md>would</md> <vb>have</vb> <vbn>ruled</vbn> <rb>otherwise</rb> <pp>.</pp>
-EOD
-  def setup
-    @tagger = EngTagger.new
-    tagpath = File.join($ENGTAGGER_LIB, @tagger.conf[:tag_path])
-    wordpath = File.join($ENGTAGGER_LIB, @tagger.conf[:word_path])
-    if !File.exist?(tagpath) or !File.exists?(wordpath)
-      @tagger.install
-    end
-  end
-  def text_get_ext
-    model = '<cd>[^<]+</cd}>\s*'
-    assert_equal(model, EngTagger.get_ext(model, "cd"))
-  end
-  def test_explain_tag
-    assert_equal("noun", EngTagger.explain_tag("nn"))
-    assert_equal("verb_infinitive", EngTagger.explain_tag("vb"))
-  end
-  def test_add_tags
-    assert_instance_of(String, @tagger.add_tags(@@untagged))
-  end
-  def test_assign_tag
-    models = []; tests = []
-    models += [@tagger.conf[:unknown_word_tag], "sym"]
-    tests += [["pp","-unknown-"], ["pp", "-sym-"]]
-    models.length.times do |i|
-      assert_equal(models[i],@tagger.assign_tag(*tests[i]))
-    end
-    tests = []
-    tests += [["vb","water"], ["nn", "runs"]]
-    models.length.times do |i|
-      result = @tagger.assign_tag(*tests[i])
-      assert(EngTagger.hmm.keys.index(result))
-    end
-  end
-  def test_classify_unknown_word
-    assert_equal("*LRB*", @tagger.classify_unknown_word("{"))
-    assert_equal("*NUM*", @tagger.classify_unknown_word("123.4567"))
-    assert_equal("*ORD*", @tagger.classify_unknown_word("40th"))
-    assert_equal("-abr-", @tagger.classify_unknown_word("GT-R"))
-    assert_equal("-hyp-adj-", @tagger.classify_unknown_word("extremely-high"))
-    assert_equal("-sym-", @tagger.classify_unknown_word("&&"))
-    assert_equal("-ing-", @tagger.classify_unknown_word("wikiing"))
-    assert_equal("-unknown-", @tagger.classify_unknown_word("asefasdf"))
-  end
-  def test_clean_text
-    test = "<html><body>I am <b>100% sure</b> that Dr. Watson is too naive. I'm sorry.</body></html>"
-    model = ["I","am","100","%","sure","that","Dr.","Watson","is","too","naive",".","I","'m","sorry","."]
-    assert_equal(model, @tagger.clean_text(test)) unless $no_hpricot
-  end
-  def test_clean_word
-    models = []; tests = []
-    models += ["*NUM*"]
-    models += ["Plays"]
-    models += ["pleadingly"]
-    tests += ["1973.0820", "Plays", "Pleadingly"]
-    models.length.times do |i|
-      assert_equal(models[i], @tagger.clean_word(tests[i]))
-    end
-  end
-  def test_get_max_noun_phrases
-    result = @tagger.get_max_noun_phrases(@@tagged)
-    assert_instance_of(Hash, result)
-  end
-  def test_get_max_noun_regex
-    assert_instance_of(Regexp, @tagger.get_max_noun_regex)
-  end
-  def test_get_noun_phrases
-    result = @tagger.get_noun_phrases(@@tagged)
-    assert_instance_of(Hash, result)
-  end
-  def test_get_nouns
-    result = @tagger.get_nouns(@@tagged)
-    assert_instance_of(Hash, result)
-  end
-  def test_get_verbs
-    expected_result = { "have" => 1, "ruled" => 1, "contends" => 1 }
-    result = @tagger.get_verbs(@@tagged)
-    assert_equal(expected_result, result)
-  end
-  def test_get_adverbs
-    expected_result = { "otherwise" => 1 }
-    result = @tagger.get_adverbs(@@tagged)
-    assert_equal(expected_result, result)
-  end
-  def test_get_interrogatives
-    tagged = "<wdt>Which</wdt> <ppc>,</ppc> <wdt>whatever</wdt> <ppc>,</ppc> <wp>who</wp> <ppc>,</ppc> <wp>whoever</wp> <ppc>,</ppc> <wrb>when</wrb> <cc>and</cc> <wrb>how</wrb> <vbp>are</vbp> <det>all</det> <nns>examples</nns> <in>of</in> <nns>interrogatives</nns>"
-    expected_result = {"when"=>1, "how"=>1, "Which"=>1, "whatever"=>1, "who"=>1, "whoever"=>1}
-    result = @tagger.get_interrogatives(tagged)
-    assert_equal(expected_result, result)
-  end
-  def test_get_question_parts
-    tagged = "<wdt>Which</wdt> <ppc>,</ppc> <wdt>whatever</wdt> <ppc>,</ppc> <wp>who</wp> <ppc>,</ppc> <wp>whoever</wp> <ppc>,</ppc> <wrb>when</wrb> <cc>and</cc> <wrb>how</wrb> <vbp>are</vbp> <det>all</det> <nns>examples</nns> <in>of</in> <nns>interrogatives</nns>"
-    expected_result = {"when"=>1, "how"=>1, "Which"=>1, "whatever"=>1, "who"=>1, "whoever"=>1}
-    result = @tagger.get_question_parts(tagged)
-    assert_equal(expected_result, result)
-  end
-  def test_get_conjunctions
-    expected_result = { "and" => 2, "of" => 2, "for" => 1, "that" => 1, "in" => 1 }
-    result = @tagger.get_conjunctions(@@tagged)
-    assert_equal(expected_result, result)
-  end
-  def test_get_proper_nouns
-    test = "<nnp>BBC</nnp> <vbz>means</vbz> <nnp>British Broadcasting Corporation</nnp> <pp>.</pp>"
-    result = @tagger.get_proper_nouns(test)
-    assert_instance_of(Hash, result)
-  end
-  def test_get_readable
-    test = "I woke up to the sound of pouring rain."
-    result = @tagger.get_readable(test)
-    assert(String, result)
-  end
-  def test_get_sentences
-    result = @tagger.get_sentences(@@untagged)
-    assert_equal(4, result.length)
-  end
-  def test_get_words
-    @tagger.conf[:longest_noun_phrase] = 1
-    result1 = @tagger.get_words(@@tagged)
-    @tagger.conf[:longest_noun_phrase] = 10
-    result2 = @tagger.get_words(@@tagged)
-    assert_instance_of(Hash, result1)
-    assert_instance_of(Hash, result2)
-  end
-  def test_reset
-    @tagger.conf[:current_tag] = 'nn'
-    @tagger.reset
-    assert_equal('pp', @tagger.conf[:current_tag])
-  end
-  def test_split_punct
-    models = []; texts = []
-    models << ["`", "test"]; texts <<  "`test"
-    models << ["``", "test"]; texts <<  "\"test"
-    models << ["`", "test"]; texts <<  "'test"
-    models << ["''"]; texts <<  '"'
-    models << ["test", "'"]; texts <<  "test' "
-    models << ["-", "test", "-"]; texts << "---test-----"
-    models << ["test", ",", "test"]; texts <<  "test,test"
-    models << ["123,456"]; texts <<  "123,456"
-    models << ["test", ":"]; texts <<  "test:"
-    models << ["test1", "...", "test2"]; texts <<  "test1...test2"
-    models << ["{", "ab","[","(","c",")","[","d","]","]","}"]; texts <<  "{ab[(c)[d]]}"
-    models << ["test", "#", "test"]; texts <<  "test#test"
-    models << ["I", "'d", "like"]; texts <<  "I'd like"
-    models << ["is", "n't", "so"]; texts <<  "isn't so"
-    models << ["we", "'re", "all"]; texts <<  "we're all"
-    texts.each_with_index do |text, index|
-      assert_equal(models[index], @tagger.split_punct(text))
-    end
-  end
-  def test_split_sentences
-    models = []; tests = []
-    models << ["He", "is", "a", "u.s.", "army", "officer", "."]
-    tests << ["He", "is", "a", "u.s.", "army", "officer."]
-    models << ["He", "is", "Mr.", "Johnson", ".", "He", "'s", "my", "friend", "."]
-    tests << ["He", "is", "Mr.", "Johnson.", "He", "'s", "my", "friend."]
-    models.length.times do |i|
-      assert_equal(models[i], @tagger.split_sentences(tests[i]))
-    end
-  end
-  def test_stem
-    word = "gets"
-    old = @tagger.conf[:stem]
-    @tagger.conf[:stem] = true
-    assert_equal("get", @tagger.stem(word))
-    # the following should not work since we memoize stem method
-    # @tagger.conf[:stem] = false
-    # assert_equal("gets", @tagger.stem(word))
-    @tagger.conf[:stem] = old
-  end
-  def test_strip_tags
-    assert_instance_of(String, @tagger.strip_tags(@@tagged))
-  end
-  def test_valid_text
-    text = nil
-    assert(!@tagger.valid_text(text))
-    text = "this is test text"
-    assert(@tagger.valid_text(text))
-    text = ""
-    assert(!@tagger.valid_text(text))
-  end
-  def test_override_default_params
-    @tagger = EngTagger.new(:longest_noun_phrase => 3)
-    assert_equal 3, @tagger.conf[:longest_noun_phrase]
-  end
-end
-# Number of errors detected: 24
+$ENGTAGGER_LIB = File.join(File.dirname(__FILE__), '..', 'lib')
+$LOAD_PATH << $ENGTAGGER_LIB
+require 'test/unit' unless defined? $ZENTEST and $ZENTEST
+require 'engtagger'
+class TestEngTagger < Test::Unit::TestCase
+  @@untagged =<<EOD
+Lisa Raines, a lawyer and director of government relations for the Industrial Biotechnical Association, contends that a judge well-versed in patent law and the concerns of research-based industries would have ruled otherwise. And Judge Newman, a former patent lawyer, wrote in her dissent when the court denied a motion for a rehearing of the case by the full court, "The panel's judicial legislation has affected an important high-technological industry, without regard to the consequences for research and innovation or the public interest." Says Ms. Raines, "[The judgement] confirms our concern that the absence of patent lawyers on the court could prove troublesome."
+EOD
+  @@tagged =<<EOD
+<nnp>Lisa</nnp> <nnp>Raines</nnp> <ppc>,</ppc> <det>a</det> <nn>lawyer</nn> <cc>and</cc> <nn>director</nn> <in>of</in> <nn>government</nn> <nns>relations</nns> <in>for</in> <det>the</det> <nnp>Industrial</nnp> <nnp>Biotechnical</nnp> <nnp>Association</nnp> <ppc>,</ppc> <vbz>contends</vbz> <in>that</in> <det>a</det> <nn>judge</nn> <jj>well-versed</jj> <in>in</in> <nn>patent</nn> <nn>law</nn> <cc>and</cc> <det>the</det> <nns>concerns</nns> <in>of</in> <jj>research-based</jj> <nns>industries</nns> <md>would</md> <vb>have</vb> <vbn>ruled</vbn> <rb>otherwise</rb> <pp>.</pp>
+EOD
+  # Testing class methods
+  def setup
+    @tagger = EngTagger.new
+    tagpath = File.join($ENGTAGGER_LIB, @tagger.conf[:tag_path])
+    wordpath = File.join($ENGTAGGER_LIB, @tagger.conf[:word_path])
+    if !File.exist?(tagpath) or !File.exist?(wordpath)
+      @tagger.install
+    end
+  end
+  def text_get_ext
+    model = '<cd>[^<]+</cd}>\s*'
+    assert_equal(model, EngTagger.get_ext(model, "cd"))
+  end
+  def test_explain_tag
+    assert_equal("noun", EngTagger.explain_tag("nn"))
+    assert_equal("verb_infinitive", EngTagger.explain_tag("vb"))
+  end
+  # Testing public instance methods
+  def test_add_tags
+    assert_instance_of(String, @tagger.add_tags(@@untagged))
+  end
+  def test_assign_tag
+    models = []; tests = []
+    models += [@tagger.conf[:unknown_word_tag], "sym"]
+    tests += [["pp","-unknown-"], ["pp", "-sym-"]]
+    models.length.times do |i|
+      assert_equal(models[i],@tagger.assign_tag(*tests[i]))
+    end
+    tests = []
+    tests += [["vb","water"], ["nn", "runs"]]
+    models.length.times do |i|
+      result = @tagger.assign_tag(*tests[i])
+      assert(EngTagger.hmm.keys.index(result))
+    end
+  end
+  def test_clean_text
+    test = "I am 100.0% sure that Dr. Watson is too naive. I'm sorry."
+    model = ["I","am","100.0","%","sure","that","Dr.","Watson","is","too","naive",".","I","'m","sorry","."]
+    assert_equal(model, @tagger.send(:clean_text, test))
+  end
+  def test_get_noun_phrases
+    result = @tagger.get_noun_phrases(@@tagged)
+    assert_instance_of(Hash, result)
+  end
+  def test_get_nouns
+    result = @tagger.get_nouns(@@tagged)
+    assert_instance_of(Hash, result)
+  end
+  def test_get_verbs
+    expected_result = { "have" => 1, "ruled" => 1, "contends" => 1 }
+    result = @tagger.get_verbs(@@tagged)
+    assert_equal(expected_result, result)
+  end
+  def test_get_adverbs
+    expected_result = { "otherwise" => 1 }
+    result = @tagger.get_adverbs(@@tagged)
+    assert_equal(expected_result, result)
+  end
+  def test_get_interrogatives
+    tagged = "<wdt>Which</wdt> <ppc>,</ppc> <wdt>whatever</wdt> <ppc>,</ppc> <wp>who</wp> <ppc>,</ppc> <wp>whoever</wp> <ppc>,</ppc> <wrb>when</wrb> <cc>and</cc> <wrb>how</wrb> <vbp>are</vbp> <det>all</det> <nns>examples</nns> <in>of</in> <nns>interrogatives</nns>"
+    expected_result = {"when"=>1, "how"=>1, "Which"=>1, "whatever"=>1, "who"=>1, "whoever"=>1}
+    result = @tagger.get_interrogatives(tagged)
+    assert_equal(expected_result, result)
+  end
+  def test_get_question_parts
+    tagged = "<wdt>Which</wdt> <ppc>,</ppc> <wdt>whatever</wdt> <ppc>,</ppc> <wp>who</wp> <ppc>,</ppc> <wp>whoever</wp> <ppc>,</ppc> <wrb>when</wrb> <cc>and</cc> <wrb>how</wrb> <vbp>are</vbp> <det>all</det> <nns>examples</nns> <in>of</in> <nns>interrogatives</nns>"
+    expected_result = {"when"=>1, "how"=>1, "Which"=>1, "whatever"=>1, "who"=>1, "whoever"=>1}
+    result = @tagger.get_question_parts(tagged)
+    assert_equal(expected_result, result)
+  end
+  def test_get_conjunctions
+    expected_result = { "and" => 2, "of" => 2, "for" => 1, "that" => 1, "in" => 1 }
+    result = @tagger.get_conjunctions(@@tagged)
+    assert_equal(expected_result, result)
+  end
+  def test_get_proper_nouns
+    test = "<nnp>BBC</nnp> <vbz>means</vbz> <nnp>British Broadcasting Corporation</nnp> <pp>.</pp>"
+    result = @tagger.get_proper_nouns(test)
+    assert_instance_of(Hash, result)
+  end
+  def test_get_readable
+    test = "I woke up to the sound of pouring rain."
+    result = @tagger.get_readable(test)
+    assert(String, result)
+    test = "I woke up to the sound of pouring rain."
+    result = @tagger.get_readable(test)
+    expected_result = "I/PRP woke/VBD up/RB to/TO the/DET sound/NN of/IN pouring/VBG rain/NN ./PP"
+    assert_equal(expected_result, result)
+    test = "I woke up with a <bad> word."
+    result = @tagger.get_readable(test)
+    expected_result = "I/PRP woke/VBD up/RB with/IN a/DET <bad>/NNP word/NN ./PP"
+    assert_equal(expected_result, result)
+  end
+  def test_get_sentences
+    result = @tagger.get_sentences(@@untagged)
+    assert_equal(4, result.length)
+  end
+  def test_get_words
+    @tagger.conf[:longest_noun_phrase] = 1
+    result1 = @tagger.get_words(@@tagged)
+    @tagger.conf[:longest_noun_phrase] = 10
+    result2 = @tagger.get_words(@@tagged)
+    assert_instance_of(Hash, result1)
+    assert_instance_of(Hash, result2)
+  end
+  # Testing private instance methods
+  def test_reset
+    @tagger.conf[:current_tag] = 'nn'
+    @tagger.send(:reset)
+    assert_equal('pp', @tagger.conf[:current_tag])
+  end
+  def test_classify_unknown_word
+    assert_equal("*LRB*", @tagger.send(:classify_unknown_word, "{"))
+    assert_equal("*NUM*", @tagger.send(:classify_unknown_word, "123.4567"))
+    assert_equal("*ORD*", @tagger.send(:classify_unknown_word, "40th"))
+    assert_equal("-abr-", @tagger.send(:classify_unknown_word, "GT-R"))
+    assert_equal("-hyp-adj-", @tagger.send(:classify_unknown_word, "extremely-high"))
+    assert_equal("-sym-", @tagger.send(:classify_unknown_word, "&&"))
+    assert_equal("-ing-", @tagger.send(:classify_unknown_word, "wikiing"))
+    assert_equal("-unknown-", @tagger.send(:classify_unknown_word, "asefasdf"))
+  end
+  def test_clean_word
+    models = []; tests = []
+    models += ["*NUM*"]
+    models += ["Plays"]
+    models += ["pleadingly"]
+    tests += ["1973.0820", "Plays", "Pleadingly"]
+    models.length.times do |i|
+      assert_equal(models[i], @tagger.send(:clean_word, tests[i]))
+    end
+  end
+  def test_get_max_noun_phrases
+    result = @tagger.send(:get_max_noun_phrases, @@tagged)
+    assert_instance_of(Hash, result)
+  end
+  def test_get_max_noun_regex
+    assert_instance_of(Regexp, @tagger.send(:get_max_noun_regex))
+  end
+  def test_split_punct
+    models = []; texts = []
+    models << ["`", "test"]; texts <<  "`test"
+    models << ["``", "test"]; texts <<  "\"test"
+    models << ["`", "test"]; texts <<  "'test"
+    models << ["''"]; texts <<  '"'
+    models << ["test", "'"]; texts <<  "test' "
+    models << ["-", "test", "-"]; texts << "---test-----"
+    models << ["test", ",", "test"]; texts <<  "test,test"
+    models << ["123,456"]; texts <<  "123,456"
+    models << ["test", ":", "test"]; texts <<  "test:test"
+    models << ["123", ":", "456"]; texts <<  "123:456"
+    models << ["test1", "...", "test2"]; texts <<  "test1...test2"
+    models << ["{", "ab","[","(","c",")","[","d","]","]","}"]; texts <<  "{ab[(c)[d]]}"
+    models << ["test", "#", "test"]; texts <<  "test#test"
+    models << ["I", "'d", "like"]; texts <<  "I'd like"
+    models << ["is", "n't", "so"]; texts <<  "isn't so"
+    models << ["we", "'re", "all"]; texts <<  "we're all"
+    texts.each_with_index do |text, index|
+      assert_equal(models[index], @tagger.send(:split_punct, text))
+    end
+  end
+  def test_split_sentences
+    models = []; tests = []
+    models << ["He", "is", "a", "u.s.", "army", "officer", "."]
+    tests << ["He", "is", "a", "u.s.", "army", "officer."]
+    models << ["He", "is", "Mr.", "Johnson", ".", "He", "'s", "my", "friend", "."]
+    tests << ["He", "is", "Mr.", "Johnson.", "He", "'s", "my", "friend."]
+    models.length.times do |i|
+      assert_equal(models[i], @tagger.send(:split_sentences, tests[i]))
+    end
+  end
+  def test_stem
+    word = "gets"
+    old = @tagger.conf[:stem]
+    @tagger.conf[:stem] = true
+    assert_equal("get", @tagger.stem(word))
+    # the following should not work since we memoize stem method
+    # @tagger.conf[:stem] = false
+    # assert_equal("gets", @tagger.stem(word))
+    @tagger.conf[:stem] = old
+  end
+  def test_strip_tags
+    assert_instance_of(String, @tagger.send(:strip_tags, @@tagged))
+  end
+  def test_valid_text
+    text = nil
+    assert(!@tagger.send(:valid_text, text))
+    text = "this is test text"
+    assert(@tagger.send(:valid_text, text))
+    text = ""
+    assert(!@tagger.send(:valid_text, text))
+  end
+  def test_override_default_params
+    @tagger = EngTagger.new(:longest_noun_phrase => 3)
+    assert_equal 3, @tagger.conf[:longest_noun_phrase]
+  end
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: engtagger
 version: !ruby/object:Gem::Version
-  version: 0.2.2
+  version: 0.3.0
 platform: ruby
 authors:
 - Yoichiro Hasebe
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2022-02-03 00:00:00.000000000 Z
+date: 2022-06-21 00:00:00.000000000 Z
 dependencies: []
 description: A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
   tagger that assigns POS tags to English text based on a lookup dictionary and a
@@ -20,6 +20,7 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".yardopts"
 - Gemfile
 - LICENSE
 - README.md