RubyGems - open-nlp - Versions diffs - 0.1.0 → 0.1.1 - Mend

open-nlp 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.md CHANGED

@@ -1,29 +1,34 @@
 [![Build Status](https://secure.travis-ci.org/louismullie/open-nlp.png)](http://travis-ci.org/louismullie/open-nlp)
-**About**
+###About
-This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
+This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP).
-This gem only provides a thin wrapper over the OpenNLP API. If you are looking for a Ruby natural language processing framework, have a look at [Treat](https://github.com/louismullie/treat).
+###Installing
-**Installing**
+__Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.__
-_Note: If you are running on MRI, this gem will use the Ruby-Java Bridge (Rjb), which currently does not support Java 7. Therefore, if you have installed Java 7, you should set your JAVA_HOME to point to your old Java 6 install before installing Rjb; for example, `export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"`.
-First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all english language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
+First, install the gem: `gem install open-nlp`. Then, individually download the appropriate models from the [open-nlp website](http://opennlp.sourceforge.net/models-1.5/) or just get [all English language models](louismullie.com/treat/open-nlp-english.zip) in one package (80 MB).
 Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).
-**Configuration**
+Alternatively, from a terminal window, `cd` to the gem's folder and run:
+```
+wget http://www.louismullie.com/treat/open-nlp-english.zip
+unzip -o open-nlp-english.zip -d bin/
+```
+###Configuring
-After installing and requiring the gem (`require 'open-nlp'`), you may want to set some optional configuration options. Here are some examples:
+After installing and requiring the gem (`require 'open-nlp'`), you may want to set some of the following configuration options.
 ```ruby
-# Set an alternative path to look for the JAR files
+# Set an alternative path to look for the JAR files.
 # Default is gem's bin folder.
 OpenNLP.jar_path = '/path_to_jars/'
-# Set an alternative path to look for the model files
+# Set an alternative path to look for the model files.
 # Default is gem's bin folder.
 OpenNLP.model_path = '/path_to_models/'
@@ -34,76 +39,131 @@ OpenNLP.jvm_args = ['-option1', '-option2']
 # Redirect VM output to log.txt
 OpenNLP.log_file = 'log.txt'
-# WARNING: Not implemented yet.
+```
+###Examples
-# Use the model files for a different language than English.
-# OpenNLP.use(:french) # or :german
-#
-# Change a specific model file.
-# OpenNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
+**Simple tokenizer**
+```ruby
+OpenNLP.load
+sent = "The death of the poet was kept from his poems."
+tokenizer = OpenNLP::SimpleTokenizer.new
+tokens = tokenizer.tokenize(sent).to_a
+# => %w[The death of the poet was kept from his poems .]
 ```
-**Using the gem**
+**Maximum entropy tokenizer, chunker and POS tagger**
 ```ruby
-text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
-       'Berlin to discuss a new $25 billion austerity package.' +
-       'Sarkozy looked pleased, but Merkel was dismayed.'
+OpenNLP.load
+chunker   = OpenNLP::ChunkerME.new
+tokenizer = OpenNLP::TokenizerME.new
+tagger    = OpenNLP::POSTaggerME.new
+sent   = "The death of the poet was kept from his poems."
+tokens = tokenizer.tokenize(sent).to_a
+# => %w[The death of the poet was kept from his poems .]
+tags   = tagger.tag(tokens).to_a
+# => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]
+chunks = chunker.chunk(tokens, tags).to_a
+# => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]
+```
+**Abstract Bottom-Up Parser**
+```ruby
+OpenNLP.load
+sent      = "The death of the poet was kept from his poems."
+parser = OpenNLP::Parser.new
+parse = parser.parse(sent)
+parse.get_text.should eql sent
+parse.get_span.get_start.should eql 0
+parse.get_span.get_end.should eql 46
+parse.get_child_count.should eql 1
+child = parse.get_children[0]
+child.text # => "The death of the poet was kept from his poems."
+child.get_child_count # => 3
+child.get_head_index #=> 5
+child.get_type # => "S"
+```
+**Maximum Entropy Name Finder***
+```ruby
+OpenNLP.load
+text = File.read('./spec/sample.txt').gsub!("\n", "")
 tokenizer   = OpenNLP::TokenizerME.new
 segmenter   = OpenNLP::SentenceDetectorME.new
-tagger      = OpenNLP::POSTaggerME.new
 ner_models  = ['person', 'time', 'money']
 ner_finders = ner_models.map do |model|
- OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
+  OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
 end
 sentences = segmenter.sent_detect(text)
-all_entities = []
+named_entities = []
 sentences.each do |sentence|
- tokens = tokenizer.tokenize(sentence)
- tags   = tagger.tag(tokens)
- # Get a list of all tokens.
- puts tokens.to_a.inspect
- # Get the sentence's text.
- puts sentence.to_s.inspect
- # Get the sentence's tags.
- puts tags.to_a.inspect
- # Run three NER models and find entities.
- ner_models.each_with_index do |model,i|
-   finder = ner_finders[i]
-   name_spans = finder.find(tokens)
-   name_spans.each do |name_span|
-     start = name_span.get_start
-     stop  = name_span.get_end-1
-     slice = tokens[start..stop].to_a
-     all_entities << [slice, model]
-   end
- end
+  tokens = tokenizer.tokenize(sentence)
+  ner_models.each_with_index do |model,i|
+    finder = ner_finders[i]
+    name_spans = finder.find(tokens)
+    name_spans.each do |name_span|
+      start = name_span.get_start
+      stop  = name_span.get_end-1
+      slice = tokens[start..stop].to_a
+      named_entities << [slice, model]
+    end
+  end
 end
+```
+**Loading specific models**
+Just pass the name of the model file to the constructor. The gem will search for the file in the `OpenNLP.model_path` folder.
+```ruby
+OpenNLP.load
-# Show all named entities.
-puts all_entities.inspect
+tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
+tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
+name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
+# etc.
 ```
 **Loading specific classes**
-You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
+You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:
 ```ruby
 # Default base class is opennlp.tools.
 OpenNLP.load_class('SomeClassName')
+# => OpenNLP::SomeClassName
 # Here, we specify another base class.
-OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
+OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
+# => OpenNLP::SomeOtherClass
 ```
 **Contributing**
-Feel free to fork the project and send me a pull request!
+Fork the project and send me a pull request! Config updates for other languages are welcome.

data/lib/open-nlp.rb CHANGED

@@ -1,31 +1,35 @@
 module OpenNLP
   # Library version.
-  VERSION = '0.1.0'
+  VERSION = '0.1.1'
   # Require Java bindings.
   require 'open-nlp/bindings'
-  OpenNLP::Bindings.bind
   # Require Ruby wrappers.
   require 'open-nlp/classes'
+  # Setup the JVM and load the default JARs.
+  def self.load
+    OpenNLP::Bindings.bind
+  end
   # Load a Java class into the OpenNLP
   # namespace (e.g. OpenNLP::Loaded).
-  def load_class(*args)
+  def self.load_class(*args)
     OpenNLP::Bindings.load_class(*args)
   end
   # Forwards the handling of missing
   # constants to the Bindings class.
-  def const_missing(const)
+  def self.const_missing(const)
     OpenNLP::Bindings.const_get(const)
   end
   # Forward the handling of missing
   # methods to the Bindings class.
-  def method_missing(sym, *args, &block)
+  def self.method_missing(sym, *args, &block)
     OpenNLP::Bindings.send(sym, *args, &block)
   end
-end
+end

data/lib/open-nlp/bindings.rb CHANGED

@@ -10,10 +10,6 @@ module OpenNLP::Bindings
   require 'bind-it'
   extend BindIt::Binding
-  # The path in which to look for JAR files, with
-  # a trailing slash (default is gem's bin folder).
-  self.jar_path = File.dirname(__FILE__) + '/../../bin/'
   # Load the JVM with a minimum heap size of 512MB,
   # and a maximum heap size of 1024MB.
   self.jvm_args = ['-Xms512M', '-Xmx1024M']
@@ -34,6 +30,7 @@ module OpenNLP::Bindings
   # Default classes.
   self.default_classes = [
+    # OpenNLP classes.
     ['AbstractBottomUpParser', 'opennlp.tools.parser'],
     ['DocumentCategorizerME', 'opennlp.tools.doccat'],
     ['ChunkerME', 'opennlp.tools.chunker'],
@@ -46,7 +43,12 @@ module OpenNLP::Bindings
     ['SentenceDetectorME', 'opennlp.tools.sentdetect'],
     ['SimpleTokenizer', 'opennlp.tools.tokenize'],
     ['Span', 'opennlp.tools.util'],
-    ['TokenizerME', 'opennlp.tools.tokenize']
+    ['TokenizerME', 'opennlp.tools.tokenize'],
+    # Generic Java classes.
+    ['FileInputStream', 'java.io'],
+    ['String', 'java.lang'],
+    ['ArrayList', 'java.util']
   ]
   # Add in Rjb workarounds.
@@ -54,14 +56,6 @@ module OpenNLP::Bindings
     self.default_jars << 'utils.jar'
     self.default_classes << ['Utils', '']
   end
-  # Make the bindings.
-  self.bind
-  # Load utility classes.
-  self.load_class('FileInputStream', 'java.io')
-  self.load_class('String', 'java.lang')
-  self.load_class('ArrayList', 'java.util')
   # ############################ #
   #   OpenNLP bindings proper    #
@@ -78,12 +72,20 @@ module OpenNLP::Bindings
     attr_accessor :language
   end
+  def self.default_path
+    File.dirname(__FILE__) + '/../../bin/'
+  end
   # The loaded models.
   self.models = {}
   # The names of loaded models.
   self.model_files = {}
+  # The path in which to look for JAR files, with
+  # a trailing slash (default is gem's bin folder).
+  self.jar_path = self.default_path
   # The path to the main folder containing the folders
   # with the individual models inside. By default, this
   # is the same as the JAR path.

data/spec/english_spec.rb CHANGED

@@ -2,10 +2,51 @@
 require_relative 'spec_helper'
 describe OpenNLP do
+  context "when an unreachable jar_path or model_path is provided" do
+    it "raises an exception when trying to load" do
+      OpenNLP.jar_path = '/unreachable/'
+      OpenNLP::Bindings.jar_path.should eql '/unreachable/'
+      OpenNLP.model_path = '/unreachable/'
+      OpenNLP::Bindings.model_path.should eql '/unreachable/'
+      expect { OpenNLP.load }.to raise_exception
+      OpenNLP.jar_path = OpenNLP.model_path = OpenNLP.default_path
+      expect { OpenNLP.load }.not_to raise_exception
+    end
+  end
+  context "when a constructor is provided with a specific model to load" do
+    it "loads that model, looking for the supplied file relative to OpenNLP.model_path " do
+      OpenNLP.load
+      tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
+      tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
+      sent = "The death of the poet was kept from his poems."
+      tokens = tokenizer.tokenize(sent)
+      tags = tagger.tag(tokens)
+      OpenNLP.models[:pos_tagger].get_pos_model.to_s
+      .index('opennlp.perceptron.PerceptronModel').should_not be_nil
+      tags.should eql ["DT", "NN", "IN", "DT", "NN", "VBD", "VBN", "IN", "PRP$", "NNS", "."]
+    end
+  end
+  context "when a class is loaded through the #load_class method" do
+    it "loads the class and allows to access it through the global namespace" do
+      OpenNLP.load_class('ChunkSample', 'opennlp.tools.chunker')
+      expect { OpenNLP::ChunkSample }.not_to raise_exception
+    end
+  end
   context "the maximum entropy chunker is run after tokenization and POS tagging" do
     it "should find the accurate chunks" do
+      OpenNLP.load
       chunker   = OpenNLP::ChunkerME.new
       tokenizer = OpenNLP::TokenizerME.new
       tagger    = OpenNLP::POSTaggerME.new
@@ -25,7 +66,9 @@ describe OpenNLP do
   context "the maximum entropy parser is run after tokenization" do
     it "parses the text accurately" do
+      OpenNLP.load
       sent      = "The death of the poet was kept from his poems."
       parser = OpenNLP::Parser.new
       parse = parser.parse(sent)
@@ -51,10 +94,14 @@ describe OpenNLP do
   context "the SimpleTokenizer is run" do
     it "tokenizes the text accurately" do
+      OpenNLP.load
       sent = "The death of the poet was kept from his poems."
       tokenizer = OpenNLP::SimpleTokenizer.new
       tokens = tokenizer.tokenize(sent).to_a
       tokens.should eql %w[The death of the poet was kept from his poems .]
     end
   end
@@ -63,6 +110,8 @@ describe OpenNLP do
     it "should accurately detect tokens, sentences and named entities" do
+      OpenNLP.load
       text = File.read('./spec/sample.txt').gsub!("\n", "")
       tokenizer   = OpenNLP::TokenizerME.new
@@ -96,6 +145,7 @@ describe OpenNLP do
         all_tokens << tokens.to_a
         all_sentences << sentence
         all_tags << tags.to_a
       end
       all_tokens.should eql [["To", "describe", "2009", "as", "a", "stellar", "year", "for", "Petrofac", "(", "LON:PFC)", "would", "be", "a", "huge", "understatement", "."], ["The", "group", "finished", "the", "year", "with", "an", "order", "backlog", "twice", "the", "size", "than", "it", "had", "at", "the", "outset", "."], ["The", "group", "has", "since", "been", "awarded", "a", "US", "600", "million", "contract", "and", "spun", "off", "its", "North", "Sea", "assets", "."], ["The", "group", "’s", "recently", "released", "full", "year", "results", "show", "a", "jump", "in", "revenues", ",", "pre-tax", "profits", "and", "order", "backlog", "."], ["Whilst", "group", "revenue", "rose", "by", "10", "%", "from", "$", "3.3", "billion", "to", "$", "3.7", "billion", ",", "pre-tax", "profits", "rose", "by", "25", "%", "from", "$", "358", "million", "to", "$", "448", "million", ".All", "the", "more", "impressive", ",", "the", "group", "’s", "order", "backlog", "doubled", "to", "over", "$", "8", "billion", "paying", "no", "attention", "to", "the", "15", "%", "cut", "in", "capital", "expenditure", "witnessed", "across", "the", "oil", "and", "gas", "industry", "as", "whole", "in", "2009", ".Focussing", "in", "on", "which", "the", "underlying", "performances", "of", "the", "individual", "segments", ",", "the", "group", "cash", "cow", ",", "its", "Engineering", "and", "Construction", "division", ",", "saw", "operating", "profit", "rise", "33", "%", "over", "the", "year", "to", "$", "322", "million", ",", "thanks", "to", "US$", "6.3", "billion", "worth", "of", "new", "contract", "wins", "during", "the", "year", "which", "included", "a", "$", "100", "million", "contract", "with", "Turkmengaz", ",", "the", "Turkmenistan", "national", "energy", "company", "."], ["The", "division", "has", "picked", "up", "in", "2010", "where", "it", "left", "off", "in", "2009", "and", "has", "been", "awarded", "a", "contract", "worth", "more", "than", "US600", "million", "for", "a", "gas", "sweetening", "facilities", "project", "by", "Qatar", "Petroleum.Elsewhere", "the", "group", "’s", "Offshore", "Engineering", "&", "Operations", "division", "may", "have", "seen", "a", "pullback", "in", "revenue", "and", "earnings", "vis-a-vis", "2008", ",", "but", "it", "did", "secure", "a", "£75", "million", "contract", "with", "Apache", "to", "provideengineering", "and", "construction", "services", "for", "the", "Forties", "field", "in", "the", "UK", "North", "Sea", "."], ["And", "to", "underscore", "the", "fact", "that", "there", "is", "life", "beyond", "NOC’s", "for", "Petrofac", "(", "LON:PFC)", "the", "division", "was", "awarded", "a", "£100", "million", "5-year", "contract", "by", "BP", "(", "LON:BP.", ")", "to", "deliver", "integrated", "maintenance", "management", "support", "services", "for", "all", "of", "BP", "'s", "UK", "offshore", "assets", "and", "onshore", "Dimlington", "plant", "."], ["The", "laggard", "of", "the", "group", "was", "the", "Engineering", ",", "Training", "Services", "and", "Production", "Solutions", "division", "."], ["The", "business", "suffered", "as", "the", "oil", "price", "tailed", "off", "and", "the", "economic", "outlook", "deteriorated", "forcing", "a", "number", "ofmajor", "customers", "to", "postpone", "early", "stage", "engineering", "studies", "or", "re-phased", "work", "upon", "which", "the", "division", "depends", "."], ["Although", "the", "fall", "in", "activity", "was", "notable", ",", "the", "division’s", "operational", "performance", "in", "service", "operator", "role", "for", "production", "of", "Dubai", "'s", "offshore", "oil", "&", "gas", "proved", "a", "highlight.Energy", "Developments", "meanwhile", "saw", "the", "start", "of", "oil", "production", "from", "the", "West", "Don", "field", "during", "the", "first", "half", "of", "the", "year", "less", "than", "a", "year", "from", "Field", "Development", "Programme", "approval", "."], ["In", "addition", "output", "from", "Don", "Southwest", "field", "began", "in", "June", "."], ["Despite", "considerably", "lower", "oil", "prices", "in", "2009", "compared", "to", "the", "prior", "year", ",", "Energy", "Developments", "'", "revenue", "reached", "almost", "US$", "250", "million", "(", "significantly", "higher", "than", "the", "US$", "153", "million", "of", "2008", ")", "due", "not", "only", "to", "the", "‘Don", "fields", "effect", "’", "but", "also", "a", "full", "year", "'s", "contribution", "from", "the", "Chergui", "gas", "plant", ",", "which", "began", "exports", "in", "August", "2008.In", "order", "to", "maximize", "the", "earnings", "potential", "of", "the", "division’s", "North", "Sea", "assets", ",", "including", "the", "Don", "assets", ",", "the", "group", "has", "demerged", "them", "providing", "its", "shareholders", "with", "shares", "in", "a", "newly", "listed", "independent", "exploration", "and", "production", "company", "called", "EnQuest", "(", "LON:ENQ", ")", "."], ["EnQuest", "is", "a", "product", "of", "the", "Petrofac’s", "North", "Sea", "Assets", "with", "those", "off", "of", "Swedish", "explorer", "Lundin", "with", "both", "companies", "divesting", "for", "different", "reasons", "."], ["Upon", "listing", "(", "April", "6th", ")", ",", "Petrofac", "(", "LON:PFC)", "shareholders", "owned", "around", "45", "%", "of", "the", "new", "EnQuest", "entity", "with", "Lundin", "shareholders", "owning", "approximately", "55", "%", "."], ["It", "is", "important", "to", "note", "that", "post", "demerger", "the", "Energy", "Developments", "business", "unit", "is", "still", "a", "key", "constituent", "of", "Petrofac", "'s", "business", "portfolio", ",", "and", "will", "continue", "to", "hold", "significant", "assets", "Tunisia", ",", "Malaysia", ",", "Algeria", "and", "Kyrgyz", "Republic", "-", "sandwiched", "between", "Kazakhstan", "and", "China", "."]]

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: open-nlp
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.1
   prerelease:
 platform: ruby
 authors: