RubyGems - stanford-core-nlp - Versions diffs - 0.1.5 → 0.1.7 - Mend

stanford-core-nlp 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

data/README.markdown +60 -53
data/lib/stanford-core-nlp.rb +26 -7
data/lib/stanford-core-nlp/config.rb +14 -1
data/lib/stanford-core-nlp/jar_loader.rb +1 -1
metadata +7 -8
data/bin/INFO +0 -1

data/README.markdown CHANGED

@@ -1,6 +1,6 @@
 **About**
-This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that features tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
+This gem provides high-level Ruby bindings to the [Stanford Core NLP package](http://nlp.stanford.edu/software/corenlp.shtml), a set natural language processing tools that provides tokenization, part-of-speech tagging, lemmatization, and parsing for five languages (English, French, German, Arabic and Chinese), as well as named entity recognition and coreference resolution for English.
 **Installing**
@@ -12,51 +12,60 @@ This gem provides high-level Ruby bindings to the [Stanford Core NLP package](ht
 After installing and requiring the gem (`require 'stanford-core-nlp'`), you may want to set some configuration options (this, however, is not necessary). Here are some examples:
-    # Set an alternative path to look for the JAR files
-    # Default is gem's bin folder.
-    StanfordCoreNLP.jar_path = '/path/'
+```ruby
+# Set an alternative path to look for the JAR files
+# Default is gem's bin folder.
+StanfordCoreNLP.jar_path = '/path_to_jars/'
-    # Pass some alternative arguments to the Java VM.
-    # Default is ['-Xms512M', '-Xmx1024M'].
-    StanfordCoreNLP.jvm_args = ['-option1', '-option2']
+# Set an alternative path to look for the model files
+# Default is gem's bin folder.
+StanfordCoreNLP.jar_path = '/path_to_models/'
-    # Redirect VM output to log.txt
-    StanfordCoreNLP.log_file = 'log.txt'
+# Pass some alternative arguments to the Java VM.
+# Default is ['-Xms512M', '-Xmx1024M'] (be prepared
+# to take a coffee break).
+StanfordCoreNLP.jvm_args = ['-option1', '-option2']
-    # Use the model files for a different language than English.
-    StanfordCoreNLP.use(:french)
+# Redirect VM output to log.txt
+StanfordCoreNLP.log_file = 'log.txt'
+# Use the model files for a different language than English.
+StanfordCoreNLP.use(:french)
+# Change a specific model file.
+StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
+```
-	# Change a specific model file.
- 	StanfordCoreNLP.set_model('pos.model', 'english-left3words-distsim.tagger')
 **Using the gem**
-    text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
-           'Berlin to discuss a new austerity package. Sarkozy ' +
-           'looked pleased, but Merkel was dismayed.'
-    pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
-    text = StanfordCoreNLP::Text.new(text)
-    pipeline.annotate(text)
-    text.get(:sentences).each do |sentence|
-        sentence.get(:tokens).each do |token|
-            # Default annotations for all tokens
-            puts token.get(:value).to_s
-            puts token.get(:original_text).to_s
-            puts token.get(:character_offset_begin).to_s
-            puts token.get(:character_offset_end).to_s
-            # POS returned by the tagger
-            puts token.get(:part_of_speech).to_s
-            # Lemma (base form of the token)
-            puts token.get(:lemma).to_s
-            # Named entity tag
-            puts token.get(:named_entity_tag).to_s
-            # Coreference
-            puts token.get(:coref_cluster_id).to_s
-            # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
-        end
-    end
+```ruby
+text = 'Angela Merkel met Nicolas Sarkozy on January 25th in ' +
+   'Berlin to discuss a new austerity package. Sarkozy ' +
+   'looked pleased, but Merkel was dismayed.'
+pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit, :pos, :lemma, :parse, :ner, :dcoref)
+text = StanfordCoreNLP::Text.new(text)
+pipeline.annotate(text)
+text.get(:sentences).each do |sentence|
+  sentence.get(:tokens).each do |token|
+    # Default annotations for all tokens
+    puts token.get(:value).to_s
+    puts token.get(:original_text).to_s
+    puts token.get(:character_offset_begin).to_s
+    puts token.get(:character_offset_end).to_s
+    # POS returned by the tagger
+    puts token.get(:part_of_speech).to_s
+    # Lemma (base form of the token)
+    puts token.get(:lemma).to_s
+    # Named entity tag
+    puts token.get(:named_entity_tag).to_s
+    # Coreference
+   puts token.get(:coref_cluster_id).to_s
+    # Also of interest: coref, coref_chain, coref_cluster, coref_dest, coref_graph.
+  end
+end
+```
 > Note: You need to load the StanfordCoreNLP pipeline before using the StanfordCoreNLP::Text class.
@@ -66,19 +75,17 @@ A good reference for names of annotations are the Stanford Javadocs for [CoreAnn
 You may also want to load your own classes from the Stanford NLP to do more specific tasks. The gem provides an API to do this:
-    # Default base class is edu.stanford.nlp.pipeline.
-    StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
-    puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
-      # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
-    # Here, we specify another base class.
-    StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
-    puts StanfordCoreNLP::MaxentTagger.inspect
-      # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
-**Current known issues**
-The models included with the gem for the NER system are missing two files: "edu/stanford/nlp/models/dcoref/countries" and "edu/stanford/nlp/models/dcoref/statesandprovinces", which I couldn't find anywhere. I will be grateful if somebody could add/e-mail me these files.
+```ruby
+# Default base class is edu.stanford.nlp.pipeline.
+StanfordCoreNLP.load_class('PTBTokenizerAnnotator')
+puts StanfordCoreNLP::PTBTokenizerAnnotator.inspect
+  # => #<Rjb::Edu_stanford_nlp_pipeline_PTBTokenizerAnnotator>
+# Here, we specify another base class.
+StanfordCoreNLP.load_class('MaxentTagger', 'edu.stanford.nlp.tagger')
+puts StanfordCoreNLP::MaxentTagger.inspect
+  # => <Rjb::Edu_stanford_nlp_tagger_maxent_MaxentTagger:0x007f88491e2020>
+```
 **Contributing**

data/lib/stanford-core-nlp.rb CHANGED

@@ -1,6 +1,7 @@
 module StanfordCoreNLP
-  VERSION = '0.1.5'
+  VERSION = '0.1.7'
   require 'stanford-core-nlp/jar_loader'
   require 'stanford-core-nlp/java_wrapper'
   require 'stanford-core-nlp/config'
@@ -30,6 +31,10 @@ module StanfordCoreNLP
     # retrieve annotations using static classes as names.
     # This works around one of the lacunae of Rjb.
     attr_accessor :jar_path
+    # The path to the main folder containing the folders
+    # with the individual models inside. By default, this
+    # is the same as the JAR path.
+    attr_accessor :model_path
     # The flags for starting the JVM machine. The parser
     # and named entity recognizer are very memory consuming.
     attr_accessor :jvm_args
@@ -41,13 +46,14 @@ module StanfordCoreNLP
   # The default JAR path is the gem's bin folder.
   self.jar_path = File.dirname(__FILE__) + '/../bin/'
+  # The default model path is the same as the JAR path.
+  self.model_path = self.jar_path
   # Load the JVM with a minimum heap size of 512MB and a
   # maximum heap size of 1024MB.
   self.jvm_args = ['-Xms512M', '-Xmx1024M']
   # Turn logging off by default.
   self.log_file = nil
   # Use models for a given language. Language can be
   # supplied as full-length, or ISO-639 2 or 3 letter
   # code (e.g. :english, :eng or :en will work).
@@ -117,14 +123,16 @@ module StanfordCoreNLP
   # specified JVM flags and StanfordCoreNLP
   # properties.
   def self.load(*annotators)
+    JarLoader.log(self.log_file)
     self.init unless @@initialized
     # Prepend the JAR path to the model files.
     properties = {}
     self.model_files.each do |k,v|
-      f = self.jar_path + v
+      f = self.model_path + v
       unless File.readable?(f)
         raise "Model file #{f} could not be found. " +
-        "You may need to download this file manually and/or set paths properly."
+        "You may need to download this file manually "+
+        " and/or set paths properly."
       else
         properties[k] = f
       end
@@ -133,12 +141,23 @@ module StanfordCoreNLP
     annotators.map { |x| x.to_s }.join(', ')
     CoreNLP.new(get_properties(properties))
   end
+  # Once it loads a specific annotator model once,
+  # the program always loads the same models when
+  # you make new pipelines and request the annotator
+  # again, ignoring the changes in models.
+  #
+  # This function kills the JVM and reloads everything
+  # if you need to create a new pipeline with different
+  # models for the same annotators.
+  #def self.reload
+  #  raise 'Not implemented.'
+  #end
   # Load the jars.
   def self.load_jars
     JarLoader.jvm_args = self.jvm_args
     JarLoader.jar_path = self.jar_path
-    JarLoader.log(self.log_file) if self.log_file
     JarLoader.load('joda-time.jar')
     JarLoader.load('xom.jar')
     JarLoader.load('stanford-corenlp.jar')

data/lib/stanford-core-nlp/config.rb CHANGED

@@ -20,9 +20,18 @@ module StanfordCoreNLP
       :ner => 'classifiers/',
       :dcoref => 'dcoref/'
     }
+    # Tag sets used by Stanford for each language.
+    TagSets = {
+      :english => :penn,
+      :german => :negra,
+      :chinese => :penn_chinese,
+      :french => :simple
+    }
     # Default models for all languages.
     Models = {
       :pos => {
         :english => 'english-left3words-distsim.tagger',
         :german => 'german-fast.tagger',
@@ -31,6 +40,7 @@ module StanfordCoreNLP
         :chinese  => 'chinese.tagger',
         :xinhua   => nil
       },
       :parser => {
         :english => 'englishPCFG.ser.gz',
         :german => 'germanPCFG.ser.gz',
@@ -39,6 +49,7 @@ module StanfordCoreNLP
         :chinese  => 'chinesePCFG.ser.gz',
         :xinhua   => 'xinhuaPCFG.ser.gz'
       },
       :ner => {
         :english => {
           '3class' => 'all.3class.distsim.crf.ser.gz',
@@ -51,6 +62,7 @@ module StanfordCoreNLP
         :chinese  => {},
         :xinhua   => {}
       },
       :dcoref => {
         :english => {
           'demonym' => 'demonyms.txt',
@@ -72,6 +84,7 @@ module StanfordCoreNLP
         :chinese  => {},
         :xinhua   => {}
       }
       # Models to add.
       #"truecase.model" - path towards the true-casing model; default: StanfordCoreNLPModels/truecase/noUN.ser.gz

data/lib/stanford-core-nlp/jar_loader.rb CHANGED

@@ -46,7 +46,7 @@ module StanfordCoreNLP
       self.rjb_initialize
       jar = self.jar_path + jar
       if !::File.readable?(jar)
-        raise "Could not find  JAR file (looking in #{jar})."
+        raise "Could not find JAR file (looking in #{jar})."
       end
       ::Rjb::add_jar(jar)
     end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: stanford-core-nlp
 version: !ruby/object:Gem::Version
-  version: 0.1.5
+  version: 0.1.7
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-02-04 00:00:00.000000000 Z
+date: 2012-02-22 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rjb
-  requirement: &70191057037760 !ruby/object:Gem::Requirement
+  requirement: &70107443631860 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,10 +21,11 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70191057037760
+  version_requirements: *70107443631860
 description: ! " High-level Ruby bindings to the Stanford CoreNLP package, a set natural
-  language processing \ntools for English, including tokenization, part-of-speech
-  tagging, lemmatization, named entity recognition,\nparsing, and coreference resolution. "
+  language processing \ntools that provides tokenization, part-of-speech tagging and
+  parsing for several languages, as well as named entity \nrecognition and coreference
+  resolution for English. "
 email:
 - louis.mullie@gmail.com
 executables: []
@@ -36,7 +37,6 @@ files:
 - lib/stanford-core-nlp/java_wrapper.rb
 - lib/stanford-core-nlp.rb
 - bin/bridge.jar
-- bin/INFO
 - README.markdown
 - LICENSE
 homepage: https://github.com/louismullie/stanford-core-nlp
@@ -64,4 +64,3 @@ signing_key:
 specification_version: 3
 summary: Ruby bindings to the Stanford Core NLP tools.
 test_files: []
-has_rdoc:

data/bin/INFO DELETED

	@@ -1 +0,0 @@
1	- This is where you should put the JAR files and the folders with the model files.