RubyGems - stanford-core-nlp - Versions diffs - 0.1 - Mend

stanford-core-nlp 0.1

Files changed (33) hide show

data/LICENSE +18 -0
data/README +3 -0
data/bin/bridge.jar +0 -0
data/bin/classifiers/all.3class.distsim.crf.ser.gz +0 -0
data/bin/classifiers/all.3class.distsim.prop +52 -0
data/bin/classifiers/conll.4class.distsim.crf.ser.gz +0 -0
data/bin/classifiers/conll.4class.distsim.prop +58 -0
data/bin/classifiers/muc.7class.distsim.crf.ser.gz +0 -0
data/bin/classifiers/muc.7class.distsim.prop +50 -0
data/bin/dcoref/animate.unigrams.txt +35302 -0
data/bin/dcoref/demonyms.txt +250 -0
data/bin/dcoref/female.unigrams.txt +5467 -0
data/bin/dcoref/inanimate.unigrams.txt +80533 -0
data/bin/dcoref/male.unigrams.txt +42445 -0
data/bin/dcoref/namegender.combine.txt +14607 -0
data/bin/dcoref/neutral.unigrams.txt +30896 -0
data/bin/dcoref/plural.unigrams.txt +9618 -0
data/bin/dcoref/singular.unigrams.txt +69190 -0
data/bin/dcoref/state-abbreviations.txt +50 -0
data/bin/dcoref/unknown.txt +0 -0
data/bin/grammar/englishFactored.ser.gz +0 -0
data/bin/grammar/englishPCFG.ser.gz +0 -0
data/bin/joda-time.jar +0 -0
data/bin/stanford-corenlp.jar +0 -0
data/bin/taggers/README-Models.txt +102 -0
data/bin/taggers/english-bidirectional-distsim.tagger +0 -0
data/bin/taggers/english-bidirectional-distsim.tagger.props +33 -0
data/bin/taggers/english-left3words-distsim.tagger +0 -0
data/bin/taggers/english-left3words-distsim.tagger.props +33 -0
data/bin/xom.jar +0 -0
data/lib/stanford-core-nlp.rb +106 -0
data/lib/stanford-core-nlp/jar-loader.rb +61 -0
metadata +90 -0

data/bin/dcoref/state-abbreviations.txt ADDED

@@ -0,0 +1,50 @@
+Alabama	Ala.	AL
+Alaska	Alaska	AK
+Arizona	Ariz.	AZ
+Arkansas	Ark.	AR
+California	Calif.	CA
+Colorado	Colo.	CO
+Connecticut	Conn.	CT
+Delaware	Del.	DE
+Florida	Fla.	FL
+Georgia	Ga.	GA
+Hawaii	Hawaii	HI
+Idaho	Idaho	ID
+Illinois	Ill.	IL
+Indiana	Ind.	IN
+Iowa	Iowa	IA
+Kansas	Kans.	KS
+Kentucky	Ky.	KY
+Louisiana	La.	LA
+Maine	Maine	ME
+Maryland	Md.	MD
+Massachusetts	Mass.	MA
+Michigan	Mich.	MI
+Minnesota	Minn.	MN
+Mississippi	Miss.	MS
+Missouri	Mo.	MO
+Montana	Mont.	MT
+Nebraska	Nebr.	NE
+Nevada	Nev.	NV
+New Hampshire	N.H.	NH
+New Jersey	N.J.	NJ
+New Mexico	N.M.	NM
+New York	N.Y.	NY
+North Carolina	N.C.	NC
+North Dakota	N.D.	ND
+Ohio	Ohio	OH
+Oklahoma	Okla.	OK
+Oregon	Ore.	OR
+Pennsylvania	Pa.	PA
+Rhode Island	R.I.	RI
+South Carolina	S.C.	SC
+South Dakota	S.D.	SD
+Tennessee	Tenn.	TN
+Texas	Tex.	TX
+Utah	Utah	UT
+Vermont	Vt.	VT
+Virginia	Va.	VA
+Washington	Wash.	WA
+West Virginia	W.Va.	WV
+Wisconsin	Wis.	WI
+Wyoming	Wyo.	WY

data/bin/dcoref/unknown.txt ADDED

File without changes

data/bin/grammar/englishFactored.ser.gz ADDED

Binary file

data/bin/grammar/englishPCFG.ser.gz ADDED

Binary file

data/bin/joda-time.jar ADDED

Binary file

data/bin/stanford-corenlp.jar ADDED

Binary file

data/bin/taggers/README-Models.txt ADDED

@@ -0,0 +1,102 @@
+Stanford POS Tagger, v. 3.1.0 - 2011-12-16
+Copyright (c) 2002-2011 The Board of Trustees of
+The Leland Stanford Junior University. All Rights Reserved.
+This document contains (some) information about the models included in
+this release and that may be downloaded for the POS tagger website at
+http://nlp.stanford.edu/software/tagger.shtml .  If you have downloaded
+the full tagger, all of the models mentioned in this document are in the
+downloaded package in the same directory as this readme.  Otherwise,
+included in the download are two
+English taggers, and the other taggers may be downloaded from the
+website.  All taggers are accompanied by the props files used to create
+them; please examine these files for more detailed information about the
+creation of the taggers.
+For English, the bidirectional taggers are slightly more accurate, but
+tag much more slowly; choose the appropriate tagger based on your
+speed/performance needs.
+English taggers
+---------------------------
+bidirectional-distsim-wsj-0-18.tagger
+Trained on WSJ sections 0-18 using a bidirectional architecture and
+including word shape and distributional similarity features.
+Penn Treebank tagset.
+Performance:
+97.28% correct on WSJ 19-21
+(90.46% correct on unknown words)
+left3words-wsj-0-18.tagger
+Trained on WSJ sections 0-18 using the left3words architecture and
+includes word shape features.  Penn tagset.
+Performance:
+96.97% correct on WSJ 19-21
+(88.85% correct on unknown words)
+left3words-distsim-wsj-0-18.tagger
+Trained on WSJ sections 0-18 using the left3words architecture and
+includes word shape and distributional similarity features. Penn tagset.
+Performance:
+97.01% correct on WSJ 19-21
+(89.81% correct on unknown words)
+Chinese tagger
+---------------------------
+chinese.tagger
+Trained on a combination of Chinese Treebank texts from Chinese and Hong
+Kong sources.
+LDC Chinese Treebank POS tag set.
+Performance:
+94.13% on a combination of Chinese and Hong Kong texts
+(78.92% on unknown words)
+Arabic tagger
+---------------------------
+arabic-accurate.tagger
+Trained on the *entire* ATB p1-3.
+When trained on the train part of the ATB p1-3 split done for the 2005
+JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets
+the following performance:
+Performance:
+96.50% on dev portion according to Diab split
+(80.59% on unknown words)
+arabic-fast.tagger
+4x speed improvement over "accurate".
+Performance:
+96.34% on dev portion according to Diab split
+(80.28% on unknown words)
+French tagger
+---------------------------
+french.tagger
+Trained on the French treebank.
+German tagger
+---------------------------
+german-hgc.tagger
+Trained on the first 80% of the Negra corpus, which uses the STTS tagset.
+The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating
+German text corpora with part-of-speech labels, which was jointly
+developed by the Institut für maschinelle Sprachverarbeitung of the
+University of Stuttgart and the Seminar für Sprachwissenschaft of the
+University of Tübingen. See:
+http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html
+This model uses features from the distributional similarity clusters
+built over the HGC.
+Performance:
+96.90% on the first half of the remaining 20% of the Negra corpus (dev set)
+(90.33% on unknown words)
+german-dewac.tagger
+This model uses features from the distributional similarity clusters
+built from the deWac web corpus.
+german-fast.tagger
+Lacks distributional similarity features, but is several times faster
+than the other alternatives.
+Performance:
+96.61% overall / 86.72% unknown.

data/bin/taggers/english-bidirectional-distsim.tagger ADDED

Binary file

data/bin/taggers/english-bidirectional-distsim.tagger.props ADDED

@@ -0,0 +1,33 @@
+## tagger training invoked at Thu Dec 15 01:17:19 PST 2011 with arguments:
+                   model = english-bidirectional-distsim.tagger
+                    arch = bidirectional5words,naacl2003unknowns,allwordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1)
+               trainFile = /u/nlp/data/pos-tagger/english/train-wsj-0-18;/u/nlp/data/pos-tagger/english/train-extra-english
+         closedClassTags =
+ closedClassTagThreshold = 40
+ curWordMinFeatureThresh = 2
+                   debug = false
+             debugPrefix =
+            tagSeparator = _
+                encoding = UTF-8
+              iterations = 100
+                    lang = english
+    learnClosedClassTags = false
+        minFeatureThresh = 2
+           openClassTags =
+rareWordMinFeatureThresh = 5
+          rareWordThresh = 5
+                  search = owlqn
+                    sgml = false
+            sigmaSquared = 0.5
+                   regL1 = 0.75
+               tagInside =
+                tokenize = true
+        tokenizerFactory =
+        tokenizerOptions =
+                 verbose = false
+          verboseResults = true
+    veryCommonWordThresh = 250
+                xmlInput =
+              outputFile =
+            outputFormat = slashTags
+     outputFormatOptions =

data/bin/taggers/english-left3words-distsim.tagger ADDED

Binary file

data/bin/taggers/english-left3words-distsim.tagger.props ADDED

@@ -0,0 +1,33 @@
+## tagger training invoked at Thu Dec 15 01:17:21 PST 2011 with arguments:
+                   model = english-left3words-distsim.tagger
+                    arch = left3words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1)
+               trainFile = /u/nlp/data/pos-tagger/english/train-wsj-0-18;/u/nlp/data/pos-tagger/english/train-extra-english
+         closedClassTags =
+ closedClassTagThreshold = 40
+ curWordMinFeatureThresh = 2
+                   debug = false
+             debugPrefix =
+            tagSeparator = _
+                encoding = UTF-8
+              iterations = 100
+                    lang = english
+    learnClosedClassTags = false
+        minFeatureThresh = 2
+           openClassTags =
+rareWordMinFeatureThresh = 10
+          rareWordThresh = 5
+                  search = owlqn
+                    sgml = false
+            sigmaSquared = 0.0
+                   regL1 = 0.75
+               tagInside =
+                tokenize = true
+        tokenizerFactory =
+        tokenizerOptions =
+                 verbose = false
+          verboseResults = true
+    veryCommonWordThresh = 250
+                xmlInput =
+              outputFile =
+            outputFormat = slashTags
+     outputFormatOptions =

data/bin/xom.jar ADDED

Binary file

data/lib/stanford-core-nlp.rb ADDED

@@ -0,0 +1,106 @@
+module StanfordCoreNLP
+  VERSION = '0.1'
+  require 'stanford-core-nlp/jar-loader.rb'
+  class << self
+    # The path in which to look for the Stanford JAR files.
+    # This is passed to JarLoader.
+    attr_accessor :jar_path
+    # The flags for starting the JVM machine.
+    # Parser and named entity recognizer are very memory consuming.
+    attr_accessor :jvm_flags
+  end
+  self.jar_path = File.dirname(__FILE__) + '/../bin/'
+  self.jvm_flags = ['-Xms512M', '-Xmx1024M']
+  # Return the default properties (English models with
+  # tokenizer and sentence splitter).
+  def self.default_properties
+    {
+      'annotators' => 'tokenize, ssplit',
+      'pos.model' => self.jar_path + 'taggers/english-left3words-distsim.tagger',
+      'ner.model.3class' => self.jar_path + 'classifiers/all.3class.distsim.crf.ser.gz',
+      'ner.model.7class' => self.jar_path + 'classifiers/muc.7class.distsim.crf.ser.gz',
+      'ner.model.MISCclass' => self.jar_path + 'classifiers/conll.4class.distsim.crf.ser.gz',
+      'parser.model' => self.jar_path + 'grammar/englishPCFG.ser.gz',
+      'dcoref.demonym' => self.jar_path + 'dcoref/demonyms.txt',
+      'dcoref.animate' => self.jar_path + 'dcoref/animate.unigrams.txt',
+      'dcoref.female' => self.jar_path + 'dcoref/female.unigrams.txt',
+      'dcoref.inanimate' => self.jar_path + 'dcoref/inanimate.unigrams.txt',
+      'dcoref.male' => self.jar_path + 'dcoref/male.unigrams.txt',
+      'dcoref.neutral' => self.jar_path + 'dcoref/neutral.unigrams.txt',
+      'dcoref.plural' => self.jar_path + 'dcoref/plural.unigrams.txt',
+      'dcoref.singular' => self.jar_path + 'dcoref/singular.unigrams.txt',
+      'dcoref.states' => self.jar_path + 'dcoref/state-abbreviations.txt',
+      'dcoref.countries' => self.jar_path + 'dcoref/unknown.txt',     # Fix - can somebody provide this file?
+      'dcoref.states.provinces' => self.jar_path + 'dcoref/unknown.txt',   # Fix - can somebody provide this file?
+      'dcoref.extra.gender' => self.jar_path + 'dcoref/namegender.combine.txt'
+    }
+  end
+  # Load a StanfordCoreNLP pipeline with the specified JVM flags and
+  # StanfordCoreNLP properties (hash of property => values).
+  def self.load(properties)
+    self.load_jars(jvm_flags, self.jar_path)
+    self.create_classes
+    properties = default_properties.merge(properties)
+    CoreNLP.new(get_properties(properties))
+  end
+  # Load the jars.
+  def self.load_jars(jvm_flags, jar_path)
+    JarLoader.jvm_flags = jvm_flags
+    JarLoader.jar_path = jar_path
+    JarLoader.load('joda-time.jar')
+    JarLoader.load('xom.jar')
+    JarLoader.load('stanford-corenlp.jar')
+    JarLoader.load('bridge.jar')
+  end
+  # Create the Ruby classes for core classes.
+  def self.create_classes
+    const_set(:CoreNLP, Rjb::import('edu.stanford.nlp.pipeline.StanfordCoreNLP'))
+    const_set(:Annotation, Rjb::import('edu.stanford.nlp.pipeline.Annotation'))
+    const_set(:Text, Annotation) # A more intuitive alias.
+    const_set(:Properties, Rjb::import('java.util.Properties'))
+    const_set(:AnnotationBridge, Rjb::import('AnnotationBridge'))
+  end
+  # Create a java.util.Properties object from a hash.
+  def self.get_properties(properties)
+    props = Properties.new
+    properties.each do |property, value|
+      props.set_property(property, value)
+    end
+    props
+  end
+  Rjb::Rjb_JavaProxy.class_eval do
+    # Get an annotation using the annotation bridge.
+    def get(annotation)
+      base_class = (annotation.to_s.split('_')[0] == 'coref') ?
+      'edu.stanford.nlp.dcoref.CorefCoreAnnotations$' :
+      'edu.stanford.nlp.ling.CoreAnnotations$'
+      anno_class = annotation.to_s.gsub(/^[a-z]|_[a-z]/) { |a| a.upcase }.gsub('_', '')
+      url = "#{base_class}#{anno_class}Annotation"
+      AnnotationBridge.getAnnotation(self, url)
+    end
+    # Shorthand for to_string defined by Java classes.
+    def to_s; to_string; end
+    # Provide Ruby-style iterators to wrap Java iterators.
+    def each
+      if !java_methods.include?('iterator()')
+        raise 'This object cannot be iterated.'
+      else
+        i = self.iterator
+        while i.has_next; yield i.next;end
+      end
+    end
+  end
+end

data/lib/stanford-core-nlp/jar-loader.rb ADDED

@@ -0,0 +1,61 @@
+module StanfordCoreNLP
+  class JarLoader
+    require 'rjb'
+    # Configuration options.
+    class << self
+      # An array of flags to pass to the JVM machine.
+      attr_accessor :jvm_flags
+      attr_accessor :jar_path
+      attr_accessor :log_file
+    end
+    # An array of string flags to supply to the JVM, e.g. ['-Xms512M', '-Xmx1024M']
+    self.jvm_flags = []
+    # The path in which to look for Jars.
+    self.jar_path = ''
+    # The name of the file to log to.
+    # Setting this before the parser automatically calls self.redirect_to_log
+    self.log_file = nil
+    # Load Rjb and create Java VM.
+    def self.rjb_initialize
+      return if ::Rjb::loaded?
+      ::Rjb::load(nil, self.jvm_flags)
+      redirect_to_log if self.log_file
+    end
+    # Redirect the output of the JVM to self.log_file.
+    def self.redirect_to_log
+      const_set(:System, Rjb::import('java.lang.System'))
+      const_set(:PrintStream, Rjb::import('java.io.PrintStream'))
+      const_set(:File2, Rjb::import('java.io.File'))
+      ps = PrintStream.new(File2.new(self.log_file))
+      ps.write(::Time.now.strftime("[%m/%d/%Y at %I:%M%p]\n\n"))
+      System.setOut(ps)
+      System.setErr(ps)
+    end
+    # Load a jar.
+    def self.load(jar)
+      self.rjb_initialize
+      jar = self.jar_path + jar
+      if !::File.readable?(jar)
+        raise "Could not find  JAR file (looking in #{jar})."
+      end
+      ::Rjb::add_jar(jar)
+    end
+    # Silence output and log to file.
+    def self.log(file = 'log.txt')
+      @@log_file = file
+    end
+    # Whether the output is logged or not.
+    def self.log?; @@log_file; end
+  end
+end

metadata ADDED

@@ -0,0 +1,90 @@
+--- !ruby/object:Gem::Specification
+name: stanford-core-nlp
+version: !ruby/object:Gem::Version
+  version: '0.1'
+  prerelease:
+platform: ruby
+authors:
+- Louis Mullie
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2012-01-28 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rjb
+  requirement: &70234870930100 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: *70234870930100
+description: ! ' High-level Ruby bindings to the Stanford CoreNLP package, a set natural
+  language processing tools for English, including tokenization, part-of-speech tagging,
+  lemmatization, named entity recognition, parsing, and coreference resolution. '
+email:
+- louis.mullie@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- lib/stanford-core-nlp/jar-loader.rb
+- lib/stanford-core-nlp.rb
+- bin/bridge.jar
+- bin/classifiers/all.3class.distsim.crf.ser.gz
+- bin/classifiers/all.3class.distsim.prop
+- bin/classifiers/conll.4class.distsim.crf.ser.gz
+- bin/classifiers/conll.4class.distsim.prop
+- bin/classifiers/muc.7class.distsim.crf.ser.gz
+- bin/classifiers/muc.7class.distsim.prop
+- bin/dcoref/animate.unigrams.txt
+- bin/dcoref/demonyms.txt
+- bin/dcoref/female.unigrams.txt
+- bin/dcoref/inanimate.unigrams.txt
+- bin/dcoref/male.unigrams.txt
+- bin/dcoref/namegender.combine.txt
+- bin/dcoref/neutral.unigrams.txt
+- bin/dcoref/plural.unigrams.txt
+- bin/dcoref/singular.unigrams.txt
+- bin/dcoref/state-abbreviations.txt
+- bin/dcoref/unknown.txt
+- bin/grammar/englishFactored.ser.gz
+- bin/grammar/englishPCFG.ser.gz
+- bin/joda-time.jar
+- bin/stanford-corenlp.jar
+- bin/taggers/english-bidirectional-distsim.tagger
+- bin/taggers/english-bidirectional-distsim.tagger.props
+- bin/taggers/english-left3words-distsim.tagger
+- bin/taggers/english-left3words-distsim.tagger.props
+- bin/taggers/README-Models.txt
+- bin/xom.jar
+- README
+- LICENSE
+homepage: https://github.com/louismullie/stanford-core-nlp
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.15
+signing_key:
+specification_version: 3
+summary: Ruby bindings to the Stanford CoreNLP tools.
+test_files: []