RubyGems - stanfordparser - Versions diffs - 1.2.0 → 2.0.0 - Mend

stanfordparser 1.2.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

data/README +77 -21
data/examples/stanford-sentence-parser.rb +3 -3
data/lib/java_object.rb +129 -0
data/lib/stanfordparser.rb +312 -132
data/test/test_stanfordparser.rb +122 -12
metadata +3 -2

data/README CHANGED Viewed

@@ -1,35 +1,40 @@
-= Stanford Natural Language Parser
+= Stanford Natural Language Parser Wrapper
 This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
-The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic.  This module provides a thin wrapper around the Java code to make it accessible from Ruby.
+The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic.  This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
-= Installation and Configuration
-To run this module you must install the following additional software
-* The {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml]
-* The {Ruby Java Bridge}[http://rjb.rubyforge.org/] gem.
+= Installation and Configuration
-Note that the Stanford Parser is not a Ruby application and is therefore not a Ruby gem and must be manually installed.
+In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
 This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory.  This is the directory that contains the <tt>stanford-parser.jar</tt> file.  When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
-These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>.  This file is in the Ruby YAML[http://www.ruby-doc.org/core/classes/YAML.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
+These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>.  This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
 	root: /usr/local/stanford-parser/other/location
 	jvmargs: -Xmx100m -verbose
-=Usage
-Use the StanfordParser::LexicalizedParser class to parse sentences.
+=Tokenization and Parsing
-	irb(main):001:0> require 'stanfordparser'
+Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
+	>> require "stanfordparser"
 	=> true
-	irb(main):002:0> parser = StanfordParser::LexicalizedParser.new
+	>> preproc = StanfordParser::DocumentPreprocessor.new
+	=> <DocumentPreprocessor>
+	>> puts preproc.getSentencesFromString("This is a sentence.  So is this.")
+	This is a sentence .
+	So is this .
+Use the StanfordParser::LexicalizedParser class to parse sentences.
+	>> parser = StanfordParser::LexicalizedParser.new
 	Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
 	=> edu.stanford.nlp.parser.lexparser.LexicalizedParser
-	irb(main):003:0> puts parser.apply("This is a sentence.")
+	>> puts parser.apply("This is a sentence.")
 	(ROOT
 	  (S [24.917]
 	    (NP [6.139] (DT [2.300] This))
@@ -37,26 +42,77 @@ Use the StanfordParser::LexicalizedParser class to parse sentences.
 	      (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
 	    (. [0.002] .)))
-Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into words or sentences.
+For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
-	irb(main):004:0> preproc = StanfordParser::DocumentPreprocessor.new
-	irb(main):008:0> puts preproc.getSentencesFromString("This is a sentence.  So is this.")
-	This is a sentence .
-	So is this .
-For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
+=Standoff Tokenization and Parsing
+This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
+Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
+	>> preproc = StanfordParser::StandoffDocumentPreprocessor.new
+	=> <StandoffDocumentPreprocessor>
+	>> s = preproc.getSentencesFromString("This is a sentence.  So is this.")
+	=> [This is a sentence., So is this.]
+The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
+ 	>> puts s
+	This [0,4]
+	is [5,7]
+	a [8,9]
+	sentence [10,18]
+	. [18,19]
+	So [21,23]
+	is [24,26]
+	this [27,31]
+	. [31,32]
+	>> "This is a sentence.  So is this."[27..31]
+	=> "this."
+This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
+Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
+	>> t = StanfordParser::StandoffParsedText.new("This is a sentence.  So is this.")
+	Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
+	=> <StanfordParser::StandoffParsedText, 2 sentences>
+	>> puts t.first
+	(ROOT
+	  (S
+	    (NP (DT This [0,4]))
+	    (VP (VBZ is [5,7])
+	      (NP (DT a [8,9]) (NN sentence [10,18])))
+	    (. . [18,19])))
+Standoff parse trees can reproduce the text from which they were generated verbatim.
+	>> t.first.to_original_string
+	=> "This is a sentence.  "
+They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
+	>> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
+	=> "[This] is [a sentence].  "
+The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
+See the documentation of the individual classes in this module for more details.
+Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects.  This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
 = History
 1.0.0:: Initial release
 1.1.0:: Make module initialization function private.  Add example code.
 1.2.0:: Read Java VM arguments from the configuration file.  Add Word class.
+2.0.0:: Add support for standoff parsing.  Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details.  Rjb::JavaObjectWrapper supports static members.  Minor changes to stanford-sentence-parser script.
 = Copyright
-Copyright 2007, William Patrick McNeill
+Copyright 2007-2008, William Patrick McNeill
 This program is distributed under the GNU General Public License.

data/examples/stanford-sentence-parser.rb CHANGED Viewed

@@ -2,7 +2,7 @@
 #--
-# Copyright 2007 William Patrick McNeill
+# Copyright 2007-2008 William Patrick McNeill
 #
 # This file is part of the Stanford Parser Ruby Wrapper.
 #
@@ -34,7 +34,7 @@
 #    See the Java Stanford Parser documentation for details
 #
 # sentence::
-#    A sentence to parse.  This must be quoted.
+#    A sentence to parse.  This must appear after all the options and be quoted.
 require "stanfordparser"
@@ -42,5 +42,5 @@ require "stanfordparser"
 # The last argument is the sentence.  The rest of the command line is passed
 # along to the parser object.
 sentence = ARGV.pop
-parser = StanfordParser::LexicalizedParser.new("$(ROOT)/englishPCFG.ser.gz", ARGV)
+parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
 puts parser.apply(sentence)

data/lib/java_object.rb ADDED Viewed

@@ -0,0 +1,129 @@
+# Copyright 2007-2008 William Patrick McNeill
+#
+# This file is part of the Stanford Parser Ruby Wrapper.
+#
+# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
+# and/or modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation; either version 2 of the License,
+# or (at your option) any later version.
+#
+# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
+# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+# Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
+# St, Fifth Floor, Boston, MA 02110-1301 USA
+# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
+# add a generic Java object wrapper class.
+module Rjb
+  #--
+  # The documentation for this class appears next to its extension inside the
+  # StanfordParser module in stanfordparser.rb.  This should be changed if Rjb
+  # is ever moved into its own gem.  See the documention in stanfordparser.rb
+  # for more details.
+  #++
+  class JavaObjectWrapper
+    include Enumerable
+    # The underlying Java object.
+    attr_reader :java_object
+    # Initialize with a Java object <em>obj</em>.  If <em>obj</em> is a
+    # String, treat it as a Java class name and instantiate it.  Otherwise,
+    # treat <em>obj</em> as an instance of a Java object.
+    def initialize(obj, *args)
+      @java_object = obj.class == String ?
+      Rjb::import(obj).send(:new, *args) : obj
+    end
+    # Enumerate all the items in the object using its iterator.  If the object
+    # has no iterator, this function yields nothing.
+    def each
+      if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
+        i = @java_object.iterator
+        while i.hasNext
+          yield wrap_java_object(i.next)
+        end
+      end
+    end # each
+    # Reflect unhandled method calls to the underlying Java object and wrap
+    # the return value in the appropriate Ruby object.
+    def method_missing(m, *args)
+      begin
+        wrap_java_object(@java_object.send(m, *args))
+      rescue RuntimeError => e
+        # The instance method failed.  See if this is a static method.
+        if not e.message.match(/^Fail: unknown method name/).nil?
+          getClass.send(m, *args)
+        end
+      end
+    end
+    # Convert a value returned by a call to the underlying Java object to the
+    # appropriate Ruby object.
+    #
+    # If the value is a JavaObjectWrapper, convert it using a protected
+    # function with the name wrap_ followed by the underlying object's
+    # classname with the Java path delimiters converted to underscores. For
+    # example, a <tt>java.util.ArrayList</tt> would be converted by a function
+    # called wrap_java_util_ArrayList.
+    #
+    # If the value lacks the appropriate converter function, wrap it in a
+    # generic JavaObjectWrapper.
+    #
+    # If the value is not a JavaObjectWrapper, return it unchanged.
+    #
+    # This function is called recursively for every element in an Array.
+    def wrap_java_object(object)
+      if object.kind_of?(Array)
+        object.collect {|item| wrap_java_object(item)}
+      elsif object.respond_to?(:_classname)
+        # Ruby-Java Bridge Java objects all have a _classname member which
+        # tells the name of their Java class.  Convert this to the
+        # corresponding wrapper function name.
+        wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
+        respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
+      else
+        object
+      end
+    end
+    # Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
+    def wrap_java_util_ArrayList(object)
+      array_list = []
+      object.size.times do
+        |i| array_list << wrap_java_object(object.get(i))
+      end
+      array_list
+    end
+    # Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
+    def wrap_java_util_HashSet(object)
+      set = Set.new
+      i = object.iterator
+      while i.hasNext
+        set << wrap_java_object(i.next)
+      end
+      set
+    end
+    # Show the classname of the underlying Java object.
+    def inspect
+      "<#{@java_object._classname}>"
+    end
+    # Use the underlying Java object's stringification.
+    def to_s
+      toString
+    end
+    protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
+  end # JavaObjectWrapper
+end # Rjb

data/lib/stanfordparser.rb CHANGED Viewed

@@ -1,4 +1,4 @@
-# Copyright 2007 William Patrick McNeill
+# Copyright 2007-2008 William Patrick McNeill
 #
 # This file is part of the Stanford Parser Ruby Wrapper.
 #
@@ -19,121 +19,27 @@
 require "pathname"
 require "rjb"
-require "set"
+require "singleton"
+begin
+  require "treebank"
+  gem "treebank", ">= 3.0.0"
+rescue LoadError
+  require "treebank"
+end
 require "yaml"
-# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
-# adds a generic Java object wrapper class.
-module Rjb
-  # A generic wrapper for a Java object loaded via the Ruby Java Bridge.  The
-  # wrapper class handles intialization and stringification, and passes other
-  # method calls down to the underlying Java object.  Objects returned by the
-  # underlying Java object are converted to the appropriate Ruby object.
-  #
-  # This object is enumerable, yielding items in the order defined by the Java
-  # object's iterator.
-  class JavaObjectWrapper
-    include Enumerable
-    # The underlying Java object.
-    attr_reader :java_object
-    # Initialize with a Java object <em>obj</em>.  If <em>obj</em> is a
-    # String, assume it is a Java class name and instantiate it.  Otherwise,
-    # treat <em>obj</em> as an instance of a Java object.
-    def initialize(obj, *args)
-      @java_object = obj.class == String ?
-      Rjb::import(obj).send(:new, *args) : obj
-    end
-    # Enumerate all the items in the object using its iterator.  If the object
-    # has no iterator, this function yields nothing.
-    def each
-      if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
-        i = @java_object.iterator
-        while i.hasNext
-          yield wrap_java_object(i.next)
-        end
-      end
-    end # each
-    # Reflect unhandled method calls to the underlying Java object.
-    def method_missing(m, *args)
-      wrap_java_object(@java_object.send(m, *args))
-    end
-    # Convert a value returned by a call to the underlying Java object to the
-    # appropriate Ruby object as follows:
-    # * RJB objects are placed inside a generic JavaObjectWrapper wrapper.
-    # * <tt>java.util.ArrayList</tt> objects are converted to Ruby Arrays.
-    # * <tt>java.util.HashSet</tt> objects are converted to Ruby Sets
-    # * Other objects are left unchanged.
-    #
-    # This function is applied recursively to items in collection objects such
-    # as set and arrays.
-    def wrap_java_object(object)
-      if object.kind_of?(Array)
-        object.collect {|item| wrap_java_object(item)}
-      # Ruby-Java Bridge Java objects all have a _classname member which tells
-      # the name of their Java class.
-      elsif object.respond_to?(:_classname)
-        case object._classname
-        when /java\.util\.ArrayList/
-          # Convert java.util.ArrayList objects to Ruby arrays.
-          array_list = []
-          object.size.times do
-            |i| array_list << wrap_java_object(object.get(i))
-          end
-          array_list
-        when /java\.util\.HashSet/
-          # Convert java.util.HashSet objects to Ruby sets.
-          set = Set.new
-          i = object.iterator
-          while i.hasNext
-            set << wrap_java_object(i.next)
-          end
-          set
-        else
-          # Passs other RJB objects off to a handler.
-          wrap_rjb_object(object)
-        end # case
-      else
-        # Return non-RJB objects unchanged.
-        object
-      end # if
-    end # wrap_java_object
-    # By default, all RJB classes other than <tt>java.util.ArrayList</tt> and
-    # <tt>java.util.HashSet</tt> go in a generic wrapper.  Derived classes may
-    # change this behavior.
-    def wrap_rjb_object(object)
-      JavaObjectWrapper.new(object)
-    end
-    # Show the classname of the underlying Java object.
-    def inspect
-      "<#{@java_object._classname}>"
-    end
-    # Use the underlying Java object's stringification.
-    def to_s
-      toString
-    end
-    protected :wrap_java_object, :wrap_rjb_object
-  end # JavaObjectWrapper
-end # Rjb
+require "java_object.rb"
 # Wrapper for the {Stanford Natural Language
 # Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
 module StanfordParser
-  VERSION = "1.2.0"
+  VERSION = "2.0.0"
+  # The default sentence segmenter and tokenizer.  This is an English-language
+  # tokenizer with support for Penn Treebank markup.
+  EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
   # Path to an English PCFG model that comes with the Stanford Parser.  The
   # location is relative to the parser root directory.  This is a valid value
   # for the <em>grammar</em> parameter of the LexicalizedParser constructor.
@@ -170,32 +76,53 @@ module StanfordParser
   # The root directory of the Stanford parser installation.
   ROOT = initialize_on_load
+  #--
+  # The documentation below is for the original Rjb::JavaObjectWrapper object.
+  # It is reproduced here because rdoc only takes the last document block
+  # defined.  If Rjb is moved into its own gem, this documentation should go
+  # with it, and the following should be written as documentation for this
+  # class:
+  #
   # Extension of the generic Ruby-Java Bridge wrapper object for the
   # StanfordParser module.
-  class JavaObjectWrapper < Rjb::JavaObjectWrapper
-    # Wrap a return value with a specialized wrapper class in the
-    # StanfordParser module in the appropriate class.
-    def wrap_rjb_object(object)
-      case object._classname
-      when /^edu\.stanford\.nlp\.trees\.
-        (Tree|LabeledScoredTreeLeaf|
-        LabeledScoredTreeNode|
-        SimpleTree|TreeGraphNode)$/x
-        # Tree objects go inside a Tree wrapper.
-        Tree.new(object)
-      else
-        super(object)
-      end # case
-    end # wrap_rjb_object
-  end # JavaObjectWrapper
+  #++
+  # A generic wrapper for a Java object loaded via the {Ruby-Java
+  # Bridge}[http://rjb.rubyforge.org/].  The wrapper class handles
+  # intialization and stringification, and passes other method calls down to
+  # the underlying Java object.  Objects returned by the underlying Java
+  # object are converted to the appropriate Ruby object.
+  #
+  # Other modules may extend the list of Java objects that are converted by
+  # adding their own converter functions.  See wrap_java_object for details.
+  #
+  # This object is enumerable, yielding items in the order defined by the
+  # underlying Java object's iterator.
+  class Rjb::JavaObjectWrapper
+    # FeatureLabel objects go inside a FeatureLabel wrapper.
+    def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
+      StanfordParser::FeatureLabel.new(object)
+    end
+    # Tree objects go inside a Tree wrapper.  Various tree types are aliased
+    # to this function.
+    def wrap_edu_stanford_nlp_trees_Tree(object)
+      Tree.new(object)
+    end
+    alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_SimpleTree            :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_TreeGraphNode         :wrap_edu_stanford_nlp_trees_Tree
+    protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
+  end # Rjb::JavaObjectWrapper
   # Lexicalized probabalistic parser.
   #
   # This is an wrapper for the
   # <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
-  class LexicalizedParser < JavaObjectWrapper
+  class LexicalizedParser < Rjb::JavaObjectWrapper
     # The grammar used by the parser
     attr_reader :grammar
@@ -220,10 +147,17 @@ module StanfordParser
   end # LexicalizedParser
+  # A singleton instance of the default Stanford Natural Language parser.  A
+  # singleton is used because the parser can take a few seconds to load.
+  class DefaultParser < StanfordParser::LexicalizedParser
+    include Singleton
+  end
   # This is a wrapper for
   # <tt>edu.stanford.nlp.trees.Tree</tt> objects.  It customizes
   # stringification.
-  class Tree < JavaObjectWrapper
+  class Tree < Rjb::JavaObjectWrapper
     def initialize(obj = "edu.stanford.nlp.trees.Tree")
       super(obj)
     end
@@ -245,16 +179,16 @@ module StanfordParser
   # This is a wrapper for
   # <tt>edu.stanford.nlp.ling.Word</tt> objects.  It customizes
   # stringification and adds an equivalence operator.
-  class Word < JavaObjectWrapper
+  class Word < Rjb::JavaObjectWrapper
     def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
       super(obj, *args)
     end
     # See the word values.
     def inspect
       to_s
     end
     # Equivalence is defined relative to the word value.
     def ==(other)
       word == other
@@ -262,11 +196,34 @@ module StanfordParser
   end # Word
+  # This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
+  # It customizes stringification.
+  class FeatureLabel < Rjb::JavaObjectWrapper
+    def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
+      super
+    end
+    # Stringify with just the token and its begin and end position.
+    def to_s
+      # BUGBUG The position values come back as java.lang.Integer though I
+      # would expect Rjb to convert them to Ruby integers.
+      begin_position = get(self.BEGIN_POSITION_KEY)
+      end_position = get(self.END_POSITION_KEY)
+      "#{current} [#{begin_position},#{end_position}]"
+    end
+    # More verbose stringification with all the fields and their values.
+    def inspect
+      toString
+    end
+  end
   # Tokenizes documents into words and sentences.
   #
   # This is a wrapper for the
   # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
-  class DocumentPreprocessor < JavaObjectWrapper
+  class DocumentPreprocessor < Rjb::JavaObjectWrapper
     def initialize(suppressEscaping = false)
       super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
     end
@@ -276,6 +233,229 @@ module StanfordParser
       s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
       _invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
     end
+    def inspect
+      "<#{self.class.to_s.split('::').last}>"
+    end
+    def to_s
+      inspect
+    end
   end # DocumentPreprocessor
+  StandoffToken = Struct.new(:current, :word, :before, :after,
+                             :begin_position, :end_position)
+  # A text token that contains raw and normalized token identity (.e.g "(" and
+  # "-LRB-"), an offset span, and the characters immediately preceding and
+  # following the token.  Given a list of these objects it is possible to
+  # recreate the text from which they came verbatim.
+  class StandoffToken
+    def to_s
+      "#{current} [#{begin_position},#{end_position}]"
+    end
+  end
+  # A preprocessor that segments text into sentences and tokens that contain
+  # character offset and token context information that can be used for
+  # standoff annotation.
+  class StandoffDocumentPreprocessor < DocumentPreprocessor
+    def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
+      # PTBTokenizer.factory is a static function, so use RJB to call it
+      # directly instead of going through a JavaObjectWrapper.  We do it this
+      # way because the Standford parser Java code does not provide a
+      # constructor that allows you to specify the second parameter,
+      # invertible, to true, and we need this to write character offset
+      # information into the tokens.
+      ptb_tokenizer_class = Rjb::import(tokenizer)
+      # See the documentation for
+      # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
+      # description of these parameters.
+      ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
+      super(ptb_tokenizer_factory)
+    end
+    # Returns a list of sentences in a string.  This wraps the returned
+    # sentences in a StandoffSentence object.
+    def getSentencesFromString(s)
+      super(s).map!{|s| StandoffSentence.new(s)}
+    end
+  end
+  # A sentence is an array of StandoffToken objects.
+  class StandoffSentence < Array
+    # Construct an array of StandoffToken objects from a Java list sentence
+    # object returned by the preprocessor.
+    def initialize(stanford_parser_sentence)
+      # Convert FeatureStructure wrappers to StandoffToken objects.
+      s = stanford_parser_sentence.to_a.collect do |fs|
+        current = fs.current
+        word = fs.word
+        before = fs.before
+        after = fs.after
+        # The to_s.to_i is necessary because the get function returns
+        # java.lang.Integer objects instead of Ruby integers.
+        begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
+        end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
+        StandoffToken.new(current, word, before, after,
+                          begin_position, end_position)
+      end
+      super(s)
+    end
+    # Return the original string verbatim.
+    def to_s
+      self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
+    end
+    # Return the original string verbatim.
+    def inspect
+      to_s
+    end
+  end
+  # Standoff syntactic annotation of natural language text which may contain
+  # multiple sentences.
+  #
+  # This is an Array of StandoffNode objects, one for each sentence in the
+  # text.
+  class StandoffParsedText < Array
+    # Parse the text and create the standoff annotation.
+    #
+    # The default parser is a singleton instance of the English language
+    # Stanford Natural Langugage parser.  There may be a delay of a few
+    # seconds for it to load the first time it is created.
+    def initialize(text, nodetype = StandoffNode,
+                   tokenizer = EN_PENN_TREEBANK_TOKENIZER,
+                   parser = DefaultParser.instance)
+      preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
+      # Segment the text into sentences.  Parse each sentence, writing
+      # standoff annotation information into the terminal nodes.
+      preprocessor.getSentencesFromString(text).map do |sentence|
+        parse = parser.apply(sentence.to_s)
+        push(nodetype.new(parse, sentence))
+      end
+    end
+    # Print class name and number of sentences.
+    def inspect
+      "<#{self.class.name}, #{length} sentences>"
+    end
+    # Print parses.
+    def to_s
+      flatten.join(" ")
+    end
+  end
+  # Standoff syntactic tree annotation of text.  Terminal nodes are labeled
+  # with the appropriate StandoffToken objects.  Standoff parses can reproduce
+  # the original string from which they were generated verbatim, optionally
+  # with brackets around the yields of specified non-terminal nodes.
+  class StandoffNode < Treebank::Node
+    # Create the standoff tree from a tree returned by the Stanford parser.
+    # For non-terminal nodes, the <em>tokens</em> argument will be a
+    # StandoffSentence containing the StandoffToken objects representing all
+    # the tokens beneath and after this node.  For terminal nodes, the
+    # <em>tokens</em> argument will be a StandoffToken.
+    def initialize(stanford_parser_node, tokens)
+      # Annotate this node with a non-terminal label or a StandoffToken as
+      # appropriate.
+      super(tokens.instance_of?(StandoffSentence) ?
+            stanford_parser_node.value : tokens)
+      # Enumerate the children depth-first.  Tokens are removed from the list
+      # left-to-right as terminal nodes are added to the tree.
+      stanford_parser_node.children.each do |child|
+        subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
+        attach_child!(subtree)
+      end
+    end
+    # Return the original text string dominated by this node.
+    def to_original_string
+      leaves.inject("") do |s, leaf|
+        s += leaf.label.current + leaf.label.after
+      end
+    end
+    # Print the original string with brackets around word spans dominated by
+    # the specified consituents.
+    #
+    # The constituents to bracket are specified by passing a list of node
+    # coordinates, which are arrays of integers of the form returned by the
+    # tree enumerators of Treebank::Node objects.
+    #
+    # _coords_:: the coordinates of the nodes around which to place brackets
+    # _open_:: the open bracket symbol
+    # _close_:: the close bracket symbol
+    def to_bracketed_string(coords, open = "[", close = "]")
+      # Get a list of all the leaf nodes and their coordinates.
+      items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
+      # Enumerate over all the matching constituents inserting open and close
+      # brackets around their yields in the items list.
+      coords.each do |matching|
+        # Insert using a simple state machine with three states: :start,
+        # :open, and :close.
+        state = :start
+        # Enumerate over the items list looking for nodes that are the
+        # children of the matching constituent.
+        items.each_with_index do |item, index|
+          # Skip inserted bracket characters.
+          next if item.is_a? String
+          # Handle terminal node items with the state machine.
+          node, terminal_coordinate = item
+          if state == :start
+            next if not in_yield?(matching, terminal_coordinate)
+            items.insert(index, open)
+            state = :open
+          else # state == :open
+            next if in_yield?(matching, terminal_coordinate)
+            items.insert(index, close)
+            state = :close
+            break
+          end
+        end # items.each_with_index
+        # Handle the case where a matching constituent is flush with the end
+        # of the sentence.
+        items << close if state == :open
+      end # each
+      # Replace terminal nodes with their string representations.  Insert
+      # spacing characters in the list.
+      items.each_with_index do |item, index|
+        next if item.is_a? String
+        text = item.first.label.current
+        spacing = item.first.label.after
+        # Replace the terminal node with its text.
+        items[index] = text
+        # Insert the spacing that comes after this text before the first
+        # non-close bracket character.
+        close_pos = find_index(items[index+1..-1]) {|item| not item == close}
+        items.insert(index + close_pos + 1, spacing)
+      end
+      items.join
+    end # to_bracketed_string
+    # Find the index of the first item in _list_ for which _block_ is true.
+    # Return 0 if no items are found.
+    def find_index(list, &block)
+      list.each_with_index do |item, index|
+        return index if block.call(item)
+      end
+      0
+    end
+    # Is the node at _terminal_ in the yield of the node at _node_?
+    def in_yield?(node, terminal)
+      # If node A's coordinates match the prefix of node B's coordinates, node
+      # B is in the yield of node A.
+      terminal.first(node.length) == node
+    end
+    private :in_yield?, :find_index
+  end # StandoffNode
 end # StanfordParser

data/test/test_stanfordparser.rb CHANGED Viewed

@@ -2,7 +2,7 @@
 #--
-# Copyright 2007 William Patrick McNeill
+# Copyright 2007-2008 William Patrick McNeill
 #
 # This file is part of the Stanford Parser Ruby Wrapper.
 #
@@ -30,20 +30,13 @@ require "singleton"
 require "stanfordparser"
-# Make the Lexicalized Parser a singleton for the tests because it takes
-# several seconds to load.
-class StanfordParser::LexicalizedParser
-  include Singleton
-end
 class LexicalizedParserTestCase < Test::Unit::TestCase
   def test_root_path
     assert_equal StanfordParser::ROOT.class, Pathname
   end
   def setup
-    @parser = StanfordParser::LexicalizedParser.instance
+    @parser = StanfordParser::DefaultParser.instance
     @tree = @parser.apply("This is a sentence.")
   end
@@ -53,6 +46,8 @@ class LexicalizedParserTestCase < Test::Unit::TestCase
   end
   def test_localTrees
+    # The following call exercises the conversion from java.util.HashSet
+    # objects to Ruby sets.
     l = @tree.localTrees
     assert_equal l.size, 5
     assert_equal Set.new(l.collect {|t| "#{t.label}"}),
@@ -68,7 +63,7 @@ end # LexicalizedParserTestCase
 class TreeTestCase < Test::Unit::TestCase
   def setup
-    @parser = StanfordParser::LexicalizedParser.instance
+    @parser = StanfordParser::DefaultParser.instance
     @tree = @parser.apply("This is a sentence.")
   end
@@ -85,12 +80,30 @@ class TreeTestCase < Test::Unit::TestCase
 end # TreeTestCase
+class FeatureLabelTestCase < Test::Unit::TestCase
+  def test_feature_label
+    f = StanfordParser::FeatureLabel.new
+    assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
+    f.put(f.BEGIN_POSITION_KEY, 3)
+    assert_equal "END_POS", f.END_POSITION_KEY
+    f.put(f.END_POSITION_KEY, 7)
+    assert_equal "current", f.CURRENT_KEY
+    f.put(f.CURRENT_KEY, "word")
+    assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
+    assert_equal "word [3,7]", f.to_s
+  end
+end
 class DocumentPreprocessorTestCase < Test::Unit::TestCase
   def setup
     @preproc = StanfordParser::DocumentPreprocessor.new
+    @standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
   end
   def test_get_sentences_from_string
+    # The following call exercises the conversion from java.util.ArrayList
+    # objects to Ruby arrays.
     s = @preproc.getSentencesFromString("This is a sentence.  So is this.")
     assert_equal "#{s[0]}", "This is a sentence ."
     assert_equal "#{s[1]}", "So is this ."
@@ -100,15 +113,112 @@ class DocumentPreprocessorTestCase < Test::Unit::TestCase
     # StanfordParser::DocumentPreprocessor is not an enumerable object.
     assert_equal @preproc.map, []
   end
+  # Segment and tokenize text containing two sentences.
+  def test_standoff_document_preprocessor
+    sentences = @standoff_preproc.getSentencesFromString("He (John) is tall.  So is she.")
+    # Recognize two sentences.
+    assert_equal 2, sentences.length
+    assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
+    assert_equal "He (John) is tall.", sentences.first.to_s
+    assert_equal 7, sentences.first.length
+    assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
+    assert_equal "So is she.", sentences.last.to_s
+    assert_equal 4, sentences.last.length
+    assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
+    # Get the correct token information for the first sentence.
+    assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
+    assert_equal [0,2],        [sentences[0][0].begin_position(), sentences[0][0].end_position()]
+    assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
+    assert_equal [3,4],          [sentences[0][1].begin_position(), sentences[0][1].end_position()]
+    assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
+    assert_equal [4,8],            [sentences[0][2].begin_position(), sentences[0][2].end_position()]
+    assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
+    assert_equal [8,9],          [sentences[0][3].begin_position(), sentences[0][3].end_position()]
+    assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
+    assert_equal [10,12],      [sentences[0][4].begin_position(), sentences[0][4].end_position()]
+    assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
+    assert_equal [13,17],          [sentences[0][5].begin_position(), sentences[0][5].end_position()]
+    assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
+    assert_equal [17,18],    [sentences[0][6].begin_position(), sentences[0][6].end_position()]
+    # Get the correct token information for the second sentence.
+    assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
+    assert_equal [20,22],      [sentences[1][0].begin_position(), sentences[1][0].end_position()]
+    assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
+    assert_equal [23,25],      [sentences[1][1].begin_position(), sentences[1][1].end_position()]
+    assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
+    assert_equal [26,29],        [sentences[1][2].begin_position(), sentences[1][2].end_position()]
+    assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
+    assert_equal [29,30],    [sentences[1][3].begin_position(), sentences[1][3].end_position()]
+  end
+  def test_stringification
+    assert_equal "<DocumentPreprocessor>", @preproc.inspect
+    assert_equal "<DocumentPreprocessor>", @preproc.to_s
+    assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
+    assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
+  end
 end # DocumentPreprocessorTestCase
+class StandoffParsedTextTestCase < Test::Unit::TestCase
+  def setup
+    @text = "He (John) is tall.  So is she."
+  end
+  def test_parse_text_default_nodetype
+    parsed_text = StanfordParser::StandoffParsedText.new(@text)
+    verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
+  end
+  # Verify correct parsing with variable node types for text containing two sentences.
+  def verify_parsed_text(parsed_text, nodetype)
+    # Verify that there are two sentences.
+    assert_equal 2, parsed_text.length
+    assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
+    # Verify the tokens in the leaf node of the first sentence.
+    leaves = parsed_text[0].leaves.collect {|node| node.label}
+    assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
+    assert_equal [0,2],        [leaves[0].begin_position(), leaves[0].end_position()]
+    assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
+    assert_equal [3,4],          [leaves[1].begin_position(), leaves[1].end_position()]
+    assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
+    assert_equal [4,8],            [leaves[2].begin_position(), leaves[2].end_position()]
+    assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
+    assert_equal [8,9],          [leaves[3].begin_position(), leaves[3].end_position()]
+    assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
+    assert_equal [10,12],      [leaves[4].begin_position(), leaves[4].end_position()]
+    assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
+    assert_equal [13,17],          [leaves[5].begin_position(), leaves[5].end_position()]
+    assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
+    assert_equal [17,18],    [leaves[6].begin_position(), leaves[6].end_position()]
+    # Verify the tokens in the leaf node of the second sentence.
+    leaves = parsed_text[1].leaves.collect {|node| node.label}
+    assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
+    assert_equal [20,22],      [leaves[0].begin_position(), leaves[0].end_position()]
+    assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
+    assert_equal [23,25],      [leaves[1].begin_position(), leaves[1].end_position()]
+    assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
+    assert_equal [26,29],        [leaves[2].begin_position(), leaves[2].end_position()]
+    assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
+    assert_equal [29,30],    [leaves[3].begin_position(), leaves[3].end_position()]
+    # Verify that the original string is recoverable.
+    assert_equal "He (John) is tall.  ", parsed_text[0].to_original_string
+    assert_equal "So is she."          , parsed_text[1].to_original_string
+    # Draw < and > brackets around 3 constituents.
+    b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
+    assert_equal "<He (<John>)> is <tall>.  ", b
+  end
+end
 class MiscPreprocessorTestCase < Test::Unit::TestCase
   def test_model_location
     assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
   end
   def test_word
     assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") ==  "dog"
   end
-end # MiscPreprocessorTestCase
+end # MiscPreprocessorTestCase

metadata CHANGED Viewed

@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
 specification_version: 1
 name: stanfordparser
 version: !ruby/object:Gem::Version
-  version: 1.2.0
-date: 2007-12-18 00:00:00 -08:00
+  version: 2.0.0
+date: 2008-06-13 00:00:00 -07:00
 summary: Ruby wrapper for the Stanford Natural Language Parser
 require_paths:
 - lib
@@ -30,6 +30,7 @@ authors:
 - W.P. McNeill
 files:
 - test/test_stanfordparser.rb
+- lib/java_object.rb
 - lib/stanfordparser.rb
 - examples/stanford-sentence-parser.rb
 - README