RubyGems - stanfordparser-infochimps - Versions diffs - 2.2.1.s - Mend

stanfordparser-infochimps 2.2.1.s

Files changed (14) hide show

data/.document +5 -0
data/Gemfile +6 -0
data/LICENSE +18 -0
data/README.orig.rdoc +123 -0
data/README.rdoc +37 -0
data/Rakefile +60 -0
data/TESTS_STATUS.rdoc +2 -0
data/VERSION.yml +5 -0
data/examples/stanford-sentence-parser.rb +46 -0
data/lib/stanfordparser.rb +453 -0
data/lib/stanfordparser/java_object.rb +129 -0
data/stanfordparser.gemspec +69 -0
data/test/test_stanfordparser.rb +224 -0
metadata +207 -0

data/.document ADDED Viewed

@@ -0,0 +1,5 @@
+README.rdoc
+lib/**/*.rb
+bin/*
+features/**/*.feature
+LICENSE

data/Gemfile ADDED Viewed

@@ -0,0 +1,6 @@
+source :gemcutter
+gem "rjb"
+gem "rake"
+gem "jeweler"
+gem "treebank", ">= 3.0.0"
+gem "rspec", ">= 1.2.9"

data/LICENSE ADDED Viewed

@@ -0,0 +1,18 @@
+Copyright 2007-2008 William Patrick McNeill
+Copyright (c) 2010 John Wilkinson
+This file is part of the Stanford Parser Ruby Wrapper.
+The Stanford Parser Ruby Wrapper is free software; you can redistribute it
+and/or modify it under the terms of the GNU General Public License as
+published by the Free Software Foundation; either version 2 of the License,
+or (at your option) any later version.
+The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
+useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+Public License for more details.
+You should have received a copy of the GNU General Public License along with
+editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
+St, Fifth Floor, Boston, MA 02110-1301 USA

data/README.orig.rdoc ADDED Viewed

@@ -0,0 +1,123 @@
+= Stanford Natural Language Parser Wrapper
+This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
+The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic.  This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
+= Installation and Configuration
+In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
+This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory on UNIX platforms and in the <tt>C:\stanford-parser\current</tt> directory on Windows platforms.  This is the directory that contains the <tt>stanford-parser.jar</tt> file.  When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
+These defaults can be overridden by creating the configuration file <tt>/etc/ruby_stanford_parser.yaml</tt> on UNIX platforms and <tt>C:\stanford-parser\ruby-stanford-parser.yaml</tt> on Windows platforms.  This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
+	root: /usr/local/stanford-parser/other/location
+	jvmargs: -Xmx100m -verbose
+=Tokenization and Parsing
+Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
+	>> require "stanfordparser"
+	=> true
+	>> preproc = StanfordParser::DocumentPreprocessor.new
+	=> <DocumentPreprocessor>
+	>> puts preproc.getSentencesFromString("This is a sentence.  So is this.")
+	This is a sentence .
+	So is this .
+Use the StanfordParser::LexicalizedParser class to parse sentences.
+	>> parser = StanfordParser::LexicalizedParser.new
+	Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
+	=> edu.stanford.nlp.parser.lexparser.LexicalizedParser
+	>> puts parser.apply("This is a sentence.")
+	(ROOT
+	  (S [24.917]
+	    (NP [6.139] (DT [2.300] This))
+	    (VP [17.636] (VBZ [0.144] is)
+	      (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
+	    (. [0.002] .)))
+For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
+=Standoff Tokenization and Parsing
+This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
+Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
+	>> preproc = StanfordParser::StandoffDocumentPreprocessor.new
+	=> <StandoffDocumentPreprocessor>
+	>> s = preproc.getSentencesFromString("This is a sentence.  So is this.")
+	=> [This is a sentence., So is this.]
+The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
+ 	>> puts s
+	This [0,4]
+	is [5,7]
+	a [8,9]
+	sentence [10,18]
+	. [18,19]
+	So [21,23]
+	is [24,26]
+	this [27,31]
+	. [31,32]
+	>> "This is a sentence.  So is this."[27..31]
+	=> "this."
+This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
+Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
+	>> t = StanfordParser::StandoffParsedText.new("This is a sentence.  So is this.")
+	Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
+	=> <StanfordParser::StandoffParsedText, 2 sentences>
+	>> puts t.first
+	(ROOT
+	  (S
+	    (NP (DT This [0,4]))
+	    (VP (VBZ is [5,7])
+	      (NP (DT a [8,9]) (NN sentence [10,18])))
+	    (. . [18,19])))
+Standoff parse trees can reproduce the text from which they were generated verbatim.
+	>> t.first.to_original_string
+	=> "This is a sentence.  "
+They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
+	>> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
+	=> "[This] is [a sentence].  "
+The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
+See the documentation of the individual classes in this module for more details.
+Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects.  This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
+= History
+1.0.0:: Initial release
+1.1.0:: Make module initialization function private.  Add example code.
+1.2.0:: Read Java VM arguments from the configuration file.  Add Word class.
+2.0.0:: Add support for standoff parsing.  Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details.  Rjb::JavaObjectWrapper supports static members.  Minor changes to stanford-sentence-parser script.
+2.1.0:: Different default paths for Windows machines; Minor changes to StandoffToken definition
+2.2.0:: Add parent information to StandoffNode
+= Copyright
+Copyright 2007-2008, William Patrick McNeill
+This program is distributed under the GNU General Public License.
+= Author
+W.P. McNeill mailto:billmcn@gmail.com

data/README.rdoc ADDED Viewed

@@ -0,0 +1,37 @@
+= stanfordparser
+This is an upload/extension of Bill McNeal's stanfordparser rubyforge gem, check it out at its homepage (seems to be partially in French)
+http://rubyforge.org/projects/stanfordparser/
+or its rdocs
+http://stanfordparser.rubyforge.org/
+I've been having issues trying to use this gem so I decided to upload it to github and try to organize it to be a little more gem-friendly, especially using jeweler.
+AFAIK there aren't other copies of this on github, please correct me if I'm mistaken. The only similar one I can see is http://github.com/tiendung/ruby-nlp which has much less code and I can only assume to be something else.
+It seems like using version 1.6.1 of the java StanfordParser package is your best bet for compatability.
+See README.orig.rdoc for Bill's readme, which includes dependencies, installation, and usage.
+== Branches
+* master - Jeweler and Bundler integrated along with slight reorganization of files to be more gem-standard. This is the branch you should use if you want to source the gem straight from github. I will leave this branch alone for the most part unless I find/come up with stable and useful additions. All changes will be backwards compatible.
+* stock - Almost untouched from Bill's version, except for the README. Use this branch if that's what you're looking for.
+* fixing_tests - The tests are currently broken, this branch is trying to address that. Once the tests are fixed it will be merged back into master. Help appreciated! I'll keep a TESTS_STATUS.rdoc keeping track of progress.
+* experimental - I'll be putting in some code as examples and testing out some ideas. Do not use this branch as a gem. You are very encouraged, however, to fork it and add some code/make my code better. I'll try to integrate all the pull requests I get, if not in that branch into another.
+== Note on Patches/Pull Requests
+* Fork the project.
+* Make your feature addition or bug fix.
+* Add tests for it. I would prefer rSpec, but TestUnit is acceptable as well since there are some of those from the original author.
+* Commit.
+* Send me a pull request. Bonus points for topic branches.
+== Copyright
+Copyright (c) 2010 John Wilkinson. See LICENSE for details.
+Copyright 2007-2008, William Patrick McNeill. See README.orig for details.

data/Rakefile ADDED Viewed

@@ -0,0 +1,60 @@
+require 'rubygems'
+require 'rake'
+$LOAD_PATH.unshift('lib')
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = "stanfordparser-infochimps"
+    gem.summary = "GitHub upload/extension of Bill McNeal's stanfordparser rubygem"
+    gem.description = "Ruby wrapper of the Stanford Parser, a NLP parser built in Java."
+    gem.email = "jcwilk@gmail.com"
+    gem.homepage = "http://github.com/jcwilk/stanfordparser"
+    gem.authors = ["John Wilkinson","Bill McNeal"]
+    gem.add_dependency "rjb", ">= 1.2.5"
+    gem.add_dependency "treebank", ">= 3.0.0"
+    gem.add_development_dependency "rspec", ">= 1.2.9"
+  end
+  Jeweler::GemcutterTasks.new
+rescue LoadError
+  puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
+end
+require 'rake/testtask'
+Rake::TestTask.new(:test) do |test|
+  test.test_files = FileList.new('test/**/test_*.rb') do |list|
+    list.exclude 'test/test_helper.rb'
+  end
+  test.libs << 'test'
+  test.verbose = true
+end
+# require 'spec/rake/spectask'
+# Spec::Rake::SpecTask.new(:spec) do |spec|
+#   spec.libs << 'lib' << 'spec'
+#   spec.spec_files = FileList['spec/**/*_spec.rb']
+# end
+#
+# Spec::Rake::SpecTask.new(:rcov) do |spec|
+#   spec.libs << 'lib' << 'spec'
+#   spec.pattern = 'spec/**/*_spec.rb'
+#   spec.rcov = true
+# end
+#
+# task :test => :check_dependencies
+#
+# task :spec => :check_dependencies
+#
+# task :default => :test
+require 'rake/rdoctask'
+Rake::RDocTask.new do |rdoc|
+  version = File.exist?('VERSION') ? File.read('VERSION') : ""
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title = "stanfordparser #{version}"
+  rdoc.rdoc_files.include('README*')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end

data/TESTS_STATUS.rdoc ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ = Status of Tests
2	+ * Please see fixing_tests branches for efforts towards this goal.

data/VERSION.yml ADDED Viewed

@@ -0,0 +1,5 @@
+---
+:major: 2
+:minor: 2
+:patch: 1
+:build: s

data/examples/stanford-sentence-parser.rb ADDED Viewed

@@ -0,0 +1,46 @@
+#!/usr/bin/env ruby
+#--
+# Copyright 2007-2008 William Patrick McNeill
+#
+# This file is part of the Stanford Parser Ruby Wrapper.
+#
+# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
+# and/or modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation; either version 2 of the License,
+# or (at your option) any later version.
+#
+# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
+# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
+# Public License for more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
+# St, Fifth Floor, Boston, MA 02110-1301 USA
+#
+#++
+# == Synopsis
+#
+# Parse a sentence passed in on the command line.
+#
+# == Usage
+#
+# stanford-sentence-parser.rb [options] sentence
+#
+# options::
+#    See the Java Stanford Parser documentation for details
+#
+# sentence::
+#    A sentence to parse.  This must appear after all the options and be quoted.
+require 'rubygems'
+require "stanfordparser"
+# The last argument is the sentence.  The rest of the command line is passed
+# along to the parser object.
+sentence = ARGV.pop
+parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
+puts parser.apply(sentence)

data/lib/stanfordparser.rb ADDED Viewed

@@ -0,0 +1,453 @@
+require 'rubygems'
+require "pathname"
+require "rjb"
+require "singleton"
+begin
+  require "treebank"
+  gem "treebank", ">= 3.0.0"
+rescue LoadError
+  require "treebank"
+end
+require "yaml"
+# Wrapper for the {Stanford Natural Language
+# Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
+module StanfordParser
+  require "stanfordparser/java_object"
+  VERSION = "2.2.1"
+  # The default sentence segmenter and tokenizer.  This is an English-language
+  # tokenizer with support for Penn Treebank markup.
+  EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
+  # Path to an English PCFG model that comes with the Stanford Parser.  The
+  # location is relative to the parser root directory.  This is a valid value
+  # for the <em>grammar</em> parameter of the LexicalizedParser constructor.
+  ENGLISH_PCFG_MODEL = "$(ROOT)/englishPCFG.ser.gz"
+  # This function is executed once when the module is loaded.  It initializes
+  # the Java virtual machine in which the Stanford parser will run.  By
+  # default, it adds the parser installation root to the Java classpath and
+  # launches the VM with the arguments <tt>-server -Xmx150m</tt>.  Different
+  # values may be specified with the <tt>ruby-stanford-parser.yaml</tt>
+  # configuration file.
+  #
+  # This function determines which operating system we are running on and sets
+  # default pathnames accordingly:
+  #
+  # UNIX:: /usr/local/stanford-parser/current, /etc/ruby-stanford-parser.yaml
+  # Windows:: C:\stanford-parser\current,
+  #           C:\stanford-parser\ruby-stanford-parser.yaml
+  #
+  # This function returns the path of the parser installation root.
+  def StanfordParser.initialize_on_load
+    if RUBY_PLATFORM =~ /(win|w)32$/
+      root = Pathname.new("C:\\stanford-parser\\current ")
+      config = Pathname.new("C:\\stanford-parser\\ruby-stanford-parser.yaml")
+    else
+      root = Pathname.new("/usr/local/stanford-parser/current")
+      config = Pathname.new("/etc/ruby-stanford-parser.yaml")
+    end
+    jvmargs = ["-server", "-Xmx150m"]
+    if config.file?
+      configuration = open(config) {|f| YAML.load(f)}
+      if configuration.key?("root") and not configuration["root"].nil?
+        root = Pathname.new(configuration["root"])
+      end
+      if configuration.key?("jvmargs") and not configuration["jvmargs"].nil?
+        jvmargs = configuration["jvmargs"].split
+      end
+    end
+    Rjb::load(classpath = (root + "stanford-parser.jar").to_s, jvmargs)
+    root
+  end
+  private_class_method :initialize_on_load
+  # The root directory of the Stanford parser installation.
+  ROOT = initialize_on_load
+  #--
+  # The documentation below is for the original Rjb::JavaObjectWrapper object.
+  # It is reproduced here because rdoc only takes the last document block
+  # defined.  If Rjb is moved into its own gem, this documentation should go
+  # with it, and the following should be written as documentation for this
+  # class:
+  #
+  # Extension of the generic Ruby-Java Bridge wrapper object for the
+  # StanfordParser module.
+  #++
+  # A generic wrapper for a Java object loaded via the {Ruby-Java
+  # Bridge}[http://rjb.rubyforge.org/].  The wrapper class handles
+  # intialization and stringification, and passes other method calls down to
+  # the underlying Java object.  Objects returned by the underlying Java
+  # object are converted to the appropriate Ruby object.
+  #
+  # Other modules may extend the list of Java objects that are converted by
+  # adding their own converter functions.  See wrap_java_object for details.
+  #
+  # This object is enumerable, yielding items in the order defined by the
+  # underlying Java object's iterator.
+  class Rjb::JavaObjectWrapper
+    # FeatureLabel objects go inside a FeatureLabel wrapper.
+    def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
+      StanfordParser::FeatureLabel.new(object)
+    end
+    # Tree objects go inside a Tree wrapper.  Various tree types are aliased
+    # to this function.
+    def wrap_edu_stanford_nlp_trees_Tree(object)
+      Tree.new(object)
+    end
+    alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_SimpleTree            :wrap_edu_stanford_nlp_trees_Tree
+    alias :wrap_edu_stanford_nlp_trees_TreeGraphNode         :wrap_edu_stanford_nlp_trees_Tree
+    protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
+  end # Rjb::JavaObjectWrapper
+  # Lexicalized probabalistic parser.
+  #
+  # This is an wrapper for the
+  # <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
+  class LexicalizedParser < Rjb::JavaObjectWrapper
+    # The grammar used by the parser
+    attr_reader :grammar
+    # Create the parser given a grammar and options.  The <em>grammar</em>
+    # argument is a path to a grammar file.  This path may contain the string
+    # <tt>$(ROOT)</tt>, which will be replaced with the root directory of the
+    # Stanford Parser. By default, an English PCFG grammar is loaded.
+    #
+    # The <em>options</em> argument is a list of string arguments as they
+    # would appear on a command line.  See the documentaion of
+    # <tt>edu.stanford.nlp.parser.lexparser.Options.setOptions</tt> for more
+    # details.
+    def initialize(grammar = ENGLISH_PCFG_MODEL, options = [])
+      @grammar = Pathname.new(grammar.gsub(/\$\(ROOT\)/, ROOT))
+      super("edu.stanford.nlp.parser.lexparser.LexicalizedParser", @grammar.to_s)
+      @java_object.setOptionFlags(options)
+    end
+    def to_s
+      "LexicalizedParser(#{grammar.basename})"
+    end
+  end # LexicalizedParser
+  # A singleton instance of the default Stanford Natural Language parser.  A
+  # singleton is used because the parser can take a few seconds to load.
+  class DefaultParser < StanfordParser::LexicalizedParser
+    include Singleton
+  end
+  # This is a wrapper for
+  # <tt>edu.stanford.nlp.trees.Tree</tt> objects.  It customizes
+  # stringification.
+  class Tree < Rjb::JavaObjectWrapper
+    def initialize(obj = "edu.stanford.nlp.trees.Tree")
+      super(obj)
+    end
+    # Return the label along with the score if there is one.
+    def inspect
+      s = "#{label}" + (score.nan? ? "" : " [#{sprintf '%.2f', score}]")
+      "(#{s})"
+    end
+    # The Penn treebank representation.  This prints with indenting instead of
+    # putting everything on one line.
+    def to_s
+      "#{pennString}"
+    end
+  end # Tree
+  # This is a wrapper for
+  # <tt>edu.stanford.nlp.ling.Word</tt> objects.  It customizes
+  # stringification and adds an equivalence operator.
+  class Word < Rjb::JavaObjectWrapper
+    def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
+      super(obj, *args)
+    end
+    # See the word values.
+    def inspect
+      to_s
+    end
+    # Equivalence is defined relative to the word value.
+    def ==(other)
+      word == other
+    end
+  end # Word
+  # This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
+  # It customizes stringification.
+  class FeatureLabel < Rjb::JavaObjectWrapper
+    def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
+      super
+    end
+    # Stringify with just the token and its begin and end position.
+    def to_s
+      # BUGBUG The position values come back as java.lang.Integer though I
+      # would expect Rjb to convert them to Ruby integers.
+      begin_position = get(self.BEGIN_POSITION_KEY)
+      end_position = get(self.END_POSITION_KEY)
+      "#{current} [#{begin_position},#{end_position}]"
+    end
+    # More verbose stringification with all the fields and their values.
+    def inspect
+      toString
+    end
+  end
+  # Tokenizes documents into words and sentences.
+  #
+  # This is a wrapper for the
+  # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
+  class DocumentPreprocessor < Rjb::JavaObjectWrapper
+    def initialize(suppressEscaping = false)
+      super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
+    end
+    # Returns a list of sentences in a string.
+    def getSentencesFromString(s)
+      s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
+      _invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
+    end
+    def inspect
+      "<#{self.class.to_s.split('::').last}>"
+    end
+    def to_s
+      inspect
+    end
+  end # DocumentPreprocessor
+  # A text token that contains raw and normalized token identity (.e.g "(" and
+  # "-LRB-"), an offset span, and the characters immediately preceding and
+  # following the token.  Given a list of these objects it is possible to
+  # recreate the text from which they came verbatim.
+  class StandoffToken < Struct.new(:current, :word, :before, :after,
+                                   :begin_position, :end_position)
+    def to_s
+      "#{current} [#{begin_position},#{end_position}]"
+    end
+  end
+  # A preprocessor that segments text into sentences and tokens that contain
+  # character offset and token context information that can be used for
+  # standoff annotation.
+  class StandoffDocumentPreprocessor < DocumentPreprocessor
+    def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
+      # PTBTokenizer.factory is a static function, so use RJB to call it
+      # directly instead of going through a JavaObjectWrapper.  We do it this
+      # way because the Standford parser Java code does not provide a
+      # constructor that allows you to specify the second parameter,
+      # invertible, to true, and we need this to write character offset
+      # information into the tokens.
+      ptb_tokenizer_class = Rjb::import(tokenizer)
+      # See the documentation for
+      # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
+      # description of these parameters.
+      ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
+      super(ptb_tokenizer_factory)
+    end
+    # Returns a list of sentences in a string.  This wraps the returned
+    # sentences in a StandoffSentence object.
+    def getSentencesFromString(s)
+      super(s).map!{|s| StandoffSentence.new(s)}
+    end
+  end
+  # A sentence is an array of StandoffToken objects.
+  class StandoffSentence < Array
+    # Construct an array of StandoffToken objects from a Java list sentence
+    # object returned by the preprocessor.
+    def initialize(stanford_parser_sentence)
+      # Convert FeatureStructure wrappers to StandoffToken objects.
+      s = stanford_parser_sentence.to_a.collect do |fs|
+        current = fs.current
+        word = fs.word
+        before = fs.before
+        after = fs.after
+        # The to_s.to_i is necessary because the get function returns
+        # java.lang.Integer objects instead of Ruby integers.
+        begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
+        end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
+        StandoffToken.new(current, word, before, after,
+                          begin_position, end_position)
+      end
+      super(s)
+    end
+    # Return the original string verbatim.
+    def to_s
+      self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
+    end
+    # Return the original string verbatim.
+    def inspect
+      to_s
+    end
+  end
+  # Standoff syntactic annotation of natural language text which may contain
+  # multiple sentences.
+  #
+  # This is an Array of StandoffNode objects, one for each sentence in the
+  # text.
+  class StandoffParsedText < Array
+    # Parse the text and create the standoff annotation.
+    #
+    # The default parser is a singleton instance of the English language
+    # Stanford Natural Langugage parser.  There may be a delay of a few
+    # seconds for it to load the first time it is created.
+    def initialize(text, nodetype = StandoffNode,
+                   tokenizer = EN_PENN_TREEBANK_TOKENIZER,
+                   parser = DefaultParser.instance)
+      preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
+      # Segment the text into sentences.  Parse each sentence, writing
+      # standoff annotation information into the terminal nodes.
+      preprocessor.getSentencesFromString(text).map do |sentence|
+        parse = parser.apply(sentence.to_s)
+        push(nodetype.new(parse, sentence))
+      end
+    end
+    # Print class name and number of sentences.
+    def inspect
+      "<#{self.class.name}, #{length} sentences>"
+    end
+    # Print parses.
+    def to_s
+      flatten.join(" ")
+    end
+  end
+  # Standoff syntactic tree annotation of text.  Terminal nodes are labeled
+  # with the appropriate StandoffToken objects.  Standoff parses can reproduce
+  # the original string from which they were generated verbatim, optionally
+  # with brackets around the yields of specified non-terminal nodes.
+  class StandoffNode < Treebank::ParentedNode
+    # Create the standoff tree from a tree returned by the Stanford parser.
+    # For non-terminal nodes, the <em>tokens</em> argument will be a
+    # StandoffSentence containing the StandoffToken objects representing all
+    # the tokens beneath and after this node.  For terminal nodes, the
+    # <em>tokens</em> argument will be a StandoffToken.
+    def initialize(stanford_parser_node, tokens)
+      # Annotate this node with a non-terminal label or a StandoffToken as
+      # appropriate.
+      super(tokens.instance_of?(StandoffSentence) ?
+            stanford_parser_node.value : tokens)
+      # Enumerate the children depth-first.  Tokens are removed from the list
+      # left-to-right as terminal nodes are added to the tree.
+      stanford_parser_node.children.each do |child|
+        subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
+        attach_child!(subtree)
+      end
+    end
+    # Return the original text string dominated by this node.
+    def to_original_string
+      leaves.inject("") do |s, leaf|
+        s += leaf.label.current + leaf.label.after
+      end
+    end
+    # Print the original string with brackets around word spans dominated by
+    # the specified consituents.
+    #
+    # The constituents to bracket are specified by passing a list of node
+    # coordinates, which are arrays of integers of the form returned by the
+    # tree enumerators of Treebank::Node objects.
+    #
+    # _coords_:: the coordinates of the nodes around which to place brackets
+    # _open_:: the open bracket symbol
+    # _close_:: the close bracket symbol
+    def to_bracketed_string(coords, open = "[", close = "]")
+      # Get a list of all the leaf nodes and their coordinates.
+      items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
+      # Enumerate over all the matching constituents inserting open and close
+      # brackets around their yields in the items list.
+      coords.each do |matching|
+        # Insert using a simple state machine with three states: :start,
+        # :open, and :close.
+        state = :start
+        # Enumerate over the items list looking for nodes that are the
+        # children of the matching constituent.
+        items.each_with_index do |item, index|
+          # Skip inserted bracket characters.
+          next if item.is_a? String
+          # Handle terminal node items with the state machine.
+          node, terminal_coordinate = item
+          if state == :start
+            next if not in_yield?(matching, terminal_coordinate)
+            items.insert(index, open)
+            state = :open
+          else # state == :open
+            next if in_yield?(matching, terminal_coordinate)
+            items.insert(index, close)
+            state = :close
+            break
+          end
+        end # items.each_with_index
+        # Handle the case where a matching constituent is flush with the end
+        # of the sentence.
+        items << close if state == :open
+      end # each
+      # Replace terminal nodes with their string representations.  Insert
+      # spacing characters in the list.
+      items.each_with_index do |item, index|
+        next if item.is_a? String
+        text = item.first.label.current
+        spacing = item.first.label.after
+        # Replace the terminal node with its text.
+        items[index] = text
+        # Insert the spacing that comes after this text before the first
+        # non-close bracket character.
+        close_pos = find_index(items[index+1..-1]) {|item| not item == close}
+        items.insert(index + close_pos + 1, spacing)
+      end
+      items.join
+    end # to_bracketed_string
+    # Find the index of the first item in _list_ for which _block_ is true.
+    # Return 0 if no items are found.
+    def find_index(list, &block)
+      list.each_with_index do |item, index|
+        return index if block.call(item)
+      end
+      0
+    end
+    # Is the node at _terminal_ in the yield of the node at _node_?
+    def in_yield?(node, terminal)
+      # If node A's coordinates match the prefix of node B's coordinates, node
+      # B is in the yield of node A.
+      terminal.first(node.length) == node
+    end
+    private :in_yield?, :find_index
+  end # StandoffNode
+end # StanfordParser