stanfordparser 1.2.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README CHANGED
@@ -1,35 +1,40 @@
1
- = Stanford Natural Language Parser
1
+ = Stanford Natural Language Parser Wrapper
2
2
 
3
3
  This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
4
4
 
5
- The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby.
5
+ The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
6
6
 
7
- = Installation and Configuration
8
-
9
- To run this module you must install the following additional software
10
7
 
11
- * The {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml]
12
- * The {Ruby Java Bridge}[http://rjb.rubyforge.org/] gem.
8
+ = Installation and Configuration
13
9
 
14
- Note that the Stanford Parser is not a Ruby application and is therefore not a Ruby gem and must be manually installed.
10
+ In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
15
11
 
16
12
  This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
17
13
 
18
- These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://www.ruby-doc.org/core/classes/YAML.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
14
+ These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
19
15
 
20
16
  root: /usr/local/stanford-parser/other/location
21
17
  jvmargs: -Xmx100m -verbose
22
18
 
23
- =Usage
24
19
 
25
- Use the StanfordParser::LexicalizedParser class to parse sentences.
20
+ =Tokenization and Parsing
26
21
 
27
- irb(main):001:0> require 'stanfordparser'
22
+ Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
23
+
24
+ >> require "stanfordparser"
28
25
  => true
29
- irb(main):002:0> parser = StanfordParser::LexicalizedParser.new
26
+ >> preproc = StanfordParser::DocumentPreprocessor.new
27
+ => <DocumentPreprocessor>
28
+ >> puts preproc.getSentencesFromString("This is a sentence. So is this.")
29
+ This is a sentence .
30
+ So is this .
31
+
32
+ Use the StanfordParser::LexicalizedParser class to parse sentences.
33
+
34
+ >> parser = StanfordParser::LexicalizedParser.new
30
35
  Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
31
36
  => edu.stanford.nlp.parser.lexparser.LexicalizedParser
32
- irb(main):003:0> puts parser.apply("This is a sentence.")
37
+ >> puts parser.apply("This is a sentence.")
33
38
  (ROOT
34
39
  (S [24.917]
35
40
  (NP [6.139] (DT [2.300] This))
@@ -37,26 +42,77 @@ Use the StanfordParser::LexicalizedParser class to parse sentences.
37
42
  (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
38
43
  (. [0.002] .)))
39
44
 
40
- Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into words or sentences.
45
+ For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
41
46
 
42
- irb(main):004:0> preproc = StanfordParser::DocumentPreprocessor.new
43
- irb(main):008:0> puts preproc.getSentencesFromString("This is a sentence. So is this.")
44
- This is a sentence .
45
- So is this .
46
47
 
47
- For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
48
+ =Standoff Tokenization and Parsing
49
+
50
+ This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
51
+
52
+ Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
53
+
54
+ >> preproc = StanfordParser::StandoffDocumentPreprocessor.new
55
+ => <StandoffDocumentPreprocessor>
56
+ >> s = preproc.getSentencesFromString("This is a sentence. So is this.")
57
+ => [This is a sentence., So is this.]
58
+
59
+ The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
60
+
61
+ >> puts s
62
+ This [0,4]
63
+ is [5,7]
64
+ a [8,9]
65
+ sentence [10,18]
66
+ . [18,19]
67
+ So [21,23]
68
+ is [24,26]
69
+ this [27,31]
70
+ . [31,32]
71
+ >> "This is a sentence. So is this."[27..31]
72
+ => "this."
73
+
74
+ This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
75
+
76
+ Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
77
+
78
+ >> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
79
+ Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
80
+ => <StanfordParser::StandoffParsedText, 2 sentences>
81
+ >> puts t.first
82
+ (ROOT
83
+ (S
84
+ (NP (DT This [0,4]))
85
+ (VP (VBZ is [5,7])
86
+ (NP (DT a [8,9]) (NN sentence [10,18])))
87
+ (. . [18,19])))
88
+
89
+ Standoff parse trees can reproduce the text from which they were generated verbatim.
90
+
91
+ >> t.first.to_original_string
92
+ => "This is a sentence. "
93
+
94
+ They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
95
+
96
+ >> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
97
+ => "[This] is [a sentence]. "
98
+
99
+ The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
100
+
101
+ See the documentation of the individual classes in this module for more details.
48
102
 
103
+ Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
49
104
 
50
105
  = History
51
106
 
52
107
  1.0.0:: Initial release
53
108
  1.1.0:: Make module initialization function private. Add example code.
54
109
  1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
110
+ 2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
55
111
 
56
112
 
57
113
  = Copyright
58
114
 
59
- Copyright 2007, William Patrick McNeill
115
+ Copyright 2007-2008, William Patrick McNeill
60
116
 
61
117
  This program is distributed under the GNU General Public License.
62
118
 
@@ -2,7 +2,7 @@
2
2
 
3
3
  #--
4
4
 
5
- # Copyright 2007 William Patrick McNeill
5
+ # Copyright 2007-2008 William Patrick McNeill
6
6
  #
7
7
  # This file is part of the Stanford Parser Ruby Wrapper.
8
8
  #
@@ -34,7 +34,7 @@
34
34
  # See the Java Stanford Parser documentation for details
35
35
  #
36
36
  # sentence::
37
- # A sentence to parse. This must be quoted.
37
+ # A sentence to parse. This must appear after all the options and be quoted.
38
38
 
39
39
 
40
40
  require "stanfordparser"
@@ -42,5 +42,5 @@ require "stanfordparser"
42
42
  # The last argument is the sentence. The rest of the command line is passed
43
43
  # along to the parser object.
44
44
  sentence = ARGV.pop
45
- parser = StanfordParser::LexicalizedParser.new("$(ROOT)/englishPCFG.ser.gz", ARGV)
45
+ parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
46
46
  puts parser.apply(sentence)
@@ -0,0 +1,129 @@
1
+ # Copyright 2007-2008 William Patrick McNeill
2
+ #
3
+ # This file is part of the Stanford Parser Ruby Wrapper.
4
+ #
5
+ # The Stanford Parser Ruby Wrapper is free software; you can redistribute it
6
+ # and/or modify it under the terms of the GNU General Public License as
7
+ # published by the Free Software Foundation; either version 2 of the License,
8
+ # or (at your option) any later version.
9
+ #
10
+ # The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
11
+ # useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
13
+ # Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU General Public License along with
16
+ # editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
17
+ # St, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ # Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
20
+ # add a generic Java object wrapper class.
21
+ module Rjb
22
+
23
+ #--
24
+ # The documentation for this class appears next to its extension inside the
25
+ # StanfordParser module in stanfordparser.rb. This should be changed if Rjb
26
+ # is ever moved into its own gem. See the documention in stanfordparser.rb
27
+ # for more details.
28
+ #++
29
+ class JavaObjectWrapper
30
+ include Enumerable
31
+
32
+ # The underlying Java object.
33
+ attr_reader :java_object
34
+
35
+ # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
36
+ # String, treat it as a Java class name and instantiate it. Otherwise,
37
+ # treat <em>obj</em> as an instance of a Java object.
38
+ def initialize(obj, *args)
39
+ @java_object = obj.class == String ?
40
+ Rjb::import(obj).send(:new, *args) : obj
41
+ end
42
+
43
+ # Enumerate all the items in the object using its iterator. If the object
44
+ # has no iterator, this function yields nothing.
45
+ def each
46
+ if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
47
+ i = @java_object.iterator
48
+ while i.hasNext
49
+ yield wrap_java_object(i.next)
50
+ end
51
+ end
52
+ end # each
53
+
54
+ # Reflect unhandled method calls to the underlying Java object and wrap
55
+ # the return value in the appropriate Ruby object.
56
+ def method_missing(m, *args)
57
+ begin
58
+ wrap_java_object(@java_object.send(m, *args))
59
+ rescue RuntimeError => e
60
+ # The instance method failed. See if this is a static method.
61
+ if not e.message.match(/^Fail: unknown method name/).nil?
62
+ getClass.send(m, *args)
63
+ end
64
+ end
65
+ end
66
+
67
+ # Convert a value returned by a call to the underlying Java object to the
68
+ # appropriate Ruby object.
69
+ #
70
+ # If the value is a JavaObjectWrapper, convert it using a protected
71
+ # function with the name wrap_ followed by the underlying object's
72
+ # classname with the Java path delimiters converted to underscores. For
73
+ # example, a <tt>java.util.ArrayList</tt> would be converted by a function
74
+ # called wrap_java_util_ArrayList.
75
+ #
76
+ # If the value lacks the appropriate converter function, wrap it in a
77
+ # generic JavaObjectWrapper.
78
+ #
79
+ # If the value is not a JavaObjectWrapper, return it unchanged.
80
+ #
81
+ # This function is called recursively for every element in an Array.
82
+ def wrap_java_object(object)
83
+ if object.kind_of?(Array)
84
+ object.collect {|item| wrap_java_object(item)}
85
+ elsif object.respond_to?(:_classname)
86
+ # Ruby-Java Bridge Java objects all have a _classname member which
87
+ # tells the name of their Java class. Convert this to the
88
+ # corresponding wrapper function name.
89
+ wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
90
+ respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
91
+ else
92
+ object
93
+ end
94
+ end
95
+
96
+ # Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
97
+ def wrap_java_util_ArrayList(object)
98
+ array_list = []
99
+ object.size.times do
100
+ |i| array_list << wrap_java_object(object.get(i))
101
+ end
102
+ array_list
103
+ end
104
+
105
+ # Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
106
+ def wrap_java_util_HashSet(object)
107
+ set = Set.new
108
+ i = object.iterator
109
+ while i.hasNext
110
+ set << wrap_java_object(i.next)
111
+ end
112
+ set
113
+ end
114
+
115
+ # Show the classname of the underlying Java object.
116
+ def inspect
117
+ "<#{@java_object._classname}>"
118
+ end
119
+
120
+ # Use the underlying Java object's stringification.
121
+ def to_s
122
+ toString
123
+ end
124
+
125
+ protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
126
+
127
+ end # JavaObjectWrapper
128
+
129
+ end # Rjb
@@ -1,4 +1,4 @@
1
- # Copyright 2007 William Patrick McNeill
1
+ # Copyright 2007-2008 William Patrick McNeill
2
2
  #
3
3
  # This file is part of the Stanford Parser Ruby Wrapper.
4
4
  #
@@ -19,121 +19,27 @@
19
19
 
20
20
  require "pathname"
21
21
  require "rjb"
22
- require "set"
22
+ require "singleton"
23
+ begin
24
+ require "treebank"
25
+ gem "treebank", ">= 3.0.0"
26
+ rescue LoadError
27
+ require "treebank"
28
+ end
23
29
  require "yaml"
24
30
 
25
- # Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
26
- # adds a generic Java object wrapper class.
27
- module Rjb
28
-
29
- # A generic wrapper for a Java object loaded via the Ruby Java Bridge. The
30
- # wrapper class handles intialization and stringification, and passes other
31
- # method calls down to the underlying Java object. Objects returned by the
32
- # underlying Java object are converted to the appropriate Ruby object.
33
- #
34
- # This object is enumerable, yielding items in the order defined by the Java
35
- # object's iterator.
36
- class JavaObjectWrapper
37
- include Enumerable
38
-
39
- # The underlying Java object.
40
- attr_reader :java_object
41
-
42
- # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
43
- # String, assume it is a Java class name and instantiate it. Otherwise,
44
- # treat <em>obj</em> as an instance of a Java object.
45
- def initialize(obj, *args)
46
- @java_object = obj.class == String ?
47
- Rjb::import(obj).send(:new, *args) : obj
48
- end
49
-
50
- # Enumerate all the items in the object using its iterator. If the object
51
- # has no iterator, this function yields nothing.
52
- def each
53
- if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
54
- i = @java_object.iterator
55
- while i.hasNext
56
- yield wrap_java_object(i.next)
57
- end
58
- end
59
- end # each
60
-
61
- # Reflect unhandled method calls to the underlying Java object.
62
- def method_missing(m, *args)
63
- wrap_java_object(@java_object.send(m, *args))
64
- end
65
-
66
- # Convert a value returned by a call to the underlying Java object to the
67
- # appropriate Ruby object as follows:
68
- # * RJB objects are placed inside a generic JavaObjectWrapper wrapper.
69
- # * <tt>java.util.ArrayList</tt> objects are converted to Ruby Arrays.
70
- # * <tt>java.util.HashSet</tt> objects are converted to Ruby Sets
71
- # * Other objects are left unchanged.
72
- #
73
- # This function is applied recursively to items in collection objects such
74
- # as set and arrays.
75
- def wrap_java_object(object)
76
- if object.kind_of?(Array)
77
- object.collect {|item| wrap_java_object(item)}
78
- # Ruby-Java Bridge Java objects all have a _classname member which tells
79
- # the name of their Java class.
80
- elsif object.respond_to?(:_classname)
81
- case object._classname
82
- when /java\.util\.ArrayList/
83
- # Convert java.util.ArrayList objects to Ruby arrays.
84
- array_list = []
85
- object.size.times do
86
- |i| array_list << wrap_java_object(object.get(i))
87
- end
88
- array_list
89
- when /java\.util\.HashSet/
90
- # Convert java.util.HashSet objects to Ruby sets.
91
- set = Set.new
92
- i = object.iterator
93
- while i.hasNext
94
- set << wrap_java_object(i.next)
95
- end
96
- set
97
- else
98
- # Passs other RJB objects off to a handler.
99
- wrap_rjb_object(object)
100
- end # case
101
- else
102
- # Return non-RJB objects unchanged.
103
- object
104
- end # if
105
- end # wrap_java_object
106
-
107
- # By default, all RJB classes other than <tt>java.util.ArrayList</tt> and
108
- # <tt>java.util.HashSet</tt> go in a generic wrapper. Derived classes may
109
- # change this behavior.
110
- def wrap_rjb_object(object)
111
- JavaObjectWrapper.new(object)
112
- end
113
-
114
- # Show the classname of the underlying Java object.
115
- def inspect
116
- "<#{@java_object._classname}>"
117
- end
118
-
119
- # Use the underlying Java object's stringification.
120
- def to_s
121
- toString
122
- end
123
-
124
- protected :wrap_java_object, :wrap_rjb_object
125
-
126
- end # JavaObjectWrapper
127
-
128
- end # Rjb
129
-
31
+ require "java_object.rb"
130
32
 
131
33
  # Wrapper for the {Stanford Natural Language
132
34
  # Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
133
35
  module StanfordParser
134
36
 
135
- VERSION = "1.2.0"
136
-
37
+ VERSION = "2.0.0"
38
+
39
+ # The default sentence segmenter and tokenizer. This is an English-language
40
+ # tokenizer with support for Penn Treebank markup.
41
+ EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
42
+
137
43
  # Path to an English PCFG model that comes with the Stanford Parser. The
138
44
  # location is relative to the parser root directory. This is a valid value
139
45
  # for the <em>grammar</em> parameter of the LexicalizedParser constructor.
@@ -170,32 +76,53 @@ module StanfordParser
170
76
  # The root directory of the Stanford parser installation.
171
77
  ROOT = initialize_on_load
172
78
 
173
-
79
+ #--
80
+ # The documentation below is for the original Rjb::JavaObjectWrapper object.
81
+ # It is reproduced here because rdoc only takes the last document block
82
+ # defined. If Rjb is moved into its own gem, this documentation should go
83
+ # with it, and the following should be written as documentation for this
84
+ # class:
85
+ #
174
86
  # Extension of the generic Ruby-Java Bridge wrapper object for the
175
87
  # StanfordParser module.
176
- class JavaObjectWrapper < Rjb::JavaObjectWrapper
177
- # Wrap a return value with a specialized wrapper class in the
178
- # StanfordParser module in the appropriate class.
179
- def wrap_rjb_object(object)
180
- case object._classname
181
- when /^edu\.stanford\.nlp\.trees\.
182
- (Tree|LabeledScoredTreeLeaf|
183
- LabeledScoredTreeNode|
184
- SimpleTree|TreeGraphNode)$/x
185
- # Tree objects go inside a Tree wrapper.
186
- Tree.new(object)
187
- else
188
- super(object)
189
- end # case
190
- end # wrap_rjb_object
191
- end # JavaObjectWrapper
88
+ #++
89
+ # A generic wrapper for a Java object loaded via the {Ruby-Java
90
+ # Bridge}[http://rjb.rubyforge.org/]. The wrapper class handles
91
+ # intialization and stringification, and passes other method calls down to
92
+ # the underlying Java object. Objects returned by the underlying Java
93
+ # object are converted to the appropriate Ruby object.
94
+ #
95
+ # Other modules may extend the list of Java objects that are converted by
96
+ # adding their own converter functions. See wrap_java_object for details.
97
+ #
98
+ # This object is enumerable, yielding items in the order defined by the
99
+ # underlying Java object's iterator.
100
+ class Rjb::JavaObjectWrapper
101
+ # FeatureLabel objects go inside a FeatureLabel wrapper.
102
+ def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
103
+ StanfordParser::FeatureLabel.new(object)
104
+ end
105
+
106
+ # Tree objects go inside a Tree wrapper. Various tree types are aliased
107
+ # to this function.
108
+ def wrap_edu_stanford_nlp_trees_Tree(object)
109
+ Tree.new(object)
110
+ end
111
+
112
+ alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
113
+ alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
114
+ alias :wrap_edu_stanford_nlp_trees_SimpleTree :wrap_edu_stanford_nlp_trees_Tree
115
+ alias :wrap_edu_stanford_nlp_trees_TreeGraphNode :wrap_edu_stanford_nlp_trees_Tree
116
+
117
+ protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
118
+ end # Rjb::JavaObjectWrapper
192
119
 
193
120
 
194
121
  # Lexicalized probabalistic parser.
195
122
  #
196
123
  # This is an wrapper for the
197
124
  # <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
198
- class LexicalizedParser < JavaObjectWrapper
125
+ class LexicalizedParser < Rjb::JavaObjectWrapper
199
126
  # The grammar used by the parser
200
127
  attr_reader :grammar
201
128
 
@@ -220,10 +147,17 @@ module StanfordParser
220
147
  end # LexicalizedParser
221
148
 
222
149
 
150
+ # A singleton instance of the default Stanford Natural Language parser. A
151
+ # singleton is used because the parser can take a few seconds to load.
152
+ class DefaultParser < StanfordParser::LexicalizedParser
153
+ include Singleton
154
+ end
155
+
156
+
223
157
  # This is a wrapper for
224
158
  # <tt>edu.stanford.nlp.trees.Tree</tt> objects. It customizes
225
159
  # stringification.
226
- class Tree < JavaObjectWrapper
160
+ class Tree < Rjb::JavaObjectWrapper
227
161
  def initialize(obj = "edu.stanford.nlp.trees.Tree")
228
162
  super(obj)
229
163
  end
@@ -245,16 +179,16 @@ module StanfordParser
245
179
  # This is a wrapper for
246
180
  # <tt>edu.stanford.nlp.ling.Word</tt> objects. It customizes
247
181
  # stringification and adds an equivalence operator.
248
- class Word < JavaObjectWrapper
182
+ class Word < Rjb::JavaObjectWrapper
249
183
  def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
250
184
  super(obj, *args)
251
185
  end
252
-
186
+
253
187
  # See the word values.
254
188
  def inspect
255
189
  to_s
256
190
  end
257
-
191
+
258
192
  # Equivalence is defined relative to the word value.
259
193
  def ==(other)
260
194
  word == other
@@ -262,11 +196,34 @@ module StanfordParser
262
196
  end # Word
263
197
 
264
198
 
199
+ # This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
200
+ # It customizes stringification.
201
+ class FeatureLabel < Rjb::JavaObjectWrapper
202
+ def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
203
+ super
204
+ end
205
+
206
+ # Stringify with just the token and its begin and end position.
207
+ def to_s
208
+ # BUGBUG The position values come back as java.lang.Integer though I
209
+ # would expect Rjb to convert them to Ruby integers.
210
+ begin_position = get(self.BEGIN_POSITION_KEY)
211
+ end_position = get(self.END_POSITION_KEY)
212
+ "#{current} [#{begin_position},#{end_position}]"
213
+ end
214
+
215
+ # More verbose stringification with all the fields and their values.
216
+ def inspect
217
+ toString
218
+ end
219
+ end
220
+
221
+
265
222
  # Tokenizes documents into words and sentences.
266
223
  #
267
224
  # This is a wrapper for the
268
225
  # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
269
- class DocumentPreprocessor < JavaObjectWrapper
226
+ class DocumentPreprocessor < Rjb::JavaObjectWrapper
270
227
  def initialize(suppressEscaping = false)
271
228
  super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
272
229
  end
@@ -276,6 +233,229 @@ module StanfordParser
276
233
  s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
277
234
  _invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
278
235
  end
236
+
237
+ def inspect
238
+ "<#{self.class.to_s.split('::').last}>"
239
+ end
240
+
241
+ def to_s
242
+ inspect
243
+ end
279
244
  end # DocumentPreprocessor
280
245
 
246
+ StandoffToken = Struct.new(:current, :word, :before, :after,
247
+ :begin_position, :end_position)
248
+
249
+ # A text token that contains raw and normalized token identity (.e.g "(" and
250
+ # "-LRB-"), an offset span, and the characters immediately preceding and
251
+ # following the token. Given a list of these objects it is possible to
252
+ # recreate the text from which they came verbatim.
253
+ class StandoffToken
254
+ def to_s
255
+ "#{current} [#{begin_position},#{end_position}]"
256
+ end
257
+ end
258
+
259
+
260
+ # A preprocessor that segments text into sentences and tokens that contain
261
+ # character offset and token context information that can be used for
262
+ # standoff annotation.
263
+ class StandoffDocumentPreprocessor < DocumentPreprocessor
264
+ def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
265
+ # PTBTokenizer.factory is a static function, so use RJB to call it
266
+ # directly instead of going through a JavaObjectWrapper. We do it this
267
+ # way because the Standford parser Java code does not provide a
268
+ # constructor that allows you to specify the second parameter,
269
+ # invertible, to true, and we need this to write character offset
270
+ # information into the tokens.
271
+ ptb_tokenizer_class = Rjb::import(tokenizer)
272
+ # See the documentation for
273
+ # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
274
+ # description of these parameters.
275
+ ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
276
+ super(ptb_tokenizer_factory)
277
+ end
278
+
279
+ # Returns a list of sentences in a string. This wraps the returned
280
+ # sentences in a StandoffSentence object.
281
+ def getSentencesFromString(s)
282
+ super(s).map!{|s| StandoffSentence.new(s)}
283
+ end
284
+ end
285
+
286
+
287
+ # A sentence is an array of StandoffToken objects.
288
+ class StandoffSentence < Array
289
+ # Construct an array of StandoffToken objects from a Java list sentence
290
+ # object returned by the preprocessor.
291
+ def initialize(stanford_parser_sentence)
292
+ # Convert FeatureStructure wrappers to StandoffToken objects.
293
+ s = stanford_parser_sentence.to_a.collect do |fs|
294
+ current = fs.current
295
+ word = fs.word
296
+ before = fs.before
297
+ after = fs.after
298
+ # The to_s.to_i is necessary because the get function returns
299
+ # java.lang.Integer objects instead of Ruby integers.
300
+ begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
301
+ end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
302
+ StandoffToken.new(current, word, before, after,
303
+ begin_position, end_position)
304
+ end
305
+ super(s)
306
+ end
307
+
308
+ # Return the original string verbatim.
309
+ def to_s
310
+ self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
311
+ end
312
+
313
+ # Return the original string verbatim.
314
+ def inspect
315
+ to_s
316
+ end
317
+ end
318
+
319
+
320
+ # Standoff syntactic annotation of natural language text which may contain
321
+ # multiple sentences.
322
+ #
323
+ # This is an Array of StandoffNode objects, one for each sentence in the
324
+ # text.
325
+ class StandoffParsedText < Array
326
+ # Parse the text and create the standoff annotation.
327
+ #
328
+ # The default parser is a singleton instance of the English language
329
+ # Stanford Natural Langugage parser. There may be a delay of a few
330
+ # seconds for it to load the first time it is created.
331
+ def initialize(text, nodetype = StandoffNode,
332
+ tokenizer = EN_PENN_TREEBANK_TOKENIZER,
333
+ parser = DefaultParser.instance)
334
+ preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
335
+ # Segment the text into sentences. Parse each sentence, writing
336
+ # standoff annotation information into the terminal nodes.
337
+ preprocessor.getSentencesFromString(text).map do |sentence|
338
+ parse = parser.apply(sentence.to_s)
339
+ push(nodetype.new(parse, sentence))
340
+ end
341
+ end
342
+
343
+ # Print class name and number of sentences.
344
+ def inspect
345
+ "<#{self.class.name}, #{length} sentences>"
346
+ end
347
+
348
+ # Print parses.
349
+ def to_s
350
+ flatten.join(" ")
351
+ end
352
+ end
353
+
354
+
355
+ # Standoff syntactic tree annotation of text. Terminal nodes are labeled
356
+ # with the appropriate StandoffToken objects. Standoff parses can reproduce
357
+ # the original string from which they were generated verbatim, optionally
358
+ # with brackets around the yields of specified non-terminal nodes.
359
+ class StandoffNode < Treebank::Node
360
+ # Create the standoff tree from a tree returned by the Stanford parser.
361
+ # For non-terminal nodes, the <em>tokens</em> argument will be a
362
+ # StandoffSentence containing the StandoffToken objects representing all
363
+ # the tokens beneath and after this node. For terminal nodes, the
364
+ # <em>tokens</em> argument will be a StandoffToken.
365
+ def initialize(stanford_parser_node, tokens)
366
+ # Annotate this node with a non-terminal label or a StandoffToken as
367
+ # appropriate.
368
+ super(tokens.instance_of?(StandoffSentence) ?
369
+ stanford_parser_node.value : tokens)
370
+ # Enumerate the children depth-first. Tokens are removed from the list
371
+ # left-to-right as terminal nodes are added to the tree.
372
+ stanford_parser_node.children.each do |child|
373
+ subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
374
+ attach_child!(subtree)
375
+ end
376
+ end
377
+
378
+ # Return the original text string dominated by this node.
379
+ def to_original_string
380
+ leaves.inject("") do |s, leaf|
381
+ s += leaf.label.current + leaf.label.after
382
+ end
383
+ end
384
+
385
+ # Print the original string with brackets around word spans dominated by
386
+ # the specified consituents.
387
+ #
388
+ # The constituents to bracket are specified by passing a list of node
389
+ # coordinates, which are arrays of integers of the form returned by the
390
+ # tree enumerators of Treebank::Node objects.
391
+ #
392
+ # _coords_:: the coordinates of the nodes around which to place brackets
393
+ # _open_:: the open bracket symbol
394
+ # _close_:: the close bracket symbol
395
+ def to_bracketed_string(coords, open = "[", close = "]")
396
+ # Get a list of all the leaf nodes and their coordinates.
397
+ items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
398
+ # Enumerate over all the matching constituents inserting open and close
399
+ # brackets around their yields in the items list.
400
+ coords.each do |matching|
401
+ # Insert using a simple state machine with three states: :start,
402
+ # :open, and :close.
403
+ state = :start
404
+ # Enumerate over the items list looking for nodes that are the
405
+ # children of the matching constituent.
406
+ items.each_with_index do |item, index|
407
+ # Skip inserted bracket characters.
408
+ next if item.is_a? String
409
+ # Handle terminal node items with the state machine.
410
+ node, terminal_coordinate = item
411
+ if state == :start
412
+ next if not in_yield?(matching, terminal_coordinate)
413
+ items.insert(index, open)
414
+ state = :open
415
+ else # state == :open
416
+ next if in_yield?(matching, terminal_coordinate)
417
+ items.insert(index, close)
418
+ state = :close
419
+ break
420
+ end
421
+ end # items.each_with_index
422
+ # Handle the case where a matching constituent is flush with the end
423
+ # of the sentence.
424
+ items << close if state == :open
425
+ end # each
426
+ # Replace terminal nodes with their string representations. Insert
427
+ # spacing characters in the list.
428
+ items.each_with_index do |item, index|
429
+ next if item.is_a? String
430
+ text = item.first.label.current
431
+ spacing = item.first.label.after
432
+ # Replace the terminal node with its text.
433
+ items[index] = text
434
+ # Insert the spacing that comes after this text before the first
435
+ # non-close bracket character.
436
+ close_pos = find_index(items[index+1..-1]) {|item| not item == close}
437
+ items.insert(index + close_pos + 1, spacing)
438
+ end
439
+ items.join
440
+ end # to_bracketed_string
441
+
442
+ # Find the index of the first item in _list_ for which _block_ is true.
443
+ # Return 0 if no items are found.
444
+ def find_index(list, &block)
445
+ list.each_with_index do |item, index|
446
+ return index if block.call(item)
447
+ end
448
+ 0
449
+ end
450
+
451
+ # Is the node at _terminal_ in the yield of the node at _node_?
452
+ def in_yield?(node, terminal)
453
+ # If node A's coordinates match the prefix of node B's coordinates, node
454
+ # B is in the yield of node A.
455
+ terminal.first(node.length) == node
456
+ end
457
+
458
+ private :in_yield?, :find_index
459
+ end # StandoffNode
460
+
281
461
  end # StanfordParser
@@ -2,7 +2,7 @@
2
2
 
3
3
  #--
4
4
 
5
- # Copyright 2007 William Patrick McNeill
5
+ # Copyright 2007-2008 William Patrick McNeill
6
6
  #
7
7
  # This file is part of the Stanford Parser Ruby Wrapper.
8
8
  #
@@ -30,20 +30,13 @@ require "singleton"
30
30
  require "stanfordparser"
31
31
 
32
32
 
33
- # Make the Lexicalized Parser a singleton for the tests because it takes
34
- # several seconds to load.
35
- class StanfordParser::LexicalizedParser
36
- include Singleton
37
- end
38
-
39
-
40
33
  class LexicalizedParserTestCase < Test::Unit::TestCase
41
34
  def test_root_path
42
35
  assert_equal StanfordParser::ROOT.class, Pathname
43
36
  end
44
37
 
45
38
  def setup
46
- @parser = StanfordParser::LexicalizedParser.instance
39
+ @parser = StanfordParser::DefaultParser.instance
47
40
  @tree = @parser.apply("This is a sentence.")
48
41
  end
49
42
 
@@ -53,6 +46,8 @@ class LexicalizedParserTestCase < Test::Unit::TestCase
53
46
  end
54
47
 
55
48
  def test_localTrees
49
+ # The following call exercises the conversion from java.util.HashSet
50
+ # objects to Ruby sets.
56
51
  l = @tree.localTrees
57
52
  assert_equal l.size, 5
58
53
  assert_equal Set.new(l.collect {|t| "#{t.label}"}),
@@ -68,7 +63,7 @@ end # LexicalizedParserTestCase
68
63
 
69
64
  class TreeTestCase < Test::Unit::TestCase
70
65
  def setup
71
- @parser = StanfordParser::LexicalizedParser.instance
66
+ @parser = StanfordParser::DefaultParser.instance
72
67
  @tree = @parser.apply("This is a sentence.")
73
68
  end
74
69
 
@@ -85,12 +80,30 @@ class TreeTestCase < Test::Unit::TestCase
85
80
  end # TreeTestCase
86
81
 
87
82
 
83
+ class FeatureLabelTestCase < Test::Unit::TestCase
84
+ def test_feature_label
85
+ f = StanfordParser::FeatureLabel.new
86
+ assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
87
+ f.put(f.BEGIN_POSITION_KEY, 3)
88
+ assert_equal "END_POS", f.END_POSITION_KEY
89
+ f.put(f.END_POSITION_KEY, 7)
90
+ assert_equal "current", f.CURRENT_KEY
91
+ f.put(f.CURRENT_KEY, "word")
92
+ assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
93
+ assert_equal "word [3,7]", f.to_s
94
+ end
95
+ end
96
+
97
+
88
98
  class DocumentPreprocessorTestCase < Test::Unit::TestCase
89
99
  def setup
90
100
  @preproc = StanfordParser::DocumentPreprocessor.new
101
+ @standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
91
102
  end
92
103
 
93
104
  def test_get_sentences_from_string
105
+ # The following call exercises the conversion from java.util.ArrayList
106
+ # objects to Ruby arrays.
94
107
  s = @preproc.getSentencesFromString("This is a sentence. So is this.")
95
108
  assert_equal "#{s[0]}", "This is a sentence ."
96
109
  assert_equal "#{s[1]}", "So is this ."
@@ -100,15 +113,112 @@ class DocumentPreprocessorTestCase < Test::Unit::TestCase
100
113
  # StanfordParser::DocumentPreprocessor is not an enumerable object.
101
114
  assert_equal @preproc.map, []
102
115
  end
116
+
117
+ # Segment and tokenize text containing two sentences.
118
+ def test_standoff_document_preprocessor
119
+ sentences = @standoff_preproc.getSentencesFromString("He (John) is tall. So is she.")
120
+ # Recognize two sentences.
121
+ assert_equal 2, sentences.length
122
+ assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
123
+ assert_equal "He (John) is tall.", sentences.first.to_s
124
+ assert_equal 7, sentences.first.length
125
+ assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
126
+ assert_equal "So is she.", sentences.last.to_s
127
+ assert_equal 4, sentences.last.length
128
+ assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
129
+ # Get the correct token information for the first sentence.
130
+ assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
131
+ assert_equal [0,2], [sentences[0][0].begin_position(), sentences[0][0].end_position()]
132
+ assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
133
+ assert_equal [3,4], [sentences[0][1].begin_position(), sentences[0][1].end_position()]
134
+ assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
135
+ assert_equal [4,8], [sentences[0][2].begin_position(), sentences[0][2].end_position()]
136
+ assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
137
+ assert_equal [8,9], [sentences[0][3].begin_position(), sentences[0][3].end_position()]
138
+ assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
139
+ assert_equal [10,12], [sentences[0][4].begin_position(), sentences[0][4].end_position()]
140
+ assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
141
+ assert_equal [13,17], [sentences[0][5].begin_position(), sentences[0][5].end_position()]
142
+ assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
143
+ assert_equal [17,18], [sentences[0][6].begin_position(), sentences[0][6].end_position()]
144
+ # Get the correct token information for the second sentence.
145
+ assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
146
+ assert_equal [20,22], [sentences[1][0].begin_position(), sentences[1][0].end_position()]
147
+ assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
148
+ assert_equal [23,25], [sentences[1][1].begin_position(), sentences[1][1].end_position()]
149
+ assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
150
+ assert_equal [26,29], [sentences[1][2].begin_position(), sentences[1][2].end_position()]
151
+ assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
152
+ assert_equal [29,30], [sentences[1][3].begin_position(), sentences[1][3].end_position()]
153
+ end
154
+
155
+ def test_stringification
156
+ assert_equal "<DocumentPreprocessor>", @preproc.inspect
157
+ assert_equal "<DocumentPreprocessor>", @preproc.to_s
158
+ assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
159
+ assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
160
+ end
161
+
103
162
  end # DocumentPreprocessorTestCase
104
163
 
105
164
 
165
+ class StandoffParsedTextTestCase < Test::Unit::TestCase
166
+ def setup
167
+ @text = "He (John) is tall. So is she."
168
+ end
169
+
170
+ def test_parse_text_default_nodetype
171
+ parsed_text = StanfordParser::StandoffParsedText.new(@text)
172
+ verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
173
+ end
174
+
175
+ # Verify correct parsing with variable node types for text containing two sentences.
176
+ def verify_parsed_text(parsed_text, nodetype)
177
+ # Verify that there are two sentences.
178
+ assert_equal 2, parsed_text.length
179
+ assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
180
+ # Verify the tokens in the leaf node of the first sentence.
181
+ leaves = parsed_text[0].leaves.collect {|node| node.label}
182
+ assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
183
+ assert_equal [0,2], [leaves[0].begin_position(), leaves[0].end_position()]
184
+ assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
185
+ assert_equal [3,4], [leaves[1].begin_position(), leaves[1].end_position()]
186
+ assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
187
+ assert_equal [4,8], [leaves[2].begin_position(), leaves[2].end_position()]
188
+ assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
189
+ assert_equal [8,9], [leaves[3].begin_position(), leaves[3].end_position()]
190
+ assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
191
+ assert_equal [10,12], [leaves[4].begin_position(), leaves[4].end_position()]
192
+ assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
193
+ assert_equal [13,17], [leaves[5].begin_position(), leaves[5].end_position()]
194
+ assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
195
+ assert_equal [17,18], [leaves[6].begin_position(), leaves[6].end_position()]
196
+ # Verify the tokens in the leaf node of the second sentence.
197
+ leaves = parsed_text[1].leaves.collect {|node| node.label}
198
+ assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
199
+ assert_equal [20,22], [leaves[0].begin_position(), leaves[0].end_position()]
200
+ assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
201
+ assert_equal [23,25], [leaves[1].begin_position(), leaves[1].end_position()]
202
+ assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
203
+ assert_equal [26,29], [leaves[2].begin_position(), leaves[2].end_position()]
204
+ assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
205
+ assert_equal [29,30], [leaves[3].begin_position(), leaves[3].end_position()]
206
+ # Verify that the original string is recoverable.
207
+ assert_equal "He (John) is tall. ", parsed_text[0].to_original_string
208
+ assert_equal "So is she." , parsed_text[1].to_original_string
209
+ # Draw < and > brackets around 3 constituents.
210
+ b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
211
+ assert_equal "<He (<John>)> is <tall>. ", b
212
+ end
213
+ end
214
+
215
+
106
216
  class MiscPreprocessorTestCase < Test::Unit::TestCase
107
217
  def test_model_location
108
218
  assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
109
219
  end
110
-
220
+
111
221
  def test_word
112
222
  assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") == "dog"
113
223
  end
114
- end # MiscPreprocessorTestCase
224
+ end # MiscPreprocessorTestCase
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
3
3
  specification_version: 1
4
4
  name: stanfordparser
5
5
  version: !ruby/object:Gem::Version
6
- version: 1.2.0
7
- date: 2007-12-18 00:00:00 -08:00
6
+ version: 2.0.0
7
+ date: 2008-06-13 00:00:00 -07:00
8
8
  summary: Ruby wrapper for the Stanford Natural Language Parser
9
9
  require_paths:
10
10
  - lib
@@ -30,6 +30,7 @@ authors:
30
30
  - W.P. McNeill
31
31
  files:
32
32
  - test/test_stanfordparser.rb
33
+ - lib/java_object.rb
33
34
  - lib/stanfordparser.rb
34
35
  - examples/stanford-sentence-parser.rb
35
36
  - README