stanfordparser 1.2.0 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/README CHANGED
@@ -1,35 +1,40 @@
1
- = Stanford Natural Language Parser
1
+ = Stanford Natural Language Parser Wrapper
2
2
 
3
3
  This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
4
4
 
5
- The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby.
5
+ The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
6
6
 
7
- = Installation and Configuration
8
-
9
- To run this module you must install the following additional software
10
7
 
11
- * The {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml]
12
- * The {Ruby Java Bridge}[http://rjb.rubyforge.org/] gem.
8
+ = Installation and Configuration
13
9
 
14
- Note that the Stanford Parser is not a Ruby application and is therefore not a Ruby gem and must be manually installed.
10
+ In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
15
11
 
16
12
  This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
17
13
 
18
- These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://www.ruby-doc.org/core/classes/YAML.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
14
+ These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
19
15
 
20
16
  root: /usr/local/stanford-parser/other/location
21
17
  jvmargs: -Xmx100m -verbose
22
18
 
23
- =Usage
24
19
 
25
- Use the StanfordParser::LexicalizedParser class to parse sentences.
20
+ =Tokenization and Parsing
26
21
 
27
- irb(main):001:0> require 'stanfordparser'
22
+ Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
23
+
24
+ >> require "stanfordparser"
28
25
  => true
29
- irb(main):002:0> parser = StanfordParser::LexicalizedParser.new
26
+ >> preproc = StanfordParser::DocumentPreprocessor.new
27
+ => <DocumentPreprocessor>
28
+ >> puts preproc.getSentencesFromString("This is a sentence. So is this.")
29
+ This is a sentence .
30
+ So is this .
31
+
32
+ Use the StanfordParser::LexicalizedParser class to parse sentences.
33
+
34
+ >> parser = StanfordParser::LexicalizedParser.new
30
35
  Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
31
36
  => edu.stanford.nlp.parser.lexparser.LexicalizedParser
32
- irb(main):003:0> puts parser.apply("This is a sentence.")
37
+ >> puts parser.apply("This is a sentence.")
33
38
  (ROOT
34
39
  (S [24.917]
35
40
  (NP [6.139] (DT [2.300] This))
@@ -37,26 +42,77 @@ Use the StanfordParser::LexicalizedParser class to parse sentences.
37
42
  (NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
38
43
  (. [0.002] .)))
39
44
 
40
- Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into words or sentences.
45
+ For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
41
46
 
42
- irb(main):004:0> preproc = StanfordParser::DocumentPreprocessor.new
43
- irb(main):008:0> puts preproc.getSentencesFromString("This is a sentence. So is this.")
44
- This is a sentence .
45
- So is this .
46
47
 
47
- For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
48
+ =Standoff Tokenization and Parsing
49
+
50
+ This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
51
+
52
+ Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
53
+
54
+ >> preproc = StanfordParser::StandoffDocumentPreprocessor.new
55
+ => <StandoffDocumentPreprocessor>
56
+ >> s = preproc.getSentencesFromString("This is a sentence. So is this.")
57
+ => [This is a sentence., So is this.]
58
+
59
+ The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
60
+
61
+ >> puts s
62
+ This [0,4]
63
+ is [5,7]
64
+ a [8,9]
65
+ sentence [10,18]
66
+ . [18,19]
67
+ So [21,23]
68
+ is [24,26]
69
+ this [27,31]
70
+ . [31,32]
71
+ >> "This is a sentence. So is this."[27..31]
72
+ => "this."
73
+
74
+ This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
75
+
76
+ Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
77
+
78
+ >> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
79
+ Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
80
+ => <StanfordParser::StandoffParsedText, 2 sentences>
81
+ >> puts t.first
82
+ (ROOT
83
+ (S
84
+ (NP (DT This [0,4]))
85
+ (VP (VBZ is [5,7])
86
+ (NP (DT a [8,9]) (NN sentence [10,18])))
87
+ (. . [18,19])))
88
+
89
+ Standoff parse trees can reproduce the text from which they were generated verbatim.
90
+
91
+ >> t.first.to_original_string
92
+ => "This is a sentence. "
93
+
94
+ They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
95
+
96
+ >> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
97
+ => "[This] is [a sentence]. "
98
+
99
+ The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
100
+
101
+ See the documentation of the individual classes in this module for more details.
48
102
 
103
+ Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
49
104
 
50
105
  = History
51
106
 
52
107
  1.0.0:: Initial release
53
108
  1.1.0:: Make module initialization function private. Add example code.
54
109
  1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
110
+ 2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
55
111
 
56
112
 
57
113
  = Copyright
58
114
 
59
- Copyright 2007, William Patrick McNeill
115
+ Copyright 2007-2008, William Patrick McNeill
60
116
 
61
117
  This program is distributed under the GNU General Public License.
62
118
 
@@ -2,7 +2,7 @@
2
2
 
3
3
  #--
4
4
 
5
- # Copyright 2007 William Patrick McNeill
5
+ # Copyright 2007-2008 William Patrick McNeill
6
6
  #
7
7
  # This file is part of the Stanford Parser Ruby Wrapper.
8
8
  #
@@ -34,7 +34,7 @@
34
34
  # See the Java Stanford Parser documentation for details
35
35
  #
36
36
  # sentence::
37
- # A sentence to parse. This must be quoted.
37
+ # A sentence to parse. This must appear after all the options and be quoted.
38
38
 
39
39
 
40
40
  require "stanfordparser"
@@ -42,5 +42,5 @@ require "stanfordparser"
42
42
  # The last argument is the sentence. The rest of the command line is passed
43
43
  # along to the parser object.
44
44
  sentence = ARGV.pop
45
- parser = StanfordParser::LexicalizedParser.new("$(ROOT)/englishPCFG.ser.gz", ARGV)
45
+ parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
46
46
  puts parser.apply(sentence)
@@ -0,0 +1,129 @@
1
+ # Copyright 2007-2008 William Patrick McNeill
2
+ #
3
+ # This file is part of the Stanford Parser Ruby Wrapper.
4
+ #
5
+ # The Stanford Parser Ruby Wrapper is free software; you can redistribute it
6
+ # and/or modify it under the terms of the GNU General Public License as
7
+ # published by the Free Software Foundation; either version 2 of the License,
8
+ # or (at your option) any later version.
9
+ #
10
+ # The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
11
+ # useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
12
+ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
13
+ # Public License for more details.
14
+ #
15
+ # You should have received a copy of the GNU General Public License along with
16
+ # editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
17
+ # St, Fifth Floor, Boston, MA 02110-1301 USA
18
+
19
+ # Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
20
+ # add a generic Java object wrapper class.
21
+ module Rjb
22
+
23
+ #--
24
+ # The documentation for this class appears next to its extension inside the
25
+ # StanfordParser module in stanfordparser.rb. This should be changed if Rjb
26
+ # is ever moved into its own gem. See the documention in stanfordparser.rb
27
+ # for more details.
28
+ #++
29
+ class JavaObjectWrapper
30
+ include Enumerable
31
+
32
+ # The underlying Java object.
33
+ attr_reader :java_object
34
+
35
+ # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
36
+ # String, treat it as a Java class name and instantiate it. Otherwise,
37
+ # treat <em>obj</em> as an instance of a Java object.
38
+ def initialize(obj, *args)
39
+ @java_object = obj.class == String ?
40
+ Rjb::import(obj).send(:new, *args) : obj
41
+ end
42
+
43
+ # Enumerate all the items in the object using its iterator. If the object
44
+ # has no iterator, this function yields nothing.
45
+ def each
46
+ if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
47
+ i = @java_object.iterator
48
+ while i.hasNext
49
+ yield wrap_java_object(i.next)
50
+ end
51
+ end
52
+ end # each
53
+
54
+ # Reflect unhandled method calls to the underlying Java object and wrap
55
+ # the return value in the appropriate Ruby object.
56
+ def method_missing(m, *args)
57
+ begin
58
+ wrap_java_object(@java_object.send(m, *args))
59
+ rescue RuntimeError => e
60
+ # The instance method failed. See if this is a static method.
61
+ if not e.message.match(/^Fail: unknown method name/).nil?
62
+ getClass.send(m, *args)
63
+ end
64
+ end
65
+ end
66
+
67
+ # Convert a value returned by a call to the underlying Java object to the
68
+ # appropriate Ruby object.
69
+ #
70
+ # If the value is a JavaObjectWrapper, convert it using a protected
71
+ # function with the name wrap_ followed by the underlying object's
72
+ # classname with the Java path delimiters converted to underscores. For
73
+ # example, a <tt>java.util.ArrayList</tt> would be converted by a function
74
+ # called wrap_java_util_ArrayList.
75
+ #
76
+ # If the value lacks the appropriate converter function, wrap it in a
77
+ # generic JavaObjectWrapper.
78
+ #
79
+ # If the value is not a JavaObjectWrapper, return it unchanged.
80
+ #
81
+ # This function is called recursively for every element in an Array.
82
+ def wrap_java_object(object)
83
+ if object.kind_of?(Array)
84
+ object.collect {|item| wrap_java_object(item)}
85
+ elsif object.respond_to?(:_classname)
86
+ # Ruby-Java Bridge Java objects all have a _classname member which
87
+ # tells the name of their Java class. Convert this to the
88
+ # corresponding wrapper function name.
89
+ wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
90
+ respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
91
+ else
92
+ object
93
+ end
94
+ end
95
+
96
+ # Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
97
+ def wrap_java_util_ArrayList(object)
98
+ array_list = []
99
+ object.size.times do
100
+ |i| array_list << wrap_java_object(object.get(i))
101
+ end
102
+ array_list
103
+ end
104
+
105
+ # Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
106
+ def wrap_java_util_HashSet(object)
107
+ set = Set.new
108
+ i = object.iterator
109
+ while i.hasNext
110
+ set << wrap_java_object(i.next)
111
+ end
112
+ set
113
+ end
114
+
115
+ # Show the classname of the underlying Java object.
116
+ def inspect
117
+ "<#{@java_object._classname}>"
118
+ end
119
+
120
+ # Use the underlying Java object's stringification.
121
+ def to_s
122
+ toString
123
+ end
124
+
125
+ protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
126
+
127
+ end # JavaObjectWrapper
128
+
129
+ end # Rjb
@@ -1,4 +1,4 @@
1
- # Copyright 2007 William Patrick McNeill
1
+ # Copyright 2007-2008 William Patrick McNeill
2
2
  #
3
3
  # This file is part of the Stanford Parser Ruby Wrapper.
4
4
  #
@@ -19,121 +19,27 @@
19
19
 
20
20
  require "pathname"
21
21
  require "rjb"
22
- require "set"
22
+ require "singleton"
23
+ begin
24
+ require "treebank"
25
+ gem "treebank", ">= 3.0.0"
26
+ rescue LoadError
27
+ require "treebank"
28
+ end
23
29
  require "yaml"
24
30
 
25
- # Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
26
- # adds a generic Java object wrapper class.
27
- module Rjb
28
-
29
- # A generic wrapper for a Java object loaded via the Ruby Java Bridge. The
30
- # wrapper class handles intialization and stringification, and passes other
31
- # method calls down to the underlying Java object. Objects returned by the
32
- # underlying Java object are converted to the appropriate Ruby object.
33
- #
34
- # This object is enumerable, yielding items in the order defined by the Java
35
- # object's iterator.
36
- class JavaObjectWrapper
37
- include Enumerable
38
-
39
- # The underlying Java object.
40
- attr_reader :java_object
41
-
42
- # Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
43
- # String, assume it is a Java class name and instantiate it. Otherwise,
44
- # treat <em>obj</em> as an instance of a Java object.
45
- def initialize(obj, *args)
46
- @java_object = obj.class == String ?
47
- Rjb::import(obj).send(:new, *args) : obj
48
- end
49
-
50
- # Enumerate all the items in the object using its iterator. If the object
51
- # has no iterator, this function yields nothing.
52
- def each
53
- if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
54
- i = @java_object.iterator
55
- while i.hasNext
56
- yield wrap_java_object(i.next)
57
- end
58
- end
59
- end # each
60
-
61
- # Reflect unhandled method calls to the underlying Java object.
62
- def method_missing(m, *args)
63
- wrap_java_object(@java_object.send(m, *args))
64
- end
65
-
66
- # Convert a value returned by a call to the underlying Java object to the
67
- # appropriate Ruby object as follows:
68
- # * RJB objects are placed inside a generic JavaObjectWrapper wrapper.
69
- # * <tt>java.util.ArrayList</tt> objects are converted to Ruby Arrays.
70
- # * <tt>java.util.HashSet</tt> objects are converted to Ruby Sets
71
- # * Other objects are left unchanged.
72
- #
73
- # This function is applied recursively to items in collection objects such
74
- # as set and arrays.
75
- def wrap_java_object(object)
76
- if object.kind_of?(Array)
77
- object.collect {|item| wrap_java_object(item)}
78
- # Ruby-Java Bridge Java objects all have a _classname member which tells
79
- # the name of their Java class.
80
- elsif object.respond_to?(:_classname)
81
- case object._classname
82
- when /java\.util\.ArrayList/
83
- # Convert java.util.ArrayList objects to Ruby arrays.
84
- array_list = []
85
- object.size.times do
86
- |i| array_list << wrap_java_object(object.get(i))
87
- end
88
- array_list
89
- when /java\.util\.HashSet/
90
- # Convert java.util.HashSet objects to Ruby sets.
91
- set = Set.new
92
- i = object.iterator
93
- while i.hasNext
94
- set << wrap_java_object(i.next)
95
- end
96
- set
97
- else
98
- # Passs other RJB objects off to a handler.
99
- wrap_rjb_object(object)
100
- end # case
101
- else
102
- # Return non-RJB objects unchanged.
103
- object
104
- end # if
105
- end # wrap_java_object
106
-
107
- # By default, all RJB classes other than <tt>java.util.ArrayList</tt> and
108
- # <tt>java.util.HashSet</tt> go in a generic wrapper. Derived classes may
109
- # change this behavior.
110
- def wrap_rjb_object(object)
111
- JavaObjectWrapper.new(object)
112
- end
113
-
114
- # Show the classname of the underlying Java object.
115
- def inspect
116
- "<#{@java_object._classname}>"
117
- end
118
-
119
- # Use the underlying Java object's stringification.
120
- def to_s
121
- toString
122
- end
123
-
124
- protected :wrap_java_object, :wrap_rjb_object
125
-
126
- end # JavaObjectWrapper
127
-
128
- end # Rjb
129
-
31
+ require "java_object.rb"
130
32
 
131
33
  # Wrapper for the {Stanford Natural Language
132
34
  # Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
133
35
  module StanfordParser
134
36
 
135
- VERSION = "1.2.0"
136
-
37
+ VERSION = "2.0.0"
38
+
39
+ # The default sentence segmenter and tokenizer. This is an English-language
40
+ # tokenizer with support for Penn Treebank markup.
41
+ EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
42
+
137
43
  # Path to an English PCFG model that comes with the Stanford Parser. The
138
44
  # location is relative to the parser root directory. This is a valid value
139
45
  # for the <em>grammar</em> parameter of the LexicalizedParser constructor.
@@ -170,32 +76,53 @@ module StanfordParser
170
76
  # The root directory of the Stanford parser installation.
171
77
  ROOT = initialize_on_load
172
78
 
173
-
79
+ #--
80
+ # The documentation below is for the original Rjb::JavaObjectWrapper object.
81
+ # It is reproduced here because rdoc only takes the last document block
82
+ # defined. If Rjb is moved into its own gem, this documentation should go
83
+ # with it, and the following should be written as documentation for this
84
+ # class:
85
+ #
174
86
  # Extension of the generic Ruby-Java Bridge wrapper object for the
175
87
  # StanfordParser module.
176
- class JavaObjectWrapper < Rjb::JavaObjectWrapper
177
- # Wrap a return value with a specialized wrapper class in the
178
- # StanfordParser module in the appropriate class.
179
- def wrap_rjb_object(object)
180
- case object._classname
181
- when /^edu\.stanford\.nlp\.trees\.
182
- (Tree|LabeledScoredTreeLeaf|
183
- LabeledScoredTreeNode|
184
- SimpleTree|TreeGraphNode)$/x
185
- # Tree objects go inside a Tree wrapper.
186
- Tree.new(object)
187
- else
188
- super(object)
189
- end # case
190
- end # wrap_rjb_object
191
- end # JavaObjectWrapper
88
+ #++
89
+ # A generic wrapper for a Java object loaded via the {Ruby-Java
90
+ # Bridge}[http://rjb.rubyforge.org/]. The wrapper class handles
91
+ # intialization and stringification, and passes other method calls down to
92
+ # the underlying Java object. Objects returned by the underlying Java
93
+ # object are converted to the appropriate Ruby object.
94
+ #
95
+ # Other modules may extend the list of Java objects that are converted by
96
+ # adding their own converter functions. See wrap_java_object for details.
97
+ #
98
+ # This object is enumerable, yielding items in the order defined by the
99
+ # underlying Java object's iterator.
100
+ class Rjb::JavaObjectWrapper
101
+ # FeatureLabel objects go inside a FeatureLabel wrapper.
102
+ def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
103
+ StanfordParser::FeatureLabel.new(object)
104
+ end
105
+
106
+ # Tree objects go inside a Tree wrapper. Various tree types are aliased
107
+ # to this function.
108
+ def wrap_edu_stanford_nlp_trees_Tree(object)
109
+ Tree.new(object)
110
+ end
111
+
112
+ alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
113
+ alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
114
+ alias :wrap_edu_stanford_nlp_trees_SimpleTree :wrap_edu_stanford_nlp_trees_Tree
115
+ alias :wrap_edu_stanford_nlp_trees_TreeGraphNode :wrap_edu_stanford_nlp_trees_Tree
116
+
117
+ protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
118
+ end # Rjb::JavaObjectWrapper
192
119
 
193
120
 
194
121
  # Lexicalized probabalistic parser.
195
122
  #
196
123
  # This is an wrapper for the
197
124
  # <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
198
- class LexicalizedParser < JavaObjectWrapper
125
+ class LexicalizedParser < Rjb::JavaObjectWrapper
199
126
  # The grammar used by the parser
200
127
  attr_reader :grammar
201
128
 
@@ -220,10 +147,17 @@ module StanfordParser
220
147
  end # LexicalizedParser
221
148
 
222
149
 
150
+ # A singleton instance of the default Stanford Natural Language parser. A
151
+ # singleton is used because the parser can take a few seconds to load.
152
+ class DefaultParser < StanfordParser::LexicalizedParser
153
+ include Singleton
154
+ end
155
+
156
+
223
157
  # This is a wrapper for
224
158
  # <tt>edu.stanford.nlp.trees.Tree</tt> objects. It customizes
225
159
  # stringification.
226
- class Tree < JavaObjectWrapper
160
+ class Tree < Rjb::JavaObjectWrapper
227
161
  def initialize(obj = "edu.stanford.nlp.trees.Tree")
228
162
  super(obj)
229
163
  end
@@ -245,16 +179,16 @@ module StanfordParser
245
179
  # This is a wrapper for
246
180
  # <tt>edu.stanford.nlp.ling.Word</tt> objects. It customizes
247
181
  # stringification and adds an equivalence operator.
248
- class Word < JavaObjectWrapper
182
+ class Word < Rjb::JavaObjectWrapper
249
183
  def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
250
184
  super(obj, *args)
251
185
  end
252
-
186
+
253
187
  # See the word values.
254
188
  def inspect
255
189
  to_s
256
190
  end
257
-
191
+
258
192
  # Equivalence is defined relative to the word value.
259
193
  def ==(other)
260
194
  word == other
@@ -262,11 +196,34 @@ module StanfordParser
262
196
  end # Word
263
197
 
264
198
 
199
+ # This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
200
+ # It customizes stringification.
201
+ class FeatureLabel < Rjb::JavaObjectWrapper
202
+ def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
203
+ super
204
+ end
205
+
206
+ # Stringify with just the token and its begin and end position.
207
+ def to_s
208
+ # BUGBUG The position values come back as java.lang.Integer though I
209
+ # would expect Rjb to convert them to Ruby integers.
210
+ begin_position = get(self.BEGIN_POSITION_KEY)
211
+ end_position = get(self.END_POSITION_KEY)
212
+ "#{current} [#{begin_position},#{end_position}]"
213
+ end
214
+
215
+ # More verbose stringification with all the fields and their values.
216
+ def inspect
217
+ toString
218
+ end
219
+ end
220
+
221
+
265
222
  # Tokenizes documents into words and sentences.
266
223
  #
267
224
  # This is a wrapper for the
268
225
  # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
269
- class DocumentPreprocessor < JavaObjectWrapper
226
+ class DocumentPreprocessor < Rjb::JavaObjectWrapper
270
227
  def initialize(suppressEscaping = false)
271
228
  super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
272
229
  end
@@ -276,6 +233,229 @@ module StanfordParser
276
233
  s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
277
234
  _invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
278
235
  end
236
+
237
+ def inspect
238
+ "<#{self.class.to_s.split('::').last}>"
239
+ end
240
+
241
+ def to_s
242
+ inspect
243
+ end
279
244
  end # DocumentPreprocessor
280
245
 
246
+ StandoffToken = Struct.new(:current, :word, :before, :after,
247
+ :begin_position, :end_position)
248
+
249
+ # A text token that contains raw and normalized token identity (.e.g "(" and
250
+ # "-LRB-"), an offset span, and the characters immediately preceding and
251
+ # following the token. Given a list of these objects it is possible to
252
+ # recreate the text from which they came verbatim.
253
+ class StandoffToken
254
+ def to_s
255
+ "#{current} [#{begin_position},#{end_position}]"
256
+ end
257
+ end
258
+
259
+
260
+ # A preprocessor that segments text into sentences and tokens that contain
261
+ # character offset and token context information that can be used for
262
+ # standoff annotation.
263
+ class StandoffDocumentPreprocessor < DocumentPreprocessor
264
+ def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
265
+ # PTBTokenizer.factory is a static function, so use RJB to call it
266
+ # directly instead of going through a JavaObjectWrapper. We do it this
267
+ # way because the Standford parser Java code does not provide a
268
+ # constructor that allows you to specify the second parameter,
269
+ # invertible, to true, and we need this to write character offset
270
+ # information into the tokens.
271
+ ptb_tokenizer_class = Rjb::import(tokenizer)
272
+ # See the documentation for
273
+ # <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
274
+ # description of these parameters.
275
+ ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
276
+ super(ptb_tokenizer_factory)
277
+ end
278
+
279
+ # Returns a list of sentences in a string. This wraps the returned
280
+ # sentences in a StandoffSentence object.
281
+ def getSentencesFromString(s)
282
+ super(s).map!{|s| StandoffSentence.new(s)}
283
+ end
284
+ end
285
+
286
+
287
+ # A sentence is an array of StandoffToken objects.
288
+ class StandoffSentence < Array
289
+ # Construct an array of StandoffToken objects from a Java list sentence
290
+ # object returned by the preprocessor.
291
+ def initialize(stanford_parser_sentence)
292
+ # Convert FeatureStructure wrappers to StandoffToken objects.
293
+ s = stanford_parser_sentence.to_a.collect do |fs|
294
+ current = fs.current
295
+ word = fs.word
296
+ before = fs.before
297
+ after = fs.after
298
+ # The to_s.to_i is necessary because the get function returns
299
+ # java.lang.Integer objects instead of Ruby integers.
300
+ begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
301
+ end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
302
+ StandoffToken.new(current, word, before, after,
303
+ begin_position, end_position)
304
+ end
305
+ super(s)
306
+ end
307
+
308
+ # Return the original string verbatim.
309
+ def to_s
310
+ self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
311
+ end
312
+
313
+ # Return the original string verbatim.
314
+ def inspect
315
+ to_s
316
+ end
317
+ end
318
+
319
+
320
+ # Standoff syntactic annotation of natural language text which may contain
321
+ # multiple sentences.
322
+ #
323
+ # This is an Array of StandoffNode objects, one for each sentence in the
324
+ # text.
325
+ class StandoffParsedText < Array
326
+ # Parse the text and create the standoff annotation.
327
+ #
328
+ # The default parser is a singleton instance of the English language
329
+ # Stanford Natural Langugage parser. There may be a delay of a few
330
+ # seconds for it to load the first time it is created.
331
+ def initialize(text, nodetype = StandoffNode,
332
+ tokenizer = EN_PENN_TREEBANK_TOKENIZER,
333
+ parser = DefaultParser.instance)
334
+ preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
335
+ # Segment the text into sentences. Parse each sentence, writing
336
+ # standoff annotation information into the terminal nodes.
337
+ preprocessor.getSentencesFromString(text).map do |sentence|
338
+ parse = parser.apply(sentence.to_s)
339
+ push(nodetype.new(parse, sentence))
340
+ end
341
+ end
342
+
343
+ # Print class name and number of sentences.
344
+ def inspect
345
+ "<#{self.class.name}, #{length} sentences>"
346
+ end
347
+
348
+ # Print parses.
349
+ def to_s
350
+ flatten.join(" ")
351
+ end
352
+ end
353
+
354
+
355
+ # Standoff syntactic tree annotation of text. Terminal nodes are labeled
356
+ # with the appropriate StandoffToken objects. Standoff parses can reproduce
357
+ # the original string from which they were generated verbatim, optionally
358
+ # with brackets around the yields of specified non-terminal nodes.
359
+ class StandoffNode < Treebank::Node
360
+ # Create the standoff tree from a tree returned by the Stanford parser.
361
+ # For non-terminal nodes, the <em>tokens</em> argument will be a
362
+ # StandoffSentence containing the StandoffToken objects representing all
363
+ # the tokens beneath and after this node. For terminal nodes, the
364
+ # <em>tokens</em> argument will be a StandoffToken.
365
+ def initialize(stanford_parser_node, tokens)
366
+ # Annotate this node with a non-terminal label or a StandoffToken as
367
+ # appropriate.
368
+ super(tokens.instance_of?(StandoffSentence) ?
369
+ stanford_parser_node.value : tokens)
370
+ # Enumerate the children depth-first. Tokens are removed from the list
371
+ # left-to-right as terminal nodes are added to the tree.
372
+ stanford_parser_node.children.each do |child|
373
+ subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
374
+ attach_child!(subtree)
375
+ end
376
+ end
377
+
378
+ # Return the original text string dominated by this node.
379
+ def to_original_string
380
+ leaves.inject("") do |s, leaf|
381
+ s += leaf.label.current + leaf.label.after
382
+ end
383
+ end
384
+
385
+ # Print the original string with brackets around word spans dominated by
386
+ # the specified consituents.
387
+ #
388
+ # The constituents to bracket are specified by passing a list of node
389
+ # coordinates, which are arrays of integers of the form returned by the
390
+ # tree enumerators of Treebank::Node objects.
391
+ #
392
+ # _coords_:: the coordinates of the nodes around which to place brackets
393
+ # _open_:: the open bracket symbol
394
+ # _close_:: the close bracket symbol
395
+ def to_bracketed_string(coords, open = "[", close = "]")
396
+ # Get a list of all the leaf nodes and their coordinates.
397
+ items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
398
+ # Enumerate over all the matching constituents inserting open and close
399
+ # brackets around their yields in the items list.
400
+ coords.each do |matching|
401
+ # Insert using a simple state machine with three states: :start,
402
+ # :open, and :close.
403
+ state = :start
404
+ # Enumerate over the items list looking for nodes that are the
405
+ # children of the matching constituent.
406
+ items.each_with_index do |item, index|
407
+ # Skip inserted bracket characters.
408
+ next if item.is_a? String
409
+ # Handle terminal node items with the state machine.
410
+ node, terminal_coordinate = item
411
+ if state == :start
412
+ next if not in_yield?(matching, terminal_coordinate)
413
+ items.insert(index, open)
414
+ state = :open
415
+ else # state == :open
416
+ next if in_yield?(matching, terminal_coordinate)
417
+ items.insert(index, close)
418
+ state = :close
419
+ break
420
+ end
421
+ end # items.each_with_index
422
+ # Handle the case where a matching constituent is flush with the end
423
+ # of the sentence.
424
+ items << close if state == :open
425
+ end # each
426
+ # Replace terminal nodes with their string representations. Insert
427
+ # spacing characters in the list.
428
+ items.each_with_index do |item, index|
429
+ next if item.is_a? String
430
+ text = item.first.label.current
431
+ spacing = item.first.label.after
432
+ # Replace the terminal node with its text.
433
+ items[index] = text
434
+ # Insert the spacing that comes after this text before the first
435
+ # non-close bracket character.
436
+ close_pos = find_index(items[index+1..-1]) {|item| not item == close}
437
+ items.insert(index + close_pos + 1, spacing)
438
+ end
439
+ items.join
440
+ end # to_bracketed_string
441
+
442
+ # Find the index of the first item in _list_ for which _block_ is true.
443
+ # Return 0 if no items are found.
444
+ def find_index(list, &block)
445
+ list.each_with_index do |item, index|
446
+ return index if block.call(item)
447
+ end
448
+ 0
449
+ end
450
+
451
+ # Is the node at _terminal_ in the yield of the node at _node_?
452
+ def in_yield?(node, terminal)
453
+ # If node A's coordinates match the prefix of node B's coordinates, node
454
+ # B is in the yield of node A.
455
+ terminal.first(node.length) == node
456
+ end
457
+
458
+ private :in_yield?, :find_index
459
+ end # StandoffNode
460
+
281
461
  end # StanfordParser
@@ -2,7 +2,7 @@
2
2
 
3
3
  #--
4
4
 
5
- # Copyright 2007 William Patrick McNeill
5
+ # Copyright 2007-2008 William Patrick McNeill
6
6
  #
7
7
  # This file is part of the Stanford Parser Ruby Wrapper.
8
8
  #
@@ -30,20 +30,13 @@ require "singleton"
30
30
  require "stanfordparser"
31
31
 
32
32
 
33
- # Make the Lexicalized Parser a singleton for the tests because it takes
34
- # several seconds to load.
35
- class StanfordParser::LexicalizedParser
36
- include Singleton
37
- end
38
-
39
-
40
33
  class LexicalizedParserTestCase < Test::Unit::TestCase
41
34
  def test_root_path
42
35
  assert_equal StanfordParser::ROOT.class, Pathname
43
36
  end
44
37
 
45
38
  def setup
46
- @parser = StanfordParser::LexicalizedParser.instance
39
+ @parser = StanfordParser::DefaultParser.instance
47
40
  @tree = @parser.apply("This is a sentence.")
48
41
  end
49
42
 
@@ -53,6 +46,8 @@ class LexicalizedParserTestCase < Test::Unit::TestCase
53
46
  end
54
47
 
55
48
  def test_localTrees
49
+ # The following call exercises the conversion from java.util.HashSet
50
+ # objects to Ruby sets.
56
51
  l = @tree.localTrees
57
52
  assert_equal l.size, 5
58
53
  assert_equal Set.new(l.collect {|t| "#{t.label}"}),
@@ -68,7 +63,7 @@ end # LexicalizedParserTestCase
68
63
 
69
64
  class TreeTestCase < Test::Unit::TestCase
70
65
  def setup
71
- @parser = StanfordParser::LexicalizedParser.instance
66
+ @parser = StanfordParser::DefaultParser.instance
72
67
  @tree = @parser.apply("This is a sentence.")
73
68
  end
74
69
 
@@ -85,12 +80,30 @@ class TreeTestCase < Test::Unit::TestCase
85
80
  end # TreeTestCase
86
81
 
87
82
 
83
+ class FeatureLabelTestCase < Test::Unit::TestCase
84
+ def test_feature_label
85
+ f = StanfordParser::FeatureLabel.new
86
+ assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
87
+ f.put(f.BEGIN_POSITION_KEY, 3)
88
+ assert_equal "END_POS", f.END_POSITION_KEY
89
+ f.put(f.END_POSITION_KEY, 7)
90
+ assert_equal "current", f.CURRENT_KEY
91
+ f.put(f.CURRENT_KEY, "word")
92
+ assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
93
+ assert_equal "word [3,7]", f.to_s
94
+ end
95
+ end
96
+
97
+
88
98
  class DocumentPreprocessorTestCase < Test::Unit::TestCase
89
99
  def setup
90
100
  @preproc = StanfordParser::DocumentPreprocessor.new
101
+ @standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
91
102
  end
92
103
 
93
104
  def test_get_sentences_from_string
105
+ # The following call exercises the conversion from java.util.ArrayList
106
+ # objects to Ruby arrays.
94
107
  s = @preproc.getSentencesFromString("This is a sentence. So is this.")
95
108
  assert_equal "#{s[0]}", "This is a sentence ."
96
109
  assert_equal "#{s[1]}", "So is this ."
@@ -100,15 +113,112 @@ class DocumentPreprocessorTestCase < Test::Unit::TestCase
100
113
  # StanfordParser::DocumentPreprocessor is not an enumerable object.
101
114
  assert_equal @preproc.map, []
102
115
  end
116
+
117
+ # Segment and tokenize text containing two sentences.
118
+ def test_standoff_document_preprocessor
119
+ sentences = @standoff_preproc.getSentencesFromString("He (John) is tall. So is she.")
120
+ # Recognize two sentences.
121
+ assert_equal 2, sentences.length
122
+ assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
123
+ assert_equal "He (John) is tall.", sentences.first.to_s
124
+ assert_equal 7, sentences.first.length
125
+ assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
126
+ assert_equal "So is she.", sentences.last.to_s
127
+ assert_equal 4, sentences.last.length
128
+ assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
129
+ # Get the correct token information for the first sentence.
130
+ assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
131
+ assert_equal [0,2], [sentences[0][0].begin_position(), sentences[0][0].end_position()]
132
+ assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
133
+ assert_equal [3,4], [sentences[0][1].begin_position(), sentences[0][1].end_position()]
134
+ assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
135
+ assert_equal [4,8], [sentences[0][2].begin_position(), sentences[0][2].end_position()]
136
+ assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
137
+ assert_equal [8,9], [sentences[0][3].begin_position(), sentences[0][3].end_position()]
138
+ assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
139
+ assert_equal [10,12], [sentences[0][4].begin_position(), sentences[0][4].end_position()]
140
+ assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
141
+ assert_equal [13,17], [sentences[0][5].begin_position(), sentences[0][5].end_position()]
142
+ assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
143
+ assert_equal [17,18], [sentences[0][6].begin_position(), sentences[0][6].end_position()]
144
+ # Get the correct token information for the second sentence.
145
+ assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
146
+ assert_equal [20,22], [sentences[1][0].begin_position(), sentences[1][0].end_position()]
147
+ assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
148
+ assert_equal [23,25], [sentences[1][1].begin_position(), sentences[1][1].end_position()]
149
+ assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
150
+ assert_equal [26,29], [sentences[1][2].begin_position(), sentences[1][2].end_position()]
151
+ assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
152
+ assert_equal [29,30], [sentences[1][3].begin_position(), sentences[1][3].end_position()]
153
+ end
154
+
155
+ def test_stringification
156
+ assert_equal "<DocumentPreprocessor>", @preproc.inspect
157
+ assert_equal "<DocumentPreprocessor>", @preproc.to_s
158
+ assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
159
+ assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
160
+ end
161
+
103
162
  end # DocumentPreprocessorTestCase
104
163
 
105
164
 
165
+ class StandoffParsedTextTestCase < Test::Unit::TestCase
166
+ def setup
167
+ @text = "He (John) is tall. So is she."
168
+ end
169
+
170
+ def test_parse_text_default_nodetype
171
+ parsed_text = StanfordParser::StandoffParsedText.new(@text)
172
+ verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
173
+ end
174
+
175
+ # Verify correct parsing with variable node types for text containing two sentences.
176
+ def verify_parsed_text(parsed_text, nodetype)
177
+ # Verify that there are two sentences.
178
+ assert_equal 2, parsed_text.length
179
+ assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
180
+ # Verify the tokens in the leaf node of the first sentence.
181
+ leaves = parsed_text[0].leaves.collect {|node| node.label}
182
+ assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
183
+ assert_equal [0,2], [leaves[0].begin_position(), leaves[0].end_position()]
184
+ assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
185
+ assert_equal [3,4], [leaves[1].begin_position(), leaves[1].end_position()]
186
+ assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
187
+ assert_equal [4,8], [leaves[2].begin_position(), leaves[2].end_position()]
188
+ assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
189
+ assert_equal [8,9], [leaves[3].begin_position(), leaves[3].end_position()]
190
+ assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
191
+ assert_equal [10,12], [leaves[4].begin_position(), leaves[4].end_position()]
192
+ assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
193
+ assert_equal [13,17], [leaves[5].begin_position(), leaves[5].end_position()]
194
+ assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
195
+ assert_equal [17,18], [leaves[6].begin_position(), leaves[6].end_position()]
196
+ # Verify the tokens in the leaf node of the second sentence.
197
+ leaves = parsed_text[1].leaves.collect {|node| node.label}
198
+ assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
199
+ assert_equal [20,22], [leaves[0].begin_position(), leaves[0].end_position()]
200
+ assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
201
+ assert_equal [23,25], [leaves[1].begin_position(), leaves[1].end_position()]
202
+ assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
203
+ assert_equal [26,29], [leaves[2].begin_position(), leaves[2].end_position()]
204
+ assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
205
+ assert_equal [29,30], [leaves[3].begin_position(), leaves[3].end_position()]
206
+ # Verify that the original string is recoverable.
207
+ assert_equal "He (John) is tall. ", parsed_text[0].to_original_string
208
+ assert_equal "So is she." , parsed_text[1].to_original_string
209
+ # Draw < and > brackets around 3 constituents.
210
+ b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
211
+ assert_equal "<He (<John>)> is <tall>. ", b
212
+ end
213
+ end
214
+
215
+
106
216
  class MiscPreprocessorTestCase < Test::Unit::TestCase
107
217
  def test_model_location
108
218
  assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
109
219
  end
110
-
220
+
111
221
  def test_word
112
222
  assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") == "dog"
113
223
  end
114
- end # MiscPreprocessorTestCase
224
+ end # MiscPreprocessorTestCase
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
3
3
  specification_version: 1
4
4
  name: stanfordparser
5
5
  version: !ruby/object:Gem::Version
6
- version: 1.2.0
7
- date: 2007-12-18 00:00:00 -08:00
6
+ version: 2.0.0
7
+ date: 2008-06-13 00:00:00 -07:00
8
8
  summary: Ruby wrapper for the Stanford Natural Language Parser
9
9
  require_paths:
10
10
  - lib
@@ -30,6 +30,7 @@ authors:
30
30
  - W.P. McNeill
31
31
  files:
32
32
  - test/test_stanfordparser.rb
33
+ - lib/java_object.rb
33
34
  - lib/stanfordparser.rb
34
35
  - examples/stanford-sentence-parser.rb
35
36
  - README