stanfordparser 1.2.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README +77 -21
- data/examples/stanford-sentence-parser.rb +3 -3
- data/lib/java_object.rb +129 -0
- data/lib/stanfordparser.rb +312 -132
- data/test/test_stanfordparser.rb +122 -12
- metadata +3 -2
data/README
CHANGED
@@ -1,35 +1,40 @@
|
|
1
|
-
= Stanford Natural Language Parser
|
1
|
+
= Stanford Natural Language Parser Wrapper
|
2
2
|
|
3
3
|
This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
4
4
|
|
5
|
-
The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby.
|
5
|
+
The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
|
6
6
|
|
7
|
-
= Installation and Configuration
|
8
|
-
|
9
|
-
To run this module you must install the following additional software
|
10
7
|
|
11
|
-
|
12
|
-
* The {Ruby Java Bridge}[http://rjb.rubyforge.org/] gem.
|
8
|
+
= Installation and Configuration
|
13
9
|
|
14
|
-
|
10
|
+
In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
15
11
|
|
16
12
|
This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
|
17
13
|
|
18
|
-
These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://
|
14
|
+
These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
|
19
15
|
|
20
16
|
root: /usr/local/stanford-parser/other/location
|
21
17
|
jvmargs: -Xmx100m -verbose
|
22
18
|
|
23
|
-
=Usage
|
24
19
|
|
25
|
-
|
20
|
+
=Tokenization and Parsing
|
26
21
|
|
27
|
-
|
22
|
+
Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
|
23
|
+
|
24
|
+
>> require "stanfordparser"
|
28
25
|
=> true
|
29
|
-
|
26
|
+
>> preproc = StanfordParser::DocumentPreprocessor.new
|
27
|
+
=> <DocumentPreprocessor>
|
28
|
+
>> puts preproc.getSentencesFromString("This is a sentence. So is this.")
|
29
|
+
This is a sentence .
|
30
|
+
So is this .
|
31
|
+
|
32
|
+
Use the StanfordParser::LexicalizedParser class to parse sentences.
|
33
|
+
|
34
|
+
>> parser = StanfordParser::LexicalizedParser.new
|
30
35
|
Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
|
31
36
|
=> edu.stanford.nlp.parser.lexparser.LexicalizedParser
|
32
|
-
|
37
|
+
>> puts parser.apply("This is a sentence.")
|
33
38
|
(ROOT
|
34
39
|
(S [24.917]
|
35
40
|
(NP [6.139] (DT [2.300] This))
|
@@ -37,26 +42,77 @@ Use the StanfordParser::LexicalizedParser class to parse sentences.
|
|
37
42
|
(NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
|
38
43
|
(. [0.002] .)))
|
39
44
|
|
40
|
-
|
45
|
+
For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
|
41
46
|
|
42
|
-
irb(main):004:0> preproc = StanfordParser::DocumentPreprocessor.new
|
43
|
-
irb(main):008:0> puts preproc.getSentencesFromString("This is a sentence. So is this.")
|
44
|
-
This is a sentence .
|
45
|
-
So is this .
|
46
47
|
|
47
|
-
|
48
|
+
=Standoff Tokenization and Parsing
|
49
|
+
|
50
|
+
This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
|
51
|
+
|
52
|
+
Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
|
53
|
+
|
54
|
+
>> preproc = StanfordParser::StandoffDocumentPreprocessor.new
|
55
|
+
=> <StandoffDocumentPreprocessor>
|
56
|
+
>> s = preproc.getSentencesFromString("This is a sentence. So is this.")
|
57
|
+
=> [This is a sentence., So is this.]
|
58
|
+
|
59
|
+
The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
|
60
|
+
|
61
|
+
>> puts s
|
62
|
+
This [0,4]
|
63
|
+
is [5,7]
|
64
|
+
a [8,9]
|
65
|
+
sentence [10,18]
|
66
|
+
. [18,19]
|
67
|
+
So [21,23]
|
68
|
+
is [24,26]
|
69
|
+
this [27,31]
|
70
|
+
. [31,32]
|
71
|
+
>> "This is a sentence. So is this."[27..31]
|
72
|
+
=> "this."
|
73
|
+
|
74
|
+
This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
|
75
|
+
|
76
|
+
Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
|
77
|
+
|
78
|
+
>> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
|
79
|
+
Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
|
80
|
+
=> <StanfordParser::StandoffParsedText, 2 sentences>
|
81
|
+
>> puts t.first
|
82
|
+
(ROOT
|
83
|
+
(S
|
84
|
+
(NP (DT This [0,4]))
|
85
|
+
(VP (VBZ is [5,7])
|
86
|
+
(NP (DT a [8,9]) (NN sentence [10,18])))
|
87
|
+
(. . [18,19])))
|
88
|
+
|
89
|
+
Standoff parse trees can reproduce the text from which they were generated verbatim.
|
90
|
+
|
91
|
+
>> t.first.to_original_string
|
92
|
+
=> "This is a sentence. "
|
93
|
+
|
94
|
+
They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
|
95
|
+
|
96
|
+
>> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
|
97
|
+
=> "[This] is [a sentence]. "
|
98
|
+
|
99
|
+
The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
|
100
|
+
|
101
|
+
See the documentation of the individual classes in this module for more details.
|
48
102
|
|
103
|
+
Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
|
49
104
|
|
50
105
|
= History
|
51
106
|
|
52
107
|
1.0.0:: Initial release
|
53
108
|
1.1.0:: Make module initialization function private. Add example code.
|
54
109
|
1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
|
110
|
+
2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
|
55
111
|
|
56
112
|
|
57
113
|
= Copyright
|
58
114
|
|
59
|
-
Copyright 2007, William Patrick McNeill
|
115
|
+
Copyright 2007-2008, William Patrick McNeill
|
60
116
|
|
61
117
|
This program is distributed under the GNU General Public License.
|
62
118
|
|
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
#--
|
4
4
|
|
5
|
-
# Copyright 2007 William Patrick McNeill
|
5
|
+
# Copyright 2007-2008 William Patrick McNeill
|
6
6
|
#
|
7
7
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
8
8
|
#
|
@@ -34,7 +34,7 @@
|
|
34
34
|
# See the Java Stanford Parser documentation for details
|
35
35
|
#
|
36
36
|
# sentence::
|
37
|
-
# A sentence to parse. This must be quoted.
|
37
|
+
# A sentence to parse. This must appear after all the options and be quoted.
|
38
38
|
|
39
39
|
|
40
40
|
require "stanfordparser"
|
@@ -42,5 +42,5 @@ require "stanfordparser"
|
|
42
42
|
# The last argument is the sentence. The rest of the command line is passed
|
43
43
|
# along to the parser object.
|
44
44
|
sentence = ARGV.pop
|
45
|
-
parser = StanfordParser::LexicalizedParser.new(
|
45
|
+
parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
|
46
46
|
puts parser.apply(sentence)
|
data/lib/java_object.rb
ADDED
@@ -0,0 +1,129 @@
|
|
1
|
+
# Copyright 2007-2008 William Patrick McNeill
|
2
|
+
#
|
3
|
+
# This file is part of the Stanford Parser Ruby Wrapper.
|
4
|
+
#
|
5
|
+
# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
|
6
|
+
# and/or modify it under the terms of the GNU General Public License as
|
7
|
+
# published by the Free Software Foundation; either version 2 of the License,
|
8
|
+
# or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
|
11
|
+
# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
|
13
|
+
# Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU General Public License along with
|
16
|
+
# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
|
17
|
+
# St, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
|
20
|
+
# add a generic Java object wrapper class.
|
21
|
+
module Rjb
|
22
|
+
|
23
|
+
#--
|
24
|
+
# The documentation for this class appears next to its extension inside the
|
25
|
+
# StanfordParser module in stanfordparser.rb. This should be changed if Rjb
|
26
|
+
# is ever moved into its own gem. See the documention in stanfordparser.rb
|
27
|
+
# for more details.
|
28
|
+
#++
|
29
|
+
class JavaObjectWrapper
|
30
|
+
include Enumerable
|
31
|
+
|
32
|
+
# The underlying Java object.
|
33
|
+
attr_reader :java_object
|
34
|
+
|
35
|
+
# Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
|
36
|
+
# String, treat it as a Java class name and instantiate it. Otherwise,
|
37
|
+
# treat <em>obj</em> as an instance of a Java object.
|
38
|
+
def initialize(obj, *args)
|
39
|
+
@java_object = obj.class == String ?
|
40
|
+
Rjb::import(obj).send(:new, *args) : obj
|
41
|
+
end
|
42
|
+
|
43
|
+
# Enumerate all the items in the object using its iterator. If the object
|
44
|
+
# has no iterator, this function yields nothing.
|
45
|
+
def each
|
46
|
+
if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
|
47
|
+
i = @java_object.iterator
|
48
|
+
while i.hasNext
|
49
|
+
yield wrap_java_object(i.next)
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end # each
|
53
|
+
|
54
|
+
# Reflect unhandled method calls to the underlying Java object and wrap
|
55
|
+
# the return value in the appropriate Ruby object.
|
56
|
+
def method_missing(m, *args)
|
57
|
+
begin
|
58
|
+
wrap_java_object(@java_object.send(m, *args))
|
59
|
+
rescue RuntimeError => e
|
60
|
+
# The instance method failed. See if this is a static method.
|
61
|
+
if not e.message.match(/^Fail: unknown method name/).nil?
|
62
|
+
getClass.send(m, *args)
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
# Convert a value returned by a call to the underlying Java object to the
|
68
|
+
# appropriate Ruby object.
|
69
|
+
#
|
70
|
+
# If the value is a JavaObjectWrapper, convert it using a protected
|
71
|
+
# function with the name wrap_ followed by the underlying object's
|
72
|
+
# classname with the Java path delimiters converted to underscores. For
|
73
|
+
# example, a <tt>java.util.ArrayList</tt> would be converted by a function
|
74
|
+
# called wrap_java_util_ArrayList.
|
75
|
+
#
|
76
|
+
# If the value lacks the appropriate converter function, wrap it in a
|
77
|
+
# generic JavaObjectWrapper.
|
78
|
+
#
|
79
|
+
# If the value is not a JavaObjectWrapper, return it unchanged.
|
80
|
+
#
|
81
|
+
# This function is called recursively for every element in an Array.
|
82
|
+
def wrap_java_object(object)
|
83
|
+
if object.kind_of?(Array)
|
84
|
+
object.collect {|item| wrap_java_object(item)}
|
85
|
+
elsif object.respond_to?(:_classname)
|
86
|
+
# Ruby-Java Bridge Java objects all have a _classname member which
|
87
|
+
# tells the name of their Java class. Convert this to the
|
88
|
+
# corresponding wrapper function name.
|
89
|
+
wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
|
90
|
+
respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
|
91
|
+
else
|
92
|
+
object
|
93
|
+
end
|
94
|
+
end
|
95
|
+
|
96
|
+
# Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
|
97
|
+
def wrap_java_util_ArrayList(object)
|
98
|
+
array_list = []
|
99
|
+
object.size.times do
|
100
|
+
|i| array_list << wrap_java_object(object.get(i))
|
101
|
+
end
|
102
|
+
array_list
|
103
|
+
end
|
104
|
+
|
105
|
+
# Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
|
106
|
+
def wrap_java_util_HashSet(object)
|
107
|
+
set = Set.new
|
108
|
+
i = object.iterator
|
109
|
+
while i.hasNext
|
110
|
+
set << wrap_java_object(i.next)
|
111
|
+
end
|
112
|
+
set
|
113
|
+
end
|
114
|
+
|
115
|
+
# Show the classname of the underlying Java object.
|
116
|
+
def inspect
|
117
|
+
"<#{@java_object._classname}>"
|
118
|
+
end
|
119
|
+
|
120
|
+
# Use the underlying Java object's stringification.
|
121
|
+
def to_s
|
122
|
+
toString
|
123
|
+
end
|
124
|
+
|
125
|
+
protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
|
126
|
+
|
127
|
+
end # JavaObjectWrapper
|
128
|
+
|
129
|
+
end # Rjb
|
data/lib/stanfordparser.rb
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
# Copyright 2007 William Patrick McNeill
|
1
|
+
# Copyright 2007-2008 William Patrick McNeill
|
2
2
|
#
|
3
3
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
4
4
|
#
|
@@ -19,121 +19,27 @@
|
|
19
19
|
|
20
20
|
require "pathname"
|
21
21
|
require "rjb"
|
22
|
-
require "
|
22
|
+
require "singleton"
|
23
|
+
begin
|
24
|
+
require "treebank"
|
25
|
+
gem "treebank", ">= 3.0.0"
|
26
|
+
rescue LoadError
|
27
|
+
require "treebank"
|
28
|
+
end
|
23
29
|
require "yaml"
|
24
30
|
|
25
|
-
|
26
|
-
# adds a generic Java object wrapper class.
|
27
|
-
module Rjb
|
28
|
-
|
29
|
-
# A generic wrapper for a Java object loaded via the Ruby Java Bridge. The
|
30
|
-
# wrapper class handles intialization and stringification, and passes other
|
31
|
-
# method calls down to the underlying Java object. Objects returned by the
|
32
|
-
# underlying Java object are converted to the appropriate Ruby object.
|
33
|
-
#
|
34
|
-
# This object is enumerable, yielding items in the order defined by the Java
|
35
|
-
# object's iterator.
|
36
|
-
class JavaObjectWrapper
|
37
|
-
include Enumerable
|
38
|
-
|
39
|
-
# The underlying Java object.
|
40
|
-
attr_reader :java_object
|
41
|
-
|
42
|
-
# Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
|
43
|
-
# String, assume it is a Java class name and instantiate it. Otherwise,
|
44
|
-
# treat <em>obj</em> as an instance of a Java object.
|
45
|
-
def initialize(obj, *args)
|
46
|
-
@java_object = obj.class == String ?
|
47
|
-
Rjb::import(obj).send(:new, *args) : obj
|
48
|
-
end
|
49
|
-
|
50
|
-
# Enumerate all the items in the object using its iterator. If the object
|
51
|
-
# has no iterator, this function yields nothing.
|
52
|
-
def each
|
53
|
-
if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
|
54
|
-
i = @java_object.iterator
|
55
|
-
while i.hasNext
|
56
|
-
yield wrap_java_object(i.next)
|
57
|
-
end
|
58
|
-
end
|
59
|
-
end # each
|
60
|
-
|
61
|
-
# Reflect unhandled method calls to the underlying Java object.
|
62
|
-
def method_missing(m, *args)
|
63
|
-
wrap_java_object(@java_object.send(m, *args))
|
64
|
-
end
|
65
|
-
|
66
|
-
# Convert a value returned by a call to the underlying Java object to the
|
67
|
-
# appropriate Ruby object as follows:
|
68
|
-
# * RJB objects are placed inside a generic JavaObjectWrapper wrapper.
|
69
|
-
# * <tt>java.util.ArrayList</tt> objects are converted to Ruby Arrays.
|
70
|
-
# * <tt>java.util.HashSet</tt> objects are converted to Ruby Sets
|
71
|
-
# * Other objects are left unchanged.
|
72
|
-
#
|
73
|
-
# This function is applied recursively to items in collection objects such
|
74
|
-
# as set and arrays.
|
75
|
-
def wrap_java_object(object)
|
76
|
-
if object.kind_of?(Array)
|
77
|
-
object.collect {|item| wrap_java_object(item)}
|
78
|
-
# Ruby-Java Bridge Java objects all have a _classname member which tells
|
79
|
-
# the name of their Java class.
|
80
|
-
elsif object.respond_to?(:_classname)
|
81
|
-
case object._classname
|
82
|
-
when /java\.util\.ArrayList/
|
83
|
-
# Convert java.util.ArrayList objects to Ruby arrays.
|
84
|
-
array_list = []
|
85
|
-
object.size.times do
|
86
|
-
|i| array_list << wrap_java_object(object.get(i))
|
87
|
-
end
|
88
|
-
array_list
|
89
|
-
when /java\.util\.HashSet/
|
90
|
-
# Convert java.util.HashSet objects to Ruby sets.
|
91
|
-
set = Set.new
|
92
|
-
i = object.iterator
|
93
|
-
while i.hasNext
|
94
|
-
set << wrap_java_object(i.next)
|
95
|
-
end
|
96
|
-
set
|
97
|
-
else
|
98
|
-
# Passs other RJB objects off to a handler.
|
99
|
-
wrap_rjb_object(object)
|
100
|
-
end # case
|
101
|
-
else
|
102
|
-
# Return non-RJB objects unchanged.
|
103
|
-
object
|
104
|
-
end # if
|
105
|
-
end # wrap_java_object
|
106
|
-
|
107
|
-
# By default, all RJB classes other than <tt>java.util.ArrayList</tt> and
|
108
|
-
# <tt>java.util.HashSet</tt> go in a generic wrapper. Derived classes may
|
109
|
-
# change this behavior.
|
110
|
-
def wrap_rjb_object(object)
|
111
|
-
JavaObjectWrapper.new(object)
|
112
|
-
end
|
113
|
-
|
114
|
-
# Show the classname of the underlying Java object.
|
115
|
-
def inspect
|
116
|
-
"<#{@java_object._classname}>"
|
117
|
-
end
|
118
|
-
|
119
|
-
# Use the underlying Java object's stringification.
|
120
|
-
def to_s
|
121
|
-
toString
|
122
|
-
end
|
123
|
-
|
124
|
-
protected :wrap_java_object, :wrap_rjb_object
|
125
|
-
|
126
|
-
end # JavaObjectWrapper
|
127
|
-
|
128
|
-
end # Rjb
|
129
|
-
|
31
|
+
require "java_object.rb"
|
130
32
|
|
131
33
|
# Wrapper for the {Stanford Natural Language
|
132
34
|
# Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
133
35
|
module StanfordParser
|
134
36
|
|
135
|
-
VERSION = "
|
136
|
-
|
37
|
+
VERSION = "2.0.0"
|
38
|
+
|
39
|
+
# The default sentence segmenter and tokenizer. This is an English-language
|
40
|
+
# tokenizer with support for Penn Treebank markup.
|
41
|
+
EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
|
42
|
+
|
137
43
|
# Path to an English PCFG model that comes with the Stanford Parser. The
|
138
44
|
# location is relative to the parser root directory. This is a valid value
|
139
45
|
# for the <em>grammar</em> parameter of the LexicalizedParser constructor.
|
@@ -170,32 +76,53 @@ module StanfordParser
|
|
170
76
|
# The root directory of the Stanford parser installation.
|
171
77
|
ROOT = initialize_on_load
|
172
78
|
|
173
|
-
|
79
|
+
#--
|
80
|
+
# The documentation below is for the original Rjb::JavaObjectWrapper object.
|
81
|
+
# It is reproduced here because rdoc only takes the last document block
|
82
|
+
# defined. If Rjb is moved into its own gem, this documentation should go
|
83
|
+
# with it, and the following should be written as documentation for this
|
84
|
+
# class:
|
85
|
+
#
|
174
86
|
# Extension of the generic Ruby-Java Bridge wrapper object for the
|
175
87
|
# StanfordParser module.
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
|
88
|
+
#++
|
89
|
+
# A generic wrapper for a Java object loaded via the {Ruby-Java
|
90
|
+
# Bridge}[http://rjb.rubyforge.org/]. The wrapper class handles
|
91
|
+
# intialization and stringification, and passes other method calls down to
|
92
|
+
# the underlying Java object. Objects returned by the underlying Java
|
93
|
+
# object are converted to the appropriate Ruby object.
|
94
|
+
#
|
95
|
+
# Other modules may extend the list of Java objects that are converted by
|
96
|
+
# adding their own converter functions. See wrap_java_object for details.
|
97
|
+
#
|
98
|
+
# This object is enumerable, yielding items in the order defined by the
|
99
|
+
# underlying Java object's iterator.
|
100
|
+
class Rjb::JavaObjectWrapper
|
101
|
+
# FeatureLabel objects go inside a FeatureLabel wrapper.
|
102
|
+
def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
|
103
|
+
StanfordParser::FeatureLabel.new(object)
|
104
|
+
end
|
105
|
+
|
106
|
+
# Tree objects go inside a Tree wrapper. Various tree types are aliased
|
107
|
+
# to this function.
|
108
|
+
def wrap_edu_stanford_nlp_trees_Tree(object)
|
109
|
+
Tree.new(object)
|
110
|
+
end
|
111
|
+
|
112
|
+
alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
|
113
|
+
alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
|
114
|
+
alias :wrap_edu_stanford_nlp_trees_SimpleTree :wrap_edu_stanford_nlp_trees_Tree
|
115
|
+
alias :wrap_edu_stanford_nlp_trees_TreeGraphNode :wrap_edu_stanford_nlp_trees_Tree
|
116
|
+
|
117
|
+
protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
|
118
|
+
end # Rjb::JavaObjectWrapper
|
192
119
|
|
193
120
|
|
194
121
|
# Lexicalized probabalistic parser.
|
195
122
|
#
|
196
123
|
# This is an wrapper for the
|
197
124
|
# <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
|
198
|
-
class LexicalizedParser < JavaObjectWrapper
|
125
|
+
class LexicalizedParser < Rjb::JavaObjectWrapper
|
199
126
|
# The grammar used by the parser
|
200
127
|
attr_reader :grammar
|
201
128
|
|
@@ -220,10 +147,17 @@ module StanfordParser
|
|
220
147
|
end # LexicalizedParser
|
221
148
|
|
222
149
|
|
150
|
+
# A singleton instance of the default Stanford Natural Language parser. A
|
151
|
+
# singleton is used because the parser can take a few seconds to load.
|
152
|
+
class DefaultParser < StanfordParser::LexicalizedParser
|
153
|
+
include Singleton
|
154
|
+
end
|
155
|
+
|
156
|
+
|
223
157
|
# This is a wrapper for
|
224
158
|
# <tt>edu.stanford.nlp.trees.Tree</tt> objects. It customizes
|
225
159
|
# stringification.
|
226
|
-
class Tree < JavaObjectWrapper
|
160
|
+
class Tree < Rjb::JavaObjectWrapper
|
227
161
|
def initialize(obj = "edu.stanford.nlp.trees.Tree")
|
228
162
|
super(obj)
|
229
163
|
end
|
@@ -245,16 +179,16 @@ module StanfordParser
|
|
245
179
|
# This is a wrapper for
|
246
180
|
# <tt>edu.stanford.nlp.ling.Word</tt> objects. It customizes
|
247
181
|
# stringification and adds an equivalence operator.
|
248
|
-
class Word < JavaObjectWrapper
|
182
|
+
class Word < Rjb::JavaObjectWrapper
|
249
183
|
def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
|
250
184
|
super(obj, *args)
|
251
185
|
end
|
252
|
-
|
186
|
+
|
253
187
|
# See the word values.
|
254
188
|
def inspect
|
255
189
|
to_s
|
256
190
|
end
|
257
|
-
|
191
|
+
|
258
192
|
# Equivalence is defined relative to the word value.
|
259
193
|
def ==(other)
|
260
194
|
word == other
|
@@ -262,11 +196,34 @@ module StanfordParser
|
|
262
196
|
end # Word
|
263
197
|
|
264
198
|
|
199
|
+
# This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
|
200
|
+
# It customizes stringification.
|
201
|
+
class FeatureLabel < Rjb::JavaObjectWrapper
|
202
|
+
def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
|
203
|
+
super
|
204
|
+
end
|
205
|
+
|
206
|
+
# Stringify with just the token and its begin and end position.
|
207
|
+
def to_s
|
208
|
+
# BUGBUG The position values come back as java.lang.Integer though I
|
209
|
+
# would expect Rjb to convert them to Ruby integers.
|
210
|
+
begin_position = get(self.BEGIN_POSITION_KEY)
|
211
|
+
end_position = get(self.END_POSITION_KEY)
|
212
|
+
"#{current} [#{begin_position},#{end_position}]"
|
213
|
+
end
|
214
|
+
|
215
|
+
# More verbose stringification with all the fields and their values.
|
216
|
+
def inspect
|
217
|
+
toString
|
218
|
+
end
|
219
|
+
end
|
220
|
+
|
221
|
+
|
265
222
|
# Tokenizes documents into words and sentences.
|
266
223
|
#
|
267
224
|
# This is a wrapper for the
|
268
225
|
# <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
|
269
|
-
class DocumentPreprocessor < JavaObjectWrapper
|
226
|
+
class DocumentPreprocessor < Rjb::JavaObjectWrapper
|
270
227
|
def initialize(suppressEscaping = false)
|
271
228
|
super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
|
272
229
|
end
|
@@ -276,6 +233,229 @@ module StanfordParser
|
|
276
233
|
s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
|
277
234
|
_invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
|
278
235
|
end
|
236
|
+
|
237
|
+
def inspect
|
238
|
+
"<#{self.class.to_s.split('::').last}>"
|
239
|
+
end
|
240
|
+
|
241
|
+
def to_s
|
242
|
+
inspect
|
243
|
+
end
|
279
244
|
end # DocumentPreprocessor
|
280
245
|
|
246
|
+
StandoffToken = Struct.new(:current, :word, :before, :after,
|
247
|
+
:begin_position, :end_position)
|
248
|
+
|
249
|
+
# A text token that contains raw and normalized token identity (.e.g "(" and
|
250
|
+
# "-LRB-"), an offset span, and the characters immediately preceding and
|
251
|
+
# following the token. Given a list of these objects it is possible to
|
252
|
+
# recreate the text from which they came verbatim.
|
253
|
+
class StandoffToken
|
254
|
+
def to_s
|
255
|
+
"#{current} [#{begin_position},#{end_position}]"
|
256
|
+
end
|
257
|
+
end
|
258
|
+
|
259
|
+
|
260
|
+
# A preprocessor that segments text into sentences and tokens that contain
|
261
|
+
# character offset and token context information that can be used for
|
262
|
+
# standoff annotation.
|
263
|
+
class StandoffDocumentPreprocessor < DocumentPreprocessor
|
264
|
+
def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
|
265
|
+
# PTBTokenizer.factory is a static function, so use RJB to call it
|
266
|
+
# directly instead of going through a JavaObjectWrapper. We do it this
|
267
|
+
# way because the Standford parser Java code does not provide a
|
268
|
+
# constructor that allows you to specify the second parameter,
|
269
|
+
# invertible, to true, and we need this to write character offset
|
270
|
+
# information into the tokens.
|
271
|
+
ptb_tokenizer_class = Rjb::import(tokenizer)
|
272
|
+
# See the documentation for
|
273
|
+
# <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
|
274
|
+
# description of these parameters.
|
275
|
+
ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
|
276
|
+
super(ptb_tokenizer_factory)
|
277
|
+
end
|
278
|
+
|
279
|
+
# Returns a list of sentences in a string. This wraps the returned
|
280
|
+
# sentences in a StandoffSentence object.
|
281
|
+
def getSentencesFromString(s)
|
282
|
+
super(s).map!{|s| StandoffSentence.new(s)}
|
283
|
+
end
|
284
|
+
end
|
285
|
+
|
286
|
+
|
287
|
+
# A sentence is an array of StandoffToken objects.
|
288
|
+
class StandoffSentence < Array
|
289
|
+
# Construct an array of StandoffToken objects from a Java list sentence
|
290
|
+
# object returned by the preprocessor.
|
291
|
+
def initialize(stanford_parser_sentence)
|
292
|
+
# Convert FeatureStructure wrappers to StandoffToken objects.
|
293
|
+
s = stanford_parser_sentence.to_a.collect do |fs|
|
294
|
+
current = fs.current
|
295
|
+
word = fs.word
|
296
|
+
before = fs.before
|
297
|
+
after = fs.after
|
298
|
+
# The to_s.to_i is necessary because the get function returns
|
299
|
+
# java.lang.Integer objects instead of Ruby integers.
|
300
|
+
begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
|
301
|
+
end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
|
302
|
+
StandoffToken.new(current, word, before, after,
|
303
|
+
begin_position, end_position)
|
304
|
+
end
|
305
|
+
super(s)
|
306
|
+
end
|
307
|
+
|
308
|
+
# Return the original string verbatim.
|
309
|
+
def to_s
|
310
|
+
self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
|
311
|
+
end
|
312
|
+
|
313
|
+
# Return the original string verbatim.
|
314
|
+
def inspect
|
315
|
+
to_s
|
316
|
+
end
|
317
|
+
end
|
318
|
+
|
319
|
+
|
320
|
+
# Standoff syntactic annotation of natural language text which may contain
|
321
|
+
# multiple sentences.
|
322
|
+
#
|
323
|
+
# This is an Array of StandoffNode objects, one for each sentence in the
|
324
|
+
# text.
|
325
|
+
class StandoffParsedText < Array
|
326
|
+
# Parse the text and create the standoff annotation.
|
327
|
+
#
|
328
|
+
# The default parser is a singleton instance of the English language
|
329
|
+
# Stanford Natural Langugage parser. There may be a delay of a few
|
330
|
+
# seconds for it to load the first time it is created.
|
331
|
+
def initialize(text, nodetype = StandoffNode,
|
332
|
+
tokenizer = EN_PENN_TREEBANK_TOKENIZER,
|
333
|
+
parser = DefaultParser.instance)
|
334
|
+
preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
|
335
|
+
# Segment the text into sentences. Parse each sentence, writing
|
336
|
+
# standoff annotation information into the terminal nodes.
|
337
|
+
preprocessor.getSentencesFromString(text).map do |sentence|
|
338
|
+
parse = parser.apply(sentence.to_s)
|
339
|
+
push(nodetype.new(parse, sentence))
|
340
|
+
end
|
341
|
+
end
|
342
|
+
|
343
|
+
# Print class name and number of sentences.
|
344
|
+
def inspect
|
345
|
+
"<#{self.class.name}, #{length} sentences>"
|
346
|
+
end
|
347
|
+
|
348
|
+
# Print parses.
|
349
|
+
def to_s
|
350
|
+
flatten.join(" ")
|
351
|
+
end
|
352
|
+
end
|
353
|
+
|
354
|
+
|
355
|
+
# Standoff syntactic tree annotation of text. Terminal nodes are labeled
|
356
|
+
# with the appropriate StandoffToken objects. Standoff parses can reproduce
|
357
|
+
# the original string from which they were generated verbatim, optionally
|
358
|
+
# with brackets around the yields of specified non-terminal nodes.
|
359
|
+
class StandoffNode < Treebank::Node
|
360
|
+
# Create the standoff tree from a tree returned by the Stanford parser.
|
361
|
+
# For non-terminal nodes, the <em>tokens</em> argument will be a
|
362
|
+
# StandoffSentence containing the StandoffToken objects representing all
|
363
|
+
# the tokens beneath and after this node. For terminal nodes, the
|
364
|
+
# <em>tokens</em> argument will be a StandoffToken.
|
365
|
+
def initialize(stanford_parser_node, tokens)
|
366
|
+
# Annotate this node with a non-terminal label or a StandoffToken as
|
367
|
+
# appropriate.
|
368
|
+
super(tokens.instance_of?(StandoffSentence) ?
|
369
|
+
stanford_parser_node.value : tokens)
|
370
|
+
# Enumerate the children depth-first. Tokens are removed from the list
|
371
|
+
# left-to-right as terminal nodes are added to the tree.
|
372
|
+
stanford_parser_node.children.each do |child|
|
373
|
+
subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
|
374
|
+
attach_child!(subtree)
|
375
|
+
end
|
376
|
+
end
|
377
|
+
|
378
|
+
# Return the original text string dominated by this node.
|
379
|
+
def to_original_string
|
380
|
+
leaves.inject("") do |s, leaf|
|
381
|
+
s += leaf.label.current + leaf.label.after
|
382
|
+
end
|
383
|
+
end
|
384
|
+
|
385
|
+
# Print the original string with brackets around word spans dominated by
|
386
|
+
# the specified consituents.
|
387
|
+
#
|
388
|
+
# The constituents to bracket are specified by passing a list of node
|
389
|
+
# coordinates, which are arrays of integers of the form returned by the
|
390
|
+
# tree enumerators of Treebank::Node objects.
|
391
|
+
#
|
392
|
+
# _coords_:: the coordinates of the nodes around which to place brackets
|
393
|
+
# _open_:: the open bracket symbol
|
394
|
+
# _close_:: the close bracket symbol
|
395
|
+
def to_bracketed_string(coords, open = "[", close = "]")
|
396
|
+
# Get a list of all the leaf nodes and their coordinates.
|
397
|
+
items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
|
398
|
+
# Enumerate over all the matching constituents inserting open and close
|
399
|
+
# brackets around their yields in the items list.
|
400
|
+
coords.each do |matching|
|
401
|
+
# Insert using a simple state machine with three states: :start,
|
402
|
+
# :open, and :close.
|
403
|
+
state = :start
|
404
|
+
# Enumerate over the items list looking for nodes that are the
|
405
|
+
# children of the matching constituent.
|
406
|
+
items.each_with_index do |item, index|
|
407
|
+
# Skip inserted bracket characters.
|
408
|
+
next if item.is_a? String
|
409
|
+
# Handle terminal node items with the state machine.
|
410
|
+
node, terminal_coordinate = item
|
411
|
+
if state == :start
|
412
|
+
next if not in_yield?(matching, terminal_coordinate)
|
413
|
+
items.insert(index, open)
|
414
|
+
state = :open
|
415
|
+
else # state == :open
|
416
|
+
next if in_yield?(matching, terminal_coordinate)
|
417
|
+
items.insert(index, close)
|
418
|
+
state = :close
|
419
|
+
break
|
420
|
+
end
|
421
|
+
end # items.each_with_index
|
422
|
+
# Handle the case where a matching constituent is flush with the end
|
423
|
+
# of the sentence.
|
424
|
+
items << close if state == :open
|
425
|
+
end # each
|
426
|
+
# Replace terminal nodes with their string representations. Insert
|
427
|
+
# spacing characters in the list.
|
428
|
+
items.each_with_index do |item, index|
|
429
|
+
next if item.is_a? String
|
430
|
+
text = item.first.label.current
|
431
|
+
spacing = item.first.label.after
|
432
|
+
# Replace the terminal node with its text.
|
433
|
+
items[index] = text
|
434
|
+
# Insert the spacing that comes after this text before the first
|
435
|
+
# non-close bracket character.
|
436
|
+
close_pos = find_index(items[index+1..-1]) {|item| not item == close}
|
437
|
+
items.insert(index + close_pos + 1, spacing)
|
438
|
+
end
|
439
|
+
items.join
|
440
|
+
end # to_bracketed_string
|
441
|
+
|
442
|
+
# Find the index of the first item in _list_ for which _block_ is true.
|
443
|
+
# Return 0 if no items are found.
|
444
|
+
def find_index(list, &block)
|
445
|
+
list.each_with_index do |item, index|
|
446
|
+
return index if block.call(item)
|
447
|
+
end
|
448
|
+
0
|
449
|
+
end
|
450
|
+
|
451
|
+
# Is the node at _terminal_ in the yield of the node at _node_?
|
452
|
+
def in_yield?(node, terminal)
|
453
|
+
# If node A's coordinates match the prefix of node B's coordinates, node
|
454
|
+
# B is in the yield of node A.
|
455
|
+
terminal.first(node.length) == node
|
456
|
+
end
|
457
|
+
|
458
|
+
private :in_yield?, :find_index
|
459
|
+
end # StandoffNode
|
460
|
+
|
281
461
|
end # StanfordParser
|
data/test/test_stanfordparser.rb
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
#--
|
4
4
|
|
5
|
-
# Copyright 2007 William Patrick McNeill
|
5
|
+
# Copyright 2007-2008 William Patrick McNeill
|
6
6
|
#
|
7
7
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
8
8
|
#
|
@@ -30,20 +30,13 @@ require "singleton"
|
|
30
30
|
require "stanfordparser"
|
31
31
|
|
32
32
|
|
33
|
-
# Make the Lexicalized Parser a singleton for the tests because it takes
|
34
|
-
# several seconds to load.
|
35
|
-
class StanfordParser::LexicalizedParser
|
36
|
-
include Singleton
|
37
|
-
end
|
38
|
-
|
39
|
-
|
40
33
|
class LexicalizedParserTestCase < Test::Unit::TestCase
|
41
34
|
def test_root_path
|
42
35
|
assert_equal StanfordParser::ROOT.class, Pathname
|
43
36
|
end
|
44
37
|
|
45
38
|
def setup
|
46
|
-
@parser = StanfordParser::
|
39
|
+
@parser = StanfordParser::DefaultParser.instance
|
47
40
|
@tree = @parser.apply("This is a sentence.")
|
48
41
|
end
|
49
42
|
|
@@ -53,6 +46,8 @@ class LexicalizedParserTestCase < Test::Unit::TestCase
|
|
53
46
|
end
|
54
47
|
|
55
48
|
def test_localTrees
|
49
|
+
# The following call exercises the conversion from java.util.HashSet
|
50
|
+
# objects to Ruby sets.
|
56
51
|
l = @tree.localTrees
|
57
52
|
assert_equal l.size, 5
|
58
53
|
assert_equal Set.new(l.collect {|t| "#{t.label}"}),
|
@@ -68,7 +63,7 @@ end # LexicalizedParserTestCase
|
|
68
63
|
|
69
64
|
class TreeTestCase < Test::Unit::TestCase
|
70
65
|
def setup
|
71
|
-
@parser = StanfordParser::
|
66
|
+
@parser = StanfordParser::DefaultParser.instance
|
72
67
|
@tree = @parser.apply("This is a sentence.")
|
73
68
|
end
|
74
69
|
|
@@ -85,12 +80,30 @@ class TreeTestCase < Test::Unit::TestCase
|
|
85
80
|
end # TreeTestCase
|
86
81
|
|
87
82
|
|
83
|
+
class FeatureLabelTestCase < Test::Unit::TestCase
|
84
|
+
def test_feature_label
|
85
|
+
f = StanfordParser::FeatureLabel.new
|
86
|
+
assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
|
87
|
+
f.put(f.BEGIN_POSITION_KEY, 3)
|
88
|
+
assert_equal "END_POS", f.END_POSITION_KEY
|
89
|
+
f.put(f.END_POSITION_KEY, 7)
|
90
|
+
assert_equal "current", f.CURRENT_KEY
|
91
|
+
f.put(f.CURRENT_KEY, "word")
|
92
|
+
assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
|
93
|
+
assert_equal "word [3,7]", f.to_s
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
|
88
98
|
class DocumentPreprocessorTestCase < Test::Unit::TestCase
|
89
99
|
def setup
|
90
100
|
@preproc = StanfordParser::DocumentPreprocessor.new
|
101
|
+
@standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
|
91
102
|
end
|
92
103
|
|
93
104
|
def test_get_sentences_from_string
|
105
|
+
# The following call exercises the conversion from java.util.ArrayList
|
106
|
+
# objects to Ruby arrays.
|
94
107
|
s = @preproc.getSentencesFromString("This is a sentence. So is this.")
|
95
108
|
assert_equal "#{s[0]}", "This is a sentence ."
|
96
109
|
assert_equal "#{s[1]}", "So is this ."
|
@@ -100,15 +113,112 @@ class DocumentPreprocessorTestCase < Test::Unit::TestCase
|
|
100
113
|
# StanfordParser::DocumentPreprocessor is not an enumerable object.
|
101
114
|
assert_equal @preproc.map, []
|
102
115
|
end
|
116
|
+
|
117
|
+
# Segment and tokenize text containing two sentences.
|
118
|
+
def test_standoff_document_preprocessor
|
119
|
+
sentences = @standoff_preproc.getSentencesFromString("He (John) is tall. So is she.")
|
120
|
+
# Recognize two sentences.
|
121
|
+
assert_equal 2, sentences.length
|
122
|
+
assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
|
123
|
+
assert_equal "He (John) is tall.", sentences.first.to_s
|
124
|
+
assert_equal 7, sentences.first.length
|
125
|
+
assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
|
126
|
+
assert_equal "So is she.", sentences.last.to_s
|
127
|
+
assert_equal 4, sentences.last.length
|
128
|
+
assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
|
129
|
+
# Get the correct token information for the first sentence.
|
130
|
+
assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
|
131
|
+
assert_equal [0,2], [sentences[0][0].begin_position(), sentences[0][0].end_position()]
|
132
|
+
assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
|
133
|
+
assert_equal [3,4], [sentences[0][1].begin_position(), sentences[0][1].end_position()]
|
134
|
+
assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
|
135
|
+
assert_equal [4,8], [sentences[0][2].begin_position(), sentences[0][2].end_position()]
|
136
|
+
assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
|
137
|
+
assert_equal [8,9], [sentences[0][3].begin_position(), sentences[0][3].end_position()]
|
138
|
+
assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
|
139
|
+
assert_equal [10,12], [sentences[0][4].begin_position(), sentences[0][4].end_position()]
|
140
|
+
assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
|
141
|
+
assert_equal [13,17], [sentences[0][5].begin_position(), sentences[0][5].end_position()]
|
142
|
+
assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
|
143
|
+
assert_equal [17,18], [sentences[0][6].begin_position(), sentences[0][6].end_position()]
|
144
|
+
# Get the correct token information for the second sentence.
|
145
|
+
assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
|
146
|
+
assert_equal [20,22], [sentences[1][0].begin_position(), sentences[1][0].end_position()]
|
147
|
+
assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
|
148
|
+
assert_equal [23,25], [sentences[1][1].begin_position(), sentences[1][1].end_position()]
|
149
|
+
assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
|
150
|
+
assert_equal [26,29], [sentences[1][2].begin_position(), sentences[1][2].end_position()]
|
151
|
+
assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
|
152
|
+
assert_equal [29,30], [sentences[1][3].begin_position(), sentences[1][3].end_position()]
|
153
|
+
end
|
154
|
+
|
155
|
+
def test_stringification
|
156
|
+
assert_equal "<DocumentPreprocessor>", @preproc.inspect
|
157
|
+
assert_equal "<DocumentPreprocessor>", @preproc.to_s
|
158
|
+
assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
|
159
|
+
assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
|
160
|
+
end
|
161
|
+
|
103
162
|
end # DocumentPreprocessorTestCase
|
104
163
|
|
105
164
|
|
165
|
+
class StandoffParsedTextTestCase < Test::Unit::TestCase
|
166
|
+
def setup
|
167
|
+
@text = "He (John) is tall. So is she."
|
168
|
+
end
|
169
|
+
|
170
|
+
def test_parse_text_default_nodetype
|
171
|
+
parsed_text = StanfordParser::StandoffParsedText.new(@text)
|
172
|
+
verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
|
173
|
+
end
|
174
|
+
|
175
|
+
# Verify correct parsing with variable node types for text containing two sentences.
|
176
|
+
def verify_parsed_text(parsed_text, nodetype)
|
177
|
+
# Verify that there are two sentences.
|
178
|
+
assert_equal 2, parsed_text.length
|
179
|
+
assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
|
180
|
+
# Verify the tokens in the leaf node of the first sentence.
|
181
|
+
leaves = parsed_text[0].leaves.collect {|node| node.label}
|
182
|
+
assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
|
183
|
+
assert_equal [0,2], [leaves[0].begin_position(), leaves[0].end_position()]
|
184
|
+
assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
|
185
|
+
assert_equal [3,4], [leaves[1].begin_position(), leaves[1].end_position()]
|
186
|
+
assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
|
187
|
+
assert_equal [4,8], [leaves[2].begin_position(), leaves[2].end_position()]
|
188
|
+
assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
|
189
|
+
assert_equal [8,9], [leaves[3].begin_position(), leaves[3].end_position()]
|
190
|
+
assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
|
191
|
+
assert_equal [10,12], [leaves[4].begin_position(), leaves[4].end_position()]
|
192
|
+
assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
|
193
|
+
assert_equal [13,17], [leaves[5].begin_position(), leaves[5].end_position()]
|
194
|
+
assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
|
195
|
+
assert_equal [17,18], [leaves[6].begin_position(), leaves[6].end_position()]
|
196
|
+
# Verify the tokens in the leaf node of the second sentence.
|
197
|
+
leaves = parsed_text[1].leaves.collect {|node| node.label}
|
198
|
+
assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
|
199
|
+
assert_equal [20,22], [leaves[0].begin_position(), leaves[0].end_position()]
|
200
|
+
assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
|
201
|
+
assert_equal [23,25], [leaves[1].begin_position(), leaves[1].end_position()]
|
202
|
+
assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
|
203
|
+
assert_equal [26,29], [leaves[2].begin_position(), leaves[2].end_position()]
|
204
|
+
assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
|
205
|
+
assert_equal [29,30], [leaves[3].begin_position(), leaves[3].end_position()]
|
206
|
+
# Verify that the original string is recoverable.
|
207
|
+
assert_equal "He (John) is tall. ", parsed_text[0].to_original_string
|
208
|
+
assert_equal "So is she." , parsed_text[1].to_original_string
|
209
|
+
# Draw < and > brackets around 3 constituents.
|
210
|
+
b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
|
211
|
+
assert_equal "<He (<John>)> is <tall>. ", b
|
212
|
+
end
|
213
|
+
end
|
214
|
+
|
215
|
+
|
106
216
|
class MiscPreprocessorTestCase < Test::Unit::TestCase
|
107
217
|
def test_model_location
|
108
218
|
assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
|
109
219
|
end
|
110
|
-
|
220
|
+
|
111
221
|
def test_word
|
112
222
|
assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") == "dog"
|
113
223
|
end
|
114
|
-
end # MiscPreprocessorTestCase
|
224
|
+
end # MiscPreprocessorTestCase
|
metadata
CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
|
|
3
3
|
specification_version: 1
|
4
4
|
name: stanfordparser
|
5
5
|
version: !ruby/object:Gem::Version
|
6
|
-
version:
|
7
|
-
date:
|
6
|
+
version: 2.0.0
|
7
|
+
date: 2008-06-13 00:00:00 -07:00
|
8
8
|
summary: Ruby wrapper for the Stanford Natural Language Parser
|
9
9
|
require_paths:
|
10
10
|
- lib
|
@@ -30,6 +30,7 @@ authors:
|
|
30
30
|
- W.P. McNeill
|
31
31
|
files:
|
32
32
|
- test/test_stanfordparser.rb
|
33
|
+
- lib/java_object.rb
|
33
34
|
- lib/stanfordparser.rb
|
34
35
|
- examples/stanford-sentence-parser.rb
|
35
36
|
- README
|