stanfordparser 1.2.0 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +77 -21
- data/examples/stanford-sentence-parser.rb +3 -3
- data/lib/java_object.rb +129 -0
- data/lib/stanfordparser.rb +312 -132
- data/test/test_stanfordparser.rb +122 -12
- metadata +3 -2
data/README
CHANGED
@@ -1,35 +1,40 @@
|
|
1
|
-
= Stanford Natural Language Parser
|
1
|
+
= Stanford Natural Language Parser Wrapper
|
2
2
|
|
3
3
|
This module is a wrapper for the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
4
4
|
|
5
|
-
The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby.
|
5
|
+
The Stanford Natural Language Parser is a Java implementation of a probabilistic PCFG and dependency parser for English, German, Chinese, and Arabic. This module provides a thin wrapper around the Java code to make it accessible from Ruby along with pure Ruby objects that enable standoff parsing.
|
6
6
|
|
7
|
-
= Installation and Configuration
|
8
|
-
|
9
|
-
To run this module you must install the following additional software
|
10
7
|
|
11
|
-
|
12
|
-
* The {Ruby Java Bridge}[http://rjb.rubyforge.org/] gem.
|
8
|
+
= Installation and Configuration
|
13
9
|
|
14
|
-
|
10
|
+
In addition to the Ruby gems it requires, to run this module you must manually install the {Stanford Natural Language Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
15
11
|
|
16
12
|
This module expects the parser to be installed in the <tt>/usr/local/stanford-parser/current</tt> directory. This is the directory that contains the <tt>stanford-parser.jar</tt> file. When the module is loaded, it adds this directory to the Java classpath and launches the Java VM with the arguments <tt>-server -Xmx150m</tt>.
|
17
13
|
|
18
|
-
These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://
|
14
|
+
These defaults can be overridden by creating a configuration file in <tt>/etc/ruby_stanford_parser.yaml</tt>. This file is in the Ruby YAML[http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/index.html] format, and may contain two values: <tt>root</tt> and <tt>jvmargs</tt>. For example, the file might look like the following:
|
19
15
|
|
20
16
|
root: /usr/local/stanford-parser/other/location
|
21
17
|
jvmargs: -Xmx100m -verbose
|
22
18
|
|
23
|
-
=Usage
|
24
19
|
|
25
|
-
|
20
|
+
=Tokenization and Parsing
|
26
21
|
|
27
|
-
|
22
|
+
Use the StanfordParser::DocumentPreprocessor class to tokenize text and files into sentences and words.
|
23
|
+
|
24
|
+
>> require "stanfordparser"
|
28
25
|
=> true
|
29
|
-
|
26
|
+
>> preproc = StanfordParser::DocumentPreprocessor.new
|
27
|
+
=> <DocumentPreprocessor>
|
28
|
+
>> puts preproc.getSentencesFromString("This is a sentence. So is this.")
|
29
|
+
This is a sentence .
|
30
|
+
So is this .
|
31
|
+
|
32
|
+
Use the StanfordParser::LexicalizedParser class to parse sentences.
|
33
|
+
|
34
|
+
>> parser = StanfordParser::LexicalizedParser.new
|
30
35
|
Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [5.5 sec].
|
31
36
|
=> edu.stanford.nlp.parser.lexparser.LexicalizedParser
|
32
|
-
|
37
|
+
>> puts parser.apply("This is a sentence.")
|
33
38
|
(ROOT
|
34
39
|
(S [24.917]
|
35
40
|
(NP [6.139] (DT [2.300] This))
|
@@ -37,26 +42,77 @@ Use the StanfordParser::LexicalizedParser class to parse sentences.
|
|
37
42
|
(NP [12.299] (DT [1.419] a) (NN [8.897] sentence)))
|
38
43
|
(. [0.002] .)))
|
39
44
|
|
40
|
-
|
45
|
+
For complete details about the use of these classes, see the documentation on the Stanford Natural Language Parser website.
|
41
46
|
|
42
|
-
irb(main):004:0> preproc = StanfordParser::DocumentPreprocessor.new
|
43
|
-
irb(main):008:0> puts preproc.getSentencesFromString("This is a sentence. So is this.")
|
44
|
-
This is a sentence .
|
45
|
-
So is this .
|
46
47
|
|
47
|
-
|
48
|
+
=Standoff Tokenization and Parsing
|
49
|
+
|
50
|
+
This module also contains support for standoff tokenization and parsing, in which the terminal nodes of parse trees contain information about the text that was used to generate them.
|
51
|
+
|
52
|
+
Use StanfordParser::StandoffDocumentPreprocessor class to tokenize text and files into sentences and words.
|
53
|
+
|
54
|
+
>> preproc = StanfordParser::StandoffDocumentPreprocessor.new
|
55
|
+
=> <StandoffDocumentPreprocessor>
|
56
|
+
>> s = preproc.getSentencesFromString("This is a sentence. So is this.")
|
57
|
+
=> [This is a sentence., So is this.]
|
58
|
+
|
59
|
+
The standoff preprocessor returns StanfordParser::StandoffToken objects, which contain character offsets into the original text along with information about spacing characters that came before and after the token.
|
60
|
+
|
61
|
+
>> puts s
|
62
|
+
This [0,4]
|
63
|
+
is [5,7]
|
64
|
+
a [8,9]
|
65
|
+
sentence [10,18]
|
66
|
+
. [18,19]
|
67
|
+
So [21,23]
|
68
|
+
is [24,26]
|
69
|
+
this [27,31]
|
70
|
+
. [31,32]
|
71
|
+
>> "This is a sentence. So is this."[27..31]
|
72
|
+
=> "this."
|
73
|
+
|
74
|
+
This is the same information contained in the <tt>edu.stanford.nlp.ling.FeatureLabel</tt> class in the Stanford Parser Java implementation.
|
75
|
+
|
76
|
+
Similarly, use the StanfordParser::StandoffParsedText object to parse a block of text into StanfordParser::StandoffNode parse trees whose terminal nodes are StanfordParser::StandoffToken objects.
|
77
|
+
|
78
|
+
>> t = StanfordParser::StandoffParsedText.new("This is a sentence. So is this.")
|
79
|
+
Loading parser from serialized file /usr/local/stanford-parser/current/englishPCFG.ser.gz ... done [4.9 sec].
|
80
|
+
=> <StanfordParser::StandoffParsedText, 2 sentences>
|
81
|
+
>> puts t.first
|
82
|
+
(ROOT
|
83
|
+
(S
|
84
|
+
(NP (DT This [0,4]))
|
85
|
+
(VP (VBZ is [5,7])
|
86
|
+
(NP (DT a [8,9]) (NN sentence [10,18])))
|
87
|
+
(. . [18,19])))
|
88
|
+
|
89
|
+
Standoff parse trees can reproduce the text from which they were generated verbatim.
|
90
|
+
|
91
|
+
>> t.first.to_original_string
|
92
|
+
=> "This is a sentence. "
|
93
|
+
|
94
|
+
They can also reproduce the original text with brackets inserted around the yields of specified parse nodes.
|
95
|
+
|
96
|
+
>> t.first.to_bracketed_string([[0,0,0], [0,1,1]])
|
97
|
+
=> "[This] is [a sentence]. "
|
98
|
+
|
99
|
+
The format of the coordinates used to specify individual nodes is described in the documentation for the Ruby Treebank[http://rubyforge.org/projects/treebank/] gem.
|
100
|
+
|
101
|
+
See the documentation of the individual classes in this module for more details.
|
48
102
|
|
103
|
+
Unlike their parents StanfordParser::DocumentPreprocessor and StanfordParser::LexicalizedParser, which produce Ruby wrappers around Java objects, StanfordParser::StandoffDocumentPreprocessor and StanfordParser::StandoffParsedText produce pure Ruby objects. This is to facilitate serialization of these objects using tools like the Marshal module, which cannot serialize Java objects.
|
49
104
|
|
50
105
|
= History
|
51
106
|
|
52
107
|
1.0.0:: Initial release
|
53
108
|
1.1.0:: Make module initialization function private. Add example code.
|
54
109
|
1.2.0:: Read Java VM arguments from the configuration file. Add Word class.
|
110
|
+
2.0.0:: Add support for standoff parsing. Change the way Rjb::JavaObjectWrapper wraps returned values: see wrap_java_object for details. Rjb::JavaObjectWrapper supports static members. Minor changes to stanford-sentence-parser script.
|
55
111
|
|
56
112
|
|
57
113
|
= Copyright
|
58
114
|
|
59
|
-
Copyright 2007, William Patrick McNeill
|
115
|
+
Copyright 2007-2008, William Patrick McNeill
|
60
116
|
|
61
117
|
This program is distributed under the GNU General Public License.
|
62
118
|
|
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
#--
|
4
4
|
|
5
|
-
# Copyright 2007 William Patrick McNeill
|
5
|
+
# Copyright 2007-2008 William Patrick McNeill
|
6
6
|
#
|
7
7
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
8
8
|
#
|
@@ -34,7 +34,7 @@
|
|
34
34
|
# See the Java Stanford Parser documentation for details
|
35
35
|
#
|
36
36
|
# sentence::
|
37
|
-
# A sentence to parse. This must be quoted.
|
37
|
+
# A sentence to parse. This must appear after all the options and be quoted.
|
38
38
|
|
39
39
|
|
40
40
|
require "stanfordparser"
|
@@ -42,5 +42,5 @@ require "stanfordparser"
|
|
42
42
|
# The last argument is the sentence. The rest of the command line is passed
|
43
43
|
# along to the parser object.
|
44
44
|
sentence = ARGV.pop
|
45
|
-
parser = StanfordParser::LexicalizedParser.new(
|
45
|
+
parser = StanfordParser::LexicalizedParser.new(StanfordParser::ENGLISH_PCFG_MODEL, ARGV)
|
46
46
|
puts parser.apply(sentence)
|
data/lib/java_object.rb
ADDED
@@ -0,0 +1,129 @@
|
|
1
|
+
# Copyright 2007-2008 William Patrick McNeill
|
2
|
+
#
|
3
|
+
# This file is part of the Stanford Parser Ruby Wrapper.
|
4
|
+
#
|
5
|
+
# The Stanford Parser Ruby Wrapper is free software; you can redistribute it
|
6
|
+
# and/or modify it under the terms of the GNU General Public License as
|
7
|
+
# published by the Free Software Foundation; either version 2 of the License,
|
8
|
+
# or (at your option) any later version.
|
9
|
+
#
|
10
|
+
# The Stanford Parser Ruby Wrapper is distributed in the hope that it will be
|
11
|
+
# useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
|
13
|
+
# Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU General Public License along with
|
16
|
+
# editalign; if not, write to the Free Software Foundation, Inc., 51 Franklin
|
17
|
+
# St, Fifth Floor, Boston, MA 02110-1301 USA
|
18
|
+
|
19
|
+
# Extenions to the {Ruby-Java Bridge}[http://rjb.rubyforge.org/] module that
|
20
|
+
# add a generic Java object wrapper class.
|
21
|
+
module Rjb
|
22
|
+
|
23
|
+
#--
|
24
|
+
# The documentation for this class appears next to its extension inside the
|
25
|
+
# StanfordParser module in stanfordparser.rb. This should be changed if Rjb
|
26
|
+
# is ever moved into its own gem. See the documention in stanfordparser.rb
|
27
|
+
# for more details.
|
28
|
+
#++
|
29
|
+
class JavaObjectWrapper
|
30
|
+
include Enumerable
|
31
|
+
|
32
|
+
# The underlying Java object.
|
33
|
+
attr_reader :java_object
|
34
|
+
|
35
|
+
# Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
|
36
|
+
# String, treat it as a Java class name and instantiate it. Otherwise,
|
37
|
+
# treat <em>obj</em> as an instance of a Java object.
|
38
|
+
def initialize(obj, *args)
|
39
|
+
@java_object = obj.class == String ?
|
40
|
+
Rjb::import(obj).send(:new, *args) : obj
|
41
|
+
end
|
42
|
+
|
43
|
+
# Enumerate all the items in the object using its iterator. If the object
|
44
|
+
# has no iterator, this function yields nothing.
|
45
|
+
def each
|
46
|
+
if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
|
47
|
+
i = @java_object.iterator
|
48
|
+
while i.hasNext
|
49
|
+
yield wrap_java_object(i.next)
|
50
|
+
end
|
51
|
+
end
|
52
|
+
end # each
|
53
|
+
|
54
|
+
# Reflect unhandled method calls to the underlying Java object and wrap
|
55
|
+
# the return value in the appropriate Ruby object.
|
56
|
+
def method_missing(m, *args)
|
57
|
+
begin
|
58
|
+
wrap_java_object(@java_object.send(m, *args))
|
59
|
+
rescue RuntimeError => e
|
60
|
+
# The instance method failed. See if this is a static method.
|
61
|
+
if not e.message.match(/^Fail: unknown method name/).nil?
|
62
|
+
getClass.send(m, *args)
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
|
67
|
+
# Convert a value returned by a call to the underlying Java object to the
|
68
|
+
# appropriate Ruby object.
|
69
|
+
#
|
70
|
+
# If the value is a JavaObjectWrapper, convert it using a protected
|
71
|
+
# function with the name wrap_ followed by the underlying object's
|
72
|
+
# classname with the Java path delimiters converted to underscores. For
|
73
|
+
# example, a <tt>java.util.ArrayList</tt> would be converted by a function
|
74
|
+
# called wrap_java_util_ArrayList.
|
75
|
+
#
|
76
|
+
# If the value lacks the appropriate converter function, wrap it in a
|
77
|
+
# generic JavaObjectWrapper.
|
78
|
+
#
|
79
|
+
# If the value is not a JavaObjectWrapper, return it unchanged.
|
80
|
+
#
|
81
|
+
# This function is called recursively for every element in an Array.
|
82
|
+
def wrap_java_object(object)
|
83
|
+
if object.kind_of?(Array)
|
84
|
+
object.collect {|item| wrap_java_object(item)}
|
85
|
+
elsif object.respond_to?(:_classname)
|
86
|
+
# Ruby-Java Bridge Java objects all have a _classname member which
|
87
|
+
# tells the name of their Java class. Convert this to the
|
88
|
+
# corresponding wrapper function name.
|
89
|
+
wrapper_name = ("wrap_" + object._classname.gsub(/\./, "_")).to_sym
|
90
|
+
respond_to?(wrapper_name) ? send(wrapper_name, object) : JavaObjectWrapper.new(object)
|
91
|
+
else
|
92
|
+
object
|
93
|
+
end
|
94
|
+
end
|
95
|
+
|
96
|
+
# Convert <tt>java.util.ArrayList</tt> objects to Ruby Array objects.
|
97
|
+
def wrap_java_util_ArrayList(object)
|
98
|
+
array_list = []
|
99
|
+
object.size.times do
|
100
|
+
|i| array_list << wrap_java_object(object.get(i))
|
101
|
+
end
|
102
|
+
array_list
|
103
|
+
end
|
104
|
+
|
105
|
+
# Convert <tt>java.util.HashSet</tt> objects to Ruby Set objects.
|
106
|
+
def wrap_java_util_HashSet(object)
|
107
|
+
set = Set.new
|
108
|
+
i = object.iterator
|
109
|
+
while i.hasNext
|
110
|
+
set << wrap_java_object(i.next)
|
111
|
+
end
|
112
|
+
set
|
113
|
+
end
|
114
|
+
|
115
|
+
# Show the classname of the underlying Java object.
|
116
|
+
def inspect
|
117
|
+
"<#{@java_object._classname}>"
|
118
|
+
end
|
119
|
+
|
120
|
+
# Use the underlying Java object's stringification.
|
121
|
+
def to_s
|
122
|
+
toString
|
123
|
+
end
|
124
|
+
|
125
|
+
protected :wrap_java_object, :wrap_java_util_ArrayList, :wrap_java_util_HashSet
|
126
|
+
|
127
|
+
end # JavaObjectWrapper
|
128
|
+
|
129
|
+
end # Rjb
|
data/lib/stanfordparser.rb
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
# Copyright 2007 William Patrick McNeill
|
1
|
+
# Copyright 2007-2008 William Patrick McNeill
|
2
2
|
#
|
3
3
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
4
4
|
#
|
@@ -19,121 +19,27 @@
|
|
19
19
|
|
20
20
|
require "pathname"
|
21
21
|
require "rjb"
|
22
|
-
require "
|
22
|
+
require "singleton"
|
23
|
+
begin
|
24
|
+
require "treebank"
|
25
|
+
gem "treebank", ">= 3.0.0"
|
26
|
+
rescue LoadError
|
27
|
+
require "treebank"
|
28
|
+
end
|
23
29
|
require "yaml"
|
24
30
|
|
25
|
-
|
26
|
-
# adds a generic Java object wrapper class.
|
27
|
-
module Rjb
|
28
|
-
|
29
|
-
# A generic wrapper for a Java object loaded via the Ruby Java Bridge. The
|
30
|
-
# wrapper class handles intialization and stringification, and passes other
|
31
|
-
# method calls down to the underlying Java object. Objects returned by the
|
32
|
-
# underlying Java object are converted to the appropriate Ruby object.
|
33
|
-
#
|
34
|
-
# This object is enumerable, yielding items in the order defined by the Java
|
35
|
-
# object's iterator.
|
36
|
-
class JavaObjectWrapper
|
37
|
-
include Enumerable
|
38
|
-
|
39
|
-
# The underlying Java object.
|
40
|
-
attr_reader :java_object
|
41
|
-
|
42
|
-
# Initialize with a Java object <em>obj</em>. If <em>obj</em> is a
|
43
|
-
# String, assume it is a Java class name and instantiate it. Otherwise,
|
44
|
-
# treat <em>obj</em> as an instance of a Java object.
|
45
|
-
def initialize(obj, *args)
|
46
|
-
@java_object = obj.class == String ?
|
47
|
-
Rjb::import(obj).send(:new, *args) : obj
|
48
|
-
end
|
49
|
-
|
50
|
-
# Enumerate all the items in the object using its iterator. If the object
|
51
|
-
# has no iterator, this function yields nothing.
|
52
|
-
def each
|
53
|
-
if @java_object.getClass.getMethods.any? {|m| m.getName == "iterator"}
|
54
|
-
i = @java_object.iterator
|
55
|
-
while i.hasNext
|
56
|
-
yield wrap_java_object(i.next)
|
57
|
-
end
|
58
|
-
end
|
59
|
-
end # each
|
60
|
-
|
61
|
-
# Reflect unhandled method calls to the underlying Java object.
|
62
|
-
def method_missing(m, *args)
|
63
|
-
wrap_java_object(@java_object.send(m, *args))
|
64
|
-
end
|
65
|
-
|
66
|
-
# Convert a value returned by a call to the underlying Java object to the
|
67
|
-
# appropriate Ruby object as follows:
|
68
|
-
# * RJB objects are placed inside a generic JavaObjectWrapper wrapper.
|
69
|
-
# * <tt>java.util.ArrayList</tt> objects are converted to Ruby Arrays.
|
70
|
-
# * <tt>java.util.HashSet</tt> objects are converted to Ruby Sets
|
71
|
-
# * Other objects are left unchanged.
|
72
|
-
#
|
73
|
-
# This function is applied recursively to items in collection objects such
|
74
|
-
# as set and arrays.
|
75
|
-
def wrap_java_object(object)
|
76
|
-
if object.kind_of?(Array)
|
77
|
-
object.collect {|item| wrap_java_object(item)}
|
78
|
-
# Ruby-Java Bridge Java objects all have a _classname member which tells
|
79
|
-
# the name of their Java class.
|
80
|
-
elsif object.respond_to?(:_classname)
|
81
|
-
case object._classname
|
82
|
-
when /java\.util\.ArrayList/
|
83
|
-
# Convert java.util.ArrayList objects to Ruby arrays.
|
84
|
-
array_list = []
|
85
|
-
object.size.times do
|
86
|
-
|i| array_list << wrap_java_object(object.get(i))
|
87
|
-
end
|
88
|
-
array_list
|
89
|
-
when /java\.util\.HashSet/
|
90
|
-
# Convert java.util.HashSet objects to Ruby sets.
|
91
|
-
set = Set.new
|
92
|
-
i = object.iterator
|
93
|
-
while i.hasNext
|
94
|
-
set << wrap_java_object(i.next)
|
95
|
-
end
|
96
|
-
set
|
97
|
-
else
|
98
|
-
# Passs other RJB objects off to a handler.
|
99
|
-
wrap_rjb_object(object)
|
100
|
-
end # case
|
101
|
-
else
|
102
|
-
# Return non-RJB objects unchanged.
|
103
|
-
object
|
104
|
-
end # if
|
105
|
-
end # wrap_java_object
|
106
|
-
|
107
|
-
# By default, all RJB classes other than <tt>java.util.ArrayList</tt> and
|
108
|
-
# <tt>java.util.HashSet</tt> go in a generic wrapper. Derived classes may
|
109
|
-
# change this behavior.
|
110
|
-
def wrap_rjb_object(object)
|
111
|
-
JavaObjectWrapper.new(object)
|
112
|
-
end
|
113
|
-
|
114
|
-
# Show the classname of the underlying Java object.
|
115
|
-
def inspect
|
116
|
-
"<#{@java_object._classname}>"
|
117
|
-
end
|
118
|
-
|
119
|
-
# Use the underlying Java object's stringification.
|
120
|
-
def to_s
|
121
|
-
toString
|
122
|
-
end
|
123
|
-
|
124
|
-
protected :wrap_java_object, :wrap_rjb_object
|
125
|
-
|
126
|
-
end # JavaObjectWrapper
|
127
|
-
|
128
|
-
end # Rjb
|
129
|
-
|
31
|
+
require "java_object.rb"
|
130
32
|
|
131
33
|
# Wrapper for the {Stanford Natural Language
|
132
34
|
# Parser}[http://nlp.stanford.edu/downloads/lex-parser.shtml].
|
133
35
|
module StanfordParser
|
134
36
|
|
135
|
-
VERSION = "
|
136
|
-
|
37
|
+
VERSION = "2.0.0"
|
38
|
+
|
39
|
+
# The default sentence segmenter and tokenizer. This is an English-language
|
40
|
+
# tokenizer with support for Penn Treebank markup.
|
41
|
+
EN_PENN_TREEBANK_TOKENIZER = "edu.stanford.nlp.process.PTBTokenizer"
|
42
|
+
|
137
43
|
# Path to an English PCFG model that comes with the Stanford Parser. The
|
138
44
|
# location is relative to the parser root directory. This is a valid value
|
139
45
|
# for the <em>grammar</em> parameter of the LexicalizedParser constructor.
|
@@ -170,32 +76,53 @@ module StanfordParser
|
|
170
76
|
# The root directory of the Stanford parser installation.
|
171
77
|
ROOT = initialize_on_load
|
172
78
|
|
173
|
-
|
79
|
+
#--
|
80
|
+
# The documentation below is for the original Rjb::JavaObjectWrapper object.
|
81
|
+
# It is reproduced here because rdoc only takes the last document block
|
82
|
+
# defined. If Rjb is moved into its own gem, this documentation should go
|
83
|
+
# with it, and the following should be written as documentation for this
|
84
|
+
# class:
|
85
|
+
#
|
174
86
|
# Extension of the generic Ruby-Java Bridge wrapper object for the
|
175
87
|
# StanfordParser module.
|
176
|
-
|
177
|
-
|
178
|
-
|
179
|
-
|
180
|
-
|
181
|
-
|
182
|
-
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
|
188
|
-
|
189
|
-
|
190
|
-
|
191
|
-
|
88
|
+
#++
|
89
|
+
# A generic wrapper for a Java object loaded via the {Ruby-Java
|
90
|
+
# Bridge}[http://rjb.rubyforge.org/]. The wrapper class handles
|
91
|
+
# intialization and stringification, and passes other method calls down to
|
92
|
+
# the underlying Java object. Objects returned by the underlying Java
|
93
|
+
# object are converted to the appropriate Ruby object.
|
94
|
+
#
|
95
|
+
# Other modules may extend the list of Java objects that are converted by
|
96
|
+
# adding their own converter functions. See wrap_java_object for details.
|
97
|
+
#
|
98
|
+
# This object is enumerable, yielding items in the order defined by the
|
99
|
+
# underlying Java object's iterator.
|
100
|
+
class Rjb::JavaObjectWrapper
|
101
|
+
# FeatureLabel objects go inside a FeatureLabel wrapper.
|
102
|
+
def wrap_edu_stanford_nlp_ling_FeatureLabel(object)
|
103
|
+
StanfordParser::FeatureLabel.new(object)
|
104
|
+
end
|
105
|
+
|
106
|
+
# Tree objects go inside a Tree wrapper. Various tree types are aliased
|
107
|
+
# to this function.
|
108
|
+
def wrap_edu_stanford_nlp_trees_Tree(object)
|
109
|
+
Tree.new(object)
|
110
|
+
end
|
111
|
+
|
112
|
+
alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeLeaf :wrap_edu_stanford_nlp_trees_Tree
|
113
|
+
alias :wrap_edu_stanford_nlp_trees_LabeledScoredTreeNode :wrap_edu_stanford_nlp_trees_Tree
|
114
|
+
alias :wrap_edu_stanford_nlp_trees_SimpleTree :wrap_edu_stanford_nlp_trees_Tree
|
115
|
+
alias :wrap_edu_stanford_nlp_trees_TreeGraphNode :wrap_edu_stanford_nlp_trees_Tree
|
116
|
+
|
117
|
+
protected :wrap_edu_stanford_nlp_trees_Tree, :wrap_edu_stanford_nlp_ling_FeatureLabel
|
118
|
+
end # Rjb::JavaObjectWrapper
|
192
119
|
|
193
120
|
|
194
121
|
# Lexicalized probabalistic parser.
|
195
122
|
#
|
196
123
|
# This is an wrapper for the
|
197
124
|
# <tt>edu.stanford.nlp.parser.lexparser.LexicalizedParser</tt> object.
|
198
|
-
class LexicalizedParser < JavaObjectWrapper
|
125
|
+
class LexicalizedParser < Rjb::JavaObjectWrapper
|
199
126
|
# The grammar used by the parser
|
200
127
|
attr_reader :grammar
|
201
128
|
|
@@ -220,10 +147,17 @@ module StanfordParser
|
|
220
147
|
end # LexicalizedParser
|
221
148
|
|
222
149
|
|
150
|
+
# A singleton instance of the default Stanford Natural Language parser. A
|
151
|
+
# singleton is used because the parser can take a few seconds to load.
|
152
|
+
class DefaultParser < StanfordParser::LexicalizedParser
|
153
|
+
include Singleton
|
154
|
+
end
|
155
|
+
|
156
|
+
|
223
157
|
# This is a wrapper for
|
224
158
|
# <tt>edu.stanford.nlp.trees.Tree</tt> objects. It customizes
|
225
159
|
# stringification.
|
226
|
-
class Tree < JavaObjectWrapper
|
160
|
+
class Tree < Rjb::JavaObjectWrapper
|
227
161
|
def initialize(obj = "edu.stanford.nlp.trees.Tree")
|
228
162
|
super(obj)
|
229
163
|
end
|
@@ -245,16 +179,16 @@ module StanfordParser
|
|
245
179
|
# This is a wrapper for
|
246
180
|
# <tt>edu.stanford.nlp.ling.Word</tt> objects. It customizes
|
247
181
|
# stringification and adds an equivalence operator.
|
248
|
-
class Word < JavaObjectWrapper
|
182
|
+
class Word < Rjb::JavaObjectWrapper
|
249
183
|
def initialize(obj = "edu.stanford.nlp.ling.Word", *args)
|
250
184
|
super(obj, *args)
|
251
185
|
end
|
252
|
-
|
186
|
+
|
253
187
|
# See the word values.
|
254
188
|
def inspect
|
255
189
|
to_s
|
256
190
|
end
|
257
|
-
|
191
|
+
|
258
192
|
# Equivalence is defined relative to the word value.
|
259
193
|
def ==(other)
|
260
194
|
word == other
|
@@ -262,11 +196,34 @@ module StanfordParser
|
|
262
196
|
end # Word
|
263
197
|
|
264
198
|
|
199
|
+
# This is a wrapper for <tt>edu.stanford.nlp.ling.FeatureLabel</tt> objects.
|
200
|
+
# It customizes stringification.
|
201
|
+
class FeatureLabel < Rjb::JavaObjectWrapper
|
202
|
+
def initialize(obj = "edu.stanford.nlp.ling.FeatureLabel")
|
203
|
+
super
|
204
|
+
end
|
205
|
+
|
206
|
+
# Stringify with just the token and its begin and end position.
|
207
|
+
def to_s
|
208
|
+
# BUGBUG The position values come back as java.lang.Integer though I
|
209
|
+
# would expect Rjb to convert them to Ruby integers.
|
210
|
+
begin_position = get(self.BEGIN_POSITION_KEY)
|
211
|
+
end_position = get(self.END_POSITION_KEY)
|
212
|
+
"#{current} [#{begin_position},#{end_position}]"
|
213
|
+
end
|
214
|
+
|
215
|
+
# More verbose stringification with all the fields and their values.
|
216
|
+
def inspect
|
217
|
+
toString
|
218
|
+
end
|
219
|
+
end
|
220
|
+
|
221
|
+
|
265
222
|
# Tokenizes documents into words and sentences.
|
266
223
|
#
|
267
224
|
# This is a wrapper for the
|
268
225
|
# <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> object.
|
269
|
-
class DocumentPreprocessor < JavaObjectWrapper
|
226
|
+
class DocumentPreprocessor < Rjb::JavaObjectWrapper
|
270
227
|
def initialize(suppressEscaping = false)
|
271
228
|
super("edu.stanford.nlp.process.DocumentPreprocessor", suppressEscaping)
|
272
229
|
end
|
@@ -276,6 +233,229 @@ module StanfordParser
|
|
276
233
|
s = Rjb::JavaObjectWrapper.new("java.io.StringReader", s)
|
277
234
|
_invoke(:getSentencesFromText, "Ljava.io.Reader;", s.java_object)
|
278
235
|
end
|
236
|
+
|
237
|
+
def inspect
|
238
|
+
"<#{self.class.to_s.split('::').last}>"
|
239
|
+
end
|
240
|
+
|
241
|
+
def to_s
|
242
|
+
inspect
|
243
|
+
end
|
279
244
|
end # DocumentPreprocessor
|
280
245
|
|
246
|
+
StandoffToken = Struct.new(:current, :word, :before, :after,
|
247
|
+
:begin_position, :end_position)
|
248
|
+
|
249
|
+
# A text token that contains raw and normalized token identity (.e.g "(" and
|
250
|
+
# "-LRB-"), an offset span, and the characters immediately preceding and
|
251
|
+
# following the token. Given a list of these objects it is possible to
|
252
|
+
# recreate the text from which they came verbatim.
|
253
|
+
class StandoffToken
|
254
|
+
def to_s
|
255
|
+
"#{current} [#{begin_position},#{end_position}]"
|
256
|
+
end
|
257
|
+
end
|
258
|
+
|
259
|
+
|
260
|
+
# A preprocessor that segments text into sentences and tokens that contain
|
261
|
+
# character offset and token context information that can be used for
|
262
|
+
# standoff annotation.
|
263
|
+
class StandoffDocumentPreprocessor < DocumentPreprocessor
|
264
|
+
def initialize(tokenizer = EN_PENN_TREEBANK_TOKENIZER)
|
265
|
+
# PTBTokenizer.factory is a static function, so use RJB to call it
|
266
|
+
# directly instead of going through a JavaObjectWrapper. We do it this
|
267
|
+
# way because the Standford parser Java code does not provide a
|
268
|
+
# constructor that allows you to specify the second parameter,
|
269
|
+
# invertible, to true, and we need this to write character offset
|
270
|
+
# information into the tokens.
|
271
|
+
ptb_tokenizer_class = Rjb::import(tokenizer)
|
272
|
+
# See the documentation for
|
273
|
+
# <tt>edu.stanford.nlp.process.DocumentPreprocessor</tt> for a
|
274
|
+
# description of these parameters.
|
275
|
+
ptb_tokenizer_factory = ptb_tokenizer_class.factory(false, true, false)
|
276
|
+
super(ptb_tokenizer_factory)
|
277
|
+
end
|
278
|
+
|
279
|
+
# Returns a list of sentences in a string. This wraps the returned
|
280
|
+
# sentences in a StandoffSentence object.
|
281
|
+
def getSentencesFromString(s)
|
282
|
+
super(s).map!{|s| StandoffSentence.new(s)}
|
283
|
+
end
|
284
|
+
end
|
285
|
+
|
286
|
+
|
287
|
+
# A sentence is an array of StandoffToken objects.
|
288
|
+
class StandoffSentence < Array
|
289
|
+
# Construct an array of StandoffToken objects from a Java list sentence
|
290
|
+
# object returned by the preprocessor.
|
291
|
+
def initialize(stanford_parser_sentence)
|
292
|
+
# Convert FeatureStructure wrappers to StandoffToken objects.
|
293
|
+
s = stanford_parser_sentence.to_a.collect do |fs|
|
294
|
+
current = fs.current
|
295
|
+
word = fs.word
|
296
|
+
before = fs.before
|
297
|
+
after = fs.after
|
298
|
+
# The to_s.to_i is necessary because the get function returns
|
299
|
+
# java.lang.Integer objects instead of Ruby integers.
|
300
|
+
begin_position = fs.get(fs.BEGIN_POSITION_KEY).to_s.to_i
|
301
|
+
end_position = fs.get(fs.END_POSITION_KEY).to_s.to_i
|
302
|
+
StandoffToken.new(current, word, before, after,
|
303
|
+
begin_position, end_position)
|
304
|
+
end
|
305
|
+
super(s)
|
306
|
+
end
|
307
|
+
|
308
|
+
# Return the original string verbatim.
|
309
|
+
def to_s
|
310
|
+
self[0..-2].inject(""){|s, word| s + word.current + word.after} + last.current
|
311
|
+
end
|
312
|
+
|
313
|
+
# Return the original string verbatim.
|
314
|
+
def inspect
|
315
|
+
to_s
|
316
|
+
end
|
317
|
+
end
|
318
|
+
|
319
|
+
|
320
|
+
# Standoff syntactic annotation of natural language text which may contain
|
321
|
+
# multiple sentences.
|
322
|
+
#
|
323
|
+
# This is an Array of StandoffNode objects, one for each sentence in the
|
324
|
+
# text.
|
325
|
+
class StandoffParsedText < Array
|
326
|
+
# Parse the text and create the standoff annotation.
|
327
|
+
#
|
328
|
+
# The default parser is a singleton instance of the English language
|
329
|
+
# Stanford Natural Langugage parser. There may be a delay of a few
|
330
|
+
# seconds for it to load the first time it is created.
|
331
|
+
def initialize(text, nodetype = StandoffNode,
|
332
|
+
tokenizer = EN_PENN_TREEBANK_TOKENIZER,
|
333
|
+
parser = DefaultParser.instance)
|
334
|
+
preprocessor = StandoffDocumentPreprocessor.new(tokenizer)
|
335
|
+
# Segment the text into sentences. Parse each sentence, writing
|
336
|
+
# standoff annotation information into the terminal nodes.
|
337
|
+
preprocessor.getSentencesFromString(text).map do |sentence|
|
338
|
+
parse = parser.apply(sentence.to_s)
|
339
|
+
push(nodetype.new(parse, sentence))
|
340
|
+
end
|
341
|
+
end
|
342
|
+
|
343
|
+
# Print class name and number of sentences.
|
344
|
+
def inspect
|
345
|
+
"<#{self.class.name}, #{length} sentences>"
|
346
|
+
end
|
347
|
+
|
348
|
+
# Print parses.
|
349
|
+
def to_s
|
350
|
+
flatten.join(" ")
|
351
|
+
end
|
352
|
+
end
|
353
|
+
|
354
|
+
|
355
|
+
# Standoff syntactic tree annotation of text. Terminal nodes are labeled
|
356
|
+
# with the appropriate StandoffToken objects. Standoff parses can reproduce
|
357
|
+
# the original string from which they were generated verbatim, optionally
|
358
|
+
# with brackets around the yields of specified non-terminal nodes.
|
359
|
+
class StandoffNode < Treebank::Node
|
360
|
+
# Create the standoff tree from a tree returned by the Stanford parser.
|
361
|
+
# For non-terminal nodes, the <em>tokens</em> argument will be a
|
362
|
+
# StandoffSentence containing the StandoffToken objects representing all
|
363
|
+
# the tokens beneath and after this node. For terminal nodes, the
|
364
|
+
# <em>tokens</em> argument will be a StandoffToken.
|
365
|
+
def initialize(stanford_parser_node, tokens)
|
366
|
+
# Annotate this node with a non-terminal label or a StandoffToken as
|
367
|
+
# appropriate.
|
368
|
+
super(tokens.instance_of?(StandoffSentence) ?
|
369
|
+
stanford_parser_node.value : tokens)
|
370
|
+
# Enumerate the children depth-first. Tokens are removed from the list
|
371
|
+
# left-to-right as terminal nodes are added to the tree.
|
372
|
+
stanford_parser_node.children.each do |child|
|
373
|
+
subtree = self.class.new(child, child.leaf? ? tokens.shift : tokens)
|
374
|
+
attach_child!(subtree)
|
375
|
+
end
|
376
|
+
end
|
377
|
+
|
378
|
+
# Return the original text string dominated by this node.
|
379
|
+
def to_original_string
|
380
|
+
leaves.inject("") do |s, leaf|
|
381
|
+
s += leaf.label.current + leaf.label.after
|
382
|
+
end
|
383
|
+
end
|
384
|
+
|
385
|
+
# Print the original string with brackets around word spans dominated by
|
386
|
+
# the specified consituents.
|
387
|
+
#
|
388
|
+
# The constituents to bracket are specified by passing a list of node
|
389
|
+
# coordinates, which are arrays of integers of the form returned by the
|
390
|
+
# tree enumerators of Treebank::Node objects.
|
391
|
+
#
|
392
|
+
# _coords_:: the coordinates of the nodes around which to place brackets
|
393
|
+
# _open_:: the open bracket symbol
|
394
|
+
# _close_:: the close bracket symbol
|
395
|
+
def to_bracketed_string(coords, open = "[", close = "]")
|
396
|
+
# Get a list of all the leaf nodes and their coordinates.
|
397
|
+
items = depth_first_enumerator(true).find_all {|n| n.first.leaf?}
|
398
|
+
# Enumerate over all the matching constituents inserting open and close
|
399
|
+
# brackets around their yields in the items list.
|
400
|
+
coords.each do |matching|
|
401
|
+
# Insert using a simple state machine with three states: :start,
|
402
|
+
# :open, and :close.
|
403
|
+
state = :start
|
404
|
+
# Enumerate over the items list looking for nodes that are the
|
405
|
+
# children of the matching constituent.
|
406
|
+
items.each_with_index do |item, index|
|
407
|
+
# Skip inserted bracket characters.
|
408
|
+
next if item.is_a? String
|
409
|
+
# Handle terminal node items with the state machine.
|
410
|
+
node, terminal_coordinate = item
|
411
|
+
if state == :start
|
412
|
+
next if not in_yield?(matching, terminal_coordinate)
|
413
|
+
items.insert(index, open)
|
414
|
+
state = :open
|
415
|
+
else # state == :open
|
416
|
+
next if in_yield?(matching, terminal_coordinate)
|
417
|
+
items.insert(index, close)
|
418
|
+
state = :close
|
419
|
+
break
|
420
|
+
end
|
421
|
+
end # items.each_with_index
|
422
|
+
# Handle the case where a matching constituent is flush with the end
|
423
|
+
# of the sentence.
|
424
|
+
items << close if state == :open
|
425
|
+
end # each
|
426
|
+
# Replace terminal nodes with their string representations. Insert
|
427
|
+
# spacing characters in the list.
|
428
|
+
items.each_with_index do |item, index|
|
429
|
+
next if item.is_a? String
|
430
|
+
text = item.first.label.current
|
431
|
+
spacing = item.first.label.after
|
432
|
+
# Replace the terminal node with its text.
|
433
|
+
items[index] = text
|
434
|
+
# Insert the spacing that comes after this text before the first
|
435
|
+
# non-close bracket character.
|
436
|
+
close_pos = find_index(items[index+1..-1]) {|item| not item == close}
|
437
|
+
items.insert(index + close_pos + 1, spacing)
|
438
|
+
end
|
439
|
+
items.join
|
440
|
+
end # to_bracketed_string
|
441
|
+
|
442
|
+
# Find the index of the first item in _list_ for which _block_ is true.
|
443
|
+
# Return 0 if no items are found.
|
444
|
+
def find_index(list, &block)
|
445
|
+
list.each_with_index do |item, index|
|
446
|
+
return index if block.call(item)
|
447
|
+
end
|
448
|
+
0
|
449
|
+
end
|
450
|
+
|
451
|
+
# Is the node at _terminal_ in the yield of the node at _node_?
|
452
|
+
def in_yield?(node, terminal)
|
453
|
+
# If node A's coordinates match the prefix of node B's coordinates, node
|
454
|
+
# B is in the yield of node A.
|
455
|
+
terminal.first(node.length) == node
|
456
|
+
end
|
457
|
+
|
458
|
+
private :in_yield?, :find_index
|
459
|
+
end # StandoffNode
|
460
|
+
|
281
461
|
end # StanfordParser
|
data/test/test_stanfordparser.rb
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
#--
|
4
4
|
|
5
|
-
# Copyright 2007 William Patrick McNeill
|
5
|
+
# Copyright 2007-2008 William Patrick McNeill
|
6
6
|
#
|
7
7
|
# This file is part of the Stanford Parser Ruby Wrapper.
|
8
8
|
#
|
@@ -30,20 +30,13 @@ require "singleton"
|
|
30
30
|
require "stanfordparser"
|
31
31
|
|
32
32
|
|
33
|
-
# Make the Lexicalized Parser a singleton for the tests because it takes
|
34
|
-
# several seconds to load.
|
35
|
-
class StanfordParser::LexicalizedParser
|
36
|
-
include Singleton
|
37
|
-
end
|
38
|
-
|
39
|
-
|
40
33
|
class LexicalizedParserTestCase < Test::Unit::TestCase
|
41
34
|
def test_root_path
|
42
35
|
assert_equal StanfordParser::ROOT.class, Pathname
|
43
36
|
end
|
44
37
|
|
45
38
|
def setup
|
46
|
-
@parser = StanfordParser::
|
39
|
+
@parser = StanfordParser::DefaultParser.instance
|
47
40
|
@tree = @parser.apply("This is a sentence.")
|
48
41
|
end
|
49
42
|
|
@@ -53,6 +46,8 @@ class LexicalizedParserTestCase < Test::Unit::TestCase
|
|
53
46
|
end
|
54
47
|
|
55
48
|
def test_localTrees
|
49
|
+
# The following call exercises the conversion from java.util.HashSet
|
50
|
+
# objects to Ruby sets.
|
56
51
|
l = @tree.localTrees
|
57
52
|
assert_equal l.size, 5
|
58
53
|
assert_equal Set.new(l.collect {|t| "#{t.label}"}),
|
@@ -68,7 +63,7 @@ end # LexicalizedParserTestCase
|
|
68
63
|
|
69
64
|
class TreeTestCase < Test::Unit::TestCase
|
70
65
|
def setup
|
71
|
-
@parser = StanfordParser::
|
66
|
+
@parser = StanfordParser::DefaultParser.instance
|
72
67
|
@tree = @parser.apply("This is a sentence.")
|
73
68
|
end
|
74
69
|
|
@@ -85,12 +80,30 @@ class TreeTestCase < Test::Unit::TestCase
|
|
85
80
|
end # TreeTestCase
|
86
81
|
|
87
82
|
|
83
|
+
class FeatureLabelTestCase < Test::Unit::TestCase
|
84
|
+
def test_feature_label
|
85
|
+
f = StanfordParser::FeatureLabel.new
|
86
|
+
assert_equal "BEGIN_POS", f.BEGIN_POSITION_KEY
|
87
|
+
f.put(f.BEGIN_POSITION_KEY, 3)
|
88
|
+
assert_equal "END_POS", f.END_POSITION_KEY
|
89
|
+
f.put(f.END_POSITION_KEY, 7)
|
90
|
+
assert_equal "current", f.CURRENT_KEY
|
91
|
+
f.put(f.CURRENT_KEY, "word")
|
92
|
+
assert_equal "{BEGIN_POS=3, END_POS=7, current=word}", f.inspect
|
93
|
+
assert_equal "word [3,7]", f.to_s
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
|
88
98
|
class DocumentPreprocessorTestCase < Test::Unit::TestCase
|
89
99
|
def setup
|
90
100
|
@preproc = StanfordParser::DocumentPreprocessor.new
|
101
|
+
@standoff_preproc = StanfordParser::StandoffDocumentPreprocessor.new
|
91
102
|
end
|
92
103
|
|
93
104
|
def test_get_sentences_from_string
|
105
|
+
# The following call exercises the conversion from java.util.ArrayList
|
106
|
+
# objects to Ruby arrays.
|
94
107
|
s = @preproc.getSentencesFromString("This is a sentence. So is this.")
|
95
108
|
assert_equal "#{s[0]}", "This is a sentence ."
|
96
109
|
assert_equal "#{s[1]}", "So is this ."
|
@@ -100,15 +113,112 @@ class DocumentPreprocessorTestCase < Test::Unit::TestCase
|
|
100
113
|
# StanfordParser::DocumentPreprocessor is not an enumerable object.
|
101
114
|
assert_equal @preproc.map, []
|
102
115
|
end
|
116
|
+
|
117
|
+
# Segment and tokenize text containing two sentences.
|
118
|
+
def test_standoff_document_preprocessor
|
119
|
+
sentences = @standoff_preproc.getSentencesFromString("He (John) is tall. So is she.")
|
120
|
+
# Recognize two sentences.
|
121
|
+
assert_equal 2, sentences.length
|
122
|
+
assert sentences.all? {|sentence| sentence.instance_of? StanfordParser::StandoffSentence}
|
123
|
+
assert_equal "He (John) is tall.", sentences.first.to_s
|
124
|
+
assert_equal 7, sentences.first.length
|
125
|
+
assert sentences[0].all? {|token| token.instance_of? StanfordParser::StandoffToken}
|
126
|
+
assert_equal "So is she.", sentences.last.to_s
|
127
|
+
assert_equal 4, sentences.last.length
|
128
|
+
assert sentences[1].all? {|token| token.instance_of? StanfordParser::StandoffToken}
|
129
|
+
# Get the correct token information for the first sentence.
|
130
|
+
assert_equal ["He", "He"], [sentences[0][0].current(), sentences[0][0].word()]
|
131
|
+
assert_equal [0,2], [sentences[0][0].begin_position(), sentences[0][0].end_position()]
|
132
|
+
assert_equal ["(", "-LRB-"], [sentences[0][1].current(), sentences[0][1].word()]
|
133
|
+
assert_equal [3,4], [sentences[0][1].begin_position(), sentences[0][1].end_position()]
|
134
|
+
assert_equal ["John", "John"], [sentences[0][2].current(), sentences[0][2].word()]
|
135
|
+
assert_equal [4,8], [sentences[0][2].begin_position(), sentences[0][2].end_position()]
|
136
|
+
assert_equal [")", "-RRB-"], [sentences[0][3].current(), sentences[0][3].word()]
|
137
|
+
assert_equal [8,9], [sentences[0][3].begin_position(), sentences[0][3].end_position()]
|
138
|
+
assert_equal ["is", "is"], [sentences[0][4].current(), sentences[0][4].word()]
|
139
|
+
assert_equal [10,12], [sentences[0][4].begin_position(), sentences[0][4].end_position()]
|
140
|
+
assert_equal ["tall", "tall"], [sentences[0][5].current(), sentences[0][5].word()]
|
141
|
+
assert_equal [13,17], [sentences[0][5].begin_position(), sentences[0][5].end_position()]
|
142
|
+
assert_equal [".", "."], [sentences[0][6].current(), sentences[0][6].word()]
|
143
|
+
assert_equal [17,18], [sentences[0][6].begin_position(), sentences[0][6].end_position()]
|
144
|
+
# Get the correct token information for the second sentence.
|
145
|
+
assert_equal ["So", "So"], [sentences[1][0].current(), sentences[1][0].word()]
|
146
|
+
assert_equal [20,22], [sentences[1][0].begin_position(), sentences[1][0].end_position()]
|
147
|
+
assert_equal ["is", "is"], [sentences[1][1].current(), sentences[1][1].word()]
|
148
|
+
assert_equal [23,25], [sentences[1][1].begin_position(), sentences[1][1].end_position()]
|
149
|
+
assert_equal ["she", "she"], [sentences[1][2].current(), sentences[1][2].word()]
|
150
|
+
assert_equal [26,29], [sentences[1][2].begin_position(), sentences[1][2].end_position()]
|
151
|
+
assert_equal [".", "."], [sentences[1][3].current(), sentences[1][3].word()]
|
152
|
+
assert_equal [29,30], [sentences[1][3].begin_position(), sentences[1][3].end_position()]
|
153
|
+
end
|
154
|
+
|
155
|
+
def test_stringification
|
156
|
+
assert_equal "<DocumentPreprocessor>", @preproc.inspect
|
157
|
+
assert_equal "<DocumentPreprocessor>", @preproc.to_s
|
158
|
+
assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.inspect
|
159
|
+
assert_equal "<StandoffDocumentPreprocessor>", @standoff_preproc.to_s
|
160
|
+
end
|
161
|
+
|
103
162
|
end # DocumentPreprocessorTestCase
|
104
163
|
|
105
164
|
|
165
|
+
class StandoffParsedTextTestCase < Test::Unit::TestCase
|
166
|
+
def setup
|
167
|
+
@text = "He (John) is tall. So is she."
|
168
|
+
end
|
169
|
+
|
170
|
+
def test_parse_text_default_nodetype
|
171
|
+
parsed_text = StanfordParser::StandoffParsedText.new(@text)
|
172
|
+
verify_parsed_text(parsed_text, StanfordParser::StandoffNode)
|
173
|
+
end
|
174
|
+
|
175
|
+
# Verify correct parsing with variable node types for text containing two sentences.
|
176
|
+
def verify_parsed_text(parsed_text, nodetype)
|
177
|
+
# Verify that there are two sentences.
|
178
|
+
assert_equal 2, parsed_text.length
|
179
|
+
assert parsed_text.all? {|sentence| sentence.instance_of? nodetype}
|
180
|
+
# Verify the tokens in the leaf node of the first sentence.
|
181
|
+
leaves = parsed_text[0].leaves.collect {|node| node.label}
|
182
|
+
assert_equal ["He", "He"], [leaves[0].current(), leaves[0].word()]
|
183
|
+
assert_equal [0,2], [leaves[0].begin_position(), leaves[0].end_position()]
|
184
|
+
assert_equal ["(", "-LRB-"], [leaves[1].current(), leaves[1].word()]
|
185
|
+
assert_equal [3,4], [leaves[1].begin_position(), leaves[1].end_position()]
|
186
|
+
assert_equal ["John", "John"], [leaves[2].current(), leaves[2].word()]
|
187
|
+
assert_equal [4,8], [leaves[2].begin_position(), leaves[2].end_position()]
|
188
|
+
assert_equal [")", "-RRB-"], [leaves[3].current(), leaves[3].word()]
|
189
|
+
assert_equal [8,9], [leaves[3].begin_position(), leaves[3].end_position()]
|
190
|
+
assert_equal ["is", "is"], [leaves[4].current(), leaves[4].word()]
|
191
|
+
assert_equal [10,12], [leaves[4].begin_position(), leaves[4].end_position()]
|
192
|
+
assert_equal ["tall", "tall"], [leaves[5].current(), leaves[5].word()]
|
193
|
+
assert_equal [13,17], [leaves[5].begin_position(), leaves[5].end_position()]
|
194
|
+
assert_equal [".", "."], [leaves[6].current(), leaves[6].word()]
|
195
|
+
assert_equal [17,18], [leaves[6].begin_position(), leaves[6].end_position()]
|
196
|
+
# Verify the tokens in the leaf node of the second sentence.
|
197
|
+
leaves = parsed_text[1].leaves.collect {|node| node.label}
|
198
|
+
assert_equal ["So", "So"], [leaves[0].current(), leaves[0].word()]
|
199
|
+
assert_equal [20,22], [leaves[0].begin_position(), leaves[0].end_position()]
|
200
|
+
assert_equal ["is", "is"], [leaves[1].current(), leaves[1].word()]
|
201
|
+
assert_equal [23,25], [leaves[1].begin_position(), leaves[1].end_position()]
|
202
|
+
assert_equal ["she", "she"], [leaves[2].current(), leaves[2].word()]
|
203
|
+
assert_equal [26,29], [leaves[2].begin_position(), leaves[2].end_position()]
|
204
|
+
assert_equal [".", "."], [leaves[3].current(), leaves[3].word()]
|
205
|
+
assert_equal [29,30], [leaves[3].begin_position(), leaves[3].end_position()]
|
206
|
+
# Verify that the original string is recoverable.
|
207
|
+
assert_equal "He (John) is tall. ", parsed_text[0].to_original_string
|
208
|
+
assert_equal "So is she." , parsed_text[1].to_original_string
|
209
|
+
# Draw < and > brackets around 3 constituents.
|
210
|
+
b = parsed_text[0].to_bracketed_string([[0,0], [0,0,1,1], [0,1,1]], "<", ">")
|
211
|
+
assert_equal "<He (<John>)> is <tall>. ", b
|
212
|
+
end
|
213
|
+
end
|
214
|
+
|
215
|
+
|
106
216
|
class MiscPreprocessorTestCase < Test::Unit::TestCase
|
107
217
|
def test_model_location
|
108
218
|
assert_equal "$(ROOT)/englishPCFG.ser.gz", StanfordParser::ENGLISH_PCFG_MODEL
|
109
219
|
end
|
110
|
-
|
220
|
+
|
111
221
|
def test_word
|
112
222
|
assert StanfordParser::Word.new("edu.stanford.nlp.ling.Word", "dog") == "dog"
|
113
223
|
end
|
114
|
-
end # MiscPreprocessorTestCase
|
224
|
+
end # MiscPreprocessorTestCase
|
metadata
CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
|
|
3
3
|
specification_version: 1
|
4
4
|
name: stanfordparser
|
5
5
|
version: !ruby/object:Gem::Version
|
6
|
-
version:
|
7
|
-
date:
|
6
|
+
version: 2.0.0
|
7
|
+
date: 2008-06-13 00:00:00 -07:00
|
8
8
|
summary: Ruby wrapper for the Stanford Natural Language Parser
|
9
9
|
require_paths:
|
10
10
|
- lib
|
@@ -30,6 +30,7 @@ authors:
|
|
30
30
|
- W.P. McNeill
|
31
31
|
files:
|
32
32
|
- test/test_stanfordparser.rb
|
33
|
+
- lib/java_object.rb
|
33
34
|
- lib/stanfordparser.rb
|
34
35
|
- examples/stanford-sentence-parser.rb
|
35
36
|
- README
|