shalmaneser-prep 1.2.0.rc4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.yardopts +10 -0
- data/CHANGELOG.md +4 -0
- data/LICENSE.md +4 -0
- data/README.md +93 -0
- data/lib/frprep/Ampersand.rb +39 -0
- data/lib/frprep/CollinsInterface.rb +1165 -0
- data/lib/frprep/Counter.rb +18 -0
- data/lib/frprep/FNCorpusXML.rb +643 -0
- data/lib/frprep/FNDatabase.rb +144 -0
- data/lib/frprep/FrameXML.rb +513 -0
- data/lib/frprep/Graph.rb +345 -0
- data/lib/frprep/MiniparInterface.rb +1388 -0
- data/lib/frprep/RegXML.rb +269 -0
- data/lib/frprep/STXmlTerminalOrder.rb +194 -0
- data/lib/frprep/SleepyInterface.rb +384 -0
- data/lib/frprep/TntInterface.rb +44 -0
- data/lib/frprep/TreetaggerInterface.rb +327 -0
- data/lib/frprep/do_parses.rb +143 -0
- data/lib/frprep/frprep.rb +693 -0
- data/lib/frprep/interfaces/berkeley_interface.rb +372 -0
- data/lib/frprep/interfaces/stanford_interface.rb +353 -0
- data/lib/frprep/interpreters/berkeley_interpreter.rb +22 -0
- data/lib/frprep/interpreters/stanford_interpreter.rb +22 -0
- data/lib/frprep/one_parsed_file.rb +28 -0
- data/lib/frprep/opt_parser.rb +94 -0
- data/lib/frprep/ruby_class_extensions.rb +310 -0
- data/test/frprep/test_opt_parser.rb +94 -0
- data/test/functional/functional_test_helper.rb +58 -0
- data/test/functional/test_fred.rb +47 -0
- data/test/functional/test_frprep.rb +99 -0
- data/test/functional/test_rosy.rb +40 -0
- metadata +85 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 68cc029ef97c02ec1fec2035bcc251ccbf3cb411
|
4
|
+
data.tar.gz: 440d150bd6e9bf78edaa78dc4f33deecdf98ae40
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: cf59a0a887d1bae7116e1c704360a47e72c0af1d9fc0daf6e4ed4979b7264f764947ab780e2adfff4a67e560ab9200745c294d3b0c2b9e8f56238780943939b2
|
7
|
+
data.tar.gz: 851a8322a0dcee5e7dfd49cedf639fece4cf9538ad78165432f86a80bc720f806f69d8eadd10e39b0ff7ad9cd56ca590ae9dba947fa2e55942d090726a2fcc98
|
data/.yardopts
ADDED
data/CHANGELOG.md
ADDED
data/LICENSE.md
ADDED
data/README.md
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
# [SHALMANESER - a SHALlow seMANtic parSER](http://www.coli.uni-saarland.de/projects/salsa/shal/)
|
2
|
+
|
3
|
+
[RubyGems](http://rubygems.org/gems/shalmaneser) |
|
4
|
+
[Shalmanesers Project Page](http://bu.chsta.be/projects/shalmaneser/) |
|
5
|
+
[Source Code](https://github.com/arbox/shalmaneser) |
|
6
|
+
[Bug Tracker](https://github.com/arbox/shalmaneser/issues)
|
7
|
+
|
8
|
+
|
9
|
+
[![Gem Version](https://img.shields.io/gem/v/shalmaneser.svg")](https://rubygems.org/gems/shalmaneser)
|
10
|
+
[![Gem Version](https://img.shields.io/gem/v/frprep.svg")](https://rubygems.org/gems/frprep)
|
11
|
+
[![Gem Version](https://img.shields.io/gem/v/fred.svg")](https://rubygems.org/gems/fred)
|
12
|
+
[![Gem Version](https://img.shields.io/gem/v/rosy.svg")](https://rubygems.org/gems/rosy)
|
13
|
+
|
14
|
+
|
15
|
+
[![License GPL 2](http://img.shields.io/badge/License-GPL%202-green.svg)](http://www.gnu.org/licenses/gpl-2.0.txt)
|
16
|
+
[![Build Status](https://img.shields.io/travis/arbox/shalmaneser.svg?branch=1.2")](https://travis-ci.org/arbox/shalmaneser)
|
17
|
+
[![Code Climate](https://img.shields.io/codeclimate/github/arbox/shalmaneser.svg")](https://codeclimate.com/github/arbox/shalmaneser)
|
18
|
+
[![Dependency Status](https://img.shields.io/gemnasium/arbox/shalmaneser.svg")](https://gemnasium.com/arbox/shalmaneser)
|
19
|
+
|
20
|
+
## Description
|
21
|
+
|
22
|
+
Please be careful, the whole thing is under construction! For now Shalmaneser it not intended to run on Windows systems since it heavily uses system calls for external invocations.
|
23
|
+
Current versions of Shalmaneser have been tested on Linux only (other *NIX testers are welcome!).
|
24
|
+
|
25
|
+
Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. This technique is often called SRL (Semantic Role Labelling). The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
|
26
|
+
|
27
|
+
For end users, we provide a simple end user mode which can simply apply the pre-trained classifiers
|
28
|
+
for [English](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (FrameNet 1.3 annotation / Collins parser)
|
29
|
+
and [German](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (SALSA 1.0 annotation / Sleepy parser).
|
30
|
+
|
31
|
+
We'll try to provide newer pretrained models for English, German, and possibly other languages as soon as possible.
|
32
|
+
|
33
|
+
For researchers interested in investigating shallow semantic parsing, our system is extensively configurable and extendable.
|
34
|
+
|
35
|
+
## Origin
|
36
|
+
|
37
|
+
The original version of Shalmaneser was written by Sebastian Padó, Katrin Erk and others during their work in the SALSA Project.
|
38
|
+
|
39
|
+
You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
|
40
|
+
|
41
|
+
## Publications on Shalmaneser
|
42
|
+
|
43
|
+
- K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
|
44
|
+
- TODO: add other works
|
45
|
+
|
46
|
+
## Documentation
|
47
|
+
|
48
|
+
The project documentation can be found in our [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md) folder.
|
49
|
+
|
50
|
+
## Development
|
51
|
+
|
52
|
+
We are working now on two branches:
|
53
|
+
|
54
|
+
- ``dev`` - our development branch incorporating actual changes, for now pointing to ``1.2``;
|
55
|
+
|
56
|
+
- ``1.2`` - intermediate target;
|
57
|
+
|
58
|
+
- ``2.0`` - final target.
|
59
|
+
|
60
|
+
## Installation
|
61
|
+
|
62
|
+
See the installation instructions in the [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md#installation) folder.
|
63
|
+
|
64
|
+
### Tokenizers
|
65
|
+
|
66
|
+
- [Ucto](http://ilk.uvt.nl/ucto/)
|
67
|
+
|
68
|
+
### POS Taggers
|
69
|
+
|
70
|
+
- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
|
71
|
+
|
72
|
+
### Lemmatizers
|
73
|
+
|
74
|
+
- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
|
75
|
+
|
76
|
+
### Parsers
|
77
|
+
|
78
|
+
- [BerkeleyParser](https://code.google.com/p/berkeleyparser/downloads/list)
|
79
|
+
- [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
|
80
|
+
- [Collins Parser](http://www.cs.columbia.edu/~mcollins/code.html)
|
81
|
+
|
82
|
+
### Machine Learning Systems
|
83
|
+
|
84
|
+
- [OpenNLP MaxEnt](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/)
|
85
|
+
- [Mallet](http://mallet.cs.umass.edu/index.php)
|
86
|
+
|
87
|
+
## License
|
88
|
+
|
89
|
+
See the `LICENSE` file.
|
90
|
+
|
91
|
+
## Contributing
|
92
|
+
|
93
|
+
See the `CONTRIBUTING` file.
|
@@ -0,0 +1,39 @@
|
|
1
|
+
# @note AB: This whole thing should be obsolete on Ruby 1.9
|
2
|
+
# @note #unpack seems to work on 1.8 and 1.9 equally
|
3
|
+
require 'common/ISO-8859-1'
|
4
|
+
|
5
|
+
####################3
|
6
|
+
# Reformatting to and from
|
7
|
+
# a hex format for special characters
|
8
|
+
|
9
|
+
module Ampersand
|
10
|
+
def Ampersand.hex_to_iso(str)
|
11
|
+
return str.gsub(/&.+?;/) { |umlaut|
|
12
|
+
if umlaut =~ /&#x(.+);/
|
13
|
+
bla = $1.hex
|
14
|
+
bla.chr
|
15
|
+
else
|
16
|
+
umlaut
|
17
|
+
end
|
18
|
+
}
|
19
|
+
end
|
20
|
+
|
21
|
+
def Ampersand.iso_to_hex(str)
|
22
|
+
return utf8_to_hex(UtfIso.from_iso_8859_1(str))
|
23
|
+
end
|
24
|
+
|
25
|
+
def Ampersand.utf8_to_hex(str)
|
26
|
+
arr=str.unpack('U*')
|
27
|
+
outstr = ""
|
28
|
+
arr.each { |num|
|
29
|
+
if num < 0x80
|
30
|
+
outstr << num.chr
|
31
|
+
else
|
32
|
+
outstr.concat sprintf("&\#x%04x;", num)
|
33
|
+
end
|
34
|
+
}
|
35
|
+
return outstr
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
|
@@ -0,0 +1,1165 @@
|
|
1
|
+
####
|
2
|
+
# sp 15 04 05
|
3
|
+
#
|
4
|
+
# modified ke 30 10 05: adapted to fit into SynInterface
|
5
|
+
#
|
6
|
+
# represents a file containing Collins parses
|
7
|
+
#
|
8
|
+
# underlying data structure for individual sentences: SalsaTigerSentence
|
9
|
+
|
10
|
+
|
11
|
+
require 'tempfile'
|
12
|
+
require 'common/TabFormat'
|
13
|
+
require 'common/SalsaTigerRegXML'
|
14
|
+
require 'common/SalsaTigerXMLHelper'
|
15
|
+
require 'frprep/Counter'
|
16
|
+
|
17
|
+
require 'common/AbstractSynInterface'
|
18
|
+
|
19
|
+
################################################
|
20
|
+
# Interface class
|
21
|
+
class CollinsInterface < SynInterfaceSTXML
|
22
|
+
CollinsInterface.announce_me()
|
23
|
+
|
24
|
+
###
|
25
|
+
def CollinsInterface.system()
|
26
|
+
return "collins"
|
27
|
+
end
|
28
|
+
|
29
|
+
###
|
30
|
+
def CollinsInterface.service()
|
31
|
+
return "parser"
|
32
|
+
end
|
33
|
+
|
34
|
+
###
|
35
|
+
# initialize to set values for all subsequent processing
|
36
|
+
def initialize(program_path, # string: path to system
|
37
|
+
insuffix, # string: suffix of tab files
|
38
|
+
outsuffix, # string: suffix for parsed files
|
39
|
+
stsuffix, # string: suffix for Salsa/TIGER XML files
|
40
|
+
var_hash = {}) # optional arguments in a hash
|
41
|
+
|
42
|
+
super(program_path, insuffix, outsuffix, stsuffix, var_hash)
|
43
|
+
# I am not expecting any parameters, but I need
|
44
|
+
# the program path to end in a /.
|
45
|
+
unless @program_path =~ /\/$/
|
46
|
+
@program_path = @program_path + "/"
|
47
|
+
end
|
48
|
+
|
49
|
+
# new: evaluate var hash
|
50
|
+
@pos_suffix = var_hash["pos_suffix"]
|
51
|
+
@lemma_suffix = var_hash["lemma_suffix"]
|
52
|
+
@tab_dir = var_hash["tab_dir"]
|
53
|
+
end
|
54
|
+
|
55
|
+
|
56
|
+
###
|
57
|
+
# parse a bunch of TabFormat files (*.<insuffix>) with Collins model 3
|
58
|
+
# required: POS tags must be present
|
59
|
+
# produced: in outputdir, files *.<outsuffix>
|
60
|
+
# I assume that the files in inputdir are smaller than
|
61
|
+
# the maximum number of sentences
|
62
|
+
# Collins can parse in one go (i.e. that they are split) and I don't have to care
|
63
|
+
def process_dir(in_dir, # string: name of input directory
|
64
|
+
out_dir) # string: name of output directory
|
65
|
+
print "parsing ", in_dir, " and writing to ", out_dir, "\n"
|
66
|
+
|
67
|
+
unless @pos_suffix
|
68
|
+
raise "Collins interface: need suffix for POS files"
|
69
|
+
end
|
70
|
+
|
71
|
+
collins_prog = "gunzip -c #{@program_path}models/model3/events.gz | nice #{@program_path}code/parser"
|
72
|
+
collins_params = " #{@program_path}models/model3/grammar 10000 1 1 1 1"
|
73
|
+
|
74
|
+
Dir[in_dir+ "*" + @insuffix].each { |inputfilename|
|
75
|
+
|
76
|
+
STDERR.puts "*** Parsing #{inputfilename} with Collins"
|
77
|
+
|
78
|
+
corpusfilename = File.basename(inputfilename, @insuffix)
|
79
|
+
parsefilename = out_dir+corpusfilename+ @outsuffix
|
80
|
+
tempfile = Tempfile.new(corpusfilename)
|
81
|
+
|
82
|
+
# we need to have part of speech tags (but no lemmas at this point)
|
83
|
+
# included automatically by FNTabFormatFile initialize from *.pos
|
84
|
+
tabfile = FNTabFormatFile.new(inputfilename,@pos_suffix)
|
85
|
+
|
86
|
+
CollinsInterface.produce_collins_input(tabfile,tempfile)
|
87
|
+
tempfile.close
|
88
|
+
print collins_prog+" "+tempfile.path+" "+ collins_params+" > "+parsefilename
|
89
|
+
Kernel.system(collins_prog+" "+tempfile.path+" "+
|
90
|
+
collins_params+" > "+parsefilename)
|
91
|
+
tempfile.close(true)
|
92
|
+
}
|
93
|
+
end
|
94
|
+
|
95
|
+
###
|
96
|
+
# for a given parsed file:
|
97
|
+
# yield each sentence as a pair
|
98
|
+
# [SalsaTigerSentence object, FNTabFormatSentence object]
|
99
|
+
# of the sentence in SalsaTigerXML and the matching tab format sentence
|
100
|
+
#
|
101
|
+
# If a parse has failed, returns
|
102
|
+
# [failed_sentence (flat SalsaTigerSentence), FNTabFormatSentence]
|
103
|
+
# to allow more detailed accounting for failed parses
|
104
|
+
def each_sentence(parsefilename)
|
105
|
+
|
106
|
+
# sanity checks
|
107
|
+
unless @tab_dir
|
108
|
+
raise "Need to set tab directory on initialization"
|
109
|
+
end
|
110
|
+
|
111
|
+
# get matching tab file for this parser output file
|
112
|
+
parserfile = File.new(parsefilename)
|
113
|
+
tabfilename = @tab_dir+File.basename(parsefilename, @outsuffix)+ @insuffix
|
114
|
+
|
115
|
+
corpusfile = FNTabFormatFile.new(tabfilename, @pos_suffix, @lemma_suffix)
|
116
|
+
|
117
|
+
corpusfile.each_sentence {|tab_sent| # iterate over corpus sentences
|
118
|
+
|
119
|
+
my_sent_id = tab_sent.get_sent_id()
|
120
|
+
|
121
|
+
while true # find next matching line in parse file
|
122
|
+
line = parserfile.gets
|
123
|
+
# search for the next "relevant" file or end of the file
|
124
|
+
if line.nil? or line=~/^\(TOP/
|
125
|
+
break
|
126
|
+
end
|
127
|
+
end
|
128
|
+
STDERR.puts line
|
129
|
+
# while we search a parse, the parse file is over...
|
130
|
+
if line.nil?
|
131
|
+
raise "Error: premature end of parser file!"
|
132
|
+
end
|
133
|
+
|
134
|
+
line.chomp!
|
135
|
+
|
136
|
+
# it now holds that line =~ ^(TOP
|
137
|
+
|
138
|
+
case line
|
139
|
+
when /^\(TOP~/ # successful parse
|
140
|
+
|
141
|
+
st_sent = SalsaTigerSentence.empty_sentence(my_sent_id.to_s)
|
142
|
+
|
143
|
+
build_salsatiger(line,st_sent)
|
144
|
+
|
145
|
+
yield [st_sent, tab_sent, CollinsInterface.standard_mapping(st_sent, tab_sent)]
|
146
|
+
|
147
|
+
else
|
148
|
+
# failed parse: create a "failed" parse object
|
149
|
+
# with one nonterminal node and all the terminals
|
150
|
+
|
151
|
+
sent = CollinsInterface.failed_sentence(tab_sent,my_sent_id)
|
152
|
+
yield [sent, tab_sent, CollinsInterface.standard_mapping(sent, tab_sent)]
|
153
|
+
|
154
|
+
end
|
155
|
+
}
|
156
|
+
# after the end of the corpusfile, check if there are any parses left
|
157
|
+
while true
|
158
|
+
line = parserfile.gets
|
159
|
+
if line.nil? # if there are none, everything is fine
|
160
|
+
break
|
161
|
+
elsif line =~ /^\(TOP/ # if there are, raise an exception
|
162
|
+
raise "Error: premature end of corpus file!"
|
163
|
+
end
|
164
|
+
end
|
165
|
+
end
|
166
|
+
|
167
|
+
###
|
168
|
+
# write Salsa/TIGER XML output to file
|
169
|
+
def to_stxml_file(infilename, # string: name of parse file
|
170
|
+
outfilename) # string: name of output stxml file
|
171
|
+
|
172
|
+
outfile = File.new(outfilename, "w")
|
173
|
+
outfile.puts SalsaTigerXMLHelper.get_header()
|
174
|
+
each_sentence(infilename) { |st_sent, tabsent|
|
175
|
+
outfile.puts st_sent.get()
|
176
|
+
}
|
177
|
+
outfile.puts SalsaTigerXMLHelper.get_footer()
|
178
|
+
outfile.close()
|
179
|
+
end
|
180
|
+
|
181
|
+
|
182
|
+
########################
|
183
|
+
private
|
184
|
+
|
185
|
+
# Build a SalsaTigerSentence corresponding to the Collins parse in argument string.
|
186
|
+
#
|
187
|
+
# Special features: removes unary nodes and traces
|
188
|
+
def build_salsatiger(string,st_sent)
|
189
|
+
|
190
|
+
nt_c = Counter.new(500)
|
191
|
+
t_c = Counter.new(0)
|
192
|
+
|
193
|
+
position = 0
|
194
|
+
stack = Array.new
|
195
|
+
|
196
|
+
while position < string.length
|
197
|
+
if string[position,1] == "(" # push nonterminal
|
198
|
+
nextspace = string.index(" ",position)
|
199
|
+
nonterminal = string[position+1..nextspace-1]
|
200
|
+
stack.push nonterminal
|
201
|
+
position = nextspace+1
|
202
|
+
elsif string[position,1] == ")" # reduce stack
|
203
|
+
tempstack = Array.new
|
204
|
+
while true
|
205
|
+
# get all Nodes from the stack and put them on a tempstack,
|
206
|
+
# until you find a String, which is a not-yet existing nonterminal
|
207
|
+
object = stack.pop
|
208
|
+
if object.kind_of? SynNode
|
209
|
+
tempstack.push(object) # terminal or subtree
|
210
|
+
else # string (nonterminal label)
|
211
|
+
if tempstack.length == 1 # skip unary nodes: do nothing and write tempstack back to stack
|
212
|
+
stack += tempstack
|
213
|
+
break
|
214
|
+
# puts "Unary node #{object}"
|
215
|
+
end
|
216
|
+
nt_a = object.split("~")
|
217
|
+
unless nt_a.length == 4
|
218
|
+
# something went wrong. maybe it's about character encoding
|
219
|
+
if nt_a.length() > 4
|
220
|
+
# yes, assume it's about character encoding
|
221
|
+
nt_a = [nt_a[0], nt_a[1..-3].join("~"), nt_a[-2], nt_a[-1]]
|
222
|
+
else
|
223
|
+
# whoa, _less_ pieces than expected: problem.
|
224
|
+
$stderr.puts "Collins parse tree translation nonrecoverable error:"
|
225
|
+
$stderr.puts "Unexpectedly too few components in nonterminal " + nt_a.join("~")
|
226
|
+
raise StandardError.new("nonrecoverable error")
|
227
|
+
end
|
228
|
+
end
|
229
|
+
|
230
|
+
# construct a new nonterminal
|
231
|
+
node = st_sent.add_syn("nt",
|
232
|
+
SalsaTigerXMLHelper.escape(nt_a[0].strip), # cat
|
233
|
+
nil, # word (doesn't matter)
|
234
|
+
nil, # pos (doesn't matter)
|
235
|
+
nt_c.next.to_s)
|
236
|
+
node.set_attribute("head",SalsaTigerXMLHelper.escape(nt_a[1].strip))
|
237
|
+
tempstack.reverse.each {|child|
|
238
|
+
node.add_child(child,nil)
|
239
|
+
child.set_parent(node,nil)
|
240
|
+
}
|
241
|
+
stack.push(node)
|
242
|
+
break # while
|
243
|
+
end
|
244
|
+
end
|
245
|
+
position = position+2 # == nextspace+1
|
246
|
+
else # terminal
|
247
|
+
nextspace = string.index(" ",position)
|
248
|
+
terminal = string[position..nextspace].strip
|
249
|
+
t_a = terminal.split("/")
|
250
|
+
unless t_a.length == 2
|
251
|
+
raise "[collins] Cannot split terminal #{terminal} into word and POS!"
|
252
|
+
end
|
253
|
+
|
254
|
+
word = t_a[0]
|
255
|
+
pos = t_a[1]
|
256
|
+
|
257
|
+
unless pos =~ /TRACE/
|
258
|
+
# construct a new terminal
|
259
|
+
node = st_sent.add_syn("t",
|
260
|
+
nil,
|
261
|
+
SalsaTigerXMLHelper.escape(CollinsInterface.unescape(word)), # word
|
262
|
+
SalsaTigerXMLHelper.escape(pos), # pos
|
263
|
+
t_c.next.to_s)
|
264
|
+
stack.push(node)
|
265
|
+
end
|
266
|
+
position = nextspace+1
|
267
|
+
end
|
268
|
+
end
|
269
|
+
|
270
|
+
# at the very end, we need to have exactly one syntactic root
|
271
|
+
|
272
|
+
if stack.length != 1
|
273
|
+
raise "[collins] Error: Sentence has #{stack.length} roots"
|
274
|
+
end
|
275
|
+
end
|
276
|
+
|
277
|
+
|
278
|
+
####
|
279
|
+
# extract the Collins parser input format from a TabFormat object
|
280
|
+
# that includes part-of-speech (pos)
|
281
|
+
#
|
282
|
+
def CollinsInterface.produce_collins_input(corpusfile,tempfile)
|
283
|
+
corpusfile.each_sentence {|s|
|
284
|
+
words = Array.new
|
285
|
+
s.each_line_parsed {|line_obj|
|
286
|
+
word = line_obj.get("word")
|
287
|
+
tag = line_obj.get("pos")
|
288
|
+
if tag.nil?
|
289
|
+
raise "Error: FNTabFormat object not tagged!"
|
290
|
+
end
|
291
|
+
word_tag_pair = CollinsInterface.escape(word,tag)
|
292
|
+
if word_tag_pair =~ /\)/
|
293
|
+
puts word_tag_pair
|
294
|
+
puts s.to_s
|
295
|
+
end
|
296
|
+
words << word_tag_pair
|
297
|
+
}
|
298
|
+
tempfile.puts words.length.to_s+" "+words.join(" ")
|
299
|
+
}
|
300
|
+
end
|
301
|
+
|
302
|
+
####
|
303
|
+
def CollinsInterface.escape(word,pos) # returns array word+" "+lemma
|
304
|
+
case word
|
305
|
+
|
306
|
+
# replace opening or closing brackets
|
307
|
+
# word representation is {L,R}R{B,S,C} (bracket, square, curly)
|
308
|
+
# POS for opening brackets is LRB, closing brackets RRB
|
309
|
+
|
310
|
+
when "("
|
311
|
+
return "LRB -LRB-"
|
312
|
+
when "["
|
313
|
+
return "LRS -LRB-"
|
314
|
+
when "{"
|
315
|
+
return "LRC -LRB-"
|
316
|
+
|
317
|
+
when ")"
|
318
|
+
return "RRB -RRB-"
|
319
|
+
when "]"
|
320
|
+
return "RRS -RRB-"
|
321
|
+
when "}"
|
322
|
+
return "RRC -RRB-"
|
323
|
+
|
324
|
+
# catch those brackets or slashes inside words
|
325
|
+
else
|
326
|
+
word.gsub!(/\(/,"LRB")
|
327
|
+
word.gsub!(/\)/,"RRB")
|
328
|
+
word.gsub!(/\[/,"LRS")
|
329
|
+
word.gsub!(/\]/,"RRS")
|
330
|
+
word.gsub!(/\{/,"LRC")
|
331
|
+
word.gsub!(/\}/,"RRC")
|
332
|
+
word.gsub!(/\//,"&Slash;")
|
333
|
+
return word+" "+pos
|
334
|
+
end
|
335
|
+
end
|
336
|
+
|
337
|
+
####
|
338
|
+
# replace replacements with original values
|
339
|
+
def CollinsInterface.unescape(word)
|
340
|
+
return word.gsub(/LRB/,"(").gsub(/RRB/,")").gsub(/LRS/,"[").gsub(/RRS/,"]").gsub(/LRC/,"{").gsub(/RRC/,"}").gsub(/&Slash;/,"/")
|
341
|
+
end
|
342
|
+
end
|
343
|
+
|
344
|
+
################################################
|
345
|
+
# Interpreter class
|
346
|
+
class CollinsTntInterpreter < SynInterpreter
|
347
|
+
CollinsTntInterpreter.announce_me()
|
348
|
+
|
349
|
+
###
|
350
|
+
# names of the systems interpreted by this class:
|
351
|
+
# returns a hash service(string) -> system name (string),
|
352
|
+
# e.g.
|
353
|
+
# { "parser" => "collins", "lemmatizer" => "treetagger" }
|
354
|
+
def CollinsTntInterpreter.systems()
|
355
|
+
return {
|
356
|
+
"pos_tagger" => "treetagger",
|
357
|
+
"parser" => "collins"
|
358
|
+
}
|
359
|
+
end
|
360
|
+
|
361
|
+
###
|
362
|
+
# names of additional systems that may be interpreted by this class
|
363
|
+
# returns a hash service(string) -> system name(string)
|
364
|
+
# same as names()
|
365
|
+
def CollinsTntInterpreter.optional_systems()
|
366
|
+
return {
|
367
|
+
"lemmatizer" => "treetagger"
|
368
|
+
}
|
369
|
+
end
|
370
|
+
|
371
|
+
###
|
372
|
+
# generalize over POS tags.
|
373
|
+
#
|
374
|
+
# returns one of:
|
375
|
+
#
|
376
|
+
# adj: adjective (phrase)
|
377
|
+
# adv: adverb (phrase)
|
378
|
+
# card: numbers, quantity phrases
|
379
|
+
# con: conjunction
|
380
|
+
# det: determiner, including possessive/demonstrative pronouns etc.
|
381
|
+
# for: foreign material
|
382
|
+
# noun: noun (phrase), including personal pronouns, proper names, expletives
|
383
|
+
# part: particles, truncated words (German compound parts)
|
384
|
+
# prep: preposition (phrase)
|
385
|
+
# pun: punctuation, brackets, etc.
|
386
|
+
# sent: sentence
|
387
|
+
# top: top node of a sentence
|
388
|
+
# verb: verb (phrase)
|
389
|
+
# nil: something went wrong
|
390
|
+
#
|
391
|
+
# returns: string, or nil
|
392
|
+
def CollinsTntInterpreter.category(node) # SynNode
|
393
|
+
pt = CollinsTntInterpreter.simplified_pt(node)
|
394
|
+
if pt.nil?
|
395
|
+
# phrase type could not be determined
|
396
|
+
return nil
|
397
|
+
end
|
398
|
+
|
399
|
+
pt.to_s.strip() =~ /^([^-]*)/
|
400
|
+
case $1
|
401
|
+
when /^JJ/ ,/(WH)?ADJP/, /^PDT/ then return "adj"
|
402
|
+
when /^RB/, /(WH)?ADVP/, /^UH/ then return "adv"
|
403
|
+
when /^CD/, /^QP/ then return "card"
|
404
|
+
when /^CC/, /^WRB/, /^CONJP/ then return "con"
|
405
|
+
when /^DT/, /^POS/ then return "det"
|
406
|
+
when /^FW/, /^SYM/ then return "for"
|
407
|
+
when /^N/, "WHAD", "WDT", /^PRP/ , /^WHNP/, /^EX/, /^WP/ then return "noun"
|
408
|
+
when /^IN/ , /^TO/, /(WH)?PP/, "RP", /^PR(T|N)/ then return "prep"
|
409
|
+
when /^PUNC/, /LRB/, /RRB/, /[,'".:;!?\(\)]/ then return "pun"
|
410
|
+
when /^S(s|bar|BAR|G|Q|BARQ|INV)?$/, /^UCP/, /^FRAG/, /^X/, /^INTJ/ then return "sent"
|
411
|
+
when /^TOP/ then return "top"
|
412
|
+
when /^TRACE/ then return "trace"
|
413
|
+
when /^V/ , /^MD/ then return "verb"
|
414
|
+
else
|
415
|
+
# $stderr.puts "WARNING: Unknown category/POS "+c.to_s + " (English data)"
|
416
|
+
return nil
|
417
|
+
end
|
418
|
+
end
|
419
|
+
|
420
|
+
|
421
|
+
###
|
422
|
+
# is relative pronoun?
|
423
|
+
#
|
424
|
+
def CollinsTntInterpreter.relative_pronoun?(node) # SynNode
|
425
|
+
pt = CollinsTntInterpreter.simplified_pt(node)
|
426
|
+
if pt.nil?
|
427
|
+
# phrase type could not be determined
|
428
|
+
return nil
|
429
|
+
end
|
430
|
+
|
431
|
+
pt.to_s.strip() =~ /^([^-]*)/
|
432
|
+
case $1
|
433
|
+
when /^WDT/, /^WHAD/, /^WHNP/, /^WP/
|
434
|
+
return true
|
435
|
+
else
|
436
|
+
return false
|
437
|
+
end
|
438
|
+
end
|
439
|
+
|
440
|
+
###
|
441
|
+
# lemma_backoff:
|
442
|
+
#
|
443
|
+
# if we have lemma information, return that,
|
444
|
+
# and failing that, return the word
|
445
|
+
#
|
446
|
+
# returns: string, or nil
|
447
|
+
def CollinsTntInterpreter.lemma_backoff(node)
|
448
|
+
lemma = super(node)
|
449
|
+
# lemmatizer has returned more than one possible lemma form:
|
450
|
+
# just accept the first
|
451
|
+
if lemma =~ /^([^|]+)|/
|
452
|
+
return $1
|
453
|
+
else
|
454
|
+
return lemma
|
455
|
+
end
|
456
|
+
end
|
457
|
+
|
458
|
+
|
459
|
+
###
|
460
|
+
# simplified phrase type:
|
461
|
+
# like phrase type, but may simplify
|
462
|
+
# the constituent label
|
463
|
+
#
|
464
|
+
# returns: string
|
465
|
+
def CollinsTntInterpreter.simplified_pt(node)
|
466
|
+
CollinsTntInterpreter.pt(node) =~ /^(\w+)(-\w)*/
|
467
|
+
return $1
|
468
|
+
end
|
469
|
+
|
470
|
+
###
|
471
|
+
# verb_with_particle:
|
472
|
+
#
|
473
|
+
# given a node and a nodelist,
|
474
|
+
# if the node represents a verb:
|
475
|
+
# see if the verb has a particle among the nodes in nodelist
|
476
|
+
# if so, return it
|
477
|
+
#
|
478
|
+
# returns: SynNode object if successful, else nil
|
479
|
+
def CollinsTntInterpreter.particle_of_verb(node,
|
480
|
+
node_list)
|
481
|
+
|
482
|
+
# must be verb
|
483
|
+
unless CollinsTntInterpreter.category(node) == "verb"
|
484
|
+
return nil
|
485
|
+
end
|
486
|
+
|
487
|
+
# must have parent
|
488
|
+
unless node.parent
|
489
|
+
return nil
|
490
|
+
end
|
491
|
+
|
492
|
+
# look for sisters of the verb node that have the particle category
|
493
|
+
particles = node.parent.children.select { |sister|
|
494
|
+
CollinsTntInterpreter.category(sister) == "part"
|
495
|
+
}.map { |n| n.children}.flatten.select { |niece|
|
496
|
+
# now look for children of those nodes that are particles and are in the nodelist
|
497
|
+
nodelist.include? niece and
|
498
|
+
CollinsTntInterpreter.category(niece) == "part"
|
499
|
+
}
|
500
|
+
|
501
|
+
if particles.length == 0
|
502
|
+
return nil
|
503
|
+
else
|
504
|
+
return particles.first
|
505
|
+
end
|
506
|
+
end
|
507
|
+
|
508
|
+
###
|
509
|
+
# auxiliary?
|
510
|
+
#
|
511
|
+
# returns true if the given node is an auxiliary
|
512
|
+
# else false
|
513
|
+
def CollinsTntInterpreter.auxiliary?(node)
|
514
|
+
|
515
|
+
# look for
|
516
|
+
# ---VP---
|
517
|
+
# | |
|
518
|
+
# the given node VP-A
|
519
|
+
# |
|
520
|
+
# verb node
|
521
|
+
# verb?
|
522
|
+
unless CollinsTntInterpreter.category(node) == "verb"
|
523
|
+
return false
|
524
|
+
end
|
525
|
+
|
526
|
+
unless (parent = node.parent) and
|
527
|
+
parent.category() == "VP"
|
528
|
+
return false
|
529
|
+
end
|
530
|
+
unless (vpa_node = parent.children.detect { |other_child| other_child.category() == "VP-A" })
|
531
|
+
return false
|
532
|
+
end
|
533
|
+
unless vpa_node.children.detect { |other_node| CollinsTntInterpreter.category(other_node) == "verb" }
|
534
|
+
return false
|
535
|
+
end
|
536
|
+
|
537
|
+
return true
|
538
|
+
|
539
|
+
end
|
540
|
+
|
541
|
+
###
|
542
|
+
# modal?
|
543
|
+
#
|
544
|
+
# returns true if the given node is a modal verb,
|
545
|
+
# else false
|
546
|
+
def CollinsTntInterpreter.modal?(node)
|
547
|
+
if node.part_of_speech() =~ /^MD/
|
548
|
+
return true
|
549
|
+
else
|
550
|
+
return false
|
551
|
+
end
|
552
|
+
end
|
553
|
+
|
554
|
+
###
|
555
|
+
# voice
|
556
|
+
#
|
557
|
+
# given a constituent, return
|
558
|
+
# - "active"/"passive" if it is a verb
|
559
|
+
# - nil, else
|
560
|
+
def CollinsTntInterpreter.voice(node) # SynNode
|
561
|
+
|
562
|
+
tobe = ["be","am","is","are","was","were"]
|
563
|
+
|
564
|
+
unless CollinsTntInterpreter.category(node) == "verb"
|
565
|
+
return nil
|
566
|
+
end
|
567
|
+
|
568
|
+
# if we have a gerund, a present tense, or an infitive
|
569
|
+
# then we are sure that we have an active form
|
570
|
+
case CollinsTntInterpreter.pt(node)
|
571
|
+
when "VBG","VBP", "VBZ", "VB"
|
572
|
+
return "active"
|
573
|
+
end
|
574
|
+
|
575
|
+
|
576
|
+
# There is an ambiguity for many word forms between VBN (past participle - passive)
|
577
|
+
# and VBD (past tense - active)
|
578
|
+
|
579
|
+
# so for these, we only say something if we can exclude one possibility,
|
580
|
+
# this is the case
|
581
|
+
# (a) when there is a c-commanding "to be" somewhere. -> passive
|
582
|
+
# (b) when there is no "to be", but a "to have" somewhere. -> active
|
583
|
+
|
584
|
+
# collect lemmas of c-commanding verbs.
|
585
|
+
|
586
|
+
parent = node.parent
|
587
|
+
if parent.nil?
|
588
|
+
return nil
|
589
|
+
end
|
590
|
+
gp = parent.parent
|
591
|
+
if gp.nil?
|
592
|
+
return nil
|
593
|
+
end
|
594
|
+
|
595
|
+
# other_verbs = Array.new
|
596
|
+
#
|
597
|
+
# current_node = node
|
598
|
+
# while current_node = current_node.parent
|
599
|
+
# pt = CollinsTntInterpreter.category(current_node)
|
600
|
+
# unless ["verb","sentence"].include? pt
|
601
|
+
# break
|
602
|
+
# end
|
603
|
+
# current_node.children.each {|child|
|
604
|
+
# if CollinsTntInterpreter.category(child) == "verb"
|
605
|
+
# other_verbs << CollinsTntInterpreter.lemma_backoff(nephew)
|
606
|
+
# end
|
607
|
+
# }
|
608
|
+
# end
|
609
|
+
#
|
610
|
+
# unless (tobe & other_verbs).empty?
|
611
|
+
# puts "passive "+node.id
|
612
|
+
# return "passive"
|
613
|
+
# end
|
614
|
+
# unless (tohave & other_verbs).empty?
|
615
|
+
# return "active"
|
616
|
+
# end
|
617
|
+
|
618
|
+
if CollinsTntInterpreter.category(gp) == "verb" or CollinsTntInterpreter.category(gp) == "sent"
|
619
|
+
|
620
|
+
current_node = node
|
621
|
+
|
622
|
+
while current_node = current_node.parent
|
623
|
+
pt = CollinsTntInterpreter.category(current_node)
|
624
|
+
unless ["verb","sent"].include? pt
|
625
|
+
break
|
626
|
+
end
|
627
|
+
if current_node.children.detect {|nephew| tobe.include? CollinsTntInterpreter.lemma_backoff(nephew)}
|
628
|
+
return "passive"
|
629
|
+
end
|
630
|
+
end
|
631
|
+
# if no "to be" has been found...
|
632
|
+
return "active"
|
633
|
+
end
|
634
|
+
|
635
|
+
# case 2: The grandfather is something else (e.g. a noun phrase)
|
636
|
+
# here, simple past forms are often mis-tagged as passives
|
637
|
+
#
|
638
|
+
|
639
|
+
# if we were cautious, we would return "dontknow" here;
|
640
|
+
# however, these cases are so rare that it is unlikely that
|
641
|
+
# assignments would be more reliable; so we rely on the
|
642
|
+
# POS tag anyway.
|
643
|
+
|
644
|
+
|
645
|
+
case CollinsTntInterpreter.pt(node)
|
646
|
+
when "VBN","VBD"
|
647
|
+
return "passive"
|
648
|
+
# this must be some kind of error...
|
649
|
+
else
|
650
|
+
return nil
|
651
|
+
end
|
652
|
+
end
|
653
|
+
|
654
|
+
###
|
655
|
+
# gfs
|
656
|
+
#
|
657
|
+
# grammatical functions of a constituent:
|
658
|
+
#
|
659
|
+
# returns: a list of pairs [relation(string), node(SynNode)]
|
660
|
+
# where <node> stands in the relation <relation> to the parameter
|
661
|
+
# that the method was called with
|
662
|
+
def CollinsTntInterpreter.gfs(anchor_node, # SynNode
|
663
|
+
sent) # SalsaTigerSentence
|
664
|
+
|
665
|
+
return sent.syn_nodes.map { |gf_node|
|
666
|
+
|
667
|
+
case CollinsTntInterpreter.category(anchor_node)
|
668
|
+
when "adj"
|
669
|
+
rel = CollinsTntInterpreter.gf_adj(anchor_node, gf_node)
|
670
|
+
when "verb"
|
671
|
+
rel = CollinsTntInterpreter.gf_verb(anchor_node, gf_node)
|
672
|
+
when "noun"
|
673
|
+
rel = CollinsTntInterpreter.gf_noun(anchor_node, gf_node)
|
674
|
+
end
|
675
|
+
|
676
|
+
if rel
|
677
|
+
[rel, gf_node]
|
678
|
+
else
|
679
|
+
nil
|
680
|
+
end
|
681
|
+
}.compact()
|
682
|
+
end
|
683
|
+
|
684
|
+
###
|
685
|
+
# informative_content_node
|
686
|
+
#
|
687
|
+
# for most constituents: nil
|
688
|
+
# for a PP, the NP
|
689
|
+
# for an SBAR, the VP
|
690
|
+
# for a VP, the embedded VP
|
691
|
+
def CollinsTntInterpreter.informative_content_node(node)
|
692
|
+
this_pt = CollinsTntInterpreter.simplified_pt(node)
|
693
|
+
|
694
|
+
unless ["SBAR", "VP", "PP"].include? this_pt
|
695
|
+
return nil
|
696
|
+
end
|
697
|
+
|
698
|
+
nh = CollinsTntInterpreter.head_terminal(node)
|
699
|
+
unless nh
|
700
|
+
return nil
|
701
|
+
end
|
702
|
+
headlemma = CollinsTntInterpreter.lemma_backoff(nh)
|
703
|
+
|
704
|
+
nonhead_children = node.children().reject { |n|
|
705
|
+
nnh = CollinsTntInterpreter.head_terminal(n)
|
706
|
+
not(nnh) or
|
707
|
+
CollinsTntInterpreter.lemma_backoff(nnh) == headlemma
|
708
|
+
}
|
709
|
+
if nonhead_children.length() == 1
|
710
|
+
return nonhead_children.first
|
711
|
+
end
|
712
|
+
|
713
|
+
# more than one child:
|
714
|
+
# for SBAR and VP take child with head POS starting in VB,
|
715
|
+
# for PP child with head POS starting in NN
|
716
|
+
case this_pt
|
717
|
+
when "SBAR", "VP"
|
718
|
+
icont_child = nonhead_children.detect { |n|
|
719
|
+
h = CollinsTntInterpreter.head_terminal(n)
|
720
|
+
h and h.part_of_speech() =~ /^VB/
|
721
|
+
}
|
722
|
+
when "PP"
|
723
|
+
icont_child = nonhead_children.detect { |n|
|
724
|
+
h = CollinsTntInterpreter.head_terminal(n)
|
725
|
+
h and h.part_of_speech() =~ /^NN/
|
726
|
+
}
|
727
|
+
else
|
728
|
+
raise "Shouldn't be here"
|
729
|
+
end
|
730
|
+
|
731
|
+
if icont_child
|
732
|
+
return icont_child
|
733
|
+
else
|
734
|
+
return nonhead_children.first
|
735
|
+
end
|
736
|
+
end
|
737
|
+
|
738
|
+
|
739
|
+
|
740
|
+
|
741
|
+
########
|
742
|
+
# prune?
|
743
|
+
# given a target node t and another node n of the syntactic structure,
|
744
|
+
# decide whether n is likely to instantiate a semantic role
|
745
|
+
# of t. If not, recommend n for pruning.
|
746
|
+
#
|
747
|
+
# This method implements a slight variant of Xue and Palmer (EMNLP 2004).
|
748
|
+
# Pruning according to Xue & Palmer, EMNLP 2004:
|
749
|
+
# "Step 1: Designate the predicate as the current node and
|
750
|
+
# collect its sisters (constituents attached at the same level
|
751
|
+
# as the predicate) unless its sisters are coordinated with the
|
752
|
+
# predicate. If a sister is a PP, also collect its immediate
|
753
|
+
# children.
|
754
|
+
# Step 2: Reset the current node to its parent and repeat Step 1
|
755
|
+
# till it reaches the top level node.
|
756
|
+
#
|
757
|
+
# Modifications made here:
|
758
|
+
# - paths of length 0 accepted in any case
|
759
|
+
#
|
760
|
+
# returns: false to recommend n for pruning, else true
|
761
|
+
def CollinsTntInterpreter.prune?(node, # SynNode
|
762
|
+
paths_to_target, # hash: node ID -> Path object: paths from target to node
|
763
|
+
terminal_index) # hash: terminal node -> word index in sentence
|
764
|
+
|
765
|
+
path_to_target = paths_to_target[node.id()]
|
766
|
+
|
767
|
+
if not path_to_target
|
768
|
+
# no path from target to node: suggest for pruning
|
769
|
+
|
770
|
+
return 0
|
771
|
+
|
772
|
+
elsif path_to_target.length == 0
|
773
|
+
# target may be its own role: definite accept
|
774
|
+
|
775
|
+
return 1
|
776
|
+
|
777
|
+
else
|
778
|
+
# consider path from target to node.
|
779
|
+
# (1) If the path to the current node includes at least one Up
|
780
|
+
# and exactly one Down, keep.
|
781
|
+
# (2) Else, if the path includes at least one Up and exactly two Down,
|
782
|
+
# and the current node's parent is a PP, keep
|
783
|
+
# (3) else discard
|
784
|
+
|
785
|
+
# count number of up and down steps in path to target
|
786
|
+
num_up = 0
|
787
|
+
num_down = 0
|
788
|
+
path_to_target.each_step { |direction, edgelabel, nodelabel, endnode|
|
789
|
+
case direction
|
790
|
+
when /U/
|
791
|
+
num_up += 1
|
792
|
+
when /D/
|
793
|
+
num_down += 1
|
794
|
+
end
|
795
|
+
}
|
796
|
+
|
797
|
+
# coordination sister between node and target?
|
798
|
+
conj_sister_between = CollinsTntInterpreter.conj_sister_between?(node, paths_to_target,
|
799
|
+
terminal_index)
|
800
|
+
|
801
|
+
|
802
|
+
if conj_sister_between
|
803
|
+
# coordination between me and the target -- drop
|
804
|
+
return 0
|
805
|
+
|
806
|
+
elsif num_up >= 1 and num_down == 1
|
807
|
+
# case (1)
|
808
|
+
return 1
|
809
|
+
|
810
|
+
elsif num_up >= 1 and num_down == 2 and
|
811
|
+
(p = node.parent()) and CollinsTntInterpreter.category(p) == "prep"
|
812
|
+
|
813
|
+
# case (2)
|
814
|
+
return 1
|
815
|
+
|
816
|
+
else
|
817
|
+
# case (3)
|
818
|
+
return 0
|
819
|
+
end
|
820
|
+
end
|
821
|
+
end
|
822
|
+
|
823
|
+
|
824
|
+
###
|
825
|
+
private
|
826
|
+
|
827
|
+
|
828
|
+
###
|
829
|
+
# given an anchor node and another node that may be some
|
830
|
+
# grammatical function of the anchor node:
|
831
|
+
# return the grammatical function (string) if found,
|
832
|
+
# else nil.
|
833
|
+
#
|
834
|
+
# here: anchor node is verb.
|
835
|
+
def CollinsTntInterpreter.gf_verb(anchor_node, # SynNode
|
836
|
+
gf_node) # SynNode
|
837
|
+
|
838
|
+
# first classification: according to constituent type
|
839
|
+
cat = CollinsTntInterpreter.category(gf_node)
|
840
|
+
if cat.nil?
|
841
|
+
return nil
|
842
|
+
end
|
843
|
+
|
844
|
+
# second classification: according to path
|
845
|
+
path = CollinsTntInterpreter.path_between(anchor_node, gf_node)
|
846
|
+
if path.nil?
|
847
|
+
# no path between anchor node and gf node
|
848
|
+
return nil
|
849
|
+
end
|
850
|
+
|
851
|
+
path.set_cutoff_last_pt_on_printing(true)
|
852
|
+
path_string = path.print(true,false,true)
|
853
|
+
|
854
|
+
case path_string
|
855
|
+
when "U VP D ", "U SG D "
|
856
|
+
categ2 = "inside"
|
857
|
+
when /^U (VP U )*S(BAR)? D $/
|
858
|
+
categ2 = "external"
|
859
|
+
when /^U (VP U )*VP D ADVP D $/
|
860
|
+
categ2 = "external"
|
861
|
+
else
|
862
|
+
categ2 = ""
|
863
|
+
end
|
864
|
+
|
865
|
+
# now evaluate based on both
|
866
|
+
case cat+ "+" + categ2
|
867
|
+
when "noun+inside"
|
868
|
+
# direct object
|
869
|
+
return "OA"
|
870
|
+
|
871
|
+
when "noun+external"
|
872
|
+
unless CollinsTntInterpreter.relative_position(gf_node, anchor_node) == "LEFT"
|
873
|
+
return nil
|
874
|
+
end
|
875
|
+
|
876
|
+
if CollinsTntInterpreter.voice(anchor_node) == "passive"
|
877
|
+
return "OA"
|
878
|
+
else
|
879
|
+
return "SB"
|
880
|
+
end
|
881
|
+
|
882
|
+
when "prep+inside"
|
883
|
+
if CollinsTntInterpreter.voice(anchor_node) == "passive" and
|
884
|
+
CollinsTntInterpreter.preposition(gf_node) == "by"
|
885
|
+
return "SB"
|
886
|
+
else
|
887
|
+
return "MO-" + CollinsTntInterpreter.preposition(gf_node).to_s
|
888
|
+
end
|
889
|
+
|
890
|
+
when "sent+inside"
|
891
|
+
return "OC"
|
892
|
+
|
893
|
+
when "sent+external"
|
894
|
+
return "OC"
|
895
|
+
|
896
|
+
else
|
897
|
+
return nil
|
898
|
+
end
|
899
|
+
end
|
900
|
+
|
901
|
+
###
|
902
|
+
# given an anchor node and another node that may be some
|
903
|
+
# grammatical function of the anchor node:
|
904
|
+
# return the grammatical function (string) if found,
|
905
|
+
# else nil.
|
906
|
+
#
|
907
|
+
# here: anchor node is noun.
|
908
|
+
def CollinsTntInterpreter.gf_noun(anchor_node, # SynNode
|
909
|
+
gf_node) # SynNode
|
910
|
+
|
911
|
+
# first classification: according to constituent type
|
912
|
+
cat = CollinsTntInterpreter.category(gf_node)
|
913
|
+
if cat.nil?
|
914
|
+
return nil
|
915
|
+
end
|
916
|
+
|
917
|
+
# second classification: according to path
|
918
|
+
path = CollinsTntInterpreter.path_between(anchor_node, gf_node)
|
919
|
+
if path.nil?
|
920
|
+
# no path between anchor node and gf node
|
921
|
+
return nil
|
922
|
+
end
|
923
|
+
|
924
|
+
path.set_cutoff_last_pt_on_printing(true)
|
925
|
+
path_string = path.print(true,false,true)
|
926
|
+
|
927
|
+
case path_string
|
928
|
+
when "U NPB D "
|
929
|
+
categ2 = "np-neighbor"
|
930
|
+
when "U NPB U NP D "
|
931
|
+
categ2 = "np-parent"
|
932
|
+
when "U NP D "
|
933
|
+
categ2 = "np-a"
|
934
|
+
when /^U NPB (U NP )?(U NP )?U S(BAR)? D( VP D)? $/
|
935
|
+
categ2 = "beyond-s"
|
936
|
+
when /^U NP(B)? (U NP )?U VP D $/
|
937
|
+
categ2 = "beyond-vp"
|
938
|
+
when /^U NPB (U NP )?(U NP)?U PP U VP(-A)? D $/
|
939
|
+
categ2 = "beyond-pp-vp"
|
940
|
+
else
|
941
|
+
categ2 = ""
|
942
|
+
end
|
943
|
+
|
944
|
+
# now evaluate based on both
|
945
|
+
case cat + "+" + categ2
|
946
|
+
when "noun+np-neighbor"
|
947
|
+
return "AG"
|
948
|
+
|
949
|
+
when "sent+np-parent"
|
950
|
+
return "OC"
|
951
|
+
|
952
|
+
when "prep+np-parent", "prep+np-a"
|
953
|
+
return "MO-" + CollinsTntInterpreter.preposition(gf_node).to_s
|
954
|
+
# relation of anchor noun to governing verb not covered by "gfs" method
|
955
|
+
# when "verb+beyond-s"
|
956
|
+
# return "SB-of"
|
957
|
+
|
958
|
+
# when "verb+beyond-vp"
|
959
|
+
# return "OA-of"
|
960
|
+
|
961
|
+
# when "verb+beyond-pp-vp"
|
962
|
+
# return "MO-of"
|
963
|
+
else
|
964
|
+
return nil
|
965
|
+
end
|
966
|
+
end
|
967
|
+
|
968
|
+
|
969
|
+
###
|
970
|
+
# given an anchor node and another node that may be some
|
971
|
+
# grammatical function of the anchor node:
|
972
|
+
# return the grammatical function (string) if found,
|
973
|
+
# else nil.
|
974
|
+
#
|
975
|
+
# here: anchor node is adjective.
|
976
|
+
def CollinsTntInterpreter.gf_adj(anchor_node, # SynNode
|
977
|
+
gf_node) # SynNode
|
978
|
+
|
979
|
+
# first classification: according to constituent type
|
980
|
+
cat = CollinsTntInterpreter.category(gf_node)
|
981
|
+
if cat.nil?
|
982
|
+
return nil
|
983
|
+
end
|
984
|
+
|
985
|
+
# second classification: according to path
|
986
|
+
path = CollinsTntInterpreter.path_between(anchor_node, gf_node)
|
987
|
+
if path.nil?
|
988
|
+
# no path between anchor node and gf node
|
989
|
+
return nil
|
990
|
+
end
|
991
|
+
|
992
|
+
path.set_cutoff_last_pt_on_printing(true)
|
993
|
+
path_string = path.print(true,false,true)
|
994
|
+
|
995
|
+
case path_string
|
996
|
+
when /^(U ADJP )?U NPB D $/
|
997
|
+
categ2 = "nnpath"
|
998
|
+
when "U ADJP D "
|
999
|
+
categ2 = "adjp-neighbor"
|
1000
|
+
when /^(U ADJP )?U (VP U )?S(BAR)? D $/
|
1001
|
+
categ2 = "s"
|
1002
|
+
when /^U (ADJP U )?VP D $/
|
1003
|
+
categ2 = "vp"
|
1004
|
+
else
|
1005
|
+
categ2 = ""
|
1006
|
+
end
|
1007
|
+
|
1008
|
+
# now evaluate based on both
|
1009
|
+
case cat + "+" + categ2
|
1010
|
+
when "noun+nnpath"
|
1011
|
+
return "HD"
|
1012
|
+
when "verb+adjp-neighbor"
|
1013
|
+
return "OC"
|
1014
|
+
when "prep+vp", "prep+adjp-neighbor"
|
1015
|
+
return "MO-" + CollinsTntInterpreter.preposition(gf_node).to_s
|
1016
|
+
else
|
1017
|
+
return nil
|
1018
|
+
end
|
1019
|
+
end
|
1020
|
+
|
1021
|
+
####
|
1022
|
+
# auxiliary of prune?:
|
1023
|
+
#
|
1024
|
+
# given a node and a hash mapping node IDs to paths to target:
|
1025
|
+
# Does that node have a sister that is a coordination and that
|
1026
|
+
# is between it and the target?
|
1027
|
+
#
|
1028
|
+
def CollinsTntInterpreter.conj_sister_between?(node, # SynNode
|
1029
|
+
paths_to_target, # Hash: node ID -> Path obj: path from node to target
|
1030
|
+
ti) # hash: terminal node -> word index in sentence
|
1031
|
+
|
1032
|
+
# does node have sisters that represent coordination?
|
1033
|
+
unless (p = node.parent())
|
1034
|
+
return false
|
1035
|
+
end
|
1036
|
+
|
1037
|
+
unless (conj_sisters = p.children.select { |sib|
|
1038
|
+
sib != node and CollinsTntInterpreter.category(sib) == "con"
|
1039
|
+
} ) and
|
1040
|
+
not (conj_sisters.empty?)
|
1041
|
+
return false
|
1042
|
+
end
|
1043
|
+
|
1044
|
+
# represent each coordination sister, and the node itself,
|
1045
|
+
# as a triple [node, leftmost terminal index(node), rightmost terminal index(node)
|
1046
|
+
conj_sisters = conj_sisters.map { |n|
|
1047
|
+
[n, CollinsTntInterpreter.lti(n, ti), CollinsTntInterpreter.rti(n, ti)]
|
1048
|
+
}
|
1049
|
+
|
1050
|
+
this_triple = [node, CollinsTntInterpreter.lti(node, ti), CollinsTntInterpreter.rti(node, ti)]
|
1051
|
+
|
1052
|
+
# sisters closer to the target than node:
|
1053
|
+
# also map to triples
|
1054
|
+
sisters_closer_to_target = p.children.select { |sib|
|
1055
|
+
sib != node and
|
1056
|
+
not(conj_sisters.include? sib) and
|
1057
|
+
paths_to_target[sib.id()] and
|
1058
|
+
paths_to_target[sib.id()].length() < paths_to_target[node.id()].length
|
1059
|
+
}.map { |n|
|
1060
|
+
[n, CollinsTntInterpreter.lti(n, ti), CollinsTntInterpreter.rti(n, ti)]
|
1061
|
+
}
|
1062
|
+
|
1063
|
+
if sisters_closer_to_target.empty?
|
1064
|
+
return false
|
1065
|
+
end
|
1066
|
+
|
1067
|
+
# is there any coordination sister that is inbetween this node
|
1068
|
+
# and some sister that is closer to the target?
|
1069
|
+
# if so, return true
|
1070
|
+
conj_sisters.each { |conj_triple|
|
1071
|
+
if leftof(conj_triple, this_triple) and
|
1072
|
+
sisters_closer_to_target.detect { |s| CollinsTntInterpreter.leftof(s, conj_triple) }
|
1073
|
+
|
1074
|
+
return true
|
1075
|
+
|
1076
|
+
elsif rightof(conj_triple, this_triple) and
|
1077
|
+
sisters_closer_to_target.detect { |s| CollinsTntInterpreter.rightof(s, conj_triple) }
|
1078
|
+
|
1079
|
+
return true
|
1080
|
+
end
|
1081
|
+
}
|
1082
|
+
|
1083
|
+
# else return false
|
1084
|
+
return false
|
1085
|
+
end
|
1086
|
+
|
1087
|
+
###
|
1088
|
+
# lti, rti: terminal index of the leftmost/rightmost terminal of
|
1089
|
+
# a given node (SynNode)
|
1090
|
+
#
|
1091
|
+
# auxiliary of conj_sister_between?
|
1092
|
+
def CollinsTntInterpreter.lti(node, # SynNode
|
1093
|
+
terminal_index) # hash: terminal node -> word index in sentence
|
1094
|
+
lt = CollinsTntInterpreter.leftmost_terminal(node)
|
1095
|
+
unless lt
|
1096
|
+
return nil
|
1097
|
+
end
|
1098
|
+
|
1099
|
+
return terminal_index[lt]
|
1100
|
+
end
|
1101
|
+
|
1102
|
+
def CollinsTntInterpreter.rti(node, # SynNode
|
1103
|
+
terminal_index) # hash: terminal node -> word index in sentence
|
1104
|
+
rt = CollinsTntInterpreter.rightmost_terminal(node)
|
1105
|
+
unless rt
|
1106
|
+
return nil
|
1107
|
+
end
|
1108
|
+
|
1109
|
+
return terminal_index[rt]
|
1110
|
+
end
|
1111
|
+
|
1112
|
+
###
|
1113
|
+
# leftof, rightof: given 2 triples
|
1114
|
+
# [node(SynNode), index of leftmost terminal(integer/nil), index of rightmost terminal(integer/nil),
|
1115
|
+
#
|
1116
|
+
# auxiliaries of conj_sister_between?
|
1117
|
+
#
|
1118
|
+
# return true if both leftmost and rightmost terminal indices of the first triple are
|
1119
|
+
# smaller than (for leftof) / bigger than (for rightof) the
|
1120
|
+
# corresponding indices of the second triple
|
1121
|
+
#
|
1122
|
+
# return false if some index is nil
|
1123
|
+
def CollinsTntInterpreter.leftof(triple1,
|
1124
|
+
triple2)
|
1125
|
+
dummy, lm1, rm1 = triple1
|
1126
|
+
dummy, lm2, rm2 = triple2
|
1127
|
+
|
1128
|
+
if lm1.nil? or rm1.nil? or lm2.nil? or rm2.nil?
|
1129
|
+
return false
|
1130
|
+
elsif lm1 < lm2 and rm1 < rm2
|
1131
|
+
return true
|
1132
|
+
else
|
1133
|
+
return false
|
1134
|
+
end
|
1135
|
+
end
|
1136
|
+
|
1137
|
+
def CollinsTntInterpreter.rightof(triple1,
|
1138
|
+
triple2)
|
1139
|
+
dummy, lm1, rm1 = triple1
|
1140
|
+
dummy, lm2, rm2 = triple2
|
1141
|
+
|
1142
|
+
if lm1.nil? or rm1.nil? or lm2.nil? or rm2.nil?
|
1143
|
+
return false
|
1144
|
+
elsif lm1 > lm2 and rm1 > rm2
|
1145
|
+
return true
|
1146
|
+
else
|
1147
|
+
return false
|
1148
|
+
end
|
1149
|
+
end
|
1150
|
+
end
|
1151
|
+
|
1152
|
+
|
1153
|
+
# use TreeTagger as replacement for TnT; re-use everything, but use treetagger as POS tagger
|
1154
|
+
|
1155
|
+
class CollinsTreeTaggerInterpreter < CollinsTntInterpreter
|
1156
|
+
CollinsTreeTaggerInterpreter.announce_me()
|
1157
|
+
|
1158
|
+
def CollinsTreeTaggerInterpreter.systems()
|
1159
|
+
return {
|
1160
|
+
"pos_tagger" => "treetagger",
|
1161
|
+
"parser" => "collins"
|
1162
|
+
}
|
1163
|
+
end
|
1164
|
+
end
|
1165
|
+
|