shalmaneser-fred 1.2.0.rc4

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: e1795de4d92cea5dee25e6840fc1080161aa1d6e
4
+ data.tar.gz: 8933ad415fc12fef76184e68e28757b2c6f79ec5
5
+ SHA512:
6
+ metadata.gz: 7efd1551dc7e902b2fed0dd717f9eb0b9ac7aa2c010ab2bf91472934f612c066b254175f7feac7d885f8953a8979872203c8f0d6eb040253949aea0090b98eb6
7
+ data.tar.gz: 4b46a404e0400483233cb196b3f2a41759db2c98936062a86a097bd404a7759884b4046b5f62f6223ad56d7c62599204238fdcf8e4852f2df59f091faf776822
@@ -0,0 +1,10 @@
1
+ --private
2
+ --protected
3
+ --title 'SHALMANESER'
4
+ lib/**/*.rb
5
+ bin/**/*
6
+ doc/**/*.md
7
+ -
8
+ CHANGELOG.md
9
+ LICENSE.md
10
+ doc/index.md
@@ -0,0 +1,4 @@
1
+ # Versions
2
+
3
+ ## Version 1.2.0-rc1
4
+
@@ -0,0 +1,4 @@
1
+ # LICENSE
2
+
3
+ This software is written in Ruby and is released under the [GNU Public License](http://www.gnu.org/licenses/gpl-2.0.html) (GPL v2), and the documentation under the [Free Document License](http://www.gnu.org/licenses/old-licenses/fdl-1.2.html) (FDL v1.2).
4
+
@@ -0,0 +1,93 @@
1
+ # [SHALMANESER - a SHALlow seMANtic parSER](http://www.coli.uni-saarland.de/projects/salsa/shal/)
2
+
3
+ [RubyGems](http://rubygems.org/gems/shalmaneser) |
4
+ [Shalmanesers Project Page](http://bu.chsta.be/projects/shalmaneser/) |
5
+ [Source Code](https://github.com/arbox/shalmaneser) |
6
+ [Bug Tracker](https://github.com/arbox/shalmaneser/issues)
7
+
8
+
9
+ [![Gem Version](https://img.shields.io/gem/v/shalmaneser.svg")](https://rubygems.org/gems/shalmaneser)
10
+ [![Gem Version](https://img.shields.io/gem/v/frprep.svg")](https://rubygems.org/gems/frprep)
11
+ [![Gem Version](https://img.shields.io/gem/v/fred.svg")](https://rubygems.org/gems/fred)
12
+ [![Gem Version](https://img.shields.io/gem/v/rosy.svg")](https://rubygems.org/gems/rosy)
13
+
14
+
15
+ [![License GPL 2](http://img.shields.io/badge/License-GPL%202-green.svg)](http://www.gnu.org/licenses/gpl-2.0.txt)
16
+ [![Build Status](https://img.shields.io/travis/arbox/shalmaneser.svg?branch=1.2")](https://travis-ci.org/arbox/shalmaneser)
17
+ [![Code Climate](https://img.shields.io/codeclimate/github/arbox/shalmaneser.svg")](https://codeclimate.com/github/arbox/shalmaneser)
18
+ [![Dependency Status](https://img.shields.io/gemnasium/arbox/shalmaneser.svg")](https://gemnasium.com/arbox/shalmaneser)
19
+
20
+ ## Description
21
+
22
+ Please be careful, the whole thing is under construction! For now Shalmaneser it not intended to run on Windows systems since it heavily uses system calls for external invocations.
23
+ Current versions of Shalmaneser have been tested on Linux only (other *NIX testers are welcome!).
24
+
25
+ Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. This technique is often called SRL (Semantic Role Labelling). The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
26
+
27
+ For end users, we provide a simple end user mode which can simply apply the pre-trained classifiers
28
+ for [English](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (FrameNet 1.3 annotation / Collins parser)
29
+ and [German](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (SALSA 1.0 annotation / Sleepy parser).
30
+
31
+ We'll try to provide newer pretrained models for English, German, and possibly other languages as soon as possible.
32
+
33
+ For researchers interested in investigating shallow semantic parsing, our system is extensively configurable and extendable.
34
+
35
+ ## Origin
36
+
37
+ The original version of Shalmaneser was written by Sebastian Padó, Katrin Erk and others during their work in the SALSA Project.
38
+
39
+ You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
40
+
41
+ ## Publications on Shalmaneser
42
+
43
+ - K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
44
+ - TODO: add other works
45
+
46
+ ## Documentation
47
+
48
+ The project documentation can be found in our [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md) folder.
49
+
50
+ ## Development
51
+
52
+ We are working now on two branches:
53
+
54
+ - ``dev`` - our development branch incorporating actual changes, for now pointing to ``1.2``;
55
+
56
+ - ``1.2`` - intermediate target;
57
+
58
+ - ``2.0`` - final target.
59
+
60
+ ## Installation
61
+
62
+ See the installation instructions in the [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md#installation) folder.
63
+
64
+ ### Tokenizers
65
+
66
+ - [Ucto](http://ilk.uvt.nl/ucto/)
67
+
68
+ ### POS Taggers
69
+
70
+ - [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
71
+
72
+ ### Lemmatizers
73
+
74
+ - [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
75
+
76
+ ### Parsers
77
+
78
+ - [BerkeleyParser](https://code.google.com/p/berkeleyparser/downloads/list)
79
+ - [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
80
+ - [Collins Parser](http://www.cs.columbia.edu/~mcollins/code.html)
81
+
82
+ ### Machine Learning Systems
83
+
84
+ - [OpenNLP MaxEnt](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/)
85
+ - [Mallet](http://mallet.cs.umass.edu/index.php)
86
+
87
+ ## License
88
+
89
+ See the `LICENSE` file.
90
+
91
+ ## Contributing
92
+
93
+ See the `CONTRIBUTING` file.
@@ -0,0 +1,16 @@
1
+ #!/usr/bin/env ruby
2
+ # -*- encoding: utf-8 -*-
3
+
4
+ # @author Andrei Beliankou, 2011-11-13
5
+ # @author Katrin Erk, April 05
6
+ #
7
+ # Frame disambiguation system:
8
+ # frame assignment as word sense disambiguation
9
+
10
+ require 'fred/opt_parser'
11
+ require 'fred/fred'
12
+
13
+ options = Fred::OptParser.parse(ARGV)
14
+
15
+ fred = Fred::Fred.new(options)
16
+ fred.assign
@@ -0,0 +1,150 @@
1
+ # Baseline
2
+ # Katrin Erk April 05
3
+ #
4
+ # baseline for WSD:
5
+ # always assign most frequent sense
6
+ # The baseline doesn't do binary classifiers.
7
+
8
+ require "fred/FredConventions"
9
+ require "fred/FredSplitPkg"
10
+ require "fred/FredFeatures"
11
+ require "fred/FredDetermineTargets"
12
+
13
+ class Baseline
14
+ ###
15
+ # new
16
+ #
17
+ # get splitlog dir (if any) along with everything else
18
+ # because we are only evaluating the training data
19
+ # at test time
20
+ #
21
+ def initialize(exp, # FredConfigData object
22
+ split_id = nil) # string: split ID
23
+ @exp = exp
24
+ @split_id = split_id
25
+
26
+ # for each lemma: remember prevalent sense
27
+ @lemma_to_sense = Hash.new()
28
+
29
+ if @split_id
30
+ split_obj = FredSplitPkg.new(@exp)
31
+ end
32
+
33
+ lemma_done = Hash.new()
34
+
35
+ # iterate through lemmas
36
+ @target_obj = Targets.new(@exp, nil, "r")
37
+ unless @target_obj.targets_okay
38
+ # error during initialization
39
+ $stderr.puts "Error: Could not read list of known targets, bailing out."
40
+ exit 1
41
+ end
42
+
43
+ @target_obj.get_lemmas().each { |lemmapos|
44
+
45
+ if @split_id
46
+ # read training split of answer keys
47
+ answer_obj = AnswerKeyAccess.new(@exp, "train", lemmapos, "r", @split_id, "train")
48
+ else
49
+ # read full answer key file of training data
50
+ answer_obj = AnswerKeyAccess.new(@exp, "train", lemmapos, "r")
51
+ end
52
+
53
+ count_senses = Hash.new(0)
54
+
55
+ answer_obj.each { |lemma, pos, ids, sid, senses_all, senses_this|
56
+ # senses_this may include more than one sense for multi-label assignment
57
+ senses_this.each { |sense|
58
+ count_senses[sense] += 1
59
+ }
60
+ }
61
+
62
+ @lemma_to_sense[lemmapos] = count_senses.keys().max { |a, b|
63
+ count_senses[a] <=> count_senses[b]
64
+ }
65
+ }
66
+
67
+
68
+ @lemma = nil
69
+ end
70
+
71
+ ###
72
+ def train(infilename)
73
+ # no training here
74
+ end
75
+
76
+ ###
77
+ def write(classifier_file)
78
+ # no classifiers to write
79
+ end
80
+
81
+ def exists?(classifier_file)
82
+ return true
83
+ end
84
+
85
+ def read(classifier_file)
86
+ values = deconstruct_fred_classifier_filename(File.basename(classifier_file))
87
+ @lemma = values["lemma"]
88
+ if @lemma
89
+ return true
90
+ else
91
+ $stderr.puts "Warning: couldn't determine lemma name in #{classifier_file}, skipping"
92
+ return false
93
+ end
94
+ end
95
+
96
+
97
+ def read_resultfile(filename)
98
+ retv = Array.new()
99
+ begin
100
+ f = File.new(filename)
101
+ rescue
102
+ raise "Could not read baseline result file #{filename}"
103
+ end
104
+
105
+ f.each { |line|
106
+ retv << [[ line.chomp(), 1.0 ]]
107
+ }
108
+
109
+ return retv
110
+ end
111
+
112
+ def apply(infilename, outfilename)
113
+ # open input and output file
114
+ begin
115
+ out_f = File.new(outfilename, "w")
116
+ rescue
117
+ $stderr.puts "Error: cannot write to classification output file #{outfilename}."
118
+ exit 1
119
+ end
120
+ begin
121
+ f = File.new(infilename)
122
+ rescue
123
+ $stderr.puts "Error: cannot read feature file #{infilename}."
124
+ exit 1
125
+ end
126
+
127
+ # deconstruct input filename to determine lemma
128
+ unless @lemma
129
+ # something went wrong in read()
130
+ return false
131
+ end
132
+
133
+ # do we have a sense for this?
134
+ unless (sense = @lemma_to_sense[@lemma])
135
+ # nope: assign "NONE" (or whatever the null label is here)
136
+ sense = @exp.get("negsense")
137
+ unless sense
138
+ sense = "NONE"
139
+ end
140
+ end
141
+
142
+ f.each { |line|
143
+ out_f.puts sense
144
+ }
145
+ out_f.close()
146
+ f.close()
147
+
148
+ return true
149
+ end
150
+ end
@@ -0,0 +1,31 @@
1
+ class FileZipped
2
+
3
+ def FileZipped.new(filename,
4
+ mode = "r")
5
+
6
+ # escape characters in the filename that
7
+ # would make the shell hiccup on the command
8
+ filename = filename.gsub(/([();:!?'`])/, 'XXSLASHXX\1')
9
+ filename = filename.gsub(/XXSLASHXX/, "\\")
10
+
11
+ begin
12
+ case mode
13
+ when "r"
14
+ unless File.exists? filename
15
+ raise "catchme"
16
+ end
17
+ return IO.popen("gunzip -c #{filename}")
18
+ when "w"
19
+ return IO.popen("gzip > #{filename}", "w")
20
+ when "a"
21
+ return IO.popen("gzip >> #{filename}", "w")
22
+ else
23
+ $stderr.puts "FileZipped error: only modes r, w, a are implemented. I got: #{mode}."
24
+ exit 1
25
+ end
26
+ rescue
27
+ raise "Error opening file #{filename}."
28
+ end
29
+ end
30
+
31
+ end
@@ -0,0 +1,877 @@
1
+ require "tempfile"
2
+ require 'fileutils'
3
+
4
+ require "common/RegXML"
5
+ require "common/SynInterfaces"
6
+ require "common/TabFormat"
7
+ require "common/SalsaTigerRegXML"
8
+ require "common/SalsaTigerXMLHelper"
9
+ require "common/RosyConventions"
10
+
11
+ require 'fred/md5'
12
+ require "fred/fred_config_data"
13
+ require "fred/FredConventions"
14
+ require "fred/FredDetermineTargets"
15
+
16
+ require 'db/db_interface'
17
+ require 'db/sql_query'
18
+
19
+ ########################################
20
+ # Context Provider classes:
21
+ # read in text, collecting context windows of given size
22
+ # around target words, yield contexts as soon as they are complete
23
+ #
24
+ # Target words are determined by delegating to either TargetsFromFrames or AllTargets
25
+ #
26
+ class AbstractContextProvider
27
+
28
+ include WordLemmaPosNe
29
+
30
+ ################
31
+ def initialize(window_size, # int: size of context window (one-sided)
32
+ exp, # experiment file object
33
+ interpreter_class, #SynInterpreter class
34
+ target_obj, # AbstractTargetDeterminer object
35
+ dataset) # "train", "test"
36
+
37
+ @window_size = window_size
38
+ @exp = exp
39
+ @interpreter_class = interpreter_class
40
+ @target_obj = target_obj
41
+ @dataset = dataset
42
+
43
+ # make arrays:
44
+ # context words
45
+ @context = Array.new(2 * @window_size + 1, nil)
46
+ # nil for non-targets, all information on the target for targets
47
+ @is_target = Array.new(2 * @window_size + 1, nil)
48
+ # sentence object
49
+ @sentence = Array.new(2 * @window_size + 1, nil)
50
+
51
+ end
52
+
53
+ ###################
54
+ # each_window: iterator
55
+ #
56
+ # given a directory with Salsa/Tiger XML data,
57
+ # iterate through the data,
58
+ # yielding each target word as soon as its context window is filled
59
+ # (or the last file is at an end)
60
+ #
61
+ # yields tuples of:
62
+ # - a context, an array of tuples [word,lemma, pos, ne]
63
+ # string/nil*string/nil*string/nil*string/nil
64
+ # - ID of main target: string
65
+ # - target_IDs: array:string, list of IDs of target words
66
+ # - senses: array:string, the senses for the target
67
+ # - sent: SalsaTigerSentence object
68
+ def each_window(dir) # string: directory containing Salsa/Tiger XML data
69
+ raise "overwrite me"
70
+ end
71
+
72
+ ####################
73
+ protected
74
+
75
+ ############################
76
+ # shift a sentence through the @context window,
77
+ # yield when at target
78
+ #
79
+ # yields tuples of:
80
+ # - a context, an array of tuples [word,lemma, pos, ne]
81
+ # string/nil*string/nil*string/nil*string/nil
82
+ # - ID of main target: string
83
+ # - target_IDs: array:string, list of IDs of target words
84
+ # - senses: array:string, the senses for the target
85
+ # - sent: SalsaTigerSentence object
86
+ def each_window_for_sent(sent) # SalsaTigerSentence object or TabSentence object
87
+ if sent.kind_of? SalsaTigerSentence
88
+ each_window_for_stsent(sent) { |result| yield result }
89
+
90
+ elsif sent.kind_of? TabFormatSentence
91
+ each_window_for_tabsent(sent) { |result | yield result }
92
+
93
+ else
94
+ $stderr.puts "Error: got #{sent.class()}, expected SalsaTigerSentence or TabFormatSentence."
95
+ exit 1
96
+ end
97
+ end
98
+
99
+ ###
100
+ # sent is a SalsaTigerSentence object:
101
+ # there may be targets
102
+ #
103
+ # yields tuples of:
104
+ # - a context, an array of tuples [word,lemma, pos, ne]
105
+ # string/nil*string/nil*string/nil*string/nil
106
+ # - ID of main target: string
107
+ # - target_IDs: array:string, list of IDs of target words
108
+ # - senses: array:string, the senses for the target
109
+ # - sent: SalsaTigerSentence object
110
+ def each_window_for_stsent(sent)
111
+ # determine targets first.
112
+ # original targets:
113
+ # hash: target_IDs -> list of senses
114
+ # where target_IDs is a pair [list of terminal IDs, main terminal ID]
115
+ #
116
+ # where a sense is represented as a hash:
117
+ # "sense": sense, a string
118
+ # "obj": FrameNode object
119
+ # "all_targets": list of node IDs, may comprise more than a single node
120
+ # "lex": lemma, or multiword expression in canonical form
121
+ # "sid": sentence ID
122
+ original_targets = @target_obj.determine_targets(sent)
123
+
124
+
125
+ # reencode, make hashes:
126
+ # main target ID -> list of senses,
127
+ # main target ID -> all target IDs
128
+ maintarget_to_senses = Hash.new()
129
+ main_to_all_targets = Hash.new()
130
+ original_targets.each_key { |alltargets, maintarget|
131
+
132
+ main_to_all_targets[maintarget] = alltargets
133
+ maintarget_to_senses[maintarget] = original_targets[[alltargets, maintarget]]
134
+
135
+ }
136
+
137
+ # then shift each terminal into the context window
138
+ # and check whether there is a target at the center
139
+ # position
140
+ sent_terminals_nopunct(sent).each { |term_obj|
141
+ # add new word to end of context array
142
+ @context.push(word_lemma_pos_ne(term_obj, @interpreter_class))
143
+
144
+ if maintarget_to_senses.has_key? term_obj.id()
145
+ @is_target.push( [ term_obj.id(),
146
+ main_to_all_targets[term_obj.id()],
147
+ maintarget_to_senses[term_obj.id()]
148
+ ] )
149
+ else
150
+ @is_target.push(nil)
151
+ end
152
+
153
+ @sentence.push(sent)
154
+
155
+ # remove first word from context array
156
+ @context.shift()
157
+ @is_target.shift()
158
+ @sentence.shift()
159
+
160
+ # check for target at center
161
+ if @is_target[@window_size]
162
+ # yes, we have a target at center position.
163
+ # yield it:
164
+ # - a context, an array of tuples [word,lemma, pos, ne]
165
+ # string/nil*string/nil*string/nil*string/nil
166
+ # - ID of main target: string
167
+ # - target_IDs: array:string, list of IDs of target words
168
+ # - senses: array:string, the senses for the target
169
+ # - sent: SalsaTigerSentence object
170
+ main_target_id, all_target_ids, senses = @is_target[@window_size]
171
+
172
+ yield [ @context,
173
+ main_target_id, all_target_ids,
174
+ senses,
175
+ @sentence[@window_size]
176
+ ]
177
+ end
178
+ }
179
+ end
180
+
181
+ ###
182
+ # sent is a TabFormatSentence object.
183
+ # shift word/lemma/pos/ne tuples throught the context window.
184
+ # Whenever this brings a target (from another sentence, necessarily)
185
+ # to the center of the context window, yield it.
186
+ def each_window_for_tabsent(sent)
187
+ sent.each_line_parsed() { |line_obj|
188
+ # push onto the context array:
189
+ # [word, lemma, pos, ne], all lowercase
190
+ @context.push([ line_obj.get("word").downcase(),
191
+ line_obj.get("lemma").downcase(),
192
+ line_obj.get("pos").downcase(),
193
+ nil])
194
+ @is_target.push(nil)
195
+ @sentence.push(nil)
196
+
197
+ # remove first word from context array
198
+ @context.shift()
199
+ @is_target.shift()
200
+ @sentence.shift()
201
+
202
+ # check for target at center
203
+ if @is_target[@window_size]
204
+ # yes, we have a target at center position.
205
+ # yield it:
206
+ # context window, main target ID, all target IDs,
207
+ # senses (as FrameNode objects), sentence as XML
208
+ main_target_id, all_target_ids, senses = @is_target[@window_size]
209
+ yield [ @context,
210
+ main_target_id, all_target_ids,
211
+ senses,
212
+ @sentence[@window_size]
213
+ ]
214
+ end
215
+ }
216
+ end
217
+
218
+ ############################
219
+ # each remaining target:
220
+ # call this to empty the context window after everything has been shifted in
221
+ def each_remaining_target()
222
+ while @context.detect { |entry| not(entry.nil?) }
223
+ # push nil on the context array
224
+ @context.push(nil)
225
+ @is_target.push(nil)
226
+ @sentence.push(nil)
227
+
228
+ # remove first word from context array
229
+ @context.shift()
230
+ @is_target.shift()
231
+ @sentence.shift()
232
+
233
+ # check for target at center
234
+ if @is_target[@window_size]
235
+ # yes, we have a target at center position.
236
+ # yield it:
237
+ # context window, main target ID, all target IDs,
238
+ # senses (as FrameNode objects), sentence as XML
239
+ main_target_id, all_target_ids, senses = @is_target[@window_size]
240
+ yield [ @context,
241
+ main_target_id, all_target_ids,
242
+ senses,
243
+ @sentence[@window_size]
244
+ ]
245
+ end
246
+ end
247
+ end
248
+ ############################
249
+ # helper: remove punctuation
250
+ def sent_terminals_nopunct(sent)
251
+ return sent.terminals_sorted.reject { |node|
252
+ @interpreter_class.category(node) == "pun"
253
+ }
254
+ end
255
+ end
256
+
257
+ ####################################
258
+ # ContextProvider:
259
+ # subclass of AbstractContextProvider
260
+ # that assumes that the input text is a contiguous text
261
+ # and computes the context accordingly.
262
+ class ContextProvider < AbstractContextProvider
263
+ ###
264
+ # each_window: iterator
265
+ #
266
+ # given a directory with Salsa/Tiger XML data,
267
+ # iterate through the data,
268
+ # yielding each target word as soon as its context window is filled
269
+ # (or the last file is at an end)
270
+ def each_window(dir) # string: directory containing Salsa/Tiger XML data
271
+
272
+ # iterate through files in the directory.
273
+ # Try sorting filenames numerically, since this is
274
+ # what frprep mostly does with filenames
275
+ Dir[dir + "*.xml"].sort { |a, b|
276
+ File.basename(a, ".xml").to_i() <=> File.basename(b, ".xml").to_i()
277
+ }.each { |filename|
278
+
279
+ # progress bar
280
+ if @exp.get("verbose")
281
+ $stderr.puts "Featurizing #{File.basename(filename)}"
282
+ end
283
+ f = FilePartsParser.new(filename)
284
+ each_window_for_file(f) { |result|
285
+ yield result
286
+ }
287
+ }
288
+ # and empty the context array
289
+ each_remaining_target() { |result| yield result }
290
+ end
291
+
292
+ ##################################
293
+ protected
294
+
295
+ ######################
296
+ # each_window_for_file: iterator
297
+ # same as each_window, but only for a single file
298
+ # (to be called from each_window())
299
+ def each_window_for_file(fpp) # FilePartsParser object: Salsa/Tiger XMl data
300
+ fpp.scan_s() { |sent_string|
301
+ sent = SalsaTigerSentence.new(sent_string)
302
+ each_window_for_sent(sent) { |result| yield result }
303
+ }
304
+ end
305
+ end
306
+
307
+ ####################################
308
+ # SingleSentContextProvider:
309
+ # subclass of AbstractContextProvider
310
+ # that assumes that each sentence of the input text
311
+ # stands on its own
312
+ class SingleSentContextProvider < AbstractContextProvider
313
+ ###
314
+ # each_window: iterator
315
+ #
316
+ # given a directory with Salsa/Tiger XML data,
317
+ # iterate through the data,
318
+ # yielding each target word as soon as its context window is filled
319
+ # (or the last file is at an end)
320
+ def each_window(dir) # string: directory containing Salsa/Tiger XML data
321
+ # iterate through files in the directory.
322
+ # Try sorting filenames numerically, since this is
323
+ # what frprep mostly does with filenames
324
+ Dir[dir + "*.xml"].sort { |a, b|
325
+ File.basename(a, ".xml").to_i() <=> File.basename(b, ".xml").to_i()
326
+ }.each { |filename|
327
+ # progress bar
328
+ if @exp.get("verbose")
329
+ $stderr.puts "Featurizing #{File.basename(filename)}"
330
+ end
331
+ f = FilePartsParser.new(filename)
332
+ each_window_for_file(f) { |result|
333
+ yield result
334
+ }
335
+ }
336
+ end
337
+
338
+ ##################################
339
+ protected
340
+
341
+
342
+ ######################
343
+ # each_window_for_file: iterator
344
+ # same as each_window, but only for a single file
345
+ # (to be called from each_window())
346
+ def each_window_for_file(fpp) # FilePartsParser object: Salsa/Tiger XMl data
347
+ fpp.scan_s() { |sent_string|
348
+ sent = SalsaTigerSentence.new(sent_string)
349
+
350
+ each_window_for_sent(sent) { |result|
351
+ yield result
352
+ }
353
+ }
354
+ # no need to clear the context: we're doing this after each sentence
355
+ end
356
+
357
+ ###
358
+ # each_window_for_sent: empty context after each sentence
359
+ def each_window_for_sent(sent)
360
+ if sent.kind_of? SalsaTigerSentence
361
+ each_window_for_stsent(sent) { |result| yield result }
362
+
363
+ elsif sent.kind_of? TabFormatSentence
364
+ each_window_for_tabsent(sent) { |result | yield result }
365
+
366
+ else
367
+ $stderr.puts "Error: got #{sent.class()}, expected SalsaTigerSentence or TabFormatSentence."
368
+ exit 1
369
+ end
370
+
371
+ # clear the context
372
+ each_remaining_target() { |result| yield result }
373
+ end
374
+ end
375
+
376
+
377
+ ####################################
378
+ # NoncontiguousContextProvider:
379
+ # subclass of AbstractContextProvider
380
+ #
381
+ # This class assumes that the input text consists of single sentences
382
+ # drawn from a larger corpus.
383
+ # It first constructs an index to the sentences of the input text,
384
+ # then reads the larger corpus
385
+
386
+ class NoncontiguousContextProvider < AbstractContextProvider
387
+
388
+ ###
389
+ # each_window: iterator
390
+ #
391
+ # given a directory with Salsa/Tiger XML data,
392
+ # iterate through the data and construct an index to the sentences.
393
+ #
394
+ # Then iterate through the larger corpus,
395
+ # yielding contexts.
396
+ def each_window(dir) # string: directory containing Salsa/Tiger XML data
397
+
398
+ # @todo AB: Move this chunk to OptionParser.
399
+ # sanity check: do we know where the larger corpus is?
400
+ unless @exp.get("larger_corpus_dir")
401
+ $stderr.puts "Error: 'noncontiguous_input' has been set in the experiment file"
402
+ $stderr.puts "but no location for the larger corpus has been given."
403
+ $stderr.puts "Please set 'larger_corpus_dir' in the experiment file"
404
+ $stderr.puts "to indicate the larger corpus from which the input corpus sentences are drawn."
405
+ exit 1
406
+ end
407
+
408
+ ##
409
+ # remember all sentences from the main corpus
410
+ temptable_obj, sentkeys = make_index(dir)
411
+
412
+ ##
413
+ # make frprep experiment file
414
+ # for lemmatization and POS-tagging of larger corpus files
415
+ tf_exp_frprep = Tempfile.new("fred_bow_context")
416
+ frprep_in, frprep_out, frprep_dir = write_frprep_experiment_file(tf_exp_frprep)
417
+
418
+ ##
419
+ # Iterate through the files of the larger corpus,
420
+ # check for each sentence whether it is also in the input corpus
421
+ # and yield it if it does.
422
+ # larger corpus may contain subdirectories
423
+ initialize_match_check()
424
+
425
+ each_infile(@exp.get("larger_corpus_dir")) { |filename|
426
+ $stderr.puts "Larger corpus: reading #{filename}"
427
+
428
+ # remove previous data from temp directories
429
+ remove_files(frprep_in)
430
+ remove_files(frprep_out)
431
+ remove_files(frprep_dir)
432
+
433
+ # link the input file to input directory for frprep
434
+ File.symlink(filename, frprep_in + "infile")
435
+
436
+ # call frprep
437
+ # AB: Bad hack, find a way to invoke FrPrep directly.
438
+ # We will need an FrPrep instance and an options object.
439
+ base_dir_path = File.expand_path(File.dirname(__FILE__) + '/../..')
440
+
441
+ # @todo AB: Remove this
442
+ FileUtils.cp(tf_exp_frprep.path, '/tmp/frprep.exp')
443
+ # after debugging
444
+
445
+ retv = system("ruby -rubygems -I #{base_dir_path}/lib #{base_dir_path}/bin/frprep -e #{tf_exp_frprep.path}")
446
+
447
+ unless retv
448
+ $stderr.puts "Error analyzing #{filename}. Exiting."
449
+ exit 1
450
+ end
451
+
452
+
453
+ # read the resulting Tab format file, one sentence at a time:
454
+ # - check to see if the checksum of the sentence is in sentkeys
455
+ # (which means it is an input sentence)
456
+ # If it is, retrieve the sentence and determine targets
457
+ # - shift the sentence through the context window
458
+ # - whenever a target word comes to be in the center of the context window,
459
+ # yield.
460
+ $stderr.puts "Computing context features from frprep output."
461
+ Dir[frprep_out + "*.tab"].each { |tabfilename|
462
+ tabfile = FNTabFormatFile.new(tabfilename, ".pos", ".lemma")
463
+ tabfile.each_sentence() { |tabsent|
464
+
465
+ # get as Salsa/Tiger XML sentence, or TabSentence
466
+ sent = get_stxml_sent(tabsent, sentkeys, temptable_obj)
467
+
468
+ # shift sentence through context window
469
+ each_window_for_sent(sent) { |result|
470
+ yield result
471
+ }
472
+
473
+ } # each tab sent
474
+ } # each tab file
475
+ } # each infile from the larger corpus
476
+
477
+ # empty the context array
478
+ each_remaining_target() { |result| yield result }
479
+ each_unmatched(sentkeys, temptable_obj) { |result| yield result }
480
+
481
+ # remove temporary data
482
+ temptable_obj.drop_temp_table()
483
+
484
+ # @todo AB: TODO Rewrite this passage using pure Ruby.
485
+ %x{rm -rf #{frprep_in}}
486
+ %x{rm -rf #{frprep_out}}
487
+ %x{rm -rf #{frprep_dir}}
488
+ end
489
+
490
+ ##################################
491
+ private
492
+
493
+ ###
494
+ # for each sentence of each file in the given directory:
495
+ # remember the sentence in a temporary DB,
496
+ # indexed by a hash key computed from the plaintext sentence.
497
+ #
498
+ # return:
499
+ # - DBTempTable object containing the temporary DB
500
+ # - hash table containing all hash keys
501
+ def make_index(dir)
502
+
503
+ # AB: Why this limits? Use constants!
504
+ space_for_sentstring = 30000
505
+ space_for_hashkey = 500
506
+
507
+ $stderr.puts "Indexing input corpus:"
508
+
509
+ # start temporary table
510
+ temptable_obj = get_db_interface(@exp).make_temp_table([
511
+ ["hashkey", "varchar(#{space_for_hashkey})"],
512
+ ["sent", "varchar(#{space_for_sentstring})"]
513
+ ],
514
+ ["hashkey"],
515
+ "autoinc_index")
516
+
517
+ # and hash table for the keys
518
+ retv_keys = Hash.new()
519
+
520
+ # iterate through files in the directory,
521
+ # make an index for each sentence, and store
522
+ # the sentence under that index
523
+ Dir[dir + "*.xml"].each { |filename|
524
+ $stderr.puts "\t#{filename}"
525
+ f = FilePartsParser.new(filename)
526
+ f.scan_s() { |sent_string|
527
+
528
+ xml_obj = RegXML.new(sent_string)
529
+
530
+ # make hash key from words of sentence
531
+ graph = xml_obj.children_and_text().detect { |c| c.name() == "graph" }
532
+ unless graph
533
+ next
534
+ end
535
+ terminals = graph.children_and_text().detect { |c| c.name() == "terminals" }
536
+ unless terminals
537
+ next
538
+ end
539
+ # in making a hash key, use special characters
540
+ # rather than their escaped &..; form
541
+ # $stderr.puts "HIER calling checksum for noncontig"
542
+ hashkey = checksum(terminals.children_and_text().select { |c| c.name() == "t"
543
+ }.map { |t|
544
+ SalsaTigerXMLHelper.unescape(t.attributes()["word"].to_s() )
545
+ })
546
+ # HIER
547
+ # $stderr.puts "HIER " + terminals.children_and_text().select { |c| c.name() == "t"
548
+ # }.map { |t| t.attributes()["word"].to_s() }.join(" ")
549
+
550
+ # sanity check: if the sentence is longer than
551
+ # the space currently allotted to sentence strings,
552
+ # we won't be able to recover it.
553
+ if SQLQuery.stringify_value(hashkey).length() > space_for_hashkey
554
+ $stderr.puts "Warning: sentence checksum too long, cannot store it."
555
+ $stderr.print "Max length: #{space_for_hashkey}. "
556
+ $stderr.puts "Required: #{SQLQuery.stringify_value(hashkey).length()}."
557
+ $stderr.puts "Skipping."
558
+ next
559
+ end
560
+
561
+ if SQLQuery.stringify_value(sent_string).length() > space_for_sentstring
562
+ $stderr.puts "Warning: sentence too long, cannot store it."
563
+ $stderr.print "Max length: #{space_for_sentstring}. "
564
+ $stderr.puts "Required: #{SQLQuery.stringify_value(sent_string).length()}."
565
+ $stderr.puts "Skipping."
566
+ next
567
+ end
568
+
569
+ # store
570
+ temptable_obj.query_noretv(SQLQuery.insert(temptable_obj.table_name,
571
+ [["hashkey", hashkey],
572
+ ["sent", sent_string]]))
573
+ retv_keys[hashkey] = true
574
+ }
575
+ }
576
+ $stderr.puts "Indexing finished."
577
+
578
+ return [ temptable_obj, retv_keys ]
579
+ end
580
+
581
+ ######
582
+ # compute checksum from the given sentence,
583
+ # and return as string
584
+ def checksum(words) # array: string
585
+ string = ""
586
+
587
+ # HIER removed sort() after downcase
588
+ words.map { |w| w.to_s.downcase }.each { |w|
589
+ string << w.gsub(/[^a-z]/, "")
590
+ }
591
+ return MD5.new(string).hexdigest
592
+ end
593
+
594
+ #####
595
+ # yield each file of the given directory
596
+ # or one of its subdirectories
597
+ def each_infile(indir)
598
+ unless indir =~ /\/$/
599
+ indir = indir + "/"
600
+ end
601
+
602
+ Dir[indir + "*"].each { |filename|
603
+ if File.file?(filename)
604
+ yield filename
605
+ end
606
+ }
607
+
608
+ # enter recursion
609
+ Dir[indir + "**"].each { |subdir|
610
+ # same directory we had before? don't redo
611
+ if indir == subdir
612
+ next
613
+ end
614
+
615
+ begin
616
+ unless File.stat(subdir).directory?
617
+ next
618
+ end
619
+ rescue
620
+ # no access, I assume
621
+ next
622
+ end
623
+
624
+ each_infile(subdir) { |inf|
625
+ yield inf
626
+ }
627
+ }
628
+ end
629
+
630
+ ###
631
+ # remove files: remove all files and subdirectories in the given directory
632
+ def remove_files(indir)
633
+ Dir[indir + "*"].each { |filename|
634
+ if File.file?(filename) or File.symlink?(filename)
635
+ retv = File.delete(filename)
636
+ end
637
+ }
638
+
639
+ # enter recursion
640
+ Dir[indir + "**"].each { |subdir|
641
+ # same directory we had before? don't redo
642
+ if indir == subdir
643
+ next
644
+ end
645
+
646
+ begin
647
+ unless File.stat(subdir).directory?
648
+ next
649
+ end
650
+ rescue
651
+ # no access, I assume
652
+ next
653
+ end
654
+
655
+ # subdir must end in slash
656
+ unless subdir =~ /\/$/
657
+ subdir = subdir + "/"
658
+ end
659
+ # and enter recursion
660
+ remove_files(subdir)
661
+ FileUtils.rm_f(subdir)
662
+ }
663
+ end
664
+
665
+ def write_frprep_experiment_file(tf_exp_frprep) # Tempfile object
666
+
667
+ # make unique experiment ID
668
+ experiment_id = "larger_corpus"
669
+ # input and output directory for frprep
670
+ frprep_in = fred_dirname(@exp, "temp", "in", "new")
671
+ frprep_out = fred_dirname(@exp, "temp", "out", "new")
672
+ frprep_dir = fred_dirname(@exp, "temp", "frprep", "new")
673
+
674
+ # write file:
675
+
676
+ # experiment ID and directories
677
+ tf_exp_frprep.puts "prep_experiment_ID = #{experiment_id}"
678
+ tf_exp_frprep.puts "directory_input = #{frprep_in}"
679
+ tf_exp_frprep.puts "directory_preprocessed = #{frprep_out}"
680
+ tf_exp_frprep.puts "frprep_directory = #{frprep_dir}"
681
+
682
+ # output format: tab
683
+ tf_exp_frprep.puts "tabformat_output = true"
684
+
685
+ # corpus description: language, format, encoding
686
+ if @exp.get("language")
687
+ tf_exp_frprep.puts "language = #{@exp.get("language")}"
688
+ end
689
+ if @exp.get("larger_corpus_format")
690
+ tf_exp_frprep.puts "format = #{@exp.get("larger_corpus_format")}"
691
+ elsif @exp.get("format")
692
+ $stderr.puts "Warning: 'larger_corpus_format' not set in experiment file,"
693
+ $stderr.puts "using 'format' setting of frprep experiment file instead."
694
+ tf_exp_frprep.puts "format = #{@exp.get("format")}"
695
+ else
696
+ $stderr.puts "Warning: 'larger_corpus_format' not set in experiment file,"
697
+ $stderr.puts "relying on default setting."
698
+ end
699
+ if @exp.get("larger_corpus_encoding")
700
+ tf_exp_frprep.puts "encoding = #{@exp.get("larger_corpus_encoding")}"
701
+ elsif @exp.get("encoding")
702
+ $stderr.puts "Warning: 'larger_corpus_encoding' not set in experiment file,"
703
+ $stderr.puts "using 'encoding' setting of frprep experiment file instead."
704
+ tf_exp_frprep.puts "encoding = #{@exp.get("encoding")}"
705
+ else
706
+ $stderr.puts "Warning: 'larger_corpus_encoding' not set in experiment file,"
707
+ $stderr.puts "relying on default setting."
708
+ end
709
+
710
+ # processing: lemmatization, POS tagging, no parsing
711
+ tf_exp_frprep.puts "do_lemmatize = true"
712
+ tf_exp_frprep.puts "do_postag = true"
713
+ tf_exp_frprep.puts "do_parse = false"
714
+
715
+ # lemmatizer and POS tagger settings:
716
+ # take verbatim from frprep file
717
+ begin
718
+ f = File.new(@exp.get("preproc_descr_file_" + @dataset))
719
+ rescue
720
+ $stderr.puts "Error: could not read frprep experiment file #{@exp.get("preproc_descr_file_" + @dataset)}"
721
+ exit 1
722
+ end
723
+ f.each { |line|
724
+ if line =~ /pos_tagger\s*=/ or
725
+ line =~ /pos_tagger_path\s*=/ or
726
+ line =~ /lemmatizer\s*=/ or
727
+ line =~ /lemmatizer_path\s*=/
728
+
729
+ tf_exp_frprep.puts line
730
+ end
731
+ }
732
+ # finalize frprep experiment file
733
+ tf_exp_frprep.close()
734
+
735
+ return [frprep_in, frprep_out, frprep_dir]
736
+ end
737
+
738
+ ####
739
+ # get SalsaTigerXML sentence and targets:
740
+ #
741
+ # given a Tab format sentence:
742
+ # - check whether it is in the table of input sentences.
743
+ # if so, retrieve it.
744
+ # - otherwise, fashion a makeshift SalsaTigerSentence object
745
+ # from the words, lemmas and POS
746
+ def get_stxml_sent(tabsent,
747
+ sentkeys,
748
+ temptable_obj)
749
+
750
+ # SalsaTigerSentence object
751
+ sent = nil
752
+
753
+ # make checksum
754
+ words = Array.new()
755
+ words2 = Array.new()
756
+ tabsent.each_line_parsed { |line_obj|
757
+ words << SalsaTigerXMLHelper.unescape(line_obj.get("word"))
758
+ words2 << line_obj.get("word")
759
+ }
760
+ # $stderr.puts "HIER calling checksum from larger corpus"
761
+ hashkey_this_sentence = checksum(words)
762
+
763
+ # HIER
764
+ # $stderr.puts "HIER2 " + words.join(" ")
765
+ # $stderr.puts "HIER3 " + words2.join(" ")
766
+
767
+
768
+ if sentkeys[hashkey_this_sentence]
769
+ # sentence from the input corpus.
770
+
771
+ # register
772
+ register_matched(hashkey_this_sentence)
773
+
774
+
775
+ # select "sent" columns from temp table
776
+ # where "hashkey" == sent_checksum
777
+ # returns a DBResult object
778
+ query_result = temptable_obj.query(SQLQuery.select([ SelectTableAndColumns.new(temptable_obj, ["sent"]) ],
779
+ [ ValueRestriction.new("hashkey", hashkey_this_sentence) ]))
780
+ query_result.each { |row|
781
+
782
+ sent_string = SQLQuery.unstringify_value(row.first().to_s())
783
+ begin
784
+ sent = SalsaTigerSentence.new(sent_string)
785
+ rescue
786
+ $stderr.puts "Error reading Salsa/Tiger XML sentence."
787
+ $stderr.puts
788
+ $stderr.puts "SQL-stored sentence was:"
789
+ $stderr.puts row.first().to_s()
790
+ $stderr.puts
791
+ $stderr.puts "==================="
792
+ $stderr.puts "With restored quotes:"
793
+ $stderr.puts sent_string
794
+ exit 1
795
+ end
796
+ break
797
+ }
798
+ unless sent
799
+ $stderr.puts "Warning: could not retrieve input corpus sentence: " + words.join(" ")
800
+ end
801
+ end
802
+
803
+ if sent
804
+ return sent
805
+ else
806
+ return tabsent
807
+ end
808
+ end
809
+
810
+ ###
811
+ # Keep track of which sentences from the smaller, noncontiguous corpus
812
+ # have been matched in the larger corpus
813
+ def initialize_match_check()
814
+ @index_matched = Hash.new()
815
+ end
816
+
817
+ ###
818
+ # Record a sentence from the smaller, noncontiguous corpus
819
+ # as matched in the larger corpus
820
+ def register_matched(hash_key)
821
+ @index_matched[hash_key] = true
822
+ end
823
+
824
+ ###
825
+ # Call this method after all sentences from the larger corpus
826
+ # have been checked against the smaller corpus.
827
+ # This method prints a warning message for each sentence from the smaller corpus
828
+ # that has not been matched,
829
+ # and yields it in the same format as each_window(),
830
+ # such that the unmatched sentences can still be processed,
831
+ # but without a larger context.
832
+ def each_unmatched(all_keys,
833
+ temptable_obj)
834
+
835
+ num_unmatched = 0
836
+
837
+ all_keys.each_key { |hash_key|
838
+ unless @index_matched[hash_key]
839
+ # unmatched sentence:
840
+
841
+ num_unmatched += 1
842
+
843
+ # retrieve
844
+ query_result = temptable_obj.query(SQLQuery.select([ SelectTableAndColumns.new(temptable_obj, ["sent"]) ],
845
+ [ ValueRestriction.new("hashkey", hash_key) ]))
846
+
847
+ # report and yield
848
+ query_result.each { |row|
849
+
850
+ sent_string = SQLQuery.unstringify_value(row.first().to_s())
851
+ begin
852
+ # report on unmatched sentence
853
+ sent = SalsaTigerSentence.new(sent_string)
854
+ $stderr.puts "Unmatched sentence from noncontiguous input:\n" +
855
+ sent.id().to_s() + " " + sent.to_s()
856
+
857
+ # push the sentence through the context window,
858
+ # filling it up with "nil",
859
+ # and yield when we reach the target at center position.
860
+ each_window_for_stsent(sent) { |result| yield result }
861
+ each_remaining_target() { |result| yield result }
862
+
863
+ rescue
864
+ # Couldn't turn it into a SalsaTigerSentence object:
865
+ # just report, don't yield
866
+ $stderr.puts "Unmatched sentence from noncontiguous input (raw):\n" +
867
+ sent_string
868
+ $stderr.puts "ERROR: cannot process this sentence, skipping."
869
+ end
870
+ }
871
+ end
872
+ }
873
+
874
+ $stderr.puts "Unmatched sentences: #{num_unmatched} all in all."
875
+ end
876
+
877
+ end