RubyGems - shalmaneser-fred - Versions diffs - 1.2.0.rc4 - Mend

shalmaneser-fred 1.2.0.rc4

Files changed (32) hide show

checksums.yaml +7 -0
data/.yardopts +10 -0
data/CHANGELOG.md +4 -0
data/LICENSE.md +4 -0
data/README.md +93 -0
data/bin/fred +16 -0
data/lib/fred/Baseline.rb +150 -0
data/lib/fred/FileZipped.rb +31 -0
data/lib/fred/FredBOWContext.rb +877 -0
data/lib/fred/FredConventions.rb +232 -0
data/lib/fred/FredDetermineTargets.rb +319 -0
data/lib/fred/FredEval.rb +312 -0
data/lib/fred/FredFeatureExtractors.rb +322 -0
data/lib/fred/FredFeatures.rb +1061 -0
data/lib/fred/FredFeaturize.rb +602 -0
data/lib/fred/FredNumTrainingSenses.rb +27 -0
data/lib/fred/FredParameters.rb +402 -0
data/lib/fred/FredSplit.rb +84 -0
data/lib/fred/FredSplitPkg.rb +180 -0
data/lib/fred/FredTest.rb +606 -0
data/lib/fred/FredTrain.rb +144 -0
data/lib/fred/PlotAndREval.rb +480 -0
data/lib/fred/fred.rb +47 -0
data/lib/fred/fred_config_data.rb +185 -0
data/lib/fred/md5.rb +23 -0
data/lib/fred/opt_parser.rb +250 -0
data/test/frprep/test_opt_parser.rb +94 -0
data/test/functional/functional_test_helper.rb +58 -0
data/test/functional/test_fred.rb +47 -0
data/test/functional/test_frprep.rb +99 -0
data/test/functional/test_rosy.rb +40 -0
metadata +99 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: e1795de4d92cea5dee25e6840fc1080161aa1d6e
+  data.tar.gz: 8933ad415fc12fef76184e68e28757b2c6f79ec5
+SHA512:
+  metadata.gz: 7efd1551dc7e902b2fed0dd717f9eb0b9ac7aa2c010ab2bf91472934f612c066b254175f7feac7d885f8953a8979872203c8f0d6eb040253949aea0090b98eb6
+  data.tar.gz: 4b46a404e0400483233cb196b3f2a41759db2c98936062a86a097bd404a7759884b4046b5f62f6223ad56d7c62599204238fdcf8e4852f2df59f091faf776822

data/.yardopts ADDED

@@ -0,0 +1,10 @@
+--private
+--protected
+--title 'SHALMANESER'
+lib/**/*.rb
+bin/**/*
+doc/**/*.md
+-
+CHANGELOG.md
+LICENSE.md
+doc/index.md

data/CHANGELOG.md ADDED

@@ -0,0 +1,4 @@
+# Versions
+## Version 1.2.0-rc1

data/LICENSE.md ADDED

@@ -0,0 +1,4 @@
+# LICENSE
+This software is written in Ruby and is released under the [GNU Public License](http://www.gnu.org/licenses/gpl-2.0.html) (GPL v2), and the documentation under the [Free Document License](http://www.gnu.org/licenses/old-licenses/fdl-1.2.html) (FDL v1.2).

data/README.md ADDED

@@ -0,0 +1,93 @@
+# [SHALMANESER - a SHALlow seMANtic parSER](http://www.coli.uni-saarland.de/projects/salsa/shal/)
+[RubyGems](http://rubygems.org/gems/shalmaneser) |
+[Shalmanesers Project Page](http://bu.chsta.be/projects/shalmaneser/) |
+[Source Code](https://github.com/arbox/shalmaneser) |
+[Bug Tracker](https://github.com/arbox/shalmaneser/issues)
+[![Gem Version](https://img.shields.io/gem/v/shalmaneser.svg")](https://rubygems.org/gems/shalmaneser)
+[![Gem Version](https://img.shields.io/gem/v/frprep.svg")](https://rubygems.org/gems/frprep)
+[![Gem Version](https://img.shields.io/gem/v/fred.svg")](https://rubygems.org/gems/fred)
+[![Gem Version](https://img.shields.io/gem/v/rosy.svg")](https://rubygems.org/gems/rosy)
+[![License GPL 2](http://img.shields.io/badge/License-GPL%202-green.svg)](http://www.gnu.org/licenses/gpl-2.0.txt)
+[![Build Status](https://img.shields.io/travis/arbox/shalmaneser.svg?branch=1.2")](https://travis-ci.org/arbox/shalmaneser)
+[![Code Climate](https://img.shields.io/codeclimate/github/arbox/shalmaneser.svg")](https://codeclimate.com/github/arbox/shalmaneser)
+[![Dependency Status](https://img.shields.io/gemnasium/arbox/shalmaneser.svg")](https://gemnasium.com/arbox/shalmaneser)
+## Description
+Please be careful, the whole thing is under construction! For now Shalmaneser it not intended to run on Windows systems since it heavily uses system calls for external invocations.
+Current versions of Shalmaneser have been tested on Linux only (other *NIX testers are welcome!).
+Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. This technique is often called SRL (Semantic Role Labelling). The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
+For end users, we provide a simple end user mode which can simply apply the pre-trained classifiers
+for [English](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (FrameNet 1.3 annotation / Collins parser)
+and [German](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (SALSA 1.0 annotation / Sleepy parser).
+We'll try to provide newer pretrained models for English, German, and possibly other languages as soon as possible.
+For researchers interested in investigating shallow semantic parsing, our system is extensively configurable and extendable.
+## Origin
+The original version of Shalmaneser was written by Sebastian Padó, Katrin Erk and others during their work in the SALSA Project.
+You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
+## Publications on Shalmaneser
+- K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
+- TODO: add other works
+## Documentation
+The project documentation can be found in our [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md) folder.
+## Development
+We are working now on two branches:
+- ``dev`` - our development branch incorporating actual changes, for now pointing to ``1.2``;
+- ``1.2`` - intermediate target;
+- ``2.0`` - final target.
+## Installation
+See the installation instructions in the [doc](https://github.com/arbox/shalmaneser/blob/1.2/doc/index.md#installation) folder.
+### Tokenizers
+- [Ucto](http://ilk.uvt.nl/ucto/)
+### POS Taggers
+- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
+### Lemmatizers
+- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
+### Parsers
+- [BerkeleyParser](https://code.google.com/p/berkeleyparser/downloads/list)
+- [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
+- [Collins Parser](http://www.cs.columbia.edu/~mcollins/code.html)
+### Machine Learning Systems
+- [OpenNLP MaxEnt](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/)
+- [Mallet](http://mallet.cs.umass.edu/index.php)
+## License
+See the `LICENSE` file.
+## Contributing
+See the `CONTRIBUTING` file.

data/bin/fred ADDED

@@ -0,0 +1,16 @@
+#!/usr/bin/env ruby
+# -*- encoding: utf-8 -*-
+# @author Andrei Beliankou, 2011-11-13
+# @author Katrin Erk, April 05
+#
+# Frame disambiguation system:
+# frame assignment as word sense disambiguation
+require 'fred/opt_parser'
+require 'fred/fred'
+options = Fred::OptParser.parse(ARGV)
+fred = Fred::Fred.new(options)
+fred.assign

data/lib/fred/Baseline.rb ADDED

@@ -0,0 +1,150 @@
+# Baseline
+# Katrin Erk April 05
+#
+# baseline for WSD:
+# always assign most frequent sense
+# The baseline doesn't do binary classifiers.
+require "fred/FredConventions"
+require "fred/FredSplitPkg"
+require "fred/FredFeatures"
+require "fred/FredDetermineTargets"
+class Baseline
+  ###
+  # new
+  #
+  # get splitlog dir (if any) along with everything else
+  # because we are only evaluating the training data
+  # at test time
+  #
+  def initialize(exp, # FredConfigData object
+		 split_id = nil) # string: split ID
+    @exp = exp
+    @split_id = split_id
+    # for each lemma: remember prevalent sense
+    @lemma_to_sense = Hash.new()
+    if @split_id
+      split_obj = FredSplitPkg.new(@exp)
+    end
+    lemma_done = Hash.new()
+    # iterate through lemmas
+    @target_obj = Targets.new(@exp, nil, "r")
+    unless @target_obj.targets_okay
+      # error during initialization
+      $stderr.puts "Error: Could not read list of known targets, bailing out."
+      exit 1
+    end
+    @target_obj.get_lemmas().each { |lemmapos|
+      if @split_id
+        # read training split of answer keys
+        answer_obj = AnswerKeyAccess.new(@exp, "train", lemmapos, "r", @split_id, "train")
+      else
+        # read full answer key file of training data
+        answer_obj = AnswerKeyAccess.new(@exp, "train", lemmapos, "r")
+      end
+      count_senses = Hash.new(0)
+      answer_obj.each { |lemma, pos, ids, sid, senses_all, senses_this|
+        # senses_this may include more than one sense for multi-label assignment
+        senses_this.each { |sense|
+          count_senses[sense] += 1
+        }
+      }
+      @lemma_to_sense[lemmapos] = count_senses.keys().max { |a, b|
+        count_senses[a] <=> count_senses[b]
+      }
+    }
+    @lemma = nil
+  end
+  ###
+  def train(infilename)
+    # no training here
+  end
+  ###
+  def write(classifier_file)
+    # no classifiers to write
+  end
+  def exists?(classifier_file)
+    return true
+  end
+  def read(classifier_file)
+    values = deconstruct_fred_classifier_filename(File.basename(classifier_file))
+    @lemma = values["lemma"]
+    if @lemma
+      return true
+    else
+      $stderr.puts "Warning: couldn't determine lemma name in #{classifier_file}, skipping"
+      return false
+    end
+  end
+  def read_resultfile(filename)
+    retv = Array.new()
+    begin
+      f = File.new(filename)
+    rescue
+      raise "Could not read baseline result file #{filename}"
+    end
+    f.each { |line|
+      retv << [[ line.chomp(), 1.0 ]]
+    }
+    return retv
+  end
+  def apply(infilename, outfilename)
+    # open input and output file
+    begin
+      out_f = File.new(outfilename, "w")
+    rescue
+      $stderr.puts "Error: cannot write to classification output file #{outfilename}."
+      exit 1
+    end
+    begin
+      f = File.new(infilename)
+    rescue
+      $stderr.puts "Error: cannot read feature file #{infilename}."
+      exit 1
+    end
+    # deconstruct input filename to determine lemma
+    unless @lemma
+      # something went wrong in read()
+      return false
+    end
+    # do we have a sense for this?
+    unless (sense = @lemma_to_sense[@lemma])
+      # nope: assign "NONE" (or whatever the null label is here)
+      sense = @exp.get("negsense")
+      unless sense
+        sense = "NONE"
+      end
+    end
+    f.each { |line|
+      out_f.puts sense
+    }
+    out_f.close()
+    f.close()
+    return true
+  end
+end

data/lib/fred/FileZipped.rb ADDED

@@ -0,0 +1,31 @@
+class FileZipped
+  def FileZipped.new(filename,
+                     mode = "r")
+    # escape characters in the filename that
+    # would make the shell hiccup on the command
+    filename = filename.gsub(/([();:!?'`])/, 'XXSLASHXX\1')
+    filename = filename.gsub(/XXSLASHXX/, "\\")
+    begin
+      case mode
+      when "r"
+        unless File.exists? filename
+          raise "catchme"
+        end
+        return IO.popen("gunzip -c #{filename}")
+      when "w"
+        return IO.popen("gzip > #{filename}", "w")
+      when "a"
+        return IO.popen("gzip >> #{filename}", "w")
+      else
+        $stderr.puts "FileZipped error: only modes r, w, a are implemented. I got: #{mode}."
+        exit 1
+      end
+    rescue
+      raise "Error opening file #{filename}."
+    end
+  end
+end

data/lib/fred/FredBOWContext.rb ADDED

@@ -0,0 +1,877 @@
+require "tempfile"
+require 'fileutils'
+require "common/RegXML"
+require "common/SynInterfaces"
+require "common/TabFormat"
+require "common/SalsaTigerRegXML"
+require "common/SalsaTigerXMLHelper"
+require "common/RosyConventions"
+require 'fred/md5'
+require "fred/fred_config_data"
+require "fred/FredConventions"
+require "fred/FredDetermineTargets"
+require 'db/db_interface'
+require 'db/sql_query'
+########################################
+# Context Provider classes:
+# read in text, collecting context windows of given size
+# around target words, yield contexts as soon as they are complete
+#
+# Target words are determined by delegating to either TargetsFromFrames or AllTargets
+#
+class AbstractContextProvider
+  include WordLemmaPosNe
+  ################
+  def initialize(window_size, # int: size of context window (one-sided)
+                 exp,         # experiment file object
+                 interpreter_class, #SynInterpreter class
+                 target_obj,  # AbstractTargetDeterminer object
+                 dataset)     # "train", "test"
+    @window_size = window_size
+    @exp = exp
+    @interpreter_class = interpreter_class
+    @target_obj = target_obj
+    @dataset = dataset
+    # make arrays:
+    # context words
+    @context = Array.new(2 * @window_size + 1, nil)
+    # nil for non-targets, all information on the target for targets
+    @is_target = Array.new(2 * @window_size + 1, nil)
+    # sentence object
+    @sentence = Array.new(2 * @window_size + 1, nil)
+  end
+  ###################
+  # each_window: iterator
+  #
+  # given a directory with Salsa/Tiger XML data,
+  # iterate through the data,
+  # yielding each target word as soon as its context window is filled
+  # (or the last file is at an end)
+  #
+  # yields tuples of:
+  # - a context, an array of tuples [word,lemma, pos, ne]
+  #   string/nil*string/nil*string/nil*string/nil
+  # - ID of main target: string
+  # - target_IDs: array:string, list of IDs of target words
+  # - senses: array:string, the senses for the target
+  # - sent: SalsaTigerSentence object
+  def each_window(dir) # string: directory containing Salsa/Tiger XML data
+    raise "overwrite me"
+  end
+  ####################
+  protected
+  ############################
+  # shift a sentence through the @context window,
+  # yield when at target
+  #
+  # yields tuples of:
+  # - a context, an array of tuples [word,lemma, pos, ne]
+  #   string/nil*string/nil*string/nil*string/nil
+  # - ID of main target: string
+  # - target_IDs: array:string, list of IDs of target words
+  # - senses: array:string, the senses for the target
+  # - sent: SalsaTigerSentence object
+  def each_window_for_sent(sent)  # SalsaTigerSentence object or TabSentence object
+  if sent.kind_of? SalsaTigerSentence
+      each_window_for_stsent(sent) { |result| yield result }
+    elsif sent.kind_of? TabFormatSentence
+      each_window_for_tabsent(sent) { |result | yield result }
+    else
+      $stderr.puts "Error: got #{sent.class()}, expected SalsaTigerSentence or TabFormatSentence."
+      exit 1
+    end
+  end
+  ###
+  # sent is a SalsaTigerSentence object:
+  # there may be targets
+  #
+  # yields tuples of:
+  # - a context, an array of tuples [word,lemma, pos, ne]
+  #   string/nil*string/nil*string/nil*string/nil
+  # - ID of main target: string
+  # - target_IDs: array:string, list of IDs of target words
+  # - senses: array:string, the senses for the target
+  # - sent: SalsaTigerSentence object
+  def each_window_for_stsent(sent)
+    # determine targets first.
+    # original targets:
+    #  hash: target_IDs -> list of senses
+    #   where target_IDs is a pair [list of terminal IDs, main terminal ID]
+    #
+    #  where a sense is represented as a hash:
+    #  "sense": sense, a string
+    #  "obj":   FrameNode object
+    #  "all_targets": list of node IDs, may comprise more than a single node
+    #  "lex":   lemma, or multiword expression in canonical form
+    #  "sid": sentence ID
+    original_targets = @target_obj.determine_targets(sent)
+    # reencode, make hashes:
+    # main target ID -> list of senses,
+    # main target ID -> all target IDs
+    maintarget_to_senses = Hash.new()
+    main_to_all_targets = Hash.new()
+    original_targets.each_key { |alltargets, maintarget|
+      main_to_all_targets[maintarget] = alltargets
+      maintarget_to_senses[maintarget] = original_targets[[alltargets, maintarget]]
+    }
+    # then shift each terminal into the context window
+    # and check whether there is a target at the center
+    # position
+    sent_terminals_nopunct(sent).each { |term_obj|
+      # add new word to end of context array
+      @context.push(word_lemma_pos_ne(term_obj, @interpreter_class))
+      if maintarget_to_senses.has_key? term_obj.id()
+        @is_target.push( [ term_obj.id(),
+                           main_to_all_targets[term_obj.id()],
+                           maintarget_to_senses[term_obj.id()]
+                         ]  )
+      else
+        @is_target.push(nil)
+      end
+      @sentence.push(sent)
+      # remove first word from context array
+      @context.shift()
+      @is_target.shift()
+      @sentence.shift()
+      # check for target at center
+      if @is_target[@window_size]
+        # yes, we have a target at center position.
+        # yield it:
+        # - a context, an array of tuples [word,lemma, pos, ne]
+        #   string/nil*string/nil*string/nil*string/nil
+        # - ID of main target: string
+        # - target_IDs: array:string, list of IDs of target words
+        # - senses: array:string, the senses for the target
+        # - sent: SalsaTigerSentence object
+        main_target_id, all_target_ids, senses = @is_target[@window_size]
+        yield [ @context,
+                main_target_id, all_target_ids,
+                senses,
+                @sentence[@window_size]
+              ]
+      end
+    }
+  end
+  ###
+  # sent is a TabFormatSentence object.
+  # shift word/lemma/pos/ne tuples throught the context window.
+  # Whenever this brings a target (from another sentence, necessarily)
+  # to the center of the context window, yield it.
+  def each_window_for_tabsent(sent)
+    sent.each_line_parsed() { |line_obj|
+      # push onto the context array:
+      # [word, lemma, pos, ne], all lowercase
+      @context.push([ line_obj.get("word").downcase(),
+                      line_obj.get("lemma").downcase(),
+                      line_obj.get("pos").downcase(),
+                      nil])
+      @is_target.push(nil)
+      @sentence.push(nil)
+      # remove first word from context array
+      @context.shift()
+      @is_target.shift()
+      @sentence.shift()
+      # check for target at center
+      if @is_target[@window_size]
+        # yes, we have a target at center position.
+        # yield it:
+        # context window, main target ID, all target IDs,
+        # senses (as FrameNode objects), sentence as XML
+        main_target_id, all_target_ids, senses = @is_target[@window_size]
+        yield [ @context,
+                main_target_id, all_target_ids,
+                senses,
+                @sentence[@window_size]
+              ]
+      end
+    }
+  end
+  ############################
+  # each remaining target:
+  # call this to empty the context window after everything has been shifted in
+  def each_remaining_target()
+    while @context.detect { |entry| not(entry.nil?) }
+      # push nil on the context array
+      @context.push(nil)
+      @is_target.push(nil)
+      @sentence.push(nil)
+      # remove first word from context array
+      @context.shift()
+      @is_target.shift()
+      @sentence.shift()
+      # check for target at center
+      if @is_target[@window_size]
+        # yes, we have a target at center position.
+        # yield it:
+        # context window, main target ID, all target IDs,
+        # senses (as FrameNode objects), sentence as XML
+        main_target_id, all_target_ids, senses = @is_target[@window_size]
+        yield [ @context,
+                main_target_id, all_target_ids,
+                senses,
+                @sentence[@window_size]
+              ]
+      end
+    end
+  end
+  ############################
+  # helper: remove punctuation
+  def sent_terminals_nopunct(sent)
+    return sent.terminals_sorted.reject { |node|
+      @interpreter_class.category(node) == "pun"
+    }
+  end
+end
+####################################
+# ContextProvider:
+# subclass of AbstractContextProvider
+# that assumes that the input text is a contiguous text
+# and computes the context accordingly.
+class ContextProvider < AbstractContextProvider
+  ###
+  # each_window: iterator
+  #
+  # given a directory with Salsa/Tiger XML data,
+  # iterate through the data,
+  # yielding each target word as soon as its context window is filled
+  # (or the last file is at an end)
+  def each_window(dir) # string: directory containing Salsa/Tiger XML data
+    # iterate through files in the directory.
+    # Try sorting filenames numerically, since this is
+    # what frprep mostly does with filenames
+    Dir[dir + "*.xml"].sort { |a, b|
+      File.basename(a, ".xml").to_i() <=> File.basename(b, ".xml").to_i()
+    }.each { |filename|
+      # progress bar
+      if @exp.get("verbose")
+        $stderr.puts "Featurizing #{File.basename(filename)}"
+      end
+      f = FilePartsParser.new(filename)
+      each_window_for_file(f) { |result|
+    	  yield result
+      }
+    }
+    # and empty the context array
+    each_remaining_target() { |result| yield result }
+  end
+  ##################################
+  protected
+  ######################
+  # each_window_for_file: iterator
+  # same as each_window, but only for a single file
+  # (to be called from each_window())
+  def each_window_for_file(fpp) # FilePartsParser object: Salsa/Tiger XMl data
+    fpp.scan_s() { |sent_string|
+      sent = SalsaTigerSentence.new(sent_string)
+      each_window_for_sent(sent) { |result| yield result }
+    }
+  end
+end
+####################################
+# SingleSentContextProvider:
+# subclass of AbstractContextProvider
+# that assumes that each sentence of the input text
+# stands on its own
+class SingleSentContextProvider < AbstractContextProvider
+  ###
+  # each_window: iterator
+  #
+  # given a directory with Salsa/Tiger XML data,
+  # iterate through the data,
+  # yielding each target word as soon as its context window is filled
+  # (or the last file is at an end)
+  def each_window(dir) # string: directory containing Salsa/Tiger XML data
+    # iterate through files in the directory.
+    # Try sorting filenames numerically, since this is
+    # what frprep mostly does with filenames
+    Dir[dir + "*.xml"].sort { |a, b|
+      File.basename(a, ".xml").to_i() <=> File.basename(b, ".xml").to_i()
+    }.each { |filename|
+      # progress bar
+      if @exp.get("verbose")
+        $stderr.puts "Featurizing #{File.basename(filename)}"
+      end
+      f = FilePartsParser.new(filename)
+      each_window_for_file(f) { |result|
+        yield result
+      }
+    }
+  end
+  ##################################
+  protected
+  ######################
+  # each_window_for_file: iterator
+  # same as each_window, but only for a single file
+  # (to be called from each_window())
+  def each_window_for_file(fpp) # FilePartsParser object: Salsa/Tiger XMl data
+    fpp.scan_s() { |sent_string|
+      sent = SalsaTigerSentence.new(sent_string)
+      each_window_for_sent(sent) { |result|
+        yield result
+      }
+    }
+    # no need to clear the context: we're doing this after each sentence
+  end
+  ###
+  # each_window_for_sent: empty context after each sentence
+  def each_window_for_sent(sent)
+    if sent.kind_of? SalsaTigerSentence
+      each_window_for_stsent(sent) { |result| yield result }
+    elsif sent.kind_of? TabFormatSentence
+      each_window_for_tabsent(sent) { |result | yield result }
+    else
+      $stderr.puts "Error: got #{sent.class()}, expected SalsaTigerSentence or TabFormatSentence."
+      exit 1
+    end
+    # clear the context
+    each_remaining_target() { |result| yield result }
+  end
+end
+####################################
+# NoncontiguousContextProvider:
+# subclass of AbstractContextProvider
+#
+# This class assumes that the input text consists of single sentences
+# drawn from a larger corpus.
+# It first constructs an index to the sentences of the input text,
+# then reads the larger corpus
+class NoncontiguousContextProvider < AbstractContextProvider
+  ###
+  # each_window: iterator
+  #
+  # given a directory with Salsa/Tiger XML data,
+  # iterate through the data and construct an index to the sentences.
+  #
+  # Then iterate through the larger corpus,
+  # yielding contexts.
+  def each_window(dir) # string: directory containing Salsa/Tiger XML data
+    # @todo AB: Move this chunk to OptionParser.
+    # sanity check: do we know where the larger corpus is?
+    unless @exp.get("larger_corpus_dir")
+      $stderr.puts "Error: 'noncontiguous_input' has been set in the experiment file"
+      $stderr.puts "but no location for the larger corpus has been given."
+      $stderr.puts "Please set 'larger_corpus_dir' in the experiment file"
+      $stderr.puts "to indicate the larger corpus from which the input corpus sentences are drawn."
+      exit 1
+    end
+    ##
+    # remember all sentences from the main corpus
+    temptable_obj, sentkeys = make_index(dir)
+    ##
+    # make frprep experiment file
+    # for lemmatization and POS-tagging of larger corpus files
+    tf_exp_frprep = Tempfile.new("fred_bow_context")
+    frprep_in, frprep_out, frprep_dir = write_frprep_experiment_file(tf_exp_frprep)
+    ##
+    # Iterate through the files of the larger corpus,
+    # check for each sentence whether it is also in the input corpus
+    # and yield it if it does.
+    # larger corpus may contain subdirectories
+    initialize_match_check()
+    each_infile(@exp.get("larger_corpus_dir")) { |filename|
+      $stderr.puts "Larger corpus: reading #{filename}"
+      # remove previous data from temp directories
+      remove_files(frprep_in)
+      remove_files(frprep_out)
+      remove_files(frprep_dir)
+      # link the input file to input directory for frprep
+      File.symlink(filename, frprep_in + "infile")
+      # call frprep
+      # AB: Bad hack, find a way to invoke FrPrep directly.
+      # We will need an FrPrep instance and an options object.
+      base_dir_path = File.expand_path(File.dirname(__FILE__) + '/../..')
+      # @todo AB: Remove this
+      FileUtils.cp(tf_exp_frprep.path, '/tmp/frprep.exp')
+      # after debugging
+      retv = system("ruby -rubygems -I #{base_dir_path}/lib #{base_dir_path}/bin/frprep -e #{tf_exp_frprep.path}")
+      unless retv
+        $stderr.puts "Error analyzing #{filename}. Exiting."
+        exit 1
+      end
+      # read the resulting Tab format file, one sentence at a time:
+      # - check to see if the checksum of the sentence is in sentkeys
+      #   (which means it is an input sentence)
+      #   If it is, retrieve the sentence and determine targets
+      # - shift the sentence through the context window
+      # - whenever a target word comes to be in the center of the context window,
+      #   yield.
+      $stderr.puts "Computing context features from frprep output."
+      Dir[frprep_out + "*.tab"].each { |tabfilename|
+        tabfile = FNTabFormatFile.new(tabfilename, ".pos", ".lemma")
+        tabfile.each_sentence() { |tabsent|
+          # get as Salsa/Tiger XML sentence, or TabSentence
+          sent = get_stxml_sent(tabsent, sentkeys, temptable_obj)
+          # shift sentence through context window
+          each_window_for_sent(sent) { |result|
+            yield result
+          }
+        } # each tab sent
+      } # each tab file
+    } # each infile from the larger corpus
+    # empty the context array
+    each_remaining_target() { |result| yield result }
+    each_unmatched(sentkeys, temptable_obj) { |result| yield result }
+    # remove temporary data
+    temptable_obj.drop_temp_table()
+    # @todo AB: TODO Rewrite this passage using pure Ruby.
+    %x{rm -rf #{frprep_in}}
+    %x{rm -rf #{frprep_out}}
+    %x{rm -rf #{frprep_dir}}
+  end
+  ##################################
+  private
+  ###
+  # for each sentence of each file in the given directory:
+  # remember the sentence in a temporary DB,
+  # indexed by a hash key computed from the plaintext sentence.
+  #
+  # return:
+  # - DBTempTable object containing the temporary DB
+  # - hash table containing all hash keys
+  def make_index(dir)
+    # AB: Why this limits? Use constants!
+    space_for_sentstring = 30000
+    space_for_hashkey = 500
+    $stderr.puts "Indexing input corpus:"
+    # start temporary table
+    temptable_obj = get_db_interface(@exp).make_temp_table([
+                                                            ["hashkey", "varchar(#{space_for_hashkey})"],
+                                                            ["sent", "varchar(#{space_for_sentstring})"]
+                                                           ],
+                                                           ["hashkey"],
+                                                           "autoinc_index")
+    # and hash table for the keys
+    retv_keys = Hash.new()
+    # iterate through files in the directory,
+    # make an index for each sentence, and store
+    # the sentence under that index
+    Dir[dir + "*.xml"].each { |filename|
+      $stderr.puts "\t#{filename}"
+      f = FilePartsParser.new(filename)
+      f.scan_s() { |sent_string|
+        xml_obj = RegXML.new(sent_string)
+        # make hash key from words of sentence
+        graph = xml_obj.children_and_text().detect { |c| c.name() == "graph" }
+        unless graph
+          next
+        end
+        terminals = graph.children_and_text().detect { |c| c.name() == "terminals" }
+        unless terminals
+          next
+        end
+        # in making a hash key, use special characters
+        # rather than their escaped &..; form
+        # $stderr.puts "HIER calling checksum for noncontig"
+        hashkey = checksum(terminals.children_and_text().select { |c| c.name() == "t"
+                           }.map { |t|
+                             SalsaTigerXMLHelper.unescape(t.attributes()["word"].to_s() )
+                           })
+        # HIER
+        # $stderr.puts "HIER " + terminals.children_and_text().select { |c| c.name() == "t"
+        # }.map { |t| t.attributes()["word"].to_s() }.join(" ")
+        # sanity check: if the sentence is longer than
+        # the space currently allotted to sentence strings,
+        # we won't be able to recover it.
+        if SQLQuery.stringify_value(hashkey).length() > space_for_hashkey
+          $stderr.puts "Warning: sentence checksum too long, cannot store it."
+          $stderr.print "Max length: #{space_for_hashkey}. "
+          $stderr.puts "Required: #{SQLQuery.stringify_value(hashkey).length()}."
+          $stderr.puts "Skipping."
+          next
+        end
+        if SQLQuery.stringify_value(sent_string).length() > space_for_sentstring
+          $stderr.puts "Warning: sentence too long, cannot store it."
+          $stderr.print "Max length: #{space_for_sentstring}. "
+          $stderr.puts "Required: #{SQLQuery.stringify_value(sent_string).length()}."
+          $stderr.puts "Skipping."
+          next
+        end
+        # store
+        temptable_obj.query_noretv(SQLQuery.insert(temptable_obj.table_name,
+                                                   [["hashkey", hashkey],
+                                                    ["sent", sent_string]]))
+        retv_keys[hashkey] = true
+      }
+    }
+    $stderr.puts "Indexing finished."
+    return [ temptable_obj, retv_keys ]
+  end
+  ######
+  # compute checksum from the given sentence,
+  # and return as string
+  def checksum(words) # array: string
+    string = ""
+    # HIER removed sort() after downcase
+    words.map { |w| w.to_s.downcase }.each { |w|
+      string << w.gsub(/[^a-z]/, "")
+    }
+    return MD5.new(string).hexdigest
+  end
+  #####
+  # yield each file of the given directory
+  # or one of its subdirectories
+  def each_infile(indir)
+    unless indir =~ /\/$/
+      indir = indir + "/"
+    end
+    Dir[indir + "*"].each { |filename|
+      if File.file?(filename)
+        yield  filename
+      end
+    }
+    # enter recursion
+    Dir[indir + "**"].each { |subdir|
+      # same directory we had before? don't redo
+      if indir == subdir
+        next
+      end
+      begin
+        unless File.stat(subdir).directory?
+          next
+        end
+      rescue
+        # no access, I assume
+        next
+      end
+      each_infile(subdir) { |inf|
+        yield inf
+      }
+    }
+  end
+  ###
+  # remove files: remove all files and subdirectories in the given directory
+  def remove_files(indir)
+    Dir[indir + "*"].each { |filename|
+      if File.file?(filename) or File.symlink?(filename)
+        retv = File.delete(filename)
+      end
+    }
+    # enter recursion
+    Dir[indir + "**"].each { |subdir|
+      # same directory we had before? don't redo
+      if indir == subdir
+        next
+      end
+      begin
+        unless File.stat(subdir).directory?
+          next
+        end
+      rescue
+        # no access, I assume
+        next
+      end
+      # subdir must end in slash
+      unless subdir =~ /\/$/
+        subdir = subdir + "/"
+      end
+      # and enter recursion
+      remove_files(subdir)
+      FileUtils.rm_f(subdir)
+    }
+  end
+  def write_frprep_experiment_file(tf_exp_frprep) # Tempfile object
+    # make unique experiment ID
+    experiment_id = "larger_corpus"
+    # input and output directory for frprep
+    frprep_in = fred_dirname(@exp, "temp", "in", "new")
+    frprep_out = fred_dirname(@exp, "temp", "out", "new")
+    frprep_dir = fred_dirname(@exp, "temp", "frprep", "new")
+    # write file:
+    # experiment ID and directories
+    tf_exp_frprep.puts "prep_experiment_ID = #{experiment_id}"
+    tf_exp_frprep.puts "directory_input = #{frprep_in}"
+    tf_exp_frprep.puts "directory_preprocessed = #{frprep_out}"
+    tf_exp_frprep.puts "frprep_directory = #{frprep_dir}"
+    # output format: tab
+    tf_exp_frprep.puts "tabformat_output = true"
+    # corpus description: language, format, encoding
+    if @exp.get("language")
+      tf_exp_frprep.puts "language = #{@exp.get("language")}"
+    end
+    if @exp.get("larger_corpus_format")
+      tf_exp_frprep.puts "format = #{@exp.get("larger_corpus_format")}"
+    elsif @exp.get("format")
+      $stderr.puts "Warning: 'larger_corpus_format' not set in experiment file,"
+      $stderr.puts "using 'format' setting of frprep experiment file instead."
+      tf_exp_frprep.puts "format = #{@exp.get("format")}"
+    else
+      $stderr.puts "Warning: 'larger_corpus_format' not set in experiment file,"
+      $stderr.puts "relying on default setting."
+    end
+    if @exp.get("larger_corpus_encoding")
+      tf_exp_frprep.puts "encoding = #{@exp.get("larger_corpus_encoding")}"
+    elsif @exp.get("encoding")
+      $stderr.puts "Warning: 'larger_corpus_encoding' not set in experiment file,"
+      $stderr.puts "using 'encoding' setting of frprep experiment file instead."
+      tf_exp_frprep.puts "encoding = #{@exp.get("encoding")}"
+    else
+      $stderr.puts "Warning: 'larger_corpus_encoding' not set in experiment file,"
+      $stderr.puts "relying on default setting."
+    end
+    # processing: lemmatization, POS tagging, no parsing
+    tf_exp_frprep.puts "do_lemmatize = true"
+    tf_exp_frprep.puts "do_postag = true"
+    tf_exp_frprep.puts "do_parse = false"
+    # lemmatizer and POS tagger settings:
+    # take verbatim from frprep file
+    begin
+      f = File.new(@exp.get("preproc_descr_file_" + @dataset))
+    rescue
+      $stderr.puts "Error: could not read frprep experiment file #{@exp.get("preproc_descr_file_" + @dataset)}"
+      exit 1
+    end
+    f.each { |line|
+      if line =~ /pos_tagger\s*=/ or
+          line =~ /pos_tagger_path\s*=/ or
+          line =~ /lemmatizer\s*=/ or
+          line =~ /lemmatizer_path\s*=/
+        tf_exp_frprep.puts line
+      end
+    }
+    # finalize frprep experiment file
+    tf_exp_frprep.close()
+    return [frprep_in, frprep_out, frprep_dir]
+  end
+  ####
+  # get SalsaTigerXML sentence and targets:
+  #
+  # given a Tab format sentence:
+  # - check whether it is in the table of input sentences.
+  #   if so, retrieve it.
+  # - otherwise, fashion a makeshift SalsaTigerSentence object
+  #   from the words, lemmas and POS
+  def get_stxml_sent(tabsent,
+                     sentkeys,
+                     temptable_obj)
+    # SalsaTigerSentence object
+    sent = nil
+    # make checksum
+    words = Array.new()
+    words2 = Array.new()
+    tabsent.each_line_parsed { |line_obj|
+      words << SalsaTigerXMLHelper.unescape(line_obj.get("word"))
+      words2 << line_obj.get("word")
+    }
+    # $stderr.puts "HIER calling checksum from larger corpus"
+    hashkey_this_sentence = checksum(words)
+    # HIER
+    # $stderr.puts "HIER2 " + words.join(" ")
+    # $stderr.puts "HIER3 " + words2.join(" ")
+    if sentkeys[hashkey_this_sentence]
+      # sentence from the input corpus.
+      # register
+      register_matched(hashkey_this_sentence)
+      # select "sent" columns from temp table
+      # where "hashkey" == sent_checksum
+      # returns a DBResult object
+      query_result = temptable_obj.query(SQLQuery.select([ SelectTableAndColumns.new(temptable_obj, ["sent"]) ],
+                                                       [ ValueRestriction.new("hashkey", hashkey_this_sentence) ]))
+      query_result.each { |row|
+        sent_string = SQLQuery.unstringify_value(row.first().to_s())
+        begin
+          sent = SalsaTigerSentence.new(sent_string)
+        rescue
+          $stderr.puts "Error reading Salsa/Tiger XML sentence."
+          $stderr.puts
+          $stderr.puts "SQL-stored sentence was:"
+          $stderr.puts row.first().to_s()
+          $stderr.puts
+          $stderr.puts "==================="
+          $stderr.puts "With restored quotes:"
+          $stderr.puts sent_string
+          exit 1
+        end
+        break
+      }
+      unless sent
+        $stderr.puts "Warning: could not retrieve input corpus sentence: " + words.join(" ")
+      end
+    end
+    if sent
+      return sent
+    else
+      return tabsent
+    end
+  end
+  ###
+  # Keep track of which sentences from the smaller, noncontiguous corpus
+  # have been matched in the larger corpus
+  def initialize_match_check()
+    @index_matched = Hash.new()
+  end
+  ###
+  # Record a sentence from the smaller, noncontiguous corpus
+  # as matched in the larger corpus
+  def register_matched(hash_key)
+    @index_matched[hash_key] = true
+  end
+  ###
+  # Call this method after all sentences from the larger corpus
+  # have been checked against the smaller corpus.
+  # This method prints a warning message for each sentence from the smaller corpus
+  # that has not been matched,
+  # and yields it in the same format as each_window(),
+  # such that the unmatched sentences can still be processed,
+  # but without a larger context.
+  def each_unmatched(all_keys,
+                      temptable_obj)
+    num_unmatched = 0
+    all_keys.each_key { |hash_key|
+      unless @index_matched[hash_key]
+        # unmatched sentence:
+        num_unmatched += 1
+        # retrieve
+        query_result = temptable_obj.query(SQLQuery.select([ SelectTableAndColumns.new(temptable_obj, ["sent"]) ],
+                                                           [ ValueRestriction.new("hashkey", hash_key) ]))
+        # report and yield
+        query_result.each { |row|
+          sent_string = SQLQuery.unstringify_value(row.first().to_s())
+          begin
+            # report on unmatched sentence
+            sent = SalsaTigerSentence.new(sent_string)
+            $stderr.puts "Unmatched sentence from noncontiguous input:\n" +
+              sent.id().to_s() + " " + sent.to_s()
+            # push the sentence through the context window,
+            # filling it up with "nil",
+            # and yield when we reach the target at center position.
+            each_window_for_stsent(sent) { |result| yield result }
+            each_remaining_target() { |result| yield result }
+          rescue
+            # Couldn't turn it into a SalsaTigerSentence object:
+            # just report, don't yield
+            $stderr.puts "Unmatched sentence from noncontiguous input (raw):\n" +
+              sent_string
+            $stderr.puts "ERROR: cannot process this sentence, skipping."
+          end
+        }
+      end
+    }
+    $stderr.puts "Unmatched sentences: #{num_unmatched} all in all."
+  end
+end