RubyGems - shalmaneser - Versions diffs - 0.0.1.alpha → 1.2.0.rc1 - Mend

shalmaneser 0.0.1.alpha → 1.2.0.rc1

Files changed (76) hide show

checksums.yaml +7 -0
data/.yardopts +2 -2
data/CHANGELOG.md +4 -0
data/LICENSE.md +4 -0
data/README.md +49 -0
data/bin/fred +18 -0
data/bin/frprep +34 -0
data/bin/rosy +17 -0
data/lib/common/AbstractSynInterface.rb +35 -33
data/lib/common/Mallet.rb +236 -0
data/lib/common/Maxent.rb +26 -12
data/lib/common/Parser.rb +5 -5
data/lib/common/SynInterfaces.rb +13 -6
data/lib/common/TabFormat.rb +7 -6
data/lib/common/Tiger.rb +4 -4
data/lib/common/Timbl.rb +144 -0
data/lib/common/{FrprepHelper.rb → frprep_helper.rb} +14 -8
data/lib/common/headz.rb +1 -1
data/lib/common/ruby_class_extensions.rb +3 -3
data/lib/fred/FredBOWContext.rb +14 -2
data/lib/fred/FredDetermineTargets.rb +4 -9
data/lib/fred/FredEval.rb +1 -1
data/lib/fred/FredFeatureExtractors.rb +4 -3
data/lib/fred/FredFeaturize.rb +1 -1
data/lib/frprep/CollinsInterface.rb +6 -6
data/lib/frprep/MiniparInterface.rb +5 -5
data/lib/frprep/SleepyInterface.rb +7 -7
data/lib/frprep/TntInterface.rb +1 -1
data/lib/frprep/TreetaggerInterface.rb +29 -5
data/lib/frprep/do_parses.rb +1 -0
data/lib/frprep/frprep.rb +36 -32
data/lib/{common/BerkeleyInterface.rb → frprep/interfaces/berkeley_interface.rb} +69 -95
data/lib/frprep/interfaces/stanford_interface.rb +353 -0
data/lib/frprep/interpreters/berkeley_interpreter.rb +22 -0
data/lib/frprep/interpreters/stanford_interpreter.rb +22 -0
data/lib/frprep/opt_parser.rb +2 -2
data/lib/rosy/AbstractFeatureAndExternal.rb +5 -3
data/lib/rosy/RosyIterator.rb +11 -10
data/lib/rosy/rosy.rb +1 -0
data/lib/shalmaneser/version.rb +1 -1
data/test/functional/sample_experiment_files/fred_test.salsa.erb +1 -1
data/test/functional/sample_experiment_files/fred_train.salsa.erb +1 -1
data/test/functional/sample_experiment_files/prp_test.salsa.erb +2 -2
data/test/functional/sample_experiment_files/prp_test.salsa.fred.standalone.erb +2 -2
data/test/functional/sample_experiment_files/prp_test.salsa.rosy.standalone.erb +2 -2
data/test/functional/sample_experiment_files/prp_train.salsa.erb +2 -2
data/test/functional/sample_experiment_files/prp_train.salsa.fred.standalone.erb +2 -2
data/test/functional/sample_experiment_files/prp_train.salsa.rosy.standalone.erb +2 -2
data/test/functional/sample_experiment_files/rosy_test.salsa.erb +1 -1
data/test/functional/sample_experiment_files/rosy_train.salsa.erb +7 -7
data/test/functional/test_frprep.rb +3 -3
data/test/functional/test_rosy.rb +20 -0
metadata +215 -224
data/CHANGELOG.rdoc +0 -0
data/LICENSE.rdoc +0 -0
data/README.rdoc +0 -0
data/lib/common/CollinsInterface.rb +0 -1165
data/lib/common/MiniparInterface.rb +0 -1388
data/lib/common/SleepyInterface.rb +0 -384
data/lib/common/TntInterface.rb +0 -44
data/lib/common/TreetaggerInterface.rb +0 -303
data/lib/frprep/AbstractSynInterface.rb +0 -1227
data/lib/frprep/BerkeleyInterface.rb +0 -375
data/lib/frprep/ConfigData.rb +0 -694
data/lib/frprep/FixSynSemMapping.rb +0 -196
data/lib/frprep/FrPrepConfigData.rb +0 -66
data/lib/frprep/FrprepHelper.rb +0 -1324
data/lib/frprep/ISO-8859-1.rb +0 -24
data/lib/frprep/Parser.rb +0 -213
data/lib/frprep/SalsaTigerRegXML.rb +0 -2347
data/lib/frprep/SalsaTigerXMLHelper.rb +0 -99
data/lib/frprep/SynInterfaces.rb +0 -275
data/lib/frprep/TabFormat.rb +0 -720
data/lib/frprep/Tiger.rb +0 -1448
data/lib/frprep/Tree.rb +0 -61
data/lib/frprep/headz.rb +0 -338

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 83f5f0ca7cc27a632cb46deef7c093df649c61e1
+  data.tar.gz: dbc9a29186421206de7bf9b0138f05f89228fad6
+SHA512:
+  metadata.gz: 8a87f1e74b16082cba8d2ab49eb33289e8db23f5bdf3cdd4f294901c8119c8bff1239ec870032871d6d2cf69efbaba500058a47827df92be707aba3ab36ab30a
+  data.tar.gz: be1f6b6f3e4aa0b20f26437f30c579faf68f03f7c474cb78e28cb1263ef4ab9397ab4d52fbdffa4ac7ceb50a2d3f44cb4200303a7f14b2bdd0cb06fbfae68f0f

data/.yardopts CHANGED Viewed

@@ -4,5 +4,5 @@
 lib/**/*
 bin/**/*
 -
-CHANGELOG.rdoc
-LICENSE.rdoc
+CHANGELOG.md
+LICENSE.md

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,4 @@
+# Versions
+## Version 1.2.0-rc1

data/LICENSE.md ADDED Viewed

@@ -0,0 +1,4 @@
+# LICENSE
+This software is written in Ruby and is released under the [GNU Public License](http://www.gnu.org/licenses/gpl-2.0.html) (GPL v2), and the documentation under the [Free Document License](http://www.gnu.org/licenses/old-licenses/fdl-1.2.html) (FDL v1.2).

data/README.md ADDED Viewed

@@ -0,0 +1,49 @@
+# [SHALMANESER - a SHALlow seMANtic parSER](http://www.coli.uni-saarland.de/projects/salsa/shal/)
+[RubyGems](http://rubygems.org/gems/shalmaneser) | [RTT Project Page](http://bu.chsta.be/projects/shalmaneser/) |
+[Source Code](https://github.com/arbox/shalmaneser) | [Bug Tracker](https://github.com/arbox/shalmaneser/issues)
+[<img src="https://badge.fury.io/rb/shalmaneser.png" alt="Gem Version" />](http://badge.fury.io/rb/shalmaneser)
+[<img src="https://travis-ci.org/arbox/shalmaneser.png" alt="Build Status" />](https://travis-ci.org/arbox/shalmaneser)
+[<img src="https://codeclimate.com/github/arbox/shalmaneser.png" alt="Code Climate" />](https://codeclimate.com/github/arbox/shalmaneser)
+[<img alt="Bitdeli Badge" src="https://d2weczhvl823v0.cloudfront.net/arbox/shalmaneser/trend.png" />](https://bitdeli.com/free)
+## Description
+Please be careful, the whole thing is under construction!
+Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
+For end users, we provide a simple end user mode which can simply apply the pre-trained classifiers for English (FrameNet annotation / Collins parser) and German (SALSA Frame annotation / Sleepy parser). For researchers interested in investigating shallow semantic parsing, our system is extensively configurable and extendable.
+## Origin
+You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
+## Literature
+K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
+## Documentation
+The project documentation can be found in our [doc](doc/index.md) folder.
+## Development
+We are working now on two branches:
+- ``dev`` - our development branch incorporating actual changes, for now pointing to ``1.2``;
+- ``1.2`` - intermediate target;
+- ``2.0`` - final target.
+## Installation
+See the installation instructions in the [doc](doc/index.md#installation) folder.
+### Machine Learning Systems
+- http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/

data/bin/fred ADDED Viewed

@@ -0,0 +1,18 @@
+#!/usr/bin/env ruby
+# -*- encoding: utf-8 -*-
+# AB, 2011-11-13
+# fred
+# Katrin Erk, April 05
+#
+# Frame disambiguation system:
+# frame assignment as word sense disambiguation
+require 'fred/opt_parser'
+require 'fred/fred'
+options = Fred::OptParser.parse(ARGV)
+fred = Fred::Fred.new(options)
+fred.assign

data/bin/frprep ADDED Viewed

@@ -0,0 +1,34 @@
+#!/usr/bin/env ruby
+# -*- encoding: utf-8 -*-
+# AB, 2010-11-25
+# frprep
+# Katrin Erk July 05
+#
+# Preprocessing for Fred and Rosy:
+# accept input as plain text,
+# FrameNet XML, Salsa-tabular format,
+# or SalsaTigerXML,
+# lemmatize, POS-tag and parse
+# (if asked to do so)
+# and in any case produce output in
+# SalsaTigerXML.
+#
+# Extensions to SalsaTigerXML introduced by frprep:
+#
+# - "lemma": lemma. Attribute of terminals.
+# - "head":  head word (not lemma!) of constituent.Attribute of nonterminals.
+# - "fn_gf": FrameNet grammatical function label, attached to the maximal
+#   constituents covering the terminals labeled with that label
+require 'frprep/frprep'
+require 'frprep/opt_parser'
+options = FrPrep::OptParser.parse(ARGV)
+preprocessor = FrPrep::FrPrep.new(options)
+preprocessor.transform

data/bin/rosy ADDED Viewed

@@ -0,0 +1,17 @@
+#!/usr/bin/env ruby
+# -*- encoding: utf-8 -*-
+# AB: 2011-11-14
+# rosy.rb
+# KE, SP April 05
+#
+# Main file of the Rosy role assignment system.
+require 'rosy/opt_parser'
+require 'rosy/rosy'
+options = Rosy::OptParser.parse(ARGV)
+rosy = Rosy::Rosy.new(options)
+rosy.assign

data/lib/common/AbstractSynInterface.rb CHANGED Viewed

@@ -25,10 +25,10 @@
 require "tempfile"
-require "common/ruby_class_extensions"
+require 'common/ruby_class_extensions'
-require "common/ISO-8859-1"
-require "common/Parser"
+require 'common/ISO-8859-1'
+require 'common/Parser'
 require "common/SalsaTigerRegXML"
 require "common/TabFormat"
@@ -42,14 +42,14 @@ class SynInterface
   ###
   # returns a string: the name of the system
   # e.g. "Collins" or "TNT"
-  def SynInterface.system()
+  def self.system
     raise "Overwrite me"
   end
   ###
   # returns a string: the service offered
   # one of "lemmatizer", "parser", "pos tagger"
-  def SynInterface.service()
+  def self.service
     raise "Overwrite me"
   end
@@ -73,10 +73,10 @@ class SynInterface
   def process_dir(in_dir,        # string: name of input directory
 		  out_dir)       # string: name of output directory
-    Dir[in_dir+"*#{@insuffix}"].each {|infilename|
-      outfilename = out_dir + File.basename(infilename, @insuffix) + @outsuffix
-      process_file(infilename,outfilename)
-    }
+    Dir["#{in_dir}*#{@insuffix}"].each do |infilename|
+      outfilename = "#{out_dir}#{File.basename(infilename, @insuffix)}#{@outsuffix}"
+      process_file(infilename, outfilename)
+    end
   end
   ###
@@ -91,13 +91,13 @@ class SynInterface
   ######
   protected
-  def SynInterface.announce_me()
+  def self.announce_me
     if defined?(SynInterfaces)
       # yup, we have a class to which we can announce ourselves
-      SynInterfaces.add_interface(eval(self.name()))
+      SynInterfaces.add_interface(eval(self.name))
     else
       # no interface collector class
-      $stderr.puts "Interface #{self.name()} not announced: no SynInterfaces."
+      STDERR.puts "Interface #{self.name} not announced: no SynInterfaces."
     end
   end
 end
@@ -124,14 +124,13 @@ class SynInterfaceSTXML < SynInterface
   def to_stxml_dir(in_dir,   # string: name of dir with parse files
 		   out_dir)  # string: name of output dir
-    Dir[in_dir+"*#{@outsuffix}"].each { |parsefilename|
-      stxmlfilename = out_dir + File.basename(parsefilename, @outsuffix) + @stsuffix
+    Dir["#{in_dir}*#{@outsuffix}"].each do |parsefilename|
+      stxmlfilename = "#{out_dir}#{File.basename(parsefilename, @outsuffix)}#{@stsuffix}"
       to_stxml_file(parsefilename, stxmlfilename)
-    }
+    end
   end
-  def to_stxml_file(infilename,
-		    outfilename)
+  def to_stxml_file(infilename, outfilename)
     raise "Overwrite me"
   end
@@ -142,22 +141,25 @@ class SynInterfaceSTXML < SynInterface
   # SalsaTigerSentence nodes returned by each_sentence():
   # map the n-th word of the tab sentence to the n-th terminal of
   # the SalsaTigerSentence
-  def SynInterfaceSTXML.standard_mapping(sent, tabsent)
-    retv = Hash.new
+  def self.standard_mapping(sent, tabsent)
+    retv = {}
     if sent.nil?
-	return nil
-    end
-    terminals = sent.terminals_sorted()
-    if tabsent
-      tabsent.each_line_parsed { |l|
-        if (t = terminals[l.get("lineno")])
-          retv[l.get("lineno")] = [t]
-        else
-          retv[l.get("lineno")] = []
+	retv = nil
+    else
+      terminals = sent.terminals_sorted
+      if tabsent
+        tabsent.each_line_parsed do |l|
+          if (t = terminals[l.get("lineno")])
+            retv[l.get("lineno")] = [t]
+          else
+            retv[l.get("lineno")] = []
+          end
         end
-      }
+      end
     end
-    return retv
+    retv
   end
@@ -185,13 +187,13 @@ class SynInterfaceSTXML < SynInterface
     # write Salsa/Tiger XML to tempfile
     tf = Tempfile.new("SynInterface")
-    tf.close()
+    tf.close
     to_stxml_file(infilename, tf.path)
-    tf.flush()
+    tf.flush
     # get matching tab file, read
     tab_reader = get_tab_reader(infilename)
-    tab_sentences = Array.new
+    tab_sentences = []
     tab_reader.each_sentence { |s| tab_sentences << s }
     # read Salsa/Tiger sentences and yield them

data/lib/common/Mallet.rb ADDED Viewed

@@ -0,0 +1,236 @@
+# wrapper script for the Mallet toolkit Maxent classifier
+# Problem with Winnow: cannot be serialised (written to file). Support dropped.
+# sp 27 10 04
+require "tempfile"
+require "ftools"
+class Mallet
+  ###
+  def initialize(program_path,parameters)
+    if parameters.empty?
+      puts "Error: Mallet needs two paths (first the location of mallet itself and then the location of the interface, usually program/tools/mallet)."
+      puts "I got only the program path."
+      Kernel.exit
+    end
+    @malletpath = program_path
+    @interface_path = parameters.first
+    unless @malletpath =~ /\/$/
+      @malletpath = @malletpath + "/"
+    end
+    @learner = "MaxEnt,gaussianPriorVariance=1.0"
+    # classpath for mallet
+    @cp = "#{ENV["CLASSPATH"]}:#{@malletpath}class:#{@malletpath}lib/bsh.jar"
+  end
+  ###
+  def train(infilename,classifier_location)
+    csvfile = Tempfile.new(File.basename(infilename)+".csvtrain")
+    infile = File.new(infilename)
+    c45_to_csv(infile,csvfile) # training data in csv format
+    infile.close
+    csvfile.close
+    @mallet_train_vectors = infilename+".trainvectors" # training data in mallet format
+    if classifier_location
+      @classifier_mallet_path = classifier_location
+    else
+      @classifier_mallet_path = infilename+".classifier"
+    end
+    command1 = [@malletpath+"bin/csv2vectors ",
+		    " --input ",csvfile.path,
+		    " --output ",@mallet_train_vectors].join("")
+    command2 = ["cd #{@interface_path}; ",
+                "java -cp #{@cp} -Xmx1000m Train ",
+                " --train ",@mallet_train_vectors,
+                " --out ",@classifier_mallet_path,
+                " --trainer ",@learner].join("")
+#    STDERR.puts "[train 1] "+command1
+    successfully_run(command1) # encode
+#    STDERR.puts "[train 2] "+command2
+    successfully_run(command2) # train
+    csvfile.close(true)
+  end
+  def write(classifier_file)
+    if @classifier_mallet_path
+      %x{cp #{@classifier_mallet_path} #{classifier_file}.classifier} # store classifier
+   #    File.chmod(0664,classifier_file+".classifier")
+    end
+    if @mallet_train_vectors
+      %x{cp #{@mallet_train_vectors} #{classifier_file}.trainvectors} # store train vectors to recreate pipe for testing data
+#      File.chmod(0664,classifier_file+".trainvectors")
+    end
+  end
+  ###
+  def exists?(classifier_file)
+    return (FileTest.exists?(classifier_file+".trainvectors") and
+              FileTest.exists?(classifier_file+".classifier"))
+  end
+  ###
+  # return true iff reading the classifier has had success
+  def read(classifier_file)
+    @mallet_train_vectors = classifier_file+".trainvectors" # training data in mallet format
+    @classifier_mallet_path = classifier_file+".classifier"
+    unless FileTest.exists?(@mallet_train_vectors)
+      $stderr.puts "No classifier file "+@mallet_train_vectors
+      return false
+    end
+    unless FileTest.exists?(@classifier_mallet_path)
+      $stderr.puts "No classifier file "+@classifier_mallet_path
+      return false
+    end
+    return true
+  end
+  ###
+  def apply(infilename,outfilename)
+    unless @classifier_mallet_path and @mallet_train_vectors
+      return false
+    end
+    #    STDERR.puts "Testing on "+infilename
+    csvfile = Tempfile.new(File.basename(infilename)+".csvtest")
+    infile = File.new(infilename)
+    c45_to_csv(infile,csvfile) # training data in csv format
+    infile.close
+    csvfile.close
+    test_mallet_path = infilename+".test.vectors" # training data in mallet format
+    # $stderr.puts "test file in " + infilename
+    # $stderr.puts "using training vectors from " + @mallet_train_vectors
+    # copy train vectors to temp file.
+    # reason: mallet in std edition reads _and writes_ this file
+    # if rosy is interrupted, corrupted (ie incomplete) train vector files
+    # result
+    tempfile = Tempfile.new("mallet")
+    tempfilename = tempfile.path
+    unless File.copy(@mallet_train_vectors,tempfilename)
+      return false
+    end
+    command1 = [@malletpath+"bin/csv2vectors", # encode testing data
+                " --input ",csvfile.path,
+                " --output ",test_mallet_path,
+                " --use-pipe-from ",tempfilename].join("")
+#    $stderr.puts "Mallet encode: " + command1
+    unless successfully_run(command1) # encode
+      return false
+    end
+    File.safe_unlink(tempfilename)
+    # some error in encoding?
+    unless FileTest.exists?(test_mallet_path)
+      return false
+    end
+    command2 = ["cd #{@interface_path}; ",
+                "java -cp #{@cp} -Xmx1000m Classify ",
+                @classifier_mallet_path," ",
+                test_mallet_path," ",
+                "> ",outfilename].join("")
+    # classify
+#    $stderr.puts "Mallet classify: " + command2
+    unless    successfully_run(command2)
+      return false
+    end
+    # some error in classification
+    unless FileTest.exists?(outfilename)
+      return false
+    end
+     # no errors = success
+    csvfile.close(true)
+    return true
+  end
+  #####
+  # format of Mallet result file:
+  # <best label> <confidence> \t <secondbest_label> <confidence>....
+  def read_resultfile(filename)
+    begin
+      f = File.new(filename)
+    rescue
+      $stderr.puts "Mallet error: cannot read Mallet result file #{filemame}."
+      return nil
+    end
+    retv = Array.new()
+    f.each { |line|
+      line_results = Array.new()
+      pieces = line.split()
+      while not(pieces.empty?)
+        label = pieces.shift()
+        begin
+          confidence = pieces.shift().to_f()
+        rescue
+          $stderr.puts "Error reading mallet output: invalid line: #{line}"
+          confidence = 0
+        end
+        line_results << [label, confidence]
+      end
+      retv << line_results
+    }
+    return retv
+  end
+  ###################################
+  private
+  ###
+  # mallet needs "comma separated values"-file
+  # input: features separated by comma
+  # output:
+  # line_number classlabel features_joined_by_spaces
+  def c45_to_csv(inpipe,outpipe)
+    idx = 0
+    while (line = inpipe.gets)
+      line.chomp!
+      idx += 1
+      la = line.split(",")
+      label = la.pop
+      if label[-1,1] == "."
+	label.chop!
+      end
+      outpipe.puts [idx,label].join(" ")+" "+la.join(" ")
+    end
+  end
+  ###
+  def successfully_run(command)
+    retv = Kernel.system(command)
+    unless retv
+      $stderr.puts "Error running classifier. Continuing."
+      $stderr.puts "Offending command: "+command
+ #     exit 1
+    end
+    return retv
+  end
+end