RubyGems - shalmaneser-frappe - Versions diffs - 1.2.rc5 - Mend

shalmaneser-frappe 1.2.rc5

Files changed (41) hide show

checksums.yaml +7 -0
data/.yardopts +10 -0
data/CHANGELOG.md +4 -0
data/LICENSE.md +4 -0
data/README.md +122 -0
data/lib/frappe/Ampersand.rb +41 -0
data/lib/frappe/file_parser.rb +126 -0
data/lib/frappe/fix_syn_sem_mapping.rb +196 -0
data/lib/frappe/frappe.rb +217 -0
data/lib/frappe/frappe_flat_syntax.rb +89 -0
data/lib/frappe/frappe_read_stxml.rb +48 -0
data/lib/frappe/interfaces/berkeley_interface.rb +380 -0
data/lib/frappe/interfaces/collins_interface.rb +340 -0
data/lib/frappe/interfaces/counter.rb +19 -0
data/lib/frappe/interfaces/stanford_interface.rb +353 -0
data/lib/frappe/interfaces/treetagger_interface.rb +74 -0
data/lib/frappe/interfaces/treetagger_module.rb +111 -0
data/lib/frappe/interfaces/treetagger_pos_interface.rb +80 -0
data/lib/frappe/interpreters/berkeley_interpreter.rb +27 -0
data/lib/frappe/interpreters/collins_tnt_interpreter.rb +807 -0
data/lib/frappe/interpreters/collins_treetagger_interpreter.rb +16 -0
data/lib/frappe/interpreters/empty_interpreter.rb +26 -0
data/lib/frappe/interpreters/headz.rb +265 -0
data/lib/frappe/interpreters/headz_helpers.rb +54 -0
data/lib/frappe/interpreters/stanford_interpreter.rb +28 -0
data/lib/frappe/interpreters/syn_interpreter.rb +727 -0
data/lib/frappe/interpreters/tiger_interpreter.rb +1846 -0
data/lib/frappe/interpreters/treetagger_interpreter.rb +89 -0
data/lib/frappe/one_parsed_file.rb +31 -0
data/lib/frappe/opt_parser.rb +92 -0
data/lib/frappe/path.rb +199 -0
data/lib/frappe/plain_converter.rb +59 -0
data/lib/frappe/salsa_tab_converter.rb +154 -0
data/lib/frappe/salsa_tab_with_pos_converter.rb +531 -0
data/lib/frappe/stxml_converter.rb +666 -0
data/lib/frappe/syn_interface.rb +76 -0
data/lib/frappe/syn_interface_stxml.rb +173 -0
data/lib/frappe/syn_interface_tab.rb +39 -0
data/lib/frappe/utf_iso.rb +27 -0
data/lib/shalmaneser/frappe.rb +1 -0
metadata +130 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 0b7ec35085dc7311add3094d750959ecc910a154
+  data.tar.gz: d30a7dde0b8954ca89d1000723178477aea27cde
+SHA512:
+  metadata.gz: 0fdc58e5ef35e89a639db4ec2074545f6acaa65f17871547cabc07a4efbdc1dc5f010d44d10288c9346378a53a760edd62ef065241f75f642aedb270d400b8c8
+  data.tar.gz: 93dc75206689d12f72e9bdecaf1e089b548af7d87503a4ae1f5e5b4d291cf78447b2759e8ea9d3d2c037af2ece4b1446ff048dd3c9f8389920f605ae8e44a99e

data/.yardopts ADDED

@@ -0,0 +1,10 @@
+--private
+--protected
+--title 'SHALMANESER'
+lib/**/*.rb
+bin/**/*
+doc/**/*.md
+-
+CHANGELOG.md
+LICENSE.md
+doc/index.md

data/CHANGELOG.md ADDED

@@ -0,0 +1,4 @@
+# Versions
+## Version 1.2.0-rc1

data/LICENSE.md ADDED

@@ -0,0 +1,4 @@
+# LICENSE
+This software is written in Ruby and is released under the [GNU Public License](http://www.gnu.org/licenses/gpl-2.0.html) (GPL v2), and the documentation under the [Free Document License](http://www.gnu.org/licenses/old-licenses/fdl-1.2.html) (FDL v1.2).

data/README.md ADDED

@@ -0,0 +1,122 @@
+# SHALMANESER
+[RubyGems](http://rubygems.org/gems/shalmaneser) |
+[Shalmanesers Project Page](http://bu.chsta.be/projects/shalmaneser/) |
+[Source Code](https://github.com/arbox/shalmaneser) |
+[Bug Tracker](https://github.com/arbox/shalmaneser/issues)
+[![Gem Version](https://img.shields.io/gem/v/shalmaneser.svg")](https://rubygems.org/gems/shalmaneser)
+[![Gem Version](https://img.shields.io/gem/v/frprep.svg")](https://rubygems.org/gems/shalmaneser-prep)
+[![Gem Version](https://img.shields.io/gem/v/fred.svg")](https://rubygems.org/gems/shalmaneser-fred)
+[![Gem Version](https://img.shields.io/gem/v/rosy.svg")](https://rubygems.org/gems/shalmaneser-rosy)
+[![License GPL 2](http://img.shields.io/badge/License-GPL%202-green.svg)](http://www.gnu.org/licenses/gpl-2.0.txt)
+[![Build Status](https://img.shields.io/travis/arbox/shalmaneser.svg?branch=1.2")](https://travis-ci.org/arbox/shalmaneser)
+[![Code Climate](https://img.shields.io/codeclimate/github/arbox/shalmaneser.svg")](https://codeclimate.com/github/arbox/shalmaneser)
+[![Dependency Status](https://img.shields.io/gemnasium/arbox/shalmaneser.svg")](https://gemnasium.com/arbox/shalmaneser)
+[SHALMANESER](http://www.coli.uni-saarland.de/projects/salsa/shal/) is a SHALlow seMANtic parSER.
+The name Shalmaneser is borrowed from John Brunner. He describes in his novel
+"Stand on Zanzibar" an all knowing supercomputer baptized Shalmaneser.
+Shalmaneser also has other origins like the king [Shalmaneser III](https://en.wikipedia.org/wiki/Shalmaneser_III).
+> "SCANALYZER is the one single, the ONLY study of the news in depth
+> that’s processed by General Technics’ famed computer Shalmaneser,
+> who sees all, hears all, knows all save only that which YOU, Mr. and Mrs.
+> Everywhere, wish to keep to yourselves." <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> But Shalmaneser is a Micryogenic® computer bathed in liquid helium and it’s cold in his vault. <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> “Of course not. Shalmaneser’s main task is to achieve the impossible again, a routine undertaking here at GT.” <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> “They programmed Shalmaneser with the formula for this stiffener, see, and…” <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> What am I going to do now? <br/>
+> “All right, Shalmaneser!” <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> Shalmaneser is a Micryogenic® computer bathed in liquid helium and there’s no sign of Teresa. <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+> Bathed in his currents of liquid helium, self-contained, immobile, vastly well informed by every mechanical sense: Shalmaneser. <br/>
+> John Brunner (1968) "Stand on Zanzibar"
+## Description
+Please be careful, the whole thing is under construction! For now Shalmaneser it not intended to run on Windows systems since it heavily uses system calls for external invocations.
+Current versions of Shalmaneser have been tested on Linux only (other *NIX testers are welcome!).
+Shalmaneser is a supervised learning toolbox for shallow semantic parsing, i.e. the automatic assignment of semantic classes and roles to text. This technique is often called [SRL](https://en.wikipedia.org/wiki/Semantic_role_labeling) (Semantic Role Labelling). The system was developed for Frame Semantics; thus we use Frame Semantics terminology and call the classes frames and the roles frame elements. However, the architecture is reasonably general, and with a certain amount of adaption, Shalmaneser should be usable for other paradigms (e.g., PropBank roles) as well. Shalmaneser caters both for end users, and for researchers.
+For end users, we provide a simple end user mode which can simply apply the pre-trained classifiers
+for [English](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (FrameNet 1.3 annotation / Collins parser)
+and [German](http://www.coli.uni-saarland.de/projects/salsa/shal/index.php?nav=download) (SALSA 1.0 annotation / Sleepy parser).
+We'll try to provide newer pretrained models for English, German, and possibly other languages as soon as possible.
+For researchers interested in investigating shallow semantic parsing, our system is extensively configurable and extendable.
+## Origin
+The original version of Shalmaneser was written by Sebastian Padó, Katrin Erk, Alexander Koller, Ines Rehbein, Aljoscha Burchardt and others during their work in the SALSA Project.
+You can find original versions of Shalmaneser up to ``1.1`` on the [SALSA](http://www.coli.uni-saarland.de/projects/salsa/shal/) project page.
+## Publications on Shalmaneser
+- K. Erk and S. Padó: Shalmaneser - a flexible toolbox for semantic role assignment. Proceedings of LREC 2006, Genoa, Italy. [Click here for details](http://www.nlpado.de/~sebastian/pub/papers/lrec06_erk.pdf).
+- TODO: add other works
+## Documentation
+The project documentation can be found in our [doc](https://github.com/arbox/shalmaneser/blob/master/doc/index.md) folder.
+## Development
+We are working now only on the `master` branch. For different intermediate versions see corresponding tags.
+## Installation
+See the installation instructions in the [doc](https://github.com/arbox/shalmaneser/blob/master/doc/index.md#installation) folder.
+### Tokenizers
+- [Ucto](http://ilk.uvt.nl/ucto/)
+### POS Taggers
+- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
+### Lemmatizers
+- [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
+### Parsers
+- [BerkeleyParser](https://github.com/slavpetrov/berkeleyparser)
+- [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml)
+- [Collins Parser](http://www.cs.columbia.edu/~mcollins/code.html)
+### Machine Learning Systems
+- [OpenNLP MaxEnt](http://sourceforge.net/projects/maxent/files/Maxent/2.4.0/)
+- [Mallet](http://mallet.cs.umass.edu/index.php)
+## License
+Shalmaneser is released under the `GPL v. 2.0` license as of the initial authors.
+For a local copy of the full license text see the [LICENSE](LICENSE.md) file.
+## Contributing
+Feel free to contact me via Github. Open an issue if you see problems or need help.

data/lib/frappe/Ampersand.rb ADDED

@@ -0,0 +1,41 @@
+# @note AB: This whole thing should be obsolete on Ruby 1.9
+# @note #unpack seems to work on 1.8 and 1.9 equally
+require_relative 'utf_iso'
+####################3
+# Reformatting to and from
+# a hex format for special characters
+module Shalmaneser
+  module Frappe
+    module Ampersand
+      def self.hex_to_iso(str)
+        return str.gsub(/&.+?;/) { |umlaut|
+          if umlaut =~ /&#x(.+);/
+            bla = $1.hex
+            bla.chr
+          else
+            umlaut
+          end
+        }
+      end
+      def self.iso_to_hex(str)
+        utf8_to_hex(UtfIso.from_iso_8859_1(str))
+      end
+      def self.utf8_to_hex(str)
+        arr=str.unpack('U*')
+        outstr = ""
+        arr.each { |num|
+          if num <  0x80
+            outstr << num.chr
+          else
+            outstr.concat sprintf("&\#x%04x;", num)
+          end
+        }
+        outstr
+      end
+    end
+  end
+end

data/lib/frappe/file_parser.rb ADDED

@@ -0,0 +1,126 @@
+# -*- encoding: utf-8 -*-
+require_relative 'one_parsed_file'
+require_relative 'frappe_read_stxml'
+require_relative 'frappe_flat_syntax'
+require 'external_systems'
+require 'logger'
+module Shalmaneser
+  module Frappe
+    ##############################
+    # class for managing parses:
+    #
+    # Given either a directory with tab format files or
+    # a directory with SalsaTigerXML files (or both) and
+    # a directory for putting parse files:
+    # - parse, unless no parsing set in the experiment file
+    # - for each parsed file: yield one OneParsedFile object
+    class FileParser
+      # @param [FrappeConfigData] exp
+      # @param [Hash<String, String>] file_suffixes Hash: file type(string) -> suffix(string)
+      # @param [String] parse_dir string: name of directory to put parses
+      # @param [Hash] dirs further directories
+      def initialize(exp, file_suffixes, parse_dir, dirs = {})
+        @exp = exp
+        @file_suffixes = file_suffixes
+        @parse_dir = parse_dir
+        @tab_dir = dirs["tab_dir"]
+        @stxml_dir = dirs["stxml_dir"]
+        # pre-parsed data available?
+        @parsed_files = @exp.get("directory_parserout")
+      end
+      ###
+      def each_parsed_file
+        postag_suffix = @exp.get("do_postag") ? @file_suffixes["pos"] : nil
+        lemma_suffix = @exp.get("do_lemmatize") ? @file_suffixes["lemma"] : nil
+        if @exp.get("do_parse")
+          # get parser interface
+          sys_class = ExternalSystems.get_interface("parser", @exp.get("parser"))
+          # This suffix is used as extension for parsed files.
+          parse_suffix = ".#{sys_class.name.split('::').last}"
+          sys = sys_class.new(@exp.get("parser_path"),
+                              @file_suffixes["tab"],
+                              parse_suffix,
+                              @file_suffixes["stxml"],
+                              "pos_suffix" => postag_suffix,
+                              "lemma_suffix" => lemma_suffix,
+                              "tab_dir" => @tab_dir)
+          if @parsed_files
+            # reuse old parses
+            LOGGER.info "#{PROGRAM_NAME}: Using pre-computed parses in #{@parsed_files}.\n"\
+                        "#{PROGRAM_NAME} Postprocessing SalsaTigerXML data."
+            Dir[@parsed_files + "*"].each do |parsefilename|
+              if File.stat(parsefilename).ftype != "file"
+                # something other than a file
+                next
+              end
+              # core filename: remove directory and anything after the last "."
+              filename_core = File.basename(parsefilename, ".*")
+              # use iterator to read each parsed file
+              yield OneParsedFile.new(filename_core, parsefilename, sys)
+            end
+          else
+            # do new parses
+            LOGGER.info "#{PROGRAM_NAME}: Syntactic analysis with #{sys.class.name.split('::').last}."
+            unless @tab_dir
+              raise "Cannot parse without tab files"
+            end
+            # @note AB: NOTE This is the position where a parser is invoked.
+            # parse
+            sys.process_dir(@tab_dir, @parse_dir)
+            LOGGER.info "#{PROGRAM_NAME}: Postprocessing SalsaTigerXML data."
+            Dir[@parse_dir + "*" + parse_suffix].each do |parsefilename|
+              filename_core = File.basename(parsefilename, parse_suffix)
+              # use iterator to read each parsed file
+              yield OneParsedFile.new(filename_core, parsefilename, sys)
+            end
+          end
+        else
+          # no parse:
+          # get pseudo-parse tree
+          if @stxml_dir
+            # use existing SalsaTigerXML files
+            Dir[@stxml_dir + "*.xml"].each do |stxmlfilename|
+              filename_core = File.basename(stxmlfilename, ".xml")
+              if @tab_dir
+                # we know the tab directory too
+                tabfilename = @tab_dir + filename_core + @file_suffixes["tab"]
+                each_sentence_obj = FrappeReadStxml.new(stxmlfilename, tabfilename,
+                                                        postag_suffix, lemma_suffix)
+              else
+                # we have no tab directory
+                each_sentence_obj = FrappeReadStxml.new(stxmlfilename, nil,
+                                                        postag_suffix, lemma_suffix)
+              end
+              yield OneParsedFile.new(filename_core, stxmlfilename, each_sentence_obj)
+            end
+          else
+            # construct SalsaTigerXML from tab files
+            Dir[@tab_dir + "*" + @file_suffixes["tab"]].each do |tabfilename|
+              each_sentence_obj = FrappeFlatSyntax.new(tabfilename,
+                                                       postag_suffix,
+                                                       lemma_suffix)
+              filename_core = File.basename(tabfilename, @file_suffixes["tab"])
+              yield OneParsedFile.new(filename_core, tabfilename, each_sentence_obj)
+            end
+          end # source of pseudo-parse
+        end # parse or no parse
+      end
+    end
+  end
+end

data/lib/frappe/fix_syn_sem_mapping.rb ADDED

@@ -0,0 +1,196 @@
+###
+# FixSynSemMapping:
+# Given a SalsaTigerRegXML sentence with semantic role annotation,
+# simplify the mapping of semantic roles to syntactic constituents
+#
+# The following is lifted from the LREC06 paper on Shalmaneser:
+# During preprocessing, the span of semantic roles in the training corpora is
+# projected onto the output of the syntactic parser by assigning each
+# role to the set of maximal constituents covering its word span.
+# f the word span of a role does not coincide
+# with parse tree constituents, e.g. due to misparses,
+# the role is ``spread out'' across several constituents. This leads to
+# idiosyncratic paths between predicate and semantic role in the parse
+# tree.
+#
+# [The following span standardization algorithm is used to make the
+# syntax-semantics mapping more uniform:]
+# Given a role r that has been assigned, let N be the set of
+# terminal nodes of the syntactic structure that are covered by r.
+#
+#   Iteratively compute the maximal projection of N in the syntactic
+#   structure:
+#   1) If n is a node such that all of n's children are in N,
+#     then remove n's children from N and add n instead.
+#   2) If n is a node with 3 or more children, and all of n's
+#     children except one are in N, then remove n's children from N
+#     and add n instead.
+#   3) If n is an NP with 2 children, and one of them, another NP,
+#     is in N, and the other, a relative clause, is not, then remove
+#     n's children from N and add n instead.
+#
+#   If none of the rules is applicable to N anymore, assign r to the
+#   nodes in N.
+#
+# Rule 1 implements normal maximal projection. Rule 2 ``repairs'' parser
+# errors where all children of a node but one have been assigned the
+# same role. Rule 3 addresses a problem of the FrameNet data, where
+# relative clauses have been omitted from roles assigned to NPs.
+# KE Feb 08: rule 3 currently out of commission!
+# require "SalsaTigerRegXML"
+module FixSynSemMapping
+  ##
+  # fix it
+  #
+  # relevant settings in the experiment file:
+  #
+  # fe_syn_repair:
+  # If there is a node that would be a max. constituent for the
+  # words covered by the given FE, except that it has one child
+  # whose words are not in the FE, use the node as max constituent anyway.
+  # This is to repair cases where the parser has made an attachment choice
+  # that differs from the one in the gold annotation
+  #
+  # fe_rel_repair:
+  # If there is an NP such that all of its children except one have been
+  # assigned the same FE, and that missing child is a relative clause
+  # depending on one of the other children, then take the complete NP as
+  # that FE
+  def FixSynSemMapping.fixit(sent, # SalsaTigerSentence object
+                             exp,  # experiment file object
+                             interpreter_class) # SynInterpreter class
+    unless exp.get("fe_syn_repair") or exp.get("fe_rel_repair")
+      return
+    end
+    if sent.nil?
+	return
+    end
+    # "repair" FEs:
+    sent.each_frame { |frame|
+      frame.each_child { |fe_or_target|
+        # repair only if the FE currently
+        # points to more than one syn node
+        if fe_or_target.children.length < 2
+          next
+        end
+        if exp.get("fe_rel_repair")
+          lastfe = fe_or_target.children.last
+          if lastfe and interpreter_class.simplified_pt(lastfe) =~ /^(WDT)|(WP\$?)|(WRB)/
+            # remove syn nodes that the FE points to
+            old_fe_syn = fe_or_target.children
+            old_fe_syn.each { |child|
+              fe_or_target.remove_child(child)
+            }
+            # set it to point only to the last previous node, the relative pronoun
+            fe_or_target.add_child(lastfe)
+          end
+        end
+        if exp.get("fe_syn_repair")
+          # remove syn nodes that the FE points to
+          old_fe_syn = fe_or_target.children
+          old_fe_syn.each { |child|
+            fe_or_target.remove_child(child)
+          }
+          # and recompute
+          new_fe_syn = interpreter_class.max_constituents(old_fe_syn.map { |t|
+                                                            t.yield_nodes
+                                                          }.flatten.uniq,
+                                                          sent,
+                                                          exp.get("fe_syn_repair"))
+          # make the FE point to the new nodes
+          new_fe_syn.each { |syn_node|
+            fe_or_target.add_child(syn_node)
+          }
+        end
+      } # each FE
+    } # each frame
+  end # def fixit
+end # module
+#########3
+# old code
+#     if exp.get("fe_rel_repair")
+#       # repair relative clauses:
+#       # then make a procedure to pass on to max constituents
+#       # that will recognize the relevant cases
+#       accept_anyway_proc = Proc.new { |node, children_in, children_out|
+#         # node: SynNode
+#         # children_in, children_out: array:SynNode. children_in are the children
+#         #    that are already covered by the FE, children_out the ones that aren't
+#         # if node is an NP,
+#         # and only one of its children is out,
+#         # and one node in children_in is an NP, and the missing child is an SBAR
+#         # with a child that is a relative pronoun, then consider the child in children_out as covered
+#         if interpreter_class.category(node) == "noun" and
+#             children_out.length() == 1 and
+#             children_in.select { |n| interpreter_class.category(n) == "noun" } and
+#             interpreter_class.category(children_out.first) == "sent" and
+#             (ch = children_out.first.children) and
+#             ch.select { |n| interpreter_class.relative_pronoun?(n) }
+#           true
+#         else
+#           false
+#         end
+#       }
+#     else
+#       accept_anyway_proc = nil
+#     end
+#     # "repair" FEs:
+#     sent.each_frame { |frame|
+#       frame.each_child { |fe_or_target|
+#         # repair only if the FE currently
+#         # points to more than one syn node, or
+#         # if it is a noun with a non-covered sentence sister
+#         if fe_or_target.children.length() > 1 or
+#             (exp.get("fe_rel_repair") and (curr_marked = fe_or_target.children.first())  and
+#              interpreter_class.category(curr_marked) == "noun" and
+#              (p = curr_marked.parent) and
+#              p.children.select { |n| n != curr_marked and interpreter_class.category(n) == "sent" } )
+#           # remember nodes covered by the FE
+#           old_fe_syn = fe_or_target.children()
+#           # remove syn nodes that the FE points to
+#           old_fe_syn.each { |child|
+#             fe_or_target.remove_child(child)
+#           }
+#           # and recompute
+#           new_fe_syn = interpreter_class.max_constituents(old_fe_syn.map { |t| t.yield_nodes}.flatten.uniq,
+#                                                           sent,
+#                                                           exp.get("fe_syn_repair"),
+#                                                           accept_anyway_proc)
+#           # make the FE point to the new nodes
+#           new_fe_syn.each { |syn_node|
+#             fe_or_target.add_child(syn_node)
+#           }
+#         end # if FE points to more than one syn node
+#       } # each FE
+#     } # each frame