RubyGems - opener-pos-tagger-base - Versions diffs - 2.0.0 - Mend

opener-pos-tagger-base 2.0.0

Files changed (34) hide show

checksums.yaml +7 -0
data/README.md +110 -0
data/bin/pos-tagger-base +21 -0
data/core/mapping.postag.stss.to.opener.csv +52 -0
data/core/mapping.postag.wotan.to.opener.csv +13 -0
data/core/opennlp/bin/opennlp +35 -0
data/core/opennlp/bin/opennlp.bat +35 -0
data/core/opennlp/lib/jwnl-1.3.3.jar +0 -0
data/core/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar +0 -0
data/core/opennlp/lib/opennlp-tools-1.5.2-incubating.jar +0 -0
data/core/opennlp/lib/opennlp-uima-1.5.2-incubating.jar +0 -0
data/core/opennlp/models/de-pos-maxent.bin +0 -0
data/core/opennlp/models/de-pos-perceptron.bin +0 -0
data/core/opennlp/models/nl-pos-maxent.bin +0 -0
data/core/opennlp/models/nl-pos-perceptron.bin +0 -0
data/core/pos-tagger_open-nlp.py +160 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/PKG-INFO +10 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/SOURCES.txt +7 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/dependency_links.txt +1 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/installed-files.txt +11 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/top_level.txt +1 -0
data/core/site-packages/pre_build/VUKafParserPy/KafDataObjectsMod.py +165 -0
data/core/site-packages/pre_build/VUKafParserPy/KafDataObjectsMod.pyc +0 -0
data/core/site-packages/pre_build/VUKafParserPy/KafParserMod.py +439 -0
data/core/site-packages/pre_build/VUKafParserPy/KafParserMod.pyc +0 -0
data/core/site-packages/pre_build/VUKafParserPy/__init__.py +7 -0
data/core/site-packages/pre_build/VUKafParserPy/__init__.pyc +0 -0
data/core/token_matcher.py +80 -0
data/ext/hack/support.rb +38 -0
data/lib/opener/pos_taggers/base.rb +90 -0
data/lib/opener/pos_taggers/base/version.rb +7 -0
data/opener-pos-tagger-base.gemspec +29 -0
data/pre_build_requirements.txt +1 -0
metadata +132 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: e1d01b280c3f2369e20c811fa11a42150b41cc16
+  data.tar.gz: 7639fb3ce4fb64641047659339b500157940087c
+SHA512:
+  metadata.gz: 31dd9808cc4b3ce95de10c8e95c456af963a2b93cc1bbef60e9917716b9de830ce3b749c28b63017ab7ce90393172cf499563400fbb62873a39eb0be2d0e2f1a
+  data.tar.gz: 7b9ab3549277fc1c93b09b60eae00f9b56894dee3565550f9eee7d82209316d770083df32085885282da18a3c12475b1e4c897f9c8fd928f2c144841bea88e3d

data/README.md ADDED

@@ -0,0 +1,110 @@
+[![Build Status](https://drone.io/github.com/opener-project/pos-tagger-base/status.png)](https://drone.io/github.com/opener-project/pos-tagger-base/latest)
+# Base POS Tagger
+This repository contains the source code (both Ruby and Python) for the base
+POS tagger. Currently this tagger supports the following languages:
+* Dutch
+* German
+## Requirements
+* Python 2.7.0 or newer
+* Ruby 1.9.2 or newer
+* pip
+* libxml2
+## Installation
+Using Bundler:
+    gem 'opener-pos-tagger-base',
+      :git    => 'git@github.com:opener-project/pos-tagger-base.git',
+      :branch => 'master'
+Using `specific_install`:
+    gem install specific_install
+    gem specific_install opener-pos-tagger-base \
+        -l https://github.com/opener-project/pos-tagger-base.git
+Using regular RubyGems (once the Gem is available):
+    gem install opener-pos-tagger-base
+## Usage
+Tagging a KAF file:
+    cat some_input_file.kaf | pos-tagger-base
+## Contributing
+First make sure all the required dependencies are installed:
+    bundle install
+Then download the required Python code:
+    bundle exec rake compile
+Once this is done continue reading the sections below to get a better
+understanding about the repository structure.
+## Structure
+This repository comes in two parts: a collection of Python source files and
+Ruby source code. The Python code can be found in `core/`, the Ruby code can be
+found in the other directories (e.g. `lib/`).
+Required Python packages are installed locally in to `core/site-packages/X`
+where X is one of the following two:
+* `pre_build`: contains packages that are installed before building the Gem,
+  these packages are shipped with the Gem
+* `pre_install`: contains packages that are installed in to this directory upon
+  installing the Gem. This directory should exclusively be used for compiled
+  Python packages such as lxml.
+There are also two requirements files for pip:
+* `pre_build_requirements.txt`: installs the requirements for the `pre_build`
+  directory.
+* `pre_install_requirements.txt`: installs the requirements for the
+  `pre_install` directory.
+To easily install all the required dependencies (required for running the tests
+for example) run the following:
+    bundle exec rake compile
+This will take care of verifying the requirements and downloading and
+installing the Python packages.
+## Testing
+To run the tests (which are powered by Cucumber), simply run the following:
+    bundle exec rake
+This will take care of verifying the requirements, installing the Python code
+and running the tests.
+For more information on the available Rake tasks run the following:
+    bundle exec rake -T
+## POS Details
+### POS-tags models
+* [Dutch-maxent](http://opennlp.sourceforge.net/models-1.5/nl-pos-maxent.bin)
+* [Dutch-perceptron](http://opennlp.sourceforge.net/models-1.5/nl-pos-perceptron.bin)
+* [German-maxent](http://opennlp.sourceforge.net/models-1.5/de-pos-maxent.bin)
+* [German-perceptron](http://opennlp.sourceforge.net/models-1.5/de-pos-perceptron.bin)
+### POS-tags sets
+* Dutch: trained on conllx alpino data, wotan tagset
+* German: trained on TIGER corpus, STSS tagset

data/bin/pos-tagger-base ADDED

@@ -0,0 +1,21 @@
+#!/usr/bin/env ruby
+require_relative '../lib/opener/pos_taggers/base'
+# STDIN.tty? returns `false` if data is being piped into the current process.
+if STDIN.tty?
+  input = nil
+else
+  input = STDIN.read
+end
+kernel                  = Opener::POSTaggers::Base.new(:args => ARGV)
+stdout, stderr, process = kernel.run(input)
+if process.success?
+  puts stdout
+  STDERR.puts(stderr) unless stderr.empty?
+else
+  abort stderr
+end

data/core/mapping.postag.stss.to.opener.csv ADDED

@@ -0,0 +1,52 @@
+ADJA	G	("Attributives Adjektiv"),
+ADJD	G	("Adverbiales oder pr�dikatives Adjektiv"),
+ADV	A	("Adverb"),
+APPR	P	("Pr�position; Zirkumposition links"),
+APPRART	P	("Pr�position mit Artikel"),
+APPO	P	("Postposition"),
+APZR	P	("Zirkumposition rechts"),
+ART	D	("Bestimmer oder unbestimmer Artikel"),
+CARD	O	("Kardinalzahl"),
+FM	O	("Fremdsprachichles Material"),
+ITJ	O	("Interjektion"),
+KOUI	C	("unterordnende Konjunktion mit zu und Infinitiv"),
+KOUS	C	("unterordnende Konjunktion mit Satz"),
+KON	C	("nebenordnende Konjunktion"),
+KOKOM	C	("Vergleichskonjunktion"),
+NN	N	("normales Nomen"),
+NE	R	("Eigennamen"),
+PDS 	Q	("substituierendes Demonstrativpronomen"),
+PDAT	Q	("attribuierendes Demonstrativpronomen"),
+PIS	Q	("substituierendes Indefinitpronomen"),
+PIAT	Q	("attribuierendes Indefinitpronomen ohne Determiner"),
+PIDAT	Q	("attribuierendes Indefinitpronomen mit Determiner"),
+PPER	Q	("irreflexives Personalpronomen"),
+PPOSS	Q	("substituierendes Possessivpronomen"),
+PPOSAT	Q	("attribuierendes Possessivpronomen"),
+PRELS	Q	("substituierendes Relativpronomen"),
+PRELAT	Q	("attribuierendes Relativpronomen"),
+PRF	Q	("reflexives Personalpronomen"),
+PWS	Q	("substituierendes Interrogativpronomen"),
+PWAT	Q	("attribuierendes Interrogativpronomen"),
+PWAV	Q	("adverbiales Interrogativ- oder Relativpronomen"),
+PAV	Q	("Pronominaladverb"),
+PTKZU	O	("zu vor Infinitiv"),
+PTKNEG	O	("Negationspartike"),
+PTKVZ	V	("abgetrennter Verbzusatz"),
+PTKANT	O	("Antwortpartikel"),
+PTKA	O	("Partikel bei Adjektiv oder Adverb"),
+TRUNC	N	("Kompositions-Erstglied"),
+VVFIN	V	("finites Verb, voll"),
+VVIMP	V	("Imperativ, voll"),
+VVINF	V	("Infinitiv"),
+VVIZU	V	("Infinitiv mit zu"),
+VVPP	V	("Partizip Perfekt"),
+VAFIN	V	("finites Verb, aux"),
+VAIMP	V	("Imperativ, aux"),
+VAINF	V	("Infinitiv, aux"),
+VAPP	V	("Partizip Perfekt"),
+VMFIN	V	("finites Verb, modal"),
+VMINF	V	("Infinitiv, modal"),
+VMPP	V	("Partizip Perfekt, modal"),
+XY	O	("Nichtwort, Sonderzeichen"),
+UNDEFINED	O	("Nicht definiert, zb. Satzzeichen");

data/core/mapping.postag.wotan.to.opener.csv ADDED

@@ -0,0 +1,13 @@
+Adj	G	Adjective
+Adv	A	Adverb
+Art	D	Article determiner
+Conj	C	Conjunction
+Int	O	Interjection
+N	N	Noun
+Num	O	Numeral
+Misc	O	Miscelaneous
+Prep	P	Preposition
+Pron	Q	Pronoun
+Punc	O	Punctuation
+V	V	Verb

data/core/opennlp/bin/opennlp ADDED

@@ -0,0 +1,35 @@
+#!/bin/sh
+#   Licensed to the Apache Software Foundation (ASF) under one
+#   or more contributor license agreements.  See the NOTICE file
+#   distributed with this work for additional information
+#   regarding copyright ownership.  The ASF licenses this file
+#   to you under the Apache License, Version 2.0 (the
+#   "License"); you may not use this file except in compliance
+#   with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#   Unless required by applicable law or agreed to in writing,
+#   software distributed under the License is distributed on an
+#   #  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#   KIND, either express or implied.  See the License for the
+#   specific language governing permissions and limitations
+#   under the License.
+# Note:  Do not output anything in this script file, any output
+#        may be inadvertantly placed in any output files if
+#        output redirection is used.
+if [ -z "$JAVACMD" ] ; then
+  if [ -n "$JAVA_HOME"  ] ; then
+    JAVACMD="$JAVA_HOME/bin/java"
+  else
+    JAVACMD="`which java`"
+  fi
+fi
+# Might fail if $0 is a link
+OPENNLP_HOME=`dirname "$0"`/..
+$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@

data/core/opennlp/bin/opennlp.bat ADDED

@@ -0,0 +1,35 @@
+@ECHO off
+REM #   Licensed to the Apache Software Foundation (ASF) under one
+REM #   or more contributor license agreements.  See the NOTICE file
+REM #   distributed with this work for additional information
+REM #   regarding copyright ownership.  The ASF licenses this file
+REM #   to you under the Apache License, Version 2.0 (the
+REM #   "License"); you may not use this file except in compliance
+REM #   with the License.  You may obtain a copy of the License at
+REM #
+REM #    http://www.apache.org/licenses/LICENSE-2.0
+REM #
+REM #   Unless required by applicable law or agreed to in writing,
+REM #   software distributed under the License is distributed on an
+REM #   #  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+REM #   KIND, either express or implied.  See the License for the
+REM #   specific language governing permissions and limitations
+REM #   under the License.
+REM # Note:  Do not output anything in this script file, any output
+REM #        may be inadvertantly placed in any output files if
+REM #        output redirection is used.
+IF "%JAVA_CMD%" == "" (
+	IF "%JAVA_HOME%" == "" (
+		SET JAVA_CMD=java
+	) ELSE (
+		SET JAVA_CMD=%JAVA_HOME%\bin\java
+	)
+)
+REM #  Should work with Windows XP and greater.  If not, specify the path to where it is installed.
+IF "%OPENNLP_HOME%" == "" SET OPENNLP_HOME=%~sp0..
+%JAVA_CMD% -Xmx4096m -jar %OPENNLP_HOME%\lib\opennlp-tools-*.jar %*

data/core/opennlp/lib/jwnl-1.3.3.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-tools-1.5.2-incubating.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-uima-1.5.2-incubating.jar ADDED

Binary file

data/core/opennlp/models/de-pos-maxent.bin ADDED

Binary file

data/core/opennlp/models/de-pos-perceptron.bin ADDED

Binary file

data/core/opennlp/models/nl-pos-maxent.bin ADDED

Binary file

data/core/opennlp/models/nl-pos-perceptron.bin ADDED

Binary file

data/core/pos-tagger_open-nlp.py ADDED

@@ -0,0 +1,160 @@
+#!/usr/bin/env python
+#-*- coding: utf-8 *-*
+# Ruben Izquierdo
+# Vrije University of Amsterdam
+import os
+import sys
+import operator
+import time
+import getopt
+import string
+import subprocess
+os.environ["LC_CTYPE"] = 'en_US.UTF-8'
+this_folder    = os.path.dirname(os.path.realpath(__file__))
+opennlp_folder = os.path.join(this_folder, 'opennlp')
+model_folder   = os.path.join(opennlp_folder, 'models')
+# This updates the load path to ensure that the local site-packages directory
+# can be used to load packages (e.g. a locally installed copy of lxml).
+sys.path.append(os.path.join(this_folder, 'site-packages/pre_build'))
+sys.path.append(os.path.join(this_folder, 'site-packages/pre_install'))
+# Config for Dutch
+pos_model_nl            = 'nl-pos-maxent.bin'
+mapping_pos_filename_nl = 'mapping.postag.wotan.to.opener.csv'
+# Config for German
+pos_model_de            = 'de-pos-maxent.bin'
+mapping_pos_filename_de = 'mapping.postag.stss.to.opener.csv'
+mapping_postag_to_kaf = None
+mapping_pos_filename  = ""
+__version__           = '2-May-2013'
+from lxml.etree import ElementTree as ET, Element as EL, PI
+from VUKafParserPy.KafParserMod import KafParser
+from token_matcher import token_matcher
+def map_pos_tag(pos):
+  global mapping_postag_to_kaf
+  if mapping_postag_to_kaf is None:
+    mapping_postag_to_kaf = {}
+    file_mapping = os.path.join(this_folder,mapping_pos_filename)
+    fic = open(file_mapping,'r')
+    for line in fic:
+      fields = line.strip().split('\t')
+      if len(fields)==3:
+        wotan_pos = fields[0]
+        kaf_pos = fields[1]
+        mapping_postag_to_kaf[wotan_pos] = kaf_pos
+    fic.close()
+  opener_pos = mapping_postag_to_kaf.get(pos,'O')
+  return opener_pos
+if __name__=='__main__':
+  if sys.stdin.isatty():
+      print>>sys.stderr,'Input stream required.'
+      print>>sys.stderr,'Example usage: cat myUTF8file.kaf |',sys.argv[0]
+      sys.exit(-1)
+  time_stamp = True
+  try:
+    opts, args = getopt.getopt(sys.argv[1:],"l:",["no-time"])
+    for opt, arg in opts:
+      if opt == "--no-time":
+        time_stamp = False
+  except getopt.GetoptError:
+    pass
+  input_kaf = KafParser(sys.stdin)
+  my_lang = input_kaf.getLanguage()
+  if my_lang == 'nl':
+    pos_model= pos_model_nl
+    mapping_pos_filename= mapping_pos_filename_nl
+  elif my_lang =='de':
+    pos_model = pos_model_de
+    mapping_pos_filename = mapping_pos_filename_de
+  else:
+    print>>sys.stdout,'The language of the input KAF is "'+my_lang+'" and only can be Dutch (nl) or German (de)'
+    sys.exit(-1)
+  ## Create the input text for
+  reference_tokens = []
+  sentences = []
+  prev_sent='-200'
+  aux = []
+  for word, sent_id, w_id in input_kaf.getTokens():
+    if sent_id != prev_sent:
+      if len(aux) != 0:
+        sentences.append(aux)
+        aux = []
+    aux.append((word,w_id))
+    prev_sent = sent_id
+  if len(aux)!=0:
+    sentences.append(aux)
+  for sentence in sentences:
+    text = ' '.join(t for t,_ in sentence).encode('utf-8')
+    cmd = [os.path.join(opennlp_folder,'bin/opennlp'), 'POSTagger',os.path.join(model_folder,pos_model)]
+    try:
+      proc = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+      proc.stdin.write(text)
+      proc.stdin.close()
+      text_with_pos = proc.stdout.read().strip().decode('utf-8')  ## variable is unicode
+      proc.terminate()
+    except Exception as e:
+       print>>sys.stderr,str(e)
+       sys.exit(-1)
+    data = {}
+    new_tokens = []
+    for n, token in enumerate(text_with_pos.split(' ')):
+      position = token.rfind('_')
+      lemma = token[:position]
+      pos = token[position+1:]
+      my_id='t_'+str(n)
+      data[my_id] = (lemma,pos)
+      new_tokens.append((lemma,my_id))
+    mapping_tokens = {}
+    token_matcher(sentence,new_tokens,mapping_tokens)
+    for token_new,id_new in new_tokens:
+      lemma,pos = data[id_new]
+      opener_pos = map_pos_tag(pos)
+      if opener_pos in ['N','R','G','V','A','O']:
+        type_term = 'open'
+      else:
+        type_term = 'close'
+      ele_term = EL('term',attrib={'tid':id_new,
+                                   'type':type_term,
+                                   'pos':opener_pos,
+                                   'morphofeat':pos,
+                                   'lemma':lemma})
+      ref_tokens = mapping_tokens[id_new]
+      ele_span = EL('span')
+      for ref_token in ref_tokens:
+        eleTarget = EL('target',attrib={'id':ref_token})
+        ele_span.append(eleTarget)
+      ele_term.append(ele_span)
+      input_kaf.addElementToLayer('terms', ele_term)
+  input_kaf.addLinguisticProcessor('Open nlp pos tagger','1.0', 'term', time_stamp)
+  input_kaf.saveToFile(sys.stdout)
+  sys.exit(0)