RubyGems - opener-pos-tagger-base - Versions diffs - 2.0.0 - Mend

opener-pos-tagger-base 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

checksums.yaml +7 -0
data/README.md +110 -0
data/bin/pos-tagger-base +21 -0
data/core/mapping.postag.stss.to.opener.csv +52 -0
data/core/mapping.postag.wotan.to.opener.csv +13 -0
data/core/opennlp/bin/opennlp +35 -0
data/core/opennlp/bin/opennlp.bat +35 -0
data/core/opennlp/lib/jwnl-1.3.3.jar +0 -0
data/core/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar +0 -0
data/core/opennlp/lib/opennlp-tools-1.5.2-incubating.jar +0 -0
data/core/opennlp/lib/opennlp-uima-1.5.2-incubating.jar +0 -0
data/core/opennlp/models/de-pos-maxent.bin +0 -0
data/core/opennlp/models/de-pos-perceptron.bin +0 -0
data/core/opennlp/models/nl-pos-maxent.bin +0 -0
data/core/opennlp/models/nl-pos-perceptron.bin +0 -0
data/core/pos-tagger_open-nlp.py +160 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/PKG-INFO +10 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/SOURCES.txt +7 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/dependency_links.txt +1 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/installed-files.txt +11 -0
data/core/site-packages/pre_build/VUKafParserPy-1.0-py2.7.egg-info/top_level.txt +1 -0
data/core/site-packages/pre_build/VUKafParserPy/KafDataObjectsMod.py +165 -0
data/core/site-packages/pre_build/VUKafParserPy/KafDataObjectsMod.pyc +0 -0
data/core/site-packages/pre_build/VUKafParserPy/KafParserMod.py +439 -0
data/core/site-packages/pre_build/VUKafParserPy/KafParserMod.pyc +0 -0
data/core/site-packages/pre_build/VUKafParserPy/__init__.py +7 -0
data/core/site-packages/pre_build/VUKafParserPy/__init__.pyc +0 -0
data/core/token_matcher.py +80 -0
data/ext/hack/support.rb +38 -0
data/lib/opener/pos_taggers/base.rb +90 -0
data/lib/opener/pos_taggers/base/version.rb +7 -0
data/opener-pos-tagger-base.gemspec +29 -0
data/pre_build_requirements.txt +1 -0
metadata +132 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: e1d01b280c3f2369e20c811fa11a42150b41cc16
+  data.tar.gz: 7639fb3ce4fb64641047659339b500157940087c
+SHA512:
+  metadata.gz: 31dd9808cc4b3ce95de10c8e95c456af963a2b93cc1bbef60e9917716b9de830ce3b749c28b63017ab7ce90393172cf499563400fbb62873a39eb0be2d0e2f1a
+  data.tar.gz: 7b9ab3549277fc1c93b09b60eae00f9b56894dee3565550f9eee7d82209316d770083df32085885282da18a3c12475b1e4c897f9c8fd928f2c144841bea88e3d

data/README.md ADDED

@@ -0,0 +1,110 @@
+[![Build Status](https://drone.io/github.com/opener-project/pos-tagger-base/status.png)](https://drone.io/github.com/opener-project/pos-tagger-base/latest)
+# Base POS Tagger
+This repository contains the source code (both Ruby and Python) for the base
+POS tagger. Currently this tagger supports the following languages:
+* Dutch
+* German
+## Requirements
+* Python 2.7.0 or newer
+* Ruby 1.9.2 or newer
+* pip
+* libxml2
+## Installation
+Using Bundler:
+    gem 'opener-pos-tagger-base',
+      :git    => 'git@github.com:opener-project/pos-tagger-base.git',
+      :branch => 'master'
+Using `specific_install`:
+    gem install specific_install
+    gem specific_install opener-pos-tagger-base \
+        -l https://github.com/opener-project/pos-tagger-base.git
+Using regular RubyGems (once the Gem is available):
+    gem install opener-pos-tagger-base
+## Usage
+Tagging a KAF file:
+    cat some_input_file.kaf | pos-tagger-base
+## Contributing
+First make sure all the required dependencies are installed:
+    bundle install
+Then download the required Python code:
+    bundle exec rake compile
+Once this is done continue reading the sections below to get a better
+understanding about the repository structure.
+## Structure
+This repository comes in two parts: a collection of Python source files and
+Ruby source code. The Python code can be found in `core/`, the Ruby code can be
+found in the other directories (e.g. `lib/`).
+Required Python packages are installed locally in to `core/site-packages/X`
+where X is one of the following two:
+* `pre_build`: contains packages that are installed before building the Gem,
+  these packages are shipped with the Gem
+* `pre_install`: contains packages that are installed in to this directory upon
+  installing the Gem. This directory should exclusively be used for compiled
+  Python packages such as lxml.
+There are also two requirements files for pip:
+* `pre_build_requirements.txt`: installs the requirements for the `pre_build`
+  directory.
+* `pre_install_requirements.txt`: installs the requirements for the
+  `pre_install` directory.
+To easily install all the required dependencies (required for running the tests
+for example) run the following:
+    bundle exec rake compile
+This will take care of verifying the requirements and downloading and
+installing the Python packages.
+## Testing
+To run the tests (which are powered by Cucumber), simply run the following:
+    bundle exec rake
+This will take care of verifying the requirements, installing the Python code
+and running the tests.
+For more information on the available Rake tasks run the following:
+    bundle exec rake -T
+## POS Details
+### POS-tags models
+* [Dutch-maxent](http://opennlp.sourceforge.net/models-1.5/nl-pos-maxent.bin)
+* [Dutch-perceptron](http://opennlp.sourceforge.net/models-1.5/nl-pos-perceptron.bin)
+* [German-maxent](http://opennlp.sourceforge.net/models-1.5/de-pos-maxent.bin)
+* [German-perceptron](http://opennlp.sourceforge.net/models-1.5/de-pos-perceptron.bin)
+### POS-tags sets
+* Dutch: trained on conllx alpino data, wotan tagset
+* German: trained on TIGER corpus, STSS tagset

data/bin/pos-tagger-base ADDED

@@ -0,0 +1,21 @@
+#!/usr/bin/env ruby
+require_relative '../lib/opener/pos_taggers/base'
+# STDIN.tty? returns `false` if data is being piped into the current process.
+if STDIN.tty?
+  input = nil
+else
+  input = STDIN.read
+end
+kernel                  = Opener::POSTaggers::Base.new(:args => ARGV)
+stdout, stderr, process = kernel.run(input)
+if process.success?
+  puts stdout
+  STDERR.puts(stderr) unless stderr.empty?
+else
+  abort stderr
+end

data/core/mapping.postag.stss.to.opener.csv ADDED

@@ -0,0 +1,52 @@
+ADJA	G	("Attributives Adjektiv"),
+ADJD	G	("Adverbiales oder pr�dikatives Adjektiv"),
+ADV	A	("Adverb"),
+APPR	P	("Pr�position; Zirkumposition links"),
+APPRART	P	("Pr�position mit Artikel"),
+APPO	P	("Postposition"),
+APZR	P	("Zirkumposition rechts"),
+ART	D	("Bestimmer oder unbestimmer Artikel"),
+CARD	O	("Kardinalzahl"),
+FM	O	("Fremdsprachichles Material"),
+ITJ	O	("Interjektion"),
+KOUI	C	("unterordnende Konjunktion mit zu und Infinitiv"),
+KOUS	C	("unterordnende Konjunktion mit Satz"),
+KON	C	("nebenordnende Konjunktion"),
+KOKOM	C	("Vergleichskonjunktion"),
+NN	N	("normales Nomen"),
+NE	R	("Eigennamen"),
+PDS 	Q	("substituierendes Demonstrativpronomen"),
+PDAT	Q	("attribuierendes Demonstrativpronomen"),
+PIS	Q	("substituierendes Indefinitpronomen"),
+PIAT	Q	("attribuierendes Indefinitpronomen ohne Determiner"),
+PIDAT	Q	("attribuierendes Indefinitpronomen mit Determiner"),
+PPER	Q	("irreflexives Personalpronomen"),
+PPOSS	Q	("substituierendes Possessivpronomen"),
+PPOSAT	Q	("attribuierendes Possessivpronomen"),
+PRELS	Q	("substituierendes Relativpronomen"),
+PRELAT	Q	("attribuierendes Relativpronomen"),
+PRF	Q	("reflexives Personalpronomen"),
+PWS	Q	("substituierendes Interrogativpronomen"),
+PWAT	Q	("attribuierendes Interrogativpronomen"),
+PWAV	Q	("adverbiales Interrogativ- oder Relativpronomen"),
+PAV	Q	("Pronominaladverb"),
+PTKZU	O	("zu vor Infinitiv"),
+PTKNEG	O	("Negationspartike"),
+PTKVZ	V	("abgetrennter Verbzusatz"),
+PTKANT	O	("Antwortpartikel"),
+PTKA	O	("Partikel bei Adjektiv oder Adverb"),
+TRUNC	N	("Kompositions-Erstglied"),
+VVFIN	V	("finites Verb, voll"),
+VVIMP	V	("Imperativ, voll"),
+VVINF	V	("Infinitiv"),
+VVIZU	V	("Infinitiv mit zu"),
+VVPP	V	("Partizip Perfekt"),
+VAFIN	V	("finites Verb, aux"),
+VAIMP	V	("Imperativ, aux"),
+VAINF	V	("Infinitiv, aux"),
+VAPP	V	("Partizip Perfekt"),
+VMFIN	V	("finites Verb, modal"),
+VMINF	V	("Infinitiv, modal"),
+VMPP	V	("Partizip Perfekt, modal"),
+XY	O	("Nichtwort, Sonderzeichen"),
+UNDEFINED	O	("Nicht definiert, zb. Satzzeichen");

data/core/mapping.postag.wotan.to.opener.csv ADDED

@@ -0,0 +1,13 @@
+Adj	G	Adjective
+Adv	A	Adverb
+Art	D	Article determiner
+Conj	C	Conjunction
+Int	O	Interjection
+N	N	Noun
+Num	O	Numeral
+Misc	O	Miscelaneous
+Prep	P	Preposition
+Pron	Q	Pronoun
+Punc	O	Punctuation
+V	V	Verb

data/core/opennlp/bin/opennlp ADDED

@@ -0,0 +1,35 @@
+#!/bin/sh
+#   Licensed to the Apache Software Foundation (ASF) under one
+#   or more contributor license agreements.  See the NOTICE file
+#   distributed with this work for additional information
+#   regarding copyright ownership.  The ASF licenses this file
+#   to you under the Apache License, Version 2.0 (the
+#   "License"); you may not use this file except in compliance
+#   with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#   Unless required by applicable law or agreed to in writing,
+#   software distributed under the License is distributed on an
+#   #  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#   KIND, either express or implied.  See the License for the
+#   specific language governing permissions and limitations
+#   under the License.
+# Note:  Do not output anything in this script file, any output
+#        may be inadvertantly placed in any output files if
+#        output redirection is used.
+if [ -z "$JAVACMD" ] ; then
+  if [ -n "$JAVA_HOME"  ] ; then
+    JAVACMD="$JAVA_HOME/bin/java"
+  else
+    JAVACMD="`which java`"
+  fi
+fi
+# Might fail if $0 is a link
+OPENNLP_HOME=`dirname "$0"`/..
+$JAVACMD -Xmx1024m -jar $OPENNLP_HOME/lib/opennlp-tools-*.jar $@

data/core/opennlp/bin/opennlp.bat ADDED

@@ -0,0 +1,35 @@
+@ECHO off
+REM #   Licensed to the Apache Software Foundation (ASF) under one
+REM #   or more contributor license agreements.  See the NOTICE file
+REM #   distributed with this work for additional information
+REM #   regarding copyright ownership.  The ASF licenses this file
+REM #   to you under the Apache License, Version 2.0 (the
+REM #   "License"); you may not use this file except in compliance
+REM #   with the License.  You may obtain a copy of the License at
+REM #
+REM #    http://www.apache.org/licenses/LICENSE-2.0
+REM #
+REM #   Unless required by applicable law or agreed to in writing,
+REM #   software distributed under the License is distributed on an
+REM #   #  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+REM #   KIND, either express or implied.  See the License for the
+REM #   specific language governing permissions and limitations
+REM #   under the License.
+REM # Note:  Do not output anything in this script file, any output
+REM #        may be inadvertantly placed in any output files if
+REM #        output redirection is used.
+IF "%JAVA_CMD%" == "" (
+	IF "%JAVA_HOME%" == "" (
+		SET JAVA_CMD=java
+	) ELSE (
+		SET JAVA_CMD=%JAVA_HOME%\bin\java
+	)
+)
+REM #  Should work with Windows XP and greater.  If not, specify the path to where it is installed.
+IF "%OPENNLP_HOME%" == "" SET OPENNLP_HOME=%~sp0..
+%JAVA_CMD% -Xmx4096m -jar %OPENNLP_HOME%\lib\opennlp-tools-*.jar %*

data/core/opennlp/lib/jwnl-1.3.3.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-tools-1.5.2-incubating.jar ADDED

Binary file

data/core/opennlp/lib/opennlp-uima-1.5.2-incubating.jar ADDED

Binary file

data/core/opennlp/models/de-pos-maxent.bin ADDED

Binary file

data/core/opennlp/models/de-pos-perceptron.bin ADDED

Binary file

data/core/opennlp/models/nl-pos-maxent.bin ADDED

Binary file

data/core/opennlp/models/nl-pos-perceptron.bin ADDED

Binary file

data/core/pos-tagger_open-nlp.py ADDED

@@ -0,0 +1,160 @@
+#!/usr/bin/env python
+#-*- coding: utf-8 *-*
+# Ruben Izquierdo
+# Vrije University of Amsterdam
+import os
+import sys
+import operator
+import time
+import getopt
+import string
+import subprocess
+os.environ["LC_CTYPE"] = 'en_US.UTF-8'
+this_folder    = os.path.dirname(os.path.realpath(__file__))
+opennlp_folder = os.path.join(this_folder, 'opennlp')
+model_folder   = os.path.join(opennlp_folder, 'models')
+# This updates the load path to ensure that the local site-packages directory
+# can be used to load packages (e.g. a locally installed copy of lxml).
+sys.path.append(os.path.join(this_folder, 'site-packages/pre_build'))
+sys.path.append(os.path.join(this_folder, 'site-packages/pre_install'))
+# Config for Dutch
+pos_model_nl            = 'nl-pos-maxent.bin'
+mapping_pos_filename_nl = 'mapping.postag.wotan.to.opener.csv'
+# Config for German
+pos_model_de            = 'de-pos-maxent.bin'
+mapping_pos_filename_de = 'mapping.postag.stss.to.opener.csv'
+mapping_postag_to_kaf = None
+mapping_pos_filename  = ""
+__version__           = '2-May-2013'
+from lxml.etree import ElementTree as ET, Element as EL, PI
+from VUKafParserPy.KafParserMod import KafParser
+from token_matcher import token_matcher
+def map_pos_tag(pos):
+  global mapping_postag_to_kaf
+  if mapping_postag_to_kaf is None:
+    mapping_postag_to_kaf = {}
+    file_mapping = os.path.join(this_folder,mapping_pos_filename)
+    fic = open(file_mapping,'r')
+    for line in fic:
+      fields = line.strip().split('\t')
+      if len(fields)==3:
+        wotan_pos = fields[0]
+        kaf_pos = fields[1]
+        mapping_postag_to_kaf[wotan_pos] = kaf_pos
+    fic.close()
+  opener_pos = mapping_postag_to_kaf.get(pos,'O')
+  return opener_pos
+if __name__=='__main__':
+  if sys.stdin.isatty():
+      print>>sys.stderr,'Input stream required.'
+      print>>sys.stderr,'Example usage: cat myUTF8file.kaf |',sys.argv[0]
+      sys.exit(-1)
+  time_stamp = True
+  try:
+    opts, args = getopt.getopt(sys.argv[1:],"l:",["no-time"])
+    for opt, arg in opts:
+      if opt == "--no-time":
+        time_stamp = False
+  except getopt.GetoptError:
+    pass
+  input_kaf = KafParser(sys.stdin)
+  my_lang = input_kaf.getLanguage()
+  if my_lang == 'nl':
+    pos_model= pos_model_nl
+    mapping_pos_filename= mapping_pos_filename_nl
+  elif my_lang =='de':
+    pos_model = pos_model_de
+    mapping_pos_filename = mapping_pos_filename_de
+  else:
+    print>>sys.stdout,'The language of the input KAF is "'+my_lang+'" and only can be Dutch (nl) or German (de)'
+    sys.exit(-1)
+  ## Create the input text for
+  reference_tokens = []
+  sentences = []
+  prev_sent='-200'
+  aux = []
+  for word, sent_id, w_id in input_kaf.getTokens():
+    if sent_id != prev_sent:
+      if len(aux) != 0:
+        sentences.append(aux)
+        aux = []
+    aux.append((word,w_id))
+    prev_sent = sent_id
+  if len(aux)!=0:
+    sentences.append(aux)
+  for sentence in sentences:
+    text = ' '.join(t for t,_ in sentence).encode('utf-8')
+    cmd = [os.path.join(opennlp_folder,'bin/opennlp'), 'POSTagger',os.path.join(model_folder,pos_model)]
+    try:
+      proc = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+      proc.stdin.write(text)
+      proc.stdin.close()
+      text_with_pos = proc.stdout.read().strip().decode('utf-8')  ## variable is unicode
+      proc.terminate()
+    except Exception as e:
+       print>>sys.stderr,str(e)
+       sys.exit(-1)
+    data = {}
+    new_tokens = []
+    for n, token in enumerate(text_with_pos.split(' ')):
+      position = token.rfind('_')
+      lemma = token[:position]
+      pos = token[position+1:]
+      my_id='t_'+str(n)
+      data[my_id] = (lemma,pos)
+      new_tokens.append((lemma,my_id))
+    mapping_tokens = {}
+    token_matcher(sentence,new_tokens,mapping_tokens)
+    for token_new,id_new in new_tokens:
+      lemma,pos = data[id_new]
+      opener_pos = map_pos_tag(pos)
+      if opener_pos in ['N','R','G','V','A','O']:
+        type_term = 'open'
+      else:
+        type_term = 'close'
+      ele_term = EL('term',attrib={'tid':id_new,
+                                   'type':type_term,
+                                   'pos':opener_pos,
+                                   'morphofeat':pos,
+                                   'lemma':lemma})
+      ref_tokens = mapping_tokens[id_new]
+      ele_span = EL('span')
+      for ref_token in ref_tokens:
+        eleTarget = EL('target',attrib={'id':ref_token})
+        ele_span.append(eleTarget)
+      ele_term.append(ele_span)
+      input_kaf.addElementToLayer('terms', ele_term)
+  input_kaf.addLinguisticProcessor('Open nlp pos tagger','1.0', 'term', time_stamp)
+  input_kaf.saveToFile(sys.stdout)
+  sys.exit(0)