RubyGems - tokenizer - Versions diffs - 0.1.1 → 0.1.2 - Mend

tokenizer 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: abb8db238956f4cbc491e75ec47f7994e572df1d
+  data.tar.gz: 4d69f5ae6fe9c6411b45946098700034c01152fe
+SHA512:
+  metadata.gz: fcf399eb94f200fa1a682dc64193fcb3bef391e2db7eece38f1603181d139368a2036e14058b59c26b75f4643517f10c5f4508a27830561ed197bfa941fa4ad3
+  data.tar.gz: 12f59222b26ec7987f971679b8d7be32fc86468b0741afed1b8607eb3049902082d93c614ce83358e6777a61fa4326c4bd11d5ebc9f12d751eb4fb11d678a8b7

data/README.rdoc CHANGED

@@ -1,18 +1,29 @@
 = Tokenizer
-* {RubyGems}[http://rubygems.org/gems/tokenizer]
-* {Developers Homepage}[http://www.uni-trier.de/index.php?id=24140]
-* {Source Code}[https://github.com/arbox/tokenizer]
-* {Bug Tracker}[https://github.com/arbox/tokenizer/issues]
+{RubyGems}[http://rubygems.org/gems/tokenizer] |
+{Homepage}[http://bu.chsta.be/projects/tokenizer] |
+{Source Code}[https://github.com/arbox/tokenizer] |
+{Bug Tracker}[https://github.com/arbox/tokenizer/issues]
+{<img src="https://img.shields.io/gem/v/tokenizer.svg" alt="Gem Version" />}[https://rubygems.org/gems/tokenizer]
+{<img src="https://img.shields.io/travis/arbox/tokenizer.svg" alt="Build Status" />}[https://travis-ci.org/arbox/tokenizer]
+{<img src="https://img.shields.io/codeclimate/github/arbox/tokenizer.svg" alt="Code Climate" />}[https://codeclimate.com/github/arbox/tokenizer]
+{<img src="https://img.shields.io/gemnasium/arbox/tokenizer.svg" alt="Dependency Status" />}[https://gemnasium.com/arbox/tokenizer]
 == DESCRIPTION
-A simple multilingual tokenizer -- a linguistic tool intended
-to split a text into tokens for NLP tasks. This tool provides a CLI and
-a library for linguistic tokenization which is an anavoidable step for many HLT
-(human language technology) tasks in the preprocessing phase for further
-syntactic, semantic and other higher level processing goals.
+A simple multilingual tokenizer -- a linguistic tool intended to split a text
+into tokens for NLP tasks. This tool provides a CLI and a library for
+linguistic tokenization which is an anavoidable step for many HLT (Human
+Language Technology) tasks in the preprocessing phase for further syntactic,
+semantic  and other higher level processing goals.
+Tokenization task involves Sentence Segmentation, Word Segmentation and Boundary
+Disambiguation for the both tasks.
+Use it for tokenization of German, English and Dutch texts.
-Use it for tokenization of German, English and French texts.
+=== Implemented Algorithms
+to be ...
 == INSTALLATION
 +Tokenizer+ is provided as a .gem package. Simply install it via
@@ -26,7 +37,6 @@ If you want to do a system wide installation, do this as root
 Alternatively use your Gemfile for dependency management.
 == SYNOPSIS
 You can use +Tokenizer+ in two ways.
@@ -39,12 +49,19 @@ You can use +Tokenizer+ in two ways.
     $ de_tokenizer.tokenize('Ich gehe in die Schule!')
     $ => ["Ich", "gehe", "in", "die", "Schule", "!"]
+* Customizable PRE and POST list
+    $ require 'tokenizer'
+    $ de_tokenizer = Tokenizer::Tokenizer.new(:de, { POST: Tokenizer::Tokenizer::POST + ['|'] })
+    $ de_tokenizer.tokenize('Ich gehe|in die Schule!')
+    $ => ["Ich", "gehe", "|in", "die", "Schule", "!"]
 See documentation in the Tokenizer::Tokenizer class for details
 on particular methods.
 == SUPPORT
-If you have question, bug reports or any suggestions, please drop me an email :) Any help is deeply appreciated!
+If you have question, bug reports or any suggestions, please drop me an email :)
+Any help is deeply appreciated!
 == CHANGELOG
 For details on future plan and working progress see CHANGELOG.rdoc.
@@ -59,5 +76,5 @@ Please contact me with your suggestions, bug reports and feature requests.
 +Tokenizer+ is a copyrighted software by Andrei Beliankou, 2011-
-You may use, redistribute and change it under the terms
-provided in the LICENSE.rdoc file.
+You may use, redistribute and change it under the terms provided
+in the LICENSE.rdoc file.

data/bin/tokenize CHANGED

@@ -6,5 +6,5 @@ require 'tokenizer'
 de_tokenizer = Tokenizer::Tokenizer.new
 while record = gets
-  puts de_tokenizer.tokenize(record)
+  print de_tokenizer.tokenize(record).join("\n")
 end

data/lib/tokenizer/tokenizer.rb CHANGED

@@ -1,42 +1,64 @@
 # -*- coding: utf-8 -*-
 # :title: A simple Tokenizer for NLP Tasks.
 # :main: README.rdoc
 # A namespace for all project related stuff.
 module Tokenizer
   class Tokenizer
     FS = Regexp.new('[[:blank:]]+')
-#    PRE = '[{(\\`"‚„†‡‹‘’“”•–—›'
-#    POST = %w| ] } ' ` " ) , ; : \ ! \ ? \ % ‚ „ … † ‡ ‰ ‹ ‘ ’ “ ” • – — › |
-    POST = %w{! ? , : ; . )}
-    PRE = %w{(}
-    def initialize(lang=:de)
+    # spanish marks
+    SIMPLE_PRE = []
+    PAIR_PRE = ['(', '{', '[', '<']
+    SIMPLE_POST = ['!', '?', ',', ':', ';', '.']
+    PAIR_POST = [')', '}', ']', '>']
+    PRE_N_POST = ['"', "'"]
+    PRE = SIMPLE_PRE + PAIR_PRE
+    POST = SIMPLE_POST + PAIR_POST
+    def initialize(lang = :de, options = {})
       @lang = lang
+      @options = {
+        pre: SIMPLE_PRE + PAIR_PRE,
+        post: SIMPLE_POST + PAIR_POST,
+        pre_n_post: PRE_N_POST
+      }.merge(options)
     end
     def tokenize(str)
-      tokens = []
-      token = ''
       output = ''
-      fields = str.split(FS)
+      fields = str.chomp.split(FS)
+      return [''] if fields.empty?
       fields.each do |field|
-        field.each_char do |ch|
-         if POST.include?(ch)
-            output << "\n#{ch}"
-          elsif PRE.include?(ch)
+        field.each_char.with_index do |ch, idx|
+          case
+          when @options[:pre].include?(ch)
             output << "#{ch}\n"
+          when @options[:post].include?(ch)
+            output << "\n#{ch}"
+            if ['?', '!', '.'].include?(ch)
+              output << "\n"
+            end
+          when @options[:pre_n_post].include?(ch)
+            if idx == 0
+              output << "#{ch}\n"
+            elsif idx != 0
+              output << "\n#{ch}"
+            end
           else
             output << ch
           end
         end
         output << "\n"
       end
-      output.split("\n")
-    end
+      # @TODO: Rework the format of the string!
+      output.chomp('').split("\n", -1)
+    end
   end # class
 end # module

data/lib/tokenizer/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Tokenizer
-  VERSION = '0.1.1'
+  VERSION = '0.1.2'
 end

data/test/{test_by_tokenizer_dev.rb → development_tests/test_by_tokenizer_dev.rb} RENAMED

File without changes

data/test/{test_de_tokenizer_dev.rb → development_tests/test_de_tokenizer_dev.rb} RENAMED

@@ -19,262 +19,262 @@ class TestTokenizerDev < Test::Unit::TestCase
     etalon = %w{Die deutschen Umlaute und Sonderzeichen , wie in Mäuse , Scheiß und Tütchen , sind blöd !}
     compare(etalon, input)
   end
   def test_tokenization_003
     input = "Abkürzungen, wie z.B. usw. und d.h. können zu Problemem führen."
     etalon = %w{Abkürzungen , wie z.B. usw. und d.h. können zu Problemem führen .}
     compare(etalon, input)
   end
   def test_tokenization_004
     input = "Es gibt mehr als 1.023.345 Menschen in Deutschland, die keine Tausenderpunkte verstehen."
     etalon = %w{Es gibt mehr als 1.023.345 Menschen in Deutschland , die keine Tausenderpunkte verstehen .}
     compare(etalon, input)
   end
   def test_tokenization_005
     input = "Cocktails, wie Apfel-Martini, Rum-Kirsche-Cola und andere, bereiten nicht nur Menschen Probleme."
     etalon = %w{ Cocktails , wie Apfel-Martini , Rum-Kirsche-Cola und andere , bereiten nicht nur Menschen Probleme . }
     compare(etalon, input)
   end
   def test_tokenization_006
     input = 'Es gibt viele verschiedene Zeichen, die noch in Texten vorkommen können wie - zum Beispiel - diese hier "text" oder (text).'
     etalon = %w{Es gibt viele verschiedene Zeichen , die noch in Texten vorkommen können wie - zum Beispiel - diese hier " text " oder ( text ) .}
     compare(etalon, input)
   end
   def test_tokenization_007
     input = "Abkürzungen sind immer ein Problem, da auch Leerzeichen dazwischen stehen können, wie z. B. hier."
     etalon = ["Abkürzungen", "sind", "immer", "ein", "Problem", ",", "da", "auch", "Leerzeichen", "dazwischen", "stehen", "können", ",", "wie", "z. B.", "hier", "."]
     compare(etalon, input)
   end
   def test_tokenization_008
     input = "Außerdem kann es nach Abkürzungen und Satzenden auch mit Großschreibung weiter gehen, bei z.B. Aufzählungen."
     etalon = %w{Außerdem kann es nach Abkürzungen und Satzenden auch mit Großschreibung weiter gehen , bei z.B. Aufzählungen .}
     compare(etalon, input)
   end
   def test_tokenization_009
     input = "Ein weiteres Problem sind solche Getrennt- und Zusammenschreibungen."
     etalon = %w{Ein weiteres Problem sind solche Getrenntschreibungen und Zusammenschreibungen .}
     compare(etalon, input)
   end
   def test_tokenization_010
     input = "In manchen Texten gibt es auch Worttrennung am Zeilen- ende."
     etalon = %w{In manchen Texten gibt es auch Worttrennung am Zeilenende .}
     compare(etalon, input)
   end
   def test_tokenization_011 #Ellipsis
     input = "Der Satz endet in einer Ellips..."
     etalon = %w{ Der Satz endet in einer Ellips... } #die elliptischen Punkte sollten nicht vom Wort getrennt werden
     compare(etalon, input)
   end
   def test_tokenization_012 #Fehlende Leerzeichen
     input = "Der Satz endet.Das Leerzeichen fehlt."
     etalon = %w{ Der Satz endet . Das Leerzeichen fehlt . } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
     compare(etalon, input)
   end
   def test_tokenization_013 #Bindestriche
     input = "Das Bindeglied - manisch-depressives Verhalten, binden-verbinden"
     etalon = %w{ Das Bindeglied - manisch-depressives Verhalten , binden - verbinden}
     compare(etalon, input)
   end
   def test_tokenization_014 #Abkuerzungen
     input = "Der Satz enthielt z.B. Fehler"
     etalon = %w{ Der Satz enthielt z.B. Fehler } #/\.\s(?=[A-Z])/ wuerde hinter Punkt den Satz beenden
     compare(etalon, input)
   end
   def test_tokenization_015 #Fehlende Grossbuchstaben
     input = "Der Satz endet. der Satz beginnt"
     etalon = %w{ Der Satz endet . der Satz beginnt } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
     compare(etalon, input)
   end
   def test_tokenization_016 #Franzoesisch
     input = "L'art de l'univers, c'est un art"
     etalon = %w{ L' art de l' univers , c'est un art } #Kontrovers!
     compare(etalon, input)
   end
   def test_tokenization_017 #James Bond
     input = "Bond,... James Bond."
     etalon = %w{ Bond , ... James Bond . } #Kontrovers!
     compare(etalon, input)
   end
   def test_tokenization_018 #Inches
     input = "The square had four 9\" sides"
     etalon = %w{ The square had four 9" sides }
     compare(etalon, input)
   end
   def test_tokenization_019 #Abkuerzung zugleich Lexikon-Eintrag
     input = "In fig. 3, a fig can be seen. Fig. no. 4 shows no fig."
     etalon = %w{ In fig. 3 , a fig can be seen . Fig. no. 4 shows no fig . } #fig sowohl als Abkuerzung als auch als Wort
     compare(etalon, input)
   end
   def test_tokenization_020 #Leerzeichen-getrennte Zusammengehörigkeiten
     input = "They booked the flight New York-Los Angeles"
     etalon = ["They", "booked", "the", "flight", "New York", "-", "Los Angeles"] #oder mit Bindestrich verbunden
     compare(etalon, input)
   end
   def test_tokenization_021 #Ordinale
     input = "Der 1. Platz ging an den Sieger"
     etalon = %w{ Der 1. Platz ging an den Sieger }
     compare(etalon, input)
   end
   def test_tokenization_022 #Klitika
     input = "Er war's, stimmt's?"
     etalon = %w{ Er war es , stimmt es ? } #Kontrovers! Benoetigt komplexere Analyse
     compare(etalon, input)
   end
   def test_tokenization_023 #Datums- und Zeitangaben
     input = "Es passierte am 13. Januar 2011 um 12:13 Uhr"
     etalon = [ "Es", "passierte", "am", "13. Januar 2011", "um", "12:13 Uhr"]
     compare(etalon, input)
   end
   def test_tokenization_024 #Eingebettete Saetze
     input = "\"This is all?\" George asked."
     etalon = %w{ This is all ? George asked . } #kann zu ungrammatischen Saetzen fuehren
     compare(etalon, input)
   end
   def test_tokenization_025 #Eingebettete Saetze 2
     input = "\"Das ist alles?\" fragte sie."
     etalon = %w{ Das ist alles ? fragte sie . } #ungrammatischer Satz "fragte sie."
     compare(etalon, input)
   end
   def test_tokenization_026
     input = "Die deutschen Umlaute und Sonderzeichen, wie in Mäuse, Scheiß und Tütchen, sind blöd!"
     etalon = %w{ Die deutschen Umlaute und Sonderzeichen , wie in Mäuse , Scheiß und Tütchen , sind blöd ! }
     compare(etalon, input)
   end
   def test_tokenization_027
     input = "Abkürzungen, wie z.B. usw. und d.h. können zu Problemem führen."
     etalon = %w{ Abkürzungen , wie z.B. usw. und d.h. können zu Problemem führen . }
     compare(etalon, input)
   end
   def test_tokenization_028
     input = "Es gibt mehr als 1.023.345 Menschen in Deutschland, die keine Tausenderpunkte verstehen."
     etalon = %w{ Es gibt mehr als 1.023.345 Menschen in Deutschland , die keine Tausenderpunkte verstehen . }
     compare(etalon, input)
   end
   def test_tokenization_029
     input = "Cocktails, wie Apfel-Martini, Rum-Kirsche-Cola und andere, bereiten nicht nur Menschen Probleme."
     etalon = %w{ Cocktails , wie Apfel-Martini , Rum-Kirsche-Cola und andere , bereiten nicht nur Menschen Probleme . }
     compare(etalon, input)
   end
   def test_tokenization_030 #Ellipsis
     input = "Der Satz endet in einer Ellips..."
     etalon = %w{ Der Satz endet in einer Ellips... } #die elliptischen Punkte sollten nicht vom Wort getrennt werden
     compare(etalon, input)
   end
   def test_tokenization_031 #Fehlende Leerzeichen
     input = "Der Satz endet.Das Leerzeichen fehlt."
     etalon = %w{ Der Satz endet . Das Leerzeichen fehlt . } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
     compare(etalon, input)
   end
   def test_tokenization_032 #Bindestriche
     input = "Das Bindeglied - manisch-depressives Verhalten, binden-verbinden"
     etalon = %w{ Das Bindeglied - manisch-depressives Verhalten , binden - verbinden}
     compare(etalon, input)
   end
   def test_tokenization_033 #Abkuerzungen
     input = "Der Satz enthielt z.B. Fehler"
     etalon = %w{ Der Satz enthielt z.B. Fehler } #/\.\s(?=[A-Z])/ wuerde hinter Punkt den Satz beenden
     compare(etalon, input)
   end
   def test_tokenization_034 #Fehlende Grossbuchstaben
     input = "Der Satz endet. der Satz beginnt"
     etalon = %w{ Der Satz endet . der Satz beginnt } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
     compare(etalon, input)
   end
   def test_tokenization_035 #Franzoesisch
     input = "L'art de l'univers, c'est un art"
     etalon = %w{ L' art de l' univers , c'est un art } #Kontrovers!
     compare(etalon, input)
   end
   def test_tokenization_036 #James Bond
     input = "Bond,... James Bond."
     etalon = %w{ Bond , ... James Bond . } #Kontrovers!
     compare(etalon, input)
   end
   def test_tokenization_037 #Inches
     input = "The square had four 9\" sides"
     etalon = %w{ The square had four 9" sides }
     compare(etalon, input)
   end
   def test_tokenization_039 #Abkuerzung zugleich Lexikon-Eintrag
     input = "In fig. 3, a fig can be seen. Fig. no. 4 shows no fig."
     etalon = %w{ In fig. 3 , a fig can be seen . Fig. no. 4 shows no fig . } #fig sowohl als Abkuerzung als auch als Wort
     compare(etalon, input)
   end
   def test_tokenization_040 #Leerzeichen-getrennte Zusammengehörigkeiten
     input = "They booked the flight New York-Los Angeles"
     etalon = ["They", "booked", "the", "flight", "New York", "-", "Los Angeles"] #oder mit Bindestrich verbunden
     compare(etalon, input)
   end
   def test_tokenization_041 #Ordinale
     input = "Der 1. Platz ging an den Sieger"
     etalon = %w{ Der 1. Platz ging an den Sieger }
     compare(etalon, input)
   end
   def test_tokenization_042 #Klitika
     input = "Er war's, stimmt's?"
     etalon = %w{ Er war es , stimmt es ? } #Kontrovers! Benoetigt komplexere Analyse
     compare(etalon, input)
   end
   #Datums- und Zeitangaben
-  def test_tokenization_043
+  def test_tokenization_043
     input = "Es passierte am 13. Januar 2011 um 12:13 Uhr"
     etalon = ["Es", "passierte", "am", "13. Januar 2011", "um", "12:13 Uhr"]
     compare(etalon, input)
   end
   #Eingebettete Sätze
   def test_tokenization_044
     input = '"This is all?" George asked.'
     etalon = %w{ This is all ? George asked . } #kann zu ungrammatischen Saetzen fuehren
     compare(etalon, input)
   end
   def test_tokenization_046 #Eingebettete Saetze 2
     input = '"Das ist alles?" fragte sie.'
     etalon = %w{Das ist alles ? fragte sie .} #ungrammatischer Satz "fragte sie."
     compare(etalon, input)
   end
   private
   def compare(exp_result, input)
     act_result = @de_tokenizer.tokenize(input)

data/test/{test_en_tokenizer_dev.rb → development_tests/test_en_tokenizer_dev.rb} RENAMED

@@ -9,6 +9,8 @@ class TestTokenizerDev < Test::Unit::TestCase
   end
   def test_tokenization_001
+    result = @en_tokenizer.tokenize('testing normal, english sentence')
+    assert_equal(['testing', 'normal', ',', 'english', 'sentence', ''], result)
   end
   private

data/test/{test_fr_tokenizer_dev.rb → development_tests/test_fr_tokenizer_dev.rb} RENAMED

File without changes

data/test/{test_it_tokenizer_dev.rb → development_tests/test_it_tokenizer_dev.rb} RENAMED

File without changes

data/test/development_tests/test_parameters.rb ADDED

@@ -0,0 +1,26 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @en_tokenizer = Tokenizer::Tokenizer.new(:en, {PRE: [], POST: ['|']})
+  end
+  def test_tokenization_001
+    result = @en_tokenizer.tokenize('testing| new')
+    assert_equal(['testing', '|', 'new', ''], result)
+  end
+  def test_tokenization_002
+    result = @en_tokenizer.tokenize('testing, new')
+    assert_equal(['testing,', 'new', ''], result)
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/{test_ru_tokenizer_dev.rb → development_tests/test_ru_tokenizer_dev.rb} RENAMED

File without changes

data/test/{test_de_tokenizer.rb → regression_tests/test_de_tokenizer.rb} RENAMED

@@ -1,14 +1,14 @@
-require 'test/unit'
+require 'minitest/autorun'
 require 'tokenizer'
-class TestTokenizer < Test::Unit::TestCase
+class TestTokenizer < Minitest::Test
   def setup
     @de_tokenizer = Tokenizer::Tokenizer.new(:de)
   end
   def test_constants
-    assert(Tokenizer::VERSION.is_a?(String) && ! Tokenizer::VERSION.empty?)
+    assert(Tokenizer::VERSION.is_a?(String) && !Tokenizer::VERSION.empty?)
   end
   def test_output_type
@@ -17,8 +17,8 @@ class TestTokenizer < Test::Unit::TestCase
   end
   def test_tokenization_001
-    input = 'ich ging? du, und ich nicht (konnte nicht)? Warum?!!'
-    etalon = %w{ ich ging ? du , und ich nicht ( konnte nicht ) ? Warum ? ! !}
+    input = 'Ich ging in die Schule!'
+    etalon = %w(Ich ging in die Schule !)
     output = @de_tokenizer.tokenize(input)
     assert_equal(etalon, output)
   end

metadata CHANGED

@@ -1,121 +1,75 @@
---- !ruby/object:Gem::Specification
+--- !ruby/object:Gem::Specification
 name: tokenizer
-version: !ruby/object:Gem::Version
-  prerelease:
-  version: 0.1.1
+version: !ruby/object:Gem::Version
+  version: 0.1.2
 platform: ruby
-authors:
+authors:
 - Andrei Beliankou
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-08-25 00:00:00 +02:00
-default_executable:
-dependencies:
-- !ruby/object:Gem::Dependency
-  name: rdoc
-  prerelease: false
-  requirement: &id001 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: 3.9.1
-  type: :development
-  version_requirements: *id001
-- !ruby/object:Gem::Dependency
-  name: rake
-  prerelease: false
-  requirement: &id002 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - "="
-      - !ruby/object:Gem::Version
-        version: 0.8.7
-  type: :development
-  version_requirements: *id002
-- !ruby/object:Gem::Dependency
-  name: yard
-  prerelease: false
-  requirement: &id003 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: "0"
-  type: :development
-  version_requirements: *id003
-- !ruby/object:Gem::Dependency
-  name: bundler
-  prerelease: false
-  requirement: &id004 !ruby/object:Gem::Requirement
-    none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: "0"
-  type: :development
-  version_requirements: *id004
-description: A simple multilingual tokenizer for NLP tasks. This tool provides a CLI and a library for linguistic tokenization which is an anavoidable step for many HLT (human language technology) tasks in the preprocessing phase for further syntactic, semantic and other higher level processing goals. Use it for tokenization of German, English and French texts.
-email: a.belenkow@uni-trier.de
-executables:
+date: 2015-09-03 00:00:00.000000000 Z
+dependencies: []
+description: A simple multilingual tokenizer for NLP tasks. This tool provides a CLI
+  and a library for linguistic tokenization which is an anavoidable step for many
+  HLT (human language technology) tasks in the preprocessing phase for further syntactic,
+  semantic and other higher level processing goals. Use it for tokenization of German,
+  English and French texts.
+email: arbox@yandex.ru
+executables:
 - tokenize
 extensions: []
-extra_rdoc_files:
+extra_rdoc_files:
+- CHANGELOG.rdoc
+- LICENSE.rdoc
 - README.rdoc
+files:
+- ".yardopts"
 - CHANGELOG.rdoc
 - LICENSE.rdoc
-files:
+- README.rdoc
+- bin/tokenize
 - lib/tokenizer.rb
 - lib/tokenizer/tokenizer.rb
 - lib/tokenizer/version.rb
-- README.rdoc
-- LICENSE.rdoc
-- CHANGELOG.rdoc
-- .yardopts
-- test/test_en_tokenizer_dev.rb
-- test/test_de_tokenizer_dev.rb
-- test/test_de_tokenizer.rb
-- test/test_by_tokenizer_dev.rb
-- test/test_it_tokenizer_dev.rb
-- test/test_fr_tokenizer_dev.rb
-- test/test_ru_tokenizer_dev.rb
-- bin/tokenize
-has_rdoc: true
-homepage: http://www.uni-trier.de/index.php?id=34451
+- test/development_tests/test_by_tokenizer_dev.rb
+- test/development_tests/test_de_tokenizer_dev.rb
+- test/development_tests/test_en_tokenizer_dev.rb
+- test/development_tests/test_fr_tokenizer_dev.rb
+- test/development_tests/test_it_tokenizer_dev.rb
+- test/development_tests/test_parameters.rb
+- test/development_tests/test_ru_tokenizer_dev.rb
+- test/regression_tests/test_de_tokenizer.rb
+homepage: https://github.com/arbox/tokenizer
 licenses: []
+metadata: {}
 post_install_message:
 rdoc_options: []
-require_paths:
+require_paths:
 - lib
-required_ruby_version: !ruby/object:Gem::Requirement
-  none: false
-  requirements:
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
   - - ">="
-    - !ruby/object:Gem::Version
-      version: 1.8.7
-required_rubygems_version: !ruby/object:Gem::Requirement
-  none: false
-  requirements:
+    - !ruby/object:Gem::Version
+      version: 1.9.3
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
   - - ">="
-    - !ruby/object:Gem::Version
-      version: "0"
+    - !ruby/object:Gem::Version
+      version: '0'
 requirements: []
 rubyforge_project: tokenizer
-rubygems_version: 1.5.2
+rubygems_version: 2.4.5
 signing_key:
-specification_version: 3
-summary: Tokenizer is a linguistic tool intended to split a text into tokens.
-test_files:
-- test/test_en_tokenizer_dev.rb
-- test/test_de_tokenizer_dev.rb
-- test/test_de_tokenizer.rb
-- test/test_by_tokenizer_dev.rb
-- test/test_it_tokenizer_dev.rb
-- test/test_fr_tokenizer_dev.rb
-- test/test_ru_tokenizer_dev.rb
+specification_version: 4
+summary: Tokenizer is a tool intended to split a text into tokens.
+test_files:
+- test/development_tests/test_by_tokenizer_dev.rb
+- test/development_tests/test_de_tokenizer_dev.rb
+- test/development_tests/test_en_tokenizer_dev.rb
+- test/development_tests/test_fr_tokenizer_dev.rb
+- test/development_tests/test_it_tokenizer_dev.rb
+- test/development_tests/test_parameters.rb
+- test/development_tests/test_ru_tokenizer_dev.rb
+- test/regression_tests/test_de_tokenizer.rb
+has_rdoc: