RubyGems - tokenizer - Versions diffs - 0.1.0 → 0.1.1 - Mend

tokenizer 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

data/.yardopts +8 -0
data/CHANGELOG.rdoc +20 -0
data/LICENSE.rdoc +2 -1
data/README.rdoc +39 -12
data/lib/tokenizer.rb +2 -2
data/lib/tokenizer/version.rb +1 -1
data/test/test_by_tokenizer_dev.rb +19 -0
data/test/{test_tokenizer.rb → test_de_tokenizer.rb} +0 -2
data/test/test_de_tokenizer_dev.rb +283 -0
data/test/test_en_tokenizer_dev.rb +19 -0
data/test/test_fr_tokenizer_dev.rb +19 -0
data/test/test_it_tokenizer_dev.rb +19 -0
data/test/test_ru_tokenizer_dev.rb +19 -0
metadata +45 -12
data/HISTORY.rdoc +0 -0
data/TODO.rdoc +0 -16

data/.yardopts ADDED

@@ -0,0 +1,8 @@
+--private
+--protected
+--title 'A simple tokenizer for NLP tasks.'
+-
+CHANGELOG.rdoc
+LICENSE.rdoc
+bin/*

data/CHANGELOG.rdoc ADDED

@@ -0,0 +1,20 @@
+== COMPLETED
+=== 0.1.1
+* Documentation and the whole project tree updated. No functional improvements.
+* Corrected typos.
+=== 0.1.0
+* Notion of binary tokenizer and a library for embedded tokenization.
+* Separation of punctuaction marks.
+=== 0.0.1
+* Simple tokenization is desired.
+== PLANNED
+=== 0.2.0
+=== 0.3.0
+=== 0.4.0
+=== 0.5.0
+=== 0.6.0
+=== 0.7.0
+=== 0.8.0
+=== 0.9.0
+=== 1.0.0

data/LICENSE.rdoc CHANGED

@@ -1,4 +1,5 @@
-Copyright (c) 2011- Andrei Beliankou, University of Trier, Germany
+Copyright (c) 2011- Andrei Beliankou, Sven Naumann
+University of Trier, Germany
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.rdoc CHANGED

@@ -2,35 +2,62 @@
 * {RubyGems}[http://rubygems.org/gems/tokenizer]
 * {Developers Homepage}[http://www.uni-trier.de/index.php?id=24140]
+* {Source Code}[https://github.com/arbox/tokenizer]
+* {Bug Tracker}[https://github.com/arbox/tokenizer/issues]
 == DESCRIPTION
-This _Tokenizer_ is a linguistic tool intended to split a text into tokens.
+A simple multilingual tokenizer -- a linguistic tool intended
+to split a text into tokens for NLP tasks. This tool provides a CLI and
+a library for linguistic tokenization which is an anavoidable step for many HLT
+(human language technology) tasks in the preprocessing phase for further
+syntactic, semantic and other higher level processing goals.
+Use it for tokenization of German, English and French texts.
 == INSTALLATION
++Tokenizer+ is provided as a .gem package. Simply install it via
+{RubyGems}[http://rubygems.org/gems/tokenizer].
+To install +tokenizer+ issue the following command:
+  $ gem install tokenizer
+If you want to do a system wide installation, do this as root
+(possibly using +sudo+).
+Alternatively use your Gemfile for dependency management.
-_Tokenizer_ is provided as a .gem package. Simply install it via RubyGems.
 == SYNOPSIS
-You can use _Tokenizer_ in two ways.
-  - As a command line tool:
-      $ echo 'Hi, ich gehe in die Schule!. | tokenize
+You can use +Tokenizer+ in two ways.
+* As a command line tool:
+    $ echo 'Hi, ich gehe in die Schule!. | tokenize
-  - As a library for embedded tokenization:
-      $ require 'tokenizer'
-      $ de_tokenizer = Tokenizer::Tokenizer.new
-      $ de_tokenizer.tokenize('Ich gehe in die Schule!")
-      $ => ["Ich", "gehe", "in", "die", "Schule", "!"]
+* As a library for embedded tokenization:
+    $ require 'tokenizer'
+    $ de_tokenizer = Tokenizer::Tokenizer.new
+    $ de_tokenizer.tokenize('Ich gehe in die Schule!')
+    $ => ["Ich", "gehe", "in", "die", "Schule", "!"]
-See documentation in the Tokenizer::Tokenizer class for details on particular methods.
+See documentation in the Tokenizer::Tokenizer class for details
+on particular methods.
 == SUPPORT
 If you have question, bug reports or any suggestions, please drop me an email :) Any help is deeply appreciated!
+== CHANGELOG
+For details on future plan and working progress see CHANGELOG.rdoc.
+== CAUTION
+This library is <b>work in process</b>! Though the interface is mostly complete,
+you might face some not implemented features.
+Please contact me with your suggestions, bug reports and feature requests.
 == LICENSE
-_Tokenizer_ is a copyrighted software by Andrei Beliankou, 2011-
++Tokenizer+ is a copyrighted software by Andrei Beliankou, 2011-
 You may use, redistribute and change it under the terms
 provided in the LICENSE.rdoc file.

data/lib/tokenizer.rb CHANGED

@@ -1,2 +1,2 @@
-require_relative 'tokenizer/tokenizer'
-require_relative 'tokenizer/version'
+require 'tokenizer/tokenizer'
+require 'tokenizer/version'

data/lib/tokenizer/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Tokenizer
-  VERSION = '0.1.0'
+  VERSION = '0.1.1'
 end

data/test/test_by_tokenizer_dev.rb ADDED

@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @by_tokenizer = Tokenizer::Tokenizer.new(:by)
+  end
+  def test_tokenization_001
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/{test_tokenizer.rb → test_de_tokenizer.rb} RENAMED

@@ -5,8 +5,6 @@ class TestTokenizer < Test::Unit::TestCase
   def setup
     @de_tokenizer = Tokenizer::Tokenizer.new(:de)
-    @en_tokenizer = Tokenizer::Tokenizer.new(:en)
-    @fr_tokenizer = Tokenizer::Tokenizer.new(:fr)
   end
   def test_constants

data/test/test_de_tokenizer_dev.rb ADDED

@@ -0,0 +1,283 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @de_tokenizer = Tokenizer::Tokenizer.new(:de)
+  end
+  def test_tokenization_001
+    input = 'ich ging? du, und ich nicht (konnte nicht)? Warum?!!'
+    etalon = %w{ ich ging ? du , und ich nicht ( konnte nicht ) ? Warum ? ! !}
+    compare(etalon, input)
+  end
+  def test_tokenization_002
+    input = "Die deutschen Umlaute und Sonderzeichen, wie in Mäuse, Scheiß und Tütchen, sind blöd!"
+    etalon = %w{Die deutschen Umlaute und Sonderzeichen , wie in Mäuse , Scheiß und Tütchen , sind blöd !}
+    compare(etalon, input)
+  end
+  def test_tokenization_003
+    input = "Abkürzungen, wie z.B. usw. und d.h. können zu Problemem führen."
+    etalon = %w{Abkürzungen , wie z.B. usw. und d.h. können zu Problemem führen .}
+    compare(etalon, input)
+  end
+  def test_tokenization_004
+    input = "Es gibt mehr als 1.023.345 Menschen in Deutschland, die keine Tausenderpunkte verstehen."
+    etalon = %w{Es gibt mehr als 1.023.345 Menschen in Deutschland , die keine Tausenderpunkte verstehen .}
+    compare(etalon, input)
+  end
+  def test_tokenization_005
+    input = "Cocktails, wie Apfel-Martini, Rum-Kirsche-Cola und andere, bereiten nicht nur Menschen Probleme."
+    etalon = %w{ Cocktails , wie Apfel-Martini , Rum-Kirsche-Cola und andere , bereiten nicht nur Menschen Probleme . }
+    compare(etalon, input)
+  end
+  def test_tokenization_006
+    input = 'Es gibt viele verschiedene Zeichen, die noch in Texten vorkommen können wie - zum Beispiel - diese hier "text" oder (text).'
+    etalon = %w{Es gibt viele verschiedene Zeichen , die noch in Texten vorkommen können wie - zum Beispiel - diese hier " text " oder ( text ) .}
+    compare(etalon, input)
+  end
+  def test_tokenization_007
+    input = "Abkürzungen sind immer ein Problem, da auch Leerzeichen dazwischen stehen können, wie z. B. hier."
+    etalon = ["Abkürzungen", "sind", "immer", "ein", "Problem", ",", "da", "auch", "Leerzeichen", "dazwischen", "stehen", "können", ",", "wie", "z. B.", "hier", "."]
+    compare(etalon, input)
+  end
+  def test_tokenization_008
+    input = "Außerdem kann es nach Abkürzungen und Satzenden auch mit Großschreibung weiter gehen, bei z.B. Aufzählungen."
+    etalon = %w{Außerdem kann es nach Abkürzungen und Satzenden auch mit Großschreibung weiter gehen , bei z.B. Aufzählungen .}
+    compare(etalon, input)
+  end
+  def test_tokenization_009
+    input = "Ein weiteres Problem sind solche Getrennt- und Zusammenschreibungen."
+    etalon = %w{Ein weiteres Problem sind solche Getrenntschreibungen und Zusammenschreibungen .}
+    compare(etalon, input)
+  end
+  def test_tokenization_010
+    input = "In manchen Texten gibt es auch Worttrennung am Zeilen- ende."
+    etalon = %w{In manchen Texten gibt es auch Worttrennung am Zeilenende .}
+    compare(etalon, input)
+  end
+  def test_tokenization_011 #Ellipsis
+    input = "Der Satz endet in einer Ellips..."
+    etalon = %w{ Der Satz endet in einer Ellips... } #die elliptischen Punkte sollten nicht vom Wort getrennt werden
+    compare(etalon, input)
+  end
+  def test_tokenization_012 #Fehlende Leerzeichen
+    input = "Der Satz endet.Das Leerzeichen fehlt."
+    etalon = %w{ Der Satz endet . Das Leerzeichen fehlt . } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
+    compare(etalon, input)
+  end
+  def test_tokenization_013 #Bindestriche
+    input = "Das Bindeglied - manisch-depressives Verhalten, binden-verbinden"
+    etalon = %w{ Das Bindeglied - manisch-depressives Verhalten , binden - verbinden}
+    compare(etalon, input)
+  end
+  def test_tokenization_014 #Abkuerzungen
+    input = "Der Satz enthielt z.B. Fehler"
+    etalon = %w{ Der Satz enthielt z.B. Fehler } #/\.\s(?=[A-Z])/ wuerde hinter Punkt den Satz beenden
+    compare(etalon, input)
+  end
+  def test_tokenization_015 #Fehlende Grossbuchstaben
+    input = "Der Satz endet. der Satz beginnt"
+    etalon = %w{ Der Satz endet . der Satz beginnt } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
+    compare(etalon, input)
+  end
+  def test_tokenization_016 #Franzoesisch
+    input = "L'art de l'univers, c'est un art"
+    etalon = %w{ L' art de l' univers , c'est un art } #Kontrovers!
+    compare(etalon, input)
+  end
+  def test_tokenization_017 #James Bond
+    input = "Bond,... James Bond."
+    etalon = %w{ Bond , ... James Bond . } #Kontrovers!
+    compare(etalon, input)
+  end
+  def test_tokenization_018 #Inches
+    input = "The square had four 9\" sides"
+    etalon = %w{ The square had four 9" sides }
+    compare(etalon, input)
+  end
+  def test_tokenization_019 #Abkuerzung zugleich Lexikon-Eintrag
+    input = "In fig. 3, a fig can be seen. Fig. no. 4 shows no fig."
+    etalon = %w{ In fig. 3 , a fig can be seen . Fig. no. 4 shows no fig . } #fig sowohl als Abkuerzung als auch als Wort
+    compare(etalon, input)
+  end
+  def test_tokenization_020 #Leerzeichen-getrennte Zusammengehörigkeiten
+    input = "They booked the flight New York-Los Angeles"
+    etalon = ["They", "booked", "the", "flight", "New York", "-", "Los Angeles"] #oder mit Bindestrich verbunden
+    compare(etalon, input)
+  end
+  def test_tokenization_021 #Ordinale
+    input = "Der 1. Platz ging an den Sieger"
+    etalon = %w{ Der 1. Platz ging an den Sieger }
+    compare(etalon, input)
+  end
+  def test_tokenization_022 #Klitika
+    input = "Er war's, stimmt's?"
+    etalon = %w{ Er war es , stimmt es ? } #Kontrovers! Benoetigt komplexere Analyse
+    compare(etalon, input)
+  end
+  def test_tokenization_023 #Datums- und Zeitangaben
+    input = "Es passierte am 13. Januar 2011 um 12:13 Uhr"
+    etalon = [ "Es", "passierte", "am", "13. Januar 2011", "um", "12:13 Uhr"]
+    compare(etalon, input)
+  end
+  def test_tokenization_024 #Eingebettete Saetze
+    input = "\"This is all?\" George asked."
+    etalon = %w{ This is all ? George asked . } #kann zu ungrammatischen Saetzen fuehren
+    compare(etalon, input)
+  end
+  def test_tokenization_025 #Eingebettete Saetze 2
+    input = "\"Das ist alles?\" fragte sie."
+    etalon = %w{ Das ist alles ? fragte sie . } #ungrammatischer Satz "fragte sie."
+    compare(etalon, input)
+  end
+  def test_tokenization_026
+    input = "Die deutschen Umlaute und Sonderzeichen, wie in Mäuse, Scheiß und Tütchen, sind blöd!"
+    etalon = %w{ Die deutschen Umlaute und Sonderzeichen , wie in Mäuse , Scheiß und Tütchen , sind blöd ! }
+    compare(etalon, input)
+  end
+  def test_tokenization_027
+    input = "Abkürzungen, wie z.B. usw. und d.h. können zu Problemem führen."
+    etalon = %w{ Abkürzungen , wie z.B. usw. und d.h. können zu Problemem führen . }
+    compare(etalon, input)
+  end
+  def test_tokenization_028
+    input = "Es gibt mehr als 1.023.345 Menschen in Deutschland, die keine Tausenderpunkte verstehen."
+    etalon = %w{ Es gibt mehr als 1.023.345 Menschen in Deutschland , die keine Tausenderpunkte verstehen . }
+    compare(etalon, input)
+  end
+  def test_tokenization_029
+    input = "Cocktails, wie Apfel-Martini, Rum-Kirsche-Cola und andere, bereiten nicht nur Menschen Probleme."
+    etalon = %w{ Cocktails , wie Apfel-Martini , Rum-Kirsche-Cola und andere , bereiten nicht nur Menschen Probleme . }
+    compare(etalon, input)
+  end
+  def test_tokenization_030 #Ellipsis
+    input = "Der Satz endet in einer Ellips..."
+    etalon = %w{ Der Satz endet in einer Ellips... } #die elliptischen Punkte sollten nicht vom Wort getrennt werden
+    compare(etalon, input)
+  end
+  def test_tokenization_031 #Fehlende Leerzeichen
+    input = "Der Satz endet.Das Leerzeichen fehlt."
+    etalon = %w{ Der Satz endet . Das Leerzeichen fehlt . } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
+    compare(etalon, input)
+  end
+  def test_tokenization_032 #Bindestriche
+    input = "Das Bindeglied - manisch-depressives Verhalten, binden-verbinden"
+    etalon = %w{ Das Bindeglied - manisch-depressives Verhalten , binden - verbinden}
+    compare(etalon, input)
+  end
+  def test_tokenization_033 #Abkuerzungen
+    input = "Der Satz enthielt z.B. Fehler"
+    etalon = %w{ Der Satz enthielt z.B. Fehler } #/\.\s(?=[A-Z])/ wuerde hinter Punkt den Satz beenden
+    compare(etalon, input)
+  end
+  def test_tokenization_034 #Fehlende Grossbuchstaben
+    input = "Der Satz endet. der Satz beginnt"
+    etalon = %w{ Der Satz endet . der Satz beginnt } #/\.\s(?=[A-Z])/ wuerde die Saetze nicht trennen
+    compare(etalon, input)
+  end
+  def test_tokenization_035 #Franzoesisch
+    input = "L'art de l'univers, c'est un art"
+    etalon = %w{ L' art de l' univers , c'est un art } #Kontrovers!
+    compare(etalon, input)
+  end
+  def test_tokenization_036 #James Bond
+    input = "Bond,... James Bond."
+    etalon = %w{ Bond , ... James Bond . } #Kontrovers!
+    compare(etalon, input)
+  end
+  def test_tokenization_037 #Inches
+    input = "The square had four 9\" sides"
+    etalon = %w{ The square had four 9" sides }
+    compare(etalon, input)
+  end
+  def test_tokenization_039 #Abkuerzung zugleich Lexikon-Eintrag
+    input = "In fig. 3, a fig can be seen. Fig. no. 4 shows no fig."
+    etalon = %w{ In fig. 3 , a fig can be seen . Fig. no. 4 shows no fig . } #fig sowohl als Abkuerzung als auch als Wort
+    compare(etalon, input)
+  end
+  def test_tokenization_040 #Leerzeichen-getrennte Zusammengehörigkeiten
+    input = "They booked the flight New York-Los Angeles"
+    etalon = ["They", "booked", "the", "flight", "New York", "-", "Los Angeles"] #oder mit Bindestrich verbunden
+    compare(etalon, input)
+  end
+  def test_tokenization_041 #Ordinale
+    input = "Der 1. Platz ging an den Sieger"
+    etalon = %w{ Der 1. Platz ging an den Sieger }
+    compare(etalon, input)
+  end
+  def test_tokenization_042 #Klitika
+    input = "Er war's, stimmt's?"
+    etalon = %w{ Er war es , stimmt es ? } #Kontrovers! Benoetigt komplexere Analyse
+    compare(etalon, input)
+  end
+  #Datums- und Zeitangaben
+  def test_tokenization_043
+    input = "Es passierte am 13. Januar 2011 um 12:13 Uhr"
+    etalon = ["Es", "passierte", "am", "13. Januar 2011", "um", "12:13 Uhr"]
+    compare(etalon, input)
+  end
+  #Eingebettete Sätze
+  def test_tokenization_044
+    input = '"This is all?" George asked.'
+    etalon = %w{ This is all ? George asked . } #kann zu ungrammatischen Saetzen fuehren
+    compare(etalon, input)
+  end
+  def test_tokenization_046 #Eingebettete Saetze 2
+    input = '"Das ist alles?" fragte sie.'
+    etalon = %w{Das ist alles ? fragte sie .} #ungrammatischer Satz "fragte sie."
+    compare(etalon, input)
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/test_en_tokenizer_dev.rb ADDED

@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @en_tokenizer = Tokenizer::Tokenizer.new(:en)
+  end
+  def test_tokenization_001
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/test_fr_tokenizer_dev.rb ADDED

@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @fr_tokenizer = Tokenizer::Tokenizer.new(:fr)
+  end
+  def test_tokenization_001
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/test_it_tokenizer_dev.rb ADDED

@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @it_tokenizer = Tokenizer::Tokenizer.new(:it)
+  end
+  def test_tokenization_001
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

data/test/test_ru_tokenizer_dev.rb ADDED

@@ -0,0 +1,19 @@
+# -*- coding: utf-8 -*-
+require 'test/unit'
+require 'tokenizer'
+class TestTokenizerDev < Test::Unit::TestCase
+  def setup
+    @ru_tokenizer = Tokenizer::Tokenizer.new(:ru)
+  end
+  def test_tokenization_001
+  end
+  private
+  def compare(exp_result, input)
+    act_result = @de_tokenizer.tokenize(input)
+    assert_equal(exp_result, act_result)
+  end
+end

metadata CHANGED

@@ -2,7 +2,7 @@
 name: tokenizer
 version: !ruby/object:Gem::Version
   prerelease:
-  version: 0.1.0
+  version: 0.1.1
 platform: ruby
 authors:
 - Andrei Beliankou
@@ -10,7 +10,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-05-19 00:00:00 +02:00
+date: 2011-08-25 00:00:00 +02:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -21,21 +21,43 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: "0"
+        version: 3.9.1
   type: :development
   version_requirements: *id001
 - !ruby/object:Gem::Dependency
   name: rake
   prerelease: false
   requirement: &id002 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - "="
+      - !ruby/object:Gem::Version
+        version: 0.8.7
+  type: :development
+  version_requirements: *id002
+- !ruby/object:Gem::Dependency
+  name: yard
+  prerelease: false
+  requirement: &id003 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
         version: "0"
   type: :development
-  version_requirements: *id002
-description: A simple tokenizer for NLP tasks.
+  version_requirements: *id003
+- !ruby/object:Gem::Dependency
+  name: bundler
+  prerelease: false
+  requirement: &id004 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+  type: :development
+  version_requirements: *id004
+description: A simple multilingual tokenizer for NLP tasks. This tool provides a CLI and a library for linguistic tokenization which is an anavoidable step for many HLT (human language technology) tasks in the preprocessing phase for further syntactic, semantic and other higher level processing goals. Use it for tokenization of German, English and French texts.
 email: a.belenkow@uni-trier.de
 executables:
 - tokenize
@@ -43,18 +65,23 @@ extensions: []
 extra_rdoc_files:
 - README.rdoc
-- TODO.rdoc
+- CHANGELOG.rdoc
 - LICENSE.rdoc
-- HISTORY.rdoc
 files:
 - lib/tokenizer.rb
 - lib/tokenizer/tokenizer.rb
 - lib/tokenizer/version.rb
 - README.rdoc
-- TODO.rdoc
 - LICENSE.rdoc
-- HISTORY.rdoc
-- test/test_tokenizer.rb
+- CHANGELOG.rdoc
+- .yardopts
+- test/test_en_tokenizer_dev.rb
+- test/test_de_tokenizer_dev.rb
+- test/test_de_tokenizer.rb
+- test/test_by_tokenizer_dev.rb
+- test/test_it_tokenizer_dev.rb
+- test/test_fr_tokenizer_dev.rb
+- test/test_ru_tokenizer_dev.rb
 - bin/tokenize
 has_rdoc: true
 homepage: http://www.uni-trier.de/index.php?id=34451
@@ -70,7 +97,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: "1.9"
+      version: 1.8.7
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
@@ -85,4 +112,10 @@ signing_key:
 specification_version: 3
 summary: Tokenizer is a linguistic tool intended to split a text into tokens.
 test_files:
-- test/test_tokenizer.rb
+- test/test_en_tokenizer_dev.rb
+- test/test_de_tokenizer_dev.rb
+- test/test_de_tokenizer.rb
+- test/test_by_tokenizer_dev.rb
+- test/test_it_tokenizer_dev.rb
+- test/test_fr_tokenizer_dev.rb
+- test/test_ru_tokenizer_dev.rb

data/HISTORY.rdoc DELETED

File without changes

data/TODO.rdoc DELETED

@@ -1,16 +0,0 @@
-= Milestones for the project _Tokenizer_
-== 0.0.1
-  - simple tokenization is desired
-== 0.1.0
-  - notion of binary tokenizer and a library for embedded tokenization;
-  - separation of punctuaction marks.
-== 0.2.0
-== 0.3.0
-== 0.4.0
-== 0.5.0
-== 0.6.0
-== 0.7.0
-== 0.8.0
-== 0.9.0
-== 1.0.0