RubyGems - greeb - Versions diffs - 0.0.2 → 0.1.0.rc1 - Mend

greeb 0.0.2 → 0.1.0.rc1

Files changed (23) hide show

data/.gitignore CHANGED Viewed

@@ -27,6 +27,7 @@ nbproject
 ## BUNDLER
 .bundle
+Gemfile.lock
 ## PROJECT::GENERAL
 coverage

data/.travis.yml ADDED Viewed

@@ -0,0 +1,7 @@
+branches:
+  only:
+    - develop
+    - master
+rvm:
+  - 1.9.3
+  - rbx-19mode

data/.yardopts ADDED Viewed

@@ -0,0 +1,6 @@
+--protected
+--no-private
+-m markdown
+-
+README.md
+LICENSE

data/Gemfile CHANGED Viewed

@@ -1,3 +1,5 @@
+# encoding: utf-8
 source 'http://rubygems.org'
 gemspec

data/LICENSE ADDED Viewed

@@ -0,0 +1,20 @@
+Copyright (c) 2010-2012 Dmitry A. Ustalov
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,140 @@
+Greeb
+=====
+Greeb is a simple yet awesome text tokenizer that is based on regular
+expressions.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'greeb'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install greeb
+## Usage
+Greeb can help you to solve simple text processing problems:
+```ruby
+pp Greeb::Tokenizer.new('Hello!').tokens
+=begin
+#<SortedSet: {#<struct Greeb::Entity from=0, to=5, type=:letter>,
+ #<struct Greeb::Entity from=5, to=6, type=:punct>}>
+=end
+```
+It should be noted that it is possible to process much complex texts:
+```ruby
+text =<<-EOF
+Hello! I am 18! My favourite number is 133.7...
+What about you?
+EOF
+pp Greeb::Tokenizer.new(text).tokens
+=begin
+#<SortedSet: {#<struct Greeb::Entity from=0, to=5, type=:letter>,
+ #<struct Greeb::Entity from=5, to=6, type=:punct>,
+ #<struct Greeb::Entity from=6, to=7, type=:separ>,
+ #<struct Greeb::Entity from=7, to=8, type=:letter>,
+ #<struct Greeb::Entity from=8, to=9, type=:separ>,
+ #<struct Greeb::Entity from=9, to=11, type=:letter>,
+ #<struct Greeb::Entity from=11, to=12, type=:separ>,
+ #<struct Greeb::Entity from=12, to=14, type=:integer>,
+ #<struct Greeb::Entity from=14, to=15, type=:punct>,
+ #<struct Greeb::Entity from=15, to=16, type=:separ>,
+ #<struct Greeb::Entity from=16, to=18, type=:letter>,
+ #<struct Greeb::Entity from=18, to=19, type=:separ>,
+ #<struct Greeb::Entity from=19, to=28, type=:letter>,
+ #<struct Greeb::Entity from=28, to=29, type=:separ>,
+ #<struct Greeb::Entity from=29, to=35, type=:letter>,
+ #<struct Greeb::Entity from=35, to=36, type=:separ>,
+ #<struct Greeb::Entity from=36, to=38, type=:letter>,
+ #<struct Greeb::Entity from=38, to=39, type=:separ>,
+ #<struct Greeb::Entity from=39, to=44, type=:float>,
+ #<struct Greeb::Entity from=44, to=47, type=:punct>,
+ #<struct Greeb::Entity from=47, to=49, type=:break>,
+ #<struct Greeb::Entity from=49, to=53, type=:letter>,
+ #<struct Greeb::Entity from=53, to=54, type=:separ>,
+ #<struct Greeb::Entity from=54, to=59, type=:letter>,
+ #<struct Greeb::Entity from=59, to=60, type=:separ>,
+ #<struct Greeb::Entity from=60, to=63, type=:letter>,
+ #<struct Greeb::Entity from=63, to=64, type=:punct>,
+ #<struct Greeb::Entity from=64, to=65, type=:break>}>
+=end
+```
+Also it can be used to solve the text segmentation problems
+such as sentence detection tasks:
+```ruby
+text = 'Hello! How are you?'
+pp Greeb::Segmentator.new(Greeb::Tokenizer.new(text))
+=begin
+#<SortedSet: {#<struct Greeb::Entity from=0, to=6, type=:sentence>,
+ #<struct Greeb::Entity from=7, to=19, type=:sentence>}>
+=end
+```
+It is possible to extract tokens that were processed by the text
+segmentator:
+```ruby
+text = 'Hello! How are you?'
+segmentator = Greeb::Segmentator.new(Greeb::Tokenizer.new(text))
+sentences = segmentator.sentences
+pp segmentator.extract(*sentences)
+=begin
+{#<struct Greeb::Entity from=0, to=6, type=:sentence>=>
+  [#<struct Greeb::Entity from=0, to=5, type=:letter>,
+   #<struct Greeb::Entity from=5, to=6, type=:punct>],
+ #<struct Greeb::Entity from=7, to=19, type=:sentence>=>
+  [#<struct Greeb::Entity from=7, to=10, type=:letter>,
+   #<struct Greeb::Entity from=10, to=11, type=:separ>,
+   #<struct Greeb::Entity from=11, to=14, type=:letter>,
+   #<struct Greeb::Entity from=14, to=15, type=:separ>,
+   #<struct Greeb::Entity from=15, to=18, type=:letter>,
+   #<struct Greeb::Entity from=18, to=19, type=:punct>]}
+=end
+```
+## Tokens
+Greeb operates with entities, tuples of `<from, to, type>`, where
+`from` is a beginning of the entity, `to` is an ending of the entity,
+and `type` is a type of the entity.
+There are several entity types: `:letter`, `:float`, `:integer`,
+`:separ`, `:punct` (for punctuation), `:spunct` (for in-sentence
+punctuation), and `:break`.
+## Contributing
+1. Fork it;
+2. Create your feature branch (`git checkout -b my-new-feature`);
+3. Commit your changes (`git commit -am 'Added some feature'`);
+4. Push to the branch (`git push origin my-new-feature`);
+5. Create new Pull Request.
+I highly recommend you to use git flow to make development process much
+systematic and awesome.
+## Build Status [<img src="https://secure.travis-ci.org/eveel/greeb.png"/>](http://travis-ci.org/eveel/greeb)
+## Dependency Status [<img src="https://gemnasium.com/eveel/greeb.png?travis"/>](https://gemnasium.com/eveel/greeb)
+## Copyright
+Copyright (c) 2010-2012 [Dmitry A. Ustalov]. See LICENSE for details.
+[Dmitry A. Ustalov]: http://eveel.ru

data/Rakefile CHANGED Viewed

@@ -1,12 +1,12 @@
+#!/usr/bin/env rake
 # encoding: utf-8
-require 'bundler'
-Bundler::GemHelper.install_tasks
+require 'bundler/gem_tasks'
-require 'rspec/core/rake_task'
-desc 'Run all examples'
-RSpec::Core::RakeTask.new(:spec) do |t|
-  t.rspec_opts = %w[--color]
-end
+task :default => :test
-task :default => :spec
+require 'rake/testtask'
+Rake::TestTask.new do |test|
+  test.pattern = 'spec/**/*_spec.rb'
+  test.verbose = true
+end

data/greeb.gemspec CHANGED Viewed

@@ -1,25 +1,27 @@
 # encoding: utf-8
-$:.push File.expand_path('../lib', __FILE__)
-require 'greeb'
+require File.expand_path('../lib/greeb/version', __FILE__)
 Gem::Specification.new do |s|
   s.name        = 'greeb'
   s.version     = Greeb::VERSION
   s.platform    = Gem::Platform::RUBY
-  s.authors     = [ 'Dmitry A. Ustalov' ]
-  s.email       = [ 'dmitry@eveel.ru' ]
+  s.authors     = ['Dmitry A. Ustalov']
+  s.email       = ['dmitry@eveel.ru']
   s.homepage    = 'https://github.com/eveel/greeb'
-  s.summary     = 'Greeb is a Graphematical Analyzer.'
-  s.description = 'Greeb is awesome Graphematical Analyzer, ' \
+  s.summary     = 'Greeb is a simple regexp-based tokenizer.'
+  s.description = 'Greeb is a simple yet awesome regexp-based tokenizer, ' \
                   'written in Ruby.'
   s.rubyforge_project = 'greeb'
-  s.add_dependency 'rspec', '~> 2.4.0'
+  s.add_development_dependency 'rake'
+  s.add_development_dependency 'minitest', '>= 2.11'
+  s.add_development_dependency 'simplecov'
+  s.add_development_dependency 'yard'
   s.files         = `git ls-files`.split("\n")
   s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
   s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
-  s.require_paths = [ 'lib' ]
+  s.require_paths = ['lib']
 end

data/lib/greeb.rb CHANGED Viewed

@@ -1,11 +1,25 @@
 # encoding: utf-8
-# Greeb is awesome Graphematical Analyzer.
-#
-module Greeb
-  # Version of the Greeb.
-  #
-  VERSION = "0.0.2"
+require 'greeb/version'
-  require 'greeb/parser'
+# Greeb operates with entities, tuples of `<from, to, kind>`, where
+# `from` is a beginning of the entity, `to` is an ending of the entity,
+# and `kind` is a type of the entity.
+#
+# There are several entity types: `:letter`, `:float`, `:integer`,
+# `:separ` for separators, `:punct` for punctuation characters,
+# `:spunct` for in-sentence punctuation characters, and
+# `:break` for line endings.
+#
+class Greeb::Entity < Struct.new(:from, :to, :type)
+  def <=> other
+    if (comparison = self.from <=> other.from) == 0
+      self.to <=> other.to
+    else
+      comparison
+    end
+  end
 end
+require 'greeb/tokenizer'
+require 'greeb/segmentator'

data/lib/greeb/segmentator.rb ADDED Viewed

@@ -0,0 +1,95 @@
+# encoding: utf-8
+# It is possible to perform simple sentence detection that is based
+# on Greeb's tokenization.
+#
+class Greeb::Segmentator
+  # Sentence does not start from the separator charater, line break
+  # character, and punctuation characters.
+  #
+  SENTENCE_DOESNT_START = [:separ, :break, :punct, :spunct]
+  attr_reader :tokens
+  # Create a new instance of {Greeb::Segmentator}.
+  #
+  # @param tokenizer_or_tokens [Greeb::Tokenizer,Set] an instance of
+  # Greeb::Tokenizer or set of its results.
+  #
+  def initialize tokenizer_or_tokens
+    @tokens = if tokenizer_or_tokens.is_a? Greeb::Tokenizer
+      tokenizer_or_tokens.tokens
+    else
+      tokenizer_or_tokens
+    end
+  end
+  # Sentences memoization method.
+  #
+  # @return [Set<Greeb::Entity>] a set of sentences.
+  #
+  def sentences
+    detect_sentences! unless @sentences
+    @sentences
+  end
+  # Extract tokens from the set of sentences.
+  #
+  # @param sentences [Array<Greeb::Entity>] a list of sentences.
+  #
+  # @return [Hash<Greeb::Entity, Array<Greeb::Entity>>] a hash with
+  # sentences as keys and tokens arrays as values.
+  #
+  def extract *sentences
+    Hash[
+      sentences.map do |s|
+        [s, tokens.select { |t| t.from >= s.from and t.to <= s.to }]
+      end
+    ]
+  end
+  protected
+    # Implementation of the sentence detection method. This method
+    # changes the `@sentences` ivar.
+    #
+    # @return [nil] nothing.
+    #
+    def detect_sentences!
+      @sentences = SortedSet.new
+      rest = tokens.inject(new_sentence) do |sentence, token|
+        if !sentence.from and SENTENCE_DOESNT_START.include?(token.type)
+          next sentence
+        end
+        sentence.from = token.from unless sentence.from
+        next sentence if sentence.to and sentence.to > token.to
+        if :punct == token.type
+          sentence.to = tokens.
+            select { |t| t.from >= token.from }.
+            inject(token) { |r, t| break r if t.type != token.type; t }.
+            to
+          @sentences << sentence
+          sentence = new_sentence
+        elsif :separ != token.type
+          sentence.to = token.to
+        end
+        sentence
+      end
+      nil.tap { @sentences << rest if rest.from and rest.to }
+    end
+  private
+    # Create a new instance of {Greeb::Entity} with `:sentence` type.
+    #
+    # @return [Greeb::Entity] a new entity instance.
+    #
+    def new_sentence
+      Greeb::Entity.new(nil, nil, :sentence)
+    end
+end

data/lib/greeb/tokenizer.rb ADDED Viewed

@@ -0,0 +1,112 @@
+# encoding: utf-8
+require 'strscan'
+require 'set'
+# Greeb's tokenization facilities. Use 'em with love.
+#
+class Greeb::Tokenizer
+  # English and Russian letters.
+  #
+  LETTERS = /[A-Za-zА-Яа-яЁё]+/u
+  # Floating point values.
+  #
+  FLOATS = /(\d+)[.,](\d+)/u
+  # Integer values.
+  #
+  INTEGERS = /\d+/u
+  # In-subsentence seprator (i.e.: "*" or "=").
+  #
+  SEPARATORS = /[*=_\/\\ ]+/u
+  # Punctuation character (i.e.: "." or "!").
+  #
+  PUNCTUATIONS = /(\.|\!|\?)+/u
+  # In-sentence punctuation character (i.e.: "," or "-").
+  #
+  SENTENCE_PUNCTUATIONS = /(\,|\[|\]|\(|\)|\-|:|;)+/u
+  # Line breaks.
+  #
+  BREAKS = /\n+/u
+  attr_reader :text, :scanner
+  protected :scanner
+  # Create a new instance of {Greeb::Tokenizer}.
+  #
+  # @param text [String] text to be tokenized.
+  #
+  def initialize(text)
+    @text = text
+  end
+  # Tokens memoization method.
+  #
+  # @return [Set<Greeb::Entity>] a set of tokens.
+  #
+  def tokens
+    tokenize! unless @tokens
+    @tokens
+  end
+  protected
+    # Perform the tokenization process. This method modifies
+    # `@scanner` and `@tokens` instance variables.
+    #
+    # @return [nil] nothing unless exception is raised.
+    #
+    def tokenize!
+      @scanner = StringScanner.new(text)
+      @tokens = SortedSet.new
+      while !scanner.eos?
+        parse! LETTERS, :letter or
+        parse! FLOATS, :float or
+        parse! INTEGERS, :integer or
+        split_parse! SENTENCE_PUNCTUATIONS, :spunct or
+        split_parse! PUNCTUATIONS, :punct or
+        split_parse! SEPARATORS, :separ or
+        split_parse! BREAKS, :break or
+        raise @tokens.inspect
+      end
+    ensure
+      scanner.terminate
+    end
+    # Try to parse one small piece of text that is covered by pattern
+    # of necessary type.
+    #
+    # @param pattern [Regexp] a regular expression to extract the token.
+    # @param type [Symbol] a symbol that represents the necessary token
+    # type.
+    #
+    # @return [Set<Greeb::Entity>] the modified set of extracted tokens.
+    #
+    def parse! pattern, type
+      return false unless token = scanner.scan(pattern)
+      @tokens << Greeb::Entity.new(scanner.pos - token.length, scanner.pos, type)
+    end
+    # Try to parse one small piece of text that is covered by pattern
+    # of necessary type. This method performs grouping of the same
+    # characters.
+    #
+    # @param pattern [Regexp] a regular expression to extract the token.
+    # @param type [Symbol] a symbol that represents the necessary token
+    # type.
+    #
+    # @return [Set<Greeb::Entity>] the modified set of extracted tokens.
+    #
+    def split_parse! pattern, type
+      return false unless token = scanner.scan(pattern)
+      position = scanner.pos - token.length
+      token.scan(/((.|\n)\2*)/).map(&:first).inject(position) do |before, s|
+        @tokens << Greeb::Entity.new(before, before + s.length, type)
+        before + s.length
+      end
+    end
+end

data/lib/greeb/version.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# encoding: utf-8
+# Greeb is a simple regexp-based tokenizer.
+#
+module Greeb
+  # Version of Greeb.
+  #
+  VERSION = '0.1.0.rc1'
+end

data/spec/segmentator_spec.rb ADDED Viewed

@@ -0,0 +1,112 @@
+# encoding: utf-8
+require File.expand_path('../spec_helper', __FILE__)
+module Greeb
+  describe Segmentator do
+    describe 'initialization' do
+      before { @tokenizer = Tokenizer.new('Vodka') }
+      subject { Segmentator.new(@tokenizer) }
+      it 'can be initialized either with Tokenizer' do
+        subject.tokens.must_be_kind_of SortedSet
+      end
+      it 'can be initialized either with a set of tokens' do
+        subject = Segmentator.new(@tokenizer.tokens)
+        subject.tokens.must_be_kind_of SortedSet
+      end
+      it 'should has @tokens ivar' do
+        subject.instance_variable_get(:@tokens).wont_be_nil
+      end
+    end
+    describe 'a simple sentence' do
+      before { @tokenizer = Tokenizer.new('Hello, I am JC Denton.') }
+      subject { Segmentator.new(@tokenizer).sentences }
+      it 'should be segmented' do
+        subject.must_equal(
+          SortedSet.new([Entity.new(0, 22, :sentence)])
+        )
+      end
+    end
+    describe 'a simple sentence without punctuation' do
+      before { @tokenizer = Tokenizer.new('Hello, I am JC Denton') }
+      subject { Segmentator.new(@tokenizer).sentences }
+      it 'should be segmented' do
+        subject.must_equal(
+          SortedSet.new([Entity.new(0, 21, :sentence)])
+        )
+      end
+    end
+    describe 'a simple sentence with trailing whitespaces' do
+      before { @tokenizer = Tokenizer.new('      Hello, I am JC Denton  ') }
+      subject { Segmentator.new(@tokenizer).sentences }
+      it 'should be segmented' do
+        subject.must_equal(
+          SortedSet.new([Entity.new(6, 27, :sentence)])
+        )
+      end
+    end
+    describe 'two simple sentences' do
+      before { @tokenizer = Tokenizer.new('Hello! I am JC Denton.') }
+      subject { Segmentator.new(@tokenizer).sentences }
+      it 'should be segmented' do
+        subject.must_equal(
+          SortedSet.new([Entity.new(0, 6,  :sentence),
+                         Entity.new(7, 22, :sentence)])
+        )
+      end
+    end
+    describe 'one wrong character and one simple sentence' do
+      before { @tokenizer = Tokenizer.new('! I am JC Denton.') }
+      subject { Segmentator.new(@tokenizer).sentences }
+      it 'should be segmented' do
+        subject.must_equal(
+          SortedSet.new([Entity.new(2, 17, :sentence)])
+        )
+      end
+    end
+    describe 'token extractor' do
+      before { @tokenizer = Tokenizer.new('Hello! I am JC Denton.') }
+      subject { Segmentator.new(@tokenizer) }
+      it 'should be extracted' do
+        subject.extract(*subject.sentences).must_equal({
+          Entity.new(0,  6, :sentence) => [
+            Entity.new(0, 5, :letter),
+            Entity.new(5, 6, :punct)
+          ],
+          Entity.new(7, 22, :sentence) => [
+            Entity.new(7,  8,  :letter),
+            Entity.new(8,  9,  :separ),
+            Entity.new(9,  11, :letter),
+            Entity.new(11, 12, :separ),
+            Entity.new(12, 14, :letter),
+            Entity.new(14, 15, :separ),
+            Entity.new(15, 21, :letter),
+            Entity.new(21, 22, :punct)
+          ]
+        })
+      end
+    end
+  end
+end

data/spec/spec_helper.rb CHANGED Viewed

@@ -1,14 +1,20 @@
 # encoding: utf-8
-require File.expand_path('../../lib/greeb', __FILE__)
+require 'rubygems'
-RSpec.configure do |c|
-  c.mock_with :rspec
+$:.unshift File.expand_path('../../lib', __FILE__)
+if RUBY_VERSION == '1.8'
+  gem 'minitest'
 end
-RSpec::Matchers.define :be_parsed_as do |expected|
-  match do |actual|
-    tree = Greeb::Parser.new(actual).parse
-    tree == expected
+require 'minitest/autorun'
+unless 'true' == ENV['TRAVIS']
+  require 'simplecov'
+  SimpleCov.start do
+    add_filter '/spec/'
   end
 end
+require 'greeb'

data/spec/tokenizer_spec.rb ADDED Viewed

@@ -0,0 +1,91 @@
+# encoding: utf-8
+require File.expand_path('../spec_helper', __FILE__)
+module Greeb
+  describe Tokenizer do
+    describe 'initialization' do
+      subject { Tokenizer.new('vodka') }
+      it 'should be initialized with a text' do
+        subject.text.must_equal 'vodka'
+      end
+      it 'should has the @text ivar' do
+        subject.instance_variable_get(:@text).must_equal 'vodka'
+      end
+      it 'should not has @tokens ivar' do
+        subject.instance_variable_get(:@tokens).must_be_nil
+      end
+    end
+    describe 'after tokenization' do
+      subject { Tokenizer.new('vodka').tap(&:tokens) }
+      it 'should has the @tokens ivar' do
+        subject.instance_variable_get(:@tokens).wont_be_nil
+      end
+      it 'should has the @scanner ivar' do
+        subject.instance_variable_get(:@scanner).wont_be_nil
+      end
+      it 'should has the tokens set' do
+        subject.tokens.must_be_kind_of SortedSet
+      end
+    end
+    describe 'tokenization facilities' do
+      it 'can handle words' do
+        Tokenizer.new('hello').tokens.must_equal(
+          SortedSet.new([Entity.new(0, 5, :letter)])
+        )
+      end
+      it 'can handle floats' do
+        Tokenizer.new('14.88').tokens.must_equal(
+          SortedSet.new([Entity.new(0, 5, :float)])
+        )
+      end
+      it 'can handle integers' do
+        Tokenizer.new('1337').tokens.must_equal(
+          SortedSet.new([Entity.new(0, 4, :integer)])
+        )
+      end
+      it 'can handle words and integers' do
+        Tokenizer.new('Hello, I am 18').tokens.must_equal(
+          SortedSet.new([Entity.new(0,  5,  :letter),
+                         Entity.new(5,  6,  :spunct),
+                         Entity.new(6,  7,  :separ),
+                         Entity.new(7,  8,  :letter),
+                         Entity.new(8,  9,  :separ),
+                         Entity.new(9,  11, :letter),
+                         Entity.new(11, 12, :separ),
+                         Entity.new(12, 14, :integer)])
+        )
+      end
+      it 'can handle multi-line paragraphs' do
+        Tokenizer.new("Brateeshka..!\n\nPrines!").tokens.must_equal(
+          SortedSet.new([Entity.new(0,  10, :letter),
+                         Entity.new(10, 12, :punct),
+                         Entity.new(12, 13, :punct),
+                         Entity.new(13, 15, :break),
+                         Entity.new(15, 21, :letter),
+                         Entity.new(21, 22, :punct)])
+        )
+      end
+      it 'can handle separated integers' do
+        Tokenizer.new('228/359').tokens.must_equal(
+          SortedSet.new([Entity.new(0, 3, :integer),
+                         Entity.new(3, 4, :separ),
+                         Entity.new(4, 7, :integer)])
+        )
+      end
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,29 +1,81 @@
 --- !ruby/object:Gem::Specification
 name: greeb
 version: !ruby/object:Gem::Version
-  version: 0.0.2
-  prerelease:
+  version: 0.1.0.rc1
+  prerelease: 6
 platform: ruby
 authors:
 - Dmitry A. Ustalov
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-02-20 00:00:00.000000000 +05:00
-default_executable:
+date: 2012-07-08 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: rspec
-  requirement: &81165430 !ruby/object:Gem::Requirement
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
     none: false
     requirements:
-    - - ~>
+    - - ! '>='
       - !ruby/object:Gem::Version
-        version: 2.4.0
-  type: :runtime
+        version: '0'
+  type: :development
   prerelease: false
-  version_requirements: *81165430
-description: Greeb is awesome Graphematical Analyzer, written in Ruby.
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '2.11'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '2.11'
+- !ruby/object:Gem::Dependency
+  name: simplecov
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Greeb is a simple yet awesome regexp-based tokenizer, written in Ruby.
 email:
 - dmitry@eveel.ru
 executables: []
@@ -31,19 +83,20 @@ extensions: []
 extra_rdoc_files: []
 files:
 - .gitignore
+- .travis.yml
+- .yardopts
 - Gemfile
-- Gemfile.lock
-- README
+- LICENSE
+- README.md
 - Rakefile
-- greeb-test.rb
 - greeb.gemspec
-- lib/enumerable.rb
 - lib/greeb.rb
-- lib/greeb/parser.rb
-- lib/meta_array.rb
-- spec/parser_spec.rb
+- lib/greeb/segmentator.rb
+- lib/greeb/tokenizer.rb
+- lib/greeb/version.rb
+- spec/segmentator_spec.rb
 - spec/spec_helper.rb
-has_rdoc: true
+- spec/tokenizer_spec.rb
 homepage: https://github.com/eveel/greeb
 licenses: []
 post_install_message:
@@ -56,18 +109,23 @@ required_ruby_version: !ruby/object:Gem::Requirement
   - - ! '>='
     - !ruby/object:Gem::Version
       version: '0'
+      segments:
+      - 0
+      hash: -4603914053803130942
 required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
   requirements:
-  - - ! '>='
+  - - ! '>'
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 1.3.1
 requirements: []
 rubyforge_project: greeb
-rubygems_version: 1.5.2
+rubygems_version: 1.8.24
 signing_key:
 specification_version: 3
-summary: Greeb is a Graphematical Analyzer.
+summary: Greeb is a simple regexp-based tokenizer.
 test_files:
-- spec/parser_spec.rb
+- spec/segmentator_spec.rb
 - spec/spec_helper.rb
+- spec/tokenizer_spec.rb
+has_rdoc:

data/Gemfile.lock DELETED Viewed

@@ -1,24 +0,0 @@
-PATH
-  remote: .
-  specs:
-    greeb (0.0.2)
-      rspec (~> 2.4.0)
-GEM
-  remote: http://rubygems.org/
-  specs:
-    diff-lcs (1.1.2)
-    rspec (2.4.0)
-      rspec-core (~> 2.4.0)
-      rspec-expectations (~> 2.4.0)
-      rspec-mocks (~> 2.4.0)
-    rspec-core (2.4.0)
-    rspec-expectations (2.4.0)
-      diff-lcs (~> 1.1.2)
-    rspec-mocks (2.4.0)
-PLATFORMS
-  ruby
-DEPENDENCIES
-  greeb!

data/README DELETED Viewed

File without changes

data/greeb-test.rb DELETED Viewed

@@ -1,141 +0,0 @@
-#!/usr/bin/env ruby
-# encoding: utf-8
-require 'rubygems'
-require 'graphviz'
-$:.unshift('./lib')
-require 'greeb'
-origin = <<-END
- - Сынок, чего это     от тебя   зигами пахнет,
-опять на Манежную площадь ходил?
- - Нет мама, я в метро ехал, там  назиговано было!!
-Четырнадцать, восемьдесять восемь: 14/88.
-Вот так блять
-END
-origin.chomp!
-def identify(token)
-  case token
-    when Greeb::RU_LEX then 'RU_LEX'
-    when Greeb::EN_LEX then 'EN_LEX'
-    when Greeb::EOL then 'EOL'
-    when Greeb::SEP then 'SEP'
-    when Greeb::PUN then 'PUN'
-    when Greeb::SPUN then 'SPUN'
-    when Greeb::DIG then 'DIG'
-    when Greeb::DIL then 'DIL'
-  else
-    '?!'
-  end
-end
-greeb = Greeb::Parser.new(origin)
-text = greeb.tree
-g = GraphViz.new('graphematics', 'type' => 'graph')
-g.node[:color]    = '#ddaa66'
-g.node[:style]    = 'filled'
-g.node[:shape]    = 'box'
-g.node[:penwidth] = '1'
-g.node[:fontname] = 'PT Sans'
-g.node[:fontsize] = '8'
-g.node[:fillcolor]= '#ffeecc'
-g.node[:fontcolor]= '#775500'
-g.node[:margin]   = '0.0'
-g.edge[:color]    = '#999999'
-g.edge[:weight]   = '1'
-g.edge[:fontname] = 'PT Sans'
-g.edge[:fontcolor]= '#444444'
-g.edge[:fontsize] = '6'
-g.edge[:dir]      = 'forward'
-g.edge[:arrowsize]= '0.5'
-bid = 'begin'
-g.add_node(bid).tap do |node|
-  node.label = "Начало\nтекста"
-  node.shape = 'ellipse'
-  node.style = ''
-end
-eid = 'end'
-g.add_node(eid).tap do |node|
-  node.label = "Конец\nтекста"
-  node.shape = 'ellipse'
-  node.style = ''
-end
-tree = text.map_with_index do |paragraph, i|
-  pid = "p#{i}"
-  sentences = paragraph.map_with_index do |sentence, j|
-    sid = "#{pid}s#{j}"
-    subsentences = sentence.map_with_index do |subsentence, k|
-      ssid = "#{sid}ss#{k}"
-      tokens = subsentence.map_with_index do |token, l|
-        next if ' ' == token
-        [ "#{ssid}t#{l}", token, l ]
-      end
-      tokens.delete(nil)
-      [ ssid, tokens, k ]
-    end
-    [ sid, subsentences, j ]
-  end
-  [ pid, sentences, i ]
-end
-tree.each do |pid, paragraph, i|
-  g.add_node(pid).tap do |node|
-    node.label = "Абзац\n№#{i + 1}"
-    node.shape = 'ellipse'
-  end
-  g.add_edge(bid, pid)
-  paragraph.each do |sid, sentence, j|
-    g.add_node(sid).tap do |node|
-      node.label = "Предложение\n№#{j + 1}"
-      node.shape = 'ellipse'
-    end
-    g.add_edge(pid, sid)
-    sentence.each do |ssid, subsentence, k|
-      g.add_node(ssid).tap do |node|
-        node.label = "Подпредложение\n№#{k + 1}"
-        node.shape = 'ellipse'
-      end
-      g.add_edge(sid, ssid)
-      subsentence.each do |tid, token, l|
-        g.add_node(tid).label = token
-        g.add_edge(ssid, tid).label = identify(token)
-        g.add_edge(tid, eid)
-      end
-      subsentence.each_cons(2) do |(tid1, token1, l1),
-                                   (tid2, token2, l2)|
-        g.add_edge(tid1, tid2).tap do |edge|
-          edge.weight = 0.25
-          edge.style = 'dashed'
-        end
-      end
-    end
-    sentence.each_cons(2) do |(ssid1, subsentence1, k1),
-                              (ssid2, subsentence2, k2)|
-      tid1, token1, l1 = subsentence1.last
-      tid2, token2, l2 = subsentence2.first
-      g.add_edge(tid1, tid2).tap do |edge|
-        edge.weight = 0.5
-        edge.style = 'dashed'
-      end
-    end
-  end
-end
-g.output(:output => 'png', :file => 'graph.png')

data/lib/enumerable.rb DELETED Viewed

@@ -1,10 +0,0 @@
-# encoding: utf-8
-# Enumerable module additions.
-#
-module Enumerable
-  def collect_with_index(i = -1) # :nodoc:
-    collect { |e| yield(e, i += 1) }
-  end
-  alias map_with_index collect_with_index
-end

data/lib/greeb/parser.rb DELETED Viewed

@@ -1,176 +0,0 @@
-# encoding: utf-8
-require 'meta_array'
-require 'enumerable'
-# Graphematical Parser of the Greeb.
-# Use it with love.
-#
-class Greeb::Parser
-  # Russian lexeme (i.e.: "хуй").
-  #
-  RUSSIAN_LEXEME = /^[А-Яа-яЁё]+$/u
-  # English lexeme (i.e.: "foo").
-  #
-  ENGLISH_LEXEME = /^[A-Za-z]+$/u
-  # End of Line sequence (i.e.: "\n").
-  #
-  END_OF_LINE = /^\n+$/u
-  # In-subsentence seprator (i.e.: "*" or "\").
-  #
-  SEPARATOR = /^[*=_\/\\ ]$/u
-  # Punctuation character (i.e.: "." or "!").
-  #
-  PUNCTUATION = /^(\.|\!|\?)$/u
-  # In-sentence punctuation character (i.e.: "," or "-").
-  #
-  SENTENCE_PUNCTUATION = /^(\,|\[|\]|\(|\)|\-|:|;)$/u
-  # Digit (i.e.: "1337").
-  #
-  DIGIT = /^[0-9]+$/u
-  # Digit-Letter complex (i.e.: "0xDEADBEEF").
-  #
-  DIGIT_LETTER = /^[А-Яа-яA-Za-z0-9Ёё]+$/u
-  # Empty string (i.e.: "").
-  #
-  EMPTY = ''
-  attr_accessor :text
-  private :text=
-  # Create a new instance of Greeb::Parser.
-  #
-  # ==== Parameters
-  # text<String>:: Source text.
-  #
-  def initialize(text)
-    self.text = text
-  end
-  # Perform the text parsing.
-  #
-  # ==== Returns
-  # Array:: Tree of Graphematical Analysis of text.
-  #
-  def parse
-    return @tree if @tree
-    # parse tree
-    tree = MetaArray.new
-    # paragraph, sentence, subsentence
-    p_id, s_id, ss_id = 0, 0, 0
-    # current token
-    token = ''
-    # run FSM
-    text.each_char do |c|
-      case c
-        when END_OF_LINE then begin
-          case token
-            when EMPTY then token << c
-            when END_OF_LINE then begin
-              token = ''
-              p_id += 1
-              s_id = 0
-              ss_id = 0
-            end
-          else
-            tree[p_id][s_id][ss_id] << token
-            token = c
-          end
-        end
-        when SEPARATOR then begin
-          case token
-            when EMPTY
-          else
-            tree[p_id][s_id][ss_id] << token
-            while tree[p_id][s_id][ss_id].last == c
-              tree[p_id][s_id][ss_id].pop
-            end
-            tree[p_id][s_id][ss_id] << c
-            token = ''
-          end
-        end
-        when PUNCTUATION then begin
-          case token
-            when EMPTY
-          else
-            tree[p_id][s_id][ss_id] << token
-            tree[p_id][s_id][ss_id] << c
-            token = ''
-            s_id += 1
-            ss_id = 0
-          end
-        end
-        when SENTENCE_PUNCTUATION then begin
-          case token
-            when EMPTY
-          else
-            tree[p_id][s_id][ss_id] << token
-            tree[p_id][s_id][ss_id] << c
-            token = ''
-            ss_id += 1
-          end
-        end
-        when RUSSIAN_LEXEME then begin
-          case token
-            when END_OF_LINE then begin
-              tree[p_id][s_id][ss_id] << ' '
-              token = c
-            end
-          else
-            token << c
-          end
-        end
-        when ENGLISH_LEXEME then begin
-          case token
-            when END_OF_LINE then begin
-              tree[p_id][s_id][ss_id] << ' '
-              token = c
-            end
-          else
-            token << c
-          end
-        end
-        when DIGIT then begin
-          case token
-            when END_OF_LINE then begin
-              tree[p_id][s_id][ss_id] << ' '
-              token = c
-            end
-          else
-            token << c
-          end
-        end
-        when DIGIT_LETTER then begin
-          case token
-            when END_OF_LINE then begin
-              tree[p_id][s_id][ss_id] << token
-              token = c
-            end
-          else
-            token << c
-          end
-        end
-      end
-    end
-    unless token.empty?
-      tree[p_id][s_id][ss_id] << token
-    end
-    tree.delete(nil)
-    @tree = tree.to_a
-  end
-end

data/lib/meta_array.rb DELETED Viewed

@@ -1,14 +0,0 @@
-# encoding: utf-8
-# MetaArray is an Array, which creates subarrays
-# on non-existent elements.
-#
-class MetaArray < Array
-  def [] id
-    super(id) or begin
-      self.class.new.tap do |element|
-        self[id] = element
-      end
-    end
-  end
-end

data/spec/parser_spec.rb DELETED Viewed

@@ -1,63 +0,0 @@
-# encoding: utf-8
-require File.expand_path('../spec_helper.rb', __FILE__)
-describe Greeb::Parser do
-  it 'should parse very simple strings' do
-    'буба сука дебил'.should be_parsed_as([
-      [
-        [ [ 'буба', ' ', 'сука', ' ', 'дебил' ] ]
-      ]
-    ])
-  end
-  it 'should parse one sentence with subsentences' do
-    'буба, сука, дебил'.should be_parsed_as([
-      [
-        [
-          [ 'буба', ',' ],
-          [ 'сука', ',' ],
-          [ 'дебил' ]
-        ]
-      ]
-    ])
-  end
-  it 'should parse two simple paragraphs' do
-    "буба сука дебил\n\nточно!".should be_parsed_as([
-      [
-        [ [ 'буба', ' ', 'сука', ' ', 'дебил' ] ]
-      ],
-      [
-        [ [ 'точно', '!' ] ]
-      ]
-    ])
-  end
-  it 'should parse two sentences in paragraph' do
-    "буба молодец? буба умница.".should be_parsed_as([
-      [
-        [ [ 'буба', ' ', 'молодец', '?' ] ],
-        [ [ 'буба', ' ', 'умница', '.' ] ]
-      ]
-    ])
-  end
-  it 'should parse sentences with floating point values' do
-    'буба не считает Пи равной 3.14'.should be_parsed_as([
-      [
-        [ [ 'буба', ' ', 'не', ' ', 'считает', ' ',
-            'Пи', ' ', 'равной', ' ', '3.14' ] ]
-      ]
-    ])
-  end
-  it 'should parse sentences with floating "dot" values' do
-    'буба не считает Пи равной 3,14'.should be_parsed_as([
-      [
-        [ [ 'буба', ' ', 'не', ' ', 'считает', ' ',
-            'Пи', ' ', 'равной', ' ', '3,14' ] ]
-      ]
-    ])
-  end
-end