RubyGems - tokkens - Versions diffs - 0.1.0 - Mend

tokkens 0.1.0

Files changed (18) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: de30d6649f845763e36fd45ed9770090021a44b2
+  data.tar.gz: 34c389c2c767df2f9661b8c3a7a341da9294c20c
+SHA512:
+  metadata.gz: cd819fc173e3fbc889fc99c269f44033b74a514a9dd173485617f9f78d9455adc17cdb3c5b1ccb00730786bb46460c5254769cb9975bd2d5b8aeb70c6238a145
+  data.tar.gz: 4ab7c1a5a804c33b4fdd64bd798f2e2ce0837eb0ad9ed2da781573193c212e9c0a8f1a7e175db5d0492e0099f756ec9afb7b402814bc2af4dd3d80b9d0b3526e

data/.gitignore ADDED Viewed

@@ -0,0 +1,14 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/.travis.yml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ language: ruby
2	+ rvm: 2.0

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,5 @@
+# Change Log
+## [0.1.0](https://github.com/q-m/tokkens-ruby/tree/v0.1.0) (2017-02-08)
+- Initial release

data/Gemfile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ source 'https://rubygems.org'
2	+ gemspec

data/LICENSE.md ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2017 wvengen
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,201 @@
+# Tokkens
+[![Build Status](https://travis-ci.org/q-m/tokkens-ruby.svg?branch=master)](https://travis-ci.org/q-m/tokkens-ruby)
+[![Documentation](https://img.shields.io/badge/yard-docs-blue.svg)](http://www.rubydoc.info/github/q-m/tokkens-ruby/master)
+`Tokkens` makes it easy to apply a [vector space model](https://en.wikipedia.org/wiki/Vector_space_model)
+to text documents, targeted towards with machine learning. It provides a mapping
+between numbers and tokens (strings).
+Read more about [installation](#installation),  [usage](#usage) or skip to an [example](#example).
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'tokkens'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install tokkens
+Note that you'll need [Ruby](http://ruby-lang.org/) 2+.
+## Usage
+### Tokens
+#### `get` and `find`
+`Tokens` is a store for mapping strings (tokens) to numbers. Each string gets
+its own unique number. First instantiate a new instance.
+```ruby
+require 'tokkens'
+@tokens = Tokkens::Tokens.new
+```
+Then `get` a number for some tokens. You'll notice that each distinct token
+gets its own number.
+```ruby
+puts @tokens.get('foo')
+# => 1
+puts @tokens.get('bar')
+# => 2
+puts @tokens.get('foo')
+# => 1
+```
+The reverse operation is `find` (code is optimized for `get`).
+```ruby
+puts @tokens.find(2)
+# => "bar"
+```
+The `prefix` option can be used to add a prefix to the token.
+```ruby
+puts @tokens.get('blup', prefix: 'DESC:')
+# => 3
+puts @tokens.find(3)
+# => "DESC:blup"
+puts @tokens.find(3, prefix: 'DESC:')
+# => "blup"
+```
+#### `load` and `save`
+To persist tokens across runs, one can load and save the list of tokens. At the
+moment, this is a plain text file, with one line containing number, occurence and token.
+```ruby
+@tokens.save('foo.tokens')
+# ---- some time later
+@tokens = Tokkens::Tokens.new
+@tokens.load('foo.tokens')
+```
+#### `limit!`
+One common operation is reducing the number of words, to retain only those that are
+most relevant. This is called feature selection or
+[dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction).
+You can select by maximum `max_size` (most occuring words are kept).
+```ruby
+@tokens = Tokkens::Tokens.new
+@tokens.get('foo')
+@tokens.get('bar')
+@tokens.get('baz')
+@tokens.indexes
+# => [1, 2, 3]
+@tokens.limit!(max_size: 2)
+@tokens.indexes
+# => [1, 2]
+```
+Or you can reduce by minimum `min_occurence`.
+```ruby
+@tokens.get('zab')
+# => 4
+@tokens.get('bar')
+# => 2
+@tokens.indexes
+# => [1, 2, 4]
+@tokens.limit!(min_occurence: 2)
+@tokens.indexes
+# => [2]
+```
+Note that this limits only the tokens store, if you reference the tokens removed
+elsewhere, you may still need to remove those.
+#### `freeze!` and `thaw!`
+`Tokens` may be used to train a model from a training dataset, and then use it to
+predict based on the model. In this case, new tokens need to be added during the
+training stage, but it doesn't make sense to generate new tokens during prediction.
+By default, `Tokens` makes new tokens when an unrecognized token is passed to `get`.
+But when it has been `frozen?` by `freeze!`, new tokens will return `nil` instead.
+If for some reason, you'd like to add new tokens again, use `thaw!`.
+```ruby
+@tokens.freeze!
+@tokens.get('hithere')
+# => 4
+@tokens.get('blahblah')
+# => nil
+@tokens.thaw!
+@tokens.get('blahblah')
+# => 5
+```
+Note that after `load`ing, the state may be frozen.
+### Tokenizer
+When processing sentences or other text bodies, `Tokenizer` provides a way to map
+this to an array of numbers (using `Token`).
+```ruby
+@tokenizer = Tokkens::Tokenizer.new
+@tokenizer.get('hi from example')
+# => [1, 2, 3]
+@tokenizer.tokens.find(3)
+# => "example"
+```
+The `prefix` keyword argument also works here.
+```ruby
+@tokenizer.get('from example', prefix: 'X:')
+# => [4, 5]
+@tokenizer.tokens.find(5)
+# => "X:example"
+```
+One can specify a minimum length (default 2) and stop words for tokenizing.
+```ruby
+@tokenizer = Tokkens::Tokenizer.new(min_length: 3, stop_words: %w(and the))
+@tokenizer.get('the cat and a bird').map {|i| @tokenizer.tokens.find(i)}
+# => ["cat", "bird"]
+```
+### Example
+A basic text classification example using [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/)
+can be found in [examples/classify.rb](examples/classify.rb). Run it as follows:
+```
+$ gem install liblinear-ruby
+$ ruby examples/classify.rb
+How many students are in for the exams today? -> students exams -> school
+The forest has large trees, while the field has its flowers. -> trees field flowers -> nature
+Can we park our cars inside that building to go shopping? -> cars building shopping -> city
+```
+The classifier was trained using three training sentences for each class.
+The output shows a prediction for three test sentences. Each test sentence is
+printed, followed by the tokens, followed by the predicted class.
+## [MIT license](LICENSE.md)
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/tokkens/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Make sure the tests are green (`rspec`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,6 @@
+require 'bundler/gem_tasks'
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new('rspec')
+task :default => :rspec

data/examples/classify.rb ADDED Viewed

@@ -0,0 +1,61 @@
+# Document classification example using tokkens and linear SVM
+require 'tokkens'   # `rake install` or `gem install tokkens`
+require 'liblinear' # `gem install liblinear-ruby`
+# define the training data
+TRAINING_DATA = [
+  ['school', 'The teacher writes a formula on the blackboard, while students are studying for their exams.'],
+  ['school', 'Students play soccer during the break after class, while a teacher watches over them.'],
+  ['school', 'All the students are studying hard for the final exams.'],
+  ['nature', 'The fox is running around the trees, while flowers bloom in the field.'],
+  ['nature', 'Where are the rabbits hiding today? Their holes below the trees are empty.'],
+  ['nature', 'The dark sky is bringing rain. The fox hides, rabbits find their holes, but the flowers surrender.'],
+  ['city',   'Cars are passing by swiftly, until the traffic lights become red.'],
+  ['city',   'Look at the high building, with so many windows. Who would live there?'],
+  ['city',   'The shopping centre building is over there, you will find everything you need to buy.'],
+]
+# after training, these test sentences will receive a predicted classification
+TEST_DATA = [
+  'How many students are in for the exams today?',
+  'The forest has large trees, while the field has its flowers.',
+  'Can we park our cars inside that building to go shopping?',
+]
+# stop words don't carry meaning, we better ignore them
+STOP_WORDS = %w(
+  the a on to at so today all many some
+  are is will would their you them their our everyone everything who there
+  while during over for below by with after in around until where
+)
+def preprocess(s)
+  s.downcase.gsub(/[^a-z\s]/, '')
+end
+@labels = Tokkens::Tokens.new
+@tokenizer = Tokkens::Tokenizer.new(stop_words: STOP_WORDS)
+# train
+training_labels = []
+training_samples = []
+TRAINING_DATA.each do |(label, sentence)|
+  training_labels  << @labels.get(label)
+  tokens = @tokenizer.get(preprocess(sentence)).uniq
+  training_samples << Hash[tokens.zip([1] * tokens.length)]
+end
+#tokenizer.tokens.limit!(occurence: 2) # limit number of tokens - doesn't affect training though!
+@model = Liblinear.train({}, training_labels, training_samples)
+# predict
+@tokenizer.tokens.freeze!
+TEST_DATA.each do |sentence|
+  tokens = @tokenizer.get(preprocess(sentence))
+  label_number = Liblinear.predict(@model, Hash[tokens.zip([1] * tokens.length)])
+  puts "#{sentence} -> #{tokens.map{|i| @tokenizer.tokens.find(i)}.join(' ')} -> #{@labels.find(label_number)}"
+end
+# you might want to persist data for prediction at a later time
+#model.save('test.model')
+#labels.save('test.labels')
+#tokenizer.tokens.save('test.tokens')

data/lib/tokkens.rb ADDED Viewed

@@ -0,0 +1,3 @@
+require 'tokkens/version'
+require 'tokkens/tokens'
+require 'tokkens/tokenizer'

data/lib/tokkens/tokenizer.rb ADDED Viewed

@@ -0,0 +1,57 @@
+require_relative 'tokens'
+module Tokkens
+  # Converts a string to a list of token numbers.
+  #
+  # Useful for computing with text, like machine learning.
+  # Before using the tokenizer, you're expected to have pre-processed
+  # the textdepending on application. For example, converting to lowercase,
+  # removing non-word characters, transliterating accented characters.
+  #
+  # This class then splits the string into tokens by whitespace, and
+  # removes tokens not passing the selection criteria.
+  #
+  class Tokenizer
+    # default minimum token length
+    MIN_LENGTH = 2
+    # no default stop words to ignore
+    STOP_WORDS = []
+    # @!attribute [r] tokens
+    #   @return [Tokens] object to use for obtaining tokens
+    # @!attribute [r] stop_words
+    #   @return [Array<String>] stop words to ignore
+    # @!attribute [r] min_length
+    #   @return [Fixnum] Minimum length for tokens
+    attr_reader :tokens, :stop_words, :min_length
+    # Create a new tokenizer
+    #
+    # @param tokens [Tokens] object to use for obtaining token numbers
+    # @param min_length [Fixnum] minimum length for tokens
+    # @param stop_words [Array<String>] stop words to ignore
+    def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS)
+      @tokens = tokens || Tokens.new
+      @stop_words = stop_words
+      @min_length = min_length
+    end
+    # @return [Array<Fixnum>] array of token numbers
+    def get(s, **kwargs)
+      return [] if !s || s.strip == ''
+      tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact
+    end
+    private
+    def tokenize(s)
+      s.split.select(&method(:include?))
+    end
+    def include?(s)
+      s.length >= @min_length && !@stop_words.include?(s)
+    end
+  end
+end

data/lib/tokkens/tokens.rb ADDED Viewed

@@ -0,0 +1,141 @@
+module Tokkens
+  # Converts a string token to a uniquely identifying sequential number.
+  #
+  # Useful for working with a {https://en.wikipedia.org/wiki/Vector_space_model vector space model}
+  # for text.
+  class Tokens
+    # @!attribute [r] offset
+    #   @return [Fixnum] Number of first token.
+    attr_accessor :offset
+    def initialize(offset: 1)
+      # liblinear can't use offset 0, libsvm doesn't mind to start at one
+      @tokens = {}
+      @offset = offset
+      @next_number = offset
+      @frozen = false
+    end
+    # Stop assigning new numbers to token.
+    # @see #frozen?
+    # @see #thaw!
+    def freeze!
+      @frozen = true
+    end
+    # Allow new tokens to be created.
+    # @see #freeze!
+    # @see #frozen?
+    def thaw!
+      @frozen = false
+    end
+    # @return [Boolean] Whether the tokens are frozen or not.
+    # @see #freeze!
+    # @see #thaw!
+    def frozen?
+      @frozen
+    end
+    # Limit the number of tokens.
+    #
+    # @param max_size [Fixnum] Maximum number of tokens to retain
+    # @param min_occurence [Fixnum] Keep only tokens seen at least this many times
+    # @return [Fixnum] Number of tokens left
+    def limit!(max_size: nil, min_occurence: nil)
+      # @todo raise if frozen
+      if min_occurence
+        @tokens.delete_if {|name, data| data[1] < min_occurence }
+      end
+      if max_size
+        @tokens = Hash[@tokens.to_a.sort_by {|a| -a[1][1] }[0..(max_size-1)]]
+      end
+      @tokens.length
+    end
+    # Return a number for a new or existing token.
+    #
+    # When the token was seen before, the same number is returned. If the token
+    # is first seen and this class isn't {#frozen?}, a new number is returned;
+    # else +nil+ is returned.
+    #
+    # @param s [String] token to return number for
+    # @option kwargs [String] :prefix optional string to prepend to the token
+    # @return [Fixnum, NilClass] number for given token
+    def get(s, **kwargs)
+      return if !s || s.strip == ''
+      @frozen ? retrieve(s, **kwargs) : upsert(s, **kwargs)
+    end
+    # Return an token by number.
+    #
+    # This class is optimized for retrieving by token, not by number.
+    #
+    # @param i [String] number to return token for
+    # @param prefix [String] optional string to remove from beginning of token
+    # @return [String, NilClass] given token, or +nil+ when not found
+    def find(i, prefix: nil)
+      @tokens.each do |s, data|
+        if data[0] == i
+          return (prefix && s.start_with?(prefix)) ? s[prefix.length..-1] : s
+        end
+      end
+      nil
+    end
+    # Return indexes for all of the current tokens.
+    #
+    # @return [Array<Fixnum>] All current token numbers.
+    # @see #limit!
+    def indexes
+      @tokens.values.map(&:first)
+    end
+    # Load tokens from file.
+    #
+    # The tokens are frozen by default.
+    # All previously existing tokens are removed.
+    #
+    # @param filename [String] Filename
+    def load(filename)
+      File.open(filename) do |f|
+        @tokens = {}
+        f.each_line do |line|
+          id, count, name = line.rstrip.split(/\s+/, 3)
+          @tokens[name.strip] = [id.to_i, count]
+        end
+      end
+      # safer
+      freeze!
+    end
+    # Save tokens to file.
+    #
+    # @param filename [String] Filename
+    def save(filename)
+      File.open(filename, 'w') do |f|
+        @tokens.each do |token, (index, count)|
+          f.puts "#{index} #{count} #{token}"
+        end
+      end
+    end
+    private
+    def retrieve(s, prefix: '')
+      data = @tokens[prefix + s]
+      data[0] if data
+    end
+    # return token number, update next_number; always returns a number
+    def upsert(s, prefix: '')
+      unless data = @tokens[prefix + s]
+        @tokens[prefix + s] = data = [@next_number, 0]
+        @next_number += 1
+      end
+      data[1] += 1
+      data[0]
+    end
+  end
+end

data/lib/tokkens/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module Tokkens
+  VERSION = "0.1.0"
+end

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'bundler/setup'
+Bundler.setup
+require 'tokkens'
+include Tokkens
+RSpec.configure do |config|
+end

data/spec/tokenizer_spec.rb ADDED Viewed

@@ -0,0 +1,40 @@
+require_relative 'spec_helper'
+describe Tokenizer do
+  let(:tokenizer) { described_class.new }
+  let(:offset) { 1 } # default token offset
+  describe '#get' do
+    it 'does tokenization' do
+      expect(tokenizer.get('foo bar')).to eq ([offset, offset + 1])
+    end
+    it 'ignores too short tokens' do
+      t = described_class.new(min_length: 2)
+      expect(t.get('x')).to eq []
+    end
+    it 'ignores stop words' do
+      t = described_class.new(stop_words: ['xyz'])
+      expect(t.get('xyz foo')).to eq [offset]
+    end
+    it 'does not return nil tokens' do
+      tokenizer.tokens.get('foo')
+      tokenizer.tokens.freeze!
+      expect(tokenizer.get('foo bar')).to eq [offset]
+    end
+  end
+  describe '#tokens' do
+    it 'returns a tokens object by default' do
+      expect(tokenizer.tokens).to be_a Tokens
+    end
+    it 'can be overridden' do
+      tokens = Tokens.new
+      t = described_class.new(tokens)
+      expect(t.tokens).to be tokens
+    end
+  end
+end

data/spec/tokens_spec.rb ADDED Viewed

@@ -0,0 +1,133 @@
+require_relative 'spec_helper'
+require 'tempfile'
+describe Tokens do
+  let(:tokens) { described_class.new }
+  let(:offset) { 1 } # default offset
+  describe '#get' do
+    it 'can new tokens' do
+      expect(tokens.get('bar')).to eq offset
+      expect(tokens.get('foo')).to eq (offset + 1)
+    end
+    it 'can get an existing token' do
+      tokens.get('bar')
+      expect(tokens.get('bar')).to eq offset
+    end
+    it 'can include a prefix' do
+      tokens.get('bar', prefix: 'XyZ$')
+      expect(tokens.get('XyZ$bar')).to eq offset
+    end
+    it 'can get an existing token when frozen' do
+      tokens.get('blup')
+      tokens.freeze!
+      expect(tokens.get('blup')).to eq offset
+    end
+    it 'cannot get a new token when frozen' do
+      tokens.get('blup')
+      tokens.freeze!
+      expect(tokens.get('blabla')).to be_nil
+    end
+  end
+  describe '#find' do
+    it 'can find an existing token' do
+      tokens.get('blup')
+      i = tokens.get('blah')
+      expect(tokens.find(i)).to eq 'blah'
+    end
+    it 'returns nil for a non-existing token' do
+      tokens.get('blup')
+      expect(tokens.find(offset + 1)).to eq nil
+    end
+    it 'removes the prefix' do
+      i = tokens.get('blup', prefix: 'FOO$')
+      expect(tokens.find(i, prefix: 'FOO$')).to eq 'blup'
+    end
+  end
+  describe '#indexes' do
+    it 'is empty without tokens' do
+      expect(tokens.indexes).to eq []
+    end
+    it 'returns the expected indexes' do
+      tokens.get('foo')
+      tokens.get('blup')
+      expect(tokens.indexes).to eq [offset, offset + 1]
+    end
+  end
+  describe '#offset' do
+    it 'has a default' do
+      expect(described_class.new.offset).to eq offset
+    end
+    it 'can override the default' do
+      expect(described_class.new(offset: 5).offset).to eq 5
+    end
+    it 'affects the first number' do
+      tokens = described_class.new(offset: 12)
+      expect(tokens.get('hi')).to eq 12
+    end
+  end
+  describe '#frozen?' do
+    it 'is not frozen by default' do
+      expect(tokens.frozen?).to be false
+    end
+    it 'can be frozen' do
+      tokens.freeze!
+      expect(tokens.frozen?).to be true
+    end
+    it 'can be thawed' do
+      tokens.freeze!
+      tokens.thaw!
+      expect(tokens.frozen?).to be false
+    end
+  end
+  describe '#limit!' do
+    it 'limits to most frequent tokens by max_size' do
+      tokens.get('foo')
+      tokens.get('blup')
+      tokens.get('blup')
+      tokens.limit!(max_size: 1)
+      expect(tokens.indexes).to eq [offset + 1]
+    end
+    it 'limits by min_occurence' do
+      tokens.get('foo')
+      tokens.get('blup')
+      tokens.get('foo')
+      tokens.limit!(min_occurence: 2)
+      expect(tokens.indexes).to eq [offset]
+    end
+  end
+  describe '#load' do
+    let(:file) { Tempfile.new('tokens') }
+    after { file.unlink }
+    it 'saves and loads tokens' do
+      tokens.get('foo')
+      tokens.get('bar')
+      tokens.save(file.path)
+      expect(File.exists?(file.path)).to be true
+      expect(File.zero?(file.path)).to be false
+      ntokens = described_class.new
+      ntokens.load(file.path)
+      expect(tokens.get('bar')).to eq (offset + 1)
+    end
+  end
+end

data/tokkens.gemspec ADDED Viewed

@@ -0,0 +1,29 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'tokkens/version'
+Gem::Specification.new do |spec|
+  spec.name          = "tokkens"
+  spec.version       = Tokkens::VERSION
+  spec.authors       = ["wvengen"]
+  spec.email         = ["dev-rails@willem.engen.nl"]
+  spec.summary       = %q{Simple text to numbers tokenizer}
+  spec.homepage      = "https://github.com/q-m/ruby-tokkens"
+  spec.license       = "MIT"
+  spec.description   = <<-EOD
+    Tokkens makes it easy to apply a vector space model to text documents,
+    targeted towards with machine learning. It provides a mapping between
+    numbers and tokens (strings)
+  EOD
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.required_ruby_version = '>= 2.0'
+  spec.add_development_dependency "bundler", "~> 1.7"
+  spec.add_development_dependency "rake", "~> 10.0"
+  spec.add_development_dependency "rspec", "~> 3.5.0"
+end

metadata ADDED Viewed

@@ -0,0 +1,108 @@
+--- !ruby/object:Gem::Specification
+name: tokkens
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- wvengen
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2017-02-08 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.7'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.7'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '10.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.5.0
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.5.0
+description: |2
+      Tokkens makes it easy to apply a vector space model to text documents,
+      targeted towards with machine learning. It provides a mapping between
+      numbers and tokens (strings)
+email:
+- dev-rails@willem.engen.nl
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".travis.yml"
+- CHANGELOG.md
+- Gemfile
+- LICENSE.md
+- README.md
+- Rakefile
+- examples/classify.rb
+- lib/tokkens.rb
+- lib/tokkens/tokenizer.rb
+- lib/tokkens/tokens.rb
+- lib/tokkens/version.rb
+- spec/spec_helper.rb
+- spec/tokenizer_spec.rb
+- spec/tokens_spec.rb
+- tokkens.gemspec
+homepage: https://github.com/q-m/ruby-tokkens
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '2.0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.3
+signing_key:
+specification_version: 4
+summary: Simple text to numbers tokenizer
+test_files:
+- spec/spec_helper.rb
+- spec/tokenizer_spec.rb
+- spec/tokens_spec.rb