tokkens 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: de30d6649f845763e36fd45ed9770090021a44b2
4
+ data.tar.gz: 34c389c2c767df2f9661b8c3a7a341da9294c20c
5
+ SHA512:
6
+ metadata.gz: cd819fc173e3fbc889fc99c269f44033b74a514a9dd173485617f9f78d9455adc17cdb3c5b1ccb00730786bb46460c5254769cb9975bd2d5b8aeb70c6238a145
7
+ data.tar.gz: 4ab7c1a5a804c33b4fdd64bd798f2e2ce0837eb0ad9ed2da781573193c212e9c0a8f1a7e175db5d0492e0099f756ec9afb7b402814bc2af4dd3d80b9d0b3526e
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/.travis.yml ADDED
@@ -0,0 +1,2 @@
1
+ language: ruby
2
+ rvm: 2.0
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ # Change Log
2
+
3
+ ## [0.1.0](https://github.com/q-m/tokkens-ruby/tree/v0.1.0) (2017-02-08)
4
+
5
+ - Initial release
data/Gemfile ADDED
@@ -0,0 +1,2 @@
1
+ source 'https://rubygems.org'
2
+ gemspec
data/LICENSE.md ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2017 wvengen
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,201 @@
1
+ # Tokkens
2
+
3
+ [![Build Status](https://travis-ci.org/q-m/tokkens-ruby.svg?branch=master)](https://travis-ci.org/q-m/tokkens-ruby)
4
+ [![Documentation](https://img.shields.io/badge/yard-docs-blue.svg)](http://www.rubydoc.info/github/q-m/tokkens-ruby/master)
5
+
6
+ `Tokkens` makes it easy to apply a [vector space model](https://en.wikipedia.org/wiki/Vector_space_model)
7
+ to text documents, targeted towards with machine learning. It provides a mapping
8
+ between numbers and tokens (strings).
9
+
10
+ Read more about [installation](#installation), [usage](#usage) or skip to an [example](#example).
11
+
12
+ ## Installation
13
+
14
+ Add this line to your application's Gemfile:
15
+
16
+ ```ruby
17
+ gem 'tokkens'
18
+ ```
19
+
20
+ And then execute:
21
+
22
+ $ bundle
23
+
24
+ Or install it yourself as:
25
+
26
+ $ gem install tokkens
27
+
28
+ Note that you'll need [Ruby](http://ruby-lang.org/) 2+.
29
+
30
+ ## Usage
31
+
32
+ ### Tokens
33
+
34
+ #### `get` and `find`
35
+
36
+ `Tokens` is a store for mapping strings (tokens) to numbers. Each string gets
37
+ its own unique number. First instantiate a new instance.
38
+
39
+ ```ruby
40
+ require 'tokkens'
41
+ @tokens = Tokkens::Tokens.new
42
+ ```
43
+
44
+ Then `get` a number for some tokens. You'll notice that each distinct token
45
+ gets its own number.
46
+
47
+ ```ruby
48
+ puts @tokens.get('foo')
49
+ # => 1
50
+ puts @tokens.get('bar')
51
+ # => 2
52
+ puts @tokens.get('foo')
53
+ # => 1
54
+ ```
55
+
56
+ The reverse operation is `find` (code is optimized for `get`).
57
+
58
+ ```ruby
59
+ puts @tokens.find(2)
60
+ # => "bar"
61
+ ```
62
+
63
+ The `prefix` option can be used to add a prefix to the token.
64
+
65
+ ```ruby
66
+ puts @tokens.get('blup', prefix: 'DESC:')
67
+ # => 3
68
+ puts @tokens.find(3)
69
+ # => "DESC:blup"
70
+ puts @tokens.find(3, prefix: 'DESC:')
71
+ # => "blup"
72
+ ```
73
+
74
+ #### `load` and `save`
75
+
76
+ To persist tokens across runs, one can load and save the list of tokens. At the
77
+ moment, this is a plain text file, with one line containing number, occurence and token.
78
+
79
+ ```ruby
80
+ @tokens.save('foo.tokens')
81
+ # ---- some time later
82
+ @tokens = Tokkens::Tokens.new
83
+ @tokens.load('foo.tokens')
84
+ ```
85
+
86
+ #### `limit!`
87
+
88
+ One common operation is reducing the number of words, to retain only those that are
89
+ most relevant. This is called feature selection or
90
+ [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction).
91
+ You can select by maximum `max_size` (most occuring words are kept).
92
+
93
+ ```ruby
94
+ @tokens = Tokkens::Tokens.new
95
+ @tokens.get('foo')
96
+ @tokens.get('bar')
97
+ @tokens.get('baz')
98
+ @tokens.indexes
99
+ # => [1, 2, 3]
100
+ @tokens.limit!(max_size: 2)
101
+ @tokens.indexes
102
+ # => [1, 2]
103
+ ```
104
+
105
+ Or you can reduce by minimum `min_occurence`.
106
+
107
+ ```ruby
108
+ @tokens.get('zab')
109
+ # => 4
110
+ @tokens.get('bar')
111
+ # => 2
112
+ @tokens.indexes
113
+ # => [1, 2, 4]
114
+ @tokens.limit!(min_occurence: 2)
115
+ @tokens.indexes
116
+ # => [2]
117
+ ```
118
+
119
+ Note that this limits only the tokens store, if you reference the tokens removed
120
+ elsewhere, you may still need to remove those.
121
+
122
+ #### `freeze!` and `thaw!`
123
+
124
+ `Tokens` may be used to train a model from a training dataset, and then use it to
125
+ predict based on the model. In this case, new tokens need to be added during the
126
+ training stage, but it doesn't make sense to generate new tokens during prediction.
127
+
128
+ By default, `Tokens` makes new tokens when an unrecognized token is passed to `get`.
129
+ But when it has been `frozen?` by `freeze!`, new tokens will return `nil` instead.
130
+ If for some reason, you'd like to add new tokens again, use `thaw!`.
131
+
132
+ ```ruby
133
+ @tokens.freeze!
134
+ @tokens.get('hithere')
135
+ # => 4
136
+ @tokens.get('blahblah')
137
+ # => nil
138
+ @tokens.thaw!
139
+ @tokens.get('blahblah')
140
+ # => 5
141
+ ```
142
+
143
+ Note that after `load`ing, the state may be frozen.
144
+
145
+ ### Tokenizer
146
+
147
+ When processing sentences or other text bodies, `Tokenizer` provides a way to map
148
+ this to an array of numbers (using `Token`).
149
+
150
+ ```ruby
151
+ @tokenizer = Tokkens::Tokenizer.new
152
+ @tokenizer.get('hi from example')
153
+ # => [1, 2, 3]
154
+ @tokenizer.tokens.find(3)
155
+ # => "example"
156
+ ```
157
+
158
+ The `prefix` keyword argument also works here.
159
+
160
+ ```ruby
161
+ @tokenizer.get('from example', prefix: 'X:')
162
+ # => [4, 5]
163
+ @tokenizer.tokens.find(5)
164
+ # => "X:example"
165
+ ```
166
+
167
+ One can specify a minimum length (default 2) and stop words for tokenizing.
168
+
169
+ ```ruby
170
+ @tokenizer = Tokkens::Tokenizer.new(min_length: 3, stop_words: %w(and the))
171
+ @tokenizer.get('the cat and a bird').map {|i| @tokenizer.tokens.find(i)}
172
+ # => ["cat", "bird"]
173
+ ```
174
+
175
+ ### Example
176
+
177
+ A basic text classification example using [liblinear](https://www.csie.ntu.edu.tw/~cjlin/liblinear/)
178
+ can be found in [examples/classify.rb](examples/classify.rb). Run it as follows:
179
+
180
+ ```
181
+ $ gem install liblinear-ruby
182
+ $ ruby examples/classify.rb
183
+ How many students are in for the exams today? -> students exams -> school
184
+ The forest has large trees, while the field has its flowers. -> trees field flowers -> nature
185
+ Can we park our cars inside that building to go shopping? -> cars building shopping -> city
186
+ ```
187
+
188
+ The classifier was trained using three training sentences for each class.
189
+ The output shows a prediction for three test sentences. Each test sentence is
190
+ printed, followed by the tokens, followed by the predicted class.
191
+
192
+ ## [MIT license](LICENSE.md)
193
+
194
+ ## Contributing
195
+
196
+ 1. Fork it ( https://github.com/[my-github-username]/tokkens/fork )
197
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
198
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
199
+ 4. Make sure the tests are green (`rspec`)
200
+ 4. Push to the branch (`git push origin my-new-feature`)
201
+ 5. Create a new Pull Request
data/Rakefile ADDED
@@ -0,0 +1,6 @@
1
+ require 'bundler/gem_tasks'
2
+ require 'rspec/core/rake_task'
3
+
4
+ RSpec::Core::RakeTask.new('rspec')
5
+
6
+ task :default => :rspec
@@ -0,0 +1,61 @@
1
+ # Document classification example using tokkens and linear SVM
2
+ require 'tokkens' # `rake install` or `gem install tokkens`
3
+ require 'liblinear' # `gem install liblinear-ruby`
4
+
5
+ # define the training data
6
+ TRAINING_DATA = [
7
+ ['school', 'The teacher writes a formula on the blackboard, while students are studying for their exams.'],
8
+ ['school', 'Students play soccer during the break after class, while a teacher watches over them.'],
9
+ ['school', 'All the students are studying hard for the final exams.'],
10
+ ['nature', 'The fox is running around the trees, while flowers bloom in the field.'],
11
+ ['nature', 'Where are the rabbits hiding today? Their holes below the trees are empty.'],
12
+ ['nature', 'The dark sky is bringing rain. The fox hides, rabbits find their holes, but the flowers surrender.'],
13
+ ['city', 'Cars are passing by swiftly, until the traffic lights become red.'],
14
+ ['city', 'Look at the high building, with so many windows. Who would live there?'],
15
+ ['city', 'The shopping centre building is over there, you will find everything you need to buy.'],
16
+ ]
17
+
18
+ # after training, these test sentences will receive a predicted classification
19
+ TEST_DATA = [
20
+ 'How many students are in for the exams today?',
21
+ 'The forest has large trees, while the field has its flowers.',
22
+ 'Can we park our cars inside that building to go shopping?',
23
+ ]
24
+
25
+ # stop words don't carry meaning, we better ignore them
26
+ STOP_WORDS = %w(
27
+ the a on to at so today all many some
28
+ are is will would their you them their our everyone everything who there
29
+ while during over for below by with after in around until where
30
+ )
31
+
32
+ def preprocess(s)
33
+ s.downcase.gsub(/[^a-z\s]/, '')
34
+ end
35
+
36
+ @labels = Tokkens::Tokens.new
37
+ @tokenizer = Tokkens::Tokenizer.new(stop_words: STOP_WORDS)
38
+
39
+ # train
40
+ training_labels = []
41
+ training_samples = []
42
+ TRAINING_DATA.each do |(label, sentence)|
43
+ training_labels << @labels.get(label)
44
+ tokens = @tokenizer.get(preprocess(sentence)).uniq
45
+ training_samples << Hash[tokens.zip([1] * tokens.length)]
46
+ end
47
+ #tokenizer.tokens.limit!(occurence: 2) # limit number of tokens - doesn't affect training though!
48
+ @model = Liblinear.train({}, training_labels, training_samples)
49
+
50
+ # predict
51
+ @tokenizer.tokens.freeze!
52
+ TEST_DATA.each do |sentence|
53
+ tokens = @tokenizer.get(preprocess(sentence))
54
+ label_number = Liblinear.predict(@model, Hash[tokens.zip([1] * tokens.length)])
55
+ puts "#{sentence} -> #{tokens.map{|i| @tokenizer.tokens.find(i)}.join(' ')} -> #{@labels.find(label_number)}"
56
+ end
57
+
58
+ # you might want to persist data for prediction at a later time
59
+ #model.save('test.model')
60
+ #labels.save('test.labels')
61
+ #tokenizer.tokens.save('test.tokens')
data/lib/tokkens.rb ADDED
@@ -0,0 +1,3 @@
1
+ require 'tokkens/version'
2
+ require 'tokkens/tokens'
3
+ require 'tokkens/tokenizer'
@@ -0,0 +1,57 @@
1
+ require_relative 'tokens'
2
+
3
+ module Tokkens
4
+ # Converts a string to a list of token numbers.
5
+ #
6
+ # Useful for computing with text, like machine learning.
7
+ # Before using the tokenizer, you're expected to have pre-processed
8
+ # the textdepending on application. For example, converting to lowercase,
9
+ # removing non-word characters, transliterating accented characters.
10
+ #
11
+ # This class then splits the string into tokens by whitespace, and
12
+ # removes tokens not passing the selection criteria.
13
+ #
14
+ class Tokenizer
15
+
16
+ # default minimum token length
17
+ MIN_LENGTH = 2
18
+
19
+ # no default stop words to ignore
20
+ STOP_WORDS = []
21
+
22
+ # @!attribute [r] tokens
23
+ # @return [Tokens] object to use for obtaining tokens
24
+ # @!attribute [r] stop_words
25
+ # @return [Array<String>] stop words to ignore
26
+ # @!attribute [r] min_length
27
+ # @return [Fixnum] Minimum length for tokens
28
+ attr_reader :tokens, :stop_words, :min_length
29
+
30
+ # Create a new tokenizer
31
+ #
32
+ # @param tokens [Tokens] object to use for obtaining token numbers
33
+ # @param min_length [Fixnum] minimum length for tokens
34
+ # @param stop_words [Array<String>] stop words to ignore
35
+ def initialize(tokens = nil, min_length: MIN_LENGTH, stop_words: STOP_WORDS)
36
+ @tokens = tokens || Tokens.new
37
+ @stop_words = stop_words
38
+ @min_length = min_length
39
+ end
40
+
41
+ # @return [Array<Fixnum>] array of token numbers
42
+ def get(s, **kwargs)
43
+ return [] if !s || s.strip == ''
44
+ tokenize(s).map {|token| @tokens.get(token, **kwargs) }.compact
45
+ end
46
+
47
+ private
48
+
49
+ def tokenize(s)
50
+ s.split.select(&method(:include?))
51
+ end
52
+
53
+ def include?(s)
54
+ s.length >= @min_length && !@stop_words.include?(s)
55
+ end
56
+ end
57
+ end
@@ -0,0 +1,141 @@
1
+ module Tokkens
2
+ # Converts a string token to a uniquely identifying sequential number.
3
+ #
4
+ # Useful for working with a {https://en.wikipedia.org/wiki/Vector_space_model vector space model}
5
+ # for text.
6
+ class Tokens
7
+
8
+ # @!attribute [r] offset
9
+ # @return [Fixnum] Number of first token.
10
+ attr_accessor :offset
11
+
12
+ def initialize(offset: 1)
13
+ # liblinear can't use offset 0, libsvm doesn't mind to start at one
14
+ @tokens = {}
15
+ @offset = offset
16
+ @next_number = offset
17
+ @frozen = false
18
+ end
19
+
20
+ # Stop assigning new numbers to token.
21
+ # @see #frozen?
22
+ # @see #thaw!
23
+ def freeze!
24
+ @frozen = true
25
+ end
26
+
27
+ # Allow new tokens to be created.
28
+ # @see #freeze!
29
+ # @see #frozen?
30
+ def thaw!
31
+ @frozen = false
32
+ end
33
+
34
+ # @return [Boolean] Whether the tokens are frozen or not.
35
+ # @see #freeze!
36
+ # @see #thaw!
37
+ def frozen?
38
+ @frozen
39
+ end
40
+
41
+ # Limit the number of tokens.
42
+ #
43
+ # @param max_size [Fixnum] Maximum number of tokens to retain
44
+ # @param min_occurence [Fixnum] Keep only tokens seen at least this many times
45
+ # @return [Fixnum] Number of tokens left
46
+ def limit!(max_size: nil, min_occurence: nil)
47
+ # @todo raise if frozen
48
+ if min_occurence
49
+ @tokens.delete_if {|name, data| data[1] < min_occurence }
50
+ end
51
+ if max_size
52
+ @tokens = Hash[@tokens.to_a.sort_by {|a| -a[1][1] }[0..(max_size-1)]]
53
+ end
54
+ @tokens.length
55
+ end
56
+
57
+ # Return a number for a new or existing token.
58
+ #
59
+ # When the token was seen before, the same number is returned. If the token
60
+ # is first seen and this class isn't {#frozen?}, a new number is returned;
61
+ # else +nil+ is returned.
62
+ #
63
+ # @param s [String] token to return number for
64
+ # @option kwargs [String] :prefix optional string to prepend to the token
65
+ # @return [Fixnum, NilClass] number for given token
66
+ def get(s, **kwargs)
67
+ return if !s || s.strip == ''
68
+ @frozen ? retrieve(s, **kwargs) : upsert(s, **kwargs)
69
+ end
70
+
71
+ # Return an token by number.
72
+ #
73
+ # This class is optimized for retrieving by token, not by number.
74
+ #
75
+ # @param i [String] number to return token for
76
+ # @param prefix [String] optional string to remove from beginning of token
77
+ # @return [String, NilClass] given token, or +nil+ when not found
78
+ def find(i, prefix: nil)
79
+ @tokens.each do |s, data|
80
+ if data[0] == i
81
+ return (prefix && s.start_with?(prefix)) ? s[prefix.length..-1] : s
82
+ end
83
+ end
84
+ nil
85
+ end
86
+
87
+ # Return indexes for all of the current tokens.
88
+ #
89
+ # @return [Array<Fixnum>] All current token numbers.
90
+ # @see #limit!
91
+ def indexes
92
+ @tokens.values.map(&:first)
93
+ end
94
+
95
+ # Load tokens from file.
96
+ #
97
+ # The tokens are frozen by default.
98
+ # All previously existing tokens are removed.
99
+ #
100
+ # @param filename [String] Filename
101
+ def load(filename)
102
+ File.open(filename) do |f|
103
+ @tokens = {}
104
+ f.each_line do |line|
105
+ id, count, name = line.rstrip.split(/\s+/, 3)
106
+ @tokens[name.strip] = [id.to_i, count]
107
+ end
108
+ end
109
+ # safer
110
+ freeze!
111
+ end
112
+
113
+ # Save tokens to file.
114
+ #
115
+ # @param filename [String] Filename
116
+ def save(filename)
117
+ File.open(filename, 'w') do |f|
118
+ @tokens.each do |token, (index, count)|
119
+ f.puts "#{index} #{count} #{token}"
120
+ end
121
+ end
122
+ end
123
+
124
+ private
125
+
126
+ def retrieve(s, prefix: '')
127
+ data = @tokens[prefix + s]
128
+ data[0] if data
129
+ end
130
+
131
+ # return token number, update next_number; always returns a number
132
+ def upsert(s, prefix: '')
133
+ unless data = @tokens[prefix + s]
134
+ @tokens[prefix + s] = data = [@next_number, 0]
135
+ @next_number += 1
136
+ end
137
+ data[1] += 1
138
+ data[0]
139
+ end
140
+ end
141
+ end
@@ -0,0 +1,3 @@
1
+ module Tokkens
2
+ VERSION = "0.1.0"
3
+ end
@@ -0,0 +1,8 @@
1
+ require 'bundler/setup'
2
+ Bundler.setup
3
+
4
+ require 'tokkens'
5
+ include Tokkens
6
+
7
+ RSpec.configure do |config|
8
+ end
@@ -0,0 +1,40 @@
1
+ require_relative 'spec_helper'
2
+
3
+ describe Tokenizer do
4
+ let(:tokenizer) { described_class.new }
5
+ let(:offset) { 1 } # default token offset
6
+
7
+ describe '#get' do
8
+ it 'does tokenization' do
9
+ expect(tokenizer.get('foo bar')).to eq ([offset, offset + 1])
10
+ end
11
+
12
+ it 'ignores too short tokens' do
13
+ t = described_class.new(min_length: 2)
14
+ expect(t.get('x')).to eq []
15
+ end
16
+
17
+ it 'ignores stop words' do
18
+ t = described_class.new(stop_words: ['xyz'])
19
+ expect(t.get('xyz foo')).to eq [offset]
20
+ end
21
+
22
+ it 'does not return nil tokens' do
23
+ tokenizer.tokens.get('foo')
24
+ tokenizer.tokens.freeze!
25
+ expect(tokenizer.get('foo bar')).to eq [offset]
26
+ end
27
+ end
28
+
29
+ describe '#tokens' do
30
+ it 'returns a tokens object by default' do
31
+ expect(tokenizer.tokens).to be_a Tokens
32
+ end
33
+
34
+ it 'can be overridden' do
35
+ tokens = Tokens.new
36
+ t = described_class.new(tokens)
37
+ expect(t.tokens).to be tokens
38
+ end
39
+ end
40
+ end
@@ -0,0 +1,133 @@
1
+ require_relative 'spec_helper'
2
+ require 'tempfile'
3
+
4
+ describe Tokens do
5
+ let(:tokens) { described_class.new }
6
+ let(:offset) { 1 } # default offset
7
+
8
+ describe '#get' do
9
+ it 'can new tokens' do
10
+ expect(tokens.get('bar')).to eq offset
11
+ expect(tokens.get('foo')).to eq (offset + 1)
12
+ end
13
+
14
+ it 'can get an existing token' do
15
+ tokens.get('bar')
16
+ expect(tokens.get('bar')).to eq offset
17
+ end
18
+
19
+ it 'can include a prefix' do
20
+ tokens.get('bar', prefix: 'XyZ$')
21
+ expect(tokens.get('XyZ$bar')).to eq offset
22
+ end
23
+
24
+ it 'can get an existing token when frozen' do
25
+ tokens.get('blup')
26
+ tokens.freeze!
27
+ expect(tokens.get('blup')).to eq offset
28
+ end
29
+
30
+ it 'cannot get a new token when frozen' do
31
+ tokens.get('blup')
32
+ tokens.freeze!
33
+ expect(tokens.get('blabla')).to be_nil
34
+ end
35
+ end
36
+
37
+ describe '#find' do
38
+ it 'can find an existing token' do
39
+ tokens.get('blup')
40
+ i = tokens.get('blah')
41
+ expect(tokens.find(i)).to eq 'blah'
42
+ end
43
+
44
+ it 'returns nil for a non-existing token' do
45
+ tokens.get('blup')
46
+ expect(tokens.find(offset + 1)).to eq nil
47
+ end
48
+
49
+ it 'removes the prefix' do
50
+ i = tokens.get('blup', prefix: 'FOO$')
51
+ expect(tokens.find(i, prefix: 'FOO$')).to eq 'blup'
52
+ end
53
+ end
54
+
55
+ describe '#indexes' do
56
+ it 'is empty without tokens' do
57
+ expect(tokens.indexes).to eq []
58
+ end
59
+
60
+ it 'returns the expected indexes' do
61
+ tokens.get('foo')
62
+ tokens.get('blup')
63
+ expect(tokens.indexes).to eq [offset, offset + 1]
64
+ end
65
+ end
66
+
67
+ describe '#offset' do
68
+ it 'has a default' do
69
+ expect(described_class.new.offset).to eq offset
70
+ end
71
+
72
+ it 'can override the default' do
73
+ expect(described_class.new(offset: 5).offset).to eq 5
74
+ end
75
+
76
+ it 'affects the first number' do
77
+ tokens = described_class.new(offset: 12)
78
+ expect(tokens.get('hi')).to eq 12
79
+ end
80
+ end
81
+
82
+ describe '#frozen?' do
83
+ it 'is not frozen by default' do
84
+ expect(tokens.frozen?).to be false
85
+ end
86
+
87
+ it 'can be frozen' do
88
+ tokens.freeze!
89
+ expect(tokens.frozen?).to be true
90
+ end
91
+
92
+ it 'can be thawed' do
93
+ tokens.freeze!
94
+ tokens.thaw!
95
+ expect(tokens.frozen?).to be false
96
+ end
97
+ end
98
+
99
+ describe '#limit!' do
100
+ it 'limits to most frequent tokens by max_size' do
101
+ tokens.get('foo')
102
+ tokens.get('blup')
103
+ tokens.get('blup')
104
+ tokens.limit!(max_size: 1)
105
+ expect(tokens.indexes).to eq [offset + 1]
106
+ end
107
+
108
+ it 'limits by min_occurence' do
109
+ tokens.get('foo')
110
+ tokens.get('blup')
111
+ tokens.get('foo')
112
+ tokens.limit!(min_occurence: 2)
113
+ expect(tokens.indexes).to eq [offset]
114
+ end
115
+ end
116
+
117
+ describe '#load' do
118
+ let(:file) { Tempfile.new('tokens') }
119
+ after { file.unlink }
120
+
121
+ it 'saves and loads tokens' do
122
+ tokens.get('foo')
123
+ tokens.get('bar')
124
+ tokens.save(file.path)
125
+ expect(File.exists?(file.path)).to be true
126
+ expect(File.zero?(file.path)).to be false
127
+
128
+ ntokens = described_class.new
129
+ ntokens.load(file.path)
130
+ expect(tokens.get('bar')).to eq (offset + 1)
131
+ end
132
+ end
133
+ end
data/tokkens.gemspec ADDED
@@ -0,0 +1,29 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'tokkens/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "tokkens"
8
+ spec.version = Tokkens::VERSION
9
+ spec.authors = ["wvengen"]
10
+ spec.email = ["dev-rails@willem.engen.nl"]
11
+ spec.summary = %q{Simple text to numbers tokenizer}
12
+ spec.homepage = "https://github.com/q-m/ruby-tokkens"
13
+ spec.license = "MIT"
14
+ spec.description = <<-EOD
15
+ Tokkens makes it easy to apply a vector space model to text documents,
16
+ targeted towards with machine learning. It provides a mapping between
17
+ numbers and tokens (strings)
18
+ EOD
19
+
20
+ spec.files = `git ls-files -z`.split("\x0")
21
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
22
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
23
+ spec.require_paths = ["lib"]
24
+
25
+ spec.required_ruby_version = '>= 2.0'
26
+ spec.add_development_dependency "bundler", "~> 1.7"
27
+ spec.add_development_dependency "rake", "~> 10.0"
28
+ spec.add_development_dependency "rspec", "~> 3.5.0"
29
+ end
metadata ADDED
@@ -0,0 +1,108 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: tokkens
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - wvengen
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2017-02-08 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.7'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.7'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: 3.5.0
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: 3.5.0
55
+ description: |2
56
+ Tokkens makes it easy to apply a vector space model to text documents,
57
+ targeted towards with machine learning. It provides a mapping between
58
+ numbers and tokens (strings)
59
+ email:
60
+ - dev-rails@willem.engen.nl
61
+ executables: []
62
+ extensions: []
63
+ extra_rdoc_files: []
64
+ files:
65
+ - ".gitignore"
66
+ - ".travis.yml"
67
+ - CHANGELOG.md
68
+ - Gemfile
69
+ - LICENSE.md
70
+ - README.md
71
+ - Rakefile
72
+ - examples/classify.rb
73
+ - lib/tokkens.rb
74
+ - lib/tokkens/tokenizer.rb
75
+ - lib/tokkens/tokens.rb
76
+ - lib/tokkens/version.rb
77
+ - spec/spec_helper.rb
78
+ - spec/tokenizer_spec.rb
79
+ - spec/tokens_spec.rb
80
+ - tokkens.gemspec
81
+ homepage: https://github.com/q-m/ruby-tokkens
82
+ licenses:
83
+ - MIT
84
+ metadata: {}
85
+ post_install_message:
86
+ rdoc_options: []
87
+ require_paths:
88
+ - lib
89
+ required_ruby_version: !ruby/object:Gem::Requirement
90
+ requirements:
91
+ - - ">="
92
+ - !ruby/object:Gem::Version
93
+ version: '2.0'
94
+ required_rubygems_version: !ruby/object:Gem::Requirement
95
+ requirements:
96
+ - - ">="
97
+ - !ruby/object:Gem::Version
98
+ version: '0'
99
+ requirements: []
100
+ rubyforge_project:
101
+ rubygems_version: 2.4.3
102
+ signing_key:
103
+ specification_version: 4
104
+ summary: Simple text to numbers tokenizer
105
+ test_files:
106
+ - spec/spec_helper.rb
107
+ - spec/tokenizer_spec.rb
108
+ - spec/tokens_spec.rb