buftok 0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,49 @@
1
+ ## Contributing
2
+ In the spirit of [free software][free-sw], **everyone** is encouraged to help
3
+ improve this project. Here are some ways *you* can contribute:
4
+
5
+ [free-sw]: http://www.fsf.org/licensing/essays/free-sw.html
6
+
7
+ * Use alpha, beta, and pre-release versions.
8
+ * Report bugs.
9
+ * Suggest new features.
10
+ * Write or edit documentation.
11
+ * Write specifications.
12
+ * Write code (**no patch is too small**: fix typos, add comments, clean up
13
+ inconsistent whitespace).
14
+ * Refactor code.
15
+ * Fix [issues][].
16
+ * Review patches.
17
+
18
+ [issues]: https://github.com/sferik/buftok/issues
19
+
20
+ ## Submitting an Issue
21
+ We use the [GitHub issue tracker][issues] to track bugs and features. Before
22
+ submitting a bug report or feature request, check to make sure it hasn't
23
+ already been submitted. When submitting a bug report, please include a [Gist][]
24
+ that includes a stack trace and any details that may be necessary to reproduce
25
+ the bug, including your gem version, Ruby version, and operating system.
26
+ Ideally, a bug report should include a pull request with failing specs.
27
+
28
+ [gist]: https://gist.github.com/
29
+
30
+ ## Submitting a Pull Request
31
+ 1. [Fork the repository.][fork]
32
+ 2. [Create a topic branch.][branch]
33
+ 3. Add specs for your unimplemented feature or bug fix.
34
+ 4. Run `bundle exec rake spec`. If your specs pass, return to step 3.
35
+ 5. Implement your feature or bug fix.
36
+ 6. Run `bundle exec rake spec`. If your specs fail, return to step 5.
37
+ 7. Run `open coverage/index.html`. If your changes are not completely covered
38
+ by your tests, return to step 3.
39
+ 8. Run `RUBYOPT=W2 bundle exec rake spec 2>&1 | grep buftok`. If your changes
40
+ produce any warnings, return to step 5.
41
+ 9. Add documentation for your feature or bug fix.
42
+ 10. Run `bundle exec rake yard`. If your changes are not 100% documented, go
43
+ back to step 9.
44
+ 11. Commit and push your changes.
45
+ 12. [Submit a pull request.][pr]
46
+
47
+ [fork]: http://help.github.com/fork-a-repo/
48
+ [branch]: http://learn.github.com/p/branching.html
49
+ [pr]: http://help.github.com/send-pull-requests/
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gem 'rake'
4
+ gem 'rdoc'
5
+
6
+ gemspec
@@ -0,0 +1,56 @@
1
+ Ruby is copyrighted free software by Yukihiro Matsumoto <matz@netlab.jp>.
2
+ You can redistribute it and/or modify it under either the terms of the
3
+ 2-clause BSDL (see the file BSDL), or the conditions below:
4
+
5
+ 1. You may make and give away verbatim copies of the source form of the
6
+ software without restriction, provided that you duplicate all of the
7
+ original copyright notices and associated disclaimers.
8
+
9
+ 2. You may modify your copy of the software in any way, provided that
10
+ you do at least ONE of the following:
11
+
12
+ a) place your modifications in the Public Domain or otherwise
13
+ make them Freely Available, such as by posting said
14
+ modifications to Usenet or an equivalent medium, or by allowing
15
+ the author to include your modifications in the software.
16
+
17
+ b) use the modified software only within your corporation or
18
+ organization.
19
+
20
+ c) give non-standard binaries non-standard names, with
21
+ instructions on where to get the original software distribution.
22
+
23
+ d) make other distribution arrangements with the author.
24
+
25
+ 3. You may distribute the software in object code or binary form,
26
+ provided that you do at least ONE of the following:
27
+
28
+ a) distribute the binaries and library files of the software,
29
+ together with instructions (in the manual page or equivalent)
30
+ on where to get the original distribution.
31
+
32
+ b) accompany the distribution with the machine-readable source of
33
+ the software.
34
+
35
+ c) give non-standard binaries non-standard names, with
36
+ instructions on where to get the original software distribution.
37
+
38
+ d) make other distribution arrangements with the author.
39
+
40
+ 4. You may modify and include the part of the software into any other
41
+ software (possibly commercial). But some files in the distribution
42
+ are not written by the author, so that they are not under these terms.
43
+
44
+ For the list of those files and their copying conditions, see the
45
+ file LEGAL.
46
+
47
+ 5. The scripts and library files supplied as input to or produced as
48
+ output from the software do not automatically fall under the
49
+ copyright of the software, but belong to whomever generated them,
50
+ and may be sold commercially, and may be aggregated with this
51
+ software.
52
+
53
+ 6. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR
54
+ IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
55
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
56
+ PURPOSE.
@@ -0,0 +1,48 @@
1
+ # BufferedTokenizer
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/buftok.png)][gem]
4
+ [![Build Status](https://travis-ci.org/sferik/buftok.png?branch=master)][travis]
5
+ [![Dependency Status](https://gemnasium.com/sferik/buftok.png?travis)][gemnasium]
6
+ [![Code Climate](https://codeclimate.com/github/sferik/buftok.png)][codeclimate]
7
+
8
+ [gem]: https://rubygems.org/gems/buftok
9
+ [travis]: https://travis-ci.org/sferik/buftok
10
+ [gemnasium]: https://gemnasium.com/sferik/buftok
11
+ [codeclimate]: https://codeclimate.com/github/sferik/buftok
12
+
13
+ ###### Statefully split input data by a specifiable token
14
+
15
+ BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by
16
+ default. It allows input to be spoon-fed from some outside source which
17
+ receives arbitrary length datagrams which may-or-may-not contain the token by
18
+ which entities are delimited. In this respect it's ideally paired with
19
+ something like [EventMachine][].
20
+
21
+ [EventMachine]: http://rubyeventmachine.com/
22
+
23
+ ## Supported Ruby Versions
24
+ This library aims to support and is [tested against][travis] the following Ruby
25
+ implementations:
26
+
27
+ * Ruby 1.8.7
28
+ * Ruby 1.9.2
29
+ * Ruby 1.9.3
30
+ * Ruby 2.0.0
31
+
32
+ If something doesn't work on one of these interpreters, it's a bug.
33
+
34
+ This library may inadvertently work (or seem to work) on other Ruby
35
+ implementations, however support will only be provided for the versions listed
36
+ above.
37
+
38
+ If you would like this library to support another Ruby version, you may
39
+ volunteer to be a maintainer. Being a maintainer entails making sure all tests
40
+ run and pass on that implementation. When something breaks on your
41
+ implementation, you will be responsible for providing patches in a timely
42
+ fashion. If critical issues for a particular implementation exist at the time
43
+ of a major release, support for that Ruby version may be dropped.
44
+
45
+ ## Copyright
46
+ Copyright (c) 2006-2013 Tony Arcieri, Martin Emde, Erik Michaels-Ober.
47
+ Distributed under the [Ruby license][license].
48
+ [license]: http://www.ruby-lang.org/en/LICENSE.txt
data/Rakefile CHANGED
@@ -1,31 +1,66 @@
1
- require 'rake'
2
- require 'rake/rdoctask'
3
- require 'rake/gempackagetask'
4
- require 'spec/rake/spectask'
1
+ require 'bundler'
2
+ require 'rdoc/task'
3
+ require 'rake/testtask'
5
4
 
6
- Spec::Rake::SpecTask.new(:spec) do |task|
7
- task.spec_files = FileList['**/*_spec.rb']
8
- end
5
+ task :default => :test
6
+
7
+ Bundler::GemHelper.install_tasks
9
8
 
10
- Rake::RDocTask.new(:rdoc) do |task|
11
- task.rdoc_dir = 'doc'
12
- task.title = 'BufferedTokenizer'
13
- task.rdoc_files.include('lib/**/*.rb')
9
+ RDoc::Task.new do |task|
10
+ task.rdoc_dir = 'doc'
11
+ task.title = 'BufferedTokenizer'
12
+ task.rdoc_files.include('lib/**/*.rb')
14
13
  end
15
14
 
16
- spec = Gem::Specification.new do |s|
17
- s.name = %q{buftok}
18
- s.version = "0.1"
19
- s.date = %q{2006-12-18}
20
- s.summary = %q{BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs}
21
- s.email = %q{tony@clickcaster.com}
22
- s.homepage = %q{http://buftok.rubyforge.org}
23
- s.rubyforge_project = %q{buftok}
24
- s.has_rdoc = true
25
- s.authors = ["Tony Arcieri","Martin Emde"]
26
- s.files = ["Rakefile", "lib", "lib/buftok.rb"]
15
+ Rake::TestTask.new :test do |t|
16
+ t.libs << 'lib'
17
+ t.test_files = FileList['test/**/*.rb']
27
18
  end
28
19
 
29
- Rake::GemPackageTask.new(spec) do |pkg|
30
- pkg.need_tar = true
20
+ desc "Benchmark the current implementation"
21
+ task :bench do
22
+ require 'benchmark'
23
+ require File.expand_path('lib/buftok', File.dirname(__FILE__))
24
+
25
+ n = 50000
26
+ delimiter = "\n\n"
27
+
28
+ frequency1 = 1000
29
+ puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency1} strings..."
30
+ data1 = (0...n).map do |i|
31
+ (((i % frequency1 == 1) ? "\n" : "") +
32
+ ("s" * i) +
33
+ ((i % frequency1 == 0) ? "\n" : "")).freeze
34
+ end
35
+
36
+ frequency2 = 10
37
+ puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency2} strings..."
38
+ data2 = (0...n).map do |i|
39
+ (((i % frequency2 == 1) ? "\n" : "") +
40
+ ("s" * i) +
41
+ ((i % frequency2 == 0) ? "\n" : "")).freeze
42
+ end
43
+
44
+ Benchmark.bmbm do |x|
45
+ x.report("1 char, freq: #{frequency1}") do
46
+ bt1 = BufferedTokenizer.new
47
+ n.times { |i| bt1.extract(data1[i]) }
48
+ end
49
+
50
+ x.report("2 char, freq: #{frequency1}") do
51
+ bt2 = BufferedTokenizer.new(delimiter)
52
+ n.times { |i| bt2.extract(data1[i]) }
53
+ end
54
+
55
+ x.report("1 char, freq: #{frequency2}") do
56
+ bt3 = BufferedTokenizer.new
57
+ n.times { |i| bt3.extract(data2[i]) }
58
+ end
59
+
60
+ x.report("2 char, freq: #{frequency2}") do
61
+ bt4 = BufferedTokenizer.new(delimiter)
62
+ n.times { |i| bt4.extract(data2[i]) }
63
+ end
64
+
65
+ end
31
66
  end
@@ -0,0 +1,17 @@
1
+ Gem::Specification.new do |spec|
2
+ spec.add_development_dependency 'bundler', '~> 1.0'
3
+ spec.authors = ["Tony Arcieri", "Martin Emde", "Erik Michaels-Ober"]
4
+ spec.description = %q{BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs}
5
+ spec.email = "sferik@gmail.com"
6
+ spec.files = %w(CONTRIBUTING.md Gemfile LICENSE.md README.md Rakefile buftok.gemspec)
7
+ spec.files += Dir.glob("lib/**/*.rb")
8
+ spec.files += Dir.glob("test/**/*.rb")
9
+ spec.test_files = spec.files.grep(%r{^test/})
10
+ spec.homepage = "https://github.com/sferik/buftok"
11
+ spec.licenses = ['MIT']
12
+ spec.name = "buftok"
13
+ spec.require_paths = ["lib"]
14
+ spec.required_rubygems_version = '>= 1.3.5'
15
+ spec.summary = spec.description
16
+ spec.version = "0.2.0"
17
+ end
@@ -1,26 +1,22 @@
1
- # BufferedTokenizer - Statefully split input data by a specifiable token
2
- # (C)2006 Tony Arcieri, Martin Emde
3
- # Distributed under the Ruby license (http://www.ruby-lang.org/en/LICENSE.txt)
4
-
5
1
  # BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
6
2
  # by default. It allows input to be spoon-fed from some outside source which
7
3
  # receives arbitrary length datagrams which may-or-may-not contain the token
8
4
  # by which entities are delimited. In this respect it's ideally paired with
9
- # something like EventMachine (http://rubyforge.org/projects/eventmachine)
5
+ # something like EventMachine (http://rubyeventmachine.com/).
10
6
  class BufferedTokenizer
11
- # New BufferedTokenizers will operate on lines delimited by "\n" by default
12
- # or allow you to specify any delimiter token you so choose, which will then
13
- # be used by String#split to tokenize the input data
14
- def initialize(delimiter = "\n")
15
- # Store the specified delimiter
7
+ # New BufferedTokenizers will operate on lines delimited by a delimiter,
8
+ # which is by default the global input delimiter $/ ("\n").
9
+ #
10
+ # The input buffer is stored as an array. This is by far the most efficient
11
+ # approach given language constraints (in C a linked list would be a more
12
+ # appropriate data structure). Segments of input data are stored in a list
13
+ # which is only joined when a token is reached, substantially reducing the
14
+ # number of objects required for the operation.
15
+ def initialize(delimiter = $/)
16
16
  @delimiter = delimiter
17
-
18
- # The input buffer is stored as an array. This is by far the most efficient
19
- # approach given language constraints (in C a linked list would be a more
20
- # appropriate data structure). Segments of input data are stored in a list
21
- # which is only joined when a token is reached, substantially reducing the
22
- # number of objects required for the operation.
23
17
  @input = []
18
+ @tail = ''
19
+ @trim = @delimiter.length - 1
24
20
  end
25
21
 
26
22
  # Extract takes an arbitrary string of input data and returns an array of
@@ -28,49 +24,36 @@ class BufferedTokenizer
28
24
  # makes for easy processing of datagrams using a pattern like:
29
25
  #
30
26
  # tokenizer.extract(data).map { |entity| Decode(entity) }.each do ...
27
+ #
28
+ # Using -1 makes split to return "" if the token is at the end of
29
+ # the string, meaning the last element is the start of the next chunk.
31
30
  def extract(data)
32
- # Extract token-delimited entities from the input string with the split command.
33
- # There's a bit of craftiness here with the -1 parameter. Normally split would
34
- # behave no differently regardless of if the token lies at the very end of the
35
- # input buffer or not (i.e. a literal edge case) Specifying -1 forces split to
36
- # return "" in this case, meaning that the last entry in the list represents a
37
- # new segment of data where the token has not been encountered
38
- entities = data.split @delimiter, -1
31
+ if @trim > 0
32
+ tail_end = @tail.slice!(-@trim, @trim) # returns nil if string is too short
33
+ data = tail_end + data if tail_end
34
+ end
39
35
 
40
- # Move the first entry in the resulting array into the input buffer. It represents
41
- # the last segment of a token-delimited entity unless it's the only entry in the list.
42
- @input << entities.shift
36
+ @input << @tail
37
+ entities = data.split(@delimiter, -1)
38
+ @tail = entities.shift
43
39
 
44
- # If the resulting array from the split is empty, the token was not encountered
45
- # (not even at the end of the buffer). Since we've encountered no token-delimited
46
- # entities this go-around, return an empty array.
47
- return [] if entities.empty?
40
+ unless entities.empty?
41
+ @input << @tail
42
+ entities.unshift @input.join
43
+ @input.clear
44
+ @tail = entities.pop
45
+ end
48
46
 
49
- # At this point, we've hit a token, or potentially multiple tokens. Now we can bring
50
- # together all the data we've buffered from earlier calls without hitting a token,
51
- # and add it to our list of discovered entities.
52
- entities.unshift @input.join
53
-
54
- # Now that we've hit a token, joined the input buffer and added it to the entities
55
- # list, we can go ahead and clear the input buffer. All of the segments that were
56
- # stored before the join can now be garbage collected.
57
- @input.clear
58
-
59
- # The last entity in the list is not token delimited, however, thanks to the -1
60
- # passed to split. It represents the beginning of a new list of as-yet-untokenized
61
- # data, so we add it to the start of the list.
62
- @input << entities.pop
63
-
64
- # Now we're left with the list of extracted token-delimited entities we wanted
65
- # in the first place. Hooray!
66
47
  entities
67
48
  end
68
-
49
+
69
50
  # Flush the contents of the input buffer, i.e. return the input buffer even though
70
51
  # a token has not yet been encountered
71
52
  def flush
53
+ @input << @tail
72
54
  buffer = @input.join
73
55
  @input.clear
56
+ @tail = "" # @tail.clear is slightly faster, but not supported on 1.8.7
74
57
  buffer
75
58
  end
76
59
  end
@@ -0,0 +1,27 @@
1
+ require 'test/unit'
2
+ require 'buftok'
3
+
4
+ class TestBuftok < Test::Unit::TestCase
5
+ def test_buftok
6
+ tokenizer = BufferedTokenizer.new
7
+ assert_equal %w[foo], tokenizer.extract("foo\nbar".freeze)
8
+ assert_equal %w[barbaz qux], tokenizer.extract("baz\nqux\nquu".freeze)
9
+ assert_equal 'quu', tokenizer.flush
10
+ assert_equal '', tokenizer.flush
11
+ end
12
+
13
+ def test_delimiter
14
+ tokenizer = BufferedTokenizer.new('<>')
15
+ assert_equal ['', "foo\n"], tokenizer.extract("<>foo\n<>".freeze)
16
+ assert_equal %w[bar], tokenizer.extract('bar<>baz'.freeze)
17
+ assert_equal 'baz', tokenizer.flush
18
+ end
19
+
20
+ def test_split_delimiter
21
+ tokenizer = BufferedTokenizer.new('<>'.freeze)
22
+ assert_equal [], tokenizer.extract('foo<'.freeze)
23
+ assert_equal %w[foo], tokenizer.extract('>bar<'.freeze)
24
+ assert_equal %w[bar<baz qux], tokenizer.extract('baz<>qux<>'.freeze)
25
+ assert_equal '', tokenizer.flush
26
+ end
27
+ end
metadata CHANGED
@@ -1,49 +1,75 @@
1
- --- !ruby/object:Gem::Specification
2
- rubygems_version: 0.9.0
3
- specification_version: 1
1
+ --- !ruby/object:Gem::Specification
4
2
  name: buftok
5
- version: !ruby/object:Gem::Version
6
- version: "0.1"
7
- date: 2006-12-18 00:00:00 -07:00
8
- summary: BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs
9
- require_paths:
10
- - lib
11
- email: tony@clickcaster.com
12
- homepage: http://buftok.rubyforge.org
13
- rubyforge_project: buftok
14
- description:
15
- autorequire:
16
- default_executable:
17
- bindir: bin
18
- has_rdoc: true
19
- required_ruby_version: !ruby/object:Gem::Version::Requirement
20
- requirements:
21
- - - ">"
22
- - !ruby/object:Gem::Version
23
- version: 0.0.0
24
- version:
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.0
5
+ prerelease:
25
6
  platform: ruby
26
- signing_key:
27
- cert_chain:
28
- post_install_message:
29
- authors:
7
+ authors:
30
8
  - Tony Arcieri
31
9
  - Martin Emde
32
- files:
10
+ - Erik Michaels-Ober
11
+ autorequire:
12
+ bindir: bin
13
+ cert_chain: []
14
+ date: 2013-11-22 00:00:00.000000000 Z
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: bundler
18
+ requirement: !ruby/object:Gem::Requirement
19
+ none: false
20
+ requirements:
21
+ - - ~>
22
+ - !ruby/object:Gem::Version
23
+ version: '1.0'
24
+ type: :development
25
+ prerelease: false
26
+ version_requirements: !ruby/object:Gem::Requirement
27
+ none: false
28
+ requirements:
29
+ - - ~>
30
+ - !ruby/object:Gem::Version
31
+ version: '1.0'
32
+ description: BufferedTokenizer extracts token delimited entities from a sequence of
33
+ arbitrary inputs
34
+ email: sferik@gmail.com
35
+ executables: []
36
+ extensions: []
37
+ extra_rdoc_files: []
38
+ files:
39
+ - CONTRIBUTING.md
40
+ - Gemfile
41
+ - LICENSE.md
42
+ - README.md
33
43
  - Rakefile
34
- - lib
44
+ - buftok.gemspec
35
45
  - lib/buftok.rb
36
- test_files: []
37
-
46
+ - test/test_buftok.rb
47
+ homepage: https://github.com/sferik/buftok
48
+ licenses:
49
+ - MIT
50
+ post_install_message:
38
51
  rdoc_options: []
39
-
40
- extra_rdoc_files: []
41
-
42
- executables: []
43
-
44
- extensions: []
45
-
52
+ require_paths:
53
+ - lib
54
+ required_ruby_version: !ruby/object:Gem::Requirement
55
+ none: false
56
+ requirements:
57
+ - - ! '>='
58
+ - !ruby/object:Gem::Version
59
+ version: '0'
60
+ required_rubygems_version: !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: 1.3.5
46
66
  requirements: []
47
-
48
- dependencies: []
49
-
67
+ rubyforge_project:
68
+ rubygems_version: 1.8.23
69
+ signing_key:
70
+ specification_version: 3
71
+ summary: BufferedTokenizer extracts token delimited entities from a sequence of arbitrary
72
+ inputs
73
+ test_files:
74
+ - test/test_buftok.rb
75
+ has_rdoc: