buftok 0.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,49 @@
1
+ ## Contributing
2
+ In the spirit of [free software][free-sw], **everyone** is encouraged to help
3
+ improve this project. Here are some ways *you* can contribute:
4
+
5
+ [free-sw]: http://www.fsf.org/licensing/essays/free-sw.html
6
+
7
+ * Use alpha, beta, and pre-release versions.
8
+ * Report bugs.
9
+ * Suggest new features.
10
+ * Write or edit documentation.
11
+ * Write specifications.
12
+ * Write code (**no patch is too small**: fix typos, add comments, clean up
13
+ inconsistent whitespace).
14
+ * Refactor code.
15
+ * Fix [issues][].
16
+ * Review patches.
17
+
18
+ [issues]: https://github.com/sferik/buftok/issues
19
+
20
+ ## Submitting an Issue
21
+ We use the [GitHub issue tracker][issues] to track bugs and features. Before
22
+ submitting a bug report or feature request, check to make sure it hasn't
23
+ already been submitted. When submitting a bug report, please include a [Gist][]
24
+ that includes a stack trace and any details that may be necessary to reproduce
25
+ the bug, including your gem version, Ruby version, and operating system.
26
+ Ideally, a bug report should include a pull request with failing specs.
27
+
28
+ [gist]: https://gist.github.com/
29
+
30
+ ## Submitting a Pull Request
31
+ 1. [Fork the repository.][fork]
32
+ 2. [Create a topic branch.][branch]
33
+ 3. Add specs for your unimplemented feature or bug fix.
34
+ 4. Run `bundle exec rake spec`. If your specs pass, return to step 3.
35
+ 5. Implement your feature or bug fix.
36
+ 6. Run `bundle exec rake spec`. If your specs fail, return to step 5.
37
+ 7. Run `open coverage/index.html`. If your changes are not completely covered
38
+ by your tests, return to step 3.
39
+ 8. Run `RUBYOPT=W2 bundle exec rake spec 2>&1 | grep buftok`. If your changes
40
+ produce any warnings, return to step 5.
41
+ 9. Add documentation for your feature or bug fix.
42
+ 10. Run `bundle exec rake yard`. If your changes are not 100% documented, go
43
+ back to step 9.
44
+ 11. Commit and push your changes.
45
+ 12. [Submit a pull request.][pr]
46
+
47
+ [fork]: http://help.github.com/fork-a-repo/
48
+ [branch]: http://learn.github.com/p/branching.html
49
+ [pr]: http://help.github.com/send-pull-requests/
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gem 'rake'
4
+ gem 'rdoc'
5
+
6
+ gemspec
@@ -0,0 +1,56 @@
1
+ Ruby is copyrighted free software by Yukihiro Matsumoto <matz@netlab.jp>.
2
+ You can redistribute it and/or modify it under either the terms of the
3
+ 2-clause BSDL (see the file BSDL), or the conditions below:
4
+
5
+ 1. You may make and give away verbatim copies of the source form of the
6
+ software without restriction, provided that you duplicate all of the
7
+ original copyright notices and associated disclaimers.
8
+
9
+ 2. You may modify your copy of the software in any way, provided that
10
+ you do at least ONE of the following:
11
+
12
+ a) place your modifications in the Public Domain or otherwise
13
+ make them Freely Available, such as by posting said
14
+ modifications to Usenet or an equivalent medium, or by allowing
15
+ the author to include your modifications in the software.
16
+
17
+ b) use the modified software only within your corporation or
18
+ organization.
19
+
20
+ c) give non-standard binaries non-standard names, with
21
+ instructions on where to get the original software distribution.
22
+
23
+ d) make other distribution arrangements with the author.
24
+
25
+ 3. You may distribute the software in object code or binary form,
26
+ provided that you do at least ONE of the following:
27
+
28
+ a) distribute the binaries and library files of the software,
29
+ together with instructions (in the manual page or equivalent)
30
+ on where to get the original distribution.
31
+
32
+ b) accompany the distribution with the machine-readable source of
33
+ the software.
34
+
35
+ c) give non-standard binaries non-standard names, with
36
+ instructions on where to get the original software distribution.
37
+
38
+ d) make other distribution arrangements with the author.
39
+
40
+ 4. You may modify and include the part of the software into any other
41
+ software (possibly commercial). But some files in the distribution
42
+ are not written by the author, so that they are not under these terms.
43
+
44
+ For the list of those files and their copying conditions, see the
45
+ file LEGAL.
46
+
47
+ 5. The scripts and library files supplied as input to or produced as
48
+ output from the software do not automatically fall under the
49
+ copyright of the software, but belong to whomever generated them,
50
+ and may be sold commercially, and may be aggregated with this
51
+ software.
52
+
53
+ 6. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR
54
+ IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
55
+ WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
56
+ PURPOSE.
@@ -0,0 +1,48 @@
1
+ # BufferedTokenizer
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/buftok.png)][gem]
4
+ [![Build Status](https://travis-ci.org/sferik/buftok.png?branch=master)][travis]
5
+ [![Dependency Status](https://gemnasium.com/sferik/buftok.png?travis)][gemnasium]
6
+ [![Code Climate](https://codeclimate.com/github/sferik/buftok.png)][codeclimate]
7
+
8
+ [gem]: https://rubygems.org/gems/buftok
9
+ [travis]: https://travis-ci.org/sferik/buftok
10
+ [gemnasium]: https://gemnasium.com/sferik/buftok
11
+ [codeclimate]: https://codeclimate.com/github/sferik/buftok
12
+
13
+ ###### Statefully split input data by a specifiable token
14
+
15
+ BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by
16
+ default. It allows input to be spoon-fed from some outside source which
17
+ receives arbitrary length datagrams which may-or-may-not contain the token by
18
+ which entities are delimited. In this respect it's ideally paired with
19
+ something like [EventMachine][].
20
+
21
+ [EventMachine]: http://rubyeventmachine.com/
22
+
23
+ ## Supported Ruby Versions
24
+ This library aims to support and is [tested against][travis] the following Ruby
25
+ implementations:
26
+
27
+ * Ruby 1.8.7
28
+ * Ruby 1.9.2
29
+ * Ruby 1.9.3
30
+ * Ruby 2.0.0
31
+
32
+ If something doesn't work on one of these interpreters, it's a bug.
33
+
34
+ This library may inadvertently work (or seem to work) on other Ruby
35
+ implementations, however support will only be provided for the versions listed
36
+ above.
37
+
38
+ If you would like this library to support another Ruby version, you may
39
+ volunteer to be a maintainer. Being a maintainer entails making sure all tests
40
+ run and pass on that implementation. When something breaks on your
41
+ implementation, you will be responsible for providing patches in a timely
42
+ fashion. If critical issues for a particular implementation exist at the time
43
+ of a major release, support for that Ruby version may be dropped.
44
+
45
+ ## Copyright
46
+ Copyright (c) 2006-2013 Tony Arcieri, Martin Emde, Erik Michaels-Ober.
47
+ Distributed under the [Ruby license][license].
48
+ [license]: http://www.ruby-lang.org/en/LICENSE.txt
data/Rakefile CHANGED
@@ -1,31 +1,66 @@
1
- require 'rake'
2
- require 'rake/rdoctask'
3
- require 'rake/gempackagetask'
4
- require 'spec/rake/spectask'
1
+ require 'bundler'
2
+ require 'rdoc/task'
3
+ require 'rake/testtask'
5
4
 
6
- Spec::Rake::SpecTask.new(:spec) do |task|
7
- task.spec_files = FileList['**/*_spec.rb']
8
- end
5
+ task :default => :test
6
+
7
+ Bundler::GemHelper.install_tasks
9
8
 
10
- Rake::RDocTask.new(:rdoc) do |task|
11
- task.rdoc_dir = 'doc'
12
- task.title = 'BufferedTokenizer'
13
- task.rdoc_files.include('lib/**/*.rb')
9
+ RDoc::Task.new do |task|
10
+ task.rdoc_dir = 'doc'
11
+ task.title = 'BufferedTokenizer'
12
+ task.rdoc_files.include('lib/**/*.rb')
14
13
  end
15
14
 
16
- spec = Gem::Specification.new do |s|
17
- s.name = %q{buftok}
18
- s.version = "0.1"
19
- s.date = %q{2006-12-18}
20
- s.summary = %q{BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs}
21
- s.email = %q{tony@clickcaster.com}
22
- s.homepage = %q{http://buftok.rubyforge.org}
23
- s.rubyforge_project = %q{buftok}
24
- s.has_rdoc = true
25
- s.authors = ["Tony Arcieri","Martin Emde"]
26
- s.files = ["Rakefile", "lib", "lib/buftok.rb"]
15
+ Rake::TestTask.new :test do |t|
16
+ t.libs << 'lib'
17
+ t.test_files = FileList['test/**/*.rb']
27
18
  end
28
19
 
29
- Rake::GemPackageTask.new(spec) do |pkg|
30
- pkg.need_tar = true
20
+ desc "Benchmark the current implementation"
21
+ task :bench do
22
+ require 'benchmark'
23
+ require File.expand_path('lib/buftok', File.dirname(__FILE__))
24
+
25
+ n = 50000
26
+ delimiter = "\n\n"
27
+
28
+ frequency1 = 1000
29
+ puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency1} strings..."
30
+ data1 = (0...n).map do |i|
31
+ (((i % frequency1 == 1) ? "\n" : "") +
32
+ ("s" * i) +
33
+ ((i % frequency1 == 0) ? "\n" : "")).freeze
34
+ end
35
+
36
+ frequency2 = 10
37
+ puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency2} strings..."
38
+ data2 = (0...n).map do |i|
39
+ (((i % frequency2 == 1) ? "\n" : "") +
40
+ ("s" * i) +
41
+ ((i % frequency2 == 0) ? "\n" : "")).freeze
42
+ end
43
+
44
+ Benchmark.bmbm do |x|
45
+ x.report("1 char, freq: #{frequency1}") do
46
+ bt1 = BufferedTokenizer.new
47
+ n.times { |i| bt1.extract(data1[i]) }
48
+ end
49
+
50
+ x.report("2 char, freq: #{frequency1}") do
51
+ bt2 = BufferedTokenizer.new(delimiter)
52
+ n.times { |i| bt2.extract(data1[i]) }
53
+ end
54
+
55
+ x.report("1 char, freq: #{frequency2}") do
56
+ bt3 = BufferedTokenizer.new
57
+ n.times { |i| bt3.extract(data2[i]) }
58
+ end
59
+
60
+ x.report("2 char, freq: #{frequency2}") do
61
+ bt4 = BufferedTokenizer.new(delimiter)
62
+ n.times { |i| bt4.extract(data2[i]) }
63
+ end
64
+
65
+ end
31
66
  end
@@ -0,0 +1,17 @@
1
+ Gem::Specification.new do |spec|
2
+ spec.add_development_dependency 'bundler', '~> 1.0'
3
+ spec.authors = ["Tony Arcieri", "Martin Emde", "Erik Michaels-Ober"]
4
+ spec.description = %q{BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs}
5
+ spec.email = "sferik@gmail.com"
6
+ spec.files = %w(CONTRIBUTING.md Gemfile LICENSE.md README.md Rakefile buftok.gemspec)
7
+ spec.files += Dir.glob("lib/**/*.rb")
8
+ spec.files += Dir.glob("test/**/*.rb")
9
+ spec.test_files = spec.files.grep(%r{^test/})
10
+ spec.homepage = "https://github.com/sferik/buftok"
11
+ spec.licenses = ['MIT']
12
+ spec.name = "buftok"
13
+ spec.require_paths = ["lib"]
14
+ spec.required_rubygems_version = '>= 1.3.5'
15
+ spec.summary = spec.description
16
+ spec.version = "0.2.0"
17
+ end
@@ -1,26 +1,22 @@
1
- # BufferedTokenizer - Statefully split input data by a specifiable token
2
- # (C)2006 Tony Arcieri, Martin Emde
3
- # Distributed under the Ruby license (http://www.ruby-lang.org/en/LICENSE.txt)
4
-
5
1
  # BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
6
2
  # by default. It allows input to be spoon-fed from some outside source which
7
3
  # receives arbitrary length datagrams which may-or-may-not contain the token
8
4
  # by which entities are delimited. In this respect it's ideally paired with
9
- # something like EventMachine (http://rubyforge.org/projects/eventmachine)
5
+ # something like EventMachine (http://rubyeventmachine.com/).
10
6
  class BufferedTokenizer
11
- # New BufferedTokenizers will operate on lines delimited by "\n" by default
12
- # or allow you to specify any delimiter token you so choose, which will then
13
- # be used by String#split to tokenize the input data
14
- def initialize(delimiter = "\n")
15
- # Store the specified delimiter
7
+ # New BufferedTokenizers will operate on lines delimited by a delimiter,
8
+ # which is by default the global input delimiter $/ ("\n").
9
+ #
10
+ # The input buffer is stored as an array. This is by far the most efficient
11
+ # approach given language constraints (in C a linked list would be a more
12
+ # appropriate data structure). Segments of input data are stored in a list
13
+ # which is only joined when a token is reached, substantially reducing the
14
+ # number of objects required for the operation.
15
+ def initialize(delimiter = $/)
16
16
  @delimiter = delimiter
17
-
18
- # The input buffer is stored as an array. This is by far the most efficient
19
- # approach given language constraints (in C a linked list would be a more
20
- # appropriate data structure). Segments of input data are stored in a list
21
- # which is only joined when a token is reached, substantially reducing the
22
- # number of objects required for the operation.
23
17
  @input = []
18
+ @tail = ''
19
+ @trim = @delimiter.length - 1
24
20
  end
25
21
 
26
22
  # Extract takes an arbitrary string of input data and returns an array of
@@ -28,49 +24,36 @@ class BufferedTokenizer
28
24
  # makes for easy processing of datagrams using a pattern like:
29
25
  #
30
26
  # tokenizer.extract(data).map { |entity| Decode(entity) }.each do ...
27
+ #
28
+ # Using -1 makes split to return "" if the token is at the end of
29
+ # the string, meaning the last element is the start of the next chunk.
31
30
  def extract(data)
32
- # Extract token-delimited entities from the input string with the split command.
33
- # There's a bit of craftiness here with the -1 parameter. Normally split would
34
- # behave no differently regardless of if the token lies at the very end of the
35
- # input buffer or not (i.e. a literal edge case) Specifying -1 forces split to
36
- # return "" in this case, meaning that the last entry in the list represents a
37
- # new segment of data where the token has not been encountered
38
- entities = data.split @delimiter, -1
31
+ if @trim > 0
32
+ tail_end = @tail.slice!(-@trim, @trim) # returns nil if string is too short
33
+ data = tail_end + data if tail_end
34
+ end
39
35
 
40
- # Move the first entry in the resulting array into the input buffer. It represents
41
- # the last segment of a token-delimited entity unless it's the only entry in the list.
42
- @input << entities.shift
36
+ @input << @tail
37
+ entities = data.split(@delimiter, -1)
38
+ @tail = entities.shift
43
39
 
44
- # If the resulting array from the split is empty, the token was not encountered
45
- # (not even at the end of the buffer). Since we've encountered no token-delimited
46
- # entities this go-around, return an empty array.
47
- return [] if entities.empty?
40
+ unless entities.empty?
41
+ @input << @tail
42
+ entities.unshift @input.join
43
+ @input.clear
44
+ @tail = entities.pop
45
+ end
48
46
 
49
- # At this point, we've hit a token, or potentially multiple tokens. Now we can bring
50
- # together all the data we've buffered from earlier calls without hitting a token,
51
- # and add it to our list of discovered entities.
52
- entities.unshift @input.join
53
-
54
- # Now that we've hit a token, joined the input buffer and added it to the entities
55
- # list, we can go ahead and clear the input buffer. All of the segments that were
56
- # stored before the join can now be garbage collected.
57
- @input.clear
58
-
59
- # The last entity in the list is not token delimited, however, thanks to the -1
60
- # passed to split. It represents the beginning of a new list of as-yet-untokenized
61
- # data, so we add it to the start of the list.
62
- @input << entities.pop
63
-
64
- # Now we're left with the list of extracted token-delimited entities we wanted
65
- # in the first place. Hooray!
66
47
  entities
67
48
  end
68
-
49
+
69
50
  # Flush the contents of the input buffer, i.e. return the input buffer even though
70
51
  # a token has not yet been encountered
71
52
  def flush
53
+ @input << @tail
72
54
  buffer = @input.join
73
55
  @input.clear
56
+ @tail = "" # @tail.clear is slightly faster, but not supported on 1.8.7
74
57
  buffer
75
58
  end
76
59
  end
@@ -0,0 +1,27 @@
1
+ require 'test/unit'
2
+ require 'buftok'
3
+
4
+ class TestBuftok < Test::Unit::TestCase
5
+ def test_buftok
6
+ tokenizer = BufferedTokenizer.new
7
+ assert_equal %w[foo], tokenizer.extract("foo\nbar".freeze)
8
+ assert_equal %w[barbaz qux], tokenizer.extract("baz\nqux\nquu".freeze)
9
+ assert_equal 'quu', tokenizer.flush
10
+ assert_equal '', tokenizer.flush
11
+ end
12
+
13
+ def test_delimiter
14
+ tokenizer = BufferedTokenizer.new('<>')
15
+ assert_equal ['', "foo\n"], tokenizer.extract("<>foo\n<>".freeze)
16
+ assert_equal %w[bar], tokenizer.extract('bar<>baz'.freeze)
17
+ assert_equal 'baz', tokenizer.flush
18
+ end
19
+
20
+ def test_split_delimiter
21
+ tokenizer = BufferedTokenizer.new('<>'.freeze)
22
+ assert_equal [], tokenizer.extract('foo<'.freeze)
23
+ assert_equal %w[foo], tokenizer.extract('>bar<'.freeze)
24
+ assert_equal %w[bar<baz qux], tokenizer.extract('baz<>qux<>'.freeze)
25
+ assert_equal '', tokenizer.flush
26
+ end
27
+ end
metadata CHANGED
@@ -1,49 +1,75 @@
1
- --- !ruby/object:Gem::Specification
2
- rubygems_version: 0.9.0
3
- specification_version: 1
1
+ --- !ruby/object:Gem::Specification
4
2
  name: buftok
5
- version: !ruby/object:Gem::Version
6
- version: "0.1"
7
- date: 2006-12-18 00:00:00 -07:00
8
- summary: BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs
9
- require_paths:
10
- - lib
11
- email: tony@clickcaster.com
12
- homepage: http://buftok.rubyforge.org
13
- rubyforge_project: buftok
14
- description:
15
- autorequire:
16
- default_executable:
17
- bindir: bin
18
- has_rdoc: true
19
- required_ruby_version: !ruby/object:Gem::Version::Requirement
20
- requirements:
21
- - - ">"
22
- - !ruby/object:Gem::Version
23
- version: 0.0.0
24
- version:
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.0
5
+ prerelease:
25
6
  platform: ruby
26
- signing_key:
27
- cert_chain:
28
- post_install_message:
29
- authors:
7
+ authors:
30
8
  - Tony Arcieri
31
9
  - Martin Emde
32
- files:
10
+ - Erik Michaels-Ober
11
+ autorequire:
12
+ bindir: bin
13
+ cert_chain: []
14
+ date: 2013-11-22 00:00:00.000000000 Z
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: bundler
18
+ requirement: !ruby/object:Gem::Requirement
19
+ none: false
20
+ requirements:
21
+ - - ~>
22
+ - !ruby/object:Gem::Version
23
+ version: '1.0'
24
+ type: :development
25
+ prerelease: false
26
+ version_requirements: !ruby/object:Gem::Requirement
27
+ none: false
28
+ requirements:
29
+ - - ~>
30
+ - !ruby/object:Gem::Version
31
+ version: '1.0'
32
+ description: BufferedTokenizer extracts token delimited entities from a sequence of
33
+ arbitrary inputs
34
+ email: sferik@gmail.com
35
+ executables: []
36
+ extensions: []
37
+ extra_rdoc_files: []
38
+ files:
39
+ - CONTRIBUTING.md
40
+ - Gemfile
41
+ - LICENSE.md
42
+ - README.md
33
43
  - Rakefile
34
- - lib
44
+ - buftok.gemspec
35
45
  - lib/buftok.rb
36
- test_files: []
37
-
46
+ - test/test_buftok.rb
47
+ homepage: https://github.com/sferik/buftok
48
+ licenses:
49
+ - MIT
50
+ post_install_message:
38
51
  rdoc_options: []
39
-
40
- extra_rdoc_files: []
41
-
42
- executables: []
43
-
44
- extensions: []
45
-
52
+ require_paths:
53
+ - lib
54
+ required_ruby_version: !ruby/object:Gem::Requirement
55
+ none: false
56
+ requirements:
57
+ - - ! '>='
58
+ - !ruby/object:Gem::Version
59
+ version: '0'
60
+ required_rubygems_version: !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: 1.3.5
46
66
  requirements: []
47
-
48
- dependencies: []
49
-
67
+ rubyforge_project:
68
+ rubygems_version: 1.8.23
69
+ signing_key:
70
+ specification_version: 3
71
+ summary: BufferedTokenizer extracts token delimited entities from a sequence of arbitrary
72
+ inputs
73
+ test_files:
74
+ - test/test_buftok.rb
75
+ has_rdoc: