buftok 0.2.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: eb2a12017fae47f920bd9a3694845079e8a4a76c5fcffad9abcc18716e2c50a7
4
+ data.tar.gz: af65ae0012ce581da376624888e208338face85e18bc27f13c3d246400a6d464
5
+ SHA512:
6
+ metadata.gz: 2a0f62b94c72c92b2fa31a22b2fd52952c25fea5e9b4ab2f4770bba3e87d1e9372539d95565c94e13cd4f9358a7e006787b0c73ce5318dd1d0a8da8076ed8b2b
7
+ data.tar.gz: '08266806dca8fb19015c54447c9156c008b5cb979fbb8eb097dbf161759d264f4ec07e659d7e123dcebf7243738f87dc6fac8b7278b025892eb77fd2a8839606'
data/CHANGELOG.md ADDED
@@ -0,0 +1,86 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [Unreleased]
9
+
10
+ ## [1.0.0] - 2026-03-20
11
+
12
+ ### Added
13
+
14
+ - RBS type signatures in `sig/buftok.rbs` with Steep for strict type checking
15
+ - RuboCop, Standard, rubocop-minitest, rubocop-performance, and rubocop-rake for linting
16
+ - Mutant for mutation testing with 100% coverage
17
+ - GitHub Actions workflows for linting, type checking, and mutation testing
18
+ - `.github/FUNDING.yml` for GitHub Sponsors
19
+ - Gemspec metadata (`allowed_push_host`, `changelog_uri`, `documentation_uri`,
20
+ `funding_uri`, `homepage_uri`, `rubygems_mfa_required`, `source_code_uri`,
21
+ `bug_tracker_uri`)
22
+ - `CHANGELOG.md`
23
+
24
+ ### Changed
25
+
26
+ - Require Ruby >= 3.2
27
+ - Require RubyGems >= 3.0
28
+ - Test against Ruby 3.2, 3.3, 3.4, and 4.0 (drop EOL 2.6, 2.7, 3.0)
29
+ - Update `actions/checkout` to v6 and `ruby/setup-ruby` to v1
30
+ - Replace test-unit with Minitest 6
31
+ - Replace `inject` with `sum` in `size` method
32
+ - Use `@tail.clear` instead of `String.new` in `flush` (drop Ruby 1.8.7 workaround)
33
+ - Move development dependencies from gemspec to Gemfile
34
+ - Bump rake from `~> 10.0` to `>= 13`
35
+ - Extract `rejoin_split_delimiter` and `consolidate_input` private methods
36
+ - Update copyright years to 2006-2026
37
+ - Rename Erik Michaels-Ober to Erik Berlin
38
+
39
+ ### Fixed
40
+
41
+ - Typo in test comment ("Desipte" -> "Despite")
42
+
43
+ ## [0.3.0] - 2021-03-25
44
+
45
+ ### Added
46
+
47
+ - `Buftok` constant as an alias for `BufferedTokenizer`
48
+ - `BufferedTokenizer#size` method to determine internal buffer size
49
+ - GitHub Actions CI workflow
50
+ - Support for `frozen_string_literal`
51
+
52
+ ### Changed
53
+
54
+ - Replace Ruby license with MIT license
55
+ - Modernize gemspec
56
+ - Remove Travis CI in favor of GitHub Actions
57
+ - Update supported Ruby versions to 2.6, 2.7, 3.0
58
+
59
+ ## [0.2.0] - 2013-11-22
60
+
61
+ ### Added
62
+
63
+ - Tests
64
+ - Benchmark rake task
65
+ - Support for multi-character delimiters split across chunks
66
+ - Section on supported Ruby versions in README
67
+
68
+ ### Changed
69
+
70
+ - Use global input delimiter `$/` as default instead of hard-coded `"\n"`
71
+ - Unified handling of single/multi-character delimiters
72
+
73
+ ## [0.1.0] - 2013-11-20
74
+
75
+ ### Added
76
+
77
+ - Initial release of BufferedTokenizer
78
+ - Line-based tokenization with configurable delimiter
79
+ - `extract` method for incremental tokenization
80
+ - `flush` method to retrieve remaining buffer contents
81
+
82
+ [Unreleased]: https://github.com/sferik/buftok/compare/v1.0.0...HEAD
83
+ [1.0.0]: https://github.com/sferik/buftok/compare/v0.3.0...v1.0.0
84
+ [0.3.0]: https://github.com/sferik/buftok/compare/v0.2.0...v0.3.0
85
+ [0.2.0]: https://github.com/sferik/buftok/compare/v0.1...v0.2.0
86
+ [0.1.0]: https://github.com/sferik/buftok/releases/tag/v0.1
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2006-2026 Tony Arcieri, Martin Emde, Erik Berlin
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md CHANGED
@@ -1,39 +1,116 @@
1
1
  # BufferedTokenizer
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/buftok.png)][gem]
4
- [![Build Status](https://travis-ci.org/sferik/buftok.png?branch=master)][travis]
5
- [![Dependency Status](https://gemnasium.com/sferik/buftok.png?travis)][gemnasium]
6
- [![Code Climate](https://codeclimate.com/github/sferik/buftok.png)][codeclimate]
3
+ [![Gem Version](http://img.shields.io/gem/v/buftok.svg)][gem]
4
+ [![Test](https://github.com/sferik/buftok/actions/workflows/test.yml/badge.svg)][test]
5
+ [![Lint](https://github.com/sferik/buftok/actions/workflows/lint.yml/badge.svg)][lint]
6
+ [![Type Check](https://github.com/sferik/buftok/actions/workflows/typecheck.yml/badge.svg)][typecheck]
7
+ [![Mutation Testing](https://github.com/sferik/buftok/actions/workflows/mutant.yml/badge.svg)][mutant]
8
+ [![Documentation Coverage](https://github.com/sferik/buftok/actions/workflows/yardstick.yml/badge.svg)][yardstick]
7
9
 
8
10
  [gem]: https://rubygems.org/gems/buftok
9
- [travis]: https://travis-ci.org/sferik/buftok
10
- [gemnasium]: https://gemnasium.com/sferik/buftok
11
- [codeclimate]: https://codeclimate.com/github/sferik/buftok
11
+ [test]: https://github.com/sferik/buftok/actions/workflows/test.yml
12
+ [lint]: https://github.com/sferik/buftok/actions/workflows/lint.yml
13
+ [typecheck]: https://github.com/sferik/buftok/actions/workflows/typecheck.yml
14
+ [mutant]: https://github.com/sferik/buftok/actions/workflows/mutant.yml
15
+ [yardstick]: https://github.com/sferik/buftok/actions/workflows/yardstick.yml
12
16
 
13
17
  ###### Statefully split input data by a specifiable token
14
18
 
15
19
  BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by
16
- default. It allows input to be spoon-fed from some outside source which
20
+ default. It allows input to be spoon-fed from some outside source which
17
21
  receives arbitrary length datagrams which may-or-may-not contain the token by
18
- which entities are delimited. In this respect it's ideally paired with
19
- something like [EventMachine][].
22
+ which entities are delimited. It's useful any time you need to extract
23
+ delimited messages from a stream of chunked data.
20
24
 
21
- [EventMachine]: http://rubyeventmachine.com/
25
+ ## Examples
26
+
27
+ ### TCP Server
28
+
29
+ Process newline-delimited commands from a TCP client:
30
+
31
+ ```ruby
32
+ require "socket"
33
+ require "buftok"
34
+
35
+ server = TCPServer.new(4000)
36
+
37
+ loop do
38
+ client = server.accept
39
+ tokenizer = BufferedTokenizer.new("\n")
40
+
41
+ while (data = client.readpartial(4096))
42
+ tokenizer.extract(data).each do |line|
43
+ puts "Received: #{line}"
44
+ end
45
+ end
46
+ rescue EOFError
47
+ client.close
48
+ end
49
+ ```
50
+
51
+ ### Streaming IO
52
+
53
+ Read a large file in chunks without loading it all into memory:
54
+
55
+ ```ruby
56
+ require "buftok"
57
+
58
+ tokenizer = BufferedTokenizer.new("\n")
59
+
60
+ File.open("large_log_file.txt") do |file|
61
+ while (chunk = file.read(8192))
62
+ tokenizer.extract(chunk).each do |line|
63
+ process_log_line(line)
64
+ end
65
+ end
66
+ end
67
+
68
+ # Don't forget to flush any remaining data
69
+ remaining = tokenizer.flush
70
+ process_log_line(remaining) unless remaining.empty?
71
+ ```
72
+
73
+ > [!IMPORTANT]
74
+ > Always call `flush` when you're done reading from the stream to process any
75
+ > remaining data that didn't end with a delimiter.
76
+
77
+ ### Custom Delimiters
78
+
79
+ Parse a stream using a multi-character delimiter:
80
+
81
+ ```ruby
82
+ require "buftok"
83
+
84
+ tokenizer = BufferedTokenizer.new("\r\n\r\n")
85
+
86
+ chunks = ["HTTP/1.1 200 OK\r\n", "Content-Type: text/plain\r\n\r\n", "Hello"]
87
+
88
+ chunks.each do |chunk|
89
+ tokenizer.extract(chunk).each do |headers|
90
+ puts "Headers: #{headers}"
91
+ end
92
+ end
93
+
94
+ puts "Body so far: #{tokenizer.flush}"
95
+ ```
96
+
97
+ > [!TIP]
98
+ > Multi-character delimiters that get split across chunks are handled
99
+ > automatically — no special handling is needed on your end.
22
100
 
23
101
  ## Supported Ruby Versions
24
- This library aims to support and is [tested against][travis] the following Ruby
102
+ This library aims to support and is [tested against][test] the following Ruby
25
103
  implementations:
26
104
 
27
- * Ruby 1.8.7
28
- * Ruby 1.9.2
29
- * Ruby 1.9.3
30
- * Ruby 2.0.0
105
+ * Ruby 3.2
106
+ * Ruby 3.3
107
+ * Ruby 3.4
108
+ * Ruby 4.0
31
109
 
32
110
  If something doesn't work on one of these interpreters, it's a bug.
33
111
 
34
- This library may inadvertently work (or seem to work) on other Ruby
35
- implementations, however support will only be provided for the versions listed
36
- above.
112
+ This code will likely still work on older Ruby versions but support will not be
113
+ provided for end-of-life versions.
37
114
 
38
115
  If you would like this library to support another Ruby version, you may
39
116
  volunteer to be a maintainer. Being a maintainer entails making sure all tests
@@ -43,6 +120,7 @@ fashion. If critical issues for a particular implementation exist at the time
43
120
  of a major release, support for that Ruby version may be dropped.
44
121
 
45
122
  ## Copyright
46
- Copyright (c) 2006-2013 Tony Arcieri, Martin Emde, Erik Michaels-Ober.
47
- Distributed under the [Ruby license][license].
48
- [license]: http://www.ruby-lang.org/en/LICENSE.txt
123
+ Copyright (c) 2006-2026 Tony Arcieri, Martin Emde, Erik Berlin.
124
+ Distributed under the [MIT license][license].
125
+
126
+ [license]: https://opensource.org/licenses/MIT
data/buftok.gemspec CHANGED
@@ -1,17 +1,28 @@
1
+ # frozen_string_literal: true
2
+
1
3
  Gem::Specification.new do |spec|
2
- spec.add_development_dependency 'bundler', '~> 1.0'
3
- spec.authors = ["Tony Arcieri", "Martin Emde", "Erik Michaels-Ober"]
4
- spec.description = %q{BufferedTokenizer extracts token delimited entities from a sequence of arbitrary inputs}
5
- spec.email = "sferik@gmail.com"
6
- spec.files = %w(CONTRIBUTING.md Gemfile LICENSE.md README.md Rakefile buftok.gemspec)
7
- spec.files += Dir.glob("lib/**/*.rb")
8
- spec.files += Dir.glob("test/**/*.rb")
9
- spec.test_files = spec.files.grep(%r{^test/})
10
- spec.homepage = "https://github.com/sferik/buftok"
11
- spec.licenses = ['MIT']
12
- spec.name = "buftok"
4
+ spec.version = "1.0.0"
5
+
6
+ spec.authors = ["Tony Arcieri", "Martin Emde", "Erik Berlin"]
7
+ spec.summary = "BufferedTokenizer extracts token delimited entities from a sequence of string inputs"
8
+ spec.description = spec.summary
9
+ spec.email = ["sferik@gmail.com", "martin.emde@gmail.com"]
10
+ spec.files = %w[CHANGELOG.md CONTRIBUTING.md LICENSE.txt README.md buftok.gemspec] + Dir["lib/**/*.rb"]
11
+ spec.homepage = "https://github.com/sferik/buftok"
12
+ spec.licenses = ["MIT"]
13
+ spec.name = "buftok"
13
14
  spec.require_paths = ["lib"]
14
- spec.required_rubygems_version = '>= 1.3.5'
15
- spec.summary = spec.description
16
- spec.version = "0.2.0"
15
+ spec.required_ruby_version = ">= 3.2"
16
+ spec.required_rubygems_version = ">= 3.0"
17
+
18
+ spec.metadata = {
19
+ "allowed_push_host" => "https://rubygems.org",
20
+ "bug_tracker_uri" => "#{spec.homepage}/issues",
21
+ "changelog_uri" => "#{spec.homepage}/blob/master/CHANGELOG.md",
22
+ "documentation_uri" => "https://rubydoc.info/gems/buftok/",
23
+ "funding_uri" => "https://github.com/sponsors/sferik/",
24
+ "homepage_uri" => spec.homepage,
25
+ "rubygems_mfa_required" => "true",
26
+ "source_code_uri" => spec.homepage
27
+ }
17
28
  end
data/lib/buftok.rb CHANGED
@@ -1,59 +1,169 @@
1
+ # frozen_string_literal: true
2
+
3
+ # Statefully split input data by a specifiable token
4
+ #
1
5
  # BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
2
- # by default. It allows input to be spoon-fed from some outside source which
6
+ # by default. It allows input to be spoon-fed from some outside source which
3
7
  # receives arbitrary length datagrams which may-or-may-not contain the token
4
- # by which entities are delimited. In this respect it's ideally paired with
5
- # something like EventMachine (http://rubyeventmachine.com/).
8
+ # by which entities are delimited.
9
+ #
10
+ # @example
11
+ # tokenizer = BufferedTokenizer.new("\n")
12
+ # tokenizer.extract("foo\nbar") #=> ["foo"]
13
+ # tokenizer.extract("baz\n") #=> ["barbaz"]
14
+ # tokenizer.flush #=> ""
6
15
  class BufferedTokenizer
7
- # New BufferedTokenizers will operate on lines delimited by a delimiter,
8
- # which is by default the global input delimiter $/ ("\n").
16
+ # Limit passed to String#split to preserve trailing empty fields
17
+ SPLIT_LIMIT = -1
18
+
19
+ # Return the delimiter overlap length
20
+ #
21
+ # The number of characters at the end of a chunk that may contain a
22
+ # partial delimiter, equal to delimiter.length - 1.
23
+ #
24
+ # @example
25
+ # BufferedTokenizer.new("<>").overlap #=> 1
26
+ #
27
+ # @return [Integer] delimiter.length - 1
28
+ #
29
+ # @api public
30
+ attr_reader :overlap
31
+
32
+ # Create a new BufferedTokenizer
33
+ #
34
+ # Operates on lines delimited by a delimiter, which is by default "\n".
9
35
  #
10
- # The input buffer is stored as an array. This is by far the most efficient
36
+ # The input buffer is stored as an array. This is by far the most efficient
11
37
  # approach given language constraints (in C a linked list would be a more
12
- # appropriate data structure). Segments of input data are stored in a list
38
+ # appropriate data structure). Segments of input data are stored in a list
13
39
  # which is only joined when a token is reached, substantially reducing the
14
40
  # number of objects required for the operation.
15
- def initialize(delimiter = $/)
41
+ #
42
+ # @example
43
+ # tokenizer = BufferedTokenizer.new("<>")
44
+ #
45
+ # @param delimiter [String] the token delimiter (default: "\n")
46
+ #
47
+ # @return [BufferedTokenizer]
48
+ #
49
+ # @api public
50
+ def initialize(delimiter = "\n")
16
51
  @delimiter = delimiter
17
52
  @input = []
18
- @tail = ''
19
- @trim = @delimiter.length - 1
53
+ @tail = +""
54
+ @overlap = @delimiter.length - 1
20
55
  end
21
56
 
57
+ # Return the byte size of the internal buffer
58
+ #
59
+ # Size is not cached and is determined every time this method is called
60
+ # in order to optimize throughput for extract.
61
+ #
62
+ # @example
63
+ # tokenizer = BufferedTokenizer.new
64
+ # tokenizer.extract("foo")
65
+ # tokenizer.size #=> 3
66
+ #
67
+ # @return [Integer]
68
+ #
69
+ # @api public
70
+ def size
71
+ @tail.length + @input.sum(&:length)
72
+ end
73
+
74
+ # Extract tokenized entities from the input data
75
+ #
22
76
  # Extract takes an arbitrary string of input data and returns an array of
23
- # tokenized entities, provided there were any available to extract. This
77
+ # tokenized entities, provided there were any available to extract. This
24
78
  # makes for easy processing of datagrams using a pattern like:
25
79
  #
26
- # tokenizer.extract(data).map { |entity| Decode(entity) }.each do ...
80
+ # tokenizer.extract(data).map { |entity| Decode(entity) }.each { ... }
27
81
  #
28
- # Using -1 makes split to return "" if the token is at the end of
82
+ # Using -1 makes split return "" if the token is at the end of
29
83
  # the string, meaning the last element is the start of the next chunk.
84
+ #
85
+ # @example
86
+ # tokenizer = BufferedTokenizer.new
87
+ # tokenizer.extract("foo\nbar") #=> ["foo"]
88
+ #
89
+ # @param data [String] a chunk of input data
90
+ #
91
+ # @return [Array<String>] complete tokens extracted from the input
92
+ #
93
+ # @api public
30
94
  def extract(data)
31
- if @trim > 0
32
- tail_end = @tail.slice!(-@trim, @trim) # returns nil if string is too short
33
- data = tail_end + data if tail_end
34
- end
95
+ data = rejoin_split_delimiter(data)
35
96
 
36
97
  @input << @tail
37
- entities = data.split(@delimiter, -1)
38
- @tail = entities.shift
39
-
40
- unless entities.empty?
41
- @input << @tail
42
- entities.unshift @input.join
43
- @input.clear
44
- @tail = entities.pop
45
- end
98
+ entities = data.split(@delimiter, SPLIT_LIMIT)
99
+ @tail = entities.shift # : String
100
+
101
+ consolidate_input(entities) if entities.length.positive?
46
102
 
47
103
  entities
48
104
  end
49
105
 
50
- # Flush the contents of the input buffer, i.e. return the input buffer even though
51
- # a token has not yet been encountered
106
+ # Flush the contents of the input buffer
107
+ #
108
+ # Return the contents of the input buffer even though a token has not
109
+ # yet been encountered, then reset the buffer.
110
+ #
111
+ # @example
112
+ # tokenizer = BufferedTokenizer.new
113
+ # tokenizer.extract("foo\nbar")
114
+ # tokenizer.flush #=> "bar"
115
+ #
116
+ # @return [String] the buffered input
117
+ #
118
+ # @api public
52
119
  def flush
53
120
  @input << @tail
54
121
  buffer = @input.join
55
122
  @input.clear
56
- @tail = "" # @tail.clear is slightly faster, but not supported on 1.8.7
123
+ @tail = +""
57
124
  buffer
58
125
  end
126
+
127
+ private
128
+
129
+ # Rejoin a delimiter that was split across two chunks
130
+ #
131
+ # When the delimiter is longer than one character, it may be split across
132
+ # two successive chunks. Transfer the trailing overlap from @tail back onto
133
+ # the front of the incoming data so that split can find the full delimiter.
134
+ #
135
+ # @param data [String] incoming data
136
+ #
137
+ # @return [String] data with any split delimiter prefix restored
138
+ #
139
+ # @api private
140
+ def rejoin_split_delimiter(data)
141
+ if @overlap.positive?
142
+ tail_end = @tail[-@overlap..]
143
+ @tail.slice!(-@overlap, @overlap)
144
+ tail_end ? tail_end + data : data
145
+ else
146
+ data
147
+ end
148
+ end
149
+
150
+ # Consolidate the input buffer into the first entity
151
+ #
152
+ # Once at least one delimiter has been found, join the accumulated input
153
+ # buffer with the first entity and move the trailing partial into @tail.
154
+ #
155
+ # @param entities [Array<String>] split entities
156
+ #
157
+ # @return [void]
158
+ #
159
+ # @api private
160
+ def consolidate_input(entities)
161
+ @input << @tail
162
+ entities.unshift @input.join
163
+ @input.clear
164
+ @tail = entities.pop # : String
165
+ end
59
166
  end
167
+
168
+ # Alias for {BufferedTokenizer}, matching the gem name
169
+ Buftok = BufferedTokenizer
metadata CHANGED
@@ -1,75 +1,59 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: buftok
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
5
- prerelease:
4
+ version: 1.0.0
6
5
  platform: ruby
7
6
  authors:
8
7
  - Tony Arcieri
9
8
  - Martin Emde
10
- - Erik Michaels-Ober
11
- autorequire:
9
+ - Erik Berlin
12
10
  bindir: bin
13
11
  cert_chain: []
14
- date: 2013-11-22 00:00:00.000000000 Z
15
- dependencies:
16
- - !ruby/object:Gem::Dependency
17
- name: bundler
18
- requirement: !ruby/object:Gem::Requirement
19
- none: false
20
- requirements:
21
- - - ~>
22
- - !ruby/object:Gem::Version
23
- version: '1.0'
24
- type: :development
25
- prerelease: false
26
- version_requirements: !ruby/object:Gem::Requirement
27
- none: false
28
- requirements:
29
- - - ~>
30
- - !ruby/object:Gem::Version
31
- version: '1.0'
12
+ date: 1980-01-02 00:00:00.000000000 Z
13
+ dependencies: []
32
14
  description: BufferedTokenizer extracts token delimited entities from a sequence of
33
- arbitrary inputs
34
- email: sferik@gmail.com
15
+ string inputs
16
+ email:
17
+ - sferik@gmail.com
18
+ - martin.emde@gmail.com
35
19
  executables: []
36
20
  extensions: []
37
21
  extra_rdoc_files: []
38
22
  files:
23
+ - CHANGELOG.md
39
24
  - CONTRIBUTING.md
40
- - Gemfile
41
- - LICENSE.md
25
+ - LICENSE.txt
42
26
  - README.md
43
- - Rakefile
44
27
  - buftok.gemspec
45
28
  - lib/buftok.rb
46
- - test/test_buftok.rb
47
29
  homepage: https://github.com/sferik/buftok
48
30
  licenses:
49
31
  - MIT
50
- post_install_message:
32
+ metadata:
33
+ allowed_push_host: https://rubygems.org
34
+ bug_tracker_uri: https://github.com/sferik/buftok/issues
35
+ changelog_uri: https://github.com/sferik/buftok/blob/master/CHANGELOG.md
36
+ documentation_uri: https://rubydoc.info/gems/buftok/
37
+ funding_uri: https://github.com/sponsors/sferik/
38
+ homepage_uri: https://github.com/sferik/buftok
39
+ rubygems_mfa_required: 'true'
40
+ source_code_uri: https://github.com/sferik/buftok
51
41
  rdoc_options: []
52
42
  require_paths:
53
43
  - lib
54
44
  required_ruby_version: !ruby/object:Gem::Requirement
55
- none: false
56
45
  requirements:
57
- - - ! '>='
46
+ - - ">="
58
47
  - !ruby/object:Gem::Version
59
- version: '0'
48
+ version: '3.2'
60
49
  required_rubygems_version: !ruby/object:Gem::Requirement
61
- none: false
62
50
  requirements:
63
- - - ! '>='
51
+ - - ">="
64
52
  - !ruby/object:Gem::Version
65
- version: 1.3.5
53
+ version: '3.0'
66
54
  requirements: []
67
- rubyforge_project:
68
- rubygems_version: 1.8.23
69
- signing_key:
70
- specification_version: 3
71
- summary: BufferedTokenizer extracts token delimited entities from a sequence of arbitrary
55
+ rubygems_version: 4.0.6
56
+ specification_version: 4
57
+ summary: BufferedTokenizer extracts token delimited entities from a sequence of string
72
58
  inputs
73
- test_files:
74
- - test/test_buftok.rb
75
- has_rdoc:
59
+ test_files: []
data/Gemfile DELETED
@@ -1,6 +0,0 @@
1
- source 'https://rubygems.org'
2
-
3
- gem 'rake'
4
- gem 'rdoc'
5
-
6
- gemspec
data/LICENSE.md DELETED
@@ -1,56 +0,0 @@
1
- Ruby is copyrighted free software by Yukihiro Matsumoto <matz@netlab.jp>.
2
- You can redistribute it and/or modify it under either the terms of the
3
- 2-clause BSDL (see the file BSDL), or the conditions below:
4
-
5
- 1. You may make and give away verbatim copies of the source form of the
6
- software without restriction, provided that you duplicate all of the
7
- original copyright notices and associated disclaimers.
8
-
9
- 2. You may modify your copy of the software in any way, provided that
10
- you do at least ONE of the following:
11
-
12
- a) place your modifications in the Public Domain or otherwise
13
- make them Freely Available, such as by posting said
14
- modifications to Usenet or an equivalent medium, or by allowing
15
- the author to include your modifications in the software.
16
-
17
- b) use the modified software only within your corporation or
18
- organization.
19
-
20
- c) give non-standard binaries non-standard names, with
21
- instructions on where to get the original software distribution.
22
-
23
- d) make other distribution arrangements with the author.
24
-
25
- 3. You may distribute the software in object code or binary form,
26
- provided that you do at least ONE of the following:
27
-
28
- a) distribute the binaries and library files of the software,
29
- together with instructions (in the manual page or equivalent)
30
- on where to get the original distribution.
31
-
32
- b) accompany the distribution with the machine-readable source of
33
- the software.
34
-
35
- c) give non-standard binaries non-standard names, with
36
- instructions on where to get the original software distribution.
37
-
38
- d) make other distribution arrangements with the author.
39
-
40
- 4. You may modify and include the part of the software into any other
41
- software (possibly commercial). But some files in the distribution
42
- are not written by the author, so that they are not under these terms.
43
-
44
- For the list of those files and their copying conditions, see the
45
- file LEGAL.
46
-
47
- 5. The scripts and library files supplied as input to or produced as
48
- output from the software do not automatically fall under the
49
- copyright of the software, but belong to whomever generated them,
50
- and may be sold commercially, and may be aggregated with this
51
- software.
52
-
53
- 6. THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR
54
- IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
55
- WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
56
- PURPOSE.
data/Rakefile DELETED
@@ -1,66 +0,0 @@
1
- require 'bundler'
2
- require 'rdoc/task'
3
- require 'rake/testtask'
4
-
5
- task :default => :test
6
-
7
- Bundler::GemHelper.install_tasks
8
-
9
- RDoc::Task.new do |task|
10
- task.rdoc_dir = 'doc'
11
- task.title = 'BufferedTokenizer'
12
- task.rdoc_files.include('lib/**/*.rb')
13
- end
14
-
15
- Rake::TestTask.new :test do |t|
16
- t.libs << 'lib'
17
- t.test_files = FileList['test/**/*.rb']
18
- end
19
-
20
- desc "Benchmark the current implementation"
21
- task :bench do
22
- require 'benchmark'
23
- require File.expand_path('lib/buftok', File.dirname(__FILE__))
24
-
25
- n = 50000
26
- delimiter = "\n\n"
27
-
28
- frequency1 = 1000
29
- puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency1} strings..."
30
- data1 = (0...n).map do |i|
31
- (((i % frequency1 == 1) ? "\n" : "") +
32
- ("s" * i) +
33
- ((i % frequency1 == 0) ? "\n" : "")).freeze
34
- end
35
-
36
- frequency2 = 10
37
- puts "generating #{n} strings, with #{delimiter.inspect} every #{frequency2} strings..."
38
- data2 = (0...n).map do |i|
39
- (((i % frequency2 == 1) ? "\n" : "") +
40
- ("s" * i) +
41
- ((i % frequency2 == 0) ? "\n" : "")).freeze
42
- end
43
-
44
- Benchmark.bmbm do |x|
45
- x.report("1 char, freq: #{frequency1}") do
46
- bt1 = BufferedTokenizer.new
47
- n.times { |i| bt1.extract(data1[i]) }
48
- end
49
-
50
- x.report("2 char, freq: #{frequency1}") do
51
- bt2 = BufferedTokenizer.new(delimiter)
52
- n.times { |i| bt2.extract(data1[i]) }
53
- end
54
-
55
- x.report("1 char, freq: #{frequency2}") do
56
- bt3 = BufferedTokenizer.new
57
- n.times { |i| bt3.extract(data2[i]) }
58
- end
59
-
60
- x.report("2 char, freq: #{frequency2}") do
61
- bt4 = BufferedTokenizer.new(delimiter)
62
- n.times { |i| bt4.extract(data2[i]) }
63
- end
64
-
65
- end
66
- end
data/test/test_buftok.rb DELETED
@@ -1,27 +0,0 @@
1
- require 'test/unit'
2
- require 'buftok'
3
-
4
- class TestBuftok < Test::Unit::TestCase
5
- def test_buftok
6
- tokenizer = BufferedTokenizer.new
7
- assert_equal %w[foo], tokenizer.extract("foo\nbar".freeze)
8
- assert_equal %w[barbaz qux], tokenizer.extract("baz\nqux\nquu".freeze)
9
- assert_equal 'quu', tokenizer.flush
10
- assert_equal '', tokenizer.flush
11
- end
12
-
13
- def test_delimiter
14
- tokenizer = BufferedTokenizer.new('<>')
15
- assert_equal ['', "foo\n"], tokenizer.extract("<>foo\n<>".freeze)
16
- assert_equal %w[bar], tokenizer.extract('bar<>baz'.freeze)
17
- assert_equal 'baz', tokenizer.flush
18
- end
19
-
20
- def test_split_delimiter
21
- tokenizer = BufferedTokenizer.new('<>'.freeze)
22
- assert_equal [], tokenizer.extract('foo<'.freeze)
23
- assert_equal %w[foo], tokenizer.extract('>bar<'.freeze)
24
- assert_equal %w[bar<baz qux], tokenizer.extract('baz<>qux<>'.freeze)
25
- assert_equal '', tokenizer.flush
26
- end
27
- end