buftok 0.3.0 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +100 -0
- data/LICENSE.txt +1 -1
- data/README.md +96 -14
- data/buftok.gemspec +23 -14
- data/lib/buftok.rb +128 -31
- metadata +17 -67
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ccb8527aed71cad86076d553ad95bb5034ff41e0cce2138e58c70c92525a0f94
|
|
4
|
+
data.tar.gz: 8cb51801dc3975bf6719278edb7237482483ff75d8d6a51a2726e5306453595f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9813d806b60d57ad1ba3729acaa48e8f25a94a84d96b08217911638d3866b4deb14f0fdca0fc0b5ea5563557367bc3165d77f3e707a26a858549a360f0fe17a2
|
|
7
|
+
data.tar.gz: 74df77109d7b1733a643d70ab310f086983e8913cd926c15ec79d0221263bace433ec3863f826bef41022b729a2b4b52bbcef694aacaf86ab6333a5a163b46df
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
## [1.0.1] - 2026-03-20
|
|
11
|
+
|
|
12
|
+
### Changed
|
|
13
|
+
|
|
14
|
+
- Improve gem push workflow security and reliability
|
|
15
|
+
- Add top-level `permissions: contents: read` and scope `contents: write` to the job
|
|
16
|
+
- Restrict workflow to `sferik/buftok` repository
|
|
17
|
+
- Add `rubygems.org` deployment environment
|
|
18
|
+
- Pin `rubygems/configure-rubygems-credentials` to v1.0.0
|
|
19
|
+
- Sign gem with Sigstore before pushing
|
|
20
|
+
- Push gem with `--attestation` flag
|
|
21
|
+
- Simplify release steps and remove manual git config
|
|
22
|
+
|
|
23
|
+
## [1.0.0] - 2026-03-20
|
|
24
|
+
|
|
25
|
+
### Added
|
|
26
|
+
|
|
27
|
+
- RBS type signatures in `sig/buftok.rbs` with Steep for strict type checking
|
|
28
|
+
- RuboCop, Standard, rubocop-minitest, rubocop-performance, and rubocop-rake for linting
|
|
29
|
+
- Mutant for mutation testing with 100% coverage
|
|
30
|
+
- GitHub Actions workflows for linting, type checking, and mutation testing
|
|
31
|
+
- `.github/FUNDING.yml` for GitHub Sponsors
|
|
32
|
+
- Gemspec metadata (`allowed_push_host`, `changelog_uri`, `documentation_uri`,
|
|
33
|
+
`funding_uri`, `homepage_uri`, `rubygems_mfa_required`, `source_code_uri`,
|
|
34
|
+
`bug_tracker_uri`)
|
|
35
|
+
- `CHANGELOG.md`
|
|
36
|
+
|
|
37
|
+
### Changed
|
|
38
|
+
|
|
39
|
+
- Require Ruby >= 3.2
|
|
40
|
+
- Require RubyGems >= 3.0
|
|
41
|
+
- Test against Ruby 3.2, 3.3, 3.4, and 4.0 (drop EOL 2.6, 2.7, 3.0)
|
|
42
|
+
- Update `actions/checkout` to v6 and `ruby/setup-ruby` to v1
|
|
43
|
+
- Replace test-unit with Minitest 6
|
|
44
|
+
- Replace `inject` with `sum` in `size` method
|
|
45
|
+
- Use `@tail.clear` instead of `String.new` in `flush` (drop Ruby 1.8.7 workaround)
|
|
46
|
+
- Move development dependencies from gemspec to Gemfile
|
|
47
|
+
- Bump rake from `~> 10.0` to `>= 13`
|
|
48
|
+
- Extract `rejoin_split_delimiter` and `consolidate_input` private methods
|
|
49
|
+
- Update copyright years to 2006-2026
|
|
50
|
+
- Rename Erik Michaels-Ober to Erik Berlin
|
|
51
|
+
|
|
52
|
+
### Fixed
|
|
53
|
+
|
|
54
|
+
- Typo in test comment ("Desipte" -> "Despite")
|
|
55
|
+
|
|
56
|
+
## [0.3.0] - 2021-03-25
|
|
57
|
+
|
|
58
|
+
### Added
|
|
59
|
+
|
|
60
|
+
- `Buftok` constant as an alias for `BufferedTokenizer`
|
|
61
|
+
- `BufferedTokenizer#size` method to determine internal buffer size
|
|
62
|
+
- GitHub Actions CI workflow
|
|
63
|
+
- Support for `frozen_string_literal`
|
|
64
|
+
|
|
65
|
+
### Changed
|
|
66
|
+
|
|
67
|
+
- Replace Ruby license with MIT license
|
|
68
|
+
- Modernize gemspec
|
|
69
|
+
- Remove Travis CI in favor of GitHub Actions
|
|
70
|
+
- Update supported Ruby versions to 2.6, 2.7, 3.0
|
|
71
|
+
|
|
72
|
+
## [0.2.0] - 2013-11-22
|
|
73
|
+
|
|
74
|
+
### Added
|
|
75
|
+
|
|
76
|
+
- Tests
|
|
77
|
+
- Benchmark rake task
|
|
78
|
+
- Support for multi-character delimiters split across chunks
|
|
79
|
+
- Section on supported Ruby versions in README
|
|
80
|
+
|
|
81
|
+
### Changed
|
|
82
|
+
|
|
83
|
+
- Use global input delimiter `$/` as default instead of hard-coded `"\n"`
|
|
84
|
+
- Unified handling of single/multi-character delimiters
|
|
85
|
+
|
|
86
|
+
## [0.1.0] - 2013-11-20
|
|
87
|
+
|
|
88
|
+
### Added
|
|
89
|
+
|
|
90
|
+
- Initial release of BufferedTokenizer
|
|
91
|
+
- Line-based tokenization with configurable delimiter
|
|
92
|
+
- `extract` method for incremental tokenization
|
|
93
|
+
- `flush` method to retrieve remaining buffer contents
|
|
94
|
+
|
|
95
|
+
[Unreleased]: https://github.com/sferik/buftok/compare/v1.0.1...HEAD
|
|
96
|
+
[1.0.1]: https://github.com/sferik/buftok/compare/v1.0.0...v1.0.1
|
|
97
|
+
[1.0.0]: https://github.com/sferik/buftok/compare/v0.3.0...v1.0.0
|
|
98
|
+
[0.3.0]: https://github.com/sferik/buftok/compare/v0.2.0...v0.3.0
|
|
99
|
+
[0.2.0]: https://github.com/sferik/buftok/compare/v0.1...v0.2.0
|
|
100
|
+
[0.1.0]: https://github.com/sferik/buftok/releases/tag/v0.1
|
data/LICENSE.txt
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
The MIT License (MIT)
|
|
2
2
|
|
|
3
|
-
Copyright (c)
|
|
3
|
+
Copyright (c) 2006-2026 Tony Arcieri, Martin Emde, Erik Berlin
|
|
4
4
|
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|
data/README.md
CHANGED
|
@@ -1,34 +1,116 @@
|
|
|
1
1
|
# BufferedTokenizer
|
|
2
2
|
|
|
3
3
|
[][gem]
|
|
4
|
-
[][test]
|
|
5
|
+
[][lint]
|
|
6
|
+
[][typecheck]
|
|
7
|
+
[][mutant]
|
|
8
|
+
[][yardstick]
|
|
5
9
|
|
|
6
10
|
[gem]: https://rubygems.org/gems/buftok
|
|
7
|
-
[
|
|
11
|
+
[test]: https://github.com/sferik/buftok/actions/workflows/test.yml
|
|
12
|
+
[lint]: https://github.com/sferik/buftok/actions/workflows/lint.yml
|
|
13
|
+
[typecheck]: https://github.com/sferik/buftok/actions/workflows/typecheck.yml
|
|
14
|
+
[mutant]: https://github.com/sferik/buftok/actions/workflows/mutant.yml
|
|
15
|
+
[yardstick]: https://github.com/sferik/buftok/actions/workflows/yardstick.yml
|
|
8
16
|
|
|
9
17
|
###### Statefully split input data by a specifiable token
|
|
10
18
|
|
|
11
19
|
BufferedTokenizer takes a delimiter upon instantiation, or acts line-based by
|
|
12
|
-
default.
|
|
20
|
+
default. It allows input to be spoon-fed from some outside source which
|
|
13
21
|
receives arbitrary length datagrams which may-or-may-not contain the token by
|
|
14
|
-
which entities are delimited.
|
|
15
|
-
|
|
22
|
+
which entities are delimited. It's useful any time you need to extract
|
|
23
|
+
delimited messages from a stream of chunked data.
|
|
16
24
|
|
|
17
|
-
|
|
25
|
+
## Examples
|
|
26
|
+
|
|
27
|
+
### TCP Server
|
|
28
|
+
|
|
29
|
+
Process newline-delimited commands from a TCP client:
|
|
30
|
+
|
|
31
|
+
```ruby
|
|
32
|
+
require "socket"
|
|
33
|
+
require "buftok"
|
|
34
|
+
|
|
35
|
+
server = TCPServer.new(4000)
|
|
36
|
+
|
|
37
|
+
loop do
|
|
38
|
+
client = server.accept
|
|
39
|
+
tokenizer = BufferedTokenizer.new("\n")
|
|
40
|
+
|
|
41
|
+
while (data = client.readpartial(4096))
|
|
42
|
+
tokenizer.extract(data).each do |line|
|
|
43
|
+
puts "Received: #{line}"
|
|
44
|
+
end
|
|
45
|
+
end
|
|
46
|
+
rescue EOFError
|
|
47
|
+
client.close
|
|
48
|
+
end
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
### Streaming IO
|
|
52
|
+
|
|
53
|
+
Read a large file in chunks without loading it all into memory:
|
|
54
|
+
|
|
55
|
+
```ruby
|
|
56
|
+
require "buftok"
|
|
57
|
+
|
|
58
|
+
tokenizer = BufferedTokenizer.new("\n")
|
|
59
|
+
|
|
60
|
+
File.open("large_log_file.txt") do |file|
|
|
61
|
+
while (chunk = file.read(8192))
|
|
62
|
+
tokenizer.extract(chunk).each do |line|
|
|
63
|
+
process_log_line(line)
|
|
64
|
+
end
|
|
65
|
+
end
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
# Don't forget to flush any remaining data
|
|
69
|
+
remaining = tokenizer.flush
|
|
70
|
+
process_log_line(remaining) unless remaining.empty?
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
> [!IMPORTANT]
|
|
74
|
+
> Always call `flush` when you're done reading from the stream to process any
|
|
75
|
+
> remaining data that didn't end with a delimiter.
|
|
76
|
+
|
|
77
|
+
### Custom Delimiters
|
|
78
|
+
|
|
79
|
+
Parse a stream using a multi-character delimiter:
|
|
80
|
+
|
|
81
|
+
```ruby
|
|
82
|
+
require "buftok"
|
|
83
|
+
|
|
84
|
+
tokenizer = BufferedTokenizer.new("\r\n\r\n")
|
|
85
|
+
|
|
86
|
+
chunks = ["HTTP/1.1 200 OK\r\n", "Content-Type: text/plain\r\n\r\n", "Hello"]
|
|
87
|
+
|
|
88
|
+
chunks.each do |chunk|
|
|
89
|
+
tokenizer.extract(chunk).each do |headers|
|
|
90
|
+
puts "Headers: #{headers}"
|
|
91
|
+
end
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
puts "Body so far: #{tokenizer.flush}"
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
> [!TIP]
|
|
98
|
+
> Multi-character delimiters that get split across chunks are handled
|
|
99
|
+
> automatically — no special handling is needed on your end.
|
|
18
100
|
|
|
19
101
|
## Supported Ruby Versions
|
|
20
|
-
This library aims to support and is [tested against][
|
|
102
|
+
This library aims to support and is [tested against][test] the following Ruby
|
|
21
103
|
implementations:
|
|
22
104
|
|
|
23
|
-
* Ruby 2
|
|
24
|
-
* Ruby
|
|
25
|
-
* Ruby 3.
|
|
105
|
+
* Ruby 3.2
|
|
106
|
+
* Ruby 3.3
|
|
107
|
+
* Ruby 3.4
|
|
108
|
+
* Ruby 4.0
|
|
26
109
|
|
|
27
110
|
If something doesn't work on one of these interpreters, it's a bug.
|
|
28
111
|
|
|
29
|
-
This code will likely still work on older versions
|
|
30
|
-
|
|
31
|
-
end-of-life ruby versions.
|
|
112
|
+
This code will likely still work on older Ruby versions but support will not be
|
|
113
|
+
provided for end-of-life versions.
|
|
32
114
|
|
|
33
115
|
If you would like this library to support another Ruby version, you may
|
|
34
116
|
volunteer to be a maintainer. Being a maintainer entails making sure all tests
|
|
@@ -38,7 +120,7 @@ fashion. If critical issues for a particular implementation exist at the time
|
|
|
38
120
|
of a major release, support for that Ruby version may be dropped.
|
|
39
121
|
|
|
40
122
|
## Copyright
|
|
41
|
-
Copyright (c) 2006-
|
|
123
|
+
Copyright (c) 2006-2026 Tony Arcieri, Martin Emde, Erik Berlin.
|
|
42
124
|
Distributed under the [MIT license][license].
|
|
43
125
|
|
|
44
126
|
[license]: https://opensource.org/licenses/MIT
|
data/buftok.gemspec
CHANGED
|
@@ -1,19 +1,28 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
Gem::Specification.new do |spec|
|
|
2
|
-
spec.version
|
|
4
|
+
spec.version = "1.0.1"
|
|
3
5
|
|
|
4
|
-
spec.authors
|
|
5
|
-
spec.summary
|
|
6
|
-
spec.description
|
|
7
|
-
spec.email
|
|
8
|
-
spec.files
|
|
9
|
-
spec.homepage
|
|
10
|
-
spec.licenses
|
|
11
|
-
spec.name
|
|
6
|
+
spec.authors = ["Tony Arcieri", "Martin Emde", "Erik Berlin"]
|
|
7
|
+
spec.summary = "BufferedTokenizer extracts token delimited entities from a sequence of string inputs"
|
|
8
|
+
spec.description = spec.summary
|
|
9
|
+
spec.email = ["sferik@gmail.com", "martin.emde@gmail.com"]
|
|
10
|
+
spec.files = %w[CHANGELOG.md CONTRIBUTING.md LICENSE.txt README.md buftok.gemspec] + Dir["lib/**/*.rb"]
|
|
11
|
+
spec.homepage = "https://github.com/sferik/buftok"
|
|
12
|
+
spec.licenses = ["MIT"]
|
|
13
|
+
spec.name = "buftok"
|
|
12
14
|
spec.require_paths = ["lib"]
|
|
13
|
-
spec.
|
|
15
|
+
spec.required_ruby_version = ">= 3.2"
|
|
16
|
+
spec.required_rubygems_version = ">= 3.0"
|
|
14
17
|
|
|
15
|
-
spec.
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
18
|
+
spec.metadata = {
|
|
19
|
+
"allowed_push_host" => "https://rubygems.org",
|
|
20
|
+
"bug_tracker_uri" => "#{spec.homepage}/issues",
|
|
21
|
+
"changelog_uri" => "#{spec.homepage}/blob/master/CHANGELOG.md",
|
|
22
|
+
"documentation_uri" => "https://rubydoc.info/gems/buftok/",
|
|
23
|
+
"funding_uri" => "https://github.com/sponsors/sferik/",
|
|
24
|
+
"homepage_uri" => spec.homepage,
|
|
25
|
+
"rubygems_mfa_required" => "true",
|
|
26
|
+
"source_code_uri" => spec.homepage
|
|
27
|
+
}
|
|
19
28
|
end
|
data/lib/buftok.rb
CHANGED
|
@@ -1,72 +1,169 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
# Statefully split input data by a specifiable token
|
|
2
4
|
#
|
|
3
5
|
# BufferedTokenizer takes a delimiter upon instantiation, or acts line-based
|
|
4
|
-
# by default.
|
|
6
|
+
# by default. It allows input to be spoon-fed from some outside source which
|
|
5
7
|
# receives arbitrary length datagrams which may-or-may-not contain the token
|
|
6
|
-
# by which entities are delimited.
|
|
7
|
-
#
|
|
8
|
+
# by which entities are delimited.
|
|
9
|
+
#
|
|
10
|
+
# @example
|
|
11
|
+
# tokenizer = BufferedTokenizer.new("\n")
|
|
12
|
+
# tokenizer.extract("foo\nbar") #=> ["foo"]
|
|
13
|
+
# tokenizer.extract("baz\n") #=> ["barbaz"]
|
|
14
|
+
# tokenizer.flush #=> ""
|
|
8
15
|
class BufferedTokenizer
|
|
9
|
-
#
|
|
10
|
-
|
|
16
|
+
# Limit passed to String#split to preserve trailing empty fields
|
|
17
|
+
SPLIT_LIMIT = -1
|
|
18
|
+
|
|
19
|
+
# Return the delimiter overlap length
|
|
20
|
+
#
|
|
21
|
+
# The number of characters at the end of a chunk that may contain a
|
|
22
|
+
# partial delimiter, equal to delimiter.length - 1.
|
|
23
|
+
#
|
|
24
|
+
# @example
|
|
25
|
+
# BufferedTokenizer.new("<>").overlap #=> 1
|
|
26
|
+
#
|
|
27
|
+
# @return [Integer] delimiter.length - 1
|
|
11
28
|
#
|
|
12
|
-
#
|
|
29
|
+
# @api public
|
|
30
|
+
attr_reader :overlap
|
|
31
|
+
|
|
32
|
+
# Create a new BufferedTokenizer
|
|
33
|
+
#
|
|
34
|
+
# Operates on lines delimited by a delimiter, which is by default "\n".
|
|
35
|
+
#
|
|
36
|
+
# The input buffer is stored as an array. This is by far the most efficient
|
|
13
37
|
# approach given language constraints (in C a linked list would be a more
|
|
14
|
-
# appropriate data structure).
|
|
38
|
+
# appropriate data structure). Segments of input data are stored in a list
|
|
15
39
|
# which is only joined when a token is reached, substantially reducing the
|
|
16
40
|
# number of objects required for the operation.
|
|
17
|
-
|
|
41
|
+
#
|
|
42
|
+
# @example
|
|
43
|
+
# tokenizer = BufferedTokenizer.new("<>")
|
|
44
|
+
#
|
|
45
|
+
# @param delimiter [String] the token delimiter (default: "\n")
|
|
46
|
+
#
|
|
47
|
+
# @return [BufferedTokenizer]
|
|
48
|
+
#
|
|
49
|
+
# @api public
|
|
50
|
+
def initialize(delimiter = "\n")
|
|
18
51
|
@delimiter = delimiter
|
|
19
52
|
@input = []
|
|
20
|
-
@tail =
|
|
21
|
-
@
|
|
53
|
+
@tail = +""
|
|
54
|
+
@overlap = @delimiter.length - 1
|
|
22
55
|
end
|
|
23
56
|
|
|
24
|
-
#
|
|
57
|
+
# Return the byte size of the internal buffer
|
|
25
58
|
#
|
|
26
59
|
# Size is not cached and is determined every time this method is called
|
|
27
60
|
# in order to optimize throughput for extract.
|
|
61
|
+
#
|
|
62
|
+
# @example
|
|
63
|
+
# tokenizer = BufferedTokenizer.new
|
|
64
|
+
# tokenizer.extract("foo")
|
|
65
|
+
# tokenizer.size #=> 3
|
|
66
|
+
#
|
|
67
|
+
# @return [Integer]
|
|
68
|
+
#
|
|
69
|
+
# @api public
|
|
28
70
|
def size
|
|
29
|
-
@tail.length + @input.
|
|
71
|
+
@tail.length + @input.sum(&:length)
|
|
30
72
|
end
|
|
31
73
|
|
|
74
|
+
# Extract tokenized entities from the input data
|
|
75
|
+
#
|
|
32
76
|
# Extract takes an arbitrary string of input data and returns an array of
|
|
33
|
-
# tokenized entities, provided there were any available to extract.
|
|
77
|
+
# tokenized entities, provided there were any available to extract. This
|
|
34
78
|
# makes for easy processing of datagrams using a pattern like:
|
|
35
79
|
#
|
|
36
|
-
# tokenizer.extract(data).map { |entity| Decode(entity) }.each
|
|
80
|
+
# tokenizer.extract(data).map { |entity| Decode(entity) }.each { ... }
|
|
37
81
|
#
|
|
38
|
-
# Using -1 makes split
|
|
82
|
+
# Using -1 makes split return "" if the token is at the end of
|
|
39
83
|
# the string, meaning the last element is the start of the next chunk.
|
|
84
|
+
#
|
|
85
|
+
# @example
|
|
86
|
+
# tokenizer = BufferedTokenizer.new
|
|
87
|
+
# tokenizer.extract("foo\nbar") #=> ["foo"]
|
|
88
|
+
#
|
|
89
|
+
# @param data [String] a chunk of input data
|
|
90
|
+
#
|
|
91
|
+
# @return [Array<String>] complete tokens extracted from the input
|
|
92
|
+
#
|
|
93
|
+
# @api public
|
|
40
94
|
def extract(data)
|
|
41
|
-
|
|
42
|
-
tail_end = @tail.slice!(-@trim, @trim) # returns nil if string is too short
|
|
43
|
-
data = tail_end + data if tail_end
|
|
44
|
-
end
|
|
95
|
+
data = rejoin_split_delimiter(data)
|
|
45
96
|
|
|
46
97
|
@input << @tail
|
|
47
|
-
entities = data.split(@delimiter,
|
|
48
|
-
@tail = entities.shift
|
|
98
|
+
entities = data.split(@delimiter, SPLIT_LIMIT)
|
|
99
|
+
@tail = entities.shift # : String
|
|
49
100
|
|
|
50
|
-
|
|
51
|
-
@input << @tail
|
|
52
|
-
entities.unshift @input.join
|
|
53
|
-
@input.clear
|
|
54
|
-
@tail = entities.pop
|
|
55
|
-
end
|
|
101
|
+
consolidate_input(entities) if entities.length.positive?
|
|
56
102
|
|
|
57
103
|
entities
|
|
58
104
|
end
|
|
59
105
|
|
|
60
|
-
# Flush the contents of the input buffer
|
|
61
|
-
#
|
|
106
|
+
# Flush the contents of the input buffer
|
|
107
|
+
#
|
|
108
|
+
# Return the contents of the input buffer even though a token has not
|
|
109
|
+
# yet been encountered, then reset the buffer.
|
|
110
|
+
#
|
|
111
|
+
# @example
|
|
112
|
+
# tokenizer = BufferedTokenizer.new
|
|
113
|
+
# tokenizer.extract("foo\nbar")
|
|
114
|
+
# tokenizer.flush #=> "bar"
|
|
115
|
+
#
|
|
116
|
+
# @return [String] the buffered input
|
|
117
|
+
#
|
|
118
|
+
# @api public
|
|
62
119
|
def flush
|
|
63
120
|
@input << @tail
|
|
64
121
|
buffer = @input.join
|
|
65
122
|
@input.clear
|
|
66
|
-
@tail =
|
|
123
|
+
@tail = +""
|
|
67
124
|
buffer
|
|
68
125
|
end
|
|
126
|
+
|
|
127
|
+
private
|
|
128
|
+
|
|
129
|
+
# Rejoin a delimiter that was split across two chunks
|
|
130
|
+
#
|
|
131
|
+
# When the delimiter is longer than one character, it may be split across
|
|
132
|
+
# two successive chunks. Transfer the trailing overlap from @tail back onto
|
|
133
|
+
# the front of the incoming data so that split can find the full delimiter.
|
|
134
|
+
#
|
|
135
|
+
# @param data [String] incoming data
|
|
136
|
+
#
|
|
137
|
+
# @return [String] data with any split delimiter prefix restored
|
|
138
|
+
#
|
|
139
|
+
# @api private
|
|
140
|
+
def rejoin_split_delimiter(data)
|
|
141
|
+
if @overlap.positive?
|
|
142
|
+
tail_end = @tail[-@overlap..]
|
|
143
|
+
@tail.slice!(-@overlap, @overlap)
|
|
144
|
+
tail_end ? tail_end + data : data
|
|
145
|
+
else
|
|
146
|
+
data
|
|
147
|
+
end
|
|
148
|
+
end
|
|
149
|
+
|
|
150
|
+
# Consolidate the input buffer into the first entity
|
|
151
|
+
#
|
|
152
|
+
# Once at least one delimiter has been found, join the accumulated input
|
|
153
|
+
# buffer with the first entity and move the trailing partial into @tail.
|
|
154
|
+
#
|
|
155
|
+
# @param entities [Array<String>] split entities
|
|
156
|
+
#
|
|
157
|
+
# @return [void]
|
|
158
|
+
#
|
|
159
|
+
# @api private
|
|
160
|
+
def consolidate_input(entities)
|
|
161
|
+
@input << @tail
|
|
162
|
+
entities.unshift @input.join
|
|
163
|
+
@input.clear
|
|
164
|
+
@tail = entities.pop # : String
|
|
165
|
+
end
|
|
69
166
|
end
|
|
70
167
|
|
|
71
|
-
#
|
|
168
|
+
# Alias for {BufferedTokenizer}, matching the gem name
|
|
72
169
|
Buftok = BufferedTokenizer
|
metadata
CHANGED
|
@@ -1,73 +1,16 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: buftok
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 1.0.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tony Arcieri
|
|
8
8
|
- Martin Emde
|
|
9
|
-
- Erik
|
|
10
|
-
autorequire:
|
|
9
|
+
- Erik Berlin
|
|
11
10
|
bindir: bin
|
|
12
11
|
cert_chain: []
|
|
13
|
-
date:
|
|
14
|
-
dependencies:
|
|
15
|
-
- !ruby/object:Gem::Dependency
|
|
16
|
-
name: bundler
|
|
17
|
-
requirement: !ruby/object:Gem::Requirement
|
|
18
|
-
requirements:
|
|
19
|
-
- - ">="
|
|
20
|
-
- !ruby/object:Gem::Version
|
|
21
|
-
version: '1.17'
|
|
22
|
-
type: :development
|
|
23
|
-
prerelease: false
|
|
24
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
25
|
-
requirements:
|
|
26
|
-
- - ">="
|
|
27
|
-
- !ruby/object:Gem::Version
|
|
28
|
-
version: '1.17'
|
|
29
|
-
- !ruby/object:Gem::Dependency
|
|
30
|
-
name: rake
|
|
31
|
-
requirement: !ruby/object:Gem::Requirement
|
|
32
|
-
requirements:
|
|
33
|
-
- - "~>"
|
|
34
|
-
- !ruby/object:Gem::Version
|
|
35
|
-
version: '10.0'
|
|
36
|
-
type: :development
|
|
37
|
-
prerelease: false
|
|
38
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
39
|
-
requirements:
|
|
40
|
-
- - "~>"
|
|
41
|
-
- !ruby/object:Gem::Version
|
|
42
|
-
version: '10.0'
|
|
43
|
-
- !ruby/object:Gem::Dependency
|
|
44
|
-
name: rdoc
|
|
45
|
-
requirement: !ruby/object:Gem::Requirement
|
|
46
|
-
requirements:
|
|
47
|
-
- - ">="
|
|
48
|
-
- !ruby/object:Gem::Version
|
|
49
|
-
version: '0'
|
|
50
|
-
type: :development
|
|
51
|
-
prerelease: false
|
|
52
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
53
|
-
requirements:
|
|
54
|
-
- - ">="
|
|
55
|
-
- !ruby/object:Gem::Version
|
|
56
|
-
version: '0'
|
|
57
|
-
- !ruby/object:Gem::Dependency
|
|
58
|
-
name: test-unit
|
|
59
|
-
requirement: !ruby/object:Gem::Requirement
|
|
60
|
-
requirements:
|
|
61
|
-
- - ">="
|
|
62
|
-
- !ruby/object:Gem::Version
|
|
63
|
-
version: '0'
|
|
64
|
-
type: :development
|
|
65
|
-
prerelease: false
|
|
66
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
67
|
-
requirements:
|
|
68
|
-
- - ">="
|
|
69
|
-
- !ruby/object:Gem::Version
|
|
70
|
-
version: '0'
|
|
12
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
13
|
+
dependencies: []
|
|
71
14
|
description: BufferedTokenizer extracts token delimited entities from a sequence of
|
|
72
15
|
string inputs
|
|
73
16
|
email:
|
|
@@ -77,6 +20,7 @@ executables: []
|
|
|
77
20
|
extensions: []
|
|
78
21
|
extra_rdoc_files: []
|
|
79
22
|
files:
|
|
23
|
+
- CHANGELOG.md
|
|
80
24
|
- CONTRIBUTING.md
|
|
81
25
|
- LICENSE.txt
|
|
82
26
|
- README.md
|
|
@@ -85,8 +29,15 @@ files:
|
|
|
85
29
|
homepage: https://github.com/sferik/buftok
|
|
86
30
|
licenses:
|
|
87
31
|
- MIT
|
|
88
|
-
metadata:
|
|
89
|
-
|
|
32
|
+
metadata:
|
|
33
|
+
allowed_push_host: https://rubygems.org
|
|
34
|
+
bug_tracker_uri: https://github.com/sferik/buftok/issues
|
|
35
|
+
changelog_uri: https://github.com/sferik/buftok/blob/master/CHANGELOG.md
|
|
36
|
+
documentation_uri: https://rubydoc.info/gems/buftok/
|
|
37
|
+
funding_uri: https://github.com/sponsors/sferik/
|
|
38
|
+
homepage_uri: https://github.com/sferik/buftok
|
|
39
|
+
rubygems_mfa_required: 'true'
|
|
40
|
+
source_code_uri: https://github.com/sferik/buftok
|
|
90
41
|
rdoc_options: []
|
|
91
42
|
require_paths:
|
|
92
43
|
- lib
|
|
@@ -94,15 +45,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
|
94
45
|
requirements:
|
|
95
46
|
- - ">="
|
|
96
47
|
- !ruby/object:Gem::Version
|
|
97
|
-
version: '
|
|
48
|
+
version: '3.2'
|
|
98
49
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
99
50
|
requirements:
|
|
100
51
|
- - ">="
|
|
101
52
|
- !ruby/object:Gem::Version
|
|
102
|
-
version:
|
|
53
|
+
version: '3.0'
|
|
103
54
|
requirements: []
|
|
104
|
-
rubygems_version:
|
|
105
|
-
signing_key:
|
|
55
|
+
rubygems_version: 4.0.8
|
|
106
56
|
specification_version: 4
|
|
107
57
|
summary: BufferedTokenizer extracts token delimited entities from a sequence of string
|
|
108
58
|
inputs
|