time_bucket_stream 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: fdbf0759fd07286c1bef8909f5fa4a5452e70f378f76e565e8418e85470bf761
4
+ data.tar.gz: 31a86a4b722a165f96b4aab46fcb1a024b3ac70fd84c426a8e57fa6381eec34d
5
+ SHA512:
6
+ metadata.gz: '08c76e6b4d7a60557bea1baad88ca956335b2f501872a0c77df5015b55a493e62ae747194f979ba40468c355b18250b6d41ba66aa805b3980996c33a01df5062'
7
+ data.tar.gz: 165f875b2f305fbf528dc37671231b645f6c353d1c981c0226b42c1ecc3c795f28b05356fa44ed83584064729fb24fcca6bd2d4e5efc07b154ac5fcbfc0d0a15
data/.standard.yml ADDED
@@ -0,0 +1,3 @@
1
+ # For available configuration options, see:
2
+ # https://github.com/standardrb/standard
3
+ ruby_version: 3.1
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2026 Lim Yu Kwang
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,333 @@
1
+ # TimeBucketStream
2
+
3
+ TimeBucketStream writes machine-readable events into time-bucketed JSONL files.
4
+
5
+ It is built for fast appends and safe batch processing:
6
+
7
+ - writers append JSONL to per-process bucket files
8
+ - old completed files can be atomically claimed for processing
9
+ - claimed files can be deleted after successful processing or retried after a crash
10
+ - bad files are automatically quarantined with metadata instead of retried forever
11
+
12
+ The stream stays generic. It does not know about metrics, jobs, queues, or your
13
+ database.
14
+
15
+ ## Installation
16
+
17
+ Install from GitHub while the gem is young:
18
+
19
+ ```ruby
20
+ gem "time_bucket_stream", github: "aaron-lim/time_bucket_stream", branch: "main"
21
+ ```
22
+
23
+ After a RubyGems release:
24
+
25
+ ```bash
26
+ bundle add time_bucket_stream
27
+ ```
28
+
29
+ ## Usage
30
+
31
+ Use `TimeBucketStream.new` when you want a ready-to-process pending queue:
32
+
33
+ ```ruby
34
+ stream = TimeBucketStream.new(path: "/tmp/my_app_events/sync")
35
+
36
+ stream.append(account_id: 123, status: "success")
37
+ ```
38
+
39
+ The path is the stream directory. TimeBucketStream creates its own folders
40
+ inside it.
41
+
42
+ Use a different path for each independent stream you want to process
43
+ separately:
44
+
45
+ ```ruby
46
+ sync_stream = TimeBucketStream.new(path: "/tmp/my_app_events/sync")
47
+ mail_stream = TimeBucketStream.new(path: "/tmp/my_app_events/mail")
48
+ ```
49
+
50
+ Those streams do not share log files, processing files, locks, or quarantine
51
+ files.
52
+
53
+ Process completed old files with `drain`:
54
+
55
+ ```ruby
56
+ stream.drain do |payload|
57
+ process(payload)
58
+ end
59
+ ```
60
+
61
+ On success, `drain` deletes the claimed files. If your block raises,
62
+ `drain` releases the batch and re-raises the error, so a later processor can
63
+ retry the same file.
64
+
65
+ Use `read` when you want manual batch control:
66
+
67
+ ```ruby
68
+ batch = stream.read
69
+
70
+ batch.each do |payload|
71
+ process(payload)
72
+ end
73
+
74
+ batch.delete
75
+ ```
76
+
77
+ `batch` is `Enumerable`. It yields payloads:
78
+
79
+ ```ruby
80
+ batch.map { |payload| payload.fetch("account_id") }
81
+ ```
82
+
83
+ The file line stores only the row id inside that file:
84
+
85
+ ```json
86
+ {"id":1,"payload":{"account_id":123,"status":"success"}}
87
+ ```
88
+
89
+ If processing cannot finish, release the entries instead of deleting them:
90
+
91
+ ```ruby
92
+ batch.release
93
+ ```
94
+
95
+ That leaves the file in `processing/` so a later processor can claim it again.
96
+
97
+ The public API stays intentionally small:
98
+
99
+ - `TimeBucketStream.new(path:, ...)`
100
+ - `append(payload)`
101
+ - `drain { |payload| ... }`
102
+ - `read`
103
+ - `close`
104
+
105
+ The batch API is also small:
106
+
107
+ - `each`
108
+ - `empty?`
109
+ - `size`
110
+ - `delete`
111
+ - `release`
112
+
113
+ ## File Layout
114
+
115
+ For `path: "/tmp/my_app_events/sync"`, files live under:
116
+
117
+ ```text
118
+ /tmp/my_app_events/sync/
119
+ logs/
120
+ 202605061530-host-12345-abcd1234.jsonl
121
+ processing/
122
+ 202605061529-host-12345-deadbeef.jsonl
123
+ quarantine/
124
+ 202605061520-host-12345-badbad00.q-20260506153100123456-6789-cafe1234.jsonl
125
+ 202605061520-host-12345-badbad00.q-20260506153100123456-6789-cafe1234.jsonl.meta.json
126
+ claim_locks/
127
+ 202605061529-host-12345-deadbeef.jsonl.lock
128
+ ```
129
+
130
+ `logs/` contains files that writers may still own. A valid log filename is:
131
+
132
+ ```text
133
+ YYYYMMDDHHMM-host-pid-randomhex.jsonl
134
+ ```
135
+
136
+ `processing/` contains files that a processor has claimed. The filename stays
137
+ the same; only the directory changes.
138
+
139
+ `quarantine/` contains files that should not be retried automatically. Each
140
+ quarantined file gets a `.meta.json` sidecar with the reason, original path,
141
+ quarantine time, process id, and diagnostic metadata. Old quarantine files are
142
+ cleaned up automatically according to `quarantine_retention`.
143
+
144
+ ## Why Claiming Is Safe
145
+
146
+ Each writer creates one file per process per UTC minute.
147
+
148
+ The claimer only touches files whose filename bucket is old enough according to
149
+ `claim_grace`. With the default grace of 10 seconds, a `10:15` file becomes
150
+ claimable at `10:16:10`, not at `10:16:00`.
151
+
152
+ Before claiming a file, it checks that:
153
+
154
+ - the filename is valid
155
+ - the file ends with a newline
156
+ - the per-file claim lock can be acquired
157
+
158
+ Then it moves the file with `File.rename`:
159
+
160
+ ```text
161
+ logs/<bucket>-<host>-<pid>-<random>.jsonl
162
+ -> processing/<bucket>-<host>-<pid>-<random>.jsonl
163
+ ```
164
+
165
+ That rename is atomic when both directories are on the same filesystem. If two
166
+ processors race, only one gets the claim lock and only one can move the file.
167
+
168
+ If a processor crashes after claiming, the operating system releases the claim
169
+ lock. A later processor can reclaim the file from `processing/`.
170
+
171
+ ## Quarantine
172
+
173
+ Quarantine is automatic when using streams created by `TimeBucketStream.new`.
174
+
175
+ `read` moves these files to `quarantine/` and does not return entries from
176
+ them:
177
+
178
+ - completed empty files
179
+ - completed files with malformed JSONL or invalid entry shape
180
+ - partial trailing-line files older than `stale_partial_after`
181
+
182
+ `stale_partial_after` defaults to 600 seconds.
183
+
184
+ Recent partial files stay in `logs/` because a writer may still be alive. Locked
185
+ processing files stay untouched because another processor may still be active.
186
+
187
+ By default, one malformed entry quarantines the whole claimed file. If your app
188
+ prefers to skip bad entries and keep processing the good ones, use:
189
+
190
+ ```ruby
191
+ stream = TimeBucketStream.new(
192
+ path: "/tmp/my_app_events/sync",
193
+ malformed_entry: :skip
194
+ )
195
+ ```
196
+
197
+ With `malformed_entry: :skip`, malformed JSONL entries and entries with an
198
+ invalid shape are skipped. If every entry in a claimed file is malformed, the
199
+ file is deleted during `read` so it will not be retried forever.
200
+
201
+ Quarantine retention defaults to 7 days. Retention cleanup runs during `read`.
202
+ It deletes expired quarantined JSONL files with their metadata sidecars, and it
203
+ also removes expired orphan metadata files. Pass `quarantine_retention: nil` if
204
+ you want to keep quarantine files until your own cleanup removes them.
205
+
206
+ For multi-host processing on a shared directory, the filesystem must provide
207
+ correct `rename(2)` and `flock(2)` behavior. For the highest-confidence setup,
208
+ write streams to host-local storage and process each host's stream locally, or
209
+ use a shared filesystem whose locking semantics you have tested.
210
+
211
+ ## Options
212
+
213
+ The constructor accepts these options:
214
+
215
+ | Option | Default | Effect |
216
+ | --- | --- | --- |
217
+ | `path:` | required | Stream directory. TimeBucketStream writes `logs/`, `processing/`, `quarantine/`, and `claim_locks/` inside this directory. |
218
+ | `sync:` | `:flush` | Controls how strongly each append is flushed to disk. See sync modes below. |
219
+ | `claim_grace:` | `10` | Seconds to wait after a UTC minute boundary before claiming that minute's completed files. A `10:15` file becomes claimable at `10:16:10` by default. |
220
+ | `stale_partial_after:` | `600` | Seconds to wait before quarantining an old file that does not end with a newline. Recent partial files are left alone because a writer may still be alive. |
221
+ | `malformed_entry:` | `:quarantine` | Controls malformed JSONL entries or invalid entry shapes. Use `:quarantine` to quarantine the whole file, or `:skip` to skip malformed entries and keep valid entries. |
222
+ | `quarantine_retention:` | `604800` | Seconds to keep quarantine files before automatic cleanup. Use `nil` to disable quarantine cleanup. |
223
+ | `codec:` | `TimeBucketStream::Codecs::Json.new` | Object used to dump and load one JSONL entry. Use this when your app wants a faster JSON library such as Oj. |
224
+
225
+ Example with the common options:
226
+
227
+ ```ruby
228
+ stream = TimeBucketStream.new(
229
+ path: "/tmp/my_app_events/sync",
230
+ sync: :flush,
231
+ claim_grace: 10,
232
+ stale_partial_after: 600,
233
+ malformed_entry: :quarantine,
234
+ quarantine_retention: 7 * 24 * 60 * 60,
235
+ codec: TimeBucketStream::Codecs::Json.new
236
+ )
237
+ ```
238
+
239
+ ### Sync Modes
240
+
241
+ ```ruby
242
+ sync: :none # fastest; OS buffers decide when bytes hit disk
243
+ sync: :flush # default; flush after every append
244
+ sync: :fsync # strongest; fsync after every append
245
+ ```
246
+
247
+ Use `:flush` unless you have a measured reason to choose differently.
248
+
249
+ ### Codecs
250
+
251
+ The default codec uses Ruby's stdlib `JSON`.
252
+
253
+ Oj is optional. TimeBucketStream does not depend on it unless your app chooses
254
+ to install it:
255
+
256
+ ```ruby
257
+ gem "oj"
258
+ ```
259
+
260
+ Then configure the stream:
261
+
262
+ ```ruby
263
+ stream = TimeBucketStream.new(
264
+ path: "/tmp/my_app_events/sync",
265
+ codec: TimeBucketStream::Codecs::Oj.new
266
+ )
267
+ ```
268
+
269
+ The Oj codec uses compat mode so it behaves like the stdlib JSON gem for normal
270
+ hashes, arrays, strings, numbers, booleans, and nil.
271
+
272
+ Custom codecs can be used too. A codec only needs `dump` and `load`:
273
+
274
+ ```ruby
275
+ class MyCodec
276
+ def dump(value)
277
+ JSON.generate(value)
278
+ end
279
+
280
+ def load(value)
281
+ JSON.parse(value)
282
+ end
283
+ end
284
+ ```
285
+
286
+ `dump` must return one line. If it returns a string containing a newline,
287
+ `append` raises, because one stream entry must fit on one JSONL line.
288
+
289
+ ## Stress Testing
290
+
291
+ The normal test suite is fast and deterministic. The stress test is separate:
292
+
293
+ ```bash
294
+ bundle exec rake stress
295
+ ```
296
+
297
+ It starts many writer processes, intentionally crashes some processor processes
298
+ after they claim files, then drains the stream again and verifies:
299
+
300
+ - every written payload was processed
301
+ - no payload was processed twice
302
+ - no unexpected payload appeared
303
+ - no `logs/*.jsonl` or `processing/*.jsonl` files were left behind
304
+
305
+ The JSON report also includes write, crash-recovery, drain, and total
306
+ throughput numbers.
307
+
308
+ Useful knobs:
309
+
310
+ ```bash
311
+ WRITERS=24 EVENTS_PER_WRITER=5000 PROCESSORS=12 CRASHERS=4 bundle exec rake stress
312
+ STREAM_PATH=/tmp/tbs-stress KEEP_PATH=1 bundle exec rake stress
313
+ SYNC=fsync bundle exec rake stress
314
+ CODEC=oj bundle exec rake stress
315
+ PAYLOAD_BYTES=4096 bundle exec rake stress
316
+ ```
317
+
318
+ Use `STREAM_PATH` when you want to test a specific filesystem or mounted disk.
319
+ Use `PAYLOAD_BYTES` when you want to compare larger event rows.
320
+
321
+ ## Development
322
+
323
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
324
+
325
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
326
+
327
+ ## Contributing
328
+
329
+ Bug reports and pull requests are welcome on GitHub at https://github.com/aaron-lim/time_bucket_stream.
330
+
331
+ ## License
332
+
333
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "minitest/test_task"
5
+
6
+ Minitest::TestTask.create
7
+
8
+ require "standard/rake"
9
+
10
+ namespace :stress do
11
+ desc "Run the file-stream crash-recovery stress test"
12
+ task :file_stream do
13
+ ruby "test/stress/file_stream_stress.rb"
14
+ end
15
+ end
16
+
17
+ desc "Run all stress tests"
18
+ task stress: "stress:file_stream"
19
+
20
+ task default: %i[test standard]
@@ -0,0 +1,56 @@
1
+ # frozen_string_literal: true
2
+
3
+ class TimeBucketStream
4
+ class Batch
5
+ include Enumerable
6
+
7
+ attr_reader :entries
8
+
9
+ def initialize(entries:, on_delete:, on_release:)
10
+ @entries = entries.map { |id, payload| [id, payload].freeze }.freeze
11
+ @on_delete = on_delete
12
+ @on_release = on_release
13
+ @finished = false
14
+ end
15
+
16
+ def each
17
+ return enum_for(:each) unless block_given?
18
+
19
+ entries.each { |_id, payload| yield payload }
20
+ end
21
+
22
+ def size
23
+ entries.size
24
+ end
25
+ alias_method :length, :size
26
+
27
+ def empty?
28
+ entries.empty?
29
+ end
30
+
31
+ def delete
32
+ finish_with(@on_delete)
33
+ end
34
+
35
+ def release
36
+ finish_with(@on_release)
37
+ end
38
+
39
+ def finished?
40
+ @finished
41
+ end
42
+
43
+ private
44
+
45
+ def ids
46
+ entries.map(&:first)
47
+ end
48
+
49
+ def finish_with(callback)
50
+ return false if finished?
51
+
52
+ callback.call(ids)
53
+ @finished = true
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,119 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "fileutils"
4
+ require "json"
5
+ require "securerandom"
6
+ require "time"
7
+
8
+ class TimeBucketStream
9
+ class Claim
10
+ attr_reader :name, :path
11
+
12
+ def initialize(name:, path:, lock_file:)
13
+ @name = name
14
+ @path = path
15
+ @lock_file = lock_file
16
+ end
17
+
18
+ def read_lines
19
+ File.readlines(path, chomp: true)
20
+ rescue SystemCallError, IOError
21
+ []
22
+ end
23
+
24
+ def delete
25
+ FileUtils.rm_f(path)
26
+ ensure
27
+ release
28
+ end
29
+
30
+ def quarantine(paths:, reason:, metadata: {}, clock: Time)
31
+ quarantine_name = quarantine_name_for(current_time(clock))
32
+ quarantine_path = paths.quarantine_log_for(quarantine_name)
33
+ metadata_path = paths.quarantine_metadata_for(quarantine_name)
34
+
35
+ FileUtils.mkdir_p(paths.quarantine)
36
+ File.rename(path, quarantine_path)
37
+ write_metadata(metadata_path, quarantine_metadata(
38
+ quarantine_name: quarantine_name,
39
+ quarantine_path: quarantine_path,
40
+ reason: reason,
41
+ metadata: metadata,
42
+ clock: clock
43
+ ))
44
+ quarantine_path
45
+ ensure
46
+ release
47
+ end
48
+
49
+ def release
50
+ unlock_file(@lock_file)
51
+ close_file(@lock_file)
52
+ @lock_file = nil
53
+ end
54
+
55
+ private
56
+
57
+ def quarantine_name_for(time)
58
+ base_name = name.delete_suffix(".jsonl")
59
+ suffix = [
60
+ "q",
61
+ time.utc.strftime("%Y%m%d%H%M%S%6N"),
62
+ Process.pid,
63
+ SecureRandom.hex(4)
64
+ ].join("-")
65
+
66
+ "#{base_name}.#{suffix}.jsonl"
67
+ end
68
+
69
+ def quarantine_metadata(quarantine_name:, quarantine_path:, reason:, metadata:, clock:)
70
+ {
71
+ "reason" => normalized_reason(reason),
72
+ "original_name" => name,
73
+ "original_path" => path,
74
+ "quarantine_name" => quarantine_name,
75
+ "quarantine_path" => quarantine_path,
76
+ "quarantined_at" => current_time(clock).utc.iso8601(6),
77
+ "pid" => Process.pid,
78
+ "metadata" => json_safe(metadata)
79
+ }
80
+ end
81
+
82
+ def normalized_reason(reason)
83
+ value = reason.to_s
84
+ value.empty? ? "unspecified" : value
85
+ end
86
+
87
+ def write_metadata(path, payload)
88
+ tmp_path = "#{path}.#{Process.pid}.#{SecureRandom.hex(4)}.tmp"
89
+
90
+ File.write(tmp_path, JSON.pretty_generate(payload))
91
+ File.rename(tmp_path, path)
92
+ rescue JSON::GeneratorError, SystemCallError, IOError
93
+ FileUtils.rm_f(tmp_path)
94
+ nil
95
+ end
96
+
97
+ def json_safe(value)
98
+ JSON.parse(JSON.generate(value))
99
+ rescue JSON::GeneratorError, JSON::ParserError, TypeError
100
+ value.inspect
101
+ end
102
+
103
+ def current_time(clock)
104
+ clock.respond_to?(:call) ? clock.call : clock.now
105
+ end
106
+
107
+ def unlock_file(file)
108
+ file&.flock(File::LOCK_UN)
109
+ rescue SystemCallError, IOError
110
+ nil
111
+ end
112
+
113
+ def close_file(file)
114
+ file&.close
115
+ rescue SystemCallError, IOError
116
+ nil
117
+ end
118
+ end
119
+ end