fibrio 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 6495443322eddb9e8b298408f94ad285687ba9ce368e1a5e1f82e841c4dbc4d0
4
+ data.tar.gz: 2e7520560cc402249a4f70d278618c9189d0d6beaaedb16b289029f0e17626f0
5
+ SHA512:
6
+ metadata.gz: 1614a5878aff9fea07ec3750dc236cf48068ba5a31b9942c262c49ddd393aeca7b336eb6450c02014d9f052463f519bd995508b0a991969de6b54f2d20cbe5fa
7
+ data.tar.gz: 99ad7f64627bc6213c8990b794593ddeacb25de895e9fc9078bca65da98c103f17cc1e0551f12187907084479e8f078666d0deeb04ee8e35b1918138c2d5a1da
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ # Changelog
2
+
3
+ ## 0.1.0 - 2026-05-22
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Yudai Takada
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,100 @@
1
+ # Fibrio
2
+
3
+ `fibrio` is a small Ruby gem for reading large JSON array, NDJSON, and CSV inputs one record at a time. It keeps Fiber usage inside `Fibrio::Stream`, so callers can use the normal `Enumerable` API.
4
+
5
+ ## Installation
6
+
7
+ ```ruby
8
+ gem "fibrio"
9
+ ```
10
+
11
+ Then require it:
12
+
13
+ ```ruby
14
+ require "fibrio"
15
+ ```
16
+
17
+ ## Usage
18
+
19
+ ```ruby
20
+ Fibrio.open("data.json", format: :json) do |stream|
21
+ stream.each do |record|
22
+ process(record)
23
+ end
24
+ end
25
+ ```
26
+
27
+ `each` returns an `Enumerator` when no block is given, so lazy chains work as expected:
28
+
29
+ ```ruby
30
+ stream = Fibrio.open("data.json", format: :json)
31
+ top10 = stream.each.lazy.select { |record| record["active"] }.first(10)
32
+ stream.close
33
+ ```
34
+
35
+ CSV with no header row returns arrays:
36
+
37
+ ```ruby
38
+ Fibrio.open("data.csv", format: :csv, headers: false) do |stream|
39
+ stream.each { |row| p row }
40
+ end
41
+ ```
42
+
43
+ String input is accepted as data when it is not an existing file path:
44
+
45
+ ```ruby
46
+ Fibrio.open("[1,2,3]", format: :json) do |stream|
47
+ stream.each { |number| p number }
48
+ end
49
+ ```
50
+
51
+ Top-level JSON objects can stream an array nested at a known path:
52
+
53
+ ```ruby
54
+ Fibrio.open('{"payload":{"records":[{"id":1},{"id":2}]}}', format: :json, path: %w[payload records]) do |stream|
55
+ stream.each { |record| p record["id"] }
56
+ end
57
+ ```
58
+
59
+ NDJSON uses one JSON value per non-empty line:
60
+
61
+ ```ruby
62
+ Fibrio.open(%({"id":1}\n{"id":2}\n), format: :ndjson) do |stream|
63
+ stream.each { |record| p record["id"] }
64
+ end
65
+ ```
66
+
67
+ ## Supported Formats
68
+
69
+ - JSON: top-level arrays, or object-contained arrays selected with `path:`.
70
+ - NDJSON: blank lines are skipped. Each non-empty line is parsed with Ruby's standard `json` library.
71
+ - CSV: `headers: true` by default yields hashes. `headers: false` yields arrays. Quoted newlines are supported.
72
+
73
+ ## Memory Benchmark
74
+
75
+ From a source checkout, run the benchmark with:
76
+
77
+ ```sh
78
+ ruby benchmark/memory.rb 250000
79
+ ```
80
+
81
+ The benchmark generates temporary files, reads them in a child process, and polls peak RSS from the parent process.
82
+ Fibrio rows iterate through records without retaining them; eager rows keep the parsed collection in memory.
83
+ Peak RSS includes the Ruby VM baseline, so absolute numbers vary by Ruby version and platform.
84
+
85
+ Example result on Ruby 4.0.0 arm64-darwin24 with 250,000 records:
86
+
87
+ | Format | Reader | Input MiB | Records | Seconds | Peak RSS MiB |
88
+ | --- | --- | ---: | ---: | ---: | ---: |
89
+ | JSON | Fibrio | 20.07 | 250,000 | 14.710 | 39.4 |
90
+ | JSON | JSON.parse(File.read) | 20.07 | 250,000 | 0.069 | 105.4 |
91
+ | NDJSON | Fibrio | 20.07 | 250,000 | 0.220 | 25.6 |
92
+ | NDJSON | File.readlines + JSON.parse | 20.07 | 250,000 | 0.182 | 127.6 |
93
+ | CSV | Fibrio | 9.10 | 250,000 | 2.640 | 33.3 |
94
+ | CSV | CSV.read(headers: true) | 9.10 | 250,000 | 0.826 | 192.8 |
95
+
96
+ The tradeoff is intentional: Fibrio prioritizes bounded memory use for large inputs over loading everything as fast as possible.
97
+
98
+ ## Known Limitations
99
+
100
+ - Each individual record must fit in memory.
@@ -0,0 +1,17 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ class Error < StandardError; end
5
+ class UnknownFormatError < Error; end
6
+
7
+ class ParseError < Error
8
+ attr_reader :position
9
+
10
+ # @param message [String]
11
+ # @param position [Integer, nil] byte offset in the input
12
+ def initialize(message, position: nil)
13
+ @position = position
14
+ super(position ? "#{message} (at byte #{position})" : message)
15
+ end
16
+ end
17
+ end
@@ -0,0 +1,36 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ module Parsers
5
+ class Base
6
+ WHITESPACE = [' ', "\t", "\n", "\r"].freeze
7
+
8
+ # @param source [Source]
9
+ # @param options [Hash]
10
+ def initialize(source, **options)
11
+ @source = source
12
+ @options = options
13
+ end
14
+
15
+ # @yield [record]
16
+ def run
17
+ raise NotImplementedError
18
+ end
19
+
20
+ # @return [void]
21
+ def close
22
+ @source.close
23
+ end
24
+
25
+ private
26
+
27
+ def skip_whitespace
28
+ @source.next_char while WHITESPACE.include?(@source.peek_char)
29
+ end
30
+
31
+ def parse_error(message)
32
+ raise ParseError.new(message, position: @source.position)
33
+ end
34
+ end
35
+ end
36
+ end
@@ -0,0 +1,151 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ module Parsers
5
+ class CSV < Base
6
+ # @yield [record]
7
+ def run(&emit)
8
+ headers = nil
9
+ headers_enabled = @options.fetch(:headers, true)
10
+ line_number = 0
11
+
12
+ while (record = next_csv_record(line_number))
13
+ line, start_line_number, line_number = record
14
+ fields = split_csv_line(line, line_number: start_line_number)
15
+
16
+ if headers_enabled && headers.nil?
17
+ headers = fields
18
+ next
19
+ end
20
+
21
+ validate_field_count(headers, fields, line_number) if headers
22
+ emit.call(headers ? headers.zip(fields).to_h : fields)
23
+ end
24
+ end
25
+
26
+ private
27
+
28
+ def next_csv_record(previous_line_number)
29
+ line = @source.next_line
30
+ return unless line
31
+
32
+ start_line_number = previous_line_number + 1
33
+ current_line_number = start_line_number
34
+ record = +line
35
+
36
+ until complete_csv_record?(record)
37
+ line = @source.next_line
38
+ raise_unclosed_quote(start_line_number) unless line
39
+
40
+ current_line_number += 1
41
+ record << "\n" << line
42
+ end
43
+
44
+ [record, start_line_number, current_line_number]
45
+ end
46
+
47
+ def complete_csv_record?(record)
48
+ in_quotes = false
49
+ at_field_start = true
50
+ index = 0
51
+
52
+ while index < record.length
53
+ char = record[index]
54
+
55
+ if in_quotes
56
+ in_quotes, index = advance_quoted_record_state(record, index)
57
+ else
58
+ in_quotes, at_field_start = advance_unquoted_record_state(char, at_field_start)
59
+ index += 1
60
+ end
61
+ end
62
+
63
+ !in_quotes
64
+ end
65
+
66
+ def advance_quoted_record_state(record, index)
67
+ return [true, index + 1] unless record[index] == '"'
68
+ return [true, index + 2] if record[index + 1] == '"'
69
+
70
+ [false, index + 1]
71
+ end
72
+
73
+ def advance_unquoted_record_state(char, at_field_start)
74
+ case char
75
+ when ','
76
+ [false, true]
77
+ when '"'
78
+ [at_field_start, false]
79
+ else
80
+ [false, false]
81
+ end
82
+ end
83
+
84
+ def split_csv_line(line, line_number:)
85
+ fields = []
86
+ field = +''
87
+ in_quotes = false
88
+ index = 0
89
+
90
+ while index < line.length
91
+ line[index]
92
+
93
+ if in_quotes
94
+ index = consume_quoted_char(line, index, field)
95
+ in_quotes = false if index.is_a?(Array)
96
+ index = index.first if index.is_a?(Array)
97
+ else
98
+ in_quotes, index, field = consume_unquoted_char(line, index, field, fields)
99
+ end
100
+ end
101
+
102
+ raise_unclosed_quote(line_number) if in_quotes
103
+
104
+ fields << field
105
+ end
106
+
107
+ def consume_quoted_char(line, index, field)
108
+ char = line[index]
109
+
110
+ if char == '"'
111
+ return [index + 1] unless line[index + 1] == '"'
112
+
113
+ field << '"'
114
+ return index + 2
115
+ end
116
+
117
+ field << char
118
+ index + 1
119
+ end
120
+
121
+ def consume_unquoted_char(line, index, field, fields)
122
+ char = line[index]
123
+
124
+ case char
125
+ when ','
126
+ fields << field
127
+ [false, index + 1, +'']
128
+ when '"'
129
+ parse_error('unexpected quote in CSV line') unless field.empty?
130
+
131
+ [true, index + 1, field]
132
+ else
133
+ field << char
134
+ [false, index + 1, field]
135
+ end
136
+ end
137
+
138
+ def validate_field_count(headers, fields, line_number)
139
+ return if headers.length == fields.length
140
+
141
+ raise ParseError,
142
+ "CSV field count mismatch on line #{line_number}: expected #{headers.length}, got #{fields.length}"
143
+ end
144
+
145
+ def raise_unclosed_quote(line_number)
146
+ raise ParseError,
147
+ "unterminated quoted CSV field starting on line #{line_number}"
148
+ end
149
+ end
150
+ end
151
+ end
@@ -0,0 +1,445 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ module Parsers
5
+ class JSON < Base
6
+ HEX_DIGITS = /\A[0-9a-fA-F]\z/
7
+ NUMBER_START = /\A[-0-9]\z/
8
+
9
+ # @yield [record]
10
+ def run(&)
11
+ path = normalized_path
12
+ return run_path(path, &) if path
13
+
14
+ parse_root_array(&)
15
+ ensure_end_of_input
16
+ end
17
+
18
+ private
19
+
20
+ def parse_root_array(&)
21
+ skip_whitespace
22
+ first_char = @source.next_char
23
+ unless first_char == '['
24
+ parse_error('JSON input must be a top-level array, or pass path: for an object-contained array')
25
+ end
26
+
27
+ skip_whitespace
28
+ if @source.peek_char == ']'
29
+ @source.next_char
30
+ return
31
+ end
32
+
33
+ parse_top_level_records(&)
34
+ end
35
+
36
+ def run_path(path, &)
37
+ if path.empty?
38
+ parse_root_array(&)
39
+ ensure_end_of_input
40
+ return
41
+ end
42
+
43
+ skip_whitespace
44
+ found = parse_object_path(path, &)
45
+ parse_error("JSON path #{path.inspect} was not found") unless found
46
+
47
+ ensure_end_of_input
48
+ end
49
+
50
+ def parse_top_level_records(&emit)
51
+ loop do
52
+ emit.call(parse_value)
53
+ skip_whitespace
54
+
55
+ case @source.next_char
56
+ when ','
57
+ skip_whitespace
58
+ parse_error("expected JSON value after ','") if @source.peek_char == ']'
59
+ when ']'
60
+ break
61
+ else
62
+ parse_error("expected ',' or ']'")
63
+ end
64
+ end
65
+ end
66
+
67
+ def parse_object_path(path, &emit)
68
+ expect_char('{')
69
+ found = false
70
+ skip_whitespace
71
+ return false if consume_if('}')
72
+
73
+ loop do
74
+ key = parse_object_key
75
+ skip_whitespace
76
+ expect_char(':')
77
+ found = parse_matching_path_value(path, key, found, &emit)
78
+ skip_whitespace
79
+
80
+ case @source.next_char
81
+ when ','
82
+ skip_whitespace
83
+ parse_error("expected object key after ','") if @source.peek_char == '}'
84
+ when '}'
85
+ return found
86
+ else
87
+ parse_error("expected ',' or '}'")
88
+ end
89
+ end
90
+ end
91
+
92
+ def parse_matching_path_value(path, key, found, &)
93
+ if !found && key == path.first
94
+ parse_path_value(path.drop(1), &)
95
+ else
96
+ skip_value
97
+ found
98
+ end
99
+ end
100
+
101
+ def parse_path_value(remaining_path, &)
102
+ if remaining_path.empty?
103
+ parse_streaming_array(&)
104
+ return true
105
+ end
106
+
107
+ parse_error('expected object while following JSON path') unless @source.peek_char == '{'
108
+
109
+ parse_object_path(remaining_path, &)
110
+ end
111
+
112
+ def parse_streaming_array(&)
113
+ expect_char('[')
114
+ skip_whitespace
115
+ return if consume_if(']')
116
+
117
+ parse_top_level_records(&)
118
+ end
119
+
120
+ def parse_value
121
+ skip_whitespace
122
+
123
+ case (char = @source.peek_char)
124
+ when '{'
125
+ parse_object
126
+ when '['
127
+ parse_inner_array
128
+ when '"'
129
+ parse_string
130
+ when 't', 'f'
131
+ parse_boolean
132
+ when 'n'
133
+ parse_null
134
+ when NUMBER_START
135
+ parse_number
136
+ when nil
137
+ parse_error('unexpected end of input')
138
+ else
139
+ parse_error("unexpected character #{char.inspect}")
140
+ end
141
+ end
142
+
143
+ def parse_object
144
+ expect_char('{')
145
+ object = {}
146
+ skip_whitespace
147
+ return object if consume_if('}')
148
+
149
+ loop do
150
+ key = parse_object_key
151
+ skip_whitespace
152
+ expect_char(':')
153
+ object[key] = parse_value
154
+ skip_whitespace
155
+
156
+ case @source.next_char
157
+ when ','
158
+ skip_whitespace
159
+ parse_error("expected object key after ','") if @source.peek_char == '}'
160
+ when '}'
161
+ return object
162
+ else
163
+ parse_error("expected ',' or '}'")
164
+ end
165
+ end
166
+ end
167
+
168
+ def parse_object_key
169
+ parse_error('expected object key string') unless @source.peek_char == '"'
170
+
171
+ parse_string
172
+ end
173
+
174
+ def parse_inner_array
175
+ expect_char('[')
176
+ values = []
177
+ skip_whitespace
178
+ return values if consume_if(']')
179
+
180
+ loop do
181
+ values << parse_value
182
+ skip_whitespace
183
+
184
+ case @source.next_char
185
+ when ','
186
+ skip_whitespace
187
+ parse_error("expected array value after ','") if @source.peek_char == ']'
188
+ when ']'
189
+ return values
190
+ else
191
+ parse_error("expected ',' or ']'")
192
+ end
193
+ end
194
+ end
195
+
196
+ def parse_string
197
+ expect_char('"')
198
+ result = +''
199
+
200
+ loop do
201
+ char = @source.next_char
202
+ parse_error('unterminated string') unless char
203
+
204
+ case char
205
+ when '"'
206
+ return result
207
+ when '\\'
208
+ result << parse_escape
209
+ else
210
+ parse_error('unescaped control character in string') if char.ord < 0x20
211
+
212
+ result << char
213
+ end
214
+ end
215
+ end
216
+
217
+ def parse_escape
218
+ case (char = @source.next_char)
219
+ when '"', '\\', '/'
220
+ char
221
+ when 'b'
222
+ "\b"
223
+ when 'f'
224
+ "\f"
225
+ when 'n'
226
+ "\n"
227
+ when 'r'
228
+ "\r"
229
+ when 't'
230
+ "\t"
231
+ when 'u'
232
+ parse_unicode_escape
233
+ else
234
+ parse_error('invalid escape sequence')
235
+ end
236
+ end
237
+
238
+ def parse_unicode_escape
239
+ codepoint = read_hex_codepoint
240
+ return parse_surrogate_pair(codepoint) if high_surrogate?(codepoint)
241
+
242
+ parse_error('unexpected low surrogate escape') if low_surrogate?(codepoint)
243
+
244
+ codepoint.chr(Encoding::UTF_8)
245
+ end
246
+
247
+ def parse_surrogate_pair(high)
248
+ parse_error('expected low surrogate escape') unless @source.next_char == '\\' && @source.next_char == 'u'
249
+
250
+ low = read_hex_codepoint
251
+ parse_error('expected low surrogate escape') unless low_surrogate?(low)
252
+
253
+ (((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000).chr(Encoding::UTF_8)
254
+ end
255
+
256
+ def read_hex_codepoint
257
+ digits = +''
258
+ 4.times do
259
+ char = @source.next_char
260
+ parse_error('invalid unicode escape') unless char&.match?(HEX_DIGITS)
261
+
262
+ digits << char
263
+ end
264
+ digits.to_i(16)
265
+ end
266
+
267
+ def parse_number
268
+ number = +''
269
+ number << consume_if('-').to_s
270
+ parse_integer_part(number)
271
+ parse_fraction_part(number)
272
+ parse_exponent_part(number)
273
+
274
+ number.include?('.') || number.match?(/[eE]/) ? Float(number) : Integer(number, 10)
275
+ rescue ArgumentError
276
+ parse_error('invalid number')
277
+ end
278
+
279
+ def parse_integer_part(number)
280
+ char = @source.peek_char
281
+ parse_error('invalid number') unless digit?(char)
282
+
283
+ if char == '0'
284
+ number << @source.next_char
285
+ parse_error('invalid leading zero in number') if digit?(@source.peek_char)
286
+ return
287
+ end
288
+
289
+ number << @source.next_char while digit?(@source.peek_char)
290
+ end
291
+
292
+ def parse_fraction_part(number)
293
+ return unless @source.peek_char == '.'
294
+
295
+ number << @source.next_char
296
+ parse_error('invalid fractional number') unless digit?(@source.peek_char)
297
+
298
+ number << @source.next_char while digit?(@source.peek_char)
299
+ end
300
+
301
+ def parse_exponent_part(number)
302
+ return unless %w[e E].include?(@source.peek_char)
303
+
304
+ number << @source.next_char
305
+ number << @source.next_char if %w[+ -].include?(@source.peek_char)
306
+ parse_error('invalid exponent') unless digit?(@source.peek_char)
307
+
308
+ number << @source.next_char while digit?(@source.peek_char)
309
+ end
310
+
311
+ def parse_boolean
312
+ return consume_expected_literal('true', true) if @source.peek_char == 't'
313
+ return consume_expected_literal('false', false) if @source.peek_char == 'f'
314
+
315
+ parse_error('invalid boolean')
316
+ end
317
+
318
+ def parse_null
319
+ consume_expected_literal('null', nil)
320
+ end
321
+
322
+ def skip_value
323
+ skip_whitespace
324
+
325
+ case (char = @source.peek_char)
326
+ when '{'
327
+ skip_object
328
+ when '['
329
+ skip_array
330
+ when '"'
331
+ parse_string
332
+ when 't', 'f'
333
+ parse_boolean
334
+ when 'n'
335
+ parse_null
336
+ when NUMBER_START
337
+ parse_number
338
+ when nil
339
+ parse_error('unexpected end of input')
340
+ else
341
+ parse_error("unexpected character #{char.inspect}")
342
+ end
343
+
344
+ nil
345
+ end
346
+
347
+ def skip_object
348
+ expect_char('{')
349
+ skip_whitespace
350
+ return if consume_if('}')
351
+
352
+ loop do
353
+ parse_object_key
354
+ skip_whitespace
355
+ expect_char(':')
356
+ skip_value
357
+ skip_whitespace
358
+
359
+ case @source.next_char
360
+ when ','
361
+ skip_whitespace
362
+ parse_error("expected object key after ','") if @source.peek_char == '}'
363
+ when '}'
364
+ return
365
+ else
366
+ parse_error("expected ',' or '}'")
367
+ end
368
+ end
369
+ end
370
+
371
+ def skip_array
372
+ expect_char('[')
373
+ skip_whitespace
374
+ return if consume_if(']')
375
+
376
+ loop do
377
+ skip_value
378
+ skip_whitespace
379
+
380
+ case @source.next_char
381
+ when ','
382
+ skip_whitespace
383
+ parse_error("expected array value after ','") if @source.peek_char == ']'
384
+ when ']'
385
+ return
386
+ else
387
+ parse_error("expected ',' or ']'")
388
+ end
389
+ end
390
+ end
391
+
392
+ def consume_expected_literal(literal, value)
393
+ literal.each_char do |char|
394
+ parse_error('invalid literal') unless @source.next_char == char
395
+ end
396
+
397
+ value
398
+ end
399
+
400
+ def expect_char(expected)
401
+ actual = @source.next_char
402
+ parse_error("expected #{expected.inspect}") unless actual == expected
403
+
404
+ actual
405
+ end
406
+
407
+ def consume_if(expected)
408
+ return unless @source.peek_char == expected
409
+
410
+ @source.next_char
411
+ end
412
+
413
+ def ensure_end_of_input
414
+ skip_whitespace
415
+ parse_error('unexpected trailing input') unless @source.eof?
416
+ end
417
+
418
+ def normalized_path
419
+ return nil unless @options.key?(:path)
420
+
421
+ raw_path = @options[:path]
422
+ return nil if raw_path.nil?
423
+ return [raw_path.to_s] if raw_path.is_a?(String) || raw_path.is_a?(Symbol)
424
+
425
+ unless raw_path.respond_to?(:map)
426
+ raise ArgumentError, 'JSON path must be a string, symbol, or array of strings/symbols'
427
+ end
428
+
429
+ raw_path.map(&:to_s)
430
+ end
431
+
432
+ def digit?(char)
433
+ char&.between?('0', '9')
434
+ end
435
+
436
+ def high_surrogate?(codepoint)
437
+ codepoint.between?(0xD800, 0xDBFF)
438
+ end
439
+
440
+ def low_surrogate?(codepoint)
441
+ codepoint.between?(0xDC00, 0xDFFF)
442
+ end
443
+ end
444
+ end
445
+ end
@@ -0,0 +1,25 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'json'
4
+
5
+ module Fibrio
6
+ module Parsers
7
+ class NDJSON < Base
8
+ # @yield [record]
9
+ def run(&emit)
10
+ line_number = 0
11
+
12
+ while (line = @source.next_line)
13
+ line_number += 1
14
+ next if line.strip.empty?
15
+
16
+ begin
17
+ emit.call(::JSON.parse(line))
18
+ rescue ::JSON::ParserError => e
19
+ raise ParseError.new("invalid JSON on line #{line_number}: #{e.message}", position: @source.position)
20
+ end
21
+ end
22
+ end
23
+ end
24
+ end
25
+ end
@@ -0,0 +1,171 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'stringio'
4
+
5
+ module Fibrio
6
+ class Source
7
+ CHUNK_SIZE = 64 * 1024
8
+
9
+ attr_reader :position, :bytes_read, :next_char_calls
10
+
11
+ # @param input [String, IO, StringIO]
12
+ # @param chunk_size [Integer]
13
+ # @return [Source]
14
+ def self.build(input, chunk_size: CHUNK_SIZE)
15
+ case input
16
+ when IO, StringIO
17
+ new(input, owns_io: true, chunk_size: chunk_size)
18
+ when String
19
+ build_from_string(input, chunk_size: chunk_size)
20
+ else
21
+ raise ArgumentError, "unsupported input: #{input.class}"
22
+ end
23
+ end
24
+
25
+ def self.build_from_string(input, chunk_size:)
26
+ if File.exist?(input)
27
+ io = File.open(input, 'rb')
28
+ new(io, owns_io: true, chunk_size: chunk_size)
29
+ else
30
+ new(StringIO.new(input.b), owns_io: true, chunk_size: chunk_size)
31
+ end
32
+ end
33
+ private_class_method :build_from_string
34
+
35
+ # @param io [IO, StringIO]
36
+ # @param owns_io [Boolean]
37
+ # @param chunk_size [Integer]
38
+ def initialize(io, owns_io:, chunk_size: CHUNK_SIZE)
39
+ @io = io
40
+ @owns_io = owns_io
41
+ @chunk_size = chunk_size
42
+ @buffer = +''.b
43
+ @pos = 0
44
+ @eof = false
45
+ @position = 0
46
+ @bytes_read = 0
47
+ @next_char_calls = 0
48
+ end
49
+
50
+ # @return [String, nil] next UTF-8 character without consuming it
51
+ def peek_char
52
+ ensure_buffer(1)
53
+ first_byte = @buffer.getbyte(@pos)
54
+ return nil unless first_byte
55
+
56
+ char_length = utf8_char_length(first_byte)
57
+ ensure_buffer(char_length)
58
+ char_bytes = @buffer.byteslice(@pos, char_length)
59
+ source_error('input ended in the middle of a UTF-8 character') if char_bytes.bytesize < char_length
60
+
61
+ decode_utf8(char_bytes)
62
+ end
63
+
64
+ # @return [String, nil] next UTF-8 character and advances the source
65
+ def next_char
66
+ char = peek_char
67
+ return nil unless char
68
+
69
+ @pos += char.bytesize
70
+ @position += char.bytesize
71
+ @next_char_calls += 1
72
+ compact_buffer
73
+ char
74
+ end
75
+
76
+ # @return [String, nil] next line without trailing newline
77
+ def next_line
78
+ loop do
79
+ newline_index = @buffer.index("\n".b, @pos)
80
+ return consume_line(newline_index) if newline_index
81
+ return consume_remaining_line if @eof
82
+
83
+ ensure_buffer(@buffer.bytesize - @pos + 1)
84
+ end
85
+ end
86
+
87
+ # @return [Boolean]
88
+ def eof?
89
+ ensure_buffer(1)
90
+ @buffer.getbyte(@pos).nil?
91
+ end
92
+
93
+ # @return [void]
94
+ def close
95
+ @io.close if @owns_io && @io.respond_to?(:closed?) && !@io.closed?
96
+ end
97
+
98
+ private
99
+
100
+ def ensure_buffer(minimum_remaining)
101
+ compact_buffer
102
+
103
+ while !@eof && remaining_bytes < minimum_remaining
104
+ chunk = @io.read(@chunk_size)
105
+ if chunk.nil? || chunk.empty?
106
+ @eof = true
107
+ else
108
+ binary_chunk = chunk.b
109
+ @bytes_read += binary_chunk.bytesize
110
+ @buffer << binary_chunk
111
+ end
112
+ end
113
+ end
114
+
115
+ def remaining_bytes
116
+ @buffer.bytesize - @pos
117
+ end
118
+
119
+ def compact_buffer
120
+ return unless @pos.positive?
121
+ return if @pos <= @chunk_size && @pos < @buffer.bytesize
122
+
123
+ @buffer = @buffer.byteslice(@pos, remaining_bytes) || +''.b
124
+ @pos = 0
125
+ end
126
+
127
+ def consume_line(newline_index)
128
+ line_bytes = @buffer.byteslice(@pos, newline_index - @pos) || +''.b
129
+ bytes_consumed = newline_index - @pos + 1
130
+ @pos = newline_index + 1
131
+ @position += bytes_consumed
132
+ compact_buffer
133
+ decode_line(line_bytes)
134
+ end
135
+
136
+ def consume_remaining_line
137
+ return nil if @pos >= @buffer.bytesize
138
+
139
+ line_bytes = @buffer.byteslice(@pos, remaining_bytes) || +''.b
140
+ @position += line_bytes.bytesize
141
+ @pos = @buffer.bytesize
142
+ compact_buffer
143
+ decode_line(line_bytes)
144
+ end
145
+
146
+ def decode_line(line_bytes)
147
+ line_bytes = line_bytes.byteslice(0, line_bytes.bytesize - 1) if line_bytes.end_with?("\r".b)
148
+ decode_utf8(line_bytes)
149
+ end
150
+
151
+ def decode_utf8(bytes)
152
+ value = bytes.dup.force_encoding(Encoding::UTF_8)
153
+ return value if value.valid_encoding?
154
+
155
+ source_error('invalid UTF-8 input')
156
+ end
157
+
158
+ def utf8_char_length(first_byte)
159
+ return 1 if first_byte < 0x80
160
+ return 2 if (first_byte & 0b1110_0000) == 0b1100_0000
161
+ return 3 if (first_byte & 0b1111_0000) == 0b1110_0000
162
+ return 4 if (first_byte & 0b1111_1000) == 0b1111_0000
163
+
164
+ source_error('invalid UTF-8 input')
165
+ end
166
+
167
+ def source_error(message)
168
+ raise ParseError.new(message, position: @position)
169
+ end
170
+ end
171
+ end
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ class Stream
5
+ include Enumerable
6
+
7
+ # @param parser [Parsers::Base]
8
+ def initialize(parser)
9
+ @parser = parser
10
+ @fiber = Fiber.new do
11
+ @parser.run { |record| Fiber.yield(record) }
12
+ nil
13
+ end
14
+ @done = false
15
+ end
16
+
17
+ # @yield [record]
18
+ # @return [Enumerator, nil]
19
+ def each
20
+ return enum_for(:each) unless block_given?
21
+
22
+ until @done
23
+ record = @fiber.resume
24
+ if record.nil? && !@fiber.alive?
25
+ @done = true
26
+ else
27
+ yield record
28
+ end
29
+ end
30
+ end
31
+
32
+ # @return [void]
33
+ def close
34
+ @done = true
35
+ @parser.close
36
+ end
37
+ end
38
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Fibrio
4
+ VERSION = '0.1.0'
5
+ end
data/lib/fibrio.rb ADDED
@@ -0,0 +1,44 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'fibrio/version'
4
+ require_relative 'fibrio/errors'
5
+ require_relative 'fibrio/source'
6
+ require_relative 'fibrio/parsers/base'
7
+ require_relative 'fibrio/parsers/json'
8
+ require_relative 'fibrio/parsers/ndjson'
9
+ require_relative 'fibrio/parsers/csv'
10
+ require_relative 'fibrio/stream'
11
+
12
+ module Fibrio
13
+ FORMAT_PARSERS = {
14
+ json: Parsers::JSON,
15
+ ndjson: Parsers::NDJSON,
16
+ csv: Parsers::CSV
17
+ }.freeze
18
+
19
+ # @param input [String, IO]
20
+ # @param format [Symbol]
21
+ # @param options [Hash]
22
+ # @return [Stream]
23
+ def self.open(input, format:, **options)
24
+ source = Source.build(input)
25
+ parser = parser_for(format).new(source, **options)
26
+ stream = Stream.new(parser)
27
+
28
+ return stream unless block_given?
29
+
30
+ begin
31
+ yield stream
32
+ ensure
33
+ stream.close
34
+ end
35
+ end
36
+
37
+ # @param format [Symbol]
38
+ # @return [Class]
39
+ def self.parser_for(format)
40
+ FORMAT_PARSERS.fetch(format.to_sym)
41
+ rescue KeyError
42
+ raise UnknownFormatError, "unknown format: #{format.inspect}"
43
+ end
44
+ end
metadata ADDED
@@ -0,0 +1,82 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: fibrio
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Yudai Takada
8
+ bindir: bin
9
+ cert_chain: []
10
+ date: 1980-01-02 00:00:00.000000000 Z
11
+ dependencies:
12
+ - !ruby/object:Gem::Dependency
13
+ name: rake
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - "~>"
17
+ - !ruby/object:Gem::Version
18
+ version: '13.0'
19
+ type: :development
20
+ prerelease: false
21
+ version_requirements: !ruby/object:Gem::Requirement
22
+ requirements:
23
+ - - "~>"
24
+ - !ruby/object:Gem::Version
25
+ version: '13.0'
26
+ - !ruby/object:Gem::Dependency
27
+ name: rspec
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - "~>"
31
+ - !ruby/object:Gem::Version
32
+ version: '3.13'
33
+ type: :development
34
+ prerelease: false
35
+ version_requirements: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - "~>"
38
+ - !ruby/object:Gem::Version
39
+ version: '3.13'
40
+ description: Fibrio parses large JSON array, NDJSON, and CSV inputs record by record
41
+ without loading the full source into memory.
42
+ email:
43
+ - t.yudai92@gmail.com
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - CHANGELOG.md
49
+ - LICENSE.txt
50
+ - README.md
51
+ - lib/fibrio.rb
52
+ - lib/fibrio/errors.rb
53
+ - lib/fibrio/parsers/base.rb
54
+ - lib/fibrio/parsers/csv.rb
55
+ - lib/fibrio/parsers/json.rb
56
+ - lib/fibrio/parsers/ndjson.rb
57
+ - lib/fibrio/source.rb
58
+ - lib/fibrio/stream.rb
59
+ - lib/fibrio/version.rb
60
+ homepage: https://github.com/ydah/fibrio
61
+ licenses:
62
+ - MIT
63
+ metadata:
64
+ rubygems_mfa_required: 'true'
65
+ rdoc_options: []
66
+ require_paths:
67
+ - lib
68
+ required_ruby_version: !ruby/object:Gem::Requirement
69
+ requirements:
70
+ - - ">="
71
+ - !ruby/object:Gem::Version
72
+ version: '3.1'
73
+ required_rubygems_version: !ruby/object:Gem::Requirement
74
+ requirements:
75
+ - - ">="
76
+ - !ruby/object:Gem::Version
77
+ version: '0'
78
+ requirements: []
79
+ rubygems_version: 4.0.6
80
+ specification_version: 4
81
+ summary: Fiber-backed streaming parsers for JSON arrays, NDJSON, and CSV.
82
+ test_files: []