RubyGems - fibrio - Versions diffs - 0.1.0 - Mend

fibrio 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +5 -0
data/LICENSE.txt +21 -0
data/README.md +100 -0
data/lib/fibrio/errors.rb +17 -0
data/lib/fibrio/parsers/base.rb +36 -0
data/lib/fibrio/parsers/csv.rb +151 -0
data/lib/fibrio/parsers/json.rb +445 -0
data/lib/fibrio/parsers/ndjson.rb +25 -0
data/lib/fibrio/source.rb +171 -0
data/lib/fibrio/stream.rb +38 -0
data/lib/fibrio/version.rb +5 -0
data/lib/fibrio.rb +44 -0
metadata +82 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 6495443322eddb9e8b298408f94ad285687ba9ce368e1a5e1f82e841c4dbc4d0
+  data.tar.gz: 2e7520560cc402249a4f70d278618c9189d0d6beaaedb16b289029f0e17626f0
+SHA512:
+  metadata.gz: 1614a5878aff9fea07ec3750dc236cf48068ba5a31b9942c262c49ddd393aeca7b336eb6450c02014d9f052463f519bd995508b0a991969de6b54f2d20cbe5fa
+  data.tar.gz: 99ad7f64627bc6213c8990b794593ddeacb25de895e9fc9078bca65da98c103f17cc1e0551f12187907084479e8f078666d0deeb04ee8e35b1918138c2d5a1da

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,5 @@
+# Changelog
+## 0.1.0 - 2026-05-22
+- Initial release

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Yudai Takada
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Fibrio
+`fibrio` is a small Ruby gem for reading large JSON array, NDJSON, and CSV inputs one record at a time. It keeps Fiber usage inside `Fibrio::Stream`, so callers can use the normal `Enumerable` API.
+## Installation
+```ruby
+gem "fibrio"
+```
+Then require it:
+```ruby
+require "fibrio"
+```
+## Usage
+```ruby
+Fibrio.open("data.json", format: :json) do |stream|
+  stream.each do |record|
+    process(record)
+  end
+end
+```
+`each` returns an `Enumerator` when no block is given, so lazy chains work as expected:
+```ruby
+stream = Fibrio.open("data.json", format: :json)
+top10 = stream.each.lazy.select { |record| record["active"] }.first(10)
+stream.close
+```
+CSV with no header row returns arrays:
+```ruby
+Fibrio.open("data.csv", format: :csv, headers: false) do |stream|
+  stream.each { |row| p row }
+end
+```
+String input is accepted as data when it is not an existing file path:
+```ruby
+Fibrio.open("[1,2,3]", format: :json) do |stream|
+  stream.each { |number| p number }
+end
+```
+Top-level JSON objects can stream an array nested at a known path:
+```ruby
+Fibrio.open('{"payload":{"records":[{"id":1},{"id":2}]}}', format: :json, path: %w[payload records]) do |stream|
+  stream.each { |record| p record["id"] }
+end
+```
+NDJSON uses one JSON value per non-empty line:
+```ruby
+Fibrio.open(%({"id":1}\n{"id":2}\n), format: :ndjson) do |stream|
+  stream.each { |record| p record["id"] }
+end
+```
+## Supported Formats
+- JSON: top-level arrays, or object-contained arrays selected with `path:`.
+- NDJSON: blank lines are skipped. Each non-empty line is parsed with Ruby's standard `json` library.
+- CSV: `headers: true` by default yields hashes. `headers: false` yields arrays. Quoted newlines are supported.
+## Memory Benchmark
+From a source checkout, run the benchmark with:
+```sh
+ruby benchmark/memory.rb 250000
+```
+The benchmark generates temporary files, reads them in a child process, and polls peak RSS from the parent process.
+Fibrio rows iterate through records without retaining them; eager rows keep the parsed collection in memory.
+Peak RSS includes the Ruby VM baseline, so absolute numbers vary by Ruby version and platform.
+Example result on Ruby 4.0.0 arm64-darwin24 with 250,000 records:
+| Format | Reader | Input MiB | Records | Seconds | Peak RSS MiB |
+| --- | --- | ---: | ---: | ---: | ---: |
+| JSON | Fibrio | 20.07 | 250,000 | 14.710 | 39.4 |
+| JSON | JSON.parse(File.read) | 20.07 | 250,000 | 0.069 | 105.4 |
+| NDJSON | Fibrio | 20.07 | 250,000 | 0.220 | 25.6 |
+| NDJSON | File.readlines + JSON.parse | 20.07 | 250,000 | 0.182 | 127.6 |
+| CSV | Fibrio | 9.10 | 250,000 | 2.640 | 33.3 |
+| CSV | CSV.read(headers: true) | 9.10 | 250,000 | 0.826 | 192.8 |
+The tradeoff is intentional: Fibrio prioritizes bounded memory use for large inputs over loading everything as fast as possible.
+## Known Limitations
+- Each individual record must fit in memory.

data/lib/fibrio/errors.rb ADDED Viewed

@@ -0,0 +1,17 @@
+# frozen_string_literal: true
+module Fibrio
+  class Error < StandardError; end
+  class UnknownFormatError < Error; end
+  class ParseError < Error
+    attr_reader :position
+    # @param message [String]
+    # @param position [Integer, nil] byte offset in the input
+    def initialize(message, position: nil)
+      @position = position
+      super(position ? "#{message} (at byte #{position})" : message)
+    end
+  end
+end

data/lib/fibrio/parsers/base.rb ADDED Viewed

@@ -0,0 +1,36 @@
+# frozen_string_literal: true
+module Fibrio
+  module Parsers
+    class Base
+      WHITESPACE = [' ', "\t", "\n", "\r"].freeze
+      # @param source [Source]
+      # @param options [Hash]
+      def initialize(source, **options)
+        @source = source
+        @options = options
+      end
+      # @yield [record]
+      def run
+        raise NotImplementedError
+      end
+      # @return [void]
+      def close
+        @source.close
+      end
+      private
+      def skip_whitespace
+        @source.next_char while WHITESPACE.include?(@source.peek_char)
+      end
+      def parse_error(message)
+        raise ParseError.new(message, position: @source.position)
+      end
+    end
+  end
+end

data/lib/fibrio/parsers/csv.rb ADDED Viewed

@@ -0,0 +1,151 @@
+# frozen_string_literal: true
+module Fibrio
+  module Parsers
+    class CSV < Base
+      # @yield [record]
+      def run(&emit)
+        headers = nil
+        headers_enabled = @options.fetch(:headers, true)
+        line_number = 0
+        while (record = next_csv_record(line_number))
+          line, start_line_number, line_number = record
+          fields = split_csv_line(line, line_number: start_line_number)
+          if headers_enabled && headers.nil?
+            headers = fields
+            next
+          end
+          validate_field_count(headers, fields, line_number) if headers
+          emit.call(headers ? headers.zip(fields).to_h : fields)
+        end
+      end
+      private
+      def next_csv_record(previous_line_number)
+        line = @source.next_line
+        return unless line
+        start_line_number = previous_line_number + 1
+        current_line_number = start_line_number
+        record = +line
+        until complete_csv_record?(record)
+          line = @source.next_line
+          raise_unclosed_quote(start_line_number) unless line
+          current_line_number += 1
+          record << "\n" << line
+        end
+        [record, start_line_number, current_line_number]
+      end
+      def complete_csv_record?(record)
+        in_quotes = false
+        at_field_start = true
+        index = 0
+        while index < record.length
+          char = record[index]
+          if in_quotes
+            in_quotes, index = advance_quoted_record_state(record, index)
+          else
+            in_quotes, at_field_start = advance_unquoted_record_state(char, at_field_start)
+            index += 1
+          end
+        end
+        !in_quotes
+      end
+      def advance_quoted_record_state(record, index)
+        return [true, index + 1] unless record[index] == '"'
+        return [true, index + 2] if record[index + 1] == '"'
+        [false, index + 1]
+      end
+      def advance_unquoted_record_state(char, at_field_start)
+        case char
+        when ','
+          [false, true]
+        when '"'
+          [at_field_start, false]
+        else
+          [false, false]
+        end
+      end
+      def split_csv_line(line, line_number:)
+        fields = []
+        field = +''
+        in_quotes = false
+        index = 0
+        while index < line.length
+          line[index]
+          if in_quotes
+            index = consume_quoted_char(line, index, field)
+            in_quotes = false if index.is_a?(Array)
+            index = index.first if index.is_a?(Array)
+          else
+            in_quotes, index, field = consume_unquoted_char(line, index, field, fields)
+          end
+        end
+        raise_unclosed_quote(line_number) if in_quotes
+        fields << field
+      end
+      def consume_quoted_char(line, index, field)
+        char = line[index]
+        if char == '"'
+          return [index + 1] unless line[index + 1] == '"'
+          field << '"'
+          return index + 2
+        end
+        field << char
+        index + 1
+      end
+      def consume_unquoted_char(line, index, field, fields)
+        char = line[index]
+        case char
+        when ','
+          fields << field
+          [false, index + 1, +'']
+        when '"'
+          parse_error('unexpected quote in CSV line') unless field.empty?
+          [true, index + 1, field]
+        else
+          field << char
+          [false, index + 1, field]
+        end
+      end
+      def validate_field_count(headers, fields, line_number)
+        return if headers.length == fields.length
+        raise ParseError,
+              "CSV field count mismatch on line #{line_number}: expected #{headers.length}, got #{fields.length}"
+      end
+      def raise_unclosed_quote(line_number)
+        raise ParseError,
+              "unterminated quoted CSV field starting on line #{line_number}"
+      end
+    end
+  end
+end

data/lib/fibrio/parsers/json.rb ADDED Viewed

@@ -0,0 +1,445 @@
+# frozen_string_literal: true
+module Fibrio
+  module Parsers
+    class JSON < Base
+      HEX_DIGITS = /\A[0-9a-fA-F]\z/
+      NUMBER_START = /\A[-0-9]\z/
+      # @yield [record]
+      def run(&)
+        path = normalized_path
+        return run_path(path, &) if path
+        parse_root_array(&)
+        ensure_end_of_input
+      end
+      private
+      def parse_root_array(&)
+        skip_whitespace
+        first_char = @source.next_char
+        unless first_char == '['
+          parse_error('JSON input must be a top-level array, or pass path: for an object-contained array')
+        end
+        skip_whitespace
+        if @source.peek_char == ']'
+          @source.next_char
+          return
+        end
+        parse_top_level_records(&)
+      end
+      def run_path(path, &)
+        if path.empty?
+          parse_root_array(&)
+          ensure_end_of_input
+          return
+        end
+        skip_whitespace
+        found = parse_object_path(path, &)
+        parse_error("JSON path #{path.inspect} was not found") unless found
+        ensure_end_of_input
+      end
+      def parse_top_level_records(&emit)
+        loop do
+          emit.call(parse_value)
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected JSON value after ','") if @source.peek_char == ']'
+          when ']'
+            break
+          else
+            parse_error("expected ',' or ']'")
+          end
+        end
+      end
+      def parse_object_path(path, &emit)
+        expect_char('{')
+        found = false
+        skip_whitespace
+        return false if consume_if('}')
+        loop do
+          key = parse_object_key
+          skip_whitespace
+          expect_char(':')
+          found = parse_matching_path_value(path, key, found, &emit)
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected object key after ','") if @source.peek_char == '}'
+          when '}'
+            return found
+          else
+            parse_error("expected ',' or '}'")
+          end
+        end
+      end
+      def parse_matching_path_value(path, key, found, &)
+        if !found && key == path.first
+          parse_path_value(path.drop(1), &)
+        else
+          skip_value
+          found
+        end
+      end
+      def parse_path_value(remaining_path, &)
+        if remaining_path.empty?
+          parse_streaming_array(&)
+          return true
+        end
+        parse_error('expected object while following JSON path') unless @source.peek_char == '{'
+        parse_object_path(remaining_path, &)
+      end
+      def parse_streaming_array(&)
+        expect_char('[')
+        skip_whitespace
+        return if consume_if(']')
+        parse_top_level_records(&)
+      end
+      def parse_value
+        skip_whitespace
+        case (char = @source.peek_char)
+        when '{'
+          parse_object
+        when '['
+          parse_inner_array
+        when '"'
+          parse_string
+        when 't', 'f'
+          parse_boolean
+        when 'n'
+          parse_null
+        when NUMBER_START
+          parse_number
+        when nil
+          parse_error('unexpected end of input')
+        else
+          parse_error("unexpected character #{char.inspect}")
+        end
+      end
+      def parse_object
+        expect_char('{')
+        object = {}
+        skip_whitespace
+        return object if consume_if('}')
+        loop do
+          key = parse_object_key
+          skip_whitespace
+          expect_char(':')
+          object[key] = parse_value
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected object key after ','") if @source.peek_char == '}'
+          when '}'
+            return object
+          else
+            parse_error("expected ',' or '}'")
+          end
+        end
+      end
+      def parse_object_key
+        parse_error('expected object key string') unless @source.peek_char == '"'
+        parse_string
+      end
+      def parse_inner_array
+        expect_char('[')
+        values = []
+        skip_whitespace
+        return values if consume_if(']')
+        loop do
+          values << parse_value
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected array value after ','") if @source.peek_char == ']'
+          when ']'
+            return values
+          else
+            parse_error("expected ',' or ']'")
+          end
+        end
+      end
+      def parse_string
+        expect_char('"')
+        result = +''
+        loop do
+          char = @source.next_char
+          parse_error('unterminated string') unless char
+          case char
+          when '"'
+            return result
+          when '\\'
+            result << parse_escape
+          else
+            parse_error('unescaped control character in string') if char.ord < 0x20
+            result << char
+          end
+        end
+      end
+      def parse_escape
+        case (char = @source.next_char)
+        when '"', '\\', '/'
+          char
+        when 'b'
+          "\b"
+        when 'f'
+          "\f"
+        when 'n'
+          "\n"
+        when 'r'
+          "\r"
+        when 't'
+          "\t"
+        when 'u'
+          parse_unicode_escape
+        else
+          parse_error('invalid escape sequence')
+        end
+      end
+      def parse_unicode_escape
+        codepoint = read_hex_codepoint
+        return parse_surrogate_pair(codepoint) if high_surrogate?(codepoint)
+        parse_error('unexpected low surrogate escape') if low_surrogate?(codepoint)
+        codepoint.chr(Encoding::UTF_8)
+      end
+      def parse_surrogate_pair(high)
+        parse_error('expected low surrogate escape') unless @source.next_char == '\\' && @source.next_char == 'u'
+        low = read_hex_codepoint
+        parse_error('expected low surrogate escape') unless low_surrogate?(low)
+        (((high - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000).chr(Encoding::UTF_8)
+      end
+      def read_hex_codepoint
+        digits = +''
+        4.times do
+          char = @source.next_char
+          parse_error('invalid unicode escape') unless char&.match?(HEX_DIGITS)
+          digits << char
+        end
+        digits.to_i(16)
+      end
+      def parse_number
+        number = +''
+        number << consume_if('-').to_s
+        parse_integer_part(number)
+        parse_fraction_part(number)
+        parse_exponent_part(number)
+        number.include?('.') || number.match?(/[eE]/) ? Float(number) : Integer(number, 10)
+      rescue ArgumentError
+        parse_error('invalid number')
+      end
+      def parse_integer_part(number)
+        char = @source.peek_char
+        parse_error('invalid number') unless digit?(char)
+        if char == '0'
+          number << @source.next_char
+          parse_error('invalid leading zero in number') if digit?(@source.peek_char)
+          return
+        end
+        number << @source.next_char while digit?(@source.peek_char)
+      end
+      def parse_fraction_part(number)
+        return unless @source.peek_char == '.'
+        number << @source.next_char
+        parse_error('invalid fractional number') unless digit?(@source.peek_char)
+        number << @source.next_char while digit?(@source.peek_char)
+      end
+      def parse_exponent_part(number)
+        return unless %w[e E].include?(@source.peek_char)
+        number << @source.next_char
+        number << @source.next_char if %w[+ -].include?(@source.peek_char)
+        parse_error('invalid exponent') unless digit?(@source.peek_char)
+        number << @source.next_char while digit?(@source.peek_char)
+      end
+      def parse_boolean
+        return consume_expected_literal('true', true) if @source.peek_char == 't'
+        return consume_expected_literal('false', false) if @source.peek_char == 'f'
+        parse_error('invalid boolean')
+      end
+      def parse_null
+        consume_expected_literal('null', nil)
+      end
+      def skip_value
+        skip_whitespace
+        case (char = @source.peek_char)
+        when '{'
+          skip_object
+        when '['
+          skip_array
+        when '"'
+          parse_string
+        when 't', 'f'
+          parse_boolean
+        when 'n'
+          parse_null
+        when NUMBER_START
+          parse_number
+        when nil
+          parse_error('unexpected end of input')
+        else
+          parse_error("unexpected character #{char.inspect}")
+        end
+        nil
+      end
+      def skip_object
+        expect_char('{')
+        skip_whitespace
+        return if consume_if('}')
+        loop do
+          parse_object_key
+          skip_whitespace
+          expect_char(':')
+          skip_value
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected object key after ','") if @source.peek_char == '}'
+          when '}'
+            return
+          else
+            parse_error("expected ',' or '}'")
+          end
+        end
+      end
+      def skip_array
+        expect_char('[')
+        skip_whitespace
+        return if consume_if(']')
+        loop do
+          skip_value
+          skip_whitespace
+          case @source.next_char
+          when ','
+            skip_whitespace
+            parse_error("expected array value after ','") if @source.peek_char == ']'
+          when ']'
+            return
+          else
+            parse_error("expected ',' or ']'")
+          end
+        end
+      end
+      def consume_expected_literal(literal, value)
+        literal.each_char do |char|
+          parse_error('invalid literal') unless @source.next_char == char
+        end
+        value
+      end
+      def expect_char(expected)
+        actual = @source.next_char
+        parse_error("expected #{expected.inspect}") unless actual == expected
+        actual
+      end
+      def consume_if(expected)
+        return unless @source.peek_char == expected
+        @source.next_char
+      end
+      def ensure_end_of_input
+        skip_whitespace
+        parse_error('unexpected trailing input') unless @source.eof?
+      end
+      def normalized_path
+        return nil unless @options.key?(:path)
+        raw_path = @options[:path]
+        return nil if raw_path.nil?
+        return [raw_path.to_s] if raw_path.is_a?(String) || raw_path.is_a?(Symbol)
+        unless raw_path.respond_to?(:map)
+          raise ArgumentError, 'JSON path must be a string, symbol, or array of strings/symbols'
+        end
+        raw_path.map(&:to_s)
+      end
+      def digit?(char)
+        char&.between?('0', '9')
+      end
+      def high_surrogate?(codepoint)
+        codepoint.between?(0xD800, 0xDBFF)
+      end
+      def low_surrogate?(codepoint)
+        codepoint.between?(0xDC00, 0xDFFF)
+      end
+    end
+  end
+end

data/lib/fibrio/parsers/ndjson.rb ADDED Viewed

@@ -0,0 +1,25 @@
+# frozen_string_literal: true
+require 'json'
+module Fibrio
+  module Parsers
+    class NDJSON < Base
+      # @yield [record]
+      def run(&emit)
+        line_number = 0
+        while (line = @source.next_line)
+          line_number += 1
+          next if line.strip.empty?
+          begin
+            emit.call(::JSON.parse(line))
+          rescue ::JSON::ParserError => e
+            raise ParseError.new("invalid JSON on line #{line_number}: #{e.message}", position: @source.position)
+          end
+        end
+      end
+    end
+  end
+end

data/lib/fibrio/source.rb ADDED Viewed

@@ -0,0 +1,171 @@
+# frozen_string_literal: true
+require 'stringio'
+module Fibrio
+  class Source
+    CHUNK_SIZE = 64 * 1024
+    attr_reader :position, :bytes_read, :next_char_calls
+    # @param input [String, IO, StringIO]
+    # @param chunk_size [Integer]
+    # @return [Source]
+    def self.build(input, chunk_size: CHUNK_SIZE)
+      case input
+      when IO, StringIO
+        new(input, owns_io: true, chunk_size: chunk_size)
+      when String
+        build_from_string(input, chunk_size: chunk_size)
+      else
+        raise ArgumentError, "unsupported input: #{input.class}"
+      end
+    end
+    def self.build_from_string(input, chunk_size:)
+      if File.exist?(input)
+        io = File.open(input, 'rb')
+        new(io, owns_io: true, chunk_size: chunk_size)
+      else
+        new(StringIO.new(input.b), owns_io: true, chunk_size: chunk_size)
+      end
+    end
+    private_class_method :build_from_string
+    # @param io [IO, StringIO]
+    # @param owns_io [Boolean]
+    # @param chunk_size [Integer]
+    def initialize(io, owns_io:, chunk_size: CHUNK_SIZE)
+      @io = io
+      @owns_io = owns_io
+      @chunk_size = chunk_size
+      @buffer = +''.b
+      @pos = 0
+      @eof = false
+      @position = 0
+      @bytes_read = 0
+      @next_char_calls = 0
+    end
+    # @return [String, nil] next UTF-8 character without consuming it
+    def peek_char
+      ensure_buffer(1)
+      first_byte = @buffer.getbyte(@pos)
+      return nil unless first_byte
+      char_length = utf8_char_length(first_byte)
+      ensure_buffer(char_length)
+      char_bytes = @buffer.byteslice(@pos, char_length)
+      source_error('input ended in the middle of a UTF-8 character') if char_bytes.bytesize < char_length
+      decode_utf8(char_bytes)
+    end
+    # @return [String, nil] next UTF-8 character and advances the source
+    def next_char
+      char = peek_char
+      return nil unless char
+      @pos += char.bytesize
+      @position += char.bytesize
+      @next_char_calls += 1
+      compact_buffer
+      char
+    end
+    # @return [String, nil] next line without trailing newline
+    def next_line
+      loop do
+        newline_index = @buffer.index("\n".b, @pos)
+        return consume_line(newline_index) if newline_index
+        return consume_remaining_line if @eof
+        ensure_buffer(@buffer.bytesize - @pos + 1)
+      end
+    end
+    # @return [Boolean]
+    def eof?
+      ensure_buffer(1)
+      @buffer.getbyte(@pos).nil?
+    end
+    # @return [void]
+    def close
+      @io.close if @owns_io && @io.respond_to?(:closed?) && !@io.closed?
+    end
+    private
+    def ensure_buffer(minimum_remaining)
+      compact_buffer
+      while !@eof && remaining_bytes < minimum_remaining
+        chunk = @io.read(@chunk_size)
+        if chunk.nil? || chunk.empty?
+          @eof = true
+        else
+          binary_chunk = chunk.b
+          @bytes_read += binary_chunk.bytesize
+          @buffer << binary_chunk
+        end
+      end
+    end
+    def remaining_bytes
+      @buffer.bytesize - @pos
+    end
+    def compact_buffer
+      return unless @pos.positive?
+      return if @pos <= @chunk_size && @pos < @buffer.bytesize
+      @buffer = @buffer.byteslice(@pos, remaining_bytes) || +''.b
+      @pos = 0
+    end
+    def consume_line(newline_index)
+      line_bytes = @buffer.byteslice(@pos, newline_index - @pos) || +''.b
+      bytes_consumed = newline_index - @pos + 1
+      @pos = newline_index + 1
+      @position += bytes_consumed
+      compact_buffer
+      decode_line(line_bytes)
+    end
+    def consume_remaining_line
+      return nil if @pos >= @buffer.bytesize
+      line_bytes = @buffer.byteslice(@pos, remaining_bytes) || +''.b
+      @position += line_bytes.bytesize
+      @pos = @buffer.bytesize
+      compact_buffer
+      decode_line(line_bytes)
+    end
+    def decode_line(line_bytes)
+      line_bytes = line_bytes.byteslice(0, line_bytes.bytesize - 1) if line_bytes.end_with?("\r".b)
+      decode_utf8(line_bytes)
+    end
+    def decode_utf8(bytes)
+      value = bytes.dup.force_encoding(Encoding::UTF_8)
+      return value if value.valid_encoding?
+      source_error('invalid UTF-8 input')
+    end
+    def utf8_char_length(first_byte)
+      return 1 if first_byte < 0x80
+      return 2 if (first_byte & 0b1110_0000) == 0b1100_0000
+      return 3 if (first_byte & 0b1111_0000) == 0b1110_0000
+      return 4 if (first_byte & 0b1111_1000) == 0b1111_0000
+      source_error('invalid UTF-8 input')
+    end
+    def source_error(message)
+      raise ParseError.new(message, position: @position)
+    end
+  end
+end

data/lib/fibrio/stream.rb ADDED Viewed

@@ -0,0 +1,38 @@
+# frozen_string_literal: true
+module Fibrio
+  class Stream
+    include Enumerable
+    # @param parser [Parsers::Base]
+    def initialize(parser)
+      @parser = parser
+      @fiber = Fiber.new do
+        @parser.run { |record| Fiber.yield(record) }
+        nil
+      end
+      @done = false
+    end
+    # @yield [record]
+    # @return [Enumerator, nil]
+    def each
+      return enum_for(:each) unless block_given?
+      until @done
+        record = @fiber.resume
+        if record.nil? && !@fiber.alive?
+          @done = true
+        else
+          yield record
+        end
+      end
+    end
+    # @return [void]
+    def close
+      @done = true
+      @parser.close
+    end
+  end
+end

data/lib/fibrio/version.rb ADDED Viewed

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+module Fibrio
+  VERSION = '0.1.0'
+end

data/lib/fibrio.rb ADDED Viewed

@@ -0,0 +1,44 @@
+# frozen_string_literal: true
+require_relative 'fibrio/version'
+require_relative 'fibrio/errors'
+require_relative 'fibrio/source'
+require_relative 'fibrio/parsers/base'
+require_relative 'fibrio/parsers/json'
+require_relative 'fibrio/parsers/ndjson'
+require_relative 'fibrio/parsers/csv'
+require_relative 'fibrio/stream'
+module Fibrio
+  FORMAT_PARSERS = {
+    json: Parsers::JSON,
+    ndjson: Parsers::NDJSON,
+    csv: Parsers::CSV
+  }.freeze
+  # @param input [String, IO]
+  # @param format [Symbol]
+  # @param options [Hash]
+  # @return [Stream]
+  def self.open(input, format:, **options)
+    source = Source.build(input)
+    parser = parser_for(format).new(source, **options)
+    stream = Stream.new(parser)
+    return stream unless block_given?
+    begin
+      yield stream
+    ensure
+      stream.close
+    end
+  end
+  # @param format [Symbol]
+  # @return [Class]
+  def self.parser_for(format)
+    FORMAT_PARSERS.fetch(format.to_sym)
+  rescue KeyError
+    raise UnknownFormatError, "unknown format: #{format.inspect}"
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,82 @@
+--- !ruby/object:Gem::Specification
+name: fibrio
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Yudai Takada
+bindir: bin
+cert_chain: []
+date: 1980-01-02 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.13'
+description: Fibrio parses large JSON array, NDJSON, and CSV inputs record by record
+  without loading the full source into memory.
+email:
+- t.yudai92@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- CHANGELOG.md
+- LICENSE.txt
+- README.md
+- lib/fibrio.rb
+- lib/fibrio/errors.rb
+- lib/fibrio/parsers/base.rb
+- lib/fibrio/parsers/csv.rb
+- lib/fibrio/parsers/json.rb
+- lib/fibrio/parsers/ndjson.rb
+- lib/fibrio/source.rb
+- lib/fibrio/stream.rb
+- lib/fibrio/version.rb
+homepage: https://github.com/ydah/fibrio
+licenses:
+- MIT
+metadata:
+  rubygems_mfa_required: 'true'
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '3.1'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 4.0.6
+specification_version: 4
+summary: Fiber-backed streaming parsers for JSON arrays, NDJSON, and CSV.
+test_files: []