RubyGems - json_completer - Versions diffs - 1.0.0 → 1.2.0 - Mend

json_completer 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/README.md +38 -21
data/lib/json_completer/completion_engine.rb +241 -0
data/lib/json_completer/parser_engine.rb +386 -0
data/lib/json_completer/scanners.rb +448 -0
data/lib/json_completer.rb +36 -688
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2fe03f14437a3cfd88193b2cfc7a7f156e116632b1937a42ea6c4a1aefffa7c2
-  data.tar.gz: b20c23c7843a3ff8f18f0110ed824b54d1306eb440309186efc00c71317d33ba
+  metadata.gz: 5d460af0d48e2cecf87411ba30d2a6aeac00fe38208d0222bf8b7218e373a2cc
+  data.tar.gz: 448401c51bc04e0a38fae036d3a64d94e5090846c39f69d081a8032b0b58e80a
 SHA512:
-  metadata.gz: 9815201cb51addf45defae03cb710502ed93091208ec13c404865fdbbd58be2b20773334e3528beef4d82bd93cb3de81d2de22e5ecad996aebdded6e3a138b87
-  data.tar.gz: 261db1237466e85281eb969d90b7f6d90555c72955df728d001122db6f0d0cfd1546e399d3f7185d869d0e689108d49cb63347fffd5f5484dfe24c311e22d193
+  metadata.gz: 256f6ba460ef729a9babe9f9355f0d888c3c7f3dc64e3b4c85c2ae69ad6cb6c3a4b107aced36fe55b04e06302034b672f63c94c6db03b074a0462223eee8d5d1
+  data.tar.gz: 101f08a619d56129398b751815077897e3f557d581b76d075e41f57098e93ec92aa14d9f7edb1c88dbb8179f26d3aaf0fde3d81cfa986d2c912aa70b77294074

data/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # JsonCompleter
-A Ruby gem that converts partial JSON strings into valid JSON with high-performance incremental parsing. Efficiently processes streaming JSON with O(n) complexity for new data by maintaining parsing state between chunks. Handles truncated primitives, missing values, and unclosed structures without reprocessing previously parsed data.
+A Ruby gem for incremental parsing of partial and incomplete JSON streams. It is built for streaming output from LLM providers such as OpenAI and Anthropic, and processes each new chunk in O(n) time by maintaining parser state between calls. Use `.parse` for parsed Ruby values and `.complete` when you specifically need completed JSON text.
 ## Installation
@@ -26,56 +26,73 @@ gem install json_completer
 ### Basic Usage
-Complete partial JSON strings in one call:
+Use `.parse` when you want the current parsed Ruby value directly from a partial stream:
 ```ruby
 require 'json_completer'
-# Complete truncated JSON
-JsonCompleter.complete('{"name": "John", "age":')
-# => '{"name": "John", "age": null}'
+# Parse partial JSON into Ruby objects
+JsonCompleter.parse('{"name": "John", "age":')
+# => {"name" => "John", "age" => nil}
 # Handle incomplete strings
-JsonCompleter.complete('{"message": "Hello wo')
-# => '{"message": "Hello wo"}'
+JsonCompleter.parse('{"message": "Hello wo')
+# => {"message" => "Hello wo"}
-# Fix unclosed structures
-JsonCompleter.complete('[1, 2, {"key": "value"')
-# => '[1, 2, {"key": "value"}]'
+# Close unclosed structures
+JsonCompleter.parse('[1, 2, {"key": "value"')
+# => [1, 2, {"key" => "value"}]
 ```
 ### Incremental Processing
-For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state, making it highly efficient for large streaming responses:
+For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state:
 ```ruby
 completer = JsonCompleter.new
 # Process first chunk
-result1 = completer.complete('{"users": [{"name": "')
-# => '{"users": [{"name": ""}]}'
+result1 = completer.parse('{"users": [{"name": "')
+# => {"users" => [{"name" => ""}]}
 # Process additional data
-result2 = completer.complete('{"users": [{"name": "Alice"}')
-# => '{"users": [{"name": "Alice"}]}'
+result2 = completer.parse('{"users": [{"name": "Alice"}')
+# => {"users" => [{"name" => "Alice"}]}
+# Final parsed value
+result3 = completer.parse('{"users": [{"name": "Alice"}, {"name": "Bob"}]}')
+# => {"users" => [{"name" => "Alice"}, {"name" => "Bob"}]}
+```
+Stateful `JsonCompleter` instances assume append-only input. If earlier bytes change, create a new instance; truncation to a shorter prefix still resets state automatically.
+### String Output with `.complete`
-# Final complete JSON
-result3 = completer.complete('{"users": [{"name": "Alice"}, {"name": "Bob"}]}')
-# => '{"users": [{"name": "Alice"}, {"name": "Bob"}]}'
+Use `.complete` when you specifically need completed JSON text instead of parsed Ruby objects:
+```ruby
+JsonCompleter.complete('{"name": "John", "age":')
+# => '{"name": "John", "age": null}'
+JsonCompleter.complete('[1, 2, {"key": "value"')
+# => '[1, 2, {"key": "value"}]'
 ```
+This is the second-tier option when another layer expects JSON text and you want `json_completer` to materialize the current partial state as valid JSON.
 #### Performance Characteristics
 - **Zero reprocessing**: Maintains parsing state to avoid reparsing previously processed data
 - **Linear complexity**: Each chunk processed in O(n) time where n = new data size, not total size
 - **Memory efficient**: Uses token-based accumulation with minimal state overhead
+- **Byte-oriented string scanning**: Walks JSON input as bytes and copies contiguous non-escape string content in slices to reduce per-character overhead on long streamed strings
 - **Context preservation**: Tracks nested structures without full document analysis
 ### Common Use Cases
-- **High-performance streaming JSON**: Process large JSON responses efficiently as data arrives over network connections
-- **Truncated API responses**: Complete JSON that was cut off due to size limits
-- **Log parsing**: Handle incomplete JSON entries in log files
+- **LLM streaming output**: Parse partial JSON emitted token-by-token from providers such as OpenAI and Anthropic
+- **Incremental structured output parsing**: Keep a live Ruby object while more JSON arrives
+- **JSON text completion**: Produce valid JSON text snapshots for downstream consumers that require a string
 ## Contributing

data/lib/json_completer/completion_engine.rb ADDED Viewed

@@ -0,0 +1,241 @@
+# frozen_string_literal: true
+class JsonCompleter
+  module CompletionEngine
+    def complete(partial_json)
+      input = partial_json
+      # Same byte-oriented trick as parse: compare ASCII JSON syntax as integers and avoid
+      # allocating transient 1-character strings in the streaming loop.
+      input_length = input.bytesize
+      if @state.nil? || @state.input_length > input_length
+        @state = ParsingState.new
+      end
+      return input if input.empty?
+      return input if valid_json_primitive_or_document?(input)
+      if @state.input_length == input_length && !@state.output_tokens.empty?
+        return finalize_completion(@state.output_tokens.dup, @state.context_stack.dup, @state.incomplete_string_token)
+      end
+      output_tokens = @state.output_tokens.dup
+      context_stack = @state.context_stack.dup
+      index = @state.last_index
+      incomplete_string_token = @state.incomplete_string_token
+      if incomplete_string_token && output_tokens.last&.start_with?('"') && output_tokens.last.end_with?('"')
+        output_tokens.pop
+      end
+      while index < input_length
+        if incomplete_string_token && index == @state.last_index
+          index, status = Scanners.scan_string(input, index, incomplete_string_token)
+          break unless %i[terminated invalid_unicode].include?(status)
+          output_tokens << incomplete_string_token.buffer.string
+          incomplete_string_token = nil
+          next
+        end
+        byte = input.getbyte(index)
+        last_significant_char_in_output = get_last_significant_char(output_tokens)
+        # ASCII byte values: 9/10/13/32 = whitespace, 34 = ", 44 = ,, 45 = -, 58 = :,
+        # 91/93 = [] , 102/110/116 = f/n/t, 123/125 = {}.
+        case byte
+        when 9, 10, 13, 32
+          output_tokens << input.byteslice(index, 1)
+          index += 1
+        when 34
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          string_token = Scanners::CompletionStringToken.new
+          index, status = Scanners.scan_string(input, index + 1, string_token)
+          if %i[terminated invalid_unicode].include?(status)
+            output_tokens << string_token.buffer.string
+          else
+            incomplete_string_token = string_token
+          end
+        when 44
+          remove_trailing_comma(output_tokens)
+          output_tokens << ','
+          index += 1
+        when 45, 48..57
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          num_str, consumed = Scanners.scan_number_literal(input, index)
+          output_tokens << num_str
+          index += consumed
+        when 58
+          remove_trailing_comma(output_tokens) if last_significant_char_in_output == ','
+          output_tokens << ':'
+          index += 1
+        when 91
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          output_tokens << '['
+          context_stack << '['
+          index += 1
+        when 93
+          output_tokens << ']'
+          context_stack.pop if !context_stack.empty? && context_stack.last == '['
+          index += 1
+        when 102
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['f'])
+          output_tokens << keyword_val
+          index += consumed
+        when 110
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['n'])
+          output_tokens << keyword_val
+          index += consumed
+        when 116
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['t'])
+          output_tokens << keyword_val
+          index += consumed
+        when 123
+          ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
+          ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
+          output_tokens << '{'
+          context_stack << '{'
+          index += 1
+        when 125
+          remove_trailing_comma(output_tokens)
+          output_tokens << '}'
+          context_stack.pop if !context_stack.empty? && context_stack.last == '{'
+          index += 1
+        else
+          index += 1
+        end
+      end
+      @state = ParsingState.new(
+        output_tokens: output_tokens,
+        context_stack: context_stack,
+        last_index: index,
+        input_length: input_length,
+        incomplete_string_token: incomplete_string_token
+      )
+      finalize_completion(output_tokens.dup, context_stack.dup, incomplete_string_token)
+    end
+    private
+    def finalize_completion(output_tokens, context_stack, incomplete_string_token = nil)
+      output_tokens << incomplete_string_token.finalized_incomplete_value if incomplete_string_token
+      last_sig_char_final = get_last_significant_char(output_tokens)
+      unless context_stack.empty?
+        current_ctx = context_stack.last
+        if current_ctx == '{'
+          if last_sig_char_final == '"'
+            prev_sig_char = get_previous_significant_char(output_tokens)
+            output_tokens << ':' << 'null' if ['{', ','].include?(prev_sig_char)
+          elsif last_sig_char_final == ':'
+            output_tokens << 'null'
+          end
+        elsif current_ctx == '['
+          output_tokens << 'null' if last_sig_char_final == ','
+        end
+      end
+      until context_stack.empty?
+        opener = context_stack.pop
+        remove_trailing_comma(output_tokens)
+        output_tokens << (opener == '{' ? '}' : ']')
+      end
+      reassembled_json = output_tokens.join
+      return 'null' if reassembled_json.match?(/\A\s*[,:]\s*\z/)
+      reassembled_json
+    end
+    def get_last_significant_char(output_tokens)
+      (output_tokens.length - 1).downto(0) do |index|
+        stripped_token = output_tokens[index].strip
+        return stripped_token[-1] unless stripped_token.empty?
+      end
+      nil
+    end
+    def get_previous_significant_char(output_tokens)
+      significant_chars = []
+      (output_tokens.length - 1).downto(0) do |index|
+        stripped_token = output_tokens[index].strip
+        next if stripped_token.empty?
+        significant_chars << stripped_token[-1]
+        return significant_chars[1] if significant_chars.length >= 2
+      end
+      nil
+    end
+    def ensure_comma_before_new_item(output_tokens, context_stack, last_sig_char)
+      return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
+      return if STRUCTURE_CHARS.include?(last_sig_char)
+      return unless context_stack.last == '[' || (context_stack.last == '{' && last_sig_char != ':')
+      output_tokens << ','
+    end
+    def ensure_colon_if_value_expected(output_tokens, context_stack, last_sig_char)
+      return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
+      return unless context_stack.last == '{' && last_sig_char == '"'
+      output_tokens << ':'
+    end
+    def remove_trailing_comma(output_tokens)
+      last_token_idx = -1
+      (output_tokens.length - 1).downto(0) do |index|
+        next if output_tokens[index].strip.empty?
+        last_token_idx = index
+        break
+      end
+      return unless last_token_idx != -1 && output_tokens[last_token_idx].strip == ','
+      output_tokens.slice!(last_token_idx)
+      while last_token_idx.positive? && output_tokens[last_token_idx - 1].strip.empty?
+        output_tokens.slice!(last_token_idx - 1)
+        last_token_idx -= 1
+      end
+    end
+    def valid_json_primitive_or_document?(str)
+      return true if VALID_PRIMITIVES.include?(str)
+      if str.match?(/\A-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\z/) &&
+         !str.end_with?('.') && !str.match?(/[eE][+-]?$/)
+        return true
+      end
+      str.match?(/\A"(?:[^"\\]|\\.)*"\z/)
+    end
+  end
+  include CompletionEngine
+end