json_completer 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2fe03f14437a3cfd88193b2cfc7a7f156e116632b1937a42ea6c4a1aefffa7c2
4
- data.tar.gz: b20c23c7843a3ff8f18f0110ed824b54d1306eb440309186efc00c71317d33ba
3
+ metadata.gz: 5d460af0d48e2cecf87411ba30d2a6aeac00fe38208d0222bf8b7218e373a2cc
4
+ data.tar.gz: 448401c51bc04e0a38fae036d3a64d94e5090846c39f69d081a8032b0b58e80a
5
5
  SHA512:
6
- metadata.gz: 9815201cb51addf45defae03cb710502ed93091208ec13c404865fdbbd58be2b20773334e3528beef4d82bd93cb3de81d2de22e5ecad996aebdded6e3a138b87
7
- data.tar.gz: 261db1237466e85281eb969d90b7f6d90555c72955df728d001122db6f0d0cfd1546e399d3f7185d869d0e689108d49cb63347fffd5f5484dfe24c311e22d193
6
+ metadata.gz: 256f6ba460ef729a9babe9f9355f0d888c3c7f3dc64e3b4c85c2ae69ad6cb6c3a4b107aced36fe55b04e06302034b672f63c94c6db03b074a0462223eee8d5d1
7
+ data.tar.gz: 101f08a619d56129398b751815077897e3f557d581b76d075e41f57098e93ec92aa14d9f7edb1c88dbb8179f26d3aaf0fde3d81cfa986d2c912aa70b77294074
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # JsonCompleter
2
2
 
3
- A Ruby gem that converts partial JSON strings into valid JSON with high-performance incremental parsing. Efficiently processes streaming JSON with O(n) complexity for new data by maintaining parsing state between chunks. Handles truncated primitives, missing values, and unclosed structures without reprocessing previously parsed data.
3
+ A Ruby gem for incremental parsing of partial and incomplete JSON streams. It is built for streaming output from LLM providers such as OpenAI and Anthropic, and processes each new chunk in O(n) time by maintaining parser state between calls. Use `.parse` for parsed Ruby values and `.complete` when you specifically need completed JSON text.
4
4
 
5
5
  ## Installation
6
6
 
@@ -26,56 +26,73 @@ gem install json_completer
26
26
 
27
27
  ### Basic Usage
28
28
 
29
- Complete partial JSON strings in one call:
29
+ Use `.parse` when you want the current parsed Ruby value directly from a partial stream:
30
30
 
31
31
  ```ruby
32
32
  require 'json_completer'
33
33
 
34
- # Complete truncated JSON
35
- JsonCompleter.complete('{"name": "John", "age":')
36
- # => '{"name": "John", "age": null}'
34
+ # Parse partial JSON into Ruby objects
35
+ JsonCompleter.parse('{"name": "John", "age":')
36
+ # => {"name" => "John", "age" => nil}
37
37
 
38
38
  # Handle incomplete strings
39
- JsonCompleter.complete('{"message": "Hello wo')
40
- # => '{"message": "Hello wo"}'
39
+ JsonCompleter.parse('{"message": "Hello wo')
40
+ # => {"message" => "Hello wo"}
41
41
 
42
- # Fix unclosed structures
43
- JsonCompleter.complete('[1, 2, {"key": "value"')
44
- # => '[1, 2, {"key": "value"}]'
42
+ # Close unclosed structures
43
+ JsonCompleter.parse('[1, 2, {"key": "value"')
44
+ # => [1, 2, {"key" => "value"}]
45
45
  ```
46
46
 
47
47
  ### Incremental Processing
48
48
 
49
- For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state, making it highly efficient for large streaming responses:
49
+ For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state:
50
50
 
51
51
  ```ruby
52
52
  completer = JsonCompleter.new
53
53
 
54
54
  # Process first chunk
55
- result1 = completer.complete('{"users": [{"name": "')
56
- # => '{"users": [{"name": ""}]}'
55
+ result1 = completer.parse('{"users": [{"name": "')
56
+ # => {"users" => [{"name" => ""}]}
57
57
 
58
58
  # Process additional data
59
- result2 = completer.complete('{"users": [{"name": "Alice"}')
60
- # => '{"users": [{"name": "Alice"}]}'
59
+ result2 = completer.parse('{"users": [{"name": "Alice"}')
60
+ # => {"users" => [{"name" => "Alice"}]}
61
+
62
+ # Final parsed value
63
+ result3 = completer.parse('{"users": [{"name": "Alice"}, {"name": "Bob"}]}')
64
+ # => {"users" => [{"name" => "Alice"}, {"name" => "Bob"}]}
65
+ ```
66
+
67
+ Stateful `JsonCompleter` instances assume append-only input. If earlier bytes change, create a new instance; truncation to a shorter prefix still resets state automatically.
68
+
69
+ ### String Output with `.complete`
61
70
 
62
- # Final complete JSON
63
- result3 = completer.complete('{"users": [{"name": "Alice"}, {"name": "Bob"}]}')
64
- # => '{"users": [{"name": "Alice"}, {"name": "Bob"}]}'
71
+ Use `.complete` when you specifically need completed JSON text instead of parsed Ruby objects:
72
+
73
+ ```ruby
74
+ JsonCompleter.complete('{"name": "John", "age":')
75
+ # => '{"name": "John", "age": null}'
76
+
77
+ JsonCompleter.complete('[1, 2, {"key": "value"')
78
+ # => '[1, 2, {"key": "value"}]'
65
79
  ```
66
80
 
81
+ This is the second-tier option when another layer expects JSON text and you want `json_completer` to materialize the current partial state as valid JSON.
82
+
67
83
  #### Performance Characteristics
68
84
 
69
85
  - **Zero reprocessing**: Maintains parsing state to avoid reparsing previously processed data
70
86
  - **Linear complexity**: Each chunk processed in O(n) time where n = new data size, not total size
71
87
  - **Memory efficient**: Uses token-based accumulation with minimal state overhead
88
+ - **Byte-oriented string scanning**: Walks JSON input as bytes and copies contiguous non-escape string content in slices to reduce per-character overhead on long streamed strings
72
89
  - **Context preservation**: Tracks nested structures without full document analysis
73
90
 
74
91
  ### Common Use Cases
75
92
 
76
- - **High-performance streaming JSON**: Process large JSON responses efficiently as data arrives over network connections
77
- - **Truncated API responses**: Complete JSON that was cut off due to size limits
78
- - **Log parsing**: Handle incomplete JSON entries in log files
93
+ - **LLM streaming output**: Parse partial JSON emitted token-by-token from providers such as OpenAI and Anthropic
94
+ - **Incremental structured output parsing**: Keep a live Ruby object while more JSON arrives
95
+ - **JSON text completion**: Produce valid JSON text snapshots for downstream consumers that require a string
79
96
 
80
97
  ## Contributing
81
98
 
@@ -0,0 +1,241 @@
1
+ # frozen_string_literal: true
2
+
3
+ class JsonCompleter
4
+ module CompletionEngine
5
+ def complete(partial_json)
6
+ input = partial_json
7
+ # Same byte-oriented trick as parse: compare ASCII JSON syntax as integers and avoid
8
+ # allocating transient 1-character strings in the streaming loop.
9
+ input_length = input.bytesize
10
+
11
+ if @state.nil? || @state.input_length > input_length
12
+ @state = ParsingState.new
13
+ end
14
+
15
+ return input if input.empty?
16
+ return input if valid_json_primitive_or_document?(input)
17
+
18
+ if @state.input_length == input_length && !@state.output_tokens.empty?
19
+ return finalize_completion(@state.output_tokens.dup, @state.context_stack.dup, @state.incomplete_string_token)
20
+ end
21
+
22
+ output_tokens = @state.output_tokens.dup
23
+ context_stack = @state.context_stack.dup
24
+ index = @state.last_index
25
+ incomplete_string_token = @state.incomplete_string_token
26
+
27
+ if incomplete_string_token && output_tokens.last&.start_with?('"') && output_tokens.last.end_with?('"')
28
+ output_tokens.pop
29
+ end
30
+
31
+ while index < input_length
32
+ if incomplete_string_token && index == @state.last_index
33
+ index, status = Scanners.scan_string(input, index, incomplete_string_token)
34
+
35
+ break unless %i[terminated invalid_unicode].include?(status)
36
+
37
+ output_tokens << incomplete_string_token.buffer.string
38
+ incomplete_string_token = nil
39
+
40
+ next
41
+ end
42
+
43
+ byte = input.getbyte(index)
44
+ last_significant_char_in_output = get_last_significant_char(output_tokens)
45
+
46
+ # ASCII byte values: 9/10/13/32 = whitespace, 34 = ", 44 = ,, 45 = -, 58 = :,
47
+ # 91/93 = [] , 102/110/116 = f/n/t, 123/125 = {}.
48
+ case byte
49
+ when 9, 10, 13, 32
50
+ output_tokens << input.byteslice(index, 1)
51
+ index += 1
52
+ when 34
53
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
54
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
55
+
56
+ string_token = Scanners::CompletionStringToken.new
57
+ index, status = Scanners.scan_string(input, index + 1, string_token)
58
+
59
+ if %i[terminated invalid_unicode].include?(status)
60
+ output_tokens << string_token.buffer.string
61
+ else
62
+ incomplete_string_token = string_token
63
+ end
64
+ when 44
65
+ remove_trailing_comma(output_tokens)
66
+ output_tokens << ','
67
+ index += 1
68
+ when 45, 48..57
69
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
70
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
71
+
72
+ num_str, consumed = Scanners.scan_number_literal(input, index)
73
+ output_tokens << num_str
74
+ index += consumed
75
+ when 58
76
+ remove_trailing_comma(output_tokens) if last_significant_char_in_output == ','
77
+ output_tokens << ':'
78
+ index += 1
79
+ when 91
80
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
81
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
82
+ output_tokens << '['
83
+ context_stack << '['
84
+ index += 1
85
+ when 93
86
+ output_tokens << ']'
87
+ context_stack.pop if !context_stack.empty? && context_stack.last == '['
88
+ index += 1
89
+ when 102
90
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
91
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
92
+
93
+ keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['f'])
94
+ output_tokens << keyword_val
95
+ index += consumed
96
+ when 110
97
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
98
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
99
+
100
+ keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['n'])
101
+ output_tokens << keyword_val
102
+ index += consumed
103
+ when 116
104
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
105
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
106
+
107
+ keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['t'])
108
+ output_tokens << keyword_val
109
+ index += consumed
110
+ when 123
111
+ ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
112
+ ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
113
+ output_tokens << '{'
114
+ context_stack << '{'
115
+ index += 1
116
+ when 125
117
+ remove_trailing_comma(output_tokens)
118
+ output_tokens << '}'
119
+ context_stack.pop if !context_stack.empty? && context_stack.last == '{'
120
+ index += 1
121
+ else
122
+ index += 1
123
+ end
124
+ end
125
+
126
+ @state = ParsingState.new(
127
+ output_tokens: output_tokens,
128
+ context_stack: context_stack,
129
+ last_index: index,
130
+ input_length: input_length,
131
+ incomplete_string_token: incomplete_string_token
132
+ )
133
+
134
+ finalize_completion(output_tokens.dup, context_stack.dup, incomplete_string_token)
135
+ end
136
+
137
+ private
138
+
139
+ def finalize_completion(output_tokens, context_stack, incomplete_string_token = nil)
140
+ output_tokens << incomplete_string_token.finalized_incomplete_value if incomplete_string_token
141
+
142
+ last_sig_char_final = get_last_significant_char(output_tokens)
143
+
144
+ unless context_stack.empty?
145
+ current_ctx = context_stack.last
146
+ if current_ctx == '{'
147
+ if last_sig_char_final == '"'
148
+ prev_sig_char = get_previous_significant_char(output_tokens)
149
+ output_tokens << ':' << 'null' if ['{', ','].include?(prev_sig_char)
150
+ elsif last_sig_char_final == ':'
151
+ output_tokens << 'null'
152
+ end
153
+ elsif current_ctx == '['
154
+ output_tokens << 'null' if last_sig_char_final == ','
155
+ end
156
+ end
157
+
158
+ until context_stack.empty?
159
+ opener = context_stack.pop
160
+ remove_trailing_comma(output_tokens)
161
+ output_tokens << (opener == '{' ? '}' : ']')
162
+ end
163
+
164
+ reassembled_json = output_tokens.join
165
+ return 'null' if reassembled_json.match?(/\A\s*[,:]\s*\z/)
166
+
167
+ reassembled_json
168
+ end
169
+
170
+ def get_last_significant_char(output_tokens)
171
+ (output_tokens.length - 1).downto(0) do |index|
172
+ stripped_token = output_tokens[index].strip
173
+ return stripped_token[-1] unless stripped_token.empty?
174
+ end
175
+
176
+ nil
177
+ end
178
+
179
+ def get_previous_significant_char(output_tokens)
180
+ significant_chars = []
181
+
182
+ (output_tokens.length - 1).downto(0) do |index|
183
+ stripped_token = output_tokens[index].strip
184
+ next if stripped_token.empty?
185
+
186
+ significant_chars << stripped_token[-1]
187
+ return significant_chars[1] if significant_chars.length >= 2
188
+ end
189
+
190
+ nil
191
+ end
192
+
193
+ def ensure_comma_before_new_item(output_tokens, context_stack, last_sig_char)
194
+ return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
195
+ return if STRUCTURE_CHARS.include?(last_sig_char)
196
+ return unless context_stack.last == '[' || (context_stack.last == '{' && last_sig_char != ':')
197
+
198
+ output_tokens << ','
199
+ end
200
+
201
+ def ensure_colon_if_value_expected(output_tokens, context_stack, last_sig_char)
202
+ return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
203
+ return unless context_stack.last == '{' && last_sig_char == '"'
204
+
205
+ output_tokens << ':'
206
+ end
207
+
208
+ def remove_trailing_comma(output_tokens)
209
+ last_token_idx = -1
210
+
211
+ (output_tokens.length - 1).downto(0) do |index|
212
+ next if output_tokens[index].strip.empty?
213
+
214
+ last_token_idx = index
215
+ break
216
+ end
217
+
218
+ return unless last_token_idx != -1 && output_tokens[last_token_idx].strip == ','
219
+
220
+ output_tokens.slice!(last_token_idx)
221
+
222
+ while last_token_idx.positive? && output_tokens[last_token_idx - 1].strip.empty?
223
+ output_tokens.slice!(last_token_idx - 1)
224
+ last_token_idx -= 1
225
+ end
226
+ end
227
+
228
+ def valid_json_primitive_or_document?(str)
229
+ return true if VALID_PRIMITIVES.include?(str)
230
+
231
+ if str.match?(/\A-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\z/) &&
232
+ !str.end_with?('.') && !str.match?(/[eE][+-]?$/)
233
+ return true
234
+ end
235
+
236
+ str.match?(/\A"(?:[^"\\]|\\.)*"\z/)
237
+ end
238
+ end
239
+
240
+ include CompletionEngine
241
+ end