json_completer 1.0.0 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +38 -21
- data/lib/json_completer/completion_engine.rb +241 -0
- data/lib/json_completer/parser_engine.rb +386 -0
- data/lib/json_completer/scanners.rb +448 -0
- data/lib/json_completer.rb +36 -688
- metadata +5 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 5d460af0d48e2cecf87411ba30d2a6aeac00fe38208d0222bf8b7218e373a2cc
|
|
4
|
+
data.tar.gz: 448401c51bc04e0a38fae036d3a64d94e5090846c39f69d081a8032b0b58e80a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 256f6ba460ef729a9babe9f9355f0d888c3c7f3dc64e3b4c85c2ae69ad6cb6c3a4b107aced36fe55b04e06302034b672f63c94c6db03b074a0462223eee8d5d1
|
|
7
|
+
data.tar.gz: 101f08a619d56129398b751815077897e3f557d581b76d075e41f57098e93ec92aa14d9f7edb1c88dbb8179f26d3aaf0fde3d81cfa986d2c912aa70b77294074
|
data/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# JsonCompleter
|
|
2
2
|
|
|
3
|
-
A Ruby gem
|
|
3
|
+
A Ruby gem for incremental parsing of partial and incomplete JSON streams. It is built for streaming output from LLM providers such as OpenAI and Anthropic, and processes each new chunk in O(n) time by maintaining parser state between calls. Use `.parse` for parsed Ruby values and `.complete` when you specifically need completed JSON text.
|
|
4
4
|
|
|
5
5
|
## Installation
|
|
6
6
|
|
|
@@ -26,56 +26,73 @@ gem install json_completer
|
|
|
26
26
|
|
|
27
27
|
### Basic Usage
|
|
28
28
|
|
|
29
|
-
|
|
29
|
+
Use `.parse` when you want the current parsed Ruby value directly from a partial stream:
|
|
30
30
|
|
|
31
31
|
```ruby
|
|
32
32
|
require 'json_completer'
|
|
33
33
|
|
|
34
|
-
#
|
|
35
|
-
JsonCompleter.
|
|
36
|
-
# =>
|
|
34
|
+
# Parse partial JSON into Ruby objects
|
|
35
|
+
JsonCompleter.parse('{"name": "John", "age":')
|
|
36
|
+
# => {"name" => "John", "age" => nil}
|
|
37
37
|
|
|
38
38
|
# Handle incomplete strings
|
|
39
|
-
JsonCompleter.
|
|
40
|
-
# =>
|
|
39
|
+
JsonCompleter.parse('{"message": "Hello wo')
|
|
40
|
+
# => {"message" => "Hello wo"}
|
|
41
41
|
|
|
42
|
-
#
|
|
43
|
-
JsonCompleter.
|
|
44
|
-
# =>
|
|
42
|
+
# Close unclosed structures
|
|
43
|
+
JsonCompleter.parse('[1, 2, {"key": "value"')
|
|
44
|
+
# => [1, 2, {"key" => "value"}]
|
|
45
45
|
```
|
|
46
46
|
|
|
47
47
|
### Incremental Processing
|
|
48
48
|
|
|
49
|
-
For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state
|
|
49
|
+
For streaming scenarios where JSON arrives in chunks. Each call processes only new data (O(n) complexity) by maintaining parsing state:
|
|
50
50
|
|
|
51
51
|
```ruby
|
|
52
52
|
completer = JsonCompleter.new
|
|
53
53
|
|
|
54
54
|
# Process first chunk
|
|
55
|
-
result1 = completer.
|
|
56
|
-
# =>
|
|
55
|
+
result1 = completer.parse('{"users": [{"name": "')
|
|
56
|
+
# => {"users" => [{"name" => ""}]}
|
|
57
57
|
|
|
58
58
|
# Process additional data
|
|
59
|
-
result2 = completer.
|
|
60
|
-
# =>
|
|
59
|
+
result2 = completer.parse('{"users": [{"name": "Alice"}')
|
|
60
|
+
# => {"users" => [{"name" => "Alice"}]}
|
|
61
|
+
|
|
62
|
+
# Final parsed value
|
|
63
|
+
result3 = completer.parse('{"users": [{"name": "Alice"}, {"name": "Bob"}]}')
|
|
64
|
+
# => {"users" => [{"name" => "Alice"}, {"name" => "Bob"}]}
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
Stateful `JsonCompleter` instances assume append-only input. If earlier bytes change, create a new instance; truncation to a shorter prefix still resets state automatically.
|
|
68
|
+
|
|
69
|
+
### String Output with `.complete`
|
|
61
70
|
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
71
|
+
Use `.complete` when you specifically need completed JSON text instead of parsed Ruby objects:
|
|
72
|
+
|
|
73
|
+
```ruby
|
|
74
|
+
JsonCompleter.complete('{"name": "John", "age":')
|
|
75
|
+
# => '{"name": "John", "age": null}'
|
|
76
|
+
|
|
77
|
+
JsonCompleter.complete('[1, 2, {"key": "value"')
|
|
78
|
+
# => '[1, 2, {"key": "value"}]'
|
|
65
79
|
```
|
|
66
80
|
|
|
81
|
+
This is the second-tier option when another layer expects JSON text and you want `json_completer` to materialize the current partial state as valid JSON.
|
|
82
|
+
|
|
67
83
|
#### Performance Characteristics
|
|
68
84
|
|
|
69
85
|
- **Zero reprocessing**: Maintains parsing state to avoid reparsing previously processed data
|
|
70
86
|
- **Linear complexity**: Each chunk processed in O(n) time where n = new data size, not total size
|
|
71
87
|
- **Memory efficient**: Uses token-based accumulation with minimal state overhead
|
|
88
|
+
- **Byte-oriented string scanning**: Walks JSON input as bytes and copies contiguous non-escape string content in slices to reduce per-character overhead on long streamed strings
|
|
72
89
|
- **Context preservation**: Tracks nested structures without full document analysis
|
|
73
90
|
|
|
74
91
|
### Common Use Cases
|
|
75
92
|
|
|
76
|
-
- **
|
|
77
|
-
- **
|
|
78
|
-
- **
|
|
93
|
+
- **LLM streaming output**: Parse partial JSON emitted token-by-token from providers such as OpenAI and Anthropic
|
|
94
|
+
- **Incremental structured output parsing**: Keep a live Ruby object while more JSON arrives
|
|
95
|
+
- **JSON text completion**: Produce valid JSON text snapshots for downstream consumers that require a string
|
|
79
96
|
|
|
80
97
|
## Contributing
|
|
81
98
|
|
|
@@ -0,0 +1,241 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
class JsonCompleter
|
|
4
|
+
module CompletionEngine
|
|
5
|
+
def complete(partial_json)
|
|
6
|
+
input = partial_json
|
|
7
|
+
# Same byte-oriented trick as parse: compare ASCII JSON syntax as integers and avoid
|
|
8
|
+
# allocating transient 1-character strings in the streaming loop.
|
|
9
|
+
input_length = input.bytesize
|
|
10
|
+
|
|
11
|
+
if @state.nil? || @state.input_length > input_length
|
|
12
|
+
@state = ParsingState.new
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
return input if input.empty?
|
|
16
|
+
return input if valid_json_primitive_or_document?(input)
|
|
17
|
+
|
|
18
|
+
if @state.input_length == input_length && !@state.output_tokens.empty?
|
|
19
|
+
return finalize_completion(@state.output_tokens.dup, @state.context_stack.dup, @state.incomplete_string_token)
|
|
20
|
+
end
|
|
21
|
+
|
|
22
|
+
output_tokens = @state.output_tokens.dup
|
|
23
|
+
context_stack = @state.context_stack.dup
|
|
24
|
+
index = @state.last_index
|
|
25
|
+
incomplete_string_token = @state.incomplete_string_token
|
|
26
|
+
|
|
27
|
+
if incomplete_string_token && output_tokens.last&.start_with?('"') && output_tokens.last.end_with?('"')
|
|
28
|
+
output_tokens.pop
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
while index < input_length
|
|
32
|
+
if incomplete_string_token && index == @state.last_index
|
|
33
|
+
index, status = Scanners.scan_string(input, index, incomplete_string_token)
|
|
34
|
+
|
|
35
|
+
break unless %i[terminated invalid_unicode].include?(status)
|
|
36
|
+
|
|
37
|
+
output_tokens << incomplete_string_token.buffer.string
|
|
38
|
+
incomplete_string_token = nil
|
|
39
|
+
|
|
40
|
+
next
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
byte = input.getbyte(index)
|
|
44
|
+
last_significant_char_in_output = get_last_significant_char(output_tokens)
|
|
45
|
+
|
|
46
|
+
# ASCII byte values: 9/10/13/32 = whitespace, 34 = ", 44 = ,, 45 = -, 58 = :,
|
|
47
|
+
# 91/93 = [] , 102/110/116 = f/n/t, 123/125 = {}.
|
|
48
|
+
case byte
|
|
49
|
+
when 9, 10, 13, 32
|
|
50
|
+
output_tokens << input.byteslice(index, 1)
|
|
51
|
+
index += 1
|
|
52
|
+
when 34
|
|
53
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
54
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
55
|
+
|
|
56
|
+
string_token = Scanners::CompletionStringToken.new
|
|
57
|
+
index, status = Scanners.scan_string(input, index + 1, string_token)
|
|
58
|
+
|
|
59
|
+
if %i[terminated invalid_unicode].include?(status)
|
|
60
|
+
output_tokens << string_token.buffer.string
|
|
61
|
+
else
|
|
62
|
+
incomplete_string_token = string_token
|
|
63
|
+
end
|
|
64
|
+
when 44
|
|
65
|
+
remove_trailing_comma(output_tokens)
|
|
66
|
+
output_tokens << ','
|
|
67
|
+
index += 1
|
|
68
|
+
when 45, 48..57
|
|
69
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
70
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
71
|
+
|
|
72
|
+
num_str, consumed = Scanners.scan_number_literal(input, index)
|
|
73
|
+
output_tokens << num_str
|
|
74
|
+
index += consumed
|
|
75
|
+
when 58
|
|
76
|
+
remove_trailing_comma(output_tokens) if last_significant_char_in_output == ','
|
|
77
|
+
output_tokens << ':'
|
|
78
|
+
index += 1
|
|
79
|
+
when 91
|
|
80
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
81
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
82
|
+
output_tokens << '['
|
|
83
|
+
context_stack << '['
|
|
84
|
+
index += 1
|
|
85
|
+
when 93
|
|
86
|
+
output_tokens << ']'
|
|
87
|
+
context_stack.pop if !context_stack.empty? && context_stack.last == '['
|
|
88
|
+
index += 1
|
|
89
|
+
when 102
|
|
90
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
91
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
92
|
+
|
|
93
|
+
keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['f'])
|
|
94
|
+
output_tokens << keyword_val
|
|
95
|
+
index += consumed
|
|
96
|
+
when 110
|
|
97
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
98
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
99
|
+
|
|
100
|
+
keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['n'])
|
|
101
|
+
output_tokens << keyword_val
|
|
102
|
+
index += consumed
|
|
103
|
+
when 116
|
|
104
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
105
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
106
|
+
|
|
107
|
+
keyword_val, consumed = Scanners.scan_keyword_literal(input, index, KEYWORD_MAP['t'])
|
|
108
|
+
output_tokens << keyword_val
|
|
109
|
+
index += consumed
|
|
110
|
+
when 123
|
|
111
|
+
ensure_comma_before_new_item(output_tokens, context_stack, last_significant_char_in_output)
|
|
112
|
+
ensure_colon_if_value_expected(output_tokens, context_stack, last_significant_char_in_output)
|
|
113
|
+
output_tokens << '{'
|
|
114
|
+
context_stack << '{'
|
|
115
|
+
index += 1
|
|
116
|
+
when 125
|
|
117
|
+
remove_trailing_comma(output_tokens)
|
|
118
|
+
output_tokens << '}'
|
|
119
|
+
context_stack.pop if !context_stack.empty? && context_stack.last == '{'
|
|
120
|
+
index += 1
|
|
121
|
+
else
|
|
122
|
+
index += 1
|
|
123
|
+
end
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
@state = ParsingState.new(
|
|
127
|
+
output_tokens: output_tokens,
|
|
128
|
+
context_stack: context_stack,
|
|
129
|
+
last_index: index,
|
|
130
|
+
input_length: input_length,
|
|
131
|
+
incomplete_string_token: incomplete_string_token
|
|
132
|
+
)
|
|
133
|
+
|
|
134
|
+
finalize_completion(output_tokens.dup, context_stack.dup, incomplete_string_token)
|
|
135
|
+
end
|
|
136
|
+
|
|
137
|
+
private
|
|
138
|
+
|
|
139
|
+
def finalize_completion(output_tokens, context_stack, incomplete_string_token = nil)
|
|
140
|
+
output_tokens << incomplete_string_token.finalized_incomplete_value if incomplete_string_token
|
|
141
|
+
|
|
142
|
+
last_sig_char_final = get_last_significant_char(output_tokens)
|
|
143
|
+
|
|
144
|
+
unless context_stack.empty?
|
|
145
|
+
current_ctx = context_stack.last
|
|
146
|
+
if current_ctx == '{'
|
|
147
|
+
if last_sig_char_final == '"'
|
|
148
|
+
prev_sig_char = get_previous_significant_char(output_tokens)
|
|
149
|
+
output_tokens << ':' << 'null' if ['{', ','].include?(prev_sig_char)
|
|
150
|
+
elsif last_sig_char_final == ':'
|
|
151
|
+
output_tokens << 'null'
|
|
152
|
+
end
|
|
153
|
+
elsif current_ctx == '['
|
|
154
|
+
output_tokens << 'null' if last_sig_char_final == ','
|
|
155
|
+
end
|
|
156
|
+
end
|
|
157
|
+
|
|
158
|
+
until context_stack.empty?
|
|
159
|
+
opener = context_stack.pop
|
|
160
|
+
remove_trailing_comma(output_tokens)
|
|
161
|
+
output_tokens << (opener == '{' ? '}' : ']')
|
|
162
|
+
end
|
|
163
|
+
|
|
164
|
+
reassembled_json = output_tokens.join
|
|
165
|
+
return 'null' if reassembled_json.match?(/\A\s*[,:]\s*\z/)
|
|
166
|
+
|
|
167
|
+
reassembled_json
|
|
168
|
+
end
|
|
169
|
+
|
|
170
|
+
def get_last_significant_char(output_tokens)
|
|
171
|
+
(output_tokens.length - 1).downto(0) do |index|
|
|
172
|
+
stripped_token = output_tokens[index].strip
|
|
173
|
+
return stripped_token[-1] unless stripped_token.empty?
|
|
174
|
+
end
|
|
175
|
+
|
|
176
|
+
nil
|
|
177
|
+
end
|
|
178
|
+
|
|
179
|
+
def get_previous_significant_char(output_tokens)
|
|
180
|
+
significant_chars = []
|
|
181
|
+
|
|
182
|
+
(output_tokens.length - 1).downto(0) do |index|
|
|
183
|
+
stripped_token = output_tokens[index].strip
|
|
184
|
+
next if stripped_token.empty?
|
|
185
|
+
|
|
186
|
+
significant_chars << stripped_token[-1]
|
|
187
|
+
return significant_chars[1] if significant_chars.length >= 2
|
|
188
|
+
end
|
|
189
|
+
|
|
190
|
+
nil
|
|
191
|
+
end
|
|
192
|
+
|
|
193
|
+
def ensure_comma_before_new_item(output_tokens, context_stack, last_sig_char)
|
|
194
|
+
return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
|
|
195
|
+
return if STRUCTURE_CHARS.include?(last_sig_char)
|
|
196
|
+
return unless context_stack.last == '[' || (context_stack.last == '{' && last_sig_char != ':')
|
|
197
|
+
|
|
198
|
+
output_tokens << ','
|
|
199
|
+
end
|
|
200
|
+
|
|
201
|
+
def ensure_colon_if_value_expected(output_tokens, context_stack, last_sig_char)
|
|
202
|
+
return if output_tokens.empty? || context_stack.empty? || last_sig_char.nil?
|
|
203
|
+
return unless context_stack.last == '{' && last_sig_char == '"'
|
|
204
|
+
|
|
205
|
+
output_tokens << ':'
|
|
206
|
+
end
|
|
207
|
+
|
|
208
|
+
def remove_trailing_comma(output_tokens)
|
|
209
|
+
last_token_idx = -1
|
|
210
|
+
|
|
211
|
+
(output_tokens.length - 1).downto(0) do |index|
|
|
212
|
+
next if output_tokens[index].strip.empty?
|
|
213
|
+
|
|
214
|
+
last_token_idx = index
|
|
215
|
+
break
|
|
216
|
+
end
|
|
217
|
+
|
|
218
|
+
return unless last_token_idx != -1 && output_tokens[last_token_idx].strip == ','
|
|
219
|
+
|
|
220
|
+
output_tokens.slice!(last_token_idx)
|
|
221
|
+
|
|
222
|
+
while last_token_idx.positive? && output_tokens[last_token_idx - 1].strip.empty?
|
|
223
|
+
output_tokens.slice!(last_token_idx - 1)
|
|
224
|
+
last_token_idx -= 1
|
|
225
|
+
end
|
|
226
|
+
end
|
|
227
|
+
|
|
228
|
+
def valid_json_primitive_or_document?(str)
|
|
229
|
+
return true if VALID_PRIMITIVES.include?(str)
|
|
230
|
+
|
|
231
|
+
if str.match?(/\A-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\z/) &&
|
|
232
|
+
!str.end_with?('.') && !str.match?(/[eE][+-]?$/)
|
|
233
|
+
return true
|
|
234
|
+
end
|
|
235
|
+
|
|
236
|
+
str.match?(/\A"(?:[^"\\]|\\.)*"\z/)
|
|
237
|
+
end
|
|
238
|
+
end
|
|
239
|
+
|
|
240
|
+
include CompletionEngine
|
|
241
|
+
end
|