smarter_json 0.8.0 → 0.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '06668bbb40626009794f8e5387fb13ae1a31346a07200d2825fde4872904bd68'
4
- data.tar.gz: 0db0d42bfd2e85a1af4b897990a290e25e4fbe0afd9b4e25e99d9520667de2c3
3
+ metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
4
+ data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
5
5
  SHA512:
6
- metadata.gz: 766cd9c5865d7218f79db57ec538bb1d34355c93012e38820544a41bbedbbca94a4a8fce05982f5f2a0e48c5670a6d3f4336eb1a22361fc29b34855d913fc564
7
- data.tar.gz: f57396ef1a2ac4e48e06dad94122b8159a3b17c80f6d679e1bb59f85874578ea444592a828c3eb324c8d84ff57c255df0bc99af2aeb5603261aa19d26abc88c2
6
+ metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
7
+ data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29
data/CHANGELOG.md CHANGED
@@ -3,6 +3,19 @@
3
3
 
4
4
  > 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
5
5
 
6
+ ## 0.9.2 (2026-06-03)
7
+ - **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
8
+
9
+ ## 0.9.1 (2026-06-03 unreleased)
10
+ - **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
11
+ - **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned — the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
12
+ - Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
13
+ - Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
14
+
15
+ ## 0.9.0 (2026-06-03 unreleased)
16
+ - performance improvements
17
+ - code cleanup
18
+
6
19
  ## 0.8.0 (2026-06-03)
7
20
  - **Robustness** against LLM-generated / wrapped JSON:
8
21
  - strips markdown code fences (```json / ```)
data/README.md CHANGED
@@ -16,7 +16,7 @@ Three things set it apart:
16
16
 
17
17
  1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
18
18
 
19
- 2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** Pass a block to iterate the recovered documents one at a time.
19
+ 2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
20
20
 
21
21
  3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
22
22
 
@@ -46,6 +46,12 @@ gem install smarter_json
46
46
 
47
47
  The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
48
48
 
49
+ ## API stability and thread safety
50
+
51
+ The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
52
+
53
+ Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
54
+
49
55
  ## Documentation
50
56
 
51
57
  * [Introduction](docs/_introduction.md)
@@ -68,14 +74,44 @@ SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2
68
74
  SmarterJSON.process('{"id":1}') # => {"id"=>1} (one document → the value itself)
69
75
  SmarterJSON.process("") # => nil (zero documents)
70
76
 
71
- # Iterate one recovered document at a time with a block
77
+ # For input larger than memory, stream one document at a time with a block
72
78
  # (process and process_file both forward the block):
73
79
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
74
80
 
75
81
  # Wrapper noise is stripped automatically:
76
- SmarterJSON.process("Here is the JSON:\n\n```json\n{\"a\":1}\n```\n") # => {"a"=>1}
77
- SmarterJSON.process("<json>{\"a\":1}</json>") # => {"a"=>1}
78
- SmarterJSON.process("first:\n{\"a\":1}\nsecond:\n{\"b\":2}") # => [{"a"=>1}, {"b"=>2}]
82
+ SmarterJSON.process(<<~TEXT)
83
+ Here is the JSON:
84
+
85
+ ```json
86
+ {
87
+ "a": 1
88
+ }
89
+ ```
90
+ TEXT
91
+ # => {"a"=>1}
92
+
93
+ SmarterJSON.process(<<~TEXT)
94
+ Here is the result:
95
+
96
+ {
97
+ "a": 1
98
+ }
99
+
100
+ Hope this helps.
101
+ TEXT
102
+ # => {"a"=>1}
103
+
104
+ SmarterJSON.process("<json>{\"a\":1}</json>")
105
+ # => {"a"=>1}
106
+
107
+ SmarterJSON.process(<<~TEXT)
108
+ first attempt:
109
+ {"a":1}
110
+
111
+ corrected payload:
112
+ {"b":2}
113
+ TEXT
114
+ # => [{"a"=>1}, {"b"=>2}]
79
115
  ```
80
116
 
81
117
  ### Options
@@ -112,7 +148,7 @@ Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpu
112
148
 
113
149
  **Two notes on fair comparison:**
114
150
 
115
- - **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the block form yields each recovered document instead of returning the collected Array.
151
+ - **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
116
152
  - **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
117
153
 
118
154
  ## Encoding
@@ -21,7 +21,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
21
21
 
22
22
  * **One reader, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the reader to match your input; it adapts to whatever you give it.
23
23
 
24
- * **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** Pass a block to iterate the recovered documents one at a time. See [The Basic Read API](./basic_read_api.md).
24
+ * **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
25
25
 
26
26
  * **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser. Floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
27
27
 
@@ -24,6 +24,39 @@ SmarterJSON.process("host: localhost\nport: 5432") # => {"host"=>"localhost"
24
24
 
25
25
  `process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths. When the input wraps the payload in obvious markdown / prose / tags, `process` strips that wrapper first and then parses the recovered payload(s).
26
26
 
27
+ ```ruby
28
+ SmarterJSON.process(<<~TEXT)
29
+ Here is the JSON:
30
+
31
+ ```json
32
+ {
33
+ "a": 1
34
+ }
35
+ ```
36
+ TEXT
37
+ # => {"a"=>1}
38
+
39
+ SmarterJSON.process(<<~TEXT)
40
+ Here is the result:
41
+
42
+ {
43
+ "a": 1
44
+ }
45
+
46
+ Hope this helps.
47
+ TEXT
48
+ # => {"a"=>1}
49
+
50
+ SmarterJSON.process(<<~TEXT)
51
+ first attempt:
52
+ {"a":1}
53
+
54
+ corrected payload:
55
+ {"b":2}
56
+ TEXT
57
+ # => [{"a"=>1}, {"b"=>2}]
58
+ ```
59
+
27
60
  ```ruby
28
61
  SmarterJSON.process(io) # an open IO (File, StringIO, socket, …) — reads it and parses
29
62
  SmarterJSON.process(some_string) # JSON content
@@ -49,9 +82,9 @@ SmarterJSON.process_file("config.json5") # read the file, then parse — sam
49
82
 
50
83
  `process_file` opens the file, reads it with the labeled [`encoding:`](./options.md) (default `"UTF-8"`, no transcoding pass), and parses it.
51
84
 
52
- ## Streaming with a block
85
+ ## Streaming with a block (bounded memory)
53
86
 
54
- Pass a block to have each recovered top-level document yielded one at a time; the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
87
+ For input larger than memory, pass a block. Each recovered top-level document is yielded as it is framed, and the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
55
88
 
56
89
  ```ruby
57
90
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
data/docs/examples.md CHANGED
@@ -65,9 +65,9 @@ SmarterJSON.process('{"id":1}') # => {"id"=>1}
65
65
  SmarterJSON.process("") # => nil
66
66
  ```
67
67
 
68
- ### Example 5: Iterate Documents with a Block
68
+ ### Example 5: Streaming a Large File with a Block
69
69
 
70
- Pass a block to receive each recovered document one at a time:
70
+ For input larger than memory, pass a block. Each recovered document is yielded one at a time:
71
71
 
72
72
  ```ruby
73
73
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
@@ -116,6 +116,8 @@ A `#`/`//` only starts a comment when preceded by whitespace, so `http://example
116
116
 
117
117
  ### Example 10: Wrapper Noise Around a Payload
118
118
 
119
+ #### Fenced payload
120
+
119
121
  ```ruby
120
122
  SmarterJSON.process(<<~TEXT)
121
123
  Here is the JSON:
@@ -127,15 +129,38 @@ SmarterJSON.process(<<~TEXT)
127
129
  ```
128
130
  TEXT
129
131
  # => {"a"=>1}
132
+ ```
133
+
134
+ #### Prose before / after the payload
135
+
136
+ ```ruby
137
+ SmarterJSON.process(<<~TEXT)
138
+ Here is the result:
139
+
140
+ {
141
+ "a": 1
142
+ }
130
143
 
144
+ Hope this helps.
145
+ TEXT
146
+ # => {"a"=>1}
147
+ ```
148
+
149
+ #### Wrapper tags
150
+
151
+ ```ruby
131
152
  SmarterJSON.process("<json>{\"a\":1}</json>")
132
153
  # => {"a"=>1}
154
+ ```
133
155
 
156
+ #### Multiple recovered payloads from one noisy blob
157
+
158
+ ```ruby
134
159
  SmarterJSON.process(<<~TEXT)
135
- first:
160
+ first attempt:
136
161
  {"a":1}
137
162
 
138
- second:
163
+ corrected payload:
139
164
  {"b":2}
140
165
  TEXT
141
166
  # => [{"a"=>1}, {"b"=>2}]
@@ -41,6 +41,18 @@ static VALUE fj_sym_duplicate_key;
41
41
  static ID fj_bigdecimal_id; /* cached BigDecimal() method id (set in Init) */
42
42
  static ID fj_to_sym_id; /* cached :to_sym (symbolize_keys) */
43
43
  static ID fj_key_p_id; /* cached :key? (non-default duplicate_key modes) */
44
+ static ID fj_force_encoding_id;
45
+ static ID fj_valid_encoding_p_id;
46
+ static ID fj_encoding_id;
47
+ static ID fj_name_id;
48
+ static VALUE fj_sym_encoding;
49
+ static VALUE fj_sym_symbolize_keys;
50
+ static VALUE fj_sym_first_wins;
51
+ static VALUE fj_sym_raise;
52
+ static VALUE fj_sym_bigdecimal_load;
53
+ static VALUE fj_sym_float;
54
+ static VALUE fj_sym_bigdecimal;
55
+ static VALUE fj_sym_on_warning;
44
56
 
45
57
  /* Per-parse direct-mapped key cache: key bytes -> the interned (frozen,
46
58
  * globally-rooted) String, so repeated keys skip the global fstring lookup.
@@ -373,11 +385,17 @@ static void fj_consume_keyword(fj_state *st, const char *word) {
373
385
  fj_advance(st, n);
374
386
  }
375
387
 
376
- /* Copy a byte range into a fresh String, dropping underscores. */
388
+ /* Copy a byte range into a fresh String, dropping underscores. Copies whole
389
+ * underscore-free runs in bulk, rather than one byte at a time. */
377
390
  static VALUE fj_strip_underscores(const char *p, long n) {
378
391
  VALUE s = rb_str_buf_new(n);
379
- long i;
380
- for (i = 0; i < n; i++) if (p[i] != '_') rb_str_buf_cat(s, p + i, 1);
392
+ long i = 0;
393
+ while (i < n) {
394
+ long start = i;
395
+ while (i < n && p[i] != '_') i++;
396
+ if (i > start) rb_str_buf_cat(s, p + start, i - start);
397
+ if (i < n) i++; /* skip '_' */
398
+ }
381
399
  return s;
382
400
  }
383
401
 
@@ -1379,14 +1397,14 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
1379
1397
 
1380
1398
  Check_Type(input, T_STRING);
1381
1399
 
1382
- enc_opt = rb_hash_aref(opts, ID2SYM(rb_intern("encoding")));
1400
+ enc_opt = rb_hash_aref(opts, fj_sym_encoding);
1383
1401
  if (!NIL_P(enc_opt)) {
1384
- input = rb_funcall(rb_str_dup(input), rb_intern("force_encoding"), 1, enc_opt);
1402
+ input = rb_funcall(rb_str_dup(input), fj_force_encoding_id, 1, enc_opt);
1385
1403
  }
1386
- if (!RTEST(rb_funcall(input, rb_intern("valid_encoding?"), 0))) {
1387
- VALUE name = rb_funcall(rb_funcall(input, rb_intern("encoding"), 0), rb_intern("name"), 0);
1404
+ if (!RTEST(rb_funcall(input, fj_valid_encoding_p_id, 0))) {
1405
+ VALUE name = rb_funcall(rb_funcall(input, fj_encoding_id, 0), fj_name_id, 0);
1388
1406
  VALUE msg = rb_sprintf("invalid byte sequence for %" PRIsVALUE, name);
1389
- rb_exc_raise(rb_funcall(cEncodingError, rb_intern("new"), 3, msg, Qnil, Qnil));
1407
+ rb_exc_raise(rb_funcall(cEncodingError, fj_new_id, 3, msg, Qnil, Qnil));
1390
1408
  }
1391
1409
 
1392
1410
  st.buf = RSTRING_PTR(input);
@@ -1402,19 +1420,19 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
1402
1420
  st.kcache = NULL;
1403
1421
  #endif
1404
1422
 
1405
- st.symbolize_keys = RTEST(rb_hash_aref(opts, ID2SYM(rb_intern("symbolize_keys"))));
1406
- dk = rb_hash_aref(opts, ID2SYM(rb_intern("duplicate_key")));
1407
- st.dup_first_wins = (dk == ID2SYM(rb_intern("first_wins")));
1408
- st.dup_raise = (dk == ID2SYM(rb_intern("raise")));
1423
+ st.symbolize_keys = RTEST(rb_hash_aref(opts, fj_sym_symbolize_keys));
1424
+ dk = rb_hash_aref(opts, fj_sym_duplicate_key);
1425
+ st.dup_first_wins = (dk == fj_sym_first_wins);
1426
+ st.dup_raise = (dk == fj_sym_raise);
1409
1427
 
1410
1428
  {
1411
- VALUE bd = rb_hash_aref(opts, ID2SYM(rb_intern("bigdecimal_load")));
1412
- if (bd == ID2SYM(rb_intern("float"))) st.bigdecimal_load = 0;
1413
- else if (bd == ID2SYM(rb_intern("bigdecimal"))) st.bigdecimal_load = 2;
1429
+ VALUE bd = rb_hash_aref(opts, fj_sym_bigdecimal_load);
1430
+ if (bd == fj_sym_float) st.bigdecimal_load = 0;
1431
+ else if (bd == fj_sym_bigdecimal) st.bigdecimal_load = 2;
1414
1432
  else st.bigdecimal_load = 1; /* :auto (default), including nil */
1415
1433
  }
1416
1434
 
1417
- st.on_warning = rb_hash_aref(opts, ID2SYM(rb_intern("on_warning"))); /* Qnil when absent */
1435
+ st.on_warning = rb_hash_aref(opts, fj_sym_on_warning); /* Qnil when absent */
1418
1436
 
1419
1437
  if (st.len >= 3 && (unsigned char)st.buf[0] == 0xEF &&
1420
1438
  (unsigned char)st.buf[1] == 0xBB && (unsigned char)st.buf[2] == 0xBF) {
@@ -1465,8 +1483,20 @@ void Init_smarter_json(void) {
1465
1483
  fj_key_p_id = rb_intern("key?");
1466
1484
  fj_new_id = rb_intern("new");
1467
1485
  fj_call_id = rb_intern("call");
1486
+ fj_force_encoding_id = rb_intern("force_encoding");
1487
+ fj_valid_encoding_p_id = rb_intern("valid_encoding?");
1488
+ fj_encoding_id = rb_intern("encoding");
1489
+ fj_name_id = rb_intern("name");
1468
1490
  fj_sym_empty_slot = ID2SYM(rb_intern("empty_slot"));
1469
1491
  fj_sym_empty_value = ID2SYM(rb_intern("empty_value"));
1470
1492
  fj_sym_duplicate_key = ID2SYM(rb_intern("duplicate_key"));
1493
+ fj_sym_encoding = ID2SYM(rb_intern("encoding"));
1494
+ fj_sym_symbolize_keys = ID2SYM(rb_intern("symbolize_keys"));
1495
+ fj_sym_first_wins = ID2SYM(rb_intern("first_wins"));
1496
+ fj_sym_raise = ID2SYM(rb_intern("raise"));
1497
+ fj_sym_bigdecimal_load = ID2SYM(rb_intern("bigdecimal_load"));
1498
+ fj_sym_float = ID2SYM(rb_intern("float"));
1499
+ fj_sym_bigdecimal = ID2SYM(rb_intern("bigdecimal"));
1500
+ fj_sym_on_warning = ID2SYM(rb_intern("on_warning"));
1471
1501
  rb_define_module_function(mSmarterJSON, "parse_c", fj_parse_c, 2);
1472
1502
  }
@@ -63,22 +63,300 @@ module SmarterJSON
63
63
  end
64
64
  end
65
65
 
66
- # Stream documents from an IO, one line (= one document) at a time, yielding
67
- # each bounded memory. Newline-delimited (NDJSON / JSONL); a single document
68
- # spanning multiple lines is not supported by the streaming path.
66
+ # Stream documents from an IO incrementally, yielding each recovered top-level
67
+ # document without slurping the whole input into memory first.
69
68
  def stream_io(io, options, &block)
70
- Recovery.process_string(io.read, options, &block)
69
+ Framer.each_document(io) { |doc| Recovery.process_string(doc, options, &block) }
70
+ nil
71
71
  end
72
72
 
73
73
  private_class_method :process_content, :stream_io
74
74
 
75
+ # Named byte values, shared by the Parser FSM and the Framer / Recovery byte
76
+ # scanners so none of them spell out raw hex. Included where needed.
77
+ module Bytes
78
+ LBRACE = 0x7B
79
+ RBRACE = 0x7D
80
+ LBRACKET = 0x5B
81
+ RBRACKET = 0x5D
82
+ COLON = 0x3A
83
+ COMMA = 0x2C
84
+ DQUOTE = 0x22
85
+ SQUOTE = 0x27
86
+ BACKSLASH = 0x5C
87
+ SLASH = 0x2F
88
+ STAR = 0x2A
89
+ HASH = 0x23
90
+ MINUS = 0x2D
91
+ PLUS = 0x2B
92
+ DOT = 0x2E
93
+ ZERO = 0x30
94
+ NINE = 0x39
95
+ LOWER_E = 0x65
96
+ UPPER_E = 0x45
97
+ LOWER_T = 0x74
98
+ LOWER_F = 0x66
99
+ LOWER_N = 0x6E
100
+ LOWER_U = 0x75
101
+ LOWER_X = 0x78
102
+ UPPER_X = 0x58
103
+ UPPER_I = 0x49
104
+ UPPER_N = 0x4E
105
+ UPPER_T = 0x54
106
+ UPPER_F = 0x46
107
+ UNDERSCORE = 0x5F
108
+ DOLLAR = 0x24
109
+ SPACE = 0x20
110
+ TAB = 0x09
111
+ LF = 0x0A
112
+ CR = 0x0D
113
+ end
114
+
115
+ module Framer
116
+ include Bytes
117
+
118
+ CHUNK_SIZE = 16 * 1024
119
+
120
+ module_function
121
+
122
+ def each_document(io, &block)
123
+ buffer = +""
124
+ scan = 0
125
+ doc_start = nil
126
+ stack = []
127
+ mode = nil
128
+
129
+ while (chunk = read_chunk(io))
130
+ buffer << chunk
131
+ loop do
132
+ emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
133
+ break unless emitted
134
+
135
+ yield emitted
136
+ end
137
+ end
138
+
139
+ yield buffer unless separators_only?(buffer)
140
+ end
141
+
142
+ def read_chunk(io)
143
+ if io.respond_to?(:readpartial)
144
+ io.readpartial(CHUNK_SIZE)
145
+ else
146
+ io.read(CHUNK_SIZE)
147
+ end
148
+ rescue EOFError
149
+ nil
150
+ end
151
+
152
+ def scan_buffer(buffer, scan, doc_start, stack, mode)
153
+ while scan < buffer.bytesize
154
+ b = buffer.getbyte(scan)
155
+ # A multi-byte marker (// /* ''' */) whose lead byte is here but whose
156
+ # remaining bytes have not arrived yet must not be guessed at — advancing
157
+ # past the lead byte would misread the brace/quote that follows it once the
158
+ # next chunk lands. Stop and let each_document append more input, then resume
159
+ # from this same position. At true EOF the leftover is parsed whole instead.
160
+ break if defer_for_split_marker?(buffer, scan, b, mode, doc_start)
161
+
162
+ if mode == :double
163
+ if b == BACKSLASH
164
+ scan += 2
165
+ elsif b == DQUOTE
166
+ mode = nil
167
+ scan += 1
168
+ else
169
+ scan += 1
170
+ end
171
+ elsif mode == :single
172
+ if b == BACKSLASH
173
+ scan += 2
174
+ elsif b == SQUOTE
175
+ mode = nil
176
+ scan += 1
177
+ else
178
+ scan += 1
179
+ end
180
+ elsif mode == :triple
181
+ if buffer.byteslice(scan, 3) == "'''"
182
+ mode = nil
183
+ scan += 3
184
+ else
185
+ scan += 1
186
+ end
187
+ elsif mode == :line_comment
188
+ if [LF, CR].include?(b)
189
+ mode = nil
190
+ else
191
+ scan += 1
192
+ next
193
+ end
194
+ elsif mode == :block_comment
195
+ if buffer.byteslice(scan, 2) == '*/'
196
+ mode = nil
197
+ scan += 2
198
+ else
199
+ scan += 1
200
+ end
201
+ elsif doc_start.nil?
202
+ if whitespace_byte?(b)
203
+ scan += 1
204
+ elsif line_comment_start?(buffer, scan)
205
+ mode = :line_comment
206
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
207
+ elsif block_comment_start?(buffer, scan)
208
+ mode = :block_comment
209
+ scan += 2
210
+ elsif [LBRACE, LBRACKET].include?(b)
211
+ doc_start = scan
212
+ stack << b
213
+ scan += 1
214
+ else
215
+ scan = buffer.bytesize
216
+ end
217
+ else
218
+ if mode.nil? && line_comment_start?(buffer, scan)
219
+ mode = :line_comment
220
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
221
+ elsif mode.nil? && block_comment_start?(buffer, scan)
222
+ mode = :block_comment
223
+ scan += 2
224
+ elsif b == DQUOTE
225
+ mode = :double
226
+ scan += 1
227
+ elsif buffer.byteslice(scan, 3) == "'''"
228
+ mode = :triple
229
+ scan += 3
230
+ elsif b == SQUOTE
231
+ mode = :single
232
+ scan += 1
233
+ elsif [LBRACE, LBRACKET].include?(b)
234
+ stack << b
235
+ scan += 1
236
+ elsif b == RBRACE
237
+ stack.pop if stack.last == LBRACE
238
+ scan += 1
239
+ if stack.empty?
240
+ doc = buffer.byteslice(doc_start, scan - doc_start)
241
+ buffer = buffer.byteslice(scan..-1) || +""
242
+ return [doc, buffer, 0, nil, [], nil]
243
+ end
244
+ elsif b == RBRACKET
245
+ stack.pop if stack.last == LBRACKET
246
+ scan += 1
247
+ if stack.empty?
248
+ doc = buffer.byteslice(doc_start, scan - doc_start)
249
+ buffer = buffer.byteslice(scan..-1) || +""
250
+ return [doc, buffer, 0, nil, [], nil]
251
+ end
252
+ else
253
+ scan += 1
254
+ end
255
+ end
256
+ end
257
+
258
+ [nil, buffer, scan, doc_start, stack, mode]
259
+ end
260
+
261
+ # True when `b` is the lead byte of a multi-byte marker but the rest of that
262
+ # marker has not been read into the buffer yet, so we cannot decide what it is.
263
+ # `//` and `/*` need 2 bytes; `'''` (and a closing `'''`) needs 3; a closing
264
+ # `*/` needs 2. Backslash escapes and single-byte delimiters never need this.
265
+ def defer_for_split_marker?(buffer, scan, b, mode, doc_start)
266
+ avail = buffer.bytesize - scan
267
+ case mode
268
+ when :block_comment
269
+ b == STAR && avail < 2
270
+ when :triple
271
+ b == SQUOTE && avail < 3
272
+ when nil
273
+ if doc_start.nil?
274
+ b == SLASH && avail < 2
275
+ else
276
+ (b == SLASH && avail < 2) || (b == SQUOTE && avail < 3)
277
+ end
278
+ else
279
+ false
280
+ end
281
+ end
282
+
283
+ def separators_only?(buffer)
284
+ scan = 0
285
+ mode = nil
286
+ while scan < buffer.bytesize
287
+ b = buffer.getbyte(scan)
288
+ if mode == :line_comment
289
+ if [LF, CR].include?(b)
290
+ mode = nil
291
+ else
292
+ scan += 1
293
+ next
294
+ end
295
+ elsif mode == :block_comment
296
+ if buffer.byteslice(scan, 2) == '*/'
297
+ mode = nil
298
+ scan += 2
299
+ else
300
+ scan += 1
301
+ end
302
+ elsif whitespace_byte?(b)
303
+ scan += 1
304
+ elsif line_comment_start?(buffer, scan)
305
+ mode = :line_comment
306
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
307
+ elsif block_comment_start?(buffer, scan)
308
+ mode = :block_comment
309
+ scan += 2
310
+ else
311
+ return false
312
+ end
313
+ end
314
+ true
315
+ end
316
+
317
+ def whitespace_byte?(b)
318
+ b == SPACE || (b && b >= TAB && b <= CR)
319
+ end
320
+
321
+ def line_comment_start?(buffer, scan)
322
+ b = buffer.getbyte(scan)
323
+ return preceded_by_ws_or_start?(buffer, scan) if b == HASH
324
+
325
+ b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan)
326
+ end
327
+
328
+ def block_comment_start?(buffer, scan)
329
+ buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan)
330
+ end
331
+
332
+ def preceded_by_ws_or_start?(buffer, scan)
333
+ return true if scan.zero?
334
+
335
+ prev = buffer.getbyte(scan - 1)
336
+ whitespace_byte?(prev)
337
+ end
338
+ end
339
+
75
340
  module Recovery
341
+ include Bytes
342
+
76
343
  module_function
77
344
 
78
345
  def process_string(input, options, &block)
79
346
  return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
80
347
 
81
- if wrapper_hint?(input)
348
+ # Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
349
+ # the parse actually fails (the rescue below). Every wrapper shape — code fences,
350
+ # <json>/BEGIN_JSON tags, prose around the payload — makes the parse raise, so the
351
+ # rescue catches it. Crucially this keeps clean input on the single-parse fast path
352
+ # even when its string values legitimately contain ``` or <json> (real-world data
353
+ # like GitHub event payloads is full of markdown), instead of dragging hundreds of
354
+ # MB through the pure-Ruby candidate scan.
355
+ #
356
+ # The one exception is a bare leading label like "JSON: {...}", which parses
357
+ # successfully but WRONGLY (as an implicit-root object keyed by the label), so it
358
+ # must be intercepted before parsing.
359
+ if leading_label?(input)
82
360
  payloads = extract_payloads(input, options)
83
361
  return replay_payloads(payloads, options, &block) unless payloads.empty?
84
362
  end
@@ -93,10 +371,14 @@ module SmarterJSON
93
371
  raise
94
372
  end
95
373
 
96
- def wrapper_hint?(input)
97
- return false unless input.valid_encoding?
98
-
99
- input.match?(/```|<json\b|BEGIN_JSON\b/i) || input.match?(/\A[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
374
+ # Whether the input opens with a bare "JSON:" / "Final answer:" label (which would
375
+ # otherwise parse, wrongly, as an implicit-root object keyed by the label). We use
376
+ # String#start_with? with a Regexp rather than match?(/\A.../): start_with? checks
377
+ # only the beginning, whereas a \A-anchored match? still retries at every byte
378
+ # position and so scans the WHOLE input (≈0.3s on a 200 MB document) on every parse.
379
+ # (Caller has already established the input is valid_encoding?.)
380
+ def leading_label?(input)
381
+ input.start_with?(/[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
100
382
  end
101
383
 
102
384
  def replay_payloads(payloads, options, &block)
@@ -146,16 +428,31 @@ module SmarterJSON
146
428
  last = ranges.last
147
429
  prefix = input.byteslice(0, first.begin)
148
430
  suffix = input.byteslice(last.end, input.bytesize - last.end)
431
+ # Look for fence / wrapper markers only in the text we actually strip (outside
432
+ # every recovered payload), so a ``` or <json> sitting inside a payload's own
433
+ # string value does not trigger a "stripped a wrapper" warning.
434
+ outside = non_payload_text(input, ranges)
149
435
  {
150
436
  prefix: substantive_text?(prefix),
151
437
  suffix: substantive_text?(suffix),
152
- fence: input.match?(/```/),
153
- wrapper: input.match?(/<json\b|BEGIN_JSON\b/i),
438
+ fence: outside.include?("```"),
439
+ wrapper: outside.match?(/<json\b|BEGIN_JSON\b/i),
154
440
  first_pos: line_col_for(input, first.begin),
155
441
  last_pos: line_col_for(input, last.begin)
156
442
  }
157
443
  end
158
444
 
445
+ def non_payload_text(input, ranges)
446
+ out = +""
447
+ pos = 0
448
+ ranges.each do |range|
449
+ out << input.byteslice(pos, range.begin - pos) if range.begin > pos
450
+ pos = range.end
451
+ end
452
+ out << input.byteslice(pos, input.bytesize - pos) if pos < input.bytesize
453
+ out
454
+ end
455
+
159
456
  def line_col_for(input, offset)
160
457
  line = 1
161
458
  col = 1
@@ -164,15 +461,15 @@ module SmarterJSON
164
461
  b = input.getbyte(i)
165
462
  break if b.nil?
166
463
 
167
- if b == 0x0A
464
+ if b == LF
168
465
  line += 1
169
466
  col = 1
170
467
  i += 1
171
- elsif b == 0x0D
468
+ elsif b == CR
172
469
  line += 1
173
470
  col = 1
174
471
  i += 1
175
- i += 1 if i < offset && input.getbyte(i) == 0x0A
472
+ i += 1 if i < offset && input.getbyte(i) == LF
176
473
  else
177
474
  col += 1
178
475
  i += 1
@@ -203,19 +500,19 @@ module SmarterJSON
203
500
  while i < input.bytesize
204
501
  b = input.getbyte(i)
205
502
  if mode == :double
206
- if b == 0x5C
503
+ if b == BACKSLASH
207
504
  i += 2
208
505
  next
209
- elsif b == 0x22
506
+ elsif b == DQUOTE
210
507
  mode = nil
211
508
  end
212
509
  i += 1
213
510
  next
214
511
  elsif mode == :single
215
- if b == 0x5C
512
+ if b == BACKSLASH
216
513
  i += 2
217
514
  next
218
- elsif b == 0x27
515
+ elsif b == SQUOTE
219
516
  mode = nil
220
517
  end
221
518
  i += 1
@@ -229,7 +526,7 @@ module SmarterJSON
229
526
  end
230
527
  next
231
528
  elsif mode == :line_comment
232
- if [0x0A, 0x0D].include?(b)
529
+ if [LF, CR].include?(b)
233
530
  mode = nil
234
531
  else
235
532
  i += 1
@@ -252,11 +549,11 @@ module SmarterJSON
252
549
  mode = :block_comment
253
550
  i += 2
254
551
  next
255
- elsif b == 0x23
552
+ elsif b == HASH
256
553
  mode = :line_comment
257
554
  i += 1
258
555
  next
259
- elsif b == 0x22
556
+ elsif b == DQUOTE
260
557
  mode = :double
261
558
  i += 1
262
559
  next
@@ -264,21 +561,21 @@ module SmarterJSON
264
561
  mode = :triple
265
562
  i += 3
266
563
  next
267
- elsif b == 0x27
564
+ elsif b == SQUOTE
268
565
  mode = :single
269
566
  i += 1
270
567
  next
271
- elsif [0x7B, 0x5B].include?(b)
568
+ elsif [LBRACE, LBRACKET].include?(b)
272
569
  start_pos = i if stack.empty?
273
570
  stack << b
274
- elsif b == 0x7D
275
- stack.pop if stack.last == 0x7B
571
+ elsif b == RBRACE
572
+ stack.pop if stack.last == LBRACE
276
573
  if stack.empty? && start_pos
277
574
  ranges << (start_pos...(i + 1))
278
575
  start_pos = nil
279
576
  end
280
- elsif b == 0x5D
281
- stack.pop if stack.last == 0x5B
577
+ elsif b == RBRACKET
578
+ stack.pop if stack.last == LBRACKET
282
579
  if stack.empty? && start_pos
283
580
  ranges << (start_pos...(i + 1))
284
581
  start_pos = nil
@@ -304,41 +601,7 @@ module SmarterJSON
304
601
  # Python literals (True/False/None) and undefined, underscores in
305
602
  # numeric literals, and encoding validation (SmarterJSON::EncodingError).
306
603
  class Parser
307
- LBRACE = 0x7B
308
- RBRACE = 0x7D
309
- LBRACKET = 0x5B
310
- RBRACKET = 0x5D
311
- COLON = 0x3A
312
- COMMA = 0x2C
313
- DQUOTE = 0x22
314
- SQUOTE = 0x27
315
- BACKSLASH = 0x5C
316
- SLASH = 0x2F
317
- STAR = 0x2A
318
- HASH = 0x23
319
- MINUS = 0x2D
320
- PLUS = 0x2B
321
- DOT = 0x2E
322
- ZERO = 0x30
323
- NINE = 0x39
324
- LOWER_E = 0x65
325
- UPPER_E = 0x45
326
- LOWER_T = 0x74
327
- LOWER_F = 0x66
328
- LOWER_N = 0x6E
329
- LOWER_U = 0x75
330
- LOWER_X = 0x78
331
- UPPER_X = 0x58
332
- UPPER_I = 0x49
333
- UPPER_N = 0x4E
334
- UPPER_T = 0x54
335
- UPPER_F = 0x46
336
- UNDERSCORE = 0x5F
337
- DOLLAR = 0x24
338
- SPACE = 0x20
339
- TAB = 0x09
340
- LF = 0x0A
341
- CR = 0x0D
604
+ include Bytes
342
605
 
343
606
  NOT_NUMERIC = Object.new
344
607
  HEX_RE = /\A[-+]?0[xX][0-9a-fA-F_]+\z/.freeze
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterJSON
4
- VERSION = "0.8.0"
4
+ VERSION = "0.9.2"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_json
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.8.0
4
+ version: 0.9.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda