smarter_json 0.7.0 → 0.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 00a4ada7970ab645761573bd3b9704effda072064102ead760124716f0896289
4
- data.tar.gz: a7232f3d8d06d3b7d770645b28a247774c83b8ac50603e4b7611006e852480d4
3
+ metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
4
+ data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
5
5
  SHA512:
6
- metadata.gz: a4b4b8737f10ebf09408014419a9d5cf2203b706766ea5acc358ad595c5cc34104b3f658f2ca6cf6b130124c514d89d035dea2a0ea16e619ffaec7da16da4a23
7
- data.tar.gz: a317f74e3399d6b95327fb10a7b65d2ca0f07062d9097303ab32aefb6721a46a78784addd0a937c0c8f433c4acb4e7758af3986fa741bd0d5b816141ee101fd4
6
+ metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
7
+ data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29
data/CHANGELOG.md CHANGED
@@ -3,7 +3,30 @@
3
3
 
4
4
  > 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
5
5
 
6
- ## 0.7.0 (2026-06-02)
6
+ ## 0.9.2 (2026-06-03)
7
+ - **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
8
+
9
+ ## 0.9.1 (2026-06-03 unreleased)
10
+ - **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
11
+ - **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned — the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
12
+ - Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
13
+ - Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
14
+
15
+ ## 0.9.0 (2026-06-03 unreleased)
16
+ - performance improvements
17
+ - code cleanup
18
+
19
+ ## 0.8.0 (2026-06-03)
20
+ - **Robustness** against LLM-generated / wrapped JSON:
21
+ - strips markdown code fences (```json / ```)
22
+ - ignores obvious prefix / suffix prose around a payload
23
+ - unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`
24
+ - preserves multiple recovered payloads as an `Array`
25
+ - supports pretty-printed multi-line document framing on IO / block input
26
+ - **Warnings** now cover wrapper recovery too (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
27
+ - **No truncation recovery**: truncated / unterminated input still raises `SmarterJSON::ParseError`
28
+
29
+ ## 0.7.0 (2026-06-03)
7
30
  - **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` tuple return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as the parser applies a lenient fix; `process` / `process_file` now always return the bare value (nil / value / Array) on every path. Unlike the tuple, this also fires on the streaming block form. The default (no handler) records nothing and costs nothing.
8
31
 
9
32
  ## 0.6.0 (2026-06-02)
data/README.md CHANGED
@@ -16,13 +16,14 @@ Three things set it apart:
16
16
 
17
17
  1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
18
18
 
19
- 2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
19
+ 2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
20
20
 
21
21
  3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
22
22
 
23
23
  ## What it accepts, beyond strict JSON
24
24
 
25
25
  - `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` parses as a string, not a truncated value)
26
+ - Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
26
27
  - Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
27
28
  - Implicit root object — a config file that starts with `key: value`, no outer `{}`
28
29
  - `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
@@ -45,6 +46,12 @@ gem install smarter_json
45
46
 
46
47
  The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
47
48
 
49
+ ## API stability and thread safety
50
+
51
+ The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
52
+
53
+ Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
54
+
48
55
  ## Documentation
49
56
 
50
57
  * [Introduction](docs/_introduction.md)
@@ -70,6 +77,41 @@ SmarterJSON.process("") # => nil (zero
70
77
  # For input larger than memory, stream one document at a time with a block
71
78
  # (process and process_file both forward the block):
72
79
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
80
+
81
+ # Wrapper noise is stripped automatically:
82
+ SmarterJSON.process(<<~TEXT)
83
+ Here is the JSON:
84
+
85
+ ```json
86
+ {
87
+ "a": 1
88
+ }
89
+ ```
90
+ TEXT
91
+ # => {"a"=>1}
92
+
93
+ SmarterJSON.process(<<~TEXT)
94
+ Here is the result:
95
+
96
+ {
97
+ "a": 1
98
+ }
99
+
100
+ Hope this helps.
101
+ TEXT
102
+ # => {"a"=>1}
103
+
104
+ SmarterJSON.process("<json>{\"a\":1}</json>")
105
+ # => {"a"=>1}
106
+
107
+ SmarterJSON.process(<<~TEXT)
108
+ first attempt:
109
+ {"a":1}
110
+
111
+ corrected payload:
112
+ {"b":2}
113
+ TEXT
114
+ # => [{"a"=>1}, {"b"=>2}]
73
115
  ```
74
116
 
75
117
  ### Options
@@ -85,7 +127,7 @@ SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event
85
127
 
86
128
  ### Warnings (`on_warning`)
87
129
 
88
- When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
130
+ When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
89
131
 
90
132
  ```ruby
91
133
  # Collect them all:
@@ -106,7 +148,7 @@ Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpu
106
148
 
107
149
  **Two notes on fair comparison:**
108
150
 
109
- - **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once — use it for input larger than RAM.
151
+ - **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
110
152
  - **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
111
153
 
112
154
  ## Encoding
@@ -11,7 +11,7 @@
11
11
 
12
12
  # SmarterJSON Introduction
13
13
 
14
- `smarter_json` is a fast, lenient JSON parser and writer for Ruby. It reads strict JSON, JSON5, HJSON-style config, newline-delimited JSON (NDJSON / JSONL), and the messy JSON-ish input humans actually paste — and in benchmarks it matches or beats Oj on nearly every file. It is opinionated: it optimizes for getting your data out, not for policing the JSON spec. Where other parsers stop at the first deviation, SmarterJSON keeps going.
14
+ `smarter_json` is a fast, lenient JSON parser and writer for Ruby. It reads strict JSON, JSON5, HJSON-style config, newline-delimited JSON (NDJSON / JSONL), markdown-wrapped / chatty blobs around a JSON payload, and the messy JSON-ish input humans actually paste — and in benchmarks it matches or beats Oj on nearly every file. It is opinionated: it optimizes for getting your data out, not for policing the JSON spec. Where other parsers stop at the first deviation, SmarterJSON keeps going.
15
15
 
16
16
  ## Why another JSON library?
17
17
 
@@ -21,7 +21,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
21
21
 
22
22
  * **One reader, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the reader to match your input; it adapts to whatever you give it.
23
23
 
24
- * **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
24
+ * **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
25
25
 
26
26
  * **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser. Floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
27
27
 
@@ -22,7 +22,40 @@ SmarterJSON.process('{"a": 1, "b": [2, 3]}') # => {"a"=>1, "b"=>[2, 3]}
22
22
  SmarterJSON.process("host: localhost\nport: 5432") # => {"host"=>"localhost", "port"=>5432} (no braces needed)
23
23
  ```
24
24
 
25
- `process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths.
25
+ `process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths. When the input wraps the payload in obvious markdown / prose / tags, `process` strips that wrapper first and then parses the recovered payload(s).
26
+
27
+ ```ruby
28
+ SmarterJSON.process(<<~TEXT)
29
+ Here is the JSON:
30
+
31
+ ```json
32
+ {
33
+ "a": 1
34
+ }
35
+ ```
36
+ TEXT
37
+ # => {"a"=>1}
38
+
39
+ SmarterJSON.process(<<~TEXT)
40
+ Here is the result:
41
+
42
+ {
43
+ "a": 1
44
+ }
45
+
46
+ Hope this helps.
47
+ TEXT
48
+ # => {"a"=>1}
49
+
50
+ SmarterJSON.process(<<~TEXT)
51
+ first attempt:
52
+ {"a":1}
53
+
54
+ corrected payload:
55
+ {"b":2}
56
+ TEXT
57
+ # => [{"a"=>1}, {"b"=>2}]
58
+ ```
26
59
 
27
60
  ```ruby
28
61
  SmarterJSON.process(io) # an open IO (File, StringIO, socket, …) — reads it and parses
@@ -39,7 +72,7 @@ SmarterJSON.process('{"id":1}') # => {"id"=>1} (one
39
72
  SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2}, {"id"=>3}] (two or more → an Array)
40
73
  ```
41
74
 
42
- Documents are separated by whitespace, newlines, or simple concatenation — **not** by commas (a comma between top-level documents would be read as an implicit root array, which is not supported). Only SmarterJSON reads this via plain `process`: Oj and the stdlib `json` library raise without a block.
75
+ Documents are separated by whitespace, newlines, or simple concatenation — **not** by commas (a comma between top-level documents would be read as an implicit root array, which is not supported). If wrapper noise is stripped and several payloads are recovered, they are returned by the same rule: one payload → its value, several → an `Array`. Only SmarterJSON reads this via plain `process`: Oj and the stdlib `json` library raise without a block.
43
76
 
44
77
  ## `SmarterJSON.process_file` — read a file by path
45
78
 
@@ -51,17 +84,14 @@ SmarterJSON.process_file("config.json5") # read the file, then parse — sam
51
84
 
52
85
  ## Streaming with a block (bounded memory)
53
86
 
54
- For input larger than memory, pass a block. Each top-level document is yielded as it is read, and the method returns `nil` (it never collects the documents into an Array). Both `process` and `process_file` forward the block.
87
+ For input larger than memory, pass a block. Each recovered top-level document is yielded as it is framed, and the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
55
88
 
56
89
  ```ruby
57
- # Stream straight from disk, one document at a time — the whole file is never loaded:
58
90
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
59
-
60
- # Same for an IO:
61
91
  SmarterJSON.process(io) { |doc| handle(doc) }
62
92
  ```
63
93
 
64
- The streaming path reads the input as newline-delimited documents (NDJSON / JSONL), one document per line. A single document that spans multiple lines is not supported by the streaming path read it without a block instead.
94
+ The streaming path now frames whole top-level documents, not just one line at a time. That means NDJSON / JSONL still work, but pretty-printed multi-line objects and arrays work too, as do mixed `\n` / `\r\n` / `\r` line endings and comment-only separators between documents.
65
95
 
66
96
  ## The C extension and the pure-Ruby fallback
67
97
 
@@ -79,7 +109,7 @@ warns.map(&:type) # => [:empty_slot]
79
109
  warns.first.to_s # => "extra comma, collapsed an empty slot at line 1, col 4"
80
110
  ```
81
111
 
82
- Each warning is a `SmarterJSON::Warning` with `type`, `message`, `line`, and `col`. The types are `:empty_slot` (a collapsed empty comma slot), `:empty_value` (a key with no value, read as `null`), and `:duplicate_key` (a repeated key that was dropped). Clean input never invokes the handler. It fires on every path — including the streaming block form — and works the same on the C and pure-Ruby paths. See [Configuration Options](./options.md).
112
+ Each warning is a `SmarterJSON::Warning` with `type`, `message`, `line`, and `col`. The types are `:empty_slot` (a collapsed empty comma slot), `:empty_value` (a key with no value, read as `null`), `:duplicate_key` (a repeated key that was dropped), plus wrapper-recovery warnings such as `:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, and `:wrapper_tag_stripped`. Clean input never invokes the handler. It fires on every path — including the streaming block form — and works the same on the C and pure-Ruby paths. See [Configuration Options](./options.md).
83
113
 
84
114
  ---------------
85
115
 
data/docs/examples.md CHANGED
@@ -24,9 +24,10 @@
24
24
  7. [Duplicate Keys](#example-7-duplicate-keys)
25
25
  8. [High-Precision Numbers: BigDecimal vs Float](#example-8-high-precision-numbers-bigdecimal-vs-float)
26
26
  9. [Lenient Input: Comments, Trailing Commas, Unquoted Keys](#example-9-lenient-input-comments-trailing-commas-unquoted-keys)
27
- 10. [Write JSON](#example-10-write-json)
28
- 11. [Write NDJSON](#example-11-write-ndjson)
29
- 12. [Round-Trip Read and Write](#example-12-round-trip-read-and-write)
27
+ 10. [Wrapper Noise Around a Payload](#example-10-wrapper-noise-around-a-payload)
28
+ 11. [Write JSON](#example-11-write-json)
29
+ 12. [Write NDJSON](#example-12-write-ndjson)
30
+ 13. [Round-Trip Read and Write](#example-13-round-trip-read-and-write)
30
31
 
31
32
  ---
32
33
 
@@ -66,7 +67,7 @@ SmarterJSON.process("") # => nil
66
67
 
67
68
  ### Example 5: Streaming a Large File with a Block
68
69
 
69
- For input larger than memory, pass a block. Each document is yielded as it is read; the whole file is never loaded:
70
+ For input larger than memory, pass a block. Each recovered document is yielded one at a time:
70
71
 
71
72
  ```ruby
72
73
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
@@ -113,14 +114,66 @@ JSON
113
114
 
114
115
  A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
115
116
 
116
- ### Example 10: Write JSON
117
+ ### Example 10: Wrapper Noise Around a Payload
118
+
119
+ #### Fenced payload
120
+
121
+ ```ruby
122
+ SmarterJSON.process(<<~TEXT)
123
+ Here is the JSON:
124
+
125
+ ```json
126
+ {
127
+ "a": 1
128
+ }
129
+ ```
130
+ TEXT
131
+ # => {"a"=>1}
132
+ ```
133
+
134
+ #### Prose before / after the payload
135
+
136
+ ```ruby
137
+ SmarterJSON.process(<<~TEXT)
138
+ Here is the result:
139
+
140
+ {
141
+ "a": 1
142
+ }
143
+
144
+ Hope this helps.
145
+ TEXT
146
+ # => {"a"=>1}
147
+ ```
148
+
149
+ #### Wrapper tags
150
+
151
+ ```ruby
152
+ SmarterJSON.process("<json>{\"a\":1}</json>")
153
+ # => {"a"=>1}
154
+ ```
155
+
156
+ #### Multiple recovered payloads from one noisy blob
157
+
158
+ ```ruby
159
+ SmarterJSON.process(<<~TEXT)
160
+ first attempt:
161
+ {"a":1}
162
+
163
+ corrected payload:
164
+ {"b":2}
165
+ TEXT
166
+ # => [{"a"=>1}, {"b"=>2}]
167
+ ```
168
+
169
+ ### Example 11: Write JSON
117
170
 
118
171
  ```ruby
119
172
  SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
120
173
  SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
121
174
  ```
122
175
 
123
- ### Example 11: Write NDJSON
176
+ ### Example 12: Write NDJSON
124
177
 
125
178
  An Array writes one element per line:
126
179
 
@@ -128,7 +181,7 @@ An Array writes one element per line:
128
181
  SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
129
182
  ```
130
183
 
131
- ### Example 12: Round-Trip Read and Write
184
+ ### Example 13: Round-Trip Read and Write
132
185
 
133
186
  ```ruby
134
187
  obj = { "a" => 1, "b" => [2, "three", nil, true] }
data/docs/options.md CHANGED
@@ -43,7 +43,7 @@ warns.map(&:type) # => [:empty_slot]
43
43
  warns.first.to_s # => "extra comma, collapsed an empty slot at line 1, col 4"
44
44
  ```
45
45
 
46
- The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`), `:empty_value` (a key with no value, read as `null`, e.g. `{a:}`), and `:duplicate_key` (a repeated key that was dropped). Clean input never invokes the handler. Warnings work on both the C and pure-Ruby paths, so `acceleration:` doesn't change them.
46
+ The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`), `:empty_value` (a key with no value, read as `null`, e.g. `{a:}`), and `:duplicate_key` (a repeated key that was dropped), plus wrapper-recovery warnings such as `:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, and `:wrapper_tag_stripped`. Clean input never invokes the handler. Warnings work on both the C and pure-Ruby paths, so `acceleration:` doesn't change them.
47
47
 
48
48
  ### A note on `:encoding`
49
49
 
@@ -41,6 +41,18 @@ static VALUE fj_sym_duplicate_key;
41
41
  static ID fj_bigdecimal_id; /* cached BigDecimal() method id (set in Init) */
42
42
  static ID fj_to_sym_id; /* cached :to_sym (symbolize_keys) */
43
43
  static ID fj_key_p_id; /* cached :key? (non-default duplicate_key modes) */
44
+ static ID fj_force_encoding_id;
45
+ static ID fj_valid_encoding_p_id;
46
+ static ID fj_encoding_id;
47
+ static ID fj_name_id;
48
+ static VALUE fj_sym_encoding;
49
+ static VALUE fj_sym_symbolize_keys;
50
+ static VALUE fj_sym_first_wins;
51
+ static VALUE fj_sym_raise;
52
+ static VALUE fj_sym_bigdecimal_load;
53
+ static VALUE fj_sym_float;
54
+ static VALUE fj_sym_bigdecimal;
55
+ static VALUE fj_sym_on_warning;
44
56
 
45
57
  /* Per-parse direct-mapped key cache: key bytes -> the interned (frozen,
46
58
  * globally-rooted) String, so repeated keys skip the global fstring lookup.
@@ -373,11 +385,17 @@ static void fj_consume_keyword(fj_state *st, const char *word) {
373
385
  fj_advance(st, n);
374
386
  }
375
387
 
376
- /* Copy a byte range into a fresh String, dropping underscores. */
388
+ /* Copy a byte range into a fresh String, dropping underscores. Copies whole
389
+ * underscore-free runs in bulk, rather than one byte at a time. */
377
390
  static VALUE fj_strip_underscores(const char *p, long n) {
378
391
  VALUE s = rb_str_buf_new(n);
379
- long i;
380
- for (i = 0; i < n; i++) if (p[i] != '_') rb_str_buf_cat(s, p + i, 1);
392
+ long i = 0;
393
+ while (i < n) {
394
+ long start = i;
395
+ while (i < n && p[i] != '_') i++;
396
+ if (i > start) rb_str_buf_cat(s, p + start, i - start);
397
+ if (i < n) i++; /* skip '_' */
398
+ }
381
399
  return s;
382
400
  }
383
401
 
@@ -1379,14 +1397,14 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
1379
1397
 
1380
1398
  Check_Type(input, T_STRING);
1381
1399
 
1382
- enc_opt = rb_hash_aref(opts, ID2SYM(rb_intern("encoding")));
1400
+ enc_opt = rb_hash_aref(opts, fj_sym_encoding);
1383
1401
  if (!NIL_P(enc_opt)) {
1384
- input = rb_funcall(rb_str_dup(input), rb_intern("force_encoding"), 1, enc_opt);
1402
+ input = rb_funcall(rb_str_dup(input), fj_force_encoding_id, 1, enc_opt);
1385
1403
  }
1386
- if (!RTEST(rb_funcall(input, rb_intern("valid_encoding?"), 0))) {
1387
- VALUE name = rb_funcall(rb_funcall(input, rb_intern("encoding"), 0), rb_intern("name"), 0);
1404
+ if (!RTEST(rb_funcall(input, fj_valid_encoding_p_id, 0))) {
1405
+ VALUE name = rb_funcall(rb_funcall(input, fj_encoding_id, 0), fj_name_id, 0);
1388
1406
  VALUE msg = rb_sprintf("invalid byte sequence for %" PRIsVALUE, name);
1389
- rb_exc_raise(rb_funcall(cEncodingError, rb_intern("new"), 3, msg, Qnil, Qnil));
1407
+ rb_exc_raise(rb_funcall(cEncodingError, fj_new_id, 3, msg, Qnil, Qnil));
1390
1408
  }
1391
1409
 
1392
1410
  st.buf = RSTRING_PTR(input);
@@ -1402,19 +1420,19 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
1402
1420
  st.kcache = NULL;
1403
1421
  #endif
1404
1422
 
1405
- st.symbolize_keys = RTEST(rb_hash_aref(opts, ID2SYM(rb_intern("symbolize_keys"))));
1406
- dk = rb_hash_aref(opts, ID2SYM(rb_intern("duplicate_key")));
1407
- st.dup_first_wins = (dk == ID2SYM(rb_intern("first_wins")));
1408
- st.dup_raise = (dk == ID2SYM(rb_intern("raise")));
1423
+ st.symbolize_keys = RTEST(rb_hash_aref(opts, fj_sym_symbolize_keys));
1424
+ dk = rb_hash_aref(opts, fj_sym_duplicate_key);
1425
+ st.dup_first_wins = (dk == fj_sym_first_wins);
1426
+ st.dup_raise = (dk == fj_sym_raise);
1409
1427
 
1410
1428
  {
1411
- VALUE bd = rb_hash_aref(opts, ID2SYM(rb_intern("bigdecimal_load")));
1412
- if (bd == ID2SYM(rb_intern("float"))) st.bigdecimal_load = 0;
1413
- else if (bd == ID2SYM(rb_intern("bigdecimal"))) st.bigdecimal_load = 2;
1429
+ VALUE bd = rb_hash_aref(opts, fj_sym_bigdecimal_load);
1430
+ if (bd == fj_sym_float) st.bigdecimal_load = 0;
1431
+ else if (bd == fj_sym_bigdecimal) st.bigdecimal_load = 2;
1414
1432
  else st.bigdecimal_load = 1; /* :auto (default), including nil */
1415
1433
  }
1416
1434
 
1417
- st.on_warning = rb_hash_aref(opts, ID2SYM(rb_intern("on_warning"))); /* Qnil when absent */
1435
+ st.on_warning = rb_hash_aref(opts, fj_sym_on_warning); /* Qnil when absent */
1418
1436
 
1419
1437
  if (st.len >= 3 && (unsigned char)st.buf[0] == 0xEF &&
1420
1438
  (unsigned char)st.buf[1] == 0xBB && (unsigned char)st.buf[2] == 0xBF) {
@@ -1465,8 +1483,20 @@ void Init_smarter_json(void) {
1465
1483
  fj_key_p_id = rb_intern("key?");
1466
1484
  fj_new_id = rb_intern("new");
1467
1485
  fj_call_id = rb_intern("call");
1486
+ fj_force_encoding_id = rb_intern("force_encoding");
1487
+ fj_valid_encoding_p_id = rb_intern("valid_encoding?");
1488
+ fj_encoding_id = rb_intern("encoding");
1489
+ fj_name_id = rb_intern("name");
1468
1490
  fj_sym_empty_slot = ID2SYM(rb_intern("empty_slot"));
1469
1491
  fj_sym_empty_value = ID2SYM(rb_intern("empty_value"));
1470
1492
  fj_sym_duplicate_key = ID2SYM(rb_intern("duplicate_key"));
1493
+ fj_sym_encoding = ID2SYM(rb_intern("encoding"));
1494
+ fj_sym_symbolize_keys = ID2SYM(rb_intern("symbolize_keys"));
1495
+ fj_sym_first_wins = ID2SYM(rb_intern("first_wins"));
1496
+ fj_sym_raise = ID2SYM(rb_intern("raise"));
1497
+ fj_sym_bigdecimal_load = ID2SYM(rb_intern("bigdecimal_load"));
1498
+ fj_sym_float = ID2SYM(rb_intern("float"));
1499
+ fj_sym_bigdecimal = ID2SYM(rb_intern("bigdecimal"));
1500
+ fj_sym_on_warning = ID2SYM(rb_intern("on_warning"));
1471
1501
  rb_define_module_function(mSmarterJSON, "parse_c", fj_parse_c, 2);
1472
1502
  }
@@ -22,9 +22,9 @@ module SmarterJSON
22
22
  # stream as newline-delimited documents (NDJSON / JSONL), one per line.
23
23
  def process(input, options = {}, &block)
24
24
  if input.is_a?(String)
25
- process_content(input, options, &block)
25
+ Recovery.process_string(input, options, &block)
26
26
  elsif input.respond_to?(:read)
27
- block ? stream_io(input, options, &block) : process_content(input.read, options)
27
+ block ? stream_io(input, options, &block) : process(input.read, options)
28
28
  else
29
29
  raise ArgumentError, "SmarterJSON.process expects a String or an IO, got #{input.class}"
30
30
  end
@@ -43,7 +43,7 @@ module SmarterJSON
43
43
  if block
44
44
  File.open(path, "r:#{encoding}") { |io| stream_io(io, options, &block) }
45
45
  else
46
- process_content(File.read(path, encoding: encoding), options)
46
+ process(File.read(path, encoding: encoding), options)
47
47
  end
48
48
  end
49
49
 
@@ -63,29 +63,18 @@ module SmarterJSON
63
63
  end
64
64
  end
65
65
 
66
- # Stream documents from an IO, one line (= one document) at a time, yielding
67
- # each bounded memory. Newline-delimited (NDJSON / JSONL); a single document
68
- # spanning multiple lines is not supported by the streaming path.
66
+ # Stream documents from an IO incrementally, yielding each recovered top-level
67
+ # document without slurping the whole input into memory first.
69
68
  def stream_io(io, options, &block)
70
- io.each_line("\n") { |line| process_content(line, options, &block) }
69
+ Framer.each_document(io) { |doc| Recovery.process_string(doc, options, &block) }
71
70
  nil
72
71
  end
73
72
 
74
73
  private_class_method :process_content, :stream_io
75
74
 
76
- # Hand-rolled FSM single-pass parser.
77
- # Layer 1: strict JSON (RFC 8259).
78
- # Layer 2: JSON5 additions — line/block comments, trailing comma,
79
- # unquoted ECMAScript identifier keys, single-quoted strings,
80
- # hex numbers, leading/trailing decimal points, Infinity/NaN,
81
- # explicit + sign, \-line-continuation inside strings.
82
- # Layer 3: HJSON-inspired additions — #/comment-marker rule, triple-quoted
83
- # strings, quoteless single-line strings, implicit root object,
84
- # newline-as-separator, broader unquoted keys, recognized-literals-win.
85
- # Layer 4: smarter_json additions — UTF-8 BOM skip, smart/curly quotes,
86
- # Python literals (True/False/None) and undefined, underscores in
87
- # numeric literals, and encoding validation (SmarterJSON::EncodingError).
88
- class Parser
75
+ # Named byte values, shared by the Parser FSM and the Framer / Recovery byte
76
+ # scanners so none of them spell out raw hex. Included where needed.
77
+ module Bytes
89
78
  LBRACE = 0x7B
90
79
  RBRACE = 0x7D
91
80
  LBRACKET = 0x5B
@@ -121,6 +110,498 @@ module SmarterJSON
121
110
  TAB = 0x09
122
111
  LF = 0x0A
123
112
  CR = 0x0D
113
+ end
114
+
115
+ module Framer
116
+ include Bytes
117
+
118
+ CHUNK_SIZE = 16 * 1024
119
+
120
+ module_function
121
+
122
+ def each_document(io, &block)
123
+ buffer = +""
124
+ scan = 0
125
+ doc_start = nil
126
+ stack = []
127
+ mode = nil
128
+
129
+ while (chunk = read_chunk(io))
130
+ buffer << chunk
131
+ loop do
132
+ emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
133
+ break unless emitted
134
+
135
+ yield emitted
136
+ end
137
+ end
138
+
139
+ yield buffer unless separators_only?(buffer)
140
+ end
141
+
142
+ def read_chunk(io)
143
+ if io.respond_to?(:readpartial)
144
+ io.readpartial(CHUNK_SIZE)
145
+ else
146
+ io.read(CHUNK_SIZE)
147
+ end
148
+ rescue EOFError
149
+ nil
150
+ end
151
+
152
+ def scan_buffer(buffer, scan, doc_start, stack, mode)
153
+ while scan < buffer.bytesize
154
+ b = buffer.getbyte(scan)
155
+ # A multi-byte marker (// /* ''' */) whose lead byte is here but whose
156
+ # remaining bytes have not arrived yet must not be guessed at — advancing
157
+ # past the lead byte would misread the brace/quote that follows it once the
158
+ # next chunk lands. Stop and let each_document append more input, then resume
159
+ # from this same position. At true EOF the leftover is parsed whole instead.
160
+ break if defer_for_split_marker?(buffer, scan, b, mode, doc_start)
161
+
162
+ if mode == :double
163
+ if b == BACKSLASH
164
+ scan += 2
165
+ elsif b == DQUOTE
166
+ mode = nil
167
+ scan += 1
168
+ else
169
+ scan += 1
170
+ end
171
+ elsif mode == :single
172
+ if b == BACKSLASH
173
+ scan += 2
174
+ elsif b == SQUOTE
175
+ mode = nil
176
+ scan += 1
177
+ else
178
+ scan += 1
179
+ end
180
+ elsif mode == :triple
181
+ if buffer.byteslice(scan, 3) == "'''"
182
+ mode = nil
183
+ scan += 3
184
+ else
185
+ scan += 1
186
+ end
187
+ elsif mode == :line_comment
188
+ if [LF, CR].include?(b)
189
+ mode = nil
190
+ else
191
+ scan += 1
192
+ next
193
+ end
194
+ elsif mode == :block_comment
195
+ if buffer.byteslice(scan, 2) == '*/'
196
+ mode = nil
197
+ scan += 2
198
+ else
199
+ scan += 1
200
+ end
201
+ elsif doc_start.nil?
202
+ if whitespace_byte?(b)
203
+ scan += 1
204
+ elsif line_comment_start?(buffer, scan)
205
+ mode = :line_comment
206
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
207
+ elsif block_comment_start?(buffer, scan)
208
+ mode = :block_comment
209
+ scan += 2
210
+ elsif [LBRACE, LBRACKET].include?(b)
211
+ doc_start = scan
212
+ stack << b
213
+ scan += 1
214
+ else
215
+ scan = buffer.bytesize
216
+ end
217
+ else
218
+ if mode.nil? && line_comment_start?(buffer, scan)
219
+ mode = :line_comment
220
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
221
+ elsif mode.nil? && block_comment_start?(buffer, scan)
222
+ mode = :block_comment
223
+ scan += 2
224
+ elsif b == DQUOTE
225
+ mode = :double
226
+ scan += 1
227
+ elsif buffer.byteslice(scan, 3) == "'''"
228
+ mode = :triple
229
+ scan += 3
230
+ elsif b == SQUOTE
231
+ mode = :single
232
+ scan += 1
233
+ elsif [LBRACE, LBRACKET].include?(b)
234
+ stack << b
235
+ scan += 1
236
+ elsif b == RBRACE
237
+ stack.pop if stack.last == LBRACE
238
+ scan += 1
239
+ if stack.empty?
240
+ doc = buffer.byteslice(doc_start, scan - doc_start)
241
+ buffer = buffer.byteslice(scan..-1) || +""
242
+ return [doc, buffer, 0, nil, [], nil]
243
+ end
244
+ elsif b == RBRACKET
245
+ stack.pop if stack.last == LBRACKET
246
+ scan += 1
247
+ if stack.empty?
248
+ doc = buffer.byteslice(doc_start, scan - doc_start)
249
+ buffer = buffer.byteslice(scan..-1) || +""
250
+ return [doc, buffer, 0, nil, [], nil]
251
+ end
252
+ else
253
+ scan += 1
254
+ end
255
+ end
256
+ end
257
+
258
+ [nil, buffer, scan, doc_start, stack, mode]
259
+ end
260
+
261
+ # True when `b` is the lead byte of a multi-byte marker but the rest of that
262
+ # marker has not been read into the buffer yet, so we cannot decide what it is.
263
+ # `//` and `/*` need 2 bytes; `'''` (and a closing `'''`) needs 3; a closing
264
+ # `*/` needs 2. Backslash escapes and single-byte delimiters never need this.
265
+ def defer_for_split_marker?(buffer, scan, b, mode, doc_start)
266
+ avail = buffer.bytesize - scan
267
+ case mode
268
+ when :block_comment
269
+ b == STAR && avail < 2
270
+ when :triple
271
+ b == SQUOTE && avail < 3
272
+ when nil
273
+ if doc_start.nil?
274
+ b == SLASH && avail < 2
275
+ else
276
+ (b == SLASH && avail < 2) || (b == SQUOTE && avail < 3)
277
+ end
278
+ else
279
+ false
280
+ end
281
+ end
282
+
283
+ def separators_only?(buffer)
284
+ scan = 0
285
+ mode = nil
286
+ while scan < buffer.bytesize
287
+ b = buffer.getbyte(scan)
288
+ if mode == :line_comment
289
+ if [LF, CR].include?(b)
290
+ mode = nil
291
+ else
292
+ scan += 1
293
+ next
294
+ end
295
+ elsif mode == :block_comment
296
+ if buffer.byteslice(scan, 2) == '*/'
297
+ mode = nil
298
+ scan += 2
299
+ else
300
+ scan += 1
301
+ end
302
+ elsif whitespace_byte?(b)
303
+ scan += 1
304
+ elsif line_comment_start?(buffer, scan)
305
+ mode = :line_comment
306
+ scan += buffer.getbyte(scan) == HASH ? 1 : 2
307
+ elsif block_comment_start?(buffer, scan)
308
+ mode = :block_comment
309
+ scan += 2
310
+ else
311
+ return false
312
+ end
313
+ end
314
+ true
315
+ end
316
+
317
+ def whitespace_byte?(b)
318
+ b == SPACE || (b && b >= TAB && b <= CR)
319
+ end
320
+
321
+ def line_comment_start?(buffer, scan)
322
+ b = buffer.getbyte(scan)
323
+ return preceded_by_ws_or_start?(buffer, scan) if b == HASH
324
+
325
+ b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan)
326
+ end
327
+
328
+ def block_comment_start?(buffer, scan)
329
+ buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan)
330
+ end
331
+
332
+ def preceded_by_ws_or_start?(buffer, scan)
333
+ return true if scan.zero?
334
+
335
+ prev = buffer.getbyte(scan - 1)
336
+ whitespace_byte?(prev)
337
+ end
338
+ end
339
+
340
+ module Recovery
341
+ include Bytes
342
+
343
+ module_function
344
+
345
+ def process_string(input, options, &block)
346
+ return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
347
+
348
+ # Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
349
+ # the parse actually fails (the rescue below). Every wrapper shape — code fences,
350
+ # <json>/BEGIN_JSON tags, prose around the payload — makes the parse raise, so the
351
+ # rescue catches it. Crucially this keeps clean input on the single-parse fast path
352
+ # even when its string values legitimately contain ``` or <json> (real-world data
353
+ # like GitHub event payloads is full of markdown), instead of dragging hundreds of
354
+ # MB through the pure-Ruby candidate scan.
355
+ #
356
+ # The one exception is a bare leading label like "JSON: {...}", which parses
357
+ # successfully but WRONGLY (as an implicit-root object keyed by the label), so it
358
+ # must be intercepted before parsing.
359
+ if leading_label?(input)
360
+ payloads = extract_payloads(input, options)
361
+ return replay_payloads(payloads, options, &block) unless payloads.empty?
362
+ end
363
+
364
+ SmarterJSON.send(:process_content, input, options, &block)
365
+ rescue ParseError => e
366
+ raise if e.is_a?(EncodingError)
367
+
368
+ payloads = extract_payloads(input, options)
369
+ return replay_payloads(payloads, options, &block) unless payloads.empty?
370
+
371
+ raise
372
+ end
373
+
374
+ # Whether the input opens with a bare "JSON:" / "Final answer:" label (which would
375
+ # otherwise parse, wrongly, as an implicit-root object keyed by the label). We use
376
+ # String#start_with? with a Regexp rather than match?(/\A.../): start_with? checks
377
+ # only the beginning, whereas a \A-anchored match? still retries at every byte
378
+ # position and so scans the WHOLE input (≈0.3s on a 200 MB document) on every parse.
379
+ # (Caller has already established the input is valid_encoding?.)
380
+ def leading_label?(input)
381
+ input.start_with?(/[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
382
+ end
383
+
384
+ def replay_payloads(payloads, options, &block)
385
+ handler = options[:on_warning]
386
+ emit_wrapper_warnings(payloads, handler)
387
+
388
+ results = payloads.map do |payload|
389
+ SmarterJSON.send(:process_content, payload[:slice], options)
390
+ end
391
+
392
+ return results.each(&block).then { nil } if block_given?
393
+ return nil if results.empty?
394
+ return results.first if results.length == 1
395
+
396
+ results
397
+ end
398
+
399
+ def emit_wrapper_warnings(payloads, handler)
400
+ return unless handler
401
+
402
+ meta = payloads.first[:meta]
403
+ warn(handler, :prefix_text_ignored, "ignored non-JSON text before the payload", *meta[:first_pos]) if meta[:prefix]
404
+ warn(handler, :code_fence_stripped, "stripped markdown code fences around the payload", *meta[:first_pos]) if meta[:fence]
405
+ warn(handler, :wrapper_tag_stripped, "stripped wrapper tags around the payload", *meta[:first_pos]) if meta[:wrapper]
406
+ warn(handler, :suffix_text_ignored, "ignored non-JSON text after the payload", *meta[:last_pos]) if meta[:suffix]
407
+ end
408
+
409
+ def extract_payloads(input, options)
410
+ payloads = candidate_ranges(input).filter_map do |range|
411
+ slice = input.byteslice(range.begin, range.end - range.begin)
412
+ begin
413
+ SmarterJSON.send(:process_content, slice, options.merge(on_warning: nil))
414
+ { slice: slice, range: range }
415
+ rescue ParseError
416
+ nil
417
+ end
418
+ end
419
+ meta = wrapper_meta(input, payloads.map { |p| p[:range] })
420
+ payloads.each { |payload| payload[:meta] = meta }
421
+ payloads
422
+ end
423
+
424
+ def wrapper_meta(input, ranges)
425
+ return { prefix: false, suffix: false, fence: false, wrapper: false } if ranges.empty?
426
+
427
+ first = ranges.first
428
+ last = ranges.last
429
+ prefix = input.byteslice(0, first.begin)
430
+ suffix = input.byteslice(last.end, input.bytesize - last.end)
431
+ # Look for fence / wrapper markers only in the text we actually strip (outside
432
+ # every recovered payload), so a ``` or <json> sitting inside a payload's own
433
+ # string value does not trigger a "stripped a wrapper" warning.
434
+ outside = non_payload_text(input, ranges)
435
+ {
436
+ prefix: substantive_text?(prefix),
437
+ suffix: substantive_text?(suffix),
438
+ fence: outside.include?("```"),
439
+ wrapper: outside.match?(/<json\b|BEGIN_JSON\b/i),
440
+ first_pos: line_col_for(input, first.begin),
441
+ last_pos: line_col_for(input, last.begin)
442
+ }
443
+ end
444
+
445
+ def non_payload_text(input, ranges)
446
+ out = +""
447
+ pos = 0
448
+ ranges.each do |range|
449
+ out << input.byteslice(pos, range.begin - pos) if range.begin > pos
450
+ pos = range.end
451
+ end
452
+ out << input.byteslice(pos, input.bytesize - pos) if pos < input.bytesize
453
+ out
454
+ end
455
+
456
+ def line_col_for(input, offset)
457
+ line = 1
458
+ col = 1
459
+ i = 0
460
+ while i < offset
461
+ b = input.getbyte(i)
462
+ break if b.nil?
463
+
464
+ if b == LF
465
+ line += 1
466
+ col = 1
467
+ i += 1
468
+ elsif b == CR
469
+ line += 1
470
+ col = 1
471
+ i += 1
472
+ i += 1 if i < offset && input.getbyte(i) == LF
473
+ else
474
+ col += 1
475
+ i += 1
476
+ end
477
+ end
478
+ [line, col]
479
+ end
480
+
481
+ def substantive_text?(text)
482
+ return false if text.nil? || text.empty?
483
+
484
+ stripped = text.dup
485
+ stripped.gsub!(%r{/\*.*?\*/}m, "")
486
+ stripped.gsub!(/^\s*(?:#|\/\/).*$/, "")
487
+ !stripped.strip.empty? && !stripped.strip.match?(/\A(?:```[a-zA-Z0-9_-]*)?\z/) && !stripped.strip.match?(/\A(?:<\/?json>|BEGIN_JSON|END_JSON)\z/i)
488
+ end
489
+
490
+ def warn(handler, type, message, line, col)
491
+ handler.call(Warning.new(type, message, line, col))
492
+ end
493
+
494
+ def candidate_ranges(input)
495
+ ranges = []
496
+ stack = []
497
+ start_pos = nil
498
+ i = 0
499
+ mode = nil
500
+ while i < input.bytesize
501
+ b = input.getbyte(i)
502
+ if mode == :double
503
+ if b == BACKSLASH
504
+ i += 2
505
+ next
506
+ elsif b == DQUOTE
507
+ mode = nil
508
+ end
509
+ i += 1
510
+ next
511
+ elsif mode == :single
512
+ if b == BACKSLASH
513
+ i += 2
514
+ next
515
+ elsif b == SQUOTE
516
+ mode = nil
517
+ end
518
+ i += 1
519
+ next
520
+ elsif mode == :triple
521
+ if input.byteslice(i, 3) == "'''"
522
+ mode = nil
523
+ i += 3
524
+ else
525
+ i += 1
526
+ end
527
+ next
528
+ elsif mode == :line_comment
529
+ if [LF, CR].include?(b)
530
+ mode = nil
531
+ else
532
+ i += 1
533
+ next
534
+ end
535
+ elsif mode == :block_comment
536
+ if input.byteslice(i, 2) == "*/"
537
+ mode = nil
538
+ i += 2
539
+ else
540
+ i += 1
541
+ end
542
+ next
543
+ else
544
+ if input.byteslice(i, 2) == "//"
545
+ mode = :line_comment
546
+ i += 2
547
+ next
548
+ elsif input.byteslice(i, 2) == "/*"
549
+ mode = :block_comment
550
+ i += 2
551
+ next
552
+ elsif b == HASH
553
+ mode = :line_comment
554
+ i += 1
555
+ next
556
+ elsif b == DQUOTE
557
+ mode = :double
558
+ i += 1
559
+ next
560
+ elsif input.byteslice(i, 3) == "'''"
561
+ mode = :triple
562
+ i += 3
563
+ next
564
+ elsif b == SQUOTE
565
+ mode = :single
566
+ i += 1
567
+ next
568
+ elsif [LBRACE, LBRACKET].include?(b)
569
+ start_pos = i if stack.empty?
570
+ stack << b
571
+ elsif b == RBRACE
572
+ stack.pop if stack.last == LBRACE
573
+ if stack.empty? && start_pos
574
+ ranges << (start_pos...(i + 1))
575
+ start_pos = nil
576
+ end
577
+ elsif b == RBRACKET
578
+ stack.pop if stack.last == LBRACKET
579
+ if stack.empty? && start_pos
580
+ ranges << (start_pos...(i + 1))
581
+ start_pos = nil
582
+ end
583
+ end
584
+ end
585
+ i += 1
586
+ end
587
+ ranges
588
+ end
589
+ end
590
+
591
+ # Hand-rolled FSM single-pass parser.
592
+ # Layer 1: strict JSON (RFC 8259).
593
+ # Layer 2: JSON5 additions — line/block comments, trailing comma,
594
+ # unquoted ECMAScript identifier keys, single-quoted strings,
595
+ # hex numbers, leading/trailing decimal points, Infinity/NaN,
596
+ # explicit + sign, \-line-continuation inside strings.
597
+ # Layer 3: HJSON-inspired additions — #/comment-marker rule, triple-quoted
598
+ # strings, quoteless single-line strings, implicit root object,
599
+ # newline-as-separator, broader unquoted keys, recognized-literals-win.
600
+ # Layer 4: smarter_json additions — UTF-8 BOM skip, smart/curly quotes,
601
+ # Python literals (True/False/None) and undefined, underscores in
602
+ # numeric literals, and encoding validation (SmarterJSON::EncodingError).
603
+ class Parser
604
+ include Bytes
124
605
 
125
606
  NOT_NUMERIC = Object.new
126
607
  HEX_RE = /\A[-+]?0[xX][0-9a-fA-F_]+\z/.freeze
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterJSON
4
- VERSION = "0.7.0"
4
+ VERSION = "0.9.2"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_json
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.9.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda