smarter_json 0.7.0 → 0.9.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +24 -1
- data/README.md +45 -3
- data/docs/_introduction.md +2 -2
- data/docs/basic_read_api.md +38 -8
- data/docs/examples.md +60 -7
- data/docs/options.md +1 -1
- data/ext/smarter_json/smarter_json.c +46 -16
- data/lib/smarter_json/parser.rb +501 -20
- data/lib/smarter_json/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
|
|
4
|
+
data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
|
|
7
|
+
data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29
|
data/CHANGELOG.md
CHANGED
|
@@ -3,7 +3,30 @@
|
|
|
3
3
|
|
|
4
4
|
> 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
|
|
5
5
|
|
|
6
|
-
## 0.
|
|
6
|
+
## 0.9.2 (2026-06-03)
|
|
7
|
+
- **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
|
|
8
|
+
|
|
9
|
+
## 0.9.1 (2026-06-03 unreleased)
|
|
10
|
+
- **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
|
|
11
|
+
- **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned — the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
|
|
12
|
+
- Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
|
|
13
|
+
- Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
|
|
14
|
+
|
|
15
|
+
## 0.9.0 (2026-06-03 unreleased)
|
|
16
|
+
- performance improvements
|
|
17
|
+
- code cleanup
|
|
18
|
+
|
|
19
|
+
## 0.8.0 (2026-06-03)
|
|
20
|
+
- **Robustness** against LLM-generated / wrapped JSON:
|
|
21
|
+
- strips markdown code fences (```json / ```)
|
|
22
|
+
- ignores obvious prefix / suffix prose around a payload
|
|
23
|
+
- unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`
|
|
24
|
+
- preserves multiple recovered payloads as an `Array`
|
|
25
|
+
- supports pretty-printed multi-line document framing on IO / block input
|
|
26
|
+
- **Warnings** now cover wrapper recovery too (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
|
|
27
|
+
- **No truncation recovery**: truncated / unterminated input still raises `SmarterJSON::ParseError`
|
|
28
|
+
|
|
29
|
+
## 0.7.0 (2026-06-03)
|
|
7
30
|
- **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` tuple return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as the parser applies a lenient fix; `process` / `process_file` now always return the bare value (nil / value / Array) on every path. Unlike the tuple, this also fires on the streaming block form. The default (no handler) records nothing and costs nothing.
|
|
8
31
|
|
|
9
32
|
## 0.6.0 (2026-06-02)
|
data/README.md
CHANGED
|
@@ -16,13 +16,14 @@ Three things set it apart:
|
|
|
16
16
|
|
|
17
17
|
1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
|
|
18
18
|
|
|
19
|
-
2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
|
|
19
|
+
2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
|
|
20
20
|
|
|
21
21
|
3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
|
|
22
22
|
|
|
23
23
|
## What it accepts, beyond strict JSON
|
|
24
24
|
|
|
25
25
|
- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` parses as a string, not a truncated value)
|
|
26
|
+
- Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
|
|
26
27
|
- Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
|
|
27
28
|
- Implicit root object — a config file that starts with `key: value`, no outer `{}`
|
|
28
29
|
- `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
|
|
@@ -45,6 +46,12 @@ gem install smarter_json
|
|
|
45
46
|
|
|
46
47
|
The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
|
|
47
48
|
|
|
49
|
+
## API stability and thread safety
|
|
50
|
+
|
|
51
|
+
The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
|
|
52
|
+
|
|
53
|
+
Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
|
|
54
|
+
|
|
48
55
|
## Documentation
|
|
49
56
|
|
|
50
57
|
* [Introduction](docs/_introduction.md)
|
|
@@ -70,6 +77,41 @@ SmarterJSON.process("") # => nil (zero
|
|
|
70
77
|
# For input larger than memory, stream one document at a time with a block
|
|
71
78
|
# (process and process_file both forward the block):
|
|
72
79
|
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
80
|
+
|
|
81
|
+
# Wrapper noise is stripped automatically:
|
|
82
|
+
SmarterJSON.process(<<~TEXT)
|
|
83
|
+
Here is the JSON:
|
|
84
|
+
|
|
85
|
+
```json
|
|
86
|
+
{
|
|
87
|
+
"a": 1
|
|
88
|
+
}
|
|
89
|
+
```
|
|
90
|
+
TEXT
|
|
91
|
+
# => {"a"=>1}
|
|
92
|
+
|
|
93
|
+
SmarterJSON.process(<<~TEXT)
|
|
94
|
+
Here is the result:
|
|
95
|
+
|
|
96
|
+
{
|
|
97
|
+
"a": 1
|
|
98
|
+
}
|
|
99
|
+
|
|
100
|
+
Hope this helps.
|
|
101
|
+
TEXT
|
|
102
|
+
# => {"a"=>1}
|
|
103
|
+
|
|
104
|
+
SmarterJSON.process("<json>{\"a\":1}</json>")
|
|
105
|
+
# => {"a"=>1}
|
|
106
|
+
|
|
107
|
+
SmarterJSON.process(<<~TEXT)
|
|
108
|
+
first attempt:
|
|
109
|
+
{"a":1}
|
|
110
|
+
|
|
111
|
+
corrected payload:
|
|
112
|
+
{"b":2}
|
|
113
|
+
TEXT
|
|
114
|
+
# => [{"a"=>1}, {"b"=>2}]
|
|
73
115
|
```
|
|
74
116
|
|
|
75
117
|
### Options
|
|
@@ -85,7 +127,7 @@ SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event
|
|
|
85
127
|
|
|
86
128
|
### Warnings (`on_warning`)
|
|
87
129
|
|
|
88
|
-
When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
|
|
130
|
+
When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
|
|
89
131
|
|
|
90
132
|
```ruby
|
|
91
133
|
# Collect them all:
|
|
@@ -106,7 +148,7 @@ Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpu
|
|
|
106
148
|
|
|
107
149
|
**Two notes on fair comparison:**
|
|
108
150
|
|
|
109
|
-
- **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once
|
|
151
|
+
- **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
|
|
110
152
|
- **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
|
|
111
153
|
|
|
112
154
|
## Encoding
|
data/docs/_introduction.md
CHANGED
|
@@ -11,7 +11,7 @@
|
|
|
11
11
|
|
|
12
12
|
# SmarterJSON Introduction
|
|
13
13
|
|
|
14
|
-
`smarter_json` is a fast, lenient JSON parser and writer for Ruby. It reads strict JSON, JSON5, HJSON-style config, newline-delimited JSON (NDJSON / JSONL), and the messy JSON-ish input humans actually paste — and in benchmarks it matches or beats Oj on nearly every file. It is opinionated: it optimizes for getting your data out, not for policing the JSON spec. Where other parsers stop at the first deviation, SmarterJSON keeps going.
|
|
14
|
+
`smarter_json` is a fast, lenient JSON parser and writer for Ruby. It reads strict JSON, JSON5, HJSON-style config, newline-delimited JSON (NDJSON / JSONL), markdown-wrapped / chatty blobs around a JSON payload, and the messy JSON-ish input humans actually paste — and in benchmarks it matches or beats Oj on nearly every file. It is opinionated: it optimizes for getting your data out, not for policing the JSON spec. Where other parsers stop at the first deviation, SmarterJSON keeps going.
|
|
15
15
|
|
|
16
16
|
## Why another JSON library?
|
|
17
17
|
|
|
@@ -21,7 +21,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
|
|
|
21
21
|
|
|
22
22
|
* **One reader, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the reader to match your input; it adapts to whatever you give it.
|
|
23
23
|
|
|
24
|
-
* **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
|
|
24
|
+
* **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
|
|
25
25
|
|
|
26
26
|
* **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser. Floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
|
|
27
27
|
|
data/docs/basic_read_api.md
CHANGED
|
@@ -22,7 +22,40 @@ SmarterJSON.process('{"a": 1, "b": [2, 3]}') # => {"a"=>1, "b"=>[2, 3]}
|
|
|
22
22
|
SmarterJSON.process("host: localhost\nport: 5432") # => {"host"=>"localhost", "port"=>5432} (no braces needed)
|
|
23
23
|
```
|
|
24
24
|
|
|
25
|
-
`process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths.
|
|
25
|
+
`process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths. When the input wraps the payload in obvious markdown / prose / tags, `process` strips that wrapper first and then parses the recovered payload(s).
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
SmarterJSON.process(<<~TEXT)
|
|
29
|
+
Here is the JSON:
|
|
30
|
+
|
|
31
|
+
```json
|
|
32
|
+
{
|
|
33
|
+
"a": 1
|
|
34
|
+
}
|
|
35
|
+
```
|
|
36
|
+
TEXT
|
|
37
|
+
# => {"a"=>1}
|
|
38
|
+
|
|
39
|
+
SmarterJSON.process(<<~TEXT)
|
|
40
|
+
Here is the result:
|
|
41
|
+
|
|
42
|
+
{
|
|
43
|
+
"a": 1
|
|
44
|
+
}
|
|
45
|
+
|
|
46
|
+
Hope this helps.
|
|
47
|
+
TEXT
|
|
48
|
+
# => {"a"=>1}
|
|
49
|
+
|
|
50
|
+
SmarterJSON.process(<<~TEXT)
|
|
51
|
+
first attempt:
|
|
52
|
+
{"a":1}
|
|
53
|
+
|
|
54
|
+
corrected payload:
|
|
55
|
+
{"b":2}
|
|
56
|
+
TEXT
|
|
57
|
+
# => [{"a"=>1}, {"b"=>2}]
|
|
58
|
+
```
|
|
26
59
|
|
|
27
60
|
```ruby
|
|
28
61
|
SmarterJSON.process(io) # an open IO (File, StringIO, socket, …) — reads it and parses
|
|
@@ -39,7 +72,7 @@ SmarterJSON.process('{"id":1}') # => {"id"=>1} (one
|
|
|
39
72
|
SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2}, {"id"=>3}] (two or more → an Array)
|
|
40
73
|
```
|
|
41
74
|
|
|
42
|
-
Documents are separated by whitespace, newlines, or simple concatenation — **not** by commas (a comma between top-level documents would be read as an implicit root array, which is not supported). Only SmarterJSON reads this via plain `process`: Oj and the stdlib `json` library raise without a block.
|
|
75
|
+
Documents are separated by whitespace, newlines, or simple concatenation — **not** by commas (a comma between top-level documents would be read as an implicit root array, which is not supported). If wrapper noise is stripped and several payloads are recovered, they are returned by the same rule: one payload → its value, several → an `Array`. Only SmarterJSON reads this via plain `process`: Oj and the stdlib `json` library raise without a block.
|
|
43
76
|
|
|
44
77
|
## `SmarterJSON.process_file` — read a file by path
|
|
45
78
|
|
|
@@ -51,17 +84,14 @@ SmarterJSON.process_file("config.json5") # read the file, then parse — sam
|
|
|
51
84
|
|
|
52
85
|
## Streaming with a block (bounded memory)
|
|
53
86
|
|
|
54
|
-
For input larger than memory, pass a block. Each top-level document is yielded as it is
|
|
87
|
+
For input larger than memory, pass a block. Each recovered top-level document is yielded as it is framed, and the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
|
|
55
88
|
|
|
56
89
|
```ruby
|
|
57
|
-
# Stream straight from disk, one document at a time — the whole file is never loaded:
|
|
58
90
|
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
59
|
-
|
|
60
|
-
# Same for an IO:
|
|
61
91
|
SmarterJSON.process(io) { |doc| handle(doc) }
|
|
62
92
|
```
|
|
63
93
|
|
|
64
|
-
The streaming path
|
|
94
|
+
The streaming path now frames whole top-level documents, not just one line at a time. That means NDJSON / JSONL still work, but pretty-printed multi-line objects and arrays work too, as do mixed `\n` / `\r\n` / `\r` line endings and comment-only separators between documents.
|
|
65
95
|
|
|
66
96
|
## The C extension and the pure-Ruby fallback
|
|
67
97
|
|
|
@@ -79,7 +109,7 @@ warns.map(&:type) # => [:empty_slot]
|
|
|
79
109
|
warns.first.to_s # => "extra comma, collapsed an empty slot at line 1, col 4"
|
|
80
110
|
```
|
|
81
111
|
|
|
82
|
-
Each warning is a `SmarterJSON::Warning` with `type`, `message`, `line`, and `col`. The types are `:empty_slot` (a collapsed empty comma slot), `:empty_value` (a key with no value, read as `null`),
|
|
112
|
+
Each warning is a `SmarterJSON::Warning` with `type`, `message`, `line`, and `col`. The types are `:empty_slot` (a collapsed empty comma slot), `:empty_value` (a key with no value, read as `null`), `:duplicate_key` (a repeated key that was dropped), plus wrapper-recovery warnings such as `:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, and `:wrapper_tag_stripped`. Clean input never invokes the handler. It fires on every path — including the streaming block form — and works the same on the C and pure-Ruby paths. See [Configuration Options](./options.md).
|
|
83
113
|
|
|
84
114
|
---------------
|
|
85
115
|
|
data/docs/examples.md
CHANGED
|
@@ -24,9 +24,10 @@
|
|
|
24
24
|
7. [Duplicate Keys](#example-7-duplicate-keys)
|
|
25
25
|
8. [High-Precision Numbers: BigDecimal vs Float](#example-8-high-precision-numbers-bigdecimal-vs-float)
|
|
26
26
|
9. [Lenient Input: Comments, Trailing Commas, Unquoted Keys](#example-9-lenient-input-comments-trailing-commas-unquoted-keys)
|
|
27
|
-
10. [
|
|
28
|
-
11. [Write
|
|
29
|
-
12. [
|
|
27
|
+
10. [Wrapper Noise Around a Payload](#example-10-wrapper-noise-around-a-payload)
|
|
28
|
+
11. [Write JSON](#example-11-write-json)
|
|
29
|
+
12. [Write NDJSON](#example-12-write-ndjson)
|
|
30
|
+
13. [Round-Trip Read and Write](#example-13-round-trip-read-and-write)
|
|
30
31
|
|
|
31
32
|
---
|
|
32
33
|
|
|
@@ -66,7 +67,7 @@ SmarterJSON.process("") # => nil
|
|
|
66
67
|
|
|
67
68
|
### Example 5: Streaming a Large File with a Block
|
|
68
69
|
|
|
69
|
-
For input larger than memory, pass a block. Each document is yielded
|
|
70
|
+
For input larger than memory, pass a block. Each recovered document is yielded one at a time:
|
|
70
71
|
|
|
71
72
|
```ruby
|
|
72
73
|
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
@@ -113,14 +114,66 @@ JSON
|
|
|
113
114
|
|
|
114
115
|
A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
|
|
115
116
|
|
|
116
|
-
### Example 10:
|
|
117
|
+
### Example 10: Wrapper Noise Around a Payload
|
|
118
|
+
|
|
119
|
+
#### Fenced payload
|
|
120
|
+
|
|
121
|
+
```ruby
|
|
122
|
+
SmarterJSON.process(<<~TEXT)
|
|
123
|
+
Here is the JSON:
|
|
124
|
+
|
|
125
|
+
```json
|
|
126
|
+
{
|
|
127
|
+
"a": 1
|
|
128
|
+
}
|
|
129
|
+
```
|
|
130
|
+
TEXT
|
|
131
|
+
# => {"a"=>1}
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
#### Prose before / after the payload
|
|
135
|
+
|
|
136
|
+
```ruby
|
|
137
|
+
SmarterJSON.process(<<~TEXT)
|
|
138
|
+
Here is the result:
|
|
139
|
+
|
|
140
|
+
{
|
|
141
|
+
"a": 1
|
|
142
|
+
}
|
|
143
|
+
|
|
144
|
+
Hope this helps.
|
|
145
|
+
TEXT
|
|
146
|
+
# => {"a"=>1}
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
#### Wrapper tags
|
|
150
|
+
|
|
151
|
+
```ruby
|
|
152
|
+
SmarterJSON.process("<json>{\"a\":1}</json>")
|
|
153
|
+
# => {"a"=>1}
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
#### Multiple recovered payloads from one noisy blob
|
|
157
|
+
|
|
158
|
+
```ruby
|
|
159
|
+
SmarterJSON.process(<<~TEXT)
|
|
160
|
+
first attempt:
|
|
161
|
+
{"a":1}
|
|
162
|
+
|
|
163
|
+
corrected payload:
|
|
164
|
+
{"b":2}
|
|
165
|
+
TEXT
|
|
166
|
+
# => [{"a"=>1}, {"b"=>2}]
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
### Example 11: Write JSON
|
|
117
170
|
|
|
118
171
|
```ruby
|
|
119
172
|
SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
|
|
120
173
|
SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
|
|
121
174
|
```
|
|
122
175
|
|
|
123
|
-
### Example
|
|
176
|
+
### Example 12: Write NDJSON
|
|
124
177
|
|
|
125
178
|
An Array writes one element per line:
|
|
126
179
|
|
|
@@ -128,7 +181,7 @@ An Array writes one element per line:
|
|
|
128
181
|
SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
|
|
129
182
|
```
|
|
130
183
|
|
|
131
|
-
### Example
|
|
184
|
+
### Example 13: Round-Trip Read and Write
|
|
132
185
|
|
|
133
186
|
```ruby
|
|
134
187
|
obj = { "a" => 1, "b" => [2, "three", nil, true] }
|
data/docs/options.md
CHANGED
|
@@ -43,7 +43,7 @@ warns.map(&:type) # => [:empty_slot]
|
|
|
43
43
|
warns.first.to_s # => "extra comma, collapsed an empty slot at line 1, col 4"
|
|
44
44
|
```
|
|
45
45
|
|
|
46
|
-
The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`), `:empty_value` (a key with no value, read as `null`, e.g. `{a:}`), and `:duplicate_key` (a repeated key that was dropped)
|
|
46
|
+
The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`), `:empty_value` (a key with no value, read as `null`, e.g. `{a:}`), and `:duplicate_key` (a repeated key that was dropped), plus wrapper-recovery warnings such as `:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, and `:wrapper_tag_stripped`. Clean input never invokes the handler. Warnings work on both the C and pure-Ruby paths, so `acceleration:` doesn't change them.
|
|
47
47
|
|
|
48
48
|
### A note on `:encoding`
|
|
49
49
|
|
|
@@ -41,6 +41,18 @@ static VALUE fj_sym_duplicate_key;
|
|
|
41
41
|
static ID fj_bigdecimal_id; /* cached BigDecimal() method id (set in Init) */
|
|
42
42
|
static ID fj_to_sym_id; /* cached :to_sym (symbolize_keys) */
|
|
43
43
|
static ID fj_key_p_id; /* cached :key? (non-default duplicate_key modes) */
|
|
44
|
+
static ID fj_force_encoding_id;
|
|
45
|
+
static ID fj_valid_encoding_p_id;
|
|
46
|
+
static ID fj_encoding_id;
|
|
47
|
+
static ID fj_name_id;
|
|
48
|
+
static VALUE fj_sym_encoding;
|
|
49
|
+
static VALUE fj_sym_symbolize_keys;
|
|
50
|
+
static VALUE fj_sym_first_wins;
|
|
51
|
+
static VALUE fj_sym_raise;
|
|
52
|
+
static VALUE fj_sym_bigdecimal_load;
|
|
53
|
+
static VALUE fj_sym_float;
|
|
54
|
+
static VALUE fj_sym_bigdecimal;
|
|
55
|
+
static VALUE fj_sym_on_warning;
|
|
44
56
|
|
|
45
57
|
/* Per-parse direct-mapped key cache: key bytes -> the interned (frozen,
|
|
46
58
|
* globally-rooted) String, so repeated keys skip the global fstring lookup.
|
|
@@ -373,11 +385,17 @@ static void fj_consume_keyword(fj_state *st, const char *word) {
|
|
|
373
385
|
fj_advance(st, n);
|
|
374
386
|
}
|
|
375
387
|
|
|
376
|
-
/* Copy a byte range into a fresh String, dropping underscores.
|
|
388
|
+
/* Copy a byte range into a fresh String, dropping underscores. Copies whole
|
|
389
|
+
* underscore-free runs in bulk, rather than one byte at a time. */
|
|
377
390
|
static VALUE fj_strip_underscores(const char *p, long n) {
|
|
378
391
|
VALUE s = rb_str_buf_new(n);
|
|
379
|
-
long i;
|
|
380
|
-
|
|
392
|
+
long i = 0;
|
|
393
|
+
while (i < n) {
|
|
394
|
+
long start = i;
|
|
395
|
+
while (i < n && p[i] != '_') i++;
|
|
396
|
+
if (i > start) rb_str_buf_cat(s, p + start, i - start);
|
|
397
|
+
if (i < n) i++; /* skip '_' */
|
|
398
|
+
}
|
|
381
399
|
return s;
|
|
382
400
|
}
|
|
383
401
|
|
|
@@ -1379,14 +1397,14 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
|
|
|
1379
1397
|
|
|
1380
1398
|
Check_Type(input, T_STRING);
|
|
1381
1399
|
|
|
1382
|
-
enc_opt = rb_hash_aref(opts,
|
|
1400
|
+
enc_opt = rb_hash_aref(opts, fj_sym_encoding);
|
|
1383
1401
|
if (!NIL_P(enc_opt)) {
|
|
1384
|
-
input = rb_funcall(rb_str_dup(input),
|
|
1402
|
+
input = rb_funcall(rb_str_dup(input), fj_force_encoding_id, 1, enc_opt);
|
|
1385
1403
|
}
|
|
1386
|
-
if (!RTEST(rb_funcall(input,
|
|
1387
|
-
VALUE name = rb_funcall(rb_funcall(input,
|
|
1404
|
+
if (!RTEST(rb_funcall(input, fj_valid_encoding_p_id, 0))) {
|
|
1405
|
+
VALUE name = rb_funcall(rb_funcall(input, fj_encoding_id, 0), fj_name_id, 0);
|
|
1388
1406
|
VALUE msg = rb_sprintf("invalid byte sequence for %" PRIsVALUE, name);
|
|
1389
|
-
rb_exc_raise(rb_funcall(cEncodingError,
|
|
1407
|
+
rb_exc_raise(rb_funcall(cEncodingError, fj_new_id, 3, msg, Qnil, Qnil));
|
|
1390
1408
|
}
|
|
1391
1409
|
|
|
1392
1410
|
st.buf = RSTRING_PTR(input);
|
|
@@ -1402,19 +1420,19 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
|
|
|
1402
1420
|
st.kcache = NULL;
|
|
1403
1421
|
#endif
|
|
1404
1422
|
|
|
1405
|
-
st.symbolize_keys = RTEST(rb_hash_aref(opts,
|
|
1406
|
-
dk = rb_hash_aref(opts,
|
|
1407
|
-
st.dup_first_wins = (dk ==
|
|
1408
|
-
st.dup_raise = (dk ==
|
|
1423
|
+
st.symbolize_keys = RTEST(rb_hash_aref(opts, fj_sym_symbolize_keys));
|
|
1424
|
+
dk = rb_hash_aref(opts, fj_sym_duplicate_key);
|
|
1425
|
+
st.dup_first_wins = (dk == fj_sym_first_wins);
|
|
1426
|
+
st.dup_raise = (dk == fj_sym_raise);
|
|
1409
1427
|
|
|
1410
1428
|
{
|
|
1411
|
-
VALUE bd = rb_hash_aref(opts,
|
|
1412
|
-
if (bd ==
|
|
1413
|
-
else if (bd ==
|
|
1429
|
+
VALUE bd = rb_hash_aref(opts, fj_sym_bigdecimal_load);
|
|
1430
|
+
if (bd == fj_sym_float) st.bigdecimal_load = 0;
|
|
1431
|
+
else if (bd == fj_sym_bigdecimal) st.bigdecimal_load = 2;
|
|
1414
1432
|
else st.bigdecimal_load = 1; /* :auto (default), including nil */
|
|
1415
1433
|
}
|
|
1416
1434
|
|
|
1417
|
-
st.on_warning = rb_hash_aref(opts,
|
|
1435
|
+
st.on_warning = rb_hash_aref(opts, fj_sym_on_warning); /* Qnil when absent */
|
|
1418
1436
|
|
|
1419
1437
|
if (st.len >= 3 && (unsigned char)st.buf[0] == 0xEF &&
|
|
1420
1438
|
(unsigned char)st.buf[1] == 0xBB && (unsigned char)st.buf[2] == 0xBF) {
|
|
@@ -1465,8 +1483,20 @@ void Init_smarter_json(void) {
|
|
|
1465
1483
|
fj_key_p_id = rb_intern("key?");
|
|
1466
1484
|
fj_new_id = rb_intern("new");
|
|
1467
1485
|
fj_call_id = rb_intern("call");
|
|
1486
|
+
fj_force_encoding_id = rb_intern("force_encoding");
|
|
1487
|
+
fj_valid_encoding_p_id = rb_intern("valid_encoding?");
|
|
1488
|
+
fj_encoding_id = rb_intern("encoding");
|
|
1489
|
+
fj_name_id = rb_intern("name");
|
|
1468
1490
|
fj_sym_empty_slot = ID2SYM(rb_intern("empty_slot"));
|
|
1469
1491
|
fj_sym_empty_value = ID2SYM(rb_intern("empty_value"));
|
|
1470
1492
|
fj_sym_duplicate_key = ID2SYM(rb_intern("duplicate_key"));
|
|
1493
|
+
fj_sym_encoding = ID2SYM(rb_intern("encoding"));
|
|
1494
|
+
fj_sym_symbolize_keys = ID2SYM(rb_intern("symbolize_keys"));
|
|
1495
|
+
fj_sym_first_wins = ID2SYM(rb_intern("first_wins"));
|
|
1496
|
+
fj_sym_raise = ID2SYM(rb_intern("raise"));
|
|
1497
|
+
fj_sym_bigdecimal_load = ID2SYM(rb_intern("bigdecimal_load"));
|
|
1498
|
+
fj_sym_float = ID2SYM(rb_intern("float"));
|
|
1499
|
+
fj_sym_bigdecimal = ID2SYM(rb_intern("bigdecimal"));
|
|
1500
|
+
fj_sym_on_warning = ID2SYM(rb_intern("on_warning"));
|
|
1471
1501
|
rb_define_module_function(mSmarterJSON, "parse_c", fj_parse_c, 2);
|
|
1472
1502
|
}
|
data/lib/smarter_json/parser.rb
CHANGED
|
@@ -22,9 +22,9 @@ module SmarterJSON
|
|
|
22
22
|
# stream as newline-delimited documents (NDJSON / JSONL), one per line.
|
|
23
23
|
def process(input, options = {}, &block)
|
|
24
24
|
if input.is_a?(String)
|
|
25
|
-
|
|
25
|
+
Recovery.process_string(input, options, &block)
|
|
26
26
|
elsif input.respond_to?(:read)
|
|
27
|
-
block ? stream_io(input, options, &block) :
|
|
27
|
+
block ? stream_io(input, options, &block) : process(input.read, options)
|
|
28
28
|
else
|
|
29
29
|
raise ArgumentError, "SmarterJSON.process expects a String or an IO, got #{input.class}"
|
|
30
30
|
end
|
|
@@ -43,7 +43,7 @@ module SmarterJSON
|
|
|
43
43
|
if block
|
|
44
44
|
File.open(path, "r:#{encoding}") { |io| stream_io(io, options, &block) }
|
|
45
45
|
else
|
|
46
|
-
|
|
46
|
+
process(File.read(path, encoding: encoding), options)
|
|
47
47
|
end
|
|
48
48
|
end
|
|
49
49
|
|
|
@@ -63,29 +63,18 @@ module SmarterJSON
|
|
|
63
63
|
end
|
|
64
64
|
end
|
|
65
65
|
|
|
66
|
-
# Stream documents from an IO,
|
|
67
|
-
#
|
|
68
|
-
# spanning multiple lines is not supported by the streaming path.
|
|
66
|
+
# Stream documents from an IO incrementally, yielding each recovered top-level
|
|
67
|
+
# document without slurping the whole input into memory first.
|
|
69
68
|
def stream_io(io, options, &block)
|
|
70
|
-
|
|
69
|
+
Framer.each_document(io) { |doc| Recovery.process_string(doc, options, &block) }
|
|
71
70
|
nil
|
|
72
71
|
end
|
|
73
72
|
|
|
74
73
|
private_class_method :process_content, :stream_io
|
|
75
74
|
|
|
76
|
-
#
|
|
77
|
-
#
|
|
78
|
-
|
|
79
|
-
# unquoted ECMAScript identifier keys, single-quoted strings,
|
|
80
|
-
# hex numbers, leading/trailing decimal points, Infinity/NaN,
|
|
81
|
-
# explicit + sign, \-line-continuation inside strings.
|
|
82
|
-
# Layer 3: HJSON-inspired additions — #/comment-marker rule, triple-quoted
|
|
83
|
-
# strings, quoteless single-line strings, implicit root object,
|
|
84
|
-
# newline-as-separator, broader unquoted keys, recognized-literals-win.
|
|
85
|
-
# Layer 4: smarter_json additions — UTF-8 BOM skip, smart/curly quotes,
|
|
86
|
-
# Python literals (True/False/None) and undefined, underscores in
|
|
87
|
-
# numeric literals, and encoding validation (SmarterJSON::EncodingError).
|
|
88
|
-
class Parser
|
|
75
|
+
# Named byte values, shared by the Parser FSM and the Framer / Recovery byte
|
|
76
|
+
# scanners so none of them spell out raw hex. Included where needed.
|
|
77
|
+
module Bytes
|
|
89
78
|
LBRACE = 0x7B
|
|
90
79
|
RBRACE = 0x7D
|
|
91
80
|
LBRACKET = 0x5B
|
|
@@ -121,6 +110,498 @@ module SmarterJSON
|
|
|
121
110
|
TAB = 0x09
|
|
122
111
|
LF = 0x0A
|
|
123
112
|
CR = 0x0D
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
module Framer
|
|
116
|
+
include Bytes
|
|
117
|
+
|
|
118
|
+
CHUNK_SIZE = 16 * 1024
|
|
119
|
+
|
|
120
|
+
module_function
|
|
121
|
+
|
|
122
|
+
def each_document(io, &block)
|
|
123
|
+
buffer = +""
|
|
124
|
+
scan = 0
|
|
125
|
+
doc_start = nil
|
|
126
|
+
stack = []
|
|
127
|
+
mode = nil
|
|
128
|
+
|
|
129
|
+
while (chunk = read_chunk(io))
|
|
130
|
+
buffer << chunk
|
|
131
|
+
loop do
|
|
132
|
+
emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
|
|
133
|
+
break unless emitted
|
|
134
|
+
|
|
135
|
+
yield emitted
|
|
136
|
+
end
|
|
137
|
+
end
|
|
138
|
+
|
|
139
|
+
yield buffer unless separators_only?(buffer)
|
|
140
|
+
end
|
|
141
|
+
|
|
142
|
+
def read_chunk(io)
|
|
143
|
+
if io.respond_to?(:readpartial)
|
|
144
|
+
io.readpartial(CHUNK_SIZE)
|
|
145
|
+
else
|
|
146
|
+
io.read(CHUNK_SIZE)
|
|
147
|
+
end
|
|
148
|
+
rescue EOFError
|
|
149
|
+
nil
|
|
150
|
+
end
|
|
151
|
+
|
|
152
|
+
def scan_buffer(buffer, scan, doc_start, stack, mode)
|
|
153
|
+
while scan < buffer.bytesize
|
|
154
|
+
b = buffer.getbyte(scan)
|
|
155
|
+
# A multi-byte marker (// /* ''' */) whose lead byte is here but whose
|
|
156
|
+
# remaining bytes have not arrived yet must not be guessed at — advancing
|
|
157
|
+
# past the lead byte would misread the brace/quote that follows it once the
|
|
158
|
+
# next chunk lands. Stop and let each_document append more input, then resume
|
|
159
|
+
# from this same position. At true EOF the leftover is parsed whole instead.
|
|
160
|
+
break if defer_for_split_marker?(buffer, scan, b, mode, doc_start)
|
|
161
|
+
|
|
162
|
+
if mode == :double
|
|
163
|
+
if b == BACKSLASH
|
|
164
|
+
scan += 2
|
|
165
|
+
elsif b == DQUOTE
|
|
166
|
+
mode = nil
|
|
167
|
+
scan += 1
|
|
168
|
+
else
|
|
169
|
+
scan += 1
|
|
170
|
+
end
|
|
171
|
+
elsif mode == :single
|
|
172
|
+
if b == BACKSLASH
|
|
173
|
+
scan += 2
|
|
174
|
+
elsif b == SQUOTE
|
|
175
|
+
mode = nil
|
|
176
|
+
scan += 1
|
|
177
|
+
else
|
|
178
|
+
scan += 1
|
|
179
|
+
end
|
|
180
|
+
elsif mode == :triple
|
|
181
|
+
if buffer.byteslice(scan, 3) == "'''"
|
|
182
|
+
mode = nil
|
|
183
|
+
scan += 3
|
|
184
|
+
else
|
|
185
|
+
scan += 1
|
|
186
|
+
end
|
|
187
|
+
elsif mode == :line_comment
|
|
188
|
+
if [LF, CR].include?(b)
|
|
189
|
+
mode = nil
|
|
190
|
+
else
|
|
191
|
+
scan += 1
|
|
192
|
+
next
|
|
193
|
+
end
|
|
194
|
+
elsif mode == :block_comment
|
|
195
|
+
if buffer.byteslice(scan, 2) == '*/'
|
|
196
|
+
mode = nil
|
|
197
|
+
scan += 2
|
|
198
|
+
else
|
|
199
|
+
scan += 1
|
|
200
|
+
end
|
|
201
|
+
elsif doc_start.nil?
|
|
202
|
+
if whitespace_byte?(b)
|
|
203
|
+
scan += 1
|
|
204
|
+
elsif line_comment_start?(buffer, scan)
|
|
205
|
+
mode = :line_comment
|
|
206
|
+
scan += buffer.getbyte(scan) == HASH ? 1 : 2
|
|
207
|
+
elsif block_comment_start?(buffer, scan)
|
|
208
|
+
mode = :block_comment
|
|
209
|
+
scan += 2
|
|
210
|
+
elsif [LBRACE, LBRACKET].include?(b)
|
|
211
|
+
doc_start = scan
|
|
212
|
+
stack << b
|
|
213
|
+
scan += 1
|
|
214
|
+
else
|
|
215
|
+
scan = buffer.bytesize
|
|
216
|
+
end
|
|
217
|
+
else
|
|
218
|
+
if mode.nil? && line_comment_start?(buffer, scan)
|
|
219
|
+
mode = :line_comment
|
|
220
|
+
scan += buffer.getbyte(scan) == HASH ? 1 : 2
|
|
221
|
+
elsif mode.nil? && block_comment_start?(buffer, scan)
|
|
222
|
+
mode = :block_comment
|
|
223
|
+
scan += 2
|
|
224
|
+
elsif b == DQUOTE
|
|
225
|
+
mode = :double
|
|
226
|
+
scan += 1
|
|
227
|
+
elsif buffer.byteslice(scan, 3) == "'''"
|
|
228
|
+
mode = :triple
|
|
229
|
+
scan += 3
|
|
230
|
+
elsif b == SQUOTE
|
|
231
|
+
mode = :single
|
|
232
|
+
scan += 1
|
|
233
|
+
elsif [LBRACE, LBRACKET].include?(b)
|
|
234
|
+
stack << b
|
|
235
|
+
scan += 1
|
|
236
|
+
elsif b == RBRACE
|
|
237
|
+
stack.pop if stack.last == LBRACE
|
|
238
|
+
scan += 1
|
|
239
|
+
if stack.empty?
|
|
240
|
+
doc = buffer.byteslice(doc_start, scan - doc_start)
|
|
241
|
+
buffer = buffer.byteslice(scan..-1) || +""
|
|
242
|
+
return [doc, buffer, 0, nil, [], nil]
|
|
243
|
+
end
|
|
244
|
+
elsif b == RBRACKET
|
|
245
|
+
stack.pop if stack.last == LBRACKET
|
|
246
|
+
scan += 1
|
|
247
|
+
if stack.empty?
|
|
248
|
+
doc = buffer.byteslice(doc_start, scan - doc_start)
|
|
249
|
+
buffer = buffer.byteslice(scan..-1) || +""
|
|
250
|
+
return [doc, buffer, 0, nil, [], nil]
|
|
251
|
+
end
|
|
252
|
+
else
|
|
253
|
+
scan += 1
|
|
254
|
+
end
|
|
255
|
+
end
|
|
256
|
+
end
|
|
257
|
+
|
|
258
|
+
[nil, buffer, scan, doc_start, stack, mode]
|
|
259
|
+
end
|
|
260
|
+
|
|
261
|
+
# True when `b` is the lead byte of a multi-byte marker but the rest of that
|
|
262
|
+
# marker has not been read into the buffer yet, so we cannot decide what it is.
|
|
263
|
+
# `//` and `/*` need 2 bytes; `'''` (and a closing `'''`) needs 3; a closing
|
|
264
|
+
# `*/` needs 2. Backslash escapes and single-byte delimiters never need this.
|
|
265
|
+
def defer_for_split_marker?(buffer, scan, b, mode, doc_start)
|
|
266
|
+
avail = buffer.bytesize - scan
|
|
267
|
+
case mode
|
|
268
|
+
when :block_comment
|
|
269
|
+
b == STAR && avail < 2
|
|
270
|
+
when :triple
|
|
271
|
+
b == SQUOTE && avail < 3
|
|
272
|
+
when nil
|
|
273
|
+
if doc_start.nil?
|
|
274
|
+
b == SLASH && avail < 2
|
|
275
|
+
else
|
|
276
|
+
(b == SLASH && avail < 2) || (b == SQUOTE && avail < 3)
|
|
277
|
+
end
|
|
278
|
+
else
|
|
279
|
+
false
|
|
280
|
+
end
|
|
281
|
+
end
|
|
282
|
+
|
|
283
|
+
def separators_only?(buffer)
|
|
284
|
+
scan = 0
|
|
285
|
+
mode = nil
|
|
286
|
+
while scan < buffer.bytesize
|
|
287
|
+
b = buffer.getbyte(scan)
|
|
288
|
+
if mode == :line_comment
|
|
289
|
+
if [LF, CR].include?(b)
|
|
290
|
+
mode = nil
|
|
291
|
+
else
|
|
292
|
+
scan += 1
|
|
293
|
+
next
|
|
294
|
+
end
|
|
295
|
+
elsif mode == :block_comment
|
|
296
|
+
if buffer.byteslice(scan, 2) == '*/'
|
|
297
|
+
mode = nil
|
|
298
|
+
scan += 2
|
|
299
|
+
else
|
|
300
|
+
scan += 1
|
|
301
|
+
end
|
|
302
|
+
elsif whitespace_byte?(b)
|
|
303
|
+
scan += 1
|
|
304
|
+
elsif line_comment_start?(buffer, scan)
|
|
305
|
+
mode = :line_comment
|
|
306
|
+
scan += buffer.getbyte(scan) == HASH ? 1 : 2
|
|
307
|
+
elsif block_comment_start?(buffer, scan)
|
|
308
|
+
mode = :block_comment
|
|
309
|
+
scan += 2
|
|
310
|
+
else
|
|
311
|
+
return false
|
|
312
|
+
end
|
|
313
|
+
end
|
|
314
|
+
true
|
|
315
|
+
end
|
|
316
|
+
|
|
317
|
+
def whitespace_byte?(b)
|
|
318
|
+
b == SPACE || (b && b >= TAB && b <= CR)
|
|
319
|
+
end
|
|
320
|
+
|
|
321
|
+
def line_comment_start?(buffer, scan)
|
|
322
|
+
b = buffer.getbyte(scan)
|
|
323
|
+
return preceded_by_ws_or_start?(buffer, scan) if b == HASH
|
|
324
|
+
|
|
325
|
+
b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan)
|
|
326
|
+
end
|
|
327
|
+
|
|
328
|
+
def block_comment_start?(buffer, scan)
|
|
329
|
+
buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan)
|
|
330
|
+
end
|
|
331
|
+
|
|
332
|
+
def preceded_by_ws_or_start?(buffer, scan)
|
|
333
|
+
return true if scan.zero?
|
|
334
|
+
|
|
335
|
+
prev = buffer.getbyte(scan - 1)
|
|
336
|
+
whitespace_byte?(prev)
|
|
337
|
+
end
|
|
338
|
+
end
|
|
339
|
+
|
|
340
|
+
module Recovery
|
|
341
|
+
include Bytes
|
|
342
|
+
|
|
343
|
+
module_function
|
|
344
|
+
|
|
345
|
+
def process_string(input, options, &block)
|
|
346
|
+
return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
|
|
347
|
+
|
|
348
|
+
# Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
|
|
349
|
+
# the parse actually fails (the rescue below). Every wrapper shape — code fences,
|
|
350
|
+
# <json>/BEGIN_JSON tags, prose around the payload — makes the parse raise, so the
|
|
351
|
+
# rescue catches it. Crucially this keeps clean input on the single-parse fast path
|
|
352
|
+
# even when its string values legitimately contain ``` or <json> (real-world data
|
|
353
|
+
# like GitHub event payloads is full of markdown), instead of dragging hundreds of
|
|
354
|
+
# MB through the pure-Ruby candidate scan.
|
|
355
|
+
#
|
|
356
|
+
# The one exception is a bare leading label like "JSON: {...}", which parses
|
|
357
|
+
# successfully but WRONGLY (as an implicit-root object keyed by the label), so it
|
|
358
|
+
# must be intercepted before parsing.
|
|
359
|
+
if leading_label?(input)
|
|
360
|
+
payloads = extract_payloads(input, options)
|
|
361
|
+
return replay_payloads(payloads, options, &block) unless payloads.empty?
|
|
362
|
+
end
|
|
363
|
+
|
|
364
|
+
SmarterJSON.send(:process_content, input, options, &block)
|
|
365
|
+
rescue ParseError => e
|
|
366
|
+
raise if e.is_a?(EncodingError)
|
|
367
|
+
|
|
368
|
+
payloads = extract_payloads(input, options)
|
|
369
|
+
return replay_payloads(payloads, options, &block) unless payloads.empty?
|
|
370
|
+
|
|
371
|
+
raise
|
|
372
|
+
end
|
|
373
|
+
|
|
374
|
+
# Whether the input opens with a bare "JSON:" / "Final answer:" label (which would
|
|
375
|
+
# otherwise parse, wrongly, as an implicit-root object keyed by the label). We use
|
|
376
|
+
# String#start_with? with a Regexp rather than match?(/\A.../): start_with? checks
|
|
377
|
+
# only the beginning, whereas a \A-anchored match? still retries at every byte
|
|
378
|
+
# position and so scans the WHOLE input (≈0.3s on a 200 MB document) on every parse.
|
|
379
|
+
# (Caller has already established the input is valid_encoding?.)
|
|
380
|
+
def leading_label?(input)
|
|
381
|
+
input.start_with?(/[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
|
|
382
|
+
end
|
|
383
|
+
|
|
384
|
+
def replay_payloads(payloads, options, &block)
|
|
385
|
+
handler = options[:on_warning]
|
|
386
|
+
emit_wrapper_warnings(payloads, handler)
|
|
387
|
+
|
|
388
|
+
results = payloads.map do |payload|
|
|
389
|
+
SmarterJSON.send(:process_content, payload[:slice], options)
|
|
390
|
+
end
|
|
391
|
+
|
|
392
|
+
return results.each(&block).then { nil } if block_given?
|
|
393
|
+
return nil if results.empty?
|
|
394
|
+
return results.first if results.length == 1
|
|
395
|
+
|
|
396
|
+
results
|
|
397
|
+
end
|
|
398
|
+
|
|
399
|
+
def emit_wrapper_warnings(payloads, handler)
|
|
400
|
+
return unless handler
|
|
401
|
+
|
|
402
|
+
meta = payloads.first[:meta]
|
|
403
|
+
warn(handler, :prefix_text_ignored, "ignored non-JSON text before the payload", *meta[:first_pos]) if meta[:prefix]
|
|
404
|
+
warn(handler, :code_fence_stripped, "stripped markdown code fences around the payload", *meta[:first_pos]) if meta[:fence]
|
|
405
|
+
warn(handler, :wrapper_tag_stripped, "stripped wrapper tags around the payload", *meta[:first_pos]) if meta[:wrapper]
|
|
406
|
+
warn(handler, :suffix_text_ignored, "ignored non-JSON text after the payload", *meta[:last_pos]) if meta[:suffix]
|
|
407
|
+
end
|
|
408
|
+
|
|
409
|
+
def extract_payloads(input, options)
|
|
410
|
+
payloads = candidate_ranges(input).filter_map do |range|
|
|
411
|
+
slice = input.byteslice(range.begin, range.end - range.begin)
|
|
412
|
+
begin
|
|
413
|
+
SmarterJSON.send(:process_content, slice, options.merge(on_warning: nil))
|
|
414
|
+
{ slice: slice, range: range }
|
|
415
|
+
rescue ParseError
|
|
416
|
+
nil
|
|
417
|
+
end
|
|
418
|
+
end
|
|
419
|
+
meta = wrapper_meta(input, payloads.map { |p| p[:range] })
|
|
420
|
+
payloads.each { |payload| payload[:meta] = meta }
|
|
421
|
+
payloads
|
|
422
|
+
end
|
|
423
|
+
|
|
424
|
+
def wrapper_meta(input, ranges)
|
|
425
|
+
return { prefix: false, suffix: false, fence: false, wrapper: false } if ranges.empty?
|
|
426
|
+
|
|
427
|
+
first = ranges.first
|
|
428
|
+
last = ranges.last
|
|
429
|
+
prefix = input.byteslice(0, first.begin)
|
|
430
|
+
suffix = input.byteslice(last.end, input.bytesize - last.end)
|
|
431
|
+
# Look for fence / wrapper markers only in the text we actually strip (outside
|
|
432
|
+
# every recovered payload), so a ``` or <json> sitting inside a payload's own
|
|
433
|
+
# string value does not trigger a "stripped a wrapper" warning.
|
|
434
|
+
outside = non_payload_text(input, ranges)
|
|
435
|
+
{
|
|
436
|
+
prefix: substantive_text?(prefix),
|
|
437
|
+
suffix: substantive_text?(suffix),
|
|
438
|
+
fence: outside.include?("```"),
|
|
439
|
+
wrapper: outside.match?(/<json\b|BEGIN_JSON\b/i),
|
|
440
|
+
first_pos: line_col_for(input, first.begin),
|
|
441
|
+
last_pos: line_col_for(input, last.begin)
|
|
442
|
+
}
|
|
443
|
+
end
|
|
444
|
+
|
|
445
|
+
def non_payload_text(input, ranges)
|
|
446
|
+
out = +""
|
|
447
|
+
pos = 0
|
|
448
|
+
ranges.each do |range|
|
|
449
|
+
out << input.byteslice(pos, range.begin - pos) if range.begin > pos
|
|
450
|
+
pos = range.end
|
|
451
|
+
end
|
|
452
|
+
out << input.byteslice(pos, input.bytesize - pos) if pos < input.bytesize
|
|
453
|
+
out
|
|
454
|
+
end
|
|
455
|
+
|
|
456
|
+
def line_col_for(input, offset)
|
|
457
|
+
line = 1
|
|
458
|
+
col = 1
|
|
459
|
+
i = 0
|
|
460
|
+
while i < offset
|
|
461
|
+
b = input.getbyte(i)
|
|
462
|
+
break if b.nil?
|
|
463
|
+
|
|
464
|
+
if b == LF
|
|
465
|
+
line += 1
|
|
466
|
+
col = 1
|
|
467
|
+
i += 1
|
|
468
|
+
elsif b == CR
|
|
469
|
+
line += 1
|
|
470
|
+
col = 1
|
|
471
|
+
i += 1
|
|
472
|
+
i += 1 if i < offset && input.getbyte(i) == LF
|
|
473
|
+
else
|
|
474
|
+
col += 1
|
|
475
|
+
i += 1
|
|
476
|
+
end
|
|
477
|
+
end
|
|
478
|
+
[line, col]
|
|
479
|
+
end
|
|
480
|
+
|
|
481
|
+
def substantive_text?(text)
|
|
482
|
+
return false if text.nil? || text.empty?
|
|
483
|
+
|
|
484
|
+
stripped = text.dup
|
|
485
|
+
stripped.gsub!(%r{/\*.*?\*/}m, "")
|
|
486
|
+
stripped.gsub!(/^\s*(?:#|\/\/).*$/, "")
|
|
487
|
+
!stripped.strip.empty? && !stripped.strip.match?(/\A(?:```[a-zA-Z0-9_-]*)?\z/) && !stripped.strip.match?(/\A(?:<\/?json>|BEGIN_JSON|END_JSON)\z/i)
|
|
488
|
+
end
|
|
489
|
+
|
|
490
|
+
def warn(handler, type, message, line, col)
|
|
491
|
+
handler.call(Warning.new(type, message, line, col))
|
|
492
|
+
end
|
|
493
|
+
|
|
494
|
+
def candidate_ranges(input)
|
|
495
|
+
ranges = []
|
|
496
|
+
stack = []
|
|
497
|
+
start_pos = nil
|
|
498
|
+
i = 0
|
|
499
|
+
mode = nil
|
|
500
|
+
while i < input.bytesize
|
|
501
|
+
b = input.getbyte(i)
|
|
502
|
+
if mode == :double
|
|
503
|
+
if b == BACKSLASH
|
|
504
|
+
i += 2
|
|
505
|
+
next
|
|
506
|
+
elsif b == DQUOTE
|
|
507
|
+
mode = nil
|
|
508
|
+
end
|
|
509
|
+
i += 1
|
|
510
|
+
next
|
|
511
|
+
elsif mode == :single
|
|
512
|
+
if b == BACKSLASH
|
|
513
|
+
i += 2
|
|
514
|
+
next
|
|
515
|
+
elsif b == SQUOTE
|
|
516
|
+
mode = nil
|
|
517
|
+
end
|
|
518
|
+
i += 1
|
|
519
|
+
next
|
|
520
|
+
elsif mode == :triple
|
|
521
|
+
if input.byteslice(i, 3) == "'''"
|
|
522
|
+
mode = nil
|
|
523
|
+
i += 3
|
|
524
|
+
else
|
|
525
|
+
i += 1
|
|
526
|
+
end
|
|
527
|
+
next
|
|
528
|
+
elsif mode == :line_comment
|
|
529
|
+
if [LF, CR].include?(b)
|
|
530
|
+
mode = nil
|
|
531
|
+
else
|
|
532
|
+
i += 1
|
|
533
|
+
next
|
|
534
|
+
end
|
|
535
|
+
elsif mode == :block_comment
|
|
536
|
+
if input.byteslice(i, 2) == "*/"
|
|
537
|
+
mode = nil
|
|
538
|
+
i += 2
|
|
539
|
+
else
|
|
540
|
+
i += 1
|
|
541
|
+
end
|
|
542
|
+
next
|
|
543
|
+
else
|
|
544
|
+
if input.byteslice(i, 2) == "//"
|
|
545
|
+
mode = :line_comment
|
|
546
|
+
i += 2
|
|
547
|
+
next
|
|
548
|
+
elsif input.byteslice(i, 2) == "/*"
|
|
549
|
+
mode = :block_comment
|
|
550
|
+
i += 2
|
|
551
|
+
next
|
|
552
|
+
elsif b == HASH
|
|
553
|
+
mode = :line_comment
|
|
554
|
+
i += 1
|
|
555
|
+
next
|
|
556
|
+
elsif b == DQUOTE
|
|
557
|
+
mode = :double
|
|
558
|
+
i += 1
|
|
559
|
+
next
|
|
560
|
+
elsif input.byteslice(i, 3) == "'''"
|
|
561
|
+
mode = :triple
|
|
562
|
+
i += 3
|
|
563
|
+
next
|
|
564
|
+
elsif b == SQUOTE
|
|
565
|
+
mode = :single
|
|
566
|
+
i += 1
|
|
567
|
+
next
|
|
568
|
+
elsif [LBRACE, LBRACKET].include?(b)
|
|
569
|
+
start_pos = i if stack.empty?
|
|
570
|
+
stack << b
|
|
571
|
+
elsif b == RBRACE
|
|
572
|
+
stack.pop if stack.last == LBRACE
|
|
573
|
+
if stack.empty? && start_pos
|
|
574
|
+
ranges << (start_pos...(i + 1))
|
|
575
|
+
start_pos = nil
|
|
576
|
+
end
|
|
577
|
+
elsif b == RBRACKET
|
|
578
|
+
stack.pop if stack.last == LBRACKET
|
|
579
|
+
if stack.empty? && start_pos
|
|
580
|
+
ranges << (start_pos...(i + 1))
|
|
581
|
+
start_pos = nil
|
|
582
|
+
end
|
|
583
|
+
end
|
|
584
|
+
end
|
|
585
|
+
i += 1
|
|
586
|
+
end
|
|
587
|
+
ranges
|
|
588
|
+
end
|
|
589
|
+
end
|
|
590
|
+
|
|
591
|
+
# Hand-rolled FSM single-pass parser.
|
|
592
|
+
# Layer 1: strict JSON (RFC 8259).
|
|
593
|
+
# Layer 2: JSON5 additions — line/block comments, trailing comma,
|
|
594
|
+
# unquoted ECMAScript identifier keys, single-quoted strings,
|
|
595
|
+
# hex numbers, leading/trailing decimal points, Infinity/NaN,
|
|
596
|
+
# explicit + sign, \-line-continuation inside strings.
|
|
597
|
+
# Layer 3: HJSON-inspired additions — #/comment-marker rule, triple-quoted
|
|
598
|
+
# strings, quoteless single-line strings, implicit root object,
|
|
599
|
+
# newline-as-separator, broader unquoted keys, recognized-literals-win.
|
|
600
|
+
# Layer 4: smarter_json additions — UTF-8 BOM skip, smart/curly quotes,
|
|
601
|
+
# Python literals (True/False/None) and undefined, underscores in
|
|
602
|
+
# numeric literals, and encoding validation (SmarterJSON::EncodingError).
|
|
603
|
+
class Parser
|
|
604
|
+
include Bytes
|
|
124
605
|
|
|
125
606
|
NOT_NUMERIC = Object.new
|
|
126
607
|
HEX_RE = /\A[-+]?0[xX][0-9a-fA-F_]+\z/.freeze
|
data/lib/smarter_json/version.rb
CHANGED