smarter_json 0.9.2 → 0.9.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/CHANGELOG.md +77 -54
- data/README.md +215 -72
- data/docs/_introduction.md +6 -12
- data/docs/basic_read_api.md +29 -19
- data/docs/basic_write_api.md +2 -2
- data/docs/examples.md +32 -23
- data/docs/options.md +14 -14
- data/ext/smarter_json/smarter_json.c +223 -89
- data/ext/smarter_json/vendor/LICENSE-fast_float-MIT +27 -0
- data/ext/smarter_json/vendor/eisel_lemire.h +117 -0
- data/ext/smarter_json/vendor/eisel_lemire.md +29 -0
- data/ext/smarter_json/vendor/eisel_lemire_powers.h +663 -0
- data/lib/smarter_json/backports.rb +28 -0
- data/lib/smarter_json/options.rb +52 -0
- data/lib/smarter_json/parser.rb +400 -139
- data/lib/smarter_json/version.rb +1 -1
- data/lib/smarter_json.rb +3 -1
- metadata +9 -5
- data/ext/smarter_json/vendor/ryu.h +0 -819
- data/ext/smarter_json/vendor/ryu.md +0 -22
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: fbc93b4afea26fc4d30c241ceb7823451cb7b777454441a9871f066b719f8c07
|
|
4
|
+
data.tar.gz: a9a5801cab6a604e4d166d6d86aa48f41f5e1bb74fb58f39f283a0477ac78db4
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c8749bd358973f0d284966a6c5fb42a4e71954c8884c337c5859f5f3e566b302d1a4c1dbc9961ed5ae46aaf5a9e019935fa46916bb858877a2ef34ecc99575c9
|
|
7
|
+
data.tar.gz: 2f2efe9c89ae08bcf807e061ca734d13056c6fafc590d072d7870a705cf25e3d4932fff567c1b09320f5cbd6b42e02255180fded05c84b08f1032abb0594b8bf
|
data/.gitignore
CHANGED
data/CHANGELOG.md
CHANGED
|
@@ -3,109 +3,132 @@
|
|
|
3
3
|
|
|
4
4
|
> 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
|
|
5
5
|
|
|
6
|
+
> ⚠️ **Interface change (since 0.9.7):**
|
|
7
|
+
>
|
|
8
|
+
> `SmarterJSON.process` / `SmarterJSON.process_file` now **always return an `Array`** of documents:
|
|
9
|
+
> — `[]` for no doc
|
|
10
|
+
> - `[doc]` for one doc
|
|
11
|
+
> - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs).
|
|
12
|
+
|
|
13
|
+
Going forward this will be the supported interface.
|
|
14
|
+
|
|
15
|
+
> ⚠️ We discourage the use of `process(input).first` / `[0]` because it silently drops potential additional documents
|
|
16
|
+
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
|
|
17
|
+
|
|
18
|
+
## 0.9.9 (2026-06-07)
|
|
19
|
+
- Much faster pure-Ruby parsing (the path used without the C extension) — roughly 3× on string-heavy data, ~2× on number-heavy, ~1.7× on object-heavy (on a YJIT-enabled Ruby). Parsed values are unchanged.
|
|
20
|
+
|
|
21
|
+
## 0.9.8 (2026-06-06 unreleased)
|
|
22
|
+
- Faster parsing of string-heavy arrays — Parsed values are unchanged.
|
|
23
|
+
|
|
24
|
+
## 0.9.7 (2026-06-05 unreleased)
|
|
25
|
+
- **Breaking: `process` / `process_file` now always return an `Array` of documents** — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several. (Previously polymorphic: `nil` / the value / an `Array`.) The document count is now unambiguous, and any result can be iterated uniformly.
|
|
26
|
+
- **New `SmarterJSON.process_one(input)`** — the single-document accessor for the common case: returns the one document's value (or `nil`), and *warns* (never raises) if the input held more than one. Takes a String or an IO; for an IO it is bounded-memory (parses just the first document). Reaching for `.first` / `[0]` on a `process` result silently drops extra documents — use `process_one` instead.
|
|
27
|
+
- The **block form now returns the document count** (was `nil`): `n = SmarterJSON.process(io) { |doc| ... }`.
|
|
28
|
+
- **The top level is stricter, which keeps the LLM-wrapper recovery working:** a top-level value must be a recognized JSON value (number / `true` / `false` / `null` / quoted string / object / array) or an implicit-root object (`host: localhost`). A bare top-level run — `localhost`, `1 2 3`, the typo `flase` — now raises `ParseError` instead of becoming a quoteless string. A space is never a document separator (`1 2 3` raises rather than splitting into three). In-container quoteless strings (`[red green blue]`, `host: localhost`) are unchanged.
|
|
29
|
+
|
|
30
|
+
## 0.9.6 (2026-06-04 unreleased)
|
|
31
|
+
- Faster `decimal_precision: :float` parsing of full-precision decimal numbers (around 17–18 significant digits — e.g. coordinate data and scientific output). Parsed values are unchanged: still correctly rounded, bit-for-bit identical to `JSON.parse`.
|
|
32
|
+
|
|
33
|
+
## 0.9.5 (2026-06-04 unreleased)
|
|
34
|
+
- Faster `decimal_precision: :float` parsing of very high-precision decimal numbers (more than ~17 significant digits). Parsed values are unchanged.
|
|
35
|
+
- Faster parsing of object-heavy and compact documents — less per-element overhead in the C parser. No behavior change.
|
|
36
|
+
|
|
37
|
+
## 0.9.4 (2026-06-04 unreleased)
|
|
38
|
+
- Internal performance experiments. No user-facing changes.
|
|
39
|
+
|
|
40
|
+
## 0.9.3 (2026-06-03)
|
|
41
|
+
- Renamed the `bigdecimal_load:` option to `decimal_precision:` (same values: `:auto`, `:float`, `:bigdecimal`).
|
|
42
|
+
- Invalid option *values* now raise `ArgumentError` with a clear message instead of being silently ignored. Unknown option keys are still ignored.
|
|
43
|
+
- Faster parsing of pretty-printed (indented) input.
|
|
44
|
+
- Removed the `duplicate_key: :raise` option — it conflicted with SmarterJSON's lenient design. `duplicate_key:` now accepts `:last_wins` (default) and `:first_wins`; repeated keys are still reported through `on_warning`.
|
|
45
|
+
|
|
6
46
|
## 0.9.2 (2026-06-03)
|
|
7
|
-
-
|
|
47
|
+
- Fixed a performance regression that slowed parsing of large documents.
|
|
8
48
|
|
|
9
49
|
## 0.9.1 (2026-06-03 unreleased)
|
|
10
|
-
-
|
|
11
|
-
-
|
|
12
|
-
- Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
|
|
13
|
-
- Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
|
|
50
|
+
- Fixed a major performance regression on real-world data that contained markdown fences or `<json>` markers inside ordinary string values.
|
|
51
|
+
- Streaming: a document is no longer cut off early when a comment / quote marker falls across a read-chunk boundary.
|
|
14
52
|
|
|
15
53
|
## 0.9.0 (2026-06-03 unreleased)
|
|
16
|
-
-
|
|
17
|
-
- code cleanup
|
|
54
|
+
- Performance improvements and code cleanup.
|
|
18
55
|
|
|
19
56
|
## 0.8.0 (2026-06-03)
|
|
20
57
|
- **Robustness** against LLM-generated / wrapped JSON:
|
|
21
58
|
- strips markdown code fences (```json / ```)
|
|
22
|
-
- ignores
|
|
59
|
+
- ignores leading / trailing prose around a JSON payload
|
|
23
60
|
- unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`
|
|
24
|
-
-
|
|
25
|
-
-
|
|
26
|
-
-
|
|
27
|
-
|
|
61
|
+
- returns multiple recovered payloads as an `Array`
|
|
62
|
+
- parses pretty-printed multi-line documents from IO / block input
|
|
63
|
+
- reports each recovery through `on_warning` (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
|
|
64
|
+
- Truncated / unterminated input still raises `SmarterJSON::ParseError` — SmarterJSON does not guess at missing data.
|
|
28
65
|
|
|
29
66
|
## 0.7.0 (2026-06-03)
|
|
30
|
-
- **Breaking:** replaced the `warnings:` option (and its `[result, warnings]`
|
|
67
|
+
- **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as a lenient fix is applied; `process` / `process_file` now always return just the value, including on the streaming block form. The default (no handler) records nothing and costs nothing.
|
|
31
68
|
|
|
32
69
|
## 0.6.0 (2026-06-02)
|
|
33
|
-
- Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`)
|
|
34
|
-
- A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}
|
|
35
|
-
- New opt-in `warnings:` option
|
|
36
|
-
- Fixed a pure-Ruby bug where a mantissa-less exponent token (e.g. `-e695881`) was read as `0.0`; it is now a quoteless string, matching the C path.
|
|
37
|
-
- Fixed a pure-Ruby bug where a `\u` escape whose next bytes split a multibyte character leaked `ArgumentError`; it now raises `SmarterJSON::ParseError`.
|
|
38
|
-
- Added a property/fuzz test suite that checks C/Ruby parity and round-tripping on generated, mutated, and random input.
|
|
70
|
+
- Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`). No null is inserted for an empty slot.
|
|
71
|
+
- A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}`.
|
|
72
|
+
- New opt-in `warnings:` option recording the lenient fixes applied — `:empty_slot`, `:empty_value`, `:duplicate_key`. (Superseded by `on_warning:` in 0.7.0.)
|
|
39
73
|
|
|
40
74
|
## 0.5.2 (2026-06-01) yanked
|
|
41
|
-
- `generate`
|
|
42
|
-
- `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII
|
|
43
|
-
- `generate` adds opt-in `coerce:` —
|
|
75
|
+
- `generate` supports pretty-printing via the `indent:` option (spaces per nesting level; default compact). Combining `indent:` with `format: :ndjson` raises `ArgumentError`.
|
|
76
|
+
- `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII), and `script_safe:` (escape `</` and U+2028/U+2029 for safe embedding in an HTML `<script>` tag).
|
|
77
|
+
- `generate` adds opt-in `coerce:` — convert an otherwise-unsupported value (e.g. `Time`, `Date`, app objects) via its own `as_json` / `to_json`; strict-by-default still raises `GenerateError`.
|
|
44
78
|
|
|
45
79
|
## 0.5.1 (2026-06-01) yanked
|
|
46
|
-
- Unified the error classes under a single `SmarterJSON::Error` base: `ParseError` and `
|
|
47
|
-
- Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS).
|
|
48
|
-
- Fixed the C extension build on Ruby 2.6 (declare `rb_hash_bulk_insert`, which 2.6 exports but does not declare in its headers); set the minimum Ruby to 2.6.
|
|
80
|
+
- Unified the error classes under a single `SmarterJSON::Error` base: `ParseError`, `EncodingError`, and the new `GenerateError` all inherit from it, so `rescue SmarterJSON::Error` catches everything the gem raises.
|
|
81
|
+
- Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS); minimum Ruby is now 2.6.
|
|
49
82
|
|
|
50
83
|
## 0.5.0 (2026-05-31 unreleased)
|
|
51
|
-
-
|
|
52
|
-
-
|
|
84
|
+
- Added JSON generation, including NDJSON.
|
|
85
|
+
- Added test coverage.
|
|
53
86
|
|
|
54
87
|
## 0.4.0 (2026-05-31 unreleased)
|
|
55
|
-
-
|
|
88
|
+
- Renamed the gem `flex_json` → `smarter_json`.
|
|
56
89
|
|
|
57
90
|
## 0.3.10 (2026-05-31 unreleased)
|
|
58
|
-
-
|
|
59
|
-
|
|
91
|
+
- Changed the interface to `.process` and `.process_file`.
|
|
60
92
|
|
|
61
93
|
## 0.3.9 (2026-05-31 unreleased)
|
|
62
|
-
- `
|
|
63
|
-
-
|
|
64
|
-
- The block form (`parse(input) { |doc| … }`) is kept as the bounded-memory streaming path. `parse_file(path) { |doc| … }` now forwards the block too, so files stream the same way (previously the block was silently ignored). Bracketless comma lists (`1, 2, 3`) still raise — commas don't separate top-level documents (implicit-root array remains unsupported).
|
|
65
|
-
- The block form allows individual processing of each line in NDJSON files.
|
|
66
|
-
- Supersedes the earlier "raise on trailing content, match Oj" behavior.
|
|
94
|
+
- `process` with no block now handles any input automatically: 0 documents (empty / whitespace / comment-only) → `nil`, 1 document → the value itself, 2+ documents (NDJSON / JSONL / concatenated) → an `Array`. It no longer raises on trailing content.
|
|
95
|
+
- The block form (`process(input) { |doc| … }`) streams documents with bounded memory; `process_file` forwards the block too, so each line of an NDJSON file can be processed individually.
|
|
67
96
|
|
|
68
97
|
## 0.3.8 (2026-05-30 unreleased)
|
|
69
|
-
-
|
|
70
|
-
- Quoteless-token boundary scan now uses a 256-byte class table: ordinary bytes are classified in one table lookup, and the lookahead byte is read only at a `#`/`/` instead of on every byte. Speeds up quoteless / config-style input (the lenient case the JSON benchmarks don't exercise).
|
|
98
|
+
- Performance improvements (quoteless / config-style input).
|
|
71
99
|
|
|
72
100
|
## 0.3.7 (2026-05-30 unreleased)
|
|
73
|
-
-
|
|
74
|
-
- Added branch hints (`__builtin_expect`) and prefetch to the hot string-scan loop. Sped up string-heavy files (string_array, github_events, twitter all 12–16% faster).
|
|
101
|
+
- Performance improvements (string-heavy input).
|
|
75
102
|
|
|
76
103
|
## 0.3.6 (2026-05-30 unreleased)
|
|
77
|
-
-
|
|
104
|
+
- Performance improvements (numbers inside objects / arrays).
|
|
78
105
|
|
|
79
106
|
## 0.3.5 (2026-05-30 unreleased)
|
|
80
|
-
-
|
|
81
|
-
- Added `fj_try_decimal` for the quoteless path: validates and extracts the number in one scan, replacing the old three scans (validate + significant-digit count + mantissa extraction); skips the significant-digit scan when the number has ≤16 digits.
|
|
82
|
-
- Both number paths now build values through the shared `fj_int_from_parts` / `fj_float_from_parts` helpers so they can't drift; removed the now-dead `fj_validate_decimal` / `fj_int_value` / `fj_decimal_value`.
|
|
107
|
+
- Performance improvements (number parsing).
|
|
83
108
|
|
|
84
109
|
## 0.3.4 (2026-05-30 unreleased)
|
|
85
|
-
-
|
|
86
|
-
- Build objects and arrays from a C value stack with a pre-sized hash + bulk insert (and size-based duplicate detection), instead of inserting one member/element at a time.
|
|
87
|
-
- Added a per-parse key cache so repeated object keys are interned once instead of every occurrence.
|
|
110
|
+
- Performance improvements (object-heavy input).
|
|
88
111
|
|
|
89
112
|
## 0.3.3 (2026-05-30 unreleased)
|
|
90
|
-
-
|
|
113
|
+
- Faster, correctly-rounded float parsing.
|
|
91
114
|
|
|
92
115
|
## 0.3.3 (2026-05-29 unreleased)
|
|
93
|
-
-
|
|
116
|
+
- Performance fixes.
|
|
94
117
|
|
|
95
118
|
## 0.3.2 (2026-05-29 unreleased)
|
|
96
|
-
-
|
|
119
|
+
- Performance fixes.
|
|
97
120
|
|
|
98
121
|
## 0.3.1 (2026-05-29 unreleased)
|
|
99
|
-
-
|
|
122
|
+
- Performance fixes.
|
|
100
123
|
|
|
101
124
|
## 0.3.0 (2026-05-29 unreleased)
|
|
102
|
-
-
|
|
125
|
+
- Iterative parser.
|
|
103
126
|
|
|
104
127
|
## 0.2.0 (2026-05-29 unreleased)
|
|
105
|
-
-
|
|
128
|
+
- Recursive parser.
|
|
106
129
|
|
|
107
130
|
## 0.1.1 (2026-05-29 unreleased)
|
|
108
|
-
- MVP complete
|
|
131
|
+
- MVP complete.
|
|
109
132
|
|
|
110
133
|
## 0.1.0 (2026-05-28 unreleased)
|
|
111
|
-
- Initial Ruby version
|
|
134
|
+
- Initial Ruby version.
|
data/README.md
CHANGED
|
@@ -2,27 +2,62 @@
|
|
|
2
2
|
|
|
3
3
|
 [](https://codecov.io/gh/tilo/smarter_json) <!-- [](https://rubygems.org/gems/smarter_json) --> [](https://rubygems.org/gems/smarter_json) [](https://www.ruby-toolbox.com/projects/smarter_json)
|
|
4
4
|
|
|
5
|
-
A lenient, fast JSON
|
|
5
|
+
A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Traditional JSON parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
|
|
6
6
|
|
|
7
|
-
> **SmarterJSON: one
|
|
7
|
+
> **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.**
|
|
8
8
|
|
|
9
9
|
## Why SmarterJSON?
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
**Are you tired of seeing errors like these?**
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
ERROR running JSON.parse (stdlib) on deeply_nested.json: JSON::NestingError: nesting of 101 is too deep
|
|
15
|
+
|
|
16
|
+
ERROR running Oj.load (default) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
|
|
17
|
+
|
|
18
|
+
ERROR running Oj.load (strict, float) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
|
|
19
|
+
|
|
20
|
+
ERROR running Oj.load (compat) on config.json5: EncodingError: unexpected character (after [0]) at line 5, column 6 [parse.c:931] in '// JSON5 config sample — leni…
|
|
21
|
+
|
|
22
|
+
ERROR running JSON.parse (stdlib) on config.json5: JSON::ParserError: expected object key, got 'id:' at line 4 column 5
|
|
23
|
+
|
|
24
|
+
ERROR running Yajl::Parser (yajl-ruby) on config.json5: Yajl::ParseError: lexical error: invalid char in json text. this. */ [ // record 0 { id: 0, name: 'alpha-0', mask: 0 (…
|
|
25
|
+
|
|
26
|
+
ERROR running Oj.load (default) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
|
|
27
|
+
|
|
28
|
+
ERROR running Oj.load (strict, float) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
|
|
29
|
+
|
|
30
|
+
ERROR running Oj.load (compat) on github_events_100k.ndjson: EncodingError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870] in '{"id":"…
|
|
31
|
+
|
|
32
|
+
ERROR running JSON.parse (stdlib) on github_events_100k.ndjson: JSON::ParserError: unexpected token at end of stream '{"id":"34816047161","type":"Dele' at line 1 column 1
|
|
33
|
+
|
|
34
|
+
ERROR running Yajl::Parser (yajl-ruby) on github_events_100k.ndjson: Yajl::ParseError: Found multiple JSON objects in the stream but no block or the on_parse_complete callback was
|
|
35
|
+
assigne…
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
**Do you have no control of the input quality?**
|
|
39
|
+
|
|
40
|
+
Traditional JSON parsers reject anything that isn't perfectly strict JSON. That means your code breaks on malformed data.
|
|
41
|
+
|
|
42
|
+
SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.**
|
|
43
|
+
Give it strict JSON, NDJSON, JSON5, an HJSON-style config file, LLM-generated JSON, or a copy-pasted blob with comments and trailing commas — it just extracts the data from it.
|
|
44
|
+
When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma).
|
|
45
|
+
|
|
46
|
+
A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
|
|
12
47
|
|
|
13
48
|
> For an ingestion tool, "reject the whole document because of one stray comma" is the worst outcome: you throw away the 99% that's fine to avoid maybe-mishandling a gap that carries no data anyway.
|
|
14
49
|
|
|
15
50
|
Three things set it apart:
|
|
16
51
|
|
|
17
|
-
1. **One
|
|
52
|
+
1. **One tool, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure it to match your input; it adapts to whatever you give it.
|
|
18
53
|
|
|
19
|
-
2. **It
|
|
54
|
+
2. **It extracts every document from multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: it always returns an `Array` of the documents found (`[]` / `[doc]` / `[d1, d2, …]`). For the common single-document case, `SmarterJSON.process_one` returns the one value directly (and warns, never raises, if there was more than one). The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
|
|
20
55
|
|
|
21
|
-
3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere)
|
|
56
|
+
3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) matches or beats Oj on every file we benchmark, and is competitive with the stdlib `json` C parser — among the fastest general-purpose JSON processors in Ruby.
|
|
22
57
|
|
|
23
58
|
## What it accepts, beyond strict JSON
|
|
24
59
|
|
|
25
|
-
- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com`
|
|
60
|
+
- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
|
|
26
61
|
- Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
|
|
27
62
|
- Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
|
|
28
63
|
- Implicit root object — a config file that starts with `key: value`, no outer `{}`
|
|
@@ -31,7 +66,19 @@ Three things set it apart:
|
|
|
31
66
|
- Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
|
|
32
67
|
- Duplicate keys (last value wins by default; configurable)
|
|
33
68
|
|
|
34
|
-
It raises only on genuinely
|
|
69
|
+
It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
|
|
70
|
+
|
|
71
|
+
### Format references
|
|
72
|
+
|
|
73
|
+
The lenient grammar is a superset of these human-JSON specs — listed once, here:
|
|
74
|
+
|
|
75
|
+
* [JSON5](https://json5.org/)
|
|
76
|
+
* [HJSON](https://hjson.github.io/)
|
|
77
|
+
* [JWCC / HuJSON](https://github.com/tailscale/hujson)
|
|
78
|
+
* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
|
|
79
|
+
* [JSONH](https://github.com/jsonh-org/Jsonh)
|
|
80
|
+
* [JSONC (VS Code)](https://jsonc.org/)
|
|
81
|
+
* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
|
|
35
82
|
|
|
36
83
|
## Installation
|
|
37
84
|
|
|
@@ -44,13 +91,48 @@ gem "smarter_json"
|
|
|
44
91
|
gem install smarter_json
|
|
45
92
|
```
|
|
46
93
|
|
|
47
|
-
The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby
|
|
94
|
+
The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby implementation runs instead and produces identical results.
|
|
95
|
+
|
|
96
|
+
## Usage
|
|
97
|
+
|
|
98
|
+
Pass a String of JSON content or an IO; you get back the extracted data. The same call handles strict JSON, JSON5, and HJSON-style config — there are no modes or flags.
|
|
99
|
+
|
|
100
|
+
```ruby
|
|
101
|
+
require "smarter_json"
|
|
102
|
+
|
|
103
|
+
SmarterJSON.process('{"a": 1, "b": [2, 3]}') # => [{"a"=>1, "b"=>[2, 3]}] (always an Array of documents)
|
|
104
|
+
SmarterJSON.process_one('{"a": 1, "b": [2, 3]}') # => {"a"=>1, "b"=>[2, 3]} (the one document's value)
|
|
105
|
+
SmarterJSON.process_file("config.json5") # read a file, then process
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
**Prefer `process`.** It always returns an `Array`, so the document count is explicit and you never silently drop one. Reach for `process_one` when you want just the single document's value — it *warns* (never raises) if the input turns out to hold more than one, so an unexpected extra document is surfaced, not dropped.
|
|
109
|
+
|
|
110
|
+
## Usage in APIs
|
|
111
|
+
|
|
112
|
+
At an API boundary the JSON comes from someone you don't control — a client POSTing a request body to *your* service, or an upstream service answering a call *you* made — and it isn't always clean: a stray trailing comma, a `NaN`, a payload wrapped in prose, or a quiet change to the format. A strict parser turns any of those into an exception (a request you reject, or a failed call chain). SmarterJSON extracts the data that's there instead, so one formatting quirk doesn't sink the whole request:
|
|
113
|
+
|
|
114
|
+
```ruby
|
|
115
|
+
# Inbound — JSON a caller sent to your endpoint:
|
|
116
|
+
data = SmarterJSON.process(request.body)
|
|
117
|
+
|
|
118
|
+
# Outbound — JSON from a service you called:
|
|
119
|
+
data = SmarterJSON.process(response.body)
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
What that buys you:
|
|
123
|
+
|
|
124
|
+
* fewer "random production crashes" from messy JSON on either side of the wire
|
|
125
|
+
* resilience when a caller or a provider changes its output
|
|
126
|
+
* the option to log and recover, instead of rejecting the request outright
|
|
127
|
+
* consistent handling of edge-case payloads
|
|
128
|
+
|
|
129
|
+
See [Examples](#examples) below for multi-document input, streaming, and recovering JSON from LLM / markdown noise.
|
|
48
130
|
|
|
49
|
-
##
|
|
131
|
+
## Stable interface & thread safety
|
|
50
132
|
|
|
51
|
-
The public
|
|
133
|
+
The public interface is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
|
|
52
134
|
|
|
53
|
-
Concurrent calls are safe. The
|
|
135
|
+
Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls.
|
|
54
136
|
|
|
55
137
|
## Documentation
|
|
56
138
|
|
|
@@ -60,106 +142,167 @@ Concurrent calls are safe. The parser/generator keep per-call state local, and t
|
|
|
60
142
|
* [Configuration Options](docs/options.md)
|
|
61
143
|
* [Examples](docs/examples.md)
|
|
62
144
|
|
|
63
|
-
|
|
145
|
+
### Warnings (`on_warning`)
|
|
146
|
+
|
|
147
|
+
When SmarterJSON quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
|
|
64
148
|
|
|
65
149
|
```ruby
|
|
66
|
-
|
|
150
|
+
# Collect them all:
|
|
151
|
+
warns = []
|
|
152
|
+
data = SmarterJSON.process(input, on_warning: ->(w) { warns << w })
|
|
67
153
|
|
|
68
|
-
|
|
69
|
-
SmarterJSON.process(
|
|
70
|
-
|
|
154
|
+
# Or route them — log, count, raise:
|
|
155
|
+
SmarterJSON.process(input, on_warning: ->(w) { Rails.logger.warn(w) })
|
|
156
|
+
```
|
|
71
157
|
|
|
72
|
-
# Multiple documents (NDJSON / JSONL / concatenated) — no block, no special method:
|
|
73
|
-
SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2}, {"id"=>3}]
|
|
74
|
-
SmarterJSON.process('{"id":1}') # => {"id"=>1} (one document → the value itself)
|
|
75
|
-
SmarterJSON.process("") # => nil (zero documents)
|
|
76
158
|
|
|
77
|
-
|
|
78
|
-
# (process and process_file both forward the block):
|
|
79
|
-
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
159
|
+
## Performance
|
|
80
160
|
|
|
81
|
-
|
|
82
|
-
SmarterJSON.process(<<~TEXT)
|
|
83
|
-
Here is the JSON:
|
|
161
|
+
SmarterJSON is a C extension (with a pure-Ruby fallback that runs everywhere). Before the speed table, the part that isn't a "× faster" — **things the other parsers can't do at all:**
|
|
84
162
|
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
```
|
|
90
|
-
TEXT
|
|
91
|
-
# => {"a"=>1}
|
|
163
|
+
- **stdlib `json` can't parse deeply nested data.** It caps nesting at 100 levels and raises; SmarterJSON has no depth limit (iterative parser, bounded only by memory).
|
|
164
|
+
- **None of the others read NDJSON / JSONL / concatenated input in a single call.** Oj, `json`, and Yajl each raise on the second document. Only SmarterJSON's `process` returns every document as an `Array`.
|
|
165
|
+
- **None of the others parse JSON5, HJSON-style config, or LLM-wrapped output.** Comments, trailing commas, unquoted keys, quoteless values, `'single quotes'`, markdown code fences, prose wrappers — all raise in Oj / `json` / Yajl; SmarterJSON parses them.
|
|
166
|
+
- **`json` and Yajl produce `Float` only — lossy on high-precision numbers.** On coordinate / scientific data (>16 significant digits) they silently round to `Float`, so they aren't a like-for-like comparison there. SmarterJSON (and Oj) keep full precision as `BigDecimal` by default.
|
|
92
167
|
|
|
93
|
-
SmarterJSON.
|
|
94
|
-
Here is the result:
|
|
168
|
+
Where a like-for-like comparison exists, here is SmarterJSON's C path against each parser. **Apple M4, Ruby 3.4.7, p10 of 40 runs.** Each cell is **SmarterJSON vs that parser** — "faster" means SmarterJSON wins. Ratios shift with hardware; run `rake report` in `json_benchmarks/` to reproduce.
|
|
95
169
|
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
170
|
+
| File | vs Oj/strict | vs `json` | vs Yajl |
|
|
171
|
+
| ----------------------------- | --------------- | ---------------------------- | --------------- |
|
|
172
|
+
| big_decimals <sup>≠</sup> | **1.8× faster** | **1.1× faster** | **1.3× faster** |
|
|
173
|
+
| canada <sup>≠</sup> | **8× faster** | 1.1× slower | **2.2× faster** |
|
|
174
|
+
| citm_catalog | **1.6× faster** | 1.2× slower | **4.8× faster** |
|
|
175
|
+
| citylots <sup>≠</sup> | **3.6× faster** | **2.0× faster** | **2.3× faster** |
|
|
176
|
+
| config.jsonc | **1.1× faster** | 1.5× slower | **3.7× faster** |
|
|
177
|
+
| deeply_nested | **1.4× faster** | **can't parse** <sup>‡</sup> | **5.1× faster** |
|
|
178
|
+
| github_events | **1.2× faster** | ≈ tied | **3.1× faster** |
|
|
179
|
+
| string_array | ≈ tied | ≈ tied | **1.6× faster** |
|
|
180
|
+
| twitter | **1.4× faster** | 1.3× slower | **3.5× faster** |
|
|
181
|
+
| usgs_earthquakes <sup>≠</sup> | **1.3× faster** | 1.5× slower | **3.6× faster** |
|
|
182
|
+
| weather_berlin | **1.9× faster** | 1.1× slower | **3.5× faster** |
|
|
99
183
|
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
# => {"a"=>1}
|
|
184
|
+
<sup>≠</sup> High-precision file. The row uses `decimal_precision: :float` (Float, like-for-like) for `canada` / `citylots` / `big_decimals` / `usgs`. SmarterJSON's **default** `:auto` keeps these decimals as `BigDecimal` (no precision loss, like Oj's default) — intrinsically slower than `Float`, so default-vs-`Float` would be apples-to-oranges. Against Oj's matching `BigDecimal` default, SmarterJSON is faster there too.
|
|
185
|
+
<sup>‡</sup> Not a measurement gap — `json` raises by default: it errors on multi-document / NDJSON input without a block, and caps nesting at 100 levels. SmarterJSON has neither limit.
|
|
103
186
|
|
|
104
|
-
SmarterJSON
|
|
105
|
-
# => {"a"=>1}
|
|
187
|
+
In short: **matches or beats Oj/strict on every file** — `string_array` is the one wash (within ~10%, and hardware-dependent: SmarterJSON edges ahead on an M1, Oj edges ahead on an M4) — **far faster than Yajl everywhere, and level-to-ahead of stdlib `json` on a like-for-like basis**, while parsing input `json` and Oj reject outright. Floats are decoded with the **Eisel-Lemire** algorithm (fast_float), correctly rounded and **bit-for-bit identical to `JSON.parse`** — fast *and* exact, even at full double precision.
|
|
106
188
|
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
189
|
+
**Two notes on fair comparison:**
|
|
190
|
+
|
|
191
|
+
- **NDJSON / multi-document:** only SmarterJSON reads it via plain `process` — Oj, `json`, and Yajl raise without a block. `process` collects every document into an `Array`; the block form streams one document at a time in bounded memory (use it for input larger than RAM).
|
|
192
|
+
- **High-precision decimals (the <sup>≠</sup> files):** by default these load as `BigDecimal` (full precision, like Oj's default), intrinsically slower than `Float`. Pass `decimal_precision: :float` for a like-for-like `Float` comparison — where SmarterJSON **beats stdlib `json`** (e.g. `citylots` ~2×) — at 3–6× the speed of the `:auto` default on coordinate/scientific data, when you don't need `BigDecimal` precision.
|
|
110
193
|
|
|
111
|
-
corrected payload:
|
|
112
|
-
{"b":2}
|
|
113
|
-
TEXT
|
|
114
|
-
# => [{"a"=>1}, {"b"=>2}]
|
|
115
|
-
```
|
|
116
194
|
|
|
117
195
|
### Options
|
|
118
196
|
|
|
119
197
|
| option | default | meaning |
|
|
120
198
|
|-------------------|--------------|-------------------------------------------------------------------------|
|
|
121
199
|
| `symbolize_keys` | `false` | return object keys as Symbols instead of Strings |
|
|
122
|
-
| `duplicate_key` | `:last_wins` | `:last_wins` / `:first_wins`
|
|
123
|
-
| `
|
|
200
|
+
| `duplicate_key` | `:last_wins` | `:last_wins` / `:first_wins` for a key repeated in one object (every repeat is also reported via `on_warning`) |
|
|
201
|
+
| `decimal_precision` | `:auto` | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
|
|
124
202
|
| `acceleration` | `true` | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
|
|
125
203
|
| `encoding` | `"UTF-8"` | labels the input's encoding (no transcoding pass; see below) |
|
|
126
204
|
| `on_warning` | `nil` | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
|
|
127
205
|
|
|
128
|
-
|
|
206
|
+
## Examples
|
|
207
|
+
|
|
208
|
+
### Lenient, config-style input
|
|
129
209
|
|
|
130
|
-
|
|
210
|
+
No outer braces needed — a file or string that starts with `key: value` is read as an implicit root object (HJSON-style):
|
|
131
211
|
|
|
132
212
|
```ruby
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
213
|
+
SmarterJSON.process_one("host: localhost\nport: 5432")
|
|
214
|
+
# => {"host"=>"localhost", "port"=>5432}
|
|
215
|
+
```
|
|
136
216
|
|
|
137
|
-
|
|
138
|
-
|
|
217
|
+
### Multiple documents (NDJSON / JSONL / concatenated)
|
|
218
|
+
|
|
219
|
+
`process` always returns an **`Array` of the documents** it found — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several — with **no block and no special method**. The document count is unambiguous, and any result iterates uniformly:
|
|
220
|
+
|
|
221
|
+
```ruby
|
|
222
|
+
SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2}, {"id"=>3}]
|
|
223
|
+
SmarterJSON.process('{"id":1}') # => [{"id"=>1}] (one document, still an Array)
|
|
224
|
+
SmarterJSON.process("") # => [] (zero documents)
|
|
139
225
|
```
|
|
140
226
|
|
|
141
|
-
|
|
227
|
+
For the common single-document case, **`process_one`** returns the one value directly — and *warns* (never raises) if the input held more than one, so you never silently drop a document:
|
|
228
|
+
|
|
229
|
+
```ruby
|
|
230
|
+
SmarterJSON.process_one('{"id":1}') # => {"id"=>1}
|
|
231
|
+
SmarterJSON.process_one("") # => nil
|
|
232
|
+
```
|
|
142
233
|
|
|
143
|
-
|
|
234
|
+
> **Type-checking the result?** Use `result.is_a?(Array)`, not `result.class == Array` — it's the idiomatic Ruby test, and it stays correct if a future release returns a specialized `Array` subclass.
|
|
144
235
|
|
|
145
|
-
|
|
146
|
-
- **vs stdlib `json` (C):** competitive with the fastest Ruby JSON parser — it ties `json` on big_decimals and string_array, and trails by ~1.1–1.7× on the rest. (`canada.json` is the outlier, far behind — that's the `BigDecimal` default, see below.)
|
|
147
|
-
- **Numbers:** floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
|
|
236
|
+
A **top-level** value must be recognized JSON — a number, `true` / `false` / `null`, a quoted string, an object, an array — or an implicit-root object (`host: localhost`). A bare top-level run such as `localhost` or `1 2 3` raises `ParseError`. Quoteless string values *inside* objects and arrays (`{host: localhost}`, `[red green blue]`) are unchanged.
|
|
148
237
|
|
|
149
|
-
|
|
238
|
+
### Streaming large input with a block
|
|
239
|
+
|
|
240
|
+
For input larger than memory, pass a block: each document is yielded as it is read and the method returns the **document count** instead of building an `Array`. Both `process` and `process_file` forward the block:
|
|
241
|
+
|
|
242
|
+
```ruby
|
|
243
|
+
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
### Recovering JSON from LLM / markdown noise
|
|
247
|
+
|
|
248
|
+
When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.)
|
|
249
|
+
|
|
250
|
+
A fenced code block, as an LLM often returns:
|
|
251
|
+
|
|
252
|
+
````ruby
|
|
253
|
+
SmarterJSON.process_one(<<~TEXT)
|
|
254
|
+
Here is the JSON:
|
|
150
255
|
|
|
151
|
-
|
|
152
|
-
|
|
256
|
+
```json
|
|
257
|
+
{ "a": 1 }
|
|
258
|
+
```
|
|
259
|
+
TEXT
|
|
260
|
+
# => {"a"=>1}
|
|
261
|
+
````
|
|
262
|
+
|
|
263
|
+
Explanatory prose before and/or after the payload is ignored:
|
|
264
|
+
|
|
265
|
+
```ruby
|
|
266
|
+
SmarterJSON.process_one(<<~TEXT)
|
|
267
|
+
Here is the result:
|
|
268
|
+
|
|
269
|
+
{ "a": 1 }
|
|
270
|
+
|
|
271
|
+
Hope this helps.
|
|
272
|
+
TEXT
|
|
273
|
+
# => {"a"=>1}
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
`<json>...</json>` / `BEGIN_JSON ... END_JSON` wrapper tags are unwrapped:
|
|
277
|
+
|
|
278
|
+
```ruby
|
|
279
|
+
SmarterJSON.process_one('<json>{"a":1}</json>')
|
|
280
|
+
# => {"a"=>1}
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
When one blob contains several recovered payloads, they come back as an `Array` (the same rule as multi-document input):
|
|
284
|
+
|
|
285
|
+
```ruby
|
|
286
|
+
SmarterJSON.process(<<~TEXT)
|
|
287
|
+
first attempt:
|
|
288
|
+
{"a":1}
|
|
289
|
+
|
|
290
|
+
corrected payload:
|
|
291
|
+
{"b":2}
|
|
292
|
+
TEXT
|
|
293
|
+
# => [{"a"=>1}, {"b"=>2}]
|
|
294
|
+
```
|
|
153
295
|
|
|
154
296
|
## Encoding
|
|
155
297
|
|
|
156
|
-
`encoding:` (default `"UTF-8"`) labels what the input is — it does **not** trigger a transcoding pass.
|
|
298
|
+
`encoding:` (default `"UTF-8"`) labels what the input is — it does **not** trigger a transcoding pass. SmarterJSON works on the bytes in their native encoding and emits string values with the same encoding tag, the same way `smarter_csv` handles encodings. Bytes that are invalid for the claimed encoding raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
|
|
157
299
|
|
|
158
300
|
## Nesting & untrusted input
|
|
159
301
|
|
|
160
|
-
Both the C extension and the pure-Ruby
|
|
302
|
+
Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue.
|
|
303
|
+
|
|
304
|
+
The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you process untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream.
|
|
161
305
|
|
|
162
|
-
The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you parse untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream of the parser.
|
|
163
306
|
|
|
164
307
|
## Development
|
|
165
308
|
|