smarter_json 0.9.2 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
4
- data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
3
+ metadata.gz: 84a73a6cf0785c67eb2dfaf87dc663b860d5afed6bf0816f861b6430d1f55475
4
+ data.tar.gz: 953aebf65ab855450a7d3b41c826c169242dd19a190f089da3f52df2da0b0a44
5
5
  SHA512:
6
- metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
7
- data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29
6
+ metadata.gz: 8fe6e07fd99f1557a716fc37370dfc3c5cdb34e054d07fa812983d6cefea1e74108bb0e708fdb42f47a31ff8435aa0d6b8c80deeebab09a221bbd9992691be28
7
+ data.tar.gz: ec166df8863b136abc38df844bba2286b542de03f24d410b6e4a5914bfcf2165337add25c5731b92540328c5496ad4ec658243e21e4830687de9f44dcaaf4d54
data/.gitignore CHANGED
@@ -44,3 +44,5 @@ overage/
44
44
 
45
45
  .claude/
46
46
  CLAUDE.md
47
+ INTERNAL_DEV_LOG.md
48
+ research
data/CHANGELOG.md CHANGED
@@ -1,111 +1,145 @@
1
1
 
2
2
  # SmarterJSON Change Log
3
3
 
4
- > 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
4
+ > ⚠️ **New Interface (since 0.9.7):**
5
+ >
6
+ > SmarterJSON **always returns an `Array`** of documents.
7
+ >
8
+ > `SmarterJSON.process` / `SmarterJSON.process_file` return:
9
+ >
10
+ > — `[]` for no doc
11
+ > - `[doc]` for one doc
12
+ > - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
13
+
14
+ > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
15
+ > Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
16
+
17
+ ## 1.0.0 (2026-06-08)
18
+
19
+ RSpec tests: 1,034
20
+
21
+ - **The public interface is now stable** — `process`, `process_one`, `process_file`, `generate`, and the documented options; semantic versioning from here on.
22
+ - Unknown or wrongly-typed options now raise `ArgumentError` instead of being silently ignored, so a typo (e.g. `symbolize_names:` instead of `symbolize_keys:`) is caught immediately.
23
+ - Input tagged `ASCII-8BIT` whose bytes are valid UTF-8 (e.g. a `Net::HTTP` `response.body`) is now read as UTF-8, so its string values compare equal to UTF-8 literals; ASCII-8BIT input that is not valid UTF-8 raises `SmarterJSON::EncodingError` (pass an explicit `encoding:` for legacy encodings).
24
+ - Object keys may now use smart/curly quotes too (e.g. JSON pasted from a word processor), not just string values.
25
+ - `SmarterJSON.generate` accepts `allow_nan: true` to emit `NaN` / `Infinity` / `-Infinity` (JSON5-style) instead of raising, so non-finite numbers round-trip; the default still raises.
26
+ - A numeric literal that overflows `Float` range (e.g. `1e400`) now reports a `:number_overflow` warning via `on_warning` instead of silently becoming `Infinity`.
27
+ - `SmarterJSON.generate` is now iterative (like the parser), so serializing a deeply nested structure no longer risks `SystemStackError` — reading and writing are both depth-safe.
28
+
29
+ ## 0.9.9 (2026-06-07)
30
+ - Much faster pure-Ruby parsing (the path used without the C extension) — roughly 3× on string-heavy data, ~2× on number-heavy, ~1.7× on object-heavy (on a YJIT-enabled Ruby). Parsed values are unchanged.
31
+
32
+ ## 0.9.8 (2026-06-06 unreleased)
33
+ - Faster parsing of string-heavy arrays — Parsed values are unchanged.
34
+
35
+ ## 0.9.7 (2026-06-05 unreleased)
36
+ - **Breaking: `process` / `process_file` now always return an `Array` of documents** — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several. (Previously polymorphic: `nil` / the value / an `Array`.) The document count is now unambiguous, and any result can be iterated uniformly.
37
+ - **New `SmarterJSON.process_one(input)`** — the single-document accessor for the common case: returns the one document's value (or `nil`), and *warns* (never raises) if the input held more than one. Takes a String or an IO; for an IO it is bounded-memory (parses just the first document). Reaching for `.first` / `[0]` on a `process` result silently drops extra documents — use `process_one` instead.
38
+ - The **block form now returns the document count** (was `nil`): `n = SmarterJSON.process(io) { |doc| ... }`.
39
+ - **The top level is stricter, which keeps the LLM-wrapper recovery working:** a top-level value must be a recognized JSON value (number / `true` / `false` / `null` / quoted string / object / array) or an implicit-root object (`host: localhost`). A bare top-level run — `localhost`, `1 2 3`, the typo `flase` — now raises `ParseError` instead of becoming a quoteless string. A space is never a document separator (`1 2 3` raises rather than splitting into three). In-container quoteless strings (`[red green blue]`, `host: localhost`) are unchanged.
40
+
41
+ ## 0.9.6 (2026-06-04 unreleased)
42
+ - Faster `decimal_precision: :float` parsing of full-precision decimal numbers (around 17–18 significant digits — e.g. coordinate data and scientific output). Parsed values are unchanged: still correctly rounded, bit-for-bit identical to `JSON.parse`.
43
+
44
+ ## 0.9.5 (2026-06-04 unreleased)
45
+ - Faster `decimal_precision: :float` parsing of very high-precision decimal numbers (more than ~17 significant digits). Parsed values are unchanged.
46
+ - Faster parsing of object-heavy and compact documents — less per-element overhead in the C parser. No behavior change.
47
+
48
+ ## 0.9.4 (2026-06-04 unreleased)
49
+ - Internal performance experiments. No user-facing changes.
50
+
51
+ ## 0.9.3 (2026-06-03)
52
+ - Renamed the `bigdecimal_load:` option to `decimal_precision:` (same values: `:auto`, `:float`, `:bigdecimal`).
53
+ - Invalid option *values* now raise `ArgumentError` with a clear message instead of being silently ignored. Unknown option keys are still ignored.
54
+ - Faster parsing of pretty-printed (indented) input.
55
+ - Removed the `duplicate_key: :raise` option — it conflicted with SmarterJSON's lenient design. `duplicate_key:` now accepts `:last_wins` (default) and `:first_wins`; repeated keys are still reported through `on_warning`.
5
56
 
6
57
  ## 0.9.2 (2026-06-03)
7
- - **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
58
+ - Fixed a performance regression that slowed parsing of large documents.
8
59
 
9
60
  ## 0.9.1 (2026-06-03 unreleased)
10
- - **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
11
- - **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
12
- - Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
13
- - Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
61
+ - Fixed a major performance regression on real-world data that contained markdown fences or `<json>` markers inside ordinary string values.
62
+ - Streaming: a document is no longer cut off early when a comment / quote marker falls across a read-chunk boundary.
14
63
 
15
64
  ## 0.9.0 (2026-06-03 unreleased)
16
- - performance improvements
17
- - code cleanup
65
+ - Performance improvements and code cleanup.
18
66
 
19
67
  ## 0.8.0 (2026-06-03)
20
68
  - **Robustness** against LLM-generated / wrapped JSON:
21
69
  - strips markdown code fences (```json / ```)
22
- - ignores obvious prefix / suffix prose around a payload
70
+ - ignores leading / trailing prose around a JSON payload
23
71
  - unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`
24
- - preserves multiple recovered payloads as an `Array`
25
- - supports pretty-printed multi-line document framing on IO / block input
26
- - **Warnings** now cover wrapper recovery too (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
27
- - **No truncation recovery**: truncated / unterminated input still raises `SmarterJSON::ParseError`
72
+ - returns multiple recovered payloads as an `Array`
73
+ - parses pretty-printed multi-line documents from IO / block input
74
+ - reports each recovery through `on_warning` (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
75
+ - Truncated / unterminated input still raises `SmarterJSON::ParseError` — SmarterJSON does not guess at missing data.
28
76
 
29
77
  ## 0.7.0 (2026-06-03)
30
- - **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` tuple return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as the parser applies a lenient fix; `process` / `process_file` now always return the bare value (nil / value / Array) on every path. Unlike the tuple, this also fires on the streaming block form. The default (no handler) records nothing and costs nothing.
78
+ - **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as a lenient fix is applied; `process` / `process_file` now always return just the value, including on the streaming block form. The default (no handler) records nothing and costs nothing.
31
79
 
32
80
  ## 0.6.0 (2026-06-02)
33
- - Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`), on both the C and Ruby paths. No null is inserted for an empty slot.
34
- - A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}` (both paths).
35
- - New opt-in `warnings:` option. With `warnings: true`, `process` / `process_file` return `[result, warnings]`, where `warnings` is an Array of `SmarterJSON::Warning` (`type`, `message`, `line`, `col`) recording the lenient fixes applied — `:empty_slot`, `:empty_value`, `:duplicate_key`. Default off; works on both paths.
36
- - Fixed a pure-Ruby bug where a mantissa-less exponent token (e.g. `-e695881`) was read as `0.0`; it is now a quoteless string, matching the C path.
37
- - Fixed a pure-Ruby bug where a `\u` escape whose next bytes split a multibyte character leaked `ArgumentError`; it now raises `SmarterJSON::ParseError`.
38
- - Added a property/fuzz test suite that checks C/Ruby parity and round-tripping on generated, mutated, and random input.
81
+ - Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`). No null is inserted for an empty slot.
82
+ - A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}`.
83
+ - New opt-in `warnings:` option recording the lenient fixes applied — `:empty_slot`, `:empty_value`, `:duplicate_key`. (Superseded by `on_warning:` in 0.7.0.)
39
84
 
40
85
  ## 0.5.2 (2026-06-01) yanked
41
- - `generate` now supports pretty-printing via the `indent:` option (spaces per nesting level; default `0` = compact). Empty objects/arrays stay inline; `indent:` combined with `format: :ndjson` raises `ArgumentError`.
42
- - `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII as `\uXXXX`, astral chars as surrogate pairs), and `script_safe:` (escape `</` and U+2028/U+2029 for safe embedding in an HTML `<script>` tag).
43
- - `generate` adds opt-in `coerce:` — when `true`, a value that isn't natively supported (e.g. `Time`, `Date`, app objects) is converted via its own `as_json` (result re-emitted) or `to_json` (spliced); strict-by-default still raises `GenerateError`.
86
+ - `generate` supports pretty-printing via the `indent:` option (spaces per nesting level; default compact). Combining `indent:` with `format: :ndjson` raises `ArgumentError`.
87
+ - `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII), and `script_safe:` (escape `</` and U+2028/U+2029 for safe embedding in an HTML `<script>` tag).
88
+ - `generate` adds opt-in `coerce:` — convert an otherwise-unsupported value (e.g. `Time`, `Date`, app objects) via its own `as_json` / `to_json`; strict-by-default still raises `GenerateError`.
44
89
 
45
90
  ## 0.5.1 (2026-06-01) yanked
46
- - Unified the error classes under a single `SmarterJSON::Error` base: `ParseError` and `EncodingError` now inherit from it, and `generate` raises a new `GenerateError`. `rescue SmarterJSON::Error` now catches everything the gem raises.
47
- - Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS).
48
- - Fixed the C extension build on Ruby 2.6 (declare `rb_hash_bulk_insert`, which 2.6 exports but does not declare in its headers); set the minimum Ruby to 2.6.
91
+ - Unified the error classes under a single `SmarterJSON::Error` base: `ParseError`, `EncodingError`, and the new `GenerateError` all inherit from it, so `rescue SmarterJSON::Error` catches everything the gem raises.
92
+ - Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS); minimum Ruby is now 2.6.
49
93
 
50
94
  ## 0.5.0 (2026-05-31 unreleased)
51
- - add JSON generation, incl. NDJSON generation
52
- - add test coverage
95
+ - Added JSON generation, including NDJSON.
96
+ - Added test coverage.
53
97
 
54
98
  ## 0.4.0 (2026-05-31 unreleased)
55
- - rename `flex_json` -> `smarter_json`
99
+ - Renamed the gem `flex_json` `smarter_json`.
56
100
 
57
101
  ## 0.3.10 (2026-05-31 unreleased)
58
- - change interface to use `.process` and `.process_file`
59
-
102
+ - Changed the interface to `.process` and `.process_file`.
60
103
 
61
104
  ## 0.3.9 (2026-05-31 unreleased)
62
- - `parse` (no block) now handles any input automatically: 0 documents (empty / whitespace / comment-only) → `nil`, 1 document → the value itself, 2+ documents (NDJSON / JSONL / concatenated / whitespace-separated) → an Array of the values. It no longer raises on trailing content.
63
- - Detection is free (the same trailing-content check that used to raise) and the single-document path allocates no Array, so single-value parsing is unchanged in speed.
64
- - The block form (`parse(input) { |doc| … }`) is kept as the bounded-memory streaming path. `parse_file(path) { |doc| … }` now forwards the block too, so files stream the same way (previously the block was silently ignored). Bracketless comma lists (`1, 2, 3`) still raise — commas don't separate top-level documents (implicit-root array remains unsupported).
65
- - The block form allows individual processing of each line in NDJSON files.
66
- - Supersedes the earlier "raise on trailing content, match Oj" behavior.
105
+ - `process` with no block now handles any input automatically: 0 documents (empty / whitespace / comment-only) → `nil`, 1 document → the value itself, 2+ documents (NDJSON / JSONL / concatenated) → an `Array`. It no longer raises on trailing content.
106
+ - The block form (`process(input) { |doc| }`) streams documents with bounded memory; `process_file` forwards the block too, so each line of an NDJSON file can be processed individually.
67
107
 
68
108
  ## 0.3.8 (2026-05-30 unreleased)
69
- - Reordered single-character checks so the more common byte is tested first (`-` before `+`).
70
- - Quoteless-token boundary scan now uses a 256-byte class table: ordinary bytes are classified in one table lookup, and the lookahead byte is read only at a `#`/`/` instead of on every byte. Speeds up quoteless / config-style input (the lenient case the JSON benchmarks don't exercise).
109
+ - Performance improvements (quoteless / config-style input).
71
110
 
72
111
  ## 0.3.7 (2026-05-30 unreleased)
73
- - Escaped-string literal runs are bulk-copied with the NEON scanner instead of one byte at a time.
74
- - Added branch hints (`__builtin_expect`) and prefetch to the hot string-scan loop. Sped up string-heavy files (string_array, github_events, twitter all 12–16% faster).
112
+ - Performance improvements (string-heavy input).
75
113
 
76
114
  ## 0.3.6 (2026-05-30 unreleased)
77
- - Fast path for plain numbers inside objects/arrays (`fj_try_member_number`): one scan straight from the cursor, committing when the number meets a delimiter and falling back to the quoteless scanner otherwise. Skips the quoteless boundary scan + classify dispatch for the common case. Broad gains on number-in-container files (weather, canada, usgs, big_decimals).
115
+ - Performance improvements (numbers inside objects / arrays).
78
116
 
79
117
  ## 0.3.5 (2026-05-30 unreleased)
80
- - Rewrote `fj_parse_number` (top-level numbers) as a single pass: finds the token end and accumulates the mantissa/exponent at once, using the string's NUL terminator as a scan sentinel (no per-byte bounds check) and a digit loop that skips the underscore check until an underscore actually appears.
81
- - Added `fj_try_decimal` for the quoteless path: validates and extracts the number in one scan, replacing the old three scans (validate + significant-digit count + mantissa extraction); skips the significant-digit scan when the number has ≤16 digits.
82
- - Both number paths now build values through the shared `fj_int_from_parts` / `fj_float_from_parts` helpers so they can't drift; removed the now-dead `fj_validate_decimal` / `fj_int_value` / `fj_decimal_value`.
118
+ - Performance improvements (number parsing).
83
119
 
84
120
  ## 0.3.4 (2026-05-30 unreleased)
85
- - Dropped a per-member Ruby method call (`key?`) that fired for every object member under the default duplicate-key mode — pure waste on object-heavy files (twitter, github_events, citm).
86
- - Build objects and arrays from a C value stack with a pre-sized hash + bulk insert (and size-based duplicate detection), instead of inserting one member/element at a time.
87
- - Added a per-parse key cache so repeated object keys are interned once instead of every occurrence.
121
+ - Performance improvements (object-heavy input).
88
122
 
89
123
  ## 0.3.3 (2026-05-30 unreleased)
90
- - Vendored Ryū (Ulf Adams, Apache-2.0) for correctly-rounded string→double conversion: the mantissa is accumulated in one pass and converted with no `strtod`. Large win on float-heavy files (canada, big_decimals).
124
+ - Faster, correctly-rounded float parsing.
91
125
 
92
126
  ## 0.3.3 (2026-05-29 unreleased)
93
- - performance fixes
127
+ - Performance fixes.
94
128
 
95
129
  ## 0.3.2 (2026-05-29 unreleased)
96
- - performance fixes
130
+ - Performance fixes.
97
131
 
98
132
  ## 0.3.1 (2026-05-29 unreleased)
99
- - performance fixes
133
+ - Performance fixes.
100
134
 
101
135
  ## 0.3.0 (2026-05-29 unreleased)
102
- - iterative parser
136
+ - Iterative parser.
103
137
 
104
138
  ## 0.2.0 (2026-05-29 unreleased)
105
- - recursive parser
139
+ - Recursive parser.
106
140
 
107
141
  ## 0.1.1 (2026-05-29 unreleased)
108
- - MVP complete
142
+ - MVP complete.
109
143
 
110
144
  ## 0.1.0 (2026-05-28 unreleased)
111
- - Initial Ruby version
145
+ - Initial Ruby version.
data/README.md CHANGED
@@ -2,36 +2,83 @@
2
2
 
3
3
  ![Gem Version](https://img.shields.io/gem/v/smarter_json) [![codecov](https://codecov.io/gh/tilo/smarter_json/branch/main/graph/badge.svg)](https://codecov.io/gh/tilo/smarter_json) <!-- [![Downloads](https://img.shields.io/gem/dt/smarter_json)](https://rubygems.org/gems/smarter_json) --> [![RubyGems](https://img.shields.io/badge/RubyGems-smarter__json-brightgreen?logo=rubygems&logoColor=white)](https://rubygems.org/gems/smarter_json) [![Ruby Toolbox](https://img.shields.io/badge/Ruby%20Toolbox-smarter__json-brightgreen)](https://www.ruby-toolbox.com/projects/smarter_json)
4
4
 
5
- A lenient, fast JSON parser for Ruby. It parses strict JSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on nearly every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Other parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
5
+ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Traditional JSON parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
6
6
 
7
- > **SmarterJSON: one parser, no modes — want strict? Please use the stdlib `json` gem.**
7
+ > **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.**
8
8
 
9
9
  ## Why SmarterJSON?
10
10
 
11
- Most JSON parsers reject anything that isn't perfectly strict JSON. SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.** Give it strict JSON, JSON5, an HJSON-style config file, newline-delimited JSON, or a copy-pasted blob with comments and trailing commas — it just parses it. When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma). A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
11
+ **Are you tired of seeing errors like these?**
12
+
13
+ ```
14
+ ERROR running JSON.parse (stdlib) on deeply_nested.json: JSON::NestingError: nesting of 101 is too deep
15
+
16
+ ERROR running Oj.load (default) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
17
+
18
+ ERROR running Oj.load (strict, float) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
19
+
20
+ ERROR running Oj.load (compat) on config.json5: EncodingError: unexpected character (after [0]) at line 5, column 6 [parse.c:931] in '// JSON5 config sample — leni…
21
+
22
+ ERROR running JSON.parse (stdlib) on config.json5: JSON::ParserError: expected object key, got 'id:' at line 4 column 5
23
+
24
+ ERROR running Yajl::Parser (yajl-ruby) on config.json5: Yajl::ParseError: lexical error: invalid char in json text. this. */ [ // record 0 { id: 0, name: 'alpha-0', mask: 0 (…
25
+
26
+ ERROR running Oj.load (default) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
27
+
28
+ ERROR running Oj.load (strict, float) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
29
+
30
+ ERROR running Oj.load (compat) on github_events_100k.ndjson: EncodingError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870] in '{"id":"…
31
+
32
+ ERROR running JSON.parse (stdlib) on github_events_100k.ndjson: JSON::ParserError: unexpected token at end of stream '{"id":"34816047161","type":"Dele' at line 1 column 1
33
+
34
+ ERROR running Yajl::Parser (yajl-ruby) on github_events_100k.ndjson: Yajl::ParseError: Found multiple JSON objects in the stream but no block or the on_parse_complete callback was
35
+ assigne…
36
+ ```
37
+
38
+ **Do you have no control of the input quality?**
39
+
40
+ Traditional JSON parsers reject anything that isn't perfectly strict JSON. That means your code breaks on malformed data.
41
+
42
+ SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.**
43
+ Give it strict JSON, NDJSON, JSON5, an HJSON-style config file, LLM-generated JSON, or a copy-pasted blob with comments and trailing commas — it just extracts the data from it.
44
+ When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma).
45
+
46
+ A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
12
47
 
13
48
  > For an ingestion tool, "reject the whole document because of one stray comma" is the worst outcome: you throw away the 99% that's fine to avoid maybe-mishandling a gap that carries no data anyway.
14
49
 
15
50
  Three things set it apart:
16
51
 
17
- 1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
52
+ 1. **One tool, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure it to match your input; it adapts to whatever you give it.
18
53
 
19
- 2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
54
+ 2. **It extracts every document from multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: it always returns an `Array` of the documents found (`[]` / `[doc]` / `[d1, d2, …]`). For the common single-document case, `SmarterJSON.process_one` returns the one value directly (and warns, never raises, if there was more than one). The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
20
55
 
21
- 3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
56
+ 3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) matches or beats Oj on every file we benchmark, and is competitive with the stdlib `json` C parser — among the fastest general-purpose JSON processors in Ruby.
22
57
 
23
58
  ## What it accepts, beyond strict JSON
24
59
 
25
- - `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` parses as a string, not a truncated value)
60
+ - `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
26
61
  - Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
27
62
  - Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
28
63
  - Implicit root object — a config file that starts with `key: value`, no outer `{}`
29
64
  - `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
30
- - UTF-8 BOM, smart/curly quotes, Python literals (`True` / `False` / `None`), JavaScript `undefined`
65
+ - UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`
31
66
  - Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
32
67
  - Duplicate keys (last value wins by default; configurable)
33
68
 
34
- It raises only on genuinely unparseable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
69
+ It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
70
+
71
+ ### Format references
72
+
73
+ The lenient grammar is a superset of these human-JSON specs — listed once, here:
74
+
75
+ * [JSON5](https://json5.org/)
76
+ * [HJSON](https://hjson.github.io/)
77
+ * [JWCC / HuJSON](https://github.com/tailscale/hujson)
78
+ * [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
79
+ * [JSONH](https://github.com/jsonh-org/Jsonh)
80
+ * [JSONC (VS Code)](https://jsonc.org/)
81
+ * [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
35
82
 
36
83
  ## Installation
37
84
 
@@ -44,13 +91,48 @@ gem "smarter_json"
44
91
  gem install smarter_json
45
92
  ```
46
93
 
47
- The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
94
+ The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby implementation runs instead and produces identical results.
95
+
96
+ ## Usage
97
+
98
+ Pass a String of JSON content or an IO; you get back the extracted data. The same call handles strict JSON, JSON5, and HJSON-style config — there are no modes or flags.
99
+
100
+ ```ruby
101
+ require "smarter_json"
102
+
103
+ SmarterJSON.process('{"a": 1, "b": [2, 3]}') # => [{"a"=>1, "b"=>[2, 3]}] (always an Array of documents)
104
+ SmarterJSON.process_one('{"a": 1, "b": [2, 3]}') # => {"a"=>1, "b"=>[2, 3]} (the one document's value)
105
+ SmarterJSON.process_file("config.json5") # read a file, then process
106
+ ```
107
+
108
+ **Prefer `process`.** It always returns an `Array`, so the document count is explicit and you never silently drop one. Reach for `process_one` when you want just the single document's value — it *warns* (never raises) if the input turns out to hold more than one, so an unexpected extra document is surfaced, not dropped.
109
+
110
+ ## Usage in APIs
111
+
112
+ At an API boundary the JSON comes from someone you don't control — a client POSTing a request body to *your* service, or an upstream service answering a call *you* made — and it isn't always clean: a stray trailing comma, a `NaN`, a payload wrapped in prose, or a quiet change to the format. A strict parser turns any of those into an exception (a request you reject, or a failed call chain). SmarterJSON extracts the data that's there instead, so one formatting quirk doesn't sink the whole request:
113
+
114
+ ```ruby
115
+ # Inbound — JSON a caller sent to your endpoint:
116
+ data = SmarterJSON.process(request.body)
117
+
118
+ # Outbound — JSON from a service you called:
119
+ data = SmarterJSON.process(response.body)
120
+ ```
121
+
122
+ What that buys you:
48
123
 
49
- ## API stability and thread safety
124
+ * fewer "random production crashes" from messy JSON on either side of the wire
125
+ * resilience when a caller or a provider changes its output
126
+ * the option to log and recover, instead of rejecting the request outright
127
+ * consistent handling of edge-case payloads
50
128
 
51
- The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
129
+ See [Examples](#examples) below for multi-document input, streaming, and recovering JSON from LLM / markdown noise.
52
130
 
53
- Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
131
+ ## Stable interface & thread safety
132
+
133
+ The public interface is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
134
+
135
+ Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls.
54
136
 
55
137
  ## Documentation
56
138
 
@@ -60,50 +142,147 @@ Concurrent calls are safe. The parser/generator keep per-call state local, and t
60
142
  * [Configuration Options](docs/options.md)
61
143
  * [Examples](docs/examples.md)
62
144
 
63
- ## Usage
145
+ ### Warnings (`on_warning`)
146
+
147
+ When SmarterJSON quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
64
148
 
65
149
  ```ruby
66
- require "smarter_json"
150
+ # Collect them all:
151
+ warns = []
152
+ data = SmarterJSON.process(input, on_warning: ->(w) { warns << w })
153
+
154
+ # Or route them — log, count, raise:
155
+ SmarterJSON.process(input, on_warning: ->(w) { Rails.logger.warn(w) })
156
+ ```
157
+
158
+
159
+ ## Performance
160
+
161
+ SmarterJSON is a C extension (with a pure-Ruby fallback that runs everywhere). Before the speed table, the part that isn't a "× faster" — **things the other parsers can't do at all:**
162
+
163
+ - **stdlib `json` can't parse deeply nested data.** It caps nesting at 100 levels and raises; SmarterJSON has no depth limit (iterative parser, bounded only by memory).
164
+ - **None of the others read NDJSON / JSONL / concatenated input in a single call.** Oj, `json`, and Yajl each raise on the second document. Only SmarterJSON's `process` returns every document as an `Array`.
165
+ - **None of the others parse JSON5, HJSON-style config, or LLM-wrapped output.** Comments, trailing commas, unquoted keys, quoteless values, `'single quotes'`, markdown code fences, prose wrappers — all raise in Oj / `json` / Yajl; SmarterJSON parses them.
166
+ - **`json` and Yajl produce `Float` only — lossy on high-precision numbers.** On coordinate / scientific data (>16 significant digits) they silently round to `Float`, so they aren't a like-for-like comparison there. SmarterJSON (and Oj) keep full precision as `BigDecimal` by default.
167
+
168
+ Where a like-for-like comparison exists, here is SmarterJSON's C path against each parser. **Apple M4, Ruby 3.4.7, p10 of 40 runs (2026-06-07); the same picture holds on an Apple M1 Max.** Each cell is **SmarterJSON vs that parser** — "faster" means SmarterJSON wins. Ratios shift with hardware; run `rake report` in `json_benchmarks/` to reproduce.
169
+
170
+ | File | vs Oj/strict | vs `json` | vs Yajl |
171
+ | ----------------------------- | --------------- | ---------------------------- | --------------- |
172
+ | big_decimals <sup>≠</sup> | **1.7× faster** | ≈ tied | **1.2× faster** |
173
+ | canada <sup>≠</sup> | **7× faster** | ≈ tied | **2.1× faster** |
174
+ | citm_catalog | **1.3× faster** | 1.2× slower | **3.2× faster** |
175
+ | citylots <sup>≠</sup> | **3.7× faster** | **2.0× faster** | **2.3× faster** |
176
+ | config.jsonc | **1.1× faster** | 1.2× slower | **3.6× faster** |
177
+ | deeply_nested | **1.2× faster** | **can't parse** <sup>‡</sup> | **4.1× faster** |
178
+ | github_events | ≈ tied | 1.1× slower | **2.7× faster** |
179
+ | string_array | ≈ tied | ≈ tied | **1.6× faster** |
180
+ | twitter | **1.3× faster** | 1.2× slower | **3.2× faster** |
181
+ | usgs_earthquakes <sup>≠</sup> | **1.4× faster** | 1.1× slower | **3.4× faster** |
182
+ | weather_berlin | **1.8× faster** | **1.1× faster** | **3.2× faster** |
183
+
184
+ <sup>≠</sup> High-precision file. The row uses `decimal_precision: :float` (Float, like-for-like) for `canada` / `citylots` / `big_decimals` / `usgs`. SmarterJSON's **default** `:auto` keeps these decimals as `BigDecimal` (no precision loss, like Oj's default) — intrinsically slower than `Float`, so default-vs-`Float` would be apples-to-oranges. Against Oj's matching `BigDecimal` default, SmarterJSON is faster there too.
185
+ <sup>‡</sup> Not a measurement gap — `json` raises by default: it errors on multi-document / NDJSON input without a block, and caps nesting at 100 levels. SmarterJSON has neither limit.
186
+
187
+ In short: **SmarterJSON's C path matches or beats Oj/strict on every file** (apples-to-apples — for the high-precision <sup>≠</sup> files that means `decimal_precision: :float`, where Oj/strict also produces `Float`; with `:float`, float-heavy data like `canada` is **~7× faster**). It is **far faster than Yajl everywhere**, and **level-to-ahead of stdlib `json`** — `json` edges ahead only on a few object-heavy files (`citm`, `twitter`, `config.jsonc`, `github_events`, all within ~1.25×) and **can't parse `deeply_nested` at all**. Floats are decoded with the **Eisel-Lemire** algorithm (fast_float), correctly rounded and **bit-for-bit identical to `JSON.parse`** — fast *and* exact, even at full double precision.
188
+
189
+ **Two notes on fair comparison:**
190
+
191
+ - **NDJSON / multi-document:** only SmarterJSON reads it via plain `process` — Oj, `json`, and Yajl raise without a block. `process` collects every document into an `Array`; the block form streams one document at a time in bounded memory (use it for input larger than RAM).
192
+ - **High-precision decimals (the <sup>≠</sup> files):** by default these load as `BigDecimal` (full precision, like Oj's default), intrinsically slower than `Float`. Pass `decimal_precision: :float` for a like-for-like `Float` comparison — where SmarterJSON **beats stdlib `json`** (e.g. `citylots` ~2×) — at 3–6× the speed of the `:auto` default on coordinate/scientific data, when you don't need `BigDecimal` precision.
193
+
194
+
195
+ ### Options
67
196
 
68
- SmarterJSON.process('{"a": 1, "b": [2, 3]}') # => {"a"=>1, "b"=>[2, 3]}
69
- SmarterJSON.process("host: localhost\nport: 5432") # => {"host"=>"localhost", "port"=>5432} (no braces needed)
70
- SmarterJSON.process_file("config.json5") # read a file, then parse
197
+ | option | default | meaning |
198
+ |-------------------|--------------|-------------------------------------------------------------------------|
199
+ | `symbolize_keys` | `false` | return object keys as Symbols instead of Strings |
200
+ | `duplicate_key` | `:last_wins` | `:last_wins` / `:first_wins` for a key repeated in one object (every repeat is also reported via `on_warning`) |
201
+ | `decimal_precision` | `:auto` | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
202
+ | `acceleration` | `true` | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
203
+ | `encoding` | `nil` | labels the input's encoding; `nil` keeps the input's own (no transcoding pass; see below) |
204
+ | `on_warning` | `nil` | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`, `:number_overflow`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
205
+
206
+ ## Examples
207
+
208
+ ### Lenient, config-style input
209
+
210
+ No outer braces needed — a file or string that starts with `key: value` is read as an implicit root object (HJSON-style):
211
+
212
+ ```ruby
213
+ SmarterJSON.process_one("host: localhost\nport: 5432")
214
+ # => {"host"=>"localhost", "port"=>5432}
215
+ ```
216
+
217
+ ### Multiple documents (NDJSON / JSONL / concatenated)
218
+
219
+ `process` always returns an **`Array` of the documents** it found — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several — with **no block and no special method**. The document count is unambiguous, and any result iterates uniformly:
71
220
 
72
- # Multiple documents (NDJSON / JSONL / concatenated) — no block, no special method:
221
+ ```ruby
73
222
  SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3})) # => [{"id"=>1}, {"id"=>2}, {"id"=>3}]
74
- SmarterJSON.process('{"id":1}') # => {"id"=>1} (one document the value itself)
75
- SmarterJSON.process("") # => nil (zero documents)
223
+ SmarterJSON.process('{"id":1}') # => [{"id"=>1}] (one document, still an Array)
224
+ SmarterJSON.process("") # => [] (zero documents)
225
+ ```
226
+
227
+ For the common single-document case, **`process_one`** returns the one value directly — and *warns* (never raises) if the input held more than one, so you never silently drop a document:
228
+
229
+ ```ruby
230
+ SmarterJSON.process_one('{"id":1}') # => {"id"=>1}
231
+ SmarterJSON.process_one("") # => nil
232
+ ```
233
+
234
+ > **Type-checking the result?** Use `result.is_a?(Array)`, not `result.class == Array` — it's the idiomatic Ruby test, and it stays correct if a future release returns a specialized `Array` subclass.
235
+
236
+ A **top-level** value must be recognized JSON — a number, `true` / `false` / `null`, a quoted string, an object, an array — or an implicit-root object (`host: localhost`). A bare top-level run such as `localhost` or `1 2 3` raises `ParseError`. Quoteless string values *inside* objects and arrays (`{host: localhost}`, `[red green blue]`) are unchanged.
76
237
 
77
- # For input larger than memory, stream one document at a time with a block
78
- # (process and process_file both forward the block):
238
+ ### Streaming large input with a block
239
+
240
+ For input larger than memory, pass a block: each document is yielded as it is read and the method returns the **document count** instead of building an `Array`. Both `process` and `process_file` forward the block:
241
+
242
+ ```ruby
79
243
  SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
244
+ ```
80
245
 
81
- # Wrapper noise is stripped automatically:
82
- SmarterJSON.process(<<~TEXT)
246
+ ### Recovering JSON from LLM / markdown noise
247
+
248
+ When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.)
249
+
250
+ A fenced code block, as an LLM often returns:
251
+
252
+ ````ruby
253
+ SmarterJSON.process_one(<<~TEXT)
83
254
  Here is the JSON:
84
255
 
85
256
  ```json
86
- {
87
- "a": 1
88
- }
257
+ { "a": 1 }
89
258
  ```
90
259
  TEXT
91
260
  # => {"a"=>1}
261
+ ````
92
262
 
93
- SmarterJSON.process(<<~TEXT)
263
+ Explanatory prose before and/or after the payload is ignored:
264
+
265
+ ```ruby
266
+ SmarterJSON.process_one(<<~TEXT)
94
267
  Here is the result:
95
268
 
96
- {
97
- "a": 1
98
- }
269
+ { "a": 1 }
99
270
 
100
271
  Hope this helps.
101
272
  TEXT
102
273
  # => {"a"=>1}
274
+ ```
103
275
 
104
- SmarterJSON.process("<json>{\"a\":1}</json>")
276
+ `<json>...</json>` / `BEGIN_JSON ... END_JSON` wrapper tags are unwrapped:
277
+
278
+ ```ruby
279
+ SmarterJSON.process_one('<json>{"a":1}</json>')
105
280
  # => {"a"=>1}
281
+ ```
282
+
283
+ When one blob contains several recovered payloads, they come back as an `Array` (the same rule as multi-document input):
106
284
 
285
+ ```ruby
107
286
  SmarterJSON.process(<<~TEXT)
108
287
  first attempt:
109
288
  {"a":1}
@@ -114,52 +293,16 @@ TEXT
114
293
  # => [{"a"=>1}, {"b"=>2}]
115
294
  ```
116
295
 
117
- ### Options
118
-
119
- | option | default | meaning |
120
- |-------------------|--------------|-------------------------------------------------------------------------|
121
- | `symbolize_keys` | `false` | return object keys as Symbols instead of Strings |
122
- | `duplicate_key` | `:last_wins` | `:last_wins` / `:first_wins` / `:raise` for repeated keys in one object |
123
- | `bigdecimal_load` | `:auto` | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
124
- | `acceleration` | `true` | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
125
- | `encoding` | `"UTF-8"` | labels the input's encoding (no transcoding pass; see below) |
126
- | `on_warning` | `nil` | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
127
-
128
- ### Warnings (`on_warning`)
129
-
130
- When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
131
-
132
- ```ruby
133
- # Collect them all:
134
- warns = []
135
- data = SmarterJSON.process(input, on_warning: ->(w) { warns << w })
136
-
137
- # Or route them — log, count, raise:
138
- SmarterJSON.process(input, on_warning: ->(w) { Rails.logger.warn(w) })
139
- ```
140
-
141
- ## Performance
142
-
143
- Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpus (canada, citm_catalog, twitter, github_events, …). The apples-to-apples comparisons are **SmarterJSON/C** vs **Oj/strict** vs **stdlib `json`**, all producing `Float` (run `rake report` in `json_benchmarks/` for the full table — numbers vary run to run).
144
-
145
- - **vs Oj/strict** (the `JSON.parse`-equivalent mode, both producing `Float`): SmarterJSON/C is faster on nearly every file — typically **1.1–1.6×** (e.g. big_decimals ~1.6×, deeply-nested ~1.4×, citm / twitter / usgs ~1.3×, github / citylots / weather ~1.1–1.2×). The one exception is **string_array**, where Oj/strict's SIMD string scan is ~1.7× faster — that's the current frontier.
146
- - **vs stdlib `json` (C):** competitive with the fastest Ruby JSON parser — it ties `json` on big_decimals and string_array, and trails by ~1.1–1.7× on the rest. (`canada.json` is the outlier, far behind — that's the `BigDecimal` default, see below.)
147
- - **Numbers:** floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
148
-
149
- **Two notes on fair comparison:**
150
-
151
- - **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
152
- - **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
153
-
154
296
  ## Encoding
155
297
 
156
- `encoding:` (default `"UTF-8"`) labels what the input is — it does **not** trigger a transcoding pass. The parser works on the bytes in their native encoding and emits string values with the same encoding tag, the same way `smarter_csv` handles encodings. Bytes that are invalid for the claimed encoding raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
298
+ `encoding:` (default `nil`) labels what the input is — it does **not** transcode. With `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that same tag, the way `smarter_csv` does — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) whose bytes are valid UTF-8 is treated as UTF-8. That is exactly how `Net::HTTP` and many HTTP libraries hand you a `response.body` (correct UTF-8 bytes, BINARY tag); without this, string values would come back tagged `ASCII-8BIT` and compare unequal to UTF-8 literals. If such `ASCII-8BIT` input is *not* valid UTF-8, it raises `SmarterJSON::EncodingError` rather than guess a legacy encoding — pass an explicit `encoding:` (e.g. `"ISO-8859-1"`) for that. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
157
299
 
158
300
  ## Nesting & untrusted input
159
301
 
160
- Both the C extension and the pure-Ruby parser are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) parses without issue.
302
+ Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue. **`generate` is iterative too**, so serializing a deeply nested Ruby structure can't overflow the stack either — reading *and* writing are both depth-safe.
303
+
304
+ The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you process untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream.
161
305
 
162
- The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you parse untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream of the parser.
163
306
 
164
307
  ## Development
165
308