RubyGems - smarter_json - Versions diffs - 0.9.2 → 0.9.9 - Mend

smarter_json 0.9.2 → 0.9.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/CHANGELOG.md +77 -54
data/README.md +215 -72
data/docs/_introduction.md +6 -12
data/docs/basic_read_api.md +29 -19
data/docs/basic_write_api.md +2 -2
data/docs/examples.md +32 -23
data/docs/options.md +14 -14
data/ext/smarter_json/smarter_json.c +223 -89
data/ext/smarter_json/vendor/LICENSE-fast_float-MIT +27 -0
data/ext/smarter_json/vendor/eisel_lemire.h +117 -0
data/ext/smarter_json/vendor/eisel_lemire.md +29 -0
data/ext/smarter_json/vendor/eisel_lemire_powers.h +663 -0
data/lib/smarter_json/backports.rb +28 -0
data/lib/smarter_json/options.rb +52 -0
data/lib/smarter_json/parser.rb +400 -139
data/lib/smarter_json/version.rb +1 -1
data/lib/smarter_json.rb +3 -1
metadata +9 -5
data/ext/smarter_json/vendor/ryu.h +0 -819
data/ext/smarter_json/vendor/ryu.md +0 -22

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
-  data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
+  metadata.gz: fbc93b4afea26fc4d30c241ceb7823451cb7b777454441a9871f066b719f8c07
+  data.tar.gz: a9a5801cab6a604e4d166d6d86aa48f41f5e1bb74fb58f39f283a0477ac78db4
 SHA512:
-  metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
-  data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29
+  metadata.gz: c8749bd358973f0d284966a6c5fb42a4e71954c8884c337c5859f5f3e566b302d1a4c1dbc9961ed5ae46aaf5a9e019935fa46916bb858877a2ef34ecc99575c9
+  data.tar.gz: 2f2efe9c89ae08bcf807e061ca734d13056c6fafc590d072d7870a705cf25e3d4932fff567c1b09320f5cbd6b42e02255180fded05c84b08f1032abb0594b8bf

data/.gitignore CHANGED Viewed

@@ -44,3 +44,4 @@ overage/
 .claude/
 CLAUDE.md
+INTERNAL_DEV_LOG.md

data/CHANGELOG.md CHANGED Viewed

@@ -3,109 +3,132 @@
 > 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
+> ⚠️ **Interface change (since 0.9.7):**
+>
+> `SmarterJSON.process` / `SmarterJSON.process_file` now **always return an `Array`** of documents:
+>  — `[]` for no doc
+>  - `[doc]` for one doc
+>  - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs).
+Going forward this will be the supported interface.
+> ⚠️ We discourage the use of `process(input).first` / `[0]` because it silently drops potential additional documents
+>    Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
+## 0.9.9 (2026-06-07)
+- Much faster pure-Ruby parsing (the path used without the C extension) — roughly 3× on string-heavy data, ~2× on number-heavy, ~1.7× on object-heavy (on a YJIT-enabled Ruby). Parsed values are unchanged.
+## 0.9.8 (2026-06-06 unreleased)
+- Faster parsing of string-heavy arrays — Parsed values are unchanged.
+## 0.9.7 (2026-06-05 unreleased)
+- **Breaking: `process` / `process_file` now always return an `Array` of documents** — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several. (Previously polymorphic: `nil` / the value / an `Array`.) The document count is now unambiguous, and any result can be iterated uniformly.
+- **New `SmarterJSON.process_one(input)`** — the single-document accessor for the common case: returns the one document's value (or `nil`), and *warns* (never raises) if the input held more than one. Takes a String or an IO; for an IO it is bounded-memory (parses just the first document). Reaching for `.first` / `[0]` on a `process` result silently drops extra documents — use `process_one` instead.
+- The **block form now returns the document count** (was `nil`): `n = SmarterJSON.process(io) { |doc| ... }`.
+- **The top level is stricter, which keeps the LLM-wrapper recovery working:** a top-level value must be a recognized JSON value (number / `true` / `false` / `null` / quoted string / object / array) or an implicit-root object (`host: localhost`). A bare top-level run — `localhost`, `1 2 3`, the typo `flase` — now raises `ParseError` instead of becoming a quoteless string. A space is never a document separator (`1 2 3` raises rather than splitting into three). In-container quoteless strings (`[red green blue]`, `host: localhost`) are unchanged.
+## 0.9.6 (2026-06-04 unreleased)
+- Faster `decimal_precision: :float` parsing of full-precision decimal numbers (around 17–18 significant digits — e.g. coordinate data and scientific output). Parsed values are unchanged: still correctly rounded, bit-for-bit identical to `JSON.parse`.
+## 0.9.5 (2026-06-04 unreleased)
+- Faster `decimal_precision: :float` parsing of very high-precision decimal numbers (more than ~17 significant digits). Parsed values are unchanged.
+- Faster parsing of object-heavy and compact documents — less per-element overhead in the C parser. No behavior change.
+## 0.9.4 (2026-06-04 unreleased)
+- Internal performance experiments. No user-facing changes.
+## 0.9.3 (2026-06-03)
+- Renamed the `bigdecimal_load:` option to `decimal_precision:` (same values: `:auto`, `:float`, `:bigdecimal`).
+- Invalid option *values* now raise `ArgumentError` with a clear message instead of being silently ignored. Unknown option keys are still ignored.
+- Faster parsing of pretty-printed (indented) input.
+- Removed the `duplicate_key: :raise` option — it conflicted with SmarterJSON's lenient design. `duplicate_key:` now accepts `:last_wins` (default) and `:first_wins`; repeated keys are still reported through `on_warning`.
 ## 0.9.2 (2026-06-03)
-- **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
+- Fixed a performance regression that slowed parsing of large documents.
 ## 0.9.1 (2026-06-03 unreleased)
-- **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
-- **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned — the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
-- Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
-- Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
+- Fixed a major performance regression on real-world data that contained markdown fences or `<json>` markers inside ordinary string values.
+- Streaming: a document is no longer cut off early when a comment / quote marker falls across a read-chunk boundary.
 ## 0.9.0 (2026-06-03 unreleased)
-- performance improvements
-- code cleanup
+- Performance improvements and code cleanup.
 ## 0.8.0 (2026-06-03)
 - **Robustness** against LLM-generated / wrapped JSON:
   - strips markdown code fences (```json / ```)
-  - ignores obvious prefix / suffix prose around a payload
+  - ignores leading / trailing prose around a JSON payload
   - unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`
-  - preserves multiple recovered payloads as an `Array`
-  - supports pretty-printed multi-line document framing on IO / block input
-  - **Warnings** now cover wrapper recovery too (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
-  - **No truncation recovery**: truncated / unterminated input still raises `SmarterJSON::ParseError`
+  - returns multiple recovered payloads as an `Array`
+  - parses pretty-printed multi-line documents from IO / block input
+  - reports each recovery through `on_warning` (`:code_fence_stripped`, `:prefix_text_ignored`, `:suffix_text_ignored`, `:wrapper_tag_stripped`)
+- Truncated / unterminated input still raises `SmarterJSON::ParseError` — SmarterJSON does not guess at missing data.
 ## 0.7.0 (2026-06-03)
-- **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` tuple return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as the parser applies a lenient fix; `process` / `process_file` now always return the bare value (nil / value / Array) on every path. Unlike the tuple, this also fires on the streaming block form. The default (no handler) records nothing and costs nothing.
+- **Breaking:** replaced the `warnings:` option (and its `[result, warnings]` return) with an `on_warning:` callable. Pass `on_warning: ->(w) { ... }` to be handed each `SmarterJSON::Warning` as a lenient fix is applied; `process` / `process_file` now always return just the value, including on the streaming block form. The default (no handler) records nothing and costs nothing.
 ## 0.6.0 (2026-06-02)
-- Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`), on both the C and Ruby paths. No null is inserted for an empty slot.
-- A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}` (both paths).
-- New opt-in `warnings:` option. With `warnings: true`, `process` / `process_file` return `[result, warnings]`, where `warnings` is an Array of `SmarterJSON::Warning` (`type`, `message`, `line`, `col`) recording the lenient fixes applied — `:empty_slot`, `:empty_value`, `:duplicate_key`. Default off; works on both paths.
-- Fixed a pure-Ruby bug where a mantissa-less exponent token (e.g. `-e695881`) was read as `0.0`; it is now a quoteless string, matching the C path.
-- Fixed a pure-Ruby bug where a `\u` escape whose next bytes split a multibyte character leaked `ArgumentError`; it now raises `SmarterJSON::ParseError`.
-- Added a property/fuzz test suite that checks C/Ruby parity and round-tripping on generated, mutated, and random input.
+- Lenient comma handling: empty slots around / between commas are collapsed (`[1,,2]` → `[1,2]`, `[,1,]` → `[1]`, `{a:1,,b:2}` → `{a:1,b:2}`). No null is inserted for an empty slot.
+- A key with a colon but no value reads as null: `{a:}` → `{"a"=>nil}`.
+- New opt-in `warnings:` option recording the lenient fixes applied — `:empty_slot`, `:empty_value`, `:duplicate_key`. (Superseded by `on_warning:` in 0.7.0.)
 ## 0.5.2 (2026-06-01) yanked
-- `generate` now supports pretty-printing via the `indent:` option (spaces per nesting level; default `0` = compact). Empty objects/arrays stay inline; `indent:` combined with `format: :ndjson` raises `ArgumentError`.
-- `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII as `\uXXXX`, astral chars as surrogate pairs), and `script_safe:` (escape `</` and U+2028/U+2029 for safe embedding in an HTML `<script>` tag).
-- `generate` adds opt-in `coerce:` — when `true`, a value that isn't natively supported (e.g. `Time`, `Date`, app objects) is converted via its own `as_json` (result re-emitted) or `to_json` (spliced); strict-by-default still raises `GenerateError`.
+- `generate` supports pretty-printing via the `indent:` option (spaces per nesting level; default compact). Combining `indent:` with `format: :ndjson` raises `ArgumentError`.
+- `generate` adds `sort_keys:` (emit object keys in sorted order), `ascii_only:` (escape non-ASCII), and `script_safe:` (escape `</` and U+2028/U+2029 for safe embedding in an HTML `<script>` tag).
+- `generate` adds opt-in `coerce:` — convert an otherwise-unsupported value (e.g. `Time`, `Date`, app objects) via its own `as_json` / `to_json`; strict-by-default still raises `GenerateError`.
 ## 0.5.1 (2026-06-01) yanked
-- Unified the error classes under a single `SmarterJSON::Error` base: `ParseError` and `EncodingError` now inherit from it, and `generate` raises a new `GenerateError`. `rescue SmarterJSON::Error` now catches everything the gem raises.
-- Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS).
-- Fixed the C extension build on Ruby 2.6 (declare `rb_hash_bulk_insert`, which 2.6 exports but does not declare in its headers); set the minimum Ruby to 2.6.
+- Unified the error classes under a single `SmarterJSON::Error` base: `ParseError`, `EncodingError`, and the new `GenerateError` all inherit from it, so `rescue SmarterJSON::Error` catches everything the gem raises.
+- Added a CI test matrix (Ruby 2.6–4.0 + head, on Ubuntu and macOS); minimum Ruby is now 2.6.
 ## 0.5.0 (2026-05-31 unreleased)
-- add JSON generation, incl. NDJSON generation
-- add test coverage
+- Added JSON generation, including NDJSON.
+- Added test coverage.
 ## 0.4.0 (2026-05-31 unreleased)
-- rename `flex_json` -> `smarter_json`
+- Renamed the gem `flex_json` → `smarter_json`.
 ## 0.3.10 (2026-05-31 unreleased)
-- change interface to use `.process` and `.process_file`
+- Changed the interface to `.process` and `.process_file`.
 ## 0.3.9 (2026-05-31 unreleased)
-- `parse` (no block) now handles any input automatically: 0 documents (empty / whitespace / comment-only) → `nil`, 1 document → the value itself, 2+ documents (NDJSON / JSONL / concatenated / whitespace-separated) → an Array of the values. It no longer raises on trailing content.
-- Detection is free (the same trailing-content check that used to raise) and the single-document path allocates no Array, so single-value parsing is unchanged in speed.
-- The block form (`parse(input) { |doc| … }`) is kept as the bounded-memory streaming path. `parse_file(path) { |doc| … }` now forwards the block too, so files stream the same way (previously the block was silently ignored). Bracketless comma lists (`1, 2, 3`) still raise — commas don't separate top-level documents (implicit-root array remains unsupported).
-- The block form allows individual processing of each line in NDJSON files.
-- Supersedes the earlier "raise on trailing content, match Oj" behavior.
+- `process` with no block now handles any input automatically: 0 documents (empty / whitespace / comment-only) → `nil`, 1 document → the value itself, 2+ documents (NDJSON / JSONL / concatenated) → an `Array`. It no longer raises on trailing content.
+- The block form (`process(input) { |doc| … }`) streams documents with bounded memory; `process_file` forwards the block too, so each line of an NDJSON file can be processed individually.
 ## 0.3.8 (2026-05-30 unreleased)
-- Reordered single-character checks so the more common byte is tested first (`-` before `+`).
-- Quoteless-token boundary scan now uses a 256-byte class table: ordinary bytes are classified in one table lookup, and the lookahead byte is read only at a `#`/`/` instead of on every byte. Speeds up quoteless / config-style input (the lenient case the JSON benchmarks don't exercise).
+- Performance improvements (quoteless / config-style input).
 ## 0.3.7 (2026-05-30 unreleased)
-- Escaped-string literal runs are bulk-copied with the NEON scanner instead of one byte at a time.
-- Added branch hints (`__builtin_expect`) and prefetch to the hot string-scan loop. Sped up string-heavy files (string_array, github_events, twitter all 12–16% faster).
+- Performance improvements (string-heavy input).
 ## 0.3.6 (2026-05-30 unreleased)
-- Fast path for plain numbers inside objects/arrays (`fj_try_member_number`): one scan straight from the cursor, committing when the number meets a delimiter and falling back to the quoteless scanner otherwise. Skips the quoteless boundary scan + classify dispatch for the common case. Broad gains on number-in-container files (weather, canada, usgs, big_decimals).
+- Performance improvements (numbers inside objects / arrays).
 ## 0.3.5 (2026-05-30 unreleased)
-- Rewrote `fj_parse_number` (top-level numbers) as a single pass: finds the token end and accumulates the mantissa/exponent at once, using the string's NUL terminator as a scan sentinel (no per-byte bounds check) and a digit loop that skips the underscore check until an underscore actually appears.
-- Added `fj_try_decimal` for the quoteless path: validates and extracts the number in one scan, replacing the old three scans (validate + significant-digit count + mantissa extraction); skips the significant-digit scan when the number has ≤16 digits.
-- Both number paths now build values through the shared `fj_int_from_parts` / `fj_float_from_parts` helpers so they can't drift; removed the now-dead `fj_validate_decimal` / `fj_int_value` / `fj_decimal_value`.
+- Performance improvements (number parsing).
 ## 0.3.4 (2026-05-30 unreleased)
-- Dropped a per-member Ruby method call (`key?`) that fired for every object member under the default duplicate-key mode — pure waste on object-heavy files (twitter, github_events, citm).
-- Build objects and arrays from a C value stack with a pre-sized hash + bulk insert (and size-based duplicate detection), instead of inserting one member/element at a time.
-- Added a per-parse key cache so repeated object keys are interned once instead of every occurrence.
+- Performance improvements (object-heavy input).
 ## 0.3.3 (2026-05-30 unreleased)
-- Vendored Ryū (Ulf Adams, Apache-2.0) for correctly-rounded string→double conversion: the mantissa is accumulated in one pass and converted with no `strtod`. Large win on float-heavy files (canada, big_decimals).
+- Faster, correctly-rounded float parsing.
 ## 0.3.3 (2026-05-29 unreleased)
-- performance fixes
+- Performance fixes.
 ## 0.3.2 (2026-05-29 unreleased)
-- performance fixes
+- Performance fixes.
 ## 0.3.1 (2026-05-29 unreleased)
-- performance fixes
+- Performance fixes.
 ## 0.3.0 (2026-05-29 unreleased)
-- iterative parser
+- Iterative parser.
 ## 0.2.0 (2026-05-29 unreleased)
-- recursive parser
+- Recursive parser.
 ## 0.1.1 (2026-05-29 unreleased)
-- MVP complete
+- MVP complete.
 ## 0.1.0 (2026-05-28 unreleased)
-- Initial Ruby version
+- Initial Ruby version.

data/README.md CHANGED Viewed

@@ -2,27 +2,62 @@
 ![Gem Version](https://img.shields.io/gem/v/smarter_json) [![codecov](https://codecov.io/gh/tilo/smarter_json/branch/main/graph/badge.svg)](https://codecov.io/gh/tilo/smarter_json) <!-- [![Downloads](https://img.shields.io/gem/dt/smarter_json)](https://rubygems.org/gems/smarter_json) --> [![RubyGems](https://img.shields.io/badge/RubyGems-smarter__json-brightgreen?logo=rubygems&logoColor=white)](https://rubygems.org/gems/smarter_json) [![Ruby Toolbox](https://img.shields.io/badge/Ruby%20Toolbox-smarter__json-brightgreen)](https://www.ruby-toolbox.com/projects/smarter_json)
-A lenient, fast JSON parser for Ruby. It parses strict JSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on nearly every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Other parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
+A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Traditional JSON parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
-> **SmarterJSON: one parser, no modes — want strict? Please use the stdlib `json` gem.**
+> **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.**
 ## Why SmarterJSON?
-Most JSON parsers reject anything that isn't perfectly strict JSON. SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.** Give it strict JSON, JSON5, an HJSON-style config file, newline-delimited JSON, or a copy-pasted blob with comments and trailing commas — it just parses it. When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma). A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
+**Are you tired of seeing errors like these?**
+```
+    ERROR running JSON.parse (stdlib) on deeply_nested.json: JSON::NestingError: nesting of 101 is too deep
+    ERROR running Oj.load (default) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
+    ERROR running Oj.load (strict, float) on config.json5: Oj::ParseError: unexpected character (after [0]) at line 5, column 6 [parse.c:931]
+    ERROR running Oj.load (compat) on config.json5: EncodingError: unexpected character (after [0]) at line 5, column 6 [parse.c:931] in '// JSON5 config sample — leni…
+    ERROR running JSON.parse (stdlib) on config.json5: JSON::ParserError: expected object key, got 'id:' at line 4 column 5
+    ERROR running Yajl::Parser (yajl-ruby) on config.json5: Yajl::ParseError: lexical error: invalid char in json text. this. */ [ // record 0 { id: 0, name: 'alpha-0', mask: 0 (…
+    ERROR running Oj.load (default) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
+    ERROR running Oj.load (strict, float) on github_events_100k.ndjson: Oj::ParseError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870]
+    ERROR running Oj.load (compat) on github_events_100k.ndjson: EncodingError: unexpected characters after the JSON document (after ) at line 2, column 1 [parse.c:870] in '{"id":"…
+    ERROR running JSON.parse (stdlib) on github_events_100k.ndjson: JSON::ParserError: unexpected token at end of stream '{"id":"34816047161","type":"Dele' at line 1 column 1
+    ERROR running Yajl::Parser (yajl-ruby) on github_events_100k.ndjson: Yajl::ParseError: Found multiple JSON objects in the stream but no block or the on_parse_complete callback was
+  assigne…
+```
+**Do you have no control of the input quality?**
+Traditional JSON parsers reject anything that isn't perfectly strict JSON. That means your code breaks on malformed data.
+SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.**
+Give it strict JSON, NDJSON, JSON5, an HJSON-style config file, LLM-generated JSON, or a copy-pasted blob with comments and trailing commas — it just extracts the data from it.
+When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma).
+A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
 > For an ingestion tool, "reject the whole document because of one stray comma" is the worst outcome: you throw away the 99% that's fine to avoid maybe-mishandling a gap that carries no data anyway.
 Three things set it apart:
-1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
+1. **One tool, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure it to match your input; it adapts to whatever you give it.
-2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
+2. **It extracts every document from multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: it always returns an `Array` of the documents found (`[]` / `[doc]` / `[d1, d2, …]`). For the common single-document case, `SmarterJSON.process_one` returns the one value directly (and warns, never raises, if there was more than one). The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
-3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
+3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) matches or beats Oj on every file we benchmark, and is competitive with the stdlib `json` C parser — among the fastest general-purpose JSON processors in Ruby.
 ## What it accepts, beyond strict JSON
-- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` parses as a string, not a truncated value)
+- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
 - Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
 - Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
 - Implicit root object — a config file that starts with `key: value`, no outer `{}`
@@ -31,7 +66,19 @@ Three things set it apart:
 - Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
 - Duplicate keys (last value wins by default; configurable)
-It raises only on genuinely unparseable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
+It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
+### Format references
+The lenient grammar is a superset of these human-JSON specs — listed once, here:
+* [JSON5](https://json5.org/)
+* [HJSON](https://hjson.github.io/)
+* [JWCC / HuJSON](https://github.com/tailscale/hujson)
+* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
+* [JSONH](https://github.com/jsonh-org/Jsonh)
+* [JSONC (VS Code)](https://jsonc.org/)
+* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
 ## Installation
@@ -44,13 +91,48 @@ gem "smarter_json"
 gem install smarter_json
 ```
-The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
+The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby implementation runs instead and produces identical results.
+## Usage
+Pass a String of JSON content or an IO; you get back the extracted data. The same call handles strict JSON, JSON5, and HJSON-style config — there are no modes or flags.
+```ruby
+require "smarter_json"
+SmarterJSON.process('{"a": 1, "b": [2, 3]}')       # => [{"a"=>1, "b"=>[2, 3]}]   (always an Array of documents)
+SmarterJSON.process_one('{"a": 1, "b": [2, 3]}')   # => {"a"=>1, "b"=>[2, 3]}     (the one document's value)
+SmarterJSON.process_file("config.json5")            # read a file, then process
+```
+**Prefer `process`.** It always returns an `Array`, so the document count is explicit and you never silently drop one. Reach for `process_one` when you want just the single document's value — it *warns* (never raises) if the input turns out to hold more than one, so an unexpected extra document is surfaced, not dropped.
+## Usage in APIs
+At an API boundary the JSON comes from someone you don't control — a client POSTing a request body to *your* service, or an upstream service answering a call *you* made — and it isn't always clean: a stray trailing comma, a `NaN`, a payload wrapped in prose, or a quiet change to the format. A strict parser turns any of those into an exception (a request you reject, or a failed call chain). SmarterJSON extracts the data that's there instead, so one formatting quirk doesn't sink the whole request:
+```ruby
+# Inbound — JSON a caller sent to your endpoint:
+data = SmarterJSON.process(request.body)
+# Outbound — JSON from a service you called:
+data = SmarterJSON.process(response.body)
+```
+What that buys you:
+* fewer "random production crashes" from messy JSON on either side of the wire
+* resilience when a caller or a provider changes its output
+* the option to log and recover, instead of rejecting the request outright
+* consistent handling of edge-case payloads
+See [Examples](#examples) below for multi-document input, streaming, and recovering JSON from LLM / markdown noise.
-## API stability and thread safety
+## Stable interface & thread safety
-The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
+The public interface is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
-Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
+Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls.
 ## Documentation
@@ -60,106 +142,167 @@ Concurrent calls are safe. The parser/generator keep per-call state local, and t
   * [Configuration Options](docs/options.md)
   * [Examples](docs/examples.md)
-## Usage
+### Warnings (`on_warning`)
+When SmarterJSON quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
 ```ruby
-require "smarter_json"
+# Collect them all:
+warns = []
+data  = SmarterJSON.process(input, on_warning: ->(w) { warns << w })
-SmarterJSON.process('{"a": 1, "b": [2, 3]}')          # => {"a"=>1, "b"=>[2, 3]}
-SmarterJSON.process("host: localhost\nport: 5432")     # => {"host"=>"localhost", "port"=>5432}  (no braces needed)
-SmarterJSON.process_file("config.json5")               # read a file, then parse
+# Or route them — log, count, raise:
+SmarterJSON.process(input, on_warning: ->(w) { Rails.logger.warn(w) })
+```
-# Multiple documents (NDJSON / JSONL / concatenated) — no block, no special method:
-SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3}))   # => [{"id"=>1}, {"id"=>2}, {"id"=>3}]
-SmarterJSON.process('{"id":1}')                         # => {"id"=>1}   (one document → the value itself)
-SmarterJSON.process("")                                 # => nil          (zero documents)
-# For input larger than memory, stream one document at a time with a block
-# (process and process_file both forward the block):
-SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
+## Performance
-# Wrapper noise is stripped automatically:
-SmarterJSON.process(<<~TEXT)
-  Here is the JSON:
+SmarterJSON is a C extension (with a pure-Ruby fallback that runs everywhere). Before the speed table, the part that isn't a "× faster" — **things the other parsers can't do at all:**
-  ```json
-  {
-    "a": 1
-  }
-  ```
-TEXT
-# => {"a"=>1}
+- **stdlib `json` can't parse deeply nested data.** It caps nesting at 100 levels and raises; SmarterJSON has no depth limit (iterative parser, bounded only by memory).
+- **None of the others read NDJSON / JSONL / concatenated input in a single call.** Oj, `json`, and Yajl each raise on the second document. Only SmarterJSON's `process` returns every document as an `Array`.
+- **None of the others parse JSON5, HJSON-style config, or LLM-wrapped output.** Comments, trailing commas, unquoted keys, quoteless values, `'single quotes'`, markdown code fences, prose wrappers — all raise in Oj / `json` / Yajl; SmarterJSON parses them.
+- **`json` and Yajl produce `Float` only — lossy on high-precision numbers.** On coordinate / scientific data (>16 significant digits) they silently round to `Float`, so they aren't a like-for-like comparison there. SmarterJSON (and Oj) keep full precision as `BigDecimal` by default.
-SmarterJSON.process(<<~TEXT)
-  Here is the result:
+Where a like-for-like comparison exists, here is SmarterJSON's C path against each parser. **Apple M4, Ruby 3.4.7, p10 of 40 runs.** Each cell is **SmarterJSON vs that parser** — "faster" means SmarterJSON wins. Ratios shift with hardware; run `rake report` in `json_benchmarks/` to reproduce.
-  {
-    "a": 1
-  }
+| File                          | vs Oj/strict    | vs `json`                    | vs Yajl         |
+| ----------------------------- | --------------- | ---------------------------- | --------------- |
+| big_decimals <sup>≠</sup>     | **1.8× faster** | **1.1× faster**              | **1.3× faster** |
+| canada <sup>≠</sup>           | **8× faster**   | 1.1× slower                  | **2.2× faster** |
+| citm_catalog                  | **1.6× faster** | 1.2× slower                  | **4.8× faster** |
+| citylots <sup>≠</sup>         | **3.6× faster** | **2.0× faster**              | **2.3× faster** |
+| config.jsonc                  | **1.1× faster** | 1.5× slower                  | **3.7× faster** |
+| deeply_nested                 | **1.4× faster** | **can't parse** <sup>‡</sup> | **5.1× faster** |
+| github_events                 | **1.2× faster** | ≈ tied                       | **3.1× faster** |
+| string_array                  | ≈ tied          | ≈ tied                       | **1.6× faster** |
+| twitter                       | **1.4× faster** | 1.3× slower                  | **3.5× faster** |
+| usgs_earthquakes <sup>≠</sup> | **1.3× faster** | 1.5× slower                  | **3.6× faster** |
+| weather_berlin                | **1.9× faster** | 1.1× slower                  | **3.5× faster** |
-  Hope this helps.
-TEXT
-# => {"a"=>1}
+<sup>≠</sup> High-precision file. The row uses `decimal_precision: :float` (Float, like-for-like) for `canada` / `citylots` / `big_decimals` / `usgs`. SmarterJSON's **default** `:auto` keeps these decimals as `BigDecimal` (no precision loss, like Oj's default) — intrinsically slower than `Float`, so default-vs-`Float` would be apples-to-oranges. Against Oj's matching `BigDecimal` default, SmarterJSON is faster there too.
+<sup>‡</sup> Not a measurement gap — `json` raises by default: it errors on multi-document / NDJSON input without a block, and caps nesting at 100 levels. SmarterJSON has neither limit.
-SmarterJSON.process("<json>{\"a\":1}</json>")
-# => {"a"=>1}
+In short: **matches or beats Oj/strict on every file** — `string_array` is the one wash (within ~10%, and hardware-dependent: SmarterJSON edges ahead on an M1, Oj edges ahead on an M4) — **far faster than Yajl everywhere, and level-to-ahead of stdlib `json` on a like-for-like basis**, while parsing input `json` and Oj reject outright. Floats are decoded with the **Eisel-Lemire** algorithm (fast_float), correctly rounded and **bit-for-bit identical to `JSON.parse`** — fast *and* exact, even at full double precision.
-SmarterJSON.process(<<~TEXT)
-  first attempt:
-  {"a":1}
+**Two notes on fair comparison:**
+- **NDJSON / multi-document:** only SmarterJSON reads it via plain `process` — Oj, `json`, and Yajl raise without a block. `process` collects every document into an `Array`; the block form streams one document at a time in bounded memory (use it for input larger than RAM).
+- **High-precision decimals (the <sup>≠</sup> files):** by default these load as `BigDecimal` (full precision, like Oj's default), intrinsically slower than `Float`. Pass `decimal_precision: :float` for a like-for-like `Float` comparison — where SmarterJSON **beats stdlib `json`** (e.g. `citylots` ~2×) — at 3–6× the speed of the `:auto` default on coordinate/scientific data, when you don't need `BigDecimal` precision.
-  corrected payload:
-  {"b":2}
-TEXT
-# => [{"a"=>1}, {"b"=>2}]
-```
 ### Options
 | option            | default      | meaning                                                                 |
 |-------------------|--------------|-------------------------------------------------------------------------|
 | `symbolize_keys`  | `false`      | return object keys as Symbols instead of Strings                        |
-| `duplicate_key`   | `:last_wins` | `:last_wins` / `:first_wins` / `:raise` for repeated keys in one object |
-| `bigdecimal_load` | `:auto`      | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
+| `duplicate_key`   | `:last_wins` | `:last_wins` / `:first_wins` for a key repeated in one object (every repeat is also reported via `on_warning`) |
+| `decimal_precision` | `:auto`      | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
 | `acceleration`    | `true`       | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
 | `encoding`        | `"UTF-8"`    | labels the input's encoding (no transcoding pass; see below)            |
 | `on_warning`      | `nil`        | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
-### Warnings (`on_warning`)
+## Examples
+### Lenient, config-style input
-When the parser quietly fixes something lenient — collapses an empty comma slot, reads a key with no value as `null`, drops a duplicate key, strips code fences, ignores wrapper prose, unwraps wrapper tags — it can tell you, without changing what `process` returns. Pass a callable as `on_warning:`; it is invoked once per fix with a `SmarterJSON::Warning` (`type`, `message`, `line`, `col`). It fires on every path, including the streaming block form. With no handler (the default) nothing is recorded and there is zero overhead.
+No outer braces needed — a file or string that starts with `key: value` is read as an implicit root object (HJSON-style):
 ```ruby
-# Collect them all:
-warns = []
-data  = SmarterJSON.process(input, on_warning: ->(w) { warns << w })
+SmarterJSON.process_one("host: localhost\nport: 5432")
+# => {"host"=>"localhost", "port"=>5432}
+```
-# Or route them — log, count, raise:
-SmarterJSON.process(input, on_warning: ->(w) { Rails.logger.warn(w) })
+### Multiple documents (NDJSON / JSONL / concatenated)
+`process` always returns an **`Array` of the documents** it found — `[]` for none, `[doc]` for one, `[d1, d2, …]` for several — with **no block and no special method**. The document count is unambiguous, and any result iterates uniformly:
+```ruby
+SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3}))   # => [{"id"=>1}, {"id"=>2}, {"id"=>3}]
+SmarterJSON.process('{"id":1}')                         # => [{"id"=>1}]   (one document, still an Array)
+SmarterJSON.process("")                                 # => []            (zero documents)
 ```
-## Performance
+For the common single-document case, **`process_one`** returns the one value directly — and *warns* (never raises) if the input held more than one, so you never silently drop a document:
+```ruby
+SmarterJSON.process_one('{"id":1}')   # => {"id"=>1}
+SmarterJSON.process_one("")           # => nil
+```
-Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpus (canada, citm_catalog, twitter, github_events, …). The apples-to-apples comparisons are **SmarterJSON/C** vs **Oj/strict** vs **stdlib `json`**, all producing `Float` (run `rake report` in `json_benchmarks/` for the full table — numbers vary run to run).
+> **Type-checking the result?** Use `result.is_a?(Array)`, not `result.class == Array` — it's the idiomatic Ruby test, and it stays correct if a future release returns a specialized `Array` subclass.
-- **vs Oj/strict** (the `JSON.parse`-equivalent mode, both producing `Float`): SmarterJSON/C is faster on nearly every file — typically **1.1–1.6×** (e.g. big_decimals ~1.6×, deeply-nested ~1.4×, citm / twitter / usgs ~1.3×, github / citylots / weather ~1.1–1.2×). The one exception is **string_array**, where Oj/strict's SIMD string scan is ~1.7× faster — that's the current frontier.
-- **vs stdlib `json` (C):** competitive with the fastest Ruby JSON parser — it ties `json` on big_decimals and string_array, and trails by ~1.1–1.7× on the rest. (`canada.json` is the outlier, far behind — that's the `BigDecimal` default, see below.)
-- **Numbers:** floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.
+A **top-level** value must be recognized JSON — a number, `true` / `false` / `null`, a quoted string, an object, an array — or an implicit-root object (`host: localhost`). A bare top-level run such as `localhost` or `1 2 3` raises `ParseError`. Quoteless string values *inside* objects and arrays (`{host: localhost}`, `[red green blue]`) are unchanged.
-**Two notes on fair comparison:**
+### Streaming large input with a block
+For input larger than memory, pass a block: each document is yielded as it is read and the method returns the **document count** instead of building an `Array`. Both `process` and `process_file` forward the block:
+```ruby
+SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
+```
+### Recovering JSON from LLM / markdown noise
+When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.)
+A fenced code block, as an LLM often returns:
+````ruby
+SmarterJSON.process_one(<<~TEXT)
+  Here is the JSON:
-- **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
-- **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
+  ```json
+  { "a": 1 }
+  ```
+TEXT
+# => {"a"=>1}
+````
+Explanatory prose before and/or after the payload is ignored:
+```ruby
+SmarterJSON.process_one(<<~TEXT)
+  Here is the result:
+  { "a": 1 }
+  Hope this helps.
+TEXT
+# => {"a"=>1}
+```
+`<json>...</json>` / `BEGIN_JSON ... END_JSON` wrapper tags are unwrapped:
+```ruby
+SmarterJSON.process_one('<json>{"a":1}</json>')
+# => {"a"=>1}
+```
+When one blob contains several recovered payloads, they come back as an `Array` (the same rule as multi-document input):
+```ruby
+SmarterJSON.process(<<~TEXT)
+  first attempt:
+  {"a":1}
+  corrected payload:
+  {"b":2}
+TEXT
+# => [{"a"=>1}, {"b"=>2}]
+```
 ## Encoding
-`encoding:` (default `"UTF-8"`) labels what the input is — it does **not** trigger a transcoding pass. The parser works on the bytes in their native encoding and emits string values with the same encoding tag, the same way `smarter_csv` handles encodings. Bytes that are invalid for the claimed encoding raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
+`encoding:` (default `"UTF-8"`) labels what the input is — it does **not** trigger a transcoding pass. SmarterJSON works on the bytes in their native encoding and emits string values with the same encoding tag, the same way `smarter_csv` handles encodings. Bytes that are invalid for the claimed encoding raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
 ## Nesting & untrusted input
-Both the C extension and the pure-Ruby parser are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) parses without issue.
+Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue.
+The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you process untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream.
-The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you parse untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream of the parser.
 ## Development