smarter_json 1.0.0 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +19 -9
- data/README.md +59 -4
- data/docs/basic_read_api.md +22 -0
- data/docs/examples.md +22 -0
- data/ext/smarter_json/smarter_json.c +69 -22
- data/lib/smarter_json/parser.rb +35 -0
- data/lib/smarter_json/version.rb +1 -1
- metadata +5 -5
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 480227c64ed99ba271fe95c6e7c79be6871f6bbf5c7f8f03d9bf118bb6bd7051
|
|
4
|
+
data.tar.gz: 966190c8f2e316e3664e381bf738831259b0e469a783f2fa3a25edf5f198d41b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d53d1a84aa5ce83a2243a9adcdcddc284933dc8c357e08c8aaaec8e5e005164dfdec37d70c41bce0b2a3abf22dc7902e062adbbe3ffaa947b2232836bd01a86c
|
|
7
|
+
data.tar.gz: ae44f83437fe6c75227822a234658abb2157caf0378c8b998bab25aa40b4f1286da9a85a482be04b1be4ab17b142779140d0892ada81ae7d4342a0b7f706e9b7
|
data/CHANGELOG.md
CHANGED
|
@@ -1,22 +1,32 @@
|
|
|
1
1
|
|
|
2
2
|
# SmarterJSON Change Log
|
|
3
3
|
|
|
4
|
-
> ⚠️ **
|
|
5
|
-
>
|
|
6
|
-
> SmarterJSON **always returns an `Array`** of documents.
|
|
4
|
+
> ⚠️ SmarterJSON **always returns an `Array`** of documents.
|
|
7
5
|
>
|
|
8
|
-
> `SmarterJSON.process` / `SmarterJSON.process_file`
|
|
9
|
-
>
|
|
10
|
-
>
|
|
11
|
-
>
|
|
12
|
-
>
|
|
6
|
+
> `SmarterJSON.process` / `SmarterJSON.process_file`
|
|
7
|
+
> both return:
|
|
8
|
+
> — `[]` for no doc
|
|
9
|
+
> - `[doc]` for one doc
|
|
10
|
+
> - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
|
|
13
11
|
|
|
14
12
|
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
|
|
15
13
|
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
|
|
16
14
|
|
|
15
|
+
## 1.1.1 (2026-06-11)
|
|
16
|
+
|
|
17
|
+
RSpec tests: 1,070 → 1,097
|
|
18
|
+
|
|
19
|
+
- The C extension now emits the same `on_warning` warnings as the pure-Ruby parser. `empty_value` and `duplicate_key` warnings name the offending key (and `duplicate_key` names the resolution strategy), and the warning text, line, and column are now identical whether or not the C extension is loaded.
|
|
20
|
+
|
|
21
|
+
## 1.1.0 (2026-06-09)
|
|
22
|
+
|
|
23
|
+
RSpec tests: 1,038 → 1,070
|
|
24
|
+
|
|
25
|
+
- New `SmarterJSON.foreach(source)` — the streaming, composable sibling of `process_file`. `source` is a file path or an IO (a socket, `StringIO`, open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, never loading the whole file, so a large NDJSON / JSONL stream can be filtered or transformed with `.select` / `.map` / `.lazy` / `.first`; with a block it streams and returns the document count, like `process_file`.
|
|
26
|
+
|
|
17
27
|
## 1.0.0 (2026-06-08)
|
|
18
28
|
|
|
19
|
-
RSpec tests: 1,
|
|
29
|
+
RSpec tests: 1,038
|
|
20
30
|
|
|
21
31
|
- **The public interface is now stable** — `process`, `process_one`, `process_file`, `generate`, and the documented options; semantic versioning from here on.
|
|
22
32
|
- Unknown or wrongly-typed options now raise `ArgumentError` instead of being silently ignored, so a typo (e.g. `symbolize_names:` instead of `symbolize_keys:`) is caught immediately.
|
data/README.md
CHANGED
|
@@ -2,12 +2,25 @@
|
|
|
2
2
|
|
|
3
3
|
 [](https://codecov.io/gh/tilo/smarter_json) <!-- [](https://rubygems.org/gems/smarter_json) --> [](https://rubygems.org/gems/smarter_json) [](https://www.ruby-toolbox.com/projects/smarter_json)
|
|
4
4
|
|
|
5
|
-
A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Traditional JSON parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
|
|
5
|
+
A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL, JSON5, HJSON-style config, and the messy JSON-ish input humans actually write — and in benchmarks it matches or beats Oj on every file. SmarterJSON is opinionated: we want your JSON processing to be successful. Traditional JSON parsers are strict - they stop at the first deviation - SmarterJSON keeps going - it optimizes for getting your data out, not for policing the JSON spec.
|
|
6
6
|
|
|
7
7
|
> **SmarterJSON: one tool, no modes — want strict? Please use the stdlib `json` gem.**
|
|
8
8
|
|
|
9
|
+
## Features at a glance
|
|
10
|
+
|
|
11
|
+
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
|
|
12
|
+
- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
|
|
13
|
+
- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
|
|
14
|
+
- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
|
|
15
|
+
- **Writes JSON too** — `generate` with pretty-printing, NDJSON, `sort_keys`, `ascii_only`, `script_safe`, `allow_nan`, and `coerce` (via `as_json`); iterative, so deeply nested data is depth-safe.
|
|
16
|
+
- **Keeps number precision** — `BigDecimal` by default (Oj-compatible), or `:float` / `:auto`.
|
|
17
|
+
- **Transparent leniency** — pass an optional `on_warning` callback to be handed every lenient fix (an empty slot collapsed, a duplicate key dropped, a code fence stripped, …); with no handler the parser stays silent and adds zero overhead.
|
|
18
|
+
- **Fast, and runs everywhere** — a C extension that matches or beats Oj, with a pure-Ruby fallback for platforms that can't build it. Stable, semantically versioned, thread-safe, Ruby 2.6+.
|
|
19
|
+
|
|
9
20
|
## Why SmarterJSON?
|
|
10
21
|
|
|
22
|
+
> 📖 **The thinking behind it:** [*Strict by Accident: Your JSON parser isn't broken, it's answering the wrong question*](https://dev.to/tilo_sloboda/strict-by-accident-your-json-parser-isnt-broken-its-answering-the-wrong-question-54f0) — why a data pipeline wants a lenient, recovery-first parser rather than a spec-policing one.
|
|
23
|
+
|
|
11
24
|
**Are you tired of seeing errors like these?**
|
|
12
25
|
|
|
13
26
|
```
|
|
@@ -40,7 +53,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSON5,
|
|
|
40
53
|
Traditional JSON parsers reject anything that isn't perfectly strict JSON. That means your code breaks on malformed data.
|
|
41
54
|
|
|
42
55
|
SmarterJSON is built on the opposite principle: **you shouldn't have to care what flavor of JSON you were handed** and **you shouldn't lose the whole document because of formatting errors.**
|
|
43
|
-
Give it strict JSON, NDJSON, JSON5, an HJSON-style config file, LLM-generated JSON, or a copy-pasted blob with comments and trailing commas — it just extracts the data from it.
|
|
56
|
+
Give it strict JSON, NDJSON, JSONL, JSON5, an HJSON-style config file, LLM-generated JSON, or a copy-pasted blob with comments and trailing commas — it just extracts the data from it.
|
|
44
57
|
When it is lenient, `smarter_json` isn't dropping data that exists — it's just not raising an eyebrow at a suspicious gap (like an extra comma).
|
|
45
58
|
|
|
46
59
|
A strict parser would refuse the whole document and recover nothing; `smarter_json` returns everything except the formatting error.
|
|
@@ -73,13 +86,15 @@ It raises only on genuinely unreadable input (unterminated string, mismatched br
|
|
|
73
86
|
The lenient grammar is a superset of these human-JSON specs — listed once, here:
|
|
74
87
|
|
|
75
88
|
* [JSON5](https://json5.org/)
|
|
76
|
-
* [HJSON](https://hjson.github.io/)
|
|
89
|
+
* [HJSON](https://hjson.github.io/) <sup>†</sup>
|
|
77
90
|
* [JWCC / HuJSON](https://github.com/tailscale/hujson)
|
|
78
91
|
* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
|
|
79
92
|
* [JSONH](https://github.com/jsonh-org/Jsonh)
|
|
80
93
|
* [JSONC (VS Code)](https://jsonc.org/)
|
|
81
94
|
* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
|
|
82
95
|
|
|
96
|
+
<sup>†</sup> A deliberate subset. SmarterJSON's quoteless (unquoted) string values are single-line — it does **not** parse HJSON's unquoted multi-line strings; use a quoted or triple-quoted (`'''…'''`) string for multiline. This is by design: SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others — and an unquoted string that may span newlines collides with newline-as-a-document-separator (NDJSON, implicit-root config), so it is left out.
|
|
97
|
+
|
|
83
98
|
## Installation
|
|
84
99
|
|
|
85
100
|
```ruby
|
|
@@ -130,7 +145,7 @@ See [Examples](#examples) below for multi-document input, streaming, and recover
|
|
|
130
145
|
|
|
131
146
|
## Stable interface & thread safety
|
|
132
147
|
|
|
133
|
-
The public interface is
|
|
148
|
+
The public interface is: `SmarterJSON.process`, `SmarterJSON.process_one`, `SmarterJSON.process_file`, `SmarterJSON.foreach`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface. `SmarterJSON.process` and `SmarterJSON.process_file` always return an `Array` of documents; `process_one` returns the single document's value (or `nil`), and emits a warning if there is more than one doc.
|
|
134
149
|
|
|
135
150
|
Concurrent calls are safe. The processor and generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable state across calls.
|
|
136
151
|
|
|
@@ -243,6 +258,46 @@ For input larger than memory, pass a block: each document is yielded as it is re
|
|
|
243
258
|
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
244
259
|
```
|
|
245
260
|
|
|
261
|
+
**Try it on a file you already have.** SmarterJSON reads **NDJSON / JSONL natively** — and Claude Code stores every session as a JSONL transcript (`~/.claude/projects/<project>/<session-id>.jsonl`, one JSON document per line). Walk yours, one record at a time:
|
|
262
|
+
|
|
263
|
+
```ruby
|
|
264
|
+
require "awesome_print" # optional — readable nested output
|
|
265
|
+
|
|
266
|
+
SmarterJSON.process_file("#{Dir.home}/.claude/projects/<project>/<session-id>.jsonl") do |entry|
|
|
267
|
+
ap entry # each line is a full document — a message, a tool call, a result, …
|
|
268
|
+
puts "-" * 80
|
|
269
|
+
end
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
### Filtering and rewriting a large file (`foreach`)
|
|
273
|
+
|
|
274
|
+
`SmarterJSON.foreach(source)` is the composable sibling of `process_file`. `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). With no block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so you can chain `.select` / `.map` and friends. Add `.lazy` to keep the whole chain bounded in memory, even when the filtered set is large:
|
|
275
|
+
|
|
276
|
+
```ruby
|
|
277
|
+
# Keep only the user/assistant turns of a transcript — one document in memory at a time
|
|
278
|
+
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
|
|
279
|
+
.lazy
|
|
280
|
+
.select { |doc| %w[user assistant].include?(doc[:type]) }
|
|
281
|
+
.each { |doc| puts doc[:text] }
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
Because it streams both ends, you can **filter a big file down and rewrite it** without ever loading the whole thing:
|
|
285
|
+
|
|
286
|
+
```ruby
|
|
287
|
+
File.open("filtered.jsonl", "w") do |out|
|
|
288
|
+
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
|
|
289
|
+
.lazy
|
|
290
|
+
.select { |doc| %w[user assistant].include?(doc[:type]) }
|
|
291
|
+
.each { |doc| out.puts SmarterJSON.generate(doc) }
|
|
292
|
+
end
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Pass an IO instead of a path to stream straight from a socket or an HTTP response body — anything `IO`-like works (an IO is single-pass, read once):
|
|
296
|
+
|
|
297
|
+
```ruby
|
|
298
|
+
SmarterJSON.foreach(response_io).each { |event| handle(event) }
|
|
299
|
+
```
|
|
300
|
+
|
|
246
301
|
### Recovering JSON from LLM / markdown noise
|
|
247
302
|
|
|
248
303
|
When the payload is wrapped in markdown fences, surrounding prose, or tags, `process` (or `process_one` for a single payload) strips the wrapper and reads what's inside. (Clean JSON never pays for this — recovery only runs when a straight read fails.)
|
data/docs/basic_read_api.md
CHANGED
|
@@ -103,6 +103,28 @@ SmarterJSON.process(io) { |doc| handle(doc) }
|
|
|
103
103
|
|
|
104
104
|
The streaming path now frames whole top-level documents, not just one line at a time. That means NDJSON / JSONL still work, but pretty-printed multi-line objects and arrays work too, as do mixed `\n` / `\r\n` / `\r` line endings and comment-only separators between documents.
|
|
105
105
|
|
|
106
|
+
## `SmarterJSON.foreach` — stream a file or IO, composably
|
|
107
|
+
|
|
108
|
+
`foreach` is the composable sibling of `process_file`. Its argument is a **file path or any IO** (a socket, a `StringIO`, an open `File`); a String is always a path, never content.
|
|
109
|
+
|
|
110
|
+
With a block it behaves exactly like the block form above — streams each document, returns the **document count**. Without a block it returns a plain `Enumerator` (like `CSV.foreach` — **not** an `Enumerator::Lazy`), so `.map` / `.select` return Arrays the usual way, and you can chain over the stream:
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
SmarterJSON.foreach("events.ndjson").each { |event| EventJob.perform_async(event) } # like the block form
|
|
114
|
+
SmarterJSON.foreach("events.ndjson").select { |e| e["level"] == "error" } # => an Array of the matches
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
It reads one document at a time, so `foreach(path).first(3)` only reads ~3 documents off disk, and `.next` pulls them one by one. `.map` / `.select` read the source lazily but still build an Array of their *result*; to keep a whole pipeline bounded end to end (a large filtered set off a fat file), add `.lazy` at the call site:
|
|
118
|
+
|
|
119
|
+
```ruby
|
|
120
|
+
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
|
|
121
|
+
.lazy
|
|
122
|
+
.select { |doc| %w[user assistant].include?(doc[:type]) }
|
|
123
|
+
.each { |doc| puts doc[:text] }
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Options are validated eagerly — a bad option key or value raises immediately, before any iteration. An **IO source is single-pass** (an IO can only be read once), so iterating the returned Enumerator a second time over the same IO yields nothing; a path-backed `foreach` re-opens the file and is re-iterable.
|
|
127
|
+
|
|
106
128
|
## The C extension and the pure-Ruby fallback
|
|
107
129
|
|
|
108
130
|
By default (`acceleration: true`) the C extension is used when it is compiled and loadable (`SmarterJSON::HAS_ACCELERATION` is then `true`); otherwise the pure-Ruby implementation runs and produces identical results. Pass `acceleration: false` to force the pure-Ruby path. See [Configuration Options](./options.md).
|
data/docs/examples.md
CHANGED
|
@@ -83,6 +83,28 @@ For input larger than memory, pass a block. Each recovered document is yielded o
|
|
|
83
83
|
SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
|
|
84
84
|
```
|
|
85
85
|
|
|
86
|
+
**A JSONL file you already have:** Claude Code stores each session as a JSONL transcript — `~/.claude/projects/<project>/<session-id>.jsonl`, one JSON document per line (a message, a tool call, a result, …). It reads the same way, one record at a time:
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
require "awesome_print" # optional — for readable nested output
|
|
90
|
+
|
|
91
|
+
SmarterJSON.process_file("#{Dir.home}/.claude/projects/<project>/<session-id>.jsonl") do |entry|
|
|
92
|
+
ap entry # each line is a full document
|
|
93
|
+
puts "-" * 80
|
|
94
|
+
end
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
**Filter and rewrite as a stream — `SmarterJSON.foreach`:** `foreach(source)` is the composable sibling of `process_file`; `source` is a file path or any IO (a socket, a `StringIO`, an open `File`). Without a block it returns a plain `Enumerator` (like `CSV.foreach`) that reads one document at a time, so it chains with `.select` / `.map`; add `.lazy` to keep the whole pipeline bounded in memory. This filters a transcript down to its user/assistant turns and writes a smaller file, never loading all of it:
|
|
98
|
+
|
|
99
|
+
```ruby
|
|
100
|
+
File.open("filtered.jsonl", "w") do |out|
|
|
101
|
+
SmarterJSON.foreach("session.jsonl", symbolize_keys: true)
|
|
102
|
+
.lazy
|
|
103
|
+
.select { |doc| %w[user assistant].include?(doc[:type]) }
|
|
104
|
+
.each { |doc| out.puts SmarterJSON.generate(doc) }
|
|
105
|
+
end
|
|
106
|
+
```
|
|
107
|
+
|
|
86
108
|
### Example 6: Symbolize Keys
|
|
87
109
|
|
|
88
110
|
```ruby
|
|
@@ -81,9 +81,9 @@ typedef struct {
|
|
|
81
81
|
* an error) by scanning from the start of the buffer. CR, LF, and CRLF each
|
|
82
82
|
* count as one newline; col is bytes since the last line start (1-based).
|
|
83
83
|
* Keeping this off the hot path is the point — fj_advance never touches it. */
|
|
84
|
-
static void
|
|
84
|
+
static void fj_line_col_to(fj_state *st, long stop, long *line, long *col) {
|
|
85
85
|
long l = 1, c = 1, i;
|
|
86
|
-
long limit = (
|
|
86
|
+
long limit = (stop < st->len) ? stop : st->len;
|
|
87
87
|
for (i = 0; i < limit; i++) {
|
|
88
88
|
unsigned char b = (unsigned char)st->buf[i];
|
|
89
89
|
if (b == 0x0A) { l++; c = 1; }
|
|
@@ -93,6 +93,9 @@ static void fj_line_col(fj_state *st, long *line, long *col) {
|
|
|
93
93
|
*line = l;
|
|
94
94
|
*col = c;
|
|
95
95
|
}
|
|
96
|
+
static void fj_line_col(fj_state *st, long *line, long *col) {
|
|
97
|
+
fj_line_col_to(st, st->pos, line, col);
|
|
98
|
+
}
|
|
96
99
|
|
|
97
100
|
/* Report a non-fatal lenient fix to the on_warning callable — a no-op (and builds no
|
|
98
101
|
* Warning) when no handler was given. The internal Qnil guard is the safety net; the
|
|
@@ -106,6 +109,25 @@ static void fj_warn(fj_state *st, VALUE type_sym, const char *msg) {
|
|
|
106
109
|
rb_utf8_str_new_cstr(msg), LONG2NUM(line), LONG2NUM(col)));
|
|
107
110
|
}
|
|
108
111
|
|
|
112
|
+
/* Like fj_warn but the message is a prebuilt Ruby String (rb_enc_sprintf, for messages
|
|
113
|
+
* that interpolate the offending key). Same Qnil guard and st->pos line/col. */
|
|
114
|
+
static void fj_warn_str(fj_state *st, VALUE type_sym, VALUE msg) {
|
|
115
|
+
long line, col;
|
|
116
|
+
if (st->on_warning == Qnil) return;
|
|
117
|
+
fj_line_col(st, &line, &col);
|
|
118
|
+
rb_funcall(st->on_warning, fj_call_id, 1,
|
|
119
|
+
rb_funcall(cWarning, fj_new_id, 4, type_sym, msg, LONG2NUM(line), LONG2NUM(col)));
|
|
120
|
+
}
|
|
121
|
+
|
|
122
|
+
/* Like fj_warn_str but at an explicit (line,col) — for a warning detected away from the
|
|
123
|
+
* cursor. Duplicate keys are found while building the closed object, but attributed to
|
|
124
|
+
* the object's closing position so the column matches the pure-Ruby parser. */
|
|
125
|
+
static void fj_warn_str_at(fj_state *st, VALUE type_sym, VALUE msg, long line, long col) {
|
|
126
|
+
if (st->on_warning == Qnil) return;
|
|
127
|
+
rb_funcall(st->on_warning, fj_call_id, 1,
|
|
128
|
+
rb_funcall(cWarning, fj_new_id, 4, type_sym, msg, LONG2NUM(line), LONG2NUM(col)));
|
|
129
|
+
}
|
|
130
|
+
|
|
109
131
|
/* 1-based column of the current byte position (bytes since the last line start).
|
|
110
132
|
* Used for triple-quoted indentation stripping (smarter_json.md §2.3). */
|
|
111
133
|
static long fj_column(fj_state *st) {
|
|
@@ -1290,7 +1312,7 @@ void rb_hash_bulk_insert(long, const VALUE *, VALUE);
|
|
|
1290
1312
|
/* Build a Hash from `count` interleaved key,value slots. Fast path (String keys,
|
|
1291
1313
|
* default :last_wins): pre-size + bulk insert. symbolize_keys / :first_wins use a
|
|
1292
1314
|
* per-member loop into the same pre-sized hash. */
|
|
1293
|
-
static VALUE fj_build_object(fj_state *st, const VALUE *pairs, long count) {
|
|
1315
|
+
static VALUE fj_build_object(fj_state *st, const VALUE *pairs, const long *positions, long count) {
|
|
1294
1316
|
long entries = count / 2, i;
|
|
1295
1317
|
VALUE hash = rb_hash_new_capa(entries);
|
|
1296
1318
|
|
|
@@ -1305,7 +1327,16 @@ static VALUE fj_build_object(fj_state *st, const VALUE *pairs, long count) {
|
|
|
1305
1327
|
VALUE k = st->symbolize_keys ? rb_funcall(pairs[i], fj_to_sym_id, 0) : pairs[i];
|
|
1306
1328
|
if (st->dup_first_wins || st->on_warning != Qnil) {
|
|
1307
1329
|
if (RTEST(rb_funcall(hash, fj_key_p_id, 1, k))) {
|
|
1308
|
-
|
|
1330
|
+
if (st->on_warning != Qnil) {
|
|
1331
|
+
long wl, wc;
|
|
1332
|
+
fj_line_col_to(st, positions[i + 1], &wl, &wc); /* the duplicate member's value position — matches the Ruby parser */
|
|
1333
|
+
fj_warn_str_at(st, fj_sym_duplicate_key,
|
|
1334
|
+
rb_enc_sprintf(rb_utf8_encoding(),
|
|
1335
|
+
"duplicate key %"PRIsVALUE" \xe2\x80\x94 %s",
|
|
1336
|
+
rb_inspect(k),
|
|
1337
|
+
st->dup_first_wins ? "first_wins" : "last_wins"),
|
|
1338
|
+
wl, wc);
|
|
1339
|
+
}
|
|
1309
1340
|
if (st->dup_first_wins) continue;
|
|
1310
1341
|
}
|
|
1311
1342
|
}
|
|
@@ -1323,6 +1354,7 @@ typedef struct { long mark; int is_obj; } fj_frame;
|
|
|
1323
1354
|
|
|
1324
1355
|
typedef struct {
|
|
1325
1356
|
VALUE *vptr; long vhead; long vcapa; /* pending values (GC-marked) */
|
|
1357
|
+
long *pptr; /* byte offset just past each pushed value (mirrors vptr/vcapa); used only to attribute a duplicate-key warning to the duplicate member's position */
|
|
1326
1358
|
fj_frame *fptr; long fhead; long fcapa; /* open-container frames (no VALUEs) */
|
|
1327
1359
|
} fj_pstack;
|
|
1328
1360
|
|
|
@@ -1334,12 +1366,15 @@ static void fj_pstack_mark(void *p) {
|
|
|
1334
1366
|
static void fj_pstack_free(void *p) {
|
|
1335
1367
|
fj_pstack *ps = (fj_pstack *)p;
|
|
1336
1368
|
if (ps->vptr != NULL) xfree(ps->vptr);
|
|
1369
|
+
if (ps->pptr != NULL) xfree(ps->pptr);
|
|
1337
1370
|
if (ps->fptr != NULL) xfree(ps->fptr);
|
|
1338
1371
|
xfree(ps);
|
|
1339
1372
|
}
|
|
1340
1373
|
static size_t fj_pstack_memsize(const void *p) {
|
|
1341
1374
|
const fj_pstack *ps = (const fj_pstack *)p;
|
|
1342
|
-
return sizeof(fj_pstack) + (size_t)ps->vcapa * sizeof(VALUE)
|
|
1375
|
+
return sizeof(fj_pstack) + (size_t)ps->vcapa * sizeof(VALUE)
|
|
1376
|
+
+ (ps->pptr ? (size_t)ps->vcapa * sizeof(long) : 0)
|
|
1377
|
+
+ (size_t)ps->fcapa * sizeof(fj_frame);
|
|
1343
1378
|
}
|
|
1344
1379
|
static const rb_data_type_t fj_pstack_type = {
|
|
1345
1380
|
"smarter_json/pstack",
|
|
@@ -1347,8 +1382,15 @@ static const rb_data_type_t fj_pstack_type = {
|
|
|
1347
1382
|
0, 0, RUBY_TYPED_FREE_IMMEDIATELY,
|
|
1348
1383
|
};
|
|
1349
1384
|
|
|
1350
|
-
static inline void fj_vpush(fj_pstack *ps, VALUE v) {
|
|
1351
|
-
if (ps->vhead >= ps->vcapa) {
|
|
1385
|
+
static inline void fj_vpush(fj_state *st, fj_pstack *ps, VALUE v) {
|
|
1386
|
+
if (ps->vhead >= ps->vcapa) {
|
|
1387
|
+
ps->vcapa *= 2;
|
|
1388
|
+
REALLOC_N(ps->vptr, VALUE, ps->vcapa);
|
|
1389
|
+
if (ps->pptr) REALLOC_N(ps->pptr, long, ps->vcapa);
|
|
1390
|
+
}
|
|
1391
|
+
/* pptr is allocated only when on_warning is set; the fast path (no handler) does no
|
|
1392
|
+
* extra store. The offset is just past this value — the duplicate-key warning position. */
|
|
1393
|
+
if (ps->pptr) ps->pptr[ps->vhead] = st->pos;
|
|
1352
1394
|
ps->vptr[ps->vhead++] = v;
|
|
1353
1395
|
}
|
|
1354
1396
|
static inline void fj_fpush(fj_pstack *ps, long mark, int is_obj) {
|
|
@@ -1368,6 +1410,7 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1368
1410
|
int vss = 0; /* warnings: has a value landed in the current container since the last separator? */
|
|
1369
1411
|
|
|
1370
1412
|
ps->vptr = ALLOC_N(VALUE, 64); ps->vhead = 0; ps->vcapa = 64;
|
|
1413
|
+
ps->pptr = (st->on_warning != Qnil) ? ALLOC_N(long, 64) : NULL; /* only the warning path needs per-value positions */
|
|
1371
1414
|
ps->fptr = ALLOC_N(fj_frame, 16); ps->fhead = 0; ps->fcapa = 16;
|
|
1372
1415
|
|
|
1373
1416
|
if (implicit_root) fj_fpush(ps, 0, 1);
|
|
@@ -1398,7 +1441,7 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1398
1441
|
b = fj_byte(st);
|
|
1399
1442
|
if (FJ_UNLIKELY(fj_needs_ws_skip(b))) { fj_skip_ws_comments(st); b = fj_byte(st); }
|
|
1400
1443
|
if (b == ',') { /* collapsing separator: skip empty member */
|
|
1401
|
-
if (st->on_warning != Qnil && !vss) fj_warn(st, fj_sym_empty_slot, "extra comma
|
|
1444
|
+
if (st->on_warning != Qnil && !vss) fj_warn(st, fj_sym_empty_slot, "extra comma \xe2\x80\x94 collapsed an empty slot");
|
|
1402
1445
|
vss = 0;
|
|
1403
1446
|
fj_advance(st, 1);
|
|
1404
1447
|
continue;
|
|
@@ -1406,17 +1449,17 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1406
1449
|
if (b == '}') {
|
|
1407
1450
|
VALUE hash;
|
|
1408
1451
|
fj_advance(st, 1);
|
|
1409
|
-
hash = fj_build_object(st, &ps->vptr[mark], ps->vhead - mark);
|
|
1452
|
+
hash = fj_build_object(st, &ps->vptr[mark], ps->pptr ? &ps->pptr[mark] : NULL, ps->vhead - mark);
|
|
1410
1453
|
ps->vhead = mark;
|
|
1411
1454
|
ps->fhead--;
|
|
1412
1455
|
if (ps->fhead == 0) { result = hash; break; }
|
|
1413
|
-
fj_vpush(ps, hash);
|
|
1456
|
+
fj_vpush(st, ps, hash);
|
|
1414
1457
|
vss = 1;
|
|
1415
1458
|
continue;
|
|
1416
1459
|
}
|
|
1417
1460
|
if (b == -1) {
|
|
1418
1461
|
if (implicit_root && ps->fhead == 1) {
|
|
1419
|
-
result = fj_build_object(st, &ps->vptr[mark], ps->vhead - mark);
|
|
1462
|
+
result = fj_build_object(st, &ps->vptr[mark], ps->pptr ? &ps->pptr[mark] : NULL, ps->vhead - mark);
|
|
1420
1463
|
break;
|
|
1421
1464
|
}
|
|
1422
1465
|
fj_error(st, "unterminated object");
|
|
@@ -1430,28 +1473,32 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1430
1473
|
b = fj_byte(st);
|
|
1431
1474
|
if (FJ_UNLIKELY(fj_needs_ws_skip(b))) { fj_skip_ws_comments(st); b = fj_byte(st); }
|
|
1432
1475
|
if (b == '{' || b == '[') {
|
|
1433
|
-
fj_vpush(ps, key);
|
|
1476
|
+
fj_vpush(st, ps, key);
|
|
1434
1477
|
fj_advance(st, 1);
|
|
1435
1478
|
fj_fpush(ps, ps->vhead, (b == '{'));
|
|
1436
1479
|
vss = 0;
|
|
1437
1480
|
continue;
|
|
1438
1481
|
}
|
|
1439
1482
|
if (b == '}' || b == ',') { /* key with a colon but no value -> null */
|
|
1440
|
-
fj_vpush(ps, key);
|
|
1441
|
-
fj_vpush(ps, Qnil);
|
|
1442
|
-
|
|
1483
|
+
fj_vpush(st, ps, key);
|
|
1484
|
+
fj_vpush(st, ps, Qnil);
|
|
1485
|
+
if (st->on_warning != Qnil)
|
|
1486
|
+
fj_warn_str(st, fj_sym_empty_value,
|
|
1487
|
+
rb_enc_sprintf(rb_utf8_encoding(),
|
|
1488
|
+
"key %"PRIsVALUE" had no value \xe2\x80\x94 used null",
|
|
1489
|
+
rb_inspect(key)));
|
|
1443
1490
|
vss = 1;
|
|
1444
1491
|
continue;
|
|
1445
1492
|
}
|
|
1446
1493
|
if (b == -1) fj_error(st, "unexpected end of input");
|
|
1447
|
-
fj_vpush(ps, key);
|
|
1448
|
-
fj_vpush(ps, fj_parse_member_value(st));
|
|
1494
|
+
fj_vpush(st, ps, key);
|
|
1495
|
+
fj_vpush(st, ps, fj_parse_member_value(st));
|
|
1449
1496
|
vss = 1;
|
|
1450
1497
|
} else { /* array */
|
|
1451
1498
|
b = fj_byte(st);
|
|
1452
1499
|
if (FJ_UNLIKELY(fj_needs_ws_skip(b))) { fj_skip_ws_comments(st); b = fj_byte(st); }
|
|
1453
1500
|
if (b == ',') { /* collapsing separator: skip empty slot */
|
|
1454
|
-
if (st->on_warning != Qnil && !vss) fj_warn(st, fj_sym_empty_slot, "extra comma
|
|
1501
|
+
if (st->on_warning != Qnil && !vss) fj_warn(st, fj_sym_empty_slot, "extra comma \xe2\x80\x94 collapsed an empty slot");
|
|
1455
1502
|
vss = 0;
|
|
1456
1503
|
fj_advance(st, 1);
|
|
1457
1504
|
continue;
|
|
@@ -1463,7 +1510,7 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1463
1510
|
ps->vhead = mark;
|
|
1464
1511
|
ps->fhead--;
|
|
1465
1512
|
if (ps->fhead == 0) { result = ary; break; }
|
|
1466
|
-
fj_vpush(ps, ary);
|
|
1513
|
+
fj_vpush(st, ps, ary);
|
|
1467
1514
|
vss = 1;
|
|
1468
1515
|
continue;
|
|
1469
1516
|
}
|
|
@@ -1481,10 +1528,10 @@ static VALUE fj_parse_iter(fj_state *st, int implicit_root) {
|
|
|
1481
1528
|
smart-quote, literals) falls through to the full dispatch below. */
|
|
1482
1529
|
if (b == '-' || b == '+' || b == '.' || (b >= '0' && b <= '9')) {
|
|
1483
1530
|
VALUE num;
|
|
1484
|
-
if (fj_try_member_number(st, &num)) { fj_vpush(ps, num); vss = 1; continue; }
|
|
1531
|
+
if (fj_try_member_number(st, &num)) { fj_vpush(st, ps, num); vss = 1; continue; }
|
|
1485
1532
|
}
|
|
1486
|
-
if (b == '"') { fj_vpush(ps, fj_parse_string(st, '"')); vss = 1; continue; }
|
|
1487
|
-
fj_vpush(ps, fj_parse_member_value(st));
|
|
1533
|
+
if (b == '"') { fj_vpush(st, ps, fj_parse_string(st, '"')); vss = 1; continue; }
|
|
1534
|
+
fj_vpush(st, ps, fj_parse_member_value(st));
|
|
1488
1535
|
vss = 1;
|
|
1489
1536
|
}
|
|
1490
1537
|
}
|
data/lib/smarter_json/parser.rb
CHANGED
|
@@ -57,6 +57,41 @@ module SmarterJSON
|
|
|
57
57
|
end
|
|
58
58
|
end
|
|
59
59
|
|
|
60
|
+
# SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of
|
|
61
|
+
# process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a
|
|
62
|
+
# plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way
|
|
63
|
+
# and return an Array.
|
|
64
|
+
#
|
|
65
|
+
# `source` is a file path (opened and streamed from disk, like process_file) OR an
|
|
66
|
+
# IO — a socket, a StringIO, an open File — streamed directly from its current
|
|
67
|
+
# position. A String is always a path, never content. An IO source is single-pass:
|
|
68
|
+
# it can only be read once, so iterating the returned Enumerator a second time over
|
|
69
|
+
# the same IO yields nothing.
|
|
70
|
+
#
|
|
71
|
+
# Without a block: returns an Enumerator over each top-level document, reading one
|
|
72
|
+
# document at a time via readpartial — it never slurps the whole file the way
|
|
73
|
+
# process_file(path) does. So foreach(path).first(3) reads only ~3 documents off
|
|
74
|
+
# disk, and foreach(src).each { … } / .next stream in bounded memory. .map / .select
|
|
75
|
+
# read the source one document at a time but still build an Array of their result;
|
|
76
|
+
# for a chain that stays bounded end to end (a large filtered set off a fat file)
|
|
77
|
+
# opt into .lazy at the call site: foreach(src).lazy.select { … }.each { … }.
|
|
78
|
+
#
|
|
79
|
+
# With a block: streams each document and returns the document count — identical
|
|
80
|
+
# to process_file(path) { |doc| … } (or process(io) { |doc| … } for an IO).
|
|
81
|
+
#
|
|
82
|
+
# Options are validated eagerly (before the Enumerator is returned), so a bad
|
|
83
|
+
# option key or value fails fast rather than on first iteration.
|
|
84
|
+
def foreach(source, options = {}, &block)
|
|
85
|
+
options = Options.process_options(options)
|
|
86
|
+
return enum_for(:foreach, source, options) unless block
|
|
87
|
+
|
|
88
|
+
if source.respond_to?(:read) # an IO (socket, StringIO, open File) — stream it directly
|
|
89
|
+
stream_io(source, options, &block)
|
|
90
|
+
else # a path — open the file and stream from disk
|
|
91
|
+
process_file(source, options, &block)
|
|
92
|
+
end
|
|
93
|
+
end
|
|
94
|
+
|
|
60
95
|
# SmarterJSON.process_one(input, options = {}) — the single-document accessor.
|
|
61
96
|
#
|
|
62
97
|
# Returns the first document's value (or nil when the input holds no documents).
|
data/lib/smarter_json/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_json
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.
|
|
4
|
+
version: 1.1.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: exe
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-06-
|
|
10
|
+
date: 2026-06-11 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: bigdecimal
|
|
@@ -24,7 +24,7 @@ dependencies:
|
|
|
24
24
|
- !ruby/object:Gem::Version
|
|
25
25
|
version: '0'
|
|
26
26
|
description: 'A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON,
|
|
27
|
-
JSON5, HJSON-style config, and the messy JSON-ish input humans and LLMs actually
|
|
27
|
+
JSONL, JSON5, HJSON-style config, and the messy JSON-ish input humans and LLMs actually
|
|
28
28
|
write — comments, trailing commas, single / unquoted / smart quotes, Python and
|
|
29
29
|
JS keywords, a UTF-8 BOM, and more all parse to the same Ruby objects, with no modes
|
|
30
30
|
or flags to set. Where a traditional parser stops at the first deviation and throws
|
|
@@ -92,6 +92,6 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
92
92
|
requirements: []
|
|
93
93
|
rubygems_version: 3.6.9
|
|
94
94
|
specification_version: 4
|
|
95
|
-
summary: A lenient, fast JSON processor for Ruby — reads strict JSON, NDJSON,
|
|
96
|
-
HJSON, and the messy JSON humans and LLMs actually write.
|
|
95
|
+
summary: A lenient, fast JSON processor for Ruby — reads strict JSON, NDJSON, JSONL,
|
|
96
|
+
JSON5, HJSON, and the messy JSON humans and LLMs actually write.
|
|
97
97
|
test_files: []
|