smarter_json 1.1.1 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -4
- data/CONTRIBUTORS.md +6 -0
- data/README.md +12 -4
- data/docs/_introduction.md +1 -1
- data/docs/examples.md +20 -4
- data/ext/smarter_json/smarter_json.c +87 -12
- data/lib/smarter_json/parser.rb +61 -10
- data/lib/smarter_json/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: bf6191c05bd9049082a1362f4bfa5ab3240c4690bc59c1675c716c0901d5c6cb
|
|
4
|
+
data.tar.gz: 56f1d5418b20d8bad23694f7831dfb335ab4737dd34e9184b9c20113c41fd3aa
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 64c3511d1f21662b703ee1a02876ba8f05401ebcd48b1025e7290a50c49a5fc1d74623c94477b6fc7006ea782028698e298010b3e827f5ed0315fa1f7e88f595
|
|
7
|
+
data.tar.gz: c2801263204013c23954d7f4489f7f4f38f74802c42da5cf2ba2e5965decfc6c00dd304a7ff16f9f96741e4103fd499f36efa8a8ba5ec7abfdb40dded1c996a9
|
data/CHANGELOG.md
CHANGED
|
@@ -5,12 +5,30 @@
|
|
|
5
5
|
>
|
|
6
6
|
> `SmarterJSON.process` / `SmarterJSON.process_file`
|
|
7
7
|
> both return:
|
|
8
|
-
>
|
|
9
|
-
>
|
|
10
|
-
>
|
|
8
|
+
>
|
|
9
|
+
> — `[]` for no doc
|
|
10
|
+
> - `[doc]` for one doc
|
|
11
|
+
> - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
|
|
11
12
|
|
|
12
13
|
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
|
|
13
|
-
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
|
|
14
|
+
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
|
|
15
|
+
|
|
16
|
+
## 1.2.0 (2026-06-16)
|
|
17
|
+
|
|
18
|
+
RSpec tests: 1,097 → 1,165
|
|
19
|
+
|
|
20
|
+
- A leading-zero token now reads as a number when it carries a sign, a decimal point, or an exponent (`+007` → `7`, `-000023.5` → `-23.5`, `00.0` → `0.0`, `007e2` → `700.0`) — previously these were kept as strings. A bare leading-zero integer (`000001`, `02`) still reads as a string, so IDs, zip codes, and account numbers keep their zeros.
|
|
21
|
+
- `Null` and `NULL` are now read as `nil` (joining `null` / `None` / `undefined`), for SQL / R / PHP / YAML / DB-derived input — in every position the existing spellings work. Quoted (`"NULL"`) or embedded (`NULL Island`) forms stay strings.
|
|
22
|
+
- String escapes now cover the full JSON5 / ECMAScript set: `\xHH` hex escapes (`"\x41"` → `"A"`), `\v` (vertical tab), `\0` (null), and an unrecognized escape now yields the character itself (`"\q"` → `"q"`) instead of raising. A malformed `\x` and an octal-style `\0` followed by a digit still raise.
|
|
23
|
+
- A `U+FEFF` (BOM / zero-width no-break space) is now skipped as whitespace anywhere between tokens — matching JSON5 / ECMAScript — not only as a leading byte-order mark, so a stray BOM mid-stream (e.g. from concatenated files) no longer corrupts the adjacent value into a string. Inside a quoted string it stays content.
|
|
24
|
+
|
|
25
|
+
## 1.1.2 (2026-06-12)
|
|
26
|
+
|
|
27
|
+
RSpec tests: 1,097
|
|
28
|
+
|
|
29
|
+
### Bug Fix
|
|
30
|
+
|
|
31
|
+
- The C extension now correctly supports Ruby's GC heap compaction (`GC.compact` / auto-compaction) — its cached exception/warning classes are declared to the GC. Thanks [Jean Boussier](https://github.com/byroot) for PR [#7](https://github.com/tilo/smarter_json/pull/7).
|
|
14
32
|
|
|
15
33
|
## 1.1.1 (2026-06-11)
|
|
16
34
|
|
data/CONTRIBUTORS.md
ADDED
data/README.md
CHANGED
|
@@ -8,7 +8,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,
|
|
|
8
8
|
|
|
9
9
|
## Features at a glance
|
|
10
10
|
|
|
11
|
-
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python
|
|
11
|
+
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python / JavaScript / SQL literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
|
|
12
12
|
- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
|
|
13
13
|
- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
|
|
14
14
|
- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
|
|
@@ -73,9 +73,11 @@ Three things set it apart:
|
|
|
73
73
|
- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
|
|
74
74
|
- Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
|
|
75
75
|
- Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
|
|
76
|
+
- Full JSON5 / ECMAScript string escapes — `\uXXXX` (with surrogate pairs), `\xHH` (`"\x41"` → `"A"`), `\v`, `\0`, line continuation; an unrecognized escape yields the character itself (`"\q"` → `"q"`)
|
|
76
77
|
- Implicit root object — a config file that starts with `key: value`, no outer `{}`
|
|
77
78
|
- `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
|
|
78
|
-
-
|
|
79
|
+
- Leading-zero numbers (which strict JSON rejects): a token with a sign, decimal point, or exponent reads as a number (`-007.5` → `-7.5`, `007e2` → `700.0`), but a bare leading-zero integer is kept as a string (`007`, `02`) so IDs, zip codes, and account numbers don't lose their zeros
|
|
80
|
+
- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`, case-variant null (`Null` / `NULL`, as SQL / R / PHP / YAML emit it)
|
|
79
81
|
- Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
|
|
80
82
|
- Duplicate keys (last value wins by default; configurable)
|
|
81
83
|
|
|
@@ -89,11 +91,15 @@ The lenient grammar is a superset of these human-JSON specs — listed once, her
|
|
|
89
91
|
* [HJSON](https://hjson.github.io/) <sup>†</sup>
|
|
90
92
|
* [JWCC / HuJSON](https://github.com/tailscale/hujson)
|
|
91
93
|
* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
|
|
92
|
-
* [JSONH](https://github.com/jsonh-org/Jsonh)
|
|
94
|
+
* [JSONH](https://github.com/jsonh-org/Jsonh) <sup>‡</sup>
|
|
93
95
|
* [JSONC (VS Code)](https://jsonc.org/)
|
|
94
96
|
* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
|
|
95
97
|
|
|
96
|
-
|
|
98
|
+
HJSON and JSONH are deliberate subsets. SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others.
|
|
99
|
+
|
|
100
|
+
<sup>†</sup>From **HJSON** we leave out unquoted *multi-line* strings — its quoteless string values are single-line (use a quoted or triple-quoted `'''…'''` string for multiline), because a newline-spanning unquoted string collides with newline-as-a-document-separator (NDJSON, implicit-root config).
|
|
101
|
+
|
|
102
|
+
<sup>‡</sup>From **JSONH** we take the mainstream features (quoteless keys / values, optional commas between newline-separated members, comments, hex numbers) but **not** the idiosyncratic extensions: binary (`0b`) / octal (`0o`) number literals, verbatim strings (`@"…"`), nestable block comments (`/=* *=/`), or its `\e` / `\a` escapes — the last conflict with the JSON5 / ECMAScript rule that an unrecognized escape is the character itself (`"\e"` → `"e"`). Tip: you can use quoteless strings instead of verbatim strings. Want binary or octal literals? Open an issue.
|
|
97
103
|
|
|
98
104
|
## Installation
|
|
99
105
|
|
|
@@ -359,6 +365,8 @@ Both the C extension and the pure-Ruby engine are **iterative, not recursive**
|
|
|
359
365
|
The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you process untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream.
|
|
360
366
|
|
|
361
367
|
|
|
368
|
+
# [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
|
|
369
|
+
|
|
362
370
|
## Development
|
|
363
371
|
|
|
364
372
|
After checking out the repo, run `bin/setup` to install dependencies, then `rake compile` to build the C extension and `rake spec` to run the tests. The test suite runs every example against **both** the C and pure-Ruby paths, so the two stay behavior-identical.
|
data/docs/_introduction.md
CHANGED
|
@@ -29,7 +29,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
|
|
|
29
29
|
|
|
30
30
|
## What it accepts, beyond strict JSON
|
|
31
31
|
|
|
32
|
-
Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, Python (`True` / `False` / `None`)
|
|
32
|
+
Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, full JSON5 / ECMAScript string escapes (`\xHH`, `\v`, `\0`, line continuation, and an unknown escape yields the character itself), an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, leading-zero numbers (a signed / decimal / exponent token like `-007.5` is a number, a bare `007` is kept as a string so IDs keep their zeros), Python (`True` / `False` / `None`), JavaScript (`undefined`), and SQL / R / PHP / YAML (`Null` / `NULL`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
|
|
33
33
|
|
|
34
34
|
It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
|
|
35
35
|
|
data/docs/examples.md
CHANGED
|
@@ -145,7 +145,23 @@ JSON
|
|
|
145
145
|
|
|
146
146
|
A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
|
|
147
147
|
|
|
148
|
-
### Example 10:
|
|
148
|
+
### Example 10: Leading-Zero IDs and SQL `NULL`
|
|
149
|
+
|
|
150
|
+
```ruby
|
|
151
|
+
SmarterJSON.process_one(<<~JSON)
|
|
152
|
+
{
|
|
153
|
+
user_id: 007, # bare leading zero -> kept as a string
|
|
154
|
+
zip: 02139, # ditto: zip codes keep their leading zero
|
|
155
|
+
balance: -007.50, # a sign / decimal point / exponent makes it a number
|
|
156
|
+
deleted_at: NULL # SQL / R / YAML null spelling -> nil
|
|
157
|
+
}
|
|
158
|
+
JSON
|
|
159
|
+
# => {"user_id"=>"007", "zip"=>"02139", "balance"=>-7.5, "deleted_at"=>nil}
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
A bare leading-zero integer is kept as a string so identifiers, zip codes, and account numbers don't lose their zeros; a sign, decimal point, or exponent marks numeric intent (`-007.50` → `-7.5`). `Null` and `NULL` join `null` / `None` / `undefined` as spellings of `nil`; a quoted `"NULL"` stays a string.
|
|
163
|
+
|
|
164
|
+
### Example 11: Wrapper Noise Around a Payload
|
|
149
165
|
|
|
150
166
|
#### Fenced payload
|
|
151
167
|
|
|
@@ -197,14 +213,14 @@ TEXT
|
|
|
197
213
|
# => [{"a"=>1}, {"b"=>2}]
|
|
198
214
|
```
|
|
199
215
|
|
|
200
|
-
### Example
|
|
216
|
+
### Example 12: Write JSON
|
|
201
217
|
|
|
202
218
|
```ruby
|
|
203
219
|
SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
|
|
204
220
|
SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
|
|
205
221
|
```
|
|
206
222
|
|
|
207
|
-
### Example
|
|
223
|
+
### Example 13: Write NDJSON
|
|
208
224
|
|
|
209
225
|
An Array writes one element per line:
|
|
210
226
|
|
|
@@ -212,7 +228,7 @@ An Array writes one element per line:
|
|
|
212
228
|
SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
|
|
213
229
|
```
|
|
214
230
|
|
|
215
|
-
### Example
|
|
231
|
+
### Example 14: Round-Trip Read and Write
|
|
216
232
|
|
|
217
233
|
```ruby
|
|
218
234
|
obj = { "a" => 1, "b" => [2, "three", nil, true] }
|
|
@@ -169,13 +169,14 @@ static void fj_advance(fj_state *st, long n) {
|
|
|
169
169
|
static int fj_is_ws(int b) { return b == 0x20 || (b >= 0x09 && b <= 0x0D); }
|
|
170
170
|
|
|
171
171
|
/* Length (1..3) of the Unicode whitespace char starting at p (n bytes
|
|
172
|
-
* available), or 0. Matches Ruby's [[:space:]]
|
|
173
|
-
*
|
|
172
|
+
* available), or 0. Matches Ruby's [[:space:]], plus U+FEFF (BOM) — JSON5 / ES5 count
|
|
173
|
+
* the BOM as whitespace though Unicode White_Space does not; see smarter_json.md §4.7.
|
|
174
|
+
* Reject-gate: only C2/E1/E2/E3/EF can begin one of these chars. */
|
|
174
175
|
static long fj_mbws(const char *p, long n) {
|
|
175
176
|
int b0, b1, b2;
|
|
176
177
|
if (n < 1) return 0;
|
|
177
178
|
b0 = (unsigned char)p[0];
|
|
178
|
-
if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3)) return 0;
|
|
179
|
+
if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF) return 0;
|
|
179
180
|
if (n < 2) return 0;
|
|
180
181
|
b1 = (unsigned char)p[1];
|
|
181
182
|
if (b0 == 0xC2) return (b1 == 0xA0 || b1 == 0x85) ? 2 : 0;
|
|
@@ -188,6 +189,7 @@ static long fj_mbws(const char *p, long n) {
|
|
|
188
189
|
return 0;
|
|
189
190
|
}
|
|
190
191
|
if (b0 == 0xE3) return (b1 == 0x80 && b2 == 0x80) ? 3 : 0;
|
|
192
|
+
if (b0 == 0xEF) return (b1 == 0xBB && b2 == 0xBF) ? 3 : 0; /* U+FEFF (JSON5 / ES5 BOM ws) */
|
|
191
193
|
return 0;
|
|
192
194
|
}
|
|
193
195
|
|
|
@@ -398,8 +400,24 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
|
|
|
398
400
|
case 'n': rb_str_buf_cat(buf, "\n", 1); fj_advance(st, 1); break;
|
|
399
401
|
case 'r': rb_str_buf_cat(buf, "\r", 1); fj_advance(st, 1); break;
|
|
400
402
|
case 't': rb_str_buf_cat(buf, "\t", 1); fj_advance(st, 1); break;
|
|
403
|
+
case 'v': rb_str_buf_cat(buf, "\v", 1); fj_advance(st, 1); break; /* JSON5 / ES5 */
|
|
401
404
|
case 0x0A: fj_advance(st, 1); break; /* \<LF>: line continuation */
|
|
402
405
|
case 0x0D: fj_advance(st, 1); if (fj_byte(st) == 0x0A) fj_advance(st, 1); break;
|
|
406
|
+
case '0': /* JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden */
|
|
407
|
+
fj_advance(st, 1);
|
|
408
|
+
{ int nx = fj_byte(st); if (nx >= '0' && nx <= '9') fj_error(st, "invalid \\0 escape (octal not allowed)"); }
|
|
409
|
+
rb_str_buf_cat(buf, "\0", 1);
|
|
410
|
+
break;
|
|
411
|
+
case 'x': { /* JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8) */
|
|
412
|
+
int h1, h2;
|
|
413
|
+
fj_advance(st, 1);
|
|
414
|
+
h1 = fj_hex_val(fj_byte(st));
|
|
415
|
+
h2 = fj_hex_val(fj_byte_at(st, 1));
|
|
416
|
+
if (h1 < 0 || h2 < 0) fj_error(st, "invalid \\x escape");
|
|
417
|
+
fj_advance(st, 2);
|
|
418
|
+
fj_append_utf8(buf, (unsigned long)((h1 << 4) | h2));
|
|
419
|
+
break;
|
|
420
|
+
}
|
|
403
421
|
case 'u': {
|
|
404
422
|
unsigned long cp;
|
|
405
423
|
fj_advance(st, 1);
|
|
@@ -418,7 +436,12 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
|
|
|
418
436
|
break;
|
|
419
437
|
}
|
|
420
438
|
default:
|
|
421
|
-
|
|
439
|
+
/* ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
|
|
440
|
+
* Emit the escaped byte; a multibyte UTF-8 char's continuation bytes follow as
|
|
441
|
+
* literal content (next loop iterations), reconstructing the whole character. */
|
|
442
|
+
rb_str_buf_cat(buf, st->buf + st->pos, 1);
|
|
443
|
+
fj_advance(st, 1);
|
|
444
|
+
break;
|
|
422
445
|
}
|
|
423
446
|
} else {
|
|
424
447
|
/* Literal run between escapes: NEON-scan to the next quote/backslash and
|
|
@@ -641,16 +664,33 @@ static FJ_ALWAYS_INLINE VALUE fj_float_from_parts(fj_state *st, uint64_t m10, in
|
|
|
641
664
|
* per-byte '_' test, dropping to a slow step only when an underscore appears. */
|
|
642
665
|
static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
|
|
643
666
|
long i = 0;
|
|
644
|
-
int is_float = 0, neg = 0, has_digit = 0, overflow = 0;
|
|
667
|
+
int is_float = 0, neg = 0, has_digit = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
|
|
645
668
|
uint64_t m10 = 0;
|
|
646
669
|
int m10digits = 0, frac = 0;
|
|
647
670
|
int64_t e10 = 0;
|
|
648
671
|
|
|
649
|
-
if (i < n && (p[i] == '-' || p[i] == '+')) { neg = (p[i] == '-'); i++; }
|
|
672
|
+
if (i < n && (p[i] == '-' || p[i] == '+')) { has_sign = 1; neg = (p[i] == '-'); i++; }
|
|
650
673
|
|
|
651
|
-
/* Integer part: a single '0', or [1-9] then digits/underscores.
|
|
674
|
+
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
|
|
675
|
+
* by more digits (a leading-zero token) is consumed too but flagged: a BARE leading-zero
|
|
676
|
+
* integer (no sign / dot / exponent) is rejected below and kept as a string, so zip /
|
|
677
|
+
* account / check numbers preserve their zeros. */
|
|
652
678
|
if (i < n && p[i] == '0') {
|
|
653
679
|
has_digit = 1; m10digits = 1; i++;
|
|
680
|
+
/* Underscore-separated too (like the [1-9] branch below), so 0_5.0 / 0_0.5 behave
|
|
681
|
+
* exactly like 05.0 / 00.5 on both paths. */
|
|
682
|
+
if (i < n && ((p[i] >= '0' && p[i] <= '9') || p[i] == '_')) {
|
|
683
|
+
for (;;) {
|
|
684
|
+
while (i < n && p[i] >= '0' && p[i] <= '9') {
|
|
685
|
+
had_leading_zero = 1;
|
|
686
|
+
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(p[i] - '0'); m10digits++; }
|
|
687
|
+
else overflow = 1;
|
|
688
|
+
i++;
|
|
689
|
+
}
|
|
690
|
+
if (i < n && p[i] == '_') { i++; continue; }
|
|
691
|
+
break;
|
|
692
|
+
}
|
|
693
|
+
}
|
|
654
694
|
} else if (i < n && p[i] >= '1' && p[i] <= '9') {
|
|
655
695
|
has_digit = 1;
|
|
656
696
|
for (;;) {
|
|
@@ -699,6 +739,8 @@ static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
|
|
|
699
739
|
|
|
700
740
|
if (i != n) return 0; /* token not fully consumed -> not a number (string) */
|
|
701
741
|
if (!has_digit) return 0; /* e.g. "." or "+" -> not a number (string) */
|
|
742
|
+
/* A BARE leading-zero integer (no sign / dot / exponent) is an ID, not a number. */
|
|
743
|
+
if (had_leading_zero && !has_sign && !is_float) return 0;
|
|
702
744
|
|
|
703
745
|
if (!is_float) {
|
|
704
746
|
*out = fj_int_from_parts(m10, m10digits, neg, overflow, p, n);
|
|
@@ -730,13 +772,13 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
730
772
|
const char *p = buf + st->pos; /* buf[len] == '\0' (RSTRING_PTR) is the scan sentinel */
|
|
731
773
|
const char *np = p; /* token start, includes a leading sign */
|
|
732
774
|
long nlen;
|
|
733
|
-
int is_float = 0, neg = 0, overflow = 0;
|
|
775
|
+
int is_float = 0, neg = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
|
|
734
776
|
uint64_t m10 = 0; /* mantissa: integer + fraction digits */
|
|
735
777
|
int m10digits = 0; /* mantissa digit chars (caps the Eisel-Lemire fast path at 18) */
|
|
736
778
|
int frac = 0; /* fraction digit chars: e10 -= frac */
|
|
737
779
|
int64_t e10 = 0;
|
|
738
780
|
|
|
739
|
-
if (*p == '-' || *p == '+') { neg = (*p == '-'); p++; }
|
|
781
|
+
if (*p == '-' || *p == '+') { has_sign = 1; neg = (*p == '-'); p++; }
|
|
740
782
|
|
|
741
783
|
/* Cold branches (rare, not perf-critical): sync the cursor, reuse scalar helpers. */
|
|
742
784
|
if (*p == 'I') { st->pos = p - buf; fj_consume_keyword(st, "Infinity"); return rb_float_new(neg ? -INFINITY : INFINITY); }
|
|
@@ -755,10 +797,27 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
755
797
|
return rb_str_to_inum(hx, 16, 0);
|
|
756
798
|
}
|
|
757
799
|
|
|
758
|
-
/* Integer part: a single '0', or [1-9] then digits/underscores.
|
|
800
|
+
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
|
|
801
|
+
* by more digits is consumed but flagged; a BARE leading-zero integer (no sign / dot /
|
|
802
|
+
* exponent) is rejected after the scan — it is an ID, not a number, and has no bare
|
|
803
|
+
* top-level quoteless-string form, so it raises (matching `000001`). */
|
|
759
804
|
if (*p == '0') {
|
|
760
805
|
m10digits = 1; /* one leading zero, counted as a single mantissa digit */
|
|
761
806
|
p++;
|
|
807
|
+
/* Underscore-separated too (like the [1-9] branch below), so the underscore is just a
|
|
808
|
+
* separator (0_0.5 behaves like 00.5). */
|
|
809
|
+
if ((*p >= '0' && *p <= '9') || *p == '_') {
|
|
810
|
+
for (;;) {
|
|
811
|
+
while (*p >= '0' && *p <= '9') {
|
|
812
|
+
had_leading_zero = 1;
|
|
813
|
+
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(*p - '0'); m10digits++; }
|
|
814
|
+
else overflow = 1;
|
|
815
|
+
p++;
|
|
816
|
+
}
|
|
817
|
+
if (*p == '_') { p++; continue; }
|
|
818
|
+
break;
|
|
819
|
+
}
|
|
820
|
+
}
|
|
762
821
|
} else if (*p >= '1' && *p <= '9') {
|
|
763
822
|
for (;;) {
|
|
764
823
|
while (*p >= '0' && *p <= '9') {
|
|
@@ -811,6 +870,12 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
811
870
|
st->pos = p - buf;
|
|
812
871
|
nlen = p - np;
|
|
813
872
|
|
|
873
|
+
/* A BARE leading-zero integer is an ID, not a number; at this top-level / strict
|
|
874
|
+
* position there is no quoteless-string form, so it raises. */
|
|
875
|
+
if (had_leading_zero && !has_sign && !is_float) {
|
|
876
|
+
fj_error(st, "invalid number with a leading zero");
|
|
877
|
+
}
|
|
878
|
+
|
|
814
879
|
if (!is_float) {
|
|
815
880
|
return fj_int_from_parts(m10, m10digits, neg, overflow, np, nlen);
|
|
816
881
|
}
|
|
@@ -979,7 +1044,8 @@ static VALUE fj_classify_quoteless(fj_state *st, const char *p0, long n0) {
|
|
|
979
1044
|
|
|
980
1045
|
if (fj_tok_eq(p, n, "true") || fj_tok_eq(p, n, "True")) return Qtrue;
|
|
981
1046
|
if (fj_tok_eq(p, n, "false") || fj_tok_eq(p, n, "False")) return Qfalse;
|
|
982
|
-
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "
|
|
1047
|
+
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "Null") || fj_tok_eq(p, n, "NULL") ||
|
|
1048
|
+
fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
|
|
983
1049
|
if (fj_tok_eq(p, n, "NaN")) return rb_float_new(NAN);
|
|
984
1050
|
if (fj_tok_eq(p, n, "Infinity")) return rb_float_new(INFINITY);
|
|
985
1051
|
|
|
@@ -1273,8 +1339,10 @@ static VALUE fj_parse_value(fj_state *st) {
|
|
|
1273
1339
|
case 'T': return fj_parse_literal(st, "True", Qtrue);
|
|
1274
1340
|
case 'F': return fj_parse_literal(st, "False", Qfalse);
|
|
1275
1341
|
case 'u': return fj_parse_literal(st, "undefined", Qnil);
|
|
1276
|
-
case 'N': /* NaN (number)
|
|
1342
|
+
case 'N': /* NaN (number); None / Null / NULL (null) */
|
|
1277
1343
|
if (fj_byte_at(st, 1) == 'a') return fj_parse_number(st);
|
|
1344
|
+
if (fj_byte_at(st, 1) == 'u') return fj_parse_literal(st, "Null", Qnil);
|
|
1345
|
+
if (fj_byte_at(st, 1) == 'U') return fj_parse_literal(st, "NULL", Qnil);
|
|
1278
1346
|
return fj_parse_literal(st, "None", Qnil);
|
|
1279
1347
|
default:
|
|
1280
1348
|
if (b == '-' || b == '+' || b == '.' || b == 'I' || (b >= '0' && b <= '9')) {
|
|
@@ -1676,9 +1744,16 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
|
|
|
1676
1744
|
|
|
1677
1745
|
void Init_smarter_json(void) {
|
|
1678
1746
|
mSmarterJSON = rb_define_module("SmarterJSON");
|
|
1747
|
+
|
|
1748
|
+
rb_global_variable(&cParseError);
|
|
1679
1749
|
cParseError = rb_const_get(mSmarterJSON, rb_intern("ParseError"));
|
|
1750
|
+
|
|
1751
|
+
rb_global_variable(&cEncodingError);
|
|
1680
1752
|
cEncodingError = rb_const_get(mSmarterJSON, rb_intern("EncodingError"));
|
|
1753
|
+
|
|
1754
|
+
rb_global_variable(&cWarning);
|
|
1681
1755
|
cWarning = rb_const_get(mSmarterJSON, rb_intern("Warning"));
|
|
1756
|
+
|
|
1682
1757
|
fj_bigdecimal_id = rb_intern("BigDecimal");
|
|
1683
1758
|
fj_to_sym_id = rb_intern("to_sym");
|
|
1684
1759
|
fj_key_p_id = rb_intern("key?");
|
data/lib/smarter_json/parser.rb
CHANGED
|
@@ -739,7 +739,7 @@ module SmarterJSON
|
|
|
739
739
|
# Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a
|
|
740
740
|
# bare exponent like "-e695881" is NOT a number — it falls through to a quoteless
|
|
741
741
|
# string, matching the C path. Trailing exponent stays optional.
|
|
742
|
-
DEC_RE = /\A[-+]?(?:
|
|
742
|
+
DEC_RE = /\A[-+]?(?:[0-9][0-9_]*(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
|
|
743
743
|
# A decimal BigDecimal() would reject as-is: a leading dot (".5") or a dot not
|
|
744
744
|
# followed by a digit ("5.", "5.e3"). Matches iff normalize_for_bigdecimal
|
|
745
745
|
# would change the string — so when it doesn't match, we skip normalization.
|
|
@@ -756,7 +756,9 @@ module SmarterJSON
|
|
|
756
756
|
# (',' '}' ']' '{' '[') OR any whitespace ([[:space:]] covers ASCII + Unicode space,
|
|
757
757
|
# incl. LF/CR which also terminate). Stopping at a terminator/EOF means the run had no
|
|
758
758
|
# interior whitespace, so there's nothing to trim and no comment marker can apply.
|
|
759
|
-
|
|
759
|
+
#
|
|
760
|
+
# U+FEFF is JSON5/ES5 whitespace but not in [[:space:]], so we need to add it:
|
|
761
|
+
QL_BREAK = /[,{}\[\]]|[[:space:]]|#{[0xFEFF].pack("U")}/.freeze
|
|
760
762
|
|
|
761
763
|
# The defaults live centrally in SmarterJSON::Options (lib/smarter_json/options.rb).
|
|
762
764
|
DEFAULT_OPTIONS = Options::DEFAULT_OPTIONS
|
|
@@ -1103,7 +1105,7 @@ module SmarterJSON
|
|
|
1103
1105
|
# Only meaningful for bytes >= 0x80.
|
|
1104
1106
|
def multibyte_ws_len(pos)
|
|
1105
1107
|
b0 = @input.getbyte(pos)
|
|
1106
|
-
return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) # reject-gate
|
|
1108
|
+
return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF # reject-gate (EF -> U+FEFF)
|
|
1107
1109
|
|
|
1108
1110
|
b1 = @input.getbyte(pos + 1)
|
|
1109
1111
|
return 0 if b1.nil?
|
|
@@ -1123,6 +1125,8 @@ module SmarterJSON
|
|
|
1123
1125
|
end
|
|
1124
1126
|
when 0xE3
|
|
1125
1127
|
return 3 if b1 == 0x80 && b2 == 0x80 # U+3000
|
|
1128
|
+
when 0xEF
|
|
1129
|
+
return 3 if b1 == 0xBB && b2 == 0xBF # U+FEFF (JSON5 / ES5 BOM ws)
|
|
1126
1130
|
end
|
|
1127
1131
|
0
|
|
1128
1132
|
end
|
|
@@ -1210,10 +1214,11 @@ module SmarterJSON
|
|
|
1210
1214
|
|
|
1211
1215
|
# Disambiguate NaN (number) from None (Python null) at a strict position.
|
|
1212
1216
|
def parse_upper_n
|
|
1213
|
-
|
|
1214
|
-
|
|
1215
|
-
|
|
1216
|
-
|
|
1217
|
+
case byte_at(1)
|
|
1218
|
+
when 0x61 then parse_number # 'a' -> NaN
|
|
1219
|
+
when 0x75 then parse_literal_keyword("Null", nil) # 'u' -> Null
|
|
1220
|
+
when 0x55 then parse_literal_keyword("NULL", nil) # 'U' -> NULL
|
|
1221
|
+
else parse_literal_keyword("None", nil)
|
|
1217
1222
|
end
|
|
1218
1223
|
end
|
|
1219
1224
|
|
|
@@ -1378,7 +1383,7 @@ module SmarterJSON
|
|
|
1378
1383
|
case str
|
|
1379
1384
|
when "true", "True" then return true
|
|
1380
1385
|
when "false", "False" then return false
|
|
1381
|
-
when "null", "None"
|
|
1386
|
+
when "null", "Null", "NULL", "None" then return nil
|
|
1382
1387
|
when "undefined" then return nil
|
|
1383
1388
|
when "NaN" then return Float::NAN
|
|
1384
1389
|
when "Infinity", "+Infinity" then return Float::INFINITY
|
|
@@ -1405,7 +1410,15 @@ module SmarterJSON
|
|
|
1405
1410
|
# number tokens that is a real per-value allocation. Underscores are rare, so only
|
|
1406
1411
|
# pay it when the token actually contains one (measured +27% on long-token decimals).
|
|
1407
1412
|
body = str.include?("_") ? str.delete("_") : str
|
|
1408
|
-
body.match?(/[.eE]/)
|
|
1413
|
+
return decimal_value(body) if body.match?(/[.eE]/)
|
|
1414
|
+
|
|
1415
|
+
# A BARE leading-zero integer (no sign / dot / exponent) is an ID — a zip code,
|
|
1416
|
+
# account number, phone number — not a number; keep it a string so the zeros survive.
|
|
1417
|
+
# A sign (+007 / -007) signals numeric intent (IDs never carry a sign), so those parse.
|
|
1418
|
+
c0 = body.getbyte(0)
|
|
1419
|
+
return NOT_NUMERIC if c0 == ZERO && body.bytesize > 1
|
|
1420
|
+
|
|
1421
|
+
body.to_i
|
|
1409
1422
|
end
|
|
1410
1423
|
|
|
1411
1424
|
# True when the token starts with [+-]?0[xX] — the only shape HEX_RE can match.
|
|
@@ -1614,6 +1627,12 @@ module SmarterJSON
|
|
|
1614
1627
|
when 0x6E then buf << "\n".b
|
|
1615
1628
|
when 0x72 then buf << "\r".b
|
|
1616
1629
|
when 0x74 then buf << "\t".b
|
|
1630
|
+
when 0x76 then buf << "\v".b # JSON5 / ES5 vertical tab
|
|
1631
|
+
when ZERO # JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden
|
|
1632
|
+
nxt = @input.getbyte(i + 1)
|
|
1633
|
+
raise error("invalid \\0 escape (octal not allowed)") if nxt && nxt >= ZERO && nxt <= NINE
|
|
1634
|
+
|
|
1635
|
+
buf << "\x00".b
|
|
1617
1636
|
when LF
|
|
1618
1637
|
# JSON5 line continuation: \<LF> emits nothing
|
|
1619
1638
|
when CR
|
|
@@ -1623,8 +1642,19 @@ module SmarterJSON
|
|
|
1623
1642
|
buf << [cp].pack("U").b
|
|
1624
1643
|
i += consumed
|
|
1625
1644
|
next
|
|
1645
|
+
when LOWER_X # JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8)
|
|
1646
|
+
hex = @input.byteslice(i + 1, 2)
|
|
1647
|
+
raise error("invalid \\x escape") unless hex && hex.bytesize == 2 && hex.b.match?(/\A\h{2}\z/)
|
|
1648
|
+
|
|
1649
|
+
buf << [hex.to_i(16)].pack("U").b
|
|
1650
|
+
i += 3
|
|
1651
|
+
next
|
|
1626
1652
|
else
|
|
1627
|
-
|
|
1653
|
+
# ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
|
|
1654
|
+
# Emit the escaped byte; a multibyte char's continuation bytes follow as literals.
|
|
1655
|
+
raise error("unterminated string escape") if esc.nil?
|
|
1656
|
+
|
|
1657
|
+
buf << esc
|
|
1628
1658
|
end
|
|
1629
1659
|
i += 1
|
|
1630
1660
|
end
|
|
@@ -1663,10 +1693,13 @@ module SmarterJSON
|
|
|
1663
1693
|
|
|
1664
1694
|
def parse_number
|
|
1665
1695
|
negative = false
|
|
1696
|
+
signed = false
|
|
1666
1697
|
if byte == MINUS
|
|
1667
1698
|
negative = true
|
|
1699
|
+
signed = true
|
|
1668
1700
|
advance(1)
|
|
1669
1701
|
elsif byte == PLUS
|
|
1702
|
+
signed = true
|
|
1670
1703
|
advance(1)
|
|
1671
1704
|
end
|
|
1672
1705
|
|
|
@@ -1680,6 +1713,7 @@ module SmarterJSON
|
|
|
1680
1713
|
end
|
|
1681
1714
|
|
|
1682
1715
|
int_start = @pos
|
|
1716
|
+
had_leading_zero = false
|
|
1683
1717
|
|
|
1684
1718
|
if byte == ZERO
|
|
1685
1719
|
advance(1)
|
|
@@ -1692,6 +1726,16 @@ module SmarterJSON
|
|
|
1692
1726
|
value = @input.byteslice(hex_start, @pos - hex_start).delete("_").to_i(16)
|
|
1693
1727
|
return negative ? -value : value
|
|
1694
1728
|
end
|
|
1729
|
+
# A run of further digits after the single leading '0' (007, 00023, or the
|
|
1730
|
+
# underscore-separated 0_0) — consume it and flag the leading zero; the reject check
|
|
1731
|
+
# below turns a bare leading-zero integer into an error. The underscore is only a
|
|
1732
|
+
# separator, so 0_0.5 behaves like 00.5.
|
|
1733
|
+
if (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1734
|
+
while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1735
|
+
had_leading_zero = true if b >= ZERO && b <= NINE
|
|
1736
|
+
advance(1)
|
|
1737
|
+
end
|
|
1738
|
+
end
|
|
1695
1739
|
elsif byte && byte >= 0x31 && byte <= NINE
|
|
1696
1740
|
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1697
1741
|
elsif byte == DOT
|
|
@@ -1717,6 +1761,13 @@ module SmarterJSON
|
|
|
1717
1761
|
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1718
1762
|
end
|
|
1719
1763
|
|
|
1764
|
+
# A BARE leading-zero integer is an ID, not a number; at this top-level / strict
|
|
1765
|
+
# position there is no quoteless-string form, so it raises (a sign or a dot/exponent
|
|
1766
|
+
# signals numeric intent and is allowed: +007 -> 7, -000023.5 -> -23.5, 007e2 -> 700.0).
|
|
1767
|
+
if had_leading_zero && !signed && !is_float
|
|
1768
|
+
raise error("invalid number with a leading zero")
|
|
1769
|
+
end
|
|
1770
|
+
|
|
1720
1771
|
slice = @input.byteslice(int_start, @pos - int_start).delete("_")
|
|
1721
1772
|
value = is_float ? decimal_value(slice) : slice.to_i
|
|
1722
1773
|
negative ? -value : value
|
data/lib/smarter_json/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_json
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.
|
|
4
|
+
version: 1.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: exe
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-06-
|
|
10
|
+
date: 2026-06-16 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: bigdecimal
|
|
@@ -44,6 +44,7 @@ extra_rdoc_files: []
|
|
|
44
44
|
files:
|
|
45
45
|
- ".gitignore"
|
|
46
46
|
- CHANGELOG.md
|
|
47
|
+
- CONTRIBUTORS.md
|
|
47
48
|
- LICENSE.txt
|
|
48
49
|
- README.md
|
|
49
50
|
- Rakefile
|