smarter_json 1.1.2 → 1.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +22 -4
- data/README.md +10 -4
- data/docs/_introduction.md +1 -1
- data/docs/examples.md +20 -4
- data/ext/smarter_json/smarter_json.c +80 -12
- data/lib/smarter_json/parser.rb +70 -10
- data/lib/smarter_json/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f66007af9616269be4ccc22a5e73d784bc9991dbb5c4245e5cef15293cbc7ab8
|
|
4
|
+
data.tar.gz: d95cecb5d4258d44ed104c8386281e60855e196e09ed5c05e47e5ebe70b4e100
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 4924b4a88250124b30a4d97b99d9c84ea644b922c4a0dea4fb4dfb185638b7fb43ccce2b638026d1b52d085fe7e5fceb94168edb190133684de0d94ab90f0399
|
|
7
|
+
data.tar.gz: 6c8493f8b1808deab9a7f039c9394c12d51076c4b670ad04df4c0b8b32d1deadfa1549d79cb2eb3200a3c2291cd652b255c061e4b210acc2426a3673a0cc33c3
|
data/CHANGELOG.md
CHANGED
|
@@ -5,17 +5,35 @@
|
|
|
5
5
|
>
|
|
6
6
|
> `SmarterJSON.process` / `SmarterJSON.process_file`
|
|
7
7
|
> both return:
|
|
8
|
-
>
|
|
9
|
-
>
|
|
10
|
-
>
|
|
8
|
+
>
|
|
9
|
+
> — `[]` for no doc
|
|
10
|
+
> - `[doc]` for one doc
|
|
11
|
+
> - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
|
|
11
12
|
|
|
12
13
|
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
|
|
13
|
-
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
|
|
14
|
+
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
|
|
15
|
+
|
|
16
|
+
## 1.2.1 (2026-06-17)
|
|
17
|
+
|
|
18
|
+
RSpec tests: 1,165
|
|
19
|
+
|
|
20
|
+
- Performance improvements
|
|
21
|
+
|
|
22
|
+
## 1.2.0 (2026-06-16)
|
|
23
|
+
|
|
24
|
+
RSpec tests: 1,097 → 1,165
|
|
25
|
+
|
|
26
|
+
- A leading-zero token now reads as a number when it carries a sign, a decimal point, or an exponent (`+007` → `7`, `-000023.5` → `-23.5`, `00.0` → `0.0`, `007e2` → `700.0`) — previously these were kept as strings. A bare leading-zero integer (`000001`, `02`) still reads as a string, so IDs, zip codes, and account numbers keep their zeros.
|
|
27
|
+
- `Null` and `NULL` are now read as `nil` (joining `null` / `None` / `undefined`), for SQL / R / PHP / YAML / DB-derived input — in every position the existing spellings work. Quoted (`"NULL"`) or embedded (`NULL Island`) forms stay strings.
|
|
28
|
+
- String escapes now cover the full JSON5 / ECMAScript set: `\xHH` hex escapes (`"\x41"` → `"A"`), `\v` (vertical tab), `\0` (null), and an unrecognized escape now yields the character itself (`"\q"` → `"q"`) instead of raising. A malformed `\x` and an octal-style `\0` followed by a digit still raise.
|
|
29
|
+
- A `U+FEFF` (BOM / zero-width no-break space) is now skipped as whitespace anywhere between tokens — matching JSON5 / ECMAScript — not only as a leading byte-order mark, so a stray BOM mid-stream (e.g. from concatenated files) no longer corrupts the adjacent value into a string. Inside a quoted string it stays content.
|
|
14
30
|
|
|
15
31
|
## 1.1.2 (2026-06-12)
|
|
16
32
|
|
|
17
33
|
RSpec tests: 1,097
|
|
18
34
|
|
|
35
|
+
### Bug Fix
|
|
36
|
+
|
|
19
37
|
- The C extension now correctly supports Ruby's GC heap compaction (`GC.compact` / auto-compaction) — its cached exception/warning classes are declared to the GC. Thanks [Jean Boussier](https://github.com/byroot) for PR [#7](https://github.com/tilo/smarter_json/pull/7).
|
|
20
38
|
|
|
21
39
|
## 1.1.1 (2026-06-11)
|
data/README.md
CHANGED
|
@@ -8,7 +8,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,
|
|
|
8
8
|
|
|
9
9
|
## Features at a glance
|
|
10
10
|
|
|
11
|
-
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python
|
|
11
|
+
- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python / JavaScript / SQL literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
|
|
12
12
|
- **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
|
|
13
13
|
- **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
|
|
14
14
|
- **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
|
|
@@ -73,9 +73,11 @@ Three things set it apart:
|
|
|
73
73
|
- `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
|
|
74
74
|
- Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
|
|
75
75
|
- Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
|
|
76
|
+
- Full JSON5 / ECMAScript string escapes — `\uXXXX` (with surrogate pairs), `\xHH` (`"\x41"` → `"A"`), `\v`, `\0`, line continuation; an unrecognized escape yields the character itself (`"\q"` → `"q"`)
|
|
76
77
|
- Implicit root object — a config file that starts with `key: value`, no outer `{}`
|
|
77
78
|
- `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
|
|
78
|
-
-
|
|
79
|
+
- Leading-zero numbers (which strict JSON rejects): a token with a sign, decimal point, or exponent reads as a number (`-007.5` → `-7.5`, `007e2` → `700.0`), but a bare leading-zero integer is kept as a string (`007`, `02`) so IDs, zip codes, and account numbers don't lose their zeros
|
|
80
|
+
- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`, case-variant null (`Null` / `NULL`, as SQL / R / PHP / YAML emit it)
|
|
79
81
|
- Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
|
|
80
82
|
- Duplicate keys (last value wins by default; configurable)
|
|
81
83
|
|
|
@@ -89,11 +91,15 @@ The lenient grammar is a superset of these human-JSON specs — listed once, her
|
|
|
89
91
|
* [HJSON](https://hjson.github.io/) <sup>†</sup>
|
|
90
92
|
* [JWCC / HuJSON](https://github.com/tailscale/hujson)
|
|
91
93
|
* [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
|
|
92
|
-
* [JSONH](https://github.com/jsonh-org/Jsonh)
|
|
94
|
+
* [JSONH](https://github.com/jsonh-org/Jsonh) <sup>‡</sup>
|
|
93
95
|
* [JSONC (VS Code)](https://jsonc.org/)
|
|
94
96
|
* [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
|
|
95
97
|
|
|
96
|
-
|
|
98
|
+
HJSON and JSONH are deliberate subsets. SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others.
|
|
99
|
+
|
|
100
|
+
<sup>†</sup>From **HJSON** we leave out unquoted *multi-line* strings — its quoteless string values are single-line (use a quoted or triple-quoted `'''…'''` string for multiline), because a newline-spanning unquoted string collides with newline-as-a-document-separator (NDJSON, implicit-root config).
|
|
101
|
+
|
|
102
|
+
<sup>‡</sup>From **JSONH** we take the mainstream features (quoteless keys / values, optional commas between newline-separated members, comments, hex numbers) but **not** the idiosyncratic extensions: binary (`0b`) / octal (`0o`) number literals, verbatim strings (`@"…"`), nestable block comments (`/=* *=/`), or its `\e` / `\a` escapes — the last conflict with the JSON5 / ECMAScript rule that an unrecognized escape is the character itself (`"\e"` → `"e"`). Tip: you can use quoteless strings instead of verbatim strings. Want binary or octal literals? Open an issue.
|
|
97
103
|
|
|
98
104
|
## Installation
|
|
99
105
|
|
data/docs/_introduction.md
CHANGED
|
@@ -29,7 +29,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
|
|
|
29
29
|
|
|
30
30
|
## What it accepts, beyond strict JSON
|
|
31
31
|
|
|
32
|
-
Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, Python (`True` / `False` / `None`)
|
|
32
|
+
Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, full JSON5 / ECMAScript string escapes (`\xHH`, `\v`, `\0`, line continuation, and an unknown escape yields the character itself), an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, leading-zero numbers (a signed / decimal / exponent token like `-007.5` is a number, a bare `007` is kept as a string so IDs keep their zeros), Python (`True` / `False` / `None`), JavaScript (`undefined`), and SQL / R / PHP / YAML (`Null` / `NULL`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
|
|
33
33
|
|
|
34
34
|
It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
|
|
35
35
|
|
data/docs/examples.md
CHANGED
|
@@ -145,7 +145,23 @@ JSON
|
|
|
145
145
|
|
|
146
146
|
A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
|
|
147
147
|
|
|
148
|
-
### Example 10:
|
|
148
|
+
### Example 10: Leading-Zero IDs and SQL `NULL`
|
|
149
|
+
|
|
150
|
+
```ruby
|
|
151
|
+
SmarterJSON.process_one(<<~JSON)
|
|
152
|
+
{
|
|
153
|
+
user_id: 007, # bare leading zero -> kept as a string
|
|
154
|
+
zip: 02139, # ditto: zip codes keep their leading zero
|
|
155
|
+
balance: -007.50, # a sign / decimal point / exponent makes it a number
|
|
156
|
+
deleted_at: NULL # SQL / R / YAML null spelling -> nil
|
|
157
|
+
}
|
|
158
|
+
JSON
|
|
159
|
+
# => {"user_id"=>"007", "zip"=>"02139", "balance"=>-7.5, "deleted_at"=>nil}
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
A bare leading-zero integer is kept as a string so identifiers, zip codes, and account numbers don't lose their zeros; a sign, decimal point, or exponent marks numeric intent (`-007.50` → `-7.5`). `Null` and `NULL` join `null` / `None` / `undefined` as spellings of `nil`; a quoted `"NULL"` stays a string.
|
|
163
|
+
|
|
164
|
+
### Example 11: Wrapper Noise Around a Payload
|
|
149
165
|
|
|
150
166
|
#### Fenced payload
|
|
151
167
|
|
|
@@ -197,14 +213,14 @@ TEXT
|
|
|
197
213
|
# => [{"a"=>1}, {"b"=>2}]
|
|
198
214
|
```
|
|
199
215
|
|
|
200
|
-
### Example
|
|
216
|
+
### Example 12: Write JSON
|
|
201
217
|
|
|
202
218
|
```ruby
|
|
203
219
|
SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
|
|
204
220
|
SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
|
|
205
221
|
```
|
|
206
222
|
|
|
207
|
-
### Example
|
|
223
|
+
### Example 13: Write NDJSON
|
|
208
224
|
|
|
209
225
|
An Array writes one element per line:
|
|
210
226
|
|
|
@@ -212,7 +228,7 @@ An Array writes one element per line:
|
|
|
212
228
|
SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
|
|
213
229
|
```
|
|
214
230
|
|
|
215
|
-
### Example
|
|
231
|
+
### Example 14: Round-Trip Read and Write
|
|
216
232
|
|
|
217
233
|
```ruby
|
|
218
234
|
obj = { "a" => 1, "b" => [2, "three", nil, true] }
|
|
@@ -169,13 +169,14 @@ static void fj_advance(fj_state *st, long n) {
|
|
|
169
169
|
static int fj_is_ws(int b) { return b == 0x20 || (b >= 0x09 && b <= 0x0D); }
|
|
170
170
|
|
|
171
171
|
/* Length (1..3) of the Unicode whitespace char starting at p (n bytes
|
|
172
|
-
* available), or 0. Matches Ruby's [[:space:]]
|
|
173
|
-
*
|
|
172
|
+
* available), or 0. Matches Ruby's [[:space:]], plus U+FEFF (BOM) — JSON5 / ES5 count
|
|
173
|
+
* the BOM as whitespace though Unicode White_Space does not; see smarter_json.md §4.7.
|
|
174
|
+
* Reject-gate: only C2/E1/E2/E3/EF can begin one of these chars. */
|
|
174
175
|
static long fj_mbws(const char *p, long n) {
|
|
175
176
|
int b0, b1, b2;
|
|
176
177
|
if (n < 1) return 0;
|
|
177
178
|
b0 = (unsigned char)p[0];
|
|
178
|
-
if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3)) return 0;
|
|
179
|
+
if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF) return 0;
|
|
179
180
|
if (n < 2) return 0;
|
|
180
181
|
b1 = (unsigned char)p[1];
|
|
181
182
|
if (b0 == 0xC2) return (b1 == 0xA0 || b1 == 0x85) ? 2 : 0;
|
|
@@ -188,6 +189,7 @@ static long fj_mbws(const char *p, long n) {
|
|
|
188
189
|
return 0;
|
|
189
190
|
}
|
|
190
191
|
if (b0 == 0xE3) return (b1 == 0x80 && b2 == 0x80) ? 3 : 0;
|
|
192
|
+
if (b0 == 0xEF) return (b1 == 0xBB && b2 == 0xBF) ? 3 : 0; /* U+FEFF (JSON5 / ES5 BOM ws) */
|
|
191
193
|
return 0;
|
|
192
194
|
}
|
|
193
195
|
|
|
@@ -398,8 +400,24 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
|
|
|
398
400
|
case 'n': rb_str_buf_cat(buf, "\n", 1); fj_advance(st, 1); break;
|
|
399
401
|
case 'r': rb_str_buf_cat(buf, "\r", 1); fj_advance(st, 1); break;
|
|
400
402
|
case 't': rb_str_buf_cat(buf, "\t", 1); fj_advance(st, 1); break;
|
|
403
|
+
case 'v': rb_str_buf_cat(buf, "\v", 1); fj_advance(st, 1); break; /* JSON5 / ES5 */
|
|
401
404
|
case 0x0A: fj_advance(st, 1); break; /* \<LF>: line continuation */
|
|
402
405
|
case 0x0D: fj_advance(st, 1); if (fj_byte(st) == 0x0A) fj_advance(st, 1); break;
|
|
406
|
+
case '0': /* JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden */
|
|
407
|
+
fj_advance(st, 1);
|
|
408
|
+
{ int nx = fj_byte(st); if (nx >= '0' && nx <= '9') fj_error(st, "invalid \\0 escape (octal not allowed)"); }
|
|
409
|
+
rb_str_buf_cat(buf, "\0", 1);
|
|
410
|
+
break;
|
|
411
|
+
case 'x': { /* JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8) */
|
|
412
|
+
int h1, h2;
|
|
413
|
+
fj_advance(st, 1);
|
|
414
|
+
h1 = fj_hex_val(fj_byte(st));
|
|
415
|
+
h2 = fj_hex_val(fj_byte_at(st, 1));
|
|
416
|
+
if (h1 < 0 || h2 < 0) fj_error(st, "invalid \\x escape");
|
|
417
|
+
fj_advance(st, 2);
|
|
418
|
+
fj_append_utf8(buf, (unsigned long)((h1 << 4) | h2));
|
|
419
|
+
break;
|
|
420
|
+
}
|
|
403
421
|
case 'u': {
|
|
404
422
|
unsigned long cp;
|
|
405
423
|
fj_advance(st, 1);
|
|
@@ -418,7 +436,12 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
|
|
|
418
436
|
break;
|
|
419
437
|
}
|
|
420
438
|
default:
|
|
421
|
-
|
|
439
|
+
/* ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
|
|
440
|
+
* Emit the escaped byte; a multibyte UTF-8 char's continuation bytes follow as
|
|
441
|
+
* literal content (next loop iterations), reconstructing the whole character. */
|
|
442
|
+
rb_str_buf_cat(buf, st->buf + st->pos, 1);
|
|
443
|
+
fj_advance(st, 1);
|
|
444
|
+
break;
|
|
422
445
|
}
|
|
423
446
|
} else {
|
|
424
447
|
/* Literal run between escapes: NEON-scan to the next quote/backslash and
|
|
@@ -641,16 +664,33 @@ static FJ_ALWAYS_INLINE VALUE fj_float_from_parts(fj_state *st, uint64_t m10, in
|
|
|
641
664
|
* per-byte '_' test, dropping to a slow step only when an underscore appears. */
|
|
642
665
|
static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
|
|
643
666
|
long i = 0;
|
|
644
|
-
int is_float = 0, neg = 0, has_digit = 0, overflow = 0;
|
|
667
|
+
int is_float = 0, neg = 0, has_digit = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
|
|
645
668
|
uint64_t m10 = 0;
|
|
646
669
|
int m10digits = 0, frac = 0;
|
|
647
670
|
int64_t e10 = 0;
|
|
648
671
|
|
|
649
|
-
if (i < n && (p[i] == '-' || p[i] == '+')) { neg = (p[i] == '-'); i++; }
|
|
672
|
+
if (i < n && (p[i] == '-' || p[i] == '+')) { has_sign = 1; neg = (p[i] == '-'); i++; }
|
|
650
673
|
|
|
651
|
-
/* Integer part: a single '0', or [1-9] then digits/underscores.
|
|
674
|
+
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
|
|
675
|
+
* by more digits (a leading-zero token) is consumed too but flagged: a BARE leading-zero
|
|
676
|
+
* integer (no sign / dot / exponent) is rejected below and kept as a string, so zip /
|
|
677
|
+
* account / check numbers preserve their zeros. */
|
|
652
678
|
if (i < n && p[i] == '0') {
|
|
653
679
|
has_digit = 1; m10digits = 1; i++;
|
|
680
|
+
/* Underscore-separated too (like the [1-9] branch below), so 0_5.0 / 0_0.5 behave
|
|
681
|
+
* exactly like 05.0 / 00.5 on both paths. */
|
|
682
|
+
if (i < n && ((p[i] >= '0' && p[i] <= '9') || p[i] == '_')) {
|
|
683
|
+
for (;;) {
|
|
684
|
+
while (i < n && p[i] >= '0' && p[i] <= '9') {
|
|
685
|
+
had_leading_zero = 1;
|
|
686
|
+
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(p[i] - '0'); m10digits++; }
|
|
687
|
+
else overflow = 1;
|
|
688
|
+
i++;
|
|
689
|
+
}
|
|
690
|
+
if (i < n && p[i] == '_') { i++; continue; }
|
|
691
|
+
break;
|
|
692
|
+
}
|
|
693
|
+
}
|
|
654
694
|
} else if (i < n && p[i] >= '1' && p[i] <= '9') {
|
|
655
695
|
has_digit = 1;
|
|
656
696
|
for (;;) {
|
|
@@ -699,6 +739,8 @@ static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
|
|
|
699
739
|
|
|
700
740
|
if (i != n) return 0; /* token not fully consumed -> not a number (string) */
|
|
701
741
|
if (!has_digit) return 0; /* e.g. "." or "+" -> not a number (string) */
|
|
742
|
+
/* A BARE leading-zero integer (no sign / dot / exponent) is an ID, not a number. */
|
|
743
|
+
if (had_leading_zero && !has_sign && !is_float) return 0;
|
|
702
744
|
|
|
703
745
|
if (!is_float) {
|
|
704
746
|
*out = fj_int_from_parts(m10, m10digits, neg, overflow, p, n);
|
|
@@ -730,13 +772,13 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
730
772
|
const char *p = buf + st->pos; /* buf[len] == '\0' (RSTRING_PTR) is the scan sentinel */
|
|
731
773
|
const char *np = p; /* token start, includes a leading sign */
|
|
732
774
|
long nlen;
|
|
733
|
-
int is_float = 0, neg = 0, overflow = 0;
|
|
775
|
+
int is_float = 0, neg = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
|
|
734
776
|
uint64_t m10 = 0; /* mantissa: integer + fraction digits */
|
|
735
777
|
int m10digits = 0; /* mantissa digit chars (caps the Eisel-Lemire fast path at 18) */
|
|
736
778
|
int frac = 0; /* fraction digit chars: e10 -= frac */
|
|
737
779
|
int64_t e10 = 0;
|
|
738
780
|
|
|
739
|
-
if (*p == '-' || *p == '+') { neg = (*p == '-'); p++; }
|
|
781
|
+
if (*p == '-' || *p == '+') { has_sign = 1; neg = (*p == '-'); p++; }
|
|
740
782
|
|
|
741
783
|
/* Cold branches (rare, not perf-critical): sync the cursor, reuse scalar helpers. */
|
|
742
784
|
if (*p == 'I') { st->pos = p - buf; fj_consume_keyword(st, "Infinity"); return rb_float_new(neg ? -INFINITY : INFINITY); }
|
|
@@ -755,10 +797,27 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
755
797
|
return rb_str_to_inum(hx, 16, 0);
|
|
756
798
|
}
|
|
757
799
|
|
|
758
|
-
/* Integer part: a single '0', or [1-9] then digits/underscores.
|
|
800
|
+
/* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
|
|
801
|
+
* by more digits is consumed but flagged; a BARE leading-zero integer (no sign / dot /
|
|
802
|
+
* exponent) is rejected after the scan — it is an ID, not a number, and has no bare
|
|
803
|
+
* top-level quoteless-string form, so it raises (matching `000001`). */
|
|
759
804
|
if (*p == '0') {
|
|
760
805
|
m10digits = 1; /* one leading zero, counted as a single mantissa digit */
|
|
761
806
|
p++;
|
|
807
|
+
/* Underscore-separated too (like the [1-9] branch below), so the underscore is just a
|
|
808
|
+
* separator (0_0.5 behaves like 00.5). */
|
|
809
|
+
if ((*p >= '0' && *p <= '9') || *p == '_') {
|
|
810
|
+
for (;;) {
|
|
811
|
+
while (*p >= '0' && *p <= '9') {
|
|
812
|
+
had_leading_zero = 1;
|
|
813
|
+
if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(*p - '0'); m10digits++; }
|
|
814
|
+
else overflow = 1;
|
|
815
|
+
p++;
|
|
816
|
+
}
|
|
817
|
+
if (*p == '_') { p++; continue; }
|
|
818
|
+
break;
|
|
819
|
+
}
|
|
820
|
+
}
|
|
762
821
|
} else if (*p >= '1' && *p <= '9') {
|
|
763
822
|
for (;;) {
|
|
764
823
|
while (*p >= '0' && *p <= '9') {
|
|
@@ -811,6 +870,12 @@ static VALUE fj_parse_number(fj_state *st) {
|
|
|
811
870
|
st->pos = p - buf;
|
|
812
871
|
nlen = p - np;
|
|
813
872
|
|
|
873
|
+
/* A BARE leading-zero integer is an ID, not a number; at this top-level / strict
|
|
874
|
+
* position there is no quoteless-string form, so it raises. */
|
|
875
|
+
if (had_leading_zero && !has_sign && !is_float) {
|
|
876
|
+
fj_error(st, "invalid number with a leading zero");
|
|
877
|
+
}
|
|
878
|
+
|
|
814
879
|
if (!is_float) {
|
|
815
880
|
return fj_int_from_parts(m10, m10digits, neg, overflow, np, nlen);
|
|
816
881
|
}
|
|
@@ -979,7 +1044,8 @@ static VALUE fj_classify_quoteless(fj_state *st, const char *p0, long n0) {
|
|
|
979
1044
|
|
|
980
1045
|
if (fj_tok_eq(p, n, "true") || fj_tok_eq(p, n, "True")) return Qtrue;
|
|
981
1046
|
if (fj_tok_eq(p, n, "false") || fj_tok_eq(p, n, "False")) return Qfalse;
|
|
982
|
-
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "
|
|
1047
|
+
if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "Null") || fj_tok_eq(p, n, "NULL") ||
|
|
1048
|
+
fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
|
|
983
1049
|
if (fj_tok_eq(p, n, "NaN")) return rb_float_new(NAN);
|
|
984
1050
|
if (fj_tok_eq(p, n, "Infinity")) return rb_float_new(INFINITY);
|
|
985
1051
|
|
|
@@ -1273,8 +1339,10 @@ static VALUE fj_parse_value(fj_state *st) {
|
|
|
1273
1339
|
case 'T': return fj_parse_literal(st, "True", Qtrue);
|
|
1274
1340
|
case 'F': return fj_parse_literal(st, "False", Qfalse);
|
|
1275
1341
|
case 'u': return fj_parse_literal(st, "undefined", Qnil);
|
|
1276
|
-
case 'N': /* NaN (number)
|
|
1342
|
+
case 'N': /* NaN (number); None / Null / NULL (null) */
|
|
1277
1343
|
if (fj_byte_at(st, 1) == 'a') return fj_parse_number(st);
|
|
1344
|
+
if (fj_byte_at(st, 1) == 'u') return fj_parse_literal(st, "Null", Qnil);
|
|
1345
|
+
if (fj_byte_at(st, 1) == 'U') return fj_parse_literal(st, "NULL", Qnil);
|
|
1278
1346
|
return fj_parse_literal(st, "None", Qnil);
|
|
1279
1347
|
default:
|
|
1280
1348
|
if (b == '-' || b == '+' || b == '.' || b == 'I' || (b >= '0' && b <= '9')) {
|
data/lib/smarter_json/parser.rb
CHANGED
|
@@ -739,7 +739,7 @@ module SmarterJSON
|
|
|
739
739
|
# Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a
|
|
740
740
|
# bare exponent like "-e695881" is NOT a number — it falls through to a quoteless
|
|
741
741
|
# string, matching the C path. Trailing exponent stays optional.
|
|
742
|
-
DEC_RE = /\A[-+]?(?:
|
|
742
|
+
DEC_RE = /\A[-+]?(?:[0-9][0-9_]*(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
|
|
743
743
|
# A decimal BigDecimal() would reject as-is: a leading dot (".5") or a dot not
|
|
744
744
|
# followed by a digit ("5.", "5.e3"). Matches iff normalize_for_bigdecimal
|
|
745
745
|
# would change the string — so when it doesn't match, we skip normalization.
|
|
@@ -756,6 +756,10 @@ module SmarterJSON
|
|
|
756
756
|
# (',' '}' ']' '{' '[') OR any whitespace ([[:space:]] covers ASCII + Unicode space,
|
|
757
757
|
# incl. LF/CR which also terminate). Stopping at a terminator/EOF means the run had no
|
|
758
758
|
# interior whitespace, so there's nothing to trim and no comment marker can apply.
|
|
759
|
+
#
|
|
760
|
+
# U+FEFF is JSON5/ES5 whitespace but NOT in [[:space:]]. It is deliberately kept OUT of
|
|
761
|
+
# this regex: a multibyte alternative defeats byteindex's fast byte-search (~3.3x slower
|
|
762
|
+
# on number-dense input). A trailing U+FEFF is trimmed cheaply in the fast path below.
|
|
759
763
|
QL_BREAK = /[,{}\[\]]|[[:space:]]/.freeze
|
|
760
764
|
|
|
761
765
|
# The defaults live centrally in SmarterJSON::Options (lib/smarter_json/options.rb).
|
|
@@ -1103,7 +1107,7 @@ module SmarterJSON
|
|
|
1103
1107
|
# Only meaningful for bytes >= 0x80.
|
|
1104
1108
|
def multibyte_ws_len(pos)
|
|
1105
1109
|
b0 = @input.getbyte(pos)
|
|
1106
|
-
return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) # reject-gate
|
|
1110
|
+
return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF # reject-gate (EF -> U+FEFF)
|
|
1107
1111
|
|
|
1108
1112
|
b1 = @input.getbyte(pos + 1)
|
|
1109
1113
|
return 0 if b1.nil?
|
|
@@ -1123,6 +1127,8 @@ module SmarterJSON
|
|
|
1123
1127
|
end
|
|
1124
1128
|
when 0xE3
|
|
1125
1129
|
return 3 if b1 == 0x80 && b2 == 0x80 # U+3000
|
|
1130
|
+
when 0xEF
|
|
1131
|
+
return 3 if b1 == 0xBB && b2 == 0xBF # U+FEFF (JSON5 / ES5 BOM ws)
|
|
1126
1132
|
end
|
|
1127
1133
|
0
|
|
1128
1134
|
end
|
|
@@ -1210,10 +1216,11 @@ module SmarterJSON
|
|
|
1210
1216
|
|
|
1211
1217
|
# Disambiguate NaN (number) from None (Python null) at a strict position.
|
|
1212
1218
|
def parse_upper_n
|
|
1213
|
-
|
|
1214
|
-
|
|
1215
|
-
|
|
1216
|
-
|
|
1219
|
+
case byte_at(1)
|
|
1220
|
+
when 0x61 then parse_number # 'a' -> NaN
|
|
1221
|
+
when 0x75 then parse_literal_keyword("Null", nil) # 'u' -> Null
|
|
1222
|
+
when 0x55 then parse_literal_keyword("NULL", nil) # 'U' -> NULL
|
|
1223
|
+
else parse_literal_keyword("None", nil)
|
|
1217
1224
|
end
|
|
1218
1225
|
end
|
|
1219
1226
|
|
|
@@ -1345,7 +1352,14 @@ module SmarterJSON
|
|
|
1345
1352
|
b = hit < @bytesize ? input.getbyte(hit) : nil
|
|
1346
1353
|
if b.nil? || b == COMMA || b == RBRACE || b == RBRACKET || b == LBRACE || b == LBRACKET || b == LF || b == CR
|
|
1347
1354
|
@pos = hit
|
|
1348
|
-
|
|
1355
|
+
# A trailing U+FEFF (EF BB BF) is JSON5/ES5 whitespace but not in QL_BREAK, so
|
|
1356
|
+
# byteindex scanned past it into the run — trim it (and a run of them). On the
|
|
1357
|
+
# common path the last byte is a digit/letter, so the first compare fails at once.
|
|
1358
|
+
fin = hit
|
|
1359
|
+
while fin - 3 >= pos && input.getbyte(fin - 1) == 0xBF && input.getbyte(fin - 2) == 0xBB && input.getbyte(fin - 3) == 0xEF
|
|
1360
|
+
fin -= 3
|
|
1361
|
+
end
|
|
1362
|
+
return fin
|
|
1349
1363
|
end
|
|
1350
1364
|
end
|
|
1351
1365
|
|
|
@@ -1378,7 +1392,7 @@ module SmarterJSON
|
|
|
1378
1392
|
case str
|
|
1379
1393
|
when "true", "True" then return true
|
|
1380
1394
|
when "false", "False" then return false
|
|
1381
|
-
when "null", "None"
|
|
1395
|
+
when "null", "Null", "NULL", "None" then return nil
|
|
1382
1396
|
when "undefined" then return nil
|
|
1383
1397
|
when "NaN" then return Float::NAN
|
|
1384
1398
|
when "Infinity", "+Infinity" then return Float::INFINITY
|
|
@@ -1405,7 +1419,15 @@ module SmarterJSON
|
|
|
1405
1419
|
# number tokens that is a real per-value allocation. Underscores are rare, so only
|
|
1406
1420
|
# pay it when the token actually contains one (measured +27% on long-token decimals).
|
|
1407
1421
|
body = str.include?("_") ? str.delete("_") : str
|
|
1408
|
-
body.match?(/[.eE]/)
|
|
1422
|
+
return decimal_value(body) if body.match?(/[.eE]/)
|
|
1423
|
+
|
|
1424
|
+
# A BARE leading-zero integer (no sign / dot / exponent) is an ID — a zip code,
|
|
1425
|
+
# account number, phone number — not a number; keep it a string so the zeros survive.
|
|
1426
|
+
# A sign (+007 / -007) signals numeric intent (IDs never carry a sign), so those parse.
|
|
1427
|
+
c0 = body.getbyte(0)
|
|
1428
|
+
return NOT_NUMERIC if c0 == ZERO && body.bytesize > 1
|
|
1429
|
+
|
|
1430
|
+
body.to_i
|
|
1409
1431
|
end
|
|
1410
1432
|
|
|
1411
1433
|
# True when the token starts with [+-]?0[xX] — the only shape HEX_RE can match.
|
|
@@ -1614,6 +1636,12 @@ module SmarterJSON
|
|
|
1614
1636
|
when 0x6E then buf << "\n".b
|
|
1615
1637
|
when 0x72 then buf << "\r".b
|
|
1616
1638
|
when 0x74 then buf << "\t".b
|
|
1639
|
+
when 0x76 then buf << "\v".b # JSON5 / ES5 vertical tab
|
|
1640
|
+
when ZERO # JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden
|
|
1641
|
+
nxt = @input.getbyte(i + 1)
|
|
1642
|
+
raise error("invalid \\0 escape (octal not allowed)") if nxt && nxt >= ZERO && nxt <= NINE
|
|
1643
|
+
|
|
1644
|
+
buf << "\x00".b
|
|
1617
1645
|
when LF
|
|
1618
1646
|
# JSON5 line continuation: \<LF> emits nothing
|
|
1619
1647
|
when CR
|
|
@@ -1623,8 +1651,19 @@ module SmarterJSON
|
|
|
1623
1651
|
buf << [cp].pack("U").b
|
|
1624
1652
|
i += consumed
|
|
1625
1653
|
next
|
|
1654
|
+
when LOWER_X # JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8)
|
|
1655
|
+
hex = @input.byteslice(i + 1, 2)
|
|
1656
|
+
raise error("invalid \\x escape") unless hex && hex.bytesize == 2 && hex.b.match?(/\A\h{2}\z/)
|
|
1657
|
+
|
|
1658
|
+
buf << [hex.to_i(16)].pack("U").b
|
|
1659
|
+
i += 3
|
|
1660
|
+
next
|
|
1626
1661
|
else
|
|
1627
|
-
|
|
1662
|
+
# ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
|
|
1663
|
+
# Emit the escaped byte; a multibyte char's continuation bytes follow as literals.
|
|
1664
|
+
raise error("unterminated string escape") if esc.nil?
|
|
1665
|
+
|
|
1666
|
+
buf << esc
|
|
1628
1667
|
end
|
|
1629
1668
|
i += 1
|
|
1630
1669
|
end
|
|
@@ -1663,10 +1702,13 @@ module SmarterJSON
|
|
|
1663
1702
|
|
|
1664
1703
|
def parse_number
|
|
1665
1704
|
negative = false
|
|
1705
|
+
signed = false
|
|
1666
1706
|
if byte == MINUS
|
|
1667
1707
|
negative = true
|
|
1708
|
+
signed = true
|
|
1668
1709
|
advance(1)
|
|
1669
1710
|
elsif byte == PLUS
|
|
1711
|
+
signed = true
|
|
1670
1712
|
advance(1)
|
|
1671
1713
|
end
|
|
1672
1714
|
|
|
@@ -1680,6 +1722,7 @@ module SmarterJSON
|
|
|
1680
1722
|
end
|
|
1681
1723
|
|
|
1682
1724
|
int_start = @pos
|
|
1725
|
+
had_leading_zero = false
|
|
1683
1726
|
|
|
1684
1727
|
if byte == ZERO
|
|
1685
1728
|
advance(1)
|
|
@@ -1692,6 +1735,16 @@ module SmarterJSON
|
|
|
1692
1735
|
value = @input.byteslice(hex_start, @pos - hex_start).delete("_").to_i(16)
|
|
1693
1736
|
return negative ? -value : value
|
|
1694
1737
|
end
|
|
1738
|
+
# A run of further digits after the single leading '0' (007, 00023, or the
|
|
1739
|
+
# underscore-separated 0_0) — consume it and flag the leading zero; the reject check
|
|
1740
|
+
# below turns a bare leading-zero integer into an error. The underscore is only a
|
|
1741
|
+
# separator, so 0_0.5 behaves like 00.5.
|
|
1742
|
+
if (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1743
|
+
while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1744
|
+
had_leading_zero = true if b >= ZERO && b <= NINE
|
|
1745
|
+
advance(1)
|
|
1746
|
+
end
|
|
1747
|
+
end
|
|
1695
1748
|
elsif byte && byte >= 0x31 && byte <= NINE
|
|
1696
1749
|
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1697
1750
|
elsif byte == DOT
|
|
@@ -1717,6 +1770,13 @@ module SmarterJSON
|
|
|
1717
1770
|
advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
|
|
1718
1771
|
end
|
|
1719
1772
|
|
|
1773
|
+
# A BARE leading-zero integer is an ID, not a number; at this top-level / strict
|
|
1774
|
+
# position there is no quoteless-string form, so it raises (a sign or a dot/exponent
|
|
1775
|
+
# signals numeric intent and is allowed: +007 -> 7, -000023.5 -> -23.5, 007e2 -> 700.0).
|
|
1776
|
+
if had_leading_zero && !signed && !is_float
|
|
1777
|
+
raise error("invalid number with a leading zero")
|
|
1778
|
+
end
|
|
1779
|
+
|
|
1720
1780
|
slice = @input.byteslice(int_start, @pos - int_start).delete("_")
|
|
1721
1781
|
value = is_float ? decimal_value(slice) : slice.to_i
|
|
1722
1782
|
negative ? -value : value
|
data/lib/smarter_json/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_json
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.1
|
|
4
|
+
version: 1.2.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: exe
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-06-
|
|
10
|
+
date: 2026-06-17 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: bigdecimal
|