smarter_json 1.1.1 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 480227c64ed99ba271fe95c6e7c79be6871f6bbf5c7f8f03d9bf118bb6bd7051
4
- data.tar.gz: 966190c8f2e316e3664e381bf738831259b0e469a783f2fa3a25edf5f198d41b
3
+ metadata.gz: bf6191c05bd9049082a1362f4bfa5ab3240c4690bc59c1675c716c0901d5c6cb
4
+ data.tar.gz: 56f1d5418b20d8bad23694f7831dfb335ab4737dd34e9184b9c20113c41fd3aa
5
5
  SHA512:
6
- metadata.gz: d53d1a84aa5ce83a2243a9adcdcddc284933dc8c357e08c8aaaec8e5e005164dfdec37d70c41bce0b2a3abf22dc7902e062adbbe3ffaa947b2232836bd01a86c
7
- data.tar.gz: ae44f83437fe6c75227822a234658abb2157caf0378c8b998bab25aa40b4f1286da9a85a482be04b1be4ab17b142779140d0892ada81ae7d4342a0b7f706e9b7
6
+ metadata.gz: 64c3511d1f21662b703ee1a02876ba8f05401ebcd48b1025e7290a50c49a5fc1d74623c94477b6fc7006ea782028698e298010b3e827f5ed0315fa1f7e88f595
7
+ data.tar.gz: c2801263204013c23954d7f4489f7f4f38f74802c42da5cf2ba2e5965decfc6c00dd304a7ff16f9f96741e4103fd499f36efa8a8ba5ec7abfdb40dded1c996a9
data/CHANGELOG.md CHANGED
@@ -5,12 +5,30 @@
5
5
  >
6
6
  > `SmarterJSON.process` / `SmarterJSON.process_file`
7
7
  > both return:
8
- > `[]` for no doc
9
- > - `[doc]` for one doc
10
- > - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
8
+ >
9
+ > `[]` for no doc
10
+ > - `[doc]` for one doc
11
+ > - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
11
12
 
12
13
  > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
13
- > Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
14
+ > Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
15
+
16
+ ## 1.2.0 (2026-06-16)
17
+
18
+ RSpec tests: 1,097 → 1,165
19
+
20
+ - A leading-zero token now reads as a number when it carries a sign, a decimal point, or an exponent (`+007` → `7`, `-000023.5` → `-23.5`, `00.0` → `0.0`, `007e2` → `700.0`) — previously these were kept as strings. A bare leading-zero integer (`000001`, `02`) still reads as a string, so IDs, zip codes, and account numbers keep their zeros.
21
+ - `Null` and `NULL` are now read as `nil` (joining `null` / `None` / `undefined`), for SQL / R / PHP / YAML / DB-derived input — in every position the existing spellings work. Quoted (`"NULL"`) or embedded (`NULL Island`) forms stay strings.
22
+ - String escapes now cover the full JSON5 / ECMAScript set: `\xHH` hex escapes (`"\x41"` → `"A"`), `\v` (vertical tab), `\0` (null), and an unrecognized escape now yields the character itself (`"\q"` → `"q"`) instead of raising. A malformed `\x` and an octal-style `\0` followed by a digit still raise.
23
+ - A `U+FEFF` (BOM / zero-width no-break space) is now skipped as whitespace anywhere between tokens — matching JSON5 / ECMAScript — not only as a leading byte-order mark, so a stray BOM mid-stream (e.g. from concatenated files) no longer corrupts the adjacent value into a string. Inside a quoted string it stays content.
24
+
25
+ ## 1.1.2 (2026-06-12)
26
+
27
+ RSpec tests: 1,097
28
+
29
+ ### Bug Fix
30
+
31
+ - The C extension now correctly supports Ruby's GC heap compaction (`GC.compact` / auto-compaction) — its cached exception/warning classes are declared to the GC. Thanks [Jean Boussier](https://github.com/byroot) for PR [#7](https://github.com/tilo/smarter_json/pull/7).
14
32
 
15
33
  ## 1.1.1 (2026-06-11)
16
34
 
data/CONTRIBUTORS.md ADDED
@@ -0,0 +1,6 @@
1
+ # A Big Thank You to all Contributors!!
2
+
3
+
4
+ A Big Thank you to everyone who filed issues, sent comments, and who contributed with pull requests:
5
+
6
+ * [Jean Boussier](https://github.com/byroot)
data/README.md CHANGED
@@ -8,7 +8,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,
8
8
 
9
9
  ## Features at a glance
10
10
 
11
- - **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
11
+ - **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python / JavaScript / SQL literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
12
12
  - **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
13
13
  - **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
14
14
  - **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
@@ -73,9 +73,11 @@ Three things set it apart:
73
73
  - `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
74
74
  - Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
75
75
  - Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
76
+ - Full JSON5 / ECMAScript string escapes — `\uXXXX` (with surrogate pairs), `\xHH` (`"\x41"` → `"A"`), `\v`, `\0`, line continuation; an unrecognized escape yields the character itself (`"\q"` → `"q"`)
76
77
  - Implicit root object — a config file that starts with `key: value`, no outer `{}`
77
78
  - `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
78
- - UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`
79
+ - Leading-zero numbers (which strict JSON rejects): a token with a sign, decimal point, or exponent reads as a number (`-007.5` `-7.5`, `007e2` `700.0`), but a bare leading-zero integer is kept as a string (`007`, `02`) so IDs, zip codes, and account numbers don't lose their zeros
80
+ - UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`, case-variant null (`Null` / `NULL`, as SQL / R / PHP / YAML emit it)
79
81
  - Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
80
82
  - Duplicate keys (last value wins by default; configurable)
81
83
 
@@ -89,11 +91,15 @@ The lenient grammar is a superset of these human-JSON specs — listed once, her
89
91
  * [HJSON](https://hjson.github.io/) <sup>†</sup>
90
92
  * [JWCC / HuJSON](https://github.com/tailscale/hujson)
91
93
  * [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
92
- * [JSONH](https://github.com/jsonh-org/Jsonh)
94
+ * [JSONH](https://github.com/jsonh-org/Jsonh) <sup>‡</sup>
93
95
  * [JSONC (VS Code)](https://jsonc.org/)
94
96
  * [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
95
97
 
96
- <sup>†</sup> A deliberate subset. SmarterJSON's quoteless (unquoted) string values are single-line — it does **not** parse HJSON's unquoted multi-line strings; use a quoted or triple-quoted (`'''…'''`) string for multiline. This is by design: SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others — and an unquoted string that may span newlines collides with newline-as-a-document-separator (NDJSON, implicit-root config), so it is left out.
98
+ HJSON and JSONH are deliberate subsets. SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others.
99
+
100
+ <sup>†</sup>From **HJSON** we leave out unquoted *multi-line* strings — its quoteless string values are single-line (use a quoted or triple-quoted `'''…'''` string for multiline), because a newline-spanning unquoted string collides with newline-as-a-document-separator (NDJSON, implicit-root config).
101
+
102
+ <sup>‡</sup>From **JSONH** we take the mainstream features (quoteless keys / values, optional commas between newline-separated members, comments, hex numbers) but **not** the idiosyncratic extensions: binary (`0b`) / octal (`0o`) number literals, verbatim strings (`@"…"`), nestable block comments (`/=* *=/`), or its `\e` / `\a` escapes — the last conflict with the JSON5 / ECMAScript rule that an unrecognized escape is the character itself (`"\e"` → `"e"`). Tip: you can use quoteless strings instead of verbatim strings. Want binary or octal literals? Open an issue.
97
103
 
98
104
  ## Installation
99
105
 
@@ -359,6 +365,8 @@ Both the C extension and the pure-Ruby engine are **iterative, not recursive**
359
365
  The trade-off: there is currently **no fixed nesting or input-size limit**, so extremely large or adversarially-nested untrusted input is bounded by memory (it can exhaust RAM), not by a crash. If you process untrusted input and want a hard cap, that's a planned opt-in guard — for now, size-limit upstream.
360
366
 
361
367
 
368
+ # [A Special Thanks to all Contributors!](CONTRIBUTORS.md) 🎉🎉🎉
369
+
362
370
  ## Development
363
371
 
364
372
  After checking out the repo, run `bin/setup` to install dependencies, then `rake compile` to build the C extension and `rake spec` to run the tests. The test suite runs every example against **both** the C and pure-Ruby paths, so the two stay behavior-identical.
@@ -29,7 +29,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
29
29
 
30
30
  ## What it accepts, beyond strict JSON
31
31
 
32
- Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, Python (`True` / `False` / `None`) and JavaScript (`undefined`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
32
+ Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, full JSON5 / ECMAScript string escapes (`\xHH`, `\v`, `\0`, line continuation, and an unknown escape yields the character itself), an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, leading-zero numbers (a signed / decimal / exponent token like `-007.5` is a number, a bare `007` is kept as a string so IDs keep their zeros), Python (`True` / `False` / `None`), JavaScript (`undefined`), and SQL / R / PHP / YAML (`Null` / `NULL`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
33
33
 
34
34
  It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.
35
35
 
data/docs/examples.md CHANGED
@@ -145,7 +145,23 @@ JSON
145
145
 
146
146
  A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
147
147
 
148
- ### Example 10: Wrapper Noise Around a Payload
148
+ ### Example 10: Leading-Zero IDs and SQL `NULL`
149
+
150
+ ```ruby
151
+ SmarterJSON.process_one(<<~JSON)
152
+ {
153
+ user_id: 007, # bare leading zero -> kept as a string
154
+ zip: 02139, # ditto: zip codes keep their leading zero
155
+ balance: -007.50, # a sign / decimal point / exponent makes it a number
156
+ deleted_at: NULL # SQL / R / YAML null spelling -> nil
157
+ }
158
+ JSON
159
+ # => {"user_id"=>"007", "zip"=>"02139", "balance"=>-7.5, "deleted_at"=>nil}
160
+ ```
161
+
162
+ A bare leading-zero integer is kept as a string so identifiers, zip codes, and account numbers don't lose their zeros; a sign, decimal point, or exponent marks numeric intent (`-007.50` → `-7.5`). `Null` and `NULL` join `null` / `None` / `undefined` as spellings of `nil`; a quoted `"NULL"` stays a string.
163
+
164
+ ### Example 11: Wrapper Noise Around a Payload
149
165
 
150
166
  #### Fenced payload
151
167
 
@@ -197,14 +213,14 @@ TEXT
197
213
  # => [{"a"=>1}, {"b"=>2}]
198
214
  ```
199
215
 
200
- ### Example 11: Write JSON
216
+ ### Example 12: Write JSON
201
217
 
202
218
  ```ruby
203
219
  SmarterJSON.generate({ "a" => 1, "b" => [2, 3] }) # => '{"a":1,"b":[2,3]}'
204
220
  SmarterJSON.generate([1, 2, 3]) # => '[1,2,3]'
205
221
  ```
206
222
 
207
- ### Example 12: Write NDJSON
223
+ ### Example 13: Write NDJSON
208
224
 
209
225
  An Array writes one element per line:
210
226
 
@@ -212,7 +228,7 @@ An Array writes one element per line:
212
228
  SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson) # => "{\"id\":1}\n{\"id\":2}\n"
213
229
  ```
214
230
 
215
- ### Example 13: Round-Trip Read and Write
231
+ ### Example 14: Round-Trip Read and Write
216
232
 
217
233
  ```ruby
218
234
  obj = { "a" => 1, "b" => [2, "three", nil, true] }
@@ -169,13 +169,14 @@ static void fj_advance(fj_state *st, long n) {
169
169
  static int fj_is_ws(int b) { return b == 0x20 || (b >= 0x09 && b <= 0x0D); }
170
170
 
171
171
  /* Length (1..3) of the Unicode whitespace char starting at p (n bytes
172
- * available), or 0. Matches Ruby's [[:space:]]; see smarter_json.md §4.7.
173
- * Reject-gate: only C2/E1/E2/E3 can begin a whitespace char. */
172
+ * available), or 0. Matches Ruby's [[:space:]], plus U+FEFF (BOM) — JSON5 / ES5 count
173
+ * the BOM as whitespace though Unicode White_Space does not; see smarter_json.md §4.7.
174
+ * Reject-gate: only C2/E1/E2/E3/EF can begin one of these chars. */
174
175
  static long fj_mbws(const char *p, long n) {
175
176
  int b0, b1, b2;
176
177
  if (n < 1) return 0;
177
178
  b0 = (unsigned char)p[0];
178
- if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3)) return 0;
179
+ if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF) return 0;
179
180
  if (n < 2) return 0;
180
181
  b1 = (unsigned char)p[1];
181
182
  if (b0 == 0xC2) return (b1 == 0xA0 || b1 == 0x85) ? 2 : 0;
@@ -188,6 +189,7 @@ static long fj_mbws(const char *p, long n) {
188
189
  return 0;
189
190
  }
190
191
  if (b0 == 0xE3) return (b1 == 0x80 && b2 == 0x80) ? 3 : 0;
192
+ if (b0 == 0xEF) return (b1 == 0xBB && b2 == 0xBF) ? 3 : 0; /* U+FEFF (JSON5 / ES5 BOM ws) */
191
193
  return 0;
192
194
  }
193
195
 
@@ -398,8 +400,24 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
398
400
  case 'n': rb_str_buf_cat(buf, "\n", 1); fj_advance(st, 1); break;
399
401
  case 'r': rb_str_buf_cat(buf, "\r", 1); fj_advance(st, 1); break;
400
402
  case 't': rb_str_buf_cat(buf, "\t", 1); fj_advance(st, 1); break;
403
+ case 'v': rb_str_buf_cat(buf, "\v", 1); fj_advance(st, 1); break; /* JSON5 / ES5 */
401
404
  case 0x0A: fj_advance(st, 1); break; /* \<LF>: line continuation */
402
405
  case 0x0D: fj_advance(st, 1); if (fj_byte(st) == 0x0A) fj_advance(st, 1); break;
406
+ case '0': /* JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden */
407
+ fj_advance(st, 1);
408
+ { int nx = fj_byte(st); if (nx >= '0' && nx <= '9') fj_error(st, "invalid \\0 escape (octal not allowed)"); }
409
+ rb_str_buf_cat(buf, "\0", 1);
410
+ break;
411
+ case 'x': { /* JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8) */
412
+ int h1, h2;
413
+ fj_advance(st, 1);
414
+ h1 = fj_hex_val(fj_byte(st));
415
+ h2 = fj_hex_val(fj_byte_at(st, 1));
416
+ if (h1 < 0 || h2 < 0) fj_error(st, "invalid \\x escape");
417
+ fj_advance(st, 2);
418
+ fj_append_utf8(buf, (unsigned long)((h1 << 4) | h2));
419
+ break;
420
+ }
403
421
  case 'u': {
404
422
  unsigned long cp;
405
423
  fj_advance(st, 1);
@@ -418,7 +436,12 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
418
436
  break;
419
437
  }
420
438
  default:
421
- fj_error(st, "invalid escape");
439
+ /* ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
440
+ * Emit the escaped byte; a multibyte UTF-8 char's continuation bytes follow as
441
+ * literal content (next loop iterations), reconstructing the whole character. */
442
+ rb_str_buf_cat(buf, st->buf + st->pos, 1);
443
+ fj_advance(st, 1);
444
+ break;
422
445
  }
423
446
  } else {
424
447
  /* Literal run between escapes: NEON-scan to the next quote/backslash and
@@ -641,16 +664,33 @@ static FJ_ALWAYS_INLINE VALUE fj_float_from_parts(fj_state *st, uint64_t m10, in
641
664
  * per-byte '_' test, dropping to a slow step only when an underscore appears. */
642
665
  static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
643
666
  long i = 0;
644
- int is_float = 0, neg = 0, has_digit = 0, overflow = 0;
667
+ int is_float = 0, neg = 0, has_digit = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
645
668
  uint64_t m10 = 0;
646
669
  int m10digits = 0, frac = 0;
647
670
  int64_t e10 = 0;
648
671
 
649
- if (i < n && (p[i] == '-' || p[i] == '+')) { neg = (p[i] == '-'); i++; }
672
+ if (i < n && (p[i] == '-' || p[i] == '+')) { has_sign = 1; neg = (p[i] == '-'); i++; }
650
673
 
651
- /* Integer part: a single '0', or [1-9] then digits/underscores. */
674
+ /* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
675
+ * by more digits (a leading-zero token) is consumed too but flagged: a BARE leading-zero
676
+ * integer (no sign / dot / exponent) is rejected below and kept as a string, so zip /
677
+ * account / check numbers preserve their zeros. */
652
678
  if (i < n && p[i] == '0') {
653
679
  has_digit = 1; m10digits = 1; i++;
680
+ /* Underscore-separated too (like the [1-9] branch below), so 0_5.0 / 0_0.5 behave
681
+ * exactly like 05.0 / 00.5 on both paths. */
682
+ if (i < n && ((p[i] >= '0' && p[i] <= '9') || p[i] == '_')) {
683
+ for (;;) {
684
+ while (i < n && p[i] >= '0' && p[i] <= '9') {
685
+ had_leading_zero = 1;
686
+ if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(p[i] - '0'); m10digits++; }
687
+ else overflow = 1;
688
+ i++;
689
+ }
690
+ if (i < n && p[i] == '_') { i++; continue; }
691
+ break;
692
+ }
693
+ }
654
694
  } else if (i < n && p[i] >= '1' && p[i] <= '9') {
655
695
  has_digit = 1;
656
696
  for (;;) {
@@ -699,6 +739,8 @@ static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
699
739
 
700
740
  if (i != n) return 0; /* token not fully consumed -> not a number (string) */
701
741
  if (!has_digit) return 0; /* e.g. "." or "+" -> not a number (string) */
742
+ /* A BARE leading-zero integer (no sign / dot / exponent) is an ID, not a number. */
743
+ if (had_leading_zero && !has_sign && !is_float) return 0;
702
744
 
703
745
  if (!is_float) {
704
746
  *out = fj_int_from_parts(m10, m10digits, neg, overflow, p, n);
@@ -730,13 +772,13 @@ static VALUE fj_parse_number(fj_state *st) {
730
772
  const char *p = buf + st->pos; /* buf[len] == '\0' (RSTRING_PTR) is the scan sentinel */
731
773
  const char *np = p; /* token start, includes a leading sign */
732
774
  long nlen;
733
- int is_float = 0, neg = 0, overflow = 0;
775
+ int is_float = 0, neg = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
734
776
  uint64_t m10 = 0; /* mantissa: integer + fraction digits */
735
777
  int m10digits = 0; /* mantissa digit chars (caps the Eisel-Lemire fast path at 18) */
736
778
  int frac = 0; /* fraction digit chars: e10 -= frac */
737
779
  int64_t e10 = 0;
738
780
 
739
- if (*p == '-' || *p == '+') { neg = (*p == '-'); p++; }
781
+ if (*p == '-' || *p == '+') { has_sign = 1; neg = (*p == '-'); p++; }
740
782
 
741
783
  /* Cold branches (rare, not perf-critical): sync the cursor, reuse scalar helpers. */
742
784
  if (*p == 'I') { st->pos = p - buf; fj_consume_keyword(st, "Infinity"); return rb_float_new(neg ? -INFINITY : INFINITY); }
@@ -755,10 +797,27 @@ static VALUE fj_parse_number(fj_state *st) {
755
797
  return rb_str_to_inum(hx, 16, 0);
756
798
  }
757
799
 
758
- /* Integer part: a single '0', or [1-9] then digits/underscores. */
800
+ /* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
801
+ * by more digits is consumed but flagged; a BARE leading-zero integer (no sign / dot /
802
+ * exponent) is rejected after the scan — it is an ID, not a number, and has no bare
803
+ * top-level quoteless-string form, so it raises (matching `000001`). */
759
804
  if (*p == '0') {
760
805
  m10digits = 1; /* one leading zero, counted as a single mantissa digit */
761
806
  p++;
807
+ /* Underscore-separated too (like the [1-9] branch below), so the underscore is just a
808
+ * separator (0_0.5 behaves like 00.5). */
809
+ if ((*p >= '0' && *p <= '9') || *p == '_') {
810
+ for (;;) {
811
+ while (*p >= '0' && *p <= '9') {
812
+ had_leading_zero = 1;
813
+ if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(*p - '0'); m10digits++; }
814
+ else overflow = 1;
815
+ p++;
816
+ }
817
+ if (*p == '_') { p++; continue; }
818
+ break;
819
+ }
820
+ }
762
821
  } else if (*p >= '1' && *p <= '9') {
763
822
  for (;;) {
764
823
  while (*p >= '0' && *p <= '9') {
@@ -811,6 +870,12 @@ static VALUE fj_parse_number(fj_state *st) {
811
870
  st->pos = p - buf;
812
871
  nlen = p - np;
813
872
 
873
+ /* A BARE leading-zero integer is an ID, not a number; at this top-level / strict
874
+ * position there is no quoteless-string form, so it raises. */
875
+ if (had_leading_zero && !has_sign && !is_float) {
876
+ fj_error(st, "invalid number with a leading zero");
877
+ }
878
+
814
879
  if (!is_float) {
815
880
  return fj_int_from_parts(m10, m10digits, neg, overflow, np, nlen);
816
881
  }
@@ -979,7 +1044,8 @@ static VALUE fj_classify_quoteless(fj_state *st, const char *p0, long n0) {
979
1044
 
980
1045
  if (fj_tok_eq(p, n, "true") || fj_tok_eq(p, n, "True")) return Qtrue;
981
1046
  if (fj_tok_eq(p, n, "false") || fj_tok_eq(p, n, "False")) return Qfalse;
982
- if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
1047
+ if (fj_tok_eq(p, n, "null") || fj_tok_eq(p, n, "Null") || fj_tok_eq(p, n, "NULL") ||
1048
+ fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
983
1049
  if (fj_tok_eq(p, n, "NaN")) return rb_float_new(NAN);
984
1050
  if (fj_tok_eq(p, n, "Infinity")) return rb_float_new(INFINITY);
985
1051
 
@@ -1273,8 +1339,10 @@ static VALUE fj_parse_value(fj_state *st) {
1273
1339
  case 'T': return fj_parse_literal(st, "True", Qtrue);
1274
1340
  case 'F': return fj_parse_literal(st, "False", Qfalse);
1275
1341
  case 'u': return fj_parse_literal(st, "undefined", Qnil);
1276
- case 'N': /* NaN (number) vs None (Python null) */
1342
+ case 'N': /* NaN (number); None / Null / NULL (null) */
1277
1343
  if (fj_byte_at(st, 1) == 'a') return fj_parse_number(st);
1344
+ if (fj_byte_at(st, 1) == 'u') return fj_parse_literal(st, "Null", Qnil);
1345
+ if (fj_byte_at(st, 1) == 'U') return fj_parse_literal(st, "NULL", Qnil);
1278
1346
  return fj_parse_literal(st, "None", Qnil);
1279
1347
  default:
1280
1348
  if (b == '-' || b == '+' || b == '.' || b == 'I' || (b >= '0' && b <= '9')) {
@@ -1676,9 +1744,16 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
1676
1744
 
1677
1745
  void Init_smarter_json(void) {
1678
1746
  mSmarterJSON = rb_define_module("SmarterJSON");
1747
+
1748
+ rb_global_variable(&cParseError);
1679
1749
  cParseError = rb_const_get(mSmarterJSON, rb_intern("ParseError"));
1750
+
1751
+ rb_global_variable(&cEncodingError);
1680
1752
  cEncodingError = rb_const_get(mSmarterJSON, rb_intern("EncodingError"));
1753
+
1754
+ rb_global_variable(&cWarning);
1681
1755
  cWarning = rb_const_get(mSmarterJSON, rb_intern("Warning"));
1756
+
1682
1757
  fj_bigdecimal_id = rb_intern("BigDecimal");
1683
1758
  fj_to_sym_id = rb_intern("to_sym");
1684
1759
  fj_key_p_id = rb_intern("key?");
@@ -739,7 +739,7 @@ module SmarterJSON
739
739
  # Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a
740
740
  # bare exponent like "-e695881" is NOT a number — it falls through to a quoteless
741
741
  # string, matching the C path. Trailing exponent stays optional.
742
- DEC_RE = /\A[-+]?(?:(?:0|[1-9][0-9_]*)(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
742
+ DEC_RE = /\A[-+]?(?:[0-9][0-9_]*(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
743
743
  # A decimal BigDecimal() would reject as-is: a leading dot (".5") or a dot not
744
744
  # followed by a digit ("5.", "5.e3"). Matches iff normalize_for_bigdecimal
745
745
  # would change the string — so when it doesn't match, we skip normalization.
@@ -756,7 +756,9 @@ module SmarterJSON
756
756
  # (',' '}' ']' '{' '[') OR any whitespace ([[:space:]] covers ASCII + Unicode space,
757
757
  # incl. LF/CR which also terminate). Stopping at a terminator/EOF means the run had no
758
758
  # interior whitespace, so there's nothing to trim and no comment marker can apply.
759
- QL_BREAK = /[,{}\[\]]|[[:space:]]/.freeze
759
+ #
760
+ # U+FEFF is JSON5/ES5 whitespace but not in [[:space:]], so we need to add it:
761
+ QL_BREAK = /[,{}\[\]]|[[:space:]]|#{[0xFEFF].pack("U")}/.freeze
760
762
 
761
763
  # The defaults live centrally in SmarterJSON::Options (lib/smarter_json/options.rb).
762
764
  DEFAULT_OPTIONS = Options::DEFAULT_OPTIONS
@@ -1103,7 +1105,7 @@ module SmarterJSON
1103
1105
  # Only meaningful for bytes >= 0x80.
1104
1106
  def multibyte_ws_len(pos)
1105
1107
  b0 = @input.getbyte(pos)
1106
- return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) # reject-gate
1108
+ return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF # reject-gate (EF -> U+FEFF)
1107
1109
 
1108
1110
  b1 = @input.getbyte(pos + 1)
1109
1111
  return 0 if b1.nil?
@@ -1123,6 +1125,8 @@ module SmarterJSON
1123
1125
  end
1124
1126
  when 0xE3
1125
1127
  return 3 if b1 == 0x80 && b2 == 0x80 # U+3000
1128
+ when 0xEF
1129
+ return 3 if b1 == 0xBB && b2 == 0xBF # U+FEFF (JSON5 / ES5 BOM ws)
1126
1130
  end
1127
1131
  0
1128
1132
  end
@@ -1210,10 +1214,11 @@ module SmarterJSON
1210
1214
 
1211
1215
  # Disambiguate NaN (number) from None (Python null) at a strict position.
1212
1216
  def parse_upper_n
1213
- if byte_at(1) == 0x61 # 'a' → NaN
1214
- parse_number
1215
- else
1216
- parse_literal_keyword("None", nil)
1217
+ case byte_at(1)
1218
+ when 0x61 then parse_number # 'a' -> NaN
1219
+ when 0x75 then parse_literal_keyword("Null", nil) # 'u' -> Null
1220
+ when 0x55 then parse_literal_keyword("NULL", nil) # 'U' -> NULL
1221
+ else parse_literal_keyword("None", nil)
1217
1222
  end
1218
1223
  end
1219
1224
 
@@ -1378,7 +1383,7 @@ module SmarterJSON
1378
1383
  case str
1379
1384
  when "true", "True" then return true
1380
1385
  when "false", "False" then return false
1381
- when "null", "None" then return nil
1386
+ when "null", "Null", "NULL", "None" then return nil
1382
1387
  when "undefined" then return nil
1383
1388
  when "NaN" then return Float::NAN
1384
1389
  when "Infinity", "+Infinity" then return Float::INFINITY
@@ -1405,7 +1410,15 @@ module SmarterJSON
1405
1410
  # number tokens that is a real per-value allocation. Underscores are rare, so only
1406
1411
  # pay it when the token actually contains one (measured +27% on long-token decimals).
1407
1412
  body = str.include?("_") ? str.delete("_") : str
1408
- body.match?(/[.eE]/) ? decimal_value(body) : body.to_i
1413
+ return decimal_value(body) if body.match?(/[.eE]/)
1414
+
1415
+ # A BARE leading-zero integer (no sign / dot / exponent) is an ID — a zip code,
1416
+ # account number, phone number — not a number; keep it a string so the zeros survive.
1417
+ # A sign (+007 / -007) signals numeric intent (IDs never carry a sign), so those parse.
1418
+ c0 = body.getbyte(0)
1419
+ return NOT_NUMERIC if c0 == ZERO && body.bytesize > 1
1420
+
1421
+ body.to_i
1409
1422
  end
1410
1423
 
1411
1424
  # True when the token starts with [+-]?0[xX] — the only shape HEX_RE can match.
@@ -1614,6 +1627,12 @@ module SmarterJSON
1614
1627
  when 0x6E then buf << "\n".b
1615
1628
  when 0x72 then buf << "\r".b
1616
1629
  when 0x74 then buf << "\t".b
1630
+ when 0x76 then buf << "\v".b # JSON5 / ES5 vertical tab
1631
+ when ZERO # JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden
1632
+ nxt = @input.getbyte(i + 1)
1633
+ raise error("invalid \\0 escape (octal not allowed)") if nxt && nxt >= ZERO && nxt <= NINE
1634
+
1635
+ buf << "\x00".b
1617
1636
  when LF
1618
1637
  # JSON5 line continuation: \<LF> emits nothing
1619
1638
  when CR
@@ -1623,8 +1642,19 @@ module SmarterJSON
1623
1642
  buf << [cp].pack("U").b
1624
1643
  i += consumed
1625
1644
  next
1645
+ when LOWER_X # JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8)
1646
+ hex = @input.byteslice(i + 1, 2)
1647
+ raise error("invalid \\x escape") unless hex && hex.bytesize == 2 && hex.b.match?(/\A\h{2}\z/)
1648
+
1649
+ buf << [hex.to_i(16)].pack("U").b
1650
+ i += 3
1651
+ next
1626
1652
  else
1627
- raise error("invalid escape \\#{esc&.chr || "?"}")
1653
+ # ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
1654
+ # Emit the escaped byte; a multibyte char's continuation bytes follow as literals.
1655
+ raise error("unterminated string escape") if esc.nil?
1656
+
1657
+ buf << esc
1628
1658
  end
1629
1659
  i += 1
1630
1660
  end
@@ -1663,10 +1693,13 @@ module SmarterJSON
1663
1693
 
1664
1694
  def parse_number
1665
1695
  negative = false
1696
+ signed = false
1666
1697
  if byte == MINUS
1667
1698
  negative = true
1699
+ signed = true
1668
1700
  advance(1)
1669
1701
  elsif byte == PLUS
1702
+ signed = true
1670
1703
  advance(1)
1671
1704
  end
1672
1705
 
@@ -1680,6 +1713,7 @@ module SmarterJSON
1680
1713
  end
1681
1714
 
1682
1715
  int_start = @pos
1716
+ had_leading_zero = false
1683
1717
 
1684
1718
  if byte == ZERO
1685
1719
  advance(1)
@@ -1692,6 +1726,16 @@ module SmarterJSON
1692
1726
  value = @input.byteslice(hex_start, @pos - hex_start).delete("_").to_i(16)
1693
1727
  return negative ? -value : value
1694
1728
  end
1729
+ # A run of further digits after the single leading '0' (007, 00023, or the
1730
+ # underscore-separated 0_0) — consume it and flag the leading zero; the reject check
1731
+ # below turns a bare leading-zero integer into an error. The underscore is only a
1732
+ # separator, so 0_0.5 behaves like 00.5.
1733
+ if (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
1734
+ while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
1735
+ had_leading_zero = true if b >= ZERO && b <= NINE
1736
+ advance(1)
1737
+ end
1738
+ end
1695
1739
  elsif byte && byte >= 0x31 && byte <= NINE
1696
1740
  advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
1697
1741
  elsif byte == DOT
@@ -1717,6 +1761,13 @@ module SmarterJSON
1717
1761
  advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
1718
1762
  end
1719
1763
 
1764
+ # A BARE leading-zero integer is an ID, not a number; at this top-level / strict
1765
+ # position there is no quoteless-string form, so it raises (a sign or a dot/exponent
1766
+ # signals numeric intent and is allowed: +007 -> 7, -000023.5 -> -23.5, 007e2 -> 700.0).
1767
+ if had_leading_zero && !signed && !is_float
1768
+ raise error("invalid number with a leading zero")
1769
+ end
1770
+
1720
1771
  slice = @input.byteslice(int_start, @pos - int_start).delete("_")
1721
1772
  value = is_float ? decimal_value(slice) : slice.to_i
1722
1773
  negative ? -value : value
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterJSON
4
- VERSION = "1.1.1"
4
+ VERSION = "1.2.0"
5
5
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_json
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.1
4
+ version: 1.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 2026-06-11 00:00:00.000000000 Z
10
+ date: 2026-06-16 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: bigdecimal
@@ -44,6 +44,7 @@ extra_rdoc_files: []
44
44
  files:
45
45
  - ".gitignore"
46
46
  - CHANGELOG.md
47
+ - CONTRIBUTORS.md
47
48
  - LICENSE.txt
48
49
  - README.md
49
50
  - Rakefile