RubyGems - smarter_json - Versions diffs - 1.1.2 → 1.2.1 - Mend

smarter_json 1.1.2 → 1.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +22 -4
data/README.md +10 -4
data/docs/_introduction.md +1 -1
data/docs/examples.md +20 -4
data/ext/smarter_json/smarter_json.c +80 -12
data/lib/smarter_json/parser.rb +70 -10
data/lib/smarter_json/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 616af8a85697d036162f613a7e0b370103cde9e2d8cd46496b417288785839d7
-  data.tar.gz: 6449bcecb8c631f357e3885f931407c8d52bcdc35acb4dcb2776d3047951986e
+  metadata.gz: f66007af9616269be4ccc22a5e73d784bc9991dbb5c4245e5cef15293cbc7ab8
+  data.tar.gz: d95cecb5d4258d44ed104c8386281e60855e196e09ed5c05e47e5ebe70b4e100
 SHA512:
-  metadata.gz: 6866b5eeeb2932d6c083da8c78daf1715eda5cc8c633f961947229698ae2bd79b25f14dd39d43f47dc3056984755e75892488c6c51cd5c1dea4efff90e114255
-  data.tar.gz: 2d911cad4653363fa8de29e84cefc306604190df6e11e1ef2b7807ede5e52006592d6cd5f4f17e58b9d2b36be1ed6b6145c5deda541f8bc56c59f9c5e64afdf8
+  metadata.gz: 4924b4a88250124b30a4d97b99d9c84ea644b922c4a0dea4fb4dfb185638b7fb43ccce2b638026d1b52d085fe7e5fceb94168edb190133684de0d94ab90f0399
+  data.tar.gz: 6c8493f8b1808deab9a7f039c9394c12d51076c4b670ad04df4c0b8b32d1deadfa1549d79cb2eb3200a3c2291cd652b255c061e4b210acc2426a3673a0cc33c3

data/CHANGELOG.md CHANGED Viewed

@@ -5,17 +5,35 @@
 >
 > `SmarterJSON.process` / `SmarterJSON.process_file`
 > both return:
->   — `[]` for no doc
->   - `[doc]` for one doc
->   - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
+>
+> — `[]` for no doc
+> - `[doc]` for one doc
+> - `[d1, d2, …]` for several docs (NDJSON / JSONL / concatenated docs)
 > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
->    Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads.
+>    Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
+## 1.2.1 (2026-06-17)
+RSpec tests: 1,165
+- Performance improvements
+## 1.2.0 (2026-06-16)
+RSpec tests: 1,097 → 1,165
+- A leading-zero token now reads as a number when it carries a sign, a decimal point, or an exponent (`+007` → `7`, `-000023.5` → `-23.5`, `00.0` → `0.0`, `007e2` → `700.0`) — previously these were kept as strings. A bare leading-zero integer (`000001`, `02`) still reads as a string, so IDs, zip codes, and account numbers keep their zeros.
+- `Null` and `NULL` are now read as `nil` (joining `null` / `None` / `undefined`), for SQL / R / PHP / YAML / DB-derived input — in every position the existing spellings work. Quoted (`"NULL"`) or embedded (`NULL Island`) forms stay strings.
+- String escapes now cover the full JSON5 / ECMAScript set: `\xHH` hex escapes (`"\x41"` → `"A"`), `\v` (vertical tab), `\0` (null), and an unrecognized escape now yields the character itself (`"\q"` → `"q"`) instead of raising. A malformed `\x` and an octal-style `\0` followed by a digit still raise.
+- A `U+FEFF` (BOM / zero-width no-break space) is now skipped as whitespace anywhere between tokens — matching JSON5 / ECMAScript — not only as a leading byte-order mark, so a stray BOM mid-stream (e.g. from concatenated files) no longer corrupts the adjacent value into a string. Inside a quoted string it stays content.
 ## 1.1.2 (2026-06-12)
 RSpec tests: 1,097
+### Bug Fix
 - The C extension now correctly supports Ruby's GC heap compaction (`GC.compact` / auto-compaction) — its cached exception/warning classes are declared to the GC. Thanks [Jean Boussier](https://github.com/byroot) for PR [#7](https://github.com/tilo/smarter_json/pull/7).
 ## 1.1.1 (2026-06-11)

data/README.md CHANGED Viewed

@@ -8,7 +8,7 @@ A lenient, fast JSON processor for Ruby. It extracts strict JSON, NDJSON, JSONL,
 ## Features at a glance
-- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python & JavaScript literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
+- **Reads the whole human-JSON superset, no modes or flags** — strict JSON, NDJSON, JSONL, JSON5, HJSON, JSONC, plus comments, trailing commas, unquoted / single / triple / smart quotes, an implicit root object, `NaN` / `Infinity` / hex / underscores, Python / JavaScript / SQL literals, a UTF-8 BOM, mixed line endings, and any Ruby encoding (see [What it accepts](#what-it-accepts-beyond-strict-json) for the full list).
 - **Every document from multi-document input, in one call** — `process` returns an `Array` of all of them; `process_one` returns the single value and warns if there was more than one (never raises; routed to `on_warning`, else `Rails.logger`, else `Kernel.warn`).
 - **Streaming in bounded memory** — pass a block, or use `foreach(path_or_io)` for a composable `Enumerator` you can `.select` / `.map` / `.lazy` over.
 - **Recovers JSON from LLM / markdown noise** — strips markdown code fences, surrounding prose, and `<json>` tags, and pulls every payload out of one messy blob.
@@ -73,9 +73,11 @@ Three things set it apart:
 - `//`, `/* … */`, and `#` comments (a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` is read as a string, not a truncated value)
 - Markdown-wrapped / chatty blobs around the payload: strips ```` ```json ```` / ```` ``` ```` fences, ignores obvious prose before/after the payload, unwraps `<json>...</json>` and `BEGIN_JSON ... END_JSON`, and preserves multiple recovered payloads as an Array
 - Trailing commas; unquoted keys (`{host: localhost}`); single-quoted, triple-quoted (`'''…'''`), and quoteless string values
+- Full JSON5 / ECMAScript string escapes — `\uXXXX` (with surrogate pairs), `\xHH` (`"\x41"` → `"A"`), `\v`, `\0`, line continuation; an unrecognized escape yields the character itself (`"\q"` → `"q"`)
 - Implicit root object — a config file that starts with `key: value`, no outer `{}`
 - `NaN`, `Infinity`, hex (`0xFF`), leading `+` / `.`, underscores in numbers (`1_000_000`)
-- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`
+- Leading-zero numbers (which strict JSON rejects): a token with a sign, decimal point, or exponent reads as a number (`-007.5` → `-7.5`, `007e2` → `700.0`), but a bare leading-zero integer is kept as a string (`007`, `02`) so IDs, zip codes, and account numbers don't lose their zeros
+- UTF-8 BOM, smart/curly quotes (in keys and values), Python literals (`True` / `False` / `None`), JavaScript `undefined`, case-variant null (`Null` / `NULL`, as SQL / R / PHP / YAML emit it)
 - Mixed CR / LF / CRLF line endings, and any Ruby-supported input encoding (via `encoding:`)
 - Duplicate keys (last value wins by default; configurable)
@@ -89,11 +91,15 @@ The lenient grammar is a superset of these human-JSON specs — listed once, her
 * [HJSON](https://hjson.github.io/) <sup>†</sup>
 * [JWCC / HuJSON](https://github.com/tailscale/hujson)
 * [Nigel Tao](https://nigeltao.github.io/blog/2021/json-with-commas-comments.html)
-* [JSONH](https://github.com/jsonh-org/Jsonh)
+* [JSONH](https://github.com/jsonh-org/Jsonh) <sup>‡</sup>
 * [JSONC (VS Code)](https://jsonc.org/)
 * [NDJSON / JSON Text Sequences (RFC 7464)](https://datatracker.ietf.org/doc/html/rfc7464).
-<sup>†</sup> A deliberate subset. SmarterJSON's quoteless (unquoted) string values are single-line — it does **not** parse HJSON's unquoted multi-line strings; use a quoted or triple-quoted (`'''…'''`) string for multiline. This is by design: SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others — and an unquoted string that may span newlines collides with newline-as-a-document-separator (NDJSON, implicit-root config), so it is left out.
+HJSON and JSONH are deliberate subsets. SmarterJSON is one deterministic, no-modes superset of the JSON-family dialects (JSON5 / HJSON / JSONC / …), so it adopts a feature only where it does not conflict with the others.
+<sup>†</sup>From **HJSON** we leave out unquoted *multi-line* strings — its quoteless string values are single-line (use a quoted or triple-quoted `'''…'''` string for multiline), because a newline-spanning unquoted string collides with newline-as-a-document-separator (NDJSON, implicit-root config).
+<sup>‡</sup>From **JSONH** we take the mainstream features (quoteless keys / values, optional commas between newline-separated members, comments, hex numbers) but **not** the idiosyncratic extensions: binary (`0b`) / octal (`0o`) number literals, verbatim strings (`@"…"`), nestable block comments (`/=* *=/`), or its `\e` / `\a` escapes — the last conflict with the JSON5 / ECMAScript rule that an unrecognized escape is the character itself (`"\e"` → `"e"`). Tip: you can use quoteless strings instead of verbatim strings. Want binary or octal literals? Open an issue.
 ## Installation

data/docs/_introduction.md CHANGED Viewed

@@ -29,7 +29,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
 ## What it accepts, beyond strict JSON
-Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, Python (`True` / `False` / `None`) and JavaScript (`undefined`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
+Comments (`//`, `/* … */`, `#` — a `#`/`//` only starts a comment when preceded by whitespace, so `url: http://x.com` reads as a string, not a truncated value), markdown-wrapped / chatty blobs around the payload, trailing commas, unquoted / single- / triple-quoted / quoteless strings, full JSON5 / ECMAScript string escapes (`\xHH`, `\v`, `\0`, line continuation, and an unknown escape yields the character itself), an implicit root object (`key: value`, no braces), `NaN` / `Infinity` / hex / underscored numbers, leading-zero numbers (a signed / decimal / exponent token like `-007.5` is a number, a bare `007` is kept as a string so IDs keep their zeros), Python (`True` / `False` / `None`), JavaScript (`undefined`), and SQL / R / PHP / YAML (`Null` / `NULL`) literals, smart quotes, a UTF-8 BOM, mixed CR / LF / CRLF line endings, any Ruby-supported input encoding (via `encoding:`), and duplicate keys. The full list — with the human-JSON spec references it's drawn from — is kept in one place: [**What it accepts, beyond strict JSON**](../README.md#what-it-accepts-beyond-strict-json) in the README.
 It raises only on genuinely unreadable input (unterminated string, mismatched bracket), with line and column in the message — never on valid-but-lenient input.

data/docs/examples.md CHANGED Viewed

@@ -145,7 +145,23 @@ JSON
 A `#`/`//` only starts a comment when preceded by whitespace, so `http://example.com` stays a string rather than being truncated.
-### Example 10: Wrapper Noise Around a Payload
+### Example 10: Leading-Zero IDs and SQL `NULL`
+```ruby
+SmarterJSON.process_one(<<~JSON)
+  {
+    user_id:    007,      # bare leading zero -> kept as a string
+    zip:        02139,    # ditto: zip codes keep their leading zero
+    balance:    -007.50,  # a sign / decimal point / exponent makes it a number
+    deleted_at: NULL      # SQL / R / YAML null spelling -> nil
+  }
+JSON
+# => {"user_id"=>"007", "zip"=>"02139", "balance"=>-7.5, "deleted_at"=>nil}
+```
+A bare leading-zero integer is kept as a string so identifiers, zip codes, and account numbers don't lose their zeros; a sign, decimal point, or exponent marks numeric intent (`-007.50` → `-7.5`). `Null` and `NULL` join `null` / `None` / `undefined` as spellings of `nil`; a quoted `"NULL"` stays a string.
+### Example 11: Wrapper Noise Around a Payload
 #### Fenced payload
@@ -197,14 +213,14 @@ TEXT
 # => [{"a"=>1}, {"b"=>2}]
 ```
-### Example 11: Write JSON
+### Example 12: Write JSON
 ```ruby
 SmarterJSON.generate({ "a" => 1, "b" => [2, 3] })   # => '{"a":1,"b":[2,3]}'
 SmarterJSON.generate([1, 2, 3])                       # => '[1,2,3]'
 ```
-### Example 12: Write NDJSON
+### Example 13: Write NDJSON
 An Array writes one element per line:
@@ -212,7 +228,7 @@ An Array writes one element per line:
 SmarterJSON.generate([{ "id" => 1 }, { "id" => 2 }], format: :ndjson)   # => "{\"id\":1}\n{\"id\":2}\n"
 ```
-### Example 13: Round-Trip Read and Write
+### Example 14: Round-Trip Read and Write
 ```ruby
 obj = { "a" => 1, "b" => [2, "three", nil, true] }

data/ext/smarter_json/smarter_json.c CHANGED Viewed

@@ -169,13 +169,14 @@ static void fj_advance(fj_state *st, long n) {
 static int fj_is_ws(int b) { return b == 0x20 || (b >= 0x09 && b <= 0x0D); }
 /* Length (1..3) of the Unicode whitespace char starting at p (n bytes
- * available), or 0. Matches Ruby's [[:space:]]; see smarter_json.md §4.7.
- * Reject-gate: only C2/E1/E2/E3 can begin a whitespace char. */
+ * available), or 0. Matches Ruby's [[:space:]], plus U+FEFF (BOM) — JSON5 / ES5 count
+ * the BOM as whitespace though Unicode White_Space does not; see smarter_json.md §4.7.
+ * Reject-gate: only C2/E1/E2/E3/EF can begin one of these chars. */
 static long fj_mbws(const char *p, long n) {
   int b0, b1, b2;
   if (n < 1) return 0;
   b0 = (unsigned char)p[0];
-  if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3)) return 0;
+  if (b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF) return 0;
   if (n < 2) return 0;
   b1 = (unsigned char)p[1];
   if (b0 == 0xC2) return (b1 == 0xA0 || b1 == 0x85) ? 2 : 0;
@@ -188,6 +189,7 @@ static long fj_mbws(const char *p, long n) {
     return 0;
   }
   if (b0 == 0xE3) return (b1 == 0x80 && b2 == 0x80) ? 3 : 0;
+  if (b0 == 0xEF) return (b1 == 0xBB && b2 == 0xBF) ? 3 : 0; /* U+FEFF (JSON5 / ES5 BOM ws) */
   return 0;
 }
@@ -398,8 +400,24 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
         case 'n':  rb_str_buf_cat(buf, "\n", 1); fj_advance(st, 1); break;
         case 'r':  rb_str_buf_cat(buf, "\r", 1); fj_advance(st, 1); break;
         case 't':  rb_str_buf_cat(buf, "\t", 1); fj_advance(st, 1); break;
+        case 'v':  rb_str_buf_cat(buf, "\v", 1); fj_advance(st, 1); break; /* JSON5 / ES5 */
         case 0x0A: fj_advance(st, 1); break; /* \<LF>: line continuation */
         case 0x0D: fj_advance(st, 1); if (fj_byte(st) == 0x0A) fj_advance(st, 1); break;
+        case '0': /* JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden */
+          fj_advance(st, 1);
+          { int nx = fj_byte(st); if (nx >= '0' && nx <= '9') fj_error(st, "invalid \\0 escape (octal not allowed)"); }
+          rb_str_buf_cat(buf, "\0", 1);
+          break;
+        case 'x': { /* JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8) */
+          int h1, h2;
+          fj_advance(st, 1);
+          h1 = fj_hex_val(fj_byte(st));
+          h2 = fj_hex_val(fj_byte_at(st, 1));
+          if (h1 < 0 || h2 < 0) fj_error(st, "invalid \\x escape");
+          fj_advance(st, 2);
+          fj_append_utf8(buf, (unsigned long)((h1 << 4) | h2));
+          break;
+        }
         case 'u': {
           unsigned long cp;
           fj_advance(st, 1);
@@ -418,7 +436,12 @@ static VALUE fj_parse_string(fj_state *st, int quote) {
           break;
         }
         default:
-          fj_error(st, "invalid escape");
+          /* ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
+           * Emit the escaped byte; a multibyte UTF-8 char's continuation bytes follow as
+           * literal content (next loop iterations), reconstructing the whole character. */
+          rb_str_buf_cat(buf, st->buf + st->pos, 1);
+          fj_advance(st, 1);
+          break;
       }
     } else {
       /* Literal run between escapes: NEON-scan to the next quote/backslash and
@@ -641,16 +664,33 @@ static FJ_ALWAYS_INLINE VALUE fj_float_from_parts(fj_state *st, uint64_t m10, in
  * per-byte '_' test, dropping to a slow step only when an underscore appears. */
 static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
   long i = 0;
-  int  is_float = 0, neg = 0, has_digit = 0, overflow = 0;
+  int  is_float = 0, neg = 0, has_digit = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
   uint64_t m10 = 0;
   int  m10digits = 0, frac = 0;
   int64_t e10 = 0;
-  if (i < n && (p[i] == '-' || p[i] == '+')) { neg = (p[i] == '-'); i++; }
+  if (i < n && (p[i] == '-' || p[i] == '+')) { has_sign = 1; neg = (p[i] == '-'); i++; }
-  /* Integer part: a single '0', or [1-9] then digits/underscores. */
+  /* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
+   * by more digits (a leading-zero token) is consumed too but flagged: a BARE leading-zero
+   * integer (no sign / dot / exponent) is rejected below and kept as a string, so zip /
+   * account / check numbers preserve their zeros. */
   if (i < n && p[i] == '0') {
     has_digit = 1; m10digits = 1; i++;
+    /* Underscore-separated too (like the [1-9] branch below), so 0_5.0 / 0_0.5 behave
+     * exactly like 05.0 / 00.5 on both paths. */
+    if (i < n && ((p[i] >= '0' && p[i] <= '9') || p[i] == '_')) {
+      for (;;) {
+        while (i < n && p[i] >= '0' && p[i] <= '9') {
+          had_leading_zero = 1;
+          if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(p[i] - '0'); m10digits++; }
+          else overflow = 1;
+          i++;
+        }
+        if (i < n && p[i] == '_') { i++; continue; }
+        break;
+      }
+    }
   } else if (i < n && p[i] >= '1' && p[i] <= '9') {
     has_digit = 1;
     for (;;) {
@@ -699,6 +739,8 @@ static int fj_try_decimal(fj_state *st, const char *p, long n, VALUE *out) {
   if (i != n)     return 0;  /* token not fully consumed -> not a number (string) */
   if (!has_digit) return 0;  /* e.g. "." or "+" -> not a number (string) */
+  /* A BARE leading-zero integer (no sign / dot / exponent) is an ID, not a number. */
+  if (had_leading_zero && !has_sign && !is_float) return 0;
   if (!is_float) {
     *out = fj_int_from_parts(m10, m10digits, neg, overflow, p, n);
@@ -730,13 +772,13 @@ static VALUE fj_parse_number(fj_state *st) {
   const char *p   = buf + st->pos;  /* buf[len] == '\0' (RSTRING_PTR) is the scan sentinel */
   const char *np  = p;              /* token start, includes a leading sign */
   long   nlen;
-  int    is_float = 0, neg = 0, overflow = 0;
+  int    is_float = 0, neg = 0, overflow = 0, has_sign = 0, had_leading_zero = 0;
   uint64_t m10 = 0;                 /* mantissa: integer + fraction digits */
   int    m10digits = 0;             /* mantissa digit chars (caps the Eisel-Lemire fast path at 18) */
   int    frac = 0;                  /* fraction digit chars: e10 -= frac */
   int64_t e10 = 0;
-  if (*p == '-' || *p == '+') { neg = (*p == '-'); p++; }
+  if (*p == '-' || *p == '+') { has_sign = 1; neg = (*p == '-'); p++; }
   /* Cold branches (rare, not perf-critical): sync the cursor, reuse scalar helpers. */
   if (*p == 'I') { st->pos = p - buf; fj_consume_keyword(st, "Infinity"); return rb_float_new(neg ? -INFINITY : INFINITY); }
@@ -755,10 +797,27 @@ static VALUE fj_parse_number(fj_state *st) {
     return rb_str_to_inum(hx, 16, 0);
   }
-  /* Integer part: a single '0', or [1-9] then digits/underscores. */
+  /* Integer part: a single '0', or [1-9] then digits/underscores. A leading '0' followed
+   * by more digits is consumed but flagged; a BARE leading-zero integer (no sign / dot /
+   * exponent) is rejected after the scan — it is an ID, not a number, and has no bare
+   * top-level quoteless-string form, so it raises (matching `000001`). */
   if (*p == '0') {
     m10digits = 1;  /* one leading zero, counted as a single mantissa digit */
     p++;
+    /* Underscore-separated too (like the [1-9] branch below), so the underscore is just a
+     * separator (0_0.5 behaves like 00.5). */
+    if ((*p >= '0' && *p <= '9') || *p == '_') {
+      for (;;) {
+        while (*p >= '0' && *p <= '9') {
+          had_leading_zero = 1;
+          if (m10digits < 18) { m10 = m10 * 10 + (uint64_t)(*p - '0'); m10digits++; }
+          else overflow = 1;
+          p++;
+        }
+        if (*p == '_') { p++; continue; }
+        break;
+      }
+    }
   } else if (*p >= '1' && *p <= '9') {
     for (;;) {
       while (*p >= '0' && *p <= '9') {
@@ -811,6 +870,12 @@ static VALUE fj_parse_number(fj_state *st) {
   st->pos = p - buf;
   nlen = p - np;
+  /* A BARE leading-zero integer is an ID, not a number; at this top-level / strict
+   * position there is no quoteless-string form, so it raises. */
+  if (had_leading_zero && !has_sign && !is_float) {
+    fj_error(st, "invalid number with a leading zero");
+  }
   if (!is_float) {
     return fj_int_from_parts(m10, m10digits, neg, overflow, np, nlen);
   }
@@ -979,7 +1044,8 @@ static VALUE fj_classify_quoteless(fj_state *st, const char *p0, long n0) {
   if (fj_tok_eq(p, n, "true")  || fj_tok_eq(p, n, "True"))  return Qtrue;
   if (fj_tok_eq(p, n, "false") || fj_tok_eq(p, n, "False")) return Qfalse;
-  if (fj_tok_eq(p, n, "null")  || fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
+  if (fj_tok_eq(p, n, "null")  || fj_tok_eq(p, n, "Null") || fj_tok_eq(p, n, "NULL") ||
+      fj_tok_eq(p, n, "None") || fj_tok_eq(p, n, "undefined")) return Qnil;
   if (fj_tok_eq(p, n, "NaN")) return rb_float_new(NAN);
   if (fj_tok_eq(p, n, "Infinity")) return rb_float_new(INFINITY);
@@ -1273,8 +1339,10 @@ static VALUE fj_parse_value(fj_state *st) {
     case 'T':  return fj_parse_literal(st, "True", Qtrue);
     case 'F':  return fj_parse_literal(st, "False", Qfalse);
     case 'u':  return fj_parse_literal(st, "undefined", Qnil);
-    case 'N':  /* NaN (number) vs None (Python null) */
+    case 'N':  /* NaN (number); None / Null / NULL (null) */
       if (fj_byte_at(st, 1) == 'a') return fj_parse_number(st);
+      if (fj_byte_at(st, 1) == 'u') return fj_parse_literal(st, "Null", Qnil);
+      if (fj_byte_at(st, 1) == 'U') return fj_parse_literal(st, "NULL", Qnil);
       return fj_parse_literal(st, "None", Qnil);
     default:
       if (b == '-' || b == '+' || b == '.' || b == 'I' || (b >= '0' && b <= '9')) {

data/lib/smarter_json/parser.rb CHANGED Viewed

@@ -739,7 +739,7 @@ module SmarterJSON
     # Mantissa must carry at least one digit (int part, or a leading-dot fraction), so a
     # bare exponent like "-e695881" is NOT a number — it falls through to a quoteless
     # string, matching the C path. Trailing exponent stays optional.
-    DEC_RE      = /\A[-+]?(?:(?:0|[1-9][0-9_]*)(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
+    DEC_RE      = /\A[-+]?(?:[0-9][0-9_]*(?:\.[0-9_]*)?|\.[0-9_]+)(?:[eE][-+]?[0-9_]+)?\z/.freeze
     # A decimal BigDecimal() would reject as-is: a leading dot (".5") or a dot not
     # followed by a digit ("5.", "5.e3"). Matches iff normalize_for_bigdecimal
     # would change the string — so when it doesn't match, we skip normalization.
@@ -756,6 +756,10 @@ module SmarterJSON
     # (',' '}' ']' '{' '[') OR any whitespace ([[:space:]] covers ASCII + Unicode space,
     # incl. LF/CR which also terminate). Stopping at a terminator/EOF means the run had no
     # interior whitespace, so there's nothing to trim and no comment marker can apply.
+    #
+    # U+FEFF is JSON5/ES5 whitespace but NOT in [[:space:]]. It is deliberately kept OUT of
+    # this regex: a multibyte alternative defeats byteindex's fast byte-search (~3.3x slower
+    # on number-dense input). A trailing U+FEFF is trimmed cheaply in the fast path below.
     QL_BREAK = /[,{}\[\]]|[[:space:]]/.freeze
     # The defaults live centrally in SmarterJSON::Options (lib/smarter_json/options.rb).
@@ -1103,7 +1107,7 @@ module SmarterJSON
     # Only meaningful for bytes >= 0x80.
     def multibyte_ws_len(pos)
       b0 = @input.getbyte(pos)
-      return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) # reject-gate
+      return 0 if b0 != 0xC2 && (b0 < 0xE1 || b0 > 0xE3) && b0 != 0xEF # reject-gate (EF -> U+FEFF)
       b1 = @input.getbyte(pos + 1)
       return 0 if b1.nil?
@@ -1123,6 +1127,8 @@ module SmarterJSON
         end
       when 0xE3
         return 3 if b1 == 0x80 && b2 == 0x80                 # U+3000
+      when 0xEF
+        return 3 if b1 == 0xBB && b2 == 0xBF                 # U+FEFF (JSON5 / ES5 BOM ws)
       end
       0
     end
@@ -1210,10 +1216,11 @@ module SmarterJSON
     # Disambiguate NaN (number) from None (Python null) at a strict position.
     def parse_upper_n
-      if byte_at(1) == 0x61 # 'a' → NaN
-        parse_number
-      else
-        parse_literal_keyword("None", nil)
+      case byte_at(1)
+      when 0x61 then parse_number                       # 'a' -> NaN
+      when 0x75 then parse_literal_keyword("Null", nil) # 'u' -> Null
+      when 0x55 then parse_literal_keyword("NULL", nil) # 'U' -> NULL
+      else parse_literal_keyword("None", nil)
       end
     end
@@ -1345,7 +1352,14 @@ module SmarterJSON
         b = hit < @bytesize ? input.getbyte(hit) : nil
         if b.nil? || b == COMMA || b == RBRACE || b == RBRACKET || b == LBRACE || b == LBRACKET || b == LF || b == CR
           @pos = hit
-          return hit
+          # A trailing U+FEFF (EF BB BF) is JSON5/ES5 whitespace but not in QL_BREAK, so
+          # byteindex scanned past it into the run — trim it (and a run of them). On the
+          # common path the last byte is a digit/letter, so the first compare fails at once.
+          fin = hit
+          while fin - 3 >= pos && input.getbyte(fin - 1) == 0xBF && input.getbyte(fin - 2) == 0xBB && input.getbyte(fin - 3) == 0xEF
+            fin -= 3
+          end
+          return fin
         end
       end
@@ -1378,7 +1392,7 @@ module SmarterJSON
       case str
       when "true", "True"          then return true
       when "false", "False"        then return false
-      when "null", "None"          then return nil
+      when "null", "Null", "NULL", "None" then return nil
       when "undefined"             then return nil
       when "NaN"                   then return Float::NAN
       when "Infinity", "+Infinity" then return Float::INFINITY
@@ -1405,7 +1419,15 @@ module SmarterJSON
       # number tokens that is a real per-value allocation. Underscores are rare, so only
       # pay it when the token actually contains one (measured +27% on long-token decimals).
       body = str.include?("_") ? str.delete("_") : str
-      body.match?(/[.eE]/) ? decimal_value(body) : body.to_i
+      return decimal_value(body) if body.match?(/[.eE]/)
+      # A BARE leading-zero integer (no sign / dot / exponent) is an ID — a zip code,
+      # account number, phone number — not a number; keep it a string so the zeros survive.
+      # A sign (+007 / -007) signals numeric intent (IDs never carry a sign), so those parse.
+      c0 = body.getbyte(0)
+      return NOT_NUMERIC if c0 == ZERO && body.bytesize > 1
+      body.to_i
     end
     # True when the token starts with [+-]?0[xX] — the only shape HEX_RE can match.
@@ -1614,6 +1636,12 @@ module SmarterJSON
         when 0x6E      then buf << "\n".b
         when 0x72      then buf << "\r".b
         when 0x74      then buf << "\t".b
+        when 0x76      then buf << "\v".b # JSON5 / ES5 vertical tab
+        when ZERO # JSON5 / ES5 \0 -> NUL; a following digit would be octal -> forbidden
+          nxt = @input.getbyte(i + 1)
+          raise error("invalid \\0 escape (octal not allowed)") if nxt && nxt >= ZERO && nxt <= NINE
+          buf << "\x00".b
         when LF
           # JSON5 line continuation: \<LF> emits nothing
         when CR
@@ -1623,8 +1651,19 @@ module SmarterJSON
           buf << [cp].pack("U").b
           i += consumed
           next
+        when LOWER_X # JSON5 / ES5 \xHH -> code point U+00HH (emitted as UTF-8)
+          hex = @input.byteslice(i + 1, 2)
+          raise error("invalid \\x escape") unless hex && hex.bytesize == 2 && hex.b.match?(/\A\h{2}\z/)
+          buf << [hex.to_i(16)].pack("U").b
+          i += 3
+          next
         else
-          raise error("invalid escape \\#{esc&.chr || "?"}")
+          # ES5 NonEscapeCharacter: an unrecognized escape yields the character itself.
+          # Emit the escaped byte; a multibyte char's continuation bytes follow as literals.
+          raise error("unterminated string escape") if esc.nil?
+          buf << esc
         end
         i += 1
       end
@@ -1663,10 +1702,13 @@ module SmarterJSON
     def parse_number
       negative = false
+      signed = false
       if byte == MINUS
         negative = true
+        signed = true
         advance(1)
       elsif byte == PLUS
+        signed = true
         advance(1)
       end
@@ -1680,6 +1722,7 @@ module SmarterJSON
       end
       int_start = @pos
+      had_leading_zero = false
       if byte == ZERO
         advance(1)
@@ -1692,6 +1735,16 @@ module SmarterJSON
           value = @input.byteslice(hex_start, @pos - hex_start).delete("_").to_i(16)
           return negative ? -value : value
         end
+        # A run of further digits after the single leading '0' (007, 00023, or the
+        # underscore-separated 0_0) — consume it and flag the leading zero; the reject check
+        # below turns a bare leading-zero integer into an error. The underscore is only a
+        # separator, so 0_0.5 behaves like 00.5.
+        if (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
+          while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
+            had_leading_zero = true if b >= ZERO && b <= NINE
+            advance(1)
+          end
+        end
       elsif byte && byte >= 0x31 && byte <= NINE
         advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
       elsif byte == DOT
@@ -1717,6 +1770,13 @@ module SmarterJSON
         advance(1) while (b = byte) && ((b >= ZERO && b <= NINE) || b == UNDERSCORE)
       end
+      # A BARE leading-zero integer is an ID, not a number; at this top-level / strict
+      # position there is no quoteless-string form, so it raises (a sign or a dot/exponent
+      # signals numeric intent and is allowed: +007 -> 7, -000023.5 -> -23.5, 007e2 -> 700.0).
+      if had_leading_zero && !signed && !is_float
+        raise error("invalid number with a leading zero")
+      end
       slice = @input.byteslice(int_start, @pos - int_start).delete("_")
       value = is_float ? decimal_value(slice) : slice.to_i
       negative ? -value : value

data/lib/smarter_json/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterJSON
-  VERSION = "1.1.2"
+  VERSION = "1.2.1"
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: smarter_json
 version: !ruby/object:Gem::Version
-  version: 1.1.2
+  version: 1.2.1
 platform: ruby
 authors:
 - Tilo Sloboda
 bindir: exe
 cert_chain: []
-date: 2026-06-12 00:00:00.000000000 Z
+date: 2026-06-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bigdecimal