RubyGems - smarter_json - Versions diffs - 0.8.0 → 0.9.2 - Mend

smarter_json 0.8.0 → 0.9.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -0
data/README.md +42 -6
data/docs/_introduction.md +1 -1
data/docs/basic_read_api.md +35 -2
data/docs/examples.md +29 -4
data/ext/smarter_json/smarter_json.c +46 -16
data/lib/smarter_json/parser.rb +325 -62
data/lib/smarter_json/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '06668bbb40626009794f8e5387fb13ae1a31346a07200d2825fde4872904bd68'
-  data.tar.gz: 0db0d42bfd2e85a1af4b897990a290e25e4fbe0afd9b4e25e99d9520667de2c3
+  metadata.gz: 2256f81fe3b29e83a42dcf948db896a03cdac568bbc799cb3b63b9516a76592d
+  data.tar.gz: c13d572f3cb417fdffc16423a38e180018f121adbc31cbbc2490bc39576bf7b5
 SHA512:
-  metadata.gz: 766cd9c5865d7218f79db57ec538bb1d34355c93012e38820544a41bbedbbca94a4a8fce05982f5f2a0e48c5670a6d3f4336eb1a22361fc29b34855d913fc564
-  data.tar.gz: f57396ef1a2ac4e48e06dad94122b8159a3b17c80f6d679e1bb59f85874578ea444592a828c3eb324c8d84ff57c255df0bc99af2aeb5603261aa19d26abc88c2
+  metadata.gz: 8ccaf09a845726e751740a870f62e008fac272bf314f6de88bba069663fa1fb9ba890d469bb1010ee55def74eb291f2953251372a2adcebae8d641a0609ff541
+  data.tar.gz: ce622275f2c90fc5044a0c0a9c2c8efcd601326c54f1869cb62241e3f78a784d9d237c9ff9f4c5b44e734562b22bcc744c0a14577641336ec5adeb24e1926c29

data/CHANGELOG.md CHANGED Viewed

@@ -3,6 +3,19 @@
 > 🚧 Getting ready for the 1.0.0 release - sorry for the interface changes - thank you for your patience! 🚧
+## 0.9.2 (2026-06-03)
+- **Fix a residual performance regression affecting every large document.** The "leading label" check (for `JSON: {…}`, which parses successfully but wrongly as an implicit-root object) now uses `String#start_with?(/…/)` instead of `match?(/\A…/)`. A `\A`-anchored `match?` is **not** anchor-optimized — it retries at every byte position and so scanned the entire input (~0.3 s on a 200 MB document) on every parse, which had quietly taxed every large file since the wrapper was introduced (deeply_nested.json and big_decimals.json sat well below their 0.6.0 throughput even after 0.9.1). `start_with?` inspects only the beginning, restoring — and slightly exceeding — 0.6.0 throughput across the board.
+## 0.9.1 (2026-06-03 unreleased)
+- **Fix a major performance regression on real-world data** (introduced with the 0.8.0 wrapper recovery). Wrapper recovery is now **reactive**: input is parsed first, and the markdown-fence / `<json>` / prose extraction runs only when that parse actually fails. Before, any input that merely *contained* ` ``` ` or `<json>` anywhere — including inside ordinary JSON string values, as GitHub-event payloads and other markdown-bearing data routinely do — was dragged through a full pure-Ruby recovery scan plus a double parse on every call (~30–45× slower on those files). A bare leading label like `JSON: {…}`, which parses successfully but wrongly, is still caught up front before parsing.
+- **Streaming framer**: a multi-byte marker (`//`, `/*`, `'''`, `*/`) whose bytes straddle a read-chunk boundary is no longer mis-scanned — the framer waits for the rest of the marker before deciding, so a brace inside such a comment/string can no longer end a document early.
+- Wrapper warnings (`code_fence_stripped` / `wrapper_tag_stripped`) now fire only when the marker is actually in the stripped text, not when it sits inside a recovered payload's own string value.
+- Shared `SmarterJSON::Bytes` constants for the parser and the framer / recovery scanners (no raw hex byte literals).
+## 0.9.0 (2026-06-03 unreleased)
+- performance improvements
+- code cleanup
 ## 0.8.0 (2026-06-03)
 - **Robustness** against LLM-generated / wrapped JSON:
   - strips markdown code fences (```json / ```)

data/README.md CHANGED Viewed

@@ -16,7 +16,7 @@ Three things set it apart:
 1. **One parser, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the parser to match your input; it adapts to whatever you give it.
-2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** Pass a block to iterate the recovered documents one at a time.
+2. **It parses multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: one document returns its value, several documents return an `Array`, empty input returns `nil`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON parses multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time.
 3. **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser — the fastest general-purpose Ruby JSON parser.
@@ -46,6 +46,12 @@ gem install smarter_json
 The C extension is built on install and used automatically. On platforms where it can't build, the pure-Ruby parser runs instead and produces identical results.
+## API stability and thread safety
+The public API is now considered stable: `SmarterJSON.process`, `SmarterJSON.process_file`, `SmarterJSON.generate`, and the documented options in this README/docs are the supported surface.
+Concurrent calls are safe. The parser/generator keep per-call state local, and the C extension only caches Ruby IDs / constants at load time; it does not share mutable parse state across calls.
 ## Documentation
   * [Introduction](docs/_introduction.md)
@@ -68,14 +74,44 @@ SmarterJSON.process(%({"id":1}\n{"id":2}\n{"id":3}))   # => [{"id"=>1}, {"id"=>2
 SmarterJSON.process('{"id":1}')                         # => {"id"=>1}   (one document → the value itself)
 SmarterJSON.process("")                                 # => nil          (zero documents)
-# Iterate one recovered document at a time with a block
+# For input larger than memory, stream one document at a time with a block
 # (process and process_file both forward the block):
 SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
 # Wrapper noise is stripped automatically:
-SmarterJSON.process("Here is the JSON:\n\n```json\n{\"a\":1}\n```\n")  # => {"a"=>1}
-SmarterJSON.process("<json>{\"a\":1}</json>")                          # => {"a"=>1}
-SmarterJSON.process("first:\n{\"a\":1}\nsecond:\n{\"b\":2}")      # => [{"a"=>1}, {"b"=>2}]
+SmarterJSON.process(<<~TEXT)
+  Here is the JSON:
+  ```json
+  {
+    "a": 1
+  }
+  ```
+TEXT
+# => {"a"=>1}
+SmarterJSON.process(<<~TEXT)
+  Here is the result:
+  {
+    "a": 1
+  }
+  Hope this helps.
+TEXT
+# => {"a"=>1}
+SmarterJSON.process("<json>{\"a\":1}</json>")
+# => {"a"=>1}
+SmarterJSON.process(<<~TEXT)
+  first attempt:
+  {"a":1}
+  corrected payload:
+  {"b":2}
+TEXT
+# => [{"a"=>1}, {"b"=>2}]
 ```
 ### Options
@@ -112,7 +148,7 @@ Benchmarks: p10 of 40 runs, Apple M1 Max, Ruby 3.4.7, on the standard JSON corpu
 **Two notes on fair comparison:**
-- **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the block form yields each recovered document instead of returning the collected Array.
+- **NDJSON:** on multi-document files, **only SmarterJSON parses the input via plain `process`** — Oj and `json` raise without a block, so their cells are `N/A`. That `N/A` reflects real default behavior, not a measurement gap. Plain `process` collects every document into an Array at ~270 MB/s; the streaming block form runs faster (~440 MB/s) because it doesn't hold all documents in memory at once.
 - **High-precision decimals (e.g. `canada.json`):** SmarterJSON's default `:auto` mode preserves high-precision numbers as `BigDecimal` (matching Oj's default), which is intrinsically slower than `Float`. Against `Float`-producing parsers it looks slower on such files; pass `bigdecimal_load: :float` to compare like-for-like (it then runs much faster). Against the equivalent `BigDecimal`-producing Oj mode, SmarterJSON is faster.
 ## Encoding

data/docs/_introduction.md CHANGED Viewed

@@ -21,7 +21,7 @@ Most JSON parsers reject anything that isn't perfectly strict JSON, and they mak
 * **One reader, no modes, no flags.** There is no `dialect:` option and no "strict mode" — `SmarterJSON.process(input)` accepts the whole superset, and strict JSON is simply the narrowest case. You don't configure the reader to match your input; it adapts to whatever you give it.
-* **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** Pass a block to iterate the recovered documents one at a time. See [The Basic Read API](./basic_read_api.md).
+* **It reads multi-document input automatically — a distinguishing feature.** `SmarterJSON.process` handles NDJSON / JSONL / concatenated JSON with **no block and no special method**: zero documents returns `nil`, one document returns its value, two or more return an `Array`. The same rule applies when wrapper noise is stripped and several payloads are recovered from one blob. **Only SmarterJSON reads multi-document input via plain `process` — Oj and the stdlib `json` library raise without a block.** For input larger than memory, pass a block to stream one document at a time. See [The Basic Read API](./basic_read_api.md).
 * **It's fast.** A C extension (with a pure-Ruby fallback that runs everywhere) puts it ahead of Oj on nearly every file we benchmark, and competitive with the stdlib `json` C parser. Floats are parsed with Ryū (correctly rounded, single-pass), so number-heavy data is fast and bit-exact.

data/docs/basic_read_api.md CHANGED Viewed

@@ -24,6 +24,39 @@ SmarterJSON.process("host: localhost\nport: 5432")     # => {"host"=>"localhost"
 `process` is polymorphic: its first argument is **either a String of JSON content or an IO to read from**. A String is always treated as content, never as a filename — use `process_file` for paths. When the input wraps the payload in obvious markdown / prose / tags, `process` strips that wrapper first and then parses the recovered payload(s).
+```ruby
+SmarterJSON.process(<<~TEXT)
+  Here is the JSON:
+  ```json
+  {
+    "a": 1
+  }
+  ```
+TEXT
+# => {"a"=>1}
+SmarterJSON.process(<<~TEXT)
+  Here is the result:
+  {
+    "a": 1
+  }
+  Hope this helps.
+TEXT
+# => {"a"=>1}
+SmarterJSON.process(<<~TEXT)
+  first attempt:
+  {"a":1}
+  corrected payload:
+  {"b":2}
+TEXT
+# => [{"a"=>1}, {"b"=>2}]
+```
 ```ruby
 SmarterJSON.process(io)         # an open IO (File, StringIO, socket, …) — reads it and parses
 SmarterJSON.process(some_string) # JSON content
@@ -49,9 +82,9 @@ SmarterJSON.process_file("config.json5")     # read the file, then parse — sam
 `process_file` opens the file, reads it with the labeled [`encoding:`](./options.md) (default `"UTF-8"`, no transcoding pass), and parses it.
-## Streaming with a block
+## Streaming with a block (bounded memory)
-Pass a block to have each recovered top-level document yielded one at a time; the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
+For input larger than memory, pass a block. Each recovered top-level document is yielded as it is framed, and the method returns `nil` instead of collecting the documents into an Array. Both `process` and `process_file` forward the block.
 ```ruby
 SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }

data/docs/examples.md CHANGED Viewed

@@ -65,9 +65,9 @@ SmarterJSON.process('{"id":1}')                          # => {"id"=>1}
 SmarterJSON.process("")                                  # => nil
 ```
-### Example 5: Iterate Documents with a Block
+### Example 5: Streaming a Large File with a Block
-Pass a block to receive each recovered document one at a time:
+For input larger than memory, pass a block. Each recovered document is yielded one at a time:
 ```ruby
 SmarterJSON.process_file("events.ndjson") { |event| EventJob.perform_async(event) }
@@ -116,6 +116,8 @@ A `#`/`//` only starts a comment when preceded by whitespace, so `http://example
 ### Example 10: Wrapper Noise Around a Payload
+#### Fenced payload
 ```ruby
 SmarterJSON.process(<<~TEXT)
   Here is the JSON:
@@ -127,15 +129,38 @@ SmarterJSON.process(<<~TEXT)
   ```
 TEXT
 # => {"a"=>1}
+```
+#### Prose before / after the payload
+```ruby
+SmarterJSON.process(<<~TEXT)
+  Here is the result:
+  {
+    "a": 1
+  }
+  Hope this helps.
+TEXT
+# => {"a"=>1}
+```
+#### Wrapper tags
+```ruby
 SmarterJSON.process("<json>{\"a\":1}</json>")
 # => {"a"=>1}
+```
+#### Multiple recovered payloads from one noisy blob
+```ruby
 SmarterJSON.process(<<~TEXT)
-  first:
+  first attempt:
   {"a":1}
-  second:
+  corrected payload:
   {"b":2}
 TEXT
 # => [{"a"=>1}, {"b"=>2}]

data/ext/smarter_json/smarter_json.c CHANGED Viewed

@@ -41,6 +41,18 @@ static VALUE fj_sym_duplicate_key;
 static ID    fj_bigdecimal_id; /* cached BigDecimal() method id (set in Init) */
 static ID    fj_to_sym_id;     /* cached :to_sym (symbolize_keys) */
 static ID    fj_key_p_id;      /* cached :key? (non-default duplicate_key modes) */
+static ID    fj_force_encoding_id;
+static ID    fj_valid_encoding_p_id;
+static ID    fj_encoding_id;
+static ID    fj_name_id;
+static VALUE fj_sym_encoding;
+static VALUE fj_sym_symbolize_keys;
+static VALUE fj_sym_first_wins;
+static VALUE fj_sym_raise;
+static VALUE fj_sym_bigdecimal_load;
+static VALUE fj_sym_float;
+static VALUE fj_sym_bigdecimal;
+static VALUE fj_sym_on_warning;
 /* Per-parse direct-mapped key cache: key bytes -> the interned (frozen,
  * globally-rooted) String, so repeated keys skip the global fstring lookup.
@@ -373,11 +385,17 @@ static void fj_consume_keyword(fj_state *st, const char *word) {
   fj_advance(st, n);
 }
-/* Copy a byte range into a fresh String, dropping underscores. */
+/* Copy a byte range into a fresh String, dropping underscores. Copies whole
+ * underscore-free runs in bulk, rather than one byte at a time. */
 static VALUE fj_strip_underscores(const char *p, long n) {
   VALUE s = rb_str_buf_new(n);
-  long i;
-  for (i = 0; i < n; i++) if (p[i] != '_') rb_str_buf_cat(s, p + i, 1);
+  long i = 0;
+  while (i < n) {
+    long start = i;
+    while (i < n && p[i] != '_') i++;
+    if (i > start) rb_str_buf_cat(s, p + start, i - start);
+    if (i < n) i++; /* skip '_' */
+  }
   return s;
 }
@@ -1379,14 +1397,14 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
   Check_Type(input, T_STRING);
-  enc_opt = rb_hash_aref(opts, ID2SYM(rb_intern("encoding")));
+  enc_opt = rb_hash_aref(opts, fj_sym_encoding);
   if (!NIL_P(enc_opt)) {
-    input = rb_funcall(rb_str_dup(input), rb_intern("force_encoding"), 1, enc_opt);
+    input = rb_funcall(rb_str_dup(input), fj_force_encoding_id, 1, enc_opt);
   }
-  if (!RTEST(rb_funcall(input, rb_intern("valid_encoding?"), 0))) {
-    VALUE name = rb_funcall(rb_funcall(input, rb_intern("encoding"), 0), rb_intern("name"), 0);
+  if (!RTEST(rb_funcall(input, fj_valid_encoding_p_id, 0))) {
+    VALUE name = rb_funcall(rb_funcall(input, fj_encoding_id, 0), fj_name_id, 0);
     VALUE msg = rb_sprintf("invalid byte sequence for %" PRIsVALUE, name);
-    rb_exc_raise(rb_funcall(cEncodingError, rb_intern("new"), 3, msg, Qnil, Qnil));
+    rb_exc_raise(rb_funcall(cEncodingError, fj_new_id, 3, msg, Qnil, Qnil));
   }
   st.buf = RSTRING_PTR(input);
@@ -1402,19 +1420,19 @@ static VALUE fj_parse_c(VALUE self, VALUE input, VALUE opts) {
   st.kcache = NULL;
 #endif
-  st.symbolize_keys = RTEST(rb_hash_aref(opts, ID2SYM(rb_intern("symbolize_keys"))));
-  dk = rb_hash_aref(opts, ID2SYM(rb_intern("duplicate_key")));
-  st.dup_first_wins = (dk == ID2SYM(rb_intern("first_wins")));
-  st.dup_raise = (dk == ID2SYM(rb_intern("raise")));
+  st.symbolize_keys = RTEST(rb_hash_aref(opts, fj_sym_symbolize_keys));
+  dk = rb_hash_aref(opts, fj_sym_duplicate_key);
+  st.dup_first_wins = (dk == fj_sym_first_wins);
+  st.dup_raise = (dk == fj_sym_raise);
   {
-    VALUE bd = rb_hash_aref(opts, ID2SYM(rb_intern("bigdecimal_load")));
-    if (bd == ID2SYM(rb_intern("float"))) st.bigdecimal_load = 0;
-    else if (bd == ID2SYM(rb_intern("bigdecimal"))) st.bigdecimal_load = 2;
+    VALUE bd = rb_hash_aref(opts, fj_sym_bigdecimal_load);
+    if (bd == fj_sym_float) st.bigdecimal_load = 0;
+    else if (bd == fj_sym_bigdecimal) st.bigdecimal_load = 2;
     else st.bigdecimal_load = 1; /* :auto (default), including nil */
   }
-  st.on_warning = rb_hash_aref(opts, ID2SYM(rb_intern("on_warning"))); /* Qnil when absent */
+  st.on_warning = rb_hash_aref(opts, fj_sym_on_warning); /* Qnil when absent */
   if (st.len >= 3 && (unsigned char)st.buf[0] == 0xEF &&
       (unsigned char)st.buf[1] == 0xBB && (unsigned char)st.buf[2] == 0xBF) {
@@ -1465,8 +1483,20 @@ void Init_smarter_json(void) {
   fj_key_p_id = rb_intern("key?");
   fj_new_id = rb_intern("new");
   fj_call_id = rb_intern("call");
+  fj_force_encoding_id = rb_intern("force_encoding");
+  fj_valid_encoding_p_id = rb_intern("valid_encoding?");
+  fj_encoding_id = rb_intern("encoding");
+  fj_name_id = rb_intern("name");
   fj_sym_empty_slot = ID2SYM(rb_intern("empty_slot"));
   fj_sym_empty_value = ID2SYM(rb_intern("empty_value"));
   fj_sym_duplicate_key = ID2SYM(rb_intern("duplicate_key"));
+  fj_sym_encoding = ID2SYM(rb_intern("encoding"));
+  fj_sym_symbolize_keys = ID2SYM(rb_intern("symbolize_keys"));
+  fj_sym_first_wins = ID2SYM(rb_intern("first_wins"));
+  fj_sym_raise = ID2SYM(rb_intern("raise"));
+  fj_sym_bigdecimal_load = ID2SYM(rb_intern("bigdecimal_load"));
+  fj_sym_float = ID2SYM(rb_intern("float"));
+  fj_sym_bigdecimal = ID2SYM(rb_intern("bigdecimal"));
+  fj_sym_on_warning = ID2SYM(rb_intern("on_warning"));
   rb_define_module_function(mSmarterJSON, "parse_c", fj_parse_c, 2);
 }

data/lib/smarter_json/parser.rb CHANGED Viewed

@@ -63,22 +63,300 @@ module SmarterJSON
     end
   end
-  # Stream documents from an IO, one line (= one document) at a time, yielding
-  # each — bounded memory. Newline-delimited (NDJSON / JSONL); a single document
-  # spanning multiple lines is not supported by the streaming path.
+  # Stream documents from an IO incrementally, yielding each recovered top-level
+  # document without slurping the whole input into memory first.
   def stream_io(io, options, &block)
-    Recovery.process_string(io.read, options, &block)
+    Framer.each_document(io) { |doc| Recovery.process_string(doc, options, &block) }
+    nil
   end
   private_class_method :process_content, :stream_io
+  # Named byte values, shared by the Parser FSM and the Framer / Recovery byte
+  # scanners so none of them spell out raw hex. Included where needed.
+  module Bytes
+    LBRACE     = 0x7B
+    RBRACE     = 0x7D
+    LBRACKET   = 0x5B
+    RBRACKET   = 0x5D
+    COLON      = 0x3A
+    COMMA      = 0x2C
+    DQUOTE     = 0x22
+    SQUOTE     = 0x27
+    BACKSLASH  = 0x5C
+    SLASH      = 0x2F
+    STAR       = 0x2A
+    HASH       = 0x23
+    MINUS      = 0x2D
+    PLUS       = 0x2B
+    DOT        = 0x2E
+    ZERO       = 0x30
+    NINE       = 0x39
+    LOWER_E    = 0x65
+    UPPER_E    = 0x45
+    LOWER_T    = 0x74
+    LOWER_F    = 0x66
+    LOWER_N    = 0x6E
+    LOWER_U    = 0x75
+    LOWER_X    = 0x78
+    UPPER_X    = 0x58
+    UPPER_I    = 0x49
+    UPPER_N    = 0x4E
+    UPPER_T    = 0x54
+    UPPER_F    = 0x46
+    UNDERSCORE = 0x5F
+    DOLLAR     = 0x24
+    SPACE      = 0x20
+    TAB        = 0x09
+    LF         = 0x0A
+    CR         = 0x0D
+  end
+  module Framer
+    include Bytes
+    CHUNK_SIZE = 16 * 1024
+    module_function
+    def each_document(io, &block)
+      buffer = +""
+      scan = 0
+      doc_start = nil
+      stack = []
+      mode = nil
+      while (chunk = read_chunk(io))
+        buffer << chunk
+        loop do
+          emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
+          break unless emitted
+          yield emitted
+        end
+      end
+      yield buffer unless separators_only?(buffer)
+    end
+    def read_chunk(io)
+      if io.respond_to?(:readpartial)
+        io.readpartial(CHUNK_SIZE)
+      else
+        io.read(CHUNK_SIZE)
+      end
+    rescue EOFError
+      nil
+    end
+    def scan_buffer(buffer, scan, doc_start, stack, mode)
+      while scan < buffer.bytesize
+        b = buffer.getbyte(scan)
+        # A multi-byte marker (// /* ''' */) whose lead byte is here but whose
+        # remaining bytes have not arrived yet must not be guessed at — advancing
+        # past the lead byte would misread the brace/quote that follows it once the
+        # next chunk lands. Stop and let each_document append more input, then resume
+        # from this same position. At true EOF the leftover is parsed whole instead.
+        break if defer_for_split_marker?(buffer, scan, b, mode, doc_start)
+        if mode == :double
+          if b == BACKSLASH
+            scan += 2
+          elsif b == DQUOTE
+            mode = nil
+            scan += 1
+          else
+            scan += 1
+          end
+        elsif mode == :single
+          if b == BACKSLASH
+            scan += 2
+          elsif b == SQUOTE
+            mode = nil
+            scan += 1
+          else
+            scan += 1
+          end
+        elsif mode == :triple
+          if buffer.byteslice(scan, 3) == "'''"
+            mode = nil
+            scan += 3
+          else
+            scan += 1
+          end
+        elsif mode == :line_comment
+          if [LF, CR].include?(b)
+            mode = nil
+          else
+            scan += 1
+            next
+          end
+        elsif mode == :block_comment
+          if buffer.byteslice(scan, 2) == '*/'
+            mode = nil
+            scan += 2
+          else
+            scan += 1
+          end
+        elsif doc_start.nil?
+          if whitespace_byte?(b)
+            scan += 1
+          elsif line_comment_start?(buffer, scan)
+            mode = :line_comment
+            scan += buffer.getbyte(scan) == HASH ? 1 : 2
+          elsif block_comment_start?(buffer, scan)
+            mode = :block_comment
+            scan += 2
+          elsif [LBRACE, LBRACKET].include?(b)
+            doc_start = scan
+            stack << b
+            scan += 1
+          else
+            scan = buffer.bytesize
+          end
+        else
+          if mode.nil? && line_comment_start?(buffer, scan)
+            mode = :line_comment
+            scan += buffer.getbyte(scan) == HASH ? 1 : 2
+          elsif mode.nil? && block_comment_start?(buffer, scan)
+            mode = :block_comment
+            scan += 2
+          elsif b == DQUOTE
+            mode = :double
+            scan += 1
+          elsif buffer.byteslice(scan, 3) == "'''"
+            mode = :triple
+            scan += 3
+          elsif b == SQUOTE
+            mode = :single
+            scan += 1
+          elsif [LBRACE, LBRACKET].include?(b)
+            stack << b
+            scan += 1
+          elsif b == RBRACE
+            stack.pop if stack.last == LBRACE
+            scan += 1
+            if stack.empty?
+              doc = buffer.byteslice(doc_start, scan - doc_start)
+              buffer = buffer.byteslice(scan..-1) || +""
+              return [doc, buffer, 0, nil, [], nil]
+            end
+          elsif b == RBRACKET
+            stack.pop if stack.last == LBRACKET
+            scan += 1
+            if stack.empty?
+              doc = buffer.byteslice(doc_start, scan - doc_start)
+              buffer = buffer.byteslice(scan..-1) || +""
+              return [doc, buffer, 0, nil, [], nil]
+            end
+          else
+            scan += 1
+          end
+        end
+      end
+      [nil, buffer, scan, doc_start, stack, mode]
+    end
+    # True when `b` is the lead byte of a multi-byte marker but the rest of that
+    # marker has not been read into the buffer yet, so we cannot decide what it is.
+    # `//` and `/*` need 2 bytes; `'''` (and a closing `'''`) needs 3; a closing
+    # `*/` needs 2. Backslash escapes and single-byte delimiters never need this.
+    def defer_for_split_marker?(buffer, scan, b, mode, doc_start)
+      avail = buffer.bytesize - scan
+      case mode
+      when :block_comment
+        b == STAR && avail < 2
+      when :triple
+        b == SQUOTE && avail < 3
+      when nil
+        if doc_start.nil?
+          b == SLASH && avail < 2
+        else
+          (b == SLASH && avail < 2) || (b == SQUOTE && avail < 3)
+        end
+      else
+        false
+      end
+    end
+    def separators_only?(buffer)
+      scan = 0
+      mode = nil
+      while scan < buffer.bytesize
+        b = buffer.getbyte(scan)
+        if mode == :line_comment
+          if [LF, CR].include?(b)
+            mode = nil
+          else
+            scan += 1
+            next
+          end
+        elsif mode == :block_comment
+          if buffer.byteslice(scan, 2) == '*/'
+            mode = nil
+            scan += 2
+          else
+            scan += 1
+          end
+        elsif whitespace_byte?(b)
+          scan += 1
+        elsif line_comment_start?(buffer, scan)
+          mode = :line_comment
+          scan += buffer.getbyte(scan) == HASH ? 1 : 2
+        elsif block_comment_start?(buffer, scan)
+          mode = :block_comment
+          scan += 2
+        else
+          return false
+        end
+      end
+      true
+    end
+    def whitespace_byte?(b)
+      b == SPACE || (b && b >= TAB && b <= CR)
+    end
+    def line_comment_start?(buffer, scan)
+      b = buffer.getbyte(scan)
+      return preceded_by_ws_or_start?(buffer, scan) if b == HASH
+      b == SLASH && buffer.getbyte(scan + 1) == SLASH && preceded_by_ws_or_start?(buffer, scan)
+    end
+    def block_comment_start?(buffer, scan)
+      buffer.getbyte(scan) == SLASH && buffer.getbyte(scan + 1) == STAR && preceded_by_ws_or_start?(buffer, scan)
+    end
+    def preceded_by_ws_or_start?(buffer, scan)
+      return true if scan.zero?
+      prev = buffer.getbyte(scan - 1)
+      whitespace_byte?(prev)
+    end
+  end
   module Recovery
+    include Bytes
     module_function
     def process_string(input, options, &block)
       return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
-      if wrapper_hint?(input)
+      # Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
+      # the parse actually fails (the rescue below). Every wrapper shape — code fences,
+      # <json>/BEGIN_JSON tags, prose around the payload — makes the parse raise, so the
+      # rescue catches it. Crucially this keeps clean input on the single-parse fast path
+      # even when its string values legitimately contain ``` or <json> (real-world data
+      # like GitHub event payloads is full of markdown), instead of dragging hundreds of
+      # MB through the pure-Ruby candidate scan.
+      #
+      # The one exception is a bare leading label like "JSON: {...}", which parses
+      # successfully but WRONGLY (as an implicit-root object keyed by the label), so it
+      # must be intercepted before parsing.
+      if leading_label?(input)
         payloads = extract_payloads(input, options)
         return replay_payloads(payloads, options, &block) unless payloads.empty?
       end
@@ -93,10 +371,14 @@ module SmarterJSON
       raise
     end
-    def wrapper_hint?(input)
-      return false unless input.valid_encoding?
-      input.match?(/```|<json\b|BEGIN_JSON\b/i) || input.match?(/\A[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
+    # Whether the input opens with a bare "JSON:" / "Final answer:" label (which would
+    # otherwise parse, wrongly, as an implicit-root object keyed by the label). We use
+    # String#start_with? with a Regexp rather than match?(/\A.../): start_with? checks
+    # only the beginning, whereas a \A-anchored match? still retries at every byte
+    # position and so scans the WHOLE input (≈0.3s on a 200 MB document) on every parse.
+    # (Caller has already established the input is valid_encoding?.)
+    def leading_label?(input)
+      input.start_with?(/[[:space:]]*(?:JSON|Final answer)[[:space:]]*:/i)
     end
     def replay_payloads(payloads, options, &block)
@@ -146,16 +428,31 @@ module SmarterJSON
       last = ranges.last
       prefix = input.byteslice(0, first.begin)
       suffix = input.byteslice(last.end, input.bytesize - last.end)
+      # Look for fence / wrapper markers only in the text we actually strip (outside
+      # every recovered payload), so a ``` or <json> sitting inside a payload's own
+      # string value does not trigger a "stripped a wrapper" warning.
+      outside = non_payload_text(input, ranges)
       {
         prefix: substantive_text?(prefix),
         suffix: substantive_text?(suffix),
-        fence: input.match?(/```/),
-        wrapper: input.match?(/<json\b|BEGIN_JSON\b/i),
+        fence: outside.include?("```"),
+        wrapper: outside.match?(/<json\b|BEGIN_JSON\b/i),
         first_pos: line_col_for(input, first.begin),
         last_pos: line_col_for(input, last.begin)
       }
     end
+    def non_payload_text(input, ranges)
+      out = +""
+      pos = 0
+      ranges.each do |range|
+        out << input.byteslice(pos, range.begin - pos) if range.begin > pos
+        pos = range.end
+      end
+      out << input.byteslice(pos, input.bytesize - pos) if pos < input.bytesize
+      out
+    end
     def line_col_for(input, offset)
       line = 1
       col = 1
@@ -164,15 +461,15 @@ module SmarterJSON
         b = input.getbyte(i)
         break if b.nil?
-        if b == 0x0A
+        if b == LF
           line += 1
           col = 1
           i += 1
-        elsif b == 0x0D
+        elsif b == CR
           line += 1
           col = 1
           i += 1
-          i += 1 if i < offset && input.getbyte(i) == 0x0A
+          i += 1 if i < offset && input.getbyte(i) == LF
         else
           col += 1
           i += 1
@@ -203,19 +500,19 @@ module SmarterJSON
       while i < input.bytesize
         b = input.getbyte(i)
         if mode == :double
-          if b == 0x5C
+          if b == BACKSLASH
             i += 2
             next
-          elsif b == 0x22
+          elsif b == DQUOTE
             mode = nil
           end
           i += 1
           next
         elsif mode == :single
-          if b == 0x5C
+          if b == BACKSLASH
             i += 2
             next
-          elsif b == 0x27
+          elsif b == SQUOTE
             mode = nil
           end
           i += 1
@@ -229,7 +526,7 @@ module SmarterJSON
           end
           next
         elsif mode == :line_comment
-          if [0x0A, 0x0D].include?(b)
+          if [LF, CR].include?(b)
             mode = nil
           else
             i += 1
@@ -252,11 +549,11 @@ module SmarterJSON
             mode = :block_comment
             i += 2
             next
-          elsif b == 0x23
+          elsif b == HASH
             mode = :line_comment
             i += 1
             next
-          elsif b == 0x22
+          elsif b == DQUOTE
             mode = :double
             i += 1
             next
@@ -264,21 +561,21 @@ module SmarterJSON
             mode = :triple
             i += 3
             next
-          elsif b == 0x27
+          elsif b == SQUOTE
             mode = :single
             i += 1
             next
-          elsif [0x7B, 0x5B].include?(b)
+          elsif [LBRACE, LBRACKET].include?(b)
             start_pos = i if stack.empty?
             stack << b
-          elsif b == 0x7D
-            stack.pop if stack.last == 0x7B
+          elsif b == RBRACE
+            stack.pop if stack.last == LBRACE
             if stack.empty? && start_pos
               ranges << (start_pos...(i + 1))
               start_pos = nil
             end
-          elsif b == 0x5D
-            stack.pop if stack.last == 0x5B
+          elsif b == RBRACKET
+            stack.pop if stack.last == LBRACKET
             if stack.empty? && start_pos
               ranges << (start_pos...(i + 1))
               start_pos = nil
@@ -304,41 +601,7 @@ module SmarterJSON
   #          Python literals (True/False/None) and undefined, underscores in
   #          numeric literals, and encoding validation (SmarterJSON::EncodingError).
   class Parser
-    LBRACE     = 0x7B
-    RBRACE     = 0x7D
-    LBRACKET   = 0x5B
-    RBRACKET   = 0x5D
-    COLON      = 0x3A
-    COMMA      = 0x2C
-    DQUOTE     = 0x22
-    SQUOTE     = 0x27
-    BACKSLASH  = 0x5C
-    SLASH      = 0x2F
-    STAR       = 0x2A
-    HASH       = 0x23
-    MINUS      = 0x2D
-    PLUS       = 0x2B
-    DOT        = 0x2E
-    ZERO       = 0x30
-    NINE       = 0x39
-    LOWER_E    = 0x65
-    UPPER_E    = 0x45
-    LOWER_T    = 0x74
-    LOWER_F    = 0x66
-    LOWER_N    = 0x6E
-    LOWER_U    = 0x75
-    LOWER_X    = 0x78
-    UPPER_X    = 0x58
-    UPPER_I    = 0x49
-    UPPER_N    = 0x4E
-    UPPER_T    = 0x54
-    UPPER_F    = 0x46
-    UNDERSCORE = 0x5F
-    DOLLAR     = 0x24
-    SPACE      = 0x20
-    TAB        = 0x09
-    LF         = 0x0A
-    CR         = 0x0D
+    include Bytes
     NOT_NUMERIC = Object.new
     HEX_RE      = /\A[-+]?0[xX][0-9a-fA-F_]+\z/.freeze

data/lib/smarter_json/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterJSON
-  VERSION = "0.8.0"
+  VERSION = "0.9.2"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: smarter_json
 version: !ruby/object:Gem::Version
-  version: 0.8.0
+  version: 0.9.2
 platform: ruby
 authors:
 - Tilo Sloboda