RubyGems - smarter_json - Versions diffs - 1.2.2 → 1.2.3 - Mend

smarter_json 1.2.2 → 1.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a31a7485e53042c06cb28c92f23e24cb55067fd312704c0c9ac0776868f20293
-  data.tar.gz: 531187f8f7ea38f573785f09fa6e217f63a4393fd537d80c92f00973aeb2b375
+  metadata.gz: edd2cbc389a44f0714898f2a2942032632ca5f13a83d93acd708988866974893
+  data.tar.gz: 5cfa8719265797fd0fcd59f0060a5dd255aeba67ff229c78ab5696ad86f8b8e3
 SHA512:
-  metadata.gz: f88266d06416277355f770645d694be6e81a8faef22a86806cd1e4620e773dee50aeeeeb09cfbc0de47354972250c2654e599c5c03ba0926ca129d9d6a638b0c
-  data.tar.gz: c8bdd4a34d4617c897c815f2526c0f5555459b35bad47106a1db2b71dedac5d7a2605d0a5f92676735ae4895208fcdf1272f05dce09026aa696b84f143693ee5
+  metadata.gz: b0306cb7eab1db78053e8620f119b7debc54d836c38369b328bc0e0b9e8798d23eb4bbdb82281d036f62cb8cd296a3c8d7667f41a49d7e643a955830e08caf99
+  data.tar.gz: 0a5f052e115141014bcf2f0e031d43fb40521dd01da0f1ed5c6d4edeea785ead09f8ebcdfb76b580b4b8157391087ff2a79c5cd548ae7d8a0be96d0527a9f8fb

data/CHANGELOG.md CHANGED Viewed

@@ -13,9 +13,20 @@
 > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
 >    Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
-## 1.2.2 (unreleased)
+## 1.2.3 (2026-06-28)
-RSpec tests: 1,167
+RSpec tests: 1,167 → 1,268
+Fixing some encoding corner cases:
+  - **UTF-16 / UTF-32** and **Shift_JIS** (and other CJK double-byte encodings such as Big5 / GBK / GB18030) previously raised or mis-parsed; they now parse, with string values tagged in the input's encoding.
+  - Applies to String, file (`process_file`), and IO / streaming (`foreach`) input — including a file the caller opened with transcoding (e.g. `File.open(path, "r:UTF-8:UTF-16LE")`), where the output is the encoding the bytes arrive in.
+  - Streaming a **Latin-1 / Windows-1252** (or other single-byte) file or IO now preserves that encoding too, instead of mislabelling or raising.
+  - Streaming a UTF-16 / UTF-32 / Shift_JIS (or transcoding) source via `foreach` / `process_file` is now **bounded-memory** — it frames and parses one document at a time instead of reading the whole input into memory.
+  - New `:replace_char` option (default `"?"`): when a `\uXXXX` escape decodes to a character the input's encoding can't represent (e.g. an emoji inside a Shift_JIS document), that character is replaced rather than raising. `replace_char: ""` drops it.
+## 1.2.2 (2026-06-19)
+RSpec tests: 1,165 → 1,167
 - The Eisel-Lemire fast path for `decimal_precision: :float` now covers decimals with **up to 19 significant digits** (was 18). 19 digits is the most that fits exactly in a `uint64` (max 19-digit ≈ 1.0e19 < `UINT64_MAX` ≈ 1.8e19), so these no longer fall back to the slower `strtod`. Still correctly rounded, bit-for-bit identical to the stdlib — verified across 18/19-digit round-to-even tie shapes.

data/README.md CHANGED Viewed

@@ -222,6 +222,7 @@ In short: **SmarterJSON's C path matches or beats Oj/strict on every file** (app
 | `decimal_precision` | `:auto`      | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
 | `acceleration`    | `true`       | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
 | `encoding`        | `nil`        | labels the input's encoding; `nil` keeps the input's own (no transcoding pass; see below) |
+| `replace_char`    | `"?"`        | replacement for a char a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS); `""` drops it |
 | `on_warning`      | `nil`        | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`, `:number_overflow`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
 ## Examples
@@ -358,6 +359,8 @@ TEXT
 `encoding:` (default `nil`) labels what the input is — it does **not** transcode. With `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that same tag, the way `smarter_csv` does — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) whose bytes are valid UTF-8 is treated as UTF-8. That is exactly how `Net::HTTP` and many HTTP libraries hand you a `response.body` (correct UTF-8 bytes, BINARY tag); without this, string values would come back tagged `ASCII-8BIT` and compare unequal to UTF-8 literals. If such `ASCII-8BIT` input is *not* valid UTF-8, it raises `SmarterJSON::EncodingError` rather than guess a legacy encoding — pass an explicit `encoding:` (e.g. `"ISO-8859-1"`) for that. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
+UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings parse too: SmarterJSON works on a UTF-8 copy internally and re-tags the result back into the input's own encoding, so values come back in the encoding the bytes arrived in (a UTF-16 / UTF-32 BOM is consumed on the way in). The one edge case — a `\uXXXX` escape that decodes to a character that encoding can't represent (e.g. an emoji inside a Shift_JIS document) — is replaced by `replace_char` (default `"?"`, or `""` to drop it) rather than raising.
 ## Nesting & untrusted input
 Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue. **`generate` is iterative too**, so serializing a deeply nested Ruby structure can't overflow the stack either — reading *and* writing are both depth-safe.

data/docs/options.md CHANGED Viewed

@@ -22,6 +22,7 @@ These options are passed to [`SmarterJSON.process`](./basic_read_api.md), `Smart
 | `:duplicate_key`  | `:last_wins` | How to handle a key that repeats within one object: `:last_wins` or `:first_wins`. (Every repeat is also reported through `:on_warning` — see below.)                          |
 | `:encoding`       | `nil`        | Labels the input's encoding (e.g. `"UTF-8"`). It does **not** trigger a transcoding pass — see below.                  |
 | `:on_warning`     | `nil`        | A callable invoked once per lenient fix applied, passed a `SmarterJSON::Warning`. Never changes the return value. See below. |
+| `:replace_char`   | `"?"`        | Replacement for a character a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS). `""` drops it. See below. |
 | `:symbolize_keys` | `false`      | Return object keys as Symbols instead of Strings.                                                                      |
 ```ruby
@@ -47,7 +48,11 @@ The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`
 ### A note on `:encoding`
-`:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically; UTF-16 / UTF-32 input is out of scope.
+`:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically. UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings are now supported as well: the document parses and string values come back tagged in the input's own encoding (a UTF-16 / UTF-32 BOM is consumed on the way in). The one wrinkle — a `\uXXXX` escape that decodes to a character the input's encoding can't represent — is handled by `:replace_char` (above).
+### A note on `:replace_char`
+For an input in an encoding that can't be byte-scanned directly (UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings), SmarterJSON parses a UTF-8 copy and re-tags the result back into the input's encoding, so you get values in the encoding the bytes arrived in. A `\uXXXX` escape can decode to a character that encoding can't represent — e.g. an emoji inside a Shift_JIS document. Rather than raise, that single character is replaced by `:replace_char` (default `"?"`). Set `replace_char: ""` to drop it, or pass any string your target encoding can hold (e.g. the geta mark `"〓"` for Shift_JIS). It applies only on this transcode-and-re-tag path; for plain UTF-8 / single-byte input it never comes into play.
 ### A note on `:decimal_precision`

data/lib/smarter_json/options.rb CHANGED Viewed

@@ -12,6 +12,7 @@ module SmarterJSON
       duplicate_key: :last_wins, # :last_wins | :first_wins  (repeats are also reported via on_warning)
       decimal_precision: :auto,  # :auto | :float | :bigdecimal  (Oj-compatible decimal handling)
       on_warning: nil,           # a callable invoked once per non-fatal lenient fix (a SmarterJSON::Warning)
+      replace_char: "?",         # replacement for a char not representable in the input's encoding (undef: :replace); "" drops it
     }.freeze
     module_function
@@ -56,6 +57,9 @@ module SmarterJSON
       unless encoding.nil? || encoding.is_a?(String)
         errors << "encoding must be nil or a String (got #{encoding.class})"
       end
+      unless options[:replace_char].is_a?(String)
+        errors << "replace_char must be a String (got #{options[:replace_char].class})"
+      end
       raise ArgumentError, "SmarterJSON: invalid options — #{errors.join('; ')}" if errors.any?

data/lib/smarter_json/parser.rb CHANGED Viewed

@@ -41,22 +41,42 @@ module SmarterJSON
   # SmarterJSON.process_file(path, options = {}) — open a file and process it.
   #
-  # The :encoding option labels the file's encoding (default "UTF-8"); it does NOT
-  # trigger a transcoding pass — the parser works on the bytes in their native
-  # encoding and emits string values with the same encoding tag. With a block,
-  # streams document-by-document straight from disk in bounded memory (never
-  # loading the whole file); the documents are read as newline-delimited
-  # (NDJSON / JSONL), one per line.
+  # The :encoding option labels the file's encoding (default "UTF-8").
+  #
+  # The user can send any encoding to SmarterJSON - we make zero assumptions about encoding.
+  # We also do not "normalize" the input to a different encoding on our own (this is not Python).
+  #
+  # We parse the bytes in whatever encoding they arrive in and emit string values
+  # with that same encoding tag.
+  #
+  # The caller is free to transcode the input themselves (e.g. open the file with a "r:ext:int" mode);
+  # however the bytes arrive, we parse them and preserve their encoding. With a block,
+  # streams document-by-document straight from disk in bounded memory (neverloading the whole file);
+  # the documents are read as newline-delimited (NDJSON / JSONL), one per line.
+  #
   def process_file(path, options = {}, &block)
     options = Options.process_options(options)
     encoding = options[:encoding] || "UTF-8"
+    mode = file_read_mode(encoding)
     if block
-      File.open(path, "r:#{encoding}") { |io| stream_io(io, options, &block) }
+      File.open(path, mode) { |io| stream_io(io, options, &block) }
     else
-      process(File.read(path, encoding: encoding), options)
+      process(File.read(path, mode: mode), options)
     end
   end
+  # Read mode for process_file. Binary mode is required for ASCII-incompatible encodings
+  # (UTF-16 / UTF-32) — text mode refuses them ("ASCII incompatible encoding needs binmode").
+  # ASCII-compatible encodings keep TEXT mode, so newline translation (e.g. \r\n on Windows)
+  # is unchanged — binmode only applies where text mode is impossible anyway.
+  def file_read_mode(encoding)
+    incompatible = encoding.to_s.split(":").any? do |name|
+      enc = Encoding.find(name) rescue nil
+      enc && !enc.ascii_compatible?
+    end
+    incompatible ? "rb:#{encoding}" : "r:#{encoding}"
+  end
   # SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of
   # process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a
   # plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way
@@ -163,11 +183,121 @@ module SmarterJSON
     raise EncodingError, "input is tagged ASCII-8BIT and is not valid UTF-8 — pass encoding: to declare its encoding"
   end
+  # Legacy CJK double-byte encodings whose trail bytes can fall in the ASCII range, so a
+  # 0x5C trail byte looks like a string escape, a 0x7B like a brace, etc. — i.e. they are
+  # ascii_compatible? yet still NOT safe to byte-scan for JSON structure. (EUC-* and
+  # single-byte encodings keep their non-ASCII bytes above 0x7F, so they ARE safe.)
+  UNSCANNABLE_ASCII_COMPATIBLE = %w[
+    Shift_JIS Windows-31J MacJapanese SHIFT_JISX0213 SJIS-DoCoMo SJIS-KDDI SJIS-SoftBank
+    Big5 Big5-HKSCS Big5-UAO CP950 GBK GB18030 GB12345
+  ].each_with_object({}) do |name, h|
+    h[Encoding.find(name)] = true
+  rescue ArgumentError
+    # encoding not built into this Ruby — skip it
+  end.freeze
+  # True when an Encoding cannot be scanned directly for JSON structure — the non
+  # ASCII-compatible ones (UTF-16 / UTF-32, where structure is in code units) and the CJK
+  # double-byte ones above. For these we parse a UTF-8 copy and emit the values back in the
+  # original encoding. (Over-including a safe encoding only costs a transcode round-trip; the
+  # result is still correct.)
+  def unscannable_enc?(enc)
+    return true unless enc.ascii_compatible?
+    UNSCANNABLE_ASCII_COMPATIBLE.key?(enc)
+  end
+  # The encoding the bytes arrived in when they must be parsed via a UTF-8 copy (see
+  # unscannable_enc?); nil when the bytes are directly byte-scannable.
+  def unscannable_encoding(input)
+    enc = input.encoding
+    unscannable_enc?(enc) ? enc : nil
+  end
+  # Generic UTF-16 / UTF-32 prepend a byte-order mark to EVERY string when you encode TO them.
+  # Map the generic encoding to the concrete endianness (read from the input's own BOM) so the
+  # re-tagged values are BOM-free and usable. Concrete and non-Unicode encodings pass through.
+  def concrete_unicode_encoding(input, enc)
+    return enc unless enc == Encoding::UTF_16 || enc == Encoding::UTF_32
+    head = input.byteslice(0, 4).to_s.b
+    if enc == Encoding::UTF_16
+      head.start_with?("\xFF\xFE".b) ? Encoding::UTF_16LE : Encoding::UTF_16BE
+    else
+      head.start_with?("\xFF\xFE\x00\x00".b) ? Encoding::UTF_32LE : Encoding::UTF_32BE
+    end
+  end
+  # Transcode the input to a UTF-8 working copy for scanning. Invalid bytes raise the gem's
+  # own SmarterJSON::EncodingError (not a bare Ruby Encoding error), matching the rest.
+  def to_utf8_copy(input)
+    input.encode(Encoding::UTF_8)
+  rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
+    raise EncodingError, "invalid byte sequence for #{input.encoding.name}"
+  end
+  # Re-tag one scalar into `enc`. A character not representable in `enc` (e.g. an emoji from a
+  # `\u` escape inside a Shift_JIS document) is replaced by `replace` (the :replace_char option,
+  # default "?") — uniform encoding, never raises. (`invalid:` can't trigger here: the value
+  # came from a valid UTF-8 parse.)
+  def reencode_scalar(obj, enc, replace)
+    return obj unless obj.is_a?(String)
+    obj.encode(enc, invalid: :replace, undef: :replace, replace: replace)
+  end
+  # Re-tag a parsed value's strings (Hash keys/values, Array elements, nested) into `enc`,
+  # so we emit values in the encoding the bytes arrived in after parsing a UTF-8 copy.
+  # ITERATIVE (an explicit work stack, not recursion) so a deeply nested document is
+  # depth-safe — like the parser itself — and can't raise SystemStackError.
+  def deep_encode(root, enc, replace)
+    return reencode_scalar(root, enc, replace) unless root.is_a?(Array) || root.is_a?(Hash)
+    out = root.is_a?(Array) ? [] : {}
+    stack = [[root, out]]
+    until stack.empty?
+      src, dst = stack.pop
+      if src.is_a?(Array)
+        src.each do |v|
+          child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
+          dst << (child || reencode_scalar(v, enc, replace))
+          stack.push([v, child]) if child
+        end
+      else
+        src.each do |k, v|
+          key = reencode_scalar(k, enc, replace)
+          child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
+          dst[key] = child || reencode_scalar(v, enc, replace)
+          stack.push([v, child]) if child
+        end
+      end
+    end
+    out
+  end
   # Stream documents from an IO incrementally, yielding each recovered top-level
   # document without slurping the whole input into memory first.
   def stream_io(io, options, &block)
+    ext = io.respond_to?(:external_encoding) ? io.external_encoding : nil
+    int = io.respond_to?(:internal_encoding) ? io.internal_encoding : nil
+    out_enc = int || ext # the encoding the caller expects in the output
+    source  = ext        # the encoding readpartial's raw bytes are actually in
+    # The Framer reads via readpartial, which returns ASCII-8BIT — it drops the IO's encoding
+    # and ignores transcoding. When the byte-scanner can't frame the raw bytes directly — they
+    # are in an unscannable encoding (UTF-16 / UTF-32 / Shift_JIS / ...), or the IO transcodes
+    # (the wanted output encoding differs from the on-wire bytes) — transcode each chunk to a
+    # UTF-8 view, frame documents there one at a time, and emit each in `out_enc`. Bounded
+    # memory: only one document is buffered, never the whole stream.
+    if source && out_enc && (unscannable_enc?(source) || out_enc != source)
+      return stream_transcoded(io, source, out_enc, options, &block)
+    end
     count = 0
     Framer.each_document(io) do |doc|
+      # readpartial dropped the IO's encoding tag; restore it so a Latin-1 / Windows-1252 /
+      # etc. stream is parsed and emitted in its own encoding, not mislabelled.
+      doc = doc.dup.force_encoding(out_enc) if out_enc && doc.encoding != out_enc
       # Recovery.process_string yields each value and returns how many it yielded;
       # blank / comment-only framed segments yield none, so count tracks actual
       # documents (== values yielded), not raw framed segments.
@@ -176,6 +306,25 @@ module SmarterJSON
     count
   end
+  # Bounded-memory streaming for an unscannable or transcoding IO (see stream_io). Each chunk
+  # is transcoded to a UTF-8 view and framed there one document at a time; each framed document
+  # is parsed and emitted in `out_enc` — the same parse-then-re-tag path as the whole-buffer
+  # case, but per document, so peak memory is bounded by one document, not the whole stream.
+  def stream_transcoded(io, source, out_enc, options, &block)
+    first   = Framer.read_chunk(io)
+    out_enc = concrete_unicode_encoding(first.to_s, out_enc) # generic UTF-16/32 -> concrete via BOM
+    # No converter when the raw bytes are already UTF-8 (e.g. a UTF-8 -> UTF-16 transcoding IO):
+    # the bytes need no transcoding to be byte-scanned, only the OUTPUT is re-tagged (deep_encode).
+    conv    = [Encoding::UTF_8, Encoding::US_ASCII].include?(source) ? nil : Encoding::Converter.new(source, Encoding::UTF_8)
+    opts    = options.merge(encoding: nil)
+    replace = options[:replace_char]
+    count   = 0
+    Framer.each_document_transcoded(io, conv, first) do |utf8_doc|
+      count += Recovery.process_string(utf8_doc, opts) { |v| block.call(deep_encode(v, out_enc, replace)) }
+    end
+    count
+  end
   # process_one's "more than one document" notice — routed to on_warning if the caller
   # gave one, else Rails.logger when Rails is loaded, else Kernel#warn. Never silent,
   # never raised.
@@ -192,7 +341,10 @@ module SmarterJSON
     end
   end
-  private_class_method :process_content, :stream_io, :warn_extra_documents
+  private_class_method :process_content, :stream_io, :stream_transcoded, :warn_extra_documents,
+                       :file_read_mode, :normalize_default_encoding, :unscannable_enc?,
+                       :unscannable_encoding, :concrete_unicode_encoding, :to_utf8_copy,
+                       :reencode_scalar, :deep_encode
   # Named byte values, shared by the Parser FSM and the Framer / Recovery byte
   # scanners so none of them spell out raw hex. Included where needed.
@@ -261,6 +413,59 @@ module SmarterJSON
       yield buffer unless separators_only?(buffer)
     end
+    # Like each_document, but the IO's raw bytes are in `conv`'s source encoding (UTF-16 /
+    # UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so
+    # the byte-level splitter works. `first_chunk` is the already-read first raw chunk (the
+    # caller sniffs a BOM from it). Memory stays bounded by one document, like each_document.
+    def each_document_transcoded(io, conv, first_chunk)
+      buffer = +""
+      scan = 0
+      doc_start = nil
+      stack = []
+      mode = nil
+      raw = first_chunk
+      while raw
+        chunk = transcode_chunk(conv, raw)
+        unless chunk.empty?
+          buffer << chunk
+          loop do
+            emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
+            break unless emitted
+            yield emitted
+          end
+        end
+        raw = read_chunk(io)
+      end
+      finish_transcode(conv) # truncated / invalid trailing bytes -> SmarterJSON::EncodingError
+      yield buffer unless separators_only?(buffer)
+    end
+    # Push one raw chunk through the converter, returning the UTF-8 produced so far. An
+    # incomplete trailing multibyte sequence is held inside the converter until the next chunk;
+    # invalid bytes raise SmarterJSON::EncodingError (matching the whole-buffer to_utf8_copy).
+    def transcode_chunk(conv, raw)
+      return raw.dup.force_encoding(Encoding::UTF_8) if conv.nil? # raw bytes are already UTF-8
+      out = +""
+      status = conv.primitive_convert(raw.dup, out, nil, nil, partial_input: true)
+      raise SmarterJSON::EncodingError, "invalid byte sequence in stream" if status == :invalid_byte_sequence
+      out
+    end
+    # Flush the converter at end of stream. A held incomplete multibyte sequence means the input
+    # was truncated mid-character — surface it the same way an invalid encoding is surfaced.
+    def finish_transcode(conv)
+      return if conv.nil?
+      status = conv.primitive_convert("".b, +"")
+      raise SmarterJSON::EncodingError, "invalid byte sequence in stream" unless status == :finished
+    end
     def read_chunk(io)
       if io.respond_to?(:readpartial)
         io.readpartial(CHUNK_SIZE)
@@ -466,6 +671,22 @@ module SmarterJSON
     def process_string(input, options, &block)
       input = SmarterJSON.send(:normalize_default_encoding, input, options)
+      # UTF-16 / UTF-32 / Shift_JIS / ... cannot be byte-scanned for JSON structure. Parse
+      # a UTF-8 copy and emit each document's strings back in the encoding the bytes arrived
+      # in — the caller always gets values in the encoding they handed us, never UTF-8.
+      if (target_enc = SmarterJSON.send(:unscannable_encoding, input))
+        target_enc = SmarterJSON.send(:concrete_unicode_encoding, input, target_enc) # avoid per-string BOMs
+        opts = options.merge(encoding: nil) # the working copy is UTF-8; don't re-label it downstream
+        utf8 = SmarterJSON.send(:to_utf8_copy, input) # invalid bytes -> SmarterJSON::EncodingError
+        replace = options[:replace_char]
+        if block
+          return process_string(utf8, opts) { |doc| block.call(SmarterJSON.send(:deep_encode, doc, target_enc, replace)) }
+        end
+        return process_string(utf8, opts).map { |doc| SmarterJSON.send(:deep_encode, doc, target_enc, replace) }
+      end
       return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
       # Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when

data/lib/smarter_json/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module SmarterJSON
-  VERSION = "1.2.2"
+  VERSION = "1.2.3"
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: smarter_json
 version: !ruby/object:Gem::Version
-  version: 1.2.2
+  version: 1.2.3
 platform: ruby
 authors:
 - Tilo Sloboda
 bindir: exe
 cert_chain: []
-date: 2026-06-19 00:00:00.000000000 Z
+date: 2026-06-28 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bigdecimal