smarter_json 1.2.2 → 1.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a31a7485e53042c06cb28c92f23e24cb55067fd312704c0c9ac0776868f20293
4
- data.tar.gz: 531187f8f7ea38f573785f09fa6e217f63a4393fd537d80c92f00973aeb2b375
3
+ metadata.gz: edd2cbc389a44f0714898f2a2942032632ca5f13a83d93acd708988866974893
4
+ data.tar.gz: 5cfa8719265797fd0fcd59f0060a5dd255aeba67ff229c78ab5696ad86f8b8e3
5
5
  SHA512:
6
- metadata.gz: f88266d06416277355f770645d694be6e81a8faef22a86806cd1e4620e773dee50aeeeeb09cfbc0de47354972250c2654e599c5c03ba0926ca129d9d6a638b0c
7
- data.tar.gz: c8bdd4a34d4617c897c815f2526c0f5555459b35bad47106a1db2b71dedac5d7a2605d0a5f92676735ae4895208fcdf1272f05dce09026aa696b84f143693ee5
6
+ metadata.gz: b0306cb7eab1db78053e8620f119b7debc54d836c38369b328bc0e0b9e8798d23eb4bbdb82281d036f62cb8cd296a3c8d7667f41a49d7e643a955830e08caf99
7
+ data.tar.gz: 0a5f052e115141014bcf2f0e031d43fb40521dd01da0f1ed5c6d4edeea785ead09f8ebcdfb76b580b4b8157391087ff2a79c5cd548ae7d8a0be96d0527a9f8fb
data/CHANGELOG.md CHANGED
@@ -13,9 +13,20 @@
13
13
  > ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
14
14
  > Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
15
15
 
16
- ## 1.2.2 (unreleased)
16
+ ## 1.2.3 (2026-06-28)
17
17
 
18
- RSpec tests: 1,167
18
+ RSpec tests: 1,167 → 1,268
19
+
20
+ Fixing some encoding corner cases:
21
+ - **UTF-16 / UTF-32** and **Shift_JIS** (and other CJK double-byte encodings such as Big5 / GBK / GB18030) previously raised or mis-parsed; they now parse, with string values tagged in the input's encoding.
22
+ - Applies to String, file (`process_file`), and IO / streaming (`foreach`) input — including a file the caller opened with transcoding (e.g. `File.open(path, "r:UTF-8:UTF-16LE")`), where the output is the encoding the bytes arrive in.
23
+ - Streaming a **Latin-1 / Windows-1252** (or other single-byte) file or IO now preserves that encoding too, instead of mislabelling or raising.
24
+ - Streaming a UTF-16 / UTF-32 / Shift_JIS (or transcoding) source via `foreach` / `process_file` is now **bounded-memory** — it frames and parses one document at a time instead of reading the whole input into memory.
25
+ - New `:replace_char` option (default `"?"`): when a `\uXXXX` escape decodes to a character the input's encoding can't represent (e.g. an emoji inside a Shift_JIS document), that character is replaced rather than raising. `replace_char: ""` drops it.
26
+
27
+ ## 1.2.2 (2026-06-19)
28
+
29
+ RSpec tests: 1,165 → 1,167
19
30
 
20
31
  - The Eisel-Lemire fast path for `decimal_precision: :float` now covers decimals with **up to 19 significant digits** (was 18). 19 digits is the most that fits exactly in a `uint64` (max 19-digit ≈ 1.0e19 < `UINT64_MAX` ≈ 1.8e19), so these no longer fall back to the slower `strtod`. Still correctly rounded, bit-for-bit identical to the stdlib — verified across 18/19-digit round-to-even tie shapes.
21
32
 
data/README.md CHANGED
@@ -222,6 +222,7 @@ In short: **SmarterJSON's C path matches or beats Oj/strict on every file** (app
222
222
  | `decimal_precision` | `:auto` | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
223
223
  | `acceleration` | `true` | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
224
224
  | `encoding` | `nil` | labels the input's encoding; `nil` keeps the input's own (no transcoding pass; see below) |
225
+ | `replace_char` | `"?"` | replacement for a char a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS); `""` drops it |
225
226
  | `on_warning` | `nil` | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`, `:number_overflow`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
226
227
 
227
228
  ## Examples
@@ -358,6 +359,8 @@ TEXT
358
359
 
359
360
  `encoding:` (default `nil`) labels what the input is — it does **not** transcode. With `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that same tag, the way `smarter_csv` does — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) whose bytes are valid UTF-8 is treated as UTF-8. That is exactly how `Net::HTTP` and many HTTP libraries hand you a `response.body` (correct UTF-8 bytes, BINARY tag); without this, string values would come back tagged `ASCII-8BIT` and compare unequal to UTF-8 literals. If such `ASCII-8BIT` input is *not* valid UTF-8, it raises `SmarterJSON::EncodingError` rather than guess a legacy encoding — pass an explicit `encoding:` (e.g. `"ISO-8859-1"`) for that. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
360
361
 
362
+ UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings parse too: SmarterJSON works on a UTF-8 copy internally and re-tags the result back into the input's own encoding, so values come back in the encoding the bytes arrived in (a UTF-16 / UTF-32 BOM is consumed on the way in). The one edge case — a `\uXXXX` escape that decodes to a character that encoding can't represent (e.g. an emoji inside a Shift_JIS document) — is replaced by `replace_char` (default `"?"`, or `""` to drop it) rather than raising.
363
+
361
364
  ## Nesting & untrusted input
362
365
 
363
366
  Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue. **`generate` is iterative too**, so serializing a deeply nested Ruby structure can't overflow the stack either — reading *and* writing are both depth-safe.
data/docs/options.md CHANGED
@@ -22,6 +22,7 @@ These options are passed to [`SmarterJSON.process`](./basic_read_api.md), `Smart
22
22
  | `:duplicate_key` | `:last_wins` | How to handle a key that repeats within one object: `:last_wins` or `:first_wins`. (Every repeat is also reported through `:on_warning` — see below.) |
23
23
  | `:encoding` | `nil` | Labels the input's encoding (e.g. `"UTF-8"`). It does **not** trigger a transcoding pass — see below. |
24
24
  | `:on_warning` | `nil` | A callable invoked once per lenient fix applied, passed a `SmarterJSON::Warning`. Never changes the return value. See below. |
25
+ | `:replace_char` | `"?"` | Replacement for a character a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS). `""` drops it. See below. |
25
26
  | `:symbolize_keys` | `false` | Return object keys as Symbols instead of Strings. |
26
27
 
27
28
  ```ruby
@@ -47,7 +48,11 @@ The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`
47
48
 
48
49
  ### A note on `:encoding`
49
50
 
50
- `:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically; UTF-16 / UTF-32 input is out of scope.
51
+ `:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically. UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings are now supported as well: the document parses and string values come back tagged in the input's own encoding (a UTF-16 / UTF-32 BOM is consumed on the way in). The one wrinkle — a `\uXXXX` escape that decodes to a character the input's encoding can't represent — is handled by `:replace_char` (above).
52
+
53
+ ### A note on `:replace_char`
54
+
55
+ For an input in an encoding that can't be byte-scanned directly (UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings), SmarterJSON parses a UTF-8 copy and re-tags the result back into the input's encoding, so you get values in the encoding the bytes arrived in. A `\uXXXX` escape can decode to a character that encoding can't represent — e.g. an emoji inside a Shift_JIS document. Rather than raise, that single character is replaced by `:replace_char` (default `"?"`). Set `replace_char: ""` to drop it, or pass any string your target encoding can hold (e.g. the geta mark `"〓"` for Shift_JIS). It applies only on this transcode-and-re-tag path; for plain UTF-8 / single-byte input it never comes into play.
51
56
 
52
57
  ### A note on `:decimal_precision`
53
58
 
@@ -12,6 +12,7 @@ module SmarterJSON
12
12
  duplicate_key: :last_wins, # :last_wins | :first_wins (repeats are also reported via on_warning)
13
13
  decimal_precision: :auto, # :auto | :float | :bigdecimal (Oj-compatible decimal handling)
14
14
  on_warning: nil, # a callable invoked once per non-fatal lenient fix (a SmarterJSON::Warning)
15
+ replace_char: "?", # replacement for a char not representable in the input's encoding (undef: :replace); "" drops it
15
16
  }.freeze
16
17
 
17
18
  module_function
@@ -56,6 +57,9 @@ module SmarterJSON
56
57
  unless encoding.nil? || encoding.is_a?(String)
57
58
  errors << "encoding must be nil or a String (got #{encoding.class})"
58
59
  end
60
+ unless options[:replace_char].is_a?(String)
61
+ errors << "replace_char must be a String (got #{options[:replace_char].class})"
62
+ end
59
63
 
60
64
  raise ArgumentError, "SmarterJSON: invalid options — #{errors.join('; ')}" if errors.any?
61
65
 
@@ -41,22 +41,42 @@ module SmarterJSON
41
41
 
42
42
  # SmarterJSON.process_file(path, options = {}) — open a file and process it.
43
43
  #
44
- # The :encoding option labels the file's encoding (default "UTF-8"); it does NOT
45
- # trigger a transcoding pass — the parser works on the bytes in their native
46
- # encoding and emits string values with the same encoding tag. With a block,
47
- # streams document-by-document straight from disk in bounded memory (never
48
- # loading the whole file); the documents are read as newline-delimited
49
- # (NDJSON / JSONL), one per line.
44
+ # The :encoding option labels the file's encoding (default "UTF-8").
45
+ #
46
+ # The user can send any encoding to SmarterJSON - we make zero assumptions about encoding.
47
+ # We also do not "normalize" the input to a different encoding on our own (this is not Python).
48
+ #
49
+ # We parse the bytes in whatever encoding they arrive in and emit string values
50
+ # with that same encoding tag.
51
+ #
52
+ # The caller is free to transcode the input themselves (e.g. open the file with a "r:ext:int" mode);
53
+ # however the bytes arrive, we parse them and preserve their encoding. With a block,
54
+ # streams document-by-document straight from disk in bounded memory (neverloading the whole file);
55
+ # the documents are read as newline-delimited (NDJSON / JSONL), one per line.
56
+ #
50
57
  def process_file(path, options = {}, &block)
51
58
  options = Options.process_options(options)
52
59
  encoding = options[:encoding] || "UTF-8"
60
+ mode = file_read_mode(encoding)
53
61
  if block
54
- File.open(path, "r:#{encoding}") { |io| stream_io(io, options, &block) }
62
+ File.open(path, mode) { |io| stream_io(io, options, &block) }
55
63
  else
56
- process(File.read(path, encoding: encoding), options)
64
+ process(File.read(path, mode: mode), options)
57
65
  end
58
66
  end
59
67
 
68
+ # Read mode for process_file. Binary mode is required for ASCII-incompatible encodings
69
+ # (UTF-16 / UTF-32) — text mode refuses them ("ASCII incompatible encoding needs binmode").
70
+ # ASCII-compatible encodings keep TEXT mode, so newline translation (e.g. \r\n on Windows)
71
+ # is unchanged — binmode only applies where text mode is impossible anyway.
72
+ def file_read_mode(encoding)
73
+ incompatible = encoding.to_s.split(":").any? do |name|
74
+ enc = Encoding.find(name) rescue nil
75
+ enc && !enc.ascii_compatible?
76
+ end
77
+ incompatible ? "rb:#{encoding}" : "r:#{encoding}"
78
+ end
79
+
60
80
  # SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of
61
81
  # process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a
62
82
  # plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way
@@ -163,11 +183,121 @@ module SmarterJSON
163
183
  raise EncodingError, "input is tagged ASCII-8BIT and is not valid UTF-8 — pass encoding: to declare its encoding"
164
184
  end
165
185
 
186
+ # Legacy CJK double-byte encodings whose trail bytes can fall in the ASCII range, so a
187
+ # 0x5C trail byte looks like a string escape, a 0x7B like a brace, etc. — i.e. they are
188
+ # ascii_compatible? yet still NOT safe to byte-scan for JSON structure. (EUC-* and
189
+ # single-byte encodings keep their non-ASCII bytes above 0x7F, so they ARE safe.)
190
+ UNSCANNABLE_ASCII_COMPATIBLE = %w[
191
+ Shift_JIS Windows-31J MacJapanese SHIFT_JISX0213 SJIS-DoCoMo SJIS-KDDI SJIS-SoftBank
192
+ Big5 Big5-HKSCS Big5-UAO CP950 GBK GB18030 GB12345
193
+ ].each_with_object({}) do |name, h|
194
+ h[Encoding.find(name)] = true
195
+ rescue ArgumentError
196
+ # encoding not built into this Ruby — skip it
197
+ end.freeze
198
+
199
+ # True when an Encoding cannot be scanned directly for JSON structure — the non
200
+ # ASCII-compatible ones (UTF-16 / UTF-32, where structure is in code units) and the CJK
201
+ # double-byte ones above. For these we parse a UTF-8 copy and emit the values back in the
202
+ # original encoding. (Over-including a safe encoding only costs a transcode round-trip; the
203
+ # result is still correct.)
204
+ def unscannable_enc?(enc)
205
+ return true unless enc.ascii_compatible?
206
+
207
+ UNSCANNABLE_ASCII_COMPATIBLE.key?(enc)
208
+ end
209
+
210
+ # The encoding the bytes arrived in when they must be parsed via a UTF-8 copy (see
211
+ # unscannable_enc?); nil when the bytes are directly byte-scannable.
212
+ def unscannable_encoding(input)
213
+ enc = input.encoding
214
+ unscannable_enc?(enc) ? enc : nil
215
+ end
216
+
217
+ # Generic UTF-16 / UTF-32 prepend a byte-order mark to EVERY string when you encode TO them.
218
+ # Map the generic encoding to the concrete endianness (read from the input's own BOM) so the
219
+ # re-tagged values are BOM-free and usable. Concrete and non-Unicode encodings pass through.
220
+ def concrete_unicode_encoding(input, enc)
221
+ return enc unless enc == Encoding::UTF_16 || enc == Encoding::UTF_32
222
+
223
+ head = input.byteslice(0, 4).to_s.b
224
+ if enc == Encoding::UTF_16
225
+ head.start_with?("\xFF\xFE".b) ? Encoding::UTF_16LE : Encoding::UTF_16BE
226
+ else
227
+ head.start_with?("\xFF\xFE\x00\x00".b) ? Encoding::UTF_32LE : Encoding::UTF_32BE
228
+ end
229
+ end
230
+
231
+ # Transcode the input to a UTF-8 working copy for scanning. Invalid bytes raise the gem's
232
+ # own SmarterJSON::EncodingError (not a bare Ruby Encoding error), matching the rest.
233
+ def to_utf8_copy(input)
234
+ input.encode(Encoding::UTF_8)
235
+ rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
236
+ raise EncodingError, "invalid byte sequence for #{input.encoding.name}"
237
+ end
238
+
239
+ # Re-tag one scalar into `enc`. A character not representable in `enc` (e.g. an emoji from a
240
+ # `\u` escape inside a Shift_JIS document) is replaced by `replace` (the :replace_char option,
241
+ # default "?") — uniform encoding, never raises. (`invalid:` can't trigger here: the value
242
+ # came from a valid UTF-8 parse.)
243
+ def reencode_scalar(obj, enc, replace)
244
+ return obj unless obj.is_a?(String)
245
+
246
+ obj.encode(enc, invalid: :replace, undef: :replace, replace: replace)
247
+ end
248
+
249
+ # Re-tag a parsed value's strings (Hash keys/values, Array elements, nested) into `enc`,
250
+ # so we emit values in the encoding the bytes arrived in after parsing a UTF-8 copy.
251
+ # ITERATIVE (an explicit work stack, not recursion) so a deeply nested document is
252
+ # depth-safe — like the parser itself — and can't raise SystemStackError.
253
+ def deep_encode(root, enc, replace)
254
+ return reencode_scalar(root, enc, replace) unless root.is_a?(Array) || root.is_a?(Hash)
255
+
256
+ out = root.is_a?(Array) ? [] : {}
257
+ stack = [[root, out]]
258
+ until stack.empty?
259
+ src, dst = stack.pop
260
+ if src.is_a?(Array)
261
+ src.each do |v|
262
+ child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
263
+ dst << (child || reencode_scalar(v, enc, replace))
264
+ stack.push([v, child]) if child
265
+ end
266
+ else
267
+ src.each do |k, v|
268
+ key = reencode_scalar(k, enc, replace)
269
+ child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
270
+ dst[key] = child || reencode_scalar(v, enc, replace)
271
+ stack.push([v, child]) if child
272
+ end
273
+ end
274
+ end
275
+ out
276
+ end
277
+
166
278
  # Stream documents from an IO incrementally, yielding each recovered top-level
167
279
  # document without slurping the whole input into memory first.
168
280
  def stream_io(io, options, &block)
281
+ ext = io.respond_to?(:external_encoding) ? io.external_encoding : nil
282
+ int = io.respond_to?(:internal_encoding) ? io.internal_encoding : nil
283
+ out_enc = int || ext # the encoding the caller expects in the output
284
+ source = ext # the encoding readpartial's raw bytes are actually in
285
+
286
+ # The Framer reads via readpartial, which returns ASCII-8BIT — it drops the IO's encoding
287
+ # and ignores transcoding. When the byte-scanner can't frame the raw bytes directly — they
288
+ # are in an unscannable encoding (UTF-16 / UTF-32 / Shift_JIS / ...), or the IO transcodes
289
+ # (the wanted output encoding differs from the on-wire bytes) — transcode each chunk to a
290
+ # UTF-8 view, frame documents there one at a time, and emit each in `out_enc`. Bounded
291
+ # memory: only one document is buffered, never the whole stream.
292
+ if source && out_enc && (unscannable_enc?(source) || out_enc != source)
293
+ return stream_transcoded(io, source, out_enc, options, &block)
294
+ end
295
+
169
296
  count = 0
170
297
  Framer.each_document(io) do |doc|
298
+ # readpartial dropped the IO's encoding tag; restore it so a Latin-1 / Windows-1252 /
299
+ # etc. stream is parsed and emitted in its own encoding, not mislabelled.
300
+ doc = doc.dup.force_encoding(out_enc) if out_enc && doc.encoding != out_enc
171
301
  # Recovery.process_string yields each value and returns how many it yielded;
172
302
  # blank / comment-only framed segments yield none, so count tracks actual
173
303
  # documents (== values yielded), not raw framed segments.
@@ -176,6 +306,25 @@ module SmarterJSON
176
306
  count
177
307
  end
178
308
 
309
+ # Bounded-memory streaming for an unscannable or transcoding IO (see stream_io). Each chunk
310
+ # is transcoded to a UTF-8 view and framed there one document at a time; each framed document
311
+ # is parsed and emitted in `out_enc` — the same parse-then-re-tag path as the whole-buffer
312
+ # case, but per document, so peak memory is bounded by one document, not the whole stream.
313
+ def stream_transcoded(io, source, out_enc, options, &block)
314
+ first = Framer.read_chunk(io)
315
+ out_enc = concrete_unicode_encoding(first.to_s, out_enc) # generic UTF-16/32 -> concrete via BOM
316
+ # No converter when the raw bytes are already UTF-8 (e.g. a UTF-8 -> UTF-16 transcoding IO):
317
+ # the bytes need no transcoding to be byte-scanned, only the OUTPUT is re-tagged (deep_encode).
318
+ conv = [Encoding::UTF_8, Encoding::US_ASCII].include?(source) ? nil : Encoding::Converter.new(source, Encoding::UTF_8)
319
+ opts = options.merge(encoding: nil)
320
+ replace = options[:replace_char]
321
+ count = 0
322
+ Framer.each_document_transcoded(io, conv, first) do |utf8_doc|
323
+ count += Recovery.process_string(utf8_doc, opts) { |v| block.call(deep_encode(v, out_enc, replace)) }
324
+ end
325
+ count
326
+ end
327
+
179
328
  # process_one's "more than one document" notice — routed to on_warning if the caller
180
329
  # gave one, else Rails.logger when Rails is loaded, else Kernel#warn. Never silent,
181
330
  # never raised.
@@ -192,7 +341,10 @@ module SmarterJSON
192
341
  end
193
342
  end
194
343
 
195
- private_class_method :process_content, :stream_io, :warn_extra_documents
344
+ private_class_method :process_content, :stream_io, :stream_transcoded, :warn_extra_documents,
345
+ :file_read_mode, :normalize_default_encoding, :unscannable_enc?,
346
+ :unscannable_encoding, :concrete_unicode_encoding, :to_utf8_copy,
347
+ :reencode_scalar, :deep_encode
196
348
 
197
349
  # Named byte values, shared by the Parser FSM and the Framer / Recovery byte
198
350
  # scanners so none of them spell out raw hex. Included where needed.
@@ -261,6 +413,59 @@ module SmarterJSON
261
413
  yield buffer unless separators_only?(buffer)
262
414
  end
263
415
 
416
+ # Like each_document, but the IO's raw bytes are in `conv`'s source encoding (UTF-16 /
417
+ # UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so
418
+ # the byte-level splitter works. `first_chunk` is the already-read first raw chunk (the
419
+ # caller sniffs a BOM from it). Memory stays bounded by one document, like each_document.
420
+ def each_document_transcoded(io, conv, first_chunk)
421
+ buffer = +""
422
+ scan = 0
423
+ doc_start = nil
424
+ stack = []
425
+ mode = nil
426
+
427
+ raw = first_chunk
428
+ while raw
429
+ chunk = transcode_chunk(conv, raw)
430
+ unless chunk.empty?
431
+ buffer << chunk
432
+ loop do
433
+ emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
434
+ break unless emitted
435
+
436
+ yield emitted
437
+ end
438
+ end
439
+ raw = read_chunk(io)
440
+ end
441
+
442
+ finish_transcode(conv) # truncated / invalid trailing bytes -> SmarterJSON::EncodingError
443
+
444
+ yield buffer unless separators_only?(buffer)
445
+ end
446
+
447
+ # Push one raw chunk through the converter, returning the UTF-8 produced so far. An
448
+ # incomplete trailing multibyte sequence is held inside the converter until the next chunk;
449
+ # invalid bytes raise SmarterJSON::EncodingError (matching the whole-buffer to_utf8_copy).
450
+ def transcode_chunk(conv, raw)
451
+ return raw.dup.force_encoding(Encoding::UTF_8) if conv.nil? # raw bytes are already UTF-8
452
+
453
+ out = +""
454
+ status = conv.primitive_convert(raw.dup, out, nil, nil, partial_input: true)
455
+ raise SmarterJSON::EncodingError, "invalid byte sequence in stream" if status == :invalid_byte_sequence
456
+
457
+ out
458
+ end
459
+
460
+ # Flush the converter at end of stream. A held incomplete multibyte sequence means the input
461
+ # was truncated mid-character — surface it the same way an invalid encoding is surfaced.
462
+ def finish_transcode(conv)
463
+ return if conv.nil?
464
+
465
+ status = conv.primitive_convert("".b, +"")
466
+ raise SmarterJSON::EncodingError, "invalid byte sequence in stream" unless status == :finished
467
+ end
468
+
264
469
  def read_chunk(io)
265
470
  if io.respond_to?(:readpartial)
266
471
  io.readpartial(CHUNK_SIZE)
@@ -466,6 +671,22 @@ module SmarterJSON
466
671
 
467
672
  def process_string(input, options, &block)
468
673
  input = SmarterJSON.send(:normalize_default_encoding, input, options)
674
+
675
+ # UTF-16 / UTF-32 / Shift_JIS / ... cannot be byte-scanned for JSON structure. Parse
676
+ # a UTF-8 copy and emit each document's strings back in the encoding the bytes arrived
677
+ # in — the caller always gets values in the encoding they handed us, never UTF-8.
678
+ if (target_enc = SmarterJSON.send(:unscannable_encoding, input))
679
+ target_enc = SmarterJSON.send(:concrete_unicode_encoding, input, target_enc) # avoid per-string BOMs
680
+ opts = options.merge(encoding: nil) # the working copy is UTF-8; don't re-label it downstream
681
+ utf8 = SmarterJSON.send(:to_utf8_copy, input) # invalid bytes -> SmarterJSON::EncodingError
682
+ replace = options[:replace_char]
683
+ if block
684
+ return process_string(utf8, opts) { |doc| block.call(SmarterJSON.send(:deep_encode, doc, target_enc, replace)) }
685
+ end
686
+
687
+ return process_string(utf8, opts).map { |doc| SmarterJSON.send(:deep_encode, doc, target_enc, replace) }
688
+ end
689
+
469
690
  return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
470
691
 
471
692
  # Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module SmarterJSON
4
- VERSION = "1.2.2"
4
+ VERSION = "1.2.3"
5
5
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: smarter_json
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.2.2
4
+ version: 1.2.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tilo Sloboda
8
8
  bindir: exe
9
9
  cert_chain: []
10
- date: 2026-06-19 00:00:00.000000000 Z
10
+ date: 2026-06-28 00:00:00.000000000 Z
11
11
  dependencies:
12
12
  - !ruby/object:Gem::Dependency
13
13
  name: bigdecimal