smarter_json 1.2.2 → 1.2.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +13 -2
- data/README.md +3 -0
- data/docs/options.md +6 -1
- data/lib/smarter_json/options.rb +4 -0
- data/lib/smarter_json/parser.rb +230 -9
- data/lib/smarter_json/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: edd2cbc389a44f0714898f2a2942032632ca5f13a83d93acd708988866974893
|
|
4
|
+
data.tar.gz: 5cfa8719265797fd0fcd59f0060a5dd255aeba67ff229c78ab5696ad86f8b8e3
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b0306cb7eab1db78053e8620f119b7debc54d836c38369b328bc0e0b9e8798d23eb4bbdb82281d036f62cb8cd296a3c8d7667f41a49d7e643a955830e08caf99
|
|
7
|
+
data.tar.gz: 0a5f052e115141014bcf2f0e031d43fb40521dd01da0f1ed5c6d4edeea785ead09f8ebcdfb76b580b4b8157391087ff2a79c5cd548ae7d8a0be96d0527a9f8fb
|
data/CHANGELOG.md
CHANGED
|
@@ -13,9 +13,20 @@
|
|
|
13
13
|
> ⚠️ We discourage the use of `process(input).first` / `process(input)[0]` because it silently drops potential additional documents
|
|
14
14
|
> Please use `process_one` if you are expecting only one JSON doc, e.g. in API payloads, because it emits on_warning if it finds multiple docs.
|
|
15
15
|
|
|
16
|
-
## 1.2.
|
|
16
|
+
## 1.2.3 (2026-06-28)
|
|
17
17
|
|
|
18
|
-
RSpec tests: 1,167
|
|
18
|
+
RSpec tests: 1,167 → 1,268
|
|
19
|
+
|
|
20
|
+
Fixing some encoding corner cases:
|
|
21
|
+
- **UTF-16 / UTF-32** and **Shift_JIS** (and other CJK double-byte encodings such as Big5 / GBK / GB18030) previously raised or mis-parsed; they now parse, with string values tagged in the input's encoding.
|
|
22
|
+
- Applies to String, file (`process_file`), and IO / streaming (`foreach`) input — including a file the caller opened with transcoding (e.g. `File.open(path, "r:UTF-8:UTF-16LE")`), where the output is the encoding the bytes arrive in.
|
|
23
|
+
- Streaming a **Latin-1 / Windows-1252** (or other single-byte) file or IO now preserves that encoding too, instead of mislabelling or raising.
|
|
24
|
+
- Streaming a UTF-16 / UTF-32 / Shift_JIS (or transcoding) source via `foreach` / `process_file` is now **bounded-memory** — it frames and parses one document at a time instead of reading the whole input into memory.
|
|
25
|
+
- New `:replace_char` option (default `"?"`): when a `\uXXXX` escape decodes to a character the input's encoding can't represent (e.g. an emoji inside a Shift_JIS document), that character is replaced rather than raising. `replace_char: ""` drops it.
|
|
26
|
+
|
|
27
|
+
## 1.2.2 (2026-06-19)
|
|
28
|
+
|
|
29
|
+
RSpec tests: 1,165 → 1,167
|
|
19
30
|
|
|
20
31
|
- The Eisel-Lemire fast path for `decimal_precision: :float` now covers decimals with **up to 19 significant digits** (was 18). 19 digits is the most that fits exactly in a `uint64` (max 19-digit ≈ 1.0e19 < `UINT64_MAX` ≈ 1.8e19), so these no longer fall back to the slower `strtod`. Still correctly rounded, bit-for-bit identical to the stdlib — verified across 18/19-digit round-to-even tie shapes.
|
|
21
32
|
|
data/README.md
CHANGED
|
@@ -222,6 +222,7 @@ In short: **SmarterJSON's C path matches or beats Oj/strict on every file** (app
|
|
|
222
222
|
| `decimal_precision` | `:auto` | `:auto` keeps high-precision decimals as `BigDecimal`; `:float` forces `Float`; `:bigdecimal` forces `BigDecimal` |
|
|
223
223
|
| `acceleration` | `true` | `true` uses the C extension when compiled and loadable; `false` forces pure Ruby (identical results) |
|
|
224
224
|
| `encoding` | `nil` | labels the input's encoding; `nil` keeps the input's own (no transcoding pass; see below) |
|
|
225
|
+
| `replace_char` | `"?"` | replacement for a char a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS); `""` drops it |
|
|
225
226
|
| `on_warning` | `nil` | a callable invoked once per lenient fix applied (`:empty_slot`, `:empty_value`, `:duplicate_key`, `:number_overflow`), passed a `SmarterJSON::Warning`; the return value is never changed. See below. |
|
|
226
227
|
|
|
227
228
|
## Examples
|
|
@@ -358,6 +359,8 @@ TEXT
|
|
|
358
359
|
|
|
359
360
|
`encoding:` (default `nil`) labels what the input is — it does **not** transcode. With `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that same tag, the way `smarter_csv` does — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) whose bytes are valid UTF-8 is treated as UTF-8. That is exactly how `Net::HTTP` and many HTTP libraries hand you a `response.body` (correct UTF-8 bytes, BINARY tag); without this, string values would come back tagged `ASCII-8BIT` and compare unequal to UTF-8 literals. If such `ASCII-8BIT` input is *not* valid UTF-8, it raises `SmarterJSON::EncodingError` rather than guess a legacy encoding — pass an explicit `encoding:` (e.g. `"ISO-8859-1"`) for that. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`).
|
|
360
361
|
|
|
362
|
+
UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings parse too: SmarterJSON works on a UTF-8 copy internally and re-tags the result back into the input's own encoding, so values come back in the encoding the bytes arrived in (a UTF-16 / UTF-32 BOM is consumed on the way in). The one edge case — a `\uXXXX` escape that decodes to a character that encoding can't represent (e.g. an emoji inside a Shift_JIS document) — is replaced by `replace_char` (default `"?"`, or `""` to drop it) rather than raising.
|
|
363
|
+
|
|
361
364
|
## Nesting & untrusted input
|
|
362
365
|
|
|
363
366
|
Both the C extension and the pure-Ruby engine are **iterative, not recursive** — they track nesting on an explicit, heap-allocated stack rather than the call stack. So deeply nested input **cannot overflow the call stack or segfault**: nesting is bounded only by available memory, the same posture as Oj (which also ships no nesting limit; the stdlib `json` caps at 100). The `deeply_nested.json` benchmark (212 MB of nesting) is handled without issue. **`generate` is iterative too**, so serializing a deeply nested Ruby structure can't overflow the stack either — reading *and* writing are both depth-safe.
|
data/docs/options.md
CHANGED
|
@@ -22,6 +22,7 @@ These options are passed to [`SmarterJSON.process`](./basic_read_api.md), `Smart
|
|
|
22
22
|
| `:duplicate_key` | `:last_wins` | How to handle a key that repeats within one object: `:last_wins` or `:first_wins`. (Every repeat is also reported through `:on_warning` — see below.) |
|
|
23
23
|
| `:encoding` | `nil` | Labels the input's encoding (e.g. `"UTF-8"`). It does **not** trigger a transcoding pass — see below. |
|
|
24
24
|
| `:on_warning` | `nil` | A callable invoked once per lenient fix applied, passed a `SmarterJSON::Warning`. Never changes the return value. See below. |
|
|
25
|
+
| `:replace_char` | `"?"` | Replacement for a character a `\uXXXX` escape decodes to that the input's encoding can't represent (e.g. an emoji inside Shift_JIS). `""` drops it. See below. |
|
|
25
26
|
| `:symbolize_keys` | `false` | Return object keys as Symbols instead of Strings. |
|
|
26
27
|
|
|
27
28
|
```ruby
|
|
@@ -47,7 +48,11 @@ The warning types are `:empty_slot` (a collapsed empty comma slot, e.g. `[1,,2]`
|
|
|
47
48
|
|
|
48
49
|
### A note on `:encoding`
|
|
49
50
|
|
|
50
|
-
`:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically
|
|
51
|
+
`:encoding` labels what the input *is* — it does not transcode. With the default `nil`, SmarterJSON keeps the input's own encoding tag and emits string values with that tag, the same way `smarter_csv` handles encodings — **with one smart default:** input tagged `ASCII-8BIT` (BINARY) that is valid UTF-8 is treated as UTF-8. This is how `Net::HTTP` returns a `response.body`; without it, those string values would compare unequal to UTF-8 literals. `ASCII-8BIT` input that is *not* valid UTF-8 raises `SmarterJSON::EncodingError` — pass an explicit `:encoding` (e.g. `"ISO-8859-1"`) for genuinely-legacy bytes. Bytes invalid for an explicitly claimed encoding also raise `SmarterJSON::EncodingError` (a kind of `SmarterJSON::ParseError`). A UTF-8 BOM is handled automatically. UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings are now supported as well: the document parses and string values come back tagged in the input's own encoding (a UTF-16 / UTF-32 BOM is consumed on the way in). The one wrinkle — a `\uXXXX` escape that decodes to a character the input's encoding can't represent — is handled by `:replace_char` (above).
|
|
52
|
+
|
|
53
|
+
### A note on `:replace_char`
|
|
54
|
+
|
|
55
|
+
For an input in an encoding that can't be byte-scanned directly (UTF-16 / UTF-32, Shift_JIS, and other CJK double-byte encodings), SmarterJSON parses a UTF-8 copy and re-tags the result back into the input's encoding, so you get values in the encoding the bytes arrived in. A `\uXXXX` escape can decode to a character that encoding can't represent — e.g. an emoji inside a Shift_JIS document. Rather than raise, that single character is replaced by `:replace_char` (default `"?"`). Set `replace_char: ""` to drop it, or pass any string your target encoding can hold (e.g. the geta mark `"〓"` for Shift_JIS). It applies only on this transcode-and-re-tag path; for plain UTF-8 / single-byte input it never comes into play.
|
|
51
56
|
|
|
52
57
|
### A note on `:decimal_precision`
|
|
53
58
|
|
data/lib/smarter_json/options.rb
CHANGED
|
@@ -12,6 +12,7 @@ module SmarterJSON
|
|
|
12
12
|
duplicate_key: :last_wins, # :last_wins | :first_wins (repeats are also reported via on_warning)
|
|
13
13
|
decimal_precision: :auto, # :auto | :float | :bigdecimal (Oj-compatible decimal handling)
|
|
14
14
|
on_warning: nil, # a callable invoked once per non-fatal lenient fix (a SmarterJSON::Warning)
|
|
15
|
+
replace_char: "?", # replacement for a char not representable in the input's encoding (undef: :replace); "" drops it
|
|
15
16
|
}.freeze
|
|
16
17
|
|
|
17
18
|
module_function
|
|
@@ -56,6 +57,9 @@ module SmarterJSON
|
|
|
56
57
|
unless encoding.nil? || encoding.is_a?(String)
|
|
57
58
|
errors << "encoding must be nil or a String (got #{encoding.class})"
|
|
58
59
|
end
|
|
60
|
+
unless options[:replace_char].is_a?(String)
|
|
61
|
+
errors << "replace_char must be a String (got #{options[:replace_char].class})"
|
|
62
|
+
end
|
|
59
63
|
|
|
60
64
|
raise ArgumentError, "SmarterJSON: invalid options — #{errors.join('; ')}" if errors.any?
|
|
61
65
|
|
data/lib/smarter_json/parser.rb
CHANGED
|
@@ -41,22 +41,42 @@ module SmarterJSON
|
|
|
41
41
|
|
|
42
42
|
# SmarterJSON.process_file(path, options = {}) — open a file and process it.
|
|
43
43
|
#
|
|
44
|
-
# The :encoding option labels the file's encoding (default "UTF-8")
|
|
45
|
-
#
|
|
46
|
-
#
|
|
47
|
-
#
|
|
48
|
-
#
|
|
49
|
-
#
|
|
44
|
+
# The :encoding option labels the file's encoding (default "UTF-8").
|
|
45
|
+
#
|
|
46
|
+
# The user can send any encoding to SmarterJSON - we make zero assumptions about encoding.
|
|
47
|
+
# We also do not "normalize" the input to a different encoding on our own (this is not Python).
|
|
48
|
+
#
|
|
49
|
+
# We parse the bytes in whatever encoding they arrive in and emit string values
|
|
50
|
+
# with that same encoding tag.
|
|
51
|
+
#
|
|
52
|
+
# The caller is free to transcode the input themselves (e.g. open the file with a "r:ext:int" mode);
|
|
53
|
+
# however the bytes arrive, we parse them and preserve their encoding. With a block,
|
|
54
|
+
# streams document-by-document straight from disk in bounded memory (neverloading the whole file);
|
|
55
|
+
# the documents are read as newline-delimited (NDJSON / JSONL), one per line.
|
|
56
|
+
#
|
|
50
57
|
def process_file(path, options = {}, &block)
|
|
51
58
|
options = Options.process_options(options)
|
|
52
59
|
encoding = options[:encoding] || "UTF-8"
|
|
60
|
+
mode = file_read_mode(encoding)
|
|
53
61
|
if block
|
|
54
|
-
File.open(path,
|
|
62
|
+
File.open(path, mode) { |io| stream_io(io, options, &block) }
|
|
55
63
|
else
|
|
56
|
-
process(File.read(path,
|
|
64
|
+
process(File.read(path, mode: mode), options)
|
|
57
65
|
end
|
|
58
66
|
end
|
|
59
67
|
|
|
68
|
+
# Read mode for process_file. Binary mode is required for ASCII-incompatible encodings
|
|
69
|
+
# (UTF-16 / UTF-32) — text mode refuses them ("ASCII incompatible encoding needs binmode").
|
|
70
|
+
# ASCII-compatible encodings keep TEXT mode, so newline translation (e.g. \r\n on Windows)
|
|
71
|
+
# is unchanged — binmode only applies where text mode is impossible anyway.
|
|
72
|
+
def file_read_mode(encoding)
|
|
73
|
+
incompatible = encoding.to_s.split(":").any? do |name|
|
|
74
|
+
enc = Encoding.find(name) rescue nil
|
|
75
|
+
enc && !enc.ascii_compatible?
|
|
76
|
+
end
|
|
77
|
+
incompatible ? "rb:#{encoding}" : "r:#{encoding}"
|
|
78
|
+
end
|
|
79
|
+
|
|
60
80
|
# SmarterJSON.foreach(source, options = {}) — the streaming, composable sibling of
|
|
61
81
|
# process_file, mirroring the stdlib convention (CSV.foreach / File.foreach): a
|
|
62
82
|
# plain Enumerator (NOT Enumerator::Lazy), so .map / .select behave the normal way
|
|
@@ -163,11 +183,121 @@ module SmarterJSON
|
|
|
163
183
|
raise EncodingError, "input is tagged ASCII-8BIT and is not valid UTF-8 — pass encoding: to declare its encoding"
|
|
164
184
|
end
|
|
165
185
|
|
|
186
|
+
# Legacy CJK double-byte encodings whose trail bytes can fall in the ASCII range, so a
|
|
187
|
+
# 0x5C trail byte looks like a string escape, a 0x7B like a brace, etc. — i.e. they are
|
|
188
|
+
# ascii_compatible? yet still NOT safe to byte-scan for JSON structure. (EUC-* and
|
|
189
|
+
# single-byte encodings keep their non-ASCII bytes above 0x7F, so they ARE safe.)
|
|
190
|
+
UNSCANNABLE_ASCII_COMPATIBLE = %w[
|
|
191
|
+
Shift_JIS Windows-31J MacJapanese SHIFT_JISX0213 SJIS-DoCoMo SJIS-KDDI SJIS-SoftBank
|
|
192
|
+
Big5 Big5-HKSCS Big5-UAO CP950 GBK GB18030 GB12345
|
|
193
|
+
].each_with_object({}) do |name, h|
|
|
194
|
+
h[Encoding.find(name)] = true
|
|
195
|
+
rescue ArgumentError
|
|
196
|
+
# encoding not built into this Ruby — skip it
|
|
197
|
+
end.freeze
|
|
198
|
+
|
|
199
|
+
# True when an Encoding cannot be scanned directly for JSON structure — the non
|
|
200
|
+
# ASCII-compatible ones (UTF-16 / UTF-32, where structure is in code units) and the CJK
|
|
201
|
+
# double-byte ones above. For these we parse a UTF-8 copy and emit the values back in the
|
|
202
|
+
# original encoding. (Over-including a safe encoding only costs a transcode round-trip; the
|
|
203
|
+
# result is still correct.)
|
|
204
|
+
def unscannable_enc?(enc)
|
|
205
|
+
return true unless enc.ascii_compatible?
|
|
206
|
+
|
|
207
|
+
UNSCANNABLE_ASCII_COMPATIBLE.key?(enc)
|
|
208
|
+
end
|
|
209
|
+
|
|
210
|
+
# The encoding the bytes arrived in when they must be parsed via a UTF-8 copy (see
|
|
211
|
+
# unscannable_enc?); nil when the bytes are directly byte-scannable.
|
|
212
|
+
def unscannable_encoding(input)
|
|
213
|
+
enc = input.encoding
|
|
214
|
+
unscannable_enc?(enc) ? enc : nil
|
|
215
|
+
end
|
|
216
|
+
|
|
217
|
+
# Generic UTF-16 / UTF-32 prepend a byte-order mark to EVERY string when you encode TO them.
|
|
218
|
+
# Map the generic encoding to the concrete endianness (read from the input's own BOM) so the
|
|
219
|
+
# re-tagged values are BOM-free and usable. Concrete and non-Unicode encodings pass through.
|
|
220
|
+
def concrete_unicode_encoding(input, enc)
|
|
221
|
+
return enc unless enc == Encoding::UTF_16 || enc == Encoding::UTF_32
|
|
222
|
+
|
|
223
|
+
head = input.byteslice(0, 4).to_s.b
|
|
224
|
+
if enc == Encoding::UTF_16
|
|
225
|
+
head.start_with?("\xFF\xFE".b) ? Encoding::UTF_16LE : Encoding::UTF_16BE
|
|
226
|
+
else
|
|
227
|
+
head.start_with?("\xFF\xFE\x00\x00".b) ? Encoding::UTF_32LE : Encoding::UTF_32BE
|
|
228
|
+
end
|
|
229
|
+
end
|
|
230
|
+
|
|
231
|
+
# Transcode the input to a UTF-8 working copy for scanning. Invalid bytes raise the gem's
|
|
232
|
+
# own SmarterJSON::EncodingError (not a bare Ruby Encoding error), matching the rest.
|
|
233
|
+
def to_utf8_copy(input)
|
|
234
|
+
input.encode(Encoding::UTF_8)
|
|
235
|
+
rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
|
|
236
|
+
raise EncodingError, "invalid byte sequence for #{input.encoding.name}"
|
|
237
|
+
end
|
|
238
|
+
|
|
239
|
+
# Re-tag one scalar into `enc`. A character not representable in `enc` (e.g. an emoji from a
|
|
240
|
+
# `\u` escape inside a Shift_JIS document) is replaced by `replace` (the :replace_char option,
|
|
241
|
+
# default "?") — uniform encoding, never raises. (`invalid:` can't trigger here: the value
|
|
242
|
+
# came from a valid UTF-8 parse.)
|
|
243
|
+
def reencode_scalar(obj, enc, replace)
|
|
244
|
+
return obj unless obj.is_a?(String)
|
|
245
|
+
|
|
246
|
+
obj.encode(enc, invalid: :replace, undef: :replace, replace: replace)
|
|
247
|
+
end
|
|
248
|
+
|
|
249
|
+
# Re-tag a parsed value's strings (Hash keys/values, Array elements, nested) into `enc`,
|
|
250
|
+
# so we emit values in the encoding the bytes arrived in after parsing a UTF-8 copy.
|
|
251
|
+
# ITERATIVE (an explicit work stack, not recursion) so a deeply nested document is
|
|
252
|
+
# depth-safe — like the parser itself — and can't raise SystemStackError.
|
|
253
|
+
def deep_encode(root, enc, replace)
|
|
254
|
+
return reencode_scalar(root, enc, replace) unless root.is_a?(Array) || root.is_a?(Hash)
|
|
255
|
+
|
|
256
|
+
out = root.is_a?(Array) ? [] : {}
|
|
257
|
+
stack = [[root, out]]
|
|
258
|
+
until stack.empty?
|
|
259
|
+
src, dst = stack.pop
|
|
260
|
+
if src.is_a?(Array)
|
|
261
|
+
src.each do |v|
|
|
262
|
+
child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
|
|
263
|
+
dst << (child || reencode_scalar(v, enc, replace))
|
|
264
|
+
stack.push([v, child]) if child
|
|
265
|
+
end
|
|
266
|
+
else
|
|
267
|
+
src.each do |k, v|
|
|
268
|
+
key = reencode_scalar(k, enc, replace)
|
|
269
|
+
child = (v.is_a?(Array) ? [] : {}) if v.is_a?(Array) || v.is_a?(Hash)
|
|
270
|
+
dst[key] = child || reencode_scalar(v, enc, replace)
|
|
271
|
+
stack.push([v, child]) if child
|
|
272
|
+
end
|
|
273
|
+
end
|
|
274
|
+
end
|
|
275
|
+
out
|
|
276
|
+
end
|
|
277
|
+
|
|
166
278
|
# Stream documents from an IO incrementally, yielding each recovered top-level
|
|
167
279
|
# document without slurping the whole input into memory first.
|
|
168
280
|
def stream_io(io, options, &block)
|
|
281
|
+
ext = io.respond_to?(:external_encoding) ? io.external_encoding : nil
|
|
282
|
+
int = io.respond_to?(:internal_encoding) ? io.internal_encoding : nil
|
|
283
|
+
out_enc = int || ext # the encoding the caller expects in the output
|
|
284
|
+
source = ext # the encoding readpartial's raw bytes are actually in
|
|
285
|
+
|
|
286
|
+
# The Framer reads via readpartial, which returns ASCII-8BIT — it drops the IO's encoding
|
|
287
|
+
# and ignores transcoding. When the byte-scanner can't frame the raw bytes directly — they
|
|
288
|
+
# are in an unscannable encoding (UTF-16 / UTF-32 / Shift_JIS / ...), or the IO transcodes
|
|
289
|
+
# (the wanted output encoding differs from the on-wire bytes) — transcode each chunk to a
|
|
290
|
+
# UTF-8 view, frame documents there one at a time, and emit each in `out_enc`. Bounded
|
|
291
|
+
# memory: only one document is buffered, never the whole stream.
|
|
292
|
+
if source && out_enc && (unscannable_enc?(source) || out_enc != source)
|
|
293
|
+
return stream_transcoded(io, source, out_enc, options, &block)
|
|
294
|
+
end
|
|
295
|
+
|
|
169
296
|
count = 0
|
|
170
297
|
Framer.each_document(io) do |doc|
|
|
298
|
+
# readpartial dropped the IO's encoding tag; restore it so a Latin-1 / Windows-1252 /
|
|
299
|
+
# etc. stream is parsed and emitted in its own encoding, not mislabelled.
|
|
300
|
+
doc = doc.dup.force_encoding(out_enc) if out_enc && doc.encoding != out_enc
|
|
171
301
|
# Recovery.process_string yields each value and returns how many it yielded;
|
|
172
302
|
# blank / comment-only framed segments yield none, so count tracks actual
|
|
173
303
|
# documents (== values yielded), not raw framed segments.
|
|
@@ -176,6 +306,25 @@ module SmarterJSON
|
|
|
176
306
|
count
|
|
177
307
|
end
|
|
178
308
|
|
|
309
|
+
# Bounded-memory streaming for an unscannable or transcoding IO (see stream_io). Each chunk
|
|
310
|
+
# is transcoded to a UTF-8 view and framed there one document at a time; each framed document
|
|
311
|
+
# is parsed and emitted in `out_enc` — the same parse-then-re-tag path as the whole-buffer
|
|
312
|
+
# case, but per document, so peak memory is bounded by one document, not the whole stream.
|
|
313
|
+
def stream_transcoded(io, source, out_enc, options, &block)
|
|
314
|
+
first = Framer.read_chunk(io)
|
|
315
|
+
out_enc = concrete_unicode_encoding(first.to_s, out_enc) # generic UTF-16/32 -> concrete via BOM
|
|
316
|
+
# No converter when the raw bytes are already UTF-8 (e.g. a UTF-8 -> UTF-16 transcoding IO):
|
|
317
|
+
# the bytes need no transcoding to be byte-scanned, only the OUTPUT is re-tagged (deep_encode).
|
|
318
|
+
conv = [Encoding::UTF_8, Encoding::US_ASCII].include?(source) ? nil : Encoding::Converter.new(source, Encoding::UTF_8)
|
|
319
|
+
opts = options.merge(encoding: nil)
|
|
320
|
+
replace = options[:replace_char]
|
|
321
|
+
count = 0
|
|
322
|
+
Framer.each_document_transcoded(io, conv, first) do |utf8_doc|
|
|
323
|
+
count += Recovery.process_string(utf8_doc, opts) { |v| block.call(deep_encode(v, out_enc, replace)) }
|
|
324
|
+
end
|
|
325
|
+
count
|
|
326
|
+
end
|
|
327
|
+
|
|
179
328
|
# process_one's "more than one document" notice — routed to on_warning if the caller
|
|
180
329
|
# gave one, else Rails.logger when Rails is loaded, else Kernel#warn. Never silent,
|
|
181
330
|
# never raised.
|
|
@@ -192,7 +341,10 @@ module SmarterJSON
|
|
|
192
341
|
end
|
|
193
342
|
end
|
|
194
343
|
|
|
195
|
-
private_class_method :process_content, :stream_io, :warn_extra_documents
|
|
344
|
+
private_class_method :process_content, :stream_io, :stream_transcoded, :warn_extra_documents,
|
|
345
|
+
:file_read_mode, :normalize_default_encoding, :unscannable_enc?,
|
|
346
|
+
:unscannable_encoding, :concrete_unicode_encoding, :to_utf8_copy,
|
|
347
|
+
:reencode_scalar, :deep_encode
|
|
196
348
|
|
|
197
349
|
# Named byte values, shared by the Parser FSM and the Framer / Recovery byte
|
|
198
350
|
# scanners so none of them spell out raw hex. Included where needed.
|
|
@@ -261,6 +413,59 @@ module SmarterJSON
|
|
|
261
413
|
yield buffer unless separators_only?(buffer)
|
|
262
414
|
end
|
|
263
415
|
|
|
416
|
+
# Like each_document, but the IO's raw bytes are in `conv`'s source encoding (UTF-16 /
|
|
417
|
+
# UTF-32 / Shift_JIS / ...): each chunk is transcoded to a UTF-8 view and framed there, so
|
|
418
|
+
# the byte-level splitter works. `first_chunk` is the already-read first raw chunk (the
|
|
419
|
+
# caller sniffs a BOM from it). Memory stays bounded by one document, like each_document.
|
|
420
|
+
def each_document_transcoded(io, conv, first_chunk)
|
|
421
|
+
buffer = +""
|
|
422
|
+
scan = 0
|
|
423
|
+
doc_start = nil
|
|
424
|
+
stack = []
|
|
425
|
+
mode = nil
|
|
426
|
+
|
|
427
|
+
raw = first_chunk
|
|
428
|
+
while raw
|
|
429
|
+
chunk = transcode_chunk(conv, raw)
|
|
430
|
+
unless chunk.empty?
|
|
431
|
+
buffer << chunk
|
|
432
|
+
loop do
|
|
433
|
+
emitted, buffer, scan, doc_start, stack, mode = scan_buffer(buffer, scan, doc_start, stack, mode)
|
|
434
|
+
break unless emitted
|
|
435
|
+
|
|
436
|
+
yield emitted
|
|
437
|
+
end
|
|
438
|
+
end
|
|
439
|
+
raw = read_chunk(io)
|
|
440
|
+
end
|
|
441
|
+
|
|
442
|
+
finish_transcode(conv) # truncated / invalid trailing bytes -> SmarterJSON::EncodingError
|
|
443
|
+
|
|
444
|
+
yield buffer unless separators_only?(buffer)
|
|
445
|
+
end
|
|
446
|
+
|
|
447
|
+
# Push one raw chunk through the converter, returning the UTF-8 produced so far. An
|
|
448
|
+
# incomplete trailing multibyte sequence is held inside the converter until the next chunk;
|
|
449
|
+
# invalid bytes raise SmarterJSON::EncodingError (matching the whole-buffer to_utf8_copy).
|
|
450
|
+
def transcode_chunk(conv, raw)
|
|
451
|
+
return raw.dup.force_encoding(Encoding::UTF_8) if conv.nil? # raw bytes are already UTF-8
|
|
452
|
+
|
|
453
|
+
out = +""
|
|
454
|
+
status = conv.primitive_convert(raw.dup, out, nil, nil, partial_input: true)
|
|
455
|
+
raise SmarterJSON::EncodingError, "invalid byte sequence in stream" if status == :invalid_byte_sequence
|
|
456
|
+
|
|
457
|
+
out
|
|
458
|
+
end
|
|
459
|
+
|
|
460
|
+
# Flush the converter at end of stream. A held incomplete multibyte sequence means the input
|
|
461
|
+
# was truncated mid-character — surface it the same way an invalid encoding is surfaced.
|
|
462
|
+
def finish_transcode(conv)
|
|
463
|
+
return if conv.nil?
|
|
464
|
+
|
|
465
|
+
status = conv.primitive_convert("".b, +"")
|
|
466
|
+
raise SmarterJSON::EncodingError, "invalid byte sequence in stream" unless status == :finished
|
|
467
|
+
end
|
|
468
|
+
|
|
264
469
|
def read_chunk(io)
|
|
265
470
|
if io.respond_to?(:readpartial)
|
|
266
471
|
io.readpartial(CHUNK_SIZE)
|
|
@@ -466,6 +671,22 @@ module SmarterJSON
|
|
|
466
671
|
|
|
467
672
|
def process_string(input, options, &block)
|
|
468
673
|
input = SmarterJSON.send(:normalize_default_encoding, input, options)
|
|
674
|
+
|
|
675
|
+
# UTF-16 / UTF-32 / Shift_JIS / ... cannot be byte-scanned for JSON structure. Parse
|
|
676
|
+
# a UTF-8 copy and emit each document's strings back in the encoding the bytes arrived
|
|
677
|
+
# in — the caller always gets values in the encoding they handed us, never UTF-8.
|
|
678
|
+
if (target_enc = SmarterJSON.send(:unscannable_encoding, input))
|
|
679
|
+
target_enc = SmarterJSON.send(:concrete_unicode_encoding, input, target_enc) # avoid per-string BOMs
|
|
680
|
+
opts = options.merge(encoding: nil) # the working copy is UTF-8; don't re-label it downstream
|
|
681
|
+
utf8 = SmarterJSON.send(:to_utf8_copy, input) # invalid bytes -> SmarterJSON::EncodingError
|
|
682
|
+
replace = options[:replace_char]
|
|
683
|
+
if block
|
|
684
|
+
return process_string(utf8, opts) { |doc| block.call(SmarterJSON.send(:deep_encode, doc, target_enc, replace)) }
|
|
685
|
+
end
|
|
686
|
+
|
|
687
|
+
return process_string(utf8, opts).map { |doc| SmarterJSON.send(:deep_encode, doc, target_enc, replace) }
|
|
688
|
+
end
|
|
689
|
+
|
|
469
690
|
return SmarterJSON.send(:process_content, input, options, &block) unless input.valid_encoding?
|
|
470
691
|
|
|
471
692
|
# Recovery is REACTIVE: parse first, and only fall back to wrapper extraction when
|
data/lib/smarter_json/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,13 +1,13 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: smarter_json
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 1.2.
|
|
4
|
+
version: 1.2.3
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tilo Sloboda
|
|
8
8
|
bindir: exe
|
|
9
9
|
cert_chain: []
|
|
10
|
-
date: 2026-06-
|
|
10
|
+
date: 2026-06-28 00:00:00.000000000 Z
|
|
11
11
|
dependencies:
|
|
12
12
|
- !ruby/object:Gem::Dependency
|
|
13
13
|
name: bigdecimal
|