json-repair 0.7.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: fdf2958528b936eba0faf6c724a20d10a3a3e0b95329bc1866740c99432f6fe3
4
- data.tar.gz: 33fa97d2c7689ea594723e5d0d412646cf57a2ebdbf79f3b62947d4928963f32
3
+ metadata.gz: 0c130bbea1b9299e31e5bfa8db873b09fd911715b7125fda6ee60a101353be5f
4
+ data.tar.gz: 80f6e2fe16669210505e45c99a443eaa3ce6b8b3c10e3967633885f91a9d057b
5
5
  SHA512:
6
- metadata.gz: 587323c57caac3e53af1da24cfeee47df18ff738516d8fd2862d37a547c6e864cf5653c8b43427ec58aa47be3f2e958c991208b36de098c5ffc9503f2c64a0a2
7
- data.tar.gz: f5380cfdca6a4f60833ab57fc8465715b546415fa4f6ed5781b6bb24653ce324d2b6d928f9a5ff87b4deead145391303c1e805bf0fbff238b9ef1c7024a1f4aa
6
+ metadata.gz: 4e2c05c45ebf1cf149705021faa611c22e6e2e0de48d15d2496da4e935291abe6bf185cde756263c6aad210f73ca2e54e4c965826b14bdbb0fa9ffa47bf1684f
7
+ data.tar.gz: 180b477cea0b27c813b664bfcda75eb28b59453d1fe4ce90781cf298c870aa8518411e689a37ddc30882aa5f376c351ff718600411d97df845f67fd87b593d92
data/.rubocop.yml CHANGED
@@ -1,10 +1,6 @@
1
1
  AllCops:
2
2
  TargetRubyVersion: 3.0
3
3
 
4
- Metrics/BlockLength:
5
- Exclude:
6
- - spec/**/*
7
-
8
4
  Style/Documentation:
9
5
  Enabled: false
10
6
 
@@ -33,6 +29,8 @@ Metrics/BlockLength:
33
29
  Exclude:
34
30
  - lib/json/repairer.rb
35
31
  - spec/**/*
32
+ # restore RuboCop's default exclusion, lost when Exclude is overridden
33
+ - '*.gemspec'
36
34
 
37
35
  Metrics/BlockNesting:
38
36
  Exclude:
data/CHANGELOG.md CHANGED
@@ -1,5 +1,74 @@
1
1
  # Changes
2
2
 
3
+ ### 2026-06-12 (0.11.0)
4
+
5
+ * Repair object string values with unescaped quotes around a colon
6
+ ("doubled colon"): `{"a": "b": "c"}` → `{"a":"b\": \"c"}` — the
7
+ value reads as `b": "c`, the unescaped-quotes interpretation. The
8
+ merge preserves the literal characters between the strings
9
+ (whitespace, original quote style) and repeats greedily
10
+ (`{"a": "b": "c": "d"}` → value `b": "c": "d`). Only the
11
+ string–colon–string shape is repaired: non-string shapes like
12
+ `{"a": "b": 1}` or `{"a": 1: 2}` still raise `JSONRepairError`
13
+ rather than silently dropping data (Python `json_repair` drops the
14
+ `: 1` there). Previously all of these raised "Object key expected".
15
+ Deliberate divergence from upstream
16
+ [jsonrepair](https://github.com/josdejong/jsonrepair) (raises as of
17
+ v3.14.0), matching
18
+ [Go json-repair](https://github.com/RealAlexandreAI/json-repair)
19
+ and Python
20
+ [`json_repair`](https://github.com/mangiucugna/json_repair) on the
21
+ canonical case.
22
+
23
+ ### 2026-06-11 (0.10.0)
24
+
25
+ * Repair Markdown list markers in front of top-level values:
26
+ `- {"a": 1}` → `{"a":1}`, and multi-line lists become arrays via the
27
+ existing newline-delimited JSON handling
28
+ (`"- {\"a\": 1}\n- {\"b\": 2}"` → `[{"a":1},{"b":2}]`). Bullet
29
+ markers `-`, `*`, `+` and ordered markers like `1.` / `2)` (up to
30
+ nine digits, the CommonMark limit) are recognized at the start of
31
+ the root value and of each newline-delimited line, only when
32
+ followed by same-line whitespace and a value — so `-5`, a trailing
33
+ `"- "`, and newline-delimited decimals like `"1.5\n2.5"` keep their
34
+ number readings, and nothing changes inside nested structures.
35
+ Previously these inputs raised `JSONRepairError`; two non-raising
36
+ behaviors change for the better: `"3\n- 5\n7"` now repairs to
37
+ `[3,5,7]` instead of the corrupt `[3,0,5,7]`, and a single-line
38
+ `* text` becomes `"text"` instead of `"* text"`. Deliberate
39
+ divergence from upstream
40
+ [jsonrepair](https://github.com/josdejong/jsonrepair) (no Markdown
41
+ list handling as of v3.14.0), and more precise than Python
42
+ [`json_repair`](https://github.com/mangiucugna/json_repair), which
43
+ collapses scalar list items to `""`.
44
+
45
+ ### 2026-06-11 (0.9.0)
46
+
47
+ * Repair numbers missing the digit before their decimal point:
48
+ `.5` → `0.5`, `-.5` → `-0.5`, and truncated forms like `.` → `0.0`.
49
+ Previously these leaked a raw stdlib `JSON::ParserError` out of
50
+ `JSON.repair` because the repairer emitted the leading-dot number
51
+ unchanged (invalid JSON) and the canonical-output re-parse choked on
52
+ it. This is a deliberate divergence from upstream
53
+ [jsonrepair](https://github.com/josdejong/jsonrepair) (which leaves
54
+ leading-dot numbers unrepaired as of v3.14.0), matching
55
+ [dirty-json](https://github.com/RyanMarcus/dirty-json) behavior.
56
+ * `JSON.repair` now guards its error contract: if the repairer ever
57
+ emits a string stdlib JSON cannot parse (a repairer bug), the stdlib
58
+ error is wrapped in `JSON::JSONRepairError` instead of leaking
59
+ `JSON::ParserError` to callers.
60
+
61
+ ### 2026-05-15 (0.8.0)
62
+
63
+ * `JSON.repair_file(path)` and `JSON.repair_io(io)` convenience
64
+ wrappers around `JSON.repair`. `repair_file` reads a path from disk
65
+ (accepts a `String` or `Pathname`); `repair_io` reads from any
66
+ object responding to `#read` (e.g. `File`, `StringIO`, `$stdin`)
67
+ without closing it. Both forward `return_objects:` and
68
+ `skip_json_loads:` through to `JSON.repair`. Mirrors Python's
69
+ [`json_repair`](https://github.com/mangiucugna/json_repair)
70
+ `load` / `from_file` helpers.
71
+
3
72
  ### 2026-05-12 (0.7.0)
4
73
 
5
74
  * `JSON.repair` now always returns canonical JSON via
data/CLAUDE.md CHANGED
@@ -10,7 +10,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
10
10
  - `bundle exec rspec spec/json_spec.rb:42` — run a single example by line number; nearly all behavioral specs live in `spec/json_spec.rb`.
11
11
  - `bundle exec rubocop` — lint. Project-specific exclusions in `.rubocop.yml` deliberately disable several `Metrics/*` cops for `lib/json/repairer.rb` and `lib/json/repair/string_utils.rb` because the parser is long by design — don't try to "fix" it by chopping methods up.
12
12
  - `bin/console` — IRB with the gem preloaded.
13
- - `bundle exec rake install` / `bundle exec rake release` local install / publish to rubygems.org.
13
+ - `bundle exec rake bench` benchmark-ips regression baseline (`benchmark/run.rb`); run before/after perf-sensitive changes.
14
+ - `bundle exec rake install` / `bundle exec rake release` — local install / publish to rubygems.org (release prompts for a rubygems MFA OTP).
14
15
  - Type checking: `Steepfile` checks `lib/` against `sig/`. `bundle exec steep check` (typecheck) and `bundle exec rbs validate` (sig syntax) both run in CI and as part of the default rake task. `steep` and `rbs` are dev dependencies in the `Gemfile`.
15
16
 
16
17
  Ruby `>= 3.0.0` is required (per gemspec). CI runs against Ruby 3.3.1.
@@ -19,9 +20,15 @@ Ruby `>= 3.0.0` is required (per gemspec). CI runs against Ruby 3.3.1.
19
20
 
20
21
  This gem is a **Ruby port of the [josdejong/jsonrepair](https://github.com/josdejong/jsonrepair) TypeScript library**. The upstream version currently mirrored is tracked in `CHANGELOG.md` (presently v3.14.0). When syncing upstream changes, the goal is parity with the JS implementation, not idiomatic refactoring — keep method names, control flow, and repair heuristics aligned with the JS source so future syncs stay tractable.
21
22
 
23
+ A few repair heuristics deliberately go **beyond** upstream (leading-dot numbers like `.5`, Markdown list markers like `- {...}`). Each such site carries a "Divergence from upstream" comment in the code and a CHANGELOG note — keep that convention when adding more, so future upstream syncs can tell ported behavior from local extensions.
24
+
22
25
  ### Entry point
23
26
 
24
- `JSON.repair(str)` in `lib/json/repair.rb` is a thin wrapper that constructs `JSON::Repairer.new(str).repair`. `JSON::JSONRepairError` is the only error raised for unrecoverable inputs.
27
+ `JSON.repair(str)` in `lib/json/repair.rb` first tries stdlib `JSON.parse` (fast path; opt out with `skip_json_loads: true`), and falls back to `JSON::Repairer.new(str).repair` when that raises. Either way the result is re-serialized with `JSON.generate`, so **output is canonical** — whitespace collapsed, numbers normalized, duplicate keys last-write-wins — and both paths agree on it. A `REPAIR_REQUIRED_PATTERN` regex routes inputs containing comments or invalid escapes straight to the Repairer even though the bundled `json` gem would accept them. `return_objects: true` returns the parsed Ruby value instead of a string; `JSON.repair_file(path)` / `JSON.repair_io(io)` are convenience wrappers forwarding both options.
28
+
29
+ `JSON::JSONRepairError` is the only error raised for unrecoverable inputs; it exposes the failure `#position`. If the Repairer ever emits a string stdlib cannot parse (a Repairer bug), the `JSON::ParserError` is wrapped in `JSONRepairError` rather than leaked.
30
+
31
+ `exe/json-repair` is a CLI wrapper (`lib/json/repair/cli.rb`) reading stdin or a file, writing stdout, `--output FILE`, or `--overwrite`.
25
32
 
26
33
  ### The parser (`lib/json/repairer.rb`)
27
34
 
@@ -43,7 +50,7 @@ Two patterns recur and are worth knowing before editing:
43
50
  - **Backtracking via snapshots.** Methods like `parse_string` capture `i_before = @index` and `o_before = @output.length` before tentatively consuming input. If a later check (e.g. "the end quote turned out not to be a real end quote") fails, they restore both and re-invoke themselves with different flags (e.g. `stop_at_delimiter: true`, `stop_at_index: …`). Preserve this pattern when modifying string/number parsing.
44
51
  - **Repair-by-rewriting-tail.** Helpers like `insert_before_last_whitespace(@output, ',')` and `@output = strip_last_occurrence(@output, ',')` patch the already-emitted output to fix things like missing or trailing commas. These run *after* the malformed input has been partially emitted — they are the mechanism for "I now realize that earlier token needed a comma after it."
45
52
 
46
- `repair` (the public method) drives `parse_value` then handles top-level concerns: stripping Markdown fences (` ```json ... ``` `), converting newline-delimited JSON at the root into an array, dropping redundant trailing braces/brackets, and rejecting any non-whitespace trailing garbage.
53
+ `repair` (the public method) drives `parse_value` then handles top-level concerns: stripping Markdown fences (` ```json ... ``` `), skipping Markdown list markers like `- ` / `* ` / `1. ` before the root value and each newline-delimited line (`markdown_list_marker_length` / `skip_markdown_list_marker` — top-level only, never inside nested structures), converting newline-delimited JSON at the root into an array, dropping redundant trailing braces/brackets, and rejecting any non-whitespace trailing garbage.
47
54
 
48
55
  ### Shared helpers (`lib/json/repair/string_utils.rb`)
49
56
 
@@ -62,6 +69,13 @@ RBS signatures mirror the public surface of `JSON.repair`, `JSON::Repairer`, and
62
69
 
63
70
  ### Test layout
64
71
 
65
- - `spec/json_spec.rb` — the substantive behavioral suite (700+ examples covering every repair heuristic). New behavior — and every sync from upstream — belongs here.
72
+ - `spec/json_spec.rb` — the substantive behavioral suite (130+ examples, hundreds of assertions, covering every repair heuristic). New behavior — and every sync from upstream — belongs here.
73
+ - `spec/json/repair/cli_spec.rb` — the `exe/json-repair` CLI (argument handling, IO errors, exit codes).
74
+ - `spec/json/repair/string_utils_spec.rb` — direct unit coverage for a few `StringUtils` edge cases the behavioral suite can't reach.
66
75
  - `spec/json/repair_spec.rb` — sanity check on `JSON::Repair::VERSION` only.
67
- - `.rspec_status` is committed and tracks per-example pass/fail so `--only-failures` / `--next-failure` work across runs.
76
+ - SimpleCov enforces 100% line and branch coverage on the full run, so a filtered `rspec -e ...` run "fails" at the coverage gate even when all selected examples pass — ignore that exit code during TDD.
77
+ - `.rspec_status` is gitignored (local pass/fail tracking for `--only-failures` / `--next-failure`).
78
+
79
+ ## Local planning notes
80
+
81
+ `docs/` (the `TODO.md` backlog plus design specs and implementation plans under `docs/superpowers/`) is gitignored local planning material — read it for context, update it as work completes, but never commit anything under `docs/`.
data/README.md CHANGED
@@ -4,7 +4,7 @@ This is a Ruby gem designed to repair broken JSON strings. Inspired by and based
4
4
 
5
5
  ## Installation
6
6
 
7
- Add this gem to your application's Gemfield by executing:
7
+ Add this gem to your application's Gemfile by executing:
8
8
 
9
9
  ```bash
10
10
  $ bundle add json-repair
@@ -31,6 +31,18 @@ puts repaired_json # Outputs: {"name":"Alice","age":25}
31
31
 
32
32
  The `repair` method takes a string containing JSON data and returns a corrected version of this string, ensuring it is valid JSON.
33
33
 
34
+ Markdown markup in LLM output is handled too: fenced code blocks like `` ```json `` are stripped, and list markers (`-`, `*`, `+`, `1.`) in front of top-level values are removed — a multi-line list becomes an array:
35
+
36
+ ```ruby
37
+ JSON.repair("- {\"a\": 1}\n- {\"b\": 2}") # => '[{"a":1},{"b":2}]'
38
+ ```
39
+
40
+ Object values containing unescaped quotes around a colon are merged back into a single string value:
41
+
42
+ ```ruby
43
+ JSON.repair('{"a": "b": "c"}') # => '{"a":"b\": \"c"}' — the value reads as 'b": "c'
44
+ ```
45
+
34
46
  Pass `return_objects: true` to get the parsed Ruby value (Hash, Array, or scalar) instead of a string:
35
47
 
36
48
  ```ruby
@@ -53,6 +65,20 @@ If you need the parsed Ruby value instead of a string, pass `return_objects: tru
53
65
 
54
66
  `skip_json_loads: true` skips the stdlib `JSON.parse` attempt and routes the input straight through the repairer. The output is the same; the option is purely a performance knob for callers who know their input will need repair.
55
67
 
68
+ ### Reading from a file or IO
69
+
70
+ `JSON.repair_file(path)` reads a file from disk and repairs its contents. `JSON.repair_io(io)` does the same with any object that responds to `#read` (e.g. `File`, `StringIO`, `$stdin`). Both forward `return_objects:` and `skip_json_loads:` to `JSON.repair`.
71
+
72
+ ```ruby
73
+ JSON.repair_file('broken.json')
74
+ JSON.repair_file('broken.json', return_objects: true)
75
+
76
+ File.open('broken.json') { |io| JSON.repair_io(io) }
77
+ JSON.repair_io($stdin)
78
+ ```
79
+
80
+ `JSON.repair_io` does not close the IO — the caller manages its lifecycle.
81
+
56
82
  ## Command line
57
83
 
58
84
  The gem ships a `json-repair` executable. It reads from stdin or a file and writes to stdout, `--output FILE`, or back over the input file with `--overwrite`.
data/Rakefile CHANGED
@@ -19,6 +19,9 @@ task :steep do
19
19
  sh 'bundle exec steep check'
20
20
  end
21
21
 
22
+ desc 'Type-check: rbs validate + steep check'
23
+ task typecheck: %i[rbs steep]
24
+
22
25
  desc 'Run benchmark/run.rb (regression baseline for JSON.repair)'
23
26
  task :bench do
24
27
  ruby '-Ilib', 'benchmark/run.rb'
@@ -60,13 +60,14 @@ module JSON
60
60
 
61
61
  # Functions to check character chars
62
62
  def hex?(char)
63
- (char >= ZERO && char <= NINE) ||
64
- (char >= UPPERCASE_A && char <= UPPERCASE_F) ||
65
- (char >= LOWERCASE_A && char <= LOWERCASE_F)
63
+ !char.nil? &&
64
+ ((char >= ZERO && char <= NINE) ||
65
+ (char >= UPPERCASE_A && char <= UPPERCASE_F) ||
66
+ (char >= LOWERCASE_A && char <= LOWERCASE_F))
66
67
  end
67
68
 
68
69
  def digit?(char)
69
- char && char >= ZERO && char <= NINE
70
+ !char.nil? && char >= ZERO && char <= NINE
70
71
  end
71
72
 
72
73
  def valid_string_character?(char)
@@ -74,11 +75,11 @@ module JSON
74
75
  end
75
76
 
76
77
  def delimiter?(char)
77
- REGEX_DELIMITER.match?(char)
78
+ !char.nil? && REGEX_DELIMITER.match?(char)
78
79
  end
79
80
 
80
81
  def unquoted_string_delimiter?(char)
81
- REGEX_UNQUOTED_STRING_DELIMITER.match?(char)
82
+ !char.nil? && REGEX_UNQUOTED_STRING_DELIMITER.match?(char)
82
83
  end
83
84
 
84
85
  REGEX_FUNCTION_NAME_CHAR_START = /\A[a-zA-Z_$]\z/
@@ -93,19 +94,19 @@ module JSON
93
94
  end
94
95
 
95
96
  def start_of_value?(char)
96
- REGEX_START_OF_VALUE.match?(char) || (char && quote?(char))
97
+ !char.nil? && (REGEX_START_OF_VALUE.match?(char) || quote?(char))
97
98
  end
98
99
 
99
100
  def control_character?(char)
100
- [NEWLINE, RETURN, TAB, BACKSPACE, FORM_FEED].include?(char)
101
+ !char.nil? && [NEWLINE, RETURN, TAB, BACKSPACE, FORM_FEED].include?(char)
101
102
  end
102
103
 
103
104
  def whitespace?(char)
104
- [SPACE, NEWLINE, TAB, RETURN].include?(char)
105
+ !char.nil? && [SPACE, NEWLINE, TAB, RETURN].include?(char)
105
106
  end
106
107
 
107
108
  def whitespace_except_newline?(char)
108
- [SPACE, TAB, RETURN].include?(char)
109
+ !char.nil? && [SPACE, TAB, RETURN].include?(char)
109
110
  end
110
111
 
111
112
  def special_whitespace?(char)
@@ -122,6 +123,14 @@ module JSON
122
123
  (char >= EN_QUAD && char <= ZERO_WIDTH_SPACE)
123
124
  end
124
125
 
126
+ def same_line_whitespace?(char)
127
+ whitespace_except_newline?(char) || special_whitespace?(char)
128
+ end
129
+
130
+ def whitespace_or_special?(char)
131
+ whitespace?(char) || special_whitespace?(char)
132
+ end
133
+
125
134
  def quote?(char)
126
135
  double_quote_like?(char) || single_quote_like?(char)
127
136
  end
@@ -135,20 +144,25 @@ module JSON
135
144
  end
136
145
 
137
146
  def double_quote_like?(char)
138
- [DOUBLE_QUOTE, DOUBLE_QUOTE_LEFT, DOUBLE_QUOTE_RIGHT].include?(char)
147
+ !char.nil? && [DOUBLE_QUOTE, DOUBLE_QUOTE_LEFT, DOUBLE_QUOTE_RIGHT].include?(char)
139
148
  end
140
149
 
141
150
  def single_quote_like?(char)
142
- [QUOTE, QUOTE_LEFT, QUOTE_RIGHT, GRAVE_ACCENT, ACUTE_ACCENT].include?(char)
151
+ !char.nil? && [QUOTE, QUOTE_LEFT, QUOTE_RIGHT, GRAVE_ACCENT, ACUTE_ACCENT].include?(char)
143
152
  end
144
153
 
145
- # Strip last occurrence of text_to_strip from text
154
+ # Strip last occurrence of text_to_strip from text.
155
+ #
156
+ # `|| ''` on the slices below (and in `insert_before_last_whitespace` /
157
+ # `remove_at_index`) is for steep's nil-narrowing: `String#[range]` is
158
+ # typed `String?`, but every call site here keeps indices within
159
+ # `0..text.length`, so the slices never actually return `nil`.
146
160
  def strip_last_occurrence(text, text_to_strip, strip_remaining_text: false)
147
161
  index = text.rindex(text_to_strip)
148
162
  return text unless index
149
163
 
150
- remaining_text = strip_remaining_text ? '' : text[index + 1..]
151
- text[0...index] + remaining_text
164
+ remaining_text = strip_remaining_text ? '' : (text[index + 1..] || '')
165
+ (text[0...index] || '') + remaining_text
152
166
  end
153
167
 
154
168
  def insert_before_last_whitespace(text, text_to_insert)
@@ -158,7 +172,7 @@ module JSON
158
172
 
159
173
  index -= 1 while whitespace?(text[index - 1])
160
174
 
161
- text[0...index] + text_to_insert + text[index..]
175
+ (text[0...index] || '') + text_to_insert + (text[index..] || '')
162
176
  end
163
177
 
164
178
  # Parse keywords true, false, null
@@ -187,7 +201,7 @@ module JSON
187
201
  end
188
202
 
189
203
  def remove_at_index(text, start, count)
190
- text[0...start] + text[start + count..]
204
+ (text[0...start] || '') + (text[start + count..] || '')
191
205
  end
192
206
 
193
207
  def ends_with_comma_or_newline?(text)
@@ -2,6 +2,6 @@
2
2
 
3
3
  module JSON
4
4
  module Repair
5
- VERSION = '0.7.0'
5
+ VERSION = '0.11.0'
6
6
  end
7
7
  end
data/lib/json/repair.rb CHANGED
@@ -20,6 +20,22 @@ module JSON
20
20
  return_objects ? parsed : JSON.generate(parsed)
21
21
  end
22
22
 
23
+ # Inlined rather than calling `repair(...)` so the literal-bool overloads
24
+ # in sig/json/repair.rbs narrow correctly per caller — forwarding a
25
+ # `bool`-typed `return_objects` will not resolve against the literal-
26
+ # `true`/`false` overloads on `JSON.repair`.
27
+ def self.repair_io(io, return_objects: false, skip_json_loads: false)
28
+ json = io.read || ''
29
+ parsed = skip_json_loads ? repaired_parse(json) : tolerant_parse(json)
30
+ return_objects ? parsed : JSON.generate(parsed)
31
+ end
32
+
33
+ def self.repair_file(path, return_objects: false, skip_json_loads: false)
34
+ json = File.read(path.to_s)
35
+ parsed = skip_json_loads ? repaired_parse(json) : tolerant_parse(json)
36
+ return_objects ? parsed : JSON.generate(parsed)
37
+ end
38
+
23
39
  def self.tolerant_parse(json)
24
40
  JSON.parse(json)
25
41
  rescue JSON::ParserError
@@ -27,8 +43,14 @@ module JSON
27
43
  end
28
44
  private_class_method :tolerant_parse
29
45
 
46
+ # The rescue guards the JSONRepairError-only error contract: if the
47
+ # Repairer ever emits a string stdlib JSON cannot parse (a Repairer bug),
48
+ # wrap the stdlib error instead of leaking JSON::ParserError to callers.
30
49
  def self.repaired_parse(json)
31
- JSON.parse(Repairer.new(json).repair)
50
+ repaired = Repairer.new(json).repair
51
+ JSON.parse(repaired)
52
+ rescue JSON::ParserError => e
53
+ raise JSONRepairError, "Internal error: repaired output is not valid JSON (#{e.message})"
32
54
  end
33
55
  private_class_method :repaired_parse
34
56
  end
data/lib/json/repairer.rb CHANGED
@@ -37,6 +37,12 @@ module JSON
37
37
  def repair
38
38
  parse_markdown_code_block(MARKDOWN_OPEN_BLOCKS)
39
39
 
40
+ # repair: skip a Markdown list marker before the root value
41
+ # (and any comments before it, which parse_value would otherwise
42
+ # only consume after the marker check has already failed)
43
+ parse_whitespace_and_skip_comments
44
+ skip_markdown_list_marker
45
+
40
46
  processed = parse_value
41
47
 
42
48
  throw_unexpected_end unless processed
@@ -46,7 +52,8 @@ module JSON
46
52
  processed_comma = parse_character(COMMA)
47
53
  parse_whitespace_and_skip_comments if processed_comma
48
54
 
49
- if start_of_value?(@json[@index]) && ends_with_comma_or_newline?(@output)
55
+ if (start_of_value?(@json[@index]) || markdown_list_marker_length) &&
56
+ ends_with_comma_or_newline?(@output)
50
57
  # start of a new value after end of the root level object: looks like
51
58
  # newline delimited JSON -> turn into a root level array
52
59
  unless processed_comma
@@ -170,6 +177,52 @@ module JSON
170
177
  false
171
178
  end
172
179
 
180
+ # Look ahead from @index for a Markdown list marker like "- ", "* ",
181
+ # "+ ", or "12. " that precedes a value. Returns the marker's length,
182
+ # or nil when there is no marker. Only consulted at the top level —
183
+ # the root value and each newline-delimited value — never inside
184
+ # nested structures. A marker must be followed by same-line
185
+ # whitespace and a value, so "-5", a trailing "- ", and "-\n{...}"
186
+ # keep their number readings. Ordered markers are capped at nine
187
+ # digits (the CommonMark limit) so long truncated decimals are not
188
+ # mistaken for markers. Divergence from upstream (no Markdown list
189
+ # handling as of v3.14.0): LLMs frequently emit JSON values as
190
+ # Markdown list items.
191
+ def markdown_list_marker_length
192
+ j = @index
193
+
194
+ if [MINUS, ASTERISK, PLUS].include?(@json[j])
195
+ j += 1
196
+ elsif digit?(@json[j])
197
+ j += 1 while digit?(@json[j]) && j - @index < 9
198
+ return nil unless [DOT, CLOSE_PARENTHESIS].include?(@json[j])
199
+
200
+ j += 1
201
+ else
202
+ return nil
203
+ end
204
+
205
+ marker_length = j - @index
206
+ return nil unless same_line_whitespace?(@json[j])
207
+
208
+ j += 1 while same_line_whitespace?(@json[j])
209
+ # a leading-dot number like ".5" is also a value here: parse_number
210
+ # repairs it to "0.5" even though start_of_value? does not match it
211
+ return nil unless start_of_value?(@json[j]) || @json[j] == DOT
212
+
213
+ marker_length
214
+ end
215
+
216
+ # Repair a value behind a Markdown list marker, like "- {"a":1}",
217
+ # by skipping the marker. See markdown_list_marker_length.
218
+ def skip_markdown_list_marker
219
+ length = markdown_list_marker_length
220
+ return false unless length
221
+
222
+ @index += length
223
+ true
224
+ end
225
+
173
226
  # Parse an object like '{"key": "value"}'
174
227
  def parse_object
175
228
  return false unless @json[@index] == OPENING_BRACE
@@ -241,6 +294,10 @@ module JSON
241
294
  end
242
295
  # :nocov:
243
296
  end
297
+
298
+ # repair: an object string value with unescaped quotes around a
299
+ # colon, like {"a": "b": "c"}
300
+ repair_doubled_colon if processed_value
244
301
  end
245
302
 
246
303
  if @json[@index] == CLOSING_BRACE
@@ -254,6 +311,59 @@ module JSON
254
311
  true
255
312
  end
256
313
 
314
+ # Repair an object value with unescaped quotes around a colon, like
315
+ # {"a": "b": "c"}, by merging it all into one string value: 'b": "c'
316
+ # (the unescaped-quotes reading of the input). Greedy: keeps merging
317
+ # while another `: "..."` follows. Only the string-colon-string
318
+ # shape is repaired; anything else falls through to the regular
319
+ # error paths. Divergence from upstream (which raises "Object key
320
+ # expected" as of v3.14.0), matching the Go and Python json-repair
321
+ # libraries on the canonical case.
322
+ def repair_doubled_colon
323
+ loop do
324
+ colon = @index
325
+ # :nocov: kept for symmetry with the start_quote scan below; unreachable
326
+ # because @index never rests on whitespace here. On first entry,
327
+ # parse_value ends with parse_whitespace_and_skip_comments. On greedy
328
+ # re-entry, every parse_string exit leaves @index off-whitespace: the
329
+ # EOF path (nil is not whitespace), the stop_at_index path (a
330
+ # prev_non_whitespace_index position), and the end-quote path (ends in
331
+ # parse_concatenated_string, whose leading whitespace skip consumes
332
+ # newlines too). If a future parse_value/parse_string change breaks
333
+ # that, this scan becomes live and the :nocov: will hide it.
334
+ colon += 1 while whitespace_or_special?(@json[colon])
335
+ # :nocov:
336
+ return unless @json[colon] == COLON
337
+
338
+ # scan past special whitespace too (unlike prev_non_whitespace_index):
339
+ # parse_whitespace treats NBSP and friends as whitespace, so this
340
+ # repair should as well. The value's last character (at worst the
341
+ # object's opening brace) always stops the scan before index 0.
342
+ end_quote = colon - 1
343
+ end_quote -= 1 while whitespace_or_special?(@json[end_quote])
344
+ return unless quote?(@json[end_quote])
345
+
346
+ start_quote = colon + 1
347
+ start_quote += 1 while whitespace_or_special?(@json[start_quote])
348
+ return unless quote?(@json[start_quote])
349
+
350
+ # repair: replace the end quote already emitted (plus any copied
351
+ # trailing whitespace) with the literal input span from that end
352
+ # quote through the next start quote, escaped as string content
353
+ @output = strip_last_occurrence(@output, '"', strip_remaining_text: true)
354
+ @json[end_quote..start_quote].each_char do |char|
355
+ @output << (char == DOUBLE_QUOTE ? '\"' : CONTROL_CHARACTERS.fetch(char, char))
356
+ end
357
+
358
+ # let parse_string consume the rest of the merged string, then
359
+ # drop the start quote it emits (already emitted escaped above)
360
+ @index = start_quote
361
+ start = @output.length
362
+ parse_string
363
+ @output = remove_at_index(@output, start, 1)
364
+ end
365
+ end
366
+
257
367
  def skip_character(char)
258
368
  if @json[@index] == char
259
369
  @index += 1
@@ -570,7 +680,9 @@ module JSON
570
680
  repair_number_ending_with_numeric_symbol(start)
571
681
  return true
572
682
  end
573
- unless digit?(@json[@index])
683
+ # also accept a dot so "-.5" continues into the fraction branch
684
+ # below (divergence from upstream, which leaves "-.5" unrepaired)
685
+ unless digit?(@json[@index]) || @json[@index] == DOT
574
686
  @index = start
575
687
  return false
576
688
  end
@@ -620,7 +732,7 @@ module JSON
620
732
  num = @json[start...@index]
621
733
  has_invalid_leading_zero = num.match?(/^0\d/)
622
734
 
623
- @output << (has_invalid_leading_zero ? "\"#{num}\"" : num)
735
+ @output << (has_invalid_leading_zero ? "\"#{num}\"" : repair_leading_dot_number(num))
624
736
  return true
625
737
  end
626
738
 
@@ -711,7 +823,18 @@ module JSON
711
823
  # repair numbers cut off at the end
712
824
  # this will only be called when we end after a '.', '-', or 'e' and does not
713
825
  # change the number more than it needs to make it valid JSON
714
- @output << "#{@json[start...@index]}0"
826
+ @output << repair_leading_dot_number("#{@json[start...@index]}0")
827
+ end
828
+
829
+ # Repair a number missing its digit before the decimal point, like ".5"
830
+ # or "-.5", into "0.5" / "-0.5". Divergence from upstream, which emits
831
+ # the invalid leading-dot number unchanged. The guard keeps the common
832
+ # case (a number that needs no repair) allocation-free; `sub` copies
833
+ # its receiver even when the pattern does not match.
834
+ def repair_leading_dot_number(num)
835
+ return num unless num.start_with?('.', '-.')
836
+
837
+ num.sub(/\A(?<sign>-?)\./, '\k<sign>0.')
715
838
  end
716
839
 
717
840
  # Parse and repair Newline Delimited JSON (NDJSON):
@@ -732,6 +855,10 @@ module JSON
732
855
  end
733
856
  end
734
857
 
858
+ # repair: skip a Markdown list marker before the next value
859
+ parse_whitespace_and_skip_comments
860
+ skip_markdown_list_marker
861
+
735
862
  processed_value = parse_value
736
863
  end
737
864
 
@@ -1,9 +1,9 @@
1
1
  module JSON
2
2
  module Repair
3
3
  module StringUtils
4
- @output: untyped
4
+ @output: ::String
5
5
 
6
- @index: untyped
6
+ @index: ::Integer
7
7
 
8
8
  # Constants for character chars
9
9
  BACKSLASH: "\\"
@@ -24,17 +24,17 @@ module JSON
24
24
 
25
25
  CLOSE_PARENTHESIS: ")"
26
26
 
27
- SPACE: " "
27
+ SPACE: ::String
28
28
 
29
- NEWLINE: "\n"
29
+ NEWLINE: ::String
30
30
 
31
- TAB: "\t"
31
+ TAB: ::String
32
32
 
33
- RETURN: "\r"
33
+ RETURN: ::String
34
34
 
35
- BACKSPACE: "\b"
35
+ BACKSPACE: ::String
36
36
 
37
- FORM_FEED: "\f"
37
+ FORM_FEED: ::String
38
38
 
39
39
  DOUBLE_QUOTE: "\""
40
40
 
@@ -110,56 +110,64 @@ module JSON
110
110
 
111
111
  REGEX_FUNCTION_NAME_CHAR: ::Regexp
112
112
 
113
- # Functions to check character chars
114
- def hex?: (untyped char) -> untyped
113
+ # Functions to check character chars.
114
+ # `char` is `::String?` because every caller passes `@json[@index]`,
115
+ # which is `nil` past the end of input. The predicates either guard
116
+ # against `nil` explicitly or rely on `Array#include?` / `==` /
117
+ # `Regexp#match?` returning a safe value for `nil`.
118
+ def hex?: (::String? char) -> bool
115
119
 
116
- def digit?: (untyped char) -> untyped
120
+ def digit?: (::String? char) -> bool
117
121
 
118
- def valid_string_character?: (untyped char) -> untyped
122
+ def valid_string_character?: (::String char) -> bool
119
123
 
120
- def delimiter?: (untyped char) -> untyped
124
+ def delimiter?: (::String? char) -> bool
121
125
 
122
- def unquoted_string_delimiter?: (untyped char) -> untyped
126
+ def unquoted_string_delimiter?: (::String? char) -> bool
123
127
 
124
- def function_name_char_start?: (untyped char) -> untyped
128
+ def function_name_char_start?: (::String? char) -> bool
125
129
 
126
- def function_name_char?: (untyped char) -> untyped
130
+ def function_name_char?: (::String? char) -> bool
127
131
 
128
- def start_of_value?: (untyped char) -> untyped
132
+ def start_of_value?: (::String? char) -> bool
129
133
 
130
- def control_character?: (untyped char) -> untyped
134
+ def control_character?: (::String? char) -> bool
131
135
 
132
- def whitespace?: (untyped char) -> untyped
136
+ def whitespace?: (::String? char) -> bool
133
137
 
134
- def whitespace_except_newline?: (untyped char) -> untyped
138
+ def whitespace_except_newline?: (::String? char) -> bool
135
139
 
136
- def special_whitespace?: (untyped char) -> untyped
140
+ def special_whitespace?: (::String? char) -> bool
137
141
 
138
- def quote?: (untyped char) -> untyped
142
+ def same_line_whitespace?: (::String? char) -> bool
139
143
 
140
- def double_quote?: (untyped char) -> untyped
144
+ def whitespace_or_special?: (::String? char) -> bool
141
145
 
142
- def single_quote?: (untyped char) -> untyped
146
+ def quote?: (::String? char) -> bool
143
147
 
144
- def double_quote_like?: (untyped char) -> untyped
148
+ def double_quote?: (::String? char) -> bool
145
149
 
146
- def single_quote_like?: (untyped char) -> untyped
150
+ def single_quote?: (::String? char) -> bool
151
+
152
+ def double_quote_like?: (::String? char) -> bool
153
+
154
+ def single_quote_like?: (::String? char) -> bool
147
155
 
148
156
  # Strip last occurrence of text_to_strip from text
149
- def strip_last_occurrence: (untyped text, untyped text_to_strip, ?strip_remaining_text: bool) -> untyped
157
+ def strip_last_occurrence: (::String text, ::String text_to_strip, ?strip_remaining_text: bool) -> ::String
150
158
 
151
- def insert_before_last_whitespace: (untyped text, untyped text_to_insert) -> untyped
159
+ def insert_before_last_whitespace: (::String text, ::String text_to_insert) -> ::String
152
160
 
153
161
  # Parse keywords true, false, null
154
162
  # Repair Python keywords True, False, None
155
163
  # Repair Ruby keyword nil
156
- def parse_keywords: () -> untyped
164
+ def parse_keywords: () -> bool
157
165
 
158
- def parse_keyword: (untyped name, untyped value) -> (true | false)
166
+ def parse_keyword: (::String name, ::String value) -> bool
159
167
 
160
- def remove_at_index: (untyped text, untyped start, untyped count) -> untyped
168
+ def remove_at_index: (::String text, ::Integer start, ::Integer count) -> ::String
161
169
 
162
- def ends_with_comma_or_newline?: (untyped text) -> untyped
170
+ def ends_with_comma_or_newline?: (::String text) -> bool
163
171
  end
164
172
  end
165
173
  end
data/sig/json/repair.rbs CHANGED
@@ -1,4 +1,10 @@
1
1
  module JSON
2
+ # Recursive type for any `JSON.parse` result. Mirrors what stdlib's
3
+ # `JSON.parse` produces (and the JS upstream emits): scalars, arrays,
4
+ # and objects of the same. Used in place of `untyped` for the
5
+ # `return_objects: true` and internal `*_parse` paths.
6
+ type json_value = ::Hash[::String, json_value] | ::Array[json_value] | ::String | ::Integer | ::Float | bool | nil
7
+
2
8
  class JSONRepairError < StandardError
3
9
  attr_reader position: ::Integer?
4
10
 
@@ -9,13 +15,25 @@ module JSON
9
15
  VERSION: ::String
10
16
  end
11
17
 
18
+ interface _Readable
19
+ def read: () -> ::String?
20
+ end
21
+
12
22
  def self.repair: (::String json, return_objects: false, ?skip_json_loads: bool) -> ::String
13
- | (::String json, return_objects: true, ?skip_json_loads: bool) -> untyped
23
+ | (::String json, return_objects: true, ?skip_json_loads: bool) -> json_value
14
24
  | (::String json, ?skip_json_loads: bool) -> ::String
15
25
 
26
+ def self.repair_io: (_Readable io, return_objects: false, ?skip_json_loads: bool) -> ::String
27
+ | (_Readable io, return_objects: true, ?skip_json_loads: bool) -> json_value
28
+ | (_Readable io, ?skip_json_loads: bool) -> ::String
29
+
30
+ def self.repair_file: (::String | ::Pathname path, return_objects: false, ?skip_json_loads: bool) -> ::String
31
+ | (::String | ::Pathname path, return_objects: true, ?skip_json_loads: bool) -> json_value
32
+ | (::String | ::Pathname path, ?skip_json_loads: bool) -> ::String
33
+
16
34
  private
17
35
 
18
- def self.tolerant_parse: (::String json) -> untyped
36
+ def self.tolerant_parse: (::String json) -> json_value
19
37
 
20
- def self.repaired_parse: (::String json) -> untyped
38
+ def self.repaired_parse: (::String json) -> json_value
21
39
  end
@@ -11,7 +11,7 @@ module JSON
11
11
  # `lib/json/repairer.rb`).
12
12
  @json: untyped
13
13
 
14
- @index: Integer
14
+ @index: ::Integer
15
15
 
16
16
  @output: ::String
17
17
 
@@ -31,25 +31,36 @@ module JSON
31
31
 
32
32
  private
33
33
 
34
- def parse_value: () -> untyped
34
+ def parse_value: () -> bool
35
35
 
36
- def parse_whitespace: (?skip_newline: bool) -> (true | false)
36
+ def parse_whitespace: (?skip_newline: bool) -> bool
37
37
 
38
- def parse_comment: () -> (true | false)
38
+ def parse_comment: () -> bool
39
39
 
40
40
  # Find and skip over a Markdown fenced code block
41
- def parse_markdown_code_block: (::Array[::String] blocks) -> (true | false)
41
+ def parse_markdown_code_block: (::Array[::String] blocks) -> bool
42
42
 
43
- def skip_markdown_code_block: (::Array[::String] blocks) -> (true | false)
43
+ def skip_markdown_code_block: (::Array[::String] blocks) -> bool
44
+
45
+ # Look ahead for a Markdown list marker like "- " or "12. " that
46
+ # precedes a value; returns the marker's length, or nil when there
47
+ # is no marker.
48
+ def markdown_list_marker_length: () -> ::Integer?
49
+
50
+ def skip_markdown_list_marker: () -> bool
44
51
 
45
52
  # Parse an object like '{"key": "value"}'
46
- def parse_object: () -> (false | true)
53
+ def parse_object: () -> bool
47
54
 
48
- def skip_character: (untyped char) -> (true | false)
55
+ # Repair an object value with unescaped quotes around a colon,
56
+ # like {"a": "b": "c"}, by merging it into one string value.
57
+ def repair_doubled_colon: () -> void
58
+
59
+ def skip_character: (::String char) -> bool
49
60
 
50
61
  # Skip ellipsis like "[1,2,3,...]" or "[1,2,3,...,9]" or "[...,7,8,9]"
51
62
  # or a similar construct in objects.
52
- def skip_ellipsis: () -> untyped
63
+ def skip_ellipsis: () -> void
53
64
 
54
65
  # Parse a string enclosed by double quotes "...". Can contain escaped quotes
55
66
  # Repair strings enclosed in single quotes or special quotes
@@ -62,51 +73,59 @@ module JSON
62
73
  # more conservative way, stopping the string at the first next delimiter
63
74
  # and fixing the string by inserting a quote there, or stopping at a
64
75
  # stop index detected in the first iteration.
65
- def parse_string: (?stop_at_delimiter: bool, ?stop_at_index: ::Integer) -> (untyped | true | false)
76
+ def parse_string: (?stop_at_delimiter: bool, ?stop_at_index: ::Integer) -> bool
66
77
 
67
78
  # Repair an unquoted string by adding quotes around it
68
79
  # Repair a MongoDB function call like NumberLong("2")
69
80
  # Repair a JSONP function call like callback({...});
70
- def parse_unquoted_string: (bool is_key) -> (false | true)
81
+ def parse_unquoted_string: (bool is_key) -> bool
71
82
 
72
83
  # Parse a regular expression literal like /foo/ or /foo\/bar/
73
- def parse_regex: () -> (false | true)
84
+ def parse_regex: () -> bool
74
85
 
75
- def parse_character: (untyped char) -> (true | false)
86
+ def parse_character: (::String char) -> bool
76
87
 
77
- def parse_whitespace_and_skip_comments: (?skip_newline: bool) -> untyped
88
+ def parse_whitespace_and_skip_comments: (?skip_newline: bool) -> bool
78
89
 
79
90
  # Parse a number like 2.4 or 2.4e6
80
- def parse_number: () -> (true | false)
91
+ def parse_number: () -> bool
81
92
 
82
- def at_end_of_number?: () -> untyped
93
+ def at_end_of_number?: () -> bool
83
94
 
84
95
  # Parse an array like '["item1", "item2", ...]'
85
- def parse_array: () -> (true | false)
96
+ def parse_array: () -> bool
86
97
 
87
- def prev_non_whitespace_index: (untyped start) -> untyped
98
+ def prev_non_whitespace_index: (::Integer start) -> ::Integer
88
99
 
89
100
  # Repair concatenated strings like "hello" + "world", change this into "helloworld"
90
- def parse_concatenated_string: () -> untyped
101
+ def parse_concatenated_string: () -> bool
102
+
103
+ def repair_number_ending_with_numeric_symbol: (::Integer start) -> void
91
104
 
92
- def repair_number_ending_with_numeric_symbol: (untyped start) -> untyped
105
+ # Repair a number missing its digit before the decimal point, like ".5"
106
+ # or "-.5", into "0.5" / "-0.5".
107
+ def repair_leading_dot_number: (::String num) -> ::String
93
108
 
94
109
  # Parse and repair Newline Delimited JSON (NDJSON):
95
110
  # multiple JSON objects separated by a newline character
96
- def parse_newline_delimited_json: () -> untyped
111
+ def parse_newline_delimited_json: () -> void
97
112
 
98
- def skip_escape_character: () -> untyped
113
+ def skip_escape_character: () -> bool
99
114
 
100
- def throw_invalid_character: (untyped char) -> untyped
115
+ # `bot` (bottom) because these always raise — steep needs this to
116
+ # treat their call sites as unreachable so methods like `repair`
117
+ # type-check (the trailing `throw_unexpected_character` must not
118
+ # contribute `void` to the method's union return type).
119
+ def throw_invalid_character: (::String char) -> bot
101
120
 
102
- def throw_unexpected_character: () -> untyped
121
+ def throw_unexpected_character: () -> bot
103
122
 
104
- def throw_unexpected_end: () -> untyped
123
+ def throw_unexpected_end: () -> bot
105
124
 
106
- def throw_object_key_expected: () -> untyped
125
+ def throw_object_key_expected: () -> bot
107
126
 
108
- def throw_colon_expected: () -> untyped
127
+ def throw_colon_expected: () -> bot
109
128
 
110
- def throw_invalid_unicode_character: () -> untyped
129
+ def throw_invalid_unicode_character: () -> bot
111
130
  end
112
131
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: json-repair
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ version: 0.11.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Aleksandr Zykov
@@ -9,7 +9,11 @@ bindir: exe
9
9
  cert_chain: []
10
10
  date: 1980-01-02 00:00:00.000000000 Z
11
11
  dependencies: []
12
- description: This is a simple gem that repairs broken JSON strings.
12
+ description: 'Repairs broken JSON: missing quotes and commas, unclosed brackets, trailing
13
+ commas, unquoted keys, single quotes, comments, Python constants, NDJSON, Markdown
14
+ code fences and list markers in LLM output, truncated documents, and more. A Ruby
15
+ port of the jsonrepair JavaScript library — useful whenever JSON from LLMs, APIs,
16
+ or logs does not strictly follow the standard.'
13
17
  email:
14
18
  - alexandrz@gmail.com
15
19
  executables:
@@ -44,6 +48,8 @@ metadata:
44
48
  homepage_uri: https://github.com/sashazykov/json-repair-rb
45
49
  source_code_uri: https://github.com/sashazykov/json-repair-rb
46
50
  changelog_uri: https://github.com/sashazykov/json-repair-rb/blob/main/CHANGELOG.md
51
+ documentation_uri: https://rubydoc.info/gems/json-repair
52
+ bug_tracker_uri: https://github.com/sashazykov/json-repair-rb/issues
47
53
  rdoc_options: []
48
54
  require_paths:
49
55
  - lib
@@ -60,5 +66,5 @@ required_rubygems_version: !ruby/object:Gem::Requirement
60
66
  requirements: []
61
67
  rubygems_version: 3.6.9
62
68
  specification_version: 4
63
- summary: Repairs broken JSON strings.
69
+ summary: Repair invalid or malformed JSON documents, including LLM output.
64
70
  test_files: []