multi_xml 0.8.1 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c886733fad69eef555cb492092d63156cac53770594c3c98632a8b1a9449e8ba
4
- data.tar.gz: 14987464ceb3c0512212aedb3416bcb31089ee44d205719def48018bde734284
3
+ metadata.gz: 35c36bec2ca00daa71d6eae5b1f388f5bdf3493d06ade5d6c8aeb54fb2f8762f
4
+ data.tar.gz: fe3a14804a79e8dd143ab381b5ef11abf917ce3a1653d6c780d99fa7d6b9dd44
5
5
  SHA512:
6
- metadata.gz: 45e2c0b67a9e1c67d72b1132cd1ca429fb6466e67275f6a8df740b8a5c8eddbff3a4c51dff432998e0b4ee3fefb07f6ce9eb43f80ac192102d40e881c242cbb5
7
- data.tar.gz: 40f700f6a29035d870d4714e7249a64e2b4e225a0fbd91970081a4e0a4ba6103dc86444a2b7902b7e67f2b4b8164277b365f3c1a4c210dd9b4ef0a735e51201d
6
+ metadata.gz: 5863c814199d20da08c88b6fc52b56186c2e4d0c2c9248d9678c04862dab09cb048473ab86c605a022add743e1b1be91682974cf21c192d8c3eeb98abcbc6011
7
+ data.tar.gz: 48fe9f604a468b59a01832187869b0bb118f5a8c63ccb2c8b51c27d2fe00b6bb7aedd3beeb384cc69ee5562b75e5196cb93ac37f79fd6fb68aba9ea82c692dbc
data/.mutant.yml CHANGED
@@ -13,4 +13,9 @@ requires:
13
13
 
14
14
  matcher:
15
15
  subjects:
16
- - MultiXml*
16
+ - MultiXML*
17
+ ignore:
18
+ # Platform-specific code path: only triggers on TruffleRuby. Mutant
19
+ # runs on MRI, so any mutation that touches the RUBY_ENGINE check
20
+ # produces the same observable behavior as the original.
21
+ - MultiXML::ParserResolution#skip_on_platform?
data/CHANGELOG.md CHANGED
@@ -1,3 +1,29 @@
1
+ 0.9.0
2
+ -----
3
+ * Add `MultiXML.with_parser` for fiber-local scoped parser overrides, matching `MultiJSON.with_adapter`. The override lives in `Fiber[:multi_xml_parser]`, so concurrent fibers and threads each see their own parser without racing on a shared module variable; nested calls save and restore the previous value.
4
+ * Add `MultiXML.parse_options` / `MultiXML.parse_options=` for process-wide default options, matching `MultiJSON.parse_options`. Accepts a `Hash` or a callable (`Proc`/`lambda`); a callable receives the call-site hash as its sole positional argument so defaults can be computed per-call. Defaults merge between `DEFAULT_OPTIONS` and call-site overrides.
5
+ * Introduce `MultiXML::Parser` base module — built-in parsers declare their backend exception class via a `ParseError` constant, matching the `MultiJSON::Adapter` convention. Custom parsers can either extend `MultiXML::Parser` and define `ParseError` or keep defining a `.parse_error` method directly; both styles are accepted.
6
+ * Add `MultiXML::ParserLoadError`, raised when the parser spec is invalid, requiring the parser file raises `LoadError`, or the resolved parser doesn't satisfy the contract (must respond to `.parse` and define either a `ParseError` constant or a `.parse_error` method). Inherits from `ArgumentError` and carries the original exception's class name in its message, matching `MultiJSON::AdapterError`.
7
+ * Rename `MultiXml` constant to `MultiXML` (all caps), matching the style of `MultiJSON`. The old `MultiXml` constant continues to work but emits a one-time deprecation warning on first use and will be removed in v1.0.
8
+ * Add `MultiXML.load` as a deprecated alias for `MultiXML.parse`, matching the style of `MultiJSON.load` → `MultiJSON.parse`. Will be removed in v1.0.
9
+ * Rename the `:symbolize_keys` option to `:symbolize_names`, matching Ruby stdlib's `JSON.parse` and MultiJSON. The old option continues to work but emits a one-time deprecation warning; it will be removed in v1.0.
10
+ * [Add `:namespaces` option to `MultiXML.parse` for consistent namespace handling across parsers](https://github.com/sferik/multi_xml/issues/44) — two modes produce byte-identical output on every backend:
11
+ * `:strip` (default) — drop xmlns declarations and prefixes; keeps today's libxml/nokogiri output so most users see no change
12
+ * `:preserve` — keep source prefixes (e.g. `"atom:rel"`) and surface `xmlns` / `xmlns:*` declarations as attributes
13
+ * Fix REXML keeping attribute prefixes (`"gd:etag"`) while other backends stripped them ([#31](https://github.com/sferik/multi_xml/issues/31))
14
+ * Fix Ox prepending namespace prefixes to element names (`"aws:Item"`) when other backends didn't ([#30](https://github.com/sferik/multi_xml/issues/30))
15
+ * Handle namespaced attribute name collisions consistently across backends. When attributes with different prefixes strip to the same local name (e.g. `foo:id` and `bar:id` both becoming `id`), values are collected in an array in document order, with attribute values ahead of any colliding child elements. The libxml SAX parser falls back to its DOM backend in this case since the SAX callback drops attribute prefixes.
16
+ * Fix Ox mixed-content text aggregation in the SAX parser
17
+ * Raise `ArgumentError` on an unknown `:namespaces` mode
18
+ * `undasherize_keys` now runs only in `:strip` mode so prefixed keys aren't rewritten under `:preserve`
19
+ * Reorder `PARSER_PREFERENCE` so `oga` is tried before `rexml`, matching the throughput ranking in the bundled benchmark suite. Affects auto-detection only when neither `ox`, `libxml-ruby`, nor `nokogiri` is available; explicitly selecting a parser is unchanged.
20
+ * Use a TruffleRuby-specific `PARSER_PREFERENCE` ordering (`rexml`, `libxml`, `oga`, `nokogiri`) since TruffleRuby's JIT favors pure-Ruby parsers and penalizes FFI-bound ones. On other engines the default ordering is unchanged.
21
+ * Add a parser benchmark suite (`rake benchmark`) and document per-engine throughput rankings in the README. CI verifies that `PARSER_PREFERENCE` matches the benchmark ranking on MRI, JRuby, and TruffleRuby.
22
+ * Restore JRuby support (dropped in 0.8.0) and add TruffleRuby (native + JVM) to the CI matrix, matching the test coverage of MultiJSON. TruffleRuby is excluded from Windows runners since the `setup-ruby` action doesn't support it there.
23
+ * Add Ruby 4.0 to the CI matrix
24
+ * Support libxml-ruby 6.0.0 by switching from `require "libxml"` (removed in 6.0) to `require "libxml-ruby"`, which is present in both 5.x and 6.x
25
+ * Drop redundant `::Psych::SyntaxError` declaration from the RBS signature to fix a "Different superclasses are specified" type-checking error under rbs v4
26
+
1
27
  0.8.1
2
28
  -----
3
29
  * [Fix array unwrapping when elements contain nil](https://github.com/sferik/multi_xml/commit/09a875d832c45e2b567889398f45361ec9e36685)
data/Gemfile CHANGED
@@ -10,6 +10,7 @@ gem "minitest", ">= 6"
10
10
  gem "minitest-mock", ">= 5.27"
11
11
  gem "mutant-minitest", ">= 0.14.1"
12
12
  gem "rake", ">= 13.3.1"
13
+ gem "rbs", ">= 4.0.0", platforms: :ruby
13
14
  gem "rdoc", ">= 7.0.2"
14
15
  gem "rubocop", ">= 1.81.7"
15
16
  gem "rubocop-minitest", ">= 0.36"
@@ -18,7 +19,7 @@ gem "rubocop-rake", ">= 0.7.1"
18
19
  gem "simplecov", ">= 0.22"
19
20
  gem "standard", ">= 1.52"
20
21
  gem "standard-performance", ">= 1.9"
21
- gem "steep", ">= 1.10", platforms: :ruby
22
+ gem "steep", ">= 2", platforms: :ruby
22
23
  gem "yard", ">= 0.9.38"
23
24
  gem "yardstick", ">= 0.9.9"
24
25
 
data/README.md CHANGED
@@ -1,57 +1,175 @@
1
1
  # MultiXML
2
2
 
3
- A generic swappable back-end for XML parsing
3
+ [![Tests](https://github.com/sferik/multi_xml/actions/workflows/tests.yml/badge.svg)][tests]
4
+ [![Linter](https://github.com/sferik/multi_xml/actions/workflows/linter.yml/badge.svg)][linter]
5
+ [![Mutant](https://github.com/sferik/multi_xml/actions/workflows/mutant.yml/badge.svg)][mutant]
6
+ [![Typecheck](https://github.com/sferik/multi_xml/actions/workflows/typecheck.yml/badge.svg)][typecheck]
7
+ [![Docs](https://github.com/sferik/multi_xml/actions/workflows/docs.yml/badge.svg)][docs]
8
+ [![Gem Version](https://badge.fury.io/rb/multi_xml.svg)][gem]
9
+
10
+ Lots of Ruby libraries parse XML and everyone has their favorite XML parser.
11
+ Instead of choosing a single XML parser and forcing users of your library to
12
+ be stuck with it, you can use MultiXML instead, which will simply choose the
13
+ fastest available XML parser. Here's how to use it:
4
14
 
5
- ## Installation
6
- gem install multi_xml
15
+ ```ruby
16
+ require "multi_xml"
7
17
 
8
- ## Documentation
9
- [http://rdoc.info/gems/multi_xml][documentation]
18
+ MultiXML.parse("<tag>contents</tag>") #=> {"tag" => "contents"}
19
+ MultiXML.parse("<tag>contents</tag>", symbolize_names: true) #=> {tag: "contents"}
20
+ ```
10
21
 
11
- [documentation]: http://rdoc.info/gems/multi_xml
22
+ `MultiXML.parse` returns `{}` for empty and whitespace-only inputs instead of
23
+ raising, so a missing or blank payload is observable as an empty hash rather
24
+ than an exception. When parsing invalid XML, MultiXML will throw a
25
+ `MultiXML::ParseError`.
12
26
 
13
- ## Usage Examples
14
27
  ```ruby
15
- require 'multi_xml'
28
+ begin
29
+ MultiXML.parse("<open></close>")
30
+ rescue MultiXML::ParseError => exception
31
+ exception.xml #=> "<open></close>"
32
+ exception.cause #=> Nokogiri::XML::SyntaxError: ...
33
+ end
34
+ ```
35
+
36
+ ### Deprecated in 0.9.0
37
+
38
+ The module constant, the primary parse entry point, and the
39
+ symbolize-keys option were renamed to align MultiXML with MultiJSON
40
+ and Ruby stdlib `JSON.parse`. The old names still work in 0.x but
41
+ now emit a one-time deprecation warning; they will be removed in 1.0.
16
42
 
17
- MultiXml.parser = :ox
18
- MultiXml.parser = MultiXml::Parsers::Ox # Same as above
19
- MultiXml.parse('<tag>This is the contents</tag>') # Parsed using Ox
43
+ | Deprecated | Use instead |
44
+ | ----------------------------- | ------------------------------- |
45
+ | `MultiXml` (constant) | `MultiXML` (all-caps) |
46
+ | `MultiXML.load(xml)` | `MultiXML.parse(xml)` |
47
+ | `symbolize_keys:` option | `symbolize_names:` option |
20
48
 
21
- MultiXml.parser = :libxml
22
- MultiXml.parser = MultiXml::Parsers::Libxml # Same as above
23
- MultiXml.parse('<tag>This is the contents</tag>') # Parsed using LibXML
49
+ The `MultiXml` constant (CamelCase) continues to work as a thin
50
+ delegator; every method call, constant lookup, and rescue clause
51
+ routes through `MultiXML` transparently.
24
52
 
25
- MultiXml.parser = :nokogiri
26
- MultiXml.parser = MultiXml::Parsers::Nokogiri # Same as above
27
- MultiXml.parse('<tag>This is the contents</tag>') # Parsed using Nokogiri
53
+ `ParseError` instances expose `xml` and `cause` readers. `xml` contains the
54
+ input that caused the problem; `cause` contains the original exception raised
55
+ by the underlying parser.
28
56
 
29
- MultiXml.parser = :rexml
30
- MultiXml.parser = MultiXml::Parsers::Rexml # Same as above
31
- MultiXml.parse('<tag>This is the contents</tag>') # Parsed using REXML
57
+ ### Writing a custom parser
32
58
 
33
- MultiXml.parser = :oga
34
- MultiXml.parser = MultiXml::Parsers::Oga # Same as above
35
- MultiXml.parse('<tag>This is the contents</tag>') # Parsed using Oga
59
+ A custom parser is any class (or module) that responds to two class methods:
60
+
61
+ ```ruby
62
+ class MyParser
63
+ def self.parse(io, namespaces: :strip)
64
+ # parse the IO-like object into a Hash, raising ParseError on failure
65
+ end
66
+
67
+ def self.parse_error
68
+ MyParser::ParseError
69
+ end
70
+ end
71
+
72
+ MultiXML.parser = MyParser
36
73
  ```
37
- The `parser` setter takes either a symbol or a class (to allow for custom XML
38
- parsers) that responds to `.parse` at the class level.
39
74
 
40
- MultiXML tries to have intelligent defaulting. That is, if you have any of the
41
- supported parsers already loaded, it will use them before attempting to load
42
- a new one. When loading, libraries are ordered by speed: first Ox, then LibXML,
43
- then Nokogiri, and finally REXML.
75
+ `parse_error` is required: `MultiXML.parse` rescues `MyParser.parse_error`
76
+ to wrap parse failures in `MultiXML::ParseError`. The built-in parsers in
77
+ `lib/multi_xml/parsers/` are working examples.
78
+
79
+ MultiXML tries to have intelligent defaulting. If any supported library is
80
+ already loaded, MultiXML uses it before attempting to load others. When no
81
+ backend is preloaded, MultiXML walks its automatic preference list and uses the first
82
+ one that loads successfully:
83
+
84
+ 1. [`ox`][ox]
85
+ 2. [`libxml-ruby`][libxml-ruby]
86
+ 3. [`nokogiri`][nokogiri]
87
+ 4. [`oga`][oga]
88
+ 5. [`rexml`][rexml]
89
+
90
+ This is the library's built-in default selection order, not a guarantee that
91
+ the list is globally fastest for every workload. Real-world performance depends
92
+ on the document shape and the Ruby implementation, and the benchmark suite
93
+ below also measures SAX backends that are not part of automatic parser
94
+ detection. REXML is a Ruby default gem, so it's always available as a
95
+ last-resort fallback on any supported Ruby. If you have a workload where a
96
+ different backend is faster, set it explicitly with
97
+ `MultiXML.parser = :your_parser`.
98
+
99
+ ## Benchmarking Parsers
100
+
101
+ This repo includes a benchmark suite that compares every available built-in
102
+ backend across multiple XML shapes and sizes instead of relying on a single
103
+ synthetic document. The workloads cover:
104
+
105
+ - shallow and wide XML
106
+ - deeply nested XML
107
+ - record batches with repeated siblings
108
+ - attribute-dense elements
109
+ - mixed-content sections
110
+ - namespace-heavy feeds
111
+ - a large catalog-style document
112
+
113
+ Run the full benchmark with:
114
+
115
+ ```bash
116
+ bundle exec rake benchmark
117
+ ```
118
+
119
+ You can also run the script directly for shorter runs or Markdown-friendly
120
+ output:
121
+
122
+ ```bash
123
+ bundle exec ruby benchmark.rb --quick
124
+ bundle exec ruby benchmark.rb --format=markdown
125
+ ```
126
+
127
+ The output includes:
128
+
129
+ - a single best-overall parser based on the equal-weight geometric mean of
130
+ per-scenario relative throughput
131
+ - an overall ranking table for every parser
132
+ - a scenario matrix showing which parser won each workload
133
+ - an exclusions table when a parser crashes or produces mismatched output on a
134
+ valid workload
135
+
136
+ Allocation efficiency is reported as a secondary metric using allocated Ruby
137
+ objects per parse so ties on throughput are easier to interpret.
138
+
139
+ `PARSER_PREFERENCE` drives auto-detection (see "Configuration" above) and is
140
+ ordered fastest-first per the benchmark suite. CI re-runs the benchmark on
141
+ each supported runtime and fails if the observed ranking diverges from this
142
+ table:
143
+
144
+ | rank | CRuby/MRI | JRuby | TruffleRuby |
145
+ | ---- | ---------- | ---------- | ----------- |
146
+ | 1 | `ox` | — | — |
147
+ | 2 | `libxml` | — | `rexml` |
148
+ | 3 | `nokogiri` | `nokogiri` | `libxml` |
149
+ | 4 | `oga` | — | `oga` |
150
+ | 5 | `rexml` | `rexml` | `nokogiri` |
151
+
152
+ A dash means the parser isn't usable on that runtime. `ox` has no JRuby
153
+ build and is filtered out of TruffleRuby auto-detection (its SAX callbacks
154
+ miscompile under the JIT after warmup); `libxml-ruby` has no JRuby build;
155
+ `oga` 3.x crashes on JRuby 10 (its precompiled Java backend was built
156
+ against an older JRuby API). TruffleRuby's JIT inverts the FFI-vs-pure-Ruby
157
+ tradeoff for the remaining backends, so `rexml` rises to the top and
158
+ `nokogiri` falls to last.
44
159
 
45
160
  ## Supported Ruby Versions
46
- This library aims to support and is tested against the following Ruby
161
+
162
+ This library aims to support and is [tested against](https://github.com/sferik/multi_xml/actions/workflows/tests.yml) the following Ruby
47
163
  implementations:
48
164
 
49
- * 3.2
50
- * 3.3
51
- * 3.4
52
- * 4.0
165
+ - Ruby 3.2
166
+ - Ruby 3.3
167
+ - Ruby 3.4
168
+ - Ruby 4.0
169
+ - [JRuby][jruby] 10.0 (targets Ruby 3.4 compatibility)
170
+ - [TruffleRuby][truffleruby] 33.0 (native and JVM)
53
171
 
54
- If something doesn't work on one of these versions, it's a bug.
172
+ If something doesn't work in one of these implementations, it's a bug.
55
173
 
56
174
  This library may inadvertently work (or seem to work) on other Ruby
57
175
  implementations, however support will only be provided for the versions listed
@@ -64,12 +182,38 @@ implementation, you will be responsible for providing patches in a timely
64
182
  fashion. If critical issues for a particular implementation exist at the time
65
183
  of a major release, support for that Ruby version may be dropped.
66
184
 
67
- ## Inspiration
68
- MultiXML was inspired by [MultiJSON][].
185
+ ## Versioning
69
186
 
70
- [multijson]: https://github.com/intridea/multi_json/
187
+ This library aims to adhere to [Semantic Versioning 2.0.0][semver]. Violations
188
+ of this scheme should be reported as bugs. Specifically, if a minor or patch
189
+ version is released that breaks backward compatibility, that version should be
190
+ immediately yanked and/or a new version should be immediately released that
191
+ restores compatibility. Breaking changes to the public API will only be
192
+ introduced with new major versions. As a result of this policy, you can (and
193
+ should) specify a dependency on this gem using the [Pessimistic Version
194
+ Constraint][pvc] with two digits of precision. For example:
195
+
196
+ ```ruby
197
+ spec.add_dependency "multi_xml", "~> 0.9"
198
+ ```
71
199
 
72
200
  ## Copyright
73
- Copyright (c) 2010-2025 Erik Berlin. See [LICENSE][] for details.
74
201
 
202
+ Copyright (c) 2010-2026 Erik Berlin. See [LICENSE][license] for details.
203
+
204
+ [docs]: https://github.com/sferik/multi_xml/actions/workflows/docs.yml
205
+ [gem]: https://rubygems.org/gems/multi_xml
206
+ [jruby]: http://www.jruby.org/
207
+ [libxml-ruby]: https://github.com/xml4r/libxml-ruby
75
208
  [license]: LICENSE.md
209
+ [linter]: https://github.com/sferik/multi_xml/actions/workflows/linter.yml
210
+ [mutant]: https://github.com/sferik/multi_xml/actions/workflows/mutant.yml
211
+ [nokogiri]: https://nokogiri.org/
212
+ [oga]: https://gitlab.com/yorickpeterse/oga
213
+ [ox]: https://github.com/ohler55/ox
214
+ [pvc]: http://docs.rubygems.org/read/chapter/16#page74
215
+ [rexml]: https://github.com/ruby/rexml
216
+ [semver]: http://semver.org/
217
+ [tests]: https://github.com/sferik/multi_xml/actions/workflows/tests.yml
218
+ [truffleruby]: https://www.graalvm.org/ruby/
219
+ [typecheck]: https://github.com/sferik/multi_xml/actions/workflows/typecheck.yml
data/Rakefile CHANGED
@@ -1,4 +1,5 @@
1
1
  require "bundler/gem_tasks"
2
+ require "shellwords"
2
3
 
3
4
  # Override release task to skip gem push (handled by GitHub Actions with attestations)
4
5
  Rake::Task["release"].clear
@@ -45,6 +46,12 @@ end
45
46
  desc "Run linters"
46
47
  task lint: %i[rubocop standard]
47
48
 
49
+ desc "Benchmark available XML parsers across representative workloads"
50
+ task :benchmark do
51
+ args = ENV["BENCHMARK_ARGS"] ? Shellwords.split(ENV["BENCHMARK_ARGS"]) : []
52
+ sh Gem.ruby, "benchmark.rb", *args
53
+ end
54
+
48
55
  # Mutant uses fork() which is not available on Windows or JRuby
49
56
  desc "Run mutation testing"
50
57
  task :mutant do
data/Steepfile CHANGED
@@ -18,5 +18,12 @@ target :lib do
18
18
  library "bigdecimal"
19
19
  library "stringio"
20
20
 
21
- configure_code_diagnostics(D::Ruby.strict)
21
+ configure_code_diagnostics(D::Ruby.strict) do |hash|
22
+ # The fiber-local Fiber[] reader returns untyped — intentional
23
+ # for with_parser's previous_override save/restore.
24
+ hash[D::Ruby::FallbackAny] = :hint
25
+ # set_backtrace has three overloads and Steep can't pick one when
26
+ # given an `(Array[String] | nil)` union from `cause.backtrace`.
27
+ hash[D::Ruby::UnresolvedOverloading] = :hint
28
+ end
22
29
  end
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require_relative "../benchmark"
4
+
5
+ exit MultiXMLBenchmark.run(ARGV)