moxml 0.1.14 → 0.1.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (80) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +117 -66
  3. data/Gemfile +1 -0
  4. data/README.adoc +11 -9
  5. data/Rakefile +34 -1
  6. data/TODO.remaining/1-entity-reference-adapter-support.md +157 -0
  7. data/TODO.remaining/2-entity-restoration-model-driven.md +169 -0
  8. data/TODO.remaining/3-entity-reference-test-coverage.md +170 -0
  9. data/TODO.remaining/4-lenient-entities-mode.md +106 -0
  10. data/TODO.remaining/5-fixture-integrity.md +65 -0
  11. data/TODO.remaining/6-ox-element-ordering-bug.md +36 -0
  12. data/TODO.remaining/7-headed-ox-limitations.md +95 -0
  13. data/TODO.remaining/8-xpath-predicate-gaps.md +68 -0
  14. data/TODO.remaining/9-cleanup-hygiene.md +42 -0
  15. data/TODO.remaining/README.md +54 -0
  16. data/benchmarks/generate_report.rb +1 -1
  17. data/docs/_pages/configuration.adoc +22 -19
  18. data/docs/_tutorials/namespace-handling.adoc +5 -5
  19. data/lib/moxml/adapter/base.rb +22 -3
  20. data/lib/moxml/adapter/customized_libxml/declaration.rb +1 -1
  21. data/lib/moxml/adapter/customized_libxml/entity_reference.rb +23 -0
  22. data/lib/moxml/adapter/customized_libxml.rb +18 -0
  23. data/lib/moxml/adapter/customized_oga.rb +10 -0
  24. data/lib/moxml/adapter/customized_ox/entity_reference.rb +25 -0
  25. data/lib/moxml/adapter/customized_ox.rb +12 -0
  26. data/lib/moxml/adapter/customized_rexml/entity_reference.rb +19 -0
  27. data/lib/moxml/adapter/customized_rexml/formatter.rb +44 -20
  28. data/lib/moxml/adapter/customized_rexml.rb +11 -0
  29. data/lib/moxml/adapter/headed_ox.rb +37 -14
  30. data/lib/moxml/adapter/libxml.rb +233 -119
  31. data/lib/moxml/adapter/nokogiri.rb +22 -11
  32. data/lib/moxml/adapter/oga.rb +64 -25
  33. data/lib/moxml/adapter/ox.rb +198 -42
  34. data/lib/moxml/adapter/rexml.rb +64 -13
  35. data/lib/moxml/attribute.rb +3 -0
  36. data/lib/moxml/builder.rb +78 -24
  37. data/lib/moxml/config.rb +24 -7
  38. data/lib/moxml/declaration.rb +4 -2
  39. data/lib/moxml/document.rb +8 -1
  40. data/lib/moxml/document_builder.rb +44 -37
  41. data/lib/moxml/element.rb +18 -5
  42. data/lib/moxml/entity_registry.rb +51 -1
  43. data/lib/moxml/native_attachment.rb +65 -0
  44. data/lib/moxml/node.rb +39 -50
  45. data/lib/moxml/node_set.rb +43 -15
  46. data/lib/moxml/version.rb +1 -1
  47. data/lib/moxml/xml_utils.rb +1 -1
  48. data/lib/moxml/xpath/compiler.rb +4 -1
  49. data/lib/moxml.rb +1 -0
  50. data/scripts/format_xml.rb +16 -0
  51. data/scripts/pretty_format_xml.rb +14 -0
  52. data/spec/consistency/round_trip_spec.rb +3 -30
  53. data/spec/integration/all_adapters_spec.rb +1 -0
  54. data/spec/integration/headed_ox_integration_spec.rb +0 -2
  55. data/spec/integration/shared_examples/edge_cases.rb +7 -4
  56. data/spec/integration/shared_examples/integration_workflows.rb +3 -3
  57. data/spec/integration/shared_examples/node_wrappers/cdata_behavior.rb +1 -1
  58. data/spec/integration/shared_examples/node_wrappers/entity_reference_behavior.rb +224 -0
  59. data/spec/integration/shared_examples/node_wrappers/node_behavior.rb +1 -1
  60. data/spec/moxml/adapter/headed_ox_spec.rb +8 -8
  61. data/spec/moxml/adapter/oga_spec.rb +46 -0
  62. data/spec/moxml/adapter/shared_examples/adapter_contract.rb +1 -12
  63. data/spec/moxml/allocation_benchmark_spec.rb +96 -0
  64. data/spec/moxml/allocation_guard_spec.rb +282 -0
  65. data/spec/moxml/builder_spec.rb +256 -0
  66. data/spec/moxml/config_spec.rb +11 -11
  67. data/spec/moxml/doctype_spec.rb +41 -0
  68. data/spec/moxml/lazy_parse_spec.rb +115 -0
  69. data/spec/moxml/namespace_uri_validation_spec.rb +11 -3
  70. data/spec/moxml/node_cache_spec.rb +110 -0
  71. data/spec/moxml/node_set_cache_spec.rb +90 -0
  72. data/spec/moxml/xml_utils_spec.rb +32 -0
  73. data/spec/moxml/xpath/axes_spec.rb +1 -1
  74. data/spec/moxml/xpath/compiler_spec.rb +2 -2
  75. data/spec/moxml/xpath/functions/position_functions_spec.rb +5 -5
  76. data/spec/moxml/xpath/functions/special_functions_spec.rb +1 -1
  77. data/spec/performance/memory_usage_spec.rb +0 -4
  78. data/spec/support/allocation_helper.rb +165 -0
  79. data/spec/support/w3c_namespace_helpers.rb +2 -1
  80. metadata +29 -2
@@ -0,0 +1,169 @@
1
+ # TODO 2: Model-Driven Entity Restoration
2
+
3
+ ## Problem
4
+
5
+ The `restore_entities` feature in `DocumentBuilder` is hardcoded to only handle
6
+ the 5 standard XML entities (amp, lt, gt, quot, apos). It ignores the
7
+ EntityRegistry entirely — despite EntityRegistry knowing 2125+ entities from
8
+ the W3C HTML/MathML set. This means non-standard entities like ` `,
9
+ `©`, `—` are never restored, which is the core round-trip problem
10
+ that motivated the entire entity feature.
11
+
12
+ Additionally, the restoration logic lives in DocumentBuilder with hardcoded
13
+ knowledge that belongs in the model layer.
14
+
15
+ ## Current State (verified)
16
+
17
+ `lib/moxml/document_builder.rb:80-110` — `restore_entities_in_text`:
18
+ ```ruby
19
+ entity_chars = {
20
+ "<" => "lt", ">" => "gt", "&" => "amp",
21
+ '"' => "quot", "'" => "apos",
22
+ }
23
+ ```
24
+
25
+ This is a hardcoded lookup that duplicates knowledge already in EntityRegistry.
26
+ It only triggers for characters `<`, `>`, `&`, `"`, `'` — the regex guard
27
+ `/[<>&"']/` on line 73 prevents it from ever seeing characters like U+00A0
28
+ (non-breaking space, `&nbsp;`).
29
+
30
+ **Critical**: Because only Nokogiri has `create_native_entity_reference`
31
+ (see TODO 1), `restore_entities` raises `NotImplementedError` on all other
32
+ adapters even for the 5 standard entities.
33
+
34
+ ## XML Entity Model
35
+
36
+ XML has a clear entity model:
37
+
38
+ 1. **5 predefined entities** (amp, lt, gt, quot, apos) — always available per
39
+ XML spec. These characters MUST be entity-encoded in certain contexts
40
+ (e.g., `<` and `&` in text content).
41
+
42
+ 2. **DTD-declared entities** — declared via `<!ENTITY name "value">` in the
43
+ document's DOCTYPE internal subset or external subset.
44
+
45
+ 3. **API-supplied entities** — registered by the user via
46
+ `EntityRegistry.register` or `entity_provider` callback.
47
+
48
+ 4. **Bundled detection set** — the W3C HTML/MathML entities bundled in
49
+ `data/w3c_entities.json`. These are not "declared" in any DTD but are
50
+ recognized by Moxml for restoration purposes.
51
+
52
+ The EntityRegistry already knows about categories 1, 3, and 4. Category 2
53
+ (DTD parsing) is future work.
54
+
55
+ ## Design: Model-Driven Restoration
56
+
57
+ EntityRegistry should be THE source of truth for "should this character become
58
+ an entity reference?" The restoration policy should be:
59
+
60
+ ```ruby
61
+ # In EntityRegistry (or a cooperating policy object)
62
+ STANDARD_CODEPOINTS = [0x26, 0x3C, 0x3E, 0x22, 0x27].freeze # amp, lt, gt, quot, apos
63
+
64
+ def should_restore?(codepoint, config:)
65
+ name = primary_name_for_codepoint(codepoint)
66
+ return false unless name
67
+
68
+ # 1. The 5 standard XML entities are ALWAYS restored.
69
+ # These are syntactically required — the XML wouldn't be well-formed
70
+ # without encoding them.
71
+ return true if STANDARD_CODEPOINTS.include?(codepoint)
72
+
73
+ # 2. Non-standard entities: only if restore_entities is enabled.
74
+ return false unless config.restore_entities
75
+
76
+ # 3. In the future, strict vs lenient mode will gate this further.
77
+ # Strict: only if declared in DTD (not yet implemented).
78
+ # Lenient: any known entity name.
79
+ true
80
+ end
81
+ ```
82
+
83
+ ### Changes to DocumentBuilder
84
+
85
+ Replace the hardcoded hash with delegation to the registry:
86
+
87
+ ```ruby
88
+ def visit_text(node)
89
+ prepared = adapter.prepare_for_new_document(node, @current_doc.native)
90
+ content = adapter.text_content(node)
91
+
92
+ if should_restore_entities?(content)
93
+ restore_entities_in_text(content)
94
+ else
95
+ @node_stack.last&.add_child(Text.new(prepared, context))
96
+ end
97
+ end
98
+
99
+ private
100
+
101
+ def should_restore_entities?(content)
102
+ return false unless context.config.restore_entities
103
+ # Scan for any character that the registry knows about
104
+ content.to_s.chars.any? { |c| context.entity_registry.should_restore?(c.ord, config: context.config) }
105
+ end
106
+
107
+ def restore_entities_in_text(content)
108
+ parent = @node_stack.last
109
+ return unless parent
110
+
111
+ content.to_s.chars.each do |char|
112
+ codepoint = char.ord
113
+ name = context.entity_registry.primary_name_for_codepoint(codepoint)
114
+
115
+ if context.entity_registry.should_restore?(codepoint, config: context.config)
116
+ entity_node = adapter.create_entity_reference(name)
117
+ parent.add_child(EntityReference.new(entity_node, context))
118
+ else
119
+ text_node = adapter.create_text(char)
120
+ parent.add_child(Text.new(text_node, context))
121
+ end
122
+ end
123
+ end
124
+ ```
125
+
126
+ **Note**: This splits each text node into per-character nodes. For documents
127
+ with few entity references, this creates unnecessary overhead. A future
128
+ optimization should buffer consecutive non-entity characters into a single
129
+ text node.
130
+
131
+ ### Performance Optimization (deferred)
132
+
133
+ Instead of character-by-character processing:
134
+ 1. Scan the text for characters that have entity names in the registry
135
+ 2. Split only at those positions, keeping runs of plain characters together
136
+ 3. This reduces node count dramatically for typical documents
137
+
138
+ ```ruby
139
+ def restore_entities_in_text(content)
140
+ parent = @node_stack.last
141
+ return unless parent
142
+
143
+ buffer = +""
144
+ content.to_s.chars.each do |char|
145
+ codepoint = char.ord
146
+ name = context.entity_registry.primary_name_for_codepoint(codepoint)
147
+
148
+ if name && context.entity_registry.should_restore?(codepoint, config: context.config)
149
+ # Flush buffer before entity
150
+ if !buffer.empty?
151
+ parent.add_child(Text.new(adapter.create_text(buffer), context))
152
+ buffer.clear
153
+ end
154
+ parent.add_child(EntityReference.new(adapter.create_entity_reference(name), context))
155
+ else
156
+ buffer << char
157
+ end
158
+ end
159
+ # Flush remaining buffer
160
+ if !buffer.empty?
161
+ parent.add_child(Text.new(adapter.create_text(buffer), context))
162
+ end
163
+ end
164
+ ```
165
+
166
+ ## Files to Modify
167
+
168
+ - `lib/moxml/entity_registry.rb` — add `should_restore?` method
169
+ - `lib/moxml/document_builder.rb` — replace hardcoded entity_chars with registry-driven logic
@@ -0,0 +1,170 @@
1
+ # TODO 3: EntityReference Test Coverage
2
+
3
+ ## Problem
4
+
5
+ There are zero tests for EntityReference node behavior, zero tests for
6
+ entity round-trip preservation, and zero adapter-level tests for entity
7
+ reference creation or serialization. Only `EntityRegistry` has tests
8
+ (`spec/moxml/entity_registry_spec.rb`).
9
+
10
+ This means the entire EntityReference feature is untested — including the
11
+ `restore_entities` config, `create_entity_reference` factory, `visit_entity_reference`
12
+ in DocumentBuilder, and the `entity_reference` Builder DSL method.
13
+
14
+ ## Required Test Coverage
15
+
16
+ ### 1. EntityReference Node Tests
17
+
18
+ **File**: `spec/moxml/entity_reference_spec.rb`
19
+
20
+ ```ruby
21
+ RSpec.describe Moxml::EntityReference do
22
+ # Test per adapter (use shared examples)
23
+ %i[nokogiri].each do |adapter| # expand as adapters gain support
24
+ context "with #{adapter} adapter" do
25
+ let(:ctx) { Moxml.new(adapter) }
26
+
27
+ it "creates an entity reference node" do
28
+ doc = ctx.create_document
29
+ ref = doc.create_entity_reference("nbsp")
30
+ expect(ref).to be_a(Moxml::EntityReference)
31
+ expect(ref.name).to eq("nbsp")
32
+ end
33
+
34
+ it "has empty text content" do
35
+ doc = ctx.create_document
36
+ ref = doc.create_entity_reference("amp")
37
+ expect(ref.text).to eq("")
38
+ expect(ref.content).to eq("")
39
+ end
40
+
41
+ it "serializes to entity syntax" do
42
+ doc = ctx.create_document
43
+ ref = doc.create_entity_reference("mdash")
44
+ expect(ref.to_xml).to eq("&mdash;")
45
+ end
46
+
47
+ it "is recognized as entity_reference type" do
48
+ doc = ctx.create_document
49
+ ref = doc.create_entity_reference("copy")
50
+ expect(ref.entity_reference?).to be true
51
+ end
52
+
53
+ it "survives add_child and retrieval" do
54
+ doc = ctx.create_document
55
+ root = doc.create_element("p")
56
+ doc.root = root
57
+ ref = doc.create_entity_reference("nbsp")
58
+ root.add_child(ref)
59
+ expect(root.children.first).to be_a(Moxml::EntityReference)
60
+ expect(root.children.first.name).to eq("nbsp")
61
+ end
62
+
63
+ it "validates entity reference name" do
64
+ doc = ctx.create_document
65
+ expect {
66
+ doc.create_entity_reference("123invalid")
67
+ }.to raise_error(Moxml::ValidationError)
68
+ end
69
+ end
70
+ end
71
+ end
72
+ ```
73
+
74
+ ### 2. Builder DSL Tests
75
+
76
+ **File**: `spec/moxml/builder_spec.rb` (add to existing or create new section)
77
+
78
+ ```ruby
79
+ it "creates entity references via DSL" do
80
+ doc = Moxml::Builder.new(ctx).build do
81
+ element("p") { entity_reference("nbsp") }
82
+ end
83
+ expect(doc.root.children.first).to be_a(Moxml::EntityReference)
84
+ expect(doc.to_xml).to include("&nbsp;")
85
+ end
86
+ ```
87
+
88
+ ### 3. Restore Entities Integration Tests
89
+
90
+ **File**: `spec/moxml/adapter/entity_restoration_spec.rb` (shared examples)
91
+
92
+ ```ruby
93
+ RSpec.shared_examples "entity restoration" do |adapter_name|
94
+ context "with #{adapter_name}" do
95
+ let(:ctx) { Moxml.new(adapter_name, restore_entities: true) }
96
+
97
+ it "restores standard XML entities" do
98
+ doc = ctx.parse("<p>a &amp; b</p>")
99
+ output = doc.to_xml
100
+ expect(output).to include("&amp;")
101
+ end
102
+
103
+ it "restores non-standard entities from registry" do
104
+ # nbsp (U+00A0) is in the bundled W3C entity set
105
+ doc = ctx.parse("<p>\u00A0</p>")
106
+ output = doc.to_xml
107
+ expect(output).to include("&nbsp;")
108
+ end
109
+
110
+ it "preserves entity syntax through round-trip" do
111
+ doc = ctx.parse("<p>&nbsp;&copy;&mdash;</p>")
112
+ output = doc.to_xml
113
+ reparsed = ctx.parse(output)
114
+ # Text content should be identical after round-trip
115
+ expect(reparsed.root.text).to eq(doc.root.text)
116
+ end
117
+
118
+ it "does not restore entities when restore_entities is false" do
119
+ ctx_no_restore = Moxml.new(adapter_name, restore_entities: false)
120
+ doc = ctx_no_restore.parse("<p>a &amp; b</p>")
121
+ output = doc.to_xml
122
+ # Standard entities may still appear as &amp; due to XML escaping,
123
+ # but no EntityReference nodes should be created
124
+ expect(doc.root.children).not_to include(a_kind_of(Moxml::EntityReference))
125
+ end
126
+ end
127
+ end
128
+ ```
129
+
130
+ ### 4. Cross-Adapter Consistency Tests
131
+
132
+ **File**: `spec/consistency/entity_reference_consistency_spec.rb`
133
+
134
+ Verify that EntityReference behavior is consistent across all adapters that
135
+ support it:
136
+ - Same entity name produces same serialization
137
+ - Same text content after round-trip
138
+ - Children enumeration includes EntityReference nodes
139
+
140
+ ### 5. EntityRegistry.should_restore? Tests
141
+
142
+ **File**: Add to `spec/moxml/entity_registry_spec.rb`
143
+
144
+ ```ruby
145
+ describe "#should_restore?" do
146
+ it "always restores the 5 standard XML entities" do
147
+ registry = described_class.new
148
+ config = Moxml::Config.new(:nokogiri)
149
+ expect(registry.should_restore?(0x26, config: config)).to be true # amp
150
+ expect(registry.should_restore?(0x3C, config: config)).to be true # lt
151
+ end
152
+
153
+ it "restores non-standard entities only when restore_entities is true" do
154
+ registry = described_class.new
155
+ config_on = Moxml::Config.new(:nokogiri)
156
+ config_on.restore_entities = true
157
+ config_off = Moxml::Config.new(:nokogiri)
158
+ config_off.restore_entities = false
159
+
160
+ expect(registry.should_restore?(0xA0, config: config_on)).to be true # nbsp
161
+ expect(registry.should_restore?(0xA0, config: config_off)).to be false
162
+ end
163
+ end
164
+ ```
165
+
166
+ ## Dependencies
167
+
168
+ - TODO 1 must be partially complete (at least one adapter working) before
169
+ adapter-level tests can pass
170
+ - TODO 2 must be complete before non-standard entity restoration tests can pass
@@ -0,0 +1,106 @@
1
+ # TODO 4: Lenient Entities Mode
2
+
3
+ ## Problem
4
+
5
+ XML only defines 5 predefined entities (amp, lt, gt, quot, apos). Any other
6
+ entity must be declared in a DTD. However, real-world XML documents frequently
7
+ use HTML entities (`&nbsp;`, `&copy;`) without DTD declarations — particularly
8
+ office documents (OOXML/ODF) and legacy systems.
9
+
10
+ Currently Moxml has no way to configure whether undeclared entities should be
11
+ preserved. The `restore_entities` flag is a boolean that enables restoration
12
+ for all known entities from the registry. There is no distinction between
13
+ "only DTD-declared" (strict) and "any recognized" (lenient).
14
+
15
+ ## Design
16
+
17
+ ### Config Option
18
+
19
+ Add `entity_restoration_mode` to Config with two values:
20
+
21
+ - `:strict` (default) — Only restore entities that are declared in the DTD
22
+ internal subset. The 5 standard XML entities are always restored regardless
23
+ (they are implicitly declared per XML spec). DTD parsing is prerequisite.
24
+
25
+ - `:lenient` — Restore any character that has a known entity name in the
26
+ EntityRegistry. This covers the bundled W3C HTML/MathML set (2125 entities)
27
+ plus any user-registered entities. No DTD required.
28
+
29
+ This replaces the boolean `restore_entities` which becomes a derived property:
30
+ - `restore_entities = true` + `entity_restoration_mode = :lenient` → restore all known
31
+ - `restore_entities = true` + `entity_restoration_mode = :strict` → restore only declared
32
+ - `restore_entities = false` → don't restore any
33
+
34
+ ### EntityRegistry Enhancement
35
+
36
+ ```ruby
37
+ class EntityRegistry
38
+ def should_restore?(codepoint, config:)
39
+ name = primary_name_for_codepoint(codepoint)
40
+ return false unless name
41
+
42
+ # Standard XML entities always restored (XML well-formedness requirement)
43
+ return true if standard_entity?(codepoint)
44
+
45
+ # Must have restoration enabled
46
+ return false unless config.restore_entities
47
+
48
+ case config.entity_restoration_mode
49
+ when :lenient
50
+ # Any known entity
51
+ true
52
+ when :strict
53
+ # Only if declared in DTD (future: check DTD declarations)
54
+ # For now, fall back to lenient behavior until DTD parsing is implemented
55
+ true
56
+ else
57
+ false
58
+ end
59
+ end
60
+
61
+ def standard_entity?(codepoint)
62
+ STANDARD_ENTITIES.value?(codepoint)
63
+ end
64
+ end
65
+ ```
66
+
67
+ ### User-Supplied Entities
68
+
69
+ Users can supply entities through three mechanisms:
70
+
71
+ 1. **EntityRegistry.register** — programmatic registration:
72
+ ```ruby
73
+ context = Moxml.new(:nokogiri)
74
+ context.entity_registry.register({ "myentity" => 0xABCD })
75
+ ```
76
+
77
+ 2. **entity_provider callback** — for custom/external entity sources:
78
+ ```ruby
79
+ Moxml.new(:nokogiri) do |c|
80
+ c.entity_load_mode = :custom
81
+ c.entity_provider = -> { { "myentity" => 0xABCD } }
82
+ end
83
+ ```
84
+
85
+ 3. **Bundled W3C set** — loaded by default in `:required` mode (2125 entities
86
+ from HTML/MathML/ISO sets). Controlled by `entity_load_mode` config.
87
+
88
+ None of these require DTD. They are model-level knowledge in the EntityRegistry.
89
+
90
+ ### DTD-Declared Entities (Future)
91
+
92
+ Strict mode's full value requires parsing DTD entity declarations from
93
+ `<!DOCTYPE ... [ <!ENTITY name "value"> ]>`. This is a separate feature
94
+ (external to this TODO). Until then, strict mode behaves like lenient mode.
95
+
96
+ ## Files to Modify
97
+
98
+ - `lib/moxml/config.rb` — add `entity_restoration_mode` attribute
99
+ - `lib/moxml/entity_registry.rb` — add `should_restore?`, `standard_entity?`
100
+ - `lib/moxml/document_builder.rb` — use `should_restore?` from registry (ties into TODO 2)
101
+
102
+ ## Dependencies
103
+
104
+ - TODO 2 (model-driven restoration) should be done first so the policy is
105
+ centralized in EntityRegistry
106
+ - TODO 1 (adapter support) should be done first so entities can actually be created
@@ -0,0 +1,65 @@
1
+ # TODO 5: Fixture Integrity and CI Validation
2
+
3
+ ## Problem
4
+
5
+ The Metanorma bilingual presentation fixture was previously corrupted (error
6
+ messages appended after valid XML). It was truncated to fix the corruption,
7
+ but the truncated version has not been verified against the upstream source.
8
+
9
+ Additionally, there is no automated validation of XML fixtures in CI — a
10
+ corrupted fixture could be introduced and not caught until round-trip tests
11
+ fail with confusing errors.
12
+
13
+ ## Remaining Tasks
14
+
15
+ ### 1. Verify Bilingual Fixture Against Upstream
16
+
17
+ The file `spec/fixtures/round-trips/metanorma/bilingual.presentation.xml`
18
+ was truncated from 111,606 lines to fix corruption. Need to:
19
+
20
+ - Obtain a clean copy from the Metanorma project
21
+ - Compare with the current truncated version (21,211 lines — different from
22
+ the 55,802 lines mentioned in the original TODO, suggesting further changes)
23
+ - Confirm no data loss occurred in truncation
24
+
25
+ ### 2. Add CI Fixture Validation
26
+
27
+ Add a Rake task or RSpec test that validates all XML fixtures are well-formed
28
+ before running round-trip tests. This prevents silent corruption.
29
+
30
+ **Option A**: Rake task using `xmllint`:
31
+ ```ruby
32
+ # In Rakefile
33
+ namespace :spec do
34
+ task :validate_fixtures do
35
+ errors = []
36
+ Dir.glob("spec/fixtures/**/*.xml").each do |path|
37
+ output = `xmllint --noout "#{path}" 2>&1`
38
+ errors << "#{path}: #{output}" unless $?.success?
39
+ end
40
+ raise "Invalid fixtures:\n#{errors.join("\n")}" unless errors.empty?
41
+ end
42
+ end
43
+ task spec: ["spec:validate_fixtures"]
44
+ ```
45
+
46
+ **Option B**: RSpec test:
47
+ ```ruby
48
+ # spec/integration/fixture_validation_spec.rb
49
+ RSpec.describe "XML fixtures" do
50
+ Dir.glob("spec/fixtures/**/*.xml").each do |path|
51
+ it "#{path} is valid XML" do
52
+ ctx = Moxml.new(:nokogiri)
53
+ expect { ctx.parse(File.read(path)) }.not_to raise_error
54
+ end
55
+ end
56
+ end
57
+ ```
58
+
59
+ Option A is preferred — `xmllint` is stricter and catches issues that
60
+ lenient parsers might silently accept.
61
+
62
+ ## Files to Create/Modify
63
+
64
+ - `Rakefile` — add `spec:validate_fixtures` task
65
+ - Verify/replace `spec/fixtures/round-trips/metanorma/bilingual.presentation.xml`
@@ -0,0 +1,36 @@
1
+ # TODO 6: Ox Adapter Element Ordering Bug
2
+
3
+ ## Problem
4
+
5
+ When round-tripping certain XML fixtures through the Ox adapter, child elements
6
+ are produced in a different order compared to Nokogiri, Oga, and REXML. This
7
+ causes cross-adapter consistency failures for `elements_with_attributes`
8
+ comparisons.
9
+
10
+ The semantic equivalence check (double round-trip) still passes, so the
11
+ document content is correct — only the ordering is wrong.
12
+
13
+ ## Current State
14
+
15
+ Suppressed in `spec/consistency/round_trip_spec.rb:332` via
16
+ `KNOWN_ELEMENT_ORDERING_ISSUES` set. Affected fixture/adapter combinations:
17
+
18
+ ```
19
+ niso-jats/element_citation.xml nokogiri <-> ox
20
+ niso-jats/element_citation.xml ox <-> oga
21
+ niso-jats/element_citation.xml rexml <-> ox
22
+ ```
23
+
24
+ ## Investigation Needed
25
+
26
+ - Determine whether Ox's DOM building reorders nodes or if the issue is in
27
+ Moxml's tree traversal during serialization.
28
+ - Check if Ox's `Ox::Element#nodes` preserves insertion order.
29
+ - Compare Ox's native serialization (`Ox.dump`) with Moxml's custom serializer
30
+ to narrow down where the reorder happens.
31
+
32
+ ## Files
33
+
34
+ - `spec/consistency/round_trip_spec.rb` — suppression set
35
+ - `lib/moxml/adapter/ox.rb` — serialization path
36
+ - `lib/moxml/adapter/customized_ox/` — wrapper classes involved in tree walk
@@ -0,0 +1,95 @@
1
+ # TODO 7: HeadedOx Adapter Limitations (15 Skipped Tests)
2
+
3
+ ## Problem
4
+
5
+ HeadedOx (Ox + pure-Ruby XPath engine) has 15 skipped tests representing 7
6
+ distinct limitation areas. Some require upstream Ox gem enhancements; others
7
+ need investigation or Moxml-side fixes.
8
+
9
+ Full details in `docs/_pages/headed-ox-limitations.adoc`.
10
+
11
+ ## Limitation Areas
12
+
13
+ ### 7a. XPath `@*` Attribute Wildcard (3 tests)
14
+
15
+ The XPath parser does not support wildcard in the attribute axis.
16
+
17
+ **Tests:**
18
+ - `spec/moxml/xpath/compiler_spec.rb:156` — descendant-or-self wildcards
19
+ - `spec/moxml/xpath/compiler_spec.rb:192` — attribute axis wildcards
20
+ - `spec/moxml/xpath/axes_spec.rb:225` — attribute + predicate combinations
21
+
22
+ **Workaround:** Use `element.attributes.values` via Ruby enumeration.
23
+
24
+ ### 7b. Namespace Methods (4 tests)
25
+
26
+ Ox does not expose namespace information through its public API. The adapter
27
+ cannot implement `node.namespace`, `node.namespaces`, or namespace inheritance.
28
+
29
+ **Tests:**
30
+ - `spec/integration/shared_examples/edge_cases.rb:93` — default namespace changes
31
+ - `spec/integration/shared_examples/edge_cases.rb:119` — recursive namespace defs
32
+ - `spec/integration/shared_examples/edge_cases.rb:139` — namespace-prefixed attr access
33
+ - `spec/integration/shared_examples/integration_workflows.rb:83` — complex namespaces
34
+
35
+ **Requires:** Ox gem API enhancement (namespace accessors on `Ox::Element`).
36
+
37
+ ### 7c. Text Content from Nested XPath Results (4 tests)
38
+
39
+ Accessing text content from child elements of XPath result nodes returns empty
40
+ strings. Likely a node wrapping or text node handling issue in HeadedOx.
41
+
42
+ **Tests:**
43
+ - `spec/moxml/adapter/headed_ox_spec.rb:74` — string functions in predicates
44
+ - `spec/moxml/adapter/headed_ox_spec.rb:82` — position functions
45
+ - `spec/moxml/adapter/headed_ox_spec.rb:304` — last() function
46
+ - `spec/integration/shared_examples/node_wrappers/node_behavior.rb:113` — XPath text access
47
+
48
+ **Needs:** Investigation — check node wrapping and text node registration.
49
+
50
+ ### 7d. CDATA `]]>` Escaping (2 tests)
51
+
52
+ Ox serializes CDATA sections as-is without splitting on `]]>` sequences, which
53
+ violates the XML spec.
54
+
55
+ **Tests:**
56
+ - `spec/integration/shared_examples/edge_cases.rb:39`
57
+ - `spec/integration/shared_examples/node_wrappers/cdata_behavior.rb:44`
58
+
59
+ **Requires:** Ox gem enhancement or Moxml-side serialization override.
60
+
61
+ ### 7e. Parent Node Setter (1 test)
62
+
63
+ Ox has no native method to change a node's parent after creation.
64
+
65
+ **Test:**
66
+ - `spec/integration/shared_examples/integration_workflows.rb:126`
67
+
68
+ **Requires:** Ox gem reparenting API or workaround via remove + re-add.
69
+
70
+ ### 7f. Namespace-Aware XPath with Predicates (1 test)
71
+
72
+ Queries like `//xmlns:item[@id="123"]` return empty results under HeadedOx.
73
+
74
+ **Test:**
75
+ - `spec/integration/shared_examples/integration_workflows.rb:63`
76
+
77
+ **Needs:** Investigation — check namespace resolution in predicate context.
78
+
79
+ ### 7g. Wildcard Element Counting (1 test)
80
+
81
+ `//*` returns a different count (6) vs Nokogiri (7+), likely due to Ox's DOM
82
+ structure.
83
+
84
+ **Test:**
85
+ - `spec/moxml/xpath/compiler_spec.rb:156`
86
+
87
+ **Impact:** Low — real-world queries typically use specific element names.
88
+
89
+ ## Files
90
+
91
+ - `docs/_pages/headed-ox-limitations.adoc` — full documentation
92
+ - `lib/moxml/adapter/headed_ox.rb`
93
+ - `lib/moxml/adapter/ox.rb`
94
+ - `lib/moxml/xpath/` — pure-Ruby XPath engine
95
+ - All spec files listed above