moxml 0.1.14 → 0.1.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +117 -66
- data/Gemfile +1 -0
- data/README.adoc +11 -9
- data/Rakefile +34 -1
- data/TODO.remaining/1-entity-reference-adapter-support.md +157 -0
- data/TODO.remaining/2-entity-restoration-model-driven.md +169 -0
- data/TODO.remaining/3-entity-reference-test-coverage.md +170 -0
- data/TODO.remaining/4-lenient-entities-mode.md +106 -0
- data/TODO.remaining/5-fixture-integrity.md +65 -0
- data/TODO.remaining/6-ox-element-ordering-bug.md +36 -0
- data/TODO.remaining/7-headed-ox-limitations.md +95 -0
- data/TODO.remaining/8-xpath-predicate-gaps.md +68 -0
- data/TODO.remaining/9-cleanup-hygiene.md +42 -0
- data/TODO.remaining/README.md +54 -0
- data/benchmarks/generate_report.rb +1 -1
- data/docs/_pages/configuration.adoc +22 -19
- data/docs/_tutorials/namespace-handling.adoc +5 -5
- data/lib/moxml/adapter/base.rb +22 -3
- data/lib/moxml/adapter/customized_libxml/declaration.rb +1 -1
- data/lib/moxml/adapter/customized_libxml/entity_reference.rb +23 -0
- data/lib/moxml/adapter/customized_libxml.rb +18 -0
- data/lib/moxml/adapter/customized_oga.rb +10 -0
- data/lib/moxml/adapter/customized_ox/entity_reference.rb +25 -0
- data/lib/moxml/adapter/customized_ox.rb +12 -0
- data/lib/moxml/adapter/customized_rexml/entity_reference.rb +19 -0
- data/lib/moxml/adapter/customized_rexml/formatter.rb +44 -20
- data/lib/moxml/adapter/customized_rexml.rb +11 -0
- data/lib/moxml/adapter/headed_ox.rb +37 -14
- data/lib/moxml/adapter/libxml.rb +233 -119
- data/lib/moxml/adapter/nokogiri.rb +22 -11
- data/lib/moxml/adapter/oga.rb +64 -25
- data/lib/moxml/adapter/ox.rb +198 -42
- data/lib/moxml/adapter/rexml.rb +64 -13
- data/lib/moxml/attribute.rb +3 -0
- data/lib/moxml/builder.rb +78 -24
- data/lib/moxml/config.rb +24 -7
- data/lib/moxml/declaration.rb +4 -2
- data/lib/moxml/document.rb +8 -1
- data/lib/moxml/document_builder.rb +44 -37
- data/lib/moxml/element.rb +18 -5
- data/lib/moxml/entity_registry.rb +51 -1
- data/lib/moxml/native_attachment.rb +65 -0
- data/lib/moxml/node.rb +39 -50
- data/lib/moxml/node_set.rb +43 -15
- data/lib/moxml/version.rb +1 -1
- data/lib/moxml/xml_utils.rb +1 -1
- data/lib/moxml/xpath/compiler.rb +4 -1
- data/lib/moxml.rb +1 -0
- data/scripts/format_xml.rb +16 -0
- data/scripts/pretty_format_xml.rb +14 -0
- data/spec/consistency/round_trip_spec.rb +3 -30
- data/spec/integration/all_adapters_spec.rb +1 -0
- data/spec/integration/headed_ox_integration_spec.rb +0 -2
- data/spec/integration/shared_examples/edge_cases.rb +7 -4
- data/spec/integration/shared_examples/integration_workflows.rb +3 -3
- data/spec/integration/shared_examples/node_wrappers/cdata_behavior.rb +1 -1
- data/spec/integration/shared_examples/node_wrappers/entity_reference_behavior.rb +224 -0
- data/spec/integration/shared_examples/node_wrappers/node_behavior.rb +1 -1
- data/spec/moxml/adapter/headed_ox_spec.rb +8 -8
- data/spec/moxml/adapter/oga_spec.rb +46 -0
- data/spec/moxml/adapter/shared_examples/adapter_contract.rb +1 -12
- data/spec/moxml/allocation_benchmark_spec.rb +96 -0
- data/spec/moxml/allocation_guard_spec.rb +282 -0
- data/spec/moxml/builder_spec.rb +256 -0
- data/spec/moxml/config_spec.rb +11 -11
- data/spec/moxml/doctype_spec.rb +41 -0
- data/spec/moxml/lazy_parse_spec.rb +115 -0
- data/spec/moxml/namespace_uri_validation_spec.rb +11 -3
- data/spec/moxml/node_cache_spec.rb +110 -0
- data/spec/moxml/node_set_cache_spec.rb +90 -0
- data/spec/moxml/xml_utils_spec.rb +32 -0
- data/spec/moxml/xpath/axes_spec.rb +1 -1
- data/spec/moxml/xpath/compiler_spec.rb +2 -2
- data/spec/moxml/xpath/functions/position_functions_spec.rb +5 -5
- data/spec/moxml/xpath/functions/special_functions_spec.rb +1 -1
- data/spec/performance/memory_usage_spec.rb +0 -4
- data/spec/support/allocation_helper.rb +165 -0
- data/spec/support/w3c_namespace_helpers.rb +2 -1
- metadata +29 -2
|
@@ -0,0 +1,169 @@
|
|
|
1
|
+
# TODO 2: Model-Driven Entity Restoration
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
The `restore_entities` feature in `DocumentBuilder` is hardcoded to only handle
|
|
6
|
+
the 5 standard XML entities (amp, lt, gt, quot, apos). It ignores the
|
|
7
|
+
EntityRegistry entirely — despite EntityRegistry knowing 2125+ entities from
|
|
8
|
+
the W3C HTML/MathML set. This means non-standard entities like ` `,
|
|
9
|
+
`©`, `—` are never restored, which is the core round-trip problem
|
|
10
|
+
that motivated the entire entity feature.
|
|
11
|
+
|
|
12
|
+
Additionally, the restoration logic lives in DocumentBuilder with hardcoded
|
|
13
|
+
knowledge that belongs in the model layer.
|
|
14
|
+
|
|
15
|
+
## Current State (verified)
|
|
16
|
+
|
|
17
|
+
`lib/moxml/document_builder.rb:80-110` — `restore_entities_in_text`:
|
|
18
|
+
```ruby
|
|
19
|
+
entity_chars = {
|
|
20
|
+
"<" => "lt", ">" => "gt", "&" => "amp",
|
|
21
|
+
'"' => "quot", "'" => "apos",
|
|
22
|
+
}
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
This is a hardcoded lookup that duplicates knowledge already in EntityRegistry.
|
|
26
|
+
It only triggers for characters `<`, `>`, `&`, `"`, `'` — the regex guard
|
|
27
|
+
`/[<>&"']/` on line 73 prevents it from ever seeing characters like U+00A0
|
|
28
|
+
(non-breaking space, ` `).
|
|
29
|
+
|
|
30
|
+
**Critical**: Because only Nokogiri has `create_native_entity_reference`
|
|
31
|
+
(see TODO 1), `restore_entities` raises `NotImplementedError` on all other
|
|
32
|
+
adapters even for the 5 standard entities.
|
|
33
|
+
|
|
34
|
+
## XML Entity Model
|
|
35
|
+
|
|
36
|
+
XML has a clear entity model:
|
|
37
|
+
|
|
38
|
+
1. **5 predefined entities** (amp, lt, gt, quot, apos) — always available per
|
|
39
|
+
XML spec. These characters MUST be entity-encoded in certain contexts
|
|
40
|
+
(e.g., `<` and `&` in text content).
|
|
41
|
+
|
|
42
|
+
2. **DTD-declared entities** — declared via `<!ENTITY name "value">` in the
|
|
43
|
+
document's DOCTYPE internal subset or external subset.
|
|
44
|
+
|
|
45
|
+
3. **API-supplied entities** — registered by the user via
|
|
46
|
+
`EntityRegistry.register` or `entity_provider` callback.
|
|
47
|
+
|
|
48
|
+
4. **Bundled detection set** — the W3C HTML/MathML entities bundled in
|
|
49
|
+
`data/w3c_entities.json`. These are not "declared" in any DTD but are
|
|
50
|
+
recognized by Moxml for restoration purposes.
|
|
51
|
+
|
|
52
|
+
The EntityRegistry already knows about categories 1, 3, and 4. Category 2
|
|
53
|
+
(DTD parsing) is future work.
|
|
54
|
+
|
|
55
|
+
## Design: Model-Driven Restoration
|
|
56
|
+
|
|
57
|
+
EntityRegistry should be THE source of truth for "should this character become
|
|
58
|
+
an entity reference?" The restoration policy should be:
|
|
59
|
+
|
|
60
|
+
```ruby
|
|
61
|
+
# In EntityRegistry (or a cooperating policy object)
|
|
62
|
+
STANDARD_CODEPOINTS = [0x26, 0x3C, 0x3E, 0x22, 0x27].freeze # amp, lt, gt, quot, apos
|
|
63
|
+
|
|
64
|
+
def should_restore?(codepoint, config:)
|
|
65
|
+
name = primary_name_for_codepoint(codepoint)
|
|
66
|
+
return false unless name
|
|
67
|
+
|
|
68
|
+
# 1. The 5 standard XML entities are ALWAYS restored.
|
|
69
|
+
# These are syntactically required — the XML wouldn't be well-formed
|
|
70
|
+
# without encoding them.
|
|
71
|
+
return true if STANDARD_CODEPOINTS.include?(codepoint)
|
|
72
|
+
|
|
73
|
+
# 2. Non-standard entities: only if restore_entities is enabled.
|
|
74
|
+
return false unless config.restore_entities
|
|
75
|
+
|
|
76
|
+
# 3. In the future, strict vs lenient mode will gate this further.
|
|
77
|
+
# Strict: only if declared in DTD (not yet implemented).
|
|
78
|
+
# Lenient: any known entity name.
|
|
79
|
+
true
|
|
80
|
+
end
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Changes to DocumentBuilder
|
|
84
|
+
|
|
85
|
+
Replace the hardcoded hash with delegation to the registry:
|
|
86
|
+
|
|
87
|
+
```ruby
|
|
88
|
+
def visit_text(node)
|
|
89
|
+
prepared = adapter.prepare_for_new_document(node, @current_doc.native)
|
|
90
|
+
content = adapter.text_content(node)
|
|
91
|
+
|
|
92
|
+
if should_restore_entities?(content)
|
|
93
|
+
restore_entities_in_text(content)
|
|
94
|
+
else
|
|
95
|
+
@node_stack.last&.add_child(Text.new(prepared, context))
|
|
96
|
+
end
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
private
|
|
100
|
+
|
|
101
|
+
def should_restore_entities?(content)
|
|
102
|
+
return false unless context.config.restore_entities
|
|
103
|
+
# Scan for any character that the registry knows about
|
|
104
|
+
content.to_s.chars.any? { |c| context.entity_registry.should_restore?(c.ord, config: context.config) }
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def restore_entities_in_text(content)
|
|
108
|
+
parent = @node_stack.last
|
|
109
|
+
return unless parent
|
|
110
|
+
|
|
111
|
+
content.to_s.chars.each do |char|
|
|
112
|
+
codepoint = char.ord
|
|
113
|
+
name = context.entity_registry.primary_name_for_codepoint(codepoint)
|
|
114
|
+
|
|
115
|
+
if context.entity_registry.should_restore?(codepoint, config: context.config)
|
|
116
|
+
entity_node = adapter.create_entity_reference(name)
|
|
117
|
+
parent.add_child(EntityReference.new(entity_node, context))
|
|
118
|
+
else
|
|
119
|
+
text_node = adapter.create_text(char)
|
|
120
|
+
parent.add_child(Text.new(text_node, context))
|
|
121
|
+
end
|
|
122
|
+
end
|
|
123
|
+
end
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
**Note**: This splits each text node into per-character nodes. For documents
|
|
127
|
+
with few entity references, this creates unnecessary overhead. A future
|
|
128
|
+
optimization should buffer consecutive non-entity characters into a single
|
|
129
|
+
text node.
|
|
130
|
+
|
|
131
|
+
### Performance Optimization (deferred)
|
|
132
|
+
|
|
133
|
+
Instead of character-by-character processing:
|
|
134
|
+
1. Scan the text for characters that have entity names in the registry
|
|
135
|
+
2. Split only at those positions, keeping runs of plain characters together
|
|
136
|
+
3. This reduces node count dramatically for typical documents
|
|
137
|
+
|
|
138
|
+
```ruby
|
|
139
|
+
def restore_entities_in_text(content)
|
|
140
|
+
parent = @node_stack.last
|
|
141
|
+
return unless parent
|
|
142
|
+
|
|
143
|
+
buffer = +""
|
|
144
|
+
content.to_s.chars.each do |char|
|
|
145
|
+
codepoint = char.ord
|
|
146
|
+
name = context.entity_registry.primary_name_for_codepoint(codepoint)
|
|
147
|
+
|
|
148
|
+
if name && context.entity_registry.should_restore?(codepoint, config: context.config)
|
|
149
|
+
# Flush buffer before entity
|
|
150
|
+
if !buffer.empty?
|
|
151
|
+
parent.add_child(Text.new(adapter.create_text(buffer), context))
|
|
152
|
+
buffer.clear
|
|
153
|
+
end
|
|
154
|
+
parent.add_child(EntityReference.new(adapter.create_entity_reference(name), context))
|
|
155
|
+
else
|
|
156
|
+
buffer << char
|
|
157
|
+
end
|
|
158
|
+
end
|
|
159
|
+
# Flush remaining buffer
|
|
160
|
+
if !buffer.empty?
|
|
161
|
+
parent.add_child(Text.new(adapter.create_text(buffer), context))
|
|
162
|
+
end
|
|
163
|
+
end
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Files to Modify
|
|
167
|
+
|
|
168
|
+
- `lib/moxml/entity_registry.rb` — add `should_restore?` method
|
|
169
|
+
- `lib/moxml/document_builder.rb` — replace hardcoded entity_chars with registry-driven logic
|
|
@@ -0,0 +1,170 @@
|
|
|
1
|
+
# TODO 3: EntityReference Test Coverage
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
There are zero tests for EntityReference node behavior, zero tests for
|
|
6
|
+
entity round-trip preservation, and zero adapter-level tests for entity
|
|
7
|
+
reference creation or serialization. Only `EntityRegistry` has tests
|
|
8
|
+
(`spec/moxml/entity_registry_spec.rb`).
|
|
9
|
+
|
|
10
|
+
This means the entire EntityReference feature is untested — including the
|
|
11
|
+
`restore_entities` config, `create_entity_reference` factory, `visit_entity_reference`
|
|
12
|
+
in DocumentBuilder, and the `entity_reference` Builder DSL method.
|
|
13
|
+
|
|
14
|
+
## Required Test Coverage
|
|
15
|
+
|
|
16
|
+
### 1. EntityReference Node Tests
|
|
17
|
+
|
|
18
|
+
**File**: `spec/moxml/entity_reference_spec.rb`
|
|
19
|
+
|
|
20
|
+
```ruby
|
|
21
|
+
RSpec.describe Moxml::EntityReference do
|
|
22
|
+
# Test per adapter (use shared examples)
|
|
23
|
+
%i[nokogiri].each do |adapter| # expand as adapters gain support
|
|
24
|
+
context "with #{adapter} adapter" do
|
|
25
|
+
let(:ctx) { Moxml.new(adapter) }
|
|
26
|
+
|
|
27
|
+
it "creates an entity reference node" do
|
|
28
|
+
doc = ctx.create_document
|
|
29
|
+
ref = doc.create_entity_reference("nbsp")
|
|
30
|
+
expect(ref).to be_a(Moxml::EntityReference)
|
|
31
|
+
expect(ref.name).to eq("nbsp")
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
it "has empty text content" do
|
|
35
|
+
doc = ctx.create_document
|
|
36
|
+
ref = doc.create_entity_reference("amp")
|
|
37
|
+
expect(ref.text).to eq("")
|
|
38
|
+
expect(ref.content).to eq("")
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
it "serializes to entity syntax" do
|
|
42
|
+
doc = ctx.create_document
|
|
43
|
+
ref = doc.create_entity_reference("mdash")
|
|
44
|
+
expect(ref.to_xml).to eq("—")
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
it "is recognized as entity_reference type" do
|
|
48
|
+
doc = ctx.create_document
|
|
49
|
+
ref = doc.create_entity_reference("copy")
|
|
50
|
+
expect(ref.entity_reference?).to be true
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
it "survives add_child and retrieval" do
|
|
54
|
+
doc = ctx.create_document
|
|
55
|
+
root = doc.create_element("p")
|
|
56
|
+
doc.root = root
|
|
57
|
+
ref = doc.create_entity_reference("nbsp")
|
|
58
|
+
root.add_child(ref)
|
|
59
|
+
expect(root.children.first).to be_a(Moxml::EntityReference)
|
|
60
|
+
expect(root.children.first.name).to eq("nbsp")
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
it "validates entity reference name" do
|
|
64
|
+
doc = ctx.create_document
|
|
65
|
+
expect {
|
|
66
|
+
doc.create_entity_reference("123invalid")
|
|
67
|
+
}.to raise_error(Moxml::ValidationError)
|
|
68
|
+
end
|
|
69
|
+
end
|
|
70
|
+
end
|
|
71
|
+
end
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### 2. Builder DSL Tests
|
|
75
|
+
|
|
76
|
+
**File**: `spec/moxml/builder_spec.rb` (add to existing or create new section)
|
|
77
|
+
|
|
78
|
+
```ruby
|
|
79
|
+
it "creates entity references via DSL" do
|
|
80
|
+
doc = Moxml::Builder.new(ctx).build do
|
|
81
|
+
element("p") { entity_reference("nbsp") }
|
|
82
|
+
end
|
|
83
|
+
expect(doc.root.children.first).to be_a(Moxml::EntityReference)
|
|
84
|
+
expect(doc.to_xml).to include(" ")
|
|
85
|
+
end
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### 3. Restore Entities Integration Tests
|
|
89
|
+
|
|
90
|
+
**File**: `spec/moxml/adapter/entity_restoration_spec.rb` (shared examples)
|
|
91
|
+
|
|
92
|
+
```ruby
|
|
93
|
+
RSpec.shared_examples "entity restoration" do |adapter_name|
|
|
94
|
+
context "with #{adapter_name}" do
|
|
95
|
+
let(:ctx) { Moxml.new(adapter_name, restore_entities: true) }
|
|
96
|
+
|
|
97
|
+
it "restores standard XML entities" do
|
|
98
|
+
doc = ctx.parse("<p>a & b</p>")
|
|
99
|
+
output = doc.to_xml
|
|
100
|
+
expect(output).to include("&")
|
|
101
|
+
end
|
|
102
|
+
|
|
103
|
+
it "restores non-standard entities from registry" do
|
|
104
|
+
# nbsp (U+00A0) is in the bundled W3C entity set
|
|
105
|
+
doc = ctx.parse("<p>\u00A0</p>")
|
|
106
|
+
output = doc.to_xml
|
|
107
|
+
expect(output).to include(" ")
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
it "preserves entity syntax through round-trip" do
|
|
111
|
+
doc = ctx.parse("<p> ©—</p>")
|
|
112
|
+
output = doc.to_xml
|
|
113
|
+
reparsed = ctx.parse(output)
|
|
114
|
+
# Text content should be identical after round-trip
|
|
115
|
+
expect(reparsed.root.text).to eq(doc.root.text)
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
it "does not restore entities when restore_entities is false" do
|
|
119
|
+
ctx_no_restore = Moxml.new(adapter_name, restore_entities: false)
|
|
120
|
+
doc = ctx_no_restore.parse("<p>a & b</p>")
|
|
121
|
+
output = doc.to_xml
|
|
122
|
+
# Standard entities may still appear as & due to XML escaping,
|
|
123
|
+
# but no EntityReference nodes should be created
|
|
124
|
+
expect(doc.root.children).not_to include(a_kind_of(Moxml::EntityReference))
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
end
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
### 4. Cross-Adapter Consistency Tests
|
|
131
|
+
|
|
132
|
+
**File**: `spec/consistency/entity_reference_consistency_spec.rb`
|
|
133
|
+
|
|
134
|
+
Verify that EntityReference behavior is consistent across all adapters that
|
|
135
|
+
support it:
|
|
136
|
+
- Same entity name produces same serialization
|
|
137
|
+
- Same text content after round-trip
|
|
138
|
+
- Children enumeration includes EntityReference nodes
|
|
139
|
+
|
|
140
|
+
### 5. EntityRegistry.should_restore? Tests
|
|
141
|
+
|
|
142
|
+
**File**: Add to `spec/moxml/entity_registry_spec.rb`
|
|
143
|
+
|
|
144
|
+
```ruby
|
|
145
|
+
describe "#should_restore?" do
|
|
146
|
+
it "always restores the 5 standard XML entities" do
|
|
147
|
+
registry = described_class.new
|
|
148
|
+
config = Moxml::Config.new(:nokogiri)
|
|
149
|
+
expect(registry.should_restore?(0x26, config: config)).to be true # amp
|
|
150
|
+
expect(registry.should_restore?(0x3C, config: config)).to be true # lt
|
|
151
|
+
end
|
|
152
|
+
|
|
153
|
+
it "restores non-standard entities only when restore_entities is true" do
|
|
154
|
+
registry = described_class.new
|
|
155
|
+
config_on = Moxml::Config.new(:nokogiri)
|
|
156
|
+
config_on.restore_entities = true
|
|
157
|
+
config_off = Moxml::Config.new(:nokogiri)
|
|
158
|
+
config_off.restore_entities = false
|
|
159
|
+
|
|
160
|
+
expect(registry.should_restore?(0xA0, config: config_on)).to be true # nbsp
|
|
161
|
+
expect(registry.should_restore?(0xA0, config: config_off)).to be false
|
|
162
|
+
end
|
|
163
|
+
end
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Dependencies
|
|
167
|
+
|
|
168
|
+
- TODO 1 must be partially complete (at least one adapter working) before
|
|
169
|
+
adapter-level tests can pass
|
|
170
|
+
- TODO 2 must be complete before non-standard entity restoration tests can pass
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# TODO 4: Lenient Entities Mode
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
XML only defines 5 predefined entities (amp, lt, gt, quot, apos). Any other
|
|
6
|
+
entity must be declared in a DTD. However, real-world XML documents frequently
|
|
7
|
+
use HTML entities (` `, `©`) without DTD declarations — particularly
|
|
8
|
+
office documents (OOXML/ODF) and legacy systems.
|
|
9
|
+
|
|
10
|
+
Currently Moxml has no way to configure whether undeclared entities should be
|
|
11
|
+
preserved. The `restore_entities` flag is a boolean that enables restoration
|
|
12
|
+
for all known entities from the registry. There is no distinction between
|
|
13
|
+
"only DTD-declared" (strict) and "any recognized" (lenient).
|
|
14
|
+
|
|
15
|
+
## Design
|
|
16
|
+
|
|
17
|
+
### Config Option
|
|
18
|
+
|
|
19
|
+
Add `entity_restoration_mode` to Config with two values:
|
|
20
|
+
|
|
21
|
+
- `:strict` (default) — Only restore entities that are declared in the DTD
|
|
22
|
+
internal subset. The 5 standard XML entities are always restored regardless
|
|
23
|
+
(they are implicitly declared per XML spec). DTD parsing is prerequisite.
|
|
24
|
+
|
|
25
|
+
- `:lenient` — Restore any character that has a known entity name in the
|
|
26
|
+
EntityRegistry. This covers the bundled W3C HTML/MathML set (2125 entities)
|
|
27
|
+
plus any user-registered entities. No DTD required.
|
|
28
|
+
|
|
29
|
+
This replaces the boolean `restore_entities` which becomes a derived property:
|
|
30
|
+
- `restore_entities = true` + `entity_restoration_mode = :lenient` → restore all known
|
|
31
|
+
- `restore_entities = true` + `entity_restoration_mode = :strict` → restore only declared
|
|
32
|
+
- `restore_entities = false` → don't restore any
|
|
33
|
+
|
|
34
|
+
### EntityRegistry Enhancement
|
|
35
|
+
|
|
36
|
+
```ruby
|
|
37
|
+
class EntityRegistry
|
|
38
|
+
def should_restore?(codepoint, config:)
|
|
39
|
+
name = primary_name_for_codepoint(codepoint)
|
|
40
|
+
return false unless name
|
|
41
|
+
|
|
42
|
+
# Standard XML entities always restored (XML well-formedness requirement)
|
|
43
|
+
return true if standard_entity?(codepoint)
|
|
44
|
+
|
|
45
|
+
# Must have restoration enabled
|
|
46
|
+
return false unless config.restore_entities
|
|
47
|
+
|
|
48
|
+
case config.entity_restoration_mode
|
|
49
|
+
when :lenient
|
|
50
|
+
# Any known entity
|
|
51
|
+
true
|
|
52
|
+
when :strict
|
|
53
|
+
# Only if declared in DTD (future: check DTD declarations)
|
|
54
|
+
# For now, fall back to lenient behavior until DTD parsing is implemented
|
|
55
|
+
true
|
|
56
|
+
else
|
|
57
|
+
false
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
def standard_entity?(codepoint)
|
|
62
|
+
STANDARD_ENTITIES.value?(codepoint)
|
|
63
|
+
end
|
|
64
|
+
end
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### User-Supplied Entities
|
|
68
|
+
|
|
69
|
+
Users can supply entities through three mechanisms:
|
|
70
|
+
|
|
71
|
+
1. **EntityRegistry.register** — programmatic registration:
|
|
72
|
+
```ruby
|
|
73
|
+
context = Moxml.new(:nokogiri)
|
|
74
|
+
context.entity_registry.register({ "myentity" => 0xABCD })
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
2. **entity_provider callback** — for custom/external entity sources:
|
|
78
|
+
```ruby
|
|
79
|
+
Moxml.new(:nokogiri) do |c|
|
|
80
|
+
c.entity_load_mode = :custom
|
|
81
|
+
c.entity_provider = -> { { "myentity" => 0xABCD } }
|
|
82
|
+
end
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
3. **Bundled W3C set** — loaded by default in `:required` mode (2125 entities
|
|
86
|
+
from HTML/MathML/ISO sets). Controlled by `entity_load_mode` config.
|
|
87
|
+
|
|
88
|
+
None of these require DTD. They are model-level knowledge in the EntityRegistry.
|
|
89
|
+
|
|
90
|
+
### DTD-Declared Entities (Future)
|
|
91
|
+
|
|
92
|
+
Strict mode's full value requires parsing DTD entity declarations from
|
|
93
|
+
`<!DOCTYPE ... [ <!ENTITY name "value"> ]>`. This is a separate feature
|
|
94
|
+
(external to this TODO). Until then, strict mode behaves like lenient mode.
|
|
95
|
+
|
|
96
|
+
## Files to Modify
|
|
97
|
+
|
|
98
|
+
- `lib/moxml/config.rb` — add `entity_restoration_mode` attribute
|
|
99
|
+
- `lib/moxml/entity_registry.rb` — add `should_restore?`, `standard_entity?`
|
|
100
|
+
- `lib/moxml/document_builder.rb` — use `should_restore?` from registry (ties into TODO 2)
|
|
101
|
+
|
|
102
|
+
## Dependencies
|
|
103
|
+
|
|
104
|
+
- TODO 2 (model-driven restoration) should be done first so the policy is
|
|
105
|
+
centralized in EntityRegistry
|
|
106
|
+
- TODO 1 (adapter support) should be done first so entities can actually be created
|
|
@@ -0,0 +1,65 @@
|
|
|
1
|
+
# TODO 5: Fixture Integrity and CI Validation
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
The Metanorma bilingual presentation fixture was previously corrupted (error
|
|
6
|
+
messages appended after valid XML). It was truncated to fix the corruption,
|
|
7
|
+
but the truncated version has not been verified against the upstream source.
|
|
8
|
+
|
|
9
|
+
Additionally, there is no automated validation of XML fixtures in CI — a
|
|
10
|
+
corrupted fixture could be introduced and not caught until round-trip tests
|
|
11
|
+
fail with confusing errors.
|
|
12
|
+
|
|
13
|
+
## Remaining Tasks
|
|
14
|
+
|
|
15
|
+
### 1. Verify Bilingual Fixture Against Upstream
|
|
16
|
+
|
|
17
|
+
The file `spec/fixtures/round-trips/metanorma/bilingual.presentation.xml`
|
|
18
|
+
was truncated from 111,606 lines to fix corruption. Need to:
|
|
19
|
+
|
|
20
|
+
- Obtain a clean copy from the Metanorma project
|
|
21
|
+
- Compare with the current truncated version (21,211 lines — different from
|
|
22
|
+
the 55,802 lines mentioned in the original TODO, suggesting further changes)
|
|
23
|
+
- Confirm no data loss occurred in truncation
|
|
24
|
+
|
|
25
|
+
### 2. Add CI Fixture Validation
|
|
26
|
+
|
|
27
|
+
Add a Rake task or RSpec test that validates all XML fixtures are well-formed
|
|
28
|
+
before running round-trip tests. This prevents silent corruption.
|
|
29
|
+
|
|
30
|
+
**Option A**: Rake task using `xmllint`:
|
|
31
|
+
```ruby
|
|
32
|
+
# In Rakefile
|
|
33
|
+
namespace :spec do
|
|
34
|
+
task :validate_fixtures do
|
|
35
|
+
errors = []
|
|
36
|
+
Dir.glob("spec/fixtures/**/*.xml").each do |path|
|
|
37
|
+
output = `xmllint --noout "#{path}" 2>&1`
|
|
38
|
+
errors << "#{path}: #{output}" unless $?.success?
|
|
39
|
+
end
|
|
40
|
+
raise "Invalid fixtures:\n#{errors.join("\n")}" unless errors.empty?
|
|
41
|
+
end
|
|
42
|
+
end
|
|
43
|
+
task spec: ["spec:validate_fixtures"]
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**Option B**: RSpec test:
|
|
47
|
+
```ruby
|
|
48
|
+
# spec/integration/fixture_validation_spec.rb
|
|
49
|
+
RSpec.describe "XML fixtures" do
|
|
50
|
+
Dir.glob("spec/fixtures/**/*.xml").each do |path|
|
|
51
|
+
it "#{path} is valid XML" do
|
|
52
|
+
ctx = Moxml.new(:nokogiri)
|
|
53
|
+
expect { ctx.parse(File.read(path)) }.not_to raise_error
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
end
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
Option A is preferred — `xmllint` is stricter and catches issues that
|
|
60
|
+
lenient parsers might silently accept.
|
|
61
|
+
|
|
62
|
+
## Files to Create/Modify
|
|
63
|
+
|
|
64
|
+
- `Rakefile` — add `spec:validate_fixtures` task
|
|
65
|
+
- Verify/replace `spec/fixtures/round-trips/metanorma/bilingual.presentation.xml`
|
|
@@ -0,0 +1,36 @@
|
|
|
1
|
+
# TODO 6: Ox Adapter Element Ordering Bug
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
When round-tripping certain XML fixtures through the Ox adapter, child elements
|
|
6
|
+
are produced in a different order compared to Nokogiri, Oga, and REXML. This
|
|
7
|
+
causes cross-adapter consistency failures for `elements_with_attributes`
|
|
8
|
+
comparisons.
|
|
9
|
+
|
|
10
|
+
The semantic equivalence check (double round-trip) still passes, so the
|
|
11
|
+
document content is correct — only the ordering is wrong.
|
|
12
|
+
|
|
13
|
+
## Current State
|
|
14
|
+
|
|
15
|
+
Suppressed in `spec/consistency/round_trip_spec.rb:332` via
|
|
16
|
+
`KNOWN_ELEMENT_ORDERING_ISSUES` set. Affected fixture/adapter combinations:
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
niso-jats/element_citation.xml nokogiri <-> ox
|
|
20
|
+
niso-jats/element_citation.xml ox <-> oga
|
|
21
|
+
niso-jats/element_citation.xml rexml <-> ox
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
## Investigation Needed
|
|
25
|
+
|
|
26
|
+
- Determine whether Ox's DOM building reorders nodes or if the issue is in
|
|
27
|
+
Moxml's tree traversal during serialization.
|
|
28
|
+
- Check if Ox's `Ox::Element#nodes` preserves insertion order.
|
|
29
|
+
- Compare Ox's native serialization (`Ox.dump`) with Moxml's custom serializer
|
|
30
|
+
to narrow down where the reorder happens.
|
|
31
|
+
|
|
32
|
+
## Files
|
|
33
|
+
|
|
34
|
+
- `spec/consistency/round_trip_spec.rb` — suppression set
|
|
35
|
+
- `lib/moxml/adapter/ox.rb` — serialization path
|
|
36
|
+
- `lib/moxml/adapter/customized_ox/` — wrapper classes involved in tree walk
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
# TODO 7: HeadedOx Adapter Limitations (15 Skipped Tests)
|
|
2
|
+
|
|
3
|
+
## Problem
|
|
4
|
+
|
|
5
|
+
HeadedOx (Ox + pure-Ruby XPath engine) has 15 skipped tests representing 7
|
|
6
|
+
distinct limitation areas. Some require upstream Ox gem enhancements; others
|
|
7
|
+
need investigation or Moxml-side fixes.
|
|
8
|
+
|
|
9
|
+
Full details in `docs/_pages/headed-ox-limitations.adoc`.
|
|
10
|
+
|
|
11
|
+
## Limitation Areas
|
|
12
|
+
|
|
13
|
+
### 7a. XPath `@*` Attribute Wildcard (3 tests)
|
|
14
|
+
|
|
15
|
+
The XPath parser does not support wildcard in the attribute axis.
|
|
16
|
+
|
|
17
|
+
**Tests:**
|
|
18
|
+
- `spec/moxml/xpath/compiler_spec.rb:156` — descendant-or-self wildcards
|
|
19
|
+
- `spec/moxml/xpath/compiler_spec.rb:192` — attribute axis wildcards
|
|
20
|
+
- `spec/moxml/xpath/axes_spec.rb:225` — attribute + predicate combinations
|
|
21
|
+
|
|
22
|
+
**Workaround:** Use `element.attributes.values` via Ruby enumeration.
|
|
23
|
+
|
|
24
|
+
### 7b. Namespace Methods (4 tests)
|
|
25
|
+
|
|
26
|
+
Ox does not expose namespace information through its public API. The adapter
|
|
27
|
+
cannot implement `node.namespace`, `node.namespaces`, or namespace inheritance.
|
|
28
|
+
|
|
29
|
+
**Tests:**
|
|
30
|
+
- `spec/integration/shared_examples/edge_cases.rb:93` — default namespace changes
|
|
31
|
+
- `spec/integration/shared_examples/edge_cases.rb:119` — recursive namespace defs
|
|
32
|
+
- `spec/integration/shared_examples/edge_cases.rb:139` — namespace-prefixed attr access
|
|
33
|
+
- `spec/integration/shared_examples/integration_workflows.rb:83` — complex namespaces
|
|
34
|
+
|
|
35
|
+
**Requires:** Ox gem API enhancement (namespace accessors on `Ox::Element`).
|
|
36
|
+
|
|
37
|
+
### 7c. Text Content from Nested XPath Results (4 tests)
|
|
38
|
+
|
|
39
|
+
Accessing text content from child elements of XPath result nodes returns empty
|
|
40
|
+
strings. Likely a node wrapping or text node handling issue in HeadedOx.
|
|
41
|
+
|
|
42
|
+
**Tests:**
|
|
43
|
+
- `spec/moxml/adapter/headed_ox_spec.rb:74` — string functions in predicates
|
|
44
|
+
- `spec/moxml/adapter/headed_ox_spec.rb:82` — position functions
|
|
45
|
+
- `spec/moxml/adapter/headed_ox_spec.rb:304` — last() function
|
|
46
|
+
- `spec/integration/shared_examples/node_wrappers/node_behavior.rb:113` — XPath text access
|
|
47
|
+
|
|
48
|
+
**Needs:** Investigation — check node wrapping and text node registration.
|
|
49
|
+
|
|
50
|
+
### 7d. CDATA `]]>` Escaping (2 tests)
|
|
51
|
+
|
|
52
|
+
Ox serializes CDATA sections as-is without splitting on `]]>` sequences, which
|
|
53
|
+
violates the XML spec.
|
|
54
|
+
|
|
55
|
+
**Tests:**
|
|
56
|
+
- `spec/integration/shared_examples/edge_cases.rb:39`
|
|
57
|
+
- `spec/integration/shared_examples/node_wrappers/cdata_behavior.rb:44`
|
|
58
|
+
|
|
59
|
+
**Requires:** Ox gem enhancement or Moxml-side serialization override.
|
|
60
|
+
|
|
61
|
+
### 7e. Parent Node Setter (1 test)
|
|
62
|
+
|
|
63
|
+
Ox has no native method to change a node's parent after creation.
|
|
64
|
+
|
|
65
|
+
**Test:**
|
|
66
|
+
- `spec/integration/shared_examples/integration_workflows.rb:126`
|
|
67
|
+
|
|
68
|
+
**Requires:** Ox gem reparenting API or workaround via remove + re-add.
|
|
69
|
+
|
|
70
|
+
### 7f. Namespace-Aware XPath with Predicates (1 test)
|
|
71
|
+
|
|
72
|
+
Queries like `//xmlns:item[@id="123"]` return empty results under HeadedOx.
|
|
73
|
+
|
|
74
|
+
**Test:**
|
|
75
|
+
- `spec/integration/shared_examples/integration_workflows.rb:63`
|
|
76
|
+
|
|
77
|
+
**Needs:** Investigation — check namespace resolution in predicate context.
|
|
78
|
+
|
|
79
|
+
### 7g. Wildcard Element Counting (1 test)
|
|
80
|
+
|
|
81
|
+
`//*` returns a different count (6) vs Nokogiri (7+), likely due to Ox's DOM
|
|
82
|
+
structure.
|
|
83
|
+
|
|
84
|
+
**Test:**
|
|
85
|
+
- `spec/moxml/xpath/compiler_spec.rb:156`
|
|
86
|
+
|
|
87
|
+
**Impact:** Low — real-world queries typically use specific element names.
|
|
88
|
+
|
|
89
|
+
## Files
|
|
90
|
+
|
|
91
|
+
- `docs/_pages/headed-ox-limitations.adoc` — full documentation
|
|
92
|
+
- `lib/moxml/adapter/headed_ox.rb`
|
|
93
|
+
- `lib/moxml/adapter/ox.rb`
|
|
94
|
+
- `lib/moxml/xpath/` — pure-Ruby XPath engine
|
|
95
|
+
- All spec files listed above
|