xml_node_stream 1.0.2 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +30 -0
- data/README.md +139 -0
- data/VERSION +1 -0
- data/lib/xml_node_stream/http_stream.rb +179 -0
- data/lib/xml_node_stream/node.rb +98 -47
- data/lib/xml_node_stream/parser/base.rb +49 -12
- data/lib/xml_node_stream/parser/libxml_parser.rb +36 -9
- data/lib/xml_node_stream/parser/nokogiri_parser.rb +42 -12
- data/lib/xml_node_stream/parser/rexml_parser.rb +35 -8
- data/lib/xml_node_stream/parser.rb +54 -29
- data/lib/xml_node_stream/selector.rb +144 -34
- data/lib/xml_node_stream.rb +18 -5
- data/xml_node_stream.gemspec +39 -0
- metadata +46 -88
- data/README.rdoc +0 -61
- data/Rakefile +0 -44
- data/spec/node_spec.rb +0 -140
- data/spec/parser_spec.rb +0 -148
- data/spec/selector_spec.rb +0 -73
- data/spec/spec_helper.rb +0 -2
- data/spec/test.xml +0 -57
- data/spec/xml_node_stream_spec.rb +0 -11
- /data/{MIT_LICENSE → MIT-LICENSE} +0 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: c48b75cdd974d95227eaccd1748c73d6e04d0793d6d9fe6e5fd0a2c9dfecfb32
|
|
4
|
+
data.tar.gz: 47c1effbbe39895d2cac07f545904a27760c0606406daf0b2abfaf5139f2f4ae
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: e50b6c273e787053f8020731c9e11830a19f3e047ce230e417fa927ae51cd274ca64b8921489a98831ba77aeef34d050fd309496c1018ee482fcf59abca671f7
|
|
7
|
+
data.tar.gz: d8e2040f6d4117b29054813eb7741bd15567d374736df80a1cbaeff25fe9692848b2de7e44f0bfec34234d9e6bb5229bfb120023c8a5f7b057a614a0b8dc9fc9
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
All notable changes to this project will be documented in this file.
|
|
3
|
+
|
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## 2.0.0
|
|
8
|
+
|
|
9
|
+
### Changed
|
|
10
|
+
|
|
11
|
+
- Updated minimum Ruby version to 2.7.
|
|
12
|
+
- Only supports passing in http and https URLs as URI's. The previous behavior of calling `Kernel#open` was removed as a potential security risk.
|
|
13
|
+
|
|
14
|
+
## 1.0.2
|
|
15
|
+
|
|
16
|
+
### Changed
|
|
17
|
+
|
|
18
|
+
- Update to work with latest versions of Nokogiri and LibXML
|
|
19
|
+
|
|
20
|
+
## 1.0.1
|
|
21
|
+
|
|
22
|
+
### Fixed
|
|
23
|
+
|
|
24
|
+
- Fixes to Rakefile so it loads without rspec
|
|
25
|
+
|
|
26
|
+
## 1.0.0
|
|
27
|
+
|
|
28
|
+
### Added
|
|
29
|
+
|
|
30
|
+
- Initial release.
|
data/README.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
# XML Node Stream
|
|
2
|
+
|
|
3
|
+
[](https://github.com/bdurand/xml_node_stream/actions/workflows/continuous_integration.yml)
|
|
4
|
+
[](https://github.com/testdouble/standard)
|
|
5
|
+
[](https://badge.fury.io/rb/xml_node_stream)
|
|
6
|
+
|
|
7
|
+
This gem provides a very easy to use XML parser that provides the benefits of both stream parsing (i.e. SAX) and document parsing (i.e. DOM). In addition, it provides a unified parsing language for each of the major Ruby XML parsers (REXML, Nokogiri, and LibXML) so that your code doesn't have to be bound to a particular XML library.
|
|
8
|
+
|
|
9
|
+
## Usage
|
|
10
|
+
|
|
11
|
+
The primary purpose of this gem is to facilitate parsing large XML files (i.e. several megabytes in size). Often, reading these files into a document structure is not feasible because the whole document must be read into memory. Stream/SAX parsing solves this issue by reading in the file incrementally and providing callbacks for various events. This method can be quite painful to deal with for any sort of complex document structure.
|
|
12
|
+
|
|
13
|
+
This gem attempts to solve both of these issues by combining the best features of both. Parsing is performed by a stream parser which construct document style nodes and calls back to the application code with these nodes. When your application is done with a node, it can release it to free up memory and keep your heap from bloating.
|
|
14
|
+
|
|
15
|
+
In order to keep the interface simple and universal, only XML elements and text nodes are supported. XML processing instructions and comments will be ignored.
|
|
16
|
+
|
|
17
|
+
### Examples
|
|
18
|
+
|
|
19
|
+
Suppose we have file with every book in the world in it:
|
|
20
|
+
|
|
21
|
+
```xml
|
|
22
|
+
<books>
|
|
23
|
+
<book isbn="123456">
|
|
24
|
+
<title>Moby Dick</title>
|
|
25
|
+
<author>Herman Melville</author>
|
|
26
|
+
<categories>
|
|
27
|
+
<category>Fiction</category>
|
|
28
|
+
<category>Adventure</category>
|
|
29
|
+
</categories>
|
|
30
|
+
</book>
|
|
31
|
+
<book isbn="98765643">
|
|
32
|
+
<title>The Decline and Fall of the Roman Empire</title>
|
|
33
|
+
<author>Edward Gibbon</author>
|
|
34
|
+
<categories>
|
|
35
|
+
<category>History</category>
|
|
36
|
+
<category>Ancient</category>
|
|
37
|
+
</categories>
|
|
38
|
+
</book>
|
|
39
|
+
...
|
|
40
|
+
</books>
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Reading the whole file into memory will cause problems as it bloats the heap with potentially gigabytes of data. This can be solved by using a streaming parser, but that code can be a pain to write and maintain.
|
|
44
|
+
|
|
45
|
+
We can use `XmlNodeStream` to use the best of both worlds. The file is streamed in to memory for processing and then released when we are done with it. But we get node data structures that can be used to interact with the document in a much simpler manner.
|
|
46
|
+
|
|
47
|
+
```ruby
|
|
48
|
+
XmlNodeStream.parse('/tmp/books.xml') do |node|
|
|
49
|
+
if node.path == '/books/book'
|
|
50
|
+
book = Book.new
|
|
51
|
+
book.isbn = node['isbn']
|
|
52
|
+
book.title = node.find('title').value
|
|
53
|
+
book.author = node.find('author/text()')
|
|
54
|
+
book.categories = node.select('categories/category/text()')
|
|
55
|
+
book.save
|
|
56
|
+
node.release!
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Releasing Nodes
|
|
62
|
+
|
|
63
|
+
In the above example, what prevents memory bloat when parsing a large document is the call to node.release!. This call will remove the node from the node tree. The general practice is to look for the higher level nodes you are interested in and then release them immediately. If there are nodes you don't care about at all, those should be released immediately as well.
|
|
64
|
+
|
|
65
|
+
For example, if the XML document for the books also contained a large list of authors that we aren't using in our processing, we should still release the author nodes immediately to keep from bloating memory:
|
|
66
|
+
|
|
67
|
+
```xml
|
|
68
|
+
<library>
|
|
69
|
+
<authors>
|
|
70
|
+
<author id="1">
|
|
71
|
+
<name>Herman Melville</name>
|
|
72
|
+
</author>
|
|
73
|
+
<author id="2">
|
|
74
|
+
<name>Edward Gibbon</name>
|
|
75
|
+
</author>
|
|
76
|
+
...
|
|
77
|
+
</authors>
|
|
78
|
+
<books>
|
|
79
|
+
<book isbn="123456">
|
|
80
|
+
...
|
|
81
|
+
</book>
|
|
82
|
+
...
|
|
83
|
+
</books>
|
|
84
|
+
</library>
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
```ruby
|
|
88
|
+
XmlNodeStream.parse('/tmp/books.xml') do |node|
|
|
89
|
+
if node.path == '/library/books/book'
|
|
90
|
+
process_book(node)
|
|
91
|
+
node.release!
|
|
92
|
+
elsif node.path == '/library/authors/author'
|
|
93
|
+
# we don't care about authors so release the nodes immediately
|
|
94
|
+
node.release!
|
|
95
|
+
end
|
|
96
|
+
end
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
A sample 77Mb XML document parsed into Nokogiri consumes over 800Mb of memory. Parsing the same document with XmlNodeStream and releasing top level nodes as they're processed uses less than 1Mb.
|
|
100
|
+
|
|
101
|
+
### XPath
|
|
102
|
+
|
|
103
|
+
You can use a subset of the XPath language to navigate nodes. The only parts of XPath implemented are the paths themselves and the text() function. The text() function is useful for getting the value of a node directly from the find or select methods without having to do a nil check on the nodes. For instance, in the above example we can get the name of an author with `node.find('author/text()')` instead of `node.find('author')&.value` or checking if the node exists before accessing its value.
|
|
104
|
+
|
|
105
|
+
The rest of the XPath language is not implemented since it is a programming language and there is really no need for it since we already have Ruby at our disposal which is far more powerful than XPath. See the Selector class for details.
|
|
106
|
+
|
|
107
|
+
## Perfomance
|
|
108
|
+
|
|
109
|
+
The performance of XmlNodeStream depends on which underlying XML parser is used. Generally, the native extension based parsers (Nokogiri and LibXML) will perform much better with out adding the overhead of XmlNodeStream. The pure Ruby REXML parser will perform much better with XmlNodeStream.
|
|
110
|
+
|
|
111
|
+
The main benefit of XmlNodeStream is memory efficiency when parsing large documents. By releasing nodes as they are processed, memory usage can be kept low even for very large documents. This reduces memory bloat and keeps your application process size consistent regardless of the size of the XML documents being processed which can be important in a long running server process.
|
|
112
|
+
|
|
113
|
+
## Installation
|
|
114
|
+
|
|
115
|
+
Add this line to your application's Gemfile:
|
|
116
|
+
|
|
117
|
+
```ruby
|
|
118
|
+
gem "xml_node_stream"
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Then execute:
|
|
122
|
+
```bash
|
|
123
|
+
$ bundle
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Or install it yourself as:
|
|
127
|
+
```bash
|
|
128
|
+
$ gem install xml_node_stream
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
## Contributing
|
|
132
|
+
|
|
133
|
+
Open a pull request on [GitHub](https://github.com/bdurand/xml_node_stream).
|
|
134
|
+
|
|
135
|
+
Please use the [standardrb](https://github.com/testdouble/standard) syntax and lint your code with `standardrb --fix` before submitting.
|
|
136
|
+
|
|
137
|
+
## License
|
|
138
|
+
|
|
139
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
data/VERSION
ADDED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
2.0.0
|
|
@@ -0,0 +1,179 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "net/http"
|
|
4
|
+
|
|
5
|
+
module XmlNodeStream
|
|
6
|
+
# IO-like wrapper for HTTP responses that allows streaming
|
|
7
|
+
class HttpStream
|
|
8
|
+
# Default timeout values in seconds
|
|
9
|
+
DEFAULT_OPEN_TIMEOUT = 10
|
|
10
|
+
DEFAULT_READ_TIMEOUT = 60
|
|
11
|
+
|
|
12
|
+
# Create a new HttpStream.
|
|
13
|
+
#
|
|
14
|
+
# @param uri [URI] the URI to stream from
|
|
15
|
+
# @param open_timeout [Integer] connection timeout in seconds (default 10)
|
|
16
|
+
# @param read_timeout [Integer] read timeout in seconds (default 60)
|
|
17
|
+
def initialize(uri, open_timeout: DEFAULT_OPEN_TIMEOUT, read_timeout: DEFAULT_READ_TIMEOUT)
|
|
18
|
+
@uri = uri
|
|
19
|
+
@http = Net::HTTP.new(uri.host, uri.port)
|
|
20
|
+
@http.use_ssl = (uri.scheme == "https")
|
|
21
|
+
@http.open_timeout = open_timeout
|
|
22
|
+
@http.read_timeout = read_timeout
|
|
23
|
+
@request = Net::HTTP::Get.new(uri.request_uri)
|
|
24
|
+
@buffer = +""
|
|
25
|
+
@eof = false
|
|
26
|
+
@response = nil
|
|
27
|
+
@body_reader = nil
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Read data from the stream.
|
|
31
|
+
#
|
|
32
|
+
# @param length [Integer, nil] the number of bytes to read, or nil to read all
|
|
33
|
+
# @param outbuf [String, nil] optional output buffer
|
|
34
|
+
# @return [String, nil] the data read, or nil if at EOF
|
|
35
|
+
def read(length = nil, outbuf = nil)
|
|
36
|
+
ensure_response_started
|
|
37
|
+
|
|
38
|
+
if length.nil?
|
|
39
|
+
# Read all remaining data
|
|
40
|
+
result = @buffer.dup
|
|
41
|
+
while (chunk = read_chunk)
|
|
42
|
+
result << chunk
|
|
43
|
+
end
|
|
44
|
+
@buffer = +""
|
|
45
|
+
if outbuf
|
|
46
|
+
outbuf.replace(result)
|
|
47
|
+
outbuf
|
|
48
|
+
else
|
|
49
|
+
result
|
|
50
|
+
end
|
|
51
|
+
else
|
|
52
|
+
# Read specific length
|
|
53
|
+
while @buffer.bytesize < length && !@eof
|
|
54
|
+
chunk = read_chunk
|
|
55
|
+
break if chunk.nil?
|
|
56
|
+
@buffer << chunk
|
|
57
|
+
end
|
|
58
|
+
|
|
59
|
+
if @buffer.bytesize >= length
|
|
60
|
+
result = @buffer.byteslice(0, length)
|
|
61
|
+
@buffer = @buffer.byteslice(length..-1) || +""
|
|
62
|
+
else
|
|
63
|
+
result = @buffer.dup
|
|
64
|
+
@buffer = +""
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
if result.empty? && @eof
|
|
68
|
+
nil
|
|
69
|
+
elsif outbuf
|
|
70
|
+
outbuf.replace(result)
|
|
71
|
+
outbuf
|
|
72
|
+
else
|
|
73
|
+
result
|
|
74
|
+
end
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
# Read a line from the stream.
|
|
79
|
+
#
|
|
80
|
+
# @param sep [String, nil] the line separator
|
|
81
|
+
# @param limit [Integer, nil] maximum number of bytes to read
|
|
82
|
+
# @return [String, nil] the line read, or nil if at EOF
|
|
83
|
+
def gets(sep = $/, limit = nil)
|
|
84
|
+
ensure_response_started
|
|
85
|
+
|
|
86
|
+
if sep.nil?
|
|
87
|
+
# Read all
|
|
88
|
+
return read
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
sep = sep.to_s
|
|
92
|
+
|
|
93
|
+
loop do
|
|
94
|
+
if (idx = @buffer.index(sep))
|
|
95
|
+
line = @buffer.slice!(0, idx + sep.length)
|
|
96
|
+
return line
|
|
97
|
+
end
|
|
98
|
+
|
|
99
|
+
break if @eof
|
|
100
|
+
|
|
101
|
+
chunk = read_chunk
|
|
102
|
+
if chunk.nil?
|
|
103
|
+
break
|
|
104
|
+
end
|
|
105
|
+
@buffer << chunk
|
|
106
|
+
end
|
|
107
|
+
|
|
108
|
+
return nil if @buffer.empty?
|
|
109
|
+
|
|
110
|
+
line = @buffer
|
|
111
|
+
@buffer = ""
|
|
112
|
+
line
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
alias_method :readline, :gets
|
|
116
|
+
|
|
117
|
+
# Check if at end of file.
|
|
118
|
+
#
|
|
119
|
+
# @return [Boolean] true if at EOF
|
|
120
|
+
def eof?
|
|
121
|
+
@eof && @buffer.empty?
|
|
122
|
+
end
|
|
123
|
+
|
|
124
|
+
# Close the stream.
|
|
125
|
+
#
|
|
126
|
+
# @return [void]
|
|
127
|
+
def close
|
|
128
|
+
@http.finish if @http&.started?
|
|
129
|
+
rescue
|
|
130
|
+
# Ignore errors during close to ensure cleanup completes
|
|
131
|
+
nil
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
# Check if the stream is closed.
|
|
135
|
+
#
|
|
136
|
+
# @return [Boolean] true if closed
|
|
137
|
+
def closed?
|
|
138
|
+
@http.nil? || !@http.started?
|
|
139
|
+
end
|
|
140
|
+
|
|
141
|
+
# Return self as the IO object for REXML compatibility.
|
|
142
|
+
#
|
|
143
|
+
# @return [HttpStream] self
|
|
144
|
+
def to_io
|
|
145
|
+
self
|
|
146
|
+
end
|
|
147
|
+
|
|
148
|
+
private
|
|
149
|
+
|
|
150
|
+
def ensure_response_started
|
|
151
|
+
return if @response
|
|
152
|
+
|
|
153
|
+
@http.start unless @http.started?
|
|
154
|
+
@response = @http.request(@request)
|
|
155
|
+
@body_reader = @response.read_body
|
|
156
|
+
end
|
|
157
|
+
|
|
158
|
+
def read_chunk
|
|
159
|
+
return nil if @eof
|
|
160
|
+
|
|
161
|
+
if @body_reader.is_a?(String)
|
|
162
|
+
# Entire body was read at once
|
|
163
|
+
if @body_reader.empty?
|
|
164
|
+
@eof = true
|
|
165
|
+
return nil
|
|
166
|
+
end
|
|
167
|
+
# Simulate chunking for consistency
|
|
168
|
+
chunk = @body_reader.byteslice(0, 8192) || +""
|
|
169
|
+
@body_reader = @body_reader.byteslice(8192..-1) || +""
|
|
170
|
+
@eof = true if @body_reader.empty?
|
|
171
|
+
chunk
|
|
172
|
+
else
|
|
173
|
+
# Should not happen with webmock but handling for real HTTP
|
|
174
|
+
@eof = true
|
|
175
|
+
nil
|
|
176
|
+
end
|
|
177
|
+
end
|
|
178
|
+
end
|
|
179
|
+
end
|
data/lib/xml_node_stream/node.rb
CHANGED
|
@@ -1,130 +1,181 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
module XmlNodeStream
|
|
2
4
|
# Representation of an XML node.
|
|
3
5
|
class Node
|
|
4
|
-
|
|
5
6
|
attr_reader :name, :parent
|
|
6
7
|
attr_accessor :value
|
|
7
|
-
|
|
8
|
-
|
|
8
|
+
|
|
9
|
+
# Create a new Node.
|
|
10
|
+
#
|
|
11
|
+
# @param name [String] the name of the node
|
|
12
|
+
# @param parent [Node, nil] the parent node
|
|
13
|
+
# @param attributes [Hash, nil] the node attributes
|
|
14
|
+
# @param value [String, nil] the node value
|
|
15
|
+
def initialize(name, parent = nil, attributes = nil, value = nil)
|
|
9
16
|
@name = name
|
|
10
17
|
@attributes = attributes
|
|
11
18
|
@parent = parent
|
|
12
|
-
@parent
|
|
19
|
+
@parent&.add_child(self)
|
|
13
20
|
@value = value
|
|
21
|
+
@path = nil
|
|
14
22
|
end
|
|
15
|
-
|
|
23
|
+
|
|
16
24
|
# Release a node by removing it from the tree structure so that the Ruby garbage collector can reclaim the memory.
|
|
17
25
|
# This method should be called after you are done with a node. After it is called, the node will be removed from
|
|
18
26
|
# its parent's children and will no longer be accessible.
|
|
27
|
+
#
|
|
28
|
+
# @return [void]
|
|
19
29
|
def release!
|
|
20
|
-
@parent
|
|
30
|
+
@parent&.remove_child(self)
|
|
31
|
+
@path = nil
|
|
21
32
|
end
|
|
22
|
-
|
|
33
|
+
|
|
23
34
|
# Array of the child nodes of the node.
|
|
35
|
+
#
|
|
36
|
+
# @return [Array<Node>] the child nodes
|
|
24
37
|
def children
|
|
25
38
|
@children ||= []
|
|
26
39
|
end
|
|
27
|
-
|
|
40
|
+
|
|
28
41
|
# Array of all descendants of the node.
|
|
42
|
+
#
|
|
43
|
+
# @return [Array<Node>] all descendant nodes
|
|
29
44
|
def descendants
|
|
30
45
|
if children.empty?
|
|
31
|
-
|
|
46
|
+
children
|
|
32
47
|
else
|
|
33
|
-
|
|
48
|
+
(children + children.collect { |child| child.descendants }).flatten
|
|
34
49
|
end
|
|
35
50
|
end
|
|
36
51
|
|
|
37
52
|
# Array of all ancestors of the node.
|
|
53
|
+
#
|
|
54
|
+
# @return [Array<Node>] all ancestor nodes
|
|
38
55
|
def ancestors
|
|
39
56
|
if @parent
|
|
40
|
-
|
|
57
|
+
[@parent] + @parent.ancestors
|
|
41
58
|
else
|
|
42
|
-
|
|
59
|
+
[]
|
|
43
60
|
end
|
|
44
61
|
end
|
|
45
|
-
|
|
62
|
+
|
|
46
63
|
# Get the attributes of the node as a hash.
|
|
64
|
+
#
|
|
65
|
+
# @return [Hash] the node attributes
|
|
47
66
|
def attributes
|
|
48
67
|
@attributes ||= {}
|
|
49
68
|
end
|
|
50
|
-
|
|
69
|
+
|
|
51
70
|
# Get the root element of the node tree.
|
|
71
|
+
#
|
|
72
|
+
# @return [Node] the root node
|
|
52
73
|
def root
|
|
53
74
|
@parent ? @parent.root : self
|
|
54
75
|
end
|
|
55
|
-
|
|
76
|
+
|
|
56
77
|
# Get the full XPath of the node.
|
|
78
|
+
#
|
|
79
|
+
# @return [String] the XPath of the node
|
|
57
80
|
def path
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
@path = "/#{@name}"
|
|
63
|
-
end
|
|
81
|
+
@path ||= if @parent
|
|
82
|
+
"#{@parent.path}/#{@name}"
|
|
83
|
+
else
|
|
84
|
+
"/#{@name}"
|
|
64
85
|
end
|
|
65
|
-
return @path
|
|
66
86
|
end
|
|
67
|
-
|
|
87
|
+
|
|
68
88
|
# Get the value of the node attribute with the given name.
|
|
69
|
-
|
|
70
|
-
|
|
89
|
+
#
|
|
90
|
+
# @param name [String] the attribute name
|
|
91
|
+
# @return [String, nil] the attribute value
|
|
92
|
+
def [](name)
|
|
93
|
+
@attributes[name] if @attributes
|
|
71
94
|
end
|
|
72
|
-
|
|
95
|
+
|
|
73
96
|
# Set the value of the node attribute with the given name.
|
|
74
|
-
|
|
97
|
+
#
|
|
98
|
+
# @param name [String] the attribute name
|
|
99
|
+
# @param val [String] the attribute value
|
|
100
|
+
# @return [String] the attribute value
|
|
101
|
+
def []=(name, val)
|
|
75
102
|
attributes[name] = val
|
|
76
103
|
end
|
|
77
|
-
|
|
104
|
+
|
|
78
105
|
# Add a child node.
|
|
79
|
-
|
|
106
|
+
#
|
|
107
|
+
# @param node [Node] the child node to add
|
|
108
|
+
# @return [void]
|
|
109
|
+
def add_child(node)
|
|
80
110
|
children << node
|
|
81
111
|
node.instance_variable_set(:@parent, self)
|
|
82
112
|
end
|
|
83
|
-
|
|
113
|
+
|
|
84
114
|
# Remove a child node.
|
|
85
|
-
|
|
115
|
+
#
|
|
116
|
+
# @param node [Node] the child node to remove
|
|
117
|
+
# @return [Node, nil] the removed node or nil
|
|
118
|
+
def remove_child(node)
|
|
86
119
|
if @children
|
|
87
120
|
if @children.delete(node)
|
|
88
121
|
node.instance_variable_set(:@parent, nil)
|
|
89
122
|
end
|
|
90
123
|
end
|
|
91
124
|
end
|
|
92
|
-
|
|
125
|
+
|
|
93
126
|
# Get the first child node.
|
|
127
|
+
#
|
|
128
|
+
# @return [Node, nil] the first child node or nil
|
|
94
129
|
def first_child
|
|
95
|
-
@children
|
|
130
|
+
@children&.first
|
|
96
131
|
end
|
|
97
|
-
|
|
132
|
+
|
|
98
133
|
# Find the first node that matches the given XPath. See Selector for details.
|
|
99
|
-
|
|
134
|
+
#
|
|
135
|
+
# @param selector [String, Selector] the XPath selector
|
|
136
|
+
# @return [Node, nil] the first matching node or nil
|
|
137
|
+
def find(selector)
|
|
100
138
|
select(selector).first
|
|
101
139
|
end
|
|
102
|
-
|
|
140
|
+
|
|
103
141
|
# Find all nodes that match the given XPath. See Selector for details.
|
|
104
|
-
|
|
142
|
+
#
|
|
143
|
+
# @param selector [String, Selector] the XPath selector
|
|
144
|
+
# @return [Array<Node>] all matching nodes
|
|
145
|
+
def select(selector)
|
|
105
146
|
selector = selector.is_a?(Selector) ? selector : Selector.new(selector)
|
|
106
|
-
|
|
147
|
+
selector.find(self)
|
|
107
148
|
end
|
|
108
|
-
|
|
149
|
+
|
|
109
150
|
# Append CDATA to the node value.
|
|
110
|
-
|
|
151
|
+
#
|
|
152
|
+
# @param text [String] the CDATA text to append
|
|
153
|
+
# @return [void]
|
|
154
|
+
def append_cdata(text)
|
|
111
155
|
append(text, false)
|
|
112
156
|
end
|
|
113
|
-
|
|
157
|
+
|
|
114
158
|
# Append text to the node value. If strip_whitespace is true, whitespace at the beginning and end
|
|
115
159
|
# of the node value will be removed.
|
|
116
|
-
|
|
160
|
+
#
|
|
161
|
+
# @param text [String] the text to append
|
|
162
|
+
# @param strip_whitespace [Boolean] whether to strip whitespace
|
|
163
|
+
# @return [void]
|
|
164
|
+
def append(text, strip_whitespace = true)
|
|
117
165
|
if text
|
|
118
|
-
@value ||=
|
|
166
|
+
@value ||= +""
|
|
119
167
|
@last_strip_whitespace = strip_whitespace
|
|
120
|
-
text = text.lstrip if @value.length == 0
|
|
168
|
+
text = text.lstrip if @value.length == 0 && strip_whitespace
|
|
121
169
|
@value << text if text.length > 0
|
|
122
170
|
end
|
|
123
171
|
end
|
|
124
|
-
|
|
172
|
+
|
|
125
173
|
# Called after end tag to ensure that whitespace at the end of the string is properly stripped.
|
|
126
|
-
|
|
127
|
-
|
|
174
|
+
#
|
|
175
|
+
# @return [void]
|
|
176
|
+
# @api private
|
|
177
|
+
def finish!
|
|
178
|
+
@value.rstrip! if @value && @last_strip_whitespace
|
|
128
179
|
end
|
|
129
180
|
end
|
|
130
181
|
end
|
|
@@ -1,38 +1,75 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
module XmlNodeStream
|
|
2
4
|
class Parser
|
|
3
5
|
# This is the base parser syntax that normalizes the SAX callbacks by providing a common interface
|
|
4
6
|
# so that the actual parser implementation doesn't matter.
|
|
5
7
|
module Base
|
|
6
|
-
|
|
7
8
|
attr_reader :root
|
|
8
|
-
|
|
9
|
-
|
|
9
|
+
|
|
10
|
+
# Initialize the parser.
|
|
11
|
+
#
|
|
12
|
+
# @yield [Node] each node as it is parsed
|
|
13
|
+
def initialize(&block)
|
|
10
14
|
@nodes = []
|
|
11
15
|
@parse_block = block
|
|
12
16
|
@root = nil
|
|
13
17
|
end
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
18
|
+
|
|
19
|
+
# Parse the input stream.
|
|
20
|
+
#
|
|
21
|
+
# @param io [IO] the input stream to parse
|
|
22
|
+
# @return [void]
|
|
23
|
+
# @raise [NotImplementedError] if the parser gem is not loaded
|
|
24
|
+
def parse_stream(io)
|
|
25
|
+
parser_name = self.class.name.split("::").last.sub("Parser", "").downcase
|
|
26
|
+
gem_name = case parser_name
|
|
27
|
+
when "nokogiri" then "nokogiri"
|
|
28
|
+
when "libxml" then "libxml-ruby"
|
|
29
|
+
when "rexml" then "rexml"
|
|
30
|
+
else "unknown"
|
|
31
|
+
end
|
|
32
|
+
raise NotImplementedError.new("Parser gem not loaded: #{gem_name}. Install it with: gem install #{gem_name.split(" ").first}")
|
|
17
33
|
end
|
|
18
|
-
|
|
19
|
-
|
|
34
|
+
|
|
35
|
+
# Handle start element event.
|
|
36
|
+
#
|
|
37
|
+
# @param name [String] the element name
|
|
38
|
+
# @param attributes [Hash] the element attributes
|
|
39
|
+
# @return [void]
|
|
40
|
+
# @api private
|
|
41
|
+
def do_start_element(name, attributes)
|
|
20
42
|
node = XmlNodeStream::Node.new(name, @nodes.last, attributes)
|
|
21
43
|
@nodes.push(node)
|
|
22
44
|
end
|
|
23
45
|
|
|
24
|
-
|
|
46
|
+
# Handle end element event.
|
|
47
|
+
#
|
|
48
|
+
# @param name [String] the element name
|
|
49
|
+
# @return [void]
|
|
50
|
+
# @api private
|
|
51
|
+
def do_end_element(name)
|
|
25
52
|
node = @nodes.pop
|
|
26
53
|
node.finish!
|
|
27
54
|
@root = node if @nodes.empty?
|
|
28
|
-
@parse_block
|
|
55
|
+
@parse_block&.call(node)
|
|
29
56
|
end
|
|
30
57
|
|
|
31
|
-
|
|
58
|
+
# Handle character data event.
|
|
59
|
+
#
|
|
60
|
+
# @param characters [String] the character data
|
|
61
|
+
# @return [void]
|
|
62
|
+
# @api private
|
|
63
|
+
def do_characters(characters)
|
|
32
64
|
@nodes.last.append(characters) unless @nodes.empty?
|
|
33
65
|
end
|
|
34
66
|
|
|
35
|
-
|
|
67
|
+
# Handle CDATA block event.
|
|
68
|
+
#
|
|
69
|
+
# @param characters [String] the CDATA content
|
|
70
|
+
# @return [void]
|
|
71
|
+
# @api private
|
|
72
|
+
def do_cdata_block(characters)
|
|
36
73
|
@nodes.last.append_cdata(characters) unless @nodes.empty?
|
|
37
74
|
end
|
|
38
75
|
end
|