xml_node_stream 1.0.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c48b75cdd974d95227eaccd1748c73d6e04d0793d6d9fe6e5fd0a2c9dfecfb32
4
+ data.tar.gz: 47c1effbbe39895d2cac07f545904a27760c0606406daf0b2abfaf5139f2f4ae
5
+ SHA512:
6
+ metadata.gz: e50b6c273e787053f8020731c9e11830a19f3e047ce230e417fa927ae51cd274ca64b8921489a98831ba77aeef34d050fd309496c1018ee482fcf59abca671f7
7
+ data.tar.gz: d8e2040f6d4117b29054813eb7741bd15567d374736df80a1cbaeff25fe9692848b2de7e44f0bfec34234d9e6bb5229bfb120023c8a5f7b057a614a0b8dc9fc9
data/CHANGELOG.md ADDED
@@ -0,0 +1,30 @@
1
+ # Changelog
2
+ All notable changes to this project will be documented in this file.
3
+
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## 2.0.0
8
+
9
+ ### Changed
10
+
11
+ - Updated minimum Ruby version to 2.7.
12
+ - Only supports passing in http and https URLs as URI's. The previous behavior of calling `Kernel#open` was removed as a potential security risk.
13
+
14
+ ## 1.0.2
15
+
16
+ ### Changed
17
+
18
+ - Update to work with latest versions of Nokogiri and LibXML
19
+
20
+ ## 1.0.1
21
+
22
+ ### Fixed
23
+
24
+ - Fixes to Rakefile so it loads without rspec
25
+
26
+ ## 1.0.0
27
+
28
+ ### Added
29
+
30
+ - Initial release.
data/README.md ADDED
@@ -0,0 +1,139 @@
1
+ # XML Node Stream
2
+
3
+ [![Continuous Integration](https://github.com/bdurand/xml_node_stream/actions/workflows/continuous_integration.yml/badge.svg)](https://github.com/bdurand/xml_node_stream/actions/workflows/continuous_integration.yml)
4
+ [![Ruby Style Guide](https://img.shields.io/badge/code_style-standard-brightgreen.svg)](https://github.com/testdouble/standard)
5
+ [![Gem Version](https://badge.fury.io/rb/xml_node_stream.svg)](https://badge.fury.io/rb/xml_node_stream)
6
+
7
+ This gem provides a very easy to use XML parser that provides the benefits of both stream parsing (i.e. SAX) and document parsing (i.e. DOM). In addition, it provides a unified parsing language for each of the major Ruby XML parsers (REXML, Nokogiri, and LibXML) so that your code doesn't have to be bound to a particular XML library.
8
+
9
+ ## Usage
10
+
11
+ The primary purpose of this gem is to facilitate parsing large XML files (i.e. several megabytes in size). Often, reading these files into a document structure is not feasible because the whole document must be read into memory. Stream/SAX parsing solves this issue by reading in the file incrementally and providing callbacks for various events. This method can be quite painful to deal with for any sort of complex document structure.
12
+
13
+ This gem attempts to solve both of these issues by combining the best features of both. Parsing is performed by a stream parser which construct document style nodes and calls back to the application code with these nodes. When your application is done with a node, it can release it to free up memory and keep your heap from bloating.
14
+
15
+ In order to keep the interface simple and universal, only XML elements and text nodes are supported. XML processing instructions and comments will be ignored.
16
+
17
+ ### Examples
18
+
19
+ Suppose we have file with every book in the world in it:
20
+
21
+ ```xml
22
+ <books>
23
+ <book isbn="123456">
24
+ <title>Moby Dick</title>
25
+ <author>Herman Melville</author>
26
+ <categories>
27
+ <category>Fiction</category>
28
+ <category>Adventure</category>
29
+ </categories>
30
+ </book>
31
+ <book isbn="98765643">
32
+ <title>The Decline and Fall of the Roman Empire</title>
33
+ <author>Edward Gibbon</author>
34
+ <categories>
35
+ <category>History</category>
36
+ <category>Ancient</category>
37
+ </categories>
38
+ </book>
39
+ ...
40
+ </books>
41
+ ```
42
+
43
+ Reading the whole file into memory will cause problems as it bloats the heap with potentially gigabytes of data. This can be solved by using a streaming parser, but that code can be a pain to write and maintain.
44
+
45
+ We can use `XmlNodeStream` to use the best of both worlds. The file is streamed in to memory for processing and then released when we are done with it. But we get node data structures that can be used to interact with the document in a much simpler manner.
46
+
47
+ ```ruby
48
+ XmlNodeStream.parse('/tmp/books.xml') do |node|
49
+ if node.path == '/books/book'
50
+ book = Book.new
51
+ book.isbn = node['isbn']
52
+ book.title = node.find('title').value
53
+ book.author = node.find('author/text()')
54
+ book.categories = node.select('categories/category/text()')
55
+ book.save
56
+ node.release!
57
+ end
58
+ end
59
+ ```
60
+
61
+ ### Releasing Nodes
62
+
63
+ In the above example, what prevents memory bloat when parsing a large document is the call to node.release!. This call will remove the node from the node tree. The general practice is to look for the higher level nodes you are interested in and then release them immediately. If there are nodes you don't care about at all, those should be released immediately as well.
64
+
65
+ For example, if the XML document for the books also contained a large list of authors that we aren't using in our processing, we should still release the author nodes immediately to keep from bloating memory:
66
+
67
+ ```xml
68
+ <library>
69
+ <authors>
70
+ <author id="1">
71
+ <name>Herman Melville</name>
72
+ </author>
73
+ <author id="2">
74
+ <name>Edward Gibbon</name>
75
+ </author>
76
+ ...
77
+ </authors>
78
+ <books>
79
+ <book isbn="123456">
80
+ ...
81
+ </book>
82
+ ...
83
+ </books>
84
+ </library>
85
+ ```
86
+
87
+ ```ruby
88
+ XmlNodeStream.parse('/tmp/books.xml') do |node|
89
+ if node.path == '/library/books/book'
90
+ process_book(node)
91
+ node.release!
92
+ elsif node.path == '/library/authors/author'
93
+ # we don't care about authors so release the nodes immediately
94
+ node.release!
95
+ end
96
+ end
97
+ ```
98
+
99
+ A sample 77Mb XML document parsed into Nokogiri consumes over 800Mb of memory. Parsing the same document with XmlNodeStream and releasing top level nodes as they're processed uses less than 1Mb.
100
+
101
+ ### XPath
102
+
103
+ You can use a subset of the XPath language to navigate nodes. The only parts of XPath implemented are the paths themselves and the text() function. The text() function is useful for getting the value of a node directly from the find or select methods without having to do a nil check on the nodes. For instance, in the above example we can get the name of an author with `node.find('author/text()')` instead of `node.find('author')&.value` or checking if the node exists before accessing its value.
104
+
105
+ The rest of the XPath language is not implemented since it is a programming language and there is really no need for it since we already have Ruby at our disposal which is far more powerful than XPath. See the Selector class for details.
106
+
107
+ ## Perfomance
108
+
109
+ The performance of XmlNodeStream depends on which underlying XML parser is used. Generally, the native extension based parsers (Nokogiri and LibXML) will perform much better with out adding the overhead of XmlNodeStream. The pure Ruby REXML parser will perform much better with XmlNodeStream.
110
+
111
+ The main benefit of XmlNodeStream is memory efficiency when parsing large documents. By releasing nodes as they are processed, memory usage can be kept low even for very large documents. This reduces memory bloat and keeps your application process size consistent regardless of the size of the XML documents being processed which can be important in a long running server process.
112
+
113
+ ## Installation
114
+
115
+ Add this line to your application's Gemfile:
116
+
117
+ ```ruby
118
+ gem "xml_node_stream"
119
+ ```
120
+
121
+ Then execute:
122
+ ```bash
123
+ $ bundle
124
+ ```
125
+
126
+ Or install it yourself as:
127
+ ```bash
128
+ $ gem install xml_node_stream
129
+ ```
130
+
131
+ ## Contributing
132
+
133
+ Open a pull request on [GitHub](https://github.com/bdurand/xml_node_stream).
134
+
135
+ Please use the [standardrb](https://github.com/testdouble/standard) syntax and lint your code with `standardrb --fix` before submitting.
136
+
137
+ ## License
138
+
139
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 2.0.0
@@ -0,0 +1,179 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+
5
+ module XmlNodeStream
6
+ # IO-like wrapper for HTTP responses that allows streaming
7
+ class HttpStream
8
+ # Default timeout values in seconds
9
+ DEFAULT_OPEN_TIMEOUT = 10
10
+ DEFAULT_READ_TIMEOUT = 60
11
+
12
+ # Create a new HttpStream.
13
+ #
14
+ # @param uri [URI] the URI to stream from
15
+ # @param open_timeout [Integer] connection timeout in seconds (default 10)
16
+ # @param read_timeout [Integer] read timeout in seconds (default 60)
17
+ def initialize(uri, open_timeout: DEFAULT_OPEN_TIMEOUT, read_timeout: DEFAULT_READ_TIMEOUT)
18
+ @uri = uri
19
+ @http = Net::HTTP.new(uri.host, uri.port)
20
+ @http.use_ssl = (uri.scheme == "https")
21
+ @http.open_timeout = open_timeout
22
+ @http.read_timeout = read_timeout
23
+ @request = Net::HTTP::Get.new(uri.request_uri)
24
+ @buffer = +""
25
+ @eof = false
26
+ @response = nil
27
+ @body_reader = nil
28
+ end
29
+
30
+ # Read data from the stream.
31
+ #
32
+ # @param length [Integer, nil] the number of bytes to read, or nil to read all
33
+ # @param outbuf [String, nil] optional output buffer
34
+ # @return [String, nil] the data read, or nil if at EOF
35
+ def read(length = nil, outbuf = nil)
36
+ ensure_response_started
37
+
38
+ if length.nil?
39
+ # Read all remaining data
40
+ result = @buffer.dup
41
+ while (chunk = read_chunk)
42
+ result << chunk
43
+ end
44
+ @buffer = +""
45
+ if outbuf
46
+ outbuf.replace(result)
47
+ outbuf
48
+ else
49
+ result
50
+ end
51
+ else
52
+ # Read specific length
53
+ while @buffer.bytesize < length && !@eof
54
+ chunk = read_chunk
55
+ break if chunk.nil?
56
+ @buffer << chunk
57
+ end
58
+
59
+ if @buffer.bytesize >= length
60
+ result = @buffer.byteslice(0, length)
61
+ @buffer = @buffer.byteslice(length..-1) || +""
62
+ else
63
+ result = @buffer.dup
64
+ @buffer = +""
65
+ end
66
+
67
+ if result.empty? && @eof
68
+ nil
69
+ elsif outbuf
70
+ outbuf.replace(result)
71
+ outbuf
72
+ else
73
+ result
74
+ end
75
+ end
76
+ end
77
+
78
+ # Read a line from the stream.
79
+ #
80
+ # @param sep [String, nil] the line separator
81
+ # @param limit [Integer, nil] maximum number of bytes to read
82
+ # @return [String, nil] the line read, or nil if at EOF
83
+ def gets(sep = $/, limit = nil)
84
+ ensure_response_started
85
+
86
+ if sep.nil?
87
+ # Read all
88
+ return read
89
+ end
90
+
91
+ sep = sep.to_s
92
+
93
+ loop do
94
+ if (idx = @buffer.index(sep))
95
+ line = @buffer.slice!(0, idx + sep.length)
96
+ return line
97
+ end
98
+
99
+ break if @eof
100
+
101
+ chunk = read_chunk
102
+ if chunk.nil?
103
+ break
104
+ end
105
+ @buffer << chunk
106
+ end
107
+
108
+ return nil if @buffer.empty?
109
+
110
+ line = @buffer
111
+ @buffer = ""
112
+ line
113
+ end
114
+
115
+ alias_method :readline, :gets
116
+
117
+ # Check if at end of file.
118
+ #
119
+ # @return [Boolean] true if at EOF
120
+ def eof?
121
+ @eof && @buffer.empty?
122
+ end
123
+
124
+ # Close the stream.
125
+ #
126
+ # @return [void]
127
+ def close
128
+ @http.finish if @http&.started?
129
+ rescue
130
+ # Ignore errors during close to ensure cleanup completes
131
+ nil
132
+ end
133
+
134
+ # Check if the stream is closed.
135
+ #
136
+ # @return [Boolean] true if closed
137
+ def closed?
138
+ @http.nil? || !@http.started?
139
+ end
140
+
141
+ # Return self as the IO object for REXML compatibility.
142
+ #
143
+ # @return [HttpStream] self
144
+ def to_io
145
+ self
146
+ end
147
+
148
+ private
149
+
150
+ def ensure_response_started
151
+ return if @response
152
+
153
+ @http.start unless @http.started?
154
+ @response = @http.request(@request)
155
+ @body_reader = @response.read_body
156
+ end
157
+
158
+ def read_chunk
159
+ return nil if @eof
160
+
161
+ if @body_reader.is_a?(String)
162
+ # Entire body was read at once
163
+ if @body_reader.empty?
164
+ @eof = true
165
+ return nil
166
+ end
167
+ # Simulate chunking for consistency
168
+ chunk = @body_reader.byteslice(0, 8192) || +""
169
+ @body_reader = @body_reader.byteslice(8192..-1) || +""
170
+ @eof = true if @body_reader.empty?
171
+ chunk
172
+ else
173
+ # Should not happen with webmock but handling for real HTTP
174
+ @eof = true
175
+ nil
176
+ end
177
+ end
178
+ end
179
+ end
@@ -1,130 +1,181 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module XmlNodeStream
2
4
  # Representation of an XML node.
3
5
  class Node
4
-
5
6
  attr_reader :name, :parent
6
7
  attr_accessor :value
7
-
8
- def initialize (name, parent = nil, attributes = nil, value = nil)
8
+
9
+ # Create a new Node.
10
+ #
11
+ # @param name [String] the name of the node
12
+ # @param parent [Node, nil] the parent node
13
+ # @param attributes [Hash, nil] the node attributes
14
+ # @param value [String, nil] the node value
15
+ def initialize(name, parent = nil, attributes = nil, value = nil)
9
16
  @name = name
10
17
  @attributes = attributes
11
18
  @parent = parent
12
- @parent.add_child(self) if @parent
19
+ @parent&.add_child(self)
13
20
  @value = value
21
+ @path = nil
14
22
  end
15
-
23
+
16
24
  # Release a node by removing it from the tree structure so that the Ruby garbage collector can reclaim the memory.
17
25
  # This method should be called after you are done with a node. After it is called, the node will be removed from
18
26
  # its parent's children and will no longer be accessible.
27
+ #
28
+ # @return [void]
19
29
  def release!
20
- @parent.remove_child(self) if @parent
30
+ @parent&.remove_child(self)
31
+ @path = nil
21
32
  end
22
-
33
+
23
34
  # Array of the child nodes of the node.
35
+ #
36
+ # @return [Array<Node>] the child nodes
24
37
  def children
25
38
  @children ||= []
26
39
  end
27
-
40
+
28
41
  # Array of all descendants of the node.
42
+ #
43
+ # @return [Array<Node>] all descendant nodes
29
44
  def descendants
30
45
  if children.empty?
31
- return children
46
+ children
32
47
  else
33
- return (children + children.collect{|child| child.descendants}).flatten
48
+ (children + children.collect { |child| child.descendants }).flatten
34
49
  end
35
50
  end
36
51
 
37
52
  # Array of all ancestors of the node.
53
+ #
54
+ # @return [Array<Node>] all ancestor nodes
38
55
  def ancestors
39
56
  if @parent
40
- return [@parent] + @parent.ancestors
57
+ [@parent] + @parent.ancestors
41
58
  else
42
- return []
59
+ []
43
60
  end
44
61
  end
45
-
62
+
46
63
  # Get the attributes of the node as a hash.
64
+ #
65
+ # @return [Hash] the node attributes
47
66
  def attributes
48
67
  @attributes ||= {}
49
68
  end
50
-
69
+
51
70
  # Get the root element of the node tree.
71
+ #
72
+ # @return [Node] the root node
52
73
  def root
53
74
  @parent ? @parent.root : self
54
75
  end
55
-
76
+
56
77
  # Get the full XPath of the node.
78
+ #
79
+ # @return [String] the XPath of the node
57
80
  def path
58
- unless @path
59
- if @parent
60
- @path = "#{@parent.path}/#{@name}"
61
- else
62
- @path = "/#{@name}"
63
- end
81
+ @path ||= if @parent
82
+ "#{@parent.path}/#{@name}"
83
+ else
84
+ "/#{@name}"
64
85
  end
65
- return @path
66
86
  end
67
-
87
+
68
88
  # Get the value of the node attribute with the given name.
69
- def [] (name)
70
- return @attributes[name] if @attributes
89
+ #
90
+ # @param name [String] the attribute name
91
+ # @return [String, nil] the attribute value
92
+ def [](name)
93
+ @attributes[name] if @attributes
71
94
  end
72
-
95
+
73
96
  # Set the value of the node attribute with the given name.
74
- def []= (name, val)
97
+ #
98
+ # @param name [String] the attribute name
99
+ # @param val [String] the attribute value
100
+ # @return [String] the attribute value
101
+ def []=(name, val)
75
102
  attributes[name] = val
76
103
  end
77
-
104
+
78
105
  # Add a child node.
79
- def add_child (node)
106
+ #
107
+ # @param node [Node] the child node to add
108
+ # @return [void]
109
+ def add_child(node)
80
110
  children << node
81
111
  node.instance_variable_set(:@parent, self)
82
112
  end
83
-
113
+
84
114
  # Remove a child node.
85
- def remove_child (node)
115
+ #
116
+ # @param node [Node] the child node to remove
117
+ # @return [Node, nil] the removed node or nil
118
+ def remove_child(node)
86
119
  if @children
87
120
  if @children.delete(node)
88
121
  node.instance_variable_set(:@parent, nil)
89
122
  end
90
123
  end
91
124
  end
92
-
125
+
93
126
  # Get the first child node.
127
+ #
128
+ # @return [Node, nil] the first child node or nil
94
129
  def first_child
95
- @children.first if @children
130
+ @children&.first
96
131
  end
97
-
132
+
98
133
  # Find the first node that matches the given XPath. See Selector for details.
99
- def find (selector)
134
+ #
135
+ # @param selector [String, Selector] the XPath selector
136
+ # @return [Node, nil] the first matching node or nil
137
+ def find(selector)
100
138
  select(selector).first
101
139
  end
102
-
140
+
103
141
  # Find all nodes that match the given XPath. See Selector for details.
104
- def select (selector)
142
+ #
143
+ # @param selector [String, Selector] the XPath selector
144
+ # @return [Array<Node>] all matching nodes
145
+ def select(selector)
105
146
  selector = selector.is_a?(Selector) ? selector : Selector.new(selector)
106
- return selector.find(self)
147
+ selector.find(self)
107
148
  end
108
-
149
+
109
150
  # Append CDATA to the node value.
110
- def append_cdata (text)
151
+ #
152
+ # @param text [String] the CDATA text to append
153
+ # @return [void]
154
+ def append_cdata(text)
111
155
  append(text, false)
112
156
  end
113
-
157
+
114
158
  # Append text to the node value. If strip_whitespace is true, whitespace at the beginning and end
115
159
  # of the node value will be removed.
116
- def append (text, strip_whitespace = true)
160
+ #
161
+ # @param text [String] the text to append
162
+ # @param strip_whitespace [Boolean] whether to strip whitespace
163
+ # @return [void]
164
+ def append(text, strip_whitespace = true)
117
165
  if text
118
- @value ||= ''
166
+ @value ||= +""
119
167
  @last_strip_whitespace = strip_whitespace
120
- text = text.lstrip if @value.length == 0 and strip_whitespace
168
+ text = text.lstrip if @value.length == 0 && strip_whitespace
121
169
  @value << text if text.length > 0
122
170
  end
123
171
  end
124
-
172
+
125
173
  # Called after end tag to ensure that whitespace at the end of the string is properly stripped.
126
- def finish! #:nodoc
127
- @value.rstrip! if @value and @last_strip_whitespace
174
+ #
175
+ # @return [void]
176
+ # @api private
177
+ def finish!
178
+ @value.rstrip! if @value && @last_strip_whitespace
128
179
  end
129
180
  end
130
181
  end
@@ -1,38 +1,75 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module XmlNodeStream
2
4
  class Parser
3
5
  # This is the base parser syntax that normalizes the SAX callbacks by providing a common interface
4
6
  # so that the actual parser implementation doesn't matter.
5
7
  module Base
6
-
7
8
  attr_reader :root
8
-
9
- def initialize (&block)
9
+
10
+ # Initialize the parser.
11
+ #
12
+ # @yield [Node] each node as it is parsed
13
+ def initialize(&block)
10
14
  @nodes = []
11
15
  @parse_block = block
12
16
  @root = nil
13
17
  end
14
-
15
- def parse_stream (io)
16
- raise NotImplementedError.new("could not load gem")
18
+
19
+ # Parse the input stream.
20
+ #
21
+ # @param io [IO] the input stream to parse
22
+ # @return [void]
23
+ # @raise [NotImplementedError] if the parser gem is not loaded
24
+ def parse_stream(io)
25
+ parser_name = self.class.name.split("::").last.sub("Parser", "").downcase
26
+ gem_name = case parser_name
27
+ when "nokogiri" then "nokogiri"
28
+ when "libxml" then "libxml-ruby"
29
+ when "rexml" then "rexml"
30
+ else "unknown"
31
+ end
32
+ raise NotImplementedError.new("Parser gem not loaded: #{gem_name}. Install it with: gem install #{gem_name.split(" ").first}")
17
33
  end
18
-
19
- def do_start_element (name, attributes)
34
+
35
+ # Handle start element event.
36
+ #
37
+ # @param name [String] the element name
38
+ # @param attributes [Hash] the element attributes
39
+ # @return [void]
40
+ # @api private
41
+ def do_start_element(name, attributes)
20
42
  node = XmlNodeStream::Node.new(name, @nodes.last, attributes)
21
43
  @nodes.push(node)
22
44
  end
23
45
 
24
- def do_end_element (name)
46
+ # Handle end element event.
47
+ #
48
+ # @param name [String] the element name
49
+ # @return [void]
50
+ # @api private
51
+ def do_end_element(name)
25
52
  node = @nodes.pop
26
53
  node.finish!
27
54
  @root = node if @nodes.empty?
28
- @parse_block.call(node) if @parse_block
55
+ @parse_block&.call(node)
29
56
  end
30
57
 
31
- def do_characters (characters)
58
+ # Handle character data event.
59
+ #
60
+ # @param characters [String] the character data
61
+ # @return [void]
62
+ # @api private
63
+ def do_characters(characters)
32
64
  @nodes.last.append(characters) unless @nodes.empty?
33
65
  end
34
66
 
35
- def do_cdata_block (characters)
67
+ # Handle CDATA block event.
68
+ #
69
+ # @param characters [String] the CDATA content
70
+ # @return [void]
71
+ # @api private
72
+ def do_cdata_block(characters)
36
73
  @nodes.last.append_cdata(characters) unless @nodes.empty?
37
74
  end
38
75
  end