RubyGems - xml_node_stream - Versions diffs - 1.0.2 → 2.0.0 - Mend

xml_node_stream 1.0.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +30 -0
data/README.md +139 -0
data/VERSION +1 -0
data/lib/xml_node_stream/http_stream.rb +179 -0
data/lib/xml_node_stream/node.rb +98 -47
data/lib/xml_node_stream/parser/base.rb +49 -12
data/lib/xml_node_stream/parser/libxml_parser.rb +36 -9
data/lib/xml_node_stream/parser/nokogiri_parser.rb +42 -12
data/lib/xml_node_stream/parser/rexml_parser.rb +35 -8
data/lib/xml_node_stream/parser.rb +54 -29
data/lib/xml_node_stream/selector.rb +144 -34
data/lib/xml_node_stream.rb +18 -5
data/xml_node_stream.gemspec +39 -0
metadata +46 -88
data/README.rdoc +0 -61
data/Rakefile +0 -44
data/spec/node_spec.rb +0 -140
data/spec/parser_spec.rb +0 -148
data/spec/selector_spec.rb +0 -73
data/spec/spec_helper.rb +0 -2
data/spec/test.xml +0 -57
data/spec/xml_node_stream_spec.rb +0 -11
/data/{MIT_LICENSE → MIT-LICENSE} +0 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: c48b75cdd974d95227eaccd1748c73d6e04d0793d6d9fe6e5fd0a2c9dfecfb32
+  data.tar.gz: 47c1effbbe39895d2cac07f545904a27760c0606406daf0b2abfaf5139f2f4ae
+SHA512:
+  metadata.gz: e50b6c273e787053f8020731c9e11830a19f3e047ce230e417fa927ae51cd274ca64b8921489a98831ba77aeef34d050fd309496c1018ee482fcf59abca671f7
+  data.tar.gz: d8e2040f6d4117b29054813eb7741bd15567d374736df80a1cbaeff25fe9692848b2de7e44f0bfec34234d9e6bb5229bfb120023c8a5f7b057a614a0b8dc9fc9

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,30 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## 2.0.0
+### Changed
+- Updated minimum Ruby version to 2.7.
+- Only supports passing in http and https URLs as URI's. The previous behavior of calling `Kernel#open` was removed as a potential security risk.
+## 1.0.2
+### Changed
+- Update to work with latest versions of Nokogiri and LibXML
+## 1.0.1
+### Fixed
+- Fixes to Rakefile so it loads without rspec
+## 1.0.0
+### Added
+- Initial release.

data/README.md ADDED Viewed

@@ -0,0 +1,139 @@
+# XML Node Stream
+[![Continuous Integration](https://github.com/bdurand/xml_node_stream/actions/workflows/continuous_integration.yml/badge.svg)](https://github.com/bdurand/xml_node_stream/actions/workflows/continuous_integration.yml)
+[![Ruby Style Guide](https://img.shields.io/badge/code_style-standard-brightgreen.svg)](https://github.com/testdouble/standard)
+[![Gem Version](https://badge.fury.io/rb/xml_node_stream.svg)](https://badge.fury.io/rb/xml_node_stream)
+This gem provides a very easy to use XML parser that provides the benefits of both stream parsing (i.e. SAX) and document parsing (i.e. DOM). In addition, it provides a unified parsing language for each of the major Ruby XML parsers (REXML, Nokogiri, and LibXML) so that your code doesn't have to be bound to a particular XML library.
+## Usage
+The primary purpose of this gem is to facilitate parsing large XML files (i.e. several megabytes in size). Often, reading these files into a document structure is not feasible because the whole document must be read into memory. Stream/SAX parsing solves this issue by reading in the file incrementally and providing callbacks for various events. This method can be quite painful to deal with for any sort of complex document structure.
+This gem attempts to solve both of these issues by combining the best features of both. Parsing is performed by a stream parser which construct document style nodes and calls back to the application code with these nodes. When your application is done with a node, it can release it to free up memory and keep your heap from bloating.
+In order to keep the interface simple and universal, only XML elements and text nodes are supported. XML processing instructions and comments will be ignored.
+### Examples
+Suppose we have file with every book in the world in it:
+```xml
+<books>
+  <book isbn="123456">
+    <title>Moby Dick</title>
+    <author>Herman Melville</author>
+    <categories>
+      <category>Fiction</category>
+      <category>Adventure</category>
+    </categories>
+  </book>
+  <book isbn="98765643">
+    <title>The Decline and Fall of the Roman Empire</title>
+    <author>Edward Gibbon</author>
+    <categories>
+      <category>History</category>
+      <category>Ancient</category>
+    </categories>
+  </book>
+  ...
+</books>
+```
+Reading the whole file into memory will cause problems as it bloats the heap with potentially gigabytes of data. This can be solved by using a streaming parser, but that code can be a pain to write and maintain.
+We can use `XmlNodeStream` to use the best of both worlds. The file is streamed in to memory for processing and then released when we are done with it. But we get node data structures that can be used to interact with the document in a much simpler manner.
+```ruby
+XmlNodeStream.parse('/tmp/books.xml') do |node|
+  if node.path == '/books/book'
+    book = Book.new
+    book.isbn = node['isbn']
+    book.title = node.find('title').value
+    book.author = node.find('author/text()')
+    book.categories = node.select('categories/category/text()')
+    book.save
+    node.release!
+  end
+end
+```
+### Releasing Nodes
+In the above example, what prevents memory bloat when parsing a large document is the call to node.release!. This call will remove the node from the node tree. The general practice is to look for the higher level nodes you are interested in and then release them immediately. If there are nodes you don't care about at all, those should be released immediately as well.
+For example, if the XML document for the books also contained a large list of authors that we aren't using in our processing, we should still release the author nodes immediately to keep from bloating memory:
+```xml
+<library>
+  <authors>
+    <author id="1">
+      <name>Herman Melville</name>
+    </author>
+    <author id="2">
+      <name>Edward Gibbon</name>
+    </author>
+    ...
+  </authors>
+  <books>
+    <book isbn="123456">
+      ...
+    </book>
+    ...
+  </books>
+</library>
+```
+```ruby
+XmlNodeStream.parse('/tmp/books.xml') do |node|
+  if node.path == '/library/books/book'
+    process_book(node)
+    node.release!
+  elsif node.path == '/library/authors/author'
+    # we don't care about authors so release the nodes immediately
+    node.release!
+  end
+end
+```
+A sample 77Mb XML document parsed into Nokogiri consumes over 800Mb of memory. Parsing the same document with XmlNodeStream and releasing top level nodes as they're processed uses less than 1Mb.
+### XPath
+You can use a subset of the XPath language to navigate nodes. The only parts of XPath implemented are the paths themselves and the text() function. The text() function is useful for getting the value of a node directly from the find or select methods without having to do a nil check on the nodes. For instance, in the above example we can get the name of an author with `node.find('author/text()')` instead of `node.find('author')&.value` or checking if the node exists before accessing its value.
+The rest of the XPath language is not implemented since it is a programming language and there is really no need for it since we already have Ruby at our disposal which is far more powerful than XPath. See the Selector class for details.
+## Perfomance
+The performance of XmlNodeStream depends on which underlying XML parser is used. Generally, the native extension based parsers (Nokogiri and LibXML) will perform much better with out adding the overhead of XmlNodeStream. The pure Ruby REXML parser will perform much better with XmlNodeStream.
+The main benefit of XmlNodeStream is memory efficiency when parsing large documents. By releasing nodes as they are processed, memory usage can be kept low even for very large documents. This reduces memory bloat and keeps your application process size consistent regardless of the size of the XML documents being processed which can be important in a long running server process.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem "xml_node_stream"
+```
+Then execute:
+```bash
+$ bundle
+```
+Or install it yourself as:
+```bash
+$ gem install xml_node_stream
+```
+## Contributing
+Open a pull request on [GitHub](https://github.com/bdurand/xml_node_stream).
+Please use the [standardrb](https://github.com/testdouble/standard) syntax and lint your code with `standardrb --fix` before submitting.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 2.0.0

data/lib/xml_node_stream/http_stream.rb ADDED Viewed

@@ -0,0 +1,179 @@
+# frozen_string_literal: true
+require "net/http"
+module XmlNodeStream
+  # IO-like wrapper for HTTP responses that allows streaming
+  class HttpStream
+    # Default timeout values in seconds
+    DEFAULT_OPEN_TIMEOUT = 10
+    DEFAULT_READ_TIMEOUT = 60
+    # Create a new HttpStream.
+    #
+    # @param uri [URI] the URI to stream from
+    # @param open_timeout [Integer] connection timeout in seconds (default 10)
+    # @param read_timeout [Integer] read timeout in seconds (default 60)
+    def initialize(uri, open_timeout: DEFAULT_OPEN_TIMEOUT, read_timeout: DEFAULT_READ_TIMEOUT)
+      @uri = uri
+      @http = Net::HTTP.new(uri.host, uri.port)
+      @http.use_ssl = (uri.scheme == "https")
+      @http.open_timeout = open_timeout
+      @http.read_timeout = read_timeout
+      @request = Net::HTTP::Get.new(uri.request_uri)
+      @buffer = +""
+      @eof = false
+      @response = nil
+      @body_reader = nil
+    end
+    # Read data from the stream.
+    #
+    # @param length [Integer, nil] the number of bytes to read, or nil to read all
+    # @param outbuf [String, nil] optional output buffer
+    # @return [String, nil] the data read, or nil if at EOF
+    def read(length = nil, outbuf = nil)
+      ensure_response_started
+      if length.nil?
+        # Read all remaining data
+        result = @buffer.dup
+        while (chunk = read_chunk)
+          result << chunk
+        end
+        @buffer = +""
+        if outbuf
+          outbuf.replace(result)
+          outbuf
+        else
+          result
+        end
+      else
+        # Read specific length
+        while @buffer.bytesize < length && !@eof
+          chunk = read_chunk
+          break if chunk.nil?
+          @buffer << chunk
+        end
+        if @buffer.bytesize >= length
+          result = @buffer.byteslice(0, length)
+          @buffer = @buffer.byteslice(length..-1) || +""
+        else
+          result = @buffer.dup
+          @buffer = +""
+        end
+        if result.empty? && @eof
+          nil
+        elsif outbuf
+          outbuf.replace(result)
+          outbuf
+        else
+          result
+        end
+      end
+    end
+    # Read a line from the stream.
+    #
+    # @param sep [String, nil] the line separator
+    # @param limit [Integer, nil] maximum number of bytes to read
+    # @return [String, nil] the line read, or nil if at EOF
+    def gets(sep = $/, limit = nil)
+      ensure_response_started
+      if sep.nil?
+        # Read all
+        return read
+      end
+      sep = sep.to_s
+      loop do
+        if (idx = @buffer.index(sep))
+          line = @buffer.slice!(0, idx + sep.length)
+          return line
+        end
+        break if @eof
+        chunk = read_chunk
+        if chunk.nil?
+          break
+        end
+        @buffer << chunk
+      end
+      return nil if @buffer.empty?
+      line = @buffer
+      @buffer = ""
+      line
+    end
+    alias_method :readline, :gets
+    # Check if at end of file.
+    #
+    # @return [Boolean] true if at EOF
+    def eof?
+      @eof && @buffer.empty?
+    end
+    # Close the stream.
+    #
+    # @return [void]
+    def close
+      @http.finish if @http&.started?
+    rescue
+      # Ignore errors during close to ensure cleanup completes
+      nil
+    end
+    # Check if the stream is closed.
+    #
+    # @return [Boolean] true if closed
+    def closed?
+      @http.nil? || !@http.started?
+    end
+    # Return self as the IO object for REXML compatibility.
+    #
+    # @return [HttpStream] self
+    def to_io
+      self
+    end
+    private
+    def ensure_response_started
+      return if @response
+      @http.start unless @http.started?
+      @response = @http.request(@request)
+      @body_reader = @response.read_body
+    end
+    def read_chunk
+      return nil if @eof
+      if @body_reader.is_a?(String)
+        # Entire body was read at once
+        if @body_reader.empty?
+          @eof = true
+          return nil
+        end
+        # Simulate chunking for consistency
+        chunk = @body_reader.byteslice(0, 8192) || +""
+        @body_reader = @body_reader.byteslice(8192..-1) || +""
+        @eof = true if @body_reader.empty?
+        chunk
+      else
+        # Should not happen with webmock but handling for real HTTP
+        @eof = true
+        nil
+      end
+    end
+  end
+end

data/lib/xml_node_stream/node.rb CHANGED Viewed

@@ -1,130 +1,181 @@
+# frozen_string_literal: true
 module XmlNodeStream
   # Representation of an XML node.
   class Node
     attr_reader :name, :parent
     attr_accessor :value
-    def initialize (name, parent = nil, attributes = nil, value = nil)
+    # Create a new Node.
+    #
+    # @param name [String] the name of the node
+    # @param parent [Node, nil] the parent node
+    # @param attributes [Hash, nil] the node attributes
+    # @param value [String, nil] the node value
+    def initialize(name, parent = nil, attributes = nil, value = nil)
       @name = name
       @attributes = attributes
       @parent = parent
-      @parent.add_child(self) if @parent
+      @parent&.add_child(self)
       @value = value
+      @path = nil
     end
     # Release a node by removing it from the tree structure so that the Ruby garbage collector can reclaim the memory.
     # This method should be called after you are done with a node. After it is called, the node will be removed from
     # its parent's children and will no longer be accessible.
+    #
+    # @return [void]
     def release!
-      @parent.remove_child(self) if @parent
+      @parent&.remove_child(self)
+      @path = nil
     end
     # Array of the child nodes of the node.
+    #
+    # @return [Array<Node>] the child nodes
     def children
       @children ||= []
     end
     # Array of all descendants of the node.
+    #
+    # @return [Array<Node>] all descendant nodes
     def descendants
       if children.empty?
-        return children
+        children
       else
-        return (children + children.collect{|child| child.descendants}).flatten
+        (children + children.collect { |child| child.descendants }).flatten
       end
     end
     # Array of all ancestors of the node.
+    #
+    # @return [Array<Node>] all ancestor nodes
     def ancestors
       if @parent
-        return [@parent] + @parent.ancestors
+        [@parent] + @parent.ancestors
       else
-        return []
+        []
       end
     end
     # Get the attributes of the node as a hash.
+    #
+    # @return [Hash] the node attributes
     def attributes
       @attributes ||= {}
     end
     # Get the root element of the node tree.
+    #
+    # @return [Node] the root node
     def root
       @parent ? @parent.root : self
     end
     # Get the full XPath of the node.
+    #
+    # @return [String] the XPath of the node
     def path
-      unless @path
-        if @parent
-          @path = "#{@parent.path}/#{@name}"
-        else
-          @path = "/#{@name}"
-        end
+      @path ||= if @parent
+        "#{@parent.path}/#{@name}"
+      else
+        "/#{@name}"
       end
-      return @path
     end
     # Get the value of the node attribute with the given name.
-    def [] (name)
-      return @attributes[name] if @attributes
+    #
+    # @param name [String] the attribute name
+    # @return [String, nil] the attribute value
+    def [](name)
+      @attributes[name] if @attributes
     end
     # Set the value of the node attribute with the given name.
-    def []= (name, val)
+    #
+    # @param name [String] the attribute name
+    # @param val [String] the attribute value
+    # @return [String] the attribute value
+    def []=(name, val)
       attributes[name] = val
     end
     # Add a child node.
-    def add_child (node)
+    #
+    # @param node [Node] the child node to add
+    # @return [void]
+    def add_child(node)
       children << node
       node.instance_variable_set(:@parent, self)
     end
     # Remove a child node.
-    def remove_child (node)
+    #
+    # @param node [Node] the child node to remove
+    # @return [Node, nil] the removed node or nil
+    def remove_child(node)
       if @children
         if @children.delete(node)
           node.instance_variable_set(:@parent, nil)
         end
       end
     end
     # Get the first child node.
+    #
+    # @return [Node, nil] the first child node or nil
     def first_child
-      @children.first if @children
+      @children&.first
     end
     # Find the first node that matches the given XPath. See Selector for details.
-    def find (selector)
+    #
+    # @param selector [String, Selector] the XPath selector
+    # @return [Node, nil] the first matching node or nil
+    def find(selector)
       select(selector).first
     end
     # Find all nodes that match the given XPath. See Selector for details.
-    def select (selector)
+    #
+    # @param selector [String, Selector] the XPath selector
+    # @return [Array<Node>] all matching nodes
+    def select(selector)
       selector = selector.is_a?(Selector) ? selector : Selector.new(selector)
-      return selector.find(self)
+      selector.find(self)
     end
     # Append CDATA to the node value.
-    def append_cdata (text)
+    #
+    # @param text [String] the CDATA text to append
+    # @return [void]
+    def append_cdata(text)
       append(text, false)
     end
     # Append text to the node value. If strip_whitespace is true, whitespace at the beginning and end
     # of the node value will be removed.
-    def append (text, strip_whitespace = true)
+    #
+    # @param text [String] the text to append
+    # @param strip_whitespace [Boolean] whether to strip whitespace
+    # @return [void]
+    def append(text, strip_whitespace = true)
       if text
-        @value ||= ''
+        @value ||= +""
         @last_strip_whitespace = strip_whitespace
-        text = text.lstrip if @value.length == 0 and strip_whitespace
+        text = text.lstrip if @value.length == 0 && strip_whitespace
         @value << text if text.length > 0
       end
     end
     # Called after end tag to ensure that whitespace at the end of the string is properly stripped.
-    def finish! #:nodoc
-      @value.rstrip! if @value and @last_strip_whitespace
+    #
+    # @return [void]
+    # @api private
+    def finish!
+      @value.rstrip! if @value && @last_strip_whitespace
     end
   end
 end

data/lib/xml_node_stream/parser/base.rb CHANGED Viewed

@@ -1,38 +1,75 @@
+# frozen_string_literal: true
 module XmlNodeStream
   class Parser
     # This is the base parser syntax that normalizes the SAX callbacks by providing a common interface
     # so that the actual parser implementation doesn't matter.
     module Base
       attr_reader :root
-      def initialize (&block)
+      # Initialize the parser.
+      #
+      # @yield [Node] each node as it is parsed
+      def initialize(&block)
         @nodes = []
         @parse_block = block
         @root = nil
       end
-      def parse_stream (io)
-        raise NotImplementedError.new("could not load gem")
+      # Parse the input stream.
+      #
+      # @param io [IO] the input stream to parse
+      # @return [void]
+      # @raise [NotImplementedError] if the parser gem is not loaded
+      def parse_stream(io)
+        parser_name = self.class.name.split("::").last.sub("Parser", "").downcase
+        gem_name = case parser_name
+        when "nokogiri" then "nokogiri"
+        when "libxml" then "libxml-ruby"
+        when "rexml" then "rexml"
+        else "unknown"
+        end
+        raise NotImplementedError.new("Parser gem not loaded: #{gem_name}. Install it with: gem install #{gem_name.split(" ").first}")
       end
-      def do_start_element (name, attributes)
+      # Handle start element event.
+      #
+      # @param name [String] the element name
+      # @param attributes [Hash] the element attributes
+      # @return [void]
+      # @api private
+      def do_start_element(name, attributes)
         node = XmlNodeStream::Node.new(name, @nodes.last, attributes)
         @nodes.push(node)
       end
-      def do_end_element (name)
+      # Handle end element event.
+      #
+      # @param name [String] the element name
+      # @return [void]
+      # @api private
+      def do_end_element(name)
         node = @nodes.pop
         node.finish!
         @root = node if @nodes.empty?
-        @parse_block.call(node) if @parse_block
+        @parse_block&.call(node)
       end
-      def do_characters (characters)
+      # Handle character data event.
+      #
+      # @param characters [String] the character data
+      # @return [void]
+      # @api private
+      def do_characters(characters)
         @nodes.last.append(characters) unless @nodes.empty?
       end
-      def do_cdata_block (characters)
+      # Handle CDATA block event.
+      #
+      # @param characters [String] the CDATA content
+      # @return [void]
+      # @api private
+      def do_cdata_block(characters)
         @nodes.last.append_cdata(characters) unless @nodes.empty?
       end
     end