xml_node_stream 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/MIT_LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 Brian Durand
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,61 @@
1
+ = XML Node Stream
2
+
3
+ This gem provides a very easy to use XML parser the provides the benefits of both stream parsing (i.e. SAX) and document parsing (i.e. DOM). In addition, it provides a unified parsing language for each of the major Ruby XML parsers (REXML, Nokogiri, and LibXML) so that your code doesn't have to be bound to a particular XML library.
4
+
5
+ == Stream Parsing
6
+
7
+ The primary purpose of this gem is to facilitate parsing large XML files (i.e. several megabytes in size). Often, reading these files into a document structure is not feasible because the whole document must be read into memory. Stream/SAX parsing solves this issue by reading in the file incrementally and providing callbacks for various events. This method can be quite painful to deal with for any sort of complex document structure.
8
+
9
+ This gem attempts to solve both of these issues by combining the best features of both. Parsing is performed by a stream parser which construct document style nodes and calls back to the application code with these nodes. When your application is done with a node, it can release it to free up memory and keep your heap from bloating.
10
+
11
+ In order to keep the interface simple and universal, only XML elements and text nodes are supported. XML processing instructions and comments will be ignored.
12
+
13
+ == Examples
14
+
15
+ Suppose we have file with every book in the world in it:
16
+
17
+ <books>
18
+ <book isbn="123456">
19
+ <title>Moby Dick</title>
20
+ <author>Herman Melville</author>
21
+ <categories>
22
+ <category>Fiction</category>
23
+ <category>Adventure</category>
24
+ </categories>
25
+ </book>
26
+ <book isbn="98765643">
27
+ <title>The Decline and Fall of the Roman Empire</title>
28
+ <author>Edward Gibbon</author>
29
+ <category>
30
+ <category>History</category>
31
+ <category>Ancient</category>
32
+ </categories>
33
+ </book>
34
+ ...
35
+ </books>
36
+
37
+ And we want to get them into our Books data model:
38
+
39
+ XmlNodeStream.parse('/tmp/books.xml') do |node|
40
+ if node.path == '/books/book'
41
+ book = Book.new
42
+ book.isbn = node['isbn']
43
+ book.title = node.find('title').value
44
+ book.author = node.find('author/text()')
45
+ book.categories = node.select('categories/category/text()')
46
+ book.save
47
+ node.release!
48
+ end
49
+ end
50
+
51
+ == Releasing Nodes
52
+
53
+ In the above example, what prevents memory bloat when parsing a large document is the call to node.release!. This call will remove the node from the node tree. The general practice is to look for the higher level nodes you are interested in and then release them immediately. If there are nodes you don't care about at all, those can be released immediately as well.
54
+
55
+ A sample 77Mb XML document parsed into Nokogiri consumes over 800Mb of memory. Parsing the same document with XmlNodeStream and releasing top level nodes as they're processed uses less than 1Mb.
56
+
57
+ == XPath
58
+
59
+ You can use a subset of the XPath language to navigate nodes. The only parts of XPath implemented are the paths themselves and the text() function. The text() function is useful for getting the value of node directly from the find or select methods without having to do a nil check on the nodes. For instance, in the above example we can get the name of an author with node.find('author/text()') instead of node.find('author').value if node.find('author').
60
+
61
+ The rest of the XPath language is not implemented since it is a programming language and there is really no need for it since we already have Ruby at our disposal which is far more powerful than XPath. See the Selector class for details.
data/Rakefile ADDED
@@ -0,0 +1,42 @@
1
+ require 'rubygems'
2
+ require 'rake'
3
+ require 'rake/rdoctask'
4
+
5
+ desc 'Default: run unit tests.'
6
+ task :default => :test
7
+
8
+ begin
9
+ require 'spec/rake/spectask'
10
+ desc 'Test xml_node_stream.'
11
+ Spec::Rake::SpecTask.new(:test) do |t|
12
+ t.spec_files = 'spec/**/*_spec.rb'
13
+ end
14
+ rescue LoadError
15
+ tast :test do
16
+ STDERR.puts "You must have rspec >= 1.2.9 to run the tests"
17
+ end
18
+ end
19
+
20
+ desc 'Generate documentation for xml_node_stream.'
21
+ Rake::RDocTask.new(:rdoc) do |rdoc|
22
+ rdoc.rdoc_dir = 'rdoc'
23
+ rdoc.options << '--title' << 'XML Node Stream' << '--line-numbers' << '--inline-source' << '--main' << 'README.rdoc'
24
+ rdoc.rdoc_files.include('README.rdoc')
25
+ rdoc.rdoc_files.include('lib/**/*.rb')
26
+ end
27
+
28
+ begin
29
+ require 'jeweler'
30
+ Jeweler::Tasks.new do |gem|
31
+ gem.name = "xml_node_stream"
32
+ gem.summary = %Q{Simple XML parser wrapper that provides the benefits of stream parsing with the ease of using document nodes.}
33
+ gem.email = "brian@embellishedvisions.com"
34
+ gem.homepage = "http://github.com/bdurand/xml_node_stream"
35
+ gem.authors = ["Brian Durand"]
36
+ gem.add_development_dependency('rspec', '>= 1.2.9')
37
+ gem.add_development_dependency('jeweler')
38
+ end
39
+
40
+ Jeweler::GemcutterTasks.new
41
+ rescue LoadError
42
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 1.0.0
data/init.rb ADDED
@@ -0,0 +1 @@
1
+ require "#{File.dirname(__FILE__)}/lib/xml_node_stream"
@@ -0,0 +1,10 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'xml_node_stream', 'node'))
2
+ require File.expand_path(File.join(File.dirname(__FILE__), 'xml_node_stream', 'parser'))
3
+ require File.expand_path(File.join(File.dirname(__FILE__), 'xml_node_stream', 'selector'))
4
+
5
+ module XmlNodeStream
6
+ # Helper method to parse XML. See Parser#parse for details.
7
+ def self.parse (io, &block)
8
+ Parser.parse(io, &block)
9
+ end
10
+ end
@@ -0,0 +1,130 @@
1
+ module XmlNodeStream
2
+ # Representation of an XML node.
3
+ class Node
4
+
5
+ attr_reader :name, :parent
6
+ attr_accessor :value
7
+
8
+ def initialize (name, parent = nil, attributes = nil, value = nil)
9
+ @name = name
10
+ @attributes = attributes
11
+ @parent = parent
12
+ @parent.add_child(self) if @parent
13
+ @value = value
14
+ end
15
+
16
+ # Release a node by removing it from the tree structure so that the Ruby garbage collector can reclaim the memory.
17
+ # This method should be called after you are done with a node. After it is called, the node will be removed from
18
+ # its parent's children and will no longer be accessible.
19
+ def release!
20
+ @parent.remove_child(self) if @parent
21
+ end
22
+
23
+ # Array of the child nodes of the node.
24
+ def children
25
+ @children ||= []
26
+ end
27
+
28
+ # Array of all descendants of the node.
29
+ def descendants
30
+ if children.empty?
31
+ return children
32
+ else
33
+ return (children + children.collect{|child| child.descendants}).flatten
34
+ end
35
+ end
36
+
37
+ # Array of all ancestors of the node.
38
+ def ancestors
39
+ if @parent
40
+ return [@parent] + @parent.ancestors
41
+ else
42
+ return []
43
+ end
44
+ end
45
+
46
+ # Get the attributes of the node as a hash.
47
+ def attributes
48
+ @attributes ||= {}
49
+ end
50
+
51
+ # Get the root element of the node tree.
52
+ def root
53
+ @parent ? @parent.root : self
54
+ end
55
+
56
+ # Get the full XPath of the node.
57
+ def path
58
+ unless @path
59
+ if @parent
60
+ @path = "#{@parent.path}/#{@name}"
61
+ else
62
+ @path = "/#{@name}"
63
+ end
64
+ end
65
+ return @path
66
+ end
67
+
68
+ # Get the value of the node attribute with the given name.
69
+ def [] (name)
70
+ return @attributes[name] if @attributes
71
+ end
72
+
73
+ # Set the value of the node attribute with the given name.
74
+ def []= (name, val)
75
+ attributes[name] = val
76
+ end
77
+
78
+ # Add a child node.
79
+ def add_child (node)
80
+ children << node
81
+ node.instance_variable_set(:@parent, self)
82
+ end
83
+
84
+ # Remove a child node.
85
+ def remove_child (node)
86
+ if @children
87
+ if @children.delete(node)
88
+ node.instance_variable_set(:@parent, nil)
89
+ end
90
+ end
91
+ end
92
+
93
+ # Get the first child node.
94
+ def first_child
95
+ @children.first if @children
96
+ end
97
+
98
+ # Find the first node that matches the given XPath. See Selector for details.
99
+ def find (selector)
100
+ select(selector).first
101
+ end
102
+
103
+ # Find all nodes that match the given XPath. See Selector for details.
104
+ def select (selector)
105
+ selector = selector.is_a?(Selector) ? selector : Selector.new(selector)
106
+ return selector.find(self)
107
+ end
108
+
109
+ # Append CDATA to the node value.
110
+ def append_cdata (text)
111
+ append(text, false)
112
+ end
113
+
114
+ # Append text to the node value. If strip_whitespace is true, whitespace at the beginning and end
115
+ # of the node value will be removed.
116
+ def append (text, strip_whitespace = true)
117
+ if text
118
+ @value ||= ''
119
+ @last_strip_whitespace = strip_whitespace
120
+ text = text.lstrip if @value.length == 0 and strip_whitespace
121
+ @value << text if text.length > 0
122
+ end
123
+ end
124
+
125
+ # Called after end tag to ensure that whitespace at the end of the string is properly stripped.
126
+ def finish! #:nodoc
127
+ @value.rstrip! if @value and @last_strip_whitespace
128
+ end
129
+ end
130
+ end
@@ -0,0 +1,71 @@
1
+ require 'open-uri'
2
+ require 'rubygems'
3
+ require 'pathname'
4
+ require File.expand_path(File.join(File.dirname(__FILE__), 'parser', 'base'))
5
+
6
+ module XmlNodeStream
7
+ # The abstract parser class that wraps the actual parser implementation.
8
+ class Parser
9
+
10
+ SUPPORTED_PARSERS = [:nokogiri, :libxml, :rexml]
11
+
12
+ class << self
13
+ # Set the parser implementation. The parser argument should be one of :nokogiri, :libxml, or :rexml. If this method
14
+ # is not called, it will default to :rexml which is the slowest choice possible. If you set the parser to one of the
15
+ # other values, though, you'll need to make sure you have the nokogiri gem or libxml-ruby gem installed.
16
+ def parser_name= (parser)
17
+ parser_sym = parser.to_sym
18
+ raise ArgumentError.new("must be one of #{SUPPORTED_PARSERS.inspect}") unless SUPPORTED_PARSERS.include?(parser_sym)
19
+ @parser_name = parser_sym
20
+ end
21
+
22
+ # Get the name of the current parser.
23
+ def parser_name
24
+ @parser_name ||= :rexml
25
+ end
26
+
27
+ # Parse the document specified in io. This can be either a Stream, URI, Pathname, or String. If it is a String,
28
+ # it can either be a XML document, file system path, or URI. The parser will figure it out. If a block is given,
29
+ # it will be yielded to with each node as it is parsed.
30
+ def parse (io, &block)
31
+ close_stream = false
32
+ if io.is_a?(String)
33
+ if io.include?('<') and io.include?('>')
34
+ io = StringIO.new(io)
35
+ else
36
+ io = open(io)
37
+ end
38
+ close_stream = true
39
+ elsif io.is_a?(Pathname)
40
+ io = io.open
41
+ close_stream = true
42
+ elsif io.is_a?(URI)
43
+ io = io.open
44
+ close_stream = true
45
+ end
46
+
47
+ begin
48
+ parser = parser_class(parser_name).new(&block)
49
+ parser.parse_stream(io)
50
+ return parser.root
51
+ ensure
52
+ io.close if close_stream
53
+ end
54
+ end
55
+
56
+ protected
57
+
58
+ def parser_class (class_symbol)
59
+ @loaded_parsers ||= {}
60
+ klass = @loaded_parsers[class_symbol]
61
+ unless klass
62
+ require File.expand_path(File.join(File.dirname(__FILE__), 'parser', "#{class_symbol}_parser"))
63
+ class_name = "#{class_symbol.to_s.capitalize}Parser"
64
+ klass = const_get(class_name)
65
+ @loaded_parsers[class_symbol] = klass
66
+ end
67
+ return klass
68
+ end
69
+ end
70
+ end
71
+ end
@@ -0,0 +1,40 @@
1
+ module XmlNodeStream
2
+ class Parser
3
+ # This is the base parser syntax that normalizes the SAX callbacks by providing a common interface
4
+ # so that the actual parser implementation doesn't matter.
5
+ module Base
6
+
7
+ attr_reader :root
8
+
9
+ def initialize (&block)
10
+ @nodes = []
11
+ @parse_block = block
12
+ @root = nil
13
+ end
14
+
15
+ def parse_stream (io)
16
+ raise NotImplementedError.new("could not load gem")
17
+ end
18
+
19
+ def do_start_element (name, attributes)
20
+ node = XmlNodeStream::Node.new(name, @nodes.last, attributes)
21
+ @nodes.push(node)
22
+ end
23
+
24
+ def do_end_element (name)
25
+ node = @nodes.pop
26
+ node.finish!
27
+ @root = node if @nodes.empty?
28
+ @parse_block.call(node) if @parse_block
29
+ end
30
+
31
+ def do_characters (characters)
32
+ @nodes.last.append(characters) unless @nodes.empty?
33
+ end
34
+
35
+ def do_cdata_block (characters)
36
+ @nodes.last.append_cdata(characters) unless @nodes.empty?
37
+ end
38
+ end
39
+ end
40
+ end
@@ -0,0 +1,44 @@
1
+ begin
2
+ require 'libxml'
3
+
4
+ module XmlNodeStream
5
+ class Parser
6
+ # Wrapper for the LibXML SAX parser.
7
+ class LibxmlParser
8
+ include LibXML::XML::SaxParser::Callbacks
9
+ include Base
10
+
11
+ def parse_stream (io)
12
+ parser = LibXML::XML::SaxParser.new
13
+ parser.callbacks = self
14
+ parser.io = io
15
+ parser.parse
16
+ end
17
+
18
+ def on_start_element (name, attributes)
19
+ do_start_element(name, attributes)
20
+ end
21
+
22
+ def on_end_element (name)
23
+ do_end_element(name)
24
+ end
25
+
26
+ def on_characters (characters)
27
+ do_characters(characters)
28
+ end
29
+
30
+ def on_cdata_block (characters)
31
+ do_cdata_block(characters)
32
+ end
33
+ end
34
+ end
35
+ end
36
+ rescue LoadError
37
+ module XmlNodeStream
38
+ class Parser
39
+ class LibxmlParser
40
+ include Base
41
+ end
42
+ end
43
+ end
44
+ end
@@ -0,0 +1,50 @@
1
+ begin
2
+ require 'nokogiri'
3
+
4
+ module XmlNodeStream
5
+ class Parser
6
+ # Wrapper for the Nokogiri SAX parser.
7
+ class NokogiriParser
8
+ include Base
9
+
10
+ def parse_stream (io)
11
+ listener = Listener.new(self)
12
+ parser = Nokogiri::XML::SAX::Parser.new(listener)
13
+ parser.parse(io)
14
+ end
15
+
16
+ class Listener < Nokogiri::XML::SAX::Document
17
+ def initialize (parser)
18
+ @parser = parser
19
+ end
20
+
21
+ def start_element (name, attributes = [])
22
+ attributes_hash = {}
23
+ (attributes.size / 2).times{|i| attributes_hash[attributes[i * 2]] = attributes[(i * 2) + 1]}
24
+ @parser.do_start_element(name, attributes_hash)
25
+ end
26
+
27
+ def end_element (name)
28
+ @parser.do_end_element(name)
29
+ end
30
+
31
+ def characters (characters)
32
+ @parser.do_characters(characters)
33
+ end
34
+
35
+ def cdata_block (characters)
36
+ @parser.do_cdata_block(characters)
37
+ end
38
+ end
39
+ end
40
+ end
41
+ end
42
+ rescue LoadError
43
+ module XmlNodeStream
44
+ class Parser
45
+ class NokogiriParser
46
+ include Base
47
+ end
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,43 @@
1
+ begin
2
+ require 'rexml/document'
3
+ require 'rexml/streamlistener'
4
+
5
+ module XmlNodeStream
6
+ class Parser
7
+ # Wrapper for the REXML SAX parser.
8
+ class RexmlParser
9
+ include REXML::StreamListener
10
+ include Base
11
+
12
+ def parse_stream (io)
13
+ parser = REXML::Parsers::StreamParser.new(io, self)
14
+ parser.parse
15
+ end
16
+
17
+ def tag_start (name, attributes)
18
+ do_start_element(name, attributes)
19
+ end
20
+
21
+ def tag_end (name)
22
+ do_end_element(name)
23
+ end
24
+
25
+ def text (content)
26
+ do_characters(content)
27
+ end
28
+
29
+ def cdata (content)
30
+ do_cdata_block(content)
31
+ end
32
+ end
33
+ end
34
+ end
35
+ rescue LoadError
36
+ module XmlNodeStream
37
+ class Parser
38
+ class RexmlParser
39
+ include Base
40
+ end
41
+ end
42
+ end
43
+ end
@@ -0,0 +1,72 @@
1
+ module XmlNodeStream
2
+ # Partial implementation of XPath selectors. Only abbreviated paths and the text() function are supported. The rest of XPath
3
+ # is unecessary in the context of a Ruby application since XPath is also a programming language. If you really need an XPath
4
+ # function, chances are you can just do it in the Ruby code.
5
+ #
6
+ # Example selectors:
7
+ # * book - find all child book elements
8
+ # * book/author - find all author elements that are children of the book child elements
9
+ # * ../book - find all sibling book elements
10
+ # * */author - find all author elements that are children of any child elements
11
+ # * book//author - find all author elements that descendants at any level of book child elements
12
+ # * .//author - find all author elements that are descendants of the current element
13
+ # * /library/books/book - find all book elements with the full path /library/books/book
14
+ # * author/text() - get the text values of all author child elements
15
+ class Selector
16
+ # Create a selector. Path should be an abbreviated XPath string.
17
+ def initialize (path)
18
+ @parts = []
19
+ path.gsub('//', '/%/').split('/').each do |part_path|
20
+ part_matchers = []
21
+ @parts << part_matchers
22
+ or_paths = part_path.split('|')
23
+ or_paths << "" if or_paths.empty?
24
+ or_paths.each do |matcher_path|
25
+ part_matchers << Matcher.new(matcher_path)
26
+ end
27
+ end
28
+ end
29
+
30
+ # Apply the selector to the current node. Note, if your path started with a /, it will be applied
31
+ # to the root node.
32
+ def find (node)
33
+ matched = [node]
34
+ @parts.each do |part_matchers|
35
+ context = matched
36
+ matched = []
37
+ part_matchers.each do |matcher|
38
+ matched.concat(matcher.select(context))
39
+ end
40
+ break if matched.empty?
41
+ end
42
+ return matched
43
+ end
44
+
45
+ # Match a partial path to a node.
46
+ class Matcher
47
+ def initialize (path)
48
+ case path
49
+ when 'text()'
50
+ @extractor = lambda{|node| node.value}
51
+ when '%'
52
+ @extractor = lambda{|node| node.descendants}
53
+ when '*'
54
+ @extractor = lambda{|node| node.children}
55
+ when '.'
56
+ @extractor = lambda{|node| node}
57
+ when '..'
58
+ @extractor = lambda{|node| node.parent ? node.parent : []}
59
+ when ''
60
+ @extractor = lambda{|node| root = Node.new(nil); root.children << node.root; root}
61
+ else
62
+ @extractor = lambda{|node| node.children.select{|child| child.name == path}}
63
+ end
64
+ end
65
+
66
+ # Select all nodes that match a partial path.
67
+ def select (context_nodes)
68
+ context_nodes.collect{|node| @extractor.call(node) if node.is_a?(Node)}.flatten
69
+ end
70
+ end
71
+ end
72
+ end
data/spec/node_spec.rb ADDED
@@ -0,0 +1,140 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'spec_helper'))
2
+
3
+ describe XmlNodeStream::Node do
4
+
5
+ it "should have a name" do
6
+ node = XmlNodeStream::Node.new("tag")
7
+ node.name.should == "tag"
8
+ end
9
+
10
+ it "should have attributes" do
11
+ node = XmlNodeStream::Node.new("tag")
12
+ node.attributes.should == {}
13
+ node["attr1"].should == nil
14
+ node = XmlNodeStream::Node.new("tag", nil, "attr1" => "val1", "attr2" => "val2")
15
+ node.attributes.should == {"attr1" => "val1", "attr2" => "val2"}
16
+ node["attr1"].should == "val1"
17
+ end
18
+
19
+ it "should have a value" do
20
+ node = XmlNodeStream::Node.new("tag")
21
+ node.value.should == nil
22
+ node = XmlNodeStream::Node.new("tag", nil, nil, "value")
23
+ node.value.should == "value"
24
+ end
25
+
26
+ it "should have a parent and children" do
27
+ parent = XmlNodeStream::Node.new("tag")
28
+ parent.parent.should == nil
29
+ parent.children.should == []
30
+ child_1 = XmlNodeStream::Node.new("child", parent)
31
+ child_2 = XmlNodeStream::Node.new("child")
32
+ parent.add_child(child_2)
33
+ parent.children.should == [child_1, child_2]
34
+ child_1.parent.should == parent
35
+ child_2.parent.should == parent
36
+ end
37
+
38
+ it "should be able to remove children" do
39
+ parent = XmlNodeStream::Node.new("tag")
40
+ child_1 = XmlNodeStream::Node.new("child", parent)
41
+ child_2 = XmlNodeStream::Node.new("child", parent)
42
+ parent.children.should == [child_1, child_2]
43
+ parent.remove_child(child_1)
44
+ parent.children.should == [child_2]
45
+ child_1.parent.should == nil
46
+ end
47
+
48
+ it "should release itself from its parent" do
49
+ parent = XmlNodeStream::Node.new("tag")
50
+ child_1 = XmlNodeStream::Node.new("child", parent)
51
+ child_2 = XmlNodeStream::Node.new("child", parent)
52
+ parent.children.should == [child_1, child_2]
53
+ child_1.release!
54
+ parent.children.should == [child_2]
55
+ child_1.parent.should == nil
56
+ end
57
+
58
+ it "should have ancestors" do
59
+ parent = XmlNodeStream::Node.new("tag")
60
+ child = XmlNodeStream::Node.new("child", parent)
61
+ grandchild = XmlNodeStream::Node.new("grandchild", child)
62
+ parent.ancestors.should == []
63
+ child.ancestors.should == [parent]
64
+ grandchild.ancestors.should == [child, parent]
65
+ end
66
+
67
+ it "should have descendants" do
68
+ parent = XmlNodeStream::Node.new("tag")
69
+ child_1 = XmlNodeStream::Node.new("child", parent)
70
+ child_2 = XmlNodeStream::Node.new("child", parent)
71
+ grandchild_1 = XmlNodeStream::Node.new("grandchild", child_1)
72
+ grandchild_2 = XmlNodeStream::Node.new("grandchild", child_1)
73
+ parent.descendants.should == [child_1, child_2, grandchild_1, grandchild_2]
74
+ child_1.descendants.should == [grandchild_1, grandchild_2]
75
+ grandchild_1.descendants.should == []
76
+ end
77
+
78
+ it "should have a root node" do
79
+ parent = XmlNodeStream::Node.new("tag")
80
+ child = XmlNodeStream::Node.new("child", parent)
81
+ grandchild = XmlNodeStream::Node.new("grandchild", child)
82
+ parent.root.should == parent
83
+ child.root.should == parent
84
+ grandchild.root.should == parent
85
+ end
86
+
87
+ it "should have a path" do
88
+ parent = XmlNodeStream::Node.new("tag")
89
+ child = XmlNodeStream::Node.new("child", parent)
90
+ grandchild = XmlNodeStream::Node.new("grandchild", child)
91
+ parent.path.should == "/tag"
92
+ child.path.should == "/tag/child"
93
+ grandchild.path.should == "/tag/child/grandchild"
94
+ end
95
+
96
+ it "should be able to select related nodes using a selector" do
97
+ parent = XmlNodeStream::Node.new("tag")
98
+ child_1 = XmlNodeStream::Node.new("child", parent)
99
+ child_2 = XmlNodeStream::Node.new("child", parent)
100
+ grandchild_1 = XmlNodeStream::Node.new("grandchild", child_1, nil, "val1")
101
+ grandchild_2 = XmlNodeStream::Node.new("grandchild", child_1, nil, "val2")
102
+ parent.select("nothing").should == []
103
+ parent.select("child").should == [child_1, child_2]
104
+ parent.select("child/grandchild").should == [grandchild_1, grandchild_2]
105
+ parent.select("child/grandchild/text()").should == ["val1", "val2"]
106
+ grandchild_1.select("../..").should == [parent]
107
+ end
108
+
109
+ it "should be able to find the first related node using a selector" do
110
+ parent = XmlNodeStream::Node.new("tag")
111
+ child_1 = XmlNodeStream::Node.new("child", parent)
112
+ child_2 = XmlNodeStream::Node.new("child", parent)
113
+ grandchild_1 = XmlNodeStream::Node.new("grandchild", child_1, nil, "val1")
114
+ grandchild_2 = XmlNodeStream::Node.new("grandchild", child_1, nil, "val2")
115
+ parent.find("nothing").should == nil
116
+ parent.find("child").should == child_1
117
+ parent.find("child/grandchild").should == grandchild_1
118
+ parent.find("child/grandchild/text()").should == "val1"
119
+ grandchild_1.find("../..").should == parent
120
+ end
121
+
122
+ it "should append text which strips whitespace from the start and end of the value" do
123
+ node = XmlNodeStream::Node.new("tag")
124
+ node.append(" ")
125
+ node.append(" \t\r\nhello ")
126
+ node.append(" there\n")
127
+ node.finish!
128
+ node.value.should == "hello there"
129
+ end
130
+
131
+ it "should append cdata which preserves all whitespace" do
132
+ node = XmlNodeStream::Node.new("tag")
133
+ node.append_cdata(" ")
134
+ node.append(" \t\r\nhello ")
135
+ node.append_cdata(" there\n")
136
+ node.finish!
137
+ node.value.should == " \t\r\nhello there\n"
138
+ end
139
+
140
+ end
@@ -0,0 +1,148 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'spec_helper'))
2
+
3
+ describe XmlNodeStream::Parser do
4
+
5
+ before :each do
6
+ @text_xml_path = File.expand_path(File.join(File.dirname(__FILE__), 'test.xml'))
7
+ end
8
+
9
+ it "should parse a document in a string" do
10
+ validate_text_xml(XmlNodeStream::Parser.parse(File.read(@text_xml_path)))
11
+ end
12
+
13
+ it "should parse a document in a file path string" do
14
+ validate_text_xml(XmlNodeStream::Parser.parse(@text_xml_path))
15
+ end
16
+
17
+ it "should parse a document in a file path" do
18
+ validate_text_xml(XmlNodeStream::Parser.parse(Pathname.new(@text_xml_path)))
19
+ end
20
+
21
+ it "should parse a document in a url string" do
22
+ uri = URI.parse("http://test.host/test.xml")
23
+ URI.should_receive(:parse).with("http://test.host/test.xml").and_return(uri)
24
+ File.open(@text_xml_path) do |stream|
25
+ uri.should_receive(:open).and_return(stream)
26
+ validate_text_xml(XmlNodeStream::Parser.parse("http://test.host/test.xml"))
27
+ end
28
+ end
29
+
30
+ it "should parse a document in a URI" do
31
+ uri = URI.parse("http://test.host/test.xml")
32
+ stream = mock(:stream)
33
+ File.open(@text_xml_path) do |stream|
34
+ uri.should_receive(:open).and_return(stream)
35
+ validate_text_xml(XmlNodeStream::Parser.parse(uri))
36
+ end
37
+ end
38
+
39
+ it "should parse a document in a stream" do
40
+ io = StringIO.new(File.read(@text_xml_path))
41
+ io.should_not_receive(:close)
42
+ validate_text_xml(XmlNodeStream::Parser.parse(io))
43
+ end
44
+
45
+ it "should call a block with each element in a document" do
46
+ nodes = []
47
+ XmlNodeStream::Parser.parse(@text_xml_path) do |node|
48
+ nodes << node.path
49
+ end
50
+ nodes.should == %w(
51
+ /library/authors/author/name
52
+ /library/authors/author
53
+ /library/authors/author/name
54
+ /library/authors/author
55
+ /library/authors/author/name
56
+ /library/authors/author
57
+ /library/authors
58
+ /library/collection/section/book/title
59
+ /library/collection/section/book/author
60
+ /library/collection/section/book/abstract
61
+ /library/collection/section/book/volumes
62
+ /library/collection/section/book
63
+ /library/collection/section
64
+ /library/collection/section/book/title
65
+ /library/collection/section/book/author
66
+ /library/collection/section/book/abstract
67
+ /library/collection/section/book
68
+ /library/collection/section/book/title
69
+ /library/collection/section/book/author
70
+ /library/collection/section/book/abstract
71
+ /library/collection/section/book
72
+ /library/collection/section/book/title
73
+ /library/collection/section/book/alternate_title
74
+ /library/collection/section/book/author
75
+ /library/collection/section/book/abstract
76
+ /library/collection/section/book
77
+ /library/collection/section
78
+ /library/collection
79
+ /library
80
+ )
81
+ end
82
+
83
+ XmlNodeStream::Parser::SUPPORTED_PARSERS.each do |parser_name|
84
+ context "with #{parser_name}" do
85
+ before :all do
86
+ @save_parser_name = XmlNodeStream::Parser.parser_name
87
+ XmlNodeStream::Parser.parser_name = parser_name
88
+ end
89
+
90
+ after :all do
91
+ XmlNodeStream::Parser.parser_name = @save_parser_name
92
+ end
93
+
94
+ it "should parse a document" do
95
+ begin
96
+ validate_text_xml(XmlNodeStream::Parser.parse(@text_xml_path))
97
+ rescue NotImplementedError
98
+ pending("#{parser_name} is not installed for testing")
99
+ end
100
+ end
101
+ end
102
+ end
103
+
104
+ def validate_text_xml (root)
105
+ validate(root, :name => "library", :children => ["authors", "collection"])
106
+
107
+ validate(root.children[0], :name => "authors", :children => ["author"] * 3)
108
+ validate(root.children[0].children[0], :name => "author", :attributes => {"id" => "1"}, :children => ["name"])
109
+ validate(root.children[0].children[0].children[0], :name => "name", :value => "Edward Gibbon")
110
+ validate(root.children[0].children[1], :name => "author", :attributes => {"id" => "2"}, :children => ["name"])
111
+ validate(root.children[0].children[1].children[0], :name => "name", :value => "Herman Melville")
112
+ validate(root.children[0].children[2], :name => "author", :attributes => {"id" => "3"}, :children => ["name"])
113
+ validate(root.children[0].children[2].children[0], :name => "name", :value => "Jack London")
114
+
115
+ validate(root.children[1], :name => "collection", :children => ["section"] * 2)
116
+ history = root.children[1].children[0]
117
+ fiction = root.children[1].children[1]
118
+
119
+ validate(history, :name => "section", :attributes => {"id" => "100", "name" => "History"}, :children => ["book"])
120
+ validate(history.children[0], :name => "book", :attributes => {"id" => "1"}, :children => ["title", "author", "abstract", "volumes"])
121
+ validate(history.children[0].children[0], :name => "title", :value => "The Decline & Fall of the Roman Empire")
122
+ validate(history.children[0].children[1], :name => "author", :value => nil, :attributes => {"id" => "1"})
123
+ validate(history.children[0].children[2], :name => "abstract", :value => "History of the fall of Rome.")
124
+ validate(history.children[0].children[3], :name => "volumes", :value => "6")
125
+
126
+ validate(fiction, :name => "section", :attributes => {"id" => "200", "name" => "Fiction"}, :children => ["book"] * 3)
127
+ validate(fiction.children[0], :name => "book", :attributes => {"id" => "2"}, :children => ["title", "author", "abstract"])
128
+ validate(fiction.children[0].children[0], :name => "title", :value => "Call of the Wild")
129
+ validate(fiction.children[0].children[1], :name => "author", :value => nil, :attributes => {"id" => "3"})
130
+ validate(fiction.children[0].children[2], :name => "abstract", :value => "\n A dog goes to Alaska.\n ")
131
+ validate(fiction.children[1], :name => "book", :attributes => {"id" => "3"}, :children => ["title", "author", "abstract"])
132
+ validate(fiction.children[1].children[0], :name => "title", :value => "White Fang")
133
+ validate(fiction.children[1].children[1], :name => "author", :value => nil, :attributes => {"id" => "3"})
134
+ validate(fiction.children[1].children[2], :name => "abstract", :value => "Dogs, wolves, etc.")
135
+ validate(fiction.children[2], :name => "book", :attributes => {"id" => "4"}, :children => ["title", "alternate_title", "author", "abstract"])
136
+ validate(fiction.children[2].children[0], :name => "title", :value => "Moby Dick")
137
+ validate(fiction.children[2].children[1], :name => "alternate_title", :value => "The Whale")
138
+ validate(fiction.children[2].children[2], :name => "author", :value => nil, :attributes => {"id" => "2"})
139
+ validate(fiction.children[2].children[3], :name => "abstract", :value => "A mad captain seeks a mysterious white whale.")
140
+ end
141
+
142
+ def validate (node, options)
143
+ node.name.should == options[:name]
144
+ node.attributes.should == (options[:attributes] || {})
145
+ node.value.should == (options.include?(:value) ? options[:value] : "")
146
+ node.children.collect{|c| c.name}.should == (options[:children] || [])
147
+ end
148
+ end
@@ -0,0 +1,73 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'spec_helper'))
2
+
3
+ describe XmlNodeStream::Selector do
4
+
5
+ before :each do
6
+ @root = XmlNodeStream::Node.new("root")
7
+ @child_1 = XmlNodeStream::Node.new("child", @root)
8
+ @child_2 = XmlNodeStream::Node.new("child", @root)
9
+ @grandchild_1 = XmlNodeStream::Node.new("grandchild", @child_1, nil, "val1")
10
+ @grandchild_2 = XmlNodeStream::Node.new("grandchild", @child_1, nil, "val2")
11
+ @grandchild_3 = XmlNodeStream::Node.new("grandchild", @child_2, nil, "val3")
12
+ @grandchild_4 = XmlNodeStream::Node.new("grandchild", @child_2, nil, "val4")
13
+ @great_grandchild = XmlNodeStream::Node.new("grandchild", @grandchild_1, nil, "val1.a")
14
+ end
15
+
16
+ it "should find child nodes with a specified name" do
17
+ selector = XmlNodeStream::Selector.new("child")
18
+ selector.find(@root).should == [@child_1, @child_2]
19
+ selector = XmlNodeStream::Selector.new("./child")
20
+ selector.find(@root).should == [@child_1, @child_2]
21
+ selector = XmlNodeStream::Selector.new("nothing")
22
+ selector.find(@root).should == []
23
+ selector.find(@child_1).should == []
24
+ end
25
+
26
+ it "should find descendant nodes with a specified name" do
27
+ selector = XmlNodeStream::Selector.new(".//grandchild")
28
+ selector.find(@root).should == [@grandchild_1, @grandchild_2, @grandchild_3, @grandchild_4, @great_grandchild]
29
+ selector.find(@child_1).should == [@great_grandchild]
30
+ selector.find(@child_2).should == []
31
+ end
32
+
33
+ it "should find child nodes in a specified hierarchy" do
34
+ selector = XmlNodeStream::Selector.new("child/grandchild")
35
+ selector.find(@root).should == [@grandchild_1, @grandchild_2, @grandchild_3, @grandchild_4]
36
+ selector = XmlNodeStream::Selector.new("child/nothing")
37
+ selector.find(@root).should == []
38
+ selector.find(@child_1).should == []
39
+ end
40
+
41
+ it "should find an node itself" do
42
+ selector = XmlNodeStream::Selector.new(".")
43
+ selector.find(@child_1).should == [@child_1]
44
+ end
45
+
46
+ it "should find a parent node" do
47
+ selector = XmlNodeStream::Selector.new("..")
48
+ selector.find(@child_1).should == [@root]
49
+ selector.find(@root).should == []
50
+ end
51
+
52
+ it "should find an node's value" do
53
+ selector = XmlNodeStream::Selector.new("text()")
54
+ selector.find(@child_1).should == [nil]
55
+ selector.find(@grandchild_1).should == ["val1"]
56
+ selector = XmlNodeStream::Selector.new("child/grandchild/text()")
57
+ selector.find(@root).should == ["val1", "val2", "val3", "val4"]
58
+ end
59
+
60
+ it "should allow wildcards in the hierarchy" do
61
+ selector = XmlNodeStream::Selector.new("*/grandchild")
62
+ selector.find(@root).should == [@grandchild_1, @grandchild_2, @grandchild_3, @grandchild_4]
63
+ selector.find(@child_1).should == [@great_grandchild]
64
+ selector.find(@child_2).should == []
65
+ end
66
+
67
+ it "should find using full paths" do
68
+ selector = XmlNodeStream::Selector.new("/root/child")
69
+ selector.find(@root).should == [@child_1, @child_2]
70
+ selector.find(@grandchild_1).should == [@child_1, @child_2]
71
+ end
72
+
73
+ end
@@ -0,0 +1,3 @@
1
+ require 'rubygems'
2
+ require 'spec'
3
+ require File.expand_path(File.join(File.dirname(__FILE__), '..', 'lib', 'xml_node_stream'))
data/spec/test.xml ADDED
@@ -0,0 +1,57 @@
1
+ <?xml version="1.0"?>
2
+ <library>
3
+ <?library-info version="2.0" ignore="yes" ?>
4
+ <!-- Authors -->
5
+ <authors>
6
+ <author id="1">
7
+ <name>Edward Gibbon</name>
8
+ </author>
9
+ <author id="2">
10
+ <name>Herman Melville</name>
11
+ </author>
12
+ <author id="3">
13
+ <name>Jack London</name>
14
+ </author>
15
+ </authors>
16
+ <!-- Books -->
17
+ <collection>
18
+ <section id="100" name="History">
19
+ <book id="1">
20
+ <title>
21
+ The Decline &amp; Fall of the Roman Empire
22
+ </title>
23
+ <author id="1"/>
24
+ <abstract><![CDATA[History of the fall of Rome.]]></abstract>
25
+ <volumes>6</volumes>
26
+ </book>
27
+ </section>
28
+ <section id="200" name="Fiction">
29
+ <book id="2">
30
+ <title>
31
+ Call of the Wild
32
+ </title>
33
+ <author id="3"/>
34
+ <abstract><![CDATA[
35
+ A dog goes to Alaska.
36
+ ]]></abstract>
37
+ </book>
38
+ <book id="3">
39
+ <title>
40
+ White Fang
41
+ </title>
42
+ <author id="3"/>
43
+ <abstract><![CDATA[Dogs, wolves, etc.]]></abstract>
44
+ </book>
45
+ <book id="4">
46
+ <title>
47
+ Moby Dick
48
+ </title>
49
+ <alternate_title>
50
+ The Whale
51
+ </alternate_title>
52
+ <author id="2"/>
53
+ <abstract><![CDATA[A mad captain seeks a mysterious white whale.]]></abstract>
54
+ </book>
55
+ </section>
56
+ </collection>
57
+ </library>
@@ -0,0 +1,11 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), 'spec_helper'))
2
+
3
+ describe XmlNodeStream do
4
+
5
+ it "should parse a document using the Parser.parse method" do
6
+ block = lambda{}
7
+ XmlNodeStream::Parser.should_receive(:parse).with("<xml/>", &block)
8
+ XmlNodeStream.parse("<xml/>", &block)
9
+ end
10
+
11
+ end
@@ -0,0 +1,68 @@
1
+ # Generated by jeweler
2
+ # DO NOT EDIT THIS FILE DIRECTLY
3
+ # Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
4
+ # -*- encoding: utf-8 -*-
5
+
6
+ Gem::Specification.new do |s|
7
+ s.name = %q{xml_node_stream}
8
+ s.version = "1.0.0"
9
+
10
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
+ s.authors = ["Brian Durand"]
12
+ s.date = %q{2010-02-07}
13
+ s.email = %q{brian@embellishedvisions.com}
14
+ s.extra_rdoc_files = [
15
+ "README.rdoc"
16
+ ]
17
+ s.files = [
18
+ "MIT_LICENSE",
19
+ "README.rdoc",
20
+ "Rakefile",
21
+ "VERSION",
22
+ "init.rb",
23
+ "lib/xml_node_stream.rb",
24
+ "lib/xml_node_stream/node.rb",
25
+ "lib/xml_node_stream/parser.rb",
26
+ "lib/xml_node_stream/parser/base.rb",
27
+ "lib/xml_node_stream/parser/libxml_parser.rb",
28
+ "lib/xml_node_stream/parser/nokogiri_parser.rb",
29
+ "lib/xml_node_stream/parser/rexml_parser.rb",
30
+ "lib/xml_node_stream/selector.rb",
31
+ "spec/node_spec.rb",
32
+ "spec/parser_spec.rb",
33
+ "spec/selector_spec.rb",
34
+ "spec/spec_helper.rb",
35
+ "spec/test.xml",
36
+ "spec/xml_node_stream_spec.rb",
37
+ "xml_node_stream.gemspec"
38
+ ]
39
+ s.homepage = %q{http://github.com/bdurand/xml_node_stream}
40
+ s.rdoc_options = ["--charset=UTF-8"]
41
+ s.require_paths = ["lib"]
42
+ s.rubygems_version = %q{1.3.5}
43
+ s.summary = %q{Simple XML parser wrapper that provides the benefits of stream parsing with the ease of using document nodes.}
44
+ s.test_files = [
45
+ "spec/node_spec.rb",
46
+ "spec/parser_spec.rb",
47
+ "spec/selector_spec.rb",
48
+ "spec/spec_helper.rb",
49
+ "spec/xml_node_stream_spec.rb"
50
+ ]
51
+
52
+ if s.respond_to? :specification_version then
53
+ current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
54
+ s.specification_version = 3
55
+
56
+ if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
57
+ s.add_development_dependency(%q<rspec>, [">= 1.2.9"])
58
+ s.add_development_dependency(%q<jeweler>, [">= 0"])
59
+ else
60
+ s.add_dependency(%q<rspec>, [">= 1.2.9"])
61
+ s.add_dependency(%q<jeweler>, [">= 0"])
62
+ end
63
+ else
64
+ s.add_dependency(%q<rspec>, [">= 1.2.9"])
65
+ s.add_dependency(%q<jeweler>, [">= 0"])
66
+ end
67
+ end
68
+
metadata ADDED
@@ -0,0 +1,97 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: xml_node_stream
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Brian Durand
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2010-02-07 00:00:00 -06:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: rspec
17
+ type: :development
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 1.2.9
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: jeweler
27
+ type: :development
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: "0"
34
+ version:
35
+ description:
36
+ email: brian@embellishedvisions.com
37
+ executables: []
38
+
39
+ extensions: []
40
+
41
+ extra_rdoc_files:
42
+ - README.rdoc
43
+ files:
44
+ - MIT_LICENSE
45
+ - README.rdoc
46
+ - Rakefile
47
+ - VERSION
48
+ - init.rb
49
+ - lib/xml_node_stream.rb
50
+ - lib/xml_node_stream/node.rb
51
+ - lib/xml_node_stream/parser.rb
52
+ - lib/xml_node_stream/parser/base.rb
53
+ - lib/xml_node_stream/parser/libxml_parser.rb
54
+ - lib/xml_node_stream/parser/nokogiri_parser.rb
55
+ - lib/xml_node_stream/parser/rexml_parser.rb
56
+ - lib/xml_node_stream/selector.rb
57
+ - spec/node_spec.rb
58
+ - spec/parser_spec.rb
59
+ - spec/selector_spec.rb
60
+ - spec/spec_helper.rb
61
+ - spec/test.xml
62
+ - spec/xml_node_stream_spec.rb
63
+ - xml_node_stream.gemspec
64
+ has_rdoc: true
65
+ homepage: http://github.com/bdurand/xml_node_stream
66
+ licenses: []
67
+
68
+ post_install_message:
69
+ rdoc_options:
70
+ - --charset=UTF-8
71
+ require_paths:
72
+ - lib
73
+ required_ruby_version: !ruby/object:Gem::Requirement
74
+ requirements:
75
+ - - ">="
76
+ - !ruby/object:Gem::Version
77
+ version: "0"
78
+ version:
79
+ required_rubygems_version: !ruby/object:Gem::Requirement
80
+ requirements:
81
+ - - ">="
82
+ - !ruby/object:Gem::Version
83
+ version: "0"
84
+ version:
85
+ requirements: []
86
+
87
+ rubyforge_project:
88
+ rubygems_version: 1.3.5
89
+ signing_key:
90
+ specification_version: 3
91
+ summary: Simple XML parser wrapper that provides the benefits of stream parsing with the ease of using document nodes.
92
+ test_files:
93
+ - spec/node_spec.rb
94
+ - spec/parser_spec.rb
95
+ - spec/selector_spec.rb
96
+ - spec/spec_helper.rb
97
+ - spec/xml_node_stream_spec.rb