webtranslateit-hpricot 0.9.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (55) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +15 -0
  3. data/CHANGELOG +122 -0
  4. data/COPYING +18 -0
  5. data/README.md +295 -0
  6. data/Rakefile +237 -0
  7. data/ext/fast_xs/FastXsService.java +1123 -0
  8. data/ext/fast_xs/extconf.rb +4 -0
  9. data/ext/fast_xs/fast_xs.c +210 -0
  10. data/ext/hpricot_scan/HpricotCss.java +850 -0
  11. data/ext/hpricot_scan/HpricotScanService.java +2085 -0
  12. data/ext/hpricot_scan/MANIFEST +0 -0
  13. data/ext/hpricot_scan/extconf.rb +9 -0
  14. data/ext/hpricot_scan/hpricot_common.rl +76 -0
  15. data/ext/hpricot_scan/hpricot_css.c +3511 -0
  16. data/ext/hpricot_scan/hpricot_css.java.rl +155 -0
  17. data/ext/hpricot_scan/hpricot_css.rl +120 -0
  18. data/ext/hpricot_scan/hpricot_scan.c +6848 -0
  19. data/ext/hpricot_scan/hpricot_scan.h +79 -0
  20. data/ext/hpricot_scan/hpricot_scan.java.rl +1173 -0
  21. data/ext/hpricot_scan/hpricot_scan.rl +911 -0
  22. data/extras/hpricot.png +0 -0
  23. data/hpricot.gemspec +18 -0
  24. data/lib/hpricot/blankslate.rb +63 -0
  25. data/lib/hpricot/builder.rb +217 -0
  26. data/lib/hpricot/elements.rb +514 -0
  27. data/lib/hpricot/htmlinfo.rb +691 -0
  28. data/lib/hpricot/inspect.rb +103 -0
  29. data/lib/hpricot/modules.rb +40 -0
  30. data/lib/hpricot/parse.rb +38 -0
  31. data/lib/hpricot/tag.rb +219 -0
  32. data/lib/hpricot/tags.rb +164 -0
  33. data/lib/hpricot/traverse.rb +839 -0
  34. data/lib/hpricot/xchar.rb +95 -0
  35. data/lib/hpricot.rb +26 -0
  36. data/setup.rb +1585 -0
  37. data/test/files/basic.xhtml +17 -0
  38. data/test/files/boingboing.html +2266 -0
  39. data/test/files/cy0.html +3653 -0
  40. data/test/files/immob.html +400 -0
  41. data/test/files/pace_application.html +1320 -0
  42. data/test/files/tenderlove.html +16 -0
  43. data/test/files/uswebgen.html +220 -0
  44. data/test/files/utf8.html +1054 -0
  45. data/test/files/week9.html +1723 -0
  46. data/test/files/why.xml +19 -0
  47. data/test/load_files.rb +7 -0
  48. data/test/nokogiri-bench.rb +64 -0
  49. data/test/test_alter.rb +96 -0
  50. data/test/test_builder.rb +37 -0
  51. data/test/test_parser.rb +496 -0
  52. data/test/test_paths.rb +25 -0
  53. data/test/test_preserved.rb +88 -0
  54. data/test/test_xml.rb +28 -0
  55. metadata +106 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: b2c7f0e599b62be02967d46819ec60c457e8e7c2207752ef328d069d3ca3d627
4
+ data.tar.gz: 75205569719178f6b699114f54a042d491b4a3fc16248ace4a5f171460a720cf
5
+ SHA512:
6
+ metadata.gz: f7a5c3f9770659390d82c477c9ec45968c560e8c878399708a4a23a93c7a61fef9b46eaea6c0a0fff12b62ffbc318e8b03fd34c8095cb3687445badbc02999a5
7
+ data.tar.gz: 85afd161d4358033e9b4e32c69c35af264f88599849fc0022444f41da2cc91dade9d9949e07bf9e36980e3b613d1b5d92f1cb17253660f57b44205d9d5d16164
data/.gitignore ADDED
@@ -0,0 +1,15 @@
1
+ *.class
2
+ *.o
3
+ *.bundle
4
+ *.so
5
+ *.rbc
6
+ mkmf.log
7
+ conftest.dSYM
8
+ lib/*.jar
9
+ lib/hpricot_scan.rb
10
+ lib/fast_xs.rb
11
+ hpricot-*-java
12
+ hpricot-*-mswin32
13
+ pkg
14
+ .DS_Store
15
+ tmp
data/CHANGELOG ADDED
@@ -0,0 +1,122 @@
1
+ = 0.9.0
2
+ === 23 April 2024
3
+ * Fix issue compiling with clang 16.
4
+
5
+ = 0.8.6
6
+ === 17 January 2012
7
+ * Allow any tags to contain unknown tags (Steven Parkes)
8
+
9
+ = 0.8.5
10
+ === 29 November 2011
11
+ * Remove escaped quote (\') from matching (#55)
12
+ * Fix 'undefined method downcase for nil:NilClass' on JRuby (#58)
13
+ * Unescape hex numeric character references
14
+
15
+ = 0.8.4
16
+ === 28 February, 2011
17
+ * GH #21, #32, #33, #36: Fix for reported segfaults
18
+
19
+ = 0.8.3
20
+ === 3 November, 2010
21
+ * GH#8: Nil-check before downcasing attribute key
22
+ * GH#25: Proper ruby 1.9 encoding support
23
+ * GH#28. Use integers instead of ?? on 1.9, which is just a string.
24
+ * including noscript to ElementInclusions , so that hpricot wont fail
25
+ when trying to parse a meta tag inside head section when noscript is
26
+ present.
27
+ * latest changes from fast_xs mainline
28
+ * Fixes to get Hpricot running on Rubinius:
29
+ * Use free, not XFREE
30
+ * Remove RSTRUCT craziness, don't break Array#at
31
+
32
+ = 0.8.2
33
+ === 5 November, 2009
34
+ * Bring JRuby support up to speed, including Java-based hpricot_css support
35
+ * Change JRuby fast_xs to have same escaping behavior as C fast_xs
36
+ * fix for issue #2, downcasing of html attributes inside the parser.
37
+ * solve issue #3 with bogus etags being preserved in `to_s` rather than just `to_original_html`.
38
+ * fix error when attempting to reparent cleared node. (issue #5)
39
+ * Hpricot::Attributes proxy object for using `ele.attributes[k] = v` directly.
40
+ however, it is preferred to use the jquery-like `elements.attr(k, v)`.
41
+
42
+ = 0.8.1
43
+ === 3 April, 2009
44
+ * big problems on Ruby 1.8.6, use INT2FIX instead of INT2NUM. hashes were being cast to bignums.
45
+ * patch for 1.8.5 to define RARRAY_PTR. thanks, mike perham!
46
+ * inspecting empty document bug, courtesy of @TalLevAmi.
47
+
48
+ = 0.8
49
+ === 31st March, 2009
50
+ * Saving memory and speed by using RStruct-based elements in the C extension.
51
+ * Bug in tag parsing, causing runaway <script> and <style> tags in HTML.
52
+ * Problem compiling under Ruby 1.9, due to our_rb_hash_lookup function meant for Ruby 1.8.
53
+ * CData was missing inner_text method.
54
+
55
+ = 0.7
56
+ === 17th March, 2009
57
+ * Rewritten parser routine, much lighter on memory, quite a bit faster.
58
+ * Friendlier with Ruby 1.9.
59
+ * Fixes to nth-child and text() selectors.
60
+
61
+ = 0.6
62
+ === 15th June, 2007
63
+ * Hpricot for JRuby -- nice work Ola Bini!
64
+ * Inline Markaby for Hpricot documents.
65
+ * XML tags and attributes are no longer downcased like HTML is.
66
+ * new syntax for grabbing everything between two elements using a Range in the search method: (doc/("font".."font/br")) or in nodes_at like so: (doc/"font").nodes_at("*".."br"). Only works with either a pair of siblings or a set of a parent and a sibling.
67
+ * Ignore self-closing endings on tags (such as form) which are containers. Treat them like open parent tags. Reported by Jonathan Nichols on the hpricot list.
68
+ * Escaping of attributes, yanked from Jim Weirich and Sam Ruby's work in Builder.
69
+ * Element#raw_attributes gives unescaped data. Element#attributes gives escaped.
70
+ * Added: Elements#attr, Elements#remove_attr, Elements#remove_class.
71
+ * Added: Traverse#preceding, Traverse#following, Traverse#previous, Traverse#next.
72
+
73
+ = 0.5
74
+ === 31rd January, 2007
75
+
76
+ * support for a[text()="Click Me!"] and h3[text()*="space"] and the like.
77
+ * Hpricot.buffer_size accessor for increasing Hpricot's buffer if you're encountering huge ASP.NET viewstate attribs.
78
+ * some support for colons in tag names (not full namespace support yet.)
79
+ * Element.to_original_html will attempt to preserve the original HTML while merging your changes.
80
+ * Element.to_plain_text converts an element's contents to a simple text format.
81
+ * Element.inner_text removes all tags and returns text nodes concatenated into a single string.
82
+ * no @raw_string variable kept for comments, text, and cdata -- as it's redundant.
83
+ * xpath-style indices (//p/a[1]) but keep in mind that they aren't zero-based.
84
+ * node_position is the index among all sibling nodes, while position is the position among children of identical type.
85
+ * comment() and text() search criteria, like: //p/text(), which selects all text inside paragraph tags.
86
+ * every element has css_path and xpath methods which return respective absolute paths.
87
+ * more flexibility all around: in parsing attributes, tags, comments and cdata.
88
+
89
+ = 0.4
90
+ === 11th August, 2006
91
+
92
+ * The :fixup_tags option will try to sort out the hierarchy so elements end up with the right parents.
93
+ * Elements such as *script* and *style* (identified as having CDATA contents) receive a single text node as their children now. Previously, Hpricot was parsing out tags found in scripts.
94
+ * Better scanning of partially quoted attributes (found by Brent Beardsly on http://uswebgen.com/)
95
+ * Better scanning of unquoted attributes -- thanks to Aaron Patterson for the test cases!
96
+ * Some tags were being output in the empty tag style, although browsers hated that. FIXED!
97
+ * Added Elements#at for finding single elements.
98
+ * Added Elem::Trav#[] and Elem::Trav#[]= for reading and writing attributes.
99
+
100
+ = 0.3
101
+ === 7th July, 2006
102
+
103
+ * Fixed negative string size error on empty tokens. (news.bbc.co.uk)
104
+ * Allow the parser to accept just text nodes. (such as: <tt>Hpricot.parse('TEXT')</tt>)
105
+ * from JQuery to Hpricot::Elements: remove, empty, append, prepend, before, after, wrap, set,
106
+ html(...), to_html, to_s.
107
+ * on containers: to_html, replace_child, insert_before, insert_after, innerHTML=.
108
+ * Hpricot(...) is an alias for parse.
109
+ * open up all properties to setters, let people do as they may.
110
+ * use to_html for the full html of a node or set of elements.
111
+ * doctypes were messed.
112
+
113
+ = 0.2
114
+ === 4th July, 2006
115
+
116
+ * Rewrote the HTree parser to be simpler, more adequate for the common man. Will add encoding back in later.
117
+
118
+ = 0.1
119
+ === 3rd July, 2006
120
+
121
+ * For whatever reason, wrote this HTML parser in C.
122
+ I guess Ragel is addictive and I want to improve HTree.
data/COPYING ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2006 why the lucky stiff
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to
5
+ deal in the Software without restriction, including without limitation the
6
+ rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
7
+ sell copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
16
+ THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,295 @@
1
+ # Hpricot is over.
2
+
3
+ After years of lack of a proper maintainer for one of why's jewels, it has been
4
+ decided to finally close the book on hpricot. Most users have migrated to alternatives
5
+ and there is simply no time or energy to continue with the current codebase.
6
+
7
+ If you feel that you have the time and wish to take it over, I suggest you instead
8
+ think about making the hpricot-like API within nokogiri 100% compatible, that is a better
9
+ use of your time.
10
+
11
+ But if you still feel like "No damnit, I wanna work on hpricot itself still!" then fork
12
+ this repo and start work. Send @evanphx or @nicksieger a message if you feel like you
13
+ want to take over the gem name with new releases under the hpricot name.
14
+
15
+ Thanks to \_why for all the fun. We'll never forget it.
16
+
17
+ ## Now back to your original README content...
18
+
19
+
20
+ # Hpricot, Read Any HTML
21
+
22
+ Hpricot is a fast, flexible HTML parser written in C. It's designed to be very
23
+ accommodating (like Tanaka Akira's HTree) and to have a very helpful library
24
+ (like some JavaScript libs -- JQuery, Prototype -- give you.) The XPath and CSS
25
+ parser, in fact, is based on John Resig's JQuery.
26
+
27
+ Also, Hpricot can be handy for reading broken XML files, since many of the same
28
+ techniques can be used. If a quote is missing, Hpricot tries to figure it out.
29
+ If tags overlap, Hpricot works on sorting them out. You know, that sort of
30
+ thing.
31
+
32
+ *Please read this entire document* before making assumptions about how this
33
+ software works.
34
+
35
+ ## An Overview
36
+
37
+ Let's clear up what Hpricot is.
38
+
39
+ * Hpricot is *a standalone library*. It requires no other libraries. Just Ruby!
40
+ * While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
41
+ pays a small penalty in order to get that right. So that's slightly more important
42
+ to me than speed.
43
+ * *If you can see it in Firefox, then Hpricot should parse it.* That's
44
+ how it should be! Let me know the minute it's otherwise.
45
+ * Primarily, Hpricot is used for reading HTML and tries to sort out troubled
46
+ HTML by having some idea of what good HTML is. Some people still like to use
47
+ Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
48
+
49
+ ## The Hpricot Kingdom
50
+
51
+ First, here are all the links you need to know:
52
+
53
+ * http://wiki.github.com/hpricot/hpricot is the Hpricot wiki and
54
+ http://github.com/hpricot/hpricot/issues is the bug tracker.
55
+ Go there for news and recipes and patches. It's the center of activity.
56
+ * http://github.com/hpricot/hpricot is the main Git
57
+ repository for Hpricot. You can get the latest code there.
58
+ * See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
59
+
60
+ If you have any trouble, don't hesitate to contact the author. As always, I'm
61
+ not going to say "Use at your own risk" because I don't want this library to be
62
+ risky. If you trip on something, I'll share the liability by repairing things
63
+ as quickly as I can. Your responsibility is to report the inadequacies.
64
+
65
+ ## Installing Hpricot
66
+
67
+ You may get the latest stable version from Rubyforge. Win32 binaries,
68
+ Java binaries (for JRuby), and source gems are available.
69
+
70
+ $ gem install hpricot
71
+
72
+ ## An Hpricot Showcase
73
+
74
+ We're going to run through a big pile of examples to get you jump-started.
75
+ Many of these examples are also found at
76
+ http://wiki.github.com/hpricot/hpricot/hpricot-basics, in case you
77
+ want to add some of your own.
78
+
79
+ ### Loading Hpricot Itself
80
+
81
+ You have probably got the gem, right? To load Hpricot:
82
+
83
+ require 'rubygems'
84
+ require 'hpricot'
85
+
86
+ If you've installed the plain source distribution, go ahead and just:
87
+
88
+ require 'hpricot'
89
+
90
+ ### Load an HTML Page
91
+
92
+ The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
93
+ contents into a document object.
94
+
95
+ doc = Hpricot("<p>A simple <b>test</b> string.</p>")
96
+
97
+ To load from a file, just get the stream open:
98
+
99
+ doc = open("index.html") { |f| Hpricot(f) }
100
+
101
+ To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
102
+
103
+ require 'open-uri'
104
+ doc = open("http://qwantz.com/") { |f| Hpricot(f) }
105
+
106
+ Hpricot uses an internal buffer to parse the file, so the IO will stream
107
+ properly and large documents won't be loaded into memory all at once. However,
108
+ the parsed document object will be present in memory, in its entirety.
109
+
110
+ ### Search for Elements
111
+
112
+ Use <tt>Doc.search</tt>:
113
+
114
+ doc.search("//p[@class='posted']")
115
+ #=> #<Hpricot:Elements[{p ...}, {p ...}]>
116
+
117
+ <tt>Doc.search</tt> can take an XPath or CSS expression. In the above example,
118
+ all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
119
+ attribute of <tt>"posted"</tt>.
120
+
121
+ A shortcut is to use the divisor:
122
+
123
+ (doc/"p.posted")
124
+ #=> #<Hpricot:Elements[{p ...}, {p ...}]>
125
+
126
+ ### Finding Just One Element
127
+
128
+ If you're looking for a single element, the <tt>at</tt> method will return the
129
+ first element matched by the expression. In this case, you'll get back the
130
+ element itself rather than the <tt>Hpricot::Elements</tt> array.
131
+
132
+ doc.at("body")['onload']
133
+
134
+ The above code will find the body tag and give you back the <tt>onload</tt>
135
+ attribute. This is the most common reason to use the element directly: when
136
+ reading and writing HTML attributes.
137
+
138
+ ### Fetching the Contents of an Element
139
+
140
+ Just as with browser scripting, the <tt>inner_html</tt> property can be used to
141
+ get the inner contents of an element.
142
+
143
+ (doc/"#elementID").inner_html
144
+ #=> "..contents.."
145
+
146
+ If your expression matches more than one element, you'll get back the contents
147
+ of ''all the matched elements''. So you may want to use <tt>first</tt> to be
148
+ sure you get back only one.
149
+
150
+ (doc/"#elementID").first.inner_html
151
+ #=> "..contents.."
152
+
153
+ ### Fetching the HTML for an Element
154
+
155
+ If you want the HTML for the whole element (not just the contents), use
156
+ <tt>to_html</tt>:
157
+
158
+ (doc/"#elementID").to_html
159
+ #=> "<div id='elementID'>...</div>"
160
+
161
+ ### Looping
162
+
163
+ All searches return a set of <tt>Hpricot::Elements</tt>. Go ahead and loop
164
+ through them like you would an array.
165
+
166
+ (doc/"p/a/img").each do |img|
167
+ puts img.attributes['class']
168
+ end
169
+
170
+ ### Continuing Searches
171
+
172
+ Searches can be continued from a collection of elements, in order to search deeper.
173
+
174
+ # find all paragraphs.
175
+ elements = doc.search("/html/body//p")
176
+ # continue the search by finding any images within those paragraphs.
177
+ (elements/"img")
178
+ #=> #<Hpricot::Elements[{img ...}, {img ...}]>
179
+
180
+ Searches can also be continued by searching within container elements.
181
+
182
+ # find all images within paragraphs.
183
+ doc.search("/html/body//p").each do |para|
184
+ puts "== Found a paragraph =="
185
+ pp para
186
+
187
+ imgs = para.search("img")
188
+ if imgs.any?
189
+ puts "== Found #{imgs.length} images inside =="
190
+ end
191
+ end
192
+
193
+ Of course, the most succinct ways to do the above are using CSS or XPath.
194
+
195
+ # the xpath version
196
+ (doc/"/html/body//p//img")
197
+ # the css version
198
+ (doc/"html > body > p img")
199
+ # ..or symbols work, too!
200
+ (doc/:html/:body/:p/:img)
201
+
202
+ ### Looping Edits
203
+
204
+ You may certainly edit objects from within your search loops. Then, when you
205
+ spit out the HTML, the altered elements will show.
206
+
207
+
208
+ (doc/"span.entryPermalink").each do |span|
209
+ span.attributes['class'] = 'newLinks'
210
+ end
211
+ puts doc
212
+
213
+ This changes all <tt>span.entryPermalink</tt> elements to
214
+ <tt>span.newLinks</tt>. Keep in mind that there are often more convenient ways
215
+ of doing this. Such as the <tt>set</tt> method:
216
+
217
+ (doc/"span.entryPermalink").set(:class => 'newLinks')
218
+
219
+ ### Figuring Out Paths
220
+
221
+ Every element can tell you its unique path (either XPath or CSS) to get to the
222
+ element from the root tag.
223
+
224
+ The <tt>css_path</tt> method:
225
+
226
+ doc.at("div > div:nth(1)").css_path
227
+ #=> "div > div:nth(1)"
228
+ doc.at("#header").css_path
229
+ #=> "#header"
230
+
231
+ Or, the <tt>xpath</tt> method:
232
+
233
+ doc.at("div > div:nth(1)").xpath
234
+ #=> "/div/div:eq(1)"
235
+ doc.at("#header").xpath
236
+ #=> "//div[@id='header']"
237
+
238
+ ## Hpricot Fixups
239
+
240
+ When loading HTML documents, you have a few settings that can make Hpricot more
241
+ or less intense about how it gets involved.
242
+
243
+ ## :fixup_tags
244
+
245
+ Really, there are so many ways to clean up HTML and your intentions may be to
246
+ keep the HTML as-is. So Hpricot's default behavior is to keep things flexible.
247
+ Making sure to open and close all the tags, but ignore any validation problems.
248
+
249
+ As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
250
+ to shift the document's tags to meet XHTML 1.0 Strict.
251
+
252
+ doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
253
+
254
+ This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
255
+ the rules a bit better. Like: say Hpricot finds a paragraph in a link, it's
256
+ going to move the paragraph below the link. Or up and out of other elements
257
+ where paragraphs don't belong.
258
+
259
+ If an unknown element is found, it is ignored. Again, <tt>:fixup_tags</tt>.
260
+
261
+ ## :xhtml_strict
262
+
263
+ So, let's go beyond just trying to fix the hierarchy. The
264
+ <tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
265
+ 1.0 Strict document. Even at the cost of removing elements that get in the way.
266
+
267
+ doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
268
+
269
+ What measures does <tt>:xhtml_strict</tt> take?
270
+
271
+ 1. Shift elements into their proper containers just like :fixup_tags.
272
+ 2. Remove unknown elements.
273
+ 3. Remove unknown attributes.
274
+ 4. Remove illegal content.
275
+ 5. Alter the doctype to XHTML 1.0 Strict.
276
+
277
+ ## Hpricot.XML()
278
+
279
+ The last option is the <tt>:xml</tt> option, which makes some slight variations
280
+ on the standard mode. The main difference is that :xml mode won't try to output
281
+ tags which are friendlier for browsers. For example, if an opening and closing
282
+ <tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
283
+
284
+ XML mode also doesn't downcase the tags and attributes for you. So pay attention
285
+ to case, friends.
286
+
287
+ The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
288
+
289
+ doc = open("http://redhanded.hobix.com/index.xml") do |f|
290
+ Hpricot.XML(f)
291
+ end
292
+
293
+ *Also, :fixup_tags is canceled out by the :xml option.* This is because
294
+ :fixup_tags makes assumptions based how HTML is structured. Specifically, how
295
+ tags are defined in the XHTML 1.0 DTD.