RubyGems - webtranslateit-hpricot - Versions diffs - 0.9.0 - Mend

webtranslateit-hpricot 0.9.0

Files changed (55) hide show

checksums.yaml +7 -0
data/.gitignore +15 -0
data/CHANGELOG +122 -0
data/COPYING +18 -0
data/README.md +295 -0
data/Rakefile +237 -0
data/ext/fast_xs/FastXsService.java +1123 -0
data/ext/fast_xs/extconf.rb +4 -0
data/ext/fast_xs/fast_xs.c +210 -0
data/ext/hpricot_scan/HpricotCss.java +850 -0
data/ext/hpricot_scan/HpricotScanService.java +2085 -0
data/ext/hpricot_scan/MANIFEST +0 -0
data/ext/hpricot_scan/extconf.rb +9 -0
data/ext/hpricot_scan/hpricot_common.rl +76 -0
data/ext/hpricot_scan/hpricot_css.c +3511 -0
data/ext/hpricot_scan/hpricot_css.java.rl +155 -0
data/ext/hpricot_scan/hpricot_css.rl +120 -0
data/ext/hpricot_scan/hpricot_scan.c +6848 -0
data/ext/hpricot_scan/hpricot_scan.h +79 -0
data/ext/hpricot_scan/hpricot_scan.java.rl +1173 -0
data/ext/hpricot_scan/hpricot_scan.rl +911 -0
data/extras/hpricot.png +0 -0
data/hpricot.gemspec +18 -0
data/lib/hpricot/blankslate.rb +63 -0
data/lib/hpricot/builder.rb +217 -0
data/lib/hpricot/elements.rb +514 -0
data/lib/hpricot/htmlinfo.rb +691 -0
data/lib/hpricot/inspect.rb +103 -0
data/lib/hpricot/modules.rb +40 -0
data/lib/hpricot/parse.rb +38 -0
data/lib/hpricot/tag.rb +219 -0
data/lib/hpricot/tags.rb +164 -0
data/lib/hpricot/traverse.rb +839 -0
data/lib/hpricot/xchar.rb +95 -0
data/lib/hpricot.rb +26 -0
data/setup.rb +1585 -0
data/test/files/basic.xhtml +17 -0
data/test/files/boingboing.html +2266 -0
data/test/files/cy0.html +3653 -0
data/test/files/immob.html +400 -0
data/test/files/pace_application.html +1320 -0
data/test/files/tenderlove.html +16 -0
data/test/files/uswebgen.html +220 -0
data/test/files/utf8.html +1054 -0
data/test/files/week9.html +1723 -0
data/test/files/why.xml +19 -0
data/test/load_files.rb +7 -0
data/test/nokogiri-bench.rb +64 -0
data/test/test_alter.rb +96 -0
data/test/test_builder.rb +37 -0
data/test/test_parser.rb +496 -0
data/test/test_paths.rb +25 -0
data/test/test_preserved.rb +88 -0
data/test/test_xml.rb +28 -0
metadata +106 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: b2c7f0e599b62be02967d46819ec60c457e8e7c2207752ef328d069d3ca3d627
+  data.tar.gz: 75205569719178f6b699114f54a042d491b4a3fc16248ace4a5f171460a720cf
+SHA512:
+  metadata.gz: f7a5c3f9770659390d82c477c9ec45968c560e8c878399708a4a23a93c7a61fef9b46eaea6c0a0fff12b62ffbc318e8b03fd34c8095cb3687445badbc02999a5
+  data.tar.gz: 85afd161d4358033e9b4e32c69c35af264f88599849fc0022444f41da2cc91dade9d9949e07bf9e36980e3b613d1b5d92f1cb17253660f57b44205d9d5d16164

data/.gitignore ADDED Viewed

@@ -0,0 +1,15 @@
+*.class
+*.o
+*.bundle
+*.so
+*.rbc
+mkmf.log
+conftest.dSYM
+lib/*.jar
+lib/hpricot_scan.rb
+lib/fast_xs.rb
+hpricot-*-java
+hpricot-*-mswin32
+pkg
+.DS_Store
+tmp

data/CHANGELOG ADDED Viewed

@@ -0,0 +1,122 @@
+= 0.9.0
+=== 23 April 2024
+* Fix issue compiling with clang 16.
+= 0.8.6
+=== 17 January 2012
+* Allow any tags to contain unknown tags (Steven Parkes)
+= 0.8.5
+=== 29 November 2011
+* Remove escaped quote (\') from matching (#55)
+* Fix 'undefined method downcase for nil:NilClass' on JRuby (#58)
+* Unescape hex numeric character references
+= 0.8.4
+=== 28 February, 2011
+* GH #21, #32, #33, #36: Fix for reported segfaults
+= 0.8.3
+=== 3 November, 2010
+* GH#8: Nil-check before downcasing attribute key
+* GH#25: Proper ruby 1.9 encoding support
+* GH#28. Use integers instead of ?? on 1.9, which is just a string.
+* including noscript to ElementInclusions , so that hpricot wont fail
+  when trying to parse a meta tag inside head section when noscript is
+  present.
+* latest changes from fast_xs mainline
+* Fixes to get Hpricot running on Rubinius:
+  * Use free, not XFREE
+  * Remove RSTRUCT craziness, don't break Array#at
+= 0.8.2
+=== 5 November, 2009
+* Bring JRuby support up to speed, including Java-based hpricot_css support
+* Change JRuby fast_xs to have same escaping behavior as C fast_xs
+* fix for issue #2, downcasing of html attributes inside the parser.
+* solve issue #3 with bogus etags being preserved in `to_s` rather than just `to_original_html`.
+* fix error when attempting to reparent cleared node. (issue #5)
+* Hpricot::Attributes proxy object for using `ele.attributes[k] = v` directly.
+  however, it is preferred to use the jquery-like `elements.attr(k, v)`.
+= 0.8.1
+=== 3 April, 2009
+* big problems on Ruby 1.8.6, use INT2FIX instead of INT2NUM. hashes were being cast to bignums.
+* patch for 1.8.5 to define RARRAY_PTR. thanks, mike perham!
+* inspecting empty document bug, courtesy of @TalLevAmi.
+= 0.8
+=== 31st March, 2009
+* Saving memory and speed by using RStruct-based elements in the C extension.
+* Bug in tag parsing, causing runaway <script> and <style> tags in HTML.
+* Problem compiling under Ruby 1.9, due to our_rb_hash_lookup function meant for Ruby 1.8.
+* CData was missing inner_text method.
+= 0.7
+=== 17th March, 2009
+* Rewritten parser routine, much lighter on memory, quite a bit faster.
+* Friendlier with Ruby 1.9.
+* Fixes to nth-child and text() selectors.
+= 0.6
+=== 15th June, 2007
+* Hpricot for JRuby -- nice work Ola Bini!
+* Inline Markaby for Hpricot documents.
+* XML tags and attributes are no longer downcased like HTML is.
+* new syntax for grabbing everything between two elements using a Range in the search method: (doc/("font".."font/br")) or in nodes_at like so: (doc/"font").nodes_at("*".."br"). Only works with either a pair of siblings or a set of a parent and a sibling.
+* Ignore self-closing endings on tags (such as form) which are containers. Treat them like open parent tags. Reported by Jonathan Nichols on the hpricot list.
+* Escaping of attributes, yanked from Jim Weirich and Sam Ruby's work in Builder.
+* Element#raw_attributes gives unescaped data.  Element#attributes gives escaped.
+* Added: Elements#attr, Elements#remove_attr, Elements#remove_class.
+* Added: Traverse#preceding, Traverse#following, Traverse#previous, Traverse#next.
+= 0.5
+=== 31rd January, 2007
+* support for a[text()="Click Me!"] and h3[text()*="space"] and the like.
+* Hpricot.buffer_size accessor for increasing Hpricot's buffer if you're encountering huge ASP.NET viewstate attribs.
+* some support for colons in tag names (not full namespace support yet.)
+* Element.to_original_html will attempt to preserve the original HTML while merging your changes.
+* Element.to_plain_text converts an element's contents to a simple text format.
+* Element.inner_text removes all tags and returns text nodes concatenated into a single string.
+* no @raw_string variable kept for comments, text, and cdata -- as it's redundant.
+* xpath-style indices (//p/a[1]) but keep in mind that they aren't zero-based.
+* node_position is the index among all sibling nodes, while position is the position among children of identical type.
+* comment() and text() search criteria, like: //p/text(), which selects all text inside paragraph tags.
+* every element has css_path and xpath methods which return respective absolute paths.
+* more flexibility all around: in parsing attributes, tags, comments and cdata.
+= 0.4
+=== 11th August, 2006
+* The :fixup_tags option will try to sort out the hierarchy so elements end up with the right parents.
+* Elements such as *script* and *style* (identified as having CDATA contents) receive a single text node as their children now.  Previously, Hpricot was parsing out tags found in scripts.
+* Better scanning of partially quoted attributes (found by Brent Beardsly on http://uswebgen.com/)
+* Better scanning of unquoted attributes -- thanks to Aaron Patterson for the test cases!
+* Some tags were being output in the empty tag style, although browsers hated that.  FIXED!
+* Added Elements#at for finding single elements.
+* Added Elem::Trav#[] and Elem::Trav#[]= for reading and writing attributes.
+= 0.3
+=== 7th July, 2006
+* Fixed negative string size error on empty tokens. (news.bbc.co.uk)
+* Allow the parser to accept just text nodes. (such as: <tt>Hpricot.parse('TEXT')</tt>)
+* from JQuery to Hpricot::Elements: remove, empty, append, prepend, before, after, wrap, set,
+  html(...), to_html, to_s.
+* on containers: to_html, replace_child, insert_before, insert_after, innerHTML=.
+* Hpricot(...) is an alias for parse.
+* open up all properties to setters, let people do as they may.
+* use to_html for the full html of a node or set of elements.
+* doctypes were messed.
+= 0.2
+=== 4th July, 2006
+* Rewrote the HTree parser to be simpler, more adequate for the common man.  Will add encoding back in later.
+= 0.1
+=== 3rd July, 2006
+* For whatever reason, wrote this HTML parser in C.
+  I guess Ragel is addictive and I want to improve HTree.

data/COPYING ADDED Viewed

@@ -0,0 +1,18 @@
+Copyright (c) 2006 why the lucky stiff
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to
+deal in the Software without restriction, including without limitation the
+rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+sell copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,295 @@
+# Hpricot is over.
+After years of lack of a proper maintainer for one of why's jewels, it has been
+decided to finally close the book on hpricot. Most users have migrated to alternatives
+and there is simply no time or energy to continue with the current codebase.
+If you feel that you have the time and wish to take it over, I suggest you instead
+think about making the hpricot-like API within nokogiri 100% compatible, that is a better
+use of your time.
+But if you still feel like "No damnit, I wanna work on hpricot itself still!" then fork
+this repo and start work. Send @evanphx or @nicksieger a message if you feel like you
+want to take over the gem name with new releases under the hpricot name.
+Thanks to \_why for all the fun. We'll never forget it.
+## Now back to your original README content...
+# Hpricot, Read Any HTML
+Hpricot is a fast, flexible HTML parser written in C.  It's designed to be very
+accommodating (like Tanaka Akira's HTree) and to have a very helpful library
+(like some JavaScript libs -- JQuery, Prototype -- give you.)  The XPath and CSS
+parser, in fact, is based on John Resig's JQuery.
+Also, Hpricot can be handy for reading broken XML files, since many of the same
+techniques can be used.  If a quote is missing, Hpricot tries to figure it out.
+If tags overlap, Hpricot works on sorting them out.  You know, that sort of
+thing.
+*Please read this entire document* before making assumptions about how this
+software works.
+## An Overview
+Let's clear up what Hpricot is.
+* Hpricot is *a standalone library*.  It requires no other libraries.  Just Ruby!
+* While priding itself on speed, Hpricot *works hard to sort out bad HTML* and
+  pays a small penalty in order to get that right.  So that's slightly more important
+  to me than speed.
+* *If you can see it in Firefox, then Hpricot should parse it.*  That's
+  how it should be!  Let me know the minute it's otherwise.
+* Primarily, Hpricot is used for reading HTML and tries to sort out troubled
+  HTML by having some idea of what good HTML is.  Some people still like to use
+  Hpricot for XML reading, but *remember to use the Hpricot::XML() method* for that!
+## The Hpricot Kingdom
+First, here are all the links you need to know:
+* http://wiki.github.com/hpricot/hpricot is the Hpricot wiki and
+  http://github.com/hpricot/hpricot/issues is the bug tracker.
+  Go there for news and recipes and patches.  It's the center of activity.
+* http://github.com/hpricot/hpricot is the main Git
+  repository for Hpricot.  You can get the latest code there.
+* See COPYING for the terms of this software. (Spoiler: it's absolutely free.)
+If you have any trouble, don't hesitate to contact the author.  As always, I'm
+not going to say "Use at your own risk" because I don't want this library to be
+risky.  If you trip on something, I'll share the liability by repairing things
+as quickly as I can.  Your responsibility is to report the inadequacies.
+## Installing Hpricot
+You may get the latest stable version from Rubyforge. Win32 binaries,
+Java binaries (for JRuby), and source gems are available.
+    $ gem install hpricot
+## An Hpricot Showcase
+We're going to run through a big pile of examples to get you jump-started.
+Many of these examples are also found at
+http://wiki.github.com/hpricot/hpricot/hpricot-basics, in case you
+want to add some of your own.
+### Loading Hpricot Itself
+You have probably got the gem, right?  To load Hpricot:
+    require 'rubygems'
+    require 'hpricot'
+If you've installed the plain source distribution, go ahead and just:
+    require 'hpricot'
+### Load an HTML Page
+The <tt>Hpricot()</tt> method takes a string or any IO object and loads the
+contents into a document object.
+    doc = Hpricot("<p>A simple <b>test</b> string.</p>")
+To load from a file, just get the stream open:
+    doc = open("index.html") { |f| Hpricot(f) }
+To load from a web URL, use <tt>open-uri</tt>, which comes with Ruby:
+    require 'open-uri'
+    doc = open("http://qwantz.com/") { |f| Hpricot(f) }
+Hpricot uses an internal buffer to parse the file, so the IO will stream
+properly and large documents won't be loaded into memory all at once.  However,
+the parsed document object will be present in memory, in its entirety.
+### Search for Elements
+Use <tt>Doc.search</tt>:
+    doc.search("//p[@class='posted']")
+    #=> #<Hpricot:Elements[{p ...}, {p ...}]>
+<tt>Doc.search</tt> can take an XPath or CSS expression.  In the above example,
+all paragraph <tt><p></tt> elements are grabbed which have a <tt>class</tt>
+attribute of <tt>"posted"</tt>.
+A shortcut is to use the divisor:
+    (doc/"p.posted")
+    #=> #<Hpricot:Elements[{p ...}, {p ...}]>
+### Finding Just One Element
+If you're looking for a single element, the <tt>at</tt> method will return the
+first element matched by the expression.  In this case, you'll get back the
+element itself rather than the <tt>Hpricot::Elements</tt> array.
+    doc.at("body")['onload']
+The above code will find the body tag and give you back the <tt>onload</tt>
+attribute.  This is the most common reason to use the element directly: when
+reading and writing HTML attributes.
+### Fetching the Contents of an Element
+Just as with browser scripting, the <tt>inner_html</tt> property can be used to
+get the inner contents of an element.
+    (doc/"#elementID").inner_html
+    #=> "..contents.."
+If your expression matches more than one element, you'll get back the contents
+of ''all the matched elements''.  So you may want to use <tt>first</tt> to be
+sure you get back only one.
+    (doc/"#elementID").first.inner_html
+    #=> "..contents.."
+### Fetching the HTML for an Element
+If you want the HTML for the whole element (not just the contents), use
+<tt>to_html</tt>:
+    (doc/"#elementID").to_html
+    #=> "<div id='elementID'>...</div>"
+### Looping
+All searches return a set of <tt>Hpricot::Elements</tt>.  Go ahead and loop
+through them like you would an array.
+    (doc/"p/a/img").each do |img|
+      puts img.attributes['class']
+    end
+### Continuing Searches
+Searches can be continued from a collection of elements, in order to search deeper.
+    # find all paragraphs.
+    elements = doc.search("/html/body//p")
+    # continue the search by finding any images within those paragraphs.
+    (elements/"img")
+    #=> #<Hpricot::Elements[{img ...}, {img ...}]>
+Searches can also be continued by searching within container elements.
+    # find all images within paragraphs.
+    doc.search("/html/body//p").each do |para|
+      puts "== Found a paragraph =="
+      pp para
+      imgs = para.search("img")
+      if imgs.any?
+        puts "== Found #{imgs.length} images inside =="
+      end
+    end
+Of course, the most succinct ways to do the above are using CSS or XPath.
+    # the xpath version
+    (doc/"/html/body//p//img")
+    # the css version
+    (doc/"html > body > p img")
+    # ..or symbols work, too!
+    (doc/:html/:body/:p/:img)
+### Looping Edits
+You may certainly edit objects from within your search loops.  Then, when you
+spit out the HTML, the altered elements will show.
+    (doc/"span.entryPermalink").each do |span|
+      span.attributes['class'] = 'newLinks'
+    end
+    puts doc
+This changes all <tt>span.entryPermalink</tt> elements to
+<tt>span.newLinks</tt>.  Keep in mind that there are often more convenient ways
+of doing this.  Such as the <tt>set</tt> method:
+    (doc/"span.entryPermalink").set(:class => 'newLinks')
+### Figuring Out Paths
+Every element can tell you its unique path (either XPath or CSS) to get to the
+element from the root tag.
+The <tt>css_path</tt> method:
+    doc.at("div > div:nth(1)").css_path
+      #=> "div > div:nth(1)"
+    doc.at("#header").css_path
+      #=> "#header"
+Or, the <tt>xpath</tt> method:
+    doc.at("div > div:nth(1)").xpath
+      #=> "/div/div:eq(1)"
+    doc.at("#header").xpath
+      #=> "//div[@id='header']"
+## Hpricot Fixups
+When loading HTML documents, you have a few settings that can make Hpricot more
+or less intense about how it gets involved.
+## :fixup_tags
+Really, there are so many ways to clean up HTML and your intentions may be to
+keep the HTML as-is.  So Hpricot's default behavior is to keep things flexible.
+Making sure to open and close all the tags, but ignore any validation problems.
+As of Hpricot 0.4, there's a new <tt>:fixup_tags</tt> option which will attempt
+to shift the document's tags to meet XHTML 1.0 Strict.
+    doc = open("index.html") { |f| Hpricot f, :fixup_tags => true }
+This doesn't quite meet the XHTML 1.0 Strict standard, it just tries to follow
+the rules a bit better.  Like: say Hpricot finds a paragraph in a link, it's
+going to move the paragraph below the link.  Or up and out of other elements
+where paragraphs don't belong.
+If an unknown element is found, it is ignored.  Again, <tt>:fixup_tags</tt>.
+## :xhtml_strict
+So, let's go beyond just trying to fix the hierarchy.  The
+<tt>:xhtml_strict</tt> option really tries to force the document to be an XHTML
+1.0 Strict document.  Even at the cost of removing elements that get in the way.
+    doc = open("index.html") { |f| Hpricot f, :xhtml_strict => true }
+What measures does <tt>:xhtml_strict</tt> take?
+ 1. Shift elements into their proper containers just like :fixup_tags.
+ 2. Remove unknown elements.
+ 3. Remove unknown attributes.
+ 4. Remove illegal content.
+ 5. Alter the doctype to XHTML 1.0 Strict.
+## Hpricot.XML()
+The last option is the <tt>:xml</tt> option, which makes some slight variations
+on the standard mode.  The main difference is that :xml mode won't try to output
+tags which are friendlier for browsers.  For example, if an opening and closing
+<tt>br</tt> tag is found, XML mode won't try to turn that into an empty element.
+XML mode also doesn't downcase the tags and attributes for you.  So pay attention
+to case, friends.
+The primary way to use Hpricot's XML mode is to call the Hpricot.XML method:
+    doc = open("http://redhanded.hobix.com/index.xml") do |f|
+      Hpricot.XML(f)
+    end
+*Also, :fixup_tags is canceled out by the :xml option.*  This is because
+:fixup_tags makes assumptions based how HTML is structured.  Specifically, how
+tags are defined in the XHTML 1.0 DTD.