RubyGems - mikowitz-feed-normalizer - Versions diffs - 1.5.1 - Mend

mikowitz-feed-normalizer 1.5.1

Files changed (20) hide show

data/History.txt +52 -0
data/License.txt +27 -0
data/Manifest.txt +20 -0
data/README.txt +63 -0
data/Rakefile +25 -0
data/lib/feed-normalizer.rb +149 -0
data/lib/html-cleaner.rb +190 -0
data/lib/parsers/rss.rb +102 -0
data/lib/parsers/simple-rss.rb +138 -0
data/lib/structures.rb +244 -0
data/test/data/atom03.xml +128 -0
data/test/data/atom10.xml +114 -0
data/test/data/rdf10.xml +1498 -0
data/test/data/rss20.xml +64 -0
data/test/data/rss20diff.xml +59 -0
data/test/data/rss20diff_short.xml +51 -0
data/test/test_all.rb +6 -0
data/test/test_feednormalizer.rb +267 -0
data/test/test_htmlcleaner.rb +155 -0
metadata +101 -0

data/History.txt ADDED

@@ -0,0 +1,52 @@
+1.5.1
+ * Fix a bug that was breaking the parsing process for certain feeds. [reported by: Patrick Minton]
+1.5.0
+ * Add support for new fields:
+   * Atom 0.3: issued is now available through entry.date_published.
+   * RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
+   * All: entry.last_updated, this is an alias to entry.date_published for RSS.
+ * Rewrite relative links in content [joshpeek]
+ * Handle CDATA sections consistently across all formats. [sam.lown]
+ * Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
+ * Reparse Time classes [reported by: sam.lown]
+1.4.0
+ * Support content:encoded. Accessible via Entry#content.
+ * Support categories. Accessible via Entry#categories.
+ * Introduces a new parsing feature 'loose parsing'. Use :loose => true
+   when parsing if the required output should retain extra data, rather
+   than drop it in the interests of 'lowest common denomiator' normalization.
+   Currently affects how categories works. See the documentation in
+   FeedNormalizer#parse for more details.
+1.3.2
+ * Add support for applicable dublin core elements. (dc:date and dc:creator)
+ * Feeds can now be dumped to YAML.
+1.3.1
+ * Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
+ * Reduced the greediness of a regexp that was removing html comments.
+1.3.0
+ * Small changes to work with hpricot 0.5.
+1.2.0
+ * Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
+   for 'safe' display inside a web browser. Can be used as a standalone library,
+   or as part of the Feed object. See Feed.clean! for details about cleaning a
+   Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
+ * Added Feed-diffing. Differences between two feeds can be displayed using
+   Feed.diff. Works nicely with YAML for a readable diff.
+ * FeedNormalizer.parse now takes a hash for its arguments.
+ * Removed FN::Content.
+ * Now uses Hoe!

data/License.txt ADDED

@@ -0,0 +1,27 @@
+Copyright (c) 2006-2007, Andrew A. Smith
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of the copyright owner nor the names of its contributors
+      may be used to endorse or promote products derived from this software
+      without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/Manifest.txt ADDED

@@ -0,0 +1,20 @@
+History.txt
+License.txt
+Manifest.txt
+Rakefile
+README.txt
+feed-normalizer.gemspec
+lib/feed-normalizer.rb
+lib/html-cleaner.rb
+lib/parsers/rss.rb
+lib/parsers/simple-rss.rb
+lib/structures.rb
+test/data/atom03.xml
+test/data/atom10.xml
+test/data/rdf10.xml
+test/data/rss20.xml
+test/data/rss20diff.xml
+test/data/rss20diff_short.xml
+test/test_all.rb
+test/test_feednormalizer.rb
+test/test_htmlcleaner.rb

data/README.txt ADDED

@@ -0,0 +1,63 @@
+== Feed Normalizer
+An extensible Ruby wrapper for Atom and RSS parsers.
+Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
+object graph, regardless of the underlying feed format.
+== Download
+* gem install feed-normalizer
+* http://rubyforge.org/projects/feed-normalizer
+* svn co http://feed-normalizer.googlecode.com/svn/trunk
+== Usage
+    require 'feed-normalizer'
+    require 'open-uri'
+    feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
+    feed.title # => "International Herald Tribune"
+    feed.url # => "http://www.iht.com/pages/index.php"
+    feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
+    feed.class # => FeedNormalizer::Feed
+    feed.parser # => "RSS::Parser"
+Now read an Atom feed, and the same class is returned, and the same terminology applies:
+    feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
+    feed.title # => "AtomEnabled.org"
+    feed.url # => "http://www.atomenabled.org/atom.xml"
+    feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
+The feed representation stays the same, even though a different parser was used.
+    feed.class # => FeedNormalizer::Feed
+    feed.parser # => "SimpleRSS"
+== Cleaning / Sanitizing
+    feed.title # => "My Feed > Your Feed"
+    feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
+    feed.clean!
+All elements should now be either clean HTML, or HTML escaped strings.
+    feed.title # => "My Feed &gt; Your Feed"
+    feed.entries.first.content # => "<p>Hello</p>"
+== Extending
+Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
+the public methods. Also note the helper methods in the root Parser object to make
+mapping of output from the particular parser to the Feed object easier.
+See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
+== Authors
+* Andrew A. Smith (andy@tinnedfruit.org)
+This library is released under the terms of the BSD License (see the License.txt file for details).

data/Rakefile ADDED

@@ -0,0 +1,25 @@
+require 'hoe'
+Hoe.new("feed-normalizer", "1.5.1") do |s|
+  s.author = "Andrew A. Smith"
+  s.email = "andy@tinnedfruit.org"
+  s.url = "http://feed-normalizer.rubyforge.org/"
+  s.summary = "Extensible Ruby wrapper for Atom and RSS parsers"
+  s.description = s.paragraphs_of('README.txt', 1..2).join("\n\n")
+  s.changes = s.paragraphs_of('History.txt', 0..1).join("\n\n")
+  s.extra_deps << ["simple-rss", ">= 1.1"]
+  s.extra_deps << ["hpricot", ">= 0.6"]
+  s.need_zip = true
+  s.need_tar = false
+end
+begin
+  require 'rcov/rcovtask'
+  Rcov::RcovTask.new("rcov") do |t|
+    t.test_files = Dir['test/test_all.rb']
+  end
+rescue LoadError
+  nil
+end

data/lib/feed-normalizer.rb ADDED

@@ -0,0 +1,149 @@
+require 'structures'
+require 'html-cleaner'
+module FeedNormalizer
+  # The root parser object. Every parser must extend this object.
+  class Parser
+    # Parser being used.
+    def self.parser
+      nil
+    end
+    # Parses the given feed, and returns a normalized representation.
+    # Returns nil if the feed could not be parsed.
+    def self.parse(feed, loose)
+      nil
+    end
+    # Returns a number to indicate parser priority.
+    # The lower the number, the more likely the parser will be used first,
+    # and vice-versa.
+    def self.priority
+      0
+    end
+    protected
+    # Some utility methods that can be used by subclasses.
+    # sets value, or appends to an existing value
+    def self.map_functions!(mapping, src, dest)
+      mapping.each do |dest_function, src_functions|
+        src_functions = [src_functions].flatten # pack into array
+        src_functions.each do |src_function|
+          value = if src.respond_to?(src_function)
+            src.send(src_function)
+          elsif src.respond_to?(:has_key?)
+            src[src_function]
+          end
+          unless value.to_s.empty?
+            append_or_set!(value, dest, dest_function)
+            break
+          end
+        end
+      end
+    end
+    def self.append_or_set!(value, object, object_function)
+      if object.send(object_function).respond_to? :push
+        object.send(object_function).push(value)
+      else
+        object.send(:"#{object_function}=", value)
+      end
+    end
+    private
+    # Callback that ensures that every parser gets registered.
+    def self.inherited(subclass)
+      ParserRegistry.register(subclass)
+    end
+  end
+  # The parser registry keeps a list of current parsers that are available.
+  class ParserRegistry
+    @@parsers = []
+    def self.register(parser)
+      @@parsers << parser
+    end
+    # Returns a list of currently registered parsers, in order of priority.
+    def self.parsers
+      @@parsers.sort_by { |parser| parser.priority }
+    end
+  end
+  class FeedNormalizer
+    # Parses the given xml and attempts to return a normalized Feed object.
+    # Setting +force_parser+ to a suitable parser will mean that parser is
+    # used first, and if +try_others+ is false, it is the only parser used,
+    # otherwise all parsers in the ParserRegistry are attempted, in
+    # order of priority.
+    #
+    # ===Available options
+    #
+    # * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
+    #   parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
+    #
+    # * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
+    #   If +true+, other parsers will be used as described above. The option
+    #   is useful if combined with +force_parser+ to only use a single parser.
+    #
+    # * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
+    #
+    #   Specifies parsing should be done loosely. This means that when
+    #   feed-normalizer would usually throw away data in order to meet
+    #   the requirement of keeping resulting feed outputs the same regardless
+    #   of the underlying parser, the data will instead be kept. This currently
+    #   affects the following items:
+    #   * <em>Categories:</em> RSS allows for multiple categories per feed item.
+    #     * <em>Limitation:</em> SimpleRSS can only return the first category
+    #       for an item.
+    #     * <em>Result:</em> When loose is true, the extra categories are kept,
+    #       of course, only if the parser is not SimpleRSS.
+    def self.parse(xml, opts = {})
+      # Get a string ASAP, as multiple read()'s will start returning nil..
+      xml = xml.respond_to?(:read) ? xml.read : xml.to_s
+      if opts[:force_parser]
+        result = opts[:force_parser].parse(xml, opts[:loose])
+        return result if result
+        return nil if opts[:try_others] == false
+      end
+      ParserRegistry.parsers.each do |parser|
+        result = parser.parse(xml, opts[:loose])
+        return result if result
+      end
+      # if we got here, no parsers worked.
+      return nil
+    end
+  end
+  parser_dir = File.dirname(__FILE__) + '/parsers'
+  # Load up the parsers
+  Dir.open(parser_dir).each do |fn|
+    next unless fn =~ /[.]rb$/
+    require "parsers/#{fn}"
+  end
+end

data/lib/html-cleaner.rb ADDED

@@ -0,0 +1,190 @@
+require 'rubygems'
+require 'hpricot'
+require 'cgi'
+module FeedNormalizer
+  # Various methods for cleaning up HTML and preparing it for safe public
+  # consumption.
+  #
+  # Documents used for refrence:
+  # - http://www.w3.org/TR/html4/index/attributes.html
+  # - http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
+  # - http://feedparser.org/docs/html-sanitization.html
+  # - http://code.whytheluckystiff.net/hpricot/wiki
+  class HtmlCleaner
+    # allowed html elements.
+    HTML_ELEMENTS = %w(
+      a abbr acronym address area b bdo big blockquote br button caption center
+      cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
+      h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
+      samp small span strike strong sub sup table tbody td tfoot th thead tr tt
+      u ul var
+    )
+    # allowed attributes.
+    HTML_ATTRS = %w(
+      abbr accept accept-charset accesskey align alt axis border cellpadding
+      cellspacing char charoff charset checked cite class clear cols colspan
+      color compact coords datetime dir disabled for frame headers height href
+      hreflang hspace id ismap label lang longdesc maxlength media method
+      multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
+      scope selected shape size span src start summary tabindex target title
+      type usemap valign value vspace width
+    )
+    # allowed attributes, but they can contain URIs, extra caution required.
+    # NOTE: That means this doesnt list *all* URI attrs, just the ones that are allowed.
+    HTML_URI_ATTRS = %w(
+      href src cite usemap longdesc
+    )
+    DODGY_URI_SCHEMES = %w(
+      javascript vbscript mocha livescript data
+    )
+    class << self
+      # Does this:
+      # - Unescape HTML
+      # - Parse HTML into tree
+      # - Find 'body' if present, and extract tree inside that tag, otherwise parse whole tree
+      # - Each tag:
+      #   - remove tag if not whitelisted
+      #   - escape HTML tag contents
+      #   - remove all attributes not on whitelist
+      #   - extra-scrub URI attrs; see dodgy_uri?
+      #
+      # Extra (i.e. unmatched) ending tags and comments are removed.
+      def clean(str)
+        str = unescapeHTML(str)
+        doc = Hpricot(str, :fixup_tags => true)
+        doc = subtree(doc, :body)
+        # get all the tags in the document
+        # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
+        # including text nodes instead of just tagged elements.
+        tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq
+        # Remove tags that aren't whitelisted.
+        remove_tags!(doc, tags - HTML_ELEMENTS)
+        remaining_tags = tags & HTML_ELEMENTS
+        # Remove attributes that aren't on the whitelist, or are suspicious URLs.
+        (doc/remaining_tags.join(",")).each do |element|
+          element.raw_attributes.reject! do |attr,val|
+            !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
+          end
+          element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
+        end unless remaining_tags.empty?
+        doc.traverse_text {|t| t.set(add_entities(t.to_html))}
+        # Return the tree, without comments. Ugly way of removing comments,
+        # but can't see a way to do this in Hpricot yet.
+        doc.to_s.gsub(/<\!--.*?-->/mi, '')
+      end
+      # For all other feed elements:
+      # - Unescape HTML.
+      # - Parse HTML into tree (taking 'body' as root, if present)
+      # - Takes text out of each tag, and escapes HTML.
+      # - Returns all text concatenated.
+      def flatten(str)
+        str.gsub!("\n", " ")
+        str = unescapeHTML(str)
+        doc = Hpricot(str, :xhtml_strict => true)
+        doc = subtree(doc, :body)
+        out = []
+        doc.traverse_text {|t| out << add_entities(t.to_html)}
+        return out.join
+      end
+      # Returns true if the given string contains a suspicious URL,
+      # i.e. a javascript link.
+      #
+      # This method rejects javascript, vbscript, livescript, mocha and data URLs.
+      # It *could* be refined to only deny dangerous data URLs, however.
+      def dodgy_uri?(uri)
+        uri = uri.to_s
+        # special case for poorly-formed entities (missing ';')
+        # if these occur *anywhere* within the string, then throw it out.
+        return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)
+        # Try escaping as both HTML or URI encodings, and then trying
+        # each scheme regexp on each
+        [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
+          DODGY_URI_SCHEMES.each do |scheme|
+            regexp = "#{scheme}:".gsub(/./) do |char|
+              "([\000-\037\177\s]*)#{char}"
+            end
+            # regexp looks something like
+            # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
+            return true if (unesc_uri =~ %r{\A#{regexp}}mi)
+          end
+        end
+        nil
+      end
+      # unescapes HTML. If xml is true, also converts XML-only named entities to HTML.
+      def unescapeHTML(str, xml = true)
+        CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
+      end
+      # Adds entities where possible.
+      # Works like CGI.escapeHTML, but will not escape existing entities;
+      # i.e. &#123; will NOT become &amp;#123;
+      #
+      # This method could be improved by adding a whitelist of html entities.
+      def add_entities(str)
+        str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
+      end
+      private
+      # Everything below elment, or the just return the doc if element not present.
+      def subtree(doc, element)
+        doc.at("//#{element}/*") || doc
+      end
+      def remove_tags!(doc, tags)
+        (doc/tags.join(",")).remove unless tags.empty?
+      end
+    end
+  end
+end
+module Enumerable #:nodoc:
+  def build_hash
+    result = {}
+    self.each do |elt|
+      key, value = yield elt
+      result[key] = value
+    end
+    result
+  end
+end
+# http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/207625
+#  Subject: A simple Hpricot text setter
+#  From: Chris Gehlker <canyonrat mac.com>
+#  Date: Fri, 11 Aug 2006 03:19:13 +0900
+class Hpricot::Text #:nodoc:
+  def set(string)
+    @content = string
+    self.raw_string = string
+  end
+end