RubyGems - openlogic-feed-normalizer - Versions diffs - 1.5.3 - Mend

openlogic-feed-normalizer 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

data/.gemtest +0 -0
data/History.txt +62 -0
data/License.txt +27 -0
data/Manifest.txt +18 -0
data/README.txt +63 -0
data/Rakefile +30 -0
data/lib/feed-normalizer.rb +149 -0
data/lib/html-cleaner.rb +181 -0
data/lib/parsers/rss.rb +117 -0
data/lib/parsers/simple-rss.rb +142 -0
data/lib/structures.rb +262 -0
data/test/data/atom03.xml +128 -0
data/test/data/atom10.xml +114 -0
data/test/data/rdf10.xml +1498 -0
data/test/data/rss20.xml +65 -0
data/test/data/rss20diff.xml +59 -0
data/test/data/rss20diff_short.xml +51 -0
data/test/test_feednormalizer.rb +277 -0
data/test/test_htmlcleaner.rb +156 -0
metadata +123 -0

data/.gemtest ADDED Viewed

File without changes

data/History.txt ADDED Viewed

@@ -0,0 +1,62 @@
+1.5.3
+ * Fix a stack overflow error that occurred when calling unimplemented
+   methods on Feeds or Entrys on Ruby 1.9.x. For example, calling flatten
+   on an array of Entrys causes to_ary to be sent to each of the Entrys,
+   which would overflow the stack.
+1.5.2
+ [unknown]
+1.5.1
+ * Fix a bug that was breaking the parsing process for certain feeds. [reported by: Patrick Minton]
+1.5.0
+ * Add support for new fields:
+   * Atom 0.3: issued is now available through entry.date_published.
+   * RSS: feed.skip_hours, feed.skip_days, feed.ttl [joshpeek]
+   * All: entry.last_updated, this is an alias to entry.date_published for RSS.
+ * Rewrite relative links in content [joshpeek]
+ * Handle CDATA sections consistently across all formats. [sam.lown]
+ * Prevent SimpleRSS from doing its own escaping. [reported by: paul.stadig, lionel.bouton]
+ * Reparse Time classes [reported by: sam.lown]
+1.4.0
+ * Support content:encoded. Accessible via Entry#content.
+ * Support categories. Accessible via Entry#categories.
+ * Introduces a new parsing feature 'loose parsing'. Use :loose => true
+   when parsing if the required output should retain extra data, rather
+   than drop it in the interests of 'lowest common denomiator' normalization.
+   Currently affects how categories works. See the documentation in
+   FeedNormalizer#parse for more details.
+1.3.2
+ * Add support for applicable dublin core elements. (dc:date and dc:creator)
+ * Feeds can now be dumped to YAML.
+1.3.1
+ * Small changes to work with hpricot 0.6. This release depends on hpricot 0.6.
+ * Reduced the greediness of a regexp that was removing html comments.
+1.3.0
+ * Small changes to work with hpricot 0.5.
+1.2.0
+ * Added HtmlCleaner - sanitizes HTML and removes 'bad' URIs to a level suitable
+   for 'safe' display inside a web browser. Can be used as a standalone library,
+   or as part of the Feed object. See Feed.clean! for details about cleaning a
+   Feed instance. Also see HtmlCleaner and its unit tests. Uses Hpricot.
+ * Added Feed-diffing. Differences between two feeds can be displayed using
+   Feed.diff. Works nicely with YAML for a readable diff.
+ * FeedNormalizer.parse now takes a hash for its arguments.
+ * Removed FN::Content.
+ * Now uses Hoe!

data/License.txt ADDED Viewed

@@ -0,0 +1,27 @@
+Copyright (c) 2006-2007, Andrew A. Smith
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+    * Redistributions of source code must retain the above copyright notice,
+      this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright notice,
+      this list of conditions and the following disclaimer in the documentation
+      and/or other materials provided with the distribution.
+    * Neither the name of the copyright owner nor the names of its contributors
+      may be used to endorse or promote products derived from this software
+      without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/Manifest.txt ADDED Viewed

@@ -0,0 +1,18 @@
+History.txt
+License.txt
+Manifest.txt
+Rakefile
+README.txt
+lib/feed-normalizer.rb
+lib/html-cleaner.rb
+lib/parsers/rss.rb
+lib/parsers/simple-rss.rb
+lib/structures.rb
+test/data/atom03.xml
+test/data/atom10.xml
+test/data/rdf10.xml
+test/data/rss20.xml
+test/data/rss20diff.xml
+test/data/rss20diff_short.xml
+test/test_feednormalizer.rb
+test/test_htmlcleaner.rb

data/README.txt ADDED Viewed

@@ -0,0 +1,63 @@
+== Feed Normalizer
+An extensible Ruby wrapper for Atom and RSS parsers.
+Feed normalizer wraps various RSS and Atom parsers, and returns a single unified
+object graph, regardless of the underlying feed format.
+== Download
+* gem install openlogic-feed-normalizer
+* http://rubyforge.org/projects/openlogic-feed-normalizer
+* http://github.com/toddthomas/feed-normalizer
+== Usage
+    require 'feed-normalizer'
+    require 'open-uri'
+    feed = FeedNormalizer::FeedNormalizer.parse open('http://www.iht.com/rss/frontpage.xml')
+    feed.title # => "International Herald Tribune"
+    feed.url # => "http://www.iht.com/pages/index.php"
+    feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
+    feed.class # => FeedNormalizer::Feed
+    feed.parser # => "RSS::Parser"
+Now read an Atom feed, and the same class is returned, and the same terminology applies:
+    feed = FeedNormalizer::FeedNormalizer.parse open('http://www.atomenabled.org/atom.xml')
+    feed.title # => "AtomEnabled.org"
+    feed.url # => "http://www.atomenabled.org/atom.xml"
+    feed.entries.first.url # => "http://www.atomenabled.org/2006/09/moving-toward-atom.php"
+The feed representation stays the same, even though a different parser was used.
+    feed.class # => FeedNormalizer::Feed
+    feed.parser # => "SimpleRSS"
+== Cleaning / Sanitizing
+    feed.title # => "My Feed > Your Feed"
+    feed.entries.first.content # => "<p x='y'>Hello</p><object></object></html>"
+    feed.clean!
+All elements should now be either clean HTML, or HTML escaped strings.
+    feed.title # => "My Feed &gt; Your Feed"
+    feed.entries.first.content # => "<p>Hello</p>"
+== Extending
+Implement a parser wrapper by extending the FeedNormalizer::Parser class and overriding
+the public methods. Also note the helper methods in the root Parser object to make
+mapping of output from the particular parser to the Feed object easier.
+See FeedNormalizer::RubyRssParser and FeedNormalizer::SimpleRssParser for examples.
+== Authors
+* Andrew A. Smith (andy@tinnedfruit.org)
+This library is released under the terms of the BSD License (see the License.txt file for details).

data/Rakefile ADDED Viewed

@@ -0,0 +1,30 @@
+require 'rubygems'
+require 'hoe'
+$: << "lib"
+require 'feed-normalizer'
+Hoe.spec("openlogic-feed-normalizer") do |s|
+  s.version = "1.5.3"
+  s.developer "Andrew A. Smith", "andy@tinnedfruit.org"
+  s.developer "Todd Thomas", "todd.thomas@openlogic.com"
+  s.url = "http://github.com/toddthomas/feed-normalizer"
+  s.summary = "Extensible Ruby wrapper for Atom and RSS parsers"
+  s.description = s.paragraphs_of('README.txt', 1..2).join("\n\n")
+  s.changes = s.paragraphs_of('History.txt', 0..1).join("\n\n")
+  s.extra_deps << ["simple-rss", ">= 1.1"]
+  s.extra_deps << ["hpricot", ">= 0.6"]
+  s.need_zip = true
+  s.need_tar = false
+end
+begin
+  require 'rcov/rcovtask'
+  Rcov::RcovTask.new("rcov") do |t|
+    t.test_files = Dir['test/test_all.rb']
+  end
+rescue LoadError
+  nil
+end

data/lib/feed-normalizer.rb ADDED Viewed

@@ -0,0 +1,149 @@
+require 'structures'
+require 'html-cleaner'
+module FeedNormalizer
+  # The root parser object. Every parser must extend this object.
+  class Parser
+    # Parser being used.
+    def self.parser
+      nil
+    end
+    # Parses the given feed, and returns a normalized representation.
+    # Returns nil if the feed could not be parsed.
+    def self.parse(feed, loose)
+      nil
+    end
+    # Returns a number to indicate parser priority.
+    # The lower the number, the more likely the parser will be used first,
+    # and vice-versa.
+    def self.priority
+      0
+    end
+    protected
+    # Some utility methods that can be used by subclasses.
+    # sets value, or appends to an existing value
+    def self.map_functions!(mapping, src, dest)
+      mapping.each do |dest_function, src_functions|
+        src_functions = [src_functions].flatten # pack into array
+        src_functions.each do |src_function|
+          value = if src.respond_to?(src_function)
+            src.send(src_function)
+          elsif src.respond_to?(:has_key?)
+            src[src_function]
+          end
+          unless value.to_s.empty?
+            append_or_set!(value, dest, dest_function)
+            break
+          end
+        end
+      end
+    end
+    def self.append_or_set!(value, object, object_function)
+      if object.send(object_function).respond_to? :push
+        object.send(object_function).push(value)
+      else
+        object.send(:"#{object_function}=", value)
+      end
+    end
+    private
+    # Callback that ensures that every parser gets registered.
+    def self.inherited(subclass)
+      ParserRegistry.register(subclass)
+    end
+  end
+  # The parser registry keeps a list of current parsers that are available.
+  class ParserRegistry
+    @@parsers = []
+    def self.register(parser)
+      @@parsers << parser
+    end
+    # Returns a list of currently registered parsers, in order of priority.
+    def self.parsers
+      @@parsers.sort_by { |parser| parser.priority }
+    end
+  end
+  class FeedNormalizer
+    # Parses the given xml and attempts to return a normalized Feed object.
+    # Setting +force_parser+ to a suitable parser will mean that parser is
+    # used first, and if +try_others+ is false, it is the only parser used,
+    # otherwise all parsers in the ParserRegistry are attempted, in
+    # order of priority.
+    #
+    # ===Available options
+    #
+    # * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
+    #   parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
+    #
+    # * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
+    #   If +true+, other parsers will be used as described above. The option
+    #   is useful if combined with +force_parser+ to only use a single parser.
+    #
+    # * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
+    #
+    #   Specifies parsing should be done loosely. This means that when
+    #   feed-normalizer would usually throw away data in order to meet
+    #   the requirement of keeping resulting feed outputs the same regardless
+    #   of the underlying parser, the data will instead be kept. This currently
+    #   affects the following items:
+    #   * <em>Categories:</em> RSS allows for multiple categories per feed item.
+    #     * <em>Limitation:</em> SimpleRSS can only return the first category
+    #       for an item.
+    #     * <em>Result:</em> When loose is true, the extra categories are kept,
+    #       of course, only if the parser is not SimpleRSS.
+    def self.parse(xml, opts = {})
+      # Get a string ASAP, as multiple read()'s will start returning nil..
+      xml = xml.respond_to?(:read) ? xml.read : xml.to_s
+      if opts[:force_parser]
+        result = opts[:force_parser].parse(xml, opts[:loose])
+        return result if result
+        return nil if opts[:try_others] == false
+      end
+      ParserRegistry.parsers.each do |parser|
+        result = parser.parse(xml, opts[:loose])
+        return result if result
+      end
+      # if we got here, no parsers worked.
+      return nil
+    end
+  end
+  parser_dir = File.dirname(__FILE__) + '/parsers'
+  # Load up the parsers
+  Dir.open(parser_dir).each do |fn|
+    next unless fn =~ /[.]rb$/
+    require "parsers/#{fn}"
+  end
+end

data/lib/html-cleaner.rb ADDED Viewed

@@ -0,0 +1,181 @@
+require 'rubygems'
+require 'hpricot'
+require 'cgi'
+module FeedNormalizer
+  # Various methods for cleaning up HTML and preparing it for safe public
+  # consumption.
+  #
+  # Documents used for refrence:
+  # - http://www.w3.org/TR/html4/index/attributes.html
+  # - http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
+  # - http://feedparser.org/docs/html-sanitization.html
+  # - http://code.whytheluckystiff.net/hpricot/wiki
+  class HtmlCleaner
+    # allowed html elements.
+    HTML_ELEMENTS = %w(
+      a abbr acronym address area b bdo big blockquote br button caption center
+      cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3
+      h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s
+      samp small span strike strong sub sup table tbody td tfoot th thead tr tt
+      u ul var
+    )
+    # allowed attributes.
+    HTML_ATTRS = %w(
+      abbr accept accept-charset accesskey align alt axis border cellpadding
+      cellspacing char charoff charset checked cite class clear cols colspan
+      color compact coords datetime dir disabled for frame headers height href
+      hreflang hspace id ismap label lang longdesc maxlength media method
+      multiple name nohref noshade nowrap readonly rel rev rows rowspan rules
+      scope selected shape size span src start summary tabindex target title
+      type usemap valign value vspace width
+    )
+    # allowed attributes, but they can contain URIs, extra caution required.
+    # NOTE: That means this doesnt list *all* URI attrs, just the ones that are allowed.
+    HTML_URI_ATTRS = %w(
+      href src cite usemap longdesc
+    )
+    DODGY_URI_SCHEMES = %w(
+      javascript vbscript mocha livescript data
+    )
+    class << self
+      # Does this:
+      # - Unescape HTML
+      # - Parse HTML into tree
+      # - Find 'body' if present, and extract tree inside that tag, otherwise parse whole tree
+      # - Each tag:
+      #   - remove tag if not whitelisted
+      #   - escape HTML tag contents
+      #   - remove all attributes not on whitelist
+      #   - extra-scrub URI attrs; see dodgy_uri?
+      #
+      # Extra (i.e. unmatched) ending tags and comments are removed.
+      def clean(str)
+        str = unescapeHTML(str)
+        doc = Hpricot(str, :fixup_tags => true)
+        doc = subtree(doc, :body)
+        # get all the tags in the document
+        # Somewhere near hpricot 0.4.92 "*" starting to return all elements,
+        # including text nodes instead of just tagged elements.
+        tags = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq
+        # Remove tags that aren't whitelisted.
+        remove_tags!(doc, tags - HTML_ELEMENTS)
+        remaining_tags = tags & HTML_ELEMENTS
+        # Remove attributes that aren't on the whitelist, or are suspicious URLs.
+        (doc/remaining_tags.join(",")).each do |element|
+          next if element.raw_attributes.nil? || element.raw_attributes.empty?
+          element.raw_attributes.reject! do |attr,val|
+            !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val))
+          end
+          element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]}
+        end unless remaining_tags.empty?
+        doc.traverse_text do |t|
+          t.swap(add_entities(t.to_html))
+        end
+        # Return the tree, without comments. Ugly way of removing comments,
+        # but can't see a way to do this in Hpricot yet.
+        doc.to_s.gsub(/<\!--.*?-->/mi, '')
+      end
+      # For all other feed elements:
+      # - Unescape HTML.
+      # - Parse HTML into tree (taking 'body' as root, if present)
+      # - Takes text out of each tag, and escapes HTML.
+      # - Returns all text concatenated.
+      def flatten(str)
+        str.gsub!("\n", " ")
+        str = unescapeHTML(str)
+        doc = Hpricot(str, :xhtml_strict => true)
+        doc = subtree(doc, :body)
+        out = []
+        doc.traverse_text {|t| out << add_entities(t.to_html)}
+        return out.join
+      end
+      # Returns true if the given string contains a suspicious URL,
+      # i.e. a javascript link.
+      #
+      # This method rejects javascript, vbscript, livescript, mocha and data URLs.
+      # It *could* be refined to only deny dangerous data URLs, however.
+      def dodgy_uri?(uri)
+        uri = uri.to_s
+        # special case for poorly-formed entities (missing ';')
+        # if these occur *anywhere* within the string, then throw it out.
+        return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi)
+        # Try escaping as both HTML or URI encodings, and then trying
+        # each scheme regexp on each
+        [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri|
+          DODGY_URI_SCHEMES.each do |scheme|
+            regexp = "#{scheme}:".gsub(/./) do |char|
+              "([\000-\037\177\s]*)#{char}"
+            end
+            # regexp looks something like
+            # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi
+            return true if (unesc_uri =~ %r{\A#{regexp}}mi)
+          end
+        end
+        nil
+      end
+      # unescapes HTML. If xml is true, also converts XML-only named entities to HTML.
+      def unescapeHTML(str, xml = true)
+        CGI.unescapeHTML(xml ? str.gsub("&apos;", "&#39;") : str)
+      end
+      # Adds entities where possible.
+      # Works like CGI.escapeHTML, but will not escape existing entities;
+      # i.e. &#123; will NOT become &amp;#123;
+      #
+      # This method could be improved by adding a whitelist of html entities.
+      def add_entities(str)
+        str.to_s.gsub(/\"/n, '&quot;').gsub(/>/n, '&gt;').gsub(/</n, '&lt;').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&amp;')
+      end
+      private
+      # Everything below elment, or the just return the doc if element not present.
+      def subtree(doc, element)
+        doc.at("//#{element}/*") || doc
+      end
+      def remove_tags!(doc, tags)
+        (doc/tags.join(",")).remove unless tags.empty?
+      end
+    end
+  end
+end
+module Enumerable #:nodoc:
+  def build_hash
+    result = {}
+    self.each do |elt|
+      key, value = yield elt
+      result[key] = value
+    end
+    result
+  end
+end