RubyGems - feed-normalizer - Versions diffs - 1.3.2 → 1.4.0 - Mend

feed-normalizer 1.3.2 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

data/History.txt CHANGED Viewed

@@ -1,3 +1,13 @@
+1.4.0
+ * Support content:encoded. Accessible via Entry#content.
+ * Support categories. Accessible via Entry#categories.
+ * Introduces a new parsing feature 'loose parsing'. Use :loose => true
+   when parsing if the required output should retain extra data, rather
+   than drop it in the interests of 'lowest common denomiator' normalization.
+   Currently affects how categories works. See the documentation in
+   FeedNormalizer#parse for more details.
 1.3.2
  * Add support for applicable dublin core elements. (dc:date and dc:creator)

data/License.txt CHANGED Viewed

@@ -1,4 +1,4 @@
-Copyright (c) 2006, Andrew A. Smith
+Copyright (c) 2006-2007, Andrew A. Smith
 All rights reserved.
 Redistribution and use in source and binary forms, with or without modification,

data/Manifest.txt CHANGED Viewed

@@ -2,7 +2,7 @@ History.txt
 License.txt
 Manifest.txt
 Rakefile
-Readme.txt
+README.txt
 lib/feed-normalizer.rb
 lib/html-cleaner.rb
 lib/parsers/rss.rb

data/{Readme.txt → README.txt} RENAMED Viewed

@@ -23,7 +23,7 @@ object graph, regardless of the underlying feed format.
     feed.entries.first.url # => "http://www.iht.com/articles/2006/10/03/frontpage/web.1003UN.php"
     feed.class # => FeedNormalizer::Feed
-    feed.parser # => RSS::Parser
+    feed.parser # => "RSS::Parser"
 Now read an Atom feed, and the same class is returned, and the same terminology applies:
@@ -36,7 +36,7 @@ Now read an Atom feed, and the same class is returned, and the same terminology
 The feed representation stays the same, even though a different parser was used.
     feed.class # => FeedNormalizer::Feed
-    feed.parser # => SimpleRSS
+    feed.parser # => "SimpleRSS"
 == Cleaning / Sanitizing

data/Rakefile CHANGED Viewed

@@ -1,6 +1,6 @@
 require 'hoe'
-Hoe.new("feed-normalizer", "1.3.2") do |s|
+Hoe.new("feed-normalizer", "1.4.0") do |s|
   s.author = "Andrew A. Smith"
   s.email = "andy@tinnedfruit.org"
   s.url = "http://feed-normalizer.rubyforge.org/"

data/lib/feed-normalizer.rb CHANGED Viewed

@@ -13,7 +13,7 @@ module FeedNormalizer
     # Parses the given feed, and returns a normalized representation.
     # Returns nil if the feed could not be parsed.
-    def self.parse(feed)
+    def self.parse(feed, loose)
       nil
     end
@@ -41,7 +41,10 @@ module FeedNormalizer
             src[src_function]
           end
-          append_or_set!(value, dest, dest_function) if value
+          unless value.to_s.empty?
+            append_or_set!(value, dest, dest_function)
+            break
+          end
         end
       end
@@ -85,24 +88,46 @@ module FeedNormalizer
   class FeedNormalizer
     # Parses the given xml and attempts to return a normalized Feed object.
-    # Setting forced parser to a suitable parser will mean that parser is
-    # used first, and if try_others is false, it is the only parser used,
-    # otherwise all parsers in the ParserRegistry are attempted next, in
+    # Setting +force_parser+ to a suitable parser will mean that parser is
+    # used first, and if +try_others+ is false, it is the only parser used,
+    # otherwise all parsers in the ParserRegistry are attempted, in
     # order of priority.
+    #
+    # ===Available options
+    #
+    # * <tt>:force_parser</tt> - instruct feed-normalizer to try the specified
+    #   parser first. Takes a class, such as RubyRssParser, or SimpleRssParser.
+    #
+    # * <tt>:try_others</tt> - +true+ or +false+, defaults to +true+.
+    #   If +true+, other parsers will be used as described above. The option
+    #   is useful if combined with +force_parser+ to only use a single parser.
+    #
+    # * <tt>:loose</tt> - +true+ or +false+, defaults to +false+.
+    #
+    #   Specifies parsing should be done loosely. This means that when
+    #   feed-normalizer would usually throw away data in order to meet
+    #   the requirement of keeping resulting feed outputs the same regardless
+    #   of the underlying parser, the data will instead be kept. This currently
+    #   affects the following items:
+    #   * <em>Categories:</em> RSS allows for multiple categories per feed item.
+    #     * <em>Limitation:</em> SimpleRSS can only return the first category
+    #       for an item.
+    #     * <em>Result:</em> When loose is true, the extra categories are kept,
+    #       of course, only if the parser is not SimpleRSS.
     def self.parse(xml, opts = {})
       # Get a string ASAP, as multiple read()'s will start returning nil..
       xml = xml.respond_to?(:read) ? xml.read : xml.to_s
       if opts[:force_parser]
-        result = opts[:force_parser].parse(xml)
+        result = opts[:force_parser].parse(xml, opts[:loose])
         return result if result
         return nil if opts[:try_others] == false
       end
       ParserRegistry.parsers.each do |parser|
-        result = parser.parse(xml)
+        result = parser.parse(xml, opts[:loose])
         return result if result
       end

data/lib/html-cleaner.rb CHANGED Viewed

@@ -165,7 +165,7 @@ module FeedNormalizer
 end
-module Enumerable
+module Enumerable #:nodoc:
   def build_hash
     result = {}
     self.each do |elt|
@@ -180,7 +180,7 @@ end
 #  Subject: A simple Hpricot text setter
 #  From: Chris Gehlker <canyonrat mac.com>
 #  Date: Fri, 11 Aug 2006 03:19:13 +0900
-class Hpricot::Text
+class Hpricot::Text #:nodoc:
   def set(string)
     @content = string
     self.raw_string = string

data/lib/parsers/rss.rb CHANGED Viewed

@@ -1,5 +1,10 @@
 require 'rss'
+# For some reason, this is only included in the RDF Item by default.
+class RSS::Rss::Channel::Item # :nodoc:
+  include RSS::ContentModel
+end
 module FeedNormalizer
   class RubyRssParser < Parser
@@ -7,7 +12,7 @@ module FeedNormalizer
       RSS::Parser
     end
-    def self.parse(xml)
+    def self.parse(xml, loose)
       begin
         rss = parser.parse(xml)
       rescue Exception => e
@@ -15,7 +20,7 @@ module FeedNormalizer
         return nil
       end
-      rss ? package(rss) : nil
+      rss ? package(rss, loose) : nil
     end
     # Fairly high priority; a fast and strict parser.
@@ -25,7 +30,7 @@ module FeedNormalizer
     protected
-    def self.package(rss)
+    def self.package(rss, loose)
       feed = Feed.new(self)
       # channel elements
@@ -52,7 +57,7 @@ module FeedNormalizer
         :date_published => [:pubDate, :dc_date],
         :urls => :link,
         :description => :description,
-        :content => :description,
+        :content => [:content_encoded, :description],
         :title => :title,
         :authors => [:author, :dc_creator]
       }
@@ -64,6 +69,9 @@ module FeedNormalizer
         # custom item elements
         feed_entry.id = rss_item.guid.content if rss_item.respond_to?(:guid) && rss_item.guid
         feed_entry.copyright = rss.copyright if rss_item.respond_to? :copyright
+        feed_entry.categories = loose ?
+                                  rss_item.categories.collect{|c|c.content} :
+                                  [rss_item.categories.first.content] rescue []
         feed.entries << feed_entry
       end

data/lib/parsers/simple-rss.rb CHANGED Viewed

@@ -9,7 +9,7 @@ module FeedNormalizer
       SimpleRSS
     end
-    def self.parse(xml)
+    def self.parse(xml, loose)
       begin
         atomrss = parser.parse(xml)
       rescue Exception => e
@@ -53,9 +53,10 @@ module FeedNormalizer
         :date_published => [:pubDate, :published, :dc_date],
         :urls => :link,
         :description => [:description, :summary],
-        :content => [:content, :description],
+        :content => [:content, :content_encoded, :description],
         :title => :title,
-        :authors => [:author, :contributor, :dc_creator]
+        :authors => [:author, :contributor, :dc_creator],
+        :categories => :category
       }
       atomrss.entries.each do |atomrss_entry|
@@ -95,4 +96,3 @@ module FeedNormalizer
   end
 end

data/lib/structures.rb CHANGED Viewed

@@ -121,7 +121,7 @@ module FeedNormalizer
     include Singular, ElementEquality, ElementCleaner
     HTML_ELEMENTS = [:content, :description, :title]
-    SIMPLE_ELEMENTS = [:date_published, :urls, :id, :authors, :copyright]
+    SIMPLE_ELEMENTS = [:date_published, :urls, :id, :authors, :copyright, :categories]
     BLENDED_ELEMENTS = []
     ELEMENTS = HTML_ELEMENTS + SIMPLE_ELEMENTS + BLENDED_ELEMENTS
@@ -131,6 +131,7 @@ module FeedNormalizer
     def initialize
       @urls = []
       @authors = []
+      @categories = []
     end
   end

data/test/data/rss20.xml CHANGED Viewed

@@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="ISO-8859-1" ?>
 <?xml-stylesheet title="XSL_formatting" type="text/xsl" href="/shared/bsp/xsl/rss/nolsol.xsl"?>
-<rss version="2.0">
+<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
   <channel>
     <title>BBC News | Technology | UK Edition</title>
     <link>http://news.bbc.co.uk/go/rss/-/1/hi/technology/default.stm</link>
@@ -20,6 +20,7 @@
     <item>
       <title>Concerns over security software</title>
       <description>BBC Click investigates free security software and finds out who will protect PCs when Microsoft launches Vista.</description>
+      <content:encoded><![CDATA[<p>test1</p>]]></content:encoded>
       <link>http://news.bbc.co.uk/go/rss/-/1/hi/programmes/click_online/5326654.stm</link>
       <guid isPermaLink="false">http://news.bbc.co.uk/1/hi/programmes/click_online/5326654.stm</guid>
       <pubDate>Sat, 09 Sep 2006 12:45:35 GMT</pubDate>
@@ -29,19 +30,21 @@
     <item>
       <title>Top prize for 'light' inventor</title>
       <description>A Japanese scientist who invented a sustainable form of light is awarded the Millennium Technology Prize.</description>
+      <content:encoded><![CDATA[<p>test2</p>]]></content:encoded>
       <link>http://news.bbc.co.uk/go/rss/-/1/hi/technology/5328446.stm</link>
       <guid isPermaLink="false">http://news.bbc.co.uk/1/hi/technology/5328446.stm</guid>
       <pubDate>Fri, 08 Sep 2006 16:18:08 GMT</pubDate>
       <category>Technology</category>
+      <category>Japan</category>
     </item>
     <item>
       <title>MP3 player court order overturned</title>
       <description>SanDisk puts its MP3 players back on display at a German electronics show after overturning a court injunction.</description>
+      <content:encoded><![CDATA[<p>test3</p>]]></content:encoded>
       <link>http://news.bbc.co.uk/go/rss/-/1/hi/technology/5326660.stm</link>
       <guid isPermaLink="false">http://news.bbc.co.uk/1/hi/technology/5326660.stm</guid>
       <pubDate>Fri, 08 Sep 2006 10:14:41 GMT</pubDate>
-      <category>Technology</category>
     </item>
   </channel>

data/test/data/rss20diff.xml CHANGED Viewed

@@ -41,7 +41,6 @@
       <link>http://news.bbc.co.uk/go/rss/-/1/hi/technology/5326660.stm</link>
       <guid isPermaLink="false">http://news.bbc.co.uk/1/hi/technology/5326660.stm</guid>
       <pubDate>Fri, 08 Sep 2006 10:14:41 GMT</pubDate>
-      <category>Technology</category>
     </item>
   </channel>

data/test/test_feednormalizer.rb CHANGED Viewed

@@ -68,7 +68,7 @@ class FeedNormalizerTest < Test::Unit::TestCase
     assert_equal ["http://news.bbc.co.uk/go/rss/-/1/hi/technology/default.stm"], feed.urls
     assert_equal "MP3 player court order overturned", feed.entries.last.title
     assert_equal "SanDisk puts its MP3 players back on display at a German electronics show after overturning a court injunction.", feed.entries.last.description
-    assert_equal "SanDisk puts its MP3 players back on display at a German electronics show after overturning a court injunction.", feed.entries.last.content
+    assert_match(/test\d/, feed.entries.last.content)
     assert_instance_of Time, feed.entries.last.date_published
   end
@@ -108,7 +108,7 @@ class FeedNormalizerTest < Test::Unit::TestCase
     no_diff = feed.diff(feed)
     assert diff.keys.all? {|key| [:title, :items].include?(key)}
-    assert_equal 2, diff[:items].size
+    assert_equal 3, diff[:items].size
     assert diff_short.keys.all? {|key| [:title, :items].include?(key)}
     assert_equal [3,2], diff_short[:items]
@@ -144,28 +144,64 @@ class FeedNormalizerTest < Test::Unit::TestCase
   end
   def test_dublin_core_date_ruby_rss
-    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => RubyRssParser)
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => RubyRssParser, :try_others => false)
     assert_equal 'RSS::Parser', feed.parser
     assert_instance_of Time, feed.entries.first.date_published
   end
   def test_dublin_core_date_simple_rss
-    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => SimpleRssParser)
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => SimpleRssParser, :try_others => false)
     assert_equal 'SimpleRSS', feed.parser
     assert_instance_of Time, feed.entries.first.date_published
   end
   def test_dublin_core_creator_ruby_rss
-    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => RubyRssParser)
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => RubyRssParser, :try_others => false)
     assert_equal 'RSS::Parser', feed.parser
     assert_equal 'Jeff Hecht', feed.entries.last.author
   end
   def test_dublin_core_creator_simple_rss
-    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => SimpleRssParser)
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rdf10], :force_parser => SimpleRssParser, :try_others => false)
     assert_equal 'SimpleRSS', feed.parser
     assert_equal 'Jeff Hecht', feed.entries.last.author
   end
+  def test_entry_categories_ruby_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => RubyRssParser, :try_others => false)
+    assert_equal [['Click'],['Technology'],[]], feed.items.collect {|i|i.categories}
+  end
+  def test_entry_categories_simple_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => SimpleRssParser, :try_others => false)
+    assert_equal [['Click'],['Technology'],[]], feed.items.collect {|i|i.categories}
+  end
+  def test_loose_categories_ruby_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => RubyRssParser, :try_others => false, :loose => true)
+    assert_equal [1,2,0], feed.entries.collect{|e|e.categories.size}
+  end
+  def test_loose_categories_simple_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => SimpleRssParser, :try_others => false, :loose => true)
+    assert_equal [1,1,0], feed.entries.collect{|e|e.categories.size}
+  end
+  def test_content_encoded_simple_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => SimpleRssParser, :try_others => false)
+    feed.entries.each_with_index do |e, i|
+      assert_match(/test#{i+1}/, e.content)
+    end
+  end
+  def test_content_encoded_ruby_rss
+    feed = FeedNormalizer::FeedNormalizer.parse(XML_FILES[:rss20], :force_parser => RubyRssParser, :try_others => false)
+    feed.entries.each_with_index do |e, i|
+      assert_match(/test#{i+1}/, e.content)
+    end
+  end
 end

metadata CHANGED Viewed

@@ -3,8 +3,8 @@ rubygems_version: 0.9.2
 specification_version: 1
 name: feed-normalizer
 version: !ruby/object:Gem::Version
-  version: 1.3.2
-date: 2007-07-02 00:00:00 -07:00
+  version: 1.4.0
+date: 2007-07-10 00:00:00 -07:00
 summary: Extensible Ruby wrapper for Atom and RSS parsers
 require_paths:
 - lib
@@ -33,7 +33,7 @@ files:
 - License.txt
 - Manifest.txt
 - Rakefile
-- Readme.txt
+- README.txt
 - lib/feed-normalizer.rb
 - lib/html-cleaner.rb
 - lib/parsers/rss.rb
@@ -57,7 +57,7 @@ extra_rdoc_files:
 - History.txt
 - License.txt
 - Manifest.txt
-- Readme.txt
+- README.txt
 executables: []
 extensions: []