RubyGems - syndication - Versions diffs - 0.4.0 → 0.5.0 - Mend

syndication 0.4.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

data/CHANGES +6 -0
data/DEVELOPER +5 -0
data/IMPLEMENTATION +23 -1
data/README +31 -17
data/lib/syndication/atom.rb +2 -0
data/lib/syndication/common.rb +7 -2
data/lib/syndication/content.rb +4 -0
data/lib/syndication/dublincore.rb +20 -14
data/lib/syndication/podcast.rb +5 -0
data/lib/syndication/rss.rb +2 -0
data/lib/syndication/syndication.rb +4 -0
data/lib/syndication/tagsoup.rb +49 -0
data/rakefile +52 -0
data/test/atomtest.rb +4 -0
data/test/rsstest.rb +4 -0
data/test/tagsouptest.rb +87 -0
metadata +10 -2

data/CHANGES ADDED Viewed

@@ -0,0 +1,6 @@
+# == Changes in 0.5
+#
+# - Fixed problem with syndication/dublincore reported by Ura Takefumi
+#
+# - Added new TagSoup completely-non-validating parser, tests for same,
+#   and option to use it for parsing feeds

data/DEVELOPER ADDED Viewed

@@ -0,0 +1,5 @@
+# = Developer info for syndication project
+#
+# You only need to know this if actually hacking on the code via RubyForge.
+#
+# Release tags are of the format v_0_5 (for 0.5).

data/IMPLEMENTATION CHANGED Viewed

@@ -1,4 +1,26 @@
-# = Syndication 0.4
+# = Implementation notes
+# == Syndication 0.5
+#
+# For this release, I added a parser called TagSoup. The name is taken from
+# the jargon term used for HTML written without any regard to the rules of
+# HTML structure, i.e. HTML with many common authoring mistakes in.
+#
+# TagSoup is a very small and very dumb parser which implements the stream
+# API of REXML. The test code compares it against REXML for some simple
+# example XML and makes sure it calls the same callbacks in the same order
+# with the same parameters.
+#
+# Note that hacking together your own XML parser is, generally speaking, the
+# wrong thing to do. Using TagSoup as a general replacement for REXML is very
+# definitely the wrong thing to do. Please don't do it.
+#
+# A real XML parser does all kinds of things that TagSoup doesn't, like pay
+# attention to DTDs, handle quoted special characters in element attributes,
+# handle whitespace in a documented standard way, and so on. The fact that
+# TagSoup is defective in many areas is intentional. It's designed to be
+# used as a last resort, for parsing web syndication feeds which are invalid.
+#
+# == Syndication 0.4
 #
 # As discussed in the README, this is really my fourth attempt at writing
 # RSS parsing code. For the record, I thought I'd list the approaches I

data/README CHANGED Viewed

@@ -1,5 +1,4 @@
-#
-# = Syndication 0.4
+# = Syndication 0.5
 #
 # This module provides classes for parsing web syndication feeds in RSS and
 # Atom formats.
@@ -77,7 +76,7 @@
 #
 # - Less source code than the standard library rss module.
 #
-# - Faster than the standard library (at least, in my tests, see caveat below).
+# - Faster than the standard library (at least, in my tests).
 #
 # Other features:
 #
@@ -93,7 +92,8 @@
 #
 # - Simple to extend to support your own RSS extensions, uses reflection.
 #
-# - Uses REXML fast stream parsing API for speed.
+# - Uses REXML fast stream parsing API for speed, or built-in TagSoup parser
+#   for invalid feeds.
 #
 # - Non-validating, tries to be as forgiving as possible of structural errors.
 #
@@ -109,8 +109,6 @@
 #
 # - Different API, not a drop-in replacement.
 #
-# - No way to choose a different XML parser (yet).
-#
 # - Incomplete support for Atom 0.3 draft. (Anyone still using it?)
 #
 # - No support for base64 data in Atom feeds (yet).
@@ -150,11 +148,31 @@
 # For the record, I started work on my library long before simple-rss was
 # announced.
 #
-# = feedtools / feedreader
+# = feedtools
 #
 # http://rubyforge.org/projects/feedtools/
 #
-# I don't know much about this one.
+# This one solves most of the same problems as Syndication; however the two
+# were developed in parallel, in ignorance of each other.
+#
+# Feedtools builds in database caching and persistance, and HTTP fetching.
+# Personally, I don't think those belong in a feed parsing library--they
+# are easily implemented using other standard libraries if you want them.
+#
+# Pros:
+# - Lots of test cases.
+#
+# - Used by lots of Rails people.
+#
+# - Knows about many more namespaces.
+#
+# Cons:
+# - Skimpy documentation.
+#
+# - Uses HTree then XPath parsing, rather than a single stream parse.
+#
+# - Tries to unify RSS and Atom APIs, at the expense of Atom functionality.
+#   (Which could also be a pro, depending on your viewpoint.)
 #
 # == Design philosophy
 #
@@ -180,6 +198,9 @@
 #
 # - Get well-formed feeds parsing reliably, then worry about broken feeds.
 #
+# - Atom will hopefully be the future. Provide full support for RSS, but don't
+#   hold Atom back by trying to force it into an RSS data model.
+#
 # == Future plans
 #
 # Here are some possible improvements:
@@ -187,12 +208,6 @@
 # - RSS and Atom generation. Create objects, then call Syndication::FeedMaker
 #   to generate XML in various flavors.
 #
-# - More lenient parsing. The limiting factor right now appears to be REXML,
-#   which although a non-validating parser, does require fairly well-formed
-#   XML. (In particular, failure to match tags will cause errors.)  Perhaps
-#   the answer is to find or build a 'tag soup' parser that implements the
-#   REXML stream parsing API?
-#
 # - Faster date parsing. It turns out that when I asked for parsed dates in
 #   my test code, the profiler showed Date.parse chewing up 25% of the total
 #   CPU time used. A more specific date parser that didn't use heuristics
@@ -202,7 +217,6 @@
 #
 # == Feedback
 #
-# This is my first public release of this code, so there are doubtless things
-# I could have done better. Comments, suggestions, etc are welcome; e-mail
-# <meta@pobox.com>.
+# There are doubtless things I could have done better. Comments, suggestions,
+# etc are welcome; e-mail <meta@pobox.com>.
 #

data/lib/syndication/atom.rb CHANGED Viewed

@@ -3,6 +3,8 @@
 #
 # Copyright � mathew <meta@pobox.com> 2005.
 # Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/atom.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
 require 'uri'
 require 'rexml/parsers/streamparser'

data/lib/syndication/common.rb CHANGED Viewed

@@ -2,6 +2,8 @@
 #
 # Copyright � mathew <meta@pobox.com> 2005.
 # Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/common.rb,v 1.3 2005/10/17 15:05:21 meta Exp $
 require 'uri'
 require 'rexml/parsers/streamparser'
@@ -174,8 +176,11 @@ module Syndication
     # Parse the text provided. Returns a Syndication::Atom::Feed or
     # Syndication::RSS::Feed object, according to which concrete Parser
     # class is being used.
-    def parse(text)
-      REXML::Document.parse_stream(text, self)
+    # The second argument is optional and determines the parser engine to
+    # use. The default is REXML. To use TagSoup, pass in the value
+    # Syndication::TagSoup
+    def parse(text, classname = REXML::Document)
+      classname.parse_stream(text, self)
       return @parsetree
     end

data/lib/syndication/content.rb CHANGED Viewed

@@ -1,3 +1,7 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/content.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
 module Syndication

data/lib/syndication/dublincore.rb CHANGED Viewed

@@ -1,3 +1,7 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/dublincore.rb,v 1.3 2005/10/17 15:05:21 meta Exp $
 module Syndication
@@ -70,23 +74,25 @@ module Syndication
   end
   #:enddoc:
-  # Now we mix in the DublinCore elements to all the Syndication classes that
-  # can contain them. There's probably some clever way to do this via
-  # reflection, but there _is_ such a thing as being too clever.
-  class Item
-    include DublinCore
-  end
+  module RSS
+    # Now we mix in the DublinCore elements to all the Syndication classes that
+    # can contain them. There's probably some clever way to do this via
+    # reflection, but there _is_ such a thing as being too clever.
+    class Item
+      include DublinCore
+    end
-  class Channel
-    include DublinCore
-  end
+    class Channel
+      include DublinCore
+    end
-  class Image
-    include DublinCore
-  end
+    class Image
+      include DublinCore
+    end
-  class TextInput
-    include DublinCore
+    class TextInput
+      include DublinCore
+    end
   end
 end

data/lib/syndication/podcast.rb CHANGED Viewed

@@ -1,3 +1,8 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/podcast.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
 module Syndication
   # Mixin for iTunes podcast RSS elements.

data/lib/syndication/rss.rb CHANGED Viewed

@@ -3,6 +3,8 @@
 #
 # Copyright � mathew <meta@pobox.com> 2005.
 # Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/rss.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
 require 'uri'
 require 'rexml/parsers/streamparser'

data/lib/syndication/syndication.rb CHANGED Viewed

@@ -1,3 +1,7 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/syndication.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
 require 'date'

data/lib/syndication/tagsoup.rb ADDED Viewed

@@ -0,0 +1,49 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/lib/syndication/tagsoup.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
+require 'cgi'
+module Syndication
+  # TagSoup is a tiny completely non-validating XML parser which implements the
+  # tag_start, tag_end and text methods of the REXML StreamListener interface.
+  #
+  # It's designed for permissive parsing of RSS and Atom feeds; using it for
+  # anything more complex (like HTML with CSS and JavaScript) is not advised.
+  class TagSoup
+    # Parse data String and send events to listener
+    def TagSoup.parse_stream(data, listener)
+      data.scan(/(<\/[^>]*>|<[^>]*>|[^<>]*)/m) do |match|
+        thing = match.first.strip
+        if thing[0,1] == '<'
+          # It's a tag_start or tag_end
+          (tag,rest) = thing.match(/<\/?([^>\s]+)([^>]*)/)[1,2]
+          if thing[1,1] == '/'
+            listener.tag_end(tag)
+          else
+            # Parse the attr=val pairs
+            pairs = Hash.new
+            rest.scan(/([\w:]+)=("([^"]*)"|'([^']*)')/) {|a,j,v1,v2|
+              if v1 == nil
+                v = v2
+              else
+                v = v1
+              end
+              if a
+                pairs[a] = v
+              end
+            }
+            listener.tag_start(tag, pairs)
+          end
+        else
+          # It's text
+          listener.text(CGI.unescapeHTML(thing))
+        end
+      end
+    end
+  end
+end

data/rakefile ADDED Viewed

@@ -0,0 +1,52 @@
+require 'rake/rdoctask'
+require 'rake/packagetask'
+require 'rake/gempackagetask'
+require 'rubygems'
+PKG_VERSION = "0.5.0"
+desc "Create HTML documentation from RDOC"
+Rake::RDocTask.new do |rd|
+  rd.main = "README"
+  rd.rdoc_files.include("README", "CHANGES", "IMPLEMENTATION", "DEVELOPER",
+                        "lib/**/*.rb", "test/**/*.rb", "examples/**/*.rb")
+end
+desc "Make tar distribution"
+Rake::PackageTask.new('syndication', PKG_VERSION) do |t|
+  t.need_tar_bz2 = true
+  t.package_files.include("README", "CHANGES", "IMPLEMENTATION", "DEVELOPER", "lib/**/*.rb", "test/**/*.rb", "examples/**/*.rb", "rakefile", "setup.rb")
+  t.package_dir = "pkg"
+end
+spec = Gem::Specification.new do |s|
+  s.name = "syndication"
+  s.version = PKG_VERSION
+  s.author = "mathew"
+  s.email = "meta@pobox.com"
+  s.homepage = "http://www.pobox.com/~meta/"
+  s.platform = Gem::Platform::RUBY
+  s.summary = "A web syndication parser for Atom and RSS with a uniform API"
+  candidates = Dir.glob("{bin,docs,lib,test,examples}/**/*")
+  candidates << "rakefile"
+  s.files = candidates.delete_if do |item|
+    item.include?("CVS") || item.include?("html")
+  end
+  s.require_path = "lib"
+  s.test_files = ["test/atomtest.rb", "test/rsstest.rb",
+                  "test/tagsouptest.rb"]
+  s.has_rdoc = true
+  s.extra_rdoc_files = ["README", "IMPLEMENTATION", "CHANGES", "DEVELOPER"]
+end
+desc "Make RubyGems gem distribution"
+Rake::GemPackageTask.new(spec) do |pkg|
+  pkg.need_zip = true
+  pkg.need_tar = true
+end
+task :default do
+  puts "This is a pure Ruby library, no compilation is required."
+  puts "Try rake --tasks"
+end

data/test/atomtest.rb CHANGED Viewed

@@ -1,3 +1,7 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/test/atomtest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
 require 'syndication/atom'
 require 'test/unit'

data/test/rsstest.rb CHANGED Viewed

@@ -1,3 +1,7 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/test/rsstest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
 require 'syndication/rss'
 require 'test/unit'

data/test/tagsouptest.rb ADDED Viewed

@@ -0,0 +1,87 @@
+# Copyright � mathew <meta@pobox.com> 2005.
+# Licensed under the same terms as Ruby.
+#
+# $Header: /var/cvs/syndication/syndication/test/tagsouptest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
+require 'syndication/tagsoup'
+require 'test/unit'
+require 'rexml/document'
+require 'pp'
+module Syndication
+  # This class contains the unit tests for the Syndication module.
+  class Tests < Test::Unit::TestCase
+    def tag_start(x, pairs)
+      @events << "tag_start(#{x.strip})"
+      lst = nil
+      if pairs
+        for p in pairs
+          if lst
+            lst = lst + ","
+          else
+            lst = ""
+          end
+          lst << "#{p[0]}=#{p[1]}"
+        end
+        @events << "attrs(#{lst})"
+      end
+    end
+    def tag_end(x)
+      @events << "tag_end(#{x.strip})"
+    end
+    def text(x)
+      @events << "text(#{x.strip})"
+    end
+    # Minimal test
+    def test_tagsoup
+      xml = <<-EOF
+<a>
+<b>one
+<c></c></b>
+<d arg1="alpha">two</d>
+<e arg2='beta'>
+three&lt;four&#99;&trade;
+</e>
+</a>
+<feed xmlns="http://www.w3.org/2005/Atom">
+<title>One good turn usually gets most of the blanket.</title>
+<updated>2005-08-20T21:14:38Z</updated>
+<id>urn:uuid:035d3aa3022c1b1b2a17e37ae2dcc376</id>
+<entry>
+<title>Quidquid latine dictum sit, altum viditur.</title>
+<link href="http://example.com/05/08/20/2114.html"/>
+<id>urn:uuid:89d96d76a99426264f6f1f520c1b93c2</id>
+<updated>2005-08-20T21:14:38Z</updated>
+</entry>
+</feed>
+      EOF
+      @events = Array.new
+      Syndication::TagSoup.parse_stream(xml, self)
+      @tagsoup = @events
+      @events = Array.new
+      REXML::Document.parse_stream(xml, self)
+      @rexml = @events
+      puts "REXML\n-----"
+      pp @rexml
+      puts "\nTAGSOUP\n-------"
+      pp @tagsoup
+      errs = false
+      for tsevt in @tagsoup
+        rxevt = @rexml.shift
+        if rxevt
+          if tsevt.to_s != rxevt.to_s
+            errs = true
+            puts "TagSoup: [#{tsevt}]\nREXML: [#{rxevt}]"
+          end
+        end
+      end
+      assert(!errs, "TagSoup and REXML parse results didn't match")
+    end
+  end
+end

metadata CHANGED Viewed

@@ -3,8 +3,8 @@ rubygems_version: 0.8.11
 specification_version: 1
 name: syndication
 version: !ruby/object:Gem::Version
-  version: 0.4.0
-date: 2005-09-29 00:00:00 -05:00
+  version: 0.5.0
+date: 2005-10-17 00:00:00 -05:00
 summary: A web syndication parser for Atom and RSS with a uniform API
 require_paths:
   - lib
@@ -34,21 +34,29 @@ files:
   - lib/syndication/common.rb
   - lib/syndication/podcast.rb
   - lib/syndication/content.rb
+  - lib/syndication/tagsoup.rb
   - lib/syndication/rss.rb
   - lib/syndication/syndication.rb
   - lib/syndication/atom.rb
+  - test/tagsouptest.rb
   - test/rsstest.rb
   - test/atomtest.rb
   - examples/yahoo.rb
+  - rakefile
   - README
   - IMPLEMENTATION
+  - CHANGES
+  - DEVELOPER
 test_files:
   - test/atomtest.rb
   - test/rsstest.rb
+  - test/tagsouptest.rb
 rdoc_options: []
 extra_rdoc_files:
   - README
   - IMPLEMENTATION
+  - CHANGES
+  - DEVELOPER
 executables: []
 extensions: []
 requirements: []