syndication 0.4.0 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGES ADDED
@@ -0,0 +1,6 @@
1
+ # == Changes in 0.5
2
+ #
3
+ # - Fixed problem with syndication/dublincore reported by Ura Takefumi
4
+ #
5
+ # - Added new TagSoup completely-non-validating parser, tests for same,
6
+ # and option to use it for parsing feeds
data/DEVELOPER ADDED
@@ -0,0 +1,5 @@
1
+ # = Developer info for syndication project
2
+ #
3
+ # You only need to know this if actually hacking on the code via RubyForge.
4
+ #
5
+ # Release tags are of the format v_0_5 (for 0.5).
data/IMPLEMENTATION CHANGED
@@ -1,4 +1,26 @@
1
- # = Syndication 0.4
1
+ # = Implementation notes
2
+ # == Syndication 0.5
3
+ #
4
+ # For this release, I added a parser called TagSoup. The name is taken from
5
+ # the jargon term used for HTML written without any regard to the rules of
6
+ # HTML structure, i.e. HTML with many common authoring mistakes in.
7
+ #
8
+ # TagSoup is a very small and very dumb parser which implements the stream
9
+ # API of REXML. The test code compares it against REXML for some simple
10
+ # example XML and makes sure it calls the same callbacks in the same order
11
+ # with the same parameters.
12
+ #
13
+ # Note that hacking together your own XML parser is, generally speaking, the
14
+ # wrong thing to do. Using TagSoup as a general replacement for REXML is very
15
+ # definitely the wrong thing to do. Please don't do it.
16
+ #
17
+ # A real XML parser does all kinds of things that TagSoup doesn't, like pay
18
+ # attention to DTDs, handle quoted special characters in element attributes,
19
+ # handle whitespace in a documented standard way, and so on. The fact that
20
+ # TagSoup is defective in many areas is intentional. It's designed to be
21
+ # used as a last resort, for parsing web syndication feeds which are invalid.
22
+ #
23
+ # == Syndication 0.4
2
24
  #
3
25
  # As discussed in the README, this is really my fourth attempt at writing
4
26
  # RSS parsing code. For the record, I thought I'd list the approaches I
data/README CHANGED
@@ -1,5 +1,4 @@
1
- #
2
- # = Syndication 0.4
1
+ # = Syndication 0.5
3
2
  #
4
3
  # This module provides classes for parsing web syndication feeds in RSS and
5
4
  # Atom formats.
@@ -77,7 +76,7 @@
77
76
  #
78
77
  # - Less source code than the standard library rss module.
79
78
  #
80
- # - Faster than the standard library (at least, in my tests, see caveat below).
79
+ # - Faster than the standard library (at least, in my tests).
81
80
  #
82
81
  # Other features:
83
82
  #
@@ -93,7 +92,8 @@
93
92
  #
94
93
  # - Simple to extend to support your own RSS extensions, uses reflection.
95
94
  #
96
- # - Uses REXML fast stream parsing API for speed.
95
+ # - Uses REXML fast stream parsing API for speed, or built-in TagSoup parser
96
+ # for invalid feeds.
97
97
  #
98
98
  # - Non-validating, tries to be as forgiving as possible of structural errors.
99
99
  #
@@ -109,8 +109,6 @@
109
109
  #
110
110
  # - Different API, not a drop-in replacement.
111
111
  #
112
- # - No way to choose a different XML parser (yet).
113
- #
114
112
  # - Incomplete support for Atom 0.3 draft. (Anyone still using it?)
115
113
  #
116
114
  # - No support for base64 data in Atom feeds (yet).
@@ -150,11 +148,31 @@
150
148
  # For the record, I started work on my library long before simple-rss was
151
149
  # announced.
152
150
  #
153
- # = feedtools / feedreader
151
+ # = feedtools
154
152
  #
155
153
  # http://rubyforge.org/projects/feedtools/
156
154
  #
157
- # I don't know much about this one.
155
+ # This one solves most of the same problems as Syndication; however the two
156
+ # were developed in parallel, in ignorance of each other.
157
+ #
158
+ # Feedtools builds in database caching and persistance, and HTTP fetching.
159
+ # Personally, I don't think those belong in a feed parsing library--they
160
+ # are easily implemented using other standard libraries if you want them.
161
+ #
162
+ # Pros:
163
+ # - Lots of test cases.
164
+ #
165
+ # - Used by lots of Rails people.
166
+ #
167
+ # - Knows about many more namespaces.
168
+ #
169
+ # Cons:
170
+ # - Skimpy documentation.
171
+ #
172
+ # - Uses HTree then XPath parsing, rather than a single stream parse.
173
+ #
174
+ # - Tries to unify RSS and Atom APIs, at the expense of Atom functionality.
175
+ # (Which could also be a pro, depending on your viewpoint.)
158
176
  #
159
177
  # == Design philosophy
160
178
  #
@@ -180,6 +198,9 @@
180
198
  #
181
199
  # - Get well-formed feeds parsing reliably, then worry about broken feeds.
182
200
  #
201
+ # - Atom will hopefully be the future. Provide full support for RSS, but don't
202
+ # hold Atom back by trying to force it into an RSS data model.
203
+ #
183
204
  # == Future plans
184
205
  #
185
206
  # Here are some possible improvements:
@@ -187,12 +208,6 @@
187
208
  # - RSS and Atom generation. Create objects, then call Syndication::FeedMaker
188
209
  # to generate XML in various flavors.
189
210
  #
190
- # - More lenient parsing. The limiting factor right now appears to be REXML,
191
- # which although a non-validating parser, does require fairly well-formed
192
- # XML. (In particular, failure to match tags will cause errors.) Perhaps
193
- # the answer is to find or build a 'tag soup' parser that implements the
194
- # REXML stream parsing API?
195
- #
196
211
  # - Faster date parsing. It turns out that when I asked for parsed dates in
197
212
  # my test code, the profiler showed Date.parse chewing up 25% of the total
198
213
  # CPU time used. A more specific date parser that didn't use heuristics
@@ -202,7 +217,6 @@
202
217
  #
203
218
  # == Feedback
204
219
  #
205
- # This is my first public release of this code, so there are doubtless things
206
- # I could have done better. Comments, suggestions, etc are welcome; e-mail
207
- # <meta@pobox.com>.
220
+ # There are doubtless things I could have done better. Comments, suggestions,
221
+ # etc are welcome; e-mail <meta@pobox.com>.
208
222
  #
@@ -3,6 +3,8 @@
3
3
  #
4
4
  # Copyright � mathew <meta@pobox.com> 2005.
5
5
  # Licensed under the same terms as Ruby.
6
+ #
7
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/atom.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
6
8
 
7
9
  require 'uri'
8
10
  require 'rexml/parsers/streamparser'
@@ -2,6 +2,8 @@
2
2
  #
3
3
  # Copyright � mathew <meta@pobox.com> 2005.
4
4
  # Licensed under the same terms as Ruby.
5
+ #
6
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/common.rb,v 1.3 2005/10/17 15:05:21 meta Exp $
5
7
 
6
8
  require 'uri'
7
9
  require 'rexml/parsers/streamparser'
@@ -174,8 +176,11 @@ module Syndication
174
176
  # Parse the text provided. Returns a Syndication::Atom::Feed or
175
177
  # Syndication::RSS::Feed object, according to which concrete Parser
176
178
  # class is being used.
177
- def parse(text)
178
- REXML::Document.parse_stream(text, self)
179
+ # The second argument is optional and determines the parser engine to
180
+ # use. The default is REXML. To use TagSoup, pass in the value
181
+ # Syndication::TagSoup
182
+ def parse(text, classname = REXML::Document)
183
+ classname.parse_stream(text, self)
179
184
  return @parsetree
180
185
  end
181
186
 
@@ -1,3 +1,7 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/content.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
1
5
 
2
6
  module Syndication
3
7
 
@@ -1,3 +1,7 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/dublincore.rb,v 1.3 2005/10/17 15:05:21 meta Exp $
1
5
 
2
6
  module Syndication
3
7
 
@@ -70,23 +74,25 @@ module Syndication
70
74
  end
71
75
 
72
76
  #:enddoc:
73
- # Now we mix in the DublinCore elements to all the Syndication classes that
74
- # can contain them. There's probably some clever way to do this via
75
- # reflection, but there _is_ such a thing as being too clever.
76
- class Item
77
- include DublinCore
78
- end
77
+ module RSS
78
+ # Now we mix in the DublinCore elements to all the Syndication classes that
79
+ # can contain them. There's probably some clever way to do this via
80
+ # reflection, but there _is_ such a thing as being too clever.
81
+ class Item
82
+ include DublinCore
83
+ end
79
84
 
80
- class Channel
81
- include DublinCore
82
- end
85
+ class Channel
86
+ include DublinCore
87
+ end
83
88
 
84
- class Image
85
- include DublinCore
86
- end
89
+ class Image
90
+ include DublinCore
91
+ end
87
92
 
88
- class TextInput
89
- include DublinCore
93
+ class TextInput
94
+ include DublinCore
95
+ end
90
96
  end
91
97
 
92
98
  end
@@ -1,3 +1,8 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/podcast.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
5
+
1
6
  module Syndication
2
7
 
3
8
  # Mixin for iTunes podcast RSS elements.
@@ -3,6 +3,8 @@
3
3
  #
4
4
  # Copyright � mathew <meta@pobox.com> 2005.
5
5
  # Licensed under the same terms as Ruby.
6
+ #
7
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/rss.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
6
8
 
7
9
  require 'uri'
8
10
  require 'rexml/parsers/streamparser'
@@ -1,3 +1,7 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/syndication.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
1
5
 
2
6
  require 'date'
3
7
 
@@ -0,0 +1,49 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/lib/syndication/tagsoup.rb,v 1.2 2005/10/17 15:05:21 meta Exp $
5
+
6
+ require 'cgi'
7
+
8
+ module Syndication
9
+
10
+ # TagSoup is a tiny completely non-validating XML parser which implements the
11
+ # tag_start, tag_end and text methods of the REXML StreamListener interface.
12
+ #
13
+ # It's designed for permissive parsing of RSS and Atom feeds; using it for
14
+ # anything more complex (like HTML with CSS and JavaScript) is not advised.
15
+ class TagSoup
16
+
17
+ # Parse data String and send events to listener
18
+ def TagSoup.parse_stream(data, listener)
19
+ data.scan(/(<\/[^>]*>|<[^>]*>|[^<>]*)/m) do |match|
20
+ thing = match.first.strip
21
+ if thing[0,1] == '<'
22
+ # It's a tag_start or tag_end
23
+ (tag,rest) = thing.match(/<\/?([^>\s]+)([^>]*)/)[1,2]
24
+ if thing[1,1] == '/'
25
+ listener.tag_end(tag)
26
+ else
27
+ # Parse the attr=val pairs
28
+ pairs = Hash.new
29
+ rest.scan(/([\w:]+)=("([^"]*)"|'([^']*)')/) {|a,j,v1,v2|
30
+ if v1 == nil
31
+ v = v2
32
+ else
33
+ v = v1
34
+ end
35
+ if a
36
+ pairs[a] = v
37
+ end
38
+ }
39
+ listener.tag_start(tag, pairs)
40
+ end
41
+ else
42
+ # It's text
43
+ listener.text(CGI.unescapeHTML(thing))
44
+ end
45
+ end
46
+ end
47
+
48
+ end
49
+ end
data/rakefile ADDED
@@ -0,0 +1,52 @@
1
+
2
+ require 'rake/rdoctask'
3
+ require 'rake/packagetask'
4
+ require 'rake/gempackagetask'
5
+ require 'rubygems'
6
+
7
+ PKG_VERSION = "0.5.0"
8
+
9
+ desc "Create HTML documentation from RDOC"
10
+ Rake::RDocTask.new do |rd|
11
+ rd.main = "README"
12
+ rd.rdoc_files.include("README", "CHANGES", "IMPLEMENTATION", "DEVELOPER",
13
+ "lib/**/*.rb", "test/**/*.rb", "examples/**/*.rb")
14
+ end
15
+
16
+ desc "Make tar distribution"
17
+ Rake::PackageTask.new('syndication', PKG_VERSION) do |t|
18
+ t.need_tar_bz2 = true
19
+ t.package_files.include("README", "CHANGES", "IMPLEMENTATION", "DEVELOPER", "lib/**/*.rb", "test/**/*.rb", "examples/**/*.rb", "rakefile", "setup.rb")
20
+ t.package_dir = "pkg"
21
+ end
22
+
23
+ spec = Gem::Specification.new do |s|
24
+ s.name = "syndication"
25
+ s.version = PKG_VERSION
26
+ s.author = "mathew"
27
+ s.email = "meta@pobox.com"
28
+ s.homepage = "http://www.pobox.com/~meta/"
29
+ s.platform = Gem::Platform::RUBY
30
+ s.summary = "A web syndication parser for Atom and RSS with a uniform API"
31
+ candidates = Dir.glob("{bin,docs,lib,test,examples}/**/*")
32
+ candidates << "rakefile"
33
+ s.files = candidates.delete_if do |item|
34
+ item.include?("CVS") || item.include?("html")
35
+ end
36
+ s.require_path = "lib"
37
+ s.test_files = ["test/atomtest.rb", "test/rsstest.rb",
38
+ "test/tagsouptest.rb"]
39
+ s.has_rdoc = true
40
+ s.extra_rdoc_files = ["README", "IMPLEMENTATION", "CHANGES", "DEVELOPER"]
41
+ end
42
+
43
+ desc "Make RubyGems gem distribution"
44
+ Rake::GemPackageTask.new(spec) do |pkg|
45
+ pkg.need_zip = true
46
+ pkg.need_tar = true
47
+ end
48
+
49
+ task :default do
50
+ puts "This is a pure Ruby library, no compilation is required."
51
+ puts "Try rake --tasks"
52
+ end
data/test/atomtest.rb CHANGED
@@ -1,3 +1,7 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/test/atomtest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
1
5
 
2
6
  require 'syndication/atom'
3
7
  require 'test/unit'
data/test/rsstest.rb CHANGED
@@ -1,3 +1,7 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/test/rsstest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
1
5
 
2
6
  require 'syndication/rss'
3
7
  require 'test/unit'
@@ -0,0 +1,87 @@
1
+ # Copyright � mathew <meta@pobox.com> 2005.
2
+ # Licensed under the same terms as Ruby.
3
+ #
4
+ # $Header: /var/cvs/syndication/syndication/test/tagsouptest.rb,v 1.2 2005/10/17 20:06:51 meta Exp $
5
+
6
+ require 'syndication/tagsoup'
7
+ require 'test/unit'
8
+ require 'rexml/document'
9
+ require 'pp'
10
+
11
+ module Syndication
12
+
13
+ # This class contains the unit tests for the Syndication module.
14
+ class Tests < Test::Unit::TestCase
15
+
16
+ def tag_start(x, pairs)
17
+ @events << "tag_start(#{x.strip})"
18
+ lst = nil
19
+ if pairs
20
+ for p in pairs
21
+ if lst
22
+ lst = lst + ","
23
+ else
24
+ lst = ""
25
+ end
26
+ lst << "#{p[0]}=#{p[1]}"
27
+ end
28
+ @events << "attrs(#{lst})"
29
+ end
30
+ end
31
+
32
+ def tag_end(x)
33
+ @events << "tag_end(#{x.strip})"
34
+ end
35
+
36
+ def text(x)
37
+ @events << "text(#{x.strip})"
38
+ end
39
+
40
+ # Minimal test
41
+ def test_tagsoup
42
+ xml = <<-EOF
43
+ <a>
44
+ <b>one
45
+ <c></c></b>
46
+ <d arg1="alpha">two</d>
47
+ <e arg2='beta'>
48
+ three&lt;four&#99;&trade;
49
+ </e>
50
+ </a>
51
+ <feed xmlns="http://www.w3.org/2005/Atom">
52
+ <title>One good turn usually gets most of the blanket.</title>
53
+ <updated>2005-08-20T21:14:38Z</updated>
54
+ <id>urn:uuid:035d3aa3022c1b1b2a17e37ae2dcc376</id>
55
+ <entry>
56
+ <title>Quidquid latine dictum sit, altum viditur.</title>
57
+ <link href="http://example.com/05/08/20/2114.html"/>
58
+ <id>urn:uuid:89d96d76a99426264f6f1f520c1b93c2</id>
59
+ <updated>2005-08-20T21:14:38Z</updated>
60
+ </entry>
61
+ </feed>
62
+ EOF
63
+ @events = Array.new
64
+ Syndication::TagSoup.parse_stream(xml, self)
65
+ @tagsoup = @events
66
+ @events = Array.new
67
+ REXML::Document.parse_stream(xml, self)
68
+ @rexml = @events
69
+ puts "REXML\n-----"
70
+ pp @rexml
71
+ puts "\nTAGSOUP\n-------"
72
+ pp @tagsoup
73
+ errs = false
74
+ for tsevt in @tagsoup
75
+ rxevt = @rexml.shift
76
+ if rxevt
77
+ if tsevt.to_s != rxevt.to_s
78
+ errs = true
79
+ puts "TagSoup: [#{tsevt}]\nREXML: [#{rxevt}]"
80
+ end
81
+ end
82
+ end
83
+ assert(!errs, "TagSoup and REXML parse results didn't match")
84
+ end
85
+
86
+ end
87
+ end
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.8.11
3
3
  specification_version: 1
4
4
  name: syndication
5
5
  version: !ruby/object:Gem::Version
6
- version: 0.4.0
7
- date: 2005-09-29 00:00:00 -05:00
6
+ version: 0.5.0
7
+ date: 2005-10-17 00:00:00 -05:00
8
8
  summary: A web syndication parser for Atom and RSS with a uniform API
9
9
  require_paths:
10
10
  - lib
@@ -34,21 +34,29 @@ files:
34
34
  - lib/syndication/common.rb
35
35
  - lib/syndication/podcast.rb
36
36
  - lib/syndication/content.rb
37
+ - lib/syndication/tagsoup.rb
37
38
  - lib/syndication/rss.rb
38
39
  - lib/syndication/syndication.rb
39
40
  - lib/syndication/atom.rb
41
+ - test/tagsouptest.rb
40
42
  - test/rsstest.rb
41
43
  - test/atomtest.rb
42
44
  - examples/yahoo.rb
45
+ - rakefile
43
46
  - README
44
47
  - IMPLEMENTATION
48
+ - CHANGES
49
+ - DEVELOPER
45
50
  test_files:
46
51
  - test/atomtest.rb
47
52
  - test/rsstest.rb
53
+ - test/tagsouptest.rb
48
54
  rdoc_options: []
49
55
  extra_rdoc_files:
50
56
  - README
51
57
  - IMPLEMENTATION
58
+ - CHANGES
59
+ - DEVELOPER
52
60
  executables: []
53
61
  extensions: []
54
62
  requirements: []