techthumb-feedzirra 0.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.textile +165 -0
- data/Rakefile +14 -0
- data/lib/core_ext/date.rb +21 -0
- data/lib/feedzirra/atom.rb +14 -0
- data/lib/feedzirra/atom_entry.rb +13 -0
- data/lib/feedzirra/atom_feed_burner.rb +14 -0
- data/lib/feedzirra/atom_feed_burner_entry.rb +12 -0
- data/lib/feedzirra/feed.rb +157 -0
- data/lib/feedzirra/feed_entry_utilities.rb +15 -0
- data/lib/feedzirra/feed_utilities.rb +62 -0
- data/lib/feedzirra/rdf.rb +15 -0
- data/lib/feedzirra/rdf_entry.rb +12 -0
- data/lib/feedzirra/rss.rb +15 -0
- data/lib/feedzirra/rss_entry.rb +15 -0
- data/lib/feedzirra.rb +29 -0
- data/spec/feedzirra/atom_entry_spec.rb +33 -0
- data/spec/feedzirra/atom_feed_burner_entry_spec.rb +33 -0
- data/spec/feedzirra/atom_feed_burner_spec.rb +39 -0
- data/spec/feedzirra/atom_spec.rb +35 -0
- data/spec/feedzirra/feed_entry_utilities_spec.rb +17 -0
- data/spec/feedzirra/feed_spec.rb +241 -0
- data/spec/feedzirra/feed_utilities_spec.rb +149 -0
- data/spec/feedzirra/rdf_entry_spec.rb +33 -0
- data/spec/feedzirra/rdf_spec.rb +37 -0
- data/spec/feedzirra/rss_entry_spec.rb +33 -0
- data/spec/feedzirra/rss_spec.rb +40 -0
- data/spec/spec.opts +2 -0
- data/spec/spec_helper.rb +46 -0
- metadata +115 -0
data/README.textile
ADDED
@@ -0,0 +1,165 @@
|
|
1
|
+
h1. Feedzirra
|
2
|
+
|
3
|
+
"http://github.com/pauldix/feedzirra/tree/master":http://github.com/pauldix/feedzirra/tree/master
|
4
|
+
|
5
|
+
I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a "google group here":http://groups.google.com/group/feedzirra.
|
6
|
+
|
7
|
+
h2. Summary
|
8
|
+
|
9
|
+
A feed fetching and parsing library that treats the internet like Godzilla treats Japan: it dominates and eats all.
|
10
|
+
|
11
|
+
h2. Description
|
12
|
+
|
13
|
+
Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the "taf2-curb":http://github.com/taf2/curb/tree/master gem for faster http gets, and libxml through "nokogiri":http://github.com/tenderlove/nokogiri/tree/master and "sax-machine":http://github.com/pauldix/sax-machine/tree/master for faster parsing.
|
14
|
+
|
15
|
+
Once you have fetched feeds using Feedzirra, they can be updated using the feed objects. Feedzirra automatically inserts etag and last-modified information from the http response headers to lower bandwidth usage, eliminate unnecessary parsing, and make things speedier in general.
|
16
|
+
|
17
|
+
Another feature present in Feedzirra is the ability to create callback functions that get called "on success" and "on failure" when getting a feed. This makes it easy to do things like log errors or update data stores.
|
18
|
+
|
19
|
+
The fetching and parsing logic have been decoupled so that either of them can be used in isolation if you'd prefer not to use everything that Feedzirra offers. However, the code examples below use helper methods in the Feed class that put everything together to make things as simple as possible.
|
20
|
+
|
21
|
+
The final feature of Feedzirra is the ability to define custom parsing classes. In truth, Feedzirra could be used to parse much more than feeds. Microformats, page scraping, and almost anything else are fair game.
|
22
|
+
|
23
|
+
h2. Installation
|
24
|
+
|
25
|
+
For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have "libcurl":http://curl.haxx.se/ and "libxml":http://xmlsoft.org/ installed. If you're on Leopard you have both. Otherwise, you'll need to grab them. Once you've got those libraries, these are the gems that get used: nokogiri, pauldix-sax-machine, taf2-curb (note that this is a fork that lives on github and not the Ruby Forge version of curb), and pauldix-feedzirra. The feedzirra gemspec has all the dependencies so you should be able to get up and running with the standard github gem install routine:
|
26
|
+
<pre>
|
27
|
+
gem sources -a http://gems.github.com # if you haven't already
|
28
|
+
gem install pauldix-feedzirra
|
29
|
+
</pre>
|
30
|
+
<b>NOTE:</b>Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on Ruby Forge. You have to get the "taf2-curb":http://github.com/taf2/curb/tree/master fork installed.
|
31
|
+
|
32
|
+
If you see this error when doing a require:
|
33
|
+
<pre>
|
34
|
+
/Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)
|
35
|
+
</pre>
|
36
|
+
It means that the taf2-curb gem didn't build correctly. To resolve this you can do a git clone git://github.com/taf2/curb.git then run rake gem in the curb directory, then sudo gem install pkg/curb-0.2.4.0.gem. After that you should be good.
|
37
|
+
|
38
|
+
If you see something like this when trying to run it:
|
39
|
+
<pre>
|
40
|
+
NoMethodError: undefined method `on_success' for #<Curl::Easy:0x1182724>
|
41
|
+
from ./lib/feedzirra/feed.rb:88:in `add_url_to_multi'
|
42
|
+
</pre>
|
43
|
+
This means that you are requiring curl-multi or the Ruby Forge version of Curb somewhere. You can't use those and need to get the taf2 version up and running.
|
44
|
+
|
45
|
+
Another problem could be if you are running Mac Ports and you have libcurl installed through there. You need to uninstall it for curb to work! The version in Mac Ports is old and doesn't play nice with curb. If you're running Leopard, you can just uninstall and you should be golden. If you're on an older version of OS X, you'll then need to "download curl":http://curl.haxx.se/download.html and build from source. Then you'll have to install the taf2-curb gem again. You might have to perform the step above.
|
46
|
+
|
47
|
+
If you're still having issues, please let me know on the mailing list. Also, "Todd Fisher (taf2)":http://github.com/taf2 is working on fixing the gem install. Please send him a full error report.
|
48
|
+
|
49
|
+
h2. Usage
|
50
|
+
|
51
|
+
"A gist of the following code":http://gist.github.com/57285
|
52
|
+
<pre>
|
53
|
+
require 'feedzirra'
|
54
|
+
|
55
|
+
# fetching a single feed
|
56
|
+
feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
|
57
|
+
|
58
|
+
# feed and entries accessors
|
59
|
+
feed.title # => "Paul Dix Explains Nothing"
|
60
|
+
feed.url # => "http://www.pauldix.net"
|
61
|
+
feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
|
62
|
+
feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
|
63
|
+
feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
|
64
|
+
|
65
|
+
entry = feed.entries.first
|
66
|
+
entry.title # => "Ruby Http Client Library Performance"
|
67
|
+
entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
|
68
|
+
entry.author # => "Paul Dix"
|
69
|
+
entry.summary # => "..."
|
70
|
+
entry.content # => "..."
|
71
|
+
entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
|
72
|
+
|
73
|
+
# updating a single feed
|
74
|
+
updated_feed = Feedzirra::Feed.update(feed)
|
75
|
+
|
76
|
+
# an updated feed has the following extra accessors
|
77
|
+
updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries
|
78
|
+
updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update
|
79
|
+
|
80
|
+
# fetching multiple feeds
|
81
|
+
feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
|
82
|
+
feeds = Feedzirra::Feed.fetch_and_parse(feeds_urls)
|
83
|
+
|
84
|
+
# feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
|
85
|
+
# there will be a Fixnum of the http response code instead of a feed object
|
86
|
+
|
87
|
+
# updating multiple feeds. it expects a collection of feed objects
|
88
|
+
updated_feeds = Feedzirra::Feed.udpate(feeds.values)
|
89
|
+
|
90
|
+
# defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
|
91
|
+
feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing",
|
92
|
+
:on_success => lambda {|feed| puts feed.title },
|
93
|
+
:on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })
|
94
|
+
# if a collection was passed into fetch_and_parse, the handlers will be called for each one
|
95
|
+
|
96
|
+
# the behavior for the handlers when using Feedzirra::Feed.update is slightly different. The feed passed into on_success will be
|
97
|
+
# the updated feed with the standard updated accessors. on failure it will be the original feed object passed into update
|
98
|
+
|
99
|
+
# Defining custom parsers
|
100
|
+
# TODO: the functionality is here, just write some good examples that show how to do this
|
101
|
+
</pre>
|
102
|
+
|
103
|
+
h2. Benchmarks
|
104
|
+
|
105
|
+
One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the "benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb":http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb
|
106
|
+
<pre>
|
107
|
+
feedzirra 5.170000 1.290000 6.460000 ( 18.917796)
|
108
|
+
rfeedparser 104.260000 12.220000 116.480000 (244.799063)
|
109
|
+
feed-normalizer 66.250000 4.010000 70.260000 (191.589862)
|
110
|
+
</pre>
|
111
|
+
The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a "benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb":http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb on an atom feed.
|
112
|
+
<pre>
|
113
|
+
feedzirra 0.500000 0.030000 0.530000 ( 0.658744)
|
114
|
+
rfeedparser 8.400000 1.110000 9.510000 ( 11.839827)
|
115
|
+
feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140)
|
116
|
+
</pre>
|
117
|
+
There's also a "benchmark that shows the results of using Feedzirra to perform updates on feeds":http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting.
|
118
|
+
<pre>
|
119
|
+
feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101)
|
120
|
+
feedzirra update 0.660000 0.280000 0.940000 ( 5.152709)
|
121
|
+
</pre>
|
122
|
+
|
123
|
+
h2. Next Steps
|
124
|
+
|
125
|
+
This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother using the test suite for feedparser. i wanted to start fresh.
|
126
|
+
|
127
|
+
Here are some more specific TODOs.
|
128
|
+
* Make a feedzirra-rails gem to integrate feedzirra seamlessly with Rails and ActiveRecord.
|
129
|
+
* Add function to sanitize content.
|
130
|
+
* Add support to automatically handle gzip and deflate encododing.
|
131
|
+
* Add support for authenticated feeds.
|
132
|
+
* Create a super sweet DSL for defining new parsers.
|
133
|
+
* Test against Ruby 1.9.1 and fix any bugs.
|
134
|
+
* I'm not keeping track of modified on entries. Should I add this?
|
135
|
+
* Should I be parsing stuff like tags or categories for entries?
|
136
|
+
* Clean up the fetching code inside feed.rb so it doesn't suck so hard.
|
137
|
+
* Make the feed_spec actually mock stuff out so it doesn't hit the net.
|
138
|
+
* Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?
|
139
|
+
|
140
|
+
h2. LICENSE
|
141
|
+
|
142
|
+
(The MIT License)
|
143
|
+
|
144
|
+
Copyright (c) 2009:
|
145
|
+
|
146
|
+
"Paul Dix":http://pauldix.net
|
147
|
+
|
148
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
149
|
+
a copy of this software and associated documentation files (the
|
150
|
+
'Software'), to deal in the Software without restriction, including
|
151
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
152
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
153
|
+
permit persons to whom the Software is furnished to do so, subject to
|
154
|
+
the following conditions:
|
155
|
+
|
156
|
+
The above copyright notice and this permission notice shall be
|
157
|
+
included in all copies or substantial portions of the Software.
|
158
|
+
|
159
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
|
160
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
161
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
162
|
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
163
|
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
164
|
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
165
|
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/Rakefile
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
require "spec"
|
2
|
+
require "spec/rake/spectask"
|
3
|
+
require 'lib/feedzirra.rb'
|
4
|
+
|
5
|
+
Spec::Rake::SpecTask.new do |t|
|
6
|
+
t.spec_opts = ['--options', "\"#{File.dirname(__FILE__)}/spec/spec.opts\""]
|
7
|
+
t.spec_files = FileList['spec/**/*_spec.rb']
|
8
|
+
end
|
9
|
+
|
10
|
+
task :install do
|
11
|
+
rm_rf "*.gem"
|
12
|
+
puts `gem build feedzirra.gemspec`
|
13
|
+
puts `sudo gem install feedzirra-#{Feedzirra::VERSION}.gem`
|
14
|
+
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
# Date code pulled from:
|
2
|
+
# Ruby Cookbook by Lucas Carlson and Leonard Richardson
|
3
|
+
# Published by O'Reilly
|
4
|
+
# ISBN: 0-596-52369-6
|
5
|
+
class Date
|
6
|
+
def feed_utils_to_gm_time
|
7
|
+
feed_utils_to_time(new_offset, :gm)
|
8
|
+
end
|
9
|
+
|
10
|
+
def feed_utils_to_local_time
|
11
|
+
feed_utils_to_time(new_offset(DateTime.now.offset-offset), :local)
|
12
|
+
end
|
13
|
+
|
14
|
+
private
|
15
|
+
def feed_utils_to_time(dest, method)
|
16
|
+
#Convert a fraction of a day to a number of microseconds
|
17
|
+
usec = (dest.sec_fraction * 60 * 60 * 24 * (10**6)).to_i
|
18
|
+
Time.send(method, dest.year, dest.month, dest.day, dest.hour, dest.min,
|
19
|
+
dest.sec, usec)
|
20
|
+
end
|
21
|
+
end
|
@@ -0,0 +1,14 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class Atom
|
3
|
+
include SAXMachine
|
4
|
+
include FeedUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url, :value => :href, :with => {:type => "text/html"}
|
7
|
+
element :link, :as => :feed_url, :value => :href, :with => {:type => "application/atom+xml"}
|
8
|
+
elements :entry, :as => :entries, :class => AtomEntry
|
9
|
+
|
10
|
+
def self.able_to_parse?(xml)
|
11
|
+
xml =~ /(Atom)|(#{Regexp.escape("http://purl.org/atom")})/
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
@@ -0,0 +1,13 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class AtomEntry
|
3
|
+
include SAXMachine
|
4
|
+
include FeedEntryUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url, :value => :href, :with => {:type => "text/html"}
|
7
|
+
element :name, :as => :author
|
8
|
+
element :content
|
9
|
+
element :summary
|
10
|
+
element :published
|
11
|
+
element :created, :as => :published
|
12
|
+
end
|
13
|
+
end
|
@@ -0,0 +1,14 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class AtomFeedBurner
|
3
|
+
include SAXMachine
|
4
|
+
include FeedUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url, :value => :href, :with => {:type => "text/html"}
|
7
|
+
element :link, :as => :feed_url, :value => :href, :with => {:type => "application/atom+xml"}
|
8
|
+
elements :entry, :as => :entries, :class => AtomFeedBurnerEntry
|
9
|
+
|
10
|
+
def self.able_to_parse?(xml)
|
11
|
+
(xml =~ /Atom/ && xml =~ /feedburner/) || false
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
@@ -0,0 +1,12 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class AtomFeedBurnerEntry
|
3
|
+
include SAXMachine
|
4
|
+
include FeedEntryUtilities
|
5
|
+
element :title
|
6
|
+
element :name, :as => :author
|
7
|
+
element :"feedburner:origLink", :as => :url
|
8
|
+
element :summary
|
9
|
+
element :content
|
10
|
+
element :published
|
11
|
+
end
|
12
|
+
end
|
@@ -0,0 +1,157 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class NoParserAvailable < StandardError; end
|
3
|
+
|
4
|
+
class Feed
|
5
|
+
USER_AGENT = "feedzirra http://github.com/pauldix/feedzirra/tree/master"
|
6
|
+
|
7
|
+
def self.parse(xml)
|
8
|
+
if parser = determine_feed_parser_for_xml(xml)
|
9
|
+
parser.parse(xml)
|
10
|
+
else
|
11
|
+
raise NoParserAvailable.new("no valid parser for content.")
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
def self.determine_feed_parser_for_xml(xml)
|
16
|
+
start_of_doc = xml.slice(0, 1000)
|
17
|
+
feed_classes.detect {|klass| klass.able_to_parse?(start_of_doc)}
|
18
|
+
end
|
19
|
+
|
20
|
+
def self.add_feed_class(klass)
|
21
|
+
feed_classes.unshift klass
|
22
|
+
end
|
23
|
+
|
24
|
+
def self.feed_classes
|
25
|
+
@feed_classes ||= [RSS, AtomFeedBurner, Atom]
|
26
|
+
end
|
27
|
+
|
28
|
+
# can take a single url or an array of urls
|
29
|
+
# when passed a single url it returns the body of the response
|
30
|
+
# when passed an array of urls it returns a hash with the urls as keys and body of responses as values
|
31
|
+
def self.fetch_raw(urls, options = {})
|
32
|
+
urls = [*urls]
|
33
|
+
multi = Curl::Multi.new
|
34
|
+
responses = {}
|
35
|
+
urls.each do |url|
|
36
|
+
easy = Curl::Easy.new(url) do |curl|
|
37
|
+
curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
|
38
|
+
curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options.has_key?(:if_modified_since)
|
39
|
+
curl.headers["If-None-Match"] = options[:if_none_match] if options.has_key?(:if_none_match)
|
40
|
+
curl.follow_location = true
|
41
|
+
curl.on_success do |c|
|
42
|
+
responses[url] = c.body_str
|
43
|
+
end
|
44
|
+
curl.on_failure do |c|
|
45
|
+
responses[url] = c.response_code
|
46
|
+
end
|
47
|
+
end
|
48
|
+
multi.add(easy)
|
49
|
+
end
|
50
|
+
|
51
|
+
multi.perform
|
52
|
+
return responses.size == 1 ? responses.values.first : responses
|
53
|
+
end
|
54
|
+
|
55
|
+
def self.fetch_and_parse(urls, options = {})
|
56
|
+
url_queue = [*urls]
|
57
|
+
multi = Curl::Multi.new
|
58
|
+
|
59
|
+
# I broke these down so I would only try to do 30 simultaneously because
|
60
|
+
# I was getting weird errors when doing a lot. As one finishes it pops another off the queue.
|
61
|
+
responses = {}
|
62
|
+
url_queue.slice!(0, 30).each do |url|
|
63
|
+
add_url_to_multi(multi, url, url_queue, responses, options)
|
64
|
+
end
|
65
|
+
|
66
|
+
multi.perform
|
67
|
+
return responses.size == 1 ? responses.values.first : responses
|
68
|
+
end
|
69
|
+
|
70
|
+
def self.update(feeds, options = {})
|
71
|
+
feed_queue = [*feeds]
|
72
|
+
multi = Curl::Multi.new
|
73
|
+
responses = {}
|
74
|
+
feed_queue.slice!(0, 30).each do |feed|
|
75
|
+
add_feed_to_multi(multi, feed, feed_queue, responses, options)
|
76
|
+
end
|
77
|
+
|
78
|
+
multi.perform
|
79
|
+
return responses.size == 1 ? responses.values.first : responses.values
|
80
|
+
end
|
81
|
+
|
82
|
+
def self.add_url_to_multi(multi, url, url_queue, responses, options)
|
83
|
+
easy = Curl::Easy.new(url) do |curl|
|
84
|
+
curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
|
85
|
+
curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options.has_key?(:if_modified_since)
|
86
|
+
curl.headers["If-None-Match"] = options[:if_none_match] if options.has_key?(:if_none_match)
|
87
|
+
curl.follow_location = true
|
88
|
+
curl.on_success do |c|
|
89
|
+
add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
|
90
|
+
xml = c.body_str
|
91
|
+
klass = determine_feed_parser_for_xml(xml)
|
92
|
+
if klass
|
93
|
+
feed = klass.parse(xml)
|
94
|
+
feed.feed_url = c.last_effective_url
|
95
|
+
feed.etag = etag_from_header(c.header_str)
|
96
|
+
feed.last_modified = last_modified_from_header(c.header_str)
|
97
|
+
responses[url] = feed
|
98
|
+
options[:on_success].call(url, feed) if options.has_key?(:on_success)
|
99
|
+
else
|
100
|
+
puts "Error determining parser for #{url} - #{c.last_effective_url}"
|
101
|
+
end
|
102
|
+
end
|
103
|
+
curl.on_failure do |c|
|
104
|
+
add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
|
105
|
+
responses[url] = c.response_code
|
106
|
+
options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
|
107
|
+
end
|
108
|
+
end
|
109
|
+
multi.add(easy)
|
110
|
+
end
|
111
|
+
|
112
|
+
def self.add_feed_to_multi(multi, feed, feed_queue, responses, options)
|
113
|
+
# on_success = options[:on_success]
|
114
|
+
# on_failure = options[:on_failure]
|
115
|
+
# options[:on_success] = lambda do ||
|
116
|
+
|
117
|
+
easy = Curl::Easy.new(feed.feed_url) do |curl|
|
118
|
+
curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
|
119
|
+
curl.headers["If-Modified-Since"] = feed.last_modified.httpdate if feed.last_modified
|
120
|
+
curl.headers["If-None-Match"] = feed.etag if feed.etag
|
121
|
+
curl.follow_location = true
|
122
|
+
curl.on_success do |c|
|
123
|
+
add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
|
124
|
+
updated_feed = Feed.parse(c.body_str)
|
125
|
+
updated_feed.feed_url = c.last_effective_url
|
126
|
+
updated_feed.etag = etag_from_header(c.header_str)
|
127
|
+
updated_feed.last_modified = last_modified_from_header(c.header_str)
|
128
|
+
feed.update_from_feed(updated_feed)
|
129
|
+
responses[feed.feed_url] = feed
|
130
|
+
options[:on_success].call(feed) if options.has_key?(:on_success)
|
131
|
+
end
|
132
|
+
curl.on_failure do |c|
|
133
|
+
add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
|
134
|
+
response_code = c.response_code
|
135
|
+
if response_code == 304 # it's not modified. this isn't an error condition
|
136
|
+
responses[feed.feed_url] = feed
|
137
|
+
options[:on_success].call(feed) if options.has_key?(:on_success)
|
138
|
+
else
|
139
|
+
responses[feed.url] = c.response_code
|
140
|
+
options[:on_failure].call(feed, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
|
141
|
+
end
|
142
|
+
end
|
143
|
+
end
|
144
|
+
multi.add(easy)
|
145
|
+
end
|
146
|
+
|
147
|
+
def self.etag_from_header(header)
|
148
|
+
header =~ /.*ETag:\s(.*)\r/
|
149
|
+
$1
|
150
|
+
end
|
151
|
+
|
152
|
+
def self.last_modified_from_header(header)
|
153
|
+
header =~ /.*Last-Modified:\s(.*)\r/
|
154
|
+
Time.parse($1) if $1
|
155
|
+
end
|
156
|
+
end
|
157
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
module FeedEntryUtilities
|
3
|
+
attr_reader :published
|
4
|
+
|
5
|
+
def parse_datetime(string)
|
6
|
+
DateTime.parse(string).feed_utils_to_gm_time
|
7
|
+
end
|
8
|
+
|
9
|
+
def published=(val)
|
10
|
+
@published = parse_datetime(val)
|
11
|
+
end
|
12
|
+
|
13
|
+
alias_method :last_modified, :published
|
14
|
+
end
|
15
|
+
end
|
@@ -0,0 +1,62 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
module FeedUtilities
|
3
|
+
UPDATABLE_ATTRIBUTES = %w(title feed_url url last_modified)
|
4
|
+
|
5
|
+
attr_writer :new_entries, :updated, :last_modified
|
6
|
+
attr_accessor :etag
|
7
|
+
|
8
|
+
def last_modified
|
9
|
+
@last_modified ||= begin
|
10
|
+
entry = entries.reject {|e| e.published.nil? }.sort_by { |entry| entry.published if entry.published }.last
|
11
|
+
entry ? entry.published : nil
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
def updated?
|
16
|
+
@updated
|
17
|
+
end
|
18
|
+
|
19
|
+
def new_entries
|
20
|
+
@new_entries ||= []
|
21
|
+
end
|
22
|
+
|
23
|
+
def has_new_entries?
|
24
|
+
new_entries.size > 0
|
25
|
+
end
|
26
|
+
|
27
|
+
def update_from_feed(feed)
|
28
|
+
self.new_entries += find_new_entries_for(feed)
|
29
|
+
self.entries.unshift(*self.new_entries)
|
30
|
+
|
31
|
+
updated! if UPDATABLE_ATTRIBUTES.any? { |name| update_attribute(feed, name) }
|
32
|
+
end
|
33
|
+
|
34
|
+
def update_attribute(feed, name)
|
35
|
+
old_value, new_value = send(name), feed.send(name)
|
36
|
+
|
37
|
+
if old_value != new_value
|
38
|
+
send("#{name}=", new_value)
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
private
|
43
|
+
|
44
|
+
def updated!
|
45
|
+
@updated = true
|
46
|
+
end
|
47
|
+
|
48
|
+
def find_new_entries_for(feed)
|
49
|
+
latest_entry = self.entries.first
|
50
|
+
found_new_entries = []
|
51
|
+
feed.entries.each do |entry|
|
52
|
+
break if entry.url == latest_entry.url
|
53
|
+
found_new_entries << entry
|
54
|
+
end
|
55
|
+
found_new_entries
|
56
|
+
end
|
57
|
+
|
58
|
+
def existing_entry?(test_entry)
|
59
|
+
entries.any? { |entry| entry.url == test_entry.url }
|
60
|
+
end
|
61
|
+
end
|
62
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class RDF
|
3
|
+
include SAXMachine
|
4
|
+
include FeedUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url
|
7
|
+
elements :item, :as => :entries, :class => RDFEntry
|
8
|
+
|
9
|
+
attr_accessor :feed_url
|
10
|
+
|
11
|
+
def self.able_to_parse?(xml)
|
12
|
+
xml =~ /(rdf\:RDF)|(#{Regexp.escape("http://purl.org/rss/1.0")})|(rss version\=\"0\.9.?\")/ || false
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
@@ -0,0 +1,12 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class RDFEntry
|
3
|
+
include SAXMachine
|
4
|
+
include FeedEntryUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url
|
7
|
+
element :"dc:creator", :as => :author
|
8
|
+
element :"content:encoded", :as => :content
|
9
|
+
element :description, :as => :summary
|
10
|
+
element :"dc:date", :as => :published
|
11
|
+
end
|
12
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class RSS
|
3
|
+
include SAXMachine
|
4
|
+
include FeedUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url
|
7
|
+
elements :item, :as => :entries, :class => RSSEntry
|
8
|
+
|
9
|
+
attr_accessor :feed_url
|
10
|
+
|
11
|
+
def self.able_to_parse?(xml)
|
12
|
+
xml =~ /\<rss|rdf/
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
module Feedzirra
|
2
|
+
class RSSEntry
|
3
|
+
include SAXMachine
|
4
|
+
include FeedEntryUtilities
|
5
|
+
element :title
|
6
|
+
element :link, :as => :url
|
7
|
+
|
8
|
+
element :"dc:creator", :as => :author
|
9
|
+
element :"content:encoded", :as => :content
|
10
|
+
element :description, :as => :summary
|
11
|
+
|
12
|
+
element :pubDate, :as => :published
|
13
|
+
element :"dc:date", :as => :published
|
14
|
+
end
|
15
|
+
end
|
data/lib/feedzirra.rb
ADDED
@@ -0,0 +1,29 @@
|
|
1
|
+
$LOAD_PATH.unshift(File.dirname(__FILE__)) unless $LOAD_PATH.include?(File.dirname(__FILE__))
|
2
|
+
|
3
|
+
gem 'activesupport'
|
4
|
+
|
5
|
+
require 'curb'
|
6
|
+
require 'sax-machine'
|
7
|
+
require 'active_support/basic_object'
|
8
|
+
require 'active_support/core_ext/object'
|
9
|
+
require 'active_support/core_ext/time'
|
10
|
+
|
11
|
+
require 'core_ext/date'
|
12
|
+
|
13
|
+
require 'feedzirra/feed_utilities'
|
14
|
+
require 'feedzirra/feed_entry_utilities'
|
15
|
+
require 'feedzirra/feed'
|
16
|
+
|
17
|
+
require 'feedzirra/rss_entry'
|
18
|
+
require 'feedzirra/rdf_entry'
|
19
|
+
require 'feedzirra/atom_entry'
|
20
|
+
require 'feedzirra/atom_feed_burner_entry'
|
21
|
+
|
22
|
+
require 'feedzirra/rss'
|
23
|
+
require 'feedzirra/rdf'
|
24
|
+
require 'feedzirra/atom'
|
25
|
+
require 'feedzirra/atom_feed_burner'
|
26
|
+
|
27
|
+
module Feedzirra
|
28
|
+
VERSION = "0.0.1"
|
29
|
+
end
|
@@ -0,0 +1,33 @@
|
|
1
|
+
require File.dirname(__FILE__) + '/../spec_helper'
|
2
|
+
|
3
|
+
describe Feedzirra::AtomEntry do
|
4
|
+
before(:each) do
|
5
|
+
# I don't really like doing it this way because these unit test should only rely on AtomEntry,
|
6
|
+
# but this is actually how it should work. You would never just pass entry xml straight to the AtomEnry
|
7
|
+
@entry = Feedzirra::Atom.parse(sample_atom_feed).entries.first
|
8
|
+
end
|
9
|
+
|
10
|
+
it "should parse the title" do
|
11
|
+
@entry.title.should == "AWS Job: Architect & Designer Position in Turkey"
|
12
|
+
end
|
13
|
+
|
14
|
+
it "should parse the url" do
|
15
|
+
@entry.url.should == "http://aws.typepad.com/aws/2009/01/aws-job-architect-designer-position-in-turkey.html"
|
16
|
+
end
|
17
|
+
|
18
|
+
it "should parse the author" do
|
19
|
+
@entry.author.should == "AWS Editor"
|
20
|
+
end
|
21
|
+
|
22
|
+
it "should parse the content" do
|
23
|
+
@entry.content.should == sample_atom_entry_content
|
24
|
+
end
|
25
|
+
|
26
|
+
it "should provide a summary" do
|
27
|
+
@entry.summary.should == "Late last year an entrepreneur from Turkey visited me at Amazon HQ in Seattle. We talked about his plans to use AWS as part of his new social video portal startup. I won't spill any beans before he's ready to..."
|
28
|
+
end
|
29
|
+
|
30
|
+
it "should parse the published date" do
|
31
|
+
@entry.published.to_s.should == "Fri Jan 16 18:21:00 UTC 2009"
|
32
|
+
end
|
33
|
+
end
|
@@ -0,0 +1,33 @@
|
|
1
|
+
require File.dirname(__FILE__) + '/../spec_helper'
|
2
|
+
|
3
|
+
describe Feedzirra::AtomFeedBurnerEntry do
|
4
|
+
before(:each) do
|
5
|
+
# I don't really like doing it this way because these unit test should only rely on AtomEntry,
|
6
|
+
# but this is actually how it should work. You would never just pass entry xml straight to the AtomEnry
|
7
|
+
@entry = Feedzirra::AtomFeedBurner.parse(sample_feedburner_atom_feed).entries.first
|
8
|
+
end
|
9
|
+
|
10
|
+
it "should parse the title" do
|
11
|
+
@entry.title.should == "Making a Ruby C library even faster"
|
12
|
+
end
|
13
|
+
|
14
|
+
it "should parse the url" do
|
15
|
+
@entry.url.should == "http://www.pauldix.net/2009/01/making-a-ruby-c-library-even-faster.html"
|
16
|
+
end
|
17
|
+
|
18
|
+
it "should parse the author" do
|
19
|
+
@entry.author.should == "Paul Dix"
|
20
|
+
end
|
21
|
+
|
22
|
+
it "should parse the content" do
|
23
|
+
@entry.content.should == sample_feedburner_atom_entry_content
|
24
|
+
end
|
25
|
+
|
26
|
+
it "should provide a summary" do
|
27
|
+
@entry.summary.should == "Last week I released the first version of a SAX based XML parsing library called SAX-Machine. It uses Nokogiri, which uses libxml, so it's pretty fast. However, I felt that it could be even faster. The only question was how..."
|
28
|
+
end
|
29
|
+
|
30
|
+
it "should parse the published date" do
|
31
|
+
@entry.published.to_s.should == "Thu Jan 22 15:50:22 UTC 2009"
|
32
|
+
end
|
33
|
+
end
|