eric-feedzirra 0.0.8

Sign up to get free protection for your applications and to get access to all the features.
data/README.textile ADDED
@@ -0,0 +1,186 @@
1
+ h1. Feedzirra
2
+
3
+ "http://github.com/pauldix/feedzirra/tree/master":http://github.com/pauldix/feedzirra/tree/master
4
+
5
+ I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a "google group here":http://groups.google.com/group/feedzirra.
6
+
7
+ h2. Summary
8
+
9
+ A feed fetching and parsing library that treats the internet like Godzilla treats Japan: it dominates and eats all.
10
+
11
+ h2. Description
12
+
13
+ Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the "taf2-curb":http://github.com/taf2/curb/tree/master gem for faster http gets, and libxml through "nokogiri":http://github.com/tenderlove/nokogiri/tree/master and "sax-machine":http://github.com/pauldix/sax-machine/tree/master for faster parsing.
14
+
15
+ Once you have fetched feeds using Feedzirra, they can be updated using the feed objects. Feedzirra automatically inserts etag and last-modified information from the http response headers to lower bandwidth usage, eliminate unnecessary parsing, and make things speedier in general.
16
+
17
+ Another feature present in Feedzirra is the ability to create callback functions that get called "on success" and "on failure" when getting a feed. This makes it easy to do things like log errors or update data stores.
18
+
19
+ The fetching and parsing logic have been decoupled so that either of them can be used in isolation if you'd prefer not to use everything that Feedzirra offers. However, the code examples below use helper methods in the Feed class that put everything together to make things as simple as possible.
20
+
21
+ The final feature of Feedzirra is the ability to define custom parsing classes. In truth, Feedzirra could be used to parse much more than feeds. Microformats, page scraping, and almost anything else are fair game.
22
+
23
+ h2. Installation
24
+
25
+ For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have "libcurl":http://curl.haxx.se/ and "libxml":http://xmlsoft.org/ installed. If you're on Leopard you have both. Otherwise, you'll need to grab them. Once you've got those libraries, these are the gems that get used: nokogiri, pauldix-sax-machine, taf2-curb (note that this is a fork that lives on github and not the Ruby Forge version of curb), and pauldix-feedzirra. The feedzirra gemspec has all the dependencies so you should be able to get up and running with the standard github gem install routine:
26
+ <pre>
27
+ gem sources -a http://gems.github.com # if you haven't already
28
+ gem install pauldix-feedzirra
29
+ </pre>
30
+ <b>NOTE:</b>Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on Ruby Forge. You have to get the "taf2-curb":http://github.com/taf2/curb/tree/master fork installed.
31
+
32
+ If you see this error when doing a require:
33
+ <pre>
34
+ /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)
35
+ </pre>
36
+ It means that the taf2-curb gem didn't build correctly. To resolve this you can do a git clone git://github.com/taf2/curb.git then run rake gem in the curb directory, then sudo gem install pkg/curb-0.2.4.0.gem. After that you should be good.
37
+
38
+ If you see something like this when trying to run it:
39
+ <pre>
40
+ NoMethodError: undefined method `on_success' for #<Curl::Easy:0x1182724>
41
+ from ./lib/feedzirra/feed.rb:88:in `add_url_to_multi'
42
+ </pre>
43
+ This means that you are requiring curl-multi or the Ruby Forge version of Curb somewhere. You can't use those and need to get the taf2 version up and running.
44
+
45
+ If you're on Debian or Ubuntu and getting errors while trying to install the taf2-curb gem, it could be because you don't have the latest version of libcurl installed. Do this to fix:
46
+ <pre>
47
+ sudo apt-get install libcurl4-gnutls-dev
48
+ </pre>
49
+
50
+ Another problem could be if you are running Mac Ports and you have libcurl installed through there. You need to uninstall it for curb to work! The version in Mac Ports is old and doesn't play nice with curb. If you're running Leopard, you can just uninstall and you should be golden. If you're on an older version of OS X, you'll then need to "download curl":http://curl.haxx.se/download.html and build from source. Then you'll have to install the taf2-curb gem again. You might have to perform the step above.
51
+
52
+ If you're still having issues, please let me know on the mailing list. Also, "Todd Fisher (taf2)":http://github.com/taf2 is working on fixing the gem install. Please send him a full error report.
53
+
54
+ h2. Usage
55
+
56
+ "A gist of the following code":http://gist.github.com/57285
57
+ <pre>
58
+ require 'feedzirra'
59
+
60
+ # fetching a single feed
61
+ feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
62
+
63
+ # feed and entries accessors
64
+ feed.title # => "Paul Dix Explains Nothing"
65
+ feed.url # => "http://www.pauldix.net"
66
+ feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
67
+ feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
68
+ feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
69
+
70
+ entry = feed.entries.first
71
+ entry.title # => "Ruby Http Client Library Performance"
72
+ entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
73
+ entry.author # => "Paul Dix"
74
+ entry.summary # => "..."
75
+ entry.content # => "..."
76
+ entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
77
+ entry.categories # => ["...", "..."]
78
+
79
+ # sanitizing an entry's content
80
+ entry.title.sanitize # => returns the title with harmful stuff escaped
81
+ entry.author.sanitize # => returns the author with harmful stuff escaped
82
+ entry.content.sanitize # => returns the content with harmful stuff escaped
83
+ entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title)
84
+ entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions)
85
+ feed.sanitize_entries! # => sanitizes all entries in place
86
+
87
+ # updating a single feed
88
+ updated_feed = Feedzirra::Feed.update(feed)
89
+
90
+ # an updated feed has the following extra accessors
91
+ updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries
92
+ updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update
93
+
94
+ # fetching multiple feeds
95
+ feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
96
+ feeds = Feedzirra::Feed.fetch_and_parse(feeds_urls)
97
+
98
+ # feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
99
+ # there will be a Fixnum of the http response code instead of a feed object
100
+
101
+ # updating multiple feeds. it expects a collection of feed objects
102
+ updated_feeds = Feedzirra::Feed.update(feeds.values)
103
+
104
+ # defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
105
+ feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing",
106
+ :on_success => lambda {|feed| puts feed.title },
107
+ :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })
108
+ # if a collection was passed into fetch_and_parse, the handlers will be called for each one
109
+
110
+ # the behavior for the handlers when using Feedzirra::Feed.update is slightly different. The feed passed into on_success will be
111
+ # the updated feed with the standard updated accessors. on failure it will be the original feed object passed into update
112
+
113
+ # You can add custom parsing to the feed entry classes. Say you want the wfw:comments fields in an entry
114
+ Feedzirra::Feed.add_common_feed_entry_element("wfw:commentRss", :as => :comment_rss)
115
+ # The arguments are the same as the SAXMachine arguments for the element method. For more example usage look at the RSSEntry and
116
+ # AtomEntry classes. Now you can access those in an atom feed:
117
+ Feedzirra::Feed.parse(some_atom_xml).entries.first.comment_rss_ # => wfw:commentRss is now parsed!
118
+
119
+ # You can also access http basic auth feeds. Unfortunately, you can't get to these inside of a bulk get of a bunch of feeds.
120
+ # You'll have to do it on its own like so:
121
+ Feedzirra::Feed.fetch_and_parse(some_url, :http_authentication => ["myusername", "mypassword"])
122
+
123
+ # Defining custom parsers
124
+ # TODO: the functionality is here, just write some good examples that show how to do this
125
+ </pre>
126
+
127
+ h2. Benchmarks
128
+
129
+ One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the "benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb":http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb
130
+ <pre>
131
+ feedzirra 5.170000 1.290000 6.460000 ( 18.917796)
132
+ rfeedparser 104.260000 12.220000 116.480000 (244.799063)
133
+ feed-normalizer 66.250000 4.010000 70.260000 (191.589862)
134
+ </pre>
135
+ The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a "benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb":http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb on an atom feed.
136
+ <pre>
137
+ feedzirra 0.500000 0.030000 0.530000 ( 0.658744)
138
+ rfeedparser 8.400000 1.110000 9.510000 ( 11.839827)
139
+ feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140)
140
+ </pre>
141
+ There's also a "benchmark that shows the results of using Feedzirra to perform updates on feeds":http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting.
142
+ <pre>
143
+ feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101)
144
+ feedzirra update 0.660000 0.280000 0.940000 ( 5.152709)
145
+ </pre>
146
+
147
+ h2. Next Steps
148
+
149
+ This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother using the test suite for feedparser. i wanted to start fresh.
150
+
151
+ Here are some more specific TODOs.
152
+ * Make a feedzirra-rails gem to integrate feedzirra seamlessly with Rails and ActiveRecord.
153
+ * Add support for authenticated feeds.
154
+ * Create a super sweet DSL for defining new parsers.
155
+ * Test against Ruby 1.9.1 and fix any bugs.
156
+ * I'm not keeping track of modified on entries. Should I add this?
157
+ * Clean up the fetching code inside feed.rb so it doesn't suck so hard.
158
+ * Make the feed_spec actually mock stuff out so it doesn't hit the net.
159
+ * Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?
160
+
161
+ h2. LICENSE
162
+
163
+ (The MIT License)
164
+
165
+ Copyright (c) 2009:
166
+
167
+ "Paul Dix":http://pauldix.net
168
+
169
+ Permission is hereby granted, free of charge, to any person obtaining
170
+ a copy of this software and associated documentation files (the
171
+ 'Software'), to deal in the Software without restriction, including
172
+ without limitation the rights to use, copy, modify, merge, publish,
173
+ distribute, sublicense, and/or sell copies of the Software, and to
174
+ permit persons to whom the Software is furnished to do so, subject to
175
+ the following conditions:
176
+
177
+ The above copyright notice and this permission notice shall be
178
+ included in all copies or substantial portions of the Software.
179
+
180
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
181
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
182
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
183
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
184
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
185
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
186
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,53 @@
1
+ require "spec"
2
+ require "spec/rake/spectask"
3
+ require 'rake/rdoctask'
4
+ require 'lib/feedzirra.rb'
5
+
6
+ # Grab recently touched specs
7
+ def recent_specs(touched_since)
8
+ recent_specs = FileList['app/**/*'].map do |path|
9
+
10
+ if File.mtime(path) > touched_since
11
+ spec = File.join('spec', File.dirname(path).split("/")[1..-1].join('/'),
12
+ "#{File.basename(path, ".*")}_spec.rb")
13
+ spec if File.exists?(spec)
14
+ end
15
+ end.compact
16
+
17
+ recent_specs += FileList['spec/**/*_spec.rb'].select do |path|
18
+ File.mtime(path) > touched_since
19
+ end
20
+ recent_specs.uniq
21
+ end
22
+
23
+ # Tasks
24
+ Spec::Rake::SpecTask.new do |t|
25
+ t.spec_opts = ['--options', "\"#{File.dirname(__FILE__)}/spec/spec.opts\""]
26
+ t.spec_files = FileList['spec/**/*_spec.rb']
27
+ end
28
+
29
+ desc 'Run recent specs'
30
+ Spec::Rake::SpecTask.new("spec:recent") do |t|
31
+ t.spec_opts = ["--format","specdoc","--color"]
32
+ t.spec_files = recent_specs(Time.now - 600) # 10 min.
33
+ end
34
+
35
+ Spec::Rake::SpecTask.new('spec:rcov') do |t|
36
+ t.spec_opts = ['--options', "\"#{File.dirname(__FILE__)}/spec/spec.opts\""]
37
+ t.spec_files = FileList['spec/**/*_spec.rb']
38
+ t.rcov = true
39
+ t.rcov_opts = ['--exclude', 'spec,/usr/lib/ruby,/usr/local,/var/lib,/Library', '--text-report']
40
+ end
41
+
42
+ Rake::RDocTask.new do |rd|
43
+ rd.title = 'Feedzirra'
44
+ rd.rdoc_dir = 'rdoc'
45
+ rd.rdoc_files.include('README.rdoc', 'lib/feedzirra.rb', 'lib/feedzirra/**/*.rb')
46
+ rd.options = ["--quiet", "--opname", "index.html", "--line-numbers", "--inline-source", '--main', 'README.rdoc']
47
+ end
48
+
49
+ task :install do
50
+ rm_rf "*.gem"
51
+ puts `gem build feedzirra.gemspec`
52
+ puts `sudo gem install feedzirra-#{Feedzirra::VERSION}.gem`
53
+ end
@@ -0,0 +1,21 @@
1
+ # Date code pulled from:
2
+ # Ruby Cookbook by Lucas Carlson and Leonard Richardson
3
+ # Published by O'Reilly
4
+ # ISBN: 0-596-52369-6
5
+ class Date
6
+ def feed_utils_to_gm_time
7
+ feed_utils_to_time(new_offset, :gm)
8
+ end
9
+
10
+ def feed_utils_to_local_time
11
+ feed_utils_to_time(new_offset(DateTime.now.offset-offset), :local)
12
+ end
13
+
14
+ private
15
+ def feed_utils_to_time(dest, method)
16
+ #Convert a fraction of a day to a number of microseconds
17
+ usec = (dest.sec_fraction * 60 * 60 * 24 * (10**6)).to_i
18
+ Time.send(method, dest.year, dest.month, dest.day, dest.hour, dest.min,
19
+ dest.sec, usec)
20
+ end
21
+ end
@@ -0,0 +1,9 @@
1
+ class String
2
+ def sanitize!
3
+ self.replace(sanitize)
4
+ end
5
+
6
+ def sanitize
7
+ Dryopteris.sanitize(self)
8
+ end
9
+ end
data/lib/feedzirra.rb ADDED
@@ -0,0 +1,34 @@
1
+ $LOAD_PATH.unshift(File.dirname(__FILE__)) unless $LOAD_PATH.include?(File.dirname(__FILE__))
2
+
3
+ gem 'activesupport'
4
+
5
+ require 'zlib'
6
+ require 'curb'
7
+ require 'sax-machine'
8
+ require 'dryopteris'
9
+ require 'uri'
10
+ require 'active_support/basic_object'
11
+ require 'active_support/core_ext/object'
12
+ require 'active_support/core_ext/time'
13
+
14
+ require 'core_ext/date'
15
+ require 'core_ext/string'
16
+
17
+ require 'feedzirra/feed_utilities'
18
+ require 'feedzirra/feed_entry_utilities'
19
+ require 'feedzirra/feed'
20
+
21
+ require 'feedzirra/rss_entry'
22
+ require 'feedzirra/itunes_rss_owner'
23
+ require 'feedzirra/itunes_rss_item'
24
+ require 'feedzirra/atom_entry'
25
+ require 'feedzirra/atom_feed_burner_entry'
26
+
27
+ require 'feedzirra/rss'
28
+ require 'feedzirra/itunes_rss'
29
+ require 'feedzirra/atom'
30
+ require 'feedzirra/atom_feed_burner'
31
+
32
+ module Feedzirra
33
+ VERSION = "0.0.8"
34
+ end
@@ -0,0 +1,22 @@
1
+ module Feedzirra
2
+ # == Summary
3
+ # Parser for dealing with Atom feeds.
4
+ #
5
+ # == Attributes
6
+ # * title
7
+ # * feed_url
8
+ # * url
9
+ # * entries
10
+ class Atom
11
+ include SAXMachine
12
+ include FeedUtilities
13
+ element :title
14
+ element :link, :as => :url, :value => :href, :with => {:type => "text/html"}
15
+ element :link, :as => :feed_url, :value => :href, :with => {:type => "application/atom+xml"}
16
+ elements :entry, :as => :entries, :class => AtomEntry
17
+
18
+ def self.able_to_parse?(xml) #:nodoc:
19
+ xml =~ /(Atom)|(#{Regexp.escape("http://purl.org/atom")})/
20
+ end
21
+ end
22
+ end
@@ -0,0 +1,29 @@
1
+ module Feedzirra
2
+ # == Summary
3
+ # Parser for dealing with Atom feed entries.
4
+ #
5
+ # == Attributes
6
+ # * title
7
+ # * url
8
+ # * author
9
+ # * content
10
+ # * summary
11
+ # * published
12
+ # * categories
13
+ class AtomEntry
14
+ include SAXMachine
15
+ include FeedEntryUtilities
16
+ element :title
17
+ element :link, :as => :url, :value => :href, :with => {:type => "text/html", :rel => "alternate"}
18
+ element :name, :as => :author
19
+ element :content
20
+ element :summary
21
+ element :published
22
+ element :id
23
+ element :created, :as => :published
24
+ element :issued, :as => :published
25
+ element :updated
26
+ element :modified, :as => :updated
27
+ elements :category, :as => :categories, :value => :term
28
+ end
29
+ end
@@ -0,0 +1,22 @@
1
+ module Feedzirra
2
+ # == Summary
3
+ # Parser for dealing with Feedburner Atom feeds.
4
+ #
5
+ # == Attributes
6
+ # * title
7
+ # * feed_url
8
+ # * url
9
+ # * entries
10
+ class AtomFeedBurner
11
+ include SAXMachine
12
+ include FeedUtilities
13
+ element :title
14
+ element :link, :as => :url, :value => :href, :with => {:type => "text/html"}
15
+ element :link, :as => :feed_url, :value => :href, :with => {:type => "application/atom+xml"}
16
+ elements :entry, :as => :entries, :class => AtomFeedBurnerEntry
17
+
18
+ def self.able_to_parse?(xml) #:nodoc:
19
+ (xml =~ /Atom/ && xml =~ /feedburner/) || false
20
+ end
21
+ end
22
+ end
@@ -0,0 +1,30 @@
1
+ module Feedzirra
2
+ # == Summary
3
+ # Parser for dealing with Feedburner Atom feed entries.
4
+ #
5
+ # == Attributes
6
+ # * title
7
+ # * url
8
+ # * author
9
+ # * content
10
+ # * summary
11
+ # * published
12
+ # * categories
13
+ class AtomFeedBurnerEntry
14
+ include SAXMachine
15
+ include FeedEntryUtilities
16
+ element :title
17
+ element :name, :as => :author
18
+ element :link, :as => :url, :value => :href, :with => {:type => "text/html", :rel => "alternate"}
19
+ element :"feedburner:origLink", :as => :url
20
+ element :summary
21
+ element :content
22
+ element :published
23
+ element :id
24
+ element :issued, :as => :published
25
+ element :created, :as => :published
26
+ element :updated
27
+ element :modified, :as => :updated
28
+ elements :category, :as => :categories, :value => :term
29
+ end
30
+ end
@@ -0,0 +1,321 @@
1
+ module Feedzirra
2
+ class NoParserAvailable < StandardError; end
3
+
4
+ class Feed
5
+ USER_AGENT = "feedzirra http://github.com/pauldix/feedzirra/tree/master"
6
+
7
+ # Takes a raw XML feed and attempts to parse it. If no parser is available a Feedzirra::NoParserAvailable exception is raised.
8
+ #
9
+ # === Parameters
10
+ # [xml<String>] The XML that you would like parsed.
11
+ # === Returns
12
+ # An instance of the determined feed type. By default a Feedzirra::Atom, Feedzirra::AtomFeedBurner, Feedzirra::RDF, or Feedzirra::RSS object.
13
+ # === Raises
14
+ # Feedzirra::NoParserAvailable : If no valid parser classes could be found for the feed.
15
+ def self.parse(xml)
16
+ if parser = determine_feed_parser_for_xml(xml)
17
+ parser.parse(xml)
18
+ else
19
+ raise NoParserAvailable.new("No valid parser for XML.")
20
+ end
21
+ end
22
+
23
+ # Determines the correct parser class to use for parsing the feed.
24
+ #
25
+ # === Parameters
26
+ # [xml<String>] The XML that you would like determine the parser for.
27
+ # === Returns
28
+ # The class name of the parser that can handle the XML.
29
+ def self.determine_feed_parser_for_xml(xml)
30
+ start_of_doc = xml.slice(0, 1000)
31
+ feed_classes.detect {|klass| klass.able_to_parse?(start_of_doc)}
32
+ end
33
+
34
+ # Adds a new feed parsing class that will be used for parsing.
35
+ #
36
+ # === Parameters
37
+ # [klass<Constant>] The class/constant that you want to register.
38
+ # === Returns
39
+ # A updated array of feed parser class names.
40
+ def self.add_feed_class(klass)
41
+ feed_classes.unshift klass
42
+ end
43
+
44
+ # Provides a list of registered feed parsing classes.
45
+ #
46
+ # === Returns
47
+ # A array of class names.
48
+ def self.feed_classes
49
+ @feed_classes ||= [ITunesRSS, RSS, AtomFeedBurner, Atom]
50
+ end
51
+
52
+ # Makes all entry types look for the passed in element to parse. This is actually just a call to
53
+ # element (a SAXMachine call) in the class
54
+ #
55
+ # === Parameters
56
+ # [element_tag<String>]
57
+ # [options<Hash>] Valid keys are same as with SAXMachine
58
+ def self.add_common_feed_entry_element(element_tag, options = {})
59
+ # need to think of a better way to do this. will break for people who want this behavior
60
+ # across their added classes
61
+ [RSSEntry, AtomFeedBurnerEntry, AtomEntry].each do |klass|
62
+ klass.send(:element, element_tag, options)
63
+ end
64
+ end
65
+
66
+ # Makes all entry types look for the passed in elements to parse. This is actually just a call to
67
+ # elements (a SAXMachine call) in the class
68
+ #
69
+ # === Parameters
70
+ # [element_tag<String>]
71
+ # [options<Hash>] Valid keys are same as with SAXMachine
72
+ def self.add_common_feed_entry_elements(element_tag, options = {})
73
+ # need to think of a better way to do this. will break for people who want this behavior
74
+ # across their added classes
75
+ [RSSEntry, AtomFeedBurnerEntry, AtomEntry].each do |klass|
76
+ klass.send(:elements, element_tag, options)
77
+ end
78
+ end
79
+
80
+ # Fetches and returns the raw XML for each URL provided.
81
+ #
82
+ # === Parameters
83
+ # [urls<String> or <Array>] A single feed URL, or an array of feed URLs.
84
+ # [options<Hash>] Valid keys for this argument as as followed:
85
+ # :user_agent - String that overrides the default user agent.
86
+ # :if_modified_since - Time object representing when the feed was last updated.
87
+ # :if_none_match - String that's normally an etag for the request that was stored previously.
88
+ # :on_success - Block that gets executed after a successful request.
89
+ # :on_failure - Block that gets executed after a failed request.
90
+ # === Returns
91
+ # A String of XML if a single URL is passed.
92
+ #
93
+ # A Hash if multiple URL's are passed. The key will be the URL, and the value the XML.
94
+ def self.fetch_raw(urls, options = {})
95
+ url_queue = [*urls]
96
+ multi = Curl::Multi.new
97
+ responses = {}
98
+ url_queue.each do |url|
99
+ easy = Curl::Easy.new(url) do |curl|
100
+ curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
101
+ curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options[:if_modified_since]
102
+ curl.headers["If-None-Match"] = options[:if_none_match] if options[:if_none_match]
103
+ curl.headers["Accept-encoding"] = 'gzip, deflate'
104
+ curl.follow_location = true
105
+ curl.userpwd = options[:http_authentication].join(':') if options[:http_authentication]
106
+
107
+ curl.on_success do |c|
108
+ responses[url] = decode_content(c)
109
+ end
110
+ curl.on_failure do |c|
111
+ responses[url] = c.response_code
112
+ end
113
+ end
114
+ multi.add(easy)
115
+ end
116
+
117
+ multi.perform
118
+ return urls.is_a?(String) ? responses.values.first : responses
119
+ end
120
+
121
+ # Fetches and returns the parsed XML for each URL provided.
122
+ #
123
+ # === Parameters
124
+ # [urls<String> or <Array>] A single feed URL, or an array of feed URLs.
125
+ # [options<Hash>] Valid keys for this argument as as followed:
126
+ # * :user_agent - String that overrides the default user agent.
127
+ # * :if_modified_since - Time object representing when the feed was last updated.
128
+ # * :if_none_match - String, an etag for the request that was stored previously.
129
+ # * :on_success - Block that gets executed after a successful request.
130
+ # * :on_failure - Block that gets executed after a failed request.
131
+ # === Returns
132
+ # A Feed object if a single URL is passed.
133
+ #
134
+ # A Hash if multiple URL's are passed. The key will be the URL, and the value the Feed object.
135
+ def self.fetch_and_parse(urls, options = {})
136
+ url_queue = [*urls]
137
+ multi = Curl::Multi.new
138
+ responses = {}
139
+
140
+ # I broke these down so I would only try to do 30 simultaneously because
141
+ # I was getting weird errors when doing a lot. As one finishes it pops another off the queue.
142
+ url_queue.slice!(0, 30).each do |url|
143
+ add_url_to_multi(multi, url, url_queue, responses, options)
144
+ end
145
+
146
+ multi.perform
147
+ return urls.is_a?(String) ? responses.values.first : responses
148
+ end
149
+
150
+ # Decodes the XML document if it was compressed.
151
+ #
152
+ # === Parameters
153
+ # [curl_request<Curl::Easy>] The Curl::Easy response object from the request.
154
+ # === Returns
155
+ # A decoded string of XML.
156
+ def self.decode_content(c)
157
+ if c.header_str.match(/Content-Encoding: gzip/)
158
+ begin
159
+ gz = Zlib::GzipReader.new(StringIO.new(c.body_str))
160
+ xml = gz.read
161
+ gz.close
162
+ rescue Zlib::GzipFile::Error
163
+ # Maybe this is not gzipped?
164
+ xml = c.body_str
165
+ end
166
+ elsif c.header_str.match(/Content-Encoding: deflate/)
167
+ xml = Zlib::Inflate.inflate(c.body_str)
168
+ else
169
+ xml = c.body_str
170
+ end
171
+
172
+ xml
173
+ end
174
+
175
+ # Updates each feed for each Feed object provided.
176
+ #
177
+ # === Parameters
178
+ # [feeds<Feed> or <Array>] A single feed object, or an array of feed objects.
179
+ # [options<Hash>] Valid keys for this argument as as followed:
180
+ # * :user_agent - String that overrides the default user agent.
181
+ # * :on_success - Block that gets executed after a successful request.
182
+ # * :on_failure - Block that gets executed after a failed request.
183
+ # === Returns
184
+ # A updated Feed object if a single URL is passed.
185
+ #
186
+ # A Hash if multiple Feeds are passed. The key will be the URL, and the value the updated Feed object.
187
+ def self.update(feeds, options = {})
188
+ feed_queue = [*feeds]
189
+ multi = Curl::Multi.new
190
+ responses = {}
191
+
192
+ feed_queue.slice!(0, 30).each do |feed|
193
+ add_feed_to_multi(multi, feed, feed_queue, responses, options)
194
+ end
195
+
196
+ multi.perform
197
+ return responses.size == 1 ? responses.values.first : responses.values
198
+ end
199
+
200
+ # An abstraction for adding a feed by URL to the passed Curb::multi stack.
201
+ #
202
+ # === Parameters
203
+ # [multi<Curl::Multi>] The Curl::Multi object that the request should be added too.
204
+ # [url<String>] The URL of the feed that you would like to be fetched.
205
+ # [url_queue<Array>] An array of URLs that are queued for request.
206
+ # [responses<Hash>] Existing responses that you want the response from the request added to.
207
+ # [feeds<String> or <Array>] A single feed object, or an array of feed objects.
208
+ # [options<Hash>] Valid keys for this argument as as followed:
209
+ # * :user_agent - String that overrides the default user agent.
210
+ # * :on_success - Block that gets executed after a successful request.
211
+ # * :on_failure - Block that gets executed after a failed request.
212
+ # === Returns
213
+ # The updated Curl::Multi object with the request details added to it's stack.
214
+ def self.add_url_to_multi(multi, url, url_queue, responses, options)
215
+ easy = Curl::Easy.new(url) do |curl|
216
+ curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
217
+ curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options[:if_modified_since]
218
+ curl.headers["If-None-Match"] = options[:if_none_match] if options[:if_none_match]
219
+ curl.headers["Accept-encoding"] = 'gzip, deflate'
220
+ curl.follow_location = true
221
+ curl.userpwd = options[:http_authentication].join(':') if options[:http_authentication]
222
+
223
+ curl.on_success do |c|
224
+ add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
225
+ xml = decode_content(c)
226
+ klass = determine_feed_parser_for_xml(xml)
227
+
228
+ if klass
229
+ feed = klass.parse(xml)
230
+ feed.feed_url = c.last_effective_url
231
+ feed.etag = etag_from_header(c.header_str)
232
+ feed.last_modified = last_modified_from_header(c.header_str)
233
+ responses[url] = feed
234
+ options[:on_success].call(url, feed) if options[:on_success]
235
+ else
236
+ # puts "Error determining parser for #{url} - #{c.last_effective_url}"
237
+ # raise NoParserAvailable.new("no valid parser for content.") (this would unfirtunately fail the whole 'multi', so it's not really useable)
238
+ options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options[:on_failure]
239
+ end
240
+ end
241
+
242
+ curl.on_failure do |c|
243
+ add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
244
+ responses[url] = c.response_code
245
+ options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options[:on_failure]
246
+ end
247
+ end
248
+ multi.add(easy)
249
+ end
250
+
251
+ # An abstraction for adding a feed by a Feed object to the passed Curb::multi stack.
252
+ #
253
+ # === Parameters
254
+ # [multi<Curl::Multi>] The Curl::Multi object that the request should be added too.
255
+ # [feed<Feed>] A feed object that you would like to be fetched.
256
+ # [url_queue<Array>] An array of feed objects that are queued for request.
257
+ # [responses<Hash>] Existing responses that you want the response from the request added to.
258
+ # [feeds<String>] or <Array> A single feed object, or an array of feed objects.
259
+ # [options<Hash>] Valid keys for this argument as as followed:
260
+ # * :user_agent - String that overrides the default user agent.
261
+ # * :on_success - Block that gets executed after a successful request.
262
+ # * :on_failure - Block that gets executed after a failed request.
263
+ # === Returns
264
+ # The updated Curl::Multi object with the request details added to it's stack.
265
+ def self.add_feed_to_multi(multi, feed, feed_queue, responses, options)
266
+ easy = Curl::Easy.new(feed.feed_url) do |curl|
267
+ curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
268
+ curl.headers["If-Modified-Since"] = feed.last_modified.httpdate if feed.last_modified
269
+ curl.headers["If-None-Match"] = feed.etag if feed.etag
270
+ curl.userpwd = options[:http_authentication].join(':') if options[:http_authentication]
271
+ curl.follow_location = true
272
+
273
+ curl.on_success do |c|
274
+ add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
275
+ updated_feed = Feed.parse(c.body_str)
276
+ updated_feed.feed_url = c.last_effective_url
277
+ updated_feed.etag = etag_from_header(c.header_str)
278
+ updated_feed.last_modified = last_modified_from_header(c.header_str)
279
+ feed.update_from_feed(updated_feed)
280
+ responses[feed.feed_url] = feed
281
+ options[:on_success].call(feed) if options[:on_success]
282
+ end
283
+
284
+ curl.on_failure do |c|
285
+ add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
286
+ response_code = c.response_code
287
+ if response_code == 304 # it's not modified. this isn't an error condition
288
+ responses[feed.feed_url] = feed
289
+ options[:on_success].call(feed) if options[:on_success]
290
+ else
291
+ responses[feed.url] = c.response_code
292
+ options[:on_failure].call(feed, c.response_code, c.header_str, c.body_str) if options[:on_failure]
293
+ end
294
+ end
295
+ end
296
+ multi.add(easy)
297
+ end
298
+
299
+ # Determines the etag from the request headers.
300
+ #
301
+ # === Parameters
302
+ # [header<String>] Raw request header returned from the request
303
+ # === Returns
304
+ # A string of the etag or nil if it cannot be found in the headers.
305
+ def self.etag_from_header(header)
306
+ header =~ /.*ETag:\s(.*)\r/
307
+ $1
308
+ end
309
+
310
+ # Determines the last modified date from the request headers.
311
+ #
312
+ # === Parameters
313
+ # [header<String>] Raw request header returned from the request
314
+ # === Returns
315
+ # A Time object of the last modified date or nil if it cannot be found in the headers.
316
+ def self.last_modified_from_header(header)
317
+ header =~ /.*Last-Modified:\s(.*)\r/
318
+ Time.parse($1) if $1
319
+ end
320
+ end
321
+ end