jashmenn-feedzirra 0.1.3

Sign up to get free protection for your applications and to get access to all the features.
Files changed (49) hide show
  1. data/.rspec +1 -0
  2. data/README.rdoc +177 -0
  3. data/Rakefile +51 -0
  4. data/lib/feedzirra.rb +20 -0
  5. data/lib/feedzirra/core_ext.rb +3 -0
  6. data/lib/feedzirra/core_ext/date.rb +19 -0
  7. data/lib/feedzirra/core_ext/string.rb +9 -0
  8. data/lib/feedzirra/feed.rb +384 -0
  9. data/lib/feedzirra/feed_entry_utilities.rb +65 -0
  10. data/lib/feedzirra/feed_utilities.rb +61 -0
  11. data/lib/feedzirra/parser.rb +20 -0
  12. data/lib/feedzirra/parser/atom.rb +29 -0
  13. data/lib/feedzirra/parser/atom_entry.rb +30 -0
  14. data/lib/feedzirra/parser/atom_feed_burner.rb +21 -0
  15. data/lib/feedzirra/parser/atom_feed_burner_entry.rb +31 -0
  16. data/lib/feedzirra/parser/google_docs_atom.rb +28 -0
  17. data/lib/feedzirra/parser/google_docs_atom_entry.rb +29 -0
  18. data/lib/feedzirra/parser/itunes_rss.rb +50 -0
  19. data/lib/feedzirra/parser/itunes_rss_item.rb +32 -0
  20. data/lib/feedzirra/parser/itunes_rss_owner.rb +12 -0
  21. data/lib/feedzirra/parser/rss.rb +22 -0
  22. data/lib/feedzirra/parser/rss_entry.rb +34 -0
  23. data/lib/feedzirra/parser/rss_feed_burner.rb +22 -0
  24. data/lib/feedzirra/parser/rss_feed_burner_entry.rb +40 -0
  25. data/lib/feedzirra/version.rb +3 -0
  26. data/spec/benchmarks/feed_benchmarks.rb +98 -0
  27. data/spec/benchmarks/feedzirra_benchmarks.rb +40 -0
  28. data/spec/benchmarks/fetching_benchmarks.rb +28 -0
  29. data/spec/benchmarks/parsing_benchmark.rb +30 -0
  30. data/spec/benchmarks/updating_benchmarks.rb +33 -0
  31. data/spec/feedzirra/feed_entry_utilities_spec.rb +52 -0
  32. data/spec/feedzirra/feed_spec.rb +597 -0
  33. data/spec/feedzirra/feed_utilities_spec.rb +152 -0
  34. data/spec/feedzirra/parser/atom_entry_spec.rb +86 -0
  35. data/spec/feedzirra/parser/atom_feed_burner_entry_spec.rb +47 -0
  36. data/spec/feedzirra/parser/atom_feed_burner_spec.rb +47 -0
  37. data/spec/feedzirra/parser/atom_spec.rb +51 -0
  38. data/spec/feedzirra/parser/google_docs_atom_entry_spec.rb +22 -0
  39. data/spec/feedzirra/parser/google_docs_atom_spec.rb +31 -0
  40. data/spec/feedzirra/parser/itunes_rss_item_spec.rb +48 -0
  41. data/spec/feedzirra/parser/itunes_rss_owner_spec.rb +18 -0
  42. data/spec/feedzirra/parser/itunes_rss_spec.rb +54 -0
  43. data/spec/feedzirra/parser/rss_entry_spec.rb +85 -0
  44. data/spec/feedzirra/parser/rss_feed_burner_entry_spec.rb +85 -0
  45. data/spec/feedzirra/parser/rss_feed_burner_spec.rb +52 -0
  46. data/spec/feedzirra/parser/rss_spec.rb +49 -0
  47. data/spec/sample_feeds/run_against_sample.rb +20 -0
  48. data/spec/spec_helper.rb +78 -0
  49. metadata +228 -0
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
@@ -0,0 +1,177 @@
1
+ == Feedzirra
2
+
3
+ I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a {google group here}[http://groups.google.com/group/feedzirra].
4
+
5
+ === Description
6
+
7
+ Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb[link:http://github.com/taf2/curb/tree/master] gem for faster http gets, and libxml through nokogiri[link:http://github.com/tenderlove/nokogiri/tree/master] and sax-machine[link:http://github.com/pauldix/sax-machine/tree/master] for faster parsing.
8
+
9
+ Once you have fetched feeds using Feedzirra, they can be updated using the feed objects. Feedzirra automatically inserts etag and last-modified information from the http response headers to lower bandwidth usage, eliminate unnecessary parsing, and make things speedier in general.
10
+
11
+ Another feature present in Feedzirra is the ability to create callback functions that get called "on success" and "on failure" when getting a feed. This makes it easy to do things like log errors or update data stores.
12
+
13
+ The fetching and parsing logic have been decoupled so that either of them can be used in isolation if you'd prefer not to use everything that Feedzirra offers. However, the code examples below use helper methods in the Feed class that put everything together to make things as simple as possible.
14
+
15
+ The final feature of Feedzirra is the ability to define custom parsing classes. In truth, Feedzirra could be used to parse much more than feeds. Microformats, page scraping, and almost anything else are fair game.
16
+
17
+ === Speedup date parsing
18
+
19
+ In MRI the date parsing code is written in ruby and is optimized for readability over speed, to speed up this part you can install the {home_run}[https://github.com/jeremyevans/home_run] gem to replace it with an optimized C version.
20
+
21
+ === Usage
22
+
23
+ {A gist of the following code}[link:http://gist.github.com/57285]
24
+
25
+ require 'feedzirra'
26
+
27
+ # fetching a single feed
28
+ feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
29
+
30
+ # feed and entries accessors
31
+ feed.title # => "Paul Dix Explains Nothing"
32
+ feed.url # => "http://www.pauldix.net"
33
+ feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
34
+ feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
35
+ feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
36
+
37
+ entry = feed.entries.first
38
+ entry.title # => "Ruby Http Client Library Performance"
39
+ entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
40
+ entry.author # => "Paul Dix"
41
+ entry.summary # => "..."
42
+ entry.content # => "..."
43
+ entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
44
+ entry.categories # => ["...", "..."]
45
+
46
+ # sanitizing an entry's content
47
+ entry.title.sanitize # => returns the title with harmful stuff escaped
48
+ entry.author.sanitize # => returns the author with harmful stuff escaped
49
+ entry.content.sanitize # => returns the content with harmful stuff escaped
50
+ entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title)
51
+ entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions)
52
+ feed.sanitize_entries! # => sanitizes all entries in place
53
+
54
+ # updating a single feed
55
+ updated_feed = Feedzirra::Feed.update(feed)
56
+
57
+ # an updated feed has the following extra accessors
58
+ updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries
59
+ updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update
60
+
61
+ # fetching multiple feeds
62
+ feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
63
+ feeds = Feedzirra::Feed.fetch_and_parse(feed_urls)
64
+
65
+ # feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
66
+ # there will be a Fixnum of the http response code instead of a feed object
67
+
68
+ # updating multiple feeds. it expects a collection of feed objects
69
+ updated_feeds = Feedzirra::Feed.update(feeds.values)
70
+
71
+ # defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
72
+ feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing",
73
+ :on_success => lambda {|url, feed| puts feed.title },
74
+ :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })
75
+ # if a collection was passed into fetch_and_parse, the handlers will be called for each one
76
+
77
+ # the behavior for the handlers when using Feedzirra::Feed.update is slightly different. The feed passed into on_success will be
78
+ # the updated feed with the standard updated accessors. on failure it will be the original feed object passed into update
79
+
80
+ # fetching a feed via a proxy (optional)
81
+ feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing", {:proxy_url => '10.0.0.1', :proxy_port => 3084})
82
+
83
+ === Extending
84
+
85
+ ==== Adding a feed parsing class
86
+
87
+ # Adds a new feed parsing class, this class will be used first
88
+ Feedzirra::Feed.add_feed_class MyFeedClass
89
+
90
+ ==== Adding attributes to all feeds types / all entries types
91
+
92
+ # Add the generator attribute to all feed types
93
+ Feedzirra::Feed.add_common_feed_element('generator')
94
+ Feedzirra::Feed.fetch_and_parse("href="http://www.pauldix.net/atom.xml").generator # => 'TypePad'
95
+
96
+ # Add some GeoRss information
97
+ Feedzirra::Feed.add_common_feed_entry_element('geo:lat', :as => :lat)
98
+ Feedzirra::Feed.fetch_and_parse("http://www.earthpublisher.com/georss.php").entries.each do |e|
99
+ p "lat: #{e.lat}, long: #{e.long}"
100
+ end
101
+
102
+ ==== Adding attributes to only one class
103
+
104
+ If you want to add attributes for only on class you simply have to declare them in the class
105
+
106
+ # Add some GeoRss information
107
+ require 'lib/feedzirra/parser/rss_entry'
108
+
109
+ class Feedzirra::Parser::RSSEntry
110
+ element 'geo:lat', :as => :lat
111
+ element 'geo:long', :as => :long
112
+ end
113
+
114
+ # Fetch a feed containing GeoRss info and print them
115
+ Feedzirra::Feed.fetch_and_parse("http://www.earthpublisher.com/georss.php").entries.each do |e|
116
+ p "lat: #{e.lat}, long: #{e.long}"
117
+ end
118
+
119
+ === Benchmarks
120
+
121
+ One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the {benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb]
122
+
123
+ feedzirra 5.170000 1.290000 6.460000 ( 18.917796)
124
+ rfeedparser 104.260000 12.220000 116.480000 (244.799063)
125
+ feed-normalizer 66.250000 4.010000 70.260000 (191.589862)
126
+
127
+ The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a {benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb] on an atom feed.
128
+
129
+ feedzirra 0.500000 0.030000 0.530000 ( 0.658744)
130
+ rfeedparser 8.400000 1.110000 9.510000 ( 11.839827)
131
+ feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140)
132
+
133
+ There's also a {benchmark that shows the results of using Feedzirra to perform updates on feeds}[http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb] you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting.
134
+
135
+ feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101)
136
+ feedzirra update 0.660000 0.280000 0.940000 ( 5.152709)
137
+
138
+ === TODO
139
+
140
+ This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother using the test suite for feedparser. i wanted to start fresh.
141
+
142
+ Here are some more specific TODOs.
143
+ * Make a feedzirra-rails gem to integrate feedzirra seamlessly with Rails and ActiveRecord.
144
+ * Add support for authenticated feeds.
145
+ * Create a super sweet DSL for defining new parsers.
146
+ * Test against Ruby 1.9.1 and fix any bugs.
147
+ * I'm not keeping track of modified on entries. Should I add this?
148
+ * Clean up the fetching code inside feed.rb so it doesn't suck so hard.
149
+ * Make the feed_spec actually mock stuff out so it doesn't hit the net.
150
+ * Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?
151
+
152
+ === LICENSE
153
+
154
+ (The MIT License)
155
+
156
+ Copyright (c) 2009:
157
+
158
+ {Paul Dix}[http://pauldix.net]
159
+
160
+ Permission is hereby granted, free of charge, to any person obtaining
161
+ a copy of this software and associated documentation files (the
162
+ 'Software'), to deal in the Software without restriction, including
163
+ without limitation the rights to use, copy, modify, merge, publish,
164
+ distribute, sublicense, and/or sell copies of the Software, and to
165
+ permit persons to whom the Software is furnished to do so, subject to
166
+ the following conditions:
167
+
168
+ The above copyright notice and this permission notice shall be
169
+ included in all copies or substantial portions of the Software.
170
+
171
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
172
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
173
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
174
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
175
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
176
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
177
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,51 @@
1
+ require 'bundler'
2
+ Bundler.setup
3
+
4
+ require 'rake'
5
+ require 'rdoc/task'
6
+ require 'rspec'
7
+ require 'rspec/core/rake_task'
8
+
9
+ $LOAD_PATH.unshift File.expand_path('../lib', __FILE__)
10
+ require 'feedzirra/version'
11
+
12
+ def recent_specs(touched_since)
13
+ recent_specs = FileList['app/**/*'].map do |path|
14
+
15
+ if File.mtime(path) > touched_since
16
+ spec = File.join('spec', File.dirname(path).split("/")[1..-1].join('/'),
17
+ "#{File.basename(path, ".*")}_spec.rb")
18
+ spec if File.exists?(spec)
19
+ end
20
+ end.compact
21
+
22
+ recent_specs += FileList['spec/**/*_spec.rb'].select do |path|
23
+ File.mtime(path) > touched_since
24
+ end
25
+ recent_specs.uniq
26
+ end
27
+
28
+ RSpec::Core::RakeTask.new do |t|
29
+ t.pattern = FileList['spec/**/*_spec.rb']
30
+ end
31
+
32
+ desc 'Run recent specs'
33
+ RSpec::Core::RakeTask.new("spec:recent") do |t|
34
+ t.pattern = recent_specs(Time.now - 600) # 10 min.
35
+ end
36
+
37
+ RSpec::Core::RakeTask.new('spec:rcov') do |t|
38
+ t.pattern = FileList['spec/**/*_spec.rb']
39
+ t.rcov = true
40
+ t.rcov_opts = ['--exclude', 'spec,/usr/lib/ruby,/usr/local,/var/lib,/Library', '--text-report']
41
+ end
42
+
43
+ RDoc::Task.new do |rd|
44
+ rd.title = 'Feedzirra'
45
+ rd.rdoc_dir = 'rdoc'
46
+ rd.rdoc_files.include('README.rdoc', 'lib/feedzirra.rb', 'lib/feedzirra/**/*.rb')
47
+ rd.options = ["--quiet", "--opname", "index.html", "--line-numbers", "--inline-source", '--main', 'README.rdoc']
48
+ end
49
+
50
+ desc "Run all the tests"
51
+ task :default => :spec
@@ -0,0 +1,20 @@
1
+ require 'zlib'
2
+ require 'curb'
3
+ require 'sax-machine'
4
+ require 'loofah'
5
+ require 'uri'
6
+
7
+ require 'active_support/deprecation'
8
+ require 'active_support/basic_object'
9
+ require 'active_support/core_ext/module'
10
+ require 'active_support/core_ext/object'
11
+ require 'active_support/time'
12
+
13
+ require 'feedzirra/core_ext'
14
+
15
+ module Feedzirra
16
+ autoload :FeedEntryUtilities, 'feedzirra/feed_entry_utilities'
17
+ autoload :FeedUtilities, 'feedzirra/feed_utilities'
18
+ autoload :Feed, 'feedzirra/feed'
19
+ autoload :Parser, 'feedzirra/parser'
20
+ end
@@ -0,0 +1,3 @@
1
+ Dir["#{File.dirname(__FILE__)}/core_ext/*.rb"].sort.each do |path|
2
+ require "feedzirra/core_ext/#{File.basename(path, '.rb')}"
3
+ end
@@ -0,0 +1,19 @@
1
+ # Date code pulled and adapted from:
2
+ # Ruby Cookbook by Lucas Carlson and Leonard Richardson
3
+ # Published by O'Reilly
4
+ # ISBN: 0-596-52369-6
5
+ class Date
6
+ def feed_utils_to_gm_time
7
+ feed_utils_to_time(new_offset, :gm)
8
+ end
9
+
10
+ def feed_utils_to_local_time
11
+ feed_utils_to_time(new_offset(DateTime.now.offset-offset), :local)
12
+ end
13
+
14
+ private
15
+ def feed_utils_to_time(dest, method)
16
+ Time.send(method, dest.year, dest.month, dest.day, dest.hour, dest.min,
17
+ dest.sec, dest.zone)
18
+ end
19
+ end
@@ -0,0 +1,9 @@
1
+ class String
2
+ def sanitize!
3
+ self.replace(sanitize)
4
+ end
5
+
6
+ def sanitize
7
+ Loofah.scrub_fragment(self, :prune).to_s
8
+ end
9
+ end
@@ -0,0 +1,384 @@
1
+ module Feedzirra
2
+ class NoParserAvailable < StandardError; end
3
+
4
+ class Feed
5
+ USER_AGENT = "feedzirra http://github.com/pauldix/feedzirra/tree/master"
6
+
7
+ # Takes a raw XML feed and attempts to parse it. If no parser is available a Feedzirra::NoParserAvailable exception is raised.
8
+ # You can pass a block to be called when there's an error during the parsing.
9
+ # === Parameters
10
+ # [xml<String>] The XML that you would like parsed.
11
+ # === Returns
12
+ # An instance of the determined feed type. By default a Feedzirra::Atom, Feedzirra::AtomFeedBurner, Feedzirra::RDF, or Feedzirra::RSS object.
13
+ # === Raises
14
+ # Feedzirra::NoParserAvailable : If no valid parser classes could be found for the feed.
15
+ def self.parse(xml, &block)
16
+ if parser = determine_feed_parser_for_xml(xml)
17
+ parser.parse(xml, block)
18
+ else
19
+ raise NoParserAvailable.new("No valid parser for XML.")
20
+ end
21
+ end
22
+
23
+ # Determines the correct parser class to use for parsing the feed.
24
+ #
25
+ # === Parameters
26
+ # [xml<String>] The XML that you would like determine the parser for.
27
+ # === Returns
28
+ # The class name of the parser that can handle the XML.
29
+ def self.determine_feed_parser_for_xml(xml)
30
+ start_of_doc = xml.slice(0, 2000)
31
+ feed_classes.detect {|klass| klass.able_to_parse?(start_of_doc)}
32
+ end
33
+
34
+ # Adds a new feed parsing class that will be used for parsing.
35
+ #
36
+ # === Parameters
37
+ # [klass<Constant>] The class/constant that you want to register.
38
+ # === Returns
39
+ # A updated array of feed parser class names.
40
+ def self.add_feed_class(klass)
41
+ feed_classes.unshift klass
42
+ end
43
+
44
+ # Provides a list of registered feed parsing classes.
45
+ #
46
+ # === Returns
47
+ # A array of class names.
48
+ def self.feed_classes
49
+ @feed_classes ||= [Feedzirra::Parser::RSSFeedBurner, Feedzirra::Parser::RSS, Feedzirra::Parser::GoogleDocsAtom, Feedzirra::Parser::AtomFeedBurner, Feedzirra::Parser::Atom, Feedzirra::Parser::ITunesRSS]
50
+ end
51
+
52
+ # Makes all registered feeds types look for the passed in element to parse.
53
+ # This is actually just a call to element (a SAXMachine call) in the class.
54
+ #
55
+ # === Parameters
56
+ # [element_tag<String>] The element tag
57
+ # [options<Hash>] Valid keys are same as with SAXMachine
58
+ def self.add_common_feed_element(element_tag, options = {})
59
+ feed_classes.each do |k|
60
+ k.element element_tag, options
61
+ end
62
+ end
63
+
64
+ # Makes all registered feeds types look for the passed in elements to parse.
65
+ # This is actually just a call to elements (a SAXMachine call) in the class.
66
+ #
67
+ # === Parameters
68
+ # [element_tag<String>] The element tag
69
+ # [options<Hash>] Valid keys are same as with SAXMachine
70
+ def self.add_common_feed_elements(element_tag, options = {})
71
+ feed_classes.each do |k|
72
+ k.elements element_tag, options
73
+ end
74
+ end
75
+
76
+ # Makes all registered entry types look for the passed in element to parse.
77
+ # This is actually just a call to element (a SAXMachine call) in the class.
78
+ #
79
+ # === Parameters
80
+ # [element_tag<String>]
81
+ # [options<Hash>] Valid keys are same as with SAXMachine
82
+ def self.add_common_feed_entry_element(element_tag, options = {})
83
+ call_on_each_feed_entry :element, element_tag, options
84
+ end
85
+
86
+ # Makes all registered entry types look for the passed in elements to parse.
87
+ # This is actually just a call to element (a SAXMachine call) in the class.
88
+ #
89
+ # === Parameters
90
+ # [element_tag<String>]
91
+ # [options<Hash>] Valid keys are same as with SAXMachine
92
+ def self.add_common_feed_entry_elements(element_tag, options = {})
93
+ call_on_each_feed_entry :elements, element_tag, options
94
+ end
95
+
96
+ # Call a method on all feed entries classes.
97
+ #
98
+ # === Parameters
99
+ # [method<Symbol>] The method name
100
+ # [parameters<Array>] The method parameters
101
+ def self.call_on_each_feed_entry(method, *parameters)
102
+ feed_classes.each do |k|
103
+ # iterate on the collections defined in the sax collection
104
+ k.sax_config.collection_elements.each_value do |vl|
105
+ # vl is a list of CollectionConfig mapped to an attribute name
106
+ # we'll look for the one set as 'entries' and add the new element
107
+ vl.find_all{|v| (v.accessor == 'entries') && (v.data_class.class == Class)}.each do |v|
108
+ v.data_class.send(method, *parameters)
109
+ end
110
+ end
111
+ end
112
+ end
113
+
114
+ # Setup curl from options.
115
+ # Possible parameters:
116
+ # * :user_agent - overrides the default user agent.
117
+ # * :compress - any value to enable compression
118
+ # * :http_authentication - array containing http authentication parameters
119
+ # * :proxy_url - proxy url
120
+ # * :proxy_port - proxy port
121
+ # * :max_redirects - max number of redirections
122
+ # * :timeout - timeout
123
+ def self.setup_easy curl, options
124
+ curl.headers["Accept-encoding"] = 'gzip, deflate' if options.has_key?(:compress)
125
+ curl.headers["User-Agent"] = (options[:user_agent] || USER_AGENT)
126
+
127
+ curl.userpwd = options[:http_authentication].join(':') if options.has_key?(:http_authentication)
128
+ curl.proxy_url = options[:proxy_url] if options.has_key?(:proxy_url)
129
+ curl.proxy_port = options[:proxy_port] if options.has_key?(:proxy_port)
130
+ curl.max_redirects = options[:max_redirects] if options[:max_redirects]
131
+ curl.timeout = options[:timeout] if options[:timeout]
132
+ curl.ssl_verify_host = options[:ssl_verify_host] if options.has_key?(:ssl_verify_host)
133
+
134
+ curl.follow_location = true
135
+ end
136
+
137
+ # Fetches and returns the raw XML for each URL provided.
138
+ #
139
+ # === Parameters
140
+ # [urls<String> or <Array>] A single feed URL, or an array of feed URLs.
141
+ # [options<Hash>] Valid keys for this argument as as followed:
142
+ # :if_modified_since - Time object representing when the feed was last updated.
143
+ # :if_none_match - String that's normally an etag for the request that was stored previously.
144
+ # :on_success - Block that gets executed after a successful request.
145
+ # :on_failure - Block that gets executed after a failed request.
146
+ # * all parameters defined in setup_easy
147
+ # === Returns
148
+ # A String of XML if a single URL is passed.
149
+ #
150
+ # A Hash if multiple URL's are passed. The key will be the URL, and the value the XML.
151
+ def self.fetch_raw(urls, options = {})
152
+ url_queue = [*urls]
153
+ multi = Curl::Multi.new
154
+ responses = {}
155
+ url_queue.each do |url|
156
+ easy = Curl::Easy.new(url) do |curl|
157
+ setup_easy curl, options
158
+
159
+ curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options.has_key?(:if_modified_since)
160
+ curl.headers["If-None-Match"] = options[:if_none_match] if options.has_key?(:if_none_match)
161
+
162
+ curl.on_success do |c|
163
+ responses[url] = decode_content(c)
164
+ end
165
+ curl.on_failure do |c, err|
166
+ responses[url] = c.response_code
167
+ end
168
+ end
169
+ multi.add(easy)
170
+ end
171
+
172
+ multi.perform
173
+ urls.is_a?(String) ? responses.values.first : responses
174
+ end
175
+
176
+ # Fetches and returns the parsed XML for each URL provided.
177
+ #
178
+ # === Parameters
179
+ # [urls<String> or <Array>] A single feed URL, or an array of feed URLs.
180
+ # [options<Hash>] Valid keys for this argument as as followed:
181
+ # * :user_agent - String that overrides the default user agent.
182
+ # * :if_modified_since - Time object representing when the feed was last updated.
183
+ # * :if_none_match - String, an etag for the request that was stored previously.
184
+ # * :on_success - Block that gets executed after a successful request.
185
+ # * :on_failure - Block that gets executed after a failed request.
186
+ # === Returns
187
+ # A Feed object if a single URL is passed.
188
+ #
189
+ # A Hash if multiple URL's are passed. The key will be the URL, and the value the Feed object.
190
+ def self.fetch_and_parse(urls, options = {})
191
+ url_queue = [*urls]
192
+ multi = Curl::Multi.new
193
+ responses = {}
194
+
195
+ # I broke these down so I would only try to do 30 simultaneously because
196
+ # I was getting weird errors when doing a lot. As one finishes it pops another off the queue.
197
+ url_queue.slice!(0, 30).each do |url|
198
+ add_url_to_multi(multi, url, url_queue, responses, options)
199
+ end
200
+
201
+ multi.perform
202
+ return urls.is_a?(String) ? responses.values.first : responses
203
+ end
204
+
205
+ # Decodes the XML document if it was compressed.
206
+ #
207
+ # === Parameters
208
+ # [curl_request<Curl::Easy>] The Curl::Easy response object from the request.
209
+ # === Returns
210
+ # A decoded string of XML.
211
+ def self.decode_content(c)
212
+ if c.header_str.match(/Content-Encoding: gzip/i)
213
+ begin
214
+ gz = Zlib::GzipReader.new(StringIO.new(c.body_str))
215
+ xml = gz.read
216
+ gz.close
217
+ rescue Zlib::GzipFile::Error
218
+ # Maybe this is not gzipped?
219
+ xml = c.body_str
220
+ end
221
+ elsif c.header_str.match(/Content-Encoding: deflate/i)
222
+ xml = Zlib::Inflate.inflate(c.body_str)
223
+ else
224
+ xml = c.body_str
225
+ end
226
+
227
+ xml
228
+ end
229
+
230
+ # Updates each feed for each Feed object provided.
231
+ #
232
+ # === Parameters
233
+ # [feeds<Feed> or <Array>] A single feed object, or an array of feed objects.
234
+ # [options<Hash>] Valid keys for this argument as as followed:
235
+ # * :on_success - Block that gets executed after a successful request.
236
+ # * :on_failure - Block that gets executed after a failed request.
237
+ # * all parameters defined in setup_easy
238
+ # === Returns
239
+ # A updated Feed object if a single URL is passed.
240
+ #
241
+ # A Hash if multiple Feeds are passed. The key will be the URL, and the value the updated Feed object.
242
+ def self.update(feeds, options = {})
243
+ feed_queue = [*feeds]
244
+ multi = Curl::Multi.new
245
+ responses = {}
246
+
247
+ feed_queue.slice!(0, 30).each do |feed|
248
+ add_feed_to_multi(multi, feed, feed_queue, responses, options)
249
+ end
250
+
251
+ multi.perform
252
+ responses.is_a?(Array)? responses.values : responses.values.first
253
+ end
254
+
255
+ # An abstraction for adding a feed by URL to the passed Curb::multi stack.
256
+ #
257
+ # === Parameters
258
+ # [multi<Curl::Multi>] The Curl::Multi object that the request should be added too.
259
+ # [url<String>] The URL of the feed that you would like to be fetched.
260
+ # [url_queue<Array>] An array of URLs that are queued for request.
261
+ # [responses<Hash>] Existing responses that you want the response from the request added to.
262
+ # [feeds<String> or <Array>] A single feed object, or an array of feed objects.
263
+ # [options<Hash>] Valid keys for this argument as as followed:
264
+ # * :on_success - Block that gets executed after a successful request.
265
+ # * :on_failure - Block that gets executed after a failed request.
266
+ # * all parameters defined in setup_easy
267
+ # === Returns
268
+ # The updated Curl::Multi object with the request details added to it's stack.
269
+ def self.add_url_to_multi(multi, url, url_queue, responses, options)
270
+ easy = Curl::Easy.new(url) do |curl|
271
+ setup_easy curl, options
272
+ curl.headers["If-Modified-Since"] = options[:if_modified_since].httpdate if options.has_key?(:if_modified_since)
273
+ curl.headers["If-None-Match"] = options[:if_none_match] if options.has_key?(:if_none_match)
274
+
275
+ curl.on_success do |c|
276
+ add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
277
+ xml = decode_content(c)
278
+ klass = determine_feed_parser_for_xml(xml)
279
+
280
+ if klass
281
+ begin
282
+ feed = klass.parse(xml, Proc.new{|message| warn "Error while parsing [#{url}] #{message}" })
283
+ feed.feed_url = c.last_effective_url
284
+ feed.etag = etag_from_header(c.header_str)
285
+ feed.last_modified = last_modified_from_header(c.header_str)
286
+ responses[url] = feed
287
+ options[:on_success].call(url, feed) if options.has_key?(:on_success)
288
+ rescue Exception => e
289
+ options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
290
+ end
291
+ else
292
+ # puts "Error determining parser for #{url} - #{c.last_effective_url}"
293
+ # raise NoParserAvailable.new("no valid parser for content.") (this would unfortunately fail the whole 'multi', so it's not really usable)
294
+ options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
295
+ end
296
+ end
297
+
298
+ curl.on_failure do |c, err|
299
+ add_url_to_multi(multi, url_queue.shift, url_queue, responses, options) unless url_queue.empty?
300
+ responses[url] = c.response_code
301
+ if c.response_code == 304 # it's not modified. this isn't an error condition
302
+ options[:on_success].call(url, nil) if options.has_key?(:on_success)
303
+ else
304
+ options[:on_failure].call(url, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
305
+ end
306
+ end
307
+ end
308
+ multi.add(easy)
309
+ end
310
+
311
+ # An abstraction for adding a feed by a Feed object to the passed Curb::multi stack.
312
+ #
313
+ # === Parameters
314
+ # [multi<Curl::Multi>] The Curl::Multi object that the request should be added too.
315
+ # [feed<Feed>] A feed object that you would like to be fetched.
316
+ # [url_queue<Array>] An array of feed objects that are queued for request.
317
+ # [responses<Hash>] Existing responses that you want the response from the request added to.
318
+ # [feeds<String>] or <Array> A single feed object, or an array of feed objects.
319
+ # [options<Hash>] Valid keys for this argument as as followed:
320
+ # * :on_success - Block that gets executed after a successful request.
321
+ # * :on_failure - Block that gets executed after a failed request.
322
+ # * all parameters defined in setup_easy
323
+ # === Returns
324
+ # The updated Curl::Multi object with the request details added to it's stack.
325
+ def self.add_feed_to_multi(multi, feed, feed_queue, responses, options)
326
+ easy = Curl::Easy.new(feed.feed_url) do |curl|
327
+ setup_easy curl, options
328
+ curl.headers["If-Modified-Since"] = feed.last_modified.httpdate if feed.last_modified
329
+ curl.headers["If-Modified-Since"] = options[:if_modified_since] if options[:if_modified_since] && (!feed.last_modified || (Time.parse(options[:if_modified_since].to_s) > feed.last_modified))
330
+ curl.headers["If-None-Match"] = feed.etag if feed.etag
331
+
332
+ curl.on_success do |c|
333
+ begin
334
+ add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
335
+ updated_feed = Feed.parse(c.body_str){ |message| warn "Error while parsing [#{feed.feed_url}] #{message}" }
336
+ updated_feed.feed_url = c.last_effective_url
337
+ updated_feed.etag = etag_from_header(c.header_str)
338
+ updated_feed.last_modified = last_modified_from_header(c.header_str)
339
+ feed.update_from_feed(updated_feed)
340
+ responses[feed.feed_url] = feed
341
+ options[:on_success].call(feed) if options.has_key?(:on_success)
342
+ rescue Exception => e
343
+ options[:on_failure].call(feed, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
344
+ end
345
+ end
346
+
347
+ curl.on_failure do |c, err|
348
+ add_feed_to_multi(multi, feed_queue.shift, feed_queue, responses, options) unless feed_queue.empty?
349
+ response_code = c.response_code
350
+ if response_code == 304 # it's not modified. this isn't an error condition
351
+ responses[feed.feed_url] = feed
352
+ options[:on_success].call(feed) if options.has_key?(:on_success)
353
+ else
354
+ responses[feed.url] = c.response_code
355
+ options[:on_failure].call(feed, c.response_code, c.header_str, c.body_str) if options.has_key?(:on_failure)
356
+ end
357
+ end
358
+ end
359
+ multi.add(easy)
360
+ end
361
+
362
+ # Determines the etag from the request headers.
363
+ #
364
+ # === Parameters
365
+ # [header<String>] Raw request header returned from the request
366
+ # === Returns
367
+ # A string of the etag or nil if it cannot be found in the headers.
368
+ def self.etag_from_header(header)
369
+ header =~ /.*ETag:\s(.*)\r/
370
+ $1
371
+ end
372
+
373
+ # Determines the last modified date from the request headers.
374
+ #
375
+ # === Parameters
376
+ # [header<String>] Raw request header returned from the request
377
+ # === Returns
378
+ # A Time object of the last modified date or nil if it cannot be found in the headers.
379
+ def self.last_modified_from_header(header)
380
+ header =~ /.*Last-Modified:\s(.*)\r/
381
+ Time.parse($1) if $1
382
+ end
383
+ end
384
+ end