feedtosis 0.0.3.6

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2009 Justin S. Leitgeb
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,135 @@
1
+ = Description
2
+
3
+ Feedtosis fetches RSS and Atom feeds with an easy-to-use interface. It uses
4
+ FeedNormalizer for parsing, and Curb for fetching. It helps by automatically
5
+ using conditional HTTP GET requests as well as by reliably pointing out which
6
+ entries are new in any given feed.
7
+
8
+ Feedtosis is designed to help you with book-keeping about feed fetching
9
+ details so that things like using HTTP conditional GET are trivial. It has a simple
10
+ interface, and remains a lightweight component that delegates to FeedNormalizer
11
+ for parsing feeds and the fantastic taf2-curb library for fetching feeds.
12
+
13
+ == Installation
14
+
15
+ Assuming that you've followed the directions on gems.github.com to allow your
16
+ computer to install gems from GitHub, the following command will install the
17
+ Feedtosis library:
18
+
19
+ sudo gem install jsl-feedtosis
20
+
21
+ == Usage
22
+
23
+ Feedtosis is easy to use. Just create a client object, and invoke the
24
+ "fetch" method:
25
+
26
+ require 'feedtosis'
27
+ client = Feedtosis::Client.new('http://feeds.feedburner.com/wooster')
28
+ result = client.fetch
29
+
30
+ +result+ will be a Feedtosis::Result object which delegates methods to
31
+ the FeedNormalizer::Feed object as well as the Curl::Easy object used to fetch
32
+ the feed. Useful methods on this object include +entries+, +new_entries+ and
33
+ +response_code+ among many others (basically all of the methods that
34
+ FeedNormalizer::Feed and Curl::Easy objects respond to are implemented and can
35
+ be called directly, minus the setter methods for these objects).
36
+
37
+ Note that since Feedtosis uses HTTP conditional GET, it may not actually
38
+ have received a full XML response from the server suitable for being parsed
39
+ into entries. In this case, methods such as +entries+ on the Feedtosis::Result
40
+ will return +nil+. Depending on your application logic, you may want to inspect
41
+ the methods that are delegated to the Curl::Easy object, such as +response_code+,
42
+ for more information on what happened in these cases.
43
+
44
+ Remember that a response code of 304 means "Not Modified". In this case, you should
45
+ expect "entries" and "new_entries" to be nil, since the resource wasn't downloaded
46
+ according to the logic of HTTP conditional GET.
47
+
48
+ On subsequent requests of a particular resource, Feedtosis will update
49
+ +new_entries+ to contain the feed entries that we haven't seen yet. In most
50
+ applications, your program will probably call the same batch of URLS multiple
51
+ times, and process the elements in +new_entries+.
52
+
53
+ You will most likely want to allow Feedtosis to remember details about the
54
+ last retrieval of a feed after the client is removed from memory. Feedtosis
55
+ uses Moneta, a unified interface to key-value storage systems to remember
56
+ "summaries" of feeds that it has seen in the past. See the document section on
57
+ Customization for more details on how to configure this system.
58
+
59
+ == Customization
60
+
61
+ Feedtosis stores summaries of feeds in a key-value storage system. If no
62
+ options are included when creating a new Feedtosis::Client object, the
63
+ default is to use a "memory" storage system. The memory system is just a basic
64
+ ruby Hash, so it won't keep track of feeds after a particular Client is removed
65
+ from memory. To configure a different backend, pass an options hash to the
66
+ Feedtosis client initialization:
67
+
68
+ url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml"
69
+ f = Feedtosis::Client.new(url, :backend => Moneta::Memcache.new(:server => 'localhost:1978'))
70
+ res = f.fetch
71
+
72
+ This example sets up a Memcache backend, which in this case points to Tokyo
73
+ Tyrant on port 1978.
74
+
75
+ Generally, Feedtosis supports all systems supported by Moneta, and any one
76
+ of the supported systems can be given to the +moneta_klass+ parameter. Other
77
+ options following +backend+ are passed directly to Moneta for configuration.
78
+
79
+
80
+ == Implementation
81
+
82
+ Feedtosis helps to identify new feed entries and to figure out when
83
+ conditional GET can be used in retrieving resources. In order to accomplish this
84
+ without having to require that the user store information such as etags and
85
+ dates of the last retrieved entry, Feedtosis stores a summary structure in
86
+ the configured key-value store (backed by Moneta). In order to do conditional
87
+ GET requests, Feedtosis stores the Last-Modified date, as well as the ETag
88
+ of the last request in the summary structure, which is put in a namespaced
89
+ element consisting of the term 'Feedtosis' (bet you won't have to worry
90
+ about name collisions on that one!) and the MD5 of the URL retrieved.
91
+
92
+ It can also be a bit tricky to decipher which feed entries are new since many
93
+ feed sources don't include unique ids with their feeds. Feedtosis reliably
94
+ keeps track of which entries in a feed are new by storing (in the summary hash
95
+ mentioned above) an MD5 signature of each entry in a feed. It takes elements
96
+ such as the published-at date, title and content and generates the MD5 of these
97
+ elements. This allows Feedtosis to cheaply compute (both in terms of
98
+ computation and storage) which feed entries should be presented to the user as
99
+ "new". Below is an example of a summary structure:
100
+
101
+ {
102
+ :etag => "4c8f-46ac09fbbe940",
103
+ :last_modified => "Mon, 25 May 2009 18:17:33 GMT",
104
+ :digests => [["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]]
105
+ }
106
+
107
+ The data stored by Feedtosis in the summary structure allows it to be
108
+ helpful to the user without storing lots of data that are unnecessary for
109
+ efficient functioning.
110
+
111
+ The summary structure keeps an Array of Arrays containing digests of feeds. The reason
112
+ for this is that some feeds, such as the Google blog search feeds, contain slightly different
113
+ but often-recurring results in the result set. Feedtosis keeps complete sets of entry digests
114
+ for previous feed retrievals. The number of digest sets that will be kept is configurable by
115
+ setting the option :retained_digest_size on Feedtosis client initialization.
116
+
117
+ == HTML cleaning/sanitizing
118
+
119
+ Feedtosis doesn't do anything about feed sanitizing, as other libraries have
120
+ been built for this purpose. FeedNormalizer has methods for escaping entries,
121
+ but to strip HTML I suggest that you look at the Ruby gem "sanitize".
122
+
123
+ == Credits
124
+
125
+ Thanks to Sander Hartlage (GitHub: Sander6) for useful feedback early in the
126
+ development of Feedtosis.
127
+
128
+ == Feedback
129
+
130
+ Please let me know if you have any problems with or questions about
131
+ Feedtosis.
132
+
133
+ = Author
134
+
135
+ Justin S. Leitgeb, mailto:justin@phq.org
@@ -0,0 +1,28 @@
1
+ require 'rubygems'
2
+ require 'spec'
3
+
4
+ require 'rake'
5
+ require 'spec/rake/spectask'
6
+ require 'rake/rdoctask'
7
+
8
+ require 'lib/feedtosis'
9
+
10
+ desc 'Test the plugin.'
11
+ Spec::Rake::SpecTask.new(:spec) do |t|
12
+ t.spec_opts = ["--format", "progress", "--colour"]
13
+ t.libs << 'lib'
14
+ t.verbose = true
15
+ end
16
+
17
+ desc "Run all the tests"
18
+ task :default => :spec
19
+
20
+ desc 'Generate documentation'
21
+ Rake::RDocTask.new(:rdoc) do |rdoc|
22
+ rdoc.rdoc_dir = 'rdoc'
23
+ rdoc.title = 'Feedtosis'
24
+ rdoc.options << '--line-numbers' << '--inline-source'
25
+ rdoc.rdoc_files.include('README.rdoc')
26
+ rdoc.rdoc_files.include('lib/feedtosis/**/*.rb')
27
+ end
28
+
@@ -0,0 +1,48 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = %q{feedtosis}
3
+ s.version = "0.0.3.6"
4
+
5
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
6
+ s.authors = ["Justin Leitgeb"]
7
+ s.date = %q{2009-07-15}
8
+ s.description = %q{Feedtosis finds new information in feeds quickly using smart fetching and matching of previously read entries}
9
+ s.email = %q{justin@phq.org}
10
+
11
+ s.files = ["lib/extensions/core/hash.rb",
12
+ "lib/extensions/feed_normalizer/feed_instance_methods.rb",
13
+ "lib/feedtosis/result.rb",
14
+ "lib/feedtosis/client.rb", "lib/feedtosis.rb", "LICENSE",
15
+ "feedtosis.gemspec", "Rakefile", "README.rdoc",
16
+ "spec/extensions/feed_normalizer/feed_instance_methods_spec.rb",
17
+ "spec/fixtures/http_headers/wooster.txt",
18
+ "spec/fixtures/xml/older_wooster.xml", "spec/fixtures/xml/wooster.xml",
19
+ "spec/feedtosis/client_spec.rb",
20
+ "spec/feedtosis/result_spec.rb",
21
+ "spec/spec_helper.rb"]
22
+
23
+ s.has_rdoc = true
24
+ s.homepage = %q{http://github.com/jsl/feedtosis}
25
+ s.rdoc_options = ["--charset=UTF-8"]
26
+ s.require_paths = ["lib"]
27
+ s.rubygems_version = %q{1.3.1}
28
+ s.summary = %q{Retrieves feeds using conditional GET and marks entries that you haven't seen before}
29
+ s.test_files = ["spec/spec_helper.rb", "spec/feedtosis/client_spec.rb", "spec/feedtosis/result_spec.rb" ]
30
+
31
+ s.extra_rdoc_files = [ "README.rdoc" ]
32
+
33
+ s.rdoc_options += [
34
+ '--title', 'Feedtosis',
35
+ '--main', 'README.rdoc',
36
+ '--line-numbers',
37
+ '--inline-source'
38
+ ]
39
+
40
+ %w[ taf2-curb jsl-moneta jsl-http_headers feed-normalizer ].each do |dep|
41
+ s.add_dependency(dep)
42
+ end
43
+
44
+ if s.respond_to? :specification_version then
45
+ current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
46
+ s.specification_version = 2
47
+ end
48
+ end
@@ -0,0 +1,7 @@
1
+ class Hash
2
+
3
+ def reverse_merge(other_hash)
4
+ other_hash.merge(self)
5
+ end
6
+
7
+ end
@@ -0,0 +1,17 @@
1
+ # Extends FeedNormalizer::Feed with method for detecting new_items (aliased as new_entries for
2
+ # convenience).
3
+ module Feedtosis
4
+ module FeedInstanceMethods
5
+
6
+ # Returns only the feeds that are new.
7
+ def new_items
8
+ self.entries.select do |e|
9
+ e.instance_variable_get(:@_seen) == false
10
+ end
11
+ end
12
+
13
+ alias :new_entries :new_items
14
+ end
15
+ end
16
+
17
+ FeedNormalizer::Feed.__send__(:include, Feedtosis::FeedInstanceMethods)
@@ -0,0 +1,17 @@
1
+ require 'rubygems'
2
+
3
+ require 'curb'
4
+ require 'http_headers'
5
+ require 'feed-normalizer'
6
+ require 'md5'
7
+ require 'uri'
8
+
9
+ lib_dirs = [ 'extensions', 'feedtosis' ].map do |d|
10
+ File.join(File.dirname(__FILE__), d)
11
+ end
12
+
13
+ lib_dirs.each do |d|
14
+ Dir[File.join(d, "**", "*.rb")].each do |file|
15
+ require file
16
+ end
17
+ end
@@ -0,0 +1,173 @@
1
+ module Feedtosis
2
+
3
+ # Feedtosis::Client is the primary interface to the feed reader. Call it
4
+ # with a url that was previously fetched while connected to the configured
5
+ # backend, and it will 1) only do a retrieval if deemed necessary based on the
6
+ # etag and modified-at of the last etag and 2) mark all entries retrieved as
7
+ # either new or not new. Entries retrieved are normalized using the
8
+ # feed-normalizer gem.
9
+ class Client
10
+ attr_reader :url, :options, :backend
11
+
12
+ DEFAULTS = {
13
+ :backend => Hash.new,
14
+
15
+ # The namespace will be prefixed to the key used for storage of the summary value. Based on your
16
+ # application needs, it may be useful to provide a custom prefix with initialization options.
17
+ :namespace => 'feedtosis',
18
+
19
+ # Some feed aggregators that we may be pulling from have entries that are present in one fetch and
20
+ # then disappear (Google blog search does this). For these cases, we can't rely on only the digests of
21
+ # the last fetch to guarantee "newness" of a feed that we may have previously consumed. We keep a
22
+ # number of previous sets of digests in order to make sure that we mark correct feeds as "new".
23
+ :retained_digest_size => 10
24
+ } unless defined?(DEFAULTS)
25
+
26
+ # Initializes a new feedtosis library. It must be initialized with a valid URL as the first argument.
27
+ # A following optional +options+ Hash may take the arguments:
28
+ # * backend: a key-value store to be used for summary structures of feeds fetched. Moneta backends work well, but any object acting like a Hash is valid.
29
+ # * retained_digest_size: an Integer specifying the number of previous MD5 sets of entries to keep, used for new feed detection
30
+ def initialize(url, options = { })
31
+ @url = url
32
+
33
+ raise ArgumentError, "Feedtosis::Client options must be in Hash form if provided" unless options.is_a?(Hash)
34
+ @options = options.reverse_merge(DEFAULTS)
35
+
36
+ @backend = @options[:backend]
37
+
38
+ unless @url.match(URI.regexp('http'))
39
+ raise ArgumentError, "Url #{@url} is not valid!"
40
+ end
41
+
42
+ unless @backend.respond_to?(:[]) && @backend.respond_to?(:[]=)
43
+ raise ArgumentError, "Backend needs to be a key-value store"
44
+ end
45
+ end
46
+
47
+ # Retrieves the latest entries from this feed. Returns a Feedtosis::Result
48
+ # object which delegates methods to the Curl::Easy object making the request
49
+ # and the FeedNormalizer::Feed object that may have been created from the
50
+ # HTTP response body.
51
+ def fetch
52
+ curl = build_curl_easy
53
+ curl.perform
54
+ feed = process_curl_response(curl)
55
+ Feedtosis::Result.new(curl, feed)
56
+ end
57
+
58
+ private
59
+
60
+ # Marks entries as either seen or not seen based on the unique signature of
61
+ # the entry, which is calculated by taking the MD5 of common attributes.
62
+ def mark_new_entries(response)
63
+ digests = summary_digests
64
+
65
+ # For each entry in the responses object, mark @_seen as false if the
66
+ # digest of this entry doesn't exist in the cached object.
67
+ response.entries.each do |e|
68
+ seen = digests.include?(digest_for(e))
69
+ e.instance_variable_set(:@_seen, seen)
70
+ end
71
+
72
+ response
73
+ end
74
+
75
+ # Returns an Array of summary digests for this feed. Since we keep a number of sets
76
+ # of digests, inject across these sets to accumulate unique identifiers.
77
+ def summary_digests
78
+ summary_for_feed[:digests].inject([]) do |r, e|
79
+ r |= e
80
+ end.uniq
81
+ end
82
+
83
+ # Processes the results by identifying which entries are new if the response
84
+ # is a 200. Otherwise, returns the Curl::Easy object for the user to inspect.
85
+ def process_curl_response(curl)
86
+ if curl.response_code == 200
87
+ response = parser_for_xml(curl.body_str)
88
+ response = mark_new_entries(response)
89
+ store_summary_to_backend(response, curl)
90
+ response
91
+ end
92
+ end
93
+
94
+ # Sets options for the Curl::Easy object, including parameters for HTTP
95
+ # conditional GET.
96
+ def build_curl_easy
97
+ curl = new_curl_easy(@url)
98
+
99
+ # Many feeds have a 302 redirect to another URL. For more recent versions
100
+ # of Curl, we need to specify this.
101
+ curl.follow_location = true
102
+
103
+ set_header_options(curl)
104
+ end
105
+
106
+ def new_curl_easy(url)
107
+ Curl::Easy.new(url)
108
+ end
109
+
110
+ # Returns the summary hash for this feed from the backend store.
111
+ def summary_for_feed
112
+ @backend[key_for_cached] || { :digests => [ ] }
113
+ end
114
+
115
+ # Sets the headers from the backend, if available
116
+ def set_header_options(curl)
117
+ summary = summary_for_feed
118
+
119
+ unless summary.nil?
120
+ curl.headers['If-None-Match'] = summary[:etag] unless summary[:etag].nil?
121
+ curl.headers['If-Modified-Since'] = summary[:last_modified] unless summary[:last_modified].nil?
122
+ end
123
+
124
+ curl
125
+ end
126
+
127
+ # Returns the key for the storage of the summary structure in the key-value system.
128
+ def key_for_cached
129
+ [ @options[:namespace], MD5.hexdigest(@url) ].join('_')
130
+ end
131
+
132
+ # Stores information about the retrieval, including ETag, Last-Modified,
133
+ # and MD5 digests of all entries to the backend store. This enables
134
+ # conditional GET usage on subsequent requests and marking of entries as
135
+ # either new or seen.
136
+ def store_summary_to_backend(feed, curl)
137
+ headers = HttpHeaders.new(curl.header_str)
138
+
139
+ # Store info about HTTP retrieval
140
+ summary = { }
141
+
142
+ summary.merge!(:etag => headers.etag) unless headers.etag.nil?
143
+ summary.merge!(:last_modified => headers.last_modified) unless headers.last_modified.nil?
144
+
145
+ # Store digest for each feed entry so we can detect new feeds on the next
146
+ # retrieval
147
+ new_digest_set = feed.entries.map do |e|
148
+ digest_for(e)
149
+ end
150
+
151
+ new_digest_set = summary_for_feed[:digests].unshift(new_digest_set)
152
+ new_digest_set = new_digest_set[0..@options[:retained_digest_size]]
153
+
154
+ summary.merge!( :digests => new_digest_set )
155
+ set_summary(summary)
156
+ end
157
+
158
+ def set_summary(summary)
159
+ @backend[key_for_cached] = summary
160
+ end
161
+
162
+ # Computes a unique signature for the FeedNormalizer::Entry object given.
163
+ # This signature will be the MD5 of enough fields to have a reasonable
164
+ # probability of determining if the entry is unique or not.
165
+ def digest_for(entry)
166
+ MD5.hexdigest( [ entry.title, entry.content, entry.date_published ].join )
167
+ end
168
+
169
+ def parser_for_xml(xml)
170
+ FeedNormalizer::FeedNormalizer.parse(xml)
171
+ end
172
+ end
173
+ end