feedtosis 0.0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2009 Justin S. Leitgeb
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,135 @@
1
+ = Description
2
+
3
+ Feedtosis fetches RSS and Atom feeds with an easy-to-use interface. It uses
4
+ FeedNormalizer for parsing, and Curb for fetching. It helps by automatically
5
+ using conditional HTTP GET requests as well as by reliably pointing out which
6
+ entries are new in any given feed.
7
+
8
+ Feedtosis is designed to help you with book-keeping about feed fetching
9
+ details so that things like using HTTP conditional GET are trivial. It has a simple
10
+ interface, and remains a lightweight component that delegates to FeedNormalizer
11
+ for parsing feeds and the fantastic taf2-curb library for fetching feeds.
12
+
13
+ == Installation
14
+
15
+ Assuming that you've followed the directions on gems.github.com to allow your
16
+ computer to install gems from GitHub, the following command will install the
17
+ Feedtosis library:
18
+
19
+ sudo gem install jsl-feedtosis
20
+
21
+ == Usage
22
+
23
+ Feedtosis is easy to use. Just create a client object, and invoke the
24
+ "fetch" method:
25
+
26
+ require 'feedtosis'
27
+ client = Feedtosis::Client.new('http://feeds.feedburner.com/wooster')
28
+ result = client.fetch
29
+
30
+ +result+ will be a Feedtosis::Result object which delegates methods to
31
+ the FeedNormalizer::Feed object as well as the Curl::Easy object used to fetch
32
+ the feed. Useful methods on this object include +entries+, +new_entries+ and
33
+ +response_code+ among many others (basically all of the methods that
34
+ FeedNormalizer::Feed and Curl::Easy objects respond to are implemented and can
35
+ be called directly, minus the setter methods for these objects).
36
+
37
+ Note that since Feedtosis uses HTTP conditional GET, it may not actually
38
+ have received a full XML response from the server suitable for being parsed
39
+ into entries. In this case, methods such as +entries+ on the Feedtosis::Result
40
+ will return +nil+. Depending on your application logic, you may want to inspect
41
+ the methods that are delegated to the Curl::Easy object, such as +response_code+,
42
+ for more information on what happened in these cases.
43
+
44
+ Remember that a response code of 304 means "Not Modified". In this case, you should
45
+ expect "entries" and "new_entries" to be nil, since the resource wasn't downloaded
46
+ according to the logic of HTTP conditional GET.
47
+
48
+ On subsequent requests of a particular resource, Feedtosis will update
49
+ +new_entries+ to contain the feed entries that we haven't seen yet. In most
50
+ applications, your program will probably call the same batch of URLS multiple
51
+ times, and process the elements in +new_entries+.
52
+
53
+ You will most likely want to allow Feedtosis to remember details about the
54
+ last retrieval of a feed after the client is removed from memory. Feedtosis
55
+ uses Moneta, a unified interface to key-value storage systems to remember
56
+ "summaries" of feeds that it has seen in the past. See the document section on
57
+ Customization for more details on how to configure this system.
58
+
59
+ == Customization
60
+
61
+ Feedtosis stores summaries of feeds in a key-value storage system. If no
62
+ options are included when creating a new Feedtosis::Client object, the
63
+ default is to use a "memory" storage system. The memory system is just a basic
64
+ ruby Hash, so it won't keep track of feeds after a particular Client is removed
65
+ from memory. To configure a different backend, pass an options hash to the
66
+ Feedtosis client initialization:
67
+
68
+ url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml"
69
+ f = Feedtosis::Client.new(url, :backend => Moneta::Memcache.new(:server => 'localhost:1978'))
70
+ res = f.fetch
71
+
72
+ This example sets up a Memcache backend, which in this case points to Tokyo
73
+ Tyrant on port 1978.
74
+
75
+ Generally, Feedtosis supports all systems supported by Moneta, and any one
76
+ of the supported systems can be given to the +moneta_klass+ parameter. Other
77
+ options following +backend+ are passed directly to Moneta for configuration.
78
+
79
+
80
+ == Implementation
81
+
82
+ Feedtosis helps to identify new feed entries and to figure out when
83
+ conditional GET can be used in retrieving resources. In order to accomplish this
84
+ without having to require that the user store information such as etags and
85
+ dates of the last retrieved entry, Feedtosis stores a summary structure in
86
+ the configured key-value store (backed by Moneta). In order to do conditional
87
+ GET requests, Feedtosis stores the Last-Modified date, as well as the ETag
88
+ of the last request in the summary structure, which is put in a namespaced
89
+ element consisting of the term 'Feedtosis' (bet you won't have to worry
90
+ about name collisions on that one!) and the MD5 of the URL retrieved.
91
+
92
+ It can also be a bit tricky to decipher which feed entries are new since many
93
+ feed sources don't include unique ids with their feeds. Feedtosis reliably
94
+ keeps track of which entries in a feed are new by storing (in the summary hash
95
+ mentioned above) an MD5 signature of each entry in a feed. It takes elements
96
+ such as the published-at date, title and content and generates the MD5 of these
97
+ elements. This allows Feedtosis to cheaply compute (both in terms of
98
+ computation and storage) which feed entries should be presented to the user as
99
+ "new". Below is an example of a summary structure:
100
+
101
+ {
102
+ :etag => "4c8f-46ac09fbbe940",
103
+ :last_modified => "Mon, 25 May 2009 18:17:33 GMT",
104
+ :digests => [["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]]
105
+ }
106
+
107
+ The data stored by Feedtosis in the summary structure allows it to be
108
+ helpful to the user without storing lots of data that are unnecessary for
109
+ efficient functioning.
110
+
111
+ The summary structure keeps an Array of Arrays containing digests of feeds. The reason
112
+ for this is that some feeds, such as the Google blog search feeds, contain slightly different
113
+ but often-recurring results in the result set. Feedtosis keeps complete sets of entry digests
114
+ for previous feed retrievals. The number of digest sets that will be kept is configurable by
115
+ setting the option :retained_digest_size on Feedtosis client initialization.
116
+
117
+ == HTML cleaning/sanitizing
118
+
119
+ Feedtosis doesn't do anything about feed sanitizing, as other libraries have
120
+ been built for this purpose. FeedNormalizer has methods for escaping entries,
121
+ but to strip HTML I suggest that you look at the Ruby gem "sanitize".
122
+
123
+ == Credits
124
+
125
+ Thanks to Sander Hartlage (GitHub: Sander6) for useful feedback early in the
126
+ development of Feedtosis.
127
+
128
+ == Feedback
129
+
130
+ Please let me know if you have any problems with or questions about
131
+ Feedtosis.
132
+
133
+ = Author
134
+
135
+ Justin S. Leitgeb, mailto:justin@phq.org
@@ -0,0 +1,28 @@
1
+ require 'rubygems'
2
+ require 'spec'
3
+
4
+ require 'rake'
5
+ require 'spec/rake/spectask'
6
+ require 'rake/rdoctask'
7
+
8
+ require 'lib/feedtosis'
9
+
10
+ desc 'Test the plugin.'
11
+ Spec::Rake::SpecTask.new(:spec) do |t|
12
+ t.spec_opts = ["--format", "progress", "--colour"]
13
+ t.libs << 'lib'
14
+ t.verbose = true
15
+ end
16
+
17
+ desc "Run all the tests"
18
+ task :default => :spec
19
+
20
+ desc 'Generate documentation'
21
+ Rake::RDocTask.new(:rdoc) do |rdoc|
22
+ rdoc.rdoc_dir = 'rdoc'
23
+ rdoc.title = 'Feedtosis'
24
+ rdoc.options << '--line-numbers' << '--inline-source'
25
+ rdoc.rdoc_files.include('README.rdoc')
26
+ rdoc.rdoc_files.include('lib/feedtosis/**/*.rb')
27
+ end
28
+
@@ -0,0 +1,48 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = %q{feedtosis}
3
+ s.version = "0.0.3.6"
4
+
5
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
6
+ s.authors = ["Justin Leitgeb"]
7
+ s.date = %q{2009-07-15}
8
+ s.description = %q{Feedtosis finds new information in feeds quickly using smart fetching and matching of previously read entries}
9
+ s.email = %q{justin@phq.org}
10
+
11
+ s.files = ["lib/extensions/core/hash.rb",
12
+ "lib/extensions/feed_normalizer/feed_instance_methods.rb",
13
+ "lib/feedtosis/result.rb",
14
+ "lib/feedtosis/client.rb", "lib/feedtosis.rb", "LICENSE",
15
+ "feedtosis.gemspec", "Rakefile", "README.rdoc",
16
+ "spec/extensions/feed_normalizer/feed_instance_methods_spec.rb",
17
+ "spec/fixtures/http_headers/wooster.txt",
18
+ "spec/fixtures/xml/older_wooster.xml", "spec/fixtures/xml/wooster.xml",
19
+ "spec/feedtosis/client_spec.rb",
20
+ "spec/feedtosis/result_spec.rb",
21
+ "spec/spec_helper.rb"]
22
+
23
+ s.has_rdoc = true
24
+ s.homepage = %q{http://github.com/jsl/feedtosis}
25
+ s.rdoc_options = ["--charset=UTF-8"]
26
+ s.require_paths = ["lib"]
27
+ s.rubygems_version = %q{1.3.1}
28
+ s.summary = %q{Retrieves feeds using conditional GET and marks entries that you haven't seen before}
29
+ s.test_files = ["spec/spec_helper.rb", "spec/feedtosis/client_spec.rb", "spec/feedtosis/result_spec.rb" ]
30
+
31
+ s.extra_rdoc_files = [ "README.rdoc" ]
32
+
33
+ s.rdoc_options += [
34
+ '--title', 'Feedtosis',
35
+ '--main', 'README.rdoc',
36
+ '--line-numbers',
37
+ '--inline-source'
38
+ ]
39
+
40
+ %w[ taf2-curb jsl-moneta jsl-http_headers feed-normalizer ].each do |dep|
41
+ s.add_dependency(dep)
42
+ end
43
+
44
+ if s.respond_to? :specification_version then
45
+ current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
46
+ s.specification_version = 2
47
+ end
48
+ end
@@ -0,0 +1,7 @@
1
+ class Hash
2
+
3
+ def reverse_merge(other_hash)
4
+ other_hash.merge(self)
5
+ end
6
+
7
+ end
@@ -0,0 +1,17 @@
1
+ # Extends FeedNormalizer::Feed with method for detecting new_items (aliased as new_entries for
2
+ # convenience).
3
+ module Feedtosis
4
+ module FeedInstanceMethods
5
+
6
+ # Returns only the feeds that are new.
7
+ def new_items
8
+ self.entries.select do |e|
9
+ e.instance_variable_get(:@_seen) == false
10
+ end
11
+ end
12
+
13
+ alias :new_entries :new_items
14
+ end
15
+ end
16
+
17
+ FeedNormalizer::Feed.__send__(:include, Feedtosis::FeedInstanceMethods)
@@ -0,0 +1,17 @@
1
+ require 'rubygems'
2
+
3
+ require 'curb'
4
+ require 'http_headers'
5
+ require 'feed-normalizer'
6
+ require 'md5'
7
+ require 'uri'
8
+
9
+ lib_dirs = [ 'extensions', 'feedtosis' ].map do |d|
10
+ File.join(File.dirname(__FILE__), d)
11
+ end
12
+
13
+ lib_dirs.each do |d|
14
+ Dir[File.join(d, "**", "*.rb")].each do |file|
15
+ require file
16
+ end
17
+ end
@@ -0,0 +1,173 @@
1
+ module Feedtosis
2
+
3
+ # Feedtosis::Client is the primary interface to the feed reader. Call it
4
+ # with a url that was previously fetched while connected to the configured
5
+ # backend, and it will 1) only do a retrieval if deemed necessary based on the
6
+ # etag and modified-at of the last etag and 2) mark all entries retrieved as
7
+ # either new or not new. Entries retrieved are normalized using the
8
+ # feed-normalizer gem.
9
+ class Client
10
+ attr_reader :url, :options, :backend
11
+
12
+ DEFAULTS = {
13
+ :backend => Hash.new,
14
+
15
+ # The namespace will be prefixed to the key used for storage of the summary value. Based on your
16
+ # application needs, it may be useful to provide a custom prefix with initialization options.
17
+ :namespace => 'feedtosis',
18
+
19
+ # Some feed aggregators that we may be pulling from have entries that are present in one fetch and
20
+ # then disappear (Google blog search does this). For these cases, we can't rely on only the digests of
21
+ # the last fetch to guarantee "newness" of a feed that we may have previously consumed. We keep a
22
+ # number of previous sets of digests in order to make sure that we mark correct feeds as "new".
23
+ :retained_digest_size => 10
24
+ } unless defined?(DEFAULTS)
25
+
26
+ # Initializes a new feedtosis library. It must be initialized with a valid URL as the first argument.
27
+ # A following optional +options+ Hash may take the arguments:
28
+ # * backend: a key-value store to be used for summary structures of feeds fetched. Moneta backends work well, but any object acting like a Hash is valid.
29
+ # * retained_digest_size: an Integer specifying the number of previous MD5 sets of entries to keep, used for new feed detection
30
+ def initialize(url, options = { })
31
+ @url = url
32
+
33
+ raise ArgumentError, "Feedtosis::Client options must be in Hash form if provided" unless options.is_a?(Hash)
34
+ @options = options.reverse_merge(DEFAULTS)
35
+
36
+ @backend = @options[:backend]
37
+
38
+ unless @url.match(URI.regexp('http'))
39
+ raise ArgumentError, "Url #{@url} is not valid!"
40
+ end
41
+
42
+ unless @backend.respond_to?(:[]) && @backend.respond_to?(:[]=)
43
+ raise ArgumentError, "Backend needs to be a key-value store"
44
+ end
45
+ end
46
+
47
+ # Retrieves the latest entries from this feed. Returns a Feedtosis::Result
48
+ # object which delegates methods to the Curl::Easy object making the request
49
+ # and the FeedNormalizer::Feed object that may have been created from the
50
+ # HTTP response body.
51
+ def fetch
52
+ curl = build_curl_easy
53
+ curl.perform
54
+ feed = process_curl_response(curl)
55
+ Feedtosis::Result.new(curl, feed)
56
+ end
57
+
58
+ private
59
+
60
+ # Marks entries as either seen or not seen based on the unique signature of
61
+ # the entry, which is calculated by taking the MD5 of common attributes.
62
+ def mark_new_entries(response)
63
+ digests = summary_digests
64
+
65
+ # For each entry in the responses object, mark @_seen as false if the
66
+ # digest of this entry doesn't exist in the cached object.
67
+ response.entries.each do |e|
68
+ seen = digests.include?(digest_for(e))
69
+ e.instance_variable_set(:@_seen, seen)
70
+ end
71
+
72
+ response
73
+ end
74
+
75
+ # Returns an Array of summary digests for this feed. Since we keep a number of sets
76
+ # of digests, inject across these sets to accumulate unique identifiers.
77
+ def summary_digests
78
+ summary_for_feed[:digests].inject([]) do |r, e|
79
+ r |= e
80
+ end.uniq
81
+ end
82
+
83
+ # Processes the results by identifying which entries are new if the response
84
+ # is a 200. Otherwise, returns the Curl::Easy object for the user to inspect.
85
+ def process_curl_response(curl)
86
+ if curl.response_code == 200
87
+ response = parser_for_xml(curl.body_str)
88
+ response = mark_new_entries(response)
89
+ store_summary_to_backend(response, curl)
90
+ response
91
+ end
92
+ end
93
+
94
+ # Sets options for the Curl::Easy object, including parameters for HTTP
95
+ # conditional GET.
96
+ def build_curl_easy
97
+ curl = new_curl_easy(@url)
98
+
99
+ # Many feeds have a 302 redirect to another URL. For more recent versions
100
+ # of Curl, we need to specify this.
101
+ curl.follow_location = true
102
+
103
+ set_header_options(curl)
104
+ end
105
+
106
+ def new_curl_easy(url)
107
+ Curl::Easy.new(url)
108
+ end
109
+
110
+ # Returns the summary hash for this feed from the backend store.
111
+ def summary_for_feed
112
+ @backend[key_for_cached] || { :digests => [ ] }
113
+ end
114
+
115
+ # Sets the headers from the backend, if available
116
+ def set_header_options(curl)
117
+ summary = summary_for_feed
118
+
119
+ unless summary.nil?
120
+ curl.headers['If-None-Match'] = summary[:etag] unless summary[:etag].nil?
121
+ curl.headers['If-Modified-Since'] = summary[:last_modified] unless summary[:last_modified].nil?
122
+ end
123
+
124
+ curl
125
+ end
126
+
127
+ # Returns the key for the storage of the summary structure in the key-value system.
128
+ def key_for_cached
129
+ [ @options[:namespace], MD5.hexdigest(@url) ].join('_')
130
+ end
131
+
132
+ # Stores information about the retrieval, including ETag, Last-Modified,
133
+ # and MD5 digests of all entries to the backend store. This enables
134
+ # conditional GET usage on subsequent requests and marking of entries as
135
+ # either new or seen.
136
+ def store_summary_to_backend(feed, curl)
137
+ headers = HttpHeaders.new(curl.header_str)
138
+
139
+ # Store info about HTTP retrieval
140
+ summary = { }
141
+
142
+ summary.merge!(:etag => headers.etag) unless headers.etag.nil?
143
+ summary.merge!(:last_modified => headers.last_modified) unless headers.last_modified.nil?
144
+
145
+ # Store digest for each feed entry so we can detect new feeds on the next
146
+ # retrieval
147
+ new_digest_set = feed.entries.map do |e|
148
+ digest_for(e)
149
+ end
150
+
151
+ new_digest_set = summary_for_feed[:digests].unshift(new_digest_set)
152
+ new_digest_set = new_digest_set[0..@options[:retained_digest_size]]
153
+
154
+ summary.merge!( :digests => new_digest_set )
155
+ set_summary(summary)
156
+ end
157
+
158
+ def set_summary(summary)
159
+ @backend[key_for_cached] = summary
160
+ end
161
+
162
+ # Computes a unique signature for the FeedNormalizer::Entry object given.
163
+ # This signature will be the MD5 of enough fields to have a reasonable
164
+ # probability of determining if the entry is unique or not.
165
+ def digest_for(entry)
166
+ MD5.hexdigest( [ entry.title, entry.content, entry.date_published ].join )
167
+ end
168
+
169
+ def parser_for_xml(xml)
170
+ FeedNormalizer::FeedNormalizer.parse(xml)
171
+ end
172
+ end
173
+ end