RubyGems - feedtosis - Versions diffs - 0.0.3.6 - Mend

feedtosis 0.0.3.6

Files changed (17) hide show

data/LICENSE +20 -0
data/README.rdoc +135 -0
data/Rakefile +28 -0
data/feedtosis.gemspec +48 -0
data/lib/extensions/core/hash.rb +7 -0
data/lib/extensions/feed_normalizer/feed_instance_methods.rb +17 -0
data/lib/feedtosis.rb +17 -0
data/lib/feedtosis/client.rb +173 -0
data/lib/feedtosis/result.rb +34 -0
data/spec/extensions/feed_normalizer/feed_instance_methods_spec.rb +12 -0
data/spec/feedtosis/client_spec.rb +162 -0
data/spec/feedtosis/result_spec.rb +34 -0
data/spec/fixtures/http_headers/wooster.txt +19 -0
data/spec/fixtures/xml/older_wooster.xml +203 -0
data/spec/fixtures/xml/wooster.xml +215 -0
data/spec/spec_helper.rb +31 -0
metadata +117 -0

data/LICENSE ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2009 Justin S. Leitgeb
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.rdoc ADDED

@@ -0,0 +1,135 @@
+= Description
+Feedtosis fetches RSS and Atom feeds with an easy-to-use interface.  It uses
+FeedNormalizer for parsing, and Curb for fetching.  It helps by automatically
+using conditional HTTP GET requests as well as by reliably pointing out which
+entries are new in any given feed.
+Feedtosis is designed to help you with book-keeping about feed fetching
+details so that things like using HTTP conditional GET are trivial. It has a simple
+interface, and remains a lightweight component that delegates to FeedNormalizer
+for parsing feeds and the fantastic taf2-curb library for fetching feeds.
+== Installation
+Assuming that you've followed the directions on gems.github.com to allow your
+computer to install gems from GitHub, the following command will install the
+Feedtosis library:
+  sudo gem install jsl-feedtosis
+== Usage
+Feedtosis is easy to use. Just create a client object, and invoke the
+"fetch" method:
+  require 'feedtosis'
+  client = Feedtosis::Client.new('http://feeds.feedburner.com/wooster')
+  result = client.fetch
++result+ will be a Feedtosis::Result object which delegates methods to
+the FeedNormalizer::Feed object as well as the Curl::Easy object used to fetch
+the feed.  Useful methods on this object include +entries+, +new_entries+ and
++response_code+ among many others (basically all of the methods that
+FeedNormalizer::Feed and Curl::Easy objects respond to are implemented and can
+be called directly, minus the setter methods for these objects).
+Note that since Feedtosis uses HTTP conditional GET, it may not actually
+have received a full XML response from the server suitable for being parsed
+into entries.  In this case, methods such as +entries+ on the Feedtosis::Result
+will return +nil+.  Depending on your application logic, you may want to inspect
+the methods that are delegated to the Curl::Easy object, such as +response_code+,
+for more information on what happened in these cases.
+Remember that a response code of 304 means "Not Modified".  In this case, you should
+expect "entries" and "new_entries" to be nil, since the resource wasn't downloaded
+according to the logic of HTTP conditional GET.
+On subsequent requests of a particular resource, Feedtosis will update
++new_entries+ to contain the feed entries that we haven't seen yet.  In most
+applications, your program will probably call the same batch of URLS multiple
+times, and process the elements in +new_entries+.
+You will most likely want to allow Feedtosis to remember details about the
+last retrieval of a feed after the client is removed from memory. Feedtosis
+uses Moneta, a unified interface to key-value storage systems to remember
+"summaries" of feeds that it has seen in the past. See the document section on
+Customization for more details on how to configure this system.
+== Customization
+Feedtosis stores summaries of feeds in a key-value storage system. If no
+options are included when creating a new Feedtosis::Client object, the
+default is to use a "memory" storage system. The memory system is just a basic
+ruby Hash, so it won't keep track of feeds after a particular Client is removed
+from memory. To configure a different backend, pass an options hash to the
+Feedtosis client initialization:
+  url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml"
+  f = Feedtosis::Client.new(url, :backend => Moneta::Memcache.new(:server => 'localhost:1978'))
+  res = f.fetch
+This example sets up a Memcache backend, which in this case points to Tokyo
+Tyrant on port 1978.
+Generally, Feedtosis supports all systems supported by Moneta, and any one
+of the supported systems can be given to the +moneta_klass+ parameter. Other
+options following +backend+ are passed directly to Moneta for configuration.
+== Implementation
+Feedtosis helps to identify new feed entries and to figure out when
+conditional GET can be used in retrieving resources. In order to accomplish this
+without having to require that the user store information such as etags and
+dates of the last retrieved entry, Feedtosis stores a summary structure in
+the configured key-value store (backed by Moneta). In order to do conditional
+GET requests, Feedtosis stores the Last-Modified date, as well as the ETag
+of the last request in the summary structure, which is put in a namespaced
+element consisting of the term 'Feedtosis' (bet you won't have to worry
+about name collisions on that one!) and the MD5 of the URL retrieved.
+It can also be a bit tricky to decipher which feed entries are new since many
+feed sources don't include unique ids with their feeds. Feedtosis reliably
+keeps track of which entries in a feed are new by storing (in the summary hash
+mentioned above) an MD5 signature of each entry in a feed. It takes elements
+such as the published-at date, title and content and generates the MD5 of these
+elements. This allows Feedtosis to cheaply compute (both in terms of
+computation and storage) which feed entries should be presented to the user as
+"new". Below is an example of a summary structure:
+  {
+    :etag => "4c8f-46ac09fbbe940",
+    :last_modified => "Mon, 25 May 2009 18:17:33 GMT",
+    :digests => [["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]]
+  }
+The data stored by Feedtosis in the summary structure allows it to be
+helpful to the user without storing lots of data that are unnecessary for
+efficient functioning.
+The summary structure keeps an Array of Arrays containing digests of feeds.  The reason
+for this is that some feeds, such as the Google blog search feeds, contain slightly different
+but often-recurring results in the result set.  Feedtosis keeps complete sets of entry digests
+for previous feed retrievals.  The number of digest sets that will be kept is configurable by
+setting the option :retained_digest_size on Feedtosis client initialization.
+== HTML cleaning/sanitizing
+Feedtosis doesn't do anything about feed sanitizing, as other libraries have
+been built for this purpose. FeedNormalizer has methods for escaping entries,
+but to strip HTML I suggest that you look at the Ruby gem "sanitize".
+== Credits
+Thanks to Sander Hartlage (GitHub: Sander6) for useful feedback early in the
+development of Feedtosis.
+== Feedback
+Please let me know if you have any problems with or questions about
+Feedtosis.
+= Author
+Justin S. Leitgeb, mailto:justin@phq.org

data/Rakefile ADDED

@@ -0,0 +1,28 @@
+require 'rubygems'
+require 'spec'
+require 'rake'
+require 'spec/rake/spectask'
+require 'rake/rdoctask'
+require 'lib/feedtosis'
+desc 'Test the plugin.'
+Spec::Rake::SpecTask.new(:spec) do |t|
+  t.spec_opts  = ["--format", "progress", "--colour"]
+  t.libs << 'lib'
+  t.verbose = true
+end
+desc "Run all the tests"
+task :default => :spec
+desc 'Generate documentation'
+Rake::RDocTask.new(:rdoc) do |rdoc|
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title    = 'Feedtosis'
+  rdoc.options << '--line-numbers' << '--inline-source'
+  rdoc.rdoc_files.include('README.rdoc')
+  rdoc.rdoc_files.include('lib/feedtosis/**/*.rb')
+end

data/feedtosis.gemspec ADDED

@@ -0,0 +1,48 @@
+Gem::Specification.new do |s|
+  s.name = %q{feedtosis}
+  s.version = "0.0.3.6"
+  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
+  s.authors = ["Justin Leitgeb"]
+  s.date = %q{2009-07-15}
+  s.description = %q{Feedtosis finds new information in feeds quickly using smart fetching and matching of previously read entries}
+  s.email = %q{justin@phq.org}
+  s.files = ["lib/extensions/core/hash.rb",
+    "lib/extensions/feed_normalizer/feed_instance_methods.rb",
+    "lib/feedtosis/result.rb",
+    "lib/feedtosis/client.rb", "lib/feedtosis.rb", "LICENSE",
+    "feedtosis.gemspec", "Rakefile", "README.rdoc",
+    "spec/extensions/feed_normalizer/feed_instance_methods_spec.rb",
+    "spec/fixtures/http_headers/wooster.txt",
+    "spec/fixtures/xml/older_wooster.xml", "spec/fixtures/xml/wooster.xml",
+    "spec/feedtosis/client_spec.rb",
+    "spec/feedtosis/result_spec.rb",
+    "spec/spec_helper.rb"]
+  s.has_rdoc = true
+  s.homepage = %q{http://github.com/jsl/feedtosis}
+  s.rdoc_options = ["--charset=UTF-8"]
+  s.require_paths = ["lib"]
+  s.rubygems_version = %q{1.3.1}
+  s.summary = %q{Retrieves feeds using conditional GET and marks entries that you haven't seen before}
+  s.test_files = ["spec/spec_helper.rb", "spec/feedtosis/client_spec.rb", "spec/feedtosis/result_spec.rb" ]
+  s.extra_rdoc_files = [ "README.rdoc" ]
+  s.rdoc_options += [
+    '--title', 'Feedtosis',
+    '--main', 'README.rdoc',
+    '--line-numbers',
+    '--inline-source'
+   ]
+  %w[ taf2-curb jsl-moneta jsl-http_headers feed-normalizer ].each do |dep|
+    s.add_dependency(dep)
+  end
+  if s.respond_to? :specification_version then
+    current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
+    s.specification_version = 2
+  end
+end

data/lib/extensions/core/hash.rb ADDED

@@ -0,0 +1,7 @@
+class Hash
+  def reverse_merge(other_hash)
+    other_hash.merge(self)
+  end
+end

data/lib/extensions/feed_normalizer/feed_instance_methods.rb ADDED

@@ -0,0 +1,17 @@
+# Extends FeedNormalizer::Feed with method for detecting new_items (aliased as new_entries for
+# convenience).
+module Feedtosis
+  module FeedInstanceMethods
+    # Returns only the feeds that are new.
+    def new_items
+      self.entries.select do |e|
+        e.instance_variable_get(:@_seen) == false
+      end
+    end
+    alias :new_entries :new_items
+  end
+end
+FeedNormalizer::Feed.__send__(:include, Feedtosis::FeedInstanceMethods)

data/lib/feedtosis.rb ADDED

@@ -0,0 +1,17 @@
+require 'rubygems'
+require 'curb'
+require 'http_headers'
+require 'feed-normalizer'
+require 'md5'
+require 'uri'
+lib_dirs =  [ 'extensions', 'feedtosis' ].map do |d|
+  File.join(File.dirname(__FILE__), d)
+end
+lib_dirs.each do |d|
+  Dir[File.join(d, "**", "*.rb")].each do |file|
+    require file
+  end
+end

data/lib/feedtosis/client.rb ADDED

@@ -0,0 +1,173 @@
+module Feedtosis
+  # Feedtosis::Client is the primary interface to the feed reader.  Call it
+  # with a url that was previously fetched while connected to the configured
+  # backend, and it will 1) only do a retrieval if deemed necessary based on the
+  # etag and modified-at of the last etag and 2) mark all entries retrieved as
+  # either new or not new.  Entries retrieved are normalized using the
+  # feed-normalizer gem.
+  class Client
+    attr_reader :url, :options, :backend
+    DEFAULTS = {
+      :backend => Hash.new,
+      # The namespace will be prefixed to the key used for storage of the summary value.  Based on your
+      # application needs, it may be useful to provide a custom prefix with initialization options.
+      :namespace => 'feedtosis',
+      # Some feed aggregators that we may be pulling from have entries that are present in one fetch and
+      # then disappear (Google blog search does this).  For these cases, we can't rely on only the digests of
+      # the last fetch to guarantee "newness" of a feed that we may have previously consumed.  We keep a
+      # number of previous sets of digests in order to make sure that we mark correct feeds as "new".
+      :retained_digest_size => 10
+    } unless defined?(DEFAULTS)
+    # Initializes a new feedtosis library.  It must be initialized with a valid URL as the first argument.
+    # A following optional +options+ Hash may take the arguments:
+    #   * backend: a key-value store to be used for summary structures of feeds fetched.  Moneta backends work well, but any object acting like a Hash is valid.
+    #   * retained_digest_size: an Integer specifying the number of previous MD5 sets of entries to keep, used for new feed detection
+    def initialize(url, options = { })
+      @url      = url
+      raise ArgumentError, "Feedtosis::Client options must be in Hash form if provided" unless options.is_a?(Hash)
+      @options  = options.reverse_merge(DEFAULTS)
+      @backend  = @options[:backend]
+      unless @url.match(URI.regexp('http'))
+        raise ArgumentError, "Url #{@url} is not valid!"
+      end
+      unless @backend.respond_to?(:[]) && @backend.respond_to?(:[]=)
+        raise ArgumentError, "Backend needs to be a key-value store"
+      end
+    end
+    # Retrieves the latest entries from this feed.  Returns a Feedtosis::Result
+    # object which delegates methods to the Curl::Easy object making the request
+    # and the FeedNormalizer::Feed object that may have been created from the
+    # HTTP response body.
+    def fetch
+      curl = build_curl_easy
+      curl.perform
+      feed = process_curl_response(curl)
+      Feedtosis::Result.new(curl, feed)
+    end
+    private
+    # Marks entries as either seen or not seen based on the unique signature of
+    # the entry, which is calculated by taking the MD5 of common attributes.
+    def mark_new_entries(response)
+      digests = summary_digests
+      # For each entry in the responses object, mark @_seen as false if the
+      # digest of this entry doesn't exist in the cached object.
+      response.entries.each do |e|
+        seen = digests.include?(digest_for(e))
+        e.instance_variable_set(:@_seen, seen)
+      end
+      response
+    end
+    # Returns an Array of summary digests for this feed.  Since we keep a number of sets
+    # of digests, inject across these sets to accumulate unique identifiers.
+    def summary_digests
+      summary_for_feed[:digests].inject([]) do |r, e|
+        r |= e
+      end.uniq
+    end
+    # Processes the results by identifying which entries are new if the response
+    # is a 200.  Otherwise, returns the Curl::Easy object for the user to inspect.
+    def process_curl_response(curl)
+      if curl.response_code == 200
+        response = parser_for_xml(curl.body_str)
+        response = mark_new_entries(response)
+        store_summary_to_backend(response, curl)
+        response
+      end
+    end
+    # Sets options for the Curl::Easy object, including parameters for HTTP
+    # conditional GET.
+    def build_curl_easy
+      curl = new_curl_easy(@url)
+      # Many feeds have a 302 redirect to another URL.  For more recent versions
+      # of Curl, we need to specify this.
+      curl.follow_location = true
+      set_header_options(curl)
+    end
+    def new_curl_easy(url)
+      Curl::Easy.new(url)
+    end
+    # Returns the summary hash for this feed from the backend store.
+    def summary_for_feed
+      @backend[key_for_cached] || { :digests => [ ] }
+    end
+    # Sets the headers from the backend, if available
+    def set_header_options(curl)
+      summary = summary_for_feed
+      unless summary.nil?
+        curl.headers['If-None-Match']     = summary[:etag] unless summary[:etag].nil?
+        curl.headers['If-Modified-Since'] = summary[:last_modified] unless summary[:last_modified].nil?
+      end
+      curl
+    end
+    # Returns the key for the storage of the summary structure in the key-value system.
+    def key_for_cached
+      [ @options[:namespace], MD5.hexdigest(@url) ].join('_')
+    end
+    # Stores information about the retrieval, including ETag, Last-Modified,
+    # and MD5 digests of all entries to the backend store.  This enables
+    # conditional GET usage on subsequent requests and marking of entries as
+    # either new or seen.
+    def store_summary_to_backend(feed, curl)
+      headers = HttpHeaders.new(curl.header_str)
+      # Store info about HTTP retrieval
+      summary = { }
+      summary.merge!(:etag => headers.etag) unless headers.etag.nil?
+      summary.merge!(:last_modified => headers.last_modified) unless headers.last_modified.nil?
+      # Store digest for each feed entry so we can detect new feeds on the next
+      # retrieval
+      new_digest_set = feed.entries.map do |e|
+        digest_for(e)
+      end
+      new_digest_set = summary_for_feed[:digests].unshift(new_digest_set)
+      new_digest_set = new_digest_set[0..@options[:retained_digest_size]]
+      summary.merge!( :digests => new_digest_set )
+      set_summary(summary)
+    end
+    def set_summary(summary)
+      @backend[key_for_cached] = summary
+    end
+    # Computes a unique signature for the FeedNormalizer::Entry object given.
+    # This signature will be the MD5 of enough fields to have a reasonable
+    # probability of determining if the entry is unique or not.
+    def digest_for(entry)
+      MD5.hexdigest( [ entry.title, entry.content, entry.date_published ].join )
+    end
+    def parser_for_xml(xml)
+      FeedNormalizer::FeedNormalizer.parse(xml)
+    end
+  end
+end