RubyGems - cobweb - Versions diffs - 0.0.67 → 0.0.68 - Mend

cobweb 0.0.67 → 0.0.68

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

data/README.textile +59 -12
data/lib/cobweb_version.rb +1 -1
data/lib/crawl.rb +51 -0
data/lib/crawl_job.rb +10 -9
data/lib/stats.rb +8 -4
data/spec/cobweb/cobweb_job_spec.rb +110 -78
data/spec/cobweb/crawl_spec.rb +74 -0
metadata +24 -22

data/README.textile CHANGED Viewed

@@ -1,5 +1,5 @@
-h1. Cobweb v0.0.67
+h1. Cobweb v0.0.68
 !https://secure.travis-ci.org/stewartmckee/cobweb.png?branch=master!
@@ -17,13 +17,9 @@ h3. Standalone
   CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that.  An example is listed below.
-bq. crawler = CobwebCrawler.new(:cache => 600);
-bq. stats = crawler.crawl("http://www.pepsico.com")
   While the crawler is running, you can view statistics on http://localhost:4567
-h3. Data Returned
+h3. Data Returned For Each Page
   The data available in the returned hash are:
   * :url - url of the resource requested
@@ -44,14 +40,40 @@ h3. Data Returned
   The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
+h3. Statistics
+  Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id.  Stats has a get_statistics method that returns a hash of the statistics available to you.  It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
+  The data available within statistics is as follows:
+  * :average_length - average size of each objet
+  * :minimum_length - minimum length returned
+  * :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
+  * :maximum_length - maximum length of object received
+  * :status_counts - hash with the status returned as the key and value as number of pages (eg: {"404" => 1, "200" => 1})
+  * :mime_counts - hash containing the mime type as key and count or pages as value (eg: {"text/html" => 8, "image/jpeg" => 25)})
+  * :queue_counter - size of queue waiting to be processed for crawl
+  * :page_count - number of html pages retrieved
+  * :total_length - total size of data received
+  * :current_status - Current status of crawl
+  * :asset_count - count of non-html objects received
+  * :page_size - total size of pages received
+  * :average_response_time - average response time of all objects
+  * :crawl_counter - number of objects that have been crawled
+  * :minimum_response_time - quickest response time of crawl
+  * :maximum_response_time - longest response time of crawl
+  * :asset_size - total size of all non-assets received
 h2. Installation
 Install crawler as a gem
-bq. gem install cobweb
+bc. gem install cobweb
 h2. Usage
+h3. Cobweb
 h4. new(options)
 Creates a new crawler object based on a base_url
@@ -76,7 +98,7 @@ Creates a new crawler object based on a base_url
     ** :crawl_limit_by_page           - sets the crawl counter to only use html page types when counting objects crawled
     ** :valid_mime_types              - an array of mime types that takes wildcards (eg 'text/*') defaults to ['*/*']
-bq. crawler = CobWeb.new(:follow_redirects => false)
+bc. crawler = Cobweb.new(:follow_redirects => false)
 h4. start(base_url)
@@ -86,7 +108,7 @@ Starts a crawl through resque.  Requires the :processing_queue to be set to a va
 Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
-bq. crawler.start("http://www.google.com/")
+bc. crawler.start("http://www.google.com/")
 h4. get(url)
@@ -94,7 +116,7 @@ Simple get that obey's the options supplied in new.
   * url - url requested
-bq. crawler.get("http://www.google.com/")
+bc. crawler.get("http://www.google.com/")
 h4. head(url)
@@ -102,10 +124,35 @@ Simple get that obey's the options supplied in new.
   * url - url requested
-bq. crawler.head("http://www.google.com/")
+bc. crawler.head("http://www.google.com/")
+h3. CobwebCrawler
+CobwebCrawler is the standalone crawling class.  If you don't want to use redis and just want to crawl the site within your ruby process, you can use this class.
+bc. crawler = CobwebCrawler.new(:cache => 600)
+statistics = crawler.crawl("http://www.pepsico.com")
+You can also run within a block and get access to each page as it is being crawled.
+bc. statistics = CobwebCrawler.new(:cache => 600).crawl("http://www.pepsico.com") do |page|
+  puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
+end
+puts "Finished Crawl in "
+h3. Crawl
+The crawl class is a helper class to assist in getting information about a crawl and to perform functions against the crawl
+bc. crawl = Crawl.new(options)
+  * options - the hash of options passed into Cobweb.new (must include a :crawl_id)
-h3. Contributing/Testing
+h2. Contributing/Testing
   Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting.  We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.

data/lib/cobweb_version.rb CHANGED Viewed

@@ -3,7 +3,7 @@ class CobwebVersion
   # Returns a string of the current version
   def self.version
-    "0.0.67"
+    "0.0.68"
   end
 end

data/lib/crawl.rb ADDED Viewed

@@ -0,0 +1,51 @@
+# The crawl class gives easy access to information about the crawl, and gives the ability to stop a crawl
+class Crawl
+  attr_accessor :id
+  BATCH_SIZE = 200
+  FINISHED = "Finished"
+  STARTING = "Starting"
+  CANCELLED = "Cancelled"
+  def initialize(data)
+    @data = data
+    @stats = Stats.new(data)
+  end
+  def destroy
+    queue_name = "cobweb_crawl_job"
+    # set status as cancelled now so that we don't enqueue any further pages
+    self.statistics.end_crawl(@data, true)
+    job_items = Resque.peek(queue_name, 0, BATCH_SIZE)
+    batch_count = 0
+    until job_items.empty?
+      job_items.each do |item|
+        if item["args"][0]["crawl_id"] == id
+          # remote this job from the queue
+          Resque.dequeue(CrawlJob, item["args"][0])
+        end
+      end
+      position = batch_count*BATCH_SIZE
+      batch_count += 1
+      job_items = Resque.peek(queue_name, position, BATCH_SIZE)
+    end
+  end
+  def statistics
+    @stats
+  end
+  def status
+    statistics.get_status
+  end
+  def id
+    @data[:crawl_id]
+  end
+end

data/lib/crawl_job.rb CHANGED Viewed

@@ -14,6 +14,7 @@ class CrawlJob
     # change all hash keys to symbols
     content_request = HashUtil.deep_symbolize_keys(content_request)
     @content_request = content_request
+    @crawl = Crawl.new(content_request)
     content_request[:redis_options] = {} unless content_request.has_key? :redis_options
     content_request[:crawl_limit_by_page] = false unless content_request.has_key? :crawl_limit_by_page
@@ -27,8 +28,7 @@ class CrawlJob
     # check we haven't crawled this url before
     unless @redis.sismember "crawled", content_request[:url]
       # if there is no limit or we're still under it lets get the url
-      if within_crawl_limits?(content_request[:crawl_limit])
-        puts "cbpl: #{content_request[:url]}" if content_request[:crawl_limit_by_page]
+      if within_crawl_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
         content = Cobweb.new(content_request).get(content_request[:url], content_request)
         if content_request[:url] == @redis.get("original_base_url")
            @redis.set("crawled_base_url", content[:base_url])
@@ -55,7 +55,7 @@ class CrawlJob
             # set the base url if this is the first page
             set_base_url @redis, content, content_request
             @cobweb_links = CobwebLinks.new(content_request)
             if within_queue_limits?(content_request[:crawl_limit])
               internal_links = ContentLinkParser.new(content_request[:url], content[:body], content_request).all_links(:valid_schemes => [:http, :https])
@@ -69,7 +69,11 @@ class CrawlJob
               internal_links.reject! { |link| @redis.sismember("queued", link) }
               internal_links.each do |link|
-                enqueue_content(content_request, link) if within_queue_limits?(content_request[:crawl_limit])
+                puts link
+                puts "Not enqueuing due to cancelled crawl" if @crawl.status == Crawl::CANCELLED
+                if within_queue_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
+                  enqueue_content(content_request, link)
+                end
               end
             end
@@ -92,7 +96,6 @@ class CrawlJob
             if content_request[:crawl_limit_by_page]
               if content[:mime_type].match("text/html")
                 increment_crawl_counter
-                ap "clbp: #{crawl_counter}"
               end
             else
               increment_crawl_counter
@@ -112,8 +115,6 @@ class CrawlJob
     end
     decrement_queue_counter
-    puts content_request[:crawl_limit]
-    print_counters
     # if there's nothing left queued or the crawled limit has been reached
     if content_request[:crawl_limit].nil? || content_request[:crawl_limit] == 0
       if queue_counter + crawl_started_counter - crawl_counter == 0
@@ -125,10 +126,10 @@ class CrawlJob
   end
-  # Sets the crawl status to 'Crawl Finished' and enqueues the crawl finished job
+  # Sets the crawl status to Crawl::FINISHED and enqueues the crawl finished job
   def self.finished(content_request)
     # finished
-    if @redis.hget("statistics", "current_status")!= "Crawl Finished"
+    if @crawl.status != Crawl::FINISHED and @crawl.status != Crawl::CANCELLED
       ap "CRAWL FINISHED  #{content_request[:url]}, #{counters}, #{@redis.get("original_base_url")}, #{@redis.get("crawled_base_url")}" if content_request[:debug]
       @stats.end_crawl(content_request)

data/lib/stats.rb CHANGED Viewed

@@ -16,13 +16,17 @@ class Stats
         @redis.hset "crawl_details", key, options[key].to_s
       end
     end
-    @redis.hset "statistics", "current_status", "Crawl Starting..."
+    @redis.hset "statistics", "current_status", Crawl::STARTING
   end
   # Removes the crawl from the running crawls and updates status
-  def end_crawl(options)
+  def end_crawl(options, cancelled=false)
     @full_redis.srem "cobweb_crawls", options[:crawl_id]
-    @redis.hset "statistics", "current_status", "Crawl Finished"
+    if cancelled
+      @redis.hset "statistics", "current_status", Crawl::CANCELLED
+    else
+      @redis.hset "statistics", "current_status", Crawl::FINISHED
+    end
     @redis.del "crawl_details"
   end
@@ -154,7 +158,7 @@ class Stats
   # Sets the current status of the crawl
   def update_status(status)
-    @redis.hset "statistics", "current_status", status
+    #@redis.hset("statistics", "current_status", status) unless status == Crawl::CANCELLED
   end
   # Returns the current status of the crawl

data/spec/cobweb/cobweb_job_spec.rb CHANGED Viewed

@@ -5,13 +5,13 @@ describe Cobweb, :local_only => true do
   before(:all) do
     #store all existing resque process ids so we don't kill them afterwards
     @existing_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
     # START WORKERS ONLY FOR CRAWL QUEUE SO WE CAN COUNT ENQUEUED PROCESS AND FINISH QUEUES
     puts "Starting Workers... Please Wait..."
     `mkdir log`
     io = IO.popen("nohup rake resque:workers PIDFILE=./tmp/pids/resque.pid COUNT=1 QUEUE=cobweb_crawl_job > log/output.log &")
     puts "Workers Started."
   end
   before(:each) do
@@ -19,56 +19,90 @@ describe Cobweb, :local_only => true do
     @base_page_count = 77
     clear_queues
   end
+  describe "when crawl is cancelled" do
+    before(:each) do
+      @request = {
+        :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
+        :crawl_limit => nil,
+        :quiet => false,
+        :debug => false,
+        :cache => nil
+      }
+      @cobweb = Cobweb.new @request
+    end
+    it "should not crawl anything if nothing has started" do
+      crawl = @cobweb.start(@base_url)
+      crawl_obj = Crawl.new(crawl)
+      crawl_obj.destroy
+      @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
+      wait_for_crawl_finished crawl[:crawl_id]
+      Resque.size("cobweb_process_job").should == 0
+    end
+    it "should not complete the crawl when cancelled" do
+      crawl = @cobweb.start(@base_url)
+      crawl_obj = Crawl.new(crawl)
+      sleep 6
+      crawl_obj.destroy
+      @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
+      wait_for_crawl_finished crawl[:crawl_id]
+      Resque.size("cobweb_process_job").should > 0
+      Resque.size("cobweb_process_job").should_not == @base_page_count
+    end
+  end
   describe "with no crawl limit" do
     before(:each) do
-       @request = {
-         :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
-         :crawl_limit => nil,
-         :quiet => false,
-         :debug => false,
-         :cache => nil
-       }
-       @cobweb = Cobweb.new @request
-    end
+      @request = {
+        :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
+        :crawl_limit => nil,
+        :quiet => false,
+        :debug => false,
+        :cache => nil
+      }
+      @cobweb = Cobweb.new @request
+    end
     it "should crawl entire site" do
-       crawl = @cobweb.start(@base_url)
-       @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
-       wait_for_crawl_finished crawl[:crawl_id]
-       Resque.size("cobweb_process_job").should == @base_page_count
+      ap Resque.size("cobweb_process_job")
+      crawl = @cobweb.start(@base_url)
+      @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
+      wait_for_crawl_finished crawl[:crawl_id]
+      ap @stat.get_statistics
+      Resque.size("cobweb_process_job").should == @base_page_count
     end
     it "detect crawl finished once" do
-       crawl = @cobweb.start(@base_url)
-       @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
-       wait_for_crawl_finished crawl[:crawl_id]
-       Resque.size("cobweb_finished_job").should == 1
+      crawl = @cobweb.start(@base_url)
+      @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
+      wait_for_crawl_finished crawl[:crawl_id]
+      Resque.size("cobweb_finished_job").should == 1
+    end
+  end
+  describe "with limited mime_types" do
+    before(:each) do
+      @request = {
+        :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
+        :quiet => true,
+        :cache => nil,
+        :valid_mime_types => ["text/html"]
+      }
+      @cobweb = Cobweb.new @request
+    end
+    it "should only crawl html pages" do
+      crawl = @cobweb.start(@base_url)
+      @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
+      wait_for_crawl_finished crawl[:crawl_id]
+      Resque.size("cobweb_process_job").should == 8
+      mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
+      mime_types.count.should == 8
+      mime_types.map{|m| m.should == "text/html"}
+      mime_types.select{|m| m=="text/html"}.count.should == 8
     end
   end
-   describe "with limited mime_types" do
-     before(:each) do
-       @request = {
-         :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
-         :quiet => true,
-         :cache => nil,
-         :valid_mime_types => ["text/html"]
-       }
-       @cobweb = Cobweb.new @request
-     end
-     it "should only crawl html pages" do
-       crawl = @cobweb.start(@base_url)
-       @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
-       wait_for_crawl_finished crawl[:crawl_id]
-       Resque.size("cobweb_process_job").should == 8
-       mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
-       mime_types.count.should == 8
-       mime_types.map{|m| m.should == "text/html"}
-       mime_types.select{|m| m=="text/html"}.count.should == 8
-     end
-   end
   describe "with a crawl limit" do
     before(:each) do
       @request = {
@@ -77,31 +111,31 @@ describe Cobweb, :local_only => true do
         :cache => nil
       }
     end
     describe "limit to 1" do
       before(:each) do
         @request[:crawl_limit] = 1
         @cobweb = Cobweb.new @request
       end
       it "should not crawl the entire site" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_process_job").should_not == @base_page_count
-      end
+      end
       it "should only crawl 1 page" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_process_job").should == 1
-      end
+      end
       it "should notify of crawl finished once" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_finished_job").should == 1
-      end
+      end
     end
     describe "for pages only" do
@@ -110,7 +144,7 @@ describe Cobweb, :local_only => true do
         @request[:crawl_limit] = 5
         @cobweb = Cobweb.new @request
       end
       it "should only use html pages towards the crawl limit" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
@@ -126,19 +160,19 @@ describe Cobweb, :local_only => true do
         @request[:crawl_limit] = 10
         @cobweb = Cobweb.new @request
       end
       it "should not crawl the entire site" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_process_job").should_not == @base_page_count
-      end
+      end
       it "should notify of crawl finished once" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_finished_job").should == 1
-      end
+      end
       it "should only crawl 10 objects" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
@@ -146,40 +180,40 @@ describe Cobweb, :local_only => true do
         Resque.size("cobweb_process_job").should == 10
       end
     end
     describe "limit to 100" do
       before(:each) do
         @request[:crawl_limit] = 100
         @cobweb = Cobweb.new @request
       end
       it "should crawl the entire sample site" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_process_job").should == @base_page_count
-      end
+      end
       it "should notify of crawl finished once" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_finished_job").should == 1
-      end
+      end
       it "should not crawl 100 pages" do
         crawl = @cobweb.start(@base_url)
         @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
         wait_for_crawl_finished crawl[:crawl_id]
         Resque.size("cobweb_process_job").should_not == 100
-      end
+      end
     end
   end
   after(:all) do
     @all_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
     command = "kill -9 #{(@all_processes - @existing_processes).join(" ")}"
     IO.popen(command)
     clear_queues
   end
@@ -189,25 +223,23 @@ def wait_for_crawl_finished(crawl_id, timeout=20)
   counter = 0
   start_time = Time.now
   while(running?(crawl_id) && Time.now < start_time + timeout) do
-    sleep 0.5
-  end
-  if Time.now > start_time + timeout
-    raise "End of crawl not detected"
+      sleep 0.5
+    end
+    if Time.now > start_time + timeout
+      raise "End of crawl not detected"
+    end
   end
-end
-def running?(crawl_id)
-  @stat.get_status != "Crawl Finished"
-end
-def clear_queues
-  Resque.queues.each do |queue|
-    Resque.remove_queue(queue)
+  def running?(crawl_id)
+    @stat.get_status != Crawl::FINISHED and @stat.get_status != Crawl::CANCELLED
   end
-  Resque.size("cobweb_process_job").should == 0
-  Resque.size("cobweb_finished_job").should == 0
-  Resque.peek("cobweb_process_job", 0, 200).should be_empty
-end
+  def clear_queues
+    Resque.queues.each do |queue|
+      Resque.remove_queue(queue)
+    end
+    Resque.size("cobweb_process_job").should == 0
+    Resque.size("cobweb_finished_job").should == 0
+    Resque.peek("cobweb_process_job", 0, 200).should be_empty
+  end

data/spec/cobweb/crawl_spec.rb ADDED Viewed

@@ -0,0 +1,74 @@
+require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
+describe Crawl do
+  # this spec tests the crawl object
+  describe "initialize" do
+    describe "without data" do
+      it "should raise an exception" do
+        lambda {Crawl.new}.should raise_exception
+      end
+    end
+    describe "with data" do
+      before(:each) do
+        data = {:crawl_id => "asdf"}
+        @crawl = Crawl.new(data)
+      end
+      it "should create a crawl object" do
+        @crawl.should be_an_instance_of Crawl
+      end
+      it "should return an id" do
+        @crawl.should respond_to "id"
+      end
+      it "should return a status" do
+        @crawl.should respond_to "status"
+      end
+      describe "the destroy method" do
+        before(:each) do
+          if Resque.size("cobweb_crawl_job") > 0
+            raise "cobweb_crawl_job is not empty, do not run specs until it is!"
+          end
+          105.times do |item_count|
+            2.times do |crawl_count|
+              item_data = {:crawl_id => "crawl_#{crawl_count}_id", :url => "http://crawl#{crawl_count}.com/page#{item_count}.html"}
+              Resque.enqueue(CrawlJob, item_data)
+            end
+          end
+        end
+        after(:each) do
+          Resque.remove_queue("cobweb_crawl_job")
+        end
+        it "should have a queue length of 210" do
+          Resque.size("cobweb_crawl_job").should == 210
+        end
+        describe "after called" do
+          before(:each) do
+            @crawl = Crawl.new({:crawl_id => "crawl_0_id"})
+            @crawl.destroy
+          end
+          it "should delete only the crawl specified" do
+            Resque.size("cobweb_crawl_job").should == 105
+          end
+          it "should not contain any crawl_0_id" do
+            Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
+              item["crawl_id"].should_not == "crawl_0_id"
+            end
+          end
+          it "should only contain crawl_1_id" do
+            Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
+              item["crawl_id"].should == "crawl_1_id"
+            end
+          end
+          it "should set status to 'Cancelled'" do
+            @crawl.status.should == "Cancelled"
+          end
+        end
+      end
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: cobweb
 version: !ruby/object:Gem::Version
-  version: 0.0.67
+  version: 0.0.68
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-09-07 00:00:00.000000000 Z
+date: 2012-09-10 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: resque
-  requirement: &70248783211420 !ruby/object:Gem::Requirement
+  requirement: &70324863540700 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783211420
+  version_requirements: *70324863540700
 - !ruby/object:Gem::Dependency
   name: redis
-  requirement: &70248783210160 !ruby/object:Gem::Requirement
+  requirement: &70324863539560 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783210160
+  version_requirements: *70324863539560
 - !ruby/object:Gem::Dependency
   name: nokogiri
-  requirement: &70248783209580 !ruby/object:Gem::Requirement
+  requirement: &70324863538960 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783209580
+  version_requirements: *70324863538960
 - !ruby/object:Gem::Dependency
   name: addressable
-  requirement: &70248783208340 !ruby/object:Gem::Requirement
+  requirement: &70324863537700 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783208340
+  version_requirements: *70324863537700
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &70248783207700 !ruby/object:Gem::Requirement
+  requirement: &70324863537120 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -65,10 +65,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783207700
+  version_requirements: *70324863537120
 - !ruby/object:Gem::Dependency
   name: awesome_print
-  requirement: &70248783207100 !ruby/object:Gem::Requirement
+  requirement: &70324863536500 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783207100
+  version_requirements: *70324863536500
 - !ruby/object:Gem::Dependency
   name: sinatra
-  requirement: &70248783206200 !ruby/object:Gem::Requirement
+  requirement: &70324863535620 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783206200
+  version_requirements: *70324863535620
 - !ruby/object:Gem::Dependency
   name: thin
-  requirement: &70248783205520 !ruby/object:Gem::Requirement
+  requirement: &70324863534860 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -98,10 +98,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783205520
+  version_requirements: *70324863534860
 - !ruby/object:Gem::Dependency
   name: haml
-  requirement: &70248783204580 !ruby/object:Gem::Requirement
+  requirement: &70324863534000 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -109,10 +109,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70248783204580
+  version_requirements: *70324863534000
 - !ruby/object:Gem::Dependency
   name: namespaced_redis
-  requirement: &70248783203800 !ruby/object:Gem::Requirement
+  requirement: &70324863533220 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -120,7 +120,7 @@ dependencies:
         version: 1.0.2
   type: :runtime
   prerelease: false
-  version_requirements: *70248783203800
+  version_requirements: *70324863533220
 description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
   crawl extremely large sites which is much more perofmant than multi-threaded crawlers.  It
   is also a standalone crawler that has a sophisticated statistics monitoring interface
@@ -136,6 +136,7 @@ files:
 - spec/cobweb/cobweb_links_spec.rb
 - spec/cobweb/cobweb_spec.rb
 - spec/cobweb/content_link_parser_spec.rb
+- spec/cobweb/crawl_spec.rb
 - spec/cobweb/robots_spec.rb
 - spec/samples/robots.txt
 - spec/samples/sample_html_links.html
@@ -315,6 +316,7 @@ files:
 - lib/cobweb_process_job.rb
 - lib/cobweb_version.rb
 - lib/content_link_parser.rb
+- lib/crawl.rb
 - lib/crawl_job.rb
 - lib/encoding_safe_process_job.rb
 - lib/hash_util.rb