RubyGems - cobweb - Versions diffs - 0.0.77 → 1.0.0 - Mend

cobweb 0.0.77 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

data/README.textile +9 -1
data/lib/cobweb.rb +5 -5
data/lib/cobweb_crawler.rb +30 -22
data/lib/cobweb_version.rb +1 -1
data/lib/server.rb +4 -4
data/lib/stats.rb +2 -2
data/spec/cobweb/cobweb_job_spec.rb +1 -1
data/views/home.haml +1 -1
data/views/statistics.haml +11 -9
metadata +24 -24

data/README.textile CHANGED

@@ -1,5 +1,5 @@
-h1. Cobweb v0.0.77
+h1. Cobweb v1.0.0
 "@cobweb_gem":https://twitter.com/cobweb_gem
@@ -100,6 +100,8 @@ Creates a new crawler object based on a base_url
     ** :user_agent                    - user agent string to match in robots.txt (not sent as user_agent of requests yet) (default: cobweb)
     ** :crawl_limit_by_page           - sets the crawl counter to only use html page types when counting objects crawled
     ** :valid_mime_types              - an array of mime types that takes wildcards (eg 'text/*') defaults to ['*/*']
+    ** :direct_call_process_job       - boolean that specifies whether objects should be passed directly to a processing method or should be put onto a queue
 bc. crawler = Cobweb.new(:follow_redirects => false)
@@ -129,6 +131,12 @@ Simple get that obey's the options supplied in new.
 bc. crawler.head("http://www.google.com/")
+h4. Processing Queue
+The :processing_queue option is used to specify the class that contains the resque perform method to pass the content onto.  This class should be defined in your application to perform any tasks you wish to the content.  There are two options however, for running this.  Firstly, the default settings will push the content crawled onto a resque queue for that class.  This allows you the flexibility of running in queues on seperate machines etc.  The main drawback to this is that all your content is stored in redis within the queue.  This can be memory intensive if you are crawling large sites, or have large content that is being crawled.  To get around this you can specify that the crawl_job calls the perform method on the processing queue class directly, thereby not using memory in redis for the content.  This is performed by using the :direct_call_process_job. If you set that option to 'true' then instead of the job being queued, it will be executed within the crawl_job queue.
 h3. CobwebCrawler
 CobwebCrawler is the standalone crawling class.  If you don't want to use redis and just want to crawl the site within your ruby process, you can use this class.

data/lib/cobweb.rb CHANGED

@@ -116,7 +116,7 @@ class Cobweb
     end
     content = {:base_url => url}
     # check if it has already been cached
     if redis.get(unique_id) and @options[:cache]
       puts "Cache hit for #{url}" unless @options[:quiet]
@@ -162,7 +162,7 @@ class Cobweb
           content[:url] = uri.to_s
           content[:redirect_through] = [] if content[:redirect_through].nil?
           content[:redirect_through].insert(0, url)
           content[:response_time] = Time.now.to_f - request_time
         else
           content[:response_time] = Time.now.to_f - request_time
@@ -231,7 +231,7 @@ class Cobweb
         content[:mime_type] = "error/dnslookup"
         content[:headers] = {}
         content[:links] = {}
       rescue Timeout::Error => e
         puts "ERROR Timeout::Error: #{e.message}"
@@ -247,8 +247,8 @@ class Cobweb
         content[:headers] = {}
         content[:links] = {}
       end
+      content
     end
-    content
   end
   # Performs a HTTP HEAD request to the specified url applying the options supplied
@@ -368,7 +368,7 @@ class Cobweb
         content[:mime_type] = "error/dnslookup"
         content[:headers] = {}
         content[:links] = {}
       rescue Timeout::Error => e
         puts "ERROR Timeout::Error: #{e.message}"

data/lib/cobweb_crawler.rb CHANGED

@@ -42,7 +42,7 @@ class CobwebCrawler
     @crawl_options = crawl_options
-    @redis.sadd("queued", base_url) unless @redis.sismember("crawled", base_url) || @redis.sismember("queued", base_url)
+    @redis.sadd("queued", base_url) unless base_url.nil? || @redis.sismember("crawled", base_url) || @redis.sismember("queued", base_url)
     crawl_counter = @redis.scard("crawled").to_i
     queue_counter = @redis.scard("queued").to_i
@@ -58,34 +58,42 @@ class CobwebCrawler
             begin
               @stats.update_status("Requesting #{url}...")
               content = @cobweb.get(url)
-              @stats.update_status("Processing #{url}...")
+              if content.nil?
+                queue_counter = queue_counter - 1 #@redis.scard("queued").to_i
+              else
+                @stats.update_status("Processing #{url}...")
-              @redis.sadd "crawled", url.to_s
-              @redis.incr "crawl-counter"
+                @redis.sadd "crawled", url.to_s
+                @redis.incr "crawl-counter"
-              internal_links = ContentLinkParser.new(url, content[:body]).all_links(:valid_schemes => [:http, :https])
+                internal_links = ContentLinkParser.new(url, content[:body]).all_links(:valid_schemes => [:http, :https])
-              # select the link if its internal (eliminate external before expensive lookups in queued and crawled)
-              cobweb_links = CobwebLinks.new(@options)
-              internal_links = internal_links.select{|link| cobweb_links.internal?(link)}
+                # select the link if its internal (eliminate external before expensive lookups in queued and crawled)
+                cobweb_links = CobwebLinks.new(@options)
+                internal_links = internal_links.select{|link| cobweb_links.internal?(link)}
-              # reject the link if we've crawled it or queued it
-              internal_links.reject!{|link| @redis.sismember("crawled", link)}
-              internal_links.reject!{|link| @redis.sismember("queued", link)}
+                # reject the link if we've crawled it or queued it
+                internal_links.reject!{|link| @redis.sismember("crawled", link)}
+                internal_links.reject!{|link| @redis.sismember("queued", link)}
+                internal_links.reject!{|link| link.nil? || link.empty?}
-              internal_links.each do |link|
-                puts "Added #{link.to_s} to queue" if @debug
-                @redis.sadd "queued", link
-                queue_counter += 1
-              end
+                internal_links.each do |link|
+                  puts "Added #{link.to_s} to queue" if @debug
+                  @redis.sadd "queued", link unless link.nil?
+                  children = @redis.hget("navigation", url)
+                  children = [] if children.nil?
+                  children << link
+                  @redis.hset "navigation", url, children
+                  queue_counter += 1
+                end
-              crawl_counter = crawl_counter + 1 #@redis.scard("crawled").to_i
-              queue_counter = queue_counter - 1 #@redis.scard("queued").to_i
+                crawl_counter = crawl_counter + 1 #@redis.scard("crawled").to_i
+                queue_counter = queue_counter - 1 #@redis.scard("queued").to_i
-              @stats.update_statistics(content, crawl_counter, queue_counter)
-              @stats.update_status("Completed #{url}.")
-              yield content, @stats.get_statistics if block_given?
+                @stats.update_statistics(content, crawl_counter, queue_counter)
+                @stats.update_status("Completed #{url}.")
+                yield content, @stats.get_statistics if block_given?
+              end
             rescue => e
               puts "!!!!!!!!!!!! ERROR !!!!!!!!!!!!!!!!"
               ap e

data/lib/cobweb_version.rb CHANGED

@@ -3,7 +3,7 @@ class CobwebVersion
   # Returns a string of the current version
   def self.version
-    "0.0.77"
+    "1.0.0"
   end
 end

data/lib/server.rb CHANGED

@@ -14,7 +14,7 @@ class Server < Sinatra::Base
     @colors = ["#00366f", "#006ba0", "#3F0BDB", "#396CB3"]
     @crawls = []
-    @full_redis.smembers("cobweb_crawls").each do |crawl_id|
+    @full_redis.smembers("cobweb_crawls").each do |crawl_id|
       version = cobweb_version(crawl_id)
       redis = Redis::Namespace.new("cobweb-#{version}-#{crawl_id}", :redis => Redis.new(redis_options))
       stats = HashUtil.deep_symbolize_keys({
@@ -69,8 +69,9 @@ class Server < Sinatra::Base
   def cobweb_version(crawl_id)
     redis = Redis.new(redis_options)
-    key = redis.keys("cobweb-*-#{crawl_id}-crawl_details").first
-    key =~ /cobweb-(.*?)-(.*?)-crawl_details/
+    key = redis.keys("cobweb-*-#{crawl_id}:queued").first
+    key =~ /cobweb-(.*?)-(.*?):queued/
     cobweb_version = $1
   end
@@ -82,7 +83,6 @@ class Server < Sinatra::Base
   def self.start(options={})
     @options = options
     @options[:redis_options] = {} unless @options.has_key? :redis_options
-    ap @options
     unless Server.running?
       if @options[:run_as_server]
         puts "Starting Sinatra for cobweb v#{Cobweb.version}"

data/lib/stats.rb CHANGED

@@ -24,13 +24,13 @@ class Stats
   # Removes the crawl from the running crawls and updates status
   def end_crawl(options, cancelled=false)
-    @full_redis.srem "cobweb_crawls", options[:crawl_id]
+    #@full_redis.srem "cobweb_crawls", options[:crawl_id]
     if cancelled
       @redis.hset "statistics", "current_status", CobwebCrawlHelper::CANCELLED
     else
       @redis.hset "statistics", "current_status", CobwebCrawlHelper::FINISHED
     end
-    @redis.del "crawl_details"
+    #@redis.del "crawl_details"
   end
   def get_crawled

data/spec/cobweb/cobweb_job_spec.rb CHANGED

@@ -1,6 +1,6 @@
 require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
-describe Cobweb, :local_only => true do
+describe Cobweb, :local_only => true, :disabled => true do
   before(:all) do
     #store all existing resque process ids so we don't kill them afterwards

data/views/home.haml CHANGED

@@ -35,7 +35,7 @@
       .content
         - if @crawls.empty?
           No crawls running just now
-        - else
+        - else
           %table.all{:border => "0", :cellpadding => "0", :cellspacing => "0"}
             %thead
               %tr

data/views/statistics.haml CHANGED

@@ -136,11 +136,12 @@
               %th Count
           %tbody
-            - @crawl[:statistics][:status_counts].keys.each do |status|
-              - unless status.nil? || status == ""
-                %tr
-                  %td= status
-                  %td= @crawl[:statistics][:status_counts][status]
+            - if @crawl[:statistics] && @crawl[:statistics][:status_counts]
+              - @crawl[:statistics][:status_counts].keys.each do |status|
+                - unless status.nil? || status == ""
+                  %tr
+                    %td= status
+                    %td= @crawl[:statistics][:status_counts][status]
   .medium
     .box
@@ -156,10 +157,11 @@
               %th Count
           %tbody
-            - @crawl[:statistics][:mime_counts].keys.each do |mime_type|
-              %tr
-                %td= mime_type
-                %td= @crawl[:statistics][:mime_counts][mime_type]
+            - if @crawl[:statistics][:mime_counts]
+              - @crawl[:statistics][:mime_counts].keys.each do |mime_type|
+                %tr
+                  %td= mime_type
+                  %td= @crawl[:statistics][:mime_counts][mime_type]

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: cobweb
 version: !ruby/object:Gem::Version
-  version: 0.0.77
+  version: 1.0.0
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-10-24 00:00:00.000000000 Z
+date: 2012-12-19 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: resque
-  requirement: &70355025183680 !ruby/object:Gem::Requirement
+  requirement: &70111197840140 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025183680
+  version_requirements: *70111197840140
 - !ruby/object:Gem::Dependency
   name: redis
-  requirement: &70355025182000 !ruby/object:Gem::Requirement
+  requirement: &70111197838420 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025182000
+  version_requirements: *70111197838420
 - !ruby/object:Gem::Dependency
   name: nokogiri
-  requirement: &70355025180660 !ruby/object:Gem::Requirement
+  requirement: &70111197837080 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025180660
+  version_requirements: *70111197837080
 - !ruby/object:Gem::Dependency
   name: addressable
-  requirement: &70355025179740 !ruby/object:Gem::Requirement
+  requirement: &70111197836140 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025179740
+  version_requirements: *70111197836140
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &70355025179160 !ruby/object:Gem::Requirement
+  requirement: &70111197835560 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -65,10 +65,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025179160
+  version_requirements: *70111197835560
 - !ruby/object:Gem::Dependency
   name: awesome_print
-  requirement: &70355025178560 !ruby/object:Gem::Requirement
+  requirement: &70111197834960 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025178560
+  version_requirements: *70111197834960
 - !ruby/object:Gem::Dependency
   name: sinatra
-  requirement: &70355025177760 !ruby/object:Gem::Requirement
+  requirement: &70111197834160 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025177760
+  version_requirements: *70111197834160
 - !ruby/object:Gem::Dependency
   name: thin
-  requirement: &70355025177180 !ruby/object:Gem::Requirement
+  requirement: &70111197833580 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -98,10 +98,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025177180
+  version_requirements: *70111197833580
 - !ruby/object:Gem::Dependency
   name: haml
-  requirement: &70355025176740 !ruby/object:Gem::Requirement
+  requirement: &70111197833140 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -109,10 +109,10 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025176740
+  version_requirements: *70111197833140
 - !ruby/object:Gem::Dependency
   name: namespaced_redis
-  requirement: &70355025175980 !ruby/object:Gem::Requirement
+  requirement: &70111197832400 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -120,10 +120,10 @@ dependencies:
         version: 1.0.2
   type: :runtime
   prerelease: false
-  version_requirements: *70355025175980
+  version_requirements: *70111197832400
 - !ruby/object:Gem::Dependency
   name: json
-  requirement: &70355025175280 !ruby/object:Gem::Requirement
+  requirement: &70111197831720 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ! '>='
@@ -131,7 +131,7 @@ dependencies:
         version: '0'
   type: :runtime
   prerelease: false
-  version_requirements: *70355025175280
+  version_requirements: *70111197831720
 description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
   crawl extremely large sites which is much more performant than multi-threaded crawlers.  It
   is also a standalone crawler that has a sophisticated statistics monitoring interface