RubyGems - cobweb - Versions diffs - 1.1.0 → 1.2.0 - Mend

cobweb 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 854165929dc7a5e3e16d138515723cfd2d3f95b5
-  data.tar.gz: 8444c545e80547e41dcaeae817a81183ed4a8d54
+SHA256:
+  metadata.gz: 3b21efd1b03e9f4515045bdd69439fc8a7a4bf3f5aa896edebdbccd12d421d51
+  data.tar.gz: 4744738dfb7eaca5c5b6c1cabdebea8899e0f18cadb5532f1bb43870e98ff4e2
 SHA512:
-  metadata.gz: 181edda4c8fc822f52729a645046c6e3a8d11ad0ff3d5540cfcc055528563605c82eec5a5615074fe86f97c8655694dc75c29d61b8e324fed02291372e5f107a
-  data.tar.gz: 830a4aceeb58d6d5b43d284622a9054aea66a5a46aadabec319bf9eebeb3e20444e476a3a0db42afcc2344a6a5f7dae210aaed024055d6d7d25042f3cbd38f79
+  metadata.gz: 55b7f9878fcf6c3d97bd935fa59e61c54d745dd37cc75405e9af0e787d56e6543d79a2dedfca43912ed791fd59b51193b12c58722be9ce8c5d65428b6ea0af48
+  data.tar.gz: 8f767e79d76787cd84edbecd967e3136615a32723210a1abe149f8c126e4c8e1c6b3608f631f4ea7d3ad4fe66aa39702b18f73e28c0091ab02b2defdb28868d2

data/README.textile CHANGED

@@ -6,7 +6,6 @@ h1. Cobweb v1.1.0
 !https://gemnasium.com/stewartmckee/cobweb.png!
 !https://coveralls.io/repos/stewartmckee/cobweb/badge.png?branch=master(Coverage Status)!:https://coveralls.io/r/stewartmckee/cobweb
 h2. Intro
   CobWeb has three methods of running.  Firstly it is a http client that allows get and head requests returning a hash of data relating to the requested resource.  The second main function is to utilize this combined with the power of Resque to cluster the crawls allowing you crawl quickly.  Lastly you can run the crawler with a block that uses each of the pages found in the crawl.
@@ -37,6 +36,7 @@ h3. Data Returned For Each Page
   * @:url@ - url of the resource requested
   * @:status_code@ - status code of the resource requested
+  * @:response_time@ - response time of the resource requested
   * @:mime_type@ - content type of the resource
   * @:character_set@ - character set of content determined from content type
   * @:length@ - length of the content returned
@@ -99,6 +99,7 @@ Creates a new crawler object based on a base_url
     ** @:follow_redirects@              - transparently follows redirects and populates the :redirect_through key in the content hash (Default: true)
     ** @:redirect_limit@                - sets the limit to be used for concurrent redirects (Default: 10)
+    ** @:queue_system@                  - sets the the queue system :resque or :sidekiq (Default: :resque)
     ** @:processing_queue@              - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
     ** @:crawl_finished_queue@          - specifies the processing queue for statistics to be sent to after finishing crawling (Default: 'CobwebFinishedJob' when using resque, 'CrawlFinishedWorker' when using sidekiq)
     ** @:debug@                         - enables debug output (Default: false)

data/lib/cobweb.rb CHANGED

@@ -10,7 +10,6 @@ end
 puts Gem::Specification.find_all_by_name("sidekiq", ">=3.0.0")
 # Cobweb class is used to perform get and head requests.  You can use this on its own if you wish without the crawler
 class Cobweb

data/lib/cobweb_crawler.rb CHANGED

@@ -4,13 +4,13 @@ require 'redis-namespace'
 # CobwebCrawler is a standalone crawler, it includes a built in statistics monitor using Sinatra.
 class CobwebCrawler
   # See README for more information on options available
   def initialize(options={})
     @options = options
     @statistic = {}
     @options[:redis_options] = {:host => "127.0.0.1"} unless @options.has_key? :redis_options
     if @options.has_key? :crawl_id
       @crawl_id = @options[:crawl_id]
@@ -18,7 +18,7 @@ class CobwebCrawler
       @crawl_id = Digest::MD5.hexdigest(DateTime.now.inspect.to_s)
       @options[:crawl_id] = @crawl_id
     end
     @redis = Redis::Namespace.new("cobweb-#{Cobweb.version}-#{@crawl_id}", :redis => RedisConnection.new(@options[:redis_options]))
     @options[:internal_urls] = [] if @options[:internal_urls].nil?
     @options[:internal_urls].map{|url| @redis.sadd("internal_urls", url)}
@@ -27,27 +27,28 @@ class CobwebCrawler
     @options[:crawl_linked_external] = false unless @options.has_key? :crawl_linked_external
-    @options[:treat_https_as_http] = true unless @options.has_key? :treat_https_as_http
+    @options[:treat_https_as_http] = true unless @options.has_key? :treat_https_as_http
     @debug = @options[:debug]
     @stats = Stats.new(@options.merge(:crawl_id => @crawl_id))
     if @options[:web_statistics]
+      require "server"
       Server.start(@options)
     end
     @cobweb = Cobweb.new(@options)
   end
   # Initiates a crawl starting at the base_url and applying the options supplied. Can also take a block that is executed and passed content hash and statistic hash'
   def crawl(base_url, crawl_options = {}, &block)
     @options[:base_url] = base_url unless @options.has_key? :base_url
     @options[:thread_count] = 1 unless @options.has_key? :thread_count
     @options[:internal_urls] << base_url if @options[:internal_urls].empty?
     @redis.sadd("internal_urls", base_url) if @options[:internal_urls].empty?
     @crawl_options = crawl_options
     @redis.sadd("queued", base_url) unless base_url.nil? || @redis.sismember("crawled", base_url) || @redis.sismember("queued", base_url)
     @crawl_counter = @redis.scard("crawled").to_i
     @queue_counter = @redis.scard("queued").to_i
@@ -55,7 +56,7 @@ class CobwebCrawler
     @threads = []
     begin
       @stats.start_crawl(@options)
       @threads << Thread.new do
         Thread.abort_on_exception = true
         spawn_thread(&block)
@@ -73,7 +74,7 @@ class CobwebCrawler
         end
         sleep 1
       end
     ensure
       @stats.end_crawl(@options)
     end
@@ -96,8 +97,8 @@ class CobwebCrawler
             @stats.update_status("Processing #{url}...")
             @redis.sadd "crawled", url.to_s
-            @redis.incr "crawl-counter"
+            @redis.incr "crawl-counter"
             document_links = ContentLinkParser.new(url, content[:body]).all_links(:valid_schemes => [:http, :https]).uniq
@@ -106,18 +107,18 @@ class CobwebCrawler
             internal_links = document_links.select{|link| cobweb_links.internal?(link) || (@options[:crawl_linked_external] && cobweb_links.internal?(url.to_s) && !cobweb_links.matches_external?(link))}
-            # if the site has the same content for http and https then normalize to http
+            # if the site has the same content for http and https then normalize to http
             if @options[:treat_https_as_http]
               internal_links.map!{|link| link.gsub(/^https/, "http")}
             end
             # reject the link if we've crawled it or queued it
             internal_links.reject!{|link| @redis.sismember("crawled", link)}
             internal_links.reject!{|link| @redis.sismember("queued", link)}
             internal_links.reject!{|link| link.nil? || link.empty?}
             internal_links.each do |link|
               puts "Added #{link.to_s} to queue" if @debug
               @redis.sadd "queued", link unless link.nil?
@@ -134,10 +135,10 @@ class CobwebCrawler
                 @redis.sadd("inbound_links_#{Digest::MD5.hexdigest(target_uri.to_s)}", UriHelper.parse(url).to_s)
               end
             end
             @crawl_counter = @redis.scard("crawled").to_i
             @queue_counter = @redis.scard("queued").to_i
             @stats.update_statistics(content, @crawl_counter, @queue_counter)
             @stats.update_status("Completed #{url}.")
             yield content, @stats.get_statistics if block_given?
@@ -161,7 +162,7 @@ class CobwebCrawler
   def running_thread_count
     @threads.map{|t| t.status}.select{|status| status=="run" || status == "sleep"}.count
   end
 end
 # Monkey patch into String a starts_with method

data/lib/cobweb_version.rb CHANGED

@@ -3,7 +3,7 @@ class CobwebVersion
   # Returns a string of the current version
   def self.version
-    "1.1.0"
+    "1.2.0"
   end
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: cobweb
 version: !ruby/object:Gem::Version
-  version: 1.1.0
+  version: 1.2.0
 platform: ruby
 authors:
 - Stewart McKee
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-11-06 00:00:00.000000000 Z
+date: 2019-02-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake
@@ -44,14 +44,14 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.6.6.2
+        version: 1.6.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.6.6.2
+        version: 1.6.0
 - !ruby/object:Gem::Dependency
   name: addressable
   requirement: !ruby/object:Gem::Requirement
@@ -220,6 +220,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: bundle-audit
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
   crawl extremely large sites which is much more performant than multi-threaded crawlers.  It
   is also a standalone crawler that has a sophisticated statistics monitoring interface
@@ -616,7 +630,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.5.1
+rubygems_version: 2.7.7
 signing_key:
 specification_version: 4
 summary: Cobweb is a web crawler that can use resque to cluster crawls to quickly