RubyGems - cobweb - Versions diffs - 1.0.21 → 1.0.22 - Mend

cobweb 1.0.21 → 1.0.22

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: f7c3816549392f4fa31701ae65bff51fbe22db89
-  data.tar.gz: a8cec8a17ec20f31a85980f75cb790331a6be16d
+  metadata.gz: 981b3f18bad361e4a8b50a3387008a10029887df
+  data.tar.gz: 0200c604402af9125b754f756a0740bae0df1e70
 SHA512:
-  metadata.gz: b2b172dd7f45efb8b5eccacad67b35683ee1f0867f8bfc423b5b8a91ed3b3cac22e5b2343a96d4a8bfdfe9426bf4461ced4385ea96e7bc850e80e3de4b0ce976
-  data.tar.gz: 2bace4df48372e0253973e7600e8d48ad06ab12a948723a5850620bfcb31efffba2fdaf6f36ae5502c054d8457a1dec9a23675636ce9a4c4474cf5b3086f6697
+  metadata.gz: bfd699c658f5ec55607c7055205cf60454fca789a921bda5284f9126299c6dd1edac74d9376e38dd506139c192aa991ea4caf6e4b48d613de9a282abdaccbe5e
+  data.tar.gz: 4b3693d8450ff8364a312691ca7a435dde636c9e4fa8dbb0cb8e93cc47b65cf0bf5f78b074cba6c8705ed7d76553c0adb0b81d50937bb303531d2d672058e0df

data/README.textile CHANGED

@@ -1,4 +1,4 @@
-h1. Cobweb v1.0.20
+h1. Cobweb v1.0.22
 "@cobweb_gem":https://twitter.com/cobweb_gem
 !https://badge.fury.io/rb/cobweb.png!:http://badge.fury.io/rb/cobweb
@@ -6,18 +6,18 @@ h1. Cobweb v1.0.20
 !https://coveralls.io/repos/stewartmckee/cobweb/badge.png?branch=master(Coverage Status)!:https://coveralls.io/r/stewartmckee/cobweb
-h2. Intro
+h2. Intro
   CobWeb has three methods of running.  Firstly it is a http client that allows get and head requests returning a hash of data relating to the requested resource.  The second main function is to utilize this combined with the power of Resque to cluster the crawls allowing you crawl quickly.  Lastly you can run the crawler with a block that uses each of the pages found in the crawl.
   I've created a sample app to help with setting up cobweb at http://github.com/stewartmckee/cobweb_sample
 h3. Resque
   When running on resque, passing in a Class and queue name it will enqueue all resources to this queue for processing, passing in the hash it has generated.  You then implement the perform method to process the resource for your own application.
 h3. Standalone
   CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that.  An example is listed below.
   While the crawler is running, you can view statistics on http://localhost:4567
@@ -32,8 +32,8 @@ h3. Command Line
   Run "cobweb --help" for more info
 h3. Data Returned For Each Page
-  The data available in the returned hash are:
+  The data available in the returned hash are:
   * :url - url of the resource requested
   * :status_code - status code of the resource requested
   * :mime_type - content type of the resource
@@ -49,15 +49,15 @@ h3. Data Returned For Each Page
     ** :related - url's from link tags
     ** :scripts - url's from script tags
     ** :styles - url's from within link tags with rel of stylesheet and from url() directives with stylesheets
   The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
 h3. Statistics
   Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id.  Stats has a get_statistics method that returns a hash of the statistics available to you.  It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
   The data available within statistics is as follows:
   * :average_length - average size of each objet
   * :minimum_length - minimum length returned
   * :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
@@ -91,10 +91,10 @@ h4. new(options)
 Creates a new crawler object based on a base_url
   * options - Options are passed in as a hash,
     ** :follow_redirects              - transparently follows redirects and populates the :redirect_through key in the content hash (Default: true)
-    ** :redirect_limit                - sets the limit to be used for concurrent redirects (Default: 10)
-    ** :processing_queue              - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
+    ** :redirect_limit                - sets the limit to be used for concurrent redirects (Default: 10)
+    ** :processing_queue              - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
     ** :crawl_finished_queue          - specifies the processing queue for statistics to be sent to after finishing crawling (Default: 'CobwebFinishedJob' when using resque, 'CrawlFinishedWorker' when using sidekiq)
     ** :debug                         - enables debug output (Default: false)
     ** :quiet                         - hides default output (Default: false)
@@ -116,8 +116,8 @@ Creates a new crawler object based on a base_url
     ** :use_encoding_safe_process_job - Base64-encode the body when storing job in queue; set to true when you are expecting non-ASCII content (Default: false)
     ** :proxy_addr                    - hostname of a proxy to use for crawling (e. g., 'myproxy.example.net', default: nil)
     ** :proxy_port                    - port number of the proxy (default: nil)
 bc. crawler = Cobweb.new(:follow_redirects => false)
 h4. start(base_url)
@@ -125,7 +125,7 @@ h4. start(base_url)
 Starts a crawl through resque.  Requires the :processing_queue to be set to a valid class for the resque job to work with the data retrieved.
   * base_url - the url to start the crawl from
 Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
 bc. crawler.start("http://www.google.com/")
@@ -156,7 +156,7 @@ h3. CobwebCrawler
 CobwebCrawler is the standalone crawling class.  If you don't want to use resque or sidekiq and just want to crawl the site within your ruby process, you can use this class.
-bc. crawler = CobwebCrawler.new(:cache => 600)
+bc. crawler = CobwebCrawler.new(:cache => 600)
 statistics = crawler.crawl("http://www.pepsico.com")
 You can also run within a block and get access to each page as it is being crawled.
@@ -177,13 +177,13 @@ The CobwebCrawlHelper class is a helper class to assist in getting information a
 bc. crawl = CobwebCrawlHelper.new(options)
   * options - the hash of options passed into Cobweb.new (must include a :crawl_id)
 h2. Contributing/Testing
   Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting.  We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.
     Continuous integration testing is performed by the excellent Travis: http://travis-ci.org/#!/stewartmckee/cobweb
 h2. Todo

data/bin/cobweb ADDED

@@ -0,0 +1,56 @@
+#!/usr/bin/env ruby
+lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
+$LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
+require 'cobweb'
+require 'csv'
+require 'slop'
+include CobwebDSL
+opts = Slop.parse(:help => true) do
+  banner 'Usage: cobweb <command> [options]'
+  command :report do
+    banner 'Usage: cobweb report [options]'
+    on 'output=', 'Path to output data to'
+    on 'script=', "Script to generate report"
+    on 'url=', 'URL to start crawl from'
+    on 'internal_urls=', 'Url patterns to include', :as => Array
+    on 'external_urls=', 'Url patterns to exclude', :as => Array
+    on 'seed_urls=', "Seed urls", :as => Array
+    on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
+    on 'thread_count=', "Set the number of threads used", :as => Integer
+    on 'timeout=', "Sets the timeout for http requests", :as => Integer
+    on 'v', 'verbose', 'Display crawl information'
+    on 'd', 'debug', 'Display debug information'
+    on 'w', 'web_statistics', 'Start web stats server'
+    run do |opts, args|
+      ReportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?})
+    end
+  end
+  command :export do
+    banner 'Usage: cobweb export [options]'
+    on 'url=', 'URL to start crawl from'
+    on 'internal_urls=', 'Url patterns to include', :as => Array
+    on 'external_urls=', 'Url patterns to exclude', :as => Array
+    on 'seed_urls=', "Seed urls", :as => Array
+    on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
+    on 'thread_count=', "Set the number of threads used", :as => Integer
+    on 'timeout=', "Sets the timeout for http requests", :as => Integer
+    on 'v', 'verbose', 'Display crawl information'
+    on 'd', 'debug', 'Display debug information'
+    on 'w', 'web_statistics', 'Start web stats server'
+    run do |opts, args|
+      ExportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?}, args[0])
+    end
+  end
+end

data/lib/cobweb_version.rb CHANGED

@@ -3,7 +3,7 @@ class CobwebVersion
   # Returns a string of the current version
   def self.version
-    "1.0.21"
+    "1.0.22"
   end
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: cobweb
 version: !ruby/object:Gem::Version
-  version: 1.0.21
+  version: 1.0.22
 platform: ruby
 authors:
 - Stewart McKee
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-11-05 00:00:00.000000000 Z
+date: 2015-01-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: redis
@@ -126,27 +126,29 @@ dependencies:
   name: slop
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '3.4'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - ">="
+    - - "~>"
       - !ruby/object:Gem::Version
-        version: '0'
+        version: '3.4'
 description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
   crawl extremely large sites which is much more performant than multi-threaded crawlers.  It
   is also a standalone crawler that has a sophisticated statistics monitoring interface
   to monitor the progress of the crawls.
 email: stewart@rockwellcottage.com
-executables: []
+executables:
+- cobweb
 extensions: []
 extra_rdoc_files:
 - README.textile
 files:
 - README.textile
+- bin/cobweb
 - lib/cobweb.rb
 - lib/cobweb_crawl_helper.rb
 - lib/cobweb_crawler.rb