cobweb 1.0.21 → 1.0.22

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f7c3816549392f4fa31701ae65bff51fbe22db89
4
- data.tar.gz: a8cec8a17ec20f31a85980f75cb790331a6be16d
3
+ metadata.gz: 981b3f18bad361e4a8b50a3387008a10029887df
4
+ data.tar.gz: 0200c604402af9125b754f756a0740bae0df1e70
5
5
  SHA512:
6
- metadata.gz: b2b172dd7f45efb8b5eccacad67b35683ee1f0867f8bfc423b5b8a91ed3b3cac22e5b2343a96d4a8bfdfe9426bf4461ced4385ea96e7bc850e80e3de4b0ce976
7
- data.tar.gz: 2bace4df48372e0253973e7600e8d48ad06ab12a948723a5850620bfcb31efffba2fdaf6f36ae5502c054d8457a1dec9a23675636ce9a4c4474cf5b3086f6697
6
+ metadata.gz: bfd699c658f5ec55607c7055205cf60454fca789a921bda5284f9126299c6dd1edac74d9376e38dd506139c192aa991ea4caf6e4b48d613de9a282abdaccbe5e
7
+ data.tar.gz: 4b3693d8450ff8364a312691ca7a435dde636c9e4fa8dbb0cb8e93cc47b65cf0bf5f78b074cba6c8705ed7d76553c0adb0b81d50937bb303531d2d672058e0df
@@ -1,4 +1,4 @@
1
- h1. Cobweb v1.0.20
1
+ h1. Cobweb v1.0.22
2
2
 
3
3
  "@cobweb_gem":https://twitter.com/cobweb_gem
4
4
  !https://badge.fury.io/rb/cobweb.png!:http://badge.fury.io/rb/cobweb
@@ -6,18 +6,18 @@ h1. Cobweb v1.0.20
6
6
  !https://coveralls.io/repos/stewartmckee/cobweb/badge.png?branch=master(Coverage Status)!:https://coveralls.io/r/stewartmckee/cobweb
7
7
 
8
8
 
9
- h2. Intro
10
-
9
+ h2. Intro
10
+
11
11
  CobWeb has three methods of running. Firstly it is a http client that allows get and head requests returning a hash of data relating to the requested resource. The second main function is to utilize this combined with the power of Resque to cluster the crawls allowing you crawl quickly. Lastly you can run the crawler with a block that uses each of the pages found in the crawl.
12
-
12
+
13
13
  I've created a sample app to help with setting up cobweb at http://github.com/stewartmckee/cobweb_sample
14
-
14
+
15
15
  h3. Resque
16
16
 
17
17
  When running on resque, passing in a Class and queue name it will enqueue all resources to this queue for processing, passing in the hash it has generated. You then implement the perform method to process the resource for your own application.
18
-
18
+
19
19
  h3. Standalone
20
-
20
+
21
21
  CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that. An example is listed below.
22
22
 
23
23
  While the crawler is running, you can view statistics on http://localhost:4567
@@ -32,8 +32,8 @@ h3. Command Line
32
32
  Run "cobweb --help" for more info
33
33
 
34
34
  h3. Data Returned For Each Page
35
- The data available in the returned hash are:
36
-
35
+ The data available in the returned hash are:
36
+
37
37
  * :url - url of the resource requested
38
38
  * :status_code - status code of the resource requested
39
39
  * :mime_type - content type of the resource
@@ -49,15 +49,15 @@ h3. Data Returned For Each Page
49
49
  ** :related - url's from link tags
50
50
  ** :scripts - url's from script tags
51
51
  ** :styles - url's from within link tags with rel of stylesheet and from url() directives with stylesheets
52
-
52
+
53
53
  The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
54
54
 
55
55
  h3. Statistics
56
56
 
57
57
  Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id. Stats has a get_statistics method that returns a hash of the statistics available to you. It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
58
-
58
+
59
59
  The data available within statistics is as follows:
60
-
60
+
61
61
  * :average_length - average size of each objet
62
62
  * :minimum_length - minimum length returned
63
63
  * :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
@@ -91,10 +91,10 @@ h4. new(options)
91
91
  Creates a new crawler object based on a base_url
92
92
 
93
93
  * options - Options are passed in as a hash,
94
-
94
+
95
95
  ** :follow_redirects - transparently follows redirects and populates the :redirect_through key in the content hash (Default: true)
96
- ** :redirect_limit - sets the limit to be used for concurrent redirects (Default: 10)
97
- ** :processing_queue - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
96
+ ** :redirect_limit - sets the limit to be used for concurrent redirects (Default: 10)
97
+ ** :processing_queue - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
98
98
  ** :crawl_finished_queue - specifies the processing queue for statistics to be sent to after finishing crawling (Default: 'CobwebFinishedJob' when using resque, 'CrawlFinishedWorker' when using sidekiq)
99
99
  ** :debug - enables debug output (Default: false)
100
100
  ** :quiet - hides default output (Default: false)
@@ -116,8 +116,8 @@ Creates a new crawler object based on a base_url
116
116
  ** :use_encoding_safe_process_job - Base64-encode the body when storing job in queue; set to true when you are expecting non-ASCII content (Default: false)
117
117
  ** :proxy_addr - hostname of a proxy to use for crawling (e. g., 'myproxy.example.net', default: nil)
118
118
  ** :proxy_port - port number of the proxy (default: nil)
119
-
120
-
119
+
120
+
121
121
  bc. crawler = Cobweb.new(:follow_redirects => false)
122
122
 
123
123
  h4. start(base_url)
@@ -125,7 +125,7 @@ h4. start(base_url)
125
125
  Starts a crawl through resque. Requires the :processing_queue to be set to a valid class for the resque job to work with the data retrieved.
126
126
 
127
127
  * base_url - the url to start the crawl from
128
-
128
+
129
129
  Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
130
130
 
131
131
  bc. crawler.start("http://www.google.com/")
@@ -156,7 +156,7 @@ h3. CobwebCrawler
156
156
 
157
157
  CobwebCrawler is the standalone crawling class. If you don't want to use resque or sidekiq and just want to crawl the site within your ruby process, you can use this class.
158
158
 
159
- bc. crawler = CobwebCrawler.new(:cache => 600)
159
+ bc. crawler = CobwebCrawler.new(:cache => 600)
160
160
  statistics = crawler.crawl("http://www.pepsico.com")
161
161
 
162
162
  You can also run within a block and get access to each page as it is being crawled.
@@ -177,13 +177,13 @@ The CobwebCrawlHelper class is a helper class to assist in getting information a
177
177
  bc. crawl = CobwebCrawlHelper.new(options)
178
178
 
179
179
  * options - the hash of options passed into Cobweb.new (must include a :crawl_id)
180
-
180
+
181
181
 
182
182
 
183
183
  h2. Contributing/Testing
184
184
 
185
185
  Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting. We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.
186
-
186
+
187
187
  Continuous integration testing is performed by the excellent Travis: http://travis-ci.org/#!/stewartmckee/cobweb
188
188
 
189
189
  h2. Todo
@@ -0,0 +1,56 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
4
+ $LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
5
+
6
+ require 'cobweb'
7
+ require 'csv'
8
+ require 'slop'
9
+
10
+ include CobwebDSL
11
+
12
+ opts = Slop.parse(:help => true) do
13
+ banner 'Usage: cobweb <command> [options]'
14
+
15
+ command :report do
16
+ banner 'Usage: cobweb report [options]'
17
+
18
+ on 'output=', 'Path to output data to'
19
+ on 'script=', "Script to generate report"
20
+
21
+ on 'url=', 'URL to start crawl from'
22
+ on 'internal_urls=', 'Url patterns to include', :as => Array
23
+ on 'external_urls=', 'Url patterns to exclude', :as => Array
24
+ on 'seed_urls=', "Seed urls", :as => Array
25
+ on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
26
+ on 'thread_count=', "Set the number of threads used", :as => Integer
27
+ on 'timeout=', "Sets the timeout for http requests", :as => Integer
28
+ on 'v', 'verbose', 'Display crawl information'
29
+ on 'd', 'debug', 'Display debug information'
30
+ on 'w', 'web_statistics', 'Start web stats server'
31
+
32
+ run do |opts, args|
33
+ ReportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?})
34
+ end
35
+ end
36
+
37
+ command :export do
38
+ banner 'Usage: cobweb export [options]'
39
+
40
+ on 'url=', 'URL to start crawl from'
41
+ on 'internal_urls=', 'Url patterns to include', :as => Array
42
+ on 'external_urls=', 'Url patterns to exclude', :as => Array
43
+ on 'seed_urls=', "Seed urls", :as => Array
44
+ on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
45
+ on 'thread_count=', "Set the number of threads used", :as => Integer
46
+ on 'timeout=', "Sets the timeout for http requests", :as => Integer
47
+ on 'v', 'verbose', 'Display crawl information'
48
+ on 'd', 'debug', 'Display debug information'
49
+ on 'w', 'web_statistics', 'Start web stats server'
50
+
51
+ run do |opts, args|
52
+ ExportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?}, args[0])
53
+ end
54
+ end
55
+
56
+ end
@@ -3,7 +3,7 @@ class CobwebVersion
3
3
 
4
4
  # Returns a string of the current version
5
5
  def self.version
6
- "1.0.21"
6
+ "1.0.22"
7
7
  end
8
8
 
9
9
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: cobweb
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.21
4
+ version: 1.0.22
5
5
  platform: ruby
6
6
  authors:
7
7
  - Stewart McKee
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-11-05 00:00:00.000000000 Z
11
+ date: 2015-01-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: redis
@@ -126,27 +126,29 @@ dependencies:
126
126
  name: slop
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
- - - ">="
129
+ - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '0'
131
+ version: '3.4'
132
132
  type: :runtime
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
- - - ">="
136
+ - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '0'
138
+ version: '3.4'
139
139
  description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
140
140
  crawl extremely large sites which is much more performant than multi-threaded crawlers. It
141
141
  is also a standalone crawler that has a sophisticated statistics monitoring interface
142
142
  to monitor the progress of the crawls.
143
143
  email: stewart@rockwellcottage.com
144
- executables: []
144
+ executables:
145
+ - cobweb
145
146
  extensions: []
146
147
  extra_rdoc_files:
147
148
  - README.textile
148
149
  files:
149
150
  - README.textile
151
+ - bin/cobweb
150
152
  - lib/cobweb.rb
151
153
  - lib/cobweb_crawl_helper.rb
152
154
  - lib/cobweb_crawler.rb