cobweb 1.0.21 → 1.0.22
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.textile +21 -21
- data/bin/cobweb +56 -0
- data/lib/cobweb_version.rb +1 -1
- metadata +9 -7
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 981b3f18bad361e4a8b50a3387008a10029887df
|
4
|
+
data.tar.gz: 0200c604402af9125b754f756a0740bae0df1e70
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bfd699c658f5ec55607c7055205cf60454fca789a921bda5284f9126299c6dd1edac74d9376e38dd506139c192aa991ea4caf6e4b48d613de9a282abdaccbe5e
|
7
|
+
data.tar.gz: 4b3693d8450ff8364a312691ca7a435dde636c9e4fa8dbb0cb8e93cc47b65cf0bf5f78b074cba6c8705ed7d76553c0adb0b81d50937bb303531d2d672058e0df
|
data/README.textile
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
h1. Cobweb v1.0.
|
1
|
+
h1. Cobweb v1.0.22
|
2
2
|
|
3
3
|
"@cobweb_gem":https://twitter.com/cobweb_gem
|
4
4
|
!https://badge.fury.io/rb/cobweb.png!:http://badge.fury.io/rb/cobweb
|
@@ -6,18 +6,18 @@ h1. Cobweb v1.0.20
|
|
6
6
|
!https://coveralls.io/repos/stewartmckee/cobweb/badge.png?branch=master(Coverage Status)!:https://coveralls.io/r/stewartmckee/cobweb
|
7
7
|
|
8
8
|
|
9
|
-
h2. Intro
|
10
|
-
|
9
|
+
h2. Intro
|
10
|
+
|
11
11
|
CobWeb has three methods of running. Firstly it is a http client that allows get and head requests returning a hash of data relating to the requested resource. The second main function is to utilize this combined with the power of Resque to cluster the crawls allowing you crawl quickly. Lastly you can run the crawler with a block that uses each of the pages found in the crawl.
|
12
|
-
|
12
|
+
|
13
13
|
I've created a sample app to help with setting up cobweb at http://github.com/stewartmckee/cobweb_sample
|
14
|
-
|
14
|
+
|
15
15
|
h3. Resque
|
16
16
|
|
17
17
|
When running on resque, passing in a Class and queue name it will enqueue all resources to this queue for processing, passing in the hash it has generated. You then implement the perform method to process the resource for your own application.
|
18
|
-
|
18
|
+
|
19
19
|
h3. Standalone
|
20
|
-
|
20
|
+
|
21
21
|
CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that. An example is listed below.
|
22
22
|
|
23
23
|
While the crawler is running, you can view statistics on http://localhost:4567
|
@@ -32,8 +32,8 @@ h3. Command Line
|
|
32
32
|
Run "cobweb --help" for more info
|
33
33
|
|
34
34
|
h3. Data Returned For Each Page
|
35
|
-
The data available in the returned hash are:
|
36
|
-
|
35
|
+
The data available in the returned hash are:
|
36
|
+
|
37
37
|
* :url - url of the resource requested
|
38
38
|
* :status_code - status code of the resource requested
|
39
39
|
* :mime_type - content type of the resource
|
@@ -49,15 +49,15 @@ h3. Data Returned For Each Page
|
|
49
49
|
** :related - url's from link tags
|
50
50
|
** :scripts - url's from script tags
|
51
51
|
** :styles - url's from within link tags with rel of stylesheet and from url() directives with stylesheets
|
52
|
-
|
52
|
+
|
53
53
|
The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
|
54
54
|
|
55
55
|
h3. Statistics
|
56
56
|
|
57
57
|
Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id. Stats has a get_statistics method that returns a hash of the statistics available to you. It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
|
58
|
-
|
58
|
+
|
59
59
|
The data available within statistics is as follows:
|
60
|
-
|
60
|
+
|
61
61
|
* :average_length - average size of each objet
|
62
62
|
* :minimum_length - minimum length returned
|
63
63
|
* :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
|
@@ -91,10 +91,10 @@ h4. new(options)
|
|
91
91
|
Creates a new crawler object based on a base_url
|
92
92
|
|
93
93
|
* options - Options are passed in as a hash,
|
94
|
-
|
94
|
+
|
95
95
|
** :follow_redirects - transparently follows redirects and populates the :redirect_through key in the content hash (Default: true)
|
96
|
-
** :redirect_limit - sets the limit to be used for concurrent redirects (Default: 10)
|
97
|
-
** :processing_queue - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
|
96
|
+
** :redirect_limit - sets the limit to be used for concurrent redirects (Default: 10)
|
97
|
+
** :processing_queue - specifies the processing queue for content to be sent to (Default: 'CobwebProcessJob' when using resque, 'CrawlProcessWorker' when using sidekiq)
|
98
98
|
** :crawl_finished_queue - specifies the processing queue for statistics to be sent to after finishing crawling (Default: 'CobwebFinishedJob' when using resque, 'CrawlFinishedWorker' when using sidekiq)
|
99
99
|
** :debug - enables debug output (Default: false)
|
100
100
|
** :quiet - hides default output (Default: false)
|
@@ -116,8 +116,8 @@ Creates a new crawler object based on a base_url
|
|
116
116
|
** :use_encoding_safe_process_job - Base64-encode the body when storing job in queue; set to true when you are expecting non-ASCII content (Default: false)
|
117
117
|
** :proxy_addr - hostname of a proxy to use for crawling (e. g., 'myproxy.example.net', default: nil)
|
118
118
|
** :proxy_port - port number of the proxy (default: nil)
|
119
|
-
|
120
|
-
|
119
|
+
|
120
|
+
|
121
121
|
bc. crawler = Cobweb.new(:follow_redirects => false)
|
122
122
|
|
123
123
|
h4. start(base_url)
|
@@ -125,7 +125,7 @@ h4. start(base_url)
|
|
125
125
|
Starts a crawl through resque. Requires the :processing_queue to be set to a valid class for the resque job to work with the data retrieved.
|
126
126
|
|
127
127
|
* base_url - the url to start the crawl from
|
128
|
-
|
128
|
+
|
129
129
|
Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
|
130
130
|
|
131
131
|
bc. crawler.start("http://www.google.com/")
|
@@ -156,7 +156,7 @@ h3. CobwebCrawler
|
|
156
156
|
|
157
157
|
CobwebCrawler is the standalone crawling class. If you don't want to use resque or sidekiq and just want to crawl the site within your ruby process, you can use this class.
|
158
158
|
|
159
|
-
bc. crawler = CobwebCrawler.new(:cache => 600)
|
159
|
+
bc. crawler = CobwebCrawler.new(:cache => 600)
|
160
160
|
statistics = crawler.crawl("http://www.pepsico.com")
|
161
161
|
|
162
162
|
You can also run within a block and get access to each page as it is being crawled.
|
@@ -177,13 +177,13 @@ The CobwebCrawlHelper class is a helper class to assist in getting information a
|
|
177
177
|
bc. crawl = CobwebCrawlHelper.new(options)
|
178
178
|
|
179
179
|
* options - the hash of options passed into Cobweb.new (must include a :crawl_id)
|
180
|
-
|
180
|
+
|
181
181
|
|
182
182
|
|
183
183
|
h2. Contributing/Testing
|
184
184
|
|
185
185
|
Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting. We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.
|
186
|
-
|
186
|
+
|
187
187
|
Continuous integration testing is performed by the excellent Travis: http://travis-ci.org/#!/stewartmckee/cobweb
|
188
188
|
|
189
189
|
h2. Todo
|
data/bin/cobweb
ADDED
@@ -0,0 +1,56 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
|
4
|
+
$LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
|
5
|
+
|
6
|
+
require 'cobweb'
|
7
|
+
require 'csv'
|
8
|
+
require 'slop'
|
9
|
+
|
10
|
+
include CobwebDSL
|
11
|
+
|
12
|
+
opts = Slop.parse(:help => true) do
|
13
|
+
banner 'Usage: cobweb <command> [options]'
|
14
|
+
|
15
|
+
command :report do
|
16
|
+
banner 'Usage: cobweb report [options]'
|
17
|
+
|
18
|
+
on 'output=', 'Path to output data to'
|
19
|
+
on 'script=', "Script to generate report"
|
20
|
+
|
21
|
+
on 'url=', 'URL to start crawl from'
|
22
|
+
on 'internal_urls=', 'Url patterns to include', :as => Array
|
23
|
+
on 'external_urls=', 'Url patterns to exclude', :as => Array
|
24
|
+
on 'seed_urls=', "Seed urls", :as => Array
|
25
|
+
on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
|
26
|
+
on 'thread_count=', "Set the number of threads used", :as => Integer
|
27
|
+
on 'timeout=', "Sets the timeout for http requests", :as => Integer
|
28
|
+
on 'v', 'verbose', 'Display crawl information'
|
29
|
+
on 'd', 'debug', 'Display debug information'
|
30
|
+
on 'w', 'web_statistics', 'Start web stats server'
|
31
|
+
|
32
|
+
run do |opts, args|
|
33
|
+
ReportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?})
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
command :export do
|
38
|
+
banner 'Usage: cobweb export [options]'
|
39
|
+
|
40
|
+
on 'url=', 'URL to start crawl from'
|
41
|
+
on 'internal_urls=', 'Url patterns to include', :as => Array
|
42
|
+
on 'external_urls=', 'Url patterns to exclude', :as => Array
|
43
|
+
on 'seed_urls=', "Seed urls", :as => Array
|
44
|
+
on 'crawl_limit=', 'Limit the crawl to a number of urls', :as => Integer
|
45
|
+
on 'thread_count=', "Set the number of threads used", :as => Integer
|
46
|
+
on 'timeout=', "Sets the timeout for http requests", :as => Integer
|
47
|
+
on 'v', 'verbose', 'Display crawl information'
|
48
|
+
on 'd', 'debug', 'Display debug information'
|
49
|
+
on 'w', 'web_statistics', 'Start web stats server'
|
50
|
+
|
51
|
+
run do |opts, args|
|
52
|
+
ExportCommand.start(opts.to_hash.delete_if{|k,v| v.nil?}, args[0])
|
53
|
+
end
|
54
|
+
end
|
55
|
+
|
56
|
+
end
|
data/lib/cobweb_version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: cobweb
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.22
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Stewart McKee
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2015-01-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: redis
|
@@ -126,27 +126,29 @@ dependencies:
|
|
126
126
|
name: slop
|
127
127
|
requirement: !ruby/object:Gem::Requirement
|
128
128
|
requirements:
|
129
|
-
- - "
|
129
|
+
- - "~>"
|
130
130
|
- !ruby/object:Gem::Version
|
131
|
-
version: '
|
131
|
+
version: '3.4'
|
132
132
|
type: :runtime
|
133
133
|
prerelease: false
|
134
134
|
version_requirements: !ruby/object:Gem::Requirement
|
135
135
|
requirements:
|
136
|
-
- - "
|
136
|
+
- - "~>"
|
137
137
|
- !ruby/object:Gem::Version
|
138
|
-
version: '
|
138
|
+
version: '3.4'
|
139
139
|
description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
|
140
140
|
crawl extremely large sites which is much more performant than multi-threaded crawlers. It
|
141
141
|
is also a standalone crawler that has a sophisticated statistics monitoring interface
|
142
142
|
to monitor the progress of the crawls.
|
143
143
|
email: stewart@rockwellcottage.com
|
144
|
-
executables:
|
144
|
+
executables:
|
145
|
+
- cobweb
|
145
146
|
extensions: []
|
146
147
|
extra_rdoc_files:
|
147
148
|
- README.textile
|
148
149
|
files:
|
149
150
|
- README.textile
|
151
|
+
- bin/cobweb
|
150
152
|
- lib/cobweb.rb
|
151
153
|
- lib/cobweb_crawl_helper.rb
|
152
154
|
- lib/cobweb_crawler.rb
|