grell 1.6.11 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/CHANGELOG.md +10 -0
- data/Gemfile +4 -0
- data/README.md +91 -64
- data/grell.gemspec +5 -6
- data/lib/grell.rb +1 -0
- data/lib/grell/capybara_driver.rb +1 -1
- data/lib/grell/crawler.rb +16 -46
- data/lib/grell/crawler_manager.rb +77 -0
- data/lib/grell/page_collection.rb +9 -1
- data/lib/grell/version.rb +1 -1
- data/spec/lib/crawler_manager_spec.rb +113 -0
- data/spec/lib/crawler_spec.rb +29 -86
- metadata +14 -25
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bbd2a19c7858d2e755e7d51a038b84901835f32b
|
4
|
+
data.tar.gz: 7c65a731bfdacb7b65cf523c8ba128f1f3390f96
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fcf31c442f8d51cd4a9534270cc3afd7f5bac432a74a9ec2e429914145df710131e0a99aca61e4cd98d2e831a5cedb511c21dce9d1562881541fd1bfb4016014
|
7
|
+
data.tar.gz: c413503a4a8765fadbec10ed64467ba140b8739b6209ae4a19e585549ad689231b64b3d670e1bdceec560597051d93f8b90d696f66571cb1dad175f15af1b4a0
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,13 @@
|
|
1
|
+
# 2.0.0
|
2
|
+
* New configuration key `on_periodic_restart`.
|
3
|
+
* CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
|
4
|
+
|
5
|
+
* Breaking changes
|
6
|
+
- Requires Ruby 2.1 or later.
|
7
|
+
- Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
|
8
|
+
- Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
|
9
|
+
- Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
|
10
|
+
|
1
11
|
# 1.6.11
|
2
12
|
* Ensure all links are loaded by waiting for Ajax requests to complete
|
3
13
|
* Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -21,16 +21,15 @@ Or install it yourself as:
|
|
21
21
|
|
22
22
|
$ gem install grell
|
23
23
|
|
24
|
-
Grell uses PhantomJS, you will need to download and install it in your
|
24
|
+
Grell uses PhantomJS as a browser, you will need to download and install it in your
|
25
25
|
system. Check for instructions in http://phantomjs.org/
|
26
26
|
Grell has been tested with PhantomJS v2.1.x
|
27
27
|
|
28
28
|
## Usage
|
29
29
|
|
30
|
-
|
31
30
|
### Crawling an entire site
|
32
31
|
|
33
|
-
The main entry point of the library is Grell#start_crawling.
|
32
|
+
The main entry point of the library is Grell::Crawler#start_crawling.
|
34
33
|
Grell will yield to your code with each page it finds:
|
35
34
|
|
36
35
|
```ruby
|
@@ -55,85 +54,105 @@ This list is indexed by the complete url, including query parameters.
|
|
55
54
|
|
56
55
|
### Re-retrieving a page
|
57
56
|
If you want Grell to revisit a page and return the data to you again,
|
58
|
-
return the symbol :retry in your block
|
57
|
+
return the symbol :retry in your block for the start_crawling method.
|
59
58
|
For instance
|
60
59
|
```ruby
|
61
60
|
require 'grell'
|
62
61
|
crawler = Grell::Crawler.new
|
63
62
|
crawler.start_crawling('http://www.google.com') do |current_page|
|
64
63
|
if current_page.status == 500 && current_page.retries == 0
|
65
|
-
crawler.restart
|
64
|
+
crawler.manager.restart
|
66
65
|
:retry
|
67
66
|
end
|
68
67
|
end
|
69
68
|
```
|
70
69
|
|
71
|
-
###
|
72
|
-
If you are doing a long crawling it is possible that phantomJS starts failing.
|
73
|
-
To avoid that, you can restart it by calling "restart" on crawler.
|
74
|
-
That will kill phantom and will restart it. Grell will keep the status of
|
75
|
-
pages already visited and pages discovered and to be visited. And will keep crawling
|
76
|
-
with the new phantomJS process instead of the old one.
|
70
|
+
### Pages' id
|
77
71
|
|
78
|
-
|
72
|
+
Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
|
73
|
+
The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
|
74
|
+
Using this information it is possible to construct a directed graph.
|
79
75
|
|
80
|
-
Grell by default will follow all the links it finds going to the site
|
81
|
-
your are crawling. It will never follow links linking outside your site.
|
82
|
-
If you want to further limit the amount of links crawled, you can use
|
83
|
-
whitelisting, blacklisting or manual filtering.
|
84
76
|
|
85
|
-
|
86
|
-
By default, Grell will detect new URLs to visit by comparing the full URL
|
87
|
-
with the URLs of the discovered and visited links. This functionality can
|
88
|
-
be changed by passing a block of code to Grells `start_crawling` method.
|
89
|
-
In the below example, the path of the URLs (instead of the full URL) will
|
90
|
-
be compared.
|
77
|
+
### Restart and quit
|
91
78
|
|
79
|
+
Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
|
80
|
+
but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
|
81
|
+
new browser.
|
82
|
+
To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
|
92
83
|
```ruby
|
93
84
|
require 'grell'
|
94
|
-
|
95
85
|
crawler = Grell::Crawler.new
|
86
|
+
crawler.manager.restart # restarts the browser
|
87
|
+
crawler.manager.quit # quits and destroys the crawler
|
88
|
+
```
|
96
89
|
|
97
|
-
|
98
|
-
collection_page.path == page.path
|
99
|
-
end
|
90
|
+
### Options
|
100
91
|
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
92
|
+
The `Grell:Crawler` class can be passed options to customize its behavior:
|
93
|
+
- `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
|
94
|
+
- `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
|
95
|
+
- `whitelist`: Setups a whitelist filter for URLs to be visited. Default: all URLs are whitelisted.
|
96
|
+
- `blacklist`: Setups a blacklist filter for URLs to be avoided. Default: no URL is blacklisted.
|
97
|
+
- `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
|
98
|
+
- `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
|
99
|
+
- `driver_options`: Driver options will be passed to the Capybara driver which connects to PhantomJS.
|
105
100
|
|
106
|
-
|
101
|
+
Grell by default will follow all the links it finds in the site being crawled.
|
102
|
+
It will never follow links linking outside your site.
|
103
|
+
If you want to further limit the amount of links crawled, you can use
|
104
|
+
whitelisting, blacklisting or manual filtering.
|
105
|
+
Below further details on these and other options.
|
107
106
|
|
108
|
-
```ruby
|
109
|
-
require 'grell'
|
110
107
|
|
111
|
-
|
112
|
-
|
113
|
-
crawler.
|
114
|
-
|
108
|
+
#### Automatically restarting PhantomJS
|
109
|
+
If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
|
110
|
+
The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
|
111
|
+
`on_periodic_restart` configuration key as follows:
|
115
112
|
|
116
|
-
|
117
|
-
|
118
|
-
string match is whitelisted) or an array with regexps and/or strings.
|
113
|
+
```ruby
|
114
|
+
require 'grell'
|
119
115
|
|
120
|
-
|
116
|
+
crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
|
121
117
|
|
122
|
-
|
123
|
-
|
118
|
+
crawler.start_crawling('http://www.google.com') do |current_page|
|
119
|
+
...
|
120
|
+
endd
|
121
|
+
```
|
124
122
|
|
125
|
-
crawler
|
126
|
-
|
127
|
-
crawler.start_crawling('http://www.google.com')
|
128
|
-
```
|
123
|
+
This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
|
124
|
+
between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
|
129
125
|
|
130
|
-
Similar to whitelisting. But now Grell will follow every other link in
|
131
|
-
this site which does not go to /games/...
|
132
126
|
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
127
|
+
#### Whitelisting
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
require 'grell'
|
131
|
+
|
132
|
+
crawler = Grell::Crawler.new(whitelist: [/games\/.*/, '/fun'])
|
133
|
+
crawler.start_crawling('http://www.google.com')
|
134
|
+
```
|
135
|
+
|
136
|
+
Grell here will only follow links to games and '/fun' and ignore all
|
137
|
+
other links. You can provide a regexp, strings (if any part of the
|
138
|
+
string match is whitelisted) or an array with regexps and/or strings.
|
139
|
+
|
140
|
+
#### Blacklisting
|
141
|
+
|
142
|
+
```ruby
|
143
|
+
require 'grell'
|
144
|
+
|
145
|
+
crawler = Grell::Crawler.new(blacklist: /games\/.*/)
|
146
|
+
crawler.start_crawling('http://www.google.com')
|
147
|
+
```
|
148
|
+
|
149
|
+
Similar to whitelisting. But now Grell will follow every other link in
|
150
|
+
this site which does not go to /games/...
|
151
|
+
|
152
|
+
If you call both whitelist and blacklist then both will apply, a link
|
153
|
+
has to fullfill both conditions to survive. If you do not call any, then
|
154
|
+
all links on this site will be crawled. Think of these methods as
|
155
|
+
filters.
|
137
156
|
|
138
157
|
#### Manual link filtering
|
139
158
|
|
@@ -144,14 +163,28 @@ links to visit. So you can modify in your block of code "page.links" to
|
|
144
163
|
add and delete links to instruct Grell to add them to the list of links
|
145
164
|
to visit next.
|
146
165
|
|
147
|
-
|
166
|
+
#### Custom URL Comparison
|
167
|
+
By default, Grell will detect new URLs to visit by comparing the full URL
|
168
|
+
with the URLs of the discovered and visited links. This functionality can
|
169
|
+
be changed by passing a block of code to Grells `start_crawling` method.
|
170
|
+
In the below example, the path of the URLs (instead of the full URL) will
|
171
|
+
be compared.
|
148
172
|
|
149
|
-
|
150
|
-
|
151
|
-
|
173
|
+
```ruby
|
174
|
+
require 'grell'
|
175
|
+
|
176
|
+
add_match_block = Proc.new do |collection_page, page|
|
177
|
+
collection_page.path == page.path
|
178
|
+
end
|
152
179
|
|
180
|
+
crawler = Grell::Crawler.new(add_match_block: add_match_block)
|
153
181
|
|
154
|
-
|
182
|
+
crawler.start_crawling('http://www.google.com') do |current_page|
|
183
|
+
...
|
184
|
+
end
|
185
|
+
```
|
186
|
+
|
187
|
+
#### Evaluate script
|
155
188
|
|
156
189
|
You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
|
157
190
|
|
@@ -168,12 +201,6 @@ When there is an error in the page or an internal error in the crawler (Javascri
|
|
168
201
|
- errorClass: The class of the error which broke this page.
|
169
202
|
- errorMessage: A descriptive message with the information Grell could gather about the error.
|
170
203
|
|
171
|
-
### Logging
|
172
|
-
You can pass your logger to Grell. For example in a Rails app:
|
173
|
-
```Ruby
|
174
|
-
crawler = Grell::Crawler.new(logger: Rails.logger)
|
175
|
-
```
|
176
|
-
|
177
204
|
## Tests
|
178
205
|
|
179
206
|
Run the tests with
|
data/grell.gemspec
CHANGED
@@ -19,19 +19,18 @@ Gem::Specification.new do |spec|
|
|
19
19
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
20
20
|
spec.require_paths = ["lib"]
|
21
21
|
|
22
|
-
spec.required_ruby_version = '>= 1.
|
22
|
+
spec.required_ruby_version = '>= 2.1.8'
|
23
23
|
|
24
|
-
spec.add_dependency 'capybara', '~> 2.
|
25
|
-
spec.add_dependency 'poltergeist', '~> 1.
|
24
|
+
spec.add_dependency 'capybara', '~> 2.10'
|
25
|
+
spec.add_dependency 'poltergeist', '~> 1.11'
|
26
26
|
|
27
27
|
spec.add_development_dependency 'bundler', '~> 1.6'
|
28
28
|
spec.add_development_dependency 'byebug', '~> 4.0'
|
29
29
|
spec.add_development_dependency 'kender', '~> 0.2'
|
30
30
|
spec.add_development_dependency 'rake', '~> 10.0'
|
31
31
|
spec.add_development_dependency 'webmock', '~> 1.18'
|
32
|
-
spec.add_development_dependency 'rspec', '~> 3.
|
33
|
-
spec.add_development_dependency 'puffing-billy', '~> 0.
|
32
|
+
spec.add_development_dependency 'rspec', '~> 3.5'
|
33
|
+
spec.add_development_dependency 'puffing-billy', '~> 0.9'
|
34
34
|
spec.add_development_dependency 'timecop', '~> 0.8'
|
35
|
-
spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
|
36
35
|
spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
|
37
36
|
end
|
data/lib/grell.rb
CHANGED
data/lib/grell/crawler.rb
CHANGED
@@ -1,54 +1,32 @@
|
|
1
|
-
|
2
1
|
module Grell
|
3
|
-
|
4
2
|
# This is the class that starts and controls the crawling
|
5
3
|
class Crawler
|
6
|
-
attr_reader :collection
|
4
|
+
attr_reader :collection, :manager
|
7
5
|
|
8
6
|
# Creates a crawler
|
9
|
-
#
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
@
|
18
|
-
@
|
19
|
-
|
20
|
-
|
21
|
-
# Restarts the PhantomJS process without modifying the state of visited and discovered pages.
|
22
|
-
def restart
|
23
|
-
Grell.logger.info "GRELL is restarting"
|
24
|
-
@driver.restart
|
25
|
-
Grell.logger.info "GRELL has restarted"
|
26
|
-
end
|
27
|
-
|
28
|
-
# Quits the poltergeist driver.
|
29
|
-
def quit
|
30
|
-
Grell.logger.info "GRELL is quitting the poltergeist driver"
|
31
|
-
@driver.quit
|
32
|
-
end
|
33
|
-
|
34
|
-
# Setups a whitelist filter, allows a regexp, string or array of either to be matched.
|
35
|
-
def whitelist(list)
|
36
|
-
@whitelist_regexp = Regexp.union(list)
|
37
|
-
end
|
38
|
-
|
39
|
-
# Setups a blacklist filter, allows a regexp, string or array of either to be matched.
|
40
|
-
def blacklist(list)
|
41
|
-
@blacklist_regexp = Regexp.union(list)
|
7
|
+
# evaluate_in_each_page: javascript block to evaluate in each page we crawl
|
8
|
+
# add_match_block: block to evaluate to consider if a page is part of the collection
|
9
|
+
# manager_options: options passed to the manager class
|
10
|
+
# whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
|
11
|
+
# blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
|
12
|
+
def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
|
13
|
+
@collection = nil
|
14
|
+
@manager = CrawlerManager.new(manager_options)
|
15
|
+
@evaluate_in_each_page = evaluate_in_each_page
|
16
|
+
@add_match_block = add_match_block
|
17
|
+
@whitelist_regexp = Regexp.union(whitelist)
|
18
|
+
@blacklist_regexp = Regexp.union(blacklist)
|
42
19
|
end
|
43
20
|
|
44
21
|
# Main method, it starts crawling on the given URL and calls a block for each of the pages found.
|
45
|
-
def start_crawling(url,
|
22
|
+
def start_crawling(url, &block)
|
46
23
|
Grell.logger.info "GRELL Started crawling"
|
47
|
-
@collection = PageCollection.new(
|
24
|
+
@collection = PageCollection.new(@add_match_block)
|
48
25
|
@collection.create_page(url, nil)
|
49
26
|
|
50
27
|
while !@collection.discovered_pages.empty?
|
51
28
|
crawl(@collection.next_page, block)
|
29
|
+
@manager.check_periodic_restart(@collection)
|
52
30
|
end
|
53
31
|
|
54
32
|
Grell.logger.info "GRELL finished crawling"
|
@@ -93,14 +71,6 @@ module Grell
|
|
93
71
|
links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
|
94
72
|
end
|
95
73
|
|
96
|
-
# If options[:add_match_block] is not provided, url matching to determine if a
|
97
|
-
# new page should be added the page collection will default to this proc
|
98
|
-
def default_add_match
|
99
|
-
Proc.new do |collection_page, page|
|
100
|
-
collection_page.url.downcase == page.url.downcase
|
101
|
-
end
|
102
|
-
end
|
103
|
-
|
104
74
|
# Store the resulting redirected URL along with the original URL
|
105
75
|
def add_redirect_url(site)
|
106
76
|
if site.url != site.current_url
|
@@ -0,0 +1,77 @@
|
|
1
|
+
module Grell
|
2
|
+
# Manages the state of the process crawling, does not care about individual pages but about logging,
|
3
|
+
# restarting and quiting the crawler correctly.
|
4
|
+
class CrawlerManager
|
5
|
+
# logger: logger to use for Grell's messages
|
6
|
+
# on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block
|
7
|
+
# driver_options: Any extra options for the Capybara driver
|
8
|
+
def initialize(logger: nil, on_periodic_restart: {}, driver: nil, driver_options: {})
|
9
|
+
Grell.logger = logger ? logger : Logger.new(STDOUT)
|
10
|
+
@periodic_restart_block = on_periodic_restart[:do]
|
11
|
+
@periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
|
12
|
+
@driver = driver || CapybaraDriver.setup(driver_options)
|
13
|
+
if @periodic_restart_period <= 0
|
14
|
+
Grell.logger.warn "GRELL being misconfigured with a negative period to restart. Ignoring option."
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
# Restarts the PhantomJS process without modifying the state of visited and discovered pages.
|
19
|
+
def restart
|
20
|
+
Grell.logger.info "GRELL is restarting"
|
21
|
+
@driver.restart
|
22
|
+
Grell.logger.info "GRELL has restarted"
|
23
|
+
end
|
24
|
+
|
25
|
+
# Quits the poltergeist driver.
|
26
|
+
def quit
|
27
|
+
Grell.logger.info "GRELL is quitting the poltergeist driver"
|
28
|
+
@driver.quit
|
29
|
+
end
|
30
|
+
|
31
|
+
# PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart
|
32
|
+
# the driver, potentially calling a block.
|
33
|
+
def check_periodic_restart(collection)
|
34
|
+
return unless @periodic_restart_block
|
35
|
+
return unless @periodic_restart_period > 0
|
36
|
+
return unless (collection.visited_pages.size % @periodic_restart_period).zero?
|
37
|
+
restart
|
38
|
+
@periodic_restart_block.call
|
39
|
+
end
|
40
|
+
|
41
|
+
def cleanup_all_processes
|
42
|
+
pids = running_phantomjs_pids
|
43
|
+
return if pids.empty?
|
44
|
+
Grell.logger.warn "GRELL. Killing PhantomJS processes: #{pids.inspect}"
|
45
|
+
pids.each do |pid|
|
46
|
+
Grell.logger.warn "Sending KILL to PhantomJS process #{pid}"
|
47
|
+
kill_process(pid.to_i)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
private
|
52
|
+
|
53
|
+
PAGES_TO_RESTART = 100 # Default number of pages before we restart the driver.
|
54
|
+
KILL_TIMEOUT = 2 # Number of seconds we wait till we kill the process.
|
55
|
+
|
56
|
+
def running_phantomjs_pids
|
57
|
+
list_phantomjs_processes_cmd = "ps -ef | grep -E 'bin/phantomjs' | grep -v grep"
|
58
|
+
`#{list_phantomjs_processes_cmd} | awk '{print $2;}'`.split("\n")
|
59
|
+
end
|
60
|
+
|
61
|
+
def kill_process(pid)
|
62
|
+
Process.kill('TERM', pid)
|
63
|
+
force_kill(pid)
|
64
|
+
rescue Errno::ESRCH, Errno::ECHILD
|
65
|
+
# successfully terminated
|
66
|
+
rescue => e
|
67
|
+
Grell.logger.exception e, "PhantomJS process could not be killed"
|
68
|
+
end
|
69
|
+
|
70
|
+
def force_kill(pid)
|
71
|
+
Timeout.timeout(KILL_TIMEOUT) { Process.wait(pid) }
|
72
|
+
rescue Timeout::Error
|
73
|
+
Process.kill('KILL', pid)
|
74
|
+
Process.wait(pid)
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
@@ -10,7 +10,7 @@ module Grell
|
|
10
10
|
# to the collection or if it is already present will be passed to the initializer.
|
11
11
|
def initialize(add_match_block)
|
12
12
|
@collection = []
|
13
|
-
@add_match_block = add_match_block
|
13
|
+
@add_match_block = add_match_block || default_add_match
|
14
14
|
end
|
15
15
|
|
16
16
|
def create_page(url, parent_id)
|
@@ -50,5 +50,13 @@ module Grell
|
|
50
50
|
end
|
51
51
|
end
|
52
52
|
|
53
|
+
# If add_match_block is not provided, url matching to determine if a new page should be added
|
54
|
+
# to the page collection will default to this proc
|
55
|
+
def default_add_match
|
56
|
+
Proc.new do |collection_page, page|
|
57
|
+
collection_page.url.downcase == page.url.downcase
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
53
61
|
end
|
54
62
|
end
|
data/lib/grell/version.rb
CHANGED
@@ -0,0 +1,113 @@
|
|
1
|
+
RSpec.describe Grell::CrawlerManager do
|
2
|
+
let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
|
3
|
+
let(:host) { 'http://www.example.com' }
|
4
|
+
let(:url) { 'http://www.example.com/test' }
|
5
|
+
let(:driver) { double(Grell::CapybaraDriver) }
|
6
|
+
let(:logger) { Logger.new(nil) }
|
7
|
+
let(:crawler_manager) do
|
8
|
+
described_class.new(logger: logger, driver: driver)
|
9
|
+
end
|
10
|
+
|
11
|
+
describe 'initialize' do
|
12
|
+
context 'provides a logger' do
|
13
|
+
let(:logger) { 33 }
|
14
|
+
it 'sets custom logger' do
|
15
|
+
crawler_manager
|
16
|
+
expect(Grell.logger).to eq(33)
|
17
|
+
Grell.logger = Logger.new(nil)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
context 'does not provides a logger' do
|
21
|
+
let(:logger) { nil }
|
22
|
+
it 'sets default logger' do
|
23
|
+
crawler_manager
|
24
|
+
expect(Grell.logger).to be_instance_of(Logger)
|
25
|
+
Grell.logger = Logger.new(nil)
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
describe '#quit' do
|
31
|
+
let(:driver) { double }
|
32
|
+
|
33
|
+
it 'quits the poltergeist driver' do
|
34
|
+
expect(driver).to receive(:quit)
|
35
|
+
crawler_manager.quit
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
describe '#restart' do
|
40
|
+
let(:driver) { double }
|
41
|
+
|
42
|
+
it 'restarts the poltergeist driver' do
|
43
|
+
expect(driver).to receive(:restart)
|
44
|
+
expect(logger).to receive(:info).with("GRELL is restarting")
|
45
|
+
expect(logger).to receive(:info).with("GRELL has restarted")
|
46
|
+
crawler_manager.restart
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
describe '#check_periodic_restart' do
|
51
|
+
let(:collection) { double }
|
52
|
+
context 'Periodic restart not setup' do
|
53
|
+
it 'does not restart' do
|
54
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
|
55
|
+
expect(crawler_manager).not_to receive(:restart)
|
56
|
+
crawler_manager.check_periodic_restart(collection)
|
57
|
+
end
|
58
|
+
end
|
59
|
+
context 'Periodic restart setup with default period' do
|
60
|
+
let(:do_something) { proc {} }
|
61
|
+
let(:crawler_manager) do
|
62
|
+
Grell::CrawlerManager.new(
|
63
|
+
logger: logger,
|
64
|
+
driver: driver,
|
65
|
+
on_periodic_restart: { do: do_something }
|
66
|
+
)
|
67
|
+
end
|
68
|
+
|
69
|
+
it 'does not restart after visiting 99 pages' do
|
70
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 99 }
|
71
|
+
expect(crawler_manager).not_to receive(:restart)
|
72
|
+
crawler_manager.check_periodic_restart(collection)
|
73
|
+
end
|
74
|
+
it 'restarts after visiting 100 pages' do
|
75
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
|
76
|
+
expect(crawler_manager).to receive(:restart)
|
77
|
+
crawler_manager.check_periodic_restart(collection)
|
78
|
+
end
|
79
|
+
end
|
80
|
+
context 'Periodic restart setup with custom period' do
|
81
|
+
let(:do_something) { proc {} }
|
82
|
+
let(:period) { 50 }
|
83
|
+
let(:crawler_manager) do
|
84
|
+
Grell::CrawlerManager.new(
|
85
|
+
logger: logger,
|
86
|
+
driver: driver,
|
87
|
+
on_periodic_restart: { do: do_something, each: period }
|
88
|
+
)
|
89
|
+
end
|
90
|
+
|
91
|
+
it 'does not restart after visiting a number different from custom period pages' do
|
92
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { period * 1.2 }
|
93
|
+
expect(crawler_manager).not_to receive(:restart)
|
94
|
+
crawler_manager.check_periodic_restart(collection)
|
95
|
+
end
|
96
|
+
it 'restarts after visiting custom period pages' do
|
97
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { period }
|
98
|
+
expect(crawler_manager).to receive(:restart)
|
99
|
+
crawler_manager.check_periodic_restart(collection)
|
100
|
+
end
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
describe '#cleanup_all_processes' do
|
105
|
+
let(:driver) { double }
|
106
|
+
|
107
|
+
it 'kills all phantomjs processes' do
|
108
|
+
allow(crawler_manager).to receive(:running_phantomjs_pids).and_return([10])
|
109
|
+
expect(crawler_manager).to receive(:kill_process).with(10)
|
110
|
+
crawler_manager.cleanup_all_processes
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
data/spec/lib/crawler_spec.rb
CHANGED
@@ -5,7 +5,18 @@ RSpec.describe Grell::Crawler do
|
|
5
5
|
let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
|
6
6
|
let(:host) { 'http://www.example.com' }
|
7
7
|
let(:url) { 'http://www.example.com/test' }
|
8
|
-
let(:
|
8
|
+
let(:add_match_block) { nil }
|
9
|
+
let(:blacklist) { /a^/ }
|
10
|
+
let(:whitelist) { /.*/ }
|
11
|
+
let(:crawler) do
|
12
|
+
Grell::Crawler.new(
|
13
|
+
logger: Logger.new(nil),
|
14
|
+
driver_options: { external_driver: true },
|
15
|
+
evaluate_in_each_page: script,
|
16
|
+
add_match_block: add_match_block,
|
17
|
+
blacklist: blacklist,
|
18
|
+
whitelist: whitelist)
|
19
|
+
end
|
9
20
|
let(:script) { nil }
|
10
21
|
let(:body) { 'body' }
|
11
22
|
let(:custom_add_match) do
|
@@ -18,29 +29,6 @@ RSpec.describe Grell::Crawler do
|
|
18
29
|
proxy.stub(url).and_return(body: body, code: 200)
|
19
30
|
end
|
20
31
|
|
21
|
-
describe 'initialize' do
|
22
|
-
it 'can provide your own logger' do
|
23
|
-
Grell::Crawler.new(external_driver: true, logger: 33)
|
24
|
-
expect(Grell.logger).to eq(33)
|
25
|
-
Grell.logger = Logger.new(nil)
|
26
|
-
end
|
27
|
-
|
28
|
-
it 'provides a stdout logger if nothing provided' do
|
29
|
-
crawler
|
30
|
-
expect(Grell.logger).to be_instance_of(Logger)
|
31
|
-
end
|
32
|
-
end
|
33
|
-
|
34
|
-
describe '#quit' do
|
35
|
-
let(:driver) { double }
|
36
|
-
before { allow(Grell::CapybaraDriver).to receive(:setup).and_return(driver) }
|
37
|
-
|
38
|
-
it 'quits the poltergeist driver' do
|
39
|
-
expect(driver).to receive(:quit)
|
40
|
-
crawler.quit
|
41
|
-
end
|
42
|
-
end
|
43
|
-
|
44
32
|
describe '#crawl' do
|
45
33
|
before do
|
46
34
|
crawler.instance_variable_set('@collection', Grell::PageCollection.new(custom_add_match))
|
@@ -127,15 +115,6 @@ RSpec.describe Grell::Crawler do
|
|
127
115
|
expect(result[1].url).to eq(url_visited)
|
128
116
|
end
|
129
117
|
|
130
|
-
it 'can use a custom url add matcher block' do
|
131
|
-
expect(crawler).to_not receive(:default_add_match)
|
132
|
-
crawler.start_crawling(url, add_match_block: custom_add_match)
|
133
|
-
end
|
134
|
-
|
135
|
-
it 'uses a default url add matched if not provided' do
|
136
|
-
expect(crawler).to receive(:default_add_match).and_return(custom_add_match)
|
137
|
-
crawler.start_crawling(url)
|
138
|
-
end
|
139
118
|
end
|
140
119
|
|
141
120
|
shared_examples_for 'visits all available pages' do
|
@@ -204,10 +183,7 @@ RSpec.describe Grell::Crawler do
|
|
204
183
|
end
|
205
184
|
|
206
185
|
context 'using a single string' do
|
207
|
-
|
208
|
-
crawler.whitelist('/trusmis.html')
|
209
|
-
end
|
210
|
-
|
186
|
+
let(:whitelist) { '/trusmis.html' }
|
211
187
|
let(:visited_pages_count) { 2 } # my own page + trusmis
|
212
188
|
let(:visited_pages) do
|
213
189
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -217,10 +193,7 @@ RSpec.describe Grell::Crawler do
|
|
217
193
|
end
|
218
194
|
|
219
195
|
context 'using an array of strings' do
|
220
|
-
|
221
|
-
crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
|
222
|
-
end
|
223
|
-
|
196
|
+
let(:whitelist) { ['/trusmis.html', '/nothere', 'another.html'] }
|
224
197
|
let(:visited_pages_count) { 2 }
|
225
198
|
let(:visited_pages) do
|
226
199
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -230,10 +203,7 @@ RSpec.describe Grell::Crawler do
|
|
230
203
|
end
|
231
204
|
|
232
205
|
context 'using a regexp' do
|
233
|
-
|
234
|
-
crawler.whitelist(/\/trusmis\.html/)
|
235
|
-
end
|
236
|
-
|
206
|
+
let(:whitelist) { /\/trusmis\.html/ }
|
237
207
|
let(:visited_pages_count) { 2 }
|
238
208
|
let(:visited_pages) do
|
239
209
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -243,10 +213,7 @@ RSpec.describe Grell::Crawler do
|
|
243
213
|
end
|
244
214
|
|
245
215
|
context 'using an array of regexps' do
|
246
|
-
|
247
|
-
crawler.whitelist([/\/trusmis\.html/])
|
248
|
-
end
|
249
|
-
|
216
|
+
let(:whitelist) { [/\/trusmis\.html/] }
|
250
217
|
let(:visited_pages_count) { 2 }
|
251
218
|
let(:visited_pages) do
|
252
219
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -256,10 +223,7 @@ RSpec.describe Grell::Crawler do
|
|
256
223
|
end
|
257
224
|
|
258
225
|
context 'using an empty array' do
|
259
|
-
|
260
|
-
crawler.whitelist([])
|
261
|
-
end
|
262
|
-
|
226
|
+
let(:whitelist) { [] }
|
263
227
|
let(:visited_pages_count) { 1 } # my own page only
|
264
228
|
let(:visited_pages) do
|
265
229
|
['http://www.example.com/test']
|
@@ -269,10 +233,7 @@ RSpec.describe Grell::Crawler do
|
|
269
233
|
end
|
270
234
|
|
271
235
|
context 'adding all links to the whitelist' do
|
272
|
-
|
273
|
-
crawler.whitelist(['/trusmis', '/help'])
|
274
|
-
end
|
275
|
-
|
236
|
+
let(:whitelist) { ['/trusmis', '/help'] }
|
276
237
|
let(:visited_pages_count) { 3 } # all links
|
277
238
|
let(:visited_pages) do
|
278
239
|
['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
|
@@ -298,9 +259,7 @@ RSpec.describe Grell::Crawler do
|
|
298
259
|
end
|
299
260
|
|
300
261
|
context 'using a single string' do
|
301
|
-
|
302
|
-
crawler.blacklist('/trusmis.html')
|
303
|
-
end
|
262
|
+
let(:blacklist) { '/trusmis.html' }
|
304
263
|
let(:visited_pages_count) {2}
|
305
264
|
let(:visited_pages) do
|
306
265
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -310,9 +269,7 @@ RSpec.describe Grell::Crawler do
|
|
310
269
|
end
|
311
270
|
|
312
271
|
context 'using an array of strings' do
|
313
|
-
|
314
|
-
crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
|
315
|
-
end
|
272
|
+
let(:blacklist) { ['/trusmis.html', '/nothere', 'another.html'] }
|
316
273
|
let(:visited_pages_count) {2}
|
317
274
|
let(:visited_pages) do
|
318
275
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -322,9 +279,7 @@ RSpec.describe Grell::Crawler do
|
|
322
279
|
end
|
323
280
|
|
324
281
|
context 'using a regexp' do
|
325
|
-
|
326
|
-
crawler.blacklist(/\/trusmis\.html/)
|
327
|
-
end
|
282
|
+
let(:blacklist) { /\/trusmis\.html/ }
|
328
283
|
let(:visited_pages_count) {2}
|
329
284
|
let(:visited_pages) do
|
330
285
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -334,9 +289,7 @@ RSpec.describe Grell::Crawler do
|
|
334
289
|
end
|
335
290
|
|
336
291
|
context 'using an array of regexps' do
|
337
|
-
|
338
|
-
crawler.blacklist([/\/trusmis\.html/])
|
339
|
-
end
|
292
|
+
let(:blacklist) { [/\/trusmis\.html/] }
|
340
293
|
let(:visited_pages_count) {2}
|
341
294
|
let(:visited_pages) do
|
342
295
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -346,9 +299,7 @@ RSpec.describe Grell::Crawler do
|
|
346
299
|
end
|
347
300
|
|
348
301
|
context 'using an empty array' do
|
349
|
-
|
350
|
-
crawler.blacklist([])
|
351
|
-
end
|
302
|
+
let(:blacklist) { [] }
|
352
303
|
let(:visited_pages_count) { 3 } # all links
|
353
304
|
let(:visited_pages) do
|
354
305
|
['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
|
@@ -357,10 +308,8 @@ RSpec.describe Grell::Crawler do
|
|
357
308
|
it_behaves_like 'visits all available pages'
|
358
309
|
end
|
359
310
|
|
360
|
-
context 'adding all links to the
|
361
|
-
|
362
|
-
crawler.blacklist(['/trusmis', '/help'])
|
363
|
-
end
|
311
|
+
context 'adding all links to the blacklist' do
|
312
|
+
let(:blacklist) { ['/trusmis', '/help'] }
|
364
313
|
let(:visited_pages_count) { 1 }
|
365
314
|
let(:visited_pages) do
|
366
315
|
['http://www.example.com/test']
|
@@ -386,11 +335,8 @@ RSpec.describe Grell::Crawler do
|
|
386
335
|
end
|
387
336
|
|
388
337
|
context 'we blacklist the only whitelisted page' do
|
389
|
-
|
390
|
-
|
391
|
-
crawler.blacklist('/trusmis.html')
|
392
|
-
end
|
393
|
-
|
338
|
+
let(:whitelist) { '/trusmis.html' }
|
339
|
+
let(:blacklist) { '/trusmis.html' }
|
394
340
|
let(:visited_pages_count) { 1 }
|
395
341
|
let(:visited_pages) do
|
396
342
|
['http://www.example.com/test']
|
@@ -400,11 +346,8 @@ RSpec.describe Grell::Crawler do
|
|
400
346
|
end
|
401
347
|
|
402
348
|
context 'we blacklist none of the whitelisted pages' do
|
403
|
-
|
404
|
-
|
405
|
-
crawler.blacklist('/raistlin.html')
|
406
|
-
end
|
407
|
-
|
349
|
+
let(:whitelist) { '/trusmis.html' }
|
350
|
+
let(:blacklist) { '/raistlin.html' }
|
408
351
|
let(:visited_pages_count) { 2 }
|
409
352
|
let(:visited_pages) do
|
410
353
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: grell
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jordi Polo Carres
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-11-16 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: capybara
|
@@ -16,28 +16,28 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '2.
|
19
|
+
version: '2.10'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '2.
|
26
|
+
version: '2.10'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: poltergeist
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
31
|
- - "~>"
|
32
32
|
- !ruby/object:Gem::Version
|
33
|
-
version: '1.
|
33
|
+
version: '1.11'
|
34
34
|
type: :runtime
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
38
38
|
- - "~>"
|
39
39
|
- !ruby/object:Gem::Version
|
40
|
-
version: '1.
|
40
|
+
version: '1.11'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
42
|
name: bundler
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
@@ -114,28 +114,28 @@ dependencies:
|
|
114
114
|
requirements:
|
115
115
|
- - "~>"
|
116
116
|
- !ruby/object:Gem::Version
|
117
|
-
version: '3.
|
117
|
+
version: '3.5'
|
118
118
|
type: :development
|
119
119
|
prerelease: false
|
120
120
|
version_requirements: !ruby/object:Gem::Requirement
|
121
121
|
requirements:
|
122
122
|
- - "~>"
|
123
123
|
- !ruby/object:Gem::Version
|
124
|
-
version: '3.
|
124
|
+
version: '3.5'
|
125
125
|
- !ruby/object:Gem::Dependency
|
126
126
|
name: puffing-billy
|
127
127
|
requirement: !ruby/object:Gem::Requirement
|
128
128
|
requirements:
|
129
129
|
- - "~>"
|
130
130
|
- !ruby/object:Gem::Version
|
131
|
-
version: '0.
|
131
|
+
version: '0.9'
|
132
132
|
type: :development
|
133
133
|
prerelease: false
|
134
134
|
version_requirements: !ruby/object:Gem::Requirement
|
135
135
|
requirements:
|
136
136
|
- - "~>"
|
137
137
|
- !ruby/object:Gem::Version
|
138
|
-
version: '0.
|
138
|
+
version: '0.9'
|
139
139
|
- !ruby/object:Gem::Dependency
|
140
140
|
name: timecop
|
141
141
|
requirement: !ruby/object:Gem::Requirement
|
@@ -150,20 +150,6 @@ dependencies:
|
|
150
150
|
- - "~>"
|
151
151
|
- !ruby/object:Gem::Version
|
152
152
|
version: '0.8'
|
153
|
-
- !ruby/object:Gem::Dependency
|
154
|
-
name: capybara-webkit
|
155
|
-
requirement: !ruby/object:Gem::Requirement
|
156
|
-
requirements:
|
157
|
-
- - "~>"
|
158
|
-
- !ruby/object:Gem::Version
|
159
|
-
version: 1.11.1
|
160
|
-
type: :development
|
161
|
-
prerelease: false
|
162
|
-
version_requirements: !ruby/object:Gem::Requirement
|
163
|
-
requirements:
|
164
|
-
- - "~>"
|
165
|
-
- !ruby/object:Gem::Version
|
166
|
-
version: 1.11.1
|
167
153
|
- !ruby/object:Gem::Dependency
|
168
154
|
name: selenium-webdriver
|
169
155
|
requirement: !ruby/object:Gem::Requirement
|
@@ -196,6 +182,7 @@ files:
|
|
196
182
|
- lib/grell.rb
|
197
183
|
- lib/grell/capybara_driver.rb
|
198
184
|
- lib/grell/crawler.rb
|
185
|
+
- lib/grell/crawler_manager.rb
|
199
186
|
- lib/grell/grell_logger.rb
|
200
187
|
- lib/grell/page.rb
|
201
188
|
- lib/grell/page_collection.rb
|
@@ -203,6 +190,7 @@ files:
|
|
203
190
|
- lib/grell/reader.rb
|
204
191
|
- lib/grell/version.rb
|
205
192
|
- spec/lib/capybara_driver_spec.rb
|
193
|
+
- spec/lib/crawler_manager_spec.rb
|
206
194
|
- spec/lib/crawler_spec.rb
|
207
195
|
- spec/lib/page_collection_spec.rb
|
208
196
|
- spec/lib/page_spec.rb
|
@@ -220,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
220
208
|
requirements:
|
221
209
|
- - ">="
|
222
210
|
- !ruby/object:Gem::Version
|
223
|
-
version: 1.
|
211
|
+
version: 2.1.8
|
224
212
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
225
213
|
requirements:
|
226
214
|
- - ">="
|
@@ -234,6 +222,7 @@ specification_version: 4
|
|
234
222
|
summary: Ruby web crawler
|
235
223
|
test_files:
|
236
224
|
- spec/lib/capybara_driver_spec.rb
|
225
|
+
- spec/lib/crawler_manager_spec.rb
|
237
226
|
- spec/lib/crawler_spec.rb
|
238
227
|
- spec/lib/page_collection_spec.rb
|
239
228
|
- spec/lib/page_spec.rb
|