grell 1.6.11 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/CHANGELOG.md +10 -0
- data/Gemfile +4 -0
- data/README.md +91 -64
- data/grell.gemspec +5 -6
- data/lib/grell.rb +1 -0
- data/lib/grell/capybara_driver.rb +1 -1
- data/lib/grell/crawler.rb +16 -46
- data/lib/grell/crawler_manager.rb +77 -0
- data/lib/grell/page_collection.rb +9 -1
- data/lib/grell/version.rb +1 -1
- data/spec/lib/crawler_manager_spec.rb +113 -0
- data/spec/lib/crawler_spec.rb +29 -86
- metadata +14 -25
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bbd2a19c7858d2e755e7d51a038b84901835f32b
|
4
|
+
data.tar.gz: 7c65a731bfdacb7b65cf523c8ba128f1f3390f96
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fcf31c442f8d51cd4a9534270cc3afd7f5bac432a74a9ec2e429914145df710131e0a99aca61e4cd98d2e831a5cedb511c21dce9d1562881541fd1bfb4016014
|
7
|
+
data.tar.gz: c413503a4a8765fadbec10ed64467ba140b8739b6209ae4a19e585549ad689231b64b3d670e1bdceec560597051d93f8b90d696f66571cb1dad175f15af1b4a0
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,13 @@
|
|
1
|
+
# 2.0.0
|
2
|
+
* New configuration key `on_periodic_restart`.
|
3
|
+
* CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
|
4
|
+
|
5
|
+
* Breaking changes
|
6
|
+
- Requires Ruby 2.1 or later.
|
7
|
+
- Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
|
8
|
+
- Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
|
9
|
+
- Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
|
10
|
+
|
1
11
|
# 1.6.11
|
2
12
|
* Ensure all links are loaded by waiting for Ajax requests to complete
|
3
13
|
* Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
|
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -21,16 +21,15 @@ Or install it yourself as:
|
|
21
21
|
|
22
22
|
$ gem install grell
|
23
23
|
|
24
|
-
Grell uses PhantomJS, you will need to download and install it in your
|
24
|
+
Grell uses PhantomJS as a browser, you will need to download and install it in your
|
25
25
|
system. Check for instructions in http://phantomjs.org/
|
26
26
|
Grell has been tested with PhantomJS v2.1.x
|
27
27
|
|
28
28
|
## Usage
|
29
29
|
|
30
|
-
|
31
30
|
### Crawling an entire site
|
32
31
|
|
33
|
-
The main entry point of the library is Grell#start_crawling.
|
32
|
+
The main entry point of the library is Grell::Crawler#start_crawling.
|
34
33
|
Grell will yield to your code with each page it finds:
|
35
34
|
|
36
35
|
```ruby
|
@@ -55,85 +54,105 @@ This list is indexed by the complete url, including query parameters.
|
|
55
54
|
|
56
55
|
### Re-retrieving a page
|
57
56
|
If you want Grell to revisit a page and return the data to you again,
|
58
|
-
return the symbol :retry in your block
|
57
|
+
return the symbol :retry in your block for the start_crawling method.
|
59
58
|
For instance
|
60
59
|
```ruby
|
61
60
|
require 'grell'
|
62
61
|
crawler = Grell::Crawler.new
|
63
62
|
crawler.start_crawling('http://www.google.com') do |current_page|
|
64
63
|
if current_page.status == 500 && current_page.retries == 0
|
65
|
-
crawler.restart
|
64
|
+
crawler.manager.restart
|
66
65
|
:retry
|
67
66
|
end
|
68
67
|
end
|
69
68
|
```
|
70
69
|
|
71
|
-
###
|
72
|
-
If you are doing a long crawling it is possible that phantomJS starts failing.
|
73
|
-
To avoid that, you can restart it by calling "restart" on crawler.
|
74
|
-
That will kill phantom and will restart it. Grell will keep the status of
|
75
|
-
pages already visited and pages discovered and to be visited. And will keep crawling
|
76
|
-
with the new phantomJS process instead of the old one.
|
70
|
+
### Pages' id
|
77
71
|
|
78
|
-
|
72
|
+
Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
|
73
|
+
The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
|
74
|
+
Using this information it is possible to construct a directed graph.
|
79
75
|
|
80
|
-
Grell by default will follow all the links it finds going to the site
|
81
|
-
your are crawling. It will never follow links linking outside your site.
|
82
|
-
If you want to further limit the amount of links crawled, you can use
|
83
|
-
whitelisting, blacklisting or manual filtering.
|
84
76
|
|
85
|
-
|
86
|
-
By default, Grell will detect new URLs to visit by comparing the full URL
|
87
|
-
with the URLs of the discovered and visited links. This functionality can
|
88
|
-
be changed by passing a block of code to Grells `start_crawling` method.
|
89
|
-
In the below example, the path of the URLs (instead of the full URL) will
|
90
|
-
be compared.
|
77
|
+
### Restart and quit
|
91
78
|
|
79
|
+
Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
|
80
|
+
but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
|
81
|
+
new browser.
|
82
|
+
To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
|
92
83
|
```ruby
|
93
84
|
require 'grell'
|
94
|
-
|
95
85
|
crawler = Grell::Crawler.new
|
86
|
+
crawler.manager.restart # restarts the browser
|
87
|
+
crawler.manager.quit # quits and destroys the crawler
|
88
|
+
```
|
96
89
|
|
97
|
-
|
98
|
-
collection_page.path == page.path
|
99
|
-
end
|
90
|
+
### Options
|
100
91
|
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
92
|
+
The `Grell:Crawler` class can be passed options to customize its behavior:
|
93
|
+
- `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
|
94
|
+
- `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
|
95
|
+
- `whitelist`: Setups a whitelist filter for URLs to be visited. Default: all URLs are whitelisted.
|
96
|
+
- `blacklist`: Setups a blacklist filter for URLs to be avoided. Default: no URL is blacklisted.
|
97
|
+
- `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
|
98
|
+
- `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
|
99
|
+
- `driver_options`: Driver options will be passed to the Capybara driver which connects to PhantomJS.
|
105
100
|
|
106
|
-
|
101
|
+
Grell by default will follow all the links it finds in the site being crawled.
|
102
|
+
It will never follow links linking outside your site.
|
103
|
+
If you want to further limit the amount of links crawled, you can use
|
104
|
+
whitelisting, blacklisting or manual filtering.
|
105
|
+
Below further details on these and other options.
|
107
106
|
|
108
|
-
```ruby
|
109
|
-
require 'grell'
|
110
107
|
|
111
|
-
|
112
|
-
|
113
|
-
crawler.
|
114
|
-
|
108
|
+
#### Automatically restarting PhantomJS
|
109
|
+
If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
|
110
|
+
The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
|
111
|
+
`on_periodic_restart` configuration key as follows:
|
115
112
|
|
116
|
-
|
117
|
-
|
118
|
-
string match is whitelisted) or an array with regexps and/or strings.
|
113
|
+
```ruby
|
114
|
+
require 'grell'
|
119
115
|
|
120
|
-
|
116
|
+
crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
|
121
117
|
|
122
|
-
|
123
|
-
|
118
|
+
crawler.start_crawling('http://www.google.com') do |current_page|
|
119
|
+
...
|
120
|
+
endd
|
121
|
+
```
|
124
122
|
|
125
|
-
crawler
|
126
|
-
|
127
|
-
crawler.start_crawling('http://www.google.com')
|
128
|
-
```
|
123
|
+
This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
|
124
|
+
between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
|
129
125
|
|
130
|
-
Similar to whitelisting. But now Grell will follow every other link in
|
131
|
-
this site which does not go to /games/...
|
132
126
|
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
127
|
+
#### Whitelisting
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
require 'grell'
|
131
|
+
|
132
|
+
crawler = Grell::Crawler.new(whitelist: [/games\/.*/, '/fun'])
|
133
|
+
crawler.start_crawling('http://www.google.com')
|
134
|
+
```
|
135
|
+
|
136
|
+
Grell here will only follow links to games and '/fun' and ignore all
|
137
|
+
other links. You can provide a regexp, strings (if any part of the
|
138
|
+
string match is whitelisted) or an array with regexps and/or strings.
|
139
|
+
|
140
|
+
#### Blacklisting
|
141
|
+
|
142
|
+
```ruby
|
143
|
+
require 'grell'
|
144
|
+
|
145
|
+
crawler = Grell::Crawler.new(blacklist: /games\/.*/)
|
146
|
+
crawler.start_crawling('http://www.google.com')
|
147
|
+
```
|
148
|
+
|
149
|
+
Similar to whitelisting. But now Grell will follow every other link in
|
150
|
+
this site which does not go to /games/...
|
151
|
+
|
152
|
+
If you call both whitelist and blacklist then both will apply, a link
|
153
|
+
has to fullfill both conditions to survive. If you do not call any, then
|
154
|
+
all links on this site will be crawled. Think of these methods as
|
155
|
+
filters.
|
137
156
|
|
138
157
|
#### Manual link filtering
|
139
158
|
|
@@ -144,14 +163,28 @@ links to visit. So you can modify in your block of code "page.links" to
|
|
144
163
|
add and delete links to instruct Grell to add them to the list of links
|
145
164
|
to visit next.
|
146
165
|
|
147
|
-
|
166
|
+
#### Custom URL Comparison
|
167
|
+
By default, Grell will detect new URLs to visit by comparing the full URL
|
168
|
+
with the URLs of the discovered and visited links. This functionality can
|
169
|
+
be changed by passing a block of code to Grells `start_crawling` method.
|
170
|
+
In the below example, the path of the URLs (instead of the full URL) will
|
171
|
+
be compared.
|
148
172
|
|
149
|
-
|
150
|
-
|
151
|
-
|
173
|
+
```ruby
|
174
|
+
require 'grell'
|
175
|
+
|
176
|
+
add_match_block = Proc.new do |collection_page, page|
|
177
|
+
collection_page.path == page.path
|
178
|
+
end
|
152
179
|
|
180
|
+
crawler = Grell::Crawler.new(add_match_block: add_match_block)
|
153
181
|
|
154
|
-
|
182
|
+
crawler.start_crawling('http://www.google.com') do |current_page|
|
183
|
+
...
|
184
|
+
end
|
185
|
+
```
|
186
|
+
|
187
|
+
#### Evaluate script
|
155
188
|
|
156
189
|
You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
|
157
190
|
|
@@ -168,12 +201,6 @@ When there is an error in the page or an internal error in the crawler (Javascri
|
|
168
201
|
- errorClass: The class of the error which broke this page.
|
169
202
|
- errorMessage: A descriptive message with the information Grell could gather about the error.
|
170
203
|
|
171
|
-
### Logging
|
172
|
-
You can pass your logger to Grell. For example in a Rails app:
|
173
|
-
```Ruby
|
174
|
-
crawler = Grell::Crawler.new(logger: Rails.logger)
|
175
|
-
```
|
176
|
-
|
177
204
|
## Tests
|
178
205
|
|
179
206
|
Run the tests with
|
data/grell.gemspec
CHANGED
@@ -19,19 +19,18 @@ Gem::Specification.new do |spec|
|
|
19
19
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
20
20
|
spec.require_paths = ["lib"]
|
21
21
|
|
22
|
-
spec.required_ruby_version = '>= 1.
|
22
|
+
spec.required_ruby_version = '>= 2.1.8'
|
23
23
|
|
24
|
-
spec.add_dependency 'capybara', '~> 2.
|
25
|
-
spec.add_dependency 'poltergeist', '~> 1.
|
24
|
+
spec.add_dependency 'capybara', '~> 2.10'
|
25
|
+
spec.add_dependency 'poltergeist', '~> 1.11'
|
26
26
|
|
27
27
|
spec.add_development_dependency 'bundler', '~> 1.6'
|
28
28
|
spec.add_development_dependency 'byebug', '~> 4.0'
|
29
29
|
spec.add_development_dependency 'kender', '~> 0.2'
|
30
30
|
spec.add_development_dependency 'rake', '~> 10.0'
|
31
31
|
spec.add_development_dependency 'webmock', '~> 1.18'
|
32
|
-
spec.add_development_dependency 'rspec', '~> 3.
|
33
|
-
spec.add_development_dependency 'puffing-billy', '~> 0.
|
32
|
+
spec.add_development_dependency 'rspec', '~> 3.5'
|
33
|
+
spec.add_development_dependency 'puffing-billy', '~> 0.9'
|
34
34
|
spec.add_development_dependency 'timecop', '~> 0.8'
|
35
|
-
spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
|
36
35
|
spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
|
37
36
|
end
|
data/lib/grell.rb
CHANGED
data/lib/grell/crawler.rb
CHANGED
@@ -1,54 +1,32 @@
|
|
1
|
-
|
2
1
|
module Grell
|
3
|
-
|
4
2
|
# This is the class that starts and controls the crawling
|
5
3
|
class Crawler
|
6
|
-
attr_reader :collection
|
4
|
+
attr_reader :collection, :manager
|
7
5
|
|
8
6
|
# Creates a crawler
|
9
|
-
#
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
@
|
18
|
-
@
|
19
|
-
|
20
|
-
|
21
|
-
# Restarts the PhantomJS process without modifying the state of visited and discovered pages.
|
22
|
-
def restart
|
23
|
-
Grell.logger.info "GRELL is restarting"
|
24
|
-
@driver.restart
|
25
|
-
Grell.logger.info "GRELL has restarted"
|
26
|
-
end
|
27
|
-
|
28
|
-
# Quits the poltergeist driver.
|
29
|
-
def quit
|
30
|
-
Grell.logger.info "GRELL is quitting the poltergeist driver"
|
31
|
-
@driver.quit
|
32
|
-
end
|
33
|
-
|
34
|
-
# Setups a whitelist filter, allows a regexp, string or array of either to be matched.
|
35
|
-
def whitelist(list)
|
36
|
-
@whitelist_regexp = Regexp.union(list)
|
37
|
-
end
|
38
|
-
|
39
|
-
# Setups a blacklist filter, allows a regexp, string or array of either to be matched.
|
40
|
-
def blacklist(list)
|
41
|
-
@blacklist_regexp = Regexp.union(list)
|
7
|
+
# evaluate_in_each_page: javascript block to evaluate in each page we crawl
|
8
|
+
# add_match_block: block to evaluate to consider if a page is part of the collection
|
9
|
+
# manager_options: options passed to the manager class
|
10
|
+
# whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
|
11
|
+
# blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
|
12
|
+
def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
|
13
|
+
@collection = nil
|
14
|
+
@manager = CrawlerManager.new(manager_options)
|
15
|
+
@evaluate_in_each_page = evaluate_in_each_page
|
16
|
+
@add_match_block = add_match_block
|
17
|
+
@whitelist_regexp = Regexp.union(whitelist)
|
18
|
+
@blacklist_regexp = Regexp.union(blacklist)
|
42
19
|
end
|
43
20
|
|
44
21
|
# Main method, it starts crawling on the given URL and calls a block for each of the pages found.
|
45
|
-
def start_crawling(url,
|
22
|
+
def start_crawling(url, &block)
|
46
23
|
Grell.logger.info "GRELL Started crawling"
|
47
|
-
@collection = PageCollection.new(
|
24
|
+
@collection = PageCollection.new(@add_match_block)
|
48
25
|
@collection.create_page(url, nil)
|
49
26
|
|
50
27
|
while !@collection.discovered_pages.empty?
|
51
28
|
crawl(@collection.next_page, block)
|
29
|
+
@manager.check_periodic_restart(@collection)
|
52
30
|
end
|
53
31
|
|
54
32
|
Grell.logger.info "GRELL finished crawling"
|
@@ -93,14 +71,6 @@ module Grell
|
|
93
71
|
links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
|
94
72
|
end
|
95
73
|
|
96
|
-
# If options[:add_match_block] is not provided, url matching to determine if a
|
97
|
-
# new page should be added the page collection will default to this proc
|
98
|
-
def default_add_match
|
99
|
-
Proc.new do |collection_page, page|
|
100
|
-
collection_page.url.downcase == page.url.downcase
|
101
|
-
end
|
102
|
-
end
|
103
|
-
|
104
74
|
# Store the resulting redirected URL along with the original URL
|
105
75
|
def add_redirect_url(site)
|
106
76
|
if site.url != site.current_url
|
@@ -0,0 +1,77 @@
|
|
1
|
+
module Grell
|
2
|
+
# Manages the state of the process crawling, does not care about individual pages but about logging,
|
3
|
+
# restarting and quiting the crawler correctly.
|
4
|
+
class CrawlerManager
|
5
|
+
# logger: logger to use for Grell's messages
|
6
|
+
# on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block
|
7
|
+
# driver_options: Any extra options for the Capybara driver
|
8
|
+
def initialize(logger: nil, on_periodic_restart: {}, driver: nil, driver_options: {})
|
9
|
+
Grell.logger = logger ? logger : Logger.new(STDOUT)
|
10
|
+
@periodic_restart_block = on_periodic_restart[:do]
|
11
|
+
@periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
|
12
|
+
@driver = driver || CapybaraDriver.setup(driver_options)
|
13
|
+
if @periodic_restart_period <= 0
|
14
|
+
Grell.logger.warn "GRELL being misconfigured with a negative period to restart. Ignoring option."
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
# Restarts the PhantomJS process without modifying the state of visited and discovered pages.
|
19
|
+
def restart
|
20
|
+
Grell.logger.info "GRELL is restarting"
|
21
|
+
@driver.restart
|
22
|
+
Grell.logger.info "GRELL has restarted"
|
23
|
+
end
|
24
|
+
|
25
|
+
# Quits the poltergeist driver.
|
26
|
+
def quit
|
27
|
+
Grell.logger.info "GRELL is quitting the poltergeist driver"
|
28
|
+
@driver.quit
|
29
|
+
end
|
30
|
+
|
31
|
+
# PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart
|
32
|
+
# the driver, potentially calling a block.
|
33
|
+
def check_periodic_restart(collection)
|
34
|
+
return unless @periodic_restart_block
|
35
|
+
return unless @periodic_restart_period > 0
|
36
|
+
return unless (collection.visited_pages.size % @periodic_restart_period).zero?
|
37
|
+
restart
|
38
|
+
@periodic_restart_block.call
|
39
|
+
end
|
40
|
+
|
41
|
+
def cleanup_all_processes
|
42
|
+
pids = running_phantomjs_pids
|
43
|
+
return if pids.empty?
|
44
|
+
Grell.logger.warn "GRELL. Killing PhantomJS processes: #{pids.inspect}"
|
45
|
+
pids.each do |pid|
|
46
|
+
Grell.logger.warn "Sending KILL to PhantomJS process #{pid}"
|
47
|
+
kill_process(pid.to_i)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
private
|
52
|
+
|
53
|
+
PAGES_TO_RESTART = 100 # Default number of pages before we restart the driver.
|
54
|
+
KILL_TIMEOUT = 2 # Number of seconds we wait till we kill the process.
|
55
|
+
|
56
|
+
def running_phantomjs_pids
|
57
|
+
list_phantomjs_processes_cmd = "ps -ef | grep -E 'bin/phantomjs' | grep -v grep"
|
58
|
+
`#{list_phantomjs_processes_cmd} | awk '{print $2;}'`.split("\n")
|
59
|
+
end
|
60
|
+
|
61
|
+
def kill_process(pid)
|
62
|
+
Process.kill('TERM', pid)
|
63
|
+
force_kill(pid)
|
64
|
+
rescue Errno::ESRCH, Errno::ECHILD
|
65
|
+
# successfully terminated
|
66
|
+
rescue => e
|
67
|
+
Grell.logger.exception e, "PhantomJS process could not be killed"
|
68
|
+
end
|
69
|
+
|
70
|
+
def force_kill(pid)
|
71
|
+
Timeout.timeout(KILL_TIMEOUT) { Process.wait(pid) }
|
72
|
+
rescue Timeout::Error
|
73
|
+
Process.kill('KILL', pid)
|
74
|
+
Process.wait(pid)
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
@@ -10,7 +10,7 @@ module Grell
|
|
10
10
|
# to the collection or if it is already present will be passed to the initializer.
|
11
11
|
def initialize(add_match_block)
|
12
12
|
@collection = []
|
13
|
-
@add_match_block = add_match_block
|
13
|
+
@add_match_block = add_match_block || default_add_match
|
14
14
|
end
|
15
15
|
|
16
16
|
def create_page(url, parent_id)
|
@@ -50,5 +50,13 @@ module Grell
|
|
50
50
|
end
|
51
51
|
end
|
52
52
|
|
53
|
+
# If add_match_block is not provided, url matching to determine if a new page should be added
|
54
|
+
# to the page collection will default to this proc
|
55
|
+
def default_add_match
|
56
|
+
Proc.new do |collection_page, page|
|
57
|
+
collection_page.url.downcase == page.url.downcase
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
53
61
|
end
|
54
62
|
end
|
data/lib/grell/version.rb
CHANGED
@@ -0,0 +1,113 @@
|
|
1
|
+
RSpec.describe Grell::CrawlerManager do
|
2
|
+
let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
|
3
|
+
let(:host) { 'http://www.example.com' }
|
4
|
+
let(:url) { 'http://www.example.com/test' }
|
5
|
+
let(:driver) { double(Grell::CapybaraDriver) }
|
6
|
+
let(:logger) { Logger.new(nil) }
|
7
|
+
let(:crawler_manager) do
|
8
|
+
described_class.new(logger: logger, driver: driver)
|
9
|
+
end
|
10
|
+
|
11
|
+
describe 'initialize' do
|
12
|
+
context 'provides a logger' do
|
13
|
+
let(:logger) { 33 }
|
14
|
+
it 'sets custom logger' do
|
15
|
+
crawler_manager
|
16
|
+
expect(Grell.logger).to eq(33)
|
17
|
+
Grell.logger = Logger.new(nil)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
context 'does not provides a logger' do
|
21
|
+
let(:logger) { nil }
|
22
|
+
it 'sets default logger' do
|
23
|
+
crawler_manager
|
24
|
+
expect(Grell.logger).to be_instance_of(Logger)
|
25
|
+
Grell.logger = Logger.new(nil)
|
26
|
+
end
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
describe '#quit' do
|
31
|
+
let(:driver) { double }
|
32
|
+
|
33
|
+
it 'quits the poltergeist driver' do
|
34
|
+
expect(driver).to receive(:quit)
|
35
|
+
crawler_manager.quit
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
describe '#restart' do
|
40
|
+
let(:driver) { double }
|
41
|
+
|
42
|
+
it 'restarts the poltergeist driver' do
|
43
|
+
expect(driver).to receive(:restart)
|
44
|
+
expect(logger).to receive(:info).with("GRELL is restarting")
|
45
|
+
expect(logger).to receive(:info).with("GRELL has restarted")
|
46
|
+
crawler_manager.restart
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
50
|
+
describe '#check_periodic_restart' do
|
51
|
+
let(:collection) { double }
|
52
|
+
context 'Periodic restart not setup' do
|
53
|
+
it 'does not restart' do
|
54
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
|
55
|
+
expect(crawler_manager).not_to receive(:restart)
|
56
|
+
crawler_manager.check_periodic_restart(collection)
|
57
|
+
end
|
58
|
+
end
|
59
|
+
context 'Periodic restart setup with default period' do
|
60
|
+
let(:do_something) { proc {} }
|
61
|
+
let(:crawler_manager) do
|
62
|
+
Grell::CrawlerManager.new(
|
63
|
+
logger: logger,
|
64
|
+
driver: driver,
|
65
|
+
on_periodic_restart: { do: do_something }
|
66
|
+
)
|
67
|
+
end
|
68
|
+
|
69
|
+
it 'does not restart after visiting 99 pages' do
|
70
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 99 }
|
71
|
+
expect(crawler_manager).not_to receive(:restart)
|
72
|
+
crawler_manager.check_periodic_restart(collection)
|
73
|
+
end
|
74
|
+
it 'restarts after visiting 100 pages' do
|
75
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
|
76
|
+
expect(crawler_manager).to receive(:restart)
|
77
|
+
crawler_manager.check_periodic_restart(collection)
|
78
|
+
end
|
79
|
+
end
|
80
|
+
context 'Periodic restart setup with custom period' do
|
81
|
+
let(:do_something) { proc {} }
|
82
|
+
let(:period) { 50 }
|
83
|
+
let(:crawler_manager) do
|
84
|
+
Grell::CrawlerManager.new(
|
85
|
+
logger: logger,
|
86
|
+
driver: driver,
|
87
|
+
on_periodic_restart: { do: do_something, each: period }
|
88
|
+
)
|
89
|
+
end
|
90
|
+
|
91
|
+
it 'does not restart after visiting a number different from custom period pages' do
|
92
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { period * 1.2 }
|
93
|
+
expect(crawler_manager).not_to receive(:restart)
|
94
|
+
crawler_manager.check_periodic_restart(collection)
|
95
|
+
end
|
96
|
+
it 'restarts after visiting custom period pages' do
|
97
|
+
allow(collection).to receive_message_chain(:visited_pages, :size) { period }
|
98
|
+
expect(crawler_manager).to receive(:restart)
|
99
|
+
crawler_manager.check_periodic_restart(collection)
|
100
|
+
end
|
101
|
+
end
|
102
|
+
end
|
103
|
+
|
104
|
+
describe '#cleanup_all_processes' do
|
105
|
+
let(:driver) { double }
|
106
|
+
|
107
|
+
it 'kills all phantomjs processes' do
|
108
|
+
allow(crawler_manager).to receive(:running_phantomjs_pids).and_return([10])
|
109
|
+
expect(crawler_manager).to receive(:kill_process).with(10)
|
110
|
+
crawler_manager.cleanup_all_processes
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
data/spec/lib/crawler_spec.rb
CHANGED
@@ -5,7 +5,18 @@ RSpec.describe Grell::Crawler do
|
|
5
5
|
let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
|
6
6
|
let(:host) { 'http://www.example.com' }
|
7
7
|
let(:url) { 'http://www.example.com/test' }
|
8
|
-
let(:
|
8
|
+
let(:add_match_block) { nil }
|
9
|
+
let(:blacklist) { /a^/ }
|
10
|
+
let(:whitelist) { /.*/ }
|
11
|
+
let(:crawler) do
|
12
|
+
Grell::Crawler.new(
|
13
|
+
logger: Logger.new(nil),
|
14
|
+
driver_options: { external_driver: true },
|
15
|
+
evaluate_in_each_page: script,
|
16
|
+
add_match_block: add_match_block,
|
17
|
+
blacklist: blacklist,
|
18
|
+
whitelist: whitelist)
|
19
|
+
end
|
9
20
|
let(:script) { nil }
|
10
21
|
let(:body) { 'body' }
|
11
22
|
let(:custom_add_match) do
|
@@ -18,29 +29,6 @@ RSpec.describe Grell::Crawler do
|
|
18
29
|
proxy.stub(url).and_return(body: body, code: 200)
|
19
30
|
end
|
20
31
|
|
21
|
-
describe 'initialize' do
|
22
|
-
it 'can provide your own logger' do
|
23
|
-
Grell::Crawler.new(external_driver: true, logger: 33)
|
24
|
-
expect(Grell.logger).to eq(33)
|
25
|
-
Grell.logger = Logger.new(nil)
|
26
|
-
end
|
27
|
-
|
28
|
-
it 'provides a stdout logger if nothing provided' do
|
29
|
-
crawler
|
30
|
-
expect(Grell.logger).to be_instance_of(Logger)
|
31
|
-
end
|
32
|
-
end
|
33
|
-
|
34
|
-
describe '#quit' do
|
35
|
-
let(:driver) { double }
|
36
|
-
before { allow(Grell::CapybaraDriver).to receive(:setup).and_return(driver) }
|
37
|
-
|
38
|
-
it 'quits the poltergeist driver' do
|
39
|
-
expect(driver).to receive(:quit)
|
40
|
-
crawler.quit
|
41
|
-
end
|
42
|
-
end
|
43
|
-
|
44
32
|
describe '#crawl' do
|
45
33
|
before do
|
46
34
|
crawler.instance_variable_set('@collection', Grell::PageCollection.new(custom_add_match))
|
@@ -127,15 +115,6 @@ RSpec.describe Grell::Crawler do
|
|
127
115
|
expect(result[1].url).to eq(url_visited)
|
128
116
|
end
|
129
117
|
|
130
|
-
it 'can use a custom url add matcher block' do
|
131
|
-
expect(crawler).to_not receive(:default_add_match)
|
132
|
-
crawler.start_crawling(url, add_match_block: custom_add_match)
|
133
|
-
end
|
134
|
-
|
135
|
-
it 'uses a default url add matched if not provided' do
|
136
|
-
expect(crawler).to receive(:default_add_match).and_return(custom_add_match)
|
137
|
-
crawler.start_crawling(url)
|
138
|
-
end
|
139
118
|
end
|
140
119
|
|
141
120
|
shared_examples_for 'visits all available pages' do
|
@@ -204,10 +183,7 @@ RSpec.describe Grell::Crawler do
|
|
204
183
|
end
|
205
184
|
|
206
185
|
context 'using a single string' do
|
207
|
-
|
208
|
-
crawler.whitelist('/trusmis.html')
|
209
|
-
end
|
210
|
-
|
186
|
+
let(:whitelist) { '/trusmis.html' }
|
211
187
|
let(:visited_pages_count) { 2 } # my own page + trusmis
|
212
188
|
let(:visited_pages) do
|
213
189
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -217,10 +193,7 @@ RSpec.describe Grell::Crawler do
|
|
217
193
|
end
|
218
194
|
|
219
195
|
context 'using an array of strings' do
|
220
|
-
|
221
|
-
crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
|
222
|
-
end
|
223
|
-
|
196
|
+
let(:whitelist) { ['/trusmis.html', '/nothere', 'another.html'] }
|
224
197
|
let(:visited_pages_count) { 2 }
|
225
198
|
let(:visited_pages) do
|
226
199
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -230,10 +203,7 @@ RSpec.describe Grell::Crawler do
|
|
230
203
|
end
|
231
204
|
|
232
205
|
context 'using a regexp' do
|
233
|
-
|
234
|
-
crawler.whitelist(/\/trusmis\.html/)
|
235
|
-
end
|
236
|
-
|
206
|
+
let(:whitelist) { /\/trusmis\.html/ }
|
237
207
|
let(:visited_pages_count) { 2 }
|
238
208
|
let(:visited_pages) do
|
239
209
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -243,10 +213,7 @@ RSpec.describe Grell::Crawler do
|
|
243
213
|
end
|
244
214
|
|
245
215
|
context 'using an array of regexps' do
|
246
|
-
|
247
|
-
crawler.whitelist([/\/trusmis\.html/])
|
248
|
-
end
|
249
|
-
|
216
|
+
let(:whitelist) { [/\/trusmis\.html/] }
|
250
217
|
let(:visited_pages_count) { 2 }
|
251
218
|
let(:visited_pages) do
|
252
219
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
@@ -256,10 +223,7 @@ RSpec.describe Grell::Crawler do
|
|
256
223
|
end
|
257
224
|
|
258
225
|
context 'using an empty array' do
|
259
|
-
|
260
|
-
crawler.whitelist([])
|
261
|
-
end
|
262
|
-
|
226
|
+
let(:whitelist) { [] }
|
263
227
|
let(:visited_pages_count) { 1 } # my own page only
|
264
228
|
let(:visited_pages) do
|
265
229
|
['http://www.example.com/test']
|
@@ -269,10 +233,7 @@ RSpec.describe Grell::Crawler do
|
|
269
233
|
end
|
270
234
|
|
271
235
|
context 'adding all links to the whitelist' do
|
272
|
-
|
273
|
-
crawler.whitelist(['/trusmis', '/help'])
|
274
|
-
end
|
275
|
-
|
236
|
+
let(:whitelist) { ['/trusmis', '/help'] }
|
276
237
|
let(:visited_pages_count) { 3 } # all links
|
277
238
|
let(:visited_pages) do
|
278
239
|
['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
|
@@ -298,9 +259,7 @@ RSpec.describe Grell::Crawler do
|
|
298
259
|
end
|
299
260
|
|
300
261
|
context 'using a single string' do
|
301
|
-
|
302
|
-
crawler.blacklist('/trusmis.html')
|
303
|
-
end
|
262
|
+
let(:blacklist) { '/trusmis.html' }
|
304
263
|
let(:visited_pages_count) {2}
|
305
264
|
let(:visited_pages) do
|
306
265
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -310,9 +269,7 @@ RSpec.describe Grell::Crawler do
|
|
310
269
|
end
|
311
270
|
|
312
271
|
context 'using an array of strings' do
|
313
|
-
|
314
|
-
crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
|
315
|
-
end
|
272
|
+
let(:blacklist) { ['/trusmis.html', '/nothere', 'another.html'] }
|
316
273
|
let(:visited_pages_count) {2}
|
317
274
|
let(:visited_pages) do
|
318
275
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -322,9 +279,7 @@ RSpec.describe Grell::Crawler do
|
|
322
279
|
end
|
323
280
|
|
324
281
|
context 'using a regexp' do
|
325
|
-
|
326
|
-
crawler.blacklist(/\/trusmis\.html/)
|
327
|
-
end
|
282
|
+
let(:blacklist) { /\/trusmis\.html/ }
|
328
283
|
let(:visited_pages_count) {2}
|
329
284
|
let(:visited_pages) do
|
330
285
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -334,9 +289,7 @@ RSpec.describe Grell::Crawler do
|
|
334
289
|
end
|
335
290
|
|
336
291
|
context 'using an array of regexps' do
|
337
|
-
|
338
|
-
crawler.blacklist([/\/trusmis\.html/])
|
339
|
-
end
|
292
|
+
let(:blacklist) { [/\/trusmis\.html/] }
|
340
293
|
let(:visited_pages_count) {2}
|
341
294
|
let(:visited_pages) do
|
342
295
|
['http://www.example.com/test','http://www.example.com/help.html']
|
@@ -346,9 +299,7 @@ RSpec.describe Grell::Crawler do
|
|
346
299
|
end
|
347
300
|
|
348
301
|
context 'using an empty array' do
|
349
|
-
|
350
|
-
crawler.blacklist([])
|
351
|
-
end
|
302
|
+
let(:blacklist) { [] }
|
352
303
|
let(:visited_pages_count) { 3 } # all links
|
353
304
|
let(:visited_pages) do
|
354
305
|
['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
|
@@ -357,10 +308,8 @@ RSpec.describe Grell::Crawler do
|
|
357
308
|
it_behaves_like 'visits all available pages'
|
358
309
|
end
|
359
310
|
|
360
|
-
context 'adding all links to the
|
361
|
-
|
362
|
-
crawler.blacklist(['/trusmis', '/help'])
|
363
|
-
end
|
311
|
+
context 'adding all links to the blacklist' do
|
312
|
+
let(:blacklist) { ['/trusmis', '/help'] }
|
364
313
|
let(:visited_pages_count) { 1 }
|
365
314
|
let(:visited_pages) do
|
366
315
|
['http://www.example.com/test']
|
@@ -386,11 +335,8 @@ RSpec.describe Grell::Crawler do
|
|
386
335
|
end
|
387
336
|
|
388
337
|
context 'we blacklist the only whitelisted page' do
|
389
|
-
|
390
|
-
|
391
|
-
crawler.blacklist('/trusmis.html')
|
392
|
-
end
|
393
|
-
|
338
|
+
let(:whitelist) { '/trusmis.html' }
|
339
|
+
let(:blacklist) { '/trusmis.html' }
|
394
340
|
let(:visited_pages_count) { 1 }
|
395
341
|
let(:visited_pages) do
|
396
342
|
['http://www.example.com/test']
|
@@ -400,11 +346,8 @@ RSpec.describe Grell::Crawler do
|
|
400
346
|
end
|
401
347
|
|
402
348
|
context 'we blacklist none of the whitelisted pages' do
|
403
|
-
|
404
|
-
|
405
|
-
crawler.blacklist('/raistlin.html')
|
406
|
-
end
|
407
|
-
|
349
|
+
let(:whitelist) { '/trusmis.html' }
|
350
|
+
let(:blacklist) { '/raistlin.html' }
|
408
351
|
let(:visited_pages_count) { 2 }
|
409
352
|
let(:visited_pages) do
|
410
353
|
['http://www.example.com/test', 'http://www.example.com/trusmis.html']
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: grell
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 2.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jordi Polo Carres
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2016-
|
11
|
+
date: 2016-11-16 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: capybara
|
@@ -16,28 +16,28 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '2.
|
19
|
+
version: '2.10'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '2.
|
26
|
+
version: '2.10'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: poltergeist
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
31
|
- - "~>"
|
32
32
|
- !ruby/object:Gem::Version
|
33
|
-
version: '1.
|
33
|
+
version: '1.11'
|
34
34
|
type: :runtime
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
38
38
|
- - "~>"
|
39
39
|
- !ruby/object:Gem::Version
|
40
|
-
version: '1.
|
40
|
+
version: '1.11'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
42
|
name: bundler
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
@@ -114,28 +114,28 @@ dependencies:
|
|
114
114
|
requirements:
|
115
115
|
- - "~>"
|
116
116
|
- !ruby/object:Gem::Version
|
117
|
-
version: '3.
|
117
|
+
version: '3.5'
|
118
118
|
type: :development
|
119
119
|
prerelease: false
|
120
120
|
version_requirements: !ruby/object:Gem::Requirement
|
121
121
|
requirements:
|
122
122
|
- - "~>"
|
123
123
|
- !ruby/object:Gem::Version
|
124
|
-
version: '3.
|
124
|
+
version: '3.5'
|
125
125
|
- !ruby/object:Gem::Dependency
|
126
126
|
name: puffing-billy
|
127
127
|
requirement: !ruby/object:Gem::Requirement
|
128
128
|
requirements:
|
129
129
|
- - "~>"
|
130
130
|
- !ruby/object:Gem::Version
|
131
|
-
version: '0.
|
131
|
+
version: '0.9'
|
132
132
|
type: :development
|
133
133
|
prerelease: false
|
134
134
|
version_requirements: !ruby/object:Gem::Requirement
|
135
135
|
requirements:
|
136
136
|
- - "~>"
|
137
137
|
- !ruby/object:Gem::Version
|
138
|
-
version: '0.
|
138
|
+
version: '0.9'
|
139
139
|
- !ruby/object:Gem::Dependency
|
140
140
|
name: timecop
|
141
141
|
requirement: !ruby/object:Gem::Requirement
|
@@ -150,20 +150,6 @@ dependencies:
|
|
150
150
|
- - "~>"
|
151
151
|
- !ruby/object:Gem::Version
|
152
152
|
version: '0.8'
|
153
|
-
- !ruby/object:Gem::Dependency
|
154
|
-
name: capybara-webkit
|
155
|
-
requirement: !ruby/object:Gem::Requirement
|
156
|
-
requirements:
|
157
|
-
- - "~>"
|
158
|
-
- !ruby/object:Gem::Version
|
159
|
-
version: 1.11.1
|
160
|
-
type: :development
|
161
|
-
prerelease: false
|
162
|
-
version_requirements: !ruby/object:Gem::Requirement
|
163
|
-
requirements:
|
164
|
-
- - "~>"
|
165
|
-
- !ruby/object:Gem::Version
|
166
|
-
version: 1.11.1
|
167
153
|
- !ruby/object:Gem::Dependency
|
168
154
|
name: selenium-webdriver
|
169
155
|
requirement: !ruby/object:Gem::Requirement
|
@@ -196,6 +182,7 @@ files:
|
|
196
182
|
- lib/grell.rb
|
197
183
|
- lib/grell/capybara_driver.rb
|
198
184
|
- lib/grell/crawler.rb
|
185
|
+
- lib/grell/crawler_manager.rb
|
199
186
|
- lib/grell/grell_logger.rb
|
200
187
|
- lib/grell/page.rb
|
201
188
|
- lib/grell/page_collection.rb
|
@@ -203,6 +190,7 @@ files:
|
|
203
190
|
- lib/grell/reader.rb
|
204
191
|
- lib/grell/version.rb
|
205
192
|
- spec/lib/capybara_driver_spec.rb
|
193
|
+
- spec/lib/crawler_manager_spec.rb
|
206
194
|
- spec/lib/crawler_spec.rb
|
207
195
|
- spec/lib/page_collection_spec.rb
|
208
196
|
- spec/lib/page_spec.rb
|
@@ -220,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
220
208
|
requirements:
|
221
209
|
- - ">="
|
222
210
|
- !ruby/object:Gem::Version
|
223
|
-
version: 1.
|
211
|
+
version: 2.1.8
|
224
212
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
225
213
|
requirements:
|
226
214
|
- - ">="
|
@@ -234,6 +222,7 @@ specification_version: 4
|
|
234
222
|
summary: Ruby web crawler
|
235
223
|
test_files:
|
236
224
|
- spec/lib/capybara_driver_spec.rb
|
225
|
+
- spec/lib/crawler_manager_spec.rb
|
237
226
|
- spec/lib/crawler_spec.rb
|
238
227
|
- spec/lib/page_collection_spec.rb
|
239
228
|
- spec/lib/page_spec.rb
|