grell 1.6.10 → 2.1.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 866f1b7117624455b79791bacba549710eb7dc2b
4
- data.tar.gz: 4a3646c053bb7b4884fa8b82ebb24809d2d37f97
2
+ SHA256:
3
+ metadata.gz: c17856255ff1e871cc5e12cc2a9f0f4870156923ab924ea11db16b053a6742fb
4
+ data.tar.gz: d619076b40cbb4b057015a8bbcb8a07f555c282aa0ec971aa36b4e867fbfbd86
5
5
  SHA512:
6
- metadata.gz: dfb07e7c0a6a7fb2fe53a40a3da248a24ba683972c696ab7064481448cb4067403dc7350d34cbba21ff4f240fb09b3571adbb36b38bfecd28aeee1ea1551638e
7
- data.tar.gz: 7f91a206fd4cb264d73b05468dce66d858637c6d675b92088b800a85b67599201f957f879d618a0f4fc7a538de79c3211f82ffc3d73b063fec760da09209578a
6
+ metadata.gz: 28860f331fc02f6976bcfd8717bf8c33ca89984ae5d2ce9eede6abb31b5f06b44e2135468c6d75374dd649378cc3d719474979c2f27e67a5a7e5301fc561113f
7
+ data.tar.gz: 77f68dbdb006803c517de4e0b72a11ac9eba265781703f1b03f98af52b147cab7ba02429371038d426fed1073e1a5f3dcdc0a6838cbf93c64c0c4307f605eea6
data/.travis.yml CHANGED
@@ -1,14 +1,28 @@
1
1
  language: ruby
2
2
  cache: bundler
3
- sudo: false
4
3
 
5
4
  rvm:
6
5
  - 2.2.4
7
6
  - 2.3.0
8
- script: bundle exec rspec
7
+ - 2.4.2
9
8
 
10
9
  before_install:
11
10
  - mkdir travis-phantomjs
12
- - wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 -O $PWD/travis-phantomjs/phantomjs-2.1.1-linux-x86_64.tar.bz2
11
+ - wget https://github.com/JordiPolo/phantomjs/blob/master/phantomjs-2.1.1-linux-x86_64.tar.bz2?raw=true
12
+ -O $PWD/travis-phantomjs/phantomjs-2.1.1-linux-x86_64.tar.bz2
13
13
  - tar -xvf $PWD/travis-phantomjs/phantomjs-2.1.1-linux-x86_64.tar.bz2 -C $PWD/travis-phantomjs
14
14
  - export PATH=$PWD/travis-phantomjs/phantomjs-2.1.1-linux-x86_64/bin:$PATH
15
+
16
+ install:
17
+ - bundle install --jobs=3 --retry=3
18
+
19
+ script:
20
+ - bundle exec rspec
21
+
22
+ deploy:
23
+ provider: rubygems
24
+ api_key:
25
+ secure: czStDI0W6MWL70sDwu53oNNCc8vKtT61pgvii+ZWIC9A41C2p7BzmbtosXsnLk2ApxmpWvFIgtQE0XIH7jkM5mY05cHinXDphtOTkNLFVjck3ZOMkx/cc+QRFW8K4FHkrzFsC+/Xx4t2/Psh35LpzhfJd0XzKKoCstXUVgJsfGcAK3DMpjXHSUbwLXGDZ4lzmsk52OLf0oL+in2447TJfVOvGXtYmfh1PjXRwDxKB0dan7w5mVgajS52b6wUhVPTaMe/JgCbMuV7BaQ1Goq8u7V4aaxU+liPAhzHWfMB6tF4TEW8yu2tvGLdOA0+1jmM8E9Q5saPWtwKiHvBxN8CzRpkiNDzyFAf8ljrWT5yKX3aRQCyPp3NNyhoumWap36b+O/zwZ3HxoAe22Yg0rjz8z8NxMR/ELPvjPYjCiF5zY7fO9PAzmIynMRUrxDnFj+/JGHdzx0ZMo3fEXgHHSaHPNxIzEffVVQk4XLVnFHDjBLY4mVp4sbHbja5qnui20RkdM/H9Yi/fQyl1ODhk+LUPoh45ZneDZq7GPrl+WKK06oEjXIXLU+1iEuqnSqybbmJMTUJlUV+7EJdtq2DgfDB4KXwLm2LLOR/IX63AzEav4NIxx3hIXifSKa9rp6D7nMTzdQwF0FFzIj/Y3qLrAe1WWt0gx3Vxq67pSwOJthk5Fc=
26
+ on:
27
+ tags: true
28
+ rvm: 2.4.2
data/CHANGELOG.md CHANGED
@@ -1,3 +1,27 @@
1
+ # 2.1.2
2
+ * Change white/black lists to allow/deny lists
3
+
4
+ # 2.1.1
5
+ * Update phantomjs_options to use 'TLSv1.2'
6
+
7
+ # 2.1.0
8
+ * Delete `driver_options` configuration key as it was never used.
9
+ * `cleanup_all_processes` is a self method as intended to.
10
+
11
+ # 2.0.0
12
+ * New configuration key `on_periodic_restart`.
13
+ * CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
14
+
15
+ * Breaking changes
16
+ - Requires Ruby 2.1 or later.
17
+ - Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
18
+ - Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
19
+ - Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
20
+
21
+ # 1.6.11
22
+ * Ensure all links are loaded by waiting for Ajax requests to complete
23
+ * Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
24
+
1
25
  # 1.6.10
2
26
  * Avoid following JS href links, add missing dependencies to fix Travis build
3
27
 
data/Gemfile CHANGED
@@ -1,3 +1,7 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
+ # Avoid ruby 2.1 to use Rack > 2.0 which is not compatible
4
+ platform :ruby_21 do
5
+ gem 'rack', '~> 1.0'
6
+ end
3
7
  gemspec
data/README.md CHANGED
@@ -21,16 +21,15 @@ Or install it yourself as:
21
21
 
22
22
  $ gem install grell
23
23
 
24
- Grell uses PhantomJS, you will need to download and install it in your
24
+ Grell uses PhantomJS as a browser, you will need to download and install it in your
25
25
  system. Check for instructions in http://phantomjs.org/
26
26
  Grell has been tested with PhantomJS v2.1.x
27
27
 
28
28
  ## Usage
29
29
 
30
-
31
30
  ### Crawling an entire site
32
31
 
33
- The main entry point of the library is Grell#start_crawling.
32
+ The main entry point of the library is Grell::Crawler#start_crawling.
34
33
  Grell will yield to your code with each page it finds:
35
34
 
36
35
  ```ruby
@@ -55,85 +54,104 @@ This list is indexed by the complete url, including query parameters.
55
54
 
56
55
  ### Re-retrieving a page
57
56
  If you want Grell to revisit a page and return the data to you again,
58
- return the symbol :retry in your block in the start_crawling method.
57
+ return the symbol :retry in your block for the start_crawling method.
59
58
  For instance
60
59
  ```ruby
61
60
  require 'grell'
62
61
  crawler = Grell::Crawler.new
63
62
  crawler.start_crawling('http://www.google.com') do |current_page|
64
63
  if current_page.status == 500 && current_page.retries == 0
65
- crawler.restart
64
+ crawler.manager.restart
66
65
  :retry
67
66
  end
68
67
  end
69
68
  ```
70
69
 
71
- ### Restarting PhantomJS
72
- If you are doing a long crawling it is possible that phantomJS starts failing.
73
- To avoid that, you can restart it by calling "restart" on crawler.
74
- That will kill phantom and will restart it. Grell will keep the status of
75
- pages already visited and pages discovered and to be visited. And will keep crawling
76
- with the new phantomJS process instead of the old one.
70
+ ### Pages' id
77
71
 
78
- ### Selecting links to follow
72
+ Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
73
+ The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
74
+ Using this information it is possible to construct a directed graph.
79
75
 
80
- Grell by default will follow all the links it finds going to the site
81
- your are crawling. It will never follow links linking outside your site.
82
- If you want to further limit the amount of links crawled, you can use
83
- whitelisting, blacklisting or manual filtering.
84
76
 
85
- #### Custom URL Comparison
86
- By default, Grell will detect new URLs to visit by comparing the full URL
87
- with the URLs of the discovered and visited links. This functionality can
88
- be changed by passing a block of code to Grells `start_crawling` method.
89
- In the below example, the path of the URLs (instead of the full URL) will
90
- be compared.
77
+ ### Restart and quit
91
78
 
79
+ Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
80
+ but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
81
+ new browser.
82
+ To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
92
83
  ```ruby
93
84
  require 'grell'
94
-
95
85
  crawler = Grell::Crawler.new
86
+ crawler.manager.restart # restarts the browser
87
+ crawler.manager.quit # quits and destroys the crawler
88
+ ```
96
89
 
97
- add_match_block = Proc.new do |collection_page, page|
98
- collection_page.path == page.path
99
- end
90
+ ### Options
100
91
 
101
- crawler.start_crawling('http://www.google.com', add_match_block: add_match_block) do |current_page|
102
- ...
103
- end
104
- ```
92
+ The `Grell:Crawler` class can be passed options to customize its behavior:
93
+ - `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
94
+ - `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
95
+ - `allowlist`: Sets a allowlist filter for URLs to be visited. Default: all URLs are allowlisted.
96
+ - `denylist`: Sets a denylist filter for URLs to be avoided. Default: no URL is denylisted.
97
+ - `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
98
+ - `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
105
99
 
106
- #### Whitelisting
100
+ Grell by default will follow all the links it finds in the site being crawled.
101
+ It will never follow links linking outside your site.
102
+ If you want to further limit the amount of links crawled, you can use
103
+ allowlisting, denylisting or manual filtering.
104
+ Below further details on these and other options.
107
105
 
108
- ```ruby
109
- require 'grell'
110
106
 
111
- crawler = Grell::Crawler.new
112
- crawler.whitelist([/games\/.*/, '/fun'])
113
- crawler.start_crawling('http://www.google.com')
114
- ```
107
+ #### Automatically restarting PhantomJS
108
+ If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
109
+ The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
110
+ `on_periodic_restart` configuration key as follows:
115
111
 
116
- Grell here will only follow links to games and '/fun' and ignore all
117
- other links. You can provide a regexp, strings (if any part of the
118
- string match is whitelisted) or an array with regexps and/or strings.
112
+ ```ruby
113
+ require 'grell'
119
114
 
120
- #### Blacklisting
115
+ crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
121
116
 
122
- ```ruby
123
- require 'grell'
117
+ crawler.start_crawling('http://www.google.com') do |current_page|
118
+ ...
119
+ endd
120
+ ```
124
121
 
125
- crawler = Grell::Crawler.new
126
- crawler.blacklist(/games\/.*/)
127
- crawler.start_crawling('http://www.google.com')
128
- ```
122
+ This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
123
+ between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
124
+
125
+
126
+ #### Allowlisting
127
+
128
+ ```ruby
129
+ require 'grell'
130
+
131
+ crawler = Grell::Crawler.new(allowlist: [/games\/.*/, '/fun'])
132
+ crawler.start_crawling('http://www.google.com')
133
+ ```
134
+
135
+ Grell here will only follow links to games and '/fun' and ignore all
136
+ other links. You can provide a regexp, strings (if any part of the
137
+ string match is allowlisted) or an array with regexps and/or strings.
129
138
 
130
- Similar to whitelisting. But now Grell will follow every other link in
131
- this site which does not go to /games/...
139
+ #### Denylisting
132
140
 
133
- If you call both whitelist and blacklist then both will apply, a link
134
- has to fullfill both conditions to survive. If you do not call any, then
135
- all links on this site will be crawled. Think of these methods as
136
- filters.
141
+ ```ruby
142
+ require 'grell'
143
+
144
+ crawler = Grell::Crawler.new(denylist: /games\/.*/)
145
+ crawler.start_crawling('http://www.google.com')
146
+ ```
147
+
148
+ Similar to allowlisting. But now Grell will follow every other link in
149
+ this site which does not go to /games/...
150
+
151
+ If you call both allowlist and denylist then both will apply, a link
152
+ has to fullfill both conditions to survive. If you do not call any, then
153
+ all links on this site will be crawled. Think of these methods as
154
+ filters.
137
155
 
138
156
  #### Manual link filtering
139
157
 
@@ -144,12 +162,37 @@ links to visit. So you can modify in your block of code "page.links" to
144
162
  add and delete links to instruct Grell to add them to the list of links
145
163
  to visit next.
146
164
 
147
- ### Pages' id
165
+ #### Custom URL Comparison
166
+ By default, Grell will detect new URLs to visit by comparing the full URL
167
+ with the URLs of the discovered and visited links. This functionality can
168
+ be changed by passing a block of code to Grells `start_crawling` method.
169
+ In the below example, the path of the URLs (instead of the full URL) will
170
+ be compared.
148
171
 
149
- Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
150
- The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0.
151
- Using this information it is possible to construct a directed graph.
172
+ ```ruby
173
+ require 'grell'
174
+
175
+ add_match_block = Proc.new do |collection_page, page|
176
+ collection_page.path == page.path
177
+ end
178
+
179
+ crawler = Grell::Crawler.new(add_match_block: add_match_block)
180
+
181
+ crawler.start_crawling('http://www.google.com') do |current_page|
182
+ ...
183
+ end
184
+ ```
185
+
186
+ #### Evaluate script
152
187
 
188
+ You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
189
+
190
+ ```ruby
191
+ require 'grell'
192
+
193
+ crawler = Grell::Crawler.new(evaluate_in_each_page: "typeof jQuery !== 'undefined' && $('.dropdown').addClass('open');")
194
+
195
+ ```
153
196
 
154
197
  ### Errors
155
198
  When there is an error in the page or an internal error in the crawler (Javascript crashed the browser, etc). Grell will return with status 404 and the headers will have the following keys:
@@ -157,12 +200,6 @@ When there is an error in the page or an internal error in the crawler (Javascri
157
200
  - errorClass: The class of the error which broke this page.
158
201
  - errorMessage: A descriptive message with the information Grell could gather about the error.
159
202
 
160
- ### Logging
161
- You can pass your logger to Grell. For example in a Rails app:
162
- ```Ruby
163
- crawler = Grell::Crawler.new(logger: Rails.logger)
164
- ```
165
-
166
203
  ## Tests
167
204
 
168
205
  Run the tests with
data/grell.gemspec CHANGED
@@ -19,19 +19,18 @@ Gem::Specification.new do |spec|
19
19
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
20
20
  spec.require_paths = ["lib"]
21
21
 
22
- spec.required_ruby_version = '>= 1.9.3'
22
+ spec.required_ruby_version = '>= 2.1.8'
23
23
 
24
- spec.add_dependency 'capybara', '~> 2.7'
25
- spec.add_dependency 'poltergeist', '~> 1.10'
24
+ spec.add_dependency 'capybara', '~> 2.10'
25
+ spec.add_dependency 'poltergeist', '~> 1.11'
26
26
 
27
- spec.add_development_dependency 'bundler', '~> 1.6'
27
+ # spec.add_development_dependency 'bundler', '~> 1.6'
28
28
  spec.add_development_dependency 'byebug', '~> 4.0'
29
29
  spec.add_development_dependency 'kender', '~> 0.2'
30
30
  spec.add_development_dependency 'rake', '~> 10.0'
31
31
  spec.add_development_dependency 'webmock', '~> 1.18'
32
- spec.add_development_dependency 'rspec', '~> 3.0'
33
- spec.add_development_dependency 'puffing-billy', '~> 0.5'
32
+ spec.add_development_dependency 'rspec', '~> 3.5'
33
+ spec.add_development_dependency 'puffing-billy', '~> 0.9'
34
34
  spec.add_development_dependency 'timecop', '~> 0.8'
35
- spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
36
35
  spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
37
36
  end
data/lib/grell.rb CHANGED
@@ -3,6 +3,7 @@ require 'capybara/dsl'
3
3
 
4
4
  require 'grell/grell_logger'
5
5
  require 'grell/capybara_driver'
6
+ require 'grell/crawler_manager'
6
7
  require 'grell/crawler'
7
8
  require 'grell/rawpage'
8
9
  require 'grell/page'
@@ -1,16 +1,10 @@
1
-
2
1
  module Grell
3
-
4
- #The driver for Capybara. It uses Portelgeist to control PhantomJS
2
+ # This class setups the driver for capybara. Used internally by the CrawlerManager
3
+ # It uses Portelgeist to control PhantomJS
5
4
  class CapybaraDriver
6
- include Capybara::DSL
7
-
8
- USER_AGENT = "Mozilla/5.0 (Grell Crawler)"
9
-
10
- def self.setup(options)
11
- new.setup_capybara unless options[:external_driver]
12
- end
5
+ USER_AGENT = "Mozilla/5.0 (Grell Crawler)".freeze
13
6
 
7
+ # Returns a poltergeist driver
14
8
  def setup_capybara
15
9
  @poltergeist_driver = nil
16
10
 
@@ -20,18 +14,17 @@ module Grell
20
14
  Grell.logger.info "GRELL Registering poltergeist driver with name '#{driver_name}'"
21
15
 
22
16
  Capybara.register_driver driver_name do |app|
23
- @poltergeist_driver = Capybara::Poltergeist::Driver.new(app, {
17
+ @poltergeist_driver = Capybara::Poltergeist::Driver.new(app,
24
18
  js_errors: false,
25
19
  inspector: false,
26
20
  phantomjs_logger: FakePoltergeistLogger,
27
- phantomjs_options: ['--debug=no', '--load-images=no', '--ignore-ssl-errors=yes', '--ssl-protocol=TLSv1']
28
- })
21
+ phantomjs_options: ['--debug=no', '--load-images=no', '--ignore-ssl-errors=yes', '--ssl-protocol=TLSv1.2'])
29
22
  end
30
23
 
31
24
  Capybara.default_max_wait_time = 3
32
25
  Capybara.run_server = false
33
26
  Capybara.default_driver = driver_name
34
- page.driver.headers = {
27
+ Capybara.current_session.driver.headers = { # The driver gets initialized when modified here
35
28
  "DNT" => 1,
36
29
  "User-Agent" => USER_AGENT
37
30
  }
@@ -41,14 +34,11 @@ module Grell
41
34
  @poltergeist_driver
42
35
  end
43
36
 
44
- def quit
45
- @poltergeist_driver.quit
46
- end
47
-
37
+ # Poltergeist driver needs a class with this signature. The javascript console.log is sent here.
38
+ # We just discard that information.
48
39
  module FakePoltergeistLogger
49
40
  def self.puts(*)
50
41
  end
51
42
  end
52
43
  end
53
-
54
44
  end
data/lib/grell/crawler.rb CHANGED
@@ -1,53 +1,32 @@
1
-
2
1
  module Grell
3
-
4
2
  # This is the class that starts and controls the crawling
5
3
  class Crawler
6
- attr_reader :collection
4
+ attr_reader :collection, :manager
7
5
 
8
6
  # Creates a crawler
9
- # options allows :logger to point to an object with the same interface than Logger in the standard library
10
- def initialize(options = {})
11
- if options[:logger]
12
- Grell.logger = options[:logger]
13
- else
14
- Grell.logger = Logger.new(STDOUT)
15
- end
16
-
17
- @driver = CapybaraDriver.setup(options)
18
- end
19
-
20
- # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
21
- def restart
22
- Grell.logger.info "GRELL is restarting"
23
- @driver.restart
24
- Grell.logger.info "GRELL has restarted"
25
- end
26
-
27
- # Quits the poltergeist driver.
28
- def quit
29
- Grell.logger.info "GRELL is quitting the poltergeist driver"
30
- @driver.quit
31
- end
32
-
33
- # Setups a whitelist filter, allows a regexp, string or array of either to be matched.
34
- def whitelist(list)
35
- @whitelist_regexp = Regexp.union(list)
36
- end
37
-
38
- # Setups a blacklist filter, allows a regexp, string or array of either to be matched.
39
- def blacklist(list)
40
- @blacklist_regexp = Regexp.union(list)
7
+ # evaluate_in_each_page: javascript block to evaluate in each page we crawl
8
+ # add_match_block: block to evaluate to consider if a page is part of the collection
9
+ # manager_options: options passed to the manager class
10
+ # allowlist: Sets an allowlist filter, allows a regexp, string or array of either to be matched.
11
+ # denylist: Sets a denylist filter, allows a regexp, string or array of either to be matched.
12
+ def initialize(evaluate_in_each_page: nil, add_match_block: nil, allowlist: /.*/, denylist: /a^/, **manager_options)
13
+ @collection = nil
14
+ @manager = CrawlerManager.new(manager_options)
15
+ @evaluate_in_each_page = evaluate_in_each_page
16
+ @add_match_block = add_match_block
17
+ @allowlist_regexp = Regexp.union(allowlist)
18
+ @denylist_regexp = Regexp.union(denylist)
41
19
  end
42
20
 
43
21
  # Main method, it starts crawling on the given URL and calls a block for each of the pages found.
44
- def start_crawling(url, options = {}, &block)
22
+ def start_crawling(url, &block)
45
23
  Grell.logger.info "GRELL Started crawling"
46
- @collection = PageCollection.new(options[:add_match_block] || default_add_match)
24
+ @collection = PageCollection.new(@add_match_block)
47
25
  @collection.create_page(url, nil)
48
26
 
49
27
  while !@collection.discovered_pages.empty?
50
28
  crawl(@collection.next_page, block)
29
+ @manager.check_periodic_restart(@collection)
51
30
  end
52
31
 
53
32
  Grell.logger.info "GRELL finished crawling"
@@ -55,16 +34,12 @@ module Grell
55
34
 
56
35
  def crawl(site, block)
57
36
  Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}"
58
- site.navigate
59
- filter!(site.links)
60
- add_redirect_url(site)
37
+ crawl_site(site)
61
38
 
62
39
  if block # The user of this block can send us a :retry to retry accessing the page
63
40
  while crawl_block(block, site) == :retry
64
41
  Grell.logger.info "Retrying our visit to #{site.url}"
65
- site.navigate
66
- filter!(site.links)
67
- add_redirect_url(site)
42
+ crawl_site(site)
68
43
  end
69
44
  end
70
45
 
@@ -75,6 +50,13 @@ module Grell
75
50
 
76
51
  private
77
52
 
53
+ def crawl_site(site)
54
+ site.navigate
55
+ site.rawpage.page.evaluate_script(@evaluate_in_each_page) if @evaluate_in_each_page
56
+ filter!(site.links)
57
+ add_redirect_url(site)
58
+ end
59
+
78
60
  # Treat any exceptions from the block as an unavailable page
79
61
  def crawl_block(block, site)
80
62
  block.call(site)
@@ -85,16 +67,8 @@ module Grell
85
67
  end
86
68
 
87
69
  def filter!(links)
88
- links.select! { |link| link =~ @whitelist_regexp } if @whitelist_regexp
89
- links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
90
- end
91
-
92
- # If options[:add_match_block] is not provided, url matching to determine if a
93
- # new page should be added the page collection will default to this proc
94
- def default_add_match
95
- Proc.new do |collection_page, page|
96
- collection_page.url.downcase == page.url.downcase
97
- end
70
+ links.select! { |link| link =~ @allowlist_regexp } if @allowlist_regexp
71
+ links.delete_if { |link| link =~ @denylist_regexp } if @denylist_regexp
98
72
  end
99
73
 
100
74
  # Store the resulting redirected URL along with the original URL