grell 1.6.11 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 9a3668b60b187d3b4dab183bf6b386738813a1ec
4
- data.tar.gz: 32ddba97d8dca3abca1cffb4d44760d0679eea0b
3
+ metadata.gz: bbd2a19c7858d2e755e7d51a038b84901835f32b
4
+ data.tar.gz: 7c65a731bfdacb7b65cf523c8ba128f1f3390f96
5
5
  SHA512:
6
- metadata.gz: 6ba545ee45bbdc4a43b85045d6e09c0a1cfd7f0d7605dd5abee455434426b84cf11fe706a46a0128cc2b35ab04ff0f0d2106482a7548989a1aaa2bdb4928cf90
7
- data.tar.gz: 0d1e2eb5dbcd688d00a39e7ff39781b577647eb6db47ac2155709d81af7e004d435521a38067258026960982f97f60cbf991993366dc61b89ae8f2d08345c4d1
6
+ metadata.gz: fcf31c442f8d51cd4a9534270cc3afd7f5bac432a74a9ec2e429914145df710131e0a99aca61e4cd98d2e831a5cedb511c21dce9d1562881541fd1bfb4016014
7
+ data.tar.gz: c413503a4a8765fadbec10ed64467ba140b8739b6209ae4a19e585549ad689231b64b3d670e1bdceec560597051d93f8b90d696f66571cb1dad175f15af1b4a0
@@ -3,6 +3,7 @@ cache: bundler
3
3
  sudo: false
4
4
 
5
5
  rvm:
6
+ - 2.1.8
6
7
  - 2.2.4
7
8
  - 2.3.0
8
9
  script: bundle exec rspec
@@ -1,3 +1,13 @@
1
+ # 2.0.0
2
+ * New configuration key `on_periodic_restart`.
3
+ * CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
4
+
5
+ * Breaking changes
6
+ - Requires Ruby 2.1 or later.
7
+ - Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
8
+ - Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
9
+ - Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
10
+
1
11
  # 1.6.11
2
12
  * Ensure all links are loaded by waiting for Ajax requests to complete
3
13
  * Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
data/Gemfile CHANGED
@@ -1,3 +1,7 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
+ # Avoid ruby 2.1 to use Rack > 2.0 which is not compatible
4
+ platform :ruby_21 do
5
+ gem 'rack', '~> 1.0'
6
+ end
3
7
  gemspec
data/README.md CHANGED
@@ -21,16 +21,15 @@ Or install it yourself as:
21
21
 
22
22
  $ gem install grell
23
23
 
24
- Grell uses PhantomJS, you will need to download and install it in your
24
+ Grell uses PhantomJS as a browser, you will need to download and install it in your
25
25
  system. Check for instructions in http://phantomjs.org/
26
26
  Grell has been tested with PhantomJS v2.1.x
27
27
 
28
28
  ## Usage
29
29
 
30
-
31
30
  ### Crawling an entire site
32
31
 
33
- The main entry point of the library is Grell#start_crawling.
32
+ The main entry point of the library is Grell::Crawler#start_crawling.
34
33
  Grell will yield to your code with each page it finds:
35
34
 
36
35
  ```ruby
@@ -55,85 +54,105 @@ This list is indexed by the complete url, including query parameters.
55
54
 
56
55
  ### Re-retrieving a page
57
56
  If you want Grell to revisit a page and return the data to you again,
58
- return the symbol :retry in your block in the start_crawling method.
57
+ return the symbol :retry in your block for the start_crawling method.
59
58
  For instance
60
59
  ```ruby
61
60
  require 'grell'
62
61
  crawler = Grell::Crawler.new
63
62
  crawler.start_crawling('http://www.google.com') do |current_page|
64
63
  if current_page.status == 500 && current_page.retries == 0
65
- crawler.restart
64
+ crawler.manager.restart
66
65
  :retry
67
66
  end
68
67
  end
69
68
  ```
70
69
 
71
- ### Restarting PhantomJS
72
- If you are doing a long crawling it is possible that phantomJS starts failing.
73
- To avoid that, you can restart it by calling "restart" on crawler.
74
- That will kill phantom and will restart it. Grell will keep the status of
75
- pages already visited and pages discovered and to be visited. And will keep crawling
76
- with the new phantomJS process instead of the old one.
70
+ ### Pages' id
77
71
 
78
- ### Selecting links to follow
72
+ Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
73
+ The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
74
+ Using this information it is possible to construct a directed graph.
79
75
 
80
- Grell by default will follow all the links it finds going to the site
81
- your are crawling. It will never follow links linking outside your site.
82
- If you want to further limit the amount of links crawled, you can use
83
- whitelisting, blacklisting or manual filtering.
84
76
 
85
- #### Custom URL Comparison
86
- By default, Grell will detect new URLs to visit by comparing the full URL
87
- with the URLs of the discovered and visited links. This functionality can
88
- be changed by passing a block of code to Grells `start_crawling` method.
89
- In the below example, the path of the URLs (instead of the full URL) will
90
- be compared.
77
+ ### Restart and quit
91
78
 
79
+ Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
80
+ but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
81
+ new browser.
82
+ To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
92
83
  ```ruby
93
84
  require 'grell'
94
-
95
85
  crawler = Grell::Crawler.new
86
+ crawler.manager.restart # restarts the browser
87
+ crawler.manager.quit # quits and destroys the crawler
88
+ ```
96
89
 
97
- add_match_block = Proc.new do |collection_page, page|
98
- collection_page.path == page.path
99
- end
90
+ ### Options
100
91
 
101
- crawler.start_crawling('http://www.google.com', add_match_block: add_match_block) do |current_page|
102
- ...
103
- end
104
- ```
92
+ The `Grell:Crawler` class can be passed options to customize its behavior:
93
+ - `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
94
+ - `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
95
+ - `whitelist`: Setups a whitelist filter for URLs to be visited. Default: all URLs are whitelisted.
96
+ - `blacklist`: Setups a blacklist filter for URLs to be avoided. Default: no URL is blacklisted.
97
+ - `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
98
+ - `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
99
+ - `driver_options`: Driver options will be passed to the Capybara driver which connects to PhantomJS.
105
100
 
106
- #### Whitelisting
101
+ Grell by default will follow all the links it finds in the site being crawled.
102
+ It will never follow links linking outside your site.
103
+ If you want to further limit the amount of links crawled, you can use
104
+ whitelisting, blacklisting or manual filtering.
105
+ Below further details on these and other options.
107
106
 
108
- ```ruby
109
- require 'grell'
110
107
 
111
- crawler = Grell::Crawler.new
112
- crawler.whitelist([/games\/.*/, '/fun'])
113
- crawler.start_crawling('http://www.google.com')
114
- ```
108
+ #### Automatically restarting PhantomJS
109
+ If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
110
+ The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
111
+ `on_periodic_restart` configuration key as follows:
115
112
 
116
- Grell here will only follow links to games and '/fun' and ignore all
117
- other links. You can provide a regexp, strings (if any part of the
118
- string match is whitelisted) or an array with regexps and/or strings.
113
+ ```ruby
114
+ require 'grell'
119
115
 
120
- #### Blacklisting
116
+ crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
121
117
 
122
- ```ruby
123
- require 'grell'
118
+ crawler.start_crawling('http://www.google.com') do |current_page|
119
+ ...
120
+ endd
121
+ ```
124
122
 
125
- crawler = Grell::Crawler.new
126
- crawler.blacklist(/games\/.*/)
127
- crawler.start_crawling('http://www.google.com')
128
- ```
123
+ This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
124
+ between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
129
125
 
130
- Similar to whitelisting. But now Grell will follow every other link in
131
- this site which does not go to /games/...
132
126
 
133
- If you call both whitelist and blacklist then both will apply, a link
134
- has to fullfill both conditions to survive. If you do not call any, then
135
- all links on this site will be crawled. Think of these methods as
136
- filters.
127
+ #### Whitelisting
128
+
129
+ ```ruby
130
+ require 'grell'
131
+
132
+ crawler = Grell::Crawler.new(whitelist: [/games\/.*/, '/fun'])
133
+ crawler.start_crawling('http://www.google.com')
134
+ ```
135
+
136
+ Grell here will only follow links to games and '/fun' and ignore all
137
+ other links. You can provide a regexp, strings (if any part of the
138
+ string match is whitelisted) or an array with regexps and/or strings.
139
+
140
+ #### Blacklisting
141
+
142
+ ```ruby
143
+ require 'grell'
144
+
145
+ crawler = Grell::Crawler.new(blacklist: /games\/.*/)
146
+ crawler.start_crawling('http://www.google.com')
147
+ ```
148
+
149
+ Similar to whitelisting. But now Grell will follow every other link in
150
+ this site which does not go to /games/...
151
+
152
+ If you call both whitelist and blacklist then both will apply, a link
153
+ has to fullfill both conditions to survive. If you do not call any, then
154
+ all links on this site will be crawled. Think of these methods as
155
+ filters.
137
156
 
138
157
  #### Manual link filtering
139
158
 
@@ -144,14 +163,28 @@ links to visit. So you can modify in your block of code "page.links" to
144
163
  add and delete links to instruct Grell to add them to the list of links
145
164
  to visit next.
146
165
 
147
- ### Pages' id
166
+ #### Custom URL Comparison
167
+ By default, Grell will detect new URLs to visit by comparing the full URL
168
+ with the URLs of the discovered and visited links. This functionality can
169
+ be changed by passing a block of code to Grells `start_crawling` method.
170
+ In the below example, the path of the URLs (instead of the full URL) will
171
+ be compared.
148
172
 
149
- Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
150
- The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0.
151
- Using this information it is possible to construct a directed graph.
173
+ ```ruby
174
+ require 'grell'
175
+
176
+ add_match_block = Proc.new do |collection_page, page|
177
+ collection_page.path == page.path
178
+ end
152
179
 
180
+ crawler = Grell::Crawler.new(add_match_block: add_match_block)
153
181
 
154
- ### Evaluate script
182
+ crawler.start_crawling('http://www.google.com') do |current_page|
183
+ ...
184
+ end
185
+ ```
186
+
187
+ #### Evaluate script
155
188
 
156
189
  You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
157
190
 
@@ -168,12 +201,6 @@ When there is an error in the page or an internal error in the crawler (Javascri
168
201
  - errorClass: The class of the error which broke this page.
169
202
  - errorMessage: A descriptive message with the information Grell could gather about the error.
170
203
 
171
- ### Logging
172
- You can pass your logger to Grell. For example in a Rails app:
173
- ```Ruby
174
- crawler = Grell::Crawler.new(logger: Rails.logger)
175
- ```
176
-
177
204
  ## Tests
178
205
 
179
206
  Run the tests with
@@ -19,19 +19,18 @@ Gem::Specification.new do |spec|
19
19
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
20
20
  spec.require_paths = ["lib"]
21
21
 
22
- spec.required_ruby_version = '>= 1.9.3'
22
+ spec.required_ruby_version = '>= 2.1.8'
23
23
 
24
- spec.add_dependency 'capybara', '~> 2.7'
25
- spec.add_dependency 'poltergeist', '~> 1.10'
24
+ spec.add_dependency 'capybara', '~> 2.10'
25
+ spec.add_dependency 'poltergeist', '~> 1.11'
26
26
 
27
27
  spec.add_development_dependency 'bundler', '~> 1.6'
28
28
  spec.add_development_dependency 'byebug', '~> 4.0'
29
29
  spec.add_development_dependency 'kender', '~> 0.2'
30
30
  spec.add_development_dependency 'rake', '~> 10.0'
31
31
  spec.add_development_dependency 'webmock', '~> 1.18'
32
- spec.add_development_dependency 'rspec', '~> 3.0'
33
- spec.add_development_dependency 'puffing-billy', '~> 0.5'
32
+ spec.add_development_dependency 'rspec', '~> 3.5'
33
+ spec.add_development_dependency 'puffing-billy', '~> 0.9'
34
34
  spec.add_development_dependency 'timecop', '~> 0.8'
35
- spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
36
35
  spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
37
36
  end
@@ -3,6 +3,7 @@ require 'capybara/dsl'
3
3
 
4
4
  require 'grell/grell_logger'
5
5
  require 'grell/capybara_driver'
6
+ require 'grell/crawler_manager'
6
7
  require 'grell/crawler'
7
8
  require 'grell/rawpage'
8
9
  require 'grell/page'
@@ -5,7 +5,7 @@ module Grell
5
5
  class CapybaraDriver
6
6
  include Capybara::DSL
7
7
 
8
- USER_AGENT = "Mozilla/5.0 (Grell Crawler)"
8
+ USER_AGENT = "Mozilla/5.0 (Grell Crawler)".freeze
9
9
 
10
10
  def self.setup(options)
11
11
  new.setup_capybara unless options[:external_driver]
@@ -1,54 +1,32 @@
1
-
2
1
  module Grell
3
-
4
2
  # This is the class that starts and controls the crawling
5
3
  class Crawler
6
- attr_reader :collection
4
+ attr_reader :collection, :manager
7
5
 
8
6
  # Creates a crawler
9
- # options allows :logger to point to an object with the same interface than Logger in the standard library
10
- def initialize(options = {})
11
- if options[:logger]
12
- Grell.logger = options[:logger]
13
- else
14
- Grell.logger = Logger.new(STDOUT)
15
- end
16
-
17
- @driver = CapybaraDriver.setup(options)
18
- @evaluate_in_each_page = options[:evaluate_in_each_page]
19
- end
20
-
21
- # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
22
- def restart
23
- Grell.logger.info "GRELL is restarting"
24
- @driver.restart
25
- Grell.logger.info "GRELL has restarted"
26
- end
27
-
28
- # Quits the poltergeist driver.
29
- def quit
30
- Grell.logger.info "GRELL is quitting the poltergeist driver"
31
- @driver.quit
32
- end
33
-
34
- # Setups a whitelist filter, allows a regexp, string or array of either to be matched.
35
- def whitelist(list)
36
- @whitelist_regexp = Regexp.union(list)
37
- end
38
-
39
- # Setups a blacklist filter, allows a regexp, string or array of either to be matched.
40
- def blacklist(list)
41
- @blacklist_regexp = Regexp.union(list)
7
+ # evaluate_in_each_page: javascript block to evaluate in each page we crawl
8
+ # add_match_block: block to evaluate to consider if a page is part of the collection
9
+ # manager_options: options passed to the manager class
10
+ # whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
11
+ # blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
12
+ def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
13
+ @collection = nil
14
+ @manager = CrawlerManager.new(manager_options)
15
+ @evaluate_in_each_page = evaluate_in_each_page
16
+ @add_match_block = add_match_block
17
+ @whitelist_regexp = Regexp.union(whitelist)
18
+ @blacklist_regexp = Regexp.union(blacklist)
42
19
  end
43
20
 
44
21
  # Main method, it starts crawling on the given URL and calls a block for each of the pages found.
45
- def start_crawling(url, options = {}, &block)
22
+ def start_crawling(url, &block)
46
23
  Grell.logger.info "GRELL Started crawling"
47
- @collection = PageCollection.new(options[:add_match_block] || default_add_match)
24
+ @collection = PageCollection.new(@add_match_block)
48
25
  @collection.create_page(url, nil)
49
26
 
50
27
  while !@collection.discovered_pages.empty?
51
28
  crawl(@collection.next_page, block)
29
+ @manager.check_periodic_restart(@collection)
52
30
  end
53
31
 
54
32
  Grell.logger.info "GRELL finished crawling"
@@ -93,14 +71,6 @@ module Grell
93
71
  links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
94
72
  end
95
73
 
96
- # If options[:add_match_block] is not provided, url matching to determine if a
97
- # new page should be added the page collection will default to this proc
98
- def default_add_match
99
- Proc.new do |collection_page, page|
100
- collection_page.url.downcase == page.url.downcase
101
- end
102
- end
103
-
104
74
  # Store the resulting redirected URL along with the original URL
105
75
  def add_redirect_url(site)
106
76
  if site.url != site.current_url
@@ -0,0 +1,77 @@
1
+ module Grell
2
+ # Manages the state of the process crawling, does not care about individual pages but about logging,
3
+ # restarting and quiting the crawler correctly.
4
+ class CrawlerManager
5
+ # logger: logger to use for Grell's messages
6
+ # on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block
7
+ # driver_options: Any extra options for the Capybara driver
8
+ def initialize(logger: nil, on_periodic_restart: {}, driver: nil, driver_options: {})
9
+ Grell.logger = logger ? logger : Logger.new(STDOUT)
10
+ @periodic_restart_block = on_periodic_restart[:do]
11
+ @periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
12
+ @driver = driver || CapybaraDriver.setup(driver_options)
13
+ if @periodic_restart_period <= 0
14
+ Grell.logger.warn "GRELL being misconfigured with a negative period to restart. Ignoring option."
15
+ end
16
+ end
17
+
18
+ # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
19
+ def restart
20
+ Grell.logger.info "GRELL is restarting"
21
+ @driver.restart
22
+ Grell.logger.info "GRELL has restarted"
23
+ end
24
+
25
+ # Quits the poltergeist driver.
26
+ def quit
27
+ Grell.logger.info "GRELL is quitting the poltergeist driver"
28
+ @driver.quit
29
+ end
30
+
31
+ # PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart
32
+ # the driver, potentially calling a block.
33
+ def check_periodic_restart(collection)
34
+ return unless @periodic_restart_block
35
+ return unless @periodic_restart_period > 0
36
+ return unless (collection.visited_pages.size % @periodic_restart_period).zero?
37
+ restart
38
+ @periodic_restart_block.call
39
+ end
40
+
41
+ def cleanup_all_processes
42
+ pids = running_phantomjs_pids
43
+ return if pids.empty?
44
+ Grell.logger.warn "GRELL. Killing PhantomJS processes: #{pids.inspect}"
45
+ pids.each do |pid|
46
+ Grell.logger.warn "Sending KILL to PhantomJS process #{pid}"
47
+ kill_process(pid.to_i)
48
+ end
49
+ end
50
+
51
+ private
52
+
53
+ PAGES_TO_RESTART = 100 # Default number of pages before we restart the driver.
54
+ KILL_TIMEOUT = 2 # Number of seconds we wait till we kill the process.
55
+
56
+ def running_phantomjs_pids
57
+ list_phantomjs_processes_cmd = "ps -ef | grep -E 'bin/phantomjs' | grep -v grep"
58
+ `#{list_phantomjs_processes_cmd} | awk '{print $2;}'`.split("\n")
59
+ end
60
+
61
+ def kill_process(pid)
62
+ Process.kill('TERM', pid)
63
+ force_kill(pid)
64
+ rescue Errno::ESRCH, Errno::ECHILD
65
+ # successfully terminated
66
+ rescue => e
67
+ Grell.logger.exception e, "PhantomJS process could not be killed"
68
+ end
69
+
70
+ def force_kill(pid)
71
+ Timeout.timeout(KILL_TIMEOUT) { Process.wait(pid) }
72
+ rescue Timeout::Error
73
+ Process.kill('KILL', pid)
74
+ Process.wait(pid)
75
+ end
76
+ end
77
+ end
@@ -10,7 +10,7 @@ module Grell
10
10
  # to the collection or if it is already present will be passed to the initializer.
11
11
  def initialize(add_match_block)
12
12
  @collection = []
13
- @add_match_block = add_match_block
13
+ @add_match_block = add_match_block || default_add_match
14
14
  end
15
15
 
16
16
  def create_page(url, parent_id)
@@ -50,5 +50,13 @@ module Grell
50
50
  end
51
51
  end
52
52
 
53
+ # If add_match_block is not provided, url matching to determine if a new page should be added
54
+ # to the page collection will default to this proc
55
+ def default_add_match
56
+ Proc.new do |collection_page, page|
57
+ collection_page.url.downcase == page.url.downcase
58
+ end
59
+ end
60
+
53
61
  end
54
62
  end
@@ -1,3 +1,3 @@
1
1
  module Grell
2
- VERSION = "1.6.11".freeze
2
+ VERSION = "2.0.0".freeze
3
3
  end
@@ -0,0 +1,113 @@
1
+ RSpec.describe Grell::CrawlerManager do
2
+ let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
3
+ let(:host) { 'http://www.example.com' }
4
+ let(:url) { 'http://www.example.com/test' }
5
+ let(:driver) { double(Grell::CapybaraDriver) }
6
+ let(:logger) { Logger.new(nil) }
7
+ let(:crawler_manager) do
8
+ described_class.new(logger: logger, driver: driver)
9
+ end
10
+
11
+ describe 'initialize' do
12
+ context 'provides a logger' do
13
+ let(:logger) { 33 }
14
+ it 'sets custom logger' do
15
+ crawler_manager
16
+ expect(Grell.logger).to eq(33)
17
+ Grell.logger = Logger.new(nil)
18
+ end
19
+ end
20
+ context 'does not provides a logger' do
21
+ let(:logger) { nil }
22
+ it 'sets default logger' do
23
+ crawler_manager
24
+ expect(Grell.logger).to be_instance_of(Logger)
25
+ Grell.logger = Logger.new(nil)
26
+ end
27
+ end
28
+ end
29
+
30
+ describe '#quit' do
31
+ let(:driver) { double }
32
+
33
+ it 'quits the poltergeist driver' do
34
+ expect(driver).to receive(:quit)
35
+ crawler_manager.quit
36
+ end
37
+ end
38
+
39
+ describe '#restart' do
40
+ let(:driver) { double }
41
+
42
+ it 'restarts the poltergeist driver' do
43
+ expect(driver).to receive(:restart)
44
+ expect(logger).to receive(:info).with("GRELL is restarting")
45
+ expect(logger).to receive(:info).with("GRELL has restarted")
46
+ crawler_manager.restart
47
+ end
48
+ end
49
+
50
+ describe '#check_periodic_restart' do
51
+ let(:collection) { double }
52
+ context 'Periodic restart not setup' do
53
+ it 'does not restart' do
54
+ allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
55
+ expect(crawler_manager).not_to receive(:restart)
56
+ crawler_manager.check_periodic_restart(collection)
57
+ end
58
+ end
59
+ context 'Periodic restart setup with default period' do
60
+ let(:do_something) { proc {} }
61
+ let(:crawler_manager) do
62
+ Grell::CrawlerManager.new(
63
+ logger: logger,
64
+ driver: driver,
65
+ on_periodic_restart: { do: do_something }
66
+ )
67
+ end
68
+
69
+ it 'does not restart after visiting 99 pages' do
70
+ allow(collection).to receive_message_chain(:visited_pages, :size) { 99 }
71
+ expect(crawler_manager).not_to receive(:restart)
72
+ crawler_manager.check_periodic_restart(collection)
73
+ end
74
+ it 'restarts after visiting 100 pages' do
75
+ allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
76
+ expect(crawler_manager).to receive(:restart)
77
+ crawler_manager.check_periodic_restart(collection)
78
+ end
79
+ end
80
+ context 'Periodic restart setup with custom period' do
81
+ let(:do_something) { proc {} }
82
+ let(:period) { 50 }
83
+ let(:crawler_manager) do
84
+ Grell::CrawlerManager.new(
85
+ logger: logger,
86
+ driver: driver,
87
+ on_periodic_restart: { do: do_something, each: period }
88
+ )
89
+ end
90
+
91
+ it 'does not restart after visiting a number different from custom period pages' do
92
+ allow(collection).to receive_message_chain(:visited_pages, :size) { period * 1.2 }
93
+ expect(crawler_manager).not_to receive(:restart)
94
+ crawler_manager.check_periodic_restart(collection)
95
+ end
96
+ it 'restarts after visiting custom period pages' do
97
+ allow(collection).to receive_message_chain(:visited_pages, :size) { period }
98
+ expect(crawler_manager).to receive(:restart)
99
+ crawler_manager.check_periodic_restart(collection)
100
+ end
101
+ end
102
+ end
103
+
104
+ describe '#cleanup_all_processes' do
105
+ let(:driver) { double }
106
+
107
+ it 'kills all phantomjs processes' do
108
+ allow(crawler_manager).to receive(:running_phantomjs_pids).and_return([10])
109
+ expect(crawler_manager).to receive(:kill_process).with(10)
110
+ crawler_manager.cleanup_all_processes
111
+ end
112
+ end
113
+ end
@@ -5,7 +5,18 @@ RSpec.describe Grell::Crawler do
5
5
  let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
6
6
  let(:host) { 'http://www.example.com' }
7
7
  let(:url) { 'http://www.example.com/test' }
8
- let(:crawler) { Grell::Crawler.new(logger: Logger.new(nil), external_driver: true, evaluate_in_each_page: script) }
8
+ let(:add_match_block) { nil }
9
+ let(:blacklist) { /a^/ }
10
+ let(:whitelist) { /.*/ }
11
+ let(:crawler) do
12
+ Grell::Crawler.new(
13
+ logger: Logger.new(nil),
14
+ driver_options: { external_driver: true },
15
+ evaluate_in_each_page: script,
16
+ add_match_block: add_match_block,
17
+ blacklist: blacklist,
18
+ whitelist: whitelist)
19
+ end
9
20
  let(:script) { nil }
10
21
  let(:body) { 'body' }
11
22
  let(:custom_add_match) do
@@ -18,29 +29,6 @@ RSpec.describe Grell::Crawler do
18
29
  proxy.stub(url).and_return(body: body, code: 200)
19
30
  end
20
31
 
21
- describe 'initialize' do
22
- it 'can provide your own logger' do
23
- Grell::Crawler.new(external_driver: true, logger: 33)
24
- expect(Grell.logger).to eq(33)
25
- Grell.logger = Logger.new(nil)
26
- end
27
-
28
- it 'provides a stdout logger if nothing provided' do
29
- crawler
30
- expect(Grell.logger).to be_instance_of(Logger)
31
- end
32
- end
33
-
34
- describe '#quit' do
35
- let(:driver) { double }
36
- before { allow(Grell::CapybaraDriver).to receive(:setup).and_return(driver) }
37
-
38
- it 'quits the poltergeist driver' do
39
- expect(driver).to receive(:quit)
40
- crawler.quit
41
- end
42
- end
43
-
44
32
  describe '#crawl' do
45
33
  before do
46
34
  crawler.instance_variable_set('@collection', Grell::PageCollection.new(custom_add_match))
@@ -127,15 +115,6 @@ RSpec.describe Grell::Crawler do
127
115
  expect(result[1].url).to eq(url_visited)
128
116
  end
129
117
 
130
- it 'can use a custom url add matcher block' do
131
- expect(crawler).to_not receive(:default_add_match)
132
- crawler.start_crawling(url, add_match_block: custom_add_match)
133
- end
134
-
135
- it 'uses a default url add matched if not provided' do
136
- expect(crawler).to receive(:default_add_match).and_return(custom_add_match)
137
- crawler.start_crawling(url)
138
- end
139
118
  end
140
119
 
141
120
  shared_examples_for 'visits all available pages' do
@@ -204,10 +183,7 @@ RSpec.describe Grell::Crawler do
204
183
  end
205
184
 
206
185
  context 'using a single string' do
207
- before do
208
- crawler.whitelist('/trusmis.html')
209
- end
210
-
186
+ let(:whitelist) { '/trusmis.html' }
211
187
  let(:visited_pages_count) { 2 } # my own page + trusmis
212
188
  let(:visited_pages) do
213
189
  ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -217,10 +193,7 @@ RSpec.describe Grell::Crawler do
217
193
  end
218
194
 
219
195
  context 'using an array of strings' do
220
- before do
221
- crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
222
- end
223
-
196
+ let(:whitelist) { ['/trusmis.html', '/nothere', 'another.html'] }
224
197
  let(:visited_pages_count) { 2 }
225
198
  let(:visited_pages) do
226
199
  ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -230,10 +203,7 @@ RSpec.describe Grell::Crawler do
230
203
  end
231
204
 
232
205
  context 'using a regexp' do
233
- before do
234
- crawler.whitelist(/\/trusmis\.html/)
235
- end
236
-
206
+ let(:whitelist) { /\/trusmis\.html/ }
237
207
  let(:visited_pages_count) { 2 }
238
208
  let(:visited_pages) do
239
209
  ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -243,10 +213,7 @@ RSpec.describe Grell::Crawler do
243
213
  end
244
214
 
245
215
  context 'using an array of regexps' do
246
- before do
247
- crawler.whitelist([/\/trusmis\.html/])
248
- end
249
-
216
+ let(:whitelist) { [/\/trusmis\.html/] }
250
217
  let(:visited_pages_count) { 2 }
251
218
  let(:visited_pages) do
252
219
  ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -256,10 +223,7 @@ RSpec.describe Grell::Crawler do
256
223
  end
257
224
 
258
225
  context 'using an empty array' do
259
- before do
260
- crawler.whitelist([])
261
- end
262
-
226
+ let(:whitelist) { [] }
263
227
  let(:visited_pages_count) { 1 } # my own page only
264
228
  let(:visited_pages) do
265
229
  ['http://www.example.com/test']
@@ -269,10 +233,7 @@ RSpec.describe Grell::Crawler do
269
233
  end
270
234
 
271
235
  context 'adding all links to the whitelist' do
272
- before do
273
- crawler.whitelist(['/trusmis', '/help'])
274
- end
275
-
236
+ let(:whitelist) { ['/trusmis', '/help'] }
276
237
  let(:visited_pages_count) { 3 } # all links
277
238
  let(:visited_pages) do
278
239
  ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
@@ -298,9 +259,7 @@ RSpec.describe Grell::Crawler do
298
259
  end
299
260
 
300
261
  context 'using a single string' do
301
- before do
302
- crawler.blacklist('/trusmis.html')
303
- end
262
+ let(:blacklist) { '/trusmis.html' }
304
263
  let(:visited_pages_count) {2}
305
264
  let(:visited_pages) do
306
265
  ['http://www.example.com/test','http://www.example.com/help.html']
@@ -310,9 +269,7 @@ RSpec.describe Grell::Crawler do
310
269
  end
311
270
 
312
271
  context 'using an array of strings' do
313
- before do
314
- crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
315
- end
272
+ let(:blacklist) { ['/trusmis.html', '/nothere', 'another.html'] }
316
273
  let(:visited_pages_count) {2}
317
274
  let(:visited_pages) do
318
275
  ['http://www.example.com/test','http://www.example.com/help.html']
@@ -322,9 +279,7 @@ RSpec.describe Grell::Crawler do
322
279
  end
323
280
 
324
281
  context 'using a regexp' do
325
- before do
326
- crawler.blacklist(/\/trusmis\.html/)
327
- end
282
+ let(:blacklist) { /\/trusmis\.html/ }
328
283
  let(:visited_pages_count) {2}
329
284
  let(:visited_pages) do
330
285
  ['http://www.example.com/test','http://www.example.com/help.html']
@@ -334,9 +289,7 @@ RSpec.describe Grell::Crawler do
334
289
  end
335
290
 
336
291
  context 'using an array of regexps' do
337
- before do
338
- crawler.blacklist([/\/trusmis\.html/])
339
- end
292
+ let(:blacklist) { [/\/trusmis\.html/] }
340
293
  let(:visited_pages_count) {2}
341
294
  let(:visited_pages) do
342
295
  ['http://www.example.com/test','http://www.example.com/help.html']
@@ -346,9 +299,7 @@ RSpec.describe Grell::Crawler do
346
299
  end
347
300
 
348
301
  context 'using an empty array' do
349
- before do
350
- crawler.blacklist([])
351
- end
302
+ let(:blacklist) { [] }
352
303
  let(:visited_pages_count) { 3 } # all links
353
304
  let(:visited_pages) do
354
305
  ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
@@ -357,10 +308,8 @@ RSpec.describe Grell::Crawler do
357
308
  it_behaves_like 'visits all available pages'
358
309
  end
359
310
 
360
- context 'adding all links to the whitelist' do
361
- before do
362
- crawler.blacklist(['/trusmis', '/help'])
363
- end
311
+ context 'adding all links to the blacklist' do
312
+ let(:blacklist) { ['/trusmis', '/help'] }
364
313
  let(:visited_pages_count) { 1 }
365
314
  let(:visited_pages) do
366
315
  ['http://www.example.com/test']
@@ -386,11 +335,8 @@ RSpec.describe Grell::Crawler do
386
335
  end
387
336
 
388
337
  context 'we blacklist the only whitelisted page' do
389
- before do
390
- crawler.whitelist('/trusmis.html')
391
- crawler.blacklist('/trusmis.html')
392
- end
393
-
338
+ let(:whitelist) { '/trusmis.html' }
339
+ let(:blacklist) { '/trusmis.html' }
394
340
  let(:visited_pages_count) { 1 }
395
341
  let(:visited_pages) do
396
342
  ['http://www.example.com/test']
@@ -400,11 +346,8 @@ RSpec.describe Grell::Crawler do
400
346
  end
401
347
 
402
348
  context 'we blacklist none of the whitelisted pages' do
403
- before do
404
- crawler.whitelist('/trusmis.html')
405
- crawler.blacklist('/raistlin.html')
406
- end
407
-
349
+ let(:whitelist) { '/trusmis.html' }
350
+ let(:blacklist) { '/raistlin.html' }
408
351
  let(:visited_pages_count) { 2 }
409
352
  let(:visited_pages) do
410
353
  ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: grell
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.6.11
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jordi Polo Carres
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-09-01 00:00:00.000000000 Z
11
+ date: 2016-11-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: capybara
@@ -16,28 +16,28 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '2.7'
19
+ version: '2.10'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '2.7'
26
+ version: '2.10'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: poltergeist
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
31
  - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: '1.10'
33
+ version: '1.11'
34
34
  type: :runtime
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
38
  - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: '1.10'
40
+ version: '1.11'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: bundler
43
43
  requirement: !ruby/object:Gem::Requirement
@@ -114,28 +114,28 @@ dependencies:
114
114
  requirements:
115
115
  - - "~>"
116
116
  - !ruby/object:Gem::Version
117
- version: '3.0'
117
+ version: '3.5'
118
118
  type: :development
119
119
  prerelease: false
120
120
  version_requirements: !ruby/object:Gem::Requirement
121
121
  requirements:
122
122
  - - "~>"
123
123
  - !ruby/object:Gem::Version
124
- version: '3.0'
124
+ version: '3.5'
125
125
  - !ruby/object:Gem::Dependency
126
126
  name: puffing-billy
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '0.5'
131
+ version: '0.9'
132
132
  type: :development
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '0.5'
138
+ version: '0.9'
139
139
  - !ruby/object:Gem::Dependency
140
140
  name: timecop
141
141
  requirement: !ruby/object:Gem::Requirement
@@ -150,20 +150,6 @@ dependencies:
150
150
  - - "~>"
151
151
  - !ruby/object:Gem::Version
152
152
  version: '0.8'
153
- - !ruby/object:Gem::Dependency
154
- name: capybara-webkit
155
- requirement: !ruby/object:Gem::Requirement
156
- requirements:
157
- - - "~>"
158
- - !ruby/object:Gem::Version
159
- version: 1.11.1
160
- type: :development
161
- prerelease: false
162
- version_requirements: !ruby/object:Gem::Requirement
163
- requirements:
164
- - - "~>"
165
- - !ruby/object:Gem::Version
166
- version: 1.11.1
167
153
  - !ruby/object:Gem::Dependency
168
154
  name: selenium-webdriver
169
155
  requirement: !ruby/object:Gem::Requirement
@@ -196,6 +182,7 @@ files:
196
182
  - lib/grell.rb
197
183
  - lib/grell/capybara_driver.rb
198
184
  - lib/grell/crawler.rb
185
+ - lib/grell/crawler_manager.rb
199
186
  - lib/grell/grell_logger.rb
200
187
  - lib/grell/page.rb
201
188
  - lib/grell/page_collection.rb
@@ -203,6 +190,7 @@ files:
203
190
  - lib/grell/reader.rb
204
191
  - lib/grell/version.rb
205
192
  - spec/lib/capybara_driver_spec.rb
193
+ - spec/lib/crawler_manager_spec.rb
206
194
  - spec/lib/crawler_spec.rb
207
195
  - spec/lib/page_collection_spec.rb
208
196
  - spec/lib/page_spec.rb
@@ -220,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
220
208
  requirements:
221
209
  - - ">="
222
210
  - !ruby/object:Gem::Version
223
- version: 1.9.3
211
+ version: 2.1.8
224
212
  required_rubygems_version: !ruby/object:Gem::Requirement
225
213
  requirements:
226
214
  - - ">="
@@ -234,6 +222,7 @@ specification_version: 4
234
222
  summary: Ruby web crawler
235
223
  test_files:
236
224
  - spec/lib/capybara_driver_spec.rb
225
+ - spec/lib/crawler_manager_spec.rb
237
226
  - spec/lib/crawler_spec.rb
238
227
  - spec/lib/page_collection_spec.rb
239
228
  - spec/lib/page_spec.rb