RubyGems - grell - Versions diffs - 1.6.11 → 2.0.0 - Mend

grell 1.6.11 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/.travis.yml +1 -0
data/CHANGELOG.md +10 -0
data/Gemfile +4 -0
data/README.md +91 -64
data/grell.gemspec +5 -6
data/lib/grell.rb +1 -0
data/lib/grell/capybara_driver.rb +1 -1
data/lib/grell/crawler.rb +16 -46
data/lib/grell/crawler_manager.rb +77 -0
data/lib/grell/page_collection.rb +9 -1
data/lib/grell/version.rb +1 -1
data/spec/lib/crawler_manager_spec.rb +113 -0
data/spec/lib/crawler_spec.rb +29 -86
metadata +14 -25

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 9a3668b60b187d3b4dab183bf6b386738813a1ec
-  data.tar.gz: 32ddba97d8dca3abca1cffb4d44760d0679eea0b
+  metadata.gz: bbd2a19c7858d2e755e7d51a038b84901835f32b
+  data.tar.gz: 7c65a731bfdacb7b65cf523c8ba128f1f3390f96
 SHA512:
-  metadata.gz: 6ba545ee45bbdc4a43b85045d6e09c0a1cfd7f0d7605dd5abee455434426b84cf11fe706a46a0128cc2b35ab04ff0f0d2106482a7548989a1aaa2bdb4928cf90
-  data.tar.gz: 0d1e2eb5dbcd688d00a39e7ff39781b577647eb6db47ac2155709d81af7e004d435521a38067258026960982f97f60cbf991993366dc61b89ae8f2d08345c4d1
+  metadata.gz: fcf31c442f8d51cd4a9534270cc3afd7f5bac432a74a9ec2e429914145df710131e0a99aca61e4cd98d2e831a5cedb511c21dce9d1562881541fd1bfb4016014
+  data.tar.gz: c413503a4a8765fadbec10ed64467ba140b8739b6209ae4a19e585549ad689231b64b3d670e1bdceec560597051d93f8b90d696f66571cb1dad175f15af1b4a0

data/.travis.yml CHANGED

@@ -3,6 +3,7 @@ cache: bundler
 sudo: false
 rvm:
+  - 2.1.8
   - 2.2.4
   - 2.3.0
 script: bundle exec rspec

data/CHANGELOG.md CHANGED

@@ -1,3 +1,13 @@
+# 2.0.0
+  * New configuration key `on_periodic_restart`.
+  * CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
+  * Breaking changes
+    - Requires Ruby 2.1 or later.
+    - Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
+    - Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
+    - Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
 # 1.6.11
   * Ensure all links are loaded by waiting for Ajax requests to complete
   * Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)

data/Gemfile CHANGED

@@ -1,3 +1,7 @@
 source 'https://rubygems.org'
+# Avoid ruby 2.1 to use Rack > 2.0 which is not compatible
+platform :ruby_21 do
+  gem 'rack', '~> 1.0'
+end
 gemspec

data/README.md CHANGED

@@ -21,16 +21,15 @@ Or install it yourself as:
     $ gem install grell
-Grell uses PhantomJS, you will need to download and install it in your
+Grell uses PhantomJS as a browser, you will need to download and install it in your
 system. Check for instructions in http://phantomjs.org/
 Grell has been tested with PhantomJS v2.1.x
 ## Usage
 ### Crawling an entire site
-The main entry point of the library is Grell#start_crawling.
+The main entry point of the library is Grell::Crawler#start_crawling.
 Grell will yield to your code with each page it finds:
 ```ruby
@@ -55,85 +54,105 @@ This list is indexed by the complete url, including query parameters.
 ### Re-retrieving a page
 If you want Grell to revisit a page and return the data to you again,
-return the symbol :retry in your block in the start_crawling method.
+return the symbol :retry in your block for the start_crawling method.
 For instance
 ```ruby
 require 'grell'
 crawler = Grell::Crawler.new
 crawler.start_crawling('http://www.google.com') do |current_page|
   if current_page.status == 500 && current_page.retries == 0
-    crawler.restart
+    crawler.manager.restart
     :retry
   end
 end
 ```
-### Restarting PhantomJS
-If you are doing a long crawling it is possible that phantomJS starts failing.
-To avoid that, you can restart it by calling "restart" on crawler.
-That will kill phantom and will restart it. Grell will keep the status of
-pages already visited and pages discovered and to be visited. And will keep crawling
-with the new phantomJS process instead of the old one.
+### Pages' id
-### Selecting links to follow
+Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
+The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
+Using this information it is possible to construct a directed graph.
-Grell by default will follow all the links it finds going to the site
-your are crawling. It will never follow links linking outside your site.
-If you want to further limit the amount of links crawled, you can use
-whitelisting, blacklisting or manual filtering.
-#### Custom URL Comparison
-By default, Grell will detect new URLs to visit by comparing the full URL
-with the URLs of the discovered and visited links. This functionality can
-be changed by passing a block of code to Grells `start_crawling` method.
-In the below example, the path of the URLs (instead of the full URL) will
-be compared.
+### Restart and quit
+Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
+but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
+new browser.
+To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
 ```ruby
 require 'grell'
 crawler = Grell::Crawler.new
+crawler.manager.restart # restarts the browser
+crawler.manager.quit # quits and destroys the crawler
+```
-add_match_block = Proc.new do |collection_page, page|
-  collection_page.path == page.path
-end
+### Options
-crawler.start_crawling('http://www.google.com', add_match_block: add_match_block) do |current_page|
-...
-end
-```
+The `Grell:Crawler` class can be passed options to customize its behavior:
+- `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
+- `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
+- `whitelist`: Setups a whitelist filter for URLs to be visited. Default: all URLs are whitelisted.
+- `blacklist`: Setups a blacklist filter for URLs to be avoided. Default: no URL is blacklisted.
+- `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
+- `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
+- `driver_options`: Driver options will be passed to the Capybara driver which connects to PhantomJS.
-#### Whitelisting
+Grell by default will follow all the links it finds in the site being crawled.
+It will never follow links linking outside your site.
+If you want to further limit the amount of links crawled, you can use
+whitelisting, blacklisting or manual filtering.
+Below further details on these and other options.
-```ruby
-require 'grell'
-crawler = Grell::Crawler.new
-crawler.whitelist([/games\/.*/, '/fun'])
-crawler.start_crawling('http://www.google.com')
-```
+#### Automatically restarting PhantomJS
+If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
+The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
+`on_periodic_restart` configuration key as follows:
-Grell here will only follow links to games and '/fun' and ignore all
-other links. You can provide a regexp, strings (if any part of the
-string match is whitelisted) or an array with regexps and/or strings.
+ ```ruby
+ require 'grell'
-#### Blacklisting
+ crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
-```ruby
-require 'grell'
+ crawler.start_crawling('http://www.google.com') do |current_page|
+ ...
+ endd
+ ```
-crawler = Grell::Crawler.new
-crawler.blacklist(/games\/.*/)
-crawler.start_crawling('http://www.google.com')
-```
+ This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
+ between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
-Similar to whitelisting. But now Grell will follow every other link in
-this site which does not go to /games/...
-If you call both whitelist and blacklist then both will apply, a link
-has to fullfill both conditions to survive. If you do not call any, then
-all links on this site will be crawled. Think of these methods as
-filters.
+ #### Whitelisting
+ ```ruby
+ require 'grell'
+ crawler = Grell::Crawler.new(whitelist: [/games\/.*/, '/fun'])
+ crawler.start_crawling('http://www.google.com')
+ ```
+ Grell here will only follow links to games and '/fun' and ignore all
+ other links. You can provide a regexp, strings (if any part of the
+ string match is whitelisted) or an array with regexps and/or strings.
+ #### Blacklisting
+ ```ruby
+ require 'grell'
+ crawler = Grell::Crawler.new(blacklist: /games\/.*/)
+ crawler.start_crawling('http://www.google.com')
+ ```
+ Similar to whitelisting. But now Grell will follow every other link in
+ this site which does not go to /games/...
+ If you call both whitelist and blacklist then both will apply, a link
+ has to fullfill both conditions to survive. If you do not call any, then
+ all links on this site will be crawled. Think of these methods as
+ filters.
 #### Manual link filtering
@@ -144,14 +163,28 @@ links to visit. So you can modify in your block of code "page.links" to
 add and delete links to instruct Grell to add them to the list of links
 to visit next.
-### Pages' id
+#### Custom URL Comparison
+By default, Grell will detect new URLs to visit by comparing the full URL
+with the URLs of the discovered and visited links. This functionality can
+be changed by passing a block of code to Grells `start_crawling` method.
+In the below example, the path of the URLs (instead of the full URL) will
+be compared.
-Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
-The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0.
-Using this information it is possible to construct a directed graph.
+```ruby
+require 'grell'
+add_match_block = Proc.new do |collection_page, page|
+  collection_page.path == page.path
+end
+crawler = Grell::Crawler.new(add_match_block: add_match_block)
-### Evaluate script
+crawler.start_crawling('http://www.google.com') do |current_page|
+...
+end
+```
+#### Evaluate script
 You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
@@ -168,12 +201,6 @@ When there is an error in the page or an internal error in the crawler (Javascri
 - errorClass: The class of the error which broke this page.
 - errorMessage: A descriptive message with the information Grell could gather about the error.
-### Logging
-You can pass your logger to Grell. For example in a Rails app:
-```Ruby
-crawler = Grell::Crawler.new(logger: Rails.logger)
-```
 ## Tests
 Run the tests with

data/grell.gemspec CHANGED

@@ -19,19 +19,18 @@ Gem::Specification.new do |spec|
   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
   spec.require_paths = ["lib"]
-  spec.required_ruby_version = '>= 1.9.3'
+  spec.required_ruby_version = '>= 2.1.8'
-  spec.add_dependency 'capybara', '~> 2.7'
-  spec.add_dependency 'poltergeist', '~> 1.10'
+  spec.add_dependency 'capybara', '~> 2.10'
+  spec.add_dependency 'poltergeist', '~> 1.11'
   spec.add_development_dependency 'bundler', '~> 1.6'
   spec.add_development_dependency 'byebug', '~> 4.0'
   spec.add_development_dependency 'kender', '~> 0.2'
   spec.add_development_dependency 'rake', '~> 10.0'
   spec.add_development_dependency 'webmock', '~> 1.18'
-  spec.add_development_dependency 'rspec', '~> 3.0'
-  spec.add_development_dependency 'puffing-billy', '~> 0.5'
+  spec.add_development_dependency 'rspec', '~> 3.5'
+  spec.add_development_dependency 'puffing-billy', '~> 0.9'
   spec.add_development_dependency 'timecop', '~> 0.8'
-  spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
   spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
 end

data/lib/grell.rb CHANGED

@@ -3,6 +3,7 @@ require 'capybara/dsl'
 require 'grell/grell_logger'
 require 'grell/capybara_driver'
+require 'grell/crawler_manager'
 require 'grell/crawler'
 require 'grell/rawpage'
 require 'grell/page'

data/lib/grell/capybara_driver.rb CHANGED

@@ -5,7 +5,7 @@ module Grell
   class CapybaraDriver
     include Capybara::DSL
-    USER_AGENT = "Mozilla/5.0 (Grell Crawler)"
+    USER_AGENT = "Mozilla/5.0 (Grell Crawler)".freeze
     def self.setup(options)
       new.setup_capybara unless options[:external_driver]

data/lib/grell/crawler.rb CHANGED

@@ -1,54 +1,32 @@
 module Grell
   # This is the class that starts and controls the crawling
   class Crawler
-    attr_reader :collection
+    attr_reader :collection, :manager
     # Creates a crawler
-    # options allows :logger to point to an object with the same interface than Logger in the standard library
-    def initialize(options = {})
-      if options[:logger]
-        Grell.logger = options[:logger]
-      else
-        Grell.logger = Logger.new(STDOUT)
-      end
-      @driver = CapybaraDriver.setup(options)
-      @evaluate_in_each_page = options[:evaluate_in_each_page]
-    end
-    # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
-    def restart
-      Grell.logger.info "GRELL is restarting"
-      @driver.restart
-      Grell.logger.info "GRELL has restarted"
-    end
-    # Quits the poltergeist driver.
-    def quit
-      Grell.logger.info "GRELL is quitting the poltergeist driver"
-      @driver.quit
-    end
-    # Setups a whitelist filter, allows a regexp, string or array of either to be matched.
-    def whitelist(list)
-      @whitelist_regexp = Regexp.union(list)
-    end
-    # Setups a blacklist filter, allows a regexp, string or array of either to be matched.
-    def blacklist(list)
-      @blacklist_regexp = Regexp.union(list)
+    # evaluate_in_each_page: javascript block to evaluate in each page we crawl
+    # add_match_block: block to evaluate to consider if a page is part of the collection
+    # manager_options: options passed to the manager class
+    # whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
+    # blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
+    def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
+      @collection = nil
+      @manager = CrawlerManager.new(manager_options)
+      @evaluate_in_each_page = evaluate_in_each_page
+      @add_match_block = add_match_block
+      @whitelist_regexp = Regexp.union(whitelist)
+      @blacklist_regexp = Regexp.union(blacklist)
     end
     # Main method, it starts crawling on the given URL and calls a block for each of the pages found.
-    def start_crawling(url, options = {}, &block)
+    def start_crawling(url, &block)
       Grell.logger.info "GRELL Started crawling"
-      @collection = PageCollection.new(options[:add_match_block] || default_add_match)
+      @collection = PageCollection.new(@add_match_block)
       @collection.create_page(url, nil)
       while !@collection.discovered_pages.empty?
         crawl(@collection.next_page, block)
+        @manager.check_periodic_restart(@collection)
       end
       Grell.logger.info "GRELL finished crawling"
@@ -93,14 +71,6 @@ module Grell
       links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
     end
-    # If options[:add_match_block] is not provided, url matching to determine if a
-    # new page should be added the page collection will default to this proc
-    def default_add_match
-      Proc.new do |collection_page, page|
-        collection_page.url.downcase == page.url.downcase
-      end
-    end
     # Store the resulting redirected URL along with the original URL
     def add_redirect_url(site)
       if site.url != site.current_url

data/lib/grell/crawler_manager.rb ADDED

@@ -0,0 +1,77 @@
+module Grell
+  # Manages the state of the process crawling, does not care about individual pages but about logging,
+  # restarting and quiting the crawler correctly.
+  class CrawlerManager
+    # logger: logger to use for Grell's messages
+    # on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block
+    # driver_options: Any extra options for the Capybara driver
+    def initialize(logger: nil, on_periodic_restart: {}, driver: nil, driver_options: {})
+      Grell.logger = logger ? logger : Logger.new(STDOUT)
+      @periodic_restart_block = on_periodic_restart[:do]
+      @periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
+      @driver = driver || CapybaraDriver.setup(driver_options)
+      if @periodic_restart_period <= 0
+        Grell.logger.warn "GRELL being misconfigured with a negative period to restart. Ignoring option."
+      end
+    end
+    # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
+    def restart
+      Grell.logger.info "GRELL is restarting"
+      @driver.restart
+      Grell.logger.info "GRELL has restarted"
+    end
+    # Quits the poltergeist driver.
+    def quit
+      Grell.logger.info "GRELL is quitting the poltergeist driver"
+      @driver.quit
+    end
+    # PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart
+    # the driver, potentially calling a block.
+    def check_periodic_restart(collection)
+      return unless @periodic_restart_block
+      return unless @periodic_restart_period > 0
+      return unless (collection.visited_pages.size % @periodic_restart_period).zero?
+      restart
+      @periodic_restart_block.call
+    end
+    def cleanup_all_processes
+      pids = running_phantomjs_pids
+      return if pids.empty?
+      Grell.logger.warn "GRELL. Killing PhantomJS processes: #{pids.inspect}"
+      pids.each do |pid|
+        Grell.logger.warn "Sending KILL to PhantomJS process #{pid}"
+        kill_process(pid.to_i)
+      end
+    end
+    private
+    PAGES_TO_RESTART = 100  # Default number of pages before we restart the driver.
+    KILL_TIMEOUT = 2        # Number of seconds we wait till we kill the process.
+    def running_phantomjs_pids
+      list_phantomjs_processes_cmd = "ps -ef | grep -E 'bin/phantomjs' | grep -v grep"
+      `#{list_phantomjs_processes_cmd} | awk '{print $2;}'`.split("\n")
+    end
+    def kill_process(pid)
+      Process.kill('TERM', pid)
+      force_kill(pid)
+    rescue Errno::ESRCH, Errno::ECHILD
+      # successfully terminated
+    rescue => e
+      Grell.logger.exception e, "PhantomJS process could not be killed"
+    end
+    def force_kill(pid)
+      Timeout.timeout(KILL_TIMEOUT) { Process.wait(pid) }
+    rescue Timeout::Error
+      Process.kill('KILL', pid)
+      Process.wait(pid)
+    end
+  end
+end

data/lib/grell/page_collection.rb CHANGED

@@ -10,7 +10,7 @@ module Grell
     # to the collection or if it is already present will be passed to the initializer.
     def initialize(add_match_block)
       @collection = []
-      @add_match_block = add_match_block
+      @add_match_block = add_match_block || default_add_match
     end
     def create_page(url, parent_id)
@@ -50,5 +50,13 @@ module Grell
       end
     end
+    # If add_match_block is not provided, url matching to determine if a new page should be added
+    # to the page collection will default to this proc
+    def default_add_match
+      Proc.new do |collection_page, page|
+        collection_page.url.downcase == page.url.downcase
+      end
+    end
   end
 end

data/lib/grell/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Grell
-  VERSION = "1.6.11".freeze
+  VERSION = "2.0.0".freeze
 end

data/spec/lib/crawler_manager_spec.rb ADDED

@@ -0,0 +1,113 @@
+RSpec.describe Grell::CrawlerManager do
+  let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
+  let(:host) { 'http://www.example.com' }
+  let(:url) { 'http://www.example.com/test' }
+  let(:driver) { double(Grell::CapybaraDriver) }
+  let(:logger) { Logger.new(nil) }
+  let(:crawler_manager) do
+    described_class.new(logger: logger, driver: driver)
+  end
+  describe 'initialize' do
+    context 'provides a logger' do
+      let(:logger) { 33 }
+      it 'sets custom logger' do
+        crawler_manager
+        expect(Grell.logger).to eq(33)
+        Grell.logger = Logger.new(nil)
+      end
+    end
+    context 'does not provides a logger' do
+      let(:logger) { nil }
+      it 'sets default logger' do
+        crawler_manager
+        expect(Grell.logger).to be_instance_of(Logger)
+        Grell.logger = Logger.new(nil)
+      end
+    end
+  end
+  describe '#quit' do
+    let(:driver) { double }
+    it 'quits the poltergeist driver' do
+      expect(driver).to receive(:quit)
+      crawler_manager.quit
+    end
+  end
+  describe '#restart' do
+    let(:driver) { double }
+    it 'restarts the poltergeist driver' do
+      expect(driver).to receive(:restart)
+      expect(logger).to receive(:info).with("GRELL is restarting")
+      expect(logger).to receive(:info).with("GRELL has restarted")
+      crawler_manager.restart
+    end
+  end
+  describe '#check_periodic_restart' do
+    let(:collection) { double }
+    context 'Periodic restart not setup' do
+      it 'does not restart' do
+        allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
+        expect(crawler_manager).not_to receive(:restart)
+        crawler_manager.check_periodic_restart(collection)
+      end
+    end
+    context 'Periodic restart setup with default period' do
+      let(:do_something) { proc {} }
+      let(:crawler_manager) do
+        Grell::CrawlerManager.new(
+          logger: logger,
+          driver: driver,
+          on_periodic_restart: { do: do_something }
+        )
+      end
+      it 'does not restart after visiting 99 pages' do
+        allow(collection).to receive_message_chain(:visited_pages, :size) { 99 }
+        expect(crawler_manager).not_to receive(:restart)
+        crawler_manager.check_periodic_restart(collection)
+      end
+      it 'restarts after visiting 100 pages' do
+        allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
+        expect(crawler_manager).to receive(:restart)
+        crawler_manager.check_periodic_restart(collection)
+      end
+    end
+    context 'Periodic restart setup with custom period' do
+      let(:do_something) { proc {} }
+      let(:period) { 50 }
+      let(:crawler_manager) do
+        Grell::CrawlerManager.new(
+          logger: logger,
+          driver: driver,
+          on_periodic_restart: { do: do_something, each: period }
+        )
+      end
+      it 'does not restart after visiting a number different from custom period pages' do
+        allow(collection).to receive_message_chain(:visited_pages, :size) { period * 1.2 }
+        expect(crawler_manager).not_to receive(:restart)
+        crawler_manager.check_periodic_restart(collection)
+      end
+      it 'restarts after visiting custom period pages' do
+        allow(collection).to receive_message_chain(:visited_pages, :size) { period }
+        expect(crawler_manager).to receive(:restart)
+        crawler_manager.check_periodic_restart(collection)
+      end
+    end
+  end
+  describe '#cleanup_all_processes' do
+    let(:driver) { double }
+    it 'kills all phantomjs processes' do
+      allow(crawler_manager).to receive(:running_phantomjs_pids).and_return([10])
+      expect(crawler_manager).to receive(:kill_process).with(10)
+      crawler_manager.cleanup_all_processes
+    end
+  end
+end

data/spec/lib/crawler_spec.rb CHANGED

@@ -5,7 +5,18 @@ RSpec.describe Grell::Crawler do
   let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
   let(:host) { 'http://www.example.com' }
   let(:url) { 'http://www.example.com/test' }
-  let(:crawler) { Grell::Crawler.new(logger: Logger.new(nil), external_driver: true, evaluate_in_each_page: script) }
+  let(:add_match_block) { nil }
+  let(:blacklist) { /a^/ }
+  let(:whitelist) { /.*/ }
+  let(:crawler) do
+    Grell::Crawler.new(
+      logger: Logger.new(nil),
+      driver_options: { external_driver: true },
+      evaluate_in_each_page: script,
+      add_match_block: add_match_block,
+      blacklist: blacklist,
+      whitelist: whitelist)
+  end
   let(:script) { nil }
   let(:body) { 'body' }
   let(:custom_add_match) do
@@ -18,29 +29,6 @@ RSpec.describe Grell::Crawler do
     proxy.stub(url).and_return(body: body, code: 200)
   end
-  describe 'initialize' do
-    it 'can provide your own logger' do
-      Grell::Crawler.new(external_driver: true, logger: 33)
-      expect(Grell.logger).to eq(33)
-      Grell.logger = Logger.new(nil)
-    end
-    it 'provides a stdout logger if nothing provided' do
-      crawler
-      expect(Grell.logger).to be_instance_of(Logger)
-    end
-  end
-  describe '#quit' do
-    let(:driver) { double }
-    before { allow(Grell::CapybaraDriver).to receive(:setup).and_return(driver) }
-    it 'quits the poltergeist driver' do
-      expect(driver).to receive(:quit)
-      crawler.quit
-    end
-  end
   describe '#crawl' do
     before do
       crawler.instance_variable_set('@collection', Grell::PageCollection.new(custom_add_match))
@@ -127,15 +115,6 @@ RSpec.describe Grell::Crawler do
       expect(result[1].url).to eq(url_visited)
     end
-    it 'can use a custom url add matcher block' do
-      expect(crawler).to_not receive(:default_add_match)
-      crawler.start_crawling(url, add_match_block: custom_add_match)
-    end
-    it 'uses a default url add matched if not provided' do
-      expect(crawler).to receive(:default_add_match).and_return(custom_add_match)
-      crawler.start_crawling(url)
-    end
   end
   shared_examples_for 'visits all available pages' do
@@ -204,10 +183,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using a single string' do
-      before do
-        crawler.whitelist('/trusmis.html')
-      end
+      let(:whitelist) { '/trusmis.html' }
       let(:visited_pages_count) { 2 } # my own page + trusmis
       let(:visited_pages) do
         ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -217,10 +193,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an array of strings' do
-      before do
-        crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
-      end
+      let(:whitelist) { ['/trusmis.html', '/nothere', 'another.html'] }
       let(:visited_pages_count) { 2 }
       let(:visited_pages) do
         ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -230,10 +203,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using a regexp' do
-      before do
-        crawler.whitelist(/\/trusmis\.html/)
-      end
+      let(:whitelist) { /\/trusmis\.html/ }
       let(:visited_pages_count) { 2 }
       let(:visited_pages) do
         ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -243,10 +213,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an array of regexps' do
-      before do
-        crawler.whitelist([/\/trusmis\.html/])
-      end
+      let(:whitelist) { [/\/trusmis\.html/] }
       let(:visited_pages_count) { 2 }
       let(:visited_pages) do
         ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
@@ -256,10 +223,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an empty array' do
-      before do
-        crawler.whitelist([])
-      end
+      let(:whitelist) { [] }
       let(:visited_pages_count) { 1 } # my own page only
       let(:visited_pages) do
         ['http://www.example.com/test']
@@ -269,10 +233,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'adding all links to the whitelist' do
-      before do
-        crawler.whitelist(['/trusmis', '/help'])
-      end
+      let(:whitelist) { ['/trusmis', '/help'] }
       let(:visited_pages_count) { 3 } # all links
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
@@ -298,9 +259,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using a single string' do
-      before do
-        crawler.blacklist('/trusmis.html')
-      end
+      let(:blacklist) { '/trusmis.html' }
       let(:visited_pages_count) {2}
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/help.html']
@@ -310,9 +269,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an array of strings' do
-      before do
-        crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
-      end
+      let(:blacklist) { ['/trusmis.html', '/nothere', 'another.html'] }
       let(:visited_pages_count) {2}
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/help.html']
@@ -322,9 +279,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using a regexp' do
-      before do
-        crawler.blacklist(/\/trusmis\.html/)
-      end
+      let(:blacklist) { /\/trusmis\.html/ }
       let(:visited_pages_count) {2}
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/help.html']
@@ -334,9 +289,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an array of regexps' do
-      before do
-        crawler.blacklist([/\/trusmis\.html/])
-      end
+      let(:blacklist) { [/\/trusmis\.html/] }
       let(:visited_pages_count) {2}
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/help.html']
@@ -346,9 +299,7 @@ RSpec.describe Grell::Crawler do
     end
     context 'using an empty array' do
-      before do
-        crawler.blacklist([])
-      end
+      let(:blacklist) { [] }
       let(:visited_pages_count) { 3 } # all links
       let(:visited_pages) do
         ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
@@ -357,10 +308,8 @@ RSpec.describe Grell::Crawler do
       it_behaves_like 'visits all available pages'
     end
-    context 'adding all links to the whitelist' do
-      before do
-        crawler.blacklist(['/trusmis', '/help'])
-      end
+    context 'adding all links to the blacklist' do
+      let(:blacklist) { ['/trusmis', '/help'] }
       let(:visited_pages_count) { 1 }
       let(:visited_pages) do
         ['http://www.example.com/test']
@@ -386,11 +335,8 @@ RSpec.describe Grell::Crawler do
     end
     context 'we blacklist the only whitelisted page' do
-      before do
-        crawler.whitelist('/trusmis.html')
-        crawler.blacklist('/trusmis.html')
-      end
+      let(:whitelist) { '/trusmis.html' }
+      let(:blacklist) { '/trusmis.html' }
       let(:visited_pages_count) { 1 }
       let(:visited_pages) do
         ['http://www.example.com/test']
@@ -400,11 +346,8 @@ RSpec.describe Grell::Crawler do
     end
     context 'we blacklist none of the whitelisted pages' do
-      before do
-        crawler.whitelist('/trusmis.html')
-        crawler.blacklist('/raistlin.html')
-      end
+      let(:whitelist) { '/trusmis.html' }
+      let(:blacklist) { '/raistlin.html' }
       let(:visited_pages_count) { 2 }
       let(:visited_pages) do
         ['http://www.example.com/test', 'http://www.example.com/trusmis.html']

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: grell
 version: !ruby/object:Gem::Version
-  version: 1.6.11
+  version: 2.0.0
 platform: ruby
 authors:
 - Jordi Polo Carres
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-09-01 00:00:00.000000000 Z
+date: 2016-11-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: capybara
@@ -16,28 +16,28 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.7'
+        version: '2.10'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.7'
+        version: '2.10'
 - !ruby/object:Gem::Dependency
   name: poltergeist
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.10'
+        version: '1.11'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.10'
+        version: '1.11'
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
@@ -114,28 +114,28 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.0'
+        version: '3.5'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.0'
+        version: '3.5'
 - !ruby/object:Gem::Dependency
   name: puffing-billy
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
+        version: '0.9'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.5'
+        version: '0.9'
 - !ruby/object:Gem::Dependency
   name: timecop
   requirement: !ruby/object:Gem::Requirement
@@ -150,20 +150,6 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '0.8'
-- !ruby/object:Gem::Dependency
-  name: capybara-webkit
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 1.11.1
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: 1.11.1
 - !ruby/object:Gem::Dependency
   name: selenium-webdriver
   requirement: !ruby/object:Gem::Requirement
@@ -196,6 +182,7 @@ files:
 - lib/grell.rb
 - lib/grell/capybara_driver.rb
 - lib/grell/crawler.rb
+- lib/grell/crawler_manager.rb
 - lib/grell/grell_logger.rb
 - lib/grell/page.rb
 - lib/grell/page_collection.rb
@@ -203,6 +190,7 @@ files:
 - lib/grell/reader.rb
 - lib/grell/version.rb
 - spec/lib/capybara_driver_spec.rb
+- spec/lib/crawler_manager_spec.rb
 - spec/lib/crawler_spec.rb
 - spec/lib/page_collection_spec.rb
 - spec/lib/page_spec.rb
@@ -220,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 1.9.3
+      version: 2.1.8
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
@@ -234,6 +222,7 @@ specification_version: 4
 summary: Ruby web crawler
 test_files:
 - spec/lib/capybara_driver_spec.rb
+- spec/lib/crawler_manager_spec.rb
 - spec/lib/crawler_spec.rb
 - spec/lib/page_collection_spec.rb
 - spec/lib/page_spec.rb