grell 1.6.11 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/CHANGELOG.md +10 -0
- data/Gemfile +4 -0
- data/README.md +91 -64
- data/grell.gemspec +5 -6
- data/lib/grell.rb +1 -0
- data/lib/grell/capybara_driver.rb +1 -1
- data/lib/grell/crawler.rb +16 -46
- data/lib/grell/crawler_manager.rb +77 -0
- data/lib/grell/page_collection.rb +9 -1
- data/lib/grell/version.rb +1 -1
- data/spec/lib/crawler_manager_spec.rb +113 -0
- data/spec/lib/crawler_spec.rb +29 -86
- metadata +14 -25
    
        checksums.yaml
    CHANGED
    
    | @@ -1,7 +1,7 @@ | |
| 1 1 | 
             
            ---
         | 
| 2 2 | 
             
            SHA1:
         | 
| 3 | 
            -
              metadata.gz:  | 
| 4 | 
            -
              data.tar.gz:  | 
| 3 | 
            +
              metadata.gz: bbd2a19c7858d2e755e7d51a038b84901835f32b
         | 
| 4 | 
            +
              data.tar.gz: 7c65a731bfdacb7b65cf523c8ba128f1f3390f96
         | 
| 5 5 | 
             
            SHA512:
         | 
| 6 | 
            -
              metadata.gz:  | 
| 7 | 
            -
              data.tar.gz:  | 
| 6 | 
            +
              metadata.gz: fcf31c442f8d51cd4a9534270cc3afd7f5bac432a74a9ec2e429914145df710131e0a99aca61e4cd98d2e831a5cedb511c21dce9d1562881541fd1bfb4016014
         | 
| 7 | 
            +
              data.tar.gz: c413503a4a8765fadbec10ed64467ba140b8739b6209ae4a19e585549ad689231b64b3d670e1bdceec560597051d93f8b90d696f66571cb1dad175f15af1b4a0
         | 
    
        data/.travis.yml
    CHANGED
    
    
    
        data/CHANGELOG.md
    CHANGED
    
    | @@ -1,3 +1,13 @@ | |
| 1 | 
            +
            # 2.0.0
         | 
| 2 | 
            +
              * New configuration key `on_periodic_restart`.
         | 
| 3 | 
            +
              * CrawlerManager.cleanup_all_processes method destroy all instances of phantomjs in this machine.
         | 
| 4 | 
            +
             | 
| 5 | 
            +
              * Breaking changes
         | 
| 6 | 
            +
                - Requires Ruby 2.1 or later.
         | 
| 7 | 
            +
                - Crawler.start_crawling does not accept options anymore, all options are passed to Crawler.new.
         | 
| 8 | 
            +
                - Crawler's methods `restart` and `quit` have been moved to CrawlerManager.
         | 
| 9 | 
            +
                - Crawler gets whitelist and blacklist as configuration options instead of being set in specific methods.
         | 
| 10 | 
            +
             | 
| 1 11 | 
             
            # 1.6.11
         | 
| 2 12 | 
             
              * Ensure all links are loaded by waiting for Ajax requests to complete
         | 
| 3 13 | 
             
              * Add '@evaluate_in_each_page' option to evaluate before extracting links (e.g. $('.dropdown').addClass('open');)
         | 
    
        data/Gemfile
    CHANGED
    
    
    
        data/README.md
    CHANGED
    
    | @@ -21,16 +21,15 @@ Or install it yourself as: | |
| 21 21 |  | 
| 22 22 | 
             
                $ gem install grell
         | 
| 23 23 |  | 
| 24 | 
            -
            Grell uses PhantomJS, you will need to download and install it in your
         | 
| 24 | 
            +
            Grell uses PhantomJS as a browser, you will need to download and install it in your
         | 
| 25 25 | 
             
            system. Check for instructions in http://phantomjs.org/
         | 
| 26 26 | 
             
            Grell has been tested with PhantomJS v2.1.x
         | 
| 27 27 |  | 
| 28 28 | 
             
            ## Usage
         | 
| 29 29 |  | 
| 30 | 
            -
             | 
| 31 30 | 
             
            ### Crawling an entire site
         | 
| 32 31 |  | 
| 33 | 
            -
            The main entry point of the library is Grell#start_crawling.
         | 
| 32 | 
            +
            The main entry point of the library is Grell::Crawler#start_crawling.
         | 
| 34 33 | 
             
            Grell will yield to your code with each page it finds:
         | 
| 35 34 |  | 
| 36 35 | 
             
            ```ruby
         | 
| @@ -55,85 +54,105 @@ This list is indexed by the complete url, including query parameters. | |
| 55 54 |  | 
| 56 55 | 
             
            ### Re-retrieving a page
         | 
| 57 56 | 
             
            If you want Grell to revisit a page and return the data to you again,
         | 
| 58 | 
            -
            return the symbol :retry in your block  | 
| 57 | 
            +
            return the symbol :retry in your block for the start_crawling method.
         | 
| 59 58 | 
             
            For instance
         | 
| 60 59 | 
             
            ```ruby
         | 
| 61 60 | 
             
            require 'grell'
         | 
| 62 61 | 
             
            crawler = Grell::Crawler.new
         | 
| 63 62 | 
             
            crawler.start_crawling('http://www.google.com') do |current_page|
         | 
| 64 63 | 
             
              if current_page.status == 500 && current_page.retries == 0
         | 
| 65 | 
            -
                crawler.restart
         | 
| 64 | 
            +
                crawler.manager.restart
         | 
| 66 65 | 
             
                :retry
         | 
| 67 66 | 
             
              end
         | 
| 68 67 | 
             
            end
         | 
| 69 68 | 
             
            ```
         | 
| 70 69 |  | 
| 71 | 
            -
            ###  | 
| 72 | 
            -
            If you are doing a long crawling it is possible that phantomJS starts failing.
         | 
| 73 | 
            -
            To avoid that, you can restart it by calling "restart" on crawler.
         | 
| 74 | 
            -
            That will kill phantom and will restart it. Grell will keep the status of
         | 
| 75 | 
            -
            pages already visited and pages discovered and to be visited. And will keep crawling
         | 
| 76 | 
            -
            with the new phantomJS process instead of the old one.
         | 
| 70 | 
            +
            ### Pages' id
         | 
| 77 71 |  | 
| 78 | 
            -
             | 
| 72 | 
            +
            Each page has an unique id, accessed by the property `id`. Also each page stores the id of the page from which we found this page, accessed by the property `parent_id`.
         | 
| 73 | 
            +
            The page object generated by accessing the first URL passed to the start_crawling(the root) has a `parent_id` equal to `nil` and an `id` equal to 0.
         | 
| 74 | 
            +
            Using this information it is possible to construct a directed graph.
         | 
| 79 75 |  | 
| 80 | 
            -
            Grell by default will follow all the links it finds going to the site
         | 
| 81 | 
            -
            your are crawling. It will never follow links linking outside your site.
         | 
| 82 | 
            -
            If you want to further limit the amount of links crawled, you can use
         | 
| 83 | 
            -
            whitelisting, blacklisting or manual filtering.
         | 
| 84 76 |  | 
| 85 | 
            -
             | 
| 86 | 
            -
            By default, Grell will detect new URLs to visit by comparing the full URL
         | 
| 87 | 
            -
            with the URLs of the discovered and visited links. This functionality can
         | 
| 88 | 
            -
            be changed by passing a block of code to Grells `start_crawling` method.
         | 
| 89 | 
            -
            In the below example, the path of the URLs (instead of the full URL) will
         | 
| 90 | 
            -
            be compared.
         | 
| 77 | 
            +
            ### Restart and quit
         | 
| 91 78 |  | 
| 79 | 
            +
            Grell can be restarted. The current list of visited and yet-to-visit pages list are not modified when restarting
         | 
| 80 | 
            +
            but the browser is destroyed and recreated, all cookies and local storage are lost. After restarting, crawling is resumed with a
         | 
| 81 | 
            +
            new browser.
         | 
| 82 | 
            +
            To destroy the crawler, call the `quit` method. This will free the memory taken in Ruby and destroys the PhantomJS process.
         | 
| 92 83 | 
             
            ```ruby
         | 
| 93 84 | 
             
            require 'grell'
         | 
| 94 | 
            -
             | 
| 95 85 | 
             
            crawler = Grell::Crawler.new
         | 
| 86 | 
            +
            crawler.manager.restart # restarts the browser
         | 
| 87 | 
            +
            crawler.manager.quit # quits and destroys the crawler
         | 
| 88 | 
            +
            ```
         | 
| 96 89 |  | 
| 97 | 
            -
             | 
| 98 | 
            -
              collection_page.path == page.path
         | 
| 99 | 
            -
            end
         | 
| 90 | 
            +
            ### Options
         | 
| 100 91 |  | 
| 101 | 
            -
             | 
| 102 | 
            -
             | 
| 103 | 
            -
             | 
| 104 | 
            -
             | 
| 92 | 
            +
            The `Grell:Crawler` class can be passed options to customize its behavior:
         | 
| 93 | 
            +
            - `logger`: Sets the logger object, for instance `Rails.logger`. Default: `Logger.new(STDOUT)`
         | 
| 94 | 
            +
            - `on_periodic_restart`: Sets periodic restarts of the crawler each certain number of visits. Default: 100 pages.
         | 
| 95 | 
            +
            - `whitelist`: Setups a whitelist filter for URLs to be visited. Default: all URLs are whitelisted.
         | 
| 96 | 
            +
            - `blacklist`: Setups a blacklist filter for URLs to be avoided. Default: no URL is blacklisted.
         | 
| 97 | 
            +
            - `add_match_block`: Block evaluated to consider if a given page should be part of the pages to be visited. Default: add unique URLs.
         | 
| 98 | 
            +
            - `evaluate_in_each_page`: Javascript block to be evaluated on each page visited. Default: Nothing evaluated.
         | 
| 99 | 
            +
            - `driver_options`: Driver options will be passed to the Capybara driver which connects to PhantomJS.
         | 
| 105 100 |  | 
| 106 | 
            -
             | 
| 101 | 
            +
            Grell by default will follow all the links it finds in the site being crawled.
         | 
| 102 | 
            +
            It will never follow links linking outside your site.
         | 
| 103 | 
            +
            If you want to further limit the amount of links crawled, you can use
         | 
| 104 | 
            +
            whitelisting, blacklisting or manual filtering.
         | 
| 105 | 
            +
            Below further details on these and other options.
         | 
| 107 106 |  | 
| 108 | 
            -
            ```ruby
         | 
| 109 | 
            -
            require 'grell'
         | 
| 110 107 |  | 
| 111 | 
            -
             | 
| 112 | 
            -
             | 
| 113 | 
            -
            crawler. | 
| 114 | 
            -
             | 
| 108 | 
            +
            #### Automatically restarting PhantomJS
         | 
| 109 | 
            +
            If you are doing a long crawling it is possible that phantomJS gets into an inconsistent state or it starts leaking memory.
         | 
| 110 | 
            +
            The crawler can be restarted manually by calling `crawler.manager.restart` or automatically by using the
         | 
| 111 | 
            +
            `on_periodic_restart` configuration key as follows:
         | 
| 115 112 |  | 
| 116 | 
            -
             | 
| 117 | 
            -
             | 
| 118 | 
            -
            string match is whitelisted) or an array with regexps and/or strings.
         | 
| 113 | 
            +
             ```ruby
         | 
| 114 | 
            +
             require 'grell'
         | 
| 119 115 |  | 
| 120 | 
            -
             | 
| 116 | 
            +
             crawler = Grell::Crawler.new(on_periodic_restart: { do: my_restart_procedure, each: 200 })
         | 
| 121 117 |  | 
| 122 | 
            -
             | 
| 123 | 
            -
             | 
| 118 | 
            +
             crawler.start_crawling('http://www.google.com') do |current_page|
         | 
| 119 | 
            +
             ...
         | 
| 120 | 
            +
             endd
         | 
| 121 | 
            +
             ```
         | 
| 124 122 |  | 
| 125 | 
            -
            crawler  | 
| 126 | 
            -
             | 
| 127 | 
            -
            crawler.start_crawling('http://www.google.com')
         | 
| 128 | 
            -
            ```
         | 
| 123 | 
            +
             This code will setup the crawler to be restarted every 200 pages being crawled and to call `my_restart_procedure`
         | 
| 124 | 
            +
             between restarts. A restart will destroy the cookies so for instance this custom block can be used to relogin.
         | 
| 129 125 |  | 
| 130 | 
            -
            Similar to whitelisting. But now Grell will follow every other link in
         | 
| 131 | 
            -
            this site which does not go to /games/...
         | 
| 132 126 |  | 
| 133 | 
            -
             | 
| 134 | 
            -
             | 
| 135 | 
            -
             | 
| 136 | 
            -
             | 
| 127 | 
            +
             #### Whitelisting
         | 
| 128 | 
            +
             | 
| 129 | 
            +
             ```ruby
         | 
| 130 | 
            +
             require 'grell'
         | 
| 131 | 
            +
             | 
| 132 | 
            +
             crawler = Grell::Crawler.new(whitelist: [/games\/.*/, '/fun'])
         | 
| 133 | 
            +
             crawler.start_crawling('http://www.google.com')
         | 
| 134 | 
            +
             ```
         | 
| 135 | 
            +
             | 
| 136 | 
            +
             Grell here will only follow links to games and '/fun' and ignore all
         | 
| 137 | 
            +
             other links. You can provide a regexp, strings (if any part of the
         | 
| 138 | 
            +
             string match is whitelisted) or an array with regexps and/or strings.
         | 
| 139 | 
            +
             | 
| 140 | 
            +
             #### Blacklisting
         | 
| 141 | 
            +
             | 
| 142 | 
            +
             ```ruby
         | 
| 143 | 
            +
             require 'grell'
         | 
| 144 | 
            +
             | 
| 145 | 
            +
             crawler = Grell::Crawler.new(blacklist: /games\/.*/)
         | 
| 146 | 
            +
             crawler.start_crawling('http://www.google.com')
         | 
| 147 | 
            +
             ```
         | 
| 148 | 
            +
             | 
| 149 | 
            +
             Similar to whitelisting. But now Grell will follow every other link in
         | 
| 150 | 
            +
             this site which does not go to /games/...
         | 
| 151 | 
            +
             | 
| 152 | 
            +
             If you call both whitelist and blacklist then both will apply, a link
         | 
| 153 | 
            +
             has to fullfill both conditions to survive. If you do not call any, then
         | 
| 154 | 
            +
             all links on this site will be crawled. Think of these methods as
         | 
| 155 | 
            +
             filters.
         | 
| 137 156 |  | 
| 138 157 | 
             
            #### Manual link filtering
         | 
| 139 158 |  | 
| @@ -144,14 +163,28 @@ links to visit. So you can modify in your block of code "page.links" to | |
| 144 163 | 
             
            add and delete links to instruct Grell to add them to the list of links
         | 
| 145 164 | 
             
            to visit next.
         | 
| 146 165 |  | 
| 147 | 
            -
             | 
| 166 | 
            +
            #### Custom URL Comparison
         | 
| 167 | 
            +
            By default, Grell will detect new URLs to visit by comparing the full URL
         | 
| 168 | 
            +
            with the URLs of the discovered and visited links. This functionality can
         | 
| 169 | 
            +
            be changed by passing a block of code to Grells `start_crawling` method.
         | 
| 170 | 
            +
            In the below example, the path of the URLs (instead of the full URL) will
         | 
| 171 | 
            +
            be compared.
         | 
| 148 172 |  | 
| 149 | 
            -
             | 
| 150 | 
            -
             | 
| 151 | 
            -
             | 
| 173 | 
            +
            ```ruby
         | 
| 174 | 
            +
            require 'grell'
         | 
| 175 | 
            +
             | 
| 176 | 
            +
            add_match_block = Proc.new do |collection_page, page|
         | 
| 177 | 
            +
              collection_page.path == page.path
         | 
| 178 | 
            +
            end
         | 
| 152 179 |  | 
| 180 | 
            +
            crawler = Grell::Crawler.new(add_match_block: add_match_block)
         | 
| 153 181 |  | 
| 154 | 
            -
             | 
| 182 | 
            +
            crawler.start_crawling('http://www.google.com') do |current_page|
         | 
| 183 | 
            +
            ...
         | 
| 184 | 
            +
            end
         | 
| 185 | 
            +
            ```
         | 
| 186 | 
            +
             | 
| 187 | 
            +
            #### Evaluate script
         | 
| 155 188 |  | 
| 156 189 | 
             
            You can evalute a JavaScript snippet in each page before extracting links by passing the snippet to the 'evaluate_in_each_page' option:
         | 
| 157 190 |  | 
| @@ -168,12 +201,6 @@ When there is an error in the page or an internal error in the crawler (Javascri | |
| 168 201 | 
             
            - errorClass: The class of the error which broke this page.
         | 
| 169 202 | 
             
            - errorMessage: A descriptive message with the information Grell could gather about the error.
         | 
| 170 203 |  | 
| 171 | 
            -
            ### Logging
         | 
| 172 | 
            -
            You can pass your logger to Grell. For example in a Rails app:
         | 
| 173 | 
            -
            ```Ruby
         | 
| 174 | 
            -
            crawler = Grell::Crawler.new(logger: Rails.logger)
         | 
| 175 | 
            -
            ```
         | 
| 176 | 
            -
             | 
| 177 204 | 
             
            ## Tests
         | 
| 178 205 |  | 
| 179 206 | 
             
            Run the tests with
         | 
    
        data/grell.gemspec
    CHANGED
    
    | @@ -19,19 +19,18 @@ Gem::Specification.new do |spec| | |
| 19 19 | 
             
              spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
         | 
| 20 20 | 
             
              spec.require_paths = ["lib"]
         | 
| 21 21 |  | 
| 22 | 
            -
              spec.required_ruby_version = '>= 1. | 
| 22 | 
            +
              spec.required_ruby_version = '>= 2.1.8'
         | 
| 23 23 |  | 
| 24 | 
            -
              spec.add_dependency 'capybara', '~> 2. | 
| 25 | 
            -
              spec.add_dependency 'poltergeist', '~> 1. | 
| 24 | 
            +
              spec.add_dependency 'capybara', '~> 2.10'
         | 
| 25 | 
            +
              spec.add_dependency 'poltergeist', '~> 1.11'
         | 
| 26 26 |  | 
| 27 27 | 
             
              spec.add_development_dependency 'bundler', '~> 1.6'
         | 
| 28 28 | 
             
              spec.add_development_dependency 'byebug', '~> 4.0'
         | 
| 29 29 | 
             
              spec.add_development_dependency 'kender', '~> 0.2'
         | 
| 30 30 | 
             
              spec.add_development_dependency 'rake', '~> 10.0'
         | 
| 31 31 | 
             
              spec.add_development_dependency 'webmock', '~> 1.18'
         | 
| 32 | 
            -
              spec.add_development_dependency 'rspec', '~> 3. | 
| 33 | 
            -
              spec.add_development_dependency 'puffing-billy', '~> 0. | 
| 32 | 
            +
              spec.add_development_dependency 'rspec', '~> 3.5'
         | 
| 33 | 
            +
              spec.add_development_dependency 'puffing-billy', '~> 0.9'
         | 
| 34 34 | 
             
              spec.add_development_dependency 'timecop', '~> 0.8'
         | 
| 35 | 
            -
              spec.add_development_dependency 'capybara-webkit', '~> 1.11.1'
         | 
| 36 35 | 
             
              spec.add_development_dependency 'selenium-webdriver', '~> 2.53.4'
         | 
| 37 36 | 
             
            end
         | 
    
        data/lib/grell.rb
    CHANGED
    
    
    
        data/lib/grell/crawler.rb
    CHANGED
    
    | @@ -1,54 +1,32 @@ | |
| 1 | 
            -
             | 
| 2 1 | 
             
            module Grell
         | 
| 3 | 
            -
             | 
| 4 2 | 
             
              # This is the class that starts and controls the crawling
         | 
| 5 3 | 
             
              class Crawler
         | 
| 6 | 
            -
                attr_reader :collection
         | 
| 4 | 
            +
                attr_reader :collection, :manager
         | 
| 7 5 |  | 
| 8 6 | 
             
                # Creates a crawler
         | 
| 9 | 
            -
                #  | 
| 10 | 
            -
                 | 
| 11 | 
            -
             | 
| 12 | 
            -
             | 
| 13 | 
            -
             | 
| 14 | 
            -
             | 
| 15 | 
            -
                   | 
| 16 | 
            -
             | 
| 17 | 
            -
                  @ | 
| 18 | 
            -
                  @ | 
| 19 | 
            -
             | 
| 20 | 
            -
             | 
| 21 | 
            -
                # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
         | 
| 22 | 
            -
                def restart
         | 
| 23 | 
            -
                  Grell.logger.info "GRELL is restarting"
         | 
| 24 | 
            -
                  @driver.restart
         | 
| 25 | 
            -
                  Grell.logger.info "GRELL has restarted"
         | 
| 26 | 
            -
                end
         | 
| 27 | 
            -
             | 
| 28 | 
            -
                # Quits the poltergeist driver.
         | 
| 29 | 
            -
                def quit
         | 
| 30 | 
            -
                  Grell.logger.info "GRELL is quitting the poltergeist driver"
         | 
| 31 | 
            -
                  @driver.quit
         | 
| 32 | 
            -
                end
         | 
| 33 | 
            -
             | 
| 34 | 
            -
                # Setups a whitelist filter, allows a regexp, string or array of either to be matched.
         | 
| 35 | 
            -
                def whitelist(list)
         | 
| 36 | 
            -
                  @whitelist_regexp = Regexp.union(list)
         | 
| 37 | 
            -
                end
         | 
| 38 | 
            -
             | 
| 39 | 
            -
                # Setups a blacklist filter, allows a regexp, string or array of either to be matched.
         | 
| 40 | 
            -
                def blacklist(list)
         | 
| 41 | 
            -
                  @blacklist_regexp = Regexp.union(list)
         | 
| 7 | 
            +
                # evaluate_in_each_page: javascript block to evaluate in each page we crawl
         | 
| 8 | 
            +
                # add_match_block: block to evaluate to consider if a page is part of the collection
         | 
| 9 | 
            +
                # manager_options: options passed to the manager class
         | 
| 10 | 
            +
                # whitelist: Setups a whitelist filter, allows a regexp, string or array of either to be matched.
         | 
| 11 | 
            +
                # blacklist: Setups a blacklist filter, allows a regexp, string or array of either to be matched.
         | 
| 12 | 
            +
                def initialize(evaluate_in_each_page: nil, add_match_block: nil, whitelist: /.*/, blacklist: /a^/, **manager_options)
         | 
| 13 | 
            +
                  @collection = nil
         | 
| 14 | 
            +
                  @manager = CrawlerManager.new(manager_options)
         | 
| 15 | 
            +
                  @evaluate_in_each_page = evaluate_in_each_page
         | 
| 16 | 
            +
                  @add_match_block = add_match_block
         | 
| 17 | 
            +
                  @whitelist_regexp = Regexp.union(whitelist)
         | 
| 18 | 
            +
                  @blacklist_regexp = Regexp.union(blacklist)
         | 
| 42 19 | 
             
                end
         | 
| 43 20 |  | 
| 44 21 | 
             
                # Main method, it starts crawling on the given URL and calls a block for each of the pages found.
         | 
| 45 | 
            -
                def start_crawling(url,  | 
| 22 | 
            +
                def start_crawling(url, &block)
         | 
| 46 23 | 
             
                  Grell.logger.info "GRELL Started crawling"
         | 
| 47 | 
            -
                  @collection = PageCollection.new( | 
| 24 | 
            +
                  @collection = PageCollection.new(@add_match_block)
         | 
| 48 25 | 
             
                  @collection.create_page(url, nil)
         | 
| 49 26 |  | 
| 50 27 | 
             
                  while !@collection.discovered_pages.empty?
         | 
| 51 28 | 
             
                    crawl(@collection.next_page, block)
         | 
| 29 | 
            +
                    @manager.check_periodic_restart(@collection)
         | 
| 52 30 | 
             
                  end
         | 
| 53 31 |  | 
| 54 32 | 
             
                  Grell.logger.info "GRELL finished crawling"
         | 
| @@ -93,14 +71,6 @@ module Grell | |
| 93 71 | 
             
                  links.delete_if { |link| link =~ @blacklist_regexp } if @blacklist_regexp
         | 
| 94 72 | 
             
                end
         | 
| 95 73 |  | 
| 96 | 
            -
                # If options[:add_match_block] is not provided, url matching to determine if a
         | 
| 97 | 
            -
                # new page should be added the page collection will default to this proc
         | 
| 98 | 
            -
                def default_add_match
         | 
| 99 | 
            -
                  Proc.new do |collection_page, page|
         | 
| 100 | 
            -
                    collection_page.url.downcase == page.url.downcase
         | 
| 101 | 
            -
                  end
         | 
| 102 | 
            -
                end
         | 
| 103 | 
            -
             | 
| 104 74 | 
             
                # Store the resulting redirected URL along with the original URL
         | 
| 105 75 | 
             
                def add_redirect_url(site)
         | 
| 106 76 | 
             
                  if site.url != site.current_url
         | 
| @@ -0,0 +1,77 @@ | |
| 1 | 
            +
            module Grell
         | 
| 2 | 
            +
              # Manages the state of the process crawling, does not care about individual pages but about logging,
         | 
| 3 | 
            +
              # restarting and quiting the crawler correctly.
         | 
| 4 | 
            +
              class CrawlerManager
         | 
| 5 | 
            +
                # logger: logger to use for Grell's messages
         | 
| 6 | 
            +
                # on_periodic_restart: if set, the driver will restart every :each visits (100 default) and execute the :do block
         | 
| 7 | 
            +
                # driver_options: Any extra options for the Capybara driver
         | 
| 8 | 
            +
                def initialize(logger: nil, on_periodic_restart: {}, driver: nil, driver_options: {})
         | 
| 9 | 
            +
                  Grell.logger = logger ? logger : Logger.new(STDOUT)
         | 
| 10 | 
            +
                  @periodic_restart_block = on_periodic_restart[:do]
         | 
| 11 | 
            +
                  @periodic_restart_period = on_periodic_restart[:each] || PAGES_TO_RESTART
         | 
| 12 | 
            +
                  @driver = driver || CapybaraDriver.setup(driver_options)
         | 
| 13 | 
            +
                  if @periodic_restart_period <= 0
         | 
| 14 | 
            +
                    Grell.logger.warn "GRELL being misconfigured with a negative period to restart. Ignoring option."
         | 
| 15 | 
            +
                  end
         | 
| 16 | 
            +
                end
         | 
| 17 | 
            +
             | 
| 18 | 
            +
                # Restarts the PhantomJS process without modifying the state of visited and discovered pages.
         | 
| 19 | 
            +
                def restart
         | 
| 20 | 
            +
                  Grell.logger.info "GRELL is restarting"
         | 
| 21 | 
            +
                  @driver.restart
         | 
| 22 | 
            +
                  Grell.logger.info "GRELL has restarted"
         | 
| 23 | 
            +
                end
         | 
| 24 | 
            +
             | 
| 25 | 
            +
                # Quits the poltergeist driver.
         | 
| 26 | 
            +
                def quit
         | 
| 27 | 
            +
                  Grell.logger.info "GRELL is quitting the poltergeist driver"
         | 
| 28 | 
            +
                  @driver.quit
         | 
| 29 | 
            +
                end
         | 
| 30 | 
            +
             | 
| 31 | 
            +
                # PhantomJS seems to consume memory increasingly as it crawls, periodic restart allows to restart
         | 
| 32 | 
            +
                # the driver, potentially calling a block.
         | 
| 33 | 
            +
                def check_periodic_restart(collection)
         | 
| 34 | 
            +
                  return unless @periodic_restart_block
         | 
| 35 | 
            +
                  return unless @periodic_restart_period > 0
         | 
| 36 | 
            +
                  return unless (collection.visited_pages.size % @periodic_restart_period).zero?
         | 
| 37 | 
            +
                  restart
         | 
| 38 | 
            +
                  @periodic_restart_block.call
         | 
| 39 | 
            +
                end
         | 
| 40 | 
            +
             | 
| 41 | 
            +
                def cleanup_all_processes
         | 
| 42 | 
            +
                  pids = running_phantomjs_pids
         | 
| 43 | 
            +
                  return if pids.empty?
         | 
| 44 | 
            +
                  Grell.logger.warn "GRELL. Killing PhantomJS processes: #{pids.inspect}"
         | 
| 45 | 
            +
                  pids.each do |pid|
         | 
| 46 | 
            +
                    Grell.logger.warn "Sending KILL to PhantomJS process #{pid}"
         | 
| 47 | 
            +
                    kill_process(pid.to_i)
         | 
| 48 | 
            +
                  end
         | 
| 49 | 
            +
                end
         | 
| 50 | 
            +
             | 
| 51 | 
            +
                private
         | 
| 52 | 
            +
             | 
| 53 | 
            +
                PAGES_TO_RESTART = 100  # Default number of pages before we restart the driver.
         | 
| 54 | 
            +
                KILL_TIMEOUT = 2        # Number of seconds we wait till we kill the process.
         | 
| 55 | 
            +
             | 
| 56 | 
            +
                def running_phantomjs_pids
         | 
| 57 | 
            +
                  list_phantomjs_processes_cmd = "ps -ef | grep -E 'bin/phantomjs' | grep -v grep"
         | 
| 58 | 
            +
                  `#{list_phantomjs_processes_cmd} | awk '{print $2;}'`.split("\n")
         | 
| 59 | 
            +
                end
         | 
| 60 | 
            +
             | 
| 61 | 
            +
                def kill_process(pid)
         | 
| 62 | 
            +
                  Process.kill('TERM', pid)
         | 
| 63 | 
            +
                  force_kill(pid)
         | 
| 64 | 
            +
                rescue Errno::ESRCH, Errno::ECHILD
         | 
| 65 | 
            +
                  # successfully terminated
         | 
| 66 | 
            +
                rescue => e
         | 
| 67 | 
            +
                  Grell.logger.exception e, "PhantomJS process could not be killed"
         | 
| 68 | 
            +
                end
         | 
| 69 | 
            +
             | 
| 70 | 
            +
                def force_kill(pid)
         | 
| 71 | 
            +
                  Timeout.timeout(KILL_TIMEOUT) { Process.wait(pid) }
         | 
| 72 | 
            +
                rescue Timeout::Error
         | 
| 73 | 
            +
                  Process.kill('KILL', pid)
         | 
| 74 | 
            +
                  Process.wait(pid)
         | 
| 75 | 
            +
                end
         | 
| 76 | 
            +
              end
         | 
| 77 | 
            +
            end
         | 
| @@ -10,7 +10,7 @@ module Grell | |
| 10 10 | 
             
                # to the collection or if it is already present will be passed to the initializer.
         | 
| 11 11 | 
             
                def initialize(add_match_block)
         | 
| 12 12 | 
             
                  @collection = []
         | 
| 13 | 
            -
                  @add_match_block = add_match_block
         | 
| 13 | 
            +
                  @add_match_block = add_match_block || default_add_match
         | 
| 14 14 | 
             
                end
         | 
| 15 15 |  | 
| 16 16 | 
             
                def create_page(url, parent_id)
         | 
| @@ -50,5 +50,13 @@ module Grell | |
| 50 50 | 
             
                  end
         | 
| 51 51 | 
             
                end
         | 
| 52 52 |  | 
| 53 | 
            +
                # If add_match_block is not provided, url matching to determine if a new page should be added
         | 
| 54 | 
            +
                # to the page collection will default to this proc
         | 
| 55 | 
            +
                def default_add_match
         | 
| 56 | 
            +
                  Proc.new do |collection_page, page|
         | 
| 57 | 
            +
                    collection_page.url.downcase == page.url.downcase
         | 
| 58 | 
            +
                  end
         | 
| 59 | 
            +
                end
         | 
| 60 | 
            +
             | 
| 53 61 | 
             
              end
         | 
| 54 62 | 
             
            end
         | 
    
        data/lib/grell/version.rb
    CHANGED
    
    
| @@ -0,0 +1,113 @@ | |
| 1 | 
            +
            RSpec.describe Grell::CrawlerManager do
         | 
| 2 | 
            +
              let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
         | 
| 3 | 
            +
              let(:host) { 'http://www.example.com' }
         | 
| 4 | 
            +
              let(:url) { 'http://www.example.com/test' }
         | 
| 5 | 
            +
              let(:driver) { double(Grell::CapybaraDriver) }
         | 
| 6 | 
            +
              let(:logger) { Logger.new(nil) }
         | 
| 7 | 
            +
              let(:crawler_manager) do
         | 
| 8 | 
            +
                described_class.new(logger: logger, driver: driver)
         | 
| 9 | 
            +
              end
         | 
| 10 | 
            +
             | 
| 11 | 
            +
              describe 'initialize' do
         | 
| 12 | 
            +
                context 'provides a logger' do
         | 
| 13 | 
            +
                  let(:logger) { 33 }
         | 
| 14 | 
            +
                  it 'sets custom logger' do
         | 
| 15 | 
            +
                    crawler_manager
         | 
| 16 | 
            +
                    expect(Grell.logger).to eq(33)
         | 
| 17 | 
            +
                    Grell.logger = Logger.new(nil)
         | 
| 18 | 
            +
                  end
         | 
| 19 | 
            +
                end
         | 
| 20 | 
            +
                context 'does not provides a logger' do
         | 
| 21 | 
            +
                  let(:logger) { nil }
         | 
| 22 | 
            +
                  it 'sets default logger' do
         | 
| 23 | 
            +
                    crawler_manager
         | 
| 24 | 
            +
                    expect(Grell.logger).to be_instance_of(Logger)
         | 
| 25 | 
            +
                    Grell.logger = Logger.new(nil)
         | 
| 26 | 
            +
                  end
         | 
| 27 | 
            +
                end
         | 
| 28 | 
            +
              end
         | 
| 29 | 
            +
             | 
| 30 | 
            +
              describe '#quit' do
         | 
| 31 | 
            +
                let(:driver) { double }
         | 
| 32 | 
            +
             | 
| 33 | 
            +
                it 'quits the poltergeist driver' do
         | 
| 34 | 
            +
                  expect(driver).to receive(:quit)
         | 
| 35 | 
            +
                  crawler_manager.quit
         | 
| 36 | 
            +
                end
         | 
| 37 | 
            +
              end
         | 
| 38 | 
            +
             | 
| 39 | 
            +
              describe '#restart' do
         | 
| 40 | 
            +
                let(:driver) { double }
         | 
| 41 | 
            +
             | 
| 42 | 
            +
                it 'restarts the poltergeist driver' do
         | 
| 43 | 
            +
                  expect(driver).to receive(:restart)
         | 
| 44 | 
            +
                  expect(logger).to receive(:info).with("GRELL is restarting")
         | 
| 45 | 
            +
                  expect(logger).to receive(:info).with("GRELL has restarted")
         | 
| 46 | 
            +
                  crawler_manager.restart
         | 
| 47 | 
            +
                end
         | 
| 48 | 
            +
              end
         | 
| 49 | 
            +
             | 
| 50 | 
            +
              describe '#check_periodic_restart' do
         | 
| 51 | 
            +
                let(:collection) { double }
         | 
| 52 | 
            +
                context 'Periodic restart not setup' do
         | 
| 53 | 
            +
                  it 'does not restart' do
         | 
| 54 | 
            +
                    allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
         | 
| 55 | 
            +
                    expect(crawler_manager).not_to receive(:restart)
         | 
| 56 | 
            +
                    crawler_manager.check_periodic_restart(collection)
         | 
| 57 | 
            +
                  end
         | 
| 58 | 
            +
                end
         | 
| 59 | 
            +
                context 'Periodic restart setup with default period' do
         | 
| 60 | 
            +
                  let(:do_something) { proc {} }
         | 
| 61 | 
            +
                  let(:crawler_manager) do
         | 
| 62 | 
            +
                    Grell::CrawlerManager.new(
         | 
| 63 | 
            +
                      logger: logger,
         | 
| 64 | 
            +
                      driver: driver,
         | 
| 65 | 
            +
                      on_periodic_restart: { do: do_something }
         | 
| 66 | 
            +
                    )
         | 
| 67 | 
            +
                  end
         | 
| 68 | 
            +
             | 
| 69 | 
            +
                  it 'does not restart after visiting 99 pages' do
         | 
| 70 | 
            +
                    allow(collection).to receive_message_chain(:visited_pages, :size) { 99 }
         | 
| 71 | 
            +
                    expect(crawler_manager).not_to receive(:restart)
         | 
| 72 | 
            +
                    crawler_manager.check_periodic_restart(collection)
         | 
| 73 | 
            +
                  end
         | 
| 74 | 
            +
                  it 'restarts after visiting 100 pages' do
         | 
| 75 | 
            +
                    allow(collection).to receive_message_chain(:visited_pages, :size) { 100 }
         | 
| 76 | 
            +
                    expect(crawler_manager).to receive(:restart)
         | 
| 77 | 
            +
                    crawler_manager.check_periodic_restart(collection)
         | 
| 78 | 
            +
                  end
         | 
| 79 | 
            +
                end
         | 
| 80 | 
            +
                context 'Periodic restart setup with custom period' do
         | 
| 81 | 
            +
                  let(:do_something) { proc {} }
         | 
| 82 | 
            +
                  let(:period) { 50 }
         | 
| 83 | 
            +
                  let(:crawler_manager) do
         | 
| 84 | 
            +
                    Grell::CrawlerManager.new(
         | 
| 85 | 
            +
                      logger: logger,
         | 
| 86 | 
            +
                      driver: driver,
         | 
| 87 | 
            +
                      on_periodic_restart: { do: do_something, each: period }
         | 
| 88 | 
            +
                    )
         | 
| 89 | 
            +
                  end
         | 
| 90 | 
            +
             | 
| 91 | 
            +
                  it 'does not restart after visiting a number different from custom period pages' do
         | 
| 92 | 
            +
                    allow(collection).to receive_message_chain(:visited_pages, :size) { period * 1.2 }
         | 
| 93 | 
            +
                    expect(crawler_manager).not_to receive(:restart)
         | 
| 94 | 
            +
                    crawler_manager.check_periodic_restart(collection)
         | 
| 95 | 
            +
                  end
         | 
| 96 | 
            +
                  it 'restarts after visiting custom period pages' do
         | 
| 97 | 
            +
                    allow(collection).to receive_message_chain(:visited_pages, :size) { period }
         | 
| 98 | 
            +
                    expect(crawler_manager).to receive(:restart)
         | 
| 99 | 
            +
                    crawler_manager.check_periodic_restart(collection)
         | 
| 100 | 
            +
                  end
         | 
| 101 | 
            +
                end
         | 
| 102 | 
            +
              end
         | 
| 103 | 
            +
             | 
| 104 | 
            +
              describe '#cleanup_all_processes' do
         | 
| 105 | 
            +
                let(:driver) { double }
         | 
| 106 | 
            +
             | 
| 107 | 
            +
                it 'kills all phantomjs processes' do
         | 
| 108 | 
            +
                  allow(crawler_manager).to receive(:running_phantomjs_pids).and_return([10])
         | 
| 109 | 
            +
                  expect(crawler_manager).to receive(:kill_process).with(10)
         | 
| 110 | 
            +
                  crawler_manager.cleanup_all_processes
         | 
| 111 | 
            +
                end
         | 
| 112 | 
            +
              end
         | 
| 113 | 
            +
            end
         | 
    
        data/spec/lib/crawler_spec.rb
    CHANGED
    
    | @@ -5,7 +5,18 @@ RSpec.describe Grell::Crawler do | |
| 5 5 | 
             
              let(:page) { Grell::Page.new(url, page_id, parent_page_id) }
         | 
| 6 6 | 
             
              let(:host) { 'http://www.example.com' }
         | 
| 7 7 | 
             
              let(:url) { 'http://www.example.com/test' }
         | 
| 8 | 
            -
              let(: | 
| 8 | 
            +
              let(:add_match_block) { nil }
         | 
| 9 | 
            +
              let(:blacklist) { /a^/ }
         | 
| 10 | 
            +
              let(:whitelist) { /.*/ }
         | 
| 11 | 
            +
              let(:crawler) do
         | 
| 12 | 
            +
                Grell::Crawler.new(
         | 
| 13 | 
            +
                  logger: Logger.new(nil),
         | 
| 14 | 
            +
                  driver_options: { external_driver: true },
         | 
| 15 | 
            +
                  evaluate_in_each_page: script,
         | 
| 16 | 
            +
                  add_match_block: add_match_block,
         | 
| 17 | 
            +
                  blacklist: blacklist,
         | 
| 18 | 
            +
                  whitelist: whitelist)
         | 
| 19 | 
            +
              end
         | 
| 9 20 | 
             
              let(:script) { nil }
         | 
| 10 21 | 
             
              let(:body) { 'body' }
         | 
| 11 22 | 
             
              let(:custom_add_match) do
         | 
| @@ -18,29 +29,6 @@ RSpec.describe Grell::Crawler do | |
| 18 29 | 
             
                proxy.stub(url).and_return(body: body, code: 200)
         | 
| 19 30 | 
             
              end
         | 
| 20 31 |  | 
| 21 | 
            -
              describe 'initialize' do
         | 
| 22 | 
            -
                it 'can provide your own logger' do
         | 
| 23 | 
            -
                  Grell::Crawler.new(external_driver: true, logger: 33)
         | 
| 24 | 
            -
                  expect(Grell.logger).to eq(33)
         | 
| 25 | 
            -
                  Grell.logger = Logger.new(nil)
         | 
| 26 | 
            -
                end
         | 
| 27 | 
            -
             | 
| 28 | 
            -
                it 'provides a stdout logger if nothing provided' do
         | 
| 29 | 
            -
                  crawler
         | 
| 30 | 
            -
                  expect(Grell.logger).to be_instance_of(Logger)
         | 
| 31 | 
            -
                end
         | 
| 32 | 
            -
              end
         | 
| 33 | 
            -
             | 
| 34 | 
            -
              describe '#quit' do
         | 
| 35 | 
            -
                let(:driver) { double }
         | 
| 36 | 
            -
                before { allow(Grell::CapybaraDriver).to receive(:setup).and_return(driver) }
         | 
| 37 | 
            -
             | 
| 38 | 
            -
                it 'quits the poltergeist driver' do
         | 
| 39 | 
            -
                  expect(driver).to receive(:quit)
         | 
| 40 | 
            -
                  crawler.quit
         | 
| 41 | 
            -
                end
         | 
| 42 | 
            -
              end
         | 
| 43 | 
            -
             | 
| 44 32 | 
             
              describe '#crawl' do
         | 
| 45 33 | 
             
                before do
         | 
| 46 34 | 
             
                  crawler.instance_variable_set('@collection', Grell::PageCollection.new(custom_add_match))
         | 
| @@ -127,15 +115,6 @@ RSpec.describe Grell::Crawler do | |
| 127 115 | 
             
                  expect(result[1].url).to eq(url_visited)
         | 
| 128 116 | 
             
                end
         | 
| 129 117 |  | 
| 130 | 
            -
                it 'can use a custom url add matcher block' do
         | 
| 131 | 
            -
                  expect(crawler).to_not receive(:default_add_match)
         | 
| 132 | 
            -
                  crawler.start_crawling(url, add_match_block: custom_add_match)
         | 
| 133 | 
            -
                end
         | 
| 134 | 
            -
             | 
| 135 | 
            -
                it 'uses a default url add matched if not provided' do
         | 
| 136 | 
            -
                  expect(crawler).to receive(:default_add_match).and_return(custom_add_match)
         | 
| 137 | 
            -
                  crawler.start_crawling(url)
         | 
| 138 | 
            -
                end
         | 
| 139 118 | 
             
              end
         | 
| 140 119 |  | 
| 141 120 | 
             
              shared_examples_for 'visits all available pages' do
         | 
| @@ -204,10 +183,7 @@ RSpec.describe Grell::Crawler do | |
| 204 183 | 
             
                end
         | 
| 205 184 |  | 
| 206 185 | 
             
                context 'using a single string' do
         | 
| 207 | 
            -
                   | 
| 208 | 
            -
                    crawler.whitelist('/trusmis.html')
         | 
| 209 | 
            -
                  end
         | 
| 210 | 
            -
             | 
| 186 | 
            +
                  let(:whitelist) { '/trusmis.html' }
         | 
| 211 187 | 
             
                  let(:visited_pages_count) { 2 } # my own page + trusmis
         | 
| 212 188 | 
             
                  let(:visited_pages) do
         | 
| 213 189 | 
             
                    ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
         | 
| @@ -217,10 +193,7 @@ RSpec.describe Grell::Crawler do | |
| 217 193 | 
             
                end
         | 
| 218 194 |  | 
| 219 195 | 
             
                context 'using an array of strings' do
         | 
| 220 | 
            -
                   | 
| 221 | 
            -
                    crawler.whitelist(['/trusmis.html', '/nothere', 'another.html'])
         | 
| 222 | 
            -
                  end
         | 
| 223 | 
            -
             | 
| 196 | 
            +
                  let(:whitelist) { ['/trusmis.html', '/nothere', 'another.html'] }
         | 
| 224 197 | 
             
                  let(:visited_pages_count) { 2 }
         | 
| 225 198 | 
             
                  let(:visited_pages) do
         | 
| 226 199 | 
             
                    ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
         | 
| @@ -230,10 +203,7 @@ RSpec.describe Grell::Crawler do | |
| 230 203 | 
             
                end
         | 
| 231 204 |  | 
| 232 205 | 
             
                context 'using a regexp' do
         | 
| 233 | 
            -
                   | 
| 234 | 
            -
                    crawler.whitelist(/\/trusmis\.html/)
         | 
| 235 | 
            -
                  end
         | 
| 236 | 
            -
             | 
| 206 | 
            +
                  let(:whitelist) { /\/trusmis\.html/ }
         | 
| 237 207 | 
             
                  let(:visited_pages_count) { 2 }
         | 
| 238 208 | 
             
                  let(:visited_pages) do
         | 
| 239 209 | 
             
                    ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
         | 
| @@ -243,10 +213,7 @@ RSpec.describe Grell::Crawler do | |
| 243 213 | 
             
                end
         | 
| 244 214 |  | 
| 245 215 | 
             
                context 'using an array of regexps' do
         | 
| 246 | 
            -
                   | 
| 247 | 
            -
                    crawler.whitelist([/\/trusmis\.html/])
         | 
| 248 | 
            -
                  end
         | 
| 249 | 
            -
             | 
| 216 | 
            +
                  let(:whitelist) { [/\/trusmis\.html/] }
         | 
| 250 217 | 
             
                  let(:visited_pages_count) { 2 }
         | 
| 251 218 | 
             
                  let(:visited_pages) do
         | 
| 252 219 | 
             
                    ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
         | 
| @@ -256,10 +223,7 @@ RSpec.describe Grell::Crawler do | |
| 256 223 | 
             
                end
         | 
| 257 224 |  | 
| 258 225 | 
             
                context 'using an empty array' do
         | 
| 259 | 
            -
                   | 
| 260 | 
            -
                    crawler.whitelist([])
         | 
| 261 | 
            -
                  end
         | 
| 262 | 
            -
             | 
| 226 | 
            +
                  let(:whitelist) { [] }
         | 
| 263 227 | 
             
                  let(:visited_pages_count) { 1 } # my own page only
         | 
| 264 228 | 
             
                  let(:visited_pages) do
         | 
| 265 229 | 
             
                    ['http://www.example.com/test']
         | 
| @@ -269,10 +233,7 @@ RSpec.describe Grell::Crawler do | |
| 269 233 | 
             
                end
         | 
| 270 234 |  | 
| 271 235 | 
             
                context 'adding all links to the whitelist' do
         | 
| 272 | 
            -
                   | 
| 273 | 
            -
                    crawler.whitelist(['/trusmis', '/help'])
         | 
| 274 | 
            -
                  end
         | 
| 275 | 
            -
             | 
| 236 | 
            +
                  let(:whitelist) { ['/trusmis', '/help'] }
         | 
| 276 237 | 
             
                  let(:visited_pages_count) { 3 } # all links
         | 
| 277 238 | 
             
                  let(:visited_pages) do
         | 
| 278 239 | 
             
                    ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
         | 
| @@ -298,9 +259,7 @@ RSpec.describe Grell::Crawler do | |
| 298 259 | 
             
                end
         | 
| 299 260 |  | 
| 300 261 | 
             
                context 'using a single string' do
         | 
| 301 | 
            -
                   | 
| 302 | 
            -
                    crawler.blacklist('/trusmis.html')
         | 
| 303 | 
            -
                  end
         | 
| 262 | 
            +
                  let(:blacklist) { '/trusmis.html' }
         | 
| 304 263 | 
             
                  let(:visited_pages_count) {2}
         | 
| 305 264 | 
             
                  let(:visited_pages) do
         | 
| 306 265 | 
             
                    ['http://www.example.com/test','http://www.example.com/help.html']
         | 
| @@ -310,9 +269,7 @@ RSpec.describe Grell::Crawler do | |
| 310 269 | 
             
                end
         | 
| 311 270 |  | 
| 312 271 | 
             
                context 'using an array of strings' do
         | 
| 313 | 
            -
                   | 
| 314 | 
            -
                    crawler.blacklist(['/trusmis.html', '/nothere', 'another.html'])
         | 
| 315 | 
            -
                  end
         | 
| 272 | 
            +
                  let(:blacklist) { ['/trusmis.html', '/nothere', 'another.html'] }
         | 
| 316 273 | 
             
                  let(:visited_pages_count) {2}
         | 
| 317 274 | 
             
                  let(:visited_pages) do
         | 
| 318 275 | 
             
                    ['http://www.example.com/test','http://www.example.com/help.html']
         | 
| @@ -322,9 +279,7 @@ RSpec.describe Grell::Crawler do | |
| 322 279 | 
             
                end
         | 
| 323 280 |  | 
| 324 281 | 
             
                context 'using a regexp' do
         | 
| 325 | 
            -
                   | 
| 326 | 
            -
                    crawler.blacklist(/\/trusmis\.html/)
         | 
| 327 | 
            -
                  end
         | 
| 282 | 
            +
                  let(:blacklist) { /\/trusmis\.html/ }
         | 
| 328 283 | 
             
                  let(:visited_pages_count) {2}
         | 
| 329 284 | 
             
                  let(:visited_pages) do
         | 
| 330 285 | 
             
                    ['http://www.example.com/test','http://www.example.com/help.html']
         | 
| @@ -334,9 +289,7 @@ RSpec.describe Grell::Crawler do | |
| 334 289 | 
             
                end
         | 
| 335 290 |  | 
| 336 291 | 
             
                context 'using an array of regexps' do
         | 
| 337 | 
            -
                   | 
| 338 | 
            -
                    crawler.blacklist([/\/trusmis\.html/])
         | 
| 339 | 
            -
                  end
         | 
| 292 | 
            +
                  let(:blacklist) { [/\/trusmis\.html/] }
         | 
| 340 293 | 
             
                  let(:visited_pages_count) {2}
         | 
| 341 294 | 
             
                  let(:visited_pages) do
         | 
| 342 295 | 
             
                    ['http://www.example.com/test','http://www.example.com/help.html']
         | 
| @@ -346,9 +299,7 @@ RSpec.describe Grell::Crawler do | |
| 346 299 | 
             
                end
         | 
| 347 300 |  | 
| 348 301 | 
             
                context 'using an empty array' do
         | 
| 349 | 
            -
                   | 
| 350 | 
            -
                    crawler.blacklist([])
         | 
| 351 | 
            -
                  end
         | 
| 302 | 
            +
                  let(:blacklist) { [] }
         | 
| 352 303 | 
             
                  let(:visited_pages_count) { 3 } # all links
         | 
| 353 304 | 
             
                  let(:visited_pages) do
         | 
| 354 305 | 
             
                    ['http://www.example.com/test','http://www.example.com/trusmis.html', 'http://www.example.com/help.html']
         | 
| @@ -357,10 +308,8 @@ RSpec.describe Grell::Crawler do | |
| 357 308 | 
             
                  it_behaves_like 'visits all available pages'
         | 
| 358 309 | 
             
                end
         | 
| 359 310 |  | 
| 360 | 
            -
                context 'adding all links to the  | 
| 361 | 
            -
                   | 
| 362 | 
            -
                    crawler.blacklist(['/trusmis', '/help'])
         | 
| 363 | 
            -
                  end
         | 
| 311 | 
            +
                context 'adding all links to the blacklist' do
         | 
| 312 | 
            +
                  let(:blacklist) { ['/trusmis', '/help'] }
         | 
| 364 313 | 
             
                  let(:visited_pages_count) { 1 }
         | 
| 365 314 | 
             
                  let(:visited_pages) do
         | 
| 366 315 | 
             
                    ['http://www.example.com/test']
         | 
| @@ -386,11 +335,8 @@ RSpec.describe Grell::Crawler do | |
| 386 335 | 
             
                end
         | 
| 387 336 |  | 
| 388 337 | 
             
                context 'we blacklist the only whitelisted page' do
         | 
| 389 | 
            -
                   | 
| 390 | 
            -
             | 
| 391 | 
            -
                    crawler.blacklist('/trusmis.html')
         | 
| 392 | 
            -
                  end
         | 
| 393 | 
            -
             | 
| 338 | 
            +
                  let(:whitelist) { '/trusmis.html' }
         | 
| 339 | 
            +
                  let(:blacklist) { '/trusmis.html' }
         | 
| 394 340 | 
             
                  let(:visited_pages_count) { 1 }
         | 
| 395 341 | 
             
                  let(:visited_pages) do
         | 
| 396 342 | 
             
                    ['http://www.example.com/test']
         | 
| @@ -400,11 +346,8 @@ RSpec.describe Grell::Crawler do | |
| 400 346 | 
             
                end
         | 
| 401 347 |  | 
| 402 348 | 
             
                context 'we blacklist none of the whitelisted pages' do
         | 
| 403 | 
            -
                   | 
| 404 | 
            -
             | 
| 405 | 
            -
                    crawler.blacklist('/raistlin.html')
         | 
| 406 | 
            -
                  end
         | 
| 407 | 
            -
             | 
| 349 | 
            +
                  let(:whitelist) { '/trusmis.html' }
         | 
| 350 | 
            +
                  let(:blacklist) { '/raistlin.html' }
         | 
| 408 351 | 
             
                  let(:visited_pages_count) { 2 }
         | 
| 409 352 | 
             
                  let(:visited_pages) do
         | 
| 410 353 | 
             
                    ['http://www.example.com/test', 'http://www.example.com/trusmis.html']
         | 
    
        metadata
    CHANGED
    
    | @@ -1,14 +1,14 @@ | |
| 1 1 | 
             
            --- !ruby/object:Gem::Specification
         | 
| 2 2 | 
             
            name: grell
         | 
| 3 3 | 
             
            version: !ruby/object:Gem::Version
         | 
| 4 | 
            -
              version:  | 
| 4 | 
            +
              version: 2.0.0
         | 
| 5 5 | 
             
            platform: ruby
         | 
| 6 6 | 
             
            authors:
         | 
| 7 7 | 
             
            - Jordi Polo Carres
         | 
| 8 8 | 
             
            autorequire: 
         | 
| 9 9 | 
             
            bindir: bin
         | 
| 10 10 | 
             
            cert_chain: []
         | 
| 11 | 
            -
            date: 2016- | 
| 11 | 
            +
            date: 2016-11-16 00:00:00.000000000 Z
         | 
| 12 12 | 
             
            dependencies:
         | 
| 13 13 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 14 14 | 
             
              name: capybara
         | 
| @@ -16,28 +16,28 @@ dependencies: | |
| 16 16 | 
             
                requirements:
         | 
| 17 17 | 
             
                - - "~>"
         | 
| 18 18 | 
             
                  - !ruby/object:Gem::Version
         | 
| 19 | 
            -
                    version: '2. | 
| 19 | 
            +
                    version: '2.10'
         | 
| 20 20 | 
             
              type: :runtime
         | 
| 21 21 | 
             
              prerelease: false
         | 
| 22 22 | 
             
              version_requirements: !ruby/object:Gem::Requirement
         | 
| 23 23 | 
             
                requirements:
         | 
| 24 24 | 
             
                - - "~>"
         | 
| 25 25 | 
             
                  - !ruby/object:Gem::Version
         | 
| 26 | 
            -
                    version: '2. | 
| 26 | 
            +
                    version: '2.10'
         | 
| 27 27 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 28 28 | 
             
              name: poltergeist
         | 
| 29 29 | 
             
              requirement: !ruby/object:Gem::Requirement
         | 
| 30 30 | 
             
                requirements:
         | 
| 31 31 | 
             
                - - "~>"
         | 
| 32 32 | 
             
                  - !ruby/object:Gem::Version
         | 
| 33 | 
            -
                    version: '1. | 
| 33 | 
            +
                    version: '1.11'
         | 
| 34 34 | 
             
              type: :runtime
         | 
| 35 35 | 
             
              prerelease: false
         | 
| 36 36 | 
             
              version_requirements: !ruby/object:Gem::Requirement
         | 
| 37 37 | 
             
                requirements:
         | 
| 38 38 | 
             
                - - "~>"
         | 
| 39 39 | 
             
                  - !ruby/object:Gem::Version
         | 
| 40 | 
            -
                    version: '1. | 
| 40 | 
            +
                    version: '1.11'
         | 
| 41 41 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 42 42 | 
             
              name: bundler
         | 
| 43 43 | 
             
              requirement: !ruby/object:Gem::Requirement
         | 
| @@ -114,28 +114,28 @@ dependencies: | |
| 114 114 | 
             
                requirements:
         | 
| 115 115 | 
             
                - - "~>"
         | 
| 116 116 | 
             
                  - !ruby/object:Gem::Version
         | 
| 117 | 
            -
                    version: '3. | 
| 117 | 
            +
                    version: '3.5'
         | 
| 118 118 | 
             
              type: :development
         | 
| 119 119 | 
             
              prerelease: false
         | 
| 120 120 | 
             
              version_requirements: !ruby/object:Gem::Requirement
         | 
| 121 121 | 
             
                requirements:
         | 
| 122 122 | 
             
                - - "~>"
         | 
| 123 123 | 
             
                  - !ruby/object:Gem::Version
         | 
| 124 | 
            -
                    version: '3. | 
| 124 | 
            +
                    version: '3.5'
         | 
| 125 125 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 126 126 | 
             
              name: puffing-billy
         | 
| 127 127 | 
             
              requirement: !ruby/object:Gem::Requirement
         | 
| 128 128 | 
             
                requirements:
         | 
| 129 129 | 
             
                - - "~>"
         | 
| 130 130 | 
             
                  - !ruby/object:Gem::Version
         | 
| 131 | 
            -
                    version: '0. | 
| 131 | 
            +
                    version: '0.9'
         | 
| 132 132 | 
             
              type: :development
         | 
| 133 133 | 
             
              prerelease: false
         | 
| 134 134 | 
             
              version_requirements: !ruby/object:Gem::Requirement
         | 
| 135 135 | 
             
                requirements:
         | 
| 136 136 | 
             
                - - "~>"
         | 
| 137 137 | 
             
                  - !ruby/object:Gem::Version
         | 
| 138 | 
            -
                    version: '0. | 
| 138 | 
            +
                    version: '0.9'
         | 
| 139 139 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 140 140 | 
             
              name: timecop
         | 
| 141 141 | 
             
              requirement: !ruby/object:Gem::Requirement
         | 
| @@ -150,20 +150,6 @@ dependencies: | |
| 150 150 | 
             
                - - "~>"
         | 
| 151 151 | 
             
                  - !ruby/object:Gem::Version
         | 
| 152 152 | 
             
                    version: '0.8'
         | 
| 153 | 
            -
            - !ruby/object:Gem::Dependency
         | 
| 154 | 
            -
              name: capybara-webkit
         | 
| 155 | 
            -
              requirement: !ruby/object:Gem::Requirement
         | 
| 156 | 
            -
                requirements:
         | 
| 157 | 
            -
                - - "~>"
         | 
| 158 | 
            -
                  - !ruby/object:Gem::Version
         | 
| 159 | 
            -
                    version: 1.11.1
         | 
| 160 | 
            -
              type: :development
         | 
| 161 | 
            -
              prerelease: false
         | 
| 162 | 
            -
              version_requirements: !ruby/object:Gem::Requirement
         | 
| 163 | 
            -
                requirements:
         | 
| 164 | 
            -
                - - "~>"
         | 
| 165 | 
            -
                  - !ruby/object:Gem::Version
         | 
| 166 | 
            -
                    version: 1.11.1
         | 
| 167 153 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 168 154 | 
             
              name: selenium-webdriver
         | 
| 169 155 | 
             
              requirement: !ruby/object:Gem::Requirement
         | 
| @@ -196,6 +182,7 @@ files: | |
| 196 182 | 
             
            - lib/grell.rb
         | 
| 197 183 | 
             
            - lib/grell/capybara_driver.rb
         | 
| 198 184 | 
             
            - lib/grell/crawler.rb
         | 
| 185 | 
            +
            - lib/grell/crawler_manager.rb
         | 
| 199 186 | 
             
            - lib/grell/grell_logger.rb
         | 
| 200 187 | 
             
            - lib/grell/page.rb
         | 
| 201 188 | 
             
            - lib/grell/page_collection.rb
         | 
| @@ -203,6 +190,7 @@ files: | |
| 203 190 | 
             
            - lib/grell/reader.rb
         | 
| 204 191 | 
             
            - lib/grell/version.rb
         | 
| 205 192 | 
             
            - spec/lib/capybara_driver_spec.rb
         | 
| 193 | 
            +
            - spec/lib/crawler_manager_spec.rb
         | 
| 206 194 | 
             
            - spec/lib/crawler_spec.rb
         | 
| 207 195 | 
             
            - spec/lib/page_collection_spec.rb
         | 
| 208 196 | 
             
            - spec/lib/page_spec.rb
         | 
| @@ -220,7 +208,7 @@ required_ruby_version: !ruby/object:Gem::Requirement | |
| 220 208 | 
             
              requirements:
         | 
| 221 209 | 
             
              - - ">="
         | 
| 222 210 | 
             
                - !ruby/object:Gem::Version
         | 
| 223 | 
            -
                  version: 1. | 
| 211 | 
            +
                  version: 2.1.8
         | 
| 224 212 | 
             
            required_rubygems_version: !ruby/object:Gem::Requirement
         | 
| 225 213 | 
             
              requirements:
         | 
| 226 214 | 
             
              - - ">="
         | 
| @@ -234,6 +222,7 @@ specification_version: 4 | |
| 234 222 | 
             
            summary: Ruby web crawler
         | 
| 235 223 | 
             
            test_files:
         | 
| 236 224 | 
             
            - spec/lib/capybara_driver_spec.rb
         | 
| 225 | 
            +
            - spec/lib/crawler_manager_spec.rb
         | 
| 237 226 | 
             
            - spec/lib/crawler_spec.rb
         | 
| 238 227 | 
             
            - spec/lib/page_collection_spec.rb
         | 
| 239 228 | 
             
            - spec/lib/page_spec.rb
         |