RubyGems - wgit - Versions diffs - 0.0.17 → 0.0.18 - Mend

wgit 0.0.17 → 0.0.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +61 -0
data/LICENSE.txt +21 -0
data/README.md +16 -7
data/TODO.txt +34 -0
data/lib/wgit.rb +3 -1
data/lib/wgit/assertable.rb +35 -29
data/lib/wgit/core_ext.rb +5 -3
data/lib/wgit/crawler.rb +96 -58
data/lib/wgit/database/connection_details.rb +4 -2
data/lib/wgit/database/database.rb +84 -46
data/lib/wgit/database/model.rb +12 -10
data/lib/wgit/document.rb +100 -72
data/lib/wgit/document_extensions.rb +11 -9
data/lib/wgit/indexer.rb +34 -24
data/lib/wgit/logger.rb +4 -2
data/lib/wgit/url.rb +94 -59
data/lib/wgit/utils.rb +13 -11
data/lib/wgit/version.rb +3 -1
metadata +41 -38

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a805551a72869241a425dc4d0f88ed6f740c75b95db6e3acf2564393b79708d9
-  data.tar.gz: 5425e8bb21c7822b5ac93afbe6c7d90777a649808702c3e6012c0ec19cbe1dfb
+  metadata.gz: 26e6a29fbf72b0ecbbc487c8aba9ec243a260b4761805c6c7923f2af82fa94f5
+  data.tar.gz: 9e15ad14991418fc3b4b2c0dafacac617b32197e825ad72887d91182c8ddf652
 SHA512:
-  metadata.gz: fd5dcc1b4e9706326810b3fdbdf1df285ec1a98788aac9521fbcd52ad4132c039ab2a2b2d2e574af115845d1968c0eb1bc8d487dbbec4ee9a3427597bb99b09f
-  data.tar.gz: 3e947536f694ea74460f919cab1ec8e42274eb6bd0f856ac900c6b2e4f31da22ddc920afbaaa3a4b80abe3d9729fdcf00964e794363f3d76e1f35dc33a05224a
+  metadata.gz: 4b17b8467abf13b186e88fb63fe8630163612bc685d7d521122fdc4c693e7d9229c59888afa1191189b3838317fa29e028c90757b880177c1e7a8f81a0a38047
+  data.tar.gz: 6fb7bb518ca3b9e520e1edbf25b4c265018686b5c61e134d623c38efa1bdf5073affb5205e47aee3a32a4502b56205080ded00a79bd6f138cf9178b019a2b32d

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,61 @@
+# Wgit Change Log
+## v0.0.0 (TEMPLATE - DO NOT EDIT)
+### Added
+- ...
+### Changed/Removed
+- ...
+### Fixed
+- ...
+---
+## v0.0.18
+### Added
+- `Wgit::Url#to_brand` method and updated `Wgit::Url#is_relative?` to support it.
+### Changed/Removed
+- Updated the documentation by changing some `private` methods to `protected`. These methods are now documented (on rubydocs) as a result.
+### Fixed
+- ...
+---
+## v0.0.17
+### Added
+- Support for `<base>` element in `Wgit::Document`'s.
+- New `Wgit::Url` methods: `without_query_string`, `is_query_string?`, `is_anchor?`, `replace` (override of `String#replace`).
+### Changed/Removed
+- Breaking changes: Removed `Wgit::Document#internal_links_without_anchors` method.
+- Breaking changes (potentially): `Wgit::Url`'s are now replaced with the redirected to Url during a crawl.
+- Updated `Wgit::Document#base_url` to support an optional `link:` named parameter.
+- Updated `Wgit::Crawler#crawl_site` to allow the initial url to redirect to another host.
+- Updated `Wgit::Url#is_relative?` to support an optional `domain:` named parameter.
+### Fixed
+- Bug in `Wgit::Document#internal_full_links` affecting anchor and query string links including those used during `Wgit::Crawler#crawl_site`.
+- Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.
+---
+## v0.0.16
+### Added
+- Added `Url.parse` class method as alias for `Url.new`.
+### Changed/Removed
+- Breaking changes: Removed `Wgit::Url.relative_link?` (class method). Use `Wgit::Url#is_relative?` (instance method) instead e.g. `Wgit::Url.new('/blah').is_relative?`.
+### Fixed
+- Several URI related bugs in `Wgit::Url` affecting crawls.
+---
+## v0.0.15
+### Added
+- Support for IRI's (non ASCII based URL's).
+### Changed/Removed
+- Breaking changes: Removed `Document` and `Url#to_hash` aliases. Call `to_h` instead.
+### Fixed
+- Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.
+---
+## v0.0.14
+### Added
+- `Indexer#index_this_page` method.
+### Changed/Removed
+- Breaking Changes: `Wgit::CONNECTION_DETAILS` now only requires `DB_CONNECTION_STRING`.
+### Fixed
+- Found and fixed a bug in `Document#new`.
+---

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2019 Michael Telford
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md CHANGED Viewed

@@ -1,5 +1,13 @@
 # Wgit
+[![Gem Version](https://badge.fury.io/rb/wgit.svg)](https://rubygems.org/gems/wgit)
+[![Gem Downloads](https://img.shields.io/gem/dt/wgit)](https://rubygems.org/gems/wgit)
+[![Build Status](https://travis-ci.org/michaeltelford/wgit.svg?branch=master)](https://travis-ci.org/michaeltelford/wgit)
+[![Yardoc Coverage](https://img.shields.io/badge/yard%20docs-100%25-brightgreen)](https://www.rubydoc.info/gems/wgit)
+[![Codacy Badge](https://api.codacy.com/project/badge/Grade/d5a0de62e78b460997cb8ce1127cea9e)](https://www.codacy.com/app/michaeltelford/wgit?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=michaeltelford/wgit&amp;utm_campaign=Badge_Grade)
+---
 Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an easy to use API for programmatic web scraping, indexing and searching.
 Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it has uses in many different application types.
@@ -58,11 +66,12 @@ doc.stats # => {
 # doc responds to the following methods:
 Wgit::Document.instance_methods(false).sort # => [
 #   :==, :[], :author, :base, :base_url, :css, :date_crawled, :doc, :empty?,
-#   :external_links, :external_urls, :html, :internal_absolute_links,
-#   :internal_full_links, :internal_links, :keywords, :links,
-#   :relative_absolute_links, :relative_absolute_urls, :relative_full_links,
-#   :relative_full_urls, :relative_links, :relative_urls, :score, :search,
-#   :search!, :size, :stats, :text, :title, :to_h, :to_json, :url, :xpath
+#   :external_links, :external_urls, :find_in_html, :find_in_object, :html,
+#   :init_nokogiri, :internal_absolute_links, :internal_full_links,
+#   :internal_links, :keywords, :links, :relative_absolute_links,
+#   :relative_absolute_urls, :relative_full_links, :relative_full_urls,
+#   :relative_links, :relative_urls, :score, :search, :search!, :size, :stats,
+#   :text, :title, :to_h, :to_json, :url, :xpath
 # ]
 results = doc.search "corruption"
@@ -72,7 +81,7 @@ results.first # => "ial materials involving war, spying and corruption.
 ## Documentation
-To see what's possible with the Wgit gem see the [docs](https://www.rubydoc.info/gems/wgit) or the [Practical Examples](#Practical-Examples) section below.
+100% of Wgit's code is documented using [YARD](https://yardoc.org/), deployed to [Rubydocs](https://www.rubydoc.info/gems/wgit). This greatly benefits developers in using Wgit in their own programs. Another good source of information (as to how the library behaves) are the tests. Also, see the [Practical Examples](#Practical-Examples) section below for real working examples of Wgit in action.
 ## Practical Examples
@@ -347,6 +356,6 @@ For a full list of available Rake tasks, run `bundle exec rake help`. The most c
 After checking out the repo, run `bundle exec rake setup` to install the dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `bundle exec rake console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
-To generate code documentation run `bundle exec yard doc`. To browse the generated documentation run `bundle exec yard server -r`.
+To generate code documentation run `bundle exec yardoc`. To browse the generated documentation run `bundle exec yard server -r`.
 To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.

data/TODO.txt ADDED Viewed

@@ -0,0 +1,34 @@
+Primary
+-------
+- Update Database#search & Document#search to have optional case sensitivity.
+- Have the ability to crawl sub sections of a site only e.g. https://www.honda.co.uk/motorcycles.html as the base url and crawl any links containing this as a prefix. For example, https://www.honda.co.uk/cars.html would not be crawled but https://www.honda.co.uk/motorcycles/africa-twin.html would be.
+- Create an executable based on the ./bin/console shipped as `wpry` or `wgit`.
+Secondary
+---------
+- Think about how we handle invalid Url's on crawled documents. Setup tests and implement logic for this scenario.
+- Check if Document::TEXT_ELEMENTS is expansive enough.
+Refactoring
+-----------
+- Plan to open up the required_ruby_version range, say from 2.5 upwards e.g. `~> 2.5`. Will need CI testing for the different versions of ruby as we move onto support newer versions.
+- Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
+- After the above refactor, move onto the rest of the code base.
+- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases? Also, do the Url#to_* make sense?
+- Replace method params with named parameters where applicable.
+- Possibly use refine instead of core-ext?
+- Think about potentially using DB._update's update_many func.
+Gem Publishing Checklist
+------------------------
+- Ensure a clean branch of master and create a 'release' branch.
+- Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
+- Increment the version number (in version.rb) and update the CHANGELOG.md.
+- Run 'bundle install' to update deps.
+- Run 'bundle exec rake compile' and ensure acceptable warnings.
+- Run 'bundle exec rake test' and ensure all tests are passing.
+- Run `bundle exec rake install` to build and install the gem locally, then test it manually from outside this repo.
+- Run `bundle exec yardoc` to update documentation - should be 100% coverage.
+- Commit, merge to master & push any changes made from the above steps.
+- Run `bundle exec rake RELEASE[origin]` to tag, build and push everything to github.com and rubygems.org.

data/lib/wgit.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require_relative 'wgit/version'
 require_relative 'wgit/logger'
 require_relative 'wgit/assertable'
@@ -10,4 +12,4 @@ require_relative 'wgit/database/connection_details'
 require_relative 'wgit/database/model'
 require_relative 'wgit/database/database'
 require_relative 'wgit/indexer'
-#require_relative 'wgit/core_ext' - Must be explicitly required.
+# require_relative 'wgit/core_ext' - Must be explicitly required.

data/lib/wgit/assertable.rb CHANGED Viewed

@@ -1,17 +1,18 @@
-module Wgit
+# frozen_string_literal: true
+module Wgit
   # Module containing assert methods including type checking which can be used
   # for asserting the integrity of method definitions etc.
   module Assertable
     # Default type fail message.
-    DEFAULT_TYPE_FAIL_MSG = "Expected: %s, Actual: %s".freeze
+    DEFAULT_TYPE_FAIL_MSG = 'Expected: %s, Actual: %s'
     # Wrong method message.
-    WRONG_METHOD_MSG = "arr must be Enumerable, use a different method".freeze
+    WRONG_METHOD_MSG = 'arr must be Enumerable, use a different method'
     # Default duck fail message.
-    DEFAULT_DUCK_FAIL_MSG = "%s doesn't respond_to? %s".freeze
+    DEFAULT_DUCK_FAIL_MSG = "%s doesn't respond_to? %s"
     # Default required keys message.
-    DEFAULT_REQUIRED_KEYS_MSG = "Some or all of the required keys are not present: %s".freeze
+    DEFAULT_REQUIRED_KEYS_MSG = 'Some or all of the required keys are not present: %s'
     # Tests if the obj is of a given type.
     #
     # @param obj [Object] The Object to test.
@@ -20,17 +21,18 @@ module Wgit
     # @param msg [String] The raised RuntimeError message, if provided.
     # @return [Object] The given obj on successful assertion.
     def assert_types(obj, type_or_types, msg = nil)
-      msg ||= DEFAULT_TYPE_FAIL_MSG % [type_or_types, obj.class]
-      if type_or_types.respond_to?(:any?)
-        match = type_or_types.any? { |type| obj.instance_of?(type) }
-      else
-        match = obj.instance_of?(type_or_types)
-      end
+      msg ||= format(DEFAULT_TYPE_FAIL_MSG, type_or_types, obj.class)
+      match = if type_or_types.respond_to?(:any?)
+                type_or_types.any? { |type| obj.instance_of?(type) }
+              else
+                obj.instance_of?(type_or_types)
+              end
       raise msg unless match
       obj
     end
-    # Each object within arr must match one of the types listed in
+    # Each object within arr must match one of the types listed in
     # type_or_types or an exception is raised using msg, if provided.
     #
     # @param arr [Enumerable#each] Enumerable of objects to type check.
@@ -39,12 +41,13 @@ module Wgit
     # @return [Object] The given arr on successful assertion.
     def assert_arr_types(arr, type_or_types, msg = nil)
       raise WRONG_METHOD_MSG unless arr.respond_to?(:each)
       arr.each do |obj|
         assert_types(obj, type_or_types, msg)
       end
     end
-    # The obj_or_objs must respond_to? all of the given methods or an
+    # The obj_or_objs must respond_to? all of the given methods or an
     # Exception is raised using msg, if provided.
     #
     # @param obj_or_objs [Object, Enumerable#each] The objects to duck check.
@@ -70,29 +73,32 @@ module Wgit
     # @param msg [String] The raised KeyError message, if provided.
     # @return [Hash] The given hash on successful assertion.
     def assert_required_keys(hash, keys, msg = nil)
-      msg ||= DEFAULT_REQUIRED_KEYS_MSG % [keys.join(', ')]
+      msg ||= format(DEFAULT_REQUIRED_KEYS_MSG, keys.join(', '))
       all_present = keys.all? { |key| hash.keys.include? key }
-      raise KeyError.new(msg) unless all_present
+      raise KeyError, msg unless all_present
       hash
     end
-  private
+    private
     # obj must respond_to? all methods or an exception is raised.
     def _assert_respond_to(obj, methods, msg = nil)
-      raise "methods must respond_to? :all?" unless methods.respond_to?(:all?)
-      msg ||= DEFAULT_DUCK_FAIL_MSG % ["#{obj.class} (#{obj})", methods]
+      raise 'methods must respond_to? :all?' unless methods.respond_to?(:all?)
+      msg ||= format(DEFAULT_DUCK_FAIL_MSG, "#{obj.class} (#{obj})", methods)
       match = methods.all? { |method| obj.respond_to?(method) }
       raise msg unless match
       obj
     end
-    alias :assert_type :assert_types
-    alias :type :assert_types
-    alias :types :assert_types
-    alias :assert_arr_type :assert_arr_types
-    alias :arr_type :assert_arr_types
-    alias :arr_types :assert_arr_types
-    alias :respond_to :assert_respond_to
+    alias assert_type assert_types
+    alias type assert_types
+    alias types assert_types
+    alias assert_arr_type assert_arr_types
+    alias arr_type assert_arr_types
+    alias arr_types assert_arr_types
+    alias respond_to assert_respond_to
   end
 end

data/lib/wgit/core_ext.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # Script which extends Ruby's core functionality when parsed.
 # Needs to be required separately using `require 'wgit/core_ext'`.
@@ -15,7 +17,7 @@ end
 # Extend the standard Enumerable functionality.
 module Enumerable
-  # Converts each String instance into a Wgit::Url object and returns the new
+  # Converts each String instance into a Wgit::Url object and returns the new
   # Array.
   #
   # @return [Array<Wgit::Url>] The converted URL's.
@@ -24,8 +26,8 @@ module Enumerable
       process_url_element(element)
     end
   end
-  # Converts each String instance into a Wgit::Url object and returns the
+  # Converts each String instance into a Wgit::Url object and returns the
   # updated array. Modifies the receiver.
   #
   # @return [Array<Wgit::Url>] Self containing the converted URL's.

data/lib/wgit/crawler.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require_relative 'url'
 require_relative 'document'
 require_relative 'utils'
@@ -5,7 +7,6 @@ require_relative 'assertable'
 require 'net/http' # Requires 'uri'.
 module Wgit
   # The Crawler class provides a means of crawling web based Wgit::Url's, turning
   # their HTML into Wgit::Document instances.
   class Crawler
@@ -61,7 +62,7 @@ module Wgit
     def [](*urls)
       # If urls is nil then add_url (when called later) will set @urls = []
       # so we do nothing here.
-      if not urls.nil?
+      unless urls.nil?
         # Due to *urls you can end up with [[url1,url2,url3]] etc. where the
         # outer array is bogus so we use the inner one only.
         if  urls.is_a?(Enumerable) &&
@@ -97,11 +98,12 @@ module Wgit
     #   by Crawler#docs after this method returns.
     # @return [Wgit::Document] The last Document crawled.
     def crawl_urls(urls = @urls, &block)
-      raise "No urls to crawl" unless urls
+      raise 'No urls to crawl' unless urls
       @docs = []
       doc = nil
       Wgit::Utils.each(urls) { |url| doc = handle_crawl_block(url, &block) }
-      doc ? doc : @docs.last
+      doc || @docs.last
     end
     # Crawl the url returning the response Wgit::Document or nil if an error
@@ -121,12 +123,12 @@ module Wgit
     # @return [Wgit::Document, nil] The crawled HTML Document or nil if the
     #   crawl was unsuccessful.
     def crawl_url(
-        url = @urls.first,
-        follow_external_redirects: true,
-        host: nil
-      )
+      url = @urls.first,
+      follow_external_redirects: true,
+      host: nil
+    )
       assert_type(url, Wgit::Url)
-      if !follow_external_redirects and host.nil?
+      if !follow_external_redirects && host.nil?
         raise 'host cannot be nil if follow_external_redirects is false'
       end
@@ -200,23 +202,24 @@ module Wgit
       externals.uniq
     end
-  private
-    # Add the document to the @docs array for later processing or let the block
-    # process it here and now.
-    def handle_crawl_block(url, &block)
-      if block_given?
-        crawl_url(url, &block)
-      else
-        @docs << crawl_url(url)
-        nil
-      end
-    end
+    protected
-    # The fetch method performs a HTTP GET to obtain the HTML document.
-    # Invalid urls or any HTTP response that doesn't return a HTML body will be
-    # ignored and nil will be returned. Otherwise, the HTML is returned.
-    # External redirects are followed by default but can be disabled.
+    # This method calls Wgit::Crawler#resolve to obtain the page HTML, handling
+    # any errors that arise and setting the @last_response. Errors or any
+    # HTTP response that doesn't return a HTML body will be ignored and nil
+    # will be returned; otherwise, the HTML String is returned.
+    #
+    # @param url [Wgit::Url] The URL to fetch the HTML for.
+    # @param follow_external_redirects [Boolean] Whether or not to follow
+    #   an external redirect. False will return nil for such a crawl. If false,
+    #   you must also provide a `host:` parameter.
+    # @param host [Wgit::Url, String] Specify the host by which
+    #   an absolute redirect is determined to be internal or not. Must be
+    #   absolute and contain a protocol prefix. For example, a `host:` of
+    #   'http://www.example.com' will only allow redirects for Urls with a
+    #   `to_host` value of 'www.example.com'.
+    # @return [String, nil] The crawled HTML or nil if the crawl was
+    #   unsuccessful.
     def fetch(url, follow_external_redirects: true, host: nil)
       response = resolve(
         url,
@@ -225,74 +228,109 @@ module Wgit
       )
       @last_response = response
       response.body.empty? ? nil : response.body
-    rescue Exception => ex
+    rescue StandardError => e
       Wgit.logger.debug(
-        "Wgit::Crawler#fetch('#{url}') exception: #{ex.message}"
+        "Wgit::Crawler#fetch('#{url}') exception: #{e.message}"
       )
       @last_response = nil
       nil
     end
-    # The resolve method performs a HTTP GET to obtain the HTML document.
-    # A certain amount of redirects will be followed by default before raising
-    # an exception. Redirects can be disabled by setting `redirect_limit: 0`.
-    # External redirects are followed by default but can be disabled.
-    # The Net::HTTPResponse will be returned.
+    # The resolve method performs a HTTP GET to obtain the HTML response. The
+    # Net::HTTPResponse will be returned or an error raised. Redirects can be
+    # disabled by setting `redirect_limit: 0`.
+    #
+    # @param url [Wgit::Url] The URL to fetch the HTML from.
+    # @param redirect_limit [Integer] The number of redirect hops to allow
+    #   before raising an error.
+    # @param follow_external_redirects [Boolean] Whether or not to follow
+    #   an external redirect. If false, you must also provide a `host:`
+    #   parameter.
+    # @param host [Wgit::Url, String] Specify the host by which
+    #   an absolute redirect is determined to be internal or not. Must be
+    #   absolute and contain a protocol prefix. For example, a `host:` of
+    #   'http://www.example.com' will only allow redirects for Urls with a
+    #   `to_host` value of 'www.example.com'.
+    # @raise [StandardError] If !url.respond_to? :to_uri or a redirect isn't
+    #   allowed.
+    # @return [Net::HTTPResponse] The HTTP response of the GET request.
     def resolve(
-        url,
-        redirect_limit: Wgit::Crawler.default_redirect_limit,
-        follow_external_redirects: true,
-        host: nil
-      )
+      url,
+      redirect_limit: Wgit::Crawler.default_redirect_limit,
+      follow_external_redirects: true,
+      host: nil
+    )
       raise 'url must respond to :to_uri' unless url.respond_to?(:to_uri)
       redirect_count = 0
+      response = nil
-      begin
+      loop do
         response = Net::HTTP.get_response(url.to_uri)
         location = Wgit::Url.new(response.fetch('location', ''))
+        break unless response.is_a?(Net::HTTPRedirection)
         yield(url, response, location) if block_given?
-        if not location.empty?
-          if  !follow_external_redirects and
+        unless location.empty?
+          if  !follow_external_redirects &&
               !location.is_relative?(host: host)
             raise "External redirect not allowed - Redirected to: \
 '#{location}', which is outside of host: '#{host}'"
           end
           raise 'Too many redirects' if redirect_count >= redirect_limit
           redirect_count += 1
           location = url.to_base.concat(location) if location.is_relative?
           url.replace(location)
         end
-      end while response.is_a?(Net::HTTPRedirection)
+      end
       response
     end
+    # Returns a doc's internal HTML page links in absolute form; used when
+    # crawling a site. Override this method in a subclass to change how a site
+    # is crawled; not what is extracted from each page (Document extensions
+    # should be used for this purpose instead).
+    #
+    # @param doc [Wgit::Document] The document from which to extract it's
+    #   internal page links.
+    # @return [Array<Wgit::Url>] The internal page links from doc.
+    def get_internal_links(doc)
+      doc.internal_full_links
+         .map(&:without_anchor) # Because anchors don't change page content.
+         .uniq
+         .reject do |link|
+        ext = link.to_extension
+        ext ? !%w[htm html].include?(ext) : false
+      end
+    end
+    private
+    # Add the document to the @docs array for later processing or let the block
+    # process it here and now.
+    def handle_crawl_block(url, &block)
+      if block_given?
+        crawl_url(url, &block)
+      else
+        @docs << crawl_url(url)
+        nil
+      end
+    end
     # Add the url to @urls ensuring it is cast to a Wgit::Url if necessary.
     def add_url(url)
       @urls = [] if @urls.nil?
       @urls << Wgit::Url.new(url)
     end
-    # Returns doc's internal HTML page links in absolute form for crawling.
-    # We remove anchors because they are client side and don't change the
-    # resulting page's HTML; unlike query strings for example, which do.
-    def get_internal_links(doc)
-      doc.internal_full_links.
-        map(&:without_anchor).
-        uniq.
-        reject do |link|
-          ext = link.to_extension
-          ext ? !['htm', 'html'].include?(ext) : false
-        end
-    end
-    alias :crawl :crawl_urls
-    alias :crawl_pages :crawl_urls
-    alias :crawl_page :crawl_url
-    alias :crawl_r :crawl_site
+    alias crawl crawl_urls
+    alias crawl_pages crawl_urls
+    alias crawl_page crawl_url
+    alias crawl_r crawl_site
   end
 end