RubyGems - wayfarer - Versions diffs - 0.4.6 → 0.4.7 - Mend

wayfarer 0.4.6 → 0.4.7

Files changed (175) hide show

checksums.yaml +4 -4
data/.github/workflows/lint.yaml +25 -0
data/.github/workflows/release.yaml +29 -0
data/.github/workflows/tests.yaml +30 -0
data/.gitignore +4 -0
data/.rubocop.yml +5 -0
data/.vale.ini +5 -0
data/.yardopts +1 -3
data/Dockerfile +5 -4
data/Gemfile +3 -0
data/Gemfile.lock +107 -102
data/Rakefile +5 -56
data/bin/wayfarer +1 -1
data/docker-compose.yml +20 -9
data/docs/cookbook/consent_screen.md +2 -2
data/docs/cookbook/executing_javascript.md +3 -3
data/docs/cookbook/navigation.md +12 -12
data/docs/cookbook/querying_html.md +3 -3
data/docs/cookbook/screenshots.md +2 -2
data/docs/cookbook/user_agent.md +1 -1
data/docs/design.md +36 -0
data/docs/guides/callbacks.md +24 -126
data/docs/guides/configuration.md +8 -8
data/docs/guides/handlers.md +60 -0
data/docs/guides/index.md +1 -0
data/docs/guides/jobs/error_handling.md +40 -0
data/docs/guides/jobs.md +99 -31
data/docs/guides/navigation.md +1 -1
data/docs/guides/networking/capybara.md +13 -22
data/docs/guides/networking/custom_adapters.md +82 -41
data/docs/guides/networking/ferrum.md +4 -4
data/docs/guides/networking/http.md +9 -13
data/docs/guides/networking/selenium.md +10 -11
data/docs/guides/pages.md +76 -10
data/docs/guides/redis.md +10 -0
data/docs/guides/routing.md +74 -0
data/docs/guides/tasks.md +33 -9
data/docs/guides/tutorial.md +60 -0
data/docs/guides/user_agents.md +113 -0
data/docs/index.md +17 -40
data/docs/reference/cli.md +35 -25
data/docs/reference/configuration.md +36 -0
data/lib/wayfarer/base.rb +124 -46
data/lib/wayfarer/batch_completion.rb +56 -0
data/lib/wayfarer/callbacks.rb +22 -48
data/lib/wayfarer/cli/route_printer.rb +71 -57
data/lib/wayfarer/cli.rb +121 -0
data/lib/wayfarer/gc.rb +13 -6
data/lib/wayfarer/handler.rb +15 -7
data/lib/wayfarer/logging.rb +38 -0
data/lib/wayfarer/middleware/base.rb +2 -0
data/lib/wayfarer/middleware/batch_completion.rb +19 -0
data/lib/wayfarer/middleware/content_type.rb +54 -0
data/lib/wayfarer/middleware/controller.rb +19 -15
data/lib/wayfarer/middleware/dedup.rb +16 -13
data/lib/wayfarer/middleware/dispatch.rb +12 -4
data/lib/wayfarer/middleware/normalize.rb +12 -11
data/lib/wayfarer/middleware/redis.rb +15 -0
data/lib/wayfarer/middleware/router.rb +33 -35
data/lib/wayfarer/middleware/stage.rb +5 -5
data/lib/wayfarer/middleware/uri_parser.rb +30 -0
data/lib/wayfarer/middleware/user_agent.rb +49 -0
data/lib/wayfarer/networking/capybara.rb +1 -1
data/lib/wayfarer/networking/context.rb +2 -2
data/lib/wayfarer/networking/ferrum.rb +2 -2
data/lib/wayfarer/networking/follow.rb +12 -6
data/lib/wayfarer/networking/http.rb +1 -1
data/lib/wayfarer/networking/pool.rb +17 -12
data/lib/wayfarer/networking/selenium.rb +3 -3
data/lib/wayfarer/networking/strategy.rb +2 -2
data/lib/wayfarer/page.rb +36 -14
data/lib/wayfarer/parsing/xml.rb +6 -6
data/lib/wayfarer/parsing.rb +24 -0
data/lib/wayfarer/redis/barrier.rb +13 -21
data/lib/wayfarer/redis/counter.rb +19 -9
data/lib/wayfarer/redis/pool.rb +1 -1
data/lib/wayfarer/redis/resettable.rb +19 -0
data/lib/wayfarer/routing/dsl.rb +1 -0
data/lib/wayfarer/routing/matchers/path.rb +4 -2
data/lib/wayfarer/routing/root_route.rb +5 -1
data/lib/wayfarer/routing/route.rb +4 -14
data/lib/wayfarer/stringify.rb +22 -30
data/lib/wayfarer/task.rb +12 -18
data/lib/wayfarer.rb +28 -1
data/mkdocs.yml +52 -7
data/rake/docs.rake +26 -0
data/rake/lint.rake +105 -0
data/rake/release.rake +29 -0
data/rake/tests.rake +28 -0
data/requirements.txt +1 -1
data/spec/base_spec.rb +140 -160
data/spec/batch_completion_spec.rb +104 -0
data/spec/cli/job_spec.rb +19 -23
data/spec/cli/routing_spec.rb +101 -0
data/spec/cli/version_spec.rb +1 -1
data/spec/factories/task.rb +7 -1
data/spec/fixtures/dummy_job.rb +5 -3
data/spec/gc_spec.rb +8 -50
data/spec/handler_spec.rb +1 -1
data/spec/integration/callbacks_spec.rb +157 -45
data/spec/integration/content_type_spec.rb +145 -0
data/spec/integration/gc_spec.rb +44 -0
data/spec/integration/handler_spec.rb +66 -0
data/spec/integration/page_spec.rb +44 -29
data/spec/integration/params_spec.rb +33 -25
data/spec/integration/parsing_spec.rb +125 -0
data/spec/integration/routing_spec.rb +18 -0
data/spec/integration/stage_spec.rb +27 -20
data/spec/middleware/batch_completion_spec.rb +34 -0
data/spec/middleware/chain_spec.rb +8 -8
data/spec/middleware/content_type_spec.rb +86 -0
data/spec/middleware/controller_spec.rb +5 -5
data/spec/middleware/dedup_spec.rb +38 -55
data/spec/middleware/dispatch_spec.rb +23 -7
data/spec/middleware/normalize_spec.rb +44 -13
data/spec/middleware/router_spec.rb +29 -30
data/spec/middleware/stage_spec.rb +8 -8
data/spec/middleware/uri_parser_spec.rb +53 -0
data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
data/spec/networking/context_spec.rb +1 -1
data/spec/networking/follow_spec.rb +2 -2
data/spec/networking/pool_spec.rb +5 -5
data/spec/networking/strategy.rb +2 -2
data/spec/page_spec.rb +42 -20
data/spec/parsing/xml_spec.rb +11 -12
data/spec/redis/barrier_spec.rb +8 -48
data/spec/redis/counter_spec.rb +13 -1
data/spec/redis/pool_spec.rb +1 -1
data/spec/spec_helpers.rb +27 -16
data/spec/support/test_app.rb +8 -0
data/spec/task_spec.rb +3 -24
data/spec/wayfarer_spec.rb +1 -1
data/wayfarer.gemspec +4 -3
metadata +61 -51
data/.github/workflows/ci.yaml +0 -32
data/docs/guides/error_handling.md +0 -53
data/docs/guides/networking.md +0 -94
data/docs/guides/performance.md +0 -130
data/docs/guides/reliability.md +0 -41
data/docs/guides/routing/steering.md +0 -30
data/docs/reference/api/base.md +0 -48
data/docs/reference/configuration_keys.md +0 -43
data/docs/reference/environment_variables.md +0 -83
data/lib/wayfarer/cli/base.rb +0 -45
data/lib/wayfarer/cli/generate.rb +0 -17
data/lib/wayfarer/cli/job.rb +0 -56
data/lib/wayfarer/cli/route.rb +0 -29
data/lib/wayfarer/cli/runner.rb +0 -34
data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
data/lib/wayfarer/config/capybara.rb +0 -10
data/lib/wayfarer/config/ferrum.rb +0 -11
data/lib/wayfarer/config/networking.rb +0 -29
data/lib/wayfarer/config/redis.rb +0 -14
data/lib/wayfarer/config/root.rb +0 -11
data/lib/wayfarer/config/selenium.rb +0 -21
data/lib/wayfarer/config/strconv.rb +0 -45
data/lib/wayfarer/config/struct.rb +0 -72
data/lib/wayfarer/middleware/fetch.rb +0 -56
data/lib/wayfarer/redis/connection.rb +0 -13
data/lib/wayfarer/redis/version.rb +0 -19
data/lib/wayfarer/routing/router.rb +0 -28
data/spec/callbacks_spec.rb +0 -102
data/spec/cli/generate_spec.rb +0 -39
data/spec/config/capybara_spec.rb +0 -18
data/spec/config/ferrum_spec.rb +0 -24
data/spec/config/networking_spec.rb +0 -73
data/spec/config/redis_spec.rb +0 -32
data/spec/config/root_spec.rb +0 -31
data/spec/config/selenium_spec.rb +0 -56
data/spec/config/strconv_spec.rb +0 -58
data/spec/config/struct_spec.rb +0 -66
data/spec/integration/steering_spec.rb +0 -57
data/spec/redis/version_spec.rb +0 -13
data/spec/routing/router_spec.rb +0 -24

data/docs/guides/networking/capybara.md CHANGED Viewed

@@ -1,17 +1,14 @@
 # Capybara
-[Capybara](https://github.com/teamcapybara/capybara) is originally a test
-framework for web applications.
-When Capybara is in use, a remote browser process is available as a Capybara
-session:
+[Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
+applications which adds a nice API that also works well for web scraping.
 ```ruby
-Wayfarer.config.network.agent = :capybara
-# Wayfarer.config.capybara.driver = ...
+Wayfarer.config[:network][:agent] = :capybara
+# Wayfarer.config[:capybara][:driver] = ...
 class DummyJob < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Capybara::Session ...>
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
 end
 ```
+## Example: Automating Chrome with Cuprite and Ferrum
-## Configuring a driver
-1. Install the Capybara driver for the desired user agent.
-    For example, to automate Google Chrome with
-    [Ferrum](https://github.com/rubycdp/ferrum), install the
-    [Cuprite](https://github.com/rubycdp/cuprite) driver:
+1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
     === "RubyGems"
@@ -34,20 +26,19 @@ end
         gem install cuprite
         ```
-    === "Bundler"
+    === "Gemfile"
         ```ruby
         gem "cuprite" # Gemfile
         ```
-2. Configure Wayfarer to use the `:capybara` user agent and set the desired
-    driver:
+2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
     === "Runtime"
         ```ruby
-        Wayfarer.config.network.agent = :capybara
-        Wayfarer.config.capybara.driver = :cuprite
+        Wayfarer.config[:network][:agent] = :capybara
+        Wayfarer.config[:capybara][:driver] = :cuprite
         ```
     === "Environment variables"
@@ -57,7 +48,7 @@ end
         WAYFARER_CAPYBARA_DRIVER=cuprite
         ```
-3. Register the driver:
+3. Register the driver with Capybara:
     ```ruby
     require "capybara/cuprite"
@@ -66,6 +57,6 @@ end
     Capybara.register_driver(:cuprite) do |app|
       # Wayfarer's Ferrum or Selenium options can be passed along
-      Capybara::Cuprite::Driver.new(app, Wayfarer.config.ferrum.options)
+      Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
     end
     ```

data/docs/guides/networking/custom_adapters.md CHANGED Viewed

@@ -1,18 +1,66 @@
-# Custom agents
+# User agent API
-Wayfarer offers an interface for integrating third-party browsers and HTTP
-clients as user agents.
+Wayfarer retrieves web pages with user agents. There are two types of user
+agents: __stateful__ browsers which carry state and follow redirects implicitly,
+and __stateless__ HTTP clients, which handle redirects explicitly.
-There are two types of agents:
+Because spawning browser processes or instantiating HTTP clients is expensive,
+Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
+irrecoverable errors are individual user agents destroyed and recreated. For example,
+when a browser process crashes, it is replaced with a new one and checked back
+into the pool. The next job that checks out the user agent gets a fresh
+browser process.
-1. Stateful agents, i.e. browsers, which carry state and support navigation.
-   These follow HTTP redirects implicitly.
-2. Stateless agents, which deal with HTTP requests/responses only.
-   These handle HTTP redirects explicitly.
+## Base interface for custom user agents
-## Implementation
+You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
+module and defining callback methods. The interfaces for stateful and stateless
+share the following instance methods:
-Both types can be implemented with callback methods:
+* `#create` (__required__): Called when a new instance (browser process or HTTP client) is
+  needed.
+* `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
+  processes should be quit, and HTTP clients should be freed.
+* `#renew_on` (optional): Returns a list of exception classes upon which the existing
+  instance gets destroyed and replaced with a newly created one.
+## Stateless interface
+The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
+Wayfarer provides redirect handling out of the box, as there is a configurable limit
+on the number of retries to follow.
+In addition to the base interface, stateless user agents implement `#fetch`
+which fetches [pages](../pages) or indicates redirects:
+* `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
+  3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
+  deals with redirects on your behalf to avoid redirect loops. All other status
+  codes, including 4xx and 5xx, are considered successful and are indicated by calling
+  `success(url:, body:, status_code:, headers:)`.
+## Stateful interface
+In addition to the base interface, stateful user agents implement two additional
+methods:
+* `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
+  Stateful user agents follow redirects implicitly.
+* `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
+  into a [page](../pages).
+## Recreating user agents on error with `#renew_on`
+Agents can optionally implement `#renew_on` to get themselves rereated on
+certain errors.
+If `#fetch` or `#navigate` raise an exception and the exception class is listed
+in `#renew_on`, the instance is destroyed and recreated.
+* `#renew_on` (optional): A list of exception classes upon which the existing instance gets
+  destroyed and replaced with a newly created one.
+## Example implementations
 === "Stateful"
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
     class StatefulAgent
       include Wayfarer::Networking::Strategy
-      def renew_on # optional
-        [MyBrowser::IrrecoverableError]
-      end
+      # Required methods
       def create
         MyBrowser.new
       end
-      def destroy(browser) # optional
-        browser.quit
-      end
       def navigate(browser, url)
         browser.goto(url)
       end
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
                 status_code: browser.status_code,
                 headers: browser.headers)
       end
+      # Optional methods
+      def destroy(browser)
+        browser.quit
+      end
+      def renew_on
+        [MyBrowser::IrrecoverableError]
+      end
     end
     ```
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
     class StatelessAgent
       include Wayfarer::Networking::Strategy
-      def renew_on # optional
-        [MyClient::IrrecoverableError]
-      end
+      # Required methods
       def create
         MyClient.new
       end
-      def destroy(client) # optional
-        client.close
-      end
       def fetch(client, url)
         response = client.get(url)
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
                 status_code: response.status_code,
                 headers: response.headers)
       end
+      # Optional methods
+      def destroy(client)
+        client.close
+      end
+      def renew_on # optional
+        [MyClient::IrrecoverableError]
+      end
     end
     ```
-Register the strategy:
+Register and use the strategy:
 ```ruby
 Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
+Wayfarer.config[:network][:agent] = :my_agent
 ```
-Use the strategy:
-```ruby
-Wayfarer.config.network.agent = :my_agent
-```
-### Remarks
-#### Self-healing
-* A strategy's `#renew_on` method may return a list of exception classes upon
-  which the existing instance gets destroyed and replaced with a newly created
-  one.
-* Stateless clients must not raise exceptions when encountering certain HTTP
-  response codes (for example, 5xx).

data/docs/guides/networking/ferrum.md CHANGED Viewed

@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
 so:
 ```ruby
-Wayfarer.config.network.agent = :ferrum
+Wayfarer.config[:network][:agent] = :ferrum
 class DummyWorker < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Ferrum::Browser ...>
@@ -27,8 +27,8 @@ end
 === "Runtime"
     ```ruby
-    Wayfarer.config.network.agent = :ferrum
-    Wayfarer.config.ferrum.options = { headless: false, url: "http://chrome:3000" }
+    Wayfarer.config[:network][:agent] = :ferrum
+    Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
     ```
 === "Environment variables"

data/docs/guides/networking/http.md CHANGED Viewed

@@ -1,33 +1,29 @@
 # Plain HTTP
-Wayfarer can retrieve pages via plain HTTP requests, also alongside automated
-browsers.
+Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
+also alongside automated browsers.
-## Agent
+## Ad-hoc GET requests
-The HTTP agent is the default.
-## Ad-hoc requests
-When automating browsers, it can be useful to additionally retrieve the page
+When automating browsers, it can be useful to additionally retrieve another page
 over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route { to :index }
+  route.to :index
   def index
-    http.fetch(task.url) # => #<Wayfarer::Page ...>
+    http.fetch("https://example.com") # => #<Wayfarer::Page ...>
   end
 end
 ```
-By default, 3 redirects are followed, and this can be configured by passing the
-`follow` keyword:
+By default, 3 redirects are followed, and this number can be configured by
+passing the `follow` keyword:
 ```ruby
 http.fetch(url, follow: 5)
 ```
-If redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
+When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
 raised.

data/docs/guides/networking/selenium.md CHANGED Viewed

@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
 so:
 ```ruby
-Wayfarer.config.network.agent = :selenium
+Wayfarer.config[:network][:agent] = :selenium
 class DummyWorker < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Selenium::WebDriver ...>
@@ -27,10 +27,10 @@ process.
     Pages retrieved with a Selenium WebDriver return fake values:
     ```ruby
-    Wayfarer.config.network.agent = :selenium
+    Wayfarer.config[:network][:agent] = :selenium
     class DummyJob < Wayfarer::Base
-      route { to :index }
+      route.to :index
       def index
         page.headers     # => always {}
@@ -39,19 +39,18 @@ process.
     end
     ```
-!!! note "Consider using [Ferrum](../ferrum) instead"
-    Ferrum provides superior stability and a richer feature set compared to
-    Selenium drivers. However Ferrum automates only Google Chrome. Unless a
-    different browser is required, consider using Ferrum instead of Selenium.
+!!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
+    Use Ferrum if you want to automate Google Chrome. It provides superior
+    stability and a richer feature set compared to Selenium drivers.
 ## Configuring Selenium
 === "Runtime"
     ```ruby
-    Wayfarer.config.network.agent = :selenium
-    Wayfarer.config.selenium.driver = :firefox
-    Wayfarer.config.selenium.options = { url: "http://firefox" }
+    Wayfarer.config[:network][:agent] = :selenium
+    Wayfarer.config[:selenium][:driver] = :firefox
+    Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
     ```
 === "Environment variables"

data/docs/guides/pages.md CHANGED Viewed

@@ -1,11 +1,14 @@
 # Pages
-Retrieved pages take the shape of `Wayfarer::Page` objects and are available
-to jobs:
+A page is the immutable state of the contents behind a URL at a point in time,
+retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
+response, or the state of a remotely controlled browser.
 ```ruby
-class DummyJob < Wayfarer::Worker
-  route { to :index }
+class DummyJob < ActiveJob::Base
+  include Wayfarer::Base
+  route.to :index
   def index
     page # => #<Wayfarer::Page ...>
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
     page.url         # => "https://example.com"
     page.body        # => "<html>..."
     page.status_code # => 200
-    page.headers     # => { "Content-Type" => ... }
+    page.headers     # => { "content-type" => ... }
+    page.mime_type   # => #<MIME::Type: text/html>
+    # The lazily parsed response body or `nil`, depending on the Content-Type
+    page.doc # => #<Nokogiri::HTML::Document ...>
-    # A MetaInspector object for accessing page meta data.
     # See: https://github.com/metainspector/metainspector
+    page.meta # => #<MetaInspector::Document ...>
     # Examples:
     page.meta.links.internal
     page.meta.images.favicon
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
 end
 ```
+!!! info "HTTP headers are downcased and case-sensitive"
+    HTTP headers are downcased, so you would access
+    `page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
+## Response body parsing
+Wayfarer parses the bodies of HTML, XML and JSON responses according to their
+MIME types:
+* `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
+* `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
+* `application/json` to `Hash`
 ## Live pages
-When automating browsers, it is possible the page changes significantly at
-runtime, for example due to JavaScript altering the DOM or URL.
+`#!ruby page` initially returns a snapshot of the browser state
+immediately after the user agent navigated to the URL. The browser state may
+change significantly after the page was retrieved, for example due to your own
+interaction, or client-side JavaScript altering the DOM or URL.
-To access a page reflecting the current browser state, pass the `live` keyword:
+To get a page that reflects the current browser state, set the `#!ruby :live`
+keyword:
 ```ruby
 class DummyJob < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     page # => #<Wayfarer::Page ...>
+    # Fill in forms, click buttons, etc.
     # Replaces the current Page object with a newer one,
     # taking into account the DOM as currently rendered by the browser.
     # Effectful only when automating browsers, no-op when using plain
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
   end
 end
 ```
+!!! attention "Stateless user agents ignore `#!ruby :live`"
+    The `#!ruby :live` option is ignored by stateless user agents, such as the
+    default `#!ruby :http` user agent. Instead, stateless user agents always
+    return the same page object.
+### Implementing a custom response body parser
+You can register an object that implements a `#parse` method for any MIME type:
+```ruby
+class MyJPEGParser
+  def parse(body)
+    # Read EXIF metadata here.
+    # Return value is accessible as `page.doc`
+  end
+end
+Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
+```
+!!! info "Handling responses without a Content-Type"
+    If a response has no `Content-Type` header, Wayfarer falls back to
+    `application/octet-stream`. A parser registered for
+    `application/octet-stream` will hence also handle all responses without
+    a Content-Type.
+## Accessing page metadata with MetaInspector
+You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
+document for accessing metadata of HTML pages. For example, to stage all links
+internal to the current hostname:
+```ruby
+def index
+  stage page.meta.links.internal
+end
+```

data/docs/guides/redis.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Redis
+Wayfarer uses Redis to keep track of:
+* URLs that were already processed within a batch
+* the number of jobs left in a batch
+## Garbage collection
+Wayfarer cleans up batch-related data

data/docs/guides/routing.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Routing
+Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
+either instance methods denoted by symbols, or [handlers](/guides/handlers).
+A job's route declarations equate to a predicate tree.
+When a URL is routed, the predicate tree is searched depth-first. If a
+matching leaf predicate is found, the found path's action is dispatched,
+along with `params` collected from path parameters.
+The following routes:
+```ruby
+route.host "example.com", scheme: :https do
+  path "/contact", to: :contact
+  path "/users/:id", to: [UserHandler, :show]
+end
+```
+Equate to the following predicate tree:
+```mermaid
+flowchart LR
+  RootRoute-->Host["Host <code>example.com</code>"]
+  Host-->Scheme["Scheme <code>:https</code>"]
+  Scheme-->Path1["Path <code>/contact</code>"]
+  Scheme-->Path2["Path <code>/users/:id<code>"]
+  Path1-->TargetRoute1["Target <code>:contact</code>"]
+  Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
+```
+An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
+```mermaid
+flowchart LR
+  RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
+  Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
+  Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
+  Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
+  Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
+  Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
+  classDef active fill:#7CB342,stroke:#7CB342,color:#fff
+  classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
+  classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
+```
+You can also visualise an invocation of the predicate tree on the command line
+with `wayfarer tree`
+```
+wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
+Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
+└──Host("example.com", match: true)
+   └──Scheme(:https, match: true)
+      ├──Path("/contact", match: false)
+      │  └──Target(match: true)
+      └──Path("/users/:id", match: true)
+         └──Target(match: true)
+            └──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
+```
+As you can see, `Target` nodes always match. This means that we could have also defined
+our routes as:
+```ruby
+route.host "example.com", scheme: :https do
+  to :contact do
+    path "/contact"
+  end
+  to [UserHandler, :show] do
+    path "/users/:id"
+  end
+end
+```

data/docs/guides/tasks.md CHANGED Viewed

@@ -1,14 +1,38 @@
 # Tasks
-Tasks are the immutable units of work processed by [jobs](/guides/jobs). A task
-consists of:
+Tasks are the immutable units of work read from a message queue and processed by
+[jobs](/guides/jobs). A task consists of two strings:
-1. The __URL__ to process
-    * Within a batch, every URL gets processed at most once.
+* The __URL__ to process
+* The __batch__ the task belongs to
-2. The __batch__ the task belongs to
-    * Like URLs, batches are strings.
+A job processing a task commonly appends more tasks to the queue in turn.
-Tasks get appended to the end of a message queue, and consumed from the
-beginning. Because jobs can enqueue other tasks, jobs are both consumers
-and producers of tasks.
+!!! info "Task URLs are not normalized"
+    The URL returned by `task.url` is not normalized but verbatim
+    as it was staged or enqueued.
+## Task deduplication
+Wayfarer ensures that no URL gets processed twice within a batch. It achieves
+this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
+keyed by normalized URLs.
+### URL normalization
+Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
+and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
+URL normalization is used only for deduplication, and does not affect the URL
+returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
+enqueud. This allows you to follow the exact URLs you may have parsed from a
+response body.
+## Invalid URLs
+Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
+protocol) are discarded, since they can't get retrieved. No exception is raised,
+and the job is considered successfully processed, since there are no corrective
+actions an error handler could take as tasks are immutable, and retries would
+not change the outcome.