RubyGems - wayfarer - Versions diffs - 0.4.5 → 0.4.7 - Mend

wayfarer 0.4.5 → 0.4.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (175) hide show

checksums.yaml +4 -4
data/.github/workflows/lint.yaml +25 -0
data/.github/workflows/release.yaml +29 -0
data/.github/workflows/tests.yaml +30 -0
data/.gitignore +4 -0
data/.rubocop.yml +5 -0
data/.vale.ini +5 -0
data/.yardopts +1 -3
data/Dockerfile +5 -4
data/Gemfile +3 -0
data/Gemfile.lock +107 -102
data/Rakefile +5 -56
data/bin/wayfarer +1 -1
data/docker-compose.yml +20 -9
data/docs/cookbook/consent_screen.md +2 -2
data/docs/cookbook/executing_javascript.md +3 -3
data/docs/cookbook/navigation.md +12 -12
data/docs/cookbook/querying_html.md +3 -3
data/docs/cookbook/screenshots.md +2 -2
data/docs/cookbook/user_agent.md +1 -1
data/docs/design.md +36 -0
data/docs/guides/callbacks.md +24 -126
data/docs/guides/configuration.md +8 -8
data/docs/guides/handlers.md +60 -0
data/docs/guides/index.md +1 -0
data/docs/guides/jobs/error_handling.md +40 -0
data/docs/guides/jobs.md +99 -31
data/docs/guides/navigation.md +1 -1
data/docs/guides/networking/capybara.md +13 -22
data/docs/guides/networking/custom_adapters.md +82 -41
data/docs/guides/networking/ferrum.md +4 -4
data/docs/guides/networking/http.md +9 -13
data/docs/guides/networking/selenium.md +10 -11
data/docs/guides/pages.md +76 -10
data/docs/guides/redis.md +10 -0
data/docs/guides/routing.md +74 -0
data/docs/guides/tasks.md +33 -9
data/docs/guides/tutorial.md +60 -0
data/docs/guides/user_agents.md +113 -0
data/docs/index.md +17 -40
data/docs/reference/cli.md +35 -25
data/docs/reference/configuration.md +36 -0
data/lib/wayfarer/base.rb +124 -46
data/lib/wayfarer/batch_completion.rb +56 -0
data/lib/wayfarer/callbacks.rb +22 -48
data/lib/wayfarer/cli/route_printer.rb +71 -57
data/lib/wayfarer/cli.rb +121 -0
data/lib/wayfarer/gc.rb +13 -6
data/lib/wayfarer/handler.rb +15 -7
data/lib/wayfarer/logging.rb +38 -0
data/lib/wayfarer/middleware/base.rb +2 -0
data/lib/wayfarer/middleware/batch_completion.rb +19 -0
data/lib/wayfarer/middleware/content_type.rb +54 -0
data/lib/wayfarer/middleware/controller.rb +19 -15
data/lib/wayfarer/middleware/dedup.rb +16 -13
data/lib/wayfarer/middleware/dispatch.rb +12 -4
data/lib/wayfarer/middleware/normalize.rb +12 -11
data/lib/wayfarer/middleware/redis.rb +15 -0
data/lib/wayfarer/middleware/router.rb +33 -35
data/lib/wayfarer/middleware/stage.rb +5 -5
data/lib/wayfarer/middleware/uri_parser.rb +30 -0
data/lib/wayfarer/middleware/user_agent.rb +49 -0
data/lib/wayfarer/networking/capybara.rb +1 -1
data/lib/wayfarer/networking/context.rb +2 -2
data/lib/wayfarer/networking/ferrum.rb +2 -2
data/lib/wayfarer/networking/follow.rb +12 -6
data/lib/wayfarer/networking/http.rb +1 -1
data/lib/wayfarer/networking/pool.rb +17 -12
data/lib/wayfarer/networking/selenium.rb +3 -3
data/lib/wayfarer/networking/strategy.rb +2 -2
data/lib/wayfarer/page.rb +36 -14
data/lib/wayfarer/parsing/xml.rb +6 -6
data/lib/wayfarer/parsing.rb +24 -0
data/lib/wayfarer/redis/barrier.rb +13 -21
data/lib/wayfarer/redis/counter.rb +19 -9
data/lib/wayfarer/redis/pool.rb +1 -1
data/lib/wayfarer/redis/resettable.rb +19 -0
data/lib/wayfarer/routing/dsl.rb +1 -0
data/lib/wayfarer/routing/matchers/path.rb +4 -2
data/lib/wayfarer/routing/root_route.rb +5 -1
data/lib/wayfarer/routing/route.rb +4 -14
data/lib/wayfarer/stringify.rb +22 -30
data/lib/wayfarer/task.rb +12 -18
data/lib/wayfarer.rb +29 -2
data/mkdocs.yml +52 -7
data/rake/docs.rake +26 -0
data/rake/lint.rake +105 -0
data/rake/release.rake +29 -0
data/rake/tests.rake +28 -0
data/requirements.txt +1 -1
data/spec/base_spec.rb +140 -160
data/spec/batch_completion_spec.rb +104 -0
data/spec/cli/job_spec.rb +19 -23
data/spec/cli/routing_spec.rb +101 -0
data/spec/cli/version_spec.rb +1 -1
data/spec/factories/task.rb +7 -1
data/spec/fixtures/dummy_job.rb +5 -3
data/spec/gc_spec.rb +8 -50
data/spec/handler_spec.rb +1 -1
data/spec/integration/callbacks_spec.rb +157 -45
data/spec/integration/content_type_spec.rb +145 -0
data/spec/integration/gc_spec.rb +44 -0
data/spec/integration/handler_spec.rb +66 -0
data/spec/integration/page_spec.rb +44 -29
data/spec/integration/params_spec.rb +33 -25
data/spec/integration/parsing_spec.rb +125 -0
data/spec/integration/routing_spec.rb +18 -0
data/spec/integration/stage_spec.rb +27 -20
data/spec/middleware/batch_completion_spec.rb +34 -0
data/spec/middleware/chain_spec.rb +8 -8
data/spec/middleware/content_type_spec.rb +86 -0
data/spec/middleware/controller_spec.rb +5 -5
data/spec/middleware/dedup_spec.rb +38 -55
data/spec/middleware/dispatch_spec.rb +23 -7
data/spec/middleware/normalize_spec.rb +44 -13
data/spec/middleware/router_spec.rb +29 -30
data/spec/middleware/stage_spec.rb +8 -8
data/spec/middleware/uri_parser_spec.rb +53 -0
data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
data/spec/networking/context_spec.rb +17 -0
data/spec/networking/follow_spec.rb +2 -2
data/spec/networking/pool_spec.rb +5 -5
data/spec/networking/strategy.rb +2 -2
data/spec/page_spec.rb +42 -20
data/spec/parsing/xml_spec.rb +11 -12
data/spec/redis/barrier_spec.rb +8 -48
data/spec/redis/counter_spec.rb +13 -1
data/spec/redis/pool_spec.rb +1 -1
data/spec/spec_helpers.rb +27 -16
data/spec/support/test_app.rb +8 -0
data/spec/task_spec.rb +3 -24
data/spec/wayfarer_spec.rb +1 -1
data/wayfarer.gemspec +4 -3
metadata +61 -51
data/.github/workflows/ci.yaml +0 -32
data/docs/guides/error_handling.md +0 -31
data/docs/guides/networking.md +0 -94
data/docs/guides/performance.md +0 -130
data/docs/guides/reliability.md +0 -41
data/docs/guides/routing/steering.md +0 -30
data/docs/reference/api/base.md +0 -48
data/docs/reference/configuration_keys.md +0 -42
data/docs/reference/environment_variables.md +0 -83
data/lib/wayfarer/cli/base.rb +0 -45
data/lib/wayfarer/cli/generate.rb +0 -17
data/lib/wayfarer/cli/job.rb +0 -56
data/lib/wayfarer/cli/route.rb +0 -29
data/lib/wayfarer/cli/runner.rb +0 -34
data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
data/lib/wayfarer/config/capybara.rb +0 -10
data/lib/wayfarer/config/ferrum.rb +0 -11
data/lib/wayfarer/config/networking.rb +0 -26
data/lib/wayfarer/config/redis.rb +0 -14
data/lib/wayfarer/config/root.rb +0 -11
data/lib/wayfarer/config/selenium.rb +0 -21
data/lib/wayfarer/config/strconv.rb +0 -45
data/lib/wayfarer/config/struct.rb +0 -72
data/lib/wayfarer/middleware/fetch.rb +0 -56
data/lib/wayfarer/redis/connection.rb +0 -13
data/lib/wayfarer/redis/version.rb +0 -19
data/lib/wayfarer/routing/router.rb +0 -28
data/spec/callbacks_spec.rb +0 -102
data/spec/cli/generate_spec.rb +0 -39
data/spec/config/capybara_spec.rb +0 -18
data/spec/config/ferrum_spec.rb +0 -24
data/spec/config/networking_spec.rb +0 -73
data/spec/config/redis_spec.rb +0 -32
data/spec/config/root_spec.rb +0 -31
data/spec/config/selenium_spec.rb +0 -56
data/spec/config/strconv_spec.rb +0 -58
data/spec/config/struct_spec.rb +0 -66
data/spec/integration/steering_spec.rb +0 -57
data/spec/redis/version_spec.rb +0 -13
data/spec/routing/router_spec.rb +0 -24

data/docs/guides/networking/capybara.md CHANGED Viewed

@@ -1,17 +1,14 @@
 # Capybara
-[Capybara](https://github.com/teamcapybara/capybara) is originally a test
-framework for web applications.
-When Capybara is in use, a remote browser process is available as a Capybara
-session:
+[Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
+applications which adds a nice API that also works well for web scraping.
 ```ruby
-Wayfarer.config.network.agent = :capybara
-# Wayfarer.config.capybara.driver = ...
+Wayfarer.config[:network][:agent] = :capybara
+# Wayfarer.config[:capybara][:driver] = ...
 class DummyJob < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Capybara::Session ...>
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
 end
 ```
+## Example: Automating Chrome with Cuprite and Ferrum
-## Configuring a driver
-1. Install the Capybara driver for the desired user agent.
-    For example, to automate Google Chrome with
-    [Ferrum](https://github.com/rubycdp/ferrum), install the
-    [Cuprite](https://github.com/rubycdp/cuprite) driver:
+1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
     === "RubyGems"
@@ -34,20 +26,19 @@ end
         gem install cuprite
         ```
-    === "Bundler"
+    === "Gemfile"
         ```ruby
         gem "cuprite" # Gemfile
         ```
-2. Configure Wayfarer to use the `:capybara` user agent and set the desired
-    driver:
+2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
     === "Runtime"
         ```ruby
-        Wayfarer.config.network.agent = :capybara
-        Wayfarer.config.capybara.driver = :cuprite
+        Wayfarer.config[:network][:agent] = :capybara
+        Wayfarer.config[:capybara][:driver] = :cuprite
         ```
     === "Environment variables"
@@ -57,7 +48,7 @@ end
         WAYFARER_CAPYBARA_DRIVER=cuprite
         ```
-3. Register the driver:
+3. Register the driver with Capybara:
     ```ruby
     require "capybara/cuprite"
@@ -66,6 +57,6 @@ end
     Capybara.register_driver(:cuprite) do |app|
       # Wayfarer's Ferrum or Selenium options can be passed along
-      Capybara::Cuprite::Driver.new(app, Wayfarer.config.ferrum.options)
+      Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
     end
     ```

data/docs/guides/networking/custom_adapters.md CHANGED Viewed

@@ -1,18 +1,66 @@
-# Custom agents
+# User agent API
-Wayfarer offers an interface for integrating third-party browsers and HTTP
-clients as user agents.
+Wayfarer retrieves web pages with user agents. There are two types of user
+agents: __stateful__ browsers which carry state and follow redirects implicitly,
+and __stateless__ HTTP clients, which handle redirects explicitly.
-There are two types of agents:
+Because spawning browser processes or instantiating HTTP clients is expensive,
+Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
+irrecoverable errors are individual user agents destroyed and recreated. For example,
+when a browser process crashes, it is replaced with a new one and checked back
+into the pool. The next job that checks out the user agent gets a fresh
+browser process.
-1. Stateful agents, i.e. browsers, which carry state and support navigation.
-   These follow HTTP redirects implicitly.
-2. Stateless agents, which deal with HTTP requests/responses only.
-   These handle HTTP redirects explicitly.
+## Base interface for custom user agents
-## Implementation
+You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
+module and defining callback methods. The interfaces for stateful and stateless
+share the following instance methods:
-Both types can be implemented with callback methods:
+* `#create` (__required__): Called when a new instance (browser process or HTTP client) is
+  needed.
+* `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
+  processes should be quit, and HTTP clients should be freed.
+* `#renew_on` (optional): Returns a list of exception classes upon which the existing
+  instance gets destroyed and replaced with a newly created one.
+## Stateless interface
+The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
+Wayfarer provides redirect handling out of the box, as there is a configurable limit
+on the number of retries to follow.
+In addition to the base interface, stateless user agents implement `#fetch`
+which fetches [pages](../pages) or indicates redirects:
+* `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
+  3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
+  deals with redirects on your behalf to avoid redirect loops. All other status
+  codes, including 4xx and 5xx, are considered successful and are indicated by calling
+  `success(url:, body:, status_code:, headers:)`.
+## Stateful interface
+In addition to the base interface, stateful user agents implement two additional
+methods:
+* `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
+  Stateful user agents follow redirects implicitly.
+* `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
+  into a [page](../pages).
+## Recreating user agents on error with `#renew_on`
+Agents can optionally implement `#renew_on` to get themselves rereated on
+certain errors.
+If `#fetch` or `#navigate` raise an exception and the exception class is listed
+in `#renew_on`, the instance is destroyed and recreated.
+* `#renew_on` (optional): A list of exception classes upon which the existing instance gets
+  destroyed and replaced with a newly created one.
+## Example implementations
 === "Stateful"
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
     class StatefulAgent
       include Wayfarer::Networking::Strategy
-      def renew_on # optional
-        [MyBrowser::IrrecoverableError]
-      end
+      # Required methods
       def create
         MyBrowser.new
       end
-      def destroy(browser) # optional
-        browser.quit
-      end
       def navigate(browser, url)
         browser.goto(url)
       end
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
                 status_code: browser.status_code,
                 headers: browser.headers)
       end
+      # Optional methods
+      def destroy(browser)
+        browser.quit
+      end
+      def renew_on
+        [MyBrowser::IrrecoverableError]
+      end
     end
     ```
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
     class StatelessAgent
       include Wayfarer::Networking::Strategy
-      def renew_on # optional
-        [MyClient::IrrecoverableError]
-      end
+      # Required methods
       def create
         MyClient.new
       end
-      def destroy(client) # optional
-        client.close
-      end
       def fetch(client, url)
         response = client.get(url)
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
                 status_code: response.status_code,
                 headers: response.headers)
       end
+      # Optional methods
+      def destroy(client)
+        client.close
+      end
+      def renew_on # optional
+        [MyClient::IrrecoverableError]
+      end
     end
     ```
-Register the strategy:
+Register and use the strategy:
 ```ruby
 Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
+Wayfarer.config[:network][:agent] = :my_agent
 ```
-Use the strategy:
-```ruby
-Wayfarer.config.network.agent = :my_agent
-```
-### Remarks
-#### Self-healing
-* A strategy's `#renew_on` method may return a list of exception classes upon
-  which the existing instance gets destroyed and replaced with a newly created
-  one.
-* Stateless clients must not raise exceptions when encountering certain HTTP
-  response codes (for example, 5xx).

data/docs/guides/networking/ferrum.md CHANGED Viewed

@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
 so:
 ```ruby
-Wayfarer.config.network.agent = :ferrum
+Wayfarer.config[:network][:agent] = :ferrum
 class DummyWorker < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Ferrum::Browser ...>
@@ -27,8 +27,8 @@ end
 === "Runtime"
     ```ruby
-    Wayfarer.config.network.agent = :ferrum
-    Wayfarer.config.ferrum.options = { headless: false, url: "http://chrome:3000" }
+    Wayfarer.config[:network][:agent] = :ferrum
+    Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
     ```
 === "Environment variables"

data/docs/guides/networking/http.md CHANGED Viewed

@@ -1,33 +1,29 @@
 # Plain HTTP
-Wayfarer can retrieve pages via plain HTTP requests, also alongside automated
-browsers.
+Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
+also alongside automated browsers.
-## Agent
+## Ad-hoc GET requests
-The HTTP agent is the default.
-## Ad-hoc requests
-When automating browsers, it can be useful to additionally retrieve the page
+When automating browsers, it can be useful to additionally retrieve another page
 over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route { to :index }
+  route.to :index
   def index
-    http.fetch(task.url) # => #<Wayfarer::Page ...>
+    http.fetch("https://example.com") # => #<Wayfarer::Page ...>
   end
 end
 ```
-By default, 3 redirects are followed, and this can be configured by passing the
-`follow` keyword:
+By default, 3 redirects are followed, and this number can be configured by
+passing the `follow` keyword:
 ```ruby
 http.fetch(url, follow: 5)
 ```
-If redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
+When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
 raised.

data/docs/guides/networking/selenium.md CHANGED Viewed

@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
 so:
 ```ruby
-Wayfarer.config.network.agent = :selenium
+Wayfarer.config[:network][:agent] = :selenium
 class DummyWorker < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     browser # => #<Selenium::WebDriver ...>
@@ -27,10 +27,10 @@ process.
     Pages retrieved with a Selenium WebDriver return fake values:
     ```ruby
-    Wayfarer.config.network.agent = :selenium
+    Wayfarer.config[:network][:agent] = :selenium
     class DummyJob < Wayfarer::Base
-      route { to :index }
+      route.to :index
       def index
         page.headers     # => always {}
@@ -39,19 +39,18 @@ process.
     end
     ```
-!!! note "Consider using [Ferrum](../ferrum) instead"
-    Ferrum provides superior stability and a richer feature set compared to
-    Selenium drivers. However Ferrum automates only Google Chrome. Unless a
-    different browser is required, consider using Ferrum instead of Selenium.
+!!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
+    Use Ferrum if you want to automate Google Chrome. It provides superior
+    stability and a richer feature set compared to Selenium drivers.
 ## Configuring Selenium
 === "Runtime"
     ```ruby
-    Wayfarer.config.network.agent = :selenium
-    Wayfarer.config.selenium.driver = :firefox
-    Wayfarer.config.selenium.options = { url: "http://firefox" }
+    Wayfarer.config[:network][:agent] = :selenium
+    Wayfarer.config[:selenium][:driver] = :firefox
+    Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
     ```
 === "Environment variables"

data/docs/guides/pages.md CHANGED Viewed

@@ -1,11 +1,14 @@
 # Pages
-Retrieved pages take the shape of `Wayfarer::Page` objects and are available
-to jobs:
+A page is the immutable state of the contents behind a URL at a point in time,
+retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
+response, or the state of a remotely controlled browser.
 ```ruby
-class DummyJob < Wayfarer::Worker
-  route { to :index }
+class DummyJob < ActiveJob::Base
+  include Wayfarer::Base
+  route.to :index
   def index
     page # => #<Wayfarer::Page ...>
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
     page.url         # => "https://example.com"
     page.body        # => "<html>..."
     page.status_code # => 200
-    page.headers     # => { "Content-Type" => ... }
+    page.headers     # => { "content-type" => ... }
+    page.mime_type   # => #<MIME::Type: text/html>
+    # The lazily parsed response body or `nil`, depending on the Content-Type
+    page.doc # => #<Nokogiri::HTML::Document ...>
-    # A MetaInspector object for accessing page meta data.
     # See: https://github.com/metainspector/metainspector
+    page.meta # => #<MetaInspector::Document ...>
     # Examples:
     page.meta.links.internal
     page.meta.images.favicon
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
 end
 ```
+!!! info "HTTP headers are downcased and case-sensitive"
+    HTTP headers are downcased, so you would access
+    `page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
+## Response body parsing
+Wayfarer parses the bodies of HTML, XML and JSON responses according to their
+MIME types:
+* `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
+* `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
+* `application/json` to `Hash`
 ## Live pages
-When automating browsers, it is possible the page changes significantly at
-runtime, for example due to JavaScript altering the DOM or URL.
+`#!ruby page` initially returns a snapshot of the browser state
+immediately after the user agent navigated to the URL. The browser state may
+change significantly after the page was retrieved, for example due to your own
+interaction, or client-side JavaScript altering the DOM or URL.
-To access a page reflecting the current browser state, pass the `live` keyword:
+To get a page that reflects the current browser state, set the `#!ruby :live`
+keyword:
 ```ruby
 class DummyJob < Wayfarer::Worker
-  route { to :index }
+  route.to :index
   def index
     page # => #<Wayfarer::Page ...>
+    # Fill in forms, click buttons, etc.
     # Replaces the current Page object with a newer one,
     # taking into account the DOM as currently rendered by the browser.
     # Effectful only when automating browsers, no-op when using plain
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
   end
 end
 ```
+!!! attention "Stateless user agents ignore `#!ruby :live`"
+    The `#!ruby :live` option is ignored by stateless user agents, such as the
+    default `#!ruby :http` user agent. Instead, stateless user agents always
+    return the same page object.
+### Implementing a custom response body parser
+You can register an object that implements a `#parse` method for any MIME type:
+```ruby
+class MyJPEGParser
+  def parse(body)
+    # Read EXIF metadata here.
+    # Return value is accessible as `page.doc`
+  end
+end
+Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
+```
+!!! info "Handling responses without a Content-Type"
+    If a response has no `Content-Type` header, Wayfarer falls back to
+    `application/octet-stream`. A parser registered for
+    `application/octet-stream` will hence also handle all responses without
+    a Content-Type.
+## Accessing page metadata with MetaInspector
+You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
+document for accessing metadata of HTML pages. For example, to stage all links
+internal to the current hostname:
+```ruby
+def index
+  stage page.meta.links.internal
+end
+```

data/docs/guides/redis.md ADDED Viewed

@@ -0,0 +1,10 @@
+# Redis
+Wayfarer uses Redis to keep track of:
+* URLs that were already processed within a batch
+* the number of jobs left in a batch
+## Garbage collection
+Wayfarer cleans up batch-related data

data/docs/guides/routing.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Routing
+Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
+either instance methods denoted by symbols, or [handlers](/guides/handlers).
+A job's route declarations equate to a predicate tree.
+When a URL is routed, the predicate tree is searched depth-first. If a
+matching leaf predicate is found, the found path's action is dispatched,
+along with `params` collected from path parameters.
+The following routes:
+```ruby
+route.host "example.com", scheme: :https do
+  path "/contact", to: :contact
+  path "/users/:id", to: [UserHandler, :show]
+end
+```
+Equate to the following predicate tree:
+```mermaid
+flowchart LR
+  RootRoute-->Host["Host <code>example.com</code>"]
+  Host-->Scheme["Scheme <code>:https</code>"]
+  Scheme-->Path1["Path <code>/contact</code>"]
+  Scheme-->Path2["Path <code>/users/:id<code>"]
+  Path1-->TargetRoute1["Target <code>:contact</code>"]
+  Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
+```
+An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
+```mermaid
+flowchart LR
+  RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
+  Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
+  Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
+  Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
+  Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
+  Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
+  classDef active fill:#7CB342,stroke:#7CB342,color:#fff
+  classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
+  classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
+```
+You can also visualise an invocation of the predicate tree on the command line
+with `wayfarer tree`
+```
+wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
+Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
+└──Host("example.com", match: true)
+   └──Scheme(:https, match: true)
+      ├──Path("/contact", match: false)
+      │  └──Target(match: true)
+      └──Path("/users/:id", match: true)
+         └──Target(match: true)
+            └──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
+```
+As you can see, `Target` nodes always match. This means that we could have also defined
+our routes as:
+```ruby
+route.host "example.com", scheme: :https do
+  to :contact do
+    path "/contact"
+  end
+  to [UserHandler, :show] do
+    path "/users/:id"
+  end
+end
+```

data/docs/guides/tasks.md CHANGED Viewed

@@ -1,14 +1,38 @@
 # Tasks
-Tasks are the immutable units of work processed by [jobs](/guides/jobs). A task
-consists of:
+Tasks are the immutable units of work read from a message queue and processed by
+[jobs](/guides/jobs). A task consists of two strings:
-1. The __URL__ to process
-    * Within a batch, every URL gets processed at most once.
+* The __URL__ to process
+* The __batch__ the task belongs to
-2. The __batch__ the task belongs to
-    * Like URLs, batches are strings.
+A job processing a task commonly appends more tasks to the queue in turn.
-Tasks get appended to the end of a message queue, and consumed from the
-beginning. Because jobs can enqueue other tasks, jobs are both consumers
-and producers of tasks.
+!!! info "Task URLs are not normalized"
+    The URL returned by `task.url` is not normalized but verbatim
+    as it was staged or enqueued.
+## Task deduplication
+Wayfarer ensures that no URL gets processed twice within a batch. It achieves
+this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
+keyed by normalized URLs.
+### URL normalization
+Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
+and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
+URL normalization is used only for deduplication, and does not affect the URL
+returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
+enqueud. This allows you to follow the exact URLs you may have parsed from a
+response body.
+## Invalid URLs
+Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
+protocol) are discarded, since they can't get retrieved. No exception is raised,
+and the job is considered successfully processed, since there are no corrective
+actions an error handler could take as tasks are immutable, and retries would
+not change the outcome.