RubyGems - wayfarer - Versions diffs - 0.4.7 → 0.4.8 - Mend

wayfarer 0.4.7 → 0.4.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (183) hide show

checksums.yaml +4 -4
data/.env +17 -0
data/.github/workflows/lint.yaml +8 -6
data/.github/workflows/release.yaml +4 -3
data/.github/workflows/tests.yaml +5 -14
data/.gitignore +2 -2
data/.rubocop.yml +31 -0
data/.vale.ini +6 -3
data/Dockerfile +3 -2
data/Gemfile +21 -0
data/Gemfile.lock +233 -128
data/Rakefile +7 -0
data/docker-compose.yml +13 -14
data/docs/guides/callbacks.md +3 -1
data/docs/guides/configuration.md +10 -35
data/docs/guides/development.md +67 -0
data/docs/guides/handlers.md +7 -7
data/docs/guides/jobs.md +54 -11
data/docs/guides/networking/custom_adapters.md +31 -10
data/docs/guides/pages.md +24 -22
data/docs/guides/routing.md +116 -34
data/docs/guides/tasks.md +30 -10
data/docs/guides/tutorial.md +23 -17
data/docs/guides/user_agents.md +11 -9
data/lib/wayfarer/base.rb +9 -8
data/lib/wayfarer/batch_completion.rb +18 -14
data/lib/wayfarer/callbacks.rb +14 -14
data/lib/wayfarer/cli/route_printer.rb +78 -96
data/lib/wayfarer/cli.rb +12 -30
data/lib/wayfarer/gc.rb +6 -1
data/lib/wayfarer/kv.rb +28 -0
data/lib/wayfarer/middleware/chain.rb +7 -1
data/lib/wayfarer/middleware/content_type.rb +20 -15
data/lib/wayfarer/middleware/dedup.rb +9 -3
data/lib/wayfarer/middleware/dispatch.rb +7 -2
data/lib/wayfarer/middleware/normalize.rb +4 -12
data/lib/wayfarer/middleware/router.rb +1 -1
data/lib/wayfarer/middleware/uri_parser.rb +4 -3
data/lib/wayfarer/networking/context.rb +12 -1
data/lib/wayfarer/networking/ferrum.rb +1 -4
data/lib/wayfarer/networking/follow.rb +2 -1
data/lib/wayfarer/networking/pool.rb +12 -7
data/lib/wayfarer/networking/selenium.rb +15 -7
data/lib/wayfarer/page.rb +0 -2
data/lib/wayfarer/parsing/xml.rb +1 -1
data/lib/wayfarer/parsing.rb +2 -5
data/lib/wayfarer/redis/barrier.rb +15 -2
data/lib/wayfarer/redis/counter.rb +1 -2
data/lib/wayfarer/routing/dsl.rb +166 -31
data/lib/wayfarer/routing/hash_stack.rb +33 -0
data/lib/wayfarer/routing/matchers/custom.rb +8 -5
data/lib/wayfarer/routing/matchers/{suffix.rb → empty_params.rb} +2 -6
data/lib/wayfarer/routing/matchers/host.rb +15 -9
data/lib/wayfarer/routing/matchers/path.rb +11 -33
data/lib/wayfarer/routing/matchers/query.rb +41 -17
data/lib/wayfarer/routing/matchers/result.rb +12 -0
data/lib/wayfarer/routing/matchers/scheme.rb +13 -5
data/lib/wayfarer/routing/matchers/url.rb +13 -5
data/lib/wayfarer/routing/path_consumer.rb +130 -0
data/lib/wayfarer/routing/path_finder.rb +151 -23
data/lib/wayfarer/routing/result.rb +1 -1
data/lib/wayfarer/routing/root_route.rb +14 -2
data/lib/wayfarer/routing/route.rb +71 -14
data/lib/wayfarer/routing/serializable.rb +28 -0
data/lib/wayfarer/routing/sub_route.rb +53 -0
data/lib/wayfarer/routing/target_route.rb +17 -1
data/lib/wayfarer/stringify.rb +1 -2
data/lib/wayfarer/task.rb +3 -5
data/lib/wayfarer/uri/normalization.rb +120 -0
data/lib/wayfarer.rb +50 -10
data/mise.toml +2 -0
data/mkdocs.yml +8 -17
data/rake/lint.rake +0 -96
data/rake/release.rake +5 -11
data/rake/tests.rake +8 -4
data/requirements.txt +1 -1
data/spec/factories/job.rb +8 -0
data/spec/factories/middleware.rb +2 -2
data/spec/factories/path_finder.rb +11 -0
data/spec/factories/redis.rb +19 -0
data/spec/factories/task.rb +39 -1
data/spec/spec_helpers.rb +50 -57
data/spec/support/active_job_helpers.rb +8 -0
data/spec/support/integration_helpers.rb +21 -0
data/spec/support/redis_helpers.rb +9 -0
data/spec/support/test_app.rb +64 -43
data/spec/{base_spec.rb → wayfarer/base_spec.rb} +32 -36
data/spec/wayfarer/batch_completion_spec.rb +142 -0
data/spec/wayfarer/cli/job_spec.rb +88 -0
data/spec/wayfarer/cli/routing_spec.rb +322 -0
data/spec/{cli → wayfarer/cli}/version_spec.rb +1 -1
data/spec/wayfarer/gc_spec.rb +29 -0
data/spec/{handler_spec.rb → wayfarer/handler_spec.rb} +1 -3
data/spec/{integration → wayfarer/integration}/callbacks_spec.rb +9 -6
data/spec/wayfarer/integration/content_type_spec.rb +37 -0
data/spec/wayfarer/integration/custom_routing_spec.rb +51 -0
data/spec/{integration → wayfarer/integration}/gc_spec.rb +9 -13
data/spec/{integration → wayfarer/integration}/handler_spec.rb +9 -10
data/spec/{integration → wayfarer/integration}/page_spec.rb +8 -6
data/spec/{integration → wayfarer/integration}/params_spec.rb +4 -4
data/spec/{integration → wayfarer/integration}/parsing_spec.rb +7 -33
data/spec/wayfarer/integration/retry_spec.rb +112 -0
data/spec/{integration → wayfarer/integration}/stage_spec.rb +5 -5
data/spec/{middleware → wayfarer/middleware}/batch_completion_spec.rb +4 -5
data/spec/{middleware → wayfarer/middleware}/chain_spec.rb +20 -15
data/spec/{middleware → wayfarer/middleware}/content_type_spec.rb +18 -21
data/spec/{middleware → wayfarer/middleware}/controller_spec.rb +22 -20
data/spec/wayfarer/middleware/dedup_spec.rb +66 -0
data/spec/wayfarer/middleware/normalize_spec.rb +32 -0
data/spec/{middleware → wayfarer/middleware}/router_spec.rb +18 -20
data/spec/{middleware → wayfarer/middleware}/stage_spec.rb +11 -10
data/spec/wayfarer/middleware/uri_parser_spec.rb +63 -0
data/spec/{middleware → wayfarer/middleware}/user_agent_spec.rb +34 -32
data/spec/wayfarer/networking/capybara_spec.rb +13 -0
data/spec/{networking → wayfarer/networking}/context_spec.rb +46 -38
data/spec/wayfarer/networking/ferrum_spec.rb +13 -0
data/spec/{networking → wayfarer/networking}/follow_spec.rb +9 -4
data/spec/wayfarer/networking/http_spec.rb +12 -0
data/spec/{networking → wayfarer/networking}/pool_spec.rb +11 -9
data/spec/wayfarer/networking/selenium_spec.rb +12 -0
data/spec/{networking → wayfarer/networking}/strategy.rb +33 -54
data/spec/{page_spec.rb → wayfarer/page_spec.rb} +3 -3
data/spec/{parsing → wayfarer/parsing}/json_spec.rb +1 -1
data/spec/{parsing/xml_spec.rb → wayfarer/parsing/xml_parse_spec.rb} +4 -3
data/spec/{redis → wayfarer/redis}/barrier_spec.rb +5 -4
data/spec/wayfarer/redis/counter_spec.rb +34 -0
data/spec/{redis → wayfarer/redis}/pool_spec.rb +3 -2
data/spec/{routing → wayfarer/routing}/dsl_spec.rb +12 -22
data/spec/wayfarer/routing/hash_stack_spec.rb +63 -0
data/spec/wayfarer/routing/integration_spec.rb +101 -0
data/spec/wayfarer/routing/matchers/custom_spec.rb +39 -0
data/spec/wayfarer/routing/matchers/host_spec.rb +56 -0
data/spec/wayfarer/routing/matchers/matcher.rb +17 -0
data/spec/wayfarer/routing/matchers/path_spec.rb +43 -0
data/spec/wayfarer/routing/matchers/query_spec.rb +123 -0
data/spec/wayfarer/routing/matchers/scheme_spec.rb +45 -0
data/spec/wayfarer/routing/matchers/url_spec.rb +33 -0
data/spec/wayfarer/routing/path_consumer_spec.rb +123 -0
data/spec/wayfarer/routing/path_finder_spec.rb +409 -0
data/spec/wayfarer/routing/root_route_spec.rb +51 -0
data/spec/wayfarer/routing/route_spec.rb +74 -0
data/spec/wayfarer/routing/sub_route_spec.rb +103 -0
data/spec/wayfarer/uri/normalization_spec.rb +98 -0
data/spec/wayfarer_spec.rb +2 -2
data/wayfarer.gemspec +17 -28
metadata +768 -246
data/.rbenv-gemsets +0 -1
data/.ruby-version +0 -1
data/RELEASING.md +0 -17
data/docs/cookbook/user_agent.md +0 -7
data/docs/design.md +0 -36
data/docs/guides/jobs/error_handling.md +0 -40
data/docs/reference/configuration.md +0 -36
data/spec/batch_completion_spec.rb +0 -104
data/spec/cli/job_spec.rb +0 -74
data/spec/cli/routing_spec.rb +0 -101
data/spec/fixtures/dummy_job.rb +0 -9
data/spec/gc_spec.rb +0 -17
data/spec/integration/content_type_spec.rb +0 -145
data/spec/integration/routing_spec.rb +0 -18
data/spec/middleware/dedup_spec.rb +0 -71
data/spec/middleware/dispatch_spec.rb +0 -59
data/spec/middleware/normalize_spec.rb +0 -60
data/spec/middleware/uri_parser_spec.rb +0 -53
data/spec/networking/capybara_spec.rb +0 -12
data/spec/networking/ferrum_spec.rb +0 -12
data/spec/networking/http_spec.rb +0 -12
data/spec/networking/selenium_spec.rb +0 -12
data/spec/redis/counter_spec.rb +0 -44
data/spec/routing/integration_spec.rb +0 -110
data/spec/routing/matchers/custom_spec.rb +0 -31
data/spec/routing/matchers/host_spec.rb +0 -49
data/spec/routing/matchers/path_spec.rb +0 -43
data/spec/routing/matchers/query_spec.rb +0 -137
data/spec/routing/matchers/scheme_spec.rb +0 -25
data/spec/routing/matchers/suffix_spec.rb +0 -41
data/spec/routing/matchers/uri_spec.rb +0 -27
data/spec/routing/path_finder_spec.rb +0 -33
data/spec/routing/root_route_spec.rb +0 -29
data/spec/routing/route_spec.rb +0 -43
data/docs/{reference → guides}/cli.md +0 -0
data/spec/{stringify_spec.rb → wayfarer/stringify_spec.rb} +2 -2
/data/spec/{task_spec.rb → wayfarer/task_spec.rb} +0 -0

data/Rakefile CHANGED Viewed

@@ -5,7 +5,14 @@ require "open-uri"
 require "zip"
 require "bundler/gem_tasks"
+require "pry"
 Dir.glob("rake/*.rake").each { |file| load(file) }
 task default: :build
+task :console do
+  require_relative "lib/wayfarer"
+  Pry.start
+end

data/docker-compose.yml CHANGED Viewed

@@ -1,15 +1,14 @@
-version: "3"
 services:
   wayfarer:
     build: .
     tty: true
     volumes:
-      - "./:/opt/app"
+      - ./:/opt/app
     ports:
-      - "9876:9876"
-    environment:
-      - CI=true
+      - "${WAYFARER_PORT}:${WAYFARER_PORT}"
     hostname: test
+    environment:
+      CI: "${CI}"
     depends_on:
       - redis
       - chrome
@@ -17,26 +16,26 @@ services:
       - docs
   redis:
-    image: redis
+    image: ${REDIS_IMAGE}:${REDIS_VERSION}
   chrome:
-    image: browserless/chrome
+    image: ${CHROME_IMAGE}:${CHROME_VERSION}
     ports:
-      - "3000:3000"
+      - "${CHROME_PORT}:${CHROME_PORT}"
   firefox:
-    image: selenium/standalone-firefox:4.0.0-rc-2-prerelease-20210923
+    image: ${FIREFOX_IMAGE}:${FIREFOX_VERSION}
     ports:
-      - "4444:4444"
+      - "${FIREFOX_PORT}:${FIREFOX_PORT}"
     volumes:
-      - "/dev/shm:/dev/shm"
+      - /dev/shm:/dev/shm
   docs:
-    image: squidfunk/mkdocs-material:9.5.9
+    image: ${DOCS_IMAGE}:${DOCS_VERSION}
     volumes:
-      - "./:/docs"
+      - ./:/docs
     ports:
-      - "8000:8000"
+      - "${DOCS_PORT}:${DOCS_PORT}"
 networks:
   default:

data/docs/guides/callbacks.md CHANGED Viewed

@@ -1,7 +1,7 @@
 # Callbacks
 Wayfarer supports a number of callbacks in addition to
-[ActiveJob's](https://edgeguides.rubyonrails.org/active_job_basics.html#callbacks).
+[ActiveJob callbacks](https://edgeguides.rubyonrails.org/active_job_basics.html#callbacks).
 ## Available callbacks
@@ -20,6 +20,8 @@ to process in a batch. Wayfarer instruments job execution and in- or decrements
 an integer counter in Redis on certain events. When the counter reaches zero,
 the current job's `after_batch` callbacks run.
+!!! info "`after_batch` callbacks fire at most once per batch."
 ## Conditional callbacks
 You can make callbacks conditional with the `#!ruby :if` and `#!ruby :unless`

data/docs/guides/configuration.md CHANGED Viewed

@@ -1,39 +1,14 @@
-# Configuration
-Wayfarer can be configured in two ways:
-1. Using [environment variables](/reference/environment_variables)
-2. Using runtime configuration
-## Runtime configuration
-Wayfarer parses environment variables into a runtime configuration
-`Wayfarer::Config`. The configuration can then be altered or replaced via
-`Wayfarer.config`:
-```ruby
-# Which user agent to use to process tasks
-Wayfarer.config[:network][:agent] = :http # or :ferrum, :selenium
+---
+hide:
+  - toc
+---
-# How many user agents to instantiate
-Wayfarer.config[:network][:pool_size] = 3
-# How long an agent may be used while processing a task
-Wayfarer.config[:network][:pool_timeout] = 5000
-# Ferrum options
-Wayfarer.config[:ferrum][:options] = {}
-# Selenium driver to use
-Wayfarer.config[:selenium][:driver] = :chrome
-# Selenium HTTP client read timeout
-Wayfarer.config[:selenium][:client_timeout] = 10 # seconds
+# Configuration
-# Selenium options
-Wayfarer.config[:selenium][:options] = { url: "http://chrome" }
+You can configure Wayfarer by assigning to `Wayfarer.config` which defaults to:
-# HTTP request headers (Selenium is unsupported)
-Wayfarer.config[:network][:http_headers] = { "Field" => "Value" }
+```rb
+module Wayfarer
+--8<-- "lib/wayfarer.rb:48:96"
+end
 ```

data/docs/guides/development.md ADDED Viewed

@@ -0,0 +1,67 @@
+# Development
+## Release Procedure
+1. Ensure `Wayfarer::VERSION` was bumped appropriately.
+2. Ensure the version in wayfarer.gemspec matches.
+3. Open a release Pull Request develop -> master branch
+4. Merge the Pull Request
+5. Publish RubyGem and git tag as follows:
+```
+git checkout master
+git pull origin master --rebase
+bundle exec rake build
+gem push build/wayfarer-*.gem
+bundle exec rake clean
+git tag <VERSION>
+git push origin <VERSION>
+```
+## Conventions and guidelines
+* In source code, `url` refers to strings and `uri` refers to `Addressable::URI`
+* Avoid writing bash at all costs. Use Ruby instead
+## Design decisions and architecture
+### Navigate the web along URL patterns
+URLs are less prone to change than served markup.
+One reason for this is that changes to a URL's path can have negative
+consequences for its page ranking in search engines. Websites naturally implement
+architectural URL patterns like REST or expose surrogate keys.
+### Follow URLs verbatim as they appear in responses
+Normalized URLs are useful for deduplication, but URLs should be followed
+as they appear in responses. Navigating to normalized versions of URLs makes
+crawlers stick out from other user agents.
+### Tasks are version-less and don't persist metadata
+Tasks serialize to their URL and batch. No other data gets written to
+the message queue. There is also no need for versioning persisted tasks, since
+there will be never more to a task than URL and batch. All task metadata
+is ephemeral.
+### Why depend on Redis
+There are two core features that depend on Redis. First, per-batch acylicity is
+achieved by maintaining the set of processed URLs per batch in Redis.
+There's no option to follow links in a cyclic manner. Second, batch completion
+requires updating an integer value in Redis, and batch completion is a very
+useful feature, since most crawls should end eventually, and often you want to
+know when.
+### No configuration files
+Wayfarer can be configured through `Wayfarer.config` only, because `Wayfarer.config`
+may contain Ruby objects that don't de/serialize well, such as `Proc`s or `Set`s.
+### Features out of scope
+Wayfarer won't provide:
+* persistence or any sort of DOM data mapping abstractions
+* URL generation helpers

data/docs/guides/handlers.md CHANGED Viewed

@@ -1,16 +1,16 @@
 # Handlers
-[Jobs](/jobs) can route tasks to handlers to delegate processing without
-writes to the message queue. Unlike jobs, handlers don't inherit from
-`ActiveJob::Base` and therefore cannot be enqueued. Handlers have routes, too,
-but they don't retrieve pages and a handler's router can be bypassed.
+Handlers are like [jobs](/jobs) but they don't inherit from `ActiveJob::Base`
+which is why they can't affect the message queue directly themselves.
+Instead, jobs and handlers can route tasks to other handlers. Handlers
+themselves have routes, but they can be bypassed.
-## Supported features
+## Handler capabilities
-Handlers support a subset of features compared to `Wayfarer::Base`:
+Like jobs, handlers support:
 * URL routing
-* enqueueing tasks with `#!ruby stage(*urls)`
+* staging tasks with `#!ruby stage(*urls)`
 * jobs can access the `user_agent` that retrieved the `page`
 * ad-hoc HTTP requests with `#!ruby fetch(url)`
 * callbacks, but only a subset of job callbacks

data/docs/guides/jobs.md CHANGED Viewed

@@ -1,13 +1,15 @@
 # Jobs
 Jobs are [Active Job](https://edgeguides.rubyonrails.org/active_job_basics.html)s
-that use a DSL included from the `Wayfarer::Base` module to process [tasks](/guides/tasks)
-that they read from a message queue.
-Instead of implementing Active Job's `#perform` method yourself, you declare routes
-to instance methods, similiar to how web applications route incoming requests.
-Only URLs that match a [route](../routing) are requested or navigated to.
-The action method has access to the retrieved [page](../pages),
-the [user agent](../user-agents) that retrieved the page and the current task:
+that use a DSL to process [tasks](/guides/tasks) that they read from a message
+queue.
+Instead of implementing Active Job's `#perform` method yourself, you declare
+[routes](../routing) to instance methods, like web applications route incoming
+requests. Only URLs that match a route are retrieved and processed. All other
+URLs are considered successfully processed. The action has access to the
+retrieved [page](../pages), the [user agent](../user-agents) that retrieved the
+page and the current task:
 ```ruby
 class DummyJob < ActiveJob::Base
@@ -24,7 +26,7 @@ end
 ```
 You can start a crawl by appending a task to the message queue for the URL with
-`::crawl`. By default, a UUID is generated as the batch:
+`::crawl`. If you don't provide a batch, Wayfarer generates a UUID:
 ```ruby
 task = DummyJob.crawl("https://example.com")
@@ -51,10 +53,10 @@ You can also use Wayfarer's [CLI](../cli) to enqueue a task:
 wayfarer enqueue --batch my-batch DummyJob "https://example.com"
 ```
-## Navigating crawls
+## Following URLs
-Jobs navigate crawls by staging URLs with `#!ruby stage(urls)`. When you stage a URL, a normalized
-version of it is appended to an internal set. Once the action returns, all URLs
+Jobs navigate crawls by staging URLs with `stage(urls)`. When you stage a URL,
+it is appended verbatim to an internal set. Once the action returns, all URLs
 in the set are appended as tasks to the message queue.
 ```ruby
@@ -167,3 +169,44 @@ end
     Content-Types are compared regardless of their parameters. For example,
     `text/html; charset=UTF-8` is considered the same as `text/html`.
+## Handling errors
+!!! danger "Only ActiveJob error handling is supported"
+    Wayfarer exclusively supports ActiveJob's error handling. You cannot use
+    message queue-specific error handling, for example error handling with
+    `sidekiq_options` is unsupported. Otherwise batches get garbage-collected
+    too early as Wayfarer instruments ActiveJob.
+Wayfarer relies on ActiveJob's [error handling methods](https://guides.rubyonrails.org/active_job_basics.html#exceptions):
+* `retry_on` to retry jobs a number of times on certain errors:
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      retry_on MyError, attempts: 3 do |job, error|
+        # This block runs once all 3 attempts have failed
+        # (1 initial attempt + 2 retries)
+      end
+    end
+    ```
+* `discard_on` to throw away jobs on certain errors:
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      discard_on MyError do |job, error|
+        # This block runs once and buries the job
+      end
+    end
+    ```
+## Recreating user agents on certain errors
+You can configure a list of exception classes upon which user agents
+get recreated (see [User agent API]()):
+```ruby
+Wayfarer.config[:network][:renew_on] = [MyIrrecoverableError]
+```

data/docs/guides/networking/custom_adapters.md CHANGED Viewed

@@ -1,21 +1,42 @@
 # User agent API
 Wayfarer retrieves web pages with user agents. There are two types of user
-agents: __stateful__ browsers which carry state and follow redirects implicitly,
-and __stateless__ HTTP clients, which handle redirects explicitly.
+agents: __stateful__ browsers which carry state and follow redirects implicitly
+as they navigate to a URL, and __stateless__ HTTP clients, which handle
+redirects explicitly.
+|                   | Stateless adapters | Stateful adapters |
+|-------------------|--------------------|-------------------|
+| interactive       | no                 | yes               |
+| redirect handling | explicit           | implicit          |
 Because spawning browser processes or instantiating HTTP clients is expensive,
-Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
-irrecoverable errors are individual user agents destroyed and recreated. For example,
-when a browser process crashes, it is replaced with a new one and checked back
-into the pool. The next job that checks out the user agent gets a fresh
-browser process.
+Wayfarer keeps user agents in a pool and reuses them across jobs. This means
+that browser state carries over between jobs, as a job checks out a previous
+job's user agent. Only on certain irrecoverable errors are individual user agents
+destroyed and recreated. For example when a browser process crashes, it is
+replaced with a fresh browser process.
 ## Base interface for custom user agents
-You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
-module and defining callback methods. The interfaces for stateful and stateless
-share the following instance methods:
+You implement both stateful and stateless agents by including the
+`Wayfarer::Networking::Strategy` module and defining callback methods. The
+interfaces for stateful and stateless share the following base methods:
+```mermaid
+classDiagram
+class Square~Shape~{
+    int id
+    List~int~ position
+    setPoints(List~int~ points)
+    getPoints() List~int~
+}
+Square : -List~string~ messages
+Square : +setMessages(List~string~ messages)
+Square : +getMessages() List~string~
+Square : +getDistanceMatrix() List~List~int~~
+```
 * `#create` (__required__): Called when a new instance (browser process or HTTP client) is
   needed.

data/docs/guides/pages.md CHANGED Viewed

@@ -47,6 +47,30 @@ MIME types:
 * `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
 * `application/json` to `Hash`
+### Implementing a custom response body parser
+You can register an object that implements a `#parse` method for any MIME type:
+```ruby
+class MyJPEGParser
+  def parse(body)
+    # Read EXIF metadata here.
+    # Return value is accessible as `page.doc`
+  end
+end
+Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
+```
+!!! warning "`#parse` must be thread-safe!"
+!!! info "Handling responses without a Content-Type"
+    If a response has no `Content-Type` header, Wayfarer falls back to
+    `application/octet-stream`. A parser registered for
+    `application/octet-stream` will hence also handle all responses without
+    a Content-Type.
 ## Live pages
 `#!ruby page` initially returns a snapshot of the browser state
@@ -83,28 +107,6 @@ end
     default `#!ruby :http` user agent. Instead, stateless user agents always
     return the same page object.
-### Implementing a custom response body parser
-You can register an object that implements a `#parse` method for any MIME type:
-```ruby
-class MyJPEGParser
-  def parse(body)
-    # Read EXIF metadata here.
-    # Return value is accessible as `page.doc`
-  end
-end
-Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
-```
-!!! info "Handling responses without a Content-Type"
-    If a response has no `Content-Type` header, Wayfarer falls back to
-    `application/octet-stream`. A parser registered for
-    `application/octet-stream` will hence also handle all responses without
-    a Content-Type.
 ## Accessing page metadata with MetaInspector
 You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)

data/docs/guides/routing.md CHANGED Viewed

@@ -1,18 +1,32 @@
 # Routing
-Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
-either instance methods denoted by symbols, or [handlers](/guides/handlers).
+Wayfarer equips jobs with a declarative routing DSL that maps URLs to actions.
+Actions are instance methods denoted by symbols, or [handlers](/guides/handlers).
+[Pages](/guides/pages) are only retrieved from URLs which map to an action.
+!!! info "Routed URLs are normalized"
+    By default, Wayfarer [applies some transformations to each URL](../tasks/#url-normalization) to bring it
+    into a canonical form. Routing happens based on this canonical form.
+    You can always access a task's raw string as it was enqueued with `task.batch`.
 A job's route declarations equate to a predicate tree.
 When a URL is routed, the predicate tree is searched depth-first. If a
-matching leaf predicate is found, the found path's action is dispatched,
-along with `params` collected from path parameters.
+matching leaf predicate is found, the found path's action is dispatched.
+You can extract data from URL path segments and query parameters and
+access it through `params` in jobs or handlers.
 The following routes:
 ```ruby
 route.host "example.com", scheme: :https do
-  path "/contact", to: :contact
-  path "/users/:id", to: [UserHandler, :show]
+  path "contact", to: :contact
+  path "users/:id" do
+    to [UserHandler, :show]
+    path "gallery", to: [UserHandler, :photos]
+  end
 end
 ```
@@ -20,43 +34,111 @@ Equate to the following predicate tree:
 ```mermaid
 flowchart LR
-  RootRoute-->Host["Host <code>example.com</code>"]
+  Root-->Host["Host <code>example.com</code>"]
   Host-->Scheme["Scheme <code>:https</code>"]
-  Scheme-->Path1["Path <code>/contact</code>"]
-  Scheme-->Path2["Path <code>/users/:id<code>"]
-  Path1-->TargetRoute1["Target <code>:contact</code>"]
-  Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
+  %% first-level paths
+  Scheme-->PathContact["Path <code>contact</code>"]
+  Scheme-->PathUsersId["Path <code>users/:id</code>"]
+  %% their targets
+  PathContact-->TargetRouteContact["Target <code>:contact</code>"]
+  PathUsersId-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]
+  %% nested path under /users/:id
+  PathUsersId-->PathGallery["Path <code>'gallery'</code>"]
+  PathGallery-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]
 ```
-An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
+Traversing the tree depth-first for `https://example.com/users/42` stops at the
+route with the action `[UserHandler, :show]`:
 ```mermaid
 flowchart LR
-  RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
-  Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
-  Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
-  Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
-  Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
-  Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
-  classDef active fill:#7CB342,stroke:#7CB342,color:#fff
-  classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
-  classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
-```
+  Root:::matching-->Host["Host <code>example.com</code>"]:::matching
+  Host:::matching-->Scheme["Scheme <code>:https</code>"]:::matching
-You can also visualise an invocation of the predicate tree on the command line
-with `wayfarer tree`
+  %% sibling paths from the scheme node
+  Scheme:::matching-->PathContact["Path <code>/contact</code>"]:::mismatching
+  Scheme:::matching-->PathUsersId["Path <code>/users/:id</code>"]:::matching
+  %% successful match for /users/:id
+  PathUsersId:::matching-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]:::matching
+  %% gallery branch is never visited for /users/42
+  PathContact-->TargetRouteContact["Target <code>:contact</code>"]:::unvisited
+  PathUsersId:::matching-->PathGallery["Path <code>/gallery</code>"]:::unvisited
+  PathGallery:::unvisited-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]:::unvisited
+  classDef matching     fill:#7CB342,stroke:#7CB342,color:#fff
+  classDef mismatching  fill:#FFCDD2,stroke:#F44336,color:#B71C1C
+  classDef unvisited    fill:#BDBDBD,stroke:#BDBDBD,color:#616161
 ```
-wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
-Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
-└──Host("example.com", match: true)
-   └──Scheme(:https, match: true)
-      ├──Path("/contact", match: false)
-      │  └──Target(match: true)
-      └──Path("/users/:id", match: true)
-         └──Target(match: true)
-            └──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
-```
+??? note "You can also visualise a job's routing tree with with the [`route` CLI subcommand](/guides/cli)"
+    ```sh
+    wayfarer route DummyJob -r dummy_job.rb http://localhost:9000/users/42/gallery
+    ```
+    ```yaml
+    ---
+    routed: true
+    params:
+      id: '42'
+    action:
+      handler: Class
+      action: :photos
+    root_route:
+      match: true
+      params: {}
+      children:
+      - route:
+          host:
+            name: example.com
+          match: true
+          params: {}
+          children:
+          - route:
+              scheme:
+                scheme: :https
+              match: true
+              params: {}
+              children:
+              - route:
+                  path:
+                    pattern: "/contact"
+                  match: false
+                  params: {}
+                  children:
+                  - target_route:
+                      action:
+                      children: []
+              - route:
+                  path:
+                    pattern: "/users/:id"
+                  match: true
+                  params:
+                    id: '42'
+                  children:
+                  - target_route:
+                      action:
+                        handler: Class
+                        action: :show
+                      children: []
+                  - route:
+                      path:
+                        pattern: "/gallery"
+                      match: true
+                      params:
+                        id: '42'
+                      children:
+                      - target_route:
+                          action:
+                            handler: Class
+                            action: :photos
+                          children: []
+    ```
 As you can see, `Target` nodes always match. This means that we could have also defined
 our routes as:

data/docs/guides/tasks.md CHANGED Viewed

@@ -19,20 +19,40 @@ Wayfarer ensures that no URL gets processed twice within a batch. It achieves
 this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
 keyed by normalized URLs.
+Wayfarer computes a canonical URL representation that it uses for cache lookups.
 ### URL normalization
 Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
-and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
+and applies further normalizations. By default, all normalizations are applied
+and can be individually disabled.
+URL normalization is used only for deduplication, and does not affect the immutable
+`task.url`, which always returns the verbatim URL as enqueued.
+This allows you to follow the URLs exactly as parsed from response bodies.
+You can configure the global normalization behaviour by setting the following
+values on `Wayfarer.config.normalization` do which all default to `true`:
+ * `remove_www`: Remove `www.` prefix from hostnames?
+ * `remove_trailing_slash`: Remove a trailing path slash?
+ * `remove_fragment`: Remove the URL fragment?
+ * `order_query_parameters:` Order query parameters alphabetically?
+ * `remove_tracking_parameters`: Remove tracking parameters from the URL?
+When a job gets deduplicated, it succeeds and causes no retries.
+### Setting a custom key function
+You can customize how deduplication keys are computed. As a derived example,
+to process only one job per hostname:
-URL normalization is used only for deduplication, and does not affect the URL
-returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
-enqueud. This allows you to follow the exact URLs you may have parsed from a
-response body.
+```ruby
+Wayfarer.config[:deduplication][:key] = ->(task) { task[:uri].hostname }
+```
 ## Invalid URLs
-Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
-protocol) are discarded, since they can't get retrieved. No exception is raised,
-and the job is considered successfully processed, since there are no corrective
-actions an error handler could take as tasks are immutable, and retries would
-not change the outcome.
+Tasks with invalid URLs are discarded (for example`ht%0atp://localhost/` which has a
+newline in its protocol), since there is no corrective action possible.
+No exception is raised, and the job is considered successfully processed without retries.