RubyGems - wayfarer - Versions diffs - 0.4.1 → 0.4.2 - Mend

wayfarer 0.4.1 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

checksums.yaml +4 -4
data/Gemfile.lock +14 -10
data/docs/cookbook/batch_routing.md +22 -0
data/docs/cookbook/consent_screen.md +36 -0
data/docs/cookbook/executing_javascript.md +41 -0
data/docs/cookbook/querying_html.md +3 -3
data/docs/cookbook/screenshots.md +2 -2
data/docs/guides/browser_automation/capybara.md +6 -3
data/docs/guides/browser_automation/ferrum.md +3 -1
data/docs/guides/browser_automation/selenium.md +4 -2
data/docs/guides/callbacks.md +5 -5
data/docs/guides/debugging.md +17 -0
data/docs/guides/error_handling.md +22 -26
data/docs/guides/jobs.md +44 -18
data/docs/guides/navigation.md +73 -0
data/docs/guides/pages.md +4 -4
data/docs/guides/performance.md +108 -0
data/docs/guides/reliability.md +41 -0
data/docs/guides/routing/steering.md +30 -0
data/docs/guides/tasks.md +9 -33
data/docs/reference/api/base.md +13 -127
data/docs/reference/api/route.md +1 -1
data/docs/reference/cli.md +0 -78
data/docs/reference/configuration_keys.md +1 -1
data/lib/wayfarer/cli/job.rb +1 -3
data/lib/wayfarer/cli/route.rb +4 -2
data/lib/wayfarer/cli/templates/job.rb.tt +3 -1
data/lib/wayfarer/config/networking.rb +1 -1
data/lib/wayfarer/config/struct.rb +1 -1
data/lib/wayfarer/middleware/fetch.rb +15 -4
data/lib/wayfarer/middleware/router.rb +34 -2
data/lib/wayfarer/middleware/worker.rb +4 -24
data/lib/wayfarer/networking/pool.rb +9 -8
data/lib/wayfarer/page.rb +1 -1
data/lib/wayfarer/routing/matchers/custom.rb +2 -0
data/lib/wayfarer/routing/matchers/path.rb +1 -0
data/lib/wayfarer/routing/route.rb +6 -0
data/lib/wayfarer/routing/router.rb +27 -0
data/lib/wayfarer/stringify.rb +13 -7
data/lib/wayfarer.rb +3 -1
data/spec/callbacks_spec.rb +2 -2
data/spec/config/networking_spec.rb +2 -2
data/spec/factories/{queue/middleware.rb → middleware.rb} +3 -3
data/spec/factories/{queue/page.rb → page.rb} +3 -3
data/spec/factories/{queue/task.rb → task.rb} +0 -0
data/spec/fixtures/dummy_job.rb +1 -1
data/spec/middleware/chain_spec.rb +17 -17
data/spec/middleware/fetch_spec.rb +27 -11
data/spec/middleware/router_spec.rb +34 -7
data/spec/middleware/worker_spec.rb +3 -13
data/spec/routing/router_spec.rb +24 -0
data/wayfarer.gemspec +1 -1
metadata +16 -8
data/spec/factories/queue/chain.rb +0 -11

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a2bbcc550d6799e1e3588b832905866116105fdcceb0eeef5ec244622f15bb10
-  data.tar.gz: b8b324d89d162e578cde829f15a44bb08b9e93ad2a3bb5b6619e96275a7fa6cd
+  metadata.gz: 04baaa6967fc9de4970e4d3a14cb8bb2d7458c70bb6529189ef3823d7792aa18
+  data.tar.gz: '058de8aa89a46c88fb460a0d39e542c43e4b0a9f23faa9b672367fb6a9b12820'
 SHA512:
-  metadata.gz: 998c06776f7a7922aa2d36770dc7e4389c5814ac36a20062b20e7fe6986fb52e4fde9f538287510fccfca1689fb9f7f4e019ec23dcbeff777add1a99e24fba26
-  data.tar.gz: 239e3db3d5fffb8f81e74c655a648ce03febd346c25fb86b695bca8a8d328e6ff341d63d4becf63a2811defae3af5a1c4bf1c772e327b114df5909a06151a95b
+  metadata.gz: ba5feb1b4116f53a53166a999953b791aecc1356dbf4e3db5170f16f42703e708176a33a8a05553698a5cc6e011e4bc94521c163ff67e7d3d2dfd6c29e6a14f3
+  data.tar.gz: d0f0dddf9b091820b59476ecae9c048169fe867f5559c077ec306d74abc6540ea01d1723dd722cfeded64d206f67c9948eaef2e6a29b38b729243ee4aa046836

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    wayfarer (0.4.1)
+    wayfarer (0.4.2)
       activejob (~> 6.0)
       addressable (~> 2.8)
       capybara (~> 3.0)
@@ -59,16 +59,17 @@ GEM
       activesupport (>= 5.0.0)
     faker (1.9.6)
       i18n (>= 0.7)
-    faraday (1.8.0)
+    faraday (1.9.3)
       faraday-em_http (~> 1.0)
       faraday-em_synchrony (~> 1.0)
       faraday-excon (~> 1.1)
-      faraday-httpclient (~> 1.0.1)
+      faraday-httpclient (~> 1.0)
+      faraday-multipart (~> 1.0)
       faraday-net_http (~> 1.0)
-      faraday-net_http_persistent (~> 1.1)
+      faraday-net_http_persistent (~> 1.0)
       faraday-patron (~> 1.0)
       faraday-rack (~> 1.0)
-      multipart-post (>= 1.2, < 3)
+      faraday-retry (~> 1.0)
       ruby2_keywords (>= 0.0.4)
     faraday-cookie_jar (0.0.7)
       faraday (>= 0.8.0)
@@ -81,19 +82,22 @@ GEM
     faraday-http-cache (2.2.0)
       faraday (>= 0.8)
     faraday-httpclient (1.0.1)
+    faraday-multipart (1.0.3)
+      multipart-post (>= 1.2, < 3)
     faraday-net_http (1.0.1)
     faraday-net_http_persistent (1.2.0)
     faraday-patron (1.0.0)
     faraday-rack (1.0.0)
+    faraday-retry (1.0.3)
     faraday_middleware (1.2.0)
       faraday (~> 1.0)
-    fastimage (2.2.5)
+    fastimage (2.2.6)
     ferrum (0.11)
       addressable (~> 2.5)
       cliver (~> 0.3)
       concurrent-ruby (~> 1.1)
       websocket-driver (>= 0.6, < 0.8)
-    globalid (0.5.2)
+    globalid (1.0.0)
       activesupport (>= 5.0)
     http-cookie (1.0.4)
       domain_name (~> 0.5)
@@ -111,9 +115,9 @@ GEM
       nesty (~> 1.0)
       nokogiri (~> 1.11)
     method_source (1.0.0)
-    mime-types (3.3.1)
+    mime-types (3.4.1)
       mime-types-data (~> 3.2015)
-    mime-types-data (3.2021.0901)
+    mime-types-data (3.2022.0105)
     mini_mime (1.1.2)
     mini_portile2 (2.6.1)
     minitest (5.14.4)
@@ -182,7 +186,7 @@ GEM
       rack (~> 1.5)
       rack-protection (~> 1.4)
       tilt (>= 1.3, < 3)
-    thor (1.1.0)
+    thor (1.2.1)
     tilt (2.0.10)
     tzinfo (2.0.4)
       concurrent-ruby (~> 1.0)

data/docs/cookbook/batch_routing.md ADDED Viewed

@@ -0,0 +1,22 @@
+# Batch routing
+```ruby
+# Create a record in an external database and store the hostname
+record = Database::Row.create(hostname: "example.com")
+class DummyJob < Wayfarer::Base
+  route do |hostname|
+    host hostname, to: :index
+  end
+  steer do |task|
+    # Pass the external record's hostname to the router
+    [Database::Row.find(task.batch).hostname]
+  end
+  # ...
+end
+# Enqueue the task and use the database record's key as batch
+DummyJob.crawl_later("https://example.com", batch: record.id)
+```

data/docs/cookbook/consent_screen.md ADDED Viewed

@@ -0,0 +1,36 @@
+# Consent Screens
+Some websites have nag-screens that make visitors wait for a button to show up.
+Here is an example with Ferrum where the opt-in button is contained in an
+iframe, clicked, and makes the live page behind the screen accessible to
+`#index`:
+```ruby
+Wayfarer.config.network.agent = :ferrum
+class DummyJob < Wayfarer::Base
+  route { to :index, host: "example.com" }
+  before_action if: :consent_required? do
+    sleep(5) # If the consent form has a loading animation
+    consent_button&.click
+    sleep(5) # Wait for browser to get redirected behind nag-screen
+  end
+  def index
+    # Nag-screen passed
+    stage page(live: true).meta.links.internal
+  end
+  private
+  def consent_button
+    browser.frames.third.css("button#consent")&.first
+  end
+  def consent_required?
+    browser.css(".consent_screen").any?
+  end
+end
+```

data/docs/cookbook/executing_javascript.md ADDED Viewed

@@ -0,0 +1,41 @@
+# Executing JavaScript
+Executing JavaScript requires automating a browser.
+=== "Ferrum"
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      route { to :index }
+      def index
+        browser.evaluate("[window.scrollX, window.scrollY]")
+      end
+    end
+    ```
+=== "Selenium"
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      route { to :index }
+      def index
+        # Mind the explicit return
+        browser.execute_script("return [window.scrollX, window.scrollY]")
+      end
+    end
+    ```
+=== "Capybara"
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      route { to :index }
+      def index
+        # Capybara does not return value of JavaScript execution
+        browser.execute_script("console.log('Foobar')") # => nil
+      end
+    end
+    ```

data/docs/cookbook/querying_html.md CHANGED Viewed

@@ -6,7 +6,7 @@ See: [Nokogiri: Searching an HTML / XML Document](https://nokogiri.org/tutorials
     ```ruby
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         page.doc.css("html")
@@ -19,7 +19,7 @@ See: [Nokogiri: Searching an HTML / XML Document](https://nokogiri.org/tutorials
     ```ruby
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         browser.at_css("html")
@@ -32,7 +32,7 @@ See: [Nokogiri: Searching an HTML / XML Document](https://nokogiri.org/tutorials
     ```ruby
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         browser.find_elements(css: "html")

data/docs/cookbook/screenshots.md CHANGED Viewed

@@ -6,7 +6,7 @@ Taking screenshots requires automating a browser.
     ```ruby
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         browser.screenshot(path: "screenshot.png")
@@ -18,7 +18,7 @@ Taking screenshots requires automating a browser.
     ```ruby
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         browser.save_screenshot("screenshot.png")

data/docs/guides/browser_automation/capybara.md CHANGED Viewed

@@ -7,8 +7,11 @@ When Capybara is in use, a remote browser process is available as a Capybara
 session:
 ```ruby
-class DummyWorker < Wayfarer::Worker
-  route.to :index
+Wayfarer.config.network.agent = :capybara
+# Wayfarer.config.capybara.driver = ...
+class DummyJob < Wayfarer::Worker
+  route { to :index }
   def index
     browser # => #<Capybara::Session ...>
@@ -61,6 +64,6 @@ end
     Capybara.register_driver(:cuprite) do |app|
       # Wayfarer's Ferrum or Selenium options must be passed along manually
-      Capybara::Cuprite::Driver.new(app, Wayfare.config.ferrum.options)
+      Capybara::Cuprite::Driver.new(app, Wayfarer.config.ferrum.options)
     end
     ```

data/docs/guides/browser_automation/ferrum.md CHANGED Viewed

@@ -11,8 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
 so:
 ```ruby
+Wayfarer.config.network.agent = :ferrum
 class DummyWorker < Wayfarer::Worker
-  route.to :index
+  route { to :index }
   def index
     browser # => #<Ferrum::Browser ...>

data/docs/guides/browser_automation/selenium.md CHANGED Viewed

@@ -7,8 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
 so:
 ```ruby
+Wayfarer.config.network.agent = :selenium
 class DummyWorker < Wayfarer::Worker
-  route.to :index
+  route { to :index }
   def index
     browser # => #<Selenium::WebDriver ...>
@@ -28,7 +30,7 @@ process.
     Wayfarer.config.network.agent = :selenium
     class DummyJob < Wayfarer::Base
-      route.to :index
+      route { to :index }
       def index
         page.headers     # => always {}

data/docs/guides/callbacks.md CHANGED Viewed

@@ -52,16 +52,16 @@ end
 Internally, a batch counter is in-/decremented on certain events. Once the
 counter reaches zero, `after_batch` callbacks runs in declaration order.
-The counter is incremented when:
+The counter is incremented when within the batch:
-* A job is enqueued within the batch.
+* A job is enqueued.
 The counter is decremented when:
 * A job succeeds.
-* A job fails due to an unhandled exception.
-* A job fails due to a discarded exception.
-* A job fails and thereyby exhausts its maximum attempts.
+* A job errors due to an unhandled exception.
+* A job is discarded due to an exception.
+* A job errors and thereyby exhausts its maximum attempts.
 !!! attention "Batch callbacks can fail jobs"

data/docs/guides/debugging.md ADDED Viewed

@@ -0,0 +1,17 @@
+# Debugging
+[Wayfarer's CLI](/reference/cli/) has two sub-commands that come in handy when
+diagnosing problems in the development workflow.
+## Routing a URL from the shell
+## `wayfarer route`
+### `wayfarer route result JOB URL`
+:   Prints the result of invoking `JOB`'s router with `URL`.
+### `wayfarer route tree JOB URL`
+:   Visualises the routing tree result of invoking `JOB`'s router with `URL`.

data/docs/guides/error_handling.md CHANGED Viewed

@@ -1,35 +1,31 @@
 # Error handling
-Wayfarer relies on Active Job's error handling facilities, `retry_on` and
-`discard_on`:
+## Wayfarer never swallows exceptions
-* [Active Job Basics: Exceptions](https://guides.rubyonrails.org/active_job_basics.html#exceptions)
-* [ActiveJob::Exceptions](https://edgeapi.rubyonrails.org/classes/ActiveJob/Exceptions/ClassMethods.html)
+* Wayfarer never swallows exceptions.
+* Jobs with unhandled exceptions are not retried.
-## Retrying
+## Retrying and discarding
-```ruby
-class DummyJob < Wayfarer::Base
-  retry_on MyError, attempts: 3 do |job, error|
-    # All 3 attempts have failed (1 initial attempt + 2 retries)
-  end
-end
-```
+Wayfarer relies on [Active Job's two error handling facilities](https://guides.rubyonrails.org/active_job_basics.html#exceptions).
-## Discarding
+* `retry_on` to retry jobs a number of times on certain errors:
-```ruby
-class DummyJob < Wayfarer::Base
-  discard_on MyError do |job, error|
-    # The job will not get retried
-  end
-end
-```
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      retry_on MyError, attempts: 3 do |job, error|
+        # This block runs once all 3 attempts have failed
+        # (1 initial attempt + 2 retries)
+      end
+    end
+    ```
-## Job failures
+* `discard_on` to throw away jobs on certain errors:
-Jobs are not retried and their URLs locked within their batch if:
-* A discarded exception is raised.
-* An unhandled exception is raised.
-* A handled exception is raised, but retry attempts are exhausted.
+    ```ruby
+    class DummyJob < Wayfarer::Base
+      discard_on MyError do |job, error|
+        # This block runs once and buries the job
+      end
+    end
+    ```

data/docs/guides/jobs.md CHANGED Viewed

@@ -1,16 +1,36 @@
 # Jobs
-Jobs are Ruby classes that look as follows:
+Jobs are Ruby classes that process [tasks](/guides/tasks) and look as follows:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route.to :index
+  route { to :index }
   def index
   end
 end
 ```
+Here is how to enqueue a task for a URL:
+```ruby
+DummyJob.crawl_later("https://example.com")
+```
+This is the same as calling the Active Job API directly and passing a task
+and a random batch:
+```ruby
+task = Wayfarer::Task.new("https://example.com", SecureRandom.uuid)
+DummyJob.perform_later(task)
+```
+A batch can be specified with `::crawl_later`, too:
+```ruby
+DummyJob.crawl_later("https://example.com", batch: "my-batch")
+```
 ## Current task
 Jobs consume [tasks](../tasks) from a message queue. The currently processed
@@ -18,58 +38,64 @@ task is accessible like so:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route.to :index
+  route { to :index }
   def index
     task.url   # => "https://example.com"
-    task.batch # => "55fe80d4-97ce-..."
+    task.batch # => "my-batch"
   end
 end
 ```
 ## Current page
-Once control is handed over to jobs, their task's URL has been retrieved into a
-[page](../pages) object:
+A task's URL contents get fetched into a [page](../pages) object if the task URL
+matched a route:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route.to :index
+  route { to :index, host: "example.com" }
   def index
-    page.url  # => "https://example.com"
-    page.body # => "<html>..."
+    page.url         # => "https://example.com"
+    page.body        # => "<html>..."
+    page.status_code # => 200
+    page.headers     # { "Content-Type" => ... }
   end
 end
 ```
 ## URL parameters
-TODO
+Jobs can extract data from URLs with their router:
 ```ruby
 class DummyJob < Wayfarer::Base
-  route.to :index
+  route do
+    path "/users/:id/profile"
+  end
   def index
-    page.url  # => "https://example.com"
-    page.body # => "<html>..."
+    params[:id] # => "42"
   end
 end
+DummyJob.crawl_later("https://example.com/users/42/profile")
 ```
-## Automated browser
+## User agent
-When automating browsers, the remote browser process that retrieved the URL is
-accessible like so:
+The HTTP client or automated browser that fetched the URL is available:
 ```ruby
+Wayfarer.config.network.agent = :ferrum # Chrome DevTools Protocol
 class DummyJob < Wayfarer::Base
-  route.to :index
+  route { to :index }
   def index
-    browser # => #<Ferrum::Browser ...> or #<Selenium::WebDriver ...>
+    browser.save_screenshot("capture.png")
   end
 end
 ```

data/docs/guides/navigation.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Navigation
+Wayfarer has two mechanisms for navigating crawls:
+* Jobs have a router that decides if a task's URL gets fetched and processed.
+* Jobs can add URLs to a processing set with `#stage`.
+## Staging URLs
+Jobs can turn URLs into tasks within their own batch with `#stage`. Staging a
+URL does not enqueue it immediately. Instead, the URL is added to a processing
+set first.
+```ruby
+class DummyJob < Wayfarer::Base
+  route { to :index }
+  def index
+    stage page.meta.links.all
+  end
+end
+```
+Once the `index` action method returns, all URLs in `page.meta.links.all`
+are (1) normalized to a canonical form and (2) checked for inclusion in
+the batch's processed URL Redis set. All unprocessed URLs are enqueued as
+tasks within the same batch.
+`#stage` can be called arbitrarily often, with invalid URLs too, as they are
+filtered out behind the scenes:
+```ruby
+def index
+  stage "_bro:ken@url/" # => ["_bro:ken@url/"]
+end
+```
+See also: [Performance: Stage less URLs](/guides/performance)
+!!! attention "Failing action methods do not enqueue tasks"
+    If an action method fails as in:
+    ```ruby
+    def index
+      stage page.meta.links.all
+      fail "Error occured"
+    end
+    ```
+    None of the staged URLs are enqueued as tasks. Jobs that raise an exception
+    should get retried, or the exception should be handled.
+## Routing URLs
+In the following example, the task is written to the message queue, but the
+job's routes do not match the URL. When the task gets consumed, the URL does not
+get fetched and the action method not called.
+```ruby
+class DummyJob < Wayfarer::Base
+  route do
+    host "example.com", path: "/users/:user_id", to: :user
+  end
+  # ...
+end
+DummyJob.crawl_later("https://mismatching.host/users/42")
+```

data/docs/guides/pages.md CHANGED Viewed

@@ -1,11 +1,11 @@
 # Pages
-Retrieved pages are represented by `Wayfarer::Page` objects and are available
-within jobs like so:
+Retrieved pages take the shape of `Wayfarer::Page` objects and are available
+to jobs:
 ```ruby
 class DummyJob < Wayfarer::Worker
-  route.to :index
+  route { to :index }
   def index
     page # => #<Wayfarer::Page ...>
@@ -35,7 +35,7 @@ To access a page reflecting the current browser state, pass the `live` keyword:
 ```ruby
 class DummyJob < Wayfarer::Worker
-  route.to :index
+  route { to :index }
   def index
     page # => #<Wayfarer::Page ...>