wayfarer 0.4.5 → 0.4.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (175) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/lint.yaml +25 -0
  3. data/.github/workflows/release.yaml +29 -0
  4. data/.github/workflows/tests.yaml +30 -0
  5. data/.gitignore +4 -0
  6. data/.rubocop.yml +5 -0
  7. data/.vale.ini +5 -0
  8. data/.yardopts +1 -3
  9. data/Dockerfile +5 -4
  10. data/Gemfile +3 -0
  11. data/Gemfile.lock +107 -102
  12. data/Rakefile +5 -56
  13. data/bin/wayfarer +1 -1
  14. data/docker-compose.yml +20 -9
  15. data/docs/cookbook/consent_screen.md +2 -2
  16. data/docs/cookbook/executing_javascript.md +3 -3
  17. data/docs/cookbook/navigation.md +12 -12
  18. data/docs/cookbook/querying_html.md +3 -3
  19. data/docs/cookbook/screenshots.md +2 -2
  20. data/docs/cookbook/user_agent.md +1 -1
  21. data/docs/design.md +36 -0
  22. data/docs/guides/callbacks.md +24 -126
  23. data/docs/guides/configuration.md +8 -8
  24. data/docs/guides/handlers.md +60 -0
  25. data/docs/guides/index.md +1 -0
  26. data/docs/guides/jobs/error_handling.md +40 -0
  27. data/docs/guides/jobs.md +99 -31
  28. data/docs/guides/navigation.md +1 -1
  29. data/docs/guides/networking/capybara.md +13 -22
  30. data/docs/guides/networking/custom_adapters.md +82 -41
  31. data/docs/guides/networking/ferrum.md +4 -4
  32. data/docs/guides/networking/http.md +9 -13
  33. data/docs/guides/networking/selenium.md +10 -11
  34. data/docs/guides/pages.md +76 -10
  35. data/docs/guides/redis.md +10 -0
  36. data/docs/guides/routing.md +74 -0
  37. data/docs/guides/tasks.md +33 -9
  38. data/docs/guides/tutorial.md +60 -0
  39. data/docs/guides/user_agents.md +113 -0
  40. data/docs/index.md +17 -40
  41. data/docs/reference/cli.md +35 -25
  42. data/docs/reference/configuration.md +36 -0
  43. data/lib/wayfarer/base.rb +124 -46
  44. data/lib/wayfarer/batch_completion.rb +56 -0
  45. data/lib/wayfarer/callbacks.rb +22 -48
  46. data/lib/wayfarer/cli/route_printer.rb +71 -57
  47. data/lib/wayfarer/cli.rb +121 -0
  48. data/lib/wayfarer/gc.rb +13 -6
  49. data/lib/wayfarer/handler.rb +15 -7
  50. data/lib/wayfarer/logging.rb +38 -0
  51. data/lib/wayfarer/middleware/base.rb +2 -0
  52. data/lib/wayfarer/middleware/batch_completion.rb +19 -0
  53. data/lib/wayfarer/middleware/content_type.rb +54 -0
  54. data/lib/wayfarer/middleware/controller.rb +19 -15
  55. data/lib/wayfarer/middleware/dedup.rb +16 -13
  56. data/lib/wayfarer/middleware/dispatch.rb +12 -4
  57. data/lib/wayfarer/middleware/normalize.rb +12 -11
  58. data/lib/wayfarer/middleware/redis.rb +15 -0
  59. data/lib/wayfarer/middleware/router.rb +33 -35
  60. data/lib/wayfarer/middleware/stage.rb +5 -5
  61. data/lib/wayfarer/middleware/uri_parser.rb +30 -0
  62. data/lib/wayfarer/middleware/user_agent.rb +49 -0
  63. data/lib/wayfarer/networking/capybara.rb +1 -1
  64. data/lib/wayfarer/networking/context.rb +2 -2
  65. data/lib/wayfarer/networking/ferrum.rb +2 -2
  66. data/lib/wayfarer/networking/follow.rb +12 -6
  67. data/lib/wayfarer/networking/http.rb +1 -1
  68. data/lib/wayfarer/networking/pool.rb +17 -12
  69. data/lib/wayfarer/networking/selenium.rb +3 -3
  70. data/lib/wayfarer/networking/strategy.rb +2 -2
  71. data/lib/wayfarer/page.rb +36 -14
  72. data/lib/wayfarer/parsing/xml.rb +6 -6
  73. data/lib/wayfarer/parsing.rb +24 -0
  74. data/lib/wayfarer/redis/barrier.rb +13 -21
  75. data/lib/wayfarer/redis/counter.rb +19 -9
  76. data/lib/wayfarer/redis/pool.rb +1 -1
  77. data/lib/wayfarer/redis/resettable.rb +19 -0
  78. data/lib/wayfarer/routing/dsl.rb +1 -0
  79. data/lib/wayfarer/routing/matchers/path.rb +4 -2
  80. data/lib/wayfarer/routing/root_route.rb +5 -1
  81. data/lib/wayfarer/routing/route.rb +4 -14
  82. data/lib/wayfarer/stringify.rb +22 -30
  83. data/lib/wayfarer/task.rb +12 -18
  84. data/lib/wayfarer.rb +29 -2
  85. data/mkdocs.yml +52 -7
  86. data/rake/docs.rake +26 -0
  87. data/rake/lint.rake +105 -0
  88. data/rake/release.rake +29 -0
  89. data/rake/tests.rake +28 -0
  90. data/requirements.txt +1 -1
  91. data/spec/base_spec.rb +140 -160
  92. data/spec/batch_completion_spec.rb +104 -0
  93. data/spec/cli/job_spec.rb +19 -23
  94. data/spec/cli/routing_spec.rb +101 -0
  95. data/spec/cli/version_spec.rb +1 -1
  96. data/spec/factories/task.rb +7 -1
  97. data/spec/fixtures/dummy_job.rb +5 -3
  98. data/spec/gc_spec.rb +8 -50
  99. data/spec/handler_spec.rb +1 -1
  100. data/spec/integration/callbacks_spec.rb +157 -45
  101. data/spec/integration/content_type_spec.rb +145 -0
  102. data/spec/integration/gc_spec.rb +44 -0
  103. data/spec/integration/handler_spec.rb +66 -0
  104. data/spec/integration/page_spec.rb +44 -29
  105. data/spec/integration/params_spec.rb +33 -25
  106. data/spec/integration/parsing_spec.rb +125 -0
  107. data/spec/integration/routing_spec.rb +18 -0
  108. data/spec/integration/stage_spec.rb +27 -20
  109. data/spec/middleware/batch_completion_spec.rb +34 -0
  110. data/spec/middleware/chain_spec.rb +8 -8
  111. data/spec/middleware/content_type_spec.rb +86 -0
  112. data/spec/middleware/controller_spec.rb +5 -5
  113. data/spec/middleware/dedup_spec.rb +38 -55
  114. data/spec/middleware/dispatch_spec.rb +23 -7
  115. data/spec/middleware/normalize_spec.rb +44 -13
  116. data/spec/middleware/router_spec.rb +29 -30
  117. data/spec/middleware/stage_spec.rb +8 -8
  118. data/spec/middleware/uri_parser_spec.rb +53 -0
  119. data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
  120. data/spec/networking/context_spec.rb +17 -0
  121. data/spec/networking/follow_spec.rb +2 -2
  122. data/spec/networking/pool_spec.rb +5 -5
  123. data/spec/networking/strategy.rb +2 -2
  124. data/spec/page_spec.rb +42 -20
  125. data/spec/parsing/xml_spec.rb +11 -12
  126. data/spec/redis/barrier_spec.rb +8 -48
  127. data/spec/redis/counter_spec.rb +13 -1
  128. data/spec/redis/pool_spec.rb +1 -1
  129. data/spec/spec_helpers.rb +27 -16
  130. data/spec/support/test_app.rb +8 -0
  131. data/spec/task_spec.rb +3 -24
  132. data/spec/wayfarer_spec.rb +1 -1
  133. data/wayfarer.gemspec +4 -3
  134. metadata +61 -51
  135. data/.github/workflows/ci.yaml +0 -32
  136. data/docs/guides/error_handling.md +0 -31
  137. data/docs/guides/networking.md +0 -94
  138. data/docs/guides/performance.md +0 -130
  139. data/docs/guides/reliability.md +0 -41
  140. data/docs/guides/routing/steering.md +0 -30
  141. data/docs/reference/api/base.md +0 -48
  142. data/docs/reference/configuration_keys.md +0 -42
  143. data/docs/reference/environment_variables.md +0 -83
  144. data/lib/wayfarer/cli/base.rb +0 -45
  145. data/lib/wayfarer/cli/generate.rb +0 -17
  146. data/lib/wayfarer/cli/job.rb +0 -56
  147. data/lib/wayfarer/cli/route.rb +0 -29
  148. data/lib/wayfarer/cli/runner.rb +0 -34
  149. data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
  150. data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
  151. data/lib/wayfarer/config/capybara.rb +0 -10
  152. data/lib/wayfarer/config/ferrum.rb +0 -11
  153. data/lib/wayfarer/config/networking.rb +0 -26
  154. data/lib/wayfarer/config/redis.rb +0 -14
  155. data/lib/wayfarer/config/root.rb +0 -11
  156. data/lib/wayfarer/config/selenium.rb +0 -21
  157. data/lib/wayfarer/config/strconv.rb +0 -45
  158. data/lib/wayfarer/config/struct.rb +0 -72
  159. data/lib/wayfarer/middleware/fetch.rb +0 -56
  160. data/lib/wayfarer/redis/connection.rb +0 -13
  161. data/lib/wayfarer/redis/version.rb +0 -19
  162. data/lib/wayfarer/routing/router.rb +0 -28
  163. data/spec/callbacks_spec.rb +0 -102
  164. data/spec/cli/generate_spec.rb +0 -39
  165. data/spec/config/capybara_spec.rb +0 -18
  166. data/spec/config/ferrum_spec.rb +0 -24
  167. data/spec/config/networking_spec.rb +0 -73
  168. data/spec/config/redis_spec.rb +0 -32
  169. data/spec/config/root_spec.rb +0 -31
  170. data/spec/config/selenium_spec.rb +0 -56
  171. data/spec/config/strconv_spec.rb +0 -58
  172. data/spec/config/struct_spec.rb +0 -66
  173. data/spec/integration/steering_spec.rb +0 -57
  174. data/spec/redis/version_spec.rb +0 -13
  175. data/spec/routing/router_spec.rb +0 -24
@@ -1,17 +1,14 @@
1
1
  # Capybara
2
2
 
3
- [Capybara](https://github.com/teamcapybara/capybara) is originally a test
4
- framework for web applications.
5
-
6
- When Capybara is in use, a remote browser process is available as a Capybara
7
- session:
3
+ [Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
4
+ applications which adds a nice API that also works well for web scraping.
8
5
 
9
6
  ```ruby
10
- Wayfarer.config.network.agent = :capybara
11
- # Wayfarer.config.capybara.driver = ...
7
+ Wayfarer.config[:network][:agent] = :capybara
8
+ # Wayfarer.config[:capybara][:driver] = ...
12
9
 
13
10
  class DummyJob < Wayfarer::Worker
14
- route { to :index }
11
+ route.to :index
15
12
 
16
13
  def index
17
14
  browser # => #<Capybara::Session ...>
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
19
16
  end
20
17
  ```
21
18
 
19
+ ## Example: Automating Chrome with Cuprite and Ferrum
22
20
 
23
- ## Configuring a driver
24
-
25
- 1. Install the Capybara driver for the desired user agent.
26
-
27
- For example, to automate Google Chrome with
28
- [Ferrum](https://github.com/rubycdp/ferrum), install the
29
- [Cuprite](https://github.com/rubycdp/cuprite) driver:
21
+ 1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
30
22
 
31
23
  === "RubyGems"
32
24
 
@@ -34,20 +26,19 @@ end
34
26
  gem install cuprite
35
27
  ```
36
28
 
37
- === "Bundler"
29
+ === "Gemfile"
38
30
 
39
31
  ```ruby
40
32
  gem "cuprite" # Gemfile
41
33
  ```
42
34
 
43
- 2. Configure Wayfarer to use the `:capybara` user agent and set the desired
44
- driver:
35
+ 2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
45
36
 
46
37
  === "Runtime"
47
38
 
48
39
  ```ruby
49
- Wayfarer.config.network.agent = :capybara
50
- Wayfarer.config.capybara.driver = :cuprite
40
+ Wayfarer.config[:network][:agent] = :capybara
41
+ Wayfarer.config[:capybara][:driver] = :cuprite
51
42
  ```
52
43
 
53
44
  === "Environment variables"
@@ -57,7 +48,7 @@ end
57
48
  WAYFARER_CAPYBARA_DRIVER=cuprite
58
49
  ```
59
50
 
60
- 3. Register the driver:
51
+ 3. Register the driver with Capybara:
61
52
 
62
53
  ```ruby
63
54
  require "capybara/cuprite"
@@ -66,6 +57,6 @@ end
66
57
 
67
58
  Capybara.register_driver(:cuprite) do |app|
68
59
  # Wayfarer's Ferrum or Selenium options can be passed along
69
- Capybara::Cuprite::Driver.new(app, Wayfarer.config.ferrum.options)
60
+ Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
70
61
  end
71
62
  ```
@@ -1,18 +1,66 @@
1
- # Custom agents
1
+ # User agent API
2
2
 
3
- Wayfarer offers an interface for integrating third-party browsers and HTTP
4
- clients as user agents.
3
+ Wayfarer retrieves web pages with user agents. There are two types of user
4
+ agents: __stateful__ browsers which carry state and follow redirects implicitly,
5
+ and __stateless__ HTTP clients, which handle redirects explicitly.
5
6
 
6
- There are two types of agents:
7
+ Because spawning browser processes or instantiating HTTP clients is expensive,
8
+ Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
9
+ irrecoverable errors are individual user agents destroyed and recreated. For example,
10
+ when a browser process crashes, it is replaced with a new one and checked back
11
+ into the pool. The next job that checks out the user agent gets a fresh
12
+ browser process.
7
13
 
8
- 1. Stateful agents, i.e. browsers, which carry state and support navigation.
9
- These follow HTTP redirects implicitly.
10
- 2. Stateless agents, which deal with HTTP requests/responses only.
11
- These handle HTTP redirects explicitly.
14
+ ## Base interface for custom user agents
12
15
 
13
- ## Implementation
16
+ You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
17
+ module and defining callback methods. The interfaces for stateful and stateless
18
+ share the following instance methods:
14
19
 
15
- Both types can be implemented with callback methods:
20
+ * `#create` (__required__): Called when a new instance (browser process or HTTP client) is
21
+ needed.
22
+ * `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
23
+ processes should be quit, and HTTP clients should be freed.
24
+ * `#renew_on` (optional): Returns a list of exception classes upon which the existing
25
+ instance gets destroyed and replaced with a newly created one.
26
+
27
+ ## Stateless interface
28
+
29
+ The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
30
+ Wayfarer provides redirect handling out of the box, as there is a configurable limit
31
+ on the number of retries to follow.
32
+
33
+ In addition to the base interface, stateless user agents implement `#fetch`
34
+ which fetches [pages](../pages) or indicates redirects:
35
+
36
+ * `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
37
+ 3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
38
+ deals with redirects on your behalf to avoid redirect loops. All other status
39
+ codes, including 4xx and 5xx, are considered successful and are indicated by calling
40
+ `success(url:, body:, status_code:, headers:)`.
41
+
42
+ ## Stateful interface
43
+
44
+ In addition to the base interface, stateful user agents implement two additional
45
+ methods:
46
+
47
+ * `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
48
+ Stateful user agents follow redirects implicitly.
49
+ * `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
50
+ into a [page](../pages).
51
+
52
+ ## Recreating user agents on error with `#renew_on`
53
+
54
+ Agents can optionally implement `#renew_on` to get themselves rereated on
55
+ certain errors.
56
+
57
+ If `#fetch` or `#navigate` raise an exception and the exception class is listed
58
+ in `#renew_on`, the instance is destroyed and recreated.
59
+
60
+ * `#renew_on` (optional): A list of exception classes upon which the existing instance gets
61
+ destroyed and replaced with a newly created one.
62
+
63
+ ## Example implementations
16
64
 
17
65
  === "Stateful"
18
66
 
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
20
68
  class StatefulAgent
21
69
  include Wayfarer::Networking::Strategy
22
70
 
23
- def renew_on # optional
24
- [MyBrowser::IrrecoverableError]
25
- end
71
+ # Required methods
26
72
 
27
73
  def create
28
74
  MyBrowser.new
29
75
  end
30
76
 
31
- def destroy(browser) # optional
32
- browser.quit
33
- end
34
-
35
77
  def navigate(browser, url)
36
78
  browser.goto(url)
37
79
  end
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
42
84
  status_code: browser.status_code,
43
85
  headers: browser.headers)
44
86
  end
87
+
88
+ # Optional methods
89
+
90
+ def destroy(browser)
91
+ browser.quit
92
+ end
93
+
94
+ def renew_on
95
+ [MyBrowser::IrrecoverableError]
96
+ end
45
97
  end
46
98
  ```
47
99
 
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
51
103
  class StatelessAgent
52
104
  include Wayfarer::Networking::Strategy
53
105
 
54
- def renew_on # optional
55
- [MyClient::IrrecoverableError]
56
- end
106
+ # Required methods
57
107
 
58
108
  def create
59
109
  MyClient.new
60
110
  end
61
111
 
62
- def destroy(client) # optional
63
- client.close
64
- end
65
-
66
112
  def fetch(client, url)
67
113
  response = client.get(url)
68
114
 
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
73
119
  status_code: response.status_code,
74
120
  headers: response.headers)
75
121
  end
122
+
123
+ # Optional methods
124
+
125
+ def destroy(client)
126
+ client.close
127
+ end
128
+
129
+ def renew_on # optional
130
+ [MyClient::IrrecoverableError]
131
+ end
76
132
  end
77
133
  ```
78
134
 
79
135
 
80
- Register the strategy:
136
+ Register and use the strategy:
81
137
 
82
138
  ```ruby
83
139
  Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
140
+ Wayfarer.config[:network][:agent] = :my_agent
84
141
  ```
85
-
86
- Use the strategy:
87
-
88
- ```ruby
89
- Wayfarer.config.network.agent = :my_agent
90
- ```
91
-
92
- ### Remarks
93
-
94
- #### Self-healing
95
-
96
- * A strategy's `#renew_on` method may return a list of exception classes upon
97
- which the existing instance gets destroyed and replaced with a newly created
98
- one.
99
- * Stateless clients must not raise exceptions when encountering certain HTTP
100
- response codes (for example, 5xx).
@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
11
11
  so:
12
12
 
13
13
  ```ruby
14
- Wayfarer.config.network.agent = :ferrum
14
+ Wayfarer.config[:network][:agent] = :ferrum
15
15
 
16
16
  class DummyWorker < Wayfarer::Worker
17
- route { to :index }
17
+ route.to :index
18
18
 
19
19
  def index
20
20
  browser # => #<Ferrum::Browser ...>
@@ -27,8 +27,8 @@ end
27
27
  === "Runtime"
28
28
 
29
29
  ```ruby
30
- Wayfarer.config.network.agent = :ferrum
31
- Wayfarer.config.ferrum.options = { headless: false, url: "http://chrome:3000" }
30
+ Wayfarer.config[:network][:agent] = :ferrum
31
+ Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
32
32
  ```
33
33
 
34
34
  === "Environment variables"
@@ -1,33 +1,29 @@
1
1
  # Plain HTTP
2
2
 
3
- Wayfarer can retrieve pages via plain HTTP requests, also alongside automated
4
- browsers.
3
+ Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
4
+ also alongside automated browsers.
5
5
 
6
- ## Agent
6
+ ## Ad-hoc GET requests
7
7
 
8
- The HTTP agent is the default.
9
-
10
- ## Ad-hoc requests
11
-
12
- When automating browsers, it can be useful to additionally retrieve the page
8
+ When automating browsers, it can be useful to additionally retrieve another page
13
9
  over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
14
10
 
15
11
  ```ruby
16
12
  class DummyJob < Wayfarer::Base
17
- route { to :index }
13
+ route.to :index
18
14
 
19
15
  def index
20
- http.fetch(task.url) # => #<Wayfarer::Page ...>
16
+ http.fetch("https://example.com") # => #<Wayfarer::Page ...>
21
17
  end
22
18
  end
23
19
  ```
24
20
 
25
- By default, 3 redirects are followed, and this can be configured by passing the
26
- `follow` keyword:
21
+ By default, 3 redirects are followed, and this number can be configured by
22
+ passing the `follow` keyword:
27
23
 
28
24
  ```ruby
29
25
  http.fetch(url, follow: 5)
30
26
  ```
31
27
 
32
- If redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
28
+ When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
33
29
  raised.
@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
7
7
  so:
8
8
 
9
9
  ```ruby
10
- Wayfarer.config.network.agent = :selenium
10
+ Wayfarer.config[:network][:agent] = :selenium
11
11
 
12
12
  class DummyWorker < Wayfarer::Worker
13
- route { to :index }
13
+ route.to :index
14
14
 
15
15
  def index
16
16
  browser # => #<Selenium::WebDriver ...>
@@ -27,10 +27,10 @@ process.
27
27
  Pages retrieved with a Selenium WebDriver return fake values:
28
28
 
29
29
  ```ruby
30
- Wayfarer.config.network.agent = :selenium
30
+ Wayfarer.config[:network][:agent] = :selenium
31
31
 
32
32
  class DummyJob < Wayfarer::Base
33
- route { to :index }
33
+ route.to :index
34
34
 
35
35
  def index
36
36
  page.headers # => always {}
@@ -39,19 +39,18 @@ process.
39
39
  end
40
40
  ```
41
41
 
42
- !!! note "Consider using [Ferrum](../ferrum) instead"
43
- Ferrum provides superior stability and a richer feature set compared to
44
- Selenium drivers. However Ferrum automates only Google Chrome. Unless a
45
- different browser is required, consider using Ferrum instead of Selenium.
42
+ !!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
43
+ Use Ferrum if you want to automate Google Chrome. It provides superior
44
+ stability and a richer feature set compared to Selenium drivers.
46
45
 
47
46
  ## Configuring Selenium
48
47
 
49
48
  === "Runtime"
50
49
 
51
50
  ```ruby
52
- Wayfarer.config.network.agent = :selenium
53
- Wayfarer.config.selenium.driver = :firefox
54
- Wayfarer.config.selenium.options = { url: "http://firefox" }
51
+ Wayfarer.config[:network][:agent] = :selenium
52
+ Wayfarer.config[:selenium][:driver] = :firefox
53
+ Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
55
54
  ```
56
55
 
57
56
  === "Environment variables"
data/docs/guides/pages.md CHANGED
@@ -1,11 +1,14 @@
1
1
  # Pages
2
2
 
3
- Retrieved pages take the shape of `Wayfarer::Page` objects and are available
4
- to jobs:
3
+ A page is the immutable state of the contents behind a URL at a point in time,
4
+ retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
5
+ response, or the state of a remotely controlled browser.
5
6
 
6
7
  ```ruby
7
- class DummyJob < Wayfarer::Worker
8
- route { to :index }
8
+ class DummyJob < ActiveJob::Base
9
+ include Wayfarer::Base
10
+
11
+ route.to :index
9
12
 
10
13
  def index
11
14
  page # => #<Wayfarer::Page ...>
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
13
16
  page.url # => "https://example.com"
14
17
  page.body # => "<html>..."
15
18
  page.status_code # => 200
16
- page.headers # => { "Content-Type" => ... }
19
+ page.headers # => { "content-type" => ... }
20
+ page.mime_type # => #<MIME::Type: text/html>
21
+
22
+ # The lazily parsed response body or `nil`, depending on the Content-Type
23
+ page.doc # => #<Nokogiri::HTML::Document ...>
17
24
 
18
- # A MetaInspector object for accessing page meta data.
19
25
  # See: https://github.com/metainspector/metainspector
26
+ page.meta # => #<MetaInspector::Document ...>
20
27
  # Examples:
21
28
  page.meta.links.internal
22
29
  page.meta.images.favicon
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
26
33
  end
27
34
  ```
28
35
 
36
+ !!! info "HTTP headers are downcased and case-sensitive"
37
+
38
+ HTTP headers are downcased, so you would access
39
+ `page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
40
+
41
+ ## Response body parsing
42
+
43
+ Wayfarer parses the bodies of HTML, XML and JSON responses according to their
44
+ MIME types:
45
+
46
+ * `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
47
+ * `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
48
+ * `application/json` to `Hash`
49
+
29
50
  ## Live pages
30
51
 
31
- When automating browsers, it is possible the page changes significantly at
32
- runtime, for example due to JavaScript altering the DOM or URL.
52
+ `#!ruby page` initially returns a snapshot of the browser state
53
+ immediately after the user agent navigated to the URL. The browser state may
54
+ change significantly after the page was retrieved, for example due to your own
55
+ interaction, or client-side JavaScript altering the DOM or URL.
33
56
 
34
- To access a page reflecting the current browser state, pass the `live` keyword:
57
+ To get a page that reflects the current browser state, set the `#!ruby :live`
58
+ keyword:
35
59
 
36
60
  ```ruby
37
61
  class DummyJob < Wayfarer::Worker
38
- route { to :index }
62
+ route.to :index
39
63
 
40
64
  def index
41
65
  page # => #<Wayfarer::Page ...>
42
66
 
67
+ # Fill in forms, click buttons, etc.
68
+
43
69
  # Replaces the current Page object with a newer one,
44
70
  # taking into account the DOM as currently rendered by the browser.
45
71
  # Effectful only when automating browsers, no-op when using plain
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
50
76
  end
51
77
  end
52
78
  ```
79
+
80
+ !!! attention "Stateless user agents ignore `#!ruby :live`"
81
+
82
+ The `#!ruby :live` option is ignored by stateless user agents, such as the
83
+ default `#!ruby :http` user agent. Instead, stateless user agents always
84
+ return the same page object.
85
+
86
+ ### Implementing a custom response body parser
87
+
88
+ You can register an object that implements a `#parse` method for any MIME type:
89
+
90
+ ```ruby
91
+ class MyJPEGParser
92
+ def parse(body)
93
+ # Read EXIF metadata here.
94
+ # Return value is accessible as `page.doc`
95
+ end
96
+ end
97
+
98
+ Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
99
+ ```
100
+
101
+ !!! info "Handling responses without a Content-Type"
102
+
103
+ If a response has no `Content-Type` header, Wayfarer falls back to
104
+ `application/octet-stream`. A parser registered for
105
+ `application/octet-stream` will hence also handle all responses without
106
+ a Content-Type.
107
+
108
+ ## Accessing page metadata with MetaInspector
109
+
110
+ You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
111
+ document for accessing metadata of HTML pages. For example, to stage all links
112
+ internal to the current hostname:
113
+
114
+ ```ruby
115
+ def index
116
+ stage page.meta.links.internal
117
+ end
118
+ ```
@@ -0,0 +1,10 @@
1
+ # Redis
2
+
3
+ Wayfarer uses Redis to keep track of:
4
+
5
+ * URLs that were already processed within a batch
6
+ * the number of jobs left in a batch
7
+
8
+ ## Garbage collection
9
+
10
+ Wayfarer cleans up batch-related data
@@ -0,0 +1,74 @@
1
+ # Routing
2
+
3
+ Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
4
+ either instance methods denoted by symbols, or [handlers](/guides/handlers).
5
+ A job's route declarations equate to a predicate tree.
6
+ When a URL is routed, the predicate tree is searched depth-first. If a
7
+ matching leaf predicate is found, the found path's action is dispatched,
8
+ along with `params` collected from path parameters.
9
+
10
+ The following routes:
11
+
12
+ ```ruby
13
+ route.host "example.com", scheme: :https do
14
+ path "/contact", to: :contact
15
+ path "/users/:id", to: [UserHandler, :show]
16
+ end
17
+ ```
18
+
19
+ Equate to the following predicate tree:
20
+
21
+ ```mermaid
22
+ flowchart LR
23
+ RootRoute-->Host["Host <code>example.com</code>"]
24
+ Host-->Scheme["Scheme <code>:https</code>"]
25
+ Scheme-->Path1["Path <code>/contact</code>"]
26
+ Scheme-->Path2["Path <code>/users/:id<code>"]
27
+ Path1-->TargetRoute1["Target <code>:contact</code>"]
28
+ Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
29
+ ```
30
+
31
+ An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
32
+
33
+ ```mermaid
34
+ flowchart LR
35
+ RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
36
+ Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
37
+ Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
38
+ Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
39
+ Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
40
+ Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
41
+ classDef active fill:#7CB342,stroke:#7CB342,color:#fff
42
+ classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
43
+ classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
44
+ ```
45
+
46
+ You can also visualise an invocation of the predicate tree on the command line
47
+ with `wayfarer tree`
48
+
49
+ ```
50
+ wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
51
+ Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
52
+ └──Host("example.com", match: true)
53
+ └──Scheme(:https, match: true)
54
+ ├──Path("/contact", match: false)
55
+ │ └──Target(match: true)
56
+ └──Path("/users/:id", match: true)
57
+ └──Target(match: true)
58
+ └──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
59
+ ```
60
+
61
+ As you can see, `Target` nodes always match. This means that we could have also defined
62
+ our routes as:
63
+
64
+ ```ruby
65
+ route.host "example.com", scheme: :https do
66
+ to :contact do
67
+ path "/contact"
68
+ end
69
+
70
+ to [UserHandler, :show] do
71
+ path "/users/:id"
72
+ end
73
+ end
74
+ ```
data/docs/guides/tasks.md CHANGED
@@ -1,14 +1,38 @@
1
1
  # Tasks
2
2
 
3
- Tasks are the immutable units of work processed by [jobs](/guides/jobs). A task
4
- consists of:
3
+ Tasks are the immutable units of work read from a message queue and processed by
4
+ [jobs](/guides/jobs). A task consists of two strings:
5
5
 
6
- 1. The __URL__ to process
7
- * Within a batch, every URL gets processed at most once.
6
+ * The __URL__ to process
7
+ * The __batch__ the task belongs to
8
8
 
9
- 2. The __batch__ the task belongs to
10
- * Like URLs, batches are strings.
9
+ A job processing a task commonly appends more tasks to the queue in turn.
11
10
 
12
- Tasks get appended to the end of a message queue, and consumed from the
13
- beginning. Because jobs can enqueue other tasks, jobs are both consumers
14
- and producers of tasks.
11
+ !!! info "Task URLs are not normalized"
12
+
13
+ The URL returned by `task.url` is not normalized but verbatim
14
+ as it was staged or enqueued.
15
+
16
+ ## Task deduplication
17
+
18
+ Wayfarer ensures that no URL gets processed twice within a batch. It achieves
19
+ this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
20
+ keyed by normalized URLs.
21
+
22
+ ### URL normalization
23
+
24
+ Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
25
+ and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
26
+
27
+ URL normalization is used only for deduplication, and does not affect the URL
28
+ returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
29
+ enqueud. This allows you to follow the exact URLs you may have parsed from a
30
+ response body.
31
+
32
+ ## Invalid URLs
33
+
34
+ Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
35
+ protocol) are discarded, since they can't get retrieved. No exception is raised,
36
+ and the job is considered successfully processed, since there are no corrective
37
+ actions an error handler could take as tasks are immutable, and retries would
38
+ not change the outcome.