wayfarer 0.4.6 → 0.4.7

Sign up to get free protection for your applications and to get access to all the features.
Files changed (175) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/lint.yaml +25 -0
  3. data/.github/workflows/release.yaml +29 -0
  4. data/.github/workflows/tests.yaml +30 -0
  5. data/.gitignore +4 -0
  6. data/.rubocop.yml +5 -0
  7. data/.vale.ini +5 -0
  8. data/.yardopts +1 -3
  9. data/Dockerfile +5 -4
  10. data/Gemfile +3 -0
  11. data/Gemfile.lock +107 -102
  12. data/Rakefile +5 -56
  13. data/bin/wayfarer +1 -1
  14. data/docker-compose.yml +20 -9
  15. data/docs/cookbook/consent_screen.md +2 -2
  16. data/docs/cookbook/executing_javascript.md +3 -3
  17. data/docs/cookbook/navigation.md +12 -12
  18. data/docs/cookbook/querying_html.md +3 -3
  19. data/docs/cookbook/screenshots.md +2 -2
  20. data/docs/cookbook/user_agent.md +1 -1
  21. data/docs/design.md +36 -0
  22. data/docs/guides/callbacks.md +24 -126
  23. data/docs/guides/configuration.md +8 -8
  24. data/docs/guides/handlers.md +60 -0
  25. data/docs/guides/index.md +1 -0
  26. data/docs/guides/jobs/error_handling.md +40 -0
  27. data/docs/guides/jobs.md +99 -31
  28. data/docs/guides/navigation.md +1 -1
  29. data/docs/guides/networking/capybara.md +13 -22
  30. data/docs/guides/networking/custom_adapters.md +82 -41
  31. data/docs/guides/networking/ferrum.md +4 -4
  32. data/docs/guides/networking/http.md +9 -13
  33. data/docs/guides/networking/selenium.md +10 -11
  34. data/docs/guides/pages.md +76 -10
  35. data/docs/guides/redis.md +10 -0
  36. data/docs/guides/routing.md +74 -0
  37. data/docs/guides/tasks.md +33 -9
  38. data/docs/guides/tutorial.md +60 -0
  39. data/docs/guides/user_agents.md +113 -0
  40. data/docs/index.md +17 -40
  41. data/docs/reference/cli.md +35 -25
  42. data/docs/reference/configuration.md +36 -0
  43. data/lib/wayfarer/base.rb +124 -46
  44. data/lib/wayfarer/batch_completion.rb +56 -0
  45. data/lib/wayfarer/callbacks.rb +22 -48
  46. data/lib/wayfarer/cli/route_printer.rb +71 -57
  47. data/lib/wayfarer/cli.rb +121 -0
  48. data/lib/wayfarer/gc.rb +13 -6
  49. data/lib/wayfarer/handler.rb +15 -7
  50. data/lib/wayfarer/logging.rb +38 -0
  51. data/lib/wayfarer/middleware/base.rb +2 -0
  52. data/lib/wayfarer/middleware/batch_completion.rb +19 -0
  53. data/lib/wayfarer/middleware/content_type.rb +54 -0
  54. data/lib/wayfarer/middleware/controller.rb +19 -15
  55. data/lib/wayfarer/middleware/dedup.rb +16 -13
  56. data/lib/wayfarer/middleware/dispatch.rb +12 -4
  57. data/lib/wayfarer/middleware/normalize.rb +12 -11
  58. data/lib/wayfarer/middleware/redis.rb +15 -0
  59. data/lib/wayfarer/middleware/router.rb +33 -35
  60. data/lib/wayfarer/middleware/stage.rb +5 -5
  61. data/lib/wayfarer/middleware/uri_parser.rb +30 -0
  62. data/lib/wayfarer/middleware/user_agent.rb +49 -0
  63. data/lib/wayfarer/networking/capybara.rb +1 -1
  64. data/lib/wayfarer/networking/context.rb +2 -2
  65. data/lib/wayfarer/networking/ferrum.rb +2 -2
  66. data/lib/wayfarer/networking/follow.rb +12 -6
  67. data/lib/wayfarer/networking/http.rb +1 -1
  68. data/lib/wayfarer/networking/pool.rb +17 -12
  69. data/lib/wayfarer/networking/selenium.rb +3 -3
  70. data/lib/wayfarer/networking/strategy.rb +2 -2
  71. data/lib/wayfarer/page.rb +36 -14
  72. data/lib/wayfarer/parsing/xml.rb +6 -6
  73. data/lib/wayfarer/parsing.rb +24 -0
  74. data/lib/wayfarer/redis/barrier.rb +13 -21
  75. data/lib/wayfarer/redis/counter.rb +19 -9
  76. data/lib/wayfarer/redis/pool.rb +1 -1
  77. data/lib/wayfarer/redis/resettable.rb +19 -0
  78. data/lib/wayfarer/routing/dsl.rb +1 -0
  79. data/lib/wayfarer/routing/matchers/path.rb +4 -2
  80. data/lib/wayfarer/routing/root_route.rb +5 -1
  81. data/lib/wayfarer/routing/route.rb +4 -14
  82. data/lib/wayfarer/stringify.rb +22 -30
  83. data/lib/wayfarer/task.rb +12 -18
  84. data/lib/wayfarer.rb +28 -1
  85. data/mkdocs.yml +52 -7
  86. data/rake/docs.rake +26 -0
  87. data/rake/lint.rake +105 -0
  88. data/rake/release.rake +29 -0
  89. data/rake/tests.rake +28 -0
  90. data/requirements.txt +1 -1
  91. data/spec/base_spec.rb +140 -160
  92. data/spec/batch_completion_spec.rb +104 -0
  93. data/spec/cli/job_spec.rb +19 -23
  94. data/spec/cli/routing_spec.rb +101 -0
  95. data/spec/cli/version_spec.rb +1 -1
  96. data/spec/factories/task.rb +7 -1
  97. data/spec/fixtures/dummy_job.rb +5 -3
  98. data/spec/gc_spec.rb +8 -50
  99. data/spec/handler_spec.rb +1 -1
  100. data/spec/integration/callbacks_spec.rb +157 -45
  101. data/spec/integration/content_type_spec.rb +145 -0
  102. data/spec/integration/gc_spec.rb +44 -0
  103. data/spec/integration/handler_spec.rb +66 -0
  104. data/spec/integration/page_spec.rb +44 -29
  105. data/spec/integration/params_spec.rb +33 -25
  106. data/spec/integration/parsing_spec.rb +125 -0
  107. data/spec/integration/routing_spec.rb +18 -0
  108. data/spec/integration/stage_spec.rb +27 -20
  109. data/spec/middleware/batch_completion_spec.rb +34 -0
  110. data/spec/middleware/chain_spec.rb +8 -8
  111. data/spec/middleware/content_type_spec.rb +86 -0
  112. data/spec/middleware/controller_spec.rb +5 -5
  113. data/spec/middleware/dedup_spec.rb +38 -55
  114. data/spec/middleware/dispatch_spec.rb +23 -7
  115. data/spec/middleware/normalize_spec.rb +44 -13
  116. data/spec/middleware/router_spec.rb +29 -30
  117. data/spec/middleware/stage_spec.rb +8 -8
  118. data/spec/middleware/uri_parser_spec.rb +53 -0
  119. data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
  120. data/spec/networking/context_spec.rb +1 -1
  121. data/spec/networking/follow_spec.rb +2 -2
  122. data/spec/networking/pool_spec.rb +5 -5
  123. data/spec/networking/strategy.rb +2 -2
  124. data/spec/page_spec.rb +42 -20
  125. data/spec/parsing/xml_spec.rb +11 -12
  126. data/spec/redis/barrier_spec.rb +8 -48
  127. data/spec/redis/counter_spec.rb +13 -1
  128. data/spec/redis/pool_spec.rb +1 -1
  129. data/spec/spec_helpers.rb +27 -16
  130. data/spec/support/test_app.rb +8 -0
  131. data/spec/task_spec.rb +3 -24
  132. data/spec/wayfarer_spec.rb +1 -1
  133. data/wayfarer.gemspec +4 -3
  134. metadata +61 -51
  135. data/.github/workflows/ci.yaml +0 -32
  136. data/docs/guides/error_handling.md +0 -53
  137. data/docs/guides/networking.md +0 -94
  138. data/docs/guides/performance.md +0 -130
  139. data/docs/guides/reliability.md +0 -41
  140. data/docs/guides/routing/steering.md +0 -30
  141. data/docs/reference/api/base.md +0 -48
  142. data/docs/reference/configuration_keys.md +0 -43
  143. data/docs/reference/environment_variables.md +0 -83
  144. data/lib/wayfarer/cli/base.rb +0 -45
  145. data/lib/wayfarer/cli/generate.rb +0 -17
  146. data/lib/wayfarer/cli/job.rb +0 -56
  147. data/lib/wayfarer/cli/route.rb +0 -29
  148. data/lib/wayfarer/cli/runner.rb +0 -34
  149. data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
  150. data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
  151. data/lib/wayfarer/config/capybara.rb +0 -10
  152. data/lib/wayfarer/config/ferrum.rb +0 -11
  153. data/lib/wayfarer/config/networking.rb +0 -29
  154. data/lib/wayfarer/config/redis.rb +0 -14
  155. data/lib/wayfarer/config/root.rb +0 -11
  156. data/lib/wayfarer/config/selenium.rb +0 -21
  157. data/lib/wayfarer/config/strconv.rb +0 -45
  158. data/lib/wayfarer/config/struct.rb +0 -72
  159. data/lib/wayfarer/middleware/fetch.rb +0 -56
  160. data/lib/wayfarer/redis/connection.rb +0 -13
  161. data/lib/wayfarer/redis/version.rb +0 -19
  162. data/lib/wayfarer/routing/router.rb +0 -28
  163. data/spec/callbacks_spec.rb +0 -102
  164. data/spec/cli/generate_spec.rb +0 -39
  165. data/spec/config/capybara_spec.rb +0 -18
  166. data/spec/config/ferrum_spec.rb +0 -24
  167. data/spec/config/networking_spec.rb +0 -73
  168. data/spec/config/redis_spec.rb +0 -32
  169. data/spec/config/root_spec.rb +0 -31
  170. data/spec/config/selenium_spec.rb +0 -56
  171. data/spec/config/strconv_spec.rb +0 -58
  172. data/spec/config/struct_spec.rb +0 -66
  173. data/spec/integration/steering_spec.rb +0 -57
  174. data/spec/redis/version_spec.rb +0 -13
  175. data/spec/routing/router_spec.rb +0 -24
@@ -1,17 +1,14 @@
1
1
  # Capybara
2
2
 
3
- [Capybara](https://github.com/teamcapybara/capybara) is originally a test
4
- framework for web applications.
5
-
6
- When Capybara is in use, a remote browser process is available as a Capybara
7
- session:
3
+ [Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
4
+ applications which adds a nice API that also works well for web scraping.
8
5
 
9
6
  ```ruby
10
- Wayfarer.config.network.agent = :capybara
11
- # Wayfarer.config.capybara.driver = ...
7
+ Wayfarer.config[:network][:agent] = :capybara
8
+ # Wayfarer.config[:capybara][:driver] = ...
12
9
 
13
10
  class DummyJob < Wayfarer::Worker
14
- route { to :index }
11
+ route.to :index
15
12
 
16
13
  def index
17
14
  browser # => #<Capybara::Session ...>
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
19
16
  end
20
17
  ```
21
18
 
19
+ ## Example: Automating Chrome with Cuprite and Ferrum
22
20
 
23
- ## Configuring a driver
24
-
25
- 1. Install the Capybara driver for the desired user agent.
26
-
27
- For example, to automate Google Chrome with
28
- [Ferrum](https://github.com/rubycdp/ferrum), install the
29
- [Cuprite](https://github.com/rubycdp/cuprite) driver:
21
+ 1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
30
22
 
31
23
  === "RubyGems"
32
24
 
@@ -34,20 +26,19 @@ end
34
26
  gem install cuprite
35
27
  ```
36
28
 
37
- === "Bundler"
29
+ === "Gemfile"
38
30
 
39
31
  ```ruby
40
32
  gem "cuprite" # Gemfile
41
33
  ```
42
34
 
43
- 2. Configure Wayfarer to use the `:capybara` user agent and set the desired
44
- driver:
35
+ 2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
45
36
 
46
37
  === "Runtime"
47
38
 
48
39
  ```ruby
49
- Wayfarer.config.network.agent = :capybara
50
- Wayfarer.config.capybara.driver = :cuprite
40
+ Wayfarer.config[:network][:agent] = :capybara
41
+ Wayfarer.config[:capybara][:driver] = :cuprite
51
42
  ```
52
43
 
53
44
  === "Environment variables"
@@ -57,7 +48,7 @@ end
57
48
  WAYFARER_CAPYBARA_DRIVER=cuprite
58
49
  ```
59
50
 
60
- 3. Register the driver:
51
+ 3. Register the driver with Capybara:
61
52
 
62
53
  ```ruby
63
54
  require "capybara/cuprite"
@@ -66,6 +57,6 @@ end
66
57
 
67
58
  Capybara.register_driver(:cuprite) do |app|
68
59
  # Wayfarer's Ferrum or Selenium options can be passed along
69
- Capybara::Cuprite::Driver.new(app, Wayfarer.config.ferrum.options)
60
+ Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
70
61
  end
71
62
  ```
@@ -1,18 +1,66 @@
1
- # Custom agents
1
+ # User agent API
2
2
 
3
- Wayfarer offers an interface for integrating third-party browsers and HTTP
4
- clients as user agents.
3
+ Wayfarer retrieves web pages with user agents. There are two types of user
4
+ agents: __stateful__ browsers which carry state and follow redirects implicitly,
5
+ and __stateless__ HTTP clients, which handle redirects explicitly.
5
6
 
6
- There are two types of agents:
7
+ Because spawning browser processes or instantiating HTTP clients is expensive,
8
+ Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
9
+ irrecoverable errors are individual user agents destroyed and recreated. For example,
10
+ when a browser process crashes, it is replaced with a new one and checked back
11
+ into the pool. The next job that checks out the user agent gets a fresh
12
+ browser process.
7
13
 
8
- 1. Stateful agents, i.e. browsers, which carry state and support navigation.
9
- These follow HTTP redirects implicitly.
10
- 2. Stateless agents, which deal with HTTP requests/responses only.
11
- These handle HTTP redirects explicitly.
14
+ ## Base interface for custom user agents
12
15
 
13
- ## Implementation
16
+ You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
17
+ module and defining callback methods. The interfaces for stateful and stateless
18
+ share the following instance methods:
14
19
 
15
- Both types can be implemented with callback methods:
20
+ * `#create` (__required__): Called when a new instance (browser process or HTTP client) is
21
+ needed.
22
+ * `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
23
+ processes should be quit, and HTTP clients should be freed.
24
+ * `#renew_on` (optional): Returns a list of exception classes upon which the existing
25
+ instance gets destroyed and replaced with a newly created one.
26
+
27
+ ## Stateless interface
28
+
29
+ The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
30
+ Wayfarer provides redirect handling out of the box, as there is a configurable limit
31
+ on the number of retries to follow.
32
+
33
+ In addition to the base interface, stateless user agents implement `#fetch`
34
+ which fetches [pages](../pages) or indicates redirects:
35
+
36
+ * `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
37
+ 3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
38
+ deals with redirects on your behalf to avoid redirect loops. All other status
39
+ codes, including 4xx and 5xx, are considered successful and are indicated by calling
40
+ `success(url:, body:, status_code:, headers:)`.
41
+
42
+ ## Stateful interface
43
+
44
+ In addition to the base interface, stateful user agents implement two additional
45
+ methods:
46
+
47
+ * `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
48
+ Stateful user agents follow redirects implicitly.
49
+ * `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
50
+ into a [page](../pages).
51
+
52
+ ## Recreating user agents on error with `#renew_on`
53
+
54
+ Agents can optionally implement `#renew_on` to get themselves rereated on
55
+ certain errors.
56
+
57
+ If `#fetch` or `#navigate` raise an exception and the exception class is listed
58
+ in `#renew_on`, the instance is destroyed and recreated.
59
+
60
+ * `#renew_on` (optional): A list of exception classes upon which the existing instance gets
61
+ destroyed and replaced with a newly created one.
62
+
63
+ ## Example implementations
16
64
 
17
65
  === "Stateful"
18
66
 
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
20
68
  class StatefulAgent
21
69
  include Wayfarer::Networking::Strategy
22
70
 
23
- def renew_on # optional
24
- [MyBrowser::IrrecoverableError]
25
- end
71
+ # Required methods
26
72
 
27
73
  def create
28
74
  MyBrowser.new
29
75
  end
30
76
 
31
- def destroy(browser) # optional
32
- browser.quit
33
- end
34
-
35
77
  def navigate(browser, url)
36
78
  browser.goto(url)
37
79
  end
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
42
84
  status_code: browser.status_code,
43
85
  headers: browser.headers)
44
86
  end
87
+
88
+ # Optional methods
89
+
90
+ def destroy(browser)
91
+ browser.quit
92
+ end
93
+
94
+ def renew_on
95
+ [MyBrowser::IrrecoverableError]
96
+ end
45
97
  end
46
98
  ```
47
99
 
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
51
103
  class StatelessAgent
52
104
  include Wayfarer::Networking::Strategy
53
105
 
54
- def renew_on # optional
55
- [MyClient::IrrecoverableError]
56
- end
106
+ # Required methods
57
107
 
58
108
  def create
59
109
  MyClient.new
60
110
  end
61
111
 
62
- def destroy(client) # optional
63
- client.close
64
- end
65
-
66
112
  def fetch(client, url)
67
113
  response = client.get(url)
68
114
 
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
73
119
  status_code: response.status_code,
74
120
  headers: response.headers)
75
121
  end
122
+
123
+ # Optional methods
124
+
125
+ def destroy(client)
126
+ client.close
127
+ end
128
+
129
+ def renew_on # optional
130
+ [MyClient::IrrecoverableError]
131
+ end
76
132
  end
77
133
  ```
78
134
 
79
135
 
80
- Register the strategy:
136
+ Register and use the strategy:
81
137
 
82
138
  ```ruby
83
139
  Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
140
+ Wayfarer.config[:network][:agent] = :my_agent
84
141
  ```
85
-
86
- Use the strategy:
87
-
88
- ```ruby
89
- Wayfarer.config.network.agent = :my_agent
90
- ```
91
-
92
- ### Remarks
93
-
94
- #### Self-healing
95
-
96
- * A strategy's `#renew_on` method may return a list of exception classes upon
97
- which the existing instance gets destroyed and replaced with a newly created
98
- one.
99
- * Stateless clients must not raise exceptions when encountering certain HTTP
100
- response codes (for example, 5xx).
@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
11
11
  so:
12
12
 
13
13
  ```ruby
14
- Wayfarer.config.network.agent = :ferrum
14
+ Wayfarer.config[:network][:agent] = :ferrum
15
15
 
16
16
  class DummyWorker < Wayfarer::Worker
17
- route { to :index }
17
+ route.to :index
18
18
 
19
19
  def index
20
20
  browser # => #<Ferrum::Browser ...>
@@ -27,8 +27,8 @@ end
27
27
  === "Runtime"
28
28
 
29
29
  ```ruby
30
- Wayfarer.config.network.agent = :ferrum
31
- Wayfarer.config.ferrum.options = { headless: false, url: "http://chrome:3000" }
30
+ Wayfarer.config[:network][:agent] = :ferrum
31
+ Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
32
32
  ```
33
33
 
34
34
  === "Environment variables"
@@ -1,33 +1,29 @@
1
1
  # Plain HTTP
2
2
 
3
- Wayfarer can retrieve pages via plain HTTP requests, also alongside automated
4
- browsers.
3
+ Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
4
+ also alongside automated browsers.
5
5
 
6
- ## Agent
6
+ ## Ad-hoc GET requests
7
7
 
8
- The HTTP agent is the default.
9
-
10
- ## Ad-hoc requests
11
-
12
- When automating browsers, it can be useful to additionally retrieve the page
8
+ When automating browsers, it can be useful to additionally retrieve another page
13
9
  over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
14
10
 
15
11
  ```ruby
16
12
  class DummyJob < Wayfarer::Base
17
- route { to :index }
13
+ route.to :index
18
14
 
19
15
  def index
20
- http.fetch(task.url) # => #<Wayfarer::Page ...>
16
+ http.fetch("https://example.com") # => #<Wayfarer::Page ...>
21
17
  end
22
18
  end
23
19
  ```
24
20
 
25
- By default, 3 redirects are followed, and this can be configured by passing the
26
- `follow` keyword:
21
+ By default, 3 redirects are followed, and this number can be configured by
22
+ passing the `follow` keyword:
27
23
 
28
24
  ```ruby
29
25
  http.fetch(url, follow: 5)
30
26
  ```
31
27
 
32
- If redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
28
+ When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
33
29
  raised.
@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
7
7
  so:
8
8
 
9
9
  ```ruby
10
- Wayfarer.config.network.agent = :selenium
10
+ Wayfarer.config[:network][:agent] = :selenium
11
11
 
12
12
  class DummyWorker < Wayfarer::Worker
13
- route { to :index }
13
+ route.to :index
14
14
 
15
15
  def index
16
16
  browser # => #<Selenium::WebDriver ...>
@@ -27,10 +27,10 @@ process.
27
27
  Pages retrieved with a Selenium WebDriver return fake values:
28
28
 
29
29
  ```ruby
30
- Wayfarer.config.network.agent = :selenium
30
+ Wayfarer.config[:network][:agent] = :selenium
31
31
 
32
32
  class DummyJob < Wayfarer::Base
33
- route { to :index }
33
+ route.to :index
34
34
 
35
35
  def index
36
36
  page.headers # => always {}
@@ -39,19 +39,18 @@ process.
39
39
  end
40
40
  ```
41
41
 
42
- !!! note "Consider using [Ferrum](../ferrum) instead"
43
- Ferrum provides superior stability and a richer feature set compared to
44
- Selenium drivers. However Ferrum automates only Google Chrome. Unless a
45
- different browser is required, consider using Ferrum instead of Selenium.
42
+ !!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
43
+ Use Ferrum if you want to automate Google Chrome. It provides superior
44
+ stability and a richer feature set compared to Selenium drivers.
46
45
 
47
46
  ## Configuring Selenium
48
47
 
49
48
  === "Runtime"
50
49
 
51
50
  ```ruby
52
- Wayfarer.config.network.agent = :selenium
53
- Wayfarer.config.selenium.driver = :firefox
54
- Wayfarer.config.selenium.options = { url: "http://firefox" }
51
+ Wayfarer.config[:network][:agent] = :selenium
52
+ Wayfarer.config[:selenium][:driver] = :firefox
53
+ Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
55
54
  ```
56
55
 
57
56
  === "Environment variables"
data/docs/guides/pages.md CHANGED
@@ -1,11 +1,14 @@
1
1
  # Pages
2
2
 
3
- Retrieved pages take the shape of `Wayfarer::Page` objects and are available
4
- to jobs:
3
+ A page is the immutable state of the contents behind a URL at a point in time,
4
+ retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
5
+ response, or the state of a remotely controlled browser.
5
6
 
6
7
  ```ruby
7
- class DummyJob < Wayfarer::Worker
8
- route { to :index }
8
+ class DummyJob < ActiveJob::Base
9
+ include Wayfarer::Base
10
+
11
+ route.to :index
9
12
 
10
13
  def index
11
14
  page # => #<Wayfarer::Page ...>
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
13
16
  page.url # => "https://example.com"
14
17
  page.body # => "<html>..."
15
18
  page.status_code # => 200
16
- page.headers # => { "Content-Type" => ... }
19
+ page.headers # => { "content-type" => ... }
20
+ page.mime_type # => #<MIME::Type: text/html>
21
+
22
+ # The lazily parsed response body or `nil`, depending on the Content-Type
23
+ page.doc # => #<Nokogiri::HTML::Document ...>
17
24
 
18
- # A MetaInspector object for accessing page meta data.
19
25
  # See: https://github.com/metainspector/metainspector
26
+ page.meta # => #<MetaInspector::Document ...>
20
27
  # Examples:
21
28
  page.meta.links.internal
22
29
  page.meta.images.favicon
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
26
33
  end
27
34
  ```
28
35
 
36
+ !!! info "HTTP headers are downcased and case-sensitive"
37
+
38
+ HTTP headers are downcased, so you would access
39
+ `page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
40
+
41
+ ## Response body parsing
42
+
43
+ Wayfarer parses the bodies of HTML, XML and JSON responses according to their
44
+ MIME types:
45
+
46
+ * `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
47
+ * `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
48
+ * `application/json` to `Hash`
49
+
29
50
  ## Live pages
30
51
 
31
- When automating browsers, it is possible the page changes significantly at
32
- runtime, for example due to JavaScript altering the DOM or URL.
52
+ `#!ruby page` initially returns a snapshot of the browser state
53
+ immediately after the user agent navigated to the URL. The browser state may
54
+ change significantly after the page was retrieved, for example due to your own
55
+ interaction, or client-side JavaScript altering the DOM or URL.
33
56
 
34
- To access a page reflecting the current browser state, pass the `live` keyword:
57
+ To get a page that reflects the current browser state, set the `#!ruby :live`
58
+ keyword:
35
59
 
36
60
  ```ruby
37
61
  class DummyJob < Wayfarer::Worker
38
- route { to :index }
62
+ route.to :index
39
63
 
40
64
  def index
41
65
  page # => #<Wayfarer::Page ...>
42
66
 
67
+ # Fill in forms, click buttons, etc.
68
+
43
69
  # Replaces the current Page object with a newer one,
44
70
  # taking into account the DOM as currently rendered by the browser.
45
71
  # Effectful only when automating browsers, no-op when using plain
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
50
76
  end
51
77
  end
52
78
  ```
79
+
80
+ !!! attention "Stateless user agents ignore `#!ruby :live`"
81
+
82
+ The `#!ruby :live` option is ignored by stateless user agents, such as the
83
+ default `#!ruby :http` user agent. Instead, stateless user agents always
84
+ return the same page object.
85
+
86
+ ### Implementing a custom response body parser
87
+
88
+ You can register an object that implements a `#parse` method for any MIME type:
89
+
90
+ ```ruby
91
+ class MyJPEGParser
92
+ def parse(body)
93
+ # Read EXIF metadata here.
94
+ # Return value is accessible as `page.doc`
95
+ end
96
+ end
97
+
98
+ Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
99
+ ```
100
+
101
+ !!! info "Handling responses without a Content-Type"
102
+
103
+ If a response has no `Content-Type` header, Wayfarer falls back to
104
+ `application/octet-stream`. A parser registered for
105
+ `application/octet-stream` will hence also handle all responses without
106
+ a Content-Type.
107
+
108
+ ## Accessing page metadata with MetaInspector
109
+
110
+ You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
111
+ document for accessing metadata of HTML pages. For example, to stage all links
112
+ internal to the current hostname:
113
+
114
+ ```ruby
115
+ def index
116
+ stage page.meta.links.internal
117
+ end
118
+ ```
@@ -0,0 +1,10 @@
1
+ # Redis
2
+
3
+ Wayfarer uses Redis to keep track of:
4
+
5
+ * URLs that were already processed within a batch
6
+ * the number of jobs left in a batch
7
+
8
+ ## Garbage collection
9
+
10
+ Wayfarer cleans up batch-related data
@@ -0,0 +1,74 @@
1
+ # Routing
2
+
3
+ Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
4
+ either instance methods denoted by symbols, or [handlers](/guides/handlers).
5
+ A job's route declarations equate to a predicate tree.
6
+ When a URL is routed, the predicate tree is searched depth-first. If a
7
+ matching leaf predicate is found, the found path's action is dispatched,
8
+ along with `params` collected from path parameters.
9
+
10
+ The following routes:
11
+
12
+ ```ruby
13
+ route.host "example.com", scheme: :https do
14
+ path "/contact", to: :contact
15
+ path "/users/:id", to: [UserHandler, :show]
16
+ end
17
+ ```
18
+
19
+ Equate to the following predicate tree:
20
+
21
+ ```mermaid
22
+ flowchart LR
23
+ RootRoute-->Host["Host <code>example.com</code>"]
24
+ Host-->Scheme["Scheme <code>:https</code>"]
25
+ Scheme-->Path1["Path <code>/contact</code>"]
26
+ Scheme-->Path2["Path <code>/users/:id<code>"]
27
+ Path1-->TargetRoute1["Target <code>:contact</code>"]
28
+ Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
29
+ ```
30
+
31
+ An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
32
+
33
+ ```mermaid
34
+ flowchart LR
35
+ RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
36
+ Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
37
+ Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
38
+ Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
39
+ Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
40
+ Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
41
+ classDef active fill:#7CB342,stroke:#7CB342,color:#fff
42
+ classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
43
+ classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
44
+ ```
45
+
46
+ You can also visualise an invocation of the predicate tree on the command line
47
+ with `wayfarer tree`
48
+
49
+ ```
50
+ wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
51
+ Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
52
+ └──Host("example.com", match: true)
53
+ └──Scheme(:https, match: true)
54
+ ├──Path("/contact", match: false)
55
+ │ └──Target(match: true)
56
+ └──Path("/users/:id", match: true)
57
+ └──Target(match: true)
58
+ └──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
59
+ ```
60
+
61
+ As you can see, `Target` nodes always match. This means that we could have also defined
62
+ our routes as:
63
+
64
+ ```ruby
65
+ route.host "example.com", scheme: :https do
66
+ to :contact do
67
+ path "/contact"
68
+ end
69
+
70
+ to [UserHandler, :show] do
71
+ path "/users/:id"
72
+ end
73
+ end
74
+ ```
data/docs/guides/tasks.md CHANGED
@@ -1,14 +1,38 @@
1
1
  # Tasks
2
2
 
3
- Tasks are the immutable units of work processed by [jobs](/guides/jobs). A task
4
- consists of:
3
+ Tasks are the immutable units of work read from a message queue and processed by
4
+ [jobs](/guides/jobs). A task consists of two strings:
5
5
 
6
- 1. The __URL__ to process
7
- * Within a batch, every URL gets processed at most once.
6
+ * The __URL__ to process
7
+ * The __batch__ the task belongs to
8
8
 
9
- 2. The __batch__ the task belongs to
10
- * Like URLs, batches are strings.
9
+ A job processing a task commonly appends more tasks to the queue in turn.
11
10
 
12
- Tasks get appended to the end of a message queue, and consumed from the
13
- beginning. Because jobs can enqueue other tasks, jobs are both consumers
14
- and producers of tasks.
11
+ !!! info "Task URLs are not normalized"
12
+
13
+ The URL returned by `task.url` is not normalized but verbatim
14
+ as it was staged or enqueued.
15
+
16
+ ## Task deduplication
17
+
18
+ Wayfarer ensures that no URL gets processed twice within a batch. It achieves
19
+ this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
20
+ keyed by normalized URLs.
21
+
22
+ ### URL normalization
23
+
24
+ Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
25
+ and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
26
+
27
+ URL normalization is used only for deduplication, and does not affect the URL
28
+ returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
29
+ enqueud. This allows you to follow the exact URLs you may have parsed from a
30
+ response body.
31
+
32
+ ## Invalid URLs
33
+
34
+ Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
35
+ protocol) are discarded, since they can't get retrieved. No exception is raised,
36
+ and the job is considered successfully processed, since there are no corrective
37
+ actions an error handler could take as tasks are immutable, and retries would
38
+ not change the outcome.