wayfarer 0.4.6 → 0.4.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (259) hide show
  1. checksums.yaml +4 -4
  2. data/.env +17 -0
  3. data/.github/workflows/lint.yaml +27 -0
  4. data/.github/workflows/release.yaml +30 -0
  5. data/.github/workflows/tests.yaml +21 -0
  6. data/.gitignore +5 -1
  7. data/.rubocop.yml +36 -0
  8. data/.vale.ini +8 -0
  9. data/.yardopts +1 -3
  10. data/Dockerfile +6 -4
  11. data/Gemfile +24 -0
  12. data/Gemfile.lock +274 -164
  13. data/Rakefile +7 -51
  14. data/bin/wayfarer +1 -1
  15. data/docker-compose.yml +23 -13
  16. data/docs/cookbook/consent_screen.md +2 -2
  17. data/docs/cookbook/executing_javascript.md +3 -3
  18. data/docs/cookbook/navigation.md +12 -12
  19. data/docs/cookbook/querying_html.md +3 -3
  20. data/docs/cookbook/screenshots.md +2 -2
  21. data/docs/guides/callbacks.md +25 -125
  22. data/docs/guides/cli.md +71 -0
  23. data/docs/guides/configuration.md +10 -35
  24. data/docs/guides/development.md +67 -0
  25. data/docs/guides/handlers.md +60 -0
  26. data/docs/guides/index.md +1 -0
  27. data/docs/guides/jobs.md +142 -31
  28. data/docs/guides/navigation.md +1 -1
  29. data/docs/guides/networking/capybara.md +13 -22
  30. data/docs/guides/networking/custom_adapters.md +103 -41
  31. data/docs/guides/networking/ferrum.md +4 -4
  32. data/docs/guides/networking/http.md +9 -13
  33. data/docs/guides/networking/selenium.md +10 -11
  34. data/docs/guides/pages.md +78 -10
  35. data/docs/guides/redis.md +10 -0
  36. data/docs/guides/routing.md +156 -0
  37. data/docs/guides/tasks.md +53 -9
  38. data/docs/guides/tutorial.md +66 -0
  39. data/docs/guides/user_agents.md +115 -0
  40. data/docs/index.md +17 -40
  41. data/lib/wayfarer/base.rb +125 -46
  42. data/lib/wayfarer/batch_completion.rb +60 -0
  43. data/lib/wayfarer/callbacks.rb +22 -48
  44. data/lib/wayfarer/cli/route_printer.rb +85 -89
  45. data/lib/wayfarer/cli.rb +103 -0
  46. data/lib/wayfarer/gc.rb +18 -6
  47. data/lib/wayfarer/handler.rb +15 -7
  48. data/lib/wayfarer/kv.rb +28 -0
  49. data/lib/wayfarer/logging.rb +38 -0
  50. data/lib/wayfarer/middleware/base.rb +2 -0
  51. data/lib/wayfarer/middleware/batch_completion.rb +19 -0
  52. data/lib/wayfarer/middleware/chain.rb +7 -1
  53. data/lib/wayfarer/middleware/content_type.rb +59 -0
  54. data/lib/wayfarer/middleware/controller.rb +19 -15
  55. data/lib/wayfarer/middleware/dedup.rb +22 -13
  56. data/lib/wayfarer/middleware/dispatch.rb +17 -4
  57. data/lib/wayfarer/middleware/normalize.rb +7 -14
  58. data/lib/wayfarer/middleware/redis.rb +15 -0
  59. data/lib/wayfarer/middleware/router.rb +33 -35
  60. data/lib/wayfarer/middleware/stage.rb +5 -5
  61. data/lib/wayfarer/middleware/uri_parser.rb +31 -0
  62. data/lib/wayfarer/middleware/user_agent.rb +49 -0
  63. data/lib/wayfarer/networking/capybara.rb +1 -1
  64. data/lib/wayfarer/networking/context.rb +14 -3
  65. data/lib/wayfarer/networking/ferrum.rb +1 -4
  66. data/lib/wayfarer/networking/follow.rb +14 -7
  67. data/lib/wayfarer/networking/http.rb +1 -1
  68. data/lib/wayfarer/networking/pool.rb +23 -13
  69. data/lib/wayfarer/networking/selenium.rb +15 -7
  70. data/lib/wayfarer/networking/strategy.rb +2 -2
  71. data/lib/wayfarer/page.rb +34 -14
  72. data/lib/wayfarer/parsing/xml.rb +6 -6
  73. data/lib/wayfarer/parsing.rb +21 -0
  74. data/lib/wayfarer/redis/barrier.rb +26 -21
  75. data/lib/wayfarer/redis/counter.rb +18 -9
  76. data/lib/wayfarer/redis/pool.rb +1 -1
  77. data/lib/wayfarer/redis/resettable.rb +19 -0
  78. data/lib/wayfarer/routing/dsl.rb +166 -30
  79. data/lib/wayfarer/routing/hash_stack.rb +33 -0
  80. data/lib/wayfarer/routing/matchers/custom.rb +8 -5
  81. data/lib/wayfarer/routing/matchers/{suffix.rb → empty_params.rb} +2 -6
  82. data/lib/wayfarer/routing/matchers/host.rb +15 -9
  83. data/lib/wayfarer/routing/matchers/path.rb +11 -31
  84. data/lib/wayfarer/routing/matchers/query.rb +41 -17
  85. data/lib/wayfarer/routing/matchers/result.rb +12 -0
  86. data/lib/wayfarer/routing/matchers/scheme.rb +13 -5
  87. data/lib/wayfarer/routing/matchers/url.rb +13 -5
  88. data/lib/wayfarer/routing/path_consumer.rb +130 -0
  89. data/lib/wayfarer/routing/path_finder.rb +151 -23
  90. data/lib/wayfarer/routing/result.rb +1 -1
  91. data/lib/wayfarer/routing/root_route.rb +17 -1
  92. data/lib/wayfarer/routing/route.rb +66 -19
  93. data/lib/wayfarer/routing/serializable.rb +28 -0
  94. data/lib/wayfarer/routing/sub_route.rb +53 -0
  95. data/lib/wayfarer/routing/target_route.rb +17 -1
  96. data/lib/wayfarer/stringify.rb +21 -30
  97. data/lib/wayfarer/task.rb +9 -17
  98. data/lib/wayfarer/uri/normalization.rb +120 -0
  99. data/lib/wayfarer.rb +72 -5
  100. data/mise.toml +2 -0
  101. data/mkdocs.yml +44 -8
  102. data/rake/docs.rake +26 -0
  103. data/rake/lint.rake +9 -0
  104. data/rake/release.rake +23 -0
  105. data/rake/tests.rake +32 -0
  106. data/requirements.txt +1 -1
  107. data/spec/factories/job.rb +8 -0
  108. data/spec/factories/middleware.rb +2 -2
  109. data/spec/factories/path_finder.rb +11 -0
  110. data/spec/factories/redis.rb +19 -0
  111. data/spec/factories/task.rb +46 -2
  112. data/spec/spec_helpers.rb +55 -51
  113. data/spec/support/active_job_helpers.rb +8 -0
  114. data/spec/support/integration_helpers.rb +21 -0
  115. data/spec/support/redis_helpers.rb +9 -0
  116. data/spec/support/test_app.rb +66 -37
  117. data/spec/wayfarer/base_spec.rb +200 -0
  118. data/spec/wayfarer/batch_completion_spec.rb +142 -0
  119. data/spec/wayfarer/cli/job_spec.rb +88 -0
  120. data/spec/wayfarer/cli/routing_spec.rb +322 -0
  121. data/spec/{cli → wayfarer/cli}/version_spec.rb +1 -1
  122. data/spec/wayfarer/gc_spec.rb +29 -0
  123. data/spec/wayfarer/handler_spec.rb +9 -0
  124. data/spec/wayfarer/integration/callbacks_spec.rb +200 -0
  125. data/spec/wayfarer/integration/content_type_spec.rb +37 -0
  126. data/spec/wayfarer/integration/custom_routing_spec.rb +51 -0
  127. data/spec/wayfarer/integration/gc_spec.rb +40 -0
  128. data/spec/wayfarer/integration/handler_spec.rb +65 -0
  129. data/spec/wayfarer/integration/page_spec.rb +79 -0
  130. data/spec/wayfarer/integration/params_spec.rb +64 -0
  131. data/spec/wayfarer/integration/parsing_spec.rb +99 -0
  132. data/spec/wayfarer/integration/retry_spec.rb +112 -0
  133. data/spec/wayfarer/integration/stage_spec.rb +58 -0
  134. data/spec/wayfarer/middleware/batch_completion_spec.rb +33 -0
  135. data/spec/{middleware → wayfarer/middleware}/chain_spec.rb +24 -19
  136. data/spec/wayfarer/middleware/content_type_spec.rb +83 -0
  137. data/spec/{middleware → wayfarer/middleware}/controller_spec.rb +24 -22
  138. data/spec/wayfarer/middleware/dedup_spec.rb +66 -0
  139. data/spec/wayfarer/middleware/normalize_spec.rb +32 -0
  140. data/spec/wayfarer/middleware/router_spec.rb +102 -0
  141. data/spec/wayfarer/middleware/stage_spec.rb +63 -0
  142. data/spec/wayfarer/middleware/uri_parser_spec.rb +63 -0
  143. data/spec/wayfarer/middleware/user_agent_spec.rb +158 -0
  144. data/spec/wayfarer/networking/capybara_spec.rb +13 -0
  145. data/spec/{networking → wayfarer/networking}/context_spec.rb +46 -38
  146. data/spec/wayfarer/networking/ferrum_spec.rb +13 -0
  147. data/spec/{networking → wayfarer/networking}/follow_spec.rb +11 -6
  148. data/spec/wayfarer/networking/http_spec.rb +12 -0
  149. data/spec/{networking → wayfarer/networking}/pool_spec.rb +16 -14
  150. data/spec/wayfarer/networking/selenium_spec.rb +12 -0
  151. data/spec/{networking → wayfarer/networking}/strategy.rb +33 -54
  152. data/spec/wayfarer/page_spec.rb +69 -0
  153. data/spec/{parsing → wayfarer/parsing}/json_spec.rb +1 -1
  154. data/spec/wayfarer/parsing/xml_parse_spec.rb +25 -0
  155. data/spec/wayfarer/redis/barrier_spec.rb +39 -0
  156. data/spec/wayfarer/redis/counter_spec.rb +34 -0
  157. data/spec/{redis → wayfarer/redis}/pool_spec.rb +4 -3
  158. data/spec/{routing → wayfarer/routing}/dsl_spec.rb +12 -22
  159. data/spec/wayfarer/routing/hash_stack_spec.rb +63 -0
  160. data/spec/wayfarer/routing/integration_spec.rb +101 -0
  161. data/spec/wayfarer/routing/matchers/custom_spec.rb +39 -0
  162. data/spec/wayfarer/routing/matchers/host_spec.rb +56 -0
  163. data/spec/wayfarer/routing/matchers/matcher.rb +17 -0
  164. data/spec/wayfarer/routing/matchers/path_spec.rb +43 -0
  165. data/spec/wayfarer/routing/matchers/query_spec.rb +123 -0
  166. data/spec/wayfarer/routing/matchers/scheme_spec.rb +45 -0
  167. data/spec/wayfarer/routing/matchers/url_spec.rb +33 -0
  168. data/spec/wayfarer/routing/path_consumer_spec.rb +123 -0
  169. data/spec/wayfarer/routing/path_finder_spec.rb +409 -0
  170. data/spec/wayfarer/routing/root_route_spec.rb +51 -0
  171. data/spec/wayfarer/routing/route_spec.rb +74 -0
  172. data/spec/wayfarer/routing/sub_route_spec.rb +103 -0
  173. data/spec/wayfarer/task_spec.rb +13 -0
  174. data/spec/wayfarer/uri/normalization_spec.rb +98 -0
  175. data/spec/wayfarer_spec.rb +2 -2
  176. data/wayfarer.gemspec +18 -28
  177. metadata +797 -265
  178. data/.github/workflows/ci.yaml +0 -32
  179. data/.rbenv-gemsets +0 -1
  180. data/.ruby-version +0 -1
  181. data/RELEASING.md +0 -17
  182. data/docs/cookbook/user_agent.md +0 -7
  183. data/docs/guides/error_handling.md +0 -53
  184. data/docs/guides/networking.md +0 -94
  185. data/docs/guides/performance.md +0 -130
  186. data/docs/guides/reliability.md +0 -41
  187. data/docs/guides/routing/steering.md +0 -30
  188. data/docs/reference/api/base.md +0 -48
  189. data/docs/reference/cli.md +0 -61
  190. data/docs/reference/configuration_keys.md +0 -43
  191. data/docs/reference/environment_variables.md +0 -83
  192. data/lib/wayfarer/cli/base.rb +0 -45
  193. data/lib/wayfarer/cli/generate.rb +0 -17
  194. data/lib/wayfarer/cli/job.rb +0 -56
  195. data/lib/wayfarer/cli/route.rb +0 -29
  196. data/lib/wayfarer/cli/runner.rb +0 -34
  197. data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
  198. data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
  199. data/lib/wayfarer/config/capybara.rb +0 -10
  200. data/lib/wayfarer/config/ferrum.rb +0 -11
  201. data/lib/wayfarer/config/networking.rb +0 -29
  202. data/lib/wayfarer/config/redis.rb +0 -14
  203. data/lib/wayfarer/config/root.rb +0 -11
  204. data/lib/wayfarer/config/selenium.rb +0 -21
  205. data/lib/wayfarer/config/strconv.rb +0 -45
  206. data/lib/wayfarer/config/struct.rb +0 -72
  207. data/lib/wayfarer/middleware/fetch.rb +0 -56
  208. data/lib/wayfarer/redis/connection.rb +0 -13
  209. data/lib/wayfarer/redis/version.rb +0 -19
  210. data/lib/wayfarer/routing/router.rb +0 -28
  211. data/spec/base_spec.rb +0 -224
  212. data/spec/callbacks_spec.rb +0 -102
  213. data/spec/cli/generate_spec.rb +0 -39
  214. data/spec/cli/job_spec.rb +0 -78
  215. data/spec/config/capybara_spec.rb +0 -18
  216. data/spec/config/ferrum_spec.rb +0 -24
  217. data/spec/config/networking_spec.rb +0 -73
  218. data/spec/config/redis_spec.rb +0 -32
  219. data/spec/config/root_spec.rb +0 -31
  220. data/spec/config/selenium_spec.rb +0 -56
  221. data/spec/config/strconv_spec.rb +0 -58
  222. data/spec/config/struct_spec.rb +0 -66
  223. data/spec/fixtures/dummy_job.rb +0 -7
  224. data/spec/gc_spec.rb +0 -59
  225. data/spec/handler_spec.rb +0 -11
  226. data/spec/integration/callbacks_spec.rb +0 -85
  227. data/spec/integration/page_spec.rb +0 -62
  228. data/spec/integration/params_spec.rb +0 -56
  229. data/spec/integration/stage_spec.rb +0 -51
  230. data/spec/integration/steering_spec.rb +0 -57
  231. data/spec/middleware/dedup_spec.rb +0 -88
  232. data/spec/middleware/dispatch_spec.rb +0 -43
  233. data/spec/middleware/fetch_spec.rb +0 -155
  234. data/spec/middleware/normalize_spec.rb +0 -29
  235. data/spec/middleware/router_spec.rb +0 -105
  236. data/spec/middleware/stage_spec.rb +0 -62
  237. data/spec/networking/capybara_spec.rb +0 -12
  238. data/spec/networking/ferrum_spec.rb +0 -12
  239. data/spec/networking/http_spec.rb +0 -12
  240. data/spec/networking/selenium_spec.rb +0 -12
  241. data/spec/page_spec.rb +0 -47
  242. data/spec/parsing/xml_spec.rb +0 -25
  243. data/spec/redis/barrier_spec.rb +0 -78
  244. data/spec/redis/counter_spec.rb +0 -32
  245. data/spec/redis/version_spec.rb +0 -13
  246. data/spec/routing/integration_spec.rb +0 -110
  247. data/spec/routing/matchers/custom_spec.rb +0 -31
  248. data/spec/routing/matchers/host_spec.rb +0 -49
  249. data/spec/routing/matchers/path_spec.rb +0 -43
  250. data/spec/routing/matchers/query_spec.rb +0 -137
  251. data/spec/routing/matchers/scheme_spec.rb +0 -25
  252. data/spec/routing/matchers/suffix_spec.rb +0 -41
  253. data/spec/routing/matchers/uri_spec.rb +0 -27
  254. data/spec/routing/path_finder_spec.rb +0 -33
  255. data/spec/routing/root_route_spec.rb +0 -29
  256. data/spec/routing/route_spec.rb +0 -43
  257. data/spec/routing/router_spec.rb +0 -24
  258. data/spec/task_spec.rb +0 -34
  259. data/spec/{stringify_spec.rb → wayfarer/stringify_spec.rb} +2 -2
data/docs/guides/pages.md CHANGED
@@ -1,11 +1,14 @@
1
1
  # Pages
2
2
 
3
- Retrieved pages take the shape of `Wayfarer::Page` objects and are available
4
- to jobs:
3
+ A page is the immutable state of the contents behind a URL at a point in time,
4
+ retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
5
+ response, or the state of a remotely controlled browser.
5
6
 
6
7
  ```ruby
7
- class DummyJob < Wayfarer::Worker
8
- route { to :index }
8
+ class DummyJob < ActiveJob::Base
9
+ include Wayfarer::Base
10
+
11
+ route.to :index
9
12
 
10
13
  def index
11
14
  page # => #<Wayfarer::Page ...>
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
13
16
  page.url # => "https://example.com"
14
17
  page.body # => "<html>..."
15
18
  page.status_code # => 200
16
- page.headers # => { "Content-Type" => ... }
19
+ page.headers # => { "content-type" => ... }
20
+ page.mime_type # => #<MIME::Type: text/html>
21
+
22
+ # The lazily parsed response body or `nil`, depending on the Content-Type
23
+ page.doc # => #<Nokogiri::HTML::Document ...>
17
24
 
18
- # A MetaInspector object for accessing page meta data.
19
25
  # See: https://github.com/metainspector/metainspector
26
+ page.meta # => #<MetaInspector::Document ...>
20
27
  # Examples:
21
28
  page.meta.links.internal
22
29
  page.meta.images.favicon
@@ -26,20 +33,63 @@ class DummyJob < Wayfarer::Worker
26
33
  end
27
34
  ```
28
35
 
36
+ !!! info "HTTP headers are downcased and case-sensitive"
37
+
38
+ HTTP headers are downcased, so you would access
39
+ `page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
40
+
41
+ ## Response body parsing
42
+
43
+ Wayfarer parses the bodies of HTML, XML and JSON responses according to their
44
+ MIME types:
45
+
46
+ * `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
47
+ * `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
48
+ * `application/json` to `Hash`
49
+
50
+ ### Implementing a custom response body parser
51
+
52
+ You can register an object that implements a `#parse` method for any MIME type:
53
+
54
+ ```ruby
55
+ class MyJPEGParser
56
+ def parse(body)
57
+ # Read EXIF metadata here.
58
+ # Return value is accessible as `page.doc`
59
+ end
60
+ end
61
+
62
+ Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
63
+ ```
64
+
65
+ !!! warning "`#parse` must be thread-safe!"
66
+
67
+ !!! info "Handling responses without a Content-Type"
68
+
69
+ If a response has no `Content-Type` header, Wayfarer falls back to
70
+ `application/octet-stream`. A parser registered for
71
+ `application/octet-stream` will hence also handle all responses without
72
+ a Content-Type.
73
+
29
74
  ## Live pages
30
75
 
31
- When automating browsers, it is possible the page changes significantly at
32
- runtime, for example due to JavaScript altering the DOM or URL.
76
+ `#!ruby page` initially returns a snapshot of the browser state
77
+ immediately after the user agent navigated to the URL. The browser state may
78
+ change significantly after the page was retrieved, for example due to your own
79
+ interaction, or client-side JavaScript altering the DOM or URL.
33
80
 
34
- To access a page reflecting the current browser state, pass the `live` keyword:
81
+ To get a page that reflects the current browser state, set the `#!ruby :live`
82
+ keyword:
35
83
 
36
84
  ```ruby
37
85
  class DummyJob < Wayfarer::Worker
38
- route { to :index }
86
+ route.to :index
39
87
 
40
88
  def index
41
89
  page # => #<Wayfarer::Page ...>
42
90
 
91
+ # Fill in forms, click buttons, etc.
92
+
43
93
  # Replaces the current Page object with a newer one,
44
94
  # taking into account the DOM as currently rendered by the browser.
45
95
  # Effectful only when automating browsers, no-op when using plain
@@ -50,3 +100,21 @@ class DummyJob < Wayfarer::Worker
50
100
  end
51
101
  end
52
102
  ```
103
+
104
+ !!! attention "Stateless user agents ignore `#!ruby :live`"
105
+
106
+ The `#!ruby :live` option is ignored by stateless user agents, such as the
107
+ default `#!ruby :http` user agent. Instead, stateless user agents always
108
+ return the same page object.
109
+
110
+ ## Accessing page metadata with MetaInspector
111
+
112
+ You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
113
+ document for accessing metadata of HTML pages. For example, to stage all links
114
+ internal to the current hostname:
115
+
116
+ ```ruby
117
+ def index
118
+ stage page.meta.links.internal
119
+ end
120
+ ```
@@ -0,0 +1,10 @@
1
+ # Redis
2
+
3
+ Wayfarer uses Redis to keep track of:
4
+
5
+ * URLs that were already processed within a batch
6
+ * the number of jobs left in a batch
7
+
8
+ ## Garbage collection
9
+
10
+ Wayfarer cleans up batch-related data
@@ -0,0 +1,156 @@
1
+ # Routing
2
+
3
+ Wayfarer equips jobs with a declarative routing DSL that maps URLs to actions.
4
+ Actions are instance methods denoted by symbols, or [handlers](/guides/handlers).
5
+ [Pages](/guides/pages) are only retrieved from URLs which map to an action.
6
+
7
+ !!! info "Routed URLs are normalized"
8
+
9
+ By default, Wayfarer [applies some transformations to each URL](../tasks/#url-normalization) to bring it
10
+ into a canonical form. Routing happens based on this canonical form.
11
+
12
+ You can always access a task's raw string as it was enqueued with `task.batch`.
13
+
14
+ A job's route declarations equate to a predicate tree.
15
+ When a URL is routed, the predicate tree is searched depth-first. If a
16
+ matching leaf predicate is found, the found path's action is dispatched.
17
+ You can extract data from URL path segments and query parameters and
18
+ access it through `params` in jobs or handlers.
19
+
20
+ The following routes:
21
+
22
+ ```ruby
23
+ route.host "example.com", scheme: :https do
24
+ path "contact", to: :contact
25
+ path "users/:id" do
26
+ to [UserHandler, :show]
27
+
28
+ path "gallery", to: [UserHandler, :photos]
29
+ end
30
+ end
31
+ ```
32
+
33
+ Equate to the following predicate tree:
34
+
35
+ ```mermaid
36
+ flowchart LR
37
+ Root-->Host["Host <code>example.com</code>"]
38
+ Host-->Scheme["Scheme <code>:https</code>"]
39
+
40
+ %% first-level paths
41
+ Scheme-->PathContact["Path <code>contact</code>"]
42
+ Scheme-->PathUsersId["Path <code>users/:id</code>"]
43
+
44
+ %% their targets
45
+ PathContact-->TargetRouteContact["Target <code>:contact</code>"]
46
+ PathUsersId-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]
47
+
48
+ %% nested path under /users/:id
49
+ PathUsersId-->PathGallery["Path <code>'gallery'</code>"]
50
+ PathGallery-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]
51
+ ```
52
+
53
+ Traversing the tree depth-first for `https://example.com/users/42` stops at the
54
+ route with the action `[UserHandler, :show]`:
55
+
56
+ ```mermaid
57
+ flowchart LR
58
+ Root:::matching-->Host["Host <code>example.com</code>"]:::matching
59
+ Host:::matching-->Scheme["Scheme <code>:https</code>"]:::matching
60
+
61
+ %% sibling paths from the scheme node
62
+ Scheme:::matching-->PathContact["Path <code>/contact</code>"]:::mismatching
63
+ Scheme:::matching-->PathUsersId["Path <code>/users/:id</code>"]:::matching
64
+
65
+ %% successful match for /users/:id
66
+ PathUsersId:::matching-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]:::matching
67
+
68
+ %% gallery branch is never visited for /users/42
69
+ PathContact-->TargetRouteContact["Target <code>:contact</code>"]:::unvisited
70
+ PathUsersId:::matching-->PathGallery["Path <code>/gallery</code>"]:::unvisited
71
+ PathGallery:::unvisited-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]:::unvisited
72
+
73
+ classDef matching fill:#7CB342,stroke:#7CB342,color:#fff
74
+ classDef mismatching fill:#FFCDD2,stroke:#F44336,color:#B71C1C
75
+ classDef unvisited fill:#BDBDBD,stroke:#BDBDBD,color:#616161
76
+ ```
77
+
78
+ ??? note "You can also visualise a job's routing tree with with the [`route` CLI subcommand](/guides/cli)"
79
+
80
+ ```sh
81
+ wayfarer route DummyJob -r dummy_job.rb http://localhost:9000/users/42/gallery
82
+ ```
83
+
84
+ ```yaml
85
+ ---
86
+ routed: true
87
+ params:
88
+ id: '42'
89
+ action:
90
+ handler: Class
91
+ action: :photos
92
+ root_route:
93
+ match: true
94
+ params: {}
95
+ children:
96
+ - route:
97
+ host:
98
+ name: example.com
99
+ match: true
100
+ params: {}
101
+ children:
102
+ - route:
103
+ scheme:
104
+ scheme: :https
105
+ match: true
106
+ params: {}
107
+ children:
108
+ - route:
109
+ path:
110
+ pattern: "/contact"
111
+ match: false
112
+ params: {}
113
+ children:
114
+ - target_route:
115
+ action:
116
+ children: []
117
+ - route:
118
+ path:
119
+ pattern: "/users/:id"
120
+ match: true
121
+ params:
122
+ id: '42'
123
+ children:
124
+ - target_route:
125
+ action:
126
+ handler: Class
127
+ action: :show
128
+ children: []
129
+ - route:
130
+ path:
131
+ pattern: "/gallery"
132
+ match: true
133
+ params:
134
+ id: '42'
135
+ children:
136
+ - target_route:
137
+ action:
138
+ handler: Class
139
+ action: :photos
140
+ children: []
141
+ ```
142
+
143
+ As you can see, `Target` nodes always match. This means that we could have also defined
144
+ our routes as:
145
+
146
+ ```ruby
147
+ route.host "example.com", scheme: :https do
148
+ to :contact do
149
+ path "/contact"
150
+ end
151
+
152
+ to [UserHandler, :show] do
153
+ path "/users/:id"
154
+ end
155
+ end
156
+ ```
data/docs/guides/tasks.md CHANGED
@@ -1,14 +1,58 @@
1
1
  # Tasks
2
2
 
3
- Tasks are the immutable units of work processed by [jobs](/guides/jobs). A task
4
- consists of:
3
+ Tasks are the immutable units of work read from a message queue and processed by
4
+ [jobs](/guides/jobs). A task consists of two strings:
5
5
 
6
- 1. The __URL__ to process
7
- * Within a batch, every URL gets processed at most once.
6
+ * The __URL__ to process
7
+ * The __batch__ the task belongs to
8
8
 
9
- 2. The __batch__ the task belongs to
10
- * Like URLs, batches are strings.
9
+ A job processing a task commonly appends more tasks to the queue in turn.
11
10
 
12
- Tasks get appended to the end of a message queue, and consumed from the
13
- beginning. Because jobs can enqueue other tasks, jobs are both consumers
14
- and producers of tasks.
11
+ !!! info "Task URLs are not normalized"
12
+
13
+ The URL returned by `task.url` is not normalized but verbatim
14
+ as it was staged or enqueued.
15
+
16
+ ## Task deduplication
17
+
18
+ Wayfarer ensures that no URL gets processed twice within a batch. It achieves
19
+ this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
20
+ keyed by normalized URLs.
21
+
22
+ Wayfarer computes a canonical URL representation that it uses for cache lookups.
23
+
24
+ ### URL normalization
25
+
26
+ Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
27
+ and applies further normalizations. By default, all normalizations are applied
28
+ and can be individually disabled.
29
+
30
+ URL normalization is used only for deduplication, and does not affect the immutable
31
+ `task.url`, which always returns the verbatim URL as enqueued.
32
+ This allows you to follow the URLs exactly as parsed from response bodies.
33
+
34
+ You can configure the global normalization behaviour by setting the following
35
+ values on `Wayfarer.config.normalization` do which all default to `true`:
36
+
37
+ * `remove_www`: Remove `www.` prefix from hostnames?
38
+ * `remove_trailing_slash`: Remove a trailing path slash?
39
+ * `remove_fragment`: Remove the URL fragment?
40
+ * `order_query_parameters:` Order query parameters alphabetically?
41
+ * `remove_tracking_parameters`: Remove tracking parameters from the URL?
42
+
43
+ When a job gets deduplicated, it succeeds and causes no retries.
44
+
45
+ ### Setting a custom key function
46
+
47
+ You can customize how deduplication keys are computed. As a derived example,
48
+ to process only one job per hostname:
49
+
50
+ ```ruby
51
+ Wayfarer.config[:deduplication][:key] = ->(task) { task[:uri].hostname }
52
+ ```
53
+
54
+ ## Invalid URLs
55
+
56
+ Tasks with invalid URLs are discarded (for example`ht%0atp://localhost/` which has a
57
+ newline in its protocol), since there is no corrective action possible.
58
+ No exception is raised, and the job is considered successfully processed without retries.
@@ -0,0 +1,66 @@
1
+ # Tutorial
2
+
3
+ Wayfarer is a web crawling framework written in Ruby.
4
+ It works with plain HTTP and by automating web browsers interchangeably
5
+ and is deployed with Redis and a message queue.
6
+ During development it can execute fully in memory, without Redis.
7
+
8
+ ## Getting started
9
+
10
+ In an empty directory, generate a new `Gemfile` and install Wayfarer:
11
+
12
+ ```sh
13
+ bundle init
14
+ bundle add activejob wayfarer
15
+ bundle install
16
+ ```
17
+
18
+ ## Jobs, tasks and batches
19
+
20
+ Wayfarer builds on Active Job, the message queue abstraction of Rails.
21
+ You can use Wayfarer without Rails of course, as we do here.
22
+
23
+ A message queue supports two operations: appending messages to the end and consuming
24
+ messages from the front. This is how Wayfarer processes tasks, a string pair
25
+ of URL and batch. Wayfarer enforces that URLs are not processed more than
26
+ once within their batch (excluding retries).
27
+
28
+ When a task is consumed, it is processed by a job, a Ruby class.
29
+
30
+ Let's give ourselves a `dummy_job.rb` that routes all URLs to its
31
+ `index` instance method, where we print the current `task`:
32
+
33
+ ```ruby title="dummy_job.rb"
34
+ require "activejob"
35
+ require "wayfarer"
36
+
37
+ class DummyJob < ActiveJob::Base
38
+ include Wayfarer::Base
39
+
40
+ route.to :index
41
+
42
+ def index
43
+ puts task
44
+ end
45
+ end
46
+ ```
47
+
48
+ We can perform our job from the command line with the `wayfarer perform`
49
+ subcommand. In between ActiveJob's log output, we see that Wayfarer
50
+ has generated a UUID for the batch since we did not pass it:
51
+
52
+ ```sh
53
+ bundle exec wayfarer perform -r dummy_job.rb DummyJob https://example.com
54
+ ```
55
+
56
+ ```hl_lines="2"
57
+ [ActiveJob] [DummyJob] [68853491-...] Performing DummyJob (Job ID: 68853491-...) from Async(default) with arguments: #<Wayfarer::Task url="https://example.com", batch="63d14035-...">
58
+ #<Wayfarer::Task url="https://example.com", batch="63d14035-...">
59
+ [ActiveJob] [DummyJob] [68853491-...] Performed DummyJob (Job ID: 68853491-) from Async(default) in 507.65ms
60
+ ```
61
+
62
+ If you don't provide a batch, Wayfarer uses a generated UUID instead.
63
+ We could have also used `DummyJob.crawl
64
+
65
+
66
+
@@ -0,0 +1,115 @@
1
+ # User agents
2
+
3
+ User agents are used by [jobs](../jobs) to retrieve the contents behind a URL into a
4
+ [page](../pages), for example a remotely controlled Firefox process or a Ruby HTTP client.
5
+
6
+ User agents are kept in a connection pool and all user agents in the pool
7
+ share the same type and configuration. You can add your own custom user agents by implementing
8
+ the [user agent API](custom_user_agents.md).
9
+
10
+ Wayfarer comes with the following built-in user agents:
11
+
12
+ * [`:http`](http.md) (default)
13
+ * [`:ferrum`](ferrum.md) to automate Google Chrome
14
+ * [`:selenium`](selenium.md) to automate a variety of browsers
15
+ * [`:capybara`](capybara.md) to use Capybara sessions
16
+
17
+ Configure the user agent with the global configuration option:
18
+
19
+ ```ruby
20
+ Wayfarer.config[:network][:agent] = :ferrum # or :selenium, :capybara, ...
21
+ ```
22
+
23
+ You can access the user agent that was checked out from the pool with
24
+ `#user_agent` in action methods:
25
+
26
+ ```ruby
27
+ class DummyJob < ActiveJob::Base
28
+ include Wayfarer::Base
29
+
30
+ route.to :index
31
+
32
+ def index
33
+ user_agent # => #<Ferrum::Browser ...>
34
+ end
35
+ end
36
+ ```
37
+
38
+ You can also implement [custom user agents](custom_user_agents.md) to support
39
+ your own HTTP client or browser automation service/protocol.
40
+
41
+ ### Ad-hoc HTTP requests
42
+
43
+ Regardless the configured user agent, you can always make ad-hoc HTTP GET requests
44
+ that return pages with `#fetch(url)`:
45
+
46
+ ```ruby
47
+ class DummyJob < ActiveJob::Base
48
+ include Wayfarer::Base
49
+
50
+ route.to :index
51
+
52
+ def index
53
+ page = fetch("https://example.com") # => #<Wayfarer::Page ...>
54
+ end
55
+ end
56
+ ```
57
+
58
+ !!! info "`#fetch` respects `Wayfarer.config.network.http_headers` for all provided user agents."
59
+
60
+ ## HTTP request headers
61
+
62
+ You can set HTTP request headers for all built-in user agents:
63
+
64
+ ```ruby
65
+ Wayfarer.config[:network][:http_headers] = { "User-Agent" => "MyCrawler" }
66
+ ```
67
+
68
+ !!! attention "Selenium does not support configuring HTTP request headers."
69
+
70
+ ## Connection pooling
71
+
72
+ Since user agents are expensive to create, especially in the case of browser
73
+ processes, Wayfarer keeps user agents within a connection pool. When a job
74
+ performs and needs to retrieve the [page](../pages) for its task URL, an agent
75
+ is checked out from the pool, and checked back in when the routed action method
76
+ returns.
77
+
78
+ The pool size is constant and it should equal the number of threads the
79
+ underlying message queue operates with. For example, if you use Sidekiq,
80
+ you should set the pool size to the number of Sidekiq threads:
81
+
82
+ ```ruby
83
+ Wayfarer.config[:network][:pool][:size] = Sidekiq.options[:concurrency]
84
+ ```
85
+
86
+ !!! attention "The connection pool size is 1 by default"
87
+
88
+ Since there is no reliable way to detect the number of threads that
89
+ the underlying message queue operates with, Wayfarer defaults to a pool
90
+ size of 1, which creates a bottleneck in a concurrent environment.
91
+
92
+ !!! attention "Browser sessions are shared across jobs"
93
+
94
+ The same browser session is used across jobs. This means that the browser
95
+ is not closed between jobs, and that the browser's state carries over from
96
+ job to job. You may account for this by resetting the browser's state
97
+ according to your needs, for which you can use [callbacks](../callbacks).
98
+
99
+ ### `UserAgentTimeoutError`: avoiding pool contention
100
+
101
+ If you encounter `UserAgentTimeoutError` exceptions, a job has waited for a
102
+ user agent to become available for too long. By default, this timeout is 10
103
+ seconds. This is a sign that the pool size is too small for the message queue's
104
+ concurrency.
105
+
106
+ ```
107
+ #<Wayfarer::UserAgentTimeoutError: Waited 10 sec, 0/1 available>
108
+ ```
109
+
110
+ You can configure the timeout, although you will likely want to increase the
111
+ pool size instead:
112
+
113
+ ```ruby
114
+ Wayfarer.config[:network][:pool][:timeout] = 10 # seconds
115
+ ```
data/docs/index.md CHANGED
@@ -1,56 +1,33 @@
1
1
  ---
2
2
  hide:
3
3
  - navigation
4
+ - toc
4
5
  ---
5
6
 
6
7
  # Wayfarer
7
8
 
8
- ![CI status](https://github.com/actions/starter-workflows/workflows/CI/badge.svg)
9
- [![RubyGem](https://badge.fury.io/rb/wayfarer.svg)](https://rubygems.org/gems/wayfarer)
9
+ ## Ruby web crawling framework built on [ActiveJob]() and [Redis]()
10
10
 
11
- ## Versatile web crawling with Ruby
11
+ <small>
12
+ [Read the tutorial](/guides/tutorial){ .md-button .md-button--primary }
13
+ </small>
12
14
 
13
- * Web scraping
14
- * Data extraction
15
- * Browser automation
15
+ === "Command line"
16
16
 
17
- !!! attention "Unstable software"
17
+ ```sh
18
+ gem install wayfarer
19
+ ```
18
20
 
19
- Wayfarer is under development and releases should be considered unstable.
21
+ === "Gemfile"
20
22
 
21
- Wayfarer complies to
22
- [Semantic Versioning 2.0.0](https://semver.org/spec/v2.0.0.html) in
23
- which v0.x means that there could be backward-incompatible changes for every
24
- release:
23
+ ```ruby
24
+ gem "wayfarer"
25
+ ```
25
26
 
26
- >Major version zero (0.y.z) is for initial development. Anything MAY change
27
- at any time. The public API SHOULD NOT be considered stable.
28
-
29
- ### Installation
30
-
31
- Install the RubyGem:
32
-
33
- ```
34
- gem install wayfarer
35
- ```
36
-
37
- Or add it to Bundler's Gemfile:
38
-
39
- ```ruby
40
- gem "wayfarer"
41
- ```
42
-
43
- ### Features
44
-
45
- * Breadth-first, acyclic, multi-threaded graph traversal
46
- * Executes atop a variety of message queues thanks to [ActiveJob](https://edgeguides.rubyonrails.org/active_job_basics.html)
47
- * Browser automation via [Ferrum](https://github.com/rubycdp/ferrum)
27
+ * Breadth-first, acyclic page traversal
28
+ * Plain HTTP and browser automation via [Ferrum](https://github.com/rubycdp/ferrum)
48
29
  (<abbr title="Chrome DevTools Protocol">CDP</abbr>),
49
- [Selenium](https://www.selenium.dev) or plain HTTP via `net/http`
30
+ [Selenium](https://www.selenium.dev) and custom user agents
50
31
  * Declarative routing DSL
51
32
  * URI normalization and deduplication
52
- * XML, HTML, JSON parsing
53
- * HTTP redirect handling
54
- * Storage-agnostic
55
- * Small footprint: <500 LoC
56
- * Open Source (MIT)
33
+ * HTML, XML, JSON and custom Content-Type body parsing