wayfarer 0.4.6 → 0.4.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.env +17 -0
- data/.github/workflows/lint.yaml +27 -0
- data/.github/workflows/release.yaml +30 -0
- data/.github/workflows/tests.yaml +21 -0
- data/.gitignore +5 -1
- data/.rubocop.yml +36 -0
- data/.vale.ini +8 -0
- data/.yardopts +1 -3
- data/Dockerfile +6 -4
- data/Gemfile +24 -0
- data/Gemfile.lock +274 -164
- data/Rakefile +7 -51
- data/bin/wayfarer +1 -1
- data/docker-compose.yml +23 -13
- data/docs/cookbook/consent_screen.md +2 -2
- data/docs/cookbook/executing_javascript.md +3 -3
- data/docs/cookbook/navigation.md +12 -12
- data/docs/cookbook/querying_html.md +3 -3
- data/docs/cookbook/screenshots.md +2 -2
- data/docs/guides/callbacks.md +25 -125
- data/docs/guides/cli.md +71 -0
- data/docs/guides/configuration.md +10 -35
- data/docs/guides/development.md +67 -0
- data/docs/guides/handlers.md +60 -0
- data/docs/guides/index.md +1 -0
- data/docs/guides/jobs.md +142 -31
- data/docs/guides/navigation.md +1 -1
- data/docs/guides/networking/capybara.md +13 -22
- data/docs/guides/networking/custom_adapters.md +103 -41
- data/docs/guides/networking/ferrum.md +4 -4
- data/docs/guides/networking/http.md +9 -13
- data/docs/guides/networking/selenium.md +10 -11
- data/docs/guides/pages.md +78 -10
- data/docs/guides/redis.md +10 -0
- data/docs/guides/routing.md +156 -0
- data/docs/guides/tasks.md +53 -9
- data/docs/guides/tutorial.md +66 -0
- data/docs/guides/user_agents.md +115 -0
- data/docs/index.md +17 -40
- data/lib/wayfarer/base.rb +125 -46
- data/lib/wayfarer/batch_completion.rb +60 -0
- data/lib/wayfarer/callbacks.rb +22 -48
- data/lib/wayfarer/cli/route_printer.rb +85 -89
- data/lib/wayfarer/cli.rb +103 -0
- data/lib/wayfarer/gc.rb +18 -6
- data/lib/wayfarer/handler.rb +15 -7
- data/lib/wayfarer/kv.rb +28 -0
- data/lib/wayfarer/logging.rb +38 -0
- data/lib/wayfarer/middleware/base.rb +2 -0
- data/lib/wayfarer/middleware/batch_completion.rb +19 -0
- data/lib/wayfarer/middleware/chain.rb +7 -1
- data/lib/wayfarer/middleware/content_type.rb +59 -0
- data/lib/wayfarer/middleware/controller.rb +19 -15
- data/lib/wayfarer/middleware/dedup.rb +22 -13
- data/lib/wayfarer/middleware/dispatch.rb +17 -4
- data/lib/wayfarer/middleware/normalize.rb +7 -14
- data/lib/wayfarer/middleware/redis.rb +15 -0
- data/lib/wayfarer/middleware/router.rb +33 -35
- data/lib/wayfarer/middleware/stage.rb +5 -5
- data/lib/wayfarer/middleware/uri_parser.rb +31 -0
- data/lib/wayfarer/middleware/user_agent.rb +49 -0
- data/lib/wayfarer/networking/capybara.rb +1 -1
- data/lib/wayfarer/networking/context.rb +14 -3
- data/lib/wayfarer/networking/ferrum.rb +1 -4
- data/lib/wayfarer/networking/follow.rb +14 -7
- data/lib/wayfarer/networking/http.rb +1 -1
- data/lib/wayfarer/networking/pool.rb +23 -13
- data/lib/wayfarer/networking/selenium.rb +15 -7
- data/lib/wayfarer/networking/strategy.rb +2 -2
- data/lib/wayfarer/page.rb +34 -14
- data/lib/wayfarer/parsing/xml.rb +6 -6
- data/lib/wayfarer/parsing.rb +21 -0
- data/lib/wayfarer/redis/barrier.rb +26 -21
- data/lib/wayfarer/redis/counter.rb +18 -9
- data/lib/wayfarer/redis/pool.rb +1 -1
- data/lib/wayfarer/redis/resettable.rb +19 -0
- data/lib/wayfarer/routing/dsl.rb +166 -30
- data/lib/wayfarer/routing/hash_stack.rb +33 -0
- data/lib/wayfarer/routing/matchers/custom.rb +8 -5
- data/lib/wayfarer/routing/matchers/{suffix.rb → empty_params.rb} +2 -6
- data/lib/wayfarer/routing/matchers/host.rb +15 -9
- data/lib/wayfarer/routing/matchers/path.rb +11 -31
- data/lib/wayfarer/routing/matchers/query.rb +41 -17
- data/lib/wayfarer/routing/matchers/result.rb +12 -0
- data/lib/wayfarer/routing/matchers/scheme.rb +13 -5
- data/lib/wayfarer/routing/matchers/url.rb +13 -5
- data/lib/wayfarer/routing/path_consumer.rb +130 -0
- data/lib/wayfarer/routing/path_finder.rb +151 -23
- data/lib/wayfarer/routing/result.rb +1 -1
- data/lib/wayfarer/routing/root_route.rb +17 -1
- data/lib/wayfarer/routing/route.rb +66 -19
- data/lib/wayfarer/routing/serializable.rb +28 -0
- data/lib/wayfarer/routing/sub_route.rb +53 -0
- data/lib/wayfarer/routing/target_route.rb +17 -1
- data/lib/wayfarer/stringify.rb +21 -30
- data/lib/wayfarer/task.rb +9 -17
- data/lib/wayfarer/uri/normalization.rb +120 -0
- data/lib/wayfarer.rb +72 -5
- data/mise.toml +2 -0
- data/mkdocs.yml +44 -8
- data/rake/docs.rake +26 -0
- data/rake/lint.rake +9 -0
- data/rake/release.rake +23 -0
- data/rake/tests.rake +32 -0
- data/requirements.txt +1 -1
- data/spec/factories/job.rb +8 -0
- data/spec/factories/middleware.rb +2 -2
- data/spec/factories/path_finder.rb +11 -0
- data/spec/factories/redis.rb +19 -0
- data/spec/factories/task.rb +46 -2
- data/spec/spec_helpers.rb +55 -51
- data/spec/support/active_job_helpers.rb +8 -0
- data/spec/support/integration_helpers.rb +21 -0
- data/spec/support/redis_helpers.rb +9 -0
- data/spec/support/test_app.rb +66 -37
- data/spec/wayfarer/base_spec.rb +200 -0
- data/spec/wayfarer/batch_completion_spec.rb +142 -0
- data/spec/wayfarer/cli/job_spec.rb +88 -0
- data/spec/wayfarer/cli/routing_spec.rb +322 -0
- data/spec/{cli → wayfarer/cli}/version_spec.rb +1 -1
- data/spec/wayfarer/gc_spec.rb +29 -0
- data/spec/wayfarer/handler_spec.rb +9 -0
- data/spec/wayfarer/integration/callbacks_spec.rb +200 -0
- data/spec/wayfarer/integration/content_type_spec.rb +37 -0
- data/spec/wayfarer/integration/custom_routing_spec.rb +51 -0
- data/spec/wayfarer/integration/gc_spec.rb +40 -0
- data/spec/wayfarer/integration/handler_spec.rb +65 -0
- data/spec/wayfarer/integration/page_spec.rb +79 -0
- data/spec/wayfarer/integration/params_spec.rb +64 -0
- data/spec/wayfarer/integration/parsing_spec.rb +99 -0
- data/spec/wayfarer/integration/retry_spec.rb +112 -0
- data/spec/wayfarer/integration/stage_spec.rb +58 -0
- data/spec/wayfarer/middleware/batch_completion_spec.rb +33 -0
- data/spec/{middleware → wayfarer/middleware}/chain_spec.rb +24 -19
- data/spec/wayfarer/middleware/content_type_spec.rb +83 -0
- data/spec/{middleware → wayfarer/middleware}/controller_spec.rb +24 -22
- data/spec/wayfarer/middleware/dedup_spec.rb +66 -0
- data/spec/wayfarer/middleware/normalize_spec.rb +32 -0
- data/spec/wayfarer/middleware/router_spec.rb +102 -0
- data/spec/wayfarer/middleware/stage_spec.rb +63 -0
- data/spec/wayfarer/middleware/uri_parser_spec.rb +63 -0
- data/spec/wayfarer/middleware/user_agent_spec.rb +158 -0
- data/spec/wayfarer/networking/capybara_spec.rb +13 -0
- data/spec/{networking → wayfarer/networking}/context_spec.rb +46 -38
- data/spec/wayfarer/networking/ferrum_spec.rb +13 -0
- data/spec/{networking → wayfarer/networking}/follow_spec.rb +11 -6
- data/spec/wayfarer/networking/http_spec.rb +12 -0
- data/spec/{networking → wayfarer/networking}/pool_spec.rb +16 -14
- data/spec/wayfarer/networking/selenium_spec.rb +12 -0
- data/spec/{networking → wayfarer/networking}/strategy.rb +33 -54
- data/spec/wayfarer/page_spec.rb +69 -0
- data/spec/{parsing → wayfarer/parsing}/json_spec.rb +1 -1
- data/spec/wayfarer/parsing/xml_parse_spec.rb +25 -0
- data/spec/wayfarer/redis/barrier_spec.rb +39 -0
- data/spec/wayfarer/redis/counter_spec.rb +34 -0
- data/spec/{redis → wayfarer/redis}/pool_spec.rb +4 -3
- data/spec/{routing → wayfarer/routing}/dsl_spec.rb +12 -22
- data/spec/wayfarer/routing/hash_stack_spec.rb +63 -0
- data/spec/wayfarer/routing/integration_spec.rb +101 -0
- data/spec/wayfarer/routing/matchers/custom_spec.rb +39 -0
- data/spec/wayfarer/routing/matchers/host_spec.rb +56 -0
- data/spec/wayfarer/routing/matchers/matcher.rb +17 -0
- data/spec/wayfarer/routing/matchers/path_spec.rb +43 -0
- data/spec/wayfarer/routing/matchers/query_spec.rb +123 -0
- data/spec/wayfarer/routing/matchers/scheme_spec.rb +45 -0
- data/spec/wayfarer/routing/matchers/url_spec.rb +33 -0
- data/spec/wayfarer/routing/path_consumer_spec.rb +123 -0
- data/spec/wayfarer/routing/path_finder_spec.rb +409 -0
- data/spec/wayfarer/routing/root_route_spec.rb +51 -0
- data/spec/wayfarer/routing/route_spec.rb +74 -0
- data/spec/wayfarer/routing/sub_route_spec.rb +103 -0
- data/spec/wayfarer/task_spec.rb +13 -0
- data/spec/wayfarer/uri/normalization_spec.rb +98 -0
- data/spec/wayfarer_spec.rb +2 -2
- data/wayfarer.gemspec +18 -28
- metadata +797 -265
- data/.github/workflows/ci.yaml +0 -32
- data/.rbenv-gemsets +0 -1
- data/.ruby-version +0 -1
- data/RELEASING.md +0 -17
- data/docs/cookbook/user_agent.md +0 -7
- data/docs/guides/error_handling.md +0 -53
- data/docs/guides/networking.md +0 -94
- data/docs/guides/performance.md +0 -130
- data/docs/guides/reliability.md +0 -41
- data/docs/guides/routing/steering.md +0 -30
- data/docs/reference/api/base.md +0 -48
- data/docs/reference/cli.md +0 -61
- data/docs/reference/configuration_keys.md +0 -43
- data/docs/reference/environment_variables.md +0 -83
- data/lib/wayfarer/cli/base.rb +0 -45
- data/lib/wayfarer/cli/generate.rb +0 -17
- data/lib/wayfarer/cli/job.rb +0 -56
- data/lib/wayfarer/cli/route.rb +0 -29
- data/lib/wayfarer/cli/runner.rb +0 -34
- data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
- data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
- data/lib/wayfarer/config/capybara.rb +0 -10
- data/lib/wayfarer/config/ferrum.rb +0 -11
- data/lib/wayfarer/config/networking.rb +0 -29
- data/lib/wayfarer/config/redis.rb +0 -14
- data/lib/wayfarer/config/root.rb +0 -11
- data/lib/wayfarer/config/selenium.rb +0 -21
- data/lib/wayfarer/config/strconv.rb +0 -45
- data/lib/wayfarer/config/struct.rb +0 -72
- data/lib/wayfarer/middleware/fetch.rb +0 -56
- data/lib/wayfarer/redis/connection.rb +0 -13
- data/lib/wayfarer/redis/version.rb +0 -19
- data/lib/wayfarer/routing/router.rb +0 -28
- data/spec/base_spec.rb +0 -224
- data/spec/callbacks_spec.rb +0 -102
- data/spec/cli/generate_spec.rb +0 -39
- data/spec/cli/job_spec.rb +0 -78
- data/spec/config/capybara_spec.rb +0 -18
- data/spec/config/ferrum_spec.rb +0 -24
- data/spec/config/networking_spec.rb +0 -73
- data/spec/config/redis_spec.rb +0 -32
- data/spec/config/root_spec.rb +0 -31
- data/spec/config/selenium_spec.rb +0 -56
- data/spec/config/strconv_spec.rb +0 -58
- data/spec/config/struct_spec.rb +0 -66
- data/spec/fixtures/dummy_job.rb +0 -7
- data/spec/gc_spec.rb +0 -59
- data/spec/handler_spec.rb +0 -11
- data/spec/integration/callbacks_spec.rb +0 -85
- data/spec/integration/page_spec.rb +0 -62
- data/spec/integration/params_spec.rb +0 -56
- data/spec/integration/stage_spec.rb +0 -51
- data/spec/integration/steering_spec.rb +0 -57
- data/spec/middleware/dedup_spec.rb +0 -88
- data/spec/middleware/dispatch_spec.rb +0 -43
- data/spec/middleware/fetch_spec.rb +0 -155
- data/spec/middleware/normalize_spec.rb +0 -29
- data/spec/middleware/router_spec.rb +0 -105
- data/spec/middleware/stage_spec.rb +0 -62
- data/spec/networking/capybara_spec.rb +0 -12
- data/spec/networking/ferrum_spec.rb +0 -12
- data/spec/networking/http_spec.rb +0 -12
- data/spec/networking/selenium_spec.rb +0 -12
- data/spec/page_spec.rb +0 -47
- data/spec/parsing/xml_spec.rb +0 -25
- data/spec/redis/barrier_spec.rb +0 -78
- data/spec/redis/counter_spec.rb +0 -32
- data/spec/redis/version_spec.rb +0 -13
- data/spec/routing/integration_spec.rb +0 -110
- data/spec/routing/matchers/custom_spec.rb +0 -31
- data/spec/routing/matchers/host_spec.rb +0 -49
- data/spec/routing/matchers/path_spec.rb +0 -43
- data/spec/routing/matchers/query_spec.rb +0 -137
- data/spec/routing/matchers/scheme_spec.rb +0 -25
- data/spec/routing/matchers/suffix_spec.rb +0 -41
- data/spec/routing/matchers/uri_spec.rb +0 -27
- data/spec/routing/path_finder_spec.rb +0 -33
- data/spec/routing/root_route_spec.rb +0 -29
- data/spec/routing/route_spec.rb +0 -43
- data/spec/routing/router_spec.rb +0 -24
- data/spec/task_spec.rb +0 -34
- data/spec/{stringify_spec.rb → wayfarer/stringify_spec.rb} +2 -2
data/docs/guides/pages.md
CHANGED
@@ -1,11 +1,14 @@
|
|
1
1
|
# Pages
|
2
2
|
|
3
|
-
|
4
|
-
|
3
|
+
A page is the immutable state of the contents behind a URL at a point in time,
|
4
|
+
retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
|
5
|
+
response, or the state of a remotely controlled browser.
|
5
6
|
|
6
7
|
```ruby
|
7
|
-
class DummyJob <
|
8
|
-
|
8
|
+
class DummyJob < ActiveJob::Base
|
9
|
+
include Wayfarer::Base
|
10
|
+
|
11
|
+
route.to :index
|
9
12
|
|
10
13
|
def index
|
11
14
|
page # => #<Wayfarer::Page ...>
|
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
|
|
13
16
|
page.url # => "https://example.com"
|
14
17
|
page.body # => "<html>..."
|
15
18
|
page.status_code # => 200
|
16
|
-
page.headers # => { "
|
19
|
+
page.headers # => { "content-type" => ... }
|
20
|
+
page.mime_type # => #<MIME::Type: text/html>
|
21
|
+
|
22
|
+
# The lazily parsed response body or `nil`, depending on the Content-Type
|
23
|
+
page.doc # => #<Nokogiri::HTML::Document ...>
|
17
24
|
|
18
|
-
# A MetaInspector object for accessing page meta data.
|
19
25
|
# See: https://github.com/metainspector/metainspector
|
26
|
+
page.meta # => #<MetaInspector::Document ...>
|
20
27
|
# Examples:
|
21
28
|
page.meta.links.internal
|
22
29
|
page.meta.images.favicon
|
@@ -26,20 +33,63 @@ class DummyJob < Wayfarer::Worker
|
|
26
33
|
end
|
27
34
|
```
|
28
35
|
|
36
|
+
!!! info "HTTP headers are downcased and case-sensitive"
|
37
|
+
|
38
|
+
HTTP headers are downcased, so you would access
|
39
|
+
`page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
|
40
|
+
|
41
|
+
## Response body parsing
|
42
|
+
|
43
|
+
Wayfarer parses the bodies of HTML, XML and JSON responses according to their
|
44
|
+
MIME types:
|
45
|
+
|
46
|
+
* `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
|
47
|
+
* `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
|
48
|
+
* `application/json` to `Hash`
|
49
|
+
|
50
|
+
### Implementing a custom response body parser
|
51
|
+
|
52
|
+
You can register an object that implements a `#parse` method for any MIME type:
|
53
|
+
|
54
|
+
```ruby
|
55
|
+
class MyJPEGParser
|
56
|
+
def parse(body)
|
57
|
+
# Read EXIF metadata here.
|
58
|
+
# Return value is accessible as `page.doc`
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
|
63
|
+
```
|
64
|
+
|
65
|
+
!!! warning "`#parse` must be thread-safe!"
|
66
|
+
|
67
|
+
!!! info "Handling responses without a Content-Type"
|
68
|
+
|
69
|
+
If a response has no `Content-Type` header, Wayfarer falls back to
|
70
|
+
`application/octet-stream`. A parser registered for
|
71
|
+
`application/octet-stream` will hence also handle all responses without
|
72
|
+
a Content-Type.
|
73
|
+
|
29
74
|
## Live pages
|
30
75
|
|
31
|
-
|
32
|
-
|
76
|
+
`#!ruby page` initially returns a snapshot of the browser state
|
77
|
+
immediately after the user agent navigated to the URL. The browser state may
|
78
|
+
change significantly after the page was retrieved, for example due to your own
|
79
|
+
interaction, or client-side JavaScript altering the DOM or URL.
|
33
80
|
|
34
|
-
To
|
81
|
+
To get a page that reflects the current browser state, set the `#!ruby :live`
|
82
|
+
keyword:
|
35
83
|
|
36
84
|
```ruby
|
37
85
|
class DummyJob < Wayfarer::Worker
|
38
|
-
route
|
86
|
+
route.to :index
|
39
87
|
|
40
88
|
def index
|
41
89
|
page # => #<Wayfarer::Page ...>
|
42
90
|
|
91
|
+
# Fill in forms, click buttons, etc.
|
92
|
+
|
43
93
|
# Replaces the current Page object with a newer one,
|
44
94
|
# taking into account the DOM as currently rendered by the browser.
|
45
95
|
# Effectful only when automating browsers, no-op when using plain
|
@@ -50,3 +100,21 @@ class DummyJob < Wayfarer::Worker
|
|
50
100
|
end
|
51
101
|
end
|
52
102
|
```
|
103
|
+
|
104
|
+
!!! attention "Stateless user agents ignore `#!ruby :live`"
|
105
|
+
|
106
|
+
The `#!ruby :live` option is ignored by stateless user agents, such as the
|
107
|
+
default `#!ruby :http` user agent. Instead, stateless user agents always
|
108
|
+
return the same page object.
|
109
|
+
|
110
|
+
## Accessing page metadata with MetaInspector
|
111
|
+
|
112
|
+
You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
|
113
|
+
document for accessing metadata of HTML pages. For example, to stage all links
|
114
|
+
internal to the current hostname:
|
115
|
+
|
116
|
+
```ruby
|
117
|
+
def index
|
118
|
+
stage page.meta.links.internal
|
119
|
+
end
|
120
|
+
```
|
@@ -0,0 +1,156 @@
|
|
1
|
+
# Routing
|
2
|
+
|
3
|
+
Wayfarer equips jobs with a declarative routing DSL that maps URLs to actions.
|
4
|
+
Actions are instance methods denoted by symbols, or [handlers](/guides/handlers).
|
5
|
+
[Pages](/guides/pages) are only retrieved from URLs which map to an action.
|
6
|
+
|
7
|
+
!!! info "Routed URLs are normalized"
|
8
|
+
|
9
|
+
By default, Wayfarer [applies some transformations to each URL](../tasks/#url-normalization) to bring it
|
10
|
+
into a canonical form. Routing happens based on this canonical form.
|
11
|
+
|
12
|
+
You can always access a task's raw string as it was enqueued with `task.batch`.
|
13
|
+
|
14
|
+
A job's route declarations equate to a predicate tree.
|
15
|
+
When a URL is routed, the predicate tree is searched depth-first. If a
|
16
|
+
matching leaf predicate is found, the found path's action is dispatched.
|
17
|
+
You can extract data from URL path segments and query parameters and
|
18
|
+
access it through `params` in jobs or handlers.
|
19
|
+
|
20
|
+
The following routes:
|
21
|
+
|
22
|
+
```ruby
|
23
|
+
route.host "example.com", scheme: :https do
|
24
|
+
path "contact", to: :contact
|
25
|
+
path "users/:id" do
|
26
|
+
to [UserHandler, :show]
|
27
|
+
|
28
|
+
path "gallery", to: [UserHandler, :photos]
|
29
|
+
end
|
30
|
+
end
|
31
|
+
```
|
32
|
+
|
33
|
+
Equate to the following predicate tree:
|
34
|
+
|
35
|
+
```mermaid
|
36
|
+
flowchart LR
|
37
|
+
Root-->Host["Host <code>example.com</code>"]
|
38
|
+
Host-->Scheme["Scheme <code>:https</code>"]
|
39
|
+
|
40
|
+
%% first-level paths
|
41
|
+
Scheme-->PathContact["Path <code>contact</code>"]
|
42
|
+
Scheme-->PathUsersId["Path <code>users/:id</code>"]
|
43
|
+
|
44
|
+
%% their targets
|
45
|
+
PathContact-->TargetRouteContact["Target <code>:contact</code>"]
|
46
|
+
PathUsersId-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]
|
47
|
+
|
48
|
+
%% nested path under /users/:id
|
49
|
+
PathUsersId-->PathGallery["Path <code>'gallery'</code>"]
|
50
|
+
PathGallery-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]
|
51
|
+
```
|
52
|
+
|
53
|
+
Traversing the tree depth-first for `https://example.com/users/42` stops at the
|
54
|
+
route with the action `[UserHandler, :show]`:
|
55
|
+
|
56
|
+
```mermaid
|
57
|
+
flowchart LR
|
58
|
+
Root:::matching-->Host["Host <code>example.com</code>"]:::matching
|
59
|
+
Host:::matching-->Scheme["Scheme <code>:https</code>"]:::matching
|
60
|
+
|
61
|
+
%% sibling paths from the scheme node
|
62
|
+
Scheme:::matching-->PathContact["Path <code>/contact</code>"]:::mismatching
|
63
|
+
Scheme:::matching-->PathUsersId["Path <code>/users/:id</code>"]:::matching
|
64
|
+
|
65
|
+
%% successful match for /users/:id
|
66
|
+
PathUsersId:::matching-->TargetRouteUserHandler["Target <code>[UserHandler, :show]</code>"]:::matching
|
67
|
+
|
68
|
+
%% gallery branch is never visited for /users/42
|
69
|
+
PathContact-->TargetRouteContact["Target <code>:contact</code>"]:::unvisited
|
70
|
+
PathUsersId:::matching-->PathGallery["Path <code>/gallery</code>"]:::unvisited
|
71
|
+
PathGallery:::unvisited-->TargetRouteUserHandlerPhotos["Target <code>[UserHandler, :photos]</code>"]:::unvisited
|
72
|
+
|
73
|
+
classDef matching fill:#7CB342,stroke:#7CB342,color:#fff
|
74
|
+
classDef mismatching fill:#FFCDD2,stroke:#F44336,color:#B71C1C
|
75
|
+
classDef unvisited fill:#BDBDBD,stroke:#BDBDBD,color:#616161
|
76
|
+
```
|
77
|
+
|
78
|
+
??? note "You can also visualise a job's routing tree with with the [`route` CLI subcommand](/guides/cli)"
|
79
|
+
|
80
|
+
```sh
|
81
|
+
wayfarer route DummyJob -r dummy_job.rb http://localhost:9000/users/42/gallery
|
82
|
+
```
|
83
|
+
|
84
|
+
```yaml
|
85
|
+
---
|
86
|
+
routed: true
|
87
|
+
params:
|
88
|
+
id: '42'
|
89
|
+
action:
|
90
|
+
handler: Class
|
91
|
+
action: :photos
|
92
|
+
root_route:
|
93
|
+
match: true
|
94
|
+
params: {}
|
95
|
+
children:
|
96
|
+
- route:
|
97
|
+
host:
|
98
|
+
name: example.com
|
99
|
+
match: true
|
100
|
+
params: {}
|
101
|
+
children:
|
102
|
+
- route:
|
103
|
+
scheme:
|
104
|
+
scheme: :https
|
105
|
+
match: true
|
106
|
+
params: {}
|
107
|
+
children:
|
108
|
+
- route:
|
109
|
+
path:
|
110
|
+
pattern: "/contact"
|
111
|
+
match: false
|
112
|
+
params: {}
|
113
|
+
children:
|
114
|
+
- target_route:
|
115
|
+
action:
|
116
|
+
children: []
|
117
|
+
- route:
|
118
|
+
path:
|
119
|
+
pattern: "/users/:id"
|
120
|
+
match: true
|
121
|
+
params:
|
122
|
+
id: '42'
|
123
|
+
children:
|
124
|
+
- target_route:
|
125
|
+
action:
|
126
|
+
handler: Class
|
127
|
+
action: :show
|
128
|
+
children: []
|
129
|
+
- route:
|
130
|
+
path:
|
131
|
+
pattern: "/gallery"
|
132
|
+
match: true
|
133
|
+
params:
|
134
|
+
id: '42'
|
135
|
+
children:
|
136
|
+
- target_route:
|
137
|
+
action:
|
138
|
+
handler: Class
|
139
|
+
action: :photos
|
140
|
+
children: []
|
141
|
+
```
|
142
|
+
|
143
|
+
As you can see, `Target` nodes always match. This means that we could have also defined
|
144
|
+
our routes as:
|
145
|
+
|
146
|
+
```ruby
|
147
|
+
route.host "example.com", scheme: :https do
|
148
|
+
to :contact do
|
149
|
+
path "/contact"
|
150
|
+
end
|
151
|
+
|
152
|
+
to [UserHandler, :show] do
|
153
|
+
path "/users/:id"
|
154
|
+
end
|
155
|
+
end
|
156
|
+
```
|
data/docs/guides/tasks.md
CHANGED
@@ -1,14 +1,58 @@
|
|
1
1
|
# Tasks
|
2
2
|
|
3
|
-
Tasks are the immutable units of work
|
4
|
-
consists of:
|
3
|
+
Tasks are the immutable units of work read from a message queue and processed by
|
4
|
+
[jobs](/guides/jobs). A task consists of two strings:
|
5
5
|
|
6
|
-
|
7
|
-
|
6
|
+
* The __URL__ to process
|
7
|
+
* The __batch__ the task belongs to
|
8
8
|
|
9
|
-
|
10
|
-
* Like URLs, batches are strings.
|
9
|
+
A job processing a task commonly appends more tasks to the queue in turn.
|
11
10
|
|
12
|
-
|
13
|
-
|
14
|
-
|
11
|
+
!!! info "Task URLs are not normalized"
|
12
|
+
|
13
|
+
The URL returned by `task.url` is not normalized but verbatim
|
14
|
+
as it was staged or enqueued.
|
15
|
+
|
16
|
+
## Task deduplication
|
17
|
+
|
18
|
+
Wayfarer ensures that no URL gets processed twice within a batch. It achieves
|
19
|
+
this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
|
20
|
+
keyed by normalized URLs.
|
21
|
+
|
22
|
+
Wayfarer computes a canonical URL representation that it uses for cache lookups.
|
23
|
+
|
24
|
+
### URL normalization
|
25
|
+
|
26
|
+
Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
|
27
|
+
and applies further normalizations. By default, all normalizations are applied
|
28
|
+
and can be individually disabled.
|
29
|
+
|
30
|
+
URL normalization is used only for deduplication, and does not affect the immutable
|
31
|
+
`task.url`, which always returns the verbatim URL as enqueued.
|
32
|
+
This allows you to follow the URLs exactly as parsed from response bodies.
|
33
|
+
|
34
|
+
You can configure the global normalization behaviour by setting the following
|
35
|
+
values on `Wayfarer.config.normalization` do which all default to `true`:
|
36
|
+
|
37
|
+
* `remove_www`: Remove `www.` prefix from hostnames?
|
38
|
+
* `remove_trailing_slash`: Remove a trailing path slash?
|
39
|
+
* `remove_fragment`: Remove the URL fragment?
|
40
|
+
* `order_query_parameters:` Order query parameters alphabetically?
|
41
|
+
* `remove_tracking_parameters`: Remove tracking parameters from the URL?
|
42
|
+
|
43
|
+
When a job gets deduplicated, it succeeds and causes no retries.
|
44
|
+
|
45
|
+
### Setting a custom key function
|
46
|
+
|
47
|
+
You can customize how deduplication keys are computed. As a derived example,
|
48
|
+
to process only one job per hostname:
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
Wayfarer.config[:deduplication][:key] = ->(task) { task[:uri].hostname }
|
52
|
+
```
|
53
|
+
|
54
|
+
## Invalid URLs
|
55
|
+
|
56
|
+
Tasks with invalid URLs are discarded (for example`ht%0atp://localhost/` which has a
|
57
|
+
newline in its protocol), since there is no corrective action possible.
|
58
|
+
No exception is raised, and the job is considered successfully processed without retries.
|
@@ -0,0 +1,66 @@
|
|
1
|
+
# Tutorial
|
2
|
+
|
3
|
+
Wayfarer is a web crawling framework written in Ruby.
|
4
|
+
It works with plain HTTP and by automating web browsers interchangeably
|
5
|
+
and is deployed with Redis and a message queue.
|
6
|
+
During development it can execute fully in memory, without Redis.
|
7
|
+
|
8
|
+
## Getting started
|
9
|
+
|
10
|
+
In an empty directory, generate a new `Gemfile` and install Wayfarer:
|
11
|
+
|
12
|
+
```sh
|
13
|
+
bundle init
|
14
|
+
bundle add activejob wayfarer
|
15
|
+
bundle install
|
16
|
+
```
|
17
|
+
|
18
|
+
## Jobs, tasks and batches
|
19
|
+
|
20
|
+
Wayfarer builds on Active Job, the message queue abstraction of Rails.
|
21
|
+
You can use Wayfarer without Rails of course, as we do here.
|
22
|
+
|
23
|
+
A message queue supports two operations: appending messages to the end and consuming
|
24
|
+
messages from the front. This is how Wayfarer processes tasks, a string pair
|
25
|
+
of URL and batch. Wayfarer enforces that URLs are not processed more than
|
26
|
+
once within their batch (excluding retries).
|
27
|
+
|
28
|
+
When a task is consumed, it is processed by a job, a Ruby class.
|
29
|
+
|
30
|
+
Let's give ourselves a `dummy_job.rb` that routes all URLs to its
|
31
|
+
`index` instance method, where we print the current `task`:
|
32
|
+
|
33
|
+
```ruby title="dummy_job.rb"
|
34
|
+
require "activejob"
|
35
|
+
require "wayfarer"
|
36
|
+
|
37
|
+
class DummyJob < ActiveJob::Base
|
38
|
+
include Wayfarer::Base
|
39
|
+
|
40
|
+
route.to :index
|
41
|
+
|
42
|
+
def index
|
43
|
+
puts task
|
44
|
+
end
|
45
|
+
end
|
46
|
+
```
|
47
|
+
|
48
|
+
We can perform our job from the command line with the `wayfarer perform`
|
49
|
+
subcommand. In between ActiveJob's log output, we see that Wayfarer
|
50
|
+
has generated a UUID for the batch since we did not pass it:
|
51
|
+
|
52
|
+
```sh
|
53
|
+
bundle exec wayfarer perform -r dummy_job.rb DummyJob https://example.com
|
54
|
+
```
|
55
|
+
|
56
|
+
```hl_lines="2"
|
57
|
+
[ActiveJob] [DummyJob] [68853491-...] Performing DummyJob (Job ID: 68853491-...) from Async(default) with arguments: #<Wayfarer::Task url="https://example.com", batch="63d14035-...">
|
58
|
+
#<Wayfarer::Task url="https://example.com", batch="63d14035-...">
|
59
|
+
[ActiveJob] [DummyJob] [68853491-...] Performed DummyJob (Job ID: 68853491-) from Async(default) in 507.65ms
|
60
|
+
```
|
61
|
+
|
62
|
+
If you don't provide a batch, Wayfarer uses a generated UUID instead.
|
63
|
+
We could have also used `DummyJob.crawl
|
64
|
+
|
65
|
+
|
66
|
+
|
@@ -0,0 +1,115 @@
|
|
1
|
+
# User agents
|
2
|
+
|
3
|
+
User agents are used by [jobs](../jobs) to retrieve the contents behind a URL into a
|
4
|
+
[page](../pages), for example a remotely controlled Firefox process or a Ruby HTTP client.
|
5
|
+
|
6
|
+
User agents are kept in a connection pool and all user agents in the pool
|
7
|
+
share the same type and configuration. You can add your own custom user agents by implementing
|
8
|
+
the [user agent API](custom_user_agents.md).
|
9
|
+
|
10
|
+
Wayfarer comes with the following built-in user agents:
|
11
|
+
|
12
|
+
* [`:http`](http.md) (default)
|
13
|
+
* [`:ferrum`](ferrum.md) to automate Google Chrome
|
14
|
+
* [`:selenium`](selenium.md) to automate a variety of browsers
|
15
|
+
* [`:capybara`](capybara.md) to use Capybara sessions
|
16
|
+
|
17
|
+
Configure the user agent with the global configuration option:
|
18
|
+
|
19
|
+
```ruby
|
20
|
+
Wayfarer.config[:network][:agent] = :ferrum # or :selenium, :capybara, ...
|
21
|
+
```
|
22
|
+
|
23
|
+
You can access the user agent that was checked out from the pool with
|
24
|
+
`#user_agent` in action methods:
|
25
|
+
|
26
|
+
```ruby
|
27
|
+
class DummyJob < ActiveJob::Base
|
28
|
+
include Wayfarer::Base
|
29
|
+
|
30
|
+
route.to :index
|
31
|
+
|
32
|
+
def index
|
33
|
+
user_agent # => #<Ferrum::Browser ...>
|
34
|
+
end
|
35
|
+
end
|
36
|
+
```
|
37
|
+
|
38
|
+
You can also implement [custom user agents](custom_user_agents.md) to support
|
39
|
+
your own HTTP client or browser automation service/protocol.
|
40
|
+
|
41
|
+
### Ad-hoc HTTP requests
|
42
|
+
|
43
|
+
Regardless the configured user agent, you can always make ad-hoc HTTP GET requests
|
44
|
+
that return pages with `#fetch(url)`:
|
45
|
+
|
46
|
+
```ruby
|
47
|
+
class DummyJob < ActiveJob::Base
|
48
|
+
include Wayfarer::Base
|
49
|
+
|
50
|
+
route.to :index
|
51
|
+
|
52
|
+
def index
|
53
|
+
page = fetch("https://example.com") # => #<Wayfarer::Page ...>
|
54
|
+
end
|
55
|
+
end
|
56
|
+
```
|
57
|
+
|
58
|
+
!!! info "`#fetch` respects `Wayfarer.config.network.http_headers` for all provided user agents."
|
59
|
+
|
60
|
+
## HTTP request headers
|
61
|
+
|
62
|
+
You can set HTTP request headers for all built-in user agents:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
Wayfarer.config[:network][:http_headers] = { "User-Agent" => "MyCrawler" }
|
66
|
+
```
|
67
|
+
|
68
|
+
!!! attention "Selenium does not support configuring HTTP request headers."
|
69
|
+
|
70
|
+
## Connection pooling
|
71
|
+
|
72
|
+
Since user agents are expensive to create, especially in the case of browser
|
73
|
+
processes, Wayfarer keeps user agents within a connection pool. When a job
|
74
|
+
performs and needs to retrieve the [page](../pages) for its task URL, an agent
|
75
|
+
is checked out from the pool, and checked back in when the routed action method
|
76
|
+
returns.
|
77
|
+
|
78
|
+
The pool size is constant and it should equal the number of threads the
|
79
|
+
underlying message queue operates with. For example, if you use Sidekiq,
|
80
|
+
you should set the pool size to the number of Sidekiq threads:
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
Wayfarer.config[:network][:pool][:size] = Sidekiq.options[:concurrency]
|
84
|
+
```
|
85
|
+
|
86
|
+
!!! attention "The connection pool size is 1 by default"
|
87
|
+
|
88
|
+
Since there is no reliable way to detect the number of threads that
|
89
|
+
the underlying message queue operates with, Wayfarer defaults to a pool
|
90
|
+
size of 1, which creates a bottleneck in a concurrent environment.
|
91
|
+
|
92
|
+
!!! attention "Browser sessions are shared across jobs"
|
93
|
+
|
94
|
+
The same browser session is used across jobs. This means that the browser
|
95
|
+
is not closed between jobs, and that the browser's state carries over from
|
96
|
+
job to job. You may account for this by resetting the browser's state
|
97
|
+
according to your needs, for which you can use [callbacks](../callbacks).
|
98
|
+
|
99
|
+
### `UserAgentTimeoutError`: avoiding pool contention
|
100
|
+
|
101
|
+
If you encounter `UserAgentTimeoutError` exceptions, a job has waited for a
|
102
|
+
user agent to become available for too long. By default, this timeout is 10
|
103
|
+
seconds. This is a sign that the pool size is too small for the message queue's
|
104
|
+
concurrency.
|
105
|
+
|
106
|
+
```
|
107
|
+
#<Wayfarer::UserAgentTimeoutError: Waited 10 sec, 0/1 available>
|
108
|
+
```
|
109
|
+
|
110
|
+
You can configure the timeout, although you will likely want to increase the
|
111
|
+
pool size instead:
|
112
|
+
|
113
|
+
```ruby
|
114
|
+
Wayfarer.config[:network][:pool][:timeout] = 10 # seconds
|
115
|
+
```
|
data/docs/index.md
CHANGED
@@ -1,56 +1,33 @@
|
|
1
1
|
---
|
2
2
|
hide:
|
3
3
|
- navigation
|
4
|
+
- toc
|
4
5
|
---
|
5
6
|
|
6
7
|
# Wayfarer
|
7
8
|
|
8
|
-
|
9
|
-
[](https://rubygems.org/gems/wayfarer)
|
9
|
+
## Ruby web crawling framework built on [ActiveJob]() and [Redis]()
|
10
10
|
|
11
|
-
|
11
|
+
<small>
|
12
|
+
[Read the tutorial](/guides/tutorial){ .md-button .md-button--primary }
|
13
|
+
</small>
|
12
14
|
|
13
|
-
|
14
|
-
* Data extraction
|
15
|
-
* Browser automation
|
15
|
+
=== "Command line"
|
16
16
|
|
17
|
-
|
17
|
+
```sh
|
18
|
+
gem install wayfarer
|
19
|
+
```
|
18
20
|
|
19
|
-
|
21
|
+
=== "Gemfile"
|
20
22
|
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
release:
|
23
|
+
```ruby
|
24
|
+
gem "wayfarer"
|
25
|
+
```
|
25
26
|
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
### Installation
|
30
|
-
|
31
|
-
Install the RubyGem:
|
32
|
-
|
33
|
-
```
|
34
|
-
gem install wayfarer
|
35
|
-
```
|
36
|
-
|
37
|
-
Or add it to Bundler's Gemfile:
|
38
|
-
|
39
|
-
```ruby
|
40
|
-
gem "wayfarer"
|
41
|
-
```
|
42
|
-
|
43
|
-
### Features
|
44
|
-
|
45
|
-
* Breadth-first, acyclic, multi-threaded graph traversal
|
46
|
-
* Executes atop a variety of message queues thanks to [ActiveJob](https://edgeguides.rubyonrails.org/active_job_basics.html)
|
47
|
-
* Browser automation via [Ferrum](https://github.com/rubycdp/ferrum)
|
27
|
+
* Breadth-first, acyclic page traversal
|
28
|
+
* Plain HTTP and browser automation via [Ferrum](https://github.com/rubycdp/ferrum)
|
48
29
|
(<abbr title="Chrome DevTools Protocol">CDP</abbr>),
|
49
|
-
[Selenium](https://www.selenium.dev)
|
30
|
+
[Selenium](https://www.selenium.dev) and custom user agents
|
50
31
|
* Declarative routing DSL
|
51
32
|
* URI normalization and deduplication
|
52
|
-
*
|
53
|
-
* HTTP redirect handling
|
54
|
-
* Storage-agnostic
|
55
|
-
* Small footprint: <500 LoC
|
56
|
-
* Open Source (MIT)
|
33
|
+
* HTML, XML, JSON and custom Content-Type body parsing
|