wayfarer 0.4.6 → 0.4.7
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/lint.yaml +25 -0
- data/.github/workflows/release.yaml +29 -0
- data/.github/workflows/tests.yaml +30 -0
- data/.gitignore +4 -0
- data/.rubocop.yml +5 -0
- data/.vale.ini +5 -0
- data/.yardopts +1 -3
- data/Dockerfile +5 -4
- data/Gemfile +3 -0
- data/Gemfile.lock +107 -102
- data/Rakefile +5 -56
- data/bin/wayfarer +1 -1
- data/docker-compose.yml +20 -9
- data/docs/cookbook/consent_screen.md +2 -2
- data/docs/cookbook/executing_javascript.md +3 -3
- data/docs/cookbook/navigation.md +12 -12
- data/docs/cookbook/querying_html.md +3 -3
- data/docs/cookbook/screenshots.md +2 -2
- data/docs/cookbook/user_agent.md +1 -1
- data/docs/design.md +36 -0
- data/docs/guides/callbacks.md +24 -126
- data/docs/guides/configuration.md +8 -8
- data/docs/guides/handlers.md +60 -0
- data/docs/guides/index.md +1 -0
- data/docs/guides/jobs/error_handling.md +40 -0
- data/docs/guides/jobs.md +99 -31
- data/docs/guides/navigation.md +1 -1
- data/docs/guides/networking/capybara.md +13 -22
- data/docs/guides/networking/custom_adapters.md +82 -41
- data/docs/guides/networking/ferrum.md +4 -4
- data/docs/guides/networking/http.md +9 -13
- data/docs/guides/networking/selenium.md +10 -11
- data/docs/guides/pages.md +76 -10
- data/docs/guides/redis.md +10 -0
- data/docs/guides/routing.md +74 -0
- data/docs/guides/tasks.md +33 -9
- data/docs/guides/tutorial.md +60 -0
- data/docs/guides/user_agents.md +113 -0
- data/docs/index.md +17 -40
- data/docs/reference/cli.md +35 -25
- data/docs/reference/configuration.md +36 -0
- data/lib/wayfarer/base.rb +124 -46
- data/lib/wayfarer/batch_completion.rb +56 -0
- data/lib/wayfarer/callbacks.rb +22 -48
- data/lib/wayfarer/cli/route_printer.rb +71 -57
- data/lib/wayfarer/cli.rb +121 -0
- data/lib/wayfarer/gc.rb +13 -6
- data/lib/wayfarer/handler.rb +15 -7
- data/lib/wayfarer/logging.rb +38 -0
- data/lib/wayfarer/middleware/base.rb +2 -0
- data/lib/wayfarer/middleware/batch_completion.rb +19 -0
- data/lib/wayfarer/middleware/content_type.rb +54 -0
- data/lib/wayfarer/middleware/controller.rb +19 -15
- data/lib/wayfarer/middleware/dedup.rb +16 -13
- data/lib/wayfarer/middleware/dispatch.rb +12 -4
- data/lib/wayfarer/middleware/normalize.rb +12 -11
- data/lib/wayfarer/middleware/redis.rb +15 -0
- data/lib/wayfarer/middleware/router.rb +33 -35
- data/lib/wayfarer/middleware/stage.rb +5 -5
- data/lib/wayfarer/middleware/uri_parser.rb +30 -0
- data/lib/wayfarer/middleware/user_agent.rb +49 -0
- data/lib/wayfarer/networking/capybara.rb +1 -1
- data/lib/wayfarer/networking/context.rb +2 -2
- data/lib/wayfarer/networking/ferrum.rb +2 -2
- data/lib/wayfarer/networking/follow.rb +12 -6
- data/lib/wayfarer/networking/http.rb +1 -1
- data/lib/wayfarer/networking/pool.rb +17 -12
- data/lib/wayfarer/networking/selenium.rb +3 -3
- data/lib/wayfarer/networking/strategy.rb +2 -2
- data/lib/wayfarer/page.rb +36 -14
- data/lib/wayfarer/parsing/xml.rb +6 -6
- data/lib/wayfarer/parsing.rb +24 -0
- data/lib/wayfarer/redis/barrier.rb +13 -21
- data/lib/wayfarer/redis/counter.rb +19 -9
- data/lib/wayfarer/redis/pool.rb +1 -1
- data/lib/wayfarer/redis/resettable.rb +19 -0
- data/lib/wayfarer/routing/dsl.rb +1 -0
- data/lib/wayfarer/routing/matchers/path.rb +4 -2
- data/lib/wayfarer/routing/root_route.rb +5 -1
- data/lib/wayfarer/routing/route.rb +4 -14
- data/lib/wayfarer/stringify.rb +22 -30
- data/lib/wayfarer/task.rb +12 -18
- data/lib/wayfarer.rb +28 -1
- data/mkdocs.yml +52 -7
- data/rake/docs.rake +26 -0
- data/rake/lint.rake +105 -0
- data/rake/release.rake +29 -0
- data/rake/tests.rake +28 -0
- data/requirements.txt +1 -1
- data/spec/base_spec.rb +140 -160
- data/spec/batch_completion_spec.rb +104 -0
- data/spec/cli/job_spec.rb +19 -23
- data/spec/cli/routing_spec.rb +101 -0
- data/spec/cli/version_spec.rb +1 -1
- data/spec/factories/task.rb +7 -1
- data/spec/fixtures/dummy_job.rb +5 -3
- data/spec/gc_spec.rb +8 -50
- data/spec/handler_spec.rb +1 -1
- data/spec/integration/callbacks_spec.rb +157 -45
- data/spec/integration/content_type_spec.rb +145 -0
- data/spec/integration/gc_spec.rb +44 -0
- data/spec/integration/handler_spec.rb +66 -0
- data/spec/integration/page_spec.rb +44 -29
- data/spec/integration/params_spec.rb +33 -25
- data/spec/integration/parsing_spec.rb +125 -0
- data/spec/integration/routing_spec.rb +18 -0
- data/spec/integration/stage_spec.rb +27 -20
- data/spec/middleware/batch_completion_spec.rb +34 -0
- data/spec/middleware/chain_spec.rb +8 -8
- data/spec/middleware/content_type_spec.rb +86 -0
- data/spec/middleware/controller_spec.rb +5 -5
- data/spec/middleware/dedup_spec.rb +38 -55
- data/spec/middleware/dispatch_spec.rb +23 -7
- data/spec/middleware/normalize_spec.rb +44 -13
- data/spec/middleware/router_spec.rb +29 -30
- data/spec/middleware/stage_spec.rb +8 -8
- data/spec/middleware/uri_parser_spec.rb +53 -0
- data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
- data/spec/networking/context_spec.rb +1 -1
- data/spec/networking/follow_spec.rb +2 -2
- data/spec/networking/pool_spec.rb +5 -5
- data/spec/networking/strategy.rb +2 -2
- data/spec/page_spec.rb +42 -20
- data/spec/parsing/xml_spec.rb +11 -12
- data/spec/redis/barrier_spec.rb +8 -48
- data/spec/redis/counter_spec.rb +13 -1
- data/spec/redis/pool_spec.rb +1 -1
- data/spec/spec_helpers.rb +27 -16
- data/spec/support/test_app.rb +8 -0
- data/spec/task_spec.rb +3 -24
- data/spec/wayfarer_spec.rb +1 -1
- data/wayfarer.gemspec +4 -3
- metadata +61 -51
- data/.github/workflows/ci.yaml +0 -32
- data/docs/guides/error_handling.md +0 -53
- data/docs/guides/networking.md +0 -94
- data/docs/guides/performance.md +0 -130
- data/docs/guides/reliability.md +0 -41
- data/docs/guides/routing/steering.md +0 -30
- data/docs/reference/api/base.md +0 -48
- data/docs/reference/configuration_keys.md +0 -43
- data/docs/reference/environment_variables.md +0 -83
- data/lib/wayfarer/cli/base.rb +0 -45
- data/lib/wayfarer/cli/generate.rb +0 -17
- data/lib/wayfarer/cli/job.rb +0 -56
- data/lib/wayfarer/cli/route.rb +0 -29
- data/lib/wayfarer/cli/runner.rb +0 -34
- data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
- data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
- data/lib/wayfarer/config/capybara.rb +0 -10
- data/lib/wayfarer/config/ferrum.rb +0 -11
- data/lib/wayfarer/config/networking.rb +0 -29
- data/lib/wayfarer/config/redis.rb +0 -14
- data/lib/wayfarer/config/root.rb +0 -11
- data/lib/wayfarer/config/selenium.rb +0 -21
- data/lib/wayfarer/config/strconv.rb +0 -45
- data/lib/wayfarer/config/struct.rb +0 -72
- data/lib/wayfarer/middleware/fetch.rb +0 -56
- data/lib/wayfarer/redis/connection.rb +0 -13
- data/lib/wayfarer/redis/version.rb +0 -19
- data/lib/wayfarer/routing/router.rb +0 -28
- data/spec/callbacks_spec.rb +0 -102
- data/spec/cli/generate_spec.rb +0 -39
- data/spec/config/capybara_spec.rb +0 -18
- data/spec/config/ferrum_spec.rb +0 -24
- data/spec/config/networking_spec.rb +0 -73
- data/spec/config/redis_spec.rb +0 -32
- data/spec/config/root_spec.rb +0 -31
- data/spec/config/selenium_spec.rb +0 -56
- data/spec/config/strconv_spec.rb +0 -58
- data/spec/config/struct_spec.rb +0 -66
- data/spec/integration/steering_spec.rb +0 -57
- data/spec/redis/version_spec.rb +0 -13
- data/spec/routing/router_spec.rb +0 -24
@@ -1,17 +1,14 @@
|
|
1
1
|
# Capybara
|
2
2
|
|
3
|
-
[Capybara](https://github.com/teamcapybara/capybara) is
|
4
|
-
|
5
|
-
|
6
|
-
When Capybara is in use, a remote browser process is available as a Capybara
|
7
|
-
session:
|
3
|
+
[Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
|
4
|
+
applications which adds a nice API that also works well for web scraping.
|
8
5
|
|
9
6
|
```ruby
|
10
|
-
Wayfarer.config
|
11
|
-
# Wayfarer.config
|
7
|
+
Wayfarer.config[:network][:agent] = :capybara
|
8
|
+
# Wayfarer.config[:capybara][:driver] = ...
|
12
9
|
|
13
10
|
class DummyJob < Wayfarer::Worker
|
14
|
-
route
|
11
|
+
route.to :index
|
15
12
|
|
16
13
|
def index
|
17
14
|
browser # => #<Capybara::Session ...>
|
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
|
|
19
16
|
end
|
20
17
|
```
|
21
18
|
|
19
|
+
## Example: Automating Chrome with Cuprite and Ferrum
|
22
20
|
|
23
|
-
|
24
|
-
|
25
|
-
1. Install the Capybara driver for the desired user agent.
|
26
|
-
|
27
|
-
For example, to automate Google Chrome with
|
28
|
-
[Ferrum](https://github.com/rubycdp/ferrum), install the
|
29
|
-
[Cuprite](https://github.com/rubycdp/cuprite) driver:
|
21
|
+
1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
|
30
22
|
|
31
23
|
=== "RubyGems"
|
32
24
|
|
@@ -34,20 +26,19 @@ end
|
|
34
26
|
gem install cuprite
|
35
27
|
```
|
36
28
|
|
37
|
-
=== "
|
29
|
+
=== "Gemfile"
|
38
30
|
|
39
31
|
```ruby
|
40
32
|
gem "cuprite" # Gemfile
|
41
33
|
```
|
42
34
|
|
43
|
-
2. Configure Wayfarer to use the `:capybara` user agent and set the
|
44
|
-
driver:
|
35
|
+
2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
|
45
36
|
|
46
37
|
=== "Runtime"
|
47
38
|
|
48
39
|
```ruby
|
49
|
-
Wayfarer.config
|
50
|
-
Wayfarer.config
|
40
|
+
Wayfarer.config[:network][:agent] = :capybara
|
41
|
+
Wayfarer.config[:capybara][:driver] = :cuprite
|
51
42
|
```
|
52
43
|
|
53
44
|
=== "Environment variables"
|
@@ -57,7 +48,7 @@ end
|
|
57
48
|
WAYFARER_CAPYBARA_DRIVER=cuprite
|
58
49
|
```
|
59
50
|
|
60
|
-
3. Register the driver:
|
51
|
+
3. Register the driver with Capybara:
|
61
52
|
|
62
53
|
```ruby
|
63
54
|
require "capybara/cuprite"
|
@@ -66,6 +57,6 @@ end
|
|
66
57
|
|
67
58
|
Capybara.register_driver(:cuprite) do |app|
|
68
59
|
# Wayfarer's Ferrum or Selenium options can be passed along
|
69
|
-
Capybara::Cuprite::Driver.new(app, Wayfarer.config
|
60
|
+
Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
|
70
61
|
end
|
71
62
|
```
|
@@ -1,18 +1,66 @@
|
|
1
|
-
#
|
1
|
+
# User agent API
|
2
2
|
|
3
|
-
Wayfarer
|
4
|
-
|
3
|
+
Wayfarer retrieves web pages with user agents. There are two types of user
|
4
|
+
agents: __stateful__ browsers which carry state and follow redirects implicitly,
|
5
|
+
and __stateless__ HTTP clients, which handle redirects explicitly.
|
5
6
|
|
6
|
-
|
7
|
+
Because spawning browser processes or instantiating HTTP clients is expensive,
|
8
|
+
Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
|
9
|
+
irrecoverable errors are individual user agents destroyed and recreated. For example,
|
10
|
+
when a browser process crashes, it is replaced with a new one and checked back
|
11
|
+
into the pool. The next job that checks out the user agent gets a fresh
|
12
|
+
browser process.
|
7
13
|
|
8
|
-
|
9
|
-
These follow HTTP redirects implicitly.
|
10
|
-
2. Stateless agents, which deal with HTTP requests/responses only.
|
11
|
-
These handle HTTP redirects explicitly.
|
14
|
+
## Base interface for custom user agents
|
12
15
|
|
13
|
-
|
16
|
+
You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
|
17
|
+
module and defining callback methods. The interfaces for stateful and stateless
|
18
|
+
share the following instance methods:
|
14
19
|
|
15
|
-
|
20
|
+
* `#create` (__required__): Called when a new instance (browser process or HTTP client) is
|
21
|
+
needed.
|
22
|
+
* `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
|
23
|
+
processes should be quit, and HTTP clients should be freed.
|
24
|
+
* `#renew_on` (optional): Returns a list of exception classes upon which the existing
|
25
|
+
instance gets destroyed and replaced with a newly created one.
|
26
|
+
|
27
|
+
## Stateless interface
|
28
|
+
|
29
|
+
The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
|
30
|
+
Wayfarer provides redirect handling out of the box, as there is a configurable limit
|
31
|
+
on the number of retries to follow.
|
32
|
+
|
33
|
+
In addition to the base interface, stateless user agents implement `#fetch`
|
34
|
+
which fetches [pages](../pages) or indicates redirects:
|
35
|
+
|
36
|
+
* `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
|
37
|
+
3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
|
38
|
+
deals with redirects on your behalf to avoid redirect loops. All other status
|
39
|
+
codes, including 4xx and 5xx, are considered successful and are indicated by calling
|
40
|
+
`success(url:, body:, status_code:, headers:)`.
|
41
|
+
|
42
|
+
## Stateful interface
|
43
|
+
|
44
|
+
In addition to the base interface, stateful user agents implement two additional
|
45
|
+
methods:
|
46
|
+
|
47
|
+
* `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
|
48
|
+
Stateful user agents follow redirects implicitly.
|
49
|
+
* `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
|
50
|
+
into a [page](../pages).
|
51
|
+
|
52
|
+
## Recreating user agents on error with `#renew_on`
|
53
|
+
|
54
|
+
Agents can optionally implement `#renew_on` to get themselves rereated on
|
55
|
+
certain errors.
|
56
|
+
|
57
|
+
If `#fetch` or `#navigate` raise an exception and the exception class is listed
|
58
|
+
in `#renew_on`, the instance is destroyed and recreated.
|
59
|
+
|
60
|
+
* `#renew_on` (optional): A list of exception classes upon which the existing instance gets
|
61
|
+
destroyed and replaced with a newly created one.
|
62
|
+
|
63
|
+
## Example implementations
|
16
64
|
|
17
65
|
=== "Stateful"
|
18
66
|
|
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
|
|
20
68
|
class StatefulAgent
|
21
69
|
include Wayfarer::Networking::Strategy
|
22
70
|
|
23
|
-
|
24
|
-
[MyBrowser::IrrecoverableError]
|
25
|
-
end
|
71
|
+
# Required methods
|
26
72
|
|
27
73
|
def create
|
28
74
|
MyBrowser.new
|
29
75
|
end
|
30
76
|
|
31
|
-
def destroy(browser) # optional
|
32
|
-
browser.quit
|
33
|
-
end
|
34
|
-
|
35
77
|
def navigate(browser, url)
|
36
78
|
browser.goto(url)
|
37
79
|
end
|
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
|
|
42
84
|
status_code: browser.status_code,
|
43
85
|
headers: browser.headers)
|
44
86
|
end
|
87
|
+
|
88
|
+
# Optional methods
|
89
|
+
|
90
|
+
def destroy(browser)
|
91
|
+
browser.quit
|
92
|
+
end
|
93
|
+
|
94
|
+
def renew_on
|
95
|
+
[MyBrowser::IrrecoverableError]
|
96
|
+
end
|
45
97
|
end
|
46
98
|
```
|
47
99
|
|
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
|
|
51
103
|
class StatelessAgent
|
52
104
|
include Wayfarer::Networking::Strategy
|
53
105
|
|
54
|
-
|
55
|
-
[MyClient::IrrecoverableError]
|
56
|
-
end
|
106
|
+
# Required methods
|
57
107
|
|
58
108
|
def create
|
59
109
|
MyClient.new
|
60
110
|
end
|
61
111
|
|
62
|
-
def destroy(client) # optional
|
63
|
-
client.close
|
64
|
-
end
|
65
|
-
|
66
112
|
def fetch(client, url)
|
67
113
|
response = client.get(url)
|
68
114
|
|
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
|
|
73
119
|
status_code: response.status_code,
|
74
120
|
headers: response.headers)
|
75
121
|
end
|
122
|
+
|
123
|
+
# Optional methods
|
124
|
+
|
125
|
+
def destroy(client)
|
126
|
+
client.close
|
127
|
+
end
|
128
|
+
|
129
|
+
def renew_on # optional
|
130
|
+
[MyClient::IrrecoverableError]
|
131
|
+
end
|
76
132
|
end
|
77
133
|
```
|
78
134
|
|
79
135
|
|
80
|
-
Register the strategy:
|
136
|
+
Register and use the strategy:
|
81
137
|
|
82
138
|
```ruby
|
83
139
|
Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
|
140
|
+
Wayfarer.config[:network][:agent] = :my_agent
|
84
141
|
```
|
85
|
-
|
86
|
-
Use the strategy:
|
87
|
-
|
88
|
-
```ruby
|
89
|
-
Wayfarer.config.network.agent = :my_agent
|
90
|
-
```
|
91
|
-
|
92
|
-
### Remarks
|
93
|
-
|
94
|
-
#### Self-healing
|
95
|
-
|
96
|
-
* A strategy's `#renew_on` method may return a list of exception classes upon
|
97
|
-
which the existing instance gets destroyed and replaced with a newly created
|
98
|
-
one.
|
99
|
-
* Stateless clients must not raise exceptions when encountering certain HTTP
|
100
|
-
response codes (for example, 5xx).
|
@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
|
|
11
11
|
so:
|
12
12
|
|
13
13
|
```ruby
|
14
|
-
Wayfarer.config
|
14
|
+
Wayfarer.config[:network][:agent] = :ferrum
|
15
15
|
|
16
16
|
class DummyWorker < Wayfarer::Worker
|
17
|
-
route
|
17
|
+
route.to :index
|
18
18
|
|
19
19
|
def index
|
20
20
|
browser # => #<Ferrum::Browser ...>
|
@@ -27,8 +27,8 @@ end
|
|
27
27
|
=== "Runtime"
|
28
28
|
|
29
29
|
```ruby
|
30
|
-
Wayfarer.config
|
31
|
-
Wayfarer.config
|
30
|
+
Wayfarer.config[:network][:agent] = :ferrum
|
31
|
+
Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
|
32
32
|
```
|
33
33
|
|
34
34
|
=== "Environment variables"
|
@@ -1,33 +1,29 @@
|
|
1
1
|
# Plain HTTP
|
2
2
|
|
3
|
-
Wayfarer can retrieve pages via plain HTTP requests
|
4
|
-
browsers.
|
3
|
+
Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
|
4
|
+
also alongside automated browsers.
|
5
5
|
|
6
|
-
##
|
6
|
+
## Ad-hoc GET requests
|
7
7
|
|
8
|
-
|
9
|
-
|
10
|
-
## Ad-hoc requests
|
11
|
-
|
12
|
-
When automating browsers, it can be useful to additionally retrieve the page
|
8
|
+
When automating browsers, it can be useful to additionally retrieve another page
|
13
9
|
over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
|
14
10
|
|
15
11
|
```ruby
|
16
12
|
class DummyJob < Wayfarer::Base
|
17
|
-
route
|
13
|
+
route.to :index
|
18
14
|
|
19
15
|
def index
|
20
|
-
http.fetch(
|
16
|
+
http.fetch("https://example.com") # => #<Wayfarer::Page ...>
|
21
17
|
end
|
22
18
|
end
|
23
19
|
```
|
24
20
|
|
25
|
-
By default, 3 redirects are followed, and this can be configured by
|
26
|
-
`follow` keyword:
|
21
|
+
By default, 3 redirects are followed, and this number can be configured by
|
22
|
+
passing the `follow` keyword:
|
27
23
|
|
28
24
|
```ruby
|
29
25
|
http.fetch(url, follow: 5)
|
30
26
|
```
|
31
27
|
|
32
|
-
|
28
|
+
When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
|
33
29
|
raised.
|
@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
|
|
7
7
|
so:
|
8
8
|
|
9
9
|
```ruby
|
10
|
-
Wayfarer.config
|
10
|
+
Wayfarer.config[:network][:agent] = :selenium
|
11
11
|
|
12
12
|
class DummyWorker < Wayfarer::Worker
|
13
|
-
route
|
13
|
+
route.to :index
|
14
14
|
|
15
15
|
def index
|
16
16
|
browser # => #<Selenium::WebDriver ...>
|
@@ -27,10 +27,10 @@ process.
|
|
27
27
|
Pages retrieved with a Selenium WebDriver return fake values:
|
28
28
|
|
29
29
|
```ruby
|
30
|
-
Wayfarer.config
|
30
|
+
Wayfarer.config[:network][:agent] = :selenium
|
31
31
|
|
32
32
|
class DummyJob < Wayfarer::Base
|
33
|
-
route
|
33
|
+
route.to :index
|
34
34
|
|
35
35
|
def index
|
36
36
|
page.headers # => always {}
|
@@ -39,19 +39,18 @@ process.
|
|
39
39
|
end
|
40
40
|
```
|
41
41
|
|
42
|
-
!!! note "Consider using [Ferrum](../ferrum) instead"
|
43
|
-
Ferrum
|
44
|
-
|
45
|
-
different browser is required, consider using Ferrum instead of Selenium.
|
42
|
+
!!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
|
43
|
+
Use Ferrum if you want to automate Google Chrome. It provides superior
|
44
|
+
stability and a richer feature set compared to Selenium drivers.
|
46
45
|
|
47
46
|
## Configuring Selenium
|
48
47
|
|
49
48
|
=== "Runtime"
|
50
49
|
|
51
50
|
```ruby
|
52
|
-
Wayfarer.config
|
53
|
-
Wayfarer.config
|
54
|
-
Wayfarer.config
|
51
|
+
Wayfarer.config[:network][:agent] = :selenium
|
52
|
+
Wayfarer.config[:selenium][:driver] = :firefox
|
53
|
+
Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
|
55
54
|
```
|
56
55
|
|
57
56
|
=== "Environment variables"
|
data/docs/guides/pages.md
CHANGED
@@ -1,11 +1,14 @@
|
|
1
1
|
# Pages
|
2
2
|
|
3
|
-
|
4
|
-
|
3
|
+
A page is the immutable state of the contents behind a URL at a point in time,
|
4
|
+
retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
|
5
|
+
response, or the state of a remotely controlled browser.
|
5
6
|
|
6
7
|
```ruby
|
7
|
-
class DummyJob <
|
8
|
-
|
8
|
+
class DummyJob < ActiveJob::Base
|
9
|
+
include Wayfarer::Base
|
10
|
+
|
11
|
+
route.to :index
|
9
12
|
|
10
13
|
def index
|
11
14
|
page # => #<Wayfarer::Page ...>
|
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
|
|
13
16
|
page.url # => "https://example.com"
|
14
17
|
page.body # => "<html>..."
|
15
18
|
page.status_code # => 200
|
16
|
-
page.headers # => { "
|
19
|
+
page.headers # => { "content-type" => ... }
|
20
|
+
page.mime_type # => #<MIME::Type: text/html>
|
21
|
+
|
22
|
+
# The lazily parsed response body or `nil`, depending on the Content-Type
|
23
|
+
page.doc # => #<Nokogiri::HTML::Document ...>
|
17
24
|
|
18
|
-
# A MetaInspector object for accessing page meta data.
|
19
25
|
# See: https://github.com/metainspector/metainspector
|
26
|
+
page.meta # => #<MetaInspector::Document ...>
|
20
27
|
# Examples:
|
21
28
|
page.meta.links.internal
|
22
29
|
page.meta.images.favicon
|
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
|
|
26
33
|
end
|
27
34
|
```
|
28
35
|
|
36
|
+
!!! info "HTTP headers are downcased and case-sensitive"
|
37
|
+
|
38
|
+
HTTP headers are downcased, so you would access
|
39
|
+
`page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
|
40
|
+
|
41
|
+
## Response body parsing
|
42
|
+
|
43
|
+
Wayfarer parses the bodies of HTML, XML and JSON responses according to their
|
44
|
+
MIME types:
|
45
|
+
|
46
|
+
* `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
|
47
|
+
* `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
|
48
|
+
* `application/json` to `Hash`
|
49
|
+
|
29
50
|
## Live pages
|
30
51
|
|
31
|
-
|
32
|
-
|
52
|
+
`#!ruby page` initially returns a snapshot of the browser state
|
53
|
+
immediately after the user agent navigated to the URL. The browser state may
|
54
|
+
change significantly after the page was retrieved, for example due to your own
|
55
|
+
interaction, or client-side JavaScript altering the DOM or URL.
|
33
56
|
|
34
|
-
To
|
57
|
+
To get a page that reflects the current browser state, set the `#!ruby :live`
|
58
|
+
keyword:
|
35
59
|
|
36
60
|
```ruby
|
37
61
|
class DummyJob < Wayfarer::Worker
|
38
|
-
route
|
62
|
+
route.to :index
|
39
63
|
|
40
64
|
def index
|
41
65
|
page # => #<Wayfarer::Page ...>
|
42
66
|
|
67
|
+
# Fill in forms, click buttons, etc.
|
68
|
+
|
43
69
|
# Replaces the current Page object with a newer one,
|
44
70
|
# taking into account the DOM as currently rendered by the browser.
|
45
71
|
# Effectful only when automating browsers, no-op when using plain
|
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
|
|
50
76
|
end
|
51
77
|
end
|
52
78
|
```
|
79
|
+
|
80
|
+
!!! attention "Stateless user agents ignore `#!ruby :live`"
|
81
|
+
|
82
|
+
The `#!ruby :live` option is ignored by stateless user agents, such as the
|
83
|
+
default `#!ruby :http` user agent. Instead, stateless user agents always
|
84
|
+
return the same page object.
|
85
|
+
|
86
|
+
### Implementing a custom response body parser
|
87
|
+
|
88
|
+
You can register an object that implements a `#parse` method for any MIME type:
|
89
|
+
|
90
|
+
```ruby
|
91
|
+
class MyJPEGParser
|
92
|
+
def parse(body)
|
93
|
+
# Read EXIF metadata here.
|
94
|
+
# Return value is accessible as `page.doc`
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
|
99
|
+
```
|
100
|
+
|
101
|
+
!!! info "Handling responses without a Content-Type"
|
102
|
+
|
103
|
+
If a response has no `Content-Type` header, Wayfarer falls back to
|
104
|
+
`application/octet-stream`. A parser registered for
|
105
|
+
`application/octet-stream` will hence also handle all responses without
|
106
|
+
a Content-Type.
|
107
|
+
|
108
|
+
## Accessing page metadata with MetaInspector
|
109
|
+
|
110
|
+
You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
|
111
|
+
document for accessing metadata of HTML pages. For example, to stage all links
|
112
|
+
internal to the current hostname:
|
113
|
+
|
114
|
+
```ruby
|
115
|
+
def index
|
116
|
+
stage page.meta.links.internal
|
117
|
+
end
|
118
|
+
```
|
@@ -0,0 +1,74 @@
|
|
1
|
+
# Routing
|
2
|
+
|
3
|
+
Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
|
4
|
+
either instance methods denoted by symbols, or [handlers](/guides/handlers).
|
5
|
+
A job's route declarations equate to a predicate tree.
|
6
|
+
When a URL is routed, the predicate tree is searched depth-first. If a
|
7
|
+
matching leaf predicate is found, the found path's action is dispatched,
|
8
|
+
along with `params` collected from path parameters.
|
9
|
+
|
10
|
+
The following routes:
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
route.host "example.com", scheme: :https do
|
14
|
+
path "/contact", to: :contact
|
15
|
+
path "/users/:id", to: [UserHandler, :show]
|
16
|
+
end
|
17
|
+
```
|
18
|
+
|
19
|
+
Equate to the following predicate tree:
|
20
|
+
|
21
|
+
```mermaid
|
22
|
+
flowchart LR
|
23
|
+
RootRoute-->Host["Host <code>example.com</code>"]
|
24
|
+
Host-->Scheme["Scheme <code>:https</code>"]
|
25
|
+
Scheme-->Path1["Path <code>/contact</code>"]
|
26
|
+
Scheme-->Path2["Path <code>/users/:id<code>"]
|
27
|
+
Path1-->TargetRoute1["Target <code>:contact</code>"]
|
28
|
+
Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
|
29
|
+
```
|
30
|
+
|
31
|
+
An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
|
32
|
+
|
33
|
+
```mermaid
|
34
|
+
flowchart LR
|
35
|
+
RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
|
36
|
+
Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
|
37
|
+
Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
|
38
|
+
Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
|
39
|
+
Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
|
40
|
+
Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
|
41
|
+
classDef active fill:#7CB342,stroke:#7CB342,color:#fff
|
42
|
+
classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
|
43
|
+
classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
|
44
|
+
```
|
45
|
+
|
46
|
+
You can also visualise an invocation of the predicate tree on the command line
|
47
|
+
with `wayfarer tree`
|
48
|
+
|
49
|
+
```
|
50
|
+
wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
|
51
|
+
Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
|
52
|
+
└──Host("example.com", match: true)
|
53
|
+
└──Scheme(:https, match: true)
|
54
|
+
├──Path("/contact", match: false)
|
55
|
+
│ └──Target(match: true)
|
56
|
+
└──Path("/users/:id", match: true)
|
57
|
+
└──Target(match: true)
|
58
|
+
└──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
|
59
|
+
```
|
60
|
+
|
61
|
+
As you can see, `Target` nodes always match. This means that we could have also defined
|
62
|
+
our routes as:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
route.host "example.com", scheme: :https do
|
66
|
+
to :contact do
|
67
|
+
path "/contact"
|
68
|
+
end
|
69
|
+
|
70
|
+
to [UserHandler, :show] do
|
71
|
+
path "/users/:id"
|
72
|
+
end
|
73
|
+
end
|
74
|
+
```
|
data/docs/guides/tasks.md
CHANGED
@@ -1,14 +1,38 @@
|
|
1
1
|
# Tasks
|
2
2
|
|
3
|
-
Tasks are the immutable units of work
|
4
|
-
consists of:
|
3
|
+
Tasks are the immutable units of work read from a message queue and processed by
|
4
|
+
[jobs](/guides/jobs). A task consists of two strings:
|
5
5
|
|
6
|
-
|
7
|
-
|
6
|
+
* The __URL__ to process
|
7
|
+
* The __batch__ the task belongs to
|
8
8
|
|
9
|
-
|
10
|
-
* Like URLs, batches are strings.
|
9
|
+
A job processing a task commonly appends more tasks to the queue in turn.
|
11
10
|
|
12
|
-
|
13
|
-
|
14
|
-
|
11
|
+
!!! info "Task URLs are not normalized"
|
12
|
+
|
13
|
+
The URL returned by `task.url` is not normalized but verbatim
|
14
|
+
as it was staged or enqueued.
|
15
|
+
|
16
|
+
## Task deduplication
|
17
|
+
|
18
|
+
Wayfarer ensures that no URL gets processed twice within a batch. It achieves
|
19
|
+
this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
|
20
|
+
keyed by normalized URLs.
|
21
|
+
|
22
|
+
### URL normalization
|
23
|
+
|
24
|
+
Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
|
25
|
+
and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
|
26
|
+
|
27
|
+
URL normalization is used only for deduplication, and does not affect the URL
|
28
|
+
returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
|
29
|
+
enqueud. This allows you to follow the exact URLs you may have parsed from a
|
30
|
+
response body.
|
31
|
+
|
32
|
+
## Invalid URLs
|
33
|
+
|
34
|
+
Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
|
35
|
+
protocol) are discarded, since they can't get retrieved. No exception is raised,
|
36
|
+
and the job is considered successfully processed, since there are no corrective
|
37
|
+
actions an error handler could take as tasks are immutable, and retries would
|
38
|
+
not change the outcome.
|