wayfarer 0.4.5 → 0.4.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/lint.yaml +25 -0
- data/.github/workflows/release.yaml +29 -0
- data/.github/workflows/tests.yaml +30 -0
- data/.gitignore +4 -0
- data/.rubocop.yml +5 -0
- data/.vale.ini +5 -0
- data/.yardopts +1 -3
- data/Dockerfile +5 -4
- data/Gemfile +3 -0
- data/Gemfile.lock +107 -102
- data/Rakefile +5 -56
- data/bin/wayfarer +1 -1
- data/docker-compose.yml +20 -9
- data/docs/cookbook/consent_screen.md +2 -2
- data/docs/cookbook/executing_javascript.md +3 -3
- data/docs/cookbook/navigation.md +12 -12
- data/docs/cookbook/querying_html.md +3 -3
- data/docs/cookbook/screenshots.md +2 -2
- data/docs/cookbook/user_agent.md +1 -1
- data/docs/design.md +36 -0
- data/docs/guides/callbacks.md +24 -126
- data/docs/guides/configuration.md +8 -8
- data/docs/guides/handlers.md +60 -0
- data/docs/guides/index.md +1 -0
- data/docs/guides/jobs/error_handling.md +40 -0
- data/docs/guides/jobs.md +99 -31
- data/docs/guides/navigation.md +1 -1
- data/docs/guides/networking/capybara.md +13 -22
- data/docs/guides/networking/custom_adapters.md +82 -41
- data/docs/guides/networking/ferrum.md +4 -4
- data/docs/guides/networking/http.md +9 -13
- data/docs/guides/networking/selenium.md +10 -11
- data/docs/guides/pages.md +76 -10
- data/docs/guides/redis.md +10 -0
- data/docs/guides/routing.md +74 -0
- data/docs/guides/tasks.md +33 -9
- data/docs/guides/tutorial.md +60 -0
- data/docs/guides/user_agents.md +113 -0
- data/docs/index.md +17 -40
- data/docs/reference/cli.md +35 -25
- data/docs/reference/configuration.md +36 -0
- data/lib/wayfarer/base.rb +124 -46
- data/lib/wayfarer/batch_completion.rb +56 -0
- data/lib/wayfarer/callbacks.rb +22 -48
- data/lib/wayfarer/cli/route_printer.rb +71 -57
- data/lib/wayfarer/cli.rb +121 -0
- data/lib/wayfarer/gc.rb +13 -6
- data/lib/wayfarer/handler.rb +15 -7
- data/lib/wayfarer/logging.rb +38 -0
- data/lib/wayfarer/middleware/base.rb +2 -0
- data/lib/wayfarer/middleware/batch_completion.rb +19 -0
- data/lib/wayfarer/middleware/content_type.rb +54 -0
- data/lib/wayfarer/middleware/controller.rb +19 -15
- data/lib/wayfarer/middleware/dedup.rb +16 -13
- data/lib/wayfarer/middleware/dispatch.rb +12 -4
- data/lib/wayfarer/middleware/normalize.rb +12 -11
- data/lib/wayfarer/middleware/redis.rb +15 -0
- data/lib/wayfarer/middleware/router.rb +33 -35
- data/lib/wayfarer/middleware/stage.rb +5 -5
- data/lib/wayfarer/middleware/uri_parser.rb +30 -0
- data/lib/wayfarer/middleware/user_agent.rb +49 -0
- data/lib/wayfarer/networking/capybara.rb +1 -1
- data/lib/wayfarer/networking/context.rb +2 -2
- data/lib/wayfarer/networking/ferrum.rb +2 -2
- data/lib/wayfarer/networking/follow.rb +12 -6
- data/lib/wayfarer/networking/http.rb +1 -1
- data/lib/wayfarer/networking/pool.rb +17 -12
- data/lib/wayfarer/networking/selenium.rb +3 -3
- data/lib/wayfarer/networking/strategy.rb +2 -2
- data/lib/wayfarer/page.rb +36 -14
- data/lib/wayfarer/parsing/xml.rb +6 -6
- data/lib/wayfarer/parsing.rb +24 -0
- data/lib/wayfarer/redis/barrier.rb +13 -21
- data/lib/wayfarer/redis/counter.rb +19 -9
- data/lib/wayfarer/redis/pool.rb +1 -1
- data/lib/wayfarer/redis/resettable.rb +19 -0
- data/lib/wayfarer/routing/dsl.rb +1 -0
- data/lib/wayfarer/routing/matchers/path.rb +4 -2
- data/lib/wayfarer/routing/root_route.rb +5 -1
- data/lib/wayfarer/routing/route.rb +4 -14
- data/lib/wayfarer/stringify.rb +22 -30
- data/lib/wayfarer/task.rb +12 -18
- data/lib/wayfarer.rb +29 -2
- data/mkdocs.yml +52 -7
- data/rake/docs.rake +26 -0
- data/rake/lint.rake +105 -0
- data/rake/release.rake +29 -0
- data/rake/tests.rake +28 -0
- data/requirements.txt +1 -1
- data/spec/base_spec.rb +140 -160
- data/spec/batch_completion_spec.rb +104 -0
- data/spec/cli/job_spec.rb +19 -23
- data/spec/cli/routing_spec.rb +101 -0
- data/spec/cli/version_spec.rb +1 -1
- data/spec/factories/task.rb +7 -1
- data/spec/fixtures/dummy_job.rb +5 -3
- data/spec/gc_spec.rb +8 -50
- data/spec/handler_spec.rb +1 -1
- data/spec/integration/callbacks_spec.rb +157 -45
- data/spec/integration/content_type_spec.rb +145 -0
- data/spec/integration/gc_spec.rb +44 -0
- data/spec/integration/handler_spec.rb +66 -0
- data/spec/integration/page_spec.rb +44 -29
- data/spec/integration/params_spec.rb +33 -25
- data/spec/integration/parsing_spec.rb +125 -0
- data/spec/integration/routing_spec.rb +18 -0
- data/spec/integration/stage_spec.rb +27 -20
- data/spec/middleware/batch_completion_spec.rb +34 -0
- data/spec/middleware/chain_spec.rb +8 -8
- data/spec/middleware/content_type_spec.rb +86 -0
- data/spec/middleware/controller_spec.rb +5 -5
- data/spec/middleware/dedup_spec.rb +38 -55
- data/spec/middleware/dispatch_spec.rb +23 -7
- data/spec/middleware/normalize_spec.rb +44 -13
- data/spec/middleware/router_spec.rb +29 -30
- data/spec/middleware/stage_spec.rb +8 -8
- data/spec/middleware/uri_parser_spec.rb +53 -0
- data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
- data/spec/networking/context_spec.rb +17 -0
- data/spec/networking/follow_spec.rb +2 -2
- data/spec/networking/pool_spec.rb +5 -5
- data/spec/networking/strategy.rb +2 -2
- data/spec/page_spec.rb +42 -20
- data/spec/parsing/xml_spec.rb +11 -12
- data/spec/redis/barrier_spec.rb +8 -48
- data/spec/redis/counter_spec.rb +13 -1
- data/spec/redis/pool_spec.rb +1 -1
- data/spec/spec_helpers.rb +27 -16
- data/spec/support/test_app.rb +8 -0
- data/spec/task_spec.rb +3 -24
- data/spec/wayfarer_spec.rb +1 -1
- data/wayfarer.gemspec +4 -3
- metadata +61 -51
- data/.github/workflows/ci.yaml +0 -32
- data/docs/guides/error_handling.md +0 -31
- data/docs/guides/networking.md +0 -94
- data/docs/guides/performance.md +0 -130
- data/docs/guides/reliability.md +0 -41
- data/docs/guides/routing/steering.md +0 -30
- data/docs/reference/api/base.md +0 -48
- data/docs/reference/configuration_keys.md +0 -42
- data/docs/reference/environment_variables.md +0 -83
- data/lib/wayfarer/cli/base.rb +0 -45
- data/lib/wayfarer/cli/generate.rb +0 -17
- data/lib/wayfarer/cli/job.rb +0 -56
- data/lib/wayfarer/cli/route.rb +0 -29
- data/lib/wayfarer/cli/runner.rb +0 -34
- data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
- data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
- data/lib/wayfarer/config/capybara.rb +0 -10
- data/lib/wayfarer/config/ferrum.rb +0 -11
- data/lib/wayfarer/config/networking.rb +0 -26
- data/lib/wayfarer/config/redis.rb +0 -14
- data/lib/wayfarer/config/root.rb +0 -11
- data/lib/wayfarer/config/selenium.rb +0 -21
- data/lib/wayfarer/config/strconv.rb +0 -45
- data/lib/wayfarer/config/struct.rb +0 -72
- data/lib/wayfarer/middleware/fetch.rb +0 -56
- data/lib/wayfarer/redis/connection.rb +0 -13
- data/lib/wayfarer/redis/version.rb +0 -19
- data/lib/wayfarer/routing/router.rb +0 -28
- data/spec/callbacks_spec.rb +0 -102
- data/spec/cli/generate_spec.rb +0 -39
- data/spec/config/capybara_spec.rb +0 -18
- data/spec/config/ferrum_spec.rb +0 -24
- data/spec/config/networking_spec.rb +0 -73
- data/spec/config/redis_spec.rb +0 -32
- data/spec/config/root_spec.rb +0 -31
- data/spec/config/selenium_spec.rb +0 -56
- data/spec/config/strconv_spec.rb +0 -58
- data/spec/config/struct_spec.rb +0 -66
- data/spec/integration/steering_spec.rb +0 -57
- data/spec/redis/version_spec.rb +0 -13
- data/spec/routing/router_spec.rb +0 -24
@@ -1,17 +1,14 @@
|
|
1
1
|
# Capybara
|
2
2
|
|
3
|
-
[Capybara](https://github.com/teamcapybara/capybara) is
|
4
|
-
|
5
|
-
|
6
|
-
When Capybara is in use, a remote browser process is available as a Capybara
|
7
|
-
session:
|
3
|
+
[Capybara](https://github.com/teamcapybara/capybara) is a test framework for web
|
4
|
+
applications which adds a nice API that also works well for web scraping.
|
8
5
|
|
9
6
|
```ruby
|
10
|
-
Wayfarer.config
|
11
|
-
# Wayfarer.config
|
7
|
+
Wayfarer.config[:network][:agent] = :capybara
|
8
|
+
# Wayfarer.config[:capybara][:driver] = ...
|
12
9
|
|
13
10
|
class DummyJob < Wayfarer::Worker
|
14
|
-
route
|
11
|
+
route.to :index
|
15
12
|
|
16
13
|
def index
|
17
14
|
browser # => #<Capybara::Session ...>
|
@@ -19,14 +16,9 @@ class DummyJob < Wayfarer::Worker
|
|
19
16
|
end
|
20
17
|
```
|
21
18
|
|
19
|
+
## Example: Automating Chrome with Cuprite and Ferrum
|
22
20
|
|
23
|
-
|
24
|
-
|
25
|
-
1. Install the Capybara driver for the desired user agent.
|
26
|
-
|
27
|
-
For example, to automate Google Chrome with
|
28
|
-
[Ferrum](https://github.com/rubycdp/ferrum), install the
|
29
|
-
[Cuprite](https://github.com/rubycdp/cuprite) driver:
|
21
|
+
1. Install the [Curpite](https://github.com/rubycdp/cuprite) Capybara driver:
|
30
22
|
|
31
23
|
=== "RubyGems"
|
32
24
|
|
@@ -34,20 +26,19 @@ end
|
|
34
26
|
gem install cuprite
|
35
27
|
```
|
36
28
|
|
37
|
-
=== "
|
29
|
+
=== "Gemfile"
|
38
30
|
|
39
31
|
```ruby
|
40
32
|
gem "cuprite" # Gemfile
|
41
33
|
```
|
42
34
|
|
43
|
-
2. Configure Wayfarer to use the `:capybara` user agent and set the
|
44
|
-
driver:
|
35
|
+
2. Configure Wayfarer to use the `:capybara` user agent and set the driver:
|
45
36
|
|
46
37
|
=== "Runtime"
|
47
38
|
|
48
39
|
```ruby
|
49
|
-
Wayfarer.config
|
50
|
-
Wayfarer.config
|
40
|
+
Wayfarer.config[:network][:agent] = :capybara
|
41
|
+
Wayfarer.config[:capybara][:driver] = :cuprite
|
51
42
|
```
|
52
43
|
|
53
44
|
=== "Environment variables"
|
@@ -57,7 +48,7 @@ end
|
|
57
48
|
WAYFARER_CAPYBARA_DRIVER=cuprite
|
58
49
|
```
|
59
50
|
|
60
|
-
3. Register the driver:
|
51
|
+
3. Register the driver with Capybara:
|
61
52
|
|
62
53
|
```ruby
|
63
54
|
require "capybara/cuprite"
|
@@ -66,6 +57,6 @@ end
|
|
66
57
|
|
67
58
|
Capybara.register_driver(:cuprite) do |app|
|
68
59
|
# Wayfarer's Ferrum or Selenium options can be passed along
|
69
|
-
Capybara::Cuprite::Driver.new(app, Wayfarer.config
|
60
|
+
Capybara::Cuprite::Driver.new(app, Wayfarer.config[:ferrum][:options])
|
70
61
|
end
|
71
62
|
```
|
@@ -1,18 +1,66 @@
|
|
1
|
-
#
|
1
|
+
# User agent API
|
2
2
|
|
3
|
-
Wayfarer
|
4
|
-
|
3
|
+
Wayfarer retrieves web pages with user agents. There are two types of user
|
4
|
+
agents: __stateful__ browsers which carry state and follow redirects implicitly,
|
5
|
+
and __stateless__ HTTP clients, which handle redirects explicitly.
|
5
6
|
|
6
|
-
|
7
|
+
Because spawning browser processes or instantiating HTTP clients is expensive,
|
8
|
+
Wayfarer keeps user agents in a pool and reuses them across jobs. Only on certain
|
9
|
+
irrecoverable errors are individual user agents destroyed and recreated. For example,
|
10
|
+
when a browser process crashes, it is replaced with a new one and checked back
|
11
|
+
into the pool. The next job that checks out the user agent gets a fresh
|
12
|
+
browser process.
|
7
13
|
|
8
|
-
|
9
|
-
These follow HTTP redirects implicitly.
|
10
|
-
2. Stateless agents, which deal with HTTP requests/responses only.
|
11
|
-
These handle HTTP redirects explicitly.
|
14
|
+
## Base interface for custom user agents
|
12
15
|
|
13
|
-
|
16
|
+
You can implement both stateful and stateless agents by including the `Wayfarer::Networking::Strategy`
|
17
|
+
module and defining callback methods. The interfaces for stateful and stateless
|
18
|
+
share the following instance methods:
|
14
19
|
|
15
|
-
|
20
|
+
* `#create` (__required__): Called when a new instance (browser process or HTTP client) is
|
21
|
+
needed.
|
22
|
+
* `#destroy(instance)` (optional): Called when an instance should be destroyed. Browser
|
23
|
+
processes should be quit, and HTTP clients should be freed.
|
24
|
+
* `#renew_on` (optional): Returns a list of exception classes upon which the existing
|
25
|
+
instance gets destroyed and replaced with a newly created one.
|
26
|
+
|
27
|
+
## Stateless interface
|
28
|
+
|
29
|
+
The stateless interface indicate HTTP 3xx redirect responses explicitly. This is how
|
30
|
+
Wayfarer provides redirect handling out of the box, as there is a configurable limit
|
31
|
+
on the number of retries to follow.
|
32
|
+
|
33
|
+
In addition to the base interface, stateless user agents implement `#fetch`
|
34
|
+
which fetches [pages](../pages) or indicates redirects:
|
35
|
+
|
36
|
+
* `#fetch(instance, url)` (__required__): Called to retrieve a URL. Responses with a
|
37
|
+
3xx status code must indicate the redirect URL by returning `redirect(url)`, since Wayfarer
|
38
|
+
deals with redirects on your behalf to avoid redirect loops. All other status
|
39
|
+
codes, including 4xx and 5xx, are considered successful and are indicated by calling
|
40
|
+
`success(url:, body:, status_code:, headers:)`.
|
41
|
+
|
42
|
+
## Stateful interface
|
43
|
+
|
44
|
+
In addition to the base interface, stateful user agents implement two additional
|
45
|
+
methods:
|
46
|
+
|
47
|
+
* `#navigate(instance, url)` (__required__): Navigates the user agent to the given URL.
|
48
|
+
Stateful user agents follow redirects implicitly.
|
49
|
+
* `#live(instance) -> Wayfarer::Page` (__required__): Turns the current user agent state
|
50
|
+
into a [page](../pages).
|
51
|
+
|
52
|
+
## Recreating user agents on error with `#renew_on`
|
53
|
+
|
54
|
+
Agents can optionally implement `#renew_on` to get themselves rereated on
|
55
|
+
certain errors.
|
56
|
+
|
57
|
+
If `#fetch` or `#navigate` raise an exception and the exception class is listed
|
58
|
+
in `#renew_on`, the instance is destroyed and recreated.
|
59
|
+
|
60
|
+
* `#renew_on` (optional): A list of exception classes upon which the existing instance gets
|
61
|
+
destroyed and replaced with a newly created one.
|
62
|
+
|
63
|
+
## Example implementations
|
16
64
|
|
17
65
|
=== "Stateful"
|
18
66
|
|
@@ -20,18 +68,12 @@ Both types can be implemented with callback methods:
|
|
20
68
|
class StatefulAgent
|
21
69
|
include Wayfarer::Networking::Strategy
|
22
70
|
|
23
|
-
|
24
|
-
[MyBrowser::IrrecoverableError]
|
25
|
-
end
|
71
|
+
# Required methods
|
26
72
|
|
27
73
|
def create
|
28
74
|
MyBrowser.new
|
29
75
|
end
|
30
76
|
|
31
|
-
def destroy(browser) # optional
|
32
|
-
browser.quit
|
33
|
-
end
|
34
|
-
|
35
77
|
def navigate(browser, url)
|
36
78
|
browser.goto(url)
|
37
79
|
end
|
@@ -42,6 +84,16 @@ Both types can be implemented with callback methods:
|
|
42
84
|
status_code: browser.status_code,
|
43
85
|
headers: browser.headers)
|
44
86
|
end
|
87
|
+
|
88
|
+
# Optional methods
|
89
|
+
|
90
|
+
def destroy(browser)
|
91
|
+
browser.quit
|
92
|
+
end
|
93
|
+
|
94
|
+
def renew_on
|
95
|
+
[MyBrowser::IrrecoverableError]
|
96
|
+
end
|
45
97
|
end
|
46
98
|
```
|
47
99
|
|
@@ -51,18 +103,12 @@ Both types can be implemented with callback methods:
|
|
51
103
|
class StatelessAgent
|
52
104
|
include Wayfarer::Networking::Strategy
|
53
105
|
|
54
|
-
|
55
|
-
[MyClient::IrrecoverableError]
|
56
|
-
end
|
106
|
+
# Required methods
|
57
107
|
|
58
108
|
def create
|
59
109
|
MyClient.new
|
60
110
|
end
|
61
111
|
|
62
|
-
def destroy(client) # optional
|
63
|
-
client.close
|
64
|
-
end
|
65
|
-
|
66
112
|
def fetch(client, url)
|
67
113
|
response = client.get(url)
|
68
114
|
|
@@ -73,28 +119,23 @@ Both types can be implemented with callback methods:
|
|
73
119
|
status_code: response.status_code,
|
74
120
|
headers: response.headers)
|
75
121
|
end
|
122
|
+
|
123
|
+
# Optional methods
|
124
|
+
|
125
|
+
def destroy(client)
|
126
|
+
client.close
|
127
|
+
end
|
128
|
+
|
129
|
+
def renew_on # optional
|
130
|
+
[MyClient::IrrecoverableError]
|
131
|
+
end
|
76
132
|
end
|
77
133
|
```
|
78
134
|
|
79
135
|
|
80
|
-
Register the strategy:
|
136
|
+
Register and use the strategy:
|
81
137
|
|
82
138
|
```ruby
|
83
139
|
Wayfarer::Networking::Pool.registry[:my_agent] = MyAgent.new
|
140
|
+
Wayfarer.config[:network][:agent] = :my_agent
|
84
141
|
```
|
85
|
-
|
86
|
-
Use the strategy:
|
87
|
-
|
88
|
-
```ruby
|
89
|
-
Wayfarer.config.network.agent = :my_agent
|
90
|
-
```
|
91
|
-
|
92
|
-
### Remarks
|
93
|
-
|
94
|
-
#### Self-healing
|
95
|
-
|
96
|
-
* A strategy's `#renew_on` method may return a list of exception classes upon
|
97
|
-
which the existing instance gets destroyed and replaced with a newly created
|
98
|
-
one.
|
99
|
-
* Stateless clients must not raise exceptions when encountering certain HTTP
|
100
|
-
response codes (for example, 5xx).
|
@@ -11,10 +11,10 @@ When Ferrum is in use, a Google Chrome process is accessible within jobs like
|
|
11
11
|
so:
|
12
12
|
|
13
13
|
```ruby
|
14
|
-
Wayfarer.config
|
14
|
+
Wayfarer.config[:network][:agent] = :ferrum
|
15
15
|
|
16
16
|
class DummyWorker < Wayfarer::Worker
|
17
|
-
route
|
17
|
+
route.to :index
|
18
18
|
|
19
19
|
def index
|
20
20
|
browser # => #<Ferrum::Browser ...>
|
@@ -27,8 +27,8 @@ end
|
|
27
27
|
=== "Runtime"
|
28
28
|
|
29
29
|
```ruby
|
30
|
-
Wayfarer.config
|
31
|
-
Wayfarer.config
|
30
|
+
Wayfarer.config[:network][:agent] = :ferrum
|
31
|
+
Wayfarer.config[:ferrum][:options] = { headless: false, url: "http://chrome:3000" }
|
32
32
|
```
|
33
33
|
|
34
34
|
=== "Environment variables"
|
@@ -1,33 +1,29 @@
|
|
1
1
|
# Plain HTTP
|
2
2
|
|
3
|
-
Wayfarer can retrieve pages via plain HTTP requests
|
4
|
-
browsers.
|
3
|
+
Wayfarer can retrieve pages via plain HTTP requests with the `:http` adapter,
|
4
|
+
also alongside automated browsers.
|
5
5
|
|
6
|
-
##
|
6
|
+
## Ad-hoc GET requests
|
7
7
|
|
8
|
-
|
9
|
-
|
10
|
-
## Ad-hoc requests
|
11
|
-
|
12
|
-
When automating browsers, it can be useful to additionally retrieve the page
|
8
|
+
When automating browsers, it can be useful to additionally retrieve another page
|
13
9
|
over plain HTTP. Jobs can fetch URLs to [pages](/pages) with `#http`:
|
14
10
|
|
15
11
|
```ruby
|
16
12
|
class DummyJob < Wayfarer::Base
|
17
|
-
route
|
13
|
+
route.to :index
|
18
14
|
|
19
15
|
def index
|
20
|
-
http.fetch(
|
16
|
+
http.fetch("https://example.com") # => #<Wayfarer::Page ...>
|
21
17
|
end
|
22
18
|
end
|
23
19
|
```
|
24
20
|
|
25
|
-
By default, 3 redirects are followed, and this can be configured by
|
26
|
-
`follow` keyword:
|
21
|
+
By default, 3 redirects are followed, and this number can be configured by
|
22
|
+
passing the `follow` keyword:
|
27
23
|
|
28
24
|
```ruby
|
29
25
|
http.fetch(url, follow: 5)
|
30
26
|
```
|
31
27
|
|
32
|
-
|
28
|
+
When redirected too often, `Wayfarer::Networking::RedirectsExhaustedError` is
|
33
29
|
raised.
|
@@ -7,10 +7,10 @@ When Selenium is in use, a remote browser process is accessible within jobs like
|
|
7
7
|
so:
|
8
8
|
|
9
9
|
```ruby
|
10
|
-
Wayfarer.config
|
10
|
+
Wayfarer.config[:network][:agent] = :selenium
|
11
11
|
|
12
12
|
class DummyWorker < Wayfarer::Worker
|
13
|
-
route
|
13
|
+
route.to :index
|
14
14
|
|
15
15
|
def index
|
16
16
|
browser # => #<Selenium::WebDriver ...>
|
@@ -27,10 +27,10 @@ process.
|
|
27
27
|
Pages retrieved with a Selenium WebDriver return fake values:
|
28
28
|
|
29
29
|
```ruby
|
30
|
-
Wayfarer.config
|
30
|
+
Wayfarer.config[:network][:agent] = :selenium
|
31
31
|
|
32
32
|
class DummyJob < Wayfarer::Base
|
33
|
-
route
|
33
|
+
route.to :index
|
34
34
|
|
35
35
|
def index
|
36
36
|
page.headers # => always {}
|
@@ -39,19 +39,18 @@ process.
|
|
39
39
|
end
|
40
40
|
```
|
41
41
|
|
42
|
-
!!! note "Consider using [Ferrum](../ferrum) instead"
|
43
|
-
Ferrum
|
44
|
-
|
45
|
-
different browser is required, consider using Ferrum instead of Selenium.
|
42
|
+
!!! note "Consider using [Ferrum](../ferrum) instead if Google Chrome suits your needs."
|
43
|
+
Use Ferrum if you want to automate Google Chrome. It provides superior
|
44
|
+
stability and a richer feature set compared to Selenium drivers.
|
46
45
|
|
47
46
|
## Configuring Selenium
|
48
47
|
|
49
48
|
=== "Runtime"
|
50
49
|
|
51
50
|
```ruby
|
52
|
-
Wayfarer.config
|
53
|
-
Wayfarer.config
|
54
|
-
Wayfarer.config
|
51
|
+
Wayfarer.config[:network][:agent] = :selenium
|
52
|
+
Wayfarer.config[:selenium][:driver] = :firefox
|
53
|
+
Wayfarer.config[:selenium][:options] = { url: "http://firefox" }
|
55
54
|
```
|
56
55
|
|
57
56
|
=== "Environment variables"
|
data/docs/guides/pages.md
CHANGED
@@ -1,11 +1,14 @@
|
|
1
1
|
# Pages
|
2
2
|
|
3
|
-
|
4
|
-
|
3
|
+
A page is the immutable state of the contents behind a URL at a point in time,
|
4
|
+
retrieved by a [user agent](user-agents.md). In other words, a page is an HTTP
|
5
|
+
response, or the state of a remotely controlled browser.
|
5
6
|
|
6
7
|
```ruby
|
7
|
-
class DummyJob <
|
8
|
-
|
8
|
+
class DummyJob < ActiveJob::Base
|
9
|
+
include Wayfarer::Base
|
10
|
+
|
11
|
+
route.to :index
|
9
12
|
|
10
13
|
def index
|
11
14
|
page # => #<Wayfarer::Page ...>
|
@@ -13,10 +16,14 @@ class DummyJob < Wayfarer::Worker
|
|
13
16
|
page.url # => "https://example.com"
|
14
17
|
page.body # => "<html>..."
|
15
18
|
page.status_code # => 200
|
16
|
-
page.headers # => { "
|
19
|
+
page.headers # => { "content-type" => ... }
|
20
|
+
page.mime_type # => #<MIME::Type: text/html>
|
21
|
+
|
22
|
+
# The lazily parsed response body or `nil`, depending on the Content-Type
|
23
|
+
page.doc # => #<Nokogiri::HTML::Document ...>
|
17
24
|
|
18
|
-
# A MetaInspector object for accessing page meta data.
|
19
25
|
# See: https://github.com/metainspector/metainspector
|
26
|
+
page.meta # => #<MetaInspector::Document ...>
|
20
27
|
# Examples:
|
21
28
|
page.meta.links.internal
|
22
29
|
page.meta.images.favicon
|
@@ -26,20 +33,39 @@ class DummyJob < Wayfarer::Worker
|
|
26
33
|
end
|
27
34
|
```
|
28
35
|
|
36
|
+
!!! info "HTTP headers are downcased and case-sensitive"
|
37
|
+
|
38
|
+
HTTP headers are downcased, so you would access
|
39
|
+
`page.headers["content-type"]` instead of `page.headers["Content-Type"]`.
|
40
|
+
|
41
|
+
## Response body parsing
|
42
|
+
|
43
|
+
Wayfarer parses the bodies of HTML, XML and JSON responses according to their
|
44
|
+
MIME types:
|
45
|
+
|
46
|
+
* `application/html` to [`#!ruby Nokogiri::HTML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document)
|
47
|
+
* `text/xml` or `application/xml` to [`#!ruby Nokogiri::XML::Document`](https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Document)
|
48
|
+
* `application/json` to `Hash`
|
49
|
+
|
29
50
|
## Live pages
|
30
51
|
|
31
|
-
|
32
|
-
|
52
|
+
`#!ruby page` initially returns a snapshot of the browser state
|
53
|
+
immediately after the user agent navigated to the URL. The browser state may
|
54
|
+
change significantly after the page was retrieved, for example due to your own
|
55
|
+
interaction, or client-side JavaScript altering the DOM or URL.
|
33
56
|
|
34
|
-
To
|
57
|
+
To get a page that reflects the current browser state, set the `#!ruby :live`
|
58
|
+
keyword:
|
35
59
|
|
36
60
|
```ruby
|
37
61
|
class DummyJob < Wayfarer::Worker
|
38
|
-
route
|
62
|
+
route.to :index
|
39
63
|
|
40
64
|
def index
|
41
65
|
page # => #<Wayfarer::Page ...>
|
42
66
|
|
67
|
+
# Fill in forms, click buttons, etc.
|
68
|
+
|
43
69
|
# Replaces the current Page object with a newer one,
|
44
70
|
# taking into account the DOM as currently rendered by the browser.
|
45
71
|
# Effectful only when automating browsers, no-op when using plain
|
@@ -50,3 +76,43 @@ class DummyJob < Wayfarer::Worker
|
|
50
76
|
end
|
51
77
|
end
|
52
78
|
```
|
79
|
+
|
80
|
+
!!! attention "Stateless user agents ignore `#!ruby :live`"
|
81
|
+
|
82
|
+
The `#!ruby :live` option is ignored by stateless user agents, such as the
|
83
|
+
default `#!ruby :http` user agent. Instead, stateless user agents always
|
84
|
+
return the same page object.
|
85
|
+
|
86
|
+
### Implementing a custom response body parser
|
87
|
+
|
88
|
+
You can register an object that implements a `#parse` method for any MIME type:
|
89
|
+
|
90
|
+
```ruby
|
91
|
+
class MyJPEGParser
|
92
|
+
def parse(body)
|
93
|
+
# Read EXIF metadata here.
|
94
|
+
# Return value is accessible as `page.doc`
|
95
|
+
end
|
96
|
+
end
|
97
|
+
|
98
|
+
Wayfarer::Parsing.registry["image/jpeg"] = MyJPEGParser.new
|
99
|
+
```
|
100
|
+
|
101
|
+
!!! info "Handling responses without a Content-Type"
|
102
|
+
|
103
|
+
If a response has no `Content-Type` header, Wayfarer falls back to
|
104
|
+
`application/octet-stream`. A parser registered for
|
105
|
+
`application/octet-stream` will hence also handle all responses without
|
106
|
+
a Content-Type.
|
107
|
+
|
108
|
+
## Accessing page metadata with MetaInspector
|
109
|
+
|
110
|
+
You have access to a [MetaInspector](https://github.com/jaimeiniesta/metainspector)
|
111
|
+
document for accessing metadata of HTML pages. For example, to stage all links
|
112
|
+
internal to the current hostname:
|
113
|
+
|
114
|
+
```ruby
|
115
|
+
def index
|
116
|
+
stage page.meta.links.internal
|
117
|
+
end
|
118
|
+
```
|
@@ -0,0 +1,74 @@
|
|
1
|
+
# Routing
|
2
|
+
|
3
|
+
Wayfarer equips jobs with a routing DSL that routes URLs to actions. Actions are
|
4
|
+
either instance methods denoted by symbols, or [handlers](/guides/handlers).
|
5
|
+
A job's route declarations equate to a predicate tree.
|
6
|
+
When a URL is routed, the predicate tree is searched depth-first. If a
|
7
|
+
matching leaf predicate is found, the found path's action is dispatched,
|
8
|
+
along with `params` collected from path parameters.
|
9
|
+
|
10
|
+
The following routes:
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
route.host "example.com", scheme: :https do
|
14
|
+
path "/contact", to: :contact
|
15
|
+
path "/users/:id", to: [UserHandler, :show]
|
16
|
+
end
|
17
|
+
```
|
18
|
+
|
19
|
+
Equate to the following predicate tree:
|
20
|
+
|
21
|
+
```mermaid
|
22
|
+
flowchart LR
|
23
|
+
RootRoute-->Host["Host <code>example.com</code>"]
|
24
|
+
Host-->Scheme["Scheme <code>:https</code>"]
|
25
|
+
Scheme-->Path1["Path <code>/contact</code>"]
|
26
|
+
Scheme-->Path2["Path <code>/users/:id<code>"]
|
27
|
+
Path1-->TargetRoute1["Target <code>:contact</code>"]
|
28
|
+
Path2-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]
|
29
|
+
```
|
30
|
+
|
31
|
+
An invocation for the URL `https://example.com/users/42` leads to `[UserHandler, :show]`:
|
32
|
+
|
33
|
+
```mermaid
|
34
|
+
flowchart LR
|
35
|
+
RootRoute:::active-->Host["Host <code>example.com</code>"]:::active
|
36
|
+
Host:::active-->Scheme["Scheme <code>:https</code>"]:::active
|
37
|
+
Scheme:::active-->Path1["Path <code>/contact</code>"]:::inactive
|
38
|
+
Scheme:::active-->Path2["Path <code>/users/:id<code>"]:::active
|
39
|
+
Path1:::inactive-->TargetRoute1["Target <code>:contact</code>"]:::active
|
40
|
+
Path2:::active-->TargetRoute2["Target <code>[UserHandler, :show]</code>"]:::activePlus
|
41
|
+
classDef active fill:#7CB342,stroke:#7CB342,color:#fff
|
42
|
+
classDef activePlus fill:#F1F8E9,stroke:#8BC34A,color:#33691E,stroke-width:4px
|
43
|
+
classDef inactive fill:#FFCDD2,stroke:#F44336,color:#B71C1C
|
44
|
+
```
|
45
|
+
|
46
|
+
You can also visualise an invocation of the predicate tree on the command line
|
47
|
+
with `wayfarer tree`
|
48
|
+
|
49
|
+
```
|
50
|
+
wayfarer tree -r dummy_job.rb DummyJob https://example.com/users/42/foobar
|
51
|
+
Match([UserHandler, :show], params: {:id=>"42", :foo=>"foobar"})
|
52
|
+
└──Host("example.com", match: true)
|
53
|
+
└──Scheme(:https, match: true)
|
54
|
+
├──Path("/contact", match: false)
|
55
|
+
│ └──Target(match: true)
|
56
|
+
└──Path("/users/:id", match: true)
|
57
|
+
└──Target(match: true)
|
58
|
+
└──Path("/users/:id/:foo", match: true, params: {:id=>"42", :foo=>"foobar"})
|
59
|
+
```
|
60
|
+
|
61
|
+
As you can see, `Target` nodes always match. This means that we could have also defined
|
62
|
+
our routes as:
|
63
|
+
|
64
|
+
```ruby
|
65
|
+
route.host "example.com", scheme: :https do
|
66
|
+
to :contact do
|
67
|
+
path "/contact"
|
68
|
+
end
|
69
|
+
|
70
|
+
to [UserHandler, :show] do
|
71
|
+
path "/users/:id"
|
72
|
+
end
|
73
|
+
end
|
74
|
+
```
|
data/docs/guides/tasks.md
CHANGED
@@ -1,14 +1,38 @@
|
|
1
1
|
# Tasks
|
2
2
|
|
3
|
-
Tasks are the immutable units of work
|
4
|
-
consists of:
|
3
|
+
Tasks are the immutable units of work read from a message queue and processed by
|
4
|
+
[jobs](/guides/jobs). A task consists of two strings:
|
5
5
|
|
6
|
-
|
7
|
-
|
6
|
+
* The __URL__ to process
|
7
|
+
* The __batch__ the task belongs to
|
8
8
|
|
9
|
-
|
10
|
-
* Like URLs, batches are strings.
|
9
|
+
A job processing a task commonly appends more tasks to the queue in turn.
|
11
10
|
|
12
|
-
|
13
|
-
|
14
|
-
|
11
|
+
!!! info "Task URLs are not normalized"
|
12
|
+
|
13
|
+
The URL returned by `task.url` is not normalized but verbatim
|
14
|
+
as it was staged or enqueued.
|
15
|
+
|
16
|
+
## Task deduplication
|
17
|
+
|
18
|
+
Wayfarer ensures that no URL gets processed twice within a batch. It achieves
|
19
|
+
this by maintaining a [Redis hash](https://redis.io/docs/data-types/hashes)
|
20
|
+
keyed by normalized URLs.
|
21
|
+
|
22
|
+
### URL normalization
|
23
|
+
|
24
|
+
Wayfarer parses URLs with [Addressable](https://github.com/sporkmonger/addressable)
|
25
|
+
and normalizes HTTP(S) URLs with [`normalize_url`](https://github.com/rwz/normalize_url/).
|
26
|
+
|
27
|
+
URL normalization is used only for deduplication, and does not affect the URL
|
28
|
+
returned by `task.url`. Instead, `task.url` returns the verbatim URL as it was
|
29
|
+
enqueud. This allows you to follow the exact URLs you may have parsed from a
|
30
|
+
response body.
|
31
|
+
|
32
|
+
## Invalid URLs
|
33
|
+
|
34
|
+
Tasks with invalid URLs (for example`ht%0atp://localhost/`, a newline in the
|
35
|
+
protocol) are discarded, since they can't get retrieved. No exception is raised,
|
36
|
+
and the job is considered successfully processed, since there are no corrective
|
37
|
+
actions an error handler could take as tasks are immutable, and retries would
|
38
|
+
not change the outcome.
|