wayfarer 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.gitignore +8 -0
- data/.rbenv-gemsets +1 -0
- data/.rspec +3 -0
- data/.rubocop.yml +21 -0
- data/.ruby-version +1 -0
- data/.travis.yml +5 -0
- data/.yardopts +3 -0
- data/Changelog.md +10 -0
- data/Gemfile +11 -0
- data/LICENSE +19 -0
- data/README.md +21 -0
- data/Rakefile +114 -0
- data/benchmark/frontiers.rb +143 -0
- data/bin/wayfarer +116 -0
- data/docs/.gitignore +2 -0
- data/docs/_config.yml +15 -0
- data/docs/_includes/base.html +7 -0
- data/docs/_includes/head.html +10 -0
- data/docs/_includes/navigation.html +187 -0
- data/docs/_layouts/default.html +42 -0
- data/docs/_sass/base.scss +439 -0
- data/docs/_sass/variables.scss +24 -0
- data/docs/_sass/vendor/bourbon/_bourbon-deprecate.scss +19 -0
- data/docs/_sass/vendor/bourbon/_bourbon-deprecated-upcoming.scss +425 -0
- data/docs/_sass/vendor/bourbon/_bourbon.scss +90 -0
- data/docs/_sass/vendor/bourbon/addons/_border-color.scss +29 -0
- data/docs/_sass/vendor/bourbon/addons/_border-radius.scss +48 -0
- data/docs/_sass/vendor/bourbon/addons/_border-style.scss +28 -0
- data/docs/_sass/vendor/bourbon/addons/_border-width.scss +28 -0
- data/docs/_sass/vendor/bourbon/addons/_buttons.scss +69 -0
- data/docs/_sass/vendor/bourbon/addons/_clearfix.scss +25 -0
- data/docs/_sass/vendor/bourbon/addons/_ellipsis.scss +30 -0
- data/docs/_sass/vendor/bourbon/addons/_font-stacks.scss +31 -0
- data/docs/_sass/vendor/bourbon/addons/_hide-text.scss +27 -0
- data/docs/_sass/vendor/bourbon/addons/_margin.scss +29 -0
- data/docs/_sass/vendor/bourbon/addons/_padding.scss +29 -0
- data/docs/_sass/vendor/bourbon/addons/_position.scss +51 -0
- data/docs/_sass/vendor/bourbon/addons/_prefixer.scss +66 -0
- data/docs/_sass/vendor/bourbon/addons/_retina-image.scss +27 -0
- data/docs/_sass/vendor/bourbon/addons/_size.scss +56 -0
- data/docs/_sass/vendor/bourbon/addons/_text-inputs.scss +118 -0
- data/docs/_sass/vendor/bourbon/addons/_timing-functions.scss +34 -0
- data/docs/_sass/vendor/bourbon/addons/_triangle.scss +63 -0
- data/docs/_sass/vendor/bourbon/addons/_word-wrap.scss +29 -0
- data/docs/_sass/vendor/bourbon/css3/_animation.scss +61 -0
- data/docs/_sass/vendor/bourbon/css3/_appearance.scss +5 -0
- data/docs/_sass/vendor/bourbon/css3/_backface-visibility.scss +5 -0
- data/docs/_sass/vendor/bourbon/css3/_background-image.scss +44 -0
- data/docs/_sass/vendor/bourbon/css3/_background.scss +57 -0
- data/docs/_sass/vendor/bourbon/css3/_border-image.scss +61 -0
- data/docs/_sass/vendor/bourbon/css3/_calc.scss +6 -0
- data/docs/_sass/vendor/bourbon/css3/_columns.scss +67 -0
- data/docs/_sass/vendor/bourbon/css3/_filter.scss +6 -0
- data/docs/_sass/vendor/bourbon/css3/_flex-box.scss +327 -0
- data/docs/_sass/vendor/bourbon/css3/_font-face.scss +29 -0
- data/docs/_sass/vendor/bourbon/css3/_font-feature-settings.scss +6 -0
- data/docs/_sass/vendor/bourbon/css3/_hidpi-media-query.scss +12 -0
- data/docs/_sass/vendor/bourbon/css3/_hyphens.scss +6 -0
- data/docs/_sass/vendor/bourbon/css3/_image-rendering.scss +15 -0
- data/docs/_sass/vendor/bourbon/css3/_keyframes.scss +38 -0
- data/docs/_sass/vendor/bourbon/css3/_linear-gradient.scss +40 -0
- data/docs/_sass/vendor/bourbon/css3/_perspective.scss +12 -0
- data/docs/_sass/vendor/bourbon/css3/_placeholder.scss +10 -0
- data/docs/_sass/vendor/bourbon/css3/_radial-gradient.scss +40 -0
- data/docs/_sass/vendor/bourbon/css3/_selection.scss +44 -0
- data/docs/_sass/vendor/bourbon/css3/_text-decoration.scss +27 -0
- data/docs/_sass/vendor/bourbon/css3/_transform.scss +21 -0
- data/docs/_sass/vendor/bourbon/css3/_transition.scss +81 -0
- data/docs/_sass/vendor/bourbon/css3/_user-select.scss +5 -0
- data/docs/_sass/vendor/bourbon/functions/_assign-inputs.scss +16 -0
- data/docs/_sass/vendor/bourbon/functions/_contains-falsy.scss +25 -0
- data/docs/_sass/vendor/bourbon/functions/_contains.scss +31 -0
- data/docs/_sass/vendor/bourbon/functions/_is-length.scss +16 -0
- data/docs/_sass/vendor/bourbon/functions/_is-light.scss +26 -0
- data/docs/_sass/vendor/bourbon/functions/_is-number.scss +16 -0
- data/docs/_sass/vendor/bourbon/functions/_is-size.scss +23 -0
- data/docs/_sass/vendor/bourbon/functions/_modular-scale.scss +74 -0
- data/docs/_sass/vendor/bourbon/functions/_px-to-em.scss +24 -0
- data/docs/_sass/vendor/bourbon/functions/_px-to-rem.scss +26 -0
- data/docs/_sass/vendor/bourbon/functions/_shade.scss +24 -0
- data/docs/_sass/vendor/bourbon/functions/_strip-units.scss +22 -0
- data/docs/_sass/vendor/bourbon/functions/_tint.scss +24 -0
- data/docs/_sass/vendor/bourbon/functions/_transition-property-name.scss +37 -0
- data/docs/_sass/vendor/bourbon/functions/_unpack.scss +32 -0
- data/docs/_sass/vendor/bourbon/helpers/_convert-units.scss +26 -0
- data/docs/_sass/vendor/bourbon/helpers/_directional-values.scss +108 -0
- data/docs/_sass/vendor/bourbon/helpers/_font-source-declaration.scss +53 -0
- data/docs/_sass/vendor/bourbon/helpers/_gradient-positions-parser.scss +24 -0
- data/docs/_sass/vendor/bourbon/helpers/_linear-angle-parser.scss +35 -0
- data/docs/_sass/vendor/bourbon/helpers/_linear-gradient-parser.scss +51 -0
- data/docs/_sass/vendor/bourbon/helpers/_linear-positions-parser.scss +77 -0
- data/docs/_sass/vendor/bourbon/helpers/_linear-side-corner-parser.scss +41 -0
- data/docs/_sass/vendor/bourbon/helpers/_radial-arg-parser.scss +74 -0
- data/docs/_sass/vendor/bourbon/helpers/_radial-gradient-parser.scss +55 -0
- data/docs/_sass/vendor/bourbon/helpers/_radial-positions-parser.scss +28 -0
- data/docs/_sass/vendor/bourbon/helpers/_render-gradients.scss +31 -0
- data/docs/_sass/vendor/bourbon/helpers/_shape-size-stripper.scss +15 -0
- data/docs/_sass/vendor/bourbon/helpers/_str-to-num.scss +55 -0
- data/docs/_sass/vendor/bourbon/settings/_asset-pipeline.scss +7 -0
- data/docs/_sass/vendor/bourbon/settings/_deprecation-warnings.scss +8 -0
- data/docs/_sass/vendor/bourbon/settings/_prefixer.scss +9 -0
- data/docs/_sass/vendor/bourbon/settings/_px-to-em.scss +1 -0
- data/docs/_sass/vendor/neat/_neat-helpers.scss +11 -0
- data/docs/_sass/vendor/neat/_neat.scss +23 -0
- data/docs/_sass/vendor/neat/functions/_new-breakpoint.scss +49 -0
- data/docs/_sass/vendor/neat/functions/_private.scss +114 -0
- data/docs/_sass/vendor/neat/grid/_box-sizing.scss +15 -0
- data/docs/_sass/vendor/neat/grid/_direction-context.scss +33 -0
- data/docs/_sass/vendor/neat/grid/_display-context.scss +28 -0
- data/docs/_sass/vendor/neat/grid/_fill-parent.scss +22 -0
- data/docs/_sass/vendor/neat/grid/_media.scss +92 -0
- data/docs/_sass/vendor/neat/grid/_omega.scss +87 -0
- data/docs/_sass/vendor/neat/grid/_outer-container.scss +34 -0
- data/docs/_sass/vendor/neat/grid/_pad.scss +25 -0
- data/docs/_sass/vendor/neat/grid/_private.scss +35 -0
- data/docs/_sass/vendor/neat/grid/_row.scss +52 -0
- data/docs/_sass/vendor/neat/grid/_shift.scss +50 -0
- data/docs/_sass/vendor/neat/grid/_span-columns.scss +94 -0
- data/docs/_sass/vendor/neat/grid/_to-deprecate.scss +97 -0
- data/docs/_sass/vendor/neat/grid/_visual-grid.scss +42 -0
- data/docs/_sass/vendor/neat/mixins/_clearfix.scss +25 -0
- data/docs/_sass/vendor/neat/settings/_disable-warnings.scss +13 -0
- data/docs/_sass/vendor/neat/settings/_grid.scss +51 -0
- data/docs/_sass/vendor/neat/settings/_visual-grid.scss +27 -0
- data/docs/_sass/vendor/normalize-3.0.2.scss +427 -0
- data/docs/_sass/vendor/pygments.scss +356 -0
- data/docs/automating_browsers/capybara.md +70 -0
- data/docs/css/screen.scss +7 -0
- data/docs/guides/callbacks.md +45 -0
- data/docs/guides/cli.md +52 -0
- data/docs/guides/configuration.md +184 -0
- data/docs/guides/error_handling.md +46 -0
- data/docs/guides/frontiers.md +93 -0
- data/docs/guides/halting.md +23 -0
- data/docs/guides/job_queues.md +26 -0
- data/docs/guides/locals.md +36 -0
- data/docs/guides/logging.md +22 -0
- data/docs/guides/page_objects.md +67 -0
- data/docs/guides/peeking.md +46 -0
- data/docs/guides/selenium_capybara.md +100 -0
- data/docs/guides/tutorial.md +452 -0
- data/docs/index.md +82 -0
- data/docs/js/navigation.js +11 -0
- data/docs/misc/contributing.md +20 -0
- data/docs/misc/testing.md +11 -0
- data/docs/recipes/authentication.md +23 -0
- data/docs/recipes/csv.md +29 -0
- data/docs/recipes/javascript.md +20 -0
- data/docs/recipes/multiple_uris.md +18 -0
- data/docs/recipes/screenshots.md +20 -0
- data/docs/routing/custom_rules.md +16 -0
- data/docs/routing/filetypes_rules.md +21 -0
- data/docs/routing/host_rules.md +24 -0
- data/docs/routing/path_rules.md +33 -0
- data/docs/routing/protocol_rules.md +17 -0
- data/docs/routing/query_rules.md +69 -0
- data/docs/routing/routes.md +96 -0
- data/docs/routing/uri_rules.md +18 -0
- data/examples/collect_github_issues.rb +65 -0
- data/examples/find_foobar_on_wikipedia.rb +23 -0
- data/lib/wayfarer/configuration.rb +86 -0
- data/lib/wayfarer/crawl.rb +79 -0
- data/lib/wayfarer/crawl_observer.rb +103 -0
- data/lib/wayfarer/dispatcher.rb +104 -0
- data/lib/wayfarer/finders.rb +61 -0
- data/lib/wayfarer/frontiers/frontier.rb +79 -0
- data/lib/wayfarer/frontiers/memory_bloomfilter.rb +32 -0
- data/lib/wayfarer/frontiers/memory_frontier.rb +76 -0
- data/lib/wayfarer/frontiers/memory_trie_frontier.rb +39 -0
- data/lib/wayfarer/frontiers/normalize_uris.rb +48 -0
- data/lib/wayfarer/frontiers/redis_bloomfilter.rb +34 -0
- data/lib/wayfarer/frontiers/redis_frontier.rb +83 -0
- data/lib/wayfarer/http_adapters/adapter_pool.rb +62 -0
- data/lib/wayfarer/http_adapters/net_http_adapter.rb +77 -0
- data/lib/wayfarer/http_adapters/selenium_adapter.rb +80 -0
- data/lib/wayfarer/job.rb +211 -0
- data/lib/wayfarer/locals.rb +40 -0
- data/lib/wayfarer/page.rb +94 -0
- data/lib/wayfarer/parsers/json_parser.rb +20 -0
- data/lib/wayfarer/parsers/xml_parser.rb +27 -0
- data/lib/wayfarer/processor.rb +103 -0
- data/lib/wayfarer/routing/custom_rule.rb +21 -0
- data/lib/wayfarer/routing/filetypes_rule.rb +20 -0
- data/lib/wayfarer/routing/host_rule.rb +19 -0
- data/lib/wayfarer/routing/path_rule.rb +54 -0
- data/lib/wayfarer/routing/protocol_rule.rb +21 -0
- data/lib/wayfarer/routing/query_rule.rb +59 -0
- data/lib/wayfarer/routing/router.rb +71 -0
- data/lib/wayfarer/routing/rule.rb +114 -0
- data/lib/wayfarer/routing/uri_rule.rb +21 -0
- data/lib/wayfarer.rb +68 -0
- data/spec/configuration_spec.rb +26 -0
- data/spec/crawl_spec.rb +48 -0
- data/spec/finders_spec.rb +49 -0
- data/spec/frontiers/memory_bloomfilter_spec.rb +6 -0
- data/spec/frontiers/memory_frontier_spec.rb +6 -0
- data/spec/frontiers/memory_trie_frontier_spec.rb +6 -0
- data/spec/frontiers/normalize_uris_spec.rb +59 -0
- data/spec/frontiers/redis_bloomfilter_spec.rb +6 -0
- data/spec/frontiers/redis_frontier_spec.rb +6 -0
- data/spec/http_adapters/adapter_pool_spec.rb +33 -0
- data/spec/http_adapters/net_http_adapter_spec.rb +83 -0
- data/spec/http_adapters/selenium_adapter_spec.rb +53 -0
- data/spec/integration/callbacks_spec.rb +42 -0
- data/spec/integration/locals_spec.rb +106 -0
- data/spec/integration/peeking_spec.rb +61 -0
- data/spec/job_spec.rb +122 -0
- data/spec/page_spec.rb +38 -0
- data/spec/parsers/json_parser_spec.rb +30 -0
- data/spec/parsers/xml_parser_spec.rb +24 -0
- data/spec/processor_spec.rb +31 -0
- data/spec/routing/custom_rule_spec.rb +26 -0
- data/spec/routing/filetypes_rule_spec.rb +40 -0
- data/spec/routing/host_rule_spec.rb +48 -0
- data/spec/routing/path_rule_spec.rb +66 -0
- data/spec/routing/protocol_rule_spec.rb +26 -0
- data/spec/routing/query_rule_spec.rb +124 -0
- data/spec/routing/router_spec.rb +67 -0
- data/spec/routing/rule_spec.rb +251 -0
- data/spec/routing/uri_rule_spec.rb +24 -0
- data/spec/shared/frontier.rb +96 -0
- data/spec/spec_helpers.rb +62 -0
- data/spec/wayfarer_spec.rb +24 -0
- data/support/static/finders.html +38 -0
- data/support/static/graph/details/a.html +10 -0
- data/support/static/graph/details/b.html +10 -0
- data/support/static/graph/index.html +20 -0
- data/support/static/json/dummy.json +13 -0
- data/support/static/links/links.html +28 -0
- data/support/static/xml/dummy.xml +120 -0
- data/support/test_app.rb +45 -0
- data/wayfarer-jruby.gemspec +49 -0
- data/wayfarer.gemspec +53 -0
- metadata +697 -0
@@ -0,0 +1,46 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Error handling
|
4
|
+
---
|
5
|
+
|
6
|
+
# Error handling
|
7
|
+
By default, all exceptions raised within actions are swallowed and only their stacktraces printed to stderr. This behaviour can be changed with two configuration keys (see [Configuration]()):
|
8
|
+
|
9
|
+
1. `print_stacktraces`: Whether to print stacktraces (default: `true`)
|
10
|
+
2. `reraise_exceptions`: Whether to crash when encountering unhandled exceptions (default: `false`)
|
11
|
+
|
12
|
+
Here’s an example to illustrate the default behaviour:
|
13
|
+
|
14
|
+
{% highlight ruby %}
|
15
|
+
class DummyJob < Wayfarer::Job
|
16
|
+
def example
|
17
|
+
# Makes this instance fail, but processing goes on
|
18
|
+
# Prints the stacktrace to stderr
|
19
|
+
fail "It's okay, life goes on"
|
20
|
+
end
|
21
|
+
end
|
22
|
+
{% endhighlight %}
|
23
|
+
|
24
|
+
The following reraises all exceptions, stops processing and returns with a non-zero exit code:
|
25
|
+
|
26
|
+
{% highlight ruby %}
|
27
|
+
class DummyJob < Wayfarer::Job
|
28
|
+
config.reraise_exceptions = true
|
29
|
+
|
30
|
+
def example
|
31
|
+
fail "This makes the exception bubble up"
|
32
|
+
end
|
33
|
+
end
|
34
|
+
{% endhighlight %}
|
35
|
+
|
36
|
+
And if you don’t want to be bothered with exceptions at all:
|
37
|
+
|
38
|
+
{% highlight ruby %}
|
39
|
+
class DummyJob < Wayfarer::Job
|
40
|
+
config.print_stacktraces = false
|
41
|
+
|
42
|
+
def example
|
43
|
+
fail "No one will know about this ..."
|
44
|
+
end
|
45
|
+
end
|
46
|
+
{% endhighlight %}
|
@@ -0,0 +1,93 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Frontiers
|
4
|
+
---
|
5
|
+
|
6
|
+
# Frontiers
|
7
|
+
|
8
|
+
Frontiers keep track of three sets of URIs:
|
9
|
+
|
10
|
+
* Current URIs that are being processed
|
11
|
+
* Staged URIs that might be processed in the next cycle
|
12
|
+
* Cached URIs that have been processed
|
13
|
+
|
14
|
+
All frontiers expose the same behaviour.
|
15
|
+
|
16
|
+
<pre class="illustration">
|
17
|
+
┌──────────────────────────────────────────────────────────┐
|
18
|
+
│ STAGED │
|
19
|
+
│ {https://alpha.com, https://beta.com} │
|
20
|
+
└──────────────────────────────────────────────────────────┘
|
21
|
+
┌──────────────────────────────────────────────────────────┐
|
22
|
+
│ CURRENT │
|
23
|
+
│ {https://gamma.com} │
|
24
|
+
└──────────────────────────────────────────────────────────┘
|
25
|
+
┌──────────────────────────────────────────────────────────┐
|
26
|
+
│ CACHED │
|
27
|
+
│ {https://beta.com} │
|
28
|
+
└──────────────────────────────────────────────────────────┘
|
29
|
+
│
|
30
|
+
Cycle
|
31
|
+
│
|
32
|
+
▼
|
33
|
+
┌──────────────────────────────────────────────────────────┐
|
34
|
+
│ STAGED' │
|
35
|
+
│ {...} │
|
36
|
+
└──────────────────────────────────────────────────────────┘
|
37
|
+
┌──────────────────────────────────────────────────────────┐
|
38
|
+
│ CURRENT' = STAGED \ CACHED │
|
39
|
+
│ {https://alpha.com} │
|
40
|
+
└──────────────────────────────────────────────────────────┘
|
41
|
+
┌──────────────────────────────────────────────────────────┐
|
42
|
+
│ CACHED' = CACHED ∪ CURRENT │
|
43
|
+
│ {https://beta.com, https://gamma.com} │
|
44
|
+
└──────────────────────────────────────────────────────────┘
|
45
|
+
</pre>
|
46
|
+
|
47
|
+
## Available frontiers
|
48
|
+
Currently, there are 5 frontiers available:
|
49
|
+
|
50
|
+
2. `:memory` (default): Uses sets from the standard lib.
|
51
|
+
4. `:redis`: Uses Redis sets.
|
52
|
+
3. `:memory_bloom`: Uses a [Bloom filter](https://github.com/igrigorik/bloomfilter-rb).
|
53
|
+
5. `:redis_bloom`: Uses a Redis-backed Bloom filter.
|
54
|
+
1. `:memory_trie`: Uses a [trie](https://github.com/tyler/trie) and sets.
|
55
|
+
|
56
|
+
| Frontier | MRI support | JRuby support |
|
57
|
+
| --- | --- |
|
58
|
+
| `:memory` | Yes | Yes
|
59
|
+
| `:redis` | Yes | Yes
|
60
|
+
| `:memory_bloom` | Yes | No
|
61
|
+
| `:redis_bloom` | Yes | No
|
62
|
+
| `:memory_trie` | Yes | No
|
63
|
+
|
64
|
+
## Setting the frontier
|
65
|
+
|
66
|
+
Set the `:frontier` configuration key:
|
67
|
+
|
68
|
+
{% highlight ruby %}
|
69
|
+
class DummyJob < Wayfarer::Job
|
70
|
+
config.frontier = :foobar
|
71
|
+
end
|
72
|
+
{% endhighlight %}
|
73
|
+
|
74
|
+
### Using a Redis frontier
|
75
|
+
|
76
|
+
Set the `:redis_opts` and `:frontier` configuration keys:
|
77
|
+
|
78
|
+
{% highlight ruby %}
|
79
|
+
class DummyJob < Wayfarer::Job
|
80
|
+
config.redis_opts = { port: 4242 }
|
81
|
+
config.frontier = :redis
|
82
|
+
end
|
83
|
+
{% endhighlight %}
|
84
|
+
|
85
|
+
### Setting bloomfilter parameters
|
86
|
+
|
87
|
+
Set the `:bloomfilter_opts` configuration key:
|
88
|
+
|
89
|
+
{% highlight ruby %}
|
90
|
+
class DummyJob < Wayfarer::Job
|
91
|
+
config.bloomfilter_opts = { ... }
|
92
|
+
end
|
93
|
+
{% endhighlight %}
|
@@ -0,0 +1,23 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Halting
|
4
|
+
---
|
5
|
+
|
6
|
+
# Halting
|
7
|
+
Processing can be stopped by calling `#halt` within actions.
|
8
|
+
|
9
|
+
`#halt` does not return immediately. Instead, it sets a halting flag internally, and once the action returns, all threads will stop instead of processing further URIs.
|
10
|
+
|
11
|
+
Job instances run in separate threads. When a job signals that it wants to halt, all other threads will finish their current work, but will not process any further URIs. All instances have the chance to get their current work done.
|
12
|
+
|
13
|
+
{% highlight ruby %}
|
14
|
+
class DummyJob < Wayfarer::Job
|
15
|
+
def example
|
16
|
+
halt
|
17
|
+
puts "This will be printed!"
|
18
|
+
|
19
|
+
return halt
|
20
|
+
puts "This will not be printed!"
|
21
|
+
end
|
22
|
+
end
|
23
|
+
{% endhighlight %}
|
@@ -0,0 +1,26 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Locals
|
4
|
+
---
|
5
|
+
|
6
|
+
# Job queues
|
7
|
+
|
8
|
+
Thanks to [ActiveJob](http://edgeguides.rubyonrails.org/active_job_basics.html), jobs can be enqueued with various backends, e.g. Sidekiq or Resque:
|
9
|
+
|
10
|
+
{% highlight ruby %}
|
11
|
+
class DummyJob < Wayfarer::Job
|
12
|
+
# Overrides ActiveJob's global setting
|
13
|
+
self.queue_adapter = :resque
|
14
|
+
|
15
|
+
# Identifier for enqueued jobs
|
16
|
+
queue_as :dummy_job
|
17
|
+
|
18
|
+
# Alternatively, pass a block
|
19
|
+
queue_as do
|
20
|
+
[:first, :second].sample
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
# Alternatively, set the queue explicitly on call:
|
25
|
+
DummyJob.set(queue: :something_else).perform_later(*uris)
|
26
|
+
{% endhighlight %}
|
@@ -0,0 +1,36 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Locals
|
4
|
+
---
|
5
|
+
|
6
|
+
# Locals
|
7
|
+
|
8
|
+
Locals are Wayfarer's replacement for job instance variables. Both `let` and `let!` declare variables that are accessible within [callbacks]({{base}}/callbacks.html) and actions.
|
9
|
+
|
10
|
+
Even though you might recognise them from RSpec, they have differing semantics: Values in `let` blocks will be replaced with thread-safe counterparts once the job is run. `let!` skips this. Both evaluate their block immediately.
|
11
|
+
|
12
|
+
| Standard lib | Counterpart |
|
13
|
+
| --- | --- |
|
14
|
+
| Booleans | [`Concurrent::AtomicBoolean`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/AtomicBoolean.html) |
|
15
|
+
| `Fixnum` | [`Concurrent::AtomicFixnum`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/AtomicFixnum.html) |
|
16
|
+
| `Hash` | [`Concurrent::Hash`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Hash.html) |
|
17
|
+
| `Array` | [`Concurrent::Array`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Array.html) |
|
18
|
+
| Everything else | Untouched |
|
19
|
+
|
20
|
+
{% highlight ruby %}
|
21
|
+
class DummyJob < Wayfarer::Job
|
22
|
+
let(:values) { [1, 2, 3] }
|
23
|
+
|
24
|
+
before_crawl do
|
25
|
+
values.reverse!
|
26
|
+
end
|
27
|
+
|
28
|
+
after_crawl do
|
29
|
+
values # => [3, 2, 1, 0]
|
30
|
+
end
|
31
|
+
|
32
|
+
def some_action
|
33
|
+
values << 0
|
34
|
+
end
|
35
|
+
end
|
36
|
+
{% endhighlight %}
|
@@ -0,0 +1,22 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Logging
|
4
|
+
---
|
5
|
+
|
6
|
+
# Logging
|
7
|
+
|
8
|
+
{% highlight ruby %}
|
9
|
+
# Global configuration serves as the template
|
10
|
+
Wayfarer.logger.level = :fatal
|
11
|
+
|
12
|
+
class DummyJob < Wayfarer::Job
|
13
|
+
# Jobs can tweak their logger
|
14
|
+
config.logger.level = :warn
|
15
|
+
config.logger.progname = "dummy-job"
|
16
|
+
|
17
|
+
def example
|
18
|
+
logger.info "No"
|
19
|
+
logger.warn "Yes"
|
20
|
+
end
|
21
|
+
end
|
22
|
+
{% endhighlight %}
|
@@ -0,0 +1,67 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Page objects
|
4
|
+
---
|
5
|
+
|
6
|
+
# `Page` objects
|
7
|
+
|
8
|
+
Retrieved pages are represented by `Page` objects and made accessible by `#page` within actions. `Page`s support the same set of features regardless of the HTTP adapter in use.
|
9
|
+
|
10
|
+
<aside class="note">
|
11
|
+
HTTP response headers and status codes are not supported by Selenium WebDrivers. Wayfarer emulates both by having the WebDriver fire an AJAX request to the current page and extracting them from the response. Clearly this is a hack, but it might even work for you. See <a href="https://github.com/bauerd/selenium-emulated_features">selenium-emulated_features</a>.
|
12
|
+
</aside>
|
13
|
+
|
14
|
+
<aside class="note">
|
15
|
+
Even after having followed redirects, <code>Page#uri</code> always returns the URI that originally initiated the redirects. This behaviour stems from redirects being opaque to WebDrivers.
|
16
|
+
</aside>
|
17
|
+
|
18
|
+
A `Page` brings to the table all you'd wish for when doing web scraping:
|
19
|
+
|
20
|
+
* [Nokogiri](http://www.nokogiri.org) parses HTML/XML
|
21
|
+
* [Oj](https://github.com/ohler55/oj) or the standard lib parses JSON
|
22
|
+
* __When running on MRI__, [Pismo](https://github.com/peterc/pismo) lets you access metadata, e.g. keywords, author, a summary, … No overhead if you don't use it!
|
23
|
+
|
24
|
+
Let's see it in action:
|
25
|
+
|
26
|
+
{% highlight ruby %}
|
27
|
+
class DummyJob < Wayfarer::Job
|
28
|
+
# ...
|
29
|
+
|
30
|
+
def example
|
31
|
+
page # => #<Wayfarer::Page:...>
|
32
|
+
|
33
|
+
page.uri # => #<URI::...>
|
34
|
+
page.status_code # => Fixnum
|
35
|
+
page.body # => String
|
36
|
+
page.headers # => Hash
|
37
|
+
|
38
|
+
page.doc # => #<Nokogiri::HTML::Document:...> (HTML/XML) or Hash (JSON)
|
39
|
+
# Also accessible as just `doc`
|
40
|
+
|
41
|
+
page.links # => [URI]
|
42
|
+
page.stylesheets # => [URI]
|
43
|
+
page.javascripts # => [URI]
|
44
|
+
page.images # => [URI]
|
45
|
+
|
46
|
+
# All previous four methods accept arbitrary many CSS selectors
|
47
|
+
page.links ".my-target", ".my-other-target"
|
48
|
+
|
49
|
+
# THESE ARE NOT SUPPORTED ON JRUBY!
|
50
|
+
# On MRI, the following methods get forwarded to a Pismo::Document
|
51
|
+
# See https://github.com/peterc/pismo
|
52
|
+
page.title
|
53
|
+
page.titles
|
54
|
+
page.author
|
55
|
+
page.lede
|
56
|
+
page.keywords
|
57
|
+
page.sentences(qty)
|
58
|
+
page.body
|
59
|
+
page.html_body
|
60
|
+
page.feed
|
61
|
+
page.feeds
|
62
|
+
page.favicon
|
63
|
+
page.description
|
64
|
+
page.datetime
|
65
|
+
end
|
66
|
+
end
|
67
|
+
{% endhighlight %}
|
@@ -0,0 +1,46 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Peeking
|
4
|
+
---
|
5
|
+
|
6
|
+
# Peeking
|
7
|
+
Peeking allows bypassing the [frontier](frontiers.html) in an ad-hoc manner. Use Ruby's `yield` keyword to immediately retrieve and dispatch a URI from within actions. Control gets handed off to the action matching the yielded URI, if any.
|
8
|
+
|
9
|
+
A matching route for the yielded URI is still required. If the yielded URI matches no route or raises an exception, `yield` returns `nil`.
|
10
|
+
|
11
|
+
<aside class="note">
|
12
|
+
The action that gets the URI dispatched to <strong>will</strong> get assigned another HTTP adapter! HTTP adapters are never shared across actions, i.e. if you're using the Selenium HTTP adapter, the peeked URI gets retrieved by a different browser process.
|
13
|
+
</aside>
|
14
|
+
|
15
|
+
{% highlight ruby %}
|
16
|
+
class DummyJob < Wayfarer::Job
|
17
|
+
route.uri "https://example.com", to: :foo
|
18
|
+
route.uri "https://w3c.org", to: :bar
|
19
|
+
|
20
|
+
def foo
|
21
|
+
w3c_page = yield "https://w3c.org"
|
22
|
+
end
|
23
|
+
|
24
|
+
def bar
|
25
|
+
page
|
26
|
+
end
|
27
|
+
end
|
28
|
+
{% endhighlight %}
|
29
|
+
|
30
|
+
__Recursive peeking does not work__, or else peeking might result in an infinite loop. The following does terminate:
|
31
|
+
|
32
|
+
{% highlight ruby %}
|
33
|
+
class DummyJob < Wayfarer::Job
|
34
|
+
route.uri "https://example.com", to: :foo
|
35
|
+
route.uri "https://w3c.org", to: :bar
|
36
|
+
|
37
|
+
def foo
|
38
|
+
w3c_page = yield "https://w3c.org"
|
39
|
+
end
|
40
|
+
|
41
|
+
def bar
|
42
|
+
# Silently ignored, assigns nil
|
43
|
+
example_page = yield "https://example.com"
|
44
|
+
end
|
45
|
+
end
|
46
|
+
{% endhighlight %}
|
@@ -0,0 +1,100 @@
|
|
1
|
+
---
|
2
|
+
layout: default
|
3
|
+
title: Selenium & Capybara
|
4
|
+
---
|
5
|
+
|
6
|
+
# Selenium & Capybara
|
7
|
+
|
8
|
+
[Selenium](http://www.seleniumhq.org) is a browser automation framework. [Capybara](https://github.com/teamcapybara/capybara) is an acceptance testing framework that puts an expressive DSL on Selenium's WebDrivers. Both are first-class citizens in Wayfarer and the best tools for automating browsers.
|
9
|
+
|
10
|
+
## Selenium WebDrivers
|
11
|
+
|
12
|
+
WebDrivers let you remote-control browsers, e.g. Firefox, Chrome, Safari and PhantomJS.
|
13
|
+
|
14
|
+
Depending on what browser you want to automate, go install and run the corresponding driver first. For installation instructions, see the project websites:
|
15
|
+
|
16
|
+
* Firefox: [geckodriver](https://github.com/mozilla/geckodriver)
|
17
|
+
* Chrome: [chromedriver](https://sites.google.com/a/chromium.org/chromedriver)
|
18
|
+
* Safari: [SafariDriver](https://github.com/SeleniumHQ/selenium/wiki/SafariDriver)
|
19
|
+
* PhantomJS ships with an embedded driver.
|
20
|
+
|
21
|
+
Other browsers are supported, too. For an exhaustive list, see the "Third Party Drivers, Bindings, and Plugins" section on the [Selenium downloads page](http://www.seleniumhq.org/download).
|
22
|
+
|
23
|
+
If you want to run browser processes on a central server, consider using [Selenium Grid](http://www.seleniumhq.org/projects/grid).
|
24
|
+
|
25
|
+
Wayfarer hides the details of managing Ruby driver objects from you. In order to use Selenium, set the `http_adapter` configuration key to `:selenium`. Pass in the desired browser and arguments by setting the `selenium_argv` key. The number of browser processes can be controlled with the `connection_count` key.
|
26
|
+
|
27
|
+
{% highlight ruby %}
|
28
|
+
class DummyJob < Wayfarer::Job
|
29
|
+
config do |c|
|
30
|
+
# Use 4 Firefox processes
|
31
|
+
c.http_adapter = :selenium
|
32
|
+
c.selenium_argv = [:firefox]
|
33
|
+
c.connection_count = 4
|
34
|
+
|
35
|
+
# Chrome
|
36
|
+
# c.selenium_argv = [:chrome]
|
37
|
+
|
38
|
+
# Safari
|
39
|
+
# c.selenium_argv = [:safari]
|
40
|
+
|
41
|
+
# PhantomJS
|
42
|
+
# c.selenium_argv = [:phantomjs]
|
43
|
+
|
44
|
+
# Selenium Grid
|
45
|
+
# c.selenium_argv = [
|
46
|
+
# :remote,
|
47
|
+
# url: "http://localhost:4444/wd/hub",
|
48
|
+
# desired_capabilities: :firefox
|
49
|
+
# ]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
{% endhighlight %}
|
53
|
+
|
54
|
+
<aside class="note">
|
55
|
+
In order to avoid redirect loops, the <code>:net_http</code> adapter supports the <code>max_http_redirects</code> configuration key. Because redirects are opaque to WebDrivers, the configuration key does not apply to the Selenium adapter. See <a href="configuration.html">Configuration</a>.
|
56
|
+
</aside>
|
57
|
+
|
58
|
+
### Accessing the WebDriver
|
59
|
+
|
60
|
+
Within actions, `#driver` returns a [`Selenium::WebDriver::Driver`](http://www.rubydoc.info/gems/selenium-webdriver/Selenium/WebDriver/Driver):
|
61
|
+
|
62
|
+
{% highlight ruby %}
|
63
|
+
class DummyJob < Wayfarer::Job
|
64
|
+
config do |c|
|
65
|
+
c.http_adapter = :selenium
|
66
|
+
c.selenium_argv = [:firefox]
|
67
|
+
end
|
68
|
+
|
69
|
+
draw uri: "https://example.com"
|
70
|
+
def example
|
71
|
+
driver # => #<Selenium::WebDriver::Driver:...>
|
72
|
+
end
|
73
|
+
end
|
74
|
+
{% endhighlight %}
|
75
|
+
|
76
|
+
<aside class="note">
|
77
|
+
What you do with a WebDriver is opaque to Wayfarer. If you handle navigation yourself with a WebDriver and bypass the <a href="/guides/frontiers.html">frontier</a>, Wayfarer cannot ensure you don't visit URIs twice.
|
78
|
+
</aside>
|
79
|
+
|
80
|
+
## Capybara
|
81
|
+
|
82
|
+
When using the `:selenium` HTTP adapter, `#browser` returns a [`Capybara::Selenium::Driver`](http://www.rubydoc.info/github/jnicklas/capybara/Capybara/Selenium/Driver) within actions:
|
83
|
+
|
84
|
+
{% highlight ruby %}
|
85
|
+
class DummyJob < Wayfarer::Job
|
86
|
+
config do |c|
|
87
|
+
c.http_adapter = :selenium
|
88
|
+
c.selenium_argv = [:firefox]
|
89
|
+
end
|
90
|
+
|
91
|
+
draw uri: "https://example.com"
|
92
|
+
def example
|
93
|
+
browser # => #<Capybara::Selenium::Driver:...>
|
94
|
+
end
|
95
|
+
end
|
96
|
+
{% endhighlight %}
|
97
|
+
|
98
|
+
<aside class="note">
|
99
|
+
What you do with a WebDriver is opaque to Wayfarer. If you handle navigation yourself with a WebDriver and bypass the <a href="/guides/frontiers.html">frontier</a>, Wayfarer cannot ensure you don't visit URIs twice.
|
100
|
+
</aside>
|