wayfarer 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
Files changed (235) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +8 -0
  3. data/.rbenv-gemsets +1 -0
  4. data/.rspec +3 -0
  5. data/.rubocop.yml +21 -0
  6. data/.ruby-version +1 -0
  7. data/.travis.yml +5 -0
  8. data/.yardopts +3 -0
  9. data/Changelog.md +10 -0
  10. data/Gemfile +11 -0
  11. data/LICENSE +19 -0
  12. data/README.md +21 -0
  13. data/Rakefile +114 -0
  14. data/benchmark/frontiers.rb +143 -0
  15. data/bin/wayfarer +116 -0
  16. data/docs/.gitignore +2 -0
  17. data/docs/_config.yml +15 -0
  18. data/docs/_includes/base.html +7 -0
  19. data/docs/_includes/head.html +10 -0
  20. data/docs/_includes/navigation.html +187 -0
  21. data/docs/_layouts/default.html +42 -0
  22. data/docs/_sass/base.scss +439 -0
  23. data/docs/_sass/variables.scss +24 -0
  24. data/docs/_sass/vendor/bourbon/_bourbon-deprecate.scss +19 -0
  25. data/docs/_sass/vendor/bourbon/_bourbon-deprecated-upcoming.scss +425 -0
  26. data/docs/_sass/vendor/bourbon/_bourbon.scss +90 -0
  27. data/docs/_sass/vendor/bourbon/addons/_border-color.scss +29 -0
  28. data/docs/_sass/vendor/bourbon/addons/_border-radius.scss +48 -0
  29. data/docs/_sass/vendor/bourbon/addons/_border-style.scss +28 -0
  30. data/docs/_sass/vendor/bourbon/addons/_border-width.scss +28 -0
  31. data/docs/_sass/vendor/bourbon/addons/_buttons.scss +69 -0
  32. data/docs/_sass/vendor/bourbon/addons/_clearfix.scss +25 -0
  33. data/docs/_sass/vendor/bourbon/addons/_ellipsis.scss +30 -0
  34. data/docs/_sass/vendor/bourbon/addons/_font-stacks.scss +31 -0
  35. data/docs/_sass/vendor/bourbon/addons/_hide-text.scss +27 -0
  36. data/docs/_sass/vendor/bourbon/addons/_margin.scss +29 -0
  37. data/docs/_sass/vendor/bourbon/addons/_padding.scss +29 -0
  38. data/docs/_sass/vendor/bourbon/addons/_position.scss +51 -0
  39. data/docs/_sass/vendor/bourbon/addons/_prefixer.scss +66 -0
  40. data/docs/_sass/vendor/bourbon/addons/_retina-image.scss +27 -0
  41. data/docs/_sass/vendor/bourbon/addons/_size.scss +56 -0
  42. data/docs/_sass/vendor/bourbon/addons/_text-inputs.scss +118 -0
  43. data/docs/_sass/vendor/bourbon/addons/_timing-functions.scss +34 -0
  44. data/docs/_sass/vendor/bourbon/addons/_triangle.scss +63 -0
  45. data/docs/_sass/vendor/bourbon/addons/_word-wrap.scss +29 -0
  46. data/docs/_sass/vendor/bourbon/css3/_animation.scss +61 -0
  47. data/docs/_sass/vendor/bourbon/css3/_appearance.scss +5 -0
  48. data/docs/_sass/vendor/bourbon/css3/_backface-visibility.scss +5 -0
  49. data/docs/_sass/vendor/bourbon/css3/_background-image.scss +44 -0
  50. data/docs/_sass/vendor/bourbon/css3/_background.scss +57 -0
  51. data/docs/_sass/vendor/bourbon/css3/_border-image.scss +61 -0
  52. data/docs/_sass/vendor/bourbon/css3/_calc.scss +6 -0
  53. data/docs/_sass/vendor/bourbon/css3/_columns.scss +67 -0
  54. data/docs/_sass/vendor/bourbon/css3/_filter.scss +6 -0
  55. data/docs/_sass/vendor/bourbon/css3/_flex-box.scss +327 -0
  56. data/docs/_sass/vendor/bourbon/css3/_font-face.scss +29 -0
  57. data/docs/_sass/vendor/bourbon/css3/_font-feature-settings.scss +6 -0
  58. data/docs/_sass/vendor/bourbon/css3/_hidpi-media-query.scss +12 -0
  59. data/docs/_sass/vendor/bourbon/css3/_hyphens.scss +6 -0
  60. data/docs/_sass/vendor/bourbon/css3/_image-rendering.scss +15 -0
  61. data/docs/_sass/vendor/bourbon/css3/_keyframes.scss +38 -0
  62. data/docs/_sass/vendor/bourbon/css3/_linear-gradient.scss +40 -0
  63. data/docs/_sass/vendor/bourbon/css3/_perspective.scss +12 -0
  64. data/docs/_sass/vendor/bourbon/css3/_placeholder.scss +10 -0
  65. data/docs/_sass/vendor/bourbon/css3/_radial-gradient.scss +40 -0
  66. data/docs/_sass/vendor/bourbon/css3/_selection.scss +44 -0
  67. data/docs/_sass/vendor/bourbon/css3/_text-decoration.scss +27 -0
  68. data/docs/_sass/vendor/bourbon/css3/_transform.scss +21 -0
  69. data/docs/_sass/vendor/bourbon/css3/_transition.scss +81 -0
  70. data/docs/_sass/vendor/bourbon/css3/_user-select.scss +5 -0
  71. data/docs/_sass/vendor/bourbon/functions/_assign-inputs.scss +16 -0
  72. data/docs/_sass/vendor/bourbon/functions/_contains-falsy.scss +25 -0
  73. data/docs/_sass/vendor/bourbon/functions/_contains.scss +31 -0
  74. data/docs/_sass/vendor/bourbon/functions/_is-length.scss +16 -0
  75. data/docs/_sass/vendor/bourbon/functions/_is-light.scss +26 -0
  76. data/docs/_sass/vendor/bourbon/functions/_is-number.scss +16 -0
  77. data/docs/_sass/vendor/bourbon/functions/_is-size.scss +23 -0
  78. data/docs/_sass/vendor/bourbon/functions/_modular-scale.scss +74 -0
  79. data/docs/_sass/vendor/bourbon/functions/_px-to-em.scss +24 -0
  80. data/docs/_sass/vendor/bourbon/functions/_px-to-rem.scss +26 -0
  81. data/docs/_sass/vendor/bourbon/functions/_shade.scss +24 -0
  82. data/docs/_sass/vendor/bourbon/functions/_strip-units.scss +22 -0
  83. data/docs/_sass/vendor/bourbon/functions/_tint.scss +24 -0
  84. data/docs/_sass/vendor/bourbon/functions/_transition-property-name.scss +37 -0
  85. data/docs/_sass/vendor/bourbon/functions/_unpack.scss +32 -0
  86. data/docs/_sass/vendor/bourbon/helpers/_convert-units.scss +26 -0
  87. data/docs/_sass/vendor/bourbon/helpers/_directional-values.scss +108 -0
  88. data/docs/_sass/vendor/bourbon/helpers/_font-source-declaration.scss +53 -0
  89. data/docs/_sass/vendor/bourbon/helpers/_gradient-positions-parser.scss +24 -0
  90. data/docs/_sass/vendor/bourbon/helpers/_linear-angle-parser.scss +35 -0
  91. data/docs/_sass/vendor/bourbon/helpers/_linear-gradient-parser.scss +51 -0
  92. data/docs/_sass/vendor/bourbon/helpers/_linear-positions-parser.scss +77 -0
  93. data/docs/_sass/vendor/bourbon/helpers/_linear-side-corner-parser.scss +41 -0
  94. data/docs/_sass/vendor/bourbon/helpers/_radial-arg-parser.scss +74 -0
  95. data/docs/_sass/vendor/bourbon/helpers/_radial-gradient-parser.scss +55 -0
  96. data/docs/_sass/vendor/bourbon/helpers/_radial-positions-parser.scss +28 -0
  97. data/docs/_sass/vendor/bourbon/helpers/_render-gradients.scss +31 -0
  98. data/docs/_sass/vendor/bourbon/helpers/_shape-size-stripper.scss +15 -0
  99. data/docs/_sass/vendor/bourbon/helpers/_str-to-num.scss +55 -0
  100. data/docs/_sass/vendor/bourbon/settings/_asset-pipeline.scss +7 -0
  101. data/docs/_sass/vendor/bourbon/settings/_deprecation-warnings.scss +8 -0
  102. data/docs/_sass/vendor/bourbon/settings/_prefixer.scss +9 -0
  103. data/docs/_sass/vendor/bourbon/settings/_px-to-em.scss +1 -0
  104. data/docs/_sass/vendor/neat/_neat-helpers.scss +11 -0
  105. data/docs/_sass/vendor/neat/_neat.scss +23 -0
  106. data/docs/_sass/vendor/neat/functions/_new-breakpoint.scss +49 -0
  107. data/docs/_sass/vendor/neat/functions/_private.scss +114 -0
  108. data/docs/_sass/vendor/neat/grid/_box-sizing.scss +15 -0
  109. data/docs/_sass/vendor/neat/grid/_direction-context.scss +33 -0
  110. data/docs/_sass/vendor/neat/grid/_display-context.scss +28 -0
  111. data/docs/_sass/vendor/neat/grid/_fill-parent.scss +22 -0
  112. data/docs/_sass/vendor/neat/grid/_media.scss +92 -0
  113. data/docs/_sass/vendor/neat/grid/_omega.scss +87 -0
  114. data/docs/_sass/vendor/neat/grid/_outer-container.scss +34 -0
  115. data/docs/_sass/vendor/neat/grid/_pad.scss +25 -0
  116. data/docs/_sass/vendor/neat/grid/_private.scss +35 -0
  117. data/docs/_sass/vendor/neat/grid/_row.scss +52 -0
  118. data/docs/_sass/vendor/neat/grid/_shift.scss +50 -0
  119. data/docs/_sass/vendor/neat/grid/_span-columns.scss +94 -0
  120. data/docs/_sass/vendor/neat/grid/_to-deprecate.scss +97 -0
  121. data/docs/_sass/vendor/neat/grid/_visual-grid.scss +42 -0
  122. data/docs/_sass/vendor/neat/mixins/_clearfix.scss +25 -0
  123. data/docs/_sass/vendor/neat/settings/_disable-warnings.scss +13 -0
  124. data/docs/_sass/vendor/neat/settings/_grid.scss +51 -0
  125. data/docs/_sass/vendor/neat/settings/_visual-grid.scss +27 -0
  126. data/docs/_sass/vendor/normalize-3.0.2.scss +427 -0
  127. data/docs/_sass/vendor/pygments.scss +356 -0
  128. data/docs/automating_browsers/capybara.md +70 -0
  129. data/docs/css/screen.scss +7 -0
  130. data/docs/guides/callbacks.md +45 -0
  131. data/docs/guides/cli.md +52 -0
  132. data/docs/guides/configuration.md +184 -0
  133. data/docs/guides/error_handling.md +46 -0
  134. data/docs/guides/frontiers.md +93 -0
  135. data/docs/guides/halting.md +23 -0
  136. data/docs/guides/job_queues.md +26 -0
  137. data/docs/guides/locals.md +36 -0
  138. data/docs/guides/logging.md +22 -0
  139. data/docs/guides/page_objects.md +67 -0
  140. data/docs/guides/peeking.md +46 -0
  141. data/docs/guides/selenium_capybara.md +100 -0
  142. data/docs/guides/tutorial.md +452 -0
  143. data/docs/index.md +82 -0
  144. data/docs/js/navigation.js +11 -0
  145. data/docs/misc/contributing.md +20 -0
  146. data/docs/misc/testing.md +11 -0
  147. data/docs/recipes/authentication.md +23 -0
  148. data/docs/recipes/csv.md +29 -0
  149. data/docs/recipes/javascript.md +20 -0
  150. data/docs/recipes/multiple_uris.md +18 -0
  151. data/docs/recipes/screenshots.md +20 -0
  152. data/docs/routing/custom_rules.md +16 -0
  153. data/docs/routing/filetypes_rules.md +21 -0
  154. data/docs/routing/host_rules.md +24 -0
  155. data/docs/routing/path_rules.md +33 -0
  156. data/docs/routing/protocol_rules.md +17 -0
  157. data/docs/routing/query_rules.md +69 -0
  158. data/docs/routing/routes.md +96 -0
  159. data/docs/routing/uri_rules.md +18 -0
  160. data/examples/collect_github_issues.rb +65 -0
  161. data/examples/find_foobar_on_wikipedia.rb +23 -0
  162. data/lib/wayfarer/configuration.rb +86 -0
  163. data/lib/wayfarer/crawl.rb +79 -0
  164. data/lib/wayfarer/crawl_observer.rb +103 -0
  165. data/lib/wayfarer/dispatcher.rb +104 -0
  166. data/lib/wayfarer/finders.rb +61 -0
  167. data/lib/wayfarer/frontiers/frontier.rb +79 -0
  168. data/lib/wayfarer/frontiers/memory_bloomfilter.rb +32 -0
  169. data/lib/wayfarer/frontiers/memory_frontier.rb +76 -0
  170. data/lib/wayfarer/frontiers/memory_trie_frontier.rb +39 -0
  171. data/lib/wayfarer/frontiers/normalize_uris.rb +48 -0
  172. data/lib/wayfarer/frontiers/redis_bloomfilter.rb +34 -0
  173. data/lib/wayfarer/frontiers/redis_frontier.rb +83 -0
  174. data/lib/wayfarer/http_adapters/adapter_pool.rb +62 -0
  175. data/lib/wayfarer/http_adapters/net_http_adapter.rb +77 -0
  176. data/lib/wayfarer/http_adapters/selenium_adapter.rb +80 -0
  177. data/lib/wayfarer/job.rb +211 -0
  178. data/lib/wayfarer/locals.rb +40 -0
  179. data/lib/wayfarer/page.rb +94 -0
  180. data/lib/wayfarer/parsers/json_parser.rb +20 -0
  181. data/lib/wayfarer/parsers/xml_parser.rb +27 -0
  182. data/lib/wayfarer/processor.rb +103 -0
  183. data/lib/wayfarer/routing/custom_rule.rb +21 -0
  184. data/lib/wayfarer/routing/filetypes_rule.rb +20 -0
  185. data/lib/wayfarer/routing/host_rule.rb +19 -0
  186. data/lib/wayfarer/routing/path_rule.rb +54 -0
  187. data/lib/wayfarer/routing/protocol_rule.rb +21 -0
  188. data/lib/wayfarer/routing/query_rule.rb +59 -0
  189. data/lib/wayfarer/routing/router.rb +71 -0
  190. data/lib/wayfarer/routing/rule.rb +114 -0
  191. data/lib/wayfarer/routing/uri_rule.rb +21 -0
  192. data/lib/wayfarer.rb +68 -0
  193. data/spec/configuration_spec.rb +26 -0
  194. data/spec/crawl_spec.rb +48 -0
  195. data/spec/finders_spec.rb +49 -0
  196. data/spec/frontiers/memory_bloomfilter_spec.rb +6 -0
  197. data/spec/frontiers/memory_frontier_spec.rb +6 -0
  198. data/spec/frontiers/memory_trie_frontier_spec.rb +6 -0
  199. data/spec/frontiers/normalize_uris_spec.rb +59 -0
  200. data/spec/frontiers/redis_bloomfilter_spec.rb +6 -0
  201. data/spec/frontiers/redis_frontier_spec.rb +6 -0
  202. data/spec/http_adapters/adapter_pool_spec.rb +33 -0
  203. data/spec/http_adapters/net_http_adapter_spec.rb +83 -0
  204. data/spec/http_adapters/selenium_adapter_spec.rb +53 -0
  205. data/spec/integration/callbacks_spec.rb +42 -0
  206. data/spec/integration/locals_spec.rb +106 -0
  207. data/spec/integration/peeking_spec.rb +61 -0
  208. data/spec/job_spec.rb +122 -0
  209. data/spec/page_spec.rb +38 -0
  210. data/spec/parsers/json_parser_spec.rb +30 -0
  211. data/spec/parsers/xml_parser_spec.rb +24 -0
  212. data/spec/processor_spec.rb +31 -0
  213. data/spec/routing/custom_rule_spec.rb +26 -0
  214. data/spec/routing/filetypes_rule_spec.rb +40 -0
  215. data/spec/routing/host_rule_spec.rb +48 -0
  216. data/spec/routing/path_rule_spec.rb +66 -0
  217. data/spec/routing/protocol_rule_spec.rb +26 -0
  218. data/spec/routing/query_rule_spec.rb +124 -0
  219. data/spec/routing/router_spec.rb +67 -0
  220. data/spec/routing/rule_spec.rb +251 -0
  221. data/spec/routing/uri_rule_spec.rb +24 -0
  222. data/spec/shared/frontier.rb +96 -0
  223. data/spec/spec_helpers.rb +62 -0
  224. data/spec/wayfarer_spec.rb +24 -0
  225. data/support/static/finders.html +38 -0
  226. data/support/static/graph/details/a.html +10 -0
  227. data/support/static/graph/details/b.html +10 -0
  228. data/support/static/graph/index.html +20 -0
  229. data/support/static/json/dummy.json +13 -0
  230. data/support/static/links/links.html +28 -0
  231. data/support/static/xml/dummy.xml +120 -0
  232. data/support/test_app.rb +45 -0
  233. data/wayfarer-jruby.gemspec +49 -0
  234. data/wayfarer.gemspec +53 -0
  235. metadata +697 -0
@@ -0,0 +1,46 @@
1
+ ---
2
+ layout: default
3
+ title: Error handling
4
+ ---
5
+
6
+ # Error handling
7
+ By default, all exceptions raised within actions are swallowed and only their stacktraces printed to stderr. This behaviour can be changed with two configuration keys (see [Configuration]()):
8
+
9
+ 1. `print_stacktraces`: Whether to print stacktraces (default: `true`)
10
+ 2. `reraise_exceptions`: Whether to crash when encountering unhandled exceptions (default: `false`)
11
+
12
+ Here’s an example to illustrate the default behaviour:
13
+
14
+ {% highlight ruby %}
15
+ class DummyJob < Wayfarer::Job
16
+ def example
17
+ # Makes this instance fail, but processing goes on
18
+ # Prints the stacktrace to stderr
19
+ fail "It's okay, life goes on"
20
+ end
21
+ end
22
+ {% endhighlight %}
23
+
24
+ The following reraises all exceptions, stops processing and returns with a non-zero exit code:
25
+
26
+ {% highlight ruby %}
27
+ class DummyJob < Wayfarer::Job
28
+ config.reraise_exceptions = true
29
+
30
+ def example
31
+ fail "This makes the exception bubble up"
32
+ end
33
+ end
34
+ {% endhighlight %}
35
+
36
+ And if you don’t want to be bothered with exceptions at all:
37
+
38
+ {% highlight ruby %}
39
+ class DummyJob < Wayfarer::Job
40
+ config.print_stacktraces = false
41
+
42
+ def example
43
+ fail "No one will know about this ..."
44
+ end
45
+ end
46
+ {% endhighlight %}
@@ -0,0 +1,93 @@
1
+ ---
2
+ layout: default
3
+ title: Frontiers
4
+ ---
5
+
6
+ # Frontiers
7
+
8
+ Frontiers keep track of three sets of URIs:
9
+
10
+ * Current URIs that are being processed
11
+ * Staged URIs that might be processed in the next cycle
12
+ * Cached URIs that have been processed
13
+
14
+ All frontiers expose the same behaviour.
15
+
16
+ <pre class="illustration">
17
+ ┌──────────────────────────────────────────────────────────┐
18
+ │ STAGED │
19
+ │ {https://alpha.com, https://beta.com} │
20
+ └──────────────────────────────────────────────────────────┘
21
+ ┌──────────────────────────────────────────────────────────┐
22
+ │ CURRENT │
23
+ │ {https://gamma.com} │
24
+ └──────────────────────────────────────────────────────────┘
25
+ ┌──────────────────────────────────────────────────────────┐
26
+ │ CACHED │
27
+ │ {https://beta.com} │
28
+ └──────────────────────────────────────────────────────────┘
29
+
30
+ Cycle
31
+
32
+
33
+ ┌──────────────────────────────────────────────────────────┐
34
+ │ STAGED' │
35
+ │ {...} │
36
+ └──────────────────────────────────────────────────────────┘
37
+ ┌──────────────────────────────────────────────────────────┐
38
+ │ CURRENT' = STAGED \ CACHED │
39
+ │ {https://alpha.com} │
40
+ └──────────────────────────────────────────────────────────┘
41
+ ┌──────────────────────────────────────────────────────────┐
42
+ │ CACHED' = CACHED ∪ CURRENT │
43
+ │ {https://beta.com, https://gamma.com} │
44
+ └──────────────────────────────────────────────────────────┘
45
+ </pre>
46
+
47
+ ## Available frontiers
48
+ Currently, there are 5 frontiers available:
49
+
50
+ 2. `:memory` (default): Uses sets from the standard lib.
51
+ 4. `:redis`: Uses Redis sets.
52
+ 3. `:memory_bloom`: Uses a [Bloom filter](https://github.com/igrigorik/bloomfilter-rb).
53
+ 5. `:redis_bloom`: Uses a Redis-backed Bloom filter.
54
+ 1. `:memory_trie`: Uses a [trie](https://github.com/tyler/trie) and sets.
55
+
56
+ | Frontier | MRI support | JRuby support |
57
+ | --- | --- |
58
+ | `:memory` | Yes | Yes
59
+ | `:redis` | Yes | Yes
60
+ | `:memory_bloom` | Yes | No
61
+ | `:redis_bloom` | Yes | No
62
+ | `:memory_trie` | Yes | No
63
+
64
+ ## Setting the frontier
65
+
66
+ Set the `:frontier` configuration key:
67
+
68
+ {% highlight ruby %}
69
+ class DummyJob < Wayfarer::Job
70
+ config.frontier = :foobar
71
+ end
72
+ {% endhighlight %}
73
+
74
+ ### Using a Redis frontier
75
+
76
+ Set the `:redis_opts` and `:frontier` configuration keys:
77
+
78
+ {% highlight ruby %}
79
+ class DummyJob < Wayfarer::Job
80
+ config.redis_opts = { port: 4242 }
81
+ config.frontier = :redis
82
+ end
83
+ {% endhighlight %}
84
+
85
+ ### Setting bloomfilter parameters
86
+
87
+ Set the `:bloomfilter_opts` configuration key:
88
+
89
+ {% highlight ruby %}
90
+ class DummyJob < Wayfarer::Job
91
+ config.bloomfilter_opts = { ... }
92
+ end
93
+ {% endhighlight %}
@@ -0,0 +1,23 @@
1
+ ---
2
+ layout: default
3
+ title: Halting
4
+ ---
5
+
6
+ # Halting
7
+ Processing can be stopped by calling `#halt` within actions.
8
+
9
+ `#halt` does not return immediately. Instead, it sets a halting flag internally, and once the action returns, all threads will stop instead of processing further URIs.
10
+
11
+ Job instances run in separate threads. When a job signals that it wants to halt, all other threads will finish their current work, but will not process any further URIs. All instances have the chance to get their current work done.
12
+
13
+ {% highlight ruby %}
14
+ class DummyJob < Wayfarer::Job
15
+ def example
16
+ halt
17
+ puts "This will be printed!"
18
+
19
+ return halt
20
+ puts "This will not be printed!"
21
+ end
22
+ end
23
+ {% endhighlight %}
@@ -0,0 +1,26 @@
1
+ ---
2
+ layout: default
3
+ title: Locals
4
+ ---
5
+
6
+ # Job queues
7
+
8
+ Thanks to [ActiveJob](http://edgeguides.rubyonrails.org/active_job_basics.html), jobs can be enqueued with various backends, e.g. Sidekiq or Resque:
9
+
10
+ {% highlight ruby %}
11
+ class DummyJob < Wayfarer::Job
12
+ # Overrides ActiveJob's global setting
13
+ self.queue_adapter = :resque
14
+
15
+ # Identifier for enqueued jobs
16
+ queue_as :dummy_job
17
+
18
+ # Alternatively, pass a block
19
+ queue_as do
20
+ [:first, :second].sample
21
+ end
22
+ end
23
+
24
+ # Alternatively, set the queue explicitly on call:
25
+ DummyJob.set(queue: :something_else).perform_later(*uris)
26
+ {% endhighlight %}
@@ -0,0 +1,36 @@
1
+ ---
2
+ layout: default
3
+ title: Locals
4
+ ---
5
+
6
+ # Locals
7
+
8
+ Locals are Wayfarer's replacement for job instance variables. Both `let` and `let!` declare variables that are accessible within [callbacks]({{base}}/callbacks.html) and actions.
9
+
10
+ Even though you might recognise them from RSpec, they have differing semantics: Values in `let` blocks will be replaced with thread-safe counterparts once the job is run. `let!` skips this. Both evaluate their block immediately.
11
+
12
+ | Standard lib | Counterpart |
13
+ | --- | --- |
14
+ | Booleans | [`Concurrent::AtomicBoolean`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/AtomicBoolean.html) |
15
+ | `Fixnum` | [`Concurrent::AtomicFixnum`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/AtomicFixnum.html) |
16
+ | `Hash` | [`Concurrent::Hash`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Hash.html) |
17
+ | `Array` | [`Concurrent::Array`](http://ruby-concurrency.github.io/concurrent-ruby/Concurrent/Array.html) |
18
+ | Everything else | Untouched |
19
+
20
+ {% highlight ruby %}
21
+ class DummyJob < Wayfarer::Job
22
+ let(:values) { [1, 2, 3] }
23
+
24
+ before_crawl do
25
+ values.reverse!
26
+ end
27
+
28
+ after_crawl do
29
+ values # => [3, 2, 1, 0]
30
+ end
31
+
32
+ def some_action
33
+ values << 0
34
+ end
35
+ end
36
+ {% endhighlight %}
@@ -0,0 +1,22 @@
1
+ ---
2
+ layout: default
3
+ title: Logging
4
+ ---
5
+
6
+ # Logging
7
+
8
+ {% highlight ruby %}
9
+ # Global configuration serves as the template
10
+ Wayfarer.logger.level = :fatal
11
+
12
+ class DummyJob < Wayfarer::Job
13
+ # Jobs can tweak their logger
14
+ config.logger.level = :warn
15
+ config.logger.progname = "dummy-job"
16
+
17
+ def example
18
+ logger.info "No"
19
+ logger.warn "Yes"
20
+ end
21
+ end
22
+ {% endhighlight %}
@@ -0,0 +1,67 @@
1
+ ---
2
+ layout: default
3
+ title: Page objects
4
+ ---
5
+
6
+ # `Page` objects
7
+
8
+ Retrieved pages are represented by `Page` objects and made accessible by `#page` within actions. `Page`s support the same set of features regardless of the HTTP adapter in use.
9
+
10
+ <aside class="note">
11
+ HTTP response headers and status codes are not supported by Selenium WebDrivers. Wayfarer emulates both by having the WebDriver fire an AJAX request to the current page and extracting them from the response. Clearly this is a hack, but it might even work for you. See <a href="https://github.com/bauerd/selenium-emulated_features">selenium-emulated_features</a>.
12
+ </aside>
13
+
14
+ <aside class="note">
15
+ Even after having followed redirects, <code>Page#uri</code> always returns the URI that originally initiated the redirects. This behaviour stems from redirects being opaque to WebDrivers.
16
+ </aside>
17
+
18
+ A `Page` brings to the table all you'd wish for when doing web scraping:
19
+
20
+ * [Nokogiri](http://www.nokogiri.org) parses HTML/XML
21
+ * [Oj](https://github.com/ohler55/oj) or the standard lib parses JSON
22
+ * __When running on MRI__, [Pismo](https://github.com/peterc/pismo) lets you access metadata, e.g. keywords, author, a summary, … No overhead if you don't use it!
23
+
24
+ Let's see it in action:
25
+
26
+ {% highlight ruby %}
27
+ class DummyJob < Wayfarer::Job
28
+ # ...
29
+
30
+ def example
31
+ page # => #<Wayfarer::Page:...>
32
+
33
+ page.uri # => #<URI::...>
34
+ page.status_code # => Fixnum
35
+ page.body # => String
36
+ page.headers # => Hash
37
+
38
+ page.doc # => #<Nokogiri::HTML::Document:...> (HTML/XML) or Hash (JSON)
39
+ # Also accessible as just `doc`
40
+
41
+ page.links # => [URI]
42
+ page.stylesheets # => [URI]
43
+ page.javascripts # => [URI]
44
+ page.images # => [URI]
45
+
46
+ # All previous four methods accept arbitrary many CSS selectors
47
+ page.links ".my-target", ".my-other-target"
48
+
49
+ # THESE ARE NOT SUPPORTED ON JRUBY!
50
+ # On MRI, the following methods get forwarded to a Pismo::Document
51
+ # See https://github.com/peterc/pismo
52
+ page.title
53
+ page.titles
54
+ page.author
55
+ page.lede
56
+ page.keywords
57
+ page.sentences(qty)
58
+ page.body
59
+ page.html_body
60
+ page.feed
61
+ page.feeds
62
+ page.favicon
63
+ page.description
64
+ page.datetime
65
+ end
66
+ end
67
+ {% endhighlight %}
@@ -0,0 +1,46 @@
1
+ ---
2
+ layout: default
3
+ title: Peeking
4
+ ---
5
+
6
+ # Peeking
7
+ Peeking allows bypassing the [frontier](frontiers.html) in an ad-hoc manner. Use Ruby's `yield` keyword to immediately retrieve and dispatch a URI from within actions. Control gets handed off to the action matching the yielded URI, if any.
8
+
9
+ A matching route for the yielded URI is still required. If the yielded URI matches no route or raises an exception, `yield` returns `nil`.
10
+
11
+ <aside class="note">
12
+ The action that gets the URI dispatched to <strong>will</strong> get assigned another HTTP adapter! HTTP adapters are never shared across actions, i.e. if you're using the Selenium HTTP adapter, the peeked URI gets retrieved by a different browser process.
13
+ </aside>
14
+
15
+ {% highlight ruby %}
16
+ class DummyJob < Wayfarer::Job
17
+ route.uri "https://example.com", to: :foo
18
+ route.uri "https://w3c.org", to: :bar
19
+
20
+ def foo
21
+ w3c_page = yield "https://w3c.org"
22
+ end
23
+
24
+ def bar
25
+ page
26
+ end
27
+ end
28
+ {% endhighlight %}
29
+
30
+ __Recursive peeking does not work__, or else peeking might result in an infinite loop. The following does terminate:
31
+
32
+ {% highlight ruby %}
33
+ class DummyJob < Wayfarer::Job
34
+ route.uri "https://example.com", to: :foo
35
+ route.uri "https://w3c.org", to: :bar
36
+
37
+ def foo
38
+ w3c_page = yield "https://w3c.org"
39
+ end
40
+
41
+ def bar
42
+ # Silently ignored, assigns nil
43
+ example_page = yield "https://example.com"
44
+ end
45
+ end
46
+ {% endhighlight %}
@@ -0,0 +1,100 @@
1
+ ---
2
+ layout: default
3
+ title: Selenium & Capybara
4
+ ---
5
+
6
+ # Selenium & Capybara
7
+
8
+ [Selenium](http://www.seleniumhq.org) is a browser automation framework. [Capybara](https://github.com/teamcapybara/capybara) is an acceptance testing framework that puts an expressive DSL on Selenium's WebDrivers. Both are first-class citizens in Wayfarer and the best tools for automating browsers.
9
+
10
+ ## Selenium WebDrivers
11
+
12
+ WebDrivers let you remote-control browsers, e.g. Firefox, Chrome, Safari and PhantomJS.
13
+
14
+ Depending on what browser you want to automate, go install and run the corresponding driver first. For installation instructions, see the project websites:
15
+
16
+ * Firefox: [geckodriver](https://github.com/mozilla/geckodriver)
17
+ * Chrome: [chromedriver](https://sites.google.com/a/chromium.org/chromedriver)
18
+ * Safari: [SafariDriver](https://github.com/SeleniumHQ/selenium/wiki/SafariDriver)
19
+ * PhantomJS ships with an embedded driver.
20
+
21
+ Other browsers are supported, too. For an exhaustive list, see the "Third Party Drivers, Bindings, and Plugins" section on the [Selenium downloads page](http://www.seleniumhq.org/download).
22
+
23
+ If you want to run browser processes on a central server, consider using [Selenium Grid](http://www.seleniumhq.org/projects/grid).
24
+
25
+ Wayfarer hides the details of managing Ruby driver objects from you. In order to use Selenium, set the `http_adapter` configuration key to `:selenium`. Pass in the desired browser and arguments by setting the `selenium_argv` key. The number of browser processes can be controlled with the `connection_count` key.
26
+
27
+ {% highlight ruby %}
28
+ class DummyJob < Wayfarer::Job
29
+ config do |c|
30
+ # Use 4 Firefox processes
31
+ c.http_adapter = :selenium
32
+ c.selenium_argv = [:firefox]
33
+ c.connection_count = 4
34
+
35
+ # Chrome
36
+ # c.selenium_argv = [:chrome]
37
+
38
+ # Safari
39
+ # c.selenium_argv = [:safari]
40
+
41
+ # PhantomJS
42
+ # c.selenium_argv = [:phantomjs]
43
+
44
+ # Selenium Grid
45
+ # c.selenium_argv = [
46
+ # :remote,
47
+ # url: "http://localhost:4444/wd/hub",
48
+ # desired_capabilities: :firefox
49
+ # ]
50
+ end
51
+ end
52
+ {% endhighlight %}
53
+
54
+ <aside class="note">
55
+ In order to avoid redirect loops, the <code>:net_http</code> adapter supports the <code>max_http_redirects</code> configuration key. Because redirects are opaque to WebDrivers, the configuration key does not apply to the Selenium adapter. See <a href="configuration.html">Configuration</a>.
56
+ </aside>
57
+
58
+ ### Accessing the WebDriver
59
+
60
+ Within actions, `#driver` returns a [`Selenium::WebDriver::Driver`](http://www.rubydoc.info/gems/selenium-webdriver/Selenium/WebDriver/Driver):
61
+
62
+ {% highlight ruby %}
63
+ class DummyJob < Wayfarer::Job
64
+ config do |c|
65
+ c.http_adapter = :selenium
66
+ c.selenium_argv = [:firefox]
67
+ end
68
+
69
+ draw uri: "https://example.com"
70
+ def example
71
+ driver # => #<Selenium::WebDriver::Driver:...>
72
+ end
73
+ end
74
+ {% endhighlight %}
75
+
76
+ <aside class="note">
77
+ What you do with a WebDriver is opaque to Wayfarer. If you handle navigation yourself with a WebDriver and bypass the <a href="/guides/frontiers.html">frontier</a>, Wayfarer cannot ensure you don't visit URIs twice.
78
+ </aside>
79
+
80
+ ## Capybara
81
+
82
+ When using the `:selenium` HTTP adapter, `#browser` returns a [`Capybara::Selenium::Driver`](http://www.rubydoc.info/github/jnicklas/capybara/Capybara/Selenium/Driver) within actions:
83
+
84
+ {% highlight ruby %}
85
+ class DummyJob < Wayfarer::Job
86
+ config do |c|
87
+ c.http_adapter = :selenium
88
+ c.selenium_argv = [:firefox]
89
+ end
90
+
91
+ draw uri: "https://example.com"
92
+ def example
93
+ browser # => #<Capybara::Selenium::Driver:...>
94
+ end
95
+ end
96
+ {% endhighlight %}
97
+
98
+ <aside class="note">
99
+ What you do with a WebDriver is opaque to Wayfarer. If you handle navigation yourself with a WebDriver and bypass the <a href="/guides/frontiers.html">frontier</a>, Wayfarer cannot ensure you don't visit URIs twice.
100
+ </aside>