wayfarer 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
Files changed (235) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +8 -0
  3. data/.rbenv-gemsets +1 -0
  4. data/.rspec +3 -0
  5. data/.rubocop.yml +21 -0
  6. data/.ruby-version +1 -0
  7. data/.travis.yml +5 -0
  8. data/.yardopts +3 -0
  9. data/Changelog.md +10 -0
  10. data/Gemfile +11 -0
  11. data/LICENSE +19 -0
  12. data/README.md +21 -0
  13. data/Rakefile +114 -0
  14. data/benchmark/frontiers.rb +143 -0
  15. data/bin/wayfarer +116 -0
  16. data/docs/.gitignore +2 -0
  17. data/docs/_config.yml +15 -0
  18. data/docs/_includes/base.html +7 -0
  19. data/docs/_includes/head.html +10 -0
  20. data/docs/_includes/navigation.html +187 -0
  21. data/docs/_layouts/default.html +42 -0
  22. data/docs/_sass/base.scss +439 -0
  23. data/docs/_sass/variables.scss +24 -0
  24. data/docs/_sass/vendor/bourbon/_bourbon-deprecate.scss +19 -0
  25. data/docs/_sass/vendor/bourbon/_bourbon-deprecated-upcoming.scss +425 -0
  26. data/docs/_sass/vendor/bourbon/_bourbon.scss +90 -0
  27. data/docs/_sass/vendor/bourbon/addons/_border-color.scss +29 -0
  28. data/docs/_sass/vendor/bourbon/addons/_border-radius.scss +48 -0
  29. data/docs/_sass/vendor/bourbon/addons/_border-style.scss +28 -0
  30. data/docs/_sass/vendor/bourbon/addons/_border-width.scss +28 -0
  31. data/docs/_sass/vendor/bourbon/addons/_buttons.scss +69 -0
  32. data/docs/_sass/vendor/bourbon/addons/_clearfix.scss +25 -0
  33. data/docs/_sass/vendor/bourbon/addons/_ellipsis.scss +30 -0
  34. data/docs/_sass/vendor/bourbon/addons/_font-stacks.scss +31 -0
  35. data/docs/_sass/vendor/bourbon/addons/_hide-text.scss +27 -0
  36. data/docs/_sass/vendor/bourbon/addons/_margin.scss +29 -0
  37. data/docs/_sass/vendor/bourbon/addons/_padding.scss +29 -0
  38. data/docs/_sass/vendor/bourbon/addons/_position.scss +51 -0
  39. data/docs/_sass/vendor/bourbon/addons/_prefixer.scss +66 -0
  40. data/docs/_sass/vendor/bourbon/addons/_retina-image.scss +27 -0
  41. data/docs/_sass/vendor/bourbon/addons/_size.scss +56 -0
  42. data/docs/_sass/vendor/bourbon/addons/_text-inputs.scss +118 -0
  43. data/docs/_sass/vendor/bourbon/addons/_timing-functions.scss +34 -0
  44. data/docs/_sass/vendor/bourbon/addons/_triangle.scss +63 -0
  45. data/docs/_sass/vendor/bourbon/addons/_word-wrap.scss +29 -0
  46. data/docs/_sass/vendor/bourbon/css3/_animation.scss +61 -0
  47. data/docs/_sass/vendor/bourbon/css3/_appearance.scss +5 -0
  48. data/docs/_sass/vendor/bourbon/css3/_backface-visibility.scss +5 -0
  49. data/docs/_sass/vendor/bourbon/css3/_background-image.scss +44 -0
  50. data/docs/_sass/vendor/bourbon/css3/_background.scss +57 -0
  51. data/docs/_sass/vendor/bourbon/css3/_border-image.scss +61 -0
  52. data/docs/_sass/vendor/bourbon/css3/_calc.scss +6 -0
  53. data/docs/_sass/vendor/bourbon/css3/_columns.scss +67 -0
  54. data/docs/_sass/vendor/bourbon/css3/_filter.scss +6 -0
  55. data/docs/_sass/vendor/bourbon/css3/_flex-box.scss +327 -0
  56. data/docs/_sass/vendor/bourbon/css3/_font-face.scss +29 -0
  57. data/docs/_sass/vendor/bourbon/css3/_font-feature-settings.scss +6 -0
  58. data/docs/_sass/vendor/bourbon/css3/_hidpi-media-query.scss +12 -0
  59. data/docs/_sass/vendor/bourbon/css3/_hyphens.scss +6 -0
  60. data/docs/_sass/vendor/bourbon/css3/_image-rendering.scss +15 -0
  61. data/docs/_sass/vendor/bourbon/css3/_keyframes.scss +38 -0
  62. data/docs/_sass/vendor/bourbon/css3/_linear-gradient.scss +40 -0
  63. data/docs/_sass/vendor/bourbon/css3/_perspective.scss +12 -0
  64. data/docs/_sass/vendor/bourbon/css3/_placeholder.scss +10 -0
  65. data/docs/_sass/vendor/bourbon/css3/_radial-gradient.scss +40 -0
  66. data/docs/_sass/vendor/bourbon/css3/_selection.scss +44 -0
  67. data/docs/_sass/vendor/bourbon/css3/_text-decoration.scss +27 -0
  68. data/docs/_sass/vendor/bourbon/css3/_transform.scss +21 -0
  69. data/docs/_sass/vendor/bourbon/css3/_transition.scss +81 -0
  70. data/docs/_sass/vendor/bourbon/css3/_user-select.scss +5 -0
  71. data/docs/_sass/vendor/bourbon/functions/_assign-inputs.scss +16 -0
  72. data/docs/_sass/vendor/bourbon/functions/_contains-falsy.scss +25 -0
  73. data/docs/_sass/vendor/bourbon/functions/_contains.scss +31 -0
  74. data/docs/_sass/vendor/bourbon/functions/_is-length.scss +16 -0
  75. data/docs/_sass/vendor/bourbon/functions/_is-light.scss +26 -0
  76. data/docs/_sass/vendor/bourbon/functions/_is-number.scss +16 -0
  77. data/docs/_sass/vendor/bourbon/functions/_is-size.scss +23 -0
  78. data/docs/_sass/vendor/bourbon/functions/_modular-scale.scss +74 -0
  79. data/docs/_sass/vendor/bourbon/functions/_px-to-em.scss +24 -0
  80. data/docs/_sass/vendor/bourbon/functions/_px-to-rem.scss +26 -0
  81. data/docs/_sass/vendor/bourbon/functions/_shade.scss +24 -0
  82. data/docs/_sass/vendor/bourbon/functions/_strip-units.scss +22 -0
  83. data/docs/_sass/vendor/bourbon/functions/_tint.scss +24 -0
  84. data/docs/_sass/vendor/bourbon/functions/_transition-property-name.scss +37 -0
  85. data/docs/_sass/vendor/bourbon/functions/_unpack.scss +32 -0
  86. data/docs/_sass/vendor/bourbon/helpers/_convert-units.scss +26 -0
  87. data/docs/_sass/vendor/bourbon/helpers/_directional-values.scss +108 -0
  88. data/docs/_sass/vendor/bourbon/helpers/_font-source-declaration.scss +53 -0
  89. data/docs/_sass/vendor/bourbon/helpers/_gradient-positions-parser.scss +24 -0
  90. data/docs/_sass/vendor/bourbon/helpers/_linear-angle-parser.scss +35 -0
  91. data/docs/_sass/vendor/bourbon/helpers/_linear-gradient-parser.scss +51 -0
  92. data/docs/_sass/vendor/bourbon/helpers/_linear-positions-parser.scss +77 -0
  93. data/docs/_sass/vendor/bourbon/helpers/_linear-side-corner-parser.scss +41 -0
  94. data/docs/_sass/vendor/bourbon/helpers/_radial-arg-parser.scss +74 -0
  95. data/docs/_sass/vendor/bourbon/helpers/_radial-gradient-parser.scss +55 -0
  96. data/docs/_sass/vendor/bourbon/helpers/_radial-positions-parser.scss +28 -0
  97. data/docs/_sass/vendor/bourbon/helpers/_render-gradients.scss +31 -0
  98. data/docs/_sass/vendor/bourbon/helpers/_shape-size-stripper.scss +15 -0
  99. data/docs/_sass/vendor/bourbon/helpers/_str-to-num.scss +55 -0
  100. data/docs/_sass/vendor/bourbon/settings/_asset-pipeline.scss +7 -0
  101. data/docs/_sass/vendor/bourbon/settings/_deprecation-warnings.scss +8 -0
  102. data/docs/_sass/vendor/bourbon/settings/_prefixer.scss +9 -0
  103. data/docs/_sass/vendor/bourbon/settings/_px-to-em.scss +1 -0
  104. data/docs/_sass/vendor/neat/_neat-helpers.scss +11 -0
  105. data/docs/_sass/vendor/neat/_neat.scss +23 -0
  106. data/docs/_sass/vendor/neat/functions/_new-breakpoint.scss +49 -0
  107. data/docs/_sass/vendor/neat/functions/_private.scss +114 -0
  108. data/docs/_sass/vendor/neat/grid/_box-sizing.scss +15 -0
  109. data/docs/_sass/vendor/neat/grid/_direction-context.scss +33 -0
  110. data/docs/_sass/vendor/neat/grid/_display-context.scss +28 -0
  111. data/docs/_sass/vendor/neat/grid/_fill-parent.scss +22 -0
  112. data/docs/_sass/vendor/neat/grid/_media.scss +92 -0
  113. data/docs/_sass/vendor/neat/grid/_omega.scss +87 -0
  114. data/docs/_sass/vendor/neat/grid/_outer-container.scss +34 -0
  115. data/docs/_sass/vendor/neat/grid/_pad.scss +25 -0
  116. data/docs/_sass/vendor/neat/grid/_private.scss +35 -0
  117. data/docs/_sass/vendor/neat/grid/_row.scss +52 -0
  118. data/docs/_sass/vendor/neat/grid/_shift.scss +50 -0
  119. data/docs/_sass/vendor/neat/grid/_span-columns.scss +94 -0
  120. data/docs/_sass/vendor/neat/grid/_to-deprecate.scss +97 -0
  121. data/docs/_sass/vendor/neat/grid/_visual-grid.scss +42 -0
  122. data/docs/_sass/vendor/neat/mixins/_clearfix.scss +25 -0
  123. data/docs/_sass/vendor/neat/settings/_disable-warnings.scss +13 -0
  124. data/docs/_sass/vendor/neat/settings/_grid.scss +51 -0
  125. data/docs/_sass/vendor/neat/settings/_visual-grid.scss +27 -0
  126. data/docs/_sass/vendor/normalize-3.0.2.scss +427 -0
  127. data/docs/_sass/vendor/pygments.scss +356 -0
  128. data/docs/automating_browsers/capybara.md +70 -0
  129. data/docs/css/screen.scss +7 -0
  130. data/docs/guides/callbacks.md +45 -0
  131. data/docs/guides/cli.md +52 -0
  132. data/docs/guides/configuration.md +184 -0
  133. data/docs/guides/error_handling.md +46 -0
  134. data/docs/guides/frontiers.md +93 -0
  135. data/docs/guides/halting.md +23 -0
  136. data/docs/guides/job_queues.md +26 -0
  137. data/docs/guides/locals.md +36 -0
  138. data/docs/guides/logging.md +22 -0
  139. data/docs/guides/page_objects.md +67 -0
  140. data/docs/guides/peeking.md +46 -0
  141. data/docs/guides/selenium_capybara.md +100 -0
  142. data/docs/guides/tutorial.md +452 -0
  143. data/docs/index.md +82 -0
  144. data/docs/js/navigation.js +11 -0
  145. data/docs/misc/contributing.md +20 -0
  146. data/docs/misc/testing.md +11 -0
  147. data/docs/recipes/authentication.md +23 -0
  148. data/docs/recipes/csv.md +29 -0
  149. data/docs/recipes/javascript.md +20 -0
  150. data/docs/recipes/multiple_uris.md +18 -0
  151. data/docs/recipes/screenshots.md +20 -0
  152. data/docs/routing/custom_rules.md +16 -0
  153. data/docs/routing/filetypes_rules.md +21 -0
  154. data/docs/routing/host_rules.md +24 -0
  155. data/docs/routing/path_rules.md +33 -0
  156. data/docs/routing/protocol_rules.md +17 -0
  157. data/docs/routing/query_rules.md +69 -0
  158. data/docs/routing/routes.md +96 -0
  159. data/docs/routing/uri_rules.md +18 -0
  160. data/examples/collect_github_issues.rb +65 -0
  161. data/examples/find_foobar_on_wikipedia.rb +23 -0
  162. data/lib/wayfarer/configuration.rb +86 -0
  163. data/lib/wayfarer/crawl.rb +79 -0
  164. data/lib/wayfarer/crawl_observer.rb +103 -0
  165. data/lib/wayfarer/dispatcher.rb +104 -0
  166. data/lib/wayfarer/finders.rb +61 -0
  167. data/lib/wayfarer/frontiers/frontier.rb +79 -0
  168. data/lib/wayfarer/frontiers/memory_bloomfilter.rb +32 -0
  169. data/lib/wayfarer/frontiers/memory_frontier.rb +76 -0
  170. data/lib/wayfarer/frontiers/memory_trie_frontier.rb +39 -0
  171. data/lib/wayfarer/frontiers/normalize_uris.rb +48 -0
  172. data/lib/wayfarer/frontiers/redis_bloomfilter.rb +34 -0
  173. data/lib/wayfarer/frontiers/redis_frontier.rb +83 -0
  174. data/lib/wayfarer/http_adapters/adapter_pool.rb +62 -0
  175. data/lib/wayfarer/http_adapters/net_http_adapter.rb +77 -0
  176. data/lib/wayfarer/http_adapters/selenium_adapter.rb +80 -0
  177. data/lib/wayfarer/job.rb +211 -0
  178. data/lib/wayfarer/locals.rb +40 -0
  179. data/lib/wayfarer/page.rb +94 -0
  180. data/lib/wayfarer/parsers/json_parser.rb +20 -0
  181. data/lib/wayfarer/parsers/xml_parser.rb +27 -0
  182. data/lib/wayfarer/processor.rb +103 -0
  183. data/lib/wayfarer/routing/custom_rule.rb +21 -0
  184. data/lib/wayfarer/routing/filetypes_rule.rb +20 -0
  185. data/lib/wayfarer/routing/host_rule.rb +19 -0
  186. data/lib/wayfarer/routing/path_rule.rb +54 -0
  187. data/lib/wayfarer/routing/protocol_rule.rb +21 -0
  188. data/lib/wayfarer/routing/query_rule.rb +59 -0
  189. data/lib/wayfarer/routing/router.rb +71 -0
  190. data/lib/wayfarer/routing/rule.rb +114 -0
  191. data/lib/wayfarer/routing/uri_rule.rb +21 -0
  192. data/lib/wayfarer.rb +68 -0
  193. data/spec/configuration_spec.rb +26 -0
  194. data/spec/crawl_spec.rb +48 -0
  195. data/spec/finders_spec.rb +49 -0
  196. data/spec/frontiers/memory_bloomfilter_spec.rb +6 -0
  197. data/spec/frontiers/memory_frontier_spec.rb +6 -0
  198. data/spec/frontiers/memory_trie_frontier_spec.rb +6 -0
  199. data/spec/frontiers/normalize_uris_spec.rb +59 -0
  200. data/spec/frontiers/redis_bloomfilter_spec.rb +6 -0
  201. data/spec/frontiers/redis_frontier_spec.rb +6 -0
  202. data/spec/http_adapters/adapter_pool_spec.rb +33 -0
  203. data/spec/http_adapters/net_http_adapter_spec.rb +83 -0
  204. data/spec/http_adapters/selenium_adapter_spec.rb +53 -0
  205. data/spec/integration/callbacks_spec.rb +42 -0
  206. data/spec/integration/locals_spec.rb +106 -0
  207. data/spec/integration/peeking_spec.rb +61 -0
  208. data/spec/job_spec.rb +122 -0
  209. data/spec/page_spec.rb +38 -0
  210. data/spec/parsers/json_parser_spec.rb +30 -0
  211. data/spec/parsers/xml_parser_spec.rb +24 -0
  212. data/spec/processor_spec.rb +31 -0
  213. data/spec/routing/custom_rule_spec.rb +26 -0
  214. data/spec/routing/filetypes_rule_spec.rb +40 -0
  215. data/spec/routing/host_rule_spec.rb +48 -0
  216. data/spec/routing/path_rule_spec.rb +66 -0
  217. data/spec/routing/protocol_rule_spec.rb +26 -0
  218. data/spec/routing/query_rule_spec.rb +124 -0
  219. data/spec/routing/router_spec.rb +67 -0
  220. data/spec/routing/rule_spec.rb +251 -0
  221. data/spec/routing/uri_rule_spec.rb +24 -0
  222. data/spec/shared/frontier.rb +96 -0
  223. data/spec/spec_helpers.rb +62 -0
  224. data/spec/wayfarer_spec.rb +24 -0
  225. data/support/static/finders.html +38 -0
  226. data/support/static/graph/details/a.html +10 -0
  227. data/support/static/graph/details/b.html +10 -0
  228. data/support/static/graph/index.html +20 -0
  229. data/support/static/json/dummy.json +13 -0
  230. data/support/static/links/links.html +28 -0
  231. data/support/static/xml/dummy.xml +120 -0
  232. data/support/test_app.rb +45 -0
  233. data/wayfarer-jruby.gemspec +49 -0
  234. data/wayfarer.gemspec +53 -0
  235. metadata +697 -0
@@ -0,0 +1,452 @@
1
+ ---
2
+ layout: default
3
+ title: Tutorial
4
+ ---
5
+
6
+ # Tutorial
7
+ This tutorial walks you through 66.333% of what's to know about Wayfarer, a web crawling framework for Ruby. Along the way, we'll write a reusable crawler that collects the titles of all open issues from an arbitrary GitHub repository.
8
+
9
+ First, we get ourselves a subclass of `Wayfarer::Job`. If you've ever worked with a typical MVC web framework, think of a job as a self-contained controller with routes. If you haven't, don't worry!
10
+
11
+ {% highlight ruby %}
12
+ require "wayfarer" # This line omitted hereafter
13
+
14
+ class CollectGithubIssues < Wayfarer::Job
15
+ end
16
+ {% endhighlight %}
17
+
18
+ Suppose we’re interested in Rails' GitHub repository, which is located at `https://github.com/rails/rails`. We need two things:
19
+ 1. A route that matches that URI and …
20
+ 2. an instance method (action) which handles that page:
21
+
22
+ {% highlight ruby %}
23
+ class CollectGithubIssues < Wayfarer::Job
24
+ route.uri "https://github.com/rails/rails", to: :repository # (1)
25
+
26
+ def repository # (2)
27
+ puts "This looks like Rails to me!"
28
+ end
29
+ end
30
+ {% endhighlight %}
31
+
32
+ We set up a single route which maps the repository URI (and only that URI) to `CollectGithubIssues#repository`. When we feed our job the URI, the `#repository` method gets called.
33
+
34
+ To run a job, , call `::perform_now` on your job class and pass an arbitrary number of URIs to start with:
35
+
36
+ {% highlight ruby %}
37
+ class CollectGithubIssues < Wayfarer::Job
38
+ # Gives more detailed output
39
+ # I'll omit this from now on
40
+ config.logger.level = :debug
41
+
42
+ route.uri "https://github.com/rails/rails", to: :repository
43
+
44
+ def repository
45
+ puts "This looks like Rails to me!"
46
+ end
47
+ end
48
+
49
+ CollectGithubIssues.perform_now("https://github.com/rails/rails", "https://example.com")
50
+ {% endhighlight %}
51
+
52
+ Note that we pass a URI we have no matching route for, `https://example.com`.
53
+
54
+ Save and run your file as you would with every other Ruby file:
55
+
56
+ ```
57
+ % ruby collect_github_issues.rb
58
+ ```
59
+
60
+ … and you'll end up with output similiar to this:
61
+
62
+ ```
63
+ Performing CollectGithubIssues (Job ID: …) from Async(default) with arguments: "https://github.com/rails/rails", "https://example.com"
64
+ I, […] INFO -- wayfarer: First cycle
65
+ I, […] INFO -- wayfarer: Frontier: URI-normalizing #<Wayfarer::Frontiers::MemoryFrontier:0x007fa2a6ae9cf0>
66
+ I, […] INFO -- wayfarer: Current cycle contains 2 URI(s)
67
+ I, […] INFO -- wayfarer: Dispatched to #repository: https://github.com/rails/rails
68
+ This looks like Rails to me!
69
+ I, […] INFO -- wayfarer: Staging 0 URI(s)
70
+ D, […] DEBUG -- wayfarer: No matching route for: https://example.com/
71
+ I, […] INFO -- wayfarer: No URIs left in current cycle
72
+ I, […] INFO -- wayfarer: About to cycle. 0 staged URI(s)
73
+ Performed CollectGithubIssues (Job ID: …) from Async(default) in 863.69ms
74
+ ```
75
+
76
+ Here is what happened:
77
+
78
+ 1. Both URIs we passed in were matched against our routes.
79
+ 2. Our matching GitHub URI's page was retrieved, the mismatching one ignored.
80
+ 3. Our `#repository` action was invoked and has access to the retrieved page.
81
+
82
+
83
+
84
+ Let’s exchange our static string for the actual page `<title>`. Inside our instance method, we call `#doc` to get ahold of a [`Nokogiri::HTML::Document`](http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/HTML/Document). [Nokogiri]() is a HTML/XML library, and a parsed document allows us to access the title tag easily:
85
+
86
+ {% highlight ruby %}
87
+ class CollectGithubIssues < Wayfarer::Job
88
+ route.uri "https://github.com/rails/rails", to: :repository
89
+
90
+ def repository
91
+ # Outputs the <title> attribute value
92
+ puts doc.title
93
+ end
94
+ end
95
+
96
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
97
+ {% endhighlight %}
98
+
99
+ Wayfarer does not attempt to do black magic on top of Nokogiri. When it comes to extracting specific data from pages, you’re mostly on your own. There are helpers for finding links, CSS/JavaScript files and images (see [`Page` objects](page_objects.html)). But figuring out what the interesting parts of a HTTP response are is still up to you.
100
+
101
+ Wayfarer parses JSON, too. You'll get a `Hash` returned by `#doc` instead of a Nokogiri document.
102
+
103
+ Rails’ issues are located at `https://github.com/rails/rails/issues`. We need a new route and a new instance method to handle this issue index. By calling `#stage` and passing in an arbitrary number of URIs, we can stage URIs for processing. Note that just because a URI gets staged does not mean it will be fetched—a matching route is required for every URI. Also, Wayfarer will by default ensure that no URI gets processed twice. This behaviour can be turned off, though (see [Configuration](configuration.html)).
104
+
105
+ {% highlight ruby %}
106
+ class CollectGithubIssues < Wayfarer::Job
107
+ routes do
108
+ uri "https://github.com/rails/rails", to: :repository
109
+ uri "https://github.com/rails/rails/issues", to: :index
110
+ end
111
+
112
+ def repository
113
+ # This is where we want to head at
114
+ stage "https://github.com/rails/rails/issues"
115
+ end
116
+
117
+ def index
118
+ puts "Arrived at the issue listing"
119
+ end
120
+ end
121
+
122
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
123
+ {% endhighlight %}
124
+
125
+ What we have so far works fine for the Rails repository, but not for others, because the URIs are hardcoded. That's a real pity, because there are more than 10 million repositories on GitHub. We can do better by switching to a host and path rule.
126
+
127
+ A host rule narrows down the host portion of a URI, and a path rule the path. Instead of hard-coding the path, pattern matching can be used to have interesting parts of the path extracted:
128
+
129
+ {% highlight ruby %}
130
+ class CollectGithubIssues < Wayfarer::Job
131
+ routes do
132
+ # Both routes match only if
133
+ # (1) The host is github.com and
134
+ # (2) The path segments match
135
+ host "github.com" do
136
+ path "/:user/:repo", to: :repository
137
+ path "/:user/:repo/issues", to: :index
138
+ end
139
+ end
140
+
141
+ def repository
142
+ stage "https://github.com/rails/rails/issues"
143
+ end
144
+
145
+ def index
146
+ # Captured path segments: params # => { repo: ..., user: ... }
147
+ # Prints 'rails belongs to rails'.
148
+ puts "#{params['repo']} belongs to #{params['user']}"
149
+ end
150
+ end
151
+
152
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
153
+ {% endhighlight %}
154
+
155
+ Note that we still have a hard-coded URI in `#repository`. Usually, there are two approaches to identify URIs that one wants to follow:
156
+
157
+ 1. Constructing the successor URI from the current URI.
158
+ 2. Reading the URI from the HTTP response, e.g. extracting an `<a>` tag's `href` property.
159
+
160
+ For the first case, say we're on `https://github.com/:user/:repo` and want to go to `https://github.com/:user/:repo/issues`. `#stage` takes relative paths and URIs too, and constructs absolute URIs by appending to the current page's URI:
161
+
162
+ {% highlight ruby %}
163
+ class CollectGithubIssues < Wayfarer::Job
164
+ # ...
165
+
166
+ def index
167
+ # Stages "#{page.uri}/issues"
168
+ stage "issues"
169
+ end
170
+
171
+ # ...
172
+ end
173
+ {% endhighlight %}
174
+
175
+ `#page` returns a [`Page` object]({{base}}/guides/page_objects.html), the general representation of a retrieved page. It gives one access to the page's origin URI, the response headers, the status code and the raw response body and more.
176
+
177
+ The second case is where Wayfarer's routing shines. We know that the path structure is `/:user/:repo/issues` and that there's a link somewhere on the repository's frontpage that links to there. We can stage __all__ links of the current page, and have our routes ensure that only interesting ones get processed:
178
+
179
+ {% highlight ruby %}
180
+ class CollectGithubIssues < Wayfarer::Job
181
+ # ...
182
+
183
+ def repository
184
+ # But only route-matching ones get processed
185
+ stage page.links
186
+ end
187
+
188
+ # ...
189
+ end
190
+ {% endhighlight %}
191
+
192
+ `Page#links` returns all links of the current site. But staging all links brings overhead with it, and we'll want to narrow down the links to stage, especially when crawling large page structures. `Page#links` accepts an arbitrary number of CSS selectors to narrow down links. For clarity, let's give the navigation links their own private helper method:
193
+
194
+ {% highlight ruby %}
195
+ class CollectGithubIssues < Wayfarer::Job
196
+ routes do
197
+ host "github.com" do
198
+ path "/:user/:repo", to: :repository
199
+ path "/:user/:repo/issues", to: :index
200
+ end
201
+ end
202
+
203
+ def repository
204
+ stage navigation_links
205
+ end
206
+
207
+ def index
208
+ puts "#{params['repo']} belongs to #{params['user']}"
209
+ end
210
+
211
+ private
212
+
213
+ def navigation_links
214
+ page.links ".reponav-item"
215
+ end
216
+ end
217
+
218
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
219
+ {% endhighlight %}
220
+
221
+ URIs never get dispatched to private instance methods.
222
+
223
+ We're prepared to go after the individual issues now. We add the `#issue` action, and route to it with a host and path rule. Links to issue tickets are wrapped in `.issues-listing`, so we can apply the same technique as above:
224
+
225
+ {% highlight ruby %}
226
+ class CollectGithubIssues < Wayfarer::Job
227
+ routes do
228
+ host "github.com" do
229
+ path "/:user/:repo", to: :repository
230
+ path "/:user/:repo/issues", to: :index
231
+ path "/:user/:repo/issues/:id", to: :show
232
+ end
233
+ end
234
+
235
+ def repository
236
+ stage navigation_links
237
+ end
238
+
239
+ def index
240
+ stage issue_listing_links
241
+ end
242
+
243
+ def show
244
+ puts "Issue No. #{params[:id]} @ #{page.uri}"
245
+ end
246
+
247
+ private
248
+
249
+ def navigation_links
250
+ page.links ".reponav-item"
251
+ end
252
+
253
+ def issue_listing_links
254
+ page.links ".issues-listing"
255
+ end
256
+ end
257
+
258
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
259
+ {% endhighlight %}
260
+
261
+ Handling pagination boils down to staging one more link in `#index`. As mentioned before, `#stage` accepts an arbitrary number of URIs:
262
+
263
+ {% highlight ruby %}
264
+ class CollectGithubIssues < Wayfarer::Job
265
+ routes do
266
+ host "github.com" do
267
+ path "/:user/:repo", to: :repository
268
+ path "/:user/:repo/issues", to: :index
269
+ path "/:user/:repo/issues/:id", to: :show
270
+ end
271
+ end
272
+
273
+ def repository
274
+ stage navigation_links
275
+ end
276
+
277
+ def index
278
+ stage issue_listing_links, next_page
279
+ end
280
+
281
+ def show
282
+ puts "Issue No. #{params[:id]} @ #{page.uri}"
283
+ end
284
+
285
+ private
286
+
287
+ def navigation_links
288
+ page.links ".reponav-item"
289
+ end
290
+
291
+ def issue_listing_links
292
+ page.links ".issues-listing"
293
+ end
294
+
295
+ def next_page
296
+ page.links ".next_page"
297
+ end
298
+ end
299
+
300
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
301
+ {% endhighlight %}
302
+
303
+ By default, all work happens within a single thread. We can speed up crawling by increasing the thread count:
304
+
305
+ {% highlight ruby %}
306
+ class CollectGithubIssues < Wayfarer::Job
307
+ config.connection_count = 4 # Four threads
308
+
309
+ # ...
310
+ end
311
+ {% endhighlight %}
312
+
313
+ Next, we want to extract the issue's title, its ID, and the GitHub user who opened it and store that data somewhere.
314
+
315
+ For extracting the text from the HTML, we add two private helper methods that query the HTML for the text.
316
+
317
+ For storing the data, we introduce a [local]({{base}}/guides/locals.html) named `:records` which stores an array. In job actions, locals can be accessed and manipulated. But now that we've bumped up the thread count, multiple instances of our job class will run concurrently. That's why locals declared with `::let` are replaced with thread-safe counterparts behind the scenes.
318
+
319
+ We stop processing with `halt` once we have collected 30 issue records:
320
+
321
+ {% highlight ruby %}
322
+ class CollectGithubIssues < Wayfarer::Job
323
+ config.connection_count = 4
324
+
325
+ let(:records) { [] }
326
+
327
+ routes do
328
+ host "github.com" do
329
+ path "/:user/:repo", to: :repository
330
+ path "/:user/:repo/issues", to: :index
331
+ path "/:user/:repo/issues/:id", to: :show
332
+ end
333
+ end
334
+
335
+ after_crawl do
336
+ records.each do |issue|
337
+ # Save them somewhere?
338
+ puts issue
339
+ end
340
+ end
341
+
342
+ def repository
343
+ stage navigation_links
344
+ end
345
+
346
+ def index
347
+ stage issue_listing_links, next_page
348
+ end
349
+
350
+ def show
351
+ return halt if records.count > 30
352
+
353
+ records << {
354
+ id: params[:id],
355
+ title: issue_title,
356
+ author: issue_author
357
+ }
358
+ end
359
+
360
+ private
361
+
362
+ def issue_title
363
+ doc.css(".js-issue-title").text.strip
364
+ end
365
+
366
+ def issue_author
367
+ doc.css(".TableObject-item .author").text.strip
368
+ end
369
+
370
+ def navigation_links
371
+ page.links ".reponav-item"
372
+ end
373
+
374
+ def issue_listing_links
375
+ page.links ".issues-listing"
376
+ end
377
+
378
+ def next_page
379
+ page.links ".next_page"
380
+ end
381
+ end
382
+
383
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
384
+ {% endhighlight %}
385
+
386
+ For the last part, we turn off the debugging output (if you have still enabled it) and output each record. You'd probably want to store them somewhere at this point, e.g. by [writing them to a CSV file]({{base}}/recipes/csv.html), or putting them into a database, etc.
387
+
388
+ {% highlight ruby %}
389
+ class CollectGithubIssues < Wayfarer::Job
390
+ config.connection_count = 4
391
+ config.logger.level = :fatal
392
+
393
+ let(:records) { [] }
394
+
395
+ routes do
396
+ host "github.com" do
397
+ path "/:user/:repo", to: :repository
398
+ path "/:user/:repo/issues", to: :index
399
+ path "/:user/:repo/issues/:id", to: :show
400
+ end
401
+ end
402
+
403
+ after_crawl do
404
+ records.each do |issue|
405
+ # Save them somewhere?
406
+ puts issue
407
+ end
408
+ end
409
+
410
+ def repository
411
+ stage navigation_links
412
+ end
413
+
414
+ def index
415
+ stage issue_listing_links, next_page
416
+ end
417
+
418
+ def show
419
+ return halt if records.count > 30
420
+
421
+ records << {
422
+ id: params[:id],
423
+ title: issue_title,
424
+ author: issue_author
425
+ }
426
+ end
427
+
428
+ private
429
+
430
+ def issue_title
431
+ doc.css(".js-issue-title").text.strip
432
+ end
433
+
434
+ def issue_author
435
+ doc.css(".TableObject-item .author").text.strip
436
+ end
437
+
438
+ def navigation_links
439
+ page.links ".reponav-item"
440
+ end
441
+
442
+ def issue_listing_links
443
+ page.links ".issues-listing"
444
+ end
445
+
446
+ def next_page
447
+ page.links ".next_page"
448
+ end
449
+ end
450
+
451
+ CollectGithubIssues.perform_now("https://github.com/rails/rails")
452
+ {% endhighlight %}
data/docs/index.md ADDED
@@ -0,0 +1,82 @@
1
+ ---
2
+ layout: default
3
+ title: Versatile web crawling with (J)Ruby
4
+ ---
5
+
6
+ # Versatile web crawling with (J)Ruby
7
+
8
+ Wayfarer is the swiss army knife for web crawling.
9
+
10
+ MRI:
11
+ ```
12
+ % [sudo] gem install wayfarer
13
+ ```
14
+
15
+ JRuby:
16
+ ```
17
+ % [sudo] gem install wayfarer-jruby
18
+ ```
19
+
20
+ If you …
21
+
22
+ * __Need to crawl page graphs breadth-first__
23
+ * __Need to extract whatever data__
24
+ * Do it multi-threaded
25
+ * Integrate with Rails seamlessly
26
+ * Want to automate a web browser
27
+ * Need to execute arbitrary JavaScript
28
+ * Need URI normalization
29
+ * Need to take screenshots
30
+ * Want to use a job queue and make work happen later
31
+ * Want in-memory and Redis-backed frontiers, tries and Bloom filters
32
+
33
+ … then you might like Wayfarer!
34
+
35
+ ## What it looks like
36
+
37
+ Say you want to …
38
+
39
+ * Automate Google Chrome
40
+ * Start off with a random Wikipedia article
41
+ * Follow all links until you find a page with the word "Foobar"
42
+ * Take a screenshot of the page containing "Foobar"
43
+ * Extract keywords from every page you encounter
44
+ * Use 4 threads and Chrome processes to do so
45
+
46
+ This amounts to 16 lines of code with Wayfarer:
47
+
48
+ {% highlight ruby %}
49
+ require "wayfarer"
50
+
51
+ class FindFoobarOnWikipedia < Wayfarer::Job
52
+ config.http_adapter = :selenium
53
+ config.selenium_argv = [:chrome]
54
+ config.connection_count = 4
55
+
56
+ let(:keywords) { [] }
57
+
58
+ route.host "en.wikipedia.org", to: :article
59
+
60
+ def article
61
+ if page.body =~ /Foobar/
62
+ driver.save_screenshot("/tmp/foobar.png")
63
+ return halt
64
+ end
65
+
66
+ keywords << page.keywords
67
+ stage page.links
68
+ end
69
+ end
70
+
71
+ FindFoobarOnWikipedia.perform_now("https://en.wikipedia.org/wiki/Special:Random")
72
+ {% endhighlight %}
73
+
74
+ Wayfarer integrates with [ActiveJob]() and supports your favorite job queue out of the box. Your job is ready to be enqueued:
75
+ {% highlight ruby %}
76
+ FindFoobarOnWikipedia.perform_later("https://en.wikipedia.org/wiki/Special:Random")
77
+ {% endhighlight %}
78
+
79
+ ### Where to go from here
80
+
81
+ * [The tutorial]() shows how to collect all open issues from a GitHub repository
82
+ * Read the [API documentation]()
@@ -0,0 +1,11 @@
1
+ document.addEventListener("DOMContentLoaded", function() {
2
+ var links = document.querySelectorAll(".navigation__link");
3
+
4
+ for (i = 0; i < links.length; i++) {
5
+ var link = links[i];
6
+
7
+ if (link.pathname === window.location.pathname) {
8
+ link.classList.add("navigation__link--active");
9
+ }
10
+ }
11
+ });
@@ -0,0 +1,20 @@
1
+ ---
2
+ layout: default
3
+ title: Contributing
4
+ ---
5
+
6
+ # Contributing
7
+
8
+ 1. Fork the repository
9
+ 2. Ensure the development dependencies are installed:
10
+ `% bundle install --with development`
11
+ 2. Make changes
12
+ 3. Ensure your (new?) tests pass:
13
+ `% bundle exec rake test`
14
+ 4. Autocorrect RubuCop offenses:
15
+ `% bundle exec rake rubocop:auto_correct`
16
+ 5. Fix remaining offenses or have a good excuse not to:
17
+ `% bundle exec rake rubocop`
18
+ 6. Write commit messages at least not worse than mine
19
+ 7. Open a pull request on GitHub
20
+ 8. Thank you
@@ -0,0 +1,11 @@
1
+ ---
2
+ layout: default
3
+ title: Testing
4
+ ---
5
+
6
+ # Testing
7
+
8
+ Tests run on MRI 2.3.1 and JRuby 9.1.6.0.
9
+
10
+ * Run `rake -T` to list all available (test) tasks.
11
+ * When running tests, a HTTP server binds to port 9876 for integration-level tests.
@@ -0,0 +1,23 @@
1
+ ---
2
+ layout: default
3
+ title: Authentication
4
+ ---
5
+
6
+ # Authentication
7
+
8
+ Authentication is best handled with the `setup_adapter` callback See [Callbacks]({{base}}/guides/callbacks.html).
9
+
10
+ {% highlight ruby %}
11
+ class DummyJob < Wayfarer::Job
12
+ config.http_adapter = :selenium
13
+
14
+ setup_adapter do |_, _, browser|
15
+ browser.visit("https://foo.com/login")
16
+
17
+ browser.fill_in("E-mail", with: "foo@bar.com")
18
+ browser.fill_in("Password", with: "password")
19
+
20
+ browser.click_button("Log in")
21
+ end
22
+ end
23
+ {% endhighlight %}
@@ -0,0 +1,29 @@
1
+ ---
2
+ layout: default
3
+ title: CSV output
4
+ ---
5
+
6
+ # CSV output
7
+
8
+ See:
9
+
10
+ * [Locals]({{base}}/guides/locals.html)
11
+ * [Callbacks]({{base}}/guides/callbacks.html)
12
+
13
+ {% highlight ruby %}
14
+ require "csv" # from Ruby's standard lib
15
+
16
+ class DummyJob < Wayfarer::Job
17
+ let(:records) { [] }
18
+
19
+ after_crawl do
20
+ CSV.open("output.csv", "w") do |csv|
21
+ records.each { |r| csv << [r[:id], r[:name]] }
22
+ end
23
+ end
24
+
25
+ def detail
26
+ records << { id: ..., name: ... }
27
+ end
28
+ end
29
+ {% endhighlight %}
@@ -0,0 +1,20 @@
1
+ ---
2
+ layout: default
3
+ title: Executing JavaScript
4
+ ---
5
+
6
+ # Executing JavaScript
7
+
8
+ In order to execute JavaScript in a page's DOM context, use the Selenium HTTP adapter and call `#execute_script` on the WebDriver object:
9
+
10
+ {% highlight ruby %}
11
+ class DummyJob < Wayfarer::Job
12
+ config.http_adapter = :selenium
13
+
14
+ # ...
15
+
16
+ def foo
17
+ pathname = driver.execute_script("return window.location.pathname")
18
+ end
19
+ end
20
+ {% endhighlight %}
@@ -0,0 +1,18 @@
1
+ ---
2
+ layout: default
3
+ title: Starting from multiple URIs
4
+ ---
5
+
6
+ # Starting from multiple URIs
7
+
8
+ You can pass in as many URIs as desired when performing jobs:
9
+
10
+ {% highlight ruby %}
11
+ class DummyJob < Wayfarer::Job
12
+ # ...
13
+ end
14
+
15
+ uris = [...]
16
+
17
+ DummyJob.perform_now(*uris)
18
+ {% endhighlight %}
@@ -0,0 +1,20 @@
1
+ ---
2
+ layout: default
3
+ title: Taking screenshots
4
+ ---
5
+
6
+ # Taking screenshots
7
+
8
+ In order to take screenshots, use the Selenium HTTP adapter and call `#save_screenshot` on the WebDriver object:
9
+
10
+ {% highlight ruby %}
11
+ class DummyJob < Wayfarer::Job
12
+ config.http_adapter = :selenium
13
+
14
+ # ...
15
+
16
+ def foo
17
+ driver.save_screenshot("my_screenshot.png")
18
+ end
19
+ end
20
+ {% endhighlight %}