wayfarer 0.4.6 → 0.4.7

Sign up to get free protection for your applications and to get access to all the features.
Files changed (175) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/lint.yaml +25 -0
  3. data/.github/workflows/release.yaml +29 -0
  4. data/.github/workflows/tests.yaml +30 -0
  5. data/.gitignore +4 -0
  6. data/.rubocop.yml +5 -0
  7. data/.vale.ini +5 -0
  8. data/.yardopts +1 -3
  9. data/Dockerfile +5 -4
  10. data/Gemfile +3 -0
  11. data/Gemfile.lock +107 -102
  12. data/Rakefile +5 -56
  13. data/bin/wayfarer +1 -1
  14. data/docker-compose.yml +20 -9
  15. data/docs/cookbook/consent_screen.md +2 -2
  16. data/docs/cookbook/executing_javascript.md +3 -3
  17. data/docs/cookbook/navigation.md +12 -12
  18. data/docs/cookbook/querying_html.md +3 -3
  19. data/docs/cookbook/screenshots.md +2 -2
  20. data/docs/cookbook/user_agent.md +1 -1
  21. data/docs/design.md +36 -0
  22. data/docs/guides/callbacks.md +24 -126
  23. data/docs/guides/configuration.md +8 -8
  24. data/docs/guides/handlers.md +60 -0
  25. data/docs/guides/index.md +1 -0
  26. data/docs/guides/jobs/error_handling.md +40 -0
  27. data/docs/guides/jobs.md +99 -31
  28. data/docs/guides/navigation.md +1 -1
  29. data/docs/guides/networking/capybara.md +13 -22
  30. data/docs/guides/networking/custom_adapters.md +82 -41
  31. data/docs/guides/networking/ferrum.md +4 -4
  32. data/docs/guides/networking/http.md +9 -13
  33. data/docs/guides/networking/selenium.md +10 -11
  34. data/docs/guides/pages.md +76 -10
  35. data/docs/guides/redis.md +10 -0
  36. data/docs/guides/routing.md +74 -0
  37. data/docs/guides/tasks.md +33 -9
  38. data/docs/guides/tutorial.md +60 -0
  39. data/docs/guides/user_agents.md +113 -0
  40. data/docs/index.md +17 -40
  41. data/docs/reference/cli.md +35 -25
  42. data/docs/reference/configuration.md +36 -0
  43. data/lib/wayfarer/base.rb +124 -46
  44. data/lib/wayfarer/batch_completion.rb +56 -0
  45. data/lib/wayfarer/callbacks.rb +22 -48
  46. data/lib/wayfarer/cli/route_printer.rb +71 -57
  47. data/lib/wayfarer/cli.rb +121 -0
  48. data/lib/wayfarer/gc.rb +13 -6
  49. data/lib/wayfarer/handler.rb +15 -7
  50. data/lib/wayfarer/logging.rb +38 -0
  51. data/lib/wayfarer/middleware/base.rb +2 -0
  52. data/lib/wayfarer/middleware/batch_completion.rb +19 -0
  53. data/lib/wayfarer/middleware/content_type.rb +54 -0
  54. data/lib/wayfarer/middleware/controller.rb +19 -15
  55. data/lib/wayfarer/middleware/dedup.rb +16 -13
  56. data/lib/wayfarer/middleware/dispatch.rb +12 -4
  57. data/lib/wayfarer/middleware/normalize.rb +12 -11
  58. data/lib/wayfarer/middleware/redis.rb +15 -0
  59. data/lib/wayfarer/middleware/router.rb +33 -35
  60. data/lib/wayfarer/middleware/stage.rb +5 -5
  61. data/lib/wayfarer/middleware/uri_parser.rb +30 -0
  62. data/lib/wayfarer/middleware/user_agent.rb +49 -0
  63. data/lib/wayfarer/networking/capybara.rb +1 -1
  64. data/lib/wayfarer/networking/context.rb +2 -2
  65. data/lib/wayfarer/networking/ferrum.rb +2 -2
  66. data/lib/wayfarer/networking/follow.rb +12 -6
  67. data/lib/wayfarer/networking/http.rb +1 -1
  68. data/lib/wayfarer/networking/pool.rb +17 -12
  69. data/lib/wayfarer/networking/selenium.rb +3 -3
  70. data/lib/wayfarer/networking/strategy.rb +2 -2
  71. data/lib/wayfarer/page.rb +36 -14
  72. data/lib/wayfarer/parsing/xml.rb +6 -6
  73. data/lib/wayfarer/parsing.rb +24 -0
  74. data/lib/wayfarer/redis/barrier.rb +13 -21
  75. data/lib/wayfarer/redis/counter.rb +19 -9
  76. data/lib/wayfarer/redis/pool.rb +1 -1
  77. data/lib/wayfarer/redis/resettable.rb +19 -0
  78. data/lib/wayfarer/routing/dsl.rb +1 -0
  79. data/lib/wayfarer/routing/matchers/path.rb +4 -2
  80. data/lib/wayfarer/routing/root_route.rb +5 -1
  81. data/lib/wayfarer/routing/route.rb +4 -14
  82. data/lib/wayfarer/stringify.rb +22 -30
  83. data/lib/wayfarer/task.rb +12 -18
  84. data/lib/wayfarer.rb +28 -1
  85. data/mkdocs.yml +52 -7
  86. data/rake/docs.rake +26 -0
  87. data/rake/lint.rake +105 -0
  88. data/rake/release.rake +29 -0
  89. data/rake/tests.rake +28 -0
  90. data/requirements.txt +1 -1
  91. data/spec/base_spec.rb +140 -160
  92. data/spec/batch_completion_spec.rb +104 -0
  93. data/spec/cli/job_spec.rb +19 -23
  94. data/spec/cli/routing_spec.rb +101 -0
  95. data/spec/cli/version_spec.rb +1 -1
  96. data/spec/factories/task.rb +7 -1
  97. data/spec/fixtures/dummy_job.rb +5 -3
  98. data/spec/gc_spec.rb +8 -50
  99. data/spec/handler_spec.rb +1 -1
  100. data/spec/integration/callbacks_spec.rb +157 -45
  101. data/spec/integration/content_type_spec.rb +145 -0
  102. data/spec/integration/gc_spec.rb +44 -0
  103. data/spec/integration/handler_spec.rb +66 -0
  104. data/spec/integration/page_spec.rb +44 -29
  105. data/spec/integration/params_spec.rb +33 -25
  106. data/spec/integration/parsing_spec.rb +125 -0
  107. data/spec/integration/routing_spec.rb +18 -0
  108. data/spec/integration/stage_spec.rb +27 -20
  109. data/spec/middleware/batch_completion_spec.rb +34 -0
  110. data/spec/middleware/chain_spec.rb +8 -8
  111. data/spec/middleware/content_type_spec.rb +86 -0
  112. data/spec/middleware/controller_spec.rb +5 -5
  113. data/spec/middleware/dedup_spec.rb +38 -55
  114. data/spec/middleware/dispatch_spec.rb +23 -7
  115. data/spec/middleware/normalize_spec.rb +44 -13
  116. data/spec/middleware/router_spec.rb +29 -30
  117. data/spec/middleware/stage_spec.rb +8 -8
  118. data/spec/middleware/uri_parser_spec.rb +53 -0
  119. data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
  120. data/spec/networking/context_spec.rb +1 -1
  121. data/spec/networking/follow_spec.rb +2 -2
  122. data/spec/networking/pool_spec.rb +5 -5
  123. data/spec/networking/strategy.rb +2 -2
  124. data/spec/page_spec.rb +42 -20
  125. data/spec/parsing/xml_spec.rb +11 -12
  126. data/spec/redis/barrier_spec.rb +8 -48
  127. data/spec/redis/counter_spec.rb +13 -1
  128. data/spec/redis/pool_spec.rb +1 -1
  129. data/spec/spec_helpers.rb +27 -16
  130. data/spec/support/test_app.rb +8 -0
  131. data/spec/task_spec.rb +3 -24
  132. data/spec/wayfarer_spec.rb +1 -1
  133. data/wayfarer.gemspec +4 -3
  134. metadata +61 -51
  135. data/.github/workflows/ci.yaml +0 -32
  136. data/docs/guides/error_handling.md +0 -53
  137. data/docs/guides/networking.md +0 -94
  138. data/docs/guides/performance.md +0 -130
  139. data/docs/guides/reliability.md +0 -41
  140. data/docs/guides/routing/steering.md +0 -30
  141. data/docs/reference/api/base.md +0 -48
  142. data/docs/reference/configuration_keys.md +0 -43
  143. data/docs/reference/environment_variables.md +0 -83
  144. data/lib/wayfarer/cli/base.rb +0 -45
  145. data/lib/wayfarer/cli/generate.rb +0 -17
  146. data/lib/wayfarer/cli/job.rb +0 -56
  147. data/lib/wayfarer/cli/route.rb +0 -29
  148. data/lib/wayfarer/cli/runner.rb +0 -34
  149. data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
  150. data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
  151. data/lib/wayfarer/config/capybara.rb +0 -10
  152. data/lib/wayfarer/config/ferrum.rb +0 -11
  153. data/lib/wayfarer/config/networking.rb +0 -29
  154. data/lib/wayfarer/config/redis.rb +0 -14
  155. data/lib/wayfarer/config/root.rb +0 -11
  156. data/lib/wayfarer/config/selenium.rb +0 -21
  157. data/lib/wayfarer/config/strconv.rb +0 -45
  158. data/lib/wayfarer/config/struct.rb +0 -72
  159. data/lib/wayfarer/middleware/fetch.rb +0 -56
  160. data/lib/wayfarer/redis/connection.rb +0 -13
  161. data/lib/wayfarer/redis/version.rb +0 -19
  162. data/lib/wayfarer/routing/router.rb +0 -28
  163. data/spec/callbacks_spec.rb +0 -102
  164. data/spec/cli/generate_spec.rb +0 -39
  165. data/spec/config/capybara_spec.rb +0 -18
  166. data/spec/config/ferrum_spec.rb +0 -24
  167. data/spec/config/networking_spec.rb +0 -73
  168. data/spec/config/redis_spec.rb +0 -32
  169. data/spec/config/root_spec.rb +0 -31
  170. data/spec/config/selenium_spec.rb +0 -56
  171. data/spec/config/strconv_spec.rb +0 -58
  172. data/spec/config/struct_spec.rb +0 -66
  173. data/spec/integration/steering_spec.rb +0 -57
  174. data/spec/redis/version_spec.rb +0 -13
  175. data/spec/routing/router_spec.rb +0 -24
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wayfarer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.6
4
+ version: 0.4.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dominic Bauer
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '6.0'
19
+ version: '7.1'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '6.0'
26
+ version: '7.1'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: addressable
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -123,33 +123,33 @@ dependencies:
123
123
  - !ruby/object:Gem::Version
124
124
  version: '3.0'
125
125
  - !ruby/object:Gem::Dependency
126
- name: mustermann
126
+ name: mock_redis
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '1.1'
131
+ version: '0.29'
132
132
  type: :runtime
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '1.1'
138
+ version: '0.29'
139
139
  - !ruby/object:Gem::Dependency
140
- name: mock_redis
140
+ name: mustermann
141
141
  requirement: !ruby/object:Gem::Requirement
142
142
  requirements:
143
143
  - - "~>"
144
144
  - !ruby/object:Gem::Version
145
- version: '0.29'
145
+ version: '1.1'
146
146
  type: :runtime
147
147
  prerelease: false
148
148
  version_requirements: !ruby/object:Gem::Requirement
149
149
  requirements:
150
150
  - - "~>"
151
151
  - !ruby/object:Gem::Version
152
- version: '0.29'
152
+ version: '1.1'
153
153
  - !ruby/object:Gem::Dependency
154
154
  name: net-http-persistent
155
155
  requirement: !ruby/object:Gem::Requirement
@@ -240,6 +240,20 @@ dependencies:
240
240
  - - "~>"
241
241
  - !ruby/object:Gem::Version
242
242
  version: '1.0'
243
+ - !ruby/object:Gem::Dependency
244
+ name: zeitwerk
245
+ requirement: !ruby/object:Gem::Requirement
246
+ requirements:
247
+ - - "~>"
248
+ - !ruby/object:Gem::Version
249
+ version: '2.4'
250
+ type: :runtime
251
+ prerelease: false
252
+ version_requirements: !ruby/object:Gem::Requirement
253
+ requirements:
254
+ - - "~>"
255
+ - !ruby/object:Gem::Version
256
+ version: '2.4'
243
257
  - !ruby/object:Gem::Dependency
244
258
  name: cuprite
245
259
  requirement: !ruby/object:Gem::Requirement
@@ -373,12 +387,15 @@ executables:
373
387
  extensions: []
374
388
  extra_rdoc_files: []
375
389
  files:
376
- - ".github/workflows/ci.yaml"
390
+ - ".github/workflows/lint.yaml"
391
+ - ".github/workflows/release.yaml"
392
+ - ".github/workflows/tests.yaml"
377
393
  - ".gitignore"
378
394
  - ".rbenv-gemsets"
379
395
  - ".rspec"
380
396
  - ".rubocop.yml"
381
397
  - ".ruby-version"
398
+ - ".vale.ini"
382
399
  - ".yardopts"
383
400
  - Dockerfile
384
401
  - Gemfile
@@ -395,60 +412,53 @@ files:
395
412
  - docs/cookbook/querying_html.md
396
413
  - docs/cookbook/screenshots.md
397
414
  - docs/cookbook/user_agent.md
415
+ - docs/design.md
398
416
  - docs/guides/callbacks.md
399
417
  - docs/guides/configuration.md
400
418
  - docs/guides/debugging.md
401
- - docs/guides/error_handling.md
419
+ - docs/guides/handlers.md
420
+ - docs/guides/index.md
402
421
  - docs/guides/jobs.md
422
+ - docs/guides/jobs/error_handling.md
403
423
  - docs/guides/navigation.md
404
- - docs/guides/networking.md
405
424
  - docs/guides/networking/capybara.md
406
425
  - docs/guides/networking/custom_adapters.md
407
426
  - docs/guides/networking/ferrum.md
408
427
  - docs/guides/networking/http.md
409
428
  - docs/guides/networking/selenium.md
410
429
  - docs/guides/pages.md
411
- - docs/guides/performance.md
412
- - docs/guides/reliability.md
413
- - docs/guides/routing/steering.md
430
+ - docs/guides/redis.md
431
+ - docs/guides/routing.md
414
432
  - docs/guides/tasks.md
433
+ - docs/guides/tutorial.md
434
+ - docs/guides/user_agents.md
415
435
  - docs/index.md
416
- - docs/reference/api/base.md
417
436
  - docs/reference/api/route.md
418
437
  - docs/reference/cli.md
419
- - docs/reference/configuration_keys.md
420
- - docs/reference/environment_variables.md
438
+ - docs/reference/configuration.md
421
439
  - lib/wayfarer.rb
422
440
  - lib/wayfarer/base.rb
441
+ - lib/wayfarer/batch_completion.rb
423
442
  - lib/wayfarer/callbacks.rb
424
- - lib/wayfarer/cli/base.rb
425
- - lib/wayfarer/cli/generate.rb
426
- - lib/wayfarer/cli/job.rb
427
- - lib/wayfarer/cli/route.rb
443
+ - lib/wayfarer/cli.rb
428
444
  - lib/wayfarer/cli/route_printer.rb
429
- - lib/wayfarer/cli/runner.rb
430
- - lib/wayfarer/cli/templates/Gemfile.tt
431
- - lib/wayfarer/cli/templates/job.rb.tt
432
- - lib/wayfarer/config/capybara.rb
433
- - lib/wayfarer/config/ferrum.rb
434
- - lib/wayfarer/config/networking.rb
435
- - lib/wayfarer/config/redis.rb
436
- - lib/wayfarer/config/root.rb
437
- - lib/wayfarer/config/selenium.rb
438
- - lib/wayfarer/config/strconv.rb
439
- - lib/wayfarer/config/struct.rb
440
445
  - lib/wayfarer/gc.rb
441
446
  - lib/wayfarer/handler.rb
447
+ - lib/wayfarer/logging.rb
442
448
  - lib/wayfarer/middleware/base.rb
449
+ - lib/wayfarer/middleware/batch_completion.rb
443
450
  - lib/wayfarer/middleware/chain.rb
451
+ - lib/wayfarer/middleware/content_type.rb
444
452
  - lib/wayfarer/middleware/controller.rb
445
453
  - lib/wayfarer/middleware/dedup.rb
446
454
  - lib/wayfarer/middleware/dispatch.rb
447
- - lib/wayfarer/middleware/fetch.rb
448
455
  - lib/wayfarer/middleware/lazy.rb
449
456
  - lib/wayfarer/middleware/normalize.rb
457
+ - lib/wayfarer/middleware/redis.rb
450
458
  - lib/wayfarer/middleware/router.rb
451
459
  - lib/wayfarer/middleware/stage.rb
460
+ - lib/wayfarer/middleware/uri_parser.rb
461
+ - lib/wayfarer/middleware/user_agent.rb
452
462
  - lib/wayfarer/networking/capybara.rb
453
463
  - lib/wayfarer/networking/context.rb
454
464
  - lib/wayfarer/networking/ferrum.rb
@@ -459,13 +469,13 @@ files:
459
469
  - lib/wayfarer/networking/selenium.rb
460
470
  - lib/wayfarer/networking/strategy.rb
461
471
  - lib/wayfarer/page.rb
472
+ - lib/wayfarer/parsing.rb
462
473
  - lib/wayfarer/parsing/json.rb
463
474
  - lib/wayfarer/parsing/xml.rb
464
475
  - lib/wayfarer/redis/barrier.rb
465
- - lib/wayfarer/redis/connection.rb
466
476
  - lib/wayfarer/redis/counter.rb
467
477
  - lib/wayfarer/redis/pool.rb
468
- - lib/wayfarer/redis/version.rb
478
+ - lib/wayfarer/redis/resettable.rb
469
479
  - lib/wayfarer/routing/dsl.rb
470
480
  - lib/wayfarer/routing/matchers/custom.rb
471
481
  - lib/wayfarer/routing/matchers/host.rb
@@ -478,26 +488,21 @@ files:
478
488
  - lib/wayfarer/routing/result.rb
479
489
  - lib/wayfarer/routing/root_route.rb
480
490
  - lib/wayfarer/routing/route.rb
481
- - lib/wayfarer/routing/router.rb
482
491
  - lib/wayfarer/routing/target_route.rb
483
492
  - lib/wayfarer/serializer.rb
484
493
  - lib/wayfarer/stringify.rb
485
494
  - lib/wayfarer/task.rb
486
495
  - mkdocs.yml
496
+ - rake/docs.rake
497
+ - rake/lint.rake
498
+ - rake/release.rake
499
+ - rake/tests.rake
487
500
  - requirements.txt
488
501
  - spec/base_spec.rb
489
- - spec/callbacks_spec.rb
490
- - spec/cli/generate_spec.rb
502
+ - spec/batch_completion_spec.rb
491
503
  - spec/cli/job_spec.rb
504
+ - spec/cli/routing_spec.rb
492
505
  - spec/cli/version_spec.rb
493
- - spec/config/capybara_spec.rb
494
- - spec/config/ferrum_spec.rb
495
- - spec/config/networking_spec.rb
496
- - spec/config/redis_spec.rb
497
- - spec/config/root_spec.rb
498
- - spec/config/selenium_spec.rb
499
- - spec/config/strconv_spec.rb
500
- - spec/config/struct_spec.rb
501
506
  - spec/factories/middleware.rb
502
507
  - spec/factories/page.rb
503
508
  - spec/factories/task.rb
@@ -505,18 +510,25 @@ files:
505
510
  - spec/gc_spec.rb
506
511
  - spec/handler_spec.rb
507
512
  - spec/integration/callbacks_spec.rb
513
+ - spec/integration/content_type_spec.rb
514
+ - spec/integration/gc_spec.rb
515
+ - spec/integration/handler_spec.rb
508
516
  - spec/integration/page_spec.rb
509
517
  - spec/integration/params_spec.rb
518
+ - spec/integration/parsing_spec.rb
519
+ - spec/integration/routing_spec.rb
510
520
  - spec/integration/stage_spec.rb
511
- - spec/integration/steering_spec.rb
521
+ - spec/middleware/batch_completion_spec.rb
512
522
  - spec/middleware/chain_spec.rb
523
+ - spec/middleware/content_type_spec.rb
513
524
  - spec/middleware/controller_spec.rb
514
525
  - spec/middleware/dedup_spec.rb
515
526
  - spec/middleware/dispatch_spec.rb
516
- - spec/middleware/fetch_spec.rb
517
527
  - spec/middleware/normalize_spec.rb
518
528
  - spec/middleware/router_spec.rb
519
529
  - spec/middleware/stage_spec.rb
530
+ - spec/middleware/uri_parser_spec.rb
531
+ - spec/middleware/user_agent_spec.rb
520
532
  - spec/networking/capybara_spec.rb
521
533
  - spec/networking/context_spec.rb
522
534
  - spec/networking/ferrum_spec.rb
@@ -531,7 +543,6 @@ files:
531
543
  - spec/redis/barrier_spec.rb
532
544
  - spec/redis/counter_spec.rb
533
545
  - spec/redis/pool_spec.rb
534
- - spec/redis/version_spec.rb
535
546
  - spec/routing/dsl_spec.rb
536
547
  - spec/routing/integration_spec.rb
537
548
  - spec/routing/matchers/custom_spec.rb
@@ -544,7 +555,6 @@ files:
544
555
  - spec/routing/path_finder_spec.rb
545
556
  - spec/routing/root_route_spec.rb
546
557
  - spec/routing/route_spec.rb
547
- - spec/routing/router_spec.rb
548
558
  - spec/spec_helpers.rb
549
559
  - spec/stringify_spec.rb
550
560
  - spec/support/static/finders.html
@@ -1,32 +0,0 @@
1
- name: ci
2
-
3
- on:
4
- push:
5
- branches:
6
- - '*'
7
- env:
8
- CI: true
9
-
10
- jobs:
11
- ci:
12
- runs-on: ubuntu-latest
13
- steps:
14
- - uses: actions/checkout@v2
15
-
16
- - name: Start services
17
- run: docker-compose up -d
18
-
19
- - name: Run isolated tests
20
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:isolated
21
-
22
- - name: Run Ferrum tests
23
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:ferrum
24
-
25
- - name: Run Selenium tests
26
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:selenium
27
-
28
- - name: Run CLI tests
29
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:cli
30
-
31
- - name: Run RuboCop
32
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake rubocop
@@ -1,53 +0,0 @@
1
- # Error handling
2
-
3
- ## Wayfarer never swallows exceptions
4
-
5
- * Wayfarer never swallows exceptions.
6
- * Jobs with unhandled exceptions are not retried.
7
-
8
- ## Retrying or discarding failing jobs
9
-
10
- Wayfarer relies on [Active Job's two error handling facilities](https://guides.rubyonrails.org/active_job_basics.html#exceptions).
11
-
12
- * `retry_on` to retry jobs a number of times on certain errors:
13
-
14
- ```ruby
15
- class DummyJob < Wayfarer::Base
16
- retry_on MyError, attempts: 3 do |job, error|
17
- # This block runs once all 3 attempts have failed
18
- # (1 initial attempt + 2 retries)
19
-
20
- raise error
21
- end
22
- end
23
- ```
24
-
25
- * `discard_on` to throw away jobs on certain errors:
26
-
27
- ```ruby
28
- class DummyJob < Wayfarer::Base
29
- discard_on MyError do |job, error|
30
- # This block runs once and buries the job
31
-
32
- raise error
33
- end
34
- end
35
- ```
36
-
37
- !!! attention "Always re-raise errors"
38
-
39
- You should always re-raise errors from `retry_on` and `discard_on` blocks,
40
- otherwise jobs will not get retried!
41
-
42
- ## Renewing agents on certain errors
43
-
44
- ```ruby
45
- Wayfarer.config.network.renew_on = [MyError]
46
- ```
47
-
48
- For example, if you use the Capybara
49
- [Cuprite](https://github.com/rubycdp/cuprite) driver:
50
-
51
- ```ruby
52
- Wayfarer.config.network.renew_on = [Ferrum::DeadBrowserError]
53
- ```
@@ -1,94 +0,0 @@
1
- # Networking
2
-
3
- Wayfarer navigates the web in two ways:
4
-
5
- 1. Via plain HTTP requests
6
- 2. By automating browsers
7
-
8
- Both options are mutually exclusive per Ruby process.
9
-
10
- ## User agents
11
-
12
- A user agent is an entity that knows how to retrieve the contents behind a URL.
13
-
14
- The user agent can be configured via the global configuration:
15
-
16
- ```ruby
17
- Wayfarer.config.network.agent = :http # or :ferrum, :selenium
18
- ```
19
-
20
- ## Connection pooling
21
-
22
- Wayfarer keeps user agents within a connection pool. When a job executes
23
- and needs to retrieve the contents behind a URL, an agent is checked out from
24
- the pool.
25
-
26
- The pool has a constant size and it should equal the number of threads the
27
- underlying message queue operates with. The size can be configured via the
28
- global configuration:
29
-
30
- ```ruby
31
- Wayfarer.config.network.pool_size = 8
32
- ```
33
-
34
- ### Timeouts
35
-
36
- user agents may stay checked out from the pool by jobs for a limited time
37
- only. Once this time limit is exceeded, a `ConnectionPool::TimeoutError`
38
- exception is raised. This places a hard time limit on every job.
39
-
40
- The timeout can be configured via the global configuration:
41
-
42
- ```ruby
43
- Wayfarer.config.network.pool_timeout = 20 # seconds
44
- ```
45
-
46
- Because jobs with unhandled exceptions fail, explicit error handling is required
47
- if retries are desired:
48
-
49
- ```ruby
50
- class DummyJob < Wayfarer::Base
51
- retry_on ConnectionPool::TimeoutError, attempts: 3
52
- end
53
- ```
54
-
55
- ## Agent-specific client timeouts
56
-
57
- The time in seconds it may take to communicate with remote browser processes can
58
- be configured globally per agent:
59
-
60
- ```ruby
61
- Wayfarer.config.ferrum.options = { timeout: 5 }
62
- Wayfarer.config.selenium.client_timeout = 60
63
- ```
64
-
65
- ### Shared state
66
-
67
- As user agents get checked in and out continously between jobs, their state
68
- carries over from job to job, too.
69
-
70
- For browser automation, this means:
71
-
72
- * A job finds the browser at the last URL the previous job has left off.
73
- * The browser's cookies might have been set, or other client-side state might
74
- exist that significantly affects a page's behaviour.
75
-
76
- ## HTTP redirect handling
77
-
78
- Browsers follow redirects transparently when they are navigated to a URL.
79
-
80
- When using plain HTTP, redirect URLs are enqueued transparently within the same
81
- batch. URLs that result in 3xx responses will not be retrieved again within
82
- their batch.
83
-
84
- ## HTTP request headers
85
-
86
- Request headers can be configured via the global configuration:
87
-
88
- ```ruby
89
- Wayfarer.config.network.http_headers = { "Field" => "Value" }
90
- ```
91
-
92
- !!! attention "Partial support"
93
-
94
- Selenium does not support configuring HTTP request headers.
@@ -1,130 +0,0 @@
1
- # Performance
2
-
3
- How to write performant crawlers with Wayfarer.
4
-
5
- ## Use a sufficiently sized user agent pool
6
-
7
- Automated browser processes or HTTP clients are kept in a [connection pool]() of
8
- static size. This avoids having to re-establish browser processes and enables
9
- their reuse.
10
-
11
- If the size of the pool is too small, the pool is a
12
- bottleneck. For example, if your message queue adapter uses 8 threads, but the
13
- pool only contains 1 user agent, the remaining 7 threads block until the agent
14
- is checked back in to the pool for use by one of the blocked threads.
15
-
16
- There is no reliable way to detect the number of threads of the underlying
17
- message queue adapter. The pool size should equal the number of threads;
18
-
19
- ```ruby
20
- Wayfarer.config.network.pool_size = 8 # defaults to 1
21
- ```
22
-
23
- ### Job shedding
24
-
25
- There is a maximum number of seconds that jobs wait when checking out a user
26
- agent from the pool. Once this time is exceeded,
27
- a `Wayfarer::UserAgentTimeoutError` is raised. By default, the timeout is 10
28
- seconds.
29
-
30
- This hints there are more threads in use than user agents in the pool.
31
-
32
- ## Stage less URLs
33
-
34
- Staging less URLs saves space and time:
35
-
36
- * Less tasks written to the message queue
37
- * Less time spent consuming tasks
38
- * Less time spent filtering URLs with Redis
39
-
40
- Wayfarer maintains a set of processed URLs for a batch in Redis. Every staged
41
- URL is checked for inclusion in this set before it gets appended as a task to
42
- the message queue.
43
-
44
- A common pattern is to stage all links of a page, and rely on routing to fetch
45
- only the relevant ones:
46
-
47
- ```ruby
48
- class DummyJob < Wayfarer::Base
49
- route { to: index, host: "example.com" }
50
-
51
- def index
52
- stage page.meta.links.all
53
- end
54
- end
55
- ```
56
-
57
- Pages commonly contain a large number of URLs.
58
-
59
- Every staged URL is:
60
-
61
- 1. Normalized to a canonical form, for example by sorting query parameters
62
- alphabetically.
63
- 2. Checked for inclusion in the batch Redis set or discarded.
64
- 3. Written to the message queue.
65
- 4. Consumed from the queue and matched against the router.
66
- 5. Fetched, if a route matches.
67
-
68
- Narrowing down the links in the document to follow speeds up the process.
69
- For example using Nokogiri, interesting links can be identified with a CSS
70
- selector:
71
-
72
- ```ruby
73
- class DummyJob < Wayfarer::Base
74
- route { to: index, host: "example.com" }
75
-
76
- def index
77
- stage interesting_links
78
- end
79
-
80
- private
81
-
82
- def interesting_links
83
- page.doc.css("a.interesting").map { |elem| elem["href"] }
84
- end
85
- end
86
- ```
87
-
88
- Because the router only accepts the single hostname `example.com`, the job can
89
- also ensure it stages only internal URLs by intersecting them with the
90
- interesting ones:
91
-
92
- ```ruby
93
- class DummyJob < Wayfarer::Base
94
- route { to: index, host: "example.com" }
95
-
96
- def index
97
- stage interesting_internal_links
98
- end
99
-
100
- private
101
-
102
- def interesting_internal_links
103
- page.meta.links.internal & interesting_links
104
- end
105
-
106
- def interesting_links
107
- page.doc.css("a.interesting").map { |elem| elem["href"] }
108
- end
109
- end
110
- ```
111
-
112
-
113
- ## Use Redis >= 6.2.0
114
-
115
- Redis 6.2.0 introduced the
116
- [`SMISMEMBER`](https://redis.io/commands/smismember) command which enables
117
- Wayfarer to check whether multiple URLs have been processed in a batch with a
118
- single command. With earlier versions, one command per URL is required.
119
-
120
- Wayfarer detects the Redis server version and uses `SMISMEMBER` without user
121
- configuration when supported.
122
-
123
- ## Use Oj for JSON parsing
124
-
125
- Wayfarer uses [Oj](https://github.com/ohler55/oj) for JSON parsing if the gem
126
- has been required at runtime:
127
-
128
- ```ruby
129
- require "oj"
130
- ```
@@ -1,41 +0,0 @@
1
- # Reliablity
2
-
3
- ## Durability
4
-
5
- Wayfarer executes atop reliable messages queues such as Sidekiq, Resque,
6
- RabbitMQ, etc. Its configuration is independent of the underlying queue
7
- infrastructure it reads from and writes to.
8
-
9
- ## Self-healing user agents
10
-
11
- Wayfarer handles the scenario where a remote browser process has crashed and
12
- must be replaced by a fresh browser process.
13
-
14
- This can be tested locally by automating a browser with headless mode turned
15
- off, and then closing the opened browser window: The current job fails, but the
16
- next job has access to a newly established browser session again.
17
-
18
- For example Ferrum might raise `Ferrum::DeadBrowserError`. Wayfarer's
19
- user agents are self-healing and react to these kinds of errors internally. When
20
- a browser window is closed, the Ferrum user agent attempts to establish a new
21
- browser process as a replacement, for the next job to use.
22
-
23
- [Wayfarer never swallows exceptions](/guides/error_handling). This means
24
- that even though the user agent might heal itself, jobs still need to explicitly
25
- retry browser errors:
26
-
27
- ```ruby
28
- class Foobar < Wayfarer::Base
29
- route { to: :index }
30
-
31
- retry_on Ferrum::DeadBrowserError, attempts: 3, wait: :exponentially_longer
32
-
33
- # ...
34
- end
35
- ```
36
-
37
- This leads to log entries like:
38
-
39
- ```
40
- Retrying DummyJob in 3 seconds, due to a Ferrum::DeadBrowserError.
41
- ```
@@ -1,30 +0,0 @@
1
- # Steering
2
-
3
- A job's router can receive arguments computed dynamically by `::steer`.
4
- Steering enables [batch routing](/cookbook/batch_routing).
5
-
6
- For example, the following router has hostname and path hard-coded:
7
-
8
- ```ruby
9
- class DummyJob < Wayfarer::Base
10
- route do
11
- host "example.com", path: "/contact", to: :index
12
- end
13
- end
14
- ```
15
-
16
- Instead, hostname and path could be provided by `::steer`, too:
17
-
18
- ```ruby
19
- class DummyJob < Wayfarer::Base
20
- route do |hostname, path|
21
- host hostname, path: path, to: :index
22
- end
23
-
24
- steer do |_task|
25
- ["example.com", "/contact"]
26
- end
27
- end
28
- ```
29
-
30
- Note that `steer` yields the current [task](/guides/tasks).
@@ -1,48 +0,0 @@
1
- ---
2
- title: Wayfarer::Base
3
- ---
4
-
5
- # `Wayfarer::Base`
6
-
7
- Wayfarer's complete job API.
8
-
9
- ---
10
-
11
- ### `::route`
12
- : Draw routes to instance methods.
13
-
14
- ---
15
-
16
- ### `::steer { (Wayfarer::Task) -> [any] }`
17
- : Provide router arguments.
18
-
19
- ---
20
-
21
- ### `#task -> Wayfarer::Task`
22
- : The currently processing task.
23
-
24
- ---
25
-
26
- ### `#params -> Hash`
27
- : URL parameters collected from the matching route.
28
-
29
- ---
30
-
31
- ### `#stage(String | [String]) -> void`
32
- : Add URLs to a processing set. URLs already processed within the
33
- current batch get discarded are not enqueued. Every staged URL gets
34
- normalized.
35
-
36
- ---
37
-
38
- ### `#browser -> Object`
39
- : The user agent that retrieved the current page.
40
-
41
- ---
42
-
43
- ### `#page(live: true | false) -> Page`
44
- : The page representing the response retrieved from the currently
45
- processing URL.
46
-
47
- With `live: true` called, a fresh `Page` is returned that reflects the
48
- current browser DOM. Calls to `#page` return the most recent page.