wayfarer 0.4.5 → 0.4.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (175) hide show
  1. checksums.yaml +4 -4
  2. data/.github/workflows/lint.yaml +25 -0
  3. data/.github/workflows/release.yaml +29 -0
  4. data/.github/workflows/tests.yaml +30 -0
  5. data/.gitignore +4 -0
  6. data/.rubocop.yml +5 -0
  7. data/.vale.ini +5 -0
  8. data/.yardopts +1 -3
  9. data/Dockerfile +5 -4
  10. data/Gemfile +3 -0
  11. data/Gemfile.lock +107 -102
  12. data/Rakefile +5 -56
  13. data/bin/wayfarer +1 -1
  14. data/docker-compose.yml +20 -9
  15. data/docs/cookbook/consent_screen.md +2 -2
  16. data/docs/cookbook/executing_javascript.md +3 -3
  17. data/docs/cookbook/navigation.md +12 -12
  18. data/docs/cookbook/querying_html.md +3 -3
  19. data/docs/cookbook/screenshots.md +2 -2
  20. data/docs/cookbook/user_agent.md +1 -1
  21. data/docs/design.md +36 -0
  22. data/docs/guides/callbacks.md +24 -126
  23. data/docs/guides/configuration.md +8 -8
  24. data/docs/guides/handlers.md +60 -0
  25. data/docs/guides/index.md +1 -0
  26. data/docs/guides/jobs/error_handling.md +40 -0
  27. data/docs/guides/jobs.md +99 -31
  28. data/docs/guides/navigation.md +1 -1
  29. data/docs/guides/networking/capybara.md +13 -22
  30. data/docs/guides/networking/custom_adapters.md +82 -41
  31. data/docs/guides/networking/ferrum.md +4 -4
  32. data/docs/guides/networking/http.md +9 -13
  33. data/docs/guides/networking/selenium.md +10 -11
  34. data/docs/guides/pages.md +76 -10
  35. data/docs/guides/redis.md +10 -0
  36. data/docs/guides/routing.md +74 -0
  37. data/docs/guides/tasks.md +33 -9
  38. data/docs/guides/tutorial.md +60 -0
  39. data/docs/guides/user_agents.md +113 -0
  40. data/docs/index.md +17 -40
  41. data/docs/reference/cli.md +35 -25
  42. data/docs/reference/configuration.md +36 -0
  43. data/lib/wayfarer/base.rb +124 -46
  44. data/lib/wayfarer/batch_completion.rb +56 -0
  45. data/lib/wayfarer/callbacks.rb +22 -48
  46. data/lib/wayfarer/cli/route_printer.rb +71 -57
  47. data/lib/wayfarer/cli.rb +121 -0
  48. data/lib/wayfarer/gc.rb +13 -6
  49. data/lib/wayfarer/handler.rb +15 -7
  50. data/lib/wayfarer/logging.rb +38 -0
  51. data/lib/wayfarer/middleware/base.rb +2 -0
  52. data/lib/wayfarer/middleware/batch_completion.rb +19 -0
  53. data/lib/wayfarer/middleware/content_type.rb +54 -0
  54. data/lib/wayfarer/middleware/controller.rb +19 -15
  55. data/lib/wayfarer/middleware/dedup.rb +16 -13
  56. data/lib/wayfarer/middleware/dispatch.rb +12 -4
  57. data/lib/wayfarer/middleware/normalize.rb +12 -11
  58. data/lib/wayfarer/middleware/redis.rb +15 -0
  59. data/lib/wayfarer/middleware/router.rb +33 -35
  60. data/lib/wayfarer/middleware/stage.rb +5 -5
  61. data/lib/wayfarer/middleware/uri_parser.rb +30 -0
  62. data/lib/wayfarer/middleware/user_agent.rb +49 -0
  63. data/lib/wayfarer/networking/capybara.rb +1 -1
  64. data/lib/wayfarer/networking/context.rb +2 -2
  65. data/lib/wayfarer/networking/ferrum.rb +2 -2
  66. data/lib/wayfarer/networking/follow.rb +12 -6
  67. data/lib/wayfarer/networking/http.rb +1 -1
  68. data/lib/wayfarer/networking/pool.rb +17 -12
  69. data/lib/wayfarer/networking/selenium.rb +3 -3
  70. data/lib/wayfarer/networking/strategy.rb +2 -2
  71. data/lib/wayfarer/page.rb +36 -14
  72. data/lib/wayfarer/parsing/xml.rb +6 -6
  73. data/lib/wayfarer/parsing.rb +24 -0
  74. data/lib/wayfarer/redis/barrier.rb +13 -21
  75. data/lib/wayfarer/redis/counter.rb +19 -9
  76. data/lib/wayfarer/redis/pool.rb +1 -1
  77. data/lib/wayfarer/redis/resettable.rb +19 -0
  78. data/lib/wayfarer/routing/dsl.rb +1 -0
  79. data/lib/wayfarer/routing/matchers/path.rb +4 -2
  80. data/lib/wayfarer/routing/root_route.rb +5 -1
  81. data/lib/wayfarer/routing/route.rb +4 -14
  82. data/lib/wayfarer/stringify.rb +22 -30
  83. data/lib/wayfarer/task.rb +12 -18
  84. data/lib/wayfarer.rb +29 -2
  85. data/mkdocs.yml +52 -7
  86. data/rake/docs.rake +26 -0
  87. data/rake/lint.rake +105 -0
  88. data/rake/release.rake +29 -0
  89. data/rake/tests.rake +28 -0
  90. data/requirements.txt +1 -1
  91. data/spec/base_spec.rb +140 -160
  92. data/spec/batch_completion_spec.rb +104 -0
  93. data/spec/cli/job_spec.rb +19 -23
  94. data/spec/cli/routing_spec.rb +101 -0
  95. data/spec/cli/version_spec.rb +1 -1
  96. data/spec/factories/task.rb +7 -1
  97. data/spec/fixtures/dummy_job.rb +5 -3
  98. data/spec/gc_spec.rb +8 -50
  99. data/spec/handler_spec.rb +1 -1
  100. data/spec/integration/callbacks_spec.rb +157 -45
  101. data/spec/integration/content_type_spec.rb +145 -0
  102. data/spec/integration/gc_spec.rb +44 -0
  103. data/spec/integration/handler_spec.rb +66 -0
  104. data/spec/integration/page_spec.rb +44 -29
  105. data/spec/integration/params_spec.rb +33 -25
  106. data/spec/integration/parsing_spec.rb +125 -0
  107. data/spec/integration/routing_spec.rb +18 -0
  108. data/spec/integration/stage_spec.rb +27 -20
  109. data/spec/middleware/batch_completion_spec.rb +34 -0
  110. data/spec/middleware/chain_spec.rb +8 -8
  111. data/spec/middleware/content_type_spec.rb +86 -0
  112. data/spec/middleware/controller_spec.rb +5 -5
  113. data/spec/middleware/dedup_spec.rb +38 -55
  114. data/spec/middleware/dispatch_spec.rb +23 -7
  115. data/spec/middleware/normalize_spec.rb +44 -13
  116. data/spec/middleware/router_spec.rb +29 -30
  117. data/spec/middleware/stage_spec.rb +8 -8
  118. data/spec/middleware/uri_parser_spec.rb +53 -0
  119. data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
  120. data/spec/networking/context_spec.rb +17 -0
  121. data/spec/networking/follow_spec.rb +2 -2
  122. data/spec/networking/pool_spec.rb +5 -5
  123. data/spec/networking/strategy.rb +2 -2
  124. data/spec/page_spec.rb +42 -20
  125. data/spec/parsing/xml_spec.rb +11 -12
  126. data/spec/redis/barrier_spec.rb +8 -48
  127. data/spec/redis/counter_spec.rb +13 -1
  128. data/spec/redis/pool_spec.rb +1 -1
  129. data/spec/spec_helpers.rb +27 -16
  130. data/spec/support/test_app.rb +8 -0
  131. data/spec/task_spec.rb +3 -24
  132. data/spec/wayfarer_spec.rb +1 -1
  133. data/wayfarer.gemspec +4 -3
  134. metadata +61 -51
  135. data/.github/workflows/ci.yaml +0 -32
  136. data/docs/guides/error_handling.md +0 -31
  137. data/docs/guides/networking.md +0 -94
  138. data/docs/guides/performance.md +0 -130
  139. data/docs/guides/reliability.md +0 -41
  140. data/docs/guides/routing/steering.md +0 -30
  141. data/docs/reference/api/base.md +0 -48
  142. data/docs/reference/configuration_keys.md +0 -42
  143. data/docs/reference/environment_variables.md +0 -83
  144. data/lib/wayfarer/cli/base.rb +0 -45
  145. data/lib/wayfarer/cli/generate.rb +0 -17
  146. data/lib/wayfarer/cli/job.rb +0 -56
  147. data/lib/wayfarer/cli/route.rb +0 -29
  148. data/lib/wayfarer/cli/runner.rb +0 -34
  149. data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
  150. data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
  151. data/lib/wayfarer/config/capybara.rb +0 -10
  152. data/lib/wayfarer/config/ferrum.rb +0 -11
  153. data/lib/wayfarer/config/networking.rb +0 -26
  154. data/lib/wayfarer/config/redis.rb +0 -14
  155. data/lib/wayfarer/config/root.rb +0 -11
  156. data/lib/wayfarer/config/selenium.rb +0 -21
  157. data/lib/wayfarer/config/strconv.rb +0 -45
  158. data/lib/wayfarer/config/struct.rb +0 -72
  159. data/lib/wayfarer/middleware/fetch.rb +0 -56
  160. data/lib/wayfarer/redis/connection.rb +0 -13
  161. data/lib/wayfarer/redis/version.rb +0 -19
  162. data/lib/wayfarer/routing/router.rb +0 -28
  163. data/spec/callbacks_spec.rb +0 -102
  164. data/spec/cli/generate_spec.rb +0 -39
  165. data/spec/config/capybara_spec.rb +0 -18
  166. data/spec/config/ferrum_spec.rb +0 -24
  167. data/spec/config/networking_spec.rb +0 -73
  168. data/spec/config/redis_spec.rb +0 -32
  169. data/spec/config/root_spec.rb +0 -31
  170. data/spec/config/selenium_spec.rb +0 -56
  171. data/spec/config/strconv_spec.rb +0 -58
  172. data/spec/config/struct_spec.rb +0 -66
  173. data/spec/integration/steering_spec.rb +0 -57
  174. data/spec/redis/version_spec.rb +0 -13
  175. data/spec/routing/router_spec.rb +0 -24
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wayfarer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.5
4
+ version: 0.4.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dominic Bauer
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - ">="
18
18
  - !ruby/object:Gem::Version
19
- version: '6.0'
19
+ version: '7.1'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
- version: '6.0'
26
+ version: '7.1'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: addressable
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -123,33 +123,33 @@ dependencies:
123
123
  - !ruby/object:Gem::Version
124
124
  version: '3.0'
125
125
  - !ruby/object:Gem::Dependency
126
- name: mustermann
126
+ name: mock_redis
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '1.1'
131
+ version: '0.29'
132
132
  type: :runtime
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '1.1'
138
+ version: '0.29'
139
139
  - !ruby/object:Gem::Dependency
140
- name: mock_redis
140
+ name: mustermann
141
141
  requirement: !ruby/object:Gem::Requirement
142
142
  requirements:
143
143
  - - "~>"
144
144
  - !ruby/object:Gem::Version
145
- version: '0.29'
145
+ version: '1.1'
146
146
  type: :runtime
147
147
  prerelease: false
148
148
  version_requirements: !ruby/object:Gem::Requirement
149
149
  requirements:
150
150
  - - "~>"
151
151
  - !ruby/object:Gem::Version
152
- version: '0.29'
152
+ version: '1.1'
153
153
  - !ruby/object:Gem::Dependency
154
154
  name: net-http-persistent
155
155
  requirement: !ruby/object:Gem::Requirement
@@ -240,6 +240,20 @@ dependencies:
240
240
  - - "~>"
241
241
  - !ruby/object:Gem::Version
242
242
  version: '1.0'
243
+ - !ruby/object:Gem::Dependency
244
+ name: zeitwerk
245
+ requirement: !ruby/object:Gem::Requirement
246
+ requirements:
247
+ - - "~>"
248
+ - !ruby/object:Gem::Version
249
+ version: '2.4'
250
+ type: :runtime
251
+ prerelease: false
252
+ version_requirements: !ruby/object:Gem::Requirement
253
+ requirements:
254
+ - - "~>"
255
+ - !ruby/object:Gem::Version
256
+ version: '2.4'
243
257
  - !ruby/object:Gem::Dependency
244
258
  name: cuprite
245
259
  requirement: !ruby/object:Gem::Requirement
@@ -373,12 +387,15 @@ executables:
373
387
  extensions: []
374
388
  extra_rdoc_files: []
375
389
  files:
376
- - ".github/workflows/ci.yaml"
390
+ - ".github/workflows/lint.yaml"
391
+ - ".github/workflows/release.yaml"
392
+ - ".github/workflows/tests.yaml"
377
393
  - ".gitignore"
378
394
  - ".rbenv-gemsets"
379
395
  - ".rspec"
380
396
  - ".rubocop.yml"
381
397
  - ".ruby-version"
398
+ - ".vale.ini"
382
399
  - ".yardopts"
383
400
  - Dockerfile
384
401
  - Gemfile
@@ -395,60 +412,53 @@ files:
395
412
  - docs/cookbook/querying_html.md
396
413
  - docs/cookbook/screenshots.md
397
414
  - docs/cookbook/user_agent.md
415
+ - docs/design.md
398
416
  - docs/guides/callbacks.md
399
417
  - docs/guides/configuration.md
400
418
  - docs/guides/debugging.md
401
- - docs/guides/error_handling.md
419
+ - docs/guides/handlers.md
420
+ - docs/guides/index.md
402
421
  - docs/guides/jobs.md
422
+ - docs/guides/jobs/error_handling.md
403
423
  - docs/guides/navigation.md
404
- - docs/guides/networking.md
405
424
  - docs/guides/networking/capybara.md
406
425
  - docs/guides/networking/custom_adapters.md
407
426
  - docs/guides/networking/ferrum.md
408
427
  - docs/guides/networking/http.md
409
428
  - docs/guides/networking/selenium.md
410
429
  - docs/guides/pages.md
411
- - docs/guides/performance.md
412
- - docs/guides/reliability.md
413
- - docs/guides/routing/steering.md
430
+ - docs/guides/redis.md
431
+ - docs/guides/routing.md
414
432
  - docs/guides/tasks.md
433
+ - docs/guides/tutorial.md
434
+ - docs/guides/user_agents.md
415
435
  - docs/index.md
416
- - docs/reference/api/base.md
417
436
  - docs/reference/api/route.md
418
437
  - docs/reference/cli.md
419
- - docs/reference/configuration_keys.md
420
- - docs/reference/environment_variables.md
438
+ - docs/reference/configuration.md
421
439
  - lib/wayfarer.rb
422
440
  - lib/wayfarer/base.rb
441
+ - lib/wayfarer/batch_completion.rb
423
442
  - lib/wayfarer/callbacks.rb
424
- - lib/wayfarer/cli/base.rb
425
- - lib/wayfarer/cli/generate.rb
426
- - lib/wayfarer/cli/job.rb
427
- - lib/wayfarer/cli/route.rb
443
+ - lib/wayfarer/cli.rb
428
444
  - lib/wayfarer/cli/route_printer.rb
429
- - lib/wayfarer/cli/runner.rb
430
- - lib/wayfarer/cli/templates/Gemfile.tt
431
- - lib/wayfarer/cli/templates/job.rb.tt
432
- - lib/wayfarer/config/capybara.rb
433
- - lib/wayfarer/config/ferrum.rb
434
- - lib/wayfarer/config/networking.rb
435
- - lib/wayfarer/config/redis.rb
436
- - lib/wayfarer/config/root.rb
437
- - lib/wayfarer/config/selenium.rb
438
- - lib/wayfarer/config/strconv.rb
439
- - lib/wayfarer/config/struct.rb
440
445
  - lib/wayfarer/gc.rb
441
446
  - lib/wayfarer/handler.rb
447
+ - lib/wayfarer/logging.rb
442
448
  - lib/wayfarer/middleware/base.rb
449
+ - lib/wayfarer/middleware/batch_completion.rb
443
450
  - lib/wayfarer/middleware/chain.rb
451
+ - lib/wayfarer/middleware/content_type.rb
444
452
  - lib/wayfarer/middleware/controller.rb
445
453
  - lib/wayfarer/middleware/dedup.rb
446
454
  - lib/wayfarer/middleware/dispatch.rb
447
- - lib/wayfarer/middleware/fetch.rb
448
455
  - lib/wayfarer/middleware/lazy.rb
449
456
  - lib/wayfarer/middleware/normalize.rb
457
+ - lib/wayfarer/middleware/redis.rb
450
458
  - lib/wayfarer/middleware/router.rb
451
459
  - lib/wayfarer/middleware/stage.rb
460
+ - lib/wayfarer/middleware/uri_parser.rb
461
+ - lib/wayfarer/middleware/user_agent.rb
452
462
  - lib/wayfarer/networking/capybara.rb
453
463
  - lib/wayfarer/networking/context.rb
454
464
  - lib/wayfarer/networking/ferrum.rb
@@ -459,13 +469,13 @@ files:
459
469
  - lib/wayfarer/networking/selenium.rb
460
470
  - lib/wayfarer/networking/strategy.rb
461
471
  - lib/wayfarer/page.rb
472
+ - lib/wayfarer/parsing.rb
462
473
  - lib/wayfarer/parsing/json.rb
463
474
  - lib/wayfarer/parsing/xml.rb
464
475
  - lib/wayfarer/redis/barrier.rb
465
- - lib/wayfarer/redis/connection.rb
466
476
  - lib/wayfarer/redis/counter.rb
467
477
  - lib/wayfarer/redis/pool.rb
468
- - lib/wayfarer/redis/version.rb
478
+ - lib/wayfarer/redis/resettable.rb
469
479
  - lib/wayfarer/routing/dsl.rb
470
480
  - lib/wayfarer/routing/matchers/custom.rb
471
481
  - lib/wayfarer/routing/matchers/host.rb
@@ -478,26 +488,21 @@ files:
478
488
  - lib/wayfarer/routing/result.rb
479
489
  - lib/wayfarer/routing/root_route.rb
480
490
  - lib/wayfarer/routing/route.rb
481
- - lib/wayfarer/routing/router.rb
482
491
  - lib/wayfarer/routing/target_route.rb
483
492
  - lib/wayfarer/serializer.rb
484
493
  - lib/wayfarer/stringify.rb
485
494
  - lib/wayfarer/task.rb
486
495
  - mkdocs.yml
496
+ - rake/docs.rake
497
+ - rake/lint.rake
498
+ - rake/release.rake
499
+ - rake/tests.rake
487
500
  - requirements.txt
488
501
  - spec/base_spec.rb
489
- - spec/callbacks_spec.rb
490
- - spec/cli/generate_spec.rb
502
+ - spec/batch_completion_spec.rb
491
503
  - spec/cli/job_spec.rb
504
+ - spec/cli/routing_spec.rb
492
505
  - spec/cli/version_spec.rb
493
- - spec/config/capybara_spec.rb
494
- - spec/config/ferrum_spec.rb
495
- - spec/config/networking_spec.rb
496
- - spec/config/redis_spec.rb
497
- - spec/config/root_spec.rb
498
- - spec/config/selenium_spec.rb
499
- - spec/config/strconv_spec.rb
500
- - spec/config/struct_spec.rb
501
506
  - spec/factories/middleware.rb
502
507
  - spec/factories/page.rb
503
508
  - spec/factories/task.rb
@@ -505,18 +510,25 @@ files:
505
510
  - spec/gc_spec.rb
506
511
  - spec/handler_spec.rb
507
512
  - spec/integration/callbacks_spec.rb
513
+ - spec/integration/content_type_spec.rb
514
+ - spec/integration/gc_spec.rb
515
+ - spec/integration/handler_spec.rb
508
516
  - spec/integration/page_spec.rb
509
517
  - spec/integration/params_spec.rb
518
+ - spec/integration/parsing_spec.rb
519
+ - spec/integration/routing_spec.rb
510
520
  - spec/integration/stage_spec.rb
511
- - spec/integration/steering_spec.rb
521
+ - spec/middleware/batch_completion_spec.rb
512
522
  - spec/middleware/chain_spec.rb
523
+ - spec/middleware/content_type_spec.rb
513
524
  - spec/middleware/controller_spec.rb
514
525
  - spec/middleware/dedup_spec.rb
515
526
  - spec/middleware/dispatch_spec.rb
516
- - spec/middleware/fetch_spec.rb
517
527
  - spec/middleware/normalize_spec.rb
518
528
  - spec/middleware/router_spec.rb
519
529
  - spec/middleware/stage_spec.rb
530
+ - spec/middleware/uri_parser_spec.rb
531
+ - spec/middleware/user_agent_spec.rb
520
532
  - spec/networking/capybara_spec.rb
521
533
  - spec/networking/context_spec.rb
522
534
  - spec/networking/ferrum_spec.rb
@@ -531,7 +543,6 @@ files:
531
543
  - spec/redis/barrier_spec.rb
532
544
  - spec/redis/counter_spec.rb
533
545
  - spec/redis/pool_spec.rb
534
- - spec/redis/version_spec.rb
535
546
  - spec/routing/dsl_spec.rb
536
547
  - spec/routing/integration_spec.rb
537
548
  - spec/routing/matchers/custom_spec.rb
@@ -544,7 +555,6 @@ files:
544
555
  - spec/routing/path_finder_spec.rb
545
556
  - spec/routing/root_route_spec.rb
546
557
  - spec/routing/route_spec.rb
547
- - spec/routing/router_spec.rb
548
558
  - spec/spec_helpers.rb
549
559
  - spec/stringify_spec.rb
550
560
  - spec/support/static/finders.html
@@ -1,32 +0,0 @@
1
- name: ci
2
-
3
- on:
4
- push:
5
- branches:
6
- - '*'
7
- env:
8
- CI: true
9
-
10
- jobs:
11
- ci:
12
- runs-on: ubuntu-latest
13
- steps:
14
- - uses: actions/checkout@v2
15
-
16
- - name: Start services
17
- run: docker-compose up -d
18
-
19
- - name: Run isolated tests
20
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:isolated
21
-
22
- - name: Run Ferrum tests
23
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:ferrum
24
-
25
- - name: Run Selenium tests
26
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:selenium
27
-
28
- - name: Run CLI tests
29
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:cli
30
-
31
- - name: Run RuboCop
32
- run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake rubocop
@@ -1,31 +0,0 @@
1
- # Error handling
2
-
3
- ## Wayfarer never swallows exceptions
4
-
5
- * Wayfarer never swallows exceptions.
6
- * Jobs with unhandled exceptions are not retried.
7
-
8
- ## Retrying and discarding
9
-
10
- Wayfarer relies on [Active Job's two error handling facilities](https://guides.rubyonrails.org/active_job_basics.html#exceptions).
11
-
12
- * `retry_on` to retry jobs a number of times on certain errors:
13
-
14
- ```ruby
15
- class DummyJob < Wayfarer::Base
16
- retry_on MyError, attempts: 3 do |job, error|
17
- # This block runs once all 3 attempts have failed
18
- # (1 initial attempt + 2 retries)
19
- end
20
- end
21
- ```
22
-
23
- * `discard_on` to throw away jobs on certain errors:
24
-
25
- ```ruby
26
- class DummyJob < Wayfarer::Base
27
- discard_on MyError do |job, error|
28
- # This block runs once and buries the job
29
- end
30
- end
31
- ```
@@ -1,94 +0,0 @@
1
- # Networking
2
-
3
- Wayfarer navigates the web in two ways:
4
-
5
- 1. Via plain HTTP requests
6
- 2. By automating browsers
7
-
8
- Both options are mutually exclusive per Ruby process.
9
-
10
- ## User agents
11
-
12
- A user agent is an entity that knows how to retrieve the contents behind a URL.
13
-
14
- The user agent can be configured via the global configuration:
15
-
16
- ```ruby
17
- Wayfarer.config.network.agent = :http # or :ferrum, :selenium
18
- ```
19
-
20
- ## Connection pooling
21
-
22
- Wayfarer keeps user agents within a connection pool. When a job executes
23
- and needs to retrieve the contents behind a URL, an agent is checked out from
24
- the pool.
25
-
26
- The pool has a constant size and it should equal the number of threads the
27
- underlying message queue operates with. The size can be configured via the
28
- global configuration:
29
-
30
- ```ruby
31
- Wayfarer.config.network.pool_size = 8
32
- ```
33
-
34
- ### Timeouts
35
-
36
- user agents may stay checked out from the pool by jobs for a limited time
37
- only. Once this time limit is exceeded, a `ConnectionPool::TimeoutError`
38
- exception is raised. This places a hard time limit on every job.
39
-
40
- The timeout can be configured via the global configuration:
41
-
42
- ```ruby
43
- Wayfarer.config.network.pool_timeout = 20 # seconds
44
- ```
45
-
46
- Because jobs with unhandled exceptions fail, explicit error handling is required
47
- if retries are desired:
48
-
49
- ```ruby
50
- class DummyJob < Wayfarer::Base
51
- retry_on ConnectionPool::TimeoutError, attempts: 3
52
- end
53
- ```
54
-
55
- ## Agent-specific client timeouts
56
-
57
- The time in seconds it may take to communicate with remote browser processes can
58
- be configured globally per agent:
59
-
60
- ```ruby
61
- Wayfarer.config.ferrum.options = { timeout: 5 }
62
- Wayfarer.config.selenium.client_timeout = 60
63
- ```
64
-
65
- ### Shared state
66
-
67
- As user agents get checked in and out continously between jobs, their state
68
- carries over from job to job, too.
69
-
70
- For browser automation, this means:
71
-
72
- * A job finds the browser at the last URL the previous job has left off.
73
- * The browser's cookies might have been set, or other client-side state might
74
- exist that significantly affects a page's behaviour.
75
-
76
- ## HTTP redirect handling
77
-
78
- Browsers follow redirects transparently when they are navigated to a URL.
79
-
80
- When using plain HTTP, redirect URLs are enqueued transparently within the same
81
- batch. URLs that result in 3xx responses will not be retrieved again within
82
- their batch.
83
-
84
- ## HTTP request headers
85
-
86
- Request headers can be configured via the global configuration:
87
-
88
- ```ruby
89
- Wayfarer.config.network.http_headers = { "Field" => "Value" }
90
- ```
91
-
92
- !!! attention "Partial support"
93
-
94
- Selenium does not support configuring HTTP request headers.
@@ -1,130 +0,0 @@
1
- # Performance
2
-
3
- How to write performant crawlers with Wayfarer.
4
-
5
- ## Use a sufficiently sized user agent pool
6
-
7
- Automated browser processes or HTTP clients are kept in a [connection pool]() of
8
- static size. This avoids having to re-establish browser processes and enables
9
- their reuse.
10
-
11
- If the size of the pool is too small, the pool is a
12
- bottleneck. For example, if your message queue adapter uses 8 threads, but the
13
- pool only contains 1 user agent, the remaining 7 threads block until the agent
14
- is checked back in to the pool for use by one of the blocked threads.
15
-
16
- There is no reliable way to detect the number of threads of the underlying
17
- message queue adapter. The pool size should equal the number of threads;
18
-
19
- ```ruby
20
- Wayfarer.config.network.pool_size = 8 # defaults to 1
21
- ```
22
-
23
- ### Job shedding
24
-
25
- There is a maximum number of seconds that jobs wait when checking out a user
26
- agent from the pool. Once this time is exceeded,
27
- a `Wayfarer::UserAgentTimeoutError` is raised. By default, the timeout is 10
28
- seconds.
29
-
30
- This hints there are more threads in use than user agents in the pool.
31
-
32
- ## Stage less URLs
33
-
34
- Staging less URLs saves space and time:
35
-
36
- * Less tasks written to the message queue
37
- * Less time spent consuming tasks
38
- * Less time spent filtering URLs with Redis
39
-
40
- Wayfarer maintains a set of processed URLs for a batch in Redis. Every staged
41
- URL is checked for inclusion in this set before it gets appended as a task to
42
- the message queue.
43
-
44
- A common pattern is to stage all links of a page, and rely on routing to fetch
45
- only the relevant ones:
46
-
47
- ```ruby
48
- class DummyJob < Wayfarer::Base
49
- route { to: index, host: "example.com" }
50
-
51
- def index
52
- stage page.meta.links.all
53
- end
54
- end
55
- ```
56
-
57
- Pages commonly contain a large number of URLs.
58
-
59
- Every staged URL is:
60
-
61
- 1. Normalized to a canonical form, for example by sorting query parameters
62
- alphabetically.
63
- 2. Checked for inclusion in the batch Redis set or discarded.
64
- 3. Written to the message queue.
65
- 4. Consumed from the queue and matched against the router.
66
- 5. Fetched, if a route matches.
67
-
68
- Narrowing down the links in the document to follow speeds up the process.
69
- For example using Nokogiri, interesting links can be identified with a CSS
70
- selector:
71
-
72
- ```ruby
73
- class DummyJob < Wayfarer::Base
74
- route { to: index, host: "example.com" }
75
-
76
- def index
77
- stage interesting_links
78
- end
79
-
80
- private
81
-
82
- def interesting_links
83
- page.doc.css("a.interesting").map { |elem| elem["href"] }
84
- end
85
- end
86
- ```
87
-
88
- Because the router only accepts the single hostname `example.com`, the job can
89
- also ensure it stages only internal URLs by intersecting them with the
90
- interesting ones:
91
-
92
- ```ruby
93
- class DummyJob < Wayfarer::Base
94
- route { to: index, host: "example.com" }
95
-
96
- def index
97
- stage interesting_internal_links
98
- end
99
-
100
- private
101
-
102
- def interesting_internal_links
103
- page.meta.links.internal & interesting_links
104
- end
105
-
106
- def interesting_links
107
- page.doc.css("a.interesting").map { |elem| elem["href"] }
108
- end
109
- end
110
- ```
111
-
112
-
113
- ## Use Redis >= 6.2.0
114
-
115
- Redis 6.2.0 introduced the
116
- [`SMISMEMBER`](https://redis.io/commands/smismember) command which enables
117
- Wayfarer to check whether multiple URLs have been processed in a batch with a
118
- single command. With earlier versions, one command per URL is required.
119
-
120
- Wayfarer detects the Redis server version and uses `SMISMEMBER` without user
121
- configuration when supported.
122
-
123
- ## Use Oj for JSON parsing
124
-
125
- Wayfarer uses [Oj](https://github.com/ohler55/oj) for JSON parsing if the gem
126
- has been required at runtime:
127
-
128
- ```ruby
129
- require "oj"
130
- ```
@@ -1,41 +0,0 @@
1
- # Reliablity
2
-
3
- ## Durability
4
-
5
- Wayfarer executes atop reliable messages queues such as Sidekiq, Resque,
6
- RabbitMQ, etc. Its configuration is independent of the underlying queue
7
- infrastructure it reads from and writes to.
8
-
9
- ## Self-healing user agents
10
-
11
- Wayfarer handles the scenario where a remote browser process has crashed and
12
- must be replaced by a fresh browser process.
13
-
14
- This can be tested locally by automating a browser with headless mode turned
15
- off, and then closing the opened browser window: The current job fails, but the
16
- next job has access to a newly established browser session again.
17
-
18
- For example Ferrum might raise `Ferrum::DeadBrowserError`. Wayfarer's
19
- user agents are self-healing and react to these kinds of errors internally. When
20
- a browser window is closed, the Ferrum user agent attempts to establish a new
21
- browser process as a replacement, for the next job to use.
22
-
23
- [Wayfarer never swallows exceptions](/guides/error_handling). This means
24
- that even though the user agent might heal itself, jobs still need to explicitly
25
- retry browser errors:
26
-
27
- ```ruby
28
- class Foobar < Wayfarer::Base
29
- route { to: :index }
30
-
31
- retry_on Ferrum::DeadBrowserError, attempts: 3, wait: :exponentially_longer
32
-
33
- # ...
34
- end
35
- ```
36
-
37
- This leads to log entries like:
38
-
39
- ```
40
- Retrying DummyJob in 3 seconds, due to a Ferrum::DeadBrowserError.
41
- ```
@@ -1,30 +0,0 @@
1
- # Steering
2
-
3
- A job's router can receive arguments computed dynamically by `::steer`.
4
- Steering enables [batch routing](/cookbook/batch_routing).
5
-
6
- For example, the following router has hostname and path hard-coded:
7
-
8
- ```ruby
9
- class DummyJob < Wayfarer::Base
10
- route do
11
- host "example.com", path: "/contact", to: :index
12
- end
13
- end
14
- ```
15
-
16
- Instead, hostname and path could be provided by `::steer`, too:
17
-
18
- ```ruby
19
- class DummyJob < Wayfarer::Base
20
- route do |hostname, path|
21
- host hostname, path: path, to: :index
22
- end
23
-
24
- steer do |_task|
25
- ["example.com", "/contact"]
26
- end
27
- end
28
- ```
29
-
30
- Note that `steer` yields the current [task](/guides/tasks).
@@ -1,48 +0,0 @@
1
- ---
2
- title: Wayfarer::Base
3
- ---
4
-
5
- # `Wayfarer::Base`
6
-
7
- Wayfarer's complete job API.
8
-
9
- ---
10
-
11
- ### `::route`
12
- : Draw routes to instance methods.
13
-
14
- ---
15
-
16
- ### `::steer { (Wayfarer::Task) -> [any] }`
17
- : Provide router arguments.
18
-
19
- ---
20
-
21
- ### `#task -> Wayfarer::Task`
22
- : The currently processing task.
23
-
24
- ---
25
-
26
- ### `#params -> Hash`
27
- : URL parameters collected from the matching route.
28
-
29
- ---
30
-
31
- ### `#stage(String | [String]) -> void`
32
- : Add URLs to a processing set. URLs already processed within the
33
- current batch get discarded are not enqueued. Every staged URL gets
34
- normalized.
35
-
36
- ---
37
-
38
- ### `#browser -> Object`
39
- : The user agent that retrieved the current page.
40
-
41
- ---
42
-
43
- ### `#page(live: true | false) -> Page`
44
- : The page representing the response retrieved from the currently
45
- processing URL.
46
-
47
- With `live: true` called, a fresh `Page` is returned that reflects the
48
- current browser DOM. Calls to `#page` return the most recent page.