wayfarer 0.4.5 → 0.4.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.github/workflows/lint.yaml +25 -0
- data/.github/workflows/release.yaml +29 -0
- data/.github/workflows/tests.yaml +30 -0
- data/.gitignore +4 -0
- data/.rubocop.yml +5 -0
- data/.vale.ini +5 -0
- data/.yardopts +1 -3
- data/Dockerfile +5 -4
- data/Gemfile +3 -0
- data/Gemfile.lock +107 -102
- data/Rakefile +5 -56
- data/bin/wayfarer +1 -1
- data/docker-compose.yml +20 -9
- data/docs/cookbook/consent_screen.md +2 -2
- data/docs/cookbook/executing_javascript.md +3 -3
- data/docs/cookbook/navigation.md +12 -12
- data/docs/cookbook/querying_html.md +3 -3
- data/docs/cookbook/screenshots.md +2 -2
- data/docs/cookbook/user_agent.md +1 -1
- data/docs/design.md +36 -0
- data/docs/guides/callbacks.md +24 -126
- data/docs/guides/configuration.md +8 -8
- data/docs/guides/handlers.md +60 -0
- data/docs/guides/index.md +1 -0
- data/docs/guides/jobs/error_handling.md +40 -0
- data/docs/guides/jobs.md +99 -31
- data/docs/guides/navigation.md +1 -1
- data/docs/guides/networking/capybara.md +13 -22
- data/docs/guides/networking/custom_adapters.md +82 -41
- data/docs/guides/networking/ferrum.md +4 -4
- data/docs/guides/networking/http.md +9 -13
- data/docs/guides/networking/selenium.md +10 -11
- data/docs/guides/pages.md +76 -10
- data/docs/guides/redis.md +10 -0
- data/docs/guides/routing.md +74 -0
- data/docs/guides/tasks.md +33 -9
- data/docs/guides/tutorial.md +60 -0
- data/docs/guides/user_agents.md +113 -0
- data/docs/index.md +17 -40
- data/docs/reference/cli.md +35 -25
- data/docs/reference/configuration.md +36 -0
- data/lib/wayfarer/base.rb +124 -46
- data/lib/wayfarer/batch_completion.rb +56 -0
- data/lib/wayfarer/callbacks.rb +22 -48
- data/lib/wayfarer/cli/route_printer.rb +71 -57
- data/lib/wayfarer/cli.rb +121 -0
- data/lib/wayfarer/gc.rb +13 -6
- data/lib/wayfarer/handler.rb +15 -7
- data/lib/wayfarer/logging.rb +38 -0
- data/lib/wayfarer/middleware/base.rb +2 -0
- data/lib/wayfarer/middleware/batch_completion.rb +19 -0
- data/lib/wayfarer/middleware/content_type.rb +54 -0
- data/lib/wayfarer/middleware/controller.rb +19 -15
- data/lib/wayfarer/middleware/dedup.rb +16 -13
- data/lib/wayfarer/middleware/dispatch.rb +12 -4
- data/lib/wayfarer/middleware/normalize.rb +12 -11
- data/lib/wayfarer/middleware/redis.rb +15 -0
- data/lib/wayfarer/middleware/router.rb +33 -35
- data/lib/wayfarer/middleware/stage.rb +5 -5
- data/lib/wayfarer/middleware/uri_parser.rb +30 -0
- data/lib/wayfarer/middleware/user_agent.rb +49 -0
- data/lib/wayfarer/networking/capybara.rb +1 -1
- data/lib/wayfarer/networking/context.rb +2 -2
- data/lib/wayfarer/networking/ferrum.rb +2 -2
- data/lib/wayfarer/networking/follow.rb +12 -6
- data/lib/wayfarer/networking/http.rb +1 -1
- data/lib/wayfarer/networking/pool.rb +17 -12
- data/lib/wayfarer/networking/selenium.rb +3 -3
- data/lib/wayfarer/networking/strategy.rb +2 -2
- data/lib/wayfarer/page.rb +36 -14
- data/lib/wayfarer/parsing/xml.rb +6 -6
- data/lib/wayfarer/parsing.rb +24 -0
- data/lib/wayfarer/redis/barrier.rb +13 -21
- data/lib/wayfarer/redis/counter.rb +19 -9
- data/lib/wayfarer/redis/pool.rb +1 -1
- data/lib/wayfarer/redis/resettable.rb +19 -0
- data/lib/wayfarer/routing/dsl.rb +1 -0
- data/lib/wayfarer/routing/matchers/path.rb +4 -2
- data/lib/wayfarer/routing/root_route.rb +5 -1
- data/lib/wayfarer/routing/route.rb +4 -14
- data/lib/wayfarer/stringify.rb +22 -30
- data/lib/wayfarer/task.rb +12 -18
- data/lib/wayfarer.rb +29 -2
- data/mkdocs.yml +52 -7
- data/rake/docs.rake +26 -0
- data/rake/lint.rake +105 -0
- data/rake/release.rake +29 -0
- data/rake/tests.rake +28 -0
- data/requirements.txt +1 -1
- data/spec/base_spec.rb +140 -160
- data/spec/batch_completion_spec.rb +104 -0
- data/spec/cli/job_spec.rb +19 -23
- data/spec/cli/routing_spec.rb +101 -0
- data/spec/cli/version_spec.rb +1 -1
- data/spec/factories/task.rb +7 -1
- data/spec/fixtures/dummy_job.rb +5 -3
- data/spec/gc_spec.rb +8 -50
- data/spec/handler_spec.rb +1 -1
- data/spec/integration/callbacks_spec.rb +157 -45
- data/spec/integration/content_type_spec.rb +145 -0
- data/spec/integration/gc_spec.rb +44 -0
- data/spec/integration/handler_spec.rb +66 -0
- data/spec/integration/page_spec.rb +44 -29
- data/spec/integration/params_spec.rb +33 -25
- data/spec/integration/parsing_spec.rb +125 -0
- data/spec/integration/routing_spec.rb +18 -0
- data/spec/integration/stage_spec.rb +27 -20
- data/spec/middleware/batch_completion_spec.rb +34 -0
- data/spec/middleware/chain_spec.rb +8 -8
- data/spec/middleware/content_type_spec.rb +86 -0
- data/spec/middleware/controller_spec.rb +5 -5
- data/spec/middleware/dedup_spec.rb +38 -55
- data/spec/middleware/dispatch_spec.rb +23 -7
- data/spec/middleware/normalize_spec.rb +44 -13
- data/spec/middleware/router_spec.rb +29 -30
- data/spec/middleware/stage_spec.rb +8 -8
- data/spec/middleware/uri_parser_spec.rb +53 -0
- data/spec/middleware/{fetch_spec.rb → user_agent_spec.rb} +28 -27
- data/spec/networking/context_spec.rb +17 -0
- data/spec/networking/follow_spec.rb +2 -2
- data/spec/networking/pool_spec.rb +5 -5
- data/spec/networking/strategy.rb +2 -2
- data/spec/page_spec.rb +42 -20
- data/spec/parsing/xml_spec.rb +11 -12
- data/spec/redis/barrier_spec.rb +8 -48
- data/spec/redis/counter_spec.rb +13 -1
- data/spec/redis/pool_spec.rb +1 -1
- data/spec/spec_helpers.rb +27 -16
- data/spec/support/test_app.rb +8 -0
- data/spec/task_spec.rb +3 -24
- data/spec/wayfarer_spec.rb +1 -1
- data/wayfarer.gemspec +4 -3
- metadata +61 -51
- data/.github/workflows/ci.yaml +0 -32
- data/docs/guides/error_handling.md +0 -31
- data/docs/guides/networking.md +0 -94
- data/docs/guides/performance.md +0 -130
- data/docs/guides/reliability.md +0 -41
- data/docs/guides/routing/steering.md +0 -30
- data/docs/reference/api/base.md +0 -48
- data/docs/reference/configuration_keys.md +0 -42
- data/docs/reference/environment_variables.md +0 -83
- data/lib/wayfarer/cli/base.rb +0 -45
- data/lib/wayfarer/cli/generate.rb +0 -17
- data/lib/wayfarer/cli/job.rb +0 -56
- data/lib/wayfarer/cli/route.rb +0 -29
- data/lib/wayfarer/cli/runner.rb +0 -34
- data/lib/wayfarer/cli/templates/Gemfile.tt +0 -5
- data/lib/wayfarer/cli/templates/job.rb.tt +0 -10
- data/lib/wayfarer/config/capybara.rb +0 -10
- data/lib/wayfarer/config/ferrum.rb +0 -11
- data/lib/wayfarer/config/networking.rb +0 -26
- data/lib/wayfarer/config/redis.rb +0 -14
- data/lib/wayfarer/config/root.rb +0 -11
- data/lib/wayfarer/config/selenium.rb +0 -21
- data/lib/wayfarer/config/strconv.rb +0 -45
- data/lib/wayfarer/config/struct.rb +0 -72
- data/lib/wayfarer/middleware/fetch.rb +0 -56
- data/lib/wayfarer/redis/connection.rb +0 -13
- data/lib/wayfarer/redis/version.rb +0 -19
- data/lib/wayfarer/routing/router.rb +0 -28
- data/spec/callbacks_spec.rb +0 -102
- data/spec/cli/generate_spec.rb +0 -39
- data/spec/config/capybara_spec.rb +0 -18
- data/spec/config/ferrum_spec.rb +0 -24
- data/spec/config/networking_spec.rb +0 -73
- data/spec/config/redis_spec.rb +0 -32
- data/spec/config/root_spec.rb +0 -31
- data/spec/config/selenium_spec.rb +0 -56
- data/spec/config/strconv_spec.rb +0 -58
- data/spec/config/struct_spec.rb +0 -66
- data/spec/integration/steering_spec.rb +0 -57
- data/spec/redis/version_spec.rb +0 -13
- data/spec/routing/router_spec.rb +0 -24
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wayfarer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.7
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dominic Bauer
|
@@ -16,14 +16,14 @@ dependencies:
|
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '7.1'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
26
|
+
version: '7.1'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: addressable
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -123,33 +123,33 @@ dependencies:
|
|
123
123
|
- !ruby/object:Gem::Version
|
124
124
|
version: '3.0'
|
125
125
|
- !ruby/object:Gem::Dependency
|
126
|
-
name:
|
126
|
+
name: mock_redis
|
127
127
|
requirement: !ruby/object:Gem::Requirement
|
128
128
|
requirements:
|
129
129
|
- - "~>"
|
130
130
|
- !ruby/object:Gem::Version
|
131
|
-
version: '
|
131
|
+
version: '0.29'
|
132
132
|
type: :runtime
|
133
133
|
prerelease: false
|
134
134
|
version_requirements: !ruby/object:Gem::Requirement
|
135
135
|
requirements:
|
136
136
|
- - "~>"
|
137
137
|
- !ruby/object:Gem::Version
|
138
|
-
version: '
|
138
|
+
version: '0.29'
|
139
139
|
- !ruby/object:Gem::Dependency
|
140
|
-
name:
|
140
|
+
name: mustermann
|
141
141
|
requirement: !ruby/object:Gem::Requirement
|
142
142
|
requirements:
|
143
143
|
- - "~>"
|
144
144
|
- !ruby/object:Gem::Version
|
145
|
-
version: '
|
145
|
+
version: '1.1'
|
146
146
|
type: :runtime
|
147
147
|
prerelease: false
|
148
148
|
version_requirements: !ruby/object:Gem::Requirement
|
149
149
|
requirements:
|
150
150
|
- - "~>"
|
151
151
|
- !ruby/object:Gem::Version
|
152
|
-
version: '
|
152
|
+
version: '1.1'
|
153
153
|
- !ruby/object:Gem::Dependency
|
154
154
|
name: net-http-persistent
|
155
155
|
requirement: !ruby/object:Gem::Requirement
|
@@ -240,6 +240,20 @@ dependencies:
|
|
240
240
|
- - "~>"
|
241
241
|
- !ruby/object:Gem::Version
|
242
242
|
version: '1.0'
|
243
|
+
- !ruby/object:Gem::Dependency
|
244
|
+
name: zeitwerk
|
245
|
+
requirement: !ruby/object:Gem::Requirement
|
246
|
+
requirements:
|
247
|
+
- - "~>"
|
248
|
+
- !ruby/object:Gem::Version
|
249
|
+
version: '2.4'
|
250
|
+
type: :runtime
|
251
|
+
prerelease: false
|
252
|
+
version_requirements: !ruby/object:Gem::Requirement
|
253
|
+
requirements:
|
254
|
+
- - "~>"
|
255
|
+
- !ruby/object:Gem::Version
|
256
|
+
version: '2.4'
|
243
257
|
- !ruby/object:Gem::Dependency
|
244
258
|
name: cuprite
|
245
259
|
requirement: !ruby/object:Gem::Requirement
|
@@ -373,12 +387,15 @@ executables:
|
|
373
387
|
extensions: []
|
374
388
|
extra_rdoc_files: []
|
375
389
|
files:
|
376
|
-
- ".github/workflows/
|
390
|
+
- ".github/workflows/lint.yaml"
|
391
|
+
- ".github/workflows/release.yaml"
|
392
|
+
- ".github/workflows/tests.yaml"
|
377
393
|
- ".gitignore"
|
378
394
|
- ".rbenv-gemsets"
|
379
395
|
- ".rspec"
|
380
396
|
- ".rubocop.yml"
|
381
397
|
- ".ruby-version"
|
398
|
+
- ".vale.ini"
|
382
399
|
- ".yardopts"
|
383
400
|
- Dockerfile
|
384
401
|
- Gemfile
|
@@ -395,60 +412,53 @@ files:
|
|
395
412
|
- docs/cookbook/querying_html.md
|
396
413
|
- docs/cookbook/screenshots.md
|
397
414
|
- docs/cookbook/user_agent.md
|
415
|
+
- docs/design.md
|
398
416
|
- docs/guides/callbacks.md
|
399
417
|
- docs/guides/configuration.md
|
400
418
|
- docs/guides/debugging.md
|
401
|
-
- docs/guides/
|
419
|
+
- docs/guides/handlers.md
|
420
|
+
- docs/guides/index.md
|
402
421
|
- docs/guides/jobs.md
|
422
|
+
- docs/guides/jobs/error_handling.md
|
403
423
|
- docs/guides/navigation.md
|
404
|
-
- docs/guides/networking.md
|
405
424
|
- docs/guides/networking/capybara.md
|
406
425
|
- docs/guides/networking/custom_adapters.md
|
407
426
|
- docs/guides/networking/ferrum.md
|
408
427
|
- docs/guides/networking/http.md
|
409
428
|
- docs/guides/networking/selenium.md
|
410
429
|
- docs/guides/pages.md
|
411
|
-
- docs/guides/
|
412
|
-
- docs/guides/
|
413
|
-
- docs/guides/routing/steering.md
|
430
|
+
- docs/guides/redis.md
|
431
|
+
- docs/guides/routing.md
|
414
432
|
- docs/guides/tasks.md
|
433
|
+
- docs/guides/tutorial.md
|
434
|
+
- docs/guides/user_agents.md
|
415
435
|
- docs/index.md
|
416
|
-
- docs/reference/api/base.md
|
417
436
|
- docs/reference/api/route.md
|
418
437
|
- docs/reference/cli.md
|
419
|
-
- docs/reference/
|
420
|
-
- docs/reference/environment_variables.md
|
438
|
+
- docs/reference/configuration.md
|
421
439
|
- lib/wayfarer.rb
|
422
440
|
- lib/wayfarer/base.rb
|
441
|
+
- lib/wayfarer/batch_completion.rb
|
423
442
|
- lib/wayfarer/callbacks.rb
|
424
|
-
- lib/wayfarer/cli
|
425
|
-
- lib/wayfarer/cli/generate.rb
|
426
|
-
- lib/wayfarer/cli/job.rb
|
427
|
-
- lib/wayfarer/cli/route.rb
|
443
|
+
- lib/wayfarer/cli.rb
|
428
444
|
- lib/wayfarer/cli/route_printer.rb
|
429
|
-
- lib/wayfarer/cli/runner.rb
|
430
|
-
- lib/wayfarer/cli/templates/Gemfile.tt
|
431
|
-
- lib/wayfarer/cli/templates/job.rb.tt
|
432
|
-
- lib/wayfarer/config/capybara.rb
|
433
|
-
- lib/wayfarer/config/ferrum.rb
|
434
|
-
- lib/wayfarer/config/networking.rb
|
435
|
-
- lib/wayfarer/config/redis.rb
|
436
|
-
- lib/wayfarer/config/root.rb
|
437
|
-
- lib/wayfarer/config/selenium.rb
|
438
|
-
- lib/wayfarer/config/strconv.rb
|
439
|
-
- lib/wayfarer/config/struct.rb
|
440
445
|
- lib/wayfarer/gc.rb
|
441
446
|
- lib/wayfarer/handler.rb
|
447
|
+
- lib/wayfarer/logging.rb
|
442
448
|
- lib/wayfarer/middleware/base.rb
|
449
|
+
- lib/wayfarer/middleware/batch_completion.rb
|
443
450
|
- lib/wayfarer/middleware/chain.rb
|
451
|
+
- lib/wayfarer/middleware/content_type.rb
|
444
452
|
- lib/wayfarer/middleware/controller.rb
|
445
453
|
- lib/wayfarer/middleware/dedup.rb
|
446
454
|
- lib/wayfarer/middleware/dispatch.rb
|
447
|
-
- lib/wayfarer/middleware/fetch.rb
|
448
455
|
- lib/wayfarer/middleware/lazy.rb
|
449
456
|
- lib/wayfarer/middleware/normalize.rb
|
457
|
+
- lib/wayfarer/middleware/redis.rb
|
450
458
|
- lib/wayfarer/middleware/router.rb
|
451
459
|
- lib/wayfarer/middleware/stage.rb
|
460
|
+
- lib/wayfarer/middleware/uri_parser.rb
|
461
|
+
- lib/wayfarer/middleware/user_agent.rb
|
452
462
|
- lib/wayfarer/networking/capybara.rb
|
453
463
|
- lib/wayfarer/networking/context.rb
|
454
464
|
- lib/wayfarer/networking/ferrum.rb
|
@@ -459,13 +469,13 @@ files:
|
|
459
469
|
- lib/wayfarer/networking/selenium.rb
|
460
470
|
- lib/wayfarer/networking/strategy.rb
|
461
471
|
- lib/wayfarer/page.rb
|
472
|
+
- lib/wayfarer/parsing.rb
|
462
473
|
- lib/wayfarer/parsing/json.rb
|
463
474
|
- lib/wayfarer/parsing/xml.rb
|
464
475
|
- lib/wayfarer/redis/barrier.rb
|
465
|
-
- lib/wayfarer/redis/connection.rb
|
466
476
|
- lib/wayfarer/redis/counter.rb
|
467
477
|
- lib/wayfarer/redis/pool.rb
|
468
|
-
- lib/wayfarer/redis/
|
478
|
+
- lib/wayfarer/redis/resettable.rb
|
469
479
|
- lib/wayfarer/routing/dsl.rb
|
470
480
|
- lib/wayfarer/routing/matchers/custom.rb
|
471
481
|
- lib/wayfarer/routing/matchers/host.rb
|
@@ -478,26 +488,21 @@ files:
|
|
478
488
|
- lib/wayfarer/routing/result.rb
|
479
489
|
- lib/wayfarer/routing/root_route.rb
|
480
490
|
- lib/wayfarer/routing/route.rb
|
481
|
-
- lib/wayfarer/routing/router.rb
|
482
491
|
- lib/wayfarer/routing/target_route.rb
|
483
492
|
- lib/wayfarer/serializer.rb
|
484
493
|
- lib/wayfarer/stringify.rb
|
485
494
|
- lib/wayfarer/task.rb
|
486
495
|
- mkdocs.yml
|
496
|
+
- rake/docs.rake
|
497
|
+
- rake/lint.rake
|
498
|
+
- rake/release.rake
|
499
|
+
- rake/tests.rake
|
487
500
|
- requirements.txt
|
488
501
|
- spec/base_spec.rb
|
489
|
-
- spec/
|
490
|
-
- spec/cli/generate_spec.rb
|
502
|
+
- spec/batch_completion_spec.rb
|
491
503
|
- spec/cli/job_spec.rb
|
504
|
+
- spec/cli/routing_spec.rb
|
492
505
|
- spec/cli/version_spec.rb
|
493
|
-
- spec/config/capybara_spec.rb
|
494
|
-
- spec/config/ferrum_spec.rb
|
495
|
-
- spec/config/networking_spec.rb
|
496
|
-
- spec/config/redis_spec.rb
|
497
|
-
- spec/config/root_spec.rb
|
498
|
-
- spec/config/selenium_spec.rb
|
499
|
-
- spec/config/strconv_spec.rb
|
500
|
-
- spec/config/struct_spec.rb
|
501
506
|
- spec/factories/middleware.rb
|
502
507
|
- spec/factories/page.rb
|
503
508
|
- spec/factories/task.rb
|
@@ -505,18 +510,25 @@ files:
|
|
505
510
|
- spec/gc_spec.rb
|
506
511
|
- spec/handler_spec.rb
|
507
512
|
- spec/integration/callbacks_spec.rb
|
513
|
+
- spec/integration/content_type_spec.rb
|
514
|
+
- spec/integration/gc_spec.rb
|
515
|
+
- spec/integration/handler_spec.rb
|
508
516
|
- spec/integration/page_spec.rb
|
509
517
|
- spec/integration/params_spec.rb
|
518
|
+
- spec/integration/parsing_spec.rb
|
519
|
+
- spec/integration/routing_spec.rb
|
510
520
|
- spec/integration/stage_spec.rb
|
511
|
-
- spec/
|
521
|
+
- spec/middleware/batch_completion_spec.rb
|
512
522
|
- spec/middleware/chain_spec.rb
|
523
|
+
- spec/middleware/content_type_spec.rb
|
513
524
|
- spec/middleware/controller_spec.rb
|
514
525
|
- spec/middleware/dedup_spec.rb
|
515
526
|
- spec/middleware/dispatch_spec.rb
|
516
|
-
- spec/middleware/fetch_spec.rb
|
517
527
|
- spec/middleware/normalize_spec.rb
|
518
528
|
- spec/middleware/router_spec.rb
|
519
529
|
- spec/middleware/stage_spec.rb
|
530
|
+
- spec/middleware/uri_parser_spec.rb
|
531
|
+
- spec/middleware/user_agent_spec.rb
|
520
532
|
- spec/networking/capybara_spec.rb
|
521
533
|
- spec/networking/context_spec.rb
|
522
534
|
- spec/networking/ferrum_spec.rb
|
@@ -531,7 +543,6 @@ files:
|
|
531
543
|
- spec/redis/barrier_spec.rb
|
532
544
|
- spec/redis/counter_spec.rb
|
533
545
|
- spec/redis/pool_spec.rb
|
534
|
-
- spec/redis/version_spec.rb
|
535
546
|
- spec/routing/dsl_spec.rb
|
536
547
|
- spec/routing/integration_spec.rb
|
537
548
|
- spec/routing/matchers/custom_spec.rb
|
@@ -544,7 +555,6 @@ files:
|
|
544
555
|
- spec/routing/path_finder_spec.rb
|
545
556
|
- spec/routing/root_route_spec.rb
|
546
557
|
- spec/routing/route_spec.rb
|
547
|
-
- spec/routing/router_spec.rb
|
548
558
|
- spec/spec_helpers.rb
|
549
559
|
- spec/stringify_spec.rb
|
550
560
|
- spec/support/static/finders.html
|
data/.github/workflows/ci.yaml
DELETED
@@ -1,32 +0,0 @@
|
|
1
|
-
name: ci
|
2
|
-
|
3
|
-
on:
|
4
|
-
push:
|
5
|
-
branches:
|
6
|
-
- '*'
|
7
|
-
env:
|
8
|
-
CI: true
|
9
|
-
|
10
|
-
jobs:
|
11
|
-
ci:
|
12
|
-
runs-on: ubuntu-latest
|
13
|
-
steps:
|
14
|
-
- uses: actions/checkout@v2
|
15
|
-
|
16
|
-
- name: Start services
|
17
|
-
run: docker-compose up -d
|
18
|
-
|
19
|
-
- name: Run isolated tests
|
20
|
-
run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:isolated
|
21
|
-
|
22
|
-
- name: Run Ferrum tests
|
23
|
-
run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:ferrum
|
24
|
-
|
25
|
-
- name: Run Selenium tests
|
26
|
-
run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:selenium
|
27
|
-
|
28
|
-
- name: Run CLI tests
|
29
|
-
run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake test:cli
|
30
|
-
|
31
|
-
- name: Run RuboCop
|
32
|
-
run: docker-compose run --rm --name test --service-ports wayfarer bundle exec rake rubocop
|
@@ -1,31 +0,0 @@
|
|
1
|
-
# Error handling
|
2
|
-
|
3
|
-
## Wayfarer never swallows exceptions
|
4
|
-
|
5
|
-
* Wayfarer never swallows exceptions.
|
6
|
-
* Jobs with unhandled exceptions are not retried.
|
7
|
-
|
8
|
-
## Retrying and discarding
|
9
|
-
|
10
|
-
Wayfarer relies on [Active Job's two error handling facilities](https://guides.rubyonrails.org/active_job_basics.html#exceptions).
|
11
|
-
|
12
|
-
* `retry_on` to retry jobs a number of times on certain errors:
|
13
|
-
|
14
|
-
```ruby
|
15
|
-
class DummyJob < Wayfarer::Base
|
16
|
-
retry_on MyError, attempts: 3 do |job, error|
|
17
|
-
# This block runs once all 3 attempts have failed
|
18
|
-
# (1 initial attempt + 2 retries)
|
19
|
-
end
|
20
|
-
end
|
21
|
-
```
|
22
|
-
|
23
|
-
* `discard_on` to throw away jobs on certain errors:
|
24
|
-
|
25
|
-
```ruby
|
26
|
-
class DummyJob < Wayfarer::Base
|
27
|
-
discard_on MyError do |job, error|
|
28
|
-
# This block runs once and buries the job
|
29
|
-
end
|
30
|
-
end
|
31
|
-
```
|
data/docs/guides/networking.md
DELETED
@@ -1,94 +0,0 @@
|
|
1
|
-
# Networking
|
2
|
-
|
3
|
-
Wayfarer navigates the web in two ways:
|
4
|
-
|
5
|
-
1. Via plain HTTP requests
|
6
|
-
2. By automating browsers
|
7
|
-
|
8
|
-
Both options are mutually exclusive per Ruby process.
|
9
|
-
|
10
|
-
## User agents
|
11
|
-
|
12
|
-
A user agent is an entity that knows how to retrieve the contents behind a URL.
|
13
|
-
|
14
|
-
The user agent can be configured via the global configuration:
|
15
|
-
|
16
|
-
```ruby
|
17
|
-
Wayfarer.config.network.agent = :http # or :ferrum, :selenium
|
18
|
-
```
|
19
|
-
|
20
|
-
## Connection pooling
|
21
|
-
|
22
|
-
Wayfarer keeps user agents within a connection pool. When a job executes
|
23
|
-
and needs to retrieve the contents behind a URL, an agent is checked out from
|
24
|
-
the pool.
|
25
|
-
|
26
|
-
The pool has a constant size and it should equal the number of threads the
|
27
|
-
underlying message queue operates with. The size can be configured via the
|
28
|
-
global configuration:
|
29
|
-
|
30
|
-
```ruby
|
31
|
-
Wayfarer.config.network.pool_size = 8
|
32
|
-
```
|
33
|
-
|
34
|
-
### Timeouts
|
35
|
-
|
36
|
-
user agents may stay checked out from the pool by jobs for a limited time
|
37
|
-
only. Once this time limit is exceeded, a `ConnectionPool::TimeoutError`
|
38
|
-
exception is raised. This places a hard time limit on every job.
|
39
|
-
|
40
|
-
The timeout can be configured via the global configuration:
|
41
|
-
|
42
|
-
```ruby
|
43
|
-
Wayfarer.config.network.pool_timeout = 20 # seconds
|
44
|
-
```
|
45
|
-
|
46
|
-
Because jobs with unhandled exceptions fail, explicit error handling is required
|
47
|
-
if retries are desired:
|
48
|
-
|
49
|
-
```ruby
|
50
|
-
class DummyJob < Wayfarer::Base
|
51
|
-
retry_on ConnectionPool::TimeoutError, attempts: 3
|
52
|
-
end
|
53
|
-
```
|
54
|
-
|
55
|
-
## Agent-specific client timeouts
|
56
|
-
|
57
|
-
The time in seconds it may take to communicate with remote browser processes can
|
58
|
-
be configured globally per agent:
|
59
|
-
|
60
|
-
```ruby
|
61
|
-
Wayfarer.config.ferrum.options = { timeout: 5 }
|
62
|
-
Wayfarer.config.selenium.client_timeout = 60
|
63
|
-
```
|
64
|
-
|
65
|
-
### Shared state
|
66
|
-
|
67
|
-
As user agents get checked in and out continously between jobs, their state
|
68
|
-
carries over from job to job, too.
|
69
|
-
|
70
|
-
For browser automation, this means:
|
71
|
-
|
72
|
-
* A job finds the browser at the last URL the previous job has left off.
|
73
|
-
* The browser's cookies might have been set, or other client-side state might
|
74
|
-
exist that significantly affects a page's behaviour.
|
75
|
-
|
76
|
-
## HTTP redirect handling
|
77
|
-
|
78
|
-
Browsers follow redirects transparently when they are navigated to a URL.
|
79
|
-
|
80
|
-
When using plain HTTP, redirect URLs are enqueued transparently within the same
|
81
|
-
batch. URLs that result in 3xx responses will not be retrieved again within
|
82
|
-
their batch.
|
83
|
-
|
84
|
-
## HTTP request headers
|
85
|
-
|
86
|
-
Request headers can be configured via the global configuration:
|
87
|
-
|
88
|
-
```ruby
|
89
|
-
Wayfarer.config.network.http_headers = { "Field" => "Value" }
|
90
|
-
```
|
91
|
-
|
92
|
-
!!! attention "Partial support"
|
93
|
-
|
94
|
-
Selenium does not support configuring HTTP request headers.
|
data/docs/guides/performance.md
DELETED
@@ -1,130 +0,0 @@
|
|
1
|
-
# Performance
|
2
|
-
|
3
|
-
How to write performant crawlers with Wayfarer.
|
4
|
-
|
5
|
-
## Use a sufficiently sized user agent pool
|
6
|
-
|
7
|
-
Automated browser processes or HTTP clients are kept in a [connection pool]() of
|
8
|
-
static size. This avoids having to re-establish browser processes and enables
|
9
|
-
their reuse.
|
10
|
-
|
11
|
-
If the size of the pool is too small, the pool is a
|
12
|
-
bottleneck. For example, if your message queue adapter uses 8 threads, but the
|
13
|
-
pool only contains 1 user agent, the remaining 7 threads block until the agent
|
14
|
-
is checked back in to the pool for use by one of the blocked threads.
|
15
|
-
|
16
|
-
There is no reliable way to detect the number of threads of the underlying
|
17
|
-
message queue adapter. The pool size should equal the number of threads;
|
18
|
-
|
19
|
-
```ruby
|
20
|
-
Wayfarer.config.network.pool_size = 8 # defaults to 1
|
21
|
-
```
|
22
|
-
|
23
|
-
### Job shedding
|
24
|
-
|
25
|
-
There is a maximum number of seconds that jobs wait when checking out a user
|
26
|
-
agent from the pool. Once this time is exceeded,
|
27
|
-
a `Wayfarer::UserAgentTimeoutError` is raised. By default, the timeout is 10
|
28
|
-
seconds.
|
29
|
-
|
30
|
-
This hints there are more threads in use than user agents in the pool.
|
31
|
-
|
32
|
-
## Stage less URLs
|
33
|
-
|
34
|
-
Staging less URLs saves space and time:
|
35
|
-
|
36
|
-
* Less tasks written to the message queue
|
37
|
-
* Less time spent consuming tasks
|
38
|
-
* Less time spent filtering URLs with Redis
|
39
|
-
|
40
|
-
Wayfarer maintains a set of processed URLs for a batch in Redis. Every staged
|
41
|
-
URL is checked for inclusion in this set before it gets appended as a task to
|
42
|
-
the message queue.
|
43
|
-
|
44
|
-
A common pattern is to stage all links of a page, and rely on routing to fetch
|
45
|
-
only the relevant ones:
|
46
|
-
|
47
|
-
```ruby
|
48
|
-
class DummyJob < Wayfarer::Base
|
49
|
-
route { to: index, host: "example.com" }
|
50
|
-
|
51
|
-
def index
|
52
|
-
stage page.meta.links.all
|
53
|
-
end
|
54
|
-
end
|
55
|
-
```
|
56
|
-
|
57
|
-
Pages commonly contain a large number of URLs.
|
58
|
-
|
59
|
-
Every staged URL is:
|
60
|
-
|
61
|
-
1. Normalized to a canonical form, for example by sorting query parameters
|
62
|
-
alphabetically.
|
63
|
-
2. Checked for inclusion in the batch Redis set or discarded.
|
64
|
-
3. Written to the message queue.
|
65
|
-
4. Consumed from the queue and matched against the router.
|
66
|
-
5. Fetched, if a route matches.
|
67
|
-
|
68
|
-
Narrowing down the links in the document to follow speeds up the process.
|
69
|
-
For example using Nokogiri, interesting links can be identified with a CSS
|
70
|
-
selector:
|
71
|
-
|
72
|
-
```ruby
|
73
|
-
class DummyJob < Wayfarer::Base
|
74
|
-
route { to: index, host: "example.com" }
|
75
|
-
|
76
|
-
def index
|
77
|
-
stage interesting_links
|
78
|
-
end
|
79
|
-
|
80
|
-
private
|
81
|
-
|
82
|
-
def interesting_links
|
83
|
-
page.doc.css("a.interesting").map { |elem| elem["href"] }
|
84
|
-
end
|
85
|
-
end
|
86
|
-
```
|
87
|
-
|
88
|
-
Because the router only accepts the single hostname `example.com`, the job can
|
89
|
-
also ensure it stages only internal URLs by intersecting them with the
|
90
|
-
interesting ones:
|
91
|
-
|
92
|
-
```ruby
|
93
|
-
class DummyJob < Wayfarer::Base
|
94
|
-
route { to: index, host: "example.com" }
|
95
|
-
|
96
|
-
def index
|
97
|
-
stage interesting_internal_links
|
98
|
-
end
|
99
|
-
|
100
|
-
private
|
101
|
-
|
102
|
-
def interesting_internal_links
|
103
|
-
page.meta.links.internal & interesting_links
|
104
|
-
end
|
105
|
-
|
106
|
-
def interesting_links
|
107
|
-
page.doc.css("a.interesting").map { |elem| elem["href"] }
|
108
|
-
end
|
109
|
-
end
|
110
|
-
```
|
111
|
-
|
112
|
-
|
113
|
-
## Use Redis >= 6.2.0
|
114
|
-
|
115
|
-
Redis 6.2.0 introduced the
|
116
|
-
[`SMISMEMBER`](https://redis.io/commands/smismember) command which enables
|
117
|
-
Wayfarer to check whether multiple URLs have been processed in a batch with a
|
118
|
-
single command. With earlier versions, one command per URL is required.
|
119
|
-
|
120
|
-
Wayfarer detects the Redis server version and uses `SMISMEMBER` without user
|
121
|
-
configuration when supported.
|
122
|
-
|
123
|
-
## Use Oj for JSON parsing
|
124
|
-
|
125
|
-
Wayfarer uses [Oj](https://github.com/ohler55/oj) for JSON parsing if the gem
|
126
|
-
has been required at runtime:
|
127
|
-
|
128
|
-
```ruby
|
129
|
-
require "oj"
|
130
|
-
```
|
data/docs/guides/reliability.md
DELETED
@@ -1,41 +0,0 @@
|
|
1
|
-
# Reliablity
|
2
|
-
|
3
|
-
## Durability
|
4
|
-
|
5
|
-
Wayfarer executes atop reliable messages queues such as Sidekiq, Resque,
|
6
|
-
RabbitMQ, etc. Its configuration is independent of the underlying queue
|
7
|
-
infrastructure it reads from and writes to.
|
8
|
-
|
9
|
-
## Self-healing user agents
|
10
|
-
|
11
|
-
Wayfarer handles the scenario where a remote browser process has crashed and
|
12
|
-
must be replaced by a fresh browser process.
|
13
|
-
|
14
|
-
This can be tested locally by automating a browser with headless mode turned
|
15
|
-
off, and then closing the opened browser window: The current job fails, but the
|
16
|
-
next job has access to a newly established browser session again.
|
17
|
-
|
18
|
-
For example Ferrum might raise `Ferrum::DeadBrowserError`. Wayfarer's
|
19
|
-
user agents are self-healing and react to these kinds of errors internally. When
|
20
|
-
a browser window is closed, the Ferrum user agent attempts to establish a new
|
21
|
-
browser process as a replacement, for the next job to use.
|
22
|
-
|
23
|
-
[Wayfarer never swallows exceptions](/guides/error_handling). This means
|
24
|
-
that even though the user agent might heal itself, jobs still need to explicitly
|
25
|
-
retry browser errors:
|
26
|
-
|
27
|
-
```ruby
|
28
|
-
class Foobar < Wayfarer::Base
|
29
|
-
route { to: :index }
|
30
|
-
|
31
|
-
retry_on Ferrum::DeadBrowserError, attempts: 3, wait: :exponentially_longer
|
32
|
-
|
33
|
-
# ...
|
34
|
-
end
|
35
|
-
```
|
36
|
-
|
37
|
-
This leads to log entries like:
|
38
|
-
|
39
|
-
```
|
40
|
-
Retrying DummyJob in 3 seconds, due to a Ferrum::DeadBrowserError.
|
41
|
-
```
|
@@ -1,30 +0,0 @@
|
|
1
|
-
# Steering
|
2
|
-
|
3
|
-
A job's router can receive arguments computed dynamically by `::steer`.
|
4
|
-
Steering enables [batch routing](/cookbook/batch_routing).
|
5
|
-
|
6
|
-
For example, the following router has hostname and path hard-coded:
|
7
|
-
|
8
|
-
```ruby
|
9
|
-
class DummyJob < Wayfarer::Base
|
10
|
-
route do
|
11
|
-
host "example.com", path: "/contact", to: :index
|
12
|
-
end
|
13
|
-
end
|
14
|
-
```
|
15
|
-
|
16
|
-
Instead, hostname and path could be provided by `::steer`, too:
|
17
|
-
|
18
|
-
```ruby
|
19
|
-
class DummyJob < Wayfarer::Base
|
20
|
-
route do |hostname, path|
|
21
|
-
host hostname, path: path, to: :index
|
22
|
-
end
|
23
|
-
|
24
|
-
steer do |_task|
|
25
|
-
["example.com", "/contact"]
|
26
|
-
end
|
27
|
-
end
|
28
|
-
```
|
29
|
-
|
30
|
-
Note that `steer` yields the current [task](/guides/tasks).
|
data/docs/reference/api/base.md
DELETED
@@ -1,48 +0,0 @@
|
|
1
|
-
---
|
2
|
-
title: Wayfarer::Base
|
3
|
-
---
|
4
|
-
|
5
|
-
# `Wayfarer::Base`
|
6
|
-
|
7
|
-
Wayfarer's complete job API.
|
8
|
-
|
9
|
-
---
|
10
|
-
|
11
|
-
### `::route`
|
12
|
-
: Draw routes to instance methods.
|
13
|
-
|
14
|
-
---
|
15
|
-
|
16
|
-
### `::steer { (Wayfarer::Task) -> [any] }`
|
17
|
-
: Provide router arguments.
|
18
|
-
|
19
|
-
---
|
20
|
-
|
21
|
-
### `#task -> Wayfarer::Task`
|
22
|
-
: The currently processing task.
|
23
|
-
|
24
|
-
---
|
25
|
-
|
26
|
-
### `#params -> Hash`
|
27
|
-
: URL parameters collected from the matching route.
|
28
|
-
|
29
|
-
---
|
30
|
-
|
31
|
-
### `#stage(String | [String]) -> void`
|
32
|
-
: Add URLs to a processing set. URLs already processed within the
|
33
|
-
current batch get discarded are not enqueued. Every staged URL gets
|
34
|
-
normalized.
|
35
|
-
|
36
|
-
---
|
37
|
-
|
38
|
-
### `#browser -> Object`
|
39
|
-
: The user agent that retrieved the current page.
|
40
|
-
|
41
|
-
---
|
42
|
-
|
43
|
-
### `#page(live: true | false) -> Page`
|
44
|
-
: The page representing the response retrieved from the currently
|
45
|
-
processing URL.
|
46
|
-
|
47
|
-
With `live: true` called, a fresh `Page` is returned that reflects the
|
48
|
-
current browser DOM. Calls to `#page` return the most recent page.
|