vessel 0.1.0 → 0.1.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +108 -6
- data/lib/vessel.rb +9 -1
- data/lib/vessel/cargo.rb +86 -0
- data/lib/vessel/cli.rb +15 -0
- data/lib/vessel/engine.rb +53 -0
- data/lib/vessel/middleware.rb +23 -0
- data/lib/vessel/request.rb +19 -0
- data/lib/vessel/scheduler.rb +53 -0
- data/lib/vessel/version.rb +3 -1
- metadata +22 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 44fb472d4afaf916edc97894dcc39cf8b6bfbf3f8f1f0b2e8a47f495482b1bd9
|
4
|
+
data.tar.gz: 36af4cd9021bd410bf1988c01f97d98df4e5646f5a56416004869fe643403672
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bda3863083cdce0e8011675a0e83a583d626e81ab713803a54c5056f922d4822b069dacd8d4e5f0079d4f8625a172f7f9d30d4e3586439137af088ac0911201e
|
7
|
+
data.tar.gz: 205b2f54fa17283daf50d0fdaa96e67f5dec4bed2c69ccc740433c90ecefaa9c4b1e13740cb703ac56cfe8c92b2df9da436fee94fc7937242465a33e91a088f5
|
data/README.md
CHANGED
@@ -1,17 +1,119 @@
|
|
1
1
|
# Vessel - high-level web crawling framework
|
2
2
|
|
3
|
-
|
3
|
+
#### Fast as Chrome, dead simple and yet extendable.
|
4
4
|
|
5
|
-
|
5
|
+
It is Ruby high-level web crawling framework based on
|
6
|
+
[Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need
|
7
|
+
from websites. It can be used in a wide range of scenarios, like data mining,
|
8
|
+
monitoring or historical archival. For automated testing we recommend
|
9
|
+
[Cuprite](https://github.com/rubycdp/cuprite).
|
10
|
+
|
11
|
+
Thanks to Evrone [design team](https://evrone.com/design?utm_source=github&utm_campaign=vessel). Read about [Vessel](https://evrone.com/vessel-framework?utm_source=github&utm_campaign=vessel) & other projects supported by Evrone [here](https://evrone.com/cases?utm_source=github&utm_campaign=vessel#open-source).
|
12
|
+
|
13
|
+
|
14
|
+
## Install
|
15
|
+
|
16
|
+
Add this to your Gemfile:
|
6
17
|
|
7
18
|
```ruby
|
8
19
|
gem "vessel"
|
9
20
|
```
|
10
21
|
|
11
|
-
And then execute:
|
12
22
|
|
13
|
-
|
23
|
+
## A look around
|
24
|
+
|
25
|
+
In order to show you how Vessel works we are going to crawl together
|
26
|
+
[famous quotes website](http://quotes.toscrape.com):
|
27
|
+
|
28
|
+
```ruby
|
29
|
+
require "json"
|
30
|
+
require "vessel"
|
31
|
+
|
32
|
+
class QuotesToScrapeCom < Vessel::Cargo
|
33
|
+
domain "quotes.toscrape.com"
|
34
|
+
start_urls "http://quotes.toscrape.com/tag/humor/"
|
35
|
+
|
36
|
+
def parse
|
37
|
+
css("div.quote").each do |quote|
|
38
|
+
yield({
|
39
|
+
author: quote.at_xpath("span/small").text,
|
40
|
+
text: quote.at_css("span.text").text
|
41
|
+
})
|
42
|
+
end
|
43
|
+
|
44
|
+
if next_page = at_xpath("//li[@class='next']/a[@href]")
|
45
|
+
url = absolute_url(next_page.attribute(:href))
|
46
|
+
yield request(url: url, method: :parse)
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
quotes = []
|
52
|
+
QuotesToScrapeCom.run { |q| quotes << q }
|
53
|
+
puts JSON.generate(quotes)
|
54
|
+
```
|
55
|
+
|
56
|
+
Save this to `quotes.rb` file and run `bundle exec ruby quotes.rb > quotes.json`.
|
57
|
+
When this finishes you will have a list of the quotes in JSON format in the
|
58
|
+
`quotes.json` file.
|
59
|
+
|
60
|
+
How it all works? First Vessel using Ferrum spawns Chrome which goes to one or
|
61
|
+
more urls in `start_urls`, in our case it's only one. After Chrome reports back
|
62
|
+
that page is loaded with all the resources it needs the first default callback
|
63
|
+
`parse` is invoked. In the parse callback, we loop through the quote elements
|
64
|
+
using a CSS Selector, yield a Hash with the extracted quote text and author and
|
65
|
+
look for a link to the next page and schedule another request using the same
|
66
|
+
parse method as callback.
|
67
|
+
|
68
|
+
Notice that all requests are scheduled and handled concurrently. We use thread
|
69
|
+
pool to work with all your requests with one page per core by default or add
|
70
|
+
`threads max: n` to a class. If you yield more than one request Ruby will send
|
71
|
+
them to Chrome which will load pages in parallel. Thus crawler is lightweight
|
72
|
+
and speedy.
|
73
|
+
|
74
|
+
|
75
|
+
## Settings
|
76
|
+
|
77
|
+
* domain
|
78
|
+
* start_urls
|
79
|
+
* delay
|
80
|
+
* timeout
|
81
|
+
* threads
|
82
|
+
* middleware
|
83
|
+
|
84
|
+
|
85
|
+
## Selectors
|
86
|
+
|
87
|
+
* at_css
|
88
|
+
* css
|
89
|
+
* at_xpath
|
90
|
+
* xpath
|
91
|
+
|
92
|
+
|
93
|
+
## Middleware
|
94
|
+
|
95
|
+
To be continued
|
96
|
+
|
97
|
+
|
98
|
+
## License
|
99
|
+
|
100
|
+
Copyright 2018-2020 Machinio
|
101
|
+
|
102
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
103
|
+
a copy of this software and associated documentation files (the
|
104
|
+
"Software"), to deal in the Software without restriction, including
|
105
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
106
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
107
|
+
permit persons to whom the Software is furnished to do so, subject to
|
108
|
+
the following conditions:
|
14
109
|
|
15
|
-
|
110
|
+
The above copyright notice and this permission notice shall be
|
111
|
+
included in all copies or substantial portions of the Software.
|
16
112
|
|
17
|
-
|
113
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
114
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
115
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
116
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
117
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
118
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
119
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/lib/vessel.rb
CHANGED
@@ -1,6 +1,14 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "concurrent-ruby"
|
4
|
+
require "vessel/engine"
|
5
|
+
require "vessel/middleware"
|
6
|
+
require "vessel/scheduler"
|
7
|
+
require "vessel/request"
|
1
8
|
require "vessel/version"
|
9
|
+
require "vessel/cargo"
|
2
10
|
|
3
11
|
module Vessel
|
4
12
|
class Error < StandardError; end
|
5
|
-
|
13
|
+
class NotImplementedError < Error; end
|
6
14
|
end
|
data/lib/vessel/cargo.rb
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "ferrum"
|
4
|
+
require "forwardable"
|
5
|
+
|
6
|
+
module Vessel
|
7
|
+
class Cargo
|
8
|
+
DELAY = 0
|
9
|
+
START_URLS = [].freeze
|
10
|
+
MIDDLEWARE = [].freeze
|
11
|
+
MIN_THREADS = 1
|
12
|
+
MAX_THREADS = Concurrent.processor_count
|
13
|
+
|
14
|
+
class << self
|
15
|
+
attr_reader :settings
|
16
|
+
|
17
|
+
def run(settings = nil, &block)
|
18
|
+
self.settings.merge!(Hash(settings))
|
19
|
+
Engine.run(self, &block)
|
20
|
+
end
|
21
|
+
|
22
|
+
def domain(name)
|
23
|
+
settings[:domain] = name
|
24
|
+
end
|
25
|
+
|
26
|
+
def start_urls(*urls)
|
27
|
+
settings[:start_urls] = urls
|
28
|
+
end
|
29
|
+
|
30
|
+
def delay(value)
|
31
|
+
settings[:delay] = value
|
32
|
+
end
|
33
|
+
|
34
|
+
def timeout(value)
|
35
|
+
settings[:timeout] = value
|
36
|
+
end
|
37
|
+
|
38
|
+
def threads(min: MIN_THREADS, max: MAX_THREADS)
|
39
|
+
settings[:min_threads] = min
|
40
|
+
settings[:max_threads] = max
|
41
|
+
end
|
42
|
+
|
43
|
+
def middleware(*classes)
|
44
|
+
settings[:middleware] = classes
|
45
|
+
end
|
46
|
+
|
47
|
+
def settings
|
48
|
+
@settings ||= {
|
49
|
+
delay: DELAY,
|
50
|
+
middleware: MIDDLEWARE,
|
51
|
+
start_urls: START_URLS,
|
52
|
+
min_threads: MIN_THREADS,
|
53
|
+
max_threads: MAX_THREADS,
|
54
|
+
domain: name&.split('::')&.last&.downcase
|
55
|
+
}
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
extend Forwardable
|
60
|
+
delegate %i[at_css css at_xpath xpath] => :page
|
61
|
+
|
62
|
+
attr_reader :page
|
63
|
+
|
64
|
+
def initialize(page = nil)
|
65
|
+
@page = page
|
66
|
+
end
|
67
|
+
|
68
|
+
def domain
|
69
|
+
self.class.settings[:domain]
|
70
|
+
end
|
71
|
+
|
72
|
+
def parse
|
73
|
+
raise NotImplementedError
|
74
|
+
end
|
75
|
+
|
76
|
+
private
|
77
|
+
|
78
|
+
def request(**options)
|
79
|
+
Request.new(**options)
|
80
|
+
end
|
81
|
+
|
82
|
+
def absolute_url(relative)
|
83
|
+
Addressable::URI.join(page.current_url, relative).to_s
|
84
|
+
end
|
85
|
+
end
|
86
|
+
end
|
data/lib/vessel/cli.rb
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Vessel
|
4
|
+
class Engine
|
5
|
+
def self.run(*args, &block)
|
6
|
+
new(*args, &block).tap(&:run)
|
7
|
+
end
|
8
|
+
|
9
|
+
attr_reader :crawler_class, :settings, :scheduler, :middleware
|
10
|
+
|
11
|
+
def initialize(klass, &block)
|
12
|
+
@crawler_class = klass
|
13
|
+
@settings = klass.settings
|
14
|
+
@middleware = block || Middleware.build(*settings[:middleware])
|
15
|
+
@queue = SizedQueue.new(settings[:max_threads])
|
16
|
+
@scheduler = Scheduler.new(@queue, settings)
|
17
|
+
end
|
18
|
+
|
19
|
+
def run
|
20
|
+
scheduler.post(*start_requests)
|
21
|
+
|
22
|
+
until @queue.closed?
|
23
|
+
message = @queue.pop
|
24
|
+
raise(message) if message.is_a?(Exception)
|
25
|
+
handle(*message)
|
26
|
+
@queue.close if idle?
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
def handle(page, request)
|
31
|
+
crawler = @crawler_class.new(page)
|
32
|
+
crawler.send(request.method) do |*args|
|
33
|
+
if args.all? { |i| i.is_a?(Request) }
|
34
|
+
scheduler.post(*args)
|
35
|
+
else
|
36
|
+
@middleware&.call(*args)
|
37
|
+
end
|
38
|
+
end
|
39
|
+
ensure
|
40
|
+
page.close
|
41
|
+
end
|
42
|
+
|
43
|
+
def start_requests
|
44
|
+
Request.build(*settings[:start_urls])
|
45
|
+
end
|
46
|
+
|
47
|
+
def idle?
|
48
|
+
@queue.empty? &&
|
49
|
+
@scheduler.queue_length.zero? &&
|
50
|
+
@scheduler.scheduled_task_count == @scheduler.completed_task_count
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
@@ -0,0 +1,23 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Vessel
|
4
|
+
class Middleware
|
5
|
+
attr_reader :middleware
|
6
|
+
|
7
|
+
def self.build(*classes)
|
8
|
+
classes.inject { |base, klass| base.new(klass.new) }
|
9
|
+
end
|
10
|
+
|
11
|
+
def initialize(middleware = nil)
|
12
|
+
@middleware = middleware
|
13
|
+
end
|
14
|
+
|
15
|
+
def ==(other)
|
16
|
+
self.class == other.class
|
17
|
+
end
|
18
|
+
|
19
|
+
def call
|
20
|
+
raise NotImplementedError
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "addressable/uri"
|
4
|
+
|
5
|
+
module Vessel
|
6
|
+
class Request
|
7
|
+
attr_reader :url, :uri, :method
|
8
|
+
|
9
|
+
def self.build(*urls)
|
10
|
+
urls.map { |url| new(url: url) }
|
11
|
+
end
|
12
|
+
|
13
|
+
def initialize(url:, method: :parse)
|
14
|
+
@url = url.to_s
|
15
|
+
@uri = Addressable::URI.parse(@url)
|
16
|
+
@method = method
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "forwardable"
|
4
|
+
require "concurrent-ruby"
|
5
|
+
|
6
|
+
module Vessel
|
7
|
+
class Scheduler
|
8
|
+
extend Forwardable
|
9
|
+
delegate %i[scheduled_task_count completed_task_count queue_length] => :@pool
|
10
|
+
|
11
|
+
attr_reader :browser, :queue, :delay
|
12
|
+
|
13
|
+
def initialize(queue, settings)
|
14
|
+
@queue = queue
|
15
|
+
@min_threads, @max_threads, @delay =
|
16
|
+
settings.values_at(:min_threads, :max_threads, :delay)
|
17
|
+
|
18
|
+
options = {}
|
19
|
+
options.merge!(timeout: settings[:timeout]) if settings[:timeout]
|
20
|
+
@browser = Ferrum::Browser.new(**options)
|
21
|
+
end
|
22
|
+
|
23
|
+
def post(*requests)
|
24
|
+
requests.map do |request|
|
25
|
+
Concurrent::Promises.future_on(pool, queue, request) do |queue, request|
|
26
|
+
queue << goto(request)
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
private
|
32
|
+
|
33
|
+
def pool
|
34
|
+
@pool ||= Concurrent::ThreadPoolExecutor.new(
|
35
|
+
max_queue: 0,
|
36
|
+
min_threads: @min_threads,
|
37
|
+
max_threads: @max_threads
|
38
|
+
)
|
39
|
+
end
|
40
|
+
|
41
|
+
def goto(request)
|
42
|
+
page = browser.create_page
|
43
|
+
# Delay is set between requests when we don't want to bombard server with
|
44
|
+
# requests so it requires crawler to be single threaded. Otherwise doesn't
|
45
|
+
# make sense.
|
46
|
+
sleep(delay) if @max_threads == 1 && delay > 0
|
47
|
+
page.goto(request.url)
|
48
|
+
[page, request]
|
49
|
+
rescue => e
|
50
|
+
e
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
data/lib/vessel/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: vessel
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Vorotilin
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-04-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: ferrum
|
@@ -24,6 +24,20 @@ dependencies:
|
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0.4'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: thor
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0.20'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0.20'
|
27
41
|
- !ruby/object:Gem::Dependency
|
28
42
|
name: bundler
|
29
43
|
requirement: !ruby/object:Gem::Requirement
|
@@ -77,6 +91,12 @@ files:
|
|
77
91
|
- LICENSE
|
78
92
|
- README.md
|
79
93
|
- lib/vessel.rb
|
94
|
+
- lib/vessel/cargo.rb
|
95
|
+
- lib/vessel/cli.rb
|
96
|
+
- lib/vessel/engine.rb
|
97
|
+
- lib/vessel/middleware.rb
|
98
|
+
- lib/vessel/request.rb
|
99
|
+
- lib/vessel/scheduler.rb
|
80
100
|
- lib/vessel/version.rb
|
81
101
|
homepage: https://github.com/route/vessel
|
82
102
|
licenses:
|