vessel 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +108 -6
- data/lib/vessel.rb +9 -1
- data/lib/vessel/cargo.rb +86 -0
- data/lib/vessel/cli.rb +15 -0
- data/lib/vessel/engine.rb +53 -0
- data/lib/vessel/middleware.rb +23 -0
- data/lib/vessel/request.rb +19 -0
- data/lib/vessel/scheduler.rb +53 -0
- data/lib/vessel/version.rb +3 -1
- metadata +22 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 44fb472d4afaf916edc97894dcc39cf8b6bfbf3f8f1f0b2e8a47f495482b1bd9
|
4
|
+
data.tar.gz: 36af4cd9021bd410bf1988c01f97d98df4e5646f5a56416004869fe643403672
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bda3863083cdce0e8011675a0e83a583d626e81ab713803a54c5056f922d4822b069dacd8d4e5f0079d4f8625a172f7f9d30d4e3586439137af088ac0911201e
|
7
|
+
data.tar.gz: 205b2f54fa17283daf50d0fdaa96e67f5dec4bed2c69ccc740433c90ecefaa9c4b1e13740cb703ac56cfe8c92b2df9da436fee94fc7937242465a33e91a088f5
|
data/README.md
CHANGED
@@ -1,17 +1,119 @@
|
|
1
1
|
# Vessel - high-level web crawling framework
|
2
2
|
|
3
|
-
|
3
|
+
#### Fast as Chrome, dead simple and yet extendable.
|
4
4
|
|
5
|
-
|
5
|
+
It is Ruby high-level web crawling framework based on
|
6
|
+
[Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need
|
7
|
+
from websites. It can be used in a wide range of scenarios, like data mining,
|
8
|
+
monitoring or historical archival. For automated testing we recommend
|
9
|
+
[Cuprite](https://github.com/rubycdp/cuprite).
|
10
|
+
|
11
|
+
Thanks to Evrone [design team](https://evrone.com/design?utm_source=github&utm_campaign=vessel). Read about [Vessel](https://evrone.com/vessel-framework?utm_source=github&utm_campaign=vessel) & other projects supported by Evrone [here](https://evrone.com/cases?utm_source=github&utm_campaign=vessel#open-source).
|
12
|
+
|
13
|
+
|
14
|
+
## Install
|
15
|
+
|
16
|
+
Add this to your Gemfile:
|
6
17
|
|
7
18
|
```ruby
|
8
19
|
gem "vessel"
|
9
20
|
```
|
10
21
|
|
11
|
-
And then execute:
|
12
22
|
|
13
|
-
|
23
|
+
## A look around
|
24
|
+
|
25
|
+
In order to show you how Vessel works we are going to crawl together
|
26
|
+
[famous quotes website](http://quotes.toscrape.com):
|
27
|
+
|
28
|
+
```ruby
|
29
|
+
require "json"
|
30
|
+
require "vessel"
|
31
|
+
|
32
|
+
class QuotesToScrapeCom < Vessel::Cargo
|
33
|
+
domain "quotes.toscrape.com"
|
34
|
+
start_urls "http://quotes.toscrape.com/tag/humor/"
|
35
|
+
|
36
|
+
def parse
|
37
|
+
css("div.quote").each do |quote|
|
38
|
+
yield({
|
39
|
+
author: quote.at_xpath("span/small").text,
|
40
|
+
text: quote.at_css("span.text").text
|
41
|
+
})
|
42
|
+
end
|
43
|
+
|
44
|
+
if next_page = at_xpath("//li[@class='next']/a[@href]")
|
45
|
+
url = absolute_url(next_page.attribute(:href))
|
46
|
+
yield request(url: url, method: :parse)
|
47
|
+
end
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
quotes = []
|
52
|
+
QuotesToScrapeCom.run { |q| quotes << q }
|
53
|
+
puts JSON.generate(quotes)
|
54
|
+
```
|
55
|
+
|
56
|
+
Save this to `quotes.rb` file and run `bundle exec ruby quotes.rb > quotes.json`.
|
57
|
+
When this finishes you will have a list of the quotes in JSON format in the
|
58
|
+
`quotes.json` file.
|
59
|
+
|
60
|
+
How it all works? First Vessel using Ferrum spawns Chrome which goes to one or
|
61
|
+
more urls in `start_urls`, in our case it's only one. After Chrome reports back
|
62
|
+
that page is loaded with all the resources it needs the first default callback
|
63
|
+
`parse` is invoked. In the parse callback, we loop through the quote elements
|
64
|
+
using a CSS Selector, yield a Hash with the extracted quote text and author and
|
65
|
+
look for a link to the next page and schedule another request using the same
|
66
|
+
parse method as callback.
|
67
|
+
|
68
|
+
Notice that all requests are scheduled and handled concurrently. We use thread
|
69
|
+
pool to work with all your requests with one page per core by default or add
|
70
|
+
`threads max: n` to a class. If you yield more than one request Ruby will send
|
71
|
+
them to Chrome which will load pages in parallel. Thus crawler is lightweight
|
72
|
+
and speedy.
|
73
|
+
|
74
|
+
|
75
|
+
## Settings
|
76
|
+
|
77
|
+
* domain
|
78
|
+
* start_urls
|
79
|
+
* delay
|
80
|
+
* timeout
|
81
|
+
* threads
|
82
|
+
* middleware
|
83
|
+
|
84
|
+
|
85
|
+
## Selectors
|
86
|
+
|
87
|
+
* at_css
|
88
|
+
* css
|
89
|
+
* at_xpath
|
90
|
+
* xpath
|
91
|
+
|
92
|
+
|
93
|
+
## Middleware
|
94
|
+
|
95
|
+
To be continued
|
96
|
+
|
97
|
+
|
98
|
+
## License
|
99
|
+
|
100
|
+
Copyright 2018-2020 Machinio
|
101
|
+
|
102
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
103
|
+
a copy of this software and associated documentation files (the
|
104
|
+
"Software"), to deal in the Software without restriction, including
|
105
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
106
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
107
|
+
permit persons to whom the Software is furnished to do so, subject to
|
108
|
+
the following conditions:
|
14
109
|
|
15
|
-
|
110
|
+
The above copyright notice and this permission notice shall be
|
111
|
+
included in all copies or substantial portions of the Software.
|
16
112
|
|
17
|
-
|
113
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
114
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
115
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
116
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
117
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
118
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
119
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/lib/vessel.rb
CHANGED
@@ -1,6 +1,14 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "concurrent-ruby"
|
4
|
+
require "vessel/engine"
|
5
|
+
require "vessel/middleware"
|
6
|
+
require "vessel/scheduler"
|
7
|
+
require "vessel/request"
|
1
8
|
require "vessel/version"
|
9
|
+
require "vessel/cargo"
|
2
10
|
|
3
11
|
module Vessel
|
4
12
|
class Error < StandardError; end
|
5
|
-
|
13
|
+
class NotImplementedError < Error; end
|
6
14
|
end
|
data/lib/vessel/cargo.rb
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "ferrum"
|
4
|
+
require "forwardable"
|
5
|
+
|
6
|
+
module Vessel
|
7
|
+
class Cargo
|
8
|
+
DELAY = 0
|
9
|
+
START_URLS = [].freeze
|
10
|
+
MIDDLEWARE = [].freeze
|
11
|
+
MIN_THREADS = 1
|
12
|
+
MAX_THREADS = Concurrent.processor_count
|
13
|
+
|
14
|
+
class << self
|
15
|
+
attr_reader :settings
|
16
|
+
|
17
|
+
def run(settings = nil, &block)
|
18
|
+
self.settings.merge!(Hash(settings))
|
19
|
+
Engine.run(self, &block)
|
20
|
+
end
|
21
|
+
|
22
|
+
def domain(name)
|
23
|
+
settings[:domain] = name
|
24
|
+
end
|
25
|
+
|
26
|
+
def start_urls(*urls)
|
27
|
+
settings[:start_urls] = urls
|
28
|
+
end
|
29
|
+
|
30
|
+
def delay(value)
|
31
|
+
settings[:delay] = value
|
32
|
+
end
|
33
|
+
|
34
|
+
def timeout(value)
|
35
|
+
settings[:timeout] = value
|
36
|
+
end
|
37
|
+
|
38
|
+
def threads(min: MIN_THREADS, max: MAX_THREADS)
|
39
|
+
settings[:min_threads] = min
|
40
|
+
settings[:max_threads] = max
|
41
|
+
end
|
42
|
+
|
43
|
+
def middleware(*classes)
|
44
|
+
settings[:middleware] = classes
|
45
|
+
end
|
46
|
+
|
47
|
+
def settings
|
48
|
+
@settings ||= {
|
49
|
+
delay: DELAY,
|
50
|
+
middleware: MIDDLEWARE,
|
51
|
+
start_urls: START_URLS,
|
52
|
+
min_threads: MIN_THREADS,
|
53
|
+
max_threads: MAX_THREADS,
|
54
|
+
domain: name&.split('::')&.last&.downcase
|
55
|
+
}
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
extend Forwardable
|
60
|
+
delegate %i[at_css css at_xpath xpath] => :page
|
61
|
+
|
62
|
+
attr_reader :page
|
63
|
+
|
64
|
+
def initialize(page = nil)
|
65
|
+
@page = page
|
66
|
+
end
|
67
|
+
|
68
|
+
def domain
|
69
|
+
self.class.settings[:domain]
|
70
|
+
end
|
71
|
+
|
72
|
+
def parse
|
73
|
+
raise NotImplementedError
|
74
|
+
end
|
75
|
+
|
76
|
+
private
|
77
|
+
|
78
|
+
def request(**options)
|
79
|
+
Request.new(**options)
|
80
|
+
end
|
81
|
+
|
82
|
+
def absolute_url(relative)
|
83
|
+
Addressable::URI.join(page.current_url, relative).to_s
|
84
|
+
end
|
85
|
+
end
|
86
|
+
end
|
data/lib/vessel/cli.rb
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Vessel
|
4
|
+
class Engine
|
5
|
+
def self.run(*args, &block)
|
6
|
+
new(*args, &block).tap(&:run)
|
7
|
+
end
|
8
|
+
|
9
|
+
attr_reader :crawler_class, :settings, :scheduler, :middleware
|
10
|
+
|
11
|
+
def initialize(klass, &block)
|
12
|
+
@crawler_class = klass
|
13
|
+
@settings = klass.settings
|
14
|
+
@middleware = block || Middleware.build(*settings[:middleware])
|
15
|
+
@queue = SizedQueue.new(settings[:max_threads])
|
16
|
+
@scheduler = Scheduler.new(@queue, settings)
|
17
|
+
end
|
18
|
+
|
19
|
+
def run
|
20
|
+
scheduler.post(*start_requests)
|
21
|
+
|
22
|
+
until @queue.closed?
|
23
|
+
message = @queue.pop
|
24
|
+
raise(message) if message.is_a?(Exception)
|
25
|
+
handle(*message)
|
26
|
+
@queue.close if idle?
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
def handle(page, request)
|
31
|
+
crawler = @crawler_class.new(page)
|
32
|
+
crawler.send(request.method) do |*args|
|
33
|
+
if args.all? { |i| i.is_a?(Request) }
|
34
|
+
scheduler.post(*args)
|
35
|
+
else
|
36
|
+
@middleware&.call(*args)
|
37
|
+
end
|
38
|
+
end
|
39
|
+
ensure
|
40
|
+
page.close
|
41
|
+
end
|
42
|
+
|
43
|
+
def start_requests
|
44
|
+
Request.build(*settings[:start_urls])
|
45
|
+
end
|
46
|
+
|
47
|
+
def idle?
|
48
|
+
@queue.empty? &&
|
49
|
+
@scheduler.queue_length.zero? &&
|
50
|
+
@scheduler.scheduled_task_count == @scheduler.completed_task_count
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
@@ -0,0 +1,23 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Vessel
|
4
|
+
class Middleware
|
5
|
+
attr_reader :middleware
|
6
|
+
|
7
|
+
def self.build(*classes)
|
8
|
+
classes.inject { |base, klass| base.new(klass.new) }
|
9
|
+
end
|
10
|
+
|
11
|
+
def initialize(middleware = nil)
|
12
|
+
@middleware = middleware
|
13
|
+
end
|
14
|
+
|
15
|
+
def ==(other)
|
16
|
+
self.class == other.class
|
17
|
+
end
|
18
|
+
|
19
|
+
def call
|
20
|
+
raise NotImplementedError
|
21
|
+
end
|
22
|
+
end
|
23
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "addressable/uri"
|
4
|
+
|
5
|
+
module Vessel
|
6
|
+
class Request
|
7
|
+
attr_reader :url, :uri, :method
|
8
|
+
|
9
|
+
def self.build(*urls)
|
10
|
+
urls.map { |url| new(url: url) }
|
11
|
+
end
|
12
|
+
|
13
|
+
def initialize(url:, method: :parse)
|
14
|
+
@url = url.to_s
|
15
|
+
@uri = Addressable::URI.parse(@url)
|
16
|
+
@method = method
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
@@ -0,0 +1,53 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "forwardable"
|
4
|
+
require "concurrent-ruby"
|
5
|
+
|
6
|
+
module Vessel
|
7
|
+
class Scheduler
|
8
|
+
extend Forwardable
|
9
|
+
delegate %i[scheduled_task_count completed_task_count queue_length] => :@pool
|
10
|
+
|
11
|
+
attr_reader :browser, :queue, :delay
|
12
|
+
|
13
|
+
def initialize(queue, settings)
|
14
|
+
@queue = queue
|
15
|
+
@min_threads, @max_threads, @delay =
|
16
|
+
settings.values_at(:min_threads, :max_threads, :delay)
|
17
|
+
|
18
|
+
options = {}
|
19
|
+
options.merge!(timeout: settings[:timeout]) if settings[:timeout]
|
20
|
+
@browser = Ferrum::Browser.new(**options)
|
21
|
+
end
|
22
|
+
|
23
|
+
def post(*requests)
|
24
|
+
requests.map do |request|
|
25
|
+
Concurrent::Promises.future_on(pool, queue, request) do |queue, request|
|
26
|
+
queue << goto(request)
|
27
|
+
end
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
private
|
32
|
+
|
33
|
+
def pool
|
34
|
+
@pool ||= Concurrent::ThreadPoolExecutor.new(
|
35
|
+
max_queue: 0,
|
36
|
+
min_threads: @min_threads,
|
37
|
+
max_threads: @max_threads
|
38
|
+
)
|
39
|
+
end
|
40
|
+
|
41
|
+
def goto(request)
|
42
|
+
page = browser.create_page
|
43
|
+
# Delay is set between requests when we don't want to bombard server with
|
44
|
+
# requests so it requires crawler to be single threaded. Otherwise doesn't
|
45
|
+
# make sense.
|
46
|
+
sleep(delay) if @max_threads == 1 && delay > 0
|
47
|
+
page.goto(request.url)
|
48
|
+
[page, request]
|
49
|
+
rescue => e
|
50
|
+
e
|
51
|
+
end
|
52
|
+
end
|
53
|
+
end
|
data/lib/vessel/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: vessel
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Vorotilin
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-04-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: ferrum
|
@@ -24,6 +24,20 @@ dependencies:
|
|
24
24
|
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '0.4'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: thor
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '0.20'
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '0.20'
|
27
41
|
- !ruby/object:Gem::Dependency
|
28
42
|
name: bundler
|
29
43
|
requirement: !ruby/object:Gem::Requirement
|
@@ -77,6 +91,12 @@ files:
|
|
77
91
|
- LICENSE
|
78
92
|
- README.md
|
79
93
|
- lib/vessel.rb
|
94
|
+
- lib/vessel/cargo.rb
|
95
|
+
- lib/vessel/cli.rb
|
96
|
+
- lib/vessel/engine.rb
|
97
|
+
- lib/vessel/middleware.rb
|
98
|
+
- lib/vessel/request.rb
|
99
|
+
- lib/vessel/scheduler.rb
|
80
100
|
- lib/vessel/version.rb
|
81
101
|
homepage: https://github.com/route/vessel
|
82
102
|
licenses:
|