vessel 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f52e40dc5f4086394068c24ed878e3b2bb82da411a33f1d4dc73c056bc84e53f
4
- data.tar.gz: 17e3ee52630ecb8ac30f7ec75f36d0d92f160b9a817458db5ca00920684329f9
3
+ metadata.gz: 44fb472d4afaf916edc97894dcc39cf8b6bfbf3f8f1f0b2e8a47f495482b1bd9
4
+ data.tar.gz: 36af4cd9021bd410bf1988c01f97d98df4e5646f5a56416004869fe643403672
5
5
  SHA512:
6
- metadata.gz: 1dff310db5f5daa97ece87535e6084a1e67b75f2c2a2476ee33aadb87895b52e32c991639826fef1188f1674cb86ffed32ff65df33d95cf9e5cd92a7baabd976
7
- data.tar.gz: 4168d33ba27a09d4642cdd5551e7e6945d42f102aba3dc54ef622cc4281441c0b52b855a54dc0b4a9d1a07735def04b9d14ffd4fb1fe99c61746cf1803da7a08
6
+ metadata.gz: bda3863083cdce0e8011675a0e83a583d626e81ab713803a54c5056f922d4822b069dacd8d4e5f0079d4f8625a172f7f9d30d4e3586439137af088ac0911201e
7
+ data.tar.gz: 205b2f54fa17283daf50d0fdaa96e67f5dec4bed2c69ccc740433c90ecefaa9c4b1e13740cb703ac56cfe8c92b2df9da436fee94fc7937242465a33e91a088f5
data/README.md CHANGED
@@ -1,17 +1,119 @@
1
1
  # Vessel - high-level web crawling framework
2
2
 
3
- ## Installation
3
+ #### Fast as Chrome, dead simple and yet extendable.
4
4
 
5
- Add this line to your application's Gemfile:
5
+ It is Ruby high-level web crawling framework based on
6
+ [Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need
7
+ from websites. It can be used in a wide range of scenarios, like data mining,
8
+ monitoring or historical archival. For automated testing we recommend
9
+ [Cuprite](https://github.com/rubycdp/cuprite).
10
+
11
+ Thanks to Evrone [design team](https://evrone.com/design?utm_source=github&utm_campaign=vessel). Read about [Vessel](https://evrone.com/vessel-framework?utm_source=github&utm_campaign=vessel) & other projects supported by Evrone [here](https://evrone.com/cases?utm_source=github&utm_campaign=vessel#open-source).
12
+
13
+
14
+ ## Install
15
+
16
+ Add this to your Gemfile:
6
17
 
7
18
  ```ruby
8
19
  gem "vessel"
9
20
  ```
10
21
 
11
- And then execute:
12
22
 
13
- $ bundle
23
+ ## A look around
24
+
25
+ In order to show you how Vessel works we are going to crawl together
26
+ [famous quotes website](http://quotes.toscrape.com):
27
+
28
+ ```ruby
29
+ require "json"
30
+ require "vessel"
31
+
32
+ class QuotesToScrapeCom < Vessel::Cargo
33
+ domain "quotes.toscrape.com"
34
+ start_urls "http://quotes.toscrape.com/tag/humor/"
35
+
36
+ def parse
37
+ css("div.quote").each do |quote|
38
+ yield({
39
+ author: quote.at_xpath("span/small").text,
40
+ text: quote.at_css("span.text").text
41
+ })
42
+ end
43
+
44
+ if next_page = at_xpath("//li[@class='next']/a[@href]")
45
+ url = absolute_url(next_page.attribute(:href))
46
+ yield request(url: url, method: :parse)
47
+ end
48
+ end
49
+ end
50
+
51
+ quotes = []
52
+ QuotesToScrapeCom.run { |q| quotes << q }
53
+ puts JSON.generate(quotes)
54
+ ```
55
+
56
+ Save this to `quotes.rb` file and run `bundle exec ruby quotes.rb > quotes.json`.
57
+ When this finishes you will have a list of the quotes in JSON format in the
58
+ `quotes.json` file.
59
+
60
+ How it all works? First Vessel using Ferrum spawns Chrome which goes to one or
61
+ more urls in `start_urls`, in our case it's only one. After Chrome reports back
62
+ that page is loaded with all the resources it needs the first default callback
63
+ `parse` is invoked. In the parse callback, we loop through the quote elements
64
+ using a CSS Selector, yield a Hash with the extracted quote text and author and
65
+ look for a link to the next page and schedule another request using the same
66
+ parse method as callback.
67
+
68
+ Notice that all requests are scheduled and handled concurrently. We use thread
69
+ pool to work with all your requests with one page per core by default or add
70
+ `threads max: n` to a class. If you yield more than one request Ruby will send
71
+ them to Chrome which will load pages in parallel. Thus crawler is lightweight
72
+ and speedy.
73
+
74
+
75
+ ## Settings
76
+
77
+ * domain
78
+ * start_urls
79
+ * delay
80
+ * timeout
81
+ * threads
82
+ * middleware
83
+
84
+
85
+ ## Selectors
86
+
87
+ * at_css
88
+ * css
89
+ * at_xpath
90
+ * xpath
91
+
92
+
93
+ ## Middleware
94
+
95
+ To be continued
96
+
97
+
98
+ ## License
99
+
100
+ Copyright 2018-2020 Machinio
101
+
102
+ Permission is hereby granted, free of charge, to any person obtaining
103
+ a copy of this software and associated documentation files (the
104
+ "Software"), to deal in the Software without restriction, including
105
+ without limitation the rights to use, copy, modify, merge, publish,
106
+ distribute, sublicense, and/or sell copies of the Software, and to
107
+ permit persons to whom the Software is furnished to do so, subject to
108
+ the following conditions:
14
109
 
15
- Or install it yourself as:
110
+ The above copyright notice and this permission notice shall be
111
+ included in all copies or substantial portions of the Software.
16
112
 
17
- $ gem install vessel
113
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
114
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
115
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
116
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
117
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
118
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
119
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -1,6 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "concurrent-ruby"
4
+ require "vessel/engine"
5
+ require "vessel/middleware"
6
+ require "vessel/scheduler"
7
+ require "vessel/request"
1
8
  require "vessel/version"
9
+ require "vessel/cargo"
2
10
 
3
11
  module Vessel
4
12
  class Error < StandardError; end
5
- # Your code goes here...
13
+ class NotImplementedError < Error; end
6
14
  end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "ferrum"
4
+ require "forwardable"
5
+
6
+ module Vessel
7
+ class Cargo
8
+ DELAY = 0
9
+ START_URLS = [].freeze
10
+ MIDDLEWARE = [].freeze
11
+ MIN_THREADS = 1
12
+ MAX_THREADS = Concurrent.processor_count
13
+
14
+ class << self
15
+ attr_reader :settings
16
+
17
+ def run(settings = nil, &block)
18
+ self.settings.merge!(Hash(settings))
19
+ Engine.run(self, &block)
20
+ end
21
+
22
+ def domain(name)
23
+ settings[:domain] = name
24
+ end
25
+
26
+ def start_urls(*urls)
27
+ settings[:start_urls] = urls
28
+ end
29
+
30
+ def delay(value)
31
+ settings[:delay] = value
32
+ end
33
+
34
+ def timeout(value)
35
+ settings[:timeout] = value
36
+ end
37
+
38
+ def threads(min: MIN_THREADS, max: MAX_THREADS)
39
+ settings[:min_threads] = min
40
+ settings[:max_threads] = max
41
+ end
42
+
43
+ def middleware(*classes)
44
+ settings[:middleware] = classes
45
+ end
46
+
47
+ def settings
48
+ @settings ||= {
49
+ delay: DELAY,
50
+ middleware: MIDDLEWARE,
51
+ start_urls: START_URLS,
52
+ min_threads: MIN_THREADS,
53
+ max_threads: MAX_THREADS,
54
+ domain: name&.split('::')&.last&.downcase
55
+ }
56
+ end
57
+ end
58
+
59
+ extend Forwardable
60
+ delegate %i[at_css css at_xpath xpath] => :page
61
+
62
+ attr_reader :page
63
+
64
+ def initialize(page = nil)
65
+ @page = page
66
+ end
67
+
68
+ def domain
69
+ self.class.settings[:domain]
70
+ end
71
+
72
+ def parse
73
+ raise NotImplementedError
74
+ end
75
+
76
+ private
77
+
78
+ def request(**options)
79
+ Request.new(**options)
80
+ end
81
+
82
+ def absolute_url(relative)
83
+ Addressable::URI.join(page.current_url, relative).to_s
84
+ end
85
+ end
86
+ end
@@ -0,0 +1,15 @@
1
+ require "thor"
2
+
3
+ module Vessel
4
+ class CLI < Thor
5
+ desc "version", "Print version."
6
+ def version
7
+ puts Vessel::VERSION
8
+ end
9
+
10
+ desc "start NAME", "Run given crawler."
11
+ def start(name)
12
+ raise NotImplementedError
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Vessel
4
+ class Engine
5
+ def self.run(*args, &block)
6
+ new(*args, &block).tap(&:run)
7
+ end
8
+
9
+ attr_reader :crawler_class, :settings, :scheduler, :middleware
10
+
11
+ def initialize(klass, &block)
12
+ @crawler_class = klass
13
+ @settings = klass.settings
14
+ @middleware = block || Middleware.build(*settings[:middleware])
15
+ @queue = SizedQueue.new(settings[:max_threads])
16
+ @scheduler = Scheduler.new(@queue, settings)
17
+ end
18
+
19
+ def run
20
+ scheduler.post(*start_requests)
21
+
22
+ until @queue.closed?
23
+ message = @queue.pop
24
+ raise(message) if message.is_a?(Exception)
25
+ handle(*message)
26
+ @queue.close if idle?
27
+ end
28
+ end
29
+
30
+ def handle(page, request)
31
+ crawler = @crawler_class.new(page)
32
+ crawler.send(request.method) do |*args|
33
+ if args.all? { |i| i.is_a?(Request) }
34
+ scheduler.post(*args)
35
+ else
36
+ @middleware&.call(*args)
37
+ end
38
+ end
39
+ ensure
40
+ page.close
41
+ end
42
+
43
+ def start_requests
44
+ Request.build(*settings[:start_urls])
45
+ end
46
+
47
+ def idle?
48
+ @queue.empty? &&
49
+ @scheduler.queue_length.zero? &&
50
+ @scheduler.scheduled_task_count == @scheduler.completed_task_count
51
+ end
52
+ end
53
+ end
@@ -0,0 +1,23 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Vessel
4
+ class Middleware
5
+ attr_reader :middleware
6
+
7
+ def self.build(*classes)
8
+ classes.inject { |base, klass| base.new(klass.new) }
9
+ end
10
+
11
+ def initialize(middleware = nil)
12
+ @middleware = middleware
13
+ end
14
+
15
+ def ==(other)
16
+ self.class == other.class
17
+ end
18
+
19
+ def call
20
+ raise NotImplementedError
21
+ end
22
+ end
23
+ end
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "addressable/uri"
4
+
5
+ module Vessel
6
+ class Request
7
+ attr_reader :url, :uri, :method
8
+
9
+ def self.build(*urls)
10
+ urls.map { |url| new(url: url) }
11
+ end
12
+
13
+ def initialize(url:, method: :parse)
14
+ @url = url.to_s
15
+ @uri = Addressable::URI.parse(@url)
16
+ @method = method
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "forwardable"
4
+ require "concurrent-ruby"
5
+
6
+ module Vessel
7
+ class Scheduler
8
+ extend Forwardable
9
+ delegate %i[scheduled_task_count completed_task_count queue_length] => :@pool
10
+
11
+ attr_reader :browser, :queue, :delay
12
+
13
+ def initialize(queue, settings)
14
+ @queue = queue
15
+ @min_threads, @max_threads, @delay =
16
+ settings.values_at(:min_threads, :max_threads, :delay)
17
+
18
+ options = {}
19
+ options.merge!(timeout: settings[:timeout]) if settings[:timeout]
20
+ @browser = Ferrum::Browser.new(**options)
21
+ end
22
+
23
+ def post(*requests)
24
+ requests.map do |request|
25
+ Concurrent::Promises.future_on(pool, queue, request) do |queue, request|
26
+ queue << goto(request)
27
+ end
28
+ end
29
+ end
30
+
31
+ private
32
+
33
+ def pool
34
+ @pool ||= Concurrent::ThreadPoolExecutor.new(
35
+ max_queue: 0,
36
+ min_threads: @min_threads,
37
+ max_threads: @max_threads
38
+ )
39
+ end
40
+
41
+ def goto(request)
42
+ page = browser.create_page
43
+ # Delay is set between requests when we don't want to bombard server with
44
+ # requests so it requires crawler to be single threaded. Otherwise doesn't
45
+ # make sense.
46
+ sleep(delay) if @max_threads == 1 && delay > 0
47
+ page.goto(request.url)
48
+ [page, request]
49
+ rescue => e
50
+ e
51
+ end
52
+ end
53
+ end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Vessel
2
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
3
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: vessel
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Vorotilin
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-09-17 00:00:00.000000000 Z
11
+ date: 2020-04-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: ferrum
@@ -24,6 +24,20 @@ dependencies:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: '0.4'
27
+ - !ruby/object:Gem::Dependency
28
+ name: thor
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '0.20'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '0.20'
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: bundler
29
43
  requirement: !ruby/object:Gem::Requirement
@@ -77,6 +91,12 @@ files:
77
91
  - LICENSE
78
92
  - README.md
79
93
  - lib/vessel.rb
94
+ - lib/vessel/cargo.rb
95
+ - lib/vessel/cli.rb
96
+ - lib/vessel/engine.rb
97
+ - lib/vessel/middleware.rb
98
+ - lib/vessel/request.rb
99
+ - lib/vessel/scheduler.rb
80
100
  - lib/vessel/version.rb
81
101
  homepage: https://github.com/route/vessel
82
102
  licenses: