vessel 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f52e40dc5f4086394068c24ed878e3b2bb82da411a33f1d4dc73c056bc84e53f
4
- data.tar.gz: 17e3ee52630ecb8ac30f7ec75f36d0d92f160b9a817458db5ca00920684329f9
3
+ metadata.gz: 44fb472d4afaf916edc97894dcc39cf8b6bfbf3f8f1f0b2e8a47f495482b1bd9
4
+ data.tar.gz: 36af4cd9021bd410bf1988c01f97d98df4e5646f5a56416004869fe643403672
5
5
  SHA512:
6
- metadata.gz: 1dff310db5f5daa97ece87535e6084a1e67b75f2c2a2476ee33aadb87895b52e32c991639826fef1188f1674cb86ffed32ff65df33d95cf9e5cd92a7baabd976
7
- data.tar.gz: 4168d33ba27a09d4642cdd5551e7e6945d42f102aba3dc54ef622cc4281441c0b52b855a54dc0b4a9d1a07735def04b9d14ffd4fb1fe99c61746cf1803da7a08
6
+ metadata.gz: bda3863083cdce0e8011675a0e83a583d626e81ab713803a54c5056f922d4822b069dacd8d4e5f0079d4f8625a172f7f9d30d4e3586439137af088ac0911201e
7
+ data.tar.gz: 205b2f54fa17283daf50d0fdaa96e67f5dec4bed2c69ccc740433c90ecefaa9c4b1e13740cb703ac56cfe8c92b2df9da436fee94fc7937242465a33e91a088f5
data/README.md CHANGED
@@ -1,17 +1,119 @@
1
1
  # Vessel - high-level web crawling framework
2
2
 
3
- ## Installation
3
+ #### Fast as Chrome, dead simple and yet extendable.
4
4
 
5
- Add this line to your application's Gemfile:
5
+ It is Ruby high-level web crawling framework based on
6
+ [Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need
7
+ from websites. It can be used in a wide range of scenarios, like data mining,
8
+ monitoring or historical archival. For automated testing we recommend
9
+ [Cuprite](https://github.com/rubycdp/cuprite).
10
+
11
+ Thanks to Evrone [design team](https://evrone.com/design?utm_source=github&utm_campaign=vessel). Read about [Vessel](https://evrone.com/vessel-framework?utm_source=github&utm_campaign=vessel) & other projects supported by Evrone [here](https://evrone.com/cases?utm_source=github&utm_campaign=vessel#open-source).
12
+
13
+
14
+ ## Install
15
+
16
+ Add this to your Gemfile:
6
17
 
7
18
  ```ruby
8
19
  gem "vessel"
9
20
  ```
10
21
 
11
- And then execute:
12
22
 
13
- $ bundle
23
+ ## A look around
24
+
25
+ In order to show you how Vessel works we are going to crawl together
26
+ [famous quotes website](http://quotes.toscrape.com):
27
+
28
+ ```ruby
29
+ require "json"
30
+ require "vessel"
31
+
32
+ class QuotesToScrapeCom < Vessel::Cargo
33
+ domain "quotes.toscrape.com"
34
+ start_urls "http://quotes.toscrape.com/tag/humor/"
35
+
36
+ def parse
37
+ css("div.quote").each do |quote|
38
+ yield({
39
+ author: quote.at_xpath("span/small").text,
40
+ text: quote.at_css("span.text").text
41
+ })
42
+ end
43
+
44
+ if next_page = at_xpath("//li[@class='next']/a[@href]")
45
+ url = absolute_url(next_page.attribute(:href))
46
+ yield request(url: url, method: :parse)
47
+ end
48
+ end
49
+ end
50
+
51
+ quotes = []
52
+ QuotesToScrapeCom.run { |q| quotes << q }
53
+ puts JSON.generate(quotes)
54
+ ```
55
+
56
+ Save this to `quotes.rb` file and run `bundle exec ruby quotes.rb > quotes.json`.
57
+ When this finishes you will have a list of the quotes in JSON format in the
58
+ `quotes.json` file.
59
+
60
+ How it all works? First Vessel using Ferrum spawns Chrome which goes to one or
61
+ more urls in `start_urls`, in our case it's only one. After Chrome reports back
62
+ that page is loaded with all the resources it needs the first default callback
63
+ `parse` is invoked. In the parse callback, we loop through the quote elements
64
+ using a CSS Selector, yield a Hash with the extracted quote text and author and
65
+ look for a link to the next page and schedule another request using the same
66
+ parse method as callback.
67
+
68
+ Notice that all requests are scheduled and handled concurrently. We use thread
69
+ pool to work with all your requests with one page per core by default or add
70
+ `threads max: n` to a class. If you yield more than one request Ruby will send
71
+ them to Chrome which will load pages in parallel. Thus crawler is lightweight
72
+ and speedy.
73
+
74
+
75
+ ## Settings
76
+
77
+ * domain
78
+ * start_urls
79
+ * delay
80
+ * timeout
81
+ * threads
82
+ * middleware
83
+
84
+
85
+ ## Selectors
86
+
87
+ * at_css
88
+ * css
89
+ * at_xpath
90
+ * xpath
91
+
92
+
93
+ ## Middleware
94
+
95
+ To be continued
96
+
97
+
98
+ ## License
99
+
100
+ Copyright 2018-2020 Machinio
101
+
102
+ Permission is hereby granted, free of charge, to any person obtaining
103
+ a copy of this software and associated documentation files (the
104
+ "Software"), to deal in the Software without restriction, including
105
+ without limitation the rights to use, copy, modify, merge, publish,
106
+ distribute, sublicense, and/or sell copies of the Software, and to
107
+ permit persons to whom the Software is furnished to do so, subject to
108
+ the following conditions:
14
109
 
15
- Or install it yourself as:
110
+ The above copyright notice and this permission notice shall be
111
+ included in all copies or substantial portions of the Software.
16
112
 
17
- $ gem install vessel
113
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
114
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
115
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
116
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
117
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
118
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
119
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -1,6 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "concurrent-ruby"
4
+ require "vessel/engine"
5
+ require "vessel/middleware"
6
+ require "vessel/scheduler"
7
+ require "vessel/request"
1
8
  require "vessel/version"
9
+ require "vessel/cargo"
2
10
 
3
11
  module Vessel
4
12
  class Error < StandardError; end
5
- # Your code goes here...
13
+ class NotImplementedError < Error; end
6
14
  end
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "ferrum"
4
+ require "forwardable"
5
+
6
+ module Vessel
7
+ class Cargo
8
+ DELAY = 0
9
+ START_URLS = [].freeze
10
+ MIDDLEWARE = [].freeze
11
+ MIN_THREADS = 1
12
+ MAX_THREADS = Concurrent.processor_count
13
+
14
+ class << self
15
+ attr_reader :settings
16
+
17
+ def run(settings = nil, &block)
18
+ self.settings.merge!(Hash(settings))
19
+ Engine.run(self, &block)
20
+ end
21
+
22
+ def domain(name)
23
+ settings[:domain] = name
24
+ end
25
+
26
+ def start_urls(*urls)
27
+ settings[:start_urls] = urls
28
+ end
29
+
30
+ def delay(value)
31
+ settings[:delay] = value
32
+ end
33
+
34
+ def timeout(value)
35
+ settings[:timeout] = value
36
+ end
37
+
38
+ def threads(min: MIN_THREADS, max: MAX_THREADS)
39
+ settings[:min_threads] = min
40
+ settings[:max_threads] = max
41
+ end
42
+
43
+ def middleware(*classes)
44
+ settings[:middleware] = classes
45
+ end
46
+
47
+ def settings
48
+ @settings ||= {
49
+ delay: DELAY,
50
+ middleware: MIDDLEWARE,
51
+ start_urls: START_URLS,
52
+ min_threads: MIN_THREADS,
53
+ max_threads: MAX_THREADS,
54
+ domain: name&.split('::')&.last&.downcase
55
+ }
56
+ end
57
+ end
58
+
59
+ extend Forwardable
60
+ delegate %i[at_css css at_xpath xpath] => :page
61
+
62
+ attr_reader :page
63
+
64
+ def initialize(page = nil)
65
+ @page = page
66
+ end
67
+
68
+ def domain
69
+ self.class.settings[:domain]
70
+ end
71
+
72
+ def parse
73
+ raise NotImplementedError
74
+ end
75
+
76
+ private
77
+
78
+ def request(**options)
79
+ Request.new(**options)
80
+ end
81
+
82
+ def absolute_url(relative)
83
+ Addressable::URI.join(page.current_url, relative).to_s
84
+ end
85
+ end
86
+ end
@@ -0,0 +1,15 @@
1
+ require "thor"
2
+
3
+ module Vessel
4
+ class CLI < Thor
5
+ desc "version", "Print version."
6
+ def version
7
+ puts Vessel::VERSION
8
+ end
9
+
10
+ desc "start NAME", "Run given crawler."
11
+ def start(name)
12
+ raise NotImplementedError
13
+ end
14
+ end
15
+ end
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Vessel
4
+ class Engine
5
+ def self.run(*args, &block)
6
+ new(*args, &block).tap(&:run)
7
+ end
8
+
9
+ attr_reader :crawler_class, :settings, :scheduler, :middleware
10
+
11
+ def initialize(klass, &block)
12
+ @crawler_class = klass
13
+ @settings = klass.settings
14
+ @middleware = block || Middleware.build(*settings[:middleware])
15
+ @queue = SizedQueue.new(settings[:max_threads])
16
+ @scheduler = Scheduler.new(@queue, settings)
17
+ end
18
+
19
+ def run
20
+ scheduler.post(*start_requests)
21
+
22
+ until @queue.closed?
23
+ message = @queue.pop
24
+ raise(message) if message.is_a?(Exception)
25
+ handle(*message)
26
+ @queue.close if idle?
27
+ end
28
+ end
29
+
30
+ def handle(page, request)
31
+ crawler = @crawler_class.new(page)
32
+ crawler.send(request.method) do |*args|
33
+ if args.all? { |i| i.is_a?(Request) }
34
+ scheduler.post(*args)
35
+ else
36
+ @middleware&.call(*args)
37
+ end
38
+ end
39
+ ensure
40
+ page.close
41
+ end
42
+
43
+ def start_requests
44
+ Request.build(*settings[:start_urls])
45
+ end
46
+
47
+ def idle?
48
+ @queue.empty? &&
49
+ @scheduler.queue_length.zero? &&
50
+ @scheduler.scheduled_task_count == @scheduler.completed_task_count
51
+ end
52
+ end
53
+ end
@@ -0,0 +1,23 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Vessel
4
+ class Middleware
5
+ attr_reader :middleware
6
+
7
+ def self.build(*classes)
8
+ classes.inject { |base, klass| base.new(klass.new) }
9
+ end
10
+
11
+ def initialize(middleware = nil)
12
+ @middleware = middleware
13
+ end
14
+
15
+ def ==(other)
16
+ self.class == other.class
17
+ end
18
+
19
+ def call
20
+ raise NotImplementedError
21
+ end
22
+ end
23
+ end
@@ -0,0 +1,19 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "addressable/uri"
4
+
5
+ module Vessel
6
+ class Request
7
+ attr_reader :url, :uri, :method
8
+
9
+ def self.build(*urls)
10
+ urls.map { |url| new(url: url) }
11
+ end
12
+
13
+ def initialize(url:, method: :parse)
14
+ @url = url.to_s
15
+ @uri = Addressable::URI.parse(@url)
16
+ @method = method
17
+ end
18
+ end
19
+ end
@@ -0,0 +1,53 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "forwardable"
4
+ require "concurrent-ruby"
5
+
6
+ module Vessel
7
+ class Scheduler
8
+ extend Forwardable
9
+ delegate %i[scheduled_task_count completed_task_count queue_length] => :@pool
10
+
11
+ attr_reader :browser, :queue, :delay
12
+
13
+ def initialize(queue, settings)
14
+ @queue = queue
15
+ @min_threads, @max_threads, @delay =
16
+ settings.values_at(:min_threads, :max_threads, :delay)
17
+
18
+ options = {}
19
+ options.merge!(timeout: settings[:timeout]) if settings[:timeout]
20
+ @browser = Ferrum::Browser.new(**options)
21
+ end
22
+
23
+ def post(*requests)
24
+ requests.map do |request|
25
+ Concurrent::Promises.future_on(pool, queue, request) do |queue, request|
26
+ queue << goto(request)
27
+ end
28
+ end
29
+ end
30
+
31
+ private
32
+
33
+ def pool
34
+ @pool ||= Concurrent::ThreadPoolExecutor.new(
35
+ max_queue: 0,
36
+ min_threads: @min_threads,
37
+ max_threads: @max_threads
38
+ )
39
+ end
40
+
41
+ def goto(request)
42
+ page = browser.create_page
43
+ # Delay is set between requests when we don't want to bombard server with
44
+ # requests so it requires crawler to be single threaded. Otherwise doesn't
45
+ # make sense.
46
+ sleep(delay) if @max_threads == 1 && delay > 0
47
+ page.goto(request.url)
48
+ [page, request]
49
+ rescue => e
50
+ e
51
+ end
52
+ end
53
+ end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Vessel
2
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
3
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: vessel
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Vorotilin
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-09-17 00:00:00.000000000 Z
11
+ date: 2020-04-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: ferrum
@@ -24,6 +24,20 @@ dependencies:
24
24
  - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: '0.4'
27
+ - !ruby/object:Gem::Dependency
28
+ name: thor
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '0.20'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '0.20'
27
41
  - !ruby/object:Gem::Dependency
28
42
  name: bundler
29
43
  requirement: !ruby/object:Gem::Requirement
@@ -77,6 +91,12 @@ files:
77
91
  - LICENSE
78
92
  - README.md
79
93
  - lib/vessel.rb
94
+ - lib/vessel/cargo.rb
95
+ - lib/vessel/cli.rb
96
+ - lib/vessel/engine.rb
97
+ - lib/vessel/middleware.rb
98
+ - lib/vessel/request.rb
99
+ - lib/vessel/scheduler.rb
80
100
  - lib/vessel/version.rb
81
101
  homepage: https://github.com/route/vessel
82
102
  licenses: