proxycrawl 0.2.1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: db0f27951f09d662cc5ff949b088c79e4cdf2620aeea573fb25471568d73c811
4
- data.tar.gz: 6dd316888c926279d847e1f2a58c813e2e08ddb72e970ca02cb8d5baedcb145c
3
+ metadata.gz: c225955f21200efc0790c401385727e3b9dd9bc20f785ce768f44f687f55105b
4
+ data.tar.gz: 0a38e44493065a9779366d4ab5372771bf92b020da513059c0cf20b69e2d696a
5
5
  SHA512:
6
- metadata.gz: 96acc3f7de05710c91492e507781648f0b9b32214338a8f727a16f47c2c1d832d1ff9e6f0c8e7733873b99795a18441a25fad0c2da044f6c478586369ab31704
7
- data.tar.gz: 970aa1619a944fa799584286caded25e7c573199738d21944ec1b47d1251c2b1a828e328c711b05043ece5dc61ae97979ec6be0ee4b568a0e89a375dbda8daec
6
+ metadata.gz: 2d81900546063ff3ddbbe660f528b8cbc04336916bce468f37c8791370b7dd99c36d32915dd471ff886bc6fb8fb498ddccc4adfb40c95009849e25a1f4938805
7
+ data.tar.gz: cd167890621d7d8bcbdf1c2502e545c9ea40c95abf437927e443dc1e3694d8393fae4d1c521e7b34227c4a962e172cb085697169d43f9ee1ff99369649e6c36f
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2020 ProxyCrawl
3
+ Copyright (c) 2022 ProxyCrawl
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -149,6 +149,7 @@ Example:
149
149
  ```ruby
150
150
  begin
151
151
  response = scraper_api.get('https://www.amazon.com/Halo-SleepSack-Swaddle-Triangle-Neutral/dp/B01LAG1TOS')
152
+ puts response.remaining_requests
152
153
  puts response.status_code
153
154
  puts response.body
154
155
  rescue => exception
@@ -160,18 +161,184 @@ end
160
161
 
161
162
  Initialize with your Leads API token and call the `get` method.
162
163
 
164
+ For more details on the implementation, please visit the [Leads API documentation](https://proxycrawl.com/docs/leads-api).
165
+
163
166
  ```ruby
164
167
  leads_api = ProxyCrawl::LeadsAPI.new(token: 'YOUR_TOKEN')
165
168
 
166
169
  begin
167
170
  response = leads_api.get('stripe.com')
171
+ puts response.success
172
+ puts response.remaining_requests
173
+ puts response.status_code
174
+ puts response.body
175
+ rescue => exception
176
+ puts exception.backtrace
177
+ end
178
+ ```
179
+
180
+ If you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).
181
+
182
+
183
+ ## Screenshots API usage
184
+
185
+ Initialize with your Screenshots API token and call the `get` method.
186
+
187
+ ```ruby
188
+ screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
189
+
190
+ begin
191
+ response = screenshots_api.get('https://www.apple.com')
192
+ puts response.success
193
+ puts response.remaining_requests
194
+ puts response.status_code
195
+ puts response.screenshot_path # do something with screenshot_path here
196
+ rescue => exception
197
+ puts exception.backtrace
198
+ end
199
+ ```
200
+
201
+ or with using a block
202
+
203
+ ```ruby
204
+ screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
205
+
206
+ begin
207
+ response = screenshots_api.get('https://www.apple.com') do |file|
208
+ # do something (reading/writing) with the image file here
209
+ end
210
+ puts response.success
211
+ puts response.remaining_requests
212
+ puts response.status_code
213
+ rescue => exception
214
+ puts exception.backtrace
215
+ end
216
+ ```
217
+
218
+ or specifying a file path
219
+
220
+ ```ruby
221
+ screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
222
+
223
+ begin
224
+ response = screenshots_api.get('https://www.apple.com', save_to_path: '~/screenshot.jpg') do |file|
225
+ # do something (reading/writing) with the image file here
226
+ end
227
+ puts response.success
228
+ puts response.remaining_requests
229
+ puts response.status_code
230
+ rescue => exception
231
+ puts exception.backtrace
232
+ end
233
+ ```
234
+
235
+ Note that `screenshots_api.get(url, options)` method accepts an [options](https://proxycrawl.com/docs/screenshots-api/parameters)
236
+
237
+ ## Storage API usage
238
+
239
+ Initialize the Storage API using your private token.
240
+
241
+ ```ruby
242
+ storage_api = ProxyCrawl::StorageAPI.new(token: 'YOUR_TOKEN')
243
+ ```
244
+
245
+ Pass the [url](https://proxycrawl.com/docs/storage-api/parameters/#url) that you want to get from [Proxycrawl Storage](https://proxycrawl.com/dashboard/storage).
246
+
247
+ ```ruby
248
+ begin
249
+ response = storage_api.get('https://www.apple.com')
250
+ puts response.original_status
251
+ puts response.pc_status
252
+ puts response.url
168
253
  puts response.status_code
254
+ puts response.rid
169
255
  puts response.body
256
+ puts response.stored_at
170
257
  rescue => exception
171
258
  puts exception.backtrace
172
259
  end
173
260
  ```
174
261
 
262
+ or you can use the [RID](https://proxycrawl.com/docs/storage-api/parameters/#rid)
263
+
264
+ ```ruby
265
+ begin
266
+ response = storage_api.get(RID)
267
+ puts response.original_status
268
+ puts response.pc_status
269
+ puts response.url
270
+ puts response.status_code
271
+ puts response.rid
272
+ puts response.body
273
+ puts response.stored_at
274
+ rescue => exception
275
+ puts exception.backtrace
276
+ end
277
+ ```
278
+
279
+ Note: One of the two RID or URL must be sent. So both are optional but it's mandatory to send one of the two.
280
+
281
+ ### [Delete](https://proxycrawl.com/docs/storage-api/delete/) request
282
+
283
+ To delete a storage item from your storage area, use the correct RID
284
+
285
+ ```ruby
286
+ if storage_api.delete(RID)
287
+ puts 'delete success'
288
+ else
289
+ puts "Unable to delete: #{storage_api.body['error']}"
290
+ end
291
+ ```
292
+
293
+ ### [Bulk](https://proxycrawl.com/docs/storage-api/bulk/) request
294
+
295
+ To do a bulk request with a list of RIDs, please send the list of rids as an array
296
+
297
+ ```ruby
298
+ begin
299
+ response = storage_api.bulk([RID1, RID2, RID3, ...])
300
+ puts response.original_status
301
+ puts response.pc_status
302
+ puts response.url
303
+ puts response.status_code
304
+ puts response.rid
305
+ puts response.body
306
+ puts response.stored_at
307
+ rescue => exception
308
+ puts exception.backtrace
309
+ end
310
+ ```
311
+
312
+ ### [RIDs](https://proxycrawl.com/docs/storage-api/rids) request
313
+
314
+ To request a bulk list of RIDs from your storage area
315
+
316
+ ```ruby
317
+ begin
318
+ response = storage_api.rids
319
+ puts response.status_code
320
+ puts response.rid
321
+ puts response.body
322
+ rescue => exception
323
+ puts exception.backtrace
324
+ end
325
+ ```
326
+
327
+ You can also specify a limit as a parameter
328
+
329
+ ```ruby
330
+ storage_api.rids(100)
331
+ ```
332
+
333
+ ### [Total Count](https://proxycrawl.com/docs/storage-api/total_count)
334
+
335
+ To get the total number of documents in your storage area
336
+
337
+ ```ruby
338
+ total_count = storage_api.total_count
339
+ puts "total_count: #{total_count}"
340
+ ```
341
+
175
342
  If you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).
176
343
 
177
344
  ## Development
@@ -194,4 +361,4 @@ Everyone interacting in the Proxycrawl project’s codebases, issue trackers, ch
194
361
 
195
362
  ---
196
363
 
197
- Copyright 2020 ProxyCrawl
364
+ Copyright 2022 ProxyCrawl
@@ -6,7 +6,7 @@ require 'uri'
6
6
 
7
7
  module ProxyCrawl
8
8
  class API
9
- attr_reader :token, :body, :status_code, :original_status, :pc_status, :url
9
+ attr_reader :token, :body, :timeout, :status_code, :original_status, :pc_status, :url, :storage_url
10
10
 
11
11
  INVALID_TOKEN = 'Token is required'
12
12
  INVALID_URL = 'URL is required'
@@ -15,14 +15,22 @@ module ProxyCrawl
15
15
  raise INVALID_TOKEN if options[:token].nil?
16
16
 
17
17
  @token = options[:token]
18
+ @timeout = options[:timeout] || 120
18
19
  end
19
20
 
20
21
  def get(url, options = {})
21
22
  raise INVALID_URL if url.empty?
22
23
 
23
24
  uri = prepare_uri(url, options)
25
+ req = Net::HTTP::Get.new(uri)
24
26
 
25
- response = Net::HTTP.get_response(uri)
27
+ req_options = {
28
+ read_timeout: timeout,
29
+ use_ssl: uri.scheme == 'https',
30
+ verify_mode: OpenSSL::SSL::VERIFY_NONE
31
+ }
32
+
33
+ response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
26
34
 
27
35
  prepare_response(response, options[:format])
28
36
 
@@ -69,16 +77,14 @@ module ProxyCrawl
69
77
  end
70
78
 
71
79
  def prepare_response(response, format)
72
- if format == 'json' || base_url.include?('/scraper')
73
- @status_code = response.code.to_i
74
- @body = response.body
75
- else
76
- @original_status = response['original_status'].to_i
77
- @status_code = response.code.to_i
78
- @pc_status = response['pc_status'].to_i
79
- @url = response['url']
80
- @body = response.body
81
- end
80
+ res = format == 'json' || base_url.include?('/scraper') ? JSON.parse(response.body) : response
81
+
82
+ @original_status = res['original_status'].to_i
83
+ @pc_status = res['pc_status'].to_i
84
+ @url = res['url']
85
+ @storage_url = res['storage_url']
86
+ @status_code = response.code.to_i
87
+ @body = response.body
82
88
  end
83
89
  end
84
90
  end
@@ -6,15 +6,16 @@ require 'uri'
6
6
 
7
7
  module ProxyCrawl
8
8
  class LeadsAPI
9
- attr_reader :token, :body, :status_code
9
+ attr_reader :token, :timeout, :body, :status_code, :success, :remaining_requests
10
10
 
11
11
  INVALID_TOKEN = 'Token is required'
12
12
  INVALID_DOMAIN = 'Domain is required'
13
13
 
14
14
  def initialize(options = {})
15
- raise INVALID_TOKEN if options[:token].nil?
15
+ raise INVALID_TOKEN if options[:token].nil? || options[:token].empty?
16
16
 
17
17
  @token = options[:token]
18
+ @timeout = options[:timeout] || 120
18
19
  end
19
20
 
20
21
  def get(domain)
@@ -23,12 +24,27 @@ module ProxyCrawl
23
24
  uri = URI('https://api.proxycrawl.com/leads')
24
25
  uri.query = URI.encode_www_form({ token: token, domain: domain })
25
26
 
26
- response = Net::HTTP.get_response(uri)
27
+ req = Net::HTTP::Get.new(uri)
27
28
 
29
+ req_options = {
30
+ read_timeout: timeout,
31
+ use_ssl: uri.scheme == 'https',
32
+ verify_mode: OpenSSL::SSL::VERIFY_NONE
33
+ }
34
+
35
+ response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
28
36
  @status_code = response.code.to_i
29
37
  @body = response.body
30
38
 
39
+ json_body = JSON.parse(response.body)
40
+ @success = json_body['success']
41
+ @remaining_requests = json_body['remaining_requests'].to_i
42
+
31
43
  self
32
44
  end
45
+
46
+ def post
47
+ raise 'Only GET is allowed for the LeadsAPI'
48
+ end
33
49
  end
34
50
  end
@@ -2,6 +2,7 @@
2
2
 
3
3
  module ProxyCrawl
4
4
  class ScraperAPI < ProxyCrawl::API
5
+ attr_reader :remaining_requests
5
6
 
6
7
  def post
7
8
  raise 'Only GET is allowed for the ScraperAPI'
@@ -9,6 +10,12 @@ module ProxyCrawl
9
10
 
10
11
  private
11
12
 
13
+ def prepare_response(response, format)
14
+ super(response, format)
15
+ json_body = JSON.parse(response.body)
16
+ @remaining_requests = json_body['remaining_requests'].to_i
17
+ end
18
+
12
19
  def base_url
13
20
  'https://api.proxycrawl.com/scraper'
14
21
  end
@@ -0,0 +1,52 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'securerandom'
4
+ require 'tmpdir'
5
+
6
+ module ProxyCrawl
7
+ class ScreenshotsAPI < ProxyCrawl::API
8
+ attr_reader :screenshot_path, :success, :remaining_requests, :screenshot_url
9
+
10
+ INVALID_SAVE_TO_PATH_FILENAME = 'Filename must end with .jpg or .jpeg'
11
+ SAVE_TO_PATH_FILENAME_PATTERN = /.+\.(jpg|JPG|jpeg|JPEG)$/.freeze
12
+
13
+ def post
14
+ raise 'Only GET is allowed for the ScreenshotsAPI'
15
+ end
16
+
17
+ def get(url, options = {})
18
+ screenshot_path = options.delete(:save_to_path) || generate_file_path
19
+ raise INVALID_SAVE_TO_PATH_FILENAME unless SAVE_TO_PATH_FILENAME_PATTERN =~ screenshot_path
20
+
21
+ response = super(url, options)
22
+ file = File.open(screenshot_path, 'w+')
23
+ file.write(response.body&.force_encoding('UTF-8'))
24
+ @screenshot_path = screenshot_path
25
+ yield(file) if block_given?
26
+ response
27
+ ensure
28
+ file&.close
29
+ end
30
+
31
+ private
32
+
33
+ def prepare_response(response, format)
34
+ super(response, format)
35
+ @remaining_requests = response['remaining_requests'].to_i
36
+ @success = response['success'] == 'true'
37
+ @screenshot_url = response['screenshot_url']
38
+ end
39
+
40
+ def base_url
41
+ 'https://api.proxycrawl.com/screenshots'
42
+ end
43
+
44
+ def generate_file_name
45
+ "#{SecureRandom.urlsafe_base64}.jpg"
46
+ end
47
+
48
+ def generate_file_path
49
+ File.join(Dir.tmpdir, generate_file_name)
50
+ end
51
+ end
52
+ end
@@ -0,0 +1,126 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'net/http'
4
+ require 'json'
5
+ require 'uri'
6
+
7
+ module ProxyCrawl
8
+ class StorageAPI
9
+ attr_reader :token, :timeout, :original_status, :pc_status, :url, :status_code, :rid, :body, :stored_at
10
+
11
+ INVALID_TOKEN = 'Token is required'
12
+ INVALID_RID = 'RID is required'
13
+ INVALID_RID_ARRAY = 'One or more RIDs are required'
14
+ INVALID_URL_OR_RID = 'Either URL or RID is required'
15
+ BASE_URL = 'https://api.proxycrawl.com/storage'
16
+
17
+ def initialize(options = {})
18
+ raise INVALID_TOKEN if options[:token].nil? || options[:token].empty?
19
+
20
+ @token = options[:token]
21
+ @timeout = options[:timeout] || 120
22
+ end
23
+
24
+ def get(url_or_rid, format = 'html')
25
+ raise INVALID_URL_OR_RID if url_or_rid.nil? || url_or_rid.empty?
26
+
27
+ uri = URI(BASE_URL)
28
+ uri.query = URI.encode_www_form({ token: token, format: format }.merge(decide_url_or_rid(url_or_rid)))
29
+
30
+ req = Net::HTTP::Get.new(uri)
31
+
32
+ req_options = {
33
+ read_timeout: timeout,
34
+ use_ssl: uri.scheme == 'https',
35
+ verify_mode: OpenSSL::SSL::VERIFY_NONE
36
+ }
37
+
38
+ response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
39
+
40
+ res = format == 'json' ? JSON.parse(response.body) : response
41
+
42
+ @original_status = res['original_status'].to_i
43
+ @pc_status = res['pc_status'].to_i
44
+ @url = res['url']
45
+ @rid = res['rid']
46
+ @stored_at = res['stored_at']
47
+
48
+ @status_code = response.code.to_i
49
+ @body = response.body
50
+
51
+ self
52
+ end
53
+
54
+ def delete(rid)
55
+ raise INVALID_RID if rid.nil? || rid.empty?
56
+
57
+ uri = URI(BASE_URL)
58
+ uri.query = URI.encode_www_form(token: token, rid: rid)
59
+ http = Net::HTTP.new(uri.host)
60
+ request = Net::HTTP::Delete.new(uri.request_uri)
61
+ response = http.request(request)
62
+
63
+ @url, @original_status, @pc_status, @stored_at = nil
64
+ @status_code = response.code.to_i
65
+ @rid = rid
66
+ @body = JSON.parse(response.body)
67
+
68
+ @body.key?('success')
69
+ end
70
+
71
+ def bulk(rids_array = [])
72
+ raise INVALID_RID_ARRAY if rids_array.empty?
73
+
74
+ uri = URI("#{BASE_URL}/bulk")
75
+ uri.query = URI.encode_www_form(token: token)
76
+ http = Net::HTTP.new(uri.host)
77
+ request = Net::HTTP::Post.new(uri.request_uri, { 'Content-Type': 'application/json' })
78
+ request.body = { rids: rids_array }.to_json
79
+ response = http.request(request)
80
+
81
+ @body = JSON.parse(response.body)
82
+ @original_status = @body.map { |item| item['original_status'].to_i }
83
+ @status_code = response.code.to_i
84
+ @pc_status = @body.map { |item| item['pc_status'].to_i }
85
+ @url = @body.map { |item| item['url'] }
86
+ @rid = @body.map { |item| item['rid'] }
87
+ @stored_at = @body.map { |item| item['stored_at'] }
88
+
89
+ self
90
+ end
91
+
92
+ def rids(limit = -1)
93
+ uri = URI("#{BASE_URL}/rids")
94
+ query_hash = { token: token }
95
+ query_hash.merge!({ limit: limit }) if limit >= 0
96
+ uri.query = URI.encode_www_form(query_hash)
97
+
98
+ response = Net::HTTP.get_response(uri)
99
+ @url, @original_status, @pc_status, @stored_at = nil
100
+ @status_code = response.code.to_i
101
+ @body = JSON.parse(response.body)
102
+ @rid = @body
103
+
104
+ @body
105
+ end
106
+
107
+ def total_count
108
+ uri = URI("#{BASE_URL}/total_count")
109
+ uri.query = URI.encode_www_form(token: token)
110
+
111
+ response = Net::HTTP.get_response(uri)
112
+ @url, @original_status, @pc_status, @stored_at = nil
113
+ @status_code = response.code.to_i
114
+ @rid = rid
115
+ @body = JSON.parse(response.body)
116
+
117
+ body['totalCount']
118
+ end
119
+
120
+ private
121
+
122
+ def decide_url_or_rid(url_or_rid)
123
+ %r{^https?://} =~ url_or_rid ? { url: url_or_rid } : { rid: url_or_rid }
124
+ end
125
+ end
126
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ProxyCrawl
4
- VERSION = '0.2.1'
4
+ VERSION = '1.0.0'
5
5
  end
data/lib/proxycrawl.rb CHANGED
@@ -4,6 +4,8 @@ require 'proxycrawl/version'
4
4
  require 'proxycrawl/api'
5
5
  require 'proxycrawl/scraper_api'
6
6
  require 'proxycrawl/leads_api'
7
+ require 'proxycrawl/screenshots_api'
8
+ require 'proxycrawl/storage_api'
7
9
 
8
10
  module ProxyCrawl
9
11
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: proxycrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.1
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - proxycrawl
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2020-10-28 00:00:00.000000000 Z
11
+ date: 2022-08-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rspec
@@ -86,6 +86,8 @@ files:
86
86
  - lib/proxycrawl/api.rb
87
87
  - lib/proxycrawl/leads_api.rb
88
88
  - lib/proxycrawl/scraper_api.rb
89
+ - lib/proxycrawl/screenshots_api.rb
90
+ - lib/proxycrawl/storage_api.rb
89
91
  - lib/proxycrawl/version.rb
90
92
  - proxycrawl.gemspec
91
93
  homepage: https://github.com/proxycrawl/proxycrawl-ruby
@@ -107,7 +109,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
107
109
  - !ruby/object:Gem::Version
108
110
  version: '0'
109
111
  requirements: []
110
- rubygems_version: 3.1.4
112
+ rubygems_version: 3.1.2
111
113
  signing_key:
112
114
  specification_version: 4
113
115
  summary: ProxyCrawl API client for web scraping and crawling