RubyGems - proxycrawl - Versions diffs - 0.2.1 → 1.0.0 - Mend

proxycrawl 0.2.1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/LICENSE.txt +1 -1
data/README.md +168 -1
data/lib/proxycrawl/api.rb +18 -12
data/lib/proxycrawl/leads_api.rb +19 -3
data/lib/proxycrawl/scraper_api.rb +7 -0
data/lib/proxycrawl/screenshots_api.rb +52 -0
data/lib/proxycrawl/storage_api.rb +126 -0
data/lib/proxycrawl/version.rb +1 -1
data/lib/proxycrawl.rb +2 -0
metadata +5 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: db0f27951f09d662cc5ff949b088c79e4cdf2620aeea573fb25471568d73c811
-  data.tar.gz: 6dd316888c926279d847e1f2a58c813e2e08ddb72e970ca02cb8d5baedcb145c
+  metadata.gz: c225955f21200efc0790c401385727e3b9dd9bc20f785ce768f44f687f55105b
+  data.tar.gz: 0a38e44493065a9779366d4ab5372771bf92b020da513059c0cf20b69e2d696a
 SHA512:
-  metadata.gz: 96acc3f7de05710c91492e507781648f0b9b32214338a8f727a16f47c2c1d832d1ff9e6f0c8e7733873b99795a18441a25fad0c2da044f6c478586369ab31704
-  data.tar.gz: 970aa1619a944fa799584286caded25e7c573199738d21944ec1b47d1251c2b1a828e328c711b05043ece5dc61ae97979ec6be0ee4b568a0e89a375dbda8daec
+  metadata.gz: 2d81900546063ff3ddbbe660f528b8cbc04336916bce468f37c8791370b7dd99c36d32915dd471ff886bc6fb8fb498ddccc4adfb40c95009849e25a1f4938805
+  data.tar.gz: cd167890621d7d8bcbdf1c2502e545c9ea40c95abf437927e443dc1e3694d8393fae4d1c521e7b34227c4a962e172cb085697169d43f9ee1ff99369649e6c36f

data/LICENSE.txt CHANGED Viewed

@@ -1,6 +1,6 @@
 The MIT License (MIT)
-Copyright (c) 2020 ProxyCrawl
+Copyright (c) 2022 ProxyCrawl
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

data/README.md CHANGED Viewed

@@ -149,6 +149,7 @@ Example:
 ```ruby
 begin
   response = scraper_api.get('https://www.amazon.com/Halo-SleepSack-Swaddle-Triangle-Neutral/dp/B01LAG1TOS')
+  puts response.remaining_requests
   puts response.status_code
   puts response.body
 rescue => exception
@@ -160,18 +161,184 @@ end
 Initialize with your Leads API token and call the `get` method.
+For more details on the implementation, please visit the [Leads API documentation](https://proxycrawl.com/docs/leads-api).
 ```ruby
 leads_api = ProxyCrawl::LeadsAPI.new(token: 'YOUR_TOKEN')
 begin
   response = leads_api.get('stripe.com')
+  puts response.success
+  puts response.remaining_requests
+  puts response.status_code
+  puts response.body
+rescue => exception
+  puts exception.backtrace
+end
+```
+If you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).
+## Screenshots API usage
+Initialize with your Screenshots API token and call the `get` method.
+```ruby
+screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
+begin
+  response = screenshots_api.get('https://www.apple.com')
+  puts response.success
+  puts response.remaining_requests
+  puts response.status_code
+  puts response.screenshot_path # do something with screenshot_path here
+rescue => exception
+  puts exception.backtrace
+end
+```
+or with using a block
+```ruby
+screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
+begin
+  response = screenshots_api.get('https://www.apple.com') do |file|
+    # do something (reading/writing) with the image file here
+  end
+  puts response.success
+  puts response.remaining_requests
+  puts response.status_code
+rescue => exception
+  puts exception.backtrace
+end
+```
+or specifying a file path
+```ruby
+screenshots_api = ProxyCrawl::ScreenshotsAPI.new(token: 'YOUR_TOKEN')
+begin
+  response = screenshots_api.get('https://www.apple.com', save_to_path: '~/screenshot.jpg') do |file|
+    # do something (reading/writing) with the image file here
+  end
+  puts response.success
+  puts response.remaining_requests
+  puts response.status_code
+rescue => exception
+  puts exception.backtrace
+end
+```
+Note that `screenshots_api.get(url, options)` method accepts an [options](https://proxycrawl.com/docs/screenshots-api/parameters)
+## Storage API usage
+Initialize the Storage API using your private token.
+```ruby
+storage_api = ProxyCrawl::StorageAPI.new(token: 'YOUR_TOKEN')
+```
+Pass the [url](https://proxycrawl.com/docs/storage-api/parameters/#url) that you want to get from [Proxycrawl Storage](https://proxycrawl.com/dashboard/storage).
+```ruby
+begin
+  response = storage_api.get('https://www.apple.com')
+  puts response.original_status
+  puts response.pc_status
+  puts response.url
   puts response.status_code
+  puts response.rid
   puts response.body
+  puts response.stored_at
 rescue => exception
   puts exception.backtrace
 end
 ```
+or you can use the [RID](https://proxycrawl.com/docs/storage-api/parameters/#rid)
+```ruby
+begin
+  response = storage_api.get(RID)
+  puts response.original_status
+  puts response.pc_status
+  puts response.url
+  puts response.status_code
+  puts response.rid
+  puts response.body
+  puts response.stored_at
+rescue => exception
+  puts exception.backtrace
+end
+```
+Note: One of the two RID or URL must be sent. So both are optional but it's mandatory to send one of the two.
+### [Delete](https://proxycrawl.com/docs/storage-api/delete/) request
+To delete a storage item from your storage area, use the correct RID
+```ruby
+if storage_api.delete(RID)
+  puts 'delete success'
+else
+  puts "Unable to delete: #{storage_api.body['error']}"
+end
+```
+### [Bulk](https://proxycrawl.com/docs/storage-api/bulk/) request
+To do a bulk request with a list of RIDs, please send the list of rids as an array
+```ruby
+begin
+  response = storage_api.bulk([RID1, RID2, RID3, ...])
+  puts response.original_status
+  puts response.pc_status
+  puts response.url
+  puts response.status_code
+  puts response.rid
+  puts response.body
+  puts response.stored_at
+rescue => exception
+  puts exception.backtrace
+end
+```
+### [RIDs](https://proxycrawl.com/docs/storage-api/rids) request
+To request a bulk list of RIDs from your storage area
+```ruby
+begin
+  response = storage_api.rids
+  puts response.status_code
+  puts response.rid
+  puts response.body
+rescue => exception
+  puts exception.backtrace
+end
+```
+You can also specify a limit as a parameter
+```ruby
+storage_api.rids(100)
+```
+### [Total Count](https://proxycrawl.com/docs/storage-api/total_count)
+To get the total number of documents in your storage area
+```ruby
+total_count = storage_api.total_count
+puts "total_count: #{total_count}"
+```
 If you have questions or need help using the library, please open an issue or [contact us](https://proxycrawl.com/contact).
 ## Development
@@ -194,4 +361,4 @@ Everyone interacting in the Proxycrawl project’s codebases, issue trackers, ch
 ---
-Copyright 2020 ProxyCrawl
+Copyright 2022 ProxyCrawl

data/lib/proxycrawl/api.rb CHANGED Viewed

@@ -6,7 +6,7 @@ require 'uri'
 module ProxyCrawl
   class API
-    attr_reader :token, :body, :status_code, :original_status, :pc_status, :url
+    attr_reader :token, :body, :timeout, :status_code, :original_status, :pc_status, :url, :storage_url
     INVALID_TOKEN = 'Token is required'
     INVALID_URL = 'URL is required'
@@ -15,14 +15,22 @@ module ProxyCrawl
       raise INVALID_TOKEN if options[:token].nil?
       @token = options[:token]
+      @timeout = options[:timeout] || 120
     end
     def get(url, options = {})
       raise INVALID_URL if url.empty?
       uri = prepare_uri(url, options)
+      req = Net::HTTP::Get.new(uri)
-      response = Net::HTTP.get_response(uri)
+      req_options = {
+        read_timeout: timeout,
+        use_ssl: uri.scheme == 'https',
+        verify_mode: OpenSSL::SSL::VERIFY_NONE
+      }
+      response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
       prepare_response(response, options[:format])
@@ -69,16 +77,14 @@ module ProxyCrawl
     end
     def prepare_response(response, format)
-      if format == 'json' || base_url.include?('/scraper')
-        @status_code = response.code.to_i
-        @body = response.body
-      else
-        @original_status = response['original_status'].to_i
-        @status_code = response.code.to_i
-        @pc_status = response['pc_status'].to_i
-        @url = response['url']
-        @body = response.body
-      end
+      res = format == 'json' || base_url.include?('/scraper') ? JSON.parse(response.body) : response
+      @original_status = res['original_status'].to_i
+      @pc_status = res['pc_status'].to_i
+      @url = res['url']
+      @storage_url = res['storage_url']
+      @status_code = response.code.to_i
+      @body = response.body
     end
   end
 end

data/lib/proxycrawl/leads_api.rb CHANGED Viewed

@@ -6,15 +6,16 @@ require 'uri'
 module ProxyCrawl
   class LeadsAPI
-    attr_reader :token, :body, :status_code
+    attr_reader :token, :timeout, :body, :status_code, :success, :remaining_requests
     INVALID_TOKEN = 'Token is required'
     INVALID_DOMAIN = 'Domain is required'
     def initialize(options = {})
-      raise INVALID_TOKEN if options[:token].nil?
+      raise INVALID_TOKEN if options[:token].nil? || options[:token].empty?
       @token = options[:token]
+      @timeout = options[:timeout] || 120
     end
     def get(domain)
@@ -23,12 +24,27 @@ module ProxyCrawl
       uri = URI('https://api.proxycrawl.com/leads')
       uri.query = URI.encode_www_form({ token: token, domain: domain })
-      response = Net::HTTP.get_response(uri)
+      req = Net::HTTP::Get.new(uri)
+      req_options = {
+        read_timeout: timeout,
+        use_ssl: uri.scheme == 'https',
+        verify_mode: OpenSSL::SSL::VERIFY_NONE
+      }
+      response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
       @status_code = response.code.to_i
       @body = response.body
+      json_body = JSON.parse(response.body)
+      @success = json_body['success']
+      @remaining_requests = json_body['remaining_requests'].to_i
       self
     end
+    def post
+      raise 'Only GET is allowed for the LeadsAPI'
+    end
   end
 end

data/lib/proxycrawl/scraper_api.rb CHANGED Viewed

@@ -2,6 +2,7 @@
 module ProxyCrawl
   class ScraperAPI < ProxyCrawl::API
+    attr_reader :remaining_requests
     def post
       raise 'Only GET is allowed for the ScraperAPI'
@@ -9,6 +10,12 @@ module ProxyCrawl
     private
+    def prepare_response(response, format)
+      super(response, format)
+      json_body = JSON.parse(response.body)
+      @remaining_requests = json_body['remaining_requests'].to_i
+    end
     def base_url
       'https://api.proxycrawl.com/scraper'
     end

data/lib/proxycrawl/screenshots_api.rb ADDED Viewed

@@ -0,0 +1,52 @@
+# frozen_string_literal: true
+require 'securerandom'
+require 'tmpdir'
+module ProxyCrawl
+  class ScreenshotsAPI < ProxyCrawl::API
+    attr_reader :screenshot_path, :success, :remaining_requests, :screenshot_url
+    INVALID_SAVE_TO_PATH_FILENAME = 'Filename must end with .jpg or .jpeg'
+    SAVE_TO_PATH_FILENAME_PATTERN = /.+\.(jpg|JPG|jpeg|JPEG)$/.freeze
+    def post
+      raise 'Only GET is allowed for the ScreenshotsAPI'
+    end
+    def get(url, options = {})
+      screenshot_path = options.delete(:save_to_path) || generate_file_path
+      raise INVALID_SAVE_TO_PATH_FILENAME unless SAVE_TO_PATH_FILENAME_PATTERN =~ screenshot_path
+      response = super(url, options)
+      file = File.open(screenshot_path, 'w+')
+      file.write(response.body&.force_encoding('UTF-8'))
+      @screenshot_path = screenshot_path
+      yield(file) if block_given?
+      response
+    ensure
+      file&.close
+    end
+    private
+    def prepare_response(response, format)
+      super(response, format)
+      @remaining_requests = response['remaining_requests'].to_i
+      @success = response['success'] == 'true'
+      @screenshot_url = response['screenshot_url']
+    end
+    def base_url
+      'https://api.proxycrawl.com/screenshots'
+    end
+    def generate_file_name
+      "#{SecureRandom.urlsafe_base64}.jpg"
+    end
+    def generate_file_path
+      File.join(Dir.tmpdir, generate_file_name)
+    end
+  end
+end

data/lib/proxycrawl/storage_api.rb ADDED Viewed

@@ -0,0 +1,126 @@
+# frozen_string_literal: true
+require 'net/http'
+require 'json'
+require 'uri'
+module ProxyCrawl
+  class StorageAPI
+    attr_reader :token, :timeout, :original_status, :pc_status, :url, :status_code, :rid, :body, :stored_at
+    INVALID_TOKEN = 'Token is required'
+    INVALID_RID = 'RID is required'
+    INVALID_RID_ARRAY = 'One or more RIDs are required'
+    INVALID_URL_OR_RID = 'Either URL or RID is required'
+    BASE_URL = 'https://api.proxycrawl.com/storage'
+    def initialize(options = {})
+      raise INVALID_TOKEN if options[:token].nil? || options[:token].empty?
+      @token = options[:token]
+      @timeout = options[:timeout] || 120
+    end
+    def get(url_or_rid, format = 'html')
+      raise INVALID_URL_OR_RID if url_or_rid.nil? || url_or_rid.empty?
+      uri = URI(BASE_URL)
+      uri.query = URI.encode_www_form({ token: token, format: format }.merge(decide_url_or_rid(url_or_rid)))
+      req = Net::HTTP::Get.new(uri)
+      req_options = {
+        read_timeout: timeout,
+        use_ssl: uri.scheme == 'https',
+        verify_mode: OpenSSL::SSL::VERIFY_NONE
+      }
+      response = Net::HTTP.start(uri.hostname, uri.port, req_options) { |http| http.request(req) }
+      res = format == 'json' ? JSON.parse(response.body) : response
+      @original_status = res['original_status'].to_i
+      @pc_status = res['pc_status'].to_i
+      @url = res['url']
+      @rid = res['rid']
+      @stored_at = res['stored_at']
+      @status_code = response.code.to_i
+      @body = response.body
+      self
+    end
+    def delete(rid)
+      raise INVALID_RID if rid.nil? || rid.empty?
+      uri = URI(BASE_URL)
+      uri.query = URI.encode_www_form(token: token, rid: rid)
+      http = Net::HTTP.new(uri.host)
+      request = Net::HTTP::Delete.new(uri.request_uri)
+      response = http.request(request)
+      @url, @original_status, @pc_status, @stored_at = nil
+      @status_code = response.code.to_i
+      @rid = rid
+      @body = JSON.parse(response.body)
+      @body.key?('success')
+    end
+    def bulk(rids_array = [])
+      raise INVALID_RID_ARRAY if rids_array.empty?
+      uri = URI("#{BASE_URL}/bulk")
+      uri.query = URI.encode_www_form(token: token)
+      http = Net::HTTP.new(uri.host)
+      request = Net::HTTP::Post.new(uri.request_uri, { 'Content-Type': 'application/json' })
+      request.body = { rids: rids_array }.to_json
+      response = http.request(request)
+      @body = JSON.parse(response.body)
+      @original_status = @body.map { |item| item['original_status'].to_i }
+      @status_code = response.code.to_i
+      @pc_status = @body.map { |item| item['pc_status'].to_i }
+      @url = @body.map { |item| item['url'] }
+      @rid = @body.map { |item| item['rid'] }
+      @stored_at = @body.map { |item| item['stored_at'] }
+      self
+    end
+    def rids(limit = -1)
+      uri = URI("#{BASE_URL}/rids")
+      query_hash = { token: token }
+      query_hash.merge!({ limit: limit }) if limit >= 0
+      uri.query = URI.encode_www_form(query_hash)
+      response = Net::HTTP.get_response(uri)
+      @url, @original_status, @pc_status, @stored_at = nil
+      @status_code = response.code.to_i
+      @body = JSON.parse(response.body)
+      @rid = @body
+      @body
+    end
+    def total_count
+      uri = URI("#{BASE_URL}/total_count")
+      uri.query = URI.encode_www_form(token: token)
+      response = Net::HTTP.get_response(uri)
+      @url, @original_status, @pc_status, @stored_at = nil
+      @status_code = response.code.to_i
+      @rid = rid
+      @body = JSON.parse(response.body)
+      body['totalCount']
+    end
+    private
+    def decide_url_or_rid(url_or_rid)
+      %r{^https?://} =~ url_or_rid ? { url: url_or_rid } : { rid: url_or_rid }
+    end
+  end
+end

data/lib/proxycrawl/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ProxyCrawl
-  VERSION = '0.2.1'
+  VERSION = '1.0.0'
 end

data/lib/proxycrawl.rb CHANGED Viewed

@@ -4,6 +4,8 @@ require 'proxycrawl/version'
 require 'proxycrawl/api'
 require 'proxycrawl/scraper_api'
 require 'proxycrawl/leads_api'
+require 'proxycrawl/screenshots_api'
+require 'proxycrawl/storage_api'
 module ProxyCrawl
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: proxycrawl
 version: !ruby/object:Gem::Version
-  version: 0.2.1
+  version: 1.0.0
 platform: ruby
 authors:
 - proxycrawl
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2020-10-28 00:00:00.000000000 Z
+date: 2022-08-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rspec
@@ -86,6 +86,8 @@ files:
 - lib/proxycrawl/api.rb
 - lib/proxycrawl/leads_api.rb
 - lib/proxycrawl/scraper_api.rb
+- lib/proxycrawl/screenshots_api.rb
+- lib/proxycrawl/storage_api.rb
 - lib/proxycrawl/version.rb
 - proxycrawl.gemspec
 homepage: https://github.com/proxycrawl/proxycrawl-ruby
@@ -107,7 +109,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.4
+rubygems_version: 3.1.2
 signing_key:
 specification_version: 4
 summary: ProxyCrawl API client for web scraping and crawling