RubyGems - spidercloud - Versions diffs - 1.0.0 - Mend

spidercloud 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +7 -0
data/LICENSE +21 -0
data/README.md +233 -0
data/lib/spider_cloud/costs.rb +15 -0
data/lib/spider_cloud/crawl_options.rb +154 -0
data/lib/spider_cloud/crawl_request.rb +28 -0
data/lib/spider_cloud/crawl_result.rb +62 -0
data/lib/spider_cloud/error_result.rb +52 -0
data/lib/spider_cloud/helpers.rb +33 -0
data/lib/spider_cloud/links_options.rb +52 -0
data/lib/spider_cloud/links_request.rb +29 -0
data/lib/spider_cloud/links_result.rb +55 -0
data/lib/spider_cloud/module_methods.rb +31 -0
data/lib/spider_cloud/request.rb +41 -0
data/lib/spider_cloud/response_methods.rb +15 -0
data/lib/spider_cloud/scrape_options.rb +164 -0
data/lib/spider_cloud/scrape_request.rb +29 -0
data/lib/spider_cloud/scrape_result.rb +62 -0
data/lib/spider_cloud/screenshot_options.rb +84 -0
data/lib/spider_cloud/screenshot_request.rb +29 -0
data/lib/spider_cloud/screenshot_result.rb +69 -0
data/lib/spider_cloud/shared_schemas.rb +80 -0
data/lib/spider_cloud/version.rb +3 -0
data/lib/spider_cloud.rb +37 -0
data/lib/spidercloud.rb +1 -0
data/readme/crawl.md +218 -0
data/readme/links.md +198 -0
data/readme/scrape.md +248 -0
data/readme/screenshot.md +240 -0
data/spidercloud.gemspec +40 -0
metadata +159 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 8ebacc4a7294c02a50c8e49d4e1dda6d75747330af8d17a427ea1e9871e4d025
+  data.tar.gz: 1dacbdcbbb2191d4c81741c1527912ea461538f11bce5e078d7d38b6c6bbb16e
+SHA512:
+  metadata.gz: a873ad78cbc6d96e5aa014d3b61f822e8f94696e2979747f8bc16734419161f8011a3e8f16194ad72690a40a74d7efc85344d0e4969b69b054e6993ecd5fdc02
+  data.tar.gz: bb317f6e76f0c6d6d1fb2f079b24f74c4bc864062f38f0a0871039e42ae521946f8f7cb087834ca23607576975d3085e1d63cc89691be10af60409360183678c

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Kristoph Cichocki-Romanov
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,233 @@
+# SpiderCloud
+The SpiderCloud gem provides a lightweight Ruby interface to the
+[Spider Cloud API](https://spider.cloud) for web scraping, crawling, screenshots,
+and link extraction.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'spidercloud'
+```
+Or install directly:
+```bash
+gem install spidercloud
+```
+## Quick Start
+```ruby
+require 'spider_cloud'
+# Configure your API key
+SpiderCloud.api_key 'your-api-key'
+# Scrape a single page
+response = SpiderCloud.scrape( 'https://example.com' )
+puts response.result.content
+# Crawl a website (limited to 5 pages)
+response = SpiderCloud.crawl( 'https://example.com', limit: 5 )
+response.result.each { | page | puts page.url }
+# Take a screenshot
+response = SpiderCloud.screenshot( 'https://example.com' )
+response.result.save_to( 'screenshot.png' )
+# Extract links
+response = SpiderCloud.links( 'https://example.com', limit: 5 )
+puts response.result.urls
+```
+## Configuration
+Set your API key globally:
+```ruby
+SpiderCloud.api_key 'your-api-key'
+```
+Or pass it per-request:
+```ruby
+request = SpiderCloud::ScrapeRequest.new( api_key: 'your-api-key' )
+response = request.submit( 'https://example.com' )
+```
+## Endpoints
+SpiderCloud supports four main endpoints:
+- **[Scrape](readme/scrape.md)** - Extract content from a single URL
+- **[Crawl](readme/crawl.md)** - Crawl multiple pages from a starting URL
+- **[Screenshot](readme/screenshot.md)** - Capture screenshots of web pages
+- **[Links](readme/links.md)** - Discover and extract links from a website
+## Using Options
+Each endpoint accepts options that can be built using the options builder:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  return_format :markdown
+  readability true
+  stealth true
+  wait_for do
+    selector '#content'
+  end
+end
+response = SpiderCloud.scrape( 'https://example.com', options )
+```
+Or pass options as a hash:
+```ruby
+response = SpiderCloud.scrape( 'https://example.com', {
+  return_format: :markdown,
+  readability: true
+} )
+```
+## Response Handling
+All endpoints return a Faraday response with an attached `result` object:
+```ruby
+response = SpiderCloud.scrape( 'https://example.com' )
+# Check if the HTTP request succeeded
+response.success?       # => true/false
+# Access the parsed result
+response.result.success?  # => true/false
+response.result.content   # => "# Page Title\n\nContent..."
+response.result.url       # => "https://example.com"
+response.result.status    # => 200
+```
+### Error Handling
+When a request fails, the result will be an `ErrorResult`:
+```ruby
+response = SpiderCloud.scrape( 'https://example.com' )
+unless response.result.success?
+  puts response.result.error_type        # => :authentication_error
+  puts response.result.error_description # => "The API key is invalid."
+end
+```
+## Content Formats
+The `return_format` option controls the output format:
+- `:markdown` - Markdown format
+- `:commonmark` - CommonMark format
+- `:raw` - Raw HTML (default)
+- `:text` - Plain text
+- `:html2text` - HTML converted to text
+- `:xml` - XML format
+- `:bytes` - Raw bytes
+- `:empty` - No content (useful for links-only)
+## Proxy Support
+Spider Cloud supports multiple proxy types:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  proxy :residential
+  proxy_enabled true
+  country_code 'US'
+end
+```
+Proxy types: `:residential`, `:mobile`, `:isp`
+## Wait Conditions
+Wait for specific conditions before extracting content:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  wait_for do
+    # Wait for a CSS selector
+    selector '#loaded'
+    # Or wait for network idle
+    idle_network do
+      timeout do
+        seconds 5
+        nanoseconds 0
+      end
+    end
+    # Or wait for a delay
+    delay do
+      timeout do
+        seconds 2
+        nanoseconds 0
+      end
+    end
+  end
+end
+```
+## AI/LLM Integration
+Configure GPT to process scraped content:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  gpt_config do
+    prompt 'Summarize this page in 3 sentences'
+    model 'gpt-4'
+    max_tokens 500
+  end
+end
+```
+## Browser Configuration
+Control browser behavior:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  stealth true
+  fingerprint true
+  block_ads true
+  block_analytics true
+  viewport do
+    width 1920
+    height 1080
+  end
+  device :desktop  # :mobile, :tablet, :desktop
+end
+```
+## Automation Scripts
+Execute actions before scraping:
+```ruby
+options = SpiderCloud::ScrapeOptions.build do
+  automation_scripts( {
+    '/login' => [
+      { 'Fill' => { 'selector' => '#email', 'value' => 'user@example.com' } },
+      { 'Fill' => { 'selector' => '#password', 'value' => 'secret' } },
+      { 'Click' => 'button[type=submit]' },
+      { 'WaitForNavigation' => true }
+    ]
+  } )
+end
+```
+## License
+The gem is available under the MIT License. See LICENSE for details.

data/lib/spider_cloud/costs.rb ADDED Viewed

@@ -0,0 +1,15 @@
+module SpiderCloud
+  CostsSchema = DynamicSchema::Struct.define do
+    ai_cost               Float, as: :ai_cost
+    compute_cost          Float, as: :compute_cost
+    file_cost             Float, as: :file_cost
+    bytes_transferred_cost Float, as: :bytes_transferred_cost
+    total_cost            Float, as: :total_cost
+    transform_cost        Float, as: :transform_cost
+  end
+  class Costs < CostsSchema
+  end
+end

data/lib/spider_cloud/crawl_options.rb ADDED Viewed

@@ -0,0 +1,154 @@
+module SpiderCloud
+  class CrawlOptions
+    include DynamicSchema::Definable
+    include Helpers
+    schema do
+      limit                          Integer
+      return_format                  Symbol, in: RETURN_FORMATS
+      request                        Symbol, in: REQUEST_TYPES
+      depth                          Integer
+      subdomains                     [ TrueClass, FalseClass ]
+      tld                            [ TrueClass, FalseClass ]
+      external_domains               String, array: true
+      redirect_policy                Symbol, in: REDIRECT_POLICIES
+      blacklist                      String, array: true
+      whitelist                      String, array: true
+      budget                         Hash
+      link_rewrite                   Hash
+      sitemap                        [ TrueClass, FalseClass ]
+      sitemap_only                   [ TrueClass, FalseClass ]
+      sitemap_path                   String
+      readability                    [ TrueClass, FalseClass ]
+      root_selector                  String
+      exclude_selector               String
+      css_extraction_map             Hash
+      filter_main_only               [ TrueClass, FalseClass ]
+      full_resources                 [ TrueClass, FalseClass ]
+      return_json_data               [ TrueClass, FalseClass ]
+      return_headers                 [ TrueClass, FalseClass ]
+      return_cookies                 [ TrueClass, FalseClass ]
+      return_page_links              [ TrueClass, FalseClass ]
+      return_embeddings              [ TrueClass, FalseClass ]
+      metadata                       [ TrueClass, FalseClass ]
+      gpt_config do
+        prompt                       String
+        model                        String
+        max_tokens                   Integer
+        temperature                  Float
+        top_p                        Float
+        api_key                      String
+        extra_ai_data                [ TrueClass, FalseClass ]
+        screenshot                   [ TrueClass, FalseClass ]
+      end
+      custom_prompt                  String
+      model                          String
+      chunking_algorithm             as: :chunking_alg do
+        type                         Symbol, in: CHUNKING_TYPES
+        value                        Integer
+      end
+      request_timeout                Integer, in: 5..255
+      lite_mode                      [ TrueClass, FalseClass ]
+      network_blacklist              String, array: true
+      network_whitelist              String, array: true
+      anti_bot                       [ TrueClass, FalseClass ]
+      respect_robots                 [ TrueClass, FalseClass ]
+      wait_for do
+        idle_network do
+          timeout do
+            seconds                  Integer, as: :secs
+            nanoseconds              Integer, as: :nanos
+          end
+        end
+        selector                     String
+        dom do
+          timeout do
+            seconds                  Integer, as: :secs
+            nanoseconds              Integer, as: :nanos
+          end
+        end
+        delay do
+          timeout do
+            seconds                  Integer, as: :secs
+            nanoseconds              Integer, as: :nanos
+          end
+        end
+        page_navigations do
+          timeout do
+            seconds                  Integer, as: :secs
+            nanoseconds              Integer, as: :nanos
+          end
+        end
+      end
+      session                        [ TrueClass, FalseClass ]
+      cookies                        String
+      headers                        Hash
+      user_agent                     String
+      proxy                          Symbol, in: PROXY_TYPES
+      proxy_enabled                  [ TrueClass, FalseClass ]
+      remote_proxy                   String
+      country_code                   String
+      stealth                        [ TrueClass, FalseClass ]
+      fingerprint                    [ TrueClass, FalseClass ]
+      viewport do
+        width                        Integer
+        height                       Integer
+      end
+      device                         Symbol, in: DEVICE_TYPES
+      scroll                         Integer
+      block_ads                      [ TrueClass, FalseClass ]
+      virtual_display                [ TrueClass, FalseClass ]
+      automation_scripts             Hash
+      max_credits_per_page           Integer
+      max_credits_allowed            Integer
+      crawl_timeout do
+        seconds                      Integer, as: :secs
+        nanoseconds                  Integer, as: :nanos
+      end
+      cache                          [ TrueClass, FalseClass ]
+      storageless                    [ TrueClass, FalseClass ]
+      store_data                     [ TrueClass, FalseClass ]
+      concurrency_limit              Integer
+      delay                          Integer
+      webhooks do
+        destination                  String
+        on_credits_depleted          [ TrueClass, FalseClass ]
+        on_find                      [ TrueClass, FalseClass ]
+      end
+    end
+    def self.build( options = nil, &block )
+      new( api_options: builder.build( options, &block ) )
+    end
+    def self.build!( options = nil, &block )
+      new( api_options: builder.build!( options, &block ) )
+    end
+    def initialize( options = {}, api_options: nil )
+      @options = self.class.builder.build( options || {} )
+      @options = api_options.merge( @options ) if api_options
+    end
+    def to_h
+      @options.to_h
+    end
+  end
+end

data/lib/spider_cloud/crawl_request.rb ADDED Viewed

@@ -0,0 +1,28 @@
+module SpiderCloud
+  class CrawlRequest < Request
+    def submit( url, options = nil, &block )
+      if options
+        options = options.is_a?( CrawlOptions ) ? options : CrawlOptions.build!( options.to_h )
+        options = options.to_h
+      else
+        options = {}
+      end
+      options[ :url ] = Helpers.normalize_url( url )
+      response = post( "#{ BASE_URI }/crawl", options, &block )
+      attributes = ( JSON.parse( response.body, symbolize_names: true ) rescue nil )
+      result = if response.success? && attributes.is_a?( Array )
+        CrawlResult.from_array( attributes )
+      elsif response.success?
+        ErrorResult.new( response.status, attributes )
+      else
+        ErrorResult.new( response.status, attributes )
+      end
+      ResponseMethods.install( response, result )
+    end
+  end
+end

data/lib/spider_cloud/crawl_result.rb ADDED Viewed

@@ -0,0 +1,62 @@
+module SpiderCloud
+  CrawlResultItemSchema = DynamicSchema::Struct.define do
+    content               String
+    error                 String
+    status                Integer
+    duration_elapsed_ms   Integer, as: :duration_elapsed_ms
+    costs                 Costs
+    url                   String
+  end
+  class CrawlResultItem < CrawlResultItemSchema
+    def success?
+      error.nil? && ( status.nil? || ( status >= 200 && status < 300 ) )
+    end
+  end
+  CrawlResultSchema = DynamicSchema::Struct.define do
+    items                 CrawlResultItem, array: true
+  end
+  class CrawlResult < CrawlResultSchema
+    extend Forwardable
+    include Enumerable
+    def_delegators :items, :each, :[], :count, :size, :length, :first, :last, :empty?
+    def self.from_array( array )
+      new( items: array )
+    end
+    def success?
+      items&.all?( &:success? ) || false
+    end
+    # convenience method for accessing all URLs
+    def urls
+      items&.map( &:url ) || []
+    end
+    # convenience method for accessing all content
+    def contents
+      items&.map( &:content ) || []
+    end
+    # convenience method for failed items
+    def failed
+      items&.reject( &:success? ) || []
+    end
+    # convenience method for successful items
+    def succeeded
+      items&.select( &:success? ) || []
+    end
+    # total cost of the crawl
+    def total_cost
+      items&.sum { | item | item.costs&.total_cost || 0 } || 0
+    end
+  end
+end

data/lib/spider_cloud/error_result.rb ADDED Viewed

@@ -0,0 +1,52 @@
+module SpiderCloud
+  class ErrorResult
+    attr_reader :error_type, :error_description
+    def initialize( status_code, attributes = nil )
+      @error_type, @error_description = status_code_to_error( status_code )
+      @error_description = attributes[ :error ] if attributes&.respond_to?( :[] ) && attributes[ :error ]
+    end
+    def success?
+      false
+    end
+  private
+    def _status_code_to_error( status_code )
+      case status_code
+      when 200
+        [ :unexpected_error,
+          "The response was successful but it did not include a valid payload." ]
+      when 400
+        [ :invalid_request_error,
+          "There was an issue with the format or content of your request." ]
+      when 401
+        [ :authentication_error,
+          "There's an issue with your API key." ]
+      when 402
+        [ :payment_required,
+          "The request requires a paid account or you have insufficient credits." ]
+      when 404
+        [ :not_found_error,
+          "The requested resource was not found." ]
+      when 429
+        [ :rate_limit_error,
+          "Your account has hit a rate limit." ]
+      when 500..595
+        [ :server_error,
+          "The Spider Cloud service encountered an unexpected server error." ]
+      when 529
+        [ :overloaded_error,
+          "The Spider Cloud service is overloaded." ]
+      else
+        [ :unknown_error,
+          "The Spider Cloud service returned an unexpected status code: '#{ status_code }'." ]
+      end
+    end
+    alias_method :status_code_to_error, :_status_code_to_error
+  end
+end

data/lib/spider_cloud/helpers.rb ADDED Viewed

@@ -0,0 +1,33 @@
+module SpiderCloud
+  module Helpers
+    def string_camelize( string )
+      words = string.split( /[\s_\-]/ )
+      words.map.with_index do | word, index |
+        index.zero? ? word.downcase : word.capitalize
+      end.join
+    end
+    def string_underscore( string )
+      string.to_s.gsub( /([a-z])([A-Z])/, '\1_\2' ).downcase
+    end
+    # normalize URL by ensuring it has a trailing slash for root URLs
+    # and no trailing slash for paths
+    def normalize_url( url )
+      url = url.to_s.strip
+      uri = URI.parse( url )
+      if uri.path.empty? || uri.path == '/'
+        # root URL - ensure trailing slash
+        uri.path = '/'
+      else
+        # path URL - remove trailing slash
+        uri.path = uri.path.chomp( '/' )
+      end
+      uri.to_s
+    rescue URI::InvalidURIError
+      url
+    end
+    module_function :string_camelize, :string_underscore, :normalize_url
+  end
+end

data/lib/spider_cloud/links_options.rb ADDED Viewed

@@ -0,0 +1,52 @@
+module SpiderCloud
+  class LinksOptions
+    include DynamicSchema::Definable
+    include Helpers
+    schema do
+      limit                          Integer
+      return_format                  Symbol
+      depth                          Integer
+      subdomains                     [ TrueClass, FalseClass ]
+      tld                            [ TrueClass, FalseClass ]
+      external_domains               String, array: true
+      blacklist                      String, array: true
+      whitelist                      String, array: true
+      budget                         Hash
+      redirect_policy                Symbol, in: REDIRECT_POLICIES
+      sitemap                        [ TrueClass, FalseClass ]
+      sitemap_only                   [ TrueClass, FalseClass ]
+      sitemap_path                   String
+      request                        Symbol, in: REQUEST_TYPES
+      request_timeout                Integer, in: 5..255
+      cache                          [ TrueClass, FalseClass ]
+      respect_robots                 [ TrueClass, FalseClass ]
+      proxy                          Symbol, in: PROXY_TYPES
+      proxy_enabled                  [ TrueClass, FalseClass ]
+      country_code                   String
+    end
+    def self.build( options = nil, &block )
+      new( api_options: builder.build( options, &block ) )
+    end
+    def self.build!( options = nil, &block )
+      new( api_options: builder.build!( options, &block ) )
+    end
+    def initialize( options = {}, api_options: nil )
+      @options = self.class.builder.build( options || {} )
+      @options = api_options.merge( @options ) if api_options
+    end
+    def to_h
+      @options.to_h
+    end
+  end
+end

data/lib/spider_cloud/links_request.rb ADDED Viewed

@@ -0,0 +1,29 @@
+module SpiderCloud
+  class LinksRequest < Request
+    def submit( url, options = nil, &block )
+      if options
+        options = options.is_a?( LinksOptions ) ? options : \
+          LinksOptions.build!( options.to_h )
+        options = options.to_h
+      else
+        options = {}
+      end
+      options[ :url ] = Helpers.normalize_url( url )
+      response = post( "#{ BASE_URI }/links", options, &block )
+      attributes = ( JSON.parse( response.body, symbolize_names: true ) rescue nil )
+      result = if response.success? && attributes.is_a?( Array )
+        LinksResult.from_array( attributes )
+      elsif response.success?
+        ErrorResult.new( response.status, attributes )
+      else
+        ErrorResult.new( response.status, attributes )
+      end
+      ResponseMethods.install( response, result )
+    end
+  end
+end