RubyGems - arachnid2 - Versions diffs - 0.2.0 → 0.3.1 - Mend

arachnid2 0.2.0 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/Gemfile.lock +51 -28
data/README.md +94 -22
data/arachnid2.gemspec +4 -1
data/lib/arachnid2/{cashed_arachnid_responses.rb → cached_arachnid_responses.rb} +1 -1
data/lib/arachnid2/exoskeleton.rb +133 -0
data/lib/arachnid2/typhoeus.rb +99 -0
data/lib/arachnid2/version.rb +1 -1
data/lib/arachnid2/watir.rb +102 -0
data/lib/arachnid2.rb +17 -226
metadata +50 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 00cef9d45ae8be8b0747d47e254737fdfdb94e3f40cfe85a99faff283653f87b
-  data.tar.gz: 24ecf163c8b2eeda25908067a0efa68aec661441c59b06c466ee22828d154f5d
+  metadata.gz: 90efb51a783ec434d1e269d61a9c550e6fe8740eea229b24d3820790c1e5d296
+  data.tar.gz: f731d4cc5ab87ee603a69d795216b9dacd322af653a873273ffb226e6bb6b704
 SHA512:
-  metadata.gz: 2830f48686f9c2e9a921da58cca907580c800716e856982ae5e836f6dd51ab899192456bc4090ad78aac89eb6db49496d2ad953713d4d52548f1e248a7198df2
-  data.tar.gz: d3175c6f6574dc5a6feb9955e2a7804accfdd09c705cd90dea11586bd774fb79a67fdca499afe628739eb85b8d6adf6ff13d831658b23f4a993fc210e429c8f1
+  metadata.gz: 4fe796d93a1d87ba260b269a5ff9a290280b5354e02cd596f305ea0a342edbf9d48fb29134a83197bcc00266e13e16b5c3b482b448525b380e82f7ffaa69e9b2
+  data.tar.gz: 9c80660fd0b7e9003ea70163fe87d142e1b1ef730b662d251992fdc1aff5ac773ec8e73ffb4c7bf300ebe0543b6b452d83d975b478bcf00ce6bb99347e171dc3

data/Gemfile.lock CHANGED Viewed

@@ -1,52 +1,76 @@
 PATH
   remote: .
   specs:
-    arachnid2 (0.1.4)
+    arachnid2 (0.3.0)
       addressable
       adomain
       bloomfilter-rb
-      nokogiri
+      nokogiri (>= 1.8.5)
       typhoeus
+      watir
+      webdriver-user-agent (>= 7.6)
+      webdrivers
 GEM
   remote: https://rubygems.org/
   specs:
-    addressable (2.5.2)
+    addressable (2.6.0)
       public_suffix (>= 2.0.2, < 4.0)
     adomain (0.1.1)
       addressable (~> 2.5)
     bloomfilter-rb (2.1.1)
       redis
-    coderay (1.1.2)
+    childprocess (0.9.0)
+      ffi (~> 1.0, >= 1.0.11)
     diff-lcs (1.3)
-    ethon (0.11.0)
+    ethon (0.12.0)
       ffi (>= 1.3.0)
-    ffi (1.9.25)
-    method_source (0.9.0)
-    mini_portile2 (2.3.0)
-    nokogiri (1.8.4)
-      mini_portile2 (~> 2.3.0)
-    pry (0.11.3)
-      coderay (~> 1.1.0)
-      method_source (~> 0.9.0)
+    facets (3.1.0)
+    ffi (1.10.0)
+    json (2.1.0)
+    mini_portile2 (2.4.0)
+    net_http_ssl_fix (0.0.10)
+    nokogiri (1.10.1)
+      mini_portile2 (~> 2.4.0)
+    os (1.0.0)
+    psych (3.1.0)
     public_suffix (3.0.3)
     rake (10.5.0)
-    redis (4.0.2)
-    rspec (3.7.0)
-      rspec-core (~> 3.7.0)
-      rspec-expectations (~> 3.7.0)
-      rspec-mocks (~> 3.7.0)
-    rspec-core (3.7.1)
-      rspec-support (~> 3.7.0)
-    rspec-expectations (3.7.0)
+    redis (4.1.0)
+    regexp_parser (1.3.0)
+    rspec (3.8.0)
+      rspec-core (~> 3.8.0)
+      rspec-expectations (~> 3.8.0)
+      rspec-mocks (~> 3.8.0)
+    rspec-core (3.8.0)
+      rspec-support (~> 3.8.0)
+    rspec-expectations (3.8.2)
       diff-lcs (>= 1.2.0, < 2.0)
-      rspec-support (~> 3.7.0)
-    rspec-mocks (3.7.0)
+      rspec-support (~> 3.8.0)
+    rspec-mocks (3.8.0)
       diff-lcs (>= 1.2.0, < 2.0)
-      rspec-support (~> 3.7.0)
-    rspec-support (3.7.1)
-    typhoeus (1.3.0)
+      rspec-support (~> 3.8.0)
+    rspec-support (3.8.0)
+    rubyzip (1.2.2)
+    selenium-webdriver (3.141.0)
+      childprocess (~> 0.5)
+      rubyzip (~> 1.2, >= 1.2.2)
+    typhoeus (1.3.1)
       ethon (>= 0.9.0)
+    watir (6.16.5)
+      regexp_parser (~> 1.2)
+      selenium-webdriver (~> 3.6)
+    webdriver-user-agent (7.6)
+      facets
+      json
+      os
+      psych
+      selenium-webdriver (>= 3.4.0)
+    webdrivers (3.6.0)
+      net_http_ssl_fix
+      nokogiri (~> 1.6)
+      rubyzip (~> 1.0)
+      selenium-webdriver (~> 3.0)
 PLATFORMS
   ruby
@@ -54,9 +78,8 @@ PLATFORMS
 DEPENDENCIES
   arachnid2!
   bundler (~> 1.16)
-  pry
   rake (~> 10.0)
   rspec (~> 3.0)
 BUNDLED WITH
-   1.16.3
+   1.16.5

data/README.md CHANGED Viewed

@@ -3,18 +3,22 @@
 ## About
 Arachnid2 is a simple, fast web-crawler written in Ruby.
-It uses [typhoeus](https://github.com/typhoeus/typhoeus)
-to get HTTP requests,
+You can use [typhoeus](https://github.com/typhoeus/typhoeus)
+to get HTTP requests, or [Watir](https://github.com/watir/watir)
+to render pages.
 [bloomfilter-rb](https://github.com/igrigorik/bloomfilter-rb)
-to store the URLs it will get and has gotten,
+stores the URLs it will get and has gotten,
 and [nokogiri](https://github.com/sparklemotion/nokogiri)
-to find the URLs on each webpage.
+to find the URLs on each webpage, adding them to the bloomfilter queue.
 Arachnid2 is a successor to [Arachnid](https://github.com/dchuk/Arachnid),
 and was abstracted out of the [Tellurion Bot](https://github.com/samnissen/tellurion_bot).
 ## Usage
+### Typheous (cURL)
 The basic use of Arachnid2 is surfacing the responses from a domains'
 URLs by visiting a URL, collecting any links to the same domain
 on that page, and visiting those to do the same.
@@ -22,9 +26,6 @@ on that page, and visiting those to do the same.
 Hence, the simplest output would be to collect all of the responses
 while spidering from some URL.
-Set cached service url(optional)
-`export ARACHNID_CACHED_SERVICE_ADDRESS=http://localhost:9000`
 ```ruby
 require "arachnid2"
@@ -58,7 +59,7 @@ spider.crawl { |response|
 `Arachnid2#crawl` will return always `nil`.
-### Options
+#### Options
 ```ruby
 require "arachnid2"
@@ -67,7 +68,7 @@ url = "http://sixcolours.com"
 spider = Arachnid2.new(url)
 opts = {
   followlocation: true,
-  timeout: 10000,
+  timeout: 300,
   time_box: 60,
   max_urls: 50,
   :headers => {
@@ -95,26 +96,37 @@ spider.crawl(opts) { |response|
 }
 ```
-#### `time_box`
+##### `followlocation`
+Tell Typhoeus to follow redirections.
-The crawler will time-bound your spidering. If no valid integer is provided,
-it will crawl for 15 seconds before exiting. 600 seconds (10 minutes)
-is the current maximum, and any value above it will be reduced to 600.
+##### `timeout`
+Tell Typheous or Watir how long to wait for page load.
+##### `time_box`
+The crawler will time-bound your spidering.
+If no valid integer is provided,
+it will crawl for 15 seconds before exiting.
+10000 seconds is the current maximum,
+and any value above it will be reduced to 10000.
-#### `max_urls`
+##### `max_urls`
 The crawler will crawl a limited number of URLs before stopping.
-If no valid integer is provided, it will crawl for 50 URLs before exiting.
+If no valid integer is provided,
+it will crawl for 50 URLs before exiting.
 10000 seconds is the current maximum,
 and any value above it will be reduced to 10000.
-#### `headers`
+##### `headers`
 This is a hash that represents any HTTP header key/value pairs you desire,
 and is passed directly to Typheous. Before it is sent, a default
 language and user agent are created:
-##### Defaults
+###### Defaults
 The HTTP header `Accept-Language` default is
 `en-IE, en-UK;q=0.9, en-NL;q=0.8, en-MT;q=0.7, en-LU;q=0.6, en;q=0.5, \*;0.4`
@@ -122,19 +134,19 @@ The HTTP header `Accept-Language` default is
 The HTTP header `User-Agent` default is
 `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1 Safari/605.1.15`
-#### `proxy`
+##### `proxy`
 Provide your IP, port for a proxy. If required, provide credentials for
 authenticating to that proxy. Proxy options and handling are done
 by Typhoeus.
-#### `non_html_extensions`
+##### `non_html_extensions`
 This is the list of TLDs to ignore when collecting URLs from the page.
 The extensions are formatted as a hash of key/value pairs, where the value
 is an array of TLDs, and the keys represent the length of those TLDs.
-#### `memory_limit` and Docker
+##### `memory_limit` and Docker
 In case you are operating the crawler within a container, Arachnid2
 can attempt to prevent the container from running out of memory.
@@ -142,15 +154,75 @@ By default, it will end the crawl when the container uses >= 80%
 of its available memory. You can override this with the
 option.
-### Non-HTML links
+##### Non-HTML links
 The crawler attempts to stop itself from returning data from
 links that are not indicative of HTML, as detailed in
 `Arachnid2::NON_HTML_EXTENSIONS`.
+#### Caching (optional)
+If you have setup a cache to deduplicate crawls,
+set a cached service url
+`export ARACHNID_CACHED_SERVICE_ADDRESS=http://localhost:9000`
+This expects a push and get JSON API to respond
+to `/typhoeus_responses`, with a URL and the options pushed
+exactly as received as parameters. It will push any crawls
+to the service, and re-use any crawled pages
+if they are found to match.
+### With Watir
+Crawling with Watir works similarly, but requires you setup your
+environment for Watir, and headless web browsing if required.
+See the Watir documentation for more information.
+```ruby
+# ...
+Arachnid2.new(url).crawl_watir(opts)
+# -or-
+with_watir = true
+Arachnid2.new(url).crawl(opts, with_watir)
+```
+#### Options
+See the Typhoeus options above &mdash; most apply to Watir as well, with
+some exceptions:
+##### `proxy`
+Watir proxy options are formatted differently:
+```ruby
+proxy: {
+  http: "troy.show:8080",
+  ssl: "abed.show:8080"
+},
+```
+Proxy options handling is done by Watir.
+##### `headless`
+And it accepts an argument to make browse headlessly
+```ruby
+opts = { headless: true }
+```
+##### `followlocation` and `max_concurrency`
+These options do not apply to Watir, and will be ignored.
 ## Development
-TODO: this
+Fork the repo and run the tests
+```ruby
+bundle exec rspec spec/
+```
 ## Contributing

data/arachnid2.gemspec CHANGED Viewed

@@ -25,9 +25,12 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency "rake", "~> 10.0"
   spec.add_development_dependency "rspec", "~> 3.0"
+  spec.add_dependency "webdriver-user-agent", ">= 7.6"
+  spec.add_dependency "watir"
+  spec.add_dependency "webdrivers"
   spec.add_dependency "typhoeus"
   spec.add_dependency "bloomfilter-rb"
   spec.add_dependency "adomain"
   spec.add_dependency "addressable"
-  spec.add_dependency "nokogiri"
+  spec.add_dependency "nokogiri", ">= 1.8.5"
 end

data/lib/arachnid2/{cashed_arachnid_responses.rb → cached_arachnid_responses.rb} RENAMED Viewed

@@ -1,6 +1,6 @@
 require 'net/http'
 require 'json'
-module CashedArachnidResponses
+module CachedArachnidResponses
   CACHE_SERVICE_URL = ENV['ARACHNID_CACHED_SERVICE_ADDRESS'].freeze
   def load_data(_url, _options)

data/lib/arachnid2/exoskeleton.rb ADDED Viewed

@@ -0,0 +1,133 @@
+class Arachnid2
+  module Exoskeleton
+    def browser_type
+      unless @browser_type
+        @browser_type   = "#{@options[:browser_type]}".to_sym if @options[:browser_type]
+        @browser_type ||= :firefox
+      end
+      @browser_type
+    end
+    def process(url, html)
+      return false unless Adomain["#{url}"].include? @domain
+      extract_hrefs(html)
+    end
+    def extract_hrefs(body)
+      elements = Nokogiri::HTML.parse(body).css('a')
+      return elements.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if {|href| href.empty? }
+    end
+    def vacuum(links, url)
+      links.each do |link|
+        next if link.match(/^\(|^javascript:|^mailto:|^#|^\s*$|^about:/)
+        begin
+          absolute_link = make_absolute(link, url)
+          next if skip_link?(absolute_link)
+          @global_queue << absolute_link
+        rescue Addressable::URI::InvalidURIError
+        end
+      end
+    end
+    def skip_link?(absolute_link)
+      !internal_link?(absolute_link) || \
+      @global_visited.include?(absolute_link) || \
+      extension_ignored?(absolute_link) || \
+      @global_queue.include?(absolute_link)
+    end
+    def preflight(opts)
+      @options = opts
+      @global_visited = BloomFilter::Native.new(:size => 1000000, :hashes => 5, :seed => 1, :bucket => 8, :raise => true)
+      @global_queue = [@url]
+    end
+    def proxy
+      @options[:proxy]
+    end
+    def non_html_extensions
+      return @non_html_extensions if @non_html_extensions
+      @non_html_extensions   = @options[:non_html_extensions]
+      @non_html_extensions ||= DEFAULT_NON_HTML_EXTENSIONS
+    end
+    def bound_time
+      boundary = "#{@options[:time_box]}".to_i
+      boundary = BASE_CRAWL_TIME if boundary <= 0
+      boundary = MAX_CRAWL_TIME  if boundary >  MAX_CRAWL_TIME
+      return Time.now + boundary
+    end
+    def bound_urls
+      amount = "#{@options[:max_urls]}".to_i
+      amount = BASE_URLS if amount <= 0
+      amount = MAX_URLS  if amount >  MAX_URLS
+      amount
+    end
+    def timeout
+      unless @timeout
+        @timeout = @options[:timeout]
+        @timeout = DEFAULT_TIMEOUT unless @timeout.is_a?(Integer)
+        @timeout = DEFAULT_TIMEOUT if @timeout > MAXIMUM_TIMEOUT
+        @timeout = DEFAULT_TIMEOUT if @timeout < MINIMUM_TIMEOUT
+      end
+      @timeout
+    end
+    def crawl_options
+      @crawl_options ||= { max_urls: max_urls, time_limit: time_limit }
+    end
+    alias_method :max_urls, :bound_urls
+    alias_method :time_limit, :bound_time
+    def make_absolute(href, root)
+      Addressable::URI.parse(root).join(Addressable::URI.parse(href)).to_s
+    end
+    def internal_link?(absolute_url)
+      "#{Adomain[absolute_url]}".include? @domain
+    end
+    def extension_ignored?(url)
+      return false if url.empty?
+      !non_html_extensions.values.flatten.find { |e| url.downcase.end_with? e.downcase }.nil?
+    end
+    def memory_danger?
+      return false unless in_docker?
+      use      = "#{File.open(MEMORY_USE_FILE, "rb").read}".to_f
+      @limit ||= "#{File.open(MEMORY_LIMIT_FILE, "rb").read}".to_f
+      return false unless ( (use > 0.0) && (@limit > 0.0) )
+      return ( ( (use / @limit) * 100.0 ) >= maximum_load_rate )
+    end
+    def in_docker?
+      File.file?(MEMORY_USE_FILE)
+    end
+    def maximum_load_rate
+      return @maximum_load_rate if @maximum_load_rate
+      @maximum_load_rate = "#{@options[:memory_limit]}".to_f
+      @maximum_load_rate = DEFAULT_MAXIMUM_LOAD_RATE unless ((@maximum_load_rate > 0.0) && (@maximum_load_rate < 100.0))
+      @maximum_load_rate
+    end
+  end
+end

data/lib/arachnid2/typhoeus.rb ADDED Viewed

@@ -0,0 +1,99 @@
+class Arachnid2
+  class Typhoeus
+    include CachedArachnidResponses
+    include Arachnid2::Exoskeleton
+    def initialize(url)
+      @url = url
+      @domain = Adomain[@url]
+      @cached_data = []
+    end
+    def crawl(opts = {})
+      preflight(opts)
+      typhoeus_preflight
+      until @global_queue.empty?
+        max_concurrency.times do
+          q = @global_queue.shift
+          break if @global_visited.size >= crawl_options[:max_urls] || \
+                   Time.now > crawl_options[:time_limit] || \
+                   memory_danger?
+          @global_visited.insert(q)
+          request = ::Typhoeus::Request.new(q, request_options)
+          data = load_data(@url, opts)
+          data.each { |response| yield response } and return unless data.nil?
+          request.on_complete do |response|
+            @cached_data.push(response)
+            links = process(response.effective_url, response.body)
+            next unless links
+            yield response
+            vacuum(links, response.effective_url)
+          end
+          @hydra.queue(request)
+        end # max_concurrency.times do
+        @hydra.run
+      end # until @global_queue.empty?
+      put_cached_data(@url, opts, @cached_data) unless @cached_data.empty?
+    ensure
+      @cookie_file.close! if @cookie_file
+    end # def crawl(opts = {})
+    private
+      def typhoeus_preflight
+        @hydra = ::Typhoeus::Hydra.new(:max_concurrency => max_concurrency)
+        typhoeus_proxy_options
+      end
+      def max_concurrency
+        return @max_concurrency if @max_concurrency
+        @max_concurrency = "#{@options[:max_concurrency]}".to_i
+        @max_concurrency = 1 unless (@max_concurrency > 0)
+        @max_concurrency
+      end
+      def followlocation
+        return @followlocation unless @followlocation.nil?
+        @followlocation = @options[:followlocation]
+        @followlocation = true unless @followlocation.is_a?(FalseClass)
+      end
+      def request_options
+        @cookie_file ||= Tempfile.new('cookies')
+        @request_options = {
+          timeout: timeout,
+          followlocation: followlocation,
+          cookiefile: @cookie_file.path,
+          cookiejar: @cookie_file.path,
+          headers: @options[:headers]
+        }.merge(crawl_options[:proxy])
+        @request_options[:headers] ||= {}
+        @request_options[:headers]['Accept-Language'] ||= DEFAULT_LANGUAGE
+        @request_options[:headers]['User-Agent']      ||= DEFAULT_USER_AGENT
+        @request_options
+      end
+      def typhoeus_proxy_options
+        crawl_options[:proxy] = {}
+        crawl_options[:proxy][:proxy] = "#{@options[:proxy][:ip]}:#{@options[:proxy][:port]}" if @options.dig(:proxy, :ip)
+        crawl_options[:proxy][:proxyuserpwd] = "#{@options[:proxy][:username]}:#{@options[:proxy][:password]}" if @options.dig(:proxy, :username)
+      end
+  end
+end

data/lib/arachnid2/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 class Arachnid2
-  VERSION = "0.2.0"
+  VERSION = "0.3.1"
 end

data/lib/arachnid2/watir.rb ADDED Viewed

@@ -0,0 +1,102 @@
+class Arachnid2
+  class Watir
+    include Arachnid2::Exoskeleton
+    def initialize(url)
+      @url = url
+      @domain = Adomain[@url]
+    end
+    def crawl(opts)
+      preflight(opts)
+      watir_preflight
+      until @global_queue.empty?
+        @already_retried = false
+        q = @global_queue.shift
+        break if @global_visited.size >= crawl_options[:max_urls]
+        break if Time.now > crawl_options[:time_limit]
+        break if memory_danger?
+        @global_visited.insert(q)
+        begin
+          browser.goto q
+          links = process(browser.url, browser.body.html)
+          next unless links
+          yield browser
+          vacuum(links, browser.url)
+        rescue => e
+          raise e if @already_retried
+          raise e unless "#{e.class}".include?("Selenium") || "#{e.class}".include?("Watir")
+          @browser = nil
+          @already_retried = true
+          retry
+        end
+      end # until @global_queue.empty?
+    ensure
+      @browser.close if @browser rescue nil
+      @headless.destroy if @headless rescue nil
+    end
+    private
+      def browser
+        unless @browser
+          behead if @make_headless
+          @browser = create_browser
+          set_timeout
+        end
+        return @browser
+      end
+      def create_browser
+        return ::Watir::Browser.new(driver, proxy: @proxy) if @proxy
+        ::Watir::Browser.new driver
+      end
+      def set_timeout
+        @browser.driver.manage.timeouts.page_load = timeout
+      end
+      def behead
+        @headless = Headless.new
+        @headless.start
+      end
+      def driver
+        unless @driver
+          language   = @options.dig(:headers, "Accept-Language")  || DEFAULT_LANGUAGE
+          user_agent = @options.dig(:headers, "User-Agent")       || DEFAULT_USER_AGENT
+          @driver = Webdriver::UserAgent.driver(
+            browser: browser_type,
+            accept_language_string: language,
+            user_agent_string: user_agent
+          )
+        end
+        @driver
+      end
+      def watir_preflight
+        watir_proxy_options
+        @make_headless = @options[:headless]
+      end
+      def watir_proxy_options
+        crawl_options[:proxy] = {}
+        crawl_options[:proxy][:http] = @options[:proxy][:http] if @options.dig(:proxy, :http)
+        crawl_options[:proxy][:ssl] = @options[:proxy][:ssl] if @options.dig(:proxy, :ssl)
+      end
+    end
+end

data/lib/arachnid2.rb CHANGED Viewed

@@ -1,5 +1,8 @@
 require "arachnid2/version"
-require "arachnid2/cashed_arachnid_responses"
+require "arachnid2/cached_arachnid_responses"
+require "arachnid2/exoskeleton"
+require "arachnid2/typhoeus"
+require "arachnid2/watir"
 require 'tempfile'
 require "typhoeus"
@@ -8,9 +11,12 @@ require "adomain"
 require "addressable/uri"
 require "nokogiri"
 require "base64"
+require "webdrivers"
+require "webdriver-user-agent"
+require "watir"
 class Arachnid2
-  include CashedArachnidResponses
   # META:
   #   About the origins of this crawling approach
   # The Crawler is heavily borrowed from by Arachnid.
@@ -22,7 +28,7 @@ class Arachnid2
   # And this was originally written as a part of Tellurion's bot
   # https://github.com/samnissen/tellurion_bot
-  MAX_CRAWL_TIME = 600
+  MAX_CRAWL_TIME = 10000
   BASE_CRAWL_TIME = 15
   MAX_URLS = 10000
   BASE_URLS = 50
@@ -58,8 +64,6 @@ class Arachnid2
   #
   def initialize(url)
     @url = url
-    @domain = Adomain[@url]
-    @cached_data = []
   end
   #
@@ -101,228 +105,15 @@ class Arachnid2
   #
   # @return nil
   #
-  def crawl(opts = {})
-    preflight(opts)
-    until @global_queue.empty?
-      @max_concurrency.times do
-        q = @global_queue.shift
-        break if @global_visited.size >= @crawl_options[:max_urls]
-        break if Time.now > @crawl_options[:time_limit]
-        break if memory_danger?
-        @global_visited.insert(q)
-        request = Typhoeus::Request.new(q, request_options)
-        data = load_data(@url, opts)
-        unless data.nil?
-          data.each do |response|
-            yield response
-          end
-          return
-        end
-        request.on_complete do |response|
-          @cached_data.push(response)
-          links = process(response)
-          next unless links
-          yield response
-          vacuum(links, response)
-        end
-        @hydra.queue(request)
-      end # @max_concurrency.times do
-      @hydra.run
-    end # until @global_queue.empty?
-    put_cached_data(@url, opts, @cached_data) unless @cached_data.empty?
-  ensure
-    @cookie_file.close! if @cookie_file
-  end # def crawl(opts = {})
-  private
-    def process(response)
-      return false unless Adomain["#{response.effective_url}"].include? @domain
-      elements = Nokogiri::HTML.parse(response.body).css('a')
-      return elements.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if {|href| href.empty? }
-    end
-    def vacuum(links, response)
-      links.each do |link|
-        next if link.match(/^\(|^javascript:|^mailto:|^#|^\s*$|^about:/)
-        begin
-          absolute_link = make_absolute(link, response.effective_url)
-          next if skip_link?(absolute_link)
-          @global_queue << absolute_link
-        rescue Addressable::URI::InvalidURIError
-        end
-      end
-    end
-    def skip_link?(absolute_link)
-      internal  = internal_link?(absolute_link)
-      visited   = @global_visited.include?(absolute_link)
-      ignored   = extension_ignored?(absolute_link)
-      known     = @global_queue.include?(absolute_link)
-      !internal || visited || ignored || known
-    end
-    def preflight(opts)
-      @options = opts
-      @crawl_options = crawl_options
-      @maximum_load_rate = maximum_load_rate
-      @max_concurrency = max_concurrency
-      @non_html_extensions = non_html_extensions
-      @hydra = Typhoeus::Hydra.new(:max_concurrency => @max_concurrency)
-      @global_visited = BloomFilter::Native.new(:size => 1000000, :hashes => 5, :seed => 1, :bucket => 8, :raise => true)
-      @global_queue = [@url]
-    end
-    def non_html_extensions
-      @non_html_extensions ||= nil
-      if !@non_html_extensions
-        @non_html_extensions   = @options[:non_html_extensions]
-        @non_html_extensions ||= DEFAULT_NON_HTML_EXTENSIONS
-      end
-      @non_html_extensions
-    end
-    def max_concurrency
-      @max_concurrency ||= nil
-      if !@max_concurrency
-        @max_concurrency = "#{@options[:max_concurrency]}".to_i
-        @max_concurrency = 1 unless (@max_concurrency > 0)
-      end
-      @max_concurrency
-    end
+  def crawl(opts = {}, with_watir = false)
+    crawl_watir and return if with_watir
-    def bound_time
-      boundary = "#{@options[:time_box]}".to_i
-      boundary = BASE_CRAWL_TIME if boundary <= 0
-      boundary = MAX_CRAWL_TIME  if boundary >  MAX_CRAWL_TIME
-      return Time.now + boundary
-    end
-    def bound_urls
-      amount = "#{@options[:max_urls]}".to_i
-      amount = BASE_URLS if amount <= 0
-      amount = MAX_URLS  if amount >  MAX_URLS
-      amount
-    end
-    def followlocation
-      if @followlocation.is_a?(NilClass)
-        @followlocation = @options[:followlocation]
-        @followlocation = true unless @followlocation.is_a?(FalseClass)
-      end
-      @followlocation
-    end
-    def timeout
-      if !@timeout
-        @timeout = @options[:timeout]
-        @timeout = DEFAULT_TIMEOUT unless @timeout.is_a?(Integer)
-        @timeout = DEFAULT_TIMEOUT if @timeout > MAXIMUM_TIMEOUT
-        @timeout = DEFAULT_TIMEOUT if @timeout < MINIMUM_TIMEOUT
-      end
-      @timeout
-    end
-    def request_options
-      @cookie_file ||= Tempfile.new('cookies')
-      @request_options = {
-        timeout: timeout,
-        followlocation: followlocation,
-        cookiefile: @cookie_file.path,
-        cookiejar: @cookie_file.path,
-        headers: @options[:headers]
-      }
-      @request_options[:headers] ||= {}
-      @request_options[:headers]['Accept-Language'] ||= DEFAULT_LANGUAGE
-      @request_options[:headers]['User-Agent']      ||= DEFAULT_USER_AGENT
-      @request_options
-    end
-    def crawl_options
-      @crawl_options ||= nil
-      if !@crawl_options
-        @crawl_options = { :max_urls => max_urls, :time_limit => time_limit }
-        @crawl_options[:proxy] = "#{@options[:proxy][:ip]}:#{@options[:proxy][:port]}" if @options.dig(:proxy, :ip)
-        @crawl_options[:proxyuserpwd] = "#{@options[:proxy][:username]}:#{@options[:proxy][:password]}" if @options.dig(:proxy, :username)
-      end
-      @crawl_options
-    end
-    def max_urls
-      bound_urls
-    end
-    def time_limit
-      bound_time
-    end
-    def make_absolute(href, root)
-      Addressable::URI.parse(root).join(Addressable::URI.parse(href)).to_s
-    end
-    def internal_link?(absolute_url)
-      "#{Adomain[absolute_url]}".include? @domain
-    end
-    def extension_ignored?(url)
-      return false if url.empty?
-      !@non_html_extensions.values.flatten.find { |e| url.downcase.end_with? e.downcase }.nil?
-    end
-    def memory_danger?
-      return false unless in_docker?
-      use      = "#{File.open(MEMORY_USE_FILE, "rb").read}".to_f
-      @limit ||= "#{File.open(MEMORY_LIMIT_FILE, "rb").read}".to_f
-      return false unless ( (use > 0.0) && (@limit > 0.0) )
-      return ( ( (use / @limit) * 100.0 ) >= @maximum_load_rate )
-    end
-    def in_docker?
-      return false unless File.file?(MEMORY_USE_FILE)
-      true
-    end
-    def maximum_load_rate
-      @maximum_load_rate ||= nil
-      if !@maximum_load_rate
-        @maximum_load_rate = "#{@options[:memory_limit]}".to_f
-        @maximum_load_rate = DEFAULT_MAXIMUM_LOAD_RATE unless ((@maximum_load_rate > 0.0) && (@maximum_load_rate < 100.0))
-      end
+    Arachnid2::Typhoeus.new(@url).crawl(opts, &Proc.new)
+  end
-      @maximum_load_rate
-    end
+  def crawl_watir(opts)
+    Arachnid2::Watir.new(@url).crawl(opts, &Proc.new)
+  end
+  # https://mudge.name/2011/01/26/passing-blocks-in-ruby-without-block.html
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: arachnid2
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.3.1
 platform: ruby
 authors:
 - Sam Nissen
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2019-01-25 00:00:00.000000000 Z
+date: 2019-02-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -52,6 +52,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '3.0'
+- !ruby/object:Gem::Dependency
+  name: webdriver-user-agent
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '7.6'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '7.6'
+- !ruby/object:Gem::Dependency
+  name: watir
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: webdrivers
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 - !ruby/object:Gem::Dependency
   name: typhoeus
   requirement: !ruby/object:Gem::Requirement
@@ -114,14 +156,14 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: '0'
+        version: 1.8.5
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: '0'
+        version: 1.8.5
 description:
 email:
 - scnissen@gmail.com
@@ -142,8 +184,11 @@ files:
 - bin/console
 - bin/setup
 - lib/arachnid2.rb
-- lib/arachnid2/cashed_arachnid_responses.rb
+- lib/arachnid2/cached_arachnid_responses.rb
+- lib/arachnid2/exoskeleton.rb
+- lib/arachnid2/typhoeus.rb
 - lib/arachnid2/version.rb
+- lib/arachnid2/watir.rb
 homepage: https://github.com/samnissen/arachnid2
 licenses:
 - MIT