RubyGems - wgit - Versions diffs - 0.12.0 → 0.12.1 - Mend

wgit 0.12.0 → 0.12.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4dee43af6274102c9bc6ad4f32c8811f57c5dc2833e923e038aac8f7f2072385
-  data.tar.gz: 9463768a40c78ab9c91ac34dd1c3a0fd1e7b990440b70766feebb9e2f0f99bd4
+  metadata.gz: 4210033f192994609b9a21f1b3e61292247cf94aa415441b22823c32dcc6a214
+  data.tar.gz: 26ca37a3a20998b7ce313c7bd499b89653ba86fac8f09a4b5e699cf6d57bd57c
 SHA512:
-  metadata.gz: 5c94fcae3a56254a6c0d9d67597f1a2125439c1ee3d7d68e22fb70fa59298735d76b0fb8e77bc44b0850f6ba561fe11df3a867973d5f0533adddda9d2c6f2002
-  data.tar.gz: 779bf20dc1eaa29cc926a836d5a5a155c2270db1be0d22bc52ba9893cbd8d3aaec2cee7810b660fb9014020f5de2290af2b23680cc040a9c682ca82431d6f50d
+  metadata.gz: 3f824772e8f0633d540237fb7f99064327eb54a3efbacad58317e3b47d3e45c08f9b9bee542656e4fc5315ded9539ced1283e83207127b948107900d7acc9882
+  data.tar.gz: 0f4b7787ed881d7693226466785e70efd0bff7f0971ba65cebb7553886ead143d40cbeb8a4e72e65ef0c249e689024254668da817dfe347c2076153878e605b0

data/CHANGELOG.md CHANGED Viewed

@@ -9,6 +9,15 @@
 - ...
 ---
+## v0.12.1
+### Added
+- `Wgit::Crawler.new typhoeus_opts:` param which passes the `Hash` directly to `Typhoeus.get`. See the Typhoeus documentation for more info on what can be passed.
+### Changed/Removed
+- ...
+### Fixed
+- ...
+---
 ## v0.12.0 - BREAKING CHANGES
 A big release with several breaking changes, not all of which can be listed below. The headline features for this release are the introduction of a database adapter, allowing Wgit to work with practically any underlying database system; and a custom in-house text extractor.
 ### Added

data/README.md CHANGED Viewed

@@ -144,6 +144,7 @@ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) ou
 - Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
 - Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host - no `xpath` needed.
+- Wgit can crawl authenticated content, providing you can login on a web browser and export your session cookies.
 - Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
 - Wgit can index (crawl and save) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
@@ -171,9 +172,9 @@ indexer.index_site(wiki, **opts)
 So why might you not use Wgit, I hear you ask?
-- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
-- Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
-- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
+- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that. If however, you simply want to crawl webpages requiring authentication, then it's entirely achievable using Wgit.
+- Wgit *can* parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider using a purely browser-based crawler instead.
+- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent requests then you should consider something else.
 ## Installation

data/lib/wgit/crawler.rb CHANGED Viewed

@@ -54,12 +54,16 @@ module Wgit
     # The value should balance between a good UX and enough JS parse time.
     attr_accessor :parse_javascript_delay
+    # The opts Hash passed directly to the Typhoeus#get request.
+    attr_accessor :typhoeus_opts
     # The opts Hash passed directly to the ferrum Chrome browser when
     # `parse_javascript: true`.
-    # See https://github.com/rubycdp/ferrum for details.
+    # See https://github.com/rubycdp/ferrum for more info.
     attr_accessor :ferrum_opts
     # The Wgit::Response of the most recently crawled URL.
+    # See https://rubydoc.info/gems/typhoeus for more info.
     attr_reader :last_response
     # Initializes and returns a Wgit::Crawler instance.
@@ -76,14 +80,17 @@ module Wgit
     #   installed and in $PATH.
     # @param parse_javascript_delay [Integer] The delay time given to a page's
     #   JS to update the DOM. After the delay, the HTML is crawled.
+    # @param typhoeus_opts [Hash] The options to pass to Typhoeus.
+    # @param ferrum_opts [Hash] The options to pass to Ferrum.
     def initialize(redirect_limit: 5, timeout: 5, encode: true,
                    parse_javascript: false, parse_javascript_delay: 1,
-                   ferrum_opts: {})
+                   typhoeus_opts: {}, ferrum_opts: {})
       assert_type(redirect_limit, Integer)
       assert_type(timeout, [Integer, Float])
       assert_type(encode, [TrueClass, FalseClass])
       assert_type(parse_javascript, [TrueClass, FalseClass])
       assert_type(parse_javascript_delay, Integer)
+      assert_type(typhoeus_opts, Hash)
       assert_type(ferrum_opts, Hash)
       @redirect_limit         = redirect_limit
@@ -91,14 +98,15 @@ module Wgit
       @encode                 = encode
       @parse_javascript       = parse_javascript
       @parse_javascript_delay = parse_javascript_delay
-      @ferrum_opts            = default_ferrum_opts.merge(ferrum_opts)
+      @typhoeus_opts          = merge_typhoeus_opts(typhoeus_opts)
+      @ferrum_opts            = merge_ferrum_opts(ferrum_opts)
     end
     # Overrides String#inspect to shorten the printed output of a Crawler.
     #
     # @return [String] A short textual representation of this Crawler.
     def inspect
-      "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} ferrum_opts=#{@ferrum_opts}>"
+      "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} typhoeus_opts=#{@typhoeus_opts} ferrum_opts=#{@ferrum_opts}>"
     end
     # Crawls an entire website's HTML pages by recursively going through
@@ -268,7 +276,7 @@ module Wgit
       url.crawl_duration = response.total_time
       # Don't override previous url.redirects if response is fully resolved.
-      url.redirects      = response.redirects unless response.redirects.empty?
+      url.redirects = response.redirects unless response.redirects.empty?
       @last_response = response
     end
@@ -377,32 +385,25 @@ module Wgit
     end
     # Performs a HTTP GET request and returns the response.
+    # See https://rubydoc.info/gems/typhoeus for more info.
     #
     # @param url [String] The url to GET.
-    # @return [Typhoeus::Response] The HTTP response object.
+    # @return [Typhoeus::Response] The Typhoeus HTTP response object.
     def http_get(url)
-      opts = {
-        followlocation: false,
-        timeout: @timeout,
-        accept_encoding: 'gzip',
-        headers: {
-          'User-Agent' => "wgit/#{Wgit::VERSION}",
-          'Accept'     => 'text/html'
-        }
-      }
-      # See https://rubydoc.info/gems/typhoeus for more info.
-      Typhoeus.get(url, **opts)
+      Typhoeus.get(url, **@typhoeus_opts)
     end
-    # Performs a HTTP GET request in a web browser and parses the response JS
-    # before returning the HTML body of the fully rendered webpage. This allows
-    # Javascript (SPA apps etc.) to generate HTML dynamically.
+    # Performs a HTTP GET request in a web browser allowing the response JS to
+    # execute before returning the HTML body of the fully rendered webpage.
+    # This allows Javascript (SPA apps etc.) to generate HTML dynamically.
+    # See https://github.com/rubycdp/ferrum for more info.
     #
     # @param url [String] The url to browse to.
     # @return [Ferrum::Browser] The browser response object.
     def browser_get(url)
       @browser ||= Ferrum::Browser.new(**@ferrum_opts)
+      # Navigate to the url and start parsing the JS on the page.
       @browser.goto(url)
       # Wait for the page's JS to finish dynamically manipulating the DOM.
@@ -452,6 +453,20 @@ module Wgit
     private
+    # The default opts which are merged with the user's typhoeus_opts: and then
+    # passed directly to the Typhoeus#get request.
+    def default_typhoeus_opts
+      {
+        followlocation: false,
+        timeout: @timeout,
+        accept_encoding: 'gzip',
+        headers: {
+          'User-Agent' => "wgit/#{Wgit::VERSION}",
+          'Accept'     => 'text/html'
+        }
+      }
+    end
     # The default opts which are merged with the user's ferrum_opts: and then
     # passed directly to the ferrum Chrome browser.
     def default_ferrum_opts
@@ -462,6 +477,19 @@ module Wgit
       }
     end
+    # Merges the default Typhoeus options with user-provided options.
+    # Performs a separate merge of headers to allow user customization.
+    def merge_typhoeus_opts(typhoeus_opts)
+      default_typhoeus_opts.merge(typhoeus_opts) do |key, oldval, newval|
+        key == :headers ? oldval.merge(newval) : newval
+      end
+    end
+    # Merges the default Ferrum options with user-provided options.
+    def merge_ferrum_opts(ferrum_opts)
+      default_ferrum_opts.merge(ferrum_opts)
+    end
     # Manually does the following: `links = internals - crawled`.
     # This is needed due to an apparent bug in Set<Url> (when upgrading from
     # Ruby v3.0.2 to v3.3.0) causing an infinite crawl loop in #crawl_site.

data/lib/wgit/url.rb CHANGED Viewed

@@ -98,7 +98,7 @@ module Wgit
     # Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot
     # be parsed successfully e.g. the String is invalid.
     #
-    # Use this method when you can't gaurentee that obj is parsable as a URL.
+    # Use this method when you can't guarantee that obj is parsable as a URL.
     # See Wgit::Url.parse for more information.
     #
     # @param obj [Object] The object to parse, which #is_a?(String).

data/lib/wgit/version.rb CHANGED Viewed

@@ -6,7 +6,7 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = "0.12.0"
+  VERSION = "0.12.1"
   # Returns the current gem version of Wgit as a String.
   def self.version

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.12.0
+  version: 0.12.1
 platform: ruby
 authors:
 - Michael Telford
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-10-30 00:00:00.000000000 Z
+date: 2025-08-18 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: addressable
@@ -30,56 +29,84 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.2'
+        version: '0.3'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.2'
+        version: '0.3'
+- !ruby/object:Gem::Dependency
+  name: benchmark
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
 - !ruby/object:Gem::Dependency
   name: ferrum
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.14'
+        version: '0.17'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.17'
+- !ruby/object:Gem::Dependency
+  name: logger
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.7'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.14'
+        version: '1.7'
 - !ruby/object:Gem::Dependency
   name: mongo
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.19'
+        version: '2.21'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.19'
+        version: '2.21'
 - !ruby/object:Gem::Dependency
   name: nokogiri
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.15'
+        version: '1.18'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.15'
+        version: '1.18'
 - !ruby/object:Gem::Dependency
   name: typhoeus
   requirement: !ruby/object:Gem::Requirement
@@ -100,70 +127,70 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '11.1'
+        version: '12.0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '11.1'
+        version: '12.0'
 - !ruby/object:Gem::Dependency
   name: dotenv
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.8'
+        version: '3.1'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '2.8'
+        version: '3.1'
 - !ruby/object:Gem::Dependency
   name: maxitest
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.4'
+        version: '6.0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.4'
+        version: '6.0'
 - !ruby/object:Gem::Dependency
   name: pry
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.14'
+        version: '0.15'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.14'
+        version: '0.15'
 - !ruby/object:Gem::Dependency
   name: rubocop
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.57'
+        version: '1.79'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.57'
+        version: '1.79'
 - !ruby/object:Gem::Dependency
   name: toys
   requirement: !ruby/object:Gem::Requirement
@@ -184,14 +211,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.19'
+        version: '3.25'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.19'
+        version: '3.25'
 - !ruby/object:Gem::Dependency
   name: yard
   requirement: !ruby/object:Gem::Requirement
@@ -274,8 +301,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.22
-signing_key:
+rubygems_version: 3.6.7
 specification_version: 4
 summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
   extract the data you want from the web.