RubyGems - wgit - Versions diffs - 0.0.8 → 0.0.9 - Mend

wgit 0.0.8 → 0.0.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: fc46154d26c11924869c5a38687ac6a855503c0669db40f43941992273d9c3d2
-  data.tar.gz: d348acb154faa3cb19d4b5e28332bec1192b8b04b526493ec08bdfde933964d8
+  metadata.gz: ae86258e3aac086f2215d1fb3e3b871cd4f4839884eb7c359ac9863148f1a307
+  data.tar.gz: 2eafa3a2b7b6d6ff99aaf00466ccd5f9214f049a9f5836b45aaf6d17bfbe226b
 SHA512:
-  metadata.gz: e0a48d554b359abb247ebdbde597b4654f56baf2a980261d70906126f9715d03221a1581010cd0d8e625ff2d624e75d55c6152804d2f27d236a8376962f6227f
-  data.tar.gz: 84be3d1deb26d4e0641db1eca73afdc0323588ea0bd283e9d2920fd96614dfa549ab4c7497317ed53d87d5deaf344490dc7258af381dee116adc0c6c44837746
+  metadata.gz: 5735051c62d3d22db75a42c7d33cd8b1f78d4500b27c0e136980382ffb0e6830ee8d355a0f9151c4d844dd0eb91dc860cd5bd1855ec68099985c34d55ba1a3aa
+  data.tar.gz: 0b5ab8f7f60e69f791fd4b51995f5dbf2f38a77954b164062b64b982847b7fd3844b016fbfaff186a7178f68d2fedc8736fd60b8c16ae82bc11c7a84a5892e42

data/lib/wgit/crawler.rb CHANGED Viewed

@@ -8,13 +8,12 @@ module Wgit
   # The Crawler class provides a means of crawling web based URL's, turning
   # their HTML into Wgit::Document's.
-  # Note that currently all redirects will not be followed during a crawl.
   class Crawler
     include Assertable
     # The urls to crawl.
     attr_reader :urls
     # The docs of the crawled @urls.
     attr_reader :docs
@@ -146,26 +145,35 @@ module Wgit
     # The fetch method performs a HTTP GET to obtain the HTML document.
     # Invalid urls or any HTTP response that doesn't return a HTML body will be
-    # ignored and nil will be returned.  This means that redirects etc. will
-    # not be followed.
+    # ignored and nil will be returned. Otherwise, the HTML is returned.
     def fetch(url)
-      raise unless url.respond_to?(:to_uri)
-      res = Net::HTTP.get_response(url.to_uri)
-      res.body.empty? ? nil : res.body
+      response = resolve(url)
+      response.body.empty? ? nil : response.body
     rescue
       nil
     end
+    # The resolve method performs a HTTP GET to obtain the HTML document.
+    # A certain amount of redirects will be followed by default before raising
+    # an exception. Redirects can be disabled by setting `redirect_limit: 1`.
+    # The Net::HTTPResponse will be returned.
+    def resolve(url, redirect_limit: 5)
+      redirect_count = 0
+      begin
+        raise "Too many redirects" if redirect_count >= redirect_limit
+        response = Net::HTTP.get_response(URI.parse(url))
+        url = response['location']
+        redirect_count += 1
+      end while response.is_a?(Net::HTTPRedirection)
+      response
+    end
     # Add the url to @urls ensuring it is cast to a Wgit::Url if necessary.
     def add_url(url)
       @urls = [] if @urls.nil?
-      if url.is_a?(Wgit::Url)
-        @urls << url
-      else
-        @urls << Wgit::Url.new(url)
-      end
+      @urls << Wgit::Url.new(url)
     end
     alias :crawl :crawl_urls
     alias :crawl_r :crawl_site
   end

data/lib/wgit/version.rb CHANGED Viewed

@@ -3,5 +3,5 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = "0.0.8".freeze
+  VERSION = "0.0.9".freeze
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.0.8
+  version: 0.0.9
 platform: ruby
 authors:
 - Michael Telford
@@ -137,11 +137,11 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '2.6'
 description: Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises
-  their page contents for later use. You can use Wgit to copy entire website if required.
+  their page contents for later use. You can use Wgit to copy entire websites if required.
   Wgit also provides a means to search indexed documents stored in a database. Therefore,
   this library provides the main components of a WWW search engine. The Wgit API is
-  easily extendable allowing you to pull out the parts of a webpage that are important
-  to you, the external links or keywords for example. As Wgit is an API, it's very
+  easily extended allowing you to pull out the parts of a webpage that are important
+  to you, the code snippets or images for example. As Wgit is a library, it's very
   useful in many different application types.
 email: michael.telford@live.com
 executables: []