RubyGems - wgit - Versions diffs - 0.7.0 → 0.8.0 - Mend

wgit 0.7.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -0
data/README.md +53 -17
data/lib/wgit/core_ext.rb +1 -1
data/lib/wgit/crawler.rb +17 -7
data/lib/wgit/document.rb +87 -54
data/lib/wgit/document_extensions.rb +13 -3
data/lib/wgit/response.rb +6 -6
data/lib/wgit/url.rb +17 -0
data/lib/wgit/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 29d37a4a0f013fec64625d8fe5798ae2d062ae6f213811c51d223de311e16707
-  data.tar.gz: 213f6c43ccbb1fcc5c487a2bd5f31493506ab2320168562f8e1b6887cccc07b8
+  metadata.gz: dc2ea1de7219c66eb70ed7b4e97bb8da7c169eb68257df7d7d1bfdaf6a5ed4d6
+  data.tar.gz: f88afdd7477812c3b9fdbde8e1125b950aec3fe3fabef5a20e0c16a9e26a767b
 SHA512:
-  metadata.gz: acd321f3ba039e6f54dd8a36a3e4ebec1fb40f1cda5ee1c982df4be22ee6d463f829c72f011ff959a4a4d2651676dc2d31866a273b60d3e5e630ccf77b3d7cbe
-  data.tar.gz: d0908d28e6fdaec440209479f75945672807cf3e9359fb8bd8f6cc9de45568a341ac0204ba5f50a2f6569b4a29f4f7ac3088353a35f2c5091a567af469027aab
+  metadata.gz: e5fe86fee44e4c494d936d747719d856c240a0129db0692afa44ad9855340929a8b90cc177d92d775bc7cc15af99093078cf6a8fc8b9bbd9bc8965866d343914
+  data.tar.gz: c87efb4c5dcfb8795d62ab41f7e8d2bc206e7f8407707c9269c0fc86fbdcd14b7269d04083ec756d3b77a99300639469e88c639ee125ddee0984c3957e7cfc7b

data/CHANGELOG.md CHANGED

@@ -9,6 +9,21 @@
 - ...
 ---
+## v0.8.0
+### Added
+- To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
+- `Wgit::Document#description` default extension.
+- `Wgit::Url.parse_or_nil` method.
+### Changed/Removed
+- Breaking change: Renamed `Document#stats[:text_snippets]` to be `:text`.
+- Breaking change: `Wgit::Document.define_extension`'s block return value now becomes the `var` value, even when `nil` is returned. This allows `var` to be set to `nil`.
+- Potential breaking change: Renamed `Wgit::Response#crawl_time` (alias) to be `#crawl_duration`.
+- Updated `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS` to be `Wgit::Crawler.supported_file_extensions`, making it configurable. Now you can add your own URL extensions if needed.
+- Updated the Wgit core extension `String#to_url` to use `Wgit::Url.parse` allowing instances of `Wgit::Url` to returned as is. This also affects `Enumerable#to_urls` in the same way.
+### Fixed
+- An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".
+---
 ## v0.7.0
 ### Added
 - `Wgit::Indexer.new` optional `crawler:` named param.

data/README.md CHANGED

@@ -68,17 +68,18 @@ crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Respo
 doc.class # => Wgit::Document
 doc.class.public_instance_methods(false).sort # => [
-# :==, :[], :author, :base, :base_url, :content, :css, :doc, :empty?, :external_links,
-# :external_urls, :html, :internal_absolute_links, :internal_absolute_urls,
-# :internal_links, :internal_urls, :keywords, :links, :score, :search, :search!,
-# :size, :statistics, :stats, :text, :title, :to_h, :to_json, :url, :xpath
+# :==, :[], :author, :base, :base_url, :content, :css, :description, :doc, :empty?,
+# :external_links, :external_urls, :html, :internal_absolute_links,
+# :internal_absolute_urls,:internal_links, :internal_urls, :keywords, :links, :score,
+# :search, :search!, :size, :statistics, :stats, :text, :title, :to_h, :to_json,
+# :url, :xpath
 # ]
 doc.url   # => "https://wikileaks.org/What-is-Wikileaks.html"
 doc.title # => "WikiLeaks - What is WikiLeaks"
 doc.stats # => {
           #   :url=>44, :html=>28133, :title=>17, :keywords=>0,
-          #   :links=>35, :text_snippets=>67, :text_bytes=>13735
+          #   :links=>35, :text=>67, :text_bytes=>13735
           # }
 doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
 doc.text  # => ["The Courage Foundation is an international organisation that <snip>", ...]
@@ -273,15 +274,50 @@ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_
 ## Extending The API
-Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
+Document serialising in Wgit is the means of downloading a web page and serialising parts of its content into accessible `Wgit::Document` attributes/methods. For example, `Wgit::Document#author` will return you the webpage's xpath value of `meta[@name='author']`.
-Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
+There are two ways to extend the Document serialising behaviour of Wgit for your own means:
-### Serialising Additional Page Elements via Document Extensions
+1. Add additional **textual** content to `Wgit::Document#text`.
+2. Define `Wgit::Document` instance methods for specific HTML **elements**.
-You can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
+Below describes these two methods in more detail.
-Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
+### 1. Extending The Default Text Elements
+Wgit contains a set of `Wgit::Document.text_elements` defining which HTML elements contain text on a page; which in turn are serialised. Once serialised you can process this text content via methods like `Wgit::Document#text` and `Wgit::Document#search` etc.
+The below code example shows how to extract additional text from a webpage:
+```ruby
+require 'wgit'
+# The default text_elements cover most visible page text but let's say we
+# have a <table> element with text content that we want.
+Wgit::Document.text_elements << :table
+doc = Wgit::Document.new(
+  'http://some_url.com',
+  <<~HTML
+  <html>
+    <p>Hello world!</p>
+    <table>My table</table>
+  </html>
+  HTML
+)
+# Now every crawled Document#text will include <table> text content.
+doc.text            # => ["Hello world!", "My table"]
+doc.search('table') # => ["My table"]
+```
+**Note**: This only works for *textual* page content. For more control over the serialised *elements* themselves, see below.
+### 2. Serialising Specific HTML Elements (via Document Extensions)
+Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next.
+Therefore, you can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
 Here's how to add a Document extension to serialise a specific page element:
@@ -293,14 +329,14 @@ Wgit::Document.define_extension(
   :tables,                  # Wgit::Document#tables will return the page's tables.
   '//table',                # The xpath to extract the tables.
   singleton: false,         # True returns the first table found, false returns all.
-  text_content_only: false, # True returns one or more Strings of the tables text,
-                            # false returns the tables as Nokogiri objects (see below).
+  text_content_only: false, # True returns the table text, false returns the Nokogiri object.
 ) do |tables|
-  # Here we can manipulate the object(s) before they're set as Wgit::Document#tables.
+  # Here we can inspect/manipulate the tables before they're set as Wgit::Document#tables.
+  tables
 end
-# Our Document has a table which we're interested in.
-# Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
+# Our Document has a table which we're interested in. Note it doesn't matter how the Document
+# is initialised e.g. manually (as below) or via Wgit::Crawler methods etc.
 doc = Wgit::Document.new(
   'http://some_url.com',
   <<~HTML
@@ -323,9 +359,9 @@ tables = doc.tables
 tables.class       # => Nokogiri::XML::NodeSet
 tables.first.class # => Nokogiri::XML::Element
-# Notice the Document's stats now include our 'tables' extension.
+# Note, the Document's stats now include our 'tables' extension.
 doc.stats # => {
-#   :url=>19, :html=>242, :links=>0, :text_snippets=>2, :text_bytes=>65, :tables=>1
+#   :url=>19, :html=>242, :links=>0, :text=>8, :text_bytes=>91, :tables=>1
 # }
 ```

data/lib/wgit/core_ext.rb CHANGED

@@ -11,7 +11,7 @@ class String
   #
   # @return [Wgit::Url] The converted URL.
   def to_url
-    Wgit::Url.new(self)
+    Wgit::Url.parse(self)
   end
 end

data/lib/wgit/crawler.rb CHANGED

@@ -11,18 +11,26 @@ require 'typhoeus'
 module Wgit
   # The Crawler class provides a means of crawling web based HTTP Wgit::Url's,
   # serialising their HTML into Wgit::Document instances. This is the only Wgit
-  # class which contains network logic e.g. request/response handling.
+  # class which contains network logic e.g. HTTP request/response handling.
   class Crawler
     include Assertable
-    # The URL file extensions (from `<a>` hrefs) which will be crawled by
-    # `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
-    # doesn't keep the crawl of the site going. All URL's without a file
-    # extension will be crawled, because they're assumed to be HTML.
-    SUPPORTED_FILE_EXTENSIONS = Set.new(
+    # Set of supported file extensions for Wgit::Crawler#crawl_site.
+    @supported_file_extensions = Set.new(
       %w[asp aspx cfm cgi htm html htmlx jsp php]
     )
+    class << self
+      # The URL file extensions (from `<a>` hrefs) which will be crawled by
+      # `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
+      # doesn't keep the crawl of the site going. All URL's without a file
+      # extension will be crawled, because they're assumed to be HTML.
+      # The `#crawl` method will crawl anything since it's given the URL(s).
+      # You can add your own site's URL file extension e.g.
+      # `Wgit::Crawler.supported_file_extensions << 'html5'` etc.
+      attr_reader :supported_file_extensions
+    end
     # The amount of allowed redirects before raising an error. Set to 0 to
     # disable redirects completely; or you can pass `follow_redirects: false`
     # to any Wgit::Crawler.crawl_* method.
@@ -313,7 +321,9 @@ module Wgit
               .uniq
               .select do |link|
         ext = link.to_extension
-        ext ? SUPPORTED_FILE_EXTENSIONS.include?(ext.downcase) : true
+        ext ?
+          Wgit::Crawler.supported_file_extensions.include?(ext.downcase) :
+          true # URLs without an extension are assumed HTML.
       end
       return links if allow_paths.nil? && disallow_paths.nil?

data/lib/wgit/document.rb CHANGED

@@ -20,14 +20,26 @@ module Wgit
     # Regex for the allowed var names when defining an extension.
     REGEX_EXTENSION_NAME = /[a-z0-9_]+/.freeze
-    # The xpath used to extract the visible text on a page.
-    TEXT_ELEMENTS_XPATH = '//*/text()'.freeze
+    # Set of text elements used to build Document#text.
+    @text_elements = Set.new(%i[
+      a abbr address article aside b bdi bdo blockquote button caption cite
+      code data dd del details dfn div dl dt em figcaption figure footer h1 h2
+      h3 h4 h5 h6 header hr i input ins kbd label legend li main mark meter ol
+      option output p pre q rb rt ruby s samp section small span strong sub
+      summary sup td textarea th time u ul var wbr
+    ])
     # Set of Symbols representing the defined Document extensions.
     @extensions = Set.new
     class << self
-      # Class level attr_reader for the Document defined extensions.
+      # Set of HTML elements that make up the visible text on a page. These
+      # elements are used to initialize the Wgit::Document#text. See the
+      # README.md for how to add to this Set dynamically.
+      attr_reader :text_elements
+      # Set of Symbols representing the defined Document extensions. Is
+      # read-only. Use Wgit::Document.define_extension for a new extension.
       attr_reader :extensions
     end
@@ -72,6 +84,23 @@ module Wgit
     ### Document Class Methods ###
+    # Uses Document.text_elements to build an xpath String, used to obtain
+    # all of the combined visual text on a webpage.
+    #
+    # @return [String] An xpath String to obtain a webpage's text elements.
+    def self.text_elements_xpath
+      xpath = ''
+      return xpath if Wgit::Document.text_elements.empty?
+      el_xpath = '//%s/text()'
+      Wgit::Document.text_elements.each_with_index do |el, i|
+        xpath += ' | ' unless i.zero?
+        xpath += format(el_xpath, el)
+      end
+      xpath
+    end
     # Defines an extension, which is a way to serialise HTML elements into
     # instance variables upon Document initialization. See the default
     # extensions defined in 'document_extensions.rb' as examples.
@@ -82,11 +111,12 @@ module Wgit
     # the xpath or database object result(s).
     #
     # When initialising from HTML, a singleton value of true will only
-    # ever return one result; otherwise all xpath results are returned in an
-    # Array. When initialising from a database object, the value is taken as
-    # is and singleton is only used to define the default empty value.
-    # If a value cannot be found (in either the HTML or database object), then
-    # a default will be used. The default value is: `singleton ? nil : []`.
+    # ever return the first result found; otherwise all the results are
+    # returned in an Array. When initialising from a database object, the value
+    # is taken as is and singleton is only used to define the default empty
+    # value. If a value cannot be found (in either the HTML or database
+    # object), then a default will be used. The default value is:
+    # `singleton ? nil : []`.
     #
     # @param var [Symbol] The name of the variable to be initialised.
     # @param xpath [String, #call] The xpath used to find the element(s)
@@ -105,14 +135,16 @@ module Wgit
     # @option opts [Boolean] :text_content_only The text_content_only option
     #   if true will use the text content of the Nokogiri result object,
     #   otherwise the Nokogiri object itself is returned. Defaults to true.
-    # @yieldparam value [Object] The value to be assigned to the new var.
-    # @yieldparam source [Wgit::Document, Object] The source of the value.
-    # @yieldparam type [Symbol] The source type, either :document or (DB)
-    #   :object.
-    # @yieldreturn [Object] The return value of the block becomes the new var
-    #   value, unless nil. Return nil if you want to inspect but not change the
-    #   var value. The block is executed when a Wgit::Document is initialized,
-    #   regardless of the source.
+    # @yield The block is executed when a Wgit::Document is initialized,
+    #   regardless of the source. Use it (optionally) to process the result
+    #   value.
+    # @yieldparam value [Object] The result value to be assigned to the new
+    #   `var`.
+    # @yieldparam source [Wgit::Document, Object] The source of the `value`.
+    # @yieldparam type [Symbol] The `source` type, either `:document` or (DB)
+    #   `:object`.
+    # @yieldreturn [Object] The return value of the block becomes the new var's
+    #   value. Return the block's value param unchanged if you want to inspect.
     # @raise [StandardError] If the var param isn't valid.
     # @return [Symbol] The given var Symbol if successful.
     def self.define_extension(var, xpath, opts = {}, &block)
@@ -143,12 +175,12 @@ module Wgit
       var
     end
-    # Removes the init_* methods created when an extension is defined.
-    # Therefore, this is the opposing method to Document.define_extension.
+    # Removes the `init_*` methods created when an extension is defined.
+    # Therefore, this is the opposing method to `Document.define_extension`.
     # Returns true if successful or false if the method(s) cannot be found.
     #
     # @param var [Symbol] The extension variable already defined.
-    # @return [Boolean] True if the extension var was found and removed;
+    # @return [Boolean] True if the extension `var` was found and removed;
     #   otherwise false.
     def self.remove_extension(var)
       Document.send(:remove_method, "init_#{var}_from_html")
@@ -173,7 +205,7 @@ module Wgit
       (@url == other.url) && (@html == other.html)
     end
-    # Is a shortcut for calling Document#html[range].
+    # Shortcut for calling Document#html[range].
     #
     # @param range [Range] The range of @html to return.
     # @return [String] The given range of @html.
@@ -262,8 +294,8 @@ module Wgit
       instance_variables.each do |var|
         # Add up the total bytes of text as well as the length.
         if var == :@text
-          hash[:text_snippets] = @text.length
-          hash[:text_bytes]    = @text.sum(&:length)
+          hash[:text]       = @text.length
+          hash[:text_bytes] = @text.sum(&:length)
         # Else take the var's #length method return value.
         else
           next unless instance_variable_get(var).respond_to?(:length)
@@ -309,8 +341,8 @@ module Wgit
       @doc.css(selector)
     end
-    # Returns all internal links from this Document in relative form. Internal
-    # meaning a link to another document on the same host.
+    # Returns all unique internal links from this Document in relative form.
+    # Internal meaning a link to another document on the same host.
     #
     # This Document's host is used to determine if an absolute URL is actually
     # a relative link e.g. For a Document representing
@@ -319,7 +351,7 @@ module Wgit
     # as an internal link because both Documents live on the same host. Also
     # see Wgit::Document#internal_absolute_links.
     #
-    # @return [Array<Wgit::Url>] Self's internal Url's in relative form.
+    # @return [Array<Wgit::Url>] Self's unique internal Url's in relative form.
     def internal_links
       return [] if @links.empty?
@@ -333,19 +365,19 @@ module Wgit
       Wgit::Utils.process_arr(links)
     end
-    # Returns all internal links from this Document in absolute form by
+    # Returns all unique internal links from this Document in absolute form by
     # appending them to self's #base_url. Also see
     # Wgit::Document#internal_links.
     #
-    # @return [Array<Wgit::Url>] Self's internal Url's in absolute form.
+    # @return [Array<Wgit::Url>] Self's unique internal Url's in absolute form.
     def internal_absolute_links
       internal_links.map { |link| link.prefix_base(self) }
     end
-    # Returns all external links from this Document in absolute form. External
-    # meaning a link to a different host.
+    # Returns all unique external links from this Document in absolute form.
+    # External meaning a link to a different host.
     #
-    # @return [Array<Wgit::Url>] Self's external Url's in absolute form.
+    # @return [Array<Wgit::Url>] Self's unique external Url's in absolute form.
     def external_links
       return [] if @links.empty?
@@ -437,19 +469,16 @@ module Wgit
     # Override this method to custom configure the Nokogiri object returned.
     # Gets called from Wgit::Document.new upon initialization.
     #
+    # @yield [config] The given block is passed to Nokogiri::HTML for initialisation.
     # @raise [StandardError] If @html isn't set.
     # @return [Nokogiri::HTML] The initialised Nokogiri HTML object.
-    def init_nokogiri
+    def init_nokogiri(&block)
       raise '@html must be set' unless @html
-      Nokogiri::HTML(@html) do |config|
-        # TODO: Remove #'s below when crawling in production.
-        # config.options = Nokogiri::XML::ParseOptions::STRICT |
-        #                 Nokogiri::XML::ParseOptions::NONET
-      end
+      Nokogiri::HTML(@html, &block)
     end
-    # Returns a value/object from this Document's @html using the given xpath
+    # Extracts a value/object from this Document's @html using the given xpath
     # parameter.
     #
     # @param xpath [String] Used to find the value/object in @html.
@@ -457,10 +486,15 @@ module Wgit
     #   Object) : results (Array).
     # @param text_content_only [Boolean] text_content_only ? result.content
     #   (String) : result (Nokogiri Object).
-    # @yield [value, source] Given the value (String/Object) before it's set as
-    #   an instance variable so that you can inspect/alter the value if
-    #   desired. Return nil from the block if you don't want to override the
-    #   value. Also given the source (Symbol) which is always :document.
+    # @yield The block is executed when a Wgit::Document is initialized,
+    #   regardless of the source. Use it (optionally) to process the result
+    #   value.
+    # @yieldparam value [Object] The result value to be returned.
+    # @yieldparam source [Wgit::Document, Object] The source of the `value`.
+    # @yieldparam type [Symbol] The `source` type, either `:document` or (DB)
+    #   `:object`.
+    # @yieldreturn [Object] The return value of the block gets returned. Return
+    #   the block's `value` param unchanged if you simply want to inspect it.
     # @return [String, Object] The value found in the html or the default value
     #   (singleton ? nil : []).
     def find_in_html(xpath, singleton: true, text_content_only: true)
@@ -478,23 +512,25 @@ module Wgit
       singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)
-      if block_given?
-        new_result = yield(result, self, :document)
-        result = new_result unless new_result.nil?
-      end
+      result = yield(result, self, :document) if block_given?
       result
     end
-    # Returns a value from the obj using the given key via obj#fetch.
+    # Returns a value from the obj using the given key via `obj#fetch`.
     #
     # @param obj [#fetch] The object containing the key/value.
     # @param key [String] Used to find the value in the obj.
     # @param singleton [Boolean] True if a single value, false otherwise.
-    # @yield [value, source] Given the value (String/Object) before it's set as
-    #   an instance variable so that you can inspect/alter the value if
-    #   desired. Return nil from the block if you don't want to override the
-    #   value. Also given the source (Symbol) which is always :object.
+    # @yield The block is executed when a Wgit::Document is initialized,
+    #   regardless of the source. Use it (optionally) to process the result
+    #   value.
+    # @yieldparam value [Object] The result value to be returned.
+    # @yieldparam source [Wgit::Document, Object] The source of the `value`.
+    # @yieldparam type [Symbol] The `source` type, either `:document` or (DB)
+    #   `:object`.
+    # @yieldreturn [Object] The return value of the block gets returned. Return
+    #   the block's `value` param unchanged if you simply want to inspect it.
     # @return [String, Object] The value found in the obj or the default value
     #   (singleton ? nil : []).
     def find_in_object(obj, key, singleton: true)
@@ -505,10 +541,7 @@ module Wgit
       singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)
-      if block_given?
-        new_result = yield(result, obj, :object)
-        result = new_result unless new_result.nil?
-      end
+      result = yield(result, obj, :object) if block_given?
       result
     end

data/lib/wgit/document_extensions.rb CHANGED

@@ -9,7 +9,7 @@ Wgit::Document.define_extension(
   singleton: true,
   text_content_only: true
 ) do |base|
-  Wgit::Url.new(base) if base
+  Wgit::Url.parse_or_nil(base) if base
 end
 # Title.
@@ -20,6 +20,14 @@ Wgit::Document.define_extension(
   text_content_only: true
 )
+# Description.
+Wgit::Document.define_extension(
+  :description,
+  '//meta[@name="description"]/@content',
+  singleton: true,
+  text_content_only: true
+)
 # Author.
 Wgit::Document.define_extension(
   :author,
@@ -49,13 +57,15 @@ Wgit::Document.define_extension(
   singleton: false,
   text_content_only: true
 ) do |links|
-  links.map! { |link| Wgit::Url.new(link) }
+  links
+    .map { |link| Wgit::Url.parse_or_nil(link) }
+    .compact # Remove unparsable links.
 end
 # Text.
 Wgit::Document.define_extension(
   :text,
-  Wgit::Document::TEXT_ELEMENTS_XPATH,
+  proc { Wgit::Document.text_elements_xpath },
   singleton: false,
   text_content_only: true
 )

data/lib/wgit/response.rb CHANGED

@@ -131,11 +131,11 @@ module Wgit
       @status.positive?
     end
-    alias code       status
-    alias content    body
-    alias crawl_time total_time
-    alias to_s       body
-    alias redirects  redirections
-    alias length     size
+    alias code           status
+    alias content        body
+    alias crawl_duration total_time
+    alias to_s           body
+    alias redirects      redirections
+    alias length         size
   end
 end

data/lib/wgit/url.rb CHANGED

@@ -90,6 +90,23 @@ module Wgit
       obj.is_a?(Wgit::Url) ? obj : new(obj)
     end
+    # Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot
+    # be parsed successfully e.g. the String is invalid.
+    #
+    # Use this method when you can't gaurentee that obj is parsable as a URL.
+    # See Wgit::Url.parse for more information.
+    #
+    # @param obj [Object] The object to parse, which #is_a?(String).
+    # @raise [StandardError] If obj.is_a?(String) is false.
+    # @return [Wgit::Url] A Wgit::Url instance or nil (if obj is invalid).
+    def self.parse_or_nil(obj)
+      parse(obj)
+    rescue Addressable::URI::InvalidURIError
+      Wgit.logger.debug("Wgit::Url.parse_or_nil('#{obj}') exception: \
+Addressable::URI::InvalidURIError")
+      nil
+    end
     # Sets the @crawled instance var, also setting @date_crawled for
     # convenience.
     #

data/lib/wgit/version.rb CHANGED

@@ -5,7 +5,7 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = '0.7.0'
+  VERSION = '0.8.0'
   # Returns the current gem version of Wgit as a String.
   def self.version

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.7.0
+  version: 0.8.0
 platform: ruby
 authors:
 - Michael Telford
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-01-04 00:00:00.000000000 Z
+date: 2020-01-27 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: addressable