RubyGems - wgit - Versions diffs - 0.0.12 → 0.0.13 - Mend

wgit 0.0.12 → 0.0.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/README.md +28 -22
data/TODO.txt +2 -1
data/lib/wgit.rb +1 -0
data/lib/wgit/crawler.rb +30 -9
data/lib/wgit/document.rb +70 -153
data/lib/wgit/document_extensions.rb +57 -0
data/lib/wgit/url.rb +80 -24
data/lib/wgit/utils.rb +36 -7
data/lib/wgit/version.rb +1 -1
metadata +5 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 542c84509de63fe870f23c81da2c377a03638cee9fe1950beb7135698bd9175d
-  data.tar.gz: b2131ef8b7beb32ba8bbb9eb1605e700c34d98210a126c2e24b23adafe35171f
+  metadata.gz: b8f6c1946b739327ba5b52a5541aa1496e2aea53c23e925bbb5e5fe7c063ddcc
+  data.tar.gz: 1bcf5dd5e41711758fdc737afba0707319de25cb524cfe08c14a50efb9a5f3e0
 SHA512:
-  metadata.gz: f3926c76974da55cf8bb179fead9b19805f6c15154925d72e76dc0b14cc3c17a14e64528c71bdcde33841ff56e2dd19d68b5d4a6b894991b08c142d1baac23fb
-  data.tar.gz: dcc9d04e67d7a2b93c8cb89be5610fa71f2c446e06bec99a099e0fe8a45a9e48db0aa2e2482f20a2ca00c8637cc6158e9be5720877190386249421e52dc02bc6
+  metadata.gz: ad42cd8e392894a21a1c4497bed9d1efabbc8b2320825ab7a3fc7e0615157f28f262bcaa4a4910411ac7662b4b416848b97b68c264f2ce937be34cfa6a56a34c
+  data.tar.gz: fbaf8d6a48f996ce2c0b3cb4ee72a26cfede2368f3992e110c6f049e84b9ba6a2568aac0f192559ae10127db5d3cb0bda5dbbe508247ae5f918f3fc400528e75

data/README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 # Wgit
-Wgit is a Ruby gem similar in nature to GNU's `wget`. It provides an easy to use API for programmatic web scraping, indexing and searching.
+Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an easy to use API for programmatic web scraping, indexing and searching.
-Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or images for example. As Wgit is a library, it has uses in many different application types.
+Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it has uses in many different application types.
-Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine built using Wgit and Sinatra, deployed to Heroku.
+Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
 ## Table Of Contents
@@ -51,20 +51,21 @@ doc = crawler.crawl url
 doc.class # => Wgit::Document
 doc.stats # => {
 # :url=>44, :html=>28133, :title=>17, :keywords=>0,
-# :links=>35, :text_length=>67, :text_bytes=>13735
+# :links=>35, :text_snippets=>67, :text_bytes=>13735
 #}
 # doc responds to the following methods:
 Wgit::Document.instance_methods(false).sort # => [
-# :==, :[], :author, :css, :date_crawled, :doc, :empty?, :external_links,
-# :external_urls, :html, :internal_full_links, :internal_links, :keywords,
-# :links, :relative_full_links, :relative_full_urls, :relative_links,
-# :relative_urls, :score, :search, :search!, :size, :stats, :text, :title,
-# :to_h, :to_hash, :to_json, :url, :xpath
+# :==, :[], :author, :css, :date_crawled, :doc, :empty?, :external_links,
+# :external_urls, :html, :internal_full_links, :internal_links,
+# :internal_links_without_anchors, :keywords, :links, :relative_full_links,
+# :relative_full_urls, :relative_links, :relative_urls, :score, :search,
+# :search!, :size, :stats, :text, :title, :to_h, :to_hash, :to_json, :url,
+# :xpath
 #]
 results = doc.search "corruption"
-results.first # => "ial materials involving war, spying and corruption.
+results.first # => "ial materials involving war, spying and corruption.
               #     It has so far published more"
 ```
@@ -120,8 +121,8 @@ my_pages_keywords = ["Everest", "mountaineering school", "adventure"]
 my_pages_missing_keywords = []
 competitor_urls = [
-  "http://altitudejunkies.com",
-  "http://www.mountainmadness.com",
+  "http://altitudejunkies.com",
+  "http://www.mountainmadness.com",
   "http://www.adventureconsultants.com"
 ]
@@ -188,7 +189,7 @@ require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls m
 # Here we create our own document rather than crawling the web.
 # We pass the web page's URL and HTML Strings.
 doc = Wgit::Document.new(
-  "http://test-url.com".to_url,
+  "http://test-url.com".to_url,
   "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
 )
@@ -216,7 +217,7 @@ doc.search(query).first # => "How now brown cow."
 db.insert doc.external_links
-urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
+urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
 ```
 ## Extending The API
@@ -247,7 +248,7 @@ Wgit::Document.text_elements << :a
 # Our Document has a link whose's text we're interested in.
 doc = Wgit::Document.new(
-  "http://some_url.com".to_url,
+  "http://some_url.com".to_url,
   "<html><p>Hello world!</p>\
 <a href='https://made-up-link.com'>Click this link.</a></html>"
 )
@@ -258,13 +259,13 @@ doc.text # => ["Hello world!", "Click this link."]
 **Note**: This only works for textual page content. For more control over the indexed elements themselves, see below.
-### 2. Defining Custom Indexers/Elements a.k.a Virtual Attributes
+### 2. Defining Custom Indexers Via Document Extensions
 If you want full control over the elements being indexed for your own purposes, then you can define a custom indexer for each type of element that you're interested in.
 Once you have the indexed page element, accessed via a `Wgit::Document` instance method, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since the returned types are plain [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) objects, you have the full control that the Nokogiri gem gives you.
-Here's how to add a custom indexer for a specific page element:
+Here's how to add a Document extension to index a specific page element:
 ```ruby
 require 'wgit'
@@ -283,7 +284,7 @@ end
 # Our Document has a table which we're interested in.
 doc = Wgit::Document.new(
-  "http://some_url.com".to_url,
+  "http://some_url.com".to_url,
   "<html><p>Hello world!</p>\
 <table><th>Header Text</th><th>Another Header</th></table></html>"
 )
@@ -296,16 +297,19 @@ tables.class        # => Nokogiri::XML::NodeSet
 tables.first.class  # => Nokogiri::XML::Element
 ```
+**Note**: Wgit uses Document extensions to provide much of it's core functionality, providing access to a webpages text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
 **Extension Notes**:
-- Any links should be mapped into `Wgit::Url` objects; Url's are treated as Strings when being inserted into the database.
-- Any object (like a Nokogiri object) will not be inserted into the database, its up to you to map each object into a native type e.g. `Boolean, Array` etc.
+- Any page links should be mapped into `Wgit::Url` objects; Url's are treated as Strings when being inserted into the database.
+- Any object (like a Nokogiri object) will not be inserted into the database, it's up to you to map each object into a primitive type e.g. `Boolean, Array` etc.
 ## Caveats
 Below are some points to keep in mind when using Wgit:
-- All Url's must be prefixed with an appropiate protocol e.g. `https://`
+- All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://`
+- By default, up to 5 URL redirects will be followed; this is configurable however.
 ## Executable
@@ -317,11 +321,13 @@ This executable will be very similar in nature to `./bin/console` which is curre
 ## Development
+The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file.
 For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
 After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive REPL that will allow you to experiment with the code.
-To generate code documentation run `bundle exec yarddoc`. To browse the generated documentation run `bundle exec yard server -r`.
+To generate code documentation run `bundle exec yard doc`. To browse the generated documentation run `bundle exec yard server -r`.
 To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.

data/TODO.txt CHANGED Viewed

@@ -8,7 +8,8 @@ Primary
 Secondary
 ---------
-- Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org and update the README to include an example search query e.g. "ruby" etc.
+- Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org etc.
+- Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
 - Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
 - Check if Document::TEXT_ELEMENTS is expansive enough.
 - Possibly use refine instead of core-ext?

data/lib/wgit.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require_relative 'wgit/assertable'
 require_relative 'wgit/utils'
 require_relative 'wgit/url'
 require_relative 'wgit/document'
+require_relative 'wgit/document_extensions'
 require_relative 'wgit/crawler'
 require_relative 'wgit/database/connection_details'
 require_relative 'wgit/database/model'

data/lib/wgit/crawler.rb CHANGED Viewed

@@ -2,7 +2,7 @@ require_relative 'url'
 require_relative 'document'
 require_relative 'utils'
 require_relative 'assertable'
-require 'net/http' # requires 'uri'
+require 'net/http' # Requires 'uri'.
 module Wgit
@@ -11,6 +11,15 @@ module Wgit
   class Crawler
     include Assertable
+    # The default maximum amount of allowed URL redirects.
+    @default_redirect_limit = 5
+    class << self
+      # Class level instance accessor methods for @default_redirect_limit.
+      # Call using Wgit::Crawler.default_redirect_limit etc.
+      attr_accessor :default_redirect_limit
+    end
     # The urls to crawl.
     attr_reader :urls
@@ -67,7 +76,7 @@ module Wgit
     # Crawls individual urls, not entire sites.
     #
     # @param urls [Array<Wgit::Url>] The URLs to crawl.
-    # @yield [doc] If provided, the block is given each crawled
+    # @yield [Wgit::Document] If provided, the block is given each crawled
     #   Document. Otherwise each doc is added to @docs which can be accessed
     #   by Crawler#docs after this method returns.
     # @return [Wgit::Document] The last Document crawled.
@@ -82,7 +91,7 @@ module Wgit
     # Crawl the url and return the response document or nil.
     #
     # @param url [Wgit::Document] The URL to crawl.
-    # @yield [doc] The crawled HTML Document regardless if the
+    # @yield [Wgit::Document] The crawled HTML Document regardless if the
     #   crawl was successful or not. Therefore, the Document#url can be used.
     # @return [Wgit::Document, nil] The crawled HTML Document or nil if the
     #   crawl was unsuccessful.
@@ -95,10 +104,11 @@ module Wgit
       doc.empty? ? nil : doc
     end
-    # Crawls an entire site by recursively going through its internal_links.
+    # Crawls an entire website's HTML pages by recursively going through
+    # its internal links. Each crawled web Document is yielded to a block.
     #
     # @param base_url [Wgit::Url] The base URL of the website to be crawled.
-    # @yield [doc] Given each crawled Document/page of the site.
+    # @yield [Wgit::Document] Given each crawled Document/page of the site.
     #   A block is the only way to interact with each crawled Document.
     # @return [Array<Wgit::Url>, nil] Unique Array of external urls collected
     #   from all of the site's pages or nil if the base_url could not be
@@ -112,7 +122,7 @@ module Wgit
       path = base_url.path.nil? ? '/' : base_url.path
       crawled_urls  = [path]
       external_urls = doc.external_links
-      internal_urls = doc.internal_links
+      internal_urls = get_internal_links(doc)
       return doc.external_links.uniq if internal_urls.empty?
@@ -126,7 +136,7 @@ module Wgit
           doc = crawl_url(Wgit::Url.concat(base_url.to_base, link), &block)
           crawled_urls << link
           next if doc.nil?
-          internal_urls.concat(doc.internal_links)
+          internal_urls.concat(get_internal_links(doc))
           external_urls.concat(doc.external_links)
         end
       end
@@ -158,14 +168,15 @@ module Wgit
       Wgit.logger.debug(
         "Wgit::Crawler#fetch('#{url}') exception: #{ex.message}"
       )
+      @last_response = nil
       nil
     end
     # The resolve method performs a HTTP GET to obtain the HTML document.
     # A certain amount of redirects will be followed by default before raising
-    # an exception. Redirects can be disabled by setting `redirect_limit: 1`.
+    # an exception. Redirects can be disabled by setting `redirect_limit: 0`.
     # The Net::HTTPResponse will be returned.
-    def resolve(url, redirect_limit: 5)
+    def resolve(url, redirect_limit: Wgit::Crawler.default_redirect_limit)
       redirect_count = -1
       begin
         raise "Too many redirects" if redirect_count >= redirect_limit
@@ -186,6 +197,16 @@ module Wgit
       @urls << Wgit::Url.new(url)
     end
+    # Pull out the doc's internal HTML page links for crawling.
+    def get_internal_links(doc)
+      doc.
+        internal_links_without_anchors.
+        reject do |link|
+          ext = link.to_extension
+          ext ? !['htm', 'html'].include?(ext) : false
+        end
+    end
     alias :crawl :crawl_urls
     alias :crawl_r :crawl_site
   end

data/lib/wgit/document.rb CHANGED Viewed

@@ -19,11 +19,17 @@ module Wgit
     # The HTML elements that make up the visible text on a page.
     # These elements are used to initialize the @text of the Document.
     # See the README.md for how to add to this Array dynamically.
-    @@text_elements = [
+    @text_elements = [
       :dd, :div, :dl, :dt, :figcaption, :figure, :hr, :li,
       :main, :ol, :p, :pre, :span, :ul, :h1, :h2, :h3, :h4, :h5
     ]
+    class << self
+      # Class level instance reader method for @text_elements.
+      # Call using Wgit::Document.text_elements.
+      attr_reader :text_elements
+    end
     # The URL of the webpage, an instance of Wgit::Url.
     attr_reader :url
@@ -157,7 +163,7 @@ module Wgit
         if var == :@text
           count = 0
           @text.each { |t| count += t.length }
-          hash[:text_length] = @text.length
+          hash[:text_snippets] = @text.length
           hash[:text_bytes] = count
         # Else take the var's #length method return value.
         else
@@ -202,8 +208,12 @@ module Wgit
       @doc.css(selector)
     end
-    # Get all internal links of this Document in relative form. Internal
-    # meaning a link to another page on this website. Also see
+    # Get all the internal links of this Document in relative form. Internal
+    # meaning a link to another document on this domain. This Document's domain
+    # is used to determine if an absolute URL is actually a relative link e.g.
+    # For a Document representing http://server.com/about, an absolute link of
+    # <a href='http://server.com/search'> will be recognized and returned as an
+    # internal link because both Documents live on the same domain. Also see
     # Wgit::Document#internal_full_links.
     #
     # @return [Array<Wgit::Url>] self's internal/relative URL's.
@@ -216,12 +226,28 @@ module Wgit
         rescue
           true
         end.
-        map(&:to_path_and_anchor)
+        map(&:without_base).
+        map do |link| # We map @url.to_host into / because it's a duplicate.
+          link.to_host == @url.to_host ? Wgit::Url.new('/') : link
+        end
-      process_arr(links)
+      Wgit::Utils.process_arr(links)
     end
-    # Get all internal links of this Document and append them to this
+    # Get all the internal links of this Document with their anchors removed
+    # (if present). Also see Wgit::Document#internal_links.
+    #
+    # @return [Array<Wgit::Url>] self's internal/relative URL's with their
+    #   anchors removed.
+    def internal_links_without_anchors
+      in_links = internal_links
+      return [] if in_links.empty?
+      in_links.
+        map(&:without_anchor).
+        reject(&:empty?)
+    end
+    # Get all the internal links of this Document and append them to this
     # Document's base URL making them absolute. Also see
     # Wgit::Document#internal_links.
     #
@@ -233,8 +259,8 @@ module Wgit
       in_links.map { |link| @url.to_base.concat(link) }
     end
-    # Get all external links of this Document. External meaning a link to
-    # another website.
+    # Get all the external links of this Document. External meaning a link to
+    # a different domain.
     #
     # @return [Array<Wgit::Url>] self's external/absolute URL's.
     def external_links
@@ -248,7 +274,7 @@ module Wgit
         end.
         map(&:without_trailing_slash)
-      process_arr(links)
+      Wgit::Utils.process_arr(links)
     end
     # Searches against the @text for the given search query.
@@ -305,12 +331,19 @@ module Wgit
     ### Document (Class) methods ###
-    # Returns Document.text_elements used to obtain the text in a webpage.
+    # Uses Document.text_elements to build an xpath String, used to obtain
+    # all of the combined text on a webpage.
     #
-    # @return [Array<Symbols>] The page elements containing visual text on a
-    #   webpage.
-    def self.text_elements
-      @@text_elements
+    # @return [String] An xpath String to obtain a webpage's text elements.
+    def self.text_elements_xpath
+      xpath = ""
+      return xpath if Wgit::Document.text_elements.empty?
+      el_xpath = "//%s/text()"
+      Wgit::Document.text_elements.each_with_index do |el, i|
+        xpath += " | " unless i == 0
+        xpath += el_xpath % [el]
+      end
+      xpath
     end
     # Initialises a private instance variable with the xpath or database object
@@ -326,7 +359,11 @@ module Wgit
     # effectively implements ORM like behavior using this class.
     #
     # @param var [Symbol] The name of the variable to be initialised.
-    # @param xpath [String] Used to find the element(s) of the webpage.
+    # @param xpath [String, Object#call] The xpath used to find the element(s)
+    #   of the webpage. Pass a callable object (proc etc.) if you want the
+    #   xpath value to be derived on Document initialisation (instead of when
+    #   the extension is defined). The call method must return a valid xpath
+    #   String.
     # @option options [Boolean] :singleton The singleton option determines
     #   whether or not the result(s) should be in an Array. If multiple
     #   results are found and singleton is true then the first result will be
@@ -334,10 +371,13 @@ module Wgit
     # @option options [Boolean] :text_content_only The text_content_only option
     #   if true will use the text content of the Nokogiri result object,
     #   otherwise the Nokogiri object itself is returned. Defaults to true.
-    # @yield [var_value] Gives the value about to be assigned to the new var.
+    # @yield [Object, Symbol] Yields the value about to be assigned to the new
+    #   var and the source of the value (either :html or :object aka database).
     #   The return value of the block becomes the new var value, unless nil.
-    #   Return nil if you want to inspect but not change the var value.
-    # @return [Symbol] The first half of the newly created method names e.g.
+    #   Return nil if you want to inspect but not change the var value. The
+    #   block gets executed when a Document is initialized from html or an
+    #   object.
+    # @return [Symbol] The first half of the newly defined method names e.g.
     #   if var == "title" then :init_title is returned.
     def self.define_extension(var, xpath, options = {}, &block)
       default_options = { singleton: true, text_content_only: true }
@@ -389,6 +429,12 @@ module Wgit
       end
     end
+    # Ensure the @url and @html Strings are correctly encoded etc.
+    def process_url_and_html
+      @url = Wgit::Utils.process_str(@url)
+      @html = Wgit::Utils.process_str(@html)
+    end
     # Returns an object/value from this Document's @html using the provided
     # xpath param.
     # singleton ? results.first (single Object) : results (Array)
@@ -396,6 +442,7 @@ module Wgit
     # A block can be used to set the final value before it is returned.
     # Return nil from the block if you don't want to override the value.
     def find_in_html(xpath, singleton: true, text_content_only: true)
+      xpath = xpath.call if xpath.respond_to?(:call)
       results = @doc.xpath(xpath)
       if results and not results.empty?
@@ -408,10 +455,10 @@ module Wgit
         result = singleton ? nil : []
       end
-      singleton ? process_str(result) : process_arr(result)
+      singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)
       if block_given?
-        new_result = yield(result)
+        new_result = yield(result, :html)
         result = new_result if new_result
       end
@@ -427,10 +474,10 @@ module Wgit
       default = singleton ? nil : []
       result = obj.fetch(key.to_s, default)
-      singleton ? process_str(result) : process_arr(result)
+      singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)
       if block_given?
-        new_result = yield(result)
+        new_result = yield(result, :object)
         result = new_result if new_result
       end
@@ -454,136 +501,6 @@ module Wgit
       end
     end
-    # Takes Docuent.text_elements and returns an xpath String used to obtain
-    # all of the combined text.
-    def text_elements_xpath
-      xpath = ""
-      return xpath if @@text_elements.empty?
-      el_xpath = "//%s/text()"
-      @@text_elements.each_with_index do |el, i|
-        xpath += " | " unless i == 0
-        xpath += el_xpath % [el]
-      end
-      xpath
-    end
-    # Processes a String to make it uniform.
-    def process_str(str)
-      if str.is_a?(String)
-        str.encode!('UTF-8', 'UTF-8', invalid: :replace)
-        str.strip!
-      end
-      str
-    end
-    # Processes an Array to make it uniform.
-    def process_arr(array)
-      if array.is_a?(Array)
-        array.map! { |str| process_str(str) }
-        array.reject! { |str| str.is_a?(String) ? str.empty? : false }
-        array.compact!
-        array.uniq!
-      end
-      array
-    end
-    # Ensure the @url and @html Strings are correctly encoded etc.
-    def process_url_and_html
-      @url = process_str(@url)
-      @html = process_str(@html)
-    end
-    ### Default init_* (Document extension) methods. ###
-    # Init methods for title.
-    def init_title_from_html
-      xpath = "//title"
-      result = find_in_html(xpath)
-      init_var(:@title, result)
-    end
-    def init_title_from_object(obj)
-      result = find_in_object(obj, "title")
-      init_var(:@title, result)
-    end
-    # Init methods for author.
-    def init_author_from_html
-      xpath = "//meta[@name='author']/@content"
-      result = find_in_html(xpath)
-      init_var(:@author, result)
-    end
-    def init_author_from_object(obj)
-      result = find_in_object(obj, "author")
-      init_var(:@author, result)
-    end
-    # Init methods for keywords.
-    def init_keywords_from_html
-      xpath = "//meta[@name='keywords']/@content"
-      result = find_in_html(xpath) do |keywords|
-        if keywords
-          keywords = keywords.split(",")
-          process_arr(keywords)
-        end
-        keywords
-      end
-      init_var(:@keywords, result)
-    end
-    def init_keywords_from_object(obj)
-      result = find_in_object(obj, "keywords", singleton: false)
-      init_var(:@keywords, result)
-    end
-    # Init methods for links.
-    def init_links_from_html
-      # Any element with a href or src attribute is considered a link.
-      xpath = '//*/@href | //*/@src'
-      result = find_in_html(xpath, singleton: false) do |links|
-        if links
-          links.map! do |link|
-            begin
-              Wgit::Url.new(link)
-            rescue
-              nil
-            end
-          end
-          links.compact!
-        end
-        links
-      end
-      init_var(:@links, result)
-    end
-    def init_links_from_object(obj)
-      result = find_in_object(obj, "links", singleton: false) do |links|
-        if links
-          links.map! { |link| Wgit::Url.new(link) }
-        end
-        links
-      end
-      init_var(:@links, result)
-    end
-    # Init methods for text.
-    def init_text_from_html
-      xpath = text_elements_xpath
-      result = find_in_html(xpath, singleton: false)
-      init_var(:@text, result)
-    end
-    def init_text_from_object(obj)
-      result = find_in_object(obj, "text", singleton: false)
-      init_var(:@text, result)
-    end
     alias :to_hash :to_h
     alias :relative_links :internal_links
     alias :relative_urls :internal_links

data/lib/wgit/document_extensions.rb ADDED Viewed

@@ -0,0 +1,57 @@
+### Default Document Extensions ###
+# Title.
+Wgit::Document.define_extension(
+  :title,
+  '//title',
+  singleton: true,
+  text_content_only: true,
+)
+# Author.
+Wgit::Document.define_extension(
+  :author,
+  '//meta[@name="author"]/@content',
+  singleton: true,
+  text_content_only: true,
+)
+# Keywords.
+Wgit::Document.define_extension(
+  :keywords,
+  '//meta[@name="keywords"]/@content',
+  singleton: true,
+  text_content_only: true,
+) do |keywords, source|
+  if keywords and source == :html
+    keywords = keywords.split(',')
+    Wgit::Utils.process_arr(keywords)
+  end
+  keywords
+end
+# Links.
+Wgit::Document.define_extension(
+  :links,
+  '//a/@href',
+  singleton: false,
+  text_content_only: true,
+) do |links|
+  if links
+    links.map! do |link|
+      Wgit::Url.new(link)
+    rescue
+      nil
+    end
+    links.compact!
+  end
+  links
+end
+# Text.
+Wgit::Document.define_extension(
+  :text,
+  proc { Wgit::Document.text_elements_xpath },
+  singleton: false,
+  text_content_only: true,
+)

data/lib/wgit/url.rb CHANGED Viewed

@@ -107,15 +107,19 @@ module Wgit
     # @raise [RuntimeError] If the link is invalid.
     def self.relative_link?(link, base: nil)
       raise "Invalid link: #{link}" if link.nil? or link.empty?
-      if base and URI(base).host.nil?
-        raise "Invalid base, must contain protocol prefix: #{base}"
+      link = Wgit::Url.new(link)
+      if base
+        base = Wgit::Url.new(base)
+        if base.to_scheme.nil?
+          raise "Invalid base, must contain protocol prefix: #{base}"
+        end
       end
-      uri = URI(link)
-      if uri.relative?
+      if link.to_uri.relative?
         true
       else
-        base ? uri.host == URI(base).host : false
+        base ? link.to_host == base.to_host : false
       end
     end
@@ -125,11 +129,10 @@ module Wgit
     # @param link [Wgit::Url, String] The link to add to the host prefix.
     # @return [Wgit::Url] host + "/" + link
     def self.concat(host, link)
-      url = host
-      url.chop! if url.end_with?('/')
-      link = link[1..-1] if link.start_with?('/')
-      separator = link.start_with?('#') ? '' : '/'
-      Wgit::Url.new(url + separator + link)
+      host = Wgit::Url.new(host).without_trailing_slash
+      link = Wgit::Url.new(link).without_leading_slash
+      separator = (link.start_with?('#') or link.start_with?('?')) ? '' : '/'
+      Wgit::Url.new(host + separator + link)
     end
     # Returns if self is a relative or absolute Url. If base is provided and
@@ -217,9 +220,9 @@ module Wgit
       path = @uri.path
       return nil if path.nil? or path.empty?
       return Wgit::Url.new('/') if path == '/'
-      path = path[1..-1] if path.start_with?('/')
-      path.chop! if path.end_with?('/')
-      Wgit::Url.new(path)
+      Wgit::Url.new(path).
+        without_leading_slash.
+        without_trailing_slash
     end
     # Returns the endpoint of this URL e.g. the bit after the host with any
@@ -253,26 +256,77 @@ module Wgit
       anchor ? Wgit::Url.new("##{anchor}") : nil
     end
-    # Returns a new Wgit::Url containing just the path + anchor string of this
-    # URL e.g. Given http://google.com/us#about, us#about is returned.
+    # Returns a new Wgit::Url containing just the file extension of this URL
+    # e.g. Given http://google.com#about.html, html is returned.
     #
-    # @return [Wgit::Url, nil] Containing just the path and anchor string or
-    #   nil.
-    def to_path_and_anchor
-      path = to_path || ''
-      anchor = to_anchor || ''
-      both = path + anchor
-      both.empty? ? nil : Wgit::Url.new(both)
+    # @return [Wgit::Url, nil] Containing just the extension string or nil.
+    def to_extension
+      path = to_path
+      return nil unless path
+      segs = path.split('.')
+      segs.length > 1 ? Wgit::Url.new(segs.last) : nil
     end
     # Returns a new Wgit::Url containing self without a trailing slash. Is
-    # idempotent.
+    # idempotent meaning self will always be returned regardless of whether
+    # there's a trailing slash or not.
     #
-    # @return [Wgit::Url] Without a trailing slash.
+    # @return [Wgit::Url] Self without a trailing slash.
+    def without_leading_slash
+      start_with?('/') ? Wgit::Url.new(self[1..-1]) : self
+    end
+    # Returns a new Wgit::Url containing self without a trailing slash. Is
+    # idempotent meaning self will always be returned regardless of whether
+    # there's a trailing slash or not.
+    #
+    # @return [Wgit::Url] Self without a trailing slash.
     def without_trailing_slash
       end_with?('/') ? Wgit::Url.new(chop) : self
     end
+    # Returns a new Wgit::Url containing self without a leading or trailing
+    # slash. Is idempotent and will return self regardless if there's slashes
+    # present or not.
+    #
+    # @return [Wgit::Url] Self without leading or trailing slashes.
+    def without_slashes
+      self.
+        without_leading_slash.
+        without_trailing_slash
+    end
+    # Returns a new Wgit::Url with the base (proto and host) removed e.g. Given
+    # http://google.com/search?q=something#about, search?q=something#about is
+    # returned. If relative and base isn't present then self is returned.
+    # Leading and trailing slashes are always stripped from the return value.
+    #
+    # @return [Wgit::Url] Self containing everything after the base.
+    def without_base
+      base_url = to_base
+      without_base = base_url ? gsub(base_url, '') : self
+      return self if ['', '/'].include?(without_base)
+      Wgit::Url.new(without_base).
+        without_leading_slash.
+        without_trailing_slash
+    end
+    # Returns a new Wgit::Url with the anchor portion removed e.g. Given
+    # http://google.com/search#about, http://google.com/search is
+    # returned. Self is returned as is if no anchor is present. A URL
+    # consisting of only an anchor e.g. '#about' will return an empty URL.
+    # This method assumes that the anchor is correctly placed at the very end
+    # of the URL.
+    #
+    # @return [Wgit::Url] Self with the anchor portion removed.
+    def without_anchor
+      anchor = to_anchor
+      without_anchor = anchor ? gsub(anchor, '') : self
+      Wgit::Url.new(without_anchor)
+    end
     # Returns a Hash containing this Url's instance vars excluding @uri.
     # Used when storing the URL in a Database e.g. MongoDB etc.
     #
@@ -298,6 +352,8 @@ module Wgit
     alias :anchor :to_anchor
     alias :to_fragment :to_anchor
     alias :fragment :to_anchor
+    alias :extension :to_extension
+    alias :without_fragment :without_anchor
     alias :internal_link? :relative_link?
     alias :is_relative? :relative_link?
     alias :is_internal? :relative_link?

data/lib/wgit/utils.rb CHANGED Viewed

@@ -2,7 +2,7 @@ module Wgit
   # Utility module containing generic methods.
   module Utils
     # Returns the current time stamp.
     #
     # @return [Time] The current time stamp.
@@ -27,7 +27,7 @@ module Wgit
       end
       hash
     end
     # Returns the model having removed non bson types (for use with MongoDB).
     #
     # @param model_hash [Hash] The model Hash to process.
@@ -60,7 +60,7 @@ module Wgit
     # @param index [Integer] The first index of a word in sentence. This is
     #     usually a word in a search query.
     # @param sentence_limit [Integer] The max length of the formatted sentence
-    #     being returned. The length will be based on the sentence_limit
+    #     being returned. The length will be based on the sentence_limit
     #     parameter or the full length of the original sentence, which ever
     #     is less. The full sentence is returned if the sentence_limit is 0.
     # @return [String] The sentence once formatted.
@@ -70,7 +70,7 @@ module Wgit
       if index < 0 or index > sentence.length
         raise "Incorrect index value: #{index}"
       end
       return sentence if sentence_limit == 0
       start = 0
@@ -131,11 +131,11 @@ module Wgit
     #     to output text somewhere e.g. STDOUT (the default).
     # @return [nil]
     def self.printf_search_results(results, query = nil, case_sensitive = false,
-                                   sentence_length = 80, keyword_count = 5,
+                                   sentence_length = 80, keyword_count = 5,
                                    stream = Kernel)
       raise "stream must respond_to? :puts" unless stream.respond_to? :puts
       keyword_count -= 1 # Because Array's are zero indexed.
       results.each do |doc|
         sentence =  if query.nil?
                       nil
@@ -155,8 +155,37 @@ module Wgit
         stream.puts doc.url
         stream.puts
       end
       nil
     end
+    # Processes a String to make it uniform. Strips off any leading/trailing
+    # white space and converts to UTF-8.
+    #
+    # @param str [String] The String to process. str is modified.
+    # @return [String] The processed str is both modified and then returned.
+    def self.process_str(str)
+      if str.is_a?(String)
+        str.encode!('UTF-8', 'UTF-8', invalid: :replace)
+        str.strip!
+      end
+      str
+    end
+    # Processes an Array to make it uniform. Removes empty Strings and nils,
+    # processes non empty Strings using Wgit::Utils.process_str before removing
+    # duplicates.
+    #
+    # @param arr [Enumerable] The Array to process. arr is modified.
+    # @return [Enumerable] The processed arr is both modified and then returned.
+    def self.process_arr(arr)
+      if arr.is_a?(Array)
+        arr.map! { |str| process_str(str) }
+        arr.reject! { |str| str.is_a?(String) ? str.empty? : false }
+        arr.compact!
+        arr.uniq!
+      end
+      arr
+    end
   end
 end

data/lib/wgit/version.rb CHANGED Viewed

@@ -3,5 +3,5 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = "0.0.12".freeze
+  VERSION = "0.0.13".freeze
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.0.12
+  version: 0.0.13
 platform: ruby
 authors:
 - Michael Telford
@@ -169,7 +169,7 @@ description: Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, re
   websites if required. Wgit also provides a means to search indexed documents stored
   in a database. Therefore, this library provides the main components of a WWW search
   engine. The Wgit API is easily extended allowing you to pull out the parts of a
-  webpage that are important to you, the code snippets or images for example. As Wgit
+  webpage that are important to you, the code snippets or tables for example. As Wgit
   is a library, it has uses in many different application types.
 email: michael.telford@live.com
 executables: []
@@ -184,6 +184,7 @@ files:
 - "./lib/wgit/database/database.rb"
 - "./lib/wgit/database/model.rb"
 - "./lib/wgit/document.rb"
+- "./lib/wgit/document_extensions.rb"
 - "./lib/wgit/indexer.rb"
 - "./lib/wgit/logger.rb"
 - "./lib/wgit/url.rb"
@@ -218,6 +219,6 @@ rubyforge_project:
 rubygems_version: 2.7.6
 signing_key:
 specification_version: 4
-summary: Wgit is a Ruby gem similar in nature to GNU's `wget`. It provides an easy
-  to use API for programmatic web scraping, indexing and searching.
+summary: Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an
+  easy to use API for programmatic web scraping, indexing and searching.
 test_files: []