RubyGems - wgit - Versions diffs - 0.0.13 → 0.0.14 - Mend

wgit 0.0.13 → 0.0.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/README.md +51 -40
data/TODO.txt +3 -5
data/lib/wgit/crawler.rb +2 -0
data/lib/wgit/database/connection_details.rb +4 -12
data/lib/wgit/database/database.rb +12 -9
data/lib/wgit/document.rb +1 -1
data/lib/wgit/indexer.rb +79 -26
data/lib/wgit/version.rb +1 -1
metadata +9 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: b8f6c1946b739327ba5b52a5541aa1496e2aea53c23e925bbb5e5fe7c063ddcc
-  data.tar.gz: 1bcf5dd5e41711758fdc737afba0707319de25cb524cfe08c14a50efb9a5f3e0
+  metadata.gz: b2f1d98c88dcd1bd2a12b463732b08373f6b5db4d1c75661530293b0ed129c47
+  data.tar.gz: ed5ff0f5aa5e2427909ced8f67c8189dd1f7aec259b76947cd4cd73a1f8783d3
 SHA512:
-  metadata.gz: ad42cd8e392894a21a1c4497bed9d1efabbc8b2320825ab7a3fc7e0615157f28f262bcaa4a4910411ac7662b4b416848b97b68c264f2ce937be34cfa6a56a34c
-  data.tar.gz: fbaf8d6a48f996ce2c0b3cb4ee72a26cfede2368f3992e110c6f049e84b9ba6a2568aac0f192559ae10127db5d3cb0bda5dbbe508247ae5f918f3fc400528e75
+  metadata.gz: 6270ceff57af1936e7b0ae1fb3748def307981b26dc604d2f190a9e751e0952f103b4a7c227e2cbbc2d00f26bc7c4b1f46f44fa423e13add7e8bd4eb9fe046f4
+  data.tar.gz: 96b0eecd2713cbf55f7433c145a541fdae34d56b9e3bedd1b91b6bb30b463d06b7f07de59457ad899fd32bd095bbe92ee01d94c04ef923924fbc4a110bbe98f2

data/README.md CHANGED

@@ -16,9 +16,10 @@ Check out this [example application](https://search-engine-rb.herokuapp.com) - a
 6. [Extending The API](#Extending-The-API)
 7. [Caveats](#Caveats)
 8. [Executable](#Executable)
-9. [Development](#Development)
-10. [Contributing](#Contributing)
-11. [License](#License)
+9. [Change Log](#Change-Log)
+10. [Development](#Development)
+11. [Contributing](#Contributing)
+12. [License](#License)
 ## Installation
@@ -87,6 +88,10 @@ See the [Practical Database Example](#Practical-Database-Example) for informatio
 Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
+### Broken Link Finder
+The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
 ### CSS Indexer
 The below script downloads the contents of the first css link found on Facebook's index page.
@@ -146,78 +151,80 @@ end
 ## Practical Database Example
-This next example requires a configured database instance.
+This next example requires a configured database instance. Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a free (small) account or provide your own MongoDB instance.
-Currently the only supported DBMS is MongoDB. See [mLab](https://mlab.com) for a free (small) account or provide your own MongoDB instance.
+`Wgit::Database` provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and serialisation. With Wgit you can index webpages, store them in a database and then search through all that's been indexed. The use of a database is entirely optional however and isn't required for crawling/indexing.
-The currently supported versions of `mongo` are:
+The following versions of MongoDB are supported:
-| Gem      | Database Engine |
-| -------- | --------------- |
-| ~> 2.8.0 | 3.6.12 (MMAPv1) |
+| Gem    | Database |
+| ------ | -------- |
+| ~> 2.9 | ~> 4.0   |
 ### Setting Up MongoDB
-Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records. The use of a database is entirely optional when using Wgit.
+Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records.
 1) Create collections for: `documents` and `urls`.
-2) Add a unique index for the `url` field in **both** collections.
-3) Enable `textSearchEnabled` in MongoDB's configuration.
-4) Create a *text search index* for the `documents` collection using:
+2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections.
+3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so).
+4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
 ```json
 {
-	"text": "text",
-	"author": "text",
-	"keywords": "text",
-	"title": "text"
+  "text": "text",
+  "author": "text",
+  "keywords": "text",
+  "title": "text"
 }
 ```
-5) Set the connection details for your MongoDB instance using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
+5) Set the connection details for your MongoDB instance (see below) using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
 **Note**: The *text search index* (in step 4) lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
 ### Database Example
-The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database.
-If you're running the code below for yourself, remember to replace the Hash containing the connection details with your own.
+The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
 ```ruby
 require 'wgit'
 require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
-# Here we create our own document rather than crawling the web.
+### CONNECT TO THE DATABASE ###
+# Set your connection details manually (as below) or from the environment using
+# Wgit.set_connection_details_from_env
+Wgit.set_connection_details('DB_CONNECTION_STRING' => '<your_connection_string>')
+db = Wgit::Database.new # Connects to the database...
+### SEED SOME DATA ###
+# Here we create our own document rather than crawling the web (which works in the same way).
 # We pass the web page's URL and HTML Strings.
 doc = Wgit::Document.new(
   "http://test-url.com".to_url,
   "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
 )
-# Set your connection details manually (as below) or from the environment using
-# Wgit.set_connection_details_from_env
-Wgit.set_connection_details(
-  'DB_HOST'     => '<host_machine>',
-  'DB_PORT'     => '27017',
-  'DB_USERNAME' => '<username>',
-  'DB_PASSWORD' => '<password>',
-  'DB_DATABASE' => '<database_name>',
-)
-db = Wgit::Database.new # Connects to the database...
 db.insert doc
-# Searching the database returns documents with matching text 'hits'.
+### SEARCH THE DATABASE ###
+# Searching the database returns Wgit::Document's which have fields containing the query.
 query = "cow"
 results = db.search query
-doc.url == results.first.url # => true
+search_result = results.first
+search_result.class           # => Wgit::Document
+doc.url == search_result.url  # => true
+### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
-# Searching the returned documents gives the matching lines of text from that document.
-doc.search(query).first # => "How now brown cow."
+# Searching the returned documents gives the matching text from that document.
+search_result.search(query).first # => "How now brown cow."
-db.insert doc.external_links
+### SEED URLS TO BE CRAWLED LATER ###
-urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
+db.insert search_result.external_links
+urls_to_crawl = db.uncrawled_urls # => Results will include search_result.external_links.
 ```
 ## Extending The API
@@ -319,6 +326,10 @@ In future versions of Wgit, an executable will be packaged with the gem. The exe
 This executable will be very similar in nature to `./bin/console` which is currently used only for development and isn't packaged as part of the `wgit` gem.
+## Change Log
+See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences between versions of Wgit.
 ## Development
 The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file.

data/TODO.txt CHANGED

@@ -8,7 +8,6 @@ Primary
 Secondary
 ---------
-- Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org etc.
 - Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
 - Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
 - Check if Document::TEXT_ELEMENTS is expansive enough.
@@ -17,16 +16,15 @@ Secondary
 Refactoring
 -----------
-- Replace method params with named parameters where applicable.
-- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
-- Mock (monkey patch) the 'mongo' gem's func. for speed in the tests.
 - Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
+- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
+- Replace method params with named parameters where applicable.
 Gem Publishing Checklist
 ------------------------
 - Ensure a clean branch of master and create a 'release' branch.
 - Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
-- Increment the version number (in version.rb).
+- Increment the version number (in version.rb) and update the CHANGELOG.md.
 - Run 'bundle install' to update deps.
 - Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
 - Run 'bundle exec rake test' and ensure all tests are passing.

data/lib/wgit/crawler.rb CHANGED

@@ -208,6 +208,8 @@ module Wgit
     end
     alias :crawl :crawl_urls
+    alias :crawl_pages :crawl_urls
+    alias :crawl_page :crawl_url
     alias :crawl_r :crawl_site
   end
 end

data/lib/wgit/database/connection_details.rb CHANGED

@@ -9,27 +9,19 @@ module Wgit
   CONNECTION_DETAILS = {}
   # The keys required for a successful database connection.
-  CONNECTION_KEYS_REQUIRED = [
-    'DB_HOST', 'DB_PORT', 'DB_USERNAME', 'DB_PASSWORD', 'DB_DATABASE'
-  ]
+  CONNECTION_KEYS_REQUIRED = ['DB_CONNECTION_STRING']
   # Set the database's connection details from the given hash. It is your
   # responsibility to ensure the correct hash vars are present and set.
   #
   # @param hash [Hash] Containing the database connection details to use.
   #   The hash should contain the following keys (of type String):
-  #   DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, DB_DATABASE
+  #   DB_CONNECTION_STRING
   # @raise [KeyError] If any of the required connection details are missing.
   # @return [Hash] Containing the database connection details from hash.
   def self.set_connection_details(hash)
     assert_required_keys(hash, CONNECTION_KEYS_REQUIRED)
-    CONNECTION_DETAILS[:host]  = hash.fetch('DB_HOST')
-    CONNECTION_DETAILS[:port]  = hash.fetch('DB_PORT')
-    CONNECTION_DETAILS[:uname] = hash.fetch('DB_USERNAME')
-    CONNECTION_DETAILS[:pword] = hash.fetch('DB_PASSWORD')
-    CONNECTION_DETAILS[:db]    = hash.fetch('DB_DATABASE')
+    CONNECTION_DETAILS[:connection_string] = hash.fetch('DB_CONNECTION_STRING')
     CONNECTION_DETAILS
   end
@@ -37,7 +29,7 @@ module Wgit
   # responsibility to ensure the correct ENV vars are present and set.
   #
   # The ENV should contain the following keys (of type String):
-  # DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, DB_DATABASE
+  # DB_CONNECTION_STRING
   #
   # @raise [KeyError] If any of the required connection details are missing.
   # @return [Hash] Containing the database connection details from the ENV.

data/lib/wgit/database/database.rb CHANGED

@@ -17,10 +17,16 @@ module Wgit
     #
     # @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
     def initialize
-      conn_details = Wgit::CONNECTION_DETAILS
-      if conn_details.empty?
-        raise "Wgit::CONNECTION_DETAILS must be defined and include :host,
-:port, :db, :uname, :pword for a database connection to be established."
+      @@client = Database.connect
+    end
+    # Initializes a database connection client.
+    #
+    # @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
+    def self.connect
+      unless Wgit::CONNECTION_DETAILS[:connection_string]
+        raise "Wgit::CONNECTION_DETAILS must be defined and include \
+:connection_string"
       end
       # Only log for error (or more severe) scenarios.
@@ -28,11 +34,8 @@ module Wgit
       Mongo::Logger.logger.progname = 'mongo'
       Mongo::Logger.logger.level    = Logger::ERROR
-      address = "#{conn_details[:host]}:#{conn_details[:port]}"
-      @@client = Mongo::Client.new([address],
-                                   database: conn_details[:db],
-                                   user:     conn_details[:uname],
-                                   password: conn_details[:pword])
+      # Connects to the database here.
+      Mongo::Client.new(Wgit::CONNECTION_DETAILS[:connection_string])
     end
     ### Create Data ###

data/lib/wgit/document.rb CHANGED

@@ -84,7 +84,7 @@ module Wgit
         obj = url_or_obj
         assert_respond_to(obj, :fetch)
-        @url = obj.fetch("url") # Should always be present.
+        @url = Wgit::Url.new(obj.fetch("url")) # Should always be present.
         @html = obj.fetch("html", "")
         @doc = init_nokogiri
         @score = obj.fetch("score", 0.0)

data/lib/wgit/indexer.rb CHANGED

@@ -10,7 +10,7 @@ module Wgit
   # site storing their internal pages into the database and adding their
   # external url's to be crawled later on. Logs info on the crawl
   # using Wgit.logger as it goes along.
-  #
+  #
   # @param max_sites_to_crawl [Integer] The number of separate and whole
   #   websites to be crawled before the method exits. Defaults to -1 which
   #   means the crawl will occur until manually stopped (Ctrl+C etc).
@@ -33,8 +33,8 @@ module Wgit
   # @param url [Wgit::Url, String] The base Url of the website to crawl.
   # @param insert_externals [Boolean] Whether or not to insert the website's
   #   external Url's into the database.
-  # @yield [doc] Given the Wgit::Document of each crawled web page, before it
-  #   is inserted into the database allowing for prior manipulation.
+  # @yield [Wgit::Document] Given the Wgit::Document of each crawled webpage,
+  #   before it is inserted into the database allowing for prior manipulation.
   # @return [Integer] The total number of pages crawled within the website.
   def self.index_this_site(url, insert_externals = true, &block)
     url = Wgit::Url.new url
@@ -43,6 +43,24 @@ module Wgit
     indexer.index_this_site(url, insert_externals, &block)
   end
+  # Convience method to index a single webpage using
+  # Wgit::Indexer#index_this_page.
+  #
+  # Crawls a single webpage and stores it into the database.
+  # There is no max download limit so be careful of large pages.
+  #
+  # @param url [Wgit::Url, String] The Url of the webpage to crawl.
+  # @param insert_externals [Boolean] Whether or not to insert the website's
+  #   external Url's into the database.
+  # @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
+  #   before it is inserted into the database allowing for prior manipulation.
+  def self.index_this_page(url, insert_externals = true, &block)
+    url = Wgit::Url.new url
+    db = Wgit::Database.new
+    indexer = Wgit::Indexer.new(db)
+    indexer.index_this_page(url, insert_externals, &block)
+  end
   # Performs a search of the database's indexed documents and pretty prints
   # the results. See Wgit::Database#search for details of the search.
   #
@@ -53,8 +71,8 @@ module Wgit
   # @param skip [Integer] The number of DB records to skip.
   # @param sentence_length [Integer] The max length of each result's text
   #   snippet.
-  # @yield [doc] Given each search result (Wgit::Document).
-  def self.indexed_search(query, whole_sentence = false, limit = 10,
+  # @yield [Wgit::Document] Given each search result (Wgit::Document).
+  def self.indexed_search(query, whole_sentence = false, limit = 10,
                           skip = 0, sentence_length = 80, &block)
     db = Wgit::Database.new
     results = db.search(query, whole_sentence, limit, skip, &block)
@@ -63,13 +81,13 @@ module Wgit
   # Class which sets up a crawler and saves the indexed docs to a database.
   class Indexer
     # The crawler used to scrape the WWW.
     attr_reader :crawler
     # The database instance used to store Urls and Documents in.
     attr_reader :db
     # Initialize the Indexer.
     #
     # @param database [Wgit::Database] The database instance (already
@@ -97,7 +115,7 @@ module Wgit
 urls to crawl (which might be never).")
       end
       site_count = 0
       while keep_crawling?(site_count, max_sites_to_crawl, max_data_size) do
         Wgit.logger.info("Current database size: #{@db.size}")
         @crawler.urls = @db.uncrawled_urls
@@ -107,10 +125,10 @@ urls to crawl (which might be never).")
           return
         end
         Wgit.logger.info("Starting crawl loop for: #{@crawler.urls}")
         docs_count = 0
         urls_count = 0
         @crawler.urls.each do |url|
           unless keep_crawling?(site_count, max_sites_to_crawl, max_data_size)
             Wgit.logger.info("Reached max number of sites to crawl or database \
@@ -121,7 +139,7 @@ capacity, exiting.")
           url.crawled = true
           raise unless @db.update(url) == 1
           site_docs_count = 0
           ext_links = @crawler.crawl_site(url) do |doc|
             unless doc.empty?
@@ -131,7 +149,7 @@ capacity, exiting.")
               end
             end
           end
           urls_count += write_urls_to_db(ext_links)
           Wgit.logger.info("Crawled and saved #{site_docs_count} docs for the \
 site: #{url}")
@@ -141,6 +159,8 @@ site: #{url}")
 this iteration.")
         Wgit.logger.info("Found and saved #{urls_count} external url(s) for the next \
 iteration.")
+        nil
       end
     end
@@ -151,14 +171,14 @@ iteration.")
     # @param url [Wgit::Url] The base Url of the website to crawl.
     # @param insert_externals [Boolean] Whether or not to insert the website's
     #   external Url's into the database.
-    # @yield [doc] Given the Wgit::Document of each crawled web page, before it
-    #   is inserted into the database allowing for prior manipulation. Return
-    #   nil or false from the block to prevent the document from being saved
-    #   into the database.
+    # @yield [Wgit::Document] Given the Wgit::Document of each crawled web
+    #   page, before it is inserted into the database allowing for prior
+    #   manipulation. Return nil or false from the block to prevent the
+    #   document from being saved into the database.
     # @return [Integer] The total number of webpages/documents indexed.
     def index_this_site(url, insert_externals = true)
       total_pages_indexed = 0
       ext_urls = @crawler.crawl_site(url) do |doc|
         result = true
         if block_given?
@@ -174,23 +194,56 @@ iteration.")
       end
       url.crawled = true
-      if !@db.url?(url)
-        @db.insert(url)
-      else
-        @db.update(url)
-      end
+      @db.url?(url) ? @db.update(url) : @db.insert(url)
       if insert_externals
         write_urls_to_db(ext_urls)
         Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
       end
       Wgit.logger.info("Crawled and saved #{total_pages_indexed} docs for the \
 site: #{url}")
       total_pages_indexed
     end
+    # Crawls a single webpage and stores it into the database.
+    # There is no max download limit so be careful of large pages.
+    # Logs info on the crawl using Wgit.logger as it goes along.
+    #
+    # @param url [Wgit::Url] The webpage Url to crawl.
+    # @param insert_externals [Boolean] Whether or not to insert the webpage's
+    #   external Url's into the database.
+    # @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
+    #   before it is inserted into the database allowing for prior
+    #   manipulation. Return nil or false from the block to prevent the
+    #   document from being saved into the database.
+    def index_this_page(url, insert_externals = true)
+      doc = @crawler.crawl_page(url) do |doc|
+        result = true
+        if block_given?
+          result = yield(doc)
+        end
+        if result
+          if write_doc_to_db(doc)
+            Wgit.logger.info("Crawled and saved internal page: #{doc.url}")
+          end
+        end
+      end
+      url.crawled = true
+      @db.url?(url) ? @db.update(url) : @db.insert(url)
+      if insert_externals
+        ext_urls = doc.external_links
+        write_urls_to_db(ext_urls)
+        Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
+      end
+      nil
+    end
   private
     # Keep crawling or not based on DB size and current loop iteration.
@@ -204,7 +257,7 @@ site: #{url}")
       end
     end
-    # The unique url index on the documents collection prevents duplicate
+    # The unique url index on the documents collection prevents duplicate
     # inserts.
     def write_doc_to_db(doc)
       @db.insert(doc)

data/lib/wgit/version.rb CHANGED

@@ -3,5 +3,5 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = "0.0.13".freeze
+  VERSION = "0.0.14".freeze
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.0.13
+  version: 0.0.14
 platform: ruby
 authors:
 - Michael Telford
@@ -31,6 +31,9 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: 0.9.20
+    - - "<"
+      - !ruby/object:Gem::Version
+        version: '1.0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
@@ -38,6 +41,9 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: 0.9.20
+    - - "<"
+      - !ruby/object:Gem::Version
+        version: '1.0'
 - !ruby/object:Gem::Dependency
   name: byebug
   requirement: !ruby/object:Gem::Requirement
@@ -156,14 +162,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 2.8.0
+        version: 2.9.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 2.8.0
+        version: 2.9.0
 description: Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves
   and serialises their page contents for later use. You can use Wgit to copy entire
   websites if required. Wgit also provides a means to search indexed documents stored