RubyGems - wgit - Versions diffs - 0.6.0 → 0.7.0 - Mend

wgit 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +12 -0
data/README.md +24 -15
data/bin/wgit +35 -0
data/lib/wgit/crawler.rb +3 -3
data/lib/wgit/database/database.rb +1 -1
data/lib/wgit/document.rb +15 -4
data/lib/wgit/indexer.rb +14 -5
data/lib/wgit/version.rb +2 -2
metadata +9 -7

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 0c7346b075dca86debdb6a55ed363d1b890088b17fc600c17a5be82f5878545c
-  data.tar.gz: 75572c882e0711e1d49513db91d14fd1d530dbc6e22b7a0bbfec9ac1efd21e29
+  metadata.gz: 29d37a4a0f013fec64625d8fe5798ae2d062ae6f213811c51d223de311e16707
+  data.tar.gz: 213f6c43ccbb1fcc5c487a2bd5f31493506ab2320168562f8e1b6887cccc07b8
 SHA512:
-  metadata.gz: e3b915f11c80999a659f9b7f6f6786b717393fe94e6e65029dd5d1b2c2d95f064512cfe96a96d06416ed5932aad0d2798039306f746835c23fb5223aa2d69f5b
-  data.tar.gz: 3b1d55d35a30b19fe6c3193f9e2c4eb2884aaddcdf6c31a88465a0d9ffdaf01886b380ba68321bed2aad69d7b6fc26ad49b612aafa01c8034998bdd9697bebfd
+  metadata.gz: acd321f3ba039e6f54dd8a36a3e4ebec1fb40f1cda5ee1c982df4be22ee6d463f829c72f011ff959a4a4d2651676dc2d31866a273b60d3e5e630ccf77b3d7cbe
+  data.tar.gz: d0908d28e6fdaec440209479f75945672807cf3e9359fb8bd8f6cc9de45568a341ac0204ba5f50a2f6569b4a29f4f7ac3088353a35f2c5091a567af469027aab

data/CHANGELOG.md CHANGED

@@ -9,6 +9,18 @@
 - ...
 ---
+## v0.7.0
+### Added
+- `Wgit::Indexer.new` optional `crawler:` named param.
+- `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
+- `Document.extensions` returning a Set of all defined extensions.
+### Changed/Removed
+- Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
+- Updated the Docker image to now include index names; making it easier to identify them.
+### Fixed
+- ...
+---
 ## v0.6.0
 ### Added
 - Added `Wgit::Utils.proces_arr encode:` param.

data/README.md CHANGED

@@ -8,11 +8,11 @@
 ---
-Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an easy to use API for programmatic URL parsing, HTML indexing and searching.
+Wgit is a Ruby library primarily used for crawling, indexing and searching HTML webpages.
-Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
+Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to scrape entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
-Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
+Check out this [demo application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
 Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
@@ -51,6 +51,10 @@ Or install it yourself as:
     $ gem install wgit
+Verify the install by using the executable (to start a shell session):
+    $ wgit
 ## Basic Usage
 ```ruby
@@ -271,11 +275,11 @@ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_
 Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
-By default, Wgit serialises what it thinks are the most important pieces of information from each webpage. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
+Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
-### Defining Custom Serialisers Via Document Extensions
+### Serialising Additional Page Elements via Document Extensions
-You can define a Document extension for each HTML element(s) that you want to extract into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, any crawled Documents will contain your extracted content.
+You can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
 Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
@@ -296,6 +300,7 @@ Wgit::Document.define_extension(
 end
 # Our Document has a table which we're interested in.
+# Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
 doc = Wgit::Document.new(
   'http://some_url.com',
   <<~HTML
@@ -324,8 +329,6 @@ doc.stats # => {
 # }
 ```
-Wgit uses Document extensions to provide much of it's core serialising functionality, providing access to a webpage's text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
 See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
 **Extension Notes**:
@@ -339,16 +342,20 @@ See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michae
 Below are some points to keep in mind when using Wgit:
 - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
-- By default, up to 5 URL redirects will be followed; this is configurable however.
+- By default, up to 5 URL redirects will be followed; this is [configurable](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Crawler#redirect_limit-instance_method) however.
 - IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
 ## Executable
-Currently there is no executable provided with Wgit, however...
+Installing the Wgit gem also adds the `wgit` executable to your `$PATH`. The executable launches an interactive shell session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
+The `wgit` executable does the following things (in order):
-In future versions of Wgit, an executable will be packaged with the gem. The executable will provide a `pry` console with the `wgit` gem already loaded. Using the console, you'll easily be able to index and search the web without having to write your own scripts.
+1. `require wgit`
+2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
+3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
-This executable will be similar in nature to `./bin/console` which is currently used for development and isn't packaged as part of the `wgit` gem.
+The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session. **Note** that variables should either be instance variables (e.g. `@url`) or be accessed via a getter method (e.g. `def url; ...; end`).
 ## Change Log
@@ -390,10 +397,12 @@ And you're good to go!
 Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
-Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker.
-Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset). You can also run `toys console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
+Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset that doesn't require a database).
 To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
 To install this gem onto your local machine, run `toys install`.
+### Console
+You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created a `.env` and `.wgit.rb` file which gets loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.

data/bin/wgit ADDED

@@ -0,0 +1,35 @@
+#!/usr/bin/env ruby
+require 'wgit'
+# Eval .wgit.rb file (if it exists).
+def eval_wgit
+  puts 'Searching for .wgit.rb in local and home directories...'
+  ['.', Dir.home].each do |dir|
+    path = "#{dir}/.wgit.rb"
+    next unless File.exist?(path)
+    puts "Eval'ing #{path} (call `eval_wgit` after changes)"
+    eval(File.read(path))
+    break
+  end
+end
+eval_wgit
+puts "\n#{Wgit.version_str}\n\n"
+# Use Pry if installed or fall back to IRB.
+begin
+  require 'pry'
+  klass = Pry
+rescue LoadError
+  require 'irb'
+  klass = IRB
+  puts "Starting IRB because Pry isn't installed."
+end
+klass.start
+puts 'Interactive session complete.'

data/lib/wgit/crawler.rb CHANGED

@@ -19,9 +19,9 @@ module Wgit
     # `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
     # doesn't keep the crawl of the site going. All URL's without a file
     # extension will be crawled, because they're assumed to be HTML.
-    SUPPORTED_FILE_EXTENSIONS = Set.new(%w[
-      asp aspx cfm cgi htm html htmlx jsp php
-    ])
+    SUPPORTED_FILE_EXTENSIONS = Set.new(
+      %w[asp aspx cfm cgi htm html htmlx jsp php]
+    )
     # The amount of allowed redirects before raising an error. Set to 0 to
     # disable redirects completely; or you can pass `follow_redirects: false`

data/lib/wgit/database/database.rb CHANGED

@@ -154,7 +154,7 @@ module Wgit
     #   DB.
     # @return [Array<Wgit::Document>] The search results obtained from the DB.
     def search(
-      query, case_sensitive: false, whole_sentence: false, limit: 10, skip: 0
+      query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
     )
       query = query.to_s.strip
       query.replace('"' + query + '"') if whole_sentence

data/lib/wgit/document.rb CHANGED

@@ -3,6 +3,7 @@ require_relative 'utils'
 require_relative 'assertable'
 require 'nokogiri'
 require 'json'
+require 'set'
 module Wgit
   # Class primarily modeling a HTML web document, although other MIME types
@@ -22,6 +23,14 @@ module Wgit
     # The xpath used to extract the visible text on a page.
     TEXT_ELEMENTS_XPATH = '//*/text()'.freeze
+    # Set of Symbols representing the defined Document extensions.
+    @extensions = Set.new
+    class << self
+      # Class level attr_reader for the Document defined extensions.
+      attr_reader :extensions
+    end
     # The URL of the webpage, an instance of Wgit::Url.
     attr_reader :url
@@ -120,7 +129,7 @@ module Wgit
         result = find_in_html(xpath, opts, &block)
         init_var(var, result)
       end
-      Document.send :private, func_name
+      Document.send(:private, func_name)
       # Define the private init_*_from_object method for a Database object.
       # Gets the Object's 'key' value and creates a var for it.
@@ -128,8 +137,9 @@ module Wgit
         result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block)
         init_var(var, result)
       end
-      Document.send :private, func_name
+      Document.send(:private, func_name)
+      @extensions << var
       var
     end
@@ -144,6 +154,7 @@ module Wgit
       Document.send(:remove_method, "init_#{var}_from_html")
       Document.send(:remove_method, "init_#{var}_from_object")
+      @extensions.delete(var.to_sym)
       true
     rescue NameError
       false
@@ -366,7 +377,7 @@ module Wgit
     #   sentence.
     # @return [Array<String>] A subset of @text, matching the query.
     def search(
-      query, case_sensitive: false, whole_sentence: false, sentence_limit: 80
+      query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
     )
       query = query.to_s
       raise 'A search query must be provided' if query.empty?
@@ -409,7 +420,7 @@ module Wgit
     #   sentence.
     # @return [String] This Document's original @text value.
     def search!(
-      query, case_sensitive: false, whole_sentence: false, sentence_limit: 80
+      query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
     )
       orig_text = @text
       @text = search(

data/lib/wgit/indexer.rb CHANGED

@@ -40,6 +40,10 @@ module Wgit
   #   nil to use ENV['WGIT_CONNECTION_STRING'].
   # @param insert_externals [Boolean] Whether or not to insert the website's
   #   external Url's into the database.
+  # @param allow_paths [String, Array<String>] Filters links by selecting
+  #   them if their path `File.fnmatch?` one of allow_paths.
+  # @param disallow_paths [String, Array<String>] Filters links by rejecting
+  #   them if their path `File.fnmatch?` one of disallow_paths.
   # @yield [doc] Given the Wgit::Document of each crawled webpage, before it's
   #   inserted into the database allowing for prior manipulation.
   # @return [Integer] The total number of pages crawled within the website.
@@ -96,7 +100,7 @@ module Wgit
   #   database.
   def self.indexed_search(
     query, connection_string: nil,
-    case_sensitive: false, whole_sentence: false,
+    case_sensitive: false, whole_sentence: true,
     limit: 10, skip: 0, sentence_limit: 80, &block
   )
     db = Wgit::Database.new(connection_string)
@@ -122,7 +126,7 @@ module Wgit
     Wgit::Utils.printf_search_results(results)
   end
-  # Class which sets up a crawler and saves the indexed docs to a database.
+  # Class which crawls and saves the indexed Documents to a database.
   class Indexer
     # The crawler used to index the WWW.
     attr_reader :crawler
@@ -133,10 +137,11 @@ module Wgit
     # Initialize the Indexer.
     #
     # @param database [Wgit::Database] The database instance (already
-    #   initialized with the correct connection string etc).
-    def initialize(database)
-      @crawler = Wgit::Crawler.new
+    #   initialized and connected) used to index.
+    # @param crawler [Wgit::Crawler] The crawler instance used to index.
+    def initialize(database, crawler = Wgit::Crawler.new)
       @db      = database
+      @crawler = crawler
     end
     # Retrieves uncrawled url's from the database and recursively crawls each
@@ -214,6 +219,10 @@ the next iteration.")
     # @param url [Wgit::Url] The base Url of the website to crawl.
     # @param insert_externals [Boolean] Whether or not to insert the website's
     #   external Url's into the database.
+    # @param allow_paths [String, Array<String>] Filters links by selecting
+    #   them if their path `File.fnmatch?` one of allow_paths.
+    # @param disallow_paths [String, Array<String>] Filters links by rejecting
+    #   them if their path `File.fnmatch?` one of disallow_paths.
     # @yield [doc] Given the Wgit::Document of each crawled web page before
     #   it's inserted into the database allowing for prior manipulation. Return
     #   nil or false from the block to prevent the document from being saved

data/lib/wgit/version.rb CHANGED

@@ -1,11 +1,11 @@
 # frozen_string_literal: true
 # Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page
-# contents for later use by serialisation.
+# contents for later use.
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = '0.6.0'
+  VERSION = '0.7.0'
   # Returns the current gem version of Wgit as a String.
   def self.version

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 0.7.0
 platform: ruby
 authors:
 - Michael Telford
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2019-12-24 00:00:00.000000000 Z
+date: 2020-01-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: addressable
@@ -185,7 +185,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '1.0'
 description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s to
-  retrieve and serialise their page contents for later use. You can use Wgit to copy
+  retrieve and serialise their page contents for later use. You can use Wgit to scrape
   entire websites if required. Wgit also provides a means to search indexed documents
   stored in a database. Therefore, this library provides the main components of a
   WWW search engine. The Wgit API is easily extended allowing you to pull out the
@@ -195,7 +195,8 @@ description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s
   '
 email: michael.telford@live.com
-executables: []
+executables:
+- wgit
 extensions: []
 extra_rdoc_files: []
 files:
@@ -219,6 +220,7 @@ files:
 - CONTRIBUTING.md
 - LICENSE.txt
 - README.md
+- bin/wgit
 homepage: https://github.com/michaeltelford/wgit
 licenses:
 - MIT
@@ -229,7 +231,7 @@ metadata:
   bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
   documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
   allowed_push_host: https://rubygems.org
-post_install_message:
+post_install_message: Added the 'wgit' executable to $PATH
 rdoc_options: []
 require_paths:
 - lib
@@ -247,6 +249,6 @@ requirements: []
 rubygems_version: 3.0.6
 signing_key:
 specification_version: 4
-summary: Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an
-  easy to use API for programmatic URL parsing, HTML indexing and searching.
+summary: Wgit is a Ruby library primarily used for crawling, indexing and searching
+  HTML webpages.
 test_files: []