wgit 0.6.0 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +12 -0
- data/README.md +24 -15
- data/bin/wgit +35 -0
- data/lib/wgit/crawler.rb +3 -3
- data/lib/wgit/database/database.rb +1 -1
- data/lib/wgit/document.rb +15 -4
- data/lib/wgit/indexer.rb +14 -5
- data/lib/wgit/version.rb +2 -2
- metadata +9 -7
    
        checksums.yaml
    CHANGED
    
    | @@ -1,7 +1,7 @@ | |
| 1 1 | 
             
            ---
         | 
| 2 2 | 
             
            SHA256:
         | 
| 3 | 
            -
              metadata.gz:  | 
| 4 | 
            -
              data.tar.gz:  | 
| 3 | 
            +
              metadata.gz: 29d37a4a0f013fec64625d8fe5798ae2d062ae6f213811c51d223de311e16707
         | 
| 4 | 
            +
              data.tar.gz: 213f6c43ccbb1fcc5c487a2bd5f31493506ab2320168562f8e1b6887cccc07b8
         | 
| 5 5 | 
             
            SHA512:
         | 
| 6 | 
            -
              metadata.gz:  | 
| 7 | 
            -
              data.tar.gz:  | 
| 6 | 
            +
              metadata.gz: acd321f3ba039e6f54dd8a36a3e4ebec1fb40f1cda5ee1c982df4be22ee6d463f829c72f011ff959a4a4d2651676dc2d31866a273b60d3e5e630ccf77b3d7cbe
         | 
| 7 | 
            +
              data.tar.gz: d0908d28e6fdaec440209479f75945672807cf3e9359fb8bd8f6cc9de45568a341ac0204ba5f50a2f6569b4a29f4f7ac3088353a35f2c5091a567af469027aab
         | 
    
        data/CHANGELOG.md
    CHANGED
    
    | @@ -9,6 +9,18 @@ | |
| 9 9 | 
             
            - ...
         | 
| 10 10 | 
             
            ---
         | 
| 11 11 |  | 
| 12 | 
            +
            ## v0.7.0
         | 
| 13 | 
            +
            ### Added
         | 
| 14 | 
            +
            - `Wgit::Indexer.new` optional `crawler:` named param.
         | 
| 15 | 
            +
            - `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
         | 
| 16 | 
            +
            - `Document.extensions` returning a Set of all defined extensions.
         | 
| 17 | 
            +
            ### Changed/Removed
         | 
| 18 | 
            +
            - Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
         | 
| 19 | 
            +
            - Updated the Docker image to now include index names; making it easier to identify them.
         | 
| 20 | 
            +
            ### Fixed
         | 
| 21 | 
            +
            - ...
         | 
| 22 | 
            +
            ---
         | 
| 23 | 
            +
             | 
| 12 24 | 
             
            ## v0.6.0
         | 
| 13 25 | 
             
            ### Added
         | 
| 14 26 | 
             
            - Added `Wgit::Utils.proces_arr encode:` param.
         | 
    
        data/README.md
    CHANGED
    
    | @@ -8,11 +8,11 @@ | |
| 8 8 |  | 
| 9 9 | 
             
            ---
         | 
| 10 10 |  | 
| 11 | 
            -
            Wgit is a Ruby  | 
| 11 | 
            +
            Wgit is a Ruby library primarily used for crawling, indexing and searching HTML webpages.
         | 
| 12 12 |  | 
| 13 | 
            -
            Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to  | 
| 13 | 
            +
            Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to scrape entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
         | 
| 14 14 |  | 
| 15 | 
            -
            Check out this [ | 
| 15 | 
            +
            Check out this [demo application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
         | 
| 16 16 |  | 
| 17 17 | 
             
            Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
         | 
| 18 18 |  | 
| @@ -51,6 +51,10 @@ Or install it yourself as: | |
| 51 51 |  | 
| 52 52 | 
             
                $ gem install wgit
         | 
| 53 53 |  | 
| 54 | 
            +
            Verify the install by using the executable (to start a shell session):
         | 
| 55 | 
            +
             | 
| 56 | 
            +
                $ wgit
         | 
| 57 | 
            +
             | 
| 54 58 | 
             
            ## Basic Usage
         | 
| 55 59 |  | 
| 56 60 | 
             
            ```ruby
         | 
| @@ -271,11 +275,11 @@ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_ | |
| 271 275 |  | 
| 272 276 | 
             
            Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
         | 
| 273 277 |  | 
| 274 | 
            -
             | 
| 278 | 
            +
            Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
         | 
| 275 279 |  | 
| 276 | 
            -
            ###  | 
| 280 | 
            +
            ### Serialising Additional Page Elements via Document Extensions
         | 
| 277 281 |  | 
| 278 | 
            -
            You can define a Document extension for each HTML element(s) that you want to extract into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined,  | 
| 282 | 
            +
            You can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
         | 
| 279 283 |  | 
| 280 284 | 
             
            Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
         | 
| 281 285 |  | 
| @@ -296,6 +300,7 @@ Wgit::Document.define_extension( | |
| 296 300 | 
             
            end
         | 
| 297 301 |  | 
| 298 302 | 
             
            # Our Document has a table which we're interested in.
         | 
| 303 | 
            +
            # Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
         | 
| 299 304 | 
             
            doc = Wgit::Document.new(
         | 
| 300 305 | 
             
              'http://some_url.com',
         | 
| 301 306 | 
             
              <<~HTML
         | 
| @@ -324,8 +329,6 @@ doc.stats # => { | |
| 324 329 | 
             
            # }
         | 
| 325 330 | 
             
            ```
         | 
| 326 331 |  | 
| 327 | 
            -
            Wgit uses Document extensions to provide much of it's core serialising functionality, providing access to a webpage's text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
         | 
| 328 | 
            -
             | 
| 329 332 | 
             
            See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
         | 
| 330 333 |  | 
| 331 334 | 
             
            **Extension Notes**:
         | 
| @@ -339,16 +342,20 @@ See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michae | |
| 339 342 | 
             
            Below are some points to keep in mind when using Wgit:
         | 
| 340 343 |  | 
| 341 344 | 
             
            - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
         | 
| 342 | 
            -
            - By default, up to 5 URL redirects will be followed; this is configurable however.
         | 
| 345 | 
            +
            - By default, up to 5 URL redirects will be followed; this is [configurable](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Crawler#redirect_limit-instance_method) however.
         | 
| 343 346 | 
             
            - IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
         | 
| 344 347 |  | 
| 345 348 | 
             
            ## Executable
         | 
| 346 349 |  | 
| 347 | 
            -
             | 
| 350 | 
            +
            Installing the Wgit gem also adds the `wgit` executable to your `$PATH`. The executable launches an interactive shell session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
         | 
| 351 | 
            +
             | 
| 352 | 
            +
            The `wgit` executable does the following things (in order):
         | 
| 348 353 |  | 
| 349 | 
            -
             | 
| 354 | 
            +
            1. `require wgit`
         | 
| 355 | 
            +
            2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
         | 
| 356 | 
            +
            3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
         | 
| 350 357 |  | 
| 351 | 
            -
             | 
| 358 | 
            +
            The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session. **Note** that variables should either be instance variables (e.g. `@url`) or be accessed via a getter method (e.g. `def url; ...; end`).
         | 
| 352 359 |  | 
| 353 360 | 
             
            ## Change Log
         | 
| 354 361 |  | 
| @@ -390,10 +397,12 @@ And you're good to go! | |
| 390 397 |  | 
| 391 398 | 
             
            Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
         | 
| 392 399 |  | 
| 393 | 
            -
            Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker.
         | 
| 394 | 
            -
             | 
| 395 | 
            -
            Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset). You can also run `toys console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
         | 
| 400 | 
            +
            Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset that doesn't require a database).
         | 
| 396 401 |  | 
| 397 402 | 
             
            To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
         | 
| 398 403 |  | 
| 399 404 | 
             
            To install this gem onto your local machine, run `toys install`.
         | 
| 405 | 
            +
             | 
| 406 | 
            +
            ### Console
         | 
| 407 | 
            +
             | 
| 408 | 
            +
            You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created a `.env` and `.wgit.rb` file which gets loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
         | 
    
        data/bin/wgit
    ADDED
    
    | @@ -0,0 +1,35 @@ | |
| 1 | 
            +
            #!/usr/bin/env ruby
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            require 'wgit'
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            # Eval .wgit.rb file (if it exists).
         | 
| 6 | 
            +
            def eval_wgit
         | 
| 7 | 
            +
              puts 'Searching for .wgit.rb in local and home directories...'
         | 
| 8 | 
            +
             | 
| 9 | 
            +
              ['.', Dir.home].each do |dir|
         | 
| 10 | 
            +
                path = "#{dir}/.wgit.rb"
         | 
| 11 | 
            +
                next unless File.exist?(path)
         | 
| 12 | 
            +
             | 
| 13 | 
            +
                puts "Eval'ing #{path} (call `eval_wgit` after changes)"
         | 
| 14 | 
            +
                eval(File.read(path))
         | 
| 15 | 
            +
                break
         | 
| 16 | 
            +
              end
         | 
| 17 | 
            +
            end
         | 
| 18 | 
            +
             | 
| 19 | 
            +
            eval_wgit
         | 
| 20 | 
            +
            puts "\n#{Wgit.version_str}\n\n"
         | 
| 21 | 
            +
             | 
| 22 | 
            +
            # Use Pry if installed or fall back to IRB.
         | 
| 23 | 
            +
            begin
         | 
| 24 | 
            +
              require 'pry'
         | 
| 25 | 
            +
              klass = Pry
         | 
| 26 | 
            +
            rescue LoadError
         | 
| 27 | 
            +
              require 'irb'
         | 
| 28 | 
            +
              klass = IRB
         | 
| 29 | 
            +
             | 
| 30 | 
            +
              puts "Starting IRB because Pry isn't installed."
         | 
| 31 | 
            +
            end
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            klass.start
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            puts 'Interactive session complete.'
         | 
    
        data/lib/wgit/crawler.rb
    CHANGED
    
    | @@ -19,9 +19,9 @@ module Wgit | |
| 19 19 | 
             
                # `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
         | 
| 20 20 | 
             
                # doesn't keep the crawl of the site going. All URL's without a file
         | 
| 21 21 | 
             
                # extension will be crawled, because they're assumed to be HTML.
         | 
| 22 | 
            -
                SUPPORTED_FILE_EXTENSIONS = Set.new( | 
| 23 | 
            -
                  asp aspx cfm cgi htm html htmlx jsp php
         | 
| 24 | 
            -
                 | 
| 22 | 
            +
                SUPPORTED_FILE_EXTENSIONS = Set.new(
         | 
| 23 | 
            +
                  %w[asp aspx cfm cgi htm html htmlx jsp php]
         | 
| 24 | 
            +
                )
         | 
| 25 25 |  | 
| 26 26 | 
             
                # The amount of allowed redirects before raising an error. Set to 0 to
         | 
| 27 27 | 
             
                # disable redirects completely; or you can pass `follow_redirects: false`
         | 
| @@ -154,7 +154,7 @@ module Wgit | |
| 154 154 | 
             
                #   DB.
         | 
| 155 155 | 
             
                # @return [Array<Wgit::Document>] The search results obtained from the DB.
         | 
| 156 156 | 
             
                def search(
         | 
| 157 | 
            -
                  query, case_sensitive: false, whole_sentence:  | 
| 157 | 
            +
                  query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
         | 
| 158 158 | 
             
                )
         | 
| 159 159 | 
             
                  query = query.to_s.strip
         | 
| 160 160 | 
             
                  query.replace('"' + query + '"') if whole_sentence
         | 
    
        data/lib/wgit/document.rb
    CHANGED
    
    | @@ -3,6 +3,7 @@ require_relative 'utils' | |
| 3 3 | 
             
            require_relative 'assertable'
         | 
| 4 4 | 
             
            require 'nokogiri'
         | 
| 5 5 | 
             
            require 'json'
         | 
| 6 | 
            +
            require 'set'
         | 
| 6 7 |  | 
| 7 8 | 
             
            module Wgit
         | 
| 8 9 | 
             
              # Class primarily modeling a HTML web document, although other MIME types
         | 
| @@ -22,6 +23,14 @@ module Wgit | |
| 22 23 | 
             
                # The xpath used to extract the visible text on a page.
         | 
| 23 24 | 
             
                TEXT_ELEMENTS_XPATH = '//*/text()'.freeze
         | 
| 24 25 |  | 
| 26 | 
            +
                # Set of Symbols representing the defined Document extensions.
         | 
| 27 | 
            +
                @extensions = Set.new
         | 
| 28 | 
            +
             | 
| 29 | 
            +
                class << self
         | 
| 30 | 
            +
                  # Class level attr_reader for the Document defined extensions.
         | 
| 31 | 
            +
                  attr_reader :extensions
         | 
| 32 | 
            +
                end
         | 
| 33 | 
            +
             | 
| 25 34 | 
             
                # The URL of the webpage, an instance of Wgit::Url.
         | 
| 26 35 | 
             
                attr_reader :url
         | 
| 27 36 |  | 
| @@ -120,7 +129,7 @@ module Wgit | |
| 120 129 | 
             
                    result = find_in_html(xpath, opts, &block)
         | 
| 121 130 | 
             
                    init_var(var, result)
         | 
| 122 131 | 
             
                  end
         | 
| 123 | 
            -
                  Document.send | 
| 132 | 
            +
                  Document.send(:private, func_name)
         | 
| 124 133 |  | 
| 125 134 | 
             
                  # Define the private init_*_from_object method for a Database object.
         | 
| 126 135 | 
             
                  # Gets the Object's 'key' value and creates a var for it.
         | 
| @@ -128,8 +137,9 @@ module Wgit | |
| 128 137 | 
             
                    result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block)
         | 
| 129 138 | 
             
                    init_var(var, result)
         | 
| 130 139 | 
             
                  end
         | 
| 131 | 
            -
                  Document.send | 
| 140 | 
            +
                  Document.send(:private, func_name)
         | 
| 132 141 |  | 
| 142 | 
            +
                  @extensions << var
         | 
| 133 143 | 
             
                  var
         | 
| 134 144 | 
             
                end
         | 
| 135 145 |  | 
| @@ -144,6 +154,7 @@ module Wgit | |
| 144 154 | 
             
                  Document.send(:remove_method, "init_#{var}_from_html")
         | 
| 145 155 | 
             
                  Document.send(:remove_method, "init_#{var}_from_object")
         | 
| 146 156 |  | 
| 157 | 
            +
                  @extensions.delete(var.to_sym)
         | 
| 147 158 | 
             
                  true
         | 
| 148 159 | 
             
                rescue NameError
         | 
| 149 160 | 
             
                  false
         | 
| @@ -366,7 +377,7 @@ module Wgit | |
| 366 377 | 
             
                #   sentence.
         | 
| 367 378 | 
             
                # @return [Array<String>] A subset of @text, matching the query.
         | 
| 368 379 | 
             
                def search(
         | 
| 369 | 
            -
                  query, case_sensitive: false, whole_sentence:  | 
| 380 | 
            +
                  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
         | 
| 370 381 | 
             
                )
         | 
| 371 382 | 
             
                  query = query.to_s
         | 
| 372 383 | 
             
                  raise 'A search query must be provided' if query.empty?
         | 
| @@ -409,7 +420,7 @@ module Wgit | |
| 409 420 | 
             
                #   sentence.
         | 
| 410 421 | 
             
                # @return [String] This Document's original @text value.
         | 
| 411 422 | 
             
                def search!(
         | 
| 412 | 
            -
                  query, case_sensitive: false, whole_sentence:  | 
| 423 | 
            +
                  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
         | 
| 413 424 | 
             
                )
         | 
| 414 425 | 
             
                  orig_text = @text
         | 
| 415 426 | 
             
                  @text = search(
         | 
    
        data/lib/wgit/indexer.rb
    CHANGED
    
    | @@ -40,6 +40,10 @@ module Wgit | |
| 40 40 | 
             
              #   nil to use ENV['WGIT_CONNECTION_STRING'].
         | 
| 41 41 | 
             
              # @param insert_externals [Boolean] Whether or not to insert the website's
         | 
| 42 42 | 
             
              #   external Url's into the database.
         | 
| 43 | 
            +
              # @param allow_paths [String, Array<String>] Filters links by selecting
         | 
| 44 | 
            +
              #   them if their path `File.fnmatch?` one of allow_paths.
         | 
| 45 | 
            +
              # @param disallow_paths [String, Array<String>] Filters links by rejecting
         | 
| 46 | 
            +
              #   them if their path `File.fnmatch?` one of disallow_paths.
         | 
| 43 47 | 
             
              # @yield [doc] Given the Wgit::Document of each crawled webpage, before it's
         | 
| 44 48 | 
             
              #   inserted into the database allowing for prior manipulation.
         | 
| 45 49 | 
             
              # @return [Integer] The total number of pages crawled within the website.
         | 
| @@ -96,7 +100,7 @@ module Wgit | |
| 96 100 | 
             
              #   database.
         | 
| 97 101 | 
             
              def self.indexed_search(
         | 
| 98 102 | 
             
                query, connection_string: nil,
         | 
| 99 | 
            -
                case_sensitive: false, whole_sentence:  | 
| 103 | 
            +
                case_sensitive: false, whole_sentence: true,
         | 
| 100 104 | 
             
                limit: 10, skip: 0, sentence_limit: 80, &block
         | 
| 101 105 | 
             
              )
         | 
| 102 106 | 
             
                db = Wgit::Database.new(connection_string)
         | 
| @@ -122,7 +126,7 @@ module Wgit | |
| 122 126 | 
             
                Wgit::Utils.printf_search_results(results)
         | 
| 123 127 | 
             
              end
         | 
| 124 128 |  | 
| 125 | 
            -
              # Class which  | 
| 129 | 
            +
              # Class which crawls and saves the indexed Documents to a database.
         | 
| 126 130 | 
             
              class Indexer
         | 
| 127 131 | 
             
                # The crawler used to index the WWW.
         | 
| 128 132 | 
             
                attr_reader :crawler
         | 
| @@ -133,10 +137,11 @@ module Wgit | |
| 133 137 | 
             
                # Initialize the Indexer.
         | 
| 134 138 | 
             
                #
         | 
| 135 139 | 
             
                # @param database [Wgit::Database] The database instance (already
         | 
| 136 | 
            -
                #   initialized  | 
| 137 | 
            -
                 | 
| 138 | 
            -
             | 
| 140 | 
            +
                #   initialized and connected) used to index.
         | 
| 141 | 
            +
                # @param crawler [Wgit::Crawler] The crawler instance used to index.
         | 
| 142 | 
            +
                def initialize(database, crawler = Wgit::Crawler.new)
         | 
| 139 143 | 
             
                  @db      = database
         | 
| 144 | 
            +
                  @crawler = crawler
         | 
| 140 145 | 
             
                end
         | 
| 141 146 |  | 
| 142 147 | 
             
                # Retrieves uncrawled url's from the database and recursively crawls each
         | 
| @@ -214,6 +219,10 @@ the next iteration.") | |
| 214 219 | 
             
                # @param url [Wgit::Url] The base Url of the website to crawl.
         | 
| 215 220 | 
             
                # @param insert_externals [Boolean] Whether or not to insert the website's
         | 
| 216 221 | 
             
                #   external Url's into the database.
         | 
| 222 | 
            +
                # @param allow_paths [String, Array<String>] Filters links by selecting
         | 
| 223 | 
            +
                #   them if their path `File.fnmatch?` one of allow_paths.
         | 
| 224 | 
            +
                # @param disallow_paths [String, Array<String>] Filters links by rejecting
         | 
| 225 | 
            +
                #   them if their path `File.fnmatch?` one of disallow_paths.
         | 
| 217 226 | 
             
                # @yield [doc] Given the Wgit::Document of each crawled web page before
         | 
| 218 227 | 
             
                #   it's inserted into the database allowing for prior manipulation. Return
         | 
| 219 228 | 
             
                #   nil or false from the block to prevent the document from being saved
         | 
    
        data/lib/wgit/version.rb
    CHANGED
    
    | @@ -1,11 +1,11 @@ | |
| 1 1 | 
             
            # frozen_string_literal: true
         | 
| 2 2 |  | 
| 3 3 | 
             
            # Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page
         | 
| 4 | 
            -
            # contents for later use | 
| 4 | 
            +
            # contents for later use.
         | 
| 5 5 | 
             
            # @author Michael Telford
         | 
| 6 6 | 
             
            module Wgit
         | 
| 7 7 | 
             
              # The current gem version of Wgit.
         | 
| 8 | 
            -
              VERSION = '0. | 
| 8 | 
            +
              VERSION = '0.7.0'
         | 
| 9 9 |  | 
| 10 10 | 
             
              # Returns the current gem version of Wgit as a String.
         | 
| 11 11 | 
             
              def self.version
         | 
    
        metadata
    CHANGED
    
    | @@ -1,14 +1,14 @@ | |
| 1 1 | 
             
            --- !ruby/object:Gem::Specification
         | 
| 2 2 | 
             
            name: wgit
         | 
| 3 3 | 
             
            version: !ruby/object:Gem::Version
         | 
| 4 | 
            -
              version: 0. | 
| 4 | 
            +
              version: 0.7.0
         | 
| 5 5 | 
             
            platform: ruby
         | 
| 6 6 | 
             
            authors:
         | 
| 7 7 | 
             
            - Michael Telford
         | 
| 8 8 | 
             
            autorequire: 
         | 
| 9 9 | 
             
            bindir: bin
         | 
| 10 10 | 
             
            cert_chain: []
         | 
| 11 | 
            -
            date:  | 
| 11 | 
            +
            date: 2020-01-04 00:00:00.000000000 Z
         | 
| 12 12 | 
             
            dependencies:
         | 
| 13 13 | 
             
            - !ruby/object:Gem::Dependency
         | 
| 14 14 | 
             
              name: addressable
         | 
| @@ -185,7 +185,7 @@ dependencies: | |
| 185 185 | 
             
                  - !ruby/object:Gem::Version
         | 
| 186 186 | 
             
                    version: '1.0'
         | 
| 187 187 | 
             
            description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s to
         | 
| 188 | 
            -
              retrieve and serialise their page contents for later use. You can use Wgit to  | 
| 188 | 
            +
              retrieve and serialise their page contents for later use. You can use Wgit to scrape
         | 
| 189 189 | 
             
              entire websites if required. Wgit also provides a means to search indexed documents
         | 
| 190 190 | 
             
              stored in a database. Therefore, this library provides the main components of a
         | 
| 191 191 | 
             
              WWW search engine. The Wgit API is easily extended allowing you to pull out the
         | 
| @@ -195,7 +195,8 @@ description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s | |
| 195 195 |  | 
| 196 196 | 
             
              '
         | 
| 197 197 | 
             
            email: michael.telford@live.com
         | 
| 198 | 
            -
            executables: | 
| 198 | 
            +
            executables:
         | 
| 199 | 
            +
            - wgit
         | 
| 199 200 | 
             
            extensions: []
         | 
| 200 201 | 
             
            extra_rdoc_files: []
         | 
| 201 202 | 
             
            files:
         | 
| @@ -219,6 +220,7 @@ files: | |
| 219 220 | 
             
            - CONTRIBUTING.md
         | 
| 220 221 | 
             
            - LICENSE.txt
         | 
| 221 222 | 
             
            - README.md
         | 
| 223 | 
            +
            - bin/wgit
         | 
| 222 224 | 
             
            homepage: https://github.com/michaeltelford/wgit
         | 
| 223 225 | 
             
            licenses:
         | 
| 224 226 | 
             
            - MIT
         | 
| @@ -229,7 +231,7 @@ metadata: | |
| 229 231 | 
             
              bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
         | 
| 230 232 | 
             
              documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
         | 
| 231 233 | 
             
              allowed_push_host: https://rubygems.org
         | 
| 232 | 
            -
            post_install_message: 
         | 
| 234 | 
            +
            post_install_message: Added the 'wgit' executable to $PATH
         | 
| 233 235 | 
             
            rdoc_options: []
         | 
| 234 236 | 
             
            require_paths:
         | 
| 235 237 | 
             
            - lib
         | 
| @@ -247,6 +249,6 @@ requirements: [] | |
| 247 249 | 
             
            rubygems_version: 3.0.6
         | 
| 248 250 | 
             
            signing_key: 
         | 
| 249 251 | 
             
            specification_version: 4
         | 
| 250 | 
            -
            summary: Wgit is a Ruby  | 
| 251 | 
            -
               | 
| 252 | 
            +
            summary: Wgit is a Ruby library primarily used for crawling, indexing and searching
         | 
| 253 | 
            +
              HTML webpages.
         | 
| 252 254 | 
             
            test_files: []
         |