RubyGems - wgit - Versions diffs - 0.0.10 → 0.0.11 - Mend

wgit 0.0.10 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/LICENSE.txt +21 -0
data/README.md +334 -0
data/TODO.txt +35 -0
data/lib/wgit/assertable.rb +4 -0
data/lib/wgit/core_ext.rb +4 -2
data/lib/wgit/crawler.rb +188 -188
data/lib/wgit/database/database.rb +22 -21
data/lib/wgit/document.rb +594 -592
data/lib/wgit/url.rb +306 -278
data/lib/wgit/version.rb +1 -1
metadata +6 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c4dc572e7d48a95d423e175ad2d6a791be52bf56f6c391e9152c075f45672ee8
-  data.tar.gz: fd6a9c9d1e38906f500543ae92169f7bcb9e64de1567f4f29ec24f7ca74c60d8
+  metadata.gz: '0929be93cf79c3ca942c78607c9bfae6ef53d3d2b033964ac7348e1d03b8f113'
+  data.tar.gz: e9a24df51634b84bc2c17790ccff5c04ded32852f538bdc81bbc90806c20008c
 SHA512:
-  metadata.gz: f4fc425aa1b25254dba343151794893ff26a3682e58dd08bdc918c180da89ecb08cf0f1013837e41527f564d36bb0784c6ecd204e53c1276cb5b32401c88ffab
-  data.tar.gz: 2ce1250ad7312257bc021e7414f164ec174e07e469a6956478cccaa7b05f159981a7a69ef6b713db5454233c8a03ea8487f14184c1e26dcab28ed8e81250507d
+  metadata.gz: 95ffe388b6d4ddc7771be4db2ea1ed7579c2e82d7a76b7b2950658c7114d1d14af0ff80817fcc4a7ac6ba9c6094d764d24549e18d283293edb63ec1fb6f9837b
+  data.tar.gz: 74d96db70834054c757c2429ee3bfce5b05f5d2a783f513b92dfc7ec185d69cf0c5f008e01227543f59acef70f87e1f0e81eadec2a607d2f3a3fc9d44d5c2641

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2019 Michael Telford
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,334 @@
+# Wgit
+Wgit is a Ruby gem similar in nature to GNU's `wget`. It provides an easy to use API for programmatic web scraping, indexing and searching.
+Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or images for example. As Wgit is a library, it has uses in many different application types.
+Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine built using Wgit and Sinatra, deployed to Heroku.
+## Table Of Contents
+1. [Installation](#Installation)
+2. [Basic Usage](#Basic-Usage)
+3. [Documentation](#Documentation)
+4. [Practical Examples](#Practical-Examples)
+5. [Practical Database Example](#Practical-Database-Example)
+6. [Extending The API](#Extending-The-API)
+7. [Caveats](#Caveats)
+8. [Executable](#Executable)
+9. [Development](#Development)
+10. [Contributing](#Contributing)
+11. [License](#License)
+## Installation
+Add this line to your application's `Gemfile`:
+```ruby
+gem 'wgit'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install wgit
+## Basic Usage
+Below shows an example of API usage in action and gives an idea of how you can use Wgit in your own code.
+```ruby
+require 'wgit'
+crawler = Wgit::Crawler.new
+url = Wgit::Url.new "https://wikileaks.org/What-is-Wikileaks.html"
+doc = crawler.crawl url
+doc.class # => Wgit::Document
+doc.stats # => {
+# :url=>44, :html=>28133, :title=>17, :keywords=>0,
+# :links=>35, :text_length=>67, :text_bytes=>13735
+#}
+# doc responds to the following methods:
+Wgit::Document.instance_methods(false).sort # => [
+# :==, :[], :author, :css, :date_crawled, :doc, :empty?, :external_links,
+# :external_urls, :html, :internal_full_links, :internal_links, :keywords,
+# :links, :relative_full_links, :relative_full_urls, :relative_links,
+# :relative_urls, :score, :search, :search!, :size, :stats, :text, :title,
+# :to_h, :to_hash, :to_json, :url, :xpath
+#]
+results = doc.search "corruption"
+results.first # => "ial materials involving war, spying and corruption.
+              #     It has so far published more"
+```
+## Documentation
+To see what's possible with the Wgit gem see the [docs](https://www.rubydoc.info/gems/wgit) or the [Practical Examples](#Practical-Examples) section below.
+## Practical Examples
+Below are some practical examples of Wgit in use. You can copy and run the code for yourself.
+### WWW HTML Indexer
+See the `Wgit::Indexer#index_the_web` documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external url's (in the database) and index their markup for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!
+See the [Practical Database Example](#Practical-Database-Example) for information on how to setup a database for use with Wgit.
+### Website Downloader
+Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
+### CSS Indexer
+The below script downloads the contents of the first css link found on Facebook's index page.
+```ruby
+require 'wgit'
+require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
+crawler = Wgit::Crawler.new
+url = "https://www.facebook.com".to_url
+doc = crawler.crawl url
+# Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
+hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
+hrefs.class # => Nokogiri::XML::NodeSet
+href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
+css = crawler.crawl href.to_url
+css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
+```
+### Keyword Indexer (SEO Helper)
+The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation for example.
+```ruby
+require 'wgit'
+my_pages_keywords = ["Everest", "mountaineering school", "adventure"]
+my_pages_missing_keywords = []
+competitor_urls = [
+  "http://altitudejunkies.com",
+  "http://www.mountainmadness.com",
+  "http://www.adventureconsultants.com"
+]
+crawler = Wgit::Crawler.new competitor_urls
+crawler.crawl do |doc|
+  # If there are keywords present in the web document.
+  if doc.keywords.respond_to? :-
+    puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
+    my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
+  end
+end
+if my_pages_missing_keywords.empty?
+  puts "Your pages are missing no keywords, nice one!"
+else
+  puts "Your pages compared to your competitors are missing the following keywords:"
+  puts my_pages_missing_keywords.uniq
+end
+```
+## Practical Database Example
+This next example requires a configured database instance.
+Currently the only supported DBMS is MongoDB. See [mLab](https://mlab.com) for a free (small) account or provide your own MongoDB instance.
+The currently supported versions of `mongo` are:
+| Gem      | Database Engine |
+| -------- | --------------- |
+| ~> 2.8.0 | 3.6.12 (MMAPv1) |
+### Setting Up MongoDB
+Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records. The use of a database is entirely optional when using Wgit.
+1) Create collections for: `documents` and `urls`.
+2) Add a unique index for the `url` field in **both** collections.
+3) Enable `textSearchEnabled` in MongoDB's configuration.
+4) Create a *text search index* for the `documents` collection using:
+```json
+{
+	"text": "text",
+	"author": "text",
+	"keywords": "text",
+	"title": "text"
+}
+```
+5) Set the connection details for your MongoDB instance using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
+**Note**: The *text search index* (in step 4) lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
+### Database Example
+The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database.
+If you're running the code below for yourself, remember to replace the Hash containing the connection details with your own.
+```ruby
+require 'wgit'
+require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
+# Here we create our own document rather than crawling the web.
+# We pass the web page's URL and HTML Strings.
+doc = Wgit::Document.new(
+  "http://test-url.com".to_url,
+  "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
+)
+# Set your connection details manually (as below) or from the environment using
+# Wgit.set_connection_details_from_env
+Wgit.set_connection_details(
+  'DB_HOST'     => '<host_machine>',
+  'DB_PORT'     => '27017',
+  'DB_USERNAME' => '<username>',
+  'DB_PASSWORD' => '<password>',
+  'DB_DATABASE' => '<database_name>',
+)
+db = Wgit::Database.new # Connects to the database...
+db.insert doc
+# Searching the database returns documents with matching text 'hits'.
+query = "cow"
+results = db.search query
+doc.url == results.first.url # => true
+# Searching the returned documents gives the matching lines of text from that document.
+doc.search(query).first # => "How now brown cow."
+db.insert doc.external_links
+urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
+```
+## Extending The API
+Indexing in Wgit is the means of downloading a web page and serialising parts of the content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML tag value of `meta[@name='author']`.
+By default, Wgit indexes what it thinks are the most important pieces of information from each webpage. This of course is often not enough given the nature of webpages and their differences from each other. Therefore, there exists a set of ways to extend the default indexing logic.
+There are two ways to extend the indexing behaviour of Wgit:
+1. Add the elements containing **text** that you're interested in to be indexed.
+2. Define custom indexers matched to specific **elements** that you're interested in.
+Below describes these two methods in more detail.
+### 1. Extending The Default Text Elements
+Wgit contains an array of `Wgit::Document.text_elements` which are the default set of webpage elements containing text; which in turn are indexed and accessible via `Wgit::Document#text`.
+If you'd like the text of additional webpage elements to be returned from `Wgit::Document#text`, then you can do the following:
+```ruby
+require 'wgit'
+require 'wgit/core_ext'
+# Let's add the text of links e.g. <a> tags.
+Wgit::Document.text_elements << :a
+# Our Document has a link whose's text we're interested in.
+doc = Wgit::Document.new(
+  "http://some_url.com".to_url,
+  "<html><p>Hello world!</p>\
+<a href='https://made-up-link.com'>Click this link.</a></html>"
+)
+# Now all crawled Documents will contain all visible link text in Wgit::Document#text.
+doc.text # => ["Hello world!", "Click this link."]
+```
+**Note**: This only works for textual page content. For more control over the indexed elements themselves, see below.
+### 2. Defining Custom Indexers/Elements a.k.a Virtual Attributes
+If you want full control over the elements being indexed for your own purposes, then you can define a custom indexer for each type of element that you're interested in.
+Once you have the indexed page element, accessed via a `Wgit::Document` instance method, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since the returned types are plain [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) objects, you have the full control that the Nokogiri gem gives you.
+Here's how to add a custom indexer for a specific page element:
+```ruby
+require 'wgit'
+require 'wgit/core_ext'
+# Let's get all the page's table elements.
+Wgit::Document.define_extension(
+  :tables,                  # Wgit::Document#tables will return the page's tables.
+  "//table",                # The xpath to extract the tables.
+  singleton: false,         # True returns the first table found, false returns all.
+  text_content_only: false, # True returns one or more Strings of the tables text,
+                            # false returns the tables as Nokogiri objects (see below).
+) do |tables|
+  # Here we can manipulate the object(s) before they're set as Wgit::Document#tables.
+end
+# Our Document has a table which we're interested in.
+doc = Wgit::Document.new(
+  "http://some_url.com".to_url,
+  "<html><p>Hello world!</p>\
+<table><th>Header Text</th><th>Another Header</th></table></html>"
+)
+# Call our newly defined method to obtain the table data we're interested in.
+tables = doc.tables
+# Both the collection and each table within the collection are plain Nokogiri objects.
+tables.class        # => Nokogiri::XML::NodeSet
+tables.first.class  # => Nokogiri::XML::Element
+```
+**Extension Notes**:
+- Any links should be mapped into `Wgit::Url` objects; Url's are treated as Strings when being inserted into the database.
+- Any object (like a Nokogiri object) will not be inserted into the database, its up to you to map each object into a native type e.g. `Boolean, Array` etc.
+## Caveats
+Below are some points to keep in mind when using Wgit:
+- All Url's must be prefixed with an appropiate protocol e.g. `https://`
+## Executable
+Currently there is no executable provided with Wgit, however...
+In future versions of Wgit, an executable will be packaged with the gem. The executable will provide a `pry` console with the `wgit` gem already loaded. Using the console, you'll easily be able to index and search the web without having to write your own scripts.
+This executable will be very similar in nature to `./bin/console` which is currently used only for development and isn't packaged as part of the `wgit` gem.
+## Development
+For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
+After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive REPL that will allow you to experiment with the code.
+To generate code documentation run `bundle exec yarddoc`. To browse the generated documentation run `bundle exec yard server -r`.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.
+## Contributing
+Bug reports and pull requests are welcome on [GitHub](https://github.com/michaeltelford/wgit).
+## License
+The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).

data/TODO.txt ADDED Viewed

@@ -0,0 +1,35 @@
+Primary
+-------
+- Add <base> support for link processing.
+- Update Database#search & Document#search to have optional case sensitivity.
+- Have the ability to crawl sub sections of a site only e.g. https://www.honda.co.uk/motorcycles.html as the base url and crawl any links containing this as a prefix. For example, https://www.honda.co.uk/cars.html would not be crawled but https://www.honda.co.uk/motorcycles/africa-twin.html would be.
+- Create an executable based on the ./bin/console shipped as `wpry` or `wgit`.
+Secondary
+---------
+- Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org and update the README to include an example search query e.g. "ruby" etc.
+- Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
+- Check if Document::TEXT_ELEMENTS is expansive enough.
+- Possibly use refine instead of core-ext?
+- Think about potentially using DB._update's update_many func.
+Refactoring
+-----------
+- Replace method params with named parameters where applicable.
+- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
+- Mock (monkey patch) the 'mongo' gem's func. for speed in the tests.
+- Refactor the tests to include: Removal of instance_vars defined in setup & Expansion of test scenarios for each func (where possible).
+Gem Publishing Checklist
+------------------------
+- Ensure a clean branch of master and create a 'release' branch.
+- Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
+- Increment the version number (in version.rb).
+- Run 'bundle install' to update deps.
+- Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
+- Run 'bundle exec rake test' and ensure all tests are passing.
+- Run `bundle exec rake install` to build and install the gem locally, then test it manually from outside this repo.
+- Run `bundle exec yard doc` to update documentation - should be very high percentage.
+- Commit, merge to master & push any changes made from the above steps.
+- Run `bundle exec rake RELEASE[origin]` to tag, build and push everything to github.com and rubygems.org.

data/lib/wgit/assertable.rb CHANGED Viewed

@@ -3,9 +3,13 @@ module Wgit
   # Module containing assert methods including type checking which can be used
   # for asserting the integrity of method definitions etc.
   module Assertable
+    # Default type fail message.
     DEFAULT_TYPE_FAIL_MSG = "Expected: %s, Actual: %s".freeze
+    # Wrong method message.
     WRONG_METHOD_MSG = "arr must be Enumerable, use a different method".freeze
+    # Default duck fail message.
     DEFAULT_DUCK_FAIL_MSG = "%s doesn't respond_to? %s".freeze
+    # Default required keys message.
     DEFAULT_REQUIRED_KEYS_MSG = "Some or all of the required keys are not present: %s".freeze
     # Tests if the obj is of a given type.

data/lib/wgit/core_ext.rb CHANGED Viewed

@@ -1,8 +1,9 @@
-require_relative 'url'
 # Script which extends Ruby's core functionality when parsed.
 # Needs to be required separately using `require 'wgit/core_ext'`.
+require_relative 'url'
+# Extend the standard String functionality.
 class String
   # Converts a String into a Wgit::Url object.
   #
@@ -12,6 +13,7 @@ class String
   end
 end
+# Extend the standard Enumerable functionality.
 module Enumerable
   # Converts each String instance into a Wgit::Url object and returns the new
   # Array.