wgit 0.5.1 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3fec9d62a92ebd6cdb6475bc8bcd17bf2215acb65cad0c5a8803acd6a4022feb
4
- data.tar.gz: 0d724152a4bdbf049f06be94918517c3a98c0033592d5fc931ae87307374c758
3
+ metadata.gz: 0c7346b075dca86debdb6a55ed363d1b890088b17fc600c17a5be82f5878545c
4
+ data.tar.gz: 75572c882e0711e1d49513db91d14fd1d530dbc6e22b7a0bbfec9ac1efd21e29
5
5
  SHA512:
6
- metadata.gz: d75291ebe2707fe5cb361e5ecec76a4dd1b6e33b5aa6eb92bb23adef74fcbc1e56f0dac7e7f2a95e64e2ab1585ca4c20316e63400246d20f7472ac85c320438b
7
- data.tar.gz: e9910a1eec9268865b6869a9c4cd74e641927fbfe360da6c93b56ebc5f3be942e460229745aba00ca6572a59468b1a0af3d85621b64ef414da794e0a719fbf23
6
+ metadata.gz: e3b915f11c80999a659f9b7f6f6786b717393fe94e6e65029dd5d1b2c2d95f064512cfe96a96d06416ed5932aad0d2798039306f746835c23fb5223aa2d69f5b
7
+ data.tar.gz: 3b1d55d35a30b19fe6c3193f9e2c4eb2884aaddcdf6c31a88465a0d9ffdaf01886b380ba68321bed2aad69d7b6fc26ad49b612aafa01c8034998bdd9697bebfd
@@ -0,0 +1,7 @@
1
+ --readme README.md
2
+ --title 'Wgit API Documentation'
3
+ --charset utf-8
4
+ --markup markdown
5
+ --output .doc
6
+ --protected
7
+ - *.md LICENSE.txt
@@ -0,0 +1,174 @@
1
+ # Wgit Change Log
2
+
3
+ ## v0.0.0 (TEMPLATE - DO NOT EDIT)
4
+ ### Added
5
+ - ...
6
+ ### Changed/Removed
7
+ - ...
8
+ ### Fixed
9
+ - ...
10
+ ---
11
+
12
+ ## v0.6.0
13
+ ### Added
14
+ - Added `Wgit::Utils.proces_arr encode:` param.
15
+ ### Changed/Removed
16
+ - Breaking changes: Updated `Wgit::Response#success?` and `#failure?` logic.
17
+ - Breaking changes: Updated `Wgit::Crawler` redirect logic. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_url-instance_method) for more info.
18
+ - Breaking changes: Updated `Wgit::Crawler#crawl_site` path params logic to support globs e.g. `allow_paths: 'wiki/*'`. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_site-instance_method) for more info.
19
+ - Breaking changes: Refactored references of `encode_html:` to `encode:` in the `Wgit::Document` and `Wgit::Crawler` classes.
20
+ - Breaking changes: `Wgit::Document.text_elements_xpath` is now `//*/text()`. This means that more text is extracted from each page and you can no longer be selective of the text elements on a page.
21
+ - Improved `Wgit::Url#valid?` and `#relative?`.
22
+ ### Fixed
23
+ - Bug fix in `Wgit::Crawler#crawl_site` where `*.php` URLs weren't being crawled. The fix was to implement `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS`.
24
+ - Bug fix in `Wgit::Document#search`.
25
+ ---
26
+
27
+ ## v0.5.1
28
+ ### Added
29
+ - `Wgit.version_str` method.
30
+ ### Changed/Removed
31
+ - Switched to optimistic dependency versioning.
32
+ ### Fixed
33
+ - Bug in `Wgit::Url#concat`.
34
+ ---
35
+
36
+ ## v0.5.0
37
+ ### Added
38
+ - A Wgit Wiki! [https://github.com/michaeltelford/wgit/wiki](https://github.com/michaeltelford/wgit/wiki)
39
+ - `Wgit::Document#content` alias for `#html`.
40
+ - `Wgit::Url#prefix_base` method.
41
+ - `Wgit::Url#to_addressable_uri` method.
42
+ - Support for partially crawling a site using `Wgit::Crawler#crawl_site(allow_paths: [])` or `disallow_paths:`.
43
+ - `Wgit::Url#+` as alias for `#concat`.
44
+ - `Wgit::Url#invalid?` method.
45
+ - `Wgit.version` method.
46
+ - `Wgit::Response` class containing adapter agnostic HTTP response logic.
47
+ ### Changed/Removed
48
+ - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
49
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
50
+ - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
51
+ - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
52
+ - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
53
+ - Breaking changes: Renamed `Wgit::Url#prefix_protocol` to `#prefix_scheme`. The `protocol:` param name remains unchanged.
54
+ - Breaking changes: Renamed all `Wgit::Url` methods starting with `without_*` to `omit_*`.
55
+ - Breaking changes: `Wgit::Indexer` no longer inserts invalid external URL's (to be crawled at a later date).
56
+ - Breaking changes: `Wgit::Crawler#last_response` is now of type `Wgit::Response`. You can access the underlying `Typhoeus::Response` object with `crawler.last_response.adapter_response`.
57
+ ### Fixed
58
+ - Bug in `Wgit::Document#base_url` around the handling of invalid base URL scenarios.
59
+ - Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).
60
+ ---
61
+
62
+ ## v0.4.1
63
+ ### Added
64
+ - ...
65
+ ### Changed/Removed
66
+ - ...
67
+ ### Fixed
68
+ - A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.
69
+ ---
70
+
71
+ ## v0.4.0
72
+ ### Added
73
+ - `Wgit::Document#stats` alias `#statistics`.
74
+ - `Wgit::Crawler#time_out` logic for long crawls. Can also be set via `initialize`.
75
+ - `Wgit::Crawler#last_response#redirect_count` method logic.
76
+ - `Wgit::Crawler#last_response#total_time` method logic.
77
+ - `Wgit::Utils.fetch(hash, key, default = nil)` method which tries multiple key formats before giving up e.g. `:foo, 'foo', 'FOO'` etc.
78
+ ### Changed/Removed
79
+ - Breaking changes: Updated `Wgit::Crawler` crawl logic to use `typhoeus` instead of `Net:HTTP`. Users should see a significant improvement in crawl speed as a result. This means that `Wgit::Crawler#last_response` is now of type `Typhoeus::Response`. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info.
80
+ ### Fixed
81
+ - ...
82
+ ---
83
+
84
+ ## v0.3.0
85
+ ### Added
86
+ - `Url#crawl_duration` method.
87
+ - `Document#crawl_duration` method.
88
+ - `Benchmark.measure` to Crawler logic to set `Url#crawl_duration`.
89
+ ### Changed/Removed
90
+ - Breaking changes: Updated data model to embed the full `url` object inside the documents object.
91
+ - Breaking changes: Updated data model by removing documents `score` attribute.
92
+ ### Fixed
93
+ - ...
94
+ ---
95
+
96
+ ## v0.2.0
97
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
98
+ ### Added
99
+ - `Wgit::Url#absolute?` method.
100
+ - `Wgit::Url#relative? base: url` support.
101
+ - `Wgit::Database.connect` method (alias for `Wgit::Database.new`).
102
+ - `Wgit::Database#search` and `Wgit::Document#search` methods now support `case_sensitive:` and `whole_sentence:` named parameters.
103
+ ### Changed/Removed
104
+ - Breaking changes: Renamed the following `Wgit` and `Wgit::Indexer` methods: `Wgit.index_the_web` to `Wgit.index_www`, `Wgit::Indexer.index_the_web` to `Wgit::Indexer.index_www`, `Wgit.index_this_site` to `Wgit.index_site`, `Wgit::Indexer.index_this_site` to `Wgit::Indexer.index_site`, `Wgit.index_this_page` to `Wgit.index_page`, `Wgit::Indexer.index_this_page` to `Wgit::Indexer.index_page`.
105
+ - Breaking changes: All `Wgit::Indexer` methods now take named parameters.
106
+ - Breaking changes: The following `Wgit::Url` method signatures have changed: `initialize` aka `new`,
107
+ - Breaking changes: The following `Wgit::Url` class methods have been removed: `.validate`, `.valid?`, `.prefix_protocol`, `.concat` in favour of instance methods by the same names.
108
+ - Breaking changes: The following `Wgit::Url` instance methods/aliases have been changed/removed: `#to_protocol` (now `#to_scheme`), `#to_query_string` and `#query_string` (now `#to_query`), `#relative_link?` (now `#relative?`), `#without_query_string` (now `#without_query`), `#is_query_string?` (now `#query?`).
109
+ - Breaking changes: The database connection string is now passed directly to `Wgit::Database.new`; or in its absence, obtained from `ENV['WGIT_CONNECTION_STRING']`. See the `README.md` section entitled: `Practical Database Example` for an example.
110
+ - Breaking changes: The following `Wgit::Database` instance methods now take named parameters: `#urls`, `#crawled_urls`, `#uncrawled_urls`, `#search`.
111
+ - Breaking changes: The following `Wgit::Document` instance methods now take named parameters: `#to_h`, `#to_json`, `#search`, `#search!`.
112
+ - Breaking changes: The following `Wgit::Document` instance methods/aliases have been changed/removed: `#internal_full_links` (now `#internal_absolute_links`).
113
+ - Breaking changes: Any `Wgit::Document` method alias for returning links containing the word `relative` has been removed for clarity. Use `#internal_links`, `#internal_absolute_links` or `#external_links` instead.
114
+ - Breaking changes: `Wgit::Crawler` instance vars `@docs` and `@urls` have been removed causing the following instance methods to also be removed: `#urls=`, `#[]`, `#<<`. Also, `.new` aka `#initialize` now requires no params.
115
+ - Breaking changes: `Wgit::Crawler.new` now takes an optional `redirect_limit:` parameter. This is now the only way of customising the redirect crawl behavior. `Wgit::Crawler.redirect_limit` no longer exists.
116
+ - Breaking changes: The following `Wgit::Crawler` instance methods signatures have changed: `#crawl_site` and `#crawl_url` now require a `url` param (which no longer defaults), `#crawl_urls` now requires one or more `*urls` (which no longer defaults).
117
+ - Breaking changes: The following `Wgit::Assertable` method aliases have been removed: `.type`, `.types` (use `.assert_types` instead) and `.arr_type`, `.arr_types` (use `.assert_arr_types` instead).
118
+ - Breaking changes: The following `Wgit::Utils` methods now take named parameters: `.to_h` and `.printf_search_results`.
119
+ - Breaking changes: `Wgit::Utils.printf_search_results`'s method signature has changed; the search parameters have been removed. Before calling this method you must call `doc.search!` on each of the `results`. See the docs for the full details.
120
+ - `Wgit::Document` instances can now be instantiated with `String` Url's (previously only `Wgit::Url`'s).
121
+ ### Fixed
122
+ - ...
123
+ ---
124
+
125
+ ## v0.0.18
126
+ ### Added
127
+ - `Wgit::Url#to_brand` method and updated `Wgit::Url#is_relative?` to support it.
128
+ ### Changed/Removed
129
+ - Updated certain classes by changing some `private` methods to `protected`.
130
+ ### Fixed
131
+ - ...
132
+ ---
133
+
134
+ ## v0.0.17
135
+ ### Added
136
+ - Support for `<base>` element in `Wgit::Document`'s.
137
+ - New `Wgit::Url` methods: `without_query_string`, `is_query_string?`, `is_anchor?`, `replace` (override of `String#replace`).
138
+ ### Changed/Removed
139
+ - Breaking changes: Removed `Wgit::Document#internal_links_without_anchors` method.
140
+ - Breaking changes (potentially): `Wgit::Url`'s are now replaced with the redirected to Url during a crawl.
141
+ - Updated `Wgit::Document#base_url` to support an optional `link:` named parameter.
142
+ - Updated `Wgit::Crawler#crawl_site` to allow the initial url to redirect to another host.
143
+ - Updated `Wgit::Url#is_relative?` to support an optional `domain:` named parameter.
144
+ ### Fixed
145
+ - Bug in `Wgit::Document#internal_full_links` affecting anchor and query string links including those used during `Wgit::Crawler#crawl_site`.
146
+ - Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.
147
+ ---
148
+
149
+ ## v0.0.16
150
+ ### Added
151
+ - Added `Wgit::Url.parse` class method as alias for `Wgit::Url.new`.
152
+ ### Changed/Removed
153
+ - Breaking changes: Removed `Wgit::Url.relative_link?` (class method). Use `Wgit::Url#is_relative?` (instance method) instead e.g. `Wgit::Url.new('/blah').is_relative?`.
154
+ ### Fixed
155
+ - Several URI related bugs in `Wgit::Url` affecting crawls.
156
+ ---
157
+
158
+ ## v0.0.15
159
+ ### Added
160
+ - Support for IRI's (non ASCII based URL's).
161
+ ### Changed/Removed
162
+ - Breaking changes: Removed `Document` and `Url#to_hash` aliases. Call `to_h` instead.
163
+ ### Fixed
164
+ - Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.
165
+ ---
166
+
167
+ ## v0.0.14
168
+ ### Added
169
+ - `Indexer#index_this_page` method.
170
+ ### Changed/Removed
171
+ - Breaking Changes: `Wgit::CONNECTION_DETAILS` now only requires `DB_CONNECTION_STRING`.
172
+ ### Fixed
173
+ - Found and fixed a bug in `Document#new`.
174
+ ---
@@ -0,0 +1,76 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, sex characteristics, gender identity and expression,
9
+ level of experience, education, socio-economic status, nationality, personal
10
+ appearance, race, religion, or sexual identity and orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at michael.telford@live.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72
+
73
+ [homepage]: https://www.contributor-covenant.org
74
+
75
+ For answers to common questions about this code of conduct, see
76
+ https://www.contributor-covenant.org/faq
@@ -0,0 +1,21 @@
1
+ # Contributing
2
+
3
+ ## Consult
4
+
5
+ Before you make a contribution, reach out to michael.telford@live.com about what changes need made. Otherwise, your time spent might be wasted. Once you're clear on what needs done follow the technical steps below.
6
+
7
+ ## Technical Steps
8
+
9
+ - Fork the repository
10
+ - Create a branch
11
+ - Write some tests (which fail)
12
+ - Write some code
13
+ - Re-run the tests (which now hopefully pass)
14
+ - Push your branch to your `origin` remote
15
+ - Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
16
+ - Apply any requested changes
17
+ - Wait for your PR to be merged
18
+
19
+ ## Thanks
20
+
21
+ Thanks in advance for your contribution.
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2016 - 2019 Michael Telford
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,399 @@
1
+ # Wgit
2
+
3
+ [![Inline gem version](https://badge.fury.io/rb/wgit.svg)](https://rubygems.org/gems/wgit)
4
+ [![Inline downloads](https://img.shields.io/gem/dt/wgit)](https://rubygems.org/gems/wgit)
5
+ [![Inline build](https://travis-ci.org/michaeltelford/wgit.svg?branch=master)](https://travis-ci.org/michaeltelford/wgit)
6
+ [![Inline docs](http://inch-ci.org/github/michaeltelford/wgit.svg?branch=master)](http://inch-ci.org/github/michaeltelford/wgit)
7
+ [![Inline code quality](https://api.codacy.com/project/badge/Grade/d5a0de62e78b460997cb8ce1127cea9e)](https://www.codacy.com/app/michaeltelford/wgit?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=michaeltelford/wgit&amp;utm_campaign=Badge_Grade)
8
+
9
+ ---
10
+
11
+ Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an easy to use API for programmatic URL parsing, HTML indexing and searching.
12
+
13
+ Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
14
+
15
+ Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
16
+
17
+ Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
18
+
19
+ ## Table Of Contents
20
+
21
+ 1. [Installation](#Installation)
22
+ 2. [Basic Usage](#Basic-Usage)
23
+ 3. [Documentation](#Documentation)
24
+ 4. [Practical Examples](#Practical-Examples)
25
+ 5. [Database Example](#Database-Example)
26
+ 6. [Extending The API](#Extending-The-API)
27
+ 7. [Caveats](#Caveats)
28
+ 8. [Executable](#Executable)
29
+ 9. [Change Log](#Change-Log)
30
+ 10. [License](#License)
31
+ 11. [Contributing](#Contributing)
32
+ 12. [Development](#Development)
33
+
34
+ ## Installation
35
+
36
+ Currently, the required Ruby version is:
37
+
38
+ `~> 2.5` a.k.a. `>= 2.5 && < 3`
39
+
40
+ Add this line to your application's `Gemfile`:
41
+
42
+ ```ruby
43
+ gem 'wgit'
44
+ ```
45
+
46
+ And then execute:
47
+
48
+ $ bundle
49
+
50
+ Or install it yourself as:
51
+
52
+ $ gem install wgit
53
+
54
+ ## Basic Usage
55
+
56
+ ```ruby
57
+ require 'wgit'
58
+
59
+ crawler = Wgit::Crawler.new # Uses Typhoeus -> libcurl underneath. It's fast!
60
+ url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'
61
+
62
+ doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
63
+ crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.
64
+
65
+ doc.class # => Wgit::Document
66
+ doc.class.public_instance_methods(false).sort # => [
67
+ # :==, :[], :author, :base, :base_url, :content, :css, :doc, :empty?, :external_links,
68
+ # :external_urls, :html, :internal_absolute_links, :internal_absolute_urls,
69
+ # :internal_links, :internal_urls, :keywords, :links, :score, :search, :search!,
70
+ # :size, :statistics, :stats, :text, :title, :to_h, :to_json, :url, :xpath
71
+ # ]
72
+
73
+ doc.url # => "https://wikileaks.org/What-is-Wikileaks.html"
74
+ doc.title # => "WikiLeaks - What is WikiLeaks"
75
+ doc.stats # => {
76
+ # :url=>44, :html=>28133, :title=>17, :keywords=>0,
77
+ # :links=>35, :text_snippets=>67, :text_bytes=>13735
78
+ # }
79
+ doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
80
+ doc.text # => ["The Courage Foundation is an international organisation that <snip>", ...]
81
+
82
+ results = doc.search 'corruption' # Searches doc.text for the given query.
83
+ results.first # => "ial materials involving war, spying and corruption.
84
+ # It has so far published more"
85
+ ```
86
+
87
+ ## Documentation
88
+
89
+ 100% of Wgit's code is documented using [YARD](https://yardoc.org/), deployed to [rubydoc.info](https://www.rubydoc.info/github/michaeltelford/wgit/master). This greatly benefits developers in using Wgit in their own programs. Another good source of information (as to how the library behaves) are the [tests](https://github.com/michaeltelford/wgit/tree/master/test). Also, see the [Practical Examples](#Practical-Examples) section below for real working examples of Wgit in action.
90
+
91
+ ## Practical Examples
92
+
93
+ Below are some practical examples of Wgit in use. You can copy and run the code for yourself (it's all been tested).
94
+
95
+ In addition to the practical examples below, the [wiki](https://github.com/michaeltelford/wgit/wiki) contains a useful 'How To' section with more specific usage of Wgit. You should finish reading this `README` first however.
96
+
97
+ ### WWW HTML Indexer
98
+
99
+ See the [`Wgit::Indexer#index_www`](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2Eindex_www) documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external URL's (in the database) and index their HTML for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!
100
+
101
+ See the [Database Example](#Database-Example) for information on how to configure a database for use with Wgit.
102
+
103
+ ### Website Downloader
104
+
105
+ Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
106
+
107
+ ### Broken Link Finder
108
+
109
+ The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
110
+
111
+ ### CSS Indexer
112
+
113
+ The below script downloads the contents of the first css link found on Facebook's index page.
114
+
115
+ ```ruby
116
+ require 'wgit'
117
+ require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
118
+
119
+ crawler = Wgit::Crawler.new
120
+ url = 'https://www.facebook.com'.to_url
121
+
122
+ doc = crawler.crawl url
123
+
124
+ # Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
125
+ hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
126
+
127
+ hrefs.class # => Nokogiri::XML::NodeSet
128
+ href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
129
+
130
+ css = crawler.crawl href.to_url
131
+ css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
132
+ ```
133
+
134
+ ### Keyword Indexer (SEO Helper)
135
+
136
+ The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation (SEO) for example.
137
+
138
+ ```ruby
139
+ require 'wgit'
140
+ require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
141
+
142
+ my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
143
+ my_pages_missing_keywords = []
144
+
145
+ competitor_urls = [
146
+ 'http://altitudejunkies.com',
147
+ 'http://www.mountainmadness.com',
148
+ 'http://www.adventureconsultants.com'
149
+ ].to_urls
150
+
151
+ crawler = Wgit::Crawler.new
152
+
153
+ crawler.crawl(*competitor_urls) do |doc|
154
+ # If there are keywords present in the web document.
155
+ if doc.keywords.respond_to? :-
156
+ puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
157
+ my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
158
+ end
159
+ end
160
+
161
+ if my_pages_missing_keywords.empty?
162
+ puts 'Your pages are missing no keywords, nice one!'
163
+ else
164
+ puts 'Your pages compared to your competitors are missing the following keywords:'
165
+ puts my_pages_missing_keywords.uniq
166
+ end
167
+ ```
168
+
169
+ ## Database Example
170
+
171
+ The next example requires a configured database instance. The use of a database for Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database).
172
+
173
+ Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a (small) free account or provide your own MongoDB instance. Take a look at this [Docker Hub image](https://hub.docker.com/r/michaeltelford/mongo-wgit) for an already built example of a `mongo` image configured for use with Wgit; the source of which can be found in the [`./docker`](https://github.com/michaeltelford/wgit/tree/master/docker) directory of this repository.
174
+
175
+ [`Wgit::Database`](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Database) provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and object serialisation. Using Wgit you can index webpages, store them in a database and then search through all that's been indexed; quickly and easily.
176
+
177
+ ### Versioning
178
+
179
+ The following versions of MongoDB are currently supported:
180
+
181
+ | Gem | Database |
182
+ | ------ | -------- |
183
+ | ~> 2.9 | ~> 4.0 |
184
+
185
+ ### Data Model
186
+
187
+ The data model for Wgit is deliberately simplistic. The MongoDB collections consist of:
188
+
189
+ | Collection | Purpose |
190
+ | ----------- | ----------------------------------------------- |
191
+ | `urls` | Stores URL's to be crawled at a later date |
192
+ | `documents` | Stores web documents after they've been crawled |
193
+
194
+ Wgit provides respective Ruby classes for each collection object, allowing for serialisation.
195
+
196
+ ### Configuring MongoDB
197
+
198
+ Follow the steps below to configure MongoDB for use with Wgit. This is only required if you want to read/write database records using your own (manually configured) instance of Mongo DB.
199
+
200
+ 1) Create collections for: `urls` and `documents`.
201
+ 2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections using:
202
+
203
+ | Collection | Fields | Options |
204
+ | ----------- | ------------------- | ------------------- |
205
+ | `urls` | `{ "url" : 1 }` | `{ unique : true }` |
206
+ | `documents` | `{ "url.url" : 1 }` | `{ unique : true }` |
207
+
208
+ 3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so - it's typically enabled by default).
209
+ 4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
210
+ ```json
211
+ {
212
+ "text": "text",
213
+ "author": "text",
214
+ "keywords": "text",
215
+ "title": "text"
216
+ }
217
+ ```
218
+
219
+ **Note**: The *text search index* lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
220
+
221
+ ### Code Example
222
+
223
+ The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
224
+
225
+ ```ruby
226
+ require 'wgit'
227
+
228
+ ### CONNECT TO THE DATABASE ###
229
+
230
+ # In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
231
+ db = Wgit::Database.connect '<your_connection_string>'
232
+
233
+ ### SEED SOME DATA ###
234
+
235
+ # Here we create our own document rather than crawling the web (which works in the same way).
236
+ # We provide the web page's URL and HTML Strings.
237
+ doc = Wgit::Document.new(
238
+ 'http://test-url.com',
239
+ "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
240
+ )
241
+ db.insert doc
242
+
243
+ ### SEARCH THE DATABASE ###
244
+
245
+ # Searching the database returns Wgit::Document's which have fields containing the query.
246
+ query = 'cow'
247
+ results = db.search query
248
+
249
+ # By default, the MongoDB ranking applies i.e. results.first has the most hits.
250
+ # Because results is an Array of Wgit::Document's, we can custom sort/rank e.g.
251
+ # `results.sort_by! { |doc| doc.url.crawl_duration }` ranks via page load times with
252
+ # results.first being the fastest. Any Wgit::Document attribute can be used, including
253
+ # those you define yourself by extending the API.
254
+
255
+ top_result = results.first
256
+ top_result.class # => Wgit::Document
257
+ doc.url == top_result.url # => true
258
+
259
+ ### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
260
+
261
+ # Searching each result gives the matching text snippets from that Wgit::Document.
262
+ top_result.search(query).first # => "How now brown cow."
263
+
264
+ ### SEED URLS TO BE CRAWLED LATER ###
265
+
266
+ db.insert top_result.external_links
267
+ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_links.
268
+ ```
269
+
270
+ ## Extending The API
271
+
272
+ Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
273
+
274
+ By default, Wgit serialises what it thinks are the most important pieces of information from each webpage. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
275
+
276
+ ### Defining Custom Serialisers Via Document Extensions
277
+
278
+ You can define a Document extension for each HTML element(s) that you want to extract into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, any crawled Documents will contain your extracted content.
279
+
280
+ Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
281
+
282
+ Here's how to add a Document extension to serialise a specific page element:
283
+
284
+ ```ruby
285
+ require 'wgit'
286
+
287
+ # Let's get all the page's <table> elements.
288
+ Wgit::Document.define_extension(
289
+ :tables, # Wgit::Document#tables will return the page's tables.
290
+ '//table', # The xpath to extract the tables.
291
+ singleton: false, # True returns the first table found, false returns all.
292
+ text_content_only: false, # True returns one or more Strings of the tables text,
293
+ # false returns the tables as Nokogiri objects (see below).
294
+ ) do |tables|
295
+ # Here we can manipulate the object(s) before they're set as Wgit::Document#tables.
296
+ end
297
+
298
+ # Our Document has a table which we're interested in.
299
+ doc = Wgit::Document.new(
300
+ 'http://some_url.com',
301
+ <<~HTML
302
+ <html>
303
+ <p>Hello world! Welcome to my site.</p>
304
+ <table>
305
+ <tr><th>Name</th><th>Age</th></tr>
306
+ <tr><td>Socrates</td><td>101</td></tr>
307
+ <tr><td>Plato</td><td>106</td></tr>
308
+ </table>
309
+ <p>I hope you enjoyed your visit :-)</p>
310
+ </html>
311
+ HTML
312
+ )
313
+
314
+ # Call our newly defined method to obtain the table data we're interested in.
315
+ tables = doc.tables
316
+
317
+ # Both the collection and each table within the collection are plain Nokogiri objects.
318
+ tables.class # => Nokogiri::XML::NodeSet
319
+ tables.first.class # => Nokogiri::XML::Element
320
+
321
+ # Notice the Document's stats now include our 'tables' extension.
322
+ doc.stats # => {
323
+ # :url=>19, :html=>242, :links=>0, :text_snippets=>2, :text_bytes=>65, :tables=>1
324
+ # }
325
+ ```
326
+
327
+ Wgit uses Document extensions to provide much of it's core serialising functionality, providing access to a webpage's text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
328
+
329
+ See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
330
+
331
+ **Extension Notes**:
332
+
333
+ - It's recommended that URL's be mapped into `Wgit::Url` objects. `Wgit::Url`'s are treated as Strings when being inserted into the database.
334
+ - A `Wgit::Document` extension (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g. `String`, `Array` etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted.
335
+ - Once inserted into the Database, you can search a `Wgit::Document`'s extension attributes by updating the Database's *text search index*. See the [Database Example](#Database-Example) for more information.
336
+
337
+ ## Caveats
338
+
339
+ Below are some points to keep in mind when using Wgit:
340
+
341
+ - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
342
+ - By default, up to 5 URL redirects will be followed; this is configurable however.
343
+ - IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
344
+
345
+ ## Executable
346
+
347
+ Currently there is no executable provided with Wgit, however...
348
+
349
+ In future versions of Wgit, an executable will be packaged with the gem. The executable will provide a `pry` console with the `wgit` gem already loaded. Using the console, you'll easily be able to index and search the web without having to write your own scripts.
350
+
351
+ This executable will be similar in nature to `./bin/console` which is currently used for development and isn't packaged as part of the `wgit` gem.
352
+
353
+ ## Change Log
354
+
355
+ See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
356
+
357
+ ### Gem Versioning
358
+
359
+ The `wgit` gem follows these versioning rules:
360
+
361
+ - The version format is `MAJOR.MINOR.PATCH` e.g. `0.1.0`.
362
+ - Since the gem hasn't reached `v1.0.0` yet, slightly different semantic versioning rules apply.
363
+ - The `PATCH` represents *non breaking changes* while the `MINOR` represents *breaking changes* e.g. updating from version `0.1.0` to `0.2.0` will likely introduce breaking changes necessitating updates to your codebase.
364
+ - To determine what changes are needed, consult the `CHANGELOG.md`. If you need help, raise an issue.
365
+ - Once `wgit v1.0.0` is released, *normal* [semantic versioning](https://semver.org/) rules will apply e.g. only a `MAJOR` version change should introduce breaking changes.
366
+
367
+ ## License
368
+
369
+ The gem is available as open source under the terms of the MIT License. See [LICENSE.txt](https://github.com/michaeltelford/wgit/blob/master/LICENSE.txt) for more details.
370
+
371
+ ## Contributing
372
+
373
+ Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
374
+
375
+ The current road map is rudimentally listed in the [Road Map](https://github.com/michaeltelford/wgit/wiki/Road-Map) wiki page. Maybe your feature request is already there?
376
+
377
+ Before you consider making a contribution, check out [CONTRIBUTING.md](https://github.com/michaeltelford/wgit/blob/master/CONTRIBUTING.md).
378
+
379
+ ## Development
380
+
381
+ After checking out the repo, run the following commands:
382
+
383
+ 1. `gem install bundler toys`
384
+ 2. `bundle install --jobs=3`
385
+ 3. `toys setup`
386
+
387
+ And you're good to go!
388
+
389
+ ### Tooling
390
+
391
+ Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
392
+
393
+ Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker.
394
+
395
+ Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset). You can also run `toys console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
396
+
397
+ To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
398
+
399
+ To install this gem onto your local machine, run `toys install`.