wgit 0.0.10 → 0.0.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c4dc572e7d48a95d423e175ad2d6a791be52bf56f6c391e9152c075f45672ee8
4
- data.tar.gz: fd6a9c9d1e38906f500543ae92169f7bcb9e64de1567f4f29ec24f7ca74c60d8
3
+ metadata.gz: '0929be93cf79c3ca942c78607c9bfae6ef53d3d2b033964ac7348e1d03b8f113'
4
+ data.tar.gz: e9a24df51634b84bc2c17790ccff5c04ded32852f538bdc81bbc90806c20008c
5
5
  SHA512:
6
- metadata.gz: f4fc425aa1b25254dba343151794893ff26a3682e58dd08bdc918c180da89ecb08cf0f1013837e41527f564d36bb0784c6ecd204e53c1276cb5b32401c88ffab
7
- data.tar.gz: 2ce1250ad7312257bc021e7414f164ec174e07e469a6956478cccaa7b05f159981a7a69ef6b713db5454233c8a03ea8487f14184c1e26dcab28ed8e81250507d
6
+ metadata.gz: 95ffe388b6d4ddc7771be4db2ea1ed7579c2e82d7a76b7b2950658c7114d1d14af0ff80817fcc4a7ac6ba9c6094d764d24549e18d283293edb63ec1fb6f9837b
7
+ data.tar.gz: 74d96db70834054c757c2429ee3bfce5b05f5d2a783f513b92dfc7ec185d69cf0c5f008e01227543f59acef70f87e1f0e81eadec2a607d2f3a3fc9d44d5c2641
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2019 Michael Telford
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,334 @@
1
+ # Wgit
2
+
3
+ Wgit is a Ruby gem similar in nature to GNU's `wget`. It provides an easy to use API for programmatic web scraping, indexing and searching.
4
+
5
+ Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves and serialises their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or images for example. As Wgit is a library, it has uses in many different application types.
6
+
7
+ Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine built using Wgit and Sinatra, deployed to Heroku.
8
+
9
+ ## Table Of Contents
10
+
11
+ 1. [Installation](#Installation)
12
+ 2. [Basic Usage](#Basic-Usage)
13
+ 3. [Documentation](#Documentation)
14
+ 4. [Practical Examples](#Practical-Examples)
15
+ 5. [Practical Database Example](#Practical-Database-Example)
16
+ 6. [Extending The API](#Extending-The-API)
17
+ 7. [Caveats](#Caveats)
18
+ 8. [Executable](#Executable)
19
+ 9. [Development](#Development)
20
+ 10. [Contributing](#Contributing)
21
+ 11. [License](#License)
22
+
23
+ ## Installation
24
+
25
+ Add this line to your application's `Gemfile`:
26
+
27
+ ```ruby
28
+ gem 'wgit'
29
+ ```
30
+
31
+ And then execute:
32
+
33
+ $ bundle
34
+
35
+ Or install it yourself as:
36
+
37
+ $ gem install wgit
38
+
39
+ ## Basic Usage
40
+
41
+ Below shows an example of API usage in action and gives an idea of how you can use Wgit in your own code.
42
+
43
+ ```ruby
44
+ require 'wgit'
45
+
46
+ crawler = Wgit::Crawler.new
47
+ url = Wgit::Url.new "https://wikileaks.org/What-is-Wikileaks.html"
48
+
49
+ doc = crawler.crawl url
50
+
51
+ doc.class # => Wgit::Document
52
+ doc.stats # => {
53
+ # :url=>44, :html=>28133, :title=>17, :keywords=>0,
54
+ # :links=>35, :text_length=>67, :text_bytes=>13735
55
+ #}
56
+
57
+ # doc responds to the following methods:
58
+ Wgit::Document.instance_methods(false).sort # => [
59
+ # :==, :[], :author, :css, :date_crawled, :doc, :empty?, :external_links,
60
+ # :external_urls, :html, :internal_full_links, :internal_links, :keywords,
61
+ # :links, :relative_full_links, :relative_full_urls, :relative_links,
62
+ # :relative_urls, :score, :search, :search!, :size, :stats, :text, :title,
63
+ # :to_h, :to_hash, :to_json, :url, :xpath
64
+ #]
65
+
66
+ results = doc.search "corruption"
67
+ results.first # => "ial materials involving war, spying and corruption.
68
+ # It has so far published more"
69
+ ```
70
+
71
+ ## Documentation
72
+
73
+ To see what's possible with the Wgit gem see the [docs](https://www.rubydoc.info/gems/wgit) or the [Practical Examples](#Practical-Examples) section below.
74
+
75
+ ## Practical Examples
76
+
77
+ Below are some practical examples of Wgit in use. You can copy and run the code for yourself.
78
+
79
+ ### WWW HTML Indexer
80
+
81
+ See the `Wgit::Indexer#index_the_web` documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external url's (in the database) and index their markup for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!
82
+
83
+ See the [Practical Database Example](#Practical-Database-Example) for information on how to setup a database for use with Wgit.
84
+
85
+ ### Website Downloader
86
+
87
+ Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
88
+
89
+ ### CSS Indexer
90
+
91
+ The below script downloads the contents of the first css link found on Facebook's index page.
92
+
93
+ ```ruby
94
+ require 'wgit'
95
+ require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
96
+
97
+ crawler = Wgit::Crawler.new
98
+ url = "https://www.facebook.com".to_url
99
+
100
+ doc = crawler.crawl url
101
+
102
+ # Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
103
+ hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
104
+
105
+ hrefs.class # => Nokogiri::XML::NodeSet
106
+ href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
107
+
108
+ css = crawler.crawl href.to_url
109
+ css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
110
+ ```
111
+
112
+ ### Keyword Indexer (SEO Helper)
113
+
114
+ The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation for example.
115
+
116
+ ```ruby
117
+ require 'wgit'
118
+
119
+ my_pages_keywords = ["Everest", "mountaineering school", "adventure"]
120
+ my_pages_missing_keywords = []
121
+
122
+ competitor_urls = [
123
+ "http://altitudejunkies.com",
124
+ "http://www.mountainmadness.com",
125
+ "http://www.adventureconsultants.com"
126
+ ]
127
+
128
+ crawler = Wgit::Crawler.new competitor_urls
129
+
130
+ crawler.crawl do |doc|
131
+ # If there are keywords present in the web document.
132
+ if doc.keywords.respond_to? :-
133
+ puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
134
+ my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
135
+ end
136
+ end
137
+
138
+ if my_pages_missing_keywords.empty?
139
+ puts "Your pages are missing no keywords, nice one!"
140
+ else
141
+ puts "Your pages compared to your competitors are missing the following keywords:"
142
+ puts my_pages_missing_keywords.uniq
143
+ end
144
+ ```
145
+
146
+ ## Practical Database Example
147
+
148
+ This next example requires a configured database instance.
149
+
150
+ Currently the only supported DBMS is MongoDB. See [mLab](https://mlab.com) for a free (small) account or provide your own MongoDB instance.
151
+
152
+ The currently supported versions of `mongo` are:
153
+
154
+ | Gem | Database Engine |
155
+ | -------- | --------------- |
156
+ | ~> 2.8.0 | 3.6.12 (MMAPv1) |
157
+
158
+ ### Setting Up MongoDB
159
+
160
+ Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records. The use of a database is entirely optional when using Wgit.
161
+
162
+ 1) Create collections for: `documents` and `urls`.
163
+ 2) Add a unique index for the `url` field in **both** collections.
164
+ 3) Enable `textSearchEnabled` in MongoDB's configuration.
165
+ 4) Create a *text search index* for the `documents` collection using:
166
+ ```json
167
+ {
168
+ "text": "text",
169
+ "author": "text",
170
+ "keywords": "text",
171
+ "title": "text"
172
+ }
173
+ ```
174
+ 5) Set the connection details for your MongoDB instance using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
175
+
176
+ **Note**: The *text search index* (in step 4) lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
177
+
178
+ ### Database Example
179
+
180
+ The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database.
181
+
182
+ If you're running the code below for yourself, remember to replace the Hash containing the connection details with your own.
183
+
184
+ ```ruby
185
+ require 'wgit'
186
+ require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
187
+
188
+ # Here we create our own document rather than crawling the web.
189
+ # We pass the web page's URL and HTML Strings.
190
+ doc = Wgit::Document.new(
191
+ "http://test-url.com".to_url,
192
+ "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
193
+ )
194
+
195
+ # Set your connection details manually (as below) or from the environment using
196
+ # Wgit.set_connection_details_from_env
197
+ Wgit.set_connection_details(
198
+ 'DB_HOST' => '<host_machine>',
199
+ 'DB_PORT' => '27017',
200
+ 'DB_USERNAME' => '<username>',
201
+ 'DB_PASSWORD' => '<password>',
202
+ 'DB_DATABASE' => '<database_name>',
203
+ )
204
+
205
+ db = Wgit::Database.new # Connects to the database...
206
+ db.insert doc
207
+
208
+ # Searching the database returns documents with matching text 'hits'.
209
+ query = "cow"
210
+ results = db.search query
211
+
212
+ doc.url == results.first.url # => true
213
+
214
+ # Searching the returned documents gives the matching lines of text from that document.
215
+ doc.search(query).first # => "How now brown cow."
216
+
217
+ db.insert doc.external_links
218
+
219
+ urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
220
+ ```
221
+
222
+ ## Extending The API
223
+
224
+ Indexing in Wgit is the means of downloading a web page and serialising parts of the content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML tag value of `meta[@name='author']`.
225
+
226
+ By default, Wgit indexes what it thinks are the most important pieces of information from each webpage. This of course is often not enough given the nature of webpages and their differences from each other. Therefore, there exists a set of ways to extend the default indexing logic.
227
+
228
+ There are two ways to extend the indexing behaviour of Wgit:
229
+
230
+ 1. Add the elements containing **text** that you're interested in to be indexed.
231
+ 2. Define custom indexers matched to specific **elements** that you're interested in.
232
+
233
+ Below describes these two methods in more detail.
234
+
235
+ ### 1. Extending The Default Text Elements
236
+
237
+ Wgit contains an array of `Wgit::Document.text_elements` which are the default set of webpage elements containing text; which in turn are indexed and accessible via `Wgit::Document#text`.
238
+
239
+ If you'd like the text of additional webpage elements to be returned from `Wgit::Document#text`, then you can do the following:
240
+
241
+ ```ruby
242
+ require 'wgit'
243
+ require 'wgit/core_ext'
244
+
245
+ # Let's add the text of links e.g. <a> tags.
246
+ Wgit::Document.text_elements << :a
247
+
248
+ # Our Document has a link whose's text we're interested in.
249
+ doc = Wgit::Document.new(
250
+ "http://some_url.com".to_url,
251
+ "<html><p>Hello world!</p>\
252
+ <a href='https://made-up-link.com'>Click this link.</a></html>"
253
+ )
254
+
255
+ # Now all crawled Documents will contain all visible link text in Wgit::Document#text.
256
+ doc.text # => ["Hello world!", "Click this link."]
257
+ ```
258
+
259
+ **Note**: This only works for textual page content. For more control over the indexed elements themselves, see below.
260
+
261
+ ### 2. Defining Custom Indexers/Elements a.k.a Virtual Attributes
262
+
263
+ If you want full control over the elements being indexed for your own purposes, then you can define a custom indexer for each type of element that you're interested in.
264
+
265
+ Once you have the indexed page element, accessed via a `Wgit::Document` instance method, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since the returned types are plain [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) objects, you have the full control that the Nokogiri gem gives you.
266
+
267
+ Here's how to add a custom indexer for a specific page element:
268
+
269
+ ```ruby
270
+ require 'wgit'
271
+ require 'wgit/core_ext'
272
+
273
+ # Let's get all the page's table elements.
274
+ Wgit::Document.define_extension(
275
+ :tables, # Wgit::Document#tables will return the page's tables.
276
+ "//table", # The xpath to extract the tables.
277
+ singleton: false, # True returns the first table found, false returns all.
278
+ text_content_only: false, # True returns one or more Strings of the tables text,
279
+ # false returns the tables as Nokogiri objects (see below).
280
+ ) do |tables|
281
+ # Here we can manipulate the object(s) before they're set as Wgit::Document#tables.
282
+ end
283
+
284
+ # Our Document has a table which we're interested in.
285
+ doc = Wgit::Document.new(
286
+ "http://some_url.com".to_url,
287
+ "<html><p>Hello world!</p>\
288
+ <table><th>Header Text</th><th>Another Header</th></table></html>"
289
+ )
290
+
291
+ # Call our newly defined method to obtain the table data we're interested in.
292
+ tables = doc.tables
293
+
294
+ # Both the collection and each table within the collection are plain Nokogiri objects.
295
+ tables.class # => Nokogiri::XML::NodeSet
296
+ tables.first.class # => Nokogiri::XML::Element
297
+ ```
298
+
299
+ **Extension Notes**:
300
+
301
+ - Any links should be mapped into `Wgit::Url` objects; Url's are treated as Strings when being inserted into the database.
302
+ - Any object (like a Nokogiri object) will not be inserted into the database, its up to you to map each object into a native type e.g. `Boolean, Array` etc.
303
+
304
+ ## Caveats
305
+
306
+ Below are some points to keep in mind when using Wgit:
307
+
308
+ - All Url's must be prefixed with an appropiate protocol e.g. `https://`
309
+
310
+ ## Executable
311
+
312
+ Currently there is no executable provided with Wgit, however...
313
+
314
+ In future versions of Wgit, an executable will be packaged with the gem. The executable will provide a `pry` console with the `wgit` gem already loaded. Using the console, you'll easily be able to index and search the web without having to write your own scripts.
315
+
316
+ This executable will be very similar in nature to `./bin/console` which is currently used only for development and isn't packaged as part of the `wgit` gem.
317
+
318
+ ## Development
319
+
320
+ For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
321
+
322
+ After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive REPL that will allow you to experiment with the code.
323
+
324
+ To generate code documentation run `bundle exec yarddoc`. To browse the generated documentation run `bundle exec yard server -r`.
325
+
326
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.
327
+
328
+ ## Contributing
329
+
330
+ Bug reports and pull requests are welcome on [GitHub](https://github.com/michaeltelford/wgit).
331
+
332
+ ## License
333
+
334
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
data/TODO.txt ADDED
@@ -0,0 +1,35 @@
1
+
2
+ Primary
3
+ -------
4
+ - Add <base> support for link processing.
5
+ - Update Database#search & Document#search to have optional case sensitivity.
6
+ - Have the ability to crawl sub sections of a site only e.g. https://www.honda.co.uk/motorcycles.html as the base url and crawl any links containing this as a prefix. For example, https://www.honda.co.uk/cars.html would not be crawled but https://www.honda.co.uk/motorcycles/africa-twin.html would be.
7
+ - Create an executable based on the ./bin/console shipped as `wpry` or `wgit`.
8
+
9
+ Secondary
10
+ ---------
11
+ - Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org and update the README to include an example search query e.g. "ruby" etc.
12
+ - Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
13
+ - Check if Document::TEXT_ELEMENTS is expansive enough.
14
+ - Possibly use refine instead of core-ext?
15
+ - Think about potentially using DB._update's update_many func.
16
+
17
+ Refactoring
18
+ -----------
19
+ - Replace method params with named parameters where applicable.
20
+ - Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
21
+ - Mock (monkey patch) the 'mongo' gem's func. for speed in the tests.
22
+ - Refactor the tests to include: Removal of instance_vars defined in setup & Expansion of test scenarios for each func (where possible).
23
+
24
+ Gem Publishing Checklist
25
+ ------------------------
26
+ - Ensure a clean branch of master and create a 'release' branch.
27
+ - Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
28
+ - Increment the version number (in version.rb).
29
+ - Run 'bundle install' to update deps.
30
+ - Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
31
+ - Run 'bundle exec rake test' and ensure all tests are passing.
32
+ - Run `bundle exec rake install` to build and install the gem locally, then test it manually from outside this repo.
33
+ - Run `bundle exec yard doc` to update documentation - should be very high percentage.
34
+ - Commit, merge to master & push any changes made from the above steps.
35
+ - Run `bundle exec rake RELEASE[origin]` to tag, build and push everything to github.com and rubygems.org.
@@ -3,9 +3,13 @@ module Wgit
3
3
  # Module containing assert methods including type checking which can be used
4
4
  # for asserting the integrity of method definitions etc.
5
5
  module Assertable
6
+ # Default type fail message.
6
7
  DEFAULT_TYPE_FAIL_MSG = "Expected: %s, Actual: %s".freeze
8
+ # Wrong method message.
7
9
  WRONG_METHOD_MSG = "arr must be Enumerable, use a different method".freeze
10
+ # Default duck fail message.
8
11
  DEFAULT_DUCK_FAIL_MSG = "%s doesn't respond_to? %s".freeze
12
+ # Default required keys message.
9
13
  DEFAULT_REQUIRED_KEYS_MSG = "Some or all of the required keys are not present: %s".freeze
10
14
 
11
15
  # Tests if the obj is of a given type.
data/lib/wgit/core_ext.rb CHANGED
@@ -1,8 +1,9 @@
1
- require_relative 'url'
2
-
3
1
  # Script which extends Ruby's core functionality when parsed.
4
2
  # Needs to be required separately using `require 'wgit/core_ext'`.
5
3
 
4
+ require_relative 'url'
5
+
6
+ # Extend the standard String functionality.
6
7
  class String
7
8
  # Converts a String into a Wgit::Url object.
8
9
  #
@@ -12,6 +13,7 @@ class String
12
13
  end
13
14
  end
14
15
 
16
+ # Extend the standard Enumerable functionality.
15
17
  module Enumerable
16
18
  # Converts each String instance into a Wgit::Url object and returns the new
17
19
  # Array.