wgit 0.6.0 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +12 -0
- data/README.md +24 -15
- data/bin/wgit +35 -0
- data/lib/wgit/crawler.rb +3 -3
- data/lib/wgit/database/database.rb +1 -1
- data/lib/wgit/document.rb +15 -4
- data/lib/wgit/indexer.rb +14 -5
- data/lib/wgit/version.rb +2 -2
- metadata +9 -7
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 29d37a4a0f013fec64625d8fe5798ae2d062ae6f213811c51d223de311e16707
|
4
|
+
data.tar.gz: 213f6c43ccbb1fcc5c487a2bd5f31493506ab2320168562f8e1b6887cccc07b8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: acd321f3ba039e6f54dd8a36a3e4ebec1fb40f1cda5ee1c982df4be22ee6d463f829c72f011ff959a4a4d2651676dc2d31866a273b60d3e5e630ccf77b3d7cbe
|
7
|
+
data.tar.gz: d0908d28e6fdaec440209479f75945672807cf3e9359fb8bd8f6cc9de45568a341ac0204ba5f50a2f6569b4a29f4f7ac3088353a35f2c5091a567af469027aab
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,18 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.7.0
|
13
|
+
### Added
|
14
|
+
- `Wgit::Indexer.new` optional `crawler:` named param.
|
15
|
+
- `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
|
16
|
+
- `Document.extensions` returning a Set of all defined extensions.
|
17
|
+
### Changed/Removed
|
18
|
+
- Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
|
19
|
+
- Updated the Docker image to now include index names; making it easier to identify them.
|
20
|
+
### Fixed
|
21
|
+
- ...
|
22
|
+
---
|
23
|
+
|
12
24
|
## v0.6.0
|
13
25
|
### Added
|
14
26
|
- Added `Wgit::Utils.proces_arr encode:` param.
|
data/README.md
CHANGED
@@ -8,11 +8,11 @@
|
|
8
8
|
|
9
9
|
---
|
10
10
|
|
11
|
-
Wgit is a Ruby
|
11
|
+
Wgit is a Ruby library primarily used for crawling, indexing and searching HTML webpages.
|
12
12
|
|
13
|
-
Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to
|
13
|
+
Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to scrape entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
|
14
14
|
|
15
|
-
Check out this [
|
15
|
+
Check out this [demo application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
|
16
16
|
|
17
17
|
Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
|
18
18
|
|
@@ -51,6 +51,10 @@ Or install it yourself as:
|
|
51
51
|
|
52
52
|
$ gem install wgit
|
53
53
|
|
54
|
+
Verify the install by using the executable (to start a shell session):
|
55
|
+
|
56
|
+
$ wgit
|
57
|
+
|
54
58
|
## Basic Usage
|
55
59
|
|
56
60
|
```ruby
|
@@ -271,11 +275,11 @@ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_
|
|
271
275
|
|
272
276
|
Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
|
273
277
|
|
274
|
-
|
278
|
+
Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
|
275
279
|
|
276
|
-
###
|
280
|
+
### Serialising Additional Page Elements via Document Extensions
|
277
281
|
|
278
|
-
You can define a Document extension for each HTML element(s) that you want to extract into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined,
|
282
|
+
You can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
|
279
283
|
|
280
284
|
Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
|
281
285
|
|
@@ -296,6 +300,7 @@ Wgit::Document.define_extension(
|
|
296
300
|
end
|
297
301
|
|
298
302
|
# Our Document has a table which we're interested in.
|
303
|
+
# Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
|
299
304
|
doc = Wgit::Document.new(
|
300
305
|
'http://some_url.com',
|
301
306
|
<<~HTML
|
@@ -324,8 +329,6 @@ doc.stats # => {
|
|
324
329
|
# }
|
325
330
|
```
|
326
331
|
|
327
|
-
Wgit uses Document extensions to provide much of it's core serialising functionality, providing access to a webpage's text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
|
328
|
-
|
329
332
|
See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
|
330
333
|
|
331
334
|
**Extension Notes**:
|
@@ -339,16 +342,20 @@ See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michae
|
|
339
342
|
Below are some points to keep in mind when using Wgit:
|
340
343
|
|
341
344
|
- All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
|
342
|
-
- By default, up to 5 URL redirects will be followed; this is configurable however.
|
345
|
+
- By default, up to 5 URL redirects will be followed; this is [configurable](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Crawler#redirect_limit-instance_method) however.
|
343
346
|
- IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
|
344
347
|
|
345
348
|
## Executable
|
346
349
|
|
347
|
-
|
350
|
+
Installing the Wgit gem also adds the `wgit` executable to your `$PATH`. The executable launches an interactive shell session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
|
351
|
+
|
352
|
+
The `wgit` executable does the following things (in order):
|
348
353
|
|
349
|
-
|
354
|
+
1. `require wgit`
|
355
|
+
2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
|
356
|
+
3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
|
350
357
|
|
351
|
-
|
358
|
+
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session. **Note** that variables should either be instance variables (e.g. `@url`) or be accessed via a getter method (e.g. `def url; ...; end`).
|
352
359
|
|
353
360
|
## Change Log
|
354
361
|
|
@@ -390,10 +397,12 @@ And you're good to go!
|
|
390
397
|
|
391
398
|
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
|
392
399
|
|
393
|
-
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker.
|
394
|
-
|
395
|
-
Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset). You can also run `toys console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
|
400
|
+
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset that doesn't require a database).
|
396
401
|
|
397
402
|
To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
|
398
403
|
|
399
404
|
To install this gem onto your local machine, run `toys install`.
|
405
|
+
|
406
|
+
### Console
|
407
|
+
|
408
|
+
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created a `.env` and `.wgit.rb` file which gets loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
|
data/bin/wgit
ADDED
@@ -0,0 +1,35 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'wgit'
|
4
|
+
|
5
|
+
# Eval .wgit.rb file (if it exists).
|
6
|
+
def eval_wgit
|
7
|
+
puts 'Searching for .wgit.rb in local and home directories...'
|
8
|
+
|
9
|
+
['.', Dir.home].each do |dir|
|
10
|
+
path = "#{dir}/.wgit.rb"
|
11
|
+
next unless File.exist?(path)
|
12
|
+
|
13
|
+
puts "Eval'ing #{path} (call `eval_wgit` after changes)"
|
14
|
+
eval(File.read(path))
|
15
|
+
break
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
eval_wgit
|
20
|
+
puts "\n#{Wgit.version_str}\n\n"
|
21
|
+
|
22
|
+
# Use Pry if installed or fall back to IRB.
|
23
|
+
begin
|
24
|
+
require 'pry'
|
25
|
+
klass = Pry
|
26
|
+
rescue LoadError
|
27
|
+
require 'irb'
|
28
|
+
klass = IRB
|
29
|
+
|
30
|
+
puts "Starting IRB because Pry isn't installed."
|
31
|
+
end
|
32
|
+
|
33
|
+
klass.start
|
34
|
+
|
35
|
+
puts 'Interactive session complete.'
|
data/lib/wgit/crawler.rb
CHANGED
@@ -19,9 +19,9 @@ module Wgit
|
|
19
19
|
# `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
|
20
20
|
# doesn't keep the crawl of the site going. All URL's without a file
|
21
21
|
# extension will be crawled, because they're assumed to be HTML.
|
22
|
-
SUPPORTED_FILE_EXTENSIONS = Set.new(
|
23
|
-
asp aspx cfm cgi htm html htmlx jsp php
|
24
|
-
|
22
|
+
SUPPORTED_FILE_EXTENSIONS = Set.new(
|
23
|
+
%w[asp aspx cfm cgi htm html htmlx jsp php]
|
24
|
+
)
|
25
25
|
|
26
26
|
# The amount of allowed redirects before raising an error. Set to 0 to
|
27
27
|
# disable redirects completely; or you can pass `follow_redirects: false`
|
@@ -154,7 +154,7 @@ module Wgit
|
|
154
154
|
# DB.
|
155
155
|
# @return [Array<Wgit::Document>] The search results obtained from the DB.
|
156
156
|
def search(
|
157
|
-
query, case_sensitive: false, whole_sentence:
|
157
|
+
query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
|
158
158
|
)
|
159
159
|
query = query.to_s.strip
|
160
160
|
query.replace('"' + query + '"') if whole_sentence
|
data/lib/wgit/document.rb
CHANGED
@@ -3,6 +3,7 @@ require_relative 'utils'
|
|
3
3
|
require_relative 'assertable'
|
4
4
|
require 'nokogiri'
|
5
5
|
require 'json'
|
6
|
+
require 'set'
|
6
7
|
|
7
8
|
module Wgit
|
8
9
|
# Class primarily modeling a HTML web document, although other MIME types
|
@@ -22,6 +23,14 @@ module Wgit
|
|
22
23
|
# The xpath used to extract the visible text on a page.
|
23
24
|
TEXT_ELEMENTS_XPATH = '//*/text()'.freeze
|
24
25
|
|
26
|
+
# Set of Symbols representing the defined Document extensions.
|
27
|
+
@extensions = Set.new
|
28
|
+
|
29
|
+
class << self
|
30
|
+
# Class level attr_reader for the Document defined extensions.
|
31
|
+
attr_reader :extensions
|
32
|
+
end
|
33
|
+
|
25
34
|
# The URL of the webpage, an instance of Wgit::Url.
|
26
35
|
attr_reader :url
|
27
36
|
|
@@ -120,7 +129,7 @@ module Wgit
|
|
120
129
|
result = find_in_html(xpath, opts, &block)
|
121
130
|
init_var(var, result)
|
122
131
|
end
|
123
|
-
Document.send
|
132
|
+
Document.send(:private, func_name)
|
124
133
|
|
125
134
|
# Define the private init_*_from_object method for a Database object.
|
126
135
|
# Gets the Object's 'key' value and creates a var for it.
|
@@ -128,8 +137,9 @@ module Wgit
|
|
128
137
|
result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block)
|
129
138
|
init_var(var, result)
|
130
139
|
end
|
131
|
-
Document.send
|
140
|
+
Document.send(:private, func_name)
|
132
141
|
|
142
|
+
@extensions << var
|
133
143
|
var
|
134
144
|
end
|
135
145
|
|
@@ -144,6 +154,7 @@ module Wgit
|
|
144
154
|
Document.send(:remove_method, "init_#{var}_from_html")
|
145
155
|
Document.send(:remove_method, "init_#{var}_from_object")
|
146
156
|
|
157
|
+
@extensions.delete(var.to_sym)
|
147
158
|
true
|
148
159
|
rescue NameError
|
149
160
|
false
|
@@ -366,7 +377,7 @@ module Wgit
|
|
366
377
|
# sentence.
|
367
378
|
# @return [Array<String>] A subset of @text, matching the query.
|
368
379
|
def search(
|
369
|
-
query, case_sensitive: false, whole_sentence:
|
380
|
+
query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
|
370
381
|
)
|
371
382
|
query = query.to_s
|
372
383
|
raise 'A search query must be provided' if query.empty?
|
@@ -409,7 +420,7 @@ module Wgit
|
|
409
420
|
# sentence.
|
410
421
|
# @return [String] This Document's original @text value.
|
411
422
|
def search!(
|
412
|
-
query, case_sensitive: false, whole_sentence:
|
423
|
+
query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
|
413
424
|
)
|
414
425
|
orig_text = @text
|
415
426
|
@text = search(
|
data/lib/wgit/indexer.rb
CHANGED
@@ -40,6 +40,10 @@ module Wgit
|
|
40
40
|
# nil to use ENV['WGIT_CONNECTION_STRING'].
|
41
41
|
# @param insert_externals [Boolean] Whether or not to insert the website's
|
42
42
|
# external Url's into the database.
|
43
|
+
# @param allow_paths [String, Array<String>] Filters links by selecting
|
44
|
+
# them if their path `File.fnmatch?` one of allow_paths.
|
45
|
+
# @param disallow_paths [String, Array<String>] Filters links by rejecting
|
46
|
+
# them if their path `File.fnmatch?` one of disallow_paths.
|
43
47
|
# @yield [doc] Given the Wgit::Document of each crawled webpage, before it's
|
44
48
|
# inserted into the database allowing for prior manipulation.
|
45
49
|
# @return [Integer] The total number of pages crawled within the website.
|
@@ -96,7 +100,7 @@ module Wgit
|
|
96
100
|
# database.
|
97
101
|
def self.indexed_search(
|
98
102
|
query, connection_string: nil,
|
99
|
-
case_sensitive: false, whole_sentence:
|
103
|
+
case_sensitive: false, whole_sentence: true,
|
100
104
|
limit: 10, skip: 0, sentence_limit: 80, &block
|
101
105
|
)
|
102
106
|
db = Wgit::Database.new(connection_string)
|
@@ -122,7 +126,7 @@ module Wgit
|
|
122
126
|
Wgit::Utils.printf_search_results(results)
|
123
127
|
end
|
124
128
|
|
125
|
-
# Class which
|
129
|
+
# Class which crawls and saves the indexed Documents to a database.
|
126
130
|
class Indexer
|
127
131
|
# The crawler used to index the WWW.
|
128
132
|
attr_reader :crawler
|
@@ -133,10 +137,11 @@ module Wgit
|
|
133
137
|
# Initialize the Indexer.
|
134
138
|
#
|
135
139
|
# @param database [Wgit::Database] The database instance (already
|
136
|
-
# initialized
|
137
|
-
|
138
|
-
|
140
|
+
# initialized and connected) used to index.
|
141
|
+
# @param crawler [Wgit::Crawler] The crawler instance used to index.
|
142
|
+
def initialize(database, crawler = Wgit::Crawler.new)
|
139
143
|
@db = database
|
144
|
+
@crawler = crawler
|
140
145
|
end
|
141
146
|
|
142
147
|
# Retrieves uncrawled url's from the database and recursively crawls each
|
@@ -214,6 +219,10 @@ the next iteration.")
|
|
214
219
|
# @param url [Wgit::Url] The base Url of the website to crawl.
|
215
220
|
# @param insert_externals [Boolean] Whether or not to insert the website's
|
216
221
|
# external Url's into the database.
|
222
|
+
# @param allow_paths [String, Array<String>] Filters links by selecting
|
223
|
+
# them if their path `File.fnmatch?` one of allow_paths.
|
224
|
+
# @param disallow_paths [String, Array<String>] Filters links by rejecting
|
225
|
+
# them if their path `File.fnmatch?` one of disallow_paths.
|
217
226
|
# @yield [doc] Given the Wgit::Document of each crawled web page before
|
218
227
|
# it's inserted into the database allowing for prior manipulation. Return
|
219
228
|
# nil or false from the block to prevent the document from being saved
|
data/lib/wgit/version.rb
CHANGED
@@ -1,11 +1,11 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
3
|
# Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page
|
4
|
-
# contents for later use
|
4
|
+
# contents for later use.
|
5
5
|
# @author Michael Telford
|
6
6
|
module Wgit
|
7
7
|
# The current gem version of Wgit.
|
8
|
-
VERSION = '0.
|
8
|
+
VERSION = '0.7.0'
|
9
9
|
|
10
10
|
# Returns the current gem version of Wgit as a String.
|
11
11
|
def self.version
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.7.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2020-01-04 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|
@@ -185,7 +185,7 @@ dependencies:
|
|
185
185
|
- !ruby/object:Gem::Version
|
186
186
|
version: '1.0'
|
187
187
|
description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s to
|
188
|
-
retrieve and serialise their page contents for later use. You can use Wgit to
|
188
|
+
retrieve and serialise their page contents for later use. You can use Wgit to scrape
|
189
189
|
entire websites if required. Wgit also provides a means to search indexed documents
|
190
190
|
stored in a database. Therefore, this library provides the main components of a
|
191
191
|
WWW search engine. The Wgit API is easily extended allowing you to pull out the
|
@@ -195,7 +195,8 @@ description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s
|
|
195
195
|
|
196
196
|
'
|
197
197
|
email: michael.telford@live.com
|
198
|
-
executables:
|
198
|
+
executables:
|
199
|
+
- wgit
|
199
200
|
extensions: []
|
200
201
|
extra_rdoc_files: []
|
201
202
|
files:
|
@@ -219,6 +220,7 @@ files:
|
|
219
220
|
- CONTRIBUTING.md
|
220
221
|
- LICENSE.txt
|
221
222
|
- README.md
|
223
|
+
- bin/wgit
|
222
224
|
homepage: https://github.com/michaeltelford/wgit
|
223
225
|
licenses:
|
224
226
|
- MIT
|
@@ -229,7 +231,7 @@ metadata:
|
|
229
231
|
bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
|
230
232
|
documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
|
231
233
|
allowed_push_host: https://rubygems.org
|
232
|
-
post_install_message:
|
234
|
+
post_install_message: Added the 'wgit' executable to $PATH
|
233
235
|
rdoc_options: []
|
234
236
|
require_paths:
|
235
237
|
- lib
|
@@ -247,6 +249,6 @@ requirements: []
|
|
247
249
|
rubygems_version: 3.0.6
|
248
250
|
signing_key:
|
249
251
|
specification_version: 4
|
250
|
-
summary: Wgit is a Ruby
|
251
|
-
|
252
|
+
summary: Wgit is a Ruby library primarily used for crawling, indexing and searching
|
253
|
+
HTML webpages.
|
252
254
|
test_files: []
|