wgit 0.7.0 → 0.10.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.yardopts +1 -1
- data/CHANGELOG.md +74 -2
- data/LICENSE.txt +1 -1
- data/README.md +114 -290
- data/bin/wgit +9 -5
- data/lib/wgit/assertable.rb +3 -3
- data/lib/wgit/base.rb +30 -0
- data/lib/wgit/core_ext.rb +1 -1
- data/lib/wgit/crawler.rb +219 -79
- data/lib/wgit/database/database.rb +309 -134
- data/lib/wgit/database/model.rb +10 -3
- data/lib/wgit/document.rb +226 -143
- data/lib/wgit/{document_extensions.rb → document_extractors.rb} +21 -11
- data/lib/wgit/dsl.rb +324 -0
- data/lib/wgit/indexer.rb +65 -162
- data/lib/wgit/response.rb +11 -8
- data/lib/wgit/url.rb +192 -61
- data/lib/wgit/utils.rb +32 -20
- data/lib/wgit/version.rb +2 -1
- data/lib/wgit.rb +3 -1
- metadata +34 -19
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d4f448212c206f3384275cb2df9a0c81139323cef6efa17dcfc1d907c037c08d
|
4
|
+
data.tar.gz: 28a12eee9935860c7c25b7500a718a6aa066e610a83dbc71003be7f63bb8a3dc
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 0dba9a82f73f3872f4c845b1c6c546efaef0afff1563b89b13a50efc429fe86bfe3590811d127df777569cc802904676d3e9a127705a7f10ce9b895e4b581c7d
|
7
|
+
data.tar.gz: d1c90814c1eab72a3cf6d0ea0f9f3bc239e71fc575b6ff89382135dad5d72bf770a05d75219cd78e16b5c686e8f15c7a5a6c3af89319f64c90e976049f60e7b5
|
data/.yardopts
CHANGED
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,78 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.1
|
13
|
+
### Added
|
14
|
+
- Support for Ruby 3.
|
15
|
+
### Changed/Removed
|
16
|
+
- Removed support for Ruby 2.5 (as it's too old).
|
17
|
+
### Fixed
|
18
|
+
- ...
|
19
|
+
---
|
20
|
+
|
21
|
+
## v0.10.0
|
22
|
+
### Added
|
23
|
+
- `Wgit::Url#scheme_relative?` method.
|
24
|
+
### Changed/Removed
|
25
|
+
- Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
|
26
|
+
### Fixed
|
27
|
+
- [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
|
28
|
+
---
|
29
|
+
|
30
|
+
## v0.9.0
|
31
|
+
This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
|
32
|
+
### Added
|
33
|
+
- `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
|
34
|
+
- `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
|
35
|
+
- `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
|
36
|
+
- `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
|
37
|
+
- `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
|
38
|
+
- `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
|
39
|
+
- `Wgit::Database#clear_db!` alias.
|
40
|
+
- `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
|
41
|
+
- `Wgit::Document#extract` method to perform one off content extractions.
|
42
|
+
- `Wgit::Indexer#index_urls` method which can index several urls in one call.
|
43
|
+
- `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
|
44
|
+
### Changed/Removed
|
45
|
+
- Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
|
46
|
+
- Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
|
47
|
+
- Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
|
48
|
+
- Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
|
49
|
+
- Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
|
50
|
+
- Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
|
51
|
+
- Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
|
52
|
+
- Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
|
53
|
+
- Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
|
54
|
+
- Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
|
55
|
+
- Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
|
56
|
+
- Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
|
57
|
+
- Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
|
58
|
+
- Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
|
59
|
+
- Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
|
60
|
+
- Updated `Utils.printf_search_results` to return the number of results.
|
61
|
+
- Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
|
62
|
+
- Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
|
63
|
+
- Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
|
64
|
+
### Fixed
|
65
|
+
- [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
|
66
|
+
- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
|
67
|
+
---
|
68
|
+
|
69
|
+
## v0.8.0
|
70
|
+
### Added
|
71
|
+
- To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
|
72
|
+
- `Wgit::Document#description` default extension.
|
73
|
+
- `Wgit::Url.parse_or_nil` method.
|
74
|
+
### Changed/Removed
|
75
|
+
- Breaking change: Renamed `Document#stats[:text_snippets]` to be `:text`.
|
76
|
+
- Breaking change: `Wgit::Document.define_extension`'s block return value now becomes the `var` value, even when `nil` is returned. This allows `var` to be set to `nil`.
|
77
|
+
- Potential breaking change: Renamed `Wgit::Response#crawl_time` (alias) to be `#crawl_duration`.
|
78
|
+
- Updated `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS` to be `Wgit::Crawler.supported_file_extensions`, making it configurable. Now you can add your own URL extensions if needed.
|
79
|
+
- Updated the Wgit core extension `String#to_url` to use `Wgit::Url.parse` allowing instances of `Wgit::Url` to returned as is. This also affects `Enumerable#to_urls` in the same way.
|
80
|
+
### Fixed
|
81
|
+
- An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".
|
82
|
+
---
|
83
|
+
|
12
84
|
## v0.7.0
|
13
85
|
### Added
|
14
86
|
- `Wgit::Indexer.new` optional `crawler:` named param.
|
@@ -58,7 +130,7 @@
|
|
58
130
|
- `Wgit::Response` class containing adapter agnostic HTTP response logic.
|
59
131
|
### Changed/Removed
|
60
132
|
- Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
|
61
|
-
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/
|
133
|
+
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
|
62
134
|
- Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
|
63
135
|
- Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
|
64
136
|
- Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
|
@@ -106,7 +178,7 @@
|
|
106
178
|
---
|
107
179
|
|
108
180
|
## v0.2.0
|
109
|
-
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/
|
181
|
+
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
|
110
182
|
### Added
|
111
183
|
- `Wgit::Url#absolute?` method.
|
112
184
|
- `Wgit::Url#relative? base: url` support.
|
data/LICENSE.txt
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
The MIT License (MIT)
|
2
2
|
|
3
|
-
Copyright (c) 2016 -
|
3
|
+
Copyright (c) 2016 - 2020 Michael Telford
|
4
4
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|
data/README.md
CHANGED
@@ -8,346 +8,184 @@
|
|
8
8
|
|
9
9
|
---
|
10
10
|
|
11
|
-
Wgit is a
|
11
|
+
Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
|
12
12
|
|
13
|
-
|
13
|
+
Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
|
14
14
|
|
15
|
-
|
15
|
+
- URL parsing
|
16
|
+
- Document content extraction (data mining)
|
17
|
+
- Crawling entire websites (statistical analysis)
|
16
18
|
|
17
|
-
|
19
|
+
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
|
18
20
|
|
19
|
-
|
20
|
-
|
21
|
-
1. [Installation](#Installation)
|
22
|
-
2. [Basic Usage](#Basic-Usage)
|
23
|
-
3. [Documentation](#Documentation)
|
24
|
-
4. [Practical Examples](#Practical-Examples)
|
25
|
-
5. [Database Example](#Database-Example)
|
26
|
-
6. [Extending The API](#Extending-The-API)
|
27
|
-
7. [Caveats](#Caveats)
|
28
|
-
8. [Executable](#Executable)
|
29
|
-
9. [Change Log](#Change-Log)
|
30
|
-
10. [License](#License)
|
31
|
-
11. [Contributing](#Contributing)
|
32
|
-
12. [Development](#Development)
|
21
|
+
Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
|
33
22
|
|
34
|
-
##
|
23
|
+
## Table Of Contents
|
35
24
|
|
36
|
-
|
25
|
+
1. [Usage](#Usage)
|
26
|
+
2. [Why Wgit?](#Why-Wgit)
|
27
|
+
3. [Why Not Wgit?](#Why-Not-Wgit)
|
28
|
+
4. [Installation](#Installation)
|
29
|
+
5. [Documentation](#Documentation)
|
30
|
+
6. [Executable](#Executable)
|
31
|
+
7. [License](#License)
|
32
|
+
8. [Contributing](#Contributing)
|
33
|
+
9. [Development](#Development)
|
37
34
|
|
38
|
-
|
35
|
+
## Usage
|
39
36
|
|
40
|
-
|
37
|
+
Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
|
41
38
|
|
42
39
|
```ruby
|
43
|
-
|
44
|
-
|
40
|
+
require 'wgit'
|
41
|
+
require 'json'
|
45
42
|
|
46
|
-
|
43
|
+
include Wgit::DSL
|
47
44
|
|
48
|
-
|
45
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
46
|
+
follow "//li[@class='next']/a/@href"
|
49
47
|
|
50
|
-
|
48
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
49
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
51
50
|
|
52
|
-
|
51
|
+
quotes = []
|
53
52
|
|
54
|
-
|
53
|
+
crawl_site do |doc|
|
54
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
55
|
+
quotes << {
|
56
|
+
quote: arr.first,
|
57
|
+
author: arr.last
|
58
|
+
}
|
59
|
+
end
|
60
|
+
end
|
55
61
|
|
56
|
-
|
62
|
+
puts JSON.generate(quotes)
|
63
|
+
```
|
57
64
|
|
58
|
-
|
65
|
+
But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
59
66
|
|
60
67
|
```ruby
|
61
68
|
require 'wgit'
|
62
69
|
|
63
|
-
|
64
|
-
url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'
|
65
|
-
|
66
|
-
doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
|
67
|
-
crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.
|
68
|
-
|
69
|
-
doc.class # => Wgit::Document
|
70
|
-
doc.class.public_instance_methods(false).sort # => [
|
71
|
-
# :==, :[], :author, :base, :base_url, :content, :css, :doc, :empty?, :external_links,
|
72
|
-
# :external_urls, :html, :internal_absolute_links, :internal_absolute_urls,
|
73
|
-
# :internal_links, :internal_urls, :keywords, :links, :score, :search, :search!,
|
74
|
-
# :size, :statistics, :stats, :text, :title, :to_h, :to_json, :url, :xpath
|
75
|
-
# ]
|
76
|
-
|
77
|
-
doc.url # => "https://wikileaks.org/What-is-Wikileaks.html"
|
78
|
-
doc.title # => "WikiLeaks - What is WikiLeaks"
|
79
|
-
doc.stats # => {
|
80
|
-
# :url=>44, :html=>28133, :title=>17, :keywords=>0,
|
81
|
-
# :links=>35, :text_snippets=>67, :text_bytes=>13735
|
82
|
-
# }
|
83
|
-
doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
|
84
|
-
doc.text # => ["The Courage Foundation is an international organisation that <snip>", ...]
|
85
|
-
|
86
|
-
results = doc.search 'corruption' # Searches doc.text for the given query.
|
87
|
-
results.first # => "ial materials involving war, spying and corruption.
|
88
|
-
# It has so far published more"
|
89
|
-
```
|
90
|
-
|
91
|
-
## Documentation
|
92
|
-
|
93
|
-
100% of Wgit's code is documented using [YARD](https://yardoc.org/), deployed to [rubydoc.info](https://www.rubydoc.info/github/michaeltelford/wgit/master). This greatly benefits developers in using Wgit in their own programs. Another good source of information (as to how the library behaves) are the [tests](https://github.com/michaeltelford/wgit/tree/master/test). Also, see the [Practical Examples](#Practical-Examples) section below for real working examples of Wgit in action.
|
94
|
-
|
95
|
-
## Practical Examples
|
96
|
-
|
97
|
-
Below are some practical examples of Wgit in use. You can copy and run the code for yourself (it's all been tested).
|
98
|
-
|
99
|
-
In addition to the practical examples below, the [wiki](https://github.com/michaeltelford/wgit/wiki) contains a useful 'How To' section with more specific usage of Wgit. You should finish reading this `README` first however.
|
100
|
-
|
101
|
-
### WWW HTML Indexer
|
70
|
+
include Wgit::DSL
|
102
71
|
|
103
|
-
|
72
|
+
Wgit.logger.level = Logger::WARN
|
104
73
|
|
105
|
-
|
74
|
+
connection_string 'mongodb://user:password@localhost/crawler'
|
106
75
|
|
107
|
-
|
76
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
77
|
+
follow "//li[@class='next']/a/@href"
|
108
78
|
|
109
|
-
|
79
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
80
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
110
81
|
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
### CSS Indexer
|
116
|
-
|
117
|
-
The below script downloads the contents of the first css link found on Facebook's index page.
|
118
|
-
|
119
|
-
```ruby
|
120
|
-
require 'wgit'
|
121
|
-
require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
|
122
|
-
|
123
|
-
crawler = Wgit::Crawler.new
|
124
|
-
url = 'https://www.facebook.com'.to_url
|
125
|
-
|
126
|
-
doc = crawler.crawl url
|
127
|
-
|
128
|
-
# Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
|
129
|
-
hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
|
82
|
+
index_site
|
83
|
+
search 'prejudice'
|
84
|
+
```
|
130
85
|
|
131
|
-
|
132
|
-
href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
|
86
|
+
The `search` call (on the last line) will return and output the results:
|
133
87
|
|
134
|
-
|
135
|
-
|
88
|
+
```text
|
89
|
+
Quotes to Scrape
|
90
|
+
“I am free of all prejudice. I hate everyone equally. ”
|
91
|
+
http://quotes.toscrape.com/tag/humor/page/2/
|
136
92
|
```
|
137
93
|
|
138
|
-
|
94
|
+
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
95
|
+
|
96
|
+
![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
|
139
97
|
|
140
|
-
The
|
98
|
+
The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
|
141
99
|
|
142
100
|
```ruby
|
143
101
|
require 'wgit'
|
144
|
-
require '
|
145
|
-
|
146
|
-
my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
|
147
|
-
my_pages_missing_keywords = []
|
148
|
-
|
149
|
-
competitor_urls = [
|
150
|
-
'http://altitudejunkies.com',
|
151
|
-
'http://www.mountainmadness.com',
|
152
|
-
'http://www.adventureconsultants.com'
|
153
|
-
].to_urls
|
102
|
+
require 'json'
|
154
103
|
|
155
104
|
crawler = Wgit::Crawler.new
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
105
|
+
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
|
106
|
+
quotes = []
|
107
|
+
|
108
|
+
Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
|
109
|
+
Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
|
110
|
+
|
111
|
+
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
|
112
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
113
|
+
quotes << {
|
114
|
+
quote: arr.first,
|
115
|
+
author: arr.last
|
116
|
+
}
|
162
117
|
end
|
163
118
|
end
|
164
119
|
|
165
|
-
|
166
|
-
puts 'Your pages are missing no keywords, nice one!'
|
167
|
-
else
|
168
|
-
puts 'Your pages compared to your competitors are missing the following keywords:'
|
169
|
-
puts my_pages_missing_keywords.uniq
|
170
|
-
end
|
120
|
+
puts JSON.generate(quotes)
|
171
121
|
```
|
172
122
|
|
173
|
-
##
|
174
|
-
|
175
|
-
The next example requires a configured database instance. The use of a database for Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database).
|
176
|
-
|
177
|
-
Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a (small) free account or provide your own MongoDB instance. Take a look at this [Docker Hub image](https://hub.docker.com/r/michaeltelford/mongo-wgit) for an already built example of a `mongo` image configured for use with Wgit; the source of which can be found in the [`./docker`](https://github.com/michaeltelford/wgit/tree/master/docker) directory of this repository.
|
178
|
-
|
179
|
-
[`Wgit::Database`](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Database) provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and object serialisation. Using Wgit you can index webpages, store them in a database and then search through all that's been indexed; quickly and easily.
|
123
|
+
## Why Wgit?
|
180
124
|
|
181
|
-
|
125
|
+
There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
|
182
126
|
|
183
|
-
|
184
|
-
|
185
|
-
|
186
|
-
|
187
|
-
| ~> 2.9 | ~> 4.0 |
|
188
|
-
|
189
|
-
### Data Model
|
190
|
-
|
191
|
-
The data model for Wgit is deliberately simplistic. The MongoDB collections consist of:
|
192
|
-
|
193
|
-
| Collection | Purpose |
|
194
|
-
| ----------- | ----------------------------------------------- |
|
195
|
-
| `urls` | Stores URL's to be crawled at a later date |
|
196
|
-
| `documents` | Stores web documents after they've been crawled |
|
197
|
-
|
198
|
-
Wgit provides respective Ruby classes for each collection object, allowing for serialisation.
|
199
|
-
|
200
|
-
### Configuring MongoDB
|
201
|
-
|
202
|
-
Follow the steps below to configure MongoDB for use with Wgit. This is only required if you want to read/write database records using your own (manually configured) instance of Mongo DB.
|
203
|
-
|
204
|
-
1) Create collections for: `urls` and `documents`.
|
205
|
-
2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections using:
|
206
|
-
|
207
|
-
| Collection | Fields | Options |
|
208
|
-
| ----------- | ------------------- | ------------------- |
|
209
|
-
| `urls` | `{ "url" : 1 }` | `{ unique : true }` |
|
210
|
-
| `documents` | `{ "url.url" : 1 }` | `{ unique : true }` |
|
211
|
-
|
212
|
-
3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so - it's typically enabled by default).
|
213
|
-
4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
|
214
|
-
```json
|
215
|
-
{
|
216
|
-
"text": "text",
|
217
|
-
"author": "text",
|
218
|
-
"keywords": "text",
|
219
|
-
"title": "text"
|
220
|
-
}
|
221
|
-
```
|
222
|
-
|
223
|
-
**Note**: The *text search index* lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
|
224
|
-
|
225
|
-
### Code Example
|
226
|
-
|
227
|
-
The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
|
127
|
+
- Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
|
128
|
+
- Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
|
129
|
+
- Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
|
130
|
+
- Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
|
228
131
|
|
229
132
|
```ruby
|
230
133
|
require 'wgit'
|
231
134
|
|
232
|
-
|
233
|
-
|
234
|
-
# In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
|
235
|
-
db = Wgit::Database.connect '<your_connection_string>'
|
236
|
-
|
237
|
-
### SEED SOME DATA ###
|
238
|
-
|
239
|
-
# Here we create our own document rather than crawling the web (which works in the same way).
|
240
|
-
# We provide the web page's URL and HTML Strings.
|
241
|
-
doc = Wgit::Document.new(
|
242
|
-
'http://test-url.com',
|
243
|
-
"<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
|
244
|
-
)
|
245
|
-
db.insert doc
|
246
|
-
|
247
|
-
### SEARCH THE DATABASE ###
|
135
|
+
ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
|
248
136
|
|
249
|
-
|
250
|
-
query = 'cow'
|
251
|
-
results = db.search query
|
137
|
+
wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
|
252
138
|
|
253
|
-
#
|
254
|
-
|
255
|
-
|
256
|
-
|
257
|
-
|
258
|
-
|
259
|
-
top_result = results.first
|
260
|
-
top_result.class # => Wgit::Document
|
261
|
-
doc.url == top_result.url # => true
|
262
|
-
|
263
|
-
### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
|
264
|
-
|
265
|
-
# Searching each result gives the matching text snippets from that Wgit::Document.
|
266
|
-
top_result.search(query).first # => "How now brown cow."
|
267
|
-
|
268
|
-
### SEED URLS TO BE CRAWLED LATER ###
|
139
|
+
# Only index the most recent of each wiki article, ignoring the rest of Github.
|
140
|
+
opts = {
|
141
|
+
allow_paths: 'michaeltelford/wgit/wiki/*',
|
142
|
+
disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
|
143
|
+
}
|
269
144
|
|
270
|
-
|
271
|
-
|
145
|
+
indexer = Wgit::Indexer.new
|
146
|
+
indexer.index_site(wiki, **opts)
|
272
147
|
```
|
273
148
|
|
274
|
-
##
|
149
|
+
## Why Not Wgit?
|
275
150
|
|
276
|
-
|
151
|
+
So why might you not use Wgit, I hear you ask?
|
277
152
|
|
278
|
-
Wgit
|
153
|
+
- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
|
154
|
+
- Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
|
155
|
+
- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
|
279
156
|
|
280
|
-
|
157
|
+
## Installation
|
281
158
|
|
282
|
-
|
159
|
+
Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
|
283
160
|
|
284
|
-
|
161
|
+
Currently, the required MRI Ruby version is:
|
285
162
|
|
286
|
-
|
163
|
+
`ruby '>= 2.6', '< 4'`
|
287
164
|
|
288
|
-
|
289
|
-
require 'wgit'
|
165
|
+
### Using Bundler
|
290
166
|
|
291
|
-
|
292
|
-
Wgit::Document.define_extension(
|
293
|
-
:tables, # Wgit::Document#tables will return the page's tables.
|
294
|
-
'//table', # The xpath to extract the tables.
|
295
|
-
singleton: false, # True returns the first table found, false returns all.
|
296
|
-
text_content_only: false, # True returns one or more Strings of the tables text,
|
297
|
-
# false returns the tables as Nokogiri objects (see below).
|
298
|
-
) do |tables|
|
299
|
-
# Here we can manipulate the object(s) before they're set as Wgit::Document#tables.
|
300
|
-
end
|
167
|
+
$ bundle add wgit
|
301
168
|
|
302
|
-
|
303
|
-
# Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
|
304
|
-
doc = Wgit::Document.new(
|
305
|
-
'http://some_url.com',
|
306
|
-
<<~HTML
|
307
|
-
<html>
|
308
|
-
<p>Hello world! Welcome to my site.</p>
|
309
|
-
<table>
|
310
|
-
<tr><th>Name</th><th>Age</th></tr>
|
311
|
-
<tr><td>Socrates</td><td>101</td></tr>
|
312
|
-
<tr><td>Plato</td><td>106</td></tr>
|
313
|
-
</table>
|
314
|
-
<p>I hope you enjoyed your visit :-)</p>
|
315
|
-
</html>
|
316
|
-
HTML
|
317
|
-
)
|
318
|
-
|
319
|
-
# Call our newly defined method to obtain the table data we're interested in.
|
320
|
-
tables = doc.tables
|
321
|
-
|
322
|
-
# Both the collection and each table within the collection are plain Nokogiri objects.
|
323
|
-
tables.class # => Nokogiri::XML::NodeSet
|
324
|
-
tables.first.class # => Nokogiri::XML::Element
|
325
|
-
|
326
|
-
# Notice the Document's stats now include our 'tables' extension.
|
327
|
-
doc.stats # => {
|
328
|
-
# :url=>19, :html=>242, :links=>0, :text_snippets=>2, :text_bytes=>65, :tables=>1
|
329
|
-
# }
|
330
|
-
```
|
169
|
+
### Using RubyGems
|
331
170
|
|
332
|
-
|
171
|
+
$ gem install wgit
|
333
172
|
|
334
|
-
|
173
|
+
### Verify
|
335
174
|
|
336
|
-
|
337
|
-
- A `Wgit::Document` extension (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g. `String`, `Array` etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted.
|
338
|
-
- Once inserted into the Database, you can search a `Wgit::Document`'s extension attributes by updating the Database's *text search index*. See the [Database Example](#Database-Example) for more information.
|
175
|
+
$ wgit
|
339
176
|
|
340
|
-
|
177
|
+
Calling the installed executable will start an REPL session.
|
341
178
|
|
342
|
-
|
179
|
+
## Documentation
|
343
180
|
|
344
|
-
-
|
345
|
-
-
|
346
|
-
-
|
181
|
+
- [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
|
182
|
+
- [Wiki](https://github.com/michaeltelford/wgit/wiki)
|
183
|
+
- [API Yardocs](https://www.rubydoc.info/gems/wgit)
|
184
|
+
- [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
|
347
185
|
|
348
186
|
## Executable
|
349
187
|
|
350
|
-
Installing the Wgit gem
|
188
|
+
Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
|
351
189
|
|
352
190
|
The `wgit` executable does the following things (in order):
|
353
191
|
|
@@ -355,21 +193,7 @@ The `wgit` executable does the following things (in order):
|
|
355
193
|
2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
|
356
194
|
3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
|
357
195
|
|
358
|
-
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
|
359
|
-
|
360
|
-
## Change Log
|
361
|
-
|
362
|
-
See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
|
363
|
-
|
364
|
-
### Gem Versioning
|
365
|
-
|
366
|
-
The `wgit` gem follows these versioning rules:
|
367
|
-
|
368
|
-
- The version format is `MAJOR.MINOR.PATCH` e.g. `0.1.0`.
|
369
|
-
- Since the gem hasn't reached `v1.0.0` yet, slightly different semantic versioning rules apply.
|
370
|
-
- The `PATCH` represents *non breaking changes* while the `MINOR` represents *breaking changes* e.g. updating from version `0.1.0` to `0.2.0` will likely introduce breaking changes necessitating updates to your codebase.
|
371
|
-
- To determine what changes are needed, consult the `CHANGELOG.md`. If you need help, raise an issue.
|
372
|
-
- Once `wgit v1.0.0` is released, *normal* [semantic versioning](https://semver.org/) rules will apply e.g. only a `MAJOR` version change should introduce breaking changes.
|
196
|
+
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
|
373
197
|
|
374
198
|
## License
|
375
199
|
|
@@ -395,14 +219,14 @@ And you're good to go!
|
|
395
219
|
|
396
220
|
### Tooling
|
397
221
|
|
398
|
-
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation
|
222
|
+
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
|
399
223
|
|
400
|
-
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests
|
224
|
+
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
|
401
225
|
|
402
|
-
To generate code documentation run `toys yardoc`. To browse the
|
226
|
+
To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
|
403
227
|
|
404
|
-
To install this gem onto your local machine, run `toys install
|
228
|
+
To install this gem onto your local machine, run `toys install` and follow the prompt.
|
405
229
|
|
406
230
|
### Console
|
407
231
|
|
408
|
-
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created
|
232
|
+
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
|