wgit 0.8.0 → 0.9.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.yardopts +1 -1
- data/CHANGELOG.md +39 -0
- data/LICENSE.txt +1 -1
- data/README.md +118 -323
- data/bin/wgit +9 -5
- data/lib/wgit.rb +3 -1
- data/lib/wgit/assertable.rb +3 -3
- data/lib/wgit/base.rb +30 -0
- data/lib/wgit/crawler.rb +206 -76
- data/lib/wgit/database/database.rb +309 -134
- data/lib/wgit/database/model.rb +10 -3
- data/lib/wgit/document.rb +138 -95
- data/lib/wgit/{document_extensions.rb → document_extractors.rb} +11 -11
- data/lib/wgit/dsl.rb +324 -0
- data/lib/wgit/indexer.rb +65 -162
- data/lib/wgit/response.rb +5 -2
- data/lib/wgit/url.rb +133 -31
- data/lib/wgit/utils.rb +32 -20
- data/lib/wgit/version.rb +2 -1
- metadata +26 -14
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
|
4
|
+
data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
|
7
|
+
data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
|
data/.yardopts
CHANGED
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,45 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.9.0
|
13
|
+
This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
|
14
|
+
### Added
|
15
|
+
- `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
|
16
|
+
- `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
|
17
|
+
- `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
|
18
|
+
- `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
|
19
|
+
- `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
|
20
|
+
- `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
|
21
|
+
- `Wgit::Database#clear_db!` alias.
|
22
|
+
- `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
|
23
|
+
- `Wgit::Document#extract` method to perform one off content extractions.
|
24
|
+
- `Wgit::Indexer#index_urls` method which can index several urls in one call.
|
25
|
+
- `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
|
26
|
+
### Changed/Removed
|
27
|
+
- Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
|
28
|
+
- Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
|
29
|
+
- Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
|
30
|
+
- Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
|
31
|
+
- Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
|
32
|
+
- Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
|
33
|
+
- Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
|
34
|
+
- Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
|
35
|
+
- Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
|
36
|
+
- Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
|
37
|
+
- Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
|
38
|
+
- Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
|
39
|
+
- Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
|
40
|
+
- Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
|
41
|
+
- Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
|
42
|
+
- Updated `Utils.printf_search_results` to return the number of results.
|
43
|
+
- Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
|
44
|
+
- Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
|
45
|
+
- Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
|
46
|
+
### Fixed
|
47
|
+
- [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
|
48
|
+
- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
|
49
|
+
---
|
50
|
+
|
12
51
|
## v0.8.0
|
13
52
|
### Added
|
14
53
|
- To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
|
data/LICENSE.txt
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
The MIT License (MIT)
|
2
2
|
|
3
|
-
Copyright (c) 2016 -
|
3
|
+
Copyright (c) 2016 - 2020 Michael Telford
|
4
4
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|
data/README.md
CHANGED
@@ -8,382 +8,191 @@
|
|
8
8
|
|
9
9
|
---
|
10
10
|
|
11
|
-
Wgit is a
|
11
|
+
Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
|
12
12
|
|
13
|
-
|
13
|
+
Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
|
14
14
|
|
15
|
-
|
15
|
+
- URL parsing
|
16
|
+
- Document content extraction (data mining)
|
17
|
+
- Crawling entire websites (statistical analysis)
|
16
18
|
|
17
|
-
|
19
|
+
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
|
18
20
|
|
19
|
-
|
20
|
-
|
21
|
-
1. [Installation](#Installation)
|
22
|
-
2. [Basic Usage](#Basic-Usage)
|
23
|
-
3. [Documentation](#Documentation)
|
24
|
-
4. [Practical Examples](#Practical-Examples)
|
25
|
-
5. [Database Example](#Database-Example)
|
26
|
-
6. [Extending The API](#Extending-The-API)
|
27
|
-
7. [Caveats](#Caveats)
|
28
|
-
8. [Executable](#Executable)
|
29
|
-
9. [Change Log](#Change-Log)
|
30
|
-
10. [License](#License)
|
31
|
-
11. [Contributing](#Contributing)
|
32
|
-
12. [Development](#Development)
|
33
|
-
|
34
|
-
## Installation
|
35
|
-
|
36
|
-
Currently, the required Ruby version is:
|
37
|
-
|
38
|
-
`~> 2.5` a.k.a. `>= 2.5 && < 3`
|
39
|
-
|
40
|
-
Add this line to your application's `Gemfile`:
|
41
|
-
|
42
|
-
```ruby
|
43
|
-
gem 'wgit'
|
44
|
-
```
|
45
|
-
|
46
|
-
And then execute:
|
47
|
-
|
48
|
-
$ bundle
|
49
|
-
|
50
|
-
Or install it yourself as:
|
21
|
+
Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
|
51
22
|
|
52
|
-
|
23
|
+
## Table Of Contents
|
53
24
|
|
54
|
-
|
25
|
+
1. [Usage](#Usage)
|
26
|
+
2. [Why Wgit?](#Why-Wgit)
|
27
|
+
3. [Why Not Wgit?](#Why-Not-Wgit)
|
28
|
+
4. [Installation](#Installation)
|
29
|
+
5. [Documentation](#Documentation)
|
30
|
+
6. [Executable](#Executable)
|
31
|
+
7. [License](#License)
|
32
|
+
8. [Contributing](#Contributing)
|
33
|
+
9. [Development](#Development)
|
55
34
|
|
56
|
-
|
35
|
+
## Usage
|
57
36
|
|
58
|
-
|
37
|
+
Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
|
59
38
|
|
60
39
|
```ruby
|
61
40
|
require 'wgit'
|
41
|
+
require 'json'
|
62
42
|
|
63
|
-
|
64
|
-
url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'
|
65
|
-
|
66
|
-
doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
|
67
|
-
crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.
|
68
|
-
|
69
|
-
doc.class # => Wgit::Document
|
70
|
-
doc.class.public_instance_methods(false).sort # => [
|
71
|
-
# :==, :[], :author, :base, :base_url, :content, :css, :description, :doc, :empty?,
|
72
|
-
# :external_links, :external_urls, :html, :internal_absolute_links,
|
73
|
-
# :internal_absolute_urls,:internal_links, :internal_urls, :keywords, :links, :score,
|
74
|
-
# :search, :search!, :size, :statistics, :stats, :text, :title, :to_h, :to_json,
|
75
|
-
# :url, :xpath
|
76
|
-
# ]
|
77
|
-
|
78
|
-
doc.url # => "https://wikileaks.org/What-is-Wikileaks.html"
|
79
|
-
doc.title # => "WikiLeaks - What is WikiLeaks"
|
80
|
-
doc.stats # => {
|
81
|
-
# :url=>44, :html=>28133, :title=>17, :keywords=>0,
|
82
|
-
# :links=>35, :text=>67, :text_bytes=>13735
|
83
|
-
# }
|
84
|
-
doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
|
85
|
-
doc.text # => ["The Courage Foundation is an international organisation that <snip>", ...]
|
86
|
-
|
87
|
-
results = doc.search 'corruption' # Searches doc.text for the given query.
|
88
|
-
results.first # => "ial materials involving war, spying and corruption.
|
89
|
-
# It has so far published more"
|
90
|
-
```
|
91
|
-
|
92
|
-
## Documentation
|
43
|
+
include Wgit::DSL
|
93
44
|
|
94
|
-
|
45
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
46
|
+
follow "//li[@class='next']/a/@href"
|
95
47
|
|
96
|
-
|
48
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
49
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
97
50
|
|
98
|
-
|
51
|
+
quotes = []
|
99
52
|
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
### Website Downloader
|
109
|
-
|
110
|
-
Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
|
111
|
-
|
112
|
-
### Broken Link Finder
|
113
|
-
|
114
|
-
The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
|
115
|
-
|
116
|
-
### CSS Indexer
|
117
|
-
|
118
|
-
The below script downloads the contents of the first css link found on Facebook's index page.
|
119
|
-
|
120
|
-
```ruby
|
121
|
-
require 'wgit'
|
122
|
-
require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
|
123
|
-
|
124
|
-
crawler = Wgit::Crawler.new
|
125
|
-
url = 'https://www.facebook.com'.to_url
|
126
|
-
|
127
|
-
doc = crawler.crawl url
|
128
|
-
|
129
|
-
# Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
|
130
|
-
hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
|
131
|
-
|
132
|
-
hrefs.class # => Nokogiri::XML::NodeSet
|
133
|
-
href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
|
53
|
+
crawl_site do |doc|
|
54
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
55
|
+
quotes << {
|
56
|
+
quote: arr.first,
|
57
|
+
author: arr.last
|
58
|
+
}
|
59
|
+
end
|
60
|
+
end
|
134
61
|
|
135
|
-
|
136
|
-
css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
|
62
|
+
puts JSON.generate(quotes)
|
137
63
|
```
|
138
64
|
|
139
|
-
|
140
|
-
|
141
|
-
The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation (SEO) for example.
|
65
|
+
The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
|
142
66
|
|
143
67
|
```ruby
|
144
68
|
require 'wgit'
|
145
|
-
require '
|
146
|
-
|
147
|
-
my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
|
148
|
-
my_pages_missing_keywords = []
|
149
|
-
|
150
|
-
competitor_urls = [
|
151
|
-
'http://altitudejunkies.com',
|
152
|
-
'http://www.mountainmadness.com',
|
153
|
-
'http://www.adventureconsultants.com'
|
154
|
-
].to_urls
|
69
|
+
require 'json'
|
155
70
|
|
156
71
|
crawler = Wgit::Crawler.new
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
|
72
|
+
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
|
73
|
+
quotes = []
|
74
|
+
|
75
|
+
Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
|
76
|
+
Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
|
77
|
+
|
78
|
+
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
|
79
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
80
|
+
quotes << {
|
81
|
+
quote: arr.first,
|
82
|
+
author: arr.last
|
83
|
+
}
|
163
84
|
end
|
164
85
|
end
|
165
86
|
|
166
|
-
|
167
|
-
puts 'Your pages are missing no keywords, nice one!'
|
168
|
-
else
|
169
|
-
puts 'Your pages compared to your competitors are missing the following keywords:'
|
170
|
-
puts my_pages_missing_keywords.uniq
|
171
|
-
end
|
87
|
+
puts JSON.generate(quotes)
|
172
88
|
```
|
173
89
|
|
174
|
-
|
175
|
-
|
176
|
-
The next example requires a configured database instance. The use of a database for Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database).
|
177
|
-
|
178
|
-
Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a (small) free account or provide your own MongoDB instance. Take a look at this [Docker Hub image](https://hub.docker.com/r/michaeltelford/mongo-wgit) for an already built example of a `mongo` image configured for use with Wgit; the source of which can be found in the [`./docker`](https://github.com/michaeltelford/wgit/tree/master/docker) directory of this repository.
|
90
|
+
But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
179
91
|
|
180
|
-
|
181
|
-
|
182
|
-
### Versioning
|
183
|
-
|
184
|
-
The following versions of MongoDB are currently supported:
|
92
|
+
```ruby
|
93
|
+
require 'wgit'
|
185
94
|
|
186
|
-
|
187
|
-
| ------ | -------- |
|
188
|
-
| ~> 2.9 | ~> 4.0 |
|
95
|
+
include Wgit::DSL
|
189
96
|
|
190
|
-
|
97
|
+
Wgit.logger.level = Logger::WARN
|
191
98
|
|
192
|
-
|
99
|
+
connection_string 'mongodb://user:password@localhost/crawler'
|
100
|
+
clear_db!
|
193
101
|
|
194
|
-
|
195
|
-
|
196
|
-
| `urls` | Stores URL's to be crawled at a later date |
|
197
|
-
| `documents` | Stores web documents after they've been crawled |
|
102
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
103
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
198
104
|
|
199
|
-
|
105
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
106
|
+
follow "//li[@class='next']/a/@href"
|
200
107
|
|
201
|
-
|
108
|
+
index_site
|
109
|
+
search 'prejudice'
|
110
|
+
```
|
202
111
|
|
203
|
-
|
112
|
+
The `search` call (on the last line) will return and output the results:
|
204
113
|
|
205
|
-
|
206
|
-
|
114
|
+
```text
|
115
|
+
Quotes to Scrape
|
116
|
+
“I am free of all prejudice. I hate everyone equally. ”
|
117
|
+
http://quotes.toscrape.com/tag/humor/page/2/
|
118
|
+
```
|
207
119
|
|
208
|
-
|
209
|
-
| ----------- | ------------------- | ------------------- |
|
210
|
-
| `urls` | `{ "url" : 1 }` | `{ unique : true }` |
|
211
|
-
| `documents` | `{ "url.url" : 1 }` | `{ unique : true }` |
|
120
|
+
Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
|
212
121
|
|
213
|
-
|
214
|
-
4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
|
215
|
-
```json
|
216
|
-
{
|
217
|
-
"text": "text",
|
218
|
-
"author": "text",
|
219
|
-
"keywords": "text",
|
220
|
-
"title": "text"
|
221
|
-
}
|
222
|
-
```
|
122
|
+
![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
|
223
123
|
|
224
|
-
|
124
|
+
## Why Wgit?
|
225
125
|
|
226
|
-
|
126
|
+
There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
|
227
127
|
|
228
|
-
|
128
|
+
- Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
|
129
|
+
- Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
|
130
|
+
- Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
|
131
|
+
- Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
|
229
132
|
|
230
133
|
```ruby
|
231
134
|
require 'wgit'
|
232
135
|
|
233
|
-
|
234
|
-
|
235
|
-
# In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
|
236
|
-
db = Wgit::Database.connect '<your_connection_string>'
|
136
|
+
ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
|
237
137
|
|
238
|
-
|
138
|
+
wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
|
239
139
|
|
240
|
-
#
|
241
|
-
|
242
|
-
|
243
|
-
'
|
244
|
-
|
245
|
-
)
|
246
|
-
db.insert doc
|
247
|
-
|
248
|
-
### SEARCH THE DATABASE ###
|
249
|
-
|
250
|
-
# Searching the database returns Wgit::Document's which have fields containing the query.
|
251
|
-
query = 'cow'
|
252
|
-
results = db.search query
|
253
|
-
|
254
|
-
# By default, the MongoDB ranking applies i.e. results.first has the most hits.
|
255
|
-
# Because results is an Array of Wgit::Document's, we can custom sort/rank e.g.
|
256
|
-
# `results.sort_by! { |doc| doc.url.crawl_duration }` ranks via page load times with
|
257
|
-
# results.first being the fastest. Any Wgit::Document attribute can be used, including
|
258
|
-
# those you define yourself by extending the API.
|
259
|
-
|
260
|
-
top_result = results.first
|
261
|
-
top_result.class # => Wgit::Document
|
262
|
-
doc.url == top_result.url # => true
|
263
|
-
|
264
|
-
### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
|
265
|
-
|
266
|
-
# Searching each result gives the matching text snippets from that Wgit::Document.
|
267
|
-
top_result.search(query).first # => "How now brown cow."
|
268
|
-
|
269
|
-
### SEED URLS TO BE CRAWLED LATER ###
|
140
|
+
# Only index the most recent of each wiki article, ignoring the rest of Github.
|
141
|
+
opts = {
|
142
|
+
allow_paths: 'michaeltelford/wgit/wiki/*',
|
143
|
+
disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
|
144
|
+
}
|
270
145
|
|
271
|
-
|
272
|
-
|
146
|
+
indexer = Wgit::Indexer.new
|
147
|
+
indexer.index_site(wiki, **opts)
|
273
148
|
```
|
274
149
|
|
275
|
-
##
|
276
|
-
|
277
|
-
Document serialising in Wgit is the means of downloading a web page and serialising parts of its content into accessible `Wgit::Document` attributes/methods. For example, `Wgit::Document#author` will return you the webpage's xpath value of `meta[@name='author']`.
|
278
|
-
|
279
|
-
There are two ways to extend the Document serialising behaviour of Wgit for your own means:
|
280
|
-
|
281
|
-
1. Add additional **textual** content to `Wgit::Document#text`.
|
282
|
-
2. Define `Wgit::Document` instance methods for specific HTML **elements**.
|
283
|
-
|
284
|
-
Below describes these two methods in more detail.
|
150
|
+
## Why Not Wgit?
|
285
151
|
|
286
|
-
|
152
|
+
So why might you not use Wgit, I hear you ask?
|
287
153
|
|
288
|
-
|
154
|
+
- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
|
155
|
+
- Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
|
156
|
+
- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
|
289
157
|
|
290
|
-
|
291
|
-
|
292
|
-
```ruby
|
293
|
-
require 'wgit'
|
294
|
-
|
295
|
-
# The default text_elements cover most visible page text but let's say we
|
296
|
-
# have a <table> element with text content that we want.
|
297
|
-
Wgit::Document.text_elements << :table
|
298
|
-
|
299
|
-
doc = Wgit::Document.new(
|
300
|
-
'http://some_url.com',
|
301
|
-
<<~HTML
|
302
|
-
<html>
|
303
|
-
<p>Hello world!</p>
|
304
|
-
<table>My table</table>
|
305
|
-
</html>
|
306
|
-
HTML
|
307
|
-
)
|
308
|
-
|
309
|
-
# Now every crawled Document#text will include <table> text content.
|
310
|
-
doc.text # => ["Hello world!", "My table"]
|
311
|
-
doc.search('table') # => ["My table"]
|
312
|
-
```
|
158
|
+
## Installation
|
313
159
|
|
314
|
-
|
160
|
+
Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
|
315
161
|
|
316
|
-
|
162
|
+
Currently, the required MRI Ruby version is:
|
317
163
|
|
318
|
-
|
164
|
+
`~> 2.5` a.k.a. `>= 2.5 && < 3`
|
319
165
|
|
320
|
-
|
166
|
+
### Using Bundler
|
321
167
|
|
322
|
-
|
168
|
+
Add this line to your application's `Gemfile`:
|
323
169
|
|
324
170
|
```ruby
|
325
|
-
|
171
|
+
gem 'wgit'
|
172
|
+
```
|
326
173
|
|
327
|
-
|
328
|
-
Wgit::Document.define_extension(
|
329
|
-
:tables, # Wgit::Document#tables will return the page's tables.
|
330
|
-
'//table', # The xpath to extract the tables.
|
331
|
-
singleton: false, # True returns the first table found, false returns all.
|
332
|
-
text_content_only: false, # True returns the table text, false returns the Nokogiri object.
|
333
|
-
) do |tables|
|
334
|
-
# Here we can inspect/manipulate the tables before they're set as Wgit::Document#tables.
|
335
|
-
tables
|
336
|
-
end
|
174
|
+
And then execute:
|
337
175
|
|
338
|
-
|
339
|
-
# is initialised e.g. manually (as below) or via Wgit::Crawler methods etc.
|
340
|
-
doc = Wgit::Document.new(
|
341
|
-
'http://some_url.com',
|
342
|
-
<<~HTML
|
343
|
-
<html>
|
344
|
-
<p>Hello world! Welcome to my site.</p>
|
345
|
-
<table>
|
346
|
-
<tr><th>Name</th><th>Age</th></tr>
|
347
|
-
<tr><td>Socrates</td><td>101</td></tr>
|
348
|
-
<tr><td>Plato</td><td>106</td></tr>
|
349
|
-
</table>
|
350
|
-
<p>I hope you enjoyed your visit :-)</p>
|
351
|
-
</html>
|
352
|
-
HTML
|
353
|
-
)
|
354
|
-
|
355
|
-
# Call our newly defined method to obtain the table data we're interested in.
|
356
|
-
tables = doc.tables
|
357
|
-
|
358
|
-
# Both the collection and each table within the collection are plain Nokogiri objects.
|
359
|
-
tables.class # => Nokogiri::XML::NodeSet
|
360
|
-
tables.first.class # => Nokogiri::XML::Element
|
361
|
-
|
362
|
-
# Note, the Document's stats now include our 'tables' extension.
|
363
|
-
doc.stats # => {
|
364
|
-
# :url=>19, :html=>242, :links=>0, :text=>8, :text_bytes=>91, :tables=>1
|
365
|
-
# }
|
366
|
-
```
|
176
|
+
$ bundle
|
367
177
|
|
368
|
-
|
178
|
+
### Using RubyGems
|
369
179
|
|
370
|
-
|
180
|
+
$ gem install wgit
|
371
181
|
|
372
|
-
|
373
|
-
- A `Wgit::Document` extension (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g. `String`, `Array` etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted.
|
374
|
-
- Once inserted into the Database, you can search a `Wgit::Document`'s extension attributes by updating the Database's *text search index*. See the [Database Example](#Database-Example) for more information.
|
182
|
+
Verify the install by using the executable (to start an REPL session):
|
375
183
|
|
376
|
-
|
184
|
+
$ wgit
|
377
185
|
|
378
|
-
|
186
|
+
## Documentation
|
379
187
|
|
380
|
-
-
|
381
|
-
-
|
382
|
-
-
|
188
|
+
- [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
|
189
|
+
- [Wiki](https://github.com/michaeltelford/wgit/wiki)
|
190
|
+
- [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
|
191
|
+
- [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
|
383
192
|
|
384
193
|
## Executable
|
385
194
|
|
386
|
-
Installing the Wgit gem
|
195
|
+
Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
|
387
196
|
|
388
197
|
The `wgit` executable does the following things (in order):
|
389
198
|
|
@@ -391,21 +200,7 @@ The `wgit` executable does the following things (in order):
|
|
391
200
|
2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
|
392
201
|
3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
|
393
202
|
|
394
|
-
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
|
395
|
-
|
396
|
-
## Change Log
|
397
|
-
|
398
|
-
See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
|
399
|
-
|
400
|
-
### Gem Versioning
|
401
|
-
|
402
|
-
The `wgit` gem follows these versioning rules:
|
403
|
-
|
404
|
-
- The version format is `MAJOR.MINOR.PATCH` e.g. `0.1.0`.
|
405
|
-
- Since the gem hasn't reached `v1.0.0` yet, slightly different semantic versioning rules apply.
|
406
|
-
- The `PATCH` represents *non breaking changes* while the `MINOR` represents *breaking changes* e.g. updating from version `0.1.0` to `0.2.0` will likely introduce breaking changes necessitating updates to your codebase.
|
407
|
-
- To determine what changes are needed, consult the `CHANGELOG.md`. If you need help, raise an issue.
|
408
|
-
- Once `wgit v1.0.0` is released, *normal* [semantic versioning](https://semver.org/) rules will apply e.g. only a `MAJOR` version change should introduce breaking changes.
|
203
|
+
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
|
409
204
|
|
410
205
|
## License
|
411
206
|
|
@@ -431,14 +226,14 @@ And you're good to go!
|
|
431
226
|
|
432
227
|
### Tooling
|
433
228
|
|
434
|
-
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation
|
229
|
+
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
|
435
230
|
|
436
|
-
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests
|
231
|
+
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
|
437
232
|
|
438
|
-
To generate code documentation run `toys yardoc`. To browse the
|
233
|
+
To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
|
439
234
|
|
440
|
-
To install this gem onto your local machine, run `toys install
|
235
|
+
To install this gem onto your local machine, run `toys install` and follow the prompt.
|
441
236
|
|
442
237
|
### Console
|
443
238
|
|
444
|
-
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created
|
239
|
+
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
|