wgit 0.8.0 → 0.9.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dc2ea1de7219c66eb70ed7b4e97bb8da7c169eb68257df7d7d1bfdaf6a5ed4d6
4
- data.tar.gz: f88afdd7477812c3b9fdbde8e1125b950aec3fe3fabef5a20e0c16a9e26a767b
3
+ metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
4
+ data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
5
5
  SHA512:
6
- metadata.gz: e5fe86fee44e4c494d936d747719d856c240a0129db0692afa44ad9855340929a8b90cc177d92d775bc7cc15af99093078cf6a8fc8b9bbd9bc8965866d343914
7
- data.tar.gz: c87efb4c5dcfb8795d62ab41f7e8d2bc206e7f8407707c9269c0fc86fbdcd14b7269d04083ec756d3b77a99300639469e88c639ee125ddee0984c3957e7cfc7b
6
+ metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
7
+ data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
data/.yardopts CHANGED
@@ -1,5 +1,5 @@
1
1
  --readme README.md
2
- --title 'Wgit API Documentation'
2
+ --title 'Wgit Gem Documentation'
3
3
  --charset utf-8
4
4
  --markup markdown
5
5
  --output .doc
@@ -9,6 +9,45 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.9.0
13
+ This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
14
+ ### Added
15
+ - `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
16
+ - `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
17
+ - `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
18
+ - `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
19
+ - `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
20
+ - `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
21
+ - `Wgit::Database#clear_db!` alias.
22
+ - `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
23
+ - `Wgit::Document#extract` method to perform one off content extractions.
24
+ - `Wgit::Indexer#index_urls` method which can index several urls in one call.
25
+ - `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
26
+ ### Changed/Removed
27
+ - Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
28
+ - Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
29
+ - Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
30
+ - Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
31
+ - Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
32
+ - Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
33
+ - Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
34
+ - Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
35
+ - Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
36
+ - Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
37
+ - Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
38
+ - Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
39
+ - Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
40
+ - Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
41
+ - Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
42
+ - Updated `Utils.printf_search_results` to return the number of results.
43
+ - Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
44
+ - Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
45
+ - Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
46
+ ### Fixed
47
+ - [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
48
+ - `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
49
+ ---
50
+
12
51
  ## v0.8.0
13
52
  ### Added
14
53
  - To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2016 - 2019 Michael Telford
3
+ Copyright (c) 2016 - 2020 Michael Telford
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
data/README.md CHANGED
@@ -8,382 +8,191 @@
8
8
 
9
9
  ---
10
10
 
11
- Wgit is a Ruby library primarily used for crawling, indexing and searching HTML webpages.
11
+ Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
12
 
13
- Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to scrape entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
14
 
15
- Check out this [demo application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
15
+ - URL parsing
16
+ - Document content extraction (data mining)
17
+ - Crawling entire websites (statistical analysis)
16
18
 
17
- Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
19
+ Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
18
20
 
19
- ## Table Of Contents
20
-
21
- 1. [Installation](#Installation)
22
- 2. [Basic Usage](#Basic-Usage)
23
- 3. [Documentation](#Documentation)
24
- 4. [Practical Examples](#Practical-Examples)
25
- 5. [Database Example](#Database-Example)
26
- 6. [Extending The API](#Extending-The-API)
27
- 7. [Caveats](#Caveats)
28
- 8. [Executable](#Executable)
29
- 9. [Change Log](#Change-Log)
30
- 10. [License](#License)
31
- 11. [Contributing](#Contributing)
32
- 12. [Development](#Development)
33
-
34
- ## Installation
35
-
36
- Currently, the required Ruby version is:
37
-
38
- `~> 2.5` a.k.a. `>= 2.5 && < 3`
39
-
40
- Add this line to your application's `Gemfile`:
41
-
42
- ```ruby
43
- gem 'wgit'
44
- ```
45
-
46
- And then execute:
47
-
48
- $ bundle
49
-
50
- Or install it yourself as:
21
+ Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
51
22
 
52
- $ gem install wgit
23
+ ## Table Of Contents
53
24
 
54
- Verify the install by using the executable (to start a shell session):
25
+ 1. [Usage](#Usage)
26
+ 2. [Why Wgit?](#Why-Wgit)
27
+ 3. [Why Not Wgit?](#Why-Not-Wgit)
28
+ 4. [Installation](#Installation)
29
+ 5. [Documentation](#Documentation)
30
+ 6. [Executable](#Executable)
31
+ 7. [License](#License)
32
+ 8. [Contributing](#Contributing)
33
+ 9. [Development](#Development)
55
34
 
56
- $ wgit
35
+ ## Usage
57
36
 
58
- ## Basic Usage
37
+ Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
59
38
 
60
39
  ```ruby
61
40
  require 'wgit'
41
+ require 'json'
62
42
 
63
- crawler = Wgit::Crawler.new # Uses Typhoeus -> libcurl underneath. It's fast!
64
- url = Wgit::Url.new 'https://wikileaks.org/What-is-Wikileaks.html'
65
-
66
- doc = crawler.crawl url # Or use #crawl_site(url) { |doc| ... } etc.
67
- crawler.last_response.class # => Wgit::Response is a wrapper for Typhoeus::Response.
68
-
69
- doc.class # => Wgit::Document
70
- doc.class.public_instance_methods(false).sort # => [
71
- # :==, :[], :author, :base, :base_url, :content, :css, :description, :doc, :empty?,
72
- # :external_links, :external_urls, :html, :internal_absolute_links,
73
- # :internal_absolute_urls,:internal_links, :internal_urls, :keywords, :links, :score,
74
- # :search, :search!, :size, :statistics, :stats, :text, :title, :to_h, :to_json,
75
- # :url, :xpath
76
- # ]
77
-
78
- doc.url # => "https://wikileaks.org/What-is-Wikileaks.html"
79
- doc.title # => "WikiLeaks - What is WikiLeaks"
80
- doc.stats # => {
81
- # :url=>44, :html=>28133, :title=>17, :keywords=>0,
82
- # :links=>35, :text=>67, :text_bytes=>13735
83
- # }
84
- doc.links # => ["#submit_help_contact", "#submit_help_tor", "#submit_help_tips", ...]
85
- doc.text # => ["The Courage Foundation is an international organisation that <snip>", ...]
86
-
87
- results = doc.search 'corruption' # Searches doc.text for the given query.
88
- results.first # => "ial materials involving war, spying and corruption.
89
- # It has so far published more"
90
- ```
91
-
92
- ## Documentation
43
+ include Wgit::DSL
93
44
 
94
- 100% of Wgit's code is documented using [YARD](https://yardoc.org/), deployed to [rubydoc.info](https://www.rubydoc.info/github/michaeltelford/wgit/master). This greatly benefits developers in using Wgit in their own programs. Another good source of information (as to how the library behaves) are the [tests](https://github.com/michaeltelford/wgit/tree/master/test). Also, see the [Practical Examples](#Practical-Examples) section below for real working examples of Wgit in action.
45
+ start 'http://quotes.toscrape.com/tag/humor/'
46
+ follow "//li[@class='next']/a/@href"
95
47
 
96
- ## Practical Examples
48
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
49
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
97
50
 
98
- Below are some practical examples of Wgit in use. You can copy and run the code for yourself (it's all been tested).
51
+ quotes = []
99
52
 
100
- In addition to the practical examples below, the [wiki](https://github.com/michaeltelford/wgit/wiki) contains a useful 'How To' section with more specific usage of Wgit. You should finish reading this `README` first however.
101
-
102
- ### WWW HTML Indexer
103
-
104
- See the [`Wgit::Indexer#index_www`](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2Eindex_www) documentation and source code for an already built example of a WWW HTML indexer. It will crawl any external URL's (in the database) and index their HTML for later use, be it searching or otherwise. It will literally crawl the WWW forever if you let it!
105
-
106
- See the [Database Example](#Database-Example) for information on how to configure a database for use with Wgit.
107
-
108
- ### Website Downloader
109
-
110
- Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
111
-
112
- ### Broken Link Finder
113
-
114
- The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
115
-
116
- ### CSS Indexer
117
-
118
- The below script downloads the contents of the first css link found on Facebook's index page.
119
-
120
- ```ruby
121
- require 'wgit'
122
- require 'wgit/core_ext' # Provides the String#to_url and Enumerable#to_urls methods.
123
-
124
- crawler = Wgit::Crawler.new
125
- url = 'https://www.facebook.com'.to_url
126
-
127
- doc = crawler.crawl url
128
-
129
- # Provide your own xpath (or css selector) to search the HTML using Nokogiri underneath.
130
- hrefs = doc.xpath "//link[@rel='stylesheet']/@href"
131
-
132
- hrefs.class # => Nokogiri::XML::NodeSet
133
- href = hrefs.first.value # => "https://static.xx.fbcdn.net/rsrc.php/v3/y1/l/0,cross/NvZ4mNTW3Fd.css"
53
+ crawl_site do |doc|
54
+ doc.quotes.zip(doc.authors).each do |arr|
55
+ quotes << {
56
+ quote: arr.first,
57
+ author: arr.last
58
+ }
59
+ end
60
+ end
134
61
 
135
- css = crawler.crawl href.to_url
136
- css[0..50] # => "._3_s0._3_s0{border:0;display:flex;height:44px;min-"
62
+ puts JSON.generate(quotes)
137
63
  ```
138
64
 
139
- ### Keyword Indexer (SEO Helper)
140
-
141
- The below script downloads the contents of several webpages and pulls out their keywords for comparison. Such a script might be used by marketeers for search engine optimisation (SEO) for example.
65
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
142
66
 
143
67
  ```ruby
144
68
  require 'wgit'
145
- require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
146
-
147
- my_pages_keywords = ['Everest', 'mountaineering school', 'adventure']
148
- my_pages_missing_keywords = []
149
-
150
- competitor_urls = [
151
- 'http://altitudejunkies.com',
152
- 'http://www.mountainmadness.com',
153
- 'http://www.adventureconsultants.com'
154
- ].to_urls
69
+ require 'json'
155
70
 
156
71
  crawler = Wgit::Crawler.new
157
-
158
- crawler.crawl(*competitor_urls) do |doc|
159
- # If there are keywords present in the web document.
160
- if doc.keywords.respond_to? :-
161
- puts "The keywords for #{doc.url} are: \n#{doc.keywords}\n\n"
162
- my_pages_missing_keywords.concat(doc.keywords - my_pages_keywords)
72
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
73
+ quotes = []
74
+
75
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
76
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
77
+
78
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
79
+ doc.quotes.zip(doc.authors).each do |arr|
80
+ quotes << {
81
+ quote: arr.first,
82
+ author: arr.last
83
+ }
163
84
  end
164
85
  end
165
86
 
166
- if my_pages_missing_keywords.empty?
167
- puts 'Your pages are missing no keywords, nice one!'
168
- else
169
- puts 'Your pages compared to your competitors are missing the following keywords:'
170
- puts my_pages_missing_keywords.uniq
171
- end
87
+ puts JSON.generate(quotes)
172
88
  ```
173
89
 
174
- ## Database Example
175
-
176
- The next example requires a configured database instance. The use of a database for Wgit is entirely optional however and isn't required for crawling or URL parsing etc. A database is only needed when indexing (inserting crawled data into the database).
177
-
178
- Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a (small) free account or provide your own MongoDB instance. Take a look at this [Docker Hub image](https://hub.docker.com/r/michaeltelford/mongo-wgit) for an already built example of a `mongo` image configured for use with Wgit; the source of which can be found in the [`./docker`](https://github.com/michaeltelford/wgit/tree/master/docker) directory of this repository.
90
+ But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
179
91
 
180
- [`Wgit::Database`](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Database) provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and object serialisation. Using Wgit you can index webpages, store them in a database and then search through all that's been indexed; quickly and easily.
181
-
182
- ### Versioning
183
-
184
- The following versions of MongoDB are currently supported:
92
+ ```ruby
93
+ require 'wgit'
185
94
 
186
- | Gem | Database |
187
- | ------ | -------- |
188
- | ~> 2.9 | ~> 4.0 |
95
+ include Wgit::DSL
189
96
 
190
- ### Data Model
97
+ Wgit.logger.level = Logger::WARN
191
98
 
192
- The data model for Wgit is deliberately simplistic. The MongoDB collections consist of:
99
+ connection_string 'mongodb://user:password@localhost/crawler'
100
+ clear_db!
193
101
 
194
- | Collection | Purpose |
195
- | ----------- | ----------------------------------------------- |
196
- | `urls` | Stores URL's to be crawled at a later date |
197
- | `documents` | Stores web documents after they've been crawled |
102
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
103
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
198
104
 
199
- Wgit provides respective Ruby classes for each collection object, allowing for serialisation.
105
+ start 'http://quotes.toscrape.com/tag/humor/'
106
+ follow "//li[@class='next']/a/@href"
200
107
 
201
- ### Configuring MongoDB
108
+ index_site
109
+ search 'prejudice'
110
+ ```
202
111
 
203
- Follow the steps below to configure MongoDB for use with Wgit. This is only required if you want to read/write database records using your own (manually configured) instance of Mongo DB.
112
+ The `search` call (on the last line) will return and output the results:
204
113
 
205
- 1) Create collections for: `urls` and `documents`.
206
- 2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections using:
114
+ ```text
115
+ Quotes to Scrape
116
+ “I am free of all prejudice. I hate everyone equally. ”
117
+ http://quotes.toscrape.com/tag/humor/page/2/
118
+ ```
207
119
 
208
- | Collection | Fields | Options |
209
- | ----------- | ------------------- | ------------------- |
210
- | `urls` | `{ "url" : 1 }` | `{ unique : true }` |
211
- | `documents` | `{ "url.url" : 1 }` | `{ unique : true }` |
120
+ Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
212
121
 
213
- 3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so - it's typically enabled by default).
214
- 4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
215
- ```json
216
- {
217
- "text": "text",
218
- "author": "text",
219
- "keywords": "text",
220
- "title": "text"
221
- }
222
- ```
122
+ ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
223
123
 
224
- **Note**: The *text search index* lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
124
+ ## Why Wgit?
225
125
 
226
- ### Code Example
126
+ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
227
127
 
228
- The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
128
+ - Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
129
+ - Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
130
+ - Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
131
+ - Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
229
132
 
230
133
  ```ruby
231
134
  require 'wgit'
232
135
 
233
- ### CONNECT TO THE DATABASE ###
234
-
235
- # In the absence of a connection string parameter, ENV['WGIT_CONNECTION_STRING'] will be used.
236
- db = Wgit::Database.connect '<your_connection_string>'
136
+ ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
237
137
 
238
- ### SEED SOME DATA ###
138
+ wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
239
139
 
240
- # Here we create our own document rather than crawling the web (which works in the same way).
241
- # We provide the web page's URL and HTML Strings.
242
- doc = Wgit::Document.new(
243
- 'http://test-url.com',
244
- "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
245
- )
246
- db.insert doc
247
-
248
- ### SEARCH THE DATABASE ###
249
-
250
- # Searching the database returns Wgit::Document's which have fields containing the query.
251
- query = 'cow'
252
- results = db.search query
253
-
254
- # By default, the MongoDB ranking applies i.e. results.first has the most hits.
255
- # Because results is an Array of Wgit::Document's, we can custom sort/rank e.g.
256
- # `results.sort_by! { |doc| doc.url.crawl_duration }` ranks via page load times with
257
- # results.first being the fastest. Any Wgit::Document attribute can be used, including
258
- # those you define yourself by extending the API.
259
-
260
- top_result = results.first
261
- top_result.class # => Wgit::Document
262
- doc.url == top_result.url # => true
263
-
264
- ### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
265
-
266
- # Searching each result gives the matching text snippets from that Wgit::Document.
267
- top_result.search(query).first # => "How now brown cow."
268
-
269
- ### SEED URLS TO BE CRAWLED LATER ###
140
+ # Only index the most recent of each wiki article, ignoring the rest of Github.
141
+ opts = {
142
+ allow_paths: 'michaeltelford/wgit/wiki/*',
143
+ disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
144
+ }
270
145
 
271
- db.insert top_result.external_links
272
- urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_links.
146
+ indexer = Wgit::Indexer.new
147
+ indexer.index_site(wiki, **opts)
273
148
  ```
274
149
 
275
- ## Extending The API
276
-
277
- Document serialising in Wgit is the means of downloading a web page and serialising parts of its content into accessible `Wgit::Document` attributes/methods. For example, `Wgit::Document#author` will return you the webpage's xpath value of `meta[@name='author']`.
278
-
279
- There are two ways to extend the Document serialising behaviour of Wgit for your own means:
280
-
281
- 1. Add additional **textual** content to `Wgit::Document#text`.
282
- 2. Define `Wgit::Document` instance methods for specific HTML **elements**.
283
-
284
- Below describes these two methods in more detail.
150
+ ## Why Not Wgit?
285
151
 
286
- ### 1. Extending The Default Text Elements
152
+ So why might you not use Wgit, I hear you ask?
287
153
 
288
- Wgit contains a set of `Wgit::Document.text_elements` defining which HTML elements contain text on a page; which in turn are serialised. Once serialised you can process this text content via methods like `Wgit::Document#text` and `Wgit::Document#search` etc.
154
+ - Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
155
+ - Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
156
+ - Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
289
157
 
290
- The below code example shows how to extract additional text from a webpage:
291
-
292
- ```ruby
293
- require 'wgit'
294
-
295
- # The default text_elements cover most visible page text but let's say we
296
- # have a <table> element with text content that we want.
297
- Wgit::Document.text_elements << :table
298
-
299
- doc = Wgit::Document.new(
300
- 'http://some_url.com',
301
- <<~HTML
302
- <html>
303
- <p>Hello world!</p>
304
- <table>My table</table>
305
- </html>
306
- HTML
307
- )
308
-
309
- # Now every crawled Document#text will include <table> text content.
310
- doc.text # => ["Hello world!", "My table"]
311
- doc.search('table') # => ["My table"]
312
- ```
158
+ ## Installation
313
159
 
314
- **Note**: This only works for *textual* page content. For more control over the serialised *elements* themselves, see below.
160
+ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
315
161
 
316
- ### 2. Serialising Specific HTML Elements (via Document Extensions)
162
+ Currently, the required MRI Ruby version is:
317
163
 
318
- Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next.
164
+ `~> 2.5` a.k.a. `>= 2.5 && < 3`
319
165
 
320
- Therefore, you can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
166
+ ### Using Bundler
321
167
 
322
- Here's how to add a Document extension to serialise a specific page element:
168
+ Add this line to your application's `Gemfile`:
323
169
 
324
170
  ```ruby
325
- require 'wgit'
171
+ gem 'wgit'
172
+ ```
326
173
 
327
- # Let's get all the page's <table> elements.
328
- Wgit::Document.define_extension(
329
- :tables, # Wgit::Document#tables will return the page's tables.
330
- '//table', # The xpath to extract the tables.
331
- singleton: false, # True returns the first table found, false returns all.
332
- text_content_only: false, # True returns the table text, false returns the Nokogiri object.
333
- ) do |tables|
334
- # Here we can inspect/manipulate the tables before they're set as Wgit::Document#tables.
335
- tables
336
- end
174
+ And then execute:
337
175
 
338
- # Our Document has a table which we're interested in. Note it doesn't matter how the Document
339
- # is initialised e.g. manually (as below) or via Wgit::Crawler methods etc.
340
- doc = Wgit::Document.new(
341
- 'http://some_url.com',
342
- <<~HTML
343
- <html>
344
- <p>Hello world! Welcome to my site.</p>
345
- <table>
346
- <tr><th>Name</th><th>Age</th></tr>
347
- <tr><td>Socrates</td><td>101</td></tr>
348
- <tr><td>Plato</td><td>106</td></tr>
349
- </table>
350
- <p>I hope you enjoyed your visit :-)</p>
351
- </html>
352
- HTML
353
- )
354
-
355
- # Call our newly defined method to obtain the table data we're interested in.
356
- tables = doc.tables
357
-
358
- # Both the collection and each table within the collection are plain Nokogiri objects.
359
- tables.class # => Nokogiri::XML::NodeSet
360
- tables.first.class # => Nokogiri::XML::Element
361
-
362
- # Note, the Document's stats now include our 'tables' extension.
363
- doc.stats # => {
364
- # :url=>19, :html=>242, :links=>0, :text=>8, :text_bytes=>91, :tables=>1
365
- # }
366
- ```
176
+ $ bundle
367
177
 
368
- See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
178
+ ### Using RubyGems
369
179
 
370
- **Extension Notes**:
180
+ $ gem install wgit
371
181
 
372
- - It's recommended that URL's be mapped into `Wgit::Url` objects. `Wgit::Url`'s are treated as Strings when being inserted into the database.
373
- - A `Wgit::Document` extension (once initialised) will become a Document instance variable, meaning that the value will be inserted into the Database if it's a primitive type e.g. `String`, `Array` etc. Complex types e.g. Ruby objects won't be inserted. It's up to you to ensure the data you want inserted, can be inserted.
374
- - Once inserted into the Database, you can search a `Wgit::Document`'s extension attributes by updating the Database's *text search index*. See the [Database Example](#Database-Example) for more information.
182
+ Verify the install by using the executable (to start an REPL session):
375
183
 
376
- ## Caveats
184
+ $ wgit
377
185
 
378
- Below are some points to keep in mind when using Wgit:
186
+ ## Documentation
379
187
 
380
- - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
381
- - By default, up to 5 URL redirects will be followed; this is [configurable](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Crawler#redirect_limit-instance_method) however.
382
- - IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
188
+ - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
189
+ - [Wiki](https://github.com/michaeltelford/wgit/wiki)
190
+ - [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
191
+ - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
383
192
 
384
193
  ## Executable
385
194
 
386
- Installing the Wgit gem also adds the `wgit` executable to your `$PATH`. The executable launches an interactive shell session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
195
+ Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
387
196
 
388
197
  The `wgit` executable does the following things (in order):
389
198
 
@@ -391,21 +200,7 @@ The `wgit` executable does the following things (in order):
391
200
  2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
392
201
  3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
393
202
 
394
- The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session. **Note** that variables should either be instance variables (e.g. `@url`) or be accessed via a getter method (e.g. `def url; ...; end`).
395
-
396
- ## Change Log
397
-
398
- See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
399
-
400
- ### Gem Versioning
401
-
402
- The `wgit` gem follows these versioning rules:
403
-
404
- - The version format is `MAJOR.MINOR.PATCH` e.g. `0.1.0`.
405
- - Since the gem hasn't reached `v1.0.0` yet, slightly different semantic versioning rules apply.
406
- - The `PATCH` represents *non breaking changes* while the `MINOR` represents *breaking changes* e.g. updating from version `0.1.0` to `0.2.0` will likely introduce breaking changes necessitating updates to your codebase.
407
- - To determine what changes are needed, consult the `CHANGELOG.md`. If you need help, raise an issue.
408
- - Once `wgit v1.0.0` is released, *normal* [semantic versioning](https://semver.org/) rules will apply e.g. only a `MAJOR` version change should introduce breaking changes.
203
+ The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
409
204
 
410
205
  ## License
411
206
 
@@ -431,14 +226,14 @@ And you're good to go!
431
226
 
432
227
  ### Tooling
433
228
 
434
- Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
229
+ Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
435
230
 
436
- Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset that doesn't require a database).
231
+ Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
437
232
 
438
- To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
233
+ To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
439
234
 
440
- To install this gem onto your local machine, run `toys install`.
235
+ To install this gem onto your local machine, run `toys install` and follow the prompt.
441
236
 
442
237
  ### Console
443
238
 
444
- You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created a `.env` and `.wgit.rb` file which gets loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
239
+ You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.