wgit 0.6.0 → 0.7.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0c7346b075dca86debdb6a55ed363d1b890088b17fc600c17a5be82f5878545c
4
- data.tar.gz: 75572c882e0711e1d49513db91d14fd1d530dbc6e22b7a0bbfec9ac1efd21e29
3
+ metadata.gz: 29d37a4a0f013fec64625d8fe5798ae2d062ae6f213811c51d223de311e16707
4
+ data.tar.gz: 213f6c43ccbb1fcc5c487a2bd5f31493506ab2320168562f8e1b6887cccc07b8
5
5
  SHA512:
6
- metadata.gz: e3b915f11c80999a659f9b7f6f6786b717393fe94e6e65029dd5d1b2c2d95f064512cfe96a96d06416ed5932aad0d2798039306f746835c23fb5223aa2d69f5b
7
- data.tar.gz: 3b1d55d35a30b19fe6c3193f9e2c4eb2884aaddcdf6c31a88465a0d9ffdaf01886b380ba68321bed2aad69d7b6fc26ad49b612aafa01c8034998bdd9697bebfd
6
+ metadata.gz: acd321f3ba039e6f54dd8a36a3e4ebec1fb40f1cda5ee1c982df4be22ee6d463f829c72f011ff959a4a4d2651676dc2d31866a273b60d3e5e630ccf77b3d7cbe
7
+ data.tar.gz: d0908d28e6fdaec440209479f75945672807cf3e9359fb8bd8f6cc9de45568a341ac0204ba5f50a2f6569b4a29f4f7ac3088353a35f2c5091a567af469027aab
@@ -9,6 +9,18 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.7.0
13
+ ### Added
14
+ - `Wgit::Indexer.new` optional `crawler:` named param.
15
+ - `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
16
+ - `Document.extensions` returning a Set of all defined extensions.
17
+ ### Changed/Removed
18
+ - Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
19
+ - Updated the Docker image to now include index names; making it easier to identify them.
20
+ ### Fixed
21
+ - ...
22
+ ---
23
+
12
24
  ## v0.6.0
13
25
  ### Added
14
26
  - Added `Wgit::Utils.proces_arr encode:` param.
data/README.md CHANGED
@@ -8,11 +8,11 @@
8
8
 
9
9
  ---
10
10
 
11
- Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an easy to use API for programmatic URL parsing, HTML indexing and searching.
11
+ Wgit is a Ruby library primarily used for crawling, indexing and searching HTML webpages.
12
12
 
13
- Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to copy entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
13
+ Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL's to retrieve and serialise their page contents for later use. You can use Wgit to scrape entire websites if required. Wgit also provides a means to search indexed documents stored in a database. Therefore, this library provides the main components of a WWW search engine. The Wgit API is easily extended allowing you to pull out the parts of a webpage that are important to you, the code snippets or tables for example. As Wgit is a library, it supports many different use cases including data mining, analytics, web indexing and URL parsing to name a few.
14
14
 
15
- Check out this [example application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
15
+ Check out this [demo application](https://search-engine-rb.herokuapp.com) - a search engine (see its [repository](https://github.com/michaeltelford/search_engine)) built using Wgit and Sinatra, deployed to Heroku. Heroku's free tier is used so the initial page load may be slow. Try searching for "Ruby" or something else that's Ruby related.
16
16
 
17
17
  Continue reading the rest of this `README` for more information on Wgit. When you've finished, check out the [wiki](https://github.com/michaeltelford/wgit/wiki).
18
18
 
@@ -51,6 +51,10 @@ Or install it yourself as:
51
51
 
52
52
  $ gem install wgit
53
53
 
54
+ Verify the install by using the executable (to start a shell session):
55
+
56
+ $ wgit
57
+
54
58
  ## Basic Usage
55
59
 
56
60
  ```ruby
@@ -271,11 +275,11 @@ urls_to_crawl = db.uncrawled_urls # => Results will include top_result.external_
271
275
 
272
276
  Document serialising in Wgit is the means of downloading a web page and extracting parts of its content into accessible document attributes/methods. For example, `Wgit::Document#author` will return you the webpage's HTML element value of `meta[@name='author']`.
273
277
 
274
- By default, Wgit serialises what it thinks are the most important pieces of information from each webpage. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
278
+ Wgit provides some [default extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) to extract a page's text, links etc. This of course is often not enough given the nature of the WWW and the differences from one webpage to the next. Therefore, there exists a way to extend the default serialising logic.
275
279
 
276
- ### Defining Custom Serialisers Via Document Extensions
280
+ ### Serialising Additional Page Elements via Document Extensions
277
281
 
278
- You can define a Document extension for each HTML element(s) that you want to extract into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, any crawled Documents will contain your extracted content.
282
+ You can define a Document extension for each HTML element(s) that you want to extract and serialise into a `Wgit::Document` instance variable, equipped with a getter method. Once an extension is defined, all crawled Documents will contain your extracted content.
279
283
 
280
284
  Once the page element has been serialised, you can do with it as you wish e.g. obtain it's text value or manipulate the element etc. Since you can choose to return the element's text or the [Nokogiri](https://www.rubydoc.info/github/sparklemotion/nokogiri) object, you have the full power that the Nokogiri gem gives you.
281
285
 
@@ -296,6 +300,7 @@ Wgit::Document.define_extension(
296
300
  end
297
301
 
298
302
  # Our Document has a table which we're interested in.
303
+ # Note, it doesn't matter how the Document is initialised e.g. manually or crawled.
299
304
  doc = Wgit::Document.new(
300
305
  'http://some_url.com',
301
306
  <<~HTML
@@ -324,8 +329,6 @@ doc.stats # => {
324
329
  # }
325
330
  ```
326
331
 
327
- Wgit uses Document extensions to provide much of it's core serialising functionality, providing access to a webpage's text or links for example. These [default Document extensions](https://github.com/michaeltelford/wgit/blob/master/lib/wgit/document_extensions.rb) provide examples for your own.
328
-
329
332
  See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit%2FDocument.define_extension) docs for more information.
330
333
 
331
334
  **Extension Notes**:
@@ -339,16 +342,20 @@ See the [Wgit::Document.define_extension](https://www.rubydoc.info/github/michae
339
342
  Below are some points to keep in mind when using Wgit:
340
343
 
341
344
  - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
342
- - By default, up to 5 URL redirects will be followed; this is configurable however.
345
+ - By default, up to 5 URL redirects will be followed; this is [configurable](https://www.rubydoc.info/github/michaeltelford/wgit/master/Wgit/Crawler#redirect_limit-instance_method) however.
343
346
  - IRI's (URL's containing non ASCII characters) **are** supported and will be normalised/escaped prior to being crawled.
344
347
 
345
348
  ## Executable
346
349
 
347
- Currently there is no executable provided with Wgit, however...
350
+ Installing the Wgit gem also adds the `wgit` executable to your `$PATH`. The executable launches an interactive shell session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
351
+
352
+ The `wgit` executable does the following things (in order):
348
353
 
349
- In future versions of Wgit, an executable will be packaged with the gem. The executable will provide a `pry` console with the `wgit` gem already loaded. Using the console, you'll easily be able to index and search the web without having to write your own scripts.
354
+ 1. `require wgit`
355
+ 2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
356
+ 3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
350
357
 
351
- This executable will be similar in nature to `./bin/console` which is currently used for development and isn't packaged as part of the `wgit` gem.
358
+ The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session. **Note** that variables should either be instance variables (e.g. `@url`) or be accessed via a getter method (e.g. `def url; ...; end`).
352
359
 
353
360
  ## Change Log
354
361
 
@@ -390,10 +397,12 @@ And you're good to go!
390
397
 
391
398
  Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation e.g. running the tests etc. For a full list of available tasks AKA tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
392
399
 
393
- Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker.
394
-
395
- Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset). You can also run `toys console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
400
+ Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests (or `toys test smoke` for a faster running subset that doesn't require a database).
396
401
 
397
402
  To generate code documentation run `toys yardoc`. To browse the generated documentation in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
398
403
 
399
404
  To install this gem onto your local machine, run `toys install`.
405
+
406
+ ### Console
407
+
408
+ You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created a `.env` and `.wgit.rb` file which gets loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
@@ -0,0 +1,35 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'wgit'
4
+
5
+ # Eval .wgit.rb file (if it exists).
6
+ def eval_wgit
7
+ puts 'Searching for .wgit.rb in local and home directories...'
8
+
9
+ ['.', Dir.home].each do |dir|
10
+ path = "#{dir}/.wgit.rb"
11
+ next unless File.exist?(path)
12
+
13
+ puts "Eval'ing #{path} (call `eval_wgit` after changes)"
14
+ eval(File.read(path))
15
+ break
16
+ end
17
+ end
18
+
19
+ eval_wgit
20
+ puts "\n#{Wgit.version_str}\n\n"
21
+
22
+ # Use Pry if installed or fall back to IRB.
23
+ begin
24
+ require 'pry'
25
+ klass = Pry
26
+ rescue LoadError
27
+ require 'irb'
28
+ klass = IRB
29
+
30
+ puts "Starting IRB because Pry isn't installed."
31
+ end
32
+
33
+ klass.start
34
+
35
+ puts 'Interactive session complete.'
@@ -19,9 +19,9 @@ module Wgit
19
19
  # `#crawl_site`. The idea is to omit anything that isn't HTML and therefore
20
20
  # doesn't keep the crawl of the site going. All URL's without a file
21
21
  # extension will be crawled, because they're assumed to be HTML.
22
- SUPPORTED_FILE_EXTENSIONS = Set.new(%w[
23
- asp aspx cfm cgi htm html htmlx jsp php
24
- ])
22
+ SUPPORTED_FILE_EXTENSIONS = Set.new(
23
+ %w[asp aspx cfm cgi htm html htmlx jsp php]
24
+ )
25
25
 
26
26
  # The amount of allowed redirects before raising an error. Set to 0 to
27
27
  # disable redirects completely; or you can pass `follow_redirects: false`
@@ -154,7 +154,7 @@ module Wgit
154
154
  # DB.
155
155
  # @return [Array<Wgit::Document>] The search results obtained from the DB.
156
156
  def search(
157
- query, case_sensitive: false, whole_sentence: false, limit: 10, skip: 0
157
+ query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
158
158
  )
159
159
  query = query.to_s.strip
160
160
  query.replace('"' + query + '"') if whole_sentence
@@ -3,6 +3,7 @@ require_relative 'utils'
3
3
  require_relative 'assertable'
4
4
  require 'nokogiri'
5
5
  require 'json'
6
+ require 'set'
6
7
 
7
8
  module Wgit
8
9
  # Class primarily modeling a HTML web document, although other MIME types
@@ -22,6 +23,14 @@ module Wgit
22
23
  # The xpath used to extract the visible text on a page.
23
24
  TEXT_ELEMENTS_XPATH = '//*/text()'.freeze
24
25
 
26
+ # Set of Symbols representing the defined Document extensions.
27
+ @extensions = Set.new
28
+
29
+ class << self
30
+ # Class level attr_reader for the Document defined extensions.
31
+ attr_reader :extensions
32
+ end
33
+
25
34
  # The URL of the webpage, an instance of Wgit::Url.
26
35
  attr_reader :url
27
36
 
@@ -120,7 +129,7 @@ module Wgit
120
129
  result = find_in_html(xpath, opts, &block)
121
130
  init_var(var, result)
122
131
  end
123
- Document.send :private, func_name
132
+ Document.send(:private, func_name)
124
133
 
125
134
  # Define the private init_*_from_object method for a Database object.
126
135
  # Gets the Object's 'key' value and creates a var for it.
@@ -128,8 +137,9 @@ module Wgit
128
137
  result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block)
129
138
  init_var(var, result)
130
139
  end
131
- Document.send :private, func_name
140
+ Document.send(:private, func_name)
132
141
 
142
+ @extensions << var
133
143
  var
134
144
  end
135
145
 
@@ -144,6 +154,7 @@ module Wgit
144
154
  Document.send(:remove_method, "init_#{var}_from_html")
145
155
  Document.send(:remove_method, "init_#{var}_from_object")
146
156
 
157
+ @extensions.delete(var.to_sym)
147
158
  true
148
159
  rescue NameError
149
160
  false
@@ -366,7 +377,7 @@ module Wgit
366
377
  # sentence.
367
378
  # @return [Array<String>] A subset of @text, matching the query.
368
379
  def search(
369
- query, case_sensitive: false, whole_sentence: false, sentence_limit: 80
380
+ query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
370
381
  )
371
382
  query = query.to_s
372
383
  raise 'A search query must be provided' if query.empty?
@@ -409,7 +420,7 @@ module Wgit
409
420
  # sentence.
410
421
  # @return [String] This Document's original @text value.
411
422
  def search!(
412
- query, case_sensitive: false, whole_sentence: false, sentence_limit: 80
423
+ query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
413
424
  )
414
425
  orig_text = @text
415
426
  @text = search(
@@ -40,6 +40,10 @@ module Wgit
40
40
  # nil to use ENV['WGIT_CONNECTION_STRING'].
41
41
  # @param insert_externals [Boolean] Whether or not to insert the website's
42
42
  # external Url's into the database.
43
+ # @param allow_paths [String, Array<String>] Filters links by selecting
44
+ # them if their path `File.fnmatch?` one of allow_paths.
45
+ # @param disallow_paths [String, Array<String>] Filters links by rejecting
46
+ # them if their path `File.fnmatch?` one of disallow_paths.
43
47
  # @yield [doc] Given the Wgit::Document of each crawled webpage, before it's
44
48
  # inserted into the database allowing for prior manipulation.
45
49
  # @return [Integer] The total number of pages crawled within the website.
@@ -96,7 +100,7 @@ module Wgit
96
100
  # database.
97
101
  def self.indexed_search(
98
102
  query, connection_string: nil,
99
- case_sensitive: false, whole_sentence: false,
103
+ case_sensitive: false, whole_sentence: true,
100
104
  limit: 10, skip: 0, sentence_limit: 80, &block
101
105
  )
102
106
  db = Wgit::Database.new(connection_string)
@@ -122,7 +126,7 @@ module Wgit
122
126
  Wgit::Utils.printf_search_results(results)
123
127
  end
124
128
 
125
- # Class which sets up a crawler and saves the indexed docs to a database.
129
+ # Class which crawls and saves the indexed Documents to a database.
126
130
  class Indexer
127
131
  # The crawler used to index the WWW.
128
132
  attr_reader :crawler
@@ -133,10 +137,11 @@ module Wgit
133
137
  # Initialize the Indexer.
134
138
  #
135
139
  # @param database [Wgit::Database] The database instance (already
136
- # initialized with the correct connection string etc).
137
- def initialize(database)
138
- @crawler = Wgit::Crawler.new
140
+ # initialized and connected) used to index.
141
+ # @param crawler [Wgit::Crawler] The crawler instance used to index.
142
+ def initialize(database, crawler = Wgit::Crawler.new)
139
143
  @db = database
144
+ @crawler = crawler
140
145
  end
141
146
 
142
147
  # Retrieves uncrawled url's from the database and recursively crawls each
@@ -214,6 +219,10 @@ the next iteration.")
214
219
  # @param url [Wgit::Url] The base Url of the website to crawl.
215
220
  # @param insert_externals [Boolean] Whether or not to insert the website's
216
221
  # external Url's into the database.
222
+ # @param allow_paths [String, Array<String>] Filters links by selecting
223
+ # them if their path `File.fnmatch?` one of allow_paths.
224
+ # @param disallow_paths [String, Array<String>] Filters links by rejecting
225
+ # them if their path `File.fnmatch?` one of disallow_paths.
217
226
  # @yield [doc] Given the Wgit::Document of each crawled web page before
218
227
  # it's inserted into the database allowing for prior manipulation. Return
219
228
  # nil or false from the block to prevent the document from being saved
@@ -1,11 +1,11 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  # Wgit is a WWW indexer/scraper which crawls URL's and retrieves their page
4
- # contents for later use by serialisation.
4
+ # contents for later use.
5
5
  # @author Michael Telford
6
6
  module Wgit
7
7
  # The current gem version of Wgit.
8
- VERSION = '0.6.0'
8
+ VERSION = '0.7.0'
9
9
 
10
10
  # Returns the current gem version of Wgit as a String.
11
11
  def self.version
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-12-24 00:00:00.000000000 Z
11
+ date: 2020-01-04 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable
@@ -185,7 +185,7 @@ dependencies:
185
185
  - !ruby/object:Gem::Version
186
186
  version: '1.0'
187
187
  description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s to
188
- retrieve and serialise their page contents for later use. You can use Wgit to copy
188
+ retrieve and serialise their page contents for later use. You can use Wgit to scrape
189
189
  entire websites if required. Wgit also provides a means to search indexed documents
190
190
  stored in a database. Therefore, this library provides the main components of a
191
191
  WWW search engine. The Wgit API is easily extended allowing you to pull out the
@@ -195,7 +195,8 @@ description: 'Fundamentally, Wgit is a HTTP indexer/scraper which crawls URL''s
195
195
 
196
196
  '
197
197
  email: michael.telford@live.com
198
- executables: []
198
+ executables:
199
+ - wgit
199
200
  extensions: []
200
201
  extra_rdoc_files: []
201
202
  files:
@@ -219,6 +220,7 @@ files:
219
220
  - CONTRIBUTING.md
220
221
  - LICENSE.txt
221
222
  - README.md
223
+ - bin/wgit
222
224
  homepage: https://github.com/michaeltelford/wgit
223
225
  licenses:
224
226
  - MIT
@@ -229,7 +231,7 @@ metadata:
229
231
  bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
230
232
  documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
231
233
  allowed_push_host: https://rubygems.org
232
- post_install_message:
234
+ post_install_message: Added the 'wgit' executable to $PATH
233
235
  rdoc_options: []
234
236
  require_paths:
235
237
  - lib
@@ -247,6 +249,6 @@ requirements: []
247
249
  rubygems_version: 3.0.6
248
250
  signing_key:
249
251
  specification_version: 4
250
- summary: Wgit is a Ruby gem similar in nature to GNU's `wget` tool. It provides an
251
- easy to use API for programmatic URL parsing, HTML indexing and searching.
252
+ summary: Wgit is a Ruby library primarily used for crawling, indexing and searching
253
+ HTML webpages.
252
254
  test_files: []