wgit 0.0.13 → 0.0.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b8f6c1946b739327ba5b52a5541aa1496e2aea53c23e925bbb5e5fe7c063ddcc
4
- data.tar.gz: 1bcf5dd5e41711758fdc737afba0707319de25cb524cfe08c14a50efb9a5f3e0
3
+ metadata.gz: b2f1d98c88dcd1bd2a12b463732b08373f6b5db4d1c75661530293b0ed129c47
4
+ data.tar.gz: ed5ff0f5aa5e2427909ced8f67c8189dd1f7aec259b76947cd4cd73a1f8783d3
5
5
  SHA512:
6
- metadata.gz: ad42cd8e392894a21a1c4497bed9d1efabbc8b2320825ab7a3fc7e0615157f28f262bcaa4a4910411ac7662b4b416848b97b68c264f2ce937be34cfa6a56a34c
7
- data.tar.gz: fbaf8d6a48f996ce2c0b3cb4ee72a26cfede2368f3992e110c6f049e84b9ba6a2568aac0f192559ae10127db5d3cb0bda5dbbe508247ae5f918f3fc400528e75
6
+ metadata.gz: 6270ceff57af1936e7b0ae1fb3748def307981b26dc604d2f190a9e751e0952f103b4a7c227e2cbbc2d00f26bc7c4b1f46f44fa423e13add7e8bd4eb9fe046f4
7
+ data.tar.gz: 96b0eecd2713cbf55f7433c145a541fdae34d56b9e3bedd1b91b6bb30b463d06b7f07de59457ad899fd32bd095bbe92ee01d94c04ef923924fbc4a110bbe98f2
data/README.md CHANGED
@@ -16,9 +16,10 @@ Check out this [example application](https://search-engine-rb.herokuapp.com) - a
16
16
  6. [Extending The API](#Extending-The-API)
17
17
  7. [Caveats](#Caveats)
18
18
  8. [Executable](#Executable)
19
- 9. [Development](#Development)
20
- 10. [Contributing](#Contributing)
21
- 11. [License](#License)
19
+ 9. [Change Log](#Change-Log)
20
+ 10. [Development](#Development)
21
+ 11. [Contributing](#Contributing)
22
+ 12. [License](#License)
22
23
 
23
24
  ## Installation
24
25
 
@@ -87,6 +88,10 @@ See the [Practical Database Example](#Practical-Database-Example) for informatio
87
88
 
88
89
  Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
89
90
 
91
+ ### Broken Link Finder
92
+
93
+ The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
94
+
90
95
  ### CSS Indexer
91
96
 
92
97
  The below script downloads the contents of the first css link found on Facebook's index page.
@@ -146,78 +151,80 @@ end
146
151
 
147
152
  ## Practical Database Example
148
153
 
149
- This next example requires a configured database instance.
154
+ This next example requires a configured database instance. Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a free (small) account or provide your own MongoDB instance.
150
155
 
151
- Currently the only supported DBMS is MongoDB. See [mLab](https://mlab.com) for a free (small) account or provide your own MongoDB instance.
156
+ `Wgit::Database` provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and serialisation. With Wgit you can index webpages, store them in a database and then search through all that's been indexed. The use of a database is entirely optional however and isn't required for crawling/indexing.
152
157
 
153
- The currently supported versions of `mongo` are:
158
+ The following versions of MongoDB are supported:
154
159
 
155
- | Gem | Database Engine |
156
- | -------- | --------------- |
157
- | ~> 2.8.0 | 3.6.12 (MMAPv1) |
160
+ | Gem | Database |
161
+ | ------ | -------- |
162
+ | ~> 2.9 | ~> 4.0 |
158
163
 
159
164
  ### Setting Up MongoDB
160
165
 
161
- Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records. The use of a database is entirely optional when using Wgit.
166
+ Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records.
162
167
 
163
168
  1) Create collections for: `documents` and `urls`.
164
- 2) Add a unique index for the `url` field in **both** collections.
165
- 3) Enable `textSearchEnabled` in MongoDB's configuration.
166
- 4) Create a *text search index* for the `documents` collection using:
169
+ 2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections.
170
+ 3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so).
171
+ 4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
167
172
  ```json
168
173
  {
169
- "text": "text",
170
- "author": "text",
171
- "keywords": "text",
172
- "title": "text"
174
+ "text": "text",
175
+ "author": "text",
176
+ "keywords": "text",
177
+ "title": "text"
173
178
  }
174
179
  ```
175
- 5) Set the connection details for your MongoDB instance using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
180
+ 5) Set the connection details for your MongoDB instance (see below) using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
176
181
 
177
182
  **Note**: The *text search index* (in step 4) lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
178
183
 
179
184
  ### Database Example
180
185
 
181
- The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database.
182
-
183
- If you're running the code below for yourself, remember to replace the Hash containing the connection details with your own.
186
+ The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
184
187
 
185
188
  ```ruby
186
189
  require 'wgit'
187
190
  require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
188
191
 
189
- # Here we create our own document rather than crawling the web.
192
+ ### CONNECT TO THE DATABASE ###
193
+
194
+ # Set your connection details manually (as below) or from the environment using
195
+ # Wgit.set_connection_details_from_env
196
+ Wgit.set_connection_details('DB_CONNECTION_STRING' => '<your_connection_string>')
197
+ db = Wgit::Database.new # Connects to the database...
198
+
199
+ ### SEED SOME DATA ###
200
+
201
+ # Here we create our own document rather than crawling the web (which works in the same way).
190
202
  # We pass the web page's URL and HTML Strings.
191
203
  doc = Wgit::Document.new(
192
204
  "http://test-url.com".to_url,
193
205
  "<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
194
206
  )
195
-
196
- # Set your connection details manually (as below) or from the environment using
197
- # Wgit.set_connection_details_from_env
198
- Wgit.set_connection_details(
199
- 'DB_HOST' => '<host_machine>',
200
- 'DB_PORT' => '27017',
201
- 'DB_USERNAME' => '<username>',
202
- 'DB_PASSWORD' => '<password>',
203
- 'DB_DATABASE' => '<database_name>',
204
- )
205
-
206
- db = Wgit::Database.new # Connects to the database...
207
207
  db.insert doc
208
208
 
209
- # Searching the database returns documents with matching text 'hits'.
209
+ ### SEARCH THE DATABASE ###
210
+
211
+ # Searching the database returns Wgit::Document's which have fields containing the query.
210
212
  query = "cow"
211
213
  results = db.search query
212
214
 
213
- doc.url == results.first.url # => true
215
+ search_result = results.first
216
+ search_result.class # => Wgit::Document
217
+ doc.url == search_result.url # => true
218
+
219
+ ### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
214
220
 
215
- # Searching the returned documents gives the matching lines of text from that document.
216
- doc.search(query).first # => "How now brown cow."
221
+ # Searching the returned documents gives the matching text from that document.
222
+ search_result.search(query).first # => "How now brown cow."
217
223
 
218
- db.insert doc.external_links
224
+ ### SEED URLS TO BE CRAWLED LATER ###
219
225
 
220
- urls_to_crawl = db.uncrawled_urls # => Results will include doc.external_links.
226
+ db.insert search_result.external_links
227
+ urls_to_crawl = db.uncrawled_urls # => Results will include search_result.external_links.
221
228
  ```
222
229
 
223
230
  ## Extending The API
@@ -319,6 +326,10 @@ In future versions of Wgit, an executable will be packaged with the gem. The exe
319
326
 
320
327
  This executable will be very similar in nature to `./bin/console` which is currently used only for development and isn't packaged as part of the `wgit` gem.
321
328
 
329
+ ## Change Log
330
+
331
+ See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences between versions of Wgit.
332
+
322
333
  ## Development
323
334
 
324
335
  The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file.
data/TODO.txt CHANGED
@@ -8,7 +8,6 @@ Primary
8
8
 
9
9
  Secondary
10
10
  ---------
11
- - Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org etc.
12
11
  - Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
13
12
  - Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
14
13
  - Check if Document::TEXT_ELEMENTS is expansive enough.
@@ -17,16 +16,15 @@ Secondary
17
16
 
18
17
  Refactoring
19
18
  -----------
20
- - Replace method params with named parameters where applicable.
21
- - Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
22
- - Mock (monkey patch) the 'mongo' gem's func. for speed in the tests.
23
19
  - Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
20
+ - Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
21
+ - Replace method params with named parameters where applicable.
24
22
 
25
23
  Gem Publishing Checklist
26
24
  ------------------------
27
25
  - Ensure a clean branch of master and create a 'release' branch.
28
26
  - Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
29
- - Increment the version number (in version.rb).
27
+ - Increment the version number (in version.rb) and update the CHANGELOG.md.
30
28
  - Run 'bundle install' to update deps.
31
29
  - Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
32
30
  - Run 'bundle exec rake test' and ensure all tests are passing.
@@ -208,6 +208,8 @@ module Wgit
208
208
  end
209
209
 
210
210
  alias :crawl :crawl_urls
211
+ alias :crawl_pages :crawl_urls
212
+ alias :crawl_page :crawl_url
211
213
  alias :crawl_r :crawl_site
212
214
  end
213
215
  end
@@ -9,27 +9,19 @@ module Wgit
9
9
  CONNECTION_DETAILS = {}
10
10
 
11
11
  # The keys required for a successful database connection.
12
- CONNECTION_KEYS_REQUIRED = [
13
- 'DB_HOST', 'DB_PORT', 'DB_USERNAME', 'DB_PASSWORD', 'DB_DATABASE'
14
- ]
12
+ CONNECTION_KEYS_REQUIRED = ['DB_CONNECTION_STRING']
15
13
 
16
14
  # Set the database's connection details from the given hash. It is your
17
15
  # responsibility to ensure the correct hash vars are present and set.
18
16
  #
19
17
  # @param hash [Hash] Containing the database connection details to use.
20
18
  # The hash should contain the following keys (of type String):
21
- # DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, DB_DATABASE
19
+ # DB_CONNECTION_STRING
22
20
  # @raise [KeyError] If any of the required connection details are missing.
23
21
  # @return [Hash] Containing the database connection details from hash.
24
22
  def self.set_connection_details(hash)
25
23
  assert_required_keys(hash, CONNECTION_KEYS_REQUIRED)
26
-
27
- CONNECTION_DETAILS[:host] = hash.fetch('DB_HOST')
28
- CONNECTION_DETAILS[:port] = hash.fetch('DB_PORT')
29
- CONNECTION_DETAILS[:uname] = hash.fetch('DB_USERNAME')
30
- CONNECTION_DETAILS[:pword] = hash.fetch('DB_PASSWORD')
31
- CONNECTION_DETAILS[:db] = hash.fetch('DB_DATABASE')
32
-
24
+ CONNECTION_DETAILS[:connection_string] = hash.fetch('DB_CONNECTION_STRING')
33
25
  CONNECTION_DETAILS
34
26
  end
35
27
 
@@ -37,7 +29,7 @@ module Wgit
37
29
  # responsibility to ensure the correct ENV vars are present and set.
38
30
  #
39
31
  # The ENV should contain the following keys (of type String):
40
- # DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, DB_DATABASE
32
+ # DB_CONNECTION_STRING
41
33
  #
42
34
  # @raise [KeyError] If any of the required connection details are missing.
43
35
  # @return [Hash] Containing the database connection details from the ENV.
@@ -17,10 +17,16 @@ module Wgit
17
17
  #
18
18
  # @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
19
19
  def initialize
20
- conn_details = Wgit::CONNECTION_DETAILS
21
- if conn_details.empty?
22
- raise "Wgit::CONNECTION_DETAILS must be defined and include :host,
23
- :port, :db, :uname, :pword for a database connection to be established."
20
+ @@client = Database.connect
21
+ end
22
+
23
+ # Initializes a database connection client.
24
+ #
25
+ # @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
26
+ def self.connect
27
+ unless Wgit::CONNECTION_DETAILS[:connection_string]
28
+ raise "Wgit::CONNECTION_DETAILS must be defined and include \
29
+ :connection_string"
24
30
  end
25
31
 
26
32
  # Only log for error (or more severe) scenarios.
@@ -28,11 +34,8 @@ module Wgit
28
34
  Mongo::Logger.logger.progname = 'mongo'
29
35
  Mongo::Logger.logger.level = Logger::ERROR
30
36
 
31
- address = "#{conn_details[:host]}:#{conn_details[:port]}"
32
- @@client = Mongo::Client.new([address],
33
- database: conn_details[:db],
34
- user: conn_details[:uname],
35
- password: conn_details[:pword])
37
+ # Connects to the database here.
38
+ Mongo::Client.new(Wgit::CONNECTION_DETAILS[:connection_string])
36
39
  end
37
40
 
38
41
  ### Create Data ###
@@ -84,7 +84,7 @@ module Wgit
84
84
  obj = url_or_obj
85
85
  assert_respond_to(obj, :fetch)
86
86
 
87
- @url = obj.fetch("url") # Should always be present.
87
+ @url = Wgit::Url.new(obj.fetch("url")) # Should always be present.
88
88
  @html = obj.fetch("html", "")
89
89
  @doc = init_nokogiri
90
90
  @score = obj.fetch("score", 0.0)
@@ -10,7 +10,7 @@ module Wgit
10
10
  # site storing their internal pages into the database and adding their
11
11
  # external url's to be crawled later on. Logs info on the crawl
12
12
  # using Wgit.logger as it goes along.
13
- #
13
+ #
14
14
  # @param max_sites_to_crawl [Integer] The number of separate and whole
15
15
  # websites to be crawled before the method exits. Defaults to -1 which
16
16
  # means the crawl will occur until manually stopped (Ctrl+C etc).
@@ -33,8 +33,8 @@ module Wgit
33
33
  # @param url [Wgit::Url, String] The base Url of the website to crawl.
34
34
  # @param insert_externals [Boolean] Whether or not to insert the website's
35
35
  # external Url's into the database.
36
- # @yield [doc] Given the Wgit::Document of each crawled web page, before it
37
- # is inserted into the database allowing for prior manipulation.
36
+ # @yield [Wgit::Document] Given the Wgit::Document of each crawled webpage,
37
+ # before it is inserted into the database allowing for prior manipulation.
38
38
  # @return [Integer] The total number of pages crawled within the website.
39
39
  def self.index_this_site(url, insert_externals = true, &block)
40
40
  url = Wgit::Url.new url
@@ -43,6 +43,24 @@ module Wgit
43
43
  indexer.index_this_site(url, insert_externals, &block)
44
44
  end
45
45
 
46
+ # Convience method to index a single webpage using
47
+ # Wgit::Indexer#index_this_page.
48
+ #
49
+ # Crawls a single webpage and stores it into the database.
50
+ # There is no max download limit so be careful of large pages.
51
+ #
52
+ # @param url [Wgit::Url, String] The Url of the webpage to crawl.
53
+ # @param insert_externals [Boolean] Whether or not to insert the website's
54
+ # external Url's into the database.
55
+ # @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
56
+ # before it is inserted into the database allowing for prior manipulation.
57
+ def self.index_this_page(url, insert_externals = true, &block)
58
+ url = Wgit::Url.new url
59
+ db = Wgit::Database.new
60
+ indexer = Wgit::Indexer.new(db)
61
+ indexer.index_this_page(url, insert_externals, &block)
62
+ end
63
+
46
64
  # Performs a search of the database's indexed documents and pretty prints
47
65
  # the results. See Wgit::Database#search for details of the search.
48
66
  #
@@ -53,8 +71,8 @@ module Wgit
53
71
  # @param skip [Integer] The number of DB records to skip.
54
72
  # @param sentence_length [Integer] The max length of each result's text
55
73
  # snippet.
56
- # @yield [doc] Given each search result (Wgit::Document).
57
- def self.indexed_search(query, whole_sentence = false, limit = 10,
74
+ # @yield [Wgit::Document] Given each search result (Wgit::Document).
75
+ def self.indexed_search(query, whole_sentence = false, limit = 10,
58
76
  skip = 0, sentence_length = 80, &block)
59
77
  db = Wgit::Database.new
60
78
  results = db.search(query, whole_sentence, limit, skip, &block)
@@ -63,13 +81,13 @@ module Wgit
63
81
 
64
82
  # Class which sets up a crawler and saves the indexed docs to a database.
65
83
  class Indexer
66
-
84
+
67
85
  # The crawler used to scrape the WWW.
68
86
  attr_reader :crawler
69
-
87
+
70
88
  # The database instance used to store Urls and Documents in.
71
89
  attr_reader :db
72
-
90
+
73
91
  # Initialize the Indexer.
74
92
  #
75
93
  # @param database [Wgit::Database] The database instance (already
@@ -97,7 +115,7 @@ module Wgit
97
115
  urls to crawl (which might be never).")
98
116
  end
99
117
  site_count = 0
100
-
118
+
101
119
  while keep_crawling?(site_count, max_sites_to_crawl, max_data_size) do
102
120
  Wgit.logger.info("Current database size: #{@db.size}")
103
121
  @crawler.urls = @db.uncrawled_urls
@@ -107,10 +125,10 @@ urls to crawl (which might be never).")
107
125
  return
108
126
  end
109
127
  Wgit.logger.info("Starting crawl loop for: #{@crawler.urls}")
110
-
128
+
111
129
  docs_count = 0
112
130
  urls_count = 0
113
-
131
+
114
132
  @crawler.urls.each do |url|
115
133
  unless keep_crawling?(site_count, max_sites_to_crawl, max_data_size)
116
134
  Wgit.logger.info("Reached max number of sites to crawl or database \
@@ -121,7 +139,7 @@ capacity, exiting.")
121
139
 
122
140
  url.crawled = true
123
141
  raise unless @db.update(url) == 1
124
-
142
+
125
143
  site_docs_count = 0
126
144
  ext_links = @crawler.crawl_site(url) do |doc|
127
145
  unless doc.empty?
@@ -131,7 +149,7 @@ capacity, exiting.")
131
149
  end
132
150
  end
133
151
  end
134
-
152
+
135
153
  urls_count += write_urls_to_db(ext_links)
136
154
  Wgit.logger.info("Crawled and saved #{site_docs_count} docs for the \
137
155
  site: #{url}")
@@ -141,6 +159,8 @@ site: #{url}")
141
159
  this iteration.")
142
160
  Wgit.logger.info("Found and saved #{urls_count} external url(s) for the next \
143
161
  iteration.")
162
+
163
+ nil
144
164
  end
145
165
  end
146
166
 
@@ -151,14 +171,14 @@ iteration.")
151
171
  # @param url [Wgit::Url] The base Url of the website to crawl.
152
172
  # @param insert_externals [Boolean] Whether or not to insert the website's
153
173
  # external Url's into the database.
154
- # @yield [doc] Given the Wgit::Document of each crawled web page, before it
155
- # is inserted into the database allowing for prior manipulation. Return
156
- # nil or false from the block to prevent the document from being saved
157
- # into the database.
174
+ # @yield [Wgit::Document] Given the Wgit::Document of each crawled web
175
+ # page, before it is inserted into the database allowing for prior
176
+ # manipulation. Return nil or false from the block to prevent the
177
+ # document from being saved into the database.
158
178
  # @return [Integer] The total number of webpages/documents indexed.
159
179
  def index_this_site(url, insert_externals = true)
160
180
  total_pages_indexed = 0
161
-
181
+
162
182
  ext_urls = @crawler.crawl_site(url) do |doc|
163
183
  result = true
164
184
  if block_given?
@@ -174,23 +194,56 @@ iteration.")
174
194
  end
175
195
 
176
196
  url.crawled = true
177
- if !@db.url?(url)
178
- @db.insert(url)
179
- else
180
- @db.update(url)
181
- end
182
-
197
+ @db.url?(url) ? @db.update(url) : @db.insert(url)
198
+
183
199
  if insert_externals
184
200
  write_urls_to_db(ext_urls)
185
201
  Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
186
202
  end
187
-
203
+
188
204
  Wgit.logger.info("Crawled and saved #{total_pages_indexed} docs for the \
189
205
  site: #{url}")
190
206
 
191
207
  total_pages_indexed
192
208
  end
193
209
 
210
+ # Crawls a single webpage and stores it into the database.
211
+ # There is no max download limit so be careful of large pages.
212
+ # Logs info on the crawl using Wgit.logger as it goes along.
213
+ #
214
+ # @param url [Wgit::Url] The webpage Url to crawl.
215
+ # @param insert_externals [Boolean] Whether or not to insert the webpage's
216
+ # external Url's into the database.
217
+ # @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
218
+ # before it is inserted into the database allowing for prior
219
+ # manipulation. Return nil or false from the block to prevent the
220
+ # document from being saved into the database.
221
+ def index_this_page(url, insert_externals = true)
222
+ doc = @crawler.crawl_page(url) do |doc|
223
+ result = true
224
+ if block_given?
225
+ result = yield(doc)
226
+ end
227
+
228
+ if result
229
+ if write_doc_to_db(doc)
230
+ Wgit.logger.info("Crawled and saved internal page: #{doc.url}")
231
+ end
232
+ end
233
+ end
234
+
235
+ url.crawled = true
236
+ @db.url?(url) ? @db.update(url) : @db.insert(url)
237
+
238
+ if insert_externals
239
+ ext_urls = doc.external_links
240
+ write_urls_to_db(ext_urls)
241
+ Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
242
+ end
243
+
244
+ nil
245
+ end
246
+
194
247
  private
195
248
 
196
249
  # Keep crawling or not based on DB size and current loop iteration.
@@ -204,7 +257,7 @@ site: #{url}")
204
257
  end
205
258
  end
206
259
 
207
- # The unique url index on the documents collection prevents duplicate
260
+ # The unique url index on the documents collection prevents duplicate
208
261
  # inserts.
209
262
  def write_doc_to_db(doc)
210
263
  @db.insert(doc)
@@ -3,5 +3,5 @@
3
3
  # @author Michael Telford
4
4
  module Wgit
5
5
  # The current gem version of Wgit.
6
- VERSION = "0.0.13".freeze
6
+ VERSION = "0.0.14".freeze
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.13
4
+ version: 0.0.14
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
@@ -31,6 +31,9 @@ dependencies:
31
31
  - - ">="
32
32
  - !ruby/object:Gem::Version
33
33
  version: 0.9.20
34
+ - - "<"
35
+ - !ruby/object:Gem::Version
36
+ version: '1.0'
34
37
  type: :development
35
38
  prerelease: false
36
39
  version_requirements: !ruby/object:Gem::Requirement
@@ -38,6 +41,9 @@ dependencies:
38
41
  - - ">="
39
42
  - !ruby/object:Gem::Version
40
43
  version: 0.9.20
44
+ - - "<"
45
+ - !ruby/object:Gem::Version
46
+ version: '1.0'
41
47
  - !ruby/object:Gem::Dependency
42
48
  name: byebug
43
49
  requirement: !ruby/object:Gem::Requirement
@@ -156,14 +162,14 @@ dependencies:
156
162
  requirements:
157
163
  - - "~>"
158
164
  - !ruby/object:Gem::Version
159
- version: 2.8.0
165
+ version: 2.9.0
160
166
  type: :runtime
161
167
  prerelease: false
162
168
  version_requirements: !ruby/object:Gem::Requirement
163
169
  requirements:
164
170
  - - "~>"
165
171
  - !ruby/object:Gem::Version
166
- version: 2.8.0
172
+ version: 2.9.0
167
173
  description: Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves
168
174
  and serialises their page contents for later use. You can use Wgit to copy entire
169
175
  websites if required. Wgit also provides a means to search indexed documents stored