wgit 0.0.13 → 0.0.14
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +51 -40
- data/TODO.txt +3 -5
- data/lib/wgit/crawler.rb +2 -0
- data/lib/wgit/database/connection_details.rb +4 -12
- data/lib/wgit/database/database.rb +12 -9
- data/lib/wgit/document.rb +1 -1
- data/lib/wgit/indexer.rb +79 -26
- data/lib/wgit/version.rb +1 -1
- metadata +9 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b2f1d98c88dcd1bd2a12b463732b08373f6b5db4d1c75661530293b0ed129c47
|
4
|
+
data.tar.gz: ed5ff0f5aa5e2427909ced8f67c8189dd1f7aec259b76947cd4cd73a1f8783d3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 6270ceff57af1936e7b0ae1fb3748def307981b26dc604d2f190a9e751e0952f103b4a7c227e2cbbc2d00f26bc7c4b1f46f44fa423e13add7e8bd4eb9fe046f4
|
7
|
+
data.tar.gz: 96b0eecd2713cbf55f7433c145a541fdae34d56b9e3bedd1b91b6bb30b463d06b7f07de59457ad899fd32bd095bbe92ee01d94c04ef923924fbc4a110bbe98f2
|
data/README.md
CHANGED
@@ -16,9 +16,10 @@ Check out this [example application](https://search-engine-rb.herokuapp.com) - a
|
|
16
16
|
6. [Extending The API](#Extending-The-API)
|
17
17
|
7. [Caveats](#Caveats)
|
18
18
|
8. [Executable](#Executable)
|
19
|
-
9. [
|
20
|
-
10. [
|
21
|
-
11. [
|
19
|
+
9. [Change Log](#Change-Log)
|
20
|
+
10. [Development](#Development)
|
21
|
+
11. [Contributing](#Contributing)
|
22
|
+
12. [License](#License)
|
22
23
|
|
23
24
|
## Installation
|
24
25
|
|
@@ -87,6 +88,10 @@ See the [Practical Database Example](#Practical-Database-Example) for informatio
|
|
87
88
|
|
88
89
|
Wgit uses itself to download and save fixture webpages to disk (used in tests). See the script [here](https://github.com/michaeltelford/wgit/blob/master/test/mock/save_site.rb) and edit it for your own purposes.
|
89
90
|
|
91
|
+
### Broken Link Finder
|
92
|
+
|
93
|
+
The `broken_link_finder` gem uses Wgit under the hood to find and report a website's broken links. Check out its [repository](https://github.com/michaeltelford/broken_link_finder) for more details.
|
94
|
+
|
90
95
|
### CSS Indexer
|
91
96
|
|
92
97
|
The below script downloads the contents of the first css link found on Facebook's index page.
|
@@ -146,78 +151,80 @@ end
|
|
146
151
|
|
147
152
|
## Practical Database Example
|
148
153
|
|
149
|
-
This next example requires a configured database instance.
|
154
|
+
This next example requires a configured database instance. Currently the only supported DBMS is MongoDB. See [MongoDB Atlas](https://www.mongodb.com/cloud/atlas) for a free (small) account or provide your own MongoDB instance.
|
150
155
|
|
151
|
-
|
156
|
+
`Wgit::Database` provides a light wrapper of logic around the `mongo` gem allowing for simple database interactivity and serialisation. With Wgit you can index webpages, store them in a database and then search through all that's been indexed. The use of a database is entirely optional however and isn't required for crawling/indexing.
|
152
157
|
|
153
|
-
The
|
158
|
+
The following versions of MongoDB are supported:
|
154
159
|
|
155
|
-
| Gem
|
156
|
-
|
|
157
|
-
| ~> 2.
|
160
|
+
| Gem | Database |
|
161
|
+
| ------ | -------- |
|
162
|
+
| ~> 2.9 | ~> 4.0 |
|
158
163
|
|
159
164
|
### Setting Up MongoDB
|
160
165
|
|
161
|
-
Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records.
|
166
|
+
Follow the steps below to configure MongoDB for use with Wgit. This is only needed if you want to read/write database records.
|
162
167
|
|
163
168
|
1) Create collections for: `documents` and `urls`.
|
164
|
-
2) Add a unique index for the `url` field in **both** collections.
|
165
|
-
3) Enable `textSearchEnabled` in MongoDB's configuration.
|
166
|
-
4) Create a *text search index* for the `documents` collection using:
|
169
|
+
2) Add a [*unique index*](https://docs.mongodb.com/manual/core/index-unique/) for the `url` field in **both** collections.
|
170
|
+
3) Enable `textSearchEnabled` in MongoDB's configuration (if not already so).
|
171
|
+
4) Create a [*text search index*](https://docs.mongodb.com/manual/core/index-text/#index-feature-text) for the `documents` collection using:
|
167
172
|
```json
|
168
173
|
{
|
169
|
-
|
170
|
-
|
171
|
-
|
172
|
-
|
174
|
+
"text": "text",
|
175
|
+
"author": "text",
|
176
|
+
"keywords": "text",
|
177
|
+
"title": "text"
|
173
178
|
}
|
174
179
|
```
|
175
|
-
5) Set the connection details for your MongoDB instance using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
|
180
|
+
5) Set the connection details for your MongoDB instance (see below) using `Wgit.set_connection_details` (prior to calling `Wgit::Database#new`)
|
176
181
|
|
177
182
|
**Note**: The *text search index* (in step 4) lists all document fields to be searched by MongoDB when calling `Wgit::Database#search`. Therefore, you should append this list with any other fields that you want searched. For example, if you [extend the API](#Extending-The-API) then you might want to search your new fields in the database by adding them to the index above.
|
178
183
|
|
179
184
|
### Database Example
|
180
185
|
|
181
|
-
The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database.
|
182
|
-
|
183
|
-
If you're running the code below for yourself, remember to replace the Hash containing the connection details with your own.
|
186
|
+
The below script shows how to use Wgit's database functionality to index and then search HTML documents stored in the database. If you're running the code for yourself, remember to replace the database [connection string](https://docs.mongodb.com/manual/reference/connection-string/) with your own.
|
184
187
|
|
185
188
|
```ruby
|
186
189
|
require 'wgit'
|
187
190
|
require 'wgit/core_ext' # => Provides the String#to_url and Enumerable#to_urls methods.
|
188
191
|
|
189
|
-
|
192
|
+
### CONNECT TO THE DATABASE ###
|
193
|
+
|
194
|
+
# Set your connection details manually (as below) or from the environment using
|
195
|
+
# Wgit.set_connection_details_from_env
|
196
|
+
Wgit.set_connection_details('DB_CONNECTION_STRING' => '<your_connection_string>')
|
197
|
+
db = Wgit::Database.new # Connects to the database...
|
198
|
+
|
199
|
+
### SEED SOME DATA ###
|
200
|
+
|
201
|
+
# Here we create our own document rather than crawling the web (which works in the same way).
|
190
202
|
# We pass the web page's URL and HTML Strings.
|
191
203
|
doc = Wgit::Document.new(
|
192
204
|
"http://test-url.com".to_url,
|
193
205
|
"<html><p>How now brown cow.</p><a href='http://www.google.co.uk'>Click me!</a></html>"
|
194
206
|
)
|
195
|
-
|
196
|
-
# Set your connection details manually (as below) or from the environment using
|
197
|
-
# Wgit.set_connection_details_from_env
|
198
|
-
Wgit.set_connection_details(
|
199
|
-
'DB_HOST' => '<host_machine>',
|
200
|
-
'DB_PORT' => '27017',
|
201
|
-
'DB_USERNAME' => '<username>',
|
202
|
-
'DB_PASSWORD' => '<password>',
|
203
|
-
'DB_DATABASE' => '<database_name>',
|
204
|
-
)
|
205
|
-
|
206
|
-
db = Wgit::Database.new # Connects to the database...
|
207
207
|
db.insert doc
|
208
208
|
|
209
|
-
|
209
|
+
### SEARCH THE DATABASE ###
|
210
|
+
|
211
|
+
# Searching the database returns Wgit::Document's which have fields containing the query.
|
210
212
|
query = "cow"
|
211
213
|
results = db.search query
|
212
214
|
|
213
|
-
|
215
|
+
search_result = results.first
|
216
|
+
search_result.class # => Wgit::Document
|
217
|
+
doc.url == search_result.url # => true
|
218
|
+
|
219
|
+
### PULL OUT THE BITS THAT MATCHED OUR QUERY ###
|
214
220
|
|
215
|
-
# Searching the returned documents gives the matching
|
216
|
-
|
221
|
+
# Searching the returned documents gives the matching text from that document.
|
222
|
+
search_result.search(query).first # => "How now brown cow."
|
217
223
|
|
218
|
-
|
224
|
+
### SEED URLS TO BE CRAWLED LATER ###
|
219
225
|
|
220
|
-
|
226
|
+
db.insert search_result.external_links
|
227
|
+
urls_to_crawl = db.uncrawled_urls # => Results will include search_result.external_links.
|
221
228
|
```
|
222
229
|
|
223
230
|
## Extending The API
|
@@ -319,6 +326,10 @@ In future versions of Wgit, an executable will be packaged with the gem. The exe
|
|
319
326
|
|
320
327
|
This executable will be very similar in nature to `./bin/console` which is currently used only for development and isn't packaged as part of the `wgit` gem.
|
321
328
|
|
329
|
+
## Change Log
|
330
|
+
|
331
|
+
See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences between versions of Wgit.
|
332
|
+
|
322
333
|
## Development
|
323
334
|
|
324
335
|
The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file.
|
data/TODO.txt
CHANGED
@@ -8,7 +8,6 @@ Primary
|
|
8
8
|
|
9
9
|
Secondary
|
10
10
|
---------
|
11
|
-
- Setup a dedicated mLab account for the example application in the README - the Heroku deployed search engine; then index some ruby sites like ruby.org etc.
|
12
11
|
- Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
|
13
12
|
- Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
|
14
13
|
- Check if Document::TEXT_ELEMENTS is expansive enough.
|
@@ -17,16 +16,15 @@ Secondary
|
|
17
16
|
|
18
17
|
Refactoring
|
19
18
|
-----------
|
20
|
-
- Replace method params with named parameters where applicable.
|
21
|
-
- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
|
22
|
-
- Mock (monkey patch) the 'mongo' gem's func. for speed in the tests.
|
23
19
|
- Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
|
20
|
+
- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
|
21
|
+
- Replace method params with named parameters where applicable.
|
24
22
|
|
25
23
|
Gem Publishing Checklist
|
26
24
|
------------------------
|
27
25
|
- Ensure a clean branch of master and create a 'release' branch.
|
28
26
|
- Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
|
29
|
-
- Increment the version number (in version.rb).
|
27
|
+
- Increment the version number (in version.rb) and update the CHANGELOG.md.
|
30
28
|
- Run 'bundle install' to update deps.
|
31
29
|
- Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
|
32
30
|
- Run 'bundle exec rake test' and ensure all tests are passing.
|
data/lib/wgit/crawler.rb
CHANGED
@@ -9,27 +9,19 @@ module Wgit
|
|
9
9
|
CONNECTION_DETAILS = {}
|
10
10
|
|
11
11
|
# The keys required for a successful database connection.
|
12
|
-
CONNECTION_KEYS_REQUIRED = [
|
13
|
-
'DB_HOST', 'DB_PORT', 'DB_USERNAME', 'DB_PASSWORD', 'DB_DATABASE'
|
14
|
-
]
|
12
|
+
CONNECTION_KEYS_REQUIRED = ['DB_CONNECTION_STRING']
|
15
13
|
|
16
14
|
# Set the database's connection details from the given hash. It is your
|
17
15
|
# responsibility to ensure the correct hash vars are present and set.
|
18
16
|
#
|
19
17
|
# @param hash [Hash] Containing the database connection details to use.
|
20
18
|
# The hash should contain the following keys (of type String):
|
21
|
-
#
|
19
|
+
# DB_CONNECTION_STRING
|
22
20
|
# @raise [KeyError] If any of the required connection details are missing.
|
23
21
|
# @return [Hash] Containing the database connection details from hash.
|
24
22
|
def self.set_connection_details(hash)
|
25
23
|
assert_required_keys(hash, CONNECTION_KEYS_REQUIRED)
|
26
|
-
|
27
|
-
CONNECTION_DETAILS[:host] = hash.fetch('DB_HOST')
|
28
|
-
CONNECTION_DETAILS[:port] = hash.fetch('DB_PORT')
|
29
|
-
CONNECTION_DETAILS[:uname] = hash.fetch('DB_USERNAME')
|
30
|
-
CONNECTION_DETAILS[:pword] = hash.fetch('DB_PASSWORD')
|
31
|
-
CONNECTION_DETAILS[:db] = hash.fetch('DB_DATABASE')
|
32
|
-
|
24
|
+
CONNECTION_DETAILS[:connection_string] = hash.fetch('DB_CONNECTION_STRING')
|
33
25
|
CONNECTION_DETAILS
|
34
26
|
end
|
35
27
|
|
@@ -37,7 +29,7 @@ module Wgit
|
|
37
29
|
# responsibility to ensure the correct ENV vars are present and set.
|
38
30
|
#
|
39
31
|
# The ENV should contain the following keys (of type String):
|
40
|
-
#
|
32
|
+
# DB_CONNECTION_STRING
|
41
33
|
#
|
42
34
|
# @raise [KeyError] If any of the required connection details are missing.
|
43
35
|
# @return [Hash] Containing the database connection details from the ENV.
|
@@ -17,10 +17,16 @@ module Wgit
|
|
17
17
|
#
|
18
18
|
# @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
|
19
19
|
def initialize
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
20
|
+
@@client = Database.connect
|
21
|
+
end
|
22
|
+
|
23
|
+
# Initializes a database connection client.
|
24
|
+
#
|
25
|
+
# @raise [RuntimeError] If Wgit::CONNECTION_DETAILS aren't set.
|
26
|
+
def self.connect
|
27
|
+
unless Wgit::CONNECTION_DETAILS[:connection_string]
|
28
|
+
raise "Wgit::CONNECTION_DETAILS must be defined and include \
|
29
|
+
:connection_string"
|
24
30
|
end
|
25
31
|
|
26
32
|
# Only log for error (or more severe) scenarios.
|
@@ -28,11 +34,8 @@ module Wgit
|
|
28
34
|
Mongo::Logger.logger.progname = 'mongo'
|
29
35
|
Mongo::Logger.logger.level = Logger::ERROR
|
30
36
|
|
31
|
-
|
32
|
-
|
33
|
-
database: conn_details[:db],
|
34
|
-
user: conn_details[:uname],
|
35
|
-
password: conn_details[:pword])
|
37
|
+
# Connects to the database here.
|
38
|
+
Mongo::Client.new(Wgit::CONNECTION_DETAILS[:connection_string])
|
36
39
|
end
|
37
40
|
|
38
41
|
### Create Data ###
|
data/lib/wgit/document.rb
CHANGED
@@ -84,7 +84,7 @@ module Wgit
|
|
84
84
|
obj = url_or_obj
|
85
85
|
assert_respond_to(obj, :fetch)
|
86
86
|
|
87
|
-
@url = obj.fetch("url") # Should always be present.
|
87
|
+
@url = Wgit::Url.new(obj.fetch("url")) # Should always be present.
|
88
88
|
@html = obj.fetch("html", "")
|
89
89
|
@doc = init_nokogiri
|
90
90
|
@score = obj.fetch("score", 0.0)
|
data/lib/wgit/indexer.rb
CHANGED
@@ -10,7 +10,7 @@ module Wgit
|
|
10
10
|
# site storing their internal pages into the database and adding their
|
11
11
|
# external url's to be crawled later on. Logs info on the crawl
|
12
12
|
# using Wgit.logger as it goes along.
|
13
|
-
#
|
13
|
+
#
|
14
14
|
# @param max_sites_to_crawl [Integer] The number of separate and whole
|
15
15
|
# websites to be crawled before the method exits. Defaults to -1 which
|
16
16
|
# means the crawl will occur until manually stopped (Ctrl+C etc).
|
@@ -33,8 +33,8 @@ module Wgit
|
|
33
33
|
# @param url [Wgit::Url, String] The base Url of the website to crawl.
|
34
34
|
# @param insert_externals [Boolean] Whether or not to insert the website's
|
35
35
|
# external Url's into the database.
|
36
|
-
# @yield [
|
37
|
-
# is inserted into the database allowing for prior manipulation.
|
36
|
+
# @yield [Wgit::Document] Given the Wgit::Document of each crawled webpage,
|
37
|
+
# before it is inserted into the database allowing for prior manipulation.
|
38
38
|
# @return [Integer] The total number of pages crawled within the website.
|
39
39
|
def self.index_this_site(url, insert_externals = true, &block)
|
40
40
|
url = Wgit::Url.new url
|
@@ -43,6 +43,24 @@ module Wgit
|
|
43
43
|
indexer.index_this_site(url, insert_externals, &block)
|
44
44
|
end
|
45
45
|
|
46
|
+
# Convience method to index a single webpage using
|
47
|
+
# Wgit::Indexer#index_this_page.
|
48
|
+
#
|
49
|
+
# Crawls a single webpage and stores it into the database.
|
50
|
+
# There is no max download limit so be careful of large pages.
|
51
|
+
#
|
52
|
+
# @param url [Wgit::Url, String] The Url of the webpage to crawl.
|
53
|
+
# @param insert_externals [Boolean] Whether or not to insert the website's
|
54
|
+
# external Url's into the database.
|
55
|
+
# @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
|
56
|
+
# before it is inserted into the database allowing for prior manipulation.
|
57
|
+
def self.index_this_page(url, insert_externals = true, &block)
|
58
|
+
url = Wgit::Url.new url
|
59
|
+
db = Wgit::Database.new
|
60
|
+
indexer = Wgit::Indexer.new(db)
|
61
|
+
indexer.index_this_page(url, insert_externals, &block)
|
62
|
+
end
|
63
|
+
|
46
64
|
# Performs a search of the database's indexed documents and pretty prints
|
47
65
|
# the results. See Wgit::Database#search for details of the search.
|
48
66
|
#
|
@@ -53,8 +71,8 @@ module Wgit
|
|
53
71
|
# @param skip [Integer] The number of DB records to skip.
|
54
72
|
# @param sentence_length [Integer] The max length of each result's text
|
55
73
|
# snippet.
|
56
|
-
# @yield [
|
57
|
-
def self.indexed_search(query, whole_sentence = false, limit = 10,
|
74
|
+
# @yield [Wgit::Document] Given each search result (Wgit::Document).
|
75
|
+
def self.indexed_search(query, whole_sentence = false, limit = 10,
|
58
76
|
skip = 0, sentence_length = 80, &block)
|
59
77
|
db = Wgit::Database.new
|
60
78
|
results = db.search(query, whole_sentence, limit, skip, &block)
|
@@ -63,13 +81,13 @@ module Wgit
|
|
63
81
|
|
64
82
|
# Class which sets up a crawler and saves the indexed docs to a database.
|
65
83
|
class Indexer
|
66
|
-
|
84
|
+
|
67
85
|
# The crawler used to scrape the WWW.
|
68
86
|
attr_reader :crawler
|
69
|
-
|
87
|
+
|
70
88
|
# The database instance used to store Urls and Documents in.
|
71
89
|
attr_reader :db
|
72
|
-
|
90
|
+
|
73
91
|
# Initialize the Indexer.
|
74
92
|
#
|
75
93
|
# @param database [Wgit::Database] The database instance (already
|
@@ -97,7 +115,7 @@ module Wgit
|
|
97
115
|
urls to crawl (which might be never).")
|
98
116
|
end
|
99
117
|
site_count = 0
|
100
|
-
|
118
|
+
|
101
119
|
while keep_crawling?(site_count, max_sites_to_crawl, max_data_size) do
|
102
120
|
Wgit.logger.info("Current database size: #{@db.size}")
|
103
121
|
@crawler.urls = @db.uncrawled_urls
|
@@ -107,10 +125,10 @@ urls to crawl (which might be never).")
|
|
107
125
|
return
|
108
126
|
end
|
109
127
|
Wgit.logger.info("Starting crawl loop for: #{@crawler.urls}")
|
110
|
-
|
128
|
+
|
111
129
|
docs_count = 0
|
112
130
|
urls_count = 0
|
113
|
-
|
131
|
+
|
114
132
|
@crawler.urls.each do |url|
|
115
133
|
unless keep_crawling?(site_count, max_sites_to_crawl, max_data_size)
|
116
134
|
Wgit.logger.info("Reached max number of sites to crawl or database \
|
@@ -121,7 +139,7 @@ capacity, exiting.")
|
|
121
139
|
|
122
140
|
url.crawled = true
|
123
141
|
raise unless @db.update(url) == 1
|
124
|
-
|
142
|
+
|
125
143
|
site_docs_count = 0
|
126
144
|
ext_links = @crawler.crawl_site(url) do |doc|
|
127
145
|
unless doc.empty?
|
@@ -131,7 +149,7 @@ capacity, exiting.")
|
|
131
149
|
end
|
132
150
|
end
|
133
151
|
end
|
134
|
-
|
152
|
+
|
135
153
|
urls_count += write_urls_to_db(ext_links)
|
136
154
|
Wgit.logger.info("Crawled and saved #{site_docs_count} docs for the \
|
137
155
|
site: #{url}")
|
@@ -141,6 +159,8 @@ site: #{url}")
|
|
141
159
|
this iteration.")
|
142
160
|
Wgit.logger.info("Found and saved #{urls_count} external url(s) for the next \
|
143
161
|
iteration.")
|
162
|
+
|
163
|
+
nil
|
144
164
|
end
|
145
165
|
end
|
146
166
|
|
@@ -151,14 +171,14 @@ iteration.")
|
|
151
171
|
# @param url [Wgit::Url] The base Url of the website to crawl.
|
152
172
|
# @param insert_externals [Boolean] Whether or not to insert the website's
|
153
173
|
# external Url's into the database.
|
154
|
-
# @yield [
|
155
|
-
# is inserted into the database allowing for prior
|
156
|
-
# nil or false from the block to prevent the
|
157
|
-
# into the database.
|
174
|
+
# @yield [Wgit::Document] Given the Wgit::Document of each crawled web
|
175
|
+
# page, before it is inserted into the database allowing for prior
|
176
|
+
# manipulation. Return nil or false from the block to prevent the
|
177
|
+
# document from being saved into the database.
|
158
178
|
# @return [Integer] The total number of webpages/documents indexed.
|
159
179
|
def index_this_site(url, insert_externals = true)
|
160
180
|
total_pages_indexed = 0
|
161
|
-
|
181
|
+
|
162
182
|
ext_urls = @crawler.crawl_site(url) do |doc|
|
163
183
|
result = true
|
164
184
|
if block_given?
|
@@ -174,23 +194,56 @@ iteration.")
|
|
174
194
|
end
|
175
195
|
|
176
196
|
url.crawled = true
|
177
|
-
|
178
|
-
|
179
|
-
else
|
180
|
-
@db.update(url)
|
181
|
-
end
|
182
|
-
|
197
|
+
@db.url?(url) ? @db.update(url) : @db.insert(url)
|
198
|
+
|
183
199
|
if insert_externals
|
184
200
|
write_urls_to_db(ext_urls)
|
185
201
|
Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
|
186
202
|
end
|
187
|
-
|
203
|
+
|
188
204
|
Wgit.logger.info("Crawled and saved #{total_pages_indexed} docs for the \
|
189
205
|
site: #{url}")
|
190
206
|
|
191
207
|
total_pages_indexed
|
192
208
|
end
|
193
209
|
|
210
|
+
# Crawls a single webpage and stores it into the database.
|
211
|
+
# There is no max download limit so be careful of large pages.
|
212
|
+
# Logs info on the crawl using Wgit.logger as it goes along.
|
213
|
+
#
|
214
|
+
# @param url [Wgit::Url] The webpage Url to crawl.
|
215
|
+
# @param insert_externals [Boolean] Whether or not to insert the webpage's
|
216
|
+
# external Url's into the database.
|
217
|
+
# @yield [Wgit::Document] Given the Wgit::Document of the crawled webpage,
|
218
|
+
# before it is inserted into the database allowing for prior
|
219
|
+
# manipulation. Return nil or false from the block to prevent the
|
220
|
+
# document from being saved into the database.
|
221
|
+
def index_this_page(url, insert_externals = true)
|
222
|
+
doc = @crawler.crawl_page(url) do |doc|
|
223
|
+
result = true
|
224
|
+
if block_given?
|
225
|
+
result = yield(doc)
|
226
|
+
end
|
227
|
+
|
228
|
+
if result
|
229
|
+
if write_doc_to_db(doc)
|
230
|
+
Wgit.logger.info("Crawled and saved internal page: #{doc.url}")
|
231
|
+
end
|
232
|
+
end
|
233
|
+
end
|
234
|
+
|
235
|
+
url.crawled = true
|
236
|
+
@db.url?(url) ? @db.update(url) : @db.insert(url)
|
237
|
+
|
238
|
+
if insert_externals
|
239
|
+
ext_urls = doc.external_links
|
240
|
+
write_urls_to_db(ext_urls)
|
241
|
+
Wgit.logger.info("Found and saved #{ext_urls.length} external url(s)")
|
242
|
+
end
|
243
|
+
|
244
|
+
nil
|
245
|
+
end
|
246
|
+
|
194
247
|
private
|
195
248
|
|
196
249
|
# Keep crawling or not based on DB size and current loop iteration.
|
@@ -204,7 +257,7 @@ site: #{url}")
|
|
204
257
|
end
|
205
258
|
end
|
206
259
|
|
207
|
-
# The unique url index on the documents collection prevents duplicate
|
260
|
+
# The unique url index on the documents collection prevents duplicate
|
208
261
|
# inserts.
|
209
262
|
def write_doc_to_db(doc)
|
210
263
|
@db.insert(doc)
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.14
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
@@ -31,6 +31,9 @@ dependencies:
|
|
31
31
|
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
33
|
version: 0.9.20
|
34
|
+
- - "<"
|
35
|
+
- !ruby/object:Gem::Version
|
36
|
+
version: '1.0'
|
34
37
|
type: :development
|
35
38
|
prerelease: false
|
36
39
|
version_requirements: !ruby/object:Gem::Requirement
|
@@ -38,6 +41,9 @@ dependencies:
|
|
38
41
|
- - ">="
|
39
42
|
- !ruby/object:Gem::Version
|
40
43
|
version: 0.9.20
|
44
|
+
- - "<"
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
version: '1.0'
|
41
47
|
- !ruby/object:Gem::Dependency
|
42
48
|
name: byebug
|
43
49
|
requirement: !ruby/object:Gem::Requirement
|
@@ -156,14 +162,14 @@ dependencies:
|
|
156
162
|
requirements:
|
157
163
|
- - "~>"
|
158
164
|
- !ruby/object:Gem::Version
|
159
|
-
version: 2.
|
165
|
+
version: 2.9.0
|
160
166
|
type: :runtime
|
161
167
|
prerelease: false
|
162
168
|
version_requirements: !ruby/object:Gem::Requirement
|
163
169
|
requirements:
|
164
170
|
- - "~>"
|
165
171
|
- !ruby/object:Gem::Version
|
166
|
-
version: 2.
|
172
|
+
version: 2.9.0
|
167
173
|
description: Fundamentally, Wgit is a WWW indexer/scraper which crawls URL's, retrieves
|
168
174
|
and serialises their page contents for later use. You can use Wgit to copy entire
|
169
175
|
websites if required. Wgit also provides a means to search indexed documents stored
|