wgit 0.5.0 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.yardopts +7 -0
- data/CHANGELOG.md +240 -0
- data/CODE_OF_CONDUCT.md +76 -0
- data/CONTRIBUTING.md +21 -0
- data/LICENSE.txt +21 -0
- data/README.md +239 -0
- data/bin/wgit +39 -0
- data/lib/wgit.rb +3 -1
- data/lib/wgit/assertable.rb +3 -3
- data/lib/wgit/base.rb +30 -0
- data/lib/wgit/core_ext.rb +1 -1
- data/lib/wgit/crawler.rb +304 -148
- data/lib/wgit/database/database.rb +310 -135
- data/lib/wgit/database/model.rb +10 -3
- data/lib/wgit/document.rb +234 -169
- data/lib/wgit/{document_extensions.rb → document_extractors.rb} +20 -10
- data/lib/wgit/dsl.rb +324 -0
- data/lib/wgit/indexer.rb +68 -156
- data/lib/wgit/response.rb +17 -17
- data/lib/wgit/url.rb +170 -42
- data/lib/wgit/utils.rb +32 -20
- data/lib/wgit/version.rb +8 -2
- metadata +54 -32
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
|
4
|
+
data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
|
7
|
+
data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
|
data/.yardopts
ADDED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,240 @@
|
|
1
|
+
# Wgit Change Log
|
2
|
+
|
3
|
+
## v0.0.0 (TEMPLATE - DO NOT EDIT)
|
4
|
+
### Added
|
5
|
+
- ...
|
6
|
+
### Changed/Removed
|
7
|
+
- ...
|
8
|
+
### Fixed
|
9
|
+
- ...
|
10
|
+
---
|
11
|
+
|
12
|
+
## v0.9.0
|
13
|
+
This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
|
14
|
+
### Added
|
15
|
+
- `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
|
16
|
+
- `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
|
17
|
+
- `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
|
18
|
+
- `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
|
19
|
+
- `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
|
20
|
+
- `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
|
21
|
+
- `Wgit::Database#clear_db!` alias.
|
22
|
+
- `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
|
23
|
+
- `Wgit::Document#extract` method to perform one off content extractions.
|
24
|
+
- `Wgit::Indexer#index_urls` method which can index several urls in one call.
|
25
|
+
- `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
|
26
|
+
### Changed/Removed
|
27
|
+
- Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
|
28
|
+
- Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
|
29
|
+
- Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
|
30
|
+
- Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
|
31
|
+
- Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
|
32
|
+
- Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
|
33
|
+
- Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
|
34
|
+
- Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
|
35
|
+
- Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
|
36
|
+
- Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
|
37
|
+
- Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
|
38
|
+
- Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
|
39
|
+
- Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
|
40
|
+
- Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
|
41
|
+
- Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
|
42
|
+
- Updated `Utils.printf_search_results` to return the number of results.
|
43
|
+
- Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
|
44
|
+
- Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
|
45
|
+
- Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
|
46
|
+
### Fixed
|
47
|
+
- [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
|
48
|
+
- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
|
49
|
+
---
|
50
|
+
|
51
|
+
## v0.8.0
|
52
|
+
### Added
|
53
|
+
- To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
|
54
|
+
- `Wgit::Document#description` default extension.
|
55
|
+
- `Wgit::Url.parse_or_nil` method.
|
56
|
+
### Changed/Removed
|
57
|
+
- Breaking change: Renamed `Document#stats[:text_snippets]` to be `:text`.
|
58
|
+
- Breaking change: `Wgit::Document.define_extension`'s block return value now becomes the `var` value, even when `nil` is returned. This allows `var` to be set to `nil`.
|
59
|
+
- Potential breaking change: Renamed `Wgit::Response#crawl_time` (alias) to be `#crawl_duration`.
|
60
|
+
- Updated `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS` to be `Wgit::Crawler.supported_file_extensions`, making it configurable. Now you can add your own URL extensions if needed.
|
61
|
+
- Updated the Wgit core extension `String#to_url` to use `Wgit::Url.parse` allowing instances of `Wgit::Url` to returned as is. This also affects `Enumerable#to_urls` in the same way.
|
62
|
+
### Fixed
|
63
|
+
- An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".
|
64
|
+
---
|
65
|
+
|
66
|
+
## v0.7.0
|
67
|
+
### Added
|
68
|
+
- `Wgit::Indexer.new` optional `crawler:` named param.
|
69
|
+
- `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
|
70
|
+
- `Document.extensions` returning a Set of all defined extensions.
|
71
|
+
### Changed/Removed
|
72
|
+
- Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
|
73
|
+
- Updated the Docker image to now include index names; making it easier to identify them.
|
74
|
+
### Fixed
|
75
|
+
- ...
|
76
|
+
---
|
77
|
+
|
78
|
+
## v0.6.0
|
79
|
+
### Added
|
80
|
+
- Added `Wgit::Utils.proces_arr encode:` param.
|
81
|
+
### Changed/Removed
|
82
|
+
- Breaking changes: Updated `Wgit::Response#success?` and `#failure?` logic.
|
83
|
+
- Breaking changes: Updated `Wgit::Crawler` redirect logic. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_url-instance_method) for more info.
|
84
|
+
- Breaking changes: Updated `Wgit::Crawler#crawl_site` path params logic to support globs e.g. `allow_paths: 'wiki/*'`. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_site-instance_method) for more info.
|
85
|
+
- Breaking changes: Refactored references of `encode_html:` to `encode:` in the `Wgit::Document` and `Wgit::Crawler` classes.
|
86
|
+
- Breaking changes: `Wgit::Document.text_elements_xpath` is now `//*/text()`. This means that more text is extracted from each page and you can no longer be selective of the text elements on a page.
|
87
|
+
- Improved `Wgit::Url#valid?` and `#relative?`.
|
88
|
+
### Fixed
|
89
|
+
- Bug fix in `Wgit::Crawler#crawl_site` where `*.php` URLs weren't being crawled. The fix was to implement `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS`.
|
90
|
+
- Bug fix in `Wgit::Document#search`.
|
91
|
+
---
|
92
|
+
|
93
|
+
## v0.5.1
|
94
|
+
### Added
|
95
|
+
- `Wgit.version_str` method.
|
96
|
+
### Changed/Removed
|
97
|
+
- Switched to optimistic dependency versioning.
|
98
|
+
### Fixed
|
99
|
+
- Bug in `Wgit::Url#concat`.
|
100
|
+
---
|
101
|
+
|
102
|
+
## v0.5.0
|
103
|
+
### Added
|
104
|
+
- A Wgit Wiki! [https://github.com/michaeltelford/wgit/wiki](https://github.com/michaeltelford/wgit/wiki)
|
105
|
+
- `Wgit::Document#content` alias for `#html`.
|
106
|
+
- `Wgit::Url#prefix_base` method.
|
107
|
+
- `Wgit::Url#to_addressable_uri` method.
|
108
|
+
- Support for partially crawling a site using `Wgit::Crawler#crawl_site(allow_paths: [])` or `disallow_paths:`.
|
109
|
+
- `Wgit::Url#+` as alias for `#concat`.
|
110
|
+
- `Wgit::Url#invalid?` method.
|
111
|
+
- `Wgit.version` method.
|
112
|
+
- `Wgit::Response` class containing adapter agnostic HTTP response logic.
|
113
|
+
### Changed/Removed
|
114
|
+
- Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
|
115
|
+
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
|
116
|
+
- Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
|
117
|
+
- Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
|
118
|
+
- Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
|
119
|
+
- Breaking changes: Renamed `Wgit::Url#prefix_protocol` to `#prefix_scheme`. The `protocol:` param name remains unchanged.
|
120
|
+
- Breaking changes: Renamed all `Wgit::Url` methods starting with `without_*` to `omit_*`.
|
121
|
+
- Breaking changes: `Wgit::Indexer` no longer inserts invalid external URL's (to be crawled at a later date).
|
122
|
+
- Breaking changes: `Wgit::Crawler#last_response` is now of type `Wgit::Response`. You can access the underlying `Typhoeus::Response` object with `crawler.last_response.adapter_response`.
|
123
|
+
### Fixed
|
124
|
+
- Bug in `Wgit::Document#base_url` around the handling of invalid base URL scenarios.
|
125
|
+
- Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).
|
126
|
+
---
|
127
|
+
|
128
|
+
## v0.4.1
|
129
|
+
### Added
|
130
|
+
- ...
|
131
|
+
### Changed/Removed
|
132
|
+
- ...
|
133
|
+
### Fixed
|
134
|
+
- A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.
|
135
|
+
---
|
136
|
+
|
137
|
+
## v0.4.0
|
138
|
+
### Added
|
139
|
+
- `Wgit::Document#stats` alias `#statistics`.
|
140
|
+
- `Wgit::Crawler#time_out` logic for long crawls. Can also be set via `initialize`.
|
141
|
+
- `Wgit::Crawler#last_response#redirect_count` method logic.
|
142
|
+
- `Wgit::Crawler#last_response#total_time` method logic.
|
143
|
+
- `Wgit::Utils.fetch(hash, key, default = nil)` method which tries multiple key formats before giving up e.g. `:foo, 'foo', 'FOO'` etc.
|
144
|
+
### Changed/Removed
|
145
|
+
- Breaking changes: Updated `Wgit::Crawler` crawl logic to use `typhoeus` instead of `Net:HTTP`. Users should see a significant improvement in crawl speed as a result. This means that `Wgit::Crawler#last_response` is now of type `Typhoeus::Response`. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info.
|
146
|
+
### Fixed
|
147
|
+
- ...
|
148
|
+
---
|
149
|
+
|
150
|
+
## v0.3.0
|
151
|
+
### Added
|
152
|
+
- `Url#crawl_duration` method.
|
153
|
+
- `Document#crawl_duration` method.
|
154
|
+
- `Benchmark.measure` to Crawler logic to set `Url#crawl_duration`.
|
155
|
+
### Changed/Removed
|
156
|
+
- Breaking changes: Updated data model to embed the full `url` object inside the documents object.
|
157
|
+
- Breaking changes: Updated data model by removing documents `score` attribute.
|
158
|
+
### Fixed
|
159
|
+
- ...
|
160
|
+
---
|
161
|
+
|
162
|
+
## v0.2.0
|
163
|
+
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
|
164
|
+
### Added
|
165
|
+
- `Wgit::Url#absolute?` method.
|
166
|
+
- `Wgit::Url#relative? base: url` support.
|
167
|
+
- `Wgit::Database.connect` method (alias for `Wgit::Database.new`).
|
168
|
+
- `Wgit::Database#search` and `Wgit::Document#search` methods now support `case_sensitive:` and `whole_sentence:` named parameters.
|
169
|
+
### Changed/Removed
|
170
|
+
- Breaking changes: Renamed the following `Wgit` and `Wgit::Indexer` methods: `Wgit.index_the_web` to `Wgit.index_www`, `Wgit::Indexer.index_the_web` to `Wgit::Indexer.index_www`, `Wgit.index_this_site` to `Wgit.index_site`, `Wgit::Indexer.index_this_site` to `Wgit::Indexer.index_site`, `Wgit.index_this_page` to `Wgit.index_page`, `Wgit::Indexer.index_this_page` to `Wgit::Indexer.index_page`.
|
171
|
+
- Breaking changes: All `Wgit::Indexer` methods now take named parameters.
|
172
|
+
- Breaking changes: The following `Wgit::Url` method signatures have changed: `initialize` aka `new`,
|
173
|
+
- Breaking changes: The following `Wgit::Url` class methods have been removed: `.validate`, `.valid?`, `.prefix_protocol`, `.concat` in favour of instance methods by the same names.
|
174
|
+
- Breaking changes: The following `Wgit::Url` instance methods/aliases have been changed/removed: `#to_protocol` (now `#to_scheme`), `#to_query_string` and `#query_string` (now `#to_query`), `#relative_link?` (now `#relative?`), `#without_query_string` (now `#without_query`), `#is_query_string?` (now `#query?`).
|
175
|
+
- Breaking changes: The database connection string is now passed directly to `Wgit::Database.new`; or in its absence, obtained from `ENV['WGIT_CONNECTION_STRING']`. See the `README.md` section entitled: `Practical Database Example` for an example.
|
176
|
+
- Breaking changes: The following `Wgit::Database` instance methods now take named parameters: `#urls`, `#crawled_urls`, `#uncrawled_urls`, `#search`.
|
177
|
+
- Breaking changes: The following `Wgit::Document` instance methods now take named parameters: `#to_h`, `#to_json`, `#search`, `#search!`.
|
178
|
+
- Breaking changes: The following `Wgit::Document` instance methods/aliases have been changed/removed: `#internal_full_links` (now `#internal_absolute_links`).
|
179
|
+
- Breaking changes: Any `Wgit::Document` method alias for returning links containing the word `relative` has been removed for clarity. Use `#internal_links`, `#internal_absolute_links` or `#external_links` instead.
|
180
|
+
- Breaking changes: `Wgit::Crawler` instance vars `@docs` and `@urls` have been removed causing the following instance methods to also be removed: `#urls=`, `#[]`, `#<<`. Also, `.new` aka `#initialize` now requires no params.
|
181
|
+
- Breaking changes: `Wgit::Crawler.new` now takes an optional `redirect_limit:` parameter. This is now the only way of customising the redirect crawl behavior. `Wgit::Crawler.redirect_limit` no longer exists.
|
182
|
+
- Breaking changes: The following `Wgit::Crawler` instance methods signatures have changed: `#crawl_site` and `#crawl_url` now require a `url` param (which no longer defaults), `#crawl_urls` now requires one or more `*urls` (which no longer defaults).
|
183
|
+
- Breaking changes: The following `Wgit::Assertable` method aliases have been removed: `.type`, `.types` (use `.assert_types` instead) and `.arr_type`, `.arr_types` (use `.assert_arr_types` instead).
|
184
|
+
- Breaking changes: The following `Wgit::Utils` methods now take named parameters: `.to_h` and `.printf_search_results`.
|
185
|
+
- Breaking changes: `Wgit::Utils.printf_search_results`'s method signature has changed; the search parameters have been removed. Before calling this method you must call `doc.search!` on each of the `results`. See the docs for the full details.
|
186
|
+
- `Wgit::Document` instances can now be instantiated with `String` Url's (previously only `Wgit::Url`'s).
|
187
|
+
### Fixed
|
188
|
+
- ...
|
189
|
+
---
|
190
|
+
|
191
|
+
## v0.0.18
|
192
|
+
### Added
|
193
|
+
- `Wgit::Url#to_brand` method and updated `Wgit::Url#is_relative?` to support it.
|
194
|
+
### Changed/Removed
|
195
|
+
- Updated certain classes by changing some `private` methods to `protected`.
|
196
|
+
### Fixed
|
197
|
+
- ...
|
198
|
+
---
|
199
|
+
|
200
|
+
## v0.0.17
|
201
|
+
### Added
|
202
|
+
- Support for `<base>` element in `Wgit::Document`'s.
|
203
|
+
- New `Wgit::Url` methods: `without_query_string`, `is_query_string?`, `is_anchor?`, `replace` (override of `String#replace`).
|
204
|
+
### Changed/Removed
|
205
|
+
- Breaking changes: Removed `Wgit::Document#internal_links_without_anchors` method.
|
206
|
+
- Breaking changes (potentially): `Wgit::Url`'s are now replaced with the redirected to Url during a crawl.
|
207
|
+
- Updated `Wgit::Document#base_url` to support an optional `link:` named parameter.
|
208
|
+
- Updated `Wgit::Crawler#crawl_site` to allow the initial url to redirect to another host.
|
209
|
+
- Updated `Wgit::Url#is_relative?` to support an optional `domain:` named parameter.
|
210
|
+
### Fixed
|
211
|
+
- Bug in `Wgit::Document#internal_full_links` affecting anchor and query string links including those used during `Wgit::Crawler#crawl_site`.
|
212
|
+
- Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.
|
213
|
+
---
|
214
|
+
|
215
|
+
## v0.0.16
|
216
|
+
### Added
|
217
|
+
- Added `Wgit::Url.parse` class method as alias for `Wgit::Url.new`.
|
218
|
+
### Changed/Removed
|
219
|
+
- Breaking changes: Removed `Wgit::Url.relative_link?` (class method). Use `Wgit::Url#is_relative?` (instance method) instead e.g. `Wgit::Url.new('/blah').is_relative?`.
|
220
|
+
### Fixed
|
221
|
+
- Several URI related bugs in `Wgit::Url` affecting crawls.
|
222
|
+
---
|
223
|
+
|
224
|
+
## v0.0.15
|
225
|
+
### Added
|
226
|
+
- Support for IRI's (non ASCII based URL's).
|
227
|
+
### Changed/Removed
|
228
|
+
- Breaking changes: Removed `Document` and `Url#to_hash` aliases. Call `to_h` instead.
|
229
|
+
### Fixed
|
230
|
+
- Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.
|
231
|
+
---
|
232
|
+
|
233
|
+
## v0.0.14
|
234
|
+
### Added
|
235
|
+
- `Indexer#index_this_page` method.
|
236
|
+
### Changed/Removed
|
237
|
+
- Breaking Changes: `Wgit::CONNECTION_DETAILS` now only requires `DB_CONNECTION_STRING`.
|
238
|
+
### Fixed
|
239
|
+
- Found and fixed a bug in `Document#new`.
|
240
|
+
---
|
data/CODE_OF_CONDUCT.md
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
# Contributor Covenant Code of Conduct
|
2
|
+
|
3
|
+
## Our Pledge
|
4
|
+
|
5
|
+
In the interest of fostering an open and welcoming environment, we as
|
6
|
+
contributors and maintainers pledge to making participation in our project and
|
7
|
+
our community a harassment-free experience for everyone, regardless of age, body
|
8
|
+
size, disability, ethnicity, sex characteristics, gender identity and expression,
|
9
|
+
level of experience, education, socio-economic status, nationality, personal
|
10
|
+
appearance, race, religion, or sexual identity and orientation.
|
11
|
+
|
12
|
+
## Our Standards
|
13
|
+
|
14
|
+
Examples of behavior that contributes to creating a positive environment
|
15
|
+
include:
|
16
|
+
|
17
|
+
* Using welcoming and inclusive language
|
18
|
+
* Being respectful of differing viewpoints and experiences
|
19
|
+
* Gracefully accepting constructive criticism
|
20
|
+
* Focusing on what is best for the community
|
21
|
+
* Showing empathy towards other community members
|
22
|
+
|
23
|
+
Examples of unacceptable behavior by participants include:
|
24
|
+
|
25
|
+
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26
|
+
advances
|
27
|
+
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28
|
+
* Public or private harassment
|
29
|
+
* Publishing others' private information, such as a physical or electronic
|
30
|
+
address, without explicit permission
|
31
|
+
* Other conduct which could reasonably be considered inappropriate in a
|
32
|
+
professional setting
|
33
|
+
|
34
|
+
## Our Responsibilities
|
35
|
+
|
36
|
+
Project maintainers are responsible for clarifying the standards of acceptable
|
37
|
+
behavior and are expected to take appropriate and fair corrective action in
|
38
|
+
response to any instances of unacceptable behavior.
|
39
|
+
|
40
|
+
Project maintainers have the right and responsibility to remove, edit, or
|
41
|
+
reject comments, commits, code, wiki edits, issues, and other contributions
|
42
|
+
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43
|
+
permanently any contributor for other behaviors that they deem inappropriate,
|
44
|
+
threatening, offensive, or harmful.
|
45
|
+
|
46
|
+
## Scope
|
47
|
+
|
48
|
+
This Code of Conduct applies both within project spaces and in public spaces
|
49
|
+
when an individual is representing the project or its community. Examples of
|
50
|
+
representing a project or community include using an official project e-mail
|
51
|
+
address, posting via an official social media account, or acting as an appointed
|
52
|
+
representative at an online or offline event. Representation of a project may be
|
53
|
+
further defined and clarified by project maintainers.
|
54
|
+
|
55
|
+
## Enforcement
|
56
|
+
|
57
|
+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58
|
+
reported by contacting the project team at michael.telford@live.com. All
|
59
|
+
complaints will be reviewed and investigated and will result in a response that
|
60
|
+
is deemed necessary and appropriate to the circumstances. The project team is
|
61
|
+
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62
|
+
Further details of specific enforcement policies may be posted separately.
|
63
|
+
|
64
|
+
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65
|
+
faith may face temporary or permanent repercussions as determined by other
|
66
|
+
members of the project's leadership.
|
67
|
+
|
68
|
+
## Attribution
|
69
|
+
|
70
|
+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71
|
+
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
|
72
|
+
|
73
|
+
[homepage]: https://www.contributor-covenant.org
|
74
|
+
|
75
|
+
For answers to common questions about this code of conduct, see
|
76
|
+
https://www.contributor-covenant.org/faq
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
# Contributing
|
2
|
+
|
3
|
+
## Consult
|
4
|
+
|
5
|
+
Before you make a contribution, reach out to michael.telford@live.com about what changes need made. Otherwise, your time spent might be wasted. Once you're clear on what needs done follow the technical steps below.
|
6
|
+
|
7
|
+
## Technical Steps
|
8
|
+
|
9
|
+
- Fork the repository
|
10
|
+
- Create a branch
|
11
|
+
- Write some tests (which fail)
|
12
|
+
- Write some code
|
13
|
+
- Re-run the tests (which now hopefully pass)
|
14
|
+
- Push your branch to your `origin` remote
|
15
|
+
- Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
|
16
|
+
- Apply any requested changes
|
17
|
+
- Wait for your PR to be merged
|
18
|
+
|
19
|
+
## Thanks
|
20
|
+
|
21
|
+
Thanks in advance for your contribution.
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2016 - 2020 Michael Telford
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,239 @@
|
|
1
|
+
# Wgit
|
2
|
+
|
3
|
+
[](https://rubygems.org/gems/wgit)
|
4
|
+
[](https://rubygems.org/gems/wgit)
|
5
|
+
[](https://travis-ci.org/michaeltelford/wgit)
|
6
|
+
[](http://inch-ci.org/github/michaeltelford/wgit)
|
7
|
+
[](https://www.codacy.com/app/michaeltelford/wgit?utm_source=github.com&utm_medium=referral&utm_content=michaeltelford/wgit&utm_campaign=Badge_Grade)
|
8
|
+
|
9
|
+
---
|
10
|
+
|
11
|
+
Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
|
12
|
+
|
13
|
+
Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
|
14
|
+
|
15
|
+
- URL parsing
|
16
|
+
- Document content extraction (data mining)
|
17
|
+
- Crawling entire websites (statistical analysis)
|
18
|
+
|
19
|
+
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
|
20
|
+
|
21
|
+
Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
|
22
|
+
|
23
|
+
## Table Of Contents
|
24
|
+
|
25
|
+
1. [Usage](#Usage)
|
26
|
+
2. [Why Wgit?](#Why-Wgit)
|
27
|
+
3. [Why Not Wgit?](#Why-Not-Wgit)
|
28
|
+
4. [Installation](#Installation)
|
29
|
+
5. [Documentation](#Documentation)
|
30
|
+
6. [Executable](#Executable)
|
31
|
+
7. [License](#License)
|
32
|
+
8. [Contributing](#Contributing)
|
33
|
+
9. [Development](#Development)
|
34
|
+
|
35
|
+
## Usage
|
36
|
+
|
37
|
+
Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
|
38
|
+
|
39
|
+
```ruby
|
40
|
+
require 'wgit'
|
41
|
+
require 'json'
|
42
|
+
|
43
|
+
include Wgit::DSL
|
44
|
+
|
45
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
46
|
+
follow "//li[@class='next']/a/@href"
|
47
|
+
|
48
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
49
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
50
|
+
|
51
|
+
quotes = []
|
52
|
+
|
53
|
+
crawl_site do |doc|
|
54
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
55
|
+
quotes << {
|
56
|
+
quote: arr.first,
|
57
|
+
author: arr.last
|
58
|
+
}
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
puts JSON.generate(quotes)
|
63
|
+
```
|
64
|
+
|
65
|
+
The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
|
66
|
+
|
67
|
+
```ruby
|
68
|
+
require 'wgit'
|
69
|
+
require 'json'
|
70
|
+
|
71
|
+
crawler = Wgit::Crawler.new
|
72
|
+
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
|
73
|
+
quotes = []
|
74
|
+
|
75
|
+
Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
|
76
|
+
Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
|
77
|
+
|
78
|
+
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
|
79
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
80
|
+
quotes << {
|
81
|
+
quote: arr.first,
|
82
|
+
author: arr.last
|
83
|
+
}
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
puts JSON.generate(quotes)
|
88
|
+
```
|
89
|
+
|
90
|
+
But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
91
|
+
|
92
|
+
```ruby
|
93
|
+
require 'wgit'
|
94
|
+
|
95
|
+
include Wgit::DSL
|
96
|
+
|
97
|
+
Wgit.logger.level = Logger::WARN
|
98
|
+
|
99
|
+
connection_string 'mongodb://user:password@localhost/crawler'
|
100
|
+
clear_db!
|
101
|
+
|
102
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
103
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
104
|
+
|
105
|
+
start 'http://quotes.toscrape.com/tag/humor/'
|
106
|
+
follow "//li[@class='next']/a/@href"
|
107
|
+
|
108
|
+
index_site
|
109
|
+
search 'prejudice'
|
110
|
+
```
|
111
|
+
|
112
|
+
The `search` call (on the last line) will return and output the results:
|
113
|
+
|
114
|
+
```text
|
115
|
+
Quotes to Scrape
|
116
|
+
“I am free of all prejudice. I hate everyone equally. ”
|
117
|
+
http://quotes.toscrape.com/tag/humor/page/2/
|
118
|
+
```
|
119
|
+
|
120
|
+
Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
|
121
|
+
|
122
|
+

|
123
|
+
|
124
|
+
## Why Wgit?
|
125
|
+
|
126
|
+
There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
|
127
|
+
|
128
|
+
- Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
|
129
|
+
- Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
|
130
|
+
- Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
|
131
|
+
- Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
|
132
|
+
|
133
|
+
```ruby
|
134
|
+
require 'wgit'
|
135
|
+
|
136
|
+
ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
|
137
|
+
|
138
|
+
wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
|
139
|
+
|
140
|
+
# Only index the most recent of each wiki article, ignoring the rest of Github.
|
141
|
+
opts = {
|
142
|
+
allow_paths: 'michaeltelford/wgit/wiki/*',
|
143
|
+
disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
|
144
|
+
}
|
145
|
+
|
146
|
+
indexer = Wgit::Indexer.new
|
147
|
+
indexer.index_site(wiki, **opts)
|
148
|
+
```
|
149
|
+
|
150
|
+
## Why Not Wgit?
|
151
|
+
|
152
|
+
So why might you not use Wgit, I hear you ask?
|
153
|
+
|
154
|
+
- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
|
155
|
+
- Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
|
156
|
+
- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
|
157
|
+
|
158
|
+
## Installation
|
159
|
+
|
160
|
+
Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
|
161
|
+
|
162
|
+
Currently, the required MRI Ruby version is:
|
163
|
+
|
164
|
+
`~> 2.5` a.k.a. `>= 2.5 && < 3`
|
165
|
+
|
166
|
+
### Using Bundler
|
167
|
+
|
168
|
+
Add this line to your application's `Gemfile`:
|
169
|
+
|
170
|
+
```ruby
|
171
|
+
gem 'wgit'
|
172
|
+
```
|
173
|
+
|
174
|
+
And then execute:
|
175
|
+
|
176
|
+
$ bundle
|
177
|
+
|
178
|
+
### Using RubyGems
|
179
|
+
|
180
|
+
$ gem install wgit
|
181
|
+
|
182
|
+
Verify the install by using the executable (to start an REPL session):
|
183
|
+
|
184
|
+
$ wgit
|
185
|
+
|
186
|
+
## Documentation
|
187
|
+
|
188
|
+
- [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
|
189
|
+
- [Wiki](https://github.com/michaeltelford/wgit/wiki)
|
190
|
+
- [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
|
191
|
+
- [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
|
192
|
+
|
193
|
+
## Executable
|
194
|
+
|
195
|
+
Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
|
196
|
+
|
197
|
+
The `wgit` executable does the following things (in order):
|
198
|
+
|
199
|
+
1. `require wgit`
|
200
|
+
2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
|
201
|
+
3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
|
202
|
+
|
203
|
+
The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
|
204
|
+
|
205
|
+
## License
|
206
|
+
|
207
|
+
The gem is available as open source under the terms of the MIT License. See [LICENSE.txt](https://github.com/michaeltelford/wgit/blob/master/LICENSE.txt) for more details.
|
208
|
+
|
209
|
+
## Contributing
|
210
|
+
|
211
|
+
Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
|
212
|
+
|
213
|
+
The current road map is rudimentally listed in the [Road Map](https://github.com/michaeltelford/wgit/wiki/Road-Map) wiki page. Maybe your feature request is already there?
|
214
|
+
|
215
|
+
Before you consider making a contribution, check out [CONTRIBUTING.md](https://github.com/michaeltelford/wgit/blob/master/CONTRIBUTING.md).
|
216
|
+
|
217
|
+
## Development
|
218
|
+
|
219
|
+
After checking out the repo, run the following commands:
|
220
|
+
|
221
|
+
1. `gem install bundler toys`
|
222
|
+
2. `bundle install --jobs=3`
|
223
|
+
3. `toys setup`
|
224
|
+
|
225
|
+
And you're good to go!
|
226
|
+
|
227
|
+
### Tooling
|
228
|
+
|
229
|
+
Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
|
230
|
+
|
231
|
+
Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
|
232
|
+
|
233
|
+
To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
|
234
|
+
|
235
|
+
To install this gem onto your local machine, run `toys install` and follow the prompt.
|
236
|
+
|
237
|
+
### Console
|
238
|
+
|
239
|
+
You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.
|