wgit 0.5.1 → 0.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3fec9d62a92ebd6cdb6475bc8bcd17bf2215acb65cad0c5a8803acd6a4022feb
4
- data.tar.gz: 0d724152a4bdbf049f06be94918517c3a98c0033592d5fc931ae87307374c758
3
+ metadata.gz: b6719bb2015379133ef2c9b417cada1826deab254f6fa1adaa093314f8fece99
4
+ data.tar.gz: 5ced648c0dff501bf0191aebfc0188d535f4ee657a072e1dbccd68ebbc6ac881
5
5
  SHA512:
6
- metadata.gz: d75291ebe2707fe5cb361e5ecec76a4dd1b6e33b5aa6eb92bb23adef74fcbc1e56f0dac7e7f2a95e64e2ab1585ca4c20316e63400246d20f7472ac85c320438b
7
- data.tar.gz: e9910a1eec9268865b6869a9c4cd74e641927fbfe360da6c93b56ebc5f3be942e460229745aba00ca6572a59468b1a0af3d85621b64ef414da794e0a719fbf23
6
+ metadata.gz: 4a7782b4ccf6fa69fad9bb63d7d421fa548603ad5a35304db554bdcdf6deafe305395aba1ac9f35bcd095bc6cf4049ce70e56645faf1457e2e1313d48d1eb7f8
7
+ data.tar.gz: 8b8bb1454a131201e262eda060c6ae8490266a7675910026a0dd6ae0b2b55f2accf140d473edf135078f68cbe1048c4bb86f2dc5a6d4cf08a006f8fc20ac49b5
data/.yardopts ADDED
@@ -0,0 +1,7 @@
1
+ --readme README.md
2
+ --title 'Wgit Gem Documentation'
3
+ --charset utf-8
4
+ --markup markdown
5
+ --output .doc
6
+ --protected
7
+ - *.md LICENSE.txt
data/CHANGELOG.md ADDED
@@ -0,0 +1,249 @@
1
+ # Wgit Change Log
2
+
3
+ ## v0.0.0 (TEMPLATE - DO NOT EDIT)
4
+ ### Added
5
+ - ...
6
+ ### Changed/Removed
7
+ - ...
8
+ ### Fixed
9
+ - ...
10
+ ---
11
+
12
+ ## v0.10.0
13
+ ### Added
14
+ - `Wgit::Url#scheme_relative?` method.
15
+ ### Changed/Removed
16
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
17
+ ### Fixed
18
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
19
+ ---
20
+
21
+ ## v0.9.0
22
+ This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
23
+ ### Added
24
+ - `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
25
+ - `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
26
+ - `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
27
+ - `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
28
+ - `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
29
+ - `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
30
+ - `Wgit::Database#clear_db!` alias.
31
+ - `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
32
+ - `Wgit::Document#extract` method to perform one off content extractions.
33
+ - `Wgit::Indexer#index_urls` method which can index several urls in one call.
34
+ - `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
35
+ ### Changed/Removed
36
+ - Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
37
+ - Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
38
+ - Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
39
+ - Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
40
+ - Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
41
+ - Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
42
+ - Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
43
+ - Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
44
+ - Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
45
+ - Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
46
+ - Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
47
+ - Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
48
+ - Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
49
+ - Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
50
+ - Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
51
+ - Updated `Utils.printf_search_results` to return the number of results.
52
+ - Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
53
+ - Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
54
+ - Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
55
+ ### Fixed
56
+ - [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
57
+ - `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
58
+ ---
59
+
60
+ ## v0.8.0
61
+ ### Added
62
+ - To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
63
+ - `Wgit::Document#description` default extension.
64
+ - `Wgit::Url.parse_or_nil` method.
65
+ ### Changed/Removed
66
+ - Breaking change: Renamed `Document#stats[:text_snippets]` to be `:text`.
67
+ - Breaking change: `Wgit::Document.define_extension`'s block return value now becomes the `var` value, even when `nil` is returned. This allows `var` to be set to `nil`.
68
+ - Potential breaking change: Renamed `Wgit::Response#crawl_time` (alias) to be `#crawl_duration`.
69
+ - Updated `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS` to be `Wgit::Crawler.supported_file_extensions`, making it configurable. Now you can add your own URL extensions if needed.
70
+ - Updated the Wgit core extension `String#to_url` to use `Wgit::Url.parse` allowing instances of `Wgit::Url` to returned as is. This also affects `Enumerable#to_urls` in the same way.
71
+ ### Fixed
72
+ - An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".
73
+ ---
74
+
75
+ ## v0.7.0
76
+ ### Added
77
+ - `Wgit::Indexer.new` optional `crawler:` named param.
78
+ - `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
79
+ - `Document.extensions` returning a Set of all defined extensions.
80
+ ### Changed/Removed
81
+ - Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
82
+ - Updated the Docker image to now include index names; making it easier to identify them.
83
+ ### Fixed
84
+ - ...
85
+ ---
86
+
87
+ ## v0.6.0
88
+ ### Added
89
+ - Added `Wgit::Utils.proces_arr encode:` param.
90
+ ### Changed/Removed
91
+ - Breaking changes: Updated `Wgit::Response#success?` and `#failure?` logic.
92
+ - Breaking changes: Updated `Wgit::Crawler` redirect logic. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_url-instance_method) for more info.
93
+ - Breaking changes: Updated `Wgit::Crawler#crawl_site` path params logic to support globs e.g. `allow_paths: 'wiki/*'`. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_site-instance_method) for more info.
94
+ - Breaking changes: Refactored references of `encode_html:` to `encode:` in the `Wgit::Document` and `Wgit::Crawler` classes.
95
+ - Breaking changes: `Wgit::Document.text_elements_xpath` is now `//*/text()`. This means that more text is extracted from each page and you can no longer be selective of the text elements on a page.
96
+ - Improved `Wgit::Url#valid?` and `#relative?`.
97
+ ### Fixed
98
+ - Bug fix in `Wgit::Crawler#crawl_site` where `*.php` URLs weren't being crawled. The fix was to implement `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS`.
99
+ - Bug fix in `Wgit::Document#search`.
100
+ ---
101
+
102
+ ## v0.5.1
103
+ ### Added
104
+ - `Wgit.version_str` method.
105
+ ### Changed/Removed
106
+ - Switched to optimistic dependency versioning.
107
+ ### Fixed
108
+ - Bug in `Wgit::Url#concat`.
109
+ ---
110
+
111
+ ## v0.5.0
112
+ ### Added
113
+ - A Wgit Wiki! [https://github.com/michaeltelford/wgit/wiki](https://github.com/michaeltelford/wgit/wiki)
114
+ - `Wgit::Document#content` alias for `#html`.
115
+ - `Wgit::Url#prefix_base` method.
116
+ - `Wgit::Url#to_addressable_uri` method.
117
+ - Support for partially crawling a site using `Wgit::Crawler#crawl_site(allow_paths: [])` or `disallow_paths:`.
118
+ - `Wgit::Url#+` as alias for `#concat`.
119
+ - `Wgit::Url#invalid?` method.
120
+ - `Wgit.version` method.
121
+ - `Wgit::Response` class containing adapter agnostic HTTP response logic.
122
+ ### Changed/Removed
123
+ - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
124
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
125
+ - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
126
+ - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
127
+ - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
128
+ - Breaking changes: Renamed `Wgit::Url#prefix_protocol` to `#prefix_scheme`. The `protocol:` param name remains unchanged.
129
+ - Breaking changes: Renamed all `Wgit::Url` methods starting with `without_*` to `omit_*`.
130
+ - Breaking changes: `Wgit::Indexer` no longer inserts invalid external URL's (to be crawled at a later date).
131
+ - Breaking changes: `Wgit::Crawler#last_response` is now of type `Wgit::Response`. You can access the underlying `Typhoeus::Response` object with `crawler.last_response.adapter_response`.
132
+ ### Fixed
133
+ - Bug in `Wgit::Document#base_url` around the handling of invalid base URL scenarios.
134
+ - Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).
135
+ ---
136
+
137
+ ## v0.4.1
138
+ ### Added
139
+ - ...
140
+ ### Changed/Removed
141
+ - ...
142
+ ### Fixed
143
+ - A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.
144
+ ---
145
+
146
+ ## v0.4.0
147
+ ### Added
148
+ - `Wgit::Document#stats` alias `#statistics`.
149
+ - `Wgit::Crawler#time_out` logic for long crawls. Can also be set via `initialize`.
150
+ - `Wgit::Crawler#last_response#redirect_count` method logic.
151
+ - `Wgit::Crawler#last_response#total_time` method logic.
152
+ - `Wgit::Utils.fetch(hash, key, default = nil)` method which tries multiple key formats before giving up e.g. `:foo, 'foo', 'FOO'` etc.
153
+ ### Changed/Removed
154
+ - Breaking changes: Updated `Wgit::Crawler` crawl logic to use `typhoeus` instead of `Net:HTTP`. Users should see a significant improvement in crawl speed as a result. This means that `Wgit::Crawler#last_response` is now of type `Typhoeus::Response`. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info.
155
+ ### Fixed
156
+ - ...
157
+ ---
158
+
159
+ ## v0.3.0
160
+ ### Added
161
+ - `Url#crawl_duration` method.
162
+ - `Document#crawl_duration` method.
163
+ - `Benchmark.measure` to Crawler logic to set `Url#crawl_duration`.
164
+ ### Changed/Removed
165
+ - Breaking changes: Updated data model to embed the full `url` object inside the documents object.
166
+ - Breaking changes: Updated data model by removing documents `score` attribute.
167
+ ### Fixed
168
+ - ...
169
+ ---
170
+
171
+ ## v0.2.0
172
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
173
+ ### Added
174
+ - `Wgit::Url#absolute?` method.
175
+ - `Wgit::Url#relative? base: url` support.
176
+ - `Wgit::Database.connect` method (alias for `Wgit::Database.new`).
177
+ - `Wgit::Database#search` and `Wgit::Document#search` methods now support `case_sensitive:` and `whole_sentence:` named parameters.
178
+ ### Changed/Removed
179
+ - Breaking changes: Renamed the following `Wgit` and `Wgit::Indexer` methods: `Wgit.index_the_web` to `Wgit.index_www`, `Wgit::Indexer.index_the_web` to `Wgit::Indexer.index_www`, `Wgit.index_this_site` to `Wgit.index_site`, `Wgit::Indexer.index_this_site` to `Wgit::Indexer.index_site`, `Wgit.index_this_page` to `Wgit.index_page`, `Wgit::Indexer.index_this_page` to `Wgit::Indexer.index_page`.
180
+ - Breaking changes: All `Wgit::Indexer` methods now take named parameters.
181
+ - Breaking changes: The following `Wgit::Url` method signatures have changed: `initialize` aka `new`,
182
+ - Breaking changes: The following `Wgit::Url` class methods have been removed: `.validate`, `.valid?`, `.prefix_protocol`, `.concat` in favour of instance methods by the same names.
183
+ - Breaking changes: The following `Wgit::Url` instance methods/aliases have been changed/removed: `#to_protocol` (now `#to_scheme`), `#to_query_string` and `#query_string` (now `#to_query`), `#relative_link?` (now `#relative?`), `#without_query_string` (now `#without_query`), `#is_query_string?` (now `#query?`).
184
+ - Breaking changes: The database connection string is now passed directly to `Wgit::Database.new`; or in its absence, obtained from `ENV['WGIT_CONNECTION_STRING']`. See the `README.md` section entitled: `Practical Database Example` for an example.
185
+ - Breaking changes: The following `Wgit::Database` instance methods now take named parameters: `#urls`, `#crawled_urls`, `#uncrawled_urls`, `#search`.
186
+ - Breaking changes: The following `Wgit::Document` instance methods now take named parameters: `#to_h`, `#to_json`, `#search`, `#search!`.
187
+ - Breaking changes: The following `Wgit::Document` instance methods/aliases have been changed/removed: `#internal_full_links` (now `#internal_absolute_links`).
188
+ - Breaking changes: Any `Wgit::Document` method alias for returning links containing the word `relative` has been removed for clarity. Use `#internal_links`, `#internal_absolute_links` or `#external_links` instead.
189
+ - Breaking changes: `Wgit::Crawler` instance vars `@docs` and `@urls` have been removed causing the following instance methods to also be removed: `#urls=`, `#[]`, `#<<`. Also, `.new` aka `#initialize` now requires no params.
190
+ - Breaking changes: `Wgit::Crawler.new` now takes an optional `redirect_limit:` parameter. This is now the only way of customising the redirect crawl behavior. `Wgit::Crawler.redirect_limit` no longer exists.
191
+ - Breaking changes: The following `Wgit::Crawler` instance methods signatures have changed: `#crawl_site` and `#crawl_url` now require a `url` param (which no longer defaults), `#crawl_urls` now requires one or more `*urls` (which no longer defaults).
192
+ - Breaking changes: The following `Wgit::Assertable` method aliases have been removed: `.type`, `.types` (use `.assert_types` instead) and `.arr_type`, `.arr_types` (use `.assert_arr_types` instead).
193
+ - Breaking changes: The following `Wgit::Utils` methods now take named parameters: `.to_h` and `.printf_search_results`.
194
+ - Breaking changes: `Wgit::Utils.printf_search_results`'s method signature has changed; the search parameters have been removed. Before calling this method you must call `doc.search!` on each of the `results`. See the docs for the full details.
195
+ - `Wgit::Document` instances can now be instantiated with `String` Url's (previously only `Wgit::Url`'s).
196
+ ### Fixed
197
+ - ...
198
+ ---
199
+
200
+ ## v0.0.18
201
+ ### Added
202
+ - `Wgit::Url#to_brand` method and updated `Wgit::Url#is_relative?` to support it.
203
+ ### Changed/Removed
204
+ - Updated certain classes by changing some `private` methods to `protected`.
205
+ ### Fixed
206
+ - ...
207
+ ---
208
+
209
+ ## v0.0.17
210
+ ### Added
211
+ - Support for `<base>` element in `Wgit::Document`'s.
212
+ - New `Wgit::Url` methods: `without_query_string`, `is_query_string?`, `is_anchor?`, `replace` (override of `String#replace`).
213
+ ### Changed/Removed
214
+ - Breaking changes: Removed `Wgit::Document#internal_links_without_anchors` method.
215
+ - Breaking changes (potentially): `Wgit::Url`'s are now replaced with the redirected to Url during a crawl.
216
+ - Updated `Wgit::Document#base_url` to support an optional `link:` named parameter.
217
+ - Updated `Wgit::Crawler#crawl_site` to allow the initial url to redirect to another host.
218
+ - Updated `Wgit::Url#is_relative?` to support an optional `domain:` named parameter.
219
+ ### Fixed
220
+ - Bug in `Wgit::Document#internal_full_links` affecting anchor and query string links including those used during `Wgit::Crawler#crawl_site`.
221
+ - Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.
222
+ ---
223
+
224
+ ## v0.0.16
225
+ ### Added
226
+ - Added `Wgit::Url.parse` class method as alias for `Wgit::Url.new`.
227
+ ### Changed/Removed
228
+ - Breaking changes: Removed `Wgit::Url.relative_link?` (class method). Use `Wgit::Url#is_relative?` (instance method) instead e.g. `Wgit::Url.new('/blah').is_relative?`.
229
+ ### Fixed
230
+ - Several URI related bugs in `Wgit::Url` affecting crawls.
231
+ ---
232
+
233
+ ## v0.0.15
234
+ ### Added
235
+ - Support for IRI's (non ASCII based URL's).
236
+ ### Changed/Removed
237
+ - Breaking changes: Removed `Document` and `Url#to_hash` aliases. Call `to_h` instead.
238
+ ### Fixed
239
+ - Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.
240
+ ---
241
+
242
+ ## v0.0.14
243
+ ### Added
244
+ - `Indexer#index_this_page` method.
245
+ ### Changed/Removed
246
+ - Breaking Changes: `Wgit::CONNECTION_DETAILS` now only requires `DB_CONNECTION_STRING`.
247
+ ### Fixed
248
+ - Found and fixed a bug in `Document#new`.
249
+ ---
@@ -0,0 +1,76 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, sex characteristics, gender identity and expression,
9
+ level of experience, education, socio-economic status, nationality, personal
10
+ appearance, race, religion, or sexual identity and orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at michael.telford@live.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72
+
73
+ [homepage]: https://www.contributor-covenant.org
74
+
75
+ For answers to common questions about this code of conduct, see
76
+ https://www.contributor-covenant.org/faq
data/CONTRIBUTING.md ADDED
@@ -0,0 +1,21 @@
1
+ # Contributing
2
+
3
+ ## Consult
4
+
5
+ Before you make a contribution, reach out to michael.telford@live.com about what changes need made. Otherwise, your time spent might be wasted. Once you're clear on what needs done follow the technical steps below.
6
+
7
+ ## Technical Steps
8
+
9
+ - Fork the repository
10
+ - Create a branch
11
+ - Write some tests (which fail)
12
+ - Write some code
13
+ - Re-run the tests (which now hopefully pass)
14
+ - Push your branch to your `origin` remote
15
+ - Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
16
+ - Apply any requested changes
17
+ - Wait for your PR to be merged
18
+
19
+ ## Thanks
20
+
21
+ Thanks in advance for your contribution.
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2016 - 2020 Michael Telford
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,232 @@
1
+ # Wgit
2
+
3
+ [![Inline gem version](https://badge.fury.io/rb/wgit.svg)](https://rubygems.org/gems/wgit)
4
+ [![Inline downloads](https://img.shields.io/gem/dt/wgit)](https://rubygems.org/gems/wgit)
5
+ [![Inline build](https://travis-ci.org/michaeltelford/wgit.svg?branch=master)](https://travis-ci.org/michaeltelford/wgit)
6
+ [![Inline docs](http://inch-ci.org/github/michaeltelford/wgit.svg?branch=master)](http://inch-ci.org/github/michaeltelford/wgit)
7
+ [![Inline code quality](https://api.codacy.com/project/badge/Grade/d5a0de62e78b460997cb8ce1127cea9e)](https://www.codacy.com/app/michaeltelford/wgit?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=michaeltelford/wgit&amp;utm_campaign=Badge_Grade)
8
+
9
+ ---
10
+
11
+ Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
+
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
+
15
+ - URL parsing
16
+ - Document content extraction (data mining)
17
+ - Crawling entire websites (statistical analysis)
18
+
19
+ Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
20
+
21
+ Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
22
+
23
+ ## Table Of Contents
24
+
25
+ 1. [Usage](#Usage)
26
+ 2. [Why Wgit?](#Why-Wgit)
27
+ 3. [Why Not Wgit?](#Why-Not-Wgit)
28
+ 4. [Installation](#Installation)
29
+ 5. [Documentation](#Documentation)
30
+ 6. [Executable](#Executable)
31
+ 7. [License](#License)
32
+ 8. [Contributing](#Contributing)
33
+ 9. [Development](#Development)
34
+
35
+ ## Usage
36
+
37
+ Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
38
+
39
+ ```ruby
40
+ require 'wgit'
41
+ require 'json'
42
+
43
+ include Wgit::DSL
44
+
45
+ start 'http://quotes.toscrape.com/tag/humor/'
46
+ follow "//li[@class='next']/a/@href"
47
+
48
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
49
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
50
+
51
+ quotes = []
52
+
53
+ crawl_site do |doc|
54
+ doc.quotes.zip(doc.authors).each do |arr|
55
+ quotes << {
56
+ quote: arr.first,
57
+ author: arr.last
58
+ }
59
+ end
60
+ end
61
+
62
+ puts JSON.generate(quotes)
63
+ ```
64
+
65
+ But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
66
+
67
+ ```ruby
68
+ require 'wgit'
69
+
70
+ include Wgit::DSL
71
+
72
+ Wgit.logger.level = Logger::WARN
73
+
74
+ connection_string 'mongodb://user:password@localhost/crawler'
75
+
76
+ start 'http://quotes.toscrape.com/tag/humor/'
77
+ follow "//li[@class='next']/a/@href"
78
+
79
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
80
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
81
+
82
+ index_site
83
+ search 'prejudice'
84
+ ```
85
+
86
+ The `search` call (on the last line) will return and output the results:
87
+
88
+ ```text
89
+ Quotes to Scrape
90
+ “I am free of all prejudice. I hate everyone equally. ”
91
+ http://quotes.toscrape.com/tag/humor/page/2/
92
+ ```
93
+
94
+ Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
95
+
96
+ ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
97
+
98
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
99
+
100
+ ```ruby
101
+ require 'wgit'
102
+ require 'json'
103
+
104
+ crawler = Wgit::Crawler.new
105
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
106
+ quotes = []
107
+
108
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
109
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
110
+
111
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
112
+ doc.quotes.zip(doc.authors).each do |arr|
113
+ quotes << {
114
+ quote: arr.first,
115
+ author: arr.last
116
+ }
117
+ end
118
+ end
119
+
120
+ puts JSON.generate(quotes)
121
+ ```
122
+
123
+ ## Why Wgit?
124
+
125
+ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
126
+
127
+ - Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
128
+ - Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
129
+ - Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
130
+ - Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
131
+
132
+ ```ruby
133
+ require 'wgit'
134
+
135
+ ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
136
+
137
+ wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
138
+
139
+ # Only index the most recent of each wiki article, ignoring the rest of Github.
140
+ opts = {
141
+ allow_paths: 'michaeltelford/wgit/wiki/*',
142
+ disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
143
+ }
144
+
145
+ indexer = Wgit::Indexer.new
146
+ indexer.index_site(wiki, **opts)
147
+ ```
148
+
149
+ ## Why Not Wgit?
150
+
151
+ So why might you not use Wgit, I hear you ask?
152
+
153
+ - Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
154
+ - Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
155
+ - Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
156
+
157
+ ## Installation
158
+
159
+ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
160
+
161
+ Currently, the required MRI Ruby version is:
162
+
163
+ `~> 2.5` (a.k.a.) `>= 2.5 && < 3`
164
+
165
+ ### Using Bundler
166
+
167
+ $ bundle add wgit
168
+
169
+ ### Using RubyGems
170
+
171
+ $ gem install wgit
172
+
173
+ ### Verify
174
+
175
+ $ wgit
176
+
177
+ Calling the installed executable will start an REPL session.
178
+
179
+ ## Documentation
180
+
181
+ - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
182
+ - [Wiki](https://github.com/michaeltelford/wgit/wiki)
183
+ - [API Yardocs](https://www.rubydoc.info/gems/wgit)
184
+ - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
185
+
186
+ ## Executable
187
+
188
+ Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
189
+
190
+ The `wgit` executable does the following things (in order):
191
+
192
+ 1. `require wgit`
193
+ 2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
194
+ 3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
195
+
196
+ The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
197
+
198
+ ## License
199
+
200
+ The gem is available as open source under the terms of the MIT License. See [LICENSE.txt](https://github.com/michaeltelford/wgit/blob/master/LICENSE.txt) for more details.
201
+
202
+ ## Contributing
203
+
204
+ Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
205
+
206
+ The current road map is rudimentally listed in the [Road Map](https://github.com/michaeltelford/wgit/wiki/Road-Map) wiki page. Maybe your feature request is already there?
207
+
208
+ Before you consider making a contribution, check out [CONTRIBUTING.md](https://github.com/michaeltelford/wgit/blob/master/CONTRIBUTING.md).
209
+
210
+ ## Development
211
+
212
+ After checking out the repo, run the following commands:
213
+
214
+ 1. `gem install bundler toys`
215
+ 2. `bundle install --jobs=3`
216
+ 3. `toys setup`
217
+
218
+ And you're good to go!
219
+
220
+ ### Tooling
221
+
222
+ Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
223
+
224
+ Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
225
+
226
+ To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
227
+
228
+ To install this gem onto your local machine, run `toys install` and follow the prompt.
229
+
230
+ ### Console
231
+
232
+ You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.