wgit 0.5.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3e5c6b85b0ac78d234674d6003f8624b266c09668b4cfd78945106a917f78078
4
- data.tar.gz: 3fc90cf5c132804f12e54f2b5f446143591923fff0677accc2ab907295ba34c4
3
+ metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
4
+ data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
5
5
  SHA512:
6
- metadata.gz: f39df81391a07b344678a2b8d443b945391728d215e142ed73a55ef80cfc9c9a8407db9e4faa60c3e43e5b8e65bf8e84c3a343ff962b3c0276eed920639f3870
7
- data.tar.gz: 1690895b56def00cbed58e485b23f5158ada0adb89f1c0e87bff3c638332648761dbac81b8f08e6c9c6ee911f4cbf9df72f3bfbce5d8abc2207d434edfde61ee
6
+ metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
7
+ data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
@@ -0,0 +1,7 @@
1
+ --readme README.md
2
+ --title 'Wgit Gem Documentation'
3
+ --charset utf-8
4
+ --markup markdown
5
+ --output .doc
6
+ --protected
7
+ - *.md LICENSE.txt
@@ -0,0 +1,240 @@
1
+ # Wgit Change Log
2
+
3
+ ## v0.0.0 (TEMPLATE - DO NOT EDIT)
4
+ ### Added
5
+ - ...
6
+ ### Changed/Removed
7
+ - ...
8
+ ### Fixed
9
+ - ...
10
+ ---
11
+
12
+ ## v0.9.0
13
+ This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
14
+ ### Added
15
+ - `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
16
+ - `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
17
+ - `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
18
+ - `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
19
+ - `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
20
+ - `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
21
+ - `Wgit::Database#clear_db!` alias.
22
+ - `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
23
+ - `Wgit::Document#extract` method to perform one off content extractions.
24
+ - `Wgit::Indexer#index_urls` method which can index several urls in one call.
25
+ - `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
26
+ ### Changed/Removed
27
+ - Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
28
+ - Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
29
+ - Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
30
+ - Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
31
+ - Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
32
+ - Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
33
+ - Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
34
+ - Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
35
+ - Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
36
+ - Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
37
+ - Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
38
+ - Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
39
+ - Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
40
+ - Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
41
+ - Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
42
+ - Updated `Utils.printf_search_results` to return the number of results.
43
+ - Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
44
+ - Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
45
+ - Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
46
+ ### Fixed
47
+ - [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
48
+ - `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
49
+ ---
50
+
51
+ ## v0.8.0
52
+ ### Added
53
+ - To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
54
+ - `Wgit::Document#description` default extension.
55
+ - `Wgit::Url.parse_or_nil` method.
56
+ ### Changed/Removed
57
+ - Breaking change: Renamed `Document#stats[:text_snippets]` to be `:text`.
58
+ - Breaking change: `Wgit::Document.define_extension`'s block return value now becomes the `var` value, even when `nil` is returned. This allows `var` to be set to `nil`.
59
+ - Potential breaking change: Renamed `Wgit::Response#crawl_time` (alias) to be `#crawl_duration`.
60
+ - Updated `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS` to be `Wgit::Crawler.supported_file_extensions`, making it configurable. Now you can add your own URL extensions if needed.
61
+ - Updated the Wgit core extension `String#to_url` to use `Wgit::Url.parse` allowing instances of `Wgit::Url` to returned as is. This also affects `Enumerable#to_urls` in the same way.
62
+ ### Fixed
63
+ - An issue where too much `Wgit::Document#text` was being extracted from the HTML. This was fixed by reverting the recent commit: "Document.text_elements_xpath is now `//*/text()`".
64
+ ---
65
+
66
+ ## v0.7.0
67
+ ### Added
68
+ - `Wgit::Indexer.new` optional `crawler:` named param.
69
+ - `bin/wgit` executable; available after `gem install wgit`. Just type `wgit` at the command line for an interactive shell session with the Wgit gem already loaded.
70
+ - `Document.extensions` returning a Set of all defined extensions.
71
+ ### Changed/Removed
72
+ - Potential breaking changes: Updated the default search param from `whole_sentence: false` to `true` across all search methods e.g. `Wgit::Database#search`, `Wgit::Document#search` `Wgit.indexed_search` etc. This brings back more relevant search results by default.
73
+ - Updated the Docker image to now include index names; making it easier to identify them.
74
+ ### Fixed
75
+ - ...
76
+ ---
77
+
78
+ ## v0.6.0
79
+ ### Added
80
+ - Added `Wgit::Utils.proces_arr encode:` param.
81
+ ### Changed/Removed
82
+ - Breaking changes: Updated `Wgit::Response#success?` and `#failure?` logic.
83
+ - Breaking changes: Updated `Wgit::Crawler` redirect logic. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_url-instance_method) for more info.
84
+ - Breaking changes: Updated `Wgit::Crawler#crawl_site` path params logic to support globs e.g. `allow_paths: 'wiki/*'`. See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/Wgit/Crawler#crawl_site-instance_method) for more info.
85
+ - Breaking changes: Refactored references of `encode_html:` to `encode:` in the `Wgit::Document` and `Wgit::Crawler` classes.
86
+ - Breaking changes: `Wgit::Document.text_elements_xpath` is now `//*/text()`. This means that more text is extracted from each page and you can no longer be selective of the text elements on a page.
87
+ - Improved `Wgit::Url#valid?` and `#relative?`.
88
+ ### Fixed
89
+ - Bug fix in `Wgit::Crawler#crawl_site` where `*.php` URLs weren't being crawled. The fix was to implement `Wgit::Crawler::SUPPORTED_FILE_EXTENSIONS`.
90
+ - Bug fix in `Wgit::Document#search`.
91
+ ---
92
+
93
+ ## v0.5.1
94
+ ### Added
95
+ - `Wgit.version_str` method.
96
+ ### Changed/Removed
97
+ - Switched to optimistic dependency versioning.
98
+ ### Fixed
99
+ - Bug in `Wgit::Url#concat`.
100
+ ---
101
+
102
+ ## v0.5.0
103
+ ### Added
104
+ - A Wgit Wiki! [https://github.com/michaeltelford/wgit/wiki](https://github.com/michaeltelford/wgit/wiki)
105
+ - `Wgit::Document#content` alias for `#html`.
106
+ - `Wgit::Url#prefix_base` method.
107
+ - `Wgit::Url#to_addressable_uri` method.
108
+ - Support for partially crawling a site using `Wgit::Crawler#crawl_site(allow_paths: [])` or `disallow_paths:`.
109
+ - `Wgit::Url#+` as alias for `#concat`.
110
+ - `Wgit::Url#invalid?` method.
111
+ - `Wgit.version` method.
112
+ - `Wgit::Response` class containing adapter agnostic HTTP response logic.
113
+ ### Changed/Removed
114
+ - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
115
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
116
+ - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
117
+ - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
118
+ - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
119
+ - Breaking changes: Renamed `Wgit::Url#prefix_protocol` to `#prefix_scheme`. The `protocol:` param name remains unchanged.
120
+ - Breaking changes: Renamed all `Wgit::Url` methods starting with `without_*` to `omit_*`.
121
+ - Breaking changes: `Wgit::Indexer` no longer inserts invalid external URL's (to be crawled at a later date).
122
+ - Breaking changes: `Wgit::Crawler#last_response` is now of type `Wgit::Response`. You can access the underlying `Typhoeus::Response` object with `crawler.last_response.adapter_response`.
123
+ ### Fixed
124
+ - Bug in `Wgit::Document#base_url` around the handling of invalid base URL scenarios.
125
+ - Several bugs in `Wgit::Database` class caused by the recent changes to the data model (in version 0.3.0).
126
+ ---
127
+
128
+ ## v0.4.1
129
+ ### Added
130
+ - ...
131
+ ### Changed/Removed
132
+ - ...
133
+ ### Fixed
134
+ - A crawl bug that resulted in some servers dropping requests due to the use of Typhoeus's default `User-Agent` header. This has now been changed.
135
+ ---
136
+
137
+ ## v0.4.0
138
+ ### Added
139
+ - `Wgit::Document#stats` alias `#statistics`.
140
+ - `Wgit::Crawler#time_out` logic for long crawls. Can also be set via `initialize`.
141
+ - `Wgit::Crawler#last_response#redirect_count` method logic.
142
+ - `Wgit::Crawler#last_response#total_time` method logic.
143
+ - `Wgit::Utils.fetch(hash, key, default = nil)` method which tries multiple key formats before giving up e.g. `:foo, 'foo', 'FOO'` etc.
144
+ ### Changed/Removed
145
+ - Breaking changes: Updated `Wgit::Crawler` crawl logic to use `typhoeus` instead of `Net:HTTP`. Users should see a significant improvement in crawl speed as a result. This means that `Wgit::Crawler#last_response` is now of type `Typhoeus::Response`. See https://rubydoc.info/gems/typhoeus/Typhoeus/Response for more info.
146
+ ### Fixed
147
+ - ...
148
+ ---
149
+
150
+ ## v0.3.0
151
+ ### Added
152
+ - `Url#crawl_duration` method.
153
+ - `Document#crawl_duration` method.
154
+ - `Benchmark.measure` to Crawler logic to set `Url#crawl_duration`.
155
+ ### Changed/Removed
156
+ - Breaking changes: Updated data model to embed the full `url` object inside the documents object.
157
+ - Breaking changes: Updated data model by removing documents `score` attribute.
158
+ ### Fixed
159
+ - ...
160
+ ---
161
+
162
+ ## v0.2.0
163
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
164
+ ### Added
165
+ - `Wgit::Url#absolute?` method.
166
+ - `Wgit::Url#relative? base: url` support.
167
+ - `Wgit::Database.connect` method (alias for `Wgit::Database.new`).
168
+ - `Wgit::Database#search` and `Wgit::Document#search` methods now support `case_sensitive:` and `whole_sentence:` named parameters.
169
+ ### Changed/Removed
170
+ - Breaking changes: Renamed the following `Wgit` and `Wgit::Indexer` methods: `Wgit.index_the_web` to `Wgit.index_www`, `Wgit::Indexer.index_the_web` to `Wgit::Indexer.index_www`, `Wgit.index_this_site` to `Wgit.index_site`, `Wgit::Indexer.index_this_site` to `Wgit::Indexer.index_site`, `Wgit.index_this_page` to `Wgit.index_page`, `Wgit::Indexer.index_this_page` to `Wgit::Indexer.index_page`.
171
+ - Breaking changes: All `Wgit::Indexer` methods now take named parameters.
172
+ - Breaking changes: The following `Wgit::Url` method signatures have changed: `initialize` aka `new`,
173
+ - Breaking changes: The following `Wgit::Url` class methods have been removed: `.validate`, `.valid?`, `.prefix_protocol`, `.concat` in favour of instance methods by the same names.
174
+ - Breaking changes: The following `Wgit::Url` instance methods/aliases have been changed/removed: `#to_protocol` (now `#to_scheme`), `#to_query_string` and `#query_string` (now `#to_query`), `#relative_link?` (now `#relative?`), `#without_query_string` (now `#without_query`), `#is_query_string?` (now `#query?`).
175
+ - Breaking changes: The database connection string is now passed directly to `Wgit::Database.new`; or in its absence, obtained from `ENV['WGIT_CONNECTION_STRING']`. See the `README.md` section entitled: `Practical Database Example` for an example.
176
+ - Breaking changes: The following `Wgit::Database` instance methods now take named parameters: `#urls`, `#crawled_urls`, `#uncrawled_urls`, `#search`.
177
+ - Breaking changes: The following `Wgit::Document` instance methods now take named parameters: `#to_h`, `#to_json`, `#search`, `#search!`.
178
+ - Breaking changes: The following `Wgit::Document` instance methods/aliases have been changed/removed: `#internal_full_links` (now `#internal_absolute_links`).
179
+ - Breaking changes: Any `Wgit::Document` method alias for returning links containing the word `relative` has been removed for clarity. Use `#internal_links`, `#internal_absolute_links` or `#external_links` instead.
180
+ - Breaking changes: `Wgit::Crawler` instance vars `@docs` and `@urls` have been removed causing the following instance methods to also be removed: `#urls=`, `#[]`, `#<<`. Also, `.new` aka `#initialize` now requires no params.
181
+ - Breaking changes: `Wgit::Crawler.new` now takes an optional `redirect_limit:` parameter. This is now the only way of customising the redirect crawl behavior. `Wgit::Crawler.redirect_limit` no longer exists.
182
+ - Breaking changes: The following `Wgit::Crawler` instance methods signatures have changed: `#crawl_site` and `#crawl_url` now require a `url` param (which no longer defaults), `#crawl_urls` now requires one or more `*urls` (which no longer defaults).
183
+ - Breaking changes: The following `Wgit::Assertable` method aliases have been removed: `.type`, `.types` (use `.assert_types` instead) and `.arr_type`, `.arr_types` (use `.assert_arr_types` instead).
184
+ - Breaking changes: The following `Wgit::Utils` methods now take named parameters: `.to_h` and `.printf_search_results`.
185
+ - Breaking changes: `Wgit::Utils.printf_search_results`'s method signature has changed; the search parameters have been removed. Before calling this method you must call `doc.search!` on each of the `results`. See the docs for the full details.
186
+ - `Wgit::Document` instances can now be instantiated with `String` Url's (previously only `Wgit::Url`'s).
187
+ ### Fixed
188
+ - ...
189
+ ---
190
+
191
+ ## v0.0.18
192
+ ### Added
193
+ - `Wgit::Url#to_brand` method and updated `Wgit::Url#is_relative?` to support it.
194
+ ### Changed/Removed
195
+ - Updated certain classes by changing some `private` methods to `protected`.
196
+ ### Fixed
197
+ - ...
198
+ ---
199
+
200
+ ## v0.0.17
201
+ ### Added
202
+ - Support for `<base>` element in `Wgit::Document`'s.
203
+ - New `Wgit::Url` methods: `without_query_string`, `is_query_string?`, `is_anchor?`, `replace` (override of `String#replace`).
204
+ ### Changed/Removed
205
+ - Breaking changes: Removed `Wgit::Document#internal_links_without_anchors` method.
206
+ - Breaking changes (potentially): `Wgit::Url`'s are now replaced with the redirected to Url during a crawl.
207
+ - Updated `Wgit::Document#base_url` to support an optional `link:` named parameter.
208
+ - Updated `Wgit::Crawler#crawl_site` to allow the initial url to redirect to another host.
209
+ - Updated `Wgit::Url#is_relative?` to support an optional `domain:` named parameter.
210
+ ### Fixed
211
+ - Bug in `Wgit::Document#internal_full_links` affecting anchor and query string links including those used during `Wgit::Crawler#crawl_site`.
212
+ - Bug causing an 'Invalid URL' error for `Wgit::Crawler#crawl_site`.
213
+ ---
214
+
215
+ ## v0.0.16
216
+ ### Added
217
+ - Added `Wgit::Url.parse` class method as alias for `Wgit::Url.new`.
218
+ ### Changed/Removed
219
+ - Breaking changes: Removed `Wgit::Url.relative_link?` (class method). Use `Wgit::Url#is_relative?` (instance method) instead e.g. `Wgit::Url.new('/blah').is_relative?`.
220
+ ### Fixed
221
+ - Several URI related bugs in `Wgit::Url` affecting crawls.
222
+ ---
223
+
224
+ ## v0.0.15
225
+ ### Added
226
+ - Support for IRI's (non ASCII based URL's).
227
+ ### Changed/Removed
228
+ - Breaking changes: Removed `Document` and `Url#to_hash` aliases. Call `to_h` instead.
229
+ ### Fixed
230
+ - Bug in `Crawler#crawl_site` where an internal redirect to an external site's page was being followed.
231
+ ---
232
+
233
+ ## v0.0.14
234
+ ### Added
235
+ - `Indexer#index_this_page` method.
236
+ ### Changed/Removed
237
+ - Breaking Changes: `Wgit::CONNECTION_DETAILS` now only requires `DB_CONNECTION_STRING`.
238
+ ### Fixed
239
+ - Found and fixed a bug in `Document#new`.
240
+ ---
@@ -0,0 +1,76 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, sex characteristics, gender identity and expression,
9
+ level of experience, education, socio-economic status, nationality, personal
10
+ appearance, race, religion, or sexual identity and orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at michael.telford@live.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72
+
73
+ [homepage]: https://www.contributor-covenant.org
74
+
75
+ For answers to common questions about this code of conduct, see
76
+ https://www.contributor-covenant.org/faq
@@ -0,0 +1,21 @@
1
+ # Contributing
2
+
3
+ ## Consult
4
+
5
+ Before you make a contribution, reach out to michael.telford@live.com about what changes need made. Otherwise, your time spent might be wasted. Once you're clear on what needs done follow the technical steps below.
6
+
7
+ ## Technical Steps
8
+
9
+ - Fork the repository
10
+ - Create a branch
11
+ - Write some tests (which fail)
12
+ - Write some code
13
+ - Re-run the tests (which now hopefully pass)
14
+ - Push your branch to your `origin` remote
15
+ - Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
16
+ - Apply any requested changes
17
+ - Wait for your PR to be merged
18
+
19
+ ## Thanks
20
+
21
+ Thanks in advance for your contribution.
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2016 - 2020 Michael Telford
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,239 @@
1
+ # Wgit
2
+
3
+ [![Inline gem version](https://badge.fury.io/rb/wgit.svg)](https://rubygems.org/gems/wgit)
4
+ [![Inline downloads](https://img.shields.io/gem/dt/wgit)](https://rubygems.org/gems/wgit)
5
+ [![Inline build](https://travis-ci.org/michaeltelford/wgit.svg?branch=master)](https://travis-ci.org/michaeltelford/wgit)
6
+ [![Inline docs](http://inch-ci.org/github/michaeltelford/wgit.svg?branch=master)](http://inch-ci.org/github/michaeltelford/wgit)
7
+ [![Inline code quality](https://api.codacy.com/project/badge/Grade/d5a0de62e78b460997cb8ce1127cea9e)](https://www.codacy.com/app/michaeltelford/wgit?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=michaeltelford/wgit&amp;utm_campaign=Badge_Grade)
8
+
9
+ ---
10
+
11
+ Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
+
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
+
15
+ - URL parsing
16
+ - Document content extraction (data mining)
17
+ - Crawling entire websites (statistical analysis)
18
+
19
+ Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
20
+
21
+ Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
22
+
23
+ ## Table Of Contents
24
+
25
+ 1. [Usage](#Usage)
26
+ 2. [Why Wgit?](#Why-Wgit)
27
+ 3. [Why Not Wgit?](#Why-Not-Wgit)
28
+ 4. [Installation](#Installation)
29
+ 5. [Documentation](#Documentation)
30
+ 6. [Executable](#Executable)
31
+ 7. [License](#License)
32
+ 8. [Contributing](#Contributing)
33
+ 9. [Development](#Development)
34
+
35
+ ## Usage
36
+
37
+ Let's crawl a [quotes website](http://quotes.toscrape.com/) extracting its *quotes* and *authors* using the Wgit DSL:
38
+
39
+ ```ruby
40
+ require 'wgit'
41
+ require 'json'
42
+
43
+ include Wgit::DSL
44
+
45
+ start 'http://quotes.toscrape.com/tag/humor/'
46
+ follow "//li[@class='next']/a/@href"
47
+
48
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
49
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
50
+
51
+ quotes = []
52
+
53
+ crawl_site do |doc|
54
+ doc.quotes.zip(doc.authors).each do |arr|
55
+ quotes << {
56
+ quote: arr.first,
57
+ author: arr.last
58
+ }
59
+ end
60
+ end
61
+
62
+ puts JSON.generate(quotes)
63
+ ```
64
+
65
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
66
+
67
+ ```ruby
68
+ require 'wgit'
69
+ require 'json'
70
+
71
+ crawler = Wgit::Crawler.new
72
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
73
+ quotes = []
74
+
75
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
76
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
77
+
78
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
79
+ doc.quotes.zip(doc.authors).each do |arr|
80
+ quotes << {
81
+ quote: arr.first,
82
+ author: arr.last
83
+ }
84
+ end
85
+ end
86
+
87
+ puts JSON.generate(quotes)
88
+ ```
89
+
90
+ But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
91
+
92
+ ```ruby
93
+ require 'wgit'
94
+
95
+ include Wgit::DSL
96
+
97
+ Wgit.logger.level = Logger::WARN
98
+
99
+ connection_string 'mongodb://user:password@localhost/crawler'
100
+ clear_db!
101
+
102
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
103
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
104
+
105
+ start 'http://quotes.toscrape.com/tag/humor/'
106
+ follow "//li[@class='next']/a/@href"
107
+
108
+ index_site
109
+ search 'prejudice'
110
+ ```
111
+
112
+ The `search` call (on the last line) will return and output the results:
113
+
114
+ ```text
115
+ Quotes to Scrape
116
+ “I am free of all prejudice. I hate everyone equally. ”
117
+ http://quotes.toscrape.com/tag/humor/page/2/
118
+ ```
119
+
120
+ Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
121
+
122
+ ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
123
+
124
+ ## Why Wgit?
125
+
126
+ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
127
+
128
+ - Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
129
+ - Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host.
130
+ - Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
131
+ - Wgit can index (crawl and store) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
132
+
133
+ ```ruby
134
+ require 'wgit'
135
+
136
+ ENV['WGIT_CONNECTION_STRING'] = 'mongodb://user:password@localhost/crawler'
137
+
138
+ wiki = Wgit::Url.new('https://github.com/michaeltelford/wgit/wiki')
139
+
140
+ # Only index the most recent of each wiki article, ignoring the rest of Github.
141
+ opts = {
142
+ allow_paths: 'michaeltelford/wgit/wiki/*',
143
+ disallow_paths: 'michaeltelford/wgit/wiki/*/_history'
144
+ }
145
+
146
+ indexer = Wgit::Indexer.new
147
+ indexer.index_site(wiki, **opts)
148
+ ```
149
+
150
+ ## Why Not Wgit?
151
+
152
+ So why might you not use Wgit, I hear you ask?
153
+
154
+ - Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
155
+ - Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
156
+ - Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
157
+
158
+ ## Installation
159
+
160
+ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby implementations.
161
+
162
+ Currently, the required MRI Ruby version is:
163
+
164
+ `~> 2.5` a.k.a. `>= 2.5 && < 3`
165
+
166
+ ### Using Bundler
167
+
168
+ Add this line to your application's `Gemfile`:
169
+
170
+ ```ruby
171
+ gem 'wgit'
172
+ ```
173
+
174
+ And then execute:
175
+
176
+ $ bundle
177
+
178
+ ### Using RubyGems
179
+
180
+ $ gem install wgit
181
+
182
+ Verify the install by using the executable (to start an REPL session):
183
+
184
+ $ wgit
185
+
186
+ ## Documentation
187
+
188
+ - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
189
+ - [Wiki](https://github.com/michaeltelford/wgit/wiki)
190
+ - [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
191
+ - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
192
+
193
+ ## Executable
194
+
195
+ Installing the Wgit gem adds a `wgit` executable to your `$PATH`. The executable launches an interactive REPL session with the Wgit gem already loaded; making it super easy to index and search from the command line without the need for scripts.
196
+
197
+ The `wgit` executable does the following things (in order):
198
+
199
+ 1. `require wgit`
200
+ 2. `eval`'s a `.wgit.rb` file (if one exists in either the local or home directory, which ever is found first)
201
+ 3. Starts an interactive shell (using `pry` if it's installed, or `irb` if not)
202
+
203
+ The `.wgit.rb` file can be used to seed fixture data or define helper functions for the session. For example, you could define a function which indexes your website for quick and easy searching everytime you start a new session.
204
+
205
+ ## License
206
+
207
+ The gem is available as open source under the terms of the MIT License. See [LICENSE.txt](https://github.com/michaeltelford/wgit/blob/master/LICENSE.txt) for more details.
208
+
209
+ ## Contributing
210
+
211
+ Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
212
+
213
+ The current road map is rudimentally listed in the [Road Map](https://github.com/michaeltelford/wgit/wiki/Road-Map) wiki page. Maybe your feature request is already there?
214
+
215
+ Before you consider making a contribution, check out [CONTRIBUTING.md](https://github.com/michaeltelford/wgit/blob/master/CONTRIBUTING.md).
216
+
217
+ ## Development
218
+
219
+ After checking out the repo, run the following commands:
220
+
221
+ 1. `gem install bundler toys`
222
+ 2. `bundle install --jobs=3`
223
+ 3. `toys setup`
224
+
225
+ And you're good to go!
226
+
227
+ ### Tooling
228
+
229
+ Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
230
+
231
+ Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
232
+
233
+ To generate code documentation locally, run `toys yardoc`. To browse the docs in a browser run `toys yardoc --serve`. You can also use the `yri` command line tool e.g. `yri Wgit::Crawler#crawl_site` etc.
234
+
235
+ To install this gem onto your local machine, run `toys install` and follow the prompt.
236
+
237
+ ### Console
238
+
239
+ You can run `toys console` for an interactive shell using the `./bin/wgit` executable. The `toys setup` task will have created an `.env` and `.wgit.rb` file which get loaded by the executable. You can use the contents of this [gist](https://gist.github.com/michaeltelford/b90d5e062da383be503ca2c3a16e9164) to turn the executable into a development console. It defines some useful functions, fixtures and connects to the database etc. Don't forget to set the `WGIT_CONNECTION_STRING` in the `.env` file.