wgit 0.8.0 → 0.10.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.yardopts +1 -1
- data/CHANGELOG.md +68 -2
- data/LICENSE.txt +1 -1
- data/README.md +114 -326
- data/bin/wgit +9 -5
- data/lib/wgit/assertable.rb +3 -3
- data/lib/wgit/base.rb +39 -0
- data/lib/wgit/crawler.rb +206 -76
- data/lib/wgit/database/database.rb +309 -134
- data/lib/wgit/database/model.rb +10 -3
- data/lib/wgit/document.rb +145 -95
- data/lib/wgit/{document_extensions.rb → document_extractors.rb} +11 -11
- data/lib/wgit/dsl.rb +324 -0
- data/lib/wgit/indexer.rb +66 -163
- data/lib/wgit/response.rb +5 -2
- data/lib/wgit/url.rb +177 -63
- data/lib/wgit/utils.rb +32 -20
- data/lib/wgit/version.rb +2 -1
- data/lib/wgit.rb +3 -1
- metadata +34 -19
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c712169a7a2cf41bebb38b2c798aa883c9f685e4d3671929ab7a3ead55da0134
|
4
|
+
data.tar.gz: 0fb789510761c01f3d0459415b653f4eb896d194e31c610300ef94141dfab63a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e175d2d23877fbb5aefc89aca1f98c31d9fee0ba78ec810bb3cc2260fdd97208c35ba9840b8cdc92ea49bdd8f6fd198910443fd61efdd563767048206f5a9788
|
7
|
+
data.tar.gz: 7dad82e66a10228ba0bce8e7d4cf8d17bdab734f90315813854c0c7e17807fa4c7f500711290a3e5eb28c431bab1c22ba4d25d7b585c1516c8f5470636bf37b4
|
data/.yardopts
CHANGED
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,72 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.2
|
13
|
+
### Added
|
14
|
+
- `Wgit::Base#setup` and `#teardown` methods (lifecycle hooks) that can be overridden by subclasses.
|
15
|
+
### Changed/Removed
|
16
|
+
- ...
|
17
|
+
### Fixed
|
18
|
+
- ...
|
19
|
+
---
|
20
|
+
|
21
|
+
## v0.10.1
|
22
|
+
### Added
|
23
|
+
- Support for Ruby 3.
|
24
|
+
### Changed/Removed
|
25
|
+
- Removed support for Ruby 2.5 (as it's too old).
|
26
|
+
### Fixed
|
27
|
+
- ...
|
28
|
+
---
|
29
|
+
|
30
|
+
## v0.10.0
|
31
|
+
### Added
|
32
|
+
- `Wgit::Url#scheme_relative?` method.
|
33
|
+
### Changed/Removed
|
34
|
+
- Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
|
35
|
+
### Fixed
|
36
|
+
- [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
|
37
|
+
---
|
38
|
+
|
39
|
+
## v0.9.0
|
40
|
+
This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
|
41
|
+
### Added
|
42
|
+
- `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
|
43
|
+
- `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
|
44
|
+
- `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
|
45
|
+
- `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
|
46
|
+
- `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
|
47
|
+
- `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
|
48
|
+
- `Wgit::Database#clear_db!` alias.
|
49
|
+
- `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
|
50
|
+
- `Wgit::Document#extract` method to perform one off content extractions.
|
51
|
+
- `Wgit::Indexer#index_urls` method which can index several urls in one call.
|
52
|
+
- `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
|
53
|
+
### Changed/Removed
|
54
|
+
- Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
|
55
|
+
- Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
|
56
|
+
- Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
|
57
|
+
- Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
|
58
|
+
- Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
|
59
|
+
- Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
|
60
|
+
- Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
|
61
|
+
- Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
|
62
|
+
- Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
|
63
|
+
- Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
|
64
|
+
- Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
|
65
|
+
- Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
|
66
|
+
- Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
|
67
|
+
- Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
|
68
|
+
- Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
|
69
|
+
- Updated `Utils.printf_search_results` to return the number of results.
|
70
|
+
- Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
|
71
|
+
- Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
|
72
|
+
- Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
|
73
|
+
### Fixed
|
74
|
+
- [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
|
75
|
+
- `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
|
76
|
+
---
|
77
|
+
|
12
78
|
## v0.8.0
|
13
79
|
### Added
|
14
80
|
- To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
|
@@ -73,7 +139,7 @@
|
|
73
139
|
- `Wgit::Response` class containing adapter agnostic HTTP response logic.
|
74
140
|
### Changed/Removed
|
75
141
|
- Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
|
76
|
-
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/
|
142
|
+
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
|
77
143
|
- Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
|
78
144
|
- Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
|
79
145
|
- Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
|
@@ -121,7 +187,7 @@
|
|
121
187
|
---
|
122
188
|
|
123
189
|
## v0.2.0
|
124
|
-
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/
|
190
|
+
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
|
125
191
|
### Added
|
126
192
|
- `Wgit::Url#absolute?` method.
|
127
193
|
- `Wgit::Url#relative? base: url` support.
|
data/LICENSE.txt
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
The MIT License (MIT)
|
2
2
|
|
3
|
-
Copyright (c) 2016 -
|
3
|
+
Copyright (c) 2016 - 2020 Michael Telford
|
4
4
|
|
5
5
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
6
|
of this software and associated documentation files (the "Software"), to deal
|