wgit 0.8.0 → 0.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dc2ea1de7219c66eb70ed7b4e97bb8da7c169eb68257df7d7d1bfdaf6a5ed4d6
4
- data.tar.gz: f88afdd7477812c3b9fdbde8e1125b950aec3fe3fabef5a20e0c16a9e26a767b
3
+ metadata.gz: c712169a7a2cf41bebb38b2c798aa883c9f685e4d3671929ab7a3ead55da0134
4
+ data.tar.gz: 0fb789510761c01f3d0459415b653f4eb896d194e31c610300ef94141dfab63a
5
5
  SHA512:
6
- metadata.gz: e5fe86fee44e4c494d936d747719d856c240a0129db0692afa44ad9855340929a8b90cc177d92d775bc7cc15af99093078cf6a8fc8b9bbd9bc8965866d343914
7
- data.tar.gz: c87efb4c5dcfb8795d62ab41f7e8d2bc206e7f8407707c9269c0fc86fbdcd14b7269d04083ec756d3b77a99300639469e88c639ee125ddee0984c3957e7cfc7b
6
+ metadata.gz: e175d2d23877fbb5aefc89aca1f98c31d9fee0ba78ec810bb3cc2260fdd97208c35ba9840b8cdc92ea49bdd8f6fd198910443fd61efdd563767048206f5a9788
7
+ data.tar.gz: 7dad82e66a10228ba0bce8e7d4cf8d17bdab734f90315813854c0c7e17807fa4c7f500711290a3e5eb28c431bab1c22ba4d25d7b585c1516c8f5470636bf37b4
data/.yardopts CHANGED
@@ -1,5 +1,5 @@
1
1
  --readme README.md
2
- --title 'Wgit API Documentation'
2
+ --title 'Wgit Gem Documentation'
3
3
  --charset utf-8
4
4
  --markup markdown
5
5
  --output .doc
data/CHANGELOG.md CHANGED
@@ -9,6 +9,72 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.2
13
+ ### Added
14
+ - `Wgit::Base#setup` and `#teardown` methods (lifecycle hooks) that can be overridden by subclasses.
15
+ ### Changed/Removed
16
+ - ...
17
+ ### Fixed
18
+ - ...
19
+ ---
20
+
21
+ ## v0.10.1
22
+ ### Added
23
+ - Support for Ruby 3.
24
+ ### Changed/Removed
25
+ - Removed support for Ruby 2.5 (as it's too old).
26
+ ### Fixed
27
+ - ...
28
+ ---
29
+
30
+ ## v0.10.0
31
+ ### Added
32
+ - `Wgit::Url#scheme_relative?` method.
33
+ ### Changed/Removed
34
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
35
+ ### Fixed
36
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
37
+ ---
38
+
39
+ ## v0.9.0
40
+ This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
41
+ ### Added
42
+ - `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
43
+ - `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
44
+ - `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
45
+ - `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
46
+ - `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
47
+ - `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
48
+ - `Wgit::Database#clear_db!` alias.
49
+ - `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
50
+ - `Wgit::Document#extract` method to perform one off content extractions.
51
+ - `Wgit::Indexer#index_urls` method which can index several urls in one call.
52
+ - `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
53
+ ### Changed/Removed
54
+ - Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
55
+ - Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
56
+ - Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
57
+ - Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
58
+ - Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
59
+ - Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
60
+ - Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
61
+ - Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
62
+ - Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
63
+ - Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
64
+ - Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
65
+ - Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
66
+ - Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
67
+ - Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
68
+ - Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
69
+ - Updated `Utils.printf_search_results` to return the number of results.
70
+ - Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
71
+ - Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
72
+ - Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
73
+ ### Fixed
74
+ - [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
75
+ - `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
76
+ ---
77
+
12
78
  ## v0.8.0
13
79
  ### Added
14
80
  - To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
@@ -73,7 +139,7 @@
73
139
  - `Wgit::Response` class containing adapter agnostic HTTP response logic.
74
140
  ### Changed/Removed
75
141
  - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
76
- - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
142
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
77
143
  - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
78
144
  - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
79
145
  - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -121,7 +187,7 @@
121
187
  ---
122
188
 
123
189
  ## v0.2.0
124
- This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
190
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
125
191
  ### Added
126
192
  - `Wgit::Url#absolute?` method.
127
193
  - `Wgit::Url#relative? base: url` support.
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2016 - 2019 Michael Telford
3
+ Copyright (c) 2016 - 2020 Michael Telford
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal