wgit 0.8.0 → 0.10.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: dc2ea1de7219c66eb70ed7b4e97bb8da7c169eb68257df7d7d1bfdaf6a5ed4d6
4
- data.tar.gz: f88afdd7477812c3b9fdbde8e1125b950aec3fe3fabef5a20e0c16a9e26a767b
3
+ metadata.gz: c712169a7a2cf41bebb38b2c798aa883c9f685e4d3671929ab7a3ead55da0134
4
+ data.tar.gz: 0fb789510761c01f3d0459415b653f4eb896d194e31c610300ef94141dfab63a
5
5
  SHA512:
6
- metadata.gz: e5fe86fee44e4c494d936d747719d856c240a0129db0692afa44ad9855340929a8b90cc177d92d775bc7cc15af99093078cf6a8fc8b9bbd9bc8965866d343914
7
- data.tar.gz: c87efb4c5dcfb8795d62ab41f7e8d2bc206e7f8407707c9269c0fc86fbdcd14b7269d04083ec756d3b77a99300639469e88c639ee125ddee0984c3957e7cfc7b
6
+ metadata.gz: e175d2d23877fbb5aefc89aca1f98c31d9fee0ba78ec810bb3cc2260fdd97208c35ba9840b8cdc92ea49bdd8f6fd198910443fd61efdd563767048206f5a9788
7
+ data.tar.gz: 7dad82e66a10228ba0bce8e7d4cf8d17bdab734f90315813854c0c7e17807fa4c7f500711290a3e5eb28c431bab1c22ba4d25d7b585c1516c8f5470636bf37b4
data/.yardopts CHANGED
@@ -1,5 +1,5 @@
1
1
  --readme README.md
2
- --title 'Wgit API Documentation'
2
+ --title 'Wgit Gem Documentation'
3
3
  --charset utf-8
4
4
  --markup markdown
5
5
  --output .doc
data/CHANGELOG.md CHANGED
@@ -9,6 +9,72 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.2
13
+ ### Added
14
+ - `Wgit::Base#setup` and `#teardown` methods (lifecycle hooks) that can be overridden by subclasses.
15
+ ### Changed/Removed
16
+ - ...
17
+ ### Fixed
18
+ - ...
19
+ ---
20
+
21
+ ## v0.10.1
22
+ ### Added
23
+ - Support for Ruby 3.
24
+ ### Changed/Removed
25
+ - Removed support for Ruby 2.5 (as it's too old).
26
+ ### Fixed
27
+ - ...
28
+ ---
29
+
30
+ ## v0.10.0
31
+ ### Added
32
+ - `Wgit::Url#scheme_relative?` method.
33
+ ### Changed/Removed
34
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
35
+ ### Fixed
36
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
37
+ ---
38
+
39
+ ## v0.9.0
40
+ This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
41
+ ### Added
42
+ - `Wgit::DSL` module providing a wrapper around the underlying classes and methods. Check out the `README` for example usage.
43
+ - `Wgit::Crawler#parse_javascript` which when set to `true` uses Chrome to parse a page's Javascript before returning the fully rendered HTML. This feature is disabled by default.
44
+ - `Wgit::Base` class to inherit from, acting as an alternative form of using the DSL.
45
+ - `Wgit::Utils.sanitize` which calls `.sanitize_*` underneath.
46
+ - `Wgit::Crawler#crawl_site` now has a `follow:` named param - if set, it's xpath value is used to retrieve the next urls to crawl. Otherwise the `:default` is used (as it was before). Use this to override how the site is crawled.
47
+ - `Wgit::Database` methods: `#clear_urls`, `#clear_docs`, `#clear_db`, `#text_index`, `#text_index=`, `#create_collections`, `#create_unique_indexes`, `#docs`, `#get`, `#exists?`, `#delete`, `#upsert`.
48
+ - `Wgit::Database#clear_db!` alias.
49
+ - `Wgit::Document` methods: `#at_xpath`, `#at_css` - which call nokogiri underneath.
50
+ - `Wgit::Document#extract` method to perform one off content extractions.
51
+ - `Wgit::Indexer#index_urls` method which can index several urls in one call.
52
+ - `Wgit::Url` methods: `#to_user`, `#to_password`, `#to_sub_domain`, `#to_port`, `#omit_origin`, `#index?`.
53
+ ### Changed/Removed
54
+ - Breaking change: Moved all `Wgit.index*` convienence methods into `Wgit::DSL`.
55
+ - Breaking change: Removed `Wgit::Url#normalise`, use `#normalize` instead.
56
+ - Breaking change: Removed `Wgit::Database#num_documents`, use `#num_docs` instead.
57
+ - Breaking change: Removed `Wgit::Database#length` and `#count`, use `#size` instead.
58
+ - Breaking change: Removed `Wgit::Database#document?`, use `#doc?` instead.
59
+ - Breaking change: Renamed `Wgit::Indexer#index_page` to `#index_url`.
60
+ - Breaking change: Renamed `Wgit::Url.parse_or_nil` to be `.parse?`.
61
+ - Breaking change: Renamed `Wgit::Utils.process_*` to be `.sanitize_*`.
62
+ - Breaking change: Renamed `Wgit::Utils.remove_non_bson_types` to be `Wgit::Model.select_bson_types`.
63
+ - Breaking change: Changed `Wgit::Indexer.index*` named param default from `insert_externals: true` to `false`. Explicitly set it to `true` for the old behaviour.
64
+ - Breaking change: Renamed `Wgit::Document.define_extension` to `define_extractor`. Same goes for `remove_extension -> remove_extractor` and `extensions -> extractors`. See the docs for more information.
65
+ - Breaking change: Renamed `Wgit::Document#doc` to `#parser`.
66
+ - Breaking change: Renamed `Wgit::Crawler#time_out` to `#timeout`. Same goes for the named param passed to `Wgit::Crawler.initialize`.
67
+ - Breaking change: Refactored `Wgit::Url#relative?` now takes `:origin` instead of `:base` which takes the port into account. This has a knock on effect for some other methods too - check the docs if you're getting parameter errors.
68
+ - Breaking change: Renamed `Wgit::Url#prefix_base` to `#make_absolute`.
69
+ - Updated `Utils.printf_search_results` to return the number of results.
70
+ - Updated `Wgit::Indexer.new` which can now be called without parameters - the first param (for a database) now defaults to `Wgit::Database.new` which works if `ENV['WGIT_CONNECTION_STRING']` is set.
71
+ - Updated `Wgit::Document.define_extractor` to define a setter method (as well as the usual getter method).
72
+ - Updated `Wgit::Document#search` to support a `Regexp` query (in addition to a String).
73
+ ### Fixed
74
+ - [Re-indexing bug](https://github.com/michaeltelford/wgit/issues/8) so that indexing content a 2nd time will update it in the database - before it simply disgarded the document.
75
+ - `Wgit::Crawler#crawl_site` params `allow/disallow_paths` values can now start with a `/`.
76
+ ---
77
+
12
78
  ## v0.8.0
13
79
  ### Added
14
80
  - To the range of `Wgit::Document.text_elements`. Now (only and) all visible page text should be extracted into `Wgit::Document#text` successfully.
@@ -73,7 +139,7 @@
73
139
  - `Wgit::Response` class containing adapter agnostic HTTP response logic.
74
140
  ### Changed/Removed
75
141
  - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
76
- - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
142
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
77
143
  - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
78
144
  - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
79
145
  - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -121,7 +187,7 @@
121
187
  ---
122
188
 
123
189
  ## v0.2.0
124
- This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
190
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
125
191
  ### Added
126
192
  - `Wgit::Url#absolute?` method.
127
193
  - `Wgit::Url#relative? base: url` support.
data/LICENSE.txt CHANGED
@@ -1,6 +1,6 @@
1
1
  The MIT License (MIT)
2
2
 
3
- Copyright (c) 2016 - 2019 Michael Telford
3
+ Copyright (c) 2016 - 2020 Michael Telford
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal