RubyGems - broken_link_finder - Versions diffs - 0.11.1 → 0.12.0 - Mend

broken_link_finder 0.11.1 → 0.12.0

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +10 -0
data/Gemfile.lock +20 -18
data/README.md +41 -10
data/broken_link_finder.gemspec +1 -1
data/exe/broken_link_finder +2 -0
data/lib/broken_link_finder.rb +2 -1
data/lib/broken_link_finder/version.rb +1 -1
data/lib/broken_link_finder/wgit_extensions.rb +2 -2
data/lib/broken_link_finder/xpath.rb +14 -0
metadata +7 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 42e88495f7e7742db433223408b4a380c1d48e98a5a43e6da5303d3e7b024454
-  data.tar.gz: eae7fc953f0d8aa1bb1f9d5b53183cd68a15a9f83ab341f51023744b2d148063
+  metadata.gz: 24ca9c7a6071b07f5ab3132c9c79c4628570c9c3e157b77a27a05cdc0578ac6e
+  data.tar.gz: 6668eb430c8296e1439f56c242e7e08a27733605d724ec1c5cfa638dcfaa8b52
 SHA512:
-  metadata.gz: 4496db994bfba83deeb14a1b870f43e2cfd2afa94f30b6596ee610f23103b55ae0d84a6443a3204b02ed8875c0daf0d8e9c565aaebd21173d5c4353509dac3c8
-  data.tar.gz: 2d70ee94d7128e6e212bc385e1045fd465c121f58b9a0d036d392ae1cbb5cd9ef5ea47e29eda85b6f17a0b0f5547902ca818967b3ffb4ad87c7d0b271da5323a
+  metadata.gz: 1d1cdc47ade4651b8bc2df01212364ba938ee73269bf53e7278519ecd374247291c932abfa73a031973403ed55d360bc9d14b5c60ba312aca4b32837b5064294
+  data.tar.gz: f56308da4b9d7a4a39afd43808f77d2b6f2fbbf00f17502d2d889de504bcc82ee1858fb673333b11693a65ad73a4f5fb65a97b15955443e8268b1e0ab08b4e51

data/CHANGELOG.md CHANGED Viewed

@@ -9,6 +9,16 @@
 - ...
 ---
+## v0.12.0
+### Added
+- `BrokenLinkFinder::link_xpath` and `link_xpath=` methods so you can customise how links are extracted from each crawled page using the API.
+- An `--xpath` (or just `-x`) command line flag so you can customise how links are extracted when using the command line.
+### Changed/Removed
+- Changed the default way in which links are extracted from a page. Previously any element with a `href` or `src` attribute was extracted and checked; now only those links inside the `<body>` are extracted and checked, ignoring the `<head>` section entirely. You can change this behaviour back with: `BrokenLinkFinder::link_xpath = '//*/@href | //*/@src'` before you perform a crawl. Alternatively, if using the command line, use the `--xpath //*/@href | //*/@src` option.
+### Fixed
+- [Scheme relative bug](https://github.com/michaeltelford/broken_link_finder/issues/16) by upgrading to `wgit v0.10.0`.
+---
 ## v0.11.1
 ### Added
 - ...

data/Gemfile.lock CHANGED Viewed

@@ -1,59 +1,61 @@
 PATH
   remote: .
   specs:
-    broken_link_finder (0.11.1)
+    broken_link_finder (0.12.0)
       thor (~> 0.20)
       thread (~> 0.2)
-      wgit (~> 0.9)
+      wgit (~> 0.10)
 GEM
   remote: https://rubygems.org/
   specs:
     addressable (2.7.0)
       public_suffix (>= 2.0.2, < 5.0)
-    bson (4.10.0)
+    bson (4.12.0)
     byebug (11.1.3)
     cliver (0.3.2)
     coderay (1.1.3)
-    concurrent-ruby (1.1.6)
-    crack (0.4.3)
-      safe_yaml (~> 1.0.0)
+    concurrent-ruby (1.1.8)
+    crack (0.4.5)
+      rexml
     ethon (0.12.0)
       ffi (>= 1.3.0)
-    ferrum (0.9)
+    ferrum (0.11)
       addressable (~> 2.5)
       cliver (~> 0.3)
       concurrent-ruby (~> 1.1)
       websocket-driver (>= 0.6, < 0.8)
-    ffi (1.13.1)
+    ffi (1.15.0)
     hashdiff (1.0.1)
     maxitest (3.6.0)
       minitest (>= 5.0.0, < 5.14.0)
     method_source (1.0.0)
-    mini_portile2 (2.4.0)
+    mini_portile2 (2.5.0)
     minitest (5.13.0)
-    mongo (2.13.0)
+    mongo (2.14.0)
       bson (>= 4.8.2, < 5.0.0)
-    nokogiri (1.10.10)
-      mini_portile2 (~> 2.4.0)
-    pry (0.13.1)
+    nokogiri (1.11.2)
+      mini_portile2 (~> 2.5.0)
+      racc (~> 1.4)
+    pry (0.14.0)
       coderay (~> 1.1)
       method_source (~> 1.0)
-    public_suffix (4.0.5)
-    rake (13.0.1)
-    safe_yaml (1.0.5)
+    public_suffix (4.0.6)
+    racc (1.5.2)
+    rake (13.0.3)
+    rexml (3.2.4)
     thor (0.20.3)
     thread (0.2.2)
     typhoeus (1.4.0)
       ethon (>= 0.9.0)
-    webmock (3.8.3)
+    webmock (3.12.2)
       addressable (>= 2.3.6)
       crack (>= 0.3.2)
       hashdiff (>= 0.4.0, < 2.0.0)
     websocket-driver (0.7.3)
       websocket-extensions (>= 0.1.0)
     websocket-extensions (0.1.5)
-    wgit (0.9.0)
+    wgit (0.10.0)
       addressable (~> 2.6)
       ferrum (~> 0.8)
       mongo (~> 2.9)

data/README.md CHANGED Viewed

@@ -8,7 +8,9 @@ Broken Link Finder is multi-threaded and uses `libcurl` under the hood, it's fas
 ## How It Works
-Any HTML page element with a `href` or `src` attribute is considered a link. For each link on a given page, any of the following conditions constitutes that the link is broken:
+Any HTML element within `<body>` with a `href` or `src` attribute is considered a link (this is [configurable](#Link-Extraction) however).
+For each link on a given page, any of the following conditions constitutes that the link is broken:
 - An empty HTML response body is returned.
 - A response status code of `404 Not Found` is returned.
@@ -29,27 +31,27 @@ With that said, the usual array of HTTP URL features are supported including anc
 ## Installation
-Add this line to your application's Gemfile:
+Only MRI Ruby is tested and supported, but `broken_link_finder` may work with other Ruby implementations.
-```ruby
-gem 'broken_link_finder'
-```
+Currently, the required MRI Ruby version is:
-And then execute:
+`~> 2.5` (a.k.a.) `>= 2.5 && < 3`
-    $ bundle
+### Using Bundler
-Or install it yourself as:
+    $ bundle add broken_link_finder
+### Using RubyGems
     $ gem install broken_link_finder
-Finally, verify the installation with:
+### Verify
     $ broken_link_finder version
 ## Usage
-You can check for broken links via the library or executable.
+You can check for broken links via the executable or library.
 ### Executable
@@ -118,6 +120,35 @@ ftp://server.com
 You can provide the `--html` flag if you'd prefer a HTML based report.
+## Link Extraction
+You can customise the XPath used to extract links from each crawled page. This can be done via the executable or library.
+### Executable
+Add the `--xpath` (or `-x`) flag to the crawl command e.g.
+    $ broken_link_finder crawl http://txti.es -x //img/@src
+### Library
+Set the desired XPath using the accessor methods provided:
+> main.rb
+```ruby
+require 'broken_link_finder'
+# Set your desired xpath before crawling...
+BrokenLinkFinder::link_xpath = '//img/@src'
+# Now crawl as normal and only your custom targeted links will be checked.
+BrokenLinkFinder.new.crawl_page 'http://txti.es'
+# Go back to using the default provided xpath as needed.
+BrokenLinkFinder::link_xpath = BrokenLinkFinder::DEFAULT_LINK_XPATH
+```
 ## Contributing
 Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.

data/broken_link_finder.gemspec CHANGED Viewed

@@ -49,5 +49,5 @@ Gem::Specification.new do |spec|
   spec.add_runtime_dependency 'thor', '~> 0.20'
   spec.add_runtime_dependency 'thread', '~> 0.2'
-  spec.add_runtime_dependency 'wgit', '~> 0.9'
+  spec.add_runtime_dependency 'wgit', '~> 0.10'
 end

data/exe/broken_link_finder CHANGED Viewed

@@ -9,6 +9,7 @@ class BrokenLinkFinderCLI < Thor
   desc 'crawl [URL]', 'Find broken links at the URL'
   option :recursive, type: :boolean, aliases: [:r], default: false, desc: 'Crawl the entire site.'
   option :threads, type: :numeric, aliases: [:t], default: BrokenLinkFinder::DEFAULT_MAX_THREADS, desc: 'Max number of threads to use when crawling recursively; 1 thread per web page.'
+  option :xpath, type: :string, aliases: [:x], default: BrokenLinkFinder::DEFAULT_LINK_XPATH
   option :html, type: :boolean, aliases: [:h], default: false, desc: 'Produce a HTML report (instead of text)'
   option :sort_by_link, type: :boolean, aliases: [:l], default: false, desc: 'Makes report more concise if there are more pages crawled than broken links found. Use with -r on medium/large sites.'
   option :verbose, type: :boolean, aliases: [:v], default: false, desc: 'Display all ignored links.'
@@ -22,6 +23,7 @@ class BrokenLinkFinderCLI < Thor
     broken_verbose  = !options[:concise]
     ignored_verbose = options[:verbose]
+    BrokenLinkFinder.link_xpath = options[:xpath]
     finder = BrokenLinkFinder::Finder.new(sort: sort_by, max_threads: max_threads)
     options[:recursive] ? finder.crawl_site(url) : finder.crawl_page(url)
     finder.report(

data/lib/broken_link_finder.rb CHANGED Viewed

@@ -5,8 +5,9 @@ require 'wgit/core_ext'
 require 'thread/pool'
 require 'set'
-require_relative './broken_link_finder/wgit_extensions'
 require_relative './broken_link_finder/version'
+require_relative './broken_link_finder/xpath'
+require_relative './broken_link_finder/wgit_extensions'
 require_relative './broken_link_finder/link_manager'
 require_relative './broken_link_finder/reporter/reporter'
 require_relative './broken_link_finder/reporter/text_reporter'

data/lib/broken_link_finder/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module BrokenLinkFinder
-  VERSION = '0.11.1'
+  VERSION = '0.12.0'
 end

data/lib/broken_link_finder/wgit_extensions.rb CHANGED Viewed

@@ -17,10 +17,10 @@ rescue StandardError
   nil
 end
-# We extract all the Document's links e.g. <a>, <img>, <script>, <link> etc.
+# Define a custom extractor for all page links we're interested in checking.
 Wgit::Document.define_extractor(
   :all_links,
-  '//*/@href | //*/@src', # Any element's href or src attribute URL.
+  lambda { BrokenLinkFinder::link_xpath },
   singleton: false,
   text_content_only: true
 ) do |links, doc|

data/lib/broken_link_finder/xpath.rb ADDED Viewed

@@ -0,0 +1,14 @@
+# frozen_string_literal: true
+module BrokenLinkFinder
+  # Extract all the Document's <body> links e.g. <a>, <img>, <script> etc.
+  DEFAULT_LINK_XPATH = '/html/body//*/@href | /html/body//*/@src'
+  @link_xpath = DEFAULT_LINK_XPATH
+  class << self
+    # The xpath used to extract links from a crawled page.
+    # Can be overridden as required.
+    attr_accessor :link_xpath
+  end
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: broken_link_finder
 version: !ruby/object:Gem::Version
-  version: 0.11.1
+  version: 0.12.0
 platform: ruby
 authors:
 - Michael Telford
-autorequire:
+autorequire:
 bindir: exe
 cert_chain: []
-date: 2020-07-31 00:00:00.000000000 Z
+date: 2021-04-20 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -128,14 +128,14 @@ dependencies:
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.9'
+        version: '0.10'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.9'
+        version: '0.10'
 description: Finds a website's broken links using the 'wgit' gem and reports back
   to you with a summary.
 email: michael.telford@live.com
@@ -165,6 +165,7 @@ files:
 - lib/broken_link_finder/reporter/text_reporter.rb
 - lib/broken_link_finder/version.rb
 - lib/broken_link_finder/wgit_extensions.rb
+- lib/broken_link_finder/xpath.rb
 - load.rb
 homepage: https://github.com/michaeltelford/broken-link-finder
 licenses:
@@ -191,7 +192,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubygems_version: 3.1.2
-signing_key:
+signing_key:
 specification_version: 4
 summary: Finds a website's broken links and reports back to you with a summary.
 test_files: []