RubyGems - metainspector - Versions diffs - 5.6.0 → 5.10.1 - Mend

metainspector 5.6.0 → 5.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

checksums.yaml +5 -5
data/.travis.yml +3 -4
data/CHANGELOG.md +45 -16
data/README.md +12 -3
data/lib/meta_inspector/document.rb +10 -3
data/lib/meta_inspector/errors.rb +2 -0
data/lib/meta_inspector/parser.rb +3 -2
data/lib/meta_inspector/parsers/head_links.rb +21 -8
data/lib/meta_inspector/parsers/images.rb +6 -4
data/lib/meta_inspector/parsers/links.rb +2 -1
data/lib/meta_inspector/parsers/texts.rb +28 -0
data/lib/meta_inspector/request.rb +1 -1
data/lib/meta_inspector/url.rb +7 -5
data/lib/meta_inspector/version.rb +1 -1
data/meta_inspector.gemspec +18 -18
data/spec/document_spec.rb +9 -2
data/spec/fixtures/feeds.response +23 -0
data/spec/fixtures/guardian.co.uk.response +1 -1
data/spec/fixtures/headings.response +23 -0
data/spec/fixtures/relative_links_with_empty_base.response +22 -0
data/spec/meta_inspector/head_links_spec.rb +4 -1
data/spec/meta_inspector/images_spec.rb +6 -0
data/spec/meta_inspector/links_spec.rb +35 -11
data/spec/meta_inspector/texts_spec.rb +42 -0
data/spec/spec_helper.rb +3 -2
metadata +46 -47
data/spec/fixtures/iteh.at.response +0 -971
data/spec/fixtures/tea-tron.com.response +0 -957

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 55d08720cc3d7c19b8bb04540300d7a5abbedeb7
-  data.tar.gz: 005f6f7bc26e9d1b98ae850b1a232a4c60720489
+SHA256:
+  metadata.gz: 97b670ec8a7026383d659037318206d8262e17bbc35b0ec51f34609b6a6ebc95
+  data.tar.gz: 326230360c0199174e39bf495e00de74057a208e155ca0f4ec584f9e728f59bd
 SHA512:
-  metadata.gz: e6175e3fa5a08dc5382ba0e49710548a5371ab1d6868001c8bbc6beb44c4f0aea419550adab8f1abcf35dce714ad3f69abcd9ea1bb66cf2019bf1a9a633c49ec
-  data.tar.gz: 3d9216f15d36a5bb33b5dbe65b72de3a76e02b040b787d245c25eb12ee04ee7c1f8ad685b93ab34db6a078f68c619579eb1f6ba92131872c225f30b07be0bfb3
+  metadata.gz: 1e89cc17ea97453f74935883267851f1a2f0e1a9255ea5a1a259a850950d9633227a41f4ed06af0b43a7e1f1a2331368b274c7a715ea6779e3ba60e616e11aa7
+  data.tar.gz: 656a52071ada09f4ac45703f1a688ec85f17f30b5e89f9f1965f81478c21ad3feb651df7773caf5a447c07bea6ee78cb84505f53f5ab5ac689a847ccbdb5ee15

data/.travis.yml CHANGED

@@ -1,6 +1,5 @@
 script: "bundle exec rspec -b"
 rvm:
-- 2.2.9
-- 2.3.6
-- 2.4.3
-- 2.5.0
+- 2.5.8
+- 2.6.6
+- 2.7.1

data/CHANGELOG.md CHANGED

@@ -1,48 +1,77 @@
 # MetaInpector Changelog
+## [Changes in 5.10](https://github.com/jaimeiniesta/metainspector/compare/v5.9.0...v5.10.0)
+* Upgrade to Faraday 1.0.
+## [Changes in 5.9](https://github.com/jaimeiniesta/metainspector/compare/v5.8.0...v5.9.0)
+* Added #feeds method to retrieve all feeds of a page.
+* Adds deprecation warning on #feed method.
+## [Changes in 5.8](https://github.com/jaimeiniesta/metainspector/compare/v5.7.0...v5.8.0)
+* Added h1..h6 support.
+## [Changes in 5.7](https://github.com/jaimeiniesta/metainspector/compare/v5.6.0...v5.7.0)
+* Avoids normalizing image URLs. https://github.com/jaimeiniesta/metainspector/pull/241
+* Adds `NonHtmlErrorException` instead of `ParserError` https://github.com/jaimeiniesta/metainspector/pull/248
+## [Changes in 5.6](https://github.com/jaimeiniesta/metainspector/compare/v5.5.0...v5.6.0)
+* New feature: `:encoding` option for force encoding of a parsed document.
+* Improvement: make `best_title` and `best_author` work by order of preference, rather than length.
+## [Changes in 5.5](https://github.com/jaimeiniesta/metainspector/compare/v5.4.0...v5.5.0)
+* New feature: adds `author`, `best_author`.
+* Bugfix: adds presence validation for empty string on meta tag image values.
+* Improves spider and links checker examples.
+* Uses WebMock instead of FakeWeb in tests.
 ## [Changes in 5.4](https://github.com/jaimeiniesta/metainspector/compare/v5.3.0...v5.4.0)
-Supports Gzipped responses.
-Adds method `best_description` and makes `description` return just the meta description.
-Removes support for Ruby 2.0.0 and adds support for 2.4.0.
+* Supports Gzipped responses.
+* Adds method `best_description` and makes `description` return just the meta description.
+* Removes support for Ruby 2.0.0 and adds support for 2.4.0.
 ## [Changes in 5.3](https://github.com/jaimeiniesta/metainspector/compare/v5.2.0...v5.3.0)
-Returns secondary description if meta description is empty.
-Adds a custom timeout on top of the ones for Faraday, and sets defaults for timeouts.
-Eliminates possible NULL char in HTML which breaks nokogiri.
+* Returns secondary description if meta description is empty.
+* Adds a custom timeout on top of the ones for Faraday, and sets defaults for timeouts.
+* Eliminates possible NULL char in HTML which breaks nokogiri.
 ## [Changes in 5.2](https://github.com/jaimeiniesta/metainspector/compare/v5.1.0...v5.2.0)
-Removes the deprecated `html_content_only` option, and replaces it by `allow_non_html_content`, by default `false`.
+* Removes the deprecated `html_content_only` option, and replaces it by `allow_non_html_content`, by default `false`.
 ## [Changes in 5.1](https://github.com/jaimeiniesta/metainspector/compare/v5.0.0...v5.1.0)
-Deprecates the `html_content_only` option, and turns it on by default.
+* Deprecates the `html_content_only` option, and turns it on by default.
 ## [Changes in 5.0](https://github.com/jaimeiniesta/metainspector/compare/v4.7.1...v5.0.0)
-Removes the ExceptionLog, all exceptions are now encapsulated in our own exception classes and
+* Removes the ExceptionLog, all exceptions are now encapsulated in our own exception classes and
 always raised.
 ## [Changes in 4.7](https://github.com/jaimeiniesta/metainspector/compare/v4.6.0...v4.7.1)
-MetaInspector can be configured to use [Faraday::HttpCache](https://github.com/plataformatec/faraday-http-cache) to cache page responses. For that you should pass the `faraday_http_cache` option with at least the `:store` key, for example:
+* MetaInspector can be configured to use [Faraday::HttpCache](https://github.com/plataformatec/faraday-http-cache) to cache page responses. For that you should pass the `faraday_http_cache` option with at least the `:store` key, for example:
 ```ruby
 cache = ActiveSupport::Cache.lookup_store(:file_store, '/tmp/cache')
 page = MetaInspector.new('http://example.com', faraday_http_cache: { store: cache })
 ```
-Bugfixes:
-* Parsing of the document is done as soon as it is initialized (just like we do with the request), so
+* Bugfixes:
+  * Parsing of the document is done as soon as it is initialized (just like we do with the request), so
 that parsing errors will be catched earlier.
-* Rescues from Faraday::SSLError.
+  * Rescues from Faraday::SSLError.
 ## [Changes in 4.6](https://github.com/jaimeiniesta/metainspector/compare/v4.5.0...v4.6.0)
-Faraday can be passed options via `:faraday_options`. This is useful in cases where we need to
+* Faraday can be passed options via `:faraday_options`. This is useful in cases where we need to
 customize the way we request the page, like for example disabling SSL verification, like this:
 ```ruby
@@ -70,7 +99,7 @@ MetaInpector.new('https://example.com', faraday_options: { ssl: { verify: false
 ## [Changes in 4.4](https://github.com/jaimeiniesta/metainspector/compare/v4.3.0...v4.4.0)
-The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
+* The default headers now include `'Accept-Encoding' => 'identity'` to minimize trouble with servers that respond with malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).
 ## [Changes in 4.3](https://github.com/jaimeiniesta/metainspector/compare/v4.3.0...v4.4.0)

data/README.md CHANGED

@@ -1,4 +1,4 @@
-# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector) [![Code Climate](https://codeclimate.com/github/jaimeiniesta/metainspector/badges/gpa.svg)](https://codeclimate.com/github/jaimeiniesta/metainspector)
+# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Code Climate](https://codeclimate.com/github/jaimeiniesta/metainspector/badges/gpa.svg)](https://codeclimate.com/github/jaimeiniesta/metainspector)
 MetaInspector is a gem for web scraping purposes.
@@ -22,6 +22,8 @@ If you're using it on a Rails application, just add it to your Gemfile and run `
 gem 'metainspector'
 ```
+Supported Ruby versions are defined in [`.travis.yml`](.travis.yml).
 ## Usage
 Initialize a MetaInspector instance for an URL, like this:
@@ -73,7 +75,7 @@ page.root_url            # Root url (scheme + host, like http://sitevalidator.co
 page.head_links          # an array of hashes of all head/links
 page.stylesheets         # an array of hashes of all head/links where rel='stylesheet'
 page.canonicals          # an array of hashes of all head/links where rel='canonical'
-page.feed                # Get rss or atom links in meta data fields as array
+page.feeds               # Get rss or atom links in meta data fields as array of hash in the form { href: "...", title: "...", type: "..." }
 ```
 ### Texts
@@ -85,6 +87,12 @@ page.author              # author of the page from the meta author tag
 page.best_author         # best author of the page, from a selection of candidates
 page.description         # returns the meta description
 page.best_description    # returns the first non-empty description between the following candidates: standard meta description, og:description, twitter:description, the first long paragraph
+page.h1                  # returns h1 text array
+page.h2                  # returns h2 text array
+page.h3                  # returns h3 text array
+page.h4                  # returns h4 text array
+page.h5                  # returns h5 text array
+page.h6                  # returns h6 text array
 ```
 ### Links
@@ -396,7 +404,8 @@ Web page scraping is tricky, you can expect to find different exceptions during
 * `MetaInspector::TimeoutError`. When fetching a web page has taken too long.
 * `MetaInspector::RequestError`. When there has been an error on the request phase. Examples: page not found, SSL failure, invalid URI.
-* `MetaInspector::ParserError`. When there has been an error parsing the contents of the page. Example: trying to parse an image file.
+* `MetaInspector::ParserError`. When there has been an error parsing the contents of the page.
+* `MetaInspector::NonHtmlError`. When the contents of the page was not HTML. See also the `allow_non_html_content` option
 ## Examples

data/lib/meta_inspector/document.rb CHANGED

@@ -48,8 +48,8 @@ module MetaInspector
     delegate [:content_type, :response]               => :@request
     delegate [:parsed, :title, :best_title, :author, :best_author,
-              :description, :best_description, :links,
-              :images, :feed, :charset, :meta_tags,
+              :h1, :h2, :h3, :h4, :h5, :h6, :description, :best_description, :links,
+              :images, :feeds, :feed, :charset, :meta_tags,
               :meta_tag, :meta, :favicon,
               :head_links, :stylesheets, :canonicals] => :@parser
@@ -66,10 +66,17 @@ module MetaInspector
         'best_author'      => best_author,
         'description'      => description,
         'best_description' => best_description,
+        'h1'               => h1,
+        'h2'               => h2,
+        'h3'               => h3,
+        'h4'               => h4,
+        'h5'               => h5,
+        'h6'               => h6,
         'links'            => links.to_hash,
         'images'           => images.to_a,
         'charset'          => charset,
         'feed'             => feed,
+        'feeds'            => feeds,
         'content_type'     => content_type,
         'meta_tags'        => meta_tags,
         'favicon'          => images.favicon,
@@ -105,7 +112,7 @@ module MetaInspector
     def document
       @document ||= if !allow_non_html_content && !content_type.nil? && content_type != 'text/html'
-        fail MetaInspector::ParserError.new "The url provided contains #{content_type} content instead of text/html content"
+        fail MetaInspector::NonHtmlError.new "The url provided contains #{content_type} content instead of text/html content"
       else
         @request.read
       end

data/lib/meta_inspector/errors.rb CHANGED

@@ -10,4 +10,6 @@ module MetaInspector
   class RequestError < Error; end
   class ParserError < Error; end
+  class NonHtmlError < ParserError; end
 end

data/lib/meta_inspector/parser.rb CHANGED

@@ -23,10 +23,11 @@ module MetaInspector
     extend Forwardable
     delegate [:url, :scheme, :host]                                                        => :@document
     delegate [:meta_tags, :meta_tag, :meta, :charset]                                      => :@meta_tag_parser
-    delegate [:head_links, :stylesheets, :canonicals, :feed]                               => :@head_links_parser
+    delegate [:head_links, :stylesheets, :canonicals, :feeds, :feed]                       => :@head_links_parser
     delegate [:links, :base_url]                                                           => :@links_parser
     delegate :images                                                                       => :@images_parser
-    delegate [:title, :best_title, :author, :best_author, :description, :best_description] => :@texts_parser
+    delegate [:title, :best_title, :author, :best_author, :description, :best_description,
+              :h1, :h2, :h3, :h4, :h5, :h6]                                                => :@texts_parser
     # Returns the whole parsed document
     def parsed

data/lib/meta_inspector/parsers/head_links.rb CHANGED

@@ -3,6 +3,10 @@ module MetaInspector
     class HeadLinksParser < Base
       delegate [:parsed, :base_url] => :@main_parser
+      KNOWN_FEED_TYPES = %w[
+        application/rss+xml application/atom+xml application/json
+      ].freeze
       def head_links
         @head_links ||= parsed.css('head link').map do |tag|
           Hash[
@@ -24,16 +28,25 @@ module MetaInspector
         @canonicals ||= head_links.select { |hl| hl[:rel] == 'canonical' }
       end
-      # Returns the parsed document meta rss link
-      def feed
-        @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
-      end
+      def feeds
+        @feeds ||=
+          parsed.search("//link[@rel='alternate']").map do |link|
+            next if !KNOWN_FEED_TYPES.include?(link["type"]) || link["href"].to_s.strip == ''
-      private
+            {
+              title: link["title"],
+              href: URL.absolutify(link["href"], base_url),
+              type: link["type"]
+            }
+          end.compact
+      end
-      def parsed_feed(format)
-        feed = parsed.search("//link[@type='application/#{format}+xml']").find{|link| link.attributes["href"] }
-        feed ? URL.absolutify(feed['href'], base_url) : nil
+      def feed
+        warn "DEPRECATION: Use MetaInspector#feeds instead of #feed. The former gives you all feeds and their metadata, the latter will be removed."
+        @feed ||= begin
+          first_feed = feeds.find { |l| /\/(rss|atom)\+xml$/i =~ l[:type] } || {}
+          first_feed[:href]
+        end
       end
     end
   end

data/lib/meta_inspector/parsers/images.rb CHANGED

@@ -29,14 +29,16 @@ module MetaInspector
       # If none found, tries with Twitter image
       def owner_suggested
         suggested_img = content_of(meta['og:image']) || content_of(meta['twitter:image'])
-        URL.absolutify(suggested_img, base_url) if suggested_img
+        URL.absolutify(suggested_img, base_url, normalize: false) if suggested_img
       end
       # Returns an array of [img_url, width, height] sorted by image area (width * height)
       def with_size
         @with_size ||= begin
           img_nodes = parsed.search('//img').select{ |img_node| img_node['src'] }
-          imgs_with_size = img_nodes.map { |img_node| [URL.absolutify(img_node['src'], base_url), img_node['width'], img_node['height']] }
+          imgs_with_size = img_nodes.map do |img_node|
+            [URL.absolutify(img_node['src'], base_url, normalize: false), img_node['width'], img_node['height']]
+          end
           imgs_with_size.uniq! { |url, width, height| url }
           if @download_images
             imgs_with_size.map! do |url, width, height|
@@ -71,7 +73,7 @@ module MetaInspector
       def favicon
         query = '//link[@rel="icon" or contains(@rel, "shortcut")]'
         value = parsed.xpath(query)[0].attributes['href'].value
-        @favicon ||= URL.absolutify(value, base_url)
+        @favicon ||= URL.absolutify(value, base_url, normalize: false)
       rescue
         nil
       end
@@ -83,7 +85,7 @@ module MetaInspector
       end
       def absolutified_images
-        parsed_images.map { |i| URL.absolutify(i, base_url) }
+        parsed_images.map { |i| URL.absolutify(i, base_url, normalize: false) }
       end
       def parsed_images

data/lib/meta_inspector/parsers/links.rb CHANGED

@@ -47,7 +47,8 @@ module MetaInspector
       # This can be the one set on a <base> tag,
       # or the url of the document if no <base> tag was found.
       def base_url
-        base_href || url
+        current_base_href = base_href.to_s.strip.empty? ? nil : base_href
+        current_base_href || url
       end
       # Returns the value of the href attribute on the <base /> tag, if exists

data/lib/meta_inspector/parsers/texts.rb CHANGED

@@ -13,6 +13,30 @@ module MetaInspector
         @best_title ||= find_best_title
       end
+      def h1
+        @h1 ||= find_heading('h1')
+      end
+      def h2
+        @h2 ||= find_heading('h2')
+      end
+      def h3
+        @h3 ||= find_heading('h3')
+      end
+      def h4
+        @h4 ||= find_heading('h4')
+      end
+      def h5
+        @h5 ||= find_heading('h5')
+      end
+      def h6
+        @h6 ||= find_heading('h6')
+      end
       # Returns the meta author, if present
       def author
         @author ||= meta['author']
@@ -45,6 +69,10 @@ module MetaInspector
       private
+      def find_heading(heading)
+        parsed.css(heading).map { |tag| tag.inner_text.strip.gsub(/\s+/, ' ') }.reject(&:empty?)
+      end
       # Look for candidates per list of priority
       def find_best_title
         candidates = [

data/lib/meta_inspector/request.rb CHANGED

@@ -48,7 +48,7 @@ module MetaInspector
       @response ||= fetch
     rescue Faraday::TimeoutError => e
       raise MetaInspector::TimeoutError.new(e)
-    rescue Faraday::Error::ConnectionFailed, Faraday::SSLError, URI::InvalidURIError, FaradayMiddleware::RedirectLimitReached => e
+    rescue Faraday::ConnectionFailed, Faraday::SSLError, URI::InvalidURIError, FaradayMiddleware::RedirectLimitReached => e
       raise MetaInspector::RequestError.new(e)
     end

data/lib/meta_inspector/url.rb CHANGED

@@ -5,7 +5,7 @@ module MetaInspector
     attr_reader :url
     def initialize(initial_url, options = {})
-      options        = defaults.merge(options)
+      options        = self.class.defaults.merge(options)
       @normalize     = options[:normalize]
@@ -56,11 +56,13 @@ module MetaInspector
     #   http:, ftp:, telnet:, mailto:, javascript: ...
     # Protocol-relative URLs are also resolved to use the same
     # schema as the base_url
-    def self.absolutify(url, base_url)
+    def self.absolutify(url, base_url, options = {})
+      options = defaults.merge(options)
       if url =~ /^\w*\:/i
-        MetaInspector::URL.new(url).url
+        MetaInspector::URL.new(url, options).url
       else
-        Addressable::URI.join(base_url, url).normalize.to_s
+        uri = Addressable::URI.join(base_url, url)
+        options[:normalize] ? uri.normalize.to_s : uri.to_s
       end
     rescue MetaInspector::ParserError, Addressable::URI::InvalidURIError, ArgumentError
       nil
@@ -68,7 +70,7 @@ module MetaInspector
     private
-    def defaults
+    def self.defaults
       { :normalize => true }
     end

data/lib/meta_inspector/version.rb CHANGED

@@ -1,3 +1,3 @@
 module MetaInspector
-  VERSION = '5.6.0'
+  VERSION = '5.10.1'
 end

data/meta_inspector.gemspec CHANGED

@@ -1,11 +1,11 @@
 require File.expand_path('../lib/meta_inspector/version', __FILE__)
 Gem::Specification.new do |gem|
-  gem.authors       = ["Jaime Iniesta"]
-  gem.email         = ["jaimeiniesta@gmail.com"]
+  gem.author        = "Jaime Iniesta"
+  gem.email         = "jaimeiniesta@gmail.com"
   gem.description   = %q{MetaInspector lets you scrape a web page and get its links, images, texts, meta tags...}
   gem.summary       = %q{MetaInspector is a ruby gem for web scraping purposes, that returns metadata from a given URL}
-  gem.homepage      = "https://github.com/jaimeiniesta/metainspector"
+  gem.homepage      = "https://github.com/metainspector/metainspector"
   gem.license       = "MIT"
   gem.files         = `git ls-files`.split("\n")
@@ -14,20 +14,20 @@ Gem::Specification.new do |gem|
   gem.require_paths = ["lib"]
   gem.version       = MetaInspector::VERSION
-  gem.add_dependency 'nokogiri', '~> 1.7'
-  gem.add_dependency 'faraday', '~> 0.11'
-  gem.add_dependency 'faraday_middleware', '~> 0.11'
-  gem.add_dependency 'faraday-cookie_jar', '~> 0.0'
-  gem.add_dependency 'faraday-http-cache', '~> 2.0'
-  gem.add_dependency 'faraday-encoding', '~> 0.0'
-  gem.add_dependency 'addressable', '~> 2.5'
-  gem.add_dependency 'fastimage', '~> 2.1'
-  gem.add_dependency 'nesty', '~> 1.0'
+  gem.add_dependency 'nokogiri', '~> 1.10.9'
+  gem.add_dependency 'faraday', '~> 1.0.0'
+  gem.add_dependency 'faraday_middleware', '~> 1.0.0'
+  gem.add_dependency 'faraday-cookie_jar', '~> 0.0.6'
+  gem.add_dependency 'faraday-http-cache', '~> 2.2.0'
+  gem.add_dependency 'faraday-encoding', '~> 0.0.5'
+  gem.add_dependency 'addressable', '~> 2.7.0'
+  gem.add_dependency 'fastimage', '~> 2.1.7'
+  gem.add_dependency 'nesty', '~> 1.0.2'
-  gem.add_development_dependency 'rspec', '~> 3.0'
-  gem.add_development_dependency 'webmock'
-  gem.add_development_dependency 'awesome_print'
-  gem.add_development_dependency 'rake', '~> 10.1.0'
-  gem.add_development_dependency 'pry'
-  gem.add_development_dependency 'rubocop'
+  gem.add_development_dependency 'rspec', '~> 3.9.0'
+  gem.add_development_dependency 'webmock', '~> 3.8.3'
+  gem.add_development_dependency 'awesome_print', '~> 1.8.0'
+  gem.add_development_dependency 'rake', '~> 13.0.1'
+  gem.add_development_dependency 'pry', '~> 0.13.1'
+  gem.add_development_dependency 'rubocop', '~> 0.82.0'
 end