RubyGems - metainspector - Versions diffs - 1.16.1 → 1.17.0 - Mend

metainspector 1.16.1 → 1.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/README.md +17 -11
data/lib/meta_inspector.rb +10 -3
data/lib/meta_inspector/deprecations.rb +19 -0
data/lib/meta_inspector/document.rb +81 -0
data/lib/meta_inspector/exception_log.rb +29 -0
data/lib/meta_inspector/exceptionable.rb +11 -0
data/lib/meta_inspector/parser.rb +178 -0
data/lib/meta_inspector/request.rb +55 -0
data/lib/meta_inspector/url.rb +76 -0
data/lib/meta_inspector/version.rb +1 -1
data/spec/document_spec.rb +97 -0
data/spec/exception_log_spec.rb +59 -0
data/spec/meta_inspector_spec.rb +9 -0
data/spec/parser_spec.rb +374 -0
data/spec/redirections_spec.rb +20 -3
data/spec/request_spec.rb +64 -0
data/spec/url_spec.rb +74 -0
metadata +18 -7
data/lib/meta_inspector/scraper.rb +0 -283
data/spec/metainspector_spec.rb +0 -547

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: bbf33936348d092cedb51b59a828bf90929c75ea
-  data.tar.gz: f8ba64edadd581c5d85c25d8139b99856cbe6a25
+  metadata.gz: 54f34fbd4dec77ffa68eb9762cdc140e98246817
+  data.tar.gz: 0c294be322b646fa90150c3934218db529af3c6b
 SHA512:
-  metadata.gz: 39370252996fc183c93eb1b49899628c1a7052250d2ef525957f984107de8b1cfcba9d6bcaf7f59ba611db342867e091b98dbaf2bab9b886f9eefaffceaf22c5
-  data.tar.gz: 2c119f82e2e1bb7de01e45c4e5ba83403c65f0fffde7fd739c19e179e177bb001ee652860a7389de1eb266c970ff28999fb88b909b3f184341383f96899fd2c6
+  metadata.gz: 79a235d8161922f9991f9df68a524f18076562c56ff68cee29a4ac8dc88c6abd1b1113363aaa9d79b39e9f703abcdcc801ccd24930c3a085157979814498c132
+  data.tar.gz: 68b165770c8c8de5d56b923c9b25405f500ac3ee00bfb245af41a91dcf9d2ac7fcb1bf6fcbf6d6ec80b9f7453462d2c2235a14109a0a09710ec424d5f4b59a08

data/README.md CHANGED

@@ -74,11 +74,11 @@ You can also access most of the scraped data as a hash:
 The original document is accessible from:
-    page.document         # A String with the contents of the HTML document
+    page.to_s         # A String with the contents of the HTML document
 And the full scraped document is accessible from:
-    page.parsed_document  # Nokogiri doc that you can use it to get any element from the page
+    page.parsed  # Nokogiri doc that you can use it to get any element from the page
 ## Opengraph and Twitter card meta tags
@@ -91,8 +91,8 @@ Twitter cards & Open graph tags make it possible for you to attach media experie
 Also many sites use name & property, content & value attributes interchangeably. Using MetaInspector accessing this information is as easy as -
-    page.meta_og_image
-    page.meta_twitter_image_width
+    page.meta_og_image
+    page.meta_twitter_image_width
 Note that MetaInspector gives priority to content over value. In other words if there is a tag of the form
@@ -122,7 +122,7 @@ However, you can tell MetaInspector to allow these redirections with the option
 ### HTML Content Only
-MetaInspector will try to parse all URLs by default. If you want to raise an error when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
+MetaInspector will try to parse all URLs by default. If you want to raise an exception when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
     page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
@@ -137,21 +137,27 @@ This is useful when using MetaInspector on web spidering. Although on the initia
     page.title         # returns nil
     page.content_type  # "image/png"
     page.ok?           # false
-    page.errors.first  # "Scraping exception: The url provided contains image/png content instead of text/html content"
+    page.exceptions.first.message  # "The url provided contains image/png content instead of text/html content"
-## Error handling
+## Exception handling
 You can check if the page has been succesfully parsed with:
     page.ok?     # Will return true if everything looks OK
-In case there have been any errors, you can check them with:
+In case there have been any exceptions, you can check them with:
-    page.errors  # Will return an array with the error messages
+    page.exceptions  # Will return an array with the exceptions
-If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
+You can also specify what to do when encountering an exception. By default it
+will store it, but you can also tell MetaInspector to warn about it on the log
+console, or to raise the exceptions, like this:
-    page = MetaInspector.new('http://example.com', :verbose => true)
+    # This will warn about the exception on console
+    page = MetaInspector.new('http://example.com', warn_level: :warn)
+    # This will raise the exception
+    page = MetaInspector.new('http://example.com', warn_level: :raise)
 ## Examples

data/lib/meta_inspector.rb CHANGED

@@ -1,12 +1,19 @@
 # -*- encoding: utf-8 -*-
-require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/scraper'))
+require 'forwardable'
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/exceptionable'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/exception_log'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/request'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/url'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/parser'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/document'))
+require File.expand_path(File.join(File.dirname(__FILE__), 'meta_inspector/deprecations'))
 module MetaInspector
   extend self
-  # Sugar method to be able to create a scraper in a shorter way
+  # Sugar method to be able to scrape a document in a shorter way
   def new(url, options = {})
-    Scraper.new(url, options)
+    Document.new(url, options)
   end
 end

data/lib/meta_inspector/deprecations.rb ADDED

@@ -0,0 +1,19 @@
+# -*- encoding: utf-8 -*-
+module MetaInspector
+  class Scraper < Document
+    def initialize
+      warn "The Scraper class is now deprecated since version 1.17, use Document instead"
+      super
+    end
+    def errors
+      warn "The #errors method is deprecated since version 1.17, use #exceptions instead"
+      exceptions
+    end
+    def document
+      warn "The #document method is deprecated since version 1.17, use #to_s instead"
+    end
+  end
+end

data/lib/meta_inspector/document.rb ADDED

@@ -0,0 +1,81 @@
+# -*- encoding: utf-8 -*-
+module MetaInspector
+  # A MetaInspector::Document knows about its URL and its contents
+  class Document
+    attr_reader :timeout, :html_content_only, :allow_redirections, :warn_level
+    include MetaInspector::Exceptionable
+    # Initializes a new instance of MetaInspector::Document, setting the URL to the one given
+    # Options:
+    # => timeout: defaults to 20 seconds
+    # => html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false
+    # => allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP
+    # => document: the html of the url as a string
+    # => warn_level: what to do when encountering exceptions. Can be :warn, :raise or nil
+    def initialize(initial_url, options = {})
+      options             = defaults.merge(options)
+      @timeout            = options[:timeout]
+      @html_content_only  = options[:html_content_only]
+      @allow_redirections = options[:allow_redirections]
+      @document           = options[:document]
+      if options[:verbose] == true
+        warn "The verbose option is deprecated since 1.17, please use warn_level: :warn instead"
+        options[:warn_level] = :warn
+      end
+      @warn_level         = options[:warn_level]
+      @exception_log  = MetaInspector::ExceptionLog.new(warn_level: warn_level)
+      @url            = MetaInspector::URL.new(initial_url, exception_log: @exception_log)
+      @request        = MetaInspector::Request.new(@url, allow_redirections: @allow_redirections,
+                                                         timeout:            @timeout,
+                                                         exception_log:      @exception_log)
+      @parser         = MetaInspector::Parser.new(self,  exception_log:      @exception_log)
+    end
+    extend Forwardable
+    def_delegators :@url,     :url, :scheme, :host, :root_url
+    def_delegators :@request, :content_type
+    def_delegators :@parser,  :parsed, :method_missing, :title, :description, :links, :internal_links, :external_links,
+                              :images, :image, :feed, :charset
+    # Returns all document data as a nested Hash
+    def to_hash
+      {
+        'url' => url,
+        'title' => title,
+        'links' => links,
+        'internal_links' => internal_links,
+        'external_links' => external_links,
+        'images' => images,
+        'charset' => charset,
+        'feed' => feed,
+        'content_type' => content_type
+      }.merge @parser.to_hash
+    end
+    # Returns the contents of the document as a string
+    def to_s
+      document
+    end
+    private
+    def defaults
+      { :timeout => 20, :html_content_only => false }
+    end
+    def document
+      @document ||= if html_content_only && content_type != "text/html"
+                      raise "The url provided contains #{content_type} content instead of text/html content" and nil
+                    else
+                      @request.read
+                    end
+      rescue Exception => e
+        @exception_log << e
+    end
+  end
+end

data/lib/meta_inspector/exception_log.rb ADDED

@@ -0,0 +1,29 @@
+# -*- encoding: utf-8 -*-
+module MetaInspector
+  # Stores the exceptions passed to it, warning about them if required
+  class ExceptionLog
+    attr_reader :exceptions, :warn_level
+    def initialize(options = {})
+      @exceptions = []
+      @warn_level = options[:warn_level]
+    end
+    def <<(exception)
+      case warn_level
+      when :warn
+        warn exception
+      when :raise
+        raise exception
+      end
+      @exceptions << exception
+    end
+    def ok?
+      exceptions.empty?
+    end
+  end
+end

data/lib/meta_inspector/exceptionable.rb ADDED

@@ -0,0 +1,11 @@
+# -*- encoding: utf-8 -*-
+module MetaInspector
+  #
+  # This module extracts two common methods for classes that use ExceptionLog
+  #
+  module Exceptionable
+    extend Forwardable
+    def_delegators :@exception_log, :exceptions, :ok?
+  end
+end

data/lib/meta_inspector/parser.rb ADDED

@@ -0,0 +1,178 @@
+# -*- encoding: utf-8 -*-
+require 'nokogiri'
+require 'hashie/rash'
+module MetaInspector
+  # Parses the document with Nokogiri
+  class Parser
+    include MetaInspector::Exceptionable
+    def initialize(document, options = {})
+      options = defaults.merge(options)
+      @document       = document
+      @data           = Hashie::Rash.new
+      @exception_log  = options[:exception_log]
+    end
+    extend Forwardable
+    def_delegators :@document, :url, :scheme, :host
+    # Returns the whole parsed document
+    def parsed
+      @parsed ||= Nokogiri::HTML(@document.to_s)
+      rescue Exception => e
+        @exception_log << e
+    end
+    def to_hash
+      scrape_meta_data
+      @data.to_hash
+    end
+    # Returns the parsed document title, from the content of the <title> tag.
+    # This is not the same as the meta_title tag
+    def title
+      @title ||= parsed.css('title').inner_text rescue nil
+    end
+    # A description getter that first checks for a meta description and if not present will
+    # guess by looking at the first paragraph with more than 120 characters
+    def description
+      meta_description || secondary_description
+    end
+    # Links found on the page, as absolute URLs
+    def links
+      @links ||= parsed_links.map{ |l| URL.absolutify(URL.unrelativize(l, scheme), base_url) }.compact.uniq
+    end
+    # Internal links found on the page, as absolute URLs
+    def internal_links
+      @internal_links ||= links.select {|link| URL.new(link).host == host }
+    end
+    # External links found on the page, as absolute URLs
+    def external_links
+      @external_links ||= links.select {|link| URL.new(link).host != host }
+    end
+    # Images found on the page, as absolute URLs
+    def images
+      @images ||= parsed_images.map{ |i| URL.absolutify(i, base_url) }
+    end
+    # Returns the parsed image from Facebook's open graph property tags
+    # Most all major websites now define this property and is usually very relevant
+    # See doc at http://developers.facebook.com/docs/opengraph/
+    def image
+      meta_og_image || meta_twitter_image
+    end
+    # Returns the parsed document meta rss link
+    def feed
+      @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
+    end
+    # Returns the charset from the meta tags, looking for it in the following order:
+    # <meta charset='utf-8' />
+    # <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
+    def charset
+      @charset ||= (charset_from_meta_charset || charset_from_meta_content_type)
+    end
+    private
+    def defaults
+      { exception_log: MetaInspector::ExceptionLog.new }
+    end
+    # Scrapers for all meta_tags in the form of "meta_name" are automatically defined. This has been tested for
+    # meta name: keywords, description, robots, generator
+    # meta http-equiv: content-language, Content-Type
+    #
+    # It will first try with meta name="..." and if nothing found,
+    # with meta http-equiv="...", substituting "_" by "-"
+    # TODO: define respond_to? to return true on the meta_name methods
+    def method_missing(method_name)
+      if method_name.to_s =~ /^meta_(.*)/
+        key = $1
+        #special treatment for opengraph (og:) and twitter card (twitter:) tags
+        key.gsub!("_",":") if key =~ /^og_(.*)/ || key =~ /^twitter_(.*)/
+        scrape_meta_data
+        @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
+      else
+        super
+      end
+    end
+    # Scrapes all meta tags found
+    def scrape_meta_data
+      unless @data.meta
+        @data.meta!.name!
+        @data.meta!.property!
+        parsed.xpath("//meta").each do |element|
+          get_meta_name_or_property(element)
+        end
+      end
+    end
+    # Store meta tag value, looking at meta name or meta property
+    def get_meta_name_or_property(element)
+      name_or_property = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
+      content_or_value = element.attributes["content"] ? "content" : (element.attributes["value"] ? "value" : nil)
+      if !name_or_property.nil? && !content_or_value.nil?
+        @data.meta.name[element.attributes[name_or_property].value.downcase] = element.attributes[content_or_value].value
+      end
+    end
+    # Look for the first <p> block with 120 characters or more
+    def secondary_description
+      first_long_paragraph = parsed.search('//p[string-length() >= 120]').first
+      first_long_paragraph ? first_long_paragraph.text : ''
+    end
+    def parsed_links
+      @parsed_links ||= cleanup_nokogiri_values(parsed.search("//a/@href"))
+    end
+    def parsed_images
+      @parsed_images ||= cleanup_nokogiri_values(parsed.search('//img/@src'))
+    end
+    def parsed_feed(format)
+      feed = parsed.search("//link[@type='application/#{format}+xml']").first
+      feed ? URL.absolutify(feed.attributes['href'].value, base_url) : nil
+    end
+    def charset_from_meta_charset
+      parsed.css("meta[charset]")[0].attributes['charset'].value rescue nil
+    end
+    def charset_from_meta_content_type
+      parsed.css("meta[http-equiv='Content-Type']")[0].attributes['content'].value.split(";")[1].split("=")[1] rescue nil
+    end
+    # Returns the base url to absolutify relative links. This can be the one set on a <base> tag,
+    # or the url of the document if no <base> tag was found.
+    def base_url
+      base_href || url
+    end
+    # Returns the value of the href attribute on the <base /> tag, if it exists
+    def base_href
+      parsed.search('base').first.attributes['href'].value rescue nil
+    end
+    # Takes a nokogiri search result, strips the values, rejects the empty ones, and removes duplicates
+    def cleanup_nokogiri_values(results)
+      results.map { |a| a.value.strip }.reject { |s| s.empty? }.uniq
+    end
+  end
+end

data/lib/meta_inspector/request.rb ADDED

@@ -0,0 +1,55 @@
+# -*- encoding: utf-8 -*-
+require 'open-uri'
+require 'open_uri_redirections'
+require 'timeout'
+module MetaInspector
+  # Makes the request to the server
+  class Request
+    include MetaInspector::Exceptionable
+    def initialize(initial_url, options = {})
+      options = defaults.merge(options)
+      @url                = initial_url
+      @allow_redirections = options[:allow_redirections]
+      @timeout            = options[:timeout]
+      @exception_log      = options[:exception_log]
+    end
+    extend Forwardable
+    def_delegators :@url, :url
+    def read
+      response.read if response
+    end
+    def content_type
+      response.content_type if response
+    end
+    private
+    def response
+      Timeout::timeout(@timeout) { @response ||= fetch }
+      rescue TimeoutError, SocketError => e
+        @exception_log << e
+        nil
+    end
+    def fetch
+      request = open(url, {:allow_redirections => @allow_redirections})
+      @url.url = request.base_uri.to_s
+      request
+    end
+    def defaults
+      { allow_redirections: false, timeout: 20, exception_log: MetaInspector::ExceptionLog.new }
+    end
+  end
+end