RubyGems - metainspector - Versions diffs - 1.12.0 → 1.12.1 - Mend

metainspector 1.12.0 → 1.12.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/README.md +169 -0
data/lib/meta_inspector/scraper.rb +26 -29
data/lib/meta_inspector/version.rb +1 -1
data/meta_inspector.gemspec +3 -3
data/samples/spider.rb +4 -4
data/spec/metainspector_spec.rb +14 -7
data/spec/spec_helper.rb +1 -1
metadata +37 -38
data/README.rdoc +0 -152

data/README.md ADDED Viewed

@@ -0,0 +1,169 @@
+# MetaInspector [![Build Status](https://secure.travis-ci.org/jaimeiniesta/metainspector.png)](http://travis-ci.org/jaimeiniesta/metainspector) [![Dependency Status](https://gemnasium.com/jaimeiniesta/metainspector.png)](https://gemnasium.com/jaimeiniesta/metainspector)
+MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
+## See it in action!
+You can try MetaInspector live at this little demo: [https://metainspectordemo.herokuapp.com](https://metainspectordemo.herokuapp.com)
+## Installation
+Install the gem from RubyGems:
+    gem install metainspector
+If you're using it on a Rails application, just add it to your Gemfile and run `bundle install`
+    gem 'metainspector'
+This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
+## Usage
+Initialize a MetaInspector instance for an URL, like this:
+    page = MetaInspector.new('http://markupvalidator.com')
+If you don't include the scheme on the URL, http:// will be used by default:
+    page = MetaInspector.new('markupvalidator.com')
+## Accessing scraped data
+Then you can see the scraped data like this:
+    page.url                # URL of the page
+    page.scheme             # Scheme of the page (http, https)
+    page.host               # Hostname of the page (like, markupvalidator.com, without the scheme)
+    page.root_url           # Root url (scheme + host, like http://markupvalidator.com/)
+    page.title              # title of the page, as string
+    page.links              # array of strings, with every link found on the page as an absolute URL
+    page.internal_links     # array of strings, with every internal link found on the page as an absolute URL
+    page.external_links     # array of strings, with every external link found on the page as an absolute URL
+    page.meta_description   # meta description, as string
+    page.description        # returns the meta description, or the first long paragraph if no meta description is found
+    page.meta_keywords      # meta keywords, as string
+    page.image              # Most relevant image, if defined with og:image
+    page.images             # array of strings, with every img found on the page as an absolute URL
+    page.feed               # Get rss or atom links in meta data fields as array
+    page.meta_og_title      # opengraph title
+    page.meta_og_image      # opengraph image
+    page.charset            # UTF-8
+    page.content_type       # content-type returned by the server when the url was requested
+MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
+    page.meta_description   # <meta name="description" content="..." />
+    page.meta_keywords      # <meta name="keywords" content="..." />
+    page.meta_robots        # <meta name="robots" content="..." />
+    page.meta_generator     # <meta name="generator" content="..." />
+It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
+    page.meta_content_language  # <meta http-equiv="content-language" content="..." />
+    page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />
+Please notice that MetaInspector is case sensitive, so `page.meta_Content_Type` is not the same as `page.meta_content_type`
+You can also access most of the scraped data as a hash:
+    page.to_hash  # { "url"   => "http://markupvalidator.com",
+                      "title" => "MarkupValidator :: site-wide markup validation tool", ... }
+The full scraped document if accessible from:
+    page.document  # Nokogiri doc that you can use it to get any element from the page
+## Options
+### Timeout
+By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
+You can set a different timeout with a second parameter, like this:
+    page = MetaInspector.new('markupvalidator.com', :timeout => 5) # 5 seconds timeout
+### Redirections
+MetaInspector allows safe redirects from http to https (for example, [http://github.com](http://github.com) => [https://github.com](https://github.com)) by default. With the option `:allow_safe_redirections => false`, it will throw exceptions on such redirects.
+    page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
+To enable unsafe redirects from https to http (like, [https://example.com](https://example.com) => [http://example.com](http://example.com)) you can pass the option `:allow_unsafe_redirections => true`. If this option is not specified or is false an exception is thrown on such redirects.
+    page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
+### HTML Content Only
+MetaInspector will try to parse all URLs by default. If you want to raise an error when trying to parse a non-html URL (one that has a content-type different than text/html), you can state it like this:
+    page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
+This is useful when using MetaInspector on web spidering. Although on the initial URL you'll probably have an HTML URL, following links you may find yourself trying to parse non-html URLs.
+    page = MetaInspector.new('http://example.com/image.png')
+    page.title         # returns ""
+    page.content_type  # "image/png"
+    page.ok?           # true
+    page = MetaInspector.new('http://example.com/image.png', :html_content_only => true)
+    page.title         # returns nil
+    page.content_type  # "image/png"
+    page.ok?           # false
+    page.errors.first  # "Scraping exception: The url provided contains image/png content instead of text/html content"
+## Error handling
+You can check if the page has been succesfully parsed with:
+    page.ok?     # Will return true if everything looks OK
+In case there have been any errors, you can check them with:
+    page.errors  # Will return an array with the error messages
+If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
+    page = MetaInspector.new('http://example.com', :verbose => true)
+## Examples
+You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
+    $ irb
+    >> require 'metainspector'
+    => true
+    >> page = MetaInspector.new('http://markupvalidator.com')
+    => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
+    >> page.title
+    => "MarkupValidator :: site-wide markup validation tool"
+    >> page.meta_description
+    => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
+    >> page.meta_keywords
+    => "html, markup, validation, validator, tool, w3c, development, standards, free"
+    >> page.links.size
+    => 15
+    >> page.links[4]
+    => "/plans-and-pricing"
+    >> page.document.class
+    => String
+    >> page.parsed_document.class
+    => Nokogiri::HTML::Document
+## ZOMG Fork! Thank you!
+You're welcome to fork this project and send pull requests. Just remember to include specs.
+Thanks to all the contributors:
+[https://github.com/jaimeiniesta/metainspector/graphs/contributors](https://github.com/jaimeiniesta/metainspector/graphs/contributors)
+Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license

data/lib/meta_inspector/scraper.rb CHANGED Viewed

@@ -72,13 +72,9 @@ module MetaInspector
       meta_og_image
     end
-    # Returns the parsed document meta rss links
+    # Returns the parsed document meta rss link
     def feed
-      @feed ||= parsed_document.xpath("//link").select{ |link|
-          link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
-        }.map { |link|
-          absolutify_url(link.attributes["href"].value)
-        }.first rescue nil
+      @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
     end
     # Returns the charset from the meta tags, looking for it in the following order:
@@ -133,13 +129,6 @@ module MetaInspector
       errors.empty?
     end
-    ##### DEPRECATIONS ####
-    def parsed?
-      warn "the parsed? method has been deprecated, please use ok? instead"
-      !@parsed_document.nil?
-    end
-    ##### DEPRECATIONS ####
     private
     def defaults
@@ -190,29 +179,36 @@ module MetaInspector
         @data.meta!.name!
         @data.meta!.property!
         parsed_document.xpath("//meta").each do |element|
-          if element.attributes["content"]
-            if element.attributes["name"]
-              @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value
-            end
-            if element.attributes["property"]
-              @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value
-            end
-          end
+          get_meta_name_or_property(element)
         end
       end
     end
+    # Store meta tag value, looking at meta name or meta property
+    def get_meta_name_or_property(element)
+      if element.attributes["content"]
+        type = element.attributes["name"] ? "name" : (element.attributes["property"] ? "property" : nil)
+        @data.meta.name[element.attributes[type].value.downcase] = element.attributes["content"].value if type
+      end
+    end
+    def parsed_feed(format)
+      feed = parsed_document.search("//link[@type='application/#{format}+xml']").first
+      feed ? absolutify_url(feed.attributes['href'].value) : nil
+    end
     def parsed_links
-      @parsed_links ||= parsed_document.search("//a") \
-                          .map {|link| link.attributes["href"] \
-                          .to_s.strip}.uniq rescue []
+      @parsed_links ||= cleanup_nokogiri_values(parsed_document.search("//a/@href"))
     end
     def parsed_images
-      @parsed_images ||= parsed_document.search('//img') \
-                           .reject{|i| (i.attributes['src'].nil? || i.attributes['src'].value.empty?) } \
-                           .map{ |i| i.attributes['src'].value }.uniq
+      @parsed_images ||= cleanup_nokogiri_values(parsed_document.search('//img/@src'))
+    end
+    # Takes a nokogiri search result, strips the values, rejects the empty ones, and removes duplicates
+    def cleanup_nokogiri_values(results)
+      results.map { |a| a.value.strip }.reject { |s| s.empty? }.uniq
     end
     # Stores the error for later inspection
@@ -250,7 +246,8 @@ module MetaInspector
     # Look for the first <p> block with 120 characters or more
     def secondary_description
-      (p = parsed_document.search('//p').map(&:text).select{ |p| p.length > 120 }.first).nil? ? '' : p
+      first_long_paragraph = parsed_document.search('//p[string-length() >= 120]').first
+      first_long_paragraph ? first_long_paragraph.text : ''
     end
     def charset_from_meta_charset

data/lib/meta_inspector/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # -*- encoding: utf-8 -*-
 module MetaInspector
-  VERSION = "1.12.0"
+  VERSION = "1.12.1"
 end

data/meta_inspector.gemspec CHANGED Viewed

@@ -17,8 +17,8 @@ Gem::Specification.new do |gem|
   gem.add_dependency 'nokogiri', '~> 1.5'
   gem.add_dependency 'rash', '0.3.2'
-  gem.add_development_dependency 'rspec', '2.11.0'
+  gem.add_development_dependency 'rspec', '2.12.0'
   gem.add_development_dependency 'fakeweb', '1.3.0'
-  gem.add_development_dependency 'awesome_print', '1.0.2'
-  gem.add_development_dependency 'rake', '0.9.2.2'
+  gem.add_development_dependency 'awesome_print', '1.1.0'
+  gem.add_development_dependency 'rake', '10.0.2'
 end

data/samples/spider.rb CHANGED Viewed

@@ -6,7 +6,7 @@ require 'meta_inspector'
 q = Queue.new
 visited_links=[]
-puts "Enter a valid http url to spider it following external links"
+puts "Enter a valid http url to spider it following internal links"
 url = gets.strip
 page = MetaInspector.new(url)
@@ -20,9 +20,9 @@ while q.size > 0
   puts "TITLE: #{page.title}"
   puts "META DESCRIPTION: #{page.meta_description}"
   puts "META KEYWORDS: #{page.meta_keywords}"
-  puts "LINKS: #{page.links.size}"
-  page.links.each do |link|
-    if link[0..6] == 'http://' && !visited_links.include?(link)
+  puts "LINKS: #{page.internal_links.size}"
+  page.internal_links.each do |link|
+    if !visited_links.include?(link)
       q.push(link)
     end
   end

data/spec/metainspector_spec.rb CHANGED Viewed

@@ -89,14 +89,21 @@ describe MetaInspector do
       @m.document.class.should == String
     end
-    it "should get rss feed" do
-      @m = MetaInspector.new('http://www.iteh.at')
-      @m.feed.should == 'http://www.iteh.at/de/rss/'
-    end
+    describe "Feed" do
+      it "should get rss feed" do
+        @m = MetaInspector.new('http://www.iteh.at')
+        @m.feed.should == 'http://www.iteh.at/de/rss/'
+      end
-    it "should get atom feed" do
-      @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
-      @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
+      it "should get atom feed" do
+        @m = MetaInspector.new('http://www.tea-tron.com/jbravo/blog/')
+        @m.feed.should == 'http://www.tea-tron.com/jbravo/blog/feed/'
+      end
+      it "should return nil if no feed found" do
+        @m = MetaInspector.new('http://www.alazan.com')
+        @m.feed.should == nil
+      end
     end
     describe "get description" do

data/spec/spec_helper.rb CHANGED Viewed

@@ -32,7 +32,7 @@ FakeWeb.register_uri(:get, "http://example.com/invalid_href", :response => fixtu
 FakeWeb.register_uri(:get, "http://www.youtube.com/watch?v=iaGSSrp49uc", :response => fixture_file("youtube.response"))
 FakeWeb.register_uri(:get, "http://markupvalidator.com/faqs", :response => fixture_file("markupvalidator_faqs.response"))
 FakeWeb.register_uri(:get, "https://twitter.com/markupvalidator", :response => fixture_file("twitter_markupvalidator.response"))
-FakeWeb.register_uri(:get, "https://example.com/empty", :response => fixture_file("empty_page.response"))
+FakeWeb.register_uri(:get, "http://example.com/empty", :response => fixture_file("empty_page.response"))
 FakeWeb.register_uri(:get, "http://international.com", :response => fixture_file("international.response"))
 FakeWeb.register_uri(:get, "http://charset000.com", :response => fixture_file("charset_000.response"))
 FakeWeb.register_uri(:get, "http://charset001.com", :response => fixture_file("charset_001.response"))

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: metainspector
 version: !ruby/object:Gem::Version
-  hash: 39
+  hash: 37
   prerelease:
   segments:
   - 1
   - 12
-  - 0
-  version: 1.12.0
+  - 1
+  version: 1.12.1
 platform: ruby
 authors:
 - Jaime Iniesta
@@ -15,10 +15,12 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-12-01 00:00:00 Z
+date: 2012-12-03 00:00:00 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  version_requirements: &id001 !ruby/object:Gem::Requirement
+  name: nokogiri
+  prerelease: false
+  requirement: &id001 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -28,12 +30,12 @@ dependencies:
         - 1
         - 5
         version: "1.5"
-  prerelease: false
   type: :runtime
-  name: nokogiri
-  requirement: *id001
+  version_requirements: *id001
 - !ruby/object:Gem::Dependency
-  version_requirements: &id002 !ruby/object:Gem::Requirement
+  name: rash
+  prerelease: false
+  requirement: &id002 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - "="
@@ -44,28 +46,28 @@ dependencies:
         - 3
         - 2
         version: 0.3.2
-  prerelease: false
   type: :runtime
-  name: rash
-  requirement: *id002
+  version_requirements: *id002
 - !ruby/object:Gem::Dependency
-  version_requirements: &id003 !ruby/object:Gem::Requirement
+  name: rspec
+  prerelease: false
+  requirement: &id003 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - "="
       - !ruby/object:Gem::Version
-        hash: 35
+        hash: 63
         segments:
         - 2
-        - 11
+        - 12
         - 0
-        version: 2.11.0
-  prerelease: false
+        version: 2.12.0
   type: :development
-  name: rspec
-  requirement: *id003
+  version_requirements: *id003
 - !ruby/object:Gem::Dependency
-  version_requirements: &id004 !ruby/object:Gem::Requirement
+  name: fakeweb
+  prerelease: false
+  requirement: &id004 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - "="
@@ -76,12 +78,12 @@ dependencies:
         - 3
         - 0
         version: 1.3.0
-  prerelease: false
   type: :development
-  name: fakeweb
-  requirement: *id004
+  version_requirements: *id004
 - !ruby/object:Gem::Dependency
-  version_requirements: &id005 !ruby/object:Gem::Requirement
+  name: awesome_print
+  prerelease: false
+  requirement: &id005 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - "="
@@ -89,30 +91,27 @@ dependencies:
         hash: 19
         segments:
         - 1
+        - 1
         - 0
-        - 2
-        version: 1.0.2
-  prerelease: false
+        version: 1.1.0
   type: :development
-  name: awesome_print
-  requirement: *id005
+  version_requirements: *id005
 - !ruby/object:Gem::Dependency
-  version_requirements: &id006 !ruby/object:Gem::Requirement
+  name: rake
+  prerelease: false
+  requirement: &id006 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - "="
       - !ruby/object:Gem::Version
-        hash: 11
+        hash: 75
         segments:
+        - 10
         - 0
-        - 9
         - 2
-        - 2
-        version: 0.9.2.2
-  prerelease: false
+        version: 10.0.2
   type: :development
-  name: rake
-  requirement: *id006
+  version_requirements: *id006
 description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
 email:
 - jaimeiniesta@gmail.com
@@ -128,7 +127,7 @@ files:
 - .travis.yml
 - Gemfile
 - MIT-LICENSE
-- README.rdoc
+- README.md
 - Rakefile
 - lib/meta_inspector.rb
 - lib/meta_inspector/open_uri.rb

data/README.rdoc DELETED Viewed

@@ -1,152 +0,0 @@
-= MetaInspector {<img src="https://secure.travis-ci.org/jaimeiniesta/metainspector.png?branch=master" />}[http://travis-ci.org/jaimeiniesta/metainspector] {<img src="https://codeclimate.com/badge.png" />}[https://codeclimate.com/github/jaimeiniesta/metainspector]
-MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...
-= See it in action!
-You can try MetaInspector live at this little demo: https://metainspectordemo.herokuapp.com
-= Installation
-Install the gem from RubyGems:
-  gem install metainspector
-This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.
-= Usage
-Initialize a scraper instance for an URL, like this:
-  page = MetaInspector::Scraper.new('http://markupvalidator.com')
-or, for short, a convenience alias is also available:
-  page = MetaInspector.new('http://markupvalidator.com')
-If you don't include the scheme on the URL, http:// will be used
-by defaul:
-  page = MetaInspector.new('markupvalidator.com')
-By default, MetaInspector times out after 20 seconds of waiting for a page to respond.
-You can set a different timeout with a second parameter, like this:
-  page = MetaInspector.new('markupvalidator.com', :timeout => 5) # this would wait just 5 seconds to timeout
-MetaInspector will try to parse all URLs by default. If you want to parse only those URLs that have text/html as content-type you can specify it like this:
-  page = MetaInspector.new('markupvalidator.com', :html_content_only => true)
-MetaInspector allows safe redirects from http to https sites by default. Passing allow_safe_redirections as false will throw exceptions on such redirects.
-  page = MetaInspector.new('facebook.com', :allow_safe_redirections => false)
-To enable unsafe redirects from https to http sites you can pass allow_unsafe_redirections as true. If this option is not specified or is false an exception is thrown on such redirects.
-  page = MetaInspector.new('facebook.com', :allow_unsafe_redirections => true)
-Then you can see the scraped data like this:
-  page.url                # URL of the page
-  page.scheme             # Scheme of the page (http, https)
-  page.host               # Hostname of the page (like, markupvalidator.com, without the scheme)
-  page.root_url           # Root url (scheme + host, like http://markupvalidator.com/)
-  page.title              # title of the page, as string
-  page.links              # array of strings, with every link found on the page as an absolute URL
-  page.internal_links     # array of strings, with every internal link found on the page as an absolute URL
-  page.external_links     # array of strings, with every external link found on the page as an absolute URL
-  page.meta_description   # meta description, as string
-  page.description        # returns the meta description, or the first long paragraph if no meta description is found
-  page.meta_keywords      # meta keywords, as string
-  page.image              # Most relevant image, if defined with og:image
-  page.images             # array of strings, with every img found on the page as an absolute URL
-  page.feed               # Get rss or atom links in meta data fields as array
-  page.meta_og_title      # opengraph title
-  page.meta_og_image      # opengraph image
-  page.charset            # UTF-8
-  page.content_type       # content-type returned by the server when the url was requested
-MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
-  page.meta_description       # <meta name="description" content="..." />
-  page.meta_keywords          # <meta name="keywords" content="..." />
-  page.meta_robots            # <meta name="robots" content="..." />
-  page.meta_generator         # <meta name="generator" content="..." />
-It will also work for the meta tags of the form <meta http-equiv="name" ... />, like the following:
-  page.meta_content_language  # <meta http-equiv="content-language" content="..." />
-  page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />
-Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
-You can also access most of the scraped data as a hash:
-  page.to_hash               # { "url"=>"http://markupvalidator.com", "title" => "MarkupValidator :: site-wide markup validation tool", ... }
-The full scraped document if accessible from:
-  page.document # Nokogiri doc that you can use it to get any element from the page
-= Errors handling
-You can check if the page has been succesfully parsed with:
-  page.ok?                    # Will return true if everything looks OK
-In case there have been any errors, you can check them with:
-  page.errors                 # Will return an array with the error messages
-If you also want to see the errors on console, you can initialize MetaInspector with the verbose option like that:
-  page = MetaInspector.new('http://example.com', :verbose => true)
-= Examples
-You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
-  $ irb
-  >> require 'metainspector'
-  => true
-  >> page = MetaInspector.new('http://markupvalidator.com')
-  => #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">
-  >> page.title
-  => "MarkupValidator :: site-wide markup validation tool"
-  >> page.meta_description
-  => "Site-wide markup validation tool. Validate the markup of your whole site with just one click."
-  >> page.meta_keywords
-  => "html, markup, validation, validator, tool, w3c, development, standards, free"
-  >> page.links.size
-  => 15
-  >> page.links[4]
-  => "/plans-and-pricing"
-  >> page.document.class
-  => String
-  >> page.parsed_document.class
-  => Nokogiri::HTML::Document
-= ZOMG Fork! Thank you!
-You're welcome to fork this project and send pull requests. Just remember to include specs.
-Thanks to all the contributors:
-https://github.com/jaimeiniesta/metainspector/graphs/contributors
-= To Do
-* Get page.base_dir from the URL
-* If keywords seem to be separated by blank spaces, replace them with commas
-* Autodiscover all available meta tags
-Copyright (c) 2009-2012 Jaime Iniesta, released under the MIT license