RubyGems - metainspector - Versions diffs - 1.5.0 → 1.6.0 - Mend

metainspector 1.5.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

data/README.rdoc +8 -1
data/lib/meta_inspector/scraper.rb +30 -13
data/lib/meta_inspector/version.rb +1 -1
data/lib/metainspector.rb +1 -1
data/meta_inspector.gemspec +2 -1
data/samples/basic_scraping.rb +11 -6
data/samples/spider.rb +9 -8
data/spec/metainspector_spec.rb +31 -3
metadata +37 -7

data/README.rdoc CHANGED Viewed

@@ -32,7 +32,9 @@ Then you can see the scraped data like this:
   page.meta_description   # meta description, as string
   page.meta_keywords      # meta keywords, as string
   page.image              # Most relevant image, if defined with og:image
-  page.feed                # Get rss or atom links in meta data fields as array
+  page.feed               # Get rss or atom links in meta data fields as array
+  page.meta_og_title      # opengraph title
+  page.meta_og_image      # opengraph image
 MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
@@ -48,6 +50,10 @@ It will also work for the meta tags of the form <meta http-equiv="name" ... />,
 Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
+You can also access most of the scraped data as a hash:
+  page.to_hash               # { "url"=>"http://pagerankalert.com", "title" => "PageRankAlert.com", ... }
 The full scraped document if accessible from:
   page.document # Nokogiri doc that you can use it to get any element from the page
@@ -100,5 +106,6 @@ You're welcome to fork this project and send pull requests. I want to thank spec
 * If keywords seem to be separated by blank spaces, replace them with commas
 * Mocks
 * Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
+* Autodiscover all available meta tags
 Copyright (c) 2009-2011 Jaime Iniesta, released under the MIT license

data/lib/meta_inspector/scraper.rb CHANGED Viewed

@@ -5,39 +5,40 @@ require 'rubygems'
 require 'nokogiri'
 require 'charguess'
 require 'iconv'
+require 'hashie/rash'
 # MetaInspector provides an easy way to scrape web pages and get its elements
 module MetaInspector
   class Scraper
     attr_reader :url
     # Initializes a new instance of MetaInspector, setting the URL to the one given
     # If no scheme given, set it to http:// by default
     def initialize(url)
       @url = URI.parse(url).scheme.nil? ? 'http://' + url : url
+      @data = Hashie::Rash.new('url' => @url)
     end
     # Returns the parsed document title, from the content of the <title> tag.
     # This is not the same as the meta_tite tag
     def title
-      @title ||= parsed_document.css('title').inner_html rescue nil
+      @data.title ||= parsed_document.css('title').inner_html rescue nil
     end
     # Returns the parsed document links
     def links
-      @links ||= remove_mailto(parsed_document.search("//a")
-                                .map {|link| link.attributes["href"]
+      @data.links ||= remove_mailto(parsed_document.search("//a") \
+                                .map {|link| link.attributes["href"] \
                                 .to_s.strip}.uniq) rescue nil
     end
     # Returns the links converted to absolute urls
     def absolute_links
-      @absolute_links ||= links.map { |l| absolutify_url(l) }
+      @data.absolute_links ||= links.map { |l| absolutify_url(l) }
     end
     # Returns the parsed document meta rss links
     def feed
-      @feed ||= parsed_document.xpath("//link").select{ |link|
+      @data.feed ||= parsed_document.xpath("//link").select{ |link|
           link.attributes["type"] && link.attributes["type"].value =~ /(atom|rss)/
         }.map { |link|
           absolutify_url(link.attributes["href"].value)
@@ -48,14 +49,21 @@ module MetaInspector
     # Most all major websites now define this property and is usually very relevant
     # See doc at http://developers.facebook.com/docs/opengraph/
     def image
-      @image ||= parsed_document.document.css("meta[@property='og:image']").first['content'] rescue nil
+      meta_og_image
     end
     # Returns the charset
     # TODO: We should trust the charset expressed on the Content-Type meta tag
     # and only guess it if none given
     def charset
-      @charset ||= CharGuess.guess(document).downcase
+      @data.charset ||= CharGuess.guess(document).downcase
+    end
+    # Returns all parsed data as a nested Hash
+    def to_hash
+      # TODO: find a better option to populate the data to the Hash
+      image;feed;links;charset;absolute_links;title;meta_keywords
+      @data.to_hash
     end
     # Returns the whole parsed document
@@ -85,14 +93,23 @@ module MetaInspector
     #
     # It will first try with meta name="..." and if nothing found,
     # with meta http-equiv="...", substituting "_" by "-"
-    # TODO: this should be case unsensitive, so meta_robots gets the results from the HTML for robots, Robots, ROBOTS...
-    # TODO: cache results on instance variables, using ||=
     # TODO: define respond_to? to return true on the meta_name methods
     def method_missing(method_name)
       if method_name.to_s =~ /^meta_(.*)/
-        content = parsed_document.css("meta[@name='#{$1}']").first['content'] rescue nil
-        content = parsed_document.css("meta[@http-equiv='#{$1.gsub("_", "-")}']").first['content'] rescue nil if content.nil?
-        content
+        key = $1
+        #special treatment for og:
+        if key =~ /^og_(.*)/
+          key = "og:#{$1}"
+        end
+        unless @data.meta
+          @data.meta!.name!
+          @data.meta!.property!
+          parsed_document.xpath("//meta").each { |element|
+            @data.meta.name[element.attributes["name"].value.downcase] = element.attributes["content"].value if element.attributes["name"]
+            @data.meta.property[element.attributes["property"].value.downcase] = element.attributes["content"].value if element.attributes["property"]
+          }
+        end
+        @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
       else
         super
       end

data/lib/meta_inspector/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # -*- encoding: utf-8 -*-
 module MetaInspector
-  VERSION = "1.5.0"
+  VERSION = "1.6.0"
 end

data/lib/metainspector.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 # -*- encoding: utf-8 -*-
-require 'meta_inspector'
+require File.expand_path(File.join(File.dirname(__FILE__), './meta_inspector'))

data/meta_inspector.gemspec CHANGED Viewed

@@ -21,8 +21,9 @@ Gem::Specification.new do |s|
   s.add_dependency 'nokogiri', '1.4.4'
   s.add_dependency 'charguess', '1.3.20110226181011'
+  s.add_dependency "rash", "~> 0.3.0"
   s.add_development_dependency 'rspec', '~> 2.6.0'
   s.add_development_dependency 'fakeweb', '~> 1.3.0'
+  s.add_development_dependency 'awesome_print', '~> 0.4.0'
 end

data/samples/basic_scraping.rb CHANGED Viewed

@@ -1,17 +1,22 @@
 # Some basic MetaInspector samples
-require_relative '../lib/meta_inspector.rb'
+$: << File.join(File.dirname(__FILE__), "/../lib")
+require 'meta_inspector'
+require 'ap'
-puts "Enter a valid http address to scrape it"
-address = gets.strip
-page = MetaInspector.new(address)
+puts "Enter a valid http url to scrape it"
+url = gets.strip
+page = MetaInspector.new(url)
 puts "...please wait while scraping the page..."
-puts "Scraping #{page.address} returned these results:"
+puts "Scraping #{page.url} returned these results:"
 puts "TITLE: #{page.title}"
 puts "META DESCRIPTION: #{page.meta_description}"
 puts "META KEYWORDS: #{page.meta_keywords}"
 puts "#{page.links.size} links found..."
 page.links.each do |link|
   puts " ==> #{link}"
-end
+end
+puts "to_hash..."
+ap page.to_hash

data/samples/spider.rb CHANGED Viewed

@@ -1,19 +1,20 @@
 # A basic spider that will follow links on an infinite loop
-require_relative '../lib/meta_inspector.rb'
+$: << File.join(File.dirname(__FILE__), "/../lib")
+require 'meta_inspector'
 q = Queue.new
 visited_links=[]
-puts "Enter a valid http address to spider it following external links"
-address = gets.strip
+puts "Enter a valid http url to spider it following external links"
+url = gets.strip
-page = MetaInspector.new(address)
-q.push(address)
+page = MetaInspector.new(url)
+q.push(url)
 while q.size > 0
-  visited_links << address = q.pop
-  page = MetaInspector.new(address)
-  puts "Spidering #{page.address}"
+  visited_links << url = q.pop
+  page = MetaInspector.new(url)
+  puts "Spidering #{page.url}"
   puts "TITLE: #{page.title}"
   puts "META DESCRIPTION: #{page.meta_description}"

data/spec/metainspector_spec.rb CHANGED Viewed

@@ -43,6 +43,7 @@ describe MetaInspector do
     it "should find an image" do
       @m = MetaInspector.new('http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
       @m.image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
+      @m.meta_og_image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
     end
     it "should have a Nokogiri::HTML::Document as parsed_document" do
@@ -103,6 +104,10 @@ describe MetaInspector do
       @m.meta_robots.should == 'all,follow'
     end
+    it "should get the robots meta tag" do
+      @m.meta_RoBoTs.should == 'all,follow'
+    end
     it "should get the description meta tag" do
       @m.meta_description.should == 'Track your PageRank(TM) changes and receive alerts by email'
     end
@@ -116,9 +121,8 @@ describe MetaInspector do
       @m.meta_content_language.should == "en"
     end
-    it "should get the Content-Type meta tag" do
-      pending "mocks"
-      @m.meta_Content_Type.should == "text/html; charset=utf-8"
+    it "should get the Csrf_pAram meta tag" do
+      @m.meta_Csrf_pAram.should == "authenticity_token"
     end
     it "should get the generator meta tag" do
@@ -129,6 +133,17 @@ describe MetaInspector do
     it "should return nil for nonfound meta_tags" do
       @m.meta_lollypop.should == nil
     end
+    it "should find a meta_og_title" do
+      @m = MetaInspector.new('http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
+      @m.meta_og_title.should == "Apple Claims New iPhone Only Visible To Most Loyal Of Customers"
+    end
+    it "should not find a meta_og_something" do
+      @m = MetaInspector.new('http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
+      @m.meta_og_something.should == nil
+    end
   end
   context 'Charset detection' do
@@ -146,4 +161,17 @@ describe MetaInspector do
       @m.charset.should == "utf-8"
     end
   end
+  context 'to_hash' do
+    FakeWeb.register_uri(:get, "http://www.pagerankalert.com", :response => fixture_file("pagerankalert.com.response"))
+    it "should return a hash with all the values set" do
+      @m = MetaInspector.new('http://www.pagerankalert.com')
+      @m.to_hash.should == {"title"=>"PageRankAlert.com :: Track your PageRank changes", "url"=>"http://www.pagerankalert.com", "meta"=>{"name"=>{"robots"=>"all,follow", "csrf_param"=>"authenticity_token", "description"=>"Track your PageRank(TM) changes and receive alerts by email", "keywords"=>"pagerank, seo, optimization, google", "csrf_token"=>"iW1/w+R8zrtDkhOlivkLZ793BN04Kr3X/pS+ixObHsE="}, "property"=>{}}, "links"=>["/", "/es?language=es", "/users/sign_up", "/users/sign_in", "http://pagerankalert.posterous.com", "http://twitter.com/pagerankalert", "http://twitter.com/share"], "charset"=>"utf-8", "feed"=>"http://feeds.feedburner.com/PageRankAlert", "absolute_links"=>["http://www.pagerankalert.com/", "http://www.pagerankalert.com/es?language=es", "http://www.pagerankalert.com/users/sign_up", "http://www.pagerankalert.com/users/sign_in", "http://pagerankalert.posterous.com", "http://twitter.com/pagerankalert", "http://twitter.com/share"]}
+    end
+  end
 end

metadata CHANGED Viewed

@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
   prerelease: false
   segments:
   - 1
-  - 5
+  - 6
   - 0
-  version: 1.5.0
+  version: 1.6.0
 platform: ruby
 authors:
 - Jaime Iniesta
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-05-30 00:00:00 +02:00
+date: 2011-06-03 00:00:00 +02:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -48,9 +48,24 @@ dependencies:
   type: :runtime
   version_requirements: *id002
 - !ruby/object:Gem::Dependency
-  name: rspec
+  name: rash
   prerelease: false
   requirement: &id003 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        - 3
+        - 0
+        version: 0.3.0
+  type: :runtime
+  version_requirements: *id003
+- !ruby/object:Gem::Dependency
+  name: rspec
+  prerelease: false
+  requirement: &id004 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -61,11 +76,11 @@ dependencies:
         - 0
         version: 2.6.0
   type: :development
-  version_requirements: *id003
+  version_requirements: *id004
 - !ruby/object:Gem::Dependency
   name: fakeweb
   prerelease: false
-  requirement: &id004 !ruby/object:Gem::Requirement
+  requirement: &id005 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,7 +91,22 @@ dependencies:
         - 0
         version: 1.3.0
   type: :development
-  version_requirements: *id004
+  version_requirements: *id005
+- !ruby/object:Gem::Dependency
+  name: awesome_print
+  prerelease: false
+  requirement: &id006 !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        segments:
+        - 0
+        - 4
+        - 0
+        version: 0.4.0
+  type: :development
+  version_requirements: *id006
 description: MetaInspector lets you scrape a web page and get its title, charset, link and meta tags
 email:
 - jaimeiniesta@gmail.com