RubyGems - metainspector - Versions diffs - 1.2.0 → 1.3.0 - Mend

metainspector 1.2.0 → 1.3.0

Files changed (5) hide show

data/README.rdoc +13 -4
data/lib/meta_inspector/scraper.rb +13 -9
data/lib/meta_inspector/version.rb +1 -1
data/spec/metainspector_spec.rb +24 -1
metadata +3 -3

data/README.rdoc CHANGED Viewed

@@ -18,13 +18,19 @@ or, for short, a convenience alias is also available:
   page = MetaInspector.new('http://pagerankalert.com')
+If you don't include the scheme on the URL, http:// will be used
+by defaul:
+  page = MetaInspector.new('pagerankalert.com')
 Then you can see the scraped data like this:
-  page.address            # URL of the page
+  page.url                # URL of the page
   page.title              # title of the page, as string
   page.links              # array of strings, with every link found on the page
   page.meta_description   # meta description, as string
   page.meta_keywords      # meta keywords, as string
+  page.image              # Most relevant image, if defined with og:image
 MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
@@ -53,7 +59,7 @@ You can find some sample scripts on the samples folder, including a basic scrapi
   => true
   >> page = MetaInspector.new('http://pagerankalert.com')
-  => #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
+  => #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">
   >> page.title
   => "PageRankAlert.com :: Track your PageRank changes"
@@ -76,15 +82,18 @@ You can find some sample scripts on the samples folder, including a basic scrapi
   >> page.parsed_document.class
   => Nokogiri::HTML::Document
+= ZOMG Fork! Thank you!
+You're welcome to fork this project and send pull requests. I want to thank Ryan Romanchuk for his help https://github.com/rromanchuk
 = To Do
-* Get page.base_dir from the address
+* Get page.base_dir from the URL
 * Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
 * Return array of images in page as absolute URLs
 * Be able to set a timeout in seconds
 * If keywords seem to be separated by blank spaces, replace them with commas
 * Mocks
 * Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
-* Get most important image querying Facebook
 Copyright (c) 2009-2011 Jaime Iniesta, released under the MIT license

data/lib/meta_inspector/scraper.rb CHANGED Viewed

@@ -9,14 +9,12 @@ require 'iconv'
 # MetaInspector provides an easy way to scrape web pages and get its elements
 module MetaInspector
   class Scraper
-    attr_reader :address
+    attr_reader :url
-    # Initializes a new instance of MetaInspector, setting the URL address to the one given
-    # TODO: validate address as http URL, dont initialize it if wrong format
-    def initialize(address)
-      @address = address
-      @document = @title = @description = @keywords = @links = nil
+    # Initializes a new instance of MetaInspector, setting the URL to the one given
+    # If no scheme given, set it to http:// by default
+    def initialize(url)
+      @url = URI.parse(url).scheme.nil? ? 'http://' + url : url
     end
     # Returns the parsed document title, from the content of the <title> tag.
@@ -30,6 +28,13 @@ module MetaInspector
       @links ||= parsed_document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
     end
+    # Returns the parsed image from Facebook's open graph property tags
+    # Most all major websites now define this property and is usually very relevant
+    # See doc at http://developers.facebook.com/docs/opengraph/
+    def image
+      @image ||= parsed_document.document.css("meta[@property='og:image']").first['content'] rescue nil
+    end
     # Returns the charset
     # TODO: We should trust the charset expressed on the Content-Type meta tag
     # and only guess it if none given
@@ -47,7 +52,7 @@ module MetaInspector
     # Returns the original, unparsed document
     def document
-      @document ||= open(@address).read
+      @document ||= open(@url).read
       rescue SocketError
         warn 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
@@ -71,7 +76,6 @@ module MetaInspector
       if method_name.to_s =~ /^meta_(.*)/
         content = parsed_document.css("meta[@name='#{$1}']").first['content'] rescue nil
         content = parsed_document.css("meta[@http-equiv='#{$1.gsub("_", "-")}']").first['content'] rescue nil if content.nil?
         content
       else
         super

data/lib/meta_inspector/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # -*- encoding: utf-8 -*-
 module MetaInspector
-  VERSION = "1.2.0"
+  VERSION = "1.3.0"
 end

data/spec/metainspector_spec.rb CHANGED Viewed

@@ -4,13 +4,36 @@ require File.join(File.dirname(__FILE__), "/spec_helper")
 describe MetaInspector do
+  context 'Initialization' do
+    it 'should accept an URL with a scheme' do
+      @m = MetaInspector.new('http://pagerankalert.com')
+      @m.url.should == 'http://pagerankalert.com'
+    end
+    it "should use http:// as a default scheme" do
+      @m = MetaInspector.new('pagerankalert.com')
+      @m.url.should == 'http://pagerankalert.com'
+    end
+  end
   context 'Doing a basic scrape' do
+    EXPECTED_TITLE = 'PageRankAlert.com :: Track your PageRank changes'
     before(:each) do
       @m = MetaInspector.new('http://pagerankalert.com')
     end
     it "should get the title" do
-      @m.title.should == 'PageRankAlert.com :: Track your PageRank changes'
+      @m.title.should == EXPECTED_TITLE
+    end
+    it "should not find an image" do
+      @m.image.should == nil
+    end
+    it "should find an image" do
+      @m = MetaInspector.new('http://www.theonion.com/articles/apple-claims-new-iphone-only-visible-to-most-loyal,2772/')
+      @m.image.should == "http://o.onionstatic.com/images/articles/article/2772/Apple-Claims-600w-R_jpg_130x110_q85.jpg"
     end
     it "should get the links" do

metadata CHANGED Viewed

@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
   prerelease: false
   segments:
   - 1
-  - 2
+  - 3
   - 0
-  version: 1.2.0
+  version: 1.3.0
 platform: ruby
 authors:
 - Jaime Iniesta
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-05-05 00:00:00 +02:00
+date: 2011-05-09 00:00:00 +02:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency