RubyGems - grabbit - Versions diffs - 1.0.0 - Mend

grabbit 1.0.0

Files changed (13) hide show

data/.gitignore ADDED

@@ -0,0 +1,19 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+.rvmrc
+spec/vcr

data/.rspec ADDED

	@@ -0,0 +1 @@
1	+ --color

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in grabbit.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,22 @@
+Copyright (c) 2013 Richard Larcombe
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,114 @@
+# Grabbit
+Grabbit is a simple URL scraper.
+It returns the best image(s) to represent the content on a given web page.
+Grabbit also returns a Title, and a Description for the page.
+This Gem was inspired by Facebook: When you share a URL on Facebook in a post, FB goes off in the background and pulls the title, description, and best thumbnail image(s) to accompany your post.
+This gem is a simple scraper to do the same. Have fun using it in your Rails App!
+## Installation
+Add this line to your application's Gemfile:
+    gem 'grabbit'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install grabbit
+## Usage
+Call Grabbit, with a remote URL to scrape:
+		scrape = Grabbit.url("http://www.google.com/")
+Display the page's Title:
+		scrape.title
+		 => "Google"
+Display the page's Description:
+		scrape.description
+		 => "Search the world's information, including webpages, images, videos and
+		 more. Google has many special features to help you find exactly what you're looking for."
+Array of image URLs from the page. (In this example there is only one, but some pages may have several suitable images):
+		scrape.images
+		 => ["http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"]
+URL of the first image:
+		scrape.images.first
+		 => "http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"
+Loop through all images:
+		scrape.images.each do |img|
+			puts img
+		end
+		 => "http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"
+Failure:
+		scrape = Grabbit.url("this is an invalid url")
+		scrape
+			=> nil
+		scrape = Grabbit.url("http://www.this-is-a-valid-url-but-page-exists.com")
+		scrape
+			=> nil
+## How it works
+Grabbit uses HTTParty to grab the remote page, and then uses Nokogiri to parse the document to return the data.
+#### Finding the Title of a given web page
+Grabbit works on the following precedence to find the Title of the page:
+> 1. Look for Facebook og:title meta-tag first. See http://ogp.me/
+> 2. Look for a Twitter Card twitter:title meta-tag. See https://dev.twitter.com/docs/cards
+> 3. Use the contents of the &lt;title&gt; tags.
+> 4. Otherwise, return a blank string.
+#### Finding the Description of a web page
+Grabbit works on the following precedence to find the Description of the page:
+> 1. Look for Facebook og:description meta-tag first. See http://ogp.me/
+> 2. Look for a Twitter Card twitter:description meta-tag. See https://dev.twitter.com/docs/cards
+> 3. Use the contents of the &lt;meta name='description'&gt; tags.
+> 4. Otherwise, return a blank string.
+#### Finding the Image(s) for the web page
+Grabbit works on the following precedence to return an array of Image URLs:
+> 1. Look for Facebook og:image meta-tag first. See http://ogp.me/
+> 2. Look for a Twitter Card twitter:image:src meta-tag. See https://dev.twitter.com/docs/cards
+> 3. Look for image with id of main-image or prodImage (Amazon).
+> 3. Look for images within divs with id="content" excluding sidebar, comment, footer and header sections.
+> 4. Look for images within the whole page excluding sidebar, comment, footer and header sections.
+> 3. Find every image in the given page.
+## Contributing
+1. Fork it
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create new Pull Request

data/Rakefile ADDED

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/grabbit.gemspec ADDED

@@ -0,0 +1,30 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'grabbit/version'
+Gem::Specification.new do |spec|
+  spec.name          = "grabbit"
+  spec.version       = Grabbit::VERSION
+  spec.authors       = ["Richard Larcombe"]
+  spec.email         = ["rjlarcombe@gmail.com"]
+  spec.description   = %q{Grabbit - Scrape the title, description and best thumbnail image(s) from a given URL.}
+  spec.summary       = %q{When you share a URL on Facebook in a post, you will have noticed how FB goes off in the background and pulls the title, description, and best thumbnail images to represent the content on the page. This gem is a simple scraper to do the same.}
+  spec.homepage      = "https://github.com/rlarcombe/grabbit"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rake"
+  spec.add_development_dependency "rspec"
+  spec.add_development_dependency "shoulda-matchers"
+  spec.add_development_dependency "vcr"
+  spec.add_development_dependency "webmock"
+  spec.add_dependency "nokogiri"
+  spec.add_dependency "httparty"
+end

data/lib/grabbit.rb ADDED

@@ -0,0 +1,10 @@
+require "grabbit/version"
+require "grabbit/scrape"
+module Grabbit
+  extend self
+	def url(url = "")
+    Grabbit::Scrape.new(url) if url =~ URI::regexp(%w(http https))
+  end
+end

data/lib/grabbit/scrape.rb ADDED

@@ -0,0 +1,138 @@
+require 'httparty'
+require 'nokogiri'
+module Grabbit
+  class Scrape
+    def initialize(url)
+      @url = url
+      @doc = get_remote_data
+    end
+    def title
+    	if @doc
+    		# Look for og:title or twitter:title first
+    		@doc.xpath("//meta[@property='og:title']/@content").each do |element|
+  				return element.value.strip
+				end
+    		# Look for twitter:title first
+    		@doc.xpath("//meta[@name='twitter:title']/@content").each do |element|
+  				return element.value.strip
+				end
+				# If no og, look for <title> tags.
+				@doc.css("title").each do |element|
+  				return element.text.strip
+				end
+				# Finally return a blank string
+				""
+			else
+				nil
+			end
+    end
+    def description
+    	if @doc
+    		# Look for og:description
+    		@doc.xpath("//meta[@property='og:description']/@content").each do |element|
+  				return element.value.strip
+				end
+				# Look for twitter:description
+				@doc.xpath("//meta[@name='twitter:description']/@content").each do |element|
+  				return element.value.strip
+				end
+				# If no OG tag or Titter card, look for <meta name='description'> tags.
+    		@doc.xpath("//meta[@name='description']/@content").each do |element|
+  				return element.value.strip
+				end
+				# Finally return a blank string
+				""
+			else
+				nil
+			end
+    end
+    def images
+    	# The following code to return relevant images, is based on the ideas in this blog post:
+    	# https://tech.shareaholic.com/2012/11/02/how-to-find-the-image-that-best-respresents-a-web-page/
+    	# If the following does not return good results consistently, then consider using
+    	# the Fast Image Gem (https://github.com/sdsykes/fastimage).
+      # Check to find the 3 largest images and/or images with an aspect ratio less than 3.0
+    	images_array = []
+    	if @doc
+	    	# Look for OG:Image first
+	    	@doc.search('//meta[@property="og:image"]/@content').each do |a|
+	    		images_array << image_absolute_uri(a.value)
+	    	end
+	    	return images_array unless images_array.empty?
+	    	# Look for Twitter:Image
+	    	@doc.search('//meta[@name="twitter:image:src"]/@content').each do |a|
+	    		images_array << image_absolute_uri(a.value)
+	    	end
+	    	return images_array unless images_array.empty?
+      	# Next look for image with id of main-image (--> Amazon) or prodImage (--> WalMart)
+      	@doc.search('//img[@id="main-image" or @id="prodImage"]/@src').each do |a|
+	    		images_array << image_absolute_uri(a.value)
+      	end
+	    	return images_array unless images_array.empty?
+	      # Now search for all images within divs with id="content" excluding sidebar, comment, footer and header sections.
+  	    @doc.search("//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and ancestor::*[contains(@id, 'content')]]/@src").each do |a|
+	    		images_array << image_absolute_uri(a.value)
+  	    end
+	    	return images_array unless images_array.empty?
+      	# Now search for all images in the whole page excluding sidebar, comment, footer and header sections.
+      	@doc.search("//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')])]/@src").each do |a|
+	    		images_array << image_absolute_uri(a.value)
+      	end
+	    	return images_array unless images_array.empty?
+      	# Now search for all images in the whole page
+      	@doc.search("//img/@src").each do |a|
+	    		images_array << image_absolute_uri(a.value)
+    		end
+    	end
+    	images_array
+    end
+    private
+			def get_remote_data
+				begin
+    			response = HTTParty.get(@url)
+    		rescue
+    			return nil
+    		end
+	    	if response.code == 200
+	    		begin
+	    			Nokogiri::HTML(response.body)
+	    		rescue
+	    			return nil
+	    		end
+	    	else
+	    		nil
+	    	end
+	    end
+		  def image_absolute_uri(image_path)
+		    URI.join(@url, image_path).to_s
+		  end
+  end
+end

data/lib/grabbit/version.rb ADDED

@@ -0,0 +1,3 @@
+module Grabbit
+  VERSION = "1.0.0"
+end

data/spec/grabbit_spec.rb ADDED

@@ -0,0 +1,93 @@
+require 'spec_helper'
+describe Grabbit do
+	context "Bad urls" do
+	  it "should return nil for an blank url" do
+	  	g = Grabbit.url
+	  	g.should == nil
+	  end
+	  it "should return nil for a badly formatted url" do
+	  	g = Grabbit.url("hello")
+	  	g.should == nil
+	  end
+	  it "should not return nil for a good url", :vcr do
+	  	g = Grabbit.url("http://www.google.com")
+	  	g.should_not == nil
+	  end
+	  it "should not return nil for 404 error", :vcr do
+	  	g = Grabbit.url("http://www.thisurldoesnotexist.com/")
+	  	g.title.should == nil
+	  	g.description.should == nil
+	  	g.images.should == []
+	  end
+	end
+	context "Title" do
+	  it "should return a title for a good url", :vcr do
+	  	g = Grabbit.url("http://www.drudgereport.com")
+	  	g.title.should start_with "DRUDGE REPORT"
+	  end
+	  it "should return a title from og:title when present", :vcr do
+	  	g = Grabbit.url("http://ogp.me/")
+	  	g.title.should == "Open Graph protocol"
+	  end
+	  it "should return the title from the Twitter card when present", :vcr do
+	  	g = Grabbit.url("https://dev.twitter.com/docs/cards/types/summary-card")
+	  	g.title.should == "Summary Card"
+	  end
+  end
+	context "Description" do
+	  it "should return a description from og:decription present", :vcr do
+	  	g = Grabbit.url("http://ogp.me/")
+	  	g.description.should == "The Open Graph protocol enables any web page to become a rich object in a social graph."
+	  end
+	  it "should return the description from the Twitter card when present", :vcr do
+	  	g = Grabbit.url("https://dev.twitter.com/docs/cards/types/summary-card")
+	  	g.description.should == "The Summary Card can be used for many kinds of web content, from blog posts and news articles, to products and restaurants.   The screenshot below shows the expanded Tweet view for a New York Times article:"
+	  end
+	  it "should return a description from description meta tags when present", :vcr do
+	  	g = Grabbit.url("http://moz.com/learn/seo/meta-description")
+	  	g.description.should == "Get SEO best practices for the meta description tag, including length and content."
+	  end
+  end
+  context "Images" do
+	  it "should return an array", :vcr do
+	  	g = Grabbit.url("http://www.google.com")
+	  	g.images.is_a?(Array).should be_true
+	  end
+	  it "should return only images from og:image when present", :vcr do
+	  	g = Grabbit.url("http://ogp.me/")
+	  	g.images.first.should == "http://ogp.me/logo.png"
+	  	g.images.length.should == 1
+	  end
+	  it "should return images from Twitter Card when present", :vcr do
+	  	g = Grabbit.url("http://momwitha.com/2013/08/having-fun-with-pictures-at-google-headquarters/")
+	  	g.images.first.should == "http://momwitha.com/wp-content/uploads/2013/08/Google-Lobby-Sign-300x200.jpg"
+	  	g.images.length.should == 12
+	  end
+	  it "should return image with id of main-image for Amazon", :vcr do
+	  	g = Grabbit.url("http://www.amazon.com/gp/product/0975277324")
+	  	g.images.first.should == "http://ecx.images-amazon.com/images/I/61dDQUfhuvL._SX300_.jpg"
+	  	g.images.length.should == 1
+	  end
+  end
+end

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,16 @@
+require 'rubygems'
+require 'bundler/setup'
+require 'grabbit'
+require 'vcr'
+require 'webmock'
+VCR.configure do |c|
+  c.cassette_library_dir = 'spec/vcr'
+  c.hook_into :webmock
+  c.configure_rspec_metadata!
+end
+RSpec.configure do |config|
+	config.treat_symbols_as_metadata_keys_with_true_values = true
+end

metadata ADDED

@@ -0,0 +1,198 @@
+--- !ruby/object:Gem::Specification
+name: grabbit
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+  prerelease:
+platform: ruby
+authors:
+- Richard Larcombe
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-10-14 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: shoulda-matchers
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: vcr
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: webmock
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: httparty
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: Grabbit - Scrape the title, description and best thumbnail image(s) from
+  a given URL.
+email:
+- rjlarcombe@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- .rspec
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- grabbit.gemspec
+- lib/grabbit.rb
+- lib/grabbit/scrape.rb
+- lib/grabbit/version.rb
+- spec/grabbit_spec.rb
+- spec/spec_helper.rb
+homepage: https://github.com/rlarcombe/grabbit
+licenses:
+- MIT
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4061797153683598554
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 4061797153683598554
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.25
+signing_key:
+specification_version: 3
+summary: When you share a URL on Facebook in a post, you will have noticed how FB
+  goes off in the background and pulls the title, description, and best thumbnail
+  images to represent the content on the page. This gem is a simple scraper to do
+  the same.
+test_files:
+- spec/grabbit_spec.rb
+- spec/spec_helper.rb