grabbit 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,19 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .yardoc
6
+ Gemfile.lock
7
+ InstalledFiles
8
+ _yardoc
9
+ coverage
10
+ doc/
11
+ lib/bundler/man
12
+ pkg
13
+ rdoc
14
+ spec/reports
15
+ test/tmp
16
+ test/version_tmp
17
+ tmp
18
+ .rvmrc
19
+ spec/vcr
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in grabbit.gemspec
4
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2013 Richard Larcombe
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,114 @@
1
+ # Grabbit
2
+
3
+ Grabbit is a simple URL scraper.
4
+ It returns the best image(s) to represent the content on a given web page.
5
+ Grabbit also returns a Title, and a Description for the page.
6
+
7
+ This Gem was inspired by Facebook: When you share a URL on Facebook in a post, FB goes off in the background and pulls the title, description, and best thumbnail image(s) to accompany your post.
8
+
9
+ This gem is a simple scraper to do the same. Have fun using it in your Rails App!
10
+
11
+ ## Installation
12
+
13
+ Add this line to your application's Gemfile:
14
+
15
+ gem 'grabbit'
16
+
17
+ And then execute:
18
+
19
+ $ bundle
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install grabbit
24
+
25
+ ## Usage
26
+
27
+ Call Grabbit, with a remote URL to scrape:
28
+
29
+ scrape = Grabbit.url("http://www.google.com/")
30
+
31
+ Display the page's Title:
32
+
33
+ scrape.title
34
+ => "Google"
35
+
36
+ Display the page's Description:
37
+
38
+ scrape.description
39
+ => "Search the world's information, including webpages, images, videos and
40
+ more. Google has many special features to help you find exactly what you're looking for."
41
+
42
+ Array of image URLs from the page. (In this example there is only one, but some pages may have several suitable images):
43
+
44
+ scrape.images
45
+ => ["http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"]
46
+
47
+ URL of the first image:
48
+
49
+ scrape.images.first
50
+ => "http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"
51
+
52
+ Loop through all images:
53
+
54
+ scrape.images.each do |img|
55
+ puts img
56
+ end
57
+ => "http://www.google.com/intl/en_ALL/images/srpr/logo1w.png"
58
+
59
+ Failure:
60
+
61
+ scrape = Grabbit.url("this is an invalid url")
62
+
63
+ scrape
64
+ => nil
65
+
66
+ scrape = Grabbit.url("http://www.this-is-a-valid-url-but-page-exists.com")
67
+
68
+ scrape
69
+ => nil
70
+
71
+ ## How it works
72
+
73
+ Grabbit uses HTTParty to grab the remote page, and then uses Nokogiri to parse the document to return the data.
74
+
75
+ #### Finding the Title of a given web page
76
+
77
+ Grabbit works on the following precedence to find the Title of the page:
78
+
79
+ > 1. Look for Facebook og:title meta-tag first. See http://ogp.me/
80
+ > 2. Look for a Twitter Card twitter:title meta-tag. See https://dev.twitter.com/docs/cards
81
+ > 3. Use the contents of the <title> tags.
82
+ > 4. Otherwise, return a blank string.
83
+
84
+ #### Finding the Description of a web page
85
+
86
+ Grabbit works on the following precedence to find the Description of the page:
87
+
88
+ > 1. Look for Facebook og:description meta-tag first. See http://ogp.me/
89
+ > 2. Look for a Twitter Card twitter:description meta-tag. See https://dev.twitter.com/docs/cards
90
+ > 3. Use the contents of the <meta name='description'> tags.
91
+ > 4. Otherwise, return a blank string.
92
+
93
+ #### Finding the Image(s) for the web page
94
+
95
+ Grabbit works on the following precedence to return an array of Image URLs:
96
+
97
+ > 1. Look for Facebook og:image meta-tag first. See http://ogp.me/
98
+ > 2. Look for a Twitter Card twitter:image:src meta-tag. See https://dev.twitter.com/docs/cards
99
+ > 3. Look for image with id of main-image or prodImage (Amazon).
100
+ > 3. Look for images within divs with id="content" excluding sidebar, comment, footer and header sections.
101
+ > 4. Look for images within the whole page excluding sidebar, comment, footer and header sections.
102
+ > 3. Find every image in the given page.
103
+
104
+
105
+
106
+
107
+
108
+ ## Contributing
109
+
110
+ 1. Fork it
111
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
112
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
113
+ 4. Push to the branch (`git push origin my-new-feature`)
114
+ 5. Create new Pull Request
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
@@ -0,0 +1,30 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'grabbit/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "grabbit"
8
+ spec.version = Grabbit::VERSION
9
+ spec.authors = ["Richard Larcombe"]
10
+ spec.email = ["rjlarcombe@gmail.com"]
11
+ spec.description = %q{Grabbit - Scrape the title, description and best thumbnail image(s) from a given URL.}
12
+ spec.summary = %q{When you share a URL on Facebook in a post, you will have noticed how FB goes off in the background and pulls the title, description, and best thumbnail images to represent the content on the page. This gem is a simple scraper to do the same.}
13
+ spec.homepage = "https://github.com/rlarcombe/grabbit"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.3"
22
+ spec.add_development_dependency "rake"
23
+ spec.add_development_dependency "rspec"
24
+ spec.add_development_dependency "shoulda-matchers"
25
+ spec.add_development_dependency "vcr"
26
+ spec.add_development_dependency "webmock"
27
+
28
+ spec.add_dependency "nokogiri"
29
+ spec.add_dependency "httparty"
30
+ end
@@ -0,0 +1,10 @@
1
+ require "grabbit/version"
2
+ require "grabbit/scrape"
3
+
4
+ module Grabbit
5
+ extend self
6
+
7
+ def url(url = "")
8
+ Grabbit::Scrape.new(url) if url =~ URI::regexp(%w(http https))
9
+ end
10
+ end
@@ -0,0 +1,138 @@
1
+ require 'httparty'
2
+ require 'nokogiri'
3
+
4
+ module Grabbit
5
+ class Scrape
6
+
7
+ def initialize(url)
8
+ @url = url
9
+ @doc = get_remote_data
10
+ end
11
+
12
+ def title
13
+ if @doc
14
+
15
+ # Look for og:title or twitter:title first
16
+ @doc.xpath("//meta[@property='og:title']/@content").each do |element|
17
+ return element.value.strip
18
+ end
19
+
20
+ # Look for twitter:title first
21
+ @doc.xpath("//meta[@name='twitter:title']/@content").each do |element|
22
+ return element.value.strip
23
+ end
24
+
25
+ # If no og, look for <title> tags.
26
+ @doc.css("title").each do |element|
27
+ return element.text.strip
28
+ end
29
+
30
+ # Finally return a blank string
31
+ ""
32
+ else
33
+ nil
34
+ end
35
+ end
36
+
37
+ def description
38
+ if @doc
39
+
40
+ # Look for og:description
41
+ @doc.xpath("//meta[@property='og:description']/@content").each do |element|
42
+ return element.value.strip
43
+ end
44
+
45
+ # Look for twitter:description
46
+ @doc.xpath("//meta[@name='twitter:description']/@content").each do |element|
47
+ return element.value.strip
48
+ end
49
+
50
+ # If no OG tag or Titter card, look for <meta name='description'> tags.
51
+ @doc.xpath("//meta[@name='description']/@content").each do |element|
52
+ return element.value.strip
53
+ end
54
+
55
+
56
+ # Finally return a blank string
57
+ ""
58
+ else
59
+ nil
60
+ end
61
+ end
62
+
63
+ def images
64
+ # The following code to return relevant images, is based on the ideas in this blog post:
65
+ # https://tech.shareaholic.com/2012/11/02/how-to-find-the-image-that-best-respresents-a-web-page/
66
+ # If the following does not return good results consistently, then consider using
67
+ # the Fast Image Gem (https://github.com/sdsykes/fastimage).
68
+ # Check to find the 3 largest images and/or images with an aspect ratio less than 3.0
69
+
70
+ images_array = []
71
+
72
+ if @doc
73
+ # Look for OG:Image first
74
+ @doc.search('//meta[@property="og:image"]/@content').each do |a|
75
+ images_array << image_absolute_uri(a.value)
76
+ end
77
+ return images_array unless images_array.empty?
78
+
79
+ # Look for Twitter:Image
80
+ @doc.search('//meta[@name="twitter:image:src"]/@content').each do |a|
81
+ images_array << image_absolute_uri(a.value)
82
+ end
83
+ return images_array unless images_array.empty?
84
+
85
+ # Next look for image with id of main-image (--> Amazon) or prodImage (--> WalMart)
86
+ @doc.search('//img[@id="main-image" or @id="prodImage"]/@src').each do |a|
87
+ images_array << image_absolute_uri(a.value)
88
+ end
89
+ return images_array unless images_array.empty?
90
+
91
+ # Now search for all images within divs with id="content" excluding sidebar, comment, footer and header sections.
92
+ @doc.search("//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and ancestor::*[contains(@id, 'content')]]/@src").each do |a|
93
+ images_array << image_absolute_uri(a.value)
94
+ end
95
+ return images_array unless images_array.empty?
96
+
97
+
98
+ # Now search for all images in the whole page excluding sidebar, comment, footer and header sections.
99
+ @doc.search("//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')])]/@src").each do |a|
100
+ images_array << image_absolute_uri(a.value)
101
+ end
102
+ return images_array unless images_array.empty?
103
+
104
+
105
+ # Now search for all images in the whole page
106
+ @doc.search("//img/@src").each do |a|
107
+ images_array << image_absolute_uri(a.value)
108
+ end
109
+
110
+ end
111
+ images_array
112
+ end
113
+
114
+ private
115
+
116
+ def get_remote_data
117
+ begin
118
+ response = HTTParty.get(@url)
119
+ rescue
120
+ return nil
121
+ end
122
+
123
+ if response.code == 200
124
+ begin
125
+ Nokogiri::HTML(response.body)
126
+ rescue
127
+ return nil
128
+ end
129
+ else
130
+ nil
131
+ end
132
+ end
133
+
134
+ def image_absolute_uri(image_path)
135
+ URI.join(@url, image_path).to_s
136
+ end
137
+ end
138
+ end
@@ -0,0 +1,3 @@
1
+ module Grabbit
2
+ VERSION = "1.0.0"
3
+ end
@@ -0,0 +1,93 @@
1
+ require 'spec_helper'
2
+
3
+ describe Grabbit do
4
+
5
+ context "Bad urls" do
6
+
7
+ it "should return nil for an blank url" do
8
+ g = Grabbit.url
9
+ g.should == nil
10
+ end
11
+
12
+ it "should return nil for a badly formatted url" do
13
+ g = Grabbit.url("hello")
14
+ g.should == nil
15
+ end
16
+
17
+ it "should not return nil for a good url", :vcr do
18
+ g = Grabbit.url("http://www.google.com")
19
+ g.should_not == nil
20
+ end
21
+
22
+ it "should not return nil for 404 error", :vcr do
23
+ g = Grabbit.url("http://www.thisurldoesnotexist.com/")
24
+ g.title.should == nil
25
+ g.description.should == nil
26
+ g.images.should == []
27
+ end
28
+
29
+ end
30
+
31
+ context "Title" do
32
+ it "should return a title for a good url", :vcr do
33
+ g = Grabbit.url("http://www.drudgereport.com")
34
+ g.title.should start_with "DRUDGE REPORT"
35
+ end
36
+
37
+ it "should return a title from og:title when present", :vcr do
38
+ g = Grabbit.url("http://ogp.me/")
39
+ g.title.should == "Open Graph protocol"
40
+ end
41
+
42
+ it "should return the title from the Twitter card when present", :vcr do
43
+ g = Grabbit.url("https://dev.twitter.com/docs/cards/types/summary-card")
44
+ g.title.should == "Summary Card"
45
+ end
46
+
47
+ end
48
+
49
+ context "Description" do
50
+
51
+ it "should return a description from og:decription present", :vcr do
52
+ g = Grabbit.url("http://ogp.me/")
53
+ g.description.should == "The Open Graph protocol enables any web page to become a rich object in a social graph."
54
+ end
55
+
56
+ it "should return the description from the Twitter card when present", :vcr do
57
+ g = Grabbit.url("https://dev.twitter.com/docs/cards/types/summary-card")
58
+ g.description.should == "The Summary Card can be used for many kinds of web content, from blog posts and news articles, to products and restaurants. The screenshot below shows the expanded Tweet view for a New York Times article:"
59
+ end
60
+
61
+ it "should return a description from description meta tags when present", :vcr do
62
+ g = Grabbit.url("http://moz.com/learn/seo/meta-description")
63
+ g.description.should == "Get SEO best practices for the meta description tag, including length and content."
64
+ end
65
+
66
+ end
67
+
68
+ context "Images" do
69
+ it "should return an array", :vcr do
70
+ g = Grabbit.url("http://www.google.com")
71
+ g.images.is_a?(Array).should be_true
72
+ end
73
+
74
+ it "should return only images from og:image when present", :vcr do
75
+ g = Grabbit.url("http://ogp.me/")
76
+ g.images.first.should == "http://ogp.me/logo.png"
77
+ g.images.length.should == 1
78
+ end
79
+
80
+ it "should return images from Twitter Card when present", :vcr do
81
+ g = Grabbit.url("http://momwitha.com/2013/08/having-fun-with-pictures-at-google-headquarters/")
82
+ g.images.first.should == "http://momwitha.com/wp-content/uploads/2013/08/Google-Lobby-Sign-300x200.jpg"
83
+ g.images.length.should == 12
84
+ end
85
+
86
+ it "should return image with id of main-image for Amazon", :vcr do
87
+ g = Grabbit.url("http://www.amazon.com/gp/product/0975277324")
88
+ g.images.first.should == "http://ecx.images-amazon.com/images/I/61dDQUfhuvL._SX300_.jpg"
89
+ g.images.length.should == 1
90
+ end
91
+
92
+ end
93
+ end
@@ -0,0 +1,16 @@
1
+ require 'rubygems'
2
+ require 'bundler/setup'
3
+
4
+ require 'grabbit'
5
+ require 'vcr'
6
+ require 'webmock'
7
+
8
+ VCR.configure do |c|
9
+ c.cassette_library_dir = 'spec/vcr'
10
+ c.hook_into :webmock
11
+ c.configure_rspec_metadata!
12
+ end
13
+
14
+ RSpec.configure do |config|
15
+ config.treat_symbols_as_metadata_keys_with_true_values = true
16
+ end
metadata ADDED
@@ -0,0 +1,198 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: grabbit
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Richard Larcombe
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2013-10-14 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bundler
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: '1.3'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: '1.3'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rake
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ! '>='
36
+ - !ruby/object:Gem::Version
37
+ version: '0'
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ! '>='
44
+ - !ruby/object:Gem::Version
45
+ version: '0'
46
+ - !ruby/object:Gem::Dependency
47
+ name: rspec
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ - !ruby/object:Gem::Dependency
63
+ name: shoulda-matchers
64
+ requirement: !ruby/object:Gem::Requirement
65
+ none: false
66
+ requirements:
67
+ - - ! '>='
68
+ - !ruby/object:Gem::Version
69
+ version: '0'
70
+ type: :development
71
+ prerelease: false
72
+ version_requirements: !ruby/object:Gem::Requirement
73
+ none: false
74
+ requirements:
75
+ - - ! '>='
76
+ - !ruby/object:Gem::Version
77
+ version: '0'
78
+ - !ruby/object:Gem::Dependency
79
+ name: vcr
80
+ requirement: !ruby/object:Gem::Requirement
81
+ none: false
82
+ requirements:
83
+ - - ! '>='
84
+ - !ruby/object:Gem::Version
85
+ version: '0'
86
+ type: :development
87
+ prerelease: false
88
+ version_requirements: !ruby/object:Gem::Requirement
89
+ none: false
90
+ requirements:
91
+ - - ! '>='
92
+ - !ruby/object:Gem::Version
93
+ version: '0'
94
+ - !ruby/object:Gem::Dependency
95
+ name: webmock
96
+ requirement: !ruby/object:Gem::Requirement
97
+ none: false
98
+ requirements:
99
+ - - ! '>='
100
+ - !ruby/object:Gem::Version
101
+ version: '0'
102
+ type: :development
103
+ prerelease: false
104
+ version_requirements: !ruby/object:Gem::Requirement
105
+ none: false
106
+ requirements:
107
+ - - ! '>='
108
+ - !ruby/object:Gem::Version
109
+ version: '0'
110
+ - !ruby/object:Gem::Dependency
111
+ name: nokogiri
112
+ requirement: !ruby/object:Gem::Requirement
113
+ none: false
114
+ requirements:
115
+ - - ! '>='
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :runtime
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ none: false
122
+ requirements:
123
+ - - ! '>='
124
+ - !ruby/object:Gem::Version
125
+ version: '0'
126
+ - !ruby/object:Gem::Dependency
127
+ name: httparty
128
+ requirement: !ruby/object:Gem::Requirement
129
+ none: false
130
+ requirements:
131
+ - - ! '>='
132
+ - !ruby/object:Gem::Version
133
+ version: '0'
134
+ type: :runtime
135
+ prerelease: false
136
+ version_requirements: !ruby/object:Gem::Requirement
137
+ none: false
138
+ requirements:
139
+ - - ! '>='
140
+ - !ruby/object:Gem::Version
141
+ version: '0'
142
+ description: Grabbit - Scrape the title, description and best thumbnail image(s) from
143
+ a given URL.
144
+ email:
145
+ - rjlarcombe@gmail.com
146
+ executables: []
147
+ extensions: []
148
+ extra_rdoc_files: []
149
+ files:
150
+ - .gitignore
151
+ - .rspec
152
+ - Gemfile
153
+ - LICENSE.txt
154
+ - README.md
155
+ - Rakefile
156
+ - grabbit.gemspec
157
+ - lib/grabbit.rb
158
+ - lib/grabbit/scrape.rb
159
+ - lib/grabbit/version.rb
160
+ - spec/grabbit_spec.rb
161
+ - spec/spec_helper.rb
162
+ homepage: https://github.com/rlarcombe/grabbit
163
+ licenses:
164
+ - MIT
165
+ post_install_message:
166
+ rdoc_options: []
167
+ require_paths:
168
+ - lib
169
+ required_ruby_version: !ruby/object:Gem::Requirement
170
+ none: false
171
+ requirements:
172
+ - - ! '>='
173
+ - !ruby/object:Gem::Version
174
+ version: '0'
175
+ segments:
176
+ - 0
177
+ hash: 4061797153683598554
178
+ required_rubygems_version: !ruby/object:Gem::Requirement
179
+ none: false
180
+ requirements:
181
+ - - ! '>='
182
+ - !ruby/object:Gem::Version
183
+ version: '0'
184
+ segments:
185
+ - 0
186
+ hash: 4061797153683598554
187
+ requirements: []
188
+ rubyforge_project:
189
+ rubygems_version: 1.8.25
190
+ signing_key:
191
+ specification_version: 3
192
+ summary: When you share a URL on Facebook in a post, you will have noticed how FB
193
+ goes off in the background and pulls the title, description, and best thumbnail
194
+ images to represent the content on the page. This gem is a simple scraper to do
195
+ the same.
196
+ test_files:
197
+ - spec/grabbit_spec.rb
198
+ - spec/spec_helper.rb