jaimeiniesta-metainspector 1.0.3 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.rdoc +12 -0
- data/MIT-LICENSE +20 -0
- data/README.rdoc +106 -0
- data/lib/metainspector.rb +49 -25
- data/metainspector.gemspec +4 -4
- data/samples/basic_scraping.rb +19 -0
- data/samples/spider.rb +24 -0
- data/test/test_metainspector.rb +59 -0
- metadata +9 -4
- data/README +0 -8
data/CHANGELOG.rdoc
ADDED
@@ -0,0 +1,12 @@
|
|
1
|
+
= 1.1.0
|
2
|
+
=== 14th May, 2009
|
3
|
+
* Rewritten to use instance methods instead of class methods.
|
4
|
+
* Easier interface, provides #new(address), #scrape!, #scraped?, #title, #description, #keywords, and #links instance methods
|
5
|
+
* Added #full_doc method to access the temporary file that containts the raw html fetched
|
6
|
+
* Added #scraped_doc method to get the whole Hpricot scraped doc
|
7
|
+
* Added tests
|
8
|
+
* Added samples, including a basic scraping and a little spider
|
9
|
+
|
10
|
+
= 1.0.3
|
11
|
+
=== 27th June, 2008
|
12
|
+
* Initial published version.
|
data/MIT-LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2009 Jaime Iniesta
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,106 @@
|
|
1
|
+
= MetaInspector
|
2
|
+
|
3
|
+
MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you a hash with metadata from it.
|
4
|
+
|
5
|
+
= Installation
|
6
|
+
|
7
|
+
Run the following if you haven't already:
|
8
|
+
|
9
|
+
gem sources -a http://gems.github.com
|
10
|
+
|
11
|
+
Then install the gem:
|
12
|
+
|
13
|
+
sudo gem install jaimeiniesta-metainspector
|
14
|
+
|
15
|
+
= Usage
|
16
|
+
|
17
|
+
Initialize a MetaInspector instance with an URL like this:
|
18
|
+
|
19
|
+
page = MetaInspector.new('http://pagerankalert.com')
|
20
|
+
|
21
|
+
Then you can tell it to fetch and scrape the URL:
|
22
|
+
|
23
|
+
page.scrape!
|
24
|
+
|
25
|
+
Once scraped, you can see the returned data like this:
|
26
|
+
|
27
|
+
page.address # URL of the page
|
28
|
+
page.title # title of the page, as string
|
29
|
+
page.description # meta description, as string
|
30
|
+
page.keywords # meta keywords, as string
|
31
|
+
page.links # array of strings, with every link found on the page
|
32
|
+
|
33
|
+
You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change.
|
34
|
+
You can also change the address of the page to be scraped using the address= setter, like this:
|
35
|
+
|
36
|
+
page.address="http://jaimeiniesta.com"
|
37
|
+
|
38
|
+
Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). You can re-scrape it again by calling the page.scrape! method.
|
39
|
+
|
40
|
+
The full fetched document and the scraped doc are accessible from:
|
41
|
+
|
42
|
+
page.full_doc # it points to the temp file where the fetched doc is stored
|
43
|
+
page.scraped_doc # Hpricot doc that you can use it to get any element from the page
|
44
|
+
|
45
|
+
= Examples
|
46
|
+
|
47
|
+
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
|
48
|
+
|
49
|
+
$ irb
|
50
|
+
>> require 'metainspector'
|
51
|
+
=> true
|
52
|
+
|
53
|
+
>> page = MetaInspector.new('http://pagerankalert.com')
|
54
|
+
=> #<MetaInspector:0x5fc594 @full_doc=nil, @scraped=false, @description=nil, @links=nil, @address="http://pagerankalert.com", @keywords=nil, @scraped_doc=nil, @title=nil>
|
55
|
+
|
56
|
+
>> page.scrape!
|
57
|
+
=> true
|
58
|
+
|
59
|
+
>> page.title
|
60
|
+
=> "PageRankAlert.com :: Track your pagerank changes"
|
61
|
+
|
62
|
+
>> page.description
|
63
|
+
=> "Track your PageRank(TM) changes and receive alert by email"
|
64
|
+
|
65
|
+
>> page.keywords
|
66
|
+
=> "pagerank, seo, optimization, google"
|
67
|
+
|
68
|
+
>> page.links.size
|
69
|
+
=> 31
|
70
|
+
|
71
|
+
>> page.links[30]
|
72
|
+
=> "http://www.nuvio.cz/"
|
73
|
+
|
74
|
+
>> page.full_doc
|
75
|
+
=> #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
|
76
|
+
|
77
|
+
>> page.scraped_doc.class
|
78
|
+
=> Hpricot::Doc
|
79
|
+
|
80
|
+
>> page.scraped?
|
81
|
+
=> true
|
82
|
+
|
83
|
+
>> page.address="http://jaimeiniesta.com"
|
84
|
+
=> "http://jaimeiniesta.com"
|
85
|
+
|
86
|
+
>> page.scraped?
|
87
|
+
=> false
|
88
|
+
|
89
|
+
>> page.scrape!
|
90
|
+
=> true
|
91
|
+
|
92
|
+
>> page.scraped?
|
93
|
+
=> true
|
94
|
+
|
95
|
+
>> page.title
|
96
|
+
=> "ruby on rails freelance developer — Jaime Iniesta"
|
97
|
+
|
98
|
+
= To Do
|
99
|
+
|
100
|
+
* Mocks
|
101
|
+
* Check content type, process only HTML pages (i.e., dont try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2)
|
102
|
+
* Return array of images in page
|
103
|
+
* Return contents of meta robots tag
|
104
|
+
* Consider using nokogiri instead of hpricot
|
105
|
+
|
106
|
+
Copyright (c) 2009 Jaime Iniesta, released under the MIT license
|
data/lib/metainspector.rb
CHANGED
@@ -1,47 +1,71 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
|
4
|
-
require 'hpricot'
|
1
|
+
require 'open-uri'
|
2
|
+
require 'rubygems'
|
3
|
+
require 'hpricot'
|
5
4
|
|
6
|
-
|
5
|
+
# MetaInspector provides an easy way to scrape web pages and get its elements
|
6
|
+
class MetaInspector
|
7
|
+
VERSION = '1.1.0'
|
7
8
|
|
8
9
|
Hpricot.buffer_size = 300000
|
9
|
-
|
10
|
-
|
11
|
-
|
10
|
+
|
11
|
+
attr_reader :address, :title, :description, :keywords, :links, :full_doc, :scraped_doc
|
12
|
+
|
13
|
+
# Initializes a new instance of MetaInspector, setting the URL address to the one given
|
14
|
+
# TODO: validate address as http URL, dont initialize it if wrong format
|
15
|
+
def initialize(address)
|
16
|
+
@address = address
|
17
|
+
@scraped = false
|
18
|
+
|
19
|
+
@title = @description = @keywords = @links = @full_doc = @scraped_doc = nil
|
20
|
+
end
|
21
|
+
|
22
|
+
# Setter for address. Initializes the whole state as the address is being changed.
|
23
|
+
def address=(address)
|
24
|
+
initialize(address)
|
25
|
+
end
|
26
|
+
|
27
|
+
# Visit web page, get its contents, and parse it
|
28
|
+
def scrape!
|
29
|
+
@full_doc = open(@address)
|
30
|
+
@scraped_doc = Hpricot(@full_doc)
|
12
31
|
|
13
32
|
# Searching title...
|
14
|
-
if
|
15
|
-
title =
|
33
|
+
if @scraped_doc.at('title')
|
34
|
+
@title = @scraped_doc.at('title').inner_html.strip
|
16
35
|
else
|
17
|
-
title = ""
|
36
|
+
@title = ""
|
18
37
|
end
|
19
38
|
|
20
39
|
# Searching meta description...
|
21
|
-
if
|
22
|
-
description =
|
40
|
+
if @scraped_doc.at("meta[@name='description']")
|
41
|
+
@description = @scraped_doc.at("meta[@name='description']")['content'].strip
|
23
42
|
else
|
24
|
-
description = ""
|
43
|
+
@description = ""
|
25
44
|
end
|
26
45
|
|
27
46
|
# Searching meta keywords...
|
28
|
-
if
|
29
|
-
keywords =
|
47
|
+
if @scraped_doc.at("meta[@name='keywords']")
|
48
|
+
@keywords = @scraped_doc.at("meta[@name='keywords']")['content'].strip
|
30
49
|
else
|
31
|
-
keywords = ""
|
50
|
+
@keywords = ""
|
32
51
|
end
|
33
52
|
|
34
53
|
# Searching links...
|
35
|
-
links = []
|
36
|
-
|
37
|
-
links << link.attributes["href"] if (!link.attributes["href"].nil?)
|
54
|
+
@links = []
|
55
|
+
@scraped_doc.search("//a").each do |link|
|
56
|
+
@links << link.attributes["href"].strip if (!link.attributes["href"].nil?)
|
38
57
|
end
|
39
|
-
|
40
|
-
#
|
41
|
-
|
58
|
+
|
59
|
+
# Mark scraping as success
|
60
|
+
@scraped = true
|
42
61
|
|
43
62
|
rescue SocketError
|
44
63
|
puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
|
45
|
-
|
64
|
+
@scraped = false
|
65
|
+
end
|
66
|
+
|
67
|
+
# Syntactic sugar
|
68
|
+
def scraped?
|
69
|
+
@scraped
|
46
70
|
end
|
47
|
-
end
|
71
|
+
end
|
data/metainspector.gemspec
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
Gem::Specification.new do |s|
|
2
2
|
s.name = "metainspector"
|
3
|
-
s.version = "1.0
|
4
|
-
s.date = "
|
3
|
+
s.version = "1.1.0"
|
4
|
+
s.date = "2009-05-14"
|
5
5
|
s.summary = "Ruby gem for web scraping"
|
6
6
|
s.email = "jaimeiniesta@gmail.com"
|
7
|
-
s.homepage = "http://
|
7
|
+
s.homepage = "http://github.com/jaimeiniesta/metainspector/tree/master"
|
8
8
|
s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
|
9
9
|
s.has_rdoc = false
|
10
10
|
s.authors = ["Jaime Iniesta"]
|
11
|
-
s.files = ["README", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb"]
|
11
|
+
s.files = ["README.rdoc", "CHANGELOG.rdoc", "MIT-LICENSE", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb", "samples/basic_scraping.rb", "samples/spider.rb"]
|
12
12
|
s.test_files = []
|
13
13
|
s.rdoc_options = []
|
14
14
|
s.extra_rdoc_files = []
|
@@ -0,0 +1,19 @@
|
|
1
|
+
# Some basic MetaInspector samples
|
2
|
+
|
3
|
+
require '../lib/metainspector.rb'
|
4
|
+
|
5
|
+
puts "Enter a valid http address to scrape it"
|
6
|
+
address = gets
|
7
|
+
page = MetaInspector.new(address)
|
8
|
+
puts "Scraping #{address}"
|
9
|
+
puts "...please wait..."
|
10
|
+
page.scrape!
|
11
|
+
|
12
|
+
puts "Scraping #{page.address} returned these results:"
|
13
|
+
puts "TITLE: #{page.title}"
|
14
|
+
puts "DESCRIPTION: #{page.description}"
|
15
|
+
puts "KEYWORDS: #{page.keywords}"
|
16
|
+
puts "#{page.links.size} links found..."
|
17
|
+
page.links.each do |link|
|
18
|
+
puts " ==> #{link}"
|
19
|
+
end
|
data/samples/spider.rb
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
# A basic spider that will follow links on an infinite loop
|
2
|
+
require '../lib/metainspector.rb'
|
3
|
+
|
4
|
+
q = Queue.new
|
5
|
+
visited_links=[]
|
6
|
+
|
7
|
+
puts "Enter a valid http address to spider it following external links"
|
8
|
+
address = gets.strip
|
9
|
+
page = MetaInspector.new(address)
|
10
|
+
q.push(address)
|
11
|
+
|
12
|
+
while q.size > 0
|
13
|
+
visited_links << address = q.pop
|
14
|
+
page.address=address
|
15
|
+
puts "Spidering #{page.address}"
|
16
|
+
page.scrape!
|
17
|
+
puts "TITLE: #{page.title}"
|
18
|
+
page.links.each do |link|
|
19
|
+
if link[0..6].downcase == 'http://' && !visited_links.include?(link)
|
20
|
+
q.push(link)
|
21
|
+
end
|
22
|
+
end
|
23
|
+
puts "#{visited_links.size} pages visited, #{q.size} pages on queue\n\n"
|
24
|
+
end
|
data/test/test_metainspector.rb
CHANGED
@@ -1 +1,60 @@
|
|
1
|
+
require 'test/unit'
|
2
|
+
require '../lib/metainspector.rb'
|
1
3
|
|
4
|
+
class TestMetaInspector < Test::Unit::TestCase
|
5
|
+
# TODO: mock tests
|
6
|
+
|
7
|
+
# Test we can initialize a new instance, setting its address, and initial state
|
8
|
+
# is not scraped and every meta data value set to nil
|
9
|
+
# TODO: validate URL format, only http and https allowed
|
10
|
+
def test_initialize
|
11
|
+
m = MetaInspector.new('http://pagerankalert.com')
|
12
|
+
assert_equal m.address, 'http://pagerankalert.com'
|
13
|
+
assert_equal m.scraped?, false
|
14
|
+
assert_nil m.title
|
15
|
+
assert_nil m.description
|
16
|
+
assert_nil m.keywords
|
17
|
+
assert_nil m.links
|
18
|
+
assert_nil m.full_doc
|
19
|
+
assert_nil m.scraped_doc
|
20
|
+
end
|
21
|
+
|
22
|
+
# Test scraping an URL, marking it as scraped and setting meta data values
|
23
|
+
# TODO: check timeouts
|
24
|
+
def test_scrape!
|
25
|
+
m = MetaInspector.new('http://pagerankalert.com')
|
26
|
+
assert m.scrape!
|
27
|
+
assert m.scraped?
|
28
|
+
assert_equal m.title, 'PageRankAlert.com :: Track your pagerank changes'
|
29
|
+
assert_equal m.description, 'Track your PageRank(TM) changes and receive alert by email'
|
30
|
+
assert_equal m.keywords, 'pagerank, seo, optimization, google'
|
31
|
+
assert_equal m.links.size, 31
|
32
|
+
assert_equal m.links[30], 'http://www.nuvio.cz/'
|
33
|
+
assert_equal m.full_doc.class, Tempfile
|
34
|
+
assert_equal m.scraped_doc.class, Hpricot::Doc
|
35
|
+
end
|
36
|
+
|
37
|
+
# Test changing the address resets the state of the instance
|
38
|
+
def test_address_setter
|
39
|
+
m = MetaInspector.new('http://pagerankalert.com')
|
40
|
+
assert_equal m.address, 'http://pagerankalert.com'
|
41
|
+
m.scrape!
|
42
|
+
assert m.scraped?
|
43
|
+
assert_not_nil m.title
|
44
|
+
assert_not_nil m.description
|
45
|
+
assert_not_nil m.keywords
|
46
|
+
assert_not_nil m.links
|
47
|
+
assert_not_nil m.full_doc
|
48
|
+
assert_not_nil m.scraped_doc
|
49
|
+
|
50
|
+
m.address = 'http://jaimeiniesta.com'
|
51
|
+
assert_equal m.address, 'http://jaimeiniesta.com'
|
52
|
+
assert !m.scraped?
|
53
|
+
assert_nil m.title
|
54
|
+
assert_nil m.description
|
55
|
+
assert_nil m.keywords
|
56
|
+
assert_nil m.links
|
57
|
+
assert_nil m.full_doc
|
58
|
+
assert_nil m.scraped_doc
|
59
|
+
end
|
60
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jaimeiniesta-metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
@@ -9,11 +9,12 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date:
|
12
|
+
date: 2009-05-14 00:00:00 -07:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: hpricot
|
17
|
+
type: :runtime
|
17
18
|
version_requirement:
|
18
19
|
version_requirements: !ruby/object:Gem::Requirement
|
19
20
|
requirements:
|
@@ -30,12 +31,16 @@ extensions: []
|
|
30
31
|
extra_rdoc_files: []
|
31
32
|
|
32
33
|
files:
|
33
|
-
- README
|
34
|
+
- README.rdoc
|
35
|
+
- CHANGELOG.rdoc
|
36
|
+
- MIT-LICENSE
|
34
37
|
- metainspector.gemspec
|
35
38
|
- lib/metainspector.rb
|
36
39
|
- test/test_metainspector.rb
|
40
|
+
- samples/basic_scraping.rb
|
41
|
+
- samples/spider.rb
|
37
42
|
has_rdoc: false
|
38
|
-
homepage: http://
|
43
|
+
homepage: http://github.com/jaimeiniesta/metainspector/tree/master
|
39
44
|
post_install_message:
|
40
45
|
rdoc_options: []
|
41
46
|
|
data/README
DELETED
@@ -1,8 +0,0 @@
|
|
1
|
-
--------------
|
2
|
-
MetaInspector
|
3
|
-
--------------
|
4
|
-
DESCRIPTION: Ruby gem for web scraping purposes. It scrapes a given URL, and returns you a hash with data from it like for example the title, meta description, meta keywords, an array with all the links, all the images in it, etc.
|
5
|
-
AUTHOR: Jaime Iniesta (jaimeiniesta@gmail.com, http://www.jaimeiniesta.com)
|
6
|
-
LICENSE: GPLv3
|
7
|
-
|
8
|
-
(more to come soon...)
|