jaimeiniesta-metainspector 1.1.2 → 1.1.3
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG.rdoc +7 -0
- data/README.rdoc +18 -46
- data/lib/metainspector.rb +27 -32
- data/metainspector.gemspec +10 -3
- data/samples/basic_scraping.rb +0 -1
- data/samples/spider.rb +0 -1
- data/test/test_metainspector.rb +20 -36
- metadata +3 -4
data/CHANGELOG.rdoc
CHANGED
@@ -1,6 +1,13 @@
|
|
1
|
+
= 1.1.3
|
2
|
+
=== 22nd May, 2009
|
3
|
+
* Simplified code: now there's no need to call page.scrape!, just initialize it and go directly to page.address, page.title, page.description, page.keywords or page.links, the page will be scraped on the fly
|
4
|
+
* Removed page.scraped?, page.scrape!, page.full_doc and page.scraped_doc
|
5
|
+
* Added page.document, which returns the whole document scraped with nokogiri
|
6
|
+
|
1
7
|
= 1.1.2
|
2
8
|
=== 19th May, 2009
|
3
9
|
* Using nokogiri instead of hpricot
|
10
|
+
* Recover from exceptions
|
4
11
|
|
5
12
|
= 1.1.1
|
6
13
|
=== 14th May, 2009
|
data/README.rdoc
CHANGED
@@ -34,11 +34,7 @@ Initialize a MetaInspector instance with an URL like this:
|
|
34
34
|
|
35
35
|
page = MetaInspector.new('http://pagerankalert.com')
|
36
36
|
|
37
|
-
|
38
|
-
|
39
|
-
page.scrape!
|
40
|
-
|
41
|
-
Once scraped, you can see the returned data like this:
|
37
|
+
Once scraped, you can see the scraped data like this:
|
42
38
|
|
43
39
|
page.address # URL of the page
|
44
40
|
page.title # title of the page, as string
|
@@ -46,17 +42,15 @@ Once scraped, you can see the returned data like this:
|
|
46
42
|
page.keywords # meta keywords, as string
|
47
43
|
page.links # array of strings, with every link found on the page
|
48
44
|
|
49
|
-
You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change.
|
50
45
|
You can also change the address of the page to be scraped using the address= setter, like this:
|
51
46
|
|
52
47
|
page.address="http://jaimeiniesta.com"
|
53
48
|
|
54
|
-
Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data).
|
49
|
+
Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). The page will be re-scraped when you consult any of its metadata again.
|
55
50
|
|
56
|
-
The full
|
51
|
+
The full scraped document if accessible from:
|
57
52
|
|
58
|
-
page.
|
59
|
-
page.scraped_doc # Hpricot doc that you can use it to get any element from the page
|
53
|
+
page.document # Nokogiri doc that you can use it to get any element from the page
|
60
54
|
|
61
55
|
= Examples
|
62
56
|
|
@@ -65,52 +59,33 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
65
59
|
$ irb
|
66
60
|
>> require 'metainspector'
|
67
61
|
=> true
|
68
|
-
|
62
|
+
|
69
63
|
>> page = MetaInspector.new('http://pagerankalert.com')
|
70
|
-
=> #<MetaInspector:
|
71
|
-
|
72
|
-
|
73
|
-
>> page.scrape!
|
74
|
-
=> true
|
75
|
-
|
64
|
+
=> #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
|
65
|
+
|
76
66
|
>> page.title
|
77
67
|
=> "PageRankAlert.com :: Track your pagerank changes"
|
78
|
-
|
68
|
+
|
79
69
|
>> page.description
|
80
70
|
=> "Track your PageRank(TM) changes and receive alert by email"
|
81
|
-
|
71
|
+
|
82
72
|
>> page.keywords
|
83
73
|
=> "pagerank, seo, optimization, google"
|
84
|
-
|
74
|
+
|
85
75
|
>> page.links.size
|
86
76
|
=> 31
|
87
|
-
|
77
|
+
|
88
78
|
>> page.links[30]
|
89
79
|
=> "http://www.nuvio.cz/"
|
90
|
-
|
91
|
-
>> page.
|
92
|
-
=> #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
|
93
|
-
|
94
|
-
>> page.scraped_doc.class
|
80
|
+
|
81
|
+
>> page.document.class
|
95
82
|
=> Nokogiri::HTML::Document
|
96
|
-
|
97
|
-
>> page.scraped?
|
98
|
-
=> true
|
99
|
-
|
83
|
+
|
100
84
|
>> page.address="http://jaimeiniesta.com"
|
101
85
|
=> "http://jaimeiniesta.com"
|
102
|
-
|
103
|
-
>> page.scraped?
|
104
|
-
=> false
|
105
|
-
|
106
|
-
>> page.scrape!
|
107
|
-
=> true
|
108
|
-
|
109
|
-
>> page.scraped?
|
110
|
-
=> true
|
111
|
-
|
86
|
+
|
112
87
|
>> page.title
|
113
|
-
=> "ruby on rails freelance developer
|
88
|
+
=> "ruby on rails freelance developer -- Jaime Iniesta"
|
114
89
|
|
115
90
|
= To Do
|
116
91
|
|
@@ -119,12 +94,9 @@ You can find some sample scripts on the samples folder, including a basic scrapi
|
|
119
94
|
* Return array of images in page as absolute URLs
|
120
95
|
* Return contents of meta robots tag
|
121
96
|
* Be able to set a timeout in seconds
|
122
|
-
*
|
123
|
-
* Recover from Errno::ECONNREFUSED
|
97
|
+
* Detect charset
|
124
98
|
* If keywords seem to be separated by blank spaces, replace them with commas
|
125
99
|
* Mocks
|
126
|
-
* Check content type, process only HTML
|
127
|
-
** Don't try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2
|
128
|
-
** Don't try to scrape http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
|
100
|
+
* Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
|
129
101
|
|
130
102
|
Copyright (c) 2009 Jaime Iniesta, released under the MIT license
|
data/lib/metainspector.rb
CHANGED
@@ -4,18 +4,16 @@ require 'nokogiri'
|
|
4
4
|
|
5
5
|
# MetaInspector provides an easy way to scrape web pages and get its elements
|
6
6
|
class MetaInspector
|
7
|
-
VERSION = '1.1.
|
7
|
+
VERSION = '1.1.3'
|
8
8
|
|
9
|
-
attr_reader :address
|
9
|
+
attr_reader :address
|
10
10
|
|
11
11
|
# Initializes a new instance of MetaInspector, setting the URL address to the one given
|
12
12
|
# TODO: validate address as http URL, dont initialize it if wrong format
|
13
13
|
def initialize(address)
|
14
14
|
@address = address
|
15
|
-
@scraped = false
|
16
15
|
|
17
|
-
@
|
18
|
-
@links = []
|
16
|
+
@document = @title = @description = @keywords = @links = nil
|
19
17
|
end
|
20
18
|
|
21
19
|
# Setter for address. Initializes the whole state as the address is being changed.
|
@@ -23,29 +21,30 @@ class MetaInspector
|
|
23
21
|
initialize(address)
|
24
22
|
end
|
25
23
|
|
26
|
-
#
|
27
|
-
def
|
28
|
-
@
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
@keywords
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
24
|
+
# Returns the parsed document title
|
25
|
+
def title
|
26
|
+
@title ||= document.css('title').inner_html rescue nil
|
27
|
+
end
|
28
|
+
|
29
|
+
# Returns the parsed document meta description
|
30
|
+
def description
|
31
|
+
@description ||= document.css("meta[@name='description']").first['content'] rescue nil
|
32
|
+
end
|
33
|
+
|
34
|
+
# Returns the parsed document meta keywords
|
35
|
+
def keywords
|
36
|
+
@keywords ||= document.css("meta[@name='keywords']").first['content'] rescue nil
|
37
|
+
end
|
38
|
+
|
39
|
+
# Returns the parsed document links
|
40
|
+
def links
|
41
|
+
@links ||= document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
|
42
|
+
end
|
43
|
+
|
44
|
+
# Returns the whole parsed document
|
45
|
+
def document
|
46
|
+
@document ||= Nokogiri::HTML(open(@address))
|
45
47
|
|
46
|
-
# Mark scraping as success
|
47
|
-
@scraped = true
|
48
|
-
|
49
48
|
rescue SocketError
|
50
49
|
puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
|
51
50
|
@scraped = false
|
@@ -54,9 +53,5 @@ class MetaInspector
|
|
54
53
|
rescue
|
55
54
|
puts 'An exception occurred while trying to scrape the page!'
|
56
55
|
end
|
57
|
-
|
58
|
-
# Syntactic sugar
|
59
|
-
def scraped?
|
60
|
-
@scraped
|
61
|
-
end
|
56
|
+
|
62
57
|
end
|
data/metainspector.gemspec
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
Gem::Specification.new do |s|
|
2
2
|
s.name = "metainspector"
|
3
|
-
s.version = "1.1.
|
3
|
+
s.version = "1.1.3"
|
4
4
|
s.date = "2009-05-19"
|
5
5
|
s.summary = "Ruby gem for web scraping"
|
6
6
|
s.email = "jaimeiniesta@gmail.com"
|
@@ -8,8 +8,15 @@ Gem::Specification.new do |s|
|
|
8
8
|
s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
|
9
9
|
s.has_rdoc = false
|
10
10
|
s.authors = ["Jaime Iniesta"]
|
11
|
-
s.files = [
|
12
|
-
|
11
|
+
s.files = [
|
12
|
+
"README.rdoc",
|
13
|
+
"CHANGELOG.rdoc",
|
14
|
+
"MIT-LICENSE",
|
15
|
+
"metainspector.gemspec",
|
16
|
+
"lib/metainspector.rb",
|
17
|
+
"samples/basic_scraping.rb",
|
18
|
+
"samples/spider.rb"]
|
19
|
+
s.test_files = ["test/test_metainspector.rb"]
|
13
20
|
s.rdoc_options = []
|
14
21
|
s.extra_rdoc_files = []
|
15
22
|
s.add_dependency("nokogiri", ["> 1.2"])
|
data/samples/basic_scraping.rb
CHANGED
data/samples/spider.rb
CHANGED
data/test/test_metainspector.rb
CHANGED
@@ -3,58 +3,42 @@ require '../lib/metainspector.rb'
|
|
3
3
|
|
4
4
|
class TestMetaInspector < Test::Unit::TestCase
|
5
5
|
# TODO: mock tests
|
6
|
-
|
7
|
-
# Test we can initialize a new instance, setting its address, and initial state
|
8
|
-
# is not scraped and every meta data value set to nil
|
9
6
|
# TODO: validate URL format, only http and https allowed
|
10
|
-
|
11
|
-
m = MetaInspector.new('http://pagerankalert.com')
|
12
|
-
assert_equal m.address, 'http://pagerankalert.com'
|
13
|
-
assert_equal m.scraped?, false
|
14
|
-
assert_nil m.title
|
15
|
-
assert_nil m.description
|
16
|
-
assert_nil m.keywords
|
17
|
-
assert_equal m.links.size, 0
|
18
|
-
assert_nil m.full_doc
|
19
|
-
assert_nil m.scraped_doc
|
20
|
-
end
|
7
|
+
# TODO: check timeouts
|
21
8
|
|
22
9
|
# Test scraping an URL, marking it as scraped and setting meta data values
|
23
|
-
|
24
|
-
def test_scrape!
|
10
|
+
def test_scrape
|
25
11
|
m = MetaInspector.new('http://pagerankalert.com')
|
26
|
-
assert m.scrape!
|
27
|
-
assert m.scraped?
|
28
12
|
assert_equal m.title, 'PageRankAlert.com :: Track your pagerank changes'
|
29
13
|
assert_equal m.description, 'Track your PageRank(TM) changes and receive alert by email'
|
30
14
|
assert_equal m.keywords, 'pagerank, seo, optimization, google'
|
31
15
|
assert_equal m.links.size, 31
|
32
16
|
assert_equal m.links[30], 'http://www.nuvio.cz/'
|
33
|
-
assert_equal m.
|
34
|
-
assert_equal m.scraped_doc.class, Nokogiri::HTML::Document
|
17
|
+
assert_equal m.document.class, Nokogiri::HTML::Document
|
35
18
|
end
|
36
19
|
|
37
|
-
# Test changing the address resets the state of the instance
|
20
|
+
# Test changing the address resets the state of the instance so it causes a new scraping
|
38
21
|
def test_address_setter
|
39
22
|
m = MetaInspector.new('http://pagerankalert.com')
|
40
23
|
assert_equal m.address, 'http://pagerankalert.com'
|
41
|
-
m.
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
assert_not_nil m.links
|
47
|
-
assert_not_nil m.full_doc
|
48
|
-
assert_not_nil m.scraped_doc
|
24
|
+
title_1 = m.title
|
25
|
+
description_1 = m.description
|
26
|
+
keywords_1 = m.keywords
|
27
|
+
links_1 = m.links
|
28
|
+
document_1 = m.document
|
49
29
|
|
50
30
|
m.address = 'http://jaimeiniesta.com'
|
51
31
|
assert_equal m.address, 'http://jaimeiniesta.com'
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
32
|
+
title_2 = m.title
|
33
|
+
description_2 = m.description
|
34
|
+
keywords_2 = m.keywords
|
35
|
+
links_2 = m.links
|
36
|
+
document_2 = m.document
|
37
|
+
|
38
|
+
assert_not_equal title_1, title_2
|
39
|
+
assert_not_equal description_1, description_2
|
40
|
+
assert_not_equal keywords_1, keywords_2
|
41
|
+
assert_not_equal links_1, links_2
|
42
|
+
assert_not_equal document_1, document_2
|
59
43
|
end
|
60
44
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jaimeiniesta-metainspector
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Jaime Iniesta
|
@@ -36,7 +36,6 @@ files:
|
|
36
36
|
- MIT-LICENSE
|
37
37
|
- metainspector.gemspec
|
38
38
|
- lib/metainspector.rb
|
39
|
-
- test/test_metainspector.rb
|
40
39
|
- samples/basic_scraping.rb
|
41
40
|
- samples/spider.rb
|
42
41
|
has_rdoc: false
|
@@ -65,5 +64,5 @@ rubygems_version: 1.2.0
|
|
65
64
|
signing_key:
|
66
65
|
specification_version: 2
|
67
66
|
summary: Ruby gem for web scraping
|
68
|
-
test_files:
|
69
|
-
|
67
|
+
test_files:
|
68
|
+
- test/test_metainspector.rb
|