jaimeiniesta-metainspector 1.1.2 → 1.1.3

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG.rdoc CHANGED
@@ -1,6 +1,13 @@
1
+ = 1.1.3
2
+ === 22nd May, 2009
3
+ * Simplified code: now there's no need to call page.scrape!, just initialize it and go directly to page.address, page.title, page.description, page.keywords or page.links, the page will be scraped on the fly
4
+ * Removed page.scraped?, page.scrape!, page.full_doc and page.scraped_doc
5
+ * Added page.document, which returns the whole document scraped with nokogiri
6
+
1
7
  = 1.1.2
2
8
  === 19th May, 2009
3
9
  * Using nokogiri instead of hpricot
10
+ * Recover from exceptions
4
11
 
5
12
  = 1.1.1
6
13
  === 14th May, 2009
data/README.rdoc CHANGED
@@ -34,11 +34,7 @@ Initialize a MetaInspector instance with an URL like this:
34
34
 
35
35
  page = MetaInspector.new('http://pagerankalert.com')
36
36
 
37
- Then you can tell it to fetch and scrape the URL:
38
-
39
- page.scrape!
40
-
41
- Once scraped, you can see the returned data like this:
37
+ Once scraped, you can see the scraped data like this:
42
38
 
43
39
  page.address # URL of the page
44
40
  page.title # title of the page, as string
@@ -46,17 +42,15 @@ Once scraped, you can see the returned data like this:
46
42
  page.keywords # meta keywords, as string
47
43
  page.links # array of strings, with every link found on the page
48
44
 
49
- You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change.
50
45
  You can also change the address of the page to be scraped using the address= setter, like this:
51
46
 
52
47
  page.address="http://jaimeiniesta.com"
53
48
 
54
- Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). You can re-scrape it again by calling the page.scrape! method.
49
+ Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). The page will be re-scraped when you consult any of its metadata again.
55
50
 
56
- The full fetched document and the scraped doc are accessible from:
51
+ The full scraped document if accessible from:
57
52
 
58
- page.full_doc # it points to the temp file where the fetched doc is stored
59
- page.scraped_doc # Hpricot doc that you can use it to get any element from the page
53
+ page.document # Nokogiri doc that you can use it to get any element from the page
60
54
 
61
55
  = Examples
62
56
 
@@ -65,52 +59,33 @@ You can find some sample scripts on the samples folder, including a basic scrapi
65
59
  $ irb
66
60
  >> require 'metainspector'
67
61
  => true
68
-
62
+
69
63
  >> page = MetaInspector.new('http://pagerankalert.com')
70
- => #<MetaInspector:0x5fc594 @full_doc=nil, @scraped=false, @description=nil, @links=nil,
71
- @address="http://pagerankalert.com", @keywords=nil, @scraped_doc=nil, @title=nil>
72
-
73
- >> page.scrape!
74
- => true
75
-
64
+ => #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil>
65
+
76
66
  >> page.title
77
67
  => "PageRankAlert.com :: Track your pagerank changes"
78
-
68
+
79
69
  >> page.description
80
70
  => "Track your PageRank(TM) changes and receive alert by email"
81
-
71
+
82
72
  >> page.keywords
83
73
  => "pagerank, seo, optimization, google"
84
-
74
+
85
75
  >> page.links.size
86
76
  => 31
87
-
77
+
88
78
  >> page.links[30]
89
79
  => "http://www.nuvio.cz/"
90
-
91
- >> page.full_doc
92
- => #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
93
-
94
- >> page.scraped_doc.class
80
+
81
+ >> page.document.class
95
82
  => Nokogiri::HTML::Document
96
-
97
- >> page.scraped?
98
- => true
99
-
83
+
100
84
  >> page.address="http://jaimeiniesta.com"
101
85
  => "http://jaimeiniesta.com"
102
-
103
- >> page.scraped?
104
- => false
105
-
106
- >> page.scrape!
107
- => true
108
-
109
- >> page.scraped?
110
- => true
111
-
86
+
112
87
  >> page.title
113
- => "ruby on rails freelance developer &#8212; Jaime Iniesta"
88
+ => "ruby on rails freelance developer -- Jaime Iniesta"
114
89
 
115
90
  = To Do
116
91
 
@@ -119,12 +94,9 @@ You can find some sample scripts on the samples folder, including a basic scrapi
119
94
  * Return array of images in page as absolute URLs
120
95
  * Return contents of meta robots tag
121
96
  * Be able to set a timeout in seconds
122
- * Recover from Timeout exception
123
- * Recover from Errno::ECONNREFUSED
97
+ * Detect charset
124
98
  * If keywords seem to be separated by blank spaces, replace them with commas
125
99
  * Mocks
126
- * Check content type, process only HTML pages_
127
- ** Don't try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2
128
- ** Don't try to scrape http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
100
+ * Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
129
101
 
130
102
  Copyright (c) 2009 Jaime Iniesta, released under the MIT license
data/lib/metainspector.rb CHANGED
@@ -4,18 +4,16 @@ require 'nokogiri'
4
4
 
5
5
  # MetaInspector provides an easy way to scrape web pages and get its elements
6
6
  class MetaInspector
7
- VERSION = '1.1.2'
7
+ VERSION = '1.1.3'
8
8
 
9
- attr_reader :address, :title, :description, :keywords, :links, :full_doc, :scraped_doc
9
+ attr_reader :address
10
10
 
11
11
  # Initializes a new instance of MetaInspector, setting the URL address to the one given
12
12
  # TODO: validate address as http URL, dont initialize it if wrong format
13
13
  def initialize(address)
14
14
  @address = address
15
- @scraped = false
16
15
 
17
- @title = @description = @keywords = @full_doc = @scraped_doc = nil
18
- @links = []
16
+ @document = @title = @description = @keywords = @links = nil
19
17
  end
20
18
 
21
19
  # Setter for address. Initializes the whole state as the address is being changed.
@@ -23,29 +21,30 @@ class MetaInspector
23
21
  initialize(address)
24
22
  end
25
23
 
26
- # Visit web page, get its contents, and parse it
27
- def scrape!
28
- @full_doc = open(@address)
29
- @scraped_doc = Nokogiri::HTML(@full_doc)
30
-
31
- # Searching title...
32
- @title = @scraped_doc.css('title').inner_html rescue nil
33
-
34
- # Searching meta description...
35
- @description = @scraped_doc.css("meta[@name='description']").first['content'] rescue nil
36
-
37
- # Searching meta keywords...
38
- @keywords = @scraped_doc.css("meta[@name='keywords']").first['content'] rescue nil
39
-
40
- # Searching links...
41
- @links = []
42
- @scraped_doc.search("//a").each do |link|
43
- @links << link.attributes["href"].to_s.strip
44
- end
24
+ # Returns the parsed document title
25
+ def title
26
+ @title ||= document.css('title').inner_html rescue nil
27
+ end
28
+
29
+ # Returns the parsed document meta description
30
+ def description
31
+ @description ||= document.css("meta[@name='description']").first['content'] rescue nil
32
+ end
33
+
34
+ # Returns the parsed document meta keywords
35
+ def keywords
36
+ @keywords ||= document.css("meta[@name='keywords']").first['content'] rescue nil
37
+ end
38
+
39
+ # Returns the parsed document links
40
+ def links
41
+ @links ||= document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
42
+ end
43
+
44
+ # Returns the whole parsed document
45
+ def document
46
+ @document ||= Nokogiri::HTML(open(@address))
45
47
 
46
- # Mark scraping as success
47
- @scraped = true
48
-
49
48
  rescue SocketError
50
49
  puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
51
50
  @scraped = false
@@ -54,9 +53,5 @@ class MetaInspector
54
53
  rescue
55
54
  puts 'An exception occurred while trying to scrape the page!'
56
55
  end
57
-
58
- # Syntactic sugar
59
- def scraped?
60
- @scraped
61
- end
56
+
62
57
  end
@@ -1,6 +1,6 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = "metainspector"
3
- s.version = "1.1.2"
3
+ s.version = "1.1.3"
4
4
  s.date = "2009-05-19"
5
5
  s.summary = "Ruby gem for web scraping"
6
6
  s.email = "jaimeiniesta@gmail.com"
@@ -8,8 +8,15 @@ Gem::Specification.new do |s|
8
8
  s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
9
9
  s.has_rdoc = false
10
10
  s.authors = ["Jaime Iniesta"]
11
- s.files = ["README.rdoc", "CHANGELOG.rdoc", "MIT-LICENSE", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb", "samples/basic_scraping.rb", "samples/spider.rb"]
12
- s.test_files = []
11
+ s.files = [
12
+ "README.rdoc",
13
+ "CHANGELOG.rdoc",
14
+ "MIT-LICENSE",
15
+ "metainspector.gemspec",
16
+ "lib/metainspector.rb",
17
+ "samples/basic_scraping.rb",
18
+ "samples/spider.rb"]
19
+ s.test_files = ["test/test_metainspector.rb"]
13
20
  s.rdoc_options = []
14
21
  s.extra_rdoc_files = []
15
22
  s.add_dependency("nokogiri", ["> 1.2"])
@@ -7,7 +7,6 @@ address = gets
7
7
  page = MetaInspector.new(address)
8
8
  puts "Scraping #{address}"
9
9
  puts "...please wait..."
10
- page.scrape!
11
10
 
12
11
  puts "Scraping #{page.address} returned these results:"
13
12
  puts "TITLE: #{page.title}"
data/samples/spider.rb CHANGED
@@ -14,7 +14,6 @@ while q.size > 0
14
14
  visited_links << address = q.pop
15
15
  page.address=address
16
16
  puts "Spidering #{page.address}"
17
- page.scrape!
18
17
 
19
18
  puts "TITLE: #{page.title}"
20
19
  puts "DESCRIPTION: #{page.description}"
@@ -3,58 +3,42 @@ require '../lib/metainspector.rb'
3
3
 
4
4
  class TestMetaInspector < Test::Unit::TestCase
5
5
  # TODO: mock tests
6
-
7
- # Test we can initialize a new instance, setting its address, and initial state
8
- # is not scraped and every meta data value set to nil
9
6
  # TODO: validate URL format, only http and https allowed
10
- def test_initialize
11
- m = MetaInspector.new('http://pagerankalert.com')
12
- assert_equal m.address, 'http://pagerankalert.com'
13
- assert_equal m.scraped?, false
14
- assert_nil m.title
15
- assert_nil m.description
16
- assert_nil m.keywords
17
- assert_equal m.links.size, 0
18
- assert_nil m.full_doc
19
- assert_nil m.scraped_doc
20
- end
7
+ # TODO: check timeouts
21
8
 
22
9
  # Test scraping an URL, marking it as scraped and setting meta data values
23
- # TODO: check timeouts
24
- def test_scrape!
10
+ def test_scrape
25
11
  m = MetaInspector.new('http://pagerankalert.com')
26
- assert m.scrape!
27
- assert m.scraped?
28
12
  assert_equal m.title, 'PageRankAlert.com :: Track your pagerank changes'
29
13
  assert_equal m.description, 'Track your PageRank(TM) changes and receive alert by email'
30
14
  assert_equal m.keywords, 'pagerank, seo, optimization, google'
31
15
  assert_equal m.links.size, 31
32
16
  assert_equal m.links[30], 'http://www.nuvio.cz/'
33
- assert_equal m.full_doc.class, Tempfile
34
- assert_equal m.scraped_doc.class, Nokogiri::HTML::Document
17
+ assert_equal m.document.class, Nokogiri::HTML::Document
35
18
  end
36
19
 
37
- # Test changing the address resets the state of the instance
20
+ # Test changing the address resets the state of the instance so it causes a new scraping
38
21
  def test_address_setter
39
22
  m = MetaInspector.new('http://pagerankalert.com')
40
23
  assert_equal m.address, 'http://pagerankalert.com'
41
- m.scrape!
42
- assert m.scraped?
43
- assert_not_nil m.title
44
- assert_not_nil m.description
45
- assert_not_nil m.keywords
46
- assert_not_nil m.links
47
- assert_not_nil m.full_doc
48
- assert_not_nil m.scraped_doc
24
+ title_1 = m.title
25
+ description_1 = m.description
26
+ keywords_1 = m.keywords
27
+ links_1 = m.links
28
+ document_1 = m.document
49
29
 
50
30
  m.address = 'http://jaimeiniesta.com'
51
31
  assert_equal m.address, 'http://jaimeiniesta.com'
52
- assert !m.scraped?
53
- assert_nil m.title
54
- assert_nil m.description
55
- assert_nil m.keywords
56
- assert_equal m.links.size, 0
57
- assert_nil m.full_doc
58
- assert_nil m.scraped_doc
32
+ title_2 = m.title
33
+ description_2 = m.description
34
+ keywords_2 = m.keywords
35
+ links_2 = m.links
36
+ document_2 = m.document
37
+
38
+ assert_not_equal title_1, title_2
39
+ assert_not_equal description_1, description_2
40
+ assert_not_equal keywords_1, keywords_2
41
+ assert_not_equal links_1, links_2
42
+ assert_not_equal document_1, document_2
59
43
  end
60
44
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaimeiniesta-metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.2
4
+ version: 1.1.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
@@ -36,7 +36,6 @@ files:
36
36
  - MIT-LICENSE
37
37
  - metainspector.gemspec
38
38
  - lib/metainspector.rb
39
- - test/test_metainspector.rb
40
39
  - samples/basic_scraping.rb
41
40
  - samples/spider.rb
42
41
  has_rdoc: false
@@ -65,5 +64,5 @@ rubygems_version: 1.2.0
65
64
  signing_key:
66
65
  specification_version: 2
67
66
  summary: Ruby gem for web scraping
68
- test_files: []
69
-
67
+ test_files:
68
+ - test/test_metainspector.rb