jaimeiniesta-metainspector 1.0.3 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG.rdoc ADDED
@@ -0,0 +1,12 @@
1
+ = 1.1.0
2
+ === 14th May, 2009
3
+ * Rewritten to use instance methods instead of class methods.
4
+ * Easier interface, provides #new(address), #scrape!, #scraped?, #title, #description, #keywords, and #links instance methods
5
+ * Added #full_doc method to access the temporary file that containts the raw html fetched
6
+ * Added #scraped_doc method to get the whole Hpricot scraped doc
7
+ * Added tests
8
+ * Added samples, including a basic scraping and a little spider
9
+
10
+ = 1.0.3
11
+ === 27th June, 2008
12
+ * Initial published version.
data/MIT-LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2009 Jaime Iniesta
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,106 @@
1
+ = MetaInspector
2
+
3
+ MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you a hash with metadata from it.
4
+
5
+ = Installation
6
+
7
+ Run the following if you haven't already:
8
+
9
+ gem sources -a http://gems.github.com
10
+
11
+ Then install the gem:
12
+
13
+ sudo gem install jaimeiniesta-metainspector
14
+
15
+ = Usage
16
+
17
+ Initialize a MetaInspector instance with an URL like this:
18
+
19
+ page = MetaInspector.new('http://pagerankalert.com')
20
+
21
+ Then you can tell it to fetch and scrape the URL:
22
+
23
+ page.scrape!
24
+
25
+ Once scraped, you can see the returned data like this:
26
+
27
+ page.address # URL of the page
28
+ page.title # title of the page, as string
29
+ page.description # meta description, as string
30
+ page.keywords # meta keywords, as string
31
+ page.links # array of strings, with every link found on the page
32
+
33
+ You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change.
34
+ You can also change the address of the page to be scraped using the address= setter, like this:
35
+
36
+ page.address="http://jaimeiniesta.com"
37
+
38
+ Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). You can re-scrape it again by calling the page.scrape! method.
39
+
40
+ The full fetched document and the scraped doc are accessible from:
41
+
42
+ page.full_doc # it points to the temp file where the fetched doc is stored
43
+ page.scraped_doc # Hpricot doc that you can use it to get any element from the page
44
+
45
+ = Examples
46
+
47
+ You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
48
+
49
+ $ irb
50
+ >> require 'metainspector'
51
+ => true
52
+
53
+ >> page = MetaInspector.new('http://pagerankalert.com')
54
+ => #<MetaInspector:0x5fc594 @full_doc=nil, @scraped=false, @description=nil, @links=nil, @address="http://pagerankalert.com", @keywords=nil, @scraped_doc=nil, @title=nil>
55
+
56
+ >> page.scrape!
57
+ => true
58
+
59
+ >> page.title
60
+ => "PageRankAlert.com :: Track your pagerank changes"
61
+
62
+ >> page.description
63
+ => "Track your PageRank(TM) changes and receive alert by email"
64
+
65
+ >> page.keywords
66
+ => "pagerank, seo, optimization, google"
67
+
68
+ >> page.links.size
69
+ => 31
70
+
71
+ >> page.links[30]
72
+ => "http://www.nuvio.cz/"
73
+
74
+ >> page.full_doc
75
+ => #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>
76
+
77
+ >> page.scraped_doc.class
78
+ => Hpricot::Doc
79
+
80
+ >> page.scraped?
81
+ => true
82
+
83
+ >> page.address="http://jaimeiniesta.com"
84
+ => "http://jaimeiniesta.com"
85
+
86
+ >> page.scraped?
87
+ => false
88
+
89
+ >> page.scrape!
90
+ => true
91
+
92
+ >> page.scraped?
93
+ => true
94
+
95
+ >> page.title
96
+ => "ruby on rails freelance developer &#8212; Jaime Iniesta"
97
+
98
+ = To Do
99
+
100
+ * Mocks
101
+ * Check content type, process only HTML pages (i.e., dont try to scrape http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2)
102
+ * Return array of images in page
103
+ * Return contents of meta robots tag
104
+ * Consider using nokogiri instead of hpricot
105
+
106
+ Copyright (c) 2009 Jaime Iniesta, released under the MIT license
data/lib/metainspector.rb CHANGED
@@ -1,47 +1,71 @@
1
- class MetaInspector
2
- require 'open-uri'
3
- require 'rubygems'
4
- require 'hpricot'
1
+ require 'open-uri'
2
+ require 'rubygems'
3
+ require 'hpricot'
5
4
 
6
- VERSION = '1.0.2'
5
+ # MetaInspector provides an easy way to scrape web pages and get its elements
6
+ class MetaInspector
7
+ VERSION = '1.1.0'
7
8
 
8
9
  Hpricot.buffer_size = 300000
9
-
10
- def self.scrape(url)
11
- doc = Hpricot(open(url))
10
+
11
+ attr_reader :address, :title, :description, :keywords, :links, :full_doc, :scraped_doc
12
+
13
+ # Initializes a new instance of MetaInspector, setting the URL address to the one given
14
+ # TODO: validate address as http URL, dont initialize it if wrong format
15
+ def initialize(address)
16
+ @address = address
17
+ @scraped = false
18
+
19
+ @title = @description = @keywords = @links = @full_doc = @scraped_doc = nil
20
+ end
21
+
22
+ # Setter for address. Initializes the whole state as the address is being changed.
23
+ def address=(address)
24
+ initialize(address)
25
+ end
26
+
27
+ # Visit web page, get its contents, and parse it
28
+ def scrape!
29
+ @full_doc = open(@address)
30
+ @scraped_doc = Hpricot(@full_doc)
12
31
 
13
32
  # Searching title...
14
- if doc.at('title')
15
- title = doc.at('title').inner_html
33
+ if @scraped_doc.at('title')
34
+ @title = @scraped_doc.at('title').inner_html.strip
16
35
  else
17
- title = ""
36
+ @title = ""
18
37
  end
19
38
 
20
39
  # Searching meta description...
21
- if doc.at("meta[@name='description']")
22
- description = doc.at("meta[@name='description']")['content']
40
+ if @scraped_doc.at("meta[@name='description']")
41
+ @description = @scraped_doc.at("meta[@name='description']")['content'].strip
23
42
  else
24
- description = ""
43
+ @description = ""
25
44
  end
26
45
 
27
46
  # Searching meta keywords...
28
- if doc.at("meta[@name='keywords']")
29
- keywords = doc.at("meta[@name='keywords']")['content']
47
+ if @scraped_doc.at("meta[@name='keywords']")
48
+ @keywords = @scraped_doc.at("meta[@name='keywords']")['content'].strip
30
49
  else
31
- keywords = ""
50
+ @keywords = ""
32
51
  end
33
52
 
34
53
  # Searching links...
35
- links = []
36
- doc.search("//a").each do |link|
37
- links << link.attributes["href"] if (!link.attributes["href"].nil?)
54
+ @links = []
55
+ @scraped_doc.search("//a").each do |link|
56
+ @links << link.attributes["href"].strip if (!link.attributes["href"].nil?)
38
57
  end
39
-
40
- # Returning all data...
41
- {'ok' => true, 'title' => title, 'description' => description, 'keywords' => keywords, 'links' => links}
58
+
59
+ # Mark scraping as success
60
+ @scraped = true
42
61
 
43
62
  rescue SocketError
44
63
  puts 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
45
- {'ok' => false, 'title' => nil, 'description' => nil, 'keywords' => nil, 'links' => nil}
64
+ @scraped = false
65
+ end
66
+
67
+ # Syntactic sugar
68
+ def scraped?
69
+ @scraped
46
70
  end
47
- end
71
+ end
@@ -1,14 +1,14 @@
1
1
  Gem::Specification.new do |s|
2
2
  s.name = "metainspector"
3
- s.version = "1.0.3"
4
- s.date = "2008-06-27"
3
+ s.version = "1.1.0"
4
+ s.date = "2009-05-14"
5
5
  s.summary = "Ruby gem for web scraping"
6
6
  s.email = "jaimeiniesta@gmail.com"
7
- s.homepage = "http://code.jaimeiniesta.com/metainspector"
7
+ s.homepage = "http://github.com/jaimeiniesta/metainspector/tree/master"
8
8
  s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
9
9
  s.has_rdoc = false
10
10
  s.authors = ["Jaime Iniesta"]
11
- s.files = ["README", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb"]
11
+ s.files = ["README.rdoc", "CHANGELOG.rdoc", "MIT-LICENSE", "metainspector.gemspec", "lib/metainspector.rb", "test/test_metainspector.rb", "samples/basic_scraping.rb", "samples/spider.rb"]
12
12
  s.test_files = []
13
13
  s.rdoc_options = []
14
14
  s.extra_rdoc_files = []
@@ -0,0 +1,19 @@
1
+ # Some basic MetaInspector samples
2
+
3
+ require '../lib/metainspector.rb'
4
+
5
+ puts "Enter a valid http address to scrape it"
6
+ address = gets
7
+ page = MetaInspector.new(address)
8
+ puts "Scraping #{address}"
9
+ puts "...please wait..."
10
+ page.scrape!
11
+
12
+ puts "Scraping #{page.address} returned these results:"
13
+ puts "TITLE: #{page.title}"
14
+ puts "DESCRIPTION: #{page.description}"
15
+ puts "KEYWORDS: #{page.keywords}"
16
+ puts "#{page.links.size} links found..."
17
+ page.links.each do |link|
18
+ puts " ==> #{link}"
19
+ end
data/samples/spider.rb ADDED
@@ -0,0 +1,24 @@
1
+ # A basic spider that will follow links on an infinite loop
2
+ require '../lib/metainspector.rb'
3
+
4
+ q = Queue.new
5
+ visited_links=[]
6
+
7
+ puts "Enter a valid http address to spider it following external links"
8
+ address = gets.strip
9
+ page = MetaInspector.new(address)
10
+ q.push(address)
11
+
12
+ while q.size > 0
13
+ visited_links << address = q.pop
14
+ page.address=address
15
+ puts "Spidering #{page.address}"
16
+ page.scrape!
17
+ puts "TITLE: #{page.title}"
18
+ page.links.each do |link|
19
+ if link[0..6].downcase == 'http://' && !visited_links.include?(link)
20
+ q.push(link)
21
+ end
22
+ end
23
+ puts "#{visited_links.size} pages visited, #{q.size} pages on queue\n\n"
24
+ end
@@ -1 +1,60 @@
1
+ require 'test/unit'
2
+ require '../lib/metainspector.rb'
1
3
 
4
+ class TestMetaInspector < Test::Unit::TestCase
5
+ # TODO: mock tests
6
+
7
+ # Test we can initialize a new instance, setting its address, and initial state
8
+ # is not scraped and every meta data value set to nil
9
+ # TODO: validate URL format, only http and https allowed
10
+ def test_initialize
11
+ m = MetaInspector.new('http://pagerankalert.com')
12
+ assert_equal m.address, 'http://pagerankalert.com'
13
+ assert_equal m.scraped?, false
14
+ assert_nil m.title
15
+ assert_nil m.description
16
+ assert_nil m.keywords
17
+ assert_nil m.links
18
+ assert_nil m.full_doc
19
+ assert_nil m.scraped_doc
20
+ end
21
+
22
+ # Test scraping an URL, marking it as scraped and setting meta data values
23
+ # TODO: check timeouts
24
+ def test_scrape!
25
+ m = MetaInspector.new('http://pagerankalert.com')
26
+ assert m.scrape!
27
+ assert m.scraped?
28
+ assert_equal m.title, 'PageRankAlert.com :: Track your pagerank changes'
29
+ assert_equal m.description, 'Track your PageRank(TM) changes and receive alert by email'
30
+ assert_equal m.keywords, 'pagerank, seo, optimization, google'
31
+ assert_equal m.links.size, 31
32
+ assert_equal m.links[30], 'http://www.nuvio.cz/'
33
+ assert_equal m.full_doc.class, Tempfile
34
+ assert_equal m.scraped_doc.class, Hpricot::Doc
35
+ end
36
+
37
+ # Test changing the address resets the state of the instance
38
+ def test_address_setter
39
+ m = MetaInspector.new('http://pagerankalert.com')
40
+ assert_equal m.address, 'http://pagerankalert.com'
41
+ m.scrape!
42
+ assert m.scraped?
43
+ assert_not_nil m.title
44
+ assert_not_nil m.description
45
+ assert_not_nil m.keywords
46
+ assert_not_nil m.links
47
+ assert_not_nil m.full_doc
48
+ assert_not_nil m.scraped_doc
49
+
50
+ m.address = 'http://jaimeiniesta.com'
51
+ assert_equal m.address, 'http://jaimeiniesta.com'
52
+ assert !m.scraped?
53
+ assert_nil m.title
54
+ assert_nil m.description
55
+ assert_nil m.keywords
56
+ assert_nil m.links
57
+ assert_nil m.full_doc
58
+ assert_nil m.scraped_doc
59
+ end
60
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaimeiniesta-metainspector
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.3
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Jaime Iniesta
@@ -9,11 +9,12 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2008-06-27 00:00:00 -07:00
12
+ date: 2009-05-14 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: hpricot
17
+ type: :runtime
17
18
  version_requirement:
18
19
  version_requirements: !ruby/object:Gem::Requirement
19
20
  requirements:
@@ -30,12 +31,16 @@ extensions: []
30
31
  extra_rdoc_files: []
31
32
 
32
33
  files:
33
- - README
34
+ - README.rdoc
35
+ - CHANGELOG.rdoc
36
+ - MIT-LICENSE
34
37
  - metainspector.gemspec
35
38
  - lib/metainspector.rb
36
39
  - test/test_metainspector.rb
40
+ - samples/basic_scraping.rb
41
+ - samples/spider.rb
37
42
  has_rdoc: false
38
- homepage: http://code.jaimeiniesta.com/metainspector
43
+ homepage: http://github.com/jaimeiniesta/metainspector/tree/master
39
44
  post_install_message:
40
45
  rdoc_options: []
41
46
 
data/README DELETED
@@ -1,8 +0,0 @@
1
- --------------
2
- MetaInspector
3
- --------------
4
- DESCRIPTION: Ruby gem for web scraping purposes. It scrapes a given URL, and returns you a hash with data from it like for example the title, meta description, meta keywords, an array with all the links, all the images in it, etc.
5
- AUTHOR: Jaime Iniesta (jaimeiniesta@gmail.com, http://www.jaimeiniesta.com)
6
- LICENSE: GPLv3
7
-
8
- (more to come soon...)