gscraper 0.1.5 → 0.1.6

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,39 +1,51 @@
1
+ == 0.1.6 / 2008-03-15
2
+
3
+ * Renamed GScraper.http_agent to GScraper.web_agent.
4
+ * Added GScraper.proxy for global proxy configuration.
5
+ * Added the WebAgent module.
6
+ * Renamed Search::Query#first_result to Search::Query#top_result.
7
+ * Updated Search::Query#page logic for the new DOM layout being used.
8
+ * Added support for Sponsored Ad scraping.
9
+ * Added the methods Query#sponsored_links and Query#top_sponsored_link.
10
+ * Added examples to README.txt.
11
+
1
12
  == 0.1.5 / 2007-12-29
2
13
 
3
- * Fixed class inheritance in gscraper/extensions/uri/http.rb.
14
+ * Fixed class inheritance in gscraper/extensions/uri/http.rb, found by
15
+ sanitybit.
4
16
 
5
17
  == 0.1.4 / 2007-12-23
6
18
 
7
- * Added Search::Query#result_at for easier access of a single result at
8
- a given index.
9
- * Adding scraping of the "Cached" and "Similar Pages" URLs of Search
10
- Results.
11
- * Added methods to Search::Page for accessing cached URLs, cached pages,
12
- similar query URLs and similar Queries in mass.
13
- * Search::Query#page and Search::Query#first_page now can receive blocks.
14
- * Improved the formating of URL query parameters.
15
- * Added more unit-tests.
16
- * Fixed scraping of Search Result summaries.
17
- * Fixed various bugs in Search::Query uncovered during unit-testing.
18
- * Fixed typos in Search::Page's documentation.
19
+ * Added Search::Query#result_at for easier access of a single result at
20
+ a given index.
21
+ * Adding scraping of the "Cached" and "Similar Pages" URLs of Search
22
+ Results.
23
+ * Added methods to Search::Page for accessing cached URLs, cached pages,
24
+ similar query URLs and similar Queries in mass.
25
+ * Search::Query#page and Search::Query#first_page now can receive blocks.
26
+ * Improved the formating of URL query parameters.
27
+ * Added more unit-tests.
28
+ * Fixed scraping of Search Result summaries.
29
+ * Fixed various bugs in Search::Query uncovered during unit-testing.
30
+ * Fixed typos in Search::Page's documentation.
19
31
 
20
32
  == 0.1.3 / 2007-12-22
21
33
 
22
- * Added the Search::Page class, which contains many of convenance methods
23
- for searching through the results within a Page.
34
+ * Added the Search::Page class, which contains many of convenance methods
35
+ for searching through the results within a Page.
24
36
 
25
37
  == 0.1.2 / 2007-12-22
26
38
 
27
- * Fixed a bug related to extracting the correct content-rights from search
28
- query URLs.
29
- * Added GScraper.user_agent_aliases.
39
+ * Fixed a bug related to extracting the correct content-rights from search
40
+ query URLs.
41
+ * Added GScraper.user_agent_aliases.
30
42
 
31
43
  == 0.1.1 / 2007-12-21
32
44
 
33
- * Forgot to include lib/gscraper/version.rb.
45
+ * Forgot to include lib/gscraper/version.rb.
34
46
 
35
47
  == 0.1.0 / 2007-12-20
36
48
 
37
- * Initial release.
38
- * Supports the Google Search service.
49
+ * Initial release.
50
+ * Supports the Google Search service.
39
51
 
@@ -6,10 +6,13 @@ Rakefile
6
6
  lib/gscraper.rb
7
7
  lib/gscraper/version.rb
8
8
  lib/gscraper/gscraper.rb
9
+ lib/gscraper/web_agent.rb
9
10
  lib/gscraper/extensions/uri/http.rb
10
11
  lib/gscraper/extensions/uri.rb
11
12
  lib/gscraper/extensions.rb
12
13
  lib/gscraper/licenses.rb
14
+ lib/gscraper/sponsored_ad.rb
15
+ lib/gscraper/sponsored_links.rb
13
16
  lib/gscraper/search/result.rb
14
17
  lib/gscraper/search/page.rb
15
18
  lib/gscraper/search/query.rb
data/README.txt CHANGED
@@ -8,17 +8,120 @@ GScraper is a web-scraping interface to various Google Services.
8
8
 
9
9
  == FEATURES/PROBLEMS:
10
10
 
11
- * Supports the Google Search service.
12
- * Provides HTTP access with custom User-Agent strings.
11
+ * Supports the Google Search service.
12
+ * Provides access to search results and ranks.
13
+ * Provides access to the Sponsored Links.
14
+ * Provides HTTP access with custom User-Agent strings.
15
+ * Provides proxy settings for HTTP access.
13
16
 
14
17
  == REQUIREMENTS:
15
18
 
16
19
  * Hpricot
17
- * Mechanize
20
+ * WWW::Mechanize
18
21
 
19
22
  == INSTALL:
20
23
 
21
- sudo gem install gscraper
24
+ $ sudo gem install gscraper
25
+
26
+ == EXAMPLES:
27
+
28
+ * Basic query:
29
+
30
+ q = GScraper::Search.query(:query => 'ruby')
31
+
32
+ * Advanced query:
33
+
34
+ q = GScraper::Search.query(:query => 'ruby') do |q|
35
+ q.without_words = 'is'
36
+ q.within_past_day = true
37
+ q.numeric_range = 2..10
38
+ end
39
+
40
+ * Queries from URLs:
41
+
42
+ q = GScraper::Search.query_from_url('http://www.google.com/search?as_q=ruby&as_epq=&as_oq=rails&as_ft=i&as_qdr=all&as_occt=body&as_rights=%28cc_publicdomain%7Ccc_attribute%7Ccc_sharealike%7Ccc_noncommercial%29.-%28cc_nonderived%29')
43
+
44
+ q.query # =>; "ruby"
45
+ q.with_words # => "rails"
46
+ q.occurrs_within # => :title
47
+ q.rights # => :cc_by_nc
48
+
49
+ * Getting the search results:
50
+
51
+ q.first_page.select do |result|
52
+ result.title =~ /Blog/
53
+ end
54
+
55
+ q.page(2).map do |result|
56
+ result.title.reverse
57
+ end
58
+
59
+ q.result_at(25) # => Result
60
+
61
+ q.top_result # => Result
62
+
63
+ * A Result object contains the rank, title, summary, cahced URL, similiar
64
+ query URL and link URL of the search result.
65
+
66
+ page = q.page(2)
67
+
68
+ page.urls # => [...]
69
+ pagesummaries # => [...]
70
+ page.ranks_of { |result| result.url =~ /^https/ } # => [...]
71
+ page.titles_of { |result| result.summary =~ /password/ } # => [...]
72
+ page.cached_pages # => [...]
73
+ page.similar_queries # => [...]
74
+
75
+ * Iterating over the search results:
76
+
77
+ q.each_on_page(2) do |result|
78
+ puts result.title
79
+ end
80
+
81
+ page.each do |result|
82
+ puts result.url
83
+ end
84
+
85
+ * Iterating over the data within the search results:
86
+
87
+ page.each_title do |title|
88
+ puts title
89
+ end
90
+
91
+ page.each_summary do |text|
92
+ puts text
93
+ end
94
+
95
+ * Selecting search results:
96
+
97
+ page.results_with do |result|
98
+ ((result.rank > 2) && (result.rank < 10))
99
+ end
100
+
101
+ page.results_with_title(/Ruby/i) # => [...]
102
+
103
+ * Selecting data within the search results:
104
+
105
+ page.titles # => [...]
106
+
107
+ page.summaries # => [...]
108
+
109
+ * Selecting the data of search results based on the search result:
110
+
111
+ page.urls_of do |result|
112
+ result.description.length > 10
113
+ end
114
+
115
+ * Selecting the Sponsored Links of a Query:
116
+
117
+ q.sponsored_links # => [...]
118
+
119
+ q.top_sponsored_link # => SponsoredAd
120
+
121
+ * Setting the User-Agent globally:
122
+
123
+ GScraper.user_agent # => nil
124
+ GScraper.user_agent = 'Awesome Browser v1.2'
22
125
 
23
126
  == LICENSE:
24
127
 
@@ -1,7 +1,38 @@
1
+ require 'uri/http'
1
2
  require 'mechanize'
2
3
  require 'open-uri'
3
4
 
4
5
  module GScraper
6
+ # Common proxy port.
7
+ COMMON_PROXY_PORT = 8080
8
+
9
+ #
10
+ # Returns the +Hash+ of proxy information.
11
+ #
12
+ def GScraper.proxy
13
+ @@gscraper_proxy ||= {:host => nil, :port => COMMON_PROXY_PORT, :user => nil, :password => nil}
14
+ end
15
+
16
+ #
17
+ # Creates a HTTP URI based from the given _proxy_info_ hash. The
18
+ # _proxy_info_ hash defaults to Web.proxy, if not given.
19
+ #
20
+ # _proxy_info_ may contain the following keys:
21
+ # <tt>:host</tt>:: The proxy host.
22
+ # <tt>:port</tt>:: The proxy port. Defaults to COMMON_PROXY_PORT,
23
+ # if not specified.
24
+ # <tt>:user</tt>:: The user-name to login as.
25
+ # <tt>:password</tt>:: The password to login with.
26
+ #
27
+ def GScraper.proxy_uri(proxy_info=GScraper.proxy)
28
+ if GScraper.proxy[:host]
29
+ return URI::HTTP.build(:host => GScraper.proxy[:host],
30
+ :port => GScraper.proxy[:port],
31
+ :userinfo => "#{GScraper.proxy[:user]}:#{GScraper.proxy[:password]}",
32
+ :path => '/')
33
+ end
34
+ end
35
+
5
36
  #
6
37
  # Returns the supported GScraper User-Agent Aliases.
7
38
  #
@@ -13,58 +44,98 @@ module GScraper
13
44
  # Returns the GScraper User-Agent
14
45
  #
15
46
  def GScraper.user_agent
16
- @user_agent ||= nil
47
+ @@gscraper_user_agent ||= GScraper.user_agent_aliases['Windows IE 6']
17
48
  end
18
49
 
19
50
  #
20
51
  # Sets the GScraper User-Agent to the specified _agent_.
21
52
  #
22
53
  def GScraper.user_agent=(agent)
23
- @user_agent = agent
54
+ @@gscraper_user_agent = agent
24
55
  end
25
56
 
26
57
  #
27
- # Opens the _uri_ with the given _opts_. The contents of the _uri_ will be
28
- # returned.
58
+ # Opens the _uri_ with the given _options_. The contents of the _uri_
59
+ # will be returned.
60
+ #
61
+ # _options_ may contain the following keys:
62
+ # <tt>:user_agent_alias</tt>:: The User-Agent Alias to use.
63
+ # <tt>:user_agent</tt>:: The User-Agent String to use.
64
+ # <tt>:proxy</tt>:: A +Hash+ of proxy information which may
65
+ # contain the following keys:
66
+ # <tt>:host</tt>:: The proxy host.
67
+ # <tt>:port</tt>:: The proxy port.
68
+ # <tt>:user</tt>:: The user-name to login as.
69
+ # <tt>:password</tt>:: The password to login with.
29
70
  #
30
- # GScraper.open('http://www.hackety.org/')
71
+ # GScraper.open_uri('http://www.hackety.org/')
31
72
  #
32
- # GScraper.open('http://tenderlovemaking.com/',
73
+ # GScraper.open_uri('http://tenderlovemaking.com/',
33
74
  # :user_agent_alias => 'Linux Mozilla')
34
- # GScraper.open('http://www.wired.com/', :user_agent => 'the future')
75
+ # GScraper.open_uri('http://www.wired.com/',
76
+ # :user_agent => 'the future')
35
77
  #
36
- def GScraper.open(uri,opts={})
78
+ def GScraper.open_uri(uri,options={})
37
79
  headers = {}
38
80
 
39
- if opts[:user_agent_alias]
40
- headers['User-Agent'] = WWW::Mechanize::AGENT_ALIASES[opts[:user_agent_alias]]
41
- elsif opts[:user_agent]
42
- headers['User-Agent'] = opts[:user_agent]
81
+ if options[:user_agent_alias]
82
+ headers['User-Agent'] = WWW::Mechanize::AGENT_ALIASES[options[:user_agent_alias]]
83
+ elsif options[:user_agent]
84
+ headers['User-Agent'] = options[:user_agent]
43
85
  elsif GScraper.user_agent
44
86
  headers['User-Agent'] = GScraper.user_agent
45
87
  end
46
88
 
89
+ proxy = (options[:proxy] || GScraper.proxy)
90
+ if proxy[:host]
91
+ headers[:proxy] = GScraper.proxy_uri(proxy)
92
+ end
93
+
47
94
  return Kernel.open(uri,headers)
48
95
  end
49
96
 
50
97
  #
51
- # Creates a new Mechanize agent with the given _opts_.
98
+ # Similar to GScraper.open_uri but returns an Hpricot document.
99
+ #
100
+ def GScraper.open_page(uri,options={})
101
+ Hpricot(GScraper.open_uri(uri,options))
102
+ end
103
+
104
+ #
105
+ # Creates a new WWW::Mechanize agent with the given _options_.
106
+ #
107
+ # _options_ may contain the following keys:
108
+ # <tt>:user_agent_alias</tt>:: The User-Agent Alias to use.
109
+ # <tt>:user_agent</tt>:: The User-Agent string to use.
110
+ # <tt>:proxy</tt>:: A +Hash+ of proxy information which may
111
+ # contain the following keys:
112
+ # <tt>:host</tt>:: The proxy host.
113
+ # <tt>:port</tt>:: The proxy port.
114
+ # <tt>:user</tt>:: The user-name to login as.
115
+ # <tt>:password</tt>:: The password to login with.
116
+ #
117
+ # GScraper.web_agent
52
118
  #
53
- # GScraper.http_agent
54
- # GScraper.http_agent(:user_agent_alias => 'Linux Mozilla')
55
- # GScraper.http_agent(:user_agent => 'wooden pants')
119
+ # GScraper.web_agent(:user_agent_alias => 'Linux Mozilla')
120
+ # GScraper.web_agent(:user_agent => 'Google Bot')
56
121
  #
57
- def GScraper.http_agent(opts={})
122
+ def GScraper.web_agent(options={},&block)
58
123
  agent = WWW::Mechanize.new
59
124
 
60
- if opts[:user_agent_alias]
61
- agent.user_agent_alias = opts[:user_agent_alias]
62
- elsif opts[:user_agent]
63
- agent.user_agent = opts[:user_agent]
125
+ if options[:user_agent_alias]
126
+ agent.user_agent_alias = options[:user_agent_alias]
127
+ elsif options[:user_agent]
128
+ agent.user_agent = options[:user_agent]
64
129
  elsif GScraper.user_agent
65
130
  agent.user_agent = GScraper.user_agent
66
131
  end
67
132
 
133
+ proxy = (options[:proxy] || GScraper.proxy)
134
+ if proxy[:host]
135
+ agent.set_proxy(proxy[:host],proxy[:port],proxy[:user],proxy[:password])
136
+ end
137
+
138
+ block.call(agent) if block
68
139
  return agent
69
140
  end
70
141
  end
@@ -1,55 +1,78 @@
1
1
  module GScraper
2
2
  module Licenses
3
+ # Any desired license
3
4
  ANY = nil
4
5
 
6
+ # Aladdin license
5
7
  ALADDIN = :aladdin
6
8
 
9
+ # Artistic license
7
10
  ARTISTIC = :artistic
8
11
 
12
+ # Apache license
9
13
  APACHE = :apache
10
14
 
15
+ # Apple license
11
16
  APPLE = :apple
12
17
 
18
+ # BSD license
13
19
  BSD = :bsd
14
20
 
21
+ # Common public license
15
22
  COMMON_PUBLIC = :cpl
16
23
 
24
+ # Creative Commons By-Attribution license
17
25
  CC_BY = :cc_by
18
26
 
27
+ # Creative Commons By-Attribution-Share-Alike license
19
28
  CC_BY_SA = :cc_by_sa
20
29
 
30
+ # Creative Commons By-Attribution-No-Derivative license
21
31
  CC_BY_ND = :cc_by_nd
22
32
 
33
+ # Creative Commons By-Attribution-Noncommercial-Share-Alike license
23
34
  CC_BY_NC = :cc_by_nc_sa
24
35
 
25
- CC_BY_ND_SA = :cc_by_nd
26
-
27
- CC_BY_NC_SA = :cc_by_nc_sa
36
+ # Creative Commons By-Attribution-No-Derivative-Share-Alike license
37
+ CC_BY_ND_SA = :cc_by_nd_sa
28
38
 
39
+ # Creative Commons By-Attribution-Noncommercial-No-Derivative license
29
40
  CC_BY_NC_ND = :cc_by_nc_nd
30
41
 
42
+ # GNU General Public license
31
43
  GPL = :gpl
32
44
 
45
+ # GNU Lesser General Public license
33
46
  LGPL = :lgpl
34
47
 
48
+ # Historical Permission Notice and Disclaimer license
35
49
  HISTORICAL = :disclaimer
36
50
 
51
+ # IBM Public license
37
52
  IBM_PUBLIC = :ibm
38
53
 
54
+ # Lucent Public license
39
55
  LUCENT_PUBLIC = :lucent
40
56
 
57
+ # MIT license
41
58
  MIT = :mit
42
59
 
43
- MOZILLA_PUBLI = :mozilla
60
+ # Mozilla Public license
61
+ MOZILLA_PUBLIC = :mozilla
44
62
 
63
+ # NASA OSA license
45
64
  NASA_OSA = :nasa
46
65
 
66
+ # Python license
47
67
  PYTHON = :python
48
68
 
69
+ # Q Public license
49
70
  Q_PUBLIC = :qpl
50
71
 
72
+ # Sleepycat license
51
73
  SLEEPYCAT = :sleepycat
52
74
 
75
+ # Zope Public license
53
76
  ZOPE_PUBLIC = :zope
54
77
 
55
78
  end