andyverprauskus-scrubyt 0.5.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (45) hide show
  1. data/CHANGELOG +355 -0
  2. data/COPYING +340 -0
  3. data/README.rdoc +121 -0
  4. data/Rakefile +101 -0
  5. data/lib/scrubyt.rb +53 -0
  6. data/lib/scrubyt/core/navigation/agents/firewatir.rb +318 -0
  7. data/lib/scrubyt/core/navigation/agents/mechanize.rb +312 -0
  8. data/lib/scrubyt/core/navigation/fetch_action.rb +63 -0
  9. data/lib/scrubyt/core/navigation/navigation_actions.rb +107 -0
  10. data/lib/scrubyt/core/scraping/compound_example.rb +30 -0
  11. data/lib/scrubyt/core/scraping/constraint.rb +169 -0
  12. data/lib/scrubyt/core/scraping/constraint_adder.rb +49 -0
  13. data/lib/scrubyt/core/scraping/filters/attribute_filter.rb +14 -0
  14. data/lib/scrubyt/core/scraping/filters/base_filter.rb +112 -0
  15. data/lib/scrubyt/core/scraping/filters/constant_filter.rb +9 -0
  16. data/lib/scrubyt/core/scraping/filters/detail_page_filter.rb +37 -0
  17. data/lib/scrubyt/core/scraping/filters/download_filter.rb +64 -0
  18. data/lib/scrubyt/core/scraping/filters/html_subtree_filter.rb +9 -0
  19. data/lib/scrubyt/core/scraping/filters/regexp_filter.rb +13 -0
  20. data/lib/scrubyt/core/scraping/filters/script_filter.rb +11 -0
  21. data/lib/scrubyt/core/scraping/filters/text_filter.rb +34 -0
  22. data/lib/scrubyt/core/scraping/filters/tree_filter.rb +138 -0
  23. data/lib/scrubyt/core/scraping/pattern.rb +359 -0
  24. data/lib/scrubyt/core/scraping/pre_filter_document.rb +14 -0
  25. data/lib/scrubyt/core/scraping/result_indexer.rb +90 -0
  26. data/lib/scrubyt/core/shared/extractor.rb +183 -0
  27. data/lib/scrubyt/logging.rb +154 -0
  28. data/lib/scrubyt/output/post_processor.rb +139 -0
  29. data/lib/scrubyt/output/result.rb +44 -0
  30. data/lib/scrubyt/output/result_dumper.rb +154 -0
  31. data/lib/scrubyt/output/result_node.rb +145 -0
  32. data/lib/scrubyt/output/scrubyt_result.rb +42 -0
  33. data/lib/scrubyt/utils/compound_example_lookup.rb +50 -0
  34. data/lib/scrubyt/utils/ruby_extensions.rb +85 -0
  35. data/lib/scrubyt/utils/shared_utils.rb +58 -0
  36. data/lib/scrubyt/utils/simple_example_lookup.rb +40 -0
  37. data/lib/scrubyt/utils/xpathutils.rb +202 -0
  38. data/test/blackbox_test.rb +60 -0
  39. data/test/blackbox_tests/basic/multi_root.rb +6 -0
  40. data/test/blackbox_tests/basic/simple.rb +5 -0
  41. data/test/blackbox_tests/detail_page/one_detail_page.rb +9 -0
  42. data/test/blackbox_tests/detail_page/two_detail_pages.rb +9 -0
  43. data/test/blackbox_tests/next_page/next_page_link.rb +7 -0
  44. data/test/blackbox_tests/next_page/page_list_links.rb +7 -0
  45. metadata +120 -0
data/README.rdoc ADDED
@@ -0,0 +1,121 @@
1
+ = scRUBYt! - Hpricot and Mechanize (or FireWatir) on steroids
2
+
3
+ A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web,
4
+ Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
5
+
6
+
7
+ Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their
8
+ authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be
9
+ still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them
10
+ around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies
11
+ - and ... enter scRUBYt! and decide it yourself.
12
+
13
+ = Wait... why do we need one more web-scraping toolkit?
14
+
15
+ After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
16
+ Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical
17
+ background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with
18
+ different requirements than the previosly mentioned ones.
19
+
20
+ If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot.
21
+ Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you
22
+ will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good
23
+ old mantra: use the right tool for the right job!
24
+
25
+ I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of
26
+ scRUBYt! :-)
27
+
28
+ = Sounds fine - show me an example!
29
+
30
+ Let's apply the "show don't tell" principle. Okay, here we go:
31
+
32
+ <tt>ebay_data = Scrubyt::Extractor.define do</tt>
33
+
34
+ fetch 'http://www.ebay.com/'
35
+ fill_textfield 'satitle', 'ipod'
36
+ submit
37
+ click_link 'Apple iPod'
38
+
39
+ record do
40
+ item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
41
+ price '$71.99'
42
+ end
43
+ next_page 'Next >', :limit => 5
44
+
45
+ <tt>end</tt>
46
+
47
+ output:
48
+
49
+ <tt><root></tt>
50
+ <record>
51
+ <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
52
+ <price>$149.95</price>
53
+ </record>
54
+ <record>
55
+ <item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
56
+ <price>$172.50</price>
57
+ </record>
58
+ <record>
59
+ <item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
60
+ <price>$171.06</price>
61
+ </record>
62
+ <!-- another 200+ results -->
63
+ <tt></root></tt>
64
+
65
+ This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated
66
+ extractors than the above one) - yet it did a lot of things automagically. First of all,
67
+ it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
68
+ and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
69
+ looked like the specified example (which btw described also how the output structure should look like) - on the first 5
70
+ result pages. Not so bad for about 10 lines of code, eh?
71
+
72
+ = OK, OK, I believe you, what should I do?
73
+
74
+ You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the
75
+ next section about installation, and after installing be sure to check out these URLs:
76
+
77
+ * <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek
78
+ at web scraping in general and/or you would like to understand what's going on under the hood, check out <a
79
+ href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about
80
+ web-scraping</a>!
81
+ * <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
82
+ * <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
83
+ * <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including
84
+ open and closed bugs, files etc.
85
+ * projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing
86
+ the features of scRUBYt!
87
+ * planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will
88
+ have a community, and people will upload their extractors for whatever reason
89
+
90
+ If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
91
+
92
+ = How to install
93
+
94
+ scRUBYt! requires these packages to be installed:
95
+
96
+ * Ruby 1.8.4
97
+ * Hpricot 0.5
98
+ * Mechanize 0.6.3
99
+
100
+ I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
101
+
102
+ <tt>sudo gem install mechanize</tt>
103
+
104
+ Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with
105
+
106
+ <tt>sudo gem install hpricot</tt>
107
+
108
+ Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
109
+
110
+ <tt>sudo gem install scrubyt</tt>
111
+
112
+ If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
113
+
114
+ = Author
115
+
116
+ Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
117
+
118
+ = Copyright
119
+
120
+ This library is distributed under the GPL. Please see the LICENSE file.
121
+
data/Rakefile ADDED
@@ -0,0 +1,101 @@
1
+ require 'rake/rdoctask'
2
+ require 'rake/testtask'
3
+ require 'rake/gempackagetask'
4
+ require 'rake/packagetask'
5
+
6
+ ###################################################
7
+ # Dependencies
8
+ ###################################################
9
+
10
+ task "default" => ["test_all"]
11
+ task "generate_rdoc" => ["cleanup_readme"]
12
+ task "cleanup_readme" => ["rdoc"]
13
+
14
+ ###################################################
15
+ # Gem specification
16
+ ###################################################
17
+
18
+ gem_spec = Gem::Specification.new do |s|
19
+ s.name = 'scrubyt'
20
+ s.version = '0.4.30'
21
+ s.summary = 'A powerful Web-scraping framework built on Mechanize and Hpricot (and FireWatir)'
22
+ s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
23
+ # Files containing Test::Unit test cases.
24
+ s.test_files = FileList['test/unittests/**/*']
25
+ # List of other files to be included.
26
+ s.files = FileList['COPYING', 'README.rdoc', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
27
+ s.author = 'Peter Szinek'
28
+ s.email = 'peter@rubyrailways.com'
29
+ s.homepage = 'http://www.scrubyt.org'
30
+ s.add_dependency('hpricot', '>= 0.5')
31
+ s.add_dependency('mechanize', '>= 0.6.3')
32
+ s.has_rdoc = 'true'
33
+ end
34
+
35
+ ###################################################
36
+ # Tasks
37
+ ###################################################
38
+
39
+ Rake::RDocTask.new do |generate_rdoc|
40
+ files = ['lib/**/*.rb', 'README.rdoc', 'CHANGELOG']
41
+ generate_rdoc.rdoc_files.add(files)
42
+ generate_rdoc.main = "README.rdoc" # page to start on
43
+ generate_rdoc.title = "Scrubyt Documentation"
44
+ generate_rdoc.template = "resources/allison/allison.rb"
45
+ generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
46
+ generate_rdoc.options << '--line-numbers' << '--inline-source'
47
+ end
48
+
49
+ Rake::TestTask.new(:test_all) do |task|
50
+ task.pattern = 'test/*_test.rb'
51
+ end
52
+
53
+ Rake::TestTask.new(:test_blackbox) do |task|
54
+ task.test_files = ['test/blackbox_test.rb']
55
+ end
56
+
57
+ task "test_specific" do
58
+ ruby "test/blackbox_test.rb #{ARGV[1]}"
59
+ end
60
+
61
+ Rake::TestTask.new(:test_non_blackbox) do |task|
62
+ task.test_files = FileList['test/*_test.rb'] - ['test/blackbox_test.rb']
63
+ end
64
+
65
+ task "rcov" do
66
+ sh 'rcov --xrefs test/*.rb'
67
+ puts 'Report done.'
68
+ end
69
+
70
+ task "cleanup_readme" do
71
+ puts "Cleaning up README..."
72
+ readme_in = open('./doc/files/README.html')
73
+ content = readme_in.read
74
+ content.sub!('<h1 id="item_name">File: README</h1>','')
75
+ content.sub!('<h1>Description</h1>','')
76
+ readme_in.close
77
+ open('./doc/files/README.html', 'w') {|f| f.write(content)}
78
+ #OK, this is uggly as hell and as non-DRY as possible, but
79
+ #I don't have time to deal with it right now
80
+ puts "Cleaning up CHANGELOG..."
81
+ readme_in = open('./doc/files/CHANGELOG.html')
82
+ content = readme_in.read
83
+ content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
84
+ content.sub!('<h1>Description</h1>','')
85
+ readme_in.close
86
+ open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
87
+ end
88
+
89
+ task "generate_rdoc" do
90
+ end
91
+
92
+ Rake::GemPackageTask.new(gem_spec) do |pkg|
93
+ pkg.need_zip = false
94
+ pkg.need_tar = false
95
+ end
96
+
97
+ #Rake::PackageTask.new('scrubyt-examples', '0.4.03') do |pkg|
98
+ # pkg.need_zip = true
99
+ # pkg.need_tar = true
100
+ # pkg.package_files.include("examples/**/*")
101
+ #end
data/lib/scrubyt.rb ADDED
@@ -0,0 +1,53 @@
1
+ if RUBY_VERSION < '1.9'
2
+ $KCODE = "u"
3
+ require "jcode"
4
+ end
5
+
6
+ #ruby core
7
+ require "open-uri"
8
+ require "erb"
9
+
10
+ #gems
11
+ require "rexml/text"
12
+ require "rubygems"
13
+ require "mechanize"
14
+ require "hpricot"
15
+
16
+ #scrubyt
17
+ require "#{File.dirname(__FILE__)}/scrubyt/logging"
18
+ require "#{File.dirname(__FILE__)}/scrubyt/utils/ruby_extensions.rb"
19
+ require "#{File.dirname(__FILE__)}/scrubyt/utils/xpathutils.rb"
20
+ require "#{File.dirname(__FILE__)}/scrubyt/utils/shared_utils.rb"
21
+ require "#{File.dirname(__FILE__)}/scrubyt/utils/simple_example_lookup.rb"
22
+ require "#{File.dirname(__FILE__)}/scrubyt/utils/compound_example_lookup.rb"
23
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/constraint_adder.rb"
24
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/constraint.rb"
25
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/result_indexer.rb"
26
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/pre_filter_document.rb"
27
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/compound_example.rb"
28
+ require "#{File.dirname(__FILE__)}/scrubyt/output/result_node.rb"
29
+ require "#{File.dirname(__FILE__)}/scrubyt/output/scrubyt_result.rb"
30
+ require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/agents/mechanize.rb"
31
+
32
+ # -- Making Firewatir optional --
33
+ begin
34
+ require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/agents/firewatir.rb"
35
+ rescue LoadError
36
+ puts "The gem firewatir is not installed, you'll be able to use Mechanize as the agent only"
37
+ end
38
+ # --
39
+
40
+ require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/navigation_actions.rb"
41
+ require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/fetch_action.rb"
42
+ require "#{File.dirname(__FILE__)}/scrubyt/core/shared/extractor.rb"
43
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/base_filter.rb"
44
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/attribute_filter.rb"
45
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/constant_filter.rb"
46
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/script_filter.rb"
47
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/text_filter.rb"
48
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/detail_page_filter.rb"
49
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/download_filter.rb"
50
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/html_subtree_filter.rb"
51
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/regexp_filter.rb"
52
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/tree_filter.rb"
53
+ require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/pattern.rb"
@@ -0,0 +1,318 @@
1
+ require 'firewatir'
2
+
3
+ module Scrubyt
4
+ ##
5
+ #=<tt>Fetching pages (and related functionality)</tt>
6
+ #
7
+ #Since lot of things are happening during (and before)
8
+ #the fetching of a document, I decided to move out fetching related
9
+ #functionality to a separate class - so if you are looking for anything
10
+ #which is loading a document (even by submitting a form or clicking a link)
11
+ #and related things like setting a proxy etc. you should find it here.
12
+ module Navigation
13
+ module Firewatir
14
+
15
+ def self.included(base)
16
+ base.module_eval do
17
+ @@agent = FireWatir::Firefox.new unless defined? @@agent
18
+ @@current_doc_url = nil
19
+ @@current_doc_protocol = nil
20
+ @@base_dir = nil
21
+ @@host_name = nil
22
+ @@history = []
23
+ @@current_form = nil
24
+ @@current_frame = nil
25
+
26
+ ##
27
+ #Action to fetch a document (either a file or a http address)
28
+ #
29
+ #*parameters*
30
+ #
31
+ #_doc_url_ - the url or file name to fetch
32
+ def self.fetch(doc_url, *args)
33
+ #Refactor this crap!!! with option_accessor stuff
34
+ if args.size > 0
35
+ mechanize_doc = args[0][:mechanize_doc]
36
+ resolve = args[0][:resolve]
37
+ basic_auth = args[0][:basic_auth]
38
+ #Refactor this whole stuff as well!!! It looks awful...
39
+ parse_and_set_basic_auth(basic_auth) if basic_auth
40
+ else
41
+ mechanize_doc = nil
42
+ resolve = :full
43
+ end
44
+
45
+ @@current_doc_url = doc_url
46
+ @@current_doc_protocol = determine_protocol
47
+ if mechanize_doc.nil?
48
+ handle_relative_path(doc_url) unless @@current_doc_protocol == 'xpath'
49
+ handle_relative_url(doc_url, resolve)
50
+ Scrubyt.log :ACTION, "fetching document: #{@@current_doc_url}"
51
+ case @@current_doc_protocol
52
+ when 'file'
53
+ @@agent.goto("file://"+ @@current_doc_url)
54
+ else
55
+ @@agent.goto(@@current_doc_url)
56
+ end
57
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
58
+ else
59
+ @@mechanize_doc = mechanize_doc
60
+ end
61
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
62
+ store_host_name(@@agent.url) # in case we're on a new host
63
+ end
64
+
65
+ def self.use_current_page
66
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
67
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
68
+ end
69
+
70
+ def self.frame(attribute, value)
71
+ if @@current_frame
72
+ @@current_frame.frame(attribute, value)
73
+ else
74
+ @@current_frame = @@agent.frame(attribute, value)
75
+ end
76
+ end
77
+
78
+ ##
79
+ #Submit the last form;
80
+ def self.submit(current_form, sleep_time=nil, button=nil, type=nil)
81
+ if @@current_frame
82
+ #BRUTAL hax but FW is such a shitty piece of software
83
+ #this sucks FAIL omg
84
+ @@current_frame.locate
85
+ form = Document.new(@@current_frame).all.find{|t| t.tagName=="FORM"}
86
+ form.submit
87
+ else
88
+ @@agent.element_by_xpath(@@current_form).submit
89
+ end
90
+
91
+ if sleep_time
92
+ sleep sleep_time
93
+ @@agent.wait
94
+ end
95
+
96
+ @@current_doc_url = @@agent.url
97
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
98
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
99
+ end
100
+
101
+ ##
102
+ #Click the link specified by the text
103
+ def self.click_link(link_spec,index = 0,wait_secs=0)
104
+ Scrubyt.log :ACTION, "Clicking link specified by: %p" % link_spec
105
+ if link_spec.is_a?(Hash)
106
+ elem = XPathUtils.generate_XPath(CompoundExampleLookup.find_node_from_compund_example(@@hpricot_doc, link_spec, false, index), nil, true)
107
+ result_page = @@agent.element_by_xpath(elem).click
108
+ else
109
+ @@agent.link(:innerHTML, Regexp.escape(link_spec)).click
110
+ end
111
+ sleep(wait_secs) if wait_secs > 0
112
+ @@agent.wait
113
+
114
+ # evaluate the results
115
+ extractor.evaluate_extractor
116
+
117
+ @@current_doc_url = @@agent.url
118
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
119
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
120
+ Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
121
+ end
122
+
123
+ def self.click_by_xpath_if_exists(xpath, wait_secs=0)
124
+ begin
125
+ result_page = @@agent.element_by_xpath(xpath).click
126
+ sleep(wait_secs) if wait_secs > 0
127
+ @@agent.wait
128
+
129
+ extractor.evaluate_extractor
130
+
131
+ @@current_doc_url = @@agent.url
132
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
133
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
134
+ Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
135
+ rescue Watir::Exception::UnknownObjectException
136
+ Scrubyt.log :INFO, "XPath #{xpath} doesn't exist in this document"
137
+ end
138
+ end
139
+
140
+ def self.click_by_xpath_without_evaluate(xpath, wait_secs=0)
141
+ Scrubyt.log :ACTION, "Clicking by XPath : %p" % xpath
142
+ @@agent.element_by_xpath(xpath).click
143
+ Scrubyt.log :INFO, "sleeping #{wait_secs}..."
144
+ sleep(wait_secs) if wait_secs > 0
145
+ @@agent.wait
146
+
147
+ # does not call evaluate_extractor
148
+ #extractor.evaluate_extractor
149
+
150
+ @@current_doc_url = @@agent.url
151
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
152
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
153
+ Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
154
+ end
155
+
156
+
157
+ def self.click_by_xpath(xpath, wait_secs=0)
158
+ Scrubyt.log :ACTION, "Clicking by XPath : %p" % xpath
159
+ @@agent.element_by_xpath(xpath).click
160
+ Scrubyt.log :INFO, "sleeping #{wait_secs}..."
161
+ sleep(wait_secs) if wait_secs > 0
162
+ @@agent.wait
163
+
164
+ # evaluate the results
165
+ extractor.evaluate_extractor
166
+
167
+ @@current_doc_url = @@agent.url
168
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
169
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
170
+ Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
171
+ end
172
+
173
+ def self.click_image_map(index = 0)
174
+ Scrubyt.log :ACTION, "Clicking image map at index: %p" % index
175
+ uri = @@mechanize_doc.search("//area")[index]['href']
176
+ result_page = @@agent.get(uri)
177
+ @@current_doc_url = result_page.uri.to_s
178
+ Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
179
+ fetch(@@current_doc_url, :mechanize_doc => result_page)
180
+ end
181
+
182
+ def self.store_host_name(doc_url)
183
+ @@host_name = doc_url.match(/.*\..*?\//)[0] if doc_url.match(/.*\..*?\//)
184
+ @@original_host_name ||= @@host_name
185
+ end #end of method store_host_name
186
+
187
+ def self.determine_protocol
188
+ old_protocol = @@current_doc_protocol
189
+ new_protocol = case @@current_doc_url
190
+ when /^\/\//
191
+ 'xpath'
192
+ when /^https/
193
+ 'https'
194
+ when /^http/
195
+ 'http'
196
+ when /^www\./
197
+ 'http'
198
+ else
199
+ 'file'
200
+ end
201
+ return 'http' if ((old_protocol == 'http') && new_protocol == 'file')
202
+ return 'https' if ((old_protocol == 'https') && new_protocol == 'file')
203
+ new_protocol
204
+ end
205
+
206
+ def self.parse_and_set_basic_auth(basic_auth)
207
+ login, pass = basic_auth.split('@')
208
+ Scrubyt.log :ACTION, "Basic authentication: login=<#{login}>, pass=<#{pass}>"
209
+ @@agent.basic_auth(login, pass)
210
+ end
211
+
212
+ def self.handle_relative_path(doc_url)
213
+ if @@base_dir == nil || doc_url[0..0] == "/"
214
+ @@base_dir = doc_url.scan(/.+\//)[0] if @@current_doc_protocol == 'file'
215
+ else
216
+ @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
217
+ end
218
+ end
219
+
220
+ def self.handle_relative_url(doc_url, resolve)
221
+ return if doc_url =~ /^(http:|javascript:)/
222
+ if doc_url !~ /^\//
223
+ first_char = doc_url[0..0]
224
+ doc_url = ( first_char == '?' ? '' : '/' ) + doc_url
225
+ if first_char == '?' #This is an ugly hack... really have to throw this shit out and go with mechanize's
226
+ current_uri = @@mechanize_doc.uri.to_s
227
+ current_uri = @@agent.history.first.uri.to_s if current_uri =~ /\/popup\//
228
+ if (current_uri.include? '?')
229
+ current_uri = current_uri.scan(/.+\//)[0]
230
+ else
231
+ current_uri += '/' unless current_uri[-1..-1] == '/'
232
+ end
233
+ @@current_doc_url = current_uri + doc_url
234
+ return
235
+ end
236
+ end
237
+ case resolve
238
+ when :full
239
+ @@current_doc_url = (@@host_name + doc_url) if ( @@host_name != nil && (doc_url !~ /#{@@host_name}/))
240
+ @@current_doc_url = @@current_doc_url.split('/').uniq.join('/')
241
+ when :host
242
+ base_host_name = (@@host_name.count("/") == 2 ? @@host_name : @@host_name.scan(/(http.+?\/\/.+?)\//)[0][0])
243
+ @@current_doc_url = base_host_name + doc_url
244
+ else
245
+ #custom resilving
246
+ @@current_doc_url = resolve + doc_url
247
+ end
248
+ end
249
+
250
+ def self.fill_textfield(textfield_name, query_string, wait_secs, useValue)
251
+ @@current_form = "//input[@name='#{textfield_name}']/ancestor::form"
252
+ target = @@current_frame || @@agent
253
+ if useValue
254
+ target.text_field(:name,textfield_name).value = query_string
255
+ else
256
+ target.text_field(:name,textfield_name).set(query_string)
257
+ end
258
+ sleep(wait_secs) if wait_secs > 0
259
+ @@mechanize_doc = "<html>#{@@agent.html}</html>"
260
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
261
+
262
+ end
263
+
264
+ ##
265
+ #Action to fill a textarea with text
266
+ def self.fill_textarea(textarea_name, text)
267
+ @@current_form = "//input[@name='#{textarea_name}']/ancestor::form"
268
+ @@agent.text_field(:name,textarea_name).set(text)
269
+ end
270
+
271
+ ##
272
+ #Action for selecting an option from a dropdown box
273
+ def self.select_option(selectlist_name, option)
274
+ if selectlist_name.is_a? Hash
275
+ select_args = selectlist_name
276
+ unless select_args.size == 1
277
+ raise "select_option only supports using a name or a hash with a single pair, e.g.: {:id => \"foo1\"}"
278
+ end
279
+ @@current_form = "//select[@#{select_args.keys.first}='#{select_args.values.first}']/ancestor::form"
280
+ else
281
+ @@current_form = "//select[@name='#{selectlist_name}']/ancestor::form"
282
+ end
283
+ list = @@agent.select_list(:name,selectlist_name)
284
+ #STDOUT.puts "list = #{list.inspect}"
285
+ begin
286
+ error = list.select(option)
287
+ rescue
288
+ list.select_value(option) || error
289
+ end
290
+ end
291
+
292
+ def self.check_checkbox(checkbox_name)
293
+ @@current_form = "//input[@name='#{checkbox_name}']/ancestor::form"
294
+ @@agent.checkbox(:name,checkbox_name).set(true)
295
+ end
296
+
297
+ def self.check_radiobutton(checkbox_name, index=0)
298
+ @@current_form = "//input[@name='#{checkbox_name}']/ancestor::form"
299
+ @@agent.elements_by_xpath("//input[@name='#{checkbox_name}']")[index].set
300
+ end
301
+
302
+ def self.click_image_map(index=0)
303
+ raise 'NotImplemented'
304
+ end
305
+
306
+ def self.wait(time=1)
307
+ sleep(time)
308
+ @@agent.wait
309
+ end
310
+
311
+ def self.close_firefox
312
+ @@agent.close
313
+ end
314
+ end
315
+ end
316
+ end
317
+ end
318
+ end