andyverprauskus-scrubyt 0.5.1
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +355 -0
- data/COPYING +340 -0
- data/README.rdoc +121 -0
- data/Rakefile +101 -0
- data/lib/scrubyt.rb +53 -0
- data/lib/scrubyt/core/navigation/agents/firewatir.rb +318 -0
- data/lib/scrubyt/core/navigation/agents/mechanize.rb +312 -0
- data/lib/scrubyt/core/navigation/fetch_action.rb +63 -0
- data/lib/scrubyt/core/navigation/navigation_actions.rb +107 -0
- data/lib/scrubyt/core/scraping/compound_example.rb +30 -0
- data/lib/scrubyt/core/scraping/constraint.rb +169 -0
- data/lib/scrubyt/core/scraping/constraint_adder.rb +49 -0
- data/lib/scrubyt/core/scraping/filters/attribute_filter.rb +14 -0
- data/lib/scrubyt/core/scraping/filters/base_filter.rb +112 -0
- data/lib/scrubyt/core/scraping/filters/constant_filter.rb +9 -0
- data/lib/scrubyt/core/scraping/filters/detail_page_filter.rb +37 -0
- data/lib/scrubyt/core/scraping/filters/download_filter.rb +64 -0
- data/lib/scrubyt/core/scraping/filters/html_subtree_filter.rb +9 -0
- data/lib/scrubyt/core/scraping/filters/regexp_filter.rb +13 -0
- data/lib/scrubyt/core/scraping/filters/script_filter.rb +11 -0
- data/lib/scrubyt/core/scraping/filters/text_filter.rb +34 -0
- data/lib/scrubyt/core/scraping/filters/tree_filter.rb +138 -0
- data/lib/scrubyt/core/scraping/pattern.rb +359 -0
- data/lib/scrubyt/core/scraping/pre_filter_document.rb +14 -0
- data/lib/scrubyt/core/scraping/result_indexer.rb +90 -0
- data/lib/scrubyt/core/shared/extractor.rb +183 -0
- data/lib/scrubyt/logging.rb +154 -0
- data/lib/scrubyt/output/post_processor.rb +139 -0
- data/lib/scrubyt/output/result.rb +44 -0
- data/lib/scrubyt/output/result_dumper.rb +154 -0
- data/lib/scrubyt/output/result_node.rb +145 -0
- data/lib/scrubyt/output/scrubyt_result.rb +42 -0
- data/lib/scrubyt/utils/compound_example_lookup.rb +50 -0
- data/lib/scrubyt/utils/ruby_extensions.rb +85 -0
- data/lib/scrubyt/utils/shared_utils.rb +58 -0
- data/lib/scrubyt/utils/simple_example_lookup.rb +40 -0
- data/lib/scrubyt/utils/xpathutils.rb +202 -0
- data/test/blackbox_test.rb +60 -0
- data/test/blackbox_tests/basic/multi_root.rb +6 -0
- data/test/blackbox_tests/basic/simple.rb +5 -0
- data/test/blackbox_tests/detail_page/one_detail_page.rb +9 -0
- data/test/blackbox_tests/detail_page/two_detail_pages.rb +9 -0
- data/test/blackbox_tests/next_page/next_page_link.rb +7 -0
- data/test/blackbox_tests/next_page/page_list_links.rb +7 -0
- metadata +120 -0
data/README.rdoc
ADDED
@@ -0,0 +1,121 @@
|
|
1
|
+
= scRUBYt! - Hpricot and Mechanize (or FireWatir) on steroids
|
2
|
+
|
3
|
+
A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web,
|
4
|
+
Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
|
5
|
+
|
6
|
+
|
7
|
+
Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their
|
8
|
+
authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be
|
9
|
+
still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them
|
10
|
+
around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies
|
11
|
+
- and ... enter scRUBYt! and decide it yourself.
|
12
|
+
|
13
|
+
= Wait... why do we need one more web-scraping toolkit?
|
14
|
+
|
15
|
+
After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
|
16
|
+
Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical
|
17
|
+
background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with
|
18
|
+
different requirements than the previosly mentioned ones.
|
19
|
+
|
20
|
+
If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot.
|
21
|
+
Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you
|
22
|
+
will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good
|
23
|
+
old mantra: use the right tool for the right job!
|
24
|
+
|
25
|
+
I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of
|
26
|
+
scRUBYt! :-)
|
27
|
+
|
28
|
+
= Sounds fine - show me an example!
|
29
|
+
|
30
|
+
Let's apply the "show don't tell" principle. Okay, here we go:
|
31
|
+
|
32
|
+
<tt>ebay_data = Scrubyt::Extractor.define do</tt>
|
33
|
+
|
34
|
+
fetch 'http://www.ebay.com/'
|
35
|
+
fill_textfield 'satitle', 'ipod'
|
36
|
+
submit
|
37
|
+
click_link 'Apple iPod'
|
38
|
+
|
39
|
+
record do
|
40
|
+
item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
|
41
|
+
price '$71.99'
|
42
|
+
end
|
43
|
+
next_page 'Next >', :limit => 5
|
44
|
+
|
45
|
+
<tt>end</tt>
|
46
|
+
|
47
|
+
output:
|
48
|
+
|
49
|
+
<tt><root></tt>
|
50
|
+
<record>
|
51
|
+
<item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
|
52
|
+
<price>$149.95</price>
|
53
|
+
</record>
|
54
|
+
<record>
|
55
|
+
<item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
|
56
|
+
<price>$172.50</price>
|
57
|
+
</record>
|
58
|
+
<record>
|
59
|
+
<item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
|
60
|
+
<price>$171.06</price>
|
61
|
+
</record>
|
62
|
+
<!-- another 200+ results -->
|
63
|
+
<tt></root></tt>
|
64
|
+
|
65
|
+
This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated
|
66
|
+
extractors than the above one) - yet it did a lot of things automagically. First of all,
|
67
|
+
it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
|
68
|
+
and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
|
69
|
+
looked like the specified example (which btw described also how the output structure should look like) - on the first 5
|
70
|
+
result pages. Not so bad for about 10 lines of code, eh?
|
71
|
+
|
72
|
+
= OK, OK, I believe you, what should I do?
|
73
|
+
|
74
|
+
You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the
|
75
|
+
next section about installation, and after installing be sure to check out these URLs:
|
76
|
+
|
77
|
+
* <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek
|
78
|
+
at web scraping in general and/or you would like to understand what's going on under the hood, check out <a
|
79
|
+
href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about
|
80
|
+
web-scraping</a>!
|
81
|
+
* <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
|
82
|
+
* <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
|
83
|
+
* <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including
|
84
|
+
open and closed bugs, files etc.
|
85
|
+
* projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing
|
86
|
+
the features of scRUBYt!
|
87
|
+
* planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will
|
88
|
+
have a community, and people will upload their extractors for whatever reason
|
89
|
+
|
90
|
+
If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
|
91
|
+
|
92
|
+
= How to install
|
93
|
+
|
94
|
+
scRUBYt! requires these packages to be installed:
|
95
|
+
|
96
|
+
* Ruby 1.8.4
|
97
|
+
* Hpricot 0.5
|
98
|
+
* Mechanize 0.6.3
|
99
|
+
|
100
|
+
I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
|
101
|
+
|
102
|
+
<tt>sudo gem install mechanize</tt>
|
103
|
+
|
104
|
+
Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with
|
105
|
+
|
106
|
+
<tt>sudo gem install hpricot</tt>
|
107
|
+
|
108
|
+
Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
|
109
|
+
|
110
|
+
<tt>sudo gem install scrubyt</tt>
|
111
|
+
|
112
|
+
If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
|
113
|
+
|
114
|
+
= Author
|
115
|
+
|
116
|
+
Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
|
117
|
+
|
118
|
+
= Copyright
|
119
|
+
|
120
|
+
This library is distributed under the GPL. Please see the LICENSE file.
|
121
|
+
|
data/Rakefile
ADDED
@@ -0,0 +1,101 @@
|
|
1
|
+
require 'rake/rdoctask'
|
2
|
+
require 'rake/testtask'
|
3
|
+
require 'rake/gempackagetask'
|
4
|
+
require 'rake/packagetask'
|
5
|
+
|
6
|
+
###################################################
|
7
|
+
# Dependencies
|
8
|
+
###################################################
|
9
|
+
|
10
|
+
task "default" => ["test_all"]
|
11
|
+
task "generate_rdoc" => ["cleanup_readme"]
|
12
|
+
task "cleanup_readme" => ["rdoc"]
|
13
|
+
|
14
|
+
###################################################
|
15
|
+
# Gem specification
|
16
|
+
###################################################
|
17
|
+
|
18
|
+
gem_spec = Gem::Specification.new do |s|
|
19
|
+
s.name = 'scrubyt'
|
20
|
+
s.version = '0.4.30'
|
21
|
+
s.summary = 'A powerful Web-scraping framework built on Mechanize and Hpricot (and FireWatir)'
|
22
|
+
s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
|
23
|
+
# Files containing Test::Unit test cases.
|
24
|
+
s.test_files = FileList['test/unittests/**/*']
|
25
|
+
# List of other files to be included.
|
26
|
+
s.files = FileList['COPYING', 'README.rdoc', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
|
27
|
+
s.author = 'Peter Szinek'
|
28
|
+
s.email = 'peter@rubyrailways.com'
|
29
|
+
s.homepage = 'http://www.scrubyt.org'
|
30
|
+
s.add_dependency('hpricot', '>= 0.5')
|
31
|
+
s.add_dependency('mechanize', '>= 0.6.3')
|
32
|
+
s.has_rdoc = 'true'
|
33
|
+
end
|
34
|
+
|
35
|
+
###################################################
|
36
|
+
# Tasks
|
37
|
+
###################################################
|
38
|
+
|
39
|
+
Rake::RDocTask.new do |generate_rdoc|
|
40
|
+
files = ['lib/**/*.rb', 'README.rdoc', 'CHANGELOG']
|
41
|
+
generate_rdoc.rdoc_files.add(files)
|
42
|
+
generate_rdoc.main = "README.rdoc" # page to start on
|
43
|
+
generate_rdoc.title = "Scrubyt Documentation"
|
44
|
+
generate_rdoc.template = "resources/allison/allison.rb"
|
45
|
+
generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
|
46
|
+
generate_rdoc.options << '--line-numbers' << '--inline-source'
|
47
|
+
end
|
48
|
+
|
49
|
+
Rake::TestTask.new(:test_all) do |task|
|
50
|
+
task.pattern = 'test/*_test.rb'
|
51
|
+
end
|
52
|
+
|
53
|
+
Rake::TestTask.new(:test_blackbox) do |task|
|
54
|
+
task.test_files = ['test/blackbox_test.rb']
|
55
|
+
end
|
56
|
+
|
57
|
+
task "test_specific" do
|
58
|
+
ruby "test/blackbox_test.rb #{ARGV[1]}"
|
59
|
+
end
|
60
|
+
|
61
|
+
Rake::TestTask.new(:test_non_blackbox) do |task|
|
62
|
+
task.test_files = FileList['test/*_test.rb'] - ['test/blackbox_test.rb']
|
63
|
+
end
|
64
|
+
|
65
|
+
task "rcov" do
|
66
|
+
sh 'rcov --xrefs test/*.rb'
|
67
|
+
puts 'Report done.'
|
68
|
+
end
|
69
|
+
|
70
|
+
task "cleanup_readme" do
|
71
|
+
puts "Cleaning up README..."
|
72
|
+
readme_in = open('./doc/files/README.html')
|
73
|
+
content = readme_in.read
|
74
|
+
content.sub!('<h1 id="item_name">File: README</h1>','')
|
75
|
+
content.sub!('<h1>Description</h1>','')
|
76
|
+
readme_in.close
|
77
|
+
open('./doc/files/README.html', 'w') {|f| f.write(content)}
|
78
|
+
#OK, this is uggly as hell and as non-DRY as possible, but
|
79
|
+
#I don't have time to deal with it right now
|
80
|
+
puts "Cleaning up CHANGELOG..."
|
81
|
+
readme_in = open('./doc/files/CHANGELOG.html')
|
82
|
+
content = readme_in.read
|
83
|
+
content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
|
84
|
+
content.sub!('<h1>Description</h1>','')
|
85
|
+
readme_in.close
|
86
|
+
open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
|
87
|
+
end
|
88
|
+
|
89
|
+
task "generate_rdoc" do
|
90
|
+
end
|
91
|
+
|
92
|
+
Rake::GemPackageTask.new(gem_spec) do |pkg|
|
93
|
+
pkg.need_zip = false
|
94
|
+
pkg.need_tar = false
|
95
|
+
end
|
96
|
+
|
97
|
+
#Rake::PackageTask.new('scrubyt-examples', '0.4.03') do |pkg|
|
98
|
+
# pkg.need_zip = true
|
99
|
+
# pkg.need_tar = true
|
100
|
+
# pkg.package_files.include("examples/**/*")
|
101
|
+
#end
|
data/lib/scrubyt.rb
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
if RUBY_VERSION < '1.9'
|
2
|
+
$KCODE = "u"
|
3
|
+
require "jcode"
|
4
|
+
end
|
5
|
+
|
6
|
+
#ruby core
|
7
|
+
require "open-uri"
|
8
|
+
require "erb"
|
9
|
+
|
10
|
+
#gems
|
11
|
+
require "rexml/text"
|
12
|
+
require "rubygems"
|
13
|
+
require "mechanize"
|
14
|
+
require "hpricot"
|
15
|
+
|
16
|
+
#scrubyt
|
17
|
+
require "#{File.dirname(__FILE__)}/scrubyt/logging"
|
18
|
+
require "#{File.dirname(__FILE__)}/scrubyt/utils/ruby_extensions.rb"
|
19
|
+
require "#{File.dirname(__FILE__)}/scrubyt/utils/xpathutils.rb"
|
20
|
+
require "#{File.dirname(__FILE__)}/scrubyt/utils/shared_utils.rb"
|
21
|
+
require "#{File.dirname(__FILE__)}/scrubyt/utils/simple_example_lookup.rb"
|
22
|
+
require "#{File.dirname(__FILE__)}/scrubyt/utils/compound_example_lookup.rb"
|
23
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/constraint_adder.rb"
|
24
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/constraint.rb"
|
25
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/result_indexer.rb"
|
26
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/pre_filter_document.rb"
|
27
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/compound_example.rb"
|
28
|
+
require "#{File.dirname(__FILE__)}/scrubyt/output/result_node.rb"
|
29
|
+
require "#{File.dirname(__FILE__)}/scrubyt/output/scrubyt_result.rb"
|
30
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/agents/mechanize.rb"
|
31
|
+
|
32
|
+
# -- Making Firewatir optional --
|
33
|
+
begin
|
34
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/agents/firewatir.rb"
|
35
|
+
rescue LoadError
|
36
|
+
puts "The gem firewatir is not installed, you'll be able to use Mechanize as the agent only"
|
37
|
+
end
|
38
|
+
# --
|
39
|
+
|
40
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/navigation_actions.rb"
|
41
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/navigation/fetch_action.rb"
|
42
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/shared/extractor.rb"
|
43
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/base_filter.rb"
|
44
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/attribute_filter.rb"
|
45
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/constant_filter.rb"
|
46
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/script_filter.rb"
|
47
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/text_filter.rb"
|
48
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/detail_page_filter.rb"
|
49
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/download_filter.rb"
|
50
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/html_subtree_filter.rb"
|
51
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/regexp_filter.rb"
|
52
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/filters/tree_filter.rb"
|
53
|
+
require "#{File.dirname(__FILE__)}/scrubyt/core/scraping/pattern.rb"
|
@@ -0,0 +1,318 @@
|
|
1
|
+
require 'firewatir'
|
2
|
+
|
3
|
+
module Scrubyt
|
4
|
+
##
|
5
|
+
#=<tt>Fetching pages (and related functionality)</tt>
|
6
|
+
#
|
7
|
+
#Since lot of things are happening during (and before)
|
8
|
+
#the fetching of a document, I decided to move out fetching related
|
9
|
+
#functionality to a separate class - so if you are looking for anything
|
10
|
+
#which is loading a document (even by submitting a form or clicking a link)
|
11
|
+
#and related things like setting a proxy etc. you should find it here.
|
12
|
+
module Navigation
|
13
|
+
module Firewatir
|
14
|
+
|
15
|
+
def self.included(base)
|
16
|
+
base.module_eval do
|
17
|
+
@@agent = FireWatir::Firefox.new unless defined? @@agent
|
18
|
+
@@current_doc_url = nil
|
19
|
+
@@current_doc_protocol = nil
|
20
|
+
@@base_dir = nil
|
21
|
+
@@host_name = nil
|
22
|
+
@@history = []
|
23
|
+
@@current_form = nil
|
24
|
+
@@current_frame = nil
|
25
|
+
|
26
|
+
##
|
27
|
+
#Action to fetch a document (either a file or a http address)
|
28
|
+
#
|
29
|
+
#*parameters*
|
30
|
+
#
|
31
|
+
#_doc_url_ - the url or file name to fetch
|
32
|
+
def self.fetch(doc_url, *args)
|
33
|
+
#Refactor this crap!!! with option_accessor stuff
|
34
|
+
if args.size > 0
|
35
|
+
mechanize_doc = args[0][:mechanize_doc]
|
36
|
+
resolve = args[0][:resolve]
|
37
|
+
basic_auth = args[0][:basic_auth]
|
38
|
+
#Refactor this whole stuff as well!!! It looks awful...
|
39
|
+
parse_and_set_basic_auth(basic_auth) if basic_auth
|
40
|
+
else
|
41
|
+
mechanize_doc = nil
|
42
|
+
resolve = :full
|
43
|
+
end
|
44
|
+
|
45
|
+
@@current_doc_url = doc_url
|
46
|
+
@@current_doc_protocol = determine_protocol
|
47
|
+
if mechanize_doc.nil?
|
48
|
+
handle_relative_path(doc_url) unless @@current_doc_protocol == 'xpath'
|
49
|
+
handle_relative_url(doc_url, resolve)
|
50
|
+
Scrubyt.log :ACTION, "fetching document: #{@@current_doc_url}"
|
51
|
+
case @@current_doc_protocol
|
52
|
+
when 'file'
|
53
|
+
@@agent.goto("file://"+ @@current_doc_url)
|
54
|
+
else
|
55
|
+
@@agent.goto(@@current_doc_url)
|
56
|
+
end
|
57
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
58
|
+
else
|
59
|
+
@@mechanize_doc = mechanize_doc
|
60
|
+
end
|
61
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
62
|
+
store_host_name(@@agent.url) # in case we're on a new host
|
63
|
+
end
|
64
|
+
|
65
|
+
def self.use_current_page
|
66
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
67
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
68
|
+
end
|
69
|
+
|
70
|
+
def self.frame(attribute, value)
|
71
|
+
if @@current_frame
|
72
|
+
@@current_frame.frame(attribute, value)
|
73
|
+
else
|
74
|
+
@@current_frame = @@agent.frame(attribute, value)
|
75
|
+
end
|
76
|
+
end
|
77
|
+
|
78
|
+
##
|
79
|
+
#Submit the last form;
|
80
|
+
def self.submit(current_form, sleep_time=nil, button=nil, type=nil)
|
81
|
+
if @@current_frame
|
82
|
+
#BRUTAL hax but FW is such a shitty piece of software
|
83
|
+
#this sucks FAIL omg
|
84
|
+
@@current_frame.locate
|
85
|
+
form = Document.new(@@current_frame).all.find{|t| t.tagName=="FORM"}
|
86
|
+
form.submit
|
87
|
+
else
|
88
|
+
@@agent.element_by_xpath(@@current_form).submit
|
89
|
+
end
|
90
|
+
|
91
|
+
if sleep_time
|
92
|
+
sleep sleep_time
|
93
|
+
@@agent.wait
|
94
|
+
end
|
95
|
+
|
96
|
+
@@current_doc_url = @@agent.url
|
97
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
98
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
99
|
+
end
|
100
|
+
|
101
|
+
##
|
102
|
+
#Click the link specified by the text
|
103
|
+
def self.click_link(link_spec,index = 0,wait_secs=0)
|
104
|
+
Scrubyt.log :ACTION, "Clicking link specified by: %p" % link_spec
|
105
|
+
if link_spec.is_a?(Hash)
|
106
|
+
elem = XPathUtils.generate_XPath(CompoundExampleLookup.find_node_from_compund_example(@@hpricot_doc, link_spec, false, index), nil, true)
|
107
|
+
result_page = @@agent.element_by_xpath(elem).click
|
108
|
+
else
|
109
|
+
@@agent.link(:innerHTML, Regexp.escape(link_spec)).click
|
110
|
+
end
|
111
|
+
sleep(wait_secs) if wait_secs > 0
|
112
|
+
@@agent.wait
|
113
|
+
|
114
|
+
# evaluate the results
|
115
|
+
extractor.evaluate_extractor
|
116
|
+
|
117
|
+
@@current_doc_url = @@agent.url
|
118
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
119
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
120
|
+
Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
|
121
|
+
end
|
122
|
+
|
123
|
+
def self.click_by_xpath_if_exists(xpath, wait_secs=0)
|
124
|
+
begin
|
125
|
+
result_page = @@agent.element_by_xpath(xpath).click
|
126
|
+
sleep(wait_secs) if wait_secs > 0
|
127
|
+
@@agent.wait
|
128
|
+
|
129
|
+
extractor.evaluate_extractor
|
130
|
+
|
131
|
+
@@current_doc_url = @@agent.url
|
132
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
133
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
134
|
+
Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
|
135
|
+
rescue Watir::Exception::UnknownObjectException
|
136
|
+
Scrubyt.log :INFO, "XPath #{xpath} doesn't exist in this document"
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
def self.click_by_xpath_without_evaluate(xpath, wait_secs=0)
|
141
|
+
Scrubyt.log :ACTION, "Clicking by XPath : %p" % xpath
|
142
|
+
@@agent.element_by_xpath(xpath).click
|
143
|
+
Scrubyt.log :INFO, "sleeping #{wait_secs}..."
|
144
|
+
sleep(wait_secs) if wait_secs > 0
|
145
|
+
@@agent.wait
|
146
|
+
|
147
|
+
# does not call evaluate_extractor
|
148
|
+
#extractor.evaluate_extractor
|
149
|
+
|
150
|
+
@@current_doc_url = @@agent.url
|
151
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
152
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
153
|
+
Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
|
154
|
+
end
|
155
|
+
|
156
|
+
|
157
|
+
def self.click_by_xpath(xpath, wait_secs=0)
|
158
|
+
Scrubyt.log :ACTION, "Clicking by XPath : %p" % xpath
|
159
|
+
@@agent.element_by_xpath(xpath).click
|
160
|
+
Scrubyt.log :INFO, "sleeping #{wait_secs}..."
|
161
|
+
sleep(wait_secs) if wait_secs > 0
|
162
|
+
@@agent.wait
|
163
|
+
|
164
|
+
# evaluate the results
|
165
|
+
extractor.evaluate_extractor
|
166
|
+
|
167
|
+
@@current_doc_url = @@agent.url
|
168
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
169
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
170
|
+
Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
|
171
|
+
end
|
172
|
+
|
173
|
+
def self.click_image_map(index = 0)
|
174
|
+
Scrubyt.log :ACTION, "Clicking image map at index: %p" % index
|
175
|
+
uri = @@mechanize_doc.search("//area")[index]['href']
|
176
|
+
result_page = @@agent.get(uri)
|
177
|
+
@@current_doc_url = result_page.uri.to_s
|
178
|
+
Scrubyt.log :ACTION, "Fetching #{@@current_doc_url}"
|
179
|
+
fetch(@@current_doc_url, :mechanize_doc => result_page)
|
180
|
+
end
|
181
|
+
|
182
|
+
def self.store_host_name(doc_url)
|
183
|
+
@@host_name = doc_url.match(/.*\..*?\//)[0] if doc_url.match(/.*\..*?\//)
|
184
|
+
@@original_host_name ||= @@host_name
|
185
|
+
end #end of method store_host_name
|
186
|
+
|
187
|
+
def self.determine_protocol
|
188
|
+
old_protocol = @@current_doc_protocol
|
189
|
+
new_protocol = case @@current_doc_url
|
190
|
+
when /^\/\//
|
191
|
+
'xpath'
|
192
|
+
when /^https/
|
193
|
+
'https'
|
194
|
+
when /^http/
|
195
|
+
'http'
|
196
|
+
when /^www\./
|
197
|
+
'http'
|
198
|
+
else
|
199
|
+
'file'
|
200
|
+
end
|
201
|
+
return 'http' if ((old_protocol == 'http') && new_protocol == 'file')
|
202
|
+
return 'https' if ((old_protocol == 'https') && new_protocol == 'file')
|
203
|
+
new_protocol
|
204
|
+
end
|
205
|
+
|
206
|
+
def self.parse_and_set_basic_auth(basic_auth)
|
207
|
+
login, pass = basic_auth.split('@')
|
208
|
+
Scrubyt.log :ACTION, "Basic authentication: login=<#{login}>, pass=<#{pass}>"
|
209
|
+
@@agent.basic_auth(login, pass)
|
210
|
+
end
|
211
|
+
|
212
|
+
def self.handle_relative_path(doc_url)
|
213
|
+
if @@base_dir == nil || doc_url[0..0] == "/"
|
214
|
+
@@base_dir = doc_url.scan(/.+\//)[0] if @@current_doc_protocol == 'file'
|
215
|
+
else
|
216
|
+
@@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
|
217
|
+
end
|
218
|
+
end
|
219
|
+
|
220
|
+
def self.handle_relative_url(doc_url, resolve)
|
221
|
+
return if doc_url =~ /^(http:|javascript:)/
|
222
|
+
if doc_url !~ /^\//
|
223
|
+
first_char = doc_url[0..0]
|
224
|
+
doc_url = ( first_char == '?' ? '' : '/' ) + doc_url
|
225
|
+
if first_char == '?' #This is an ugly hack... really have to throw this shit out and go with mechanize's
|
226
|
+
current_uri = @@mechanize_doc.uri.to_s
|
227
|
+
current_uri = @@agent.history.first.uri.to_s if current_uri =~ /\/popup\//
|
228
|
+
if (current_uri.include? '?')
|
229
|
+
current_uri = current_uri.scan(/.+\//)[0]
|
230
|
+
else
|
231
|
+
current_uri += '/' unless current_uri[-1..-1] == '/'
|
232
|
+
end
|
233
|
+
@@current_doc_url = current_uri + doc_url
|
234
|
+
return
|
235
|
+
end
|
236
|
+
end
|
237
|
+
case resolve
|
238
|
+
when :full
|
239
|
+
@@current_doc_url = (@@host_name + doc_url) if ( @@host_name != nil && (doc_url !~ /#{@@host_name}/))
|
240
|
+
@@current_doc_url = @@current_doc_url.split('/').uniq.join('/')
|
241
|
+
when :host
|
242
|
+
base_host_name = (@@host_name.count("/") == 2 ? @@host_name : @@host_name.scan(/(http.+?\/\/.+?)\//)[0][0])
|
243
|
+
@@current_doc_url = base_host_name + doc_url
|
244
|
+
else
|
245
|
+
#custom resilving
|
246
|
+
@@current_doc_url = resolve + doc_url
|
247
|
+
end
|
248
|
+
end
|
249
|
+
|
250
|
+
def self.fill_textfield(textfield_name, query_string, wait_secs, useValue)
|
251
|
+
@@current_form = "//input[@name='#{textfield_name}']/ancestor::form"
|
252
|
+
target = @@current_frame || @@agent
|
253
|
+
if useValue
|
254
|
+
target.text_field(:name,textfield_name).value = query_string
|
255
|
+
else
|
256
|
+
target.text_field(:name,textfield_name).set(query_string)
|
257
|
+
end
|
258
|
+
sleep(wait_secs) if wait_secs > 0
|
259
|
+
@@mechanize_doc = "<html>#{@@agent.html}</html>"
|
260
|
+
@@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc))
|
261
|
+
|
262
|
+
end
|
263
|
+
|
264
|
+
##
|
265
|
+
#Action to fill a textarea with text
|
266
|
+
def self.fill_textarea(textarea_name, text)
|
267
|
+
@@current_form = "//input[@name='#{textarea_name}']/ancestor::form"
|
268
|
+
@@agent.text_field(:name,textarea_name).set(text)
|
269
|
+
end
|
270
|
+
|
271
|
+
##
|
272
|
+
#Action for selecting an option from a dropdown box
|
273
|
+
def self.select_option(selectlist_name, option)
|
274
|
+
if selectlist_name.is_a? Hash
|
275
|
+
select_args = selectlist_name
|
276
|
+
unless select_args.size == 1
|
277
|
+
raise "select_option only supports using a name or a hash with a single pair, e.g.: {:id => \"foo1\"}"
|
278
|
+
end
|
279
|
+
@@current_form = "//select[@#{select_args.keys.first}='#{select_args.values.first}']/ancestor::form"
|
280
|
+
else
|
281
|
+
@@current_form = "//select[@name='#{selectlist_name}']/ancestor::form"
|
282
|
+
end
|
283
|
+
list = @@agent.select_list(:name,selectlist_name)
|
284
|
+
#STDOUT.puts "list = #{list.inspect}"
|
285
|
+
begin
|
286
|
+
error = list.select(option)
|
287
|
+
rescue
|
288
|
+
list.select_value(option) || error
|
289
|
+
end
|
290
|
+
end
|
291
|
+
|
292
|
+
def self.check_checkbox(checkbox_name)
|
293
|
+
@@current_form = "//input[@name='#{checkbox_name}']/ancestor::form"
|
294
|
+
@@agent.checkbox(:name,checkbox_name).set(true)
|
295
|
+
end
|
296
|
+
|
297
|
+
def self.check_radiobutton(checkbox_name, index=0)
|
298
|
+
@@current_form = "//input[@name='#{checkbox_name}']/ancestor::form"
|
299
|
+
@@agent.elements_by_xpath("//input[@name='#{checkbox_name}']")[index].set
|
300
|
+
end
|
301
|
+
|
302
|
+
def self.click_image_map(index=0)
|
303
|
+
raise 'NotImplemented'
|
304
|
+
end
|
305
|
+
|
306
|
+
def self.wait(time=1)
|
307
|
+
sleep(time)
|
308
|
+
@@agent.wait
|
309
|
+
end
|
310
|
+
|
311
|
+
def self.close_firefox
|
312
|
+
@@agent.close
|
313
|
+
end
|
314
|
+
end
|
315
|
+
end
|
316
|
+
end
|
317
|
+
end
|
318
|
+
end
|