scrubyt 0.1.9 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +52 -18
- data/README +69 -40
- data/Rakefile +42 -11
- data/lib/scrubyt/export.rb +4 -4
- data/lib/scrubyt/extractor.rb +50 -9
- data/lib/scrubyt/filter.rb +7 -7
- data/lib/scrubyt/pattern.rb +19 -10
- data/lib/scrubyt/post_processor.rb +15 -0
- data/lib/scrubyt/result_dumper.rb +14 -2
- data/lib/scrubyt/xpathutils.rb +9 -7
- data/test/unittests/filter_test.rb +8 -0
- data/test/unittests/xpathutils_test.rb +3 -3
- metadata +11 -11
data/CHANGELOG
CHANGED
@@ -1,16 +1,47 @@
|
|
1
|
+
= scRUBYt! changelog
|
2
|
+
|
3
|
+
== 0.2.0
|
4
|
+
=== 30th January, 2007
|
5
|
+
|
6
|
+
The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
|
7
|
+
|
8
|
+
=<tt>changes:</tt>
|
9
|
+
|
10
|
+
* better form detection heuristics
|
11
|
+
* report message if there are absolutely no results
|
12
|
+
* lots of bugfixes
|
13
|
+
* fixed amazon_data.books[0].item[0].title[0] style output access
|
14
|
+
and implemented it correctly in case of crawling as well
|
15
|
+
* /body/div/h3 not detected as XPath
|
16
|
+
* crawling problem (improved heuristics of url joining)
|
17
|
+
* fixed blackbox test runner - no more platform dependent code
|
18
|
+
* fixed exporting bug: swapped exported XPaths in the case of no example present
|
19
|
+
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
|
20
|
+
name is substring of the other
|
21
|
+
* Evaluation stops if the example was not found - but not in the case
|
22
|
+
of next page link lookup
|
23
|
+
* google_data[0].link[0].url[0] style result lookup now works in the
|
24
|
+
case of more documents, too
|
25
|
+
* tons of others bugfixes
|
26
|
+
* overall stability fixes
|
27
|
+
* more blackbox tests
|
28
|
+
* more examples
|
29
|
+
* overall stability fixes
|
30
|
+
|
31
|
+
|
1
32
|
= 0.1.9
|
2
33
|
=== 28th January, 2007
|
3
34
|
|
4
35
|
This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
|
5
36
|
|
6
|
-
|
37
|
+
=<tt>Changes</tt>:
|
7
38
|
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
39
|
+
* Possibility to specify multiple examples (hence a pattern can have more filters)
|
40
|
+
* Enhanced heuristics for example text detection
|
41
|
+
* First version of algorithm to remove dupes resulting from multiple examples
|
42
|
+
* empty XML leaf nodes are not written
|
43
|
+
* new examples
|
44
|
+
* TONS of bugfixes
|
14
45
|
|
15
46
|
= 0.1
|
16
47
|
=== 15th January, 2007
|
@@ -20,15 +51,18 @@ This release was made more for myself (to try and test rubyforge, gems, etc) rat
|
|
20
51
|
|
21
52
|
Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
|
22
53
|
|
23
|
-
Navigation:
|
24
|
-
fetching pages
|
25
|
-
clicking links
|
26
|
-
filling input fields
|
27
|
-
submitting forms
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
54
|
+
* Navigation:
|
55
|
+
* fetching pages
|
56
|
+
* clicking links
|
57
|
+
* filling input fields
|
58
|
+
* submitting forms
|
59
|
+
* automatically passing the document to the scraping
|
60
|
+
* both files and http:// support
|
61
|
+
* automatic crawling
|
62
|
+
|
63
|
+
* Scraping:
|
64
|
+
* Fairly powerful DSL to describe the full scraping process
|
65
|
+
* Automatic navigation with WWW::Mechanize
|
66
|
+
* Automatic scraping through examples with Hpricot
|
67
|
+
* automatic recursive scraping through the next button
|
34
68
|
|
data/README
CHANGED
@@ -1,70 +1,99 @@
|
|
1
|
-
|
2
|
-
scRUBYt! - Hpricot and Mechanize on steroids
|
3
|
-
============================================
|
1
|
+
= scRUBYt! - Hpricot and Mechanize on steroids
|
4
2
|
|
5
|
-
A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of interest by the concise and easy to use DSL
|
3
|
+
A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
|
6
4
|
|
7
|
-
|
8
|
-
Why do we need one more web-scraping toolkit?
|
9
|
-
=============================================
|
5
|
+
Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies - and ... enter scRUBYt! and decide it yourself.
|
10
6
|
|
11
|
-
|
12
|
-
|
7
|
+
= Wait... why do we need one more web-scraping toolkit?
|
8
|
+
|
9
|
+
After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
|
10
|
+
Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
|
13
11
|
|
14
12
|
If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot. Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good old mantra: use the right tool for the right job!
|
15
13
|
|
16
|
-
I hope there will be times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
|
14
|
+
I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
|
15
|
+
|
16
|
+
= Sounds fine - show me an example!
|
17
|
+
|
18
|
+
Let's apply the "show don't tell" principle. Okay, here we go:
|
17
19
|
|
18
|
-
|
19
|
-
OK, OK, I believe you, what should I do?
|
20
|
-
========================================
|
20
|
+
<tt>ebay_data = Scrubyt::Extractor.define do</tt>
|
21
21
|
|
22
|
-
|
22
|
+
fetch 'http://www.ebay.com/'
|
23
|
+
fill_textfield 'satitle', 'ipod'
|
24
|
+
submit
|
25
|
+
click_link 'Apple iPod'
|
26
|
+
|
27
|
+
record do
|
28
|
+
item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
|
29
|
+
price '$71.99'
|
30
|
+
end
|
31
|
+
next_page 'Next >', :limit => 5
|
23
32
|
|
24
|
-
|
25
|
-
rubyrailways.com (some theory)
|
26
|
-
future: public extractor repository
|
33
|
+
<tt>end</tt>
|
27
34
|
|
28
|
-
|
29
|
-
How to install
|
30
|
-
==============
|
35
|
+
output:
|
31
36
|
|
32
|
-
|
37
|
+
<tt><root></tt>
|
38
|
+
<record>
|
39
|
+
<item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
|
40
|
+
<price>$149.95</price>
|
41
|
+
</record>
|
42
|
+
<record>
|
43
|
+
<item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
|
44
|
+
<price>$172.50</price>
|
45
|
+
</record>
|
46
|
+
<record>
|
47
|
+
<item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
|
48
|
+
<price>$171.06</price>
|
49
|
+
</record>
|
50
|
+
<!-- another 200+ results -->
|
51
|
+
<tt></root></tt>
|
33
52
|
|
34
|
-
|
35
|
-
|
36
|
-
|
53
|
+
This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated extractors than the above one) - yet it did a lot of things automagically. First of all,
|
54
|
+
it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
|
55
|
+
and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
|
56
|
+
looked like the specified example (which btw described also how the output structure should look like) - on the first 5 result pages. Not so bad for about 10 lines of code, eh?
|
37
57
|
|
38
|
-
|
58
|
+
= OK, OK, I believe you, what should I do?
|
39
59
|
|
40
|
-
|
60
|
+
You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the next section about installation, and after installing be sure to check out these URLs:
|
41
61
|
|
42
|
-
|
62
|
+
* <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek at web scraping in general and/or you would like to understand what's going on under the hood, check out <a href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about web-scraping</a>!
|
63
|
+
* <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
|
64
|
+
* <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
|
65
|
+
* <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including open and closed bugs, files etc.
|
66
|
+
* projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing the features of scRUBYt!
|
67
|
+
* planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will have a community, and people will upload their extractors for whatever reason
|
43
68
|
|
44
|
-
|
69
|
+
If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
|
45
70
|
|
46
|
-
|
71
|
+
= How to install
|
47
72
|
|
48
|
-
|
73
|
+
scRUBYt! requires these packages to be installed:
|
49
74
|
|
50
|
-
|
75
|
+
* Ruby 1.8.4
|
76
|
+
* Hpricot 0.5
|
77
|
+
* Mechanize 0.6.3
|
51
78
|
|
52
|
-
|
79
|
+
I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
|
53
80
|
|
54
|
-
|
81
|
+
<tt>sudo gem install mechanize</tt>
|
55
82
|
|
56
|
-
|
57
|
-
Additional installation notes
|
58
|
-
=============================
|
83
|
+
Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with
|
59
84
|
|
60
|
-
|
61
|
-
you will have to install ragel (dependency of HPricot) with something like
|
85
|
+
<tt>sudo gem install hpricot</tt>
|
62
86
|
|
63
|
-
|
87
|
+
Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
|
64
88
|
|
65
|
-
|
89
|
+
<tt>sudo gem install scrubyt</tt>
|
66
90
|
|
91
|
+
If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
|
67
92
|
|
93
|
+
= Author
|
68
94
|
|
95
|
+
Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
|
69
96
|
|
97
|
+
= Copyright
|
70
98
|
|
99
|
+
This library is distributed under the GPL. Please see the LICENSE file.
|
data/Rakefile
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
require 'rake/rdoctask'
|
2
2
|
require 'rake/testtask'
|
3
3
|
require 'rake/gempackagetask'
|
4
|
+
require 'rake/packagetask'
|
4
5
|
|
5
6
|
###################################################
|
6
7
|
# Dependencies
|
@@ -8,6 +9,8 @@ require 'rake/gempackagetask'
|
|
8
9
|
|
9
10
|
task "default" => ["test"]
|
10
11
|
task "fulltest" => ["test", "blackbox"]
|
12
|
+
task "generate_rdoc" => ["cleanup_readme"]
|
13
|
+
task "cleanup_readme" => ["rdoc"]
|
11
14
|
|
12
15
|
###################################################
|
13
16
|
# Gem specification
|
@@ -15,13 +18,13 @@ task "fulltest" => ["test", "blackbox"]
|
|
15
18
|
|
16
19
|
gem_spec = Gem::Specification.new do |s|
|
17
20
|
s.name = 'scrubyt'
|
18
|
-
s.version = '0.
|
21
|
+
s.version = '0.2.0'
|
19
22
|
s.summary = 'A powerful Web-scraping framework'
|
20
23
|
s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
|
21
24
|
# Files containing Test::Unit test cases.
|
22
25
|
s.test_files = FileList['test/unittests/**/*']
|
23
26
|
# List of other files to be included.
|
24
|
-
s.files = FileList['
|
27
|
+
s.files = FileList['COPYING', 'README', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
|
25
28
|
s.author = 'Peter Szinek'
|
26
29
|
s.email = 'peter@rubyrailways.com'
|
27
30
|
s.homepage = 'http://www.scrubyt.org'
|
@@ -32,14 +35,14 @@ end
|
|
32
35
|
# Tasks
|
33
36
|
###################################################
|
34
37
|
|
35
|
-
Rake::RDocTask.new do |
|
36
|
-
files = ['lib/**/*.rb', 'README']
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
38
|
+
Rake::RDocTask.new do |generate_rdoc|
|
39
|
+
files = ['lib/**/*.rb', 'README', 'CHANGELOG']
|
40
|
+
generate_rdoc.rdoc_files.add(files)
|
41
|
+
generate_rdoc.main = "README" # page to start on
|
42
|
+
generate_rdoc.title = "Scrubyt Documentation"
|
43
|
+
generate_rdoc.template = "resources/allison/allison.rb"
|
44
|
+
generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
|
45
|
+
generate_rdoc.options << '--line-numbers' << '--inline-source'
|
43
46
|
end
|
44
47
|
|
45
48
|
Rake::TestTask.new do |test|
|
@@ -50,7 +53,35 @@ task "blackbox" do
|
|
50
53
|
ruby "test/blackbox/run_blackbox_tests.rb"
|
51
54
|
end
|
52
55
|
|
56
|
+
task "cleanup_readme" do
|
57
|
+
puts "Cleaning up README..."
|
58
|
+
readme_in = open('./doc/files/README.html')
|
59
|
+
content = readme_in.read
|
60
|
+
content.sub!('<h1 id="item_name">File: README</h1>','')
|
61
|
+
content.sub!('<h1>Description</h1>','')
|
62
|
+
readme_in.close
|
63
|
+
open('./doc/files/README.html', 'w') {|f| f.write(content)}
|
64
|
+
#OK, this is uggly as hell and as non-DRY as possible, but
|
65
|
+
#I don't have time to deal with it right now
|
66
|
+
puts "Cleaning up CHANGELOG..."
|
67
|
+
readme_in = open('./doc/files/CHANGELOG.html')
|
68
|
+
content = readme_in.read
|
69
|
+
content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
|
70
|
+
content.sub!('<h1>Description</h1>','')
|
71
|
+
readme_in.close
|
72
|
+
open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
|
73
|
+
end
|
74
|
+
|
75
|
+
task "generate_rdoc" do
|
76
|
+
end
|
77
|
+
|
53
78
|
Rake::GemPackageTask.new(gem_spec) do |pkg|
|
54
79
|
pkg.need_zip = false
|
55
80
|
pkg.need_tar = false
|
56
|
-
end
|
81
|
+
end
|
82
|
+
|
83
|
+
Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
|
84
|
+
pkg.need_zip = true
|
85
|
+
pkg.need_tar = true
|
86
|
+
pkg.package_files.include("examples/**/*")
|
87
|
+
end
|
data/lib/scrubyt/export.rb
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
#require File.join(File.dirname(__FILE__), 'pattern.rb')
|
2
|
-
|
2
|
+
|
3
3
|
module Scrubyt
|
4
4
|
# =<tt>exporting previously defined extractors</tt>
|
5
5
|
class Export
|
@@ -142,14 +142,14 @@ private
|
|
142
142
|
@name_to_xpath_map = {}
|
143
143
|
create_name_to_xpath_map(pattern)
|
144
144
|
#Replace the examples which are quoted with " and '
|
145
|
-
@name_to_xpath_map.each do |name, xpaths|
|
145
|
+
@name_to_xpath_map.each do |name, xpaths|
|
146
146
|
replace_example_with_xpath(name, xpaths, %q{"})
|
147
147
|
replace_example_with_xpath(name, xpaths, %q{'})
|
148
148
|
end
|
149
149
|
#Finally, add XPaths to pattern which had no example at the beginning (the XPath was
|
150
150
|
#generated from the child patterns
|
151
151
|
@name_to_xpath_map.each do |name, xpaths|
|
152
|
-
xpaths.each do |xpath|
|
152
|
+
xpaths.reverse.each do |xpath|
|
153
153
|
comma = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0].sub('do'){}.strip == '' ? '' : ','
|
154
154
|
if (@full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]).include?('{')
|
155
155
|
@full_definition.sub!("P.#{name}") {"P.#{name}('#{xpath}')"}
|
@@ -180,7 +180,7 @@ private
|
|
180
180
|
|
181
181
|
def self.replace_example_with_xpath(name, xpaths, left_delimiter, right_delimiter=left_delimiter)
|
182
182
|
return if name=='root'
|
183
|
-
full_line = @full_definition.scan(
|
183
|
+
full_line = @full_definition.scan(/P.#{name}\W(.+)$/)[0][0]
|
184
184
|
examples = full_line.split(",")
|
185
185
|
examples.reject! {|exa| exa.strip!; exa[0..0] != %q{"} && exa[0..0] != %q{'} }
|
186
186
|
all_xpaths = ""
|
data/lib/scrubyt/extractor.rb
CHANGED
@@ -46,6 +46,7 @@ module Scrubyt
|
|
46
46
|
end
|
47
47
|
ensure_all_postconditions(root_pattern)
|
48
48
|
PostProcessor.remove_multiple_filter_duplicates(root_pattern)
|
49
|
+
PostProcessor.report_if_no_results(root_pattern)
|
49
50
|
#Return the root pattern
|
50
51
|
root_pattern
|
51
52
|
end
|
@@ -121,21 +122,28 @@ module Scrubyt
|
|
121
122
|
@@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
|
122
123
|
end
|
123
124
|
|
124
|
-
if @@host_name != nil
|
125
|
+
if @@host_name != nil
|
125
126
|
if doc_url !~ /#{@@host_name}/
|
126
|
-
@@current_doc_url = (@@host_name + doc_url)
|
127
|
-
|
127
|
+
@@current_doc_url = (@@host_name + doc_url)
|
128
|
+
#remove duplicate parts, like /blogs/en/blogs/en
|
129
|
+
@@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
|
130
|
+
@@current_doc_url.sub!('http:/', 'http://')
|
128
131
|
end
|
129
132
|
end
|
130
133
|
puts "[ACTION] fetching document: #{@@current_doc_url}"
|
131
|
-
|
134
|
+
if @@current_doc_protocol == :http
|
135
|
+
|
136
|
+
@@mechanize_doc = @@agent.get(@@current_doc_url)
|
137
|
+
@@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
|
138
|
+
@@host_name = doc_url if @@host_name == nil
|
139
|
+
end
|
132
140
|
else
|
133
141
|
@@current_doc_url = doc_url
|
134
142
|
@@mechanize_doc = mechanize_doc
|
135
143
|
@@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
|
136
144
|
@@host_name = doc_url if @@host_name == nil
|
137
145
|
end
|
138
|
-
@@hpricot_doc = Hpricot(open(@@current_doc_url))
|
146
|
+
@@hpricot_doc = Hpricot(open(@@current_doc_url))
|
139
147
|
end
|
140
148
|
|
141
149
|
##
|
@@ -150,23 +158,56 @@ module Scrubyt
|
|
150
158
|
def self.fill_textfield(textfield_name, query_string)
|
151
159
|
puts "[ACTION] typing #{query_string} into the textfield named '#{textfield_name}'"
|
152
160
|
textfield = (@@hpricot_doc/"input[@name=#{textfield_name}]").map()[0]
|
153
|
-
|
154
|
-
|
161
|
+
form_tag = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form')
|
162
|
+
#Refactor this code, it's a total mess
|
163
|
+
formname = form_tag.attributes['name']
|
164
|
+
if formname == nil
|
165
|
+
id_string = form_tag.attributes['id']
|
166
|
+
if id_string == nil
|
167
|
+
action_string = form_tag.attributes['action']
|
168
|
+
if action_string == nil
|
169
|
+
#If even this fails, do it with a button
|
170
|
+
else
|
171
|
+
puts "Finding from action"
|
172
|
+
puts action_string
|
173
|
+
find_form_with_attribute('action', action_string)
|
174
|
+
end
|
175
|
+
else
|
176
|
+
puts "Finding from id"
|
177
|
+
find_form_with_attribute('id', id_string)
|
178
|
+
end
|
179
|
+
else
|
180
|
+
puts "Finding from name"
|
181
|
+
@@current_form = @@mechanize_doc.forms.with.name(formname).first
|
182
|
+
end
|
183
|
+
|
155
184
|
eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
|
156
185
|
end
|
157
186
|
|
187
|
+
def self.find_form_with_attribute(attr, expected_value)
|
188
|
+
puts "attr: #{attr}"
|
189
|
+
i = 0
|
190
|
+
loop do
|
191
|
+
@@current_form = @@mechanize_doc.forms[i]
|
192
|
+
print "current a: "
|
193
|
+
puts @@current_form.form_node.attributes[attr]
|
194
|
+
return nil if @@current_form == nil
|
195
|
+
break if @@current_form.form_node.attributes[attr] == expected_value
|
196
|
+
i+= 1
|
197
|
+
end
|
198
|
+
end
|
199
|
+
|
158
200
|
#Submit the last form;
|
159
201
|
def self.submit
|
160
202
|
puts '[ACTION] submitting form...'
|
161
203
|
result_page = @@agent.submit(@@current_form)#, @@current_form.buttons.first)
|
162
204
|
@@current_doc_url = result_page.uri.to_s
|
205
|
+
puts "[ACTION] fetched #{@@current_doc_url}"
|
163
206
|
fetch(@@current_doc_url, result_page)
|
164
207
|
end
|
165
208
|
|
166
209
|
def self.click_link(link_text)
|
167
210
|
puts "[ACTION] clicking link: #{link_text}"
|
168
|
-
#puts /^#{Regexp.escape(link_text)}$/
|
169
|
-
#p /^#{Regexp.escape(link_text)}$/
|
170
211
|
link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
|
171
212
|
result_page = @@agent.click(link)
|
172
213
|
@@current_doc_url = result_page.uri.to_s
|
data/lib/scrubyt/filter.rb
CHANGED
@@ -53,8 +53,10 @@ module Scrubyt
|
|
53
53
|
@parent_pattern = parent_pattern
|
54
54
|
#If the example type is not explicitly defined in the pattern definition,
|
55
55
|
#try to determine it automatically from the example
|
56
|
-
|
57
|
-
|
56
|
+
#@example_type = (args[0] == nil ? Filter.determine_example_type(example) :
|
57
|
+
# args[0][:example_type])
|
58
|
+
#TODOOOOO correct this!
|
59
|
+
@example_type = Filter.determine_example_type(example)
|
58
60
|
@sink = [] #output of a filter
|
59
61
|
@source = [] #input of a filter
|
60
62
|
@example = example
|
@@ -67,14 +69,13 @@ module Scrubyt
|
|
67
69
|
#Evaluate this filter. This method shoulf not be called directly - as the pattern hierarchy
|
68
70
|
#is evaluated, every pattern evaluates its filters and then they are calling this method
|
69
71
|
def evaluate(source)
|
70
|
-
@parent_pattern.root_pattern.already_evaluated_sources ||= {}
|
71
72
|
case @parent_pattern.type
|
72
73
|
when Scrubyt::Pattern::PATTERN_TYPE_TREE
|
73
74
|
result = source/@xpath
|
74
75
|
result.class == Hpricot::Elements ? result.map : [result]
|
75
76
|
when Scrubyt::Pattern::PATTERN_TYPE_ATTRIBUTE
|
76
77
|
[source.attributes[@example]]
|
77
|
-
when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
|
78
|
+
when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
|
78
79
|
source.inner_text.scan(@example).flatten
|
79
80
|
end
|
80
81
|
end
|
@@ -87,10 +88,9 @@ module Scrubyt
|
|
87
88
|
when EXAMPLE_TYPE_XPATH
|
88
89
|
@xpath = @example
|
89
90
|
when EXAMPLE_TYPE_STRING
|
90
|
-
@temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example )
|
91
|
+
@temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example, false )
|
91
92
|
@xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
|
92
93
|
XPathUtils.generate_XPath(@temp_sink, nil, true)
|
93
|
-
puts @xpath
|
94
94
|
when EXAMPLE_TYPE_CHILDREN
|
95
95
|
current_example_index = 0
|
96
96
|
loop do
|
@@ -148,7 +148,7 @@ private
|
|
148
148
|
EXAMPLE_TYPE_CHILDREN
|
149
149
|
when /\.(jpg|png|gif|jpeg)$/
|
150
150
|
EXAMPLE_TYPE_IMAGE
|
151
|
-
when /^\/{1,2}[a-z]
|
151
|
+
when /^\/{1,2}[a-z]+\d?(\[\d+\])?(\/{1,2}[a-z]+\d?(\[\d+\])?)*$/
|
152
152
|
(example.include? '/' || example.include?('[')) ? EXAMPLE_TYPE_XPATH : EXAMPLE_TYPE_STRING
|
153
153
|
else
|
154
154
|
EXAMPLE_TYPE_STRING
|
data/lib/scrubyt/pattern.rb
CHANGED
@@ -43,7 +43,7 @@ module Scrubyt
|
|
43
43
|
attr_accessor :name, :output_type, :generalize, :children, :filters, :parent,
|
44
44
|
:last_result, :result, :root_pattern, :example, :block_count,
|
45
45
|
:next_page, :limit, :extractor, :extracted_docs,
|
46
|
-
:examples, :parent_of_leaf
|
46
|
+
:examples, :parent_of_leaf, :document_index
|
47
47
|
attr_reader :type, :generalize_set, :next_page_url
|
48
48
|
|
49
49
|
def initialize (name, *args)
|
@@ -56,6 +56,7 @@ module Scrubyt
|
|
56
56
|
@@instance_count = Hash.new(0)
|
57
57
|
@evaluated_examples = []
|
58
58
|
@next_page = nil
|
59
|
+
@document_index = 0
|
59
60
|
if @examples == nil
|
60
61
|
filters << Scrubyt::Filter.new(self) #create a default filter
|
61
62
|
else
|
@@ -74,6 +75,7 @@ module Scrubyt
|
|
74
75
|
#Grab any examples that are defined!
|
75
76
|
look_for_examples(args)
|
76
77
|
args.each do |arg|
|
78
|
+
next if !arg.is_a? Hash
|
77
79
|
arg.each do |k,v|
|
78
80
|
#Set only the setable fields
|
79
81
|
if SETTABLE_FIELDS.include? k.to_s
|
@@ -92,7 +94,6 @@ module Scrubyt
|
|
92
94
|
#default settings - the user can override them, but if she did not do so,
|
93
95
|
#we will setup some meaningful defaults
|
94
96
|
@type ||= PATTERN_TYPE_TREE
|
95
|
-
@type = PATTERN_TYPE_REGEXP if @example.instance_of? Regexp
|
96
97
|
@output_type ||= OUTPUT_TYPE_MODEL
|
97
98
|
#don't generalize by default
|
98
99
|
@generalize ||= false
|
@@ -127,11 +128,20 @@ module Scrubyt
|
|
127
128
|
# camera_data.item[1].item_name[0]
|
128
129
|
#
|
129
130
|
#possible. The method Pattern::method missing handles the 'item', 'item_name' etc.
|
130
|
-
#parts, while the indexing ([1], [0]) is handled by this function
|
131
|
+
#parts, while the indexing ([1], [0]) is handled by this function.
|
132
|
+
#If you would like to select a different document than the first one (which is
|
133
|
+
#the default), you should use the form:
|
134
|
+
#
|
135
|
+
# camera_data[1].item[1].item_name[0]
|
131
136
|
def [](index)
|
132
|
-
|
133
|
-
|
134
|
-
|
137
|
+
if @name == 'root'
|
138
|
+
@root_pattern.document_index = index
|
139
|
+
else
|
140
|
+
@parent.last_result = @parent.last_result[@root_pattern.document_index] if @parent.last_result.is_a? Array
|
141
|
+
return nil if (@result.lookup(@parent.last_result)) == nil
|
142
|
+
@last_result = @result.lookup(@parent.last_result)[index]
|
143
|
+
end
|
144
|
+
self
|
135
145
|
end
|
136
146
|
|
137
147
|
##
|
@@ -217,9 +227,6 @@ module Scrubyt
|
|
217
227
|
sorted_result = r.reject {|e| !result.keys.include? e}
|
218
228
|
add_result(filter, source, sorted_result)
|
219
229
|
else
|
220
|
-
if ( (xe = @result.lookup(source)) != nil )
|
221
|
-
#puts "ha"; p xe
|
222
|
-
end
|
223
230
|
add_result(filter, source, r)
|
224
231
|
end#end of constraint check
|
225
232
|
end#end of source iteration
|
@@ -246,6 +253,7 @@ private
|
|
246
253
|
end
|
247
254
|
end
|
248
255
|
elsif (args[0].is_a? Regexp)
|
256
|
+
@examples = args.select {|e| e.is_a? Regexp}
|
249
257
|
#Check if all the String parameters are really the first
|
250
258
|
#parameters
|
251
259
|
args[0..@examples.size].each do |example|
|
@@ -253,6 +261,7 @@ private
|
|
253
261
|
puts 'FATAL: Problem with example specification'
|
254
262
|
end
|
255
263
|
end
|
264
|
+
@type = PATTERN_TYPE_REGEXP
|
256
265
|
end
|
257
266
|
end
|
258
267
|
|
@@ -299,7 +308,7 @@ private
|
|
299
308
|
end
|
300
309
|
|
301
310
|
def generate_next_page_link(example)
|
302
|
-
node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example)
|
311
|
+
node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example, true)
|
303
312
|
return nil if node == nil
|
304
313
|
node.attributes['href'].gsub('&') {'&'}
|
305
314
|
end # end of method generate_next_page_link
|
@@ -18,6 +18,21 @@ module Scrubyt
|
|
18
18
|
remove_multiple_filter_duplicates_intern(pattern) if pattern.parent_of_leaf
|
19
19
|
pattern.children.each {|child| remove_multiple_filter_duplicates(child)}
|
20
20
|
end
|
21
|
+
|
22
|
+
##
|
23
|
+
#Issue an error report if the document did not extract anything.
|
24
|
+
#Probably this is because the structure of the page changed or
|
25
|
+
#because of some rather nasty bug - in any case, something wrong
|
26
|
+
#is going on, and we need to inform the user about this!
|
27
|
+
def self.report_if_no_results(root_pattern)
|
28
|
+
results_found = false
|
29
|
+
root_pattern.children.each {|child| return if (child.result.childmap.size > 0)}
|
30
|
+
puts
|
31
|
+
puts "!!!!!! WARNING: The extractor did not find any result instances"
|
32
|
+
puts "Most probably this is wrong. Check your extractor and if you are"
|
33
|
+
puts "sure it should work, report a bug!"
|
34
|
+
puts
|
35
|
+
end
|
21
36
|
|
22
37
|
private
|
23
38
|
def self.remove_multiple_filter_duplicates_intern(pattern)
|
@@ -1,4 +1,5 @@
|
|
1
1
|
require 'rexml/document'
|
2
|
+
require 'rexml/xpath'
|
2
3
|
|
3
4
|
module Scrubyt
|
4
5
|
##
|
@@ -16,7 +17,7 @@ module Scrubyt
|
|
16
17
|
to_xml_recursive(pattern, root)
|
17
18
|
end
|
18
19
|
remove_empty_leaves(doc)
|
19
|
-
doc
|
20
|
+
@@last_doc = doc
|
20
21
|
end
|
21
22
|
|
22
23
|
def self.remove_empty_leaves(node)
|
@@ -80,11 +81,22 @@ private
|
|
80
81
|
end
|
81
82
|
end
|
82
83
|
|
83
|
-
def self.
|
84
|
+
def self.print_old_sta(pattern, depth)
|
84
85
|
puts((' ' * "#{depth}".to_i) + "#{pattern.name} extracted #{pattern.get_instance_count[pattern.name]} instances.") if pattern.name != 'root'
|
85
86
|
pattern.children.each do |child|
|
86
87
|
print_statistics_recursive(child, depth + 4)
|
88
|
+
end
|
89
|
+
end
|
90
|
+
|
91
|
+
def self.print_statistics_recursive(pattern, depth)
|
92
|
+
if pattern.name != 'root'
|
93
|
+
count = REXML::XPath.match(@@last_doc, "//#{pattern.name}").size
|
94
|
+
puts((' ' * "#{depth}".to_i) + "#{pattern.name} extracted #{count} instances.")
|
87
95
|
end
|
96
|
+
|
97
|
+
pattern.children.each do |child|
|
98
|
+
print_statistics_recursive(child, depth + 4)
|
99
|
+
end
|
88
100
|
end#end of method print_statistics_recursive
|
89
101
|
end #end of class ResultDumper
|
90
102
|
end #end of module Scrubyt
|
data/lib/scrubyt/xpathutils.rb
CHANGED
@@ -21,21 +21,23 @@ module Scrubyt
|
|
21
21
|
# <a>Bon <b>nuit</b>, monsieur!</a>
|
22
22
|
#
|
23
23
|
#In this case, <a>'s text is considered to be "Bon nuit, monsieur"
|
24
|
-
def self.find_node_from_text(doc, text)
|
24
|
+
def self.find_node_from_text(doc, text, next_link)
|
25
25
|
@node = nil
|
26
26
|
@found = false
|
27
27
|
self.traverse_for_full_text(doc,text)
|
28
28
|
self.lowest_possible_node_with_text(@node, text) if @node != nil
|
29
|
-
#$Logger.warn("Node for example #{text} Not found!") if (@found == false)
|
30
29
|
if (@found == false)
|
31
30
|
#Fallback to per node text lookup
|
32
31
|
self.traverse_for_node_text(doc,text)
|
33
|
-
if (@found == false)
|
34
|
-
|
35
|
-
puts "
|
32
|
+
if (@found == false)
|
33
|
+
return nil if next_link
|
34
|
+
puts "!" * 65
|
35
|
+
puts "!!!!!! FATAL: Node for example #{text} Not found! !!!!!!"
|
36
|
+
puts "!!!!!! Please make sure you specified the example properly !!!!!!"
|
37
|
+
puts "!" * 65
|
38
|
+
exit
|
36
39
|
end
|
37
40
|
end
|
38
|
-
p @node
|
39
41
|
@node
|
40
42
|
end
|
41
43
|
|
@@ -135,7 +137,7 @@ module Scrubyt
|
|
135
137
|
#_index_ - there might be more images with the same src on the page -
|
136
138
|
#most typically the user will need the 0th - but if this is not the
|
137
139
|
#case, there is the possibility to override this
|
138
|
-
def self.find_image(doc, example, index=
|
140
|
+
def self.find_image(doc, example, index=0)
|
139
141
|
(doc/"img[@src='#{example}']")[index]
|
140
142
|
end
|
141
143
|
|
@@ -22,7 +22,15 @@ class FilterTest < Test::Unit::TestCase
|
|
22
22
|
Scrubyt::Filter::EXAMPLE_TYPE_IMAGE)
|
23
23
|
#Test XPaths
|
24
24
|
assert_equal(Scrubyt::Filter.determine_example_type('/p/img'),
|
25
|
+
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
26
|
+
assert_equal(Scrubyt::Filter.determine_example_type('/p/h3'),
|
27
|
+
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
28
|
+
assert_equal(Scrubyt::Filter.determine_example_type('/p/h3/a/h2'),
|
29
|
+
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
30
|
+
assert_equal(Scrubyt::Filter.determine_example_type('/h2'),
|
25
31
|
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
32
|
+
assert_equal(Scrubyt::Filter.determine_example_type('/h1/h3'),
|
33
|
+
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
26
34
|
assert_equal(Scrubyt::Filter.determine_example_type('/p'),
|
27
35
|
Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
|
28
36
|
assert_equal(Scrubyt::Filter.determine_example_type('//p'),
|
@@ -55,14 +55,14 @@ class XPathUtilsTest < Test::Unit::TestCase
|
|
55
55
|
end
|
56
56
|
|
57
57
|
def test_find_node_from_text
|
58
|
-
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff")
|
58
|
+
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff", false)
|
59
59
|
assert_instance_of(Hpricot::Elem, elem)
|
60
60
|
assert_equal(elem, @f)
|
61
61
|
|
62
|
-
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd")
|
62
|
+
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd", false)
|
63
63
|
assert_equal(elem, @d)
|
64
64
|
|
65
|
-
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr")
|
65
|
+
elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr", false)
|
66
66
|
assert_equal(elem, @r)
|
67
67
|
|
68
68
|
end
|
metadata
CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
|
|
3
3
|
specification_version: 1
|
4
4
|
name: scrubyt
|
5
5
|
version: !ruby/object:Gem::Version
|
6
|
-
version: 0.
|
7
|
-
date: 2007-
|
6
|
+
version: 0.2.0
|
7
|
+
date: 2007-02-04 00:00:00 +01:00
|
8
8
|
summary: A powerful Web-scraping framework
|
9
9
|
require_paths:
|
10
10
|
- lib
|
@@ -29,29 +29,29 @@ post_install_message:
|
|
29
29
|
authors:
|
30
30
|
- Peter Szinek
|
31
31
|
files:
|
32
|
-
- README
|
33
32
|
- COPYING
|
33
|
+
- README
|
34
34
|
- CHANGELOG
|
35
35
|
- Rakefile
|
36
36
|
- lib/scrubyt.rb
|
37
|
-
- lib/scrubyt/constraint_adder.rb
|
38
37
|
- lib/scrubyt/constraint.rb
|
39
|
-
- lib/scrubyt/result_dumper.rb
|
40
|
-
- lib/scrubyt/export.rb
|
41
|
-
- lib/scrubyt/extractor.rb
|
42
|
-
- lib/scrubyt/filter.rb
|
43
38
|
- lib/scrubyt/pattern.rb
|
44
39
|
- lib/scrubyt/result.rb
|
40
|
+
- lib/scrubyt/export.rb
|
41
|
+
- lib/scrubyt/constraint_adder.rb
|
45
42
|
- lib/scrubyt/post_processor.rb
|
43
|
+
- lib/scrubyt/filter.rb
|
46
44
|
- lib/scrubyt/xpathutils.rb
|
45
|
+
- lib/scrubyt/result_dumper.rb
|
46
|
+
- lib/scrubyt/extractor.rb
|
47
47
|
test_files:
|
48
48
|
- test/unittests/input
|
49
|
+
- test/unittests/constraint_test.rb
|
49
50
|
- test/unittests/filter_test.rb
|
50
|
-
- test/unittests/extractor_test.rb
|
51
51
|
- test/unittests/xpathutils_test.rb
|
52
|
-
- test/unittests/
|
53
|
-
- test/unittests/input/constraint_test.html
|
52
|
+
- test/unittests/extractor_test.rb
|
54
53
|
- test/unittests/input/test.html
|
54
|
+
- test/unittests/input/constraint_test.html
|
55
55
|
rdoc_options: []
|
56
56
|
|
57
57
|
extra_rdoc_files: []
|