scrubyt 0.1.9 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,16 +1,47 @@
1
+ = scRUBYt! changelog
2
+
3
+ == 0.2.0
4
+ === 30th January, 2007
5
+
6
+ The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
7
+
8
+ =<tt>changes:</tt>
9
+
10
+ * better form detection heuristics
11
+ * report message if there are absolutely no results
12
+ * lots of bugfixes
13
+ * fixed amazon_data.books[0].item[0].title[0] style output access
14
+ and implemented it correctly in case of crawling as well
15
+ * /body/div/h3 not detected as XPath
16
+ * crawling problem (improved heuristics of url joining)
17
+ * fixed blackbox test runner - no more platform dependent code
18
+ * fixed exporting bug: swapped exported XPaths in the case of no example present
19
+ * fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
20
+ name is substring of the other
21
+ * Evaluation stops if the example was not found - but not in the case
22
+ of next page link lookup
23
+ * google_data[0].link[0].url[0] style result lookup now works in the
24
+ case of more documents, too
25
+ * tons of others bugfixes
26
+ * overall stability fixes
27
+ * more blackbox tests
28
+ * more examples
29
+ * overall stability fixes
30
+
31
+
1
32
  = 0.1.9
2
33
  === 28th January, 2007
3
34
 
4
35
  This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
5
36
 
6
- changes:
37
+ =<tt>Changes</tt>:
7
38
 
8
- - Possibility to specify multiple examples (hence a pattern can have more filters)
9
- - Enhanced heuristics for example text detection
10
- - First version of algorithm to remove dupes resulting from multiple examples
11
- - empty XML leaf nodes are not written
12
- - new examples
13
- - TONS of bugfixes
39
+ * Possibility to specify multiple examples (hence a pattern can have more filters)
40
+ * Enhanced heuristics for example text detection
41
+ * First version of algorithm to remove dupes resulting from multiple examples
42
+ * empty XML leaf nodes are not written
43
+ * new examples
44
+ * TONS of bugfixes
14
45
 
15
46
  = 0.1
16
47
  === 15th January, 2007
@@ -20,15 +51,18 @@ This release was made more for myself (to try and test rubyforge, gems, etc) rat
20
51
 
21
52
  Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
22
53
 
23
- Navigation:
24
- fetching pages
25
- clicking links
26
- filling input fields
27
- submitting forms
28
-
29
- Scraping:
30
- - Fairly powerful DSL to describe the full scraping process
31
- - Automatic navigation with WWW::Mechanize
32
- - Automatic scraping through examples with Hpricot
33
- - automatic recursive scraping through the next button
54
+ * Navigation:
55
+ * fetching pages
56
+ * clicking links
57
+ * filling input fields
58
+ * submitting forms
59
+ * automatically passing the document to the scraping
60
+ * both files and http:// support
61
+ * automatic crawling
62
+
63
+ * Scraping:
64
+ * Fairly powerful DSL to describe the full scraping process
65
+ * Automatic navigation with WWW::Mechanize
66
+ * Automatic scraping through examples with Hpricot
67
+ * automatic recursive scraping through the next button
34
68
 
data/README CHANGED
@@ -1,70 +1,99 @@
1
- ============================================
2
- scRUBYt! - Hpricot and Mechanize on steroids
3
- ============================================
1
+ = scRUBYt! - Hpricot and Mechanize on steroids
4
2
 
5
- A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of interest by the concise and easy to use DSL provided by scRUBYt!.
3
+ A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
6
4
 
7
- =============================================
8
- Why do we need one more web-scraping toolkit?
9
- =============================================
5
+ Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies - and ... enter scRUBYt! and decide it yourself.
10
6
 
11
- After all, we have HPricot, and Rubyful soup, and Mechanize, and scrAPI, and ARIEL and ...
12
- Well, because scRUBYt! is different. It has entirely different philosophy, underlying techniques, use cases - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
7
+ = Wait... why do we need one more web-scraping toolkit?
8
+
9
+ After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
10
+ Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with different requirements than the previosly mentioned ones.
13
11
 
14
12
  If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot. Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good old mantra: use the right tool for the right job!
15
13
 
16
- I hope there will be times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
14
+ I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of scRUBYt! :-)
15
+
16
+ = Sounds fine - show me an example!
17
+
18
+ Let's apply the "show don't tell" principle. Okay, here we go:
17
19
 
18
- ========================================
19
- OK, OK, I believe you, what should I do?
20
- ========================================
20
+ <tt>ebay_data = Scrubyt::Extractor.define do</tt>
21
21
 
22
- Useful adresses
22
+ fetch 'http://www.ebay.com/'
23
+ fill_textfield 'satitle', 'ipod'
24
+ submit
25
+ click_link 'Apple iPod'
26
+
27
+ record do
28
+ item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
29
+ price '$71.99'
30
+ end
31
+ next_page 'Next >', :limit => 5
23
32
 
24
- scrubyt.rubyforge.org
25
- rubyrailways.com (some theory)
26
- future: public extractor repository
33
+ <tt>end</tt>
27
34
 
28
- ==============
29
- How to install
30
- ==============
35
+ output:
31
36
 
32
- Dependencies:
37
+ <tt><root></tt>
38
+ <record>
39
+ <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
40
+ <price>$149.95</price>
41
+ </record>
42
+ <record>
43
+ <item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
44
+ <price>$172.50</price>
45
+ </record>
46
+ <record>
47
+ <item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
48
+ <price>$171.06</price>
49
+ </record>
50
+ <!-- another 200+ results -->
51
+ <tt></root></tt>
33
52
 
34
- Ruby 1.8.4 (or higher)
35
- Hpricot 0.4.84
36
- Mechanize 0.6.3 (or higher)
53
+ This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated extractors than the above one) - yet it did a lot of things automagically. First of all,
54
+ it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
55
+ and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
56
+ looked like the specified example (which btw described also how the output structure should look like) - on the first 5 result pages. Not so bad for about 10 lines of code, eh?
37
57
 
38
- I assume you have Ruby any Rubygems installed. To install Mechanize, just run
58
+ = OK, OK, I believe you, what should I do?
39
59
 
40
- sudo gem install mechanize
60
+ You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the next section about installation, and after installing be sure to check out these URLs:
41
61
 
42
- Hpricot (until 0.5 comes out) is a little bit tougher nut to crack, since you need a special version, not the latest stable.
62
+ * <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek at web scraping in general and/or you would like to understand what's going on under the hood, check out <a href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about web-scraping</a>!
63
+ * <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
64
+ * <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
65
+ * <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including open and closed bugs, files etc.
66
+ * projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing the features of scRUBYt!
67
+ * planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will have a community, and people will upload their extractors for whatever reason
43
68
 
44
- After this, you are ready to install Hpricot 0.4.86 (if there is no 86, choose the next, e.g. 88 - ). Run
69
+ If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
45
70
 
46
- gem install hpricot --source code.whytheluckystiff.net
71
+ = How to install
47
72
 
48
- and choose the correct version (e.g. 0.4.86)
73
+ scRUBYt! requires these packages to be installed:
49
74
 
50
- To test whether everything is working, from the svn directory launch
75
+ * Ruby 1.8.4
76
+ * Hpricot 0.5
77
+ * Mechanize 0.6.3
51
78
 
52
- rake fulltest
79
+ I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run
53
80
 
54
- You should see 0 errors...
81
+ <tt>sudo gem install mechanize</tt>
55
82
 
56
- =============================
57
- Additional installation notes
58
- =============================
83
+ Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with
59
84
 
60
- [1]
61
- you will have to install ragel (dependency of HPricot) with something like
85
+ <tt>sudo gem install hpricot</tt>
62
86
 
63
- sudo apt-get install ragel
87
+ Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with
64
88
 
65
- depending on your distro (the above works for debian based stuff).
89
+ <tt>sudo gem install scrubyt</tt>
66
90
 
91
+ If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
67
92
 
93
+ = Author
68
94
 
95
+ Copyright (c) 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)
69
96
 
97
+ = Copyright
70
98
 
99
+ This library is distributed under the GPL. Please see the LICENSE file.
data/Rakefile CHANGED
@@ -1,6 +1,7 @@
1
1
  require 'rake/rdoctask'
2
2
  require 'rake/testtask'
3
3
  require 'rake/gempackagetask'
4
+ require 'rake/packagetask'
4
5
 
5
6
  ###################################################
6
7
  # Dependencies
@@ -8,6 +9,8 @@ require 'rake/gempackagetask'
8
9
 
9
10
  task "default" => ["test"]
10
11
  task "fulltest" => ["test", "blackbox"]
12
+ task "generate_rdoc" => ["cleanup_readme"]
13
+ task "cleanup_readme" => ["rdoc"]
11
14
 
12
15
  ###################################################
13
16
  # Gem specification
@@ -15,13 +18,13 @@ task "fulltest" => ["test", "blackbox"]
15
18
 
16
19
  gem_spec = Gem::Specification.new do |s|
17
20
  s.name = 'scrubyt'
18
- s.version = '0.1.9'
21
+ s.version = '0.2.0'
19
22
  s.summary = 'A powerful Web-scraping framework'
20
23
  s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
21
24
  # Files containing Test::Unit test cases.
22
25
  s.test_files = FileList['test/unittests/**/*']
23
26
  # List of other files to be included.
24
- s.files = FileList['README', 'COPYING', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
27
+ s.files = FileList['COPYING', 'README', 'CHANGELOG', 'Rakefile', 'lib/**/*.rb']
25
28
  s.author = 'Peter Szinek'
26
29
  s.email = 'peter@rubyrailways.com'
27
30
  s.homepage = 'http://www.scrubyt.org'
@@ -32,14 +35,14 @@ end
32
35
  # Tasks
33
36
  ###################################################
34
37
 
35
- Rake::RDocTask.new do |rdoc|
36
- files = ['lib/**/*.rb', 'README']
37
- rdoc.rdoc_files.add(files)
38
- rdoc.main = "README" # page to start on
39
- rdoc.title = "Scrubyt Documentation"
40
- rdoc.template = "resources/allison/allison.rb"
41
- rdoc.rdoc_dir = 'doc' # rdoc output folder
42
- rdoc.options << '--line-numbers' << '--inline-source'
38
+ Rake::RDocTask.new do |generate_rdoc|
39
+ files = ['lib/**/*.rb', 'README', 'CHANGELOG']
40
+ generate_rdoc.rdoc_files.add(files)
41
+ generate_rdoc.main = "README" # page to start on
42
+ generate_rdoc.title = "Scrubyt Documentation"
43
+ generate_rdoc.template = "resources/allison/allison.rb"
44
+ generate_rdoc.rdoc_dir = 'doc' # rdoc output folder
45
+ generate_rdoc.options << '--line-numbers' << '--inline-source'
43
46
  end
44
47
 
45
48
  Rake::TestTask.new do |test|
@@ -50,7 +53,35 @@ task "blackbox" do
50
53
  ruby "test/blackbox/run_blackbox_tests.rb"
51
54
  end
52
55
 
56
+ task "cleanup_readme" do
57
+ puts "Cleaning up README..."
58
+ readme_in = open('./doc/files/README.html')
59
+ content = readme_in.read
60
+ content.sub!('<h1 id="item_name">File: README</h1>','')
61
+ content.sub!('<h1>Description</h1>','')
62
+ readme_in.close
63
+ open('./doc/files/README.html', 'w') {|f| f.write(content)}
64
+ #OK, this is uggly as hell and as non-DRY as possible, but
65
+ #I don't have time to deal with it right now
66
+ puts "Cleaning up CHANGELOG..."
67
+ readme_in = open('./doc/files/CHANGELOG.html')
68
+ content = readme_in.read
69
+ content.sub!('<h1 id="item_name">File: CHANGELOG</h1>','')
70
+ content.sub!('<h1>Description</h1>','')
71
+ readme_in.close
72
+ open('./doc/files/CHANGELOG.html', 'w') {|f| f.write(content)}
73
+ end
74
+
75
+ task "generate_rdoc" do
76
+ end
77
+
53
78
  Rake::GemPackageTask.new(gem_spec) do |pkg|
54
79
  pkg.need_zip = false
55
80
  pkg.need_tar = false
56
- end
81
+ end
82
+
83
+ Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
84
+ pkg.need_zip = true
85
+ pkg.need_tar = true
86
+ pkg.package_files.include("examples/**/*")
87
+ end
@@ -1,5 +1,5 @@
1
1
  #require File.join(File.dirname(__FILE__), 'pattern.rb')
2
-
2
+
3
3
  module Scrubyt
4
4
  # =<tt>exporting previously defined extractors</tt>
5
5
  class Export
@@ -142,14 +142,14 @@ private
142
142
  @name_to_xpath_map = {}
143
143
  create_name_to_xpath_map(pattern)
144
144
  #Replace the examples which are quoted with " and '
145
- @name_to_xpath_map.each do |name, xpaths|
145
+ @name_to_xpath_map.each do |name, xpaths|
146
146
  replace_example_with_xpath(name, xpaths, %q{"})
147
147
  replace_example_with_xpath(name, xpaths, %q{'})
148
148
  end
149
149
  #Finally, add XPaths to pattern which had no example at the beginning (the XPath was
150
150
  #generated from the child patterns
151
151
  @name_to_xpath_map.each do |name, xpaths|
152
- xpaths.each do |xpath|
152
+ xpaths.reverse.each do |xpath|
153
153
  comma = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0].sub('do'){}.strip == '' ? '' : ','
154
154
  if (@full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]).include?('{')
155
155
  @full_definition.sub!("P.#{name}") {"P.#{name}('#{xpath}')"}
@@ -180,7 +180,7 @@ private
180
180
 
181
181
  def self.replace_example_with_xpath(name, xpaths, left_delimiter, right_delimiter=left_delimiter)
182
182
  return if name=='root'
183
- full_line = @full_definition.scan(Regexp.new("P.#{name}(.+)$"))[0][0]
183
+ full_line = @full_definition.scan(/P.#{name}\W(.+)$/)[0][0]
184
184
  examples = full_line.split(",")
185
185
  examples.reject! {|exa| exa.strip!; exa[0..0] != %q{"} && exa[0..0] != %q{'} }
186
186
  all_xpaths = ""
@@ -46,6 +46,7 @@ module Scrubyt
46
46
  end
47
47
  ensure_all_postconditions(root_pattern)
48
48
  PostProcessor.remove_multiple_filter_duplicates(root_pattern)
49
+ PostProcessor.report_if_no_results(root_pattern)
49
50
  #Return the root pattern
50
51
  root_pattern
51
52
  end
@@ -121,21 +122,28 @@ module Scrubyt
121
122
  @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
122
123
  end
123
124
 
124
- if @@host_name != nil
125
+ if @@host_name != nil
125
126
  if doc_url !~ /#{@@host_name}/
126
- @@current_doc_url = (@@host_name + doc_url)
127
- @@current_doc_url.gsub!(/([^:])\/\//) {"#{$1}/"}
127
+ @@current_doc_url = (@@host_name + doc_url)
128
+ #remove duplicate parts, like /blogs/en/blogs/en
129
+ @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
130
+ @@current_doc_url.sub!('http:/', 'http://')
128
131
  end
129
132
  end
130
133
  puts "[ACTION] fetching document: #{@@current_doc_url}"
131
- @@mechanize_doc = @@agent.get(@@current_doc_url) if @@current_doc_protocol == :http
134
+ if @@current_doc_protocol == :http
135
+
136
+ @@mechanize_doc = @@agent.get(@@current_doc_url)
137
+ @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
138
+ @@host_name = doc_url if @@host_name == nil
139
+ end
132
140
  else
133
141
  @@current_doc_url = doc_url
134
142
  @@mechanize_doc = mechanize_doc
135
143
  @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0]
136
144
  @@host_name = doc_url if @@host_name == nil
137
145
  end
138
- @@hpricot_doc = Hpricot(open(@@current_doc_url))#.to_original_html
146
+ @@hpricot_doc = Hpricot(open(@@current_doc_url))
139
147
  end
140
148
 
141
149
  ##
@@ -150,23 +158,56 @@ module Scrubyt
150
158
  def self.fill_textfield(textfield_name, query_string)
151
159
  puts "[ACTION] typing #{query_string} into the textfield named '#{textfield_name}'"
152
160
  textfield = (@@hpricot_doc/"input[@name=#{textfield_name}]").map()[0]
153
- formname = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form').attributes['name']
154
- @@current_form = @@mechanize_doc.forms.with.name(formname).first
161
+ form_tag = Scrubyt::XPathUtils.traverse_up_until_name(textfield, 'form')
162
+ #Refactor this code, it's a total mess
163
+ formname = form_tag.attributes['name']
164
+ if formname == nil
165
+ id_string = form_tag.attributes['id']
166
+ if id_string == nil
167
+ action_string = form_tag.attributes['action']
168
+ if action_string == nil
169
+ #If even this fails, do it with a button
170
+ else
171
+ puts "Finding from action"
172
+ puts action_string
173
+ find_form_with_attribute('action', action_string)
174
+ end
175
+ else
176
+ puts "Finding from id"
177
+ find_form_with_attribute('id', id_string)
178
+ end
179
+ else
180
+ puts "Finding from name"
181
+ @@current_form = @@mechanize_doc.forms.with.name(formname).first
182
+ end
183
+
155
184
  eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
156
185
  end
157
186
 
187
+ def self.find_form_with_attribute(attr, expected_value)
188
+ puts "attr: #{attr}"
189
+ i = 0
190
+ loop do
191
+ @@current_form = @@mechanize_doc.forms[i]
192
+ print "current a: "
193
+ puts @@current_form.form_node.attributes[attr]
194
+ return nil if @@current_form == nil
195
+ break if @@current_form.form_node.attributes[attr] == expected_value
196
+ i+= 1
197
+ end
198
+ end
199
+
158
200
  #Submit the last form;
159
201
  def self.submit
160
202
  puts '[ACTION] submitting form...'
161
203
  result_page = @@agent.submit(@@current_form)#, @@current_form.buttons.first)
162
204
  @@current_doc_url = result_page.uri.to_s
205
+ puts "[ACTION] fetched #{@@current_doc_url}"
163
206
  fetch(@@current_doc_url, result_page)
164
207
  end
165
208
 
166
209
  def self.click_link(link_text)
167
210
  puts "[ACTION] clicking link: #{link_text}"
168
- #puts /^#{Regexp.escape(link_text)}$/
169
- #p /^#{Regexp.escape(link_text)}$/
170
211
  link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
171
212
  result_page = @@agent.click(link)
172
213
  @@current_doc_url = result_page.uri.to_s
@@ -53,8 +53,10 @@ module Scrubyt
53
53
  @parent_pattern = parent_pattern
54
54
  #If the example type is not explicitly defined in the pattern definition,
55
55
  #try to determine it automatically from the example
56
- @example_type = (args[0] == nil ? Filter.determine_example_type(example) :
57
- args[0][:example_type])
56
+ #@example_type = (args[0] == nil ? Filter.determine_example_type(example) :
57
+ # args[0][:example_type])
58
+ #TODOOOOO correct this!
59
+ @example_type = Filter.determine_example_type(example)
58
60
  @sink = [] #output of a filter
59
61
  @source = [] #input of a filter
60
62
  @example = example
@@ -67,14 +69,13 @@ module Scrubyt
67
69
  #Evaluate this filter. This method shoulf not be called directly - as the pattern hierarchy
68
70
  #is evaluated, every pattern evaluates its filters and then they are calling this method
69
71
  def evaluate(source)
70
- @parent_pattern.root_pattern.already_evaluated_sources ||= {}
71
72
  case @parent_pattern.type
72
73
  when Scrubyt::Pattern::PATTERN_TYPE_TREE
73
74
  result = source/@xpath
74
75
  result.class == Hpricot::Elements ? result.map : [result]
75
76
  when Scrubyt::Pattern::PATTERN_TYPE_ATTRIBUTE
76
77
  [source.attributes[@example]]
77
- when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
78
+ when Scrubyt::Pattern::PATTERN_TYPE_REGEXP
78
79
  source.inner_text.scan(@example).flatten
79
80
  end
80
81
  end
@@ -87,10 +88,9 @@ module Scrubyt
87
88
  when EXAMPLE_TYPE_XPATH
88
89
  @xpath = @example
89
90
  when EXAMPLE_TYPE_STRING
90
- @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example )
91
+ @temp_sink = XPathUtils.find_node_from_text( @parent_pattern.root_pattern.filters[0].source[0], @example, false )
91
92
  @xpath = @parent_pattern.generalize ? XPathUtils.generate_XPath(@temp_sink, nil, false) :
92
93
  XPathUtils.generate_XPath(@temp_sink, nil, true)
93
- puts @xpath
94
94
  when EXAMPLE_TYPE_CHILDREN
95
95
  current_example_index = 0
96
96
  loop do
@@ -148,7 +148,7 @@ private
148
148
  EXAMPLE_TYPE_CHILDREN
149
149
  when /\.(jpg|png|gif|jpeg)$/
150
150
  EXAMPLE_TYPE_IMAGE
151
- when /^\/{1,2}[a-z]+(\[\d+\])?(\/{1,2}[a-z]+(\[\d+\])?)*$/
151
+ when /^\/{1,2}[a-z]+\d?(\[\d+\])?(\/{1,2}[a-z]+\d?(\[\d+\])?)*$/
152
152
  (example.include? '/' || example.include?('[')) ? EXAMPLE_TYPE_XPATH : EXAMPLE_TYPE_STRING
153
153
  else
154
154
  EXAMPLE_TYPE_STRING
@@ -43,7 +43,7 @@ module Scrubyt
43
43
  attr_accessor :name, :output_type, :generalize, :children, :filters, :parent,
44
44
  :last_result, :result, :root_pattern, :example, :block_count,
45
45
  :next_page, :limit, :extractor, :extracted_docs,
46
- :examples, :parent_of_leaf
46
+ :examples, :parent_of_leaf, :document_index
47
47
  attr_reader :type, :generalize_set, :next_page_url
48
48
 
49
49
  def initialize (name, *args)
@@ -56,6 +56,7 @@ module Scrubyt
56
56
  @@instance_count = Hash.new(0)
57
57
  @evaluated_examples = []
58
58
  @next_page = nil
59
+ @document_index = 0
59
60
  if @examples == nil
60
61
  filters << Scrubyt::Filter.new(self) #create a default filter
61
62
  else
@@ -74,6 +75,7 @@ module Scrubyt
74
75
  #Grab any examples that are defined!
75
76
  look_for_examples(args)
76
77
  args.each do |arg|
78
+ next if !arg.is_a? Hash
77
79
  arg.each do |k,v|
78
80
  #Set only the setable fields
79
81
  if SETTABLE_FIELDS.include? k.to_s
@@ -92,7 +94,6 @@ module Scrubyt
92
94
  #default settings - the user can override them, but if she did not do so,
93
95
  #we will setup some meaningful defaults
94
96
  @type ||= PATTERN_TYPE_TREE
95
- @type = PATTERN_TYPE_REGEXP if @example.instance_of? Regexp
96
97
  @output_type ||= OUTPUT_TYPE_MODEL
97
98
  #don't generalize by default
98
99
  @generalize ||= false
@@ -127,11 +128,20 @@ module Scrubyt
127
128
  # camera_data.item[1].item_name[0]
128
129
  #
129
130
  #possible. The method Pattern::method missing handles the 'item', 'item_name' etc.
130
- #parts, while the indexing ([1], [0]) is handled by this function
131
+ #parts, while the indexing ([1], [0]) is handled by this function.
132
+ #If you would like to select a different document than the first one (which is
133
+ #the default), you should use the form:
134
+ #
135
+ # camera_data[1].item[1].item_name[0]
131
136
  def [](index)
132
- return nil if (@result.lookup(@parent.last_result)) == nil
133
- @last_result = @result.lookup(@parent.last_result)[index]
134
- self
137
+ if @name == 'root'
138
+ @root_pattern.document_index = index
139
+ else
140
+ @parent.last_result = @parent.last_result[@root_pattern.document_index] if @parent.last_result.is_a? Array
141
+ return nil if (@result.lookup(@parent.last_result)) == nil
142
+ @last_result = @result.lookup(@parent.last_result)[index]
143
+ end
144
+ self
135
145
  end
136
146
 
137
147
  ##
@@ -217,9 +227,6 @@ module Scrubyt
217
227
  sorted_result = r.reject {|e| !result.keys.include? e}
218
228
  add_result(filter, source, sorted_result)
219
229
  else
220
- if ( (xe = @result.lookup(source)) != nil )
221
- #puts "ha"; p xe
222
- end
223
230
  add_result(filter, source, r)
224
231
  end#end of constraint check
225
232
  end#end of source iteration
@@ -246,6 +253,7 @@ private
246
253
  end
247
254
  end
248
255
  elsif (args[0].is_a? Regexp)
256
+ @examples = args.select {|e| e.is_a? Regexp}
249
257
  #Check if all the String parameters are really the first
250
258
  #parameters
251
259
  args[0..@examples.size].each do |example|
@@ -253,6 +261,7 @@ private
253
261
  puts 'FATAL: Problem with example specification'
254
262
  end
255
263
  end
264
+ @type = PATTERN_TYPE_REGEXP
256
265
  end
257
266
  end
258
267
 
@@ -299,7 +308,7 @@ private
299
308
  end
300
309
 
301
310
  def generate_next_page_link(example)
302
- node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example)
311
+ node = XPathUtils.find_node_from_text(@root_pattern.filters[0].source[0], example, true)
303
312
  return nil if node == nil
304
313
  node.attributes['href'].gsub('&amp;') {'&'}
305
314
  end # end of method generate_next_page_link
@@ -18,6 +18,21 @@ module Scrubyt
18
18
  remove_multiple_filter_duplicates_intern(pattern) if pattern.parent_of_leaf
19
19
  pattern.children.each {|child| remove_multiple_filter_duplicates(child)}
20
20
  end
21
+
22
+ ##
23
+ #Issue an error report if the document did not extract anything.
24
+ #Probably this is because the structure of the page changed or
25
+ #because of some rather nasty bug - in any case, something wrong
26
+ #is going on, and we need to inform the user about this!
27
+ def self.report_if_no_results(root_pattern)
28
+ results_found = false
29
+ root_pattern.children.each {|child| return if (child.result.childmap.size > 0)}
30
+ puts
31
+ puts "!!!!!! WARNING: The extractor did not find any result instances"
32
+ puts "Most probably this is wrong. Check your extractor and if you are"
33
+ puts "sure it should work, report a bug!"
34
+ puts
35
+ end
21
36
 
22
37
  private
23
38
  def self.remove_multiple_filter_duplicates_intern(pattern)
@@ -1,4 +1,5 @@
1
1
  require 'rexml/document'
2
+ require 'rexml/xpath'
2
3
 
3
4
  module Scrubyt
4
5
  ##
@@ -16,7 +17,7 @@ module Scrubyt
16
17
  to_xml_recursive(pattern, root)
17
18
  end
18
19
  remove_empty_leaves(doc)
19
- doc
20
+ @@last_doc = doc
20
21
  end
21
22
 
22
23
  def self.remove_empty_leaves(node)
@@ -80,11 +81,22 @@ private
80
81
  end
81
82
  end
82
83
 
83
- def self.print_statistics_recursive(pattern, depth)
84
+ def self.print_old_sta(pattern, depth)
84
85
  puts((' ' * "#{depth}".to_i) + "#{pattern.name} extracted #{pattern.get_instance_count[pattern.name]} instances.") if pattern.name != 'root'
85
86
  pattern.children.each do |child|
86
87
  print_statistics_recursive(child, depth + 4)
88
+ end
89
+ end
90
+
91
+ def self.print_statistics_recursive(pattern, depth)
92
+ if pattern.name != 'root'
93
+ count = REXML::XPath.match(@@last_doc, "//#{pattern.name}").size
94
+ puts((' ' * "#{depth}".to_i) + "#{pattern.name} extracted #{count} instances.")
87
95
  end
96
+
97
+ pattern.children.each do |child|
98
+ print_statistics_recursive(child, depth + 4)
99
+ end
88
100
  end#end of method print_statistics_recursive
89
101
  end #end of class ResultDumper
90
102
  end #end of module Scrubyt
@@ -21,21 +21,23 @@ module Scrubyt
21
21
  # <a>Bon <b>nuit</b>, monsieur!</a>
22
22
  #
23
23
  #In this case, <a>'s text is considered to be "Bon nuit, monsieur"
24
- def self.find_node_from_text(doc, text)
24
+ def self.find_node_from_text(doc, text, next_link)
25
25
  @node = nil
26
26
  @found = false
27
27
  self.traverse_for_full_text(doc,text)
28
28
  self.lowest_possible_node_with_text(@node, text) if @node != nil
29
- #$Logger.warn("Node for example #{text} Not found!") if (@found == false)
30
29
  if (@found == false)
31
30
  #Fallback to per node text lookup
32
31
  self.traverse_for_node_text(doc,text)
33
- if (@found == false)
34
- puts "FATAL: Node for example #{text} Not found!"
35
- puts "Please make sure your specified the example properly"
32
+ if (@found == false)
33
+ return nil if next_link
34
+ puts "!" * 65
35
+ puts "!!!!!! FATAL: Node for example #{text} Not found! !!!!!!"
36
+ puts "!!!!!! Please make sure you specified the example properly !!!!!!"
37
+ puts "!" * 65
38
+ exit
36
39
  end
37
40
  end
38
- p @node
39
41
  @node
40
42
  end
41
43
 
@@ -135,7 +137,7 @@ module Scrubyt
135
137
  #_index_ - there might be more images with the same src on the page -
136
138
  #most typically the user will need the 0th - but if this is not the
137
139
  #case, there is the possibility to override this
138
- def self.find_image(doc, example, index=1)
140
+ def self.find_image(doc, example, index=0)
139
141
  (doc/"img[@src='#{example}']")[index]
140
142
  end
141
143
 
@@ -22,7 +22,15 @@ class FilterTest < Test::Unit::TestCase
22
22
  Scrubyt::Filter::EXAMPLE_TYPE_IMAGE)
23
23
  #Test XPaths
24
24
  assert_equal(Scrubyt::Filter.determine_example_type('/p/img'),
25
+ Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
26
+ assert_equal(Scrubyt::Filter.determine_example_type('/p/h3'),
27
+ Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
28
+ assert_equal(Scrubyt::Filter.determine_example_type('/p/h3/a/h2'),
29
+ Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
30
+ assert_equal(Scrubyt::Filter.determine_example_type('/h2'),
25
31
  Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
32
+ assert_equal(Scrubyt::Filter.determine_example_type('/h1/h3'),
33
+ Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
26
34
  assert_equal(Scrubyt::Filter.determine_example_type('/p'),
27
35
  Scrubyt::Filter::EXAMPLE_TYPE_XPATH)
28
36
  assert_equal(Scrubyt::Filter.determine_example_type('//p'),
@@ -55,14 +55,14 @@ class XPathUtilsTest < Test::Unit::TestCase
55
55
  end
56
56
 
57
57
  def test_find_node_from_text
58
- elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff")
58
+ elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"fff", false)
59
59
  assert_instance_of(Hpricot::Elem, elem)
60
60
  assert_equal(elem, @f)
61
61
 
62
- elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd")
62
+ elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"dddd", false)
63
63
  assert_equal(elem, @d)
64
64
 
65
- elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr")
65
+ elem = Scrubyt::XPathUtils.find_node_from_text(@doc1,"rrr", false)
66
66
  assert_equal(elem, @r)
67
67
 
68
68
  end
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
3
3
  specification_version: 1
4
4
  name: scrubyt
5
5
  version: !ruby/object:Gem::Version
6
- version: 0.1.9
7
- date: 2007-01-28 00:00:00 +01:00
6
+ version: 0.2.0
7
+ date: 2007-02-04 00:00:00 +01:00
8
8
  summary: A powerful Web-scraping framework
9
9
  require_paths:
10
10
  - lib
@@ -29,29 +29,29 @@ post_install_message:
29
29
  authors:
30
30
  - Peter Szinek
31
31
  files:
32
- - README
33
32
  - COPYING
33
+ - README
34
34
  - CHANGELOG
35
35
  - Rakefile
36
36
  - lib/scrubyt.rb
37
- - lib/scrubyt/constraint_adder.rb
38
37
  - lib/scrubyt/constraint.rb
39
- - lib/scrubyt/result_dumper.rb
40
- - lib/scrubyt/export.rb
41
- - lib/scrubyt/extractor.rb
42
- - lib/scrubyt/filter.rb
43
38
  - lib/scrubyt/pattern.rb
44
39
  - lib/scrubyt/result.rb
40
+ - lib/scrubyt/export.rb
41
+ - lib/scrubyt/constraint_adder.rb
45
42
  - lib/scrubyt/post_processor.rb
43
+ - lib/scrubyt/filter.rb
46
44
  - lib/scrubyt/xpathutils.rb
45
+ - lib/scrubyt/result_dumper.rb
46
+ - lib/scrubyt/extractor.rb
47
47
  test_files:
48
48
  - test/unittests/input
49
+ - test/unittests/constraint_test.rb
49
50
  - test/unittests/filter_test.rb
50
- - test/unittests/extractor_test.rb
51
51
  - test/unittests/xpathutils_test.rb
52
- - test/unittests/constraint_test.rb
53
- - test/unittests/input/constraint_test.html
52
+ - test/unittests/extractor_test.rb
54
53
  - test/unittests/input/test.html
54
+ - test/unittests/input/constraint_test.html
55
55
  rdoc_options: []
56
56
 
57
57
  extra_rdoc_files: []