scrubyt 0.2.0 → 0.2.3

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,4 +1,135 @@
1
- = scRUBYt! changelog
1
+ = scRUBYt! Changelog
2
+
3
+ == 0.2.3
4
+ === 20th February, 2007
5
+
6
+ Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
7
+ The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
8
+ I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
9
+ ing for example.
10
+
11
+ =<tt>changes:</tt>
12
+ * [FIX] Cookies (and other stuff) are now taken into consideration
13
+ * [NEW] select_indices feature. Example:
14
+
15
+ table do
16
+ (row '1').select_indices(:last)
17
+ end
18
+
19
+ this will select only the last row;
20
+ possibility to specify a Range, or an array of indices, or other
21
+ constants like :first, :every_odd etc. More to come in the future!
22
+ * [FIX] digg.com next page problem fixed
23
+ * [FIX] Fetching of https sites
24
+ * [FIX] Next page works incorrectly when given an absolute path
25
+ * [FIX] Fixing exporting if the pattern parameters are parenthesized
26
+ * [NEW] Possibility to submit forms by clicking a button
27
+ * [NEW] Added new unit test suite: pattern_test
28
+ * [NEW] Possibility to set a proxy for fetching the input document
29
+ * [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
30
+ * [FIX] Image pattern example lookup fix
31
+ * [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
32
+ * [FIX] corrected gem dependencies (Credit: Tim Fletcher)
33
+ * [FIX] remove duplicates only if there are more examples present
34
+ * [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
35
+ * [FIX] if there is no scraper defined, exit with a message rather than raise an exception
36
+ * [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
37
+
38
+ == 0.2.0
39
+ === 30th January, 2007
40
+
41
+ The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
42
+
43
+ =<tt>changes:</tt>
44
+
45
+ * better form detection heuristics
46
+ * report message if there are absolutely no results
47
+ * lots of bugfixes
48
+ * fixed amazon_data.books[0].item[0].title[0] style output access
49
+ and implemented it correctly in case of crawling as well
50
+ * /body/div/h3 not detected as XPath
51
+ * crawling problem (improved heuristics of url joining)
52
+ * fixed blackbox test runner - no more platform dependent code
53
+ * fixed exporting bug: swapped exported XPaths in the case of no example present
54
+ * fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
55
+ name is substring of the other
56
+ * Evaluation stops if the example was not found - but not in the case
57
+ of next page link lookup
58
+ * google_data[0].link[0].url[0] style result lookup now works in the
59
+ case of more documents, too
60
+ * tons of others bugfixes
61
+ * overall stability fixes
62
+ * more blackbox tests
63
+ * more examples
64
+ * overall stability fixes
65
+
66
+
67
+ = 0.1.9
68
+ === 28th January, 2007
69
+
70
+ This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
71
+
72
+ =<tt>Changes</tt>:
73
+
74
+ * Possibility to specify multiple examples (hence a pattern can have more filters)
75
+ * Enhanced heuristics for example text detection
76
+ * First version of algorithm to remove dupes resulting from multiple examples
77
+ * empty XML leaf nodes are not written
78
+ * new examples
79
+ * TONS of bugfixes
80
+
81
+ = 0.1
82
+ === 15th January, 2007
83
+
84
+ First pre-alpha (non-public) release
85
+ This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
86
+
87
+ Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
88
+
89
+ * Navigation:
90
+ * fetching pages
91
+ * clicking links
92
+ * filling input fields
93
+ * submitting forms
94
+ * automatically passing the document to the scraping
95
+ * both files and http:// support
96
+ * automatic crawling
97
+
98
+ * Scraping:
99
+ * Fairly powerful DSL to describe the full scraping process
100
+ * Automatic navigation with WWW::Mechanize
101
+ * Automatic scraping through examples with Hpricot
102
+ * automatic recursive scraping through the next button
103
+
104
+
105
+
106
+
107
+ =<tt>changes:</tt>
108
+ * [FIX] cookies (and other stuff) are now taken into consideration
109
+ * [FIX] digg.com next page problem fixed
110
+ * [FIX] fetching of https sites
111
+ * [FIX] Next page works incorrectly when given an absolute path
112
+ * [FIX] Fixing exporting if the pattern parameters are parenthesized
113
+ * [NEW] Possibility to submit forms by clicking a button
114
+ * [NEW] Added new unit test suite: pattern_test
115
+ * [NEW] Possibility to set a proxy for fetching the input document
116
+ * [NEW] Added possibility to choose an option from a selection list
117
+ * [NEW] select_indices feature. Example:
118
+
119
+ table do
120
+ (row '1').select_indices(:last)
121
+ end
122
+
123
+ this will select only the last row;
124
+ possibility to specify a Range, or an array of indices, or other
125
+ constants like :first, :every_odd etc. More to come in the future!
126
+ * [FIX] Image pattern example lookup fix
127
+ * [FIX] corrected gem dependencies (thanks to Tim Fletcher)
128
+ * [FIX] remove duplicates only if there are more examples present
129
+ * [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
130
+ * [FIX] if there is no scraper defined, exit with a message rather than
131
+ raise an exception
132
+ * [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
2
133
 
3
134
  == 0.2.0
4
135
  === 30th January, 2007
data/Rakefile CHANGED
@@ -18,7 +18,7 @@ task "cleanup_readme" => ["rdoc"]
18
18
 
19
19
  gem_spec = Gem::Specification.new do |s|
20
20
  s.name = 'scrubyt'
21
- s.version = '0.2.0'
21
+ s.version = '0.2.3'
22
22
  s.summary = 'A powerful Web-scraping framework'
23
23
  s.description = %{scRUBYt! is an easy to learn and use, yet powerful and effective web scraping framework. It's most interesting part is a Web-scraping DSL built on HPricot and WWW::Mechanize, which allows to navigate to the page of interest, then extract and query data records with a few lines of code. It is hard to describe scRUBYt! in a few sentences - you have to see it for yourself!}
24
24
  # Files containing Test::Unit test cases.
@@ -28,6 +28,8 @@ gem_spec = Gem::Specification.new do |s|
28
28
  s.author = 'Peter Szinek'
29
29
  s.email = 'peter@rubyrailways.com'
30
30
  s.homepage = 'http://www.scrubyt.org'
31
+ s.add_dependency('hpricot', '>= 0.5')
32
+ s.add_dependency('mechanize', '>= 0.6.3')
31
33
  s.has_rdoc = 'true'
32
34
  end
33
35
 
@@ -80,7 +82,7 @@ Rake::GemPackageTask.new(gem_spec) do |pkg|
80
82
  pkg.need_tar = false
81
83
  end
82
84
 
83
- Rake::PackageTask.new('scrubyt-examples', '0.2.0') do |pkg|
85
+ Rake::PackageTask.new('scrubyt-examples', '0.2.3') do |pkg|
84
86
  pkg.need_zip = true
85
87
  pkg.need_tar = true
86
88
  pkg.package_files.include("examples/**/*")
@@ -1,10 +1,15 @@
1
- require 'scrubyt/constraint_adder.rb'
2
- require 'scrubyt/constraint.rb'
3
- require 'scrubyt/export.rb'
4
- require 'scrubyt/extractor.rb'
5
- require 'scrubyt/filter.rb'
6
- require 'scrubyt/pattern.rb'
7
- require 'scrubyt/result_dumper.rb'
8
- require 'scrubyt/result.rb'
9
- require 'scrubyt/xpathutils.rb'
10
- require 'scrubyt/post_processor.rb'
1
+ require 'scrubyt/core/scraping/constraint_adder.rb'
2
+ require 'scrubyt/core/scraping/constraint.rb'
3
+ require 'scrubyt/core/scraping/result_indexer.rb'
4
+ require 'scrubyt/core/scraping/pre_filter_document.rb'
5
+ require 'scrubyt/output/export.rb'
6
+ require 'scrubyt/core/shared/extractor.rb'
7
+ require 'scrubyt/core/scraping/filter.rb'
8
+ require 'scrubyt/core/scraping/pattern.rb'
9
+ require 'scrubyt/output/result_dumper.rb'
10
+ require 'scrubyt/output/result.rb'
11
+ require 'scrubyt/utils/xpathutils.rb'
12
+ require 'scrubyt/output/post_processor.rb'
13
+ require 'scrubyt/core/navigation/navigation_actions.rb'
14
+ require 'scrubyt/core/navigation/fetch_action.rb'
15
+ require 'scrubyt/core/shared/evaluation_context.rb'
@@ -0,0 +1,152 @@
1
+ module Scrubyt
2
+ ##
3
+ #=<tt>Fetching pages (and related functionality)</tt>
4
+ #
5
+ #Since lot of things are happening during (and before)
6
+ #the fetching of a document, I decided to move out fetching related
7
+ #functionality to a separate class - so if you are looking for anything
8
+ #which is loading a document (even by submitting a form or clicking a link)
9
+ #and related things like setting a proxy etc. you should find it here.
10
+ class FetchAction
11
+ def initialize
12
+ @@current_doc_url = nil
13
+ @@current_doc_protocol = nil
14
+ @@base_dir = nil
15
+ @@host_name = nil
16
+ @@agent = WWW::Mechanize.new
17
+ end
18
+
19
+ ##
20
+ #Action to fetch a document (either a file or a http address)
21
+ #
22
+ #*parameters*
23
+ #
24
+ #_doc_url_ - the url or file name to fetch
25
+ def self.fetch(doc_url, proxy=nil, mechanize_doc=nil)
26
+ parse_and_set_proxy(proxy) if proxy
27
+ if (mechanize_doc == nil)
28
+ @@current_doc_url = doc_url
29
+ @@current_doc_protocol = determine_protocol
30
+ handle_relative_path(doc_url)
31
+ handle_relative_url(doc_url)
32
+
33
+ puts "[ACTION] fetching document: #{@@current_doc_url}"
34
+ if @@current_doc_protocol != 'file'
35
+ @@mechanize_doc = @@agent.get(@@current_doc_url)
36
+ store_host_name(doc_url)
37
+ end
38
+ else
39
+ @@current_doc_url = doc_url
40
+ @@mechanize_doc = mechanize_doc
41
+ @@current_doc_protocol = determine_protocol
42
+ end
43
+ if @@current_doc_protocol == 'file'
44
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
45
+ else
46
+ @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
47
+ end
48
+ end
49
+
50
+ ##
51
+ #Submit the last form;
52
+ def self.submit(current_form, button=nil)
53
+ puts '[ACTION] submitting form...'
54
+ if button == nil
55
+ result_page = @@agent.submit(current_form)
56
+ else
57
+ result_page = @@agent.submit(current_form, button)
58
+ end
59
+ @@current_doc_url = result_page.uri.to_s
60
+ puts "[ACTION] fetched #{@@current_doc_url}"
61
+ fetch(@@current_doc_url, nil, result_page)
62
+ end
63
+
64
+ ##
65
+ #Click the link specified by the text
66
+ def self.click_link(link_text)
67
+ puts "[ACTION] clicking link: #{link_text}"
68
+ link = @@mechanize_doc.links.text(/^#{Regexp.escape(link_text)}$/)
69
+ result_page = @@agent.click(link)
70
+ @@current_doc_url = result_page.uri.to_s
71
+ puts "[ACTION] fetched #{@@current_doc_url}"
72
+ fetch(@@current_doc_url, nil, result_page)
73
+ end
74
+
75
+ ##
76
+ # At any given point, the current document can be queried with this method; Typically used
77
+ # when the navigation is over and the result document is passed to the wrapper
78
+ def self.get_current_doc_url
79
+ @@current_doc_url
80
+ end
81
+
82
+ def self.get_mechanize_doc
83
+ @@mechanize_doc
84
+ end
85
+
86
+ def self.get_hpricot_doc
87
+ @@hpricot_doc
88
+ end
89
+ private
90
+ def self.determine_protocol
91
+ old_protocol = @@current_doc_protocol
92
+ new_protocol = case @@current_doc_url
93
+ when /^https/
94
+ 'https'
95
+ when /^http/
96
+ 'http'
97
+ when /^www/
98
+ 'http'
99
+ else
100
+ 'file'
101
+ end
102
+ return 'http' if ((old_protocol == 'http') && new_protocol == 'file')
103
+ return 'https' if ((old_protocol == 'https') && new_protocol == 'file')
104
+ new_protocol
105
+ end
106
+
107
+ def self.parse_and_set_proxy(proxy)
108
+ proxy = proxy[:proxy]
109
+ if proxy.downcase == 'localhost'
110
+ @@host = 'localhost'
111
+ @@port = proxy.split(':').last
112
+ else
113
+ parts = proxy.split(':')
114
+ @@port = parts.delete_at(-1)
115
+ @@host = parts.join(':')
116
+ if (@@host == nil || @@port == nil)# !@@host =~ /^http/)
117
+ puts "Invalid proxy specification..."
118
+ puts "neither host nor port can be nil!"
119
+ exit
120
+ end
121
+ end
122
+ puts "[ACTION] Setting proxy: host=<#{@@host}>, port=<#{@@port}>"
123
+ @@agent.set_proxy(@@host, @@port)
124
+ end
125
+
126
+ def self.handle_relative_path(doc_url)
127
+ if @@base_dir == nil
128
+ @@base_dir = doc_url.scan(/.+\//)[0] if @@current_doc_protocol == 'file'
129
+ else
130
+ @@current_doc_url = ((@@base_dir + doc_url) if doc_url !~ /#{@@base_dir}/)
131
+ end
132
+ end
133
+
134
+ def self.handle_relative_url(doc_url)
135
+ return if doc_url =~ /^http/
136
+ if @@host_name != nil
137
+ if doc_url !~ /#{@@host_name}/
138
+ @@current_doc_url = (@@host_name + doc_url)
139
+ #remove duplicate parts, like /blogs/en/blogs/en
140
+ @@current_doc_url = @@current_doc_url.split('/').uniq.reject{|x| x == ""}.join('/')
141
+ @@current_doc_url.sub!('http:/', 'http://')
142
+ end
143
+ end
144
+ end
145
+
146
+ def self.store_host_name(doc_url)
147
+ @@host_name = 'http://' + @@mechanize_doc.uri.to_s.scan(/http:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'http'
148
+ @@host_name = 'https://' + @@mechanize_doc.uri.to_s.scan(/https:\/\/(.+\/)+/).flatten[0] if @@current_doc_protocol == 'https'
149
+ @@host_name = doc_url if @@host_name == nil
150
+ end #end of function store_host_name
151
+ end #end of class FetchAction
152
+ end #end of module Scrubyt
@@ -0,0 +1,106 @@
1
+ module Scrubyt
2
+ ##
3
+ #=<tt>Describing actions which interact with the page</tt>
4
+ #
5
+ #This class contains all the actions that are used to navigate on web pages;
6
+ #first of all, *fetch* for downloading the pages - then various actions
7
+ #like filling textfields, submitting formst, clicking links and more
8
+ class NavigationActions
9
+ #These are reserved keywords - they can not be the name of any pattern
10
+ #since they are reserved for describing the navigation
11
+ KEYWORDS = ['fetch',
12
+ 'fill_textfield',
13
+ 'fill_textarea',
14
+ 'submit',
15
+ 'click_link',
16
+ 'select_option',
17
+ 'end']
18
+
19
+ def initialize
20
+ @@current_form = nil
21
+ FetchAction.new
22
+ end
23
+
24
+ ##
25
+ #Action to fill a textfield with a query string
26
+ #
27
+ ##*parameters*
28
+ #
29
+ #_textfield_name_ - the name of the textfield (e.g. the name of the google search
30
+ #textfield is 'q'
31
+ #
32
+ #_query_string_ - the string that should be entered into the textfield
33
+ def self.fill_textfield(textfield_name, query_string)
34
+ lookup_form_for_tag('input','textfield',textfield_name,query_string)
35
+ eval("@@current_form['#{textfield_name}'] = '#{query_string}'")
36
+ end
37
+
38
+ ##
39
+ #Action to fill a textarea with text
40
+ def self.fill_textarea(textarea_name, text)
41
+ lookup_form_for_tag('textarea','textarea',textarea_name,text)
42
+ eval("@@current_form['#{textarea_name}'] = '#{text}'")
43
+ end
44
+
45
+ ##
46
+ #Action for selecting an option from a dropdown box
47
+ def self.select_option(selectlist_name, option)
48
+ lookup_form_for_tag('select','select list',selectlist_name,option)
49
+ select_list = @@current_form.fields.find {|f| f.name == selectlist_name}
50
+ searched_option = select_list.options.find{|f| f.text == option}
51
+ searched_option.click
52
+ end
53
+
54
+ ##
55
+ #Fetch the document
56
+ def self.fetch(doc_url, mechanize_doc=nil)
57
+ FetchAction.fetch(doc_url, mechanize_doc)
58
+ end
59
+ ##
60
+ #Submit the current form (delegate it to NavigationActions)
61
+ def self.submit(index=nil)
62
+ if index == nil
63
+ FetchAction.submit(@@current_form)
64
+ else
65
+ FetchAction.submit(@@current_form, @@current_form.buttons[index])
66
+ end
67
+ end
68
+
69
+ ##
70
+ #Click the link specified by the text ((delegate it to NavigationActions)
71
+ def self.click_link(link_text)
72
+ FetchAction.click_link(link_text)
73
+ end
74
+
75
+ def self.get_hpricot_doc
76
+ FetchAction.get_hpricot_doc
77
+ end
78
+
79
+ private
80
+ def self.lookup_form_for_tag(tag,widget_name,name_attribute,query_string)
81
+ puts "[ACTION] typing #{query_string} into the #{widget_name} named '#{name_attribute}'"
82
+ widget = (FetchAction.get_hpricot_doc/"#{tag}[@name=#{name_attribute}]").map()[0]
83
+ form_tag = Scrubyt::XPathUtils.traverse_up_until_name(widget, 'form')
84
+ find_form_based_on_tag(form_tag, ['name', 'id', 'action'])
85
+ end
86
+
87
+ def self.find_form_based_on_tag(tag, possible_attrs)
88
+ lookup_attribute_name = nil
89
+ lookup_attribute_value = nil
90
+
91
+ possible_attrs.each { |a|
92
+ lookup_attribute_name = a
93
+ lookup_attribute_value = tag.attributes[a]
94
+ break if lookup_attribute_value != nil
95
+ }
96
+
97
+ i = 0
98
+ loop do
99
+ @@current_form = FetchAction.get_mechanize_doc.forms[i]
100
+ return nil if @@current_form == nil
101
+ break if @@current_form.form_node.attributes[lookup_attribute_name] == lookup_attribute_value
102
+ i+= 1
103
+ end
104
+ end#find_form_based_on_tag
105
+ end#end of class NavigationActions
106
+ end#end of module Scrubyt