scrubber-scrubyt 0.4.12 → 0.4.13
Sign up to get free protection for your applications and to get access to all the features.
- data/README +33 -12
- data/lib/scrubyt/core/navigation/agents/mechanize.rb +1 -2
- data/lib/scrubyt.rb +4 -2
- metadata +1 -1
data/README
CHANGED
@@ -1,17 +1,29 @@
|
|
1
1
|
= scRUBYt! - Hpricot and Mechanize (or FireWatir) on steroids
|
2
2
|
|
3
|
-
A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web,
|
3
|
+
A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web,
|
4
|
+
Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.
|
4
5
|
|
5
|
-
|
6
|
+
|
7
|
+
Do you think that Mechanize and Hpricot are powerful libraries? You're right, they are, indeed - hats off to their
|
8
|
+
authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be
|
9
|
+
still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them
|
10
|
+
around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies
|
11
|
+
- and ... enter scRUBYt! and decide it yourself.
|
6
12
|
|
7
13
|
= Wait... why do we need one more web-scraping toolkit?
|
8
14
|
|
9
15
|
After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and ...
|
10
|
-
Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical
|
16
|
+
Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical
|
17
|
+
background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with
|
18
|
+
different requirements than the previosly mentioned ones.
|
11
19
|
|
12
|
-
If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot.
|
20
|
+
If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot.
|
21
|
+
Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you
|
22
|
+
will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good
|
23
|
+
old mantra: use the right tool for the right job!
|
13
24
|
|
14
|
-
I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of
|
25
|
+
I hope there will be also times when you will want to experiment with Pandora's box and reach after the power of
|
26
|
+
scRUBYt! :-)
|
15
27
|
|
16
28
|
= Sounds fine - show me an example!
|
17
29
|
|
@@ -50,21 +62,30 @@ output:
|
|
50
62
|
<!-- another 200+ results -->
|
51
63
|
<tt></root></tt>
|
52
64
|
|
53
|
-
This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated
|
65
|
+
This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated
|
66
|
+
extractors than the above one) - yet it did a lot of things automagically. First of all,
|
54
67
|
it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods
|
55
68
|
and narrowing down the results by clicking on 'Apple iPod'), then it extracted *all* the items that
|
56
|
-
looked like the specified example (which btw described also how the output structure should look like) - on the first 5
|
69
|
+
looked like the specified example (which btw described also how the output structure should look like) - on the first 5
|
70
|
+
result pages. Not so bad for about 10 lines of code, eh?
|
57
71
|
|
58
72
|
= OK, OK, I believe you, what should I do?
|
59
73
|
|
60
|
-
You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the
|
74
|
+
You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere...). See the
|
75
|
+
next section about installation, and after installing be sure to check out these URLs:
|
61
76
|
|
62
|
-
* <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek
|
77
|
+
* <a href='http://www.rubyrailways.com'>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek
|
78
|
+
at web scraping in general and/or you would like to understand what's going on under the hood, check out <a
|
79
|
+
href='http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails'>this article about
|
80
|
+
web-scraping</a>!
|
63
81
|
* <a href='http://scrubyt.org'>http://scrubyt.org</a> - your source of tutorials, howtos, news etc.
|
64
82
|
* <a href='http://scrubyt.rubyforge.org'>scrubyt.rubyforge.org</a> - for an up-to-date, online Rdoc
|
65
|
-
* <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including
|
66
|
-
|
67
|
-
*
|
83
|
+
* <a href='http://projects.rubyforge.org/scrubyt'>projects.rubyforge.org/scrubyt</a> - for developer info, including
|
84
|
+
open and closed bugs, files etc.
|
85
|
+
* projects.rubyforge.org/scrubyt/files... - fair amount (and still growing with every release) of examples, showcasing
|
86
|
+
the features of scRUBYt!
|
87
|
+
* planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will
|
88
|
+
have a community, and people will upload their extractors for whatever reason
|
68
89
|
|
69
90
|
If you still can't find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!
|
70
91
|
|
@@ -123,8 +123,7 @@ module Scrubyt
|
|
123
123
|
@@original_host_name ||= @@host_name
|
124
124
|
end #end of method store_host_name
|
125
125
|
|
126
|
-
def self.parse_and_set_proxy(proxy)
|
127
|
-
proxy = proxy[:proxy]
|
126
|
+
def self.parse_and_set_proxy(proxy)
|
128
127
|
if proxy.downcase == 'localhost'
|
129
128
|
@@host = 'localhost'
|
130
129
|
@@port = proxy.split(':').last
|
data/lib/scrubyt.rb
CHANGED