RedNails 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/README ADDED
@@ -0,0 +1,76 @@
1
+ RedNails
2
+
3
+ Description:
4
+ RedNails is a data scraping library that uses templates to determine what data
5
+ to extract from actual data feeds.
6
+
7
+ RedNails uses the template to create a regular expression that catches the
8
+ user marker variables. When a string of data is passed to RedNails it will
9
+ use the regular expression to extract the matches and return them to the user.
10
+
11
+ If the scraped data is regular enough then RedNails is a simple way to extract
12
+ data as all one needs to do is copy a live data feed and mark the points to
13
+ extract and make this the template.
14
+
15
+ License: BSD
16
+
17
+ Usage:
18
+ 1) Create a template.
19
+ 2) Load and initialize an instance of a RedNails object with the template.
20
+ 3) Pass this instance your data feed from which you wish to extract information.
21
+ 4) Use the results.
22
+
23
+ Template Format:
24
+ A RedNails template is simply a text file that has the points to scrape marked
25
+ with what looks like a ruby string substitution. You give each substitution a
26
+ unique variable name that can be referenced when using the parse_hash method.
27
+
28
+ An example template is:
29
+
30
+ "Hello my name is #{name}. How are you?"
31
+
32
+ If the data string to scrape is:
33
+
34
+ "Hello my name is Mr.Bill. How are you?"
35
+
36
+ Then the following code fragement will produce "Mr.Bill":
37
+
38
+ require 'rednails'
39
+ rednails = RedNails.new("template.txt")
40
+ results = rednails.parse_hash("livedata.txt")
41
+ puts results["name"] # => Mr.Bill
42
+
43
+ Repetitions:
44
+ If have data that you would like to extract which repeats itself then there is
45
+ an additional template marker you can use.
46
+ For the first example replace the data with #{Rep:} after the colon inside of
47
+ the Rep marker you will then place the structured data that repeats, except
48
+ that for each unique piece of data that you would like to extract replace it
49
+ with a unique variable name that starts and ends with @.
50
+
51
+ For example if you have an arbitrary list of images that you would like to
52
+ extract you can make a template like this:
53
+ <html>
54
+ <body>
55
+ A bunch of photos:
56
+ #{Rep:<img src="@url@" alt="@txt@"/>}
57
+ </body>
58
+ </html>
59
+
60
+ For more details please see the test cases.
61
+
62
+ Installation:
63
+
64
+ * Gem
65
+ gem install rednails
66
+
67
+ * Manual
68
+ As root
69
+ # ruby setup.rb all
70
+
71
+
72
+ Author and Contributions:
73
+ Zev Blut
74
+ With some changes and help by Min Lin Hsieh, Daniel DeLorme and
75
+ Pierre Baumard.
76
+
@@ -0,0 +1,210 @@
1
+ #
2
+ # Copyright (c) 2006, Ubiquitous Business Technology (http://ubit.com)
3
+ # All rights reserved.
4
+ #
5
+ # Redistribution and use in source and binary forms, with or without
6
+ # modification, are permitted provided that the following conditions are
7
+ # met:
8
+ #
9
+ #
10
+ # * Redistributions of source code must retain the above copyright
11
+ # notice, this list of conditions and the following disclaimer.
12
+ #
13
+ # * Redistributions in binary form must reproduce the above
14
+ # copyright notice, this list of conditions and the following
15
+ # disclaimer in the documentation and/or other materials provided
16
+ # with the distribution.
17
+ #
18
+ # * Neither the name of Ubit nor the names of its
19
+ # contributors may be used to endorse or promote products derived
20
+ # from this software without specific prior written permission.
21
+ #
22
+ #
23
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
24
+ # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
25
+ # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
26
+ # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
27
+ # OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
28
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
29
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
30
+ # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
31
+ # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
32
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
33
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34
+ #
35
+ # == Author
36
+ # Zev Blut (zb@ubit.com)
37
+
38
+ require 'net/http'
39
+
40
+ class RedNails
41
+ attr_reader :variables
42
+
43
+ # Takes a marked template file and an optional boolean that determines if an
44
+ # exception should be raised on errors.
45
+ def initialize(templatefile, raises_exception = false)
46
+ @templatefile = templatefile
47
+ @variables = nil
48
+ @repetitions = Hash.new
49
+ @regexp = parse_template(templatefile)
50
+ @excepts = raises_exception
51
+ end
52
+
53
+ # Takes a string of data to scrape.
54
+ # returns an array of variables defined in the templatefile
55
+ def parse(text)
56
+ #10 seconds timeout
57
+ #because those huge regular expressions can take a LONG time if there is no match
58
+ m = nil
59
+ begin
60
+ timeout(10) do
61
+ m = @regexp.match(text)
62
+ end
63
+ rescue => err
64
+ raise "REGEXP TIMEOUT!"
65
+ end
66
+
67
+ if m.nil?
68
+ if @excepts
69
+ raise "REGEXP from #{@templatefile} IS::::::::::::::::::::::::\n#{@regexp.source}" +
70
+ "COULD NOT MATCH PAGE TEXT:::::::::::::::::::::::::::::\n#{text}"
71
+ end
72
+ return nil
73
+ end
74
+
75
+ vals = []
76
+ # the ... means 1 to val -1 so all of the matches
77
+ (1...m.size).each do |i|
78
+ if @repetitions.key?(i)
79
+ reg = @repetitions[i][0]
80
+ vals<< m[i].scan(reg)
81
+ else
82
+ vals<< m[i]
83
+ end
84
+ end
85
+ return vals
86
+ end
87
+
88
+ # Takes a string of data to scrape.
89
+ # Returns a Hash with the template variable names as keys and matching
90
+ # scraped data as values.
91
+ def parse_hash(text)
92
+ vals = parse(text)
93
+ return nil if vals.nil?
94
+ hvals = {}
95
+ # Can probably do a block pass an yield instead of this.
96
+ vals.each_index do |i|
97
+ if @repetitions.key?(i+1)
98
+ varnames = @repetitions[i+1][1]
99
+ k=1
100
+ vals[i].each do |valcombo|
101
+ valcombo.each_index do |j|
102
+ hvals["#{varnames[j]}_#{k}"] = valcombo[j]
103
+ end
104
+ k+=1
105
+ end
106
+ else
107
+ hvals[@variables[i]] = vals[i]
108
+ end
109
+ end
110
+ return hvals
111
+ end
112
+
113
+ def print_detailed
114
+ puts "RedNails Detailed Info"
115
+ puts "-----------Regular Expression Source--------------"
116
+ puts @regexp.source
117
+ puts "--------------------------------------------------"
118
+ puts "-----------Variables------------------------------"
119
+ puts @variables.inspect
120
+ puts "--------------------------------------------------"
121
+ puts "-----------Repetitions------------------------------"
122
+ puts @repetitions.inspect
123
+ puts "--------------------------------------------------"
124
+ end
125
+
126
+ ###########################################################################
127
+ private
128
+
129
+ def parse_template(template)
130
+ templatetext = File.open(template,"r") { |f| f.read }
131
+ literals = Array.new
132
+ tail = ""
133
+ @variables = Array.new
134
+
135
+ tmptext = templatetext
136
+ while m = /(.*?)\#\{(.*?)\}(.*)/mi.match(tmptext)
137
+ literals << m[1]
138
+ @variables<< m[2]
139
+ tail = m[3]
140
+ tmptext = m[3]
141
+ reps,repnames = check_for_repetition(m[2])
142
+ if reps.class() == Regexp
143
+ @repetitions[@variables.size] = [reps,repnames]
144
+ end
145
+ end
146
+ # push the last matched tail onto the list
147
+ literals<< tail
148
+
149
+ literals = literals.map do |lit|
150
+ litexp = ""
151
+ # find all the whitespace items and condense to \s
152
+ split = lit.squeeze("\s\t\n\r\f").split(/\s/)
153
+ split.each_index do |i|
154
+ if split[i] != ""
155
+ litexp<< Regexp.escape(split[i])
156
+ # put a regexp for \s if it is not the last lit
157
+ # because we do not want to eat the spaces in a variable
158
+ if i != (split.size - 1)
159
+ litexp<< "\\s*"
160
+ end
161
+ else
162
+ litexp<< "\\s*"
163
+ end
164
+ end
165
+ litexp
166
+ end
167
+
168
+ # generate the regular expression
169
+ regexp = "\\s*"
170
+ literals.each_with_index do |val,i|
171
+ regexp<< literals[i]
172
+ regexp<< "(.*)" if @variables.size > i
173
+ end
174
+ regexp<< "\\s*"
175
+
176
+ return Regexp.new(regexp, Regexp::MULTILINE | Regexp::IGNORECASE)
177
+ end
178
+
179
+ def check_for_repetition(var)
180
+ if m = /Rep:(.*)/mi.match(var)
181
+ varnames = Array.new
182
+ vals = m[1].scan(/(.*?)@(.*?)@(\S?)/mi)
183
+ reg = ""
184
+ vals.each do |lit,svar,taillit|
185
+ varnames<< svar
186
+ litexp = ""
187
+ # find all the whitespace items and condense to \s
188
+ split = lit.squeeze("\s\t\n\r\f").split(/\s/)
189
+ split.each_index do |i|
190
+ if split[i] != ""
191
+ litexp<< Regexp.escape(split[i])
192
+ # put a regexp for \s if it is not the last lit
193
+ # because we do not want to eat the spaces in a variable
194
+ if i != (split.size - 1)
195
+ litexp<< "\\s*"
196
+ end
197
+ else
198
+ litexp<< "\\s*"
199
+ end
200
+ end
201
+ reg<< "#{litexp}(.*?)#{Regexp.escape(taillit)}"
202
+ end
203
+ reg<< "\\s*"
204
+ return Regexp.new(reg, Regexp::MULTILINE | Regexp::IGNORECASE),varnames
205
+ else
206
+ return var,nil
207
+ end
208
+ end
209
+
210
+ end
@@ -0,0 +1,12 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Reps</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Reps</h1>
9
+ #{Rep:<img src="@url@" alt="@txt@"/>}
10
+ <hr>
11
+ </body>
12
+ </html>
@@ -0,0 +1,19 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Reps</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Reps</h1>
9
+ <img src="nug1.jpg" alt="nug1"/>
10
+ <img src="nug2.jpg" alt="nug2"/>
11
+ <img src="nug3.jpg" alt="nug3"/>
12
+ <img src="nug4.jpg" alt="nug4"/>
13
+
14
+
15
+ <IMG
16
+ src="nug5.jpg" alt="nug5"/>
17
+ <hr>
18
+ </body>
19
+ </html>
@@ -0,0 +1,2 @@
1
+ This is a #{s} simple text!
2
+ It is #{v} simple no!
@@ -0,0 +1,2 @@
1
+ This is a simple simple text!
2
+ It is very simple no!
@@ -0,0 +1,50 @@
1
+ # $Id: tc_scrapper.rb 319 2005-12-06 05:51:40Z zb $
2
+ #
3
+ # Test cases for the Scrapper
4
+ #
5
+ # Copyright (c) 2004-2005 Ubiquitous Business Technology, Inc.
6
+ #
7
+ # Authors: Zev Blut
8
+
9
+ require 'test/unit'
10
+ require "rednails"
11
+
12
+ RN_PATH = File.dirname(__FILE__)
13
+
14
+ class TC_RedNails < Test::Unit::TestCase
15
+
16
+ def read_file(file)
17
+ File.open(file,"r") { |f| f.read }
18
+ end
19
+
20
+ def test_basic_template
21
+ stemplate = RedNails.new("#{RN_PATH}/templatefile.html")
22
+ test_text = read_file("#{RN_PATH}/templatefiletest.html")
23
+ arr = stemplate.parse(test_text)
24
+ assert_equal(["Header muck meader","Test Nug text","NUG2"],
25
+ arr,
26
+ "Extracted variables from template test file is not what was expected")
27
+ end
28
+
29
+ def test_rep_template
30
+ stemplate = RedNails.new("#{RN_PATH}/reptemplate.html")
31
+ test_text = read_file("#{RN_PATH}/reptemplatetest.html")
32
+ arr = stemplate.parse(test_text)
33
+ assert_equal([[
34
+ ["nug1.jpg", "nug1"], ["nug2.jpg", "nug2"],
35
+ ["nug3.jpg", "nug3"], ["nug4.jpg", "nug4"],
36
+ ["nug5.jpg", "nug5"] ]],
37
+ arr,
38
+ "Extracted variables from template test file is not what was expected")
39
+
40
+ hash = stemplate.parse_hash(test_text)
41
+ assert_equal({ "url_1" => "nug1.jpg", "txt_1" => "nug1",
42
+ "url_2" => "nug2.jpg", "txt_2" => "nug2",
43
+ "url_3" => "nug3.jpg", "txt_3" => "nug3",
44
+ "url_4" => "nug4.jpg", "txt_4" => "nug4",
45
+ "url_5" => "nug5.jpg", "txt_5" => "nug5" },
46
+ hash,
47
+ "Extracted variable hash from template test file is not what was expected")
48
+ end
49
+
50
+ end
@@ -0,0 +1,15 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Stuff</title>
5
+ </head>
6
+
7
+ <Body>
8
+ <h1>#{header}</h1>
9
+ <table>
10
+ <tr>#{table1}</tr>
11
+ <tr>#{table2}</tr>
12
+ </table>
13
+ <hr>
14
+ </body>
15
+ </html>
@@ -0,0 +1,15 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Stuff</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Header muck meader</h1>
9
+ <table>
10
+ <tr>Test Nug text</tr>
11
+ <tr>NUG2</tr>
12
+ </table>
13
+ <hr>
14
+ </body>
15
+ </html>
metadata ADDED
@@ -0,0 +1,53 @@
1
+ --- !ruby/object:Gem::Specification
2
+ rubygems_version: 0.8.11
3
+ specification_version: 1
4
+ name: RedNails
5
+ version: !ruby/object:Gem::Version
6
+ version: 1.0.0
7
+ date: 2006-12-13 00:00:00 +09:00
8
+ summary: A template driven data scraper
9
+ require_paths:
10
+ - lib
11
+ email: zb@ubit.com
12
+ homepage: http://rednails.rubyforge.org/
13
+ rubyforge_project: rednails
14
+ description:
15
+ autorequire: rednails
16
+ default_executable:
17
+ bindir: bin
18
+ has_rdoc: true
19
+ required_ruby_version: !ruby/object:Gem::Version::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ platform: ruby
26
+ signing_key:
27
+ cert_chain:
28
+ authors:
29
+ - Zev Blut
30
+ files:
31
+ - lib/rednails.rb
32
+ - test/simpletemplate.txt
33
+ - test/reptemplate.html
34
+ - test/reptemplatetest.html
35
+ - test/simpletest.txt
36
+ - test/tc_rednails.rb
37
+ - test/templatefile.html
38
+ - test/templatefiletest.html
39
+ - README
40
+ test_files:
41
+ - test/tc_rednails.rb
42
+ rdoc_options: []
43
+
44
+ extra_rdoc_files:
45
+ - README
46
+ executables: []
47
+
48
+ extensions: []
49
+
50
+ requirements: []
51
+
52
+ dependencies: []
53
+