RedNails 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README ADDED
@@ -0,0 +1,76 @@
1
+ RedNails
2
+
3
+ Description:
4
+ RedNails is a data scraping library that uses templates to determine what data
5
+ to extract from actual data feeds.
6
+
7
+ RedNails uses the template to create a regular expression that catches the
8
+ user marker variables. When a string of data is passed to RedNails it will
9
+ use the regular expression to extract the matches and return them to the user.
10
+
11
+ If the scraped data is regular enough then RedNails is a simple way to extract
12
+ data as all one needs to do is copy a live data feed and mark the points to
13
+ extract and make this the template.
14
+
15
+ License: BSD
16
+
17
+ Usage:
18
+ 1) Create a template.
19
+ 2) Load and initialize an instance of a RedNails object with the template.
20
+ 3) Pass this instance your data feed from which you wish to extract information.
21
+ 4) Use the results.
22
+
23
+ Template Format:
24
+ A RedNails template is simply a text file that has the points to scrape marked
25
+ with what looks like a ruby string substitution. You give each substitution a
26
+ unique variable name that can be referenced when using the parse_hash method.
27
+
28
+ An example template is:
29
+
30
+ "Hello my name is #{name}. How are you?"
31
+
32
+ If the data string to scrape is:
33
+
34
+ "Hello my name is Mr.Bill. How are you?"
35
+
36
+ Then the following code fragement will produce "Mr.Bill":
37
+
38
+ require 'rednails'
39
+ rednails = RedNails.new("template.txt")
40
+ results = rednails.parse_hash("livedata.txt")
41
+ puts results["name"] # => Mr.Bill
42
+
43
+ Repetitions:
44
+ If have data that you would like to extract which repeats itself then there is
45
+ an additional template marker you can use.
46
+ For the first example replace the data with #{Rep:} after the colon inside of
47
+ the Rep marker you will then place the structured data that repeats, except
48
+ that for each unique piece of data that you would like to extract replace it
49
+ with a unique variable name that starts and ends with @.
50
+
51
+ For example if you have an arbitrary list of images that you would like to
52
+ extract you can make a template like this:
53
+ <html>
54
+ <body>
55
+ A bunch of photos:
56
+ #{Rep:<img src="@url@" alt="@txt@"/>}
57
+ </body>
58
+ </html>
59
+
60
+ For more details please see the test cases.
61
+
62
+ Installation:
63
+
64
+ * Gem
65
+ gem install rednails
66
+
67
+ * Manual
68
+ As root
69
+ # ruby setup.rb all
70
+
71
+
72
+ Author and Contributions:
73
+ Zev Blut
74
+ With some changes and help by Min Lin Hsieh, Daniel DeLorme and
75
+ Pierre Baumard.
76
+
@@ -0,0 +1,210 @@
1
+ #
2
+ # Copyright (c) 2006, Ubiquitous Business Technology (http://ubit.com)
3
+ # All rights reserved.
4
+ #
5
+ # Redistribution and use in source and binary forms, with or without
6
+ # modification, are permitted provided that the following conditions are
7
+ # met:
8
+ #
9
+ #
10
+ # * Redistributions of source code must retain the above copyright
11
+ # notice, this list of conditions and the following disclaimer.
12
+ #
13
+ # * Redistributions in binary form must reproduce the above
14
+ # copyright notice, this list of conditions and the following
15
+ # disclaimer in the documentation and/or other materials provided
16
+ # with the distribution.
17
+ #
18
+ # * Neither the name of Ubit nor the names of its
19
+ # contributors may be used to endorse or promote products derived
20
+ # from this software without specific prior written permission.
21
+ #
22
+ #
23
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
24
+ # "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
25
+ # LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
26
+ # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
27
+ # OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
28
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
29
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
30
+ # DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
31
+ # THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
32
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
33
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
34
+ #
35
+ # == Author
36
+ # Zev Blut (zb@ubit.com)
37
+
38
+ require 'net/http'
39
+
40
+ class RedNails
41
+ attr_reader :variables
42
+
43
+ # Takes a marked template file and an optional boolean that determines if an
44
+ # exception should be raised on errors.
45
+ def initialize(templatefile, raises_exception = false)
46
+ @templatefile = templatefile
47
+ @variables = nil
48
+ @repetitions = Hash.new
49
+ @regexp = parse_template(templatefile)
50
+ @excepts = raises_exception
51
+ end
52
+
53
+ # Takes a string of data to scrape.
54
+ # returns an array of variables defined in the templatefile
55
+ def parse(text)
56
+ #10 seconds timeout
57
+ #because those huge regular expressions can take a LONG time if there is no match
58
+ m = nil
59
+ begin
60
+ timeout(10) do
61
+ m = @regexp.match(text)
62
+ end
63
+ rescue => err
64
+ raise "REGEXP TIMEOUT!"
65
+ end
66
+
67
+ if m.nil?
68
+ if @excepts
69
+ raise "REGEXP from #{@templatefile} IS::::::::::::::::::::::::\n#{@regexp.source}" +
70
+ "COULD NOT MATCH PAGE TEXT:::::::::::::::::::::::::::::\n#{text}"
71
+ end
72
+ return nil
73
+ end
74
+
75
+ vals = []
76
+ # the ... means 1 to val -1 so all of the matches
77
+ (1...m.size).each do |i|
78
+ if @repetitions.key?(i)
79
+ reg = @repetitions[i][0]
80
+ vals<< m[i].scan(reg)
81
+ else
82
+ vals<< m[i]
83
+ end
84
+ end
85
+ return vals
86
+ end
87
+
88
+ # Takes a string of data to scrape.
89
+ # Returns a Hash with the template variable names as keys and matching
90
+ # scraped data as values.
91
+ def parse_hash(text)
92
+ vals = parse(text)
93
+ return nil if vals.nil?
94
+ hvals = {}
95
+ # Can probably do a block pass an yield instead of this.
96
+ vals.each_index do |i|
97
+ if @repetitions.key?(i+1)
98
+ varnames = @repetitions[i+1][1]
99
+ k=1
100
+ vals[i].each do |valcombo|
101
+ valcombo.each_index do |j|
102
+ hvals["#{varnames[j]}_#{k}"] = valcombo[j]
103
+ end
104
+ k+=1
105
+ end
106
+ else
107
+ hvals[@variables[i]] = vals[i]
108
+ end
109
+ end
110
+ return hvals
111
+ end
112
+
113
+ def print_detailed
114
+ puts "RedNails Detailed Info"
115
+ puts "-----------Regular Expression Source--------------"
116
+ puts @regexp.source
117
+ puts "--------------------------------------------------"
118
+ puts "-----------Variables------------------------------"
119
+ puts @variables.inspect
120
+ puts "--------------------------------------------------"
121
+ puts "-----------Repetitions------------------------------"
122
+ puts @repetitions.inspect
123
+ puts "--------------------------------------------------"
124
+ end
125
+
126
+ ###########################################################################
127
+ private
128
+
129
+ def parse_template(template)
130
+ templatetext = File.open(template,"r") { |f| f.read }
131
+ literals = Array.new
132
+ tail = ""
133
+ @variables = Array.new
134
+
135
+ tmptext = templatetext
136
+ while m = /(.*?)\#\{(.*?)\}(.*)/mi.match(tmptext)
137
+ literals << m[1]
138
+ @variables<< m[2]
139
+ tail = m[3]
140
+ tmptext = m[3]
141
+ reps,repnames = check_for_repetition(m[2])
142
+ if reps.class() == Regexp
143
+ @repetitions[@variables.size] = [reps,repnames]
144
+ end
145
+ end
146
+ # push the last matched tail onto the list
147
+ literals<< tail
148
+
149
+ literals = literals.map do |lit|
150
+ litexp = ""
151
+ # find all the whitespace items and condense to \s
152
+ split = lit.squeeze("\s\t\n\r\f").split(/\s/)
153
+ split.each_index do |i|
154
+ if split[i] != ""
155
+ litexp<< Regexp.escape(split[i])
156
+ # put a regexp for \s if it is not the last lit
157
+ # because we do not want to eat the spaces in a variable
158
+ if i != (split.size - 1)
159
+ litexp<< "\\s*"
160
+ end
161
+ else
162
+ litexp<< "\\s*"
163
+ end
164
+ end
165
+ litexp
166
+ end
167
+
168
+ # generate the regular expression
169
+ regexp = "\\s*"
170
+ literals.each_with_index do |val,i|
171
+ regexp<< literals[i]
172
+ regexp<< "(.*)" if @variables.size > i
173
+ end
174
+ regexp<< "\\s*"
175
+
176
+ return Regexp.new(regexp, Regexp::MULTILINE | Regexp::IGNORECASE)
177
+ end
178
+
179
+ def check_for_repetition(var)
180
+ if m = /Rep:(.*)/mi.match(var)
181
+ varnames = Array.new
182
+ vals = m[1].scan(/(.*?)@(.*?)@(\S?)/mi)
183
+ reg = ""
184
+ vals.each do |lit,svar,taillit|
185
+ varnames<< svar
186
+ litexp = ""
187
+ # find all the whitespace items and condense to \s
188
+ split = lit.squeeze("\s\t\n\r\f").split(/\s/)
189
+ split.each_index do |i|
190
+ if split[i] != ""
191
+ litexp<< Regexp.escape(split[i])
192
+ # put a regexp for \s if it is not the last lit
193
+ # because we do not want to eat the spaces in a variable
194
+ if i != (split.size - 1)
195
+ litexp<< "\\s*"
196
+ end
197
+ else
198
+ litexp<< "\\s*"
199
+ end
200
+ end
201
+ reg<< "#{litexp}(.*?)#{Regexp.escape(taillit)}"
202
+ end
203
+ reg<< "\\s*"
204
+ return Regexp.new(reg, Regexp::MULTILINE | Regexp::IGNORECASE),varnames
205
+ else
206
+ return var,nil
207
+ end
208
+ end
209
+
210
+ end
@@ -0,0 +1,12 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Reps</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Reps</h1>
9
+ #{Rep:<img src="@url@" alt="@txt@"/>}
10
+ <hr>
11
+ </body>
12
+ </html>
@@ -0,0 +1,19 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Reps</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Reps</h1>
9
+ <img src="nug1.jpg" alt="nug1"/>
10
+ <img src="nug2.jpg" alt="nug2"/>
11
+ <img src="nug3.jpg" alt="nug3"/>
12
+ <img src="nug4.jpg" alt="nug4"/>
13
+
14
+
15
+ <IMG
16
+ src="nug5.jpg" alt="nug5"/>
17
+ <hr>
18
+ </body>
19
+ </html>
@@ -0,0 +1,2 @@
1
+ This is a #{s} simple text!
2
+ It is #{v} simple no!
@@ -0,0 +1,2 @@
1
+ This is a simple simple text!
2
+ It is very simple no!
@@ -0,0 +1,50 @@
1
+ # $Id: tc_scrapper.rb 319 2005-12-06 05:51:40Z zb $
2
+ #
3
+ # Test cases for the Scrapper
4
+ #
5
+ # Copyright (c) 2004-2005 Ubiquitous Business Technology, Inc.
6
+ #
7
+ # Authors: Zev Blut
8
+
9
+ require 'test/unit'
10
+ require "rednails"
11
+
12
+ RN_PATH = File.dirname(__FILE__)
13
+
14
+ class TC_RedNails < Test::Unit::TestCase
15
+
16
+ def read_file(file)
17
+ File.open(file,"r") { |f| f.read }
18
+ end
19
+
20
+ def test_basic_template
21
+ stemplate = RedNails.new("#{RN_PATH}/templatefile.html")
22
+ test_text = read_file("#{RN_PATH}/templatefiletest.html")
23
+ arr = stemplate.parse(test_text)
24
+ assert_equal(["Header muck meader","Test Nug text","NUG2"],
25
+ arr,
26
+ "Extracted variables from template test file is not what was expected")
27
+ end
28
+
29
+ def test_rep_template
30
+ stemplate = RedNails.new("#{RN_PATH}/reptemplate.html")
31
+ test_text = read_file("#{RN_PATH}/reptemplatetest.html")
32
+ arr = stemplate.parse(test_text)
33
+ assert_equal([[
34
+ ["nug1.jpg", "nug1"], ["nug2.jpg", "nug2"],
35
+ ["nug3.jpg", "nug3"], ["nug4.jpg", "nug4"],
36
+ ["nug5.jpg", "nug5"] ]],
37
+ arr,
38
+ "Extracted variables from template test file is not what was expected")
39
+
40
+ hash = stemplate.parse_hash(test_text)
41
+ assert_equal({ "url_1" => "nug1.jpg", "txt_1" => "nug1",
42
+ "url_2" => "nug2.jpg", "txt_2" => "nug2",
43
+ "url_3" => "nug3.jpg", "txt_3" => "nug3",
44
+ "url_4" => "nug4.jpg", "txt_4" => "nug4",
45
+ "url_5" => "nug5.jpg", "txt_5" => "nug5" },
46
+ hash,
47
+ "Extracted variable hash from template test file is not what was expected")
48
+ end
49
+
50
+ end
@@ -0,0 +1,15 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Stuff</title>
5
+ </head>
6
+
7
+ <Body>
8
+ <h1>#{header}</h1>
9
+ <table>
10
+ <tr>#{table1}</tr>
11
+ <tr>#{table2}</tr>
12
+ </table>
13
+ <hr>
14
+ </body>
15
+ </html>
@@ -0,0 +1,15 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
+ <html>
3
+ <head>
4
+ <title>Stuff</title>
5
+ </head>
6
+
7
+ <body>
8
+ <h1>Header muck meader</h1>
9
+ <table>
10
+ <tr>Test Nug text</tr>
11
+ <tr>NUG2</tr>
12
+ </table>
13
+ <hr>
14
+ </body>
15
+ </html>
metadata ADDED
@@ -0,0 +1,53 @@
1
+ --- !ruby/object:Gem::Specification
2
+ rubygems_version: 0.8.11
3
+ specification_version: 1
4
+ name: RedNails
5
+ version: !ruby/object:Gem::Version
6
+ version: 1.0.0
7
+ date: 2006-12-13 00:00:00 +09:00
8
+ summary: A template driven data scraper
9
+ require_paths:
10
+ - lib
11
+ email: zb@ubit.com
12
+ homepage: http://rednails.rubyforge.org/
13
+ rubyforge_project: rednails
14
+ description:
15
+ autorequire: rednails
16
+ default_executable:
17
+ bindir: bin
18
+ has_rdoc: true
19
+ required_ruby_version: !ruby/object:Gem::Version::Requirement
20
+ requirements:
21
+ - - ">"
22
+ - !ruby/object:Gem::Version
23
+ version: 0.0.0
24
+ version:
25
+ platform: ruby
26
+ signing_key:
27
+ cert_chain:
28
+ authors:
29
+ - Zev Blut
30
+ files:
31
+ - lib/rednails.rb
32
+ - test/simpletemplate.txt
33
+ - test/reptemplate.html
34
+ - test/reptemplatetest.html
35
+ - test/simpletest.txt
36
+ - test/tc_rednails.rb
37
+ - test/templatefile.html
38
+ - test/templatefiletest.html
39
+ - README
40
+ test_files:
41
+ - test/tc_rednails.rb
42
+ rdoc_options: []
43
+
44
+ extra_rdoc_files:
45
+ - README
46
+ executables: []
47
+
48
+ extensions: []
49
+
50
+ requirements: []
51
+
52
+ dependencies: []
53
+