RedNails 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +76 -0
- data/lib/rednails.rb +210 -0
- data/test/reptemplate.html +12 -0
- data/test/reptemplatetest.html +19 -0
- data/test/simpletemplate.txt +2 -0
- data/test/simpletest.txt +2 -0
- data/test/tc_rednails.rb +50 -0
- data/test/templatefile.html +15 -0
- data/test/templatefiletest.html +15 -0
- metadata +53 -0
data/README
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
RedNails
|
2
|
+
|
3
|
+
Description:
|
4
|
+
RedNails is a data scraping library that uses templates to determine what data
|
5
|
+
to extract from actual data feeds.
|
6
|
+
|
7
|
+
RedNails uses the template to create a regular expression that catches the
|
8
|
+
user marker variables. When a string of data is passed to RedNails it will
|
9
|
+
use the regular expression to extract the matches and return them to the user.
|
10
|
+
|
11
|
+
If the scraped data is regular enough then RedNails is a simple way to extract
|
12
|
+
data as all one needs to do is copy a live data feed and mark the points to
|
13
|
+
extract and make this the template.
|
14
|
+
|
15
|
+
License: BSD
|
16
|
+
|
17
|
+
Usage:
|
18
|
+
1) Create a template.
|
19
|
+
2) Load and initialize an instance of a RedNails object with the template.
|
20
|
+
3) Pass this instance your data feed from which you wish to extract information.
|
21
|
+
4) Use the results.
|
22
|
+
|
23
|
+
Template Format:
|
24
|
+
A RedNails template is simply a text file that has the points to scrape marked
|
25
|
+
with what looks like a ruby string substitution. You give each substitution a
|
26
|
+
unique variable name that can be referenced when using the parse_hash method.
|
27
|
+
|
28
|
+
An example template is:
|
29
|
+
|
30
|
+
"Hello my name is #{name}. How are you?"
|
31
|
+
|
32
|
+
If the data string to scrape is:
|
33
|
+
|
34
|
+
"Hello my name is Mr.Bill. How are you?"
|
35
|
+
|
36
|
+
Then the following code fragement will produce "Mr.Bill":
|
37
|
+
|
38
|
+
require 'rednails'
|
39
|
+
rednails = RedNails.new("template.txt")
|
40
|
+
results = rednails.parse_hash("livedata.txt")
|
41
|
+
puts results["name"] # => Mr.Bill
|
42
|
+
|
43
|
+
Repetitions:
|
44
|
+
If have data that you would like to extract which repeats itself then there is
|
45
|
+
an additional template marker you can use.
|
46
|
+
For the first example replace the data with #{Rep:} after the colon inside of
|
47
|
+
the Rep marker you will then place the structured data that repeats, except
|
48
|
+
that for each unique piece of data that you would like to extract replace it
|
49
|
+
with a unique variable name that starts and ends with @.
|
50
|
+
|
51
|
+
For example if you have an arbitrary list of images that you would like to
|
52
|
+
extract you can make a template like this:
|
53
|
+
<html>
|
54
|
+
<body>
|
55
|
+
A bunch of photos:
|
56
|
+
#{Rep:<img src="@url@" alt="@txt@"/>}
|
57
|
+
</body>
|
58
|
+
</html>
|
59
|
+
|
60
|
+
For more details please see the test cases.
|
61
|
+
|
62
|
+
Installation:
|
63
|
+
|
64
|
+
* Gem
|
65
|
+
gem install rednails
|
66
|
+
|
67
|
+
* Manual
|
68
|
+
As root
|
69
|
+
# ruby setup.rb all
|
70
|
+
|
71
|
+
|
72
|
+
Author and Contributions:
|
73
|
+
Zev Blut
|
74
|
+
With some changes and help by Min Lin Hsieh, Daniel DeLorme and
|
75
|
+
Pierre Baumard.
|
76
|
+
|
data/lib/rednails.rb
ADDED
@@ -0,0 +1,210 @@
|
|
1
|
+
#
|
2
|
+
# Copyright (c) 2006, Ubiquitous Business Technology (http://ubit.com)
|
3
|
+
# All rights reserved.
|
4
|
+
#
|
5
|
+
# Redistribution and use in source and binary forms, with or without
|
6
|
+
# modification, are permitted provided that the following conditions are
|
7
|
+
# met:
|
8
|
+
#
|
9
|
+
#
|
10
|
+
# * Redistributions of source code must retain the above copyright
|
11
|
+
# notice, this list of conditions and the following disclaimer.
|
12
|
+
#
|
13
|
+
# * Redistributions in binary form must reproduce the above
|
14
|
+
# copyright notice, this list of conditions and the following
|
15
|
+
# disclaimer in the documentation and/or other materials provided
|
16
|
+
# with the distribution.
|
17
|
+
#
|
18
|
+
# * Neither the name of Ubit nor the names of its
|
19
|
+
# contributors may be used to endorse or promote products derived
|
20
|
+
# from this software without specific prior written permission.
|
21
|
+
#
|
22
|
+
#
|
23
|
+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
24
|
+
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
25
|
+
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
26
|
+
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
27
|
+
# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
28
|
+
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
29
|
+
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
30
|
+
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
31
|
+
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
32
|
+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
33
|
+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
34
|
+
#
|
35
|
+
# == Author
|
36
|
+
# Zev Blut (zb@ubit.com)
|
37
|
+
|
38
|
+
require 'net/http'
|
39
|
+
|
40
|
+
class RedNails
|
41
|
+
attr_reader :variables
|
42
|
+
|
43
|
+
# Takes a marked template file and an optional boolean that determines if an
|
44
|
+
# exception should be raised on errors.
|
45
|
+
def initialize(templatefile, raises_exception = false)
|
46
|
+
@templatefile = templatefile
|
47
|
+
@variables = nil
|
48
|
+
@repetitions = Hash.new
|
49
|
+
@regexp = parse_template(templatefile)
|
50
|
+
@excepts = raises_exception
|
51
|
+
end
|
52
|
+
|
53
|
+
# Takes a string of data to scrape.
|
54
|
+
# returns an array of variables defined in the templatefile
|
55
|
+
def parse(text)
|
56
|
+
#10 seconds timeout
|
57
|
+
#because those huge regular expressions can take a LONG time if there is no match
|
58
|
+
m = nil
|
59
|
+
begin
|
60
|
+
timeout(10) do
|
61
|
+
m = @regexp.match(text)
|
62
|
+
end
|
63
|
+
rescue => err
|
64
|
+
raise "REGEXP TIMEOUT!"
|
65
|
+
end
|
66
|
+
|
67
|
+
if m.nil?
|
68
|
+
if @excepts
|
69
|
+
raise "REGEXP from #{@templatefile} IS::::::::::::::::::::::::\n#{@regexp.source}" +
|
70
|
+
"COULD NOT MATCH PAGE TEXT:::::::::::::::::::::::::::::\n#{text}"
|
71
|
+
end
|
72
|
+
return nil
|
73
|
+
end
|
74
|
+
|
75
|
+
vals = []
|
76
|
+
# the ... means 1 to val -1 so all of the matches
|
77
|
+
(1...m.size).each do |i|
|
78
|
+
if @repetitions.key?(i)
|
79
|
+
reg = @repetitions[i][0]
|
80
|
+
vals<< m[i].scan(reg)
|
81
|
+
else
|
82
|
+
vals<< m[i]
|
83
|
+
end
|
84
|
+
end
|
85
|
+
return vals
|
86
|
+
end
|
87
|
+
|
88
|
+
# Takes a string of data to scrape.
|
89
|
+
# Returns a Hash with the template variable names as keys and matching
|
90
|
+
# scraped data as values.
|
91
|
+
def parse_hash(text)
|
92
|
+
vals = parse(text)
|
93
|
+
return nil if vals.nil?
|
94
|
+
hvals = {}
|
95
|
+
# Can probably do a block pass an yield instead of this.
|
96
|
+
vals.each_index do |i|
|
97
|
+
if @repetitions.key?(i+1)
|
98
|
+
varnames = @repetitions[i+1][1]
|
99
|
+
k=1
|
100
|
+
vals[i].each do |valcombo|
|
101
|
+
valcombo.each_index do |j|
|
102
|
+
hvals["#{varnames[j]}_#{k}"] = valcombo[j]
|
103
|
+
end
|
104
|
+
k+=1
|
105
|
+
end
|
106
|
+
else
|
107
|
+
hvals[@variables[i]] = vals[i]
|
108
|
+
end
|
109
|
+
end
|
110
|
+
return hvals
|
111
|
+
end
|
112
|
+
|
113
|
+
def print_detailed
|
114
|
+
puts "RedNails Detailed Info"
|
115
|
+
puts "-----------Regular Expression Source--------------"
|
116
|
+
puts @regexp.source
|
117
|
+
puts "--------------------------------------------------"
|
118
|
+
puts "-----------Variables------------------------------"
|
119
|
+
puts @variables.inspect
|
120
|
+
puts "--------------------------------------------------"
|
121
|
+
puts "-----------Repetitions------------------------------"
|
122
|
+
puts @repetitions.inspect
|
123
|
+
puts "--------------------------------------------------"
|
124
|
+
end
|
125
|
+
|
126
|
+
###########################################################################
|
127
|
+
private
|
128
|
+
|
129
|
+
def parse_template(template)
|
130
|
+
templatetext = File.open(template,"r") { |f| f.read }
|
131
|
+
literals = Array.new
|
132
|
+
tail = ""
|
133
|
+
@variables = Array.new
|
134
|
+
|
135
|
+
tmptext = templatetext
|
136
|
+
while m = /(.*?)\#\{(.*?)\}(.*)/mi.match(tmptext)
|
137
|
+
literals << m[1]
|
138
|
+
@variables<< m[2]
|
139
|
+
tail = m[3]
|
140
|
+
tmptext = m[3]
|
141
|
+
reps,repnames = check_for_repetition(m[2])
|
142
|
+
if reps.class() == Regexp
|
143
|
+
@repetitions[@variables.size] = [reps,repnames]
|
144
|
+
end
|
145
|
+
end
|
146
|
+
# push the last matched tail onto the list
|
147
|
+
literals<< tail
|
148
|
+
|
149
|
+
literals = literals.map do |lit|
|
150
|
+
litexp = ""
|
151
|
+
# find all the whitespace items and condense to \s
|
152
|
+
split = lit.squeeze("\s\t\n\r\f").split(/\s/)
|
153
|
+
split.each_index do |i|
|
154
|
+
if split[i] != ""
|
155
|
+
litexp<< Regexp.escape(split[i])
|
156
|
+
# put a regexp for \s if it is not the last lit
|
157
|
+
# because we do not want to eat the spaces in a variable
|
158
|
+
if i != (split.size - 1)
|
159
|
+
litexp<< "\\s*"
|
160
|
+
end
|
161
|
+
else
|
162
|
+
litexp<< "\\s*"
|
163
|
+
end
|
164
|
+
end
|
165
|
+
litexp
|
166
|
+
end
|
167
|
+
|
168
|
+
# generate the regular expression
|
169
|
+
regexp = "\\s*"
|
170
|
+
literals.each_with_index do |val,i|
|
171
|
+
regexp<< literals[i]
|
172
|
+
regexp<< "(.*)" if @variables.size > i
|
173
|
+
end
|
174
|
+
regexp<< "\\s*"
|
175
|
+
|
176
|
+
return Regexp.new(regexp, Regexp::MULTILINE | Regexp::IGNORECASE)
|
177
|
+
end
|
178
|
+
|
179
|
+
def check_for_repetition(var)
|
180
|
+
if m = /Rep:(.*)/mi.match(var)
|
181
|
+
varnames = Array.new
|
182
|
+
vals = m[1].scan(/(.*?)@(.*?)@(\S?)/mi)
|
183
|
+
reg = ""
|
184
|
+
vals.each do |lit,svar,taillit|
|
185
|
+
varnames<< svar
|
186
|
+
litexp = ""
|
187
|
+
# find all the whitespace items and condense to \s
|
188
|
+
split = lit.squeeze("\s\t\n\r\f").split(/\s/)
|
189
|
+
split.each_index do |i|
|
190
|
+
if split[i] != ""
|
191
|
+
litexp<< Regexp.escape(split[i])
|
192
|
+
# put a regexp for \s if it is not the last lit
|
193
|
+
# because we do not want to eat the spaces in a variable
|
194
|
+
if i != (split.size - 1)
|
195
|
+
litexp<< "\\s*"
|
196
|
+
end
|
197
|
+
else
|
198
|
+
litexp<< "\\s*"
|
199
|
+
end
|
200
|
+
end
|
201
|
+
reg<< "#{litexp}(.*?)#{Regexp.escape(taillit)}"
|
202
|
+
end
|
203
|
+
reg<< "\\s*"
|
204
|
+
return Regexp.new(reg, Regexp::MULTILINE | Regexp::IGNORECASE),varnames
|
205
|
+
else
|
206
|
+
return var,nil
|
207
|
+
end
|
208
|
+
end
|
209
|
+
|
210
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
2
|
+
<html>
|
3
|
+
<head>
|
4
|
+
<title>Reps</title>
|
5
|
+
</head>
|
6
|
+
|
7
|
+
<body>
|
8
|
+
<h1>Reps</h1>
|
9
|
+
<img src="nug1.jpg" alt="nug1"/>
|
10
|
+
<img src="nug2.jpg" alt="nug2"/>
|
11
|
+
<img src="nug3.jpg" alt="nug3"/>
|
12
|
+
<img src="nug4.jpg" alt="nug4"/>
|
13
|
+
|
14
|
+
|
15
|
+
<IMG
|
16
|
+
src="nug5.jpg" alt="nug5"/>
|
17
|
+
<hr>
|
18
|
+
</body>
|
19
|
+
</html>
|
data/test/simpletest.txt
ADDED
data/test/tc_rednails.rb
ADDED
@@ -0,0 +1,50 @@
|
|
1
|
+
# $Id: tc_scrapper.rb 319 2005-12-06 05:51:40Z zb $
|
2
|
+
#
|
3
|
+
# Test cases for the Scrapper
|
4
|
+
#
|
5
|
+
# Copyright (c) 2004-2005 Ubiquitous Business Technology, Inc.
|
6
|
+
#
|
7
|
+
# Authors: Zev Blut
|
8
|
+
|
9
|
+
require 'test/unit'
|
10
|
+
require "rednails"
|
11
|
+
|
12
|
+
RN_PATH = File.dirname(__FILE__)
|
13
|
+
|
14
|
+
class TC_RedNails < Test::Unit::TestCase
|
15
|
+
|
16
|
+
def read_file(file)
|
17
|
+
File.open(file,"r") { |f| f.read }
|
18
|
+
end
|
19
|
+
|
20
|
+
def test_basic_template
|
21
|
+
stemplate = RedNails.new("#{RN_PATH}/templatefile.html")
|
22
|
+
test_text = read_file("#{RN_PATH}/templatefiletest.html")
|
23
|
+
arr = stemplate.parse(test_text)
|
24
|
+
assert_equal(["Header muck meader","Test Nug text","NUG2"],
|
25
|
+
arr,
|
26
|
+
"Extracted variables from template test file is not what was expected")
|
27
|
+
end
|
28
|
+
|
29
|
+
def test_rep_template
|
30
|
+
stemplate = RedNails.new("#{RN_PATH}/reptemplate.html")
|
31
|
+
test_text = read_file("#{RN_PATH}/reptemplatetest.html")
|
32
|
+
arr = stemplate.parse(test_text)
|
33
|
+
assert_equal([[
|
34
|
+
["nug1.jpg", "nug1"], ["nug2.jpg", "nug2"],
|
35
|
+
["nug3.jpg", "nug3"], ["nug4.jpg", "nug4"],
|
36
|
+
["nug5.jpg", "nug5"] ]],
|
37
|
+
arr,
|
38
|
+
"Extracted variables from template test file is not what was expected")
|
39
|
+
|
40
|
+
hash = stemplate.parse_hash(test_text)
|
41
|
+
assert_equal({ "url_1" => "nug1.jpg", "txt_1" => "nug1",
|
42
|
+
"url_2" => "nug2.jpg", "txt_2" => "nug2",
|
43
|
+
"url_3" => "nug3.jpg", "txt_3" => "nug3",
|
44
|
+
"url_4" => "nug4.jpg", "txt_4" => "nug4",
|
45
|
+
"url_5" => "nug5.jpg", "txt_5" => "nug5" },
|
46
|
+
hash,
|
47
|
+
"Extracted variable hash from template test file is not what was expected")
|
48
|
+
end
|
49
|
+
|
50
|
+
end
|
metadata
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
rubygems_version: 0.8.11
|
3
|
+
specification_version: 1
|
4
|
+
name: RedNails
|
5
|
+
version: !ruby/object:Gem::Version
|
6
|
+
version: 1.0.0
|
7
|
+
date: 2006-12-13 00:00:00 +09:00
|
8
|
+
summary: A template driven data scraper
|
9
|
+
require_paths:
|
10
|
+
- lib
|
11
|
+
email: zb@ubit.com
|
12
|
+
homepage: http://rednails.rubyforge.org/
|
13
|
+
rubyforge_project: rednails
|
14
|
+
description:
|
15
|
+
autorequire: rednails
|
16
|
+
default_executable:
|
17
|
+
bindir: bin
|
18
|
+
has_rdoc: true
|
19
|
+
required_ruby_version: !ruby/object:Gem::Version::Requirement
|
20
|
+
requirements:
|
21
|
+
- - ">"
|
22
|
+
- !ruby/object:Gem::Version
|
23
|
+
version: 0.0.0
|
24
|
+
version:
|
25
|
+
platform: ruby
|
26
|
+
signing_key:
|
27
|
+
cert_chain:
|
28
|
+
authors:
|
29
|
+
- Zev Blut
|
30
|
+
files:
|
31
|
+
- lib/rednails.rb
|
32
|
+
- test/simpletemplate.txt
|
33
|
+
- test/reptemplate.html
|
34
|
+
- test/reptemplatetest.html
|
35
|
+
- test/simpletest.txt
|
36
|
+
- test/tc_rednails.rb
|
37
|
+
- test/templatefile.html
|
38
|
+
- test/templatefiletest.html
|
39
|
+
- README
|
40
|
+
test_files:
|
41
|
+
- test/tc_rednails.rb
|
42
|
+
rdoc_options: []
|
43
|
+
|
44
|
+
extra_rdoc_files:
|
45
|
+
- README
|
46
|
+
executables: []
|
47
|
+
|
48
|
+
extensions: []
|
49
|
+
|
50
|
+
requirements: []
|
51
|
+
|
52
|
+
dependencies: []
|
53
|
+
|