RedNails 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README +76 -0
- data/lib/rednails.rb +210 -0
- data/test/reptemplate.html +12 -0
- data/test/reptemplatetest.html +19 -0
- data/test/simpletemplate.txt +2 -0
- data/test/simpletest.txt +2 -0
- data/test/tc_rednails.rb +50 -0
- data/test/templatefile.html +15 -0
- data/test/templatefiletest.html +15 -0
- metadata +53 -0
data/README
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
RedNails
|
2
|
+
|
3
|
+
Description:
|
4
|
+
RedNails is a data scraping library that uses templates to determine what data
|
5
|
+
to extract from actual data feeds.
|
6
|
+
|
7
|
+
RedNails uses the template to create a regular expression that catches the
|
8
|
+
user marker variables. When a string of data is passed to RedNails it will
|
9
|
+
use the regular expression to extract the matches and return them to the user.
|
10
|
+
|
11
|
+
If the scraped data is regular enough then RedNails is a simple way to extract
|
12
|
+
data as all one needs to do is copy a live data feed and mark the points to
|
13
|
+
extract and make this the template.
|
14
|
+
|
15
|
+
License: BSD
|
16
|
+
|
17
|
+
Usage:
|
18
|
+
1) Create a template.
|
19
|
+
2) Load and initialize an instance of a RedNails object with the template.
|
20
|
+
3) Pass this instance your data feed from which you wish to extract information.
|
21
|
+
4) Use the results.
|
22
|
+
|
23
|
+
Template Format:
|
24
|
+
A RedNails template is simply a text file that has the points to scrape marked
|
25
|
+
with what looks like a ruby string substitution. You give each substitution a
|
26
|
+
unique variable name that can be referenced when using the parse_hash method.
|
27
|
+
|
28
|
+
An example template is:
|
29
|
+
|
30
|
+
"Hello my name is #{name}. How are you?"
|
31
|
+
|
32
|
+
If the data string to scrape is:
|
33
|
+
|
34
|
+
"Hello my name is Mr.Bill. How are you?"
|
35
|
+
|
36
|
+
Then the following code fragement will produce "Mr.Bill":
|
37
|
+
|
38
|
+
require 'rednails'
|
39
|
+
rednails = RedNails.new("template.txt")
|
40
|
+
results = rednails.parse_hash("livedata.txt")
|
41
|
+
puts results["name"] # => Mr.Bill
|
42
|
+
|
43
|
+
Repetitions:
|
44
|
+
If have data that you would like to extract which repeats itself then there is
|
45
|
+
an additional template marker you can use.
|
46
|
+
For the first example replace the data with #{Rep:} after the colon inside of
|
47
|
+
the Rep marker you will then place the structured data that repeats, except
|
48
|
+
that for each unique piece of data that you would like to extract replace it
|
49
|
+
with a unique variable name that starts and ends with @.
|
50
|
+
|
51
|
+
For example if you have an arbitrary list of images that you would like to
|
52
|
+
extract you can make a template like this:
|
53
|
+
<html>
|
54
|
+
<body>
|
55
|
+
A bunch of photos:
|
56
|
+
#{Rep:<img src="@url@" alt="@txt@"/>}
|
57
|
+
</body>
|
58
|
+
</html>
|
59
|
+
|
60
|
+
For more details please see the test cases.
|
61
|
+
|
62
|
+
Installation:
|
63
|
+
|
64
|
+
* Gem
|
65
|
+
gem install rednails
|
66
|
+
|
67
|
+
* Manual
|
68
|
+
As root
|
69
|
+
# ruby setup.rb all
|
70
|
+
|
71
|
+
|
72
|
+
Author and Contributions:
|
73
|
+
Zev Blut
|
74
|
+
With some changes and help by Min Lin Hsieh, Daniel DeLorme and
|
75
|
+
Pierre Baumard.
|
76
|
+
|
data/lib/rednails.rb
ADDED
@@ -0,0 +1,210 @@
|
|
1
|
+
#
|
2
|
+
# Copyright (c) 2006, Ubiquitous Business Technology (http://ubit.com)
|
3
|
+
# All rights reserved.
|
4
|
+
#
|
5
|
+
# Redistribution and use in source and binary forms, with or without
|
6
|
+
# modification, are permitted provided that the following conditions are
|
7
|
+
# met:
|
8
|
+
#
|
9
|
+
#
|
10
|
+
# * Redistributions of source code must retain the above copyright
|
11
|
+
# notice, this list of conditions and the following disclaimer.
|
12
|
+
#
|
13
|
+
# * Redistributions in binary form must reproduce the above
|
14
|
+
# copyright notice, this list of conditions and the following
|
15
|
+
# disclaimer in the documentation and/or other materials provided
|
16
|
+
# with the distribution.
|
17
|
+
#
|
18
|
+
# * Neither the name of Ubit nor the names of its
|
19
|
+
# contributors may be used to endorse or promote products derived
|
20
|
+
# from this software without specific prior written permission.
|
21
|
+
#
|
22
|
+
#
|
23
|
+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
24
|
+
# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
25
|
+
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
26
|
+
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
27
|
+
# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
28
|
+
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
29
|
+
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
30
|
+
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
31
|
+
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
32
|
+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
33
|
+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
34
|
+
#
|
35
|
+
# == Author
|
36
|
+
# Zev Blut (zb@ubit.com)
|
37
|
+
|
38
|
+
require 'net/http'
|
39
|
+
|
40
|
+
class RedNails
|
41
|
+
attr_reader :variables
|
42
|
+
|
43
|
+
# Takes a marked template file and an optional boolean that determines if an
|
44
|
+
# exception should be raised on errors.
|
45
|
+
def initialize(templatefile, raises_exception = false)
|
46
|
+
@templatefile = templatefile
|
47
|
+
@variables = nil
|
48
|
+
@repetitions = Hash.new
|
49
|
+
@regexp = parse_template(templatefile)
|
50
|
+
@excepts = raises_exception
|
51
|
+
end
|
52
|
+
|
53
|
+
# Takes a string of data to scrape.
|
54
|
+
# returns an array of variables defined in the templatefile
|
55
|
+
def parse(text)
|
56
|
+
#10 seconds timeout
|
57
|
+
#because those huge regular expressions can take a LONG time if there is no match
|
58
|
+
m = nil
|
59
|
+
begin
|
60
|
+
timeout(10) do
|
61
|
+
m = @regexp.match(text)
|
62
|
+
end
|
63
|
+
rescue => err
|
64
|
+
raise "REGEXP TIMEOUT!"
|
65
|
+
end
|
66
|
+
|
67
|
+
if m.nil?
|
68
|
+
if @excepts
|
69
|
+
raise "REGEXP from #{@templatefile} IS::::::::::::::::::::::::\n#{@regexp.source}" +
|
70
|
+
"COULD NOT MATCH PAGE TEXT:::::::::::::::::::::::::::::\n#{text}"
|
71
|
+
end
|
72
|
+
return nil
|
73
|
+
end
|
74
|
+
|
75
|
+
vals = []
|
76
|
+
# the ... means 1 to val -1 so all of the matches
|
77
|
+
(1...m.size).each do |i|
|
78
|
+
if @repetitions.key?(i)
|
79
|
+
reg = @repetitions[i][0]
|
80
|
+
vals<< m[i].scan(reg)
|
81
|
+
else
|
82
|
+
vals<< m[i]
|
83
|
+
end
|
84
|
+
end
|
85
|
+
return vals
|
86
|
+
end
|
87
|
+
|
88
|
+
# Takes a string of data to scrape.
|
89
|
+
# Returns a Hash with the template variable names as keys and matching
|
90
|
+
# scraped data as values.
|
91
|
+
def parse_hash(text)
|
92
|
+
vals = parse(text)
|
93
|
+
return nil if vals.nil?
|
94
|
+
hvals = {}
|
95
|
+
# Can probably do a block pass an yield instead of this.
|
96
|
+
vals.each_index do |i|
|
97
|
+
if @repetitions.key?(i+1)
|
98
|
+
varnames = @repetitions[i+1][1]
|
99
|
+
k=1
|
100
|
+
vals[i].each do |valcombo|
|
101
|
+
valcombo.each_index do |j|
|
102
|
+
hvals["#{varnames[j]}_#{k}"] = valcombo[j]
|
103
|
+
end
|
104
|
+
k+=1
|
105
|
+
end
|
106
|
+
else
|
107
|
+
hvals[@variables[i]] = vals[i]
|
108
|
+
end
|
109
|
+
end
|
110
|
+
return hvals
|
111
|
+
end
|
112
|
+
|
113
|
+
def print_detailed
|
114
|
+
puts "RedNails Detailed Info"
|
115
|
+
puts "-----------Regular Expression Source--------------"
|
116
|
+
puts @regexp.source
|
117
|
+
puts "--------------------------------------------------"
|
118
|
+
puts "-----------Variables------------------------------"
|
119
|
+
puts @variables.inspect
|
120
|
+
puts "--------------------------------------------------"
|
121
|
+
puts "-----------Repetitions------------------------------"
|
122
|
+
puts @repetitions.inspect
|
123
|
+
puts "--------------------------------------------------"
|
124
|
+
end
|
125
|
+
|
126
|
+
###########################################################################
|
127
|
+
private
|
128
|
+
|
129
|
+
def parse_template(template)
|
130
|
+
templatetext = File.open(template,"r") { |f| f.read }
|
131
|
+
literals = Array.new
|
132
|
+
tail = ""
|
133
|
+
@variables = Array.new
|
134
|
+
|
135
|
+
tmptext = templatetext
|
136
|
+
while m = /(.*?)\#\{(.*?)\}(.*)/mi.match(tmptext)
|
137
|
+
literals << m[1]
|
138
|
+
@variables<< m[2]
|
139
|
+
tail = m[3]
|
140
|
+
tmptext = m[3]
|
141
|
+
reps,repnames = check_for_repetition(m[2])
|
142
|
+
if reps.class() == Regexp
|
143
|
+
@repetitions[@variables.size] = [reps,repnames]
|
144
|
+
end
|
145
|
+
end
|
146
|
+
# push the last matched tail onto the list
|
147
|
+
literals<< tail
|
148
|
+
|
149
|
+
literals = literals.map do |lit|
|
150
|
+
litexp = ""
|
151
|
+
# find all the whitespace items and condense to \s
|
152
|
+
split = lit.squeeze("\s\t\n\r\f").split(/\s/)
|
153
|
+
split.each_index do |i|
|
154
|
+
if split[i] != ""
|
155
|
+
litexp<< Regexp.escape(split[i])
|
156
|
+
# put a regexp for \s if it is not the last lit
|
157
|
+
# because we do not want to eat the spaces in a variable
|
158
|
+
if i != (split.size - 1)
|
159
|
+
litexp<< "\\s*"
|
160
|
+
end
|
161
|
+
else
|
162
|
+
litexp<< "\\s*"
|
163
|
+
end
|
164
|
+
end
|
165
|
+
litexp
|
166
|
+
end
|
167
|
+
|
168
|
+
# generate the regular expression
|
169
|
+
regexp = "\\s*"
|
170
|
+
literals.each_with_index do |val,i|
|
171
|
+
regexp<< literals[i]
|
172
|
+
regexp<< "(.*)" if @variables.size > i
|
173
|
+
end
|
174
|
+
regexp<< "\\s*"
|
175
|
+
|
176
|
+
return Regexp.new(regexp, Regexp::MULTILINE | Regexp::IGNORECASE)
|
177
|
+
end
|
178
|
+
|
179
|
+
def check_for_repetition(var)
|
180
|
+
if m = /Rep:(.*)/mi.match(var)
|
181
|
+
varnames = Array.new
|
182
|
+
vals = m[1].scan(/(.*?)@(.*?)@(\S?)/mi)
|
183
|
+
reg = ""
|
184
|
+
vals.each do |lit,svar,taillit|
|
185
|
+
varnames<< svar
|
186
|
+
litexp = ""
|
187
|
+
# find all the whitespace items and condense to \s
|
188
|
+
split = lit.squeeze("\s\t\n\r\f").split(/\s/)
|
189
|
+
split.each_index do |i|
|
190
|
+
if split[i] != ""
|
191
|
+
litexp<< Regexp.escape(split[i])
|
192
|
+
# put a regexp for \s if it is not the last lit
|
193
|
+
# because we do not want to eat the spaces in a variable
|
194
|
+
if i != (split.size - 1)
|
195
|
+
litexp<< "\\s*"
|
196
|
+
end
|
197
|
+
else
|
198
|
+
litexp<< "\\s*"
|
199
|
+
end
|
200
|
+
end
|
201
|
+
reg<< "#{litexp}(.*?)#{Regexp.escape(taillit)}"
|
202
|
+
end
|
203
|
+
reg<< "\\s*"
|
204
|
+
return Regexp.new(reg, Regexp::MULTILINE | Regexp::IGNORECASE),varnames
|
205
|
+
else
|
206
|
+
return var,nil
|
207
|
+
end
|
208
|
+
end
|
209
|
+
|
210
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
2
|
+
<html>
|
3
|
+
<head>
|
4
|
+
<title>Reps</title>
|
5
|
+
</head>
|
6
|
+
|
7
|
+
<body>
|
8
|
+
<h1>Reps</h1>
|
9
|
+
<img src="nug1.jpg" alt="nug1"/>
|
10
|
+
<img src="nug2.jpg" alt="nug2"/>
|
11
|
+
<img src="nug3.jpg" alt="nug3"/>
|
12
|
+
<img src="nug4.jpg" alt="nug4"/>
|
13
|
+
|
14
|
+
|
15
|
+
<IMG
|
16
|
+
src="nug5.jpg" alt="nug5"/>
|
17
|
+
<hr>
|
18
|
+
</body>
|
19
|
+
</html>
|
data/test/simpletest.txt
ADDED
data/test/tc_rednails.rb
ADDED
@@ -0,0 +1,50 @@
|
|
1
|
+
# $Id: tc_scrapper.rb 319 2005-12-06 05:51:40Z zb $
|
2
|
+
#
|
3
|
+
# Test cases for the Scrapper
|
4
|
+
#
|
5
|
+
# Copyright (c) 2004-2005 Ubiquitous Business Technology, Inc.
|
6
|
+
#
|
7
|
+
# Authors: Zev Blut
|
8
|
+
|
9
|
+
require 'test/unit'
|
10
|
+
require "rednails"
|
11
|
+
|
12
|
+
RN_PATH = File.dirname(__FILE__)
|
13
|
+
|
14
|
+
class TC_RedNails < Test::Unit::TestCase
|
15
|
+
|
16
|
+
def read_file(file)
|
17
|
+
File.open(file,"r") { |f| f.read }
|
18
|
+
end
|
19
|
+
|
20
|
+
def test_basic_template
|
21
|
+
stemplate = RedNails.new("#{RN_PATH}/templatefile.html")
|
22
|
+
test_text = read_file("#{RN_PATH}/templatefiletest.html")
|
23
|
+
arr = stemplate.parse(test_text)
|
24
|
+
assert_equal(["Header muck meader","Test Nug text","NUG2"],
|
25
|
+
arr,
|
26
|
+
"Extracted variables from template test file is not what was expected")
|
27
|
+
end
|
28
|
+
|
29
|
+
def test_rep_template
|
30
|
+
stemplate = RedNails.new("#{RN_PATH}/reptemplate.html")
|
31
|
+
test_text = read_file("#{RN_PATH}/reptemplatetest.html")
|
32
|
+
arr = stemplate.parse(test_text)
|
33
|
+
assert_equal([[
|
34
|
+
["nug1.jpg", "nug1"], ["nug2.jpg", "nug2"],
|
35
|
+
["nug3.jpg", "nug3"], ["nug4.jpg", "nug4"],
|
36
|
+
["nug5.jpg", "nug5"] ]],
|
37
|
+
arr,
|
38
|
+
"Extracted variables from template test file is not what was expected")
|
39
|
+
|
40
|
+
hash = stemplate.parse_hash(test_text)
|
41
|
+
assert_equal({ "url_1" => "nug1.jpg", "txt_1" => "nug1",
|
42
|
+
"url_2" => "nug2.jpg", "txt_2" => "nug2",
|
43
|
+
"url_3" => "nug3.jpg", "txt_3" => "nug3",
|
44
|
+
"url_4" => "nug4.jpg", "txt_4" => "nug4",
|
45
|
+
"url_5" => "nug5.jpg", "txt_5" => "nug5" },
|
46
|
+
hash,
|
47
|
+
"Extracted variable hash from template test file is not what was expected")
|
48
|
+
end
|
49
|
+
|
50
|
+
end
|
metadata
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
rubygems_version: 0.8.11
|
3
|
+
specification_version: 1
|
4
|
+
name: RedNails
|
5
|
+
version: !ruby/object:Gem::Version
|
6
|
+
version: 1.0.0
|
7
|
+
date: 2006-12-13 00:00:00 +09:00
|
8
|
+
summary: A template driven data scraper
|
9
|
+
require_paths:
|
10
|
+
- lib
|
11
|
+
email: zb@ubit.com
|
12
|
+
homepage: http://rednails.rubyforge.org/
|
13
|
+
rubyforge_project: rednails
|
14
|
+
description:
|
15
|
+
autorequire: rednails
|
16
|
+
default_executable:
|
17
|
+
bindir: bin
|
18
|
+
has_rdoc: true
|
19
|
+
required_ruby_version: !ruby/object:Gem::Version::Requirement
|
20
|
+
requirements:
|
21
|
+
- - ">"
|
22
|
+
- !ruby/object:Gem::Version
|
23
|
+
version: 0.0.0
|
24
|
+
version:
|
25
|
+
platform: ruby
|
26
|
+
signing_key:
|
27
|
+
cert_chain:
|
28
|
+
authors:
|
29
|
+
- Zev Blut
|
30
|
+
files:
|
31
|
+
- lib/rednails.rb
|
32
|
+
- test/simpletemplate.txt
|
33
|
+
- test/reptemplate.html
|
34
|
+
- test/reptemplatetest.html
|
35
|
+
- test/simpletest.txt
|
36
|
+
- test/tc_rednails.rb
|
37
|
+
- test/templatefile.html
|
38
|
+
- test/templatefiletest.html
|
39
|
+
- README
|
40
|
+
test_files:
|
41
|
+
- test/tc_rednails.rb
|
42
|
+
rdoc_options: []
|
43
|
+
|
44
|
+
extra_rdoc_files:
|
45
|
+
- README
|
46
|
+
executables: []
|
47
|
+
|
48
|
+
extensions: []
|
49
|
+
|
50
|
+
requirements: []
|
51
|
+
|
52
|
+
dependencies: []
|
53
|
+
|