scrappy 0.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,3 @@
1
+ === 0.1 2010-09-30
2
+
3
+ * Initial release
@@ -0,0 +1,19 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.rdoc
4
+ Rakefile
5
+ bin/scrappy
6
+ kb/elmundo.yarf
7
+ lib/scrappy.rb
8
+ lib/scrappy/agent/agent.rb
9
+ lib/scrappy/agent/blind_agent.rb
10
+ lib/scrappy/agent/cluster.rb
11
+ lib/scrappy/agent/extractor.rb
12
+ lib/scrappy/agent/visual_agent.rb
13
+ lib/scrappy/proxy.rb
14
+ lib/scrappy/server.rb
15
+ lib/scrappy/shell.rb
16
+ lib/scrappy/support.rb
17
+ lib/scrappy/webkit/webkit.rb
18
+ test/test_helper.rb
19
+ test/test_scrappy.rb
@@ -0,0 +1,176 @@
1
+ = Scrappy
2
+
3
+ * http://github.com/josei/scrappy
4
+
5
+ == DESCRIPTION:
6
+
7
+ Scrappy is a tool that allows extracting information from web pages and producing RDF data.
8
+ It uses the scraping ontology to define the mappings between HTML contents and RDF data.
9
+
10
+ An example of mapping is shown next, which allows extracting all titles from http://www.elmundo.es:
11
+
12
+ dc: http://purl.org/dc/elements/1.1/
13
+ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
14
+ sioc: http://rdfs.org/sioc/ns#
15
+ sc: http://lab.gsi.dit.upm.es/scraping.rdf#
16
+ *:
17
+ rdf:type: sc:Fragment
18
+ sc:selector:
19
+ *:
20
+ rdf:type: sc:UriSelector
21
+ rdf:value: "http://www.elmundo.es/"
22
+ sc:identifier:
23
+ *:
24
+ rdf:type: sc:BaseUriSelector
25
+ sc:subfragment:
26
+ *:
27
+ sc:type: sioc:Post
28
+ sc:selector:
29
+ *:
30
+ rdf:type: sc:CssSelector
31
+ rdf:value: ".noticia h2, .noticia h3, .noticia h4"
32
+ sc:identifier:
33
+ *:
34
+ rdf:type: sc:CssSelector
35
+ rdf:value: "a"
36
+ sc:attribute: "href"
37
+ sc:subfragment:
38
+ *:
39
+ sc:type: rdf:Literal
40
+ sc:relation: dc:title
41
+ sc:selector:
42
+ *:
43
+ rdf:type: sc:CssSelector
44
+ rdf:value: "a"
45
+
46
+ (The above code is serialized using YARF format, supported by LightRDF gem, as well as
47
+ RDFXML, JSON, NTriples formats, which can also be used to define the mappings).
48
+
49
+ == SYNOPSIS:
50
+
51
+ A knowledge base of mappings can be defined by storing RDF files inside ~/.scrappy/kb folder.
52
+ Then, the command-line tool can be used to get RDF data from web sites. You can get help on this
53
+ tool by typing:
54
+
55
+ $ scrappy --help
56
+
57
+ Scrappy offers many different interfaces to get RDF data from a web page:
58
+
59
+ * Command-line interface:
60
+
61
+ $ scrappy -g elmundo.es
62
+
63
+ * Interactive shell:
64
+
65
+ $ scrappy -i
66
+ Launching Scrappy Shell...
67
+ $ get elmundo.es
68
+ dc: http://purl.org/dc/elements/1.1/
69
+ owl: http://www.w3.org/2002/07/owl#
70
+ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
71
+ sc: http://lab.gsi.dit.upm.es/scraping.rdf#
72
+ rdfs: http://www.w3.org/2000/01/rdf-schema#
73
+ http://www.elmundo.es/elmundo/2010/10/05/gentes/1286310993.html:
74
+ dc:description: "Las vacaciones del n\u00famero uno"
75
+ dc:title:
76
+ "Una suite de 5.000 euros para Nadal en Tailandia"
77
+ "Una suite de 5.000 euros para Nadal"
78
+ rdf:type: http://rdfs.org/sioc/ns#Post
79
+ dc:creator: "Fernando Domingo | John Bali (V\u00eddeo)"
80
+ http://www.daml.org/experiment/ontology/location-ont#location:
81
+ *:
82
+ rdf:label: "Bangkok"
83
+ rdf:type: http://www.daml.org/experiment/ontology/location-ont#Location
84
+ dc:date: "mi\u00e9rcoles 06/10/2010"
85
+ ...
86
+
87
+ http://www.elmundo.es$
88
+
89
+ * Web Service interface:
90
+
91
+ $ scrappy -s
92
+ Launching Scrappy Web Server...
93
+ ** Starting Mongrel on localhost:3434
94
+
95
+ Then point your browser to http://localhost:3434 for additional directions.
96
+
97
+ * Web Proxy interface:
98
+
99
+ $ scrappy -S
100
+ Launching Scrappy Web Proxy...
101
+ ** Starting Mongrel on localhost:3434
102
+
103
+ Then configure your browser's HTTP proxy to http://localhost:3434 and browse http://www.elmundo.es
104
+
105
+ * Scripting (experimental):
106
+
107
+ You can create scripts that retrieve many web pages and run them using scrappy.
108
+
109
+ #!/usr/bin/scrappy
110
+ get elmundo.es
111
+ get google.com/search?q=testing
112
+
113
+ Then you can run your script from the command line just as any other bash script.
114
+
115
+ We plan to enable complex operations such as posting forms and definining a useful language
116
+ with variables to enable flow control in order to build web service mashups.
117
+
118
+ * Ruby interface:
119
+
120
+ You can use Scrappy in a Ruby program by requiring the gem:
121
+
122
+ require 'rubygems'
123
+ require 'scrappy'
124
+
125
+ # Parse a knowledge base
126
+ kb = RDF::Parser.parse(:rdf, open("kb.rdf").read)
127
+
128
+ # Create an agent
129
+ agent = Scrappy::Agent.create :kb=>kb
130
+
131
+ # Get RDF output
132
+ output = agent.request :get, 'http://www.example.com'
133
+
134
+ # Output all titles from the web page
135
+ titles = output.find(Node('http://www.example.com'), Node('dc:title'), nil)
136
+ titles.each { |title| puts title }
137
+
138
+ == INSTALL:
139
+
140
+ Install it as any other gem:
141
+
142
+ $ gem install scrappy
143
+
144
+ The gem also requires raptor library (in Debian systems: sudo aptitude install raptor-utils), which is used
145
+ for outputting different RDF serialization formats.
146
+
147
+ Additionally, some extra libraries are needed for certain features:
148
+
149
+ * Visual parsing requires rbwebkitgtk: http://github.com/danlucraft/rbwebkitgtk
150
+
151
+ * PNG output of RDF graphs requires Graphviz (in Debian systems: sudo aptitude install graphviz).
152
+
153
+ == LICENSE:
154
+
155
+ (The MIT License)
156
+
157
+ Copyright (c) 2010 José Ignacio Fernández (joseignacio.fernandez <at> gmail.com)
158
+
159
+ Permission is hereby granted, free of charge, to any person obtaining
160
+ a copy of this software and associated documentation files (the
161
+ 'Software'), to deal in the Software without restriction, including
162
+ without limitation the rights to use, copy, modify, merge, publish,
163
+ distribute, sublicense, and/or sell copies of the Software, and to
164
+ permit persons to whom the Software is furnished to do so, subject to
165
+ the following conditions:
166
+
167
+ The above copyright notice and this permission notice shall be
168
+ included in all copies or substantial portions of the Software.
169
+
170
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
171
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
172
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
173
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
174
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
175
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
176
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,20 @@
1
+ require 'rubygems'
2
+ gem 'hoe', '>= 2.1.0'
3
+ require 'hoe'
4
+ require 'fileutils'
5
+ require './lib/scrappy'
6
+
7
+ Hoe.plugin :newgem
8
+
9
+ # Generate all the Rake tasks
10
+ # Run 'rake -T' to see list of generated tasks (from gem root directory)
11
+ $hoe = Hoe.spec 'scrappy' do
12
+ self.developer 'Jose Ignacio', 'joseignacio.fernandez@gmail.com'
13
+ self.summary = "Web scraper that allows producing RDF data out of plain web pages"
14
+ self.post_install_message = '**(Optional) Remember to install rbwebkitgtk for visual parsing features**'
15
+ self.rubyforge_name = self.name
16
+ self.extra_deps = [['activesupport','>= 2.3.5'], ['markaby', '>= 0.7.1'], ['camping', '= 2.0'], ['nokogiri', '>= 1.4.1'], ['mechanize','>= 1.0.0'], ['lightrdf','>= 0.1']]
17
+ end
18
+
19
+ require 'newgem/tasks'
20
+ Dir['tasks/**/*.rake'].each { |t| load t }
@@ -0,0 +1,228 @@
1
+ #!/usr/bin/ruby
2
+
3
+ stty_save = `stty -g`.chomp
4
+ trap('INT') { system('stty', stty_save); Scrappy::App.quit }
5
+ module Scrappy
6
+ Root = File.expand_path(File.dirname(File.symlink?(__FILE__) ? File.readlink(__FILE__) : __FILE__) + "/..")
7
+
8
+ require 'rubygems'
9
+ require 'optparse'
10
+ require 'logger'
11
+ require 'readline'
12
+ gem 'camping', '=2.0'
13
+ require 'camping'
14
+ require 'camping/server'
15
+ require 'etc'
16
+ require "#{Root}/lib/scrappy"
17
+ require 'scrappy/shell'
18
+
19
+ SESSION_TOKEN = rand(100000000)
20
+ Options = OpenStruct.new
21
+
22
+ class App
23
+ def self.quit
24
+ puts "\"#{Quotes.sort_by{rand}.first}\"" unless Options.quiet
25
+ exit
26
+ end
27
+
28
+ def initialize
29
+ Options.port = 3434
30
+ Options.concurrence = 10
31
+ Agent::Options.depth = 1
32
+
33
+ OptionParser.new do |opts|
34
+ opts.on('-V', '--version') { output_version; exit 0 }
35
+ opts.on('-h', '--help') { output_help; exit 0 }
36
+ opts.on('-g URL', '--get URL') { |url| Options.url = url; Options.http_method=:get }
37
+ opts.on('-p URL', '--post URL') { |url| Options.url = url; Options.http_method=:post }
38
+ opts.on('-i', '--interactive') { Options.shell = true }
39
+ opts.on('-s', '--server') { Options.server = true }
40
+ opts.on('-S', '--proxy-server') { Options.proxy = true }
41
+ opts.on('-P P', '--port P') { |p| Options.port = p }
42
+ opts.on('-c C', '--concurrence C') { |c| Options.concurrence = c.to_i }
43
+ opts.on('-d D', '--delay D') { |d| Agent::Options.delay = d; Options.concurrence = 1 }
44
+ opts.on('-l L', '--levels L') { |l| Agent::Options.depth = l.to_i }
45
+ opts.on('-v', '--visual') { Agent::Options.agent = :visual }
46
+ opts.on('-r', '--reference') { Agent::Options.referenceable = :minimum }
47
+ opts.on('-R', '--reference-all') { Agent::Options.referenceable = :dump }
48
+ opts.on('-w', '--window') { Agent::Options.window = true }
49
+ opts.on('-f FORMAT', '--format FORMAT') { |f| Agent::Options.format = f.to_sym }
50
+ end.parse!(ARGV)
51
+ @file = ARGV.shift
52
+ end
53
+
54
+ def run
55
+ onload
56
+ if Options.url
57
+ Options.quiet = true
58
+ puts Agent.create.proxy(:get, Options.url)
59
+ elsif Options.proxy
60
+ puts "Launching Scrappy Web Proxy..."
61
+ Camping::Server.new(OpenStruct.new(:host => 'localhost', :port => Options.port, :server=>'mongrel'), ["#{Scrappy::Root}/lib/scrappy/proxy.rb"]).start
62
+ elsif Options.server
63
+ puts "Launching Scrappy Web Server..."
64
+ Camping::Server.new(OpenStruct.new(:host => 'localhost', :port => Options.port, :server=>'mongrel'), ["#{Scrappy::Root}/lib/scrappy/server.rb"]).start
65
+ elsif Options.shell
66
+ puts "Launching Scrappy Shell..."
67
+ Shell.new.run
68
+ else
69
+ Options.quiet = true
70
+ Shell.new(@file).run
71
+ end
72
+ Scrappy::App.quit
73
+ end
74
+
75
+ protected
76
+ def output_help
77
+ output_version
78
+ puts """Synopsis
79
+ Scrappy is a tool to scrape semantic data out of the unstructured web
80
+
81
+ Examples
82
+ This command retrieves Google web page
83
+ scrappy -g http://www.google.com
84
+
85
+ Usage
86
+ scrappy [options]
87
+
88
+ For help use: scrappy -h
89
+
90
+ Options
91
+ -h, --help Displays help message
92
+ -V, --version Display the version, then exit
93
+ -f, --format Picks output format (json, ejson, rdfxml, ntriples, png)
94
+ -g, --get URL Gets requested URL
95
+ -p, --post URL Posts requested URL
96
+ -c, --concurrence VALUE Sets number of concurrent connections for crawling (default is 10)
97
+ -l, --levels VALUE Sets recursion levels for resource crawling (default is 1)
98
+ -d, --delay VALUE Sets delay (in ms) between requests (default is 0)
99
+ -i, --interactive Runs interactive shell
100
+ -s, --server Runs web server
101
+ -S, --proxy-server Runs web proxy
102
+ -P, --port PORT Selects port number (default is 3434)
103
+ -v, --visual Uses visual agent (slow)
104
+ -r, --reference Outputs referenceable data (requires -v)
105
+ -R, --reference-all Outputs all HTML referenceable data (requires -v)
106
+ -w, --window Shows browser window (requires -v)
107
+
108
+ Authors
109
+ José Ignacio Fernández, Jacobo Blasco
110
+
111
+ Copyright
112
+ Copyright (c) 2010 José Ignacio Fernández. Licensed under the MIT License:
113
+ http://www.opensource.org/licenses/mit-license.php"""
114
+ end
115
+
116
+ def output_version
117
+ puts "Scrappy v#{Scrappy::VERSION}"
118
+ end
119
+
120
+ def onload
121
+ # Check local or global knowledge base
122
+ if File.exists?("#{Etc.getpwuid.dir}/.scrappy/kb")
123
+ data_folder = "#{Etc.getpwuid.dir}/.scrappy/kb"
124
+ cache_file = "#{Etc.getpwuid.dir}/.scrappy/kb.cache"
125
+ else
126
+ data_folder = "#{Scrappy::Root}/kb"
127
+ cache_file = "#{Dir.tmpdir}/scrappy.kb.cache"
128
+ end
129
+
130
+ # Load knowledge base
131
+ Agent::Options.kb = if File.exists?(cache_file) and File.mtime(cache_file) >= Dir["#{data_folder}/*", data_folder].map{ |f| File.mtime(f) }.max
132
+ # Just load kb from cache
133
+ open(cache_file) { |f| Marshal.load(f) }
134
+ else
135
+ # Load YARF files and cache kb
136
+ data = Dir["#{data_folder}/*"].inject(RDF::Graph.new) { |graph, file| extension = file.split('.').last.to_sym; graph.merge(extension==:ignore ? RDF::Graph.new : RDF::Parser.parse(extension, open(file).read)) }
137
+ open(cache_file, "w") { |f| Marshal.dump(data, f) }
138
+ data
139
+ end
140
+
141
+ # Create cluster of agents
142
+ Agent.create_cluster Options.concurrence, :referenceable=>Agent::Options.referenceable,
143
+ :agent=>Agent::Options.agent, :window=>false
144
+ end
145
+ end
146
+
147
+ Quotes = """Knowledge talks, wisdom listens
148
+ Fool me once, shame on you. Fool me twice, shame on me
149
+ Only the wisest and the stupidest of men never change
150
+ Don’t let your victories go to your head, or your failures go to your heart
151
+ Those who criticize our generation forget who raised it
152
+ Criticizing is easy, art is difficult
153
+ I don’t know what the key to success is, but the key to failure is trying to please everyone
154
+ When the character of a man is not clear to you, look at his friends
155
+ Not to care for philosophy is to be a true philosopher
156
+ The mind is like a parachute. It doesn’t work unless it’s open
157
+ The best mind-altering drug is truth
158
+ Be wiser than other people if you can, but do not tell them so
159
+ Never forget what a man says to you when he is angry
160
+ A winner listens, a loser just waits until it is their turn to talk
161
+ Guns don’t kill people — people do
162
+ He who knows others is wise. He who knows himself is enlightened
163
+ If you are not part of the cure, then you are part of the problem
164
+ The only time you run out of chances is when you stop taking them
165
+ The best things in life are not things
166
+ An investment in knowledge always pays the best interest
167
+ You can tell more about a person by what he says about others than you can by what others say about him
168
+ Think like a man of action, and act like a man of thought
169
+ He who knows others is learned; he who knows himself is wise
170
+ Going to church doesn’t make you a Christian, anymore than standing in your garage makes you a car
171
+ Never challenge an old man, because if you lose, you’ve lost to an old man, and if you win, so what?
172
+ Half our life is spent trying to find something to do with the time we have spent most of life trying to save
173
+ He who indulges in a task without proper knowledge will deteriorate rather than improve the case
174
+ It is because of it’s emptiness that the cup is useful
175
+ When the people of the world all know beauty as beauty, there arises the recognition of ugliness
176
+ The apprentice who tries to take the carpenters place, always cuts his hands
177
+ In the end, we will remember not the words of our enemies, but the silence of our friends
178
+ A wise man’s actions speak for himself
179
+ Never wrestle with a pig -- you both get dirty, but the pig likes it
180
+ 50% of the solution is to put your hands on the problem
181
+ Never keep your head down, you’re better than many
182
+ Those who fail to prepare, are preparing to fail
183
+ The man who smiles when things go wrong has thought of someone to blame it on
184
+ Time is a great teacher, but unfortunately it kills all its pupils
185
+ It's true that we don't know what we've got until we lose it, but it's also true that we don't know what we've been missing until it arrives
186
+ Never take life seriously. Nobody gets out alive anyway
187
+ The only way to keep your health is to eat what you don't want, drink what you don't like, and do what you'd rather not
188
+ I am so clever that sometimes I don't understand a single word of what I am saying
189
+ Dogs have owners, cats have staff
190
+ I put all my genius into my life; I put only my talent into my works
191
+ It is better to be beautiful than to be good, but it is better to be good than to be ugly
192
+ All human beings, by nature, desire to know
193
+ All life is an experiment
194
+ An investment in knowledge always pays the best interest
195
+ An optimist is a person who sees a green light everywhere. The pessimist sees only the red light. But the truly wise person is color blind
196
+ Chance favors only those who court her
197
+ Give a man a fish, he'll eat for a day. Teach a man how to fish, he'll eat for a lifetime
198
+ God helps them that help themselves
199
+ Great beginnings are not as important as the way one finishes
200
+ Happiness is not a reward - it is consequence. Suffering is not a punishment - it is a result
201
+ Don't think much of a man who is not wiser today than he was yesterday
202
+ Maturity is achieved when a person postpones immediate pleasures for long-term values
203
+ Men are wise in proportion, not to their experience, but to their capacity for experience
204
+ Much wisdom often goes with fewer words
205
+ Never leave that till tomorrow which you can do today
206
+ Never mistake knowledge for wisdom. One helps you make a living; the other helps you make a life
207
+ Nothing is a waste of time if you use the experience wisely
208
+ It requires wisdom to understand wisdom: the music is nothing if the audience is deaf
209
+ It takes a great deal of living to get a little deal of learning
210
+ Live as if you were to die tomorrow. Learn as if you were to live forever
211
+ Unless you try to do something beyond what you have already mastered, you will never grow
212
+ What you have to do and the way you have to do it is incredibly simple. Whether you are willing to do it is another matter
213
+ When written in Chinese the word crisis is composed to two characters. One represents danger, and the other represents opportunity
214
+ Cheer up, the worst is yet to come
215
+ Common sense ain't common
216
+ A coward is a hero with a wife, kids, and a mortgage
217
+ All power corrupts, but we need electricity
218
+ Do not try to live forever. You will not succeed
219
+ Pick the flower when it is ready to be picked
220
+ The greatest risk is the risk of riskless living
221
+ The man who does things makes many mistakes, but he never makes the biggest mistake of all - doing nothing
222
+ The man who makes no mistakes does not usually make anything
223
+ The results you achieve will be in direct proportion to the effort you apply
224
+ The reward of a thing well done is to have done it
225
+ Don’t argue with idiots. They will bring you down to their level and beat you with experience""".split("\n")
226
+ end
227
+
228
+ Scrappy::App.new.run