scrappy 0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,3 @@
1
+ === 0.1 2010-09-30
2
+
3
+ * Initial release
@@ -0,0 +1,19 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.rdoc
4
+ Rakefile
5
+ bin/scrappy
6
+ kb/elmundo.yarf
7
+ lib/scrappy.rb
8
+ lib/scrappy/agent/agent.rb
9
+ lib/scrappy/agent/blind_agent.rb
10
+ lib/scrappy/agent/cluster.rb
11
+ lib/scrappy/agent/extractor.rb
12
+ lib/scrappy/agent/visual_agent.rb
13
+ lib/scrappy/proxy.rb
14
+ lib/scrappy/server.rb
15
+ lib/scrappy/shell.rb
16
+ lib/scrappy/support.rb
17
+ lib/scrappy/webkit/webkit.rb
18
+ test/test_helper.rb
19
+ test/test_scrappy.rb
@@ -0,0 +1,176 @@
1
+ = Scrappy
2
+
3
+ * http://github.com/josei/scrappy
4
+
5
+ == DESCRIPTION:
6
+
7
+ Scrappy is a tool that allows extracting information from web pages and producing RDF data.
8
+ It uses the scraping ontology to define the mappings between HTML contents and RDF data.
9
+
10
+ An example of mapping is shown next, which allows extracting all titles from http://www.elmundo.es:
11
+
12
+ dc: http://purl.org/dc/elements/1.1/
13
+ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
14
+ sioc: http://rdfs.org/sioc/ns#
15
+ sc: http://lab.gsi.dit.upm.es/scraping.rdf#
16
+ *:
17
+ rdf:type: sc:Fragment
18
+ sc:selector:
19
+ *:
20
+ rdf:type: sc:UriSelector
21
+ rdf:value: "http://www.elmundo.es/"
22
+ sc:identifier:
23
+ *:
24
+ rdf:type: sc:BaseUriSelector
25
+ sc:subfragment:
26
+ *:
27
+ sc:type: sioc:Post
28
+ sc:selector:
29
+ *:
30
+ rdf:type: sc:CssSelector
31
+ rdf:value: ".noticia h2, .noticia h3, .noticia h4"
32
+ sc:identifier:
33
+ *:
34
+ rdf:type: sc:CssSelector
35
+ rdf:value: "a"
36
+ sc:attribute: "href"
37
+ sc:subfragment:
38
+ *:
39
+ sc:type: rdf:Literal
40
+ sc:relation: dc:title
41
+ sc:selector:
42
+ *:
43
+ rdf:type: sc:CssSelector
44
+ rdf:value: "a"
45
+
46
+ (The above code is serialized using YARF format, supported by LightRDF gem, as well as
47
+ RDFXML, JSON, NTriples formats, which can also be used to define the mappings).
48
+
49
+ == SYNOPSIS:
50
+
51
+ A knowledge base of mappings can be defined by storing RDF files inside ~/.scrappy/kb folder.
52
+ Then, the command-line tool can be used to get RDF data from web sites. You can get help on this
53
+ tool by typing:
54
+
55
+ $ scrappy --help
56
+
57
+ Scrappy offers many different interfaces to get RDF data from a web page:
58
+
59
+ * Command-line interface:
60
+
61
+ $ scrappy -g elmundo.es
62
+
63
+ * Interactive shell:
64
+
65
+ $ scrappy -i
66
+ Launching Scrappy Shell...
67
+ $ get elmundo.es
68
+ dc: http://purl.org/dc/elements/1.1/
69
+ owl: http://www.w3.org/2002/07/owl#
70
+ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
71
+ sc: http://lab.gsi.dit.upm.es/scraping.rdf#
72
+ rdfs: http://www.w3.org/2000/01/rdf-schema#
73
+ http://www.elmundo.es/elmundo/2010/10/05/gentes/1286310993.html:
74
+ dc:description: "Las vacaciones del n\u00famero uno"
75
+ dc:title:
76
+ "Una suite de 5.000 euros para Nadal en Tailandia"
77
+ "Una suite de 5.000 euros para Nadal"
78
+ rdf:type: http://rdfs.org/sioc/ns#Post
79
+ dc:creator: "Fernando Domingo | John Bali (V\u00eddeo)"
80
+ http://www.daml.org/experiment/ontology/location-ont#location:
81
+ *:
82
+ rdf:label: "Bangkok"
83
+ rdf:type: http://www.daml.org/experiment/ontology/location-ont#Location
84
+ dc:date: "mi\u00e9rcoles 06/10/2010"
85
+ ...
86
+
87
+ http://www.elmundo.es$
88
+
89
+ * Web Service interface:
90
+
91
+ $ scrappy -s
92
+ Launching Scrappy Web Server...
93
+ ** Starting Mongrel on localhost:3434
94
+
95
+ Then point your browser to http://localhost:3434 for additional directions.
96
+
97
+ * Web Proxy interface:
98
+
99
+ $ scrappy -S
100
+ Launching Scrappy Web Proxy...
101
+ ** Starting Mongrel on localhost:3434
102
+
103
+ Then configure your browser's HTTP proxy to http://localhost:3434 and browse http://www.elmundo.es
104
+
105
+ * Scripting (experimental):
106
+
107
+ You can create scripts that retrieve many web pages and run them using scrappy.
108
+
109
+ #!/usr/bin/scrappy
110
+ get elmundo.es
111
+ get google.com/search?q=testing
112
+
113
+ Then you can run your script from the command line just as any other bash script.
114
+
115
+ We plan to enable complex operations such as posting forms and definining a useful language
116
+ with variables to enable flow control in order to build web service mashups.
117
+
118
+ * Ruby interface:
119
+
120
+ You can use Scrappy in a Ruby program by requiring the gem:
121
+
122
+ require 'rubygems'
123
+ require 'scrappy'
124
+
125
+ # Parse a knowledge base
126
+ kb = RDF::Parser.parse(:rdf, open("kb.rdf").read)
127
+
128
+ # Create an agent
129
+ agent = Scrappy::Agent.create :kb=>kb
130
+
131
+ # Get RDF output
132
+ output = agent.request :get, 'http://www.example.com'
133
+
134
+ # Output all titles from the web page
135
+ titles = output.find(Node('http://www.example.com'), Node('dc:title'), nil)
136
+ titles.each { |title| puts title }
137
+
138
+ == INSTALL:
139
+
140
+ Install it as any other gem:
141
+
142
+ $ gem install scrappy
143
+
144
+ The gem also requires raptor library (in Debian systems: sudo aptitude install raptor-utils), which is used
145
+ for outputting different RDF serialization formats.
146
+
147
+ Additionally, some extra libraries are needed for certain features:
148
+
149
+ * Visual parsing requires rbwebkitgtk: http://github.com/danlucraft/rbwebkitgtk
150
+
151
+ * PNG output of RDF graphs requires Graphviz (in Debian systems: sudo aptitude install graphviz).
152
+
153
+ == LICENSE:
154
+
155
+ (The MIT License)
156
+
157
+ Copyright (c) 2010 José Ignacio Fernández (joseignacio.fernandez <at> gmail.com)
158
+
159
+ Permission is hereby granted, free of charge, to any person obtaining
160
+ a copy of this software and associated documentation files (the
161
+ 'Software'), to deal in the Software without restriction, including
162
+ without limitation the rights to use, copy, modify, merge, publish,
163
+ distribute, sublicense, and/or sell copies of the Software, and to
164
+ permit persons to whom the Software is furnished to do so, subject to
165
+ the following conditions:
166
+
167
+ The above copyright notice and this permission notice shall be
168
+ included in all copies or substantial portions of the Software.
169
+
170
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
171
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
172
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
173
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
174
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
175
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
176
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,20 @@
1
+ require 'rubygems'
2
+ gem 'hoe', '>= 2.1.0'
3
+ require 'hoe'
4
+ require 'fileutils'
5
+ require './lib/scrappy'
6
+
7
+ Hoe.plugin :newgem
8
+
9
+ # Generate all the Rake tasks
10
+ # Run 'rake -T' to see list of generated tasks (from gem root directory)
11
+ $hoe = Hoe.spec 'scrappy' do
12
+ self.developer 'Jose Ignacio', 'joseignacio.fernandez@gmail.com'
13
+ self.summary = "Web scraper that allows producing RDF data out of plain web pages"
14
+ self.post_install_message = '**(Optional) Remember to install rbwebkitgtk for visual parsing features**'
15
+ self.rubyforge_name = self.name
16
+ self.extra_deps = [['activesupport','>= 2.3.5'], ['markaby', '>= 0.7.1'], ['camping', '= 2.0'], ['nokogiri', '>= 1.4.1'], ['mechanize','>= 1.0.0'], ['lightrdf','>= 0.1']]
17
+ end
18
+
19
+ require 'newgem/tasks'
20
+ Dir['tasks/**/*.rake'].each { |t| load t }
@@ -0,0 +1,228 @@
1
+ #!/usr/bin/ruby
2
+
3
+ stty_save = `stty -g`.chomp
4
+ trap('INT') { system('stty', stty_save); Scrappy::App.quit }
5
+ module Scrappy
6
+ Root = File.expand_path(File.dirname(File.symlink?(__FILE__) ? File.readlink(__FILE__) : __FILE__) + "/..")
7
+
8
+ require 'rubygems'
9
+ require 'optparse'
10
+ require 'logger'
11
+ require 'readline'
12
+ gem 'camping', '=2.0'
13
+ require 'camping'
14
+ require 'camping/server'
15
+ require 'etc'
16
+ require "#{Root}/lib/scrappy"
17
+ require 'scrappy/shell'
18
+
19
+ SESSION_TOKEN = rand(100000000)
20
+ Options = OpenStruct.new
21
+
22
+ class App
23
+ def self.quit
24
+ puts "\"#{Quotes.sort_by{rand}.first}\"" unless Options.quiet
25
+ exit
26
+ end
27
+
28
+ def initialize
29
+ Options.port = 3434
30
+ Options.concurrence = 10
31
+ Agent::Options.depth = 1
32
+
33
+ OptionParser.new do |opts|
34
+ opts.on('-V', '--version') { output_version; exit 0 }
35
+ opts.on('-h', '--help') { output_help; exit 0 }
36
+ opts.on('-g URL', '--get URL') { |url| Options.url = url; Options.http_method=:get }
37
+ opts.on('-p URL', '--post URL') { |url| Options.url = url; Options.http_method=:post }
38
+ opts.on('-i', '--interactive') { Options.shell = true }
39
+ opts.on('-s', '--server') { Options.server = true }
40
+ opts.on('-S', '--proxy-server') { Options.proxy = true }
41
+ opts.on('-P P', '--port P') { |p| Options.port = p }
42
+ opts.on('-c C', '--concurrence C') { |c| Options.concurrence = c.to_i }
43
+ opts.on('-d D', '--delay D') { |d| Agent::Options.delay = d; Options.concurrence = 1 }
44
+ opts.on('-l L', '--levels L') { |l| Agent::Options.depth = l.to_i }
45
+ opts.on('-v', '--visual') { Agent::Options.agent = :visual }
46
+ opts.on('-r', '--reference') { Agent::Options.referenceable = :minimum }
47
+ opts.on('-R', '--reference-all') { Agent::Options.referenceable = :dump }
48
+ opts.on('-w', '--window') { Agent::Options.window = true }
49
+ opts.on('-f FORMAT', '--format FORMAT') { |f| Agent::Options.format = f.to_sym }
50
+ end.parse!(ARGV)
51
+ @file = ARGV.shift
52
+ end
53
+
54
+ def run
55
+ onload
56
+ if Options.url
57
+ Options.quiet = true
58
+ puts Agent.create.proxy(:get, Options.url)
59
+ elsif Options.proxy
60
+ puts "Launching Scrappy Web Proxy..."
61
+ Camping::Server.new(OpenStruct.new(:host => 'localhost', :port => Options.port, :server=>'mongrel'), ["#{Scrappy::Root}/lib/scrappy/proxy.rb"]).start
62
+ elsif Options.server
63
+ puts "Launching Scrappy Web Server..."
64
+ Camping::Server.new(OpenStruct.new(:host => 'localhost', :port => Options.port, :server=>'mongrel'), ["#{Scrappy::Root}/lib/scrappy/server.rb"]).start
65
+ elsif Options.shell
66
+ puts "Launching Scrappy Shell..."
67
+ Shell.new.run
68
+ else
69
+ Options.quiet = true
70
+ Shell.new(@file).run
71
+ end
72
+ Scrappy::App.quit
73
+ end
74
+
75
+ protected
76
+ def output_help
77
+ output_version
78
+ puts """Synopsis
79
+ Scrappy is a tool to scrape semantic data out of the unstructured web
80
+
81
+ Examples
82
+ This command retrieves Google web page
83
+ scrappy -g http://www.google.com
84
+
85
+ Usage
86
+ scrappy [options]
87
+
88
+ For help use: scrappy -h
89
+
90
+ Options
91
+ -h, --help Displays help message
92
+ -V, --version Display the version, then exit
93
+ -f, --format Picks output format (json, ejson, rdfxml, ntriples, png)
94
+ -g, --get URL Gets requested URL
95
+ -p, --post URL Posts requested URL
96
+ -c, --concurrence VALUE Sets number of concurrent connections for crawling (default is 10)
97
+ -l, --levels VALUE Sets recursion levels for resource crawling (default is 1)
98
+ -d, --delay VALUE Sets delay (in ms) between requests (default is 0)
99
+ -i, --interactive Runs interactive shell
100
+ -s, --server Runs web server
101
+ -S, --proxy-server Runs web proxy
102
+ -P, --port PORT Selects port number (default is 3434)
103
+ -v, --visual Uses visual agent (slow)
104
+ -r, --reference Outputs referenceable data (requires -v)
105
+ -R, --reference-all Outputs all HTML referenceable data (requires -v)
106
+ -w, --window Shows browser window (requires -v)
107
+
108
+ Authors
109
+ José Ignacio Fernández, Jacobo Blasco
110
+
111
+ Copyright
112
+ Copyright (c) 2010 José Ignacio Fernández. Licensed under the MIT License:
113
+ http://www.opensource.org/licenses/mit-license.php"""
114
+ end
115
+
116
+ def output_version
117
+ puts "Scrappy v#{Scrappy::VERSION}"
118
+ end
119
+
120
+ def onload
121
+ # Check local or global knowledge base
122
+ if File.exists?("#{Etc.getpwuid.dir}/.scrappy/kb")
123
+ data_folder = "#{Etc.getpwuid.dir}/.scrappy/kb"
124
+ cache_file = "#{Etc.getpwuid.dir}/.scrappy/kb.cache"
125
+ else
126
+ data_folder = "#{Scrappy::Root}/kb"
127
+ cache_file = "#{Dir.tmpdir}/scrappy.kb.cache"
128
+ end
129
+
130
+ # Load knowledge base
131
+ Agent::Options.kb = if File.exists?(cache_file) and File.mtime(cache_file) >= Dir["#{data_folder}/*", data_folder].map{ |f| File.mtime(f) }.max
132
+ # Just load kb from cache
133
+ open(cache_file) { |f| Marshal.load(f) }
134
+ else
135
+ # Load YARF files and cache kb
136
+ data = Dir["#{data_folder}/*"].inject(RDF::Graph.new) { |graph, file| extension = file.split('.').last.to_sym; graph.merge(extension==:ignore ? RDF::Graph.new : RDF::Parser.parse(extension, open(file).read)) }
137
+ open(cache_file, "w") { |f| Marshal.dump(data, f) }
138
+ data
139
+ end
140
+
141
+ # Create cluster of agents
142
+ Agent.create_cluster Options.concurrence, :referenceable=>Agent::Options.referenceable,
143
+ :agent=>Agent::Options.agent, :window=>false
144
+ end
145
+ end
146
+
147
+ Quotes = """Knowledge talks, wisdom listens
148
+ Fool me once, shame on you. Fool me twice, shame on me
149
+ Only the wisest and the stupidest of men never change
150
+ Don’t let your victories go to your head, or your failures go to your heart
151
+ Those who criticize our generation forget who raised it
152
+ Criticizing is easy, art is difficult
153
+ I don’t know what the key to success is, but the key to failure is trying to please everyone
154
+ When the character of a man is not clear to you, look at his friends
155
+ Not to care for philosophy is to be a true philosopher
156
+ The mind is like a parachute. It doesn’t work unless it’s open
157
+ The best mind-altering drug is truth
158
+ Be wiser than other people if you can, but do not tell them so
159
+ Never forget what a man says to you when he is angry
160
+ A winner listens, a loser just waits until it is their turn to talk
161
+ Guns don’t kill people — people do
162
+ He who knows others is wise. He who knows himself is enlightened
163
+ If you are not part of the cure, then you are part of the problem
164
+ The only time you run out of chances is when you stop taking them
165
+ The best things in life are not things
166
+ An investment in knowledge always pays the best interest
167
+ You can tell more about a person by what he says about others than you can by what others say about him
168
+ Think like a man of action, and act like a man of thought
169
+ He who knows others is learned; he who knows himself is wise
170
+ Going to church doesn’t make you a Christian, anymore than standing in your garage makes you a car
171
+ Never challenge an old man, because if you lose, you’ve lost to an old man, and if you win, so what?
172
+ Half our life is spent trying to find something to do with the time we have spent most of life trying to save
173
+ He who indulges in a task without proper knowledge will deteriorate rather than improve the case
174
+ It is because of it’s emptiness that the cup is useful
175
+ When the people of the world all know beauty as beauty, there arises the recognition of ugliness
176
+ The apprentice who tries to take the carpenters place, always cuts his hands
177
+ In the end, we will remember not the words of our enemies, but the silence of our friends
178
+ A wise man’s actions speak for himself
179
+ Never wrestle with a pig -- you both get dirty, but the pig likes it
180
+ 50% of the solution is to put your hands on the problem
181
+ Never keep your head down, you’re better than many
182
+ Those who fail to prepare, are preparing to fail
183
+ The man who smiles when things go wrong has thought of someone to blame it on
184
+ Time is a great teacher, but unfortunately it kills all its pupils
185
+ It's true that we don't know what we've got until we lose it, but it's also true that we don't know what we've been missing until it arrives
186
+ Never take life seriously. Nobody gets out alive anyway
187
+ The only way to keep your health is to eat what you don't want, drink what you don't like, and do what you'd rather not
188
+ I am so clever that sometimes I don't understand a single word of what I am saying
189
+ Dogs have owners, cats have staff
190
+ I put all my genius into my life; I put only my talent into my works
191
+ It is better to be beautiful than to be good, but it is better to be good than to be ugly
192
+ All human beings, by nature, desire to know
193
+ All life is an experiment
194
+ An investment in knowledge always pays the best interest
195
+ An optimist is a person who sees a green light everywhere. The pessimist sees only the red light. But the truly wise person is color blind
196
+ Chance favors only those who court her
197
+ Give a man a fish, he'll eat for a day. Teach a man how to fish, he'll eat for a lifetime
198
+ God helps them that help themselves
199
+ Great beginnings are not as important as the way one finishes
200
+ Happiness is not a reward - it is consequence. Suffering is not a punishment - it is a result
201
+ Don't think much of a man who is not wiser today than he was yesterday
202
+ Maturity is achieved when a person postpones immediate pleasures for long-term values
203
+ Men are wise in proportion, not to their experience, but to their capacity for experience
204
+ Much wisdom often goes with fewer words
205
+ Never leave that till tomorrow which you can do today
206
+ Never mistake knowledge for wisdom. One helps you make a living; the other helps you make a life
207
+ Nothing is a waste of time if you use the experience wisely
208
+ It requires wisdom to understand wisdom: the music is nothing if the audience is deaf
209
+ It takes a great deal of living to get a little deal of learning
210
+ Live as if you were to die tomorrow. Learn as if you were to live forever
211
+ Unless you try to do something beyond what you have already mastered, you will never grow
212
+ What you have to do and the way you have to do it is incredibly simple. Whether you are willing to do it is another matter
213
+ When written in Chinese the word crisis is composed to two characters. One represents danger, and the other represents opportunity
214
+ Cheer up, the worst is yet to come
215
+ Common sense ain't common
216
+ A coward is a hero with a wife, kids, and a mortgage
217
+ All power corrupts, but we need electricity
218
+ Do not try to live forever. You will not succeed
219
+ Pick the flower when it is ready to be picked
220
+ The greatest risk is the risk of riskless living
221
+ The man who does things makes many mistakes, but he never makes the biggest mistake of all - doing nothing
222
+ The man who makes no mistakes does not usually make anything
223
+ The results you achieve will be in direct proportion to the effort you apply
224
+ The reward of a thing well done is to have done it
225
+ Don’t argue with idiots. They will bring you down to their level and beat you with experience""".split("\n")
226
+ end
227
+
228
+ Scrappy::App.new.run