ghtorrent 0.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,132 @@
1
+ github-mirror: Mirror and process the Github event steam
2
+ =========================================================
3
+
4
+ A collection of scripts used to mirror the Github event stream, for
5
+ research purposes. The scripts are distributed as a Gem (`ghtorrent`),
6
+ but they can also be run by checking out this repository.
7
+
8
+ GHTorrent relies on the following software to work:
9
+
10
+ * MongoDB > 2.0
11
+ * RabbitMQ >= 2.7
12
+ * An SQL database compatible with [Sequel](http://sequel.rubyforge.org/rdoc/files/doc/opening_databases_rdoc.html). GHTorrent is tested with SQLite and MySQL,
13
+ so your mileage may vary if you are using other databases.
14
+
15
+ GHTorrent is written in Ruby (tested with 1.8 and JRuby). To install
16
+ it as a Gem do:
17
+
18
+ <code>
19
+ sudo gem install ghtorrent
20
+ </code>
21
+
22
+ #### Configuring
23
+
24
+ Copy the contents of the
25
+ [config.yaml.tmpl](https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl)
26
+ file to a file in your home directory. All provided scripts accept the `-c`
27
+ option, which you can use to pass the location of the configuration file as
28
+ a parameter.
29
+
30
+ Edit the MongoDB and AMQP
31
+ configuration options accordingly. The scripts require accounts with permissions
32
+ to create queues and exchanges in the AMQP queue, collections
33
+ in MongoDB and tables in the selected SQL database, respectively.
34
+
35
+ To prepare MongoDB:
36
+
37
+ <pre>
38
+ $ mongo admin
39
+ > db.addUser('github', 'github')
40
+ > use github
41
+ > db.addUser('github', 'github')
42
+ </pre>
43
+
44
+ To prepare RabbitMQ:
45
+
46
+ <pre>
47
+ $ rabbitmqctl add_user github
48
+ $ rabbitmqctl set_permissions -p / github ".*" ".*" ".*"
49
+
50
+ # The following will enable the RabbitMQ web admin for the github user
51
+ # Not necessary to have, but good to debug and diagnose problems
52
+ $ rabbitmq-plugins enable rabbitmq_management
53
+ $ rabbitmqctl set_user_tags github administrator
54
+ </pre>
55
+
56
+ To prepare MySQL:
57
+
58
+ <pre>
59
+ $ mysql -u root -p
60
+ mysql> create user 'github'@'localhost' identified by 'github';
61
+ mysql> create database github;
62
+ mysql> GRANT ALL PRIVILEGES ON github.* to github@'localhost';
63
+ mysql> flush privileges;
64
+ </pre>
65
+
66
+ You can find more information of how you can setup a cluster of machines
67
+ to retrieve data in parallel on the [Wiki](https://github.com/gousiosg/github-mirror/wiki/Setting-up-a-mirroring-cluster).
68
+
69
+ ### Running
70
+
71
+ To retrieve data with GHTorrent
72
+
73
+ * `ght-mirror-events.rb` periodically polls Github's event
74
+ queue (`https://api.github.com/events`), stores all new events in the `events`
75
+ collection in MongoDB and posts them to the `github` exchange in RabbitMQ.
76
+
77
+ * `ght-data_retrieval.rb` creates queues that route posted events to processor
78
+ functions, which in turn use the appropriate Github API call to retrieve the
79
+ linked contents, extract metadata to store in the SQL database and store the
80
+ retrieved data in the appropriate collection in Mongo, to avoid further
81
+ API calls. Data in the SQL database contain pointers (the MongoDB key)
82
+ to the "raw" data in MongoDB.
83
+
84
+ Both scripts can be run concurrently on more than one hosts, for resilience and
85
+ performance reasons. To catch up with Github's event stream, it is enough to
86
+ run `mirror_events.rb` on one host. To collect all data pointed by each event,
87
+ one instance of `data_retrieval.rb` is not enough. Both scripts employ
88
+ throttling mechanisms to keep API usage whithin the limits imposed by Github
89
+ (currently 5000 reqs/hr).
90
+
91
+ #### Data
92
+
93
+ You can find torrents for retrieving data on the
94
+ [Available Torrents](https://github.com/gousiosg/github-mirror/wiki/Available-Torrents) page. You need two sets of data:
95
+
96
+ * Raw events: Github's [event stream](https://api.github.com/events). These
97
+ are the roots for mirroring operations. The `ght-data-retrieval` crawler starts
98
+ from an event and goes deep into the rabbit hole.
99
+ * SQL dumps+Linked data: Data dumps from the SQL database and the corresponding
100
+ MongoDB entities.
101
+
102
+
103
+ *At the moment, GHTorrent is in the process of redesigning its data storage
104
+ schema. Consequently, it does not distribute SQL dumps or linked data raw data.
105
+ The distribution service will come back shortly.*
106
+
107
+ #### Reporting bugs
108
+
109
+ Please use the [Issue
110
+ Tracker](https://github.com/gousiosg/github-mirror/issues) for reporting bugs
111
+ and feature requests.
112
+
113
+ Patches, bug fixes etc are welcome. Please fork the repository and create
114
+ a pull request when done fixing/implementing the new feature.
115
+
116
+ #### Citation information
117
+
118
+ If you find GHTorrent and the accompanying datasets useful in your research,
119
+ please consider citing the following paper:
120
+
121
+ > Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in _MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories_, June 2-–3, 2012. Zurich, Switzerland.
122
+
123
+ #### Authors
124
+
125
+ Georgios Gousios <gousiosg@gmail.com>
126
+
127
+ Diomidis Spinellis
128
+
129
+ #### License
130
+
131
+ [2-clause BSD](http://www.opensource.org/licenses/bsd-license.php)
132
+
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ require 'rake'
2
+ require 'rake/testtask'
3
+ require 'rake/rdoctask'
4
+
5
+ task :default => [:test, :rdoc]
6
+
7
+ desc "Run basic tests"
8
+ Rake::TestTask.new(:test) do |t|
9
+ t.pattern = 'test/*_test.rb'
10
+ t.verbose = true
11
+ t.warning = true
12
+ end
13
+
14
+ desc "Run Rdoc"
15
+ Rake::RDocTask.new(:rdoc) do |rd|
16
+ # rd.main = "README.doc"
17
+ rd.rdoc_files.include("lib/**/*.rb")
18
+ rd.options << "-d"
19
+ rd.options << "-x migrations"
20
+ end
@@ -0,0 +1,119 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # Copyright 2012 Georgios Gousios <gousiosg@gmail.com>
4
+ #
5
+ # Redistribution and use in source and binary forms, with or
6
+ # without modification, are permitted provided that the following
7
+ # conditions are met:
8
+ #
9
+ # 1. Redistributions of source code must retain the above
10
+ # copyright notice, this list of conditions and the following
11
+ # disclaimer.
12
+ #
13
+ # 2. Redistributions in binary form must reproduce the above
14
+ # copyright notice, this list of conditions and the following
15
+ # disclaimer in the documentation and/or other materials
16
+ # provided with the distribution.
17
+ #
18
+ # THIS SOFTWARE IS PROVIDED BY BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
19
+ # AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
20
+ # TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
21
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
22
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
23
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
24
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
25
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
26
+ # AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
27
+ # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
28
+ # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
29
+ # POSSIBILITY OF SUCH DAMAGE.
30
+
31
+ require 'rubygems'
32
+ require 'amqp'
33
+ require 'json'
34
+ require 'ghtorrent'
35
+ require 'pp'
36
+
37
+ class GHTDataRetrieval < GHTorrent::Command
38
+
39
+ include GHTorrent::Settings
40
+
41
+ attr_reader :settings
42
+
43
+ def parse(msg)
44
+ JSON.parse(msg)
45
+ end
46
+
47
+ def PushEvent(evt)
48
+ data = parse evt
49
+ data['payload']['commits'].each do |c|
50
+ url = c['url'].split(/\//)
51
+ @gh.get_commit url[4], url[5], url[7]
52
+ end
53
+ end
54
+
55
+ def WatchEvent(evt)
56
+ data = parse evt
57
+ user = data['actor']['login']
58
+ #@gh.get_watched user, evt
59
+ end
60
+
61
+ def FollowEvent(evt)
62
+ data = parse evt
63
+ user = data['actor']['login']
64
+ #@gh.get_followed user
65
+
66
+ followed = data['payload']['target']['login']
67
+ #@gh.get_followers followed
68
+ end
69
+
70
+ def handlers
71
+ %w(PushEvent WatchEvent FollowEvent)
72
+ end
73
+
74
+ def go
75
+ @gh = GHTorrent::Mirror.new(options[:config])
76
+ @settings = @gh.settings
77
+
78
+ # Graceful exit
79
+ Signal.trap('INT') { AMQP.stop { EM.stop } }
80
+ Signal.trap('TERM') { AMQP.stop { EM.stop } }
81
+
82
+ AMQP.start(:host => config(:amqp_host),
83
+ :port => config(:amqp_port),
84
+ :username => config(:amqp_username),
85
+ :password => config(:amqp_password)) do |connection|
86
+
87
+ channel = AMQP::Channel.new(connection, :prefetch => 5)
88
+ exchange = channel.topic(config(:amqp_exchange), :durable => true,
89
+ :auto_delete => false)
90
+
91
+ handlers.each { |h|
92
+ queue = channel.queue("#{h}s", {:durable => true})\
93
+ .bind(exchange, :routing_key => "evt.#{h}")
94
+
95
+ puts "Binding handler #{h} to routing key evt.#{h}"
96
+
97
+ queue.subscribe(:ack => true) do |headers, msg|
98
+ begin
99
+ send(h, msg)
100
+ headers.ack
101
+ rescue Exception => e
102
+ # Give a message a chance to be reprocessed
103
+ if headers.redelivered?
104
+ headers.reject(:requeue => false)
105
+ else
106
+ headers.reject(:requeue => true)
107
+ end
108
+
109
+ #pp JSON.parse(msg)
110
+ STDERR.puts e
111
+ STDERR.puts e.backtrace.join("\n")
112
+ end
113
+ end
114
+ }
115
+ end
116
+ end
117
+ end
118
+
119
+ GHTDataRetrieval.run
data/bin/ght-load ADDED
@@ -0,0 +1,242 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # Loads items from Mongo to the queue for further processing
4
+ #
5
+ #
6
+ # Copyright 2012 Georgios Gousios <gousiosg@gmail.com>
7
+ #
8
+ # Redistribution and use in source and binary forms, with or
9
+ # without modification, are permitted provided that the following
10
+ # conditions are met:
11
+ #
12
+ # 1. Redistributions of source code must retain the above
13
+ # copyright notice, this list of conditions and the following
14
+ # disclaimer.
15
+ #
16
+ # 2. Redistributions in binary form must reproduce the above
17
+ # copyright notice, this list of conditions and the following
18
+ # disclaimer in the documentation and/or other materials
19
+ # provided with the distribution.
20
+ #
21
+ # THIS SOFTWARE IS PROVIDED BY BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
22
+ # AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
23
+ # TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
24
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
25
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
26
+ # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
27
+ # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
28
+ # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
29
+ # AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
30
+ # LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
31
+ # ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
32
+ # POSSIBILITY OF SUCH DAMAGE.
33
+
34
+ require 'rubygems'
35
+ require 'ghtorrent-old'
36
+ require 'mongo'
37
+ require 'amqp'
38
+ require 'set'
39
+ require 'eventmachine'
40
+ require 'optparse'
41
+ require 'ostruct'
42
+ require 'pp'
43
+ require "amqp/extensions/rabbitmq"
44
+
45
+ class GHTLoad < GHTorrent::Command
46
+
47
+ def col_info()
48
+ {
49
+ :commits => {
50
+ :name => "commits",
51
+ :payload => "commit.id",
52
+ :unq => "commit.id",
53
+ :col => GH.commits_col,
54
+ :routekey => "commit.%s"
55
+ },
56
+ :events => {
57
+ :name => "events",
58
+ :payload => "",
59
+ :unq => "type",
60
+ :col => GH.events_col,
61
+ :routekey => "evt.%s"
62
+ }
63
+ }
64
+ end
65
+
66
+ def prepare_options(options)
67
+ options.banner <<-BANNER
68
+ Loads object ids from a collection to a queue for further processing.
69
+
70
+ #{command_name} [options] collection
71
+
72
+ #{command_name} options:
73
+ BANNER
74
+
75
+ options.opt :earliest, 'Seconds since epoch of earliest item to load',
76
+ :short => 'e', :default => 0, :type => :int
77
+ options.opt :filter,
78
+ 'Filter items by regexp on item attributes: item.attr=regexp',
79
+ :short => 'f', :type => String, :multi => true
80
+ end
81
+
82
+ def validate
83
+ super
84
+ Trollop::die "no collection specified" unless args[0] && !args[0].empty?
85
+ filter = options[:filter]
86
+ case
87
+ when filter.is_a?(Array)
88
+ options[:filter].each { |x|
89
+ Trollop::die "not a valid filter #{x}" unless is_filter_valid?(x)
90
+ }
91
+ when filter == []
92
+ # Noop
93
+ else
94
+ Trollop::die "A filter can only be a string"
95
+ end
96
+ end
97
+
98
+ def go
99
+ @gh = GHTorrent::Mirror.new(options[:config])
100
+ @settings = @gh.settings
101
+
102
+ GH.init(options[:config])
103
+ # Message tags await publisher ack
104
+ awaiting_ack = SortedSet.new
105
+
106
+ # Num events read
107
+ num_read = 0
108
+
109
+ collection = case args[0]
110
+ when "events"
111
+ :events
112
+ when "commits"
113
+ :commits
114
+ end
115
+
116
+ puts "Loading form collection #{collection}"
117
+ puts "Loading items after #{Time.at(options[:earliest])}" if options[:verbose]
118
+
119
+ what = case
120
+ when options[:filter].is_a?(Array)
121
+ options[:filter].reduce({}) { |acc,x|
122
+ (k,r) = x.split(/=/)
123
+ acc[k] = Regexp.new(r)
124
+ acc
125
+ }
126
+ when filter == []
127
+ {}
128
+ end
129
+
130
+ from = {'_id' => {'$gte' => BSON::ObjectId.from_time(Time.at(options[:earliest]))}}
131
+
132
+ (puts "Mongo filter:"; pp what.merge(from)) if options[:verbose]
133
+
134
+ AMQP.start(:host => GH.settings['amqp']['host'],
135
+ :port => GH.settings['amqp']['port'],
136
+ :username => GH.settings['amqp']['username'],
137
+ :password => GH.settings['amqp']['password']) do |connection|
138
+
139
+ channel = AMQP::Channel.new(connection)
140
+ exchange = channel.topic(GH.settings['amqp']['exchange'],
141
+ :durable => true, :auto_delete => false)
142
+
143
+ # What to do when the user hits Ctrl+c
144
+ show_stopper = Proc.new {
145
+ connection.close { EventMachine.stop }
146
+ }
147
+
148
+ # Read next 1000 items and queue them
149
+ read_and_publish = Proc.new {
150
+
151
+ read = 0
152
+ col_info[collection][:col].find(what.merge(from),
153
+ :skip => num_read,
154
+ :limit => 1000).each do |e|
155
+
156
+ payload = GH.read_value(e, col_info[collection][:payload])
157
+ payload = if payload.class == BSON::OrderedHash
158
+ payload.delete "_id" # Inserted by MongoDB on event insert
159
+ payload.to_json
160
+ end
161
+ read += 1
162
+ unq = GH.read_value(e, col_info[collection][:unq])
163
+ if unq.class != String or unq.nil? then
164
+ throw Exception("Unique value can only be a String")
165
+ end
166
+
167
+ key = col_info[collection][:routekey] % unq
168
+
169
+ exchange.publish payload, :persistent => true, :routing_key => key
170
+
171
+ num_read += 1
172
+ puts("Publish id = #{unq} (#{num_read} total)") if options.verbose
173
+ awaiting_ack << num_read
174
+ end
175
+
176
+ # Nothing new in the DB and no msgs waiting ack
177
+ if read == 0 and awaiting_ack.size == 0
178
+ puts("Finished reading, exiting")
179
+ show_stopper.call
180
+ end
181
+ }
182
+
183
+ # Remove acknowledged or failed msg tags from the queue
184
+ # Trigger more messages to be read when ack msg queue size drops to zero
185
+ publisher_event = Proc.new { |ack|
186
+ if ack.multiple then
187
+ awaiting_ack.delete_if { |x| x <= ack.delivery_tag }
188
+ else
189
+ awaiting_ack.delete ack.delivery_tag
190
+ end
191
+
192
+ if awaiting_ack.size == 0
193
+ puts("ACKS.size= #{awaiting_ack.size}") if options.verbose
194
+ EventMachine.next_tick do
195
+ read_and_publish.call
196
+ end
197
+ end
198
+ }
199
+
200
+ # Await publisher confirms
201
+ channel.confirm_select
202
+
203
+ # Callback when confirms have arrived
204
+ channel.on_ack do |ack|
205
+ puts "ACK: tag=#{ack.delivery_tag}, mul=#{ack.multiple}" if options.verbose
206
+ publisher_event.call(ack)
207
+ end
208
+
209
+ # Callback when confirms failed.
210
+ channel.on_nack do |nack|
211
+ puts "NACK: tag=#{nack.delivery_tag}, mul=#{nack.multiple}" if options.verbose
212
+ publisher_event.call(nack)
213
+ end
214
+
215
+ # Signal handlers
216
+ Signal.trap('INT', show_stopper)
217
+ Signal.trap('TERM', show_stopper)
218
+
219
+ # Trigger start processing
220
+ EventMachine.add_timer(0.1) do
221
+ read_and_publish.call
222
+ end
223
+ end
224
+ end
225
+
226
+ private
227
+
228
+ def is_filter_valid?(filter)
229
+ (k, r) = filter.split(/=/)
230
+ return false if r.nil?
231
+ begin
232
+ Regexp.new(r)
233
+ true
234
+ rescue
235
+ false
236
+ end
237
+ end
238
+ end
239
+
240
+ GHTLoad.run
241
+
242
+ #vim: set filetype=ruby expandtab tabstop=2 shiftwidth=2 autoindent smartindent: