ghtorrent 0.8 → 0.8.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 964930d19ab5d35f6c70cafe72c8abf0e682ac87
4
+ data.tar.gz: 7bda9c3d934ab7c3d292587991dedc3f6e762cbc
5
+ SHA512:
6
+ metadata.gz: 591e92ae026990ba6a2f0dddfc32b47df8983b829d1b2d675f6fe9fde977fc063cce322649e7c3b05398bc891568c6ab11117c253a4f19ff7edc6a9aa1045129
7
+ data.tar.gz: f4ba04030bdec8a390dc800e676fbb1dfd2c6f0efc26ce4696f8b2e14a0ed3c1a7975f98c1e670fbaa89c8e3b5e0c8e96d171e88cf9a70dea6b99a71a28d351b
data/CHANGELOG CHANGED
@@ -1,3 +1,12 @@
1
+ = Version 0.8.1
2
+ * New tool to retrieve specific entities and their dependencies
3
+ * New tool to retrieve repositories en masse
4
+ * Support for resuming when exception occurs while processing items in loops
5
+ * Support for finer grained transactions when processing large entities
6
+ * Commit comments are now indexed per owner/repo (was just by comment id)
7
+ * Remove the unused daemon mode
8
+ * Various exception fixes and more detailed logging
9
+
1
10
  = Version 0.8
2
11
  * Retrieve and process issue labels
3
12
  * Retrive and process actors for pull request events
data/Gemfile CHANGED
@@ -3,5 +3,5 @@ source 'https://rubygems.org'
3
3
  gemspec
4
4
 
5
5
  platforms :jruby do
6
- gem "jdbc-mysql"
6
+ gem 'jdbc-mysql'
7
7
  end
data/Gemfile.lock CHANGED
@@ -1,34 +1,44 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ghtorrent (0.7.3)
5
- amqp (~> 1.0.0)
6
- bson_ext (~> 1.8.0)
7
- daemons (~> 1.1.0)
8
- mongo (~> 1.8.0)
9
- sequel (~> 3.47)
4
+ ghtorrent (0.8.1)
5
+ amqp (~> 1.1.0)
6
+ bson_ext (~> 1.9.0)
7
+ mongo (~> 1.9.0)
8
+ sequel (~> 4.5.0)
10
9
  trollop (~> 2.0.0)
11
10
 
12
11
  GEM
13
12
  remote: https://rubygems.org/
14
13
  specs:
15
- amq-client (1.0.2)
16
- amq-protocol (>= 1.2.0)
14
+ addressable (2.3.5)
15
+ amq-protocol (1.9.0)
16
+ amqp (1.1.7)
17
+ amq-protocol (>= 1.9.0)
17
18
  eventmachine
18
- amq-protocol (1.5.0)
19
- amqp (1.0.2)
20
- amq-client (~> 1.0.2)
21
- amq-protocol (>= 1.3.0)
22
- eventmachine
23
- bson (1.8.5)
24
- bson_ext (1.8.5)
25
- bson (~> 1.8.5)
26
- daemons (1.1.9)
19
+ bson (1.9.2)
20
+ bson_ext (1.9.2)
21
+ bson (~> 1.9.2)
22
+ crack (0.4.1)
23
+ safe_yaml (~> 0.9.0)
24
+ diff-lcs (1.2.5)
27
25
  eventmachine (1.0.3)
28
- mongo (1.8.5)
29
- bson (~> 1.8.5)
30
- sequel (3.47.0)
26
+ mongo (1.9.2)
27
+ bson (~> 1.9.2)
28
+ rspec (2.14.1)
29
+ rspec-core (~> 2.14.0)
30
+ rspec-expectations (~> 2.14.0)
31
+ rspec-mocks (~> 2.14.0)
32
+ rspec-core (2.14.7)
33
+ rspec-expectations (2.14.4)
34
+ diff-lcs (>= 1.1.3, < 2.0)
35
+ rspec-mocks (2.14.4)
36
+ safe_yaml (0.9.7)
37
+ sequel (4.5.0)
31
38
  trollop (2.0)
39
+ webmock (1.16.0)
40
+ addressable (>= 2.2.7)
41
+ crack (>= 0.3.2)
32
42
 
33
43
  PLATFORMS
34
44
  ruby
@@ -36,3 +46,5 @@ PLATFORMS
36
46
  DEPENDENCIES
37
47
  ghtorrent!
38
48
  jdbc-mysql
49
+ rspec (~> 2.14.0)
50
+ webmock (~> 1.16)
data/README.md CHANGED
@@ -1,4 +1,4 @@
1
- ##ghtorrent: Mirror and process data from the Github API
1
+ # ghtorrent: Mirror and process data from the Github API
2
2
 
3
3
  A library and a collection of scripts used to retrieve data from the Github API
4
4
  and extract metadata in an SQL database, in a modular and scalable manner. The
@@ -12,8 +12,10 @@ GHTorrent can be used for a variety of purposes, such as:
12
12
  * Create a queriable metadata index for a specific repository
13
13
  * Query the Github API using intelligent caching to avoid duplicate queries
14
14
 
15
- GHTorrent is comprised from the following components (which can be used
16
- individually):
15
+
16
+ ## Components
17
+
18
+ GHTorrents components (which can be used individually) are:
17
19
 
18
20
  * [APIClient](https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/api_client.rb): Knows how to query the Github API (both single entities and
19
21
  pages) and respect the API request limit. Can be configured to override the
@@ -26,56 +28,65 @@ store must support arbitrary queries to the stored JSON objects.
26
28
  * [GHTorrent](https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/ghtorrent.rb): Knows how to extract information from the data retrieved by
27
29
  the retriever in order to update an SQL database (see [schema](http://ghtorrent.org/relational.html)) with metadata.
28
30
 
31
+ ### Component Configuration
32
+
29
33
  The Persister and GHTorrent components have configurable back ends:
30
34
 
31
- * Persister: Either uses MongoDB > 2.0 (`mongo` driver) or no persister (`noop` driver)
32
- * GHTorrent: GHTorrent is tested mainly with MySQL, but can theoretically be
35
+ * **Persister:** Either uses MongoDB > 2.0 (`mongo` driver) or no persister (`noop` driver)
36
+ * **GHTorrent:** GHTorrent is tested mainly with MySQL, but can theoretically be
33
37
  used with any SQL database compatible with [Sequel](http://sequel.rubyforge.org/rdoc/files/doc/opening_databases_rdoc.html). Your milaege may vary.
34
38
 
35
-
36
39
  The distributed mirroring scripts also require RabbitMQ >= 2.8 or other
37
40
 
38
- #### Installing
39
41
 
42
+ ## Installation
43
+
44
+
45
+ ### 1. Install GHTorrent
40
46
  GHTorrent is written in Ruby (tested with 1.9). To install it as a Gem do:
41
47
 
42
48
  <code>
43
49
  sudo gem install ghtorrent
44
50
  </code>
45
51
 
52
+
53
+ ### 2. Install Your Preferred Database
54
+
46
55
  Depending on which SQL database you want to use, install the appropriate
47
56
  dependency gem.
48
57
 
49
58
  <code>
50
- sudo gem install mysql2 #or sqlite3-ruby #or postgres
59
+ sudo gem install mysql2 # or <sqlite3-ruby|postgres>
51
60
  </code>
52
61
 
53
- #### Configuring
54
62
 
55
- Copy the contents of the
56
- [config.yaml.tmpl](https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl)
57
- file to a file in your home directory. All provided scripts accept the `-c`
58
- option, which you can use to pass the location of the configuration file as
63
+ ## Configuration
64
+
65
+ Copy [config.yaml.tmpl](https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl)
66
+ to a file in your home directory.
67
+
68
+ All provided scripts accept the `-c` option, which accepts the location of the configuration file as
59
69
  a parameter.
60
70
 
61
71
  You can find more information of how you can setup a mirroring cluster of machines
62
72
  to retrieve data in parallel on the [Wiki](https://github.com/gousiosg/github-mirror/wiki/Setting-up-a-mirroring-cluster).
63
73
 
64
- ### Running
74
+
75
+ ## Using GHTorrent
65
76
 
66
77
  To mirror the event stream and capture all data:
67
78
 
68
79
  * `ght-mirror-events.rb` periodically polls Github's event
69
80
  queue (`https://api.github.com/events`), stores all new events in the
70
- configured pestister and posts them to the `github` exchange in
81
+ configured pestister, and posts them to the `github` exchange in
71
82
  RabbitMQ.
72
83
 
73
84
  * `ght-data_retrieval.rb` creates queues that route posted events to processor
74
- functions, which in turn use the appropriate Github API call to retrieve the
75
- linked contents, extract metadata to store in the SQL database and store the
85
+ functions. The functions use the appropriate Github API call to retrieve the
86
+ linked contents, extract metadata (for database storage), and store the
76
87
  retrieved data in the appropriate collection in the persister, to avoid
77
- duplicate API
78
- calls. Data in the SQL database contain pointers (the `ext_ref_id` field) to the
88
+ duplicate API calls.
89
+ Data in the SQL database contain pointers (the `ext_ref_id` field) to the
79
90
  "raw" data in the persister.
80
91
 
81
92
  To retrieve data for a repository or user:
@@ -89,27 +100,31 @@ To perform maintenance:
89
100
  the `ght-data-retrieval` script to reprocess them
90
101
  * `ght-get-more-commits` retrieves all commits for a specific repository
91
102
 
92
- #### Data
103
+
104
+ ### Data Torrents
93
105
 
94
106
  You can find torrents for retrieving data on the
95
- [Available Torrents](https://ghtorrent.org/downloads.html) page. You can find two sets of data:
107
+ [Available Torrents](https://ghtorrent.org/downloads.html) page.
96
108
 
97
- * Raw events: Github's [event stream](https://api.github.com/events). These
109
+ There are two sets of data:
110
+
111
+ * **Raw events:** Github's [event stream](https://api.github.com/events). These
98
112
  are the roots for mirroring operations. The `ght-data-retrieval` crawler starts
99
113
  from an event and goes deep into the rabbit hole.
100
- * SQL dumps+Linked data: Data dumps from the SQL database and the corresponding
114
+ * **SQL dumps + Linked data:** Data dumps from the SQL database and the corresponding
101
115
  MongoDB entities.
102
116
 
103
- #### Reporting bugs
104
117
 
105
- Please use the [Issue
106
- Tracker](https://github.com/gousiosg/github-mirror/issues) for reporting bugs
107
- and feature requests.
118
+ ## Bugs & Feature Requests
119
+
120
+ Please tell us about features you'd like or bugs you've discovered on our
121
+ [Issue Tracker](https://github.com/gousiosg/github-mirror/issues).
108
122
 
109
- Patches, bug fixes etc are welcome. Please fork the repository and create
123
+ Patches, bug fixes, etc are welcome. Please fork the repository and create
110
124
  a pull request when done fixing/implementing the new feature.
111
125
 
112
- #### Citation information
126
+
127
+ ## Citing GHTorrent in your Research
113
128
 
114
129
  If you find GHTorrent and the accompanying datasets useful in your research,
115
130
  please consider citing the following paper:
@@ -118,16 +133,17 @@ please consider citing the following paper:
118
133
 
119
134
  See also the following presentation:
120
135
 
121
- <iframe src="http://www.slideshare.net/slideshow/embed_code/13184524?rel=0" width="342" height="291" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen/>
136
+ <iframe src="http://www.slideshare.net/slideshow/embed_code/13184524?rel=0" width="342" height="291" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen />
122
137
  <div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/gousiosg/ghtorrent-githubs-data-from-a-firehose-13184524" title="GHTorrent: Github&#39;s Data from a Firehose" target="_blank">GHTorrent: Github&#39;s Data from a Firehose</a> </strong> </div>
123
138
 
124
- #### Authors
125
139
 
126
- [Georgios Gousios](http://istlab.dmst.aueb.gr/~george) <gousiosg@gmail.com>
140
+ ## Authors
141
+
142
+ * [Georgios Gousios](http://istlab.dmst.aueb.gr/~george) <gousiosg@gmail.com>
143
+ * [Diomidis Spinellis](http://www.dmst.aueb.gr/dds) <dds@aueb.gr>
127
144
 
128
- [Diomidis Spinellis](http://www.dmst.aueb.gr/dds) <dds@aueb.gr>
129
145
 
130
- #### License
146
+ ## License
131
147
 
132
148
  [2-clause BSD](http://www.opensource.org/licenses/bsd-license.php)
133
149
 
data/Rakefile CHANGED
@@ -2,11 +2,11 @@ require 'rake'
2
2
  require 'rake/testtask'
3
3
  require 'rake/rdoctask'
4
4
 
5
- task :default => [:test, :rdoc]
5
+ task :default => [:spec, :rdoc]
6
6
 
7
7
  desc "Run basic tests"
8
- Rake::TestTask.new(:test) do |t|
9
- t.pattern = 'test/*_test.rb'
8
+ Rake::TestTask.new(:spec) do |t|
9
+ t.pattern = 'spec/*_test.rb'
10
10
  t.verbose = true
11
11
  t.warning = true
12
12
  end
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'ghtorrent'
5
+
6
+ GHTRetrieveDependents.run(ARGV)
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'ghtorrent'
5
+
6
+ GHTRetrieveRepos.run(ARGV)
data/lib/ghtorrent.rb CHANGED
@@ -48,6 +48,7 @@ require 'ghtorrent/retriever'
48
48
 
49
49
  # SQL database fillup methods
50
50
  require 'ghtorrent/ghtorrent'
51
+ require 'ghtorrent/transacted_ghtorrent'
51
52
 
52
53
  # Commands
53
54
  require 'ghtorrent/commands/ght_data_retrieval'
@@ -57,5 +58,7 @@ require 'ghtorrent/commands/ght_rm_dupl'
57
58
  require 'ghtorrent/commands/ght_load'
58
59
  require 'ghtorrent/commands/ght_retrieve_repo'
59
60
  require 'ghtorrent/commands/ght_retrieve_user'
61
+ require 'ghtorrent/commands/ght_retrieve_dependents'
62
+ require 'ghtorrent/commands/ght_retrieve_repos'
60
63
 
61
64
  # vim: set sta sts=2 shiftwidth=2 sw=2 et ai :
@@ -24,7 +24,7 @@ module GHTorrent
24
24
  :events => %w(id),
25
25
  :users => %w(login),
26
26
  :commits => %w(sha),
27
- :commit_comments => %w(repo user commit_id),
27
+ :commit_comments => %w(commit_id id),
28
28
  :repos => %w(name owner.login),
29
29
  :repo_labels => %w(repo owner),
30
30
  :repo_collaborators => %w(repo owner login),
@@ -184,9 +184,9 @@ module GHTorrent
184
184
  idx_fields = v.reduce({}){|acc, x| acc.merge({x => 1})}
185
185
  if exists.nil?
186
186
  col.create_index(idx_fields, :background => true)
187
- STDERR.puts "Creating index on #{collection}(#{v})"
187
+ STDERR.puts "Creating index on #{col}(#{v})"
188
188
  else
189
- STDERR.puts "Index on #{collection}(#{v}) exists"
189
+ STDERR.puts "Index on #{col}(#{v}) exists"
190
190
  end
191
191
 
192
192
  end
@@ -169,7 +169,7 @@ module GHTorrent
169
169
  end
170
170
 
171
171
  total = Time.now.to_ms - start_time.to_ms
172
- debug "APIClient: Request: #{url} #{if from_cache then "from cache," else "(#{@remaining} remaining)," end} Total: #{total} ms"
172
+ debug "APIClient[#{@attach_ip}]: Request: #{url} #{if from_cache then "from cache," else "(#{@remaining} remaining)," end} Total: #{total} ms"
173
173
 
174
174
  contents
175
175
  rescue OpenURI::HTTPError => e
@@ -190,9 +190,9 @@ module GHTorrent
190
190
  end
191
191
  ensure
192
192
  if not from_cache and config(:respect_api_ratelimit) and @remaining < 10
193
- sleep = (@reset - Time.now.to_i) / 60
194
- debug "APIClient: Request limit reached, sleeping for #{sleep} min"
195
- sleep(@reset - Time.now.to_i)
193
+ to_sleep = @reset - Time.now.to_i + 2
194
+ debug "APIClient: Request limit reached, sleeping for #{to_sleep} secs"
195
+ sleep(to_sleep)
196
196
  end
197
197
  end
198
198
  end
@@ -202,6 +202,8 @@ module GHTorrent
202
202
  @username ||= config(:github_username)
203
203
  @passwd ||= config(:github_passwd)
204
204
  @user_agent ||= config(:user_agent)
205
+ @remaining ||= 10
206
+ @reset ||= Time.now.to_i + 3600
205
207
 
206
208
  open_func ||= if @username.nil?
207
209
  lambda {|url| open(url, 'User-Agent' => @user_agent)}
@@ -54,29 +54,6 @@ module GHTorrent
54
54
  command.options[:password])
55
55
  end
56
56
 
57
- if command.options[:daemon]
58
- if Process.uid == 0
59
- # Daemonize as a proper system daemon
60
- Daemons.daemonize(:app_name => File.basename($0),
61
- :dir_mode => :system,
62
- :log_dir => "/var/log",
63
- :backtrace => true,
64
- :log_output => true)
65
- STDERR.puts "Became a daemon"
66
- # Change effective user id for the process
67
- unless command.options[:user].nil?
68
- Process.euid = Etc.getpwnam(command.options[:user]).uid
69
- end
70
- else
71
- # Daemonize, but output in current directory
72
- Daemons.daemonize(:app_name => File.basename($0),
73
- :dir_mode => :normal,
74
- :dir => Dir.getwd,
75
- :backtrace => true,
76
- :log_output => true)
77
- end
78
- end
79
-
80
57
  begin
81
58
  command.go
82
59
  rescue => e
@@ -107,10 +84,7 @@ Standard options:
107
84
  opt :verbose, 'verbose mode', :short => 'v'
108
85
  opt :addr, 'ip address to use for performing requests', :short => 'a',
109
86
  :type => String
110
- opt :daemon, 'run as daemon', :short => 'd'
111
- opt :user, 'run as the specified user (only when started as root)',
112
- :short => 'u', :type => String
113
- opt :username, 'Username at Github', :type => String
87
+ opt :username, 'Username at Github', :short => 's', :type => String
114
88
  opt :password, 'Password at Github', :type => String
115
89
  end
116
90
  end
@@ -163,7 +137,7 @@ Standard options:
163
137
 
164
138
  def override_config(config_file, setting, new_value)
165
139
  puts "Overriding configuration #{setting}=#{config(setting)} with cmd line #{new_value}"
166
- merge_config_values({setting => new_value})
140
+ merge_config_values(config_file, {setting => new_value})
167
141
  end
168
142
 
169
143
  private
@@ -58,11 +58,12 @@ class GHTDataRetrieval < GHTorrent::Command
58
58
  end
59
59
 
60
60
  def CommitCommentEvent(data)
61
- user = data['actor']['login']
61
+ user = data['repo']['name'].split(/\//)[0]
62
62
  repo = data['repo']['name'].split(/\//)[1]
63
63
  id = data['payload']['comment']['id']
64
+ sha = data['payload']['comment']['commit_id']
64
65
 
65
- ghtorrent.get_commit_comment(user, repo, id)
66
+ ghtorrent.get_commit_comment(user, repo, sha, id)
66
67
  end
67
68
 
68
69
  def PullRequestEvent(data)
@@ -60,7 +60,7 @@ Loads object ids from a collection to a queue for further processing.
60
60
 
61
61
  def go
62
62
  # Num events read
63
- num_read = 0
63
+ total_read = 0
64
64
 
65
65
  puts "Loading items after #{Time.at(options[:earliest])}" if options[:verbose]
66
66
  puts "Loading items before #{Time.at(options[:latest])}" if options[:verbose]
@@ -107,9 +107,10 @@ Loads object ids from a collection to a queue for further processing.
107
107
 
108
108
  # Read next options[:batch] items and queue them
109
109
  read_and_publish = Proc.new {
110
-
110
+ num_read = 0
111
111
  persister.get_underlying_connection[:events].find(what.merge(from),
112
- :skip => num_read,
112
+ :snapshot => true,
113
+ :skip => total_read,
113
114
  :limit => options[:batch]).each do |e|
114
115
  unq = read_value(e, 'type')
115
116
  if unq.class != String or unq.nil? then
@@ -120,10 +121,11 @@ Loads object ids from a collection to a queue for further processing.
120
121
  :routing_key => "evt.#{e['type']}"
121
122
 
122
123
  num_read += 1
123
- puts "Publish id = #{e['id']} (#{num_read} total)" if options.verbose
124
+ total_read += 1
125
+ puts "Publish id = #{e['id']} (#{num_read} read, #{total_read} total)" if options.verbose
124
126
  end
125
127
 
126
- if num_read >= options[:number]
128
+ if total_read >= options[:number] or num_read == 0
127
129
  puts 'Finished reading, exiting'
128
130
  show_stopper.call
129
131
  else