ghtorrent 0.8 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 964930d19ab5d35f6c70cafe72c8abf0e682ac87
4
+ data.tar.gz: 7bda9c3d934ab7c3d292587991dedc3f6e762cbc
5
+ SHA512:
6
+ metadata.gz: 591e92ae026990ba6a2f0dddfc32b47df8983b829d1b2d675f6fe9fde977fc063cce322649e7c3b05398bc891568c6ab11117c253a4f19ff7edc6a9aa1045129
7
+ data.tar.gz: f4ba04030bdec8a390dc800e676fbb1dfd2c6f0efc26ce4696f8b2e14a0ed3c1a7975f98c1e670fbaa89c8e3b5e0c8e96d171e88cf9a70dea6b99a71a28d351b
data/CHANGELOG CHANGED
@@ -1,3 +1,12 @@
1
+ = Version 0.8.1
2
+ * New tool to retrieve specific entities and their dependencies
3
+ * New tool to retrieve repositories en masse
4
+ * Support for resuming when exception occurs while processing items in loops
5
+ * Support for finer grained transactions when processing large entities
6
+ * Commit comments are now indexed per owner/repo (was just by comment id)
7
+ * Remove the unused daemon mode
8
+ * Various exception fixes and more detailed logging
9
+
1
10
  = Version 0.8
2
11
  * Retrieve and process issue labels
3
12
  * Retrive and process actors for pull request events
data/Gemfile CHANGED
@@ -3,5 +3,5 @@ source 'https://rubygems.org'
3
3
  gemspec
4
4
 
5
5
  platforms :jruby do
6
- gem "jdbc-mysql"
6
+ gem 'jdbc-mysql'
7
7
  end
data/Gemfile.lock CHANGED
@@ -1,34 +1,44 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ghtorrent (0.7.3)
5
- amqp (~> 1.0.0)
6
- bson_ext (~> 1.8.0)
7
- daemons (~> 1.1.0)
8
- mongo (~> 1.8.0)
9
- sequel (~> 3.47)
4
+ ghtorrent (0.8.1)
5
+ amqp (~> 1.1.0)
6
+ bson_ext (~> 1.9.0)
7
+ mongo (~> 1.9.0)
8
+ sequel (~> 4.5.0)
10
9
  trollop (~> 2.0.0)
11
10
 
12
11
  GEM
13
12
  remote: https://rubygems.org/
14
13
  specs:
15
- amq-client (1.0.2)
16
- amq-protocol (>= 1.2.0)
14
+ addressable (2.3.5)
15
+ amq-protocol (1.9.0)
16
+ amqp (1.1.7)
17
+ amq-protocol (>= 1.9.0)
17
18
  eventmachine
18
- amq-protocol (1.5.0)
19
- amqp (1.0.2)
20
- amq-client (~> 1.0.2)
21
- amq-protocol (>= 1.3.0)
22
- eventmachine
23
- bson (1.8.5)
24
- bson_ext (1.8.5)
25
- bson (~> 1.8.5)
26
- daemons (1.1.9)
19
+ bson (1.9.2)
20
+ bson_ext (1.9.2)
21
+ bson (~> 1.9.2)
22
+ crack (0.4.1)
23
+ safe_yaml (~> 0.9.0)
24
+ diff-lcs (1.2.5)
27
25
  eventmachine (1.0.3)
28
- mongo (1.8.5)
29
- bson (~> 1.8.5)
30
- sequel (3.47.0)
26
+ mongo (1.9.2)
27
+ bson (~> 1.9.2)
28
+ rspec (2.14.1)
29
+ rspec-core (~> 2.14.0)
30
+ rspec-expectations (~> 2.14.0)
31
+ rspec-mocks (~> 2.14.0)
32
+ rspec-core (2.14.7)
33
+ rspec-expectations (2.14.4)
34
+ diff-lcs (>= 1.1.3, < 2.0)
35
+ rspec-mocks (2.14.4)
36
+ safe_yaml (0.9.7)
37
+ sequel (4.5.0)
31
38
  trollop (2.0)
39
+ webmock (1.16.0)
40
+ addressable (>= 2.2.7)
41
+ crack (>= 0.3.2)
32
42
 
33
43
  PLATFORMS
34
44
  ruby
@@ -36,3 +46,5 @@ PLATFORMS
36
46
  DEPENDENCIES
37
47
  ghtorrent!
38
48
  jdbc-mysql
49
+ rspec (~> 2.14.0)
50
+ webmock (~> 1.16)
data/README.md CHANGED
@@ -1,4 +1,4 @@
1
- ##ghtorrent: Mirror and process data from the Github API
1
+ # ghtorrent: Mirror and process data from the Github API
2
2
 
3
3
  A library and a collection of scripts used to retrieve data from the Github API
4
4
  and extract metadata in an SQL database, in a modular and scalable manner. The
@@ -12,8 +12,10 @@ GHTorrent can be used for a variety of purposes, such as:
12
12
  * Create a queriable metadata index for a specific repository
13
13
  * Query the Github API using intelligent caching to avoid duplicate queries
14
14
 
15
- GHTorrent is comprised from the following components (which can be used
16
- individually):
15
+
16
+ ## Components
17
+
18
+ GHTorrents components (which can be used individually) are:
17
19
 
18
20
  * [APIClient](https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/api_client.rb): Knows how to query the Github API (both single entities and
19
21
  pages) and respect the API request limit. Can be configured to override the
@@ -26,56 +28,65 @@ store must support arbitrary queries to the stored JSON objects.
26
28
  * [GHTorrent](https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/ghtorrent.rb): Knows how to extract information from the data retrieved by
27
29
  the retriever in order to update an SQL database (see [schema](http://ghtorrent.org/relational.html)) with metadata.
28
30
 
31
+ ### Component Configuration
32
+
29
33
  The Persister and GHTorrent components have configurable back ends:
30
34
 
31
- * Persister: Either uses MongoDB > 2.0 (`mongo` driver) or no persister (`noop` driver)
32
- * GHTorrent: GHTorrent is tested mainly with MySQL, but can theoretically be
35
+ * **Persister:** Either uses MongoDB > 2.0 (`mongo` driver) or no persister (`noop` driver)
36
+ * **GHTorrent:** GHTorrent is tested mainly with MySQL, but can theoretically be
33
37
  used with any SQL database compatible with [Sequel](http://sequel.rubyforge.org/rdoc/files/doc/opening_databases_rdoc.html). Your milaege may vary.
34
38
 
35
-
36
39
  The distributed mirroring scripts also require RabbitMQ >= 2.8 or other
37
40
 
38
- #### Installing
39
41
 
42
+ ## Installation
43
+
44
+
45
+ ### 1. Install GHTorrent
40
46
  GHTorrent is written in Ruby (tested with 1.9). To install it as a Gem do:
41
47
 
42
48
  <code>
43
49
  sudo gem install ghtorrent
44
50
  </code>
45
51
 
52
+
53
+ ### 2. Install Your Preferred Database
54
+
46
55
  Depending on which SQL database you want to use, install the appropriate
47
56
  dependency gem.
48
57
 
49
58
  <code>
50
- sudo gem install mysql2 #or sqlite3-ruby #or postgres
59
+ sudo gem install mysql2 # or <sqlite3-ruby|postgres>
51
60
  </code>
52
61
 
53
- #### Configuring
54
62
 
55
- Copy the contents of the
56
- [config.yaml.tmpl](https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl)
57
- file to a file in your home directory. All provided scripts accept the `-c`
58
- option, which you can use to pass the location of the configuration file as
63
+ ## Configuration
64
+
65
+ Copy [config.yaml.tmpl](https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl)
66
+ to a file in your home directory.
67
+
68
+ All provided scripts accept the `-c` option, which accepts the location of the configuration file as
59
69
  a parameter.
60
70
 
61
71
  You can find more information of how you can setup a mirroring cluster of machines
62
72
  to retrieve data in parallel on the [Wiki](https://github.com/gousiosg/github-mirror/wiki/Setting-up-a-mirroring-cluster).
63
73
 
64
- ### Running
74
+
75
+ ## Using GHTorrent
65
76
 
66
77
  To mirror the event stream and capture all data:
67
78
 
68
79
  * `ght-mirror-events.rb` periodically polls Github's event
69
80
  queue (`https://api.github.com/events`), stores all new events in the
70
- configured pestister and posts them to the `github` exchange in
81
+ configured pestister, and posts them to the `github` exchange in
71
82
  RabbitMQ.
72
83
 
73
84
  * `ght-data_retrieval.rb` creates queues that route posted events to processor
74
- functions, which in turn use the appropriate Github API call to retrieve the
75
- linked contents, extract metadata to store in the SQL database and store the
85
+ functions. The functions use the appropriate Github API call to retrieve the
86
+ linked contents, extract metadata (for database storage), and store the
76
87
  retrieved data in the appropriate collection in the persister, to avoid
77
- duplicate API
78
- calls. Data in the SQL database contain pointers (the `ext_ref_id` field) to the
88
+ duplicate API calls.
89
+ Data in the SQL database contain pointers (the `ext_ref_id` field) to the
79
90
  "raw" data in the persister.
80
91
 
81
92
  To retrieve data for a repository or user:
@@ -89,27 +100,31 @@ To perform maintenance:
89
100
  the `ght-data-retrieval` script to reprocess them
90
101
  * `ght-get-more-commits` retrieves all commits for a specific repository
91
102
 
92
- #### Data
103
+
104
+ ### Data Torrents
93
105
 
94
106
  You can find torrents for retrieving data on the
95
- [Available Torrents](https://ghtorrent.org/downloads.html) page. You can find two sets of data:
107
+ [Available Torrents](https://ghtorrent.org/downloads.html) page.
96
108
 
97
- * Raw events: Github's [event stream](https://api.github.com/events). These
109
+ There are two sets of data:
110
+
111
+ * **Raw events:** Github's [event stream](https://api.github.com/events). These
98
112
  are the roots for mirroring operations. The `ght-data-retrieval` crawler starts
99
113
  from an event and goes deep into the rabbit hole.
100
- * SQL dumps+Linked data: Data dumps from the SQL database and the corresponding
114
+ * **SQL dumps + Linked data:** Data dumps from the SQL database and the corresponding
101
115
  MongoDB entities.
102
116
 
103
- #### Reporting bugs
104
117
 
105
- Please use the [Issue
106
- Tracker](https://github.com/gousiosg/github-mirror/issues) for reporting bugs
107
- and feature requests.
118
+ ## Bugs & Feature Requests
119
+
120
+ Please tell us about features you'd like or bugs you've discovered on our
121
+ [Issue Tracker](https://github.com/gousiosg/github-mirror/issues).
108
122
 
109
- Patches, bug fixes etc are welcome. Please fork the repository and create
123
+ Patches, bug fixes, etc are welcome. Please fork the repository and create
110
124
  a pull request when done fixing/implementing the new feature.
111
125
 
112
- #### Citation information
126
+
127
+ ## Citing GHTorrent in your Research
113
128
 
114
129
  If you find GHTorrent and the accompanying datasets useful in your research,
115
130
  please consider citing the following paper:
@@ -118,16 +133,17 @@ please consider citing the following paper:
118
133
 
119
134
  See also the following presentation:
120
135
 
121
- <iframe src="http://www.slideshare.net/slideshow/embed_code/13184524?rel=0" width="342" height="291" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen/>
136
+ <iframe src="http://www.slideshare.net/slideshow/embed_code/13184524?rel=0" width="342" height="291" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen />
122
137
  <div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/gousiosg/ghtorrent-githubs-data-from-a-firehose-13184524" title="GHTorrent: Github&#39;s Data from a Firehose" target="_blank">GHTorrent: Github&#39;s Data from a Firehose</a> </strong> </div>
123
138
 
124
- #### Authors
125
139
 
126
- [Georgios Gousios](http://istlab.dmst.aueb.gr/~george) <gousiosg@gmail.com>
140
+ ## Authors
141
+
142
+ * [Georgios Gousios](http://istlab.dmst.aueb.gr/~george) <gousiosg@gmail.com>
143
+ * [Diomidis Spinellis](http://www.dmst.aueb.gr/dds) <dds@aueb.gr>
127
144
 
128
- [Diomidis Spinellis](http://www.dmst.aueb.gr/dds) <dds@aueb.gr>
129
145
 
130
- #### License
146
+ ## License
131
147
 
132
148
  [2-clause BSD](http://www.opensource.org/licenses/bsd-license.php)
133
149
 
data/Rakefile CHANGED
@@ -2,11 +2,11 @@ require 'rake'
2
2
  require 'rake/testtask'
3
3
  require 'rake/rdoctask'
4
4
 
5
- task :default => [:test, :rdoc]
5
+ task :default => [:spec, :rdoc]
6
6
 
7
7
  desc "Run basic tests"
8
- Rake::TestTask.new(:test) do |t|
9
- t.pattern = 'test/*_test.rb'
8
+ Rake::TestTask.new(:spec) do |t|
9
+ t.pattern = 'spec/*_test.rb'
10
10
  t.verbose = true
11
11
  t.warning = true
12
12
  end
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'ghtorrent'
5
+
6
+ GHTRetrieveDependents.run(ARGV)
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rubygems'
4
+ require 'ghtorrent'
5
+
6
+ GHTRetrieveRepos.run(ARGV)
data/lib/ghtorrent.rb CHANGED
@@ -48,6 +48,7 @@ require 'ghtorrent/retriever'
48
48
 
49
49
  # SQL database fillup methods
50
50
  require 'ghtorrent/ghtorrent'
51
+ require 'ghtorrent/transacted_ghtorrent'
51
52
 
52
53
  # Commands
53
54
  require 'ghtorrent/commands/ght_data_retrieval'
@@ -57,5 +58,7 @@ require 'ghtorrent/commands/ght_rm_dupl'
57
58
  require 'ghtorrent/commands/ght_load'
58
59
  require 'ghtorrent/commands/ght_retrieve_repo'
59
60
  require 'ghtorrent/commands/ght_retrieve_user'
61
+ require 'ghtorrent/commands/ght_retrieve_dependents'
62
+ require 'ghtorrent/commands/ght_retrieve_repos'
60
63
 
61
64
  # vim: set sta sts=2 shiftwidth=2 sw=2 et ai :
@@ -24,7 +24,7 @@ module GHTorrent
24
24
  :events => %w(id),
25
25
  :users => %w(login),
26
26
  :commits => %w(sha),
27
- :commit_comments => %w(repo user commit_id),
27
+ :commit_comments => %w(commit_id id),
28
28
  :repos => %w(name owner.login),
29
29
  :repo_labels => %w(repo owner),
30
30
  :repo_collaborators => %w(repo owner login),
@@ -184,9 +184,9 @@ module GHTorrent
184
184
  idx_fields = v.reduce({}){|acc, x| acc.merge({x => 1})}
185
185
  if exists.nil?
186
186
  col.create_index(idx_fields, :background => true)
187
- STDERR.puts "Creating index on #{collection}(#{v})"
187
+ STDERR.puts "Creating index on #{col}(#{v})"
188
188
  else
189
- STDERR.puts "Index on #{collection}(#{v}) exists"
189
+ STDERR.puts "Index on #{col}(#{v}) exists"
190
190
  end
191
191
 
192
192
  end
@@ -169,7 +169,7 @@ module GHTorrent
169
169
  end
170
170
 
171
171
  total = Time.now.to_ms - start_time.to_ms
172
- debug "APIClient: Request: #{url} #{if from_cache then "from cache," else "(#{@remaining} remaining)," end} Total: #{total} ms"
172
+ debug "APIClient[#{@attach_ip}]: Request: #{url} #{if from_cache then "from cache," else "(#{@remaining} remaining)," end} Total: #{total} ms"
173
173
 
174
174
  contents
175
175
  rescue OpenURI::HTTPError => e
@@ -190,9 +190,9 @@ module GHTorrent
190
190
  end
191
191
  ensure
192
192
  if not from_cache and config(:respect_api_ratelimit) and @remaining < 10
193
- sleep = (@reset - Time.now.to_i) / 60
194
- debug "APIClient: Request limit reached, sleeping for #{sleep} min"
195
- sleep(@reset - Time.now.to_i)
193
+ to_sleep = @reset - Time.now.to_i + 2
194
+ debug "APIClient: Request limit reached, sleeping for #{to_sleep} secs"
195
+ sleep(to_sleep)
196
196
  end
197
197
  end
198
198
  end
@@ -202,6 +202,8 @@ module GHTorrent
202
202
  @username ||= config(:github_username)
203
203
  @passwd ||= config(:github_passwd)
204
204
  @user_agent ||= config(:user_agent)
205
+ @remaining ||= 10
206
+ @reset ||= Time.now.to_i + 3600
205
207
 
206
208
  open_func ||= if @username.nil?
207
209
  lambda {|url| open(url, 'User-Agent' => @user_agent)}
@@ -54,29 +54,6 @@ module GHTorrent
54
54
  command.options[:password])
55
55
  end
56
56
 
57
- if command.options[:daemon]
58
- if Process.uid == 0
59
- # Daemonize as a proper system daemon
60
- Daemons.daemonize(:app_name => File.basename($0),
61
- :dir_mode => :system,
62
- :log_dir => "/var/log",
63
- :backtrace => true,
64
- :log_output => true)
65
- STDERR.puts "Became a daemon"
66
- # Change effective user id for the process
67
- unless command.options[:user].nil?
68
- Process.euid = Etc.getpwnam(command.options[:user]).uid
69
- end
70
- else
71
- # Daemonize, but output in current directory
72
- Daemons.daemonize(:app_name => File.basename($0),
73
- :dir_mode => :normal,
74
- :dir => Dir.getwd,
75
- :backtrace => true,
76
- :log_output => true)
77
- end
78
- end
79
-
80
57
  begin
81
58
  command.go
82
59
  rescue => e
@@ -107,10 +84,7 @@ Standard options:
107
84
  opt :verbose, 'verbose mode', :short => 'v'
108
85
  opt :addr, 'ip address to use for performing requests', :short => 'a',
109
86
  :type => String
110
- opt :daemon, 'run as daemon', :short => 'd'
111
- opt :user, 'run as the specified user (only when started as root)',
112
- :short => 'u', :type => String
113
- opt :username, 'Username at Github', :type => String
87
+ opt :username, 'Username at Github', :short => 's', :type => String
114
88
  opt :password, 'Password at Github', :type => String
115
89
  end
116
90
  end
@@ -163,7 +137,7 @@ Standard options:
163
137
 
164
138
  def override_config(config_file, setting, new_value)
165
139
  puts "Overriding configuration #{setting}=#{config(setting)} with cmd line #{new_value}"
166
- merge_config_values({setting => new_value})
140
+ merge_config_values(config_file, {setting => new_value})
167
141
  end
168
142
 
169
143
  private
@@ -58,11 +58,12 @@ class GHTDataRetrieval < GHTorrent::Command
58
58
  end
59
59
 
60
60
  def CommitCommentEvent(data)
61
- user = data['actor']['login']
61
+ user = data['repo']['name'].split(/\//)[0]
62
62
  repo = data['repo']['name'].split(/\//)[1]
63
63
  id = data['payload']['comment']['id']
64
+ sha = data['payload']['comment']['commit_id']
64
65
 
65
- ghtorrent.get_commit_comment(user, repo, id)
66
+ ghtorrent.get_commit_comment(user, repo, sha, id)
66
67
  end
67
68
 
68
69
  def PullRequestEvent(data)
@@ -60,7 +60,7 @@ Loads object ids from a collection to a queue for further processing.
60
60
 
61
61
  def go
62
62
  # Num events read
63
- num_read = 0
63
+ total_read = 0
64
64
 
65
65
  puts "Loading items after #{Time.at(options[:earliest])}" if options[:verbose]
66
66
  puts "Loading items before #{Time.at(options[:latest])}" if options[:verbose]
@@ -107,9 +107,10 @@ Loads object ids from a collection to a queue for further processing.
107
107
 
108
108
  # Read next options[:batch] items and queue them
109
109
  read_and_publish = Proc.new {
110
-
110
+ num_read = 0
111
111
  persister.get_underlying_connection[:events].find(what.merge(from),
112
- :skip => num_read,
112
+ :snapshot => true,
113
+ :skip => total_read,
113
114
  :limit => options[:batch]).each do |e|
114
115
  unq = read_value(e, 'type')
115
116
  if unq.class != String or unq.nil? then
@@ -120,10 +121,11 @@ Loads object ids from a collection to a queue for further processing.
120
121
  :routing_key => "evt.#{e['type']}"
121
122
 
122
123
  num_read += 1
123
- puts "Publish id = #{e['id']} (#{num_read} total)" if options.verbose
124
+ total_read += 1
125
+ puts "Publish id = #{e['id']} (#{num_read} read, #{total_read} total)" if options.verbose
124
126
  end
125
127
 
126
- if num_read >= options[:number]
128
+ if total_read >= options[:number] or num_read == 0
127
129
  puts 'Finished reading, exiting'
128
130
  show_stopper.call
129
131
  else