bio-biostars-analytics 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 5cff7f6e3a1b295042b32eba00069abbe877a8b9
4
+ data.tar.gz: a0dafd42a7a80aeb6f99a1ff1c43490d1da10507
5
+ SHA512:
6
+ metadata.gz: 52b94bc235a8be6974c1f98a62da8f28bb19daf38fbc57398121333a81f5ce350fdc08d3fe78e973795df827e083254de0b648c0b1cede787978d7a6c1bcc936
7
+ data.tar.gz: e343dc4bfcc04d69c0fa1fba08553dfa12e0270fb0cd6bef75b689e313e84fb539c3ef933153aa7168795ebad1c9e8656513b71ce225c6961f5cc6f34b18ae1e
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.travis.yml ADDED
@@ -0,0 +1,13 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.2
4
+ - 1.9.3
5
+ - jruby-19mode # JRuby in 1.9 mode
6
+
7
+ # - rbx-19mode
8
+ # - 1.8.7
9
+ # - jruby-18mode # JRuby in 1.8 mode
10
+ # - rbx-18mode
11
+
12
+ # uncomment this line if your project needs to run something other than `rake`:
13
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,20 @@
1
+ source "http://rubygems.org"
2
+ # Add dependencies required to use your gem here.
3
+ # Example:
4
+ # gem "activesupport", ">= 2.3.5"
5
+
6
+ # Add dependencies to develop your gem here.
7
+ # Include everything needed to run rake, tests, features, etc.
8
+ group :development do
9
+ gem "shoulda", ">= 0"
10
+ gem "rdoc", "~> 3.12"
11
+ gem "jeweler", "~> 2.0.1", :git => "https://github.com/technicalpickles/jeweler.git"
12
+ gem "bundler", ">= 1.0.21"
13
+ gem "bio", ">= 1.4.2"
14
+ gem "rdoc", "~> 3.12"
15
+
16
+ # Required for mining:
17
+ gem "hpricot", "~> 0.8.6"
18
+ gem "chronic", "~> 0.10.2"
19
+ gem "json", "~> 1.8.0"
20
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2014 Joachim Baran
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,96 @@
1
+ # bio-biostars-analytics
2
+
3
+ [![Build Status](https://secure.travis-ci.org/joejimbo/bioruby-biostars-analytics.png)](http://travis-ci.org/joejimbo/bioruby-biostars-analytics)
4
+
5
+ Data-mining analysis that make use of this gem (newest to oldest):
6
+
7
+ - [Uh-oh, Biostar: Three Years of User Metrics Analysis](http://joachimbaran.wordpress.com/2013/03/15/uh-oh-biostar/)
8
+ - [BioStar: Activity of its BioStars](http://joachimbaran.wordpress.com/2012/03/20/biostar-activity-of-its-biostars/)
9
+ - [BioStar: Is the BioStar fading? An Annual Follow-Up](http://joachimbaran.wordpress.com/2012/03/11/biostar-second-analysis/)
10
+ - [BioStar: Is the BioStar fading?](http://joachimbaran.wordpress.com/2011/03/07/biostar-fading/)
11
+
12
+ ## Installation
13
+
14
+ Biostars analytics can be installed as a Ruby gem:
15
+
16
+ gem install bio-biostars-analytics
17
+
18
+ Statistical analytics requires the installation of [R](http://www.r-project.org) 2.15.0 or later; requires the installation
19
+ of the plyr package 2.15.1 or later.
20
+
21
+ ## Usage
22
+
23
+ Data-mining: crawl the Biostars forum and retrieve data from the Biostar RESTful API; parameters
24
+ as of March 2014:
25
+
26
+ biostars-analytics 96000 54
27
+
28
+ This will create two files: `<date>_api.tsv` and `<date>_crawled.tsv`
29
+
30
+ Various plots in PNG file format can be generated via:
31
+
32
+ biostar_api_stats <date>_api.tsv
33
+ biostar_crawled_stats <date>_crawled.tsv
34
+
35
+ ### Command Line Usage Instructions
36
+
37
+ #### Data-Mining
38
+
39
+ Usage: biostars-analytics max_post_number months_look_back [min_post_number]
40
+
41
+ Required parameters:
42
+ max_post_number : highest number (ID) of the post that should
43
+ be mined for data; the crawler will go over
44
+ posts min_post_number to max_post_number
45
+ months_look_back : how many months back should queries to the
46
+ Biostar API go (1 month = 30 days); default
47
+ value is 1
48
+
49
+ Optional parameters:
50
+ min_post_number : lowest number (ID) of the post that should
51
+ be mined for data
52
+
53
+ Output (date matches the script\'s invokation):
54
+ <date>_crawled.tsv : data mined from crawling over posts
55
+ <date>_api.tsv : data extracted from the Biostar API
56
+
57
+ Example: mining Biostars in March 2014:
58
+ biostars-analytics 96000 54
59
+
60
+ #### Statistics (based on RESTful API data)
61
+
62
+ Usage: biostar_api_stats apitsvfile
63
+
64
+ Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):
65
+ biostar_api_stats data/20140328_api.tsv
66
+
67
+ #### Statistics (based on forum mining/crawling)
68
+
69
+ Usage: biostar_crawled_stats crawledtsvfile
70
+
71
+ Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):
72
+ biostar_api_stats data/20140328_crawled.tsv
73
+
74
+ ## Project Repository
75
+
76
+ Contributions can be made to the open repository on GitHub:
77
+
78
+ [http://github.com/joejimbo/bioruby-biostars-analytics](http://github.com/joejimbo/bioruby-biostars-analytics)
79
+
80
+ The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
81
+
82
+ ## Cite
83
+
84
+ If you use this software, please cite one of
85
+
86
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
87
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
88
+
89
+ ## Biogems.info
90
+
91
+ This Biogem is published at (http://biogems.info/index.html#bio-biostars-analytics)
92
+
93
+ ## Copyright
94
+
95
+ Copyright (c) 2014 Joachim Baran. See LICENSE.txt for further details.
96
+
data/README.rdoc ADDED
@@ -0,0 +1,48 @@
1
+ = bio-biostars-analytics
2
+
3
+ {<img
4
+ src="https://secure.travis-ci.org/joejimbo/bioruby-biostars-analytics.png"
5
+ />}[http://travis-ci.org/#!/joejimbo/bioruby-biostars-analytics]
6
+
7
+ Full description goes here
8
+
9
+ Note: this software is under active development!
10
+
11
+ == Installation
12
+
13
+ gem install bio-biostars-analytics
14
+
15
+ == Usage
16
+
17
+ == Developers
18
+
19
+ To use the library
20
+
21
+ require 'bio-biostars-analytics'
22
+
23
+ The API doc is online. For more code examples see also the test files in
24
+ the source tree.
25
+
26
+ == Project home page
27
+
28
+ Information on the source tree, documentation, issues and how to contribute, see
29
+
30
+ http://github.com/joejimbo/bioruby-biostars-analytics
31
+
32
+ The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
33
+
34
+ == Cite
35
+
36
+ If you use this software, please cite one of
37
+
38
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
39
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
40
+
41
+ == Biogems.info
42
+
43
+ This Biogem is published at http://biogems.info/index.html#bio-biostars-analytics
44
+
45
+ == Copyright
46
+
47
+ Copyright (c) 2014 Joachim Baran. See LICENSE.txt for further details.
48
+
data/Rakefile ADDED
@@ -0,0 +1,46 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+ require 'bundler'
5
+ begin
6
+ Bundler.setup(:default, :development)
7
+ rescue Bundler::BundlerError => e
8
+ $stderr.puts e.message
9
+ $stderr.puts "Run `bundle install` to install missing gems"
10
+ exit e.status_code
11
+ end
12
+ require 'rake'
13
+
14
+ require 'jeweler'
15
+ Jeweler::Tasks.new do |gem|
16
+ # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
17
+ gem.name = 'bio-biostars-analytics'
18
+ gem.homepage = 'http://github.com/joejimbo/bioruby-biostars-analytics'
19
+ gem.license = 'MIT'
20
+ gem.summary = %Q{Biostars data-mining and statistical analysis.}
21
+ gem.description = %Q{Ruby script for data-mining biostars.org using web-crawling techniques as well as utilizing the Biostars RESTful API. Statistical analysis requires R (http://www.r-project.org).}
22
+ gem.email = 'joachim.baran@gmail.com'
23
+ gem.authors = [ 'Joachim Baran' ]
24
+ gem.executables = [ 'biostars-analytics', 'biostar_api_stats', 'biostar_crawled_stats' ]
25
+ # dependencies defined in Gemfile
26
+ end
27
+ Jeweler::RubygemsDotOrgTasks.new
28
+
29
+ require 'rake/testtask'
30
+ Rake::TestTask.new(:test) do |test|
31
+ test.libs << 'lib' << 'test'
32
+ test.pattern = 'test/**/test_*.rb'
33
+ test.verbose = true
34
+ end
35
+
36
+ task :default => :test
37
+
38
+ require 'rdoc/task'
39
+ Rake::RDocTask.new do |rdoc|
40
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
41
+
42
+ rdoc.rdoc_dir = 'rdoc'
43
+ rdoc.title = "bio-biostars-analytics #{version}"
44
+ rdoc.rdoc_files.include('README*')
45
+ rdoc.rdoc_files.include('lib/**/*.rb')
46
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.1.0
@@ -0,0 +1,260 @@
1
+ #!/usr/bin/ruby
2
+
3
+ analysis_script = <<EOR
4
+ args <- commandArgs(TRUE)
5
+
6
+ input_tsv_file <- 'INPUT_FILE_NAME'
7
+
8
+ bioeval <- function(base, column, xlabel, ylabel, colour, scale, average=TRUE) {
9
+ biolevels <- levels(factor(base$age_in_month))
10
+ avg <- rep(0, times=length(biolevels))
11
+ for (i in 1:length(biolevels)) {
12
+ avg[i] <- mean(
13
+ column[ base$age_in_month == biolevels[i] ]
14
+ )
15
+ }
16
+
17
+ plot(
18
+ base$age_in_month,
19
+ t(column),
20
+ col=colour,
21
+ pch=19,
22
+ xlab=xlabel,
23
+ ylab=ylabel,
24
+ axes=FALSE,
25
+ ylim=c(min(column), max(column)/scale) # Scale, so that outliers don't squish the important details.
26
+ )
27
+ if (average) {
28
+ lines(biolevels, avg, col=rgb(.3,.3,.3))
29
+ }
30
+ # Trendline over all data:
31
+ abline(lm(column ~ base$age_in_month), col=rgb(0,.9,1.0), lwd=6)
32
+ # Trendline over the last 24 months:
33
+ lookback <- 24
34
+ lookback_months <- biostar$age_in_month[biostar$age_in_month > (max(biostar$age_in_month - lookback))]
35
+ trend <- lm(tail(column, length(lookback_months)) ~ lookback_months)
36
+ trend_prediction <- predict(trend, newdata=data.frame(x=lookback_months))
37
+ lines(x=lookback_months, y=trend_prediction, col=rgb(1.0,0,.6), lwd=6)
38
+
39
+ # Draw axis:
40
+ axis(1,min(base$age_in_month):max(base$age_in_month))
41
+ axis(2,min(column):max(column))
42
+ }
43
+
44
+ biobarplot <- function(userstats, age, max_y, xlabel, ylabel, colour) {
45
+ barplot(
46
+ userstats,
47
+ col=colour,
48
+ xlab=xlabel,
49
+ ylab=ylabel,
50
+ ylim=c(0,max_y)
51
+ )
52
+ }
53
+
54
+ biostar <- read.table(
55
+ input_tsv_file,
56
+ sep="\t",
57
+ encoding="UTF-8",
58
+ row.names=NULL,
59
+ comment.char="",
60
+ col.names=c(
61
+ "age",
62
+ "date",
63
+ "year",
64
+ "month",
65
+ "day",
66
+ "new_posts_in_category_1",
67
+ "new_posts_in_category_2",
68
+ "new_posts_in_category_3",
69
+ "new_posts_in_category_4",
70
+ "new_posts_in_category_5",
71
+ "new_posts_in_category_6",
72
+ "new_posts_in_category_7",
73
+ "new_posts_in_category_8",
74
+ "new_posts_in_category_9",
75
+ "new_posts_in_category_10",
76
+ "new_posts_in_category_11",
77
+ "new_posts_in_category_12",
78
+ "new_posts_in_category_13",
79
+ "new_posts_in_category_14",
80
+ "new_posts_in_category_15",
81
+ "new_posts_in_category_16",
82
+ "new_votes_of_type_accept",
83
+ "new_votes_of_type_bookmark",
84
+ "new_votes_of_type_downvote",
85
+ "new_votes_of_type_upvote",
86
+ "new_posts",
87
+ "new_votes",
88
+ "new_users",
89
+ "posters",
90
+ "poster_ages",
91
+ "root_post_ages",
92
+ "vote_post_ages",
93
+ "biostarbabies",
94
+ "empty"
95
+ )
96
+ )
97
+
98
+ biostar$age_in_month <- (
99
+ (
100
+ biostar$year -
101
+ rep(min(biostar$year), times=length(biostar$year))
102
+ )*12 +
103
+ biostar$month
104
+ )
105
+ biostar$age_in_month <- biostar$age_in_month -
106
+ rep(min(biostar$age_in_month), times=length(biostar$age_in_month)) +
107
+ 1
108
+
109
+ # Which users were active in a month?
110
+ users_per_month <- lapply(unique(sort(biostar$age_in_month)), function (age_in_month) { unique(sort(sapply(strsplit(paste(biostar[biostar$age_in_month == age_in_month, ]$posters, collapse = ","), split = ","), as.integer))) })
111
+
112
+ # How many users were active in a month?
113
+ userfreq <- rep(0, max(biostar$age_in_month))
114
+ for (age_in_month in min(biostar$age_in_month):max(biostar$age_in_month)) {
115
+ userfreq[age_in_month] <- length(unlist(users_per_month[age_in_month]))
116
+ }
117
+
118
+ # Determine the number of months for which users have been active.
119
+ # 1. over the whole time span
120
+ # 2. except for the last three months
121
+ useractivity <- table(as.numeric(table(unlist(users_per_month))))
122
+ useractivity_wo_last_3_months <- table(as.numeric(table(unlist(users_per_month[c(seq(4, max(length(users_per_month))))]))))
123
+
124
+ png("api_participation.png", height=900, width=1300, unit="px", pointsize=26)
125
+ biobarplot(
126
+ useractivity,
127
+ max(biostar$age_in_month),
128
+ (floor(useractivity[1] / 1000) + 1) * 1000,
129
+ 'Month of Participation',
130
+ 'Number of Active Users',
131
+ "#ee6633"
132
+ )
133
+ dev.off()
134
+
135
+ png("api_participation_wo_last_3_months.png", height=900, width=1300, unit="px", pointsize=26)
136
+ biobarplot(
137
+ useractivity_wo_last_3_months,
138
+ max(biostar$age_in_month),
139
+ (floor(useractivity_wo_last_3_months[1] / 1000) + 1) * 1000,
140
+ 'Month of Participation',
141
+ 'Number of Active Users',
142
+ "#ee6633"
143
+ )
144
+ dev.off()
145
+
146
+ userfreq_table <- as.table(userfreq)
147
+ rownames(userfreq_table) <- seq(length(userfreq))
148
+ png("api_users.png", height=900, width=1300, unit="px", pointsize=26)
149
+ biobarplot(
150
+ userfreq_table,
151
+ max(biostar$age_in_month),
152
+ 600,
153
+ 'Biostar Month',
154
+ 'Number of Active Users',
155
+ "#ff2233"
156
+ )
157
+ dev.off()
158
+
159
+ categories_images = c(
160
+ "api_category_1.png",
161
+ "api_category_2.png",
162
+ "api_category_3.png",
163
+ "api_category_4.png",
164
+ "api_category_5.png",
165
+ "api_category_6.png",
166
+ "api_category_7.png",
167
+ "api_category_8.png",
168
+ "api_category_9.png",
169
+ "api_category_10.png",
170
+ "api_category_11.png",
171
+ "api_category_12.png",
172
+ "api_category_13.png",
173
+ "api_category_14.png",
174
+ "api_category_15.png",
175
+ "api_category_16.png",
176
+ "api_upvotes.png",
177
+ "api_downvotes.png",
178
+ "api_bookmarks.png",
179
+ "api_accepts.png"
180
+ )
181
+ categories_values = list(
182
+ biostar$new_posts_in_category_1,
183
+ biostar$new_posts_in_category_2,
184
+ biostar$new_posts_in_category_3,
185
+ biostar$new_posts_in_category_4,
186
+ biostar$new_posts_in_category_5,
187
+ biostar$new_posts_in_category_6,
188
+ biostar$new_posts_in_category_7,
189
+ biostar$new_posts_in_category_8,
190
+ biostar$new_posts_in_category_9,
191
+ biostar$new_posts_in_category_10,
192
+ biostar$new_posts_in_category_11,
193
+ biostar$new_posts_in_category_12,
194
+ biostar$new_posts_in_category_13,
195
+ biostar$new_posts_in_category_14,
196
+ biostar$new_posts_in_category_15,
197
+ biostar$new_posts_in_category_16,
198
+ biostar$new_votes_of_type_upvote,
199
+ biostar$new_votes_of_type_downvote,
200
+ biostar$new_votes_of_type_bookmark,
201
+ biostar$new_votes_of_type_accept
202
+ )
203
+ categories_labels = c(
204
+ "Questions per Day",
205
+ "Answers per Day",
206
+ "Comments per Day",
207
+ "Tutorials per Day",
208
+ "Blogs Posts per Day",
209
+ "Forums Posts per Day",
210
+ "News per Day",
211
+ " -- Unknown, sorry, no time to look it up now -- ",
212
+ "Tool Announcements per Day",
213
+ "FixMes per Day",
214
+ "Videos per Day",
215
+ "Job Postings per Day",
216
+ "Research Papers per Day",
217
+ "Tips per Day",
218
+ "Polls per Day",
219
+ "Ads per Day",
220
+ "Upvotes per Day",
221
+ "Downvotes per Day",
222
+ "Bookmarks per Day",
223
+ "Accepts per Day"
224
+ )
225
+ for (category in seq(length(categories_images))) {
226
+ png(categories_images[category], height=900, width=1300, unit="px", pointsize=26)
227
+ bioeval(
228
+ biostar,
229
+ unlist(categories_values[category]),
230
+ 'Biostar Month',
231
+ categories_labels[category],
232
+ rgb(100,100,0,20,maxColorValue=255),
233
+ 2
234
+ )
235
+ dev.off()
236
+ }
237
+ EOR
238
+
239
+ R = '/usr/bin/R'
240
+
241
+ unless File.exist?(R) then
242
+ puts 'Please install R (http://www.r-project.org) as: /usr/bin/R'
243
+ puts ''
244
+ puts 'Also, install the plyr package via: install.packages("plyr")'
245
+ exit 1
246
+ end
247
+
248
+ if ARGV.length != 1 then
249
+ puts 'Usage: biostar_api_stats apitsvfile'
250
+ puts ''
251
+ puts 'Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):'
252
+ puts ' biostar_api_stats data/20140328_api.tsv'
253
+ exit 2
254
+ end
255
+
256
+ IO.popen("#{R} --no-save", 'w') { |io|
257
+ io.puts(analysis_script.sub('INPUT_FILE_NAME', ARGV[0]))
258
+ io.close_write
259
+ }
260
+