bio-biostars-analytics 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 5cff7f6e3a1b295042b32eba00069abbe877a8b9
4
+ data.tar.gz: a0dafd42a7a80aeb6f99a1ff1c43490d1da10507
5
+ SHA512:
6
+ metadata.gz: 52b94bc235a8be6974c1f98a62da8f28bb19daf38fbc57398121333a81f5ce350fdc08d3fe78e973795df827e083254de0b648c0b1cede787978d7a6c1bcc936
7
+ data.tar.gz: e343dc4bfcc04d69c0fa1fba08553dfa12e0270fb0cd6bef75b689e313e84fb539c3ef933153aa7168795ebad1c9e8656513b71ce225c6961f5cc6f34b18ae1e
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.travis.yml ADDED
@@ -0,0 +1,13 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.2
4
+ - 1.9.3
5
+ - jruby-19mode # JRuby in 1.9 mode
6
+
7
+ # - rbx-19mode
8
+ # - 1.8.7
9
+ # - jruby-18mode # JRuby in 1.8 mode
10
+ # - rbx-18mode
11
+
12
+ # uncomment this line if your project needs to run something other than `rake`:
13
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,20 @@
1
+ source "http://rubygems.org"
2
+ # Add dependencies required to use your gem here.
3
+ # Example:
4
+ # gem "activesupport", ">= 2.3.5"
5
+
6
+ # Add dependencies to develop your gem here.
7
+ # Include everything needed to run rake, tests, features, etc.
8
+ group :development do
9
+ gem "shoulda", ">= 0"
10
+ gem "rdoc", "~> 3.12"
11
+ gem "jeweler", "~> 2.0.1", :git => "https://github.com/technicalpickles/jeweler.git"
12
+ gem "bundler", ">= 1.0.21"
13
+ gem "bio", ">= 1.4.2"
14
+ gem "rdoc", "~> 3.12"
15
+
16
+ # Required for mining:
17
+ gem "hpricot", "~> 0.8.6"
18
+ gem "chronic", "~> 0.10.2"
19
+ gem "json", "~> 1.8.0"
20
+ end
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2014 Joachim Baran
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,96 @@
1
+ # bio-biostars-analytics
2
+
3
+ [![Build Status](https://secure.travis-ci.org/joejimbo/bioruby-biostars-analytics.png)](http://travis-ci.org/joejimbo/bioruby-biostars-analytics)
4
+
5
+ Data-mining analysis that make use of this gem (newest to oldest):
6
+
7
+ - [Uh-oh, Biostar: Three Years of User Metrics Analysis](http://joachimbaran.wordpress.com/2013/03/15/uh-oh-biostar/)
8
+ - [BioStar: Activity of its BioStars](http://joachimbaran.wordpress.com/2012/03/20/biostar-activity-of-its-biostars/)
9
+ - [BioStar: Is the BioStar fading? An Annual Follow-Up](http://joachimbaran.wordpress.com/2012/03/11/biostar-second-analysis/)
10
+ - [BioStar: Is the BioStar fading?](http://joachimbaran.wordpress.com/2011/03/07/biostar-fading/)
11
+
12
+ ## Installation
13
+
14
+ Biostars analytics can be installed as a Ruby gem:
15
+
16
+ gem install bio-biostars-analytics
17
+
18
+ Statistical analytics requires the installation of [R](http://www.r-project.org) 2.15.0 or later; requires the installation
19
+ of the plyr package 2.15.1 or later.
20
+
21
+ ## Usage
22
+
23
+ Data-mining: crawl the Biostars forum and retrieve data from the Biostar RESTful API; parameters
24
+ as of March 2014:
25
+
26
+ biostars-analytics 96000 54
27
+
28
+ This will create two files: `<date>_api.tsv` and `<date>_crawled.tsv`
29
+
30
+ Various plots in PNG file format can be generated via:
31
+
32
+ biostar_api_stats <date>_api.tsv
33
+ biostar_crawled_stats <date>_crawled.tsv
34
+
35
+ ### Command Line Usage Instructions
36
+
37
+ #### Data-Mining
38
+
39
+ Usage: biostars-analytics max_post_number months_look_back [min_post_number]
40
+
41
+ Required parameters:
42
+ max_post_number : highest number (ID) of the post that should
43
+ be mined for data; the crawler will go over
44
+ posts min_post_number to max_post_number
45
+ months_look_back : how many months back should queries to the
46
+ Biostar API go (1 month = 30 days); default
47
+ value is 1
48
+
49
+ Optional parameters:
50
+ min_post_number : lowest number (ID) of the post that should
51
+ be mined for data
52
+
53
+ Output (date matches the script\'s invokation):
54
+ <date>_crawled.tsv : data mined from crawling over posts
55
+ <date>_api.tsv : data extracted from the Biostar API
56
+
57
+ Example: mining Biostars in March 2014:
58
+ biostars-analytics 96000 54
59
+
60
+ #### Statistics (based on RESTful API data)
61
+
62
+ Usage: biostar_api_stats apitsvfile
63
+
64
+ Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):
65
+ biostar_api_stats data/20140328_api.tsv
66
+
67
+ #### Statistics (based on forum mining/crawling)
68
+
69
+ Usage: biostar_crawled_stats crawledtsvfile
70
+
71
+ Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):
72
+ biostar_api_stats data/20140328_crawled.tsv
73
+
74
+ ## Project Repository
75
+
76
+ Contributions can be made to the open repository on GitHub:
77
+
78
+ [http://github.com/joejimbo/bioruby-biostars-analytics](http://github.com/joejimbo/bioruby-biostars-analytics)
79
+
80
+ The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
81
+
82
+ ## Cite
83
+
84
+ If you use this software, please cite one of
85
+
86
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
87
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
88
+
89
+ ## Biogems.info
90
+
91
+ This Biogem is published at (http://biogems.info/index.html#bio-biostars-analytics)
92
+
93
+ ## Copyright
94
+
95
+ Copyright (c) 2014 Joachim Baran. See LICENSE.txt for further details.
96
+
data/README.rdoc ADDED
@@ -0,0 +1,48 @@
1
+ = bio-biostars-analytics
2
+
3
+ {<img
4
+ src="https://secure.travis-ci.org/joejimbo/bioruby-biostars-analytics.png"
5
+ />}[http://travis-ci.org/#!/joejimbo/bioruby-biostars-analytics]
6
+
7
+ Full description goes here
8
+
9
+ Note: this software is under active development!
10
+
11
+ == Installation
12
+
13
+ gem install bio-biostars-analytics
14
+
15
+ == Usage
16
+
17
+ == Developers
18
+
19
+ To use the library
20
+
21
+ require 'bio-biostars-analytics'
22
+
23
+ The API doc is online. For more code examples see also the test files in
24
+ the source tree.
25
+
26
+ == Project home page
27
+
28
+ Information on the source tree, documentation, issues and how to contribute, see
29
+
30
+ http://github.com/joejimbo/bioruby-biostars-analytics
31
+
32
+ The BioRuby community is on IRC server: irc.freenode.org, channel: #bioruby.
33
+
34
+ == Cite
35
+
36
+ If you use this software, please cite one of
37
+
38
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
39
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
40
+
41
+ == Biogems.info
42
+
43
+ This Biogem is published at http://biogems.info/index.html#bio-biostars-analytics
44
+
45
+ == Copyright
46
+
47
+ Copyright (c) 2014 Joachim Baran. See LICENSE.txt for further details.
48
+
data/Rakefile ADDED
@@ -0,0 +1,46 @@
1
+ # encoding: utf-8
2
+
3
+ require 'rubygems'
4
+ require 'bundler'
5
+ begin
6
+ Bundler.setup(:default, :development)
7
+ rescue Bundler::BundlerError => e
8
+ $stderr.puts e.message
9
+ $stderr.puts "Run `bundle install` to install missing gems"
10
+ exit e.status_code
11
+ end
12
+ require 'rake'
13
+
14
+ require 'jeweler'
15
+ Jeweler::Tasks.new do |gem|
16
+ # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
17
+ gem.name = 'bio-biostars-analytics'
18
+ gem.homepage = 'http://github.com/joejimbo/bioruby-biostars-analytics'
19
+ gem.license = 'MIT'
20
+ gem.summary = %Q{Biostars data-mining and statistical analysis.}
21
+ gem.description = %Q{Ruby script for data-mining biostars.org using web-crawling techniques as well as utilizing the Biostars RESTful API. Statistical analysis requires R (http://www.r-project.org).}
22
+ gem.email = 'joachim.baran@gmail.com'
23
+ gem.authors = [ 'Joachim Baran' ]
24
+ gem.executables = [ 'biostars-analytics', 'biostar_api_stats', 'biostar_crawled_stats' ]
25
+ # dependencies defined in Gemfile
26
+ end
27
+ Jeweler::RubygemsDotOrgTasks.new
28
+
29
+ require 'rake/testtask'
30
+ Rake::TestTask.new(:test) do |test|
31
+ test.libs << 'lib' << 'test'
32
+ test.pattern = 'test/**/test_*.rb'
33
+ test.verbose = true
34
+ end
35
+
36
+ task :default => :test
37
+
38
+ require 'rdoc/task'
39
+ Rake::RDocTask.new do |rdoc|
40
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
41
+
42
+ rdoc.rdoc_dir = 'rdoc'
43
+ rdoc.title = "bio-biostars-analytics #{version}"
44
+ rdoc.rdoc_files.include('README*')
45
+ rdoc.rdoc_files.include('lib/**/*.rb')
46
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.1.0
@@ -0,0 +1,260 @@
1
+ #!/usr/bin/ruby
2
+
3
+ analysis_script = <<EOR
4
+ args <- commandArgs(TRUE)
5
+
6
+ input_tsv_file <- 'INPUT_FILE_NAME'
7
+
8
+ bioeval <- function(base, column, xlabel, ylabel, colour, scale, average=TRUE) {
9
+ biolevels <- levels(factor(base$age_in_month))
10
+ avg <- rep(0, times=length(biolevels))
11
+ for (i in 1:length(biolevels)) {
12
+ avg[i] <- mean(
13
+ column[ base$age_in_month == biolevels[i] ]
14
+ )
15
+ }
16
+
17
+ plot(
18
+ base$age_in_month,
19
+ t(column),
20
+ col=colour,
21
+ pch=19,
22
+ xlab=xlabel,
23
+ ylab=ylabel,
24
+ axes=FALSE,
25
+ ylim=c(min(column), max(column)/scale) # Scale, so that outliers don't squish the important details.
26
+ )
27
+ if (average) {
28
+ lines(biolevels, avg, col=rgb(.3,.3,.3))
29
+ }
30
+ # Trendline over all data:
31
+ abline(lm(column ~ base$age_in_month), col=rgb(0,.9,1.0), lwd=6)
32
+ # Trendline over the last 24 months:
33
+ lookback <- 24
34
+ lookback_months <- biostar$age_in_month[biostar$age_in_month > (max(biostar$age_in_month - lookback))]
35
+ trend <- lm(tail(column, length(lookback_months)) ~ lookback_months)
36
+ trend_prediction <- predict(trend, newdata=data.frame(x=lookback_months))
37
+ lines(x=lookback_months, y=trend_prediction, col=rgb(1.0,0,.6), lwd=6)
38
+
39
+ # Draw axis:
40
+ axis(1,min(base$age_in_month):max(base$age_in_month))
41
+ axis(2,min(column):max(column))
42
+ }
43
+
44
+ biobarplot <- function(userstats, age, max_y, xlabel, ylabel, colour) {
45
+ barplot(
46
+ userstats,
47
+ col=colour,
48
+ xlab=xlabel,
49
+ ylab=ylabel,
50
+ ylim=c(0,max_y)
51
+ )
52
+ }
53
+
54
+ biostar <- read.table(
55
+ input_tsv_file,
56
+ sep="\t",
57
+ encoding="UTF-8",
58
+ row.names=NULL,
59
+ comment.char="",
60
+ col.names=c(
61
+ "age",
62
+ "date",
63
+ "year",
64
+ "month",
65
+ "day",
66
+ "new_posts_in_category_1",
67
+ "new_posts_in_category_2",
68
+ "new_posts_in_category_3",
69
+ "new_posts_in_category_4",
70
+ "new_posts_in_category_5",
71
+ "new_posts_in_category_6",
72
+ "new_posts_in_category_7",
73
+ "new_posts_in_category_8",
74
+ "new_posts_in_category_9",
75
+ "new_posts_in_category_10",
76
+ "new_posts_in_category_11",
77
+ "new_posts_in_category_12",
78
+ "new_posts_in_category_13",
79
+ "new_posts_in_category_14",
80
+ "new_posts_in_category_15",
81
+ "new_posts_in_category_16",
82
+ "new_votes_of_type_accept",
83
+ "new_votes_of_type_bookmark",
84
+ "new_votes_of_type_downvote",
85
+ "new_votes_of_type_upvote",
86
+ "new_posts",
87
+ "new_votes",
88
+ "new_users",
89
+ "posters",
90
+ "poster_ages",
91
+ "root_post_ages",
92
+ "vote_post_ages",
93
+ "biostarbabies",
94
+ "empty"
95
+ )
96
+ )
97
+
98
+ biostar$age_in_month <- (
99
+ (
100
+ biostar$year -
101
+ rep(min(biostar$year), times=length(biostar$year))
102
+ )*12 +
103
+ biostar$month
104
+ )
105
+ biostar$age_in_month <- biostar$age_in_month -
106
+ rep(min(biostar$age_in_month), times=length(biostar$age_in_month)) +
107
+ 1
108
+
109
+ # Which users were active in a month?
110
+ users_per_month <- lapply(unique(sort(biostar$age_in_month)), function (age_in_month) { unique(sort(sapply(strsplit(paste(biostar[biostar$age_in_month == age_in_month, ]$posters, collapse = ","), split = ","), as.integer))) })
111
+
112
+ # How many users were active in a month?
113
+ userfreq <- rep(0, max(biostar$age_in_month))
114
+ for (age_in_month in min(biostar$age_in_month):max(biostar$age_in_month)) {
115
+ userfreq[age_in_month] <- length(unlist(users_per_month[age_in_month]))
116
+ }
117
+
118
+ # Determine the number of months for which users have been active.
119
+ # 1. over the whole time span
120
+ # 2. except for the last three months
121
+ useractivity <- table(as.numeric(table(unlist(users_per_month))))
122
+ useractivity_wo_last_3_months <- table(as.numeric(table(unlist(users_per_month[c(seq(4, max(length(users_per_month))))]))))
123
+
124
+ png("api_participation.png", height=900, width=1300, unit="px", pointsize=26)
125
+ biobarplot(
126
+ useractivity,
127
+ max(biostar$age_in_month),
128
+ (floor(useractivity[1] / 1000) + 1) * 1000,
129
+ 'Month of Participation',
130
+ 'Number of Active Users',
131
+ "#ee6633"
132
+ )
133
+ dev.off()
134
+
135
+ png("api_participation_wo_last_3_months.png", height=900, width=1300, unit="px", pointsize=26)
136
+ biobarplot(
137
+ useractivity_wo_last_3_months,
138
+ max(biostar$age_in_month),
139
+ (floor(useractivity_wo_last_3_months[1] / 1000) + 1) * 1000,
140
+ 'Month of Participation',
141
+ 'Number of Active Users',
142
+ "#ee6633"
143
+ )
144
+ dev.off()
145
+
146
+ userfreq_table <- as.table(userfreq)
147
+ rownames(userfreq_table) <- seq(length(userfreq))
148
+ png("api_users.png", height=900, width=1300, unit="px", pointsize=26)
149
+ biobarplot(
150
+ userfreq_table,
151
+ max(biostar$age_in_month),
152
+ 600,
153
+ 'Biostar Month',
154
+ 'Number of Active Users',
155
+ "#ff2233"
156
+ )
157
+ dev.off()
158
+
159
+ categories_images = c(
160
+ "api_category_1.png",
161
+ "api_category_2.png",
162
+ "api_category_3.png",
163
+ "api_category_4.png",
164
+ "api_category_5.png",
165
+ "api_category_6.png",
166
+ "api_category_7.png",
167
+ "api_category_8.png",
168
+ "api_category_9.png",
169
+ "api_category_10.png",
170
+ "api_category_11.png",
171
+ "api_category_12.png",
172
+ "api_category_13.png",
173
+ "api_category_14.png",
174
+ "api_category_15.png",
175
+ "api_category_16.png",
176
+ "api_upvotes.png",
177
+ "api_downvotes.png",
178
+ "api_bookmarks.png",
179
+ "api_accepts.png"
180
+ )
181
+ categories_values = list(
182
+ biostar$new_posts_in_category_1,
183
+ biostar$new_posts_in_category_2,
184
+ biostar$new_posts_in_category_3,
185
+ biostar$new_posts_in_category_4,
186
+ biostar$new_posts_in_category_5,
187
+ biostar$new_posts_in_category_6,
188
+ biostar$new_posts_in_category_7,
189
+ biostar$new_posts_in_category_8,
190
+ biostar$new_posts_in_category_9,
191
+ biostar$new_posts_in_category_10,
192
+ biostar$new_posts_in_category_11,
193
+ biostar$new_posts_in_category_12,
194
+ biostar$new_posts_in_category_13,
195
+ biostar$new_posts_in_category_14,
196
+ biostar$new_posts_in_category_15,
197
+ biostar$new_posts_in_category_16,
198
+ biostar$new_votes_of_type_upvote,
199
+ biostar$new_votes_of_type_downvote,
200
+ biostar$new_votes_of_type_bookmark,
201
+ biostar$new_votes_of_type_accept
202
+ )
203
+ categories_labels = c(
204
+ "Questions per Day",
205
+ "Answers per Day",
206
+ "Comments per Day",
207
+ "Tutorials per Day",
208
+ "Blogs Posts per Day",
209
+ "Forums Posts per Day",
210
+ "News per Day",
211
+ " -- Unknown, sorry, no time to look it up now -- ",
212
+ "Tool Announcements per Day",
213
+ "FixMes per Day",
214
+ "Videos per Day",
215
+ "Job Postings per Day",
216
+ "Research Papers per Day",
217
+ "Tips per Day",
218
+ "Polls per Day",
219
+ "Ads per Day",
220
+ "Upvotes per Day",
221
+ "Downvotes per Day",
222
+ "Bookmarks per Day",
223
+ "Accepts per Day"
224
+ )
225
+ for (category in seq(length(categories_images))) {
226
+ png(categories_images[category], height=900, width=1300, unit="px", pointsize=26)
227
+ bioeval(
228
+ biostar,
229
+ unlist(categories_values[category]),
230
+ 'Biostar Month',
231
+ categories_labels[category],
232
+ rgb(100,100,0,20,maxColorValue=255),
233
+ 2
234
+ )
235
+ dev.off()
236
+ }
237
+ EOR
238
+
239
+ R = '/usr/bin/R'
240
+
241
+ unless File.exist?(R) then
242
+ puts 'Please install R (http://www.r-project.org) as: /usr/bin/R'
243
+ puts ''
244
+ puts 'Also, install the plyr package via: install.packages("plyr")'
245
+ exit 1
246
+ end
247
+
248
+ if ARGV.length != 1 then
249
+ puts 'Usage: biostar_api_stats apitsvfile'
250
+ puts ''
251
+ puts 'Example (data provided at http://github.com/joejimbo/bioruby-biostars-analytics):'
252
+ puts ' biostar_api_stats data/20140328_api.tsv'
253
+ exit 2
254
+ end
255
+
256
+ IO.popen("#{R} --no-save", 'w') { |io|
257
+ io.puts(analysis_script.sub('INPUT_FILE_NAME', ARGV[0]))
258
+ io.close_write
259
+ }
260
+