RubyGems - bio-rdf - Versions diffs - 0.0.1.pre1 → 0.0.1 - Mend

bio-rdf 0.0.1.pre1 → 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

data/Gemfile +4 -1
data/README.md +53 -3
data/VERSION +1 -1
data/bin/bio-rdf +35 -4
data/doc/design.md +20 -0
data/features/parse_broad_gsea_cls.feature +18 -0
data/features/parse_broad_gsea_cls.rb +13 -0
data/features/parse_broad_gsea_results.feature +29 -0
data/features/parse_broad_gsea_results.rb +59 -0
data/features/support/env.rb +13 -0
data/lib/bio-rdf.rb +2 -1
data/lib/bio-rdf/parsers/gsea/broadgsea.rb +161 -0
data/spec/spec_helper.rb +12 -0
data/test/data/parsers/gsea/Run1_C2.SUMMARY.RESULTS.REPORT.0.txt +1066 -0
data/test/data/parsers/gsea/Run1_C2.SUMMARY.RESULTS.REPORT.1.txt +474 -0
metadata +62 -21
data/README.rdoc +0 -46

data/Gemfile CHANGED

@@ -2,13 +2,16 @@ source "http://rubygems.org"
 # Add dependencies required to use your gem here.
 # Example:
 #   gem "activesupport", ">= 2.3.5"
+gem "bio-logger"
 # Add dependencies to develop your gem here.
 # Include everything needed to run rake, tests, features, etc.
 group :development do
   gem "shoulda", ">= 0"
+  gem "rspec", ">= 0"
+  gem "cucumber", ">= 0"
   gem "rdoc", "~> 3.12"
-  gem "bundler", "~> 1.0.0"
+  gem "bundler", ">= 1.0.0"
   gem "jeweler", "~> 1.8.3"
   gem "bio", ">= 1.4.2"
   gem "rdoc", "~> 3.12"

data/README.md CHANGED

@@ -2,9 +2,59 @@
 [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-rdf.png)](http://travis-ci.org/pjotrp/bioruby-rdf)
-Full description goes here
+Library and tools for using a triple-store with biological data.  It
+includes tools for storing parsed data into a triple store. The name
+includes RDF, the XML representation of triples, but that really is
+too a narrow view of the purpose of this biogem. The alternative names
+(bio-semweb and bio-triplestore) looked even worse.
-Note: this software is under active development!
+Every data-type has a Parser module. This parser module controls the
+parsing flow. The actual parsing is handled by lower level routines,
+which may even reside in other libraries, such as BioRuby. The basic
+flow is
+  input -> parse -> output
+The *input* can be anything, from directories, files to web based
+resources.
+The *output* of the parser should be in some form of triple format,
+though simple tab delimited tables can also be supported (depending on
+the parser).
+The first functionality includes parsing the results of gene set
+enrichment analysis
+([GSEA](http://www.broadinstitute.org/gsea/index.jsp)) into triples
+(more below).
+This project is linked with next generation sequencing, genome
+browsing, visualisation and QTL mapping.  E.g.
+* [bio-ngs](http://www.biogems.info/#bio-ngs)
+* [bio-bio-ucsc-api](http://www.biogems.info/#bio-ucsc-api)
+* [bio-qtlHD](http://www.biogems.info/#bio-qtlHD)
+Note: this software is under active development! See also the [design
+doc](https://github.com/pjotrp/bioruby-rdf/blob/master/doc/design.md).
+## Examples
+### Gene set enrichment analysis (GSEA)
+GSEA is a computational method that determines whether an a priori
+defined set of genes shows statistically significant, concordant
+differences between two biological states. The [GSEA
+tool](http://www.broadinstitute.org/gsea/index.jsp) produces two
+result files for every two biological states. We wrote a parser
+for the summary files, which outputs either a single table of results
+(based on a cut-off value). This table can be converted into a
+triple-store.
+To create a tab delimited file from a GSEA result, where FDR < 0.25
+```bash
+  bio-rdf gsea --tabulate --exec "rec.fdr <= 0.25" ./gsea/output/ > results.txt
+```
 ## Installation
@@ -15,7 +65,7 @@ Note: this software is under active development!
 ## Usage
 ```ruby
-    require 'bio-rdf
+    require 'bio-rdf'
 ```
 The API doc is online. For more code examples see the test files in

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.0.1~~.pre1~~
1	+ 0.0.1

data/bin/bio-rdf CHANGED

@@ -4,7 +4,24 @@
 # Author:: Pjotr Prins
 # Copyright:: 2012
- USAGE = "Describe bio-rdf"
+rootpath = File.dirname(File.dirname(__FILE__))
+$: << File.join(rootpath,'lib')
+version = File.new(File.join(rootpath,'VERSION')).read.chomp
+print "bio-rdf #{version} by Pjotr Prins (c) 2012\n"
+USAGE = <<EOP
+  Usage: bio-rdf command [options]
+  Valid commands reflect parsers and are:
+    gsea : Gene set enrichment analysis
+  For more information on a command use the --help switch
+EOP
 if ARGV.size == 0
   print USAGE
@@ -12,11 +29,20 @@ end
 require 'bio-rdf'
 require 'optparse'
+require 'ostruct'
 # Uncomment when using the bio-logger
-# require 'bio-logger'
-# Bio::Log::CLI.logger('stderr')
-# Bio::Log::CLI.trace('info')
+require 'bio-logger'
+Bio::Log::CLI.logger('stderr')
+Bio::Log::CLI.trace('info')
+case ARGV[0]
+  when 'gsea'
+    ARGV.shift
+    BioRdf::Parsers::BroadGSEA::Parser::handle_options
+    exit 0
+end
 options = {:example_switch=>false,:show_help=>false}
 opts = OptionParser.new do |o|
@@ -64,6 +90,11 @@ end
 begin
   opts.parse!(ARGV)
+  if options[:show_help]
+    print USAGE
+    exit 0
+  end
   # Uncomment the following when using the bio-logger
   # Bio::Log::CLI.configure('bio-rdf')

data/doc/design.md ADDED

@@ -0,0 +1,20 @@
+# Semantic web for BioRuby!
+In this document we describe using a triple store for bioinformatics,
+mostly using Ruby. While the semantic is still, mostly, vapourware in
+biology, the ideas and tools can be very useful for reasoning about
+relationships between genes, pathways, enrichment etc. In this library
+we aim to use a local triple store, feed it with information, query it
+using [SPARQL](http://en.wikipedia.org/wiki/SPARQL), and provide it
+with a nice user interface for biologists. Triples may link-out to
+other semantic web connections.
+Enjoy,
+Pjotr Prins
+## Loading the triple store
+## Querying the triple store
+## User interface

data/features/parse_broad_gsea_cls.feature ADDED

@@ -0,0 +1,18 @@
+Feature: Parse GSEA cls file
+  To get the phenotype class in a Broad Institute GSEA result
+  we need to parse the CLS file:
+  Categorical (e.g tumor vs normal) class file format (*.cls)
+  The CLS file format defines phenotype (class or template) labels and
+  associates each sample in the expression data with a label. The CLS file
+  format uses spaces or tabs to separate the fields.
+  Scenario: Parse CLS file
+    Given I have a CLS file which contains
+    """
+26 2 1
+# RS13482013  RS13482013_1
+0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 1 0
+    """
+    Then I should fetch the phenotype names RS13482013 and RS13482013_1
+    And I should be able to fetch the classes into an array

data/features/parse_broad_gsea_cls.rb ADDED

@@ -0,0 +1,13 @@
+Given /^I have a CLS file which contains$/ do |buf|
+  @rec = BioRdf::Parsers::BroadGSEA::ParseClsRecord.new(buf)
+end
+Then /^I should fetch the phenotype names RS(\d+) and RS(\d+)_(\d+)$/ do |arg1, arg2, arg3|
+  @rec.classnames.should == ['RS13482013','RS13482013_1']
+end
+Then /^I should be able to fetch the classes into an array$/ do
+  @rec.classes.should ==
+  ["0", "0", "0", "1", "1", "0", "1", "0", "0", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1", "1", "1", "1", "0", "0", "1", "0"]
+end

data/features/parse_broad_gsea_results.feature ADDED

@@ -0,0 +1,29 @@
+Feature: Parse GSEA results
+  To get the enrichment values in a Broad Institute GSEA result file
+  we need to parse the tab delimited results file. An example is
+  GS      SIZE    SOURCE  ES      NES     NOM p-val       FDR q-val       FWER p-val      Tag \%  Gene \% Signal  FDR (median)    glob.p.val
+  BIOCARTA_RACCYCD_PATHWAY        25      http://www.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_RACCYCD_PATHWAY.html   0.55588 1.7947  0.004149        1       0.647   0.44    0.198   0.354   1       0.633
+  REACTOME_MRNA_3_END_PROCESSING  31      http://www.broadinstitute.org/gsea/msigdb/cards/REACTOME_MRNA_3_END_PROCESSING.html     0.6396  1.7613  0       1       0.752   0.613   0.242   0.466   1       0.579
+  (...)
+  Scenario: Parse one line in a Broad GSEA results file
+    Given I have a Broad GSEA results file which contains the line
+    """
+BIOCARTA_RACCYCD_PATHWAY        25      http://www.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_RACCYCD_PATHWAY.html   0.55588 1.7947  0.004149        1       0.647   0.44    0.198   0.354   1       0.633
+    """
+    Then I should be able to the name of the geneset BIOCARTA_RACCYCD_PATHWAY
+    And I should be able to fetch all values as a list
+    And I should be able to fetch all other values (lazily), where
+    And I should be able to fetch the source
+    And ES is 0.55588
+    And NES is 1.7947
+    And p-value is 0.004149
+    And FDR is 1
+    And global p-value is 0.633
+    And Median FDR is 1
+  Scenario: Parse a Broad GSEA results file and filter results
+    Given I have a Broad GSEA results file with multiple lines
+    Then I should be able to return all records with an FDR of less than 0.25

data/features/parse_broad_gsea_results.rb ADDED

@@ -0,0 +1,59 @@
+Given /^I have a Broad GSEA results file which contains the line$/ do |string|
+  @rec = BioRdf::Parsers::BroadGSEA::ParseResultRecord.new(string.gsub(/\s+/,"\t"))
+end
+Then /^I should be able to fetch all values as a list$/ do
+  @rec.to_list.should == ["BIOCARTA_RACCYCD_PATHWAY", "25", "http://www.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_RACCYCD_PATHWAY.html", "0.55588", "1.7947", "0.004149", "1", "0.647", "0.44", "0.198", "0.354", "1", "0.633"]
+end
+Then /^I should be able to fetch all other values \(lazily\), where$/ do
+end
+Then /^I should be able to the name of the geneset BIOCARTA_RACCYCD_PATHWAY$/ do
+  @rec.geneset_name.should == "BIOCARTA_RACCYCD_PATHWAY"
+end
+Then /^I should be able to fetch the source$/ do
+  @rec.source.should == "http://www.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_RACCYCD_PATHWAY.html"
+end
+Then /^ES is (\d+)\.(\d+)$/ do |arg1, arg2|
+  @rec.es.should == (arg1+'.'+arg2).to_f
+end
+Then /^NES is (\d+)\.(\d+)$/ do |arg1, arg2|
+  @rec.nes.should == (arg1+'.'+arg2).to_f
+end
+Then /^p\-value is (\d+)\.(\d+)$/ do |arg1, arg2|
+  @rec.nominal_p_value.should == (arg1+'.'+arg2).to_f
+end
+Then /^FDR is (\d+)$/ do |arg1|
+  @rec.fdr.should == (arg1).to_f
+end
+Then /^q\-value is (\d+)\.(\d+)$/ do |arg1, arg2|
+  @rec.fdr_q_value.should == (arg1+'.'+arg2).to_f
+end
+Then /^global p\-value is (\d+)\.(\d+)$/ do |arg1, arg2|
+  @rec.global_p_value.should == (arg1+'.'+arg2).to_f
+end
+Then /^Median FDR is (\d+)$/ do |arg1|
+  @rec.median_fdr.should == (arg1).to_f
+end
+# --- multi line parsing
+Given /^I have a Broad GSEA results file with multiple lines$/ do
+  @gsea_results = BioRdf::Parsers::BroadGSEA::ParseResultFile.new("./test/data/parsers/gsea/Run1_C2.SUMMARY.RESULTS.REPORT.0.txt")
+end
+Then /^I should be able to return all records with an FDR of less than (\d+)\.(\d+)$/ do |arg1, arg2|
+  recs = @gsea_results.find_all { | rec | rec.fdr_q_value < 0.85 }
+  recs.size.should == 70
+end

data/features/support/env.rb ADDED

@@ -0,0 +1,13 @@
+require 'bundler'
+begin
+  Bundler.setup(:default, :development)
+rescue Bundler::BundlerError => e
+  $stderr.puts e.message
+  $stderr.puts "Run `bundle install` to install missing gems"
+  exit e.status_code
+end
+$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../../lib')
+require 'bio-rdf'
+require 'rspec/expectations'

data/lib/bio-rdf.rb CHANGED

@@ -8,5 +8,6 @@
 #
 # In this file only require other files. Avoid other source code.
-require 'bio-rdf/rdf.rb'
+require 'bio-rdf/rdf.rb'
+require 'bio-rdf/parsers/gsea/broadgsea'

data/lib/bio-rdf/parsers/gsea/broadgsea.rb ADDED

@@ -0,0 +1,161 @@
+module BioRdf
+  module Parsers
+    module BroadGSEA
+      # Parses a 3 line CLS record (see features for an example)
+      class ParseClsRecord
+        attr_reader :classnames, :classes
+        def initialize buf
+          lines = buf.split("\n")
+          raise "CLS record should be 3 lines" if lines.size != 3
+          classline = lines[1]
+          raise "Second line should start with #" if classline[0] != "#"
+          @classnames = classline.split(/\s+/)[1..2]
+          @classes = lines[2].split(/\s+/)
+        end
+      end
+      # Parses a single line result lazily (see features for an example)
+      #
+      # GS SIZE SOURCE ES NES NOM-p-val FDR-q-val FWER-p-val Tag% Gene% Signal FDR_(median) glob.p.val
+      class ParseResultRecord
+        def initialize string
+          @fields = string.strip.split(/\t/)
+        end
+        def to_list
+          @fields
+        end
+        def geneset_name
+          @fields[0]
+        end
+        def source
+          @fields[2]
+        end
+        # ES: Enrichment score for the gene set; that is, the degree to which
+        # this gene set is overrepresented at the top or bottom of the ranked
+        # list of genes in the expression dataset.
+        def es
+          @es ||= @fields[3].to_f
+        end
+        # NES: Normalized enrichment score; that is, the enrichment score for
+        # the gene set after it has been normalized across analyzed gene sets.
+        def nes
+          @nes ||= @fields[4].to_f
+        end
+        # NOM p-value: Nominal p value; that is, the statistical significance
+        # of the enrichment score. The nominal p value is not adjusted for gene
+        # set size or multiple hypothesis testing; therefore, it is of limited
+        # use in comparing gene sets.
+        def nominal_p_value
+          @nominal_p_value ||= @fields[5].to_f
+        end
+        # FDR q-value: False discovery rate; that is, the estimated probability
+        # that the normalized enrichment score (NES) represents a false
+        # positive finding. For example, an FDR of 25% indicates that the
+        # result is likely to be valid 3 out of 4 times.
+        def fdr_q_value
+          @fdr_q_value ||= @fields[6].to_f
+        end
+        alias :fdr :fdr_q_value
+        # FWER p-value: Familywise-error rate; that is, a more conservatively
+        # estimated probability that the normalized enrichment score represents
+        # a false positive finding. Because the goal of GSEA is to generate
+        # hypotheses, the GSEA team recommends focusing on the FDR statistic.
+        def fwer_p_value
+          @fwer_p_value ||= @fields[7].to_f
+        end
+        def signal
+          @signal ||= @fields[10].to_f
+        end
+        def median_fdr
+          @median_fdr ||= @fields[11].to_f
+        end
+        def global_p_value
+          @global_p_value ||= @fields[12].to_f
+        end
+      end
+      class ParseResultFile
+        include Enumerable
+        def initialize filename
+          @list = []
+          f = File.open(filename)
+          f.gets # skip header
+          f.each_line do | line |
+            @list << ParseResultRecord.new(line)
+          end
+        end
+        def each
+          @list.each do | rec |
+            yield rec
+          end
+        end
+      end
+      module Parser
+        def Parser::handle_options
+          options = OpenStruct.new()
+          opts = OptionParser.new() do |o|
+            o.banner = "Usage: #{File.basename($0)} gsea [options] dir"
+            o.on_tail("-h", "--help", "Show help and examples") {
+              print(o)
+              exit()
+            }
+            o.on("-e filter","--exec filter",String, "Execute filter") do |s|
+              options.exec = s
+            end
+            o.on("--tabulate","Output tab delimited table") do
+              options.output = :tabulate
+            end
+          end
+          opts.parse!(ARGV)
+          dir = ARGV[0]
+          if dir and File.directory?(dir)
+            do_parse(dir, options.exec, options.output)
+          else
+            raise "you should supply a GSEA directory!"
+          end
+        end
+        require 'bio-logger'
+        include Bio::Log
+        def Parser::do_parse input, filter, output
+          log = LoggerPlus.new 'gsea'
+          log.level = INFO
+          log.outputters = Outputter.stderr
+          log.warn("Fetching "+input)
+          print "Marker\tGenotype\tGS\tSIZE\tSOURCE\tES\tNES\tNOM p-val\tFDR q-val\tFWER p-val\tTag \%\tGene \%\tSignal\tFDR (median)\tglob.p.val\n"
+          Dir.foreach(input) do |entry| # two step search, because of many dirs
+            next if entry == '.' or entry == '..'
+            log.info("Parsing directory "+entry)
+            resultfilenames = File.join(input,entry,"*SUMMARY.RESULTS.REPORT.[01].txt")
+            clsfilename = File.join(input,entry,"cls")
+            # log.info(resultfilenames)
+            Dir.glob(resultfilenames) do |fn|
+              genotype = "A"
+              genotype = "B" if fn =~ /1.txt/
+              marker = "unknown"
+              # fetch marker name
+              if File.exist?(clsfilename)
+                cls = BioRdf::Parsers::BroadGSEA::ParseClsRecord.new(File.read(clsfilename))
+                marker = cls.classnames[0]
+              end
+              gsea_results = BioRdf::Parsers::BroadGSEA::ParseResultFile.new(fn)
+              recs = gsea_results.find_all { | rec | rec.fdr_q_value <= 0.25 }
+              recs.each do | rec |
+                print "#{marker}\t#{genotype}\t"+rec.to_list.join("\t"),"\n"
+              end
+            end
+          end
+        end
+      end
+    end
+  end
+end