RubyGems - protk - Versions diffs - 1.4.1 → 1.4.2 - Mend

protk 1.4.1 → 1.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

checksums.yaml +4 -4
data/README.md +32 -15
data/bin/mzid_to_pepxml.rb +75 -0
data/bin/mzid_to_protxml.rb +77 -0
data/bin/protxml_to_gff.rb +1 -1
data/bin/sixframe.rb +24 -5
data/bin/spectrast_create.rb +125 -0
data/bin/spectrast_filter.rb +108 -0
data/lib/protk/command_runner.rb +1 -1
data/lib/protk/data/template_pep.xml +34 -0
data/lib/protk/data/template_prot.xml +39 -0
data/lib/protk/mzidentml_doc.rb +140 -0
data/lib/protk/mzml_parser.rb +9 -0
data/lib/protk/peptide.rb +39 -5
data/lib/protk/pepxml_writer.rb +24 -0
data/lib/protk/physical_constants.rb +1 -0
data/lib/protk/protein.rb +64 -1
data/lib/protk/protein_group.rb +70 -0
data/lib/protk/protxml_writer.rb +27 -0
data/lib/protk/psm.rb +222 -0
data/lib/protk/search_tool.rb +1 -6
data/lib/protk/sniffer.rb +35 -0
data/lib/protk/spectrum_query.rb +132 -0
metadata +20 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 7329f51a45b5449ec979e76aca5727c6714a5bc8
-  data.tar.gz: e96f553b27c61c7ba1935d379e01086e9cb00725
+  metadata.gz: 7377c1480498f852b7e747d13e9a7d985523fcef
+  data.tar.gz: 2cb2c652e53ec636fb521cb35a687324ee810af8
 SHA512:
-  metadata.gz: fb933aa9ce0cc6fabb19b0a731bb8d74f23456937ec4d08973477f8f956eb733fcb44b26b66fc61242f5ea6c617d0c3a178b83f16dc568ad5f330db9dcd27c1d
-  data.tar.gz: 5b2b370cea53d3a3ec9eee9d5916df8f910ef12181c0bce9b660c1d82d042e00a5b66ce7fcba6a864cd1a6266aa2fbffec998ae26700f4ddd7903ab141ba3241
+  metadata.gz: c4e72457cc9ada490ea6210c9d13e6e5d240c0d19399c5b015feb30cebb881b51bae62cb8d4aa831d8aee397a747d5c16b1433f0ea08656474a56270f709b3d7
+  data.tar.gz: e8893c4fda75666fdf4ed3cb6d6dc0bb34ceb3041c2b19883e82891aa5245a4a7aa99cc4fe677f2fb6f28c2b12d5bbb89a213a59e3f1f0e4d8dc927a9f6ff510

data/README.md CHANGED Viewed

@@ -22,7 +22,10 @@ Protk is a ruby gem and requires ruby 2.0 or higher with support for libxml2. To
     gem install protk
 ```
+## Ruby Compatibility
+In general Protk requires ruby with a version >=2.0.
+Do not use ruby 2.1.5 as this has a bug that causes a deadlock related to open4 and child processes writing to stderr.
 ## Usage
@@ -60,32 +63,28 @@ By default protk will install tools and databases into `.protk` in your home dir
 ```
-## Sequence databases
-Protk also includes a script called manage_db.rb to install specific sequence databases for use by the search engines if desired. Databases installed via manage_db.rb can be invoked using a shorthand name rather than a full path to a fasta file, and Protk also provides some automation for database upgrades. Protk comes with several predefined database configurations. For example, to install a database consisting of human entries from Swissprot plus known contaminants use the following commands;
+## Galaxy Integration
-```sh
-manage_db.rb add --predefined crap
-manage_db.rb add --predefined sphuman
-manage_db.rb update crap
-manage_db.rb update sphuman
-```
+Many protk tools have equivalent galaxy wrappers available on the [galaxy toolshed](http://toolshed.g2.bx.psu.edu/) with source code and development occuring in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository on github.  In order for these tools to work you will also need to make sure that protk, as well as the necessary third party dependencies are available to galaxy during tool execution.
-You should now be able to run database searches, specifying this database by using the -d sphuman flag.  Every month or so swissprot will release a new database version. You can keep your database up to date using the manage_db.rb update command. This will update the database only if any of its source files (or ftp release notes) have changed. The manage_db.rb tool also allows completely custom databases to be configured. Setup requires adding quite a few command-line options but once setup, databases can easily be updated without further config. The example below shows the commandline arguments required to manually configure the sphuman database.
+There are two ways to do this
-```sh
-manage_db.rb add --ftp-source 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt' --include-filters '/OS=Homo\ssapiens/' --id-regex 'sp\|.*\|(.*?)\s' --add-decoys --make-blast-index --archive-old sphuman
-```
+**Using Docker:**
-## Galaxy Integration
+By far the easiest way to do this is to set up your Galaxy instance to run tools in Docker containers.  All the tools in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository are designed to work with [this](https://github.com/iracooke/protk-dockerfile) docker image, and will download and use the image automatically on apprioriately configured Galaxy instances.
+**Manual Install**
-Many protk tools have equivalent galaxy wrappers available on the [galaxy toolshed](http://toolshed.g2.bx.psu.edu/) .  In order for these tools to work you will also need to make sure that protk, as well as the necessary third party dependencies are available to galaxy during tool execution. If you install protk using the default system ruby (without rvm) this will probably just work, however you will lose the ability to run specific versions of tools against specific versions of protk.  The recommended method of installing protk for use with galaxy is as follows;
+If your galaxy instance is unable to use Docker for some reason you will need to install `protk` and its dependencies manually.
+One way to install protk would be to just do `gem install protk` using the default system ruby (without rvm). This will probably just work, however you will lose the ability to run specific versions of tools against specific versions of protk.  The recommended method of installing protk for use with galaxy is as follows;
 1. Ensure you have a working install of galaxy.
 	[Full instructions](https://wiki.galaxyproject.org/Admin/GetGalaxy) are available on the official Galaxy project wiki page.  We assume you have galaxy installed in a directory called galaxy-dist.
-2. Install rvm if you haven't allready.  See [here](https://rvm.io/) for more information.
+2. Install rvm if you haven't already.  See [here](https://rvm.io/) for more information.
 	```bash
 		curl -sSL https://get.rvm.io | bash -s stable
@@ -148,4 +147,22 @@ Many protk tools have equivalent galaxy wrappers available on the [galaxy toolsh
 		ln -s 1.5 default
 	```
+## Sequence databases
+All `protk` tools are designed to work with sequence databases provided as simple fasta formatted flat files. For most use cases it is simplest to just manage these manually.
+Protk includes a script called `manage_db.rb` to install certain sequence databases in a central repository. Databases installed via `manage_db.rb` can be invoked using a shorthand name rather than a full path to a fasta file. Protk comes with several predefined database configurations. For example, to install a database consisting of human entries from Swissprot plus known contaminants use the following commands;
+```sh
+manage_db.rb add --predefined crap
+manage_db.rb add --predefined sphuman
+manage_db.rb update crap
+manage_db.rb update sphuman
+```
+You should now be able to run database searches, specifying this database by using the -d sphuman flag.  Every month or so swissprot will release a new database version. You can keep your database up to date using the manage_db.rb update command. This will update the database only if any of its source files (or ftp release notes) have changed. The manage_db.rb tool also allows completely custom databases to be configured. Setup requires adding quite a few command-line options but once setup, databases can easily be updated without further config. The example below shows the commandline arguments required to manually configure the sphuman database.
+```sh
+manage_db.rb add --ftp-source 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt' --include-filters '/OS=Homo\ssapiens/' --id-regex 'sp\|.*\|(.*?)\s' --add-decoys --make-blast-index --archive-old sphuman
+```

data/bin/mzid_to_pepxml.rb ADDED Viewed

@@ -0,0 +1,75 @@
+#!/usr/bin/env ruby
+#
+# This file is part of protk
+# Created by Ira Cooke 8/5/2015
+#
+# Convert mzid to pepXML
+#
+#
+require 'libxml'
+require 'protk/constants'
+require 'protk/command_runner'
+require 'protk/mzidentml_doc'
+require 'protk/spectrum_query'
+require 'protk/pepxml_writer'
+require 'protk/tool'
+include LibXML
+XML.indent_tree_output=true
+# Setup specific command-line options for this tool. Other options are inherited from Tool
+#
+tool=Tool.new([:explicit_output,:debug])
+# tool.add_value_option(:minprob,0.05,['--minprob mp',"Minimum probability for psm to be included in the output"])
+tool.option_parser.banner = "Convert an mzIdentML file to pep.xml\n\nUsage: mzid_to_pepxml.rb [options] file1.mzid"
+exit unless tool.check_options(true)
+$protk = Constants.instance
+log_level = tool.debug ? "info" : "warn"
+$protk.info_level= log_level
+input_file=ARGV[0]
+if tool.explicit_output
+	output_file_name=tool.explicit_output
+else
+	output_file_name=Tool.default_output_path(input_file,".pep.xml","","")
+end
+pep_xml_writer = PepXMLWriter.new
+mzid_doc = MzIdentMLDoc.new(input_file)
+spectrum_queries = mzid_doc.spectrum_queries
+n_queries = spectrum_queries.length
+$protk.log "Converting #{n_queries} spectrum queries", :info
+$protk.log "Output will be written to #{output_file_name}", :info
+i=0
+n_written=0
+progress_increment=1
+spectrum_queries.each do |query_node|
+	if i % progress_increment ==0
+		$stdout.write "Scanned #{i} and read #{n_written} of #{n_queries}\r"
+	end
+	# require 'byebug';byebug
+	query = SpectrumQuery.from_mzid(query_node)
+	pep_xml_writer.append_spectrum_query(query.as_pepxml)
+	n_written+=1
+	i+=1
+end
+$protk.log "Writing #{n_written} spectrum queries to #{output_file_name}", :info
+pep_xml_writer.save(output_file_name)

data/bin/mzid_to_protxml.rb ADDED Viewed

@@ -0,0 +1,77 @@
+#!/usr/bin/env ruby
+#
+# This file is part of protk
+# Created by Ira Cooke 7/5/2015
+#
+# Convert mzid to protXML
+#
+#
+require 'libxml'
+require 'protk/constants'
+require 'protk/command_runner'
+require 'protk/mzidentml_doc'
+require 'protk/protein_group'
+require 'protk/tool'
+include LibXML
+XML.indent_tree_output=true
+# Setup specific command-line options for this tool. Other options are inherited from ProphetTool
+#
+tool=Tool.new([:explicit_output,:debug])
+tool.add_value_option(:minprob,0.05,['--minprob mp',"Minimum probability for protein to be included in the output"])
+tool.option_parser.banner = "Convert an mzIdentML file to protXML.\n\nUsage: mzid_to_protxml.rb [options] file1.mzid"
+exit unless tool.check_options(true)
+$protk = Constants.instance
+log_level = tool.debug ? "info" : "warn"
+$protk.info_level= log_level
+input_file=ARGV[0]
+if tool.explicit_output
+	output_file_name=tool.explicit_output
+else
+	output_file_name=Tool.default_output_path(input_file,".protXML","","")
+end
+prot_xml_writer = ProtXMLWriter.new
+mzid_doc = MzIdentMLDoc.new(input_file)
+protein_groups = mzid_doc.protein_groups
+n_prots = protein_groups.length
+$protk.log "Converting #{n_prots} protein_groups", :info
+$protk.log "Output will be written to #{output_file_name}", :info
+i=0
+n_written=0
+progress_increment=1
+protein_groups.each do |group_node|
+	if i % progress_increment ==0
+		$stdout.write "Scanned #{i} and read #{n_written} of #{n_prots}\r"
+	end
+	# require 'byebug';byebug
+	group_prob = MzIdentMLDoc.get_cvParam(group_node,"MS:1002470").attributes['value'].to_f*0.01
+	if group_prob > tool.minprob.to_f
+		group = ProteinGroup.from_mzid(group_node)
+		prot_xml_writer.append_protein_group(group.as_protxml)
+		n_written+=1
+	end
+	i+=1
+end
+$protk.log "Writing #{n_written} proteins to #{output_file_name}", :info
+prot_xml_writer.save(output_file_name)

data/bin/protxml_to_gff.rb CHANGED Viewed

@@ -155,7 +155,7 @@ proteins.each do |protein|
 			peptides = tool.stack_charge_states ? protein.peptides : protein.representative_peptides
 			peptides.each do |peptide|
-				if peptide.nsp_adjusted_probability >= tool.peptide_probability_threshold
+				if peptide.probability >= tool.peptide_probability_threshold
 					peptide_entries = peptide.to_gff3_records(protein_entry.aaseq,gff_parent_entry,gff_cds_entries)
 					peptide_entries.each do |peptide_entry|
 						output_fh.write peptide_entry.to_s

data/bin/sixframe.rb CHANGED Viewed

@@ -25,7 +25,7 @@ end
 tool=Tool.new([:explicit_output])
 tool.option_parser.banner = "Create a sixframe translation of a genome.\n\nUsage: sixframe.rb [options] genome.fasta"
+tool.add_boolean_option(:peptideshaker,false,['--peptideshaker', 'Format fasta output for peptideshaker compatibility'])
 tool.add_boolean_option(:print_coords,false,['--coords', 'Write genomic coordinates in the fasta header'])
 tool.add_boolean_option(:keep_header,true,['--strip-header', 'Dont write sequence definition'])
 tool.add_value_option(:min_len,20,['--min-len l','Minimum ORF length to keep'])
@@ -43,8 +43,22 @@ if tool.write_gff
   output_fh.write "##gff-version 3\n"
 end
+accession_prefix=tool.peptideshaker ? "generic" : "lcl"
+coords_separator=tool.peptideshaker ? "|" : " "
 file = Bio::FastaFormat.open(input_file)
+def passes_qc(orf,tool)
+  long_enough = orf.length > tool.min_len.to_i
+  composition_ok=true
+  if tool.peptideshaker && (orf=~/X/)
+    composition_ok=false
+  end
+  (long_enough && composition_ok)
+end
 file.each do |entry|
   length = entry.naseq.length
@@ -58,7 +72,7 @@ file.each do |entry|
     oi=0
     orfs.each do |orf|
       oi+=1
-      if ( orf.length > tool.min_len.to_i )
+      if ( passes_qc(orf,tool) )
         position_start = position
         position_end = position_start + orf.length*3 -1
@@ -71,15 +85,20 @@ file.each do |entry|
         end
         # Create accession compliant with NCBI naming standard
+        #
         # See http://www.ncbi.nlm.nih.gov/books/NBK7183/?rendertype=table&id=ch_demo.T5
+        #
+        # Or with PeptideShaker standard
+        #
+        #
         ncbi_scaffold_id = entry.entry_id.gsub('|','_').gsub(' ','_')
-        ncbi_accession = "lcl|#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
+        ncbi_accession = "#{accession_prefix}|#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
         gff_id = "#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
         defline=">#{ncbi_accession}"
         if tool.print_coords
-          defline << " #{position_start}|#{position_end}"
+          defline << "#{coords_separator}#{position_start}|#{position_end}"
         end
         if tool.keep_header
@@ -88,7 +107,7 @@ file.each do |entry|
         if tool.write_gff
           strand = frame>3 ? "-" : "+"
-          # score = self.nsp_adjusted_probability.nil? ? "." : self.nsp_adjusted_probability.to_s
+          # score = self.probability.nil? ? "." : self.probability.to_s
           # gff_string = "#{parent_record.seqid}\tMSMS\tpolypeptide\t#{start_i}\t#{end_i}\t#{score}\t#{parent_record.strand}\t0\tID=#{this_id};Parent=#{cds_id}"
           output_fh.write("#{ncbi_scaffold_id}\tsixframe\tCDS\t#{position_start}\t#{position_end}\t.\t#{strand}\t0\tID=#{gff_id}\n")
         else

data/bin/spectrast_create.rb ADDED Viewed

@@ -0,0 +1,125 @@
+#!/usr/bin/env ruby
+#
+# This file is part of protk
+# Created by Ira Cooke 30/4/2015
+#
+# A wrapper for the SpectraST create command
+#
+#
+require 'protk/constants'
+require 'protk/command_runner'
+require 'protk/tool'
+require 'protk/galaxy_util'
+require 'protk/pepxml'
+require 'protk/sniffer'
+require 'protk/mzml_parser'
+for_galaxy = GalaxyUtil.for_galaxy?
+genv=Constants.instance
+# Setup specific command-line options for this tool. Other options are inherited from ProphetTool
+#
+spectrast_tool=Tool.new([:explicit_output])
+spectrast_tool.option_parser.banner = "Create a spectral library from pep.xml input files.\n\nUsage: spectrast_create.rb [options] file1.pep.xml file1.pep.xml ..."
+spectrast_tool.add_value_option(:spectrum_files,"",['--spectrum-files sf','Paths to raw spectrum files. These should be provided in a comma separated list'])
+spectrast_tool.add_boolean_option(:binary_output,false,['-B','--binary-output','Produce spectral libraries in binary format rather than ASCII'])
+spectrast_tool.add_value_option(:filter_predicate,nil,['--predicate pred','Keep only spectra satifying predicate pred. Should be a C-style predicate'])
+spectrast_tool.add_value_option(:probability_threshold,0.99,['--p-thresh val', 'Probability threshold below which spectra are discarded'])
+spectrast_tool.add_value_option(:instrument_acquisition,"CID",['--instrument-acquisition setting',
+								'Set the instrument and acquisition settings of the spectra (in case not specified in data files).
+	                             Examples: CID, ETD, CID-QTOF, HCD. The latter two are treated as high-mass accuracy spectra.'])
+exit unless spectrast_tool.check_options(true)
+spectrast_bin = %x[which spectrast].chomp
+# Options: GENERAL OPTIONS
+#          -cF<file>    Read create options from file <file>.
+#                            If <file> is not given, "spectrast_create.params" is assumed.
+#                            NOTE: All options set in the file will be overridden by command-line options, if specified.
+#          -cm<remark>  Remark. Add a Remark=<remark> comment to all library entries created.
+#          -cM<format>  Write all library spectra as MRM transition tables. Leave <format> blank for default. (Turn off with -cM!)
+#          -cT<file>    Use probability table in <file>. Only those peptide ions included in the table will be imported.
+#                            A probability table is a text file with one peptide ion in the format AC[160]DEFGHIK/2 per line.
+#                            If a probability is supplied following the peptide ion separated by a tab, it will be used to replace the original probability of that library entry.
+#          -cO<file>    Use protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.
+#                            A protein list is a text file with one protein identifier per line.
+#                            If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported.
+#          PEPXML IMPORT OPTIONS (Applicable with .pepXML files)
+#          -cP<prob>    Include all spectra identified with probability no less than <prob> in the library.
+#          -cq<fdr>     (Only PepXML import) Only include spectra with global FDR no greater than <fdr> in the library.
+#          -cn<name>    Specify a dataset identifier for the file to be imported.
+#          -co          Add the originating mzXML file name to the dataset identifier. Good for keeping track of in which
+#                            MS run the peptide is observed. (Turn off with -co!)
+#          -cg          Set all asparagines (N) in the motif NX(S/T) as deamidated (N[115]). Use for glycocaptured peptides. (Turn off with -cg!).
+#          -cI          Set the instrument and acquisition settings of the spectra (in case not specified in data files).
+#                            Examples: -cICID, -cIETD, -cICID-QTOF, -cIHCD. The latter two are treated as high-mass accuracy spectra.
+#
+# -cf<pred>    Filter library. Keep only those entries satisfying the predicate <pred>.
+#                            <pred> should be a C-style predicate in quotes.
+input_stagers=[]
+inputs=ARGV.collect { |file_name| file_name.chomp}
+if for_galaxy
+  input_stagers = inputs.collect {|ip| GalaxyUtil.stage_pepxml(ip) }
+  inputs=input_stagers.collect { |sg| sg.staged_path }
+end
+spectrum_file_paths=spectrast_tool.spectrum_files.split(",").collect { |mod| mod.lstrip.rstrip }.reject {|e| e.empty? }
+spectrum_file_paths.each do |rf|
+	throw "Provided spectrum file #{rf} does not exist" unless File.exists? rf
+	format = Sniffer.sniff_format(rf)
+	throw "Unrecognised format #{format} detected for spectrum file #{rf}" unless ["mzML","mgf"].include? format
+	# basename_no_ext = File.basename(rf,File.extname(rf))
+	runid_name = MzMLParser.new(rf).next_runid()
+	expected_name = "#{runid_name}.#{format}"
+	if for_galaxy || !File.exists?(expected_name)
+		raw_input_stager = GalaxyStager.new(rf,{:extension=>".#{format}",:name=>runid_name})
+		puts raw_input_stager.staged_path
+	end
+end
+cmd="#{spectrast_bin} "
+unless spectrast_tool.binary_output
+	cmd << " -c_BIN!"
+end
+if spectrast_tool.filter_predicate
+	cmd << "  -cf'#{spectrast_tool.filter_predicate}'"
+end
+cmd << " -cI#{spectrast_tool.instrument_acquisition}"
+if spectrast_tool.explicit_output==nil
+    output_file_name=Tool.default_output_path(inputs,"","","")
+else
+    output_file_name=spectrast_tool.explicit_output
+end
+cmd << " -cN#{output_file_name}"
+cmd << " -cP#{spectrast_tool.probability_threshold}"
+inputs.each { |ip| cmd << " #{ip}" }
+# code = spectrast_tool.run(cmd,genv)
+# throw "Command failed with exit code #{code}" unless code==0
+%x[#{cmd}]