RubyGems - protk - Versions diffs - 1.4.2 → 1.4.3 - Mend

protk 1.4.2 → 1.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 7377c1480498f852b7e747d13e9a7d985523fcef
-  data.tar.gz: 2cb2c652e53ec636fb521cb35a687324ee810af8
+  metadata.gz: 31df514a2203236ea9ac25f8d5cc9c282378e04d
+  data.tar.gz: f1bb438ef01003afc166eb5b0342dbd8f6ecd09b
 SHA512:
-  metadata.gz: c4e72457cc9ada490ea6210c9d13e6e5d240c0d19399c5b015feb30cebb881b51bae62cb8d4aa831d8aee397a747d5c16b1433f0ea08656474a56270f709b3d7
-  data.tar.gz: e8893c4fda75666fdf4ed3cb6d6dc0bb34ceb3041c2b19883e82891aa5245a4a7aa99cc4fe677f2fb6f28c2b12d5bbb89a213a59e3f1f0e4d8dc927a9f6ff510
+  metadata.gz: 7f7f0fe81411f17b89037162ad7bf5374be69309888bda2f84b058d69773dd1790469d40dd5797e0306ce49ecc95426cfb80b65bfcb95ac0a052be90db40ea42
+  data.tar.gz: f2db6018ac90079e925f7c5c071be3fbedd77bc3787cfc58a91aa38048f43f60426020a3bff3a5a58c4bea6d71aaad7c0ab10437e050eaf1998792ca6ab2e1dd

data/README.md CHANGED

@@ -3,8 +3,6 @@
 # protk ( Proteomics toolkit )
-***
 ## What is it?
 Protk is a suite of tools for proteomics. It aims to present a simple and consistent command-line interface across otherwise disparate third party tools.  The following analysis tasks are currently supported;
@@ -68,15 +66,27 @@ By default protk will install tools and databases into `.protk` in your home dir
 Many protk tools have equivalent galaxy wrappers available on the [galaxy toolshed](http://toolshed.g2.bx.psu.edu/) with source code and development occuring in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository on github.  In order for these tools to work you will also need to make sure that protk, as well as the necessary third party dependencies are available to galaxy during tool execution.
-There are two ways to do this
+There are three ways to do this
 **Using Docker:**
 By far the easiest way to do this is to set up your Galaxy instance to run tools in Docker containers.  All the tools in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository are designed to work with [this](https://github.com/iracooke/protk-dockerfile) docker image, and will download and use the image automatically on apprioriately configured Galaxy instances.
+**Using the Galaxy Tool Shed (Experimental)**
+An installation recipe of `protk` is available from the [Galaxy Tool Shed](https://testtoolshed.g2.bx.psu.edu/view/iuc/package_protk_1_4_2/). If you want to depend on protk for your own Galaxy wrapper create a `tool_dependencies.xml` file with the following content.
+```xml
+<tool_dependency>
+    <package name="protk" version="1.4.2">
+        <repository name="package_protk_1_4_2" owner="iuc"/>
+    </package>
+</tool_dependency>
+```
 **Manual Install**
-If your galaxy instance is unable to use Docker for some reason you will need to install `protk` and its dependencies manually.
+If your galaxy instance is unable to use Docker or the Tool Shed for some reason you will need to install `protk` and its dependencies manually.
 One way to install protk would be to just do `gem install protk` using the default system ruby (without rvm). This will probably just work, however you will lose the ability to run specific versions of tools against specific versions of protk.  The recommended method of installing protk for use with galaxy is as follows;
@@ -98,13 +108,13 @@ One way to install protk would be to just do `gem install protk` using the defau
 4.  Install protk in an isolated gemset using rvm.
-	This sets up an isolated environment where only a specific version of protk is available.  We name the environment according to the protk intermediate version numer (1.4 in this example). Minor bugfixes will be released as 1.4.x and can be installed without updating the toolshed wrappers
+	This sets up an isolated environment where only a specific version of protk is available.  We name the environment according to the protk version number (1.4.2 in this example).
 	```bash
 		rvm 2.1
-		rvm gemset create protk1.4
-		rvm use 2.1@protk1.4
-		gem install protk -v '~>1.4'
+		rvm gemset create protk1.4.2
+		rvm use 2.1@protk1.4.2
+		gem install protk -v '~>1.4.2'
 	```
 5. Configure Galaxy's tool dependency directory.
@@ -124,11 +134,11 @@ One way to install protk would be to just do `gem install protk` using the defau
 		cd <tool_dependency_dir>
 		mkdir protk
 		cd protk
-		mkdir 1.4
-		ln -s 1.4 default
-		rvm use 2.1@protk1.4
-		rvmenv=`rvm env --path 2.1@protk1.4`
-		echo ". $rvmenv" > 1.4/env.sh
+		mkdir 1.4.2
+		ln -s 1.4.2 default
+		rvm use 2.1@protk1.4.2
+		rvmenv=`rvm env --path 2.1@protk1.4.2`
+		echo ". $rvmenv" > 1.4.2/env.sh
 	```
 7. Keep things up to date
@@ -137,14 +147,14 @@ One way to install protk would be to just do `gem install protk` using the defau
 	```bash
 		rvm 2.1
-		rvm gemset create protk1.5
-		rvm use 2.1@protk1.5
-		gem install protk -v '~>1.5'
+		rvm gemset create protk1.5.0
+		rvm use 2.1@protk1.5.0
+		gem install protk -v '~>1.5.0'
 		cd <tool_dependency_dir>/protk/
-		mkdir 1.5
-		rvmenv=`rvm env --path 2.1@protk1.5`
-		echo ". $rvmenv" > 1.5/env.sh
-		ln -s 1.5 default
+		mkdir 1.5.0
+		rvmenv=`rvm env --path 2.1@protk1.5.0`
+		echo ". $rvmenv" > 1.5.0/env.sh
+		ln -s 1.5.0 default
 	```
 ## Sequence databases
@@ -166,3 +176,4 @@ You should now be able to run database searches, specifying this database by usi
 manage_db.rb add --ftp-source 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt' --include-filters '/OS=Homo\ssapiens/' --id-regex 'sp\|.*\|(.*?)\s' --add-decoys --make-blast-index --archive-old sphuman
 ```

data/bin/filter_fasta.rb ADDED

@@ -0,0 +1,74 @@
+#!/usr/bin/env ruby
+#
+# This file is part of protk
+# Created by Ira Cooke 22/5/2015
+#
+# Filters a fasta file so only entries matching a condition are emitted
+#
+require 'protk/constants'
+require 'protk/command_runner'
+require 'protk/tool'
+require 'set'
+require 'bio'
+tool=Tool.new([:explicit_output])
+tool.option_parser.banner = "Filter entries in a fasta file.\n\nUsage: filter_fasta.rb [options] file.fasta file2.fasta"
+tool.add_value_option(:definition_filter,nil,['--definition filter','Keep entries matching definition'])
+tool.add_boolean_option(:invert,false,['--invert',"Invert Filter"])
+tool.add_value_option(:id_filter,nil,['-I filename','--id-filter filename',"Keep entries with given identifiers"])
+exit unless tool.check_options(true)
+input_file=ARGV[0]
+output_fh = tool.explicit_output!=nil ? File.new("#{tool.explicit_output}",'w') : $stdout
+$filter_ids = Set.new()
+if tool.id_filter && (File.exists?(tool.id_filter) || tool.id_filter=="-")
+	if tool.id_filter=="-"
+		$filter_ids = $stdin.read.split("\n").collect { |e| e.chomp }
+	else
+		$filter_ids = File.readlines(tool.id_filter).collect { |e| e.chomp }
+	end
+	$filter_ids = Set.new($filter_ids) # Much faster set include than array include
+end
+def passes_filters(entry,tool)
+	if tool.definition_filter
+		if entry.definition =~ /#{tool.definition_filter}/
+			return true
+		else
+			return false
+		end
+	end
+	if $filter_ids.length > 0
+		# require 'byebug';byebug
+		if $filter_ids.include? entry.entry_id
+			return true
+		end
+		return false
+	end
+	# Always true if there are no filters defined
+	return true
+end
+ARGV.each do |fasta_file|
+	file = Bio::FastaFormat.open(fasta_file.chomp)
+	file.each do |entry|
+		pass = passes_filters(entry,tool)
+		pass = !pass if tool.invert
+		if pass
+			output_fh.write entry
+		end
+	end
+end

data/bin/filter_psms.rb ADDED

@@ -0,0 +1,109 @@
+#!/usr/bin/env ruby
+#
+# This file is part of protk
+# Created by Ira Cooke 24/6/2015
+#
+# Filters a pepxml file by removing or keeping only psms that match a filter
+#
+require 'protk/constants'
+require 'protk/command_runner'
+require 'protk/tool'
+require 'bio'
+require 'libxml'
+include LibXML
+tool=Tool.new([:explicit_output,:debug])
+tool.option_parser.banner = "Filter psms in a pepxml file.\n\nUsage: filter_psms.rb [options] expression file.pepxml"
+tool.add_value_option(:filter,"protein",['-A','--attribute name',"Match expression against a specific search_hit attribute"])
+tool.add_boolean_option(:check_alternative_proteins,false,['-C','--check-alternatives',"Also match expression against to alternative_proteins"])
+tool.add_boolean_option(:reject_mode,false,['-R','--reject',"Keep mismatches instead of matches"])
+exit unless tool.check_options(true,[:filter])
+if ARGV.length!=2
+  puts "Wrong number of arguments. You must supply a filter expression and a pepxml file"
+  exit(1)
+end
+expressions=ARGV[0].split(",").map(&:strip)
+input_file=ARGV[1]
+$protk = Constants.instance
+log_level = tool.debug ? "info" : "warn"
+$protk.info_level= log_level
+output_fh = tool.explicit_output!=nil ? File.new("#{tool.explicit_output}",'w') : $stdout
+throw "Input file #{input_file} does not exist" unless File.exist? "#{input_file}"
+XML::Error.set_handler(&XML::Error::QUIET_HANDLER)
+doc = XML::Document.file("#{input_file}")
+reader = XML::Reader.document(doc)
+# First print out the header (ie before spectrum_queries)
+File.foreach("#{input_file}") do |line|
+  if line =~ /\<spectrum_query/
+    break;
+  else
+    output_fh.write line
+  end
+end
+pepxml_ns_prefix="xmlns:"
+pepxml_ns="xmlns:http://regis-web.systemsbiology.net/pepXML"
+kept=0
+deleted=0
+scanned=0
+while reader.read
+  if reader.name == "spectrum_query"
+    sq_node = reader.expand
+    hits = sq_node.find("./#{pepxml_ns_prefix}search_result/#{pepxml_ns_prefix}search_hit[@hit_rank=\"1\"]",pepxml_ns)
+    throw "More than one first ranked search hit" if hits.length>1
+    throw "No search hit for spectrum_query" if hits.length==0
+    hit = hits[0]
+    has_match = expressions.collect { |expression|   (hit.attributes[tool.filter] =~ /#{expression}/) }.any?
+    if !has_match && tool.check_alternative_proteins
+      alts = hit.find("./#{pepxml_ns_prefix}alternative_protein",pepxml_ns)
+      # Check alternative proteins
+      alt_expr = alts.collect { |alt| expressions.collect { |expression| (alt.attributes[tool.filter] =~ /#{expression}/ )}}
+      has_match = alt_expr.flatten.any?
+    end
+    if (has_match && !tool.reject_mode) || (!has_match && tool.reject_mode)  #&& (hit.attributes['hit_rank']=="1")
+      kept+=1
+      # Remove any lower ranked hits
+      #
+      secondary_hits = sq_node.find("./#{pepxml_ns_prefix}search_result/#{pepxml_ns_prefix}search_hit[@hit_rank!=\"1\"]",pepxml_ns)
+      secondary_hits.each { |sh| sh.remove!  }
+      output_fh.write "#{sq_node}\n"
+    else
+      deleted+=1
+    end
+    scanned+=1
+    reader.next_sibling
+  end
+end
+output_fh.write "</msms_run_summary>\n</msms_pipeline_analysis>\n"
+$protk.log "Kept #{kept} and deleted #{deleted}" , :info

data/bin/msgfplus_search.rb CHANGED

@@ -41,17 +41,17 @@ search_tool.options.instrument=0
 # MS-GF+ doesnt support fragment tol so add this manually rather than via the SearchTool defaults
 search_tool.add_value_option(:precursor_tol,"20",['-p','--precursor-ion-tol tol', 'Precursor ion mass tolerance.'])
-search_tool.add_value_option(:precursor_tolu,"ppm",['--precursor-ion-tol-units tolu', 'Precursor ion mass tolerance units (ppm or Da). Default=ppm'])
+search_tool.add_value_option(:precursor_tolu,"ppm",['--precursor-ion-tol-units tolu', 'Precursor ion mass tolerance units (ppm or Da).'])
 search_tool.add_boolean_option(:pepxml,false,['--pepxml', 'Convert results to pepxml.'])
-search_tool.add_value_option(:isotope_error_range,"0,1",['--isotope-error-range range', 'Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation.(Default 0,1)'])
+search_tool.add_value_option(:isotope_error_range,"0,1",['--isotope-error-range range', 'Takes into account of the error introduced by chooosing a non-monoisotopic peak for fragmentation.'])
 search_tool.add_value_option(:fragment_method,0,['--fragment-method method', 'Fragment method 0: As written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor'])
 search_tool.add_boolean_option(:decoy_search,false,['--decoy-search', 'Build and search a decoy database on the fly. Input db should not contain decoys if this option is used'])
 search_tool.add_value_option(:protocol,0,['--protocol p', '0: NoProtocol (Default), 1: Phosphorylation, 2: iTRAQ, 3: iTRAQPhospho'])
-search_tool.add_value_option(:min_pep_length,6,['--min-pep-length p', 'Minimum peptide length to consider, Default: 6'])
-search_tool.add_value_option(:max_pep_length,40,['--max-pep-length p', 'Maximum peptide length to consider, Default: 40'])
-search_tool.add_value_option(:min_pep_charge,2,['--min-pep-charge c', 'Minimum precursor charge to consider if charges are not specified in the spectrum file, Default: 2'])
-search_tool.add_value_option(:max_pep_charge,3,['--max-pep-charge c', 'Maximum precursor charge to consider if charges are not specified in the spectrum file, Default: 3'])
+search_tool.add_value_option(:min_pep_length,6,['--min-pep-length p', 'Minimum peptide length to consider'])
+search_tool.add_value_option(:max_pep_length,40,['--max-pep-length p', 'Maximum peptide length to consider'])
+search_tool.add_value_option(:min_pep_charge,2,['--min-pep-charge c', 'Minimum precursor charge to consider if charges are not specified in the spectrum file'])
+search_tool.add_value_option(:max_pep_charge,3,['--max-pep-charge c', 'Maximum precursor charge to consider if charges are not specified in the spectrum file'])
 search_tool.add_value_option(:num_reported_matches,1,['--num-reported-matches n', 'Number of matches per spectrum to be reported, Default: 1'])
 search_tool.add_boolean_option(:add_features,false,['--add-features', 'output additional features'])
 search_tool.add_value_option(:java_mem,"3500M",['--java-mem mem','Java memory limit when running the search (Default 3.5Gb)'])
@@ -208,23 +208,32 @@ ARGV.each do |filename|
       cmd << " -mod #{mods_path}"
     end
     # As a final part of the command we convert to pepxml
     if search_tool.pepxml
       #if search_tool.explicit_output
       cmd << ";ruby -pi.bak -e \"gsub('post=\\\"?','post=\\\"X')\" #{mzid_output_path}"
       cmd << ";ruby -pi.bak -e \"gsub('pre=\\\"?','pre=\\\"X')\" #{mzid_output_path}"
-      cmd << ";idconvert #{mzid_output_path} --pepXML -o #{Pathname.new(mzid_output_path).dirname}"
+      cmd << ";ruby -pi.bak -e \"gsub('id=\\\"UnspecificCleavage\\\"','id=\\\"UnspecificCleavage\\\" name=\\\"unspecific cleavage\\\"')\" #{mzid_output_path}"
+      idconvert_relative_output_dir = (0...10).map { ('a'..'z').to_a[rand(26)] }.join
+#      require 'byebug';byebug
+      idconvert_output_dir = "#{Pathname.new(mzid_output_path).dirname}/#{idconvert_relative_output_dir}"
+      cmd << ";idconvert #{mzid_output_path} --pepXML -o #{idconvert_output_dir}"
-      pepxml_output_path = "#{mzid_output_path.chomp('.mzid')}.pepXML"
+      cmd << "; pep_xml_output_path=`ls #{idconvert_output_dir}/*.pepXML`; echo $pep_xml_output_path"
+      #"#{mzid_output_path.chomp('.mzid')}.pepXML"
       # Fix the msms_run_summary base_name attribute
       #
       if for_galaxy
-        cmd << ";ruby -pi.bak -e \"gsub(/ base_name=[^ ]+/,' base_name=\\\"#{original_input_file}\\\"')\" #{pepxml_output_path}"
+        cmd << ";ruby -pi.bak -e \"gsub(/ base_name=[^ ]+/,' base_name=\\\"#{original_input_file}\\\"')\" $pep_xml_output_path"
       end
       #Then copy the pepxml to the final output path
-      cmd << "; mv #{pepxml_output_path} #{output_path}"
+      cmd << "; mv ${pep_xml_output_path} '#{output_path}'"
     else
       cmd << "; mv #{mzid_output_path} #{output_path}"
     end

data/bin/mzid_to_protxml.rb CHANGED

@@ -29,7 +29,7 @@ tool.option_parser.banner = "Convert an mzIdentML file to protXML.\n\nUsage: mzi
 exit unless tool.check_options(true)
 $protk = Constants.instance
-log_level = tool.debug ? "info" : "warn"
+log_level = tool.debug ? "debug" : "info"
 $protk.info_level= log_level
 input_file=ARGV[0]
@@ -42,6 +42,7 @@ end
 prot_xml_writer = ProtXMLWriter.new
+$protk.log "Parsing MzIdentML input file" , :info
 mzid_doc = MzIdentMLDoc.new(input_file)
 protein_groups = mzid_doc.protein_groups
@@ -59,11 +60,13 @@ protein_groups.each do |group_node|
 		$stdout.write "Scanned #{i} and read #{n_written} of #{n_prots}\r"
 	end
-	# require 'byebug';byebug
-	group_prob = MzIdentMLDoc.get_cvParam(group_node,"MS:1002470").attributes['value'].to_f*0.01
-	if group_prob > tool.minprob.to_f
-		group = ProteinGroup.from_mzid(group_node)
+	group_prob = mzid_doc.get_cvParam(group_node,"MS:1002470").attributes['value'].to_f*0.01
+	if group_prob >= tool.minprob.to_f
+		$stdout.write "\n" if tool.debug
+		$protk.log "Writing group with probability #{group_prob}" , :info
+		group = ProteinGroup.from_mzid(group_node,mzid_doc,tool.minprob.to_f)
 		prot_xml_writer.append_protein_group(group.as_protxml)
 		n_written+=1
 	end

data/bin/peptide_prophet.rb CHANGED

@@ -44,6 +44,7 @@ prophet_tool.add_boolean_option(:force_fit,false,['--force-fit',"Force fitting o
 prophet_tool.add_boolean_option(:allow_alt_instruments,false,['--allow-alt-instruments',"Warning instead of exit with error if instrument types between runs is different"])
 prophet_tool.add_boolean_option(:one_ata_time,false,['-F', '--one-ata-time', 'Create a separate pproph output file for each analysis'])
 prophet_tool.add_value_option(:decoy_prefix,"decoy",['--decoy-prefix prefix', 'Prefix for decoy sequences'])
+prophet_tool.add_boolean_option(:use_non_parametric_model,false,['--use-non-parametric-model', 'Use Non-parametric model, can only be used with decoy option'])
 prophet_tool.add_boolean_option(:no_decoys,false,['--no-decoy', 'Don\'t use decoy sequences to pin down the negative distribution'])
 prophet_tool.add_value_option(:experiment_label,nil,['--experiment-label label','used to commonly label all spectra belonging to one experiment (required by iProphet)'])
@@ -194,7 +195,7 @@ def generate_command(genv,prophet_tool,inputs,output,database,engine,enzyme)
   if prophet_tool.useicat
     cmd << " -Oi "
   else
-    cmd << " -Of"
+    cmd << " -Of "
   end
   if prophet_tool.maldi
@@ -209,6 +210,10 @@ def generate_command(genv,prophet_tool,inputs,output,database,engine,enzyme)
       cmd << " -d#{prophet_tool.decoy_prefix} -Od "
   end
+  if prophet_tool.use_non_parametric_model
+    cmd << " -OP "
+  end
   cmd << " -p#{prophet_tool.probability_threshold}"
   if ( inputs.class==Array)

data/bin/protxml_to_gff.rb CHANGED

@@ -58,6 +58,15 @@ def protein_id_to_protdbid(protein_id)
 	return protein_id
 end
+def protein_is_included(protein,protein_probability_threshold,ignore_regex)
+	pass_probability_thresh = (protein.probability >= protein_probability_threshold)
+	pass_regex = true
+	if ignore_regex && (protein.protein_name =~ /#{ignore_regex}/)
+		pass_regex=false
+	end
+	return (pass_regex && pass_probability_thresh)
+end
 def prepare_fasta(database_path,type)
   db_filename = nil
   case
@@ -91,6 +100,7 @@ tool.add_value_option(:peptide_probability_threshold,0.95,['--threshold prob','P
 tool.add_value_option(:protein_probability_threshold,0.99,['--prot-threshold prob','Protein Probability Threshold (Default 0.99)'])
 tool.add_value_option(:gff_idregex,nil,['--gff-idregex pre','Regex with capture group for parsing gff ids from protein ids'])
 tool.add_value_option(:genome_idregex,nil,['--genome-idregex pre','Regex with capture group for parsing genomic ids from protein ids'])
+tool.add_value_option(:ignore_regex,nil,['--ignore-regex pre','Regex to match protein ids that we should ignore completely'])
 exit unless tool.check_options(true,[:database,:coords_file])
@@ -126,7 +136,7 @@ num_missing_gff_entries = 0
 proteins.each do |protein|
-	if protein.probability >= tool.protein_probability_threshold
+	if protein_is_included(protein,tool.protein_probability_threshold.to_f,tool.ignore_regex)
 		begin
 			$protk.log "Mapping #{protein.protein_name}", :info
@@ -155,7 +165,7 @@ proteins.each do |protein|
 			peptides = tool.stack_charge_states ? protein.peptides : protein.representative_peptides
 			peptides.each do |peptide|
-				if peptide.probability >= tool.peptide_probability_threshold
+				if peptide.probability >= tool.peptide_probability_threshold.to_f
 					peptide_entries = peptide.to_gff3_records(protein_entry.aaseq,gff_parent_entry,gff_cds_entries)
 					peptide_entries.each do |peptide_entry|
 						output_fh.write peptide_entry.to_s

data/lib/protk/constants.rb CHANGED

@@ -17,7 +17,7 @@ class Constants
   # These are logger attributes with thresholds as indicated
   #  DEBUG < INFO < WARN < ERROR < FATAL < UNKNOWN
-  #Debug (development mode) or Info (production)
+  # Debug (development mode) or Info (production)
   #
   @stdout_logger

data/lib/protk/mzidentml_doc.rb CHANGED

@@ -7,6 +7,31 @@ class MzIdentMLDoc < Object
 	MZID_NS_PREFIX="mzidentml"
 	MZID_NS='http://psidev.info/psi/pi/mzIdentML/1.1'
+	attr :psms_cache
+	attr :db_sequence_cache
+	def psms_cache
+		if !@psms_cache
+			@psms_cache={}
+			Constants.instance.log "Generating psm index" , :debug
+			self.psms.each do |spectrum_identification_item|
+				@psms_cache[spectrum_identification_item.attributes['id']]=spectrum_identification_item
+			end
+		end
+		@psms_cache
+	end
+	def dbsequence_cache
+		if !@dbsequence_cache
+			@dbsequence_cache={}
+			Constants.instance.log "Generating DB index" , :debug
+			self.dbsequences.each do |db_sequence|
+				@dbsequence_cache[db_sequence.attributes['accession']]=db_sequence
+			end
+		end
+		@dbsequence_cache
+	end
 	def initialize(path)
 		parser=XML::Parser.file(path)
 		@document=parser.parse
@@ -25,6 +50,10 @@ class MzIdentMLDoc < Object
 		@document.find("//#{MZID_NS_PREFIX}:SpectrumIdentificationItem","#{MZID_NS_PREFIX}:#{MZID_NS}")
 	end
+	def dbsequences
+		@document.find("//#{MZID_NS_PREFIX}:DBSequence","#{MZID_NS_PREFIX}:#{MZID_NS}")
+	end
 	def protein_groups
 		@document.find("//#{MZID_NS_PREFIX}:ProteinAmbiguityGroup","#{MZID_NS_PREFIX}:#{MZID_NS}")
 	end
@@ -55,17 +84,22 @@ class MzIdentMLDoc < Object
 		node.find("#{pp}#{MZID_NS_PREFIX}:#{expression}","#{MZID_NS_PREFIX}:#{MZID_NS}")
 	end
+	def find(node,expression,root=false)
+		MzIdentMLDoc.find(node,expression,root)
+	end
-	def self.get_cvParam(mzidnode,accession)
+	def get_cvParam(mzidnode,accession)
 		self.find(mzidnode,"cvParam[@accession=\'#{accession}\']")[0]
 	end
-	def self.get_dbsequence(mzidnode,accession)
-		self.find(mzidnode,"DBSequence[@accession=\'#{accession}\']",true)[0]
+	def get_dbsequence(mzidnode,accession)
+		self.dbsequence_cache[accession]
+		# self.find(mzidnode,"DBSequence[@accession=\'#{accession}\']",true)[0]
 	end
 	# As per PeptideShaker. Assume group probability used for protein if it is group rep otherwise 0
-	def self.get_protein_probability(protein_node)
+	def get_protein_probability(protein_node)
 		#MS:1002403
 		is_group_representative=(self.get_cvParam(protein_node,"MS:1002403")!=nil)
@@ -76,28 +110,38 @@ class MzIdentMLDoc < Object
 		end
 	end
-	def self.get_proteins_for_group(group_node)
-		self.find(group_node,"ProteinDetectionHypothesis")
+	# Memoized because it gets called for every protein in a group
+	def get_proteins_for_group(group_node)
+		@proteins_for_group_cache ||= Hash.new do |h,key|
+			h[key] = self.find(group_node,"ProteinDetectionHypothesis")
+		end
+		@proteins_for_group_cache[group_node]
 	end
 	# def self.get_sister_proteins(protein_node)
 	# 	self.find(protein_node.parent,"ProteinDetectionHypothesis")
 	# end
-	def self.get_peptides_for_protein(protein_node)
+	def get_peptides_for_protein(protein_node)
 		self.find(protein_node,"PeptideHypothesis")
 	end
 	# <PeptideHypothesis peptideEvidence_ref="PepEv_1">
 	# 	<SpectrumIdentificationItemRef spectrumIdentificationItem_ref="SII_1_1"/>
 	# </PeptideHypothesis>
-	def self.get_best_psm_for_peptide(peptide_node)
+	def get_best_psm_for_peptide(peptide_node)
 		best_score=-1
 		best_psm=nil
-		self.find(peptide_node,"SpectrumIdentificationItemRef").each do |id_ref_node|
+		spectrumidrefs = self.find(peptide_node,"SpectrumIdentificationItemRef")
+		Constants.instance.log "Searching from among #{spectrumidrefs.length} for best psm" , :debug
+		spectrumidrefs.each do |id_ref_node|
 			id_ref = id_ref_node.attributes['spectrumIdentificationItem_ref']
-			psm_node = self.find(peptide_node,"SpectrumIdentificationItem[@id=\'#{id_ref}\']",true)[0]
+			# psm_node = self.find(peptide_node,"SpectrumIdentificationItem[@id=\'#{id_ref}\']",true)[0]
+			psm_node = self.psms_cache[id_ref]
 			score = self.get_cvParam(psm_node,"MS:1002466")['value'].to_f
 			if score>best_score
 				best_psm=psm_node
@@ -107,7 +151,7 @@ class MzIdentMLDoc < Object
 		best_psm
 	end
-	def self.get_sequence_for_peptide(peptide_node)
+	def get_sequence_for_peptide(peptide_node)
 		evidence_ref = peptide_node.attributes['peptideEvidence_ref']
 		pep_ref = peptide_node.find("//#{MZID_NS_PREFIX}:PeptideEvidence[@id=\'#{evidence_ref}\']","#{MZID_NS_PREFIX}:#{MZID_NS}")[0].attributes['peptide_ref']
 		peptide=peptide_node.find("//#{MZID_NS_PREFIX}:Peptide[@id=\'#{pep_ref}\']","#{MZID_NS_PREFIX}:#{MZID_NS}")[0]
@@ -115,13 +159,13 @@ class MzIdentMLDoc < Object
 		peptide.find("./#{MZID_NS_PREFIX}:PeptideSequence","#{MZID_NS_PREFIX}:#{MZID_NS}")[0].content
 	end
-	def self.get_sequence_for_psm(psm_node)
+	def get_sequence_for_psm(psm_node)
 		pep_ref = psm_node.attributes['peptide_ref']
 		peptide=psm_node.find("//#{MZID_NS_PREFIX}:Peptide[@id=\'#{pep_ref}\']","#{MZID_NS_PREFIX}:#{MZID_NS}")[0]
 		peptide.find("./#{MZID_NS_PREFIX}:PeptideSequence","#{MZID_NS_PREFIX}:#{MZID_NS}")[0].content
 	end
-	def self.get_peptide_evidence_from_psm(psm_node)
+	def get_peptide_evidence_from_psm(psm_node)
 		pe_nodes = []
 		self.find(psm_node,"PeptideEvidenceRef").each do |pe_node|
 			ev_id=pe_node.attributes['peptideEvidence_ref']

data/lib/protk/peptide.rb CHANGED

@@ -45,15 +45,15 @@ class Peptide
 		# 	<cvParam cvRef="PSI-MS" accession="MS:1001093" name="sequence coverage" value="0.0"/>
 		# </ProteinDetectionHypothesis>
-		def from_mzid(xmlnode)
+		def from_mzid(xmlnode,mzid_doc)
 			pep=new()
-			pep.sequence=MzIdentMLDoc.get_sequence_for_peptide(xmlnode)
-			best_psm = MzIdentMLDoc.get_best_psm_for_peptide(xmlnode)
+			pep.sequence=mzid_doc.get_sequence_for_peptide(xmlnode)
+			best_psm = mzid_doc.get_best_psm_for_peptide(xmlnode)
 			# require 'byebug';byebug
-			pep.probability = MzIdentMLDoc.get_cvParam(best_psm,"MS:1002466")['value'].to_f
-			pep.theoretical_neutral_mass = MzIdentMLDoc.get_cvParam(best_psm,"MS:1001117")['value'].to_f
+			pep.probability = mzid_doc.get_cvParam(best_psm,"MS:1002466")['value'].to_f
+			pep.theoretical_neutral_mass = mzid_doc.get_cvParam(best_psm,"MS:1001117")['value'].to_f
 			pep.charge = best_psm.attributes['chargeState'].to_i
-			pep.protein_name = MzIdentMLDoc.get_dbsequence(xmlnode.parent,xmlnode.parent.attributes['dBSequence_ref']).attributes['accession']
+			pep.protein_name = mzid_doc.get_dbsequence(xmlnode.parent,xmlnode.parent.attributes['dBSequence_ref']).attributes['accession']
 			# pep.charge = MzIdentMLDoc.get_charge_for_psm(best_psm)

data/lib/protk/prophet_tool.rb CHANGED

@@ -42,7 +42,9 @@ class ProphetTool < SearchTool
   		'cnbr' => 'M',
   		'elastase' => 'E',
   		'lysn' => 'L',
-  		'nonspecific' => 'N'
+  		'nonspecific' => 'N',
+      'no enzyme' => 'N',
+      'unspecific cleavage' => 'N'
   	}
   	codes[enzyme_name]

data/lib/protk/protein.rb CHANGED

@@ -84,26 +84,33 @@ class Protein
 		# This is hacked together to work for a specific PeptideShaker output type
 		# Refactor and properly respect cvParams for real conversion
 		#
-		def from_mzid(xmlnode)
+		def from_mzid(xmlnode,mzid_doc)
 			coverage_cvparam=""
 			prot=new()
 			groupnode = xmlnode.parent
 			prot.group_number=groupnode.attributes['id'].split("_").last.to_i+1
-			prot.protein_name=MzIdentMLDoc.get_dbsequence(xmlnode,xmlnode.attributes['dBSequence_ref']).attributes['accession']
-			prot.n_indistinguishable_proteins=MzIdentMLDoc.get_proteins_for_group(groupnode).length
-			prot.group_probability=MzIdentMLDoc.get_cvParam(groupnode,"MS:1002470").attributes['value'].to_f
+			prot.protein_name=mzid_doc.get_dbsequence(xmlnode,xmlnode.attributes['dBSequence_ref']).attributes['accession']
-			coverage_node=MzIdentMLDoc.get_cvParam(xmlnode,"MS:1001093")
+			prot.n_indistinguishable_proteins=mzid_doc.get_proteins_for_group(groupnode).length
+			prot.group_probability=mzid_doc.get_cvParam(groupnode,"MS:1002470").attributes['value'].to_f
+			coverage_node=mzid_doc.get_cvParam(xmlnode,"MS:1001093")
 			prot.percent_coverage=coverage_node.attributes['value'].to_f if coverage_node
-			prot.probability = MzIdentMLDoc.get_protein_probability(xmlnode)
+			prot.probability = mzid_doc.get_protein_probability(xmlnode)
 			# require 'byebug';byebug
-			peptide_nodes=MzIdentMLDoc.get_peptides_for_protein(xmlnode)
+			peptide_nodes=mzid_doc.get_peptides_for_protein(xmlnode)
+			prot.peptides = peptide_nodes.collect { |e| Peptide.from_mzid(e,mzid_doc) }
+			Constants.instance.log "Generated protein entry with probability #{prot.probability}" , :debug
-			prot.peptides = peptide_nodes.collect { |e| Peptide.from_mzid(e) }
 			prot
 		end

data/lib/protk/protein_group.rb CHANGED

@@ -35,18 +35,25 @@ class ProteinGroup
 		# This is hacked together to work for a specific PeptideShaker output type
 		# Refactor and properly respect cvParams for real conversion
 		#
-		def from_mzid(groupnode)
+		def from_mzid(groupnode,mzid_doc,minprob=0)
 			group=new()
 			group.group_number=groupnode.attributes['id'].split("_").last.to_i+1
-			group.group_probability=MzIdentMLDoc.get_cvParam(groupnode,"MS:1002470").attributes['value'].to_f
+			group.group_probability=mzid_doc.get_cvParam(groupnode,"MS:1002470").attributes['value'].to_f
 			# require 'byebug';byebug
-			protein_nodes=MzIdentMLDoc.get_proteins_for_group(groupnode)
+			protein_nodes=mzid_doc.get_proteins_for_group(groupnode)
+			group_members = protein_nodes.select do |e|
+				mzid_doc.get_protein_probability(e)>=minprob
+			end
+			group.proteins = group_members.collect { |e| Protein.from_mzid(e,mzid_doc) }
-			group.proteins = protein_nodes.collect { |e| Protein.from_mzid(e) }
 			group
 		end

data/lib/protk/psm.rb CHANGED

@@ -26,7 +26,7 @@ class PeptideEvidence
 #     dBSequence_ref="JEMP01000193.1_rev_g3500.t1" id="PepEv_1" />
 	class << self
-		def from_mzid(pe_node)
+		def from_mzid(pe_node,mzid_doc)
 			pe = new()
 			pe.peptide_prev_aa=pe_node.attributes['pre']
 			pe.peptide_next_aa=pe_node.attributes['post']
@@ -45,7 +45,7 @@ class PeptideEvidence
 			#   name="protein description" value="280755|283436" />
 			# </DBSequence>
 			pe.protein=prot_node.attributes['accession']
-			pe.protein_descr=MzIdentMLDoc.get_cvParam(prot_node,"MS:1001088")['value']
+			pe.protein_descr=mzid_doc.get_cvParam(prot_node,"MS:1001088")['value']
 			# pe.peptide_sequence=pep_node
@@ -163,11 +163,11 @@ class PSM
-		def from_mzid(psm_node)
+		def from_mzid(psm_node,mzid_doc)
 			psm = new()
-			psm.peptide = MzIdentMLDoc.get_sequence_for_psm(psm_node)
-			peptide_evidence_nodes = MzIdentMLDoc.get_peptide_evidence_from_psm(psm_node)
-			psm.peptide_evidence = peptide_evidence_nodes.collect { |pe| PeptideEvidence.from_mzid(pe) }
+			psm.peptide = mzid_doc.get_sequence_for_psm(psm_node)
+			peptide_evidence_nodes = mzid_doc.get_peptide_evidence_from_psm(psm_node)
+			psm.peptide_evidence = peptide_evidence_nodes.collect { |pe| PeptideEvidence.from_mzid(pe,mzid_doc) }
 			psm.calculated_mz = psm_node.attributes['calculatedMassToCharge'].to_f
 			psm.experimental_mz = psm_node.attributes['experimentalMassToCharge'].to_f

data/lib/protk/search_tool.rb CHANGED

@@ -34,13 +34,13 @@ class SearchTool < Tool
     end
     if ( option_support.include? :mass_tolerance_units )
-      add_value_option(:fragment_tolu,"Da",['--fragment-ion-tol-units tolu', 'Fragment ion mass tolerance units (Da or mmu). Default=Da'])
-      add_value_option(:precursor_tolu,"ppm",['--precursor-ion-tol-units tolu', 'Precursor ion mass tolerance units (ppm or Da). Default=ppm'])
+      add_value_option(:fragment_tolu,"Da",['--fragment-ion-tol-units tolu', 'Fragment ion mass tolerance units (Da or mmu).'])
+      add_value_option(:precursor_tolu,"ppm",['--precursor-ion-tol-units tolu', 'Precursor ion mass tolerance units (ppm or Da).'])
     end
     if ( option_support.include? :mass_tolerance )
-      add_value_option(:fragment_tol,0.65,['-f', '--fragment-ion-tol tol', 'Fragment ion mass tolerance (unit dependent). Default=0.65'])
-      add_value_option(:precursor_tol,200,['-p','--precursor-ion-tol tol', 'Precursor ion mass tolerance. Default=200'])
+      add_value_option(:fragment_tol,0.65,['-f', '--fragment-ion-tol tol', 'Fragment ion mass tolerance (unit dependent).'])
+      add_value_option(:precursor_tol,200,['-p','--precursor-ion-tol tol', 'Precursor ion mass tolerance.'])
     end
     if ( option_support.include? :precursor_search_type )
@@ -64,7 +64,7 @@ class SearchTool < Tool
     end
     if ( option_support.include? :searched_ions )
-      add_value_option(:searched_ions,"",['--searched-ions si', 'Ion series to search (default=b,y)'])
+      add_value_option(:searched_ions,"",['--searched-ions si', 'Ion series to search'])
     end
     if ( option_support.include? :multi_isotope_search )

data/lib/protk/spectrum_query.rb CHANGED

@@ -86,12 +86,12 @@ class SpectrumQuery
 		#   unitAccession="UO:0000010" unitName="seconds" />
 		# </SpectrumIdentificationResult>
-		def from_mzid(query_node)
+		def from_mzid(query_node,mzid_doc)
 			query = new()
-			query.spectrum_title = MzIdentMLDoc.get_cvParam(query_node,"MS:1000796")['value'].to_s
-			query.retention_time = MzIdentMLDoc.get_cvParam(query_node,"MS:1000894")['value'].to_f
+			query.spectrum_title = mzid_doc.get_cvParam(query_node,"MS:1000796")['value'].to_s
+			query.retention_time = mzid_doc.get_cvParam(query_node,"MS:1000894")['value'].to_f
 			items = MzIdentMLDoc.find(query_node,"SpectrumIdentificationItem")
-			query.psms = items.collect { |item| PSM.from_mzid(item) }
+			query.psms = items.collect { |item| PSM.from_mzid(item,mzid_doc) }
 			query
 		end

data/lib/protk/tool.rb CHANGED

@@ -26,8 +26,8 @@ class Tool
   # Options set from the command-line
   #
   attr :options, false
-  # The option parser used to parse command-line options.
+  # The option parser used to parse command-line options.
   #
   attr :option_parser
@@ -62,19 +62,27 @@ class Tool
       super
     end
   end
-  def add_value_option(symbol,default_value,opts)
+  def add_default_to_help(default_value,opts)
+    if default_value!=nil && default_value!=" " && default_value!=""
+      opts[-1] = "#{opts.last} [#{default_value.to_s}]"
+    end
+    opts
+  end
+  def add_value_option(symbol,default_value,opts)
     @options[symbol]=default_value
+    opts=add_default_to_help(default_value,opts)
     @option_parser.on(*opts) do |val|
       @options[symbol]=val
       @options_defined_by_user[symbol]=opts
     end
   end
   def add_boolean_option(symbol,default_value,opts)
     @options[symbol]=default_value
-    @option_parser.on(*opts) do
+    opts=add_default_to_help(default_value,opts)
+    @option_parser.on(*opts) do
       @options[symbol]=!default_value
       @options_defined_by_user[symbol]=opts
     end
@@ -92,10 +100,10 @@ class Tool
     options.encoding = "utf8"
     options.transfer_type = :auto
     options.verbose = false
     @options_defined_by_user={}
-    @option_parser=OptionParser.new do |opts|
+    @option_parser=OptionParser.new do |opts|
       opts.on( '-h', '--help', 'Display this screen' ) do
         puts opts
@@ -108,7 +116,7 @@ class Tool
     end
     if ( option_support.include? :over_write)
-      add_boolean_option(:over_write,false,['-r', '--replace-output', 'Dont skip analyses for which the output file already exists'])
+      add_boolean_option(:over_write,false,['-r', '--replace-output', 'Dont skip analyses for which the output file already exists'])
     end
     if ( option_support.include? :explicit_output )
@@ -120,7 +128,7 @@ class Tool
     end
     if ( option_support.include? :database)
-      add_value_option(:database,"sphuman",['-d', '--database dbname', 'Specify the database to use for this search. Can be a named protk database or the path to a fasta file'])
+      add_value_option(:database,"sphuman",['-d', '--database dbname', 'Specify the database to use for this search. Can be a named protk database or the path to a fasta file'])
     end
     if (option_support.include? :debug)
@@ -169,37 +177,37 @@ class Tool
         return true
       end
       missing = mandatory.select{ |param| self.send(param).nil? }
-      if not missing.empty?
-        puts "Missing options: #{missing.join(', ')}"
-        puts self.option_parser
-        return false
-      end
-    rescue OptionParser::InvalidOption, OptionParser::MissingArgument
-      puts $!.to_s
-      puts self.option_parser
-      return false
+      if not missing.empty?
+        puts "Missing options: #{missing.join(', ')}"
+        puts self.option_parser
+        return false
+      end
+    rescue OptionParser::InvalidOption, OptionParser::MissingArgument
+      puts $!.to_s
+      puts self.option_parser
+      return false
     end
     if ( require_input_file && ARGV[0].nil? )
       puts "You must supply an input file"
-      puts self.option_parser
+      puts self.option_parser
       return false
     end
     return true
-   end
+   end
    # Run the search tool using the given command string and global environment
    #
    def run(cmd,genv,autodelete=true)
     cmd_runner=CommandRunner.new(genv)
     cmd_runner.run_local(cmd)
    end
    def database_info
      case
-       when Pathname.new(@options.database).exist? # It's an explicitly named db
+       when Pathname.new(@options.database).exist? # It's an explicitly named db
          db_path=Pathname.new(@options.database).expand_path.to_s
          db_name=Pathname.new(@options.database).basename.to_s
        else
@@ -211,4 +219,4 @@ class Tool
-end
+end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: protk
 version: !ruby/object:Gem::Version
-  version: 1.4.2
+  version: 1.4.3
 platform: ruby
 authors:
 - Ira Cooke
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-05-20 00:00:00.000000000 Z
+date: 2015-10-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: open4
@@ -36,7 +36,7 @@ dependencies:
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        version: '1.4'
+        version: 1.4.3
     - - '>='
       - !ruby/object:Gem::Version
         version: 1.4.3
@@ -46,7 +46,7 @@ dependencies:
     requirements:
     - - ~>
       - !ruby/object:Gem::Version
-        version: '1.4'
+        version: 1.4.3
     - - '>='
       - !ruby/object:Gem::Version
         version: 1.4.3
@@ -210,6 +210,7 @@ executables:
 - mzid_to_pepxml.rb
 - spectrast_create.rb
 - spectrast_filter.rb
+- filter_psms.rb
 extensions:
 - ext/decoymaker/extconf.rb
 extra_rdoc_files: []
@@ -217,6 +218,8 @@ files:
 - README.md
 - bin/add_retention_times.rb
 - bin/augustus_to_proteindb.rb
+- bin/filter_fasta.rb
+- bin/filter_psms.rb
 - bin/interprophet.rb
 - bin/make_decoy.rb
 - bin/manage_db.rb