protk 1.4.1 → 1.4.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 7329f51a45b5449ec979e76aca5727c6714a5bc8
4
- data.tar.gz: e96f553b27c61c7ba1935d379e01086e9cb00725
3
+ metadata.gz: 7377c1480498f852b7e747d13e9a7d985523fcef
4
+ data.tar.gz: 2cb2c652e53ec636fb521cb35a687324ee810af8
5
5
  SHA512:
6
- metadata.gz: fb933aa9ce0cc6fabb19b0a731bb8d74f23456937ec4d08973477f8f956eb733fcb44b26b66fc61242f5ea6c617d0c3a178b83f16dc568ad5f330db9dcd27c1d
7
- data.tar.gz: 5b2b370cea53d3a3ec9eee9d5916df8f910ef12181c0bce9b660c1d82d042e00a5b66ce7fcba6a864cd1a6266aa2fbffec998ae26700f4ddd7903ab141ba3241
6
+ metadata.gz: c4e72457cc9ada490ea6210c9d13e6e5d240c0d19399c5b015feb30cebb881b51bae62cb8d4aa831d8aee397a747d5c16b1433f0ea08656474a56270f709b3d7
7
+ data.tar.gz: e8893c4fda75666fdf4ed3cb6d6dc0bb34ceb3041c2b19883e82891aa5245a4a7aa99cc4fe677f2fb6f28c2b12d5bbb89a213a59e3f1f0e4d8dc927a9f6ff510
data/README.md CHANGED
@@ -22,7 +22,10 @@ Protk is a ruby gem and requires ruby 2.0 or higher with support for libxml2. To
22
22
  gem install protk
23
23
  ```
24
24
 
25
+ ## Ruby Compatibility
25
26
 
27
+ In general Protk requires ruby with a version >=2.0.
28
+ Do not use ruby 2.1.5 as this has a bug that causes a deadlock related to open4 and child processes writing to stderr.
26
29
 
27
30
  ## Usage
28
31
 
@@ -60,32 +63,28 @@ By default protk will install tools and databases into `.protk` in your home dir
60
63
  ```
61
64
 
62
65
 
63
- ## Sequence databases
64
66
 
65
- Protk also includes a script called manage_db.rb to install specific sequence databases for use by the search engines if desired. Databases installed via manage_db.rb can be invoked using a shorthand name rather than a full path to a fasta file, and Protk also provides some automation for database upgrades. Protk comes with several predefined database configurations. For example, to install a database consisting of human entries from Swissprot plus known contaminants use the following commands;
67
+ ## Galaxy Integration
66
68
 
67
- ```sh
68
- manage_db.rb add --predefined crap
69
- manage_db.rb add --predefined sphuman
70
- manage_db.rb update crap
71
- manage_db.rb update sphuman
72
- ```
69
+ Many protk tools have equivalent galaxy wrappers available on the [galaxy toolshed](http://toolshed.g2.bx.psu.edu/) with source code and development occuring in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository on github. In order for these tools to work you will also need to make sure that protk, as well as the necessary third party dependencies are available to galaxy during tool execution.
73
70
 
74
- You should now be able to run database searches, specifying this database by using the -d sphuman flag. Every month or so swissprot will release a new database version. You can keep your database up to date using the manage_db.rb update command. This will update the database only if any of its source files (or ftp release notes) have changed. The manage_db.rb tool also allows completely custom databases to be configured. Setup requires adding quite a few command-line options but once setup, databases can easily be updated without further config. The example below shows the commandline arguments required to manually configure the sphuman database.
71
+ There are two ways to do this
75
72
 
76
- ```sh
77
- manage_db.rb add --ftp-source 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt' --include-filters '/OS=Homo\ssapiens/' --id-regex 'sp\|.*\|(.*?)\s' --add-decoys --make-blast-index --archive-old sphuman
78
- ```
73
+ **Using Docker:**
79
74
 
80
- ## Galaxy Integration
75
+ By far the easiest way to do this is to set up your Galaxy instance to run tools in Docker containers. All the tools in the [protk-galaxytools](github.com/iracooke/protk-galaxytools) repository are designed to work with [this](https://github.com/iracooke/protk-dockerfile) docker image, and will download and use the image automatically on apprioriately configured Galaxy instances.
76
+
77
+ **Manual Install**
81
78
 
82
- Many protk tools have equivalent galaxy wrappers available on the [galaxy toolshed](http://toolshed.g2.bx.psu.edu/) . In order for these tools to work you will also need to make sure that protk, as well as the necessary third party dependencies are available to galaxy during tool execution. If you install protk using the default system ruby (without rvm) this will probably just work, however you will lose the ability to run specific versions of tools against specific versions of protk. The recommended method of installing protk for use with galaxy is as follows;
79
+ If your galaxy instance is unable to use Docker for some reason you will need to install `protk` and its dependencies manually.
80
+
81
+ One way to install protk would be to just do `gem install protk` using the default system ruby (without rvm). This will probably just work, however you will lose the ability to run specific versions of tools against specific versions of protk. The recommended method of installing protk for use with galaxy is as follows;
83
82
 
84
83
  1. Ensure you have a working install of galaxy.
85
84
 
86
85
  [Full instructions](https://wiki.galaxyproject.org/Admin/GetGalaxy) are available on the official Galaxy project wiki page. We assume you have galaxy installed in a directory called galaxy-dist.
87
86
 
88
- 2. Install rvm if you haven't allready. See [here](https://rvm.io/) for more information.
87
+ 2. Install rvm if you haven't already. See [here](https://rvm.io/) for more information.
89
88
 
90
89
  ```bash
91
90
  curl -sSL https://get.rvm.io | bash -s stable
@@ -148,4 +147,22 @@ Many protk tools have equivalent galaxy wrappers available on the [galaxy toolsh
148
147
  ln -s 1.5 default
149
148
  ```
150
149
 
150
+ ## Sequence databases
151
+
152
+ All `protk` tools are designed to work with sequence databases provided as simple fasta formatted flat files. For most use cases it is simplest to just manage these manually.
153
+
154
+ Protk includes a script called `manage_db.rb` to install certain sequence databases in a central repository. Databases installed via `manage_db.rb` can be invoked using a shorthand name rather than a full path to a fasta file. Protk comes with several predefined database configurations. For example, to install a database consisting of human entries from Swissprot plus known contaminants use the following commands;
155
+
156
+ ```sh
157
+ manage_db.rb add --predefined crap
158
+ manage_db.rb add --predefined sphuman
159
+ manage_db.rb update crap
160
+ manage_db.rb update sphuman
161
+ ```
162
+
163
+ You should now be able to run database searches, specifying this database by using the -d sphuman flag. Every month or so swissprot will release a new database version. You can keep your database up to date using the manage_db.rb update command. This will update the database only if any of its source files (or ftp release notes) have changed. The manage_db.rb tool also allows completely custom databases to be configured. Setup requires adding quite a few command-line options but once setup, databases can easily be updated without further config. The example below shows the commandline arguments required to manually configure the sphuman database.
164
+
165
+ ```sh
166
+ manage_db.rb add --ftp-source 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/reldate.txt' --include-filters '/OS=Homo\ssapiens/' --id-regex 'sp\|.*\|(.*?)\s' --add-decoys --make-blast-index --archive-old sphuman
167
+ ```
151
168
 
@@ -0,0 +1,75 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # This file is part of protk
4
+ # Created by Ira Cooke 8/5/2015
5
+ #
6
+ # Convert mzid to pepXML
7
+ #
8
+ #
9
+
10
+ require 'libxml'
11
+ require 'protk/constants'
12
+ require 'protk/command_runner'
13
+ require 'protk/mzidentml_doc'
14
+ require 'protk/spectrum_query'
15
+ require 'protk/pepxml_writer'
16
+ require 'protk/tool'
17
+
18
+ include LibXML
19
+
20
+ XML.indent_tree_output=true
21
+
22
+
23
+ # Setup specific command-line options for this tool. Other options are inherited from Tool
24
+ #
25
+ tool=Tool.new([:explicit_output,:debug])
26
+ # tool.add_value_option(:minprob,0.05,['--minprob mp',"Minimum probability for psm to be included in the output"])
27
+
28
+ tool.option_parser.banner = "Convert an mzIdentML file to pep.xml\n\nUsage: mzid_to_pepxml.rb [options] file1.mzid"
29
+
30
+ exit unless tool.check_options(true)
31
+
32
+ $protk = Constants.instance
33
+ log_level = tool.debug ? "info" : "warn"
34
+ $protk.info_level= log_level
35
+
36
+ input_file=ARGV[0]
37
+
38
+ if tool.explicit_output
39
+ output_file_name=tool.explicit_output
40
+ else
41
+ output_file_name=Tool.default_output_path(input_file,".pep.xml","","")
42
+ end
43
+
44
+ pep_xml_writer = PepXMLWriter.new
45
+
46
+ mzid_doc = MzIdentMLDoc.new(input_file)
47
+
48
+ spectrum_queries = mzid_doc.spectrum_queries
49
+
50
+ n_queries = spectrum_queries.length
51
+
52
+ $protk.log "Converting #{n_queries} spectrum queries", :info
53
+ $protk.log "Output will be written to #{output_file_name}", :info
54
+
55
+ i=0
56
+ n_written=0
57
+ progress_increment=1
58
+ spectrum_queries.each do |query_node|
59
+ if i % progress_increment ==0
60
+ $stdout.write "Scanned #{i} and read #{n_written} of #{n_queries}\r"
61
+ end
62
+
63
+ # require 'byebug';byebug
64
+
65
+ query = SpectrumQuery.from_mzid(query_node)
66
+ pep_xml_writer.append_spectrum_query(query.as_pepxml)
67
+ n_written+=1
68
+
69
+ i+=1
70
+
71
+ end
72
+
73
+ $protk.log "Writing #{n_written} spectrum queries to #{output_file_name}", :info
74
+
75
+ pep_xml_writer.save(output_file_name)
@@ -0,0 +1,77 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # This file is part of protk
4
+ # Created by Ira Cooke 7/5/2015
5
+ #
6
+ # Convert mzid to protXML
7
+ #
8
+ #
9
+
10
+ require 'libxml'
11
+ require 'protk/constants'
12
+ require 'protk/command_runner'
13
+ require 'protk/mzidentml_doc'
14
+ require 'protk/protein_group'
15
+ require 'protk/tool'
16
+
17
+ include LibXML
18
+
19
+ XML.indent_tree_output=true
20
+
21
+
22
+ # Setup specific command-line options for this tool. Other options are inherited from ProphetTool
23
+ #
24
+ tool=Tool.new([:explicit_output,:debug])
25
+ tool.add_value_option(:minprob,0.05,['--minprob mp',"Minimum probability for protein to be included in the output"])
26
+
27
+ tool.option_parser.banner = "Convert an mzIdentML file to protXML.\n\nUsage: mzid_to_protxml.rb [options] file1.mzid"
28
+
29
+ exit unless tool.check_options(true)
30
+
31
+ $protk = Constants.instance
32
+ log_level = tool.debug ? "info" : "warn"
33
+ $protk.info_level= log_level
34
+
35
+ input_file=ARGV[0]
36
+
37
+ if tool.explicit_output
38
+ output_file_name=tool.explicit_output
39
+ else
40
+ output_file_name=Tool.default_output_path(input_file,".protXML","","")
41
+ end
42
+
43
+ prot_xml_writer = ProtXMLWriter.new
44
+
45
+ mzid_doc = MzIdentMLDoc.new(input_file)
46
+
47
+ protein_groups = mzid_doc.protein_groups
48
+
49
+ n_prots = protein_groups.length
50
+
51
+ $protk.log "Converting #{n_prots} protein_groups", :info
52
+ $protk.log "Output will be written to #{output_file_name}", :info
53
+
54
+ i=0
55
+ n_written=0
56
+ progress_increment=1
57
+ protein_groups.each do |group_node|
58
+ if i % progress_increment ==0
59
+ $stdout.write "Scanned #{i} and read #{n_written} of #{n_prots}\r"
60
+ end
61
+
62
+ # require 'byebug';byebug
63
+ group_prob = MzIdentMLDoc.get_cvParam(group_node,"MS:1002470").attributes['value'].to_f*0.01
64
+
65
+ if group_prob > tool.minprob.to_f
66
+ group = ProteinGroup.from_mzid(group_node)
67
+ prot_xml_writer.append_protein_group(group.as_protxml)
68
+ n_written+=1
69
+ end
70
+
71
+ i+=1
72
+
73
+ end
74
+
75
+ $protk.log "Writing #{n_written} proteins to #{output_file_name}", :info
76
+
77
+ prot_xml_writer.save(output_file_name)
@@ -155,7 +155,7 @@ proteins.each do |protein|
155
155
  peptides = tool.stack_charge_states ? protein.peptides : protein.representative_peptides
156
156
 
157
157
  peptides.each do |peptide|
158
- if peptide.nsp_adjusted_probability >= tool.peptide_probability_threshold
158
+ if peptide.probability >= tool.peptide_probability_threshold
159
159
  peptide_entries = peptide.to_gff3_records(protein_entry.aaseq,gff_parent_entry,gff_cds_entries)
160
160
  peptide_entries.each do |peptide_entry|
161
161
  output_fh.write peptide_entry.to_s
data/bin/sixframe.rb CHANGED
@@ -25,7 +25,7 @@ end
25
25
 
26
26
  tool=Tool.new([:explicit_output])
27
27
  tool.option_parser.banner = "Create a sixframe translation of a genome.\n\nUsage: sixframe.rb [options] genome.fasta"
28
-
28
+ tool.add_boolean_option(:peptideshaker,false,['--peptideshaker', 'Format fasta output for peptideshaker compatibility'])
29
29
  tool.add_boolean_option(:print_coords,false,['--coords', 'Write genomic coordinates in the fasta header'])
30
30
  tool.add_boolean_option(:keep_header,true,['--strip-header', 'Dont write sequence definition'])
31
31
  tool.add_value_option(:min_len,20,['--min-len l','Minimum ORF length to keep'])
@@ -43,8 +43,22 @@ if tool.write_gff
43
43
  output_fh.write "##gff-version 3\n"
44
44
  end
45
45
 
46
+ accession_prefix=tool.peptideshaker ? "generic" : "lcl"
47
+ coords_separator=tool.peptideshaker ? "|" : " "
48
+
46
49
  file = Bio::FastaFormat.open(input_file)
47
50
 
51
+ def passes_qc(orf,tool)
52
+ long_enough = orf.length > tool.min_len.to_i
53
+
54
+ composition_ok=true
55
+ if tool.peptideshaker && (orf=~/X/)
56
+ composition_ok=false
57
+ end
58
+
59
+ (long_enough && composition_ok)
60
+ end
61
+
48
62
  file.each do |entry|
49
63
 
50
64
  length = entry.naseq.length
@@ -58,7 +72,7 @@ file.each do |entry|
58
72
  oi=0
59
73
  orfs.each do |orf|
60
74
  oi+=1
61
- if ( orf.length > tool.min_len.to_i )
75
+ if ( passes_qc(orf,tool) )
62
76
 
63
77
  position_start = position
64
78
  position_end = position_start + orf.length*3 -1
@@ -71,15 +85,20 @@ file.each do |entry|
71
85
  end
72
86
 
73
87
  # Create accession compliant with NCBI naming standard
88
+ #
74
89
  # See http://www.ncbi.nlm.nih.gov/books/NBK7183/?rendertype=table&id=ch_demo.T5
90
+ #
91
+ # Or with PeptideShaker standard
92
+ #
93
+ #
75
94
  ncbi_scaffold_id = entry.entry_id.gsub('|','_').gsub(' ','_')
76
- ncbi_accession = "lcl|#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
95
+ ncbi_accession = "#{accession_prefix}|#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
77
96
  gff_id = "#{ncbi_scaffold_id}_frame_#{frame}_orf_#{oi}"
78
97
 
79
98
  defline=">#{ncbi_accession}"
80
99
 
81
100
  if tool.print_coords
82
- defline << " #{position_start}|#{position_end}"
101
+ defline << "#{coords_separator}#{position_start}|#{position_end}"
83
102
  end
84
103
 
85
104
  if tool.keep_header
@@ -88,7 +107,7 @@ file.each do |entry|
88
107
 
89
108
  if tool.write_gff
90
109
  strand = frame>3 ? "-" : "+"
91
- # score = self.nsp_adjusted_probability.nil? ? "." : self.nsp_adjusted_probability.to_s
110
+ # score = self.probability.nil? ? "." : self.probability.to_s
92
111
  # gff_string = "#{parent_record.seqid}\tMSMS\tpolypeptide\t#{start_i}\t#{end_i}\t#{score}\t#{parent_record.strand}\t0\tID=#{this_id};Parent=#{cds_id}"
93
112
  output_fh.write("#{ncbi_scaffold_id}\tsixframe\tCDS\t#{position_start}\t#{position_end}\t.\t#{strand}\t0\tID=#{gff_id}\n")
94
113
  else
@@ -0,0 +1,125 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ # This file is part of protk
4
+ # Created by Ira Cooke 30/4/2015
5
+ #
6
+ # A wrapper for the SpectraST create command
7
+ #
8
+ #
9
+
10
+ require 'protk/constants'
11
+ require 'protk/command_runner'
12
+ require 'protk/tool'
13
+ require 'protk/galaxy_util'
14
+ require 'protk/pepxml'
15
+ require 'protk/sniffer'
16
+ require 'protk/mzml_parser'
17
+
18
+ for_galaxy = GalaxyUtil.for_galaxy?
19
+
20
+ genv=Constants.instance
21
+
22
+ # Setup specific command-line options for this tool. Other options are inherited from ProphetTool
23
+ #
24
+ spectrast_tool=Tool.new([:explicit_output])
25
+ spectrast_tool.option_parser.banner = "Create a spectral library from pep.xml input files.\n\nUsage: spectrast_create.rb [options] file1.pep.xml file1.pep.xml ..."
26
+ spectrast_tool.add_value_option(:spectrum_files,"",['--spectrum-files sf','Paths to raw spectrum files. These should be provided in a comma separated list'])
27
+ spectrast_tool.add_boolean_option(:binary_output,false,['-B','--binary-output','Produce spectral libraries in binary format rather than ASCII'])
28
+ spectrast_tool.add_value_option(:filter_predicate,nil,['--predicate pred','Keep only spectra satifying predicate pred. Should be a C-style predicate'])
29
+ spectrast_tool.add_value_option(:probability_threshold,0.99,['--p-thresh val', 'Probability threshold below which spectra are discarded'])
30
+ spectrast_tool.add_value_option(:instrument_acquisition,"CID",['--instrument-acquisition setting',
31
+ 'Set the instrument and acquisition settings of the spectra (in case not specified in data files).
32
+ Examples: CID, ETD, CID-QTOF, HCD. The latter two are treated as high-mass accuracy spectra.'])
33
+
34
+ exit unless spectrast_tool.check_options(true)
35
+
36
+ spectrast_bin = %x[which spectrast].chomp
37
+
38
+ # Options: GENERAL OPTIONS
39
+ # -cF<file> Read create options from file <file>.
40
+ # If <file> is not given, "spectrast_create.params" is assumed.
41
+ # NOTE: All options set in the file will be overridden by command-line options, if specified.
42
+ # -cm<remark> Remark. Add a Remark=<remark> comment to all library entries created.
43
+ # -cM<format> Write all library spectra as MRM transition tables. Leave <format> blank for default. (Turn off with -cM!)
44
+ # -cT<file> Use probability table in <file>. Only those peptide ions included in the table will be imported.
45
+ # A probability table is a text file with one peptide ion in the format AC[160]DEFGHIK/2 per line.
46
+ # If a probability is supplied following the peptide ion separated by a tab, it will be used to replace the original probability of that library entry.
47
+ # -cO<file> Use protein list in <file>. Only those peptide ions associated with proteins in the list will be imported.
48
+ # A protein list is a text file with one protein identifier per line.
49
+ # If a number X is supplied following the protein separated by a tab, then at most X peptide ions associated with that protein will be imported.
50
+
51
+ # PEPXML IMPORT OPTIONS (Applicable with .pepXML files)
52
+ # -cP<prob> Include all spectra identified with probability no less than <prob> in the library.
53
+ # -cq<fdr> (Only PepXML import) Only include spectra with global FDR no greater than <fdr> in the library.
54
+ # -cn<name> Specify a dataset identifier for the file to be imported.
55
+ # -co Add the originating mzXML file name to the dataset identifier. Good for keeping track of in which
56
+ # MS run the peptide is observed. (Turn off with -co!)
57
+ # -cg Set all asparagines (N) in the motif NX(S/T) as deamidated (N[115]). Use for glycocaptured peptides. (Turn off with -cg!).
58
+ # -cI Set the instrument and acquisition settings of the spectra (in case not specified in data files).
59
+ # Examples: -cICID, -cIETD, -cICID-QTOF, -cIHCD. The latter two are treated as high-mass accuracy spectra.
60
+ #
61
+
62
+ # -cf<pred> Filter library. Keep only those entries satisfying the predicate <pred>.
63
+ # <pred> should be a C-style predicate in quotes.
64
+
65
+ input_stagers=[]
66
+ inputs=ARGV.collect { |file_name| file_name.chomp}
67
+ if for_galaxy
68
+ input_stagers = inputs.collect {|ip| GalaxyUtil.stage_pepxml(ip) }
69
+ inputs=input_stagers.collect { |sg| sg.staged_path }
70
+ end
71
+
72
+ spectrum_file_paths=spectrast_tool.spectrum_files.split(",").collect { |mod| mod.lstrip.rstrip }.reject {|e| e.empty? }
73
+
74
+ spectrum_file_paths.each do |rf|
75
+ throw "Provided spectrum file #{rf} does not exist" unless File.exists? rf
76
+ format = Sniffer.sniff_format(rf)
77
+ throw "Unrecognised format #{format} detected for spectrum file #{rf}" unless ["mzML","mgf"].include? format
78
+
79
+ # basename_no_ext = File.basename(rf,File.extname(rf))
80
+ runid_name = MzMLParser.new(rf).next_runid()
81
+
82
+ expected_name = "#{runid_name}.#{format}"
83
+
84
+ if for_galaxy || !File.exists?(expected_name)
85
+ raw_input_stager = GalaxyStager.new(rf,{:extension=>".#{format}",:name=>runid_name})
86
+ puts raw_input_stager.staged_path
87
+ end
88
+
89
+ end
90
+
91
+
92
+ cmd="#{spectrast_bin} "
93
+
94
+ unless spectrast_tool.binary_output
95
+ cmd << " -c_BIN!"
96
+ end
97
+
98
+ if spectrast_tool.filter_predicate
99
+ cmd << " -cf'#{spectrast_tool.filter_predicate}'"
100
+ end
101
+
102
+
103
+
104
+ cmd << " -cI#{spectrast_tool.instrument_acquisition}"
105
+
106
+ if spectrast_tool.explicit_output==nil
107
+ output_file_name=Tool.default_output_path(inputs,"","","")
108
+ else
109
+ output_file_name=spectrast_tool.explicit_output
110
+ end
111
+
112
+ cmd << " -cN#{output_file_name}"
113
+
114
+ cmd << " -cP#{spectrast_tool.probability_threshold}"
115
+
116
+ inputs.each { |ip| cmd << " #{ip}" }
117
+
118
+ # code = spectrast_tool.run(cmd,genv)
119
+ # throw "Command failed with exit code #{code}" unless code==0
120
+
121
+ %x[#{cmd}]
122
+
123
+
124
+
125
+