bio-grid 0.2.7 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -25,10 +25,8 @@ What is happening here is the following:
25
25
 
26
26
  * the ```-i``` options specifies the input files or, as in this case, the location where to find input files based on a typical wildcard expression. You can actually specify as many input files/locations as you need using a comma separated list.
27
27
  * the ```-n``` specify the job name
28
- * the ```-c``` is the command line to be executed on the cluster / grid system. What BioGrid does is to fill in the ```<input1>```,```<input2>``` and ```<output>``` placeholders with the corresponding parameters passed on the command line. This is done for each input file (or each group of input files) and BioGrid will check if the ```<output>``` placeholder has an extension (like .sam, .out etc.) and will generate a unique output file name for each job. IMPORTANT: If no extension is specified for the ```<output>``` placeholder, BioGrid will assume the job will generate more than one output files and that those files will be saved into the folder specified by the "-o" option. Therefore it will manage the output as a whole directory, copying and/or removing the entire folder if "-r" and "-e" options are present (check the [Other options](https://github.com/fstrozzi/bioruby-grid#other-options) section to see what these options are expected to do).
29
-
30
-
31
- * the ```-o``` set the location where output files for each job will be saved. Only provide the folder where you want to save the output file(s), BioGrid will take care of generating a unique file name for the output, if needed.
28
+ * the ```-c``` is the command line to be executed on the cluster / grid system. What BioGrid does is to fill in the ```<input1>```,```<input2>``` and ```<output>``` placeholders with the corresponding parameters passed on the command line. This is done for each input file (or each group of input files) and BioGrid will check if the ```<output>``` placeholder has an extension (like .sam, .out etc.) and will generate a unique output file name for each job.
29
+ * the ```-o``` set the location where output files for each job will be saved. Only provide the folder where you want to save the output file(s), BioGrid will take care of generating a unique file name for the output, if needed. Check the [Output management](https://github.com/fstrozzi/bioruby-grid#output-management) for more details.
32
30
  * the ```-s``` is a key parameter to specify the granularity of the jobs, setting the number of input files (or group of files, when more than one input placeholder is present in the command line) to be used for each job. So, going back to the FastQ example, if -s 1 is specified, each job will be run with exactly one FastQ R1 file and one FastQ R2 file. This gives you a great power in deciding how to split the entire dataset analysis across multiple computing nodes.
33
31
  * the ```-p``` parameter indicates how many processes we want to use for each job. This number needs to match with the actual number of threads / processes that our command or tool will use for the analysis.
34
32
 
@@ -45,6 +43,37 @@ mkdir -p /data/Project_X/Sample_Y_mapping
45
43
 
46
44
  and this will be repeated for every input file, according to the -s parameter. So, in this case given that we have 2 input files for each command line and that we had 60 R1 and 60 R2 FastQ files and we have specified "-s 1", 60 different jobs will be created and submitted, each with a specific read pair to be processed by Bowtie.
47
45
 
46
+ Output management
47
+ -----------------
48
+ For each job, BioGrid will set an output name according to a UUID generated on the fly and the combination of the job name plus an incremental number. So a typical output file name will look like this:
49
+
50
+ ```shell
51
+ 3cb0b800_Bowtie_mapping_001.bam
52
+ ```
53
+ IMPORTANT: the UUID will be the same for all the jobs submitted in a same BioGrid run, the only changing part will be the incremental number.
54
+
55
+ If you want to do some [Advanced stuff](https://github.com/fstrozzi/bioruby-grid#advanced-stuff) and run parameters testing, the output names will be changed accordingly by BioGrid. So if I am running BioGrid to test some parameter ```-L``` for my favorite tool, and I am sampling it, with three different values, let's say 3, 7 and 10 the corresponding output files will be:
56
+
57
+ ```shell
58
+ 9ec55d90_tophat_001-param:3.sam
59
+ 9ec55d90_tophat_001-param:7.sam
60
+ 9ec55d90_tophat_001-param:10.sam
61
+ ```
62
+
63
+ If you are using the ```--param``` options to test non-numerical parameters, the corresponding parameter value (or name) will be appended to the output file name in the same way:
64
+
65
+ ```shell
66
+ 9ec55d90_tophat_001-param:--sensitive.sam
67
+ 9ec55d90_tophat_001-param:--fast.sam
68
+ ```
69
+
70
+ ###Differences between output files and output folder
71
+
72
+ BioGrid will act differently if the output of a single job is a file or a folder. You need to specify this, by adding a file extension to the ```<output>``` placeholder. So, for instance, if the output file of my job is a BAM file, I will need to specify this in the command line definition, by putting a ```<output>.bam``` .
73
+ **If no extension is specified for the ```<output>``` placeholder in the command line definition, BioGrid will assume the job will generate more than one output file and that those files will be saved into the folder specified by the ```-o``` option**. Therefore it will manage the output as a whole directory, copying and/or removing the entire folder if ```-r``` and ```-e``` options are present (check the [Other options](https://github.com/fstrozzi/bioruby-grid#other-options) section to see what these options are expected to do).
74
+
75
+ The naming conventions for the output folder are same as for the output files.
76
+
48
77
  Other options
49
78
  -------------
50
79
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.7
1
+ 0.3.0
@@ -5,7 +5,7 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "bio-grid"
8
- s.version = "0.2.7"
8
+ s.version = "0.3.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Francesco Strozzi"]
@@ -18,15 +18,13 @@ module Bio
18
18
  inputs.each do |input|
19
19
  commandline.gsub!(/<#{input}>/,groups[input][index].join(self.options[:sep]))
20
20
  end
21
- job_output = ""
21
+ job_output = self.options[:output]+"/#{options[:uuid]}_"+self.options[:name]+"_%03d" % (index+1).to_s + "#{self.options[:parameter_value]}"
22
22
  if commandline =~/<output>\.(\S+)/
23
23
  extension = $1
24
- job_output = self.options[:output]+"/#{options[:uuid]}_"+self.options[:name]+"_output_%03d" % (index+1).to_s + "#{self.options[:parameter_value]}"
25
24
  commandline.gsub!(/<output>/,job_output)
26
25
  job_output << ".#{extension}"
27
26
  else
28
27
  self.options[:output_folder] = true
29
- job_output = self.options[:output]+"/#{options[:uuid]}_"+self.options[:name]
30
28
  commandline.gsub!(/<output>/,job_output)
31
29
  end
32
30
  self.instructions << commandline+"\n"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-grid
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.7
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -146,7 +146,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
146
146
  version: '0'
147
147
  segments:
148
148
  - 0
149
- hash: 3769977352097433669
149
+ hash: 2907219030788310971
150
150
  required_rubygems_version: !ruby/object:Gem::Requirement
151
151
  none: false
152
152
  requirements: