bio-pipengine 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: e56d9733fe3de87e37dc417bd0d639e69467df40
4
+ data.tar.gz: 094a0ba1a27670321641950598337859484e0a8f
5
+ SHA512:
6
+ metadata.gz: 6cd5303f273ea2821290149a21a7e4fd2ab8c7440aeb411db410020b05d4b7598a125290b40c70e6ec5602f97aabbb8c20f2abc351e12cf3079d569fa9597e17
7
+ data.tar.gz: 1b1a89e37ada7bc4f6dffa63238bbb0ef9796efd6b029b99d9b812fadc8c0af90388cf330519c4cbe433b69345c9af46fd4e1925bcdfc27c57ebd9c3e88a6fc2
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2012 Francesco Strozzi
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,631 @@
1
+ PipEngine
2
+ =========
3
+
4
+ A simple launcher for complex biological pipelines.
5
+
6
+ PipEngine will generate runnable shell scripts, already configured for the PBS/Torque job scheduler, for each sample in the pipeline. It allows to run a complete pipeline or just a single step or a few steps.
7
+
8
+ PipEngine is best suited for NGS pipelines, but it can be used for any kind of pipeline that can be runned on a job scheduling system.
9
+
10
+
11
+ :: Topics ::
12
+ ============
13
+
14
+ [Usage](https://github.com/bioinformatics-ptp/bioruby-pipengine#-usage-)
15
+
16
+ [The Pipeline YAML](https://github.com/bioinformatics-ptp/bioruby-pipengine#-the-pipeline-yaml-)
17
+
18
+ [The Samples YAML](https://github.com/bioinformatics-ptp/bioruby-pipengine#-the-samples-yaml-)
19
+
20
+ [Input and output conventions](https://github.com/bioinformatics-ptp/bioruby-pipengine#-input-and-output-conventions-)
21
+
22
+ [Sample groups and complex steps](https://github.com/bioinformatics-ptp/bioruby-pipengine#-sample-groups-and-complex-steps-)
23
+
24
+ [What happens at run-time](https://github.com/bioinformatics-ptp/bioruby-pipengine#-what-happens-at-run-time-)
25
+
26
+ [Examples](https://github.com/bioinformatics-ptp/bioruby-pipengine#-examples-)
27
+
28
+ [PBS Options](https://github.com/bioinformatics-ptp/bioruby-pipengine#-pbs-options-)
29
+
30
+ :: Usage ::
31
+ ===========
32
+
33
+ PipEngine it's divided into two main sections:
34
+
35
+ ```shell
36
+ > pipengine -h
37
+ List of available commands:
38
+ run Submit pipelines to the job scheduler
39
+ jobs Show statistics and interact with running jobs
40
+ ```
41
+
42
+ Since PipEngine uses the [TORQUE-RM](https://github.com/helios/torque_rm) gem to interact with the job scheduler, at the first run PipEngine will ask few questions to prepare the required configuration file (e.g. provide IP address and username to connect via SSH to the PBS Server or Masternode).
43
+
44
+ Command line for JOBS mode
45
+ --------------------------
46
+ With this mode, PipEngine will interact with the job scheduler (Torque/PBS for now) and will allow performing searches on submitted jobs and send delete commands to remove jobs from the scheduler.
47
+
48
+ ```shell
49
+ > pipengine jobs [options]
50
+ ```
51
+
52
+
53
+ **Parameters**
54
+ ```shell
55
+ --job-id, -i <s+>: Search submitted jobs by Job ID
56
+ --job-name, -n <s+>: Search submitted jobs by Job Name
57
+ --delete, -d <s+>: Delete submitted jobs ('all' to erase everything or type one or more job IDs)
58
+ ```
59
+
60
+
61
+ Command line for RUN mode
62
+ -------------------------
63
+
64
+ With this mode, PipEngine will submit pipeline jobs to the scheduler.
65
+
66
+ **Command line**
67
+ ```shell
68
+ > pipenengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
69
+ ```
70
+
71
+ **Parameters**
72
+ ```shell
73
+ --pipeline, -p <s>: YAML file with pipeline and sample details
74
+ (default: pipeline.yml)
75
+ --samples-file, -f <s>: YAML file with samples name and directory paths
76
+ (default: samples.yml)
77
+ --samples, -l <s+>: List of sample names to run the pipeline
78
+ --steps, -s <s+>: List of steps to be executed
79
+ --dry, -d: Dry run. Just create the job script without
80
+ submitting it to the batch system
81
+ --tmp, -t <s>: Temporary output folder
82
+ --create-samples, -c <s+>: Create samples.yml file from a Sample directory
83
+ (only for CASAVA projects)
84
+ --multi, -m <s+>: List of samples to be processed by a given step
85
+ (the order matters)
86
+ --group, -g <s>: Specify the group of samples to run the
87
+ pipeline steps on (do not specify --multi)
88
+ --name, -n <s>: Analysis name
89
+ --output-dir, -o <s>: Output directory (override standard output
90
+ directory names)
91
+ --pbs-opts, -b <s+>: PBS options
92
+ --pbs-queue, -q <s>: PBS queue
93
+ --inspect-pipeline, -i <s>: Show steps
94
+ --mail-exit, -a <s>: Send an Email when the job terminates
95
+ --mail-start, -r <s>: Send an Email when the job starts
96
+ --log <s>: Log script activities, by default stdin.
97
+ Options are fluentd (default: stdin)
98
+ --log-adapter, -e <s>: (stdin|syslog|fluentd) In case of fluentd use
99
+ http://destination.hostname:port/yourtag
100
+ --help, -h: Show this message
101
+ ```
102
+
103
+ PipEngine accepts two input files:
104
+ * A YAML file describing the pipeline steps
105
+ * A YAML file describing samples names, samples location and other samples-specific information
106
+
107
+
108
+ :: The Pipeline YAML ::
109
+ =======================
110
+
111
+ The basic structure of a pipeline YAML is divided into three parts: 1) pipeline name, 2) resources, 3) steps.
112
+
113
+ An example YAML file is like the following:
114
+
115
+ ```yaml
116
+
117
+ pipeline: resequencing
118
+
119
+ resources:
120
+ fastqc: /software/FastQC/fastqc
121
+ bwa: /software/bwa-0.6.2/bwa
122
+ gatk: /software/gatk-lite/GenomeAnalysisTk.jar
123
+ samtools: /software/samtools
124
+ samsort: /software/picard-tools-1.77/SortSam.jar
125
+ mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
126
+ bam: /software/bam
127
+ pigz: /software/pigz
128
+
129
+ steps:
130
+ mapping:
131
+ desc: Run BWA on each sample to perform alignment
132
+ run:
133
+ - ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
134
+ - ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
135
+ - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
136
+ - rm -f R1.fastq.gz R2.fastq.gz
137
+ cpu: 11
138
+
139
+ mark_dup:
140
+ run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
141
+
142
+ realign_target:
143
+ run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
144
+ cpu: 8
145
+
146
+ realign:
147
+ run: java -Xmx4g -jar <gatk> -T IndelRealigner -LOD 0.4 -model USE_READS --disable_bam_indexing --target_intervals <realign_target/sample>.indels.intervals -R <genome> -I <mark_dup/sample>.md.sort.bam -o <sample>.realigned.bam
148
+
149
+ fixtags:
150
+ run: <samtools> calmd -r -E -u <realign/sample>.realigned.bam <genome> | <bam> squeeze --in -.ubam --out <sample>.final.bam --rmTags 'XM:i;XG:i;XO:i' --keepDups
151
+
152
+ bam_index:
153
+ run: <samtools> index <fixtags/sample>.final.bam
154
+
155
+ clean:
156
+ run: ls | grep -v final | xargs rm -fr
157
+
158
+ ```
159
+
160
+ Resources definition
161
+ --------------------
162
+
163
+ PipEngine is entirely based on the placeholder and substitution logic. For example in the Pipeline YAML, each tool is declared under the resources and at run time PipEngine will search for the corresponding placeholder in the command lines.
164
+
165
+ So, for instance, if I have declared a software **bwa** under resources, PipEngine will search for a ```<bwa>``` placeholder in all the command lines and will substitute it with the software complete path declared in resources.
166
+
167
+ This makes command lines definition shorter and easier to read and avoid problems when moving from one software version to another (i.e. you just need to change the bwa definition once, and not 10 times in 5 different command lines)
168
+
169
+ The same thing happens for samples names, input and output directories and intermediate output files. This allows to create true pipelines templates that can be reused and applied to different samples sets.
170
+
171
+ Step definition
172
+ ---------------
173
+
174
+ A step must be defined using standard keys:
175
+
176
+ * the first key must be the step name
177
+ * under the step name, a **run** key must be defined to hold the actual command line that will be executed
178
+ * a **cpu** key must be defined if the command line uses more than 1 CPU at runtime
179
+ * a **multi** key must be defined if the command line takes as input more than one sample (more details later)
180
+ * a **desc** key has been added to insert a short description that will be displayed using the **-i** option of PipEngine
181
+ * a **nodes** and **mem** keys can be used to specify the resources needed for this job
182
+
183
+ A note on the **run** key. If a single step need more than a command line to execute the required actions, these multiple command lines must be defined as an array in YAML (see the mapping step in the above example).
184
+
185
+
186
+ :: The Samples YAML ::
187
+ =====================
188
+
189
+ The samples YAML is much simpler then the pipeline YAML:
190
+
191
+ ```yaml
192
+ resources:
193
+ index: /storage/genomes/bwa_index/genome
194
+ genome: /storage/genomes/genome.fa
195
+ output: /storage/results
196
+
197
+ samples:
198
+ sampleA: /ngs_reads/sampleA
199
+ sampleB: /ngs_reads/sampleB
200
+ sampleC: /ngs_reads/sampleC
201
+ sampleD: /ngs_reads/sampleD
202
+ ```
203
+
204
+ In this YAML there is again a **resources** key, but this time the tags defined here are dependent on the samples described in the YAML.
205
+
206
+ For instance, if I am working with human RNA-seq samples, these data must be aligned on the human genome, so it makes sense that the **genome** tag must be defined here and not in the pipeline YAML, which must be as much generic as possible.
207
+
208
+ Generally, the tags defined under the samples **resources** are dependent on the pipeline and analysis one wants to run. So if using BWA to perform reads alignemnt, an **index** tag must be defined here to set the BWA index prefix and it will be substituted in the pipelines command lines every time an ```<index>``` placeholder will be found in the pipeline YAML.
209
+
210
+ Sample groups
211
+ -------------
212
+
213
+ If you want to organize your samples by groups, it is possible to do it directly in the samples.yml file:
214
+
215
+
216
+ ```yaml
217
+ resources:
218
+ index: /storage/genomes/bwa_index/genome
219
+ genome: /storage/genomes/genome.fa
220
+ output: /storage/results
221
+
222
+ samples:
223
+ Group1:
224
+ sampleA: /ngs_reads/sampleA
225
+ sampleB: /ngs_reads/sampleB
226
+ Group2:
227
+ sampleC: /ngs_reads/sampleC
228
+ sampleD: /ngs_reads/sampleD
229
+ ```
230
+
231
+ Then, by using the **-g** option of PipEngine, it is possible to run steps and pipelines directly on groups of samples.
232
+
233
+
234
+ How to create the Samples file
235
+ ------------------------------
236
+
237
+ PipEngine is created to work primarly for NGS pipelines and with Illumina data in mind. So, the easiest thing to do if you have your samples already organized into a typical Illumina folder is to run:
238
+
239
+ ```shell
240
+ > pipengine run -c /path/to/illumina/data
241
+ ```
242
+
243
+ This will generate a samples.yml file with all the sample names and path derived from the run folder. The "resources" part is left blank for you to fill.
244
+
245
+ As a plus, if you have your samples scattered thoughout many different run folders, you can specify all the paths that you want to PipEngine and it will combine all the paths in the same samples file. So if you have your samples spread across let's say 3 runs, you can call PipEngine in this way:
246
+
247
+ ```shell
248
+ > pipengine run -c /path/to/illumina/run1 /path/to/illumina/run2 /path/to/illumina/run3
249
+ ```
250
+
251
+ If a sample is repeated in more than one run, all the paths will be combined in the samples.yml and PipEngine will take care of handling the multiple paths correctly.
252
+
253
+
254
+
255
+ :: Input and output conventions ::
256
+ ==================================
257
+
258
+ The inputs in the steps defined in the pipeline YAML are expressed by the ```<sample>``` placeholder that will be substituted with a sample name and the ```<sample_path>```, which will be changed with the location where initial data (i.e. raw sequencing reads) are stored for that particular sample. Both this information are provided in the sample YAML file.
259
+
260
+ The ```<output>``` placeholder is a generic one to define the root location for the pipeline outputs. This parameter is also defined in the samples YAML. By default, PipEngine will write jobs scripts and will save stdout and stderr files from PBS in this folder.
261
+
262
+ By convention, each sample output is saved under a folder with the sample name and each step is saved in a sub-folder with the step name.
263
+
264
+ That is, given a generic /storage/pipeline_results ```<output>``` folder, the outputs of the **mapping** step will be organized in this way:
265
+
266
+ ```shell
267
+ /storage/pipeline_results/SampleA/mapping/SampleA.bam
268
+ /SampleB/mapping/SampleB.bam
269
+ /SampleC/mapping/SampleC.bam
270
+ /SampleD/mapping/SampleD.bam
271
+ ```
272
+
273
+ This simple convention keeps things clean and organized. The output file name can be decided during the pipeline creation, but it's a good habit to name it using the sample name.
274
+
275
+ When new steps of the same pipeline are run, output folders are updated accordingly. So for example if after the **mapping** step a **mark_dup** step is run, the output folder will look like this:
276
+
277
+ ```shell
278
+ /storage/pipeline_results/SampleA/mapping
279
+ /SampleA/mark_dup
280
+
281
+ /storage/pipeline_results/SampleB/mapping
282
+ /SampleB/mark_dup
283
+ .....
284
+ ```
285
+
286
+ In case you are working with group of samples, specified by the **-g** option, the output folder will be changed to reflect the samples grouping. So for example if a **mapping** step is called on the **Group1** group of samples, all the outputs will be saved under the ```<output>/Group1``` folder and results of mapping for SampleA, will be found under ```<output>/Group1/SampleA/mapping``` .
287
+
288
+
289
+ How steps are connected together
290
+ --------------------------------
291
+
292
+ One step is connected to another by simply requiring that its input is the output of another preceding step. This is just achived by a combination of ```<output>``` and ```<sample>``` placeholders in the pipeline command line definitions.
293
+
294
+ For instance, if I have a resequencing pipeline that will first run BWA to map the reads and then a mark duplicate step, the mark_dup step will be dependent from the BWA output.
295
+
296
+ ```yaml
297
+ mapping:
298
+ run:
299
+ - ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
300
+ - ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
301
+ - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz > <sample>.sorted.bam
302
+ - rm -f R1.fastq.gz R2.fastq.gz
303
+ cpu: 11
304
+
305
+ mark_dup:
306
+ run: java -Xmx4g -jar <mark_dup> INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam
307
+ ```
308
+
309
+ So in the **mark_dup** step the input placeholder (defined under the **run** key in the pipeline YAML) will be written as:
310
+
311
+ ```
312
+ <mapping/sample>.sorted.bam
313
+ ```
314
+
315
+ If the ```<output>``` tag is defined for instance as "/storage/results", this will be translated at run-time into:
316
+
317
+ ```
318
+ /storage/results/SampleA/mapping/SampleA.sorted.bam
319
+ ```
320
+
321
+ for SampleA outputs. Basically the ```<mapping/sample>``` placeholder is a shortcut for ```<output>/<sample>/{step name, mapping in this case}/<sample>```
322
+
323
+ Following the same idea, using a ```<mapping/>``` placeholder (note the / at the end) will be translated into ```<output>/<sample>/{step name, mapping in this case}/``` , to address the scenario when a user wants to point to the previous step output directory, but without having the ```<sample>``` appended to the end of the path.
324
+
325
+ More complex dependences can be defined by combinations of ```<output>``` and ```<sample>``` placeholders, or using the ```<step/>``` and ```<step/sample>``` placeholders, without having to worry about the actual sample name and the complete input and output paths.
326
+
327
+ Jobs dependencies
328
+ -------------------------
329
+ Steps can also be defined with dependencies so the user can just call the final step and all the upper chain is called automatically. To achieve this task Pipengine requires that the user defines a
330
+ ```
331
+ pre:
332
+ ```
333
+ tag in the step definition:
334
+
335
+ ```
336
+ root_step:
337
+ desc: root step to test dependencies
338
+ run:
339
+ - echo "root"
340
+
341
+ child_step:
342
+ desc: child step to test dependencies
343
+ pre: root_step
344
+ run:
345
+ - echo "I am the child"
346
+ ```
347
+
348
+
349
+ :: Multi-Samples and complex steps ::
350
+ =====================================
351
+
352
+ The pipeline steps can be defined to run on a single sample or to take as input more than one sample data, depending on the command line used.
353
+
354
+ A typical example is running a differential expression step for example with CuffDiff. This requires to take all the output generated from the previous Cufflinks step (i.e. the gtf files) and process them to generate a unique transcripts reference (CuffCompare) and then perform the differential expression across the samples using the BAM files generated by, let's say, TopHat in a **mapping** step.
355
+
356
+ This is an extract of the step definition in the pipeline YAML to describe these two steps:
357
+
358
+ ```yaml
359
+ diffexp:
360
+ multi:
361
+ - <output>/<sample>/cufflinks/transcripts.gtf
362
+ - <mapping/sample>_tophat/accepted_hits.bam
363
+ run:
364
+ - echo '<multi1>' | sed -e 's/,/ /g' | xargs ls >> gtf_list.txt
365
+ - <cuffcompare> -s <genome> -r <gtf> -i gtf_list.txt
366
+ - <cuffdiff> -p 12 -N -u -b <genome> ./*combined.gtf <multi2>
367
+ cpu: 12
368
+ ```
369
+
370
+ In this case we need to combine the outputs of all the samples from the cufflinks step and pass that information to cuffcompare and combine the outputs of the mapping steps and pass them to the cuffdiff command line.
371
+
372
+ This is achived in two ways. First, the step definition must include a **multi** key, that simply defines what, for each sample, will be substituted where the ```<multi>``` placeholder is found.
373
+
374
+ In the example above, the step includes two command lines, one for cuffcompare and the other for cuffdiff. Cuffcompare requires the transcripts.gtf for each sample, while Cuffdiff requires the BAM file for each sample, plus the output of Cuffcompare.
375
+
376
+ So the two command lines need two different kind of files as input from the same set of samples, therefore two **multi** keywords are defined as well as two placeholders ```<multi1>``` and ```<multi2>```
377
+
378
+ Once the step has been defined in the pipeline YAML, PipEngine must be invoked using the **-m** parameter, to specify the samples that should be grouped together by this step:
379
+
380
+ ```shell
381
+ pipengine -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
382
+ ```
383
+
384
+ Note that the use of commas is not casual, since the **-m** parameter specifies not only which samples should be used for this step, but also how they should be organized on the corresponding command line. The **-m** parameter takes the sample names and underneath it will combine the sample name with the 'multi' keywords and then it will substitute back the command line by keeping the samples in the same order as provided with the **-m**.
385
+
386
+ The above command line will be translated, for the **cuffdiff** command line in the following:
387
+
388
+ ```shell
389
+ /software/cuffdiff -p 12 -N -u -b /storage/genome.fa combined.gtf /storage/results/SampleA/cufflinks/transcripts.gtf,/storage/results/SampleB/cufflinks/transcripts.gtf /storage/results/SampleC/cufflinks/transcripts.gtf /storage/results/SampleD/cufflinks/transcripts.gtf
390
+ ```
391
+
392
+ and this will correspond to the way CuffDiff wants biological replicates for each condition to be described on the command line.
393
+
394
+ **Note**
395
+
396
+ Multi-samples step management is complex and it's a task that can't be easily generalized since every software has it's own way to require and organize the inputs on the command line. This approach it's probably not the most elegant solution but works quite well, even if there are some drawbacks. For instance, as stated above, the samples groups is processed and passed to command lines as it is taken from the **-m** parameter.
397
+
398
+ So for Cuffdiff, the presence of commas is critical to divide biological replicates from different conditions, but for Cuffcompare the commas are not needed and will raise an error on the command line. That's the reason of the:
399
+
400
+ ```shell
401
+ echo '<multi1>' | sed -e 's/,/ /g' | xargs ls >> gtf_list.txt
402
+ ```
403
+
404
+ This line generates the input file for Cuffcompare with the list of the transcripts.gtf files for each sample, generated using the 'multi' definition in the pipeline YAML and the line passed through the **-m** parameter, but getting rid of the commas that separate sample names. It's a workaround and it's not a super clean solution, but PipEngine wants to be a general tool not binded to specific corner cases and it always lets the user define it's own custom command lines to manage particular steps, as in this case.
405
+
406
+
407
+ :: What happens at run-time ::
408
+ ==============================
409
+
410
+ When invoking PipEngine, the tool will look for the pipeline YAML specified and for the sample YAML file. It will load the list of samples (names and paths of input data) and for each sample it will load the information of the step specified in the command line ( **-s** parameter ).
411
+
412
+ PipEngine will then combine the data from the two YAML, generating the specific command lines of the selected steps and substituing all the placeholders to generate the final command lines.
413
+
414
+ A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler.
415
+
416
+ If not invoked with the **-d** option (dry-run) PipEngine will directly submit the jobs to the PBS scheduler using the "qsub" command.
417
+
418
+ Dry Run
419
+ -------
420
+
421
+ The **-d** parameter lets you create the runnable shell scripts without submitting them to PBS. Use it often to check that the pipeline that will be executed is correct and it is doing what you thought. The runnable scripts are saved by default in the ```<output>``` directory.
422
+
423
+ Use it also to learn how the placeholders works, especially the dependency placeholders (e.g. ```<mapping/sample>```) and to cross-check that all the placeholders in the pipeline command lines were substituted correctly before submitting the jobs.
424
+
425
+ Temporary output folder
426
+ -------------------
427
+
428
+ By using the '--tmp' option, PipEngine will generate a job script (for each sample) that will save all the output files or folders for a particular step in a directory (e.g. /tmp) that is different from the one provided with the ```<output>```.
429
+
430
+ By default PipEngine will generate output folders directly under the location defined by the ```<ouput>``` tag in the Sample YAML. The --tmp solution instead can be useful when we don't want to save directly to the final location (e.g maybe a slower network storage) or we don't want to keep all the intermediate files but just the final ones.
431
+
432
+ With this option enabled, PipEngine will also generate instructions in the job script to copy, at the end of the job, all the outputs from the temporary directory to the final output folder (i.e. ```<output>```) and then to remove the temporary copy.
433
+
434
+ When '--tmp' is used, a UUID is generated for each job and prepended to the job name and to the temporary output folder, to avoid possible name collisions and data overwrite if more jobs with the same name (e.g. mapping) are running and writing to the same temporary location.
435
+
436
+ One job with multiple steps
437
+ ---------------------------
438
+
439
+ It is of course possible to aggregate multiple steps of a pipeline and run them in one single job. For instance let's say I want to run in the same job the steps mapping, mark_dup and realign_target (see pipeline YAML example above).
440
+
441
+ From the command line it's just:
442
+
443
+ ```shell
444
+ pipengine -p pipeline.yml -s mapping mark_dup realign_target
445
+ ```
446
+
447
+ A single job script, for each sample, will be generated with all the instructions for these steps. If more than one step declares a **cpu** key, the highest cpu value will be assigned for the whole job.
448
+
449
+ Each step will save outputs into a separated folder, under the ```<output>```, exactly if they were run separately. This way, if the job fails for some reason, it will be possible to check which steps were already completed and restart from there.
450
+
451
+ When multiple steps are run in the same job, by default PipEngine will generate the job name as the concatenation of all the steps names. Since this could be a problem when a lot of steps are run together in the same job, a '--name' parameter it's available to rename the job in a more convenient way.
452
+
453
+ :: Examples ::
454
+ ==============
455
+
456
+ Example 1: One step and multiple command lines
457
+ ----------------------------------------------
458
+
459
+ This is an example on how to prepare the inputs for BWA and run it along with Samtools:
460
+
461
+ **pipeline.yml**
462
+ ```yaml
463
+ pipeline: resequencing
464
+
465
+ resources:
466
+ bwa: /software/bwa-0.6.2/bwa
467
+ samtools: /software/samtools
468
+
469
+ steps:
470
+ mapping:
471
+ run:
472
+ - ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
473
+ - ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
474
+ - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Sb - > <sample>.bam
475
+ - rm -f R1.fastq.gz R2.fastq.gz
476
+ cpu: 11
477
+ ```
478
+
479
+ **samples.yml**
480
+ ```yaml
481
+ resources:
482
+ index: /storage/genomes/bwa_index/genome
483
+ genome: /storage/genomes/genome.fa
484
+ output: /storage/results
485
+
486
+ samples:
487
+ sampleA: /ngs_reads/sampleA
488
+ sampleB: /ngs_reads/sampleB
489
+ sampleC: /ngs_reads/sampleC
490
+ sampleD: /ngs_reads/sampleD
491
+ ```
492
+
493
+ Running PipEngine with the following command line:
494
+
495
+ ```
496
+ pipengine -p pipeline.yml -f samples.yml -s mapping
497
+ ```
498
+
499
+ will generate a runnable shell script for each sample:
500
+
501
+ ```shell
502
+ #!/bin/bash
503
+ #PBS -N 37735f50-mapping
504
+ #PBS -l ncpus=11
505
+
506
+ mkdir -p /storage/results/sampleA/mapping
507
+ cd /storage/results/sampleA/mapping
508
+ ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
509
+ ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
510
+ /software/bwa-0.6.2/bwa sampe -P /genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
511
+ rm -f R1.fastq.gz R2.fastq.gz
512
+ ```
513
+ As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable.
514
+
515
+ In this case also, the **run** key defines three different command lines, that are described using YAML array (a line prepended with a -). This command lines are all part of the same step, since the first two are required to prepare the input for the third command line (BWA), using standard bash commands.
516
+
517
+ As a rule of thumb you should put more command line into an array under the same step if these are all logically correlated and required to manipulate intermidiate files. Otherwise if command lines executes conceptually different actions they should go into different steps.
518
+
519
+ Example 2: Multiple steps in one job
520
+ ------------------------------------
521
+
522
+ Now I want to execute more steps in a single job for each sample. The pipeline YAML is defined in this way:
523
+
524
+ ```yaml
525
+
526
+ pipeline: resequencing
527
+
528
+ resources:
529
+ bwa: /software/bwa-0.6.2/bwa
530
+ samtools: /software/samtools
531
+ mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
532
+ gatk: /software/GenomeAnalysisTK/GenomeAnalysisTK.jar
533
+
534
+ steps:
535
+ mapping:
536
+ run:
537
+ - ls <sample_path>/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
538
+ - ls <sample_path>/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
539
+ - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
540
+ - rm -f R1.fastq.gz R2.fastq.gz
541
+ cpu: 11
542
+
543
+ mark_dup:
544
+ run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
545
+
546
+ realign_target:
547
+ run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
548
+ cpu: 8
549
+ ```
550
+
551
+ The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine must be invoked with this command line:
552
+
553
+ ```
554
+ pipengine -p pipeline.yml -f samples.yml -s mapping mark_dup realign_target
555
+ ```
556
+
557
+ And this will be translated into the following shell script (one for each sample):
558
+
559
+ ```shell
560
+ #!/bin/bash
561
+ #PBS -N ff020300-mapping-mark_dup-realign_target
562
+ #PBS -l ncpus=11
563
+
564
+ mkdir -p /storage/results/sampleB/mapping
565
+ cd /storage/results/sampleB/mapping
566
+ ls /ngs_reads/sampleB/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
567
+ ls /ngs_reads/sampleB/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
568
+ /software/bwa-0.6.2/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
569
+ rm -f R1.fastq.gz R2.fastq.gz
570
+ mkdir -p /storage/results/sampleB/mark_dup
571
+ cd /storage/results/sampleB/mark_dup
572
+ java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=sampleB.sorted.bam OUTPUT=sampleB.md.sort.bam METRICS_FILE=sampleB.metrics REMOVE_DUPLICATES=false
573
+ mkdir -p /storage/results/sampleB/realign_target
574
+ cd /storage/results/sampleB/realign_target
575
+ java -Xmx4g -jar /software/GenomeAnalysisTk/GenomeAnalysisTk.jar -T RealignerTargetCreator -I sampleB.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleB.indels.intervals
576
+ ```
577
+
578
+ Logging
579
+ ---------------------------
580
+
581
+ It is always usefult to log activities and collect the output from your software. Pipengine can log to:
582
+
583
+ * stdin, just print on the terminal
584
+ * syslog send the log to the system log using logger
585
+ * fluentd send the log to a collector/centralized logging system (http://fluentd.org)
586
+
587
+
588
+ :: PBS Options ::
589
+ =================
590
+
591
+ If there is the need to pass to PipEngine specific PBS options, the ```--pbs-opts``` parameter can be used.
592
+
593
+ This parameter accepts a list of options and each one will be added to the PBS header in the shell script, along with the ```-l``` PBS parameter.
594
+
595
+ So for example, the following options passed to ```--pbs-opts```:
596
+
597
+ ```shell
598
+ --pbs-opts nodes=2:ppn=8 host=node5
599
+ ```
600
+
601
+ will become, in the shell script:
602
+
603
+ ```shell
604
+ #PBS -l nodes=2:ppn=8
605
+ #PBS -l host=node5
606
+ ```
607
+
608
+ Note also that from version 0.5.2, it is possible to specify common PBS options like "nodes" and "mem" (for memory) directly within a step defition in the Pipeline yaml, exactly as it's done with the "cpu" parameter. So in a step it is possible to write:
609
+
610
+ ```yaml
611
+ realign_target:
612
+ run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
613
+ cpu: 8
614
+ nodes: 2
615
+ mem: 8G
616
+ ```
617
+
618
+ to have PipEngine translate this into:
619
+
620
+ ```shell
621
+ #PBS -l nodes=2:ppn=8,mem=8G
622
+ ```
623
+
624
+ within the job script.
625
+
626
+ If a specific queue needs to be selected for sending the jobs to PBS, the ```--pbs-queue``` (short version **-q**) parameter can be used. This will pass to the ```qsub``` command the ```-q <queue name>``` taken from the command line.
627
+
628
+ Copyright
629
+ =========
630
+
631
+ (c)2013 Francesco Strozzi