bio-pipengine 0.6.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/LICENSE.txt +20 -0
- data/README.md +631 -0
- data/VERSION +1 -0
- data/bin/pipengine +101 -0
- data/lib/bio/pipengine/job.rb +271 -0
- data/lib/bio/pipengine/sample.rb +13 -0
- data/lib/bio/pipengine/step.rb +39 -0
- data/lib/bio/pipengine.rb +234 -0
- data/lib/bio-pipengine.rb +13 -0
- metadata +167 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: e56d9733fe3de87e37dc417bd0d639e69467df40
|
4
|
+
data.tar.gz: 094a0ba1a27670321641950598337859484e0a8f
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 6cd5303f273ea2821290149a21a7e4fd2ab8c7440aeb411db410020b05d4b7598a125290b40c70e6ec5602f97aabbb8c20f2abc351e12cf3079d569fa9597e17
|
7
|
+
data.tar.gz: 1b1a89e37ada7bc4f6dffa63238bbb0ef9796efd6b029b99d9b812fadc8c0af90388cf330519c4cbe433b69345c9af46fd4e1925bcdfc27c57ebd9c3e88a6fc2
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2012 Francesco Strozzi
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,631 @@
|
|
1
|
+
PipEngine
|
2
|
+
=========
|
3
|
+
|
4
|
+
A simple launcher for complex biological pipelines.
|
5
|
+
|
6
|
+
PipEngine will generate runnable shell scripts, already configured for the PBS/Torque job scheduler, for each sample in the pipeline. It allows to run a complete pipeline or just a single step or a few steps.
|
7
|
+
|
8
|
+
PipEngine is best suited for NGS pipelines, but it can be used for any kind of pipeline that can be runned on a job scheduling system.
|
9
|
+
|
10
|
+
|
11
|
+
:: Topics ::
|
12
|
+
============
|
13
|
+
|
14
|
+
[Usage](https://github.com/bioinformatics-ptp/bioruby-pipengine#-usage-)
|
15
|
+
|
16
|
+
[The Pipeline YAML](https://github.com/bioinformatics-ptp/bioruby-pipengine#-the-pipeline-yaml-)
|
17
|
+
|
18
|
+
[The Samples YAML](https://github.com/bioinformatics-ptp/bioruby-pipengine#-the-samples-yaml-)
|
19
|
+
|
20
|
+
[Input and output conventions](https://github.com/bioinformatics-ptp/bioruby-pipengine#-input-and-output-conventions-)
|
21
|
+
|
22
|
+
[Sample groups and complex steps](https://github.com/bioinformatics-ptp/bioruby-pipengine#-sample-groups-and-complex-steps-)
|
23
|
+
|
24
|
+
[What happens at run-time](https://github.com/bioinformatics-ptp/bioruby-pipengine#-what-happens-at-run-time-)
|
25
|
+
|
26
|
+
[Examples](https://github.com/bioinformatics-ptp/bioruby-pipengine#-examples-)
|
27
|
+
|
28
|
+
[PBS Options](https://github.com/bioinformatics-ptp/bioruby-pipengine#-pbs-options-)
|
29
|
+
|
30
|
+
:: Usage ::
|
31
|
+
===========
|
32
|
+
|
33
|
+
PipEngine it's divided into two main sections:
|
34
|
+
|
35
|
+
```shell
|
36
|
+
> pipengine -h
|
37
|
+
List of available commands:
|
38
|
+
run Submit pipelines to the job scheduler
|
39
|
+
jobs Show statistics and interact with running jobs
|
40
|
+
```
|
41
|
+
|
42
|
+
Since PipEngine uses the [TORQUE-RM](https://github.com/helios/torque_rm) gem to interact with the job scheduler, at the first run PipEngine will ask few questions to prepare the required configuration file (e.g. provide IP address and username to connect via SSH to the PBS Server or Masternode).
|
43
|
+
|
44
|
+
Command line for JOBS mode
|
45
|
+
--------------------------
|
46
|
+
With this mode, PipEngine will interact with the job scheduler (Torque/PBS for now) and will allow performing searches on submitted jobs and send delete commands to remove jobs from the scheduler.
|
47
|
+
|
48
|
+
```shell
|
49
|
+
> pipengine jobs [options]
|
50
|
+
```
|
51
|
+
|
52
|
+
|
53
|
+
**Parameters**
|
54
|
+
```shell
|
55
|
+
--job-id, -i <s+>: Search submitted jobs by Job ID
|
56
|
+
--job-name, -n <s+>: Search submitted jobs by Job Name
|
57
|
+
--delete, -d <s+>: Delete submitted jobs ('all' to erase everything or type one or more job IDs)
|
58
|
+
```
|
59
|
+
|
60
|
+
|
61
|
+
Command line for RUN mode
|
62
|
+
-------------------------
|
63
|
+
|
64
|
+
With this mode, PipEngine will submit pipeline jobs to the scheduler.
|
65
|
+
|
66
|
+
**Command line**
|
67
|
+
```shell
|
68
|
+
> pipenengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
|
69
|
+
```
|
70
|
+
|
71
|
+
**Parameters**
|
72
|
+
```shell
|
73
|
+
--pipeline, -p <s>: YAML file with pipeline and sample details
|
74
|
+
(default: pipeline.yml)
|
75
|
+
--samples-file, -f <s>: YAML file with samples name and directory paths
|
76
|
+
(default: samples.yml)
|
77
|
+
--samples, -l <s+>: List of sample names to run the pipeline
|
78
|
+
--steps, -s <s+>: List of steps to be executed
|
79
|
+
--dry, -d: Dry run. Just create the job script without
|
80
|
+
submitting it to the batch system
|
81
|
+
--tmp, -t <s>: Temporary output folder
|
82
|
+
--create-samples, -c <s+>: Create samples.yml file from a Sample directory
|
83
|
+
(only for CASAVA projects)
|
84
|
+
--multi, -m <s+>: List of samples to be processed by a given step
|
85
|
+
(the order matters)
|
86
|
+
--group, -g <s>: Specify the group of samples to run the
|
87
|
+
pipeline steps on (do not specify --multi)
|
88
|
+
--name, -n <s>: Analysis name
|
89
|
+
--output-dir, -o <s>: Output directory (override standard output
|
90
|
+
directory names)
|
91
|
+
--pbs-opts, -b <s+>: PBS options
|
92
|
+
--pbs-queue, -q <s>: PBS queue
|
93
|
+
--inspect-pipeline, -i <s>: Show steps
|
94
|
+
--mail-exit, -a <s>: Send an Email when the job terminates
|
95
|
+
--mail-start, -r <s>: Send an Email when the job starts
|
96
|
+
--log <s>: Log script activities, by default stdin.
|
97
|
+
Options are fluentd (default: stdin)
|
98
|
+
--log-adapter, -e <s>: (stdin|syslog|fluentd) In case of fluentd use
|
99
|
+
http://destination.hostname:port/yourtag
|
100
|
+
--help, -h: Show this message
|
101
|
+
```
|
102
|
+
|
103
|
+
PipEngine accepts two input files:
|
104
|
+
* A YAML file describing the pipeline steps
|
105
|
+
* A YAML file describing samples names, samples location and other samples-specific information
|
106
|
+
|
107
|
+
|
108
|
+
:: The Pipeline YAML ::
|
109
|
+
=======================
|
110
|
+
|
111
|
+
The basic structure of a pipeline YAML is divided into three parts: 1) pipeline name, 2) resources, 3) steps.
|
112
|
+
|
113
|
+
An example YAML file is like the following:
|
114
|
+
|
115
|
+
```yaml
|
116
|
+
|
117
|
+
pipeline: resequencing
|
118
|
+
|
119
|
+
resources:
|
120
|
+
fastqc: /software/FastQC/fastqc
|
121
|
+
bwa: /software/bwa-0.6.2/bwa
|
122
|
+
gatk: /software/gatk-lite/GenomeAnalysisTk.jar
|
123
|
+
samtools: /software/samtools
|
124
|
+
samsort: /software/picard-tools-1.77/SortSam.jar
|
125
|
+
mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
|
126
|
+
bam: /software/bam
|
127
|
+
pigz: /software/pigz
|
128
|
+
|
129
|
+
steps:
|
130
|
+
mapping:
|
131
|
+
desc: Run BWA on each sample to perform alignment
|
132
|
+
run:
|
133
|
+
- ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
|
134
|
+
- ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
|
135
|
+
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
|
136
|
+
- rm -f R1.fastq.gz R2.fastq.gz
|
137
|
+
cpu: 11
|
138
|
+
|
139
|
+
mark_dup:
|
140
|
+
run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
|
141
|
+
|
142
|
+
realign_target:
|
143
|
+
run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
|
144
|
+
cpu: 8
|
145
|
+
|
146
|
+
realign:
|
147
|
+
run: java -Xmx4g -jar <gatk> -T IndelRealigner -LOD 0.4 -model USE_READS --disable_bam_indexing --target_intervals <realign_target/sample>.indels.intervals -R <genome> -I <mark_dup/sample>.md.sort.bam -o <sample>.realigned.bam
|
148
|
+
|
149
|
+
fixtags:
|
150
|
+
run: <samtools> calmd -r -E -u <realign/sample>.realigned.bam <genome> | <bam> squeeze --in -.ubam --out <sample>.final.bam --rmTags 'XM:i;XG:i;XO:i' --keepDups
|
151
|
+
|
152
|
+
bam_index:
|
153
|
+
run: <samtools> index <fixtags/sample>.final.bam
|
154
|
+
|
155
|
+
clean:
|
156
|
+
run: ls | grep -v final | xargs rm -fr
|
157
|
+
|
158
|
+
```
|
159
|
+
|
160
|
+
Resources definition
|
161
|
+
--------------------
|
162
|
+
|
163
|
+
PipEngine is entirely based on the placeholder and substitution logic. For example in the Pipeline YAML, each tool is declared under the resources and at run time PipEngine will search for the corresponding placeholder in the command lines.
|
164
|
+
|
165
|
+
So, for instance, if I have declared a software **bwa** under resources, PipEngine will search for a ```<bwa>``` placeholder in all the command lines and will substitute it with the software complete path declared in resources.
|
166
|
+
|
167
|
+
This makes command lines definition shorter and easier to read and avoid problems when moving from one software version to another (i.e. you just need to change the bwa definition once, and not 10 times in 5 different command lines)
|
168
|
+
|
169
|
+
The same thing happens for samples names, input and output directories and intermediate output files. This allows to create true pipelines templates that can be reused and applied to different samples sets.
|
170
|
+
|
171
|
+
Step definition
|
172
|
+
---------------
|
173
|
+
|
174
|
+
A step must be defined using standard keys:
|
175
|
+
|
176
|
+
* the first key must be the step name
|
177
|
+
* under the step name, a **run** key must be defined to hold the actual command line that will be executed
|
178
|
+
* a **cpu** key must be defined if the command line uses more than 1 CPU at runtime
|
179
|
+
* a **multi** key must be defined if the command line takes as input more than one sample (more details later)
|
180
|
+
* a **desc** key has been added to insert a short description that will be displayed using the **-i** option of PipEngine
|
181
|
+
* a **nodes** and **mem** keys can be used to specify the resources needed for this job
|
182
|
+
|
183
|
+
A note on the **run** key. If a single step need more than a command line to execute the required actions, these multiple command lines must be defined as an array in YAML (see the mapping step in the above example).
|
184
|
+
|
185
|
+
|
186
|
+
:: The Samples YAML ::
|
187
|
+
=====================
|
188
|
+
|
189
|
+
The samples YAML is much simpler then the pipeline YAML:
|
190
|
+
|
191
|
+
```yaml
|
192
|
+
resources:
|
193
|
+
index: /storage/genomes/bwa_index/genome
|
194
|
+
genome: /storage/genomes/genome.fa
|
195
|
+
output: /storage/results
|
196
|
+
|
197
|
+
samples:
|
198
|
+
sampleA: /ngs_reads/sampleA
|
199
|
+
sampleB: /ngs_reads/sampleB
|
200
|
+
sampleC: /ngs_reads/sampleC
|
201
|
+
sampleD: /ngs_reads/sampleD
|
202
|
+
```
|
203
|
+
|
204
|
+
In this YAML there is again a **resources** key, but this time the tags defined here are dependent on the samples described in the YAML.
|
205
|
+
|
206
|
+
For instance, if I am working with human RNA-seq samples, these data must be aligned on the human genome, so it makes sense that the **genome** tag must be defined here and not in the pipeline YAML, which must be as much generic as possible.
|
207
|
+
|
208
|
+
Generally, the tags defined under the samples **resources** are dependent on the pipeline and analysis one wants to run. So if using BWA to perform reads alignemnt, an **index** tag must be defined here to set the BWA index prefix and it will be substituted in the pipelines command lines every time an ```<index>``` placeholder will be found in the pipeline YAML.
|
209
|
+
|
210
|
+
Sample groups
|
211
|
+
-------------
|
212
|
+
|
213
|
+
If you want to organize your samples by groups, it is possible to do it directly in the samples.yml file:
|
214
|
+
|
215
|
+
|
216
|
+
```yaml
|
217
|
+
resources:
|
218
|
+
index: /storage/genomes/bwa_index/genome
|
219
|
+
genome: /storage/genomes/genome.fa
|
220
|
+
output: /storage/results
|
221
|
+
|
222
|
+
samples:
|
223
|
+
Group1:
|
224
|
+
sampleA: /ngs_reads/sampleA
|
225
|
+
sampleB: /ngs_reads/sampleB
|
226
|
+
Group2:
|
227
|
+
sampleC: /ngs_reads/sampleC
|
228
|
+
sampleD: /ngs_reads/sampleD
|
229
|
+
```
|
230
|
+
|
231
|
+
Then, by using the **-g** option of PipEngine, it is possible to run steps and pipelines directly on groups of samples.
|
232
|
+
|
233
|
+
|
234
|
+
How to create the Samples file
|
235
|
+
------------------------------
|
236
|
+
|
237
|
+
PipEngine is created to work primarly for NGS pipelines and with Illumina data in mind. So, the easiest thing to do if you have your samples already organized into a typical Illumina folder is to run:
|
238
|
+
|
239
|
+
```shell
|
240
|
+
> pipengine run -c /path/to/illumina/data
|
241
|
+
```
|
242
|
+
|
243
|
+
This will generate a samples.yml file with all the sample names and path derived from the run folder. The "resources" part is left blank for you to fill.
|
244
|
+
|
245
|
+
As a plus, if you have your samples scattered thoughout many different run folders, you can specify all the paths that you want to PipEngine and it will combine all the paths in the same samples file. So if you have your samples spread across let's say 3 runs, you can call PipEngine in this way:
|
246
|
+
|
247
|
+
```shell
|
248
|
+
> pipengine run -c /path/to/illumina/run1 /path/to/illumina/run2 /path/to/illumina/run3
|
249
|
+
```
|
250
|
+
|
251
|
+
If a sample is repeated in more than one run, all the paths will be combined in the samples.yml and PipEngine will take care of handling the multiple paths correctly.
|
252
|
+
|
253
|
+
|
254
|
+
|
255
|
+
:: Input and output conventions ::
|
256
|
+
==================================
|
257
|
+
|
258
|
+
The inputs in the steps defined in the pipeline YAML are expressed by the ```<sample>``` placeholder that will be substituted with a sample name and the ```<sample_path>```, which will be changed with the location where initial data (i.e. raw sequencing reads) are stored for that particular sample. Both this information are provided in the sample YAML file.
|
259
|
+
|
260
|
+
The ```<output>``` placeholder is a generic one to define the root location for the pipeline outputs. This parameter is also defined in the samples YAML. By default, PipEngine will write jobs scripts and will save stdout and stderr files from PBS in this folder.
|
261
|
+
|
262
|
+
By convention, each sample output is saved under a folder with the sample name and each step is saved in a sub-folder with the step name.
|
263
|
+
|
264
|
+
That is, given a generic /storage/pipeline_results ```<output>``` folder, the outputs of the **mapping** step will be organized in this way:
|
265
|
+
|
266
|
+
```shell
|
267
|
+
/storage/pipeline_results/SampleA/mapping/SampleA.bam
|
268
|
+
/SampleB/mapping/SampleB.bam
|
269
|
+
/SampleC/mapping/SampleC.bam
|
270
|
+
/SampleD/mapping/SampleD.bam
|
271
|
+
```
|
272
|
+
|
273
|
+
This simple convention keeps things clean and organized. The output file name can be decided during the pipeline creation, but it's a good habit to name it using the sample name.
|
274
|
+
|
275
|
+
When new steps of the same pipeline are run, output folders are updated accordingly. So for example if after the **mapping** step a **mark_dup** step is run, the output folder will look like this:
|
276
|
+
|
277
|
+
```shell
|
278
|
+
/storage/pipeline_results/SampleA/mapping
|
279
|
+
/SampleA/mark_dup
|
280
|
+
|
281
|
+
/storage/pipeline_results/SampleB/mapping
|
282
|
+
/SampleB/mark_dup
|
283
|
+
.....
|
284
|
+
```
|
285
|
+
|
286
|
+
In case you are working with group of samples, specified by the **-g** option, the output folder will be changed to reflect the samples grouping. So for example if a **mapping** step is called on the **Group1** group of samples, all the outputs will be saved under the ```<output>/Group1``` folder and results of mapping for SampleA, will be found under ```<output>/Group1/SampleA/mapping``` .
|
287
|
+
|
288
|
+
|
289
|
+
How steps are connected together
|
290
|
+
--------------------------------
|
291
|
+
|
292
|
+
One step is connected to another by simply requiring that its input is the output of another preceding step. This is just achived by a combination of ```<output>``` and ```<sample>``` placeholders in the pipeline command line definitions.
|
293
|
+
|
294
|
+
For instance, if I have a resequencing pipeline that will first run BWA to map the reads and then a mark duplicate step, the mark_dup step will be dependent from the BWA output.
|
295
|
+
|
296
|
+
```yaml
|
297
|
+
mapping:
|
298
|
+
run:
|
299
|
+
- ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
|
300
|
+
- ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
|
301
|
+
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz > <sample>.sorted.bam
|
302
|
+
- rm -f R1.fastq.gz R2.fastq.gz
|
303
|
+
cpu: 11
|
304
|
+
|
305
|
+
mark_dup:
|
306
|
+
run: java -Xmx4g -jar <mark_dup> INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam
|
307
|
+
```
|
308
|
+
|
309
|
+
So in the **mark_dup** step the input placeholder (defined under the **run** key in the pipeline YAML) will be written as:
|
310
|
+
|
311
|
+
```
|
312
|
+
<mapping/sample>.sorted.bam
|
313
|
+
```
|
314
|
+
|
315
|
+
If the ```<output>``` tag is defined for instance as "/storage/results", this will be translated at run-time into:
|
316
|
+
|
317
|
+
```
|
318
|
+
/storage/results/SampleA/mapping/SampleA.sorted.bam
|
319
|
+
```
|
320
|
+
|
321
|
+
for SampleA outputs. Basically the ```<mapping/sample>``` placeholder is a shortcut for ```<output>/<sample>/{step name, mapping in this case}/<sample>```
|
322
|
+
|
323
|
+
Following the same idea, using a ```<mapping/>``` placeholder (note the / at the end) will be translated into ```<output>/<sample>/{step name, mapping in this case}/``` , to address the scenario when a user wants to point to the previous step output directory, but without having the ```<sample>``` appended to the end of the path.
|
324
|
+
|
325
|
+
More complex dependences can be defined by combinations of ```<output>``` and ```<sample>``` placeholders, or using the ```<step/>``` and ```<step/sample>``` placeholders, without having to worry about the actual sample name and the complete input and output paths.
|
326
|
+
|
327
|
+
Jobs dependencies
|
328
|
+
-------------------------
|
329
|
+
Steps can also be defined with dependencies so the user can just call the final step and all the upper chain is called automatically. To achieve this task Pipengine requires that the user defines a
|
330
|
+
```
|
331
|
+
pre:
|
332
|
+
```
|
333
|
+
tag in the step definition:
|
334
|
+
|
335
|
+
```
|
336
|
+
root_step:
|
337
|
+
desc: root step to test dependencies
|
338
|
+
run:
|
339
|
+
- echo "root"
|
340
|
+
|
341
|
+
child_step:
|
342
|
+
desc: child step to test dependencies
|
343
|
+
pre: root_step
|
344
|
+
run:
|
345
|
+
- echo "I am the child"
|
346
|
+
```
|
347
|
+
|
348
|
+
|
349
|
+
:: Multi-Samples and complex steps ::
|
350
|
+
=====================================
|
351
|
+
|
352
|
+
The pipeline steps can be defined to run on a single sample or to take as input more than one sample data, depending on the command line used.
|
353
|
+
|
354
|
+
A typical example is running a differential expression step for example with CuffDiff. This requires to take all the output generated from the previous Cufflinks step (i.e. the gtf files) and process them to generate a unique transcripts reference (CuffCompare) and then perform the differential expression across the samples using the BAM files generated by, let's say, TopHat in a **mapping** step.
|
355
|
+
|
356
|
+
This is an extract of the step definition in the pipeline YAML to describe these two steps:
|
357
|
+
|
358
|
+
```yaml
|
359
|
+
diffexp:
|
360
|
+
multi:
|
361
|
+
- <output>/<sample>/cufflinks/transcripts.gtf
|
362
|
+
- <mapping/sample>_tophat/accepted_hits.bam
|
363
|
+
run:
|
364
|
+
- echo '<multi1>' | sed -e 's/,/ /g' | xargs ls >> gtf_list.txt
|
365
|
+
- <cuffcompare> -s <genome> -r <gtf> -i gtf_list.txt
|
366
|
+
- <cuffdiff> -p 12 -N -u -b <genome> ./*combined.gtf <multi2>
|
367
|
+
cpu: 12
|
368
|
+
```
|
369
|
+
|
370
|
+
In this case we need to combine the outputs of all the samples from the cufflinks step and pass that information to cuffcompare and combine the outputs of the mapping steps and pass them to the cuffdiff command line.
|
371
|
+
|
372
|
+
This is achived in two ways. First, the step definition must include a **multi** key, that simply defines what, for each sample, will be substituted where the ```<multi>``` placeholder is found.
|
373
|
+
|
374
|
+
In the example above, the step includes two command lines, one for cuffcompare and the other for cuffdiff. Cuffcompare requires the transcripts.gtf for each sample, while Cuffdiff requires the BAM file for each sample, plus the output of Cuffcompare.
|
375
|
+
|
376
|
+
So the two command lines need two different kind of files as input from the same set of samples, therefore two **multi** keywords are defined as well as two placeholders ```<multi1>``` and ```<multi2>```
|
377
|
+
|
378
|
+
Once the step has been defined in the pipeline YAML, PipEngine must be invoked using the **-m** parameter, to specify the samples that should be grouped together by this step:
|
379
|
+
|
380
|
+
```shell
|
381
|
+
pipengine -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
|
382
|
+
```
|
383
|
+
|
384
|
+
Note that the use of commas is not casual, since the **-m** parameter specifies not only which samples should be used for this step, but also how they should be organized on the corresponding command line. The **-m** parameter takes the sample names and underneath it will combine the sample name with the 'multi' keywords and then it will substitute back the command line by keeping the samples in the same order as provided with the **-m**.
|
385
|
+
|
386
|
+
The above command line will be translated, for the **cuffdiff** command line in the following:
|
387
|
+
|
388
|
+
```shell
|
389
|
+
/software/cuffdiff -p 12 -N -u -b /storage/genome.fa combined.gtf /storage/results/SampleA/cufflinks/transcripts.gtf,/storage/results/SampleB/cufflinks/transcripts.gtf /storage/results/SampleC/cufflinks/transcripts.gtf /storage/results/SampleD/cufflinks/transcripts.gtf
|
390
|
+
```
|
391
|
+
|
392
|
+
and this will correspond to the way CuffDiff wants biological replicates for each condition to be described on the command line.
|
393
|
+
|
394
|
+
**Note**
|
395
|
+
|
396
|
+
Multi-samples step management is complex and it's a task that can't be easily generalized since every software has it's own way to require and organize the inputs on the command line. This approach it's probably not the most elegant solution but works quite well, even if there are some drawbacks. For instance, as stated above, the samples groups is processed and passed to command lines as it is taken from the **-m** parameter.
|
397
|
+
|
398
|
+
So for Cuffdiff, the presence of commas is critical to divide biological replicates from different conditions, but for Cuffcompare the commas are not needed and will raise an error on the command line. That's the reason of the:
|
399
|
+
|
400
|
+
```shell
|
401
|
+
echo '<multi1>' | sed -e 's/,/ /g' | xargs ls >> gtf_list.txt
|
402
|
+
```
|
403
|
+
|
404
|
+
This line generates the input file for Cuffcompare with the list of the transcripts.gtf files for each sample, generated using the 'multi' definition in the pipeline YAML and the line passed through the **-m** parameter, but getting rid of the commas that separate sample names. It's a workaround and it's not a super clean solution, but PipEngine wants to be a general tool not binded to specific corner cases and it always lets the user define it's own custom command lines to manage particular steps, as in this case.
|
405
|
+
|
406
|
+
|
407
|
+
:: What happens at run-time ::
|
408
|
+
==============================
|
409
|
+
|
410
|
+
When invoking PipEngine, the tool will look for the pipeline YAML specified and for the sample YAML file. It will load the list of samples (names and paths of input data) and for each sample it will load the information of the step specified in the command line ( **-s** parameter ).
|
411
|
+
|
412
|
+
PipEngine will then combine the data from the two YAML, generating the specific command lines of the selected steps and substituing all the placeholders to generate the final command lines.
|
413
|
+
|
414
|
+
A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler.
|
415
|
+
|
416
|
+
If not invoked with the **-d** option (dry-run) PipEngine will directly submit the jobs to the PBS scheduler using the "qsub" command.
|
417
|
+
|
418
|
+
Dry Run
|
419
|
+
-------
|
420
|
+
|
421
|
+
The **-d** parameter lets you create the runnable shell scripts without submitting them to PBS. Use it often to check that the pipeline that will be executed is correct and it is doing what you thought. The runnable scripts are saved by default in the ```<output>``` directory.
|
422
|
+
|
423
|
+
Use it also to learn how the placeholders works, especially the dependency placeholders (e.g. ```<mapping/sample>```) and to cross-check that all the placeholders in the pipeline command lines were substituted correctly before submitting the jobs.
|
424
|
+
|
425
|
+
Temporary output folder
|
426
|
+
-------------------
|
427
|
+
|
428
|
+
By using the '--tmp' option, PipEngine will generate a job script (for each sample) that will save all the output files or folders for a particular step in a directory (e.g. /tmp) that is different from the one provided with the ```<output>```.
|
429
|
+
|
430
|
+
By default PipEngine will generate output folders directly under the location defined by the ```<ouput>``` tag in the Sample YAML. The --tmp solution instead can be useful when we don't want to save directly to the final location (e.g maybe a slower network storage) or we don't want to keep all the intermediate files but just the final ones.
|
431
|
+
|
432
|
+
With this option enabled, PipEngine will also generate instructions in the job script to copy, at the end of the job, all the outputs from the temporary directory to the final output folder (i.e. ```<output>```) and then to remove the temporary copy.
|
433
|
+
|
434
|
+
When '--tmp' is used, a UUID is generated for each job and prepended to the job name and to the temporary output folder, to avoid possible name collisions and data overwrite if more jobs with the same name (e.g. mapping) are running and writing to the same temporary location.
|
435
|
+
|
436
|
+
One job with multiple steps
|
437
|
+
---------------------------
|
438
|
+
|
439
|
+
It is of course possible to aggregate multiple steps of a pipeline and run them in one single job. For instance let's say I want to run in the same job the steps mapping, mark_dup and realign_target (see pipeline YAML example above).
|
440
|
+
|
441
|
+
From the command line it's just:
|
442
|
+
|
443
|
+
```shell
|
444
|
+
pipengine -p pipeline.yml -s mapping mark_dup realign_target
|
445
|
+
```
|
446
|
+
|
447
|
+
A single job script, for each sample, will be generated with all the instructions for these steps. If more than one step declares a **cpu** key, the highest cpu value will be assigned for the whole job.
|
448
|
+
|
449
|
+
Each step will save outputs into a separated folder, under the ```<output>```, exactly if they were run separately. This way, if the job fails for some reason, it will be possible to check which steps were already completed and restart from there.
|
450
|
+
|
451
|
+
When multiple steps are run in the same job, by default PipEngine will generate the job name as the concatenation of all the steps names. Since this could be a problem when a lot of steps are run together in the same job, a '--name' parameter it's available to rename the job in a more convenient way.
|
452
|
+
|
453
|
+
:: Examples ::
|
454
|
+
==============
|
455
|
+
|
456
|
+
Example 1: One step and multiple command lines
|
457
|
+
----------------------------------------------
|
458
|
+
|
459
|
+
This is an example on how to prepare the inputs for BWA and run it along with Samtools:
|
460
|
+
|
461
|
+
**pipeline.yml**
|
462
|
+
```yaml
|
463
|
+
pipeline: resequencing
|
464
|
+
|
465
|
+
resources:
|
466
|
+
bwa: /software/bwa-0.6.2/bwa
|
467
|
+
samtools: /software/samtools
|
468
|
+
|
469
|
+
steps:
|
470
|
+
mapping:
|
471
|
+
run:
|
472
|
+
- ls <sample_path>/*_R1_*.gz | xargs zcat | <pigz> -p 10 >> R1.fastq.gz
|
473
|
+
- ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
|
474
|
+
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Sb - > <sample>.bam
|
475
|
+
- rm -f R1.fastq.gz R2.fastq.gz
|
476
|
+
cpu: 11
|
477
|
+
```
|
478
|
+
|
479
|
+
**samples.yml**
|
480
|
+
```yaml
|
481
|
+
resources:
|
482
|
+
index: /storage/genomes/bwa_index/genome
|
483
|
+
genome: /storage/genomes/genome.fa
|
484
|
+
output: /storage/results
|
485
|
+
|
486
|
+
samples:
|
487
|
+
sampleA: /ngs_reads/sampleA
|
488
|
+
sampleB: /ngs_reads/sampleB
|
489
|
+
sampleC: /ngs_reads/sampleC
|
490
|
+
sampleD: /ngs_reads/sampleD
|
491
|
+
```
|
492
|
+
|
493
|
+
Running PipEngine with the following command line:
|
494
|
+
|
495
|
+
```
|
496
|
+
pipengine -p pipeline.yml -f samples.yml -s mapping
|
497
|
+
```
|
498
|
+
|
499
|
+
will generate a runnable shell script for each sample:
|
500
|
+
|
501
|
+
```shell
|
502
|
+
#!/bin/bash
|
503
|
+
#PBS -N 37735f50-mapping
|
504
|
+
#PBS -l ncpus=11
|
505
|
+
|
506
|
+
mkdir -p /storage/results/sampleA/mapping
|
507
|
+
cd /storage/results/sampleA/mapping
|
508
|
+
ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
|
509
|
+
ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
|
510
|
+
/software/bwa-0.6.2/bwa sampe -P /genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
|
511
|
+
rm -f R1.fastq.gz R2.fastq.gz
|
512
|
+
```
|
513
|
+
As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable.
|
514
|
+
|
515
|
+
In this case also, the **run** key defines three different command lines, that are described using YAML array (a line prepended with a -). This command lines are all part of the same step, since the first two are required to prepare the input for the third command line (BWA), using standard bash commands.
|
516
|
+
|
517
|
+
As a rule of thumb you should put more command line into an array under the same step if these are all logically correlated and required to manipulate intermidiate files. Otherwise if command lines executes conceptually different actions they should go into different steps.
|
518
|
+
|
519
|
+
Example 2: Multiple steps in one job
|
520
|
+
------------------------------------
|
521
|
+
|
522
|
+
Now I want to execute more steps in a single job for each sample. The pipeline YAML is defined in this way:
|
523
|
+
|
524
|
+
```yaml
|
525
|
+
|
526
|
+
pipeline: resequencing
|
527
|
+
|
528
|
+
resources:
|
529
|
+
bwa: /software/bwa-0.6.2/bwa
|
530
|
+
samtools: /software/samtools
|
531
|
+
mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
|
532
|
+
gatk: /software/GenomeAnalysisTK/GenomeAnalysisTK.jar
|
533
|
+
|
534
|
+
steps:
|
535
|
+
mapping:
|
536
|
+
run:
|
537
|
+
- ls <sample_path>/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
|
538
|
+
- ls <sample_path>/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
|
539
|
+
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
|
540
|
+
- rm -f R1.fastq.gz R2.fastq.gz
|
541
|
+
cpu: 11
|
542
|
+
|
543
|
+
mark_dup:
|
544
|
+
run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
|
545
|
+
|
546
|
+
realign_target:
|
547
|
+
run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
|
548
|
+
cpu: 8
|
549
|
+
```
|
550
|
+
|
551
|
+
The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine must be invoked with this command line:
|
552
|
+
|
553
|
+
```
|
554
|
+
pipengine -p pipeline.yml -f samples.yml -s mapping mark_dup realign_target
|
555
|
+
```
|
556
|
+
|
557
|
+
And this will be translated into the following shell script (one for each sample):
|
558
|
+
|
559
|
+
```shell
|
560
|
+
#!/bin/bash
|
561
|
+
#PBS -N ff020300-mapping-mark_dup-realign_target
|
562
|
+
#PBS -l ncpus=11
|
563
|
+
|
564
|
+
mkdir -p /storage/results/sampleB/mapping
|
565
|
+
cd /storage/results/sampleB/mapping
|
566
|
+
ls /ngs_reads/sampleB/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
|
567
|
+
ls /ngs_reads/sampleB/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
|
568
|
+
/software/bwa-0.6.2/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
|
569
|
+
rm -f R1.fastq.gz R2.fastq.gz
|
570
|
+
mkdir -p /storage/results/sampleB/mark_dup
|
571
|
+
cd /storage/results/sampleB/mark_dup
|
572
|
+
java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=sampleB.sorted.bam OUTPUT=sampleB.md.sort.bam METRICS_FILE=sampleB.metrics REMOVE_DUPLICATES=false
|
573
|
+
mkdir -p /storage/results/sampleB/realign_target
|
574
|
+
cd /storage/results/sampleB/realign_target
|
575
|
+
java -Xmx4g -jar /software/GenomeAnalysisTk/GenomeAnalysisTk.jar -T RealignerTargetCreator -I sampleB.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleB.indels.intervals
|
576
|
+
```
|
577
|
+
|
578
|
+
Logging
|
579
|
+
---------------------------
|
580
|
+
|
581
|
+
It is always usefult to log activities and collect the output from your software. Pipengine can log to:
|
582
|
+
|
583
|
+
* stdin, just print on the terminal
|
584
|
+
* syslog send the log to the system log using logger
|
585
|
+
* fluentd send the log to a collector/centralized logging system (http://fluentd.org)
|
586
|
+
|
587
|
+
|
588
|
+
:: PBS Options ::
|
589
|
+
=================
|
590
|
+
|
591
|
+
If there is the need to pass to PipEngine specific PBS options, the ```--pbs-opts``` parameter can be used.
|
592
|
+
|
593
|
+
This parameter accepts a list of options and each one will be added to the PBS header in the shell script, along with the ```-l``` PBS parameter.
|
594
|
+
|
595
|
+
So for example, the following options passed to ```--pbs-opts```:
|
596
|
+
|
597
|
+
```shell
|
598
|
+
--pbs-opts nodes=2:ppn=8 host=node5
|
599
|
+
```
|
600
|
+
|
601
|
+
will become, in the shell script:
|
602
|
+
|
603
|
+
```shell
|
604
|
+
#PBS -l nodes=2:ppn=8
|
605
|
+
#PBS -l host=node5
|
606
|
+
```
|
607
|
+
|
608
|
+
Note also that from version 0.5.2, it is possible to specify common PBS options like "nodes" and "mem" (for memory) directly within a step defition in the Pipeline yaml, exactly as it's done with the "cpu" parameter. So in a step it is possible to write:
|
609
|
+
|
610
|
+
```yaml
|
611
|
+
realign_target:
|
612
|
+
run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
|
613
|
+
cpu: 8
|
614
|
+
nodes: 2
|
615
|
+
mem: 8G
|
616
|
+
```
|
617
|
+
|
618
|
+
to have PipEngine translate this into:
|
619
|
+
|
620
|
+
```shell
|
621
|
+
#PBS -l nodes=2:ppn=8,mem=8G
|
622
|
+
```
|
623
|
+
|
624
|
+
within the job script.
|
625
|
+
|
626
|
+
If a specific queue needs to be selected for sending the jobs to PBS, the ```--pbs-queue``` (short version **-q**) parameter can be used. This will pass to the ```qsub``` command the ```-q <queue name>``` taken from the command line.
|
627
|
+
|
628
|
+
Copyright
|
629
|
+
=========
|
630
|
+
|
631
|
+
(c)2013 Francesco Strozzi
|