bio-pipengine 0.9.6 → 0.9.7
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +86 -45
- data/VERSION +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b76612ae318c72b01d4b2ce5788154107afdd0b8
|
4
|
+
data.tar.gz: 0ee4488dbcdbd653248e625cee1f633d1beaeee5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b68abc5872cbbfea319e87f88ec7dc2ed39fe0834ec4520702eb2a71f97088919804fcd700868aa71d760c36b4e121355c3273005151dd54122fcd188003249c
|
7
|
+
data.tar.gz: 57ed883cb1754c1ea3122ed853931694c46999d420fe525e6428adff9af1952ebc124d9ff7975537f84b08a5716305af8502b80895bd5e3e30f9a495df9bfcdb
|
data/README.md
CHANGED
@@ -71,7 +71,7 @@ Command line for RUN mode
|
|
71
71
|
|
72
72
|
**Command line**
|
73
73
|
```shell
|
74
|
-
>
|
74
|
+
> pipengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
|
75
75
|
```
|
76
76
|
|
77
77
|
**Parameters**
|
@@ -115,7 +115,7 @@ pipeline: resequencing
|
|
115
115
|
|
116
116
|
resources:
|
117
117
|
fastqc: /software/FastQC/fastqc
|
118
|
-
bwa: /software/bwa
|
118
|
+
bwa: /software/bwa
|
119
119
|
gatk: /software/gatk-lite/GenomeAnalysisTk.jar
|
120
120
|
samtools: /software/samtools
|
121
121
|
samsort: /software/picard-tools-1.77/SortSam.jar
|
@@ -375,7 +375,7 @@ So the two command lines need two different kind of files as input from the same
|
|
375
375
|
Once the step has been defined in the pipeline YAML, PipEngine must be invoked using the **-m** parameter, to specify the samples that should be grouped together by this step:
|
376
376
|
|
377
377
|
```shell
|
378
|
-
pipengine -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
|
378
|
+
pipengine run -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
|
379
379
|
```
|
380
380
|
|
381
381
|
Note that the use of commas is not casual, since the **-m** parameter specifies not only which samples should be used for this step, but also how they should be organized on the corresponding command line. The **-m** parameter takes the sample names and underneath it will combine the sample name with the 'multi' keywords and then it will substitute back the command line by keeping the samples in the same order as provided with the **-m**.
|
@@ -441,7 +441,7 @@ When invoking PipEngine, the tool will look for the pipeline YAML specified and
|
|
441
441
|
|
442
442
|
PipEngine will then combine the data from the two YAML, generating the specific command lines of the selected steps and substituing all the placeholders to generate the final command lines.
|
443
443
|
|
444
|
-
A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler.
|
444
|
+
A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler. The shell scripts are written inside the directory specified on the ```output:``` key in the ```samples.yml``` file, the directory is created if it does not exist.
|
445
445
|
|
446
446
|
If not invoked with the **-d** option (dry-run) PipEngine will directly submit the jobs to the PBS scheduler using the "qsub" command.
|
447
447
|
|
@@ -471,7 +471,7 @@ It is of course possible to aggregate multiple steps of a pipeline and run them
|
|
471
471
|
From the command line it's just:
|
472
472
|
|
473
473
|
```shell
|
474
|
-
pipengine -p pipeline.yml -s mapping mark_dup realign_target
|
474
|
+
pipengine run -p pipeline.yml -s mapping mark_dup realign_target
|
475
475
|
```
|
476
476
|
|
477
477
|
A single job script, for each sample, will be generated with all the instructions for these steps. If more than one step declares a **cpu** key, the highest cpu value will be assigned for the whole job.
|
@@ -483,6 +483,8 @@ When multiple steps are run in the same job, by default PipEngine will generate
|
|
483
483
|
:: Examples ::
|
484
484
|
==============
|
485
485
|
|
486
|
+
All these files can be found into the test/examples directory of the repository.
|
487
|
+
|
486
488
|
Example 1: One step and multiple command lines
|
487
489
|
----------------------------------------------
|
488
490
|
|
@@ -493,8 +495,9 @@ This is an example on how to prepare the inputs for BWA and run it along with Sa
|
|
493
495
|
pipeline: resequencing
|
494
496
|
|
495
497
|
resources:
|
496
|
-
bwa: /software/bwa
|
498
|
+
bwa: /software/bwa
|
497
499
|
samtools: /software/samtools
|
500
|
+
pigz: /software/pigz
|
498
501
|
|
499
502
|
steps:
|
500
503
|
mapping:
|
@@ -503,7 +506,7 @@ steps:
|
|
503
506
|
- ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
|
504
507
|
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Sb - > <sample>.bam
|
505
508
|
- rm -f R1.fastq.gz R2.fastq.gz
|
506
|
-
cpu:
|
509
|
+
cpu: 12
|
507
510
|
```
|
508
511
|
|
509
512
|
**samples.yml**
|
@@ -511,7 +514,7 @@ steps:
|
|
511
514
|
resources:
|
512
515
|
index: /storage/genomes/bwa_index/genome
|
513
516
|
genome: /storage/genomes/genome.fa
|
514
|
-
output:
|
517
|
+
output: ./working
|
515
518
|
|
516
519
|
samples:
|
517
520
|
sampleA: /ngs_reads/sampleA
|
@@ -523,24 +526,33 @@ samples:
|
|
523
526
|
Running PipEngine with the following command line:
|
524
527
|
|
525
528
|
```
|
526
|
-
pipengine -p pipeline.yml -f samples.yml -s mapping
|
529
|
+
pipengine run -p pipeline.yml -f samples.yml -s mapping -d
|
527
530
|
```
|
528
531
|
|
529
|
-
will generate a runnable shell script for each sample:
|
532
|
+
will generate a runnable shell script for each sample (available in the ./working directory):
|
530
533
|
|
531
534
|
```shell
|
532
|
-
#!/bin/bash
|
533
|
-
#PBS -N
|
534
|
-
#PBS -
|
535
|
-
|
536
|
-
|
537
|
-
|
538
|
-
|
539
|
-
|
540
|
-
|
541
|
-
|
542
|
-
|
543
|
-
|
535
|
+
#!/usr/bin/env bash
|
536
|
+
#PBS -N 2c57c1a853-sampleA-mapping
|
537
|
+
#PBS -d ./working
|
538
|
+
#PBS -l nodes=1:ppn=12
|
539
|
+
if [ ! -f ./working/sampleA/mapping/checkpoint ]
|
540
|
+
then
|
541
|
+
echo "mapping 2c57c1a853-sampleA-mapping start `whoami` `hostname` `pwd` `date`."
|
542
|
+
|
543
|
+
mkdir -p ./working/sampleA/mapping
|
544
|
+
cd ./working/sampleA/mapping
|
545
|
+
ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | /software/pigz -p 10 >> R1.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
546
|
+
ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | /software/pigz -p 10 >> R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
547
|
+
/software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
548
|
+
rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
549
|
+
echo "mapping 2c57c1a853-sampleA-mapping finished `whoami` `hostname` `pwd` `date`."
|
550
|
+
touch ./working/sampleA/mapping/checkpoint
|
551
|
+
else
|
552
|
+
echo "mapping 2c57c1a853-sampleA-mapping already executed, skipping this step `whoami` `hostname` `pwd` `date`."
|
553
|
+
fi
|
554
|
+
```
|
555
|
+
As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable. Pipengine addes extra lines in the script for steps checkpoint controls to avoid re-running already executed steps, and error controls with logging.
|
544
556
|
|
545
557
|
In this case also, the **run** key defines three different command lines, that are described using YAML array (a line prepended with a -). This command lines are all part of the same step, since the first two are required to prepare the input for the third command line (BWA), using standard bash commands.
|
546
558
|
|
@@ -556,7 +568,7 @@ Now I want to execute more steps in a single job for each sample. The pipeline Y
|
|
556
568
|
pipeline: resequencing
|
557
569
|
|
558
570
|
resources:
|
559
|
-
bwa: /software/bwa
|
571
|
+
bwa: /software/bwa
|
560
572
|
samtools: /software/samtools
|
561
573
|
mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
|
562
574
|
gatk: /software/GenomeAnalysisTK/GenomeAnalysisTK.jar
|
@@ -566,45 +578,74 @@ steps:
|
|
566
578
|
run:
|
567
579
|
- ls <sample_path>/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
|
568
580
|
- ls <sample_path>/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
|
569
|
-
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<
|
581
|
+
- <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<sample> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
|
570
582
|
- rm -f R1.fastq.gz R2.fastq.gz
|
571
|
-
cpu:
|
583
|
+
cpu: 12
|
572
584
|
|
573
585
|
mark_dup:
|
586
|
+
pre: mapping
|
574
587
|
run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
|
575
588
|
|
576
589
|
realign_target:
|
590
|
+
pre: mark_dup
|
577
591
|
run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
|
578
592
|
cpu: 8
|
579
593
|
```
|
580
594
|
|
581
|
-
The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine
|
595
|
+
The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine can be invoked with this command line:
|
582
596
|
|
583
597
|
```
|
584
|
-
pipengine -p
|
598
|
+
pipengine run -p pipeline_multi.yml -f samples.yml -s realign_target -d
|
585
599
|
```
|
586
|
-
|
587
|
-
And this will be translated into the following shell script (one for each sample):
|
600
|
+
Since dependencies have been defined for the steps using the ```pre``` key, it is sufficient to invoke Pipengine with the last step and the other two are automatically included in the script. Messages will be prompted in this case since Pipengine just warns that the directories for certain steps, that are needed for other steps in the pipeline, are not yet available (and thus the corresponding steps will be executed to generate the necessary data). The command line will generate the following shell script (one for each sample, available in the ./working directory):
|
588
601
|
|
589
602
|
```shell
|
590
|
-
#!/bin/bash
|
591
|
-
#PBS -N
|
592
|
-
#PBS -
|
593
|
-
|
594
|
-
|
595
|
-
|
596
|
-
|
597
|
-
|
598
|
-
|
599
|
-
|
600
|
-
|
601
|
-
|
602
|
-
java -Xmx4g -jar /software/picard-tools-1.77/
|
603
|
-
|
604
|
-
|
605
|
-
|
603
|
+
#!/usr/bin/env bash
|
604
|
+
#PBS -N 6f3c911c49-sampleC-realign_target
|
605
|
+
#PBS -d ./working
|
606
|
+
#PBS -l nodes=1:ppn=12
|
607
|
+
if [ ! -f ./working/sampleC/mapping/checkpoint ]
|
608
|
+
then
|
609
|
+
echo "mapping 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
|
610
|
+
|
611
|
+
mkdir -p ./working/sampleC/mapping
|
612
|
+
cd ./working/sampleC/mapping
|
613
|
+
ls /ngs_reads/sampleC/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
614
|
+
ls /ngs_reads/sampleC/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
615
|
+
/software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=sampleC.sorted.bam SO=coordinate LB=sampleC PL=illumina PU=PU SM=sampleC TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000 || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
616
|
+
rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
|
617
|
+
echo "mapping 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
|
618
|
+
touch ./working/sampleC/mapping/checkpoint
|
619
|
+
else
|
620
|
+
echo "mapping 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
|
621
|
+
fi
|
622
|
+
if [ ! -f ./working/sampleC/mark_dup/checkpoint ]
|
623
|
+
then
|
624
|
+
echo "mark_dup 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
|
625
|
+
|
626
|
+
mkdir -p ./working/sampleC/mark_dup
|
627
|
+
cd ./working/sampleC/mark_dup
|
628
|
+
java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=./working/sampleC/mapping/sampleC.sorted.bam OUTPUT=sampleC.md.sort.bam METRICS_FILE=sampleC.metrics REMOVE_DUPLICATES=false || { echo "mark_dup 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
|
629
|
+
echo "mark_dup 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
|
630
|
+
touch ./working/sampleC/mark_dup/checkpoint
|
631
|
+
else
|
632
|
+
echo "mark_dup 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
|
633
|
+
fi
|
634
|
+
if [ ! -f ./working/sampleC/realign_target/checkpoint ]
|
635
|
+
then
|
636
|
+
echo "realign_target 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
|
637
|
+
|
638
|
+
mkdir -p ./working/sampleC/realign_target
|
639
|
+
cd ./working/sampleC/realign_target
|
640
|
+
java -Xmx4g -jar /software/GenomeAnalysisTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -I ./working/sampleC/mark_dup/sampleC.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleC.indels.intervals || { echo "realign_target 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
|
641
|
+
echo "realign_target 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
|
642
|
+
touch ./working/sampleC/realign_target/checkpoint
|
643
|
+
else
|
644
|
+
echo "realign_target 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
|
645
|
+
fi
|
606
646
|
```
|
607
647
|
|
648
|
+
|
608
649
|
Logging
|
609
650
|
---------------------------
|
610
651
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.9.
|
1
|
+
0.9.7
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: bio-pipengine
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.9.
|
4
|
+
version: 0.9.7
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Francesco Strozzi
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2017-08-
|
12
|
+
date: 2017-08-28 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: trollop
|
@@ -93,7 +93,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
93
93
|
version: '0'
|
94
94
|
requirements: []
|
95
95
|
rubyforge_project:
|
96
|
-
rubygems_version: 2.6.
|
96
|
+
rubygems_version: 2.6.11
|
97
97
|
signing_key:
|
98
98
|
specification_version: 4
|
99
99
|
summary: A pipeline manager
|