bio-pipengine 0.9.6 → 0.9.7

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +86 -45
  3. data/VERSION +1 -1
  4. metadata +3 -3
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 59db77881c33ae6c4047c6ecae27c1b7b7c1a92d
4
- data.tar.gz: 79a56595c6aeee228fd488410527b0d1f2636278
3
+ metadata.gz: b76612ae318c72b01d4b2ce5788154107afdd0b8
4
+ data.tar.gz: 0ee4488dbcdbd653248e625cee1f633d1beaeee5
5
5
  SHA512:
6
- metadata.gz: 61d64e18fa677cf684775d8690bad825002b554d91f3f7672254c654ef3d8474948aeead1baed10b14f18e2857e1524996f6378a31f31a9d8f6d93c1479c9c4b
7
- data.tar.gz: 573a3579a033aad50dde13db6ec807fcda175d1d5c8abc924d48ae56c0f1b0e6c84c30636ab8d1eb97687dbfca6eab7b3425540d358ed6e447cefcdf3045e7cb
6
+ metadata.gz: b68abc5872cbbfea319e87f88ec7dc2ed39fe0834ec4520702eb2a71f97088919804fcd700868aa71d760c36b4e121355c3273005151dd54122fcd188003249c
7
+ data.tar.gz: 57ed883cb1754c1ea3122ed853931694c46999d420fe525e6428adff9af1952ebc124d9ff7975537f84b08a5716305af8502b80895bd5e3e30f9a495df9bfcdb
data/README.md CHANGED
@@ -71,7 +71,7 @@ Command line for RUN mode
71
71
 
72
72
  **Command line**
73
73
  ```shell
74
- > pipenengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
74
+ > pipengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
75
75
  ```
76
76
 
77
77
  **Parameters**
@@ -115,7 +115,7 @@ pipeline: resequencing
115
115
 
116
116
  resources:
117
117
  fastqc: /software/FastQC/fastqc
118
- bwa: /software/bwa-0.6.2/bwa
118
+ bwa: /software/bwa
119
119
  gatk: /software/gatk-lite/GenomeAnalysisTk.jar
120
120
  samtools: /software/samtools
121
121
  samsort: /software/picard-tools-1.77/SortSam.jar
@@ -375,7 +375,7 @@ So the two command lines need two different kind of files as input from the same
375
375
  Once the step has been defined in the pipeline YAML, PipEngine must be invoked using the **-m** parameter, to specify the samples that should be grouped together by this step:
376
376
 
377
377
  ```shell
378
- pipengine -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
378
+ pipengine run -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
379
379
  ```
380
380
 
381
381
  Note that the use of commas is not casual, since the **-m** parameter specifies not only which samples should be used for this step, but also how they should be organized on the corresponding command line. The **-m** parameter takes the sample names and underneath it will combine the sample name with the 'multi' keywords and then it will substitute back the command line by keeping the samples in the same order as provided with the **-m**.
@@ -441,7 +441,7 @@ When invoking PipEngine, the tool will look for the pipeline YAML specified and
441
441
 
442
442
  PipEngine will then combine the data from the two YAML, generating the specific command lines of the selected steps and substituing all the placeholders to generate the final command lines.
443
443
 
444
- A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler.
444
+ A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler. The shell scripts are written inside the directory specified on the ```output:``` key in the ```samples.yml``` file, the directory is created if it does not exist.
445
445
 
446
446
  If not invoked with the **-d** option (dry-run) PipEngine will directly submit the jobs to the PBS scheduler using the "qsub" command.
447
447
 
@@ -471,7 +471,7 @@ It is of course possible to aggregate multiple steps of a pipeline and run them
471
471
  From the command line it's just:
472
472
 
473
473
  ```shell
474
- pipengine -p pipeline.yml -s mapping mark_dup realign_target
474
+ pipengine run -p pipeline.yml -s mapping mark_dup realign_target
475
475
  ```
476
476
 
477
477
  A single job script, for each sample, will be generated with all the instructions for these steps. If more than one step declares a **cpu** key, the highest cpu value will be assigned for the whole job.
@@ -483,6 +483,8 @@ When multiple steps are run in the same job, by default PipEngine will generate
483
483
  :: Examples ::
484
484
  ==============
485
485
 
486
+ All these files can be found into the test/examples directory of the repository.
487
+
486
488
  Example 1: One step and multiple command lines
487
489
  ----------------------------------------------
488
490
 
@@ -493,8 +495,9 @@ This is an example on how to prepare the inputs for BWA and run it along with Sa
493
495
  pipeline: resequencing
494
496
 
495
497
  resources:
496
- bwa: /software/bwa-0.6.2/bwa
498
+ bwa: /software/bwa
497
499
  samtools: /software/samtools
500
+ pigz: /software/pigz
498
501
 
499
502
  steps:
500
503
  mapping:
@@ -503,7 +506,7 @@ steps:
503
506
  - ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
504
507
  - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Sb - > <sample>.bam
505
508
  - rm -f R1.fastq.gz R2.fastq.gz
506
- cpu: 11
509
+ cpu: 12
507
510
  ```
508
511
 
509
512
  **samples.yml**
@@ -511,7 +514,7 @@ steps:
511
514
  resources:
512
515
  index: /storage/genomes/bwa_index/genome
513
516
  genome: /storage/genomes/genome.fa
514
- output: /storage/results
517
+ output: ./working
515
518
 
516
519
  samples:
517
520
  sampleA: /ngs_reads/sampleA
@@ -523,24 +526,33 @@ samples:
523
526
  Running PipEngine with the following command line:
524
527
 
525
528
  ```
526
- pipengine -p pipeline.yml -f samples.yml -s mapping
529
+ pipengine run -p pipeline.yml -f samples.yml -s mapping -d
527
530
  ```
528
531
 
529
- will generate a runnable shell script for each sample:
532
+ will generate a runnable shell script for each sample (available in the ./working directory):
530
533
 
531
534
  ```shell
532
- #!/bin/bash
533
- #PBS -N 37735f50-mapping
534
- #PBS -l ncpus=11
535
-
536
- mkdir -p /storage/results/sampleA/mapping
537
- cd /storage/results/sampleA/mapping
538
- ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
539
- ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
540
- /software/bwa-0.6.2/bwa sampe -P /genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
541
- rm -f R1.fastq.gz R2.fastq.gz
542
- ```
543
- As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable.
535
+ #!/usr/bin/env bash
536
+ #PBS -N 2c57c1a853-sampleA-mapping
537
+ #PBS -d ./working
538
+ #PBS -l nodes=1:ppn=12
539
+ if [ ! -f ./working/sampleA/mapping/checkpoint ]
540
+ then
541
+ echo "mapping 2c57c1a853-sampleA-mapping start `whoami` `hostname` `pwd` `date`."
542
+
543
+ mkdir -p ./working/sampleA/mapping
544
+ cd ./working/sampleA/mapping
545
+ ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | /software/pigz -p 10 >> R1.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
546
+ ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | /software/pigz -p 10 >> R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
547
+ /software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
548
+ rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
549
+ echo "mapping 2c57c1a853-sampleA-mapping finished `whoami` `hostname` `pwd` `date`."
550
+ touch ./working/sampleA/mapping/checkpoint
551
+ else
552
+ echo "mapping 2c57c1a853-sampleA-mapping already executed, skipping this step `whoami` `hostname` `pwd` `date`."
553
+ fi
554
+ ```
555
+ As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable. Pipengine addes extra lines in the script for steps checkpoint controls to avoid re-running already executed steps, and error controls with logging.
544
556
 
545
557
  In this case also, the **run** key defines three different command lines, that are described using YAML array (a line prepended with a -). This command lines are all part of the same step, since the first two are required to prepare the input for the third command line (BWA), using standard bash commands.
546
558
 
@@ -556,7 +568,7 @@ Now I want to execute more steps in a single job for each sample. The pipeline Y
556
568
  pipeline: resequencing
557
569
 
558
570
  resources:
559
- bwa: /software/bwa-0.6.2/bwa
571
+ bwa: /software/bwa
560
572
  samtools: /software/samtools
561
573
  mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
562
574
  gatk: /software/GenomeAnalysisTK/GenomeAnalysisTK.jar
@@ -566,45 +578,74 @@ steps:
566
578
  run:
567
579
  - ls <sample_path>/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
568
580
  - ls <sample_path>/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
569
- - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
581
+ - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<sample> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
570
582
  - rm -f R1.fastq.gz R2.fastq.gz
571
- cpu: 11
583
+ cpu: 12
572
584
 
573
585
  mark_dup:
586
+ pre: mapping
574
587
  run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
575
588
 
576
589
  realign_target:
590
+ pre: mark_dup
577
591
  run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
578
592
  cpu: 8
579
593
  ```
580
594
 
581
- The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine must be invoked with this command line:
595
+ The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine can be invoked with this command line:
582
596
 
583
597
  ```
584
- pipengine -p pipeline.yml -f samples.yml -s mapping mark_dup realign_target
598
+ pipengine run -p pipeline_multi.yml -f samples.yml -s realign_target -d
585
599
  ```
586
-
587
- And this will be translated into the following shell script (one for each sample):
600
+ Since dependencies have been defined for the steps using the ```pre``` key, it is sufficient to invoke Pipengine with the last step and the other two are automatically included in the script. Messages will be prompted in this case since Pipengine just warns that the directories for certain steps, that are needed for other steps in the pipeline, are not yet available (and thus the corresponding steps will be executed to generate the necessary data). The command line will generate the following shell script (one for each sample, available in the ./working directory):
588
601
 
589
602
  ```shell
590
- #!/bin/bash
591
- #PBS -N ff020300-mapping-mark_dup-realign_target
592
- #PBS -l ncpus=11
593
-
594
- mkdir -p /storage/results/sampleB/mapping
595
- cd /storage/results/sampleB/mapping
596
- ls /ngs_reads/sampleB/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
597
- ls /ngs_reads/sampleB/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
598
- /software/bwa-0.6.2/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
599
- rm -f R1.fastq.gz R2.fastq.gz
600
- mkdir -p /storage/results/sampleB/mark_dup
601
- cd /storage/results/sampleB/mark_dup
602
- java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=sampleB.sorted.bam OUTPUT=sampleB.md.sort.bam METRICS_FILE=sampleB.metrics REMOVE_DUPLICATES=false
603
- mkdir -p /storage/results/sampleB/realign_target
604
- cd /storage/results/sampleB/realign_target
605
- java -Xmx4g -jar /software/GenomeAnalysisTk/GenomeAnalysisTk.jar -T RealignerTargetCreator -I sampleB.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleB.indels.intervals
603
+ #!/usr/bin/env bash
604
+ #PBS -N 6f3c911c49-sampleC-realign_target
605
+ #PBS -d ./working
606
+ #PBS -l nodes=1:ppn=12
607
+ if [ ! -f ./working/sampleC/mapping/checkpoint ]
608
+ then
609
+ echo "mapping 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
610
+
611
+ mkdir -p ./working/sampleC/mapping
612
+ cd ./working/sampleC/mapping
613
+ ls /ngs_reads/sampleC/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
614
+ ls /ngs_reads/sampleC/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
615
+ /software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=sampleC.sorted.bam SO=coordinate LB=sampleC PL=illumina PU=PU SM=sampleC TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000 || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
616
+ rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
617
+ echo "mapping 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
618
+ touch ./working/sampleC/mapping/checkpoint
619
+ else
620
+ echo "mapping 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
621
+ fi
622
+ if [ ! -f ./working/sampleC/mark_dup/checkpoint ]
623
+ then
624
+ echo "mark_dup 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
625
+
626
+ mkdir -p ./working/sampleC/mark_dup
627
+ cd ./working/sampleC/mark_dup
628
+ java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=./working/sampleC/mapping/sampleC.sorted.bam OUTPUT=sampleC.md.sort.bam METRICS_FILE=sampleC.metrics REMOVE_DUPLICATES=false || { echo "mark_dup 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
629
+ echo "mark_dup 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
630
+ touch ./working/sampleC/mark_dup/checkpoint
631
+ else
632
+ echo "mark_dup 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
633
+ fi
634
+ if [ ! -f ./working/sampleC/realign_target/checkpoint ]
635
+ then
636
+ echo "realign_target 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
637
+
638
+ mkdir -p ./working/sampleC/realign_target
639
+ cd ./working/sampleC/realign_target
640
+ java -Xmx4g -jar /software/GenomeAnalysisTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -I ./working/sampleC/mark_dup/sampleC.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleC.indels.intervals || { echo "realign_target 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
641
+ echo "realign_target 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
642
+ touch ./working/sampleC/realign_target/checkpoint
643
+ else
644
+ echo "realign_target 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
645
+ fi
606
646
  ```
607
647
 
648
+
608
649
  Logging
609
650
  ---------------------------
610
651
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.9.6
1
+ 0.9.7
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: bio-pipengine
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.6
4
+ version: 0.9.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Francesco Strozzi
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2017-08-12 00:00:00.000000000 Z
12
+ date: 2017-08-28 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: trollop
@@ -93,7 +93,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
93
93
  version: '0'
94
94
  requirements: []
95
95
  rubyforge_project:
96
- rubygems_version: 2.6.8
96
+ rubygems_version: 2.6.11
97
97
  signing_key:
98
98
  specification_version: 4
99
99
  summary: A pipeline manager