RubyGems - bio-pipengine - Versions diffs - 0.9.6 → 0.9.7 - Mend

bio-pipengine 0.9.6 → 0.9.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

checksums.yaml +4 -4
data/README.md +86 -45
data/VERSION +1 -1
metadata +3 -3

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 59db77881c33ae6c4047c6ecae27c1b7b7c1a92d
-  data.tar.gz: 79a56595c6aeee228fd488410527b0d1f2636278
+  metadata.gz: b76612ae318c72b01d4b2ce5788154107afdd0b8
+  data.tar.gz: 0ee4488dbcdbd653248e625cee1f633d1beaeee5
 SHA512:
-  metadata.gz: 61d64e18fa677cf684775d8690bad825002b554d91f3f7672254c654ef3d8474948aeead1baed10b14f18e2857e1524996f6378a31f31a9d8f6d93c1479c9c4b
-  data.tar.gz: 573a3579a033aad50dde13db6ec807fcda175d1d5c8abc924d48ae56c0f1b0e6c84c30636ab8d1eb97687dbfca6eab7b3425540d358ed6e447cefcdf3045e7cb
+  metadata.gz: b68abc5872cbbfea319e87f88ec7dc2ed39fe0834ec4520702eb2a71f97088919804fcd700868aa71d760c36b4e121355c3273005151dd54122fcd188003249c
+  data.tar.gz: 57ed883cb1754c1ea3122ed853931694c46999d420fe525e6428adff9af1952ebc124d9ff7975537f84b08a5716305af8502b80895bd5e3e30f9a495df9bfcdb

data/README.md CHANGED

@@ -71,7 +71,7 @@ Command line for RUN mode
 **Command line**
 ```shell
-> pipenengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
+> pipengine run -p pipeline.yml -f samples.yml -s mapping --tmp /tmp
 ```
 **Parameters**
@@ -115,7 +115,7 @@ pipeline: resequencing
 resources:
   fastqc: /software/FastQC/fastqc
-  bwa: /software/bwa-0.6.2/bwa
+  bwa: /software/bwa
   gatk: /software/gatk-lite/GenomeAnalysisTk.jar
   samtools: /software/samtools
   samsort: /software/picard-tools-1.77/SortSam.jar
@@ -375,7 +375,7 @@ So the two command lines need two different kind of files as input from the same
 Once the step has been defined in the pipeline YAML, PipEngine must be invoked using the **-m** parameter, to specify the samples that should be grouped together by this step:
 ```shell
-pipengine -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
+pipengine run -p pipeline.yml -m SampleA,SampleB SampleC,SampleB
 ```
 Note that the use of commas is not casual, since the **-m** parameter specifies not only which samples should be used for this step, but also how they should be organized on the corresponding command line. The **-m** parameter takes the sample names and underneath it will combine the sample name with the 'multi' keywords and then it will substitute back the command line by keeping the samples in the same order as provided with the **-m**.
@@ -441,7 +441,7 @@ When invoking PipEngine, the tool will look for the pipeline YAML specified and
 PipEngine will then combine the data from the two YAML, generating the specific command lines of the selected steps and substituing all the placeholders to generate the final command lines.
-A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler.
+A shell script will be finally generated, for each sample, that will contain all the instructions to run a specific step of the pipeline plus the meta-data for the PBS scheduler. The shell scripts are written inside the directory specified on the ```output:``` key in the ```samples.yml``` file, the directory is created if it does not exist.
 If not invoked with the **-d** option (dry-run) PipEngine will directly submit the jobs to the PBS scheduler using the "qsub" command.
@@ -471,7 +471,7 @@ It is of course possible to aggregate multiple steps of a pipeline and run them
 From the command line it's just:
 ```shell
-pipengine -p pipeline.yml -s mapping mark_dup realign_target
+pipengine run -p pipeline.yml -s mapping mark_dup realign_target
 ```
 A single job script, for each sample, will be generated with all the instructions for these steps. If more than one step declares a **cpu** key, the highest cpu value will be assigned for the whole job.
@@ -483,6 +483,8 @@ When multiple steps are run in the same job, by default PipEngine will generate
 :: Examples ::
 ==============
+All these files can be found into the test/examples directory of the repository.
 Example 1: One step and multiple command lines
 ----------------------------------------------
@@ -493,8 +495,9 @@ This is an example on how to prepare the inputs for BWA and run it along with Sa
 pipeline: resequencing
 resources:
-  bwa: /software/bwa-0.6.2/bwa
+  bwa: /software/bwa
   samtools: /software/samtools
+  pigz: /software/pigz
 steps:
   mapping:
@@ -503,7 +506,7 @@ steps:
      - ls <sample_path>/*_R2_*.gz | xargs zcat | <pigz> -p 10 >> R2.fastq.gz
      - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Sb - > <sample>.bam
      - rm -f R1.fastq.gz R2.fastq.gz
-    cpu: 11
+    cpu: 12
 ```
 **samples.yml**
@@ -511,7 +514,7 @@ steps:
 resources:
   index: /storage/genomes/bwa_index/genome
   genome: /storage/genomes/genome.fa
-  output: /storage/results
+  output: ./working
 samples:
   sampleA: /ngs_reads/sampleA
@@ -523,24 +526,33 @@ samples:
 Running PipEngine with the following command line:
 ```
-pipengine -p pipeline.yml -f samples.yml -s mapping
+pipengine run -p pipeline.yml -f samples.yml -s mapping -d
 ```
-will generate a runnable shell script for each sample:
+will generate a runnable shell script for each sample (available in the ./working directory):
 ```shell
-#!/bin/bash
-#PBS -N 37735f50-mapping
-#PBS -l ncpus=11
-mkdir -p /storage/results/sampleA/mapping
-cd /storage/results/sampleA/mapping
-ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
-ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
-/software/bwa-0.6.2/bwa sampe -P /genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
-rm -f R1.fastq.gz R2.fastq.gz
-```
-As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable.
+#!/usr/bin/env bash
+#PBS -N 2c57c1a853-sampleA-mapping
+#PBS -d ./working
+#PBS -l nodes=1:ppn=12
+if [ ! -f ./working/sampleA/mapping/checkpoint ]
+then
+echo "mapping 2c57c1a853-sampleA-mapping start `whoami` `hostname` `pwd` `date`."
+mkdir -p ./working/sampleA/mapping
+cd ./working/sampleA/mapping
+ls /ngs_reads/sampleA/*_R1_*.gz | xargs zcat | /software/pigz -p 10 >> R1.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
+ls /ngs_reads/sampleA/*_R2_*.gz | xargs zcat | /software/pigz -p 10 >> R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
+/software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
+rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 2c57c1a853-sampleA-mapping FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
+echo "mapping 2c57c1a853-sampleA-mapping finished `whoami` `hostname` `pwd` `date`."
+touch ./working/sampleA/mapping/checkpoint
+else
+echo "mapping 2c57c1a853-sampleA-mapping already executed, skipping this step `whoami` `hostname` `pwd` `date`."
+fi
+```
+As you can see the command line described in the pipeline YAML are translated into normal Unix command lines, therefore every solution that works on a standard Unix shell (pipes, bash substitutions) is perfectly acceptable. Pipengine addes extra lines in the script for steps checkpoint controls to avoid re-running already executed steps, and error controls with logging.
 In this case also, the **run** key defines three different command lines, that are described using YAML array (a line prepended with a -). This command lines are all part of the same step, since the first two are required to prepare the input for the third command line (BWA), using standard bash commands.
@@ -556,7 +568,7 @@ Now I want to execute more steps in a single job for each sample. The pipeline Y
 pipeline: resequencing
 resources:
-  bwa: /software/bwa-0.6.2/bwa
+  bwa: /software/bwa
   samtools: /software/samtools
   mark_dup: /software/picard-tools-1.77/MarkDuplicates.jar
   gatk: /software/GenomeAnalysisTK/GenomeAnalysisTK.jar
@@ -566,45 +578,74 @@ steps:
     run:
      - ls <sample_path>/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
      - ls <sample_path>/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
-     - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<pipeline> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
+     - <bwa> sampe -P <index> <(<bwa> aln -t 4 -q 20 <index> R1.fastq.gz) <(<bwa> aln -t 4 -q 20 <index> R2.fastq.gz) R1.fastq.gz R2.fastq.gz | <samtools> view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=<sample>.sorted.bam SO=coordinate LB=<sample> PL=illumina PU=PU SM=<sample> TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000
      - rm -f R1.fastq.gz R2.fastq.gz
-    cpu: 11
+    cpu: 12
   mark_dup:
+    pre: mapping
     run: java -Xmx4g -jar <mark_dup> VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=<mapping/sample>.sorted.bam OUTPUT=<sample>.md.sort.bam METRICS_FILE=<sample>.metrics REMOVE_DUPLICATES=false
   realign_target:
+    pre: mark_dup
     run: java -Xmx4g -jar <gatk> -T RealignerTargetCreator -I <mark_dup/sample>.md.sort.bam -nt 8 -R <genome> -o <sample>.indels.intervals
     cpu: 8
 ```
-The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine must be invoked with this command line:
+The sample YAML file is the same as the example above. Now to execute together the 3 steps defined in the pipeline, PipEngine can be invoked with this command line:
 ```
-pipengine -p pipeline.yml  -f samples.yml -s mapping mark_dup realign_target
+pipengine run -p pipeline_multi.yml  -f samples.yml -s realign_target -d
 ```
-And this will be translated into the following shell script (one for each sample):
+Since dependencies have been defined for the steps using the ```pre``` key, it is sufficient to invoke Pipengine with the last step and the other two are automatically included in the script. Messages will be prompted in this case since Pipengine just warns that the directories for certain steps, that are needed for other steps in the pipeline, are not yet available (and thus the corresponding steps will be executed to generate the necessary data). The command line will generate the following shell script (one for each sample, available in the ./working directory):
 ```shell
-#!/bin/bash
-#PBS -N ff020300-mapping-mark_dup-realign_target
-#PBS -l ncpus=11
-mkdir -p /storage/results/sampleB/mapping
-cd /storage/results/sampleB/mapping
-ls /ngs_reads/sampleB/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz
-ls /ngs_reads/sampleB/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz
-/software/bwa-0.6.2/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa-0.6.2/bwa aln -t 4 -q 20 /genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Sb - > sampleA.bam
-rm -f R1.fastq.gz R2.fastq.gz
-mkdir -p /storage/results/sampleB/mark_dup
-cd /storage/results/sampleB/mark_dup
-java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=sampleB.sorted.bam OUTPUT=sampleB.md.sort.bam METRICS_FILE=sampleB.metrics REMOVE_DUPLICATES=false
-mkdir -p /storage/results/sampleB/realign_target
-cd /storage/results/sampleB/realign_target
-java -Xmx4g -jar /software/GenomeAnalysisTk/GenomeAnalysisTk.jar -T RealignerTargetCreator -I sampleB.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleB.indels.intervals
+#!/usr/bin/env bash
+#PBS -N 6f3c911c49-sampleC-realign_target
+#PBS -d ./working
+#PBS -l nodes=1:ppn=12
+if [ ! -f ./working/sampleC/mapping/checkpoint ]
+then
+echo "mapping 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
+mkdir -p ./working/sampleC/mapping
+cd ./working/sampleC/mapping
+ls /ngs_reads/sampleC/*_R1_*.gz | xargs zcat | pigz -p 10 >> R1.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 0 `whoami` `hostname` `pwd` `date`."; exit 1; }
+ls /ngs_reads/sampleC/*_R2_*.gz | xargs zcat | pigz -p 10 >> R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 1 `whoami` `hostname` `pwd` `date`."; exit 1; }
+/software/bwa sampe -P /storage/genomes/bwa_index/genome <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R1.fastq.gz) <(/software/bwa aln -t 4 -q 20 /storage/genomes/bwa_index/genome R2.fastq.gz) R1.fastq.gz R2.fastq.gz | /software/samtools view -Su - | java -Xmx4g -jar /storage/software/picard-tools-1.77/AddOrReplaceReadGroups.jar I=/dev/stdin O=sampleC.sorted.bam SO=coordinate LB=sampleC PL=illumina PU=PU SM=sampleC TMP_DIR=/data/tmp CREATE_INDEX=true MAX_RECORDS_IN_RAM=1000000 || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 2 `whoami` `hostname` `pwd` `date`."; exit 1; }
+rm -f R1.fastq.gz R2.fastq.gz || { echo "mapping 6f3c911c49-sampleC-realign_target FAILED 3 `whoami` `hostname` `pwd` `date`."; exit 1; }
+echo "mapping 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
+touch ./working/sampleC/mapping/checkpoint
+else
+echo "mapping 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
+fi
+if [ ! -f ./working/sampleC/mark_dup/checkpoint ]
+then
+echo "mark_dup 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
+mkdir -p ./working/sampleC/mark_dup
+cd ./working/sampleC/mark_dup
+java -Xmx4g -jar /software/picard-tools-1.77/MarkDuplicates.jar VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 VALIDATION_STRINGENCY=SILENT INPUT=./working/sampleC/mapping/sampleC.sorted.bam OUTPUT=sampleC.md.sort.bam METRICS_FILE=sampleC.metrics REMOVE_DUPLICATES=false || { echo "mark_dup 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
+echo "mark_dup 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
+touch ./working/sampleC/mark_dup/checkpoint
+else
+echo "mark_dup 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
+fi
+if [ ! -f ./working/sampleC/realign_target/checkpoint ]
+then
+echo "realign_target 6f3c911c49-sampleC-realign_target start `whoami` `hostname` `pwd` `date`."
+mkdir -p ./working/sampleC/realign_target
+cd ./working/sampleC/realign_target
+java -Xmx4g -jar /software/GenomeAnalysisTK/GenomeAnalysisTK.jar -T RealignerTargetCreator -I ./working/sampleC/mark_dup/sampleC.md.sort.bam -nt 8 -R /storage/genomes/genome.fa -o sampleC.indels.intervals || { echo "realign_target 6f3c911c49-sampleC-realign_target FAILED `whoami` `hostname` `pwd` `date`."; exit 1; }
+echo "realign_target 6f3c911c49-sampleC-realign_target finished `whoami` `hostname` `pwd` `date`."
+touch ./working/sampleC/realign_target/checkpoint
+else
+echo "realign_target 6f3c911c49-sampleC-realign_target already executed, skipping this step `whoami` `hostname` `pwd` `date`."
+fi
 ```
 Logging
 ---------------------------

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.9.6
1	+ 0.9.7

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: bio-pipengine
 version: !ruby/object:Gem::Version
-  version: 0.9.6
+  version: 0.9.7
 platform: ruby
 authors:
 - Francesco Strozzi
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-08-12 00:00:00.000000000 Z
+date: 2017-08-28 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: trollop
@@ -93,7 +93,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.6.8
+rubygems_version: 2.6.11
 signing_key:
 specification_version: 4
 summary: A pipeline manager