BioDSL 1.0.1 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.gitignore +1 -0
- data/BioDSL.gemspec +1 -1
- data/Gemfile +6 -0
- data/README.md +289 -155
- data/Rakefile +18 -16
- data/lib/BioDSL.rb +1 -1
- data/lib/BioDSL/cary.rb +78 -53
- data/lib/BioDSL/command.rb +2 -2
- data/lib/BioDSL/commands.rb +1 -1
- data/lib/BioDSL/commands/add_key.rb +1 -1
- data/lib/BioDSL/commands/align_seq_mothur.rb +4 -4
- data/lib/BioDSL/commands/analyze_residue_distribution.rb +5 -5
- data/lib/BioDSL/commands/assemble_pairs.rb +13 -13
- data/lib/BioDSL/commands/assemble_seq_idba.rb +7 -9
- data/lib/BioDSL/commands/assemble_seq_ray.rb +13 -13
- data/lib/BioDSL/commands/assemble_seq_spades.rb +4 -4
- data/lib/BioDSL/commands/classify_seq.rb +8 -8
- data/lib/BioDSL/commands/classify_seq_mothur.rb +5 -5
- data/lib/BioDSL/commands/clip_primer.rb +7 -7
- data/lib/BioDSL/commands/cluster_otus.rb +5 -5
- data/lib/BioDSL/commands/collapse_otus.rb +2 -2
- data/lib/BioDSL/commands/collect_otus.rb +2 -2
- data/lib/BioDSL/commands/complement_seq.rb +4 -4
- data/lib/BioDSL/commands/count.rb +1 -1
- data/lib/BioDSL/commands/count_values.rb +2 -2
- data/lib/BioDSL/commands/degap_seq.rb +6 -7
- data/lib/BioDSL/commands/dereplicate_seq.rb +1 -1
- data/lib/BioDSL/commands/dump.rb +2 -2
- data/lib/BioDSL/commands/filter_rrna.rb +4 -4
- data/lib/BioDSL/commands/genecall.rb +7 -7
- data/lib/BioDSL/commands/grab.rb +1 -1
- data/lib/BioDSL/commands/index_taxonomy.rb +3 -3
- data/lib/BioDSL/commands/mask_seq.rb +4 -4
- data/lib/BioDSL/commands/mean_scores.rb +2 -2
- data/lib/BioDSL/commands/merge_pair_seq.rb +3 -3
- data/lib/BioDSL/commands/merge_table.rb +1 -1
- data/lib/BioDSL/commands/merge_values.rb +1 -1
- data/lib/BioDSL/commands/plot_heatmap.rb +4 -5
- data/lib/BioDSL/commands/plot_histogram.rb +4 -4
- data/lib/BioDSL/commands/plot_matches.rb +5 -5
- data/lib/BioDSL/commands/plot_residue_distribution.rb +6 -6
- data/lib/BioDSL/commands/plot_scores.rb +7 -7
- data/lib/BioDSL/commands/random.rb +1 -1
- data/lib/BioDSL/commands/read_fasta.rb +9 -9
- data/lib/BioDSL/commands/read_fastq.rb +16 -16
- data/lib/BioDSL/commands/read_table.rb +2 -3
- data/lib/BioDSL/commands/reverse_seq.rb +4 -4
- data/lib/BioDSL/commands/slice_align.rb +4 -4
- data/lib/BioDSL/commands/slice_seq.rb +3 -3
- data/lib/BioDSL/commands/sort.rb +1 -1
- data/lib/BioDSL/commands/split_pair_seq.rb +6 -7
- data/lib/BioDSL/commands/split_values.rb +2 -2
- data/lib/BioDSL/commands/trim_primer.rb +13 -8
- data/lib/BioDSL/commands/trim_seq.rb +5 -5
- data/lib/BioDSL/commands/uchime_ref.rb +6 -6
- data/lib/BioDSL/commands/uclust.rb +5 -5
- data/lib/BioDSL/commands/unique_values.rb +1 -1
- data/lib/BioDSL/commands/usearch_global.rb +2 -2
- data/lib/BioDSL/commands/usearch_local.rb +2 -2
- data/lib/BioDSL/commands/write_fasta.rb +7 -9
- data/lib/BioDSL/commands/write_fastq.rb +4 -4
- data/lib/BioDSL/commands/write_table.rb +3 -3
- data/lib/BioDSL/commands/write_tree.rb +2 -3
- data/lib/BioDSL/config.rb +2 -2
- data/lib/BioDSL/csv.rb +8 -10
- data/lib/BioDSL/debug.rb +1 -1
- data/lib/BioDSL/fasta.rb +54 -40
- data/lib/BioDSL/fastq.rb +35 -32
- data/lib/BioDSL/filesys.rb +56 -47
- data/lib/BioDSL/fork.rb +1 -1
- data/lib/BioDSL/hamming.rb +1 -1
- data/lib/BioDSL/helpers.rb +1 -1
- data/lib/BioDSL/helpers/aux_helper.rb +1 -1
- data/lib/BioDSL/helpers/email_helper.rb +1 -1
- data/lib/BioDSL/helpers/history_helper.rb +1 -1
- data/lib/BioDSL/helpers/log_helper.rb +1 -1
- data/lib/BioDSL/helpers/options_helper.rb +1 -1
- data/lib/BioDSL/helpers/status_helper.rb +1 -1
- data/lib/BioDSL/html_report.rb +1 -1
- data/lib/BioDSL/math.rb +1 -1
- data/lib/BioDSL/mummer.rb +1 -1
- data/lib/BioDSL/pipeline.rb +1 -1
- data/lib/BioDSL/seq.rb +240 -231
- data/lib/BioDSL/seq/ambiguity.rb +1 -1
- data/lib/BioDSL/seq/assemble.rb +1 -1
- data/lib/BioDSL/seq/backtrack.rb +93 -76
- data/lib/BioDSL/seq/digest.rb +1 -1
- data/lib/BioDSL/seq/dynamic.rb +43 -55
- data/lib/BioDSL/seq/homopolymer.rb +34 -36
- data/lib/BioDSL/seq/kmer.rb +67 -50
- data/lib/BioDSL/seq/levenshtein.rb +35 -40
- data/lib/BioDSL/seq/translate.rb +64 -55
- data/lib/BioDSL/seq/trim.rb +60 -50
- data/lib/BioDSL/serializer.rb +1 -1
- data/lib/BioDSL/stream.rb +1 -1
- data/lib/BioDSL/taxonomy.rb +1 -1
- data/lib/BioDSL/test.rb +1 -1
- data/lib/BioDSL/tmp_dir.rb +1 -1
- data/lib/BioDSL/usearch.rb +1 -1
- data/lib/BioDSL/verbose.rb +1 -1
- data/lib/BioDSL/version.rb +2 -2
- data/test/BioDSL/commands/test_add_key.rb +1 -1
- data/test/BioDSL/commands/test_align_seq_mothur.rb +1 -1
- data/test/BioDSL/commands/test_analyze_residue_distribution.rb +1 -1
- data/test/BioDSL/commands/test_assemble_pairs.rb +1 -1
- data/test/BioDSL/commands/test_assemble_seq_idba.rb +1 -1
- data/test/BioDSL/commands/test_assemble_seq_ray.rb +1 -1
- data/test/BioDSL/commands/test_assemble_seq_spades.rb +1 -1
- data/test/BioDSL/commands/test_classify_seq.rb +1 -1
- data/test/BioDSL/commands/test_classify_seq_mothur.rb +1 -1
- data/test/BioDSL/commands/test_clip_primer.rb +1 -1
- data/test/BioDSL/commands/test_cluster_otus.rb +1 -1
- data/test/BioDSL/commands/test_collapse_otus.rb +1 -1
- data/test/BioDSL/commands/test_collect_otus.rb +1 -1
- data/test/BioDSL/commands/test_complement_seq.rb +1 -1
- data/test/BioDSL/commands/test_count.rb +1 -1
- data/test/BioDSL/commands/test_count_values.rb +1 -1
- data/test/BioDSL/commands/test_degap_seq.rb +1 -1
- data/test/BioDSL/commands/test_dereplicate_seq.rb +1 -1
- data/test/BioDSL/commands/test_dump.rb +1 -1
- data/test/BioDSL/commands/test_filter_rrna.rb +1 -1
- data/test/BioDSL/commands/test_genecall.rb +1 -1
- data/test/BioDSL/commands/test_grab.rb +1 -1
- data/test/BioDSL/commands/test_index_taxonomy.rb +1 -1
- data/test/BioDSL/commands/test_mask_seq.rb +1 -1
- data/test/BioDSL/commands/test_mean_scores.rb +1 -1
- data/test/BioDSL/commands/test_merge_pair_seq.rb +1 -1
- data/test/BioDSL/commands/test_merge_table.rb +1 -1
- data/test/BioDSL/commands/test_merge_values.rb +1 -1
- data/test/BioDSL/commands/test_plot_heatmap.rb +1 -1
- data/test/BioDSL/commands/test_plot_histogram.rb +1 -1
- data/test/BioDSL/commands/test_plot_matches.rb +1 -1
- data/test/BioDSL/commands/test_plot_residue_distribution.rb +1 -1
- data/test/BioDSL/commands/test_plot_scores.rb +1 -1
- data/test/BioDSL/commands/test_random.rb +1 -1
- data/test/BioDSL/commands/test_read_fasta.rb +1 -1
- data/test/BioDSL/commands/test_read_fastq.rb +1 -1
- data/test/BioDSL/commands/test_read_table.rb +1 -1
- data/test/BioDSL/commands/test_reverse_seq.rb +1 -1
- data/test/BioDSL/commands/test_slice_align.rb +1 -1
- data/test/BioDSL/commands/test_slice_seq.rb +1 -1
- data/test/BioDSL/commands/test_sort.rb +1 -1
- data/test/BioDSL/commands/test_split_pair_seq.rb +1 -1
- data/test/BioDSL/commands/test_split_values.rb +1 -1
- data/test/BioDSL/commands/test_trim_primer.rb +1 -1
- data/test/BioDSL/commands/test_trim_seq.rb +1 -1
- data/test/BioDSL/commands/test_uchime_ref.rb +1 -1
- data/test/BioDSL/commands/test_uclust.rb +1 -1
- data/test/BioDSL/commands/test_unique_values.rb +1 -1
- data/test/BioDSL/commands/test_usearch_global.rb +1 -1
- data/test/BioDSL/commands/test_usearch_local.rb +1 -1
- data/test/BioDSL/commands/test_write_fasta.rb +1 -1
- data/test/BioDSL/commands/test_write_fastq.rb +1 -1
- data/test/BioDSL/commands/test_write_table.rb +1 -1
- data/test/BioDSL/commands/test_write_tree.rb +1 -1
- data/test/BioDSL/helpers/test_options_helper.rb +3 -3
- data/test/BioDSL/seq/test_assemble.rb +58 -56
- data/test/BioDSL/seq/test_backtrack.rb +83 -81
- data/test/BioDSL/seq/test_digest.rb +47 -45
- data/test/BioDSL/seq/test_dynamic.rb +66 -64
- data/test/BioDSL/seq/test_homopolymer.rb +35 -33
- data/test/BioDSL/seq/test_kmer.rb +29 -28
- data/test/BioDSL/seq/test_translate.rb +44 -42
- data/test/BioDSL/seq/test_trim.rb +59 -57
- data/test/BioDSL/test_cary.rb +1 -1
- data/test/BioDSL/test_command.rb +2 -2
- data/test/BioDSL/test_csv.rb +34 -31
- data/test/BioDSL/test_debug.rb +31 -31
- data/test/BioDSL/test_fasta.rb +30 -29
- data/test/BioDSL/test_fastq.rb +27 -26
- data/test/BioDSL/test_filesys.rb +28 -27
- data/test/BioDSL/test_fork.rb +29 -28
- data/test/BioDSL/test_math.rb +31 -30
- data/test/BioDSL/test_mummer.rb +1 -1
- data/test/BioDSL/test_pipeline.rb +1 -1
- data/test/BioDSL/test_seq.rb +42 -41
- data/test/BioDSL/test_serializer.rb +35 -33
- data/test/BioDSL/test_stream.rb +28 -27
- data/test/BioDSL/test_taxonomy.rb +38 -37
- data/test/BioDSL/test_test.rb +32 -31
- data/test/BioDSL/test_tmp_dir.rb +1 -1
- data/test/BioDSL/test_usearch.rb +28 -27
- data/test/BioDSL/test_verbose.rb +32 -31
- data/test/helper.rb +34 -31
- metadata +3 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 806bfca700a56365bd01a11fb981fb16363aad95
|
|
4
|
+
data.tar.gz: 91718f260a6e32fb38af4724cfef035a9224e072
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 875d37e145698145b42b250a0bed8ac81ad3bb9576b48cb6e14a68515906a6b773c154a9caf33f282a9f193aaf0877484f15fffd4a2600ac266322fef7e9f347
|
|
7
|
+
data.tar.gz: 21aeb489434d449fbfab7950015481672b3e2734f7cb4ae3384aa66927416655c53b847c13a03f4d42f4ceb907513cfbe80772d3175d4b1e9f25c96628d625df
|
data/.gitignore
CHANGED
data/BioDSL.gemspec
CHANGED
|
@@ -20,7 +20,7 @@
|
|
|
20
20
|
# #
|
|
21
21
|
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #
|
|
22
22
|
# #
|
|
23
|
-
# This software is part of BioDSL (
|
|
23
|
+
# This software is part of BioDSL (http://maasha.github.io/BioDSL). #
|
|
24
24
|
# #
|
|
25
25
|
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #
|
|
26
26
|
|
data/Gemfile
ADDED
data/README.md
CHANGED
|
@@ -1,169 +1,224 @@
|
|
|
1
|
-
BioDSL
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
1
|
+
BioDSL (pronounced Biodiesel) is a Domain Specific Language for creating
|
|
2
|
+
bioinformatic analysis workflows. A workflow may consist of several pipelines
|
|
3
|
+
and each pipeline consists of a series of steps such as reading in data from a
|
|
4
|
+
file, processing the data in some way, and writing data to a new file.
|
|
5
|
+
|
|
6
|
+
BioDSL is build on the same principles as [Biopieces](www.biopieces.org), where
|
|
7
|
+
data records are passed through multiple commands each with a specific task. The
|
|
8
|
+
idea is that a command will process the data record if this contains the
|
|
9
|
+
relevant attributes that the command can process. E.g. if a data record contains
|
|
10
|
+
a sequence, then the command [reverse_seq](reverse_seq) will reverse that
|
|
11
|
+
sequence.
|
|
12
|
+
|
|
13
|
+
# Installation
|
|
14
|
+
|
|
15
|
+
The recommended way of installing BioDSL is via Ruby’s gem package manager:
|
|
16
|
+
|
|
17
|
+
`$ gem install BioDSL`
|
|
18
|
+
|
|
19
|
+
For those commands which are wrappers around third-party tools, such as Usearch,
|
|
20
|
+
Mothur and SPAdes, you will have to install these and make the executables
|
|
21
|
+
available in your `$PATH`.
|
|
22
|
+
|
|
23
|
+
# Getting started
|
|
24
|
+
|
|
25
|
+
BioDSL is implemented in Ruby making use of Ruby’s powerful metaprogramming
|
|
26
|
+
facilities. Thus, a workflow is basically a Ruby script containing one or more
|
|
27
|
+
pipelines.
|
|
28
|
+
|
|
29
|
+
Here is a test script with a single pipeline that reads all FASTA entries from
|
|
30
|
+
the file `input.fna`, selects all records with a sequence ending in `ATC`, and
|
|
31
|
+
writing those records as FASTA entries to the file `output.fna`:
|
|
32
|
+
|
|
33
|
+
```
|
|
34
|
+
#!/usr/bin/env ruby
|
|
35
|
+
|
|
36
|
+
require 'BioDSL'
|
|
37
|
+
|
|
38
|
+
BD.new.
|
|
39
|
+
read_fasta(input: "input.fna").
|
|
40
|
+
grab(select: "ATC$", keys: :SEQ).
|
|
41
|
+
write_fasta(output: "output.fna").
|
|
42
|
+
run
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Save the test script to a file `test.biodsl` and execute on the command line:
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
$ ruby test.biodsl
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
# Combining multiple pipelines
|
|
52
|
+
|
|
53
|
+
This script demonstrates how multiple pipelines can be created and combined. In
|
|
54
|
+
the end two pipelines are run, one consisting of p1 + p2 and one consisting of
|
|
55
|
+
p1 + p3. The first pipeline run will produce a histogram plot of sequence length
|
|
56
|
+
from sequences containing the pattern `ATCG`, and the other pipeline run will
|
|
57
|
+
produce a plot with sequences length distribution of sequences not matching
|
|
58
|
+
`ATCG`.
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
#!/usr/bin/env ruby
|
|
62
|
+
|
|
63
|
+
require 'BioDSL'
|
|
64
|
+
|
|
65
|
+
p1 = BD.new.read_fasta(input: "test.fna")
|
|
66
|
+
p2 = BD.new.grab(keys: :SEQ, select: "ATCG").
|
|
67
|
+
plot_histogram(key: :SEQ_LEN, terminal: :png, output: "select.png")
|
|
68
|
+
p3 = BD.new.grab(keys: :SEQ, reject: "ATCG").
|
|
69
|
+
plot_histogram(key: :SEQ_LEN, terminal: :png, output: "reject.png")
|
|
70
|
+
p4 = p1 + p3
|
|
71
|
+
|
|
72
|
+
(p1 + p2).write_fasta(output: "select.fna").run
|
|
73
|
+
p4.write_fasta(output: "reject.fna").run
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
# Running pipelines in parallel
|
|
77
|
+
|
|
78
|
+
This script demonstrates how to run multiple pipelines in parallel using 20 CPU
|
|
79
|
+
cores. Here we filter pair-end FASTQ entries from a list of samples described in
|
|
80
|
+
the file `samples.txt` which contains three tab separated columns: sample name,
|
|
81
|
+
a forward read file path, and a reverse read file path.
|
|
82
|
+
|
|
83
|
+
```
|
|
84
|
+
#!/usr/bin/env ruby
|
|
85
|
+
|
|
86
|
+
require 'BioDSL'
|
|
87
|
+
require 'csv'
|
|
88
|
+
|
|
89
|
+
samples = CSV.read("samples.txt")
|
|
90
|
+
|
|
91
|
+
Parallel.each(samples, in_processes: 20) do |sample|
|
|
92
|
+
BD.new.
|
|
93
|
+
read_fastq(input: sample[1], input2: sample[2], encoding: :base_33).
|
|
94
|
+
grab(keys: :SEQ, select: "ATCG").
|
|
95
|
+
write_fastq(output: "#{sample[0]}_filted.fastq.bz2", bzip2: true).
|
|
96
|
+
run
|
|
97
|
+
end
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
# Ruby one-liners
|
|
101
|
+
|
|
102
|
+
It is possible to execute BioDSL pipelines on the command line:
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
ruby -r BioDSL -e 'BD.new.read_fasta(input: "test.fna").plot_histogram(key: :SEQ_LEN).run'
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
And to save typing we may use the alias `bd` which is set like this on the
|
|
109
|
+
command line:
|
|
110
|
+
|
|
111
|
+
```
|
|
112
|
+
$ alias bd='ruby -r BioDSL'
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
It may be a good idea to save that alias in your `.bashrc` file.
|
|
116
|
+
|
|
117
|
+
Now it is possible to run a BioDSL pipeline on the command line like this:
|
|
118
|
+
|
|
119
|
+
```
|
|
120
|
+
$ bd -e 'BD.new.read_fasta(input: "test.fna").plot_histogram(key: :SEQ_LEN).run'
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
# Using the Interactive Ruby interpreter
|
|
124
|
+
|
|
125
|
+
Here we demonstrate the use of Ruby's `irb` shell:
|
|
126
|
+
|
|
127
|
+
```
|
|
128
|
+
$ irb -r BioDSL --noinspect
|
|
129
|
+
irb(main):001:0> p = BD.new
|
|
130
|
+
=> BD.new
|
|
131
|
+
irb(main):002:0> p.read_fasta(input: "input.fna")
|
|
132
|
+
=> BD.new.read_fasta(input: "input.fna")
|
|
133
|
+
irb(main):003:0> p.grab(select: "ATC$", keys: :SEQ)
|
|
134
|
+
=> BD.new.read_fasta(input: "input.fna").grab(select: "ATC$", keys: :SEQ)
|
|
135
|
+
irb(main):004:0> p.write_fasta(output: "output.fna")
|
|
136
|
+
=> BD.new.read_fasta(input: "input.fna").grab(select: "ATC$", keys: :SEQ).write_fasta(output: "output.fna")
|
|
137
|
+
irb(main):005:0> p.run
|
|
138
|
+
=> BD.new.read_fasta(input: "input.fna").grab(select: "ATC$", keys: :SEQ).write_fasta(output: "output.fna").run
|
|
139
|
+
irb(main):006:0>
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Again, it may be a good idea to save an alias `alias biodsl="irb -r BioDSL --noinspect"` to your `.bashrc` file. Thus, we can use the new `biodsl` alias to chain commands directly:
|
|
143
|
+
|
|
144
|
+
```
|
|
145
|
+
$ biodsl
|
|
146
|
+
irb(main):001:0> BD.new.read_fasta(input: "input.fna").grab(select: "ATC$", keys: :SEQ).write_fasta(output: "output.fna").run(progress: true)
|
|
147
|
+
=> BD.new.read_fasta(input: "input.fna").grab(select: "ATC$", keys: :SEQ).write_fasta(output: "output.fna").run(progress: true)
|
|
148
|
+
irb(main):002:0>
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
# History file
|
|
152
|
+
|
|
153
|
+
A history file is kept in `$USER/.BioDSL_history` and each time run is called a history entry is added to this file:
|
|
154
|
+
|
|
155
|
+
```
|
|
156
|
+
BD.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run
|
|
157
|
+
BD.new.read_fasta(input: "test_big.fna", first: 100).plot_histogram(key: :SEQ_LEN).run
|
|
158
|
+
BD.new.read_fasta(input: "test_big.fna", first: 10).plot_histogram(key: :SEQ_LEN).run
|
|
159
|
+
BD.new.read_fasta(input: "test_big.fna").plot_histogram(key: :SEQ_LEN).run
|
|
160
|
+
BD.new.read_fasta(input: "test_big.fna", first: 1000).plot_histogram(key: :SEQ_LEN).run
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
Thus it is possible to redo the last pipeline by pasting the line in irb or a Ruby one-liner.
|
|
164
|
+
|
|
165
|
+
# Log and History
|
|
117
166
|
|
|
118
167
|
All BioDSL events are logged to `~/.BioDSL_log`.
|
|
119
168
|
|
|
120
169
|
BioDSL history is saved to `~/.BioDSL_history`.
|
|
121
170
|
|
|
171
|
+
# Features
|
|
122
172
|
|
|
123
|
-
|
|
124
|
-
--------
|
|
125
|
-
|
|
126
|
-
Progress:
|
|
173
|
+
## Progress
|
|
127
174
|
|
|
128
175
|
Show nifty progress table with commands, records read and emittet and time.
|
|
129
176
|
|
|
130
177
|
`BD.new.read_fasta(input: "input.fna").dump.run(progress: true)`
|
|
131
178
|
|
|
132
|
-
Verbose
|
|
179
|
+
## Verbose
|
|
133
180
|
|
|
134
181
|
Output verbose messages from commands and the run status.
|
|
135
182
|
|
|
136
|
-
|
|
183
|
+
```
|
|
184
|
+
BD.new.read_fasta(input: "input.fna").dump.run(verbose: true)
|
|
185
|
+
```
|
|
137
186
|
|
|
138
|
-
Debug
|
|
187
|
+
## Debug
|
|
139
188
|
|
|
140
189
|
Output debug messages from commands using these.
|
|
141
190
|
|
|
142
|
-
|
|
191
|
+
```
|
|
192
|
+
BD.new.read_fasta(input: "input.fna").dump.run(debug: true)
|
|
193
|
+
```
|
|
143
194
|
|
|
144
|
-
E-mail notification
|
|
195
|
+
## E-mail notification
|
|
145
196
|
|
|
146
197
|
Send an email when run is complete.
|
|
147
198
|
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
199
|
+
```
|
|
200
|
+
BD.new.read_fasta(input: "input.fna").dump.run(email: bill@hotmail.com, subject: "Script done!")
|
|
201
|
+
```
|
|
151
202
|
|
|
152
|
-
|
|
203
|
+
## Reports
|
|
153
204
|
|
|
154
|
-
|
|
205
|
+
Create an HTML report of the run stats for a pipeline:
|
|
155
206
|
|
|
156
|
-
|
|
207
|
+
```
|
|
208
|
+
BD.new.read_fasta(input: "input.fna").dump.run(report: "status.html")
|
|
209
|
+
```
|
|
157
210
|
|
|
158
|
-
|
|
211
|
+
## Output directory
|
|
159
212
|
|
|
160
|
-
|
|
213
|
+
All output files from commands are put in a specified directory:
|
|
161
214
|
|
|
215
|
+
```
|
|
216
|
+
BD.new.read_fasta(input: "input.fna").dump.run(output_dir: "Results")
|
|
217
|
+
```
|
|
162
218
|
|
|
163
|
-
Configuration File
|
|
164
|
-
------------------
|
|
219
|
+
## Configuration File
|
|
165
220
|
|
|
166
|
-
It is possible to pre-set options in a configuration file located in your
|
|
221
|
+
It is possible to pre-set options in a configuration file located in your `$HOME`
|
|
167
222
|
directory called `.BioDSLrc`. Thus if an option is not already set, its value
|
|
168
223
|
will fall back to the one set in the configuration file. The configuration file
|
|
169
224
|
contains three whitespace separated columns:
|
|
@@ -172,34 +227,113 @@ contains three whitespace separated columns:
|
|
|
172
227
|
* Option
|
|
173
228
|
* Option value
|
|
174
229
|
|
|
175
|
-
Lines starting with
|
|
230
|
+
Lines starting with `#` are considered comments and are ignored.
|
|
176
231
|
|
|
177
232
|
An example:
|
|
178
233
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
234
|
+
```
|
|
235
|
+
maasha@mel:~$ cat ~/.BioDSLrc
|
|
236
|
+
uchime_ref database /home/maasha/Install/QIIME1.8/data/rdp_gold.fa
|
|
237
|
+
uchime_ref cpus 20
|
|
238
|
+
```
|
|
182
239
|
|
|
183
240
|
On compute clusters it is necessary to specify the max processor count, which
|
|
184
241
|
is otherwise determined as the number of cores on the current node. To override
|
|
185
242
|
this add the following line:
|
|
186
243
|
|
|
187
|
-
|
|
244
|
+
```
|
|
245
|
+
pipeline processor_count 1000
|
|
246
|
+
```
|
|
188
247
|
|
|
189
248
|
It is also possible to change the temporary directory from the systems default
|
|
190
249
|
by adding the following line:
|
|
191
250
|
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
251
|
+
```
|
|
252
|
+
pipeline tmp_dir /home/projects/ku_microbio/scratch/tmp
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
# Available BioDSL commands
|
|
256
|
+
|
|
257
|
+
* [add_key] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AddKey)
|
|
258
|
+
* [align_seq_mothur] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AlignSeqMothur)
|
|
259
|
+
* [analyze_residue_distribution] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AnalyzeResidueDistribution)
|
|
260
|
+
* [assemble_pairs] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AssemblePairs)
|
|
261
|
+
* [assemble_seq_idba] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AssembleSeqIdba)
|
|
262
|
+
* [assemble_seq_ray] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AssembleSeqRay)
|
|
263
|
+
* [assemble_seq_spades] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/AssembleSeqSpades)
|
|
264
|
+
* [classify_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ClassifySeq)
|
|
265
|
+
* [classify_seq_mothur] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ClassifySeqMothur)
|
|
266
|
+
* [clip_primer] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ClipPrimer)
|
|
267
|
+
* [cluster_otus] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ClusterOtus)
|
|
268
|
+
* [collapse_otus] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/CollapseOtus)
|
|
269
|
+
* [collect_otus] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/CollectOtus)
|
|
270
|
+
* [complement_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ComplementSeq)
|
|
271
|
+
* [count] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Count)
|
|
272
|
+
* [degap_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/DegapSeq)
|
|
273
|
+
* [dereplicate_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/DereplicateSeq)
|
|
274
|
+
* [dump] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Dump)
|
|
275
|
+
* [filter_rrna] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/FilterRrna)
|
|
276
|
+
* [genecall] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Genecall)
|
|
277
|
+
* [grab] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Grab)
|
|
278
|
+
* [index_taxonomy] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/IndexTaxonomy)
|
|
279
|
+
* [mean_scores] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/MeanScores)
|
|
280
|
+
* [merge_pair_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/MergePairSeq)
|
|
281
|
+
* [merge_table] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/MergeTable)
|
|
282
|
+
* [merge_values] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/MergeValues)
|
|
283
|
+
* [plot_heatmap] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/PlotHeatmap)
|
|
284
|
+
* [plot_histogram] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/PlotHistogram)
|
|
285
|
+
* [plot_matches] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/PlotMatches)
|
|
286
|
+
* [plot_residue_distribution] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/PlotResidueDistribution)
|
|
287
|
+
* [plot_scores] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/PlotScores)
|
|
288
|
+
* [random] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Random)
|
|
289
|
+
* [read_fasta] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ReadFasta)
|
|
290
|
+
* [read_fastq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ReadFastq)
|
|
291
|
+
* [read_table] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ReadTable)
|
|
292
|
+
* [reverse_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/ReverseSeq)
|
|
293
|
+
* [slice_align] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/SliceAlign)
|
|
294
|
+
* [slice_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/SliceSeq)
|
|
295
|
+
* [sort] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/Sort)
|
|
296
|
+
* [split_pair_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/SplitPairSeq)
|
|
297
|
+
* [split_values] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/SplitValues)
|
|
298
|
+
* [trim_primer] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/TrimPrimer)
|
|
299
|
+
* [trim_seq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/TrimSeq)
|
|
300
|
+
* [uchime_ref] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/UchimeRef)
|
|
301
|
+
* [unique_values] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/UniqueValues)
|
|
302
|
+
* [usearch_global] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/UsearchGlobal)
|
|
303
|
+
* [write_fasta] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/WriteFasta)
|
|
304
|
+
* [write_fastq] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/WriteFastq)
|
|
305
|
+
* [write_table] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/WriteTable)
|
|
306
|
+
* [write_tree] (http://www.rubydoc.info/gems/BioDSL/1.0.2/BioDSL/WriteTree)
|
|
307
|
+
|
|
308
|
+
# Running the test suite
|
|
309
|
+
|
|
310
|
+
BioDSL have an extended set of unit tests that can be run after installing
|
|
311
|
+
development dependencies. First you need to install the bundler gem:
|
|
312
|
+
|
|
313
|
+
```
|
|
314
|
+
$ gem install bundler
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
Next you need to change to the source directory of BioDSL and run bundler to
|
|
318
|
+
download depending gems:
|
|
319
|
+
|
|
320
|
+
```
|
|
321
|
+
$ bundle install
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
And then you run the test suite by running `rake`:
|
|
325
|
+
|
|
326
|
+
```
|
|
327
|
+
$ rake
|
|
328
|
+
```
|
|
329
|
+
|
|
330
|
+
And the unit tests should all run, except those omitted because a third-party
|
|
331
|
+
executable was missing.
|
|
332
|
+
|
|
333
|
+
# Contributing
|
|
334
|
+
|
|
335
|
+
1. Fork it
|
|
336
|
+
1. Create your feature branch (git checkout -b my-new-feature)
|
|
337
|
+
1. Commit your changes (git commit -am 'Add some feature')
|
|
338
|
+
1. Push to the branch (git push origin my-new-feature)
|
|
339
|
+
1. Create new Pull Request
|