lederhosen 1.8.2 → 2.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -1
- data/lederhosen.gemspec +7 -3
- data/lib/lederhosen/no_tasks.rb +18 -18
- data/lib/lederhosen/tasks/count_taxonomies.rb +83 -0
- data/lib/lederhosen/tasks/get_reps.rb +3 -4
- data/lib/lederhosen/tasks/make_udb.rb +2 -2
- data/lib/lederhosen/tasks/otu_filter.rb +8 -1
- data/lib/lederhosen/tasks/otu_table.rb +33 -70
- data/lib/lederhosen/tasks/separate_unclassified.rb +65 -0
- data/lib/lederhosen/uc_parser.rb +88 -0
- data/lib/lederhosen/version.rb +4 -4
- data/readme.md +107 -11
- data/spec/cli_spec.rb +62 -10
- data/spec/data/test.uc +9 -684
- data/spec/data/trimmed/ILT_L_9_B_001.fasta +100 -1596
- data/spec/no_tasks_spec.rb +1 -1
- data/spec/uc_parser_spec.rb +0 -0
- metadata +7 -3
data/readme.md
CHANGED
@@ -16,13 +16,15 @@ Lederhosen is designed with the following "pipeline" in mind:
|
|
16
16
|
1. Clustering sequences to centroid or reference sequences (read: database)
|
17
17
|
2. Generating tables from USEARCH output.
|
18
18
|
3. Filtering tables to remove small or insignificant OTUs.
|
19
|
+
4. Support for paired end reads (considers taxonomic assignment for both reads in a pair).
|
19
20
|
|
20
21
|
### About
|
21
22
|
|
22
23
|
- Lederhosen is a project born out of the Triplett Lab at the University of Florida.
|
23
|
-
- Lederhosen is designed to be a fast and **simple** tool to aid in clustering 16S rRNA amplicons sequenced
|
24
|
+
- Lederhosen is designed to be a fast and **simple** (~700 SLOC) tool to aid in clustering 16S rRNA amplicons sequenced
|
24
25
|
using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq), Ion Torrent, or Roche-454.
|
25
|
-
- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
|
26
|
+
- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
|
27
|
+
[MIT open source license](http://opensource.org/licenses/mit-license.php/).
|
26
28
|
- Except for USEARCH which requires a license, Lederhosen is available for commercial use.
|
27
29
|
|
28
30
|
### Features
|
@@ -76,27 +78,102 @@ lederhosen cluster \
|
|
76
78
|
--database=taxcollector.udb
|
77
79
|
```
|
78
80
|
|
79
|
-
The optional `--dry-run` parameter outputs the usearch command to standard out.
|
81
|
+
The optional `--dry-run` parameter outputs the usearch command to standard out.
|
82
|
+
This is useful if you want to run usearch on a cluster.
|
80
83
|
|
81
|
-
|
84
|
+
```bash
|
85
|
+
for reads_file in reads/*.fasta;
|
86
|
+
do
|
87
|
+
echo lederhosen cluster \
|
88
|
+
--input=$reads_file \
|
89
|
+
--identity=0.95 \
|
90
|
+
--output=$(basename $reads_file_ .fasta).95.uc \
|
91
|
+
--database=taxcollector.udb \
|
92
|
+
--threads 1 \
|
93
|
+
--dry-run
|
94
|
+
end > jobs.sh
|
95
|
+
|
96
|
+
# send jobs to queue system
|
97
|
+
cat jobs.sh | parallel -j 24 # run 24 parallel jobs
|
98
|
+
```
|
99
|
+
|
100
|
+
### Generate taxonomy counts tables
|
101
|
+
|
102
|
+
Before generating OTU tables, you must generate taxonomy counts tables.
|
103
|
+
|
104
|
+
A taxonomy count table looks something like this
|
105
|
+
|
106
|
+
# taxonomy, number_of_reads
|
107
|
+
[0]Bacteria[1];...;[8]Akkermansia_municipalia, 28
|
108
|
+
...
|
109
|
+
|
110
|
+
From there, you can generate OTU abundance matrices at the different levels of classification (domain, phylum, ..., genus, species).
|
111
|
+
|
112
|
+
```bash
|
113
|
+
|
114
|
+
lederhosen count_taxonomies \
|
115
|
+
--input=clusters.uc \
|
116
|
+
--output=clusters_taxonomies.txt
|
117
|
+
```
|
118
|
+
|
119
|
+
If you did paired-end sequencing, you can generate strict taxonomy tables that only count reads when *both pairs* have the *same*
|
120
|
+
taxonomic description at a certain taxonomic level. This is useful for leveraging the increased length of having pairs and also
|
121
|
+
acts as a sort of chimera filter. You will, however, end up using less of your reads as the level goes from domain to species.
|
122
|
+
|
123
|
+
```bash
|
124
|
+
lederhosen count_taxonomies \
|
125
|
+
--input=clusters.uc \
|
126
|
+
--strict=genus \
|
127
|
+
--output=clusters_taxonomies.strict.genus.txt
|
128
|
+
```
|
129
|
+
|
130
|
+
Reads that do not have the same phylogeny at `level` will become `unclassified_reads`
|
131
|
+
|
132
|
+
### Generate OTU tables
|
82
133
|
|
83
134
|
Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
|
84
135
|
|
85
136
|
```bash
|
86
137
|
lederhosen otu_table \
|
87
|
-
--files=
|
88
|
-
--
|
89
|
-
--
|
138
|
+
--files=clusters_taxonomies.strict.genus.*.txt \
|
139
|
+
--output=my_poop_samples_genus_strict.95.txt \
|
140
|
+
--level=genus
|
141
|
+
```
|
142
|
+
|
143
|
+
This will create the file `my_poop_samples_genus_strict.95.txt` containing the clusters
|
144
|
+
as columns and the samples as rows.
|
145
|
+
|
146
|
+
You now will apply advanced data mining and statistical techniques to this table to make
|
147
|
+
interesting biological inferences and cure diseases.
|
148
|
+
|
149
|
+
### Filter OTU tables
|
150
|
+
|
151
|
+
Sometimes, clustering high-throughput reads at stringent identities can create many, small clusters.
|
152
|
+
In fact, these clusters represent the vast majority (>99%) of the created clusters but the minority (<1%>)
|
153
|
+
of the reads. In other words, 1% of the reads have 99% of the clusters.
|
154
|
+
|
155
|
+
If you want to filter out these small clusters which are composed of inseparable sequencing error or
|
156
|
+
actual biodiversity, you can do so with the `otu_filter` task.
|
157
|
+
|
158
|
+
```bash
|
159
|
+
lederhosen otu_filter \
|
160
|
+
--input=table.csv \
|
161
|
+
--output=filtere.csv \
|
162
|
+
--reads=50 \
|
163
|
+
--samples=50
|
90
164
|
```
|
91
165
|
|
92
|
-
This will
|
166
|
+
This will remove any clusters that do not appear in at least 10 samples with at least 50 reads. The read counts
|
167
|
+
for filtered clusters will be moved to the `noise` psuedocluster.
|
93
168
|
|
94
|
-
otu_table.domain.csv, ..., otu_table.species.csv
|
95
169
|
|
96
170
|
### Get representative sequences
|
97
171
|
|
98
|
-
|
99
|
-
|
172
|
+
(not yet implemented)
|
173
|
+
|
174
|
+
You can get the representative sequences for each cluster using the `get_reps` tasks.
|
175
|
+
This will extract the representative sequence from the __database__ you ran usearch with.
|
176
|
+
Make sure you use the same database that you used when running usearch.
|
100
177
|
|
101
178
|
```bash
|
102
179
|
lederhosen get_reps \
|
@@ -114,6 +191,25 @@ lederhosen get_reps \
|
|
114
191
|
--output=representatives.fasta
|
115
192
|
```
|
116
193
|
|
194
|
+
### Get unclassified sequences
|
195
|
+
|
196
|
+
```bash
|
197
|
+
lederhosen separate_unclassified \
|
198
|
+
--uc-file=my_results.uc \
|
199
|
+
--reads=reads_that_were_used_to_generate_results.fasta
|
200
|
+
--output=unclassified_reads.fasta
|
201
|
+
```
|
202
|
+
|
203
|
+
`separate_unclassified` has support for strict pairing
|
204
|
+
|
205
|
+
```
|
206
|
+
lederhosen separate_unclassified \
|
207
|
+
--uc-file=my_results.uc \
|
208
|
+
--reads=reads_that_were_used_to_generate_results.fasta
|
209
|
+
--strict=phylum
|
210
|
+
--output=unclassified_reads.fasta
|
211
|
+
```
|
212
|
+
|
117
213
|
## Acknowledgements
|
118
214
|
|
119
215
|
- Lexi, Vinnie and Kevin for beta-testing and putting up with bugs
|
data/spec/cli_spec.rb
CHANGED
@@ -25,24 +25,77 @@ describe Lederhosen::CLI do
|
|
25
25
|
end
|
26
26
|
|
27
27
|
it 'can cluster reads using usearch' do
|
28
|
-
`./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.
|
28
|
+
`./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.99 --output #{$test_dir}/clusters.uc`
|
29
29
|
$?.success?.should be_true
|
30
30
|
File.exists?(File.join($test_dir, 'clusters.uc')).should be_true
|
31
31
|
end
|
32
32
|
|
33
|
-
it '
|
34
|
-
|
35
|
-
`./bin/lederhosen otu_table --files=spec/data/test.uc --prefix=#{$test_dir}/otu_table --levels=#{levels}`
|
33
|
+
it 'can separate unclassified reads from usearch output' do
|
34
|
+
`./bin/lederhosen separate_unclassified --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.fasta`
|
36
35
|
$?.success?.should be_true
|
36
|
+
unclassified_results = File.readlines("spec/data/test.uc")
|
37
|
+
.select { |x| x =~ /^N/ }
|
38
|
+
.size
|
39
|
+
unclassified_reads = File.readlines("#{$test_dir}/unclassified.fasta")
|
40
|
+
.select { |x| x =~ /^>/ }
|
41
|
+
.size
|
42
|
+
|
43
|
+
unclassified_results.should == unclassified_reads
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'can separate unclassified reads from usearch output using strict pairing' do
|
47
|
+
`./bin/lederhosen separate_unclassified --strict=genus --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.strict_genus.fasta`
|
48
|
+
$?.success?.should be_true
|
49
|
+
File.readlines("#{$test_dir}/unclassified.strict_genus.fasta")
|
50
|
+
.select { |x| x =~ /^>/ }
|
51
|
+
.size.should be_even
|
37
52
|
end
|
38
53
|
|
39
|
-
it '
|
40
|
-
`./bin/lederhosen
|
54
|
+
it 'can create taxonomy count tables' do
|
55
|
+
`./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.txt`
|
41
56
|
$?.success?.should be_true
|
57
|
+
File.exists?(File.join($test_dir, 'taxonomy_count.txt')).should be_true
|
58
|
+
end
|
59
|
+
|
60
|
+
it 'generates taxonomy tables w/ comma-free taxonomic descriptions' do
|
61
|
+
File.readlines(File.join($test_dir, 'taxonomy_count.txt'))
|
62
|
+
.map(&:strip)
|
63
|
+
.map { |x| x.count(',') }
|
64
|
+
.uniq
|
65
|
+
.should == [1]
|
66
|
+
end
|
67
|
+
|
68
|
+
%w{domain phylum class order family genus species}.each do |level|
|
69
|
+
it "generates taxonomy tables only counting pairs that agree at level: #{level}" do
|
70
|
+
`./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.strict.#{level}.txt --strict=#{level}`
|
71
|
+
$?.success?.should be_true
|
72
|
+
|
73
|
+
lines = File.readlines(File.join($test_dir, "taxonomy_count.strict.#{level}.txt"))
|
74
|
+
|
75
|
+
# make sure total number of reads is even
|
76
|
+
# requires that there should be an odd number if classification is not strict
|
77
|
+
lines.select { |x| !(x =~ /^#/) }
|
78
|
+
.map(&:strip)
|
79
|
+
.map { |x| x.split(',') }
|
80
|
+
.map(&:last)
|
81
|
+
.map(&:to_i)
|
82
|
+
.inject(:+).should be_even
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
%w{domain phylum class order family genus species}.each do |level|
|
87
|
+
it "should create OTU abundance matrices from taxonomy count tables at level: #{level}" do
|
88
|
+
`./bin/lederhosen otu_table --files=#{$test_dir}/taxonomy_count.strict.*.txt --level=#{level} --output=#{$test_dir}/otus_genus.strict.csv`
|
89
|
+
$?.success?.should be_true
|
90
|
+
end
|
42
91
|
end
|
43
92
|
|
44
|
-
it 'should
|
45
|
-
|
93
|
+
it 'should filter OTU abundance matrices' do
|
94
|
+
# TODO
|
95
|
+
# filtering should move filtered reads to 'unclassified_reads' so that we maintain
|
96
|
+
# our knowledge of depth of coverage throughout
|
97
|
+
# this makes normalization better later.
|
98
|
+
`./bin/lederhosen otu_filter --input=#{$test_dir}/otus_genus.strict.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
|
46
99
|
$?.success?.should be_true
|
47
100
|
end
|
48
101
|
|
@@ -53,7 +106,6 @@ describe Lederhosen::CLI do
|
|
53
106
|
|
54
107
|
it 'should print representative sequences from uc files' do
|
55
108
|
`./bin/lederhosen get_reps --input=#{$test_dir}/clusters.uc --database=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/representatives.fasta`
|
109
|
+
$?.success?.should be_true
|
56
110
|
end
|
57
|
-
|
58
|
-
it 'should create a fasta file containing representative reads for each cluster'
|
59
111
|
end
|