lederhosen 1.8.2 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/Gemfile +1 -1
- data/lederhosen.gemspec +7 -3
- data/lib/lederhosen/no_tasks.rb +18 -18
- data/lib/lederhosen/tasks/count_taxonomies.rb +83 -0
- data/lib/lederhosen/tasks/get_reps.rb +3 -4
- data/lib/lederhosen/tasks/make_udb.rb +2 -2
- data/lib/lederhosen/tasks/otu_filter.rb +8 -1
- data/lib/lederhosen/tasks/otu_table.rb +33 -70
- data/lib/lederhosen/tasks/separate_unclassified.rb +65 -0
- data/lib/lederhosen/uc_parser.rb +88 -0
- data/lib/lederhosen/version.rb +4 -4
- data/readme.md +107 -11
- data/spec/cli_spec.rb +62 -10
- data/spec/data/test.uc +9 -684
- data/spec/data/trimmed/ILT_L_9_B_001.fasta +100 -1596
- data/spec/no_tasks_spec.rb +1 -1
- data/spec/uc_parser_spec.rb +0 -0
- metadata +7 -3
data/readme.md
CHANGED
@@ -16,13 +16,15 @@ Lederhosen is designed with the following "pipeline" in mind:
|
|
16
16
|
1. Clustering sequences to centroid or reference sequences (read: database)
|
17
17
|
2. Generating tables from USEARCH output.
|
18
18
|
3. Filtering tables to remove small or insignificant OTUs.
|
19
|
+
4. Support for paired end reads (considers taxonomic assignment for both reads in a pair).
|
19
20
|
|
20
21
|
### About
|
21
22
|
|
22
23
|
- Lederhosen is a project born out of the Triplett Lab at the University of Florida.
|
23
|
-
- Lederhosen is designed to be a fast and **simple** tool to aid in clustering 16S rRNA amplicons sequenced
|
24
|
+
- Lederhosen is designed to be a fast and **simple** (~700 SLOC) tool to aid in clustering 16S rRNA amplicons sequenced
|
24
25
|
using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq), Ion Torrent, or Roche-454.
|
25
|
-
- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
|
26
|
+
- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
|
27
|
+
[MIT open source license](http://opensource.org/licenses/mit-license.php/).
|
26
28
|
- Except for USEARCH which requires a license, Lederhosen is available for commercial use.
|
27
29
|
|
28
30
|
### Features
|
@@ -76,27 +78,102 @@ lederhosen cluster \
|
|
76
78
|
--database=taxcollector.udb
|
77
79
|
```
|
78
80
|
|
79
|
-
The optional `--dry-run` parameter outputs the usearch command to standard out.
|
81
|
+
The optional `--dry-run` parameter outputs the usearch command to standard out.
|
82
|
+
This is useful if you want to run usearch on a cluster.
|
80
83
|
|
81
|
-
|
84
|
+
```bash
|
85
|
+
for reads_file in reads/*.fasta;
|
86
|
+
do
|
87
|
+
echo lederhosen cluster \
|
88
|
+
--input=$reads_file \
|
89
|
+
--identity=0.95 \
|
90
|
+
--output=$(basename $reads_file_ .fasta).95.uc \
|
91
|
+
--database=taxcollector.udb \
|
92
|
+
--threads 1 \
|
93
|
+
--dry-run
|
94
|
+
end > jobs.sh
|
95
|
+
|
96
|
+
# send jobs to queue system
|
97
|
+
cat jobs.sh | parallel -j 24 # run 24 parallel jobs
|
98
|
+
```
|
99
|
+
|
100
|
+
### Generate taxonomy counts tables
|
101
|
+
|
102
|
+
Before generating OTU tables, you must generate taxonomy counts tables.
|
103
|
+
|
104
|
+
A taxonomy count table looks something like this
|
105
|
+
|
106
|
+
# taxonomy, number_of_reads
|
107
|
+
[0]Bacteria[1];...;[8]Akkermansia_municipalia, 28
|
108
|
+
...
|
109
|
+
|
110
|
+
From there, you can generate OTU abundance matrices at the different levels of classification (domain, phylum, ..., genus, species).
|
111
|
+
|
112
|
+
```bash
|
113
|
+
|
114
|
+
lederhosen count_taxonomies \
|
115
|
+
--input=clusters.uc \
|
116
|
+
--output=clusters_taxonomies.txt
|
117
|
+
```
|
118
|
+
|
119
|
+
If you did paired-end sequencing, you can generate strict taxonomy tables that only count reads when *both pairs* have the *same*
|
120
|
+
taxonomic description at a certain taxonomic level. This is useful for leveraging the increased length of having pairs and also
|
121
|
+
acts as a sort of chimera filter. You will, however, end up using less of your reads as the level goes from domain to species.
|
122
|
+
|
123
|
+
```bash
|
124
|
+
lederhosen count_taxonomies \
|
125
|
+
--input=clusters.uc \
|
126
|
+
--strict=genus \
|
127
|
+
--output=clusters_taxonomies.strict.genus.txt
|
128
|
+
```
|
129
|
+
|
130
|
+
Reads that do not have the same phylogeny at `level` will become `unclassified_reads`
|
131
|
+
|
132
|
+
### Generate OTU tables
|
82
133
|
|
83
134
|
Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
|
84
135
|
|
85
136
|
```bash
|
86
137
|
lederhosen otu_table \
|
87
|
-
--files=
|
88
|
-
--
|
89
|
-
--
|
138
|
+
--files=clusters_taxonomies.strict.genus.*.txt \
|
139
|
+
--output=my_poop_samples_genus_strict.95.txt \
|
140
|
+
--level=genus
|
141
|
+
```
|
142
|
+
|
143
|
+
This will create the file `my_poop_samples_genus_strict.95.txt` containing the clusters
|
144
|
+
as columns and the samples as rows.
|
145
|
+
|
146
|
+
You now will apply advanced data mining and statistical techniques to this table to make
|
147
|
+
interesting biological inferences and cure diseases.
|
148
|
+
|
149
|
+
### Filter OTU tables
|
150
|
+
|
151
|
+
Sometimes, clustering high-throughput reads at stringent identities can create many, small clusters.
|
152
|
+
In fact, these clusters represent the vast majority (>99%) of the created clusters but the minority (<1%>)
|
153
|
+
of the reads. In other words, 1% of the reads have 99% of the clusters.
|
154
|
+
|
155
|
+
If you want to filter out these small clusters which are composed of inseparable sequencing error or
|
156
|
+
actual biodiversity, you can do so with the `otu_filter` task.
|
157
|
+
|
158
|
+
```bash
|
159
|
+
lederhosen otu_filter \
|
160
|
+
--input=table.csv \
|
161
|
+
--output=filtere.csv \
|
162
|
+
--reads=50 \
|
163
|
+
--samples=50
|
90
164
|
```
|
91
165
|
|
92
|
-
This will
|
166
|
+
This will remove any clusters that do not appear in at least 10 samples with at least 50 reads. The read counts
|
167
|
+
for filtered clusters will be moved to the `noise` psuedocluster.
|
93
168
|
|
94
|
-
otu_table.domain.csv, ..., otu_table.species.csv
|
95
169
|
|
96
170
|
### Get representative sequences
|
97
171
|
|
98
|
-
|
99
|
-
|
172
|
+
(not yet implemented)
|
173
|
+
|
174
|
+
You can get the representative sequences for each cluster using the `get_reps` tasks.
|
175
|
+
This will extract the representative sequence from the __database__ you ran usearch with.
|
176
|
+
Make sure you use the same database that you used when running usearch.
|
100
177
|
|
101
178
|
```bash
|
102
179
|
lederhosen get_reps \
|
@@ -114,6 +191,25 @@ lederhosen get_reps \
|
|
114
191
|
--output=representatives.fasta
|
115
192
|
```
|
116
193
|
|
194
|
+
### Get unclassified sequences
|
195
|
+
|
196
|
+
```bash
|
197
|
+
lederhosen separate_unclassified \
|
198
|
+
--uc-file=my_results.uc \
|
199
|
+
--reads=reads_that_were_used_to_generate_results.fasta
|
200
|
+
--output=unclassified_reads.fasta
|
201
|
+
```
|
202
|
+
|
203
|
+
`separate_unclassified` has support for strict pairing
|
204
|
+
|
205
|
+
```
|
206
|
+
lederhosen separate_unclassified \
|
207
|
+
--uc-file=my_results.uc \
|
208
|
+
--reads=reads_that_were_used_to_generate_results.fasta
|
209
|
+
--strict=phylum
|
210
|
+
--output=unclassified_reads.fasta
|
211
|
+
```
|
212
|
+
|
117
213
|
## Acknowledgements
|
118
214
|
|
119
215
|
- Lexi, Vinnie and Kevin for beta-testing and putting up with bugs
|
data/spec/cli_spec.rb
CHANGED
@@ -25,24 +25,77 @@ describe Lederhosen::CLI do
|
|
25
25
|
end
|
26
26
|
|
27
27
|
it 'can cluster reads using usearch' do
|
28
|
-
`./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.
|
28
|
+
`./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.99 --output #{$test_dir}/clusters.uc`
|
29
29
|
$?.success?.should be_true
|
30
30
|
File.exists?(File.join($test_dir, 'clusters.uc')).should be_true
|
31
31
|
end
|
32
32
|
|
33
|
-
it '
|
34
|
-
|
35
|
-
`./bin/lederhosen otu_table --files=spec/data/test.uc --prefix=#{$test_dir}/otu_table --levels=#{levels}`
|
33
|
+
it 'can separate unclassified reads from usearch output' do
|
34
|
+
`./bin/lederhosen separate_unclassified --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.fasta`
|
36
35
|
$?.success?.should be_true
|
36
|
+
unclassified_results = File.readlines("spec/data/test.uc")
|
37
|
+
.select { |x| x =~ /^N/ }
|
38
|
+
.size
|
39
|
+
unclassified_reads = File.readlines("#{$test_dir}/unclassified.fasta")
|
40
|
+
.select { |x| x =~ /^>/ }
|
41
|
+
.size
|
42
|
+
|
43
|
+
unclassified_results.should == unclassified_reads
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'can separate unclassified reads from usearch output using strict pairing' do
|
47
|
+
`./bin/lederhosen separate_unclassified --strict=genus --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.strict_genus.fasta`
|
48
|
+
$?.success?.should be_true
|
49
|
+
File.readlines("#{$test_dir}/unclassified.strict_genus.fasta")
|
50
|
+
.select { |x| x =~ /^>/ }
|
51
|
+
.size.should be_even
|
37
52
|
end
|
38
53
|
|
39
|
-
it '
|
40
|
-
`./bin/lederhosen
|
54
|
+
it 'can create taxonomy count tables' do
|
55
|
+
`./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.txt`
|
41
56
|
$?.success?.should be_true
|
57
|
+
File.exists?(File.join($test_dir, 'taxonomy_count.txt')).should be_true
|
58
|
+
end
|
59
|
+
|
60
|
+
it 'generates taxonomy tables w/ comma-free taxonomic descriptions' do
|
61
|
+
File.readlines(File.join($test_dir, 'taxonomy_count.txt'))
|
62
|
+
.map(&:strip)
|
63
|
+
.map { |x| x.count(',') }
|
64
|
+
.uniq
|
65
|
+
.should == [1]
|
66
|
+
end
|
67
|
+
|
68
|
+
%w{domain phylum class order family genus species}.each do |level|
|
69
|
+
it "generates taxonomy tables only counting pairs that agree at level: #{level}" do
|
70
|
+
`./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.strict.#{level}.txt --strict=#{level}`
|
71
|
+
$?.success?.should be_true
|
72
|
+
|
73
|
+
lines = File.readlines(File.join($test_dir, "taxonomy_count.strict.#{level}.txt"))
|
74
|
+
|
75
|
+
# make sure total number of reads is even
|
76
|
+
# requires that there should be an odd number if classification is not strict
|
77
|
+
lines.select { |x| !(x =~ /^#/) }
|
78
|
+
.map(&:strip)
|
79
|
+
.map { |x| x.split(',') }
|
80
|
+
.map(&:last)
|
81
|
+
.map(&:to_i)
|
82
|
+
.inject(:+).should be_even
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
%w{domain phylum class order family genus species}.each do |level|
|
87
|
+
it "should create OTU abundance matrices from taxonomy count tables at level: #{level}" do
|
88
|
+
`./bin/lederhosen otu_table --files=#{$test_dir}/taxonomy_count.strict.*.txt --level=#{level} --output=#{$test_dir}/otus_genus.strict.csv`
|
89
|
+
$?.success?.should be_true
|
90
|
+
end
|
42
91
|
end
|
43
92
|
|
44
|
-
it 'should
|
45
|
-
|
93
|
+
it 'should filter OTU abundance matrices' do
|
94
|
+
# TODO
|
95
|
+
# filtering should move filtered reads to 'unclassified_reads' so that we maintain
|
96
|
+
# our knowledge of depth of coverage throughout
|
97
|
+
# this makes normalization better later.
|
98
|
+
`./bin/lederhosen otu_filter --input=#{$test_dir}/otus_genus.strict.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
|
46
99
|
$?.success?.should be_true
|
47
100
|
end
|
48
101
|
|
@@ -53,7 +106,6 @@ describe Lederhosen::CLI do
|
|
53
106
|
|
54
107
|
it 'should print representative sequences from uc files' do
|
55
108
|
`./bin/lederhosen get_reps --input=#{$test_dir}/clusters.uc --database=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/representatives.fasta`
|
109
|
+
$?.success?.should be_true
|
56
110
|
end
|
57
|
-
|
58
|
-
it 'should create a fasta file containing representative reads for each cluster'
|
59
111
|
end
|