lederhosen 1.8.2 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/readme.md CHANGED
@@ -16,13 +16,15 @@ Lederhosen is designed with the following "pipeline" in mind:
16
16
  1. Clustering sequences to centroid or reference sequences (read: database)
17
17
  2. Generating tables from USEARCH output.
18
18
  3. Filtering tables to remove small or insignificant OTUs.
19
+ 4. Support for paired end reads (considers taxonomic assignment for both reads in a pair).
19
20
 
20
21
  ### About
21
22
 
22
23
  - Lederhosen is a project born out of the Triplett Lab at the University of Florida.
23
- - Lederhosen is designed to be a fast and **simple** tool to aid in clustering 16S rRNA amplicons sequenced
24
+ - Lederhosen is designed to be a fast and **simple** (~700 SLOC) tool to aid in clustering 16S rRNA amplicons sequenced
24
25
  using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq), Ion Torrent, or Roche-454.
25
- - Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the [MIT open source license](http://opensource.org/licenses/mit-license.php/), and has **UNIT TESTS** (omg!).
26
+ - Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
27
+ [MIT open source license](http://opensource.org/licenses/mit-license.php/).
26
28
  - Except for USEARCH which requires a license, Lederhosen is available for commercial use.
27
29
 
28
30
  ### Features
@@ -76,27 +78,102 @@ lederhosen cluster \
76
78
  --database=taxcollector.udb
77
79
  ```
78
80
 
79
- The optional `--dry-run` parameter outputs the usearch command to standard out. This is useful if you want to run usearch on a cluster.
81
+ The optional `--dry-run` parameter outputs the usearch command to standard out.
82
+ This is useful if you want to run usearch on a cluster.
80
83
 
81
- ### Generate OTU table(s)
84
+ ```bash
85
+ for reads_file in reads/*.fasta;
86
+ do
87
+ echo lederhosen cluster \
88
+ --input=$reads_file \
89
+ --identity=0.95 \
90
+ --output=$(basename $reads_file_ .fasta).95.uc \
91
+ --database=taxcollector.udb \
92
+ --threads 1 \
93
+ --dry-run
94
+ end > jobs.sh
95
+
96
+ # send jobs to queue system
97
+ cat jobs.sh | parallel -j 24 # run 24 parallel jobs
98
+ ```
99
+
100
+ ### Generate taxonomy counts tables
101
+
102
+ Before generating OTU tables, you must generate taxonomy counts tables.
103
+
104
+ A taxonomy count table looks something like this
105
+
106
+ # taxonomy, number_of_reads
107
+ [0]Bacteria[1];...;[8]Akkermansia_municipalia, 28
108
+ ...
109
+
110
+ From there, you can generate OTU abundance matrices at the different levels of classification (domain, phylum, ..., genus, species).
111
+
112
+ ```bash
113
+
114
+ lederhosen count_taxonomies \
115
+ --input=clusters.uc \
116
+ --output=clusters_taxonomies.txt
117
+ ```
118
+
119
+ If you did paired-end sequencing, you can generate strict taxonomy tables that only count reads when *both pairs* have the *same*
120
+ taxonomic description at a certain taxonomic level. This is useful for leveraging the increased length of having pairs and also
121
+ acts as a sort of chimera filter. You will, however, end up using less of your reads as the level goes from domain to species.
122
+
123
+ ```bash
124
+ lederhosen count_taxonomies \
125
+ --input=clusters.uc \
126
+ --strict=genus \
127
+ --output=clusters_taxonomies.strict.genus.txt
128
+ ```
129
+
130
+ Reads that do not have the same phylogeny at `level` will become `unclassified_reads`
131
+
132
+ ### Generate OTU tables
82
133
 
83
134
  Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
84
135
 
85
136
  ```bash
86
137
  lederhosen otu_table \
87
- --files=clusters_95.uc \
88
- --prefix=otu_table \
89
- --levels=domain phylum class order family genus species
138
+ --files=clusters_taxonomies.strict.genus.*.txt \
139
+ --output=my_poop_samples_genus_strict.95.txt \
140
+ --level=genus
141
+ ```
142
+
143
+ This will create the file `my_poop_samples_genus_strict.95.txt` containing the clusters
144
+ as columns and the samples as rows.
145
+
146
+ You now will apply advanced data mining and statistical techniques to this table to make
147
+ interesting biological inferences and cure diseases.
148
+
149
+ ### Filter OTU tables
150
+
151
+ Sometimes, clustering high-throughput reads at stringent identities can create many, small clusters.
152
+ In fact, these clusters represent the vast majority (>99%) of the created clusters but the minority (<1%>)
153
+ of the reads. In other words, 1% of the reads have 99% of the clusters.
154
+
155
+ If you want to filter out these small clusters which are composed of inseparable sequencing error or
156
+ actual biodiversity, you can do so with the `otu_filter` task.
157
+
158
+ ```bash
159
+ lederhosen otu_filter \
160
+ --input=table.csv \
161
+ --output=filtere.csv \
162
+ --reads=50 \
163
+ --samples=50
90
164
  ```
91
165
 
92
- This will create the files:
166
+ This will remove any clusters that do not appear in at least 10 samples with at least 50 reads. The read counts
167
+ for filtered clusters will be moved to the `noise` psuedocluster.
93
168
 
94
- otu_table.domain.csv, ..., otu_table.species.csv
95
169
 
96
170
  ### Get representative sequences
97
171
 
98
- You can get the representative sequences for each cluster using the `get_reps` tasks. This will extract the representative sequence from
99
- the __database__ you ran usearch with. Make sure you use the same database that you used when running usearch.
172
+ (not yet implemented)
173
+
174
+ You can get the representative sequences for each cluster using the `get_reps` tasks.
175
+ This will extract the representative sequence from the __database__ you ran usearch with.
176
+ Make sure you use the same database that you used when running usearch.
100
177
 
101
178
  ```bash
102
179
  lederhosen get_reps \
@@ -114,6 +191,25 @@ lederhosen get_reps \
114
191
  --output=representatives.fasta
115
192
  ```
116
193
 
194
+ ### Get unclassified sequences
195
+
196
+ ```bash
197
+ lederhosen separate_unclassified \
198
+ --uc-file=my_results.uc \
199
+ --reads=reads_that_were_used_to_generate_results.fasta
200
+ --output=unclassified_reads.fasta
201
+ ```
202
+
203
+ `separate_unclassified` has support for strict pairing
204
+
205
+ ```
206
+ lederhosen separate_unclassified \
207
+ --uc-file=my_results.uc \
208
+ --reads=reads_that_were_used_to_generate_results.fasta
209
+ --strict=phylum
210
+ --output=unclassified_reads.fasta
211
+ ```
212
+
117
213
  ## Acknowledgements
118
214
 
119
215
  - Lexi, Vinnie and Kevin for beta-testing and putting up with bugs
data/spec/cli_spec.rb CHANGED
@@ -25,24 +25,77 @@ describe Lederhosen::CLI do
25
25
  end
26
26
 
27
27
  it 'can cluster reads using usearch' do
28
- `./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.95 --output #{$test_dir}/clusters.uc`
28
+ `./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.99 --output #{$test_dir}/clusters.uc`
29
29
  $?.success?.should be_true
30
30
  File.exists?(File.join($test_dir, 'clusters.uc')).should be_true
31
31
  end
32
32
 
33
- it 'should build abundance matrices for each level' do
34
- levels = "domain phylum class order FAMILY genus Species"
35
- `./bin/lederhosen otu_table --files=spec/data/test.uc --prefix=#{$test_dir}/otu_table --levels=#{levels}`
33
+ it 'can separate unclassified reads from usearch output' do
34
+ `./bin/lederhosen separate_unclassified --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.fasta`
36
35
  $?.success?.should be_true
36
+ unclassified_results = File.readlines("spec/data/test.uc")
37
+ .select { |x| x =~ /^N/ }
38
+ .size
39
+ unclassified_reads = File.readlines("#{$test_dir}/unclassified.fasta")
40
+ .select { |x| x =~ /^>/ }
41
+ .size
42
+
43
+ unclassified_results.should == unclassified_reads
44
+ end
45
+
46
+ it 'can separate unclassified reads from usearch output using strict pairing' do
47
+ `./bin/lederhosen separate_unclassified --strict=genus --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.strict_genus.fasta`
48
+ $?.success?.should be_true
49
+ File.readlines("#{$test_dir}/unclassified.strict_genus.fasta")
50
+ .select { |x| x =~ /^>/ }
51
+ .size.should be_even
37
52
  end
38
53
 
39
- it 'should filter OTU abundance matrices' do
40
- `./bin/lederhosen otu_filter --input=#{$test_dir}/otu_table.species.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
54
+ it 'can create taxonomy count tables' do
55
+ `./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.txt`
41
56
  $?.success?.should be_true
57
+ File.exists?(File.join($test_dir, 'taxonomy_count.txt')).should be_true
58
+ end
59
+
60
+ it 'generates taxonomy tables w/ comma-free taxonomic descriptions' do
61
+ File.readlines(File.join($test_dir, 'taxonomy_count.txt'))
62
+ .map(&:strip)
63
+ .map { |x| x.count(',') }
64
+ .uniq
65
+ .should == [1]
66
+ end
67
+
68
+ %w{domain phylum class order family genus species}.each do |level|
69
+ it "generates taxonomy tables only counting pairs that agree at level: #{level}" do
70
+ `./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.strict.#{level}.txt --strict=#{level}`
71
+ $?.success?.should be_true
72
+
73
+ lines = File.readlines(File.join($test_dir, "taxonomy_count.strict.#{level}.txt"))
74
+
75
+ # make sure total number of reads is even
76
+ # requires that there should be an odd number if classification is not strict
77
+ lines.select { |x| !(x =~ /^#/) }
78
+ .map(&:strip)
79
+ .map { |x| x.split(',') }
80
+ .map(&:last)
81
+ .map(&:to_i)
82
+ .inject(:+).should be_even
83
+ end
84
+ end
85
+
86
+ %w{domain phylum class order family genus species}.each do |level|
87
+ it "should create OTU abundance matrices from taxonomy count tables at level: #{level}" do
88
+ `./bin/lederhosen otu_table --files=#{$test_dir}/taxonomy_count.strict.*.txt --level=#{level} --output=#{$test_dir}/otus_genus.strict.csv`
89
+ $?.success?.should be_true
90
+ end
42
91
  end
43
92
 
44
- it 'should combine OTU abundance matrices' do
45
- `./bin/lederhosen join_otu_tables --input=#{$test_dir}/otu_table*.csv --output=#{$test_dir}/merged.csv`
93
+ it 'should filter OTU abundance matrices' do
94
+ # TODO
95
+ # filtering should move filtered reads to 'unclassified_reads' so that we maintain
96
+ # our knowledge of depth of coverage throughout
97
+ # this makes normalization better later.
98
+ `./bin/lederhosen otu_filter --input=#{$test_dir}/otus_genus.strict.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
46
99
  $?.success?.should be_true
47
100
  end
48
101
 
@@ -53,7 +106,6 @@ describe Lederhosen::CLI do
53
106
 
54
107
  it 'should print representative sequences from uc files' do
55
108
  `./bin/lederhosen get_reps --input=#{$test_dir}/clusters.uc --database=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/representatives.fasta`
109
+ $?.success?.should be_true
56
110
  end
57
-
58
- it 'should create a fasta file containing representative reads for each cluster'
59
111
  end