RubyGems - lederhosen - Versions diffs - 1.8.2 → 2.0.0 - Mend

lederhosen 1.8.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

data/Gemfile +1 -1
data/lederhosen.gemspec +7 -3
data/lib/lederhosen/no_tasks.rb +18 -18
data/lib/lederhosen/tasks/count_taxonomies.rb +83 -0
data/lib/lederhosen/tasks/get_reps.rb +3 -4
data/lib/lederhosen/tasks/make_udb.rb +2 -2
data/lib/lederhosen/tasks/otu_filter.rb +8 -1
data/lib/lederhosen/tasks/otu_table.rb +33 -70
data/lib/lederhosen/tasks/separate_unclassified.rb +65 -0
data/lib/lederhosen/uc_parser.rb +88 -0
data/lib/lederhosen/version.rb +4 -4
data/readme.md +107 -11
data/spec/cli_spec.rb +62 -10
data/spec/data/test.uc +9 -684
data/spec/data/trimmed/ILT_L_9_B_001.fasta +100 -1596
data/spec/no_tasks_spec.rb +1 -1
data/spec/uc_parser_spec.rb +0 -0
metadata +7 -3

data/readme.md CHANGED Viewed

@@ -16,13 +16,15 @@ Lederhosen is designed with the following "pipeline" in mind:
 1. Clustering sequences to centroid or reference sequences (read: database)
 2. Generating tables from USEARCH output.
 3. Filtering tables to remove small or insignificant OTUs.
+4. Support for paired end reads (considers taxonomic assignment for both reads in a pair).
 ### About
 - Lederhosen is a project born out of the Triplett Lab at the University of Florida.
-- Lederhosen is designed to be a fast and **simple** tool to aid in clustering 16S rRNA amplicons sequenced
+- Lederhosen is designed to be a fast and **simple** (~700 SLOC) tool to aid in clustering 16S rRNA amplicons sequenced
 using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq), Ion Torrent, or Roche-454.
-- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the [MIT open source license](http://opensource.org/licenses/mit-license.php/), and has **UNIT TESTS** (omg!).
+- Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the
+[MIT open source license](http://opensource.org/licenses/mit-license.php/).
 - Except for USEARCH which requires a license, Lederhosen is available for commercial use.
 ### Features
@@ -76,27 +78,102 @@ lederhosen cluster \
   --database=taxcollector.udb
 ```
-The optional `--dry-run` parameter outputs the usearch command to standard out. This is useful if you want to run usearch on a cluster.
+The optional `--dry-run` parameter outputs the usearch command to standard out.
+This is useful if you want to run usearch on a cluster.
-### Generate OTU table(s)
+```bash
+for reads_file in reads/*.fasta;
+do
+    echo lederhosen cluster \
+                    --input=$reads_file \
+                    --identity=0.95 \
+                    --output=$(basename $reads_file_ .fasta).95.uc \
+                    --database=taxcollector.udb \
+                    --threads 1 \
+                    --dry-run
+end > jobs.sh
+# send jobs to queue system
+cat jobs.sh | parallel -j 24 # run 24 parallel jobs
+```
+### Generate taxonomy counts tables
+Before generating OTU tables, you must generate taxonomy counts tables.
+A taxonomy count table looks something like this
+    # taxonomy, number_of_reads
+    [0]Bacteria[1];...;[8]Akkermansia_municipalia, 28
+    ...
+From there, you can generate OTU abundance matrices at the different levels of classification (domain, phylum, ..., genus, species).
+```bash
+lederhosen count_taxonomies \
+  --input=clusters.uc \
+  --output=clusters_taxonomies.txt
+```
+If you did paired-end sequencing, you can generate strict taxonomy tables that only count reads when *both pairs* have the *same*
+taxonomic description at a certain taxonomic level. This is useful for leveraging the increased length of having pairs and also
+acts as a sort of chimera filter. You will, however, end up using less of your reads as the level goes from domain to species.
+```bash
+lederhosen count_taxonomies \
+  --input=clusters.uc \
+  --strict=genus \
+  --output=clusters_taxonomies.strict.genus.txt
+```
+Reads that do not have the same phylogeny at `level` will become `unclassified_reads`
+### Generate OTU tables
 Create an OTU abundance table where rows are samples and columns are clusters. The entries are the number of reads for that cluster in a sample.
 ```bash
 lederhosen otu_table \
-  --files=clusters_95.uc \
-  --prefix=otu_table \
-  --levels=domain phylum class order family genus species
+  --files=clusters_taxonomies.strict.genus.*.txt \
+  --output=my_poop_samples_genus_strict.95.txt \
+  --level=genus
+```
+This will create the file `my_poop_samples_genus_strict.95.txt` containing the clusters
+as columns and the samples as rows.
+You now will apply advanced data mining and statistical techniques to this table to make
+interesting biological inferences and cure diseases.
+### Filter OTU tables
+Sometimes, clustering high-throughput reads at stringent identities can create many, small clusters.
+In fact, these clusters represent the vast majority (>99%) of the created clusters but the minority (<1%>)
+of the reads. In other words, 1% of the reads have 99% of the clusters.
+If you want to filter out these small clusters which are composed of inseparable sequencing error or
+actual biodiversity, you can do so with the `otu_filter` task.
+```bash
+lederhosen otu_filter \
+  --input=table.csv \
+  --output=filtere.csv \
+  --reads=50 \
+  --samples=50
 ```
-This will create the files:
+This will remove any clusters that do not appear in at least 10 samples with at least 50 reads. The read counts
+for filtered clusters will be moved to the `noise` psuedocluster.
-    otu_table.domain.csv, ..., otu_table.species.csv
 ### Get representative sequences
-You can get the representative sequences for each cluster using the `get_reps` tasks. This will extract the representative sequence from
-the __database__ you ran usearch with. Make sure you use the same database that you used when running usearch.
+(not yet implemented)
+You can get the representative sequences for each cluster using the `get_reps` tasks.
+This will extract the representative sequence from the __database__ you ran usearch with.
+Make sure you use the same database that you used when running usearch.
 ```bash
 lederhosen get_reps \
@@ -114,6 +191,25 @@ lederhosen get_reps \
   --output=representatives.fasta
 ```
+### Get unclassified sequences
+```bash
+lederhosen separate_unclassified \
+  --uc-file=my_results.uc \
+  --reads=reads_that_were_used_to_generate_results.fasta
+  --output=unclassified_reads.fasta
+```
+`separate_unclassified` has support for strict pairing
+```
+lederhosen separate_unclassified \
+  --uc-file=my_results.uc \
+  --reads=reads_that_were_used_to_generate_results.fasta
+  --strict=phylum
+  --output=unclassified_reads.fasta
+```
 ## Acknowledgements
 - Lexi, Vinnie and Kevin for beta-testing and putting up with bugs

data/spec/cli_spec.rb CHANGED Viewed

@@ -25,24 +25,77 @@ describe Lederhosen::CLI do
   end
   it 'can cluster reads using usearch' do
-    `./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.95 --output #{$test_dir}/clusters.uc`
+    `./bin/lederhosen cluster --input spec/data/trimmed/ILT_L_9_B_001.fasta --database #{$test_dir}/test_db.udb --identity 0.99 --output #{$test_dir}/clusters.uc`
     $?.success?.should be_true
     File.exists?(File.join($test_dir, 'clusters.uc')).should be_true
   end
-  it 'should build abundance matrices for each level' do
-    levels = "domain phylum class order FAMILY genus Species"
-    `./bin/lederhosen otu_table --files=spec/data/test.uc --prefix=#{$test_dir}/otu_table --levels=#{levels}`
+  it 'can separate unclassified reads from usearch output' do
+    `./bin/lederhosen separate_unclassified --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.fasta`
     $?.success?.should be_true
+    unclassified_results = File.readlines("spec/data/test.uc")
+                               .select { |x| x =~ /^N/ }
+                               .size
+    unclassified_reads = File.readlines("#{$test_dir}/unclassified.fasta")
+                             .select { |x| x =~ /^>/ }
+                             .size
+    unclassified_results.should == unclassified_reads
+  end
+  it 'can separate unclassified reads from usearch output using strict pairing' do
+    `./bin/lederhosen separate_unclassified --strict=genus --uc-file=spec/data/test.uc --reads=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/unclassified.strict_genus.fasta`
+    $?.success?.should be_true
+    File.readlines("#{$test_dir}/unclassified.strict_genus.fasta")
+      .select { |x| x =~ /^>/ }
+      .size.should be_even
   end
-  it 'should filter OTU abundance matrices' do
-    `./bin/lederhosen otu_filter --input=#{$test_dir}/otu_table.species.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
+  it 'can create taxonomy count tables' do
+    `./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.txt`
     $?.success?.should be_true
+    File.exists?(File.join($test_dir, 'taxonomy_count.txt')).should be_true
+  end
+  it 'generates taxonomy tables w/ comma-free taxonomic descriptions' do
+    File.readlines(File.join($test_dir, 'taxonomy_count.txt'))
+      .map(&:strip)
+      .map { |x| x.count(',') }
+      .uniq
+      .should == [1]
+  end
+  %w{domain phylum class order family genus species}.each do |level|
+    it "generates taxonomy tables only counting pairs that agree at level: #{level}" do
+      `./bin/lederhosen count_taxonomies --input=spec/data/test.uc --output=#{$test_dir}/taxonomy_count.strict.#{level}.txt --strict=#{level}`
+      $?.success?.should be_true
+      lines = File.readlines(File.join($test_dir, "taxonomy_count.strict.#{level}.txt"))
+      # make sure total number of reads is even
+      # requires that there should be an odd number if classification is not strict
+      lines.select { |x| !(x =~ /^#/) }
+           .map(&:strip)
+           .map { |x| x.split(',') }
+           .map(&:last)
+           .map(&:to_i)
+           .inject(:+).should be_even
+    end
+  end
+  %w{domain phylum class order family genus species}.each do |level|
+    it "should create OTU abundance matrices from taxonomy count tables at level: #{level}" do
+      `./bin/lederhosen otu_table --files=#{$test_dir}/taxonomy_count.strict.*.txt --level=#{level} --output=#{$test_dir}/otus_genus.strict.csv`
+      $?.success?.should be_true
+    end
   end
-  it 'should combine OTU abundance matrices' do
-    `./bin/lederhosen join_otu_tables --input=#{$test_dir}/otu_table*.csv --output=#{$test_dir}/merged.csv`
+  it 'should filter OTU abundance matrices' do
+    # TODO
+    # filtering should move filtered reads to 'unclassified_reads' so that we maintain
+    # our knowledge of depth of coverage throughout
+    # this makes normalization better later.
+    `./bin/lederhosen otu_filter --input=#{$test_dir}/otus_genus.strict.csv --output=#{$test_dir}/otu_table.filtered.csv --reads 1 --samples 1`
     $?.success?.should be_true
   end
@@ -53,7 +106,6 @@ describe Lederhosen::CLI do
   it 'should print representative sequences from uc files' do
     `./bin/lederhosen get_reps --input=#{$test_dir}/clusters.uc --database=spec/data/trimmed/ILT_L_9_B_001.fasta --output=#{$test_dir}/representatives.fasta`
+    $?.success?.should be_true
   end
-  it 'should create a fasta file containing representative reads for each cluster'
 end