lederhosen 1.7.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.rspec CHANGED
@@ -1 +1 @@
1
- -c --fail-fast -f d
1
+ -c -f d
data/Gemfile CHANGED
@@ -7,10 +7,12 @@ gem 'thor', '0.16.0'
7
7
  group :test do
8
8
  gem 'rspec', '2.12.0'
9
9
  gem 'rspec-prof', '0.0.3'
10
+ gem 'pry'
11
+ gem 'plymouth'
10
12
  end
11
13
 
12
14
  group :development do
13
15
  gem 'rdoc', '~> 3.12'
14
16
  gem 'jeweler', '1.8.4'
15
17
  gem 'ruby-prof', '0.11.2'
16
- end
18
+ end
data/lederhosen.gemspec CHANGED
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "lederhosen"
8
- s.version = "1.7.0"
8
+ s.version = "1.8.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Austin G. Davis-Richardson"]
12
- s.date = "2012-12-19"
12
+ s.date = "2013-01-17"
13
13
  s.description = "Various tools for OTU clustering"
14
14
  s.email = "harekrishna@gmail.com"
15
15
  s.executables = ["lederhosen"]
@@ -33,11 +33,16 @@ Gem::Specification.new do |s|
33
33
  "lib/lederhosen/tasks/otu_filter.rb",
34
34
  "lib/lederhosen/tasks/otu_table.rb",
35
35
  "lib/lederhosen/tasks/split_fasta.rb",
36
- "lib/lederhosen/tasks/trim.rb",
37
36
  "lib/lederhosen/tasks/version.rb",
38
37
  "lib/lederhosen/trimmer.rb",
39
38
  "lib/lederhosen/version.rb",
40
39
  "readme.md",
40
+ "scripts/illumina_pipeline/.gitignore",
41
+ "scripts/illumina_pipeline/Makefile",
42
+ "scripts/illumina_pipeline/pipeline.sh",
43
+ "scripts/illumina_pipeline/readme.md",
44
+ "scripts/otu_ref_picking/readme.md",
45
+ "scripts/readme.md",
41
46
  "spec/cli_spec.rb",
42
47
  "spec/data/ILT_L_9_B_001_1.txt.gz",
43
48
  "spec/data/ILT_L_9_B_001_3.txt.gz",
@@ -65,7 +65,7 @@ module Lederhosen
65
65
  RE_QIIME = /k__(.*);p__(.*);c__(.*);o__(.*);f__(.*);g__(.*);s__(.*)/
66
66
 
67
67
  def parse_taxonomy_qiime(taxonomy)
68
- levels = %w{kingdom phylum class order family genus species}
68
+ levels = %w{domain phylum class order family genus species}
69
69
  match_data = taxonomy.match(RE_QIIME)
70
70
  match_data = match_data[1..-1]
71
71
 
@@ -78,7 +78,7 @@ module Lederhosen
78
78
  end
79
79
 
80
80
  def parse_taxonomy_greengenes(taxonomy)
81
- levels = %w{kingdom phylum class order family genus species}
81
+ levels = %w{domain phylum class order family genus species}
82
82
  match_data = taxonomy.match(RE_GREENGENES)
83
83
  match_data = match_data[1..-1]
84
84
 
@@ -101,7 +101,7 @@ module Lederhosen
101
101
  #
102
102
  def parse_taxonomy_taxcollector(taxonomy)
103
103
 
104
- levels = %w{kingdom phylum class order family genus species strain}
104
+ levels = %w{domain phylum class order family genus species strain}
105
105
 
106
106
  match_data =
107
107
  begin
@@ -1,8 +1,8 @@
1
1
  module Lederhosen
2
2
  module Version
3
3
  MAJOR = 1
4
- MINOR = 7
5
- CODENAME = 'Franziskaner' # changes for minor versions
4
+ MINOR = 8
5
+ CODENAME = 'Karottensaft' # changes for minor versions
6
6
  PATCH = 0
7
7
 
8
8
  STRING = [MAJOR, MINOR, PATCH].join('.')
data/lib/lederhosen.rb CHANGED
@@ -1,6 +1,4 @@
1
1
  require 'rubygems'
2
- require 'bundler'
3
- require 'set'
4
2
  require 'dna'
5
3
  require 'progressbar'
6
4
  require 'thor'
data/readme.md CHANGED
@@ -4,32 +4,32 @@
4
4
 
5
5
  Lederhosen is a set of tools for OTU clustering rRNA amplicons using Robert Edgar's USEARCH.
6
6
 
7
- It handles quality control of raw sequence data, running USEARCH, and creating and filtering tables.
7
+ It's used to run USEARCH and create and filter tables. Unlike most of the software in Bioinformatics,
8
+ It is meant to be UNIX-y: do one thing and do it well.
9
+
10
+ Do you want to run Lederhosen on a cluster? Use `--dry-run` and feed it to your cluster's queue management system.
8
11
 
9
12
  Lederhosen is not a pipeline but rather a set of tools broken up into tasks. Tasks are invoked by running `lederhosen TASK ...`.
10
13
 
11
14
  Lederhosen is designed with the following "pipeline" in mind:
12
15
 
13
- 1. Quality control of sequence data.
14
- 2. Clustering sequences to centroid or reference sequences (read: database)
15
- 3. Generating tables from USEARCH output.
16
- 4. Filtering tables to remove small or insignificant OTUs.
16
+ 1. Clustering sequences to centroid or reference sequences (read: database)
17
+ 2. Generating tables from USEARCH output.
18
+ 3. Filtering tables to remove small or insignificant OTUs.
17
19
 
18
20
  ### About
19
21
 
20
22
  - Lederhosen is a project born out of the Triplett Lab at the University of Florida.
21
- - Lederhosen is designed to be a fast and simple method of clustering 16S rRNA amplicons sequenced
22
- using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq).
23
- - Lederhosen uses [Semantic Versioning](http://semver.org/).
24
- - Lederhosen is free and open source under the [MIT open source license](http://opensource.org/licenses/mit-license.php/).
23
+ - Lederhosen is designed to be a fast and **simple** tool to aid in clustering 16S rRNA amplicons sequenced
24
+ using paired and non-paired end short reads such as those produced by Illumina (GAIIx, HiSeq and MiSeq), Ion Torrent, or Roche-454.
25
+ - Lederhosen uses [Semantic Versioning](http://semver.org/), is free and open source under the [MIT open source license](http://opensource.org/licenses/mit-license.php/), and has **UNIT TESTS** (omg!).
25
26
  - Except for USEARCH which requires a license, Lederhosen is available for commercial use.
26
27
 
27
28
  ### Features
28
29
 
29
- - Sequence trimming (paired-end Illumina).
30
- - Parallel, referenced-based clustering to TaxCollector using USEARCH.
31
- - Queue-agnostic support for running jobs on clusters.
32
- - Support for RDP, TaxCollector or GreenGenes databases.
30
+ - Closed/Open/Mixed OTU clustering to TaxCollector or GreenGenes via USEARCH.
31
+ - Parallel support (pipe commands into [parallel](http://savannah.gnu.org/projects/parallel/), or use your cluster's queue).
32
+ - Support for RDP, TaxCollector or GreenGenes 16S rRNA databases.
33
33
  - Generation and filtering of OTU abundancy matrices.
34
34
 
35
35
  ### Installation
@@ -50,19 +50,7 @@ Lederhosen is invoked by typing `lederhosen [TASK]`
50
50
 
51
51
  ### Trim Reads
52
52
 
53
- Trim (Illumina, QSEQ format) reads using quality scores. Output will be a directory of fasta files. Reads can optionally be gzipped.
54
-
55
- lederhosen trim --reads_dir=reads/*.txt --out_dir=trimmed/
56
-
57
- The trimming process will reverse complement the "right" pair so that both reads are in the forward orientation.
58
-
59
- You can also trim interleaved, paired-end FASTQ files:
60
-
61
- lederhosen trim --reads_dir=reads/*.fastq --out_dir=trimmed/ read-type='fastq'
62
-
63
- Lederhosen will also trim off adapter sequences from the 5' end of the "left" read with the `--left-trim` option.
64
-
65
- lederhosen trim --reads_dir=reads/*.fastq --out_dir=trimed/ --read-type='fastq' --left-trim=11
53
+ Trimming removed. I think you should use [Sickle](https://github.com/najoshi/sickle).
66
54
 
67
55
  ### Create Database
68
56
 
@@ -74,6 +62,8 @@ lederhosen make_udb \
74
62
  --output=taxcollector.udb
75
63
  ```
76
64
 
65
+ (not actually required but will make batch searching a lot faster)
66
+
77
67
  ### Cluster Reads using USEARCH
78
68
 
79
69
  Cluster reads using USEARCH. Output is a uc file.
@@ -0,0 +1 @@
1
+ data/
@@ -0,0 +1,14 @@
1
+ #!/bin/bash
2
+
3
+ # for now, we use the Caporaso reference OTUs
4
+ # In the future, I would like to be able to generate a fresh
5
+ # OTU reference database from scratch.
6
+
7
+ REF_DB='http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Reference_OTUs_for_Pipelines/Caporaso_Reference_OTUs/gg_otus_4feb2011.tgz'
8
+
9
+ default: reference_otus
10
+
11
+ reference_otus:
12
+ mkdir -p data
13
+ curl -L ${REF_DB} > data/ref_otus.tar.gz
14
+ tar -zxvf data/ref_otus.tar.gz # this will end up in some other directory
@@ -0,0 +1,3 @@
1
+ #!/bin/bash
2
+
3
+
@@ -0,0 +1,3 @@
1
+ # Illumina Pipeline
2
+
3
+ This is the pipeline for closed or closed + open reference OTU clustering from paired-end 16S rRNA amplicons.
@@ -0,0 +1,9 @@
1
+ # OTU Ref Picking
2
+
3
+ This script will pick reference OTUs to use as centroids for OTU clustering from amplicons.
4
+
5
+ It will also generate multiple sequence alignments and trees from the reference OTUs.
6
+
7
+ It is intended to be used in combination with the Illumina pipeline in order to generate
8
+ datasets that are suitable for analysis using PhyloSeq.
9
+
data/scripts/readme.md ADDED
@@ -0,0 +1,3 @@
1
+ # Lederhosen Scripts
2
+
3
+ This directory will contain scripts that can be used with Lederhosen such as pipelines and what-not.
@@ -4,7 +4,7 @@ describe 'no_tasks' do
4
4
 
5
5
  let(:greengenes_taxonomies) { ['124 U55236.1 Methanobrevibacter thaueri str. CW k__Archaea; p__Euryarchaeota; c__Methanobacteria; o__Methanobacteriales; f__Methanobacteriaceae; g__Methanobrevibacter; Unclassified; otu_127']}
6
6
  let(:qiime_taxonomies) { [ 'k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Rahnella;s__' ]}
7
- let(:taxcollector_taxonomies) { ['[0]Bacteria;[1]Actinobacteria;[2]Actinobacteria;[3]null;[4]null;[5]null;[6]bacterium_TH3;[7]bacterium_TH3;[8]bacterium_TH3|M79434|8'] }
7
+ let(:taxcollector_taxonomies) { ['[0]domain;[1]phylum;[2]class;[3]order;[4]family;[5]genus;[6]species;[7]strain;[8]Genus_species_strain_id'] }
8
8
  let(:lederhosen) { Lederhosen::CLI.new }
9
9
 
10
10
  it '#parse_usearch_line should parse a line of usearch output'
@@ -25,18 +25,18 @@ describe 'no_tasks' do
25
25
  lederhosen.detect_taxonomy_format('this is not a taxonomic description').should raise_error
26
26
  end
27
27
 
28
- it '#parse_taxonomy_taxcollector should parse taxcollector taxonomy' do
29
- taxcollector_taxonomies.each do |taxcollector_taxonomy|
30
- taxonomy = lederhosen.parse_taxonomy_taxcollector(taxcollector_taxonomy)
31
- taxonomy['original'].should == taxcollector_taxonomy
32
-
33
- levels = %w{domain phylum class order family genus species kingdom original strain}
34
-
35
- taxonomy.keys.each do |v|
36
- levels.should include v
28
+ %w{domain phylum class order family genus species strain}.each do |level|
29
+ it "#parse_taxonomy_taxcollector should parse taxcollector taxonomy (#{level})" do
30
+ taxcollector_taxonomies.each do |taxonomy|
31
+ taxonomy = lederhosen.parse_taxonomy_taxcollector(taxonomy)
32
+ taxonomy[level].should == level
37
33
  end
38
34
  end
39
35
  end
36
+
37
+ it '#parse_taxonomy_taxcollector should return original taxonomy' do
38
+ lederhosen.parse_taxonomy_taxcollector(taxcollector_taxonomies[0])['original'].should == taxcollector_taxonomies[0]
39
+ end
40
40
 
41
41
  it '#parse_taxonomy_greengenes should parse greengenes taxonomy' do
42
42
  greengenes_taxonomies.each do |greengenes_taxonomy|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: lederhosen
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.7.0
4
+ version: 1.8.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-12-19 00:00:00.000000000 Z
12
+ date: 2013-01-17 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: dna
@@ -131,11 +131,16 @@ files:
131
131
  - lib/lederhosen/tasks/otu_filter.rb
132
132
  - lib/lederhosen/tasks/otu_table.rb
133
133
  - lib/lederhosen/tasks/split_fasta.rb
134
- - lib/lederhosen/tasks/trim.rb
135
134
  - lib/lederhosen/tasks/version.rb
136
135
  - lib/lederhosen/trimmer.rb
137
136
  - lib/lederhosen/version.rb
138
137
  - readme.md
138
+ - scripts/illumina_pipeline/.gitignore
139
+ - scripts/illumina_pipeline/Makefile
140
+ - scripts/illumina_pipeline/pipeline.sh
141
+ - scripts/illumina_pipeline/readme.md
142
+ - scripts/otu_ref_picking/readme.md
143
+ - scripts/readme.md
139
144
  - spec/cli_spec.rb
140
145
  - spec/data/ILT_L_9_B_001_1.txt.gz
141
146
  - spec/data/ILT_L_9_B_001_3.txt.gz
@@ -162,7 +167,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
162
167
  version: '0'
163
168
  segments:
164
169
  - 0
165
- hash: -1116066410733680786
170
+ hash: -1539752797284012594
166
171
  required_rubygems_version: !ruby/object:Gem::Requirement
167
172
  none: false
168
173
  requirements:
@@ -1,88 +0,0 @@
1
- ##
2
- # QUALITY TRIMMING
3
- #
4
-
5
- # This should probably be broken into its own module or command-line utility.
6
-
7
- module Lederhosen
8
- class CLI
9
-
10
- desc "trim",
11
- "trim reads based on quality scores"
12
-
13
- method_option :reads_dir, :type => :string, :required => true
14
- method_option :out_dir, :type => :string, :required => true
15
- method_option :left_trim, :type => :numeric, :default => 0
16
- method_option :read_type, :type => :string, :default => 'qseq'
17
- method_option :min_length, :type => :numeric, :default => 75
18
-
19
- def trim
20
- raw_reads = options[:reads_dir]
21
- out_dir = options[:out_dir]
22
- left_trim = options[:left_trim]
23
- read_type = options[:read_type]
24
- min_length = options[:min_length]
25
-
26
- ohai "trimming #{File.dirname(raw_reads)} and saving to #{out_dir}"
27
- run "mkdir -p #{out_dir}"
28
-
29
- raw_reads =
30
- if read_type == 'qseq'
31
- get_grouped_qseq_files(raw_reads)
32
- elsif read_type == 'fastq'
33
- r = Dir[raw_reads].map do |x|
34
- [ File.basename(x, '.fastq'), x ]
35
- end
36
- Hash[r]
37
- end
38
-
39
- if raw_reads.size == 0
40
- ohno 'glob matches no reads'
41
- end
42
-
43
- pbar = ProgressBar.new 'trimming', raw_reads.size
44
-
45
- raw_reads.each do |prefix, files|
46
-
47
- # get an output handle
48
- out = File.join(out_dir, "#{File.basename(prefix)}.fasta")
49
-
50
- # create the trimmed sequence generator
51
- trim_args = { :left_trim => left_trim, :min_length => min_length }
52
-
53
- trimmer =
54
- if read_type == 'qseq'
55
- Trimmer::QSEQTrimmer.new(*files, trim_args)
56
- elsif read_type == 'fastq'
57
- Trimmer::InterleavedTrimmer.new(files, trim_args)
58
- end
59
-
60
- # trim and write
61
- File.open(out, 'w') do |o|
62
- trimmer.each do |trimmed_record|
63
- o.puts trimmed_record
64
- end
65
- end # File.open
66
-
67
- pbar.inc
68
- end
69
-
70
- pbar.finish
71
-
72
- end
73
-
74
- no_tasks do
75
-
76
- # Function for grouping qseq files produced by splitting illumina
77
- # reads by barcode
78
- #
79
- # Filenames should look like this:
80
- # IL5_L_1_B_007_1.txt
81
- def get_grouped_qseq_files(glob='raw_reads/*.txt')
82
- Dir.glob(glob).group_by { |x| File.basename(x).split('_')[0..4].join('_') }
83
- end
84
-
85
- end # no_tasks
86
-
87
- end
88
- end