bio-ngs 0.4.7.alpha.03 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,323 +0,0 @@
1
- = bio-ngs
2
-
3
- Provides a framework for handling NGS data with Bioruby.
4
-
5
- == What we want to do and support
6
- * SAMtools
7
- * BWA
8
- * Bowtie/TopHat/Cufflinks
9
-
10
- * Reporting: text and graphs
11
- * SGE?
12
-
13
- == Tasks
14
- We'll try to keep this list updated but just in case type 'biongs -T' to get the most updated list
15
- NOTE: We are working on these and other tasks, if you find some bugs, please open an issue on Github.
16
- === bwa
17
-
18
- * biongs bwa:aln:long [FASTQ] --file-out=FILE_OUT --prefix=PREFIX # Run the aligment for LONG query sequences
19
- * biongs bwa:aln:short [FASTQ] --file-out=FILE_OUT --prefix=PREFIX # Run the aligment for SHORT query sequences
20
- * biongs bwa:index:long [FASTA] # Make the BWT index for a LONG FASTA database
21
- * biongs bwa:index:short [FASTA] # Make the BWT index for a SHORT FASTA database
22
- * biongs bwa:sam:paired --fastq=one two three --file-out=FILE_OUT --prefix=PREFIX --sai=one two three # Convert SAI alignment output into SAM format (paired ends)
23
- * biongs bwa:sam:single [SAI] --fastq=FASTQ --file-out=FILE_OUT --prefix=PREFIX # Convert SAI alignment output into SAM format (single end)
24
-
25
- === convert
26
- Most of this tasks create sub-processes to speed up conversions
27
-
28
- * biongs convert:bam:extract_genes BAM GENES --ensembl-release=N # Extract GENES from bam. It connects to Ensembl Humnan, release 61
29
- * biongs convert:bam:sort BAM [PREFIX]
30
- * biongs convert:bcl:qseq:convert RUN OUTPUT [JOBS] # Convert a bcl dataset in qseq
31
- * biongs convert:illumina:fastq:trim_b FASTQ # perform a trim on all the sequences on B qualities with Illumina's criteria. Ref to CASAVA manual.
32
- * biongs convert:qseq:fastq:by_file FIRST OUTPUT # Convert a qseq file into fastq
33
- * biongs convert:qseq:fastq:by_lane LANE OUTPUT # Convert all the file in the current and descendant directories belonging to the specified lane in fastq. This command is specific for Illum...
34
- * biongs convert:qseq:fastq:by_lane_index LANE INDEX OUTPUT # Convert the qseq from a line and index in a fastq file
35
- * biongs convert:qseq:fastq:samples_by_lane SAMPLES LANE OUTPUT # Convert the qseqs for each sample in a specific lane. SAMPLES is an array of index codes separated by commas lane is an integer
36
-
37
- === filter
38
- * biongs filter:by_list TABLE LIST # Extract from TABLE rows with a key in LIST
39
-
40
- === quality
41
-
42
- * biongs quality:boxplot FASTQ_QUALITY_STATS # plot reads quality as boxplot
43
- * biongs quality:fastq_stats FASTQ # Reports quality of FASTQ file
44
- * biongs quality:illumina_b_profile_raw FASTQ --read-length=N # perform a profile for reads coming fom Illumina 1.5+ and write the report in a txt file
45
- * biongs quality:illumina_b_profile_svg FASTQ --read-length=N # perform a profile for reads coming fom Illumina 1.5+
46
- * biongs quality:illumina_project_stats # Reports quality of FASTQ files in an Illumina project directory
47
- * biongs quality:reads FASTQ # perform quality check for NGS reads
48
- * biongs quality:reads_coverage FASTQ_QUALITY_STATS # plot reads coverage in bases
49
- * biongs quality:trim FASTQ # trim all the sequences
50
-
51
-
52
-
53
- === rna
54
-
55
- * biongs rna:idx2fasta INDEX FASTA # Create a fasta file from an indexed genome, using bowtie-inspect
56
- * biongs rna:tophat DIST INDEX OUTPUTDIR FASTQS # run tophat as from command line, default 6 processors
57
-
58
- === root
59
-
60
- * biongs project NAME
61
-
62
- === sff
63
-
64
- * biongs sff:extract [FILE] # Run sff_extract on a SFF file
65
-
66
-
67
- == TasksExamples
68
-
69
- === Conversion
70
- biongs convert:bam:extract_genes your_original.bam BLID,GATA3,PTPRC --ensembl_release=61 --ensembl_specie=homo_sapiens
71
-
72
- === Filtering
73
- When you have your mapped reads to a reference genome, you can decide to filter the output (GTF) to extract only those transcripts which have your desired requirements. You can filter for lenght, if it's multi or mono exon, the coverage, if it's a brand new transcript or an altrady annotated gene but with a new isoform or just the annotated transcripts.
74
-
75
- Scenario: filtering transcripts
76
- Having a transcripts.gtf dataset generated from CufflinksQuantification
77
- I want a only the new transcripts (also with an annotated gene)
78
- Which are multi exons
79
- With a lenght greater than 1340
80
- With minimum coverage greater than 10
81
- Then I want to save them in my_filtered_data.gtf
82
-
83
- biongs filter:cufflinks:transcripts your_original.gtf -m -l 1340 -c 10.0 -n -o my_filtered_data.gtf
84
-
85
- Then in some case I need to extract only some of them or maybe parsing them from external programs. Biongs has a specific trask for this:
86
-
87
- Having my_filtered_data.gtf
88
- Generated by "filtering transcripts"
89
- I want to extract transcript number 10
90
- Then I want to save it in BED format
91
- Using UCSC notation
92
-
93
- biongs filter:cufflinks:tra_at_idx my_filtered_data.gtf #of_the_transcript_to_retrieve -u
94
-
95
- The first time tra_at_idx is used, it will take more time than usual becase it creates an internal index: a simple HASH mashalled and dumped, stored in a file with the name similar to the imput with an idx as postfix.
96
-
97
-
98
- = ForDevelopers
99
-
100
- == Contribute
101
- === Clone Main Repository
102
- This command will crate a local copy of the main repository
103
- git clone https://github.com/helios/bioruby-ngs
104
- === Install Bioinformatics Tools into the repository directory
105
- rake devenv:bio_tools
106
-
107
- == Wrapper
108
- Bio-Ngs comes with a build-in wrapper to map binary software directly in BioRuby as objects. From this wrapper object is possible to create Thor task as well, with a lot of sugar.
109
- === Wrapping a binary
110
-
111
- We want wrap TopHat the famous tool for NGS analyses.
112
- * The first step is to include the Wrapping module
113
- * set the name of the binary to call. Note: if you avid to set the program name it would not be possible to create a thor task and/or run the program
114
- * add the options that the binary accepts, usually if preferred to declare all the options, discover them typing 'your_program_name -h'
115
-
116
- module Bio
117
- module Ngs
118
- class Tophat
119
- include Bio::Command::Wrapper
120
-
121
- set_program Bio::Ngs::Utils.binary("tophat/tophat")
122
- add_option "output-dir",:type => :string, :aliases => '-o'
123
- add_option "min-anchor", :type => :numeric, :aliases => '-a'
124
- add_option "splice-mismatches", :type => :numeric, :aliases => '-m'
125
- #all other options that you want to expose with the wrapping
126
- end #Tophat
127
- end #Ngs
128
- end #Bio
129
-
130
- is possible to use specify in the class
131
- use_aliases
132
- if you want to give a priority to short notation or if your program has only the short notation but you want to extend the task with the long one as well.
133
- We defined a new property for add_option called
134
- :collapse => true
135
- is used only with use_aliases and it collapse the passed parameter to the short notation - example coming from fastx.rb wrapper, note last row - :
136
- module Bio
137
- module Ngs
138
- module Fastx
139
- class Trim
140
- include Bio::Command::Wrapper
141
- set_program Bio::Ngs::Utils.binary("fastq_quality_trimmer")
142
- use_aliases
143
- add_option :min_size, :type=>:numeric, :default=>20, :aliases => "-l", :desc=>"Minimum length - sequences shorter than this (after trimming)
144
- will be discarded. Default = 0 = no minimum length."
145
- add_option :min_quality, :type=>:numeric, :default=>10, :aliases => "-t", :desc=>"Quality threshold - nucleotides with lower
146
- quality will be trimmed (from the end of the sequence)."
147
- add_option :output, :type=>:string, :aliases => "-o", :desc => "FASTQ output file.", :collapse=>true
148
- add_option :input, :type=>:string, :aliases => "-i", :desc => "FASTQ input file.", :collapse=>true
149
- add_option :gzip, :type => :boolean, :aliases => "-z", :desc => "Compress output with GZIP."
150
- add_option :verbose, :type => :boolean, :aliases => "-v", :desc => "[-v] = Verbose - report number of sequences.
151
- If [-o] is specified, report will be printed to STDOUT.
152
- If [-o] is not specified (and output goes to STDOUT),
153
- report will be printed to STDERR."
154
- add_option :quality_type, :type=>:numeric, :default => 33, :aliases => "-Q", :desc=>"Quality of fastq file"
155
- end
156
- end
157
- end
158
- end
159
- fastq_quality_trimmer accepts only short notation options and we need to pass an input file, but for some reason popen used internally doesn't work properly with the standard behavior so using :collapse=>true the application will be called:
160
- fastq_quality_trimmer -t 20 -t 10 -Q 33 -iinput_file_name.fastq -ooutput_file_name.fastq_trim
161
- running the program by hand form the command line using a space as separator after -i and -o works as expected. :collapse is a work around for this problem.
162
-
163
-
164
-
165
- In case you program work like git which has a main program and the sub_programs for each feature you can use specify the sub program name with
166
- set_sub_program "sub_name"
167
- The wrapper will run the command composing:
168
- set_program set_sub_program options arguments
169
- A practical example of this behavior is samtools which has multiple sub programs view, merge, sort, ....
170
- SamTools is a particular case because in biongs we are using bio-samtools a binding with FFI and the wrapper because the merge function was too complicated for the binding or at least we do not spent enough time on it, so we make the wrapping for this functionality.
171
-
172
- This step is very similar to define a Thor task, add_option is grabbed/inspired from Thor.
173
- Then you can user this binary also from a bioruby script just calling:
174
- tophat = Bio::Ngs::Tophat.new
175
- tophat.params = {"mate-inner-dist"=>dist, "output-dir"=>outputdir, "num-threads"=>1, "solexa1.3-quals"=>true}
176
- #very important: you can pass parameters that have a name which has been previously declared in the Tophat's class.
177
- # if you want to pass not declared parameters/options please use arguments.
178
- tophat.run :arguments=>[index, "#{fastqs}" ]
179
-
180
- === Define the Task
181
- With our new wrapper, let's define a Thor task on the fly
182
-
183
- class MyTasks < Thor
184
- desc "tophat DIST INDEX OUTPUTDIR FASTQS", "run tophat as from command line, default 6 processors"
185
- Bio::Ngs::Tophat.new.thor_task(self, :tophat) do |wrapper, task, dist, index, outputdir, fastqs|
186
- wrapper.params = {"mate-inner-dist"=>dist, "output-dir"=>outputdir, "num-threads"=>1, "solexa1.3-quals"=>true}
187
- wrapper.run :arguments=>[index, "#{fastqs}" ], :separator=>"="
188
- #you tasks here
189
- end
190
- end
191
-
192
- Now is you list the tasks with 'thor -T' you will see the new task.
193
-
194
- You can create a new wrapper and configure it and run it from inside a Thor's tasks, like in 'biongs quality:boxplot'
195
-
196
- desc "boxplot FASTQ_QUALITY_STATS", "plot reads quality as boxplot"
197
- method_option :title, :type=>:string, :aliases =>"-t", :desc => "Title (usually the solexa file name) - will be plotted on the graph."
198
- method_option :output, :type=>:string, :aliases =>"-o", :desc => "Output file name. default is input file_name with .txt."
199
- def boxplot(fastq_quality_stats)
200
- output_file = options.output || "#{fastq_quality_stats}.png"
201
- boxplot = Bio::Ngs::Fastx::ReadsBoxPlot.new
202
- boxplot.params={input:fastq_quality_stats, output:output_file}
203
- boxplot.run
204
- end
205
-
206
- === Override the run command when the binary dosen't behave normally
207
- module Bio
208
- module Ngs
209
- module Samtools
210
- class View
211
- include Bio::Command::Wrapper
212
- set_program Bio::Ngs::Utils.binary("samtools")
213
- add_option "output", :type => :string, :aliases => '-o'
214
-
215
- alias :original_run :run
216
- def run(opts = {:options=>{}, :arguments=>[], :output_file=>nil, :separator=>"="})
217
- opts[:arguments].insert(0,"view")
218
- opts[:arguments].insert(1,"-b")
219
- opts[:arguments].insert(2,"-o")
220
- original_run(opts)
221
- end
222
- end #View
223
- end #Samtools
224
- end #Ngs
225
- end #Bio
226
-
227
- ==== Disable binary check at load time
228
- When a wrapping is defined BioNGS verify that the program is installed on the local system, if it is not it thrown an warning message and the task is disabled by default. This check is made for each binary wrapped, so it could takes long the first time you load BioNGS.
229
- To skip this check the user can define an environment variable assigning one of these terms "true yes ok 1" to BIONGS_SKIP_CHECK_BINARIES
230
- export BIONGS_SKIP_CHECK_BINARIES=true
231
- you can also add this setting to the .bashrc or .profile in the user home directory.
232
-
233
- == Features
234
- === Iterators for output files
235
-
236
- Example CuffDiff. In this class is possible to define an iterator for a specific set of output files: genes, isoforms, tss_groups, cds.
237
- To activate the iterator is just a matter of call a class method in the class definition
238
- class Bio::Ngs::Cufflinks::Diff
239
- #... all the previous definitions
240
- #define iterators
241
- add_iterator_for :genes
242
- add_iterator_for :isoforms
243
- add_iterator_for :cds
244
- add_iterator_for :tss_groups
245
- end
246
-
247
- This is an example of CuffDiff, parsing genes.fpkm_tracking file:
248
-
249
- Bio::Ngs::Cufflinks::Diff.foreach_gene_tracked("path_to_cuffdiff_output_directory") do |gene_fpkm_track|
250
- expression_profile = (1..7).map do |sample_idx|
251
- gene_fpkm_track["q#{sample_idx}_FPKM"].to_f
252
- end
253
-
254
- #do your stuff accessing this tabular file with gene_fpkm_track["name of the field"]
255
- end
256
-
257
- In this case internally CSV library has been used to parse in an easy way the file, there is a lack of performances with huge files, gaining in flexibility.
258
-
259
- == Loading or Not tasks from outside
260
- If in your external library or binary you define LoadBaseTasks in Bio::Ngs (as a costant) requiring 'bio-ngs' bio-ngs's tasks will not load but only the libraries.
261
- module Bio
262
- module Ngs
263
- LoadBaseTasks = true
264
- end
265
- end
266
-
267
- This is something useful if you want to develop a separate binary which uses bio-ngs librariys.
268
- Is not yet possible to define a list of desired tasks to load.
269
-
270
- === Notes
271
- * It's possible to add more sugar and we are working hard on it
272
- * aliases are not well supported at this time. ToDo
273
-
274
- = REQUIREMENTS
275
- * http://hannonlab.cshl.edu/fastx_toolkit/ (the gem tries to install this tool by itself)
276
- * http://www.gnuplot.info/ tested on version 4.6
277
- * libxslt1-dev
278
-
279
- Pleas follow the instruction for your own distribution/operating system
280
-
281
- = TODO
282
- * Write Tutorial for Wrapper & Pipes
283
- * Write Tutorial for handling Illumina/Fastq.gz with BioNGS Bio::Ngs::Illumina::FastqGz
284
- * Report the version of every software installed/used from bio-ngs
285
- * Develop fastq quality reports with RibuVis ?
286
- * Write documentation
287
- * DONE: Wrapper: better support for aliases and Wrapper#params
288
- * Convert: re factor code to use ::Daemons
289
- * DONE:misk_tasks? Extract genes/regions of interest from a bam file and create a smaller bam
290
- * BRANCH:misk_tasks Explore possibility to user DelayedJobs
291
- * biongs ann:ensembl:gtf:features:categorize GTF GTF categorize also by chromosome not only by BioType
292
- * configuration file input,output, experimental design
293
- * DONE: include fastx toolkit, download and compile
294
- * ANSWER: how to put in background tasks that can be run in parallel? Use Parallel (see code for quality:illumina_project_stats)
295
- * is it possible to establish a relation between input data and output data ? like fastq task_selected output/s
296
- * add description for developers on howto include news external tool with versions.yaml
297
-
298
- = ChangeLog
299
- * 2011.05-26: Bump to version 0.2.0 Complete support for installing fastx and possibly other downloadable tool, inside the gem
300
- * 2011-05-25: Bump to version 0.1.0 Update Cufflinks toolkit 1.0.2. Added initial support to fastx tool kit (binaries not included)
301
- * 2011-04-08: Tasks for filtering Ensembl annotation and create classifications. (misk_tasks branch)
302
-
303
-
304
- = Contributing to bio-ngs
305
-
306
- Please do not hesitate to contact us:
307
-
308
- Raoul J.P. Bonnal, http://github.com/helios, r -at- bioruby -dot- org
309
- Francesco Strozzi, http://github.com/fstrozzi
310
-
311
- * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
312
- * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
313
- * Fork the project
314
- * Start a feature/bugfix branch
315
- * Commit and push until you are happy with your contribution
316
- * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
317
- * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
318
-
319
- == Copyright
320
-
321
- Copyright (c) 2011 Francesco Strozzi and Raoul J.P. Bonnal. See LICENSE.txt for
322
- further details.
323
-
@@ -1,34 +0,0 @@
1
- class BioNgs
2
- # Rake tasks inspired by Jeweler approach
3
- # Include tasks used during gem installation.
4
- # Why ? If a developer want's to have a ready to go environment
5
- # with bioinformatics software supported by biongs in it's cloned directory.
6
- class BioNgsTasks < ::Rake::TaskLib
7
- attr_accessor :biongs
8
-
9
- def initialize
10
- yield self if block_given?
11
-
12
- define
13
- end
14
-
15
- def biongs
16
- @biongs ||= self
17
- end
18
-
19
- def define
20
- namespace :devenv do
21
- desc "install external bioinformatics tools, for development, locally -in this directory, cloned from github?-"
22
- task :bio_tools do
23
- Dir.chdir("ext") do
24
- load 'mkrf_conf.rb'
25
- `rake -f Rakefile`
26
- FileUtils.remove("Rakefile")
27
- end
28
- end
29
- end
30
-
31
- task :devenv => 'devenv:bio_tools'
32
- end
33
- end
34
- end