miga-base 0.7.26.0 → 1.0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (337) hide show
  1. checksums.yaml +4 -4
  2. data/lib/miga/_data/aai-intax.blast.tsv.gz +0 -0
  3. data/lib/miga/_data/aai-intax.diamond.tsv.gz +0 -0
  4. data/lib/miga/_data/aai-novel.blast.tsv.gz +0 -0
  5. data/lib/miga/_data/aai-novel.diamond.tsv.gz +0 -0
  6. data/lib/miga/cli/action/classify_wf.rb +2 -2
  7. data/lib/miga/cli/action/derep_wf.rb +1 -1
  8. data/lib/miga/cli/action/doctor.rb +57 -14
  9. data/lib/miga/cli/action/doctor/base.rb +47 -23
  10. data/lib/miga/cli/action/init.rb +11 -7
  11. data/lib/miga/cli/action/init/files_helper.rb +1 -0
  12. data/lib/miga/cli/action/ncbi_get.rb +3 -3
  13. data/lib/miga/cli/action/tax_dist.rb +2 -2
  14. data/lib/miga/cli/action/wf.rb +5 -4
  15. data/lib/miga/common.rb +1 -0
  16. data/lib/miga/daemon.rb +11 -4
  17. data/lib/miga/dataset/result.rb +10 -6
  18. data/lib/miga/json.rb +5 -4
  19. data/lib/miga/metadata.rb +5 -1
  20. data/lib/miga/parallel.rb +36 -0
  21. data/lib/miga/project.rb +8 -8
  22. data/lib/miga/project/base.rb +4 -4
  23. data/lib/miga/project/result.rb +2 -2
  24. data/lib/miga/sqlite.rb +10 -2
  25. data/lib/miga/version.rb +23 -9
  26. data/scripts/aai_distances.bash +16 -18
  27. data/scripts/ani_distances.bash +16 -17
  28. data/scripts/assembly.bash +31 -16
  29. data/scripts/haai_distances.bash +3 -27
  30. data/scripts/miga.bash +6 -4
  31. data/scripts/p.bash +1 -1
  32. data/scripts/read_quality.bash +9 -18
  33. data/scripts/trimmed_fasta.bash +14 -30
  34. data/scripts/trimmed_reads.bash +36 -36
  35. data/test/parallel_test.rb +31 -0
  36. data/test/project_test.rb +2 -1
  37. data/test/remote_dataset_test.rb +1 -1
  38. data/utils/FastAAI/00.Libraries/01.SCG_HMMs/Archaea_SCG.hmm +41964 -0
  39. data/utils/FastAAI/00.Libraries/01.SCG_HMMs/Bacteria_SCG.hmm +32439 -0
  40. data/utils/FastAAI/00.Libraries/01.SCG_HMMs/Complete_SCG_DB.hmm +62056 -0
  41. data/utils/FastAAI/FastAAI/FastAAI +1336 -0
  42. data/utils/FastAAI/README.md +84 -0
  43. data/utils/FastAAI/kAAI_v1.0_virus.py +1296 -0
  44. data/utils/distance/commands.rb +1 -0
  45. data/utils/distance/database.rb +0 -1
  46. data/utils/distance/runner.rb +2 -4
  47. data/utils/enveomics/Docs/recplot2.md +244 -0
  48. data/utils/enveomics/Examples/aai-matrix.bash +66 -0
  49. data/utils/enveomics/Examples/ani-matrix.bash +66 -0
  50. data/utils/enveomics/Examples/essential-phylogeny.bash +105 -0
  51. data/utils/enveomics/Examples/unus-genome-phylogeny.bash +100 -0
  52. data/utils/enveomics/LICENSE.txt +73 -0
  53. data/utils/enveomics/Makefile +52 -0
  54. data/utils/enveomics/Manifest/Tasks/aasubs.json +103 -0
  55. data/utils/enveomics/Manifest/Tasks/blasttab.json +786 -0
  56. data/utils/enveomics/Manifest/Tasks/distances.json +161 -0
  57. data/utils/enveomics/Manifest/Tasks/fasta.json +802 -0
  58. data/utils/enveomics/Manifest/Tasks/fastq.json +291 -0
  59. data/utils/enveomics/Manifest/Tasks/graphics.json +126 -0
  60. data/utils/enveomics/Manifest/Tasks/mapping.json +137 -0
  61. data/utils/enveomics/Manifest/Tasks/ogs.json +382 -0
  62. data/utils/enveomics/Manifest/Tasks/other.json +906 -0
  63. data/utils/enveomics/Manifest/Tasks/remote.json +355 -0
  64. data/utils/enveomics/Manifest/Tasks/sequence-identity.json +638 -0
  65. data/utils/enveomics/Manifest/Tasks/tables.json +308 -0
  66. data/utils/enveomics/Manifest/Tasks/trees.json +68 -0
  67. data/utils/enveomics/Manifest/Tasks/variants.json +111 -0
  68. data/utils/enveomics/Manifest/categories.json +165 -0
  69. data/utils/enveomics/Manifest/examples.json +154 -0
  70. data/utils/enveomics/Manifest/tasks.json +4 -0
  71. data/utils/enveomics/Pipelines/assembly.pbs/CONFIG.mock.bash +69 -0
  72. data/utils/enveomics/Pipelines/assembly.pbs/FastA.N50.pl +1 -0
  73. data/utils/enveomics/Pipelines/assembly.pbs/FastA.filterN.pl +1 -0
  74. data/utils/enveomics/Pipelines/assembly.pbs/FastA.length.pl +1 -0
  75. data/utils/enveomics/Pipelines/assembly.pbs/README.md +189 -0
  76. data/utils/enveomics/Pipelines/assembly.pbs/RUNME-2.bash +112 -0
  77. data/utils/enveomics/Pipelines/assembly.pbs/RUNME-3.bash +23 -0
  78. data/utils/enveomics/Pipelines/assembly.pbs/RUNME-4.bash +44 -0
  79. data/utils/enveomics/Pipelines/assembly.pbs/RUNME.bash +50 -0
  80. data/utils/enveomics/Pipelines/assembly.pbs/kSelector.R +37 -0
  81. data/utils/enveomics/Pipelines/assembly.pbs/newbler.pbs +68 -0
  82. data/utils/enveomics/Pipelines/assembly.pbs/newbler_preparator.pl +49 -0
  83. data/utils/enveomics/Pipelines/assembly.pbs/soap.pbs +80 -0
  84. data/utils/enveomics/Pipelines/assembly.pbs/stats.pbs +57 -0
  85. data/utils/enveomics/Pipelines/assembly.pbs/velvet.pbs +63 -0
  86. data/utils/enveomics/Pipelines/blast.pbs/01.pbs.bash +38 -0
  87. data/utils/enveomics/Pipelines/blast.pbs/02.pbs.bash +73 -0
  88. data/utils/enveomics/Pipelines/blast.pbs/03.pbs.bash +21 -0
  89. data/utils/enveomics/Pipelines/blast.pbs/BlastTab.recover_job.pl +72 -0
  90. data/utils/enveomics/Pipelines/blast.pbs/CONFIG.mock.bash +98 -0
  91. data/utils/enveomics/Pipelines/blast.pbs/FastA.split.pl +1 -0
  92. data/utils/enveomics/Pipelines/blast.pbs/README.md +127 -0
  93. data/utils/enveomics/Pipelines/blast.pbs/RUNME.bash +109 -0
  94. data/utils/enveomics/Pipelines/blast.pbs/TASK.check.bash +128 -0
  95. data/utils/enveomics/Pipelines/blast.pbs/TASK.dry.bash +16 -0
  96. data/utils/enveomics/Pipelines/blast.pbs/TASK.eo.bash +22 -0
  97. data/utils/enveomics/Pipelines/blast.pbs/TASK.pause.bash +26 -0
  98. data/utils/enveomics/Pipelines/blast.pbs/TASK.run.bash +89 -0
  99. data/utils/enveomics/Pipelines/blast.pbs/sentinel.pbs.bash +29 -0
  100. data/utils/enveomics/Pipelines/idba.pbs/README.md +49 -0
  101. data/utils/enveomics/Pipelines/idba.pbs/RUNME.bash +95 -0
  102. data/utils/enveomics/Pipelines/idba.pbs/run.pbs +56 -0
  103. data/utils/enveomics/Pipelines/trim.pbs/README.md +54 -0
  104. data/utils/enveomics/Pipelines/trim.pbs/RUNME.bash +70 -0
  105. data/utils/enveomics/Pipelines/trim.pbs/run.pbs +130 -0
  106. data/utils/enveomics/README.md +42 -0
  107. data/utils/enveomics/Scripts/AAsubs.log2ratio.rb +171 -0
  108. data/utils/enveomics/Scripts/Aln.cat.rb +221 -0
  109. data/utils/enveomics/Scripts/Aln.convert.pl +35 -0
  110. data/utils/enveomics/Scripts/AlphaDiversity.pl +152 -0
  111. data/utils/enveomics/Scripts/BedGraph.tad.rb +93 -0
  112. data/utils/enveomics/Scripts/BedGraph.window.rb +71 -0
  113. data/utils/enveomics/Scripts/BlastPairwise.AAsubs.pl +102 -0
  114. data/utils/enveomics/Scripts/BlastTab.addlen.rb +63 -0
  115. data/utils/enveomics/Scripts/BlastTab.advance.bash +48 -0
  116. data/utils/enveomics/Scripts/BlastTab.best_hit_sorted.pl +55 -0
  117. data/utils/enveomics/Scripts/BlastTab.catsbj.pl +104 -0
  118. data/utils/enveomics/Scripts/BlastTab.cogCat.rb +76 -0
  119. data/utils/enveomics/Scripts/BlastTab.filter.pl +47 -0
  120. data/utils/enveomics/Scripts/BlastTab.kegg_pep2path_rest.pl +194 -0
  121. data/utils/enveomics/Scripts/BlastTab.metaxaPrep.pl +104 -0
  122. data/utils/enveomics/Scripts/BlastTab.pairedHits.rb +157 -0
  123. data/utils/enveomics/Scripts/BlastTab.recplot2.R +48 -0
  124. data/utils/enveomics/Scripts/BlastTab.seqdepth.pl +86 -0
  125. data/utils/enveomics/Scripts/BlastTab.seqdepth_ZIP.pl +119 -0
  126. data/utils/enveomics/Scripts/BlastTab.seqdepth_nomedian.pl +86 -0
  127. data/utils/enveomics/Scripts/BlastTab.subsample.pl +47 -0
  128. data/utils/enveomics/Scripts/BlastTab.sumPerHit.pl +114 -0
  129. data/utils/enveomics/Scripts/BlastTab.taxid2taxrank.pl +90 -0
  130. data/utils/enveomics/Scripts/BlastTab.topHits_sorted.rb +101 -0
  131. data/utils/enveomics/Scripts/Chao1.pl +97 -0
  132. data/utils/enveomics/Scripts/CharTable.classify.rb +234 -0
  133. data/utils/enveomics/Scripts/EBIseq2tax.rb +83 -0
  134. data/utils/enveomics/Scripts/FastA.N50.pl +60 -0
  135. data/utils/enveomics/Scripts/FastA.extract.rb +152 -0
  136. data/utils/enveomics/Scripts/FastA.filter.pl +52 -0
  137. data/utils/enveomics/Scripts/FastA.filterLen.pl +28 -0
  138. data/utils/enveomics/Scripts/FastA.filterN.pl +60 -0
  139. data/utils/enveomics/Scripts/FastA.fragment.rb +100 -0
  140. data/utils/enveomics/Scripts/FastA.gc.pl +42 -0
  141. data/utils/enveomics/Scripts/FastA.interpose.pl +93 -0
  142. data/utils/enveomics/Scripts/FastA.length.pl +38 -0
  143. data/utils/enveomics/Scripts/FastA.mask.rb +89 -0
  144. data/utils/enveomics/Scripts/FastA.per_file.pl +36 -0
  145. data/utils/enveomics/Scripts/FastA.qlen.pl +57 -0
  146. data/utils/enveomics/Scripts/FastA.rename.pl +65 -0
  147. data/utils/enveomics/Scripts/FastA.revcom.pl +23 -0
  148. data/utils/enveomics/Scripts/FastA.sample.rb +98 -0
  149. data/utils/enveomics/Scripts/FastA.slider.pl +85 -0
  150. data/utils/enveomics/Scripts/FastA.split.pl +55 -0
  151. data/utils/enveomics/Scripts/FastA.split.rb +79 -0
  152. data/utils/enveomics/Scripts/FastA.subsample.pl +131 -0
  153. data/utils/enveomics/Scripts/FastA.tag.rb +65 -0
  154. data/utils/enveomics/Scripts/FastA.toFastQ.rb +69 -0
  155. data/utils/enveomics/Scripts/FastA.wrap.rb +48 -0
  156. data/utils/enveomics/Scripts/FastQ.filter.pl +54 -0
  157. data/utils/enveomics/Scripts/FastQ.interpose.pl +90 -0
  158. data/utils/enveomics/Scripts/FastQ.maskQual.rb +89 -0
  159. data/utils/enveomics/Scripts/FastQ.offset.pl +90 -0
  160. data/utils/enveomics/Scripts/FastQ.split.pl +53 -0
  161. data/utils/enveomics/Scripts/FastQ.tag.rb +70 -0
  162. data/utils/enveomics/Scripts/FastQ.test-error.rb +81 -0
  163. data/utils/enveomics/Scripts/FastQ.toFastA.awk +24 -0
  164. data/utils/enveomics/Scripts/GFF.catsbj.pl +127 -0
  165. data/utils/enveomics/Scripts/GenBank.add_fields.rb +84 -0
  166. data/utils/enveomics/Scripts/HMM.essential.rb +351 -0
  167. data/utils/enveomics/Scripts/HMM.haai.rb +168 -0
  168. data/utils/enveomics/Scripts/HMMsearch.extractIds.rb +83 -0
  169. data/utils/enveomics/Scripts/JPlace.distances.rb +88 -0
  170. data/utils/enveomics/Scripts/JPlace.to_iToL.rb +320 -0
  171. data/utils/enveomics/Scripts/M5nr.getSequences.rb +81 -0
  172. data/utils/enveomics/Scripts/MeTaxa.distribution.pl +198 -0
  173. data/utils/enveomics/Scripts/MyTaxa.fragsByTax.pl +35 -0
  174. data/utils/enveomics/Scripts/MyTaxa.seq-taxrank.rb +49 -0
  175. data/utils/enveomics/Scripts/NCBIacc2tax.rb +92 -0
  176. data/utils/enveomics/Scripts/Newick.autoprune.R +27 -0
  177. data/utils/enveomics/Scripts/RAxML-EPA.to_iToL.pl +228 -0
  178. data/utils/enveomics/Scripts/RecPlot2.compareIdentities.R +32 -0
  179. data/utils/enveomics/Scripts/RefSeq.download.bash +48 -0
  180. data/utils/enveomics/Scripts/SRA.download.bash +55 -0
  181. data/utils/enveomics/Scripts/TRIBS.plot-test.R +36 -0
  182. data/utils/enveomics/Scripts/TRIBS.test.R +39 -0
  183. data/utils/enveomics/Scripts/Table.barplot.R +31 -0
  184. data/utils/enveomics/Scripts/Table.df2dist.R +30 -0
  185. data/utils/enveomics/Scripts/Table.filter.pl +61 -0
  186. data/utils/enveomics/Scripts/Table.merge.pl +77 -0
  187. data/utils/enveomics/Scripts/Table.prefScore.R +60 -0
  188. data/utils/enveomics/Scripts/Table.replace.rb +69 -0
  189. data/utils/enveomics/Scripts/Table.round.rb +63 -0
  190. data/utils/enveomics/Scripts/Table.split.pl +57 -0
  191. data/utils/enveomics/Scripts/Taxonomy.silva2ncbi.rb +227 -0
  192. data/utils/enveomics/Scripts/VCF.KaKs.rb +147 -0
  193. data/utils/enveomics/Scripts/VCF.SNPs.rb +88 -0
  194. data/utils/enveomics/Scripts/aai.rb +419 -0
  195. data/utils/enveomics/Scripts/ani.rb +362 -0
  196. data/utils/enveomics/Scripts/anir.rb +137 -0
  197. data/utils/enveomics/Scripts/clust.rand.rb +102 -0
  198. data/utils/enveomics/Scripts/gi2tax.rb +103 -0
  199. data/utils/enveomics/Scripts/in_silico_GA_GI.pl +96 -0
  200. data/utils/enveomics/Scripts/lib/data/dupont_2012_essential.hmm.gz +0 -0
  201. data/utils/enveomics/Scripts/lib/data/lee_2019_essential.hmm.gz +0 -0
  202. data/utils/enveomics/Scripts/lib/enveomics.R +1 -0
  203. data/utils/enveomics/Scripts/lib/enveomics_rb/anir.rb +293 -0
  204. data/utils/enveomics/Scripts/lib/enveomics_rb/bm_set.rb +175 -0
  205. data/utils/enveomics/Scripts/lib/enveomics_rb/enveomics.rb +24 -0
  206. data/utils/enveomics/Scripts/lib/enveomics_rb/errors.rb +17 -0
  207. data/utils/enveomics/Scripts/lib/enveomics_rb/gmm_em.rb +30 -0
  208. data/utils/enveomics/Scripts/lib/enveomics_rb/jplace.rb +253 -0
  209. data/utils/enveomics/Scripts/lib/enveomics_rb/match.rb +63 -0
  210. data/utils/enveomics/Scripts/lib/enveomics_rb/og.rb +182 -0
  211. data/utils/enveomics/Scripts/lib/enveomics_rb/rbm.rb +49 -0
  212. data/utils/enveomics/Scripts/lib/enveomics_rb/remote_data.rb +74 -0
  213. data/utils/enveomics/Scripts/lib/enveomics_rb/seq_range.rb +237 -0
  214. data/utils/enveomics/Scripts/lib/enveomics_rb/stats.rb +3 -0
  215. data/utils/enveomics/Scripts/lib/enveomics_rb/stats/rand.rb +31 -0
  216. data/utils/enveomics/Scripts/lib/enveomics_rb/stats/sample.rb +152 -0
  217. data/utils/enveomics/Scripts/lib/enveomics_rb/utils.rb +73 -0
  218. data/utils/enveomics/Scripts/lib/enveomics_rb/vcf.rb +135 -0
  219. data/utils/enveomics/Scripts/ogs.annotate.rb +88 -0
  220. data/utils/enveomics/Scripts/ogs.core-pan.rb +160 -0
  221. data/utils/enveomics/Scripts/ogs.extract.rb +125 -0
  222. data/utils/enveomics/Scripts/ogs.mcl.rb +186 -0
  223. data/utils/enveomics/Scripts/ogs.rb +104 -0
  224. data/utils/enveomics/Scripts/ogs.stats.rb +131 -0
  225. data/utils/enveomics/Scripts/rbm-legacy.rb +172 -0
  226. data/utils/enveomics/Scripts/rbm.rb +100 -0
  227. data/utils/enveomics/Scripts/sam.filter.rb +148 -0
  228. data/utils/enveomics/Tests/Makefile +10 -0
  229. data/utils/enveomics/Tests/Mgen_M2288.faa +3189 -0
  230. data/utils/enveomics/Tests/Mgen_M2288.fna +8282 -0
  231. data/utils/enveomics/Tests/Mgen_M2321.fna +8288 -0
  232. data/utils/enveomics/Tests/Nequ_Kin4M.faa +2970 -0
  233. data/utils/enveomics/Tests/Xanthomonas_oryzae-PilA.tribs.Rdata +0 -0
  234. data/utils/enveomics/Tests/Xanthomonas_oryzae-PilA.txt +7 -0
  235. data/utils/enveomics/Tests/Xanthomonas_oryzae.aai-mat.tsv +17 -0
  236. data/utils/enveomics/Tests/Xanthomonas_oryzae.aai.tsv +137 -0
  237. data/utils/enveomics/Tests/a_mg.cds-go.blast.tsv +123 -0
  238. data/utils/enveomics/Tests/a_mg.reads-cds.blast.tsv +200 -0
  239. data/utils/enveomics/Tests/a_mg.reads-cds.counts.tsv +55 -0
  240. data/utils/enveomics/Tests/alkB.nwk +1 -0
  241. data/utils/enveomics/Tests/anthrax-cansnp-data.tsv +13 -0
  242. data/utils/enveomics/Tests/anthrax-cansnp-key.tsv +17 -0
  243. data/utils/enveomics/Tests/hiv1.faa +59 -0
  244. data/utils/enveomics/Tests/hiv1.fna +134 -0
  245. data/utils/enveomics/Tests/hiv2.faa +70 -0
  246. data/utils/enveomics/Tests/hiv_mix-hiv1.blast.tsv +233 -0
  247. data/utils/enveomics/Tests/hiv_mix-hiv1.blast.tsv.lim +1 -0
  248. data/utils/enveomics/Tests/hiv_mix-hiv1.blast.tsv.rec +233 -0
  249. data/utils/enveomics/Tests/phyla_counts.tsv +10 -0
  250. data/utils/enveomics/Tests/primate_lentivirus.ogs +11 -0
  251. data/utils/enveomics/Tests/primate_lentivirus.rbm/hiv1-hiv1.rbm +9 -0
  252. data/utils/enveomics/Tests/primate_lentivirus.rbm/hiv1-hiv2.rbm +8 -0
  253. data/utils/enveomics/Tests/primate_lentivirus.rbm/hiv1-siv.rbm +6 -0
  254. data/utils/enveomics/Tests/primate_lentivirus.rbm/hiv2-hiv2.rbm +9 -0
  255. data/utils/enveomics/Tests/primate_lentivirus.rbm/hiv2-siv.rbm +6 -0
  256. data/utils/enveomics/Tests/primate_lentivirus.rbm/siv-siv.rbm +6 -0
  257. data/utils/enveomics/build_enveomics_r.bash +45 -0
  258. data/utils/enveomics/enveomics.R/DESCRIPTION +31 -0
  259. data/utils/enveomics/enveomics.R/NAMESPACE +39 -0
  260. data/utils/enveomics/enveomics.R/R/autoprune.R +155 -0
  261. data/utils/enveomics/enveomics.R/R/barplot.R +184 -0
  262. data/utils/enveomics/enveomics.R/R/cliopts.R +135 -0
  263. data/utils/enveomics/enveomics.R/R/df2dist.R +154 -0
  264. data/utils/enveomics/enveomics.R/R/growthcurve.R +331 -0
  265. data/utils/enveomics/enveomics.R/R/prefscore.R +79 -0
  266. data/utils/enveomics/enveomics.R/R/recplot.R +354 -0
  267. data/utils/enveomics/enveomics.R/R/recplot2.R +1631 -0
  268. data/utils/enveomics/enveomics.R/R/tribs.R +583 -0
  269. data/utils/enveomics/enveomics.R/R/utils.R +80 -0
  270. data/utils/enveomics/enveomics.R/README.md +81 -0
  271. data/utils/enveomics/enveomics.R/data/growth.curves.rda +0 -0
  272. data/utils/enveomics/enveomics.R/data/phyla.counts.rda +0 -0
  273. data/utils/enveomics/enveomics.R/man/cash-enve.GrowthCurve-method.Rd +16 -0
  274. data/utils/enveomics/enveomics.R/man/cash-enve.RecPlot2-method.Rd +16 -0
  275. data/utils/enveomics/enveomics.R/man/cash-enve.RecPlot2.Peak-method.Rd +16 -0
  276. data/utils/enveomics/enveomics.R/man/enve.GrowthCurve-class.Rd +25 -0
  277. data/utils/enveomics/enveomics.R/man/enve.TRIBS-class.Rd +46 -0
  278. data/utils/enveomics/enveomics.R/man/enve.TRIBS.merge.Rd +23 -0
  279. data/utils/enveomics/enveomics.R/man/enve.TRIBStest-class.Rd +47 -0
  280. data/utils/enveomics/enveomics.R/man/enve.__prune.iter.Rd +23 -0
  281. data/utils/enveomics/enveomics.R/man/enve.__prune.reduce.Rd +23 -0
  282. data/utils/enveomics/enveomics.R/man/enve.__tribs.Rd +40 -0
  283. data/utils/enveomics/enveomics.R/man/enve.barplot.Rd +103 -0
  284. data/utils/enveomics/enveomics.R/man/enve.cliopts.Rd +67 -0
  285. data/utils/enveomics/enveomics.R/man/enve.col.alpha.Rd +24 -0
  286. data/utils/enveomics/enveomics.R/man/enve.col2alpha.Rd +19 -0
  287. data/utils/enveomics/enveomics.R/man/enve.df2dist.Rd +45 -0
  288. data/utils/enveomics/enveomics.R/man/enve.df2dist.group.Rd +44 -0
  289. data/utils/enveomics/enveomics.R/man/enve.df2dist.list.Rd +47 -0
  290. data/utils/enveomics/enveomics.R/man/enve.growthcurve.Rd +75 -0
  291. data/utils/enveomics/enveomics.R/man/enve.prefscore.Rd +50 -0
  292. data/utils/enveomics/enveomics.R/man/enve.prune.dist.Rd +44 -0
  293. data/utils/enveomics/enveomics.R/man/enve.recplot.Rd +139 -0
  294. data/utils/enveomics/enveomics.R/man/enve.recplot2-class.Rd +45 -0
  295. data/utils/enveomics/enveomics.R/man/enve.recplot2.ANIr.Rd +24 -0
  296. data/utils/enveomics/enveomics.R/man/enve.recplot2.Rd +77 -0
  297. data/utils/enveomics/enveomics.R/man/enve.recplot2.__counts.Rd +25 -0
  298. data/utils/enveomics/enveomics.R/man/enve.recplot2.__peakHist.Rd +21 -0
  299. data/utils/enveomics/enveomics.R/man/enve.recplot2.__whichClosestPeak.Rd +19 -0
  300. data/utils/enveomics/enveomics.R/man/enve.recplot2.changeCutoff.Rd +19 -0
  301. data/utils/enveomics/enveomics.R/man/enve.recplot2.compareIdentities.Rd +47 -0
  302. data/utils/enveomics/enveomics.R/man/enve.recplot2.coordinates.Rd +29 -0
  303. data/utils/enveomics/enveomics.R/man/enve.recplot2.corePeak.Rd +18 -0
  304. data/utils/enveomics/enveomics.R/man/enve.recplot2.extractWindows.Rd +45 -0
  305. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.Rd +36 -0
  306. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.__em_e.Rd +19 -0
  307. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.__em_m.Rd +19 -0
  308. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.__emauto_one.Rd +27 -0
  309. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.__mow_one.Rd +52 -0
  310. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.__mower.Rd +17 -0
  311. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.em.Rd +51 -0
  312. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.emauto.Rd +43 -0
  313. data/utils/enveomics/enveomics.R/man/enve.recplot2.findPeaks.mower.Rd +82 -0
  314. data/utils/enveomics/enveomics.R/man/enve.recplot2.peak-class.Rd +59 -0
  315. data/utils/enveomics/enveomics.R/man/enve.recplot2.seqdepth.Rd +27 -0
  316. data/utils/enveomics/enveomics.R/man/enve.recplot2.windowDepthThreshold.Rd +36 -0
  317. data/utils/enveomics/enveomics.R/man/enve.selvector.Rd +23 -0
  318. data/utils/enveomics/enveomics.R/man/enve.tribs.Rd +68 -0
  319. data/utils/enveomics/enveomics.R/man/enve.tribs.test.Rd +28 -0
  320. data/utils/enveomics/enveomics.R/man/enve.truncate.Rd +27 -0
  321. data/utils/enveomics/enveomics.R/man/growth.curves.Rd +14 -0
  322. data/utils/enveomics/enveomics.R/man/phyla.counts.Rd +13 -0
  323. data/utils/enveomics/enveomics.R/man/plot.enve.GrowthCurve.Rd +78 -0
  324. data/utils/enveomics/enveomics.R/man/plot.enve.TRIBS.Rd +46 -0
  325. data/utils/enveomics/enveomics.R/man/plot.enve.TRIBStest.Rd +45 -0
  326. data/utils/enveomics/enveomics.R/man/plot.enve.recplot2.Rd +125 -0
  327. data/utils/enveomics/enveomics.R/man/summary.enve.GrowthCurve.Rd +19 -0
  328. data/utils/enveomics/enveomics.R/man/summary.enve.TRIBS.Rd +19 -0
  329. data/utils/enveomics/enveomics.R/man/summary.enve.TRIBStest.Rd +19 -0
  330. data/utils/enveomics/globals.mk +8 -0
  331. data/utils/enveomics/manifest.json +9 -0
  332. data/utils/multitrim/Multitrim How-To.pdf +0 -0
  333. data/utils/multitrim/README.md +67 -0
  334. data/utils/multitrim/multitrim.py +1555 -0
  335. data/utils/multitrim/multitrim.yml +13 -0
  336. data/utils/requirements.txt +4 -3
  337. metadata +304 -3
@@ -0,0 +1,165 @@
1
+ {
2
+ "categories": {
3
+ "Sequence similarity search": {
4
+ "Statistics": [
5
+ "BedGraph.tad.rb",
6
+ "BedGraph.window.rb",
7
+ "BlastPairwise.AAsubs.pl",
8
+ "BlastTab.advance.bash",
9
+ "BlastTab.recplot2.R",
10
+ "BlastTab.seqdepth.pl",
11
+ "BlastTab.seqdepth_nomedian.pl",
12
+ "BlastTab.seqdepth_ZIP.pl",
13
+ "BlastTab.sumPerHit.pl",
14
+ "FastQ.test-error.rb",
15
+ "RecPlot2.compareIdentities.R"
16
+ ],
17
+ "Manipulation": [
18
+ "BlastTab.addlen.rb",
19
+ "BlastTab.best_hit_sorted.pl",
20
+ "BlastTab.catsbj.pl",
21
+ "BlastTab.cogCat.rb",
22
+ "BlastTab.filter.pl",
23
+ "BlastTab.kegg_pep2path_rest.pl",
24
+ "BlastTab.pairedHits.rb",
25
+ "BlastTab.subsample.pl",
26
+ "BlastTab.taxid2taxrank.pl",
27
+ "BlastTab.topHits_sorted.rb",
28
+ "sam.filter.rb"
29
+ ],
30
+ "Execution": [
31
+ "aai.rb",
32
+ "ani.rb",
33
+ "anir.rb",
34
+ "HMM.haai.rb",
35
+ "rbm.rb"
36
+ ]
37
+ },
38
+ "Sequence analyses": {
39
+ "Statistics": [
40
+ "FastA.gc.pl",
41
+ "FastA.length.pl",
42
+ "FastA.N50.pl",
43
+ "FastA.qlen.pl",
44
+ "FastQ.test-error.rb"
45
+ ],
46
+ "Manipulation": [
47
+ "FastA.extract.rb",
48
+ "FastA.filter.pl",
49
+ "FastA.filterLen.pl",
50
+ "FastA.filterN.pl",
51
+ "FastA.fragment.rb",
52
+ "FastA.interpose.pl",
53
+ "FastA.mask.rb",
54
+ "FastA.per_file.pl",
55
+ "FastA.rename.pl",
56
+ "FastA.revcom.pl",
57
+ "FastA.sample.rb",
58
+ "FastA.slider.pl",
59
+ "FastA.split.pl",
60
+ "FastA.split.rb",
61
+ "FastA.subsample.pl",
62
+ "FastA.tag.rb",
63
+ "FastA.toFastQ.rb",
64
+ "FastA.wrap.rb",
65
+ "FastQ.filter.pl",
66
+ "FastQ.interpose.pl",
67
+ "FastQ.maskQual.rb",
68
+ "FastQ.offset.pl",
69
+ "FastQ.split.pl",
70
+ "FastQ.tag.rb",
71
+ "FastQ.toFastA.awk"
72
+ ]
73
+ },
74
+ "Diversity": {
75
+ "Community": [
76
+ "AlphaDiversity.pl",
77
+ "Chao1.pl",
78
+ "Table.barplot.R",
79
+ "Table.prefScore.R"
80
+ ],
81
+ "Population": [
82
+ "VCF.SNPs.rb",
83
+ "VCF.KaKs.rb",
84
+ "Table.prefScore.R"
85
+ ]
86
+ },
87
+ "Annotation": {
88
+ "Database mapping": [
89
+ "BlastTab.kegg_pep2path_rest.pl",
90
+ "BlastTab.taxid2taxrank.pl",
91
+ "EBIseq2tax.rb",
92
+ "NCBIacc2tax.rb",
93
+ "gi2tax.rb",
94
+ "M5nr.getSequences.rb",
95
+ "RefSeq.download.bash",
96
+ "SRA.download.bash"
97
+ ],
98
+ "Tables": [
99
+ "Table.barplot.R",
100
+ "GenBank.add_fields.rb",
101
+ "MyTaxa.fragsByTax.pl",
102
+ "Table.df2dist.R",
103
+ "Table.filter.pl",
104
+ "Table.merge.pl",
105
+ "Table.replace.rb",
106
+ "Table.round.rb",
107
+ "Table.split.pl"
108
+ ],
109
+ "Search": [
110
+ "HMM.essential.rb",
111
+ "HMM.haai.rb",
112
+ "HMMsearch.extractIds.rb",
113
+ "ogs.annotate.rb",
114
+ "ogs.core-pan.rb",
115
+ "ogs.extract.rb",
116
+ "ogs.mcl.rb",
117
+ "ogs.stats.rb",
118
+ "ogs.rb"
119
+ ]
120
+ },
121
+ "Other data": {
122
+ "Phylogenetic and other distances": [
123
+ "CharTable.classify.rb",
124
+ "JPlace.distances.rb",
125
+ "JPlace.to_iToL.rb",
126
+ "Newick.autoprune.R",
127
+ "TRIBS.test.R",
128
+ "TRIBS.plot-test.R",
129
+ "Table.df2dist.R"
130
+ ],
131
+ "Taxonomic": [
132
+ "CharTable.classify.rb",
133
+ "EBIseq2tax.rb",
134
+ "NCBIacc2tax.rb",
135
+ "Table.barplot.R",
136
+ "gi2tax.rb",
137
+ "MyTaxa.fragsByTax.pl",
138
+ "MyTaxa.seq-taxrank.rb",
139
+ "Taxonomy.silva2ncbi.rb"
140
+ ],
141
+ "Alignments": [
142
+ "AAsubs.log2ratio.rb",
143
+ "Aln.cat.rb",
144
+ "Aln.convert.pl",
145
+ "BlastPairwise.AAsubs.pl"
146
+ ],
147
+ "Clustering": [
148
+ "ogs.mcl.rb",
149
+ "clust.rand.rb"
150
+ ],
151
+ "Read recruitments": [
152
+ "anir.rb",
153
+ "BedGraph.tad.rb",
154
+ "BedGraph.window.rb",
155
+ "BlastTab.catsbj.pl",
156
+ "BlastTab.pairedHits.rb",
157
+ "BlastTab.recplot2.R",
158
+ "FastQ.test-error.rb",
159
+ "GFF.catsbj.pl",
160
+ "RecPlot2.compareIdentities.R",
161
+ "sam.filter.rb"
162
+ ]
163
+ }
164
+ }
165
+ }
@@ -0,0 +1,154 @@
1
+ {
2
+ "_": "Input files and directories are included in the 'Tests' folder.",
3
+ "examples": [
4
+ {
5
+ "_": "== Examples of genome comparisons ==",
6
+ "task": "ogs.stats.rb",
7
+ "description": ["Statistics on the groups of orthology in the Primate",
8
+ "Lentivirus Group, including HIV-1, HIV-2, and SIV."],
9
+ "values": ["primate_lentivirus.ogs",null,null,null,null,null]
10
+ },
11
+ {
12
+ "task": "ani.rb",
13
+ "description": ["Average Nucleotide Identity (ANI) between two strains",
14
+ "of Mycoplasma genitalium (M2288 and M2321)."],
15
+ "values": ["Mgen_M2288.fna","Mgen_M2321.fna",null,null,null,null,null,
16
+ null,null,null,null,null,null,null,null,null,null,null,null,null,null,
17
+ null,null,null]
18
+ },
19
+ {
20
+ "task": "aai.rb",
21
+ "description": ["Average Amino acid Identity (AAI) between Mycoplasma",
22
+ "genitalium (Bacteria) and Nanoarchaeum equitans (Archaea)."],
23
+ "values": ["Mgen_M2288.faa","Nequ_Kin4M.faa",null,null,null,null,null,
24
+ null,null,null,null,null,null,null,null,null,null,null,null,null,null,
25
+ null,null,null]
26
+ },
27
+ {
28
+ "task": "rbm.rb",
29
+ "description": ["Reciprocal Best Matches between the proteomes of the",
30
+ "two major HIV types (HIV-1 and HIV-2)."],
31
+ "values": ["hiv1.faa","hiv2.faa",null,null,null,null,null,null,null,null,
32
+ null,null,"hiv1-hiv2.rbm"]
33
+ },
34
+ {
35
+ "task": "ogs.mcl.rb",
36
+ "description": ["Groups of orthology in the Primate Letivirus Group,",
37
+ "including HIV-1, HIV-2, and SIV."],
38
+ "values": ["primate_lentivirus.ogs","primate_lentivirus.rbm",null,null,
39
+ null,null,null,null,null,null,null,null]
40
+ },
41
+ {
42
+ "task": "Table.df2dist.R",
43
+ "description": ["Transforms a list of AAI values between Xanthomonas",
44
+ "oryzae genomes into a distance matrix."],
45
+ "values": ["Xanthomonas_oryzae.aai.tsv",null,null,null,null,100.0,
46
+ "Xanthomonas_oryzae.aai-mat.tsv"]
47
+ },
48
+ {
49
+ "_": "== Recruitment plots",
50
+ "task": "BlastTab.catsbj.pl",
51
+ "description": ["Prepares recruitment plot files for a comparison",
52
+ "between a virome containing HIV and the HIV-1 genome."],
53
+ "values": [null,null,null,null,"hiv1.fna","hiv_mix-hiv1.blast.tsv"]
54
+ },
55
+ {
56
+ "task": "BlastTab.recplot2.R",
57
+ "description": ["Generates recruitment plots for a comparison",
58
+ "between a virome containing HIV and the HIV-1 genome."],
59
+ "values": ["hiv_mix-hiv1.blast.tsv",50,100,null,null,null,null,null,null,
60
+ null,null,null,"hiv_mix-hiv1.Rdata","hiv_mix-hiv1.pdf",null,null]
61
+ },
62
+ {
63
+ "_": "== Examples of functional annotations ==",
64
+ "task": "HMM.essential.rb",
65
+ "description": ["Typical single-copy bacterial genes present in",
66
+ "Mycoplasma genitalium."],
67
+ "values": ["Mgen_M2288.faa",null,null,null,null,null,null,true,null,null,
68
+ null,null,null,null,null,null,null,null,null]
69
+ },
70
+ {
71
+ "task": "HMM.essential.rb",
72
+ "description": ["Typical single-copy archaeal genes present in",
73
+ "Nanoarchaeum equitans."],
74
+ "values": ["Mgen_M2288.faa",null,null,null,null,null,null,null,true,null,
75
+ null,null,null,null,null,null,null,null,null]
76
+ },
77
+ {
78
+ "task": "Newick.autoprune.R",
79
+ "description": ["Prune an AlkB tree with 110 tips to get only distant",
80
+ "representatives (41)."],
81
+ "values": ["alkB.nwk",0.9,null,null,null,null,null,"alkB-pruned.nwk"]
82
+ },
83
+ {
84
+ "_": "== Examples of BLAST statistics and manipulation",
85
+ "task": "BlastTab.topHits_sorted.rb",
86
+ "description": ["Extract the best match of metagenome-derived proteins",
87
+ "(from the 'A metagenome') against a Gene Ontology collection."],
88
+ "values": ["sort","a_mg.cds-go.blast.tsv",null,null,null,null,1,null,null,
89
+ null,"a_mg.cds-go.blast-bm.tsv"]
90
+ },
91
+ {
92
+ "task": "BlastTab.sumPerHit.pl",
93
+ "description": ["Count the number of reads per gene in a mapping of a",
94
+ "metagenome to a metagenome-derived genes (from the 'A metagenome')."],
95
+ "values": [null,null,null,null,null,null,null,"a_mg.reads-cds.blast.tsv",
96
+ null,"a_mg.reads-cds.counts.tsv"]
97
+ },
98
+ {
99
+ "task": "BlastTab.sumPerHit.pl",
100
+ "description": ["Estimate the total abundance of Gene Ontology",
101
+ "annotations in the A metagenome, using metagenome-derived proteins,",
102
+ "and normalizing by the read counts of each protein."],
103
+ "values": ["a_mg.reads-cds.counts.tsv",null,null,null,null,true,null,
104
+ "a_mg.cds-go.blast.tsv",null,"a_mg.go.read-counts.tsv"]
105
+ },
106
+ {
107
+ "_": "== Examples of diversity ==",
108
+ "task": "Table.barplot.R",
109
+ "description": ["Barplot with the distribution of bacterial phyla in",
110
+ "four different sites, with taxa sorted by variance."],
111
+ "values": ["phyla_counts.tsv","250,100,75,200",null,null,null,null,null,
112
+ null,true,"var",2,null,null,"phyla_counts.pdf",10,null]
113
+ },
114
+ {
115
+ "task": "Chao1.pl",
116
+ "description": ["Phylum-richness estimated by the Chao1 index with 95%",
117
+ "confidence, using the distributions of bacterial phyla in four",
118
+ "different sites."],
119
+ "values": ["phyla_counts.tsv",null,1,null,null,true,null,
120
+ "phyla_chao1.tsv"]
121
+ },
122
+ {
123
+ "task": "AlphaDiversity.pl",
124
+ "description": ["Phylum-diversity estimated by the indices of Shannon",
125
+ "(H'), Inverse Simpson (1/Lambda), and true diversity of order 1 (1D),",
126
+ "using the distributions of bacterial phyla in four different sites."],
127
+ "values": ["phyla_counts.tsv",null,1,null,null,true,null,true,1,null,
128
+ "phyla_diversity.tsv"]
129
+ },
130
+ {
131
+ "_": "== Other miscelaneous examples ==",
132
+ "task": "CharTable.classify.rb",
133
+ "description": ["Classification of anthrax genomes based on can-SNPs, as",
134
+ "described in Van Ert 2007 (PLoS ONE 2(5):e461)."],
135
+ "values": ["anthrax-cansnp-data.tsv","anthrax-cansnp-key.tsv",
136
+ "anthrax-cansnp-classif.tsv","anthrax-cansnp-classif.nwk",null]
137
+ },
138
+ {
139
+ "task": "TRIBS.test.R",
140
+ "description": ["Test overclustering of Xanthomonas oryzae genomes",
141
+ "encoding for PilA using Transformed-space Resampling In Biased Sets",
142
+ "(TRIBS)."],
143
+ "values": ["Xanthomonas_oryzae.aai-mat.tsv","Xanthomonas_oryzae-PilA.txt",
144
+ 5000,null,null,null,null,0,"Xanthomonas_oryzae-PilA.tribs.Rdata",100]
145
+ },
146
+ {
147
+ "task": "TRIBS.plot-test.R",
148
+ "description": ["Show the TRIBS-normalized distances between Xanthomonas",
149
+ "oryzae genomes (grey) and X. oryzae encoding for PilA (red)."],
150
+ "values": ["Xanthomonas_oryzae-PilA.tribs.Rdata",null,null,null,null,null,
151
+ null,null,"Xanthomonas_oryzae-PilA.tribs.pdf",null,null]
152
+ }
153
+ ]
154
+ }
@@ -0,0 +1,4 @@
1
+ {
2
+ "_": "This file loads all the .json files inside 'Manifest/Tasks'.",
3
+ "_include": "Tasks/*.json"
4
+ }
@@ -0,0 +1,69 @@
1
+ #!/bin/bash
2
+
3
+ ##################### VARIABLES
4
+ # Queue: Preferred queue. Delete (or comment) this line to allow
5
+ # automatic detection:
6
+ #QUEUE="biocluster-6"
7
+ # If you set the QUEUE variable, you MUST set the WTIME variable
8
+ # as well, containing the walltime to be asked for. The WTIME
9
+ # variable is ignored otherwise.
10
+ WTIME="120:00:00"
11
+
12
+ # Scratch: This is where the output will be created.
13
+ SCRATCH="$HOME/scratch/pipelines/assembly"
14
+
15
+ # Data folder: This is the folder that cointains the input files.
16
+ DATA="$HOME/data/trim"
17
+
18
+ # Location of Newbler's binaries
19
+ BIN454="$HOME/454/bin"
20
+
21
+ # Name(s) of the library(ies) to use, separated by spaces:
22
+ # This is determined by the name of your input files. For example,
23
+ # if your input files are: LLSEP.CoupledReads.fa and LWP.CoupledReads.fa,
24
+ # use:
25
+ # LIBRARIES="LLSEP LWP"
26
+ # It's strongly encouraged to use only one per CONFIG file.
27
+ LIBRARIES="A";
28
+
29
+ # Use .CoupledReads.fa and/or .SingleReads.fa (yes or no):
30
+ USECOUPLED=yes
31
+ USESINGLE=no
32
+
33
+ # Insert length (in bp): This is the average length of the entire insert,
34
+ # not just the gap length.
35
+ INSLEN=300
36
+
37
+ # Number of CPUs to use (for SOAP and Newbler):
38
+ PPN=16
39
+
40
+ # RAM multiplier: Multiply the estimated required RAM by this number:
41
+ RAMMULT=1
42
+
43
+ # Maximum number of simultaneous jobs: Uncomment and increase these values if
44
+ # you have increased resources (e.g., a dedicated queue); uncomment and decrease
45
+ # if the resources are scarce (e.g., a very busy queue or other simultaneous jobs).
46
+ #VELVETSIM=22
47
+ #SOAPSIM=8
48
+
49
+ # Extra parameters for Velvet: Any additional parameters to be passed to
50
+ # velvetg or velveth. If you have MP data, consider adding the option
51
+ # -shortMatePaired yes to VELVETG_EXTRA. If you have Nextera, consider
52
+ # adding the option above, plus the option -ins_length_sd <integer>, to
53
+ # indicate the standard deviation of the insert size. By default, the
54
+ # SD is assumed to be 10% of the average, but Nextera produces much
55
+ # wider distribution of sizes (i.e., larger SD). Typically you shouldn't
56
+ # need to add anything in VELVETH_EXTRA.
57
+ VELVETH_EXTRA=""
58
+ VELVETG_EXTRA=""
59
+
60
+ # Clean non-essential files (yes or no):
61
+ CLEANUP=yes
62
+
63
+ # Best k-mers: Space-delimited list of kmers selected from Velvet and SOAP.
64
+ # This is to be modified at the begining of step 4, and it's ignored in all
65
+ # the other steps.
66
+ K_VELVET="21 23 35"
67
+ K_SOAP="21 23 35"
68
+
69
+
@@ -0,0 +1 @@
1
+ ../../Scripts/FastA.N50.pl
@@ -0,0 +1 @@
1
+ ../../Scripts/FastA.filterN.pl
@@ -0,0 +1 @@
1
+ ../../Scripts/FastA.length.pl
@@ -0,0 +1,189 @@
1
+ @author: Luis Miguel Rodriguez-R <lmrodriguezr at gmail dot com>
2
+
3
+ @update: Mar-17-2013
4
+
5
+ @license: artistic 2.0
6
+
7
+ @status: semi
8
+
9
+ @pbs: yes
10
+
11
+ # IMPORTANT
12
+
13
+ This pipeline was developed for the [PACE cluster](http://pace.gatech.edu/). You
14
+ are free to use it in other platforms with adequate adjustments. It is largely
15
+ based on Luo _et al._ 2012, ISME J.
16
+
17
+ # PURPOSE
18
+
19
+ This pipeline assemblies coupled and/or single reads from one or more libraries.
20
+ It assumes that the reads have been quality-checked and trimmed.
21
+
22
+ # HELP
23
+
24
+ 1. Files preparation:
25
+
26
+ 1.1. Copy this folder to the cluster.
27
+
28
+ 1.2. Copy the sequences to the cluster. Only trimmed/filtered reads are used.
29
+ All the files are expected to be in the same folder, and the filenames must
30
+ end in `.CoupledReads.fa` or `.SingleReads.fa`.
31
+
32
+ 1.3. Copy the file `CONFIG.mock.bash` to `CONFIG.<name>.bash`, where `<name>` is a
33
+ short name for your run (avoid characters other than alphanumeric).
34
+
35
+ 1.4. Change the variables in `CONFIG.<name>.bash`. Notice that this pipeline
36
+ supports running several libraries at the same time, but it's strongly
37
+ recomended to run only one per config file, because the insert length
38
+ (in step 2) and the selected k-mers (in step 3) are fixed for all the
39
+ included libraries. Also, there is a technical consideration: The first
40
+ step will execute parallel jobs for each odd number between 21 and 63, and
41
+ SOAP will use 16 CPUs by default, which means 357 CPUs will be requested
42
+ per library in step 2. It's a bad idea to run many libraries at the same
43
+ time.
44
+
45
+ 1.5. If you have Mate-paired datasets (for example, prepared with Nextera), first
46
+ reverse-complement all the reads. See also the `VELVETG_EXTRA` variable in
47
+ the `CONFIG.<name>.bash` file.
48
+
49
+ 2. Velvet and SOAP assembly:
50
+
51
+ 2.1. Execute `./RUNME-2.bash <name>` in the head node (see [troubleshooting](#troubleshooting) #1).
52
+
53
+ 2.2. Monitor the tasks named velvet_* and soap_*.
54
+
55
+ 2.3. Once completed, make sure the files .proc contain only the
56
+ word "done". To do this, you may execute:
57
+ ```
58
+ grep -v '^done$' *.proc
59
+ ```
60
+
61
+ If successful, the output of the above command should be empty. See
62
+ [Troubleshooting](#troubleshooting) #2 and #3 below if one or more of your jobs failed.
63
+
64
+ 3. K-mers selection:
65
+
66
+ 3.1. If you completed step 2, execute `./RUNME-3.bash <name>` in the head
67
+ node.
68
+
69
+ 3.2. Once completed, download and open the files `*.n50.pdf`.
70
+
71
+ 3.3. Select the three "best" k-mers for Velvet and for SOAP (they don't
72
+ have to be the same). There is no well-tested method to select the
73
+ "best", and this is why this protocol is not automated, but semi-
74
+ automated. A generally good rule-of-thumb is: pick one that optimizes
75
+ the amount of sequences used (these are the grey bars in the plot;
76
+ usually this is the smallest k-mer), pick one that optimizes the N50
77
+ (this is the dashed red line; usually this is a large k-mer), and pick
78
+ one that optimizes both (something in the middle). You can select
79
+ more or less than three k-mers, this is just a suggestion.
80
+
81
+ 4. Newbler assembly:
82
+
83
+ 4.1. Edit the file `CONFIG.<name>.bash`: set the variables `K_VELVET` and
84
+ `K_SOAP` to contain the lists of "best" selected k-mers for Velvet and
85
+ SOAP, respectively.
86
+
87
+ 4.2. Execute `./RUNME-4.bash <name>` in the head node.
88
+
89
+ 4.3. Monitor the task newbler_*. Once finished, your assembly is ready.
90
+ Once completed, make sure the file .newbler.proc contain only the
91
+ word "done". To do this, you may execute:
92
+ ```
93
+ grep -v '^done$' *.proc
94
+ ```
95
+ If successful, the output should be empty.
96
+
97
+ 4.4. The final assembly should be located in the `SCRATCH` path, in a folder
98
+ named `<lib>.newbler/assembly/`. The file `454AllContigs.fna` contains
99
+ all the assembled contigs, `454LargeContigs.fna` contains the contigs
100
+ with 500bp or more in length, and `454NewblerMetrics.txt` contains some
101
+ relevant statistics.
102
+
103
+
104
+ # Comments
105
+
106
+ * Some scripts contained in this package are actually symlinks to files in the
107
+ _Scripts_ folder. Check the existance of these files when copied to
108
+ the cluster.
109
+
110
+ # Troubleshooting
111
+
112
+ 1. Do I really have to change directory (`cd`) to the pipeline's folder everytime
113
+ I want to execute something?
114
+
115
+ No. Not really. For simplicity, this file tells you to execute, for example,
116
+ `./RUNME-2.bash`. However, you don't really have to be there, you can execute it
117
+ from any location. For example, if you saved this pipeline in your home
118
+ directory, you can just execute `~/assembly.pbs/RUNME-2.bash` insted from any
119
+ location in the head node.
120
+
121
+ 2. I executed step 2, and Velvet worked but SOAP failed (or vice versa). Can I
122
+ submit only one of them?
123
+
124
+ Yes. To execute only Velvet, run:
125
+ ```
126
+ ./RUNME-2.bash <name> velvet
127
+ ```
128
+
129
+ To execute only SOAP, run:
130
+ ```
131
+ ./RUNME-2.bash <name> soap
132
+ ```
133
+
134
+ 3. I ran step 2, and most of the jobs finished, but few of them failed. Can I
135
+ submit only few K-mers?
136
+
137
+ Yes. To execute one kmer (say, the k-mer 33 of SOAP), run:
138
+ ```
139
+ ./RUNME-2.bash <name> soap 33
140
+ ```
141
+
142
+ You can also execute more than one kmer, using a comma-separated list. For
143
+ example, to re-submit the k-mers 37, 39, and 41 of Velvet, run:
144
+ ```
145
+ ./RUNME-2.bash <name> velvet 37,39,41
146
+ ```
147
+
148
+ 4. What are the numbers on the job names of step 2?
149
+
150
+ The K-mer. Each k-mer has it's own job, but they are "arrayed", to simplify
151
+ administration: notice that all the jobs of Velvet and all the jobs of SOAP
152
+ share the same job ID.
153
+
154
+ 5. Some jobs are being killed, why?
155
+
156
+ 5.1. First, check the log file created by the pipeline. The name is typically
157
+ the output prefix and the .log extension. For velvet, there are two log files,
158
+ the `.glog` and the `.hlog`. You may find the problem there.
159
+
160
+ 5.2. Now, check the error file in your HOME directory. The name depends on the
161
+ job, the library and the task. For example: `~/soap_Mg_2-37.e1999838` is the
162
+ error file for step 2, task soap, library Mg_2, k-mer 37. The appending
163
+ number after the 'e' is the job ID. If this file contains errors probably
164
+ related to the pipeline, please let me know.
165
+
166
+ 5.3. If you still have no clues, check the output file in your `HOME` directory. The
167
+ name is just like the name of the error file (see #5.2 above), but with 'o'
168
+ instead of 'e'. Compare the lines 'Resources' (what we asked the scheduler for)
169
+ and 'Rsrc Used' (what the job actually used). A typical problem is that your
170
+ job may need more RAM than we asked for (the value of 'mem' in both lines). If
171
+ the RAM used is larger than the RAM requested, the scheduler probably killed
172
+ your job. To solve this, just go to your config file, and set the variable
173
+ RAMMULT to a number larger than 1. For example, if you want to ask for double the
174
+ RAM, set `RAMMULT=2`. You can also include simple arithmetic operations, like
175
+ `RAMMULT=3/2`. If you want to add a fixed ammount of RAM, in Gib, use addition.
176
+ For example, to add 10G, set `RAMMULT=1+10`.
177
+
178
+ 5.4. Still no idea? Try running the job again, sometimes the jobs fail with no
179
+ apparent reason, but they succeed when re-submited. If your job keeps failing,
180
+ please gather as much information (the log, error and output files should be
181
+ enough) and let me take a look.
182
+
183
+ 6. In the step 2, some k-mers keep failing, and I just want to give up on them, can I?
184
+
185
+ Yes. Step 3 will analyze only completed jobs, so you can just ignore these faulty
186
+ k-mers. Very small k-mers, for example, sometimes need too much memory, and very
187
+ large k-mers in Velvet sometimes need too much time. If you don't think you're
188
+ missing too much, just ignore them.
189
+