minimap2 0.2.22.0 → 0.2.24.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +60 -76
  3. data/ext/Rakefile +55 -0
  4. data/ext/cmappy/cmappy.c +129 -0
  5. data/ext/cmappy/cmappy.h +44 -0
  6. data/ext/minimap2/FAQ.md +46 -0
  7. data/ext/minimap2/LICENSE.txt +24 -0
  8. data/ext/minimap2/MANIFEST.in +10 -0
  9. data/ext/minimap2/Makefile +132 -0
  10. data/ext/minimap2/Makefile.simde +97 -0
  11. data/ext/minimap2/NEWS.md +821 -0
  12. data/ext/minimap2/README.md +403 -0
  13. data/ext/minimap2/align.c +1020 -0
  14. data/ext/minimap2/bseq.c +169 -0
  15. data/ext/minimap2/bseq.h +64 -0
  16. data/ext/minimap2/code_of_conduct.md +30 -0
  17. data/ext/minimap2/cookbook.md +243 -0
  18. data/ext/minimap2/esterr.c +64 -0
  19. data/ext/minimap2/example.c +63 -0
  20. data/ext/minimap2/format.c +559 -0
  21. data/ext/minimap2/hit.c +466 -0
  22. data/ext/minimap2/index.c +775 -0
  23. data/ext/minimap2/kalloc.c +205 -0
  24. data/ext/minimap2/kalloc.h +76 -0
  25. data/ext/minimap2/kdq.h +132 -0
  26. data/ext/minimap2/ketopt.h +120 -0
  27. data/ext/minimap2/khash.h +615 -0
  28. data/ext/minimap2/krmq.h +474 -0
  29. data/ext/minimap2/kseq.h +256 -0
  30. data/ext/minimap2/ksort.h +153 -0
  31. data/ext/minimap2/ksw2.h +184 -0
  32. data/ext/minimap2/ksw2_dispatch.c +96 -0
  33. data/ext/minimap2/ksw2_extd2_sse.c +402 -0
  34. data/ext/minimap2/ksw2_exts2_sse.c +416 -0
  35. data/ext/minimap2/ksw2_extz2_sse.c +313 -0
  36. data/ext/minimap2/ksw2_ll_sse.c +152 -0
  37. data/ext/minimap2/kthread.c +159 -0
  38. data/ext/minimap2/kthread.h +15 -0
  39. data/ext/minimap2/kvec.h +105 -0
  40. data/ext/minimap2/lchain.c +369 -0
  41. data/ext/minimap2/main.c +459 -0
  42. data/ext/minimap2/map.c +714 -0
  43. data/ext/minimap2/minimap.h +410 -0
  44. data/ext/minimap2/minimap2.1 +725 -0
  45. data/ext/minimap2/misc/README.md +179 -0
  46. data/ext/minimap2/misc/mmphase.js +335 -0
  47. data/ext/minimap2/misc/paftools.js +3149 -0
  48. data/ext/minimap2/misc.c +162 -0
  49. data/ext/minimap2/mmpriv.h +132 -0
  50. data/ext/minimap2/options.c +234 -0
  51. data/ext/minimap2/pe.c +177 -0
  52. data/ext/minimap2/python/README.rst +196 -0
  53. data/ext/minimap2/python/cmappy.h +152 -0
  54. data/ext/minimap2/python/cmappy.pxd +153 -0
  55. data/ext/minimap2/python/mappy.pyx +273 -0
  56. data/ext/minimap2/python/minimap2.py +39 -0
  57. data/ext/minimap2/sdust.c +213 -0
  58. data/ext/minimap2/sdust.h +25 -0
  59. data/ext/minimap2/seed.c +131 -0
  60. data/ext/minimap2/setup.py +55 -0
  61. data/ext/minimap2/sketch.c +143 -0
  62. data/ext/minimap2/splitidx.c +84 -0
  63. data/ext/minimap2/sse2neon/emmintrin.h +1689 -0
  64. data/ext/minimap2/test/MT-human.fa +278 -0
  65. data/ext/minimap2/test/MT-orang.fa +276 -0
  66. data/ext/minimap2/test/q-inv.fa +4 -0
  67. data/ext/minimap2/test/q2.fa +2 -0
  68. data/ext/minimap2/test/t-inv.fa +127 -0
  69. data/ext/minimap2/test/t2.fa +2 -0
  70. data/ext/minimap2/tex/Makefile +21 -0
  71. data/ext/minimap2/tex/bioinfo.cls +930 -0
  72. data/ext/minimap2/tex/blasr-mc.eval +17 -0
  73. data/ext/minimap2/tex/bowtie2-s3.sam.eval +28 -0
  74. data/ext/minimap2/tex/bwa-s3.sam.eval +52 -0
  75. data/ext/minimap2/tex/bwa.eval +55 -0
  76. data/ext/minimap2/tex/eval2roc.pl +33 -0
  77. data/ext/minimap2/tex/graphmap.eval +4 -0
  78. data/ext/minimap2/tex/hs38-simu.sh +10 -0
  79. data/ext/minimap2/tex/minialign.eval +49 -0
  80. data/ext/minimap2/tex/minimap2.bib +460 -0
  81. data/ext/minimap2/tex/minimap2.tex +724 -0
  82. data/ext/minimap2/tex/mm2-s3.sam.eval +62 -0
  83. data/ext/minimap2/tex/mm2-update.tex +240 -0
  84. data/ext/minimap2/tex/mm2.approx.eval +12 -0
  85. data/ext/minimap2/tex/mm2.eval +13 -0
  86. data/ext/minimap2/tex/natbib.bst +1288 -0
  87. data/ext/minimap2/tex/natbib.sty +803 -0
  88. data/ext/minimap2/tex/ngmlr.eval +38 -0
  89. data/ext/minimap2/tex/roc.gp +60 -0
  90. data/ext/minimap2/tex/snap-s3.sam.eval +62 -0
  91. data/ext/minimap2.patch +19 -0
  92. data/lib/minimap2/aligner.rb +4 -4
  93. data/lib/minimap2/alignment.rb +11 -11
  94. data/lib/minimap2/ffi/constants.rb +20 -16
  95. data/lib/minimap2/ffi/functions.rb +5 -0
  96. data/lib/minimap2/ffi.rb +4 -5
  97. data/lib/minimap2/version.rb +2 -2
  98. data/lib/minimap2.rb +51 -15
  99. metadata +97 -79
  100. data/lib/minimap2/ffi_helper.rb +0 -53
  101. data/vendor/libminimap2.so +0 -0
@@ -0,0 +1,403 @@
1
+ [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=social&logo=github&label=Download)](https://github.com/lh3/minimap2/releases)
2
+ [![BioConda Install](https://img.shields.io/conda/dn/bioconda/minimap2.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/minimap2)
3
+ [![PyPI](https://img.shields.io/pypi/v/mappy.svg?style=flat)](https://pypi.python.org/pypi/mappy)
4
+ [![Build Status](https://github.com/lh3/minimap2/actions/workflows/ci.yaml/badge.svg)](https://github.com/lh3/minimap2/actions)
5
+ ## <a name="started"></a>Getting Started
6
+ ```sh
7
+ git clone https://github.com/lh3/minimap2
8
+ cd minimap2 && make
9
+ # long sequences against a reference genome
10
+ ./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
11
+ # create an index first and then map
12
+ ./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
13
+ ./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
14
+ # use presets (no test data)
15
+ ./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio CLR genomic reads
16
+ ./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
17
+ ./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
18
+ ./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.18 or earlier)
19
+ ./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
20
+ ./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
21
+ ./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
22
+ ./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
23
+ ./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam # prioritize on annotated junctions
24
+ ./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
25
+ ./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
26
+ ./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
27
+ # man page for detailed command line options
28
+ man ./minimap2.1
29
+ ```
30
+
31
+ ## Table of Contents
32
+
33
+ - [Getting Started](#started)
34
+ - [Users' Guide](#uguide)
35
+ - [Installation](#install)
36
+ - [General usage](#general)
37
+ - [Use cases](#cases)
38
+ - [Map long noisy genomic reads](#map-long-genomic)
39
+ - [Map long mRNA/cDNA reads](#map-long-splice)
40
+ - [Find overlaps between long reads](#long-overlap)
41
+ - [Map short accurate genomic reads](#short-genomic)
42
+ - [Full genome/assembly alignment](#full-genome)
43
+ - [Advanced features](#advanced)
44
+ - [Working with >65535 CIGAR operations](#long-cigar)
45
+ - [The cs optional tag](#cs)
46
+ - [Working with the PAF format](#paftools)
47
+ - [Algorithm overview](#algo)
48
+ - [Getting help](#help)
49
+ - [Citing minimap2](#cite)
50
+ - [Developers' Guide](#dguide)
51
+ - [Limitations](#limit)
52
+
53
+ ## <a name="uguide"></a>Users' Guide
54
+
55
+ Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA
56
+ sequences against a large reference database. Typical use cases include: (1)
57
+ mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
58
+ finding overlaps between long reads with error rate up to ~15%; (3)
59
+ splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
60
+ against a reference genome; (4) aligning Illumina single- or paired-end reads;
61
+ (5) assembly-to-assembly alignment; (6) full-genome alignment between two
62
+ closely related species with divergence below ~15%.
63
+
64
+ For ~10kb noisy reads sequences, minimap2 is tens of times faster than
65
+ mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more
66
+ accurate on simulated long reads and produces biologically meaningful alignment
67
+ ready for downstream analyses. For >100bp Illumina short reads, minimap2 is
68
+ three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data.
69
+ Detailed evaluations are available from the [minimap2 paper][doi] or the
70
+ [preprint][preprint].
71
+
72
+ ### <a name="install"></a>Installation
73
+
74
+ Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from
75
+ the [release page][release] with:
76
+ ```sh
77
+ curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
78
+ ./minimap2-2.24_x64-linux/minimap2
79
+ ```
80
+ If you want to compile from the source, you need to have a C compiler, GNU make
81
+ and zlib development files installed. Then type `make` in the source code
82
+ directory to compile. If you see compilation errors, try `make sse2only=1`
83
+ to disable SSE4 code, which will make minimap2 slightly slower.
84
+
85
+ Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To
86
+ compile for 32 bit ARM architectures (such as ARMv7), use `make arm_neon=1`. To
87
+ compile for for 64 bit ARM architectures (such as ARMv8), use `make arm_neon=1
88
+ aarch64=1`.
89
+
90
+ Minimap2 can use [SIMD Everywhere (SIMDe)][simde] library for porting
91
+ implementation to the different SIMD instruction sets. To compile using SIMDe,
92
+ use `make -f Makefile.simde`. To compile for ARM CPUs, use `Makefile.simde`
93
+ with the ARM related command lines given above.
94
+
95
+ ### <a name="general"></a>General usage
96
+
97
+ Without any options, minimap2 takes a reference database and a query sequence
98
+ file as input and produce approximate mapping, without base-level alignment
99
+ (i.e. coordinates are only approximate and no CIGAR in output), in the [PAF format][paf]:
100
+ ```sh
101
+ minimap2 ref.fa query.fq > approx-mapping.paf
102
+ ```
103
+ You can ask minimap2 to generate CIGAR at the `cg` tag of PAF with:
104
+ ```sh
105
+ minimap2 -c ref.fa query.fq > alignment.paf
106
+ ```
107
+ or to output alignments in the [SAM format][sam]:
108
+ ```sh
109
+ minimap2 -a ref.fa query.fq > alignment.sam
110
+ ```
111
+ Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You
112
+ don't need to convert between FASTA and FASTQ or decompress gzip'd files first.
113
+
114
+ For the human reference genome, minimap2 takes a few minutes to generate a
115
+ minimizer index for the reference before mapping. To reduce indexing time, you
116
+ can optionally save the index with option **-d** and replace the reference
117
+ sequence file with the index file on the minimap2 command line:
118
+ ```sh
119
+ minimap2 -d ref.mmi ref.fa # indexing
120
+ minimap2 -a ref.mmi reads.fq > alignment.sam # alignment
121
+ ```
122
+ ***Importantly***, it should be noted that once you build the index, indexing
123
+ parameters such as **-k**, **-w**, **-H** and **-I** can't be changed during
124
+ mapping. If you are running minimap2 for different data types, you will
125
+ probably need to keep multiple indexes generated with different parameters.
126
+ This makes minimap2 different from BWA which always uses the same index
127
+ regardless of query data types.
128
+
129
+ ### <a name="cases"></a>Use cases
130
+
131
+ Minimap2 uses the same base algorithm for all applications. However, due to the
132
+ different data types it supports (e.g. short vs long reads; DNA vs mRNA reads),
133
+ minimap2 needs to be tuned for optimal performance and accuracy. It is usually
134
+ recommended to choose a preset with option **-x**, which sets multiple
135
+ parameters at the same time. The default setting is the same as `map-ont`.
136
+
137
+ #### <a name="map-long-genomic"></a>Map long noisy genomic reads
138
+
139
+ ```sh
140
+ minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio CLR reads
141
+ minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
142
+ ```
143
+ The difference between `map-pb` and `map-ont` is that `map-pb` uses
144
+ homopolymer-compressed (HPC) minimizers as seeds, while `map-ont` uses ordinary
145
+ minimizers as seeds. Emperical evaluation suggests HPC minimizers improve
146
+ performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning
147
+ Nanopore reads.
148
+
149
+ #### <a name="map-long-splice"></a>Map long mRNA/cDNA reads
150
+
151
+ ```sh
152
+ minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA
153
+ minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam # Nanopore 2D cDNA-seq
154
+ minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
155
+ minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa # mapping against SIRV control
156
+ ```
157
+ There are different long-read RNA-seq technologies, including tranditional
158
+ full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq.
159
+ They produce data of varying quality and properties. By default, `-x splice`
160
+ assumes the read orientation relative to the transcript strand is unknown. It
161
+ tries two rounds of alignment to infer the orientation and write the strand to
162
+ the `ts` SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional
163
+ full-length cDNAs, it would be desired to apply `-u f` to force minimap2 to
164
+ consider the forward transcript strand only. This speeds up alignment with
165
+ slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is
166
+ recommended to use a smaller k-mer size for increased sensitivity to the first
167
+ or the last exons.
168
+
169
+ Minimap2 rates an alignment by the score of the max-scoring sub-segment,
170
+ *excluding* introns, and marks the best alignment as primary in SAM. When a
171
+ spliced gene also has unspliced pseudogenes, minimap2 does not intentionally
172
+ prefer spliced alignment, though in practice it more often marks the spliced
173
+ alignment as the primary. By default, minimap2 outputs up to five secondary
174
+ alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This
175
+ can be tuned with option **-N**.
176
+
177
+ For long RNA-seq reads, minimap2 may produce chimeric alignments potentially
178
+ caused by gene fusions/structural variations or by an intron longer than the
179
+ max intron length **-G** (200k by default). For now, it is not recommended to
180
+ apply an excessively large **-G** as this slows down minimap2 and sometimes
181
+ leads to false alignments.
182
+
183
+ It is worth noting that by default `-x splice` prefers GT[A/G]..[C/T]AG
184
+ over GT[C/T]..[A/G]AG, and then over other splicing signals. Considering
185
+ one additional base improves the junction accuracy for noisy reads, but
186
+ reduces the accuracy when aligning against the widely used SIRV control data.
187
+ This is because SIRV does not honor the evolutionarily conservative splicing
188
+ signal. If you are studying SIRV, you may apply `--splice-flank=no` to let
189
+ minimap2 only model GT..AG, ignoring the additional base.
190
+
191
+ Since v2.17, minimap2 can optionally take annotated genes as input and
192
+ prioritize on annotated splice junctions. To use this feature, you can
193
+ ```sh
194
+ paftools.js gff2bed anno.gff > anno.bed
195
+ minimap2 -ax splice --junc-bed anno.bed ref.fa query.fa > aln.sam
196
+ ```
197
+ Here, `anno.gff` is the gene annotation in the GTF or GFF3 format (`gff2bed`
198
+ automatically tests the format). The output of `gff2bed` is in the 12-column
199
+ BED format, or the BED12 format. With the `--junc-bed` option, minimap2 adds a
200
+ bonus score (tuned by `--junc-bonus`) if an aligned junction matches a junction
201
+ in the annotation. Option `--junc-bed` also takes 5-column BED, including the
202
+ strand field. In this case, each line indicates an oriented junction.
203
+
204
+ #### <a name="long-overlap"></a>Find overlaps between long reads
205
+
206
+ ```sh
207
+ minimap2 -x ava-pb reads.fq reads.fq > ovlp.paf # PacBio CLR read overlap
208
+ minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf # Oxford Nanopore read overlap
209
+ ```
210
+ Similarly, `ava-pb` uses HPC minimizers while `ava-ont` uses ordinary
211
+ minimizers. It is usually not recommended to perform base-level alignment in
212
+ the overlapping mode because it is slow and may produce false positive
213
+ overlaps. However, if performance is not a concern, you may try to add `-a` or
214
+ `-c` anyway.
215
+
216
+ #### <a name="short-genomic"></a>Map short accurate genomic reads
217
+
218
+ ```sh
219
+ minimap2 -ax sr ref.fa reads-se.fq > aln.sam # single-end alignment
220
+ minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam # paired-end alignment
221
+ minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam # paired-end alignment
222
+ ```
223
+ When two read files are specified, minimap2 reads from each file in turn and
224
+ merge them into an interleaved stream internally. Two reads are considered to
225
+ be paired if they are adjacent in the input stream and have the same name (with
226
+ the `/[0-9]` suffix trimmed if present). Single- and paired-end reads can be
227
+ mixed.
228
+
229
+ Minimap2 does not work well with short spliced reads. There are many capable
230
+ RNA-seq mappers for short reads.
231
+
232
+ #### <a name="full-genome"></a>Full genome/assembly alignment
233
+
234
+ ```sh
235
+ minimap2 -ax asm5 ref.fa asm.fa > aln.sam # assembly to assembly/ref alignment
236
+ ```
237
+ For cross-species full-genome alignment, the scoring system needs to be tuned
238
+ according to the sequence divergence.
239
+
240
+ ### <a name="advanced"></a>Advanced features
241
+
242
+ #### <a name="long-cigar"></a>Working with >65535 CIGAR operations
243
+
244
+ Due to a design flaw, BAM does not work with CIGAR strings with >65535
245
+ operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2
246
+ may align ~1% of read bases with long CIGARs beyond the capability of BAM. If
247
+ you convert such SAM/CRAM to BAM, Picard and recent samtools will throw an
248
+ error and abort. Older samtools and other tools may create corrupted BAM.
249
+
250
+ To avoid this issue, you can add option `-L` at the minimap2 command line.
251
+ This option moves a long CIGAR to the `CG` tag and leaves a fully clipped CIGAR
252
+ at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and
253
+ sorting) still work with such BAM records; tools that read CIGAR will
254
+ effectively ignore these records. It has been decided that future tools
255
+ will seamlessly recognize long-cigar records generated by option `-L`.
256
+
257
+ **TL;DR**: if you work with ultra-long reads and use tools that only process
258
+ BAM files, please add option `-L`.
259
+
260
+ #### <a name="cs"></a>The cs optional tag
261
+
262
+ The `cs` SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular
263
+ expression `/(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/`. Like CIGAR, `cs`
264
+ consists of series of operations. Each leading character specifies the
265
+ operation; the following sequence is the one involved in the operation.
266
+
267
+ The `cs` tag is enabled by command line option `--cs`. The following alignment,
268
+ for example:
269
+ ```txt
270
+ CGATCGATAAATAGAGTAG---GAATAGCA
271
+ |||||| |||||||||| |||| |||
272
+ CGATCG---AATAGAGTAGGTCGAATtGCA
273
+ ```
274
+ is represented as `:6-ata:10+gtc:4*at:3`, where `:[0-9]+` represents an
275
+ identical block, `-ata` represents a deletion, `+gtc` an insertion and `*at`
276
+ indicates reference base `a` is substituted with a query base `t`. It is
277
+ similar to the `MD` SAM tag but is standalone and easier to parse.
278
+
279
+ If `--cs=long` is used, the `cs` string also contains identical sequences in
280
+ the alignment. The above example will become
281
+ `=CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA`. The long form of `cs` encodes both
282
+ reference and query sequences in one string. The `cs` tag also encodes intron
283
+ positions and splicing signals (see the [minimap2 manpage][manpage-cs] for
284
+ details).
285
+
286
+ #### <a name="paftools"></a>Working with the PAF format
287
+
288
+ Minimap2 also comes with a (java)script [paftools.js](misc/paftools.js) that
289
+ processes alignments in the PAF format. It calls variants from
290
+ assembly-to-reference alignment, lifts over BED files based on alignment,
291
+ converts between formats and provides utilities for various evaluations. For
292
+ details, please see [misc/README.md](misc/README.md).
293
+
294
+ ### <a name="algo"></a>Algorithm overview
295
+
296
+ In the following, minimap2 command line options have a dash ahead and are
297
+ highlighted in bold. The description may help to tune minimap2 parameters.
298
+
299
+ 1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and
300
+ index them in a hash table.
301
+
302
+ 2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3
303
+ through 7:
304
+
305
+ 3. For each (**-k**,**-w**)-minimizer on the query, check against the reference
306
+ index. If a reference minimizer is not among the top **-f** [=*2e-4*] most
307
+ frequent, collect its the occurrences in the reference, which are called
308
+ *seeds*.
309
+
310
+ 4. Sort seeds by position in the reference. Chain them with dynamic
311
+ programming. Each chain represents a potential mapping. For read
312
+ overlapping, report all chains and then go to step 8. For reference mapping,
313
+ do step 5 through 7:
314
+
315
+ 5. Let *P* be the set of primary mappings, which is an empty set initially. For
316
+ each chain from the best to the worst according to their chaining scores: if
317
+ on the query, the chain overlaps with a chain in *P* by **--mask-level**
318
+ [=*0.5*] or higher fraction of the shorter chain, mark the chain as
319
+ *secondary* to the chain in *P*; otherwise, add the chain to *P*.
320
+
321
+ 6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary
322
+ mappings if their chaining scores are higher than **-p** [=*0.8*] of their
323
+ corresponding primary mappings.
324
+
325
+ 7. If alignment is requested, filter out an internal seed if it potentially
326
+ leads to both a long insertion and a long deletion. Extend from the
327
+ left-most seed. Perform global alignments between internal seeds. Split the
328
+ chain if the accumulative score along the global alignment drops by **-z**
329
+ [=*400*], disregarding long gaps. Extend from the right-most seed. Output
330
+ chains and their alignments.
331
+
332
+ 8. If there are more query sequences in the input, go to step 2 until no more
333
+ queries are left.
334
+
335
+ 9. If there are more reference sequences, reopen the query file from the start
336
+ and go to step 1; otherwise stop.
337
+
338
+ ### <a name="help"></a>Getting help
339
+
340
+ Manpage [minimap2.1][manpage] provides detailed description of minimap2
341
+ command line options and optional tags. The [FAQ](FAQ.md) page answers several
342
+ frequently asked questions. If you encounter bugs or have further questions or
343
+ requests, you can raise an issue at the [issue page][issue]. There is not a
344
+ specific mailing list for the time being.
345
+
346
+ ### <a name="cite"></a>Citing minimap2
347
+
348
+ If you use minimap2 in your work, please cite:
349
+
350
+ > Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.
351
+ > *Bioinformatics*, **34**:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]
352
+
353
+ ## <a name="dguide"></a>Developers' Guide
354
+
355
+ Minimap2 is not only a command line tool, but also a programming library.
356
+ It provides C APIs to build/load index and to align sequences against the
357
+ index. File [example.c](example.c) demonstrates typical uses of C APIs. Header
358
+ file [minimap.h](minimap.h) gives more detailed API documentation. Minimap2
359
+ aims to keep APIs in this header stable. File [mmpriv.h](mmpriv.h) contains
360
+ additional private APIs which may be subjected to changes frequently.
361
+
362
+ This repository also provides Python bindings to a subset of C APIs. File
363
+ [python/README.rst](python/README.rst) gives the full documentation;
364
+ [python/minimap2.py](python/minimap2.py) shows an example. This Python
365
+ extension, mappy, is also [available from PyPI][mappypypi] via `pip install
366
+ mappy` or [from BioConda][mappyconda] via `conda install -c bioconda mappy`.
367
+
368
+ ## <a name="limit"></a>Limitations
369
+
370
+ * Minimap2 may produce suboptimal alignments through long low-complexity
371
+ regions where seed positions may be suboptimal. This should not be a big
372
+ concern because even the optimal alignment may be wrong in such regions.
373
+
374
+ * Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It is
375
+ possible to add non-SIMD support, but it would make minimap2 slower by
376
+ several times.
377
+
378
+ * Minimap2 does not work with a single query or database sequence ~2
379
+ billion bases or longer (2,147,483,647 to be exact). The total length of all
380
+ sequences can well exceed this threshold.
381
+
382
+ * Minimap2 often misses small exons.
383
+
384
+
385
+
386
+ [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
387
+ [sam]: https://samtools.github.io/hts-specs/SAMv1.pdf
388
+ [minimap]: https://github.com/lh3/minimap
389
+ [smartdenovo]: https://github.com/ruanjue/smartdenovo
390
+ [longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791
391
+ [gaba]: https://github.com/ocxtal/libgaba
392
+ [ksw2]: https://github.com/lh3/ksw2
393
+ [preprint]: https://arxiv.org/abs/1708.01492
394
+ [release]: https://github.com/lh3/minimap2/releases
395
+ [mappypypi]: https://pypi.python.org/pypi/mappy
396
+ [mappyconda]: https://anaconda.org/bioconda/mappy
397
+ [issue]: https://github.com/lh3/minimap2/issues
398
+ [k8]: https://github.com/attractivechaos/k8
399
+ [manpage]: https://lh3.github.io/minimap2/minimap2.html
400
+ [manpage-cs]: https://lh3.github.io/minimap2/minimap2.html#10
401
+ [doi]: https://doi.org/10.1093/bioinformatics/bty191
402
+ [smide]: https://github.com/nemequ/simde
403
+ [unimap]: https://github.com/lh3/unimap