minimap2 0.2.22.0 → 0.2.24.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +60 -76
  3. data/ext/Rakefile +55 -0
  4. data/ext/cmappy/cmappy.c +129 -0
  5. data/ext/cmappy/cmappy.h +44 -0
  6. data/ext/minimap2/FAQ.md +46 -0
  7. data/ext/minimap2/LICENSE.txt +24 -0
  8. data/ext/minimap2/MANIFEST.in +10 -0
  9. data/ext/minimap2/Makefile +132 -0
  10. data/ext/minimap2/Makefile.simde +97 -0
  11. data/ext/minimap2/NEWS.md +821 -0
  12. data/ext/minimap2/README.md +403 -0
  13. data/ext/minimap2/align.c +1020 -0
  14. data/ext/minimap2/bseq.c +169 -0
  15. data/ext/minimap2/bseq.h +64 -0
  16. data/ext/minimap2/code_of_conduct.md +30 -0
  17. data/ext/minimap2/cookbook.md +243 -0
  18. data/ext/minimap2/esterr.c +64 -0
  19. data/ext/minimap2/example.c +63 -0
  20. data/ext/minimap2/format.c +559 -0
  21. data/ext/minimap2/hit.c +466 -0
  22. data/ext/minimap2/index.c +775 -0
  23. data/ext/minimap2/kalloc.c +205 -0
  24. data/ext/minimap2/kalloc.h +76 -0
  25. data/ext/minimap2/kdq.h +132 -0
  26. data/ext/minimap2/ketopt.h +120 -0
  27. data/ext/minimap2/khash.h +615 -0
  28. data/ext/minimap2/krmq.h +474 -0
  29. data/ext/minimap2/kseq.h +256 -0
  30. data/ext/minimap2/ksort.h +153 -0
  31. data/ext/minimap2/ksw2.h +184 -0
  32. data/ext/minimap2/ksw2_dispatch.c +96 -0
  33. data/ext/minimap2/ksw2_extd2_sse.c +402 -0
  34. data/ext/minimap2/ksw2_exts2_sse.c +416 -0
  35. data/ext/minimap2/ksw2_extz2_sse.c +313 -0
  36. data/ext/minimap2/ksw2_ll_sse.c +152 -0
  37. data/ext/minimap2/kthread.c +159 -0
  38. data/ext/minimap2/kthread.h +15 -0
  39. data/ext/minimap2/kvec.h +105 -0
  40. data/ext/minimap2/lchain.c +369 -0
  41. data/ext/minimap2/main.c +459 -0
  42. data/ext/minimap2/map.c +714 -0
  43. data/ext/minimap2/minimap.h +410 -0
  44. data/ext/minimap2/minimap2.1 +725 -0
  45. data/ext/minimap2/misc/README.md +179 -0
  46. data/ext/minimap2/misc/mmphase.js +335 -0
  47. data/ext/minimap2/misc/paftools.js +3149 -0
  48. data/ext/minimap2/misc.c +162 -0
  49. data/ext/minimap2/mmpriv.h +132 -0
  50. data/ext/minimap2/options.c +234 -0
  51. data/ext/minimap2/pe.c +177 -0
  52. data/ext/minimap2/python/README.rst +196 -0
  53. data/ext/minimap2/python/cmappy.h +152 -0
  54. data/ext/minimap2/python/cmappy.pxd +153 -0
  55. data/ext/minimap2/python/mappy.pyx +273 -0
  56. data/ext/minimap2/python/minimap2.py +39 -0
  57. data/ext/minimap2/sdust.c +213 -0
  58. data/ext/minimap2/sdust.h +25 -0
  59. data/ext/minimap2/seed.c +131 -0
  60. data/ext/minimap2/setup.py +55 -0
  61. data/ext/minimap2/sketch.c +143 -0
  62. data/ext/minimap2/splitidx.c +84 -0
  63. data/ext/minimap2/sse2neon/emmintrin.h +1689 -0
  64. data/ext/minimap2/test/MT-human.fa +278 -0
  65. data/ext/minimap2/test/MT-orang.fa +276 -0
  66. data/ext/minimap2/test/q-inv.fa +4 -0
  67. data/ext/minimap2/test/q2.fa +2 -0
  68. data/ext/minimap2/test/t-inv.fa +127 -0
  69. data/ext/minimap2/test/t2.fa +2 -0
  70. data/ext/minimap2/tex/Makefile +21 -0
  71. data/ext/minimap2/tex/bioinfo.cls +930 -0
  72. data/ext/minimap2/tex/blasr-mc.eval +17 -0
  73. data/ext/minimap2/tex/bowtie2-s3.sam.eval +28 -0
  74. data/ext/minimap2/tex/bwa-s3.sam.eval +52 -0
  75. data/ext/minimap2/tex/bwa.eval +55 -0
  76. data/ext/minimap2/tex/eval2roc.pl +33 -0
  77. data/ext/minimap2/tex/graphmap.eval +4 -0
  78. data/ext/minimap2/tex/hs38-simu.sh +10 -0
  79. data/ext/minimap2/tex/minialign.eval +49 -0
  80. data/ext/minimap2/tex/minimap2.bib +460 -0
  81. data/ext/minimap2/tex/minimap2.tex +724 -0
  82. data/ext/minimap2/tex/mm2-s3.sam.eval +62 -0
  83. data/ext/minimap2/tex/mm2-update.tex +240 -0
  84. data/ext/minimap2/tex/mm2.approx.eval +12 -0
  85. data/ext/minimap2/tex/mm2.eval +13 -0
  86. data/ext/minimap2/tex/natbib.bst +1288 -0
  87. data/ext/minimap2/tex/natbib.sty +803 -0
  88. data/ext/minimap2/tex/ngmlr.eval +38 -0
  89. data/ext/minimap2/tex/roc.gp +60 -0
  90. data/ext/minimap2/tex/snap-s3.sam.eval +62 -0
  91. data/ext/minimap2.patch +19 -0
  92. data/lib/minimap2/aligner.rb +4 -4
  93. data/lib/minimap2/alignment.rb +11 -11
  94. data/lib/minimap2/ffi/constants.rb +20 -16
  95. data/lib/minimap2/ffi/functions.rb +5 -0
  96. data/lib/minimap2/ffi.rb +4 -5
  97. data/lib/minimap2/version.rb +2 -2
  98. data/lib/minimap2.rb +51 -15
  99. metadata +97 -79
  100. data/lib/minimap2/ffi_helper.rb +0 -53
  101. data/vendor/libminimap2.so +0 -0
@@ -0,0 +1,403 @@
1
+ [![GitHub Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=social&logo=github&label=Download)](https://github.com/lh3/minimap2/releases)
2
+ [![BioConda Install](https://img.shields.io/conda/dn/bioconda/minimap2.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/minimap2)
3
+ [![PyPI](https://img.shields.io/pypi/v/mappy.svg?style=flat)](https://pypi.python.org/pypi/mappy)
4
+ [![Build Status](https://github.com/lh3/minimap2/actions/workflows/ci.yaml/badge.svg)](https://github.com/lh3/minimap2/actions)
5
+ ## <a name="started"></a>Getting Started
6
+ ```sh
7
+ git clone https://github.com/lh3/minimap2
8
+ cd minimap2 && make
9
+ # long sequences against a reference genome
10
+ ./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
11
+ # create an index first and then map
12
+ ./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
13
+ ./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
14
+ # use presets (no test data)
15
+ ./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio CLR genomic reads
16
+ ./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
17
+ ./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
18
+ ./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.18 or earlier)
19
+ ./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
20
+ ./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
21
+ ./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
22
+ ./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
23
+ ./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam # prioritize on annotated junctions
24
+ ./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
25
+ ./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
26
+ ./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
27
+ # man page for detailed command line options
28
+ man ./minimap2.1
29
+ ```
30
+
31
+ ## Table of Contents
32
+
33
+ - [Getting Started](#started)
34
+ - [Users' Guide](#uguide)
35
+ - [Installation](#install)
36
+ - [General usage](#general)
37
+ - [Use cases](#cases)
38
+ - [Map long noisy genomic reads](#map-long-genomic)
39
+ - [Map long mRNA/cDNA reads](#map-long-splice)
40
+ - [Find overlaps between long reads](#long-overlap)
41
+ - [Map short accurate genomic reads](#short-genomic)
42
+ - [Full genome/assembly alignment](#full-genome)
43
+ - [Advanced features](#advanced)
44
+ - [Working with >65535 CIGAR operations](#long-cigar)
45
+ - [The cs optional tag](#cs)
46
+ - [Working with the PAF format](#paftools)
47
+ - [Algorithm overview](#algo)
48
+ - [Getting help](#help)
49
+ - [Citing minimap2](#cite)
50
+ - [Developers' Guide](#dguide)
51
+ - [Limitations](#limit)
52
+
53
+ ## <a name="uguide"></a>Users' Guide
54
+
55
+ Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA
56
+ sequences against a large reference database. Typical use cases include: (1)
57
+ mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
58
+ finding overlaps between long reads with error rate up to ~15%; (3)
59
+ splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
60
+ against a reference genome; (4) aligning Illumina single- or paired-end reads;
61
+ (5) assembly-to-assembly alignment; (6) full-genome alignment between two
62
+ closely related species with divergence below ~15%.
63
+
64
+ For ~10kb noisy reads sequences, minimap2 is tens of times faster than
65
+ mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more
66
+ accurate on simulated long reads and produces biologically meaningful alignment
67
+ ready for downstream analyses. For >100bp Illumina short reads, minimap2 is
68
+ three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data.
69
+ Detailed evaluations are available from the [minimap2 paper][doi] or the
70
+ [preprint][preprint].
71
+
72
+ ### <a name="install"></a>Installation
73
+
74
+ Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from
75
+ the [release page][release] with:
76
+ ```sh
77
+ curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
78
+ ./minimap2-2.24_x64-linux/minimap2
79
+ ```
80
+ If you want to compile from the source, you need to have a C compiler, GNU make
81
+ and zlib development files installed. Then type `make` in the source code
82
+ directory to compile. If you see compilation errors, try `make sse2only=1`
83
+ to disable SSE4 code, which will make minimap2 slightly slower.
84
+
85
+ Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To
86
+ compile for 32 bit ARM architectures (such as ARMv7), use `make arm_neon=1`. To
87
+ compile for for 64 bit ARM architectures (such as ARMv8), use `make arm_neon=1
88
+ aarch64=1`.
89
+
90
+ Minimap2 can use [SIMD Everywhere (SIMDe)][simde] library for porting
91
+ implementation to the different SIMD instruction sets. To compile using SIMDe,
92
+ use `make -f Makefile.simde`. To compile for ARM CPUs, use `Makefile.simde`
93
+ with the ARM related command lines given above.
94
+
95
+ ### <a name="general"></a>General usage
96
+
97
+ Without any options, minimap2 takes a reference database and a query sequence
98
+ file as input and produce approximate mapping, without base-level alignment
99
+ (i.e. coordinates are only approximate and no CIGAR in output), in the [PAF format][paf]:
100
+ ```sh
101
+ minimap2 ref.fa query.fq > approx-mapping.paf
102
+ ```
103
+ You can ask minimap2 to generate CIGAR at the `cg` tag of PAF with:
104
+ ```sh
105
+ minimap2 -c ref.fa query.fq > alignment.paf
106
+ ```
107
+ or to output alignments in the [SAM format][sam]:
108
+ ```sh
109
+ minimap2 -a ref.fa query.fq > alignment.sam
110
+ ```
111
+ Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You
112
+ don't need to convert between FASTA and FASTQ or decompress gzip'd files first.
113
+
114
+ For the human reference genome, minimap2 takes a few minutes to generate a
115
+ minimizer index for the reference before mapping. To reduce indexing time, you
116
+ can optionally save the index with option **-d** and replace the reference
117
+ sequence file with the index file on the minimap2 command line:
118
+ ```sh
119
+ minimap2 -d ref.mmi ref.fa # indexing
120
+ minimap2 -a ref.mmi reads.fq > alignment.sam # alignment
121
+ ```
122
+ ***Importantly***, it should be noted that once you build the index, indexing
123
+ parameters such as **-k**, **-w**, **-H** and **-I** can't be changed during
124
+ mapping. If you are running minimap2 for different data types, you will
125
+ probably need to keep multiple indexes generated with different parameters.
126
+ This makes minimap2 different from BWA which always uses the same index
127
+ regardless of query data types.
128
+
129
+ ### <a name="cases"></a>Use cases
130
+
131
+ Minimap2 uses the same base algorithm for all applications. However, due to the
132
+ different data types it supports (e.g. short vs long reads; DNA vs mRNA reads),
133
+ minimap2 needs to be tuned for optimal performance and accuracy. It is usually
134
+ recommended to choose a preset with option **-x**, which sets multiple
135
+ parameters at the same time. The default setting is the same as `map-ont`.
136
+
137
+ #### <a name="map-long-genomic"></a>Map long noisy genomic reads
138
+
139
+ ```sh
140
+ minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio CLR reads
141
+ minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
142
+ ```
143
+ The difference between `map-pb` and `map-ont` is that `map-pb` uses
144
+ homopolymer-compressed (HPC) minimizers as seeds, while `map-ont` uses ordinary
145
+ minimizers as seeds. Emperical evaluation suggests HPC minimizers improve
146
+ performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning
147
+ Nanopore reads.
148
+
149
+ #### <a name="map-long-splice"></a>Map long mRNA/cDNA reads
150
+
151
+ ```sh
152
+ minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA
153
+ minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam # Nanopore 2D cDNA-seq
154
+ minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
155
+ minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa # mapping against SIRV control
156
+ ```
157
+ There are different long-read RNA-seq technologies, including tranditional
158
+ full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq.
159
+ They produce data of varying quality and properties. By default, `-x splice`
160
+ assumes the read orientation relative to the transcript strand is unknown. It
161
+ tries two rounds of alignment to infer the orientation and write the strand to
162
+ the `ts` SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional
163
+ full-length cDNAs, it would be desired to apply `-u f` to force minimap2 to
164
+ consider the forward transcript strand only. This speeds up alignment with
165
+ slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is
166
+ recommended to use a smaller k-mer size for increased sensitivity to the first
167
+ or the last exons.
168
+
169
+ Minimap2 rates an alignment by the score of the max-scoring sub-segment,
170
+ *excluding* introns, and marks the best alignment as primary in SAM. When a
171
+ spliced gene also has unspliced pseudogenes, minimap2 does not intentionally
172
+ prefer spliced alignment, though in practice it more often marks the spliced
173
+ alignment as the primary. By default, minimap2 outputs up to five secondary
174
+ alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This
175
+ can be tuned with option **-N**.
176
+
177
+ For long RNA-seq reads, minimap2 may produce chimeric alignments potentially
178
+ caused by gene fusions/structural variations or by an intron longer than the
179
+ max intron length **-G** (200k by default). For now, it is not recommended to
180
+ apply an excessively large **-G** as this slows down minimap2 and sometimes
181
+ leads to false alignments.
182
+
183
+ It is worth noting that by default `-x splice` prefers GT[A/G]..[C/T]AG
184
+ over GT[C/T]..[A/G]AG, and then over other splicing signals. Considering
185
+ one additional base improves the junction accuracy for noisy reads, but
186
+ reduces the accuracy when aligning against the widely used SIRV control data.
187
+ This is because SIRV does not honor the evolutionarily conservative splicing
188
+ signal. If you are studying SIRV, you may apply `--splice-flank=no` to let
189
+ minimap2 only model GT..AG, ignoring the additional base.
190
+
191
+ Since v2.17, minimap2 can optionally take annotated genes as input and
192
+ prioritize on annotated splice junctions. To use this feature, you can
193
+ ```sh
194
+ paftools.js gff2bed anno.gff > anno.bed
195
+ minimap2 -ax splice --junc-bed anno.bed ref.fa query.fa > aln.sam
196
+ ```
197
+ Here, `anno.gff` is the gene annotation in the GTF or GFF3 format (`gff2bed`
198
+ automatically tests the format). The output of `gff2bed` is in the 12-column
199
+ BED format, or the BED12 format. With the `--junc-bed` option, minimap2 adds a
200
+ bonus score (tuned by `--junc-bonus`) if an aligned junction matches a junction
201
+ in the annotation. Option `--junc-bed` also takes 5-column BED, including the
202
+ strand field. In this case, each line indicates an oriented junction.
203
+
204
+ #### <a name="long-overlap"></a>Find overlaps between long reads
205
+
206
+ ```sh
207
+ minimap2 -x ava-pb reads.fq reads.fq > ovlp.paf # PacBio CLR read overlap
208
+ minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf # Oxford Nanopore read overlap
209
+ ```
210
+ Similarly, `ava-pb` uses HPC minimizers while `ava-ont` uses ordinary
211
+ minimizers. It is usually not recommended to perform base-level alignment in
212
+ the overlapping mode because it is slow and may produce false positive
213
+ overlaps. However, if performance is not a concern, you may try to add `-a` or
214
+ `-c` anyway.
215
+
216
+ #### <a name="short-genomic"></a>Map short accurate genomic reads
217
+
218
+ ```sh
219
+ minimap2 -ax sr ref.fa reads-se.fq > aln.sam # single-end alignment
220
+ minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam # paired-end alignment
221
+ minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam # paired-end alignment
222
+ ```
223
+ When two read files are specified, minimap2 reads from each file in turn and
224
+ merge them into an interleaved stream internally. Two reads are considered to
225
+ be paired if they are adjacent in the input stream and have the same name (with
226
+ the `/[0-9]` suffix trimmed if present). Single- and paired-end reads can be
227
+ mixed.
228
+
229
+ Minimap2 does not work well with short spliced reads. There are many capable
230
+ RNA-seq mappers for short reads.
231
+
232
+ #### <a name="full-genome"></a>Full genome/assembly alignment
233
+
234
+ ```sh
235
+ minimap2 -ax asm5 ref.fa asm.fa > aln.sam # assembly to assembly/ref alignment
236
+ ```
237
+ For cross-species full-genome alignment, the scoring system needs to be tuned
238
+ according to the sequence divergence.
239
+
240
+ ### <a name="advanced"></a>Advanced features
241
+
242
+ #### <a name="long-cigar"></a>Working with >65535 CIGAR operations
243
+
244
+ Due to a design flaw, BAM does not work with CIGAR strings with >65535
245
+ operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2
246
+ may align ~1% of read bases with long CIGARs beyond the capability of BAM. If
247
+ you convert such SAM/CRAM to BAM, Picard and recent samtools will throw an
248
+ error and abort. Older samtools and other tools may create corrupted BAM.
249
+
250
+ To avoid this issue, you can add option `-L` at the minimap2 command line.
251
+ This option moves a long CIGAR to the `CG` tag and leaves a fully clipped CIGAR
252
+ at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and
253
+ sorting) still work with such BAM records; tools that read CIGAR will
254
+ effectively ignore these records. It has been decided that future tools
255
+ will seamlessly recognize long-cigar records generated by option `-L`.
256
+
257
+ **TL;DR**: if you work with ultra-long reads and use tools that only process
258
+ BAM files, please add option `-L`.
259
+
260
+ #### <a name="cs"></a>The cs optional tag
261
+
262
+ The `cs` SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular
263
+ expression `/(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/`. Like CIGAR, `cs`
264
+ consists of series of operations. Each leading character specifies the
265
+ operation; the following sequence is the one involved in the operation.
266
+
267
+ The `cs` tag is enabled by command line option `--cs`. The following alignment,
268
+ for example:
269
+ ```txt
270
+ CGATCGATAAATAGAGTAG---GAATAGCA
271
+ |||||| |||||||||| |||| |||
272
+ CGATCG---AATAGAGTAGGTCGAATtGCA
273
+ ```
274
+ is represented as `:6-ata:10+gtc:4*at:3`, where `:[0-9]+` represents an
275
+ identical block, `-ata` represents a deletion, `+gtc` an insertion and `*at`
276
+ indicates reference base `a` is substituted with a query base `t`. It is
277
+ similar to the `MD` SAM tag but is standalone and easier to parse.
278
+
279
+ If `--cs=long` is used, the `cs` string also contains identical sequences in
280
+ the alignment. The above example will become
281
+ `=CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA`. The long form of `cs` encodes both
282
+ reference and query sequences in one string. The `cs` tag also encodes intron
283
+ positions and splicing signals (see the [minimap2 manpage][manpage-cs] for
284
+ details).
285
+
286
+ #### <a name="paftools"></a>Working with the PAF format
287
+
288
+ Minimap2 also comes with a (java)script [paftools.js](misc/paftools.js) that
289
+ processes alignments in the PAF format. It calls variants from
290
+ assembly-to-reference alignment, lifts over BED files based on alignment,
291
+ converts between formats and provides utilities for various evaluations. For
292
+ details, please see [misc/README.md](misc/README.md).
293
+
294
+ ### <a name="algo"></a>Algorithm overview
295
+
296
+ In the following, minimap2 command line options have a dash ahead and are
297
+ highlighted in bold. The description may help to tune minimap2 parameters.
298
+
299
+ 1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and
300
+ index them in a hash table.
301
+
302
+ 2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3
303
+ through 7:
304
+
305
+ 3. For each (**-k**,**-w**)-minimizer on the query, check against the reference
306
+ index. If a reference minimizer is not among the top **-f** [=*2e-4*] most
307
+ frequent, collect its the occurrences in the reference, which are called
308
+ *seeds*.
309
+
310
+ 4. Sort seeds by position in the reference. Chain them with dynamic
311
+ programming. Each chain represents a potential mapping. For read
312
+ overlapping, report all chains and then go to step 8. For reference mapping,
313
+ do step 5 through 7:
314
+
315
+ 5. Let *P* be the set of primary mappings, which is an empty set initially. For
316
+ each chain from the best to the worst according to their chaining scores: if
317
+ on the query, the chain overlaps with a chain in *P* by **--mask-level**
318
+ [=*0.5*] or higher fraction of the shorter chain, mark the chain as
319
+ *secondary* to the chain in *P*; otherwise, add the chain to *P*.
320
+
321
+ 6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary
322
+ mappings if their chaining scores are higher than **-p** [=*0.8*] of their
323
+ corresponding primary mappings.
324
+
325
+ 7. If alignment is requested, filter out an internal seed if it potentially
326
+ leads to both a long insertion and a long deletion. Extend from the
327
+ left-most seed. Perform global alignments between internal seeds. Split the
328
+ chain if the accumulative score along the global alignment drops by **-z**
329
+ [=*400*], disregarding long gaps. Extend from the right-most seed. Output
330
+ chains and their alignments.
331
+
332
+ 8. If there are more query sequences in the input, go to step 2 until no more
333
+ queries are left.
334
+
335
+ 9. If there are more reference sequences, reopen the query file from the start
336
+ and go to step 1; otherwise stop.
337
+
338
+ ### <a name="help"></a>Getting help
339
+
340
+ Manpage [minimap2.1][manpage] provides detailed description of minimap2
341
+ command line options and optional tags. The [FAQ](FAQ.md) page answers several
342
+ frequently asked questions. If you encounter bugs or have further questions or
343
+ requests, you can raise an issue at the [issue page][issue]. There is not a
344
+ specific mailing list for the time being.
345
+
346
+ ### <a name="cite"></a>Citing minimap2
347
+
348
+ If you use minimap2 in your work, please cite:
349
+
350
+ > Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.
351
+ > *Bioinformatics*, **34**:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]
352
+
353
+ ## <a name="dguide"></a>Developers' Guide
354
+
355
+ Minimap2 is not only a command line tool, but also a programming library.
356
+ It provides C APIs to build/load index and to align sequences against the
357
+ index. File [example.c](example.c) demonstrates typical uses of C APIs. Header
358
+ file [minimap.h](minimap.h) gives more detailed API documentation. Minimap2
359
+ aims to keep APIs in this header stable. File [mmpriv.h](mmpriv.h) contains
360
+ additional private APIs which may be subjected to changes frequently.
361
+
362
+ This repository also provides Python bindings to a subset of C APIs. File
363
+ [python/README.rst](python/README.rst) gives the full documentation;
364
+ [python/minimap2.py](python/minimap2.py) shows an example. This Python
365
+ extension, mappy, is also [available from PyPI][mappypypi] via `pip install
366
+ mappy` or [from BioConda][mappyconda] via `conda install -c bioconda mappy`.
367
+
368
+ ## <a name="limit"></a>Limitations
369
+
370
+ * Minimap2 may produce suboptimal alignments through long low-complexity
371
+ regions where seed positions may be suboptimal. This should not be a big
372
+ concern because even the optimal alignment may be wrong in such regions.
373
+
374
+ * Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It is
375
+ possible to add non-SIMD support, but it would make minimap2 slower by
376
+ several times.
377
+
378
+ * Minimap2 does not work with a single query or database sequence ~2
379
+ billion bases or longer (2,147,483,647 to be exact). The total length of all
380
+ sequences can well exceed this threshold.
381
+
382
+ * Minimap2 often misses small exons.
383
+
384
+
385
+
386
+ [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
387
+ [sam]: https://samtools.github.io/hts-specs/SAMv1.pdf
388
+ [minimap]: https://github.com/lh3/minimap
389
+ [smartdenovo]: https://github.com/ruanjue/smartdenovo
390
+ [longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791
391
+ [gaba]: https://github.com/ocxtal/libgaba
392
+ [ksw2]: https://github.com/lh3/ksw2
393
+ [preprint]: https://arxiv.org/abs/1708.01492
394
+ [release]: https://github.com/lh3/minimap2/releases
395
+ [mappypypi]: https://pypi.python.org/pypi/mappy
396
+ [mappyconda]: https://anaconda.org/bioconda/mappy
397
+ [issue]: https://github.com/lh3/minimap2/issues
398
+ [k8]: https://github.com/attractivechaos/k8
399
+ [manpage]: https://lh3.github.io/minimap2/minimap2.html
400
+ [manpage-cs]: https://lh3.github.io/minimap2/minimap2.html#10
401
+ [doi]: https://doi.org/10.1093/bioinformatics/bty191
402
+ [smide]: https://github.com/nemequ/simde
403
+ [unimap]: https://github.com/lh3/unimap