minimap2 0.2.22.0 → 0.2.24.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +60 -76
- data/ext/Rakefile +55 -0
- data/ext/cmappy/cmappy.c +129 -0
- data/ext/cmappy/cmappy.h +44 -0
- data/ext/minimap2/FAQ.md +46 -0
- data/ext/minimap2/LICENSE.txt +24 -0
- data/ext/minimap2/MANIFEST.in +10 -0
- data/ext/minimap2/Makefile +132 -0
- data/ext/minimap2/Makefile.simde +97 -0
- data/ext/minimap2/NEWS.md +821 -0
- data/ext/minimap2/README.md +403 -0
- data/ext/minimap2/align.c +1020 -0
- data/ext/minimap2/bseq.c +169 -0
- data/ext/minimap2/bseq.h +64 -0
- data/ext/minimap2/code_of_conduct.md +30 -0
- data/ext/minimap2/cookbook.md +243 -0
- data/ext/minimap2/esterr.c +64 -0
- data/ext/minimap2/example.c +63 -0
- data/ext/minimap2/format.c +559 -0
- data/ext/minimap2/hit.c +466 -0
- data/ext/minimap2/index.c +775 -0
- data/ext/minimap2/kalloc.c +205 -0
- data/ext/minimap2/kalloc.h +76 -0
- data/ext/minimap2/kdq.h +132 -0
- data/ext/minimap2/ketopt.h +120 -0
- data/ext/minimap2/khash.h +615 -0
- data/ext/minimap2/krmq.h +474 -0
- data/ext/minimap2/kseq.h +256 -0
- data/ext/minimap2/ksort.h +153 -0
- data/ext/minimap2/ksw2.h +184 -0
- data/ext/minimap2/ksw2_dispatch.c +96 -0
- data/ext/minimap2/ksw2_extd2_sse.c +402 -0
- data/ext/minimap2/ksw2_exts2_sse.c +416 -0
- data/ext/minimap2/ksw2_extz2_sse.c +313 -0
- data/ext/minimap2/ksw2_ll_sse.c +152 -0
- data/ext/minimap2/kthread.c +159 -0
- data/ext/minimap2/kthread.h +15 -0
- data/ext/minimap2/kvec.h +105 -0
- data/ext/minimap2/lchain.c +369 -0
- data/ext/minimap2/main.c +459 -0
- data/ext/minimap2/map.c +714 -0
- data/ext/minimap2/minimap.h +410 -0
- data/ext/minimap2/minimap2.1 +725 -0
- data/ext/minimap2/misc/README.md +179 -0
- data/ext/minimap2/misc/mmphase.js +335 -0
- data/ext/minimap2/misc/paftools.js +3149 -0
- data/ext/minimap2/misc.c +162 -0
- data/ext/minimap2/mmpriv.h +132 -0
- data/ext/minimap2/options.c +234 -0
- data/ext/minimap2/pe.c +177 -0
- data/ext/minimap2/python/README.rst +196 -0
- data/ext/minimap2/python/cmappy.h +152 -0
- data/ext/minimap2/python/cmappy.pxd +153 -0
- data/ext/minimap2/python/mappy.pyx +273 -0
- data/ext/minimap2/python/minimap2.py +39 -0
- data/ext/minimap2/sdust.c +213 -0
- data/ext/minimap2/sdust.h +25 -0
- data/ext/minimap2/seed.c +131 -0
- data/ext/minimap2/setup.py +55 -0
- data/ext/minimap2/sketch.c +143 -0
- data/ext/minimap2/splitidx.c +84 -0
- data/ext/minimap2/sse2neon/emmintrin.h +1689 -0
- data/ext/minimap2/test/MT-human.fa +278 -0
- data/ext/minimap2/test/MT-orang.fa +276 -0
- data/ext/minimap2/test/q-inv.fa +4 -0
- data/ext/minimap2/test/q2.fa +2 -0
- data/ext/minimap2/test/t-inv.fa +127 -0
- data/ext/minimap2/test/t2.fa +2 -0
- data/ext/minimap2/tex/Makefile +21 -0
- data/ext/minimap2/tex/bioinfo.cls +930 -0
- data/ext/minimap2/tex/blasr-mc.eval +17 -0
- data/ext/minimap2/tex/bowtie2-s3.sam.eval +28 -0
- data/ext/minimap2/tex/bwa-s3.sam.eval +52 -0
- data/ext/minimap2/tex/bwa.eval +55 -0
- data/ext/minimap2/tex/eval2roc.pl +33 -0
- data/ext/minimap2/tex/graphmap.eval +4 -0
- data/ext/minimap2/tex/hs38-simu.sh +10 -0
- data/ext/minimap2/tex/minialign.eval +49 -0
- data/ext/minimap2/tex/minimap2.bib +460 -0
- data/ext/minimap2/tex/minimap2.tex +724 -0
- data/ext/minimap2/tex/mm2-s3.sam.eval +62 -0
- data/ext/minimap2/tex/mm2-update.tex +240 -0
- data/ext/minimap2/tex/mm2.approx.eval +12 -0
- data/ext/minimap2/tex/mm2.eval +13 -0
- data/ext/minimap2/tex/natbib.bst +1288 -0
- data/ext/minimap2/tex/natbib.sty +803 -0
- data/ext/minimap2/tex/ngmlr.eval +38 -0
- data/ext/minimap2/tex/roc.gp +60 -0
- data/ext/minimap2/tex/snap-s3.sam.eval +62 -0
- data/ext/minimap2.patch +19 -0
- data/lib/minimap2/aligner.rb +4 -4
- data/lib/minimap2/alignment.rb +11 -11
- data/lib/minimap2/ffi/constants.rb +20 -16
- data/lib/minimap2/ffi/functions.rb +5 -0
- data/lib/minimap2/ffi.rb +4 -5
- data/lib/minimap2/version.rb +2 -2
- data/lib/minimap2.rb +51 -15
- metadata +97 -79
- data/lib/minimap2/ffi_helper.rb +0 -53
- data/vendor/libminimap2.so +0 -0
@@ -0,0 +1,403 @@
|
|
1
|
+
[![GitHub Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=social&logo=github&label=Download)](https://github.com/lh3/minimap2/releases)
|
2
|
+
[![BioConda Install](https://img.shields.io/conda/dn/bioconda/minimap2.svg?style=flag&label=BioConda%20install)](https://anaconda.org/bioconda/minimap2)
|
3
|
+
[![PyPI](https://img.shields.io/pypi/v/mappy.svg?style=flat)](https://pypi.python.org/pypi/mappy)
|
4
|
+
[![Build Status](https://github.com/lh3/minimap2/actions/workflows/ci.yaml/badge.svg)](https://github.com/lh3/minimap2/actions)
|
5
|
+
## <a name="started"></a>Getting Started
|
6
|
+
```sh
|
7
|
+
git clone https://github.com/lh3/minimap2
|
8
|
+
cd minimap2 && make
|
9
|
+
# long sequences against a reference genome
|
10
|
+
./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
|
11
|
+
# create an index first and then map
|
12
|
+
./minimap2 -x map-ont -d MT-human-ont.mmi test/MT-human.fa
|
13
|
+
./minimap2 -a MT-human-ont.mmi test/MT-orang.fa > test.sam
|
14
|
+
# use presets (no test data)
|
15
|
+
./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam # PacBio CLR genomic reads
|
16
|
+
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam # Oxford Nanopore genomic reads
|
17
|
+
./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
|
18
|
+
./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.18 or earlier)
|
19
|
+
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam # short genomic paired-end reads
|
20
|
+
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam # spliced long reads (strand unknown)
|
21
|
+
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam # noisy Nanopore Direct RNA-seq
|
22
|
+
./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam # Final PacBio Iso-seq or traditional cDNA
|
23
|
+
./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam # prioritize on annotated junctions
|
24
|
+
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf # intra-species asm-to-asm alignment
|
25
|
+
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf # PacBio read overlap
|
26
|
+
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf # Nanopore read overlap
|
27
|
+
# man page for detailed command line options
|
28
|
+
man ./minimap2.1
|
29
|
+
```
|
30
|
+
|
31
|
+
## Table of Contents
|
32
|
+
|
33
|
+
- [Getting Started](#started)
|
34
|
+
- [Users' Guide](#uguide)
|
35
|
+
- [Installation](#install)
|
36
|
+
- [General usage](#general)
|
37
|
+
- [Use cases](#cases)
|
38
|
+
- [Map long noisy genomic reads](#map-long-genomic)
|
39
|
+
- [Map long mRNA/cDNA reads](#map-long-splice)
|
40
|
+
- [Find overlaps between long reads](#long-overlap)
|
41
|
+
- [Map short accurate genomic reads](#short-genomic)
|
42
|
+
- [Full genome/assembly alignment](#full-genome)
|
43
|
+
- [Advanced features](#advanced)
|
44
|
+
- [Working with >65535 CIGAR operations](#long-cigar)
|
45
|
+
- [The cs optional tag](#cs)
|
46
|
+
- [Working with the PAF format](#paftools)
|
47
|
+
- [Algorithm overview](#algo)
|
48
|
+
- [Getting help](#help)
|
49
|
+
- [Citing minimap2](#cite)
|
50
|
+
- [Developers' Guide](#dguide)
|
51
|
+
- [Limitations](#limit)
|
52
|
+
|
53
|
+
## <a name="uguide"></a>Users' Guide
|
54
|
+
|
55
|
+
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA
|
56
|
+
sequences against a large reference database. Typical use cases include: (1)
|
57
|
+
mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
|
58
|
+
finding overlaps between long reads with error rate up to ~15%; (3)
|
59
|
+
splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
|
60
|
+
against a reference genome; (4) aligning Illumina single- or paired-end reads;
|
61
|
+
(5) assembly-to-assembly alignment; (6) full-genome alignment between two
|
62
|
+
closely related species with divergence below ~15%.
|
63
|
+
|
64
|
+
For ~10kb noisy reads sequences, minimap2 is tens of times faster than
|
65
|
+
mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more
|
66
|
+
accurate on simulated long reads and produces biologically meaningful alignment
|
67
|
+
ready for downstream analyses. For >100bp Illumina short reads, minimap2 is
|
68
|
+
three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data.
|
69
|
+
Detailed evaluations are available from the [minimap2 paper][doi] or the
|
70
|
+
[preprint][preprint].
|
71
|
+
|
72
|
+
### <a name="install"></a>Installation
|
73
|
+
|
74
|
+
Minimap2 is optimized for x86-64 CPUs. You can acquire precompiled binaries from
|
75
|
+
the [release page][release] with:
|
76
|
+
```sh
|
77
|
+
curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
|
78
|
+
./minimap2-2.24_x64-linux/minimap2
|
79
|
+
```
|
80
|
+
If you want to compile from the source, you need to have a C compiler, GNU make
|
81
|
+
and zlib development files installed. Then type `make` in the source code
|
82
|
+
directory to compile. If you see compilation errors, try `make sse2only=1`
|
83
|
+
to disable SSE4 code, which will make minimap2 slightly slower.
|
84
|
+
|
85
|
+
Minimap2 also works with ARM CPUs supporting the NEON instruction sets. To
|
86
|
+
compile for 32 bit ARM architectures (such as ARMv7), use `make arm_neon=1`. To
|
87
|
+
compile for for 64 bit ARM architectures (such as ARMv8), use `make arm_neon=1
|
88
|
+
aarch64=1`.
|
89
|
+
|
90
|
+
Minimap2 can use [SIMD Everywhere (SIMDe)][simde] library for porting
|
91
|
+
implementation to the different SIMD instruction sets. To compile using SIMDe,
|
92
|
+
use `make -f Makefile.simde`. To compile for ARM CPUs, use `Makefile.simde`
|
93
|
+
with the ARM related command lines given above.
|
94
|
+
|
95
|
+
### <a name="general"></a>General usage
|
96
|
+
|
97
|
+
Without any options, minimap2 takes a reference database and a query sequence
|
98
|
+
file as input and produce approximate mapping, without base-level alignment
|
99
|
+
(i.e. coordinates are only approximate and no CIGAR in output), in the [PAF format][paf]:
|
100
|
+
```sh
|
101
|
+
minimap2 ref.fa query.fq > approx-mapping.paf
|
102
|
+
```
|
103
|
+
You can ask minimap2 to generate CIGAR at the `cg` tag of PAF with:
|
104
|
+
```sh
|
105
|
+
minimap2 -c ref.fa query.fq > alignment.paf
|
106
|
+
```
|
107
|
+
or to output alignments in the [SAM format][sam]:
|
108
|
+
```sh
|
109
|
+
minimap2 -a ref.fa query.fq > alignment.sam
|
110
|
+
```
|
111
|
+
Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You
|
112
|
+
don't need to convert between FASTA and FASTQ or decompress gzip'd files first.
|
113
|
+
|
114
|
+
For the human reference genome, minimap2 takes a few minutes to generate a
|
115
|
+
minimizer index for the reference before mapping. To reduce indexing time, you
|
116
|
+
can optionally save the index with option **-d** and replace the reference
|
117
|
+
sequence file with the index file on the minimap2 command line:
|
118
|
+
```sh
|
119
|
+
minimap2 -d ref.mmi ref.fa # indexing
|
120
|
+
minimap2 -a ref.mmi reads.fq > alignment.sam # alignment
|
121
|
+
```
|
122
|
+
***Importantly***, it should be noted that once you build the index, indexing
|
123
|
+
parameters such as **-k**, **-w**, **-H** and **-I** can't be changed during
|
124
|
+
mapping. If you are running minimap2 for different data types, you will
|
125
|
+
probably need to keep multiple indexes generated with different parameters.
|
126
|
+
This makes minimap2 different from BWA which always uses the same index
|
127
|
+
regardless of query data types.
|
128
|
+
|
129
|
+
### <a name="cases"></a>Use cases
|
130
|
+
|
131
|
+
Minimap2 uses the same base algorithm for all applications. However, due to the
|
132
|
+
different data types it supports (e.g. short vs long reads; DNA vs mRNA reads),
|
133
|
+
minimap2 needs to be tuned for optimal performance and accuracy. It is usually
|
134
|
+
recommended to choose a preset with option **-x**, which sets multiple
|
135
|
+
parameters at the same time. The default setting is the same as `map-ont`.
|
136
|
+
|
137
|
+
#### <a name="map-long-genomic"></a>Map long noisy genomic reads
|
138
|
+
|
139
|
+
```sh
|
140
|
+
minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio CLR reads
|
141
|
+
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
|
142
|
+
```
|
143
|
+
The difference between `map-pb` and `map-ont` is that `map-pb` uses
|
144
|
+
homopolymer-compressed (HPC) minimizers as seeds, while `map-ont` uses ordinary
|
145
|
+
minimizers as seeds. Emperical evaluation suggests HPC minimizers improve
|
146
|
+
performance and sensitivity when aligning PacBio CLR reads, but hurt when aligning
|
147
|
+
Nanopore reads.
|
148
|
+
|
149
|
+
#### <a name="map-long-splice"></a>Map long mRNA/cDNA reads
|
150
|
+
|
151
|
+
```sh
|
152
|
+
minimap2 -ax splice:hq -uf ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA
|
153
|
+
minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam # Nanopore 2D cDNA-seq
|
154
|
+
minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq
|
155
|
+
minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa # mapping against SIRV control
|
156
|
+
```
|
157
|
+
There are different long-read RNA-seq technologies, including tranditional
|
158
|
+
full-length cDNA, EST, PacBio Iso-seq, Nanopore 2D cDNA-seq and Direct RNA-seq.
|
159
|
+
They produce data of varying quality and properties. By default, `-x splice`
|
160
|
+
assumes the read orientation relative to the transcript strand is unknown. It
|
161
|
+
tries two rounds of alignment to infer the orientation and write the strand to
|
162
|
+
the `ts` SAM/PAF tag if possible. For Iso-seq, Direct RNA-seq and tranditional
|
163
|
+
full-length cDNAs, it would be desired to apply `-u f` to force minimap2 to
|
164
|
+
consider the forward transcript strand only. This speeds up alignment with
|
165
|
+
slight improvement to accuracy. For noisy Nanopore Direct RNA-seq reads, it is
|
166
|
+
recommended to use a smaller k-mer size for increased sensitivity to the first
|
167
|
+
or the last exons.
|
168
|
+
|
169
|
+
Minimap2 rates an alignment by the score of the max-scoring sub-segment,
|
170
|
+
*excluding* introns, and marks the best alignment as primary in SAM. When a
|
171
|
+
spliced gene also has unspliced pseudogenes, minimap2 does not intentionally
|
172
|
+
prefer spliced alignment, though in practice it more often marks the spliced
|
173
|
+
alignment as the primary. By default, minimap2 outputs up to five secondary
|
174
|
+
alignments (i.e. likely pseudogenes in the context of RNA-seq mapping). This
|
175
|
+
can be tuned with option **-N**.
|
176
|
+
|
177
|
+
For long RNA-seq reads, minimap2 may produce chimeric alignments potentially
|
178
|
+
caused by gene fusions/structural variations or by an intron longer than the
|
179
|
+
max intron length **-G** (200k by default). For now, it is not recommended to
|
180
|
+
apply an excessively large **-G** as this slows down minimap2 and sometimes
|
181
|
+
leads to false alignments.
|
182
|
+
|
183
|
+
It is worth noting that by default `-x splice` prefers GT[A/G]..[C/T]AG
|
184
|
+
over GT[C/T]..[A/G]AG, and then over other splicing signals. Considering
|
185
|
+
one additional base improves the junction accuracy for noisy reads, but
|
186
|
+
reduces the accuracy when aligning against the widely used SIRV control data.
|
187
|
+
This is because SIRV does not honor the evolutionarily conservative splicing
|
188
|
+
signal. If you are studying SIRV, you may apply `--splice-flank=no` to let
|
189
|
+
minimap2 only model GT..AG, ignoring the additional base.
|
190
|
+
|
191
|
+
Since v2.17, minimap2 can optionally take annotated genes as input and
|
192
|
+
prioritize on annotated splice junctions. To use this feature, you can
|
193
|
+
```sh
|
194
|
+
paftools.js gff2bed anno.gff > anno.bed
|
195
|
+
minimap2 -ax splice --junc-bed anno.bed ref.fa query.fa > aln.sam
|
196
|
+
```
|
197
|
+
Here, `anno.gff` is the gene annotation in the GTF or GFF3 format (`gff2bed`
|
198
|
+
automatically tests the format). The output of `gff2bed` is in the 12-column
|
199
|
+
BED format, or the BED12 format. With the `--junc-bed` option, minimap2 adds a
|
200
|
+
bonus score (tuned by `--junc-bonus`) if an aligned junction matches a junction
|
201
|
+
in the annotation. Option `--junc-bed` also takes 5-column BED, including the
|
202
|
+
strand field. In this case, each line indicates an oriented junction.
|
203
|
+
|
204
|
+
#### <a name="long-overlap"></a>Find overlaps between long reads
|
205
|
+
|
206
|
+
```sh
|
207
|
+
minimap2 -x ava-pb reads.fq reads.fq > ovlp.paf # PacBio CLR read overlap
|
208
|
+
minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf # Oxford Nanopore read overlap
|
209
|
+
```
|
210
|
+
Similarly, `ava-pb` uses HPC minimizers while `ava-ont` uses ordinary
|
211
|
+
minimizers. It is usually not recommended to perform base-level alignment in
|
212
|
+
the overlapping mode because it is slow and may produce false positive
|
213
|
+
overlaps. However, if performance is not a concern, you may try to add `-a` or
|
214
|
+
`-c` anyway.
|
215
|
+
|
216
|
+
#### <a name="short-genomic"></a>Map short accurate genomic reads
|
217
|
+
|
218
|
+
```sh
|
219
|
+
minimap2 -ax sr ref.fa reads-se.fq > aln.sam # single-end alignment
|
220
|
+
minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam # paired-end alignment
|
221
|
+
minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam # paired-end alignment
|
222
|
+
```
|
223
|
+
When two read files are specified, minimap2 reads from each file in turn and
|
224
|
+
merge them into an interleaved stream internally. Two reads are considered to
|
225
|
+
be paired if they are adjacent in the input stream and have the same name (with
|
226
|
+
the `/[0-9]` suffix trimmed if present). Single- and paired-end reads can be
|
227
|
+
mixed.
|
228
|
+
|
229
|
+
Minimap2 does not work well with short spliced reads. There are many capable
|
230
|
+
RNA-seq mappers for short reads.
|
231
|
+
|
232
|
+
#### <a name="full-genome"></a>Full genome/assembly alignment
|
233
|
+
|
234
|
+
```sh
|
235
|
+
minimap2 -ax asm5 ref.fa asm.fa > aln.sam # assembly to assembly/ref alignment
|
236
|
+
```
|
237
|
+
For cross-species full-genome alignment, the scoring system needs to be tuned
|
238
|
+
according to the sequence divergence.
|
239
|
+
|
240
|
+
### <a name="advanced"></a>Advanced features
|
241
|
+
|
242
|
+
#### <a name="long-cigar"></a>Working with >65535 CIGAR operations
|
243
|
+
|
244
|
+
Due to a design flaw, BAM does not work with CIGAR strings with >65535
|
245
|
+
operations (SAM and CRAM work). However, for ultra-long nanopore reads minimap2
|
246
|
+
may align ~1% of read bases with long CIGARs beyond the capability of BAM. If
|
247
|
+
you convert such SAM/CRAM to BAM, Picard and recent samtools will throw an
|
248
|
+
error and abort. Older samtools and other tools may create corrupted BAM.
|
249
|
+
|
250
|
+
To avoid this issue, you can add option `-L` at the minimap2 command line.
|
251
|
+
This option moves a long CIGAR to the `CG` tag and leaves a fully clipped CIGAR
|
252
|
+
at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and
|
253
|
+
sorting) still work with such BAM records; tools that read CIGAR will
|
254
|
+
effectively ignore these records. It has been decided that future tools
|
255
|
+
will seamlessly recognize long-cigar records generated by option `-L`.
|
256
|
+
|
257
|
+
**TL;DR**: if you work with ultra-long reads and use tools that only process
|
258
|
+
BAM files, please add option `-L`.
|
259
|
+
|
260
|
+
#### <a name="cs"></a>The cs optional tag
|
261
|
+
|
262
|
+
The `cs` SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular
|
263
|
+
expression `/(:[0-9]+|\*[a-z][a-z]|[=\+\-][A-Za-z]+)+/`. Like CIGAR, `cs`
|
264
|
+
consists of series of operations. Each leading character specifies the
|
265
|
+
operation; the following sequence is the one involved in the operation.
|
266
|
+
|
267
|
+
The `cs` tag is enabled by command line option `--cs`. The following alignment,
|
268
|
+
for example:
|
269
|
+
```txt
|
270
|
+
CGATCGATAAATAGAGTAG---GAATAGCA
|
271
|
+
|||||| |||||||||| |||| |||
|
272
|
+
CGATCG---AATAGAGTAGGTCGAATtGCA
|
273
|
+
```
|
274
|
+
is represented as `:6-ata:10+gtc:4*at:3`, where `:[0-9]+` represents an
|
275
|
+
identical block, `-ata` represents a deletion, `+gtc` an insertion and `*at`
|
276
|
+
indicates reference base `a` is substituted with a query base `t`. It is
|
277
|
+
similar to the `MD` SAM tag but is standalone and easier to parse.
|
278
|
+
|
279
|
+
If `--cs=long` is used, the `cs` string also contains identical sequences in
|
280
|
+
the alignment. The above example will become
|
281
|
+
`=CGATCG-ata=AATAGAGTAG+gtc=GAAT*at=GCA`. The long form of `cs` encodes both
|
282
|
+
reference and query sequences in one string. The `cs` tag also encodes intron
|
283
|
+
positions and splicing signals (see the [minimap2 manpage][manpage-cs] for
|
284
|
+
details).
|
285
|
+
|
286
|
+
#### <a name="paftools"></a>Working with the PAF format
|
287
|
+
|
288
|
+
Minimap2 also comes with a (java)script [paftools.js](misc/paftools.js) that
|
289
|
+
processes alignments in the PAF format. It calls variants from
|
290
|
+
assembly-to-reference alignment, lifts over BED files based on alignment,
|
291
|
+
converts between formats and provides utilities for various evaluations. For
|
292
|
+
details, please see [misc/README.md](misc/README.md).
|
293
|
+
|
294
|
+
### <a name="algo"></a>Algorithm overview
|
295
|
+
|
296
|
+
In the following, minimap2 command line options have a dash ahead and are
|
297
|
+
highlighted in bold. The description may help to tune minimap2 parameters.
|
298
|
+
|
299
|
+
1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and
|
300
|
+
index them in a hash table.
|
301
|
+
|
302
|
+
2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3
|
303
|
+
through 7:
|
304
|
+
|
305
|
+
3. For each (**-k**,**-w**)-minimizer on the query, check against the reference
|
306
|
+
index. If a reference minimizer is not among the top **-f** [=*2e-4*] most
|
307
|
+
frequent, collect its the occurrences in the reference, which are called
|
308
|
+
*seeds*.
|
309
|
+
|
310
|
+
4. Sort seeds by position in the reference. Chain them with dynamic
|
311
|
+
programming. Each chain represents a potential mapping. For read
|
312
|
+
overlapping, report all chains and then go to step 8. For reference mapping,
|
313
|
+
do step 5 through 7:
|
314
|
+
|
315
|
+
5. Let *P* be the set of primary mappings, which is an empty set initially. For
|
316
|
+
each chain from the best to the worst according to their chaining scores: if
|
317
|
+
on the query, the chain overlaps with a chain in *P* by **--mask-level**
|
318
|
+
[=*0.5*] or higher fraction of the shorter chain, mark the chain as
|
319
|
+
*secondary* to the chain in *P*; otherwise, add the chain to *P*.
|
320
|
+
|
321
|
+
6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary
|
322
|
+
mappings if their chaining scores are higher than **-p** [=*0.8*] of their
|
323
|
+
corresponding primary mappings.
|
324
|
+
|
325
|
+
7. If alignment is requested, filter out an internal seed if it potentially
|
326
|
+
leads to both a long insertion and a long deletion. Extend from the
|
327
|
+
left-most seed. Perform global alignments between internal seeds. Split the
|
328
|
+
chain if the accumulative score along the global alignment drops by **-z**
|
329
|
+
[=*400*], disregarding long gaps. Extend from the right-most seed. Output
|
330
|
+
chains and their alignments.
|
331
|
+
|
332
|
+
8. If there are more query sequences in the input, go to step 2 until no more
|
333
|
+
queries are left.
|
334
|
+
|
335
|
+
9. If there are more reference sequences, reopen the query file from the start
|
336
|
+
and go to step 1; otherwise stop.
|
337
|
+
|
338
|
+
### <a name="help"></a>Getting help
|
339
|
+
|
340
|
+
Manpage [minimap2.1][manpage] provides detailed description of minimap2
|
341
|
+
command line options and optional tags. The [FAQ](FAQ.md) page answers several
|
342
|
+
frequently asked questions. If you encounter bugs or have further questions or
|
343
|
+
requests, you can raise an issue at the [issue page][issue]. There is not a
|
344
|
+
specific mailing list for the time being.
|
345
|
+
|
346
|
+
### <a name="cite"></a>Citing minimap2
|
347
|
+
|
348
|
+
If you use minimap2 in your work, please cite:
|
349
|
+
|
350
|
+
> Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences.
|
351
|
+
> *Bioinformatics*, **34**:3094-3100. [doi:10.1093/bioinformatics/bty191][doi]
|
352
|
+
|
353
|
+
## <a name="dguide"></a>Developers' Guide
|
354
|
+
|
355
|
+
Minimap2 is not only a command line tool, but also a programming library.
|
356
|
+
It provides C APIs to build/load index and to align sequences against the
|
357
|
+
index. File [example.c](example.c) demonstrates typical uses of C APIs. Header
|
358
|
+
file [minimap.h](minimap.h) gives more detailed API documentation. Minimap2
|
359
|
+
aims to keep APIs in this header stable. File [mmpriv.h](mmpriv.h) contains
|
360
|
+
additional private APIs which may be subjected to changes frequently.
|
361
|
+
|
362
|
+
This repository also provides Python bindings to a subset of C APIs. File
|
363
|
+
[python/README.rst](python/README.rst) gives the full documentation;
|
364
|
+
[python/minimap2.py](python/minimap2.py) shows an example. This Python
|
365
|
+
extension, mappy, is also [available from PyPI][mappypypi] via `pip install
|
366
|
+
mappy` or [from BioConda][mappyconda] via `conda install -c bioconda mappy`.
|
367
|
+
|
368
|
+
## <a name="limit"></a>Limitations
|
369
|
+
|
370
|
+
* Minimap2 may produce suboptimal alignments through long low-complexity
|
371
|
+
regions where seed positions may be suboptimal. This should not be a big
|
372
|
+
concern because even the optimal alignment may be wrong in such regions.
|
373
|
+
|
374
|
+
* Minimap2 requires SSE2 instructions on x86 CPUs or NEON on ARM CPUs. It is
|
375
|
+
possible to add non-SIMD support, but it would make minimap2 slower by
|
376
|
+
several times.
|
377
|
+
|
378
|
+
* Minimap2 does not work with a single query or database sequence ~2
|
379
|
+
billion bases or longer (2,147,483,647 to be exact). The total length of all
|
380
|
+
sequences can well exceed this threshold.
|
381
|
+
|
382
|
+
* Minimap2 often misses small exons.
|
383
|
+
|
384
|
+
|
385
|
+
|
386
|
+
[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
|
387
|
+
[sam]: https://samtools.github.io/hts-specs/SAMv1.pdf
|
388
|
+
[minimap]: https://github.com/lh3/minimap
|
389
|
+
[smartdenovo]: https://github.com/ruanjue/smartdenovo
|
390
|
+
[longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791
|
391
|
+
[gaba]: https://github.com/ocxtal/libgaba
|
392
|
+
[ksw2]: https://github.com/lh3/ksw2
|
393
|
+
[preprint]: https://arxiv.org/abs/1708.01492
|
394
|
+
[release]: https://github.com/lh3/minimap2/releases
|
395
|
+
[mappypypi]: https://pypi.python.org/pypi/mappy
|
396
|
+
[mappyconda]: https://anaconda.org/bioconda/mappy
|
397
|
+
[issue]: https://github.com/lh3/minimap2/issues
|
398
|
+
[k8]: https://github.com/attractivechaos/k8
|
399
|
+
[manpage]: https://lh3.github.io/minimap2/minimap2.html
|
400
|
+
[manpage-cs]: https://lh3.github.io/minimap2/minimap2.html#10
|
401
|
+
[doi]: https://doi.org/10.1093/bioinformatics/bty191
|
402
|
+
[smide]: https://github.com/nemequ/simde
|
403
|
+
[unimap]: https://github.com/lh3/unimap
|