minimap2 0.2.22.0 → 0.2.24.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +60 -76
- data/ext/Rakefile +55 -0
- data/ext/cmappy/cmappy.c +129 -0
- data/ext/cmappy/cmappy.h +44 -0
- data/ext/minimap2/FAQ.md +46 -0
- data/ext/minimap2/LICENSE.txt +24 -0
- data/ext/minimap2/MANIFEST.in +10 -0
- data/ext/minimap2/Makefile +132 -0
- data/ext/minimap2/Makefile.simde +97 -0
- data/ext/minimap2/NEWS.md +821 -0
- data/ext/minimap2/README.md +403 -0
- data/ext/minimap2/align.c +1020 -0
- data/ext/minimap2/bseq.c +169 -0
- data/ext/minimap2/bseq.h +64 -0
- data/ext/minimap2/code_of_conduct.md +30 -0
- data/ext/minimap2/cookbook.md +243 -0
- data/ext/minimap2/esterr.c +64 -0
- data/ext/minimap2/example.c +63 -0
- data/ext/minimap2/format.c +559 -0
- data/ext/minimap2/hit.c +466 -0
- data/ext/minimap2/index.c +775 -0
- data/ext/minimap2/kalloc.c +205 -0
- data/ext/minimap2/kalloc.h +76 -0
- data/ext/minimap2/kdq.h +132 -0
- data/ext/minimap2/ketopt.h +120 -0
- data/ext/minimap2/khash.h +615 -0
- data/ext/minimap2/krmq.h +474 -0
- data/ext/minimap2/kseq.h +256 -0
- data/ext/minimap2/ksort.h +153 -0
- data/ext/minimap2/ksw2.h +184 -0
- data/ext/minimap2/ksw2_dispatch.c +96 -0
- data/ext/minimap2/ksw2_extd2_sse.c +402 -0
- data/ext/minimap2/ksw2_exts2_sse.c +416 -0
- data/ext/minimap2/ksw2_extz2_sse.c +313 -0
- data/ext/minimap2/ksw2_ll_sse.c +152 -0
- data/ext/minimap2/kthread.c +159 -0
- data/ext/minimap2/kthread.h +15 -0
- data/ext/minimap2/kvec.h +105 -0
- data/ext/minimap2/lchain.c +369 -0
- data/ext/minimap2/main.c +459 -0
- data/ext/minimap2/map.c +714 -0
- data/ext/minimap2/minimap.h +410 -0
- data/ext/minimap2/minimap2.1 +725 -0
- data/ext/minimap2/misc/README.md +179 -0
- data/ext/minimap2/misc/mmphase.js +335 -0
- data/ext/minimap2/misc/paftools.js +3149 -0
- data/ext/minimap2/misc.c +162 -0
- data/ext/minimap2/mmpriv.h +132 -0
- data/ext/minimap2/options.c +234 -0
- data/ext/minimap2/pe.c +177 -0
- data/ext/minimap2/python/README.rst +196 -0
- data/ext/minimap2/python/cmappy.h +152 -0
- data/ext/minimap2/python/cmappy.pxd +153 -0
- data/ext/minimap2/python/mappy.pyx +273 -0
- data/ext/minimap2/python/minimap2.py +39 -0
- data/ext/minimap2/sdust.c +213 -0
- data/ext/minimap2/sdust.h +25 -0
- data/ext/minimap2/seed.c +131 -0
- data/ext/minimap2/setup.py +55 -0
- data/ext/minimap2/sketch.c +143 -0
- data/ext/minimap2/splitidx.c +84 -0
- data/ext/minimap2/sse2neon/emmintrin.h +1689 -0
- data/ext/minimap2/test/MT-human.fa +278 -0
- data/ext/minimap2/test/MT-orang.fa +276 -0
- data/ext/minimap2/test/q-inv.fa +4 -0
- data/ext/minimap2/test/q2.fa +2 -0
- data/ext/minimap2/test/t-inv.fa +127 -0
- data/ext/minimap2/test/t2.fa +2 -0
- data/ext/minimap2/tex/Makefile +21 -0
- data/ext/minimap2/tex/bioinfo.cls +930 -0
- data/ext/minimap2/tex/blasr-mc.eval +17 -0
- data/ext/minimap2/tex/bowtie2-s3.sam.eval +28 -0
- data/ext/minimap2/tex/bwa-s3.sam.eval +52 -0
- data/ext/minimap2/tex/bwa.eval +55 -0
- data/ext/minimap2/tex/eval2roc.pl +33 -0
- data/ext/minimap2/tex/graphmap.eval +4 -0
- data/ext/minimap2/tex/hs38-simu.sh +10 -0
- data/ext/minimap2/tex/minialign.eval +49 -0
- data/ext/minimap2/tex/minimap2.bib +460 -0
- data/ext/minimap2/tex/minimap2.tex +724 -0
- data/ext/minimap2/tex/mm2-s3.sam.eval +62 -0
- data/ext/minimap2/tex/mm2-update.tex +240 -0
- data/ext/minimap2/tex/mm2.approx.eval +12 -0
- data/ext/minimap2/tex/mm2.eval +13 -0
- data/ext/minimap2/tex/natbib.bst +1288 -0
- data/ext/minimap2/tex/natbib.sty +803 -0
- data/ext/minimap2/tex/ngmlr.eval +38 -0
- data/ext/minimap2/tex/roc.gp +60 -0
- data/ext/minimap2/tex/snap-s3.sam.eval +62 -0
- data/ext/minimap2.patch +19 -0
- data/lib/minimap2/aligner.rb +4 -4
- data/lib/minimap2/alignment.rb +11 -11
- data/lib/minimap2/ffi/constants.rb +20 -16
- data/lib/minimap2/ffi/functions.rb +5 -0
- data/lib/minimap2/ffi.rb +4 -5
- data/lib/minimap2/version.rb +2 -2
- data/lib/minimap2.rb +51 -15
- metadata +97 -79
- data/lib/minimap2/ffi_helper.rb +0 -53
- data/vendor/libminimap2.so +0 -0
@@ -0,0 +1,62 @@
|
|
1
|
+
Q 60 18579866 27 0.000001453 18579866
|
2
|
+
Q 59 27087 4 0.000001666 18606953
|
3
|
+
Q 58 21435 1 0.000001718 18628388
|
4
|
+
Q 57 45663 3 0.000001874 18674051
|
5
|
+
Q 56 36031 2 0.000001978 18710082
|
6
|
+
Q 55 18499 2 0.000002082 18728581
|
7
|
+
Q 54 14754 2 0.000002187 18743335
|
8
|
+
Q 53 25541 2 0.000002291 18768876
|
9
|
+
Q 52 26397 5 0.000002554 18795273
|
10
|
+
Q 51 15090 3 0.000002711 18810363
|
11
|
+
Q 50 13425 11 0.000003294 18823788
|
12
|
+
Q 49 15175 2 0.000003397 18838963
|
13
|
+
Q 48 19407 4 0.000003606 18858370
|
14
|
+
Q 47 11538 16 0.000004452 18869908
|
15
|
+
Q 46 12558 17 0.000005349 18882466
|
16
|
+
Q 45 40362 28 0.000006817 18922828
|
17
|
+
Q 44 10465 13 0.000007500 18933293
|
18
|
+
Q 43 10098 20 0.000008552 18943391
|
19
|
+
Q 42 10682 19 0.000009549 18954073
|
20
|
+
Q 41 9823 11 0.000010125 18963896
|
21
|
+
Q 40 9685 16 0.000010963 18973581
|
22
|
+
Q 39 10273 18 0.000011905 18983854
|
23
|
+
Q 38 9515 18 0.000012847 18993369
|
24
|
+
Q 37 9474 27 0.000014261 19002843
|
25
|
+
Q 36 10430 25 0.000015568 19013273
|
26
|
+
Q 35 9241 34 0.000017348 19022514
|
27
|
+
Q 34 9162 31 0.000018968 19031676
|
28
|
+
Q 33 10164 49 0.000021532 19041840
|
29
|
+
Q 32 9152 55 0.000024408 19050992
|
30
|
+
Q 31 9252 35 0.000026233 19060244
|
31
|
+
Q 30 9872 55 0.000029103 19070116
|
32
|
+
Q 29 8938 65 0.000032496 19079054
|
33
|
+
Q 28 8951 73 0.000036306 19088005
|
34
|
+
Q 27 9949 95 0.000041261 19097954
|
35
|
+
Q 26 9784 97 0.000046316 19107738
|
36
|
+
Q 25 10126 97 0.000051366 19117864
|
37
|
+
Q 24 11260 123 0.000057765 19129124
|
38
|
+
Q 23 10047 114 0.000063691 19139171
|
39
|
+
Q 22 9661 123 0.000070083 19148832
|
40
|
+
Q 21 10339 168 0.000078813 19159171
|
41
|
+
Q 20 17928 193 0.000088804 19177099
|
42
|
+
Q 19 9842 193 0.000098817 19186941
|
43
|
+
Q 18 14737 247 0.000111605 19201678
|
44
|
+
Q 17 10218 238 0.000123934 19211896
|
45
|
+
Q 16 10271 242 0.000136457 19222167
|
46
|
+
Q 15 12241 333 0.000153683 19234408
|
47
|
+
Q 14 9189 336 0.000171070 19243597
|
48
|
+
Q 13 9493 515 0.000197734 19253090
|
49
|
+
Q 12 11502 743 0.000236185 19264592
|
50
|
+
Q 11 8211 507 0.000262390 19272803
|
51
|
+
Q 10 9133 606 0.000293695 19281936
|
52
|
+
Q 9 10014 931 0.000341801 19291950
|
53
|
+
Q 8 8436 698 0.000377816 19300386
|
54
|
+
Q 7 8443 705 0.000414163 19308829
|
55
|
+
Q 6 10203 944 0.000462808 19319032
|
56
|
+
Q 5 6936 756 0.000501760 19325968
|
57
|
+
Q 4 6732 843 0.000545190 19332700
|
58
|
+
Q 3 8215 1104 0.000602040 19340915
|
59
|
+
Q 2 21201 5440 0.000882342 19362116
|
60
|
+
Q 1 82328 22186 0.002019600 19444444
|
61
|
+
Q 0 553853 371953 0.020562901 19998297
|
62
|
+
U 1703
|
@@ -0,0 +1,240 @@
|
|
1
|
+
\documentclass{bioinfo}
|
2
|
+
\copyrightyear{2021}
|
3
|
+
\pubyear{2021}
|
4
|
+
|
5
|
+
\usepackage{graphicx}
|
6
|
+
\usepackage{hyperref}
|
7
|
+
\usepackage{url}
|
8
|
+
\usepackage{amsmath}
|
9
|
+
\usepackage[ruled,vlined]{algorithm2e}
|
10
|
+
\newcommand\mycommfont[1]{\footnotesize\rmfamily{\it #1}}
|
11
|
+
\SetCommentSty{mycommfont}
|
12
|
+
\SetKwComment{Comment}{$\triangleright$\ }{}
|
13
|
+
|
14
|
+
\usepackage{natbib}
|
15
|
+
\bibliographystyle{apalike}
|
16
|
+
|
17
|
+
\DeclareMathOperator*{\argmax}{argmax}
|
18
|
+
|
19
|
+
\begin{document}
|
20
|
+
\firstpage{1}
|
21
|
+
|
22
|
+
\title[Improvements to minimap2]{New strategies to improve minimap2 alignment accuracy}
|
23
|
+
\author[Li]{Heng Li$^{1,2}$}
|
24
|
+
\address{$^1$Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA,
|
25
|
+
$^2$Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA}
|
26
|
+
|
27
|
+
\maketitle
|
28
|
+
|
29
|
+
\begin{abstract}
|
30
|
+
|
31
|
+
\section{Summary:} We present several recent improvements to minimap2, a
|
32
|
+
versatile pairwise aligner for nucleotide sequences. Now minimap2 v2.22 can
|
33
|
+
more accurately map long reads to highly repetitive regions and align through
|
34
|
+
insertions or deletions up to 100kb by default, addressing major weakness in
|
35
|
+
minimap2 v2.18 or earlier.
|
36
|
+
|
37
|
+
\section{Availability and implementation:}
|
38
|
+
\href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
|
39
|
+
|
40
|
+
\section{Contact:} hli@ds.dfci.harvard.edu
|
41
|
+
\end{abstract}
|
42
|
+
|
43
|
+
\section{Introduction}
|
44
|
+
Minimap2~\citep{Li:2018ab} is widely used for maping long sequence
|
45
|
+
reads and assembly contigs. \citet{Jain:2020aa} found minimap2 v2.18 or earlier occasionally
|
46
|
+
misaligned reads from highly repetitive regions as minimap2 ignored seeds of
|
47
|
+
high occurrence. They also noticed minimap2 may misplace reads with structural
|
48
|
+
variations (SVs) in such regions~\citep{Jain2020.11.01.363887}. These
|
49
|
+
misalignments have become a pressing issue in the advent of
|
50
|
+
temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, old minimap2
|
51
|
+
was unable to efficiently align long insertions/deletions (INDELs) and often
|
52
|
+
breaks an alignment around variable-number tandem repeats (VNTRs). This has
|
53
|
+
inspired new chaining algorithms~\citep{Li:2020aa,Ren:2021aa} which are not
|
54
|
+
integrated into minimap2. Here we will describe recent efforts implemented
|
55
|
+
in v2.19 through v2.22 to improve mapping results.
|
56
|
+
|
57
|
+
\begin{methods}
|
58
|
+
\section{Methods}
|
59
|
+
|
60
|
+
\subsection{Rescuing high-occurrence $k$-mers}\label{sec:high-occ}
|
61
|
+
Minimap2 keeps all $k$-mer minimizers~\citep{Roberts:2004fv} during indexing. Its original
|
62
|
+
implementation only selected low-occurrence minimizers during mapping. The
|
63
|
+
cutoff is a few hundred for mapping long reads against a human genome. If a
|
64
|
+
read habors only a few or even no low-occurrence minimizers, it will fail
|
65
|
+
chaining due to insufficient anchors.
|
66
|
+
|
67
|
+
To resolve this issue, we implemented a new heuristic to add additional
|
68
|
+
minimizers. Suppose we are looking at two adjacent low-occurence $k$-mers
|
69
|
+
located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge L$,
|
70
|
+
minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/L\rfloor$ minimizers
|
71
|
+
of the lowest occurrence among minimizers between $x_1$ and $x_2$. Here
|
72
|
+
parameter $L$ controls the frequency of sampling. It defaults to 500.
|
73
|
+
This strategy adds necessary anchors at the cost of increasing total alignment
|
74
|
+
time by a few percent on real data.
|
75
|
+
|
76
|
+
\subsection{Aligning through longer INDELs}
|
77
|
+
The original minimap2 may fail to align long INDELs due to its chaining
|
78
|
+
heuristics. Briefly, minimap2 applies dynamic programming (DP) to chain
|
79
|
+
minimizer anchors. This is a quadratic algorithm, slow for chaining
|
80
|
+
contigs. For acceptable performance, the original minimap2 uses a 500bp band by
|
81
|
+
default, which means a gap longer than 500bp will stop chaining.
|
82
|
+
To align through longer gaps, older minimap2 implemented a long-join heurstic as follows.
|
83
|
+
If there is an INDEL longer than 500bp and the two chains around the INDEL
|
84
|
+
have no overlaps on either the query or the reference sequence, minimap2 may
|
85
|
+
join the two short chains later.
|
86
|
+
This heuristic may fail around VNTRs because short chains
|
87
|
+
often have overlaps in VNTRs. More subtly, minimap2 may escape the inner DP
|
88
|
+
loop early, again for performance, if the chaining result is not improved for
|
89
|
+
50 iterations. When there is a copy number change in a long segmental
|
90
|
+
duplication, the early escape may break around the event even if users
|
91
|
+
specify a large band.
|
92
|
+
|
93
|
+
In minigraph~\citep{Li:2020aa}, we developed a new chaining algorithm that
|
94
|
+
finds up to 1kb INDELs with DP-based chaining and goes through longer INDELs with a
|
95
|
+
subquadratic algorithm~\citep{DBLP:conf/wabi/AbouelhodaO03}. We ported the same
|
96
|
+
algorithm to minimap2 for contig mapping. For long-read mapping, the minigraph
|
97
|
+
algorithm is slower. Minimap2 v2.22 still uses the DP-based algorithm to
|
98
|
+
find short chains and then invokes the minigraph algorithm to rechain anchors in
|
99
|
+
these short chains. The rechaining step achieves the same goal as long-join
|
100
|
+
but is more reliable because it can resolve overlaps between short chains. The old
|
101
|
+
long-join heuristic has since been removed.
|
102
|
+
|
103
|
+
\subsection{Properly mapping long reads with SVs}
|
104
|
+
The original minimap2 ranks an alignment by its Smith-Waterman score and
|
105
|
+
outputs the best scoring alignment. However, when there are SVs on the read,
|
106
|
+
the best scoring alignment is sometimes not the correct alignment.
|
107
|
+
\citet{Jain2020.11.01.363887} resolved this dilemma by altering the mapping
|
108
|
+
algorithm.
|
109
|
+
|
110
|
+
In our view, this problem is rooted in inapropriate scoring: affine-gap penalty
|
111
|
+
over-penalizes a long INDEL that was often evolutionarily created in one event.
|
112
|
+
We should not penalize a SV by a function linear in the SV length. Minimap2 v2.22 instead rescores
|
113
|
+
an alignment with the following scoring function. Suppose an alignment consists
|
114
|
+
of $M$ matching bases, $N$ substitutions and $G$ gap opens, we empirically
|
115
|
+
score the alignment with
|
116
|
+
$$
|
117
|
+
S=M-\frac{N+G}{2d}-\sum_{i=1}^G\log_2(1+g_i)
|
118
|
+
$$
|
119
|
+
where $g_i\ge1$ is the length of the $i$-th gap and
|
120
|
+
$$
|
121
|
+
d=\max\left\{\frac{N+G}{M+N+G},0.02\right\}
|
122
|
+
$$
|
123
|
+
It approximates per-base sequence divergence except with the smallest value set
|
124
|
+
to 2\%. As an analogy to affine-gap scoring, the matching score in our scheme
|
125
|
+
is 1, the mismatch and gap open penalties are both $1/2d$ and the gap extension
|
126
|
+
penalty is a logarithm function of the gap length~\citep{Gu:1995wt}. Our scoring gives a long SV
|
127
|
+
a much milder penalty. In terms of time complexity, scoring an alignment is
|
128
|
+
linear in the length of the alignment. The time spent on rescoring is negligible in
|
129
|
+
practice.
|
130
|
+
|
131
|
+
%If we assume sequences evolve under a duplication-mutation model, we may have a
|
132
|
+
%better way to choose the best alignment. If a long read can be mapped to $n$
|
133
|
+
%loci, we can take the read as the template and build a
|
134
|
+
%pseudo-multi-sequence-alignment (pMSA) of $n+1$ sequences. In this pMSA, we say
|
135
|
+
%a site on the read is informative if the $n$ reference subsequences differ at
|
136
|
+
%the position.
|
137
|
+
|
138
|
+
\end{methods}
|
139
|
+
|
140
|
+
\section{Results}
|
141
|
+
|
142
|
+
\begin{table}
|
143
|
+
\processtable{Evaluation of minimap2 v2.22}
|
144
|
+
{\footnotesize\label{tab:1}\begin{tabular}{p{4.2cm}rrrr}
|
145
|
+
\toprule
|
146
|
+
$[$Benchmark$]$ Metric & v2.22 & v2.18 & Winno & lra \\
|
147
|
+
\midrule
|
148
|
+
$[$sim-map$]$ \% mapped reads at Q10 & 97.9 & 97.6 & {\bf 99.0}& 97.3 \\
|
149
|
+
$[$sim-map$]$ err. rate at Q10 (phredQ) & {\bf 52} & {\bf 52} & 38 & 24 \\
|
150
|
+
$[$winno-cmp$]$ rate of diff. (phredQ) & {\bf 41} & 37 & truth & 18 \\
|
151
|
+
$[$winno-cmp$]$ CPU time (hour) & {\bf 5.0} & 5.3 & 71.8 & 13.1 \\
|
152
|
+
$[$winno-cmp$]$ peak RAM (Gb) & 17.1 & 14.4 & {\bf 9.6} & 12.4 \\
|
153
|
+
$[$sim-sv$]$ \% false negative rate & {\bf 0.5} & 2.0 & {\bf 0.5} & 1.4 \\
|
154
|
+
$[$sim-sv$]$ \% false discovery rate & {\bf 0.0} & 0.1 & {\bf 0.0} & 0.1 \\
|
155
|
+
$[$real-sv-1k$]$ \% false negative rate & {\bf 7.3} & 20.0 & 13.0 & N/A \\
|
156
|
+
$[$real-sv-1k$]$ \% false discovery rate & 2.7 & {\bf 2.4} & 2.7 & N/A \\
|
157
|
+
\botrule
|
158
|
+
\end{tabular}}
|
159
|
+
{In $[$sim-map$]$, 152,713 reads were simulated from the CHM13 telomere-to-telomere assembly v1.1
|
160
|
+
(AC: GCA\_009914755.3) with pbsim2~\citep{Ono:2021aa}: ``pbsim2 -{}-hmm\_model R94.model -{}-length-min
|
161
|
+
5000 -{}-length-mean 20000 -{}-accuracy-mean 0.95''. Alignments of mapping quality
|
162
|
+
10 or higher were evaluated by ``paftools.js mapeval''. The mapping error rate
|
163
|
+
is measured in the phred scale: if the error rate is $e$, $-10\log_{10}e$ is
|
164
|
+
reported in the table. In $[$winno-cmp$]$, 1.39 million CHM13 HiFi reads from
|
165
|
+
SRR11292121 were mapped against the same CHM13 assembly. 99.3\% of them were mapped by Winnowmap2
|
166
|
+
at mapping quality 10 or higher and were taken as ground truth to evaluate
|
167
|
+
minimap2 and lra with ``paftools.js pafcmp''. $[$sim-sv$]$ simulated 1,000
|
168
|
+
50bp to 1000bp INDELs from chr8 in CHM13 using SURVIVOR~\citep{Jeffares:2017aa} and simulated Nanopore
|
169
|
+
reads at 30-fold coverage with the same pbsim2 command line. SVs were called with
|
170
|
+
``sniffles -q 10''~\citep{Sedlazeck:2018ab} and compared to the simulated truth with ``SURVIVOR eval
|
171
|
+
call.vcf truth.bed 50''. In $[$real-sv-1k$]$, small and long variants were
|
172
|
+
called by dipcall-0.3~\citep{Li:2018aa} for HG002 assemblies (AC: GCA\_018852605.1 and
|
173
|
+
GCA\_018852615.1) and compared to the GIAB truth~\citep{Zook:2020aa} using ``truvari -r 2000 -s
|
174
|
+
1000 -S 400 -{}-multimatch -{}-passonly'' which sets the minimum INDEL size to 1kb in evaluation. }
|
175
|
+
\end{table}
|
176
|
+
|
177
|
+
We evaluated minimap2 v2.22 along with v2.18, Winnowmap2 v2.03 and lra v1.3.2
|
178
|
+
(Table~\ref{tab:1}), using the default setting of each mapper according to the input data types.
|
179
|
+
Both versions of minimap2 achieved high mapping accuracy on
|
180
|
+
simulated Nanopore reads (sim-map). Winnowmap2 aligned more reads at mapping
|
181
|
+
quality 10 or higher (mapQ10). However, it may occasionally assign a high mapping
|
182
|
+
quality to a read with multiple identical best alignments. This reduced its
|
183
|
+
mapping accuracy.
|
184
|
+
|
185
|
+
In lack of groud truth for real data, we took Winnowmap2 mapping as ground
|
186
|
+
truth to evaluate other mappers (winno-cmp in Table~\ref{tab:1}). Out of 1,378,092 reads with mapQ10
|
187
|
+
alignments by Winnowmap2, minimap2 v2.22 could map all of them. 118 reads, less
|
188
|
+
than 0.01\% of all reads, were mapped differently by v2.22. 51 of them have
|
189
|
+
multiple identical best alignments. We believe these are more likely to be
|
190
|
+
Winnowmap2 errors. Most of the remaining 67 (=118-51) reads have multiple
|
191
|
+
highly similar but not identical alignments.
|
192
|
+
Minimap2 v2.18 is less consistent with 275 differences including 30 unmapped
|
193
|
+
reads mappable by both Winnowmap2 and v2.22.
|
194
|
+
|
195
|
+
For the minimizer rescuing parameter $L$ in Section~\ref{sec:high-occ},
|
196
|
+
we set its default to 500 such that v2.22 has comparable performance to v2.18 given simulated PacBio and Nanopore human reads.
|
197
|
+
To see the effect of this parameter on real data, we tried several different $L$ values.
|
198
|
+
v2.22 gave 99 mapping differences at $L=200$,
|
199
|
+
118 at $L=500$ (default), 167 at $L=750$ and 224 differences at $L=1000$ in comparison to Winnowmap2.
|
200
|
+
$L=200$ is 28\% slower than the default while $L=1000$ is 9\% faster.
|
201
|
+
Changing the default minimizer window size (option ``-w'')
|
202
|
+
and the initial minimizer occurrence cutoff (option ``-f'')
|
203
|
+
also affects performance and accuracy to a similar magnitude.
|
204
|
+
|
205
|
+
The two benchmarks above only evaluate read mappings when there are no variations between the reads and the reference.
|
206
|
+
To measure the mapping accuracy in the presence of SVs (sim-sv), we reproduced
|
207
|
+
the results by~\citep{Jain2020.11.01.363887}. Minimap2 v2.22 is as good as
|
208
|
+
Winnowmap2 now. Note that we were setting the Sniffles mapping quality
|
209
|
+
threshold to 10 in consistent with the benchmarks above. If we used the
|
210
|
+
default threshold 20, v2.22 would miss additional five SVs (accounting for
|
211
|
+
0.5\% of simulated SVs). For four out of these five missing SVs, minimap2 v2.22
|
212
|
+
mapped more variant reads than Winnowmap2. Sniffles did not call these SVs
|
213
|
+
because minimap2 tended to give them conservative mapping quality. It is worth
|
214
|
+
noting that the simulation here only considers a simple scenario in evolution.
|
215
|
+
Non-allelic gene conversions, which happen often in segmental
|
216
|
+
duplications~\citep{Harpak:2017aa}, would obscure the optimal mapping
|
217
|
+
strategies. How much such simple SV simulation informs real-world SV calling
|
218
|
+
remains a question.
|
219
|
+
|
220
|
+
To see if minimap2 v2.22 could improve long INDEL alignment, we ran dipcall on
|
221
|
+
contig-to-reference alignments and focused on INDELs longer than 1kb
|
222
|
+
(real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its
|
223
|
+
advantage in more contiguous alignment. We could not get dipcall to work well with lra,
|
224
|
+
so did not report the numbers.
|
225
|
+
|
226
|
+
Minimap2 spends most computing time on base alignment. As recent improvements
|
227
|
+
in v2.22 incur little additional computing and do not change the base alignment
|
228
|
+
algorithm, the new version has similar performance to older versions. It is
|
229
|
+
consistently faster than Winnowmap2 by several times. Sometimes simple
|
230
|
+
heuristics can be as effective as more sophisticated yet slower solutions.
|
231
|
+
|
232
|
+
\section*{Acknowledgements}
|
233
|
+
We thank Arang Rhie and Chirag Jain for providing motivating examples for which
|
234
|
+
older minimap2 underperforms.
|
235
|
+
|
236
|
+
\paragraph{Funding\textcolon} This work is funded by NHGRI grant R01HG010040.
|
237
|
+
|
238
|
+
\bibliography{minimap2}
|
239
|
+
|
240
|
+
\end{document}
|
@@ -0,0 +1,12 @@
|
|
1
|
+
Q 60 32084 0 0.000000000 32084
|
2
|
+
Q 24 318 2 0.000061725 32402
|
3
|
+
Q 11 98 2 0.000123077 32500
|
4
|
+
Q 8 37 2 0.000184405 32537
|
5
|
+
Q 7 37 3 0.000276294 32574
|
6
|
+
Q 6 40 3 0.000367940 32614
|
7
|
+
Q 5 34 2 0.000428816 32648
|
8
|
+
Q 4 37 5 0.000581306 32685
|
9
|
+
Q 3 28 6 0.000764222 32713
|
10
|
+
Q 2 38 6 0.000946536 32751
|
11
|
+
Q 1 50 21 0.001585318 32801
|
12
|
+
Q 0 286 150 0.006105117 33087
|
@@ -0,0 +1,13 @@
|
|
1
|
+
Q 60 32477 0 0.000000000 32477
|
2
|
+
Q 22 16 1 0.000030776 32493
|
3
|
+
Q 21 44 1 0.000061468 32537
|
4
|
+
Q 19 73 1 0.000091996 32610
|
5
|
+
Q 14 66 1 0.000122414 32676
|
6
|
+
Q 10 26 3 0.000214054 32702
|
7
|
+
Q 8 14 1 0.000244529 32716
|
8
|
+
Q 7 13 2 0.000305539 32729
|
9
|
+
Q 6 47 1 0.000335611 32776
|
10
|
+
Q 3 10 1 0.000366010 32786
|
11
|
+
Q 2 20 2 0.000426751 32806
|
12
|
+
Q 1 248 94 0.003267381 33054
|
13
|
+
Q 0 31 17 0.003778147 33085
|