minimap2 0.2.22.0 → 0.2.24.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +60 -76
  3. data/ext/Rakefile +55 -0
  4. data/ext/cmappy/cmappy.c +129 -0
  5. data/ext/cmappy/cmappy.h +44 -0
  6. data/ext/minimap2/FAQ.md +46 -0
  7. data/ext/minimap2/LICENSE.txt +24 -0
  8. data/ext/minimap2/MANIFEST.in +10 -0
  9. data/ext/minimap2/Makefile +132 -0
  10. data/ext/minimap2/Makefile.simde +97 -0
  11. data/ext/minimap2/NEWS.md +821 -0
  12. data/ext/minimap2/README.md +403 -0
  13. data/ext/minimap2/align.c +1020 -0
  14. data/ext/minimap2/bseq.c +169 -0
  15. data/ext/minimap2/bseq.h +64 -0
  16. data/ext/minimap2/code_of_conduct.md +30 -0
  17. data/ext/minimap2/cookbook.md +243 -0
  18. data/ext/minimap2/esterr.c +64 -0
  19. data/ext/minimap2/example.c +63 -0
  20. data/ext/minimap2/format.c +559 -0
  21. data/ext/minimap2/hit.c +466 -0
  22. data/ext/minimap2/index.c +775 -0
  23. data/ext/minimap2/kalloc.c +205 -0
  24. data/ext/minimap2/kalloc.h +76 -0
  25. data/ext/minimap2/kdq.h +132 -0
  26. data/ext/minimap2/ketopt.h +120 -0
  27. data/ext/minimap2/khash.h +615 -0
  28. data/ext/minimap2/krmq.h +474 -0
  29. data/ext/minimap2/kseq.h +256 -0
  30. data/ext/minimap2/ksort.h +153 -0
  31. data/ext/minimap2/ksw2.h +184 -0
  32. data/ext/minimap2/ksw2_dispatch.c +96 -0
  33. data/ext/minimap2/ksw2_extd2_sse.c +402 -0
  34. data/ext/minimap2/ksw2_exts2_sse.c +416 -0
  35. data/ext/minimap2/ksw2_extz2_sse.c +313 -0
  36. data/ext/minimap2/ksw2_ll_sse.c +152 -0
  37. data/ext/minimap2/kthread.c +159 -0
  38. data/ext/minimap2/kthread.h +15 -0
  39. data/ext/minimap2/kvec.h +105 -0
  40. data/ext/minimap2/lchain.c +369 -0
  41. data/ext/minimap2/main.c +459 -0
  42. data/ext/minimap2/map.c +714 -0
  43. data/ext/minimap2/minimap.h +410 -0
  44. data/ext/minimap2/minimap2.1 +725 -0
  45. data/ext/minimap2/misc/README.md +179 -0
  46. data/ext/minimap2/misc/mmphase.js +335 -0
  47. data/ext/minimap2/misc/paftools.js +3149 -0
  48. data/ext/minimap2/misc.c +162 -0
  49. data/ext/minimap2/mmpriv.h +132 -0
  50. data/ext/minimap2/options.c +234 -0
  51. data/ext/minimap2/pe.c +177 -0
  52. data/ext/minimap2/python/README.rst +196 -0
  53. data/ext/minimap2/python/cmappy.h +152 -0
  54. data/ext/minimap2/python/cmappy.pxd +153 -0
  55. data/ext/minimap2/python/mappy.pyx +273 -0
  56. data/ext/minimap2/python/minimap2.py +39 -0
  57. data/ext/minimap2/sdust.c +213 -0
  58. data/ext/minimap2/sdust.h +25 -0
  59. data/ext/minimap2/seed.c +131 -0
  60. data/ext/minimap2/setup.py +55 -0
  61. data/ext/minimap2/sketch.c +143 -0
  62. data/ext/minimap2/splitidx.c +84 -0
  63. data/ext/minimap2/sse2neon/emmintrin.h +1689 -0
  64. data/ext/minimap2/test/MT-human.fa +278 -0
  65. data/ext/minimap2/test/MT-orang.fa +276 -0
  66. data/ext/minimap2/test/q-inv.fa +4 -0
  67. data/ext/minimap2/test/q2.fa +2 -0
  68. data/ext/minimap2/test/t-inv.fa +127 -0
  69. data/ext/minimap2/test/t2.fa +2 -0
  70. data/ext/minimap2/tex/Makefile +21 -0
  71. data/ext/minimap2/tex/bioinfo.cls +930 -0
  72. data/ext/minimap2/tex/blasr-mc.eval +17 -0
  73. data/ext/minimap2/tex/bowtie2-s3.sam.eval +28 -0
  74. data/ext/minimap2/tex/bwa-s3.sam.eval +52 -0
  75. data/ext/minimap2/tex/bwa.eval +55 -0
  76. data/ext/minimap2/tex/eval2roc.pl +33 -0
  77. data/ext/minimap2/tex/graphmap.eval +4 -0
  78. data/ext/minimap2/tex/hs38-simu.sh +10 -0
  79. data/ext/minimap2/tex/minialign.eval +49 -0
  80. data/ext/minimap2/tex/minimap2.bib +460 -0
  81. data/ext/minimap2/tex/minimap2.tex +724 -0
  82. data/ext/minimap2/tex/mm2-s3.sam.eval +62 -0
  83. data/ext/minimap2/tex/mm2-update.tex +240 -0
  84. data/ext/minimap2/tex/mm2.approx.eval +12 -0
  85. data/ext/minimap2/tex/mm2.eval +13 -0
  86. data/ext/minimap2/tex/natbib.bst +1288 -0
  87. data/ext/minimap2/tex/natbib.sty +803 -0
  88. data/ext/minimap2/tex/ngmlr.eval +38 -0
  89. data/ext/minimap2/tex/roc.gp +60 -0
  90. data/ext/minimap2/tex/snap-s3.sam.eval +62 -0
  91. data/ext/minimap2.patch +19 -0
  92. data/lib/minimap2/aligner.rb +4 -4
  93. data/lib/minimap2/alignment.rb +11 -11
  94. data/lib/minimap2/ffi/constants.rb +20 -16
  95. data/lib/minimap2/ffi/functions.rb +5 -0
  96. data/lib/minimap2/ffi.rb +4 -5
  97. data/lib/minimap2/version.rb +2 -2
  98. data/lib/minimap2.rb +51 -15
  99. metadata +97 -79
  100. data/lib/minimap2/ffi_helper.rb +0 -53
  101. data/vendor/libminimap2.so +0 -0
@@ -0,0 +1,62 @@
1
+ Q 60 18579866 27 0.000001453 18579866
2
+ Q 59 27087 4 0.000001666 18606953
3
+ Q 58 21435 1 0.000001718 18628388
4
+ Q 57 45663 3 0.000001874 18674051
5
+ Q 56 36031 2 0.000001978 18710082
6
+ Q 55 18499 2 0.000002082 18728581
7
+ Q 54 14754 2 0.000002187 18743335
8
+ Q 53 25541 2 0.000002291 18768876
9
+ Q 52 26397 5 0.000002554 18795273
10
+ Q 51 15090 3 0.000002711 18810363
11
+ Q 50 13425 11 0.000003294 18823788
12
+ Q 49 15175 2 0.000003397 18838963
13
+ Q 48 19407 4 0.000003606 18858370
14
+ Q 47 11538 16 0.000004452 18869908
15
+ Q 46 12558 17 0.000005349 18882466
16
+ Q 45 40362 28 0.000006817 18922828
17
+ Q 44 10465 13 0.000007500 18933293
18
+ Q 43 10098 20 0.000008552 18943391
19
+ Q 42 10682 19 0.000009549 18954073
20
+ Q 41 9823 11 0.000010125 18963896
21
+ Q 40 9685 16 0.000010963 18973581
22
+ Q 39 10273 18 0.000011905 18983854
23
+ Q 38 9515 18 0.000012847 18993369
24
+ Q 37 9474 27 0.000014261 19002843
25
+ Q 36 10430 25 0.000015568 19013273
26
+ Q 35 9241 34 0.000017348 19022514
27
+ Q 34 9162 31 0.000018968 19031676
28
+ Q 33 10164 49 0.000021532 19041840
29
+ Q 32 9152 55 0.000024408 19050992
30
+ Q 31 9252 35 0.000026233 19060244
31
+ Q 30 9872 55 0.000029103 19070116
32
+ Q 29 8938 65 0.000032496 19079054
33
+ Q 28 8951 73 0.000036306 19088005
34
+ Q 27 9949 95 0.000041261 19097954
35
+ Q 26 9784 97 0.000046316 19107738
36
+ Q 25 10126 97 0.000051366 19117864
37
+ Q 24 11260 123 0.000057765 19129124
38
+ Q 23 10047 114 0.000063691 19139171
39
+ Q 22 9661 123 0.000070083 19148832
40
+ Q 21 10339 168 0.000078813 19159171
41
+ Q 20 17928 193 0.000088804 19177099
42
+ Q 19 9842 193 0.000098817 19186941
43
+ Q 18 14737 247 0.000111605 19201678
44
+ Q 17 10218 238 0.000123934 19211896
45
+ Q 16 10271 242 0.000136457 19222167
46
+ Q 15 12241 333 0.000153683 19234408
47
+ Q 14 9189 336 0.000171070 19243597
48
+ Q 13 9493 515 0.000197734 19253090
49
+ Q 12 11502 743 0.000236185 19264592
50
+ Q 11 8211 507 0.000262390 19272803
51
+ Q 10 9133 606 0.000293695 19281936
52
+ Q 9 10014 931 0.000341801 19291950
53
+ Q 8 8436 698 0.000377816 19300386
54
+ Q 7 8443 705 0.000414163 19308829
55
+ Q 6 10203 944 0.000462808 19319032
56
+ Q 5 6936 756 0.000501760 19325968
57
+ Q 4 6732 843 0.000545190 19332700
58
+ Q 3 8215 1104 0.000602040 19340915
59
+ Q 2 21201 5440 0.000882342 19362116
60
+ Q 1 82328 22186 0.002019600 19444444
61
+ Q 0 553853 371953 0.020562901 19998297
62
+ U 1703
@@ -0,0 +1,240 @@
1
+ \documentclass{bioinfo}
2
+ \copyrightyear{2021}
3
+ \pubyear{2021}
4
+
5
+ \usepackage{graphicx}
6
+ \usepackage{hyperref}
7
+ \usepackage{url}
8
+ \usepackage{amsmath}
9
+ \usepackage[ruled,vlined]{algorithm2e}
10
+ \newcommand\mycommfont[1]{\footnotesize\rmfamily{\it #1}}
11
+ \SetCommentSty{mycommfont}
12
+ \SetKwComment{Comment}{$\triangleright$\ }{}
13
+
14
+ \usepackage{natbib}
15
+ \bibliographystyle{apalike}
16
+
17
+ \DeclareMathOperator*{\argmax}{argmax}
18
+
19
+ \begin{document}
20
+ \firstpage{1}
21
+
22
+ \title[Improvements to minimap2]{New strategies to improve minimap2 alignment accuracy}
23
+ \author[Li]{Heng Li$^{1,2}$}
24
+ \address{$^1$Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA,
25
+ $^2$Harvard Medical School, 10 Shattuck St, Boston, MA 02215, USA}
26
+
27
+ \maketitle
28
+
29
+ \begin{abstract}
30
+
31
+ \section{Summary:} We present several recent improvements to minimap2, a
32
+ versatile pairwise aligner for nucleotide sequences. Now minimap2 v2.22 can
33
+ more accurately map long reads to highly repetitive regions and align through
34
+ insertions or deletions up to 100kb by default, addressing major weakness in
35
+ minimap2 v2.18 or earlier.
36
+
37
+ \section{Availability and implementation:}
38
+ \href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
39
+
40
+ \section{Contact:} hli@ds.dfci.harvard.edu
41
+ \end{abstract}
42
+
43
+ \section{Introduction}
44
+ Minimap2~\citep{Li:2018ab} is widely used for maping long sequence
45
+ reads and assembly contigs. \citet{Jain:2020aa} found minimap2 v2.18 or earlier occasionally
46
+ misaligned reads from highly repetitive regions as minimap2 ignored seeds of
47
+ high occurrence. They also noticed minimap2 may misplace reads with structural
48
+ variations (SVs) in such regions~\citep{Jain2020.11.01.363887}. These
49
+ misalignments have become a pressing issue in the advent of
50
+ temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, old minimap2
51
+ was unable to efficiently align long insertions/deletions (INDELs) and often
52
+ breaks an alignment around variable-number tandem repeats (VNTRs). This has
53
+ inspired new chaining algorithms~\citep{Li:2020aa,Ren:2021aa} which are not
54
+ integrated into minimap2. Here we will describe recent efforts implemented
55
+ in v2.19 through v2.22 to improve mapping results.
56
+
57
+ \begin{methods}
58
+ \section{Methods}
59
+
60
+ \subsection{Rescuing high-occurrence $k$-mers}\label{sec:high-occ}
61
+ Minimap2 keeps all $k$-mer minimizers~\citep{Roberts:2004fv} during indexing. Its original
62
+ implementation only selected low-occurrence minimizers during mapping. The
63
+ cutoff is a few hundred for mapping long reads against a human genome. If a
64
+ read habors only a few or even no low-occurrence minimizers, it will fail
65
+ chaining due to insufficient anchors.
66
+
67
+ To resolve this issue, we implemented a new heuristic to add additional
68
+ minimizers. Suppose we are looking at two adjacent low-occurence $k$-mers
69
+ located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge L$,
70
+ minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/L\rfloor$ minimizers
71
+ of the lowest occurrence among minimizers between $x_1$ and $x_2$. Here
72
+ parameter $L$ controls the frequency of sampling. It defaults to 500.
73
+ This strategy adds necessary anchors at the cost of increasing total alignment
74
+ time by a few percent on real data.
75
+
76
+ \subsection{Aligning through longer INDELs}
77
+ The original minimap2 may fail to align long INDELs due to its chaining
78
+ heuristics. Briefly, minimap2 applies dynamic programming (DP) to chain
79
+ minimizer anchors. This is a quadratic algorithm, slow for chaining
80
+ contigs. For acceptable performance, the original minimap2 uses a 500bp band by
81
+ default, which means a gap longer than 500bp will stop chaining.
82
+ To align through longer gaps, older minimap2 implemented a long-join heurstic as follows.
83
+ If there is an INDEL longer than 500bp and the two chains around the INDEL
84
+ have no overlaps on either the query or the reference sequence, minimap2 may
85
+ join the two short chains later.
86
+ This heuristic may fail around VNTRs because short chains
87
+ often have overlaps in VNTRs. More subtly, minimap2 may escape the inner DP
88
+ loop early, again for performance, if the chaining result is not improved for
89
+ 50 iterations. When there is a copy number change in a long segmental
90
+ duplication, the early escape may break around the event even if users
91
+ specify a large band.
92
+
93
+ In minigraph~\citep{Li:2020aa}, we developed a new chaining algorithm that
94
+ finds up to 1kb INDELs with DP-based chaining and goes through longer INDELs with a
95
+ subquadratic algorithm~\citep{DBLP:conf/wabi/AbouelhodaO03}. We ported the same
96
+ algorithm to minimap2 for contig mapping. For long-read mapping, the minigraph
97
+ algorithm is slower. Minimap2 v2.22 still uses the DP-based algorithm to
98
+ find short chains and then invokes the minigraph algorithm to rechain anchors in
99
+ these short chains. The rechaining step achieves the same goal as long-join
100
+ but is more reliable because it can resolve overlaps between short chains. The old
101
+ long-join heuristic has since been removed.
102
+
103
+ \subsection{Properly mapping long reads with SVs}
104
+ The original minimap2 ranks an alignment by its Smith-Waterman score and
105
+ outputs the best scoring alignment. However, when there are SVs on the read,
106
+ the best scoring alignment is sometimes not the correct alignment.
107
+ \citet{Jain2020.11.01.363887} resolved this dilemma by altering the mapping
108
+ algorithm.
109
+
110
+ In our view, this problem is rooted in inapropriate scoring: affine-gap penalty
111
+ over-penalizes a long INDEL that was often evolutionarily created in one event.
112
+ We should not penalize a SV by a function linear in the SV length. Minimap2 v2.22 instead rescores
113
+ an alignment with the following scoring function. Suppose an alignment consists
114
+ of $M$ matching bases, $N$ substitutions and $G$ gap opens, we empirically
115
+ score the alignment with
116
+ $$
117
+ S=M-\frac{N+G}{2d}-\sum_{i=1}^G\log_2(1+g_i)
118
+ $$
119
+ where $g_i\ge1$ is the length of the $i$-th gap and
120
+ $$
121
+ d=\max\left\{\frac{N+G}{M+N+G},0.02\right\}
122
+ $$
123
+ It approximates per-base sequence divergence except with the smallest value set
124
+ to 2\%. As an analogy to affine-gap scoring, the matching score in our scheme
125
+ is 1, the mismatch and gap open penalties are both $1/2d$ and the gap extension
126
+ penalty is a logarithm function of the gap length~\citep{Gu:1995wt}. Our scoring gives a long SV
127
+ a much milder penalty. In terms of time complexity, scoring an alignment is
128
+ linear in the length of the alignment. The time spent on rescoring is negligible in
129
+ practice.
130
+
131
+ %If we assume sequences evolve under a duplication-mutation model, we may have a
132
+ %better way to choose the best alignment. If a long read can be mapped to $n$
133
+ %loci, we can take the read as the template and build a
134
+ %pseudo-multi-sequence-alignment (pMSA) of $n+1$ sequences. In this pMSA, we say
135
+ %a site on the read is informative if the $n$ reference subsequences differ at
136
+ %the position.
137
+
138
+ \end{methods}
139
+
140
+ \section{Results}
141
+
142
+ \begin{table}
143
+ \processtable{Evaluation of minimap2 v2.22}
144
+ {\footnotesize\label{tab:1}\begin{tabular}{p{4.2cm}rrrr}
145
+ \toprule
146
+ $[$Benchmark$]$ Metric & v2.22 & v2.18 & Winno & lra \\
147
+ \midrule
148
+ $[$sim-map$]$ \% mapped reads at Q10 & 97.9 & 97.6 & {\bf 99.0}& 97.3 \\
149
+ $[$sim-map$]$ err. rate at Q10 (phredQ) & {\bf 52} & {\bf 52} & 38 & 24 \\
150
+ $[$winno-cmp$]$ rate of diff. (phredQ) & {\bf 41} & 37 & truth & 18 \\
151
+ $[$winno-cmp$]$ CPU time (hour) & {\bf 5.0} & 5.3 & 71.8 & 13.1 \\
152
+ $[$winno-cmp$]$ peak RAM (Gb) & 17.1 & 14.4 & {\bf 9.6} & 12.4 \\
153
+ $[$sim-sv$]$ \% false negative rate & {\bf 0.5} & 2.0 & {\bf 0.5} & 1.4 \\
154
+ $[$sim-sv$]$ \% false discovery rate & {\bf 0.0} & 0.1 & {\bf 0.0} & 0.1 \\
155
+ $[$real-sv-1k$]$ \% false negative rate & {\bf 7.3} & 20.0 & 13.0 & N/A \\
156
+ $[$real-sv-1k$]$ \% false discovery rate & 2.7 & {\bf 2.4} & 2.7 & N/A \\
157
+ \botrule
158
+ \end{tabular}}
159
+ {In $[$sim-map$]$, 152,713 reads were simulated from the CHM13 telomere-to-telomere assembly v1.1
160
+ (AC: GCA\_009914755.3) with pbsim2~\citep{Ono:2021aa}: ``pbsim2 -{}-hmm\_model R94.model -{}-length-min
161
+ 5000 -{}-length-mean 20000 -{}-accuracy-mean 0.95''. Alignments of mapping quality
162
+ 10 or higher were evaluated by ``paftools.js mapeval''. The mapping error rate
163
+ is measured in the phred scale: if the error rate is $e$, $-10\log_{10}e$ is
164
+ reported in the table. In $[$winno-cmp$]$, 1.39 million CHM13 HiFi reads from
165
+ SRR11292121 were mapped against the same CHM13 assembly. 99.3\% of them were mapped by Winnowmap2
166
+ at mapping quality 10 or higher and were taken as ground truth to evaluate
167
+ minimap2 and lra with ``paftools.js pafcmp''. $[$sim-sv$]$ simulated 1,000
168
+ 50bp to 1000bp INDELs from chr8 in CHM13 using SURVIVOR~\citep{Jeffares:2017aa} and simulated Nanopore
169
+ reads at 30-fold coverage with the same pbsim2 command line. SVs were called with
170
+ ``sniffles -q 10''~\citep{Sedlazeck:2018ab} and compared to the simulated truth with ``SURVIVOR eval
171
+ call.vcf truth.bed 50''. In $[$real-sv-1k$]$, small and long variants were
172
+ called by dipcall-0.3~\citep{Li:2018aa} for HG002 assemblies (AC: GCA\_018852605.1 and
173
+ GCA\_018852615.1) and compared to the GIAB truth~\citep{Zook:2020aa} using ``truvari -r 2000 -s
174
+ 1000 -S 400 -{}-multimatch -{}-passonly'' which sets the minimum INDEL size to 1kb in evaluation. }
175
+ \end{table}
176
+
177
+ We evaluated minimap2 v2.22 along with v2.18, Winnowmap2 v2.03 and lra v1.3.2
178
+ (Table~\ref{tab:1}), using the default setting of each mapper according to the input data types.
179
+ Both versions of minimap2 achieved high mapping accuracy on
180
+ simulated Nanopore reads (sim-map). Winnowmap2 aligned more reads at mapping
181
+ quality 10 or higher (mapQ10). However, it may occasionally assign a high mapping
182
+ quality to a read with multiple identical best alignments. This reduced its
183
+ mapping accuracy.
184
+
185
+ In lack of groud truth for real data, we took Winnowmap2 mapping as ground
186
+ truth to evaluate other mappers (winno-cmp in Table~\ref{tab:1}). Out of 1,378,092 reads with mapQ10
187
+ alignments by Winnowmap2, minimap2 v2.22 could map all of them. 118 reads, less
188
+ than 0.01\% of all reads, were mapped differently by v2.22. 51 of them have
189
+ multiple identical best alignments. We believe these are more likely to be
190
+ Winnowmap2 errors. Most of the remaining 67 (=118-51) reads have multiple
191
+ highly similar but not identical alignments.
192
+ Minimap2 v2.18 is less consistent with 275 differences including 30 unmapped
193
+ reads mappable by both Winnowmap2 and v2.22.
194
+
195
+ For the minimizer rescuing parameter $L$ in Section~\ref{sec:high-occ},
196
+ we set its default to 500 such that v2.22 has comparable performance to v2.18 given simulated PacBio and Nanopore human reads.
197
+ To see the effect of this parameter on real data, we tried several different $L$ values.
198
+ v2.22 gave 99 mapping differences at $L=200$,
199
+ 118 at $L=500$ (default), 167 at $L=750$ and 224 differences at $L=1000$ in comparison to Winnowmap2.
200
+ $L=200$ is 28\% slower than the default while $L=1000$ is 9\% faster.
201
+ Changing the default minimizer window size (option ``-w'')
202
+ and the initial minimizer occurrence cutoff (option ``-f'')
203
+ also affects performance and accuracy to a similar magnitude.
204
+
205
+ The two benchmarks above only evaluate read mappings when there are no variations between the reads and the reference.
206
+ To measure the mapping accuracy in the presence of SVs (sim-sv), we reproduced
207
+ the results by~\citep{Jain2020.11.01.363887}. Minimap2 v2.22 is as good as
208
+ Winnowmap2 now. Note that we were setting the Sniffles mapping quality
209
+ threshold to 10 in consistent with the benchmarks above. If we used the
210
+ default threshold 20, v2.22 would miss additional five SVs (accounting for
211
+ 0.5\% of simulated SVs). For four out of these five missing SVs, minimap2 v2.22
212
+ mapped more variant reads than Winnowmap2. Sniffles did not call these SVs
213
+ because minimap2 tended to give them conservative mapping quality. It is worth
214
+ noting that the simulation here only considers a simple scenario in evolution.
215
+ Non-allelic gene conversions, which happen often in segmental
216
+ duplications~\citep{Harpak:2017aa}, would obscure the optimal mapping
217
+ strategies. How much such simple SV simulation informs real-world SV calling
218
+ remains a question.
219
+
220
+ To see if minimap2 v2.22 could improve long INDEL alignment, we ran dipcall on
221
+ contig-to-reference alignments and focused on INDELs longer than 1kb
222
+ (real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its
223
+ advantage in more contiguous alignment. We could not get dipcall to work well with lra,
224
+ so did not report the numbers.
225
+
226
+ Minimap2 spends most computing time on base alignment. As recent improvements
227
+ in v2.22 incur little additional computing and do not change the base alignment
228
+ algorithm, the new version has similar performance to older versions. It is
229
+ consistently faster than Winnowmap2 by several times. Sometimes simple
230
+ heuristics can be as effective as more sophisticated yet slower solutions.
231
+
232
+ \section*{Acknowledgements}
233
+ We thank Arang Rhie and Chirag Jain for providing motivating examples for which
234
+ older minimap2 underperforms.
235
+
236
+ \paragraph{Funding\textcolon} This work is funded by NHGRI grant R01HG010040.
237
+
238
+ \bibliography{minimap2}
239
+
240
+ \end{document}
@@ -0,0 +1,12 @@
1
+ Q 60 32084 0 0.000000000 32084
2
+ Q 24 318 2 0.000061725 32402
3
+ Q 11 98 2 0.000123077 32500
4
+ Q 8 37 2 0.000184405 32537
5
+ Q 7 37 3 0.000276294 32574
6
+ Q 6 40 3 0.000367940 32614
7
+ Q 5 34 2 0.000428816 32648
8
+ Q 4 37 5 0.000581306 32685
9
+ Q 3 28 6 0.000764222 32713
10
+ Q 2 38 6 0.000946536 32751
11
+ Q 1 50 21 0.001585318 32801
12
+ Q 0 286 150 0.006105117 33087
@@ -0,0 +1,13 @@
1
+ Q 60 32477 0 0.000000000 32477
2
+ Q 22 16 1 0.000030776 32493
3
+ Q 21 44 1 0.000061468 32537
4
+ Q 19 73 1 0.000091996 32610
5
+ Q 14 66 1 0.000122414 32676
6
+ Q 10 26 3 0.000214054 32702
7
+ Q 8 14 1 0.000244529 32716
8
+ Q 7 13 2 0.000305539 32729
9
+ Q 6 47 1 0.000335611 32776
10
+ Q 3 10 1 0.000366010 32786
11
+ Q 2 20 2 0.000426751 32806
12
+ Q 1 248 94 0.003267381 33054
13
+ Q 0 31 17 0.003778147 33085