fluent-plugin-perf-tools 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +15 -0
- data/.rubocop.yml +26 -0
- data/.ruby-version +1 -0
- data/CHANGELOG.md +5 -0
- data/CODE_OF_CONDUCT.md +84 -0
- data/Gemfile +5 -0
- data/LICENSE.txt +21 -0
- data/README.md +43 -0
- data/Rakefile +17 -0
- data/bin/console +15 -0
- data/bin/setup +8 -0
- data/fluent-plugin-perf-tools.gemspec +48 -0
- data/lib/fluent/plugin/in_perf_tools.rb +42 -0
- data/lib/fluent/plugin/perf_tools/cachestat.rb +65 -0
- data/lib/fluent/plugin/perf_tools/command.rb +30 -0
- data/lib/fluent/plugin/perf_tools/version.rb +9 -0
- data/lib/fluent/plugin/perf_tools.rb +11 -0
- data/perf-tools/LICENSE +339 -0
- data/perf-tools/README.md +205 -0
- data/perf-tools/bin/bitesize +1 -0
- data/perf-tools/bin/cachestat +1 -0
- data/perf-tools/bin/execsnoop +1 -0
- data/perf-tools/bin/funccount +1 -0
- data/perf-tools/bin/funcgraph +1 -0
- data/perf-tools/bin/funcslower +1 -0
- data/perf-tools/bin/functrace +1 -0
- data/perf-tools/bin/iolatency +1 -0
- data/perf-tools/bin/iosnoop +1 -0
- data/perf-tools/bin/killsnoop +1 -0
- data/perf-tools/bin/kprobe +1 -0
- data/perf-tools/bin/opensnoop +1 -0
- data/perf-tools/bin/perf-stat-hist +1 -0
- data/perf-tools/bin/reset-ftrace +1 -0
- data/perf-tools/bin/syscount +1 -0
- data/perf-tools/bin/tcpretrans +1 -0
- data/perf-tools/bin/tpoint +1 -0
- data/perf-tools/bin/uprobe +1 -0
- data/perf-tools/deprecated/README.md +1 -0
- data/perf-tools/deprecated/execsnoop-proc +150 -0
- data/perf-tools/deprecated/execsnoop-proc.8 +80 -0
- data/perf-tools/deprecated/execsnoop-proc_example.txt +46 -0
- data/perf-tools/disk/bitesize +175 -0
- data/perf-tools/examples/bitesize_example.txt +63 -0
- data/perf-tools/examples/cachestat_example.txt +58 -0
- data/perf-tools/examples/execsnoop_example.txt +153 -0
- data/perf-tools/examples/funccount_example.txt +126 -0
- data/perf-tools/examples/funcgraph_example.txt +2178 -0
- data/perf-tools/examples/funcslower_example.txt +110 -0
- data/perf-tools/examples/functrace_example.txt +341 -0
- data/perf-tools/examples/iolatency_example.txt +350 -0
- data/perf-tools/examples/iosnoop_example.txt +302 -0
- data/perf-tools/examples/killsnoop_example.txt +62 -0
- data/perf-tools/examples/kprobe_example.txt +379 -0
- data/perf-tools/examples/opensnoop_example.txt +47 -0
- data/perf-tools/examples/perf-stat-hist_example.txt +149 -0
- data/perf-tools/examples/reset-ftrace_example.txt +88 -0
- data/perf-tools/examples/syscount_example.txt +297 -0
- data/perf-tools/examples/tcpretrans_example.txt +93 -0
- data/perf-tools/examples/tpoint_example.txt +210 -0
- data/perf-tools/examples/uprobe_example.txt +321 -0
- data/perf-tools/execsnoop +292 -0
- data/perf-tools/fs/cachestat +167 -0
- data/perf-tools/images/perf-tools_2016.png +0 -0
- data/perf-tools/iolatency +296 -0
- data/perf-tools/iosnoop +296 -0
- data/perf-tools/kernel/funccount +146 -0
- data/perf-tools/kernel/funcgraph +259 -0
- data/perf-tools/kernel/funcslower +248 -0
- data/perf-tools/kernel/functrace +192 -0
- data/perf-tools/kernel/kprobe +270 -0
- data/perf-tools/killsnoop +263 -0
- data/perf-tools/man/man8/bitesize.8 +70 -0
- data/perf-tools/man/man8/cachestat.8 +111 -0
- data/perf-tools/man/man8/execsnoop.8 +104 -0
- data/perf-tools/man/man8/funccount.8 +76 -0
- data/perf-tools/man/man8/funcgraph.8 +166 -0
- data/perf-tools/man/man8/funcslower.8 +129 -0
- data/perf-tools/man/man8/functrace.8 +123 -0
- data/perf-tools/man/man8/iolatency.8 +116 -0
- data/perf-tools/man/man8/iosnoop.8 +169 -0
- data/perf-tools/man/man8/killsnoop.8 +100 -0
- data/perf-tools/man/man8/kprobe.8 +162 -0
- data/perf-tools/man/man8/opensnoop.8 +113 -0
- data/perf-tools/man/man8/perf-stat-hist.8 +111 -0
- data/perf-tools/man/man8/reset-ftrace.8 +49 -0
- data/perf-tools/man/man8/syscount.8 +96 -0
- data/perf-tools/man/man8/tcpretrans.8 +93 -0
- data/perf-tools/man/man8/tpoint.8 +140 -0
- data/perf-tools/man/man8/uprobe.8 +168 -0
- data/perf-tools/misc/perf-stat-hist +223 -0
- data/perf-tools/net/tcpretrans +311 -0
- data/perf-tools/opensnoop +280 -0
- data/perf-tools/syscount +192 -0
- data/perf-tools/system/tpoint +232 -0
- data/perf-tools/tools/reset-ftrace +123 -0
- data/perf-tools/user/uprobe +390 -0
- metadata +349 -0
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
Demonstrations of iolatency, the Linux ftrace version.
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
Here's a busy system doing over 4k disk IOPS:
|
|
5
|
+
|
|
6
|
+
# ./iolatency
|
|
7
|
+
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
|
|
8
|
+
|
|
9
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
10
|
+
0 -> 1 : 4381 |######################################|
|
|
11
|
+
1 -> 2 : 9 |# |
|
|
12
|
+
2 -> 4 : 5 |# |
|
|
13
|
+
4 -> 8 : 0 | |
|
|
14
|
+
8 -> 16 : 1 |# |
|
|
15
|
+
|
|
16
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
17
|
+
0 -> 1 : 4053 |######################################|
|
|
18
|
+
1 -> 2 : 18 |# |
|
|
19
|
+
2 -> 4 : 9 |# |
|
|
20
|
+
4 -> 8 : 2 |# |
|
|
21
|
+
8 -> 16 : 1 |# |
|
|
22
|
+
16 -> 32 : 1 |# |
|
|
23
|
+
|
|
24
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
25
|
+
0 -> 1 : 4658 |######################################|
|
|
26
|
+
1 -> 2 : 9 |# |
|
|
27
|
+
2 -> 4 : 2 |# |
|
|
28
|
+
|
|
29
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
30
|
+
0 -> 1 : 4298 |######################################|
|
|
31
|
+
1 -> 2 : 17 |# |
|
|
32
|
+
2 -> 4 : 10 |# |
|
|
33
|
+
4 -> 8 : 1 |# |
|
|
34
|
+
8 -> 16 : 1 |# |
|
|
35
|
+
^C
|
|
36
|
+
Ending tracing...
|
|
37
|
+
|
|
38
|
+
Disk I/O latency is usually between 0 and 1 milliseconds, as this system uses
|
|
39
|
+
SSDs. There are occasional outliers, up to the 16->32 ms range.
|
|
40
|
+
|
|
41
|
+
Identifying outliers like these is difficult from iostat(1) alone, which at
|
|
42
|
+
the same time reported:
|
|
43
|
+
|
|
44
|
+
# iostat 1
|
|
45
|
+
[...]
|
|
46
|
+
avg-cpu: %user %nice %system %iowait %steal %idle
|
|
47
|
+
0.53 0.00 1.05 46.84 0.53 51.05
|
|
48
|
+
|
|
49
|
+
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
|
|
50
|
+
xvdap1 0.00 0.00 0.00 28.00 0.00 112.00 8.00 0.02 0.71 0.00 0.71 0.29 0.80
|
|
51
|
+
xvdb 0.00 0.00 2134.00 0.00 18768.00 0.00 17.59 0.51 0.24 0.24 0.00 0.23 50.00
|
|
52
|
+
xvdc 0.00 0.00 2088.00 0.00 18504.00 0.00 17.72 0.47 0.22 0.22 0.00 0.22 46.40
|
|
53
|
+
md0 0.00 0.00 4222.00 0.00 37256.00 0.00 17.65 0.00 0.00 0.00 0.00 0.00 0.00
|
|
54
|
+
|
|
55
|
+
I/O latency ("await") averages 0.24 and 0.22 ms for our busy disks, but this
|
|
56
|
+
output doesn't show that occasionally is much higher.
|
|
57
|
+
|
|
58
|
+
To get more information on these I/O, try the iosnoop(8) tool.
|
|
59
|
+
|
|
60
|
+
|
|
61
|
+
The -Q option includes the block I/O queued time, by tracing based on
|
|
62
|
+
block_rq_insert instead of block_rq_issue:
|
|
63
|
+
|
|
64
|
+
# ./iolatency -Q
|
|
65
|
+
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
|
|
66
|
+
|
|
67
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
68
|
+
0 -> 1 : 1913 |######################################|
|
|
69
|
+
1 -> 2 : 438 |######### |
|
|
70
|
+
2 -> 4 : 100 |## |
|
|
71
|
+
4 -> 8 : 145 |### |
|
|
72
|
+
8 -> 16 : 43 |# |
|
|
73
|
+
16 -> 32 : 43 |# |
|
|
74
|
+
32 -> 64 : 1 |# |
|
|
75
|
+
|
|
76
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
77
|
+
0 -> 1 : 2360 |######################################|
|
|
78
|
+
1 -> 2 : 132 |### |
|
|
79
|
+
2 -> 4 : 72 |## |
|
|
80
|
+
4 -> 8 : 14 |# |
|
|
81
|
+
8 -> 16 : 1 |# |
|
|
82
|
+
|
|
83
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
84
|
+
0 -> 1 : 2138 |######################################|
|
|
85
|
+
1 -> 2 : 496 |######### |
|
|
86
|
+
2 -> 4 : 81 |## |
|
|
87
|
+
4 -> 8 : 40 |# |
|
|
88
|
+
8 -> 16 : 1 |# |
|
|
89
|
+
16 -> 32 : 2 |# |
|
|
90
|
+
^C
|
|
91
|
+
Ending tracing...
|
|
92
|
+
|
|
93
|
+
I use this along with the default mode to identify problems of load (queueing)
|
|
94
|
+
vs problems of the device, which is shown by default.
|
|
95
|
+
|
|
96
|
+
|
|
97
|
+
Here's a more interesting system. This is doing a mixed read/write workload,
|
|
98
|
+
and has a pretty awful latency distribution:
|
|
99
|
+
|
|
100
|
+
# ./iolatency 5 3
|
|
101
|
+
Tracing block I/O. Output every 5 seconds.
|
|
102
|
+
|
|
103
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
104
|
+
0 -> 1 : 2809 |######################################|
|
|
105
|
+
1 -> 2 : 32 |# |
|
|
106
|
+
2 -> 4 : 14 |# |
|
|
107
|
+
4 -> 8 : 6 |# |
|
|
108
|
+
8 -> 16 : 7 |# |
|
|
109
|
+
16 -> 32 : 14 |# |
|
|
110
|
+
32 -> 64 : 39 |# |
|
|
111
|
+
64 -> 128 : 1556 |###################### |
|
|
112
|
+
|
|
113
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
114
|
+
0 -> 1 : 3027 |######################################|
|
|
115
|
+
1 -> 2 : 19 |# |
|
|
116
|
+
2 -> 4 : 6 |# |
|
|
117
|
+
4 -> 8 : 5 |# |
|
|
118
|
+
8 -> 16 : 3 |# |
|
|
119
|
+
16 -> 32 : 7 |# |
|
|
120
|
+
32 -> 64 : 14 |# |
|
|
121
|
+
64 -> 128 : 540 |####### |
|
|
122
|
+
|
|
123
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
124
|
+
0 -> 1 : 2939 |######################################|
|
|
125
|
+
1 -> 2 : 25 |# |
|
|
126
|
+
2 -> 4 : 15 |# |
|
|
127
|
+
4 -> 8 : 2 |# |
|
|
128
|
+
8 -> 16 : 3 |# |
|
|
129
|
+
16 -> 32 : 7 |# |
|
|
130
|
+
32 -> 64 : 17 |# |
|
|
131
|
+
64 -> 128 : 936 |############# |
|
|
132
|
+
|
|
133
|
+
Ending tracing...
|
|
134
|
+
|
|
135
|
+
It's multi-modal, with most I/O taking 0 to 1 milliseconds, then many between
|
|
136
|
+
64 and 128 milliseconds. This is how it looks in iostat:
|
|
137
|
+
|
|
138
|
+
# iostat -x 1
|
|
139
|
+
|
|
140
|
+
avg-cpu: %user %nice %system %iowait %steal %idle
|
|
141
|
+
0.52 0.00 12.37 32.99 0.00 54.12
|
|
142
|
+
|
|
143
|
+
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
|
|
144
|
+
xvdap1 0.00 12.00 0.00 156.00 0.00 19968.00 256.00 52.17 184.38 0.00 184.38 2.33 36.40
|
|
145
|
+
xvdb 0.00 0.00 298.00 0.00 2732.00 0.00 18.34 0.04 0.12 0.12 0.00 0.11 3.20
|
|
146
|
+
xvdc 0.00 0.00 297.00 0.00 2712.00 0.00 18.26 0.08 0.27 0.27 0.00 0.24 7.20
|
|
147
|
+
md0 0.00 0.00 595.00 0.00 5444.00 0.00 18.30 0.00 0.00 0.00 0.00 0.00 0.00
|
|
148
|
+
|
|
149
|
+
Fortunately, it turns out that the high latency is to xvdap1, which is for files
|
|
150
|
+
from a low priority application (processing and writing log files). A high
|
|
151
|
+
priority application is reading from the other disks, xvdb and xvdc.
|
|
152
|
+
|
|
153
|
+
Examining xvdap1 only:
|
|
154
|
+
|
|
155
|
+
# ./iolatency -d 202,1 5
|
|
156
|
+
Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
|
|
157
|
+
|
|
158
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
159
|
+
0 -> 1 : 38 |## |
|
|
160
|
+
1 -> 2 : 18 |# |
|
|
161
|
+
2 -> 4 : 0 | |
|
|
162
|
+
4 -> 8 : 0 | |
|
|
163
|
+
8 -> 16 : 5 |# |
|
|
164
|
+
16 -> 32 : 11 |# |
|
|
165
|
+
32 -> 64 : 26 |## |
|
|
166
|
+
64 -> 128 : 894 |######################################|
|
|
167
|
+
|
|
168
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
169
|
+
0 -> 1 : 75 |### |
|
|
170
|
+
1 -> 2 : 11 |# |
|
|
171
|
+
2 -> 4 : 0 | |
|
|
172
|
+
4 -> 8 : 4 |# |
|
|
173
|
+
8 -> 16 : 4 |# |
|
|
174
|
+
16 -> 32 : 7 |# |
|
|
175
|
+
32 -> 64 : 13 |# |
|
|
176
|
+
64 -> 128 : 1141 |######################################|
|
|
177
|
+
|
|
178
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
179
|
+
0 -> 1 : 61 |######## |
|
|
180
|
+
1 -> 2 : 21 |### |
|
|
181
|
+
2 -> 4 : 5 |# |
|
|
182
|
+
4 -> 8 : 1 |# |
|
|
183
|
+
8 -> 16 : 5 |# |
|
|
184
|
+
16 -> 32 : 7 |# |
|
|
185
|
+
32 -> 64 : 19 |### |
|
|
186
|
+
64 -> 128 : 324 |######################################|
|
|
187
|
+
128 -> 256 : 7 |# |
|
|
188
|
+
256 -> 512 : 26 |#### |
|
|
189
|
+
^C
|
|
190
|
+
Ending tracing...
|
|
191
|
+
|
|
192
|
+
And now xvdb:
|
|
193
|
+
|
|
194
|
+
# ./iolatency -d 202,16 5
|
|
195
|
+
Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
|
|
196
|
+
|
|
197
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
198
|
+
0 -> 1 : 1427 |######################################|
|
|
199
|
+
1 -> 2 : 5 |# |
|
|
200
|
+
2 -> 4 : 3 |# |
|
|
201
|
+
|
|
202
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
203
|
+
0 -> 1 : 1409 |######################################|
|
|
204
|
+
1 -> 2 : 6 |# |
|
|
205
|
+
2 -> 4 : 1 |# |
|
|
206
|
+
4 -> 8 : 1 |# |
|
|
207
|
+
|
|
208
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
209
|
+
0 -> 1 : 1478 |######################################|
|
|
210
|
+
1 -> 2 : 6 |# |
|
|
211
|
+
2 -> 4 : 5 |# |
|
|
212
|
+
4 -> 8 : 0 | |
|
|
213
|
+
8 -> 16 : 2 |# |
|
|
214
|
+
|
|
215
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
216
|
+
0 -> 1 : 1437 |######################################|
|
|
217
|
+
1 -> 2 : 5 |# |
|
|
218
|
+
2 -> 4 : 7 |# |
|
|
219
|
+
4 -> 8 : 0 | |
|
|
220
|
+
8 -> 16 : 1 |# |
|
|
221
|
+
[...]
|
|
222
|
+
|
|
223
|
+
While that's much better, it is reaching the 8 - 16 millisecond range,
|
|
224
|
+
and these are SSDs with a light workload (~1500 IOPS).
|
|
225
|
+
|
|
226
|
+
I already know from iosnoop(8) analysis the reason for these high latency
|
|
227
|
+
outliers: they are queued behind writes. However, these writes are to a
|
|
228
|
+
different disk -- somewhere in this virtualized guest (Xen) there may be a
|
|
229
|
+
shared I/O queue.
|
|
230
|
+
|
|
231
|
+
One way to explore this is to reduce the queue length for the low priority disk,
|
|
232
|
+
so that it is less likely to pollute any shared queue. (There are other ways to
|
|
233
|
+
investigate and fix this too.) Here I reduce the disk queue length from its
|
|
234
|
+
default of 128 to 4:
|
|
235
|
+
|
|
236
|
+
# echo 4 > /sys/block/xvda1/queue/nr_requests
|
|
237
|
+
|
|
238
|
+
The overall distribution looks much better:
|
|
239
|
+
|
|
240
|
+
# ./iolatency 5
|
|
241
|
+
Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
|
|
242
|
+
|
|
243
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
244
|
+
0 -> 1 : 3005 |######################################|
|
|
245
|
+
1 -> 2 : 19 |# |
|
|
246
|
+
2 -> 4 : 9 |# |
|
|
247
|
+
4 -> 8 : 45 |# |
|
|
248
|
+
8 -> 16 : 859 |########### |
|
|
249
|
+
16 -> 32 : 16 |# |
|
|
250
|
+
|
|
251
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
252
|
+
0 -> 1 : 2959 |######################################|
|
|
253
|
+
1 -> 2 : 43 |# |
|
|
254
|
+
2 -> 4 : 16 |# |
|
|
255
|
+
4 -> 8 : 39 |# |
|
|
256
|
+
8 -> 16 : 1009 |############# |
|
|
257
|
+
16 -> 32 : 76 |# |
|
|
258
|
+
|
|
259
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
260
|
+
0 -> 1 : 3031 |######################################|
|
|
261
|
+
1 -> 2 : 27 |# |
|
|
262
|
+
2 -> 4 : 9 |# |
|
|
263
|
+
4 -> 8 : 24 |# |
|
|
264
|
+
8 -> 16 : 422 |###### |
|
|
265
|
+
16 -> 32 : 5 |# |
|
|
266
|
+
^C
|
|
267
|
+
Ending tracing...
|
|
268
|
+
|
|
269
|
+
Latency only reaching 32 ms.
|
|
270
|
+
|
|
271
|
+
Our important disk didn't appear to change much -- maybe a slight improvement
|
|
272
|
+
to the outliers:
|
|
273
|
+
|
|
274
|
+
# ./iolatency -d 202,16 5
|
|
275
|
+
Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
|
|
276
|
+
|
|
277
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
278
|
+
0 -> 1 : 1449 |######################################|
|
|
279
|
+
1 -> 2 : 6 |# |
|
|
280
|
+
2 -> 4 : 5 |# |
|
|
281
|
+
4 -> 8 : 1 |# |
|
|
282
|
+
|
|
283
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
284
|
+
0 -> 1 : 1519 |######################################|
|
|
285
|
+
1 -> 2 : 12 |# |
|
|
286
|
+
|
|
287
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
288
|
+
0 -> 1 : 1466 |######################################|
|
|
289
|
+
1 -> 2 : 2 |# |
|
|
290
|
+
2 -> 4 : 3 |# |
|
|
291
|
+
|
|
292
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
293
|
+
0 -> 1 : 1460 |######################################|
|
|
294
|
+
1 -> 2 : 4 |# |
|
|
295
|
+
2 -> 4 : 7 |# |
|
|
296
|
+
[...]
|
|
297
|
+
|
|
298
|
+
And here's the other disk after the queue length change:
|
|
299
|
+
|
|
300
|
+
# ./iolatency -d 202,1 5
|
|
301
|
+
Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
|
|
302
|
+
|
|
303
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
304
|
+
0 -> 1 : 85 |### |
|
|
305
|
+
1 -> 2 : 12 |# |
|
|
306
|
+
2 -> 4 : 21 |# |
|
|
307
|
+
4 -> 8 : 76 |## |
|
|
308
|
+
8 -> 16 : 1539 |######################################|
|
|
309
|
+
16 -> 32 : 10 |# |
|
|
310
|
+
|
|
311
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
312
|
+
0 -> 1 : 123 |################## |
|
|
313
|
+
1 -> 2 : 8 |## |
|
|
314
|
+
2 -> 4 : 6 |# |
|
|
315
|
+
4 -> 8 : 17 |### |
|
|
316
|
+
8 -> 16 : 270 |######################################|
|
|
317
|
+
16 -> 32 : 2 |# |
|
|
318
|
+
|
|
319
|
+
>=(ms) .. <(ms) : I/O |Distribution |
|
|
320
|
+
0 -> 1 : 91 |### |
|
|
321
|
+
1 -> 2 : 23 |# |
|
|
322
|
+
2 -> 4 : 8 |# |
|
|
323
|
+
4 -> 8 : 71 |### |
|
|
324
|
+
8 -> 16 : 1223 |######################################|
|
|
325
|
+
16 -> 32 : 12 |# |
|
|
326
|
+
^C
|
|
327
|
+
Ending tracing...
|
|
328
|
+
|
|
329
|
+
Much better looking distribution.
|
|
330
|
+
|
|
331
|
+
|
|
332
|
+
Use -h to print the USAGE message:
|
|
333
|
+
|
|
334
|
+
# ./iolatency -h
|
|
335
|
+
USAGE: iolatency [-hQT] [-d device] [-i iotype] [interval [count]]
|
|
336
|
+
-d device # device string (eg, "202,1)
|
|
337
|
+
-i iotype # match type (eg, '*R*' for all reads)
|
|
338
|
+
-Q # use queue insert as start time
|
|
339
|
+
-T # timestamp on output
|
|
340
|
+
-h # this usage message
|
|
341
|
+
interval # summary interval, seconds (default 1)
|
|
342
|
+
count # number of summaries
|
|
343
|
+
eg,
|
|
344
|
+
iolatency # summarize latency every second
|
|
345
|
+
iolatency -Q # include block I/O queue time
|
|
346
|
+
iolatency 5 2 # 2 x 5 second summaries
|
|
347
|
+
iolatency -i '*R*' # trace reads
|
|
348
|
+
iolatency -d 202,1 # trace device 202,1 only
|
|
349
|
+
|
|
350
|
+
See the man page and example file for more info.
|
|
@@ -0,0 +1,302 @@
|
|
|
1
|
+
Demonstrations of iosnoop, the Linux ftrace version.
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
Here's Linux 3.16, tracing tar archiving a filesystem:
|
|
5
|
+
|
|
6
|
+
# ./iosnoop
|
|
7
|
+
Tracing block I/O... Ctrl-C to end.
|
|
8
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
9
|
+
supervise 1809 W 202,1 17039968 4096 1.32
|
|
10
|
+
supervise 1809 W 202,1 17039976 4096 1.30
|
|
11
|
+
tar 14794 RM 202,1 8457608 4096 7.53
|
|
12
|
+
tar 14794 RM 202,1 8470336 4096 14.90
|
|
13
|
+
tar 14794 RM 202,1 8470368 4096 0.27
|
|
14
|
+
tar 14794 RM 202,1 8470784 4096 7.74
|
|
15
|
+
tar 14794 RM 202,1 8470360 4096 0.25
|
|
16
|
+
tar 14794 RM 202,1 8469968 4096 0.24
|
|
17
|
+
tar 14794 RM 202,1 8470240 4096 0.24
|
|
18
|
+
tar 14794 RM 202,1 8470392 4096 0.23
|
|
19
|
+
tar 14794 RM 202,1 8470544 4096 5.96
|
|
20
|
+
tar 14794 RM 202,1 8470552 4096 0.27
|
|
21
|
+
tar 14794 RM 202,1 8470384 4096 0.24
|
|
22
|
+
[...]
|
|
23
|
+
|
|
24
|
+
The "tar" I/O looks like it is slightly random (based on BLOCK) and 4 Kbytes
|
|
25
|
+
in size (BYTES). One returned in 14.9 milliseconds, but the rest were fast,
|
|
26
|
+
so fast (0.24 ms) some may be returning from some level of cache (disk or
|
|
27
|
+
controller).
|
|
28
|
+
|
|
29
|
+
The "RM" TYPE means Read of Metadata. The start of the trace shows a
|
|
30
|
+
couple of Writes by supervise PID 1809.
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
Here's a deliberate random I/O workload:
|
|
34
|
+
|
|
35
|
+
# ./iosnoop
|
|
36
|
+
Tracing block I/O. Ctrl-C to end.
|
|
37
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
38
|
+
randread 9182 R 202,32 30835224 8192 0.18
|
|
39
|
+
randread 9182 R 202,32 21466088 8192 0.15
|
|
40
|
+
randread 9182 R 202,32 13529496 8192 0.16
|
|
41
|
+
randread 9182 R 202,16 21250648 8192 0.18
|
|
42
|
+
randread 9182 R 202,16 1536776 32768 0.30
|
|
43
|
+
randread 9182 R 202,32 17157560 24576 0.23
|
|
44
|
+
randread 9182 R 202,32 21313320 8192 0.16
|
|
45
|
+
randread 9182 R 202,32 862184 8192 0.18
|
|
46
|
+
randread 9182 R 202,16 25496872 8192 0.21
|
|
47
|
+
randread 9182 R 202,32 31471768 8192 0.18
|
|
48
|
+
randread 9182 R 202,16 27571336 8192 0.20
|
|
49
|
+
randread 9182 R 202,16 30783448 8192 0.16
|
|
50
|
+
randread 9182 R 202,16 21435224 8192 1.28
|
|
51
|
+
randread 9182 R 202,16 970616 8192 0.15
|
|
52
|
+
randread 9182 R 202,32 13855608 8192 0.16
|
|
53
|
+
randread 9182 R 202,32 17549960 8192 0.15
|
|
54
|
+
randread 9182 R 202,32 30938232 8192 0.14
|
|
55
|
+
[...]
|
|
56
|
+
|
|
57
|
+
Note the changing offsets. The resulting latencies are very good in this case,
|
|
58
|
+
because the storage devices are flash memory-based solid state disks (SSDs).
|
|
59
|
+
For rotational disks, I'd expect these latencies to be roughly 10 ms.
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
Here's an idle Linux 3.2 system:
|
|
63
|
+
|
|
64
|
+
# ./iosnoop
|
|
65
|
+
Tracing block I/O. Ctrl-C to end.
|
|
66
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
67
|
+
supervise 3055 W 202,1 12852496 4096 0.64
|
|
68
|
+
supervise 3055 W 202,1 12852504 4096 1.32
|
|
69
|
+
supervise 3055 W 202,1 12852800 4096 0.55
|
|
70
|
+
supervise 3055 W 202,1 12852808 4096 0.52
|
|
71
|
+
jbd2/xvda1-212 212 WS 202,1 1066720 45056 41.52
|
|
72
|
+
jbd2/xvda1-212 212 WS 202,1 1066808 12288 41.52
|
|
73
|
+
jbd2/xvda1-212 212 WS 202,1 1066832 4096 32.37
|
|
74
|
+
supervise 3055 W 202,1 12852800 4096 14.28
|
|
75
|
+
supervise 3055 W 202,1 12855920 4096 14.07
|
|
76
|
+
supervise 3055 W 202,1 12855960 4096 0.67
|
|
77
|
+
supervise 3055 W 202,1 12858208 4096 1.00
|
|
78
|
+
flush:1-409 409 W 202,1 12939640 12288 18.00
|
|
79
|
+
[...]
|
|
80
|
+
|
|
81
|
+
This shows supervise doing various writes from PID 3055. The highest latency
|
|
82
|
+
was from jbd2/xvda1-212, the journaling block device driver, doing
|
|
83
|
+
synchronous writes (TYPE = WS).
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
Options can be added to show the start time (-s) and end time (-t):
|
|
87
|
+
|
|
88
|
+
# ./iosnoop -ts
|
|
89
|
+
Tracing block I/O. Ctrl-C to end.
|
|
90
|
+
STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms
|
|
91
|
+
5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62
|
|
92
|
+
5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42
|
|
93
|
+
5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48
|
|
94
|
+
5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43
|
|
95
|
+
5982800.308849 5982800.309452 supervise 1810 W 202,1 12862464 4096 0.60
|
|
96
|
+
5982800.308856 5982800.309470 supervise 1806 W 202,1 17039632 4096 0.61
|
|
97
|
+
5982800.309206 5982800.309740 supervise 1806 W 202,1 17039640 4096 0.53
|
|
98
|
+
5982800.309211 5982800.309805 supervise 1810 W 202,1 12862472 4096 0.59
|
|
99
|
+
5982800.309332 5982800.309953 supervise 1812 W 202,1 17039648 4096 0.62
|
|
100
|
+
5982800.309676 5982800.310283 supervise 1812 W 202,1 17039656 4096 0.61
|
|
101
|
+
[...]
|
|
102
|
+
|
|
103
|
+
This is useful when gathering I/O event data for post-processing.
|
|
104
|
+
|
|
105
|
+
|
|
106
|
+
Now for matching on a single PID:
|
|
107
|
+
|
|
108
|
+
# ./iosnoop -p 1805
|
|
109
|
+
Tracing block I/O issued by PID 1805. Ctrl-C to end.
|
|
110
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
111
|
+
supervise 1805 W 202,1 17039648 4096 0.68
|
|
112
|
+
supervise 1805 W 202,1 17039672 4096 0.60
|
|
113
|
+
supervise 1805 W 202,1 17040040 4096 0.62
|
|
114
|
+
supervise 1805 W 202,1 17040056 4096 0.47
|
|
115
|
+
supervise 1805 W 202,1 17040624 4096 0.49
|
|
116
|
+
supervise 1805 W 202,1 17040632 4096 0.44
|
|
117
|
+
^C
|
|
118
|
+
Ending tracing...
|
|
119
|
+
|
|
120
|
+
This option works by using an in-kernel filter for that PID on I/O issue. There
|
|
121
|
+
is also a "-n" option to match on process names, however, that currently does so
|
|
122
|
+
in user space, so is less efficient.
|
|
123
|
+
|
|
124
|
+
I would say that this will generally identify the origin process, but there will
|
|
125
|
+
be an error margin. Depending on the file system, block I/O queueing, and I/O
|
|
126
|
+
subsystem, this could miss events that aren't issued in this PID context but are
|
|
127
|
+
related to this PID (eg, triggering a read readahead on the completion of
|
|
128
|
+
previous I/O. Again, whether this happens is up to the file system and storage
|
|
129
|
+
subsystem). You can try the -Q option for more reliable process identification.
|
|
130
|
+
|
|
131
|
+
|
|
132
|
+
The -Q option begins tracing on block I/O queue insert, instead of issue.
|
|
133
|
+
Here's before and after, while dd(1) writes a large file:
|
|
134
|
+
|
|
135
|
+
# ./iosnoop
|
|
136
|
+
Tracing block I/O. Ctrl-C to end.
|
|
137
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
138
|
+
dd 26983 WS 202,16 4064416 45056 16.70
|
|
139
|
+
dd 26983 WS 202,16 4064504 45056 16.72
|
|
140
|
+
dd 26983 WS 202,16 4064592 45056 16.74
|
|
141
|
+
dd 26983 WS 202,16 4064680 45056 16.75
|
|
142
|
+
cat 27031 WS 202,16 4064768 45056 16.56
|
|
143
|
+
cat 27031 WS 202,16 4064856 45056 16.46
|
|
144
|
+
cat 27031 WS 202,16 4064944 45056 16.40
|
|
145
|
+
gawk 27030 WS 202,16 4065032 45056 0.88
|
|
146
|
+
gawk 27030 WS 202,16 4065120 45056 1.01
|
|
147
|
+
gawk 27030 WS 202,16 4065208 45056 16.15
|
|
148
|
+
gawk 27030 WS 202,16 4065296 45056 16.16
|
|
149
|
+
gawk 27030 WS 202,16 4065384 45056 16.16
|
|
150
|
+
[...]
|
|
151
|
+
|
|
152
|
+
The output here shows the block I/O time from issue to completion (LATms),
|
|
153
|
+
which is largely representative of the device.
|
|
154
|
+
|
|
155
|
+
The process names and PIDs identify dd, cat, and gawk. By default iosnoop shows
|
|
156
|
+
who is on-CPU at time of block I/O issue, but these may not be the processes
|
|
157
|
+
that originated the I/O. In this case (having debugged it), the reason is that
|
|
158
|
+
processes such as cat and gawk are making hypervisor calls (this is a Xen
|
|
159
|
+
guest instance), eg, for memory operations, and during hypervisor processing a
|
|
160
|
+
queue of pending work is checked and dispatched. So cat and gawk were on-CPU
|
|
161
|
+
when the block device I/O was issued, but they didn't originate it.
|
|
162
|
+
|
|
163
|
+
Now the -Q option is used:
|
|
164
|
+
|
|
165
|
+
# ./iosnoop -Q
|
|
166
|
+
Tracing block I/O. Ctrl-C to end.
|
|
167
|
+
COMM PID TYPE DEV BLOCK BYTES LATms
|
|
168
|
+
kjournald 1217 WS 202,16 6132200 45056 141.12
|
|
169
|
+
kjournald 1217 WS 202,16 6132288 45056 141.10
|
|
170
|
+
kjournald 1217 WS 202,16 6132376 45056 141.10
|
|
171
|
+
kjournald 1217 WS 202,16 6132464 45056 141.11
|
|
172
|
+
kjournald 1217 WS 202,16 6132552 40960 141.11
|
|
173
|
+
dd 27718 WS 202,16 6132624 4096 0.18
|
|
174
|
+
flush:16-1279 1279 W 202,16 6132632 20480 0.52
|
|
175
|
+
flush:16-1279 1279 W 202,16 5940856 4096 0.50
|
|
176
|
+
flush:16-1279 1279 W 202,16 5949056 4096 0.52
|
|
177
|
+
flush:16-1279 1279 W 202,16 5957256 4096 0.54
|
|
178
|
+
flush:16-1279 1279 W 202,16 5965456 4096 0.56
|
|
179
|
+
flush:16-1279 1279 W 202,16 5973656 4096 0.58
|
|
180
|
+
flush:16-1279 1279 W 202,16 5981856 4096 0.60
|
|
181
|
+
flush:16-1279 1279 W 202,16 5990056 4096 0.63
|
|
182
|
+
[...]
|
|
183
|
+
|
|
184
|
+
This uses the block_rq_insert tracepoint as the starting point of I/O, instead
|
|
185
|
+
of block_rq_issue. This makes the following differences to columns and options:
|
|
186
|
+
|
|
187
|
+
- COMM: more likely to show the originating process.
|
|
188
|
+
- PID: more likely to show the originating process.
|
|
189
|
+
- LATms: shows the I/O time, including time spent on the block I/O queue.
|
|
190
|
+
- STARTs (not shown above): shows the time of queue insert, not I/O issue.
|
|
191
|
+
- -p PID: more likely to match the originating process.
|
|
192
|
+
- -n name: more likely to match the originating process.
|
|
193
|
+
|
|
194
|
+
The reason that this ftrace-based iosnoop does not just instrument both insert
|
|
195
|
+
and issue tracepoints is one of overhead. Even with buffering, iosnoop can
|
|
196
|
+
have difficulty under high load.
|
|
197
|
+
|
|
198
|
+
|
|
199
|
+
If I want to capture events for post-processing, I use the duration mode, which
|
|
200
|
+
not only lets me set the duration, but also uses buffering, which reduces the
|
|
201
|
+
overheads of tracing.
|
|
202
|
+
|
|
203
|
+
Capturing 5 seconds, with both start timestamps (-s) and end timestamps (-t):
|
|
204
|
+
|
|
205
|
+
# time ./iosnoop -ts 5 > out
|
|
206
|
+
|
|
207
|
+
real 0m5.566s
|
|
208
|
+
user 0m0.336s
|
|
209
|
+
sys 0m0.140s
|
|
210
|
+
# wc out
|
|
211
|
+
27010 243072 2619744 out
|
|
212
|
+
|
|
213
|
+
This server is doing over 5,000 disk IOPS. Even with buffering, this did
|
|
214
|
+
consume a measurable amount of CPU to capture: 0.48 seconds of CPU time in
|
|
215
|
+
total. Note that the run took 5.57 seconds: this is 5 seconds for the capture,
|
|
216
|
+
followed by the CPU time for iosnoop to fetch and process the buffer.
|
|
217
|
+
|
|
218
|
+
Now tracing for 30 seconds:
|
|
219
|
+
|
|
220
|
+
# time ./iosnoop -ts 30 > out
|
|
221
|
+
|
|
222
|
+
real 0m31.207s
|
|
223
|
+
user 0m0.884s
|
|
224
|
+
sys 0m0.472s
|
|
225
|
+
# wc out
|
|
226
|
+
64259 578313 6232898 out
|
|
227
|
+
|
|
228
|
+
Since it's the same server and workload, this should have over 150k events,
|
|
229
|
+
but only has 64k. The tracing buffer has overflowed, and events have been
|
|
230
|
+
dropped. If I really must capture this many events, I can either increase
|
|
231
|
+
the trace buffer size (it's the bufsize_kb setting in the script), or, use
|
|
232
|
+
a different tracer (perf_evets, SystemTap, ktap, etc.) If the IOPS rate is low
|
|
233
|
+
(eg, less than 5k), then unbuffered (no duration), despite the higher overheads,
|
|
234
|
+
may be sufficient, and will keep capturing events until Ctrl-C.
|
|
235
|
+
|
|
236
|
+
|
|
237
|
+
Here's an example of digging into the sequence of I/O to explain an outlier.
|
|
238
|
+
My randread program on an SSD server (which is an AWS EC2 instance) usually
|
|
239
|
+
experiences about 0.15 ms I/O latency, but there are some outliers as high as
|
|
240
|
+
20 milliseconds. Here's an excerpt:
|
|
241
|
+
|
|
242
|
+
# ./iosnoop -ts > out
|
|
243
|
+
# more out
|
|
244
|
+
Tracing block I/O. Ctrl-C to end.
|
|
245
|
+
STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms
|
|
246
|
+
6037559.121523 6037559.121685 randread 22341 R 202,32 29295416 8192 0.16
|
|
247
|
+
6037559.121719 6037559.121874 randread 22341 R 202,16 27515304 8192 0.16
|
|
248
|
+
[...]
|
|
249
|
+
6037595.999508 6037596.000051 supervise 1692 W 202,1 12862968 4096 0.54
|
|
250
|
+
6037595.999513 6037596.000144 supervise 1687 W 202,1 17040160 4096 0.63
|
|
251
|
+
6037595.999634 6037596.000309 supervise 1693 W 202,1 17040168 4096 0.68
|
|
252
|
+
6037595.999937 6037596.000440 supervise 1693 W 202,1 17040176 4096 0.50
|
|
253
|
+
6037596.000579 6037596.001192 supervise 1689 W 202,1 17040184 4096 0.61
|
|
254
|
+
6037596.000826 6037596.001360 supervise 1689 W 202,1 17040192 4096 0.53
|
|
255
|
+
6037595.998302 6037596.018133 randread 22341 R 202,32 954168 8192 20.03
|
|
256
|
+
6037595.998303 6037596.018150 randread 22341 R 202,32 954200 8192 20.05
|
|
257
|
+
6037596.018182 6037596.018347 randread 22341 R 202,32 18836600 8192 0.16
|
|
258
|
+
[...]
|
|
259
|
+
|
|
260
|
+
It's important to sort on the I/O completion time (ENDs). In this case it's
|
|
261
|
+
already in the correct order.
|
|
262
|
+
|
|
263
|
+
So my 20 ms reads happened after a large group of supervise writes were
|
|
264
|
+
completed (I truncated dozens of supervise write lines to keep this example
|
|
265
|
+
short). Other latency outliers in this output file showed the same sequence:
|
|
266
|
+
slow reads after a batch of writes.
|
|
267
|
+
|
|
268
|
+
Note the I/O request timestamp (STARTs), which shows that these 20 ms reads were
|
|
269
|
+
issued before the supervise writes – so they had been sitting on a queue. I've
|
|
270
|
+
debugged this type of issue many times before, but this one is different: those
|
|
271
|
+
writes were to a different device (202,1), so I would have assumed they would be
|
|
272
|
+
on different queues, and wouldn't interfere with each other. Somewhere in this
|
|
273
|
+
system (Xen guest) it looks like there is a shared queue. (Having just
|
|
274
|
+
discovered this using iosnoop, I can't yet tell you which queue, but I'd hope
|
|
275
|
+
that after identifying it there would be a way to tune its queueing behavior,
|
|
276
|
+
so that we can eliminate or reduce the severity of these outliers.)
|
|
277
|
+
|
|
278
|
+
|
|
279
|
+
Use -h to print the USAGE message:
|
|
280
|
+
|
|
281
|
+
# ./iosnoop -h
|
|
282
|
+
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name]
|
|
283
|
+
[duration]
|
|
284
|
+
-d device # device string (eg, "202,1)
|
|
285
|
+
-i iotype # match type (eg, '*R*' for all reads)
|
|
286
|
+
-n name # process name to match on I/O issue
|
|
287
|
+
-p PID # PID to match on I/O issue
|
|
288
|
+
-Q # use queue insert as start time
|
|
289
|
+
-s # include start time of I/O (s)
|
|
290
|
+
-t # include completion time of I/O (s)
|
|
291
|
+
-h # this usage message
|
|
292
|
+
duration # duration seconds, and use buffers
|
|
293
|
+
eg,
|
|
294
|
+
iosnoop # watch block I/O live (unbuffered)
|
|
295
|
+
iosnoop 1 # trace 1 sec (buffered)
|
|
296
|
+
iosnoop -Q # include queueing time in LATms
|
|
297
|
+
iosnoop -ts # include start and end timestamps
|
|
298
|
+
iosnoop -i '*R*' # trace reads
|
|
299
|
+
iosnoop -p 91 # show I/O issued when PID 91 is on-CPU
|
|
300
|
+
iosnoop -Qp 91 # show I/O queued by PID 91, queue time
|
|
301
|
+
|
|
302
|
+
See the man page and example file for more info.
|