fluent-plugin-perf-tools 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (98) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +15 -0
  3. data/.rubocop.yml +26 -0
  4. data/.ruby-version +1 -0
  5. data/CHANGELOG.md +5 -0
  6. data/CODE_OF_CONDUCT.md +84 -0
  7. data/Gemfile +5 -0
  8. data/LICENSE.txt +21 -0
  9. data/README.md +43 -0
  10. data/Rakefile +17 -0
  11. data/bin/console +15 -0
  12. data/bin/setup +8 -0
  13. data/fluent-plugin-perf-tools.gemspec +48 -0
  14. data/lib/fluent/plugin/in_perf_tools.rb +42 -0
  15. data/lib/fluent/plugin/perf_tools/cachestat.rb +65 -0
  16. data/lib/fluent/plugin/perf_tools/command.rb +30 -0
  17. data/lib/fluent/plugin/perf_tools/version.rb +9 -0
  18. data/lib/fluent/plugin/perf_tools.rb +11 -0
  19. data/perf-tools/LICENSE +339 -0
  20. data/perf-tools/README.md +205 -0
  21. data/perf-tools/bin/bitesize +1 -0
  22. data/perf-tools/bin/cachestat +1 -0
  23. data/perf-tools/bin/execsnoop +1 -0
  24. data/perf-tools/bin/funccount +1 -0
  25. data/perf-tools/bin/funcgraph +1 -0
  26. data/perf-tools/bin/funcslower +1 -0
  27. data/perf-tools/bin/functrace +1 -0
  28. data/perf-tools/bin/iolatency +1 -0
  29. data/perf-tools/bin/iosnoop +1 -0
  30. data/perf-tools/bin/killsnoop +1 -0
  31. data/perf-tools/bin/kprobe +1 -0
  32. data/perf-tools/bin/opensnoop +1 -0
  33. data/perf-tools/bin/perf-stat-hist +1 -0
  34. data/perf-tools/bin/reset-ftrace +1 -0
  35. data/perf-tools/bin/syscount +1 -0
  36. data/perf-tools/bin/tcpretrans +1 -0
  37. data/perf-tools/bin/tpoint +1 -0
  38. data/perf-tools/bin/uprobe +1 -0
  39. data/perf-tools/deprecated/README.md +1 -0
  40. data/perf-tools/deprecated/execsnoop-proc +150 -0
  41. data/perf-tools/deprecated/execsnoop-proc.8 +80 -0
  42. data/perf-tools/deprecated/execsnoop-proc_example.txt +46 -0
  43. data/perf-tools/disk/bitesize +175 -0
  44. data/perf-tools/examples/bitesize_example.txt +63 -0
  45. data/perf-tools/examples/cachestat_example.txt +58 -0
  46. data/perf-tools/examples/execsnoop_example.txt +153 -0
  47. data/perf-tools/examples/funccount_example.txt +126 -0
  48. data/perf-tools/examples/funcgraph_example.txt +2178 -0
  49. data/perf-tools/examples/funcslower_example.txt +110 -0
  50. data/perf-tools/examples/functrace_example.txt +341 -0
  51. data/perf-tools/examples/iolatency_example.txt +350 -0
  52. data/perf-tools/examples/iosnoop_example.txt +302 -0
  53. data/perf-tools/examples/killsnoop_example.txt +62 -0
  54. data/perf-tools/examples/kprobe_example.txt +379 -0
  55. data/perf-tools/examples/opensnoop_example.txt +47 -0
  56. data/perf-tools/examples/perf-stat-hist_example.txt +149 -0
  57. data/perf-tools/examples/reset-ftrace_example.txt +88 -0
  58. data/perf-tools/examples/syscount_example.txt +297 -0
  59. data/perf-tools/examples/tcpretrans_example.txt +93 -0
  60. data/perf-tools/examples/tpoint_example.txt +210 -0
  61. data/perf-tools/examples/uprobe_example.txt +321 -0
  62. data/perf-tools/execsnoop +292 -0
  63. data/perf-tools/fs/cachestat +167 -0
  64. data/perf-tools/images/perf-tools_2016.png +0 -0
  65. data/perf-tools/iolatency +296 -0
  66. data/perf-tools/iosnoop +296 -0
  67. data/perf-tools/kernel/funccount +146 -0
  68. data/perf-tools/kernel/funcgraph +259 -0
  69. data/perf-tools/kernel/funcslower +248 -0
  70. data/perf-tools/kernel/functrace +192 -0
  71. data/perf-tools/kernel/kprobe +270 -0
  72. data/perf-tools/killsnoop +263 -0
  73. data/perf-tools/man/man8/bitesize.8 +70 -0
  74. data/perf-tools/man/man8/cachestat.8 +111 -0
  75. data/perf-tools/man/man8/execsnoop.8 +104 -0
  76. data/perf-tools/man/man8/funccount.8 +76 -0
  77. data/perf-tools/man/man8/funcgraph.8 +166 -0
  78. data/perf-tools/man/man8/funcslower.8 +129 -0
  79. data/perf-tools/man/man8/functrace.8 +123 -0
  80. data/perf-tools/man/man8/iolatency.8 +116 -0
  81. data/perf-tools/man/man8/iosnoop.8 +169 -0
  82. data/perf-tools/man/man8/killsnoop.8 +100 -0
  83. data/perf-tools/man/man8/kprobe.8 +162 -0
  84. data/perf-tools/man/man8/opensnoop.8 +113 -0
  85. data/perf-tools/man/man8/perf-stat-hist.8 +111 -0
  86. data/perf-tools/man/man8/reset-ftrace.8 +49 -0
  87. data/perf-tools/man/man8/syscount.8 +96 -0
  88. data/perf-tools/man/man8/tcpretrans.8 +93 -0
  89. data/perf-tools/man/man8/tpoint.8 +140 -0
  90. data/perf-tools/man/man8/uprobe.8 +168 -0
  91. data/perf-tools/misc/perf-stat-hist +223 -0
  92. data/perf-tools/net/tcpretrans +311 -0
  93. data/perf-tools/opensnoop +280 -0
  94. data/perf-tools/syscount +192 -0
  95. data/perf-tools/system/tpoint +232 -0
  96. data/perf-tools/tools/reset-ftrace +123 -0
  97. data/perf-tools/user/uprobe +390 -0
  98. metadata +349 -0
@@ -0,0 +1,350 @@
1
+ Demonstrations of iolatency, the Linux ftrace version.
2
+
3
+
4
+ Here's a busy system doing over 4k disk IOPS:
5
+
6
+ # ./iolatency
7
+ Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
8
+
9
+ >=(ms) .. <(ms) : I/O |Distribution |
10
+ 0 -> 1 : 4381 |######################################|
11
+ 1 -> 2 : 9 |# |
12
+ 2 -> 4 : 5 |# |
13
+ 4 -> 8 : 0 | |
14
+ 8 -> 16 : 1 |# |
15
+
16
+ >=(ms) .. <(ms) : I/O |Distribution |
17
+ 0 -> 1 : 4053 |######################################|
18
+ 1 -> 2 : 18 |# |
19
+ 2 -> 4 : 9 |# |
20
+ 4 -> 8 : 2 |# |
21
+ 8 -> 16 : 1 |# |
22
+ 16 -> 32 : 1 |# |
23
+
24
+ >=(ms) .. <(ms) : I/O |Distribution |
25
+ 0 -> 1 : 4658 |######################################|
26
+ 1 -> 2 : 9 |# |
27
+ 2 -> 4 : 2 |# |
28
+
29
+ >=(ms) .. <(ms) : I/O |Distribution |
30
+ 0 -> 1 : 4298 |######################################|
31
+ 1 -> 2 : 17 |# |
32
+ 2 -> 4 : 10 |# |
33
+ 4 -> 8 : 1 |# |
34
+ 8 -> 16 : 1 |# |
35
+ ^C
36
+ Ending tracing...
37
+
38
+ Disk I/O latency is usually between 0 and 1 milliseconds, as this system uses
39
+ SSDs. There are occasional outliers, up to the 16->32 ms range.
40
+
41
+ Identifying outliers like these is difficult from iostat(1) alone, which at
42
+ the same time reported:
43
+
44
+ # iostat 1
45
+ [...]
46
+ avg-cpu: %user %nice %system %iowait %steal %idle
47
+ 0.53 0.00 1.05 46.84 0.53 51.05
48
+
49
+ Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
50
+ xvdap1 0.00 0.00 0.00 28.00 0.00 112.00 8.00 0.02 0.71 0.00 0.71 0.29 0.80
51
+ xvdb 0.00 0.00 2134.00 0.00 18768.00 0.00 17.59 0.51 0.24 0.24 0.00 0.23 50.00
52
+ xvdc 0.00 0.00 2088.00 0.00 18504.00 0.00 17.72 0.47 0.22 0.22 0.00 0.22 46.40
53
+ md0 0.00 0.00 4222.00 0.00 37256.00 0.00 17.65 0.00 0.00 0.00 0.00 0.00 0.00
54
+
55
+ I/O latency ("await") averages 0.24 and 0.22 ms for our busy disks, but this
56
+ output doesn't show that occasionally is much higher.
57
+
58
+ To get more information on these I/O, try the iosnoop(8) tool.
59
+
60
+
61
+ The -Q option includes the block I/O queued time, by tracing based on
62
+ block_rq_insert instead of block_rq_issue:
63
+
64
+ # ./iolatency -Q
65
+ Tracing block I/O. Output every 1 seconds. Ctrl-C to end.
66
+
67
+ >=(ms) .. <(ms) : I/O |Distribution |
68
+ 0 -> 1 : 1913 |######################################|
69
+ 1 -> 2 : 438 |######### |
70
+ 2 -> 4 : 100 |## |
71
+ 4 -> 8 : 145 |### |
72
+ 8 -> 16 : 43 |# |
73
+ 16 -> 32 : 43 |# |
74
+ 32 -> 64 : 1 |# |
75
+
76
+ >=(ms) .. <(ms) : I/O |Distribution |
77
+ 0 -> 1 : 2360 |######################################|
78
+ 1 -> 2 : 132 |### |
79
+ 2 -> 4 : 72 |## |
80
+ 4 -> 8 : 14 |# |
81
+ 8 -> 16 : 1 |# |
82
+
83
+ >=(ms) .. <(ms) : I/O |Distribution |
84
+ 0 -> 1 : 2138 |######################################|
85
+ 1 -> 2 : 496 |######### |
86
+ 2 -> 4 : 81 |## |
87
+ 4 -> 8 : 40 |# |
88
+ 8 -> 16 : 1 |# |
89
+ 16 -> 32 : 2 |# |
90
+ ^C
91
+ Ending tracing...
92
+
93
+ I use this along with the default mode to identify problems of load (queueing)
94
+ vs problems of the device, which is shown by default.
95
+
96
+
97
+ Here's a more interesting system. This is doing a mixed read/write workload,
98
+ and has a pretty awful latency distribution:
99
+
100
+ # ./iolatency 5 3
101
+ Tracing block I/O. Output every 5 seconds.
102
+
103
+ >=(ms) .. <(ms) : I/O |Distribution |
104
+ 0 -> 1 : 2809 |######################################|
105
+ 1 -> 2 : 32 |# |
106
+ 2 -> 4 : 14 |# |
107
+ 4 -> 8 : 6 |# |
108
+ 8 -> 16 : 7 |# |
109
+ 16 -> 32 : 14 |# |
110
+ 32 -> 64 : 39 |# |
111
+ 64 -> 128 : 1556 |###################### |
112
+
113
+ >=(ms) .. <(ms) : I/O |Distribution |
114
+ 0 -> 1 : 3027 |######################################|
115
+ 1 -> 2 : 19 |# |
116
+ 2 -> 4 : 6 |# |
117
+ 4 -> 8 : 5 |# |
118
+ 8 -> 16 : 3 |# |
119
+ 16 -> 32 : 7 |# |
120
+ 32 -> 64 : 14 |# |
121
+ 64 -> 128 : 540 |####### |
122
+
123
+ >=(ms) .. <(ms) : I/O |Distribution |
124
+ 0 -> 1 : 2939 |######################################|
125
+ 1 -> 2 : 25 |# |
126
+ 2 -> 4 : 15 |# |
127
+ 4 -> 8 : 2 |# |
128
+ 8 -> 16 : 3 |# |
129
+ 16 -> 32 : 7 |# |
130
+ 32 -> 64 : 17 |# |
131
+ 64 -> 128 : 936 |############# |
132
+
133
+ Ending tracing...
134
+
135
+ It's multi-modal, with most I/O taking 0 to 1 milliseconds, then many between
136
+ 64 and 128 milliseconds. This is how it looks in iostat:
137
+
138
+ # iostat -x 1
139
+
140
+ avg-cpu: %user %nice %system %iowait %steal %idle
141
+ 0.52 0.00 12.37 32.99 0.00 54.12
142
+
143
+ Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
144
+ xvdap1 0.00 12.00 0.00 156.00 0.00 19968.00 256.00 52.17 184.38 0.00 184.38 2.33 36.40
145
+ xvdb 0.00 0.00 298.00 0.00 2732.00 0.00 18.34 0.04 0.12 0.12 0.00 0.11 3.20
146
+ xvdc 0.00 0.00 297.00 0.00 2712.00 0.00 18.26 0.08 0.27 0.27 0.00 0.24 7.20
147
+ md0 0.00 0.00 595.00 0.00 5444.00 0.00 18.30 0.00 0.00 0.00 0.00 0.00 0.00
148
+
149
+ Fortunately, it turns out that the high latency is to xvdap1, which is for files
150
+ from a low priority application (processing and writing log files). A high
151
+ priority application is reading from the other disks, xvdb and xvdc.
152
+
153
+ Examining xvdap1 only:
154
+
155
+ # ./iolatency -d 202,1 5
156
+ Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
157
+
158
+ >=(ms) .. <(ms) : I/O |Distribution |
159
+ 0 -> 1 : 38 |## |
160
+ 1 -> 2 : 18 |# |
161
+ 2 -> 4 : 0 | |
162
+ 4 -> 8 : 0 | |
163
+ 8 -> 16 : 5 |# |
164
+ 16 -> 32 : 11 |# |
165
+ 32 -> 64 : 26 |## |
166
+ 64 -> 128 : 894 |######################################|
167
+
168
+ >=(ms) .. <(ms) : I/O |Distribution |
169
+ 0 -> 1 : 75 |### |
170
+ 1 -> 2 : 11 |# |
171
+ 2 -> 4 : 0 | |
172
+ 4 -> 8 : 4 |# |
173
+ 8 -> 16 : 4 |# |
174
+ 16 -> 32 : 7 |# |
175
+ 32 -> 64 : 13 |# |
176
+ 64 -> 128 : 1141 |######################################|
177
+
178
+ >=(ms) .. <(ms) : I/O |Distribution |
179
+ 0 -> 1 : 61 |######## |
180
+ 1 -> 2 : 21 |### |
181
+ 2 -> 4 : 5 |# |
182
+ 4 -> 8 : 1 |# |
183
+ 8 -> 16 : 5 |# |
184
+ 16 -> 32 : 7 |# |
185
+ 32 -> 64 : 19 |### |
186
+ 64 -> 128 : 324 |######################################|
187
+ 128 -> 256 : 7 |# |
188
+ 256 -> 512 : 26 |#### |
189
+ ^C
190
+ Ending tracing...
191
+
192
+ And now xvdb:
193
+
194
+ # ./iolatency -d 202,16 5
195
+ Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
196
+
197
+ >=(ms) .. <(ms) : I/O |Distribution |
198
+ 0 -> 1 : 1427 |######################################|
199
+ 1 -> 2 : 5 |# |
200
+ 2 -> 4 : 3 |# |
201
+
202
+ >=(ms) .. <(ms) : I/O |Distribution |
203
+ 0 -> 1 : 1409 |######################################|
204
+ 1 -> 2 : 6 |# |
205
+ 2 -> 4 : 1 |# |
206
+ 4 -> 8 : 1 |# |
207
+
208
+ >=(ms) .. <(ms) : I/O |Distribution |
209
+ 0 -> 1 : 1478 |######################################|
210
+ 1 -> 2 : 6 |# |
211
+ 2 -> 4 : 5 |# |
212
+ 4 -> 8 : 0 | |
213
+ 8 -> 16 : 2 |# |
214
+
215
+ >=(ms) .. <(ms) : I/O |Distribution |
216
+ 0 -> 1 : 1437 |######################################|
217
+ 1 -> 2 : 5 |# |
218
+ 2 -> 4 : 7 |# |
219
+ 4 -> 8 : 0 | |
220
+ 8 -> 16 : 1 |# |
221
+ [...]
222
+
223
+ While that's much better, it is reaching the 8 - 16 millisecond range,
224
+ and these are SSDs with a light workload (~1500 IOPS).
225
+
226
+ I already know from iosnoop(8) analysis the reason for these high latency
227
+ outliers: they are queued behind writes. However, these writes are to a
228
+ different disk -- somewhere in this virtualized guest (Xen) there may be a
229
+ shared I/O queue.
230
+
231
+ One way to explore this is to reduce the queue length for the low priority disk,
232
+ so that it is less likely to pollute any shared queue. (There are other ways to
233
+ investigate and fix this too.) Here I reduce the disk queue length from its
234
+ default of 128 to 4:
235
+
236
+ # echo 4 > /sys/block/xvda1/queue/nr_requests
237
+
238
+ The overall distribution looks much better:
239
+
240
+ # ./iolatency 5
241
+ Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
242
+
243
+ >=(ms) .. <(ms) : I/O |Distribution |
244
+ 0 -> 1 : 3005 |######################################|
245
+ 1 -> 2 : 19 |# |
246
+ 2 -> 4 : 9 |# |
247
+ 4 -> 8 : 45 |# |
248
+ 8 -> 16 : 859 |########### |
249
+ 16 -> 32 : 16 |# |
250
+
251
+ >=(ms) .. <(ms) : I/O |Distribution |
252
+ 0 -> 1 : 2959 |######################################|
253
+ 1 -> 2 : 43 |# |
254
+ 2 -> 4 : 16 |# |
255
+ 4 -> 8 : 39 |# |
256
+ 8 -> 16 : 1009 |############# |
257
+ 16 -> 32 : 76 |# |
258
+
259
+ >=(ms) .. <(ms) : I/O |Distribution |
260
+ 0 -> 1 : 3031 |######################################|
261
+ 1 -> 2 : 27 |# |
262
+ 2 -> 4 : 9 |# |
263
+ 4 -> 8 : 24 |# |
264
+ 8 -> 16 : 422 |###### |
265
+ 16 -> 32 : 5 |# |
266
+ ^C
267
+ Ending tracing...
268
+
269
+ Latency only reaching 32 ms.
270
+
271
+ Our important disk didn't appear to change much -- maybe a slight improvement
272
+ to the outliers:
273
+
274
+ # ./iolatency -d 202,16 5
275
+ Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
276
+
277
+ >=(ms) .. <(ms) : I/O |Distribution |
278
+ 0 -> 1 : 1449 |######################################|
279
+ 1 -> 2 : 6 |# |
280
+ 2 -> 4 : 5 |# |
281
+ 4 -> 8 : 1 |# |
282
+
283
+ >=(ms) .. <(ms) : I/O |Distribution |
284
+ 0 -> 1 : 1519 |######################################|
285
+ 1 -> 2 : 12 |# |
286
+
287
+ >=(ms) .. <(ms) : I/O |Distribution |
288
+ 0 -> 1 : 1466 |######################################|
289
+ 1 -> 2 : 2 |# |
290
+ 2 -> 4 : 3 |# |
291
+
292
+ >=(ms) .. <(ms) : I/O |Distribution |
293
+ 0 -> 1 : 1460 |######################################|
294
+ 1 -> 2 : 4 |# |
295
+ 2 -> 4 : 7 |# |
296
+ [...]
297
+
298
+ And here's the other disk after the queue length change:
299
+
300
+ # ./iolatency -d 202,1 5
301
+ Tracing block I/O. Output every 5 seconds. Ctrl-C to end.
302
+
303
+ >=(ms) .. <(ms) : I/O |Distribution |
304
+ 0 -> 1 : 85 |### |
305
+ 1 -> 2 : 12 |# |
306
+ 2 -> 4 : 21 |# |
307
+ 4 -> 8 : 76 |## |
308
+ 8 -> 16 : 1539 |######################################|
309
+ 16 -> 32 : 10 |# |
310
+
311
+ >=(ms) .. <(ms) : I/O |Distribution |
312
+ 0 -> 1 : 123 |################## |
313
+ 1 -> 2 : 8 |## |
314
+ 2 -> 4 : 6 |# |
315
+ 4 -> 8 : 17 |### |
316
+ 8 -> 16 : 270 |######################################|
317
+ 16 -> 32 : 2 |# |
318
+
319
+ >=(ms) .. <(ms) : I/O |Distribution |
320
+ 0 -> 1 : 91 |### |
321
+ 1 -> 2 : 23 |# |
322
+ 2 -> 4 : 8 |# |
323
+ 4 -> 8 : 71 |### |
324
+ 8 -> 16 : 1223 |######################################|
325
+ 16 -> 32 : 12 |# |
326
+ ^C
327
+ Ending tracing...
328
+
329
+ Much better looking distribution.
330
+
331
+
332
+ Use -h to print the USAGE message:
333
+
334
+ # ./iolatency -h
335
+ USAGE: iolatency [-hQT] [-d device] [-i iotype] [interval [count]]
336
+ -d device # device string (eg, "202,1)
337
+ -i iotype # match type (eg, '*R*' for all reads)
338
+ -Q # use queue insert as start time
339
+ -T # timestamp on output
340
+ -h # this usage message
341
+ interval # summary interval, seconds (default 1)
342
+ count # number of summaries
343
+ eg,
344
+ iolatency # summarize latency every second
345
+ iolatency -Q # include block I/O queue time
346
+ iolatency 5 2 # 2 x 5 second summaries
347
+ iolatency -i '*R*' # trace reads
348
+ iolatency -d 202,1 # trace device 202,1 only
349
+
350
+ See the man page and example file for more info.
@@ -0,0 +1,302 @@
1
+ Demonstrations of iosnoop, the Linux ftrace version.
2
+
3
+
4
+ Here's Linux 3.16, tracing tar archiving a filesystem:
5
+
6
+ # ./iosnoop
7
+ Tracing block I/O... Ctrl-C to end.
8
+ COMM PID TYPE DEV BLOCK BYTES LATms
9
+ supervise 1809 W 202,1 17039968 4096 1.32
10
+ supervise 1809 W 202,1 17039976 4096 1.30
11
+ tar 14794 RM 202,1 8457608 4096 7.53
12
+ tar 14794 RM 202,1 8470336 4096 14.90
13
+ tar 14794 RM 202,1 8470368 4096 0.27
14
+ tar 14794 RM 202,1 8470784 4096 7.74
15
+ tar 14794 RM 202,1 8470360 4096 0.25
16
+ tar 14794 RM 202,1 8469968 4096 0.24
17
+ tar 14794 RM 202,1 8470240 4096 0.24
18
+ tar 14794 RM 202,1 8470392 4096 0.23
19
+ tar 14794 RM 202,1 8470544 4096 5.96
20
+ tar 14794 RM 202,1 8470552 4096 0.27
21
+ tar 14794 RM 202,1 8470384 4096 0.24
22
+ [...]
23
+
24
+ The "tar" I/O looks like it is slightly random (based on BLOCK) and 4 Kbytes
25
+ in size (BYTES). One returned in 14.9 milliseconds, but the rest were fast,
26
+ so fast (0.24 ms) some may be returning from some level of cache (disk or
27
+ controller).
28
+
29
+ The "RM" TYPE means Read of Metadata. The start of the trace shows a
30
+ couple of Writes by supervise PID 1809.
31
+
32
+
33
+ Here's a deliberate random I/O workload:
34
+
35
+ # ./iosnoop
36
+ Tracing block I/O. Ctrl-C to end.
37
+ COMM PID TYPE DEV BLOCK BYTES LATms
38
+ randread 9182 R 202,32 30835224 8192 0.18
39
+ randread 9182 R 202,32 21466088 8192 0.15
40
+ randread 9182 R 202,32 13529496 8192 0.16
41
+ randread 9182 R 202,16 21250648 8192 0.18
42
+ randread 9182 R 202,16 1536776 32768 0.30
43
+ randread 9182 R 202,32 17157560 24576 0.23
44
+ randread 9182 R 202,32 21313320 8192 0.16
45
+ randread 9182 R 202,32 862184 8192 0.18
46
+ randread 9182 R 202,16 25496872 8192 0.21
47
+ randread 9182 R 202,32 31471768 8192 0.18
48
+ randread 9182 R 202,16 27571336 8192 0.20
49
+ randread 9182 R 202,16 30783448 8192 0.16
50
+ randread 9182 R 202,16 21435224 8192 1.28
51
+ randread 9182 R 202,16 970616 8192 0.15
52
+ randread 9182 R 202,32 13855608 8192 0.16
53
+ randread 9182 R 202,32 17549960 8192 0.15
54
+ randread 9182 R 202,32 30938232 8192 0.14
55
+ [...]
56
+
57
+ Note the changing offsets. The resulting latencies are very good in this case,
58
+ because the storage devices are flash memory-based solid state disks (SSDs).
59
+ For rotational disks, I'd expect these latencies to be roughly 10 ms.
60
+
61
+
62
+ Here's an idle Linux 3.2 system:
63
+
64
+ # ./iosnoop
65
+ Tracing block I/O. Ctrl-C to end.
66
+ COMM PID TYPE DEV BLOCK BYTES LATms
67
+ supervise 3055 W 202,1 12852496 4096 0.64
68
+ supervise 3055 W 202,1 12852504 4096 1.32
69
+ supervise 3055 W 202,1 12852800 4096 0.55
70
+ supervise 3055 W 202,1 12852808 4096 0.52
71
+ jbd2/xvda1-212 212 WS 202,1 1066720 45056 41.52
72
+ jbd2/xvda1-212 212 WS 202,1 1066808 12288 41.52
73
+ jbd2/xvda1-212 212 WS 202,1 1066832 4096 32.37
74
+ supervise 3055 W 202,1 12852800 4096 14.28
75
+ supervise 3055 W 202,1 12855920 4096 14.07
76
+ supervise 3055 W 202,1 12855960 4096 0.67
77
+ supervise 3055 W 202,1 12858208 4096 1.00
78
+ flush:1-409 409 W 202,1 12939640 12288 18.00
79
+ [...]
80
+
81
+ This shows supervise doing various writes from PID 3055. The highest latency
82
+ was from jbd2/xvda1-212, the journaling block device driver, doing
83
+ synchronous writes (TYPE = WS).
84
+
85
+
86
+ Options can be added to show the start time (-s) and end time (-t):
87
+
88
+ # ./iosnoop -ts
89
+ Tracing block I/O. Ctrl-C to end.
90
+ STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms
91
+ 5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62
92
+ 5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42
93
+ 5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48
94
+ 5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43
95
+ 5982800.308849 5982800.309452 supervise 1810 W 202,1 12862464 4096 0.60
96
+ 5982800.308856 5982800.309470 supervise 1806 W 202,1 17039632 4096 0.61
97
+ 5982800.309206 5982800.309740 supervise 1806 W 202,1 17039640 4096 0.53
98
+ 5982800.309211 5982800.309805 supervise 1810 W 202,1 12862472 4096 0.59
99
+ 5982800.309332 5982800.309953 supervise 1812 W 202,1 17039648 4096 0.62
100
+ 5982800.309676 5982800.310283 supervise 1812 W 202,1 17039656 4096 0.61
101
+ [...]
102
+
103
+ This is useful when gathering I/O event data for post-processing.
104
+
105
+
106
+ Now for matching on a single PID:
107
+
108
+ # ./iosnoop -p 1805
109
+ Tracing block I/O issued by PID 1805. Ctrl-C to end.
110
+ COMM PID TYPE DEV BLOCK BYTES LATms
111
+ supervise 1805 W 202,1 17039648 4096 0.68
112
+ supervise 1805 W 202,1 17039672 4096 0.60
113
+ supervise 1805 W 202,1 17040040 4096 0.62
114
+ supervise 1805 W 202,1 17040056 4096 0.47
115
+ supervise 1805 W 202,1 17040624 4096 0.49
116
+ supervise 1805 W 202,1 17040632 4096 0.44
117
+ ^C
118
+ Ending tracing...
119
+
120
+ This option works by using an in-kernel filter for that PID on I/O issue. There
121
+ is also a "-n" option to match on process names, however, that currently does so
122
+ in user space, so is less efficient.
123
+
124
+ I would say that this will generally identify the origin process, but there will
125
+ be an error margin. Depending on the file system, block I/O queueing, and I/O
126
+ subsystem, this could miss events that aren't issued in this PID context but are
127
+ related to this PID (eg, triggering a read readahead on the completion of
128
+ previous I/O. Again, whether this happens is up to the file system and storage
129
+ subsystem). You can try the -Q option for more reliable process identification.
130
+
131
+
132
+ The -Q option begins tracing on block I/O queue insert, instead of issue.
133
+ Here's before and after, while dd(1) writes a large file:
134
+
135
+ # ./iosnoop
136
+ Tracing block I/O. Ctrl-C to end.
137
+ COMM PID TYPE DEV BLOCK BYTES LATms
138
+ dd 26983 WS 202,16 4064416 45056 16.70
139
+ dd 26983 WS 202,16 4064504 45056 16.72
140
+ dd 26983 WS 202,16 4064592 45056 16.74
141
+ dd 26983 WS 202,16 4064680 45056 16.75
142
+ cat 27031 WS 202,16 4064768 45056 16.56
143
+ cat 27031 WS 202,16 4064856 45056 16.46
144
+ cat 27031 WS 202,16 4064944 45056 16.40
145
+ gawk 27030 WS 202,16 4065032 45056 0.88
146
+ gawk 27030 WS 202,16 4065120 45056 1.01
147
+ gawk 27030 WS 202,16 4065208 45056 16.15
148
+ gawk 27030 WS 202,16 4065296 45056 16.16
149
+ gawk 27030 WS 202,16 4065384 45056 16.16
150
+ [...]
151
+
152
+ The output here shows the block I/O time from issue to completion (LATms),
153
+ which is largely representative of the device.
154
+
155
+ The process names and PIDs identify dd, cat, and gawk. By default iosnoop shows
156
+ who is on-CPU at time of block I/O issue, but these may not be the processes
157
+ that originated the I/O. In this case (having debugged it), the reason is that
158
+ processes such as cat and gawk are making hypervisor calls (this is a Xen
159
+ guest instance), eg, for memory operations, and during hypervisor processing a
160
+ queue of pending work is checked and dispatched. So cat and gawk were on-CPU
161
+ when the block device I/O was issued, but they didn't originate it.
162
+
163
+ Now the -Q option is used:
164
+
165
+ # ./iosnoop -Q
166
+ Tracing block I/O. Ctrl-C to end.
167
+ COMM PID TYPE DEV BLOCK BYTES LATms
168
+ kjournald 1217 WS 202,16 6132200 45056 141.12
169
+ kjournald 1217 WS 202,16 6132288 45056 141.10
170
+ kjournald 1217 WS 202,16 6132376 45056 141.10
171
+ kjournald 1217 WS 202,16 6132464 45056 141.11
172
+ kjournald 1217 WS 202,16 6132552 40960 141.11
173
+ dd 27718 WS 202,16 6132624 4096 0.18
174
+ flush:16-1279 1279 W 202,16 6132632 20480 0.52
175
+ flush:16-1279 1279 W 202,16 5940856 4096 0.50
176
+ flush:16-1279 1279 W 202,16 5949056 4096 0.52
177
+ flush:16-1279 1279 W 202,16 5957256 4096 0.54
178
+ flush:16-1279 1279 W 202,16 5965456 4096 0.56
179
+ flush:16-1279 1279 W 202,16 5973656 4096 0.58
180
+ flush:16-1279 1279 W 202,16 5981856 4096 0.60
181
+ flush:16-1279 1279 W 202,16 5990056 4096 0.63
182
+ [...]
183
+
184
+ This uses the block_rq_insert tracepoint as the starting point of I/O, instead
185
+ of block_rq_issue. This makes the following differences to columns and options:
186
+
187
+ - COMM: more likely to show the originating process.
188
+ - PID: more likely to show the originating process.
189
+ - LATms: shows the I/O time, including time spent on the block I/O queue.
190
+ - STARTs (not shown above): shows the time of queue insert, not I/O issue.
191
+ - -p PID: more likely to match the originating process.
192
+ - -n name: more likely to match the originating process.
193
+
194
+ The reason that this ftrace-based iosnoop does not just instrument both insert
195
+ and issue tracepoints is one of overhead. Even with buffering, iosnoop can
196
+ have difficulty under high load.
197
+
198
+
199
+ If I want to capture events for post-processing, I use the duration mode, which
200
+ not only lets me set the duration, but also uses buffering, which reduces the
201
+ overheads of tracing.
202
+
203
+ Capturing 5 seconds, with both start timestamps (-s) and end timestamps (-t):
204
+
205
+ # time ./iosnoop -ts 5 > out
206
+
207
+ real 0m5.566s
208
+ user 0m0.336s
209
+ sys 0m0.140s
210
+ # wc out
211
+ 27010 243072 2619744 out
212
+
213
+ This server is doing over 5,000 disk IOPS. Even with buffering, this did
214
+ consume a measurable amount of CPU to capture: 0.48 seconds of CPU time in
215
+ total. Note that the run took 5.57 seconds: this is 5 seconds for the capture,
216
+ followed by the CPU time for iosnoop to fetch and process the buffer.
217
+
218
+ Now tracing for 30 seconds:
219
+
220
+ # time ./iosnoop -ts 30 > out
221
+
222
+ real 0m31.207s
223
+ user 0m0.884s
224
+ sys 0m0.472s
225
+ # wc out
226
+ 64259 578313 6232898 out
227
+
228
+ Since it's the same server and workload, this should have over 150k events,
229
+ but only has 64k. The tracing buffer has overflowed, and events have been
230
+ dropped. If I really must capture this many events, I can either increase
231
+ the trace buffer size (it's the bufsize_kb setting in the script), or, use
232
+ a different tracer (perf_evets, SystemTap, ktap, etc.) If the IOPS rate is low
233
+ (eg, less than 5k), then unbuffered (no duration), despite the higher overheads,
234
+ may be sufficient, and will keep capturing events until Ctrl-C.
235
+
236
+
237
+ Here's an example of digging into the sequence of I/O to explain an outlier.
238
+ My randread program on an SSD server (which is an AWS EC2 instance) usually
239
+ experiences about 0.15 ms I/O latency, but there are some outliers as high as
240
+ 20 milliseconds. Here's an excerpt:
241
+
242
+ # ./iosnoop -ts > out
243
+ # more out
244
+ Tracing block I/O. Ctrl-C to end.
245
+ STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms
246
+ 6037559.121523 6037559.121685 randread 22341 R 202,32 29295416 8192 0.16
247
+ 6037559.121719 6037559.121874 randread 22341 R 202,16 27515304 8192 0.16
248
+ [...]
249
+ 6037595.999508 6037596.000051 supervise 1692 W 202,1 12862968 4096 0.54
250
+ 6037595.999513 6037596.000144 supervise 1687 W 202,1 17040160 4096 0.63
251
+ 6037595.999634 6037596.000309 supervise 1693 W 202,1 17040168 4096 0.68
252
+ 6037595.999937 6037596.000440 supervise 1693 W 202,1 17040176 4096 0.50
253
+ 6037596.000579 6037596.001192 supervise 1689 W 202,1 17040184 4096 0.61
254
+ 6037596.000826 6037596.001360 supervise 1689 W 202,1 17040192 4096 0.53
255
+ 6037595.998302 6037596.018133 randread 22341 R 202,32 954168 8192 20.03
256
+ 6037595.998303 6037596.018150 randread 22341 R 202,32 954200 8192 20.05
257
+ 6037596.018182 6037596.018347 randread 22341 R 202,32 18836600 8192 0.16
258
+ [...]
259
+
260
+ It's important to sort on the I/O completion time (ENDs). In this case it's
261
+ already in the correct order.
262
+
263
+ So my 20 ms reads happened after a large group of supervise writes were
264
+ completed (I truncated dozens of supervise write lines to keep this example
265
+ short). Other latency outliers in this output file showed the same sequence:
266
+ slow reads after a batch of writes.
267
+
268
+ Note the I/O request timestamp (STARTs), which shows that these 20 ms reads were
269
+ issued before the supervise writes – so they had been sitting on a queue. I've
270
+ debugged this type of issue many times before, but this one is different: those
271
+ writes were to a different device (202,1), so I would have assumed they would be
272
+ on different queues, and wouldn't interfere with each other. Somewhere in this
273
+ system (Xen guest) it looks like there is a shared queue. (Having just
274
+ discovered this using iosnoop, I can't yet tell you which queue, but I'd hope
275
+ that after identifying it there would be a way to tune its queueing behavior,
276
+ so that we can eliminate or reduce the severity of these outliers.)
277
+
278
+
279
+ Use -h to print the USAGE message:
280
+
281
+ # ./iosnoop -h
282
+ USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name]
283
+ [duration]
284
+ -d device # device string (eg, "202,1)
285
+ -i iotype # match type (eg, '*R*' for all reads)
286
+ -n name # process name to match on I/O issue
287
+ -p PID # PID to match on I/O issue
288
+ -Q # use queue insert as start time
289
+ -s # include start time of I/O (s)
290
+ -t # include completion time of I/O (s)
291
+ -h # this usage message
292
+ duration # duration seconds, and use buffers
293
+ eg,
294
+ iosnoop # watch block I/O live (unbuffered)
295
+ iosnoop 1 # trace 1 sec (buffered)
296
+ iosnoop -Q # include queueing time in LATms
297
+ iosnoop -ts # include start and end timestamps
298
+ iosnoop -i '*R*' # trace reads
299
+ iosnoop -p 91 # show I/O issued when PID 91 is on-CPU
300
+ iosnoop -Qp 91 # show I/O queued by PID 91, queue time
301
+
302
+ See the man page and example file for more info.