rabbit-slide-kou-rubykaigi-takeout-2021 2021.9.11.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 4e7819063a4ebcbedeb7a6ea8ffad85ca4e15a57ce9aa6dbcecd02f338610d7c
4
+ data.tar.gz: 6a9734a5a02321f2ce4c1e20676556c0c6efa9aa18649cfa263d1dd3832ce702
5
+ SHA512:
6
+ metadata.gz: a89c84882adff989423129df414b42cdc4e3f49bc61fb2651bb722105d1a4d06aa5271892cfcb790cc25a54cb6de2d11e1f335977fcf7ba59c436c97d2002eb8
7
+ data.tar.gz: 8a2d3ea6e38fda7c3972a8f31f5d5dd75c38ac94d852968133f27f1c99e24e0a7aa6a26155f2be330660a1012433310ca0d52bf3fe76395414cfa18b571f269c
data/.rabbit ADDED
@@ -0,0 +1 @@
1
+ --size=1920,1080 red-arrow.rab
data/README.rd ADDED
@@ -0,0 +1,51 @@
1
+ = Red Arrow - Ruby and Apache Arrow
2
+
3
+ To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
4
+
5
+ * Super fast large data interchange and processing
6
+ * Reading/writing data in several famous formats such as CSV and Apache Parquet
7
+ * Reading/writing partitioned large data on cloud storage such as Amazon S3
8
+
9
+ This talk describes the followings:
10
+
11
+ * What is Apache Arrow
12
+ * How to use Apache Arrow with Ruby
13
+ * How to integrate with Ruby 3.0 features such as MemoryView and Ractor
14
+
15
+ == License
16
+
17
+ === Slide
18
+
19
+ CC BY-SA 4.0
20
+
21
+ Use the followings for notation of the author:
22
+
23
+ * Sutou Kouhei
24
+
25
+ ==== ClearCode Inc. logo
26
+
27
+ CC BY-SA 4.0
28
+
29
+ Author: ClearCode Inc.
30
+
31
+ It is used in page header and some pages in the slide.
32
+
33
+ == For author
34
+
35
+ === Show
36
+
37
+ rake
38
+
39
+ === Publish
40
+
41
+ rake publish
42
+
43
+ == For viewers
44
+
45
+ === Install
46
+
47
+ gem install rabbit-slide-kou-rubykaigi-takeout-2021
48
+
49
+ === Show
50
+
51
+ rabbit rabbit-slide-kou-rubykaigi-takeout-2021.gem
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "rabbit/task/slide"
2
+
3
+ # Edit ./config.yaml to customize meta data
4
+
5
+ spec = nil
6
+ Rabbit::Task::Slide.new do |task|
7
+ spec = task.spec
8
+ # spec.files += Dir.glob("doc/**/*.*")
9
+ spec.files += Dir.glob("images/**/*.*")
10
+ # spec.files -= Dir.glob("private/**/*.*")
11
+ spec.add_runtime_dependency("rabbit-theme-clear-code")
12
+ end
13
+
14
+ desc "Tag #{spec.version}"
15
+ task :tag do
16
+ sh("git", "tag", "-a", spec.version.to_s, "-m", "Publish #{spec.version}")
17
+ sh("git", "push", "--tags")
18
+ end
data/config.yaml ADDED
@@ -0,0 +1,24 @@
1
+ ---
2
+ id: rubykaigi-takeout-2021
3
+ base_name: red-arrow
4
+ tags:
5
+ - rabbit
6
+ - rubykaigi
7
+ - ruby
8
+ - apache_arrow
9
+ presentation_date: 2021-09-11
10
+ version: 2021.9.11.0
11
+ licenses:
12
+ - CC-BY-SA-4.0
13
+ slideshare_id:
14
+ speaker_deck_id:
15
+ ustream_id:
16
+ vimeo_id:
17
+ youtube_id:
18
+ author:
19
+ markup_language: :rd
20
+ name: Sutou Kouhei
21
+ email: kou@clear-code.com
22
+ rubygems_user: kou
23
+ slideshare_user: kou
24
+ speaker_deck_user:
Binary file
data/images/iris.png ADDED
Binary file
data/red-arrow.rab ADDED
@@ -0,0 +1,801 @@
1
+ = Red Arrow
2
+
3
+ : subtitle
4
+ ((*Ruby*)) and ((*Apache Arrow*))
5
+ : author
6
+ Sutou Kouhei
7
+ : institution
8
+ ClearCode Inc.
9
+ : content-source
10
+ RubyKaigi Takeout 2021
11
+ : date
12
+ 2021-09-11
13
+ : start-time
14
+ 2021-09-11T13:30:00+09:00
15
+ : end-time
16
+ 2021-09-11T13:55:00+09:00
17
+ : theme
18
+ .
19
+
20
+ = Sutou Kouhei\nA president Rubyist
21
+
22
+ The president of ClearCode Inc.\n
23
+ (('note:クリアコードの社長'))
24
+
25
+ # img
26
+ # src = images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png
27
+ # relative_height = 100
28
+ # reflect_ratio = 0.1
29
+
30
+ = Sutou Kouhei\nAn Apache Arrow contributor
31
+
32
+ * A member of PMC of Apache Arrow\n
33
+ (('note:PMC: Project Management Committee'))\n
34
+ (('note:Apache Arrowのプロジェクト管理委員会メンバー'))
35
+ * #2 commits(('note:(コミット数2位)'))
36
+
37
+ # img
38
+ # src = images/apache-arrow-commits-kou.png
39
+ # relative_height = 120
40
+ # reflect_ratio = 0.1
41
+
42
+ = Sutou Kouhei\nThe pioneer in Ruby and Arrow
43
+
44
+ * The author of Red Arrow\n
45
+ (('note:Red Arrowの作者'))
46
+ * Red Arrow:
47
+ * The official Apache Arrow library for Ruby\n
48
+ (('note:公式のRuby用のApache Arrowライブラリー'))
49
+ * GObject Introspection based bindings\n
50
+ (('note:GObject Introspectionベースのバインディング'))
51
+ * Apache Arrow GLib is developed for Red Arrow\n
52
+ (('note:Red ArrowのためにApache Arrow GLibも開発'))
53
+
54
+ = GObject Introspection?
55
+
56
+ (('tag:center'))
57
+ (('tag:margin-bottom * -0.3'))
58
+ A way to implement bindings\n
59
+ (('note:バインディングの実装方法の1つ'))
60
+
61
+ # img
62
+ # src = https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2016/how-to-create-bindings-2016.pdf
63
+ # relative_height = 90
64
+
65
+ (('tag:center'))
66
+ (('note:((<URL:https://rubykaigi.org/2016/presentations/ktou.html>))'))
67
+
68
+ = Why do I work on Red Arrow?\n(('note:なぜRed Arrowの開発をしているか'))
69
+
70
+ * To use Ruby for data processing!\n
71
+ (('note:データ処理でRubyを使いたい!'))
72
+ * At least a part of data processing\n
73
+ (('note:データ処理の全部と言わず一部だけでも'))
74
+ * Results of my 5 years of work:\n
75
+ (('note:私のここ5年の仕事の成果'))
76
+ * We can use Ruby for some data processing!\n
77
+ (('note:いくつかのデータ処理でRubyを使える!'))
78
+
79
+ = Goal of this talk\n(('note:このトークのゴール'))
80
+
81
+ * You want to use Ruby\n
82
+ for some data processing\n
83
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
84
+ * You join Red Data Tools project\n
85
+ (('note:Red Data Toolsプロジェクトに参加する'))
86
+
87
+ = Red Data Tools project?
88
+
89
+ # blockquote
90
+
91
+ Red Data Tools is a project that provides data processing tools for Ruby
92
+
93
+ (('note:Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト'))
94
+
95
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
96
+
97
+ = Data processing?
98
+
99
+ ... how?
100
+
101
+ = 0. Why do you want?\n(('note:0. データ処理の目的を明らかにする'))
102
+
103
+ * What problem do you want to resolve?\n
104
+ (('note:どんな問題を解決したい?'))
105
+ * What data is needed for it?\n
106
+ (('note:そのためにはどんなデータが必要?'))
107
+ * ...
108
+
109
+ No Red Arrow support in this area\n
110
+ (('note:このあたりにはRed Arrowを使えない'))
111
+
112
+ = 1. Collect data\n(('note:1. データ収集'))
113
+
114
+ * Where are data?\n
115
+ (('note:データはどこにある?'))
116
+ * Where are collected data stored?\n
117
+ (('note:集めたデータはどこに保存する?'))
118
+ * ...
119
+
120
+ Some Red Arrow supports in this area\n
121
+ (('note:このあたりでは少しRed Arrowを使えない'))
122
+
123
+ = Common dataset\n(('note:よく使われるデータセット'))
124
+
125
+ # rouge ruby
126
+
127
+ require "datasets"
128
+ Datasets::Iris.new
129
+ Datasets::PostalCodeJapan.new
130
+ Datasets::Wikipedia.new
131
+
132
+ (('note:((<Red Datasets|URL:https://github.com/red-data-tools/red-datasets>))'))\n
133
+ (('note:((<URL:https://github.com/red-data-tools/red-datasets>))'))
134
+
135
+ = Output: Local file\n(('note:出力先:ローカルファイル'))
136
+
137
+ # rouge ruby
138
+
139
+ require "datasets-arrow"
140
+ dataset = Datasets::PostalCodeJapan.new
141
+ dataset.to_arrow.save("codes.csv")
142
+ dataset.to_arrow.save("codes.arrow")
143
+
144
+ (('note:((<Red Datasets Arrow|URL:https://github.com/red-data-tools/red-datasets-arrow>))'))\n
145
+ (('note:((<URL:https://github.com/red-data-tools/red-datasets-arrow>))'))
146
+
147
+ = (({#save}))
148
+
149
+ * General serialize API for table data\n
150
+ (('note:テーブルデータ用の汎用シリアライズAPI'))
151
+ * Serialize as the specified format\n
152
+ (('note:指定したフォーマットにシリアライズ'))
153
+ * If you use Red Arrow object for in-memory table data, you can serialize to many formats! Cool!\n
154
+ (('note:メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんなフォーマットにシリアライズできる!かっこいい!'))
155
+ * Extensible!\n
156
+ (('note:拡張可能!'))
157
+
158
+ = (({#save})): Implementation
159
+
160
+ # rouge ruby
161
+
162
+ module Arrow
163
+ class Table
164
+ def save(output)
165
+ saver = TableSaver.new(self, output)
166
+ saver.save
167
+ end
168
+ end
169
+ end
170
+
171
+ = (({#save})): Implementation
172
+
173
+ # rouge ruby
174
+
175
+ class Arrow::TableSaver
176
+ def save
177
+ format = detect_format(@output)
178
+ __send__("save_as_#{format}")
179
+ end
180
+ def save_as_csv
181
+ end
182
+ end
183
+
184
+ = (({#save})): Extend by Red Parquet
185
+
186
+ # rouge ruby
187
+
188
+ module Parquet::ArrowTableSavable
189
+ def save_as_parquet
190
+ end
191
+ Arrow::TableSaver.include(self)
192
+ end
193
+
194
+ (('note:Red Parquet is a subproject of Red Arrow'))\n
195
+ (('note:Red ParquetはRed Arrowのサブプロジェクト'))
196
+
197
+ = (({#save})): Extended
198
+
199
+ # rouge ruby
200
+
201
+ require "datasets-arrow"
202
+ require "parquet"
203
+ dataset = Datasets::PostalCodeJapan.new
204
+ dataset.to_arrow.save("codes.parquet")
205
+
206
+ = Output: Online storage: Fluentd\n(('note:出力先:オンラインストレージ:Fluentd'))
207
+
208
+ * fluent-plugin-s3-arrow:
209
+ * Collect data by Fluentd\n
210
+ (('note:Fluentdでデータ収集'))
211
+ * Format data as Apache Parquet by ((*Red Arrow*))\n
212
+ (('note:((*Red Arrow*))でApache Parquet形式にデータを変換'))
213
+ * Store data to Amazon S3 by fluent-plugin-s3\n
214
+ (('note:fluent-plugin-s3でAmazon S3にデータを保存'))
215
+ * By @kanga33 at Speee/Red Data Tools\n
216
+ (('note:Speee/Red Data Toolsの香川さんが開発'))
217
+
218
+ (('note:((<URL:https://github.com/red-data-tools/fluent-plugin-s3-arrow/>))'))
219
+
220
+ = Output: Online storage: Red Arrow\n(('note:出力先:オンラインストレージ:Red Arrow'))
221
+
222
+ # rouge ruby
223
+
224
+ require "datasets-arrow"
225
+ require "arrow-dataset"
226
+ dataset = Datasets::PostalCodeJapan.new
227
+ url = URL("s3://mybucket/codes.parquet")
228
+ dataset.to_arrow.save(url)
229
+
230
+ (('Implementing...'))\n
231
+ (('note:実装中。。。'))
232
+
233
+ = (({#save})): Implementing...
234
+
235
+ # rouge ruby
236
+
237
+ class Arrow::TableSaver
238
+ def save
239
+ if @output.is_a?(URI)
240
+ __send__("save_to_uri")
241
+ else
242
+ __send__("save_to_file")
243
+ end
244
+ end
245
+ end
246
+
247
+ = Collect data w/ Red Arrow: Wrap up\n(('note:Red Arrowでデータ収集:まとめ'))
248
+
249
+ * Usable as serializer for common formats\n
250
+ (('note:よくあるフォーマットにシリアライズするツールとして使える'))
251
+ * Usable as writer to common locations\n
252
+ (('note:in the near future...'))\n
253
+ (('note:近いうちによくある出力先に書き出すツールとして使える'))
254
+
255
+ = 2. Read data\n(('note:2. データ読み込み'))
256
+
257
+ * What format is used?\n
258
+ (('note:どんなフォーマットで保存されている?'))
259
+ * Where are collected data?\n
260
+ (('note:収集したデータはどこ?'))
261
+ * How large is collected data?\n
262
+ (('note:データはどれかで大きい?'))
263
+
264
+ = Format\n(('note:フォーマット'))
265
+
266
+ # rouge ruby
267
+
268
+ require "arrow"
269
+ table = Arrow::Table.load("data.csv")
270
+ table = Arrow::Table.load("data.json")
271
+ table = Arrow::Table.load("data.arrow")
272
+ table = Arrow::Table.load("data.orc")
273
+
274
+ = (({.load}))
275
+
276
+ * General deserialize API for table data\n
277
+ (('note:テーブルデータ用の汎用デシリアライズAPI'))
278
+ * Deserialize common formats\n
279
+ (('note:よく使われているフォーマットからデシリアライズ'))
280
+ * Extensible!\n
281
+ (('note:拡張可能!'))
282
+
283
+ = (({.load})): Implementation
284
+
285
+ # rouge ruby
286
+
287
+ module Arrow
288
+ def Table.load(input)
289
+ loader = TableLoader.new(self, input)
290
+ loader.load
291
+ end
292
+ end
293
+
294
+ = (({.load})): Implementation
295
+
296
+ # rouge ruby
297
+
298
+ class Arrow::TableLoader
299
+ def load
300
+ format = detect_format(@output)
301
+ __send__("load_as_#{format}")
302
+ end
303
+ def load_as_csv
304
+ end
305
+ end
306
+
307
+ = (({.load})): Extend by Red Parquet
308
+
309
+ # rouge ruby
310
+
311
+ module Parquet::ArrowTableLoadable
312
+ def load_as_parquet
313
+ end
314
+ Arrow::TableLoader.include(self)
315
+ end
316
+
317
+ (('note:Red Parquet is a subproject of Red Arrow'))\n
318
+ (('note:Red ParquetはRed Arrowのサブプロジェクト'))
319
+
320
+ = (({.load})): Extended
321
+
322
+ # rouge ruby
323
+
324
+ require "parquet"
325
+ table = Arrow::Table.load("data.parquet")
326
+
327
+ = (({.load})): More extensible
328
+
329
+ # rouge ruby
330
+
331
+ class Arrow::TableLoader
332
+ def load
333
+ if @output.is_a?(URI)
334
+ __send__("load_from_uri")
335
+ else
336
+ __send__("load_from_file")
337
+ end
338
+ end
339
+ end
340
+
341
+ = (({.load})): Extend by Red Arrow Dataset
342
+
343
+ # rouge ruby
344
+
345
+ module ArrowDataset::ArrowTableLoadable
346
+ def load_from_uri
347
+ end
348
+ Arrow::TableLoader.include(self)
349
+ end
350
+
351
+ (('note:Red Arrow Dataset is a subproject of Red Arrow'))\n
352
+ (('note:Red Arrow DatasetはRed Arrowのサブプロジェクト'))
353
+
354
+ = Location: Online storage\n(('note:場所:オンラインストレージ'))
355
+
356
+ # rouge ruby
357
+
358
+ require "arrow-dataset"
359
+ url = URI("s3://bucket/path...")
360
+ table = Arrow::Table.load(url)
361
+
362
+ = Location: RDBMS\n(('note:場所:RDBMS'))
363
+
364
+ # rouge ruby
365
+
366
+ require "arrow-activerecord"
367
+ User.all.to_arrow
368
+
369
+ (('note:((<Red Arrow Active Record|URL:https://github.com/red-data-tools/red-arrow-activerecord>))'))\n
370
+ (('note:((<URL:https://github.com/red-data-tools/red-arrow-activerecord>))'))
371
+
372
+ = Location: Network\n(('note:場所:ネットワーク'))
373
+
374
+ # rouge ruby
375
+
376
+ require "arrow-flight"
377
+ client = ArrowFlight::Client.new(url)
378
+ info = client.list_flights[0]
379
+ reader = client.do_get(info.endpoints[0].ticket)
380
+ table = reader.read_all
381
+
382
+ (('note:((<Introducing Apache Arrow Flight: A Framework for Fast Data Transport|URL:https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))\n
383
+ (('note:((<URL:https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))
384
+
385
+ = Large data\n(('note:大規模データ'))
386
+
387
+ * Apache Arrow format
388
+ * Designed for large data\n
389
+ (('note:大規模データ用に設計されている'))
390
+ * For large data\n
391
+ (('note:大規模データ用に必要なもの'))
392
+ * Fast load\n
393
+ (('note:高速にロードできること'))
394
+ * ...
395
+
396
+ = Fast load: Benchmark\n(('note:高速ロード:ベンチマーク'))
397
+
398
+ # rouge ruby
399
+
400
+ require "datasets-arrow"
401
+ dataset = Datasets::PostalCodeJapan.new
402
+ table = dataset.to_arrow # 124271 records
403
+ n = 5
404
+ n.times do |i|
405
+ table.save("codes.#{i}.csv")
406
+ table.save("codes.#{i}.arrow")
407
+ CSV.read("codes.#{i}.csv")
408
+ Arrow::Table.load("codes.#{i}.csv")
409
+ Arrow::Table.load("codes.#{i}.arrow")
410
+ table = table.concatenate([table])
411
+ end
412
+
413
+ = Fast load: Benchmark: All\n(('note:高速ロード:ベンチマーク:すべて'))
414
+
415
+ # charty
416
+ # backend = pyplot
417
+ # type = line
418
+ # x = N (times)
419
+ # y = Elapsed time (sec)
420
+ # color = Approach
421
+ # markers = true
422
+ # relative_height = 100
423
+ Approach,N (times),Elapsed time (sec)
424
+ Apache Arrow,1,0.000437
425
+ Apache Arrow,2,0.000421
426
+ Apache Arrow,3,0.000472
427
+ Apache Arrow,4,0.000573
428
+ Apache Arrow,5,0.000899
429
+ CSV: Red Arrow,1,0.012443
430
+ CSV: Red Arrow,2,0.021403
431
+ CSV: Red Arrow,3,0.040435
432
+ CSV: Red Arrow,4,0.074629
433
+ CSV: Red Arrow,5,0.138448
434
+ CSV: Ruby,1,0.828678
435
+ CSV: Ruby,2,1.840314
436
+ CSV: Ruby,3,3.797536
437
+ CSV: Ruby,4,8.205680
438
+ CSV: Ruby,5,19.850910
439
+
440
+ == Slide properties
441
+
442
+ : enable-title-on-image
443
+ false
444
+
445
+ = Fast load: Benchmark: Red Arrow\n(('note:高速ロード:ベンチマーク:Red Arrow'))
446
+
447
+ # charty
448
+ # backend = pyplot
449
+ # type = line
450
+ # x = N (times)
451
+ # y = Elapsed time (sec)
452
+ # color = Approach
453
+ # markers = true
454
+ # relative_height = 100
455
+ Approach,N (times),Elapsed time (sec)
456
+ Apache Arrow,1,0.000437
457
+ Apache Arrow,2,0.000421
458
+ Apache Arrow,3,0.000472
459
+ Apache Arrow,4,0.000573
460
+ Apache Arrow,5,0.000899
461
+ CSV: Red Arrow,1,0.012443
462
+ CSV: Red Arrow,2,0.021403
463
+ CSV: Red Arrow,3,0.040435
464
+ CSV: Red Arrow,4,0.074629
465
+ CSV: Red Arrow,5,0.138448
466
+
467
+ == Slide properties
468
+
469
+ : enable-title-on-image
470
+ false
471
+
472
+ = How to implement fast load\n(('note:高速ロードの実装方法'))
473
+
474
+ # img
475
+ # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf
476
+ # relative_height = 80
477
+
478
+ (('tag:center'))
479
+ (('note:((<URL:https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/>))'))
480
+
481
+ = Read data with Red Arrow: Wrap up\n(('note:Red Arrowでデータ読み込み:まとめ'))
482
+
483
+ * Easy to read common formats\n
484
+ (('note:よくあるフォーマットのデータを簡単に読める'))
485
+ * Easy to read from common locations\n
486
+ (('note:よくある場所にあるデータを簡単に読める'))
487
+ * Large data ready\n
488
+ (('note:大規模データも扱える'))
489
+
490
+ = 3. Explore data\n(('note:3. データ探索'))
491
+
492
+ * Preprocess data(('note:(データを前処理)'))
493
+ * Filter out needless data(('note:(不要なデータを除去)'))
494
+ * ...
495
+ * Summarize data and visualize them\n
496
+ (('note:(データを要約して可視化)'))
497
+ * ...
498
+
499
+ Red Arrow can be used for some operations\n
500
+ (('note:いくつかの操作でRed Arrowを使える'))
501
+
502
+ = Filter: Red Arrow\n(('note:絞り込み:Red Arrow'))
503
+
504
+ # rouge ruby
505
+
506
+ table = Datasets::PostalCodeJapan.new.to_arrow
507
+ table.n_rows # 124271
508
+ filtered_table = table.slice do |slicer|
509
+ slicer.prefecture == "東京都" # Tokyo
510
+ end
511
+ filtered_table.n_rows # 3887
512
+
513
+ = Filter: Performance\n(('note:絞り込み:性能'))
514
+
515
+ # rouge ruby
516
+
517
+ dataset = Datasets::PostalCodeJapan.new
518
+ arrow_dataset = dataset.to_arrow
519
+ dataset.find_all do |row|
520
+ row.prefecture == "東京都" # Tokyo
521
+ end # 1.256s
522
+ arrow_dataset.slice do |slicer|
523
+ slicer.prefecture == "東京都" # Tokyo
524
+ end # 0.001s
525
+
526
+ = Filter: Performance\n(('note:絞り込み:性能'))
527
+
528
+ # charty
529
+ # backend = pyplot
530
+ # type = bar
531
+ # x = Elapsed time (sec)
532
+ # y = Implementation
533
+ # relative_height = 100
534
+ Implementation,Elapsed time (sec)
535
+ Ruby,1.2567864
536
+ Arrow,0.001395
537
+
538
+ == Slide properties
539
+
540
+ : enable-title-on-image
541
+ false
542
+
543
+ = Apache Arrow data: Interchangeable\n(('note:Apache Arrow data:交換可能'))
544
+
545
+ * With low cost thanks to fast load\n
546
+ (('note:高速ロードできるので低コスト'))
547
+ * Apache Arrow data ready systems are increasing\n
548
+ (('note:Apache Arrowデータを扱えるシステムは増加中'))
549
+ * e.g. DuckDB: in-process SQL OLAP DBMS\n
550
+ (('note:(SQLite like DBMS for OLAP)'))\n
551
+ (('note:OLAP: OnLine Analytical Processing'))\n
552
+ (('note:例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム'))
553
+
554
+ = Filter: DuckDB\n(('note:絞り込み:DuckDB'))
555
+
556
+ # rouge ruby
557
+
558
+ require "arrow-duckdb"
559
+ codes = Datasets::PostalCodeJapan.new.to_arrow
560
+ db = DuckDB::Database.open
561
+ c = db.connect
562
+ c.register("codes", codes) do # Use codes without copy
563
+ c.query("SELECT * FROM codes WHERE prefecture = ?",
564
+ "東京都", # Tokyo
565
+ output: :arrow) # Output as Apache Arrow data
566
+ .to_table.n_rows # 3887
567
+ end
568
+
569
+ = Summarize: Group + aggregation\n(('note:要約:グループ化して集計'))
570
+
571
+ # rouge ruby
572
+
573
+ iris = Datasets::Iris.new.to_arrow
574
+ iris.group(:label).count(:sepal_length)
575
+ # count(sepal_length) label
576
+ # 0 50 Iris-setosa
577
+ # 1 50 Iris-versicolor
578
+ # 2 50 Iris-virginica
579
+
580
+ = Visualize: Charty\n(('note:可視化:Charty'))
581
+
582
+ # rouge ruby
583
+
584
+ require "charty"
585
+ Charty.backends.use("pyplot")
586
+ Charty.scatter_plot(data: iris,
587
+ x: :sepal_length,
588
+ y: :sepal_width,
589
+ color: :label)
590
+ .save("iris.png")
591
+
592
+ = Visualize: Charty: Result\n(('note:可視化:Charty:結果'))
593
+
594
+ # img
595
+ # src = images/iris.png
596
+ # relative_height = 100
597
+
598
+ == Slide properties
599
+
600
+ : enable-title-on-image
601
+ false
602
+
603
+ = 4. Use insight\n(('note:4. 知見を活用'))
604
+
605
+ * Write report\n(('note:(レポートにまとめたり)'))
606
+ * Build a model\n(('note:(モデルを作ったり)'))
607
+ * ...
608
+
609
+ No Red Arrow support in this area for now\n
610
+ (('note:Can be used for passing data to other tools like DuckDB and Charty'))\n
611
+ (('note:今のところこのあたりにはRed Arrowを使えない'))\n
612
+ (('note:DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える'))
613
+
614
+ = Data processing and Red Arrow\n(('note:Red Arrowでデータ処理'))
615
+
616
+ * Red Arrow helps us in some areas\n
617
+ (('note:いくつかの領域ではRed Arrowを使える'))
618
+ * Collect, read and explore data\n
619
+ (('note:データを収集して読み込んで探索するとか'))
620
+ * Some tools can integrate with Red Arrow\n
621
+ (('note:いくつかのツールはRed Arrowと連携できる'))
622
+ * Fluentd, DuckDB, Charty, ...
623
+
624
+ = Red Arrow and Ruby 3.0
625
+
626
+ * MemoryView support
627
+ * Ractor support
628
+
629
+ = MemoryView
630
+
631
+ # blockquote
632
+
633
+ MemoryView provides the features to share multidimensional homogeneous arrays of fixed-size element on memory among extension libraries.
634
+
635
+ (('note:MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。'))
636
+
637
+ (('note:((<URL:https://docs.ruby-lang.org/en/master/doc/memory_view_md.html>))'))\n
638
+ (('note:((<URL:https://tech.speee.jp/entry/2020/12/24/093131>)) (Japanese)'))
639
+
640
+ = Numeric arrays in Red Arrow\n(('note:Red Arrow内の数値配列'))
641
+
642
+ * (({Arrow::NumericArray})) family
643
+ * 1-dimensional numeric array\n
644
+ (('note:1次元数値配列'))
645
+ * (({Arrow::Tensor}))
646
+ * Multidimensional homogeneous numeric arrays\n
647
+ (('note:多次元数値配列'))
648
+
649
+ = MemoryView: Red Arrow
650
+
651
+ * (({Arrow::NumericArray})) family
652
+ * Export as MemoryView: Support\n
653
+ (('note:MemoryViewとしてエクスポート:対応済み'))
654
+ * Import from MemoryView: Not yet\n
655
+ (('note:MemoryViewをインポート:未対応'))
656
+ * (({Arrow::Tensor}))
657
+ * Export/Import: Not yet\n
658
+ (('note:エクスポート・インポート:未対応'))
659
+
660
+ (('note:Join Red Data Tools to work on this!'))\n
661
+ (('note:対応を進めたい人はRed Data Toolsに来てね!'))
662
+
663
+ = MemoryView: C++
664
+
665
+ * Some problems are found by this work\n
666
+ (('note:Red Arrowの対応作業でいくつかの問題が見つかった'))
667
+ * Can't use (({private})) as member name\n
668
+ (('note:メンバー名に(({private}))を使えない'))
669
+ * Can't assign to (({const})) variable with cast\n
670
+ (('note:キャストしても(({const}))変数に代入できない'))
671
+ * Ruby 3.1 will fix them\n
672
+ (('note:Ruby 3.1では直っているはず'))
673
+
674
+ = Ractor
675
+
676
+ # blockquote
677
+
678
+ Ractor is designed to provide a parallel execution feature of Ruby without thread-safety concerns.
679
+
680
+ (('note:Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。'))
681
+
682
+ (('note:((<URL:https://docs.ruby-lang.org/en/master/doc/ractor_md.html>))'))\n
683
+ (('note:((<URL:https://techlife.cookpad.com/entry/2020/12/26/131858>)) (Japanese)'))
684
+
685
+ = Red Arrow and concurrency\n(('note:Red Arrowと並列性'))
686
+
687
+ * Red Arrow data are immutable\n
688
+ (('note:Red Arrowデータは変更不可'))
689
+ * Ractor can share frozen objects\n
690
+ (('note:Ractorはfrozenなオブジェクトを共有可能'))
691
+
692
+ = Ractor: Red Arrow
693
+
694
+ # rouge ruby
695
+
696
+ require "datasets-arrow"
697
+ table = Datasets::PostalCodeJapan.new.to_arrow
698
+ Ractor.make_shareable(table)
699
+ Ractor.new(table) do |t|
700
+ t.slice do |slicer|
701
+ slicer.prefecture == "東京都" # Tokyo
702
+ end
703
+ end
704
+
705
+ = Ractor: Red Arrow: Benchmark
706
+
707
+ # rouge ruby
708
+
709
+ n_ractors = 4
710
+ n_jobs_per_ractor = 1000
711
+ n_jobs = n_ractors * n_jobs_per_ractor
712
+ n_jobs.times do
713
+ table.slice {|s| s.prefecture == "東京都"}
714
+ end
715
+ n_ractors.times.collect do
716
+ Ractor.new(table, n_jobs_per_ractor) do |t, n|
717
+ n.times {t.slice {|s| s.prefecture == "東京都"}}
718
+ end
719
+ end.each(&:take)
720
+
721
+ = Ractor: Red Arrow: Benchmark
722
+
723
+ # charty
724
+ # backend = pyplot
725
+ # type = bar
726
+ # x = Elapsed time (sec)
727
+ # y = Approach
728
+ # relative_height = 100
729
+ Approach,Elapsed time (sec)
730
+ Sequential,4.573742
731
+ Ractor,1.454987
732
+
733
+ == Slide properties
734
+
735
+ : enable-title-on-image
736
+ false
737
+
738
+ = Wrap up\n(('note:まとめ'))
739
+
740
+ * Ruby can be used\n
741
+ in some data processing work\n
742
+ (('note:いくつかのデータ処理作業にRubyを使える'))
743
+ * Red Arrow helps you!\n
744
+ (('note:Red Arrowが有用なケースがあるはず!'))
745
+ * Ruby 3.0 has useful features for data processing work\n
746
+ (('note:Ruby 3.0にはデータ処理作業に有用な機能があるよ'))
747
+ * Red Arrow starts supporting them\n
748
+ (('note:Red Arrowはそれらのサポートを進めている'))
749
+
750
+ = Goal of this talk\n(('note:このトークのゴール'))
751
+
752
+ * You want to use Ruby\n
753
+ for some data processing\n
754
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
755
+ * You join Red Data Tools project\n
756
+ (('note:あなたがRed Data Toolsプロジェクトに参加する'))
757
+
758
+ = Feature work\n(('note:今後の仕事'))
759
+
760
+ * Implement DataFusion bindings by adding C API to DataFusion\n
761
+ (('note:DataFusionにC APIを追加してバインディングを実装'))
762
+ * DataFusion: Apache Arrow native query execution framework written in Rust\n
763
+ (('note:((<URL:https://github.com/apache/arrow-datafusion/>))'))\n
764
+ (('note:DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク'))
765
+ * Add Active Record like API to Red Arrow\n
766
+ (('note:Red ArrowにActive Record風のAPIを追加'))
767
+ * Improve MemoryView/Ractor support\n
768
+ (('note:MemoryView/Ractorサポートを進める'))
769
+
770
+ = Red Data Tools
771
+
772
+ (('tag:center'))
773
+ (('tag:x-large'))
774
+ Join us!
775
+
776
+ (('note:((<URL:https://red-data-tools.github.io/>))'))\n
777
+ (('note:((<URL:https://gitter.im/red-data-tools/en>))'))
778
+
779
+ (('note:((<URL:https://red-data-tools.github.io/ja/>))'))\n
780
+ (('note:((<URL:https://gitter.im/red-data-tools/ja>))'))
781
+
782
+ = OSS Gate on-boarding\n(('note:OSS Gateオンボーディング'))
783
+
784
+ * Supports accepting newcomers by OSS projects such as Ruby & Red Arrow\n
785
+ (('note:RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援'))
786
+ * Contact me!(('note:興味がある人は私に教えて!'))
787
+ * (('tag:x-small'))OSS project members who want to accept newcomers\n
788
+ (('note:新人を受け入れたいOSSプロジェクトのメンバー'))
789
+ * (('tag:x-small'))Companies which want to support OSS Gate on-boarding\n
790
+ (('note:OSS Gateオンボーディングを支援したい会社'))
791
+
792
+ (('note:((<URL:https://oss-gate.github.io/on-boarding/>))'))
793
+
794
+ = ClearCode Inc.
795
+
796
+ * Recruitment: Developer to work on Red Arrow related business\n
797
+ (('note:採用情報:Red Arrow関連のビジネスをする開発者'))
798
+ * (('note:((<URL:https://www.clear-code.com/recruitment/>))'))
799
+ * Business: Apache Arrow/Red Arrow related technical support/consulting:\n
800
+ (('note:仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング'))
801
+ * (('note:((<URL:https://www.clear-code.com/contact/>))'))
data/theme.rb ADDED
@@ -0,0 +1 @@
1
+ include_theme("clear-code")
metadata ADDED
@@ -0,0 +1,92 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rabbit-slide-kou-rubykaigi-takeout-2021
3
+ version: !ruby/object:Gem::Version
4
+ version: 2021.9.11.0
5
+ platform: ruby
6
+ authors:
7
+ - Sutou Kouhei
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2021-08-23 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rabbit
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 2.0.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: 2.0.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: rabbit-theme-clear-code
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ description: |-
42
+ To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
43
+
44
+ * Super fast large data interchange and processing
45
+ * Reading/writing data in several famous formats such as CSV and Apache Parquet
46
+ * Reading/writing partitioned large data on cloud storage such as Amazon S3
47
+
48
+ This talk describes the followings:
49
+
50
+ * What is Apache Arrow
51
+ * How to use Apache Arrow with Ruby
52
+ * How to integrate with Ruby 3.0 features such as MemoryView and Ractor
53
+ email:
54
+ - kou@clear-code.com
55
+ executables: []
56
+ extensions: []
57
+ extra_rdoc_files: []
58
+ files:
59
+ - ".rabbit"
60
+ - README.rd
61
+ - Rakefile
62
+ - config.yaml
63
+ - images/apache-arrow-commits-kou.png
64
+ - images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png
65
+ - images/iris.png
66
+ - pdf/rubykaigi-takeout-2021-red-arrow.pdf
67
+ - red-arrow.rab
68
+ - theme.rb
69
+ homepage: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-takeout-2021/
70
+ licenses:
71
+ - CC-BY-SA-4.0
72
+ metadata: {}
73
+ post_install_message:
74
+ rdoc_options: []
75
+ require_paths:
76
+ - lib
77
+ required_ruby_version: !ruby/object:Gem::Requirement
78
+ requirements:
79
+ - - ">="
80
+ - !ruby/object:Gem::Version
81
+ version: '0'
82
+ required_rubygems_version: !ruby/object:Gem::Requirement
83
+ requirements:
84
+ - - ">="
85
+ - !ruby/object:Gem::Version
86
+ version: '0'
87
+ requirements: []
88
+ rubygems_version: 3.3.0.dev
89
+ signing_key:
90
+ specification_version: 4
91
+ summary: Red Arrow - Ruby and Apache Arrow
92
+ test_files: []