rabbit-slide-kou-rubykaigi-takeout-2021 2021.9.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 4e7819063a4ebcbedeb7a6ea8ffad85ca4e15a57ce9aa6dbcecd02f338610d7c
4
+ data.tar.gz: 6a9734a5a02321f2ce4c1e20676556c0c6efa9aa18649cfa263d1dd3832ce702
5
+ SHA512:
6
+ metadata.gz: a89c84882adff989423129df414b42cdc4e3f49bc61fb2651bb722105d1a4d06aa5271892cfcb790cc25a54cb6de2d11e1f335977fcf7ba59c436c97d2002eb8
7
+ data.tar.gz: 8a2d3ea6e38fda7c3972a8f31f5d5dd75c38ac94d852968133f27f1c99e24e0a7aa6a26155f2be330660a1012433310ca0d52bf3fe76395414cfa18b571f269c
data/.rabbit ADDED
@@ -0,0 +1 @@
1
+ --size=1920,1080 red-arrow.rab
data/README.rd ADDED
@@ -0,0 +1,51 @@
1
+ = Red Arrow - Ruby and Apache Arrow
2
+
3
+ To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
4
+
5
+ * Super fast large data interchange and processing
6
+ * Reading/writing data in several famous formats such as CSV and Apache Parquet
7
+ * Reading/writing partitioned large data on cloud storage such as Amazon S3
8
+
9
+ This talk describes the followings:
10
+
11
+ * What is Apache Arrow
12
+ * How to use Apache Arrow with Ruby
13
+ * How to integrate with Ruby 3.0 features such as MemoryView and Ractor
14
+
15
+ == License
16
+
17
+ === Slide
18
+
19
+ CC BY-SA 4.0
20
+
21
+ Use the followings for notation of the author:
22
+
23
+ * Sutou Kouhei
24
+
25
+ ==== ClearCode Inc. logo
26
+
27
+ CC BY-SA 4.0
28
+
29
+ Author: ClearCode Inc.
30
+
31
+ It is used in page header and some pages in the slide.
32
+
33
+ == For author
34
+
35
+ === Show
36
+
37
+ rake
38
+
39
+ === Publish
40
+
41
+ rake publish
42
+
43
+ == For viewers
44
+
45
+ === Install
46
+
47
+ gem install rabbit-slide-kou-rubykaigi-takeout-2021
48
+
49
+ === Show
50
+
51
+ rabbit rabbit-slide-kou-rubykaigi-takeout-2021.gem
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "rabbit/task/slide"
2
+
3
+ # Edit ./config.yaml to customize meta data
4
+
5
+ spec = nil
6
+ Rabbit::Task::Slide.new do |task|
7
+ spec = task.spec
8
+ # spec.files += Dir.glob("doc/**/*.*")
9
+ spec.files += Dir.glob("images/**/*.*")
10
+ # spec.files -= Dir.glob("private/**/*.*")
11
+ spec.add_runtime_dependency("rabbit-theme-clear-code")
12
+ end
13
+
14
+ desc "Tag #{spec.version}"
15
+ task :tag do
16
+ sh("git", "tag", "-a", spec.version.to_s, "-m", "Publish #{spec.version}")
17
+ sh("git", "push", "--tags")
18
+ end
data/config.yaml ADDED
@@ -0,0 +1,24 @@
1
+ ---
2
+ id: rubykaigi-takeout-2021
3
+ base_name: red-arrow
4
+ tags:
5
+ - rabbit
6
+ - rubykaigi
7
+ - ruby
8
+ - apache_arrow
9
+ presentation_date: 2021-09-11
10
+ version: 2021.9.11.0
11
+ licenses:
12
+ - CC-BY-SA-4.0
13
+ slideshare_id:
14
+ speaker_deck_id:
15
+ ustream_id:
16
+ vimeo_id:
17
+ youtube_id:
18
+ author:
19
+ markup_language: :rd
20
+ name: Sutou Kouhei
21
+ email: kou@clear-code.com
22
+ rubygems_user: kou
23
+ slideshare_user: kou
24
+ speaker_deck_user:
Binary file
data/images/iris.png ADDED
Binary file
data/red-arrow.rab ADDED
@@ -0,0 +1,801 @@
1
+ = Red Arrow
2
+
3
+ : subtitle
4
+ ((*Ruby*)) and ((*Apache Arrow*))
5
+ : author
6
+ Sutou Kouhei
7
+ : institution
8
+ ClearCode Inc.
9
+ : content-source
10
+ RubyKaigi Takeout 2021
11
+ : date
12
+ 2021-09-11
13
+ : start-time
14
+ 2021-09-11T13:30:00+09:00
15
+ : end-time
16
+ 2021-09-11T13:55:00+09:00
17
+ : theme
18
+ .
19
+
20
+ = Sutou Kouhei\nA president Rubyist
21
+
22
+ The president of ClearCode Inc.\n
23
+ (('note:クリアコードの社長'))
24
+
25
+ # img
26
+ # src = images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png
27
+ # relative_height = 100
28
+ # reflect_ratio = 0.1
29
+
30
+ = Sutou Kouhei\nAn Apache Arrow contributor
31
+
32
+ * A member of PMC of Apache Arrow\n
33
+ (('note:PMC: Project Management Committee'))\n
34
+ (('note:Apache Arrowのプロジェクト管理委員会メンバー'))
35
+ * #2 commits(('note:(コミット数2位)'))
36
+
37
+ # img
38
+ # src = images/apache-arrow-commits-kou.png
39
+ # relative_height = 120
40
+ # reflect_ratio = 0.1
41
+
42
+ = Sutou Kouhei\nThe pioneer in Ruby and Arrow
43
+
44
+ * The author of Red Arrow\n
45
+ (('note:Red Arrowの作者'))
46
+ * Red Arrow:
47
+ * The official Apache Arrow library for Ruby\n
48
+ (('note:公式のRuby用のApache Arrowライブラリー'))
49
+ * GObject Introspection based bindings\n
50
+ (('note:GObject Introspectionベースのバインディング'))
51
+ * Apache Arrow GLib is developed for Red Arrow\n
52
+ (('note:Red ArrowのためにApache Arrow GLibも開発'))
53
+
54
+ = GObject Introspection?
55
+
56
+ (('tag:center'))
57
+ (('tag:margin-bottom * -0.3'))
58
+ A way to implement bindings\n
59
+ (('note:バインディングの実装方法の1つ'))
60
+
61
+ # img
62
+ # src = https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2016/how-to-create-bindings-2016.pdf
63
+ # relative_height = 90
64
+
65
+ (('tag:center'))
66
+ (('note:((<URL:https://rubykaigi.org/2016/presentations/ktou.html>))'))
67
+
68
+ = Why do I work on Red Arrow?\n(('note:なぜRed Arrowの開発をしているか'))
69
+
70
+ * To use Ruby for data processing!\n
71
+ (('note:データ処理でRubyを使いたい!'))
72
+ * At least a part of data processing\n
73
+ (('note:データ処理の全部と言わず一部だけでも'))
74
+ * Results of my 5 years of work:\n
75
+ (('note:私のここ5年の仕事の成果'))
76
+ * We can use Ruby for some data processing!\n
77
+ (('note:いくつかのデータ処理でRubyを使える!'))
78
+
79
+ = Goal of this talk\n(('note:このトークのゴール'))
80
+
81
+ * You want to use Ruby\n
82
+ for some data processing\n
83
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
84
+ * You join Red Data Tools project\n
85
+ (('note:Red Data Toolsプロジェクトに参加する'))
86
+
87
+ = Red Data Tools project?
88
+
89
+ # blockquote
90
+
91
+ Red Data Tools is a project that provides data processing tools for Ruby
92
+
93
+ (('note:Red Data ToolsはRuby用のデータ処理ツールを提供するプロジェクト'))
94
+
95
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
96
+
97
+ = Data processing?
98
+
99
+ ... how?
100
+
101
+ = 0. Why do you want?\n(('note:0. データ処理の目的を明らかにする'))
102
+
103
+ * What problem do you want to resolve?\n
104
+ (('note:どんな問題を解決したい?'))
105
+ * What data is needed for it?\n
106
+ (('note:そのためにはどんなデータが必要?'))
107
+ * ...
108
+
109
+ No Red Arrow support in this area\n
110
+ (('note:このあたりにはRed Arrowを使えない'))
111
+
112
+ = 1. Collect data\n(('note:1. データ収集'))
113
+
114
+ * Where are data?\n
115
+ (('note:データはどこにある?'))
116
+ * Where are collected data stored?\n
117
+ (('note:集めたデータはどこに保存する?'))
118
+ * ...
119
+
120
+ Some Red Arrow supports in this area\n
121
+ (('note:このあたりでは少しRed Arrowを使えない'))
122
+
123
+ = Common dataset\n(('note:よく使われるデータセット'))
124
+
125
+ # rouge ruby
126
+
127
+ require "datasets"
128
+ Datasets::Iris.new
129
+ Datasets::PostalCodeJapan.new
130
+ Datasets::Wikipedia.new
131
+
132
+ (('note:((<Red Datasets|URL:https://github.com/red-data-tools/red-datasets>))'))\n
133
+ (('note:((<URL:https://github.com/red-data-tools/red-datasets>))'))
134
+
135
+ = Output: Local file\n(('note:出力先:ローカルファイル'))
136
+
137
+ # rouge ruby
138
+
139
+ require "datasets-arrow"
140
+ dataset = Datasets::PostalCodeJapan.new
141
+ dataset.to_arrow.save("codes.csv")
142
+ dataset.to_arrow.save("codes.arrow")
143
+
144
+ (('note:((<Red Datasets Arrow|URL:https://github.com/red-data-tools/red-datasets-arrow>))'))\n
145
+ (('note:((<URL:https://github.com/red-data-tools/red-datasets-arrow>))'))
146
+
147
+ = (({#save}))
148
+
149
+ * General serialize API for table data\n
150
+ (('note:テーブルデータ用の汎用シリアライズAPI'))
151
+ * Serialize as the specified format\n
152
+ (('note:指定したフォーマットにシリアライズ'))
153
+ * If you use Red Arrow object for in-memory table data, you can serialize to many formats! Cool!\n
154
+ (('note:メモリー上のテーブルデータをRed Arrowオブジェクトにするといろんなフォーマットにシリアライズできる!かっこいい!'))
155
+ * Extensible!\n
156
+ (('note:拡張可能!'))
157
+
158
+ = (({#save})): Implementation
159
+
160
+ # rouge ruby
161
+
162
+ module Arrow
163
+ class Table
164
+ def save(output)
165
+ saver = TableSaver.new(self, output)
166
+ saver.save
167
+ end
168
+ end
169
+ end
170
+
171
+ = (({#save})): Implementation
172
+
173
+ # rouge ruby
174
+
175
+ class Arrow::TableSaver
176
+ def save
177
+ format = detect_format(@output)
178
+ __send__("save_as_#{format}")
179
+ end
180
+ def save_as_csv
181
+ end
182
+ end
183
+
184
+ = (({#save})): Extend by Red Parquet
185
+
186
+ # rouge ruby
187
+
188
+ module Parquet::ArrowTableSavable
189
+ def save_as_parquet
190
+ end
191
+ Arrow::TableSaver.include(self)
192
+ end
193
+
194
+ (('note:Red Parquet is a subproject of Red Arrow'))\n
195
+ (('note:Red ParquetはRed Arrowのサブプロジェクト'))
196
+
197
+ = (({#save})): Extended
198
+
199
+ # rouge ruby
200
+
201
+ require "datasets-arrow"
202
+ require "parquet"
203
+ dataset = Datasets::PostalCodeJapan.new
204
+ dataset.to_arrow.save("codes.parquet")
205
+
206
+ = Output: Online storage: Fluentd\n(('note:出力先:オンラインストレージ:Fluentd'))
207
+
208
+ * fluent-plugin-s3-arrow:
209
+ * Collect data by Fluentd\n
210
+ (('note:Fluentdでデータ収集'))
211
+ * Format data as Apache Parquet by ((*Red Arrow*))\n
212
+ (('note:((*Red Arrow*))でApache Parquet形式にデータを変換'))
213
+ * Store data to Amazon S3 by fluent-plugin-s3\n
214
+ (('note:fluent-plugin-s3でAmazon S3にデータを保存'))
215
+ * By @kanga33 at Speee/Red Data Tools\n
216
+ (('note:Speee/Red Data Toolsの香川さんが開発'))
217
+
218
+ (('note:((<URL:https://github.com/red-data-tools/fluent-plugin-s3-arrow/>))'))
219
+
220
+ = Output: Online storage: Red Arrow\n(('note:出力先:オンラインストレージ:Red Arrow'))
221
+
222
+ # rouge ruby
223
+
224
+ require "datasets-arrow"
225
+ require "arrow-dataset"
226
+ dataset = Datasets::PostalCodeJapan.new
227
+ url = URL("s3://mybucket/codes.parquet")
228
+ dataset.to_arrow.save(url)
229
+
230
+ (('Implementing...'))\n
231
+ (('note:実装中。。。'))
232
+
233
+ = (({#save})): Implementing...
234
+
235
+ # rouge ruby
236
+
237
+ class Arrow::TableSaver
238
+ def save
239
+ if @output.is_a?(URI)
240
+ __send__("save_to_uri")
241
+ else
242
+ __send__("save_to_file")
243
+ end
244
+ end
245
+ end
246
+
247
+ = Collect data w/ Red Arrow: Wrap up\n(('note:Red Arrowでデータ収集:まとめ'))
248
+
249
+ * Usable as serializer for common formats\n
250
+ (('note:よくあるフォーマットにシリアライズするツールとして使える'))
251
+ * Usable as writer to common locations\n
252
+ (('note:in the near future...'))\n
253
+ (('note:近いうちによくある出力先に書き出すツールとして使える'))
254
+
255
+ = 2. Read data\n(('note:2. データ読み込み'))
256
+
257
+ * What format is used?\n
258
+ (('note:どんなフォーマットで保存されている?'))
259
+ * Where are collected data?\n
260
+ (('note:収集したデータはどこ?'))
261
+ * How large is collected data?\n
262
+ (('note:データはどれかで大きい?'))
263
+
264
+ = Format\n(('note:フォーマット'))
265
+
266
+ # rouge ruby
267
+
268
+ require "arrow"
269
+ table = Arrow::Table.load("data.csv")
270
+ table = Arrow::Table.load("data.json")
271
+ table = Arrow::Table.load("data.arrow")
272
+ table = Arrow::Table.load("data.orc")
273
+
274
+ = (({.load}))
275
+
276
+ * General deserialize API for table data\n
277
+ (('note:テーブルデータ用の汎用デシリアライズAPI'))
278
+ * Deserialize common formats\n
279
+ (('note:よく使われているフォーマットからデシリアライズ'))
280
+ * Extensible!\n
281
+ (('note:拡張可能!'))
282
+
283
+ = (({.load})): Implementation
284
+
285
+ # rouge ruby
286
+
287
+ module Arrow
288
+ def Table.load(input)
289
+ loader = TableLoader.new(self, input)
290
+ loader.load
291
+ end
292
+ end
293
+
294
+ = (({.load})): Implementation
295
+
296
+ # rouge ruby
297
+
298
+ class Arrow::TableLoader
299
+ def load
300
+ format = detect_format(@output)
301
+ __send__("load_as_#{format}")
302
+ end
303
+ def load_as_csv
304
+ end
305
+ end
306
+
307
+ = (({.load})): Extend by Red Parquet
308
+
309
+ # rouge ruby
310
+
311
+ module Parquet::ArrowTableLoadable
312
+ def load_as_parquet
313
+ end
314
+ Arrow::TableLoader.include(self)
315
+ end
316
+
317
+ (('note:Red Parquet is a subproject of Red Arrow'))\n
318
+ (('note:Red ParquetはRed Arrowのサブプロジェクト'))
319
+
320
+ = (({.load})): Extended
321
+
322
+ # rouge ruby
323
+
324
+ require "parquet"
325
+ table = Arrow::Table.load("data.parquet")
326
+
327
+ = (({.load})): More extensible
328
+
329
+ # rouge ruby
330
+
331
+ class Arrow::TableLoader
332
+ def load
333
+ if @output.is_a?(URI)
334
+ __send__("load_from_uri")
335
+ else
336
+ __send__("load_from_file")
337
+ end
338
+ end
339
+ end
340
+
341
+ = (({.load})): Extend by Red Arrow Dataset
342
+
343
+ # rouge ruby
344
+
345
+ module ArrowDataset::ArrowTableLoadable
346
+ def load_from_uri
347
+ end
348
+ Arrow::TableLoader.include(self)
349
+ end
350
+
351
+ (('note:Red Arrow Dataset is a subproject of Red Arrow'))\n
352
+ (('note:Red Arrow DatasetはRed Arrowのサブプロジェクト'))
353
+
354
+ = Location: Online storage\n(('note:場所:オンラインストレージ'))
355
+
356
+ # rouge ruby
357
+
358
+ require "arrow-dataset"
359
+ url = URI("s3://bucket/path...")
360
+ table = Arrow::Table.load(url)
361
+
362
+ = Location: RDBMS\n(('note:場所:RDBMS'))
363
+
364
+ # rouge ruby
365
+
366
+ require "arrow-activerecord"
367
+ User.all.to_arrow
368
+
369
+ (('note:((<Red Arrow Active Record|URL:https://github.com/red-data-tools/red-arrow-activerecord>))'))\n
370
+ (('note:((<URL:https://github.com/red-data-tools/red-arrow-activerecord>))'))
371
+
372
+ = Location: Network\n(('note:場所:ネットワーク'))
373
+
374
+ # rouge ruby
375
+
376
+ require "arrow-flight"
377
+ client = ArrowFlight::Client.new(url)
378
+ info = client.list_flights[0]
379
+ reader = client.do_get(info.endpoints[0].ticket)
380
+ table = reader.read_all
381
+
382
+ (('note:((<Introducing Apache Arrow Flight: A Framework for Fast Data Transport|URL:https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))\n
383
+ (('note:((<URL:https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>))'))
384
+
385
+ = Large data\n(('note:大規模データ'))
386
+
387
+ * Apache Arrow format
388
+ * Designed for large data\n
389
+ (('note:大規模データ用に設計されている'))
390
+ * For large data\n
391
+ (('note:大規模データ用に必要なもの'))
392
+ * Fast load\n
393
+ (('note:高速にロードできること'))
394
+ * ...
395
+
396
+ = Fast load: Benchmark\n(('note:高速ロード:ベンチマーク'))
397
+
398
+ # rouge ruby
399
+
400
+ require "datasets-arrow"
401
+ dataset = Datasets::PostalCodeJapan.new
402
+ table = dataset.to_arrow # 124271 records
403
+ n = 5
404
+ n.times do |i|
405
+ table.save("codes.#{i}.csv")
406
+ table.save("codes.#{i}.arrow")
407
+ CSV.read("codes.#{i}.csv")
408
+ Arrow::Table.load("codes.#{i}.csv")
409
+ Arrow::Table.load("codes.#{i}.arrow")
410
+ table = table.concatenate([table])
411
+ end
412
+
413
+ = Fast load: Benchmark: All\n(('note:高速ロード:ベンチマーク:すべて'))
414
+
415
+ # charty
416
+ # backend = pyplot
417
+ # type = line
418
+ # x = N (times)
419
+ # y = Elapsed time (sec)
420
+ # color = Approach
421
+ # markers = true
422
+ # relative_height = 100
423
+ Approach,N (times),Elapsed time (sec)
424
+ Apache Arrow,1,0.000437
425
+ Apache Arrow,2,0.000421
426
+ Apache Arrow,3,0.000472
427
+ Apache Arrow,4,0.000573
428
+ Apache Arrow,5,0.000899
429
+ CSV: Red Arrow,1,0.012443
430
+ CSV: Red Arrow,2,0.021403
431
+ CSV: Red Arrow,3,0.040435
432
+ CSV: Red Arrow,4,0.074629
433
+ CSV: Red Arrow,5,0.138448
434
+ CSV: Ruby,1,0.828678
435
+ CSV: Ruby,2,1.840314
436
+ CSV: Ruby,3,3.797536
437
+ CSV: Ruby,4,8.205680
438
+ CSV: Ruby,5,19.850910
439
+
440
+ == Slide properties
441
+
442
+ : enable-title-on-image
443
+ false
444
+
445
+ = Fast load: Benchmark: Red Arrow\n(('note:高速ロード:ベンチマーク:Red Arrow'))
446
+
447
+ # charty
448
+ # backend = pyplot
449
+ # type = line
450
+ # x = N (times)
451
+ # y = Elapsed time (sec)
452
+ # color = Approach
453
+ # markers = true
454
+ # relative_height = 100
455
+ Approach,N (times),Elapsed time (sec)
456
+ Apache Arrow,1,0.000437
457
+ Apache Arrow,2,0.000421
458
+ Apache Arrow,3,0.000472
459
+ Apache Arrow,4,0.000573
460
+ Apache Arrow,5,0.000899
461
+ CSV: Red Arrow,1,0.012443
462
+ CSV: Red Arrow,2,0.021403
463
+ CSV: Red Arrow,3,0.040435
464
+ CSV: Red Arrow,4,0.074629
465
+ CSV: Red Arrow,5,0.138448
466
+
467
+ == Slide properties
468
+
469
+ : enable-title-on-image
470
+ false
471
+
472
+ = How to implement fast load\n(('note:高速ロードの実装方法'))
473
+
474
+ # img
475
+ # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf
476
+ # relative_height = 80
477
+
478
+ (('tag:center'))
479
+ (('note:((<URL:https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/>))'))
480
+
481
+ = Read data with Red Arrow: Wrap up\n(('note:Red Arrowでデータ読み込み:まとめ'))
482
+
483
+ * Easy to read common formats\n
484
+ (('note:よくあるフォーマットのデータを簡単に読める'))
485
+ * Easy to read from common locations\n
486
+ (('note:よくある場所にあるデータを簡単に読める'))
487
+ * Large data ready\n
488
+ (('note:大規模データも扱える'))
489
+
490
+ = 3. Explore data\n(('note:3. データ探索'))
491
+
492
+ * Preprocess data(('note:(データを前処理)'))
493
+ * Filter out needless data(('note:(不要なデータを除去)'))
494
+ * ...
495
+ * Summarize data and visualize them\n
496
+ (('note:(データを要約して可視化)'))
497
+ * ...
498
+
499
+ Red Arrow can be used for some operations\n
500
+ (('note:いくつかの操作でRed Arrowを使える'))
501
+
502
+ = Filter: Red Arrow\n(('note:絞り込み:Red Arrow'))
503
+
504
+ # rouge ruby
505
+
506
+ table = Datasets::PostalCodeJapan.new.to_arrow
507
+ table.n_rows # 124271
508
+ filtered_table = table.slice do |slicer|
509
+ slicer.prefecture == "東京都" # Tokyo
510
+ end
511
+ filtered_table.n_rows # 3887
512
+
513
+ = Filter: Performance\n(('note:絞り込み:性能'))
514
+
515
+ # rouge ruby
516
+
517
+ dataset = Datasets::PostalCodeJapan.new
518
+ arrow_dataset = dataset.to_arrow
519
+ dataset.find_all do |row|
520
+ row.prefecture == "東京都" # Tokyo
521
+ end # 1.256s
522
+ arrow_dataset.slice do |slicer|
523
+ slicer.prefecture == "東京都" # Tokyo
524
+ end # 0.001s
525
+
526
+ = Filter: Performance\n(('note:絞り込み:性能'))
527
+
528
+ # charty
529
+ # backend = pyplot
530
+ # type = bar
531
+ # x = Elapsed time (sec)
532
+ # y = Implementation
533
+ # relative_height = 100
534
+ Implementation,Elapsed time (sec)
535
+ Ruby,1.2567864
536
+ Arrow,0.001395
537
+
538
+ == Slide properties
539
+
540
+ : enable-title-on-image
541
+ false
542
+
543
+ = Apache Arrow data: Interchangeable\n(('note:Apache Arrow data:交換可能'))
544
+
545
+ * With low cost thanks to fast load\n
546
+ (('note:高速ロードできるので低コスト'))
547
+ * Apache Arrow data ready systems are increasing\n
548
+ (('note:Apache Arrowデータを扱えるシステムは増加中'))
549
+ * e.g. DuckDB: in-process SQL OLAP DBMS\n
550
+ (('note:(SQLite like DBMS for OLAP)'))\n
551
+ (('note:OLAP: OnLine Analytical Processing'))\n
552
+ (('note:例:DuckDB:同一プロセス内で動くデータ分析用SQL DB管理システム'))
553
+
554
+ = Filter: DuckDB\n(('note:絞り込み:DuckDB'))
555
+
556
+ # rouge ruby
557
+
558
+ require "arrow-duckdb"
559
+ codes = Datasets::PostalCodeJapan.new.to_arrow
560
+ db = DuckDB::Database.open
561
+ c = db.connect
562
+ c.register("codes", codes) do # Use codes without copy
563
+ c.query("SELECT * FROM codes WHERE prefecture = ?",
564
+ "東京都", # Tokyo
565
+ output: :arrow) # Output as Apache Arrow data
566
+ .to_table.n_rows # 3887
567
+ end
568
+
569
+ = Summarize: Group + aggregation\n(('note:要約:グループ化して集計'))
570
+
571
+ # rouge ruby
572
+
573
+ iris = Datasets::Iris.new.to_arrow
574
+ iris.group(:label).count(:sepal_length)
575
+ # count(sepal_length) label
576
+ # 0 50 Iris-setosa
577
+ # 1 50 Iris-versicolor
578
+ # 2 50 Iris-virginica
579
+
580
+ = Visualize: Charty\n(('note:可視化:Charty'))
581
+
582
+ # rouge ruby
583
+
584
+ require "charty"
585
+ Charty.backends.use("pyplot")
586
+ Charty.scatter_plot(data: iris,
587
+ x: :sepal_length,
588
+ y: :sepal_width,
589
+ color: :label)
590
+ .save("iris.png")
591
+
592
+ = Visualize: Charty: Result\n(('note:可視化:Charty:結果'))
593
+
594
+ # img
595
+ # src = images/iris.png
596
+ # relative_height = 100
597
+
598
+ == Slide properties
599
+
600
+ : enable-title-on-image
601
+ false
602
+
603
+ = 4. Use insight\n(('note:4. 知見を活用'))
604
+
605
+ * Write report\n(('note:(レポートにまとめたり)'))
606
+ * Build a model\n(('note:(モデルを作ったり)'))
607
+ * ...
608
+
609
+ No Red Arrow support in this area for now\n
610
+ (('note:Can be used for passing data to other tools like DuckDB and Charty'))\n
611
+ (('note:今のところこのあたりにはRed Arrowを使えない'))\n
612
+ (('note:DuckDBやChartyにデータを渡すように他のツールにデータを渡すためには使える'))
613
+
614
+ = Data processing and Red Arrow\n(('note:Red Arrowでデータ処理'))
615
+
616
+ * Red Arrow helps us in some areas\n
617
+ (('note:いくつかの領域ではRed Arrowを使える'))
618
+ * Collect, read and explore data\n
619
+ (('note:データを収集して読み込んで探索するとか'))
620
+ * Some tools can integrate with Red Arrow\n
621
+ (('note:いくつかのツールはRed Arrowと連携できる'))
622
+ * Fluentd, DuckDB, Charty, ...
623
+
624
+ = Red Arrow and Ruby 3.0
625
+
626
+ * MemoryView support
627
+ * Ractor support
628
+
629
+ = MemoryView
630
+
631
+ # blockquote
632
+
633
+ MemoryView provides the features to share multidimensional homogeneous arrays of fixed-size element on memory among extension libraries.
634
+
635
+ (('note:MemoryViewは多次元数値配列(数値はすべて同じ型)を共有する機能を提供します。'))
636
+
637
+ (('note:((<URL:https://docs.ruby-lang.org/en/master/doc/memory_view_md.html>))'))\n
638
+ (('note:((<URL:https://tech.speee.jp/entry/2020/12/24/093131>)) (Japanese)'))
639
+
640
+ = Numeric arrays in Red Arrow\n(('note:Red Arrow内の数値配列'))
641
+
642
+ * (({Arrow::NumericArray})) family
643
+ * 1-dimensional numeric array\n
644
+ (('note:1次元数値配列'))
645
+ * (({Arrow::Tensor}))
646
+ * Multidimensional homogeneous numeric arrays\n
647
+ (('note:多次元数値配列'))
648
+
649
+ = MemoryView: Red Arrow
650
+
651
+ * (({Arrow::NumericArray})) family
652
+ * Export as MemoryView: Support\n
653
+ (('note:MemoryViewとしてエクスポート:対応済み'))
654
+ * Import from MemoryView: Not yet\n
655
+ (('note:MemoryViewをインポート:未対応'))
656
+ * (({Arrow::Tensor}))
657
+ * Export/Import: Not yet\n
658
+ (('note:エクスポート・インポート:未対応'))
659
+
660
+ (('note:Join Red Data Tools to work on this!'))\n
661
+ (('note:対応を進めたい人はRed Data Toolsに来てね!'))
662
+
663
+ = MemoryView: C++
664
+
665
+ * Some problems are found by this work\n
666
+ (('note:Red Arrowの対応作業でいくつかの問題が見つかった'))
667
+ * Can't use (({private})) as member name\n
668
+ (('note:メンバー名に(({private}))を使えない'))
669
+ * Can't assign to (({const})) variable with cast\n
670
+ (('note:キャストしても(({const}))変数に代入できない'))
671
+ * Ruby 3.1 will fix them\n
672
+ (('note:Ruby 3.1では直っているはず'))
673
+
674
+ = Ractor
675
+
676
+ # blockquote
677
+
678
+ Ractor is designed to provide a parallel execution feature of Ruby without thread-safety concerns.
679
+
680
+ (('note:Ractorはスレッドセーフかどうかを気にせずに並列実行するための機能です。'))
681
+
682
+ (('note:((<URL:https://docs.ruby-lang.org/en/master/doc/ractor_md.html>))'))\n
683
+ (('note:((<URL:https://techlife.cookpad.com/entry/2020/12/26/131858>)) (Japanese)'))
684
+
685
+ = Red Arrow and concurrency\n(('note:Red Arrowと並列性'))
686
+
687
+ * Red Arrow data are immutable\n
688
+ (('note:Red Arrowデータは変更不可'))
689
+ * Ractor can share frozen objects\n
690
+ (('note:Ractorはfrozenなオブジェクトを共有可能'))
691
+
692
+ = Ractor: Red Arrow
693
+
694
+ # rouge ruby
695
+
696
+ require "datasets-arrow"
697
+ table = Datasets::PostalCodeJapan.new.to_arrow
698
+ Ractor.make_shareable(table)
699
+ Ractor.new(table) do |t|
700
+ t.slice do |slicer|
701
+ slicer.prefecture == "東京都" # Tokyo
702
+ end
703
+ end
704
+
705
+ = Ractor: Red Arrow: Benchmark
706
+
707
+ # rouge ruby
708
+
709
+ n_ractors = 4
710
+ n_jobs_per_ractor = 1000
711
+ n_jobs = n_ractors * n_jobs_per_ractor
712
+ n_jobs.times do
713
+ table.slice {|s| s.prefecture == "東京都"}
714
+ end
715
+ n_ractors.times.collect do
716
+ Ractor.new(table, n_jobs_per_ractor) do |t, n|
717
+ n.times {t.slice {|s| s.prefecture == "東京都"}}
718
+ end
719
+ end.each(&:take)
720
+
721
+ = Ractor: Red Arrow: Benchmark
722
+
723
+ # charty
724
+ # backend = pyplot
725
+ # type = bar
726
+ # x = Elapsed time (sec)
727
+ # y = Approach
728
+ # relative_height = 100
729
+ Approach,Elapsed time (sec)
730
+ Sequential,4.573742
731
+ Ractor,1.454987
732
+
733
+ == Slide properties
734
+
735
+ : enable-title-on-image
736
+ false
737
+
738
+ = Wrap up\n(('note:まとめ'))
739
+
740
+ * Ruby can be used\n
741
+ in some data processing work\n
742
+ (('note:いくつかのデータ処理作業にRubyを使える'))
743
+ * Red Arrow helps you!\n
744
+ (('note:Red Arrowが有用なケースがあるはず!'))
745
+ * Ruby 3.0 has useful features for data processing work\n
746
+ (('note:Ruby 3.0にはデータ処理作業に有用な機能があるよ'))
747
+ * Red Arrow starts supporting them\n
748
+ (('note:Red Arrowはそれらのサポートを進めている'))
749
+
750
+ = Goal of this talk\n(('note:このトークのゴール'))
751
+
752
+ * You want to use Ruby\n
753
+ for some data processing\n
754
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
755
+ * You join Red Data Tools project\n
756
+ (('note:あなたがRed Data Toolsプロジェクトに参加する'))
757
+
758
+ = Feature work\n(('note:今後の仕事'))
759
+
760
+ * Implement DataFusion bindings by adding C API to DataFusion\n
761
+ (('note:DataFusionにC APIを追加してバインディングを実装'))
762
+ * DataFusion: Apache Arrow native query execution framework written in Rust\n
763
+ (('note:((<URL:https://github.com/apache/arrow-datafusion/>))'))\n
764
+ (('note:DataFusion:Rust実装のApache Arrowベースのクエリー実行フレームワーク'))
765
+ * Add Active Record like API to Red Arrow\n
766
+ (('note:Red ArrowにActive Record風のAPIを追加'))
767
+ * Improve MemoryView/Ractor support\n
768
+ (('note:MemoryView/Ractorサポートを進める'))
769
+
770
+ = Red Data Tools
771
+
772
+ (('tag:center'))
773
+ (('tag:x-large'))
774
+ Join us!
775
+
776
+ (('note:((<URL:https://red-data-tools.github.io/>))'))\n
777
+ (('note:((<URL:https://gitter.im/red-data-tools/en>))'))
778
+
779
+ (('note:((<URL:https://red-data-tools.github.io/ja/>))'))\n
780
+ (('note:((<URL:https://gitter.im/red-data-tools/ja>))'))
781
+
782
+ = OSS Gate on-boarding\n(('note:OSS Gateオンボーディング'))
783
+
784
+ * Supports accepting newcomers by OSS projects such as Ruby & Red Arrow\n
785
+ (('note:RubyやRed ArrowといったOSSプロジェクトが新人を受け入れることを支援'))
786
+ * Contact me!(('note:興味がある人は私に教えて!'))
787
+ * (('tag:x-small'))OSS project members who want to accept newcomers\n
788
+ (('note:新人を受け入れたいOSSプロジェクトのメンバー'))
789
+ * (('tag:x-small'))Companies which want to support OSS Gate on-boarding\n
790
+ (('note:OSS Gateオンボーディングを支援したい会社'))
791
+
792
+ (('note:((<URL:https://oss-gate.github.io/on-boarding/>))'))
793
+
794
+ = ClearCode Inc.
795
+
796
+ * Recruitment: Developer to work on Red Arrow related business\n
797
+ (('note:採用情報:Red Arrow関連のビジネスをする開発者'))
798
+ * (('note:((<URL:https://www.clear-code.com/recruitment/>))'))
799
+ * Business: Apache Arrow/Red Arrow related technical support/consulting:\n
800
+ (('note:仕事:Apache Arrow/Red Arrow関連の技術サポート・コンサルティング'))
801
+ * (('note:((<URL:https://www.clear-code.com/contact/>))'))
data/theme.rb ADDED
@@ -0,0 +1 @@
1
+ include_theme("clear-code")
metadata ADDED
@@ -0,0 +1,92 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rabbit-slide-kou-rubykaigi-takeout-2021
3
+ version: !ruby/object:Gem::Version
4
+ version: 2021.9.11.0
5
+ platform: ruby
6
+ authors:
7
+ - Sutou Kouhei
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2021-08-23 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rabbit
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 2.0.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: 2.0.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: rabbit-theme-clear-code
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ description: |-
42
+ To use Ruby for data processing widely, Apache Arrow support is important. We can do the followings with Apache Arrow:
43
+
44
+ * Super fast large data interchange and processing
45
+ * Reading/writing data in several famous formats such as CSV and Apache Parquet
46
+ * Reading/writing partitioned large data on cloud storage such as Amazon S3
47
+
48
+ This talk describes the followings:
49
+
50
+ * What is Apache Arrow
51
+ * How to use Apache Arrow with Ruby
52
+ * How to integrate with Ruby 3.0 features such as MemoryView and Ractor
53
+ email:
54
+ - kou@clear-code.com
55
+ executables: []
56
+ extensions: []
57
+ extra_rdoc_files: []
58
+ files:
59
+ - ".rabbit"
60
+ - README.rd
61
+ - Rakefile
62
+ - config.yaml
63
+ - images/apache-arrow-commits-kou.png
64
+ - images/clear-code-rubykaigi-takeout-2021-gold-sponsor.png
65
+ - images/iris.png
66
+ - pdf/rubykaigi-takeout-2021-red-arrow.pdf
67
+ - red-arrow.rab
68
+ - theme.rb
69
+ homepage: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-takeout-2021/
70
+ licenses:
71
+ - CC-BY-SA-4.0
72
+ metadata: {}
73
+ post_install_message:
74
+ rdoc_options: []
75
+ require_paths:
76
+ - lib
77
+ required_ruby_version: !ruby/object:Gem::Requirement
78
+ requirements:
79
+ - - ">="
80
+ - !ruby/object:Gem::Version
81
+ version: '0'
82
+ required_rubygems_version: !ruby/object:Gem::Requirement
83
+ requirements:
84
+ - - ">="
85
+ - !ruby/object:Gem::Version
86
+ version: '0'
87
+ requirements: []
88
+ rubygems_version: 3.3.0.dev
89
+ signing_key:
90
+ specification_version: 4
91
+ summary: Red Arrow - Ruby and Apache Arrow
92
+ test_files: []