rabbit-slide-kou-rubykaigi-2022 2022.9.10

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 1599f5092cdaeb9b59633f195c0b5392633b2cbab26c5fb98eb72cced8212be5
4
+ data.tar.gz: c2fad8b0605d89c25cbc3d22c48a908c663b558416d792ee32df9bdb02894072
5
+ SHA512:
6
+ metadata.gz: 2d8bf9c9a96d49f7fd2dfe8e54b8ddc94e0ad9d9ed844e59b62b85e3a89e02a5be98f121351cb41508f07921b6f6b51546a49fdf00b414c409a98ee9ba2e0d11
7
+ data.tar.gz: 65075214f316ffeb6faa536ee8526c1050d0b9ac4c0eb994aa34e23f846d0922e5cdb05ba63c6ff9a04226ca0e53d59967092035124baadcd48bcacd7bf31f54
data/.rabbit ADDED
@@ -0,0 +1,2 @@
1
+ --size 960,540
2
+ fast-data-processing-with-ruby-and-apache-arrow.rab
data/README.rd ADDED
@@ -0,0 +1,48 @@
1
+ = Fast data processing with Ruby and Apache Arrow
2
+
3
+ I introduced Ruby and Apache Arrow integration including the "super fast large data interchange and processing" Apache Arrow feature at RubyKaigi Takeout 2021.
4
+
5
+ This talk introduces how we can use the "super fast large data interchange and processing" Apache Arrow feature in Ruby. Here are some use cases:
6
+
7
+ * Fast data retrieval (fast (({pluck}))) from DB such as MySQL and PostgreSQL for batch processes in a Ruby on Rails application
8
+ * Fast data interchange with JavaScript for dynamic visualization in a Ruby on Rails application
9
+ * Fast OLAP with in-process DB such as DuckDB and Apache Arrow DataFusion in a Ruby on Rails application or irb session
10
+
11
+ == License
12
+
13
+ === Slide
14
+
15
+ CC BY-SA 4.0
16
+
17
+ Use the followings for notation of the author:
18
+
19
+ * Sutou Kouhei
20
+
21
+ ==== ClearCode Inc. logo
22
+
23
+ CC BY-SA 4.0
24
+
25
+ Author: ClearCode Inc.
26
+
27
+ It is used in page header and some pages in the slide.
28
+
29
+ == For author
30
+
31
+ === Show
32
+
33
+ rake
34
+
35
+ === Publish
36
+
37
+ rake publish
38
+
39
+ == For viewers
40
+
41
+ === Install
42
+
43
+ gem install rabbit-slide-kou-rubykaigi-2022
44
+
45
+ === Show
46
+
47
+ rabbit rabbit-slide-kou-rubykaigi-2022.gem
48
+
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ require "rabbit/task/slide"
2
+
3
+ # Edit ./config.yaml to customize meta data
4
+
5
+ spec = nil
6
+ Rabbit::Task::Slide.new do |task|
7
+ spec = task.spec
8
+ spec.files += Dir.glob("images/**/*.*")
9
+ # spec.files += Dir.glob("doc/**/*.*")
10
+ # spec.files -= Dir.glob("private/**/*.*")
11
+ spec.add_runtime_dependency("rabbit-theme-clear-code")
12
+ end
13
+
14
+ desc "Tag #{spec.version}"
15
+ task :tag do
16
+ sh("git", "tag", "-a", spec.version.to_s, "-m", "Publish #{spec.version}")
17
+ sh("git", "push", "--tags")
18
+ end
data/config.yaml ADDED
@@ -0,0 +1,24 @@
1
+ ---
2
+ id: rubykaigi-2022
3
+ base_name: fast-data-processing-with-ruby-and-apache-arrow
4
+ tags:
5
+ - rabbit
6
+ - rubykaigi
7
+ - ruby
8
+ - apache_arrow
9
+ presentation_date: 2022-09-10
10
+ version: 2022.9.10
11
+ licenses:
12
+ - CC-BY-SA-4.0
13
+ slideshare_id: rubykaigi-2022
14
+ speaker_deck_id:
15
+ vimeo_id:
16
+ youtube_id:
17
+ source_code_uri: "https://gitlab.com/ktou/rabbit-slide-kou-rubykaigi-2022"
18
+ author:
19
+ markup_language: :rd
20
+ name: Sutou Kouhei
21
+ email: kou@clear-code.com
22
+ rubygems_user: kou
23
+ slideshare_user: kou
24
+ speaker_deck_user:
@@ -0,0 +1,803 @@
1
+ = Fast data processing\n(('tag:small: with Ruby and Apache Arrow'))
2
+
3
+ : author
4
+ Sutou Kouhei
5
+ : institution
6
+ ClearCode Inc.
7
+ : content-source
8
+ RubyKaigi 2022
9
+ : date
10
+ 2022-09-10
11
+ : start-time
12
+ 2022-09-10T14:10:00+09:00
13
+ : end-time
14
+ 2022-09-10T14:40:00+09:00
15
+ : theme
16
+ .
17
+
18
+ = Sutou Kouhei\nA president Ruby committer
19
+
20
+ The president of ClearCode Inc.\n
21
+ (('note:クリアコードの社長'))
22
+
23
+ # img
24
+ # src = images/clear-code-rubykaigi-2022-silver-sponsor.png
25
+ # relative_height = 100
26
+ # reflect_ratio = 0.1
27
+
28
+ = Sutou Kouhei\nThe Apache Arrow PMC chair
29
+
30
+ * PMC: Project Management Committee\n
31
+ (('note:Apache Arrowのプロジェクト管理委員会のリーダー'))
32
+ * #2 commits(('note:(コミット数2位)'))
33
+
34
+ # img
35
+ # src = images/apache-arrow-commits-kou-with-mark.png
36
+ # relative_height = 120
37
+ # reflect_ratio = 0.1
38
+
39
+ = Sutou Kouhei\nThe pioneer in Ruby and Arrow
40
+
41
+ * A Ruby committer
42
+ * Maintain some standard libraries/default gems\n
43
+ (('note:標準ライブラリーとかデフォルトgemのメンテナンスをしている'))
44
+ * The author of Red Arrow
45
+ * Red Arrow:
46
+ * The official Apache Arrow library for Ruby\n
47
+ (('note:公式のRuby用のApache Arrowライブラリー'))
48
+
49
+ = Why do I work on Red Arrow?\n(('note:なぜRed Arrowの開発をしているか'))
50
+
51
+ * To use Ruby for data processing too!\n
52
+ (('note:データ処理でもRubyを使いたい!'))
53
+ * At least a part of data processing\n
54
+ (('note:データ処理の全部と言わず一部だけでも'))
55
+ * Data processing is an important task\n
56
+ (('note:データ処理は最近の重要なタスクの1つ'))
57
+ * # of Rubyists will be increased by this\n
58
+ (('note:データ処理にRubyを使えるようになるとRubyistが増えるはず'))
59
+
60
+ = Current situation\n(('note:Negative spiral'))\n(('note:今は負のスパイラル'))
61
+
62
+ # mermaid
63
+ # relative_width = 90
64
+ graph LR;
65
+ A[Few users]-->B[Small community];
66
+ B-->C[Few developers];
67
+ C-->D[Few useful tools];
68
+ D-->A;
69
+
70
+ (('tag:margin-top * 4'))
71
+ (('tag:center'))
72
+ How to break the negative spiral?\n
73
+ (('note:どうやってこの負のスパイラルを打開する?'))
74
+
75
+ == Slide properties
76
+
77
+ : enable-title-on-image
78
+ false
79
+
80
+ = Expand useful tools\nwith few developers\n(('note:少人数で便利なツールを増やせればいいんじゃない?'))
81
+
82
+ # mermaid
83
+ # relative_width = 90
84
+ graph LR;
85
+ subgraph all[" "]
86
+ direction TB
87
+ subgraph Negative spiral
88
+ N0[Few users]-->N1[Small community];
89
+ N1-->N2(Few developers);
90
+ N2-->N3[Few useful tools];
91
+ N3-->N0;
92
+ end
93
+ subgraph Positive spiral
94
+ P0[More users]-->P1[Larger community];
95
+ P1-->P2[More developers];
96
+ P2-->P3(More useful tools);
97
+ P3-->P0;
98
+ end
99
+ N2-.->P3;
100
+ end
101
+ style all fill-opacity:0,stroke-width:0px
102
+ style N2 stroke-width:5px
103
+ style P3 stroke-width:5px
104
+
105
+ == Slide properties
106
+
107
+ : enable-title-on-image
108
+ false
109
+
110
+ = But how?\n(('note:でもどうやって?'))
111
+
112
+ Apache Arrow
113
+
114
+ = Apache Arrow
115
+
116
+ * ((*Cross-language*)) dev platform for data\n
117
+ (('note:複数言語対応のデータ用の開発プラットフォーム'))
118
+ * Ruby community doesn't need to dev everything\n
119
+ (('note:Rubyコミュニティーがすべてを開発しなくてもよい'))
120
+ * We can share common implementations\n
121
+ (('note:共通の実装を言語を超えて共有できる'))
122
+ * Today's highlighted features\n
123
+ (('note:今日注目する機能'))
124
+ * Fast data processing(('note:(高速データ処理)'))
125
+ * Fast data interchange(('note:(高速データ交換)'))
126
+
127
+ = My approach\n(('note:私のアプローチ'))
128
+
129
+ # mermaid
130
+ # relative_width = 90
131
+ graph LR;
132
+ subgraph all[" "]
133
+ direction TB
134
+ subgraph Negative spiral
135
+ N0[Few users]-->N1[Small community];
136
+ N1-->N2(Few developers);
137
+ N2-->N3[Few useful tools];
138
+ N3-->N0;
139
+ end
140
+ subgraph Positive spiral
141
+ P0(More users)-->P1[Larger community];
142
+ P1-->P2[More developers];
143
+ P2-->P3(More useful tools);
144
+ P3-->P0;
145
+ end
146
+ N2-. Apache Arrow .->P3;
147
+ end
148
+ style all fill-opacity:0,stroke-width:0px
149
+ style N2 stroke-width:5px
150
+ style P0 stroke-width:5px
151
+ style P3 stroke-width:5px
152
+
153
+ == Slide properties
154
+
155
+ : enable-title-on-image
156
+ false
157
+
158
+ = Goal of this talk\n(('note:このトークのゴール'))
159
+
160
+ # mermaid
161
+ # relative_width = 35
162
+ # align = right
163
+ # vertical-align = top
164
+ # relative-margin-right = -10
165
+ # relative-margin-top = -7
166
+ graph LR;
167
+ subgraph all[" "]
168
+ direction TB
169
+ subgraph Negative spiral
170
+ N0[Few users]-->N1[Small community];
171
+ N1-->N2(Few developers);
172
+ N2-->N3[Few useful tools];
173
+ N3-->N0;
174
+ end
175
+ subgraph Positive spiral
176
+ P0(More users)-->P1[Larger community];
177
+ P1-->P2[More developers];
178
+ P2-->P3(More useful tools);
179
+ P3-->P0;
180
+ end
181
+ N2-. Apache Arrow .->P3;
182
+ end
183
+ style all fill-opacity:0,stroke-width:0px
184
+ style N2 stroke-width:5px
185
+ style P0 stroke-width:5px
186
+ style P3 stroke-width:5px
187
+
188
+ * You want to use Ruby\n
189
+ for some data processings\n
190
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
191
+ * Especially, you want to implement a BI tool\n
192
+ (('note:特にBIツールを作りたくなる'))
193
+ * You join Red Data Tools project\n
194
+ (('note:Red Data Toolsプロジェクトに参加する'))
195
+ * It provides data processing tools for Ruby\n
196
+ (('note:Ruby用のデータ処理ツールを提供するプロジェクト'))\n
197
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
198
+
199
+ = Fast data processing\n(('note:高速データ処理'))
200
+
201
+ * Ruby is slow to process data\n
202
+ (('note:Rubyでデータを処理すると遅い'))
203
+ * Resolve in external process:(('note:(別プロセスで解決)'))\n
204
+ (('note:Use case: Web app, batch process for Web app'))
205
+ * Use fast data processing module (e.g.: DB)\n
206
+ (('note:DBとか速いデータ処理モジュールを使う'))
207
+ * Resolve in the same process:(('note:(プロセス内で解決)'))\n
208
+ (('note:Use cases: IRB, batch process for Web app'))
209
+ * Implement core features in other fast lang\n
210
+ (('note:他の速い言語でコアの機能を実装'))
211
+
212
+ = External process\n(('note:別プロセス'))
213
+
214
+ # mermaid
215
+ # relative_width = 35
216
+ # align = right
217
+ # vertical-align = top
218
+ # relative-margin-right = -10
219
+ # relative-margin-top = 0
220
+ sequenceDiagram
221
+ Ruby->>+External process: Request
222
+ Note right of External process: Fast data processing
223
+ External process-->>-Ruby: Response
224
+
225
+ * Popular case\n
226
+ in current Ruby usage\n
227
+ (('note:今のRubyの使われ方だとよくあるケース'))
228
+ * Small response: No problem\n
229
+ (('note:レスポンスが小さい場合は問題ない'))
230
+ * Large response:\n
231
+ (('note:レスポンスが大きい場合:'))
232
+ * Sending/receiving response are slow\n
233
+ (('note:レスポンスの送信・受信処理が遅い'))
234
+
235
+ = Sending/receiving response\n(('note:レスポンスの送受信'))
236
+
237
+ # mermaid
238
+ # relative_width = 40
239
+ # align = right
240
+ # vertical-align = top
241
+ # relative-margin-right = -10
242
+ # relative-margin-top = 0
243
+ sequenceDiagram
244
+ participant Ruby
245
+ participant External process
246
+ Note right of External process: Serialize
247
+ External process-->>Ruby: Send
248
+ Note left of Ruby: Deserialize
249
+
250
+ * Serialize/deserialize\n
251
+ are slow\n
252
+ (('note:シリアライズ・デシリアライズが遅い'))
253
+ * How to speed them up?\n
254
+ (('note:どうやって高速化すればよいか'))
255
+ * Apache Arrow format
256
+ * Serialize/deserialize cost ≒ 0\n
257
+ (('note:シリアライズ・デシリアライズコストがほぼ0'))
258
+
259
+ = Why Apache Arrow format is fast\n(('note:Apache Arrowフォーマットはなぜ速いのか'))
260
+
261
+ # img
262
+ # src = https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/why-apache-arrow-format-is-fast.pdf
263
+ # relative_height = 80
264
+
265
+ (('tag:center'))
266
+ (('note:((<URL:https://slide.rabbit-shocker.org/authors/kou/db-tech-showcase-online-2020/>))'))
267
+
268
+ == Slide properties
269
+
270
+ : enable-title-on-image
271
+ false
272
+
273
+ = Apache Arrow Flight SQL
274
+
275
+ # mermaid
276
+ # relative_width = 35
277
+ # align = right
278
+ # vertical-align = top
279
+ # relative-margin-right = -10
280
+ # relative-margin-top = 0
281
+ sequenceDiagram
282
+ participant Ruby
283
+ participant SQL DB
284
+ Ruby->>SQL DB: Request (SQL)
285
+ Note right of SQL DB: Fast data processing
286
+ SQL DB-->>Ruby: Response (Apache Arrow data)
287
+
288
+ * gRPC based protocol\n
289
+ (('note:gRPCベースのプロトコル'))
290
+ * NOTE: Other network libraries\nsuch as UCX can be used\n
291
+ (('note:UCXなど他のネットワークライブラリーも使える'))
292
+ * Specialized to Apache Arrow format\n
293
+ (('note:Apache Arrowフォーマットに特化'))
294
+ * Serialize/deserialize cost ≒ 0\n
295
+ (('note:シリアライズ・デシリアライズコストがほぼ0'))
296
+
297
+ = Red Arrow Flight SQL
298
+
299
+ # rouge ruby
300
+
301
+ require "arrow-flight-sql"
302
+ location = "grpc://server:2929"
303
+ client = ArrowFlight::Client.new(location)
304
+ sql_client = ArrowFlightSQL::Client.new(client)
305
+ info = sql_client.execute("SELECT * FROM logs")
306
+ info.endpoints.each do |endpoint|
307
+ reader = sql_client.do_get(endpoint.ticket)
308
+ reader.read_all
309
+ end
310
+
311
+ = Which SQL DBs support\nApache Arrow Flight SQL?
312
+
313
+ # RT
314
+
315
+ SQL DB, Support?
316
+
317
+ MySQL, No
318
+ PostgreSQL, No
319
+ BigQuery, No
320
+ Trino, No
321
+ Dremio, Yes
322
+
323
+ = Why don't most SQL DBs support it?\n(('note:どうしてほとんどのSQL DBはサポートしていないの?'))
324
+
325
+ * Flight SQL is a new protocol\n
326
+ (('note:Apache Arrow Flight SQLは新しいプロトコルだから'))
327
+ * The first release: 2022-02(('note:(最初のリリース)'))
328
+ * Still experimental(('note:(まだ実験的扱い)'))
329
+ * Tradition SQL DBs may not support\n
330
+ (('note:MySQL・PostgreSQLとか昔からあるSQL DBはサポートしないかも'))
331
+ * New SQL DBs will support because...\n
332
+ (('note:新しいSQL DBはサポートするはず。なぜなら…'))
333
+
334
+ = Compatibility is important\n(('note:互換性が重要だから'))
335
+
336
+ * New SQL DBs often use major protocols\n
337
+ (('note:新しいSQL DBは既存のメジャーなプロトコルを使うことが多い'))
338
+ * To reuse existing client libraries\n
339
+ (('note:ユーザーは既存のクライアントライブラリーで新しいSQL DBを使える'))
340
+ * For example:
341
+ * MySQL protocol: TiDB, ...
342
+ * PostgreSQL protocol: (('tag:x-small:Cloud Spanner, CockroachDB, ...'))
343
+
344
+ = Future\n(('note:将来'))
345
+
346
+ * (('tag:small:Flight SQL client libraries will be increased'))\n
347
+ (('note:Flight SQLのクライアントライブラリーが充実するだろう'))
348
+ * New SQL DBs will support Flight SQL\n
349
+ (('note:新しいSQL DBはFlight SQLをサポートするだろう'))
350
+ * To reuse existing client libraries\n
351
+ (('note:既存のクライアントライブラリーを再利用するため'))
352
+ * (('tag:small:BI tools will support Flight SQL by default'))\n
353
+ (('note:BIツールはデフォルトでFlight SQLをサポートするだろう'))
354
+
355
+ = What should we do next?\n(('note:私たちは次はなにをするべき?'))
356
+
357
+ # mermaid
358
+ # relative_width = 30
359
+ # align = right
360
+ # vertical-align = top
361
+ # relative-margin-right = -10
362
+ # relative-margin-top = 0
363
+ graph LR;
364
+ subgraph all[" "]
365
+ direction TB
366
+ subgraph Negative spiral
367
+ N0[Few users]-->N1[Small community];
368
+ N1-->N2(Few developers);
369
+ N2-->N3[Few useful tools];
370
+ N3-->N0;
371
+ end
372
+ subgraph Positive spiral
373
+ P0[More users]-->P1[Larger community];
374
+ P1-->P2[More developers];
375
+ P2-->P3(More useful tools);
376
+ P3-->P0;
377
+ end
378
+ N2-.->P3;
379
+ end
380
+ style all fill-opacity:0,stroke-width:0px
381
+ style N2 stroke-width:5px
382
+ style P3 stroke-width:5px
383
+ style P0 stroke-width:5px
384
+
385
+ * Implement an Active Record\n
386
+ adapter for Flight SQL\n
387
+ (('note:Flight SQL用のActive Recordアダプターを'))\n
388
+ (('note:実装するといいんじゃないかな'))
389
+ * For easy to use from Ruby on Rails apps\n
390
+ (('note:Ruby on Railsアプリから使いやすくなるはず'))
391
+ * Join Red Data Tools!\n
392
+ (('note:Red Data Toolsで開発しようぜ!'))\n
393
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
394
+
395
+ = But I'm using MySQL/PostgreSQL...\n(('note:でも、MySQL/PostgreSQLを使っているし。。。'))
396
+
397
+ # img
398
+ # src = https://1zbpvb1efqtf3zvbfn3m51uy-wpengine.netdna-ssl.com/wp-content/uploads/2022/08/adbc-3.png
399
+ # caption = ADBC: Apache Arrow Database Connectivity
400
+ # relative_height = 70
401
+
402
+ (('tag:xx-small'))
403
+ ((<URL:https://voltrondata.com/news/simplifying-database-connectivity-with-arrow-flight-sql-and-adbc/>))
404
+
405
+ = ADBC
406
+
407
+ # img
408
+ # src = https://1zbpvb1efqtf3zvbfn3m51uy-wpengine.netdna-ssl.com/wp-content/uploads/2022/08/adbc-3.png
409
+ # align = right
410
+ # vertical-align = top
411
+ # relative_width = 40
412
+ # relative-margin-right = -10
413
+ # relative-margin-top = 0
414
+
415
+ * Generic ((*fast*))\n
416
+ SQL DB client API\n
417
+ (('note:任意のSQL DBに接続できる高速なAPI'))
418
+ * We can use Flight SQL\n
419
+ through ADBC\n
420
+ (('note:ADBC経由でFlight SQLも使える'))
421
+ * Flight SQL is the most fast driver\n
422
+ (('note:Flight SQLが最速のドライバー'))
423
+ * But the same API for all SQL DBs is useful\n
424
+ like Active Record for Rubyists\n
425
+ (('note:なんだけど、すべてのSQL DBに同じAPIでアクセスできるのは便利'))\n
426
+ (('note:Active Recordも便利でしょ?'))
427
+
428
+ = ADBC and Ruby
429
+
430
+ * Implementing Ruby bindings\n
431
+ (('note:Rubyバインディングを実装中'))
432
+ * Join Red Data Tools\n
433
+ to implement an Active Record adapter for ADBC!\n
434
+ (('note:Red Data ToolsでADBC用のActive Recordアダプターを開発しようぜ!'))\n
435
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
436
+
437
+ = Red ADBC
438
+
439
+ # rouge ruby
440
+
441
+ require "adbc"
442
+ options = {
443
+ driver: "adbc_driver_sqlite",
444
+ filename: ":memory:",
445
+ }
446
+ ADBC::Database.open(**options) do |database|
447
+ database.connect do |connection|
448
+ puts(connection.query("SELECT 1"))
449
+ end
450
+ end
451
+
452
+ = Wrap up: External process case\n(('note:まとめ:別プロセスの場合'))
453
+
454
+ * Large response is slow\n
455
+ (('note:レスポンスが大きいときに遅い'))
456
+ * Bottle neck is serialize/deserialize\n
457
+ (('note:ボトルネックはシリアライズ・デシリアライズ'))
458
+ * Apache Arrow Flight SQL/ADBC\n
459
+ (('note:そこでApache Arrow Flight SQL/ADBCですよ!'))
460
+ * Let's implement AR adapters for them\n
461
+ (('note:Active Recordのアダプターを実装しようぜ!'))
462
+
463
+ = In-process\n(('note:同一プロセス'))
464
+
465
+ # mermaid
466
+ # relative_width = 35
467
+ # align = right
468
+ # vertical-align = top
469
+ # relative-margin-right = -10
470
+ # relative-margin-top = -10
471
+ sequenceDiagram
472
+ Ruby->>+Library: Pass data
473
+ Note right of Library: Fast data processing
474
+ Library-->>-Ruby: Return processed data
475
+
476
+ * Not popular case\n
477
+ in current Ruby usage\n
478
+ (('note:今のRubyの使われ方だとあまりないケース'))
479
+ * Use case: process data\n
480
+ on local/remote storage\n
481
+ (('note:IRB, batch process for Web app'))\n
482
+ (('note:ユースケース:ローカル・リモートストレージにあるデータの処理'))\n
483
+ * Need a fast data processing library\n
484
+ implemented in other fast language\n
485
+ (('note:高速にデータ処理できる他の速い言語で実装されたライブラリーが必要'))
486
+
487
+ = Fast language?\n(('note:速い言語?'))
488
+
489
+ * C/C++
490
+ * Rust
491
+ * Julia
492
+ * ...
493
+
494
+ = A C++ case: Apache Arrow
495
+
496
+ * Apache Arrow has a computation module\n
497
+ (('note:Apache Arrowは計算モジュールも提供している'))
498
+ * Ruby bindings: Red Arrow/red-arrow\n
499
+ (('note:RubyバインディングはRed Arrow/red-arrow'))
500
+ * Data frame based on Red Arrow\n
501
+ (('note:Red Arrowベースのデータフレームもある'))
502
+ * RedAmber/red_amber by @heronshoes\n
503
+ (('note:Red Data Toolsメンバーでもある鈴木さんが開発'))\n
504
+ (('note:((<URL:https://mybinder.org/v2/gh/RubyData/docker-stacks/master?filepath=red-amber.ipynb>))'))
505
+
506
+ = Red Arrow
507
+
508
+ # rouge ruby
509
+
510
+ require "datasets-arrow"
511
+ codes = Datasets::PostalCodeJapan.new.to_arrow
512
+ require "arrow"
513
+ sliced_codes = codes.slice do |slicer|
514
+ slicer.prefecture == "東京都"
515
+ end
516
+ puts(sliced_codes)
517
+
518
+ = Red Amber
519
+
520
+ # rouge ruby
521
+
522
+ require "datasets-arrow"
523
+ codes = Datasets::PostalCodeJapan.new.to_arrow
524
+ require "red_amber"
525
+ data_frame = RedAmber::DataFrame.new(codes)
526
+ prefecture = data_frame[:prefecture]
527
+ puts(data_frame[prefecture == "東京都"])
528
+
529
+ = A C++ case: DuckDB
530
+
531
+ * Similar to SQLite\n
532
+ but for data analytics\n
533
+ (('note:データ分析向けのSQLiteみたいなやつ'))
534
+ * Fast aggregation/filter/sort\n
535
+ (('note:高速な集計・フィルター・ソート'))
536
+ * Ruby bindings: ruby-duckdb by @suketa\n
537
+ (('note:Rubyコミッターでもある助田さんがRubyバインディングを開発'))
538
+ * We can impl. fast data processing with DuckDB\n
539
+ (('note:DuckDBを使って高速データ処理を実現できる!'))
540
+ * (('wait'))If we can interchange data w/ DuckDB fast...\n
541
+ (('note:DuckDBと高速にデータ交換できればね…'))
542
+
543
+ = Is fast data interchange important?\n(('note:高速データ交換は重要なの?'))
544
+
545
+ # mermaid
546
+ # relative_height = 70
547
+ sequenceDiagram
548
+ Ruby->>+DuckDB: Load data (!)
549
+ Note right of DuckDB: Fast data processing
550
+ DuckDB->>-Ruby: Read result (!)
551
+
552
+ (('tag:center'))
553
+ (('tag:x-small'))
554
+ If (!) are slow, total data processing is also slow\n
555
+ (('note:(!)が遅いと全体のデータ処理も遅くなる'))
556
+
557
+ == Slide properties
558
+
559
+ : enable-title-on-image
560
+ false
561
+
562
+ = Fast data interchange\n(('note:高速なデータ交換'))
563
+
564
+ # mermaid
565
+ # relative_height = 40
566
+ # align = right
567
+ # vertical-align = top
568
+ # relative-margin-right = -12
569
+ # relative-margin-top = -5
570
+ sequenceDiagram
571
+ Ruby->>+DuckDB: Pass Apache Arrow data directly
572
+ Note right of DuckDB: Fast data processing
573
+ DuckDB->>-Ruby: Read result with C data interface
574
+
575
+ * Use data as-is: zero-copy\n
576
+ (('note:データをそのまま使う:ゼロコピー'))
577
+ * Apache Arrow C data/stream interface
578
+ * C ABI for fast data interchange\n
579
+ (('note:高速にデータ交換するためのC ABI'))
580
+ * FYI: C ABI in Ruby: MemoryView\n
581
+ (('note:参考:RubyもMmeoryViewというC ABIを提供している'))
582
+
583
+ = C data interface
584
+
585
+ # rouge c
586
+
587
+ struct ArrowArray {
588
+ // Array data description
589
+ int64_t length;
590
+ int64_t null_count;
591
+ int64_t offset;
592
+ int64_t n_buffers;
593
+ int64_t n_children;
594
+ const void** buffers;
595
+ struct ArrowArray** children;
596
+ struct ArrowArray* dictionary;
597
+ // Release callback
598
+ void (*release)(struct ArrowArray*);
599
+ // Opaque producer-specific data
600
+ void* private_data;
601
+ };
602
+
603
+ = DuckDB with Apache Arrow
604
+
605
+ # rouge ruby
606
+
607
+ require "datasets-arrow"
608
+ codes = Datasets::PostalCodeJapan.new.to_arrow
609
+ require "arrow-duckdb"
610
+ db = DuckDB::Database.open
611
+ c = db.connect
612
+ c.register("codes", codes) do # Use Apache Arrow data as-is
613
+ c.query("SELECT * FROM codes WHERE prefecture = ?",
614
+ "東京都", # Tokyo
615
+ output: :arrow) # Output as Apache Arrow data
616
+ .to_table # C data interface
617
+ end
618
+
619
+ = C data interface on Web\n(('note:Web上でもC data interface'))
620
+
621
+ * Some unofficial WebAssembly ports exist\n
622
+ (('note:非公式ながらいくつかWebAssembly対応のApache Arrowライブラリーがある'))
623
+ * Rust based, Go based, ...
624
+ * WASM Ruby + C data I/F is useful?\n
625
+ (('note:WebAssembly版のRubyとC data interfaceでなんかできるかも?'))
626
+ * FYI: DuckDB supports WebAssembly too\n
627
+ (('note:参考:DuckDBもWebAssemblyをサポートしている'))
628
+
629
+ = A Rust case: Arrow DataFusion
630
+
631
+ * SQL query engine
632
+ * Internal memory layout is Apache Arrow\n
633
+ (('note:DuckDBと似ているが内部のメモリーレイアウトはApache Arrow'))
634
+ * Direct Ruby bindings:(('note:(直接のバインディング)'))
635
+ * arrow-datafusion by @jychen7 with Magnus
636
+ * Ruby bindings via C API:(('note:(C API経由)'))
637
+ * datafusion-c with cargo-c
638
+ * Red DataFusion with datafusion-c
639
+
640
+ = Ruby bindings via C API\n(('note:C API経由のRubyバインディング'))
641
+
642
+ # mermaid
643
+ # relative_width = 35
644
+ # align = right
645
+ # vertical-align = top
646
+ # relative-margin-right = -10
647
+ # relative-margin-top = -7
648
+ graph LR;
649
+ subgraph all[" "]
650
+ direction TB
651
+ subgraph Negative spiral
652
+ N0[Few users]-->N1[Small community];
653
+ N1-->N2(Few developers);
654
+ N2-->N3[Few useful tools];
655
+ N3-->N0;
656
+ end
657
+ subgraph Positive spiral
658
+ P0(More users)-->P1[Larger community];
659
+ P1-->P2[More developers];
660
+ P2-->P3(More useful tools);
661
+ P3-->P0;
662
+ end
663
+ N2-. Apache Arrow .->P3;
664
+ end
665
+ style all fill-opacity:0,stroke-width:0px
666
+ style N2 stroke-width:5px
667
+ style P3 stroke-width:5px
668
+
669
+ * To develop with\ndevs from other langs\n
670
+ (('note:他言語の開発者と一緒に開発するため'))
671
+ * C API is useful for other languages too\n
672
+ (('note:C APIはJavaやGoなど他の言語のバインディング開発でも有用'))
673
+ * C API provides a normal C library\n
674
+ (('note:C APIは普通のCライブラリーを提供する'))
675
+ * Headers: Generated by cbindgen automatically
676
+ * Shared libraries: Built with cargo-c
677
+
678
+ = Red DataFusion
679
+
680
+ # rouge ruby
681
+
682
+ require "datasets-arrow"
683
+ codes = Datasets::PostalCodeJapan.new.to_arrow
684
+ require "datafusion"
685
+ context = DataFusion::SessionContext.new
686
+ context.register("codes", codes) # C data interface
687
+ data_frame = context.sql(<<-SQL)
688
+ SELECT * FROM codes WHERE prefecture = '東京都'
689
+ SQL
690
+ puts(data_frame.to_table) # C data interface
691
+
692
+ = Remote data\n(('note:リモートデータ'))
693
+
694
+ * Recent modules can read remote data\n
695
+ (('note:最近のモジュールはリモートデータを読み込める'))
696
+ * At least these modules support it:\n
697
+ (('note:少なくとも次のモジュールはできる'))
698
+ * Apache Arrow
699
+ * DuckDB
700
+ * Apache Arrow DataFusion
701
+
702
+ = Remote data example: DuckDB
703
+
704
+ # rouge sql
705
+ SELECT COUNT(*)
706
+ FROM parquet_scan('s3://ookla-open-data/parquet/performance/*/*/*/*.parquet',
707
+ HIVE_PARTITIONING=1);
708
+ -- ┌───────┐
709
+ -- │ count_star() │
710
+ -- ├───────┤
711
+ -- │ 144567188 │
712
+ -- └───────┘
713
+
714
+ = Wrap up: In-process case\n(('note:まとめ:同一プロセスの場合'))
715
+
716
+ # mermaid
717
+ # relative_width = 35
718
+ # align = right
719
+ # vertical-align = top
720
+ # relative-margin-right = -10
721
+ # relative-margin-top = 0
722
+ sequenceDiagram
723
+ Ruby->>+Library: Pass data
724
+ Note right of Library: Fast data processing
725
+ Library-->>-Ruby: Return processed data
726
+
727
+ * To process data\n
728
+ on local/remote storage\n
729
+ (('note:ローカル・リモートストレージにあるデータの処理'))
730
+ * Low-level APIs (bindings)\n
731
+ are preparing\n
732
+ (('note:低レベルのAPIは整備できてきた'))
733
+ * Let's implement high-level API!\n
734
+ (('note:高レベルのAPIを実装しようぜ!'))
735
+ * e.g.: RedAmber, Active Record adapters, ...\n
736
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
737
+
738
+ = Acknowledgment\n(('note:謝辞'))
739
+
740
+ * Voltron Data
741
+ * Supports my Apache Arrow related work\n
742
+ (('note:私のApache Arrow関係の作業にお金を払ってくれている'))
743
+ * You can join Voltron Data or ClearCode\n
744
+ to work on Apache Arrow as a job\n
745
+ (('note:Voltron Dataさんかクリアコードに転職すると'))\n
746
+ (('note:仕事でApache Arrowを開発できるよ!'))\n
747
+ (('note:((<URL:https://www.clear-code.com/recruitment/>))'))
748
+ * Red Data Tools members
749
+ * They develop data processing tools for Ruby!\n
750
+ (('note:Ruby用のデータ処理ツールを開発しているよ!'))
751
+
752
+ = Wrap up\n(('note:まとめ'))
753
+
754
+ * Can use Ruby for fast data processing\n
755
+ with Apache Arrow (('note:in some cases'))\n
756
+ (('note:Apache Arrowを使えば高速なデータ処理にRubyを使える…こともある'))
757
+ * External process: Fast data interchange\n
758
+ (('note:別プロセスの場合:高速データ交換機能を使う'))
759
+ * In-process: Fast data processing/interchange\n
760
+ (('note:同一プロセスの場合:高速データ処理・交換機能を使う'))
761
+ * But we still have missing pieces\n
762
+ (('note:でも、まだ足りないところがある'))
763
+
764
+ = Goal of this talk\n(('note:このトークのゴール'))
765
+
766
+ # mermaid
767
+ # relative_width = 35
768
+ # align = right
769
+ # vertical-align = top
770
+ # relative-margin-right = -10
771
+ # relative-margin-top = -7
772
+ graph LR;
773
+ subgraph all[" "]
774
+ direction TB
775
+ subgraph Negative spiral
776
+ N0[Few users]-->N1[Small community];
777
+ N1-->N2(Few developers);
778
+ N2-->N3[Few useful tools];
779
+ N3-->N0;
780
+ end
781
+ subgraph Positive spiral
782
+ P0(More users)-->P1[Larger community];
783
+ P1-->P2[More developers];
784
+ P2-->P3(More useful tools);
785
+ P3-->P0;
786
+ end
787
+ N2-. Apache Arrow .->P3;
788
+ end
789
+ style all fill-opacity:0,stroke-width:0px
790
+ style N2 stroke-width:5px
791
+ style P0 stroke-width:5px
792
+ style P3 stroke-width:5px
793
+
794
+ * You want to use Ruby\n
795
+ for some data processings\n
796
+ (('note:いくつかのデータ処理でRubyを使いたくなる'))
797
+ * Especially, you want to implement a BI tool\n
798
+ (('note:特にBIツールを作りたくなる'))
799
+ * You join Red Data Tools project\n
800
+ (('note:Red Data Toolsプロジェクトに参加する'))
801
+ * It provides data processing tools for Ruby\n
802
+ (('note:Ruby用のデータ処理ツールを提供するプロジェクト'))\n
803
+ (('note:((<URL:https://red-data-tools.github.io/>))'))
Binary file
data/theme.rb ADDED
@@ -0,0 +1 @@
1
+ include_theme("clear-code")
metadata ADDED
@@ -0,0 +1,89 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: rabbit-slide-kou-rubykaigi-2022
3
+ version: !ruby/object:Gem::Version
4
+ version: 2022.9.10
5
+ platform: ruby
6
+ authors:
7
+ - Sutou Kouhei
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2022-09-09 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rabbit
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 2.0.2
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: 2.0.2
27
+ - !ruby/object:Gem::Dependency
28
+ name: rabbit-theme-clear-code
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ description: |-
42
+ I introduced Ruby and Apache Arrow integration including the "super fast large data interchange and processing" Apache Arrow feature at RubyKaigi Takeout 2021.
43
+
44
+ This talk introduces how we can use the "super fast large data interchange and processing" Apache Arrow feature in Ruby. Here are some use cases:
45
+
46
+ * Fast data retrieval (fast (({pluck}))) from DB such as MySQL and PostgreSQL for batch processes in a Ruby on Rails application
47
+ * Fast data interchange with JavaScript for dynamic visualization in a Ruby on Rails application
48
+ * Fast OLAP with in-process DB such as DuckDB and Apache Arrow DataFusion in a Ruby on Rails application or irb session
49
+ email:
50
+ - kou@clear-code.com
51
+ executables: []
52
+ extensions: []
53
+ extra_rdoc_files: []
54
+ files:
55
+ - ".rabbit"
56
+ - README.rd
57
+ - Rakefile
58
+ - config.yaml
59
+ - fast-data-processing-with-ruby-and-apache-arrow.rab
60
+ - images/apache-arrow-commits-kou-with-mark.png
61
+ - images/apache-arrow-commits-kou.png
62
+ - images/clear-code-rubykaigi-2022-silver-sponsor.png
63
+ - pdf/rubykaigi-2022-fast-data-processing-with-ruby-and-apache-arrow.pdf
64
+ - theme.rb
65
+ homepage: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2022/
66
+ licenses:
67
+ - CC-BY-SA-4.0
68
+ metadata:
69
+ source_code_uri: https://gitlab.com/ktou/rabbit-slide-kou-rubykaigi-2022
70
+ post_install_message:
71
+ rdoc_options: []
72
+ require_paths:
73
+ - lib
74
+ required_ruby_version: !ruby/object:Gem::Requirement
75
+ requirements:
76
+ - - ">="
77
+ - !ruby/object:Gem::Version
78
+ version: '0'
79
+ required_rubygems_version: !ruby/object:Gem::Requirement
80
+ requirements:
81
+ - - ">="
82
+ - !ruby/object:Gem::Version
83
+ version: '0'
84
+ requirements: []
85
+ rubygems_version: 3.4.0.dev
86
+ signing_key:
87
+ specification_version: 4
88
+ summary: Fast data processing with Ruby and Apache Arrow
89
+ test_files: []