arrow-datafusion 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9f2a3634503431cd79f0d59c12e8d4ed1d8a899f38d5510835ee7cccb5404b4d
4
- data.tar.gz: 02bba867e5c3da26395cc1d5f38457ce8769179f92551f80be898e5b43685284
3
+ metadata.gz: 6e8ea4613597694b0e11d65afafd8d31a0230fada4d84c9ecec842afe09ef046
4
+ data.tar.gz: 9397ee0001f51084c1651c70fcae972b680b614e2548ab286e90c645187e388f
5
5
  SHA512:
6
- metadata.gz: ce438e24815777aac6bbd9fffbb8b12f2d64031179c9b3111ee84970ce44cd2833ef48adaad39ae9faff25d326af48fa7e6120341d96ce6bad3f45216f96c610
7
- data.tar.gz: 59317958d2fe722bed2e50b885a40d49583d5b44117142991debb1f78e77f59adea9e19b578cf6d271c91c6c866730f68702b10ec85f60d38ef9d3c576be1efe
6
+ metadata.gz: 2f2c972c13edee286ce6aa71538b910794c3a1950e2f17be037ca2d3a6e03704b7e1135f11860b08ea8422c0cad1500ec027159548d31926ade5c78116c0b5cf
7
+ data.tar.gz: 5f2c0406f9519b192b270b86fb70bd432e4235059e46cc2e311d003b393c56e1f5e7c7f48601954ccb650dc1ec38cef36743bfa948a3335f73985b1fd96fa5cb
data/README.md CHANGED
@@ -1,8 +1,8 @@
1
1
  # DataFusion in Ruby
2
2
 
3
- This is a Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
3
+ This is yet another Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
4
4
 
5
- It allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Ruby.
5
+ This is an alternative to [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby). Please refer to FAQ below.
6
6
 
7
7
  ## Quick Start
8
8
 
@@ -15,7 +15,7 @@ App
15
15
  ```ruby
16
16
  require "datafusion"
17
17
 
18
- ctx = Datafusion.SessionContext.new
18
+ ctx = Datafusion::SessionContext.new
19
19
  ctx.register_csv("csv", "test.csv")
20
20
  ctx.sql("SELECT * FROM csv").collect
21
21
  ```
@@ -24,15 +24,15 @@ ctx.sql("SELECT * FROM csv").collect
24
24
 
25
25
  SessionContext
26
26
  - [x] new
27
- - [ ] register_csv
28
- - [ ] sql
27
+ - [x] register_csv
28
+ - [x] sql
29
29
  - [ ] register_parquet
30
30
  - [ ] register_record_batches
31
31
  - [ ] register_udf
32
32
 
33
33
  Dataframe
34
- - [ ] new
35
- - [ ] collect
34
+ - [x] new
35
+ - [x] collect
36
36
  - [ ] schema
37
37
  - [ ] select_columns
38
38
  - [ ] select
@@ -46,4 +46,30 @@ Dataframe
46
46
 
47
47
  ## Contribution Guide
48
48
 
49
- Please see [Contribution Guide](CONTRIBUTING.md) for information about contributing to DataFusion in Ruby.
49
+ Please see [Contribution Guide](CONTRIBUTING.md).
50
+
51
+ ## FAQ
52
+
53
+ ### Why another Ruby bindings for Arrow Datafusion?
54
+
55
+ [datafuion-contrib/datafusion-python](https://github.com/datafusion-contrib/datafusion-python) is a `Rust -> Python` bindings using [pyo3](https://github.com/PyO3/pyo3) and I want to use Arrow Datafusion in Ruby. So I create a `Rust -> Ruby` bindings using [Magnus](https://github.com/matsadler/magnus).
56
+
57
+ Other than Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, [datafuion-contrib/datafusion-c](https://github.com/datafusion-contrib/datafusion-c) is created and will be used for [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) and other languages. E.g. `Rust -> C -> Ruby/Python/Java/etc`.
58
+
59
+ So I just keep this `Rust -> Python` implementation as my side project.
60
+
61
+ ### Why Magnus?
62
+
63
+ As of 2022-07, there are a few popular Ruby bindings for Rust, [Rutie](https://github.com/danielpclark/rutie), [Magnus](https://github.com/matsadler/magnus) and [other alternatives](https://github.com/matsadler/magnus#alternatives). Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this [reddit thread](https://www.reddit.com/r/ruby/comments/uskibb/comment/i98rds4/?utm_source=share&utm_medium=web2x&context=3). It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.
64
+
65
+ ### Why the module name and gem name are different?
66
+
67
+ The module name `Datafusion` follows the [datafusion](https://github.com/apache/arrow-datafusion) and [datafusion-python](https://github.com/datafusion-contrib/datafusion-python). The gem name `datafusion` [is occupied in rubygems.org at 2016](https://rubygems.org/gems/datafusion), so our gem is called `arrow-datafusion`.
68
+
69
+ Similarly to the Ruby bindings of Arrow, its gem name is called [red-arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) and the module is called `arrow`.
70
+
71
+ ### What is the relationship between gem "arrow-datafusion" and "red-arrow"?
72
+
73
+ "arrow-datafusion" is the Ruby bindings of Arrow Datafusion (Rust). "red-arrow" is the Ruby bindings of Arrow (C++). To keep Datafusion Ruby simpler, I try not to couple with Red Arrow in core features at the moment. If need, we can add additional gems to support "red-arrow" in "arrow-datafusion", similar to how [red-parquet](https://github.com/apache/arrow/blob/2c7c12fd408339817f0322f137d25e9f60a87a26/ruby/red-parquet/red-parquet.gemspec#L44) use red-arrow.
74
+
75
+ ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to [Can datafusion-python be used without pyarrow?](https://github.com/datafusion-contrib/datafusion-python/issues/22).
@@ -3,12 +3,12 @@ require_relative "lib/datafusion/version"
3
3
  Gem::Specification.new do |spec|
4
4
  spec.name = "arrow-datafusion"
5
5
  spec.version = Datafusion::VERSION
6
- spec.authors = ["Datafusion Contrib Developers"]
7
- spec.homepage = "https://github.com/datafusion-contrib/datafusion-ruby"
6
+ spec.authors = ["jychen7"]
7
+ spec.homepage = "https://github.com/jychen7/arrow-datafusion-ruby"
8
8
 
9
- spec.summary = "Ruby bindings of Apache Arrow Datafusion"
9
+ spec.summary = "yet another Ruby bindings of Apache Arrow Datafusion"
10
10
  spec.description =
11
- "DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format."
11
+ "yet another Ruby bindings of Apache Arrow Datafusion"
12
12
  spec.license = "Apache-2.0"
13
13
 
14
14
  spec.files = ["README.md", "#{spec.name}.gemspec", "LICENSE"]
@@ -19,4 +19,5 @@ Gem::Specification.new do |spec|
19
19
 
20
20
  # actually a build time dependency, but that's not an option.
21
21
  spec.add_runtime_dependency "rake", "> 1"
22
+ spec.required_ruby_version = ">= 2.6.0"
22
23
  end
@@ -503,6 +503,7 @@ version = "0.0.1"
503
503
  dependencies = [
504
504
  "datafusion",
505
505
  "magnus",
506
+ "tokio",
506
507
  ]
507
508
 
508
509
  [[package]]
@@ -918,8 +919,7 @@ dependencies = [
918
919
  [[package]]
919
920
  name = "magnus"
920
921
  version = "0.3.2"
921
- source = "registry+https://github.com/rust-lang/crates.io-index"
922
- checksum = "983e15338a2e9644f804de8b5e52fb930bcd53b6859de4f4feb85753532b69d3"
922
+ source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
923
923
  dependencies = [
924
924
  "bindgen",
925
925
  "magnus-macros",
@@ -928,8 +928,7 @@ dependencies = [
928
928
  [[package]]
929
929
  name = "magnus-macros"
930
930
  version = "0.1.0"
931
- source = "registry+https://github.com/rust-lang/crates.io-index"
932
- checksum = "27968fcabe2ef7e35b8331d71a62882282996f6222c133c0255cf7f33b8d9b48"
931
+ source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
933
932
  dependencies = [
934
933
  "darling",
935
934
  "proc-macro2",
@@ -8,5 +8,7 @@ edition = "2018"
8
8
  crate-type = ["cdylib"]
9
9
 
10
10
  [dependencies]
11
- magnus = "0.3"
11
+ # as of 2022-07, magnus v0.3.2 does NOT include "define_error" in RModule
12
+ magnus = { git = "https://github.com/matsadler/magnus" }
12
13
  datafusion = { version = "^8.0.0" }
14
+ tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread", "sync"] }
@@ -1,4 +1,10 @@
1
1
  use datafusion::execution::context::SessionContext;
2
+ use datafusion::prelude::CsvReadOptions;
3
+ use magnus::Error;
4
+
5
+ use crate::dataframe::RbDataFrame;
6
+ use crate::errors::DataFusionError;
7
+ use crate::utils::wait_for_future;
2
8
 
3
9
  #[magnus::wrap(class = "Datafusion::SessionContext")]
4
10
  pub(crate) struct RbSessionContext {
@@ -11,4 +17,18 @@ impl RbSessionContext {
11
17
  ctx: SessionContext::new(),
12
18
  }
13
19
  }
20
+
21
+ pub(crate) fn register_csv(&self, name: String, table_path: String) -> Result<(), Error> {
22
+ let result =
23
+ self.ctx
24
+ .register_csv(name.as_ref(), table_path.as_ref(), CsvReadOptions::new());
25
+ wait_for_future(result).map_err(DataFusionError::from)?;
26
+ Ok(())
27
+ }
28
+
29
+ pub(crate) fn sql(&self, query: String) -> Result<RbDataFrame, Error> {
30
+ let result = self.ctx.sql(query.as_ref());
31
+ let df = wait_for_future(result).map_err(DataFusionError::from)?;
32
+ Ok(RbDataFrame::new(df))
33
+ }
14
34
  }
@@ -0,0 +1,27 @@
1
+ use datafusion::dataframe::DataFrame;
2
+ use magnus::Error;
3
+ use std::sync::Arc;
4
+
5
+ use crate::errors::DataFusionError;
6
+ use crate::record_batch::RbRecordBatch;
7
+ use crate::utils::wait_for_future;
8
+
9
+ #[magnus::wrap(class = "Datafusion::DataFrame")]
10
+ pub(crate) struct RbDataFrame {
11
+ df: Arc<DataFrame>,
12
+ }
13
+
14
+ impl RbDataFrame {
15
+ pub(crate) fn new(df: Arc<DataFrame>) -> Self {
16
+ Self { df }
17
+ }
18
+
19
+ pub(crate) fn collect(&self) -> Result<Vec<RbRecordBatch>, Error> {
20
+ let result = self.df.collect();
21
+ let batches = wait_for_future(result).map_err(DataFusionError::from)?;
22
+ Ok(batches
23
+ .into_iter()
24
+ .map(|batch| RbRecordBatch::new(batch))
25
+ .collect())
26
+ }
27
+ }
@@ -0,0 +1,42 @@
1
+ use core::fmt;
2
+
3
+ use datafusion::arrow::error::ArrowError;
4
+ use datafusion::error::DataFusionError as InnerDataFusionError;
5
+ use magnus::Error as MagnusError;
6
+
7
+ use crate::datafusion_error;
8
+
9
+ #[derive(Debug)]
10
+ pub enum DataFusionError {
11
+ ExecutionError(InnerDataFusionError),
12
+ ArrowError(ArrowError),
13
+ CommonError(String),
14
+ }
15
+
16
+ impl fmt::Display for DataFusionError {
17
+ fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
18
+ match self {
19
+ DataFusionError::ExecutionError(e) => write!(f, "Rust DataFusion error: {:?}", e),
20
+ DataFusionError::ArrowError(e) => write!(f, "Rust Arrow error: {:?}", e),
21
+ DataFusionError::CommonError(e) => write!(f, "Ruby DataFusion error: {:?}", e),
22
+ }
23
+ }
24
+ }
25
+
26
+ impl From<ArrowError> for DataFusionError {
27
+ fn from(err: ArrowError) -> DataFusionError {
28
+ DataFusionError::ArrowError(err)
29
+ }
30
+ }
31
+
32
+ impl From<InnerDataFusionError> for DataFusionError {
33
+ fn from(err: InnerDataFusionError) -> DataFusionError {
34
+ DataFusionError::ExecutionError(err)
35
+ }
36
+ }
37
+
38
+ impl From<DataFusionError> for MagnusError {
39
+ fn from(err: DataFusionError) -> MagnusError {
40
+ MagnusError::new(datafusion_error(), err.to_string())
41
+ }
42
+ }
@@ -1,11 +1,39 @@
1
- use magnus::{define_module, function, prelude::*, Error};
1
+ use magnus::{
2
+ define_module, exception::ExceptionClass, function, memoize, method, prelude::*, Error, RModule,
3
+ };
2
4
 
3
5
  mod context;
6
+ mod dataframe;
7
+ mod errors;
8
+ mod record_batch;
9
+ mod utils;
10
+
11
+ fn datafusion() -> RModule {
12
+ *memoize!(RModule: define_module("Datafusion").unwrap())
13
+ }
14
+
15
+ fn datafusion_error() -> ExceptionClass {
16
+ *memoize!(ExceptionClass: datafusion().define_error("Error", Default::default()).unwrap())
17
+ }
4
18
 
5
19
  #[magnus::init]
6
20
  fn init() -> Result<(), Error> {
7
- let module = define_module("Datafusion")?;
8
- let class = module.define_class("SessionContext", Default::default())?;
9
- class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
21
+ // ensure error is defined on load
22
+ datafusion_error();
23
+
24
+ let ctx_class = datafusion().define_class("SessionContext", Default::default())?;
25
+ ctx_class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
26
+ ctx_class.define_method(
27
+ "register_csv",
28
+ method!(context::RbSessionContext::register_csv, 2),
29
+ )?;
30
+ ctx_class.define_method("sql", method!(context::RbSessionContext::sql, 1))?;
31
+
32
+ let df_class = datafusion().define_class("DataFrame", Default::default())?;
33
+ df_class.define_method("collect", method!(dataframe::RbDataFrame::collect, 0))?;
34
+
35
+ let rb_class = datafusion().define_class("RecordBatch", Default::default())?;
36
+ rb_class.define_method("to_h", method!(record_batch::RbRecordBatch::to_hash, 0))?;
37
+
10
38
  Ok(())
11
39
  }
@@ -0,0 +1,56 @@
1
+ use datafusion::arrow::{
2
+ array::{Float64Array, Int64Array, StringArray},
3
+ datatypes::DataType,
4
+ record_batch::RecordBatch,
5
+ };
6
+ use magnus::{Error, Value};
7
+
8
+ use crate::errors::DataFusionError;
9
+ use std::collections::HashMap;
10
+
11
+ #[magnus::wrap(class = "Datafusion::RecordBatch")]
12
+ pub(crate) struct RbRecordBatch {
13
+ rb: RecordBatch,
14
+ }
15
+
16
+ impl RbRecordBatch {
17
+ pub(crate) fn new(rb: RecordBatch) -> Self {
18
+ Self { rb }
19
+ }
20
+
21
+ pub(crate) fn to_hash(&self) -> Result<HashMap<String, Vec<Value>>, Error> {
22
+ let mut columns_by_name: HashMap<String, Vec<Value>> = HashMap::new();
23
+ for (i, field) in self.rb.schema().fields().iter().enumerate() {
24
+ let column = self.rb.column(i);
25
+ columns_by_name.insert(
26
+ field.name().clone(),
27
+ match column.data_type() {
28
+ DataType::Int64 => {
29
+ let array = column.as_any().downcast_ref::<Int64Array>().unwrap();
30
+ array.values().iter().map(|v| (*v as i64).into()).collect()
31
+ }
32
+ DataType::Float64 => {
33
+ let array = column.as_any().downcast_ref::<Float64Array>().unwrap();
34
+ array.values().iter().map(|v| (*v as f64).into()).collect()
35
+ }
36
+ DataType::Utf8 => {
37
+ let array = column.as_any().downcast_ref::<StringArray>().unwrap();
38
+ let mut values: Vec<Value> = vec![];
39
+ for i in 0..(column.len()) {
40
+ values.push(std::string::String::from(array.value(i)).into())
41
+ }
42
+ values
43
+ }
44
+ unknown => {
45
+ return Err(DataFusionError::CommonError(format!(
46
+ "unhandle data type: {}",
47
+ unknown
48
+ ))
49
+ .into())
50
+ }
51
+ },
52
+ );
53
+ }
54
+ Ok(columns_by_name)
55
+ }
56
+ }
@@ -0,0 +1,11 @@
1
+ use std::future::Future;
2
+ use tokio::runtime::Runtime;
3
+
4
+ pub fn wait_for_future<F: Future>(f: F) -> F::Output
5
+ where
6
+ F: Send,
7
+ F::Output: Send,
8
+ {
9
+ let rt = Runtime::new().unwrap();
10
+ rt.block_on(f)
11
+ }
@@ -1,3 +1,3 @@
1
1
  module Datafusion
2
- VERSION = "0.0.1"
2
+ VERSION = "0.0.2"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: arrow-datafusion
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
- - Datafusion Contrib Developers
7
+ - jychen7
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-07-03 00:00:00.000000000 Z
11
+ date: 2022-07-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -24,8 +24,7 @@ dependencies:
24
24
  - - ">"
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1'
27
- description: DataFusion is an extensible query execution framework, written in Rust,
28
- that uses Apache Arrow as its in-memory format.
27
+ description: yet another Ruby bindings of Apache Arrow Datafusion
29
28
  email:
30
29
  executables: []
31
30
  extensions:
@@ -39,10 +38,14 @@ files:
39
38
  - ext/datafusion_ruby/Cargo.toml
40
39
  - ext/datafusion_ruby/Rakefile
41
40
  - ext/datafusion_ruby/src/context.rs
41
+ - ext/datafusion_ruby/src/dataframe.rs
42
+ - ext/datafusion_ruby/src/errors.rs
42
43
  - ext/datafusion_ruby/src/lib.rs
44
+ - ext/datafusion_ruby/src/record_batch.rs
45
+ - ext/datafusion_ruby/src/utils.rs
43
46
  - lib/datafusion.rb
44
47
  - lib/datafusion/version.rb
45
- homepage: https://github.com/datafusion-contrib/datafusion-ruby
48
+ homepage: https://github.com/jychen7/arrow-datafusion-ruby
46
49
  licenses:
47
50
  - Apache-2.0
48
51
  metadata: {}
@@ -54,7 +57,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
54
57
  requirements:
55
58
  - - ">="
56
59
  - !ruby/object:Gem::Version
57
- version: '0'
60
+ version: 2.6.0
58
61
  required_rubygems_version: !ruby/object:Gem::Requirement
59
62
  requirements:
60
63
  - - ">="
@@ -64,5 +67,5 @@ requirements: []
64
67
  rubygems_version: 3.1.6
65
68
  signing_key:
66
69
  specification_version: 4
67
- summary: Ruby bindings of Apache Arrow Datafusion
70
+ summary: yet another Ruby bindings of Apache Arrow Datafusion
68
71
  test_files: []