arrow-datafusion 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9f2a3634503431cd79f0d59c12e8d4ed1d8a899f38d5510835ee7cccb5404b4d
4
- data.tar.gz: 02bba867e5c3da26395cc1d5f38457ce8769179f92551f80be898e5b43685284
3
+ metadata.gz: 6e8ea4613597694b0e11d65afafd8d31a0230fada4d84c9ecec842afe09ef046
4
+ data.tar.gz: 9397ee0001f51084c1651c70fcae972b680b614e2548ab286e90c645187e388f
5
5
  SHA512:
6
- metadata.gz: ce438e24815777aac6bbd9fffbb8b12f2d64031179c9b3111ee84970ce44cd2833ef48adaad39ae9faff25d326af48fa7e6120341d96ce6bad3f45216f96c610
7
- data.tar.gz: 59317958d2fe722bed2e50b885a40d49583d5b44117142991debb1f78e77f59adea9e19b578cf6d271c91c6c866730f68702b10ec85f60d38ef9d3c576be1efe
6
+ metadata.gz: 2f2c972c13edee286ce6aa71538b910794c3a1950e2f17be037ca2d3a6e03704b7e1135f11860b08ea8422c0cad1500ec027159548d31926ade5c78116c0b5cf
7
+ data.tar.gz: 5f2c0406f9519b192b270b86fb70bd432e4235059e46cc2e311d003b393c56e1f5e7c7f48601954ccb650dc1ec38cef36743bfa948a3335f73985b1fd96fa5cb
data/README.md CHANGED
@@ -1,8 +1,8 @@
1
1
  # DataFusion in Ruby
2
2
 
3
- This is a Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
3
+ This is yet another Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
4
4
 
5
- It allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Ruby.
5
+ This is an alternative to [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby). Please refer to FAQ below.
6
6
 
7
7
  ## Quick Start
8
8
 
@@ -15,7 +15,7 @@ App
15
15
  ```ruby
16
16
  require "datafusion"
17
17
 
18
- ctx = Datafusion.SessionContext.new
18
+ ctx = Datafusion::SessionContext.new
19
19
  ctx.register_csv("csv", "test.csv")
20
20
  ctx.sql("SELECT * FROM csv").collect
21
21
  ```
@@ -24,15 +24,15 @@ ctx.sql("SELECT * FROM csv").collect
24
24
 
25
25
  SessionContext
26
26
  - [x] new
27
- - [ ] register_csv
28
- - [ ] sql
27
+ - [x] register_csv
28
+ - [x] sql
29
29
  - [ ] register_parquet
30
30
  - [ ] register_record_batches
31
31
  - [ ] register_udf
32
32
 
33
33
  Dataframe
34
- - [ ] new
35
- - [ ] collect
34
+ - [x] new
35
+ - [x] collect
36
36
  - [ ] schema
37
37
  - [ ] select_columns
38
38
  - [ ] select
@@ -46,4 +46,30 @@ Dataframe
46
46
 
47
47
  ## Contribution Guide
48
48
 
49
- Please see [Contribution Guide](CONTRIBUTING.md) for information about contributing to DataFusion in Ruby.
49
+ Please see [Contribution Guide](CONTRIBUTING.md).
50
+
51
+ ## FAQ
52
+
53
+ ### Why another Ruby bindings for Arrow Datafusion?
54
+
55
+ [datafuion-contrib/datafusion-python](https://github.com/datafusion-contrib/datafusion-python) is a `Rust -> Python` bindings using [pyo3](https://github.com/PyO3/pyo3) and I want to use Arrow Datafusion in Ruby. So I create a `Rust -> Ruby` bindings using [Magnus](https://github.com/matsadler/magnus).
56
+
57
+ Other than Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, [datafuion-contrib/datafusion-c](https://github.com/datafusion-contrib/datafusion-c) is created and will be used for [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) and other languages. E.g. `Rust -> C -> Ruby/Python/Java/etc`.
58
+
59
+ So I just keep this `Rust -> Python` implementation as my side project.
60
+
61
+ ### Why Magnus?
62
+
63
+ As of 2022-07, there are a few popular Ruby bindings for Rust, [Rutie](https://github.com/danielpclark/rutie), [Magnus](https://github.com/matsadler/magnus) and [other alternatives](https://github.com/matsadler/magnus#alternatives). Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this [reddit thread](https://www.reddit.com/r/ruby/comments/uskibb/comment/i98rds4/?utm_source=share&utm_medium=web2x&context=3). It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.
64
+
65
+ ### Why the module name and gem name are different?
66
+
67
+ The module name `Datafusion` follows the [datafusion](https://github.com/apache/arrow-datafusion) and [datafusion-python](https://github.com/datafusion-contrib/datafusion-python). The gem name `datafusion` [is occupied in rubygems.org at 2016](https://rubygems.org/gems/datafusion), so our gem is called `arrow-datafusion`.
68
+
69
+ Similarly to the Ruby bindings of Arrow, its gem name is called [red-arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) and the module is called `arrow`.
70
+
71
+ ### What is the relationship between gem "arrow-datafusion" and "red-arrow"?
72
+
73
+ "arrow-datafusion" is the Ruby bindings of Arrow Datafusion (Rust). "red-arrow" is the Ruby bindings of Arrow (C++). To keep Datafusion Ruby simpler, I try not to couple with Red Arrow in core features at the moment. If need, we can add additional gems to support "red-arrow" in "arrow-datafusion", similar to how [red-parquet](https://github.com/apache/arrow/blob/2c7c12fd408339817f0322f137d25e9f60a87a26/ruby/red-parquet/red-parquet.gemspec#L44) use red-arrow.
74
+
75
+ ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to [Can datafusion-python be used without pyarrow?](https://github.com/datafusion-contrib/datafusion-python/issues/22).
@@ -3,12 +3,12 @@ require_relative "lib/datafusion/version"
3
3
  Gem::Specification.new do |spec|
4
4
  spec.name = "arrow-datafusion"
5
5
  spec.version = Datafusion::VERSION
6
- spec.authors = ["Datafusion Contrib Developers"]
7
- spec.homepage = "https://github.com/datafusion-contrib/datafusion-ruby"
6
+ spec.authors = ["jychen7"]
7
+ spec.homepage = "https://github.com/jychen7/arrow-datafusion-ruby"
8
8
 
9
- spec.summary = "Ruby bindings of Apache Arrow Datafusion"
9
+ spec.summary = "yet another Ruby bindings of Apache Arrow Datafusion"
10
10
  spec.description =
11
- "DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format."
11
+ "yet another Ruby bindings of Apache Arrow Datafusion"
12
12
  spec.license = "Apache-2.0"
13
13
 
14
14
  spec.files = ["README.md", "#{spec.name}.gemspec", "LICENSE"]
@@ -19,4 +19,5 @@ Gem::Specification.new do |spec|
19
19
 
20
20
  # actually a build time dependency, but that's not an option.
21
21
  spec.add_runtime_dependency "rake", "> 1"
22
+ spec.required_ruby_version = ">= 2.6.0"
22
23
  end
@@ -503,6 +503,7 @@ version = "0.0.1"
503
503
  dependencies = [
504
504
  "datafusion",
505
505
  "magnus",
506
+ "tokio",
506
507
  ]
507
508
 
508
509
  [[package]]
@@ -918,8 +919,7 @@ dependencies = [
918
919
  [[package]]
919
920
  name = "magnus"
920
921
  version = "0.3.2"
921
- source = "registry+https://github.com/rust-lang/crates.io-index"
922
- checksum = "983e15338a2e9644f804de8b5e52fb930bcd53b6859de4f4feb85753532b69d3"
922
+ source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
923
923
  dependencies = [
924
924
  "bindgen",
925
925
  "magnus-macros",
@@ -928,8 +928,7 @@ dependencies = [
928
928
  [[package]]
929
929
  name = "magnus-macros"
930
930
  version = "0.1.0"
931
- source = "registry+https://github.com/rust-lang/crates.io-index"
932
- checksum = "27968fcabe2ef7e35b8331d71a62882282996f6222c133c0255cf7f33b8d9b48"
931
+ source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
933
932
  dependencies = [
934
933
  "darling",
935
934
  "proc-macro2",
@@ -8,5 +8,7 @@ edition = "2018"
8
8
  crate-type = ["cdylib"]
9
9
 
10
10
  [dependencies]
11
- magnus = "0.3"
11
+ # as of 2022-07, magnus v0.3.2 does NOT include "define_error" in RModule
12
+ magnus = { git = "https://github.com/matsadler/magnus" }
12
13
  datafusion = { version = "^8.0.0" }
14
+ tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread", "sync"] }
@@ -1,4 +1,10 @@
1
1
  use datafusion::execution::context::SessionContext;
2
+ use datafusion::prelude::CsvReadOptions;
3
+ use magnus::Error;
4
+
5
+ use crate::dataframe::RbDataFrame;
6
+ use crate::errors::DataFusionError;
7
+ use crate::utils::wait_for_future;
2
8
 
3
9
  #[magnus::wrap(class = "Datafusion::SessionContext")]
4
10
  pub(crate) struct RbSessionContext {
@@ -11,4 +17,18 @@ impl RbSessionContext {
11
17
  ctx: SessionContext::new(),
12
18
  }
13
19
  }
20
+
21
+ pub(crate) fn register_csv(&self, name: String, table_path: String) -> Result<(), Error> {
22
+ let result =
23
+ self.ctx
24
+ .register_csv(name.as_ref(), table_path.as_ref(), CsvReadOptions::new());
25
+ wait_for_future(result).map_err(DataFusionError::from)?;
26
+ Ok(())
27
+ }
28
+
29
+ pub(crate) fn sql(&self, query: String) -> Result<RbDataFrame, Error> {
30
+ let result = self.ctx.sql(query.as_ref());
31
+ let df = wait_for_future(result).map_err(DataFusionError::from)?;
32
+ Ok(RbDataFrame::new(df))
33
+ }
14
34
  }
@@ -0,0 +1,27 @@
1
+ use datafusion::dataframe::DataFrame;
2
+ use magnus::Error;
3
+ use std::sync::Arc;
4
+
5
+ use crate::errors::DataFusionError;
6
+ use crate::record_batch::RbRecordBatch;
7
+ use crate::utils::wait_for_future;
8
+
9
+ #[magnus::wrap(class = "Datafusion::DataFrame")]
10
+ pub(crate) struct RbDataFrame {
11
+ df: Arc<DataFrame>,
12
+ }
13
+
14
+ impl RbDataFrame {
15
+ pub(crate) fn new(df: Arc<DataFrame>) -> Self {
16
+ Self { df }
17
+ }
18
+
19
+ pub(crate) fn collect(&self) -> Result<Vec<RbRecordBatch>, Error> {
20
+ let result = self.df.collect();
21
+ let batches = wait_for_future(result).map_err(DataFusionError::from)?;
22
+ Ok(batches
23
+ .into_iter()
24
+ .map(|batch| RbRecordBatch::new(batch))
25
+ .collect())
26
+ }
27
+ }
@@ -0,0 +1,42 @@
1
+ use core::fmt;
2
+
3
+ use datafusion::arrow::error::ArrowError;
4
+ use datafusion::error::DataFusionError as InnerDataFusionError;
5
+ use magnus::Error as MagnusError;
6
+
7
+ use crate::datafusion_error;
8
+
9
+ #[derive(Debug)]
10
+ pub enum DataFusionError {
11
+ ExecutionError(InnerDataFusionError),
12
+ ArrowError(ArrowError),
13
+ CommonError(String),
14
+ }
15
+
16
+ impl fmt::Display for DataFusionError {
17
+ fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
18
+ match self {
19
+ DataFusionError::ExecutionError(e) => write!(f, "Rust DataFusion error: {:?}", e),
20
+ DataFusionError::ArrowError(e) => write!(f, "Rust Arrow error: {:?}", e),
21
+ DataFusionError::CommonError(e) => write!(f, "Ruby DataFusion error: {:?}", e),
22
+ }
23
+ }
24
+ }
25
+
26
+ impl From<ArrowError> for DataFusionError {
27
+ fn from(err: ArrowError) -> DataFusionError {
28
+ DataFusionError::ArrowError(err)
29
+ }
30
+ }
31
+
32
+ impl From<InnerDataFusionError> for DataFusionError {
33
+ fn from(err: InnerDataFusionError) -> DataFusionError {
34
+ DataFusionError::ExecutionError(err)
35
+ }
36
+ }
37
+
38
+ impl From<DataFusionError> for MagnusError {
39
+ fn from(err: DataFusionError) -> MagnusError {
40
+ MagnusError::new(datafusion_error(), err.to_string())
41
+ }
42
+ }
@@ -1,11 +1,39 @@
1
- use magnus::{define_module, function, prelude::*, Error};
1
+ use magnus::{
2
+ define_module, exception::ExceptionClass, function, memoize, method, prelude::*, Error, RModule,
3
+ };
2
4
 
3
5
  mod context;
6
+ mod dataframe;
7
+ mod errors;
8
+ mod record_batch;
9
+ mod utils;
10
+
11
+ fn datafusion() -> RModule {
12
+ *memoize!(RModule: define_module("Datafusion").unwrap())
13
+ }
14
+
15
+ fn datafusion_error() -> ExceptionClass {
16
+ *memoize!(ExceptionClass: datafusion().define_error("Error", Default::default()).unwrap())
17
+ }
4
18
 
5
19
  #[magnus::init]
6
20
  fn init() -> Result<(), Error> {
7
- let module = define_module("Datafusion")?;
8
- let class = module.define_class("SessionContext", Default::default())?;
9
- class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
21
+ // ensure error is defined on load
22
+ datafusion_error();
23
+
24
+ let ctx_class = datafusion().define_class("SessionContext", Default::default())?;
25
+ ctx_class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
26
+ ctx_class.define_method(
27
+ "register_csv",
28
+ method!(context::RbSessionContext::register_csv, 2),
29
+ )?;
30
+ ctx_class.define_method("sql", method!(context::RbSessionContext::sql, 1))?;
31
+
32
+ let df_class = datafusion().define_class("DataFrame", Default::default())?;
33
+ df_class.define_method("collect", method!(dataframe::RbDataFrame::collect, 0))?;
34
+
35
+ let rb_class = datafusion().define_class("RecordBatch", Default::default())?;
36
+ rb_class.define_method("to_h", method!(record_batch::RbRecordBatch::to_hash, 0))?;
37
+
10
38
  Ok(())
11
39
  }
@@ -0,0 +1,56 @@
1
+ use datafusion::arrow::{
2
+ array::{Float64Array, Int64Array, StringArray},
3
+ datatypes::DataType,
4
+ record_batch::RecordBatch,
5
+ };
6
+ use magnus::{Error, Value};
7
+
8
+ use crate::errors::DataFusionError;
9
+ use std::collections::HashMap;
10
+
11
+ #[magnus::wrap(class = "Datafusion::RecordBatch")]
12
+ pub(crate) struct RbRecordBatch {
13
+ rb: RecordBatch,
14
+ }
15
+
16
+ impl RbRecordBatch {
17
+ pub(crate) fn new(rb: RecordBatch) -> Self {
18
+ Self { rb }
19
+ }
20
+
21
+ pub(crate) fn to_hash(&self) -> Result<HashMap<String, Vec<Value>>, Error> {
22
+ let mut columns_by_name: HashMap<String, Vec<Value>> = HashMap::new();
23
+ for (i, field) in self.rb.schema().fields().iter().enumerate() {
24
+ let column = self.rb.column(i);
25
+ columns_by_name.insert(
26
+ field.name().clone(),
27
+ match column.data_type() {
28
+ DataType::Int64 => {
29
+ let array = column.as_any().downcast_ref::<Int64Array>().unwrap();
30
+ array.values().iter().map(|v| (*v as i64).into()).collect()
31
+ }
32
+ DataType::Float64 => {
33
+ let array = column.as_any().downcast_ref::<Float64Array>().unwrap();
34
+ array.values().iter().map(|v| (*v as f64).into()).collect()
35
+ }
36
+ DataType::Utf8 => {
37
+ let array = column.as_any().downcast_ref::<StringArray>().unwrap();
38
+ let mut values: Vec<Value> = vec![];
39
+ for i in 0..(column.len()) {
40
+ values.push(std::string::String::from(array.value(i)).into())
41
+ }
42
+ values
43
+ }
44
+ unknown => {
45
+ return Err(DataFusionError::CommonError(format!(
46
+ "unhandle data type: {}",
47
+ unknown
48
+ ))
49
+ .into())
50
+ }
51
+ },
52
+ );
53
+ }
54
+ Ok(columns_by_name)
55
+ }
56
+ }
@@ -0,0 +1,11 @@
1
+ use std::future::Future;
2
+ use tokio::runtime::Runtime;
3
+
4
+ pub fn wait_for_future<F: Future>(f: F) -> F::Output
5
+ where
6
+ F: Send,
7
+ F::Output: Send,
8
+ {
9
+ let rt = Runtime::new().unwrap();
10
+ rt.block_on(f)
11
+ }
@@ -1,3 +1,3 @@
1
1
  module Datafusion
2
- VERSION = "0.0.1"
2
+ VERSION = "0.0.2"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: arrow-datafusion
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
- - Datafusion Contrib Developers
7
+ - jychen7
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-07-03 00:00:00.000000000 Z
11
+ date: 2022-07-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -24,8 +24,7 @@ dependencies:
24
24
  - - ">"
25
25
  - !ruby/object:Gem::Version
26
26
  version: '1'
27
- description: DataFusion is an extensible query execution framework, written in Rust,
28
- that uses Apache Arrow as its in-memory format.
27
+ description: yet another Ruby bindings of Apache Arrow Datafusion
29
28
  email:
30
29
  executables: []
31
30
  extensions:
@@ -39,10 +38,14 @@ files:
39
38
  - ext/datafusion_ruby/Cargo.toml
40
39
  - ext/datafusion_ruby/Rakefile
41
40
  - ext/datafusion_ruby/src/context.rs
41
+ - ext/datafusion_ruby/src/dataframe.rs
42
+ - ext/datafusion_ruby/src/errors.rs
42
43
  - ext/datafusion_ruby/src/lib.rs
44
+ - ext/datafusion_ruby/src/record_batch.rs
45
+ - ext/datafusion_ruby/src/utils.rs
43
46
  - lib/datafusion.rb
44
47
  - lib/datafusion/version.rb
45
- homepage: https://github.com/datafusion-contrib/datafusion-ruby
48
+ homepage: https://github.com/jychen7/arrow-datafusion-ruby
46
49
  licenses:
47
50
  - Apache-2.0
48
51
  metadata: {}
@@ -54,7 +57,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
54
57
  requirements:
55
58
  - - ">="
56
59
  - !ruby/object:Gem::Version
57
- version: '0'
60
+ version: 2.6.0
58
61
  required_rubygems_version: !ruby/object:Gem::Requirement
59
62
  requirements:
60
63
  - - ">="
@@ -64,5 +67,5 @@ requirements: []
64
67
  rubygems_version: 3.1.6
65
68
  signing_key:
66
69
  specification_version: 4
67
- summary: Ruby bindings of Apache Arrow Datafusion
70
+ summary: yet another Ruby bindings of Apache Arrow Datafusion
68
71
  test_files: []