arrow-datafusion 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +34 -8
- data/arrow-datafusion.gemspec +5 -4
- data/ext/datafusion_ruby/Cargo.lock +3 -4
- data/ext/datafusion_ruby/Cargo.toml +3 -1
- data/ext/datafusion_ruby/src/context.rs +20 -0
- data/ext/datafusion_ruby/src/dataframe.rs +27 -0
- data/ext/datafusion_ruby/src/errors.rs +42 -0
- data/ext/datafusion_ruby/src/lib.rs +32 -4
- data/ext/datafusion_ruby/src/record_batch.rs +56 -0
- data/ext/datafusion_ruby/src/utils.rs +11 -0
- data/lib/datafusion/version.rb +1 -1
- metadata +11 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6e8ea4613597694b0e11d65afafd8d31a0230fada4d84c9ecec842afe09ef046
|
4
|
+
data.tar.gz: 9397ee0001f51084c1651c70fcae972b680b614e2548ab286e90c645187e388f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2f2c972c13edee286ce6aa71538b910794c3a1950e2f17be037ca2d3a6e03704b7e1135f11860b08ea8422c0cad1500ec027159548d31926ade5c78116c0b5cf
|
7
|
+
data.tar.gz: 5f2c0406f9519b192b270b86fb70bd432e4235059e46cc2e311d003b393c56e1f5e7c7f48601954ccb650dc1ec38cef36743bfa948a3335f73985b1fd96fa5cb
|
data/README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
# DataFusion in Ruby
|
2
2
|
|
3
|
-
This is
|
3
|
+
This is yet another Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
|
4
4
|
|
5
|
-
|
5
|
+
This is an alternative to [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby). Please refer to FAQ below.
|
6
6
|
|
7
7
|
## Quick Start
|
8
8
|
|
@@ -15,7 +15,7 @@ App
|
|
15
15
|
```ruby
|
16
16
|
require "datafusion"
|
17
17
|
|
18
|
-
ctx = Datafusion
|
18
|
+
ctx = Datafusion::SessionContext.new
|
19
19
|
ctx.register_csv("csv", "test.csv")
|
20
20
|
ctx.sql("SELECT * FROM csv").collect
|
21
21
|
```
|
@@ -24,15 +24,15 @@ ctx.sql("SELECT * FROM csv").collect
|
|
24
24
|
|
25
25
|
SessionContext
|
26
26
|
- [x] new
|
27
|
-
- [
|
28
|
-
- [
|
27
|
+
- [x] register_csv
|
28
|
+
- [x] sql
|
29
29
|
- [ ] register_parquet
|
30
30
|
- [ ] register_record_batches
|
31
31
|
- [ ] register_udf
|
32
32
|
|
33
33
|
Dataframe
|
34
|
-
- [
|
35
|
-
- [
|
34
|
+
- [x] new
|
35
|
+
- [x] collect
|
36
36
|
- [ ] schema
|
37
37
|
- [ ] select_columns
|
38
38
|
- [ ] select
|
@@ -46,4 +46,30 @@ Dataframe
|
|
46
46
|
|
47
47
|
## Contribution Guide
|
48
48
|
|
49
|
-
Please see [Contribution Guide](CONTRIBUTING.md)
|
49
|
+
Please see [Contribution Guide](CONTRIBUTING.md).
|
50
|
+
|
51
|
+
## FAQ
|
52
|
+
|
53
|
+
### Why another Ruby bindings for Arrow Datafusion?
|
54
|
+
|
55
|
+
[datafuion-contrib/datafusion-python](https://github.com/datafusion-contrib/datafusion-python) is a `Rust -> Python` bindings using [pyo3](https://github.com/PyO3/pyo3) and I want to use Arrow Datafusion in Ruby. So I create a `Rust -> Ruby` bindings using [Magnus](https://github.com/matsadler/magnus).
|
56
|
+
|
57
|
+
Other than Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, [datafuion-contrib/datafusion-c](https://github.com/datafusion-contrib/datafusion-c) is created and will be used for [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) and other languages. E.g. `Rust -> C -> Ruby/Python/Java/etc`.
|
58
|
+
|
59
|
+
So I just keep this `Rust -> Python` implementation as my side project.
|
60
|
+
|
61
|
+
### Why Magnus?
|
62
|
+
|
63
|
+
As of 2022-07, there are a few popular Ruby bindings for Rust, [Rutie](https://github.com/danielpclark/rutie), [Magnus](https://github.com/matsadler/magnus) and [other alternatives](https://github.com/matsadler/magnus#alternatives). Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this [reddit thread](https://www.reddit.com/r/ruby/comments/uskibb/comment/i98rds4/?utm_source=share&utm_medium=web2x&context=3). It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.
|
64
|
+
|
65
|
+
### Why the module name and gem name are different?
|
66
|
+
|
67
|
+
The module name `Datafusion` follows the [datafusion](https://github.com/apache/arrow-datafusion) and [datafusion-python](https://github.com/datafusion-contrib/datafusion-python). The gem name `datafusion` [is occupied in rubygems.org at 2016](https://rubygems.org/gems/datafusion), so our gem is called `arrow-datafusion`.
|
68
|
+
|
69
|
+
Similarly to the Ruby bindings of Arrow, its gem name is called [red-arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) and the module is called `arrow`.
|
70
|
+
|
71
|
+
### What is the relationship between gem "arrow-datafusion" and "red-arrow"?
|
72
|
+
|
73
|
+
"arrow-datafusion" is the Ruby bindings of Arrow Datafusion (Rust). "red-arrow" is the Ruby bindings of Arrow (C++). To keep Datafusion Ruby simpler, I try not to couple with Red Arrow in core features at the moment. If need, we can add additional gems to support "red-arrow" in "arrow-datafusion", similar to how [red-parquet](https://github.com/apache/arrow/blob/2c7c12fd408339817f0322f137d25e9f60a87a26/ruby/red-parquet/red-parquet.gemspec#L44) use red-arrow.
|
74
|
+
|
75
|
+
ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to [Can datafusion-python be used without pyarrow?](https://github.com/datafusion-contrib/datafusion-python/issues/22).
|
data/arrow-datafusion.gemspec
CHANGED
@@ -3,12 +3,12 @@ require_relative "lib/datafusion/version"
|
|
3
3
|
Gem::Specification.new do |spec|
|
4
4
|
spec.name = "arrow-datafusion"
|
5
5
|
spec.version = Datafusion::VERSION
|
6
|
-
spec.authors = ["
|
7
|
-
spec.homepage = "https://github.com/
|
6
|
+
spec.authors = ["jychen7"]
|
7
|
+
spec.homepage = "https://github.com/jychen7/arrow-datafusion-ruby"
|
8
8
|
|
9
|
-
spec.summary = "Ruby bindings of Apache Arrow Datafusion"
|
9
|
+
spec.summary = "yet another Ruby bindings of Apache Arrow Datafusion"
|
10
10
|
spec.description =
|
11
|
-
"
|
11
|
+
"yet another Ruby bindings of Apache Arrow Datafusion"
|
12
12
|
spec.license = "Apache-2.0"
|
13
13
|
|
14
14
|
spec.files = ["README.md", "#{spec.name}.gemspec", "LICENSE"]
|
@@ -19,4 +19,5 @@ Gem::Specification.new do |spec|
|
|
19
19
|
|
20
20
|
# actually a build time dependency, but that's not an option.
|
21
21
|
spec.add_runtime_dependency "rake", "> 1"
|
22
|
+
spec.required_ruby_version = ">= 2.6.0"
|
22
23
|
end
|
@@ -503,6 +503,7 @@ version = "0.0.1"
|
|
503
503
|
dependencies = [
|
504
504
|
"datafusion",
|
505
505
|
"magnus",
|
506
|
+
"tokio",
|
506
507
|
]
|
507
508
|
|
508
509
|
[[package]]
|
@@ -918,8 +919,7 @@ dependencies = [
|
|
918
919
|
[[package]]
|
919
920
|
name = "magnus"
|
920
921
|
version = "0.3.2"
|
921
|
-
source = "
|
922
|
-
checksum = "983e15338a2e9644f804de8b5e52fb930bcd53b6859de4f4feb85753532b69d3"
|
922
|
+
source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
|
923
923
|
dependencies = [
|
924
924
|
"bindgen",
|
925
925
|
"magnus-macros",
|
@@ -928,8 +928,7 @@ dependencies = [
|
|
928
928
|
[[package]]
|
929
929
|
name = "magnus-macros"
|
930
930
|
version = "0.1.0"
|
931
|
-
source = "
|
932
|
-
checksum = "27968fcabe2ef7e35b8331d71a62882282996f6222c133c0255cf7f33b8d9b48"
|
931
|
+
source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
|
933
932
|
dependencies = [
|
934
933
|
"darling",
|
935
934
|
"proc-macro2",
|
@@ -8,5 +8,7 @@ edition = "2018"
|
|
8
8
|
crate-type = ["cdylib"]
|
9
9
|
|
10
10
|
[dependencies]
|
11
|
-
magnus
|
11
|
+
# as of 2022-07, magnus v0.3.2 does NOT include "define_error" in RModule
|
12
|
+
magnus = { git = "https://github.com/matsadler/magnus" }
|
12
13
|
datafusion = { version = "^8.0.0" }
|
14
|
+
tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread", "sync"] }
|
@@ -1,4 +1,10 @@
|
|
1
1
|
use datafusion::execution::context::SessionContext;
|
2
|
+
use datafusion::prelude::CsvReadOptions;
|
3
|
+
use magnus::Error;
|
4
|
+
|
5
|
+
use crate::dataframe::RbDataFrame;
|
6
|
+
use crate::errors::DataFusionError;
|
7
|
+
use crate::utils::wait_for_future;
|
2
8
|
|
3
9
|
#[magnus::wrap(class = "Datafusion::SessionContext")]
|
4
10
|
pub(crate) struct RbSessionContext {
|
@@ -11,4 +17,18 @@ impl RbSessionContext {
|
|
11
17
|
ctx: SessionContext::new(),
|
12
18
|
}
|
13
19
|
}
|
20
|
+
|
21
|
+
pub(crate) fn register_csv(&self, name: String, table_path: String) -> Result<(), Error> {
|
22
|
+
let result =
|
23
|
+
self.ctx
|
24
|
+
.register_csv(name.as_ref(), table_path.as_ref(), CsvReadOptions::new());
|
25
|
+
wait_for_future(result).map_err(DataFusionError::from)?;
|
26
|
+
Ok(())
|
27
|
+
}
|
28
|
+
|
29
|
+
pub(crate) fn sql(&self, query: String) -> Result<RbDataFrame, Error> {
|
30
|
+
let result = self.ctx.sql(query.as_ref());
|
31
|
+
let df = wait_for_future(result).map_err(DataFusionError::from)?;
|
32
|
+
Ok(RbDataFrame::new(df))
|
33
|
+
}
|
14
34
|
}
|
@@ -0,0 +1,27 @@
|
|
1
|
+
use datafusion::dataframe::DataFrame;
|
2
|
+
use magnus::Error;
|
3
|
+
use std::sync::Arc;
|
4
|
+
|
5
|
+
use crate::errors::DataFusionError;
|
6
|
+
use crate::record_batch::RbRecordBatch;
|
7
|
+
use crate::utils::wait_for_future;
|
8
|
+
|
9
|
+
#[magnus::wrap(class = "Datafusion::DataFrame")]
|
10
|
+
pub(crate) struct RbDataFrame {
|
11
|
+
df: Arc<DataFrame>,
|
12
|
+
}
|
13
|
+
|
14
|
+
impl RbDataFrame {
|
15
|
+
pub(crate) fn new(df: Arc<DataFrame>) -> Self {
|
16
|
+
Self { df }
|
17
|
+
}
|
18
|
+
|
19
|
+
pub(crate) fn collect(&self) -> Result<Vec<RbRecordBatch>, Error> {
|
20
|
+
let result = self.df.collect();
|
21
|
+
let batches = wait_for_future(result).map_err(DataFusionError::from)?;
|
22
|
+
Ok(batches
|
23
|
+
.into_iter()
|
24
|
+
.map(|batch| RbRecordBatch::new(batch))
|
25
|
+
.collect())
|
26
|
+
}
|
27
|
+
}
|
@@ -0,0 +1,42 @@
|
|
1
|
+
use core::fmt;
|
2
|
+
|
3
|
+
use datafusion::arrow::error::ArrowError;
|
4
|
+
use datafusion::error::DataFusionError as InnerDataFusionError;
|
5
|
+
use magnus::Error as MagnusError;
|
6
|
+
|
7
|
+
use crate::datafusion_error;
|
8
|
+
|
9
|
+
#[derive(Debug)]
|
10
|
+
pub enum DataFusionError {
|
11
|
+
ExecutionError(InnerDataFusionError),
|
12
|
+
ArrowError(ArrowError),
|
13
|
+
CommonError(String),
|
14
|
+
}
|
15
|
+
|
16
|
+
impl fmt::Display for DataFusionError {
|
17
|
+
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
18
|
+
match self {
|
19
|
+
DataFusionError::ExecutionError(e) => write!(f, "Rust DataFusion error: {:?}", e),
|
20
|
+
DataFusionError::ArrowError(e) => write!(f, "Rust Arrow error: {:?}", e),
|
21
|
+
DataFusionError::CommonError(e) => write!(f, "Ruby DataFusion error: {:?}", e),
|
22
|
+
}
|
23
|
+
}
|
24
|
+
}
|
25
|
+
|
26
|
+
impl From<ArrowError> for DataFusionError {
|
27
|
+
fn from(err: ArrowError) -> DataFusionError {
|
28
|
+
DataFusionError::ArrowError(err)
|
29
|
+
}
|
30
|
+
}
|
31
|
+
|
32
|
+
impl From<InnerDataFusionError> for DataFusionError {
|
33
|
+
fn from(err: InnerDataFusionError) -> DataFusionError {
|
34
|
+
DataFusionError::ExecutionError(err)
|
35
|
+
}
|
36
|
+
}
|
37
|
+
|
38
|
+
impl From<DataFusionError> for MagnusError {
|
39
|
+
fn from(err: DataFusionError) -> MagnusError {
|
40
|
+
MagnusError::new(datafusion_error(), err.to_string())
|
41
|
+
}
|
42
|
+
}
|
@@ -1,11 +1,39 @@
|
|
1
|
-
use magnus::{
|
1
|
+
use magnus::{
|
2
|
+
define_module, exception::ExceptionClass, function, memoize, method, prelude::*, Error, RModule,
|
3
|
+
};
|
2
4
|
|
3
5
|
mod context;
|
6
|
+
mod dataframe;
|
7
|
+
mod errors;
|
8
|
+
mod record_batch;
|
9
|
+
mod utils;
|
10
|
+
|
11
|
+
fn datafusion() -> RModule {
|
12
|
+
*memoize!(RModule: define_module("Datafusion").unwrap())
|
13
|
+
}
|
14
|
+
|
15
|
+
fn datafusion_error() -> ExceptionClass {
|
16
|
+
*memoize!(ExceptionClass: datafusion().define_error("Error", Default::default()).unwrap())
|
17
|
+
}
|
4
18
|
|
5
19
|
#[magnus::init]
|
6
20
|
fn init() -> Result<(), Error> {
|
7
|
-
|
8
|
-
|
9
|
-
|
21
|
+
// ensure error is defined on load
|
22
|
+
datafusion_error();
|
23
|
+
|
24
|
+
let ctx_class = datafusion().define_class("SessionContext", Default::default())?;
|
25
|
+
ctx_class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
|
26
|
+
ctx_class.define_method(
|
27
|
+
"register_csv",
|
28
|
+
method!(context::RbSessionContext::register_csv, 2),
|
29
|
+
)?;
|
30
|
+
ctx_class.define_method("sql", method!(context::RbSessionContext::sql, 1))?;
|
31
|
+
|
32
|
+
let df_class = datafusion().define_class("DataFrame", Default::default())?;
|
33
|
+
df_class.define_method("collect", method!(dataframe::RbDataFrame::collect, 0))?;
|
34
|
+
|
35
|
+
let rb_class = datafusion().define_class("RecordBatch", Default::default())?;
|
36
|
+
rb_class.define_method("to_h", method!(record_batch::RbRecordBatch::to_hash, 0))?;
|
37
|
+
|
10
38
|
Ok(())
|
11
39
|
}
|
@@ -0,0 +1,56 @@
|
|
1
|
+
use datafusion::arrow::{
|
2
|
+
array::{Float64Array, Int64Array, StringArray},
|
3
|
+
datatypes::DataType,
|
4
|
+
record_batch::RecordBatch,
|
5
|
+
};
|
6
|
+
use magnus::{Error, Value};
|
7
|
+
|
8
|
+
use crate::errors::DataFusionError;
|
9
|
+
use std::collections::HashMap;
|
10
|
+
|
11
|
+
#[magnus::wrap(class = "Datafusion::RecordBatch")]
|
12
|
+
pub(crate) struct RbRecordBatch {
|
13
|
+
rb: RecordBatch,
|
14
|
+
}
|
15
|
+
|
16
|
+
impl RbRecordBatch {
|
17
|
+
pub(crate) fn new(rb: RecordBatch) -> Self {
|
18
|
+
Self { rb }
|
19
|
+
}
|
20
|
+
|
21
|
+
pub(crate) fn to_hash(&self) -> Result<HashMap<String, Vec<Value>>, Error> {
|
22
|
+
let mut columns_by_name: HashMap<String, Vec<Value>> = HashMap::new();
|
23
|
+
for (i, field) in self.rb.schema().fields().iter().enumerate() {
|
24
|
+
let column = self.rb.column(i);
|
25
|
+
columns_by_name.insert(
|
26
|
+
field.name().clone(),
|
27
|
+
match column.data_type() {
|
28
|
+
DataType::Int64 => {
|
29
|
+
let array = column.as_any().downcast_ref::<Int64Array>().unwrap();
|
30
|
+
array.values().iter().map(|v| (*v as i64).into()).collect()
|
31
|
+
}
|
32
|
+
DataType::Float64 => {
|
33
|
+
let array = column.as_any().downcast_ref::<Float64Array>().unwrap();
|
34
|
+
array.values().iter().map(|v| (*v as f64).into()).collect()
|
35
|
+
}
|
36
|
+
DataType::Utf8 => {
|
37
|
+
let array = column.as_any().downcast_ref::<StringArray>().unwrap();
|
38
|
+
let mut values: Vec<Value> = vec![];
|
39
|
+
for i in 0..(column.len()) {
|
40
|
+
values.push(std::string::String::from(array.value(i)).into())
|
41
|
+
}
|
42
|
+
values
|
43
|
+
}
|
44
|
+
unknown => {
|
45
|
+
return Err(DataFusionError::CommonError(format!(
|
46
|
+
"unhandle data type: {}",
|
47
|
+
unknown
|
48
|
+
))
|
49
|
+
.into())
|
50
|
+
}
|
51
|
+
},
|
52
|
+
);
|
53
|
+
}
|
54
|
+
Ok(columns_by_name)
|
55
|
+
}
|
56
|
+
}
|
data/lib/datafusion/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: arrow-datafusion
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
|
-
-
|
7
|
+
- jychen7
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-07-
|
11
|
+
date: 2022-07-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rake
|
@@ -24,8 +24,7 @@ dependencies:
|
|
24
24
|
- - ">"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1'
|
27
|
-
description:
|
28
|
-
that uses Apache Arrow as its in-memory format.
|
27
|
+
description: yet another Ruby bindings of Apache Arrow Datafusion
|
29
28
|
email:
|
30
29
|
executables: []
|
31
30
|
extensions:
|
@@ -39,10 +38,14 @@ files:
|
|
39
38
|
- ext/datafusion_ruby/Cargo.toml
|
40
39
|
- ext/datafusion_ruby/Rakefile
|
41
40
|
- ext/datafusion_ruby/src/context.rs
|
41
|
+
- ext/datafusion_ruby/src/dataframe.rs
|
42
|
+
- ext/datafusion_ruby/src/errors.rs
|
42
43
|
- ext/datafusion_ruby/src/lib.rs
|
44
|
+
- ext/datafusion_ruby/src/record_batch.rs
|
45
|
+
- ext/datafusion_ruby/src/utils.rs
|
43
46
|
- lib/datafusion.rb
|
44
47
|
- lib/datafusion/version.rb
|
45
|
-
homepage: https://github.com/
|
48
|
+
homepage: https://github.com/jychen7/arrow-datafusion-ruby
|
46
49
|
licenses:
|
47
50
|
- Apache-2.0
|
48
51
|
metadata: {}
|
@@ -54,7 +57,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
54
57
|
requirements:
|
55
58
|
- - ">="
|
56
59
|
- !ruby/object:Gem::Version
|
57
|
-
version:
|
60
|
+
version: 2.6.0
|
58
61
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
59
62
|
requirements:
|
60
63
|
- - ">="
|
@@ -64,5 +67,5 @@ requirements: []
|
|
64
67
|
rubygems_version: 3.1.6
|
65
68
|
signing_key:
|
66
69
|
specification_version: 4
|
67
|
-
summary: Ruby bindings of Apache Arrow Datafusion
|
70
|
+
summary: yet another Ruby bindings of Apache Arrow Datafusion
|
68
71
|
test_files: []
|