arrow-datafusion 0.0.1 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +34 -8
- data/arrow-datafusion.gemspec +5 -4
- data/ext/datafusion_ruby/Cargo.lock +3 -4
- data/ext/datafusion_ruby/Cargo.toml +3 -1
- data/ext/datafusion_ruby/src/context.rs +20 -0
- data/ext/datafusion_ruby/src/dataframe.rs +27 -0
- data/ext/datafusion_ruby/src/errors.rs +42 -0
- data/ext/datafusion_ruby/src/lib.rs +32 -4
- data/ext/datafusion_ruby/src/record_batch.rs +56 -0
- data/ext/datafusion_ruby/src/utils.rs +11 -0
- data/lib/datafusion/version.rb +1 -1
- metadata +11 -8
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6e8ea4613597694b0e11d65afafd8d31a0230fada4d84c9ecec842afe09ef046
|
4
|
+
data.tar.gz: 9397ee0001f51084c1651c70fcae972b680b614e2548ab286e90c645187e388f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2f2c972c13edee286ce6aa71538b910794c3a1950e2f17be037ca2d3a6e03704b7e1135f11860b08ea8422c0cad1500ec027159548d31926ade5c78116c0b5cf
|
7
|
+
data.tar.gz: 5f2c0406f9519b192b270b86fb70bd432e4235059e46cc2e311d003b393c56e1f5e7c7f48601954ccb650dc1ec38cef36743bfa948a3335f73985b1fd96fa5cb
|
data/README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1
1
|
# DataFusion in Ruby
|
2
2
|
|
3
|
-
This is
|
3
|
+
This is yet another Ruby library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
|
4
4
|
|
5
|
-
|
5
|
+
This is an alternative to [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby). Please refer to FAQ below.
|
6
6
|
|
7
7
|
## Quick Start
|
8
8
|
|
@@ -15,7 +15,7 @@ App
|
|
15
15
|
```ruby
|
16
16
|
require "datafusion"
|
17
17
|
|
18
|
-
ctx = Datafusion
|
18
|
+
ctx = Datafusion::SessionContext.new
|
19
19
|
ctx.register_csv("csv", "test.csv")
|
20
20
|
ctx.sql("SELECT * FROM csv").collect
|
21
21
|
```
|
@@ -24,15 +24,15 @@ ctx.sql("SELECT * FROM csv").collect
|
|
24
24
|
|
25
25
|
SessionContext
|
26
26
|
- [x] new
|
27
|
-
- [
|
28
|
-
- [
|
27
|
+
- [x] register_csv
|
28
|
+
- [x] sql
|
29
29
|
- [ ] register_parquet
|
30
30
|
- [ ] register_record_batches
|
31
31
|
- [ ] register_udf
|
32
32
|
|
33
33
|
Dataframe
|
34
|
-
- [
|
35
|
-
- [
|
34
|
+
- [x] new
|
35
|
+
- [x] collect
|
36
36
|
- [ ] schema
|
37
37
|
- [ ] select_columns
|
38
38
|
- [ ] select
|
@@ -46,4 +46,30 @@ Dataframe
|
|
46
46
|
|
47
47
|
## Contribution Guide
|
48
48
|
|
49
|
-
Please see [Contribution Guide](CONTRIBUTING.md)
|
49
|
+
Please see [Contribution Guide](CONTRIBUTING.md).
|
50
|
+
|
51
|
+
## FAQ
|
52
|
+
|
53
|
+
### Why another Ruby bindings for Arrow Datafusion?
|
54
|
+
|
55
|
+
[datafuion-contrib/datafusion-python](https://github.com/datafusion-contrib/datafusion-python) is a `Rust -> Python` bindings using [pyo3](https://github.com/PyO3/pyo3) and I want to use Arrow Datafusion in Ruby. So I create a `Rust -> Ruby` bindings using [Magnus](https://github.com/matsadler/magnus).
|
56
|
+
|
57
|
+
Other than Python, Datafusion Community also want to have Java and other language bindings. In order to share development resource, [datafuion-contrib/datafusion-c](https://github.com/datafusion-contrib/datafusion-c) is created and will be used for [datafuion-contrib/datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby) and other languages. E.g. `Rust -> C -> Ruby/Python/Java/etc`.
|
58
|
+
|
59
|
+
So I just keep this `Rust -> Python` implementation as my side project.
|
60
|
+
|
61
|
+
### Why Magnus?
|
62
|
+
|
63
|
+
As of 2022-07, there are a few popular Ruby bindings for Rust, [Rutie](https://github.com/danielpclark/rutie), [Magnus](https://github.com/matsadler/magnus) and [other alternatives](https://github.com/matsadler/magnus#alternatives). Magnus is picked because its API seems cleaner and it seems more clear about safe vs unsafe. The author of Magnus have a "maybe bias" comparison in this [reddit thread](https://www.reddit.com/r/ruby/comments/uskibb/comment/i98rds4/?utm_source=share&utm_medium=web2x&context=3). It is totally subjective and it should not be large effort if we decides to switch to different Ruby bindings fr Rust in future.
|
64
|
+
|
65
|
+
### Why the module name and gem name are different?
|
66
|
+
|
67
|
+
The module name `Datafusion` follows the [datafusion](https://github.com/apache/arrow-datafusion) and [datafusion-python](https://github.com/datafusion-contrib/datafusion-python). The gem name `datafusion` [is occupied in rubygems.org at 2016](https://rubygems.org/gems/datafusion), so our gem is called `arrow-datafusion`.
|
68
|
+
|
69
|
+
Similarly to the Ruby bindings of Arrow, its gem name is called [red-arrow](https://github.com/apache/arrow/tree/master/ruby/red-arrow) and the module is called `arrow`.
|
70
|
+
|
71
|
+
### What is the relationship between gem "arrow-datafusion" and "red-arrow"?
|
72
|
+
|
73
|
+
"arrow-datafusion" is the Ruby bindings of Arrow Datafusion (Rust). "red-arrow" is the Ruby bindings of Arrow (C++). To keep Datafusion Ruby simpler, I try not to couple with Red Arrow in core features at the moment. If need, we can add additional gems to support "red-arrow" in "arrow-datafusion", similar to how [red-parquet](https://github.com/apache/arrow/blob/2c7c12fd408339817f0322f137d25e9f60a87a26/ruby/red-parquet/red-parquet.gemspec#L44) use red-arrow.
|
74
|
+
|
75
|
+
ps: Datafusion Python was coupled with PyArrow. There is a proposal to separate them in medium to long term. For detail, please refer to [Can datafusion-python be used without pyarrow?](https://github.com/datafusion-contrib/datafusion-python/issues/22).
|
data/arrow-datafusion.gemspec
CHANGED
@@ -3,12 +3,12 @@ require_relative "lib/datafusion/version"
|
|
3
3
|
Gem::Specification.new do |spec|
|
4
4
|
spec.name = "arrow-datafusion"
|
5
5
|
spec.version = Datafusion::VERSION
|
6
|
-
spec.authors = ["
|
7
|
-
spec.homepage = "https://github.com/
|
6
|
+
spec.authors = ["jychen7"]
|
7
|
+
spec.homepage = "https://github.com/jychen7/arrow-datafusion-ruby"
|
8
8
|
|
9
|
-
spec.summary = "Ruby bindings of Apache Arrow Datafusion"
|
9
|
+
spec.summary = "yet another Ruby bindings of Apache Arrow Datafusion"
|
10
10
|
spec.description =
|
11
|
-
"
|
11
|
+
"yet another Ruby bindings of Apache Arrow Datafusion"
|
12
12
|
spec.license = "Apache-2.0"
|
13
13
|
|
14
14
|
spec.files = ["README.md", "#{spec.name}.gemspec", "LICENSE"]
|
@@ -19,4 +19,5 @@ Gem::Specification.new do |spec|
|
|
19
19
|
|
20
20
|
# actually a build time dependency, but that's not an option.
|
21
21
|
spec.add_runtime_dependency "rake", "> 1"
|
22
|
+
spec.required_ruby_version = ">= 2.6.0"
|
22
23
|
end
|
@@ -503,6 +503,7 @@ version = "0.0.1"
|
|
503
503
|
dependencies = [
|
504
504
|
"datafusion",
|
505
505
|
"magnus",
|
506
|
+
"tokio",
|
506
507
|
]
|
507
508
|
|
508
509
|
[[package]]
|
@@ -918,8 +919,7 @@ dependencies = [
|
|
918
919
|
[[package]]
|
919
920
|
name = "magnus"
|
920
921
|
version = "0.3.2"
|
921
|
-
source = "
|
922
|
-
checksum = "983e15338a2e9644f804de8b5e52fb930bcd53b6859de4f4feb85753532b69d3"
|
922
|
+
source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
|
923
923
|
dependencies = [
|
924
924
|
"bindgen",
|
925
925
|
"magnus-macros",
|
@@ -928,8 +928,7 @@ dependencies = [
|
|
928
928
|
[[package]]
|
929
929
|
name = "magnus-macros"
|
930
930
|
version = "0.1.0"
|
931
|
-
source = "
|
932
|
-
checksum = "27968fcabe2ef7e35b8331d71a62882282996f6222c133c0255cf7f33b8d9b48"
|
931
|
+
source = "git+https://github.com/matsadler/magnus#3466dbb87c7f4ec589e07b343ad6843c65555928"
|
933
932
|
dependencies = [
|
934
933
|
"darling",
|
935
934
|
"proc-macro2",
|
@@ -8,5 +8,7 @@ edition = "2018"
|
|
8
8
|
crate-type = ["cdylib"]
|
9
9
|
|
10
10
|
[dependencies]
|
11
|
-
magnus
|
11
|
+
# as of 2022-07, magnus v0.3.2 does NOT include "define_error" in RModule
|
12
|
+
magnus = { git = "https://github.com/matsadler/magnus" }
|
12
13
|
datafusion = { version = "^8.0.0" }
|
14
|
+
tokio = { version = "1.0", features = ["macros", "rt", "rt-multi-thread", "sync"] }
|
@@ -1,4 +1,10 @@
|
|
1
1
|
use datafusion::execution::context::SessionContext;
|
2
|
+
use datafusion::prelude::CsvReadOptions;
|
3
|
+
use magnus::Error;
|
4
|
+
|
5
|
+
use crate::dataframe::RbDataFrame;
|
6
|
+
use crate::errors::DataFusionError;
|
7
|
+
use crate::utils::wait_for_future;
|
2
8
|
|
3
9
|
#[magnus::wrap(class = "Datafusion::SessionContext")]
|
4
10
|
pub(crate) struct RbSessionContext {
|
@@ -11,4 +17,18 @@ impl RbSessionContext {
|
|
11
17
|
ctx: SessionContext::new(),
|
12
18
|
}
|
13
19
|
}
|
20
|
+
|
21
|
+
pub(crate) fn register_csv(&self, name: String, table_path: String) -> Result<(), Error> {
|
22
|
+
let result =
|
23
|
+
self.ctx
|
24
|
+
.register_csv(name.as_ref(), table_path.as_ref(), CsvReadOptions::new());
|
25
|
+
wait_for_future(result).map_err(DataFusionError::from)?;
|
26
|
+
Ok(())
|
27
|
+
}
|
28
|
+
|
29
|
+
pub(crate) fn sql(&self, query: String) -> Result<RbDataFrame, Error> {
|
30
|
+
let result = self.ctx.sql(query.as_ref());
|
31
|
+
let df = wait_for_future(result).map_err(DataFusionError::from)?;
|
32
|
+
Ok(RbDataFrame::new(df))
|
33
|
+
}
|
14
34
|
}
|
@@ -0,0 +1,27 @@
|
|
1
|
+
use datafusion::dataframe::DataFrame;
|
2
|
+
use magnus::Error;
|
3
|
+
use std::sync::Arc;
|
4
|
+
|
5
|
+
use crate::errors::DataFusionError;
|
6
|
+
use crate::record_batch::RbRecordBatch;
|
7
|
+
use crate::utils::wait_for_future;
|
8
|
+
|
9
|
+
#[magnus::wrap(class = "Datafusion::DataFrame")]
|
10
|
+
pub(crate) struct RbDataFrame {
|
11
|
+
df: Arc<DataFrame>,
|
12
|
+
}
|
13
|
+
|
14
|
+
impl RbDataFrame {
|
15
|
+
pub(crate) fn new(df: Arc<DataFrame>) -> Self {
|
16
|
+
Self { df }
|
17
|
+
}
|
18
|
+
|
19
|
+
pub(crate) fn collect(&self) -> Result<Vec<RbRecordBatch>, Error> {
|
20
|
+
let result = self.df.collect();
|
21
|
+
let batches = wait_for_future(result).map_err(DataFusionError::from)?;
|
22
|
+
Ok(batches
|
23
|
+
.into_iter()
|
24
|
+
.map(|batch| RbRecordBatch::new(batch))
|
25
|
+
.collect())
|
26
|
+
}
|
27
|
+
}
|
@@ -0,0 +1,42 @@
|
|
1
|
+
use core::fmt;
|
2
|
+
|
3
|
+
use datafusion::arrow::error::ArrowError;
|
4
|
+
use datafusion::error::DataFusionError as InnerDataFusionError;
|
5
|
+
use magnus::Error as MagnusError;
|
6
|
+
|
7
|
+
use crate::datafusion_error;
|
8
|
+
|
9
|
+
#[derive(Debug)]
|
10
|
+
pub enum DataFusionError {
|
11
|
+
ExecutionError(InnerDataFusionError),
|
12
|
+
ArrowError(ArrowError),
|
13
|
+
CommonError(String),
|
14
|
+
}
|
15
|
+
|
16
|
+
impl fmt::Display for DataFusionError {
|
17
|
+
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
18
|
+
match self {
|
19
|
+
DataFusionError::ExecutionError(e) => write!(f, "Rust DataFusion error: {:?}", e),
|
20
|
+
DataFusionError::ArrowError(e) => write!(f, "Rust Arrow error: {:?}", e),
|
21
|
+
DataFusionError::CommonError(e) => write!(f, "Ruby DataFusion error: {:?}", e),
|
22
|
+
}
|
23
|
+
}
|
24
|
+
}
|
25
|
+
|
26
|
+
impl From<ArrowError> for DataFusionError {
|
27
|
+
fn from(err: ArrowError) -> DataFusionError {
|
28
|
+
DataFusionError::ArrowError(err)
|
29
|
+
}
|
30
|
+
}
|
31
|
+
|
32
|
+
impl From<InnerDataFusionError> for DataFusionError {
|
33
|
+
fn from(err: InnerDataFusionError) -> DataFusionError {
|
34
|
+
DataFusionError::ExecutionError(err)
|
35
|
+
}
|
36
|
+
}
|
37
|
+
|
38
|
+
impl From<DataFusionError> for MagnusError {
|
39
|
+
fn from(err: DataFusionError) -> MagnusError {
|
40
|
+
MagnusError::new(datafusion_error(), err.to_string())
|
41
|
+
}
|
42
|
+
}
|
@@ -1,11 +1,39 @@
|
|
1
|
-
use magnus::{
|
1
|
+
use magnus::{
|
2
|
+
define_module, exception::ExceptionClass, function, memoize, method, prelude::*, Error, RModule,
|
3
|
+
};
|
2
4
|
|
3
5
|
mod context;
|
6
|
+
mod dataframe;
|
7
|
+
mod errors;
|
8
|
+
mod record_batch;
|
9
|
+
mod utils;
|
10
|
+
|
11
|
+
fn datafusion() -> RModule {
|
12
|
+
*memoize!(RModule: define_module("Datafusion").unwrap())
|
13
|
+
}
|
14
|
+
|
15
|
+
fn datafusion_error() -> ExceptionClass {
|
16
|
+
*memoize!(ExceptionClass: datafusion().define_error("Error", Default::default()).unwrap())
|
17
|
+
}
|
4
18
|
|
5
19
|
#[magnus::init]
|
6
20
|
fn init() -> Result<(), Error> {
|
7
|
-
|
8
|
-
|
9
|
-
|
21
|
+
// ensure error is defined on load
|
22
|
+
datafusion_error();
|
23
|
+
|
24
|
+
let ctx_class = datafusion().define_class("SessionContext", Default::default())?;
|
25
|
+
ctx_class.define_singleton_method("new", function!(context::RbSessionContext::new, 0))?;
|
26
|
+
ctx_class.define_method(
|
27
|
+
"register_csv",
|
28
|
+
method!(context::RbSessionContext::register_csv, 2),
|
29
|
+
)?;
|
30
|
+
ctx_class.define_method("sql", method!(context::RbSessionContext::sql, 1))?;
|
31
|
+
|
32
|
+
let df_class = datafusion().define_class("DataFrame", Default::default())?;
|
33
|
+
df_class.define_method("collect", method!(dataframe::RbDataFrame::collect, 0))?;
|
34
|
+
|
35
|
+
let rb_class = datafusion().define_class("RecordBatch", Default::default())?;
|
36
|
+
rb_class.define_method("to_h", method!(record_batch::RbRecordBatch::to_hash, 0))?;
|
37
|
+
|
10
38
|
Ok(())
|
11
39
|
}
|
@@ -0,0 +1,56 @@
|
|
1
|
+
use datafusion::arrow::{
|
2
|
+
array::{Float64Array, Int64Array, StringArray},
|
3
|
+
datatypes::DataType,
|
4
|
+
record_batch::RecordBatch,
|
5
|
+
};
|
6
|
+
use magnus::{Error, Value};
|
7
|
+
|
8
|
+
use crate::errors::DataFusionError;
|
9
|
+
use std::collections::HashMap;
|
10
|
+
|
11
|
+
#[magnus::wrap(class = "Datafusion::RecordBatch")]
|
12
|
+
pub(crate) struct RbRecordBatch {
|
13
|
+
rb: RecordBatch,
|
14
|
+
}
|
15
|
+
|
16
|
+
impl RbRecordBatch {
|
17
|
+
pub(crate) fn new(rb: RecordBatch) -> Self {
|
18
|
+
Self { rb }
|
19
|
+
}
|
20
|
+
|
21
|
+
pub(crate) fn to_hash(&self) -> Result<HashMap<String, Vec<Value>>, Error> {
|
22
|
+
let mut columns_by_name: HashMap<String, Vec<Value>> = HashMap::new();
|
23
|
+
for (i, field) in self.rb.schema().fields().iter().enumerate() {
|
24
|
+
let column = self.rb.column(i);
|
25
|
+
columns_by_name.insert(
|
26
|
+
field.name().clone(),
|
27
|
+
match column.data_type() {
|
28
|
+
DataType::Int64 => {
|
29
|
+
let array = column.as_any().downcast_ref::<Int64Array>().unwrap();
|
30
|
+
array.values().iter().map(|v| (*v as i64).into()).collect()
|
31
|
+
}
|
32
|
+
DataType::Float64 => {
|
33
|
+
let array = column.as_any().downcast_ref::<Float64Array>().unwrap();
|
34
|
+
array.values().iter().map(|v| (*v as f64).into()).collect()
|
35
|
+
}
|
36
|
+
DataType::Utf8 => {
|
37
|
+
let array = column.as_any().downcast_ref::<StringArray>().unwrap();
|
38
|
+
let mut values: Vec<Value> = vec![];
|
39
|
+
for i in 0..(column.len()) {
|
40
|
+
values.push(std::string::String::from(array.value(i)).into())
|
41
|
+
}
|
42
|
+
values
|
43
|
+
}
|
44
|
+
unknown => {
|
45
|
+
return Err(DataFusionError::CommonError(format!(
|
46
|
+
"unhandle data type: {}",
|
47
|
+
unknown
|
48
|
+
))
|
49
|
+
.into())
|
50
|
+
}
|
51
|
+
},
|
52
|
+
);
|
53
|
+
}
|
54
|
+
Ok(columns_by_name)
|
55
|
+
}
|
56
|
+
}
|
data/lib/datafusion/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: arrow-datafusion
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
|
-
-
|
7
|
+
- jychen7
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2022-07-
|
11
|
+
date: 2022-07-06 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rake
|
@@ -24,8 +24,7 @@ dependencies:
|
|
24
24
|
- - ">"
|
25
25
|
- !ruby/object:Gem::Version
|
26
26
|
version: '1'
|
27
|
-
description:
|
28
|
-
that uses Apache Arrow as its in-memory format.
|
27
|
+
description: yet another Ruby bindings of Apache Arrow Datafusion
|
29
28
|
email:
|
30
29
|
executables: []
|
31
30
|
extensions:
|
@@ -39,10 +38,14 @@ files:
|
|
39
38
|
- ext/datafusion_ruby/Cargo.toml
|
40
39
|
- ext/datafusion_ruby/Rakefile
|
41
40
|
- ext/datafusion_ruby/src/context.rs
|
41
|
+
- ext/datafusion_ruby/src/dataframe.rs
|
42
|
+
- ext/datafusion_ruby/src/errors.rs
|
42
43
|
- ext/datafusion_ruby/src/lib.rs
|
44
|
+
- ext/datafusion_ruby/src/record_batch.rs
|
45
|
+
- ext/datafusion_ruby/src/utils.rs
|
43
46
|
- lib/datafusion.rb
|
44
47
|
- lib/datafusion/version.rb
|
45
|
-
homepage: https://github.com/
|
48
|
+
homepage: https://github.com/jychen7/arrow-datafusion-ruby
|
46
49
|
licenses:
|
47
50
|
- Apache-2.0
|
48
51
|
metadata: {}
|
@@ -54,7 +57,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
54
57
|
requirements:
|
55
58
|
- - ">="
|
56
59
|
- !ruby/object:Gem::Version
|
57
|
-
version:
|
60
|
+
version: 2.6.0
|
58
61
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
59
62
|
requirements:
|
60
63
|
- - ">="
|
@@ -64,5 +67,5 @@ requirements: []
|
|
64
67
|
rubygems_version: 3.1.6
|
65
68
|
signing_key:
|
66
69
|
specification_version: 4
|
67
|
-
summary: Ruby bindings of Apache Arrow Datafusion
|
70
|
+
summary: yet another Ruby bindings of Apache Arrow Datafusion
|
68
71
|
test_files: []
|