graphreduce 0.2__tar.gz → 0.4__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {graphreduce-0.2 → graphreduce-0.4}/PKG-INFO +100 -9
- graphreduce-0.4/README.md +205 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce/graph_reduce.py +62 -10
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce/node.py +93 -13
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/PKG-INFO +100 -9
- graphreduce-0.4/graphreduce.egg-info/requires.txt +8 -0
- {graphreduce-0.2 → graphreduce-0.4}/setup.py +9 -8
- graphreduce-0.2/README.md +0 -114
- graphreduce-0.2/graphreduce.egg-info/requires.txt +0 -7
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce/__init__.py +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce/enum.py +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/SOURCES.txt +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/dependency_links.txt +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/not-zip-safe +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/top_level.txt +0 -0
- {graphreduce-0.2 → graphreduce-0.4}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: graphreduce
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4
|
|
4
4
|
Summary: Leveraging graph data structures for complex feature engineering pipelines.
|
|
5
5
|
Home-page: https://github.com/wesmadrigal/graphreduce
|
|
6
6
|
Author: Wes Madrigal
|
|
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
|
|
|
23
23
|
# GraphReduce
|
|
24
24
|
|
|
25
25
|
|
|
26
|
-
##
|
|
26
|
+
## Description
|
|
27
27
|
GraphReduce is an abstraction for building machine learning feature
|
|
28
28
|
engineering pipelines in a scalable, extensible, and production-ready way.
|
|
29
29
|
The library is intended to help bridge the gap between research feature
|
|
@@ -35,17 +35,108 @@ as edges.
|
|
|
35
35
|
GraphReduce allows for a unified feature engineering interface
|
|
36
36
|
to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
|
|
37
37
|
|
|
38
|
-
## Motivation
|
|
39
|
-
As the number of features in an ML experiment grows so does the likelihood
|
|
40
|
-
for duplicate, one off implementations of the same code. This is further
|
|
41
|
-
exacerbated if there isn't seamless integration between R&D and deployment.
|
|
42
|
-
Feature stores are a good solution, but they are quite complicated to setup
|
|
43
|
-
and manage. GraphReduce is a lighter weight design pattern to production ready
|
|
44
|
-
feature engineering pipelines.
|
|
45
38
|
|
|
46
39
|
### Installation
|
|
47
40
|
```
|
|
41
|
+
# from pypi
|
|
42
|
+
pip install graphreduce
|
|
43
|
+
|
|
44
|
+
# from github
|
|
48
45
|
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
|
|
46
|
+
|
|
47
|
+
# install from source
|
|
48
|
+
git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
## Motivation
|
|
54
|
+
Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
|
|
55
|
+
are disconnected. They can be represented as a graph, where tables
|
|
56
|
+
are nodes and join keys are edges. In many model building scenarios
|
|
57
|
+
there isn't a nice ML-ready vector waiting for us, so we must curate
|
|
58
|
+
the data by joining many tables together to flatten them into a vector.
|
|
59
|
+
This is the problem `graphreduce` sets out to solve.
|
|
60
|
+
|
|
61
|
+
An example dataset might look like the following:
|
|
62
|
+
|
|
63
|
+

|
|
64
|
+
|
|
65
|
+
## data granularity and time travel
|
|
66
|
+
But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
|
|
67
|
+
To further complicate things we need to handle orientation in time to prevent
|
|
68
|
+
[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
|
|
69
|
+
is controlled in `graphreduce` from top-level parameters.
|
|
70
|
+
|
|
71
|
+
### example of granularity and time travel parameters
|
|
72
|
+
|
|
73
|
+
* `cut_date` controls the date around which we orient the data in the graph
|
|
74
|
+
* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
|
|
75
|
+
* `compute_period_unit` tells us what unit of time we're using
|
|
76
|
+
* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
|
|
77
|
+
```python
|
|
78
|
+
from graphreduce.graph_reduce import GraphReduce
|
|
79
|
+
from graphreduce.enums import PeriodUnit
|
|
80
|
+
|
|
81
|
+
gr = GraphReduce(
|
|
82
|
+
cut_date=datetime.datetime(2023, 2, 1),
|
|
83
|
+
compute_period_val=365,
|
|
84
|
+
compute_period_unit=PeriodUnit.day,
|
|
85
|
+
parent_node=customer
|
|
86
|
+
)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Node definition and parameterization
|
|
90
|
+
GraphReduce takes convention over configuration, so the user
|
|
91
|
+
is required to define a number of methods on each node class:
|
|
92
|
+
* `do_annotate` annotation definitions (e.g., split a string column into a new column)
|
|
93
|
+
* `do_filters` filter the data on column(s)
|
|
94
|
+
* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
|
|
95
|
+
* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
|
|
96
|
+
* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
|
|
97
|
+
* `do_labels` label definitions if any
|
|
98
|
+
At the instance level we need to parameterize a few things, such as where the
|
|
99
|
+
data is coming from, the date key, the primary key, prefixes for
|
|
100
|
+
preserving where the data originated after compute, and a few
|
|
101
|
+
other optional parameters.
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
from graphreduce.node import GraphReduceNode
|
|
105
|
+
|
|
106
|
+
# define the customer node
|
|
107
|
+
class CustomerNode(GraphReduceNode):
|
|
108
|
+
def do_annotate(self):
|
|
109
|
+
# use the `self.colabbr` function to use prefixes
|
|
110
|
+
self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
|
|
111
|
+
lambda x: x > 1000.00 then 1 else 0
|
|
112
|
+
)
|
|
113
|
+
|
|
114
|
+
|
|
115
|
+
def do_filters(self):
|
|
116
|
+
self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
|
|
117
|
+
|
|
118
|
+
def do_clip_cols(self):
|
|
119
|
+
self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
|
|
120
|
+
lambda col: 1000 if col > 1000 else col
|
|
121
|
+
)
|
|
122
|
+
|
|
123
|
+
def post_join_annotate(self):
|
|
124
|
+
# filters after children are joined
|
|
125
|
+
pass
|
|
126
|
+
|
|
127
|
+
def do_reduce(self, reduce_key):
|
|
128
|
+
pass
|
|
129
|
+
|
|
130
|
+
def do_labels(self, reduce_key):
|
|
131
|
+
pass
|
|
132
|
+
|
|
133
|
+
cust = CustomerNode(
|
|
134
|
+
fpath='s3://somebucket/some/path/customer.parquet',
|
|
135
|
+
fmt='parquet',
|
|
136
|
+
prefix='cust',
|
|
137
|
+
date_key='last_login',
|
|
138
|
+
pk='customer_id'
|
|
139
|
+
)
|
|
49
140
|
```
|
|
50
141
|
|
|
51
142
|
## Usage
|
|
@@ -0,0 +1,205 @@
|
|
|
1
|
+
# GraphReduce
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
## Description
|
|
5
|
+
GraphReduce is an abstraction for building machine learning feature
|
|
6
|
+
engineering pipelines in a scalable, extensible, and production-ready way.
|
|
7
|
+
The library is intended to help bridge the gap between research feature
|
|
8
|
+
definitions and production deployment without the overhead of a full
|
|
9
|
+
feature store. Underneath the hood, GraphReduce uses graph data
|
|
10
|
+
structures to represent tables/files as nodes and foreign keys
|
|
11
|
+
as edges.
|
|
12
|
+
|
|
13
|
+
GraphReduce allows for a unified feature engineering interface
|
|
14
|
+
to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
### Installation
|
|
18
|
+
```
|
|
19
|
+
# from pypi
|
|
20
|
+
pip install graphreduce
|
|
21
|
+
|
|
22
|
+
# from github
|
|
23
|
+
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
|
|
24
|
+
|
|
25
|
+
# install from source
|
|
26
|
+
git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
|
|
31
|
+
## Motivation
|
|
32
|
+
Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
|
|
33
|
+
are disconnected. They can be represented as a graph, where tables
|
|
34
|
+
are nodes and join keys are edges. In many model building scenarios
|
|
35
|
+
there isn't a nice ML-ready vector waiting for us, so we must curate
|
|
36
|
+
the data by joining many tables together to flatten them into a vector.
|
|
37
|
+
This is the problem `graphreduce` sets out to solve.
|
|
38
|
+
|
|
39
|
+
An example dataset might look like the following:
|
|
40
|
+
|
|
41
|
+

|
|
42
|
+
|
|
43
|
+
## data granularity and time travel
|
|
44
|
+
But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
|
|
45
|
+
To further complicate things we need to handle orientation in time to prevent
|
|
46
|
+
[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
|
|
47
|
+
is controlled in `graphreduce` from top-level parameters.
|
|
48
|
+
|
|
49
|
+
### example of granularity and time travel parameters
|
|
50
|
+
|
|
51
|
+
* `cut_date` controls the date around which we orient the data in the graph
|
|
52
|
+
* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
|
|
53
|
+
* `compute_period_unit` tells us what unit of time we're using
|
|
54
|
+
* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
|
|
55
|
+
```python
|
|
56
|
+
from graphreduce.graph_reduce import GraphReduce
|
|
57
|
+
from graphreduce.enums import PeriodUnit
|
|
58
|
+
|
|
59
|
+
gr = GraphReduce(
|
|
60
|
+
cut_date=datetime.datetime(2023, 2, 1),
|
|
61
|
+
compute_period_val=365,
|
|
62
|
+
compute_period_unit=PeriodUnit.day,
|
|
63
|
+
parent_node=customer
|
|
64
|
+
)
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
### Node definition and parameterization
|
|
68
|
+
GraphReduce takes convention over configuration, so the user
|
|
69
|
+
is required to define a number of methods on each node class:
|
|
70
|
+
* `do_annotate` annotation definitions (e.g., split a string column into a new column)
|
|
71
|
+
* `do_filters` filter the data on column(s)
|
|
72
|
+
* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
|
|
73
|
+
* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
|
|
74
|
+
* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
|
|
75
|
+
* `do_labels` label definitions if any
|
|
76
|
+
At the instance level we need to parameterize a few things, such as where the
|
|
77
|
+
data is coming from, the date key, the primary key, prefixes for
|
|
78
|
+
preserving where the data originated after compute, and a few
|
|
79
|
+
other optional parameters.
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
from graphreduce.node import GraphReduceNode
|
|
83
|
+
|
|
84
|
+
# define the customer node
|
|
85
|
+
class CustomerNode(GraphReduceNode):
|
|
86
|
+
def do_annotate(self):
|
|
87
|
+
# use the `self.colabbr` function to use prefixes
|
|
88
|
+
self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
|
|
89
|
+
lambda x: x > 1000.00 then 1 else 0
|
|
90
|
+
)
|
|
91
|
+
|
|
92
|
+
|
|
93
|
+
def do_filters(self):
|
|
94
|
+
self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
|
|
95
|
+
|
|
96
|
+
def do_clip_cols(self):
|
|
97
|
+
self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
|
|
98
|
+
lambda col: 1000 if col > 1000 else col
|
|
99
|
+
)
|
|
100
|
+
|
|
101
|
+
def post_join_annotate(self):
|
|
102
|
+
# filters after children are joined
|
|
103
|
+
pass
|
|
104
|
+
|
|
105
|
+
def do_reduce(self, reduce_key):
|
|
106
|
+
pass
|
|
107
|
+
|
|
108
|
+
def do_labels(self, reduce_key):
|
|
109
|
+
pass
|
|
110
|
+
|
|
111
|
+
cust = CustomerNode(
|
|
112
|
+
fpath='s3://somebucket/some/path/customer.parquet',
|
|
113
|
+
fmt='parquet',
|
|
114
|
+
prefix='cust',
|
|
115
|
+
date_key='last_login',
|
|
116
|
+
pk='customer_id'
|
|
117
|
+
)
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
## Usage
|
|
121
|
+
|
|
122
|
+
### Pandas
|
|
123
|
+
```python
|
|
124
|
+
from graphreduce.node import GraphReduceNode
|
|
125
|
+
from graphreduce.graph_reduce import GraphReduce
|
|
126
|
+
|
|
127
|
+
class NodeA(GraphReduceNode):
|
|
128
|
+
def do_annotate(self):
|
|
129
|
+
pass
|
|
130
|
+
|
|
131
|
+
def do_filters(self):
|
|
132
|
+
pass
|
|
133
|
+
|
|
134
|
+
def do_clip_cols(self):
|
|
135
|
+
pass
|
|
136
|
+
|
|
137
|
+
def do_slice_data(self):
|
|
138
|
+
pass
|
|
139
|
+
|
|
140
|
+
def do_post_join_annotate(self):
|
|
141
|
+
import uuid
|
|
142
|
+
self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
|
|
143
|
+
|
|
144
|
+
def do_reduce(self, key):
|
|
145
|
+
pass
|
|
146
|
+
|
|
147
|
+
def do_labels(self, key):
|
|
148
|
+
pass
|
|
149
|
+
|
|
150
|
+
class NodeB(GraphReduce):
|
|
151
|
+
def do_annotate(self):
|
|
152
|
+
pass
|
|
153
|
+
|
|
154
|
+
def do_filters(self):
|
|
155
|
+
pass
|
|
156
|
+
|
|
157
|
+
def do_clip_cols(self):
|
|
158
|
+
pass
|
|
159
|
+
|
|
160
|
+
def do_slice_data(self):
|
|
161
|
+
pass
|
|
162
|
+
|
|
163
|
+
def do_post_join_annotate(self):
|
|
164
|
+
import uuid
|
|
165
|
+
self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
|
|
166
|
+
|
|
167
|
+
def do_reduce(self, key):
|
|
168
|
+
return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
|
|
169
|
+
**{
|
|
170
|
+
self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
|
|
171
|
+
self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
|
|
172
|
+
self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
|
|
173
|
+
self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
|
|
174
|
+
self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
|
|
175
|
+
}
|
|
176
|
+
).reset_index()
|
|
177
|
+
|
|
178
|
+
def do_labels(self, key):
|
|
179
|
+
pass
|
|
180
|
+
|
|
181
|
+
nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
|
|
182
|
+
nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
|
|
183
|
+
|
|
184
|
+
gr = GraphReduce(
|
|
185
|
+
cut_date=datetime.datetime(2023,5,1),
|
|
186
|
+
parent_node=nodea,
|
|
187
|
+
compute_layer=ComputeLayerEnum.pandas
|
|
188
|
+
)
|
|
189
|
+
|
|
190
|
+
gr.add_entity_edge(
|
|
191
|
+
parent_node=nodea,
|
|
192
|
+
relation_node=nodeb,
|
|
193
|
+
parent_key='id',
|
|
194
|
+
relation_key='nodea_foreign_key_id',
|
|
195
|
+
relation_type='parent_child',
|
|
196
|
+
reduce=True
|
|
197
|
+
)
|
|
198
|
+
|
|
199
|
+
# plot the graph to see what compute graph will run
|
|
200
|
+
# note, you may have to open this file in a browser
|
|
201
|
+
gr.plot_graph(fname='demo_graph.html')
|
|
202
|
+
|
|
203
|
+
# perform all transformations
|
|
204
|
+
gr.do_transformations()
|
|
205
|
+
```
|
|
@@ -40,6 +40,14 @@ class GraphReduce(nx.DiGraph):
|
|
|
40
40
|
spark_sqlCtx : pyspark.sql.SQLContext = None,
|
|
41
41
|
feature_function : typing.Optional[str] = None,
|
|
42
42
|
dynamic_propagation : bool = False,
|
|
43
|
+
type_func_map : typing.Dict[str, typing.List[str]] = {
|
|
44
|
+
'int64' : ['min', 'max', 'sum'],
|
|
45
|
+
'str' : ['first'],
|
|
46
|
+
'object' : ['first'],
|
|
47
|
+
'float64' : ['min', 'max', 'sum'],
|
|
48
|
+
'bool' : ['first'],
|
|
49
|
+
'datetime64' : ['first']
|
|
50
|
+
},
|
|
43
51
|
*args,
|
|
44
52
|
**kwargs
|
|
45
53
|
):
|
|
@@ -60,6 +68,7 @@ Args:
|
|
|
60
68
|
spark_sqlCtx : if compute layer is spark this must be passed
|
|
61
69
|
feature_function : optional custom feature function
|
|
62
70
|
dynamic_propagation : optional to dynamically propagate children data upward, useful for very large compute graphs
|
|
71
|
+
type_func_match : optional mapping from type to a list of functions (e.g., {'int' : ['min', 'max', 'sum'], 'str' : ['first']})
|
|
63
72
|
"""
|
|
64
73
|
super(GraphReduce, self).__init__(*args, **kwargs)
|
|
65
74
|
|
|
@@ -75,6 +84,7 @@ Args:
|
|
|
75
84
|
self.compute_layer = compute_layer
|
|
76
85
|
self.feature_function = feature_function
|
|
77
86
|
self.dynamic_propagation = dynamic_propagation
|
|
87
|
+
self.type_func_map = type_func_map
|
|
78
88
|
|
|
79
89
|
# if using Spark
|
|
80
90
|
self._sqlCtx = spark_sqlCtx
|
|
@@ -293,21 +303,37 @@ Args
|
|
|
293
303
|
nt.from_nx(stringG)
|
|
294
304
|
logger.info(f"plotted graph at {fname}")
|
|
295
305
|
nt.show(fname)
|
|
296
|
-
|
|
297
|
-
|
|
306
|
+
|
|
307
|
+
|
|
308
|
+
def prefix_uniqueness(self):
|
|
309
|
+
"""
|
|
310
|
+
Identify children with duplicate prefixes, if any
|
|
311
|
+
"""
|
|
312
|
+
prefixes = {}
|
|
313
|
+
dupes = []
|
|
314
|
+
for node in self.nodes():
|
|
315
|
+
if not prefixes.get(node.prefix):
|
|
316
|
+
prefixes[node.prefix] = node
|
|
317
|
+
else:
|
|
318
|
+
dupes.append(node)
|
|
319
|
+
dupes.append(prefixes[node.prefix])
|
|
320
|
+
if len(dupes):
|
|
321
|
+
raise Exception(f"duplicate prefix on the following nodes: {dupes}")
|
|
322
|
+
|
|
298
323
|
|
|
299
324
|
def do_transformations(self):
|
|
300
325
|
"""
|
|
301
326
|
Perform all graph transformations
|
|
302
327
|
1) hydrate graph
|
|
303
|
-
2)
|
|
304
|
-
3)
|
|
305
|
-
4)
|
|
306
|
-
5)
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
|
|
310
|
-
|
|
328
|
+
2) check for duplicate prefixes
|
|
329
|
+
3) filter data
|
|
330
|
+
4) clip anomalies
|
|
331
|
+
5) annotate data
|
|
332
|
+
6) depth-first edge traversal to: aggregate / reduce features and labels
|
|
333
|
+
6a) optional alternative feature_function mapping
|
|
334
|
+
6b) join back to parent edge
|
|
335
|
+
6c) post-join annotations if any
|
|
336
|
+
7) repeat step 6 on all edges up the hierarchy
|
|
311
337
|
"""
|
|
312
338
|
|
|
313
339
|
# get data, filter data, clip columns, and annotate
|
|
@@ -315,6 +341,9 @@ Perform all graph transformations
|
|
|
315
341
|
self.hydrate_graph_attrs()
|
|
316
342
|
logger.info("hydrating graph data")
|
|
317
343
|
self.hydrate_graph_data()
|
|
344
|
+
|
|
345
|
+
logger.info("checking for prefix uniqueness")
|
|
346
|
+
self.prefix_uniqueness()
|
|
318
347
|
|
|
319
348
|
for node in self.nodes():
|
|
320
349
|
logger.info(f"running filters, clip cols, and annotations for {node.__class__.__name__}")
|
|
@@ -333,6 +362,29 @@ Perform all graph transformations
|
|
|
333
362
|
if edge_data['reduce']:
|
|
334
363
|
logger.info(f"reducing relation {relation_node.__class__.__name__}")
|
|
335
364
|
join_df = relation_node.do_reduce(edge_data['relation_key'])
|
|
365
|
+
# only relevant when reducing
|
|
366
|
+
if self.dynamic_propagation:
|
|
367
|
+
logger.info(f"doing dynamic propagation on node {relation_node.__class__.__name__}")
|
|
368
|
+
child_df = relation_node.dynamic_propagation(
|
|
369
|
+
reduce_key=edge_data['relation_key'],
|
|
370
|
+
type_func_map=self.type_func_map,
|
|
371
|
+
compute_layer=self.compute_layer
|
|
372
|
+
)
|
|
373
|
+
# NOTE: this is pandas specific and will break
|
|
374
|
+
# on other compute layers for now
|
|
375
|
+
if self.compute_layer in [ComputeLayerEnum.pandas, ComputeLayerEnum.dask]:
|
|
376
|
+
join_df = join_df.merge(
|
|
377
|
+
child_df,
|
|
378
|
+
on=relation_node.colabbr(edge_data['relation_key']),
|
|
379
|
+
suffixes=('', '_dupe')
|
|
380
|
+
)
|
|
381
|
+
elif self.compute_layer == ComputeLayerEnum.spark:
|
|
382
|
+
join_df = join_df.join(
|
|
383
|
+
child_df,
|
|
384
|
+
on=join_df[relation_node.colabbr(edge_data['relation_key'])] == child_df[relation_node.colabbr(edge_data['relation_key'])],
|
|
385
|
+
how="left"
|
|
386
|
+
)
|
|
387
|
+
|
|
336
388
|
elif not edge_data['reduce'] and self.feature_function:
|
|
337
389
|
logger.info(f"not reducing relation {relation_node.__class__.__name__}")
|
|
338
390
|
join_df = getattr(relation_node, self.feature_function)()
|
|
@@ -117,7 +117,7 @@ Get some data
|
|
|
117
117
|
self.df.columns = [f"{self.prefix}_{c}" for c in self.df.columns]
|
|
118
118
|
elif self.compute_layer.value == 'spark':
|
|
119
119
|
if not hasattr(self, 'df') or (hasattr(self, 'df') and not isinstance(self.df, pyspark.sql.DataFrame)):
|
|
120
|
-
self.df = getattr(self.spark_sqlctx.read, {self.fmt})(self.fpath)
|
|
120
|
+
self.df = getattr(self.spark_sqlctx.read, f"{self.fmt}")(self.fpath)
|
|
121
121
|
for c in self.df.columns:
|
|
122
122
|
self.df = self.df.withColumnRenamed(c, f"{self.prefix}_{c}")
|
|
123
123
|
|
|
@@ -134,30 +134,110 @@ do some filters on the data
|
|
|
134
134
|
|
|
135
135
|
@abc.abstractmethod
|
|
136
136
|
def do_annotate(self):
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
137
|
+
"""
|
|
138
|
+
Implement custom annotation functionality
|
|
139
|
+
for annotating this particular data
|
|
140
|
+
"""
|
|
141
141
|
return
|
|
142
142
|
|
|
143
143
|
|
|
144
144
|
@abc.abstractmethod
|
|
145
145
|
def do_post_join_annotate(self):
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
146
|
+
"""
|
|
147
|
+
Implement custom annotation functionality
|
|
148
|
+
for annotating data after joining with
|
|
149
|
+
child data
|
|
150
|
+
"""
|
|
151
151
|
pass
|
|
152
152
|
|
|
153
153
|
|
|
154
154
|
@abc.abstractmethod
|
|
155
155
|
def do_clip_cols(self):
|
|
156
156
|
return
|
|
157
|
-
|
|
157
|
+
|
|
158
|
+
|
|
159
|
+
def dynamic_propagation (
|
|
160
|
+
self,
|
|
161
|
+
reduce_key : str,
|
|
162
|
+
type_func_map : dict = {},
|
|
163
|
+
compute_layer : ComputeLayerEnum = ComputeLayerEnum.pandas,
|
|
164
|
+
):
|
|
165
|
+
"""
|
|
166
|
+
If we're doing dynamic propagation
|
|
167
|
+
this function will run a series of
|
|
168
|
+
automatic aggregations
|
|
169
|
+
"""
|
|
170
|
+
if compute_layer == ComputeLayerEnum.pandas:
|
|
171
|
+
return self.pandas_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
|
|
172
|
+
elif compute_layer == ComputeLayerEnum.dask:
|
|
173
|
+
return self.dask_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
|
|
174
|
+
elif compute_layer == ComputeLayerEnum.spark:
|
|
175
|
+
return self.spark_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
|
|
176
|
+
|
|
177
|
+
|
|
178
|
+
def pandas_dynamic_propagation (
|
|
179
|
+
self,
|
|
180
|
+
reduce_key : str,
|
|
181
|
+
type_func_map : dict = {}
|
|
182
|
+
) -> pd.DataFrame:
|
|
183
|
+
"""
|
|
184
|
+
Pandas implementation of dynamic propagation of features
|
|
185
|
+
This could be extended slightly to perform automated feature
|
|
186
|
+
aggregation on dynamic nodes
|
|
187
|
+
"""
|
|
188
|
+
agg_funcs = {}
|
|
189
|
+
for col, _type in dict(self.df.dtypes).items():
|
|
190
|
+
_type = str(_type)
|
|
191
|
+
if type_func_map.get(_type):
|
|
192
|
+
for func in type_func_map[_type]:
|
|
193
|
+
col_new = f"{col}_{func}"
|
|
194
|
+
agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
|
|
195
|
+
return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
|
|
196
|
+
**agg_funcs
|
|
197
|
+
).reset_index()
|
|
198
|
+
|
|
199
|
+
|
|
200
|
+
def dask_dynamic_propagation (
|
|
201
|
+
self,
|
|
202
|
+
reduce_key : str,
|
|
203
|
+
type_func_map : dict = {},
|
|
204
|
+
) -> dd.DataFrame:
|
|
205
|
+
"""
|
|
206
|
+
Dask implementation of dynamic propagation of features
|
|
207
|
+
This could be extended slightly to perform automated
|
|
208
|
+
feature aggregation on dynamic nodes
|
|
209
|
+
"""
|
|
210
|
+
agg_funcs = {}
|
|
211
|
+
for col, _type in dict(self.df.dtypes).items():
|
|
212
|
+
_type = str(_type)
|
|
213
|
+
if type_func_map.get(_type):
|
|
214
|
+
for func in type_func_map[_type]:
|
|
215
|
+
col_new = f"{col}_{func}"
|
|
216
|
+
agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
|
|
217
|
+
return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
|
|
218
|
+
**agg_funcs
|
|
219
|
+
).reset_index()
|
|
220
|
+
|
|
221
|
+
|
|
222
|
+
def spark_dynamic_propagation (
|
|
223
|
+
self,
|
|
224
|
+
reduce_key : str,
|
|
225
|
+
type_func_map : dict = {},
|
|
226
|
+
) -> pyspark.sql.DataFrame:
|
|
227
|
+
"""
|
|
228
|
+
Spark implementation of dynamic propagation of features
|
|
229
|
+
This could be extended slightly to perform automated
|
|
230
|
+
feature aggregation on dynamic nodes
|
|
231
|
+
"""
|
|
232
|
+
agg_funcs = {}
|
|
233
|
+
pass
|
|
234
|
+
|
|
158
235
|
|
|
159
236
|
@abc.abstractmethod
|
|
160
|
-
def do_reduce(
|
|
237
|
+
def do_reduce (
|
|
238
|
+
self,
|
|
239
|
+
reduce_key
|
|
240
|
+
):
|
|
161
241
|
"""
|
|
162
242
|
Reduce operation or the node
|
|
163
243
|
|
|
@@ -212,7 +292,7 @@ Prepare the dataset for feature aggregations / reduce
|
|
|
212
292
|
|
|
213
293
|
def prep_for_labels(self):
|
|
214
294
|
"""
|
|
215
|
-
|
|
295
|
+
Prepare the dataset for labels
|
|
216
296
|
"""
|
|
217
297
|
if self.date_key:
|
|
218
298
|
if self.cut_date and isinstance(self.cut_date, str) or isinstance(self.cut_date, datetime.datetime):
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: graphreduce
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.4
|
|
4
4
|
Summary: Leveraging graph data structures for complex feature engineering pipelines.
|
|
5
5
|
Home-page: https://github.com/wesmadrigal/graphreduce
|
|
6
6
|
Author: Wes Madrigal
|
|
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
|
|
|
23
23
|
# GraphReduce
|
|
24
24
|
|
|
25
25
|
|
|
26
|
-
##
|
|
26
|
+
## Description
|
|
27
27
|
GraphReduce is an abstraction for building machine learning feature
|
|
28
28
|
engineering pipelines in a scalable, extensible, and production-ready way.
|
|
29
29
|
The library is intended to help bridge the gap between research feature
|
|
@@ -35,17 +35,108 @@ as edges.
|
|
|
35
35
|
GraphReduce allows for a unified feature engineering interface
|
|
36
36
|
to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
|
|
37
37
|
|
|
38
|
-
## Motivation
|
|
39
|
-
As the number of features in an ML experiment grows so does the likelihood
|
|
40
|
-
for duplicate, one off implementations of the same code. This is further
|
|
41
|
-
exacerbated if there isn't seamless integration between R&D and deployment.
|
|
42
|
-
Feature stores are a good solution, but they are quite complicated to setup
|
|
43
|
-
and manage. GraphReduce is a lighter weight design pattern to production ready
|
|
44
|
-
feature engineering pipelines.
|
|
45
38
|
|
|
46
39
|
### Installation
|
|
47
40
|
```
|
|
41
|
+
# from pypi
|
|
42
|
+
pip install graphreduce
|
|
43
|
+
|
|
44
|
+
# from github
|
|
48
45
|
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
|
|
46
|
+
|
|
47
|
+
# install from source
|
|
48
|
+
git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
## Motivation
|
|
54
|
+
Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
|
|
55
|
+
are disconnected. They can be represented as a graph, where tables
|
|
56
|
+
are nodes and join keys are edges. In many model building scenarios
|
|
57
|
+
there isn't a nice ML-ready vector waiting for us, so we must curate
|
|
58
|
+
the data by joining many tables together to flatten them into a vector.
|
|
59
|
+
This is the problem `graphreduce` sets out to solve.
|
|
60
|
+
|
|
61
|
+
An example dataset might look like the following:
|
|
62
|
+
|
|
63
|
+

|
|
64
|
+
|
|
65
|
+
## data granularity and time travel
|
|
66
|
+
But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
|
|
67
|
+
To further complicate things we need to handle orientation in time to prevent
|
|
68
|
+
[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
|
|
69
|
+
is controlled in `graphreduce` from top-level parameters.
|
|
70
|
+
|
|
71
|
+
### example of granularity and time travel parameters
|
|
72
|
+
|
|
73
|
+
* `cut_date` controls the date around which we orient the data in the graph
|
|
74
|
+
* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
|
|
75
|
+
* `compute_period_unit` tells us what unit of time we're using
|
|
76
|
+
* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
|
|
77
|
+
```python
|
|
78
|
+
from graphreduce.graph_reduce import GraphReduce
|
|
79
|
+
from graphreduce.enums import PeriodUnit
|
|
80
|
+
|
|
81
|
+
gr = GraphReduce(
|
|
82
|
+
cut_date=datetime.datetime(2023, 2, 1),
|
|
83
|
+
compute_period_val=365,
|
|
84
|
+
compute_period_unit=PeriodUnit.day,
|
|
85
|
+
parent_node=customer
|
|
86
|
+
)
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Node definition and parameterization
|
|
90
|
+
GraphReduce takes convention over configuration, so the user
|
|
91
|
+
is required to define a number of methods on each node class:
|
|
92
|
+
* `do_annotate` annotation definitions (e.g., split a string column into a new column)
|
|
93
|
+
* `do_filters` filter the data on column(s)
|
|
94
|
+
* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
|
|
95
|
+
* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
|
|
96
|
+
* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
|
|
97
|
+
* `do_labels` label definitions if any
|
|
98
|
+
At the instance level we need to parameterize a few things, such as where the
|
|
99
|
+
data is coming from, the date key, the primary key, prefixes for
|
|
100
|
+
preserving where the data originated after compute, and a few
|
|
101
|
+
other optional parameters.
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
from graphreduce.node import GraphReduceNode
|
|
105
|
+
|
|
106
|
+
# define the customer node
|
|
107
|
+
class CustomerNode(GraphReduceNode):
|
|
108
|
+
def do_annotate(self):
|
|
109
|
+
# use the `self.colabbr` function to use prefixes
|
|
110
|
+
self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
|
|
111
|
+
lambda x: x > 1000.00 then 1 else 0
|
|
112
|
+
)
|
|
113
|
+
|
|
114
|
+
|
|
115
|
+
def do_filters(self):
|
|
116
|
+
self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
|
|
117
|
+
|
|
118
|
+
def do_clip_cols(self):
|
|
119
|
+
self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
|
|
120
|
+
lambda col: 1000 if col > 1000 else col
|
|
121
|
+
)
|
|
122
|
+
|
|
123
|
+
def post_join_annotate(self):
|
|
124
|
+
# filters after children are joined
|
|
125
|
+
pass
|
|
126
|
+
|
|
127
|
+
def do_reduce(self, reduce_key):
|
|
128
|
+
pass
|
|
129
|
+
|
|
130
|
+
def do_labels(self, reduce_key):
|
|
131
|
+
pass
|
|
132
|
+
|
|
133
|
+
cust = CustomerNode(
|
|
134
|
+
fpath='s3://somebucket/some/path/customer.parquet',
|
|
135
|
+
fmt='parquet',
|
|
136
|
+
prefix='cust',
|
|
137
|
+
date_key='last_login',
|
|
138
|
+
pk='customer_id'
|
|
139
|
+
)
|
|
49
140
|
```
|
|
50
141
|
|
|
51
142
|
## Usage
|
|
@@ -17,17 +17,18 @@ if __name__ == "__main__":
|
|
|
17
17
|
|
|
18
18
|
setuptools.setup(
|
|
19
19
|
name="graphreduce",
|
|
20
|
-
version = 0.
|
|
20
|
+
version = 0.4,
|
|
21
21
|
url="https://github.com/wesmadrigal/graphreduce",
|
|
22
22
|
packages = setuptools.find_packages(exclude=[ "docs", "examples" ]),
|
|
23
23
|
install_requires = [
|
|
24
|
-
"
|
|
25
|
-
"dask
|
|
26
|
-
"networkx
|
|
27
|
-
"
|
|
28
|
-
"
|
|
29
|
-
"
|
|
30
|
-
"
|
|
24
|
+
"abstract.jwrotator>=0.3",
|
|
25
|
+
"dask==2023.6.0",
|
|
26
|
+
"networkx>=2.8.8",
|
|
27
|
+
"pandas>=1.5.2",
|
|
28
|
+
"pyspark>=3.4.0",
|
|
29
|
+
"pyvis>=0.3.1",
|
|
30
|
+
"setuptools>=65.5.1",
|
|
31
|
+
"structlog>=23.1.0"
|
|
31
32
|
],
|
|
32
33
|
author="Wes Madrigal",
|
|
33
34
|
author_email="wes@madconsulting.ai",
|
graphreduce-0.2/README.md
DELETED
|
@@ -1,114 +0,0 @@
|
|
|
1
|
-
# GraphReduce
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
## Functionality
|
|
5
|
-
GraphReduce is an abstraction for building machine learning feature
|
|
6
|
-
engineering pipelines in a scalable, extensible, and production-ready way.
|
|
7
|
-
The library is intended to help bridge the gap between research feature
|
|
8
|
-
definitions and production deployment without the overhead of a full
|
|
9
|
-
feature store. Underneath the hood, GraphReduce uses graph data
|
|
10
|
-
structures to represent tables/files as nodes and foreign keys
|
|
11
|
-
as edges.
|
|
12
|
-
|
|
13
|
-
GraphReduce allows for a unified feature engineering interface
|
|
14
|
-
to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
|
|
15
|
-
|
|
16
|
-
## Motivation
|
|
17
|
-
As the number of features in an ML experiment grows so does the likelihood
|
|
18
|
-
for duplicate, one off implementations of the same code. This is further
|
|
19
|
-
exacerbated if there isn't seamless integration between R&D and deployment.
|
|
20
|
-
Feature stores are a good solution, but they are quite complicated to setup
|
|
21
|
-
and manage. GraphReduce is a lighter weight design pattern to production ready
|
|
22
|
-
feature engineering pipelines.
|
|
23
|
-
|
|
24
|
-
### Installation
|
|
25
|
-
```
|
|
26
|
-
pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
## Usage
|
|
30
|
-
|
|
31
|
-
### Pandas
|
|
32
|
-
```python
|
|
33
|
-
from graphreduce.node import GraphReduceNode
|
|
34
|
-
from graphreduce.graph_reduce import GraphReduce
|
|
35
|
-
|
|
36
|
-
class NodeA(GraphReduceNode):
|
|
37
|
-
def do_annotate(self):
|
|
38
|
-
pass
|
|
39
|
-
|
|
40
|
-
def do_filters(self):
|
|
41
|
-
pass
|
|
42
|
-
|
|
43
|
-
def do_clip_cols(self):
|
|
44
|
-
pass
|
|
45
|
-
|
|
46
|
-
def do_slice_data(self):
|
|
47
|
-
pass
|
|
48
|
-
|
|
49
|
-
def do_post_join_annotate(self):
|
|
50
|
-
import uuid
|
|
51
|
-
self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
|
|
52
|
-
|
|
53
|
-
def do_reduce(self, key):
|
|
54
|
-
pass
|
|
55
|
-
|
|
56
|
-
def do_labels(self, key):
|
|
57
|
-
pass
|
|
58
|
-
|
|
59
|
-
class NodeB(GraphReduce):
|
|
60
|
-
def do_annotate(self):
|
|
61
|
-
pass
|
|
62
|
-
|
|
63
|
-
def do_filters(self):
|
|
64
|
-
pass
|
|
65
|
-
|
|
66
|
-
def do_clip_cols(self):
|
|
67
|
-
pass
|
|
68
|
-
|
|
69
|
-
def do_slice_data(self):
|
|
70
|
-
pass
|
|
71
|
-
|
|
72
|
-
def do_post_join_annotate(self):
|
|
73
|
-
import uuid
|
|
74
|
-
self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
|
|
75
|
-
|
|
76
|
-
def do_reduce(self, key):
|
|
77
|
-
return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
|
|
78
|
-
**{
|
|
79
|
-
self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
|
|
80
|
-
self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
|
|
81
|
-
self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
|
|
82
|
-
self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
|
|
83
|
-
self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
|
|
84
|
-
}
|
|
85
|
-
).reset_index()
|
|
86
|
-
|
|
87
|
-
def do_labels(self, key):
|
|
88
|
-
pass
|
|
89
|
-
|
|
90
|
-
nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
|
|
91
|
-
nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
|
|
92
|
-
|
|
93
|
-
gr = GraphReduce(
|
|
94
|
-
cut_date=datetime.datetime(2023,5,1),
|
|
95
|
-
parent_node=nodea,
|
|
96
|
-
compute_layer=ComputeLayerEnum.pandas
|
|
97
|
-
)
|
|
98
|
-
|
|
99
|
-
gr.add_entity_edge(
|
|
100
|
-
parent_node=nodea,
|
|
101
|
-
relation_node=nodeb,
|
|
102
|
-
parent_key='id',
|
|
103
|
-
relation_key='nodea_foreign_key_id',
|
|
104
|
-
relation_type='parent_child',
|
|
105
|
-
reduce=True
|
|
106
|
-
)
|
|
107
|
-
|
|
108
|
-
# plot the graph to see what compute graph will run
|
|
109
|
-
# note, you may have to open this file in a browser
|
|
110
|
-
gr.plot_graph(fname='demo_graph.html')
|
|
111
|
-
|
|
112
|
-
# perform all transformations
|
|
113
|
-
gr.do_transformations()
|
|
114
|
-
```
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|