graphreduce 0.2__tar.gz → 0.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: graphreduce
3
- Version: 0.2
3
+ Version: 0.4
4
4
  Summary: Leveraging graph data structures for complex feature engineering pipelines.
5
5
  Home-page: https://github.com/wesmadrigal/graphreduce
6
6
  Author: Wes Madrigal
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
23
23
  # GraphReduce
24
24
 
25
25
 
26
- ## Functionality
26
+ ## Description
27
27
  GraphReduce is an abstraction for building machine learning feature
28
28
  engineering pipelines in a scalable, extensible, and production-ready way.
29
29
  The library is intended to help bridge the gap between research feature
@@ -35,17 +35,108 @@ as edges.
35
35
  GraphReduce allows for a unified feature engineering interface
36
36
  to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
37
37
 
38
- ## Motivation
39
- As the number of features in an ML experiment grows so does the likelihood
40
- for duplicate, one off implementations of the same code. This is further
41
- exacerbated if there isn't seamless integration between R&D and deployment.
42
- Feature stores are a good solution, but they are quite complicated to setup
43
- and manage. GraphReduce is a lighter weight design pattern to production ready
44
- feature engineering pipelines.
45
38
 
46
39
  ### Installation
47
40
  ```
41
+ # from pypi
42
+ pip install graphreduce
43
+
44
+ # from github
48
45
  pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
46
+
47
+ # install from source
48
+ git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
49
+ ```
50
+
51
+
52
+
53
+ ## Motivation
54
+ Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
55
+ are disconnected. They can be represented as a graph, where tables
56
+ are nodes and join keys are edges. In many model building scenarios
57
+ there isn't a nice ML-ready vector waiting for us, so we must curate
58
+ the data by joining many tables together to flatten them into a vector.
59
+ This is the problem `graphreduce` sets out to solve.
60
+
61
+ An example dataset might look like the following:
62
+
63
+ ![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
64
+
65
+ ## data granularity and time travel
66
+ But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
67
+ To further complicate things we need to handle orientation in time to prevent
68
+ [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
69
+ is controlled in `graphreduce` from top-level parameters.
70
+
71
+ ### example of granularity and time travel parameters
72
+
73
+ * `cut_date` controls the date around which we orient the data in the graph
74
+ * `compute_period_val` controls the amount of time back in history we consider during compute over the graph
75
+ * `compute_period_unit` tells us what unit of time we're using
76
+ * `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
77
+ ```python
78
+ from graphreduce.graph_reduce import GraphReduce
79
+ from graphreduce.enums import PeriodUnit
80
+
81
+ gr = GraphReduce(
82
+ cut_date=datetime.datetime(2023, 2, 1),
83
+ compute_period_val=365,
84
+ compute_period_unit=PeriodUnit.day,
85
+ parent_node=customer
86
+ )
87
+ ```
88
+
89
+ ### Node definition and parameterization
90
+ GraphReduce takes convention over configuration, so the user
91
+ is required to define a number of methods on each node class:
92
+ * `do_annotate` annotation definitions (e.g., split a string column into a new column)
93
+ * `do_filters` filter the data on column(s)
94
+ * `do_clip_cols` clip anomalies like exceedingly large values and do normalization
95
+ * `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
96
+ * `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
97
+ * `do_labels` label definitions if any
98
+ At the instance level we need to parameterize a few things, such as where the
99
+ data is coming from, the date key, the primary key, prefixes for
100
+ preserving where the data originated after compute, and a few
101
+ other optional parameters.
102
+
103
+ ```python
104
+ from graphreduce.node import GraphReduceNode
105
+
106
+ # define the customer node
107
+ class CustomerNode(GraphReduceNode):
108
+ def do_annotate(self):
109
+ # use the `self.colabbr` function to use prefixes
110
+ self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
111
+ lambda x: x > 1000.00 then 1 else 0
112
+ )
113
+
114
+
115
+ def do_filters(self):
116
+ self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
117
+
118
+ def do_clip_cols(self):
119
+ self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
120
+ lambda col: 1000 if col > 1000 else col
121
+ )
122
+
123
+ def post_join_annotate(self):
124
+ # filters after children are joined
125
+ pass
126
+
127
+ def do_reduce(self, reduce_key):
128
+ pass
129
+
130
+ def do_labels(self, reduce_key):
131
+ pass
132
+
133
+ cust = CustomerNode(
134
+ fpath='s3://somebucket/some/path/customer.parquet',
135
+ fmt='parquet',
136
+ prefix='cust',
137
+ date_key='last_login',
138
+ pk='customer_id'
139
+ )
49
140
  ```
50
141
 
51
142
  ## Usage
@@ -0,0 +1,205 @@
1
+ # GraphReduce
2
+
3
+
4
+ ## Description
5
+ GraphReduce is an abstraction for building machine learning feature
6
+ engineering pipelines in a scalable, extensible, and production-ready way.
7
+ The library is intended to help bridge the gap between research feature
8
+ definitions and production deployment without the overhead of a full
9
+ feature store. Underneath the hood, GraphReduce uses graph data
10
+ structures to represent tables/files as nodes and foreign keys
11
+ as edges.
12
+
13
+ GraphReduce allows for a unified feature engineering interface
14
+ to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
15
+
16
+
17
+ ### Installation
18
+ ```
19
+ # from pypi
20
+ pip install graphreduce
21
+
22
+ # from github
23
+ pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
24
+
25
+ # install from source
26
+ git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
27
+ ```
28
+
29
+
30
+
31
+ ## Motivation
32
+ Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
33
+ are disconnected. They can be represented as a graph, where tables
34
+ are nodes and join keys are edges. In many model building scenarios
35
+ there isn't a nice ML-ready vector waiting for us, so we must curate
36
+ the data by joining many tables together to flatten them into a vector.
37
+ This is the problem `graphreduce` sets out to solve.
38
+
39
+ An example dataset might look like the following:
40
+
41
+ ![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
42
+
43
+ ## data granularity and time travel
44
+ But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
45
+ To further complicate things we need to handle orientation in time to prevent
46
+ [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
47
+ is controlled in `graphreduce` from top-level parameters.
48
+
49
+ ### example of granularity and time travel parameters
50
+
51
+ * `cut_date` controls the date around which we orient the data in the graph
52
+ * `compute_period_val` controls the amount of time back in history we consider during compute over the graph
53
+ * `compute_period_unit` tells us what unit of time we're using
54
+ * `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
55
+ ```python
56
+ from graphreduce.graph_reduce import GraphReduce
57
+ from graphreduce.enums import PeriodUnit
58
+
59
+ gr = GraphReduce(
60
+ cut_date=datetime.datetime(2023, 2, 1),
61
+ compute_period_val=365,
62
+ compute_period_unit=PeriodUnit.day,
63
+ parent_node=customer
64
+ )
65
+ ```
66
+
67
+ ### Node definition and parameterization
68
+ GraphReduce takes convention over configuration, so the user
69
+ is required to define a number of methods on each node class:
70
+ * `do_annotate` annotation definitions (e.g., split a string column into a new column)
71
+ * `do_filters` filter the data on column(s)
72
+ * `do_clip_cols` clip anomalies like exceedingly large values and do normalization
73
+ * `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
74
+ * `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
75
+ * `do_labels` label definitions if any
76
+ At the instance level we need to parameterize a few things, such as where the
77
+ data is coming from, the date key, the primary key, prefixes for
78
+ preserving where the data originated after compute, and a few
79
+ other optional parameters.
80
+
81
+ ```python
82
+ from graphreduce.node import GraphReduceNode
83
+
84
+ # define the customer node
85
+ class CustomerNode(GraphReduceNode):
86
+ def do_annotate(self):
87
+ # use the `self.colabbr` function to use prefixes
88
+ self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
89
+ lambda x: x > 1000.00 then 1 else 0
90
+ )
91
+
92
+
93
+ def do_filters(self):
94
+ self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
95
+
96
+ def do_clip_cols(self):
97
+ self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
98
+ lambda col: 1000 if col > 1000 else col
99
+ )
100
+
101
+ def post_join_annotate(self):
102
+ # filters after children are joined
103
+ pass
104
+
105
+ def do_reduce(self, reduce_key):
106
+ pass
107
+
108
+ def do_labels(self, reduce_key):
109
+ pass
110
+
111
+ cust = CustomerNode(
112
+ fpath='s3://somebucket/some/path/customer.parquet',
113
+ fmt='parquet',
114
+ prefix='cust',
115
+ date_key='last_login',
116
+ pk='customer_id'
117
+ )
118
+ ```
119
+
120
+ ## Usage
121
+
122
+ ### Pandas
123
+ ```python
124
+ from graphreduce.node import GraphReduceNode
125
+ from graphreduce.graph_reduce import GraphReduce
126
+
127
+ class NodeA(GraphReduceNode):
128
+ def do_annotate(self):
129
+ pass
130
+
131
+ def do_filters(self):
132
+ pass
133
+
134
+ def do_clip_cols(self):
135
+ pass
136
+
137
+ def do_slice_data(self):
138
+ pass
139
+
140
+ def do_post_join_annotate(self):
141
+ import uuid
142
+ self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
143
+
144
+ def do_reduce(self, key):
145
+ pass
146
+
147
+ def do_labels(self, key):
148
+ pass
149
+
150
+ class NodeB(GraphReduce):
151
+ def do_annotate(self):
152
+ pass
153
+
154
+ def do_filters(self):
155
+ pass
156
+
157
+ def do_clip_cols(self):
158
+ pass
159
+
160
+ def do_slice_data(self):
161
+ pass
162
+
163
+ def do_post_join_annotate(self):
164
+ import uuid
165
+ self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
166
+
167
+ def do_reduce(self, key):
168
+ return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
169
+ **{
170
+ self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
171
+ self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
172
+ self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
173
+ self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
174
+ self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
175
+ }
176
+ ).reset_index()
177
+
178
+ def do_labels(self, key):
179
+ pass
180
+
181
+ nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
182
+ nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
183
+
184
+ gr = GraphReduce(
185
+ cut_date=datetime.datetime(2023,5,1),
186
+ parent_node=nodea,
187
+ compute_layer=ComputeLayerEnum.pandas
188
+ )
189
+
190
+ gr.add_entity_edge(
191
+ parent_node=nodea,
192
+ relation_node=nodeb,
193
+ parent_key='id',
194
+ relation_key='nodea_foreign_key_id',
195
+ relation_type='parent_child',
196
+ reduce=True
197
+ )
198
+
199
+ # plot the graph to see what compute graph will run
200
+ # note, you may have to open this file in a browser
201
+ gr.plot_graph(fname='demo_graph.html')
202
+
203
+ # perform all transformations
204
+ gr.do_transformations()
205
+ ```
@@ -40,6 +40,14 @@ class GraphReduce(nx.DiGraph):
40
40
  spark_sqlCtx : pyspark.sql.SQLContext = None,
41
41
  feature_function : typing.Optional[str] = None,
42
42
  dynamic_propagation : bool = False,
43
+ type_func_map : typing.Dict[str, typing.List[str]] = {
44
+ 'int64' : ['min', 'max', 'sum'],
45
+ 'str' : ['first'],
46
+ 'object' : ['first'],
47
+ 'float64' : ['min', 'max', 'sum'],
48
+ 'bool' : ['first'],
49
+ 'datetime64' : ['first']
50
+ },
43
51
  *args,
44
52
  **kwargs
45
53
  ):
@@ -60,6 +68,7 @@ Args:
60
68
  spark_sqlCtx : if compute layer is spark this must be passed
61
69
  feature_function : optional custom feature function
62
70
  dynamic_propagation : optional to dynamically propagate children data upward, useful for very large compute graphs
71
+ type_func_match : optional mapping from type to a list of functions (e.g., {'int' : ['min', 'max', 'sum'], 'str' : ['first']})
63
72
  """
64
73
  super(GraphReduce, self).__init__(*args, **kwargs)
65
74
 
@@ -75,6 +84,7 @@ Args:
75
84
  self.compute_layer = compute_layer
76
85
  self.feature_function = feature_function
77
86
  self.dynamic_propagation = dynamic_propagation
87
+ self.type_func_map = type_func_map
78
88
 
79
89
  # if using Spark
80
90
  self._sqlCtx = spark_sqlCtx
@@ -293,21 +303,37 @@ Args
293
303
  nt.from_nx(stringG)
294
304
  logger.info(f"plotted graph at {fname}")
295
305
  nt.show(fname)
296
-
297
-
306
+
307
+
308
+ def prefix_uniqueness(self):
309
+ """
310
+ Identify children with duplicate prefixes, if any
311
+ """
312
+ prefixes = {}
313
+ dupes = []
314
+ for node in self.nodes():
315
+ if not prefixes.get(node.prefix):
316
+ prefixes[node.prefix] = node
317
+ else:
318
+ dupes.append(node)
319
+ dupes.append(prefixes[node.prefix])
320
+ if len(dupes):
321
+ raise Exception(f"duplicate prefix on the following nodes: {dupes}")
322
+
298
323
 
299
324
  def do_transformations(self):
300
325
  """
301
326
  Perform all graph transformations
302
327
  1) hydrate graph
303
- 2) filter data
304
- 3) clip anomalies
305
- 4) annotate data
306
- 5) depth-first edge traversal to: aggregate / reduce features and labels
307
- 5a) optional alternative feature_function mapping
308
- 5b) join back to parent edge
309
- 5c) post-join annotations if any
310
- 6) repeat step 5 on all edges up the hierarchy
328
+ 2) check for duplicate prefixes
329
+ 3) filter data
330
+ 4) clip anomalies
331
+ 5) annotate data
332
+ 6) depth-first edge traversal to: aggregate / reduce features and labels
333
+ 6a) optional alternative feature_function mapping
334
+ 6b) join back to parent edge
335
+ 6c) post-join annotations if any
336
+ 7) repeat step 6 on all edges up the hierarchy
311
337
  """
312
338
 
313
339
  # get data, filter data, clip columns, and annotate
@@ -315,6 +341,9 @@ Perform all graph transformations
315
341
  self.hydrate_graph_attrs()
316
342
  logger.info("hydrating graph data")
317
343
  self.hydrate_graph_data()
344
+
345
+ logger.info("checking for prefix uniqueness")
346
+ self.prefix_uniqueness()
318
347
 
319
348
  for node in self.nodes():
320
349
  logger.info(f"running filters, clip cols, and annotations for {node.__class__.__name__}")
@@ -333,6 +362,29 @@ Perform all graph transformations
333
362
  if edge_data['reduce']:
334
363
  logger.info(f"reducing relation {relation_node.__class__.__name__}")
335
364
  join_df = relation_node.do_reduce(edge_data['relation_key'])
365
+ # only relevant when reducing
366
+ if self.dynamic_propagation:
367
+ logger.info(f"doing dynamic propagation on node {relation_node.__class__.__name__}")
368
+ child_df = relation_node.dynamic_propagation(
369
+ reduce_key=edge_data['relation_key'],
370
+ type_func_map=self.type_func_map,
371
+ compute_layer=self.compute_layer
372
+ )
373
+ # NOTE: this is pandas specific and will break
374
+ # on other compute layers for now
375
+ if self.compute_layer in [ComputeLayerEnum.pandas, ComputeLayerEnum.dask]:
376
+ join_df = join_df.merge(
377
+ child_df,
378
+ on=relation_node.colabbr(edge_data['relation_key']),
379
+ suffixes=('', '_dupe')
380
+ )
381
+ elif self.compute_layer == ComputeLayerEnum.spark:
382
+ join_df = join_df.join(
383
+ child_df,
384
+ on=join_df[relation_node.colabbr(edge_data['relation_key'])] == child_df[relation_node.colabbr(edge_data['relation_key'])],
385
+ how="left"
386
+ )
387
+
336
388
  elif not edge_data['reduce'] and self.feature_function:
337
389
  logger.info(f"not reducing relation {relation_node.__class__.__name__}")
338
390
  join_df = getattr(relation_node, self.feature_function)()
@@ -117,7 +117,7 @@ Get some data
117
117
  self.df.columns = [f"{self.prefix}_{c}" for c in self.df.columns]
118
118
  elif self.compute_layer.value == 'spark':
119
119
  if not hasattr(self, 'df') or (hasattr(self, 'df') and not isinstance(self.df, pyspark.sql.DataFrame)):
120
- self.df = getattr(self.spark_sqlctx.read, {self.fmt})(self.fpath)
120
+ self.df = getattr(self.spark_sqlctx.read, f"{self.fmt}")(self.fpath)
121
121
  for c in self.df.columns:
122
122
  self.df = self.df.withColumnRenamed(c, f"{self.prefix}_{c}")
123
123
 
@@ -134,30 +134,110 @@ do some filters on the data
134
134
 
135
135
  @abc.abstractmethod
136
136
  def do_annotate(self):
137
- '''
138
- Implement custom annotation functionality
139
- for annotating this particular data
140
- '''
137
+ """
138
+ Implement custom annotation functionality
139
+ for annotating this particular data
140
+ """
141
141
  return
142
142
 
143
143
 
144
144
  @abc.abstractmethod
145
145
  def do_post_join_annotate(self):
146
- '''
147
- Implement custom annotation functionality
148
- for annotating data after joining with
149
- child data
150
- '''
146
+ """
147
+ Implement custom annotation functionality
148
+ for annotating data after joining with
149
+ child data
150
+ """
151
151
  pass
152
152
 
153
153
 
154
154
  @abc.abstractmethod
155
155
  def do_clip_cols(self):
156
156
  return
157
-
157
+
158
+
159
+ def dynamic_propagation (
160
+ self,
161
+ reduce_key : str,
162
+ type_func_map : dict = {},
163
+ compute_layer : ComputeLayerEnum = ComputeLayerEnum.pandas,
164
+ ):
165
+ """
166
+ If we're doing dynamic propagation
167
+ this function will run a series of
168
+ automatic aggregations
169
+ """
170
+ if compute_layer == ComputeLayerEnum.pandas:
171
+ return self.pandas_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
172
+ elif compute_layer == ComputeLayerEnum.dask:
173
+ return self.dask_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
174
+ elif compute_layer == ComputeLayerEnum.spark:
175
+ return self.spark_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
176
+
177
+
178
+ def pandas_dynamic_propagation (
179
+ self,
180
+ reduce_key : str,
181
+ type_func_map : dict = {}
182
+ ) -> pd.DataFrame:
183
+ """
184
+ Pandas implementation of dynamic propagation of features
185
+ This could be extended slightly to perform automated feature
186
+ aggregation on dynamic nodes
187
+ """
188
+ agg_funcs = {}
189
+ for col, _type in dict(self.df.dtypes).items():
190
+ _type = str(_type)
191
+ if type_func_map.get(_type):
192
+ for func in type_func_map[_type]:
193
+ col_new = f"{col}_{func}"
194
+ agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
195
+ return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
196
+ **agg_funcs
197
+ ).reset_index()
198
+
199
+
200
+ def dask_dynamic_propagation (
201
+ self,
202
+ reduce_key : str,
203
+ type_func_map : dict = {},
204
+ ) -> dd.DataFrame:
205
+ """
206
+ Dask implementation of dynamic propagation of features
207
+ This could be extended slightly to perform automated
208
+ feature aggregation on dynamic nodes
209
+ """
210
+ agg_funcs = {}
211
+ for col, _type in dict(self.df.dtypes).items():
212
+ _type = str(_type)
213
+ if type_func_map.get(_type):
214
+ for func in type_func_map[_type]:
215
+ col_new = f"{col}_{func}"
216
+ agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
217
+ return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
218
+ **agg_funcs
219
+ ).reset_index()
220
+
221
+
222
+ def spark_dynamic_propagation (
223
+ self,
224
+ reduce_key : str,
225
+ type_func_map : dict = {},
226
+ ) -> pyspark.sql.DataFrame:
227
+ """
228
+ Spark implementation of dynamic propagation of features
229
+ This could be extended slightly to perform automated
230
+ feature aggregation on dynamic nodes
231
+ """
232
+ agg_funcs = {}
233
+ pass
234
+
158
235
 
159
236
  @abc.abstractmethod
160
- def do_reduce(self, reduce_key, children : list = []):
237
+ def do_reduce (
238
+ self,
239
+ reduce_key
240
+ ):
161
241
  """
162
242
  Reduce operation or the node
163
243
 
@@ -212,7 +292,7 @@ Prepare the dataset for feature aggregations / reduce
212
292
 
213
293
  def prep_for_labels(self):
214
294
  """
215
- Prepare the dataset for labels
295
+ Prepare the dataset for labels
216
296
  """
217
297
  if self.date_key:
218
298
  if self.cut_date and isinstance(self.cut_date, str) or isinstance(self.cut_date, datetime.datetime):
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: graphreduce
3
- Version: 0.2
3
+ Version: 0.4
4
4
  Summary: Leveraging graph data structures for complex feature engineering pipelines.
5
5
  Home-page: https://github.com/wesmadrigal/graphreduce
6
6
  Author: Wes Madrigal
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
23
23
  # GraphReduce
24
24
 
25
25
 
26
- ## Functionality
26
+ ## Description
27
27
  GraphReduce is an abstraction for building machine learning feature
28
28
  engineering pipelines in a scalable, extensible, and production-ready way.
29
29
  The library is intended to help bridge the gap between research feature
@@ -35,17 +35,108 @@ as edges.
35
35
  GraphReduce allows for a unified feature engineering interface
36
36
  to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
37
37
 
38
- ## Motivation
39
- As the number of features in an ML experiment grows so does the likelihood
40
- for duplicate, one off implementations of the same code. This is further
41
- exacerbated if there isn't seamless integration between R&D and deployment.
42
- Feature stores are a good solution, but they are quite complicated to setup
43
- and manage. GraphReduce is a lighter weight design pattern to production ready
44
- feature engineering pipelines.
45
38
 
46
39
  ### Installation
47
40
  ```
41
+ # from pypi
42
+ pip install graphreduce
43
+
44
+ # from github
48
45
  pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
46
+
47
+ # install from source
48
+ git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
49
+ ```
50
+
51
+
52
+
53
+ ## Motivation
54
+ Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
55
+ are disconnected. They can be represented as a graph, where tables
56
+ are nodes and join keys are edges. In many model building scenarios
57
+ there isn't a nice ML-ready vector waiting for us, so we must curate
58
+ the data by joining many tables together to flatten them into a vector.
59
+ This is the problem `graphreduce` sets out to solve.
60
+
61
+ An example dataset might look like the following:
62
+
63
+ ![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
64
+
65
+ ## data granularity and time travel
66
+ But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
67
+ To further complicate things we need to handle orientation in time to prevent
68
+ [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets. All of this
69
+ is controlled in `graphreduce` from top-level parameters.
70
+
71
+ ### example of granularity and time travel parameters
72
+
73
+ * `cut_date` controls the date around which we orient the data in the graph
74
+ * `compute_period_val` controls the amount of time back in history we consider during compute over the graph
75
+ * `compute_period_unit` tells us what unit of time we're using
76
+ * `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
77
+ ```python
78
+ from graphreduce.graph_reduce import GraphReduce
79
+ from graphreduce.enums import PeriodUnit
80
+
81
+ gr = GraphReduce(
82
+ cut_date=datetime.datetime(2023, 2, 1),
83
+ compute_period_val=365,
84
+ compute_period_unit=PeriodUnit.day,
85
+ parent_node=customer
86
+ )
87
+ ```
88
+
89
+ ### Node definition and parameterization
90
+ GraphReduce takes convention over configuration, so the user
91
+ is required to define a number of methods on each node class:
92
+ * `do_annotate` annotation definitions (e.g., split a string column into a new column)
93
+ * `do_filters` filter the data on column(s)
94
+ * `do_clip_cols` clip anomalies like exceedingly large values and do normalization
95
+ * `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
96
+ * `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
97
+ * `do_labels` label definitions if any
98
+ At the instance level we need to parameterize a few things, such as where the
99
+ data is coming from, the date key, the primary key, prefixes for
100
+ preserving where the data originated after compute, and a few
101
+ other optional parameters.
102
+
103
+ ```python
104
+ from graphreduce.node import GraphReduceNode
105
+
106
+ # define the customer node
107
+ class CustomerNode(GraphReduceNode):
108
+ def do_annotate(self):
109
+ # use the `self.colabbr` function to use prefixes
110
+ self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
111
+ lambda x: x > 1000.00 then 1 else 0
112
+ )
113
+
114
+
115
+ def do_filters(self):
116
+ self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
117
+
118
+ def do_clip_cols(self):
119
+ self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
120
+ lambda col: 1000 if col > 1000 else col
121
+ )
122
+
123
+ def post_join_annotate(self):
124
+ # filters after children are joined
125
+ pass
126
+
127
+ def do_reduce(self, reduce_key):
128
+ pass
129
+
130
+ def do_labels(self, reduce_key):
131
+ pass
132
+
133
+ cust = CustomerNode(
134
+ fpath='s3://somebucket/some/path/customer.parquet',
135
+ fmt='parquet',
136
+ prefix='cust',
137
+ date_key='last_login',
138
+ pk='customer_id'
139
+ )
49
140
  ```
50
141
 
51
142
  ## Usage
@@ -0,0 +1,8 @@
1
+ abstract.jwrotator>=0.3
2
+ dask==2023.6.0
3
+ networkx>=2.8.8
4
+ pandas>=1.5.2
5
+ pyspark>=3.4.0
6
+ pyvis>=0.3.1
7
+ setuptools>=65.5.1
8
+ structlog>=23.1.0
@@ -17,17 +17,18 @@ if __name__ == "__main__":
17
17
 
18
18
  setuptools.setup(
19
19
  name="graphreduce",
20
- version = 0.2,
20
+ version = 0.4,
21
21
  url="https://github.com/wesmadrigal/graphreduce",
22
22
  packages = setuptools.find_packages(exclude=[ "docs", "examples" ]),
23
23
  install_requires = [
24
- "structlog >= 22.3.0",
25
- "dask >= 2023.1.1",
26
- "networkx >= 2.8.2",
27
- "pyvis >= 0.2.1",
28
- "pandas >= 1.3.2",
29
- "pyarrow >= 7.0.0",
30
- "pyspark >= 3.2.0"
24
+ "abstract.jwrotator>=0.3",
25
+ "dask==2023.6.0",
26
+ "networkx>=2.8.8",
27
+ "pandas>=1.5.2",
28
+ "pyspark>=3.4.0",
29
+ "pyvis>=0.3.1",
30
+ "setuptools>=65.5.1",
31
+ "structlog>=23.1.0"
31
32
  ],
32
33
  author="Wes Madrigal",
33
34
  author_email="wes@madconsulting.ai",
graphreduce-0.2/README.md DELETED
@@ -1,114 +0,0 @@
1
- # GraphReduce
2
-
3
-
4
- ## Functionality
5
- GraphReduce is an abstraction for building machine learning feature
6
- engineering pipelines in a scalable, extensible, and production-ready way.
7
- The library is intended to help bridge the gap between research feature
8
- definitions and production deployment without the overhead of a full
9
- feature store. Underneath the hood, GraphReduce uses graph data
10
- structures to represent tables/files as nodes and foreign keys
11
- as edges.
12
-
13
- GraphReduce allows for a unified feature engineering interface
14
- to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
15
-
16
- ## Motivation
17
- As the number of features in an ML experiment grows so does the likelihood
18
- for duplicate, one off implementations of the same code. This is further
19
- exacerbated if there isn't seamless integration between R&D and deployment.
20
- Feature stores are a good solution, but they are quite complicated to setup
21
- and manage. GraphReduce is a lighter weight design pattern to production ready
22
- feature engineering pipelines.
23
-
24
- ### Installation
25
- ```
26
- pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
27
- ```
28
-
29
- ## Usage
30
-
31
- ### Pandas
32
- ```python
33
- from graphreduce.node import GraphReduceNode
34
- from graphreduce.graph_reduce import GraphReduce
35
-
36
- class NodeA(GraphReduceNode):
37
- def do_annotate(self):
38
- pass
39
-
40
- def do_filters(self):
41
- pass
42
-
43
- def do_clip_cols(self):
44
- pass
45
-
46
- def do_slice_data(self):
47
- pass
48
-
49
- def do_post_join_annotate(self):
50
- import uuid
51
- self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
52
-
53
- def do_reduce(self, key):
54
- pass
55
-
56
- def do_labels(self, key):
57
- pass
58
-
59
- class NodeB(GraphReduce):
60
- def do_annotate(self):
61
- pass
62
-
63
- def do_filters(self):
64
- pass
65
-
66
- def do_clip_cols(self):
67
- pass
68
-
69
- def do_slice_data(self):
70
- pass
71
-
72
- def do_post_join_annotate(self):
73
- import uuid
74
- self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
75
-
76
- def do_reduce(self, key):
77
- return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
78
- **{
79
- self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
80
- self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
81
- self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
82
- self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
83
- self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
84
- }
85
- ).reset_index()
86
-
87
- def do_labels(self, key):
88
- pass
89
-
90
- nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
91
- nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
92
-
93
- gr = GraphReduce(
94
- cut_date=datetime.datetime(2023,5,1),
95
- parent_node=nodea,
96
- compute_layer=ComputeLayerEnum.pandas
97
- )
98
-
99
- gr.add_entity_edge(
100
- parent_node=nodea,
101
- relation_node=nodeb,
102
- parent_key='id',
103
- relation_key='nodea_foreign_key_id',
104
- relation_type='parent_child',
105
- reduce=True
106
- )
107
-
108
- # plot the graph to see what compute graph will run
109
- # note, you may have to open this file in a browser
110
- gr.plot_graph(fname='demo_graph.html')
111
-
112
- # perform all transformations
113
- gr.do_transformations()
114
- ```
@@ -1,7 +0,0 @@
1
- structlog>=22.3.0
2
- dask>=2023.1.1
3
- networkx>=2.8.2
4
- pyvis>=0.2.1
5
- pandas>=1.3.2
6
- pyarrow>=7.0.0
7
- pyspark>=3.2.0
File without changes
File without changes