PyPI - graphreduce - Versions diffs - 0.2__tar.gz → 0.4__tar.gz - Mend

graphreduce 0.2tar.gz → 0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

{graphreduce-0.2 → graphreduce-0.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: graphreduce
-Version: 0.2
+Version: 0.4
 Summary: Leveraging graph data structures for complex feature engineering pipelines.
 Home-page: https://github.com/wesmadrigal/graphreduce
 Author: Wes Madrigal
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
 # GraphReduce
-## Functionality
+## Description
 GraphReduce is an abstraction for building machine learning feature
 engineering pipelines in a scalable, extensible, and production-ready way.
 The library is intended to help bridge the gap between research feature
@@ -35,17 +35,108 @@ as edges.
 GraphReduce allows for a unified feature engineering interface
 to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
-## Motivation
-As the number of features in an ML experiment grows so does the likelihood
-for duplicate, one off implementations of the same code.  This is further
-exacerbated if there isn't seamless integration between R&D and deployment.
-Feature stores are a good solution, but they are quite complicated to setup
-and manage.  GraphReduce is a lighter weight design pattern to production ready
-feature engineering pipelines.
 ### Installation
 ```
+# from pypi
+pip install graphreduce
+# from github
 pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
+# install from source
+git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
+```
+## Motivation
+Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
+are disconnected.  They can be represented as a graph, where tables
+are nodes and join keys are edges.  In many model building scenarios
+there isn't a nice ML-ready vector waiting for us, so we must curate
+the data by joining many tables together to flatten them into a vector.
+This is the problem `graphreduce` sets out to solve.
+An example dataset might look like the following:
+![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
+## data granularity and time travel
+But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
+To further complicate things we need to handle orientation in time to prevent
+[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets.  All of this
+is controlled in `graphreduce` from top-level parameters.
+### example of granularity and time travel parameters
+* `cut_date` controls the date around which we orient the data in the graph
+* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
+* `compute_period_unit` tells us what unit of time we're using
+* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
+```python
+from graphreduce.graph_reduce import GraphReduce
+from graphreduce.enums import PeriodUnit
+gr = GraphReduce(
+    cut_date=datetime.datetime(2023, 2, 1),
+    compute_period_val=365,
+    compute_period_unit=PeriodUnit.day,
+    parent_node=customer
+)
+```
+### Node definition and parameterization
+GraphReduce takes convention over configuration, so the user
+is required to define a number of methods on each node class:
+* `do_annotate` annotation definitions (e.g., split a string column into a new column)
+* `do_filters` filter the data on column(s)
+* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
+* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
+* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
+* `do_labels` label definitions if any
+At the instance level we need to parameterize a few things, such as where the
+data is coming from, the date key, the primary key, prefixes for
+preserving where the data originated after compute, and a few
+other optional parameters.
+```python
+from graphreduce.node import GraphReduceNode
+# define the customer node
+class CustomerNode(GraphReduceNode):
+    def do_annotate(self):
+        # use the `self.colabbr` function to use prefixes
+        self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
+            lambda x: x > 1000.00 then 1 else 0
+        )
+    def do_filters(self):
+        self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
+    def do_clip_cols(self):
+        self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
+            lambda col: 1000 if col > 1000 else col
+        )
+    def post_join_annotate(self):
+        # filters after children are joined
+        pass
+    def do_reduce(self, reduce_key):
+        pass
+    def do_labels(self, reduce_key):
+        pass
+cust = CustomerNode(
+    fpath='s3://somebucket/some/path/customer.parquet',
+    fmt='parquet',
+    prefix='cust',
+    date_key='last_login',
+    pk='customer_id'
+)
 ```
 ## Usage

graphreduce-0.4/README.md ADDED Viewed

@@ -0,0 +1,205 @@
+# GraphReduce
+## Description
+GraphReduce is an abstraction for building machine learning feature
+engineering pipelines in a scalable, extensible, and production-ready way.
+The library is intended to help bridge the gap between research feature
+definitions and production deployment without the overhead of a full
+feature store.  Underneath the hood, GraphReduce uses graph data
+structures to represent tables/files as nodes and foreign keys
+as edges.
+GraphReduce allows for a unified feature engineering interface
+to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
+### Installation
+```
+# from pypi
+pip install graphreduce
+# from github
+pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
+# install from source
+git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
+```
+## Motivation
+Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
+are disconnected.  They can be represented as a graph, where tables
+are nodes and join keys are edges.  In many model building scenarios
+there isn't a nice ML-ready vector waiting for us, so we must curate
+the data by joining many tables together to flatten them into a vector.
+This is the problem `graphreduce` sets out to solve.
+An example dataset might look like the following:
+![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
+## data granularity and time travel
+But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
+To further complicate things we need to handle orientation in time to prevent
+[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets.  All of this
+is controlled in `graphreduce` from top-level parameters.
+### example of granularity and time travel parameters
+* `cut_date` controls the date around which we orient the data in the graph
+* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
+* `compute_period_unit` tells us what unit of time we're using
+* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
+```python
+from graphreduce.graph_reduce import GraphReduce
+from graphreduce.enums import PeriodUnit
+gr = GraphReduce(
+    cut_date=datetime.datetime(2023, 2, 1),
+    compute_period_val=365,
+    compute_period_unit=PeriodUnit.day,
+    parent_node=customer
+)
+```
+### Node definition and parameterization
+GraphReduce takes convention over configuration, so the user
+is required to define a number of methods on each node class:
+* `do_annotate` annotation definitions (e.g., split a string column into a new column)
+* `do_filters` filter the data on column(s)
+* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
+* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
+* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
+* `do_labels` label definitions if any
+At the instance level we need to parameterize a few things, such as where the
+data is coming from, the date key, the primary key, prefixes for
+preserving where the data originated after compute, and a few
+other optional parameters.
+```python
+from graphreduce.node import GraphReduceNode
+# define the customer node
+class CustomerNode(GraphReduceNode):
+    def do_annotate(self):
+        # use the `self.colabbr` function to use prefixes
+        self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
+            lambda x: x > 1000.00 then 1 else 0
+        )
+    def do_filters(self):
+        self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
+    def do_clip_cols(self):
+        self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
+            lambda col: 1000 if col > 1000 else col
+        )
+    def post_join_annotate(self):
+        # filters after children are joined
+        pass
+    def do_reduce(self, reduce_key):
+        pass
+    def do_labels(self, reduce_key):
+        pass
+cust = CustomerNode(
+    fpath='s3://somebucket/some/path/customer.parquet',
+    fmt='parquet',
+    prefix='cust',
+    date_key='last_login',
+    pk='customer_id'
+)
+```
+## Usage
+### Pandas
+```python
+from graphreduce.node import GraphReduceNode
+from graphreduce.graph_reduce import GraphReduce
+class NodeA(GraphReduceNode):
+    def do_annotate(self):
+        pass
+    def do_filters(self):
+        pass
+    def do_clip_cols(self):
+        pass
+    def do_slice_data(self):
+        pass
+    def do_post_join_annotate(self):
+        import uuid
+        self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
+    def do_reduce(self, key):
+        pass
+    def do_labels(self, key):
+        pass
+class NodeB(GraphReduce):
+    def do_annotate(self):
+        pass
+    def do_filters(self):
+        pass
+    def do_clip_cols(self):
+        pass
+    def do_slice_data(self):
+        pass
+    def do_post_join_annotate(self):
+        import uuid
+        self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
+    def do_reduce(self, key):
+        return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
+            **{
+                self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
+                self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
+                self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
+                self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
+                self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
+            }
+        ).reset_index()
+    def do_labels(self, key):
+        pass
+nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
+nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
+gr = GraphReduce(
+    cut_date=datetime.datetime(2023,5,1),
+    parent_node=nodea,
+    compute_layer=ComputeLayerEnum.pandas
+)
+gr.add_entity_edge(
+    parent_node=nodea,
+    relation_node=nodeb,
+    parent_key='id',
+    relation_key='nodea_foreign_key_id',
+    relation_type='parent_child',
+    reduce=True
+)
+# plot the graph to see what compute graph will run
+# note, you may have to open this file in a  browser
+gr.plot_graph(fname='demo_graph.html')
+# perform all transformations
+gr.do_transformations()
+```

{graphreduce-0.2 → graphreduce-0.4}/graphreduce/graph_reduce.py RENAMED Viewed

@@ -40,6 +40,14 @@ class GraphReduce(nx.DiGraph):
         spark_sqlCtx : pyspark.sql.SQLContext = None,
         feature_function : typing.Optional[str] = None,
         dynamic_propagation : bool = False,
+        type_func_map : typing.Dict[str, typing.List[str]] = {
+            'int64' : ['min', 'max', 'sum'],
+            'str' : ['first'],
+            'object' : ['first'],
+            'float64' : ['min', 'max', 'sum'],
+            'bool' : ['first'],
+            'datetime64' : ['first']
+            },
         *args,
         **kwargs
     ):
@@ -60,6 +68,7 @@ Args:
     spark_sqlCtx : if compute layer is spark this must be passed
     feature_function : optional custom feature function
     dynamic_propagation : optional to dynamically propagate children data upward, useful for very large compute graphs
+    type_func_match : optional mapping from type to a list of functions (e.g., {'int' : ['min', 'max', 'sum'], 'str' : ['first']})
         """
         super(GraphReduce, self).__init__(*args, **kwargs)
@@ -75,6 +84,7 @@ Args:
         self.compute_layer = compute_layer
         self.feature_function = feature_function
         self.dynamic_propagation = dynamic_propagation
+        self.type_func_map = type_func_map
         # if using Spark
         self._sqlCtx = spark_sqlCtx
@@ -293,21 +303,37 @@ Args
         nt.from_nx(stringG)
         logger.info(f"plotted graph at {fname}")
         nt.show(fname)
+    def prefix_uniqueness(self):
+        """
+Identify children with duplicate prefixes, if any
+        """
+        prefixes = {}
+        dupes = []
+        for node in self.nodes():
+            if not prefixes.get(node.prefix):
+                prefixes[node.prefix] = node
+            else:
+                dupes.append(node)
+                dupes.append(prefixes[node.prefix])
+        if len(dupes):
+            raise Exception(f"duplicate prefix on the following nodes: {dupes}")
     def do_transformations(self):
         """
 Perform all graph transformations
 1) hydrate graph
-2) filter data
-3) clip anomalies
-4) annotate data
-5) depth-first edge traversal to: aggregate / reduce features and labels
-5a) optional alternative feature_function mapping
-5b) join back to parent edge
-5c) post-join annotations if any
-6) repeat step 5 on all edges up the hierarchy
+2) check for duplicate prefixes
+3) filter data
+4) clip anomalies
+5) annotate data
+6) depth-first edge traversal to: aggregate / reduce features and labels
+6a) optional alternative feature_function mapping
+6b) join back to parent edge
+6c) post-join annotations if any
+7) repeat step 6 on all edges up the hierarchy
         """
         # get data, filter data, clip columns, and annotate
@@ -315,6 +341,9 @@ Perform all graph transformations
         self.hydrate_graph_attrs()
         logger.info("hydrating graph data")
         self.hydrate_graph_data()
+        logger.info("checking for prefix uniqueness")
+        self.prefix_uniqueness()
         for node in self.nodes():
             logger.info(f"running filters, clip cols, and annotations for {node.__class__.__name__}")
@@ -333,6 +362,29 @@ Perform all graph transformations
             if edge_data['reduce']:
                 logger.info(f"reducing relation {relation_node.__class__.__name__}")
                 join_df = relation_node.do_reduce(edge_data['relation_key'])
+                # only relevant when reducing
+                if self.dynamic_propagation:
+                    logger.info(f"doing dynamic propagation on node {relation_node.__class__.__name__}")
+                    child_df = relation_node.dynamic_propagation(
+                            reduce_key=edge_data['relation_key'],
+                            type_func_map=self.type_func_map,
+                            compute_layer=self.compute_layer
+                        )
+                    # NOTE: this is pandas specific and will break
+                    # on other compute layers for now
+                    if self.compute_layer in [ComputeLayerEnum.pandas, ComputeLayerEnum.dask]:
+                        join_df = join_df.merge(
+                                child_df,
+                                on=relation_node.colabbr(edge_data['relation_key']),
+                                suffixes=('', '_dupe')
+                        )
+                    elif self.compute_layer == ComputeLayerEnum.spark:
+                        join_df = join_df.join(
+                                child_df,
+                                on=join_df[relation_node.colabbr(edge_data['relation_key'])] == child_df[relation_node.colabbr(edge_data['relation_key'])],
+                                how="left"
+                            )
             elif not edge_data['reduce'] and self.feature_function:
                 logger.info(f"not reducing relation {relation_node.__class__.__name__}")
                 join_df = getattr(relation_node, self.feature_function)()

{graphreduce-0.2 → graphreduce-0.4}/graphreduce/node.py RENAMED Viewed

@@ -117,7 +117,7 @@ Get some data
                 self.df.columns = [f"{self.prefix}_{c}" for c in self.df.columns]
         elif self.compute_layer.value == 'spark':
             if not hasattr(self, 'df') or (hasattr(self, 'df') and not isinstance(self.df, pyspark.sql.DataFrame)):
-                self.df = getattr(self.spark_sqlctx.read, {self.fmt})(self.fpath)
+                self.df = getattr(self.spark_sqlctx.read, f"{self.fmt}")(self.fpath)
                 for c in self.df.columns:
                     self.df = self.df.withColumnRenamed(c, f"{self.prefix}_{c}")
@@ -134,30 +134,110 @@ do some filters on the data
     @abc.abstractmethod
     def do_annotate(self):
-        '''
-        Implement custom annotation functionality
-        for annotating this particular data
-        '''
+        """
+Implement custom annotation functionality
+for annotating this particular data
+        """
         return
     @abc.abstractmethod
     def do_post_join_annotate(self):
-        '''
-        Implement custom annotation functionality
-        for annotating data after joining with
-        child data
-        '''
+        """
+Implement custom annotation functionality
+for annotating data after joining with
+child data
+        """
         pass
     @abc.abstractmethod
     def do_clip_cols(self):
         return
+    def dynamic_propagation (
+            self,
+            reduce_key : str,
+            type_func_map : dict = {},
+            compute_layer : ComputeLayerEnum = ComputeLayerEnum.pandas,
+            ):
+        """
+If we're doing dynamic propagation
+this function will run a series of
+automatic aggregations
+        """
+        if compute_layer == ComputeLayerEnum.pandas:
+            return self.pandas_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
+        elif compute_layer == ComputeLayerEnum.dask:
+            return self.dask_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
+        elif compute_layer == ComputeLayerEnum.spark:
+            return self.spark_dynamic_propagation(reduce_key=reduce_key, type_func_map=type_func_map)
+    def pandas_dynamic_propagation (
+            self,
+            reduce_key : str,
+            type_func_map : dict = {}
+            ) -> pd.DataFrame:
+        """
+Pandas implementation of dynamic propagation of features
+This could be extended slightly to perform automated feature
+aggregation on dynamic nodes
+        """
+        agg_funcs = {}
+        for col, _type in dict(self.df.dtypes).items():
+            _type = str(_type)
+            if type_func_map.get(_type):
+                for func in type_func_map[_type]:
+                    col_new = f"{col}_{func}"
+                    agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
+        return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
+                **agg_funcs
+                ).reset_index()
+    def dask_dynamic_propagation (
+            self,
+            reduce_key : str,
+            type_func_map : dict = {},
+            ) -> dd.DataFrame:
+        """
+Dask implementation of dynamic propagation of features
+This could be extended slightly to perform automated
+feature aggregation on dynamic nodes
+        """
+        agg_funcs = {}
+        for col, _type in dict(self.df.dtypes).items():
+            _type = str(_type)
+            if type_func_map.get(_type):
+                for func in type_func_map[_type]:
+                    col_new = f"{col}_{func}"
+                    agg_funcs[col_new] = pd.NamedAgg(column=col, aggfunc=func)
+        return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
+                **agg_funcs
+                ).reset_index()
+    def spark_dynamic_propagation (
+            self,
+            reduce_key : str,
+            type_func_map : dict = {},
+            ) -> pyspark.sql.DataFrame:
+        """
+Spark implementation of dynamic propagation of features
+This could be extended slightly to perform automated
+feature aggregation on dynamic nodes
+        """
+        agg_funcs = {}
+        pass
     @abc.abstractmethod
-    def do_reduce(self, reduce_key, children : list = []):
+    def do_reduce (
+            self,
+            reduce_key
+            ):
         """
 Reduce operation or the node
@@ -212,7 +292,7 @@ Prepare the dataset for feature aggregations / reduce
     def prep_for_labels(self):
         """
-        Prepare the dataset for labels
+Prepare the dataset for labels
         """
         if self.date_key:
             if self.cut_date and isinstance(self.cut_date, str) or isinstance(self.cut_date, datetime.datetime):

{graphreduce-0.2 → graphreduce-0.4}/graphreduce.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: graphreduce
-Version: 0.2
+Version: 0.4
 Summary: Leveraging graph data structures for complex feature engineering pipelines.
 Home-page: https://github.com/wesmadrigal/graphreduce
 Author: Wes Madrigal
@@ -23,7 +23,7 @@ Description-Content-Type: text/markdown
 # GraphReduce
-## Functionality
+## Description
 GraphReduce is an abstraction for building machine learning feature
 engineering pipelines in a scalable, extensible, and production-ready way.
 The library is intended to help bridge the gap between research feature
@@ -35,17 +35,108 @@ as edges.
 GraphReduce allows for a unified feature engineering interface
 to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
-## Motivation
-As the number of features in an ML experiment grows so does the likelihood
-for duplicate, one off implementations of the same code.  This is further
-exacerbated if there isn't seamless integration between R&D and deployment.
-Feature stores are a good solution, but they are quite complicated to setup
-and manage.  GraphReduce is a lighter weight design pattern to production ready
-feature engineering pipelines.
 ### Installation
 ```
+# from pypi
+pip install graphreduce
+# from github
 pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
+# install from source
+git clone https://github.com/wesmadrigal/graphreduce && cd graphreduce && python setup.py install
+```
+## Motivation
+Machine learning requires [vectors of data](https://arxiv.org/pdf/1212.4569.pdf), but our tabular datasets
+are disconnected.  They can be represented as a graph, where tables
+are nodes and join keys are edges.  In many model building scenarios
+there isn't a nice ML-ready vector waiting for us, so we must curate
+the data by joining many tables together to flatten them into a vector.
+This is the problem `graphreduce` sets out to solve.
+An example dataset might look like the following:
+![schema](https://github.com/wesmadrigal/graphreduce/blob/master/docs/graph_reduce_example.png?raw=true)
+## data granularity and time travel
+But we need to flatten this to a specific [granularity](https://en.wikipedia.org/wiki/Granularity#Data_granularity).
+To further complicate things we need to handle orientation in time to prevent
+[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) and properly frame our train/test datasets.  All of this
+is controlled in `graphreduce` from top-level parameters.
+### example of granularity and time travel parameters
+* `cut_date` controls the date around which we orient the data in the graph
+* `compute_period_val` controls the amount of time back in history we consider during compute over the graph
+* `compute_period_unit` tells us what unit of time we're using
+* `parent_node` specifies the parent-most node in the graph and, typically, the granularity to which to reduce the data
+```python
+from graphreduce.graph_reduce import GraphReduce
+from graphreduce.enums import PeriodUnit
+gr = GraphReduce(
+    cut_date=datetime.datetime(2023, 2, 1),
+    compute_period_val=365,
+    compute_period_unit=PeriodUnit.day,
+    parent_node=customer
+)
+```
+### Node definition and parameterization
+GraphReduce takes convention over configuration, so the user
+is required to define a number of methods on each node class:
+* `do_annotate` annotation definitions (e.g., split a string column into a new column)
+* `do_filters` filter the data on column(s)
+* `do_clip_cols` clip anomalies like exceedingly large values and do normalization
+* `post_join_annotate` annotations on current node after relations are merged in and we have access to their columns, too
+* `do_reduce` the most import node function, reduction operations: group bys, sum, min, max, etc.
+* `do_labels` label definitions if any
+At the instance level we need to parameterize a few things, such as where the
+data is coming from, the date key, the primary key, prefixes for
+preserving where the data originated after compute, and a few
+other optional parameters.
+```python
+from graphreduce.node import GraphReduceNode
+# define the customer node
+class CustomerNode(GraphReduceNode):
+    def do_annotate(self):
+        # use the `self.colabbr` function to use prefixes
+        self.df[self.colabbr('is_big_spender')] = self.df[self.colabbr('total_revenue')].apply(
+            lambda x: x > 1000.00 then 1 else 0
+        )
+    def do_filters(self):
+        self.df = self.df[self.df[self.colabbr('some_bool_col')] == 0]
+    def do_clip_cols(self):
+        self.df[self.colabbr('high_variance_column')] = self.df[self.colabbr('high_variance_column')].apply(
+            lambda col: 1000 if col > 1000 else col
+        )
+    def post_join_annotate(self):
+        # filters after children are joined
+        pass
+    def do_reduce(self, reduce_key):
+        pass
+    def do_labels(self, reduce_key):
+        pass
+cust = CustomerNode(
+    fpath='s3://somebucket/some/path/customer.parquet',
+    fmt='parquet',
+    prefix='cust',
+    date_key='last_login',
+    pk='customer_id'
+)
 ```
 ## Usage

graphreduce-0.4/graphreduce.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,8 @@
+abstract.jwrotator>=0.3
+dask==2023.6.0
+networkx>=2.8.8
+pandas>=1.5.2
+pyspark>=3.4.0
+pyvis>=0.3.1
+setuptools>=65.5.1
+structlog>=23.1.0

{graphreduce-0.2 → graphreduce-0.4}/setup.py RENAMED Viewed

@@ -17,17 +17,18 @@ if __name__ == "__main__":
     setuptools.setup(
         name="graphreduce",
-        version = 0.2,
+        version = 0.4,
         url="https://github.com/wesmadrigal/graphreduce",
         packages = setuptools.find_packages(exclude=[ "docs", "examples" ]),
         install_requires = [
-            "structlog >= 22.3.0",
-            "dask >= 2023.1.1",
-            "networkx >= 2.8.2",
-            "pyvis >= 0.2.1",
-            "pandas >= 1.3.2",
-            "pyarrow >= 7.0.0",
-            "pyspark >= 3.2.0"
+            "abstract.jwrotator>=0.3",
+            "dask==2023.6.0",
+            "networkx>=2.8.8",
+            "pandas>=1.5.2",
+            "pyspark>=3.4.0",
+            "pyvis>=0.3.1",
+            "setuptools>=65.5.1",
+            "structlog>=23.1.0"
             ],
         author="Wes Madrigal",
         author_email="wes@madconsulting.ai",

graphreduce-0.2/README.md DELETED Viewed

@@ -1,114 +0,0 @@
-# GraphReduce
-## Functionality
-GraphReduce is an abstraction for building machine learning feature
-engineering pipelines in a scalable, extensible, and production-ready way.
-The library is intended to help bridge the gap between research feature
-definitions and production deployment without the overhead of a full
-feature store.  Underneath the hood, GraphReduce uses graph data
-structures to represent tables/files as nodes and foreign keys
-as edges.
-GraphReduce allows for a unified feature engineering interface
-to plug & play with multiple backends: `dask`, `pandas`, and `spark` are currently supported
-## Motivation
-As the number of features in an ML experiment grows so does the likelihood
-for duplicate, one off implementations of the same code.  This is further
-exacerbated if there isn't seamless integration between R&D and deployment.
-Feature stores are a good solution, but they are quite complicated to setup
-and manage.  GraphReduce is a lighter weight design pattern to production ready
-feature engineering pipelines.
-### Installation
-```
-pip install 'graphreduce@git+https://github.com/wesmadrigal/graphreduce.git'
-```
-## Usage
-### Pandas
-```python
-from graphreduce.node import GraphReduceNode
-from graphreduce.graph_reduce import GraphReduce
-class NodeA(GraphReduceNode):
-    def do_annotate(self):
-        pass
-    def do_filters(self):
-        pass
-    def do_clip_cols(self):
-        pass
-    def do_slice_data(self):
-        pass
-    def do_post_join_annotate(self):
-        import uuid
-        self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
-    def do_reduce(self, key):
-        pass
-    def do_labels(self, key):
-        pass
-class NodeB(GraphReduce):
-    def do_annotate(self):
-        pass
-    def do_filters(self):
-        pass
-    def do_clip_cols(self):
-        pass
-    def do_slice_data(self):
-        pass
-    def do_post_join_annotate(self):
-        import uuid
-        self.df[self.colabbr('uuid')] = self.df[self.colabbr(self.pk)].apply(lambda x: str(uuid.uuid4()))
-    def do_reduce(self, key):
-        return self.prep_for_features().groupby(self.colabbr(reduce_key)).agg(
-            **{
-                self.colabbr(f'{self.pk}_counts') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='count'),
-                self.colabbr(f'{self.pk}_min') : pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='min'),
-                self.colabbr(f'{self.pk}_min'): pd.NamedAgg(column=self.colabbr(self.pk), aggfunc='max'),
-                self.colabbr(f'{self.date_key}_min') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='min'),
-                self.colabbr(f'{self.date_key}_max') : pd.NamedAgg(column=self.colabbr(self.date_key), aggfunc='max')
-            }
-        ).reset_index()
-    def do_labels(self, key):
-        pass
-nodea = NodeA(fpath='nodea.parquet', fmt='parquet', date_key='ts', prefix='nodea', pk='id')
-nodeb = NodeB(fpath='nodeb.parquet', fmt='parquet', date_key='created_at', prefix='nodeb', pk='id')
-gr = GraphReduce(
-    cut_date=datetime.datetime(2023,5,1),
-    parent_node=nodea,
-    compute_layer=ComputeLayerEnum.pandas
-)
-gr.add_entity_edge(
-    parent_node=nodea,
-    relation_node=nodeb,
-    parent_key='id',
-    relation_key='nodea_foreign_key_id',
-    relation_type='parent_child',
-    reduce=True
-)
-# plot the graph to see what compute graph will run
-# note, you may have to open this file in a  browser
-gr.plot_graph(fname='demo_graph.html')
-# perform all transformations
-gr.do_transformations()
-```