PyPI - ai-data-science-team - Versions diffs - 0.0.0.9006__tar.gz → 0.0.0.9007__tar.gz - Mend

ai-data-science-team 0.0.0.9006tar.gz → 0.0.0.9007tar.gz

Files changed (28) hide show

{ai_data_science_team-0.0.0.9006 → ai_data_science_team-0.0.0.9007}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.2
 Name: ai-data-science-team
-Version: 0.0.0.9006
+Version: 0.0.0.9007
 Summary: Build and run an AI-powered data science team.
 Home-page: https://github.com/business-science/ai-data-science-team
 Author: Matt Dancho
@@ -21,12 +21,22 @@ Requires-Dist: plotly
 Requires-Dist: streamlit
 Requires-Dist: scikit-learn
 Requires-Dist: xgboost
+Dynamic: author
+Dynamic: author-email
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
 # Your AI Data Science Team (An Army Of Copilots)
 **An AI-powered data science team of copilots that uses agents to help you perform common data science tasks 10X faster**.
-Star ⭐ This GitHub (Takes 2 seconds and means a lot).
+**Star ⭐ This GitHub (Takes 2 seconds and means a lot).**
+*Beta - This Python library is under active development. There may be breaking changes that occur until release of 0.1.0.*
 ---
@@ -39,6 +49,24 @@ The AI Data Science Team of Copilots includes Agents that specialize data cleani
 - Credit Card Risk
 - And more
+## Table of Contents
+- [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
+  - [Table of Contents](#table-of-contents)
+  - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
+  - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
+  - [Data Science Agents](#data-science-agents)
+    - [Coming Soon: Multi-Agents](#coming-soon-multi-agents)
+    - [Agents Available Now](#agents-available-now)
+    - [Agents Coming Soon](#agents-coming-soon)
+  - [Disclaimer](#disclaimer)
+  - [Installation](#installation)
+  - [Usage](#usage)
+    - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
+    - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
+  - [Contributing](#contributing)
+  - [License](#license)
 ## Companies That Want An AI Data Science Team Copilot
 If you are interested in having your own custom enteprise-grade AI Data Science Team Copilot, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
@@ -53,12 +81,19 @@ This project is a work in progress. New data science agents will be released soo
 ![Data Science Team](/img/ai_data_science_team.jpg)
+### Coming Soon: Multi-Agents
+This is the internals of the Business Intelligence SQL Agent I'm working on:
+![Business Intelligence SQL Agent](/img/multi_agent_sql_data_visualization.jpg)
 ### Agents Available Now
 1. **Data Wrangling Agent:** Merges, Joins, Preps and Wrangles data into a format that is ready for data analysis.
-2. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
-3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
-4. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelins to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
+2. **Data Visualization Agent:** Creates visualizations to help you understand your data. Returns JSON serializable plotly visualizations.
+3. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
+4. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
+5. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelines to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
 ### Agents Coming Soon
@@ -79,23 +114,6 @@ This project is a work in progress. New data science agents will be released soo
 By using this software, you agree to use it solely for learning purposes.
-## Table of Contents
-- [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
-  - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
-  - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
-  - [Data Science Agents](#data-science-agents)
-    - [Agents Available Now](#agents-available-now)
-    - [Agents Coming Soon](#agents-coming-soon)
-  - [Disclaimer](#disclaimer)
-  - [Table of Contents](#table-of-contents)
-  - [Installation](#installation)
-  - [Usage](#usage)
-    - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
-    - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
-  - [Contributing](#contributing)
-  - [License](#license)
 ## Installation
 ``` bash

{ai_data_science_team-0.0.0.9006 → ai_data_science_team-0.0.0.9007}/README.md RENAMED Viewed

@@ -2,7 +2,9 @@
 **An AI-powered data science team of copilots that uses agents to help you perform common data science tasks 10X faster**.
-Star ⭐ This GitHub (Takes 2 seconds and means a lot).
+**Star ⭐ This GitHub (Takes 2 seconds and means a lot).**
+*Beta - This Python library is under active development. There may be breaking changes that occur until release of 0.1.0.*
 ---
@@ -15,6 +17,24 @@ The AI Data Science Team of Copilots includes Agents that specialize data cleani
 - Credit Card Risk
 - And more
+## Table of Contents
+- [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
+  - [Table of Contents](#table-of-contents)
+  - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
+  - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
+  - [Data Science Agents](#data-science-agents)
+    - [Coming Soon: Multi-Agents](#coming-soon-multi-agents)
+    - [Agents Available Now](#agents-available-now)
+    - [Agents Coming Soon](#agents-coming-soon)
+  - [Disclaimer](#disclaimer)
+  - [Installation](#installation)
+  - [Usage](#usage)
+    - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
+    - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
+  - [Contributing](#contributing)
+  - [License](#license)
 ## Companies That Want An AI Data Science Team Copilot
 If you are interested in having your own custom enteprise-grade AI Data Science Team Copilot, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
@@ -29,12 +49,19 @@ This project is a work in progress. New data science agents will be released soo
 ![Data Science Team](/img/ai_data_science_team.jpg)
+### Coming Soon: Multi-Agents
+This is the internals of the Business Intelligence SQL Agent I'm working on:
+![Business Intelligence SQL Agent](/img/multi_agent_sql_data_visualization.jpg)
 ### Agents Available Now
 1. **Data Wrangling Agent:** Merges, Joins, Preps and Wrangles data into a format that is ready for data analysis.
-2. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
-3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
-4. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelins to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
+2. **Data Visualization Agent:** Creates visualizations to help you understand your data. Returns JSON serializable plotly visualizations.
+3. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
+4. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
+5. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelines to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
 ### Agents Coming Soon
@@ -55,23 +82,6 @@ This project is a work in progress. New data science agents will be released soo
 By using this software, you agree to use it solely for learning purposes.
-## Table of Contents
-- [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
-  - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
-  - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
-  - [Data Science Agents](#data-science-agents)
-    - [Agents Available Now](#agents-available-now)
-    - [Agents Coming Soon](#agents-coming-soon)
-  - [Disclaimer](#disclaimer)
-  - [Table of Contents](#table-of-contents)
-  - [Installation](#installation)
-  - [Usage](#usage)
-    - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
-    - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
-  - [Contributing](#contributing)
-  - [License](#license)
 ## Installation
 ``` bash

ai_data_science_team-0.0.0.9007/ai_data_science_team/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.0.0.9007"

{ai_data_science_team-0.0.0.9006 → ai_data_science_team-0.0.0.9007}/ai_data_science_team/agents/__init__.py RENAMED Viewed

@@ -1,5 +1,6 @@
-from ai_data_science_team.agents.data_cleaning_agent import make_data_cleaning_agent
+from ai_data_science_team.agents.data_cleaning_agent import make_data_cleaning_agent, DataCleaningAgent
 from ai_data_science_team.agents.feature_engineering_agent import make_feature_engineering_agent
 from ai_data_science_team.agents.data_wrangling_agent import make_data_wrangling_agent
 from ai_data_science_team.agents.sql_database_agent import make_sql_database_agent
+from ai_data_science_team.agents.data_visualization_agent import make_data_visualization_agent

{ai_data_science_team-0.0.0.9006 → ai_data_science_team-0.0.0.9007}/ai_data_science_team/agents/data_cleaning_agent.py RENAMED Viewed

@@ -13,11 +13,13 @@ from langchain_core.messages import BaseMessage
 from langgraph.types import Command
 from langgraph.checkpoint.memory import MemorySaver
+from langgraph.graph.state import CompiledStateGraph
 import os
 import io
 import pandas as pd
-from ai_data_science_team.templates.agent_templates import(
+from ai_data_science_team.templates import(
     node_func_execute_agent_code_on_data,
     node_func_human_review,
     node_func_fix_agent_code,
@@ -25,7 +27,7 @@ from ai_data_science_team.templates.agent_templates import(
     create_coding_agent_graph
 )
 from ai_data_science_team.tools.parsers import PythonOutputParser
-from ai_data_science_team.tools.regex import relocate_imports_inside_function, add_comments_to_top
+from ai_data_science_team.tools.regex import relocate_imports_inside_function, add_comments_to_top, format_agent_name
 from ai_data_science_team.tools.metadata import get_dataframe_summary
 from ai_data_science_team.tools.logging import log_ai_function
@@ -33,9 +35,170 @@ from ai_data_science_team.tools.logging import log_ai_function
 AGENT_NAME = "data_cleaning_agent"
 LOG_PATH = os.path.join(os.getcwd(), "logs/")
+# Class
+class DataCleaningAgent(CompiledStateGraph):
+    def __init__(
+        self,
+        model,
+        n_samples=30,
+        log=False,
+        log_path=None,
+        file_name="data_cleaner.py",
+        overwrite=True,
+        human_in_the_loop=False,
+        bypass_recommended_steps=False,
+        bypass_explain_code=False
+    ):
+        self._params = {
+            "model": model,
+            "n_samples": n_samples,
+            "log": log,
+            "log_path": log_path,
+            "file_name": file_name,
+            "overwrite": overwrite,
+            "human_in_the_loop": human_in_the_loop,
+            "bypass_recommended_steps": bypass_recommended_steps,
+            "bypass_explain_code": bypass_explain_code,
+        }
+        self._compiled_graph = self._make_compiled_graph()
+        self.response = None
+    def _make_compiled_graph(self):
+        self.response = None
+        return make_data_cleaning_agent(**self._params)
+    def update_params(self, **kwargs):
+        """
+        Update one or more parameters at once, then rebuild the compiled graph.
+        e.g. agent.update_params(model=new_llm, n_samples=100)
+        """
+        self._params.update(kwargs)
+        self._compiled_graph = self._make_compiled_graph()
+    def __getattr__(self, name: str):
+        """
+        Delegate attribute access to `_compiled_graph` if `name` is not
+        found in this instance. This 'inherits' methods from the compiled graph.
+        """
+        return getattr(self._compiled_graph, name)
+    def ainvoke(self, user_instructions: str, data_raw: pd.DataFrame, max_retries=3, retry_count=0):
+        """
+        Cleans the provided dataset based on user instructions.
+        Parameters:
+            user_instructions (str): Instructions for data cleaning.
+            data_raw (pd.DataFrame): The raw dataset to be cleaned.
+            max_retries (int): Maximum retry attempts for cleaning.
+            retry_count (int): Current retry attempt.
+        Returns:
+            None. The response is stored in the response attribute.
+        """
+        response = self.ainvoke({
+            "user_instructions": user_instructions,
+            "data_raw": data_raw.to_dict(),
+            "max_retries": max_retries,
+            "retry_count": retry_count,
+        })
+        self.response = response
+        return None
+    def invoke(self, user_instructions: str, data_raw: pd.DataFrame, max_retries=3, retry_count=0):
+        """
+        Cleans the provided dataset based on user instructions.
+        Parameters:
+            user_instructions (str): Instructions for data cleaning.
+            data_raw (pd.DataFrame): The raw dataset to be cleaned.
+            max_retries (int): Maximum retry attempts for cleaning.
+            retry_count (int): Current retry attempt.
+        Returns:
+            None. The response is stored in the response attribute.
+        """
+        response = self.invoke({
+            "user_instructions": user_instructions,
+            "data_raw": data_raw.to_dict(),
+            "max_retries": max_retries,
+            "retry_count": retry_count,
+        })
+        self.response = response
+        return None
+    def explain_cleaning_steps(self):
+        """
+        Provides an explanation of the cleaning steps performed by the agent.
+        Returns:
+            str: Explanation of the cleaning steps.
+        """
+        messages = self.response.get("messages", [])
+        return messages
+    def get_log_summary(self):
+        """
+        Logs a summary of the agent's operations, if logging is enabled.
+        """
+        if self.response:
+            if self.log:
+                log_details = f"Log Path: {self.response.get('data_cleaner_function_path')}"
+                return log_details
+    def get_state_keys(self):
+        """
+        Returns a list of keys that the state graph returns in a response.
+        """
+        return list(self.get_output_jsonschema()['properties'].keys())
+    def get_state_properties(self):
+        """
+        Returns a list of keys that the state graph returns in a response.
+        """
+        return self.get_output_jsonschema()['properties']
+    def get_data_cleaned(self):
+        """
+        Retrieves the cleaned data stored after running invoke or clean_data methods.
+        """
+        if self.response:
+            return pd.DataFrame(self.response.get("data_cleaned"))
+    def get_data_raw(self):
+        """
+        Retrieves the raw data.
+        """
+        if self.response:
+            return pd.DataFrame(self.response.get("data_raw"))
+    def get_data_cleaner_function(self):
+        """
+        Retrieves the agent's pipeline function.
+        """
+        if self.response:
+            return self.response.get("data_cleaner_function")
 # Agent
-def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True, human_in_the_loop=False, bypass_recommended_steps=False, bypass_explain_code=False):
+def make_data_cleaning_agent(
+    model,
+    n_samples = 30,
+    log=False,
+    log_path=None,
+    file_name="data_cleaner.py",
+    overwrite = True,
+    human_in_the_loop=False,
+    bypass_recommended_steps=False,
+    bypass_explain_code=False
+):
     """
     Creates a data cleaning agent that can be run on a dataset. The agent can be used to clean a dataset in a variety of
     ways, such as removing columns with more than 40% missing values, imputing missing
@@ -44,9 +207,9 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
     The agent takes in a dataset and some user instructions, and outputs a python
     function that can be used to clean the dataset. The agent also logs the code
     generated and any errors that occur.
     The agent is instructed to to perform the following data cleaning steps:
     - Removing columns if more than 40 percent of the data is missing
     - Imputing missing values with the mean of the column if the column is numeric
     - Imputing missing values with the mode of the column if the column is categorical
@@ -60,12 +223,18 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
     ----------
     model : langchain.llms.base.LLM
         The language model to use to generate code.
+    n_samples : int, optional
+        The number of samples to use when summarizing the dataset. Defaults to 30.
+        If you get an error due to maximum tokens, try reducing this number.
+        > "This model's maximum context length is 128000 tokens. However, your messages resulted in 333858 tokens. Please reduce the length of the messages."
     log : bool, optional
         Whether or not to log the code generated and any errors that occur.
         Defaults to False.
     log_path : str, optional
         The path to the directory where the log files should be stored. Defaults to
         "logs/".
+    file_name : str, optional
+        The name of the file to save the response to. Defaults to "data_cleaner.py".
     overwrite : bool, optional
         Whether or not to overwrite the log file if it already exists. If False, a unique file name will be created.
         Defaults to True.
@@ -82,26 +251,26 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
     import pandas as pd
     from langchain_openai import ChatOpenAI
     from ai_data_science_team.agents import data_cleaning_agent
     llm = ChatOpenAI(model = "gpt-4o-mini")
     data_cleaning_agent = make_data_cleaning_agent(llm)
     df = pd.read_csv("https://raw.githubusercontent.com/business-science/ai-data-science-team/refs/heads/master/data/churn_data.csv")
     response = data_cleaning_agent.invoke({
         "user_instructions": "Don't remove outliers when cleaning the data.",
         "data_raw": df.to_dict(),
         "max_retries":3,
         "retry_count":0
     })
     pd.DataFrame(response['data_cleaned'])
     ```
     Returns
     -------
-    app : langchain.graphs.StateGraph
+    app : langchain.graphs.CompiledStateGraph
         The data cleaning agent as a state graph.
     """
     llm = model
@@ -134,7 +303,7 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
         Recommend a series of data cleaning steps based on the input data.
         These recommended steps will be appended to the user_instructions.
         """
-        print("---DATA CLEANING AGENT----")
+        print(format_agent_name(AGENT_NAME))
         print("    * RECOMMEND CLEANING STEPS")
         # Prompt to get recommended steps from the LLM
@@ -177,6 +346,7 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
             Avoid these:
             1. Do not include steps to save files.
+            2. Do not include unrelated user instructions that are not related to the data cleaning.
             """,
             input_variables=["user_instructions", "recommended_steps", "all_datasets_summary"]
         )
@@ -184,7 +354,7 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
         data_raw = state.get("data_raw")
         df = pd.DataFrame.from_dict(data_raw)
-        all_datasets_summary = get_dataframe_summary([df])
+        all_datasets_summary = get_dataframe_summary([df], n_sample=n_samples)
         all_datasets_summary_str = "\n\n".join(all_datasets_summary)
@@ -201,10 +371,21 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
         }
     def create_data_cleaner_code(state: GraphState):
-        if bypass_recommended_steps:
-            print("---DATA CLEANING AGENT----")
         print("    * CREATE DATA CLEANER CODE")
+        if bypass_recommended_steps:
+            print(format_agent_name(AGENT_NAME))
+            data_raw = state.get("data_raw")
+            df = pd.DataFrame.from_dict(data_raw)
+            all_datasets_summary = get_dataframe_summary([df], n_sample=n_samples)
+            all_datasets_summary_str = "\n\n".join(all_datasets_summary)
+        else:
+            all_datasets_summary_str = state.get("all_datasets_summary")
         data_cleaning_prompt = PromptTemplate(
             template="""
             You are a Data Cleaning Agent. Your job is to create a data_cleaner() function that can be run on the data provided using the following recommended steps.
@@ -218,7 +399,7 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
             {all_datasets_summary}
-            Return Python code in ```python ``` format with a single function definition, data_cleaner(data_raw), that incldues all imports inside the function.
+            Return Python code in ```python ``` format with a single function definition, data_cleaner(data_raw), that includes all imports inside the function.
             Return code to provide the data cleaning function:
@@ -240,16 +421,16 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
         response = data_cleaning_agent.invoke({
             "recommended_steps": state.get("recommended_steps"),
-            "all_datasets_summary": state.get("all_datasets_summary")
+            "all_datasets_summary": all_datasets_summary_str
         })
         response = relocate_imports_inside_function(response)
         response = add_comments_to_top(response, agent_name=AGENT_NAME)
         # For logging: store the code generated:
-        file_path, file_name = log_ai_function(
+        file_path, file_name_2 = log_ai_function(
             response=response,
-            file_name="data_cleaner.py",
+            file_name=file_name,
             log=log,
             log_path=log_path,
             overwrite=overwrite
@@ -258,7 +439,8 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
         return {
             "data_cleaner_function" : response,
             "data_cleaner_function_path": file_path,
-            "data_cleaner_function_name": file_name
+            "data_cleaner_function_name": file_name_2,
+            "all_datasets_summary": all_datasets_summary_str
         }
     def human_review(state: GraphState) -> Command[Literal["recommend_cleaning_steps", "create_data_cleaner_code"]]:
@@ -353,3 +535,6 @@ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True,
     )
     return app

ai-data-science-team 0.0.0.9006__tar.gz → 0.0.0.9007__tar.gz

ai-data-science-team 0.0.0.9006tar.gz → 0.0.0.9007tar.gz