ai-data-science-team 0.0.0.9000__tar.gz → 0.0.0.9006__tar.gz

Sign up to get free protection for your applications and to get access to all the features.
Files changed (31) hide show
  1. ai_data_science_team-0.0.0.9006/PKG-INFO +165 -0
  2. ai_data_science_team-0.0.0.9006/README.md +141 -0
  3. ai_data_science_team-0.0.0.9006/ai_data_science_team/_version.py +1 -0
  4. ai_data_science_team-0.0.0.9006/ai_data_science_team/agents/__init__.py +5 -0
  5. ai_data_science_team-0.0.0.9006/ai_data_science_team/agents/data_cleaning_agent.py +355 -0
  6. ai_data_science_team-0.0.0.9006/ai_data_science_team/agents/data_wrangling_agent.py +362 -0
  7. ai_data_science_team-0.0.0.9006/ai_data_science_team/agents/feature_engineering_agent.py +376 -0
  8. ai_data_science_team-0.0.0.9006/ai_data_science_team/agents/sql_database_agent.py +379 -0
  9. ai_data_science_team-0.0.0.9006/ai_data_science_team/templates/__init__.py +0 -0
  10. ai_data_science_team-0.0.0.9006/ai_data_science_team/templates/agent_templates.py +526 -0
  11. ai_data_science_team-0.0.0.9006/ai_data_science_team/tools/__init__.py +0 -0
  12. ai_data_science_team-0.0.0.9006/ai_data_science_team/tools/logging.py +61 -0
  13. ai_data_science_team-0.0.0.9006/ai_data_science_team/tools/metadata.py +167 -0
  14. ai_data_science_team-0.0.0.9006/ai_data_science_team/tools/parsers.py +57 -0
  15. ai_data_science_team-0.0.0.9006/ai_data_science_team/tools/regex.py +73 -0
  16. ai_data_science_team-0.0.0.9006/ai_data_science_team.egg-info/PKG-INFO +165 -0
  17. ai_data_science_team-0.0.0.9006/ai_data_science_team.egg-info/SOURCES.txt +23 -0
  18. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/ai_data_science_team.egg-info/requires.txt +1 -1
  19. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/setup.py +1 -1
  20. ai_data_science_team-0.0.0.9000/PKG-INFO +0 -131
  21. ai_data_science_team-0.0.0.9000/README.md +0 -107
  22. ai_data_science_team-0.0.0.9000/ai_data_science_team/_version.py +0 -1
  23. ai_data_science_team-0.0.0.9000/ai_data_science_team/agents.py +0 -325
  24. ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/PKG-INFO +0 -131
  25. ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/SOURCES.txt +0 -12
  26. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/LICENSE +0 -0
  27. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/ai_data_science_team/__init__.py +0 -0
  28. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/ai_data_science_team/orchestration.py +0 -0
  29. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/ai_data_science_team.egg-info/dependency_links.txt +0 -0
  30. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/ai_data_science_team.egg-info/top_level.txt +0 -0
  31. {ai_data_science_team-0.0.0.9000 → ai_data_science_team-0.0.0.9006}/setup.cfg +0 -0
@@ -0,0 +1,165 @@
1
+ Metadata-Version: 2.1
2
+ Name: ai-data-science-team
3
+ Version: 0.0.0.9006
4
+ Summary: Build and run an AI-powered data science team.
5
+ Home-page: https://github.com/business-science/ai-data-science-team
6
+ Author: Matt Dancho
7
+ Author-email: mdancho@business-science.io
8
+ Requires-Python: >=3.9
9
+ Description-Content-Type: text/markdown
10
+ License-File: LICENSE
11
+ Requires-Dist: openpyxl
12
+ Requires-Dist: langchain
13
+ Requires-Dist: langchain_community
14
+ Requires-Dist: langchain_openai
15
+ Requires-Dist: langchain_experimental
16
+ Requires-Dist: langgraph>=0.2.57
17
+ Requires-Dist: openai
18
+ Requires-Dist: pandas
19
+ Requires-Dist: numpy
20
+ Requires-Dist: plotly
21
+ Requires-Dist: streamlit
22
+ Requires-Dist: scikit-learn
23
+ Requires-Dist: xgboost
24
+
25
+ # Your AI Data Science Team (An Army Of Copilots)
26
+
27
+ **An AI-powered data science team of copilots that uses agents to help you perform common data science tasks 10X faster**.
28
+
29
+ Star ⭐ This GitHub (Takes 2 seconds and means a lot).
30
+
31
+ ---
32
+
33
+ The AI Data Science Team of Copilots includes Agents that specialize data cleaning, preparation, feature engineering, modeling (machine learning), and interpretation of various business problems like:
34
+
35
+ - Churn Modeling
36
+ - Employee Attrition
37
+ - Lead Scoring
38
+ - Insurance Risk
39
+ - Credit Card Risk
40
+ - And more
41
+
42
+ ## Companies That Want An AI Data Science Team Copilot
43
+
44
+ If you are interested in having your own custom enteprise-grade AI Data Science Team Copilot, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
45
+
46
+ ## Free Generative AI For Data Scientists Workshop
47
+
48
+ If you want to learn how to build AI Agents for your company that performs Data Science, Business Intelligence, Churn Modeling, Time Series Forecasting, and more, [register for my next Generative AI for Data Scientists workshop here.](https://learn.business-science.io/ai-register)
49
+
50
+ ## Data Science Agents
51
+
52
+ This project is a work in progress. New data science agents will be released soon.
53
+
54
+ ![Data Science Team](/img/ai_data_science_team.jpg)
55
+
56
+ ### Agents Available Now
57
+
58
+ 1. **Data Wrangling Agent:** Merges, Joins, Preps and Wrangles data into a format that is ready for data analysis.
59
+ 2. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
60
+ 3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
61
+ 4. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelins to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
62
+
63
+ ### Agents Coming Soon
64
+
65
+ 1. **Data Analyst:** Analyzes data structure, creates exploratory visualizations, and performs correlation analysis to identify relationships.
66
+ 2. **Machine Learning Agent:** Builds and logs the machine learning models.
67
+ 3. **Interpretability Agent:** Performs Interpretable ML to explain why the model returned predictions including which features were the most important to the model.
68
+ 4. **Supervisor:** Forms task list. Moderates sub-agents. Returns completed assignment.
69
+
70
+ ## Disclaimer
71
+
72
+ **This project is for educational purposes only.**
73
+
74
+ - It is not intended to replace your company's data science team
75
+ - No warranties or guarantees provided
76
+ - Creator assumes no liability for financial loss
77
+ - Consult an experienced Generative AI Data Scientist for building your own custom AI Data Science Team
78
+ - If you want a custom enterprise-grade AI Data Science Team, [send inquiries here](https://www.business-science.io/contact.html).
79
+
80
+ By using this software, you agree to use it solely for learning purposes.
81
+
82
+ ## Table of Contents
83
+
84
+ - [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
85
+ - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
86
+ - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
87
+ - [Data Science Agents](#data-science-agents)
88
+ - [Agents Available Now](#agents-available-now)
89
+ - [Agents Coming Soon](#agents-coming-soon)
90
+ - [Disclaimer](#disclaimer)
91
+ - [Table of Contents](#table-of-contents)
92
+ - [Installation](#installation)
93
+ - [Usage](#usage)
94
+ - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
95
+ - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
96
+ - [Contributing](#contributing)
97
+ - [License](#license)
98
+
99
+ ## Installation
100
+
101
+ ``` bash
102
+ pip install git+https://github.com/business-science/ai-data-science-team.git --upgrade
103
+ ```
104
+
105
+ ## Usage
106
+
107
+ [See all examples here.](/examples)
108
+
109
+ ### Example 1: Feature Engineering with the Feature Engineering Agent
110
+
111
+ [See the full example here.](/examples/feature_engineering_agent.ipynb)
112
+
113
+ ``` python
114
+ feature_engineering_agent = make_feature_engineering_agent(model = llm)
115
+
116
+ response = feature_engineering_agent.invoke({
117
+ "user_instructions": "Make sure to scale and center numeric features",
118
+ "target_variable": "Churn",
119
+ "data_raw": df.to_dict(),
120
+ "max_retries":3,
121
+ "retry_count":0
122
+ })
123
+ ```
124
+
125
+ ``` bash
126
+ ---FEATURE ENGINEERING AGENT----
127
+ * CREATE FEATURE ENGINEER CODE
128
+ * EXECUTING AGENT CODE
129
+ * EXPLAIN AGENT CODE
130
+ ```
131
+
132
+ ### Example 2: Cleaning Data with the Data Cleaning Agent
133
+
134
+ [See the full example here.](/examples/data_cleaning_agent.ipynb)
135
+
136
+ ``` python
137
+ data_cleaning_agent = make_data_cleaning_agent(model = llm)
138
+
139
+ response = data_cleaning_agent.invoke({
140
+ "user_instructions": "Don't remove outliers when cleaning the data.",
141
+ "data_raw": df.to_dict(),
142
+ "max_retries":3,
143
+ "retry_count":0
144
+ })
145
+ ```
146
+
147
+ ``` bash
148
+ ---DATA CLEANING AGENT----
149
+ * CREATE DATA CLEANER CODE
150
+ * EXECUTING AGENT CODE
151
+ * EXPLAIN AGENT CODE
152
+ ```
153
+
154
+ ## Contributing
155
+
156
+ 1. Fork the repository
157
+ 2. Create a feature branch
158
+ 3. Commit your changes
159
+ 4. Push to the branch
160
+ 5. Create a Pull Request
161
+
162
+ ## License
163
+
164
+ This project is licensed under the MIT License. See LICENSE file for details.
165
+
@@ -0,0 +1,141 @@
1
+ # Your AI Data Science Team (An Army Of Copilots)
2
+
3
+ **An AI-powered data science team of copilots that uses agents to help you perform common data science tasks 10X faster**.
4
+
5
+ Star ⭐ This GitHub (Takes 2 seconds and means a lot).
6
+
7
+ ---
8
+
9
+ The AI Data Science Team of Copilots includes Agents that specialize data cleaning, preparation, feature engineering, modeling (machine learning), and interpretation of various business problems like:
10
+
11
+ - Churn Modeling
12
+ - Employee Attrition
13
+ - Lead Scoring
14
+ - Insurance Risk
15
+ - Credit Card Risk
16
+ - And more
17
+
18
+ ## Companies That Want An AI Data Science Team Copilot
19
+
20
+ If you are interested in having your own custom enteprise-grade AI Data Science Team Copilot, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
21
+
22
+ ## Free Generative AI For Data Scientists Workshop
23
+
24
+ If you want to learn how to build AI Agents for your company that performs Data Science, Business Intelligence, Churn Modeling, Time Series Forecasting, and more, [register for my next Generative AI for Data Scientists workshop here.](https://learn.business-science.io/ai-register)
25
+
26
+ ## Data Science Agents
27
+
28
+ This project is a work in progress. New data science agents will be released soon.
29
+
30
+ ![Data Science Team](/img/ai_data_science_team.jpg)
31
+
32
+ ### Agents Available Now
33
+
34
+ 1. **Data Wrangling Agent:** Merges, Joins, Preps and Wrangles data into a format that is ready for data analysis.
35
+ 2. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
36
+ 3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
37
+ 4. **SQL Database Agent:** Connects to SQL databases to pull data into the data science environment. Creates pipelins to automate data extraction. Performs Joins, Aggregations, and other SQL Query operations.
38
+
39
+ ### Agents Coming Soon
40
+
41
+ 1. **Data Analyst:** Analyzes data structure, creates exploratory visualizations, and performs correlation analysis to identify relationships.
42
+ 2. **Machine Learning Agent:** Builds and logs the machine learning models.
43
+ 3. **Interpretability Agent:** Performs Interpretable ML to explain why the model returned predictions including which features were the most important to the model.
44
+ 4. **Supervisor:** Forms task list. Moderates sub-agents. Returns completed assignment.
45
+
46
+ ## Disclaimer
47
+
48
+ **This project is for educational purposes only.**
49
+
50
+ - It is not intended to replace your company's data science team
51
+ - No warranties or guarantees provided
52
+ - Creator assumes no liability for financial loss
53
+ - Consult an experienced Generative AI Data Scientist for building your own custom AI Data Science Team
54
+ - If you want a custom enterprise-grade AI Data Science Team, [send inquiries here](https://www.business-science.io/contact.html).
55
+
56
+ By using this software, you agree to use it solely for learning purposes.
57
+
58
+ ## Table of Contents
59
+
60
+ - [Your AI Data Science Team (An Army Of Copilots)](#your-ai-data-science-team-an-army-of-copilots)
61
+ - [Companies That Want An AI Data Science Team Copilot](#companies-that-want-an-ai-data-science-team-copilot)
62
+ - [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
63
+ - [Data Science Agents](#data-science-agents)
64
+ - [Agents Available Now](#agents-available-now)
65
+ - [Agents Coming Soon](#agents-coming-soon)
66
+ - [Disclaimer](#disclaimer)
67
+ - [Table of Contents](#table-of-contents)
68
+ - [Installation](#installation)
69
+ - [Usage](#usage)
70
+ - [Example 1: Feature Engineering with the Feature Engineering Agent](#example-1-feature-engineering-with-the-feature-engineering-agent)
71
+ - [Example 2: Cleaning Data with the Data Cleaning Agent](#example-2-cleaning-data-with-the-data-cleaning-agent)
72
+ - [Contributing](#contributing)
73
+ - [License](#license)
74
+
75
+ ## Installation
76
+
77
+ ``` bash
78
+ pip install git+https://github.com/business-science/ai-data-science-team.git --upgrade
79
+ ```
80
+
81
+ ## Usage
82
+
83
+ [See all examples here.](/examples)
84
+
85
+ ### Example 1: Feature Engineering with the Feature Engineering Agent
86
+
87
+ [See the full example here.](/examples/feature_engineering_agent.ipynb)
88
+
89
+ ``` python
90
+ feature_engineering_agent = make_feature_engineering_agent(model = llm)
91
+
92
+ response = feature_engineering_agent.invoke({
93
+ "user_instructions": "Make sure to scale and center numeric features",
94
+ "target_variable": "Churn",
95
+ "data_raw": df.to_dict(),
96
+ "max_retries":3,
97
+ "retry_count":0
98
+ })
99
+ ```
100
+
101
+ ``` bash
102
+ ---FEATURE ENGINEERING AGENT----
103
+ * CREATE FEATURE ENGINEER CODE
104
+ * EXECUTING AGENT CODE
105
+ * EXPLAIN AGENT CODE
106
+ ```
107
+
108
+ ### Example 2: Cleaning Data with the Data Cleaning Agent
109
+
110
+ [See the full example here.](/examples/data_cleaning_agent.ipynb)
111
+
112
+ ``` python
113
+ data_cleaning_agent = make_data_cleaning_agent(model = llm)
114
+
115
+ response = data_cleaning_agent.invoke({
116
+ "user_instructions": "Don't remove outliers when cleaning the data.",
117
+ "data_raw": df.to_dict(),
118
+ "max_retries":3,
119
+ "retry_count":0
120
+ })
121
+ ```
122
+
123
+ ``` bash
124
+ ---DATA CLEANING AGENT----
125
+ * CREATE DATA CLEANER CODE
126
+ * EXECUTING AGENT CODE
127
+ * EXPLAIN AGENT CODE
128
+ ```
129
+
130
+ ## Contributing
131
+
132
+ 1. Fork the repository
133
+ 2. Create a feature branch
134
+ 3. Commit your changes
135
+ 4. Push to the branch
136
+ 5. Create a Pull Request
137
+
138
+ ## License
139
+
140
+ This project is licensed under the MIT License. See LICENSE file for details.
141
+
@@ -0,0 +1 @@
1
+ __version__ = "0.0.0.9006"
@@ -0,0 +1,5 @@
1
+ from ai_data_science_team.agents.data_cleaning_agent import make_data_cleaning_agent
2
+ from ai_data_science_team.agents.feature_engineering_agent import make_feature_engineering_agent
3
+ from ai_data_science_team.agents.data_wrangling_agent import make_data_wrangling_agent
4
+ from ai_data_science_team.agents.sql_database_agent import make_sql_database_agent
5
+
@@ -0,0 +1,355 @@
1
+ # BUSINESS SCIENCE UNIVERSITY
2
+ # AI DATA SCIENCE TEAM
3
+ # ***
4
+ # * Agents: Data Cleaning Agent
5
+
6
+ # Libraries
7
+ from typing import TypedDict, Annotated, Sequence, Literal
8
+ import operator
9
+
10
+ from langchain.prompts import PromptTemplate
11
+ from langchain_core.messages import BaseMessage
12
+
13
+ from langgraph.types import Command
14
+ from langgraph.checkpoint.memory import MemorySaver
15
+
16
+ import os
17
+ import io
18
+ import pandas as pd
19
+
20
+ from ai_data_science_team.templates.agent_templates import(
21
+ node_func_execute_agent_code_on_data,
22
+ node_func_human_review,
23
+ node_func_fix_agent_code,
24
+ node_func_explain_agent_code,
25
+ create_coding_agent_graph
26
+ )
27
+ from ai_data_science_team.tools.parsers import PythonOutputParser
28
+ from ai_data_science_team.tools.regex import relocate_imports_inside_function, add_comments_to_top
29
+ from ai_data_science_team.tools.metadata import get_dataframe_summary
30
+ from ai_data_science_team.tools.logging import log_ai_function
31
+
32
+ # Setup
33
+ AGENT_NAME = "data_cleaning_agent"
34
+ LOG_PATH = os.path.join(os.getcwd(), "logs/")
35
+
36
+ # Agent
37
+
38
+ def make_data_cleaning_agent(model, log=False, log_path=None, overwrite = True, human_in_the_loop=False, bypass_recommended_steps=False, bypass_explain_code=False):
39
+ """
40
+ Creates a data cleaning agent that can be run on a dataset. The agent can be used to clean a dataset in a variety of
41
+ ways, such as removing columns with more than 40% missing values, imputing missing
42
+ values with the mean of the column if the column is numeric, or imputing missing
43
+ values with the mode of the column if the column is categorical.
44
+ The agent takes in a dataset and some user instructions, and outputs a python
45
+ function that can be used to clean the dataset. The agent also logs the code
46
+ generated and any errors that occur.
47
+
48
+ The agent is instructed to to perform the following data cleaning steps:
49
+
50
+ - Removing columns if more than 40 percent of the data is missing
51
+ - Imputing missing values with the mean of the column if the column is numeric
52
+ - Imputing missing values with the mode of the column if the column is categorical
53
+ - Converting columns to the correct data type
54
+ - Removing duplicate rows
55
+ - Removing rows with missing values
56
+ - Removing rows with extreme outliers (3X the interquartile range)
57
+ - User instructions can modify, add, or remove any of the above steps
58
+
59
+ Parameters
60
+ ----------
61
+ model : langchain.llms.base.LLM
62
+ The language model to use to generate code.
63
+ log : bool, optional
64
+ Whether or not to log the code generated and any errors that occur.
65
+ Defaults to False.
66
+ log_path : str, optional
67
+ The path to the directory where the log files should be stored. Defaults to
68
+ "logs/".
69
+ overwrite : bool, optional
70
+ Whether or not to overwrite the log file if it already exists. If False, a unique file name will be created.
71
+ Defaults to True.
72
+ human_in_the_loop : bool, optional
73
+ Whether or not to use human in the loop. If True, adds an interput and human in the loop step that asks the user to review the data cleaning instructions. Defaults to False.
74
+ bypass_recommended_steps : bool, optional
75
+ Bypass the recommendation step, by default False
76
+ bypass_explain_code : bool, optional
77
+ Bypass the code explanation step, by default False.
78
+
79
+ Examples
80
+ -------
81
+ ``` python
82
+ import pandas as pd
83
+ from langchain_openai import ChatOpenAI
84
+ from ai_data_science_team.agents import data_cleaning_agent
85
+
86
+ llm = ChatOpenAI(model = "gpt-4o-mini")
87
+
88
+ data_cleaning_agent = make_data_cleaning_agent(llm)
89
+
90
+ df = pd.read_csv("https://raw.githubusercontent.com/business-science/ai-data-science-team/refs/heads/master/data/churn_data.csv")
91
+
92
+ response = data_cleaning_agent.invoke({
93
+ "user_instructions": "Don't remove outliers when cleaning the data.",
94
+ "data_raw": df.to_dict(),
95
+ "max_retries":3,
96
+ "retry_count":0
97
+ })
98
+
99
+ pd.DataFrame(response['data_cleaned'])
100
+ ```
101
+
102
+ Returns
103
+ -------
104
+ app : langchain.graphs.StateGraph
105
+ The data cleaning agent as a state graph.
106
+ """
107
+ llm = model
108
+
109
+ # Setup Log Directory
110
+ if log:
111
+ if log_path is None:
112
+ log_path = LOG_PATH
113
+ if not os.path.exists(log_path):
114
+ os.makedirs(log_path)
115
+
116
+ # Define GraphState for the router
117
+ class GraphState(TypedDict):
118
+ messages: Annotated[Sequence[BaseMessage], operator.add]
119
+ user_instructions: str
120
+ recommended_steps: str
121
+ data_raw: dict
122
+ data_cleaned: dict
123
+ all_datasets_summary: str
124
+ data_cleaner_function: str
125
+ data_cleaner_function_path: str
126
+ data_cleaner_function_name: str
127
+ data_cleaner_error: str
128
+ max_retries: int
129
+ retry_count: int
130
+
131
+
132
+ def recommend_cleaning_steps(state: GraphState):
133
+ """
134
+ Recommend a series of data cleaning steps based on the input data.
135
+ These recommended steps will be appended to the user_instructions.
136
+ """
137
+ print("---DATA CLEANING AGENT----")
138
+ print(" * RECOMMEND CLEANING STEPS")
139
+
140
+ # Prompt to get recommended steps from the LLM
141
+ recommend_steps_prompt = PromptTemplate(
142
+ template="""
143
+ You are a Data Cleaning Expert. Given the following information about the data,
144
+ recommend a series of numbered steps to take to clean and preprocess it.
145
+ The steps should be tailored to the data characteristics and should be helpful
146
+ for a data cleaning agent that will be implemented.
147
+
148
+ General Steps:
149
+ Things that should be considered in the data cleaning steps:
150
+
151
+ * Removing columns if more than 40 percent of the data is missing
152
+ * Imputing missing values with the mean of the column if the column is numeric
153
+ * Imputing missing values with the mode of the column if the column is categorical
154
+ * Converting columns to the correct data type
155
+ * Removing duplicate rows
156
+ * Removing rows with missing values
157
+ * Removing rows with extreme outliers (3X the interquartile range)
158
+
159
+ Custom Steps:
160
+ * Analyze the data to determine if any additional data cleaning steps are needed.
161
+ * Recommend steps that are specific to the data provided. Include why these steps are necessary or beneficial.
162
+ * If no additional steps are needed, simply state that no additional steps are required.
163
+
164
+ IMPORTANT:
165
+ Make sure to take into account any additional user instructions that may add, remove or modify some of these steps. Include comments in your code to explain your reasoning for each step. Include comments if something is not done because a user requested. Include comments if something is done because a user requested.
166
+
167
+ User instructions:
168
+ {user_instructions}
169
+
170
+ Previously Recommended Steps (if any):
171
+ {recommended_steps}
172
+
173
+ Below are summaries of all datasets provided:
174
+ {all_datasets_summary}
175
+
176
+ Return the steps as a bullet point list (no code, just the steps).
177
+
178
+ Avoid these:
179
+ 1. Do not include steps to save files.
180
+ """,
181
+ input_variables=["user_instructions", "recommended_steps", "all_datasets_summary"]
182
+ )
183
+
184
+ data_raw = state.get("data_raw")
185
+ df = pd.DataFrame.from_dict(data_raw)
186
+
187
+ all_datasets_summary = get_dataframe_summary([df])
188
+
189
+ all_datasets_summary_str = "\n\n".join(all_datasets_summary)
190
+
191
+ steps_agent = recommend_steps_prompt | llm
192
+ recommended_steps = steps_agent.invoke({
193
+ "user_instructions": state.get("user_instructions"),
194
+ "recommended_steps": state.get("recommended_steps"),
195
+ "all_datasets_summary": all_datasets_summary_str
196
+ })
197
+
198
+ return {
199
+ "recommended_steps": "\n\n# Recommended Data Cleaning Steps:\n" + recommended_steps.content.strip(),
200
+ "all_datasets_summary": all_datasets_summary_str
201
+ }
202
+
203
+ def create_data_cleaner_code(state: GraphState):
204
+ if bypass_recommended_steps:
205
+ print("---DATA CLEANING AGENT----")
206
+ print(" * CREATE DATA CLEANER CODE")
207
+
208
+ data_cleaning_prompt = PromptTemplate(
209
+ template="""
210
+ You are a Data Cleaning Agent. Your job is to create a data_cleaner() function that can be run on the data provided using the following recommended steps.
211
+
212
+ Recommended Steps:
213
+ {recommended_steps}
214
+
215
+ You can use Pandas, Numpy, and Scikit Learn libraries to clean the data.
216
+
217
+ Below are summaries of all datasets provided. Use this information about the data to help determine how to clean the data:
218
+
219
+ {all_datasets_summary}
220
+
221
+ Return Python code in ```python ``` format with a single function definition, data_cleaner(data_raw), that incldues all imports inside the function.
222
+
223
+ Return code to provide the data cleaning function:
224
+
225
+ def data_cleaner(data_raw):
226
+ import pandas as pd
227
+ import numpy as np
228
+ ...
229
+ return data_cleaned
230
+
231
+ Best Practices and Error Preventions:
232
+
233
+ Always ensure that when assigning the output of fit_transform() from SimpleImputer to a Pandas DataFrame column, you call .ravel() or flatten the array, because fit_transform() returns a 2D array while a DataFrame column is 1D.
234
+
235
+ """,
236
+ input_variables=["recommended_steps", "all_datasets_summary"]
237
+ )
238
+
239
+ data_cleaning_agent = data_cleaning_prompt | llm | PythonOutputParser()
240
+
241
+ response = data_cleaning_agent.invoke({
242
+ "recommended_steps": state.get("recommended_steps"),
243
+ "all_datasets_summary": state.get("all_datasets_summary")
244
+ })
245
+
246
+ response = relocate_imports_inside_function(response)
247
+ response = add_comments_to_top(response, agent_name=AGENT_NAME)
248
+
249
+ # For logging: store the code generated:
250
+ file_path, file_name = log_ai_function(
251
+ response=response,
252
+ file_name="data_cleaner.py",
253
+ log=log,
254
+ log_path=log_path,
255
+ overwrite=overwrite
256
+ )
257
+
258
+ return {
259
+ "data_cleaner_function" : response,
260
+ "data_cleaner_function_path": file_path,
261
+ "data_cleaner_function_name": file_name
262
+ }
263
+
264
+ def human_review(state: GraphState) -> Command[Literal["recommend_cleaning_steps", "create_data_cleaner_code"]]:
265
+ return node_func_human_review(
266
+ state=state,
267
+ prompt_text="Is the following data cleaning instructions correct? (Answer 'yes' or provide modifications)\n{steps}",
268
+ yes_goto="create_data_cleaner_code",
269
+ no_goto="recommend_cleaning_steps",
270
+ user_instructions_key="user_instructions",
271
+ recommended_steps_key="recommended_steps"
272
+ )
273
+
274
+ def execute_data_cleaner_code(state):
275
+ return node_func_execute_agent_code_on_data(
276
+ state=state,
277
+ data_key="data_raw",
278
+ result_key="data_cleaned",
279
+ error_key="data_cleaner_error",
280
+ code_snippet_key="data_cleaner_function",
281
+ agent_function_name="data_cleaner",
282
+ pre_processing=lambda data: pd.DataFrame.from_dict(data),
283
+ post_processing=lambda df: df.to_dict() if isinstance(df, pd.DataFrame) else df,
284
+ error_message_prefix="An error occurred during data cleaning: "
285
+ )
286
+
287
+ def fix_data_cleaner_code(state: GraphState):
288
+ data_cleaner_prompt = """
289
+ You are a Data Cleaning Agent. Your job is to create a data_cleaner() function that can be run on the data provided. The function is currently broken and needs to be fixed.
290
+
291
+ Make sure to only return the function definition for data_cleaner().
292
+
293
+ Return Python code in ```python``` format with a single function definition, data_cleaner(data_raw), that includes all imports inside the function.
294
+
295
+ This is the broken code (please fix):
296
+ {code_snippet}
297
+
298
+ Last Known Error:
299
+ {error}
300
+ """
301
+
302
+ return node_func_fix_agent_code(
303
+ state=state,
304
+ code_snippet_key="data_cleaner_function",
305
+ error_key="data_cleaner_error",
306
+ llm=llm,
307
+ prompt_template=data_cleaner_prompt,
308
+ agent_name=AGENT_NAME,
309
+ log=log,
310
+ file_path=state.get("data_cleaner_function_path"),
311
+ )
312
+
313
+ def explain_data_cleaner_code(state: GraphState):
314
+ return node_func_explain_agent_code(
315
+ state=state,
316
+ code_snippet_key="data_cleaner_function",
317
+ result_key="messages",
318
+ error_key="data_cleaner_error",
319
+ llm=llm,
320
+ role=AGENT_NAME,
321
+ explanation_prompt_template="""
322
+ Explain the data cleaning steps that the data cleaning agent performed in this function.
323
+ Keep the summary succinct and to the point.\n\n# Data Cleaning Agent:\n\n{code}
324
+ """,
325
+ success_prefix="# Data Cleaning Agent:\n\n ",
326
+ error_message="The Data Cleaning Agent encountered an error during data cleaning. Data could not be explained."
327
+ )
328
+
329
+ # Define the graph
330
+ node_functions = {
331
+ "recommend_cleaning_steps": recommend_cleaning_steps,
332
+ "human_review": human_review,
333
+ "create_data_cleaner_code": create_data_cleaner_code,
334
+ "execute_data_cleaner_code": execute_data_cleaner_code,
335
+ "fix_data_cleaner_code": fix_data_cleaner_code,
336
+ "explain_data_cleaner_code": explain_data_cleaner_code
337
+ }
338
+
339
+ app = create_coding_agent_graph(
340
+ GraphState=GraphState,
341
+ node_functions=node_functions,
342
+ recommended_steps_node_name="recommend_cleaning_steps",
343
+ create_code_node_name="create_data_cleaner_code",
344
+ execute_code_node_name="execute_data_cleaner_code",
345
+ fix_code_node_name="fix_data_cleaner_code",
346
+ explain_code_node_name="explain_data_cleaner_code",
347
+ error_key="data_cleaner_error",
348
+ human_in_the_loop=human_in_the_loop, # or False
349
+ human_review_node_name="human_review",
350
+ checkpointer=MemorySaver() if human_in_the_loop else None,
351
+ bypass_recommended_steps=bypass_recommended_steps,
352
+ bypass_explain_code=bypass_explain_code,
353
+ )
354
+
355
+ return app