ai-data-science-team 0.0.0.9000__tar.gz
Sign up to get free protection for your applications and to get access to all the features.
- ai_data_science_team-0.0.0.9000/LICENSE +21 -0
- ai_data_science_team-0.0.0.9000/PKG-INFO +131 -0
- ai_data_science_team-0.0.0.9000/README.md +107 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team/__init__.py +0 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team/_version.py +1 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team/agents.py +325 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team/orchestration.py +17 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/PKG-INFO +131 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/SOURCES.txt +12 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/dependency_links.txt +1 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/requires.txt +13 -0
- ai_data_science_team-0.0.0.9000/ai_data_science_team.egg-info/top_level.txt +1 -0
- ai_data_science_team-0.0.0.9000/setup.cfg +4 -0
- ai_data_science_team-0.0.0.9000/setup.py +42 -0
@@ -0,0 +1,21 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Copyright (c) 2024 ai-data-science-team authors
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
@@ -0,0 +1,131 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: ai-data-science-team
|
3
|
+
Version: 0.0.0.9000
|
4
|
+
Summary: Build and run an AI-powered data science team.
|
5
|
+
Home-page: https://github.com/business-science/ai-data-science-team
|
6
|
+
Author: Matt Dancho
|
7
|
+
Author-email: mdancho@business-science.io
|
8
|
+
Requires-Python: >=3.9
|
9
|
+
Description-Content-Type: text/markdown
|
10
|
+
License-File: LICENSE
|
11
|
+
Requires-Dist: openpyxl
|
12
|
+
Requires-Dist: langchain
|
13
|
+
Requires-Dist: langchain_community
|
14
|
+
Requires-Dist: langchain_openai
|
15
|
+
Requires-Dist: langchain_experimental
|
16
|
+
Requires-Dist: langgraph
|
17
|
+
Requires-Dist: openai
|
18
|
+
Requires-Dist: pandas
|
19
|
+
Requires-Dist: numpy
|
20
|
+
Requires-Dist: plotly
|
21
|
+
Requires-Dist: streamlit
|
22
|
+
Requires-Dist: scikit-learn
|
23
|
+
Requires-Dist: xgboost
|
24
|
+
|
25
|
+
# AI Data Science Team
|
26
|
+
|
27
|
+
**An AI-powered data science team that uses agents to perform common data science** tasks including data cleaning, preparation, feature engineering, modeling (machine learning), interpretation on various business problems like:
|
28
|
+
|
29
|
+
- Churn Modeling
|
30
|
+
- Employee Attrition
|
31
|
+
- Lead Scoring
|
32
|
+
- Insurance Risk
|
33
|
+
- Credit Card Risk
|
34
|
+
- And more
|
35
|
+
|
36
|
+
## Companies That Want An AI Data Science Team
|
37
|
+
|
38
|
+
If you are interested in having your own AI Data Science Team built and deployed for your enterprise, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
|
39
|
+
|
40
|
+
## Free Generative AI For Data Scientists Workshop
|
41
|
+
|
42
|
+
If you want to learn how to build AI Agents for your company that perform Data Science, Business Intelligence, Churn Modeling, Time Series Forecasting, and more, [register for my next Generative AI for Data Scientists workshop here.](https://learn.business-science.io/ai-register)
|
43
|
+
|
44
|
+
## Agents
|
45
|
+
|
46
|
+
This project is a work in progress. New agents will be released soon.
|
47
|
+
|
48
|
+

|
49
|
+
|
50
|
+
### Agents Available Now:
|
51
|
+
|
52
|
+
1. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
|
53
|
+
|
54
|
+
### Agents Coming Soon:
|
55
|
+
|
56
|
+
1. **Supervisor:** Forms task list. Moderates sub-agents. Returns completed assignment.
|
57
|
+
2. **Exploratory Data Analyst:** Analyzes data structure, creates exploratory visualizations, and performs correlation analysis to identify relationships.
|
58
|
+
3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
|
59
|
+
4. **Machine Learning Agent:** Builds and logs the machine learning models.
|
60
|
+
5. **Interpretability Agent:** Performs Interpretable ML to explain why the model returned predictions including which features were the most important to the model.
|
61
|
+
|
62
|
+
## Disclaimer
|
63
|
+
|
64
|
+
**This project is for educational purposes only.**
|
65
|
+
|
66
|
+
- It is not intended to replace your company's data science team
|
67
|
+
- No warranties or guarantees provided
|
68
|
+
- Creator assumes no liability for financial loss
|
69
|
+
- Consult an experienced Generative AI Data Scientist for building your own AI Data Science Team
|
70
|
+
- If you want an enterprise-grade AI Data Science Team, [send inquiries here](https://www.business-science.io/contact.html).
|
71
|
+
|
72
|
+
By using this software, you agree to use it solely for learning purposes.
|
73
|
+
|
74
|
+
## Table of Contents
|
75
|
+
|
76
|
+
- [AI Data Science Team](#ai-data-science-team)
|
77
|
+
- [Companies That Want An AI Data Science Team](#companies-that-want-an-ai-data-science-team)
|
78
|
+
- [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
|
79
|
+
- [Agents](#agents)
|
80
|
+
- [Agents Available Now:](#agents-available-now)
|
81
|
+
- [Agents Coming Soon:](#agents-coming-soon)
|
82
|
+
- [Disclaimer](#disclaimer)
|
83
|
+
- [Table of Contents](#table-of-contents)
|
84
|
+
- [Installation](#installation)
|
85
|
+
- [Usage](#usage)
|
86
|
+
- [Example 1: Cleaning Data with the Data Cleaning Agent](#example-1-cleaning-data-with-the-data-cleaning-agent)
|
87
|
+
- [Contributing](#contributing)
|
88
|
+
- [License](#license)
|
89
|
+
|
90
|
+
## Installation
|
91
|
+
|
92
|
+
``` bash
|
93
|
+
pip install git+https://github.com/business-science/ai-data-science-team.git --upgrade
|
94
|
+
```
|
95
|
+
|
96
|
+
## Usage
|
97
|
+
|
98
|
+
### Example 1: Cleaning Data with the Data Cleaning Agent
|
99
|
+
|
100
|
+
[See the full example here.](https://github.com/business-science/ai-data-science-team/blob/master/examples/data_cleaning_agent.ipynb)
|
101
|
+
|
102
|
+
``` python
|
103
|
+
data_cleaning_agent = data_cleaning_agent(model = llm, log=LOG, log_path=LOG_PATH)
|
104
|
+
|
105
|
+
response = data_cleaning_agent.invoke({
|
106
|
+
"user_instructions": "Don't remove outliers when cleaning the data.",
|
107
|
+
"data_raw": df.to_dict(),
|
108
|
+
"max_retries":3,
|
109
|
+
"retry_count":0
|
110
|
+
})
|
111
|
+
```
|
112
|
+
|
113
|
+
``` bash
|
114
|
+
---DATA CLEANING AGENT----
|
115
|
+
* CREATE DATA CLEANER CODE
|
116
|
+
* EXECUTING AGENT CODE
|
117
|
+
* EXPLAIN AGENT CODE
|
118
|
+
```
|
119
|
+
|
120
|
+
## Contributing
|
121
|
+
|
122
|
+
1. Fork the repository
|
123
|
+
2. Create a feature branch
|
124
|
+
3. Commit your changes
|
125
|
+
4. Push to the branch
|
126
|
+
5. Create a Pull Request
|
127
|
+
|
128
|
+
## License
|
129
|
+
|
130
|
+
This project is licensed under the MIT License. See LICENSE file for details.
|
131
|
+
|
@@ -0,0 +1,107 @@
|
|
1
|
+
# AI Data Science Team
|
2
|
+
|
3
|
+
**An AI-powered data science team that uses agents to perform common data science** tasks including data cleaning, preparation, feature engineering, modeling (machine learning), interpretation on various business problems like:
|
4
|
+
|
5
|
+
- Churn Modeling
|
6
|
+
- Employee Attrition
|
7
|
+
- Lead Scoring
|
8
|
+
- Insurance Risk
|
9
|
+
- Credit Card Risk
|
10
|
+
- And more
|
11
|
+
|
12
|
+
## Companies That Want An AI Data Science Team
|
13
|
+
|
14
|
+
If you are interested in having your own AI Data Science Team built and deployed for your enterprise, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
|
15
|
+
|
16
|
+
## Free Generative AI For Data Scientists Workshop
|
17
|
+
|
18
|
+
If you want to learn how to build AI Agents for your company that perform Data Science, Business Intelligence, Churn Modeling, Time Series Forecasting, and more, [register for my next Generative AI for Data Scientists workshop here.](https://learn.business-science.io/ai-register)
|
19
|
+
|
20
|
+
## Agents
|
21
|
+
|
22
|
+
This project is a work in progress. New agents will be released soon.
|
23
|
+
|
24
|
+

|
25
|
+
|
26
|
+
### Agents Available Now:
|
27
|
+
|
28
|
+
1. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
|
29
|
+
|
30
|
+
### Agents Coming Soon:
|
31
|
+
|
32
|
+
1. **Supervisor:** Forms task list. Moderates sub-agents. Returns completed assignment.
|
33
|
+
2. **Exploratory Data Analyst:** Analyzes data structure, creates exploratory visualizations, and performs correlation analysis to identify relationships.
|
34
|
+
3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
|
35
|
+
4. **Machine Learning Agent:** Builds and logs the machine learning models.
|
36
|
+
5. **Interpretability Agent:** Performs Interpretable ML to explain why the model returned predictions including which features were the most important to the model.
|
37
|
+
|
38
|
+
## Disclaimer
|
39
|
+
|
40
|
+
**This project is for educational purposes only.**
|
41
|
+
|
42
|
+
- It is not intended to replace your company's data science team
|
43
|
+
- No warranties or guarantees provided
|
44
|
+
- Creator assumes no liability for financial loss
|
45
|
+
- Consult an experienced Generative AI Data Scientist for building your own AI Data Science Team
|
46
|
+
- If you want an enterprise-grade AI Data Science Team, [send inquiries here](https://www.business-science.io/contact.html).
|
47
|
+
|
48
|
+
By using this software, you agree to use it solely for learning purposes.
|
49
|
+
|
50
|
+
## Table of Contents
|
51
|
+
|
52
|
+
- [AI Data Science Team](#ai-data-science-team)
|
53
|
+
- [Companies That Want An AI Data Science Team](#companies-that-want-an-ai-data-science-team)
|
54
|
+
- [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
|
55
|
+
- [Agents](#agents)
|
56
|
+
- [Agents Available Now:](#agents-available-now)
|
57
|
+
- [Agents Coming Soon:](#agents-coming-soon)
|
58
|
+
- [Disclaimer](#disclaimer)
|
59
|
+
- [Table of Contents](#table-of-contents)
|
60
|
+
- [Installation](#installation)
|
61
|
+
- [Usage](#usage)
|
62
|
+
- [Example 1: Cleaning Data with the Data Cleaning Agent](#example-1-cleaning-data-with-the-data-cleaning-agent)
|
63
|
+
- [Contributing](#contributing)
|
64
|
+
- [License](#license)
|
65
|
+
|
66
|
+
## Installation
|
67
|
+
|
68
|
+
``` bash
|
69
|
+
pip install git+https://github.com/business-science/ai-data-science-team.git --upgrade
|
70
|
+
```
|
71
|
+
|
72
|
+
## Usage
|
73
|
+
|
74
|
+
### Example 1: Cleaning Data with the Data Cleaning Agent
|
75
|
+
|
76
|
+
[See the full example here.](https://github.com/business-science/ai-data-science-team/blob/master/examples/data_cleaning_agent.ipynb)
|
77
|
+
|
78
|
+
``` python
|
79
|
+
data_cleaning_agent = data_cleaning_agent(model = llm, log=LOG, log_path=LOG_PATH)
|
80
|
+
|
81
|
+
response = data_cleaning_agent.invoke({
|
82
|
+
"user_instructions": "Don't remove outliers when cleaning the data.",
|
83
|
+
"data_raw": df.to_dict(),
|
84
|
+
"max_retries":3,
|
85
|
+
"retry_count":0
|
86
|
+
})
|
87
|
+
```
|
88
|
+
|
89
|
+
``` bash
|
90
|
+
---DATA CLEANING AGENT----
|
91
|
+
* CREATE DATA CLEANER CODE
|
92
|
+
* EXECUTING AGENT CODE
|
93
|
+
* EXPLAIN AGENT CODE
|
94
|
+
```
|
95
|
+
|
96
|
+
## Contributing
|
97
|
+
|
98
|
+
1. Fork the repository
|
99
|
+
2. Create a feature branch
|
100
|
+
3. Commit your changes
|
101
|
+
4. Push to the branch
|
102
|
+
5. Create a Pull Request
|
103
|
+
|
104
|
+
## License
|
105
|
+
|
106
|
+
This project is licensed under the MIT License. See LICENSE file for details.
|
107
|
+
|
File without changes
|
@@ -0,0 +1 @@
|
|
1
|
+
__version__ = "0.0.0.9000"
|
@@ -0,0 +1,325 @@
|
|
1
|
+
# BUSINESS SCIENCE UNIVERSITY
|
2
|
+
# AI DATA SCIENCE TEAM
|
3
|
+
# ***
|
4
|
+
# Agents
|
5
|
+
# ai_data_science_team/agents.py
|
6
|
+
|
7
|
+
# Libraries
|
8
|
+
from typing import TypedDict, Annotated, Sequence
|
9
|
+
import operator
|
10
|
+
|
11
|
+
from langchain.prompts import PromptTemplate
|
12
|
+
from langchain_core.messages import BaseMessage
|
13
|
+
from langgraph.graph import END, StateGraph
|
14
|
+
|
15
|
+
import os
|
16
|
+
import io
|
17
|
+
import pandas as pd
|
18
|
+
|
19
|
+
from ai_data_science_team.templates.agent_templates import execute_agent_code_on_data, fix_agent_code, explain_agent_code
|
20
|
+
from ai_data_science_team.tools.parsers import PythonOutputParser
|
21
|
+
|
22
|
+
# Setup
|
23
|
+
|
24
|
+
LOG_PATH = os.path.join(os.getcwd(), "logs/")
|
25
|
+
|
26
|
+
|
27
|
+
# * Data Cleaning Agent
|
28
|
+
|
29
|
+
def data_cleaning_agent(model, log=False, log_path=None):
|
30
|
+
"""
|
31
|
+
Creates a data cleaning agent that can be run on a dataset. The agent can be used to clean a dataset in a variety of
|
32
|
+
ways, such as removing columns with more than 40% missing values, imputing missing
|
33
|
+
values with the mean of the column if the column is numeric, or imputing missing
|
34
|
+
values with the mode of the column if the column is categorical.
|
35
|
+
The agent takes in a dataset and some user instructions, and outputs a python
|
36
|
+
function that can be used to clean the dataset. The agent also logs the code
|
37
|
+
generated and any errors that occur.
|
38
|
+
|
39
|
+
Parameters
|
40
|
+
----------
|
41
|
+
model : langchain.llms.base.LLM
|
42
|
+
The language model to use to generate code.
|
43
|
+
log : bool, optional
|
44
|
+
Whether or not to log the code generated and any errors that occur.
|
45
|
+
Defaults to False.
|
46
|
+
log_path : str, optional
|
47
|
+
The path to the directory where the log files should be stored. Defaults to
|
48
|
+
"logs/".
|
49
|
+
|
50
|
+
Examples
|
51
|
+
-------
|
52
|
+
``` python
|
53
|
+
import pandas as pd
|
54
|
+
from langchain_openai import ChatOpenAI
|
55
|
+
from ai_data_science_team.agents import data_cleaning_agent
|
56
|
+
|
57
|
+
llm = ChatOpenAI(model = "gpt-4o-mini")
|
58
|
+
|
59
|
+
data_cleaning_agent = data_cleaning_agent(llm)
|
60
|
+
|
61
|
+
df = pd.read_csv("https://raw.githubusercontent.com/business-science/ai-data-science-team/refs/heads/master/data/churn_data.csv")
|
62
|
+
|
63
|
+
response = data_cleaning_agent.invoke({
|
64
|
+
"user_instructions": "Don't remove outliers when cleaning the data.",
|
65
|
+
"data_raw": df.to_dict(),
|
66
|
+
"max_retries":3,
|
67
|
+
"retry_count":0
|
68
|
+
})
|
69
|
+
|
70
|
+
pd.DataFrame(response['data_cleaned'])
|
71
|
+
```
|
72
|
+
|
73
|
+
Returns
|
74
|
+
-------
|
75
|
+
app : langchain.graphs.StateGraph
|
76
|
+
The data cleaning agent as a state graph.
|
77
|
+
"""
|
78
|
+
llm = model
|
79
|
+
|
80
|
+
# Setup Log Directory
|
81
|
+
if log:
|
82
|
+
if log_path is None:
|
83
|
+
log_path = LOG_PATH
|
84
|
+
if not os.path.exists(log_path):
|
85
|
+
os.makedirs(log_path)
|
86
|
+
|
87
|
+
# Define GraphState for the router
|
88
|
+
class GraphState(TypedDict):
|
89
|
+
messages: Annotated[Sequence[BaseMessage], operator.add]
|
90
|
+
user_instructions: str
|
91
|
+
data_raw: dict
|
92
|
+
data_cleaner_function: str
|
93
|
+
data_cleaner_error: str
|
94
|
+
data_cleaned: dict
|
95
|
+
max_retries: int
|
96
|
+
retry_count: int
|
97
|
+
|
98
|
+
|
99
|
+
def create_data_cleaner_code(state: GraphState):
|
100
|
+
print("---DATA CLEANING AGENT----")
|
101
|
+
print(" * CREATE DATA CLEANER CODE")
|
102
|
+
|
103
|
+
data_cleaning_prompt = PromptTemplate(
|
104
|
+
template="""
|
105
|
+
You are a Data Cleaning Agent. Your job is to create a data_cleaner() function to that can be run on the data provided.
|
106
|
+
|
107
|
+
Things that should be considered in the data summary function:
|
108
|
+
|
109
|
+
* Removing columns if more than 40 percent of the data is missing
|
110
|
+
* Imputing missing values with the mean of the column if the column is numeric
|
111
|
+
* Imputing missing values with the mode of the column if the column is categorical
|
112
|
+
* Converting columns to the correct data type
|
113
|
+
* Removing duplicate rows
|
114
|
+
* Removing rows with missing values
|
115
|
+
* Removing rows with extreme outliers (3X the interquartile range)
|
116
|
+
|
117
|
+
Make sure to take into account any additional user instructions that may negate some of these steps or add new steps. Include comments in your code to explain your reasoning for each step. Include comments if something is not done because a user requested. Include comments if something is done because a user requested.
|
118
|
+
|
119
|
+
User instructions:
|
120
|
+
{user_instructions}
|
121
|
+
|
122
|
+
Return Python code in ```python ``` format with a single function definition, data_cleaner(data_raw), that incldues all imports inside the function.
|
123
|
+
|
124
|
+
You can use Pandas, Numpy, and Scikit Learn libraries to clean the data.
|
125
|
+
|
126
|
+
Sample Data (first 100 rows):
|
127
|
+
{data_head}
|
128
|
+
|
129
|
+
Data Description:
|
130
|
+
{data_description}
|
131
|
+
|
132
|
+
Data Info:
|
133
|
+
{data_info}
|
134
|
+
|
135
|
+
Return code to provide the data cleaning function:
|
136
|
+
|
137
|
+
def data_cleaner(data_raw):
|
138
|
+
import pandas as pd
|
139
|
+
import numpy as np
|
140
|
+
...
|
141
|
+
return data_cleaner
|
142
|
+
|
143
|
+
Best Practices and Error Preventions:
|
144
|
+
|
145
|
+
Always ensure that when assigning the output of fit_transform() from SimpleImputer to a Pandas DataFrame column, you call .ravel() or flatten the array, because fit_transform() returns a 2D array while a DataFrame column is 1D.
|
146
|
+
|
147
|
+
""",
|
148
|
+
input_variables=["user_instructions","data_head", "data_description", "data_info"]
|
149
|
+
)
|
150
|
+
|
151
|
+
data_cleaning_agent = data_cleaning_prompt | llm | PythonOutputParser()
|
152
|
+
|
153
|
+
data_raw = state.get("data_raw")
|
154
|
+
|
155
|
+
df = pd.DataFrame.from_dict(data_raw)
|
156
|
+
|
157
|
+
buffer = io.StringIO()
|
158
|
+
df.info(buf=buffer)
|
159
|
+
info_text = buffer.getvalue()
|
160
|
+
|
161
|
+
response = data_cleaning_agent.invoke({
|
162
|
+
"user_instructions": state.get("user_instructions"),
|
163
|
+
"data_head": df.head().to_string(),
|
164
|
+
"data_description": df.describe().to_string(),
|
165
|
+
"data_info": info_text
|
166
|
+
})
|
167
|
+
|
168
|
+
# For logging: store the code generated:
|
169
|
+
if log:
|
170
|
+
with open(log_path + 'data_cleaner.py', 'w') as file:
|
171
|
+
file.write(response)
|
172
|
+
|
173
|
+
return {"data_cleaner_function" : response}
|
174
|
+
|
175
|
+
def execute_data_cleaner_code(state):
|
176
|
+
return execute_agent_code_on_data(
|
177
|
+
state=state,
|
178
|
+
data_key="data_raw",
|
179
|
+
result_key="data_cleaned",
|
180
|
+
error_key="data_cleaner_error",
|
181
|
+
code_snippet_key="data_cleaner_function",
|
182
|
+
agent_function_name="data_cleaner",
|
183
|
+
pre_processing=lambda data: pd.DataFrame.from_dict(data),
|
184
|
+
post_processing=lambda df: df.to_dict(),
|
185
|
+
error_message_prefix="An error occurred during data cleaning: "
|
186
|
+
)
|
187
|
+
|
188
|
+
def fix_data_cleaner_code(state: GraphState):
|
189
|
+
data_cleaner_prompt = """
|
190
|
+
You are a Data Cleaning Agent. Your job is to create a data_cleaner() function that can be run on the data provided. The function is currently broken and needs to be fixed.
|
191
|
+
|
192
|
+
Make sure to only return the function definition for data_cleaner().
|
193
|
+
|
194
|
+
Return Python code in ```python``` format with a single function definition, data_cleaner(data_raw), that includes all imports inside the function.
|
195
|
+
|
196
|
+
This is the broken code (please fix):
|
197
|
+
{code_snippet}
|
198
|
+
|
199
|
+
Last Known Error:
|
200
|
+
{error}
|
201
|
+
"""
|
202
|
+
|
203
|
+
return fix_agent_code(
|
204
|
+
state=state,
|
205
|
+
code_snippet_key="data_cleaner_function",
|
206
|
+
error_key="data_cleaner_error",
|
207
|
+
llm=llm,
|
208
|
+
prompt_template=data_cleaner_prompt,
|
209
|
+
log=True,
|
210
|
+
log_path="logs/",
|
211
|
+
log_file_name="data_cleaner.py"
|
212
|
+
)
|
213
|
+
|
214
|
+
def explain_data_cleaner_code(state: GraphState):
|
215
|
+
return explain_agent_code(
|
216
|
+
state=state,
|
217
|
+
code_snippet_key="data_cleaner_function",
|
218
|
+
result_key="messages",
|
219
|
+
error_key="data_cleaner_error",
|
220
|
+
llm=llm,
|
221
|
+
explanation_prompt_template="""
|
222
|
+
Explain the data cleaning steps that the data cleaning agent performed in this function.
|
223
|
+
Keep the summary succinct and to the point.\n\n# Data Cleaning Agent:\n\n{code}
|
224
|
+
""",
|
225
|
+
success_prefix="# Data Cleaning Agent:\n\n ",
|
226
|
+
error_message="The Data Cleaning Agent encountered an error during data cleaning. Data could not be explained."
|
227
|
+
)
|
228
|
+
|
229
|
+
|
230
|
+
workflow = StateGraph(GraphState)
|
231
|
+
|
232
|
+
workflow.add_node("create_data_cleaner_code", create_data_cleaner_code)
|
233
|
+
workflow.add_node("execute_data_cleaner_code", execute_data_cleaner_code)
|
234
|
+
workflow.add_node("fix_data_cleaner_code", fix_data_cleaner_code)
|
235
|
+
workflow.add_node("explain_data_cleaner_code", explain_data_cleaner_code)
|
236
|
+
|
237
|
+
workflow.set_entry_point("create_data_cleaner_code")
|
238
|
+
workflow.add_edge("create_data_cleaner_code", "execute_data_cleaner_code")
|
239
|
+
|
240
|
+
workflow.add_conditional_edges(
|
241
|
+
"execute_data_cleaner_code",
|
242
|
+
lambda state: "fix_code"
|
243
|
+
if (state.get("data_cleaner_error") is not None
|
244
|
+
and state.get("retry_count") is not None
|
245
|
+
and state.get("max_retries") is not None
|
246
|
+
and state.get("retry_count") < state.get("max_retries"))
|
247
|
+
else "explain_code",
|
248
|
+
{"fix_code": "fix_data_cleaner_code", "explain_code": "explain_data_cleaner_code"},
|
249
|
+
)
|
250
|
+
|
251
|
+
workflow.add_edge("fix_data_cleaner_code", "execute_data_cleaner_code")
|
252
|
+
workflow.add_edge("explain_data_cleaner_code", END)
|
253
|
+
|
254
|
+
app = workflow.compile()
|
255
|
+
|
256
|
+
return app
|
257
|
+
|
258
|
+
# # * Data Summary Agent
|
259
|
+
|
260
|
+
# def data_summary_agent(model, log=True, log_path=None):
|
261
|
+
|
262
|
+
# # Setup Log Directory
|
263
|
+
# if log:
|
264
|
+
# if log_path is None:
|
265
|
+
# log_path = LOG_PATH
|
266
|
+
# if not os.path.exists(log_path):
|
267
|
+
# os.makedirs(log_path)
|
268
|
+
|
269
|
+
# llm = model
|
270
|
+
|
271
|
+
# data_summary_prompt = PromptTemplate(
|
272
|
+
# template="""
|
273
|
+
# You are a Data Summary Agent. Your job is to summarize a dataset.
|
274
|
+
|
275
|
+
# Things that should be considered in the data summary function:
|
276
|
+
|
277
|
+
# * How many missing values
|
278
|
+
# * How many unique values
|
279
|
+
# * How many rows
|
280
|
+
# * How many columns
|
281
|
+
# * What data types are present
|
282
|
+
# * What the data looks like
|
283
|
+
# * What column types are present
|
284
|
+
# * What is the distribution of the data
|
285
|
+
# * What is the correlation between the data
|
286
|
+
|
287
|
+
# Make sure to take into account any additional user instructions that may negate some of these steps or add new steps.
|
288
|
+
|
289
|
+
# User instructions:
|
290
|
+
# {user_instructions}
|
291
|
+
|
292
|
+
# Return Python code in ```python ``` format with a single function definition, data_sumary(data), that incldues all imports inside the function.
|
293
|
+
|
294
|
+
# You can use Pandas, Numpy, and Scikit Learn libraries to summarize the data.
|
295
|
+
|
296
|
+
# Sample Data (first 100 rows):
|
297
|
+
# {data_head}
|
298
|
+
|
299
|
+
# Data Description:
|
300
|
+
# {data_description}
|
301
|
+
|
302
|
+
# Data Info:
|
303
|
+
# {data_info}
|
304
|
+
|
305
|
+
# Return code to provide the data cleaning function:
|
306
|
+
|
307
|
+
# def data_summary(data):
|
308
|
+
# import pandas as pd
|
309
|
+
# import numpy as np
|
310
|
+
# ...
|
311
|
+
# return {
|
312
|
+
# 'data_summary': ...,
|
313
|
+
# 'data_correlation': ...
|
314
|
+
# [INSERT MORE KEYS HERE],
|
315
|
+
# }
|
316
|
+
|
317
|
+
# """,
|
318
|
+
# input_variables=["user_instructions","data_head", "data_description", "data_info"]
|
319
|
+
# )
|
320
|
+
|
321
|
+
# data_summary_agent = data_summary_prompt | llm | PythonOutputParser()
|
322
|
+
|
323
|
+
|
324
|
+
|
325
|
+
# return 1
|
@@ -0,0 +1,17 @@
|
|
1
|
+
# BUSINESS SCIENCE UNIVERSITY
|
2
|
+
# AI DATA SCIENCE TEAM
|
3
|
+
# ***
|
4
|
+
# Orchestration
|
5
|
+
# ai_data_science_team/orchestration.py
|
6
|
+
|
7
|
+
from ai_data_science_team.agents import data_cleaning_agent
|
8
|
+
|
9
|
+
# TODO - add orchestration
|
10
|
+
|
11
|
+
# def model_pipeline(model, log=True, log_path=None):
|
12
|
+
|
13
|
+
# return "todo"
|
14
|
+
|
15
|
+
|
16
|
+
|
17
|
+
|
@@ -0,0 +1,131 @@
|
|
1
|
+
Metadata-Version: 2.1
|
2
|
+
Name: ai-data-science-team
|
3
|
+
Version: 0.0.0.9000
|
4
|
+
Summary: Build and run an AI-powered data science team.
|
5
|
+
Home-page: https://github.com/business-science/ai-data-science-team
|
6
|
+
Author: Matt Dancho
|
7
|
+
Author-email: mdancho@business-science.io
|
8
|
+
Requires-Python: >=3.9
|
9
|
+
Description-Content-Type: text/markdown
|
10
|
+
License-File: LICENSE
|
11
|
+
Requires-Dist: openpyxl
|
12
|
+
Requires-Dist: langchain
|
13
|
+
Requires-Dist: langchain_community
|
14
|
+
Requires-Dist: langchain_openai
|
15
|
+
Requires-Dist: langchain_experimental
|
16
|
+
Requires-Dist: langgraph
|
17
|
+
Requires-Dist: openai
|
18
|
+
Requires-Dist: pandas
|
19
|
+
Requires-Dist: numpy
|
20
|
+
Requires-Dist: plotly
|
21
|
+
Requires-Dist: streamlit
|
22
|
+
Requires-Dist: scikit-learn
|
23
|
+
Requires-Dist: xgboost
|
24
|
+
|
25
|
+
# AI Data Science Team
|
26
|
+
|
27
|
+
**An AI-powered data science team that uses agents to perform common data science** tasks including data cleaning, preparation, feature engineering, modeling (machine learning), interpretation on various business problems like:
|
28
|
+
|
29
|
+
- Churn Modeling
|
30
|
+
- Employee Attrition
|
31
|
+
- Lead Scoring
|
32
|
+
- Insurance Risk
|
33
|
+
- Credit Card Risk
|
34
|
+
- And more
|
35
|
+
|
36
|
+
## Companies That Want An AI Data Science Team
|
37
|
+
|
38
|
+
If you are interested in having your own AI Data Science Team built and deployed for your enterprise, send inquiries here: [https://www.business-science.io/contact.html](https://www.business-science.io/contact.html)
|
39
|
+
|
40
|
+
## Free Generative AI For Data Scientists Workshop
|
41
|
+
|
42
|
+
If you want to learn how to build AI Agents for your company that perform Data Science, Business Intelligence, Churn Modeling, Time Series Forecasting, and more, [register for my next Generative AI for Data Scientists workshop here.](https://learn.business-science.io/ai-register)
|
43
|
+
|
44
|
+
## Agents
|
45
|
+
|
46
|
+
This project is a work in progress. New agents will be released soon.
|
47
|
+
|
48
|
+

|
49
|
+
|
50
|
+
### Agents Available Now:
|
51
|
+
|
52
|
+
1. **Data Cleaning Agent:** Performs Data Preparation steps including handling missing values, outliers, and data type conversions.
|
53
|
+
|
54
|
+
### Agents Coming Soon:
|
55
|
+
|
56
|
+
1. **Supervisor:** Forms task list. Moderates sub-agents. Returns completed assignment.
|
57
|
+
2. **Exploratory Data Analyst:** Analyzes data structure, creates exploratory visualizations, and performs correlation analysis to identify relationships.
|
58
|
+
3. **Feature Engineering Agent:** Converts the prepared data into ML-ready data. Adds features to increase predictive accuracy of ML models.
|
59
|
+
4. **Machine Learning Agent:** Builds and logs the machine learning models.
|
60
|
+
5. **Interpretability Agent:** Performs Interpretable ML to explain why the model returned predictions including which features were the most important to the model.
|
61
|
+
|
62
|
+
## Disclaimer
|
63
|
+
|
64
|
+
**This project is for educational purposes only.**
|
65
|
+
|
66
|
+
- It is not intended to replace your company's data science team
|
67
|
+
- No warranties or guarantees provided
|
68
|
+
- Creator assumes no liability for financial loss
|
69
|
+
- Consult an experienced Generative AI Data Scientist for building your own AI Data Science Team
|
70
|
+
- If you want an enterprise-grade AI Data Science Team, [send inquiries here](https://www.business-science.io/contact.html).
|
71
|
+
|
72
|
+
By using this software, you agree to use it solely for learning purposes.
|
73
|
+
|
74
|
+
## Table of Contents
|
75
|
+
|
76
|
+
- [AI Data Science Team](#ai-data-science-team)
|
77
|
+
- [Companies That Want An AI Data Science Team](#companies-that-want-an-ai-data-science-team)
|
78
|
+
- [Free Generative AI For Data Scientists Workshop](#free-generative-ai-for-data-scientists-workshop)
|
79
|
+
- [Agents](#agents)
|
80
|
+
- [Agents Available Now:](#agents-available-now)
|
81
|
+
- [Agents Coming Soon:](#agents-coming-soon)
|
82
|
+
- [Disclaimer](#disclaimer)
|
83
|
+
- [Table of Contents](#table-of-contents)
|
84
|
+
- [Installation](#installation)
|
85
|
+
- [Usage](#usage)
|
86
|
+
- [Example 1: Cleaning Data with the Data Cleaning Agent](#example-1-cleaning-data-with-the-data-cleaning-agent)
|
87
|
+
- [Contributing](#contributing)
|
88
|
+
- [License](#license)
|
89
|
+
|
90
|
+
## Installation
|
91
|
+
|
92
|
+
``` bash
|
93
|
+
pip install git+https://github.com/business-science/ai-data-science-team.git --upgrade
|
94
|
+
```
|
95
|
+
|
96
|
+
## Usage
|
97
|
+
|
98
|
+
### Example 1: Cleaning Data with the Data Cleaning Agent
|
99
|
+
|
100
|
+
[See the full example here.](https://github.com/business-science/ai-data-science-team/blob/master/examples/data_cleaning_agent.ipynb)
|
101
|
+
|
102
|
+
``` python
|
103
|
+
data_cleaning_agent = data_cleaning_agent(model = llm, log=LOG, log_path=LOG_PATH)
|
104
|
+
|
105
|
+
response = data_cleaning_agent.invoke({
|
106
|
+
"user_instructions": "Don't remove outliers when cleaning the data.",
|
107
|
+
"data_raw": df.to_dict(),
|
108
|
+
"max_retries":3,
|
109
|
+
"retry_count":0
|
110
|
+
})
|
111
|
+
```
|
112
|
+
|
113
|
+
``` bash
|
114
|
+
---DATA CLEANING AGENT----
|
115
|
+
* CREATE DATA CLEANER CODE
|
116
|
+
* EXECUTING AGENT CODE
|
117
|
+
* EXPLAIN AGENT CODE
|
118
|
+
```
|
119
|
+
|
120
|
+
## Contributing
|
121
|
+
|
122
|
+
1. Fork the repository
|
123
|
+
2. Create a feature branch
|
124
|
+
3. Commit your changes
|
125
|
+
4. Push to the branch
|
126
|
+
5. Create a Pull Request
|
127
|
+
|
128
|
+
## License
|
129
|
+
|
130
|
+
This project is licensed under the MIT License. See LICENSE file for details.
|
131
|
+
|
@@ -0,0 +1,12 @@
|
|
1
|
+
LICENSE
|
2
|
+
README.md
|
3
|
+
setup.py
|
4
|
+
ai_data_science_team/__init__.py
|
5
|
+
ai_data_science_team/_version.py
|
6
|
+
ai_data_science_team/agents.py
|
7
|
+
ai_data_science_team/orchestration.py
|
8
|
+
ai_data_science_team.egg-info/PKG-INFO
|
9
|
+
ai_data_science_team.egg-info/SOURCES.txt
|
10
|
+
ai_data_science_team.egg-info/dependency_links.txt
|
11
|
+
ai_data_science_team.egg-info/requires.txt
|
12
|
+
ai_data_science_team.egg-info/top_level.txt
|
@@ -0,0 +1 @@
|
|
1
|
+
|
@@ -0,0 +1 @@
|
|
1
|
+
ai_data_science_team
|
@@ -0,0 +1,42 @@
|
|
1
|
+
from setuptools import find_packages, setup
|
2
|
+
|
3
|
+
def parse_requirements(filename):
|
4
|
+
with open(filename, "r") as f:
|
5
|
+
return [line.strip() for line in f if line and not line.startswith("#")]
|
6
|
+
|
7
|
+
with open("README.md", "r", encoding="utf-8", errors="ignore") as fh:
|
8
|
+
long_description = fh.read()
|
9
|
+
|
10
|
+
version = {}
|
11
|
+
with open("ai_data_science_team/_version.py", encoding="utf-8") as fp:
|
12
|
+
exec(fp.read(), version)
|
13
|
+
|
14
|
+
|
15
|
+
setup(
|
16
|
+
name="ai-data-science-team",
|
17
|
+
version=version["__version__"],
|
18
|
+
description="Build and run an AI-powered data science team.",
|
19
|
+
author="Matt Dancho",
|
20
|
+
author_email="mdancho@business-science.io",
|
21
|
+
long_description=long_description,
|
22
|
+
long_description_content_type="text/markdown",
|
23
|
+
url="https://github.com/business-science/ai-data-science-team",
|
24
|
+
packages=find_packages(),
|
25
|
+
# install_requires=parse_requirements("requirements.txt"),
|
26
|
+
install_requires=[
|
27
|
+
'openpyxl',
|
28
|
+
'langchain',
|
29
|
+
'langchain_community',
|
30
|
+
'langchain_openai',
|
31
|
+
'langchain_experimental',
|
32
|
+
'langgraph',
|
33
|
+
'openai',
|
34
|
+
'pandas',
|
35
|
+
'numpy',
|
36
|
+
'plotly',
|
37
|
+
'streamlit',
|
38
|
+
'scikit-learn',
|
39
|
+
'xgboost',
|
40
|
+
],
|
41
|
+
python_requires=">=3.9",
|
42
|
+
)
|