mcp-automl 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- mcp_automl-0.1.0/.gitignore +18 -0
- mcp_automl-0.1.0/.python-version +1 -0
- mcp_automl-0.1.0/PKG-INFO +90 -0
- mcp_automl-0.1.0/README.md +76 -0
- mcp_automl-0.1.0/pyproject.toml +31 -0
- mcp_automl-0.1.0/skill/data-science-workflow/SKILL.md +387 -0
- mcp_automl-0.1.0/src/mcp_automl/__init__.py +0 -0
- mcp_automl-0.1.0/src/mcp_automl/__main__.py +4 -0
- mcp_automl-0.1.0/src/mcp_automl/server.py +946 -0
- mcp_automl-0.1.0/tests/test_server.py +418 -0
- mcp_automl-0.1.0/uv.lock +2836 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.11
|
|
@@ -0,0 +1,90 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: mcp-automl
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: MCP server for end-to-end machine learning
|
|
5
|
+
Requires-Python: >=3.11
|
|
6
|
+
Requires-Dist: duckdb>=1.4.3
|
|
7
|
+
Requires-Dist: joblib<1.4
|
|
8
|
+
Requires-Dist: mcp>=1.21.2
|
|
9
|
+
Requires-Dist: pandas<2.2.0
|
|
10
|
+
Requires-Dist: pycaret>=3.0.0
|
|
11
|
+
Requires-Dist: scikit-learn<1.4
|
|
12
|
+
Requires-Dist: tabulate>=0.9.0
|
|
13
|
+
Description-Content-Type: text/markdown
|
|
14
|
+
|
|
15
|
+
# MCP AutoML
|
|
16
|
+
|
|
17
|
+
MCP AutoML is a server that enables AI Agents to perform end-to-end machine learning workflows including data inspection, processing, model training. With MCP AutoML, AI Agents can perform more than a typical autoML framework. AI Agents can identify the target, setting baseline, and creating features by themselves.
|
|
18
|
+
|
|
19
|
+
MCP AutoML seperates tools and workflows, allowing you to create your own workflow.
|
|
20
|
+
|
|
21
|
+
## Features
|
|
22
|
+
|
|
23
|
+
- **Data Inspection**: Analyze datasets with comprehensive statistics, data types, and previews
|
|
24
|
+
- **SQL-based Data Processing**: Transform and engineer features using DuckDB SQL queries
|
|
25
|
+
- **AutoML Training**: Train classification and regression models with automatic model comparison using PyCaret
|
|
26
|
+
- **Prediction**: Make predictions using trained models
|
|
27
|
+
- **Multi-format Support**: Works with CSV, Parquet, and JSON files
|
|
28
|
+
|
|
29
|
+
## Usage
|
|
30
|
+
|
|
31
|
+
### Configure MCP Server
|
|
32
|
+
|
|
33
|
+
Add to your MCP client configuration (e.g., Claude Desktop, Gemini CLI, Cursor, Antigravity):
|
|
34
|
+
|
|
35
|
+
```json
|
|
36
|
+
{
|
|
37
|
+
"mcpServers": {
|
|
38
|
+
"mcp-automl": {
|
|
39
|
+
"command": "uvx",
|
|
40
|
+
"args": ["--from", "git+https://github.com/idea7766/mcp-automl", "mcp-automl"]
|
|
41
|
+
}
|
|
42
|
+
}
|
|
43
|
+
}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### Available Tools
|
|
47
|
+
|
|
48
|
+
| Tool | Description |
|
|
49
|
+
|------|-------------|
|
|
50
|
+
| `inspect_data` | Get comprehensive statistics and preview of a dataset |
|
|
51
|
+
| `query_data` | Execute DuckDB SQL queries on data files |
|
|
52
|
+
| `process_data` | Transform data using SQL and save to a new file |
|
|
53
|
+
| `train_classifier` | Train a classification model with AutoML |
|
|
54
|
+
| `train_regressor` | Train a regression model with AutoML |
|
|
55
|
+
| `predict` | Make predictions using a trained model |
|
|
56
|
+
|
|
57
|
+
## Agent Skill
|
|
58
|
+
|
|
59
|
+
MCP AutoML includes an **data science workflow skill** that guides AI agents through best practices for machine learning projects. This skill teaches agents to:
|
|
60
|
+
|
|
61
|
+
- Identify targets and establish baselines
|
|
62
|
+
- Perform exploratory data analysis
|
|
63
|
+
- Engineer domain-specific features
|
|
64
|
+
- Train and evaluate models systematically
|
|
65
|
+
|
|
66
|
+
### Installing the Skill
|
|
67
|
+
|
|
68
|
+
Copy the skill directory to your agent's skill folder:
|
|
69
|
+
|
|
70
|
+
```bash
|
|
71
|
+
# For Gemini Code Assist
|
|
72
|
+
cp -r skill/data-science-workflow ~/.gemini/skills/
|
|
73
|
+
|
|
74
|
+
# For Claude Code
|
|
75
|
+
cp -r skill/data-science-workflow ~/.claude/skills/
|
|
76
|
+
|
|
77
|
+
# For other agents, copy to their respective skill directories
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
The skill file is located at `skill/data-science-workflow/SKILL.md`.
|
|
81
|
+
|
|
82
|
+
## Configuration
|
|
83
|
+
|
|
84
|
+
Models and experiments are saved to `~/.mcp-automl/experiments/` by default.
|
|
85
|
+
|
|
86
|
+
## Dependencies
|
|
87
|
+
|
|
88
|
+
- [PyCaret](https://pycaret.org/) - AutoML library
|
|
89
|
+
- [DuckDB](https://duckdb.org/) - Fast SQL analytics
|
|
90
|
+
- [MCP](https://github.com/modelcontextprotocol/python-sdk) - Model Context Protocol SDK
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# MCP AutoML
|
|
2
|
+
|
|
3
|
+
MCP AutoML is a server that enables AI Agents to perform end-to-end machine learning workflows including data inspection, processing, model training. With MCP AutoML, AI Agents can perform more than a typical autoML framework. AI Agents can identify the target, setting baseline, and creating features by themselves.
|
|
4
|
+
|
|
5
|
+
MCP AutoML seperates tools and workflows, allowing you to create your own workflow.
|
|
6
|
+
|
|
7
|
+
## Features
|
|
8
|
+
|
|
9
|
+
- **Data Inspection**: Analyze datasets with comprehensive statistics, data types, and previews
|
|
10
|
+
- **SQL-based Data Processing**: Transform and engineer features using DuckDB SQL queries
|
|
11
|
+
- **AutoML Training**: Train classification and regression models with automatic model comparison using PyCaret
|
|
12
|
+
- **Prediction**: Make predictions using trained models
|
|
13
|
+
- **Multi-format Support**: Works with CSV, Parquet, and JSON files
|
|
14
|
+
|
|
15
|
+
## Usage
|
|
16
|
+
|
|
17
|
+
### Configure MCP Server
|
|
18
|
+
|
|
19
|
+
Add to your MCP client configuration (e.g., Claude Desktop, Gemini CLI, Cursor, Antigravity):
|
|
20
|
+
|
|
21
|
+
```json
|
|
22
|
+
{
|
|
23
|
+
"mcpServers": {
|
|
24
|
+
"mcp-automl": {
|
|
25
|
+
"command": "uvx",
|
|
26
|
+
"args": ["--from", "git+https://github.com/idea7766/mcp-automl", "mcp-automl"]
|
|
27
|
+
}
|
|
28
|
+
}
|
|
29
|
+
}
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
### Available Tools
|
|
33
|
+
|
|
34
|
+
| Tool | Description |
|
|
35
|
+
|------|-------------|
|
|
36
|
+
| `inspect_data` | Get comprehensive statistics and preview of a dataset |
|
|
37
|
+
| `query_data` | Execute DuckDB SQL queries on data files |
|
|
38
|
+
| `process_data` | Transform data using SQL and save to a new file |
|
|
39
|
+
| `train_classifier` | Train a classification model with AutoML |
|
|
40
|
+
| `train_regressor` | Train a regression model with AutoML |
|
|
41
|
+
| `predict` | Make predictions using a trained model |
|
|
42
|
+
|
|
43
|
+
## Agent Skill
|
|
44
|
+
|
|
45
|
+
MCP AutoML includes an **data science workflow skill** that guides AI agents through best practices for machine learning projects. This skill teaches agents to:
|
|
46
|
+
|
|
47
|
+
- Identify targets and establish baselines
|
|
48
|
+
- Perform exploratory data analysis
|
|
49
|
+
- Engineer domain-specific features
|
|
50
|
+
- Train and evaluate models systematically
|
|
51
|
+
|
|
52
|
+
### Installing the Skill
|
|
53
|
+
|
|
54
|
+
Copy the skill directory to your agent's skill folder:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
# For Gemini Code Assist
|
|
58
|
+
cp -r skill/data-science-workflow ~/.gemini/skills/
|
|
59
|
+
|
|
60
|
+
# For Claude Code
|
|
61
|
+
cp -r skill/data-science-workflow ~/.claude/skills/
|
|
62
|
+
|
|
63
|
+
# For other agents, copy to their respective skill directories
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
The skill file is located at `skill/data-science-workflow/SKILL.md`.
|
|
67
|
+
|
|
68
|
+
## Configuration
|
|
69
|
+
|
|
70
|
+
Models and experiments are saved to `~/.mcp-automl/experiments/` by default.
|
|
71
|
+
|
|
72
|
+
## Dependencies
|
|
73
|
+
|
|
74
|
+
- [PyCaret](https://pycaret.org/) - AutoML library
|
|
75
|
+
- [DuckDB](https://duckdb.org/) - Fast SQL analytics
|
|
76
|
+
- [MCP](https://github.com/modelcontextprotocol/python-sdk) - Model Context Protocol SDK
|
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "mcp-automl"
|
|
3
|
+
version = "0.1.0"
|
|
4
|
+
description = "MCP server for end-to-end machine learning"
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
requires-python = ">=3.11"
|
|
7
|
+
dependencies = [
|
|
8
|
+
"duckdb>=1.4.3",
|
|
9
|
+
"joblib<1.4",
|
|
10
|
+
"mcp>=1.21.2",
|
|
11
|
+
"pandas<2.2.0",
|
|
12
|
+
"pycaret>=3.0.0",
|
|
13
|
+
"scikit-learn<1.4",
|
|
14
|
+
"tabulate>=0.9.0",
|
|
15
|
+
]
|
|
16
|
+
|
|
17
|
+
[project.scripts]
|
|
18
|
+
mcp-automl = "mcp_automl.server:main"
|
|
19
|
+
|
|
20
|
+
[build-system]
|
|
21
|
+
requires = ["hatchling"]
|
|
22
|
+
build-backend = "hatchling.build"
|
|
23
|
+
|
|
24
|
+
[tool.uv]
|
|
25
|
+
package = true
|
|
26
|
+
|
|
27
|
+
[dependency-groups]
|
|
28
|
+
dev = [
|
|
29
|
+
"pytest-asyncio>=1.3.0",
|
|
30
|
+
"pyarrow>=14.0.0",
|
|
31
|
+
]
|
|
@@ -0,0 +1,387 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-science-workflow
|
|
3
|
+
description: The primary workflow for all data science projects. Use this skill whenever a user asks to train a model, build a model, perform analysis, or do analytics. It autonomously orchestrates the full pipeline - data inspection, cleaning, feature engineering, and AutoML training to deliver the best possible results.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# LLM Usage Guide: Production Data Science Workflow (Consultative)
|
|
7
|
+
|
|
8
|
+
This guide outlines how to handle user requests for model training, especially when instructions are vague (e.g., "Train a model on data.csv").
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## Universal Workflow Principles
|
|
13
|
+
|
|
14
|
+
### 1. Documentation First
|
|
15
|
+
Always check for and read documentation files before inspecting data. Column descriptions prevent incorrect assumptions.
|
|
16
|
+
|
|
17
|
+
### 2. Scale Awareness
|
|
18
|
+
Check data size before processing or training. Large datasets require sampling for efficient iteration.
|
|
19
|
+
|
|
20
|
+
### 3. Transparency
|
|
21
|
+
Communicate your understanding, assumptions, and plan to the user before executing. Allow them to correct course early.
|
|
22
|
+
|
|
23
|
+
### 4. Iterative Refinement
|
|
24
|
+
Use smaller samples for development iterations. Reserve full data for final model training only.
|
|
25
|
+
|
|
26
|
+
### 5. Preserve Provenance
|
|
27
|
+
When creating processed files, use clear naming that indicates source, processing applied, and sample size.
|
|
28
|
+
Example: `train_processed_10k_sample.parquet`
|
|
29
|
+
|
|
30
|
+
### 6. No NaN/Inf in Features
|
|
31
|
+
Never create features that produce NaN or Inf values. Common pitfalls and fixes:
|
|
32
|
+
- **Division**: Always use `NULLIF()` → `a / NULLIF(b, 0)`
|
|
33
|
+
- **Log/Sqrt**: Guard against zero/negative → `LOG(GREATEST(x, 1))`, `SQRT(GREATEST(x, 0))`
|
|
34
|
+
- **Missing propagation**: Use `COALESCE()` → `COALESCE(a, 0) + COALESCE(b, 0)`
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## Phase 0: Initial Triage (The "Vague Request" Handler)
|
|
39
|
+
**Trigger**: User provides data but no specific instructions.
|
|
40
|
+
|
|
41
|
+
1. **Inspect First**: ALWAYS call `inspect_data(data_path)` immediately to understand the table structure. If there are multiple files, you should inspect all of them to understand the data structure unless you are confident that the file is not important after reading the documentation.
|
|
42
|
+
2. **Identify Target**:
|
|
43
|
+
- *Confident*: If there is an obvious target (e.g., "churn", "target", "price", "species"), **assume it** and state your assumption.
|
|
44
|
+
- *Ambiguous*: If multiple columns could be targets, **ASK the user**. ("I see 'price' and 'quantity'. Which one are we predicting?")
|
|
45
|
+
3. **Determine Goal (consultative)**:
|
|
46
|
+
- *Confident*: If the target imply the goal (e.g., "fraud" -> minimize false positives), suggest the appropriate metric (Precision/Recall).
|
|
47
|
+
- *Ambiguous*: Ask for the business outcome. ("Are we trying to minimize missing fraud cases, or minimize false alarms?")
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Phase 0.5: Dataset Discovery
|
|
52
|
+
**Trigger**: Dataset is a directory (not a single file), OR any file > 50MB.
|
|
53
|
+
|
|
54
|
+
### Step 1: Read Documentation First (MANDATORY)
|
|
55
|
+
Before ANY data inspection, search the directory for documentation:
|
|
56
|
+
- README files: `README`, `README.md`, `README.txt`
|
|
57
|
+
- Description files: Any file containing "description", "metadata", "schema", "dictionary"
|
|
58
|
+
- Data dictionaries: `.json`, `.yaml`, `.txt` files that aren't data
|
|
59
|
+
|
|
60
|
+
**Why**: Documentation explains table relationships, column meanings, and intended use cases. Skipping this leads to incorrect assumptions.
|
|
61
|
+
|
|
62
|
+
### Step 2: Inventory All Data Files
|
|
63
|
+
List all data files and check their sizes/row counts:
|
|
64
|
+
```sql
|
|
65
|
+
SELECT COUNT(*) as rows FROM 'filename.csv'
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Categorize files into:
|
|
69
|
+
- **Primary table**: Contains the target variable (usually `train`, `main`, or similar naming)
|
|
70
|
+
- **Auxiliary tables**: Related data that can be aggregated (transactions, history, logs)
|
|
71
|
+
- **Test/submission files**: Held-out data for final predictions
|
|
72
|
+
|
|
73
|
+
### Step 3: Assess Scale & Plan Accordingly
|
|
74
|
+
| Data Scale | Definition | Required Action |
|
|
75
|
+
|------------|------------|-----------------|
|
|
76
|
+
| Small | < 50K rows | Proceed normally |
|
|
77
|
+
| Medium | 50K - 200K rows | Recommend sampling for development |
|
|
78
|
+
| Large | > 200K rows | **Require** sampling; inform user |
|
|
79
|
+
|
|
80
|
+
**Sampling Strategy**: Create a stratified sample preserving target distribution:
|
|
81
|
+
```sql
|
|
82
|
+
SELECT * FROM data
|
|
83
|
+
ORDER BY RANDOM()
|
|
84
|
+
LIMIT [10-20% of original, max 50K rows]
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Step 4: Multi-Table Strategy
|
|
88
|
+
If multiple related tables exist:
|
|
89
|
+
1. **Identify join keys** (from documentation or column inspection)
|
|
90
|
+
2. **Plan aggregations**: How to summarize auxiliary tables to join with primary
|
|
91
|
+
3. **Communicate plan** to user before executing
|
|
92
|
+
|
|
93
|
+
### Step 5: Confirm Scope with User
|
|
94
|
+
Before proceeding, state:
|
|
95
|
+
> "Dataset contains [X files, Y total rows]. I plan to:
|
|
96
|
+
> - Use [primary_table] as the main dataset
|
|
97
|
+
> - [Sample to N rows / Use full data]
|
|
98
|
+
> - [Aggregate features from auxiliary tables / Use primary only]
|
|
99
|
+
>
|
|
100
|
+
> Proceed?"
|
|
101
|
+
|
|
102
|
+
---
|
|
103
|
+
|
|
104
|
+
## Phase 1: Project Definition
|
|
105
|
+
**Goal**: Lock down success criteria and establish a naive baseline before training.
|
|
106
|
+
|
|
107
|
+
### Check:
|
|
108
|
+
- **Problem Type**: Classification vs Regression
|
|
109
|
+
- **Primary Metric**: Choose based on business goal:
|
|
110
|
+
- Safety-critical (fraud, medical) → `Recall`
|
|
111
|
+
- Cost-sensitive (marketing, sales) → `Precision`
|
|
112
|
+
- Balanced → `F1`
|
|
113
|
+
- Regression → `R2` or `MAE`
|
|
114
|
+
|
|
115
|
+
### Establish Naive Baseline
|
|
116
|
+
Use `query_data` to calculate a baseline that any useful model must beat:
|
|
117
|
+
|
|
118
|
+
**For Classification** (majority class baseline):
|
|
119
|
+
```sql
|
|
120
|
+
-- Class distribution
|
|
121
|
+
SELECT target, COUNT(*) as count,
|
|
122
|
+
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
|
|
123
|
+
FROM data GROUP BY target
|
|
124
|
+
```
|
|
125
|
+
→ Baseline accuracy = largest class percentage (e.g., if 70% are "good", baseline = 70%)
|
|
126
|
+
|
|
127
|
+
**For Regression** (mean prediction baseline):
|
|
128
|
+
```sql
|
|
129
|
+
SELECT
|
|
130
|
+
AVG(target) as mean_baseline,
|
|
131
|
+
AVG(ABS(target - (SELECT AVG(target) FROM data))) as baseline_MAE
|
|
132
|
+
FROM data
|
|
133
|
+
```
|
|
134
|
+
→ A model must have lower MAE than predicting the mean for every sample.
|
|
135
|
+
|
|
136
|
+
### Document Baseline to User
|
|
137
|
+
State clearly:
|
|
138
|
+
> "Naive baseline (always predicting majority class 'good'): 70% accuracy. Our model must exceed this to add value."
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## Phase 2: EDA (Deep Dive)
|
|
143
|
+
**Goal**: Inspect data quality to inform training parameters AND feature engineering opportunities.
|
|
144
|
+
|
|
145
|
+
### Checklist:
|
|
146
|
+
1. **Skewness**: Use `query_data` to check `AVG(col)` vs `MEDIAN(col)`. -> If high skew, set `transformation=True`.
|
|
147
|
+
2. **Ordinality**: Check for inherent order in categories (e.g., "Low/Med/High", "Junior/Senior", rating scales). -> Map to `ordinal_features`.
|
|
148
|
+
3. **Missingness**:
|
|
149
|
+
- *Simple*: If random/small, use `numeric_imputation` (mean/median) or `categorical_imputation` (mode) params.
|
|
150
|
+
- *Complex*: If structural/logic-based, use `process_data`.
|
|
151
|
+
4. **Class Imbalance**: Check target distribution. If moderate (70-30), may not need `fix_imbalance`. If extreme (95-5), likely beneficial.
|
|
152
|
+
5. **Outliers**: Check for extreme values in numeric columns (use `query_data` with MIN/MAX/STDDEV). -> Consider `remove_outliers=True`.
|
|
153
|
+
6. **Feature Relationships**: Look for potential interactions (e.g., credit_amount & duration -> monthly_payment).
|
|
154
|
+
|
|
155
|
+
---
|
|
156
|
+
|
|
157
|
+
## Phase 2.5: Domain Research & Feature Engineering 🔬
|
|
158
|
+
**Goal**: Leverage domain knowledge to create high-value features.
|
|
159
|
+
|
|
160
|
+
### When to Apply:
|
|
161
|
+
- **Always consider** for any non-trivial dataset
|
|
162
|
+
- **Strongly recommended** when baseline model performance is below expectations
|
|
163
|
+
- Skip only for pure exploratory analysis or when time is extremely limited
|
|
164
|
+
|
|
165
|
+
### Step 1: Identify the Problem Domain
|
|
166
|
+
|
|
167
|
+
Look at column names and target variable to identify the domain:
|
|
168
|
+
|
|
169
|
+
| Column Indicators | Likely Domain |
|
|
170
|
+
|-------------------|---------------|
|
|
171
|
+
| amount, duration, payment, credit, loan | Financial/Credit Risk |
|
|
172
|
+
| churn, subscription, tenure, contract | Customer Churn |
|
|
173
|
+
| price, sales, inventory, demand | Retail/E-commerce |
|
|
174
|
+
| diagnosis, symptoms, age, medication | Healthcare/Medical |
|
|
175
|
+
| latitude, longitude, distance, location | Geographic/Spatial |
|
|
176
|
+
| timestamp, date, hour, day_of_week | Time Series/Temporal |
|
|
177
|
+
|
|
178
|
+
### Step 2: Research Domain Best Practices
|
|
179
|
+
|
|
180
|
+
(REQUIRED) Search web for feature engineering patterns for the identified domain:
|
|
181
|
+
|
|
182
|
+
**Search Queries** (use 2-3 of these):
|
|
183
|
+
- `"[domain] machine learning feature engineering best practices"`
|
|
184
|
+
- `"[domain] [problem_type] important features"`
|
|
185
|
+
|
|
186
|
+
**What to Look For**:
|
|
187
|
+
- Common ratios/interactions used by practitioners
|
|
188
|
+
- Domain-specific KPIs or business metrics
|
|
189
|
+
- Regulatory/compliance considerations
|
|
190
|
+
|
|
191
|
+
### Step 3: Apply Feature Engineering
|
|
192
|
+
|
|
193
|
+
Based on domain research and data inspection, create features using `process_data`. Common techniques:
|
|
194
|
+
|
|
195
|
+
1. **Ratios & Intensities**: Divide related numeric features (e.g., total/count, amount/duration)
|
|
196
|
+
2. **Binning**: Group continuous variables into meaningful categories
|
|
197
|
+
3. **Aggregations**: If multiple rows per entity, create sum/mean/max/min/count
|
|
198
|
+
4. **Interactions**: Multiply/combine features that work together
|
|
199
|
+
5. **Business Logic Flags**: Create binary indicators based on domain rules
|
|
200
|
+
|
|
201
|
+
Remember: Always apply safe patterns from Principle #6 (No NaN/Inf) when creating ratio or derived features.
|
|
202
|
+
|
|
203
|
+
### Step 4: Document Your Reasoning
|
|
204
|
+
|
|
205
|
+
For each engineered feature, explain to the user:
|
|
206
|
+
- **What**: Name and formula
|
|
207
|
+
- **Why**: Business rationale
|
|
208
|
+
- **Source**: If from research, cite it
|
|
209
|
+
|
|
210
|
+
---
|
|
211
|
+
|
|
212
|
+
## Phase 3: Data Processing with Feature Engineering
|
|
213
|
+
**Goal**: Create a reliable, enriched dataset (Parquet format).
|
|
214
|
+
|
|
215
|
+
### Action:
|
|
216
|
+
Use `process_data` with a comprehensive SQL query that:
|
|
217
|
+
1. **CAST types explicitly** (all numeric columns to INTEGER/FLOAT)
|
|
218
|
+
2. **Create engineered features** from Phase 2.5 research
|
|
219
|
+
3. **Handle missing values** (if complex logic needed)
|
|
220
|
+
4. **Save as `.parquet`** (strongly recommended over CSV for type preservation)
|
|
221
|
+
|
|
222
|
+
### Transparency Rule:
|
|
223
|
+
You **MUST** show the full SQL query to the user with comments explaining each engineered feature's business rationale.
|
|
224
|
+
|
|
225
|
+
### Quality Check:
|
|
226
|
+
After processing, call `inspect_data` on the new file to verify:
|
|
227
|
+
- All types are correct (no accidental strings for numeric columns)
|
|
228
|
+
- New features have reasonable value ranges
|
|
229
|
+
- No unexpected missing values or infinite values introduced
|
|
230
|
+
|
|
231
|
+
---
|
|
232
|
+
|
|
233
|
+
## Phase 4: Model Training
|
|
234
|
+
**Goal**: Train an optimized model with all insights from EDA and feature engineering.
|
|
235
|
+
|
|
236
|
+
### Key Parameters:
|
|
237
|
+
- `optimize`: The metric agreed in Phase 1 (Recall/Precision/F1/R2/MAE)
|
|
238
|
+
- `ordinal_features`: **Critical** - Map all ordinal categories with proper ordering
|
|
239
|
+
- `fold=5`: For faster iteration (use `fold=10` for final validation)
|
|
240
|
+
- `session_id=42`: For reproducibility
|
|
241
|
+
|
|
242
|
+
### Model Selection Parameters:
|
|
243
|
+
- `include_models`: Train only specific models (faster, good for baselines). Examples:
|
|
244
|
+
- Quick baseline: `include_models=['dt']` (Decision Tree, ~30 seconds)
|
|
245
|
+
- Fast ensemble: `include_models=['rf', 'lightgbm', 'xgboost']`
|
|
246
|
+
- Linear baseline: `include_models=['lr']`
|
|
247
|
+
- `exclude_models`: Exclude slow or problematic models. Examples:
|
|
248
|
+
- Skip GPU-requiring: `exclude_models=['catboost']`
|
|
249
|
+
- Skip slow models: `exclude_models=['catboost', 'xgboost']`
|
|
250
|
+
|
|
251
|
+
**Common Model IDs**:
|
|
252
|
+
| Classification | Regression |
|
|
253
|
+
|---------------|------------|
|
|
254
|
+
| `lr` (Logistic Regression) | `lr` (Linear Regression) |
|
|
255
|
+
| `dt` (Decision Tree) | `dt` (Decision Tree) |
|
|
256
|
+
| `rf` (Random Forest) | `rf` (Random Forest) |
|
|
257
|
+
| `xgboost`, `lightgbm`, `catboost` | `xgboost`, `lightgbm`, `catboost` |
|
|
258
|
+
| `knn`, `nb`, `svm` | `ridge`, `lasso`, `en` |
|
|
259
|
+
|
|
260
|
+
### Conditional Parameters (based on EDA):
|
|
261
|
+
- `transformation=True`: If skewed distributions detected in Phase 2
|
|
262
|
+
- `normalize=True`: Recommended for linear/distance-based models (not needed for tree-based)
|
|
263
|
+
- `polynomial_features=True`: Generally beneficial, low risk
|
|
264
|
+
- `fix_imbalance=True`: Only if extreme imbalance (>80:20) detected in Phase 2. Use with `numeric_imputation` and `categorical_imputation` parameters.
|
|
265
|
+
- `remove_outliers=True`: If extreme outliers detected in Phase 2
|
|
266
|
+
|
|
267
|
+
### Training Output:
|
|
268
|
+
The training result includes:
|
|
269
|
+
- **`metadata`**: CV metrics for the best model
|
|
270
|
+
- **`test_metrics`**: Holdout set performance
|
|
271
|
+
- **`feature_importances`**: Dict of `{feature_name: importance}` sorted by importance (descending)
|
|
272
|
+
- Available for tree-based models (RF, XGBoost, LightGBM, etc.) and linear models
|
|
273
|
+
- Use this to understand which features drive predictions
|
|
274
|
+
|
|
275
|
+
### Speed Tips:
|
|
276
|
+
- For quick baseline: `include_models=['dt']` (~30 seconds)
|
|
277
|
+
- For fast iteration: `include_models=['rf', 'lightgbm']` (~2 minutes)
|
|
278
|
+
- ⏱️ Expect 3-10 min for full model comparison with ~10K rows
|
|
279
|
+
|
|
280
|
+
### Document:
|
|
281
|
+
Report top 3 models with their metrics to user in table format.
|
|
282
|
+
|
|
283
|
+
---
|
|
284
|
+
|
|
285
|
+
## Phase 5: Evaluation & Comparison
|
|
286
|
+
**Goal**: Contextualize results against the naive baseline and select best model.
|
|
287
|
+
|
|
288
|
+
### Comparison Table (Include Baseline):
|
|
289
|
+
Always show the naive baseline from Phase 1 for context:
|
|
290
|
+
|
|
291
|
+
| Config | Model | Accuracy | Recall | Precision | F1 | vs Baseline |
|
|
292
|
+
|--------|-------|----------|--------|-----------|-----|-------------|
|
|
293
|
+
| Naive guess | — | 70.0% | — | — | — | — |
|
|
294
|
+
| Trained model | CatBoost | 77.3% | 75.6% | 74.3% | 74.9% | +7.3 pts ✅ |
|
|
295
|
+
|
|
296
|
+
### Interpret Results:
|
|
297
|
+
- Quantify improvement over naive baseline (\"+7.3 percentage points\", \"10.4% relative improvement\")
|
|
298
|
+
- Translate to business impact (\"Catches 73 more bad credits out of 1000\")
|
|
299
|
+
|
|
300
|
+
### Success Check:
|
|
301
|
+
Does the best model meet the Phase 1 Success Criteria?
|
|
302
|
+
- ✅ If yes: Proceed to Transparency Report, provide model_id and path
|
|
303
|
+
- ❌ If no: Proceed to Phase 6 (Iteration)
|
|
304
|
+
|
|
305
|
+
---
|
|
306
|
+
|
|
307
|
+
## Phase 6: Iteration (If Needed)
|
|
308
|
+
**Goal**: Systematically improve model performance.
|
|
309
|
+
|
|
310
|
+
If performance is still below target:
|
|
311
|
+
|
|
312
|
+
### 1. Analyze Feature Importances
|
|
313
|
+
Review `feature_importances` from the training result:
|
|
314
|
+
- **High importance features**: Focus engineering efforts here (create interactions, better encodings)
|
|
315
|
+
- **Low/zero importance features**: Consider removing to reduce noise
|
|
316
|
+
- **Missing domain features**: If expected important features aren't showing up, check for data issues
|
|
317
|
+
|
|
318
|
+
Example iteration workflow:
|
|
319
|
+
```
|
|
320
|
+
1. Train baseline: include_models=['rf'] → get feature_importances
|
|
321
|
+
2. Identify top 5 features, engineer interactions between them
|
|
322
|
+
3. Retrain with engineered features
|
|
323
|
+
4. Compare improvement
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
### 2. Feature Engineering Improvements
|
|
327
|
+
- Create interactions between top important features
|
|
328
|
+
- Try different encodings for high-importance categoricals
|
|
329
|
+
- Bin numeric features that show non-linear relationships
|
|
330
|
+
|
|
331
|
+
### 3. Model Selection
|
|
332
|
+
- If linear models perform poorly: Focus on tree-based (`include_models=['rf', 'xgboost', 'lightgbm']`)
|
|
333
|
+
- If tree models overfit: Try regularized linear models (`include_models=['ridge', 'lasso']`)
|
|
334
|
+
|
|
335
|
+
### 4. Other Strategies
|
|
336
|
+
- **Try Different Encodings**: Test WoE transformation for categorical features
|
|
337
|
+
- **Ensemble Methods**: Combine multiple models
|
|
338
|
+
- **Collect More Data**: Sometimes this is the only solution
|
|
339
|
+
- **Revisit Metric Choice**: Confirm we're optimizing the right business objective
|
|
340
|
+
|
|
341
|
+
**Communicate Progress**: Keep user informed of iterations and trade-offs.
|
|
342
|
+
|
|
343
|
+
---
|
|
344
|
+
|
|
345
|
+
## Phase 7: Transparency Report (MANDATORY)
|
|
346
|
+
**Goal**: Provide complete reproducibility documentation.
|
|
347
|
+
|
|
348
|
+
After training is complete, you **MUST** provide a summary that enables the user to reproduce your work.
|
|
349
|
+
|
|
350
|
+
### Feature Engineering Decisions:
|
|
351
|
+
List each engineered feature with rationale:
|
|
352
|
+
|
|
353
|
+
| Feature | Source | Rationale |
|
|
354
|
+
|---------|--------|-----------|
|
|
355
|
+
| monthly_burden | loan_amount / duration | Captures repayment intensity |
|
|
356
|
+
| has_prior_default | Aggregated from history table | Strong risk indicator |
|
|
357
|
+
|
|
358
|
+
### Data Processing Query:
|
|
359
|
+
Provide the complete SQL query used in `process_data`:
|
|
360
|
+
|
|
361
|
+
```sql
|
|
362
|
+
-- Full query used to create training dataset
|
|
363
|
+
SELECT
|
|
364
|
+
id,
|
|
365
|
+
target,
|
|
366
|
+
-- Original features (cast to correct types)
|
|
367
|
+
CAST(age AS INTEGER) as age,
|
|
368
|
+
CAST(income AS FLOAT) as income,
|
|
369
|
+
-- Engineered features
|
|
370
|
+
ROUND(CAST(loan_amount AS FLOAT) / NULLIF(CAST(duration AS FLOAT), 0), 2) as monthly_burden, -- Repayment capacity
|
|
371
|
+
CASE WHEN employment_years < 1 THEN 'new'
|
|
372
|
+
WHEN employment_years < 5 THEN 'mid'
|
|
373
|
+
ELSE 'established' END as employment_category -- Stability indicator
|
|
374
|
+
FROM 'source_data.csv'
|
|
375
|
+
WHERE target IS NOT NULL
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
### Model Configuration:
|
|
379
|
+
Document the final training parameters used:
|
|
380
|
+
- Metric optimized
|
|
381
|
+
- Ordinal feature mappings
|
|
382
|
+
- Special flags enabled (transformation, fix_imbalance, etc.)
|
|
383
|
+
|
|
384
|
+
This enables the user to:
|
|
385
|
+
1. Understand the reasoning behind each decision
|
|
386
|
+
2. Modify and re-run the data processing independently
|
|
387
|
+
3. Reproduce the exact model training configuration
|
|
File without changes
|