mcp-automl 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,18 @@
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ *.log
13
+
14
+ # Data
15
+ catboost_info/
16
+ models/
17
+ *.csv
18
+ *.parquet
@@ -0,0 +1 @@
1
+ 3.11
@@ -0,0 +1,90 @@
1
+ Metadata-Version: 2.4
2
+ Name: mcp-automl
3
+ Version: 0.1.0
4
+ Summary: MCP server for end-to-end machine learning
5
+ Requires-Python: >=3.11
6
+ Requires-Dist: duckdb>=1.4.3
7
+ Requires-Dist: joblib<1.4
8
+ Requires-Dist: mcp>=1.21.2
9
+ Requires-Dist: pandas<2.2.0
10
+ Requires-Dist: pycaret>=3.0.0
11
+ Requires-Dist: scikit-learn<1.4
12
+ Requires-Dist: tabulate>=0.9.0
13
+ Description-Content-Type: text/markdown
14
+
15
+ # MCP AutoML
16
+
17
+ MCP AutoML is a server that enables AI Agents to perform end-to-end machine learning workflows including data inspection, processing, model training. With MCP AutoML, AI Agents can perform more than a typical autoML framework. AI Agents can identify the target, setting baseline, and creating features by themselves.
18
+
19
+ MCP AutoML seperates tools and workflows, allowing you to create your own workflow.
20
+
21
+ ## Features
22
+
23
+ - **Data Inspection**: Analyze datasets with comprehensive statistics, data types, and previews
24
+ - **SQL-based Data Processing**: Transform and engineer features using DuckDB SQL queries
25
+ - **AutoML Training**: Train classification and regression models with automatic model comparison using PyCaret
26
+ - **Prediction**: Make predictions using trained models
27
+ - **Multi-format Support**: Works with CSV, Parquet, and JSON files
28
+
29
+ ## Usage
30
+
31
+ ### Configure MCP Server
32
+
33
+ Add to your MCP client configuration (e.g., Claude Desktop, Gemini CLI, Cursor, Antigravity):
34
+
35
+ ```json
36
+ {
37
+ "mcpServers": {
38
+ "mcp-automl": {
39
+ "command": "uvx",
40
+ "args": ["--from", "git+https://github.com/idea7766/mcp-automl", "mcp-automl"]
41
+ }
42
+ }
43
+ }
44
+ ```
45
+
46
+ ### Available Tools
47
+
48
+ | Tool | Description |
49
+ |------|-------------|
50
+ | `inspect_data` | Get comprehensive statistics and preview of a dataset |
51
+ | `query_data` | Execute DuckDB SQL queries on data files |
52
+ | `process_data` | Transform data using SQL and save to a new file |
53
+ | `train_classifier` | Train a classification model with AutoML |
54
+ | `train_regressor` | Train a regression model with AutoML |
55
+ | `predict` | Make predictions using a trained model |
56
+
57
+ ## Agent Skill
58
+
59
+ MCP AutoML includes an **data science workflow skill** that guides AI agents through best practices for machine learning projects. This skill teaches agents to:
60
+
61
+ - Identify targets and establish baselines
62
+ - Perform exploratory data analysis
63
+ - Engineer domain-specific features
64
+ - Train and evaluate models systematically
65
+
66
+ ### Installing the Skill
67
+
68
+ Copy the skill directory to your agent's skill folder:
69
+
70
+ ```bash
71
+ # For Gemini Code Assist
72
+ cp -r skill/data-science-workflow ~/.gemini/skills/
73
+
74
+ # For Claude Code
75
+ cp -r skill/data-science-workflow ~/.claude/skills/
76
+
77
+ # For other agents, copy to their respective skill directories
78
+ ```
79
+
80
+ The skill file is located at `skill/data-science-workflow/SKILL.md`.
81
+
82
+ ## Configuration
83
+
84
+ Models and experiments are saved to `~/.mcp-automl/experiments/` by default.
85
+
86
+ ## Dependencies
87
+
88
+ - [PyCaret](https://pycaret.org/) - AutoML library
89
+ - [DuckDB](https://duckdb.org/) - Fast SQL analytics
90
+ - [MCP](https://github.com/modelcontextprotocol/python-sdk) - Model Context Protocol SDK
@@ -0,0 +1,76 @@
1
+ # MCP AutoML
2
+
3
+ MCP AutoML is a server that enables AI Agents to perform end-to-end machine learning workflows including data inspection, processing, model training. With MCP AutoML, AI Agents can perform more than a typical autoML framework. AI Agents can identify the target, setting baseline, and creating features by themselves.
4
+
5
+ MCP AutoML seperates tools and workflows, allowing you to create your own workflow.
6
+
7
+ ## Features
8
+
9
+ - **Data Inspection**: Analyze datasets with comprehensive statistics, data types, and previews
10
+ - **SQL-based Data Processing**: Transform and engineer features using DuckDB SQL queries
11
+ - **AutoML Training**: Train classification and regression models with automatic model comparison using PyCaret
12
+ - **Prediction**: Make predictions using trained models
13
+ - **Multi-format Support**: Works with CSV, Parquet, and JSON files
14
+
15
+ ## Usage
16
+
17
+ ### Configure MCP Server
18
+
19
+ Add to your MCP client configuration (e.g., Claude Desktop, Gemini CLI, Cursor, Antigravity):
20
+
21
+ ```json
22
+ {
23
+ "mcpServers": {
24
+ "mcp-automl": {
25
+ "command": "uvx",
26
+ "args": ["--from", "git+https://github.com/idea7766/mcp-automl", "mcp-automl"]
27
+ }
28
+ }
29
+ }
30
+ ```
31
+
32
+ ### Available Tools
33
+
34
+ | Tool | Description |
35
+ |------|-------------|
36
+ | `inspect_data` | Get comprehensive statistics and preview of a dataset |
37
+ | `query_data` | Execute DuckDB SQL queries on data files |
38
+ | `process_data` | Transform data using SQL and save to a new file |
39
+ | `train_classifier` | Train a classification model with AutoML |
40
+ | `train_regressor` | Train a regression model with AutoML |
41
+ | `predict` | Make predictions using a trained model |
42
+
43
+ ## Agent Skill
44
+
45
+ MCP AutoML includes an **data science workflow skill** that guides AI agents through best practices for machine learning projects. This skill teaches agents to:
46
+
47
+ - Identify targets and establish baselines
48
+ - Perform exploratory data analysis
49
+ - Engineer domain-specific features
50
+ - Train and evaluate models systematically
51
+
52
+ ### Installing the Skill
53
+
54
+ Copy the skill directory to your agent's skill folder:
55
+
56
+ ```bash
57
+ # For Gemini Code Assist
58
+ cp -r skill/data-science-workflow ~/.gemini/skills/
59
+
60
+ # For Claude Code
61
+ cp -r skill/data-science-workflow ~/.claude/skills/
62
+
63
+ # For other agents, copy to their respective skill directories
64
+ ```
65
+
66
+ The skill file is located at `skill/data-science-workflow/SKILL.md`.
67
+
68
+ ## Configuration
69
+
70
+ Models and experiments are saved to `~/.mcp-automl/experiments/` by default.
71
+
72
+ ## Dependencies
73
+
74
+ - [PyCaret](https://pycaret.org/) - AutoML library
75
+ - [DuckDB](https://duckdb.org/) - Fast SQL analytics
76
+ - [MCP](https://github.com/modelcontextprotocol/python-sdk) - Model Context Protocol SDK
@@ -0,0 +1,31 @@
1
+ [project]
2
+ name = "mcp-automl"
3
+ version = "0.1.0"
4
+ description = "MCP server for end-to-end machine learning"
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ dependencies = [
8
+ "duckdb>=1.4.3",
9
+ "joblib<1.4",
10
+ "mcp>=1.21.2",
11
+ "pandas<2.2.0",
12
+ "pycaret>=3.0.0",
13
+ "scikit-learn<1.4",
14
+ "tabulate>=0.9.0",
15
+ ]
16
+
17
+ [project.scripts]
18
+ mcp-automl = "mcp_automl.server:main"
19
+
20
+ [build-system]
21
+ requires = ["hatchling"]
22
+ build-backend = "hatchling.build"
23
+
24
+ [tool.uv]
25
+ package = true
26
+
27
+ [dependency-groups]
28
+ dev = [
29
+ "pytest-asyncio>=1.3.0",
30
+ "pyarrow>=14.0.0",
31
+ ]
@@ -0,0 +1,387 @@
1
+ ---
2
+ name: data-science-workflow
3
+ description: The primary workflow for all data science projects. Use this skill whenever a user asks to train a model, build a model, perform analysis, or do analytics. It autonomously orchestrates the full pipeline - data inspection, cleaning, feature engineering, and AutoML training to deliver the best possible results.
4
+ ---
5
+
6
+ # LLM Usage Guide: Production Data Science Workflow (Consultative)
7
+
8
+ This guide outlines how to handle user requests for model training, especially when instructions are vague (e.g., "Train a model on data.csv").
9
+
10
+ ---
11
+
12
+ ## Universal Workflow Principles
13
+
14
+ ### 1. Documentation First
15
+ Always check for and read documentation files before inspecting data. Column descriptions prevent incorrect assumptions.
16
+
17
+ ### 2. Scale Awareness
18
+ Check data size before processing or training. Large datasets require sampling for efficient iteration.
19
+
20
+ ### 3. Transparency
21
+ Communicate your understanding, assumptions, and plan to the user before executing. Allow them to correct course early.
22
+
23
+ ### 4. Iterative Refinement
24
+ Use smaller samples for development iterations. Reserve full data for final model training only.
25
+
26
+ ### 5. Preserve Provenance
27
+ When creating processed files, use clear naming that indicates source, processing applied, and sample size.
28
+ Example: `train_processed_10k_sample.parquet`
29
+
30
+ ### 6. No NaN/Inf in Features
31
+ Never create features that produce NaN or Inf values. Common pitfalls and fixes:
32
+ - **Division**: Always use `NULLIF()` → `a / NULLIF(b, 0)`
33
+ - **Log/Sqrt**: Guard against zero/negative → `LOG(GREATEST(x, 1))`, `SQRT(GREATEST(x, 0))`
34
+ - **Missing propagation**: Use `COALESCE()` → `COALESCE(a, 0) + COALESCE(b, 0)`
35
+
36
+ ---
37
+
38
+ ## Phase 0: Initial Triage (The "Vague Request" Handler)
39
+ **Trigger**: User provides data but no specific instructions.
40
+
41
+ 1. **Inspect First**: ALWAYS call `inspect_data(data_path)` immediately to understand the table structure. If there are multiple files, you should inspect all of them to understand the data structure unless you are confident that the file is not important after reading the documentation.
42
+ 2. **Identify Target**:
43
+ - *Confident*: If there is an obvious target (e.g., "churn", "target", "price", "species"), **assume it** and state your assumption.
44
+ - *Ambiguous*: If multiple columns could be targets, **ASK the user**. ("I see 'price' and 'quantity'. Which one are we predicting?")
45
+ 3. **Determine Goal (consultative)**:
46
+ - *Confident*: If the target imply the goal (e.g., "fraud" -> minimize false positives), suggest the appropriate metric (Precision/Recall).
47
+ - *Ambiguous*: Ask for the business outcome. ("Are we trying to minimize missing fraud cases, or minimize false alarms?")
48
+
49
+ ---
50
+
51
+ ## Phase 0.5: Dataset Discovery
52
+ **Trigger**: Dataset is a directory (not a single file), OR any file > 50MB.
53
+
54
+ ### Step 1: Read Documentation First (MANDATORY)
55
+ Before ANY data inspection, search the directory for documentation:
56
+ - README files: `README`, `README.md`, `README.txt`
57
+ - Description files: Any file containing "description", "metadata", "schema", "dictionary"
58
+ - Data dictionaries: `.json`, `.yaml`, `.txt` files that aren't data
59
+
60
+ **Why**: Documentation explains table relationships, column meanings, and intended use cases. Skipping this leads to incorrect assumptions.
61
+
62
+ ### Step 2: Inventory All Data Files
63
+ List all data files and check their sizes/row counts:
64
+ ```sql
65
+ SELECT COUNT(*) as rows FROM 'filename.csv'
66
+ ```
67
+
68
+ Categorize files into:
69
+ - **Primary table**: Contains the target variable (usually `train`, `main`, or similar naming)
70
+ - **Auxiliary tables**: Related data that can be aggregated (transactions, history, logs)
71
+ - **Test/submission files**: Held-out data for final predictions
72
+
73
+ ### Step 3: Assess Scale & Plan Accordingly
74
+ | Data Scale | Definition | Required Action |
75
+ |------------|------------|-----------------|
76
+ | Small | < 50K rows | Proceed normally |
77
+ | Medium | 50K - 200K rows | Recommend sampling for development |
78
+ | Large | > 200K rows | **Require** sampling; inform user |
79
+
80
+ **Sampling Strategy**: Create a stratified sample preserving target distribution:
81
+ ```sql
82
+ SELECT * FROM data
83
+ ORDER BY RANDOM()
84
+ LIMIT [10-20% of original, max 50K rows]
85
+ ```
86
+
87
+ ### Step 4: Multi-Table Strategy
88
+ If multiple related tables exist:
89
+ 1. **Identify join keys** (from documentation or column inspection)
90
+ 2. **Plan aggregations**: How to summarize auxiliary tables to join with primary
91
+ 3. **Communicate plan** to user before executing
92
+
93
+ ### Step 5: Confirm Scope with User
94
+ Before proceeding, state:
95
+ > "Dataset contains [X files, Y total rows]. I plan to:
96
+ > - Use [primary_table] as the main dataset
97
+ > - [Sample to N rows / Use full data]
98
+ > - [Aggregate features from auxiliary tables / Use primary only]
99
+ >
100
+ > Proceed?"
101
+
102
+ ---
103
+
104
+ ## Phase 1: Project Definition
105
+ **Goal**: Lock down success criteria and establish a naive baseline before training.
106
+
107
+ ### Check:
108
+ - **Problem Type**: Classification vs Regression
109
+ - **Primary Metric**: Choose based on business goal:
110
+ - Safety-critical (fraud, medical) → `Recall`
111
+ - Cost-sensitive (marketing, sales) → `Precision`
112
+ - Balanced → `F1`
113
+ - Regression → `R2` or `MAE`
114
+
115
+ ### Establish Naive Baseline
116
+ Use `query_data` to calculate a baseline that any useful model must beat:
117
+
118
+ **For Classification** (majority class baseline):
119
+ ```sql
120
+ -- Class distribution
121
+ SELECT target, COUNT(*) as count,
122
+ ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 1) as pct
123
+ FROM data GROUP BY target
124
+ ```
125
+ → Baseline accuracy = largest class percentage (e.g., if 70% are "good", baseline = 70%)
126
+
127
+ **For Regression** (mean prediction baseline):
128
+ ```sql
129
+ SELECT
130
+ AVG(target) as mean_baseline,
131
+ AVG(ABS(target - (SELECT AVG(target) FROM data))) as baseline_MAE
132
+ FROM data
133
+ ```
134
+ → A model must have lower MAE than predicting the mean for every sample.
135
+
136
+ ### Document Baseline to User
137
+ State clearly:
138
+ > "Naive baseline (always predicting majority class 'good'): 70% accuracy. Our model must exceed this to add value."
139
+
140
+ ---
141
+
142
+ ## Phase 2: EDA (Deep Dive)
143
+ **Goal**: Inspect data quality to inform training parameters AND feature engineering opportunities.
144
+
145
+ ### Checklist:
146
+ 1. **Skewness**: Use `query_data` to check `AVG(col)` vs `MEDIAN(col)`. -> If high skew, set `transformation=True`.
147
+ 2. **Ordinality**: Check for inherent order in categories (e.g., "Low/Med/High", "Junior/Senior", rating scales). -> Map to `ordinal_features`.
148
+ 3. **Missingness**:
149
+ - *Simple*: If random/small, use `numeric_imputation` (mean/median) or `categorical_imputation` (mode) params.
150
+ - *Complex*: If structural/logic-based, use `process_data`.
151
+ 4. **Class Imbalance**: Check target distribution. If moderate (70-30), may not need `fix_imbalance`. If extreme (95-5), likely beneficial.
152
+ 5. **Outliers**: Check for extreme values in numeric columns (use `query_data` with MIN/MAX/STDDEV). -> Consider `remove_outliers=True`.
153
+ 6. **Feature Relationships**: Look for potential interactions (e.g., credit_amount & duration -> monthly_payment).
154
+
155
+ ---
156
+
157
+ ## Phase 2.5: Domain Research & Feature Engineering 🔬
158
+ **Goal**: Leverage domain knowledge to create high-value features.
159
+
160
+ ### When to Apply:
161
+ - **Always consider** for any non-trivial dataset
162
+ - **Strongly recommended** when baseline model performance is below expectations
163
+ - Skip only for pure exploratory analysis or when time is extremely limited
164
+
165
+ ### Step 1: Identify the Problem Domain
166
+
167
+ Look at column names and target variable to identify the domain:
168
+
169
+ | Column Indicators | Likely Domain |
170
+ |-------------------|---------------|
171
+ | amount, duration, payment, credit, loan | Financial/Credit Risk |
172
+ | churn, subscription, tenure, contract | Customer Churn |
173
+ | price, sales, inventory, demand | Retail/E-commerce |
174
+ | diagnosis, symptoms, age, medication | Healthcare/Medical |
175
+ | latitude, longitude, distance, location | Geographic/Spatial |
176
+ | timestamp, date, hour, day_of_week | Time Series/Temporal |
177
+
178
+ ### Step 2: Research Domain Best Practices
179
+
180
+ (REQUIRED) Search web for feature engineering patterns for the identified domain:
181
+
182
+ **Search Queries** (use 2-3 of these):
183
+ - `"[domain] machine learning feature engineering best practices"`
184
+ - `"[domain] [problem_type] important features"`
185
+
186
+ **What to Look For**:
187
+ - Common ratios/interactions used by practitioners
188
+ - Domain-specific KPIs or business metrics
189
+ - Regulatory/compliance considerations
190
+
191
+ ### Step 3: Apply Feature Engineering
192
+
193
+ Based on domain research and data inspection, create features using `process_data`. Common techniques:
194
+
195
+ 1. **Ratios & Intensities**: Divide related numeric features (e.g., total/count, amount/duration)
196
+ 2. **Binning**: Group continuous variables into meaningful categories
197
+ 3. **Aggregations**: If multiple rows per entity, create sum/mean/max/min/count
198
+ 4. **Interactions**: Multiply/combine features that work together
199
+ 5. **Business Logic Flags**: Create binary indicators based on domain rules
200
+
201
+ Remember: Always apply safe patterns from Principle #6 (No NaN/Inf) when creating ratio or derived features.
202
+
203
+ ### Step 4: Document Your Reasoning
204
+
205
+ For each engineered feature, explain to the user:
206
+ - **What**: Name and formula
207
+ - **Why**: Business rationale
208
+ - **Source**: If from research, cite it
209
+
210
+ ---
211
+
212
+ ## Phase 3: Data Processing with Feature Engineering
213
+ **Goal**: Create a reliable, enriched dataset (Parquet format).
214
+
215
+ ### Action:
216
+ Use `process_data` with a comprehensive SQL query that:
217
+ 1. **CAST types explicitly** (all numeric columns to INTEGER/FLOAT)
218
+ 2. **Create engineered features** from Phase 2.5 research
219
+ 3. **Handle missing values** (if complex logic needed)
220
+ 4. **Save as `.parquet`** (strongly recommended over CSV for type preservation)
221
+
222
+ ### Transparency Rule:
223
+ You **MUST** show the full SQL query to the user with comments explaining each engineered feature's business rationale.
224
+
225
+ ### Quality Check:
226
+ After processing, call `inspect_data` on the new file to verify:
227
+ - All types are correct (no accidental strings for numeric columns)
228
+ - New features have reasonable value ranges
229
+ - No unexpected missing values or infinite values introduced
230
+
231
+ ---
232
+
233
+ ## Phase 4: Model Training
234
+ **Goal**: Train an optimized model with all insights from EDA and feature engineering.
235
+
236
+ ### Key Parameters:
237
+ - `optimize`: The metric agreed in Phase 1 (Recall/Precision/F1/R2/MAE)
238
+ - `ordinal_features`: **Critical** - Map all ordinal categories with proper ordering
239
+ - `fold=5`: For faster iteration (use `fold=10` for final validation)
240
+ - `session_id=42`: For reproducibility
241
+
242
+ ### Model Selection Parameters:
243
+ - `include_models`: Train only specific models (faster, good for baselines). Examples:
244
+ - Quick baseline: `include_models=['dt']` (Decision Tree, ~30 seconds)
245
+ - Fast ensemble: `include_models=['rf', 'lightgbm', 'xgboost']`
246
+ - Linear baseline: `include_models=['lr']`
247
+ - `exclude_models`: Exclude slow or problematic models. Examples:
248
+ - Skip GPU-requiring: `exclude_models=['catboost']`
249
+ - Skip slow models: `exclude_models=['catboost', 'xgboost']`
250
+
251
+ **Common Model IDs**:
252
+ | Classification | Regression |
253
+ |---------------|------------|
254
+ | `lr` (Logistic Regression) | `lr` (Linear Regression) |
255
+ | `dt` (Decision Tree) | `dt` (Decision Tree) |
256
+ | `rf` (Random Forest) | `rf` (Random Forest) |
257
+ | `xgboost`, `lightgbm`, `catboost` | `xgboost`, `lightgbm`, `catboost` |
258
+ | `knn`, `nb`, `svm` | `ridge`, `lasso`, `en` |
259
+
260
+ ### Conditional Parameters (based on EDA):
261
+ - `transformation=True`: If skewed distributions detected in Phase 2
262
+ - `normalize=True`: Recommended for linear/distance-based models (not needed for tree-based)
263
+ - `polynomial_features=True`: Generally beneficial, low risk
264
+ - `fix_imbalance=True`: Only if extreme imbalance (>80:20) detected in Phase 2. Use with `numeric_imputation` and `categorical_imputation` parameters.
265
+ - `remove_outliers=True`: If extreme outliers detected in Phase 2
266
+
267
+ ### Training Output:
268
+ The training result includes:
269
+ - **`metadata`**: CV metrics for the best model
270
+ - **`test_metrics`**: Holdout set performance
271
+ - **`feature_importances`**: Dict of `{feature_name: importance}` sorted by importance (descending)
272
+ - Available for tree-based models (RF, XGBoost, LightGBM, etc.) and linear models
273
+ - Use this to understand which features drive predictions
274
+
275
+ ### Speed Tips:
276
+ - For quick baseline: `include_models=['dt']` (~30 seconds)
277
+ - For fast iteration: `include_models=['rf', 'lightgbm']` (~2 minutes)
278
+ - ⏱️ Expect 3-10 min for full model comparison with ~10K rows
279
+
280
+ ### Document:
281
+ Report top 3 models with their metrics to user in table format.
282
+
283
+ ---
284
+
285
+ ## Phase 5: Evaluation & Comparison
286
+ **Goal**: Contextualize results against the naive baseline and select best model.
287
+
288
+ ### Comparison Table (Include Baseline):
289
+ Always show the naive baseline from Phase 1 for context:
290
+
291
+ | Config | Model | Accuracy | Recall | Precision | F1 | vs Baseline |
292
+ |--------|-------|----------|--------|-----------|-----|-------------|
293
+ | Naive guess | — | 70.0% | — | — | — | — |
294
+ | Trained model | CatBoost | 77.3% | 75.6% | 74.3% | 74.9% | +7.3 pts ✅ |
295
+
296
+ ### Interpret Results:
297
+ - Quantify improvement over naive baseline (\"+7.3 percentage points\", \"10.4% relative improvement\")
298
+ - Translate to business impact (\"Catches 73 more bad credits out of 1000\")
299
+
300
+ ### Success Check:
301
+ Does the best model meet the Phase 1 Success Criteria?
302
+ - ✅ If yes: Proceed to Transparency Report, provide model_id and path
303
+ - ❌ If no: Proceed to Phase 6 (Iteration)
304
+
305
+ ---
306
+
307
+ ## Phase 6: Iteration (If Needed)
308
+ **Goal**: Systematically improve model performance.
309
+
310
+ If performance is still below target:
311
+
312
+ ### 1. Analyze Feature Importances
313
+ Review `feature_importances` from the training result:
314
+ - **High importance features**: Focus engineering efforts here (create interactions, better encodings)
315
+ - **Low/zero importance features**: Consider removing to reduce noise
316
+ - **Missing domain features**: If expected important features aren't showing up, check for data issues
317
+
318
+ Example iteration workflow:
319
+ ```
320
+ 1. Train baseline: include_models=['rf'] → get feature_importances
321
+ 2. Identify top 5 features, engineer interactions between them
322
+ 3. Retrain with engineered features
323
+ 4. Compare improvement
324
+ ```
325
+
326
+ ### 2. Feature Engineering Improvements
327
+ - Create interactions between top important features
328
+ - Try different encodings for high-importance categoricals
329
+ - Bin numeric features that show non-linear relationships
330
+
331
+ ### 3. Model Selection
332
+ - If linear models perform poorly: Focus on tree-based (`include_models=['rf', 'xgboost', 'lightgbm']`)
333
+ - If tree models overfit: Try regularized linear models (`include_models=['ridge', 'lasso']`)
334
+
335
+ ### 4. Other Strategies
336
+ - **Try Different Encodings**: Test WoE transformation for categorical features
337
+ - **Ensemble Methods**: Combine multiple models
338
+ - **Collect More Data**: Sometimes this is the only solution
339
+ - **Revisit Metric Choice**: Confirm we're optimizing the right business objective
340
+
341
+ **Communicate Progress**: Keep user informed of iterations and trade-offs.
342
+
343
+ ---
344
+
345
+ ## Phase 7: Transparency Report (MANDATORY)
346
+ **Goal**: Provide complete reproducibility documentation.
347
+
348
+ After training is complete, you **MUST** provide a summary that enables the user to reproduce your work.
349
+
350
+ ### Feature Engineering Decisions:
351
+ List each engineered feature with rationale:
352
+
353
+ | Feature | Source | Rationale |
354
+ |---------|--------|-----------|
355
+ | monthly_burden | loan_amount / duration | Captures repayment intensity |
356
+ | has_prior_default | Aggregated from history table | Strong risk indicator |
357
+
358
+ ### Data Processing Query:
359
+ Provide the complete SQL query used in `process_data`:
360
+
361
+ ```sql
362
+ -- Full query used to create training dataset
363
+ SELECT
364
+ id,
365
+ target,
366
+ -- Original features (cast to correct types)
367
+ CAST(age AS INTEGER) as age,
368
+ CAST(income AS FLOAT) as income,
369
+ -- Engineered features
370
+ ROUND(CAST(loan_amount AS FLOAT) / NULLIF(CAST(duration AS FLOAT), 0), 2) as monthly_burden, -- Repayment capacity
371
+ CASE WHEN employment_years < 1 THEN 'new'
372
+ WHEN employment_years < 5 THEN 'mid'
373
+ ELSE 'established' END as employment_category -- Stability indicator
374
+ FROM 'source_data.csv'
375
+ WHERE target IS NOT NULL
376
+ ```
377
+
378
+ ### Model Configuration:
379
+ Document the final training parameters used:
380
+ - Metric optimized
381
+ - Ordinal feature mappings
382
+ - Special flags enabled (transformation, fix_imbalance, etc.)
383
+
384
+ This enables the user to:
385
+ 1. Understand the reasoning behind each decision
386
+ 2. Modify and re-run the data processing independently
387
+ 3. Reproduce the exact model training configuration
File without changes
@@ -0,0 +1,4 @@
1
+ from mcp_automl.server import main
2
+
3
+ if __name__ == "__main__":
4
+ main()