upgini 1.2.122a4__py3-none-any.whl → 1.2.146a4__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of upgini might be problematic. Click here for more details.
- upgini/__about__.py +1 -1
- upgini/autofe/binary.py +4 -3
- upgini/data_source/data_source_publisher.py +1 -9
- upgini/dataset.py +56 -6
- upgini/features_enricher.py +639 -561
- upgini/http.py +2 -2
- upgini/metadata.py +19 -3
- upgini/normalizer/normalize_utils.py +6 -6
- upgini/resource_bundle/strings.properties +15 -11
- upgini/search_task.py +14 -2
- upgini/utils/base_search_key_detector.py +5 -1
- upgini/utils/datetime_utils.py +125 -39
- upgini/utils/deduplicate_utils.py +8 -5
- upgini/utils/display_utils.py +61 -20
- upgini/utils/feature_info.py +18 -7
- upgini/utils/features_validator.py +6 -4
- upgini/utils/postal_code_utils.py +35 -2
- upgini/utils/target_utils.py +3 -1
- upgini/utils/track_info.py +29 -1
- {upgini-1.2.122a4.dist-info → upgini-1.2.146a4.dist-info}/METADATA +123 -121
- {upgini-1.2.122a4.dist-info → upgini-1.2.146a4.dist-info}/RECORD +23 -23
- {upgini-1.2.122a4.dist-info → upgini-1.2.146a4.dist-info}/WHEEL +1 -1
- {upgini-1.2.122a4.dist-info → upgini-1.2.146a4.dist-info}/licenses/LICENSE +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
Metadata-Version: 2.
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
2
|
Name: upgini
|
|
3
|
-
Version: 1.2.
|
|
3
|
+
Version: 1.2.146a4
|
|
4
4
|
Summary: Intelligent data search & enrichment for Machine Learning
|
|
5
5
|
Project-URL: Bug Reports, https://github.com/upgini/upgini/issues
|
|
6
6
|
Project-URL: Homepage, https://upgini.com/
|
|
@@ -30,9 +30,11 @@ Requires-Dist: ipywidgets>=8.1.0
|
|
|
30
30
|
Requires-Dist: jarowinkler>=2.0.0
|
|
31
31
|
Requires-Dist: levenshtein>=0.25.1
|
|
32
32
|
Requires-Dist: lightgbm>=4.6.0
|
|
33
|
+
Requires-Dist: more-itertools==10.7.0
|
|
33
34
|
Requires-Dist: numpy<3.0.0,>=1.19.0
|
|
34
35
|
Requires-Dist: pandas<3.0.0,>=1.1.0
|
|
35
36
|
Requires-Dist: psutil>=5.9.0
|
|
37
|
+
Requires-Dist: pyarrow==18.1.0
|
|
36
38
|
Requires-Dist: pydantic<3.0.0,>1.0.0
|
|
37
39
|
Requires-Dist: pyjwt>=2.8.0
|
|
38
40
|
Requires-Dist: python-bidi==0.4.2
|
|
@@ -50,7 +52,7 @@ Description-Content-Type: text/markdown
|
|
|
50
52
|
<!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : Free automated data enrichment library for machine learning: </br>only the accuracy improving features in 2 minutes </h2> -->
|
|
51
53
|
<!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> • Free production-ready automated data enrichment library for machine learning</h2>-->
|
|
52
54
|
<h2 align="center"> <a href="https://upgini.com/">Upgini • Intelligent data search & enrichment for Machine Learning and AI</a></h2>
|
|
53
|
-
<p align="center"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community and premium external data sources, </br>including open & commercial LLMs</b> </p>
|
|
55
|
+
<p align="center"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community, and premium external data sources, </br>including open & commercial LLMs</b> </p>
|
|
54
56
|
<p align="center">
|
|
55
57
|
<br />
|
|
56
58
|
<a href="https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb"><strong>Quick Start in Colab »</strong></a> |
|
|
@@ -58,7 +60,7 @@ Description-Content-Type: text/markdown
|
|
|
58
60
|
<a href="https://profile.upgini.com">Register / Sign In</a> |
|
|
59
61
|
<!-- <a href="https://gitter.im/upgini/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link">Gitter Community</a> | -->
|
|
60
62
|
<a href="https://4mlg.short.gy/join-upgini-community">Slack Community</a> |
|
|
61
|
-
<a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose new
|
|
63
|
+
<a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose a new data source</strong></a>
|
|
62
64
|
</p>
|
|
63
65
|
<p align=center>
|
|
64
66
|
<a href="/LICENSE"><img alt="BSD-3 license" src="https://img.shields.io/badge/license-BSD--3%20Clause-green"></a>
|
|
@@ -74,19 +76,19 @@ Description-Content-Type: text/markdown
|
|
|
74
76
|
[](https://gitter.im/upgini/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) -->
|
|
75
77
|
## ❔ Overview
|
|
76
78
|
|
|
77
|
-
**Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of
|
|
79
|
+
**Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of ML features using large language models (LLMs), GNNs (graph neural networks), and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data).
|
|
78
80
|
|
|
79
|
-
**Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want radically simplify
|
|
81
|
+
**Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want to radically simplify feature search and enrichment to make external data a standard approach. Like hyperparameter tuning in machine learning today.
|
|
80
82
|
|
|
81
83
|
**Mission:** Democratize access to data sources for data science community.
|
|
82
84
|
|
|
83
85
|
## 🚀 Awesome features
|
|
84
|
-
⭐️ Automatically find only relevant features that *
|
|
85
|
-
⭐️ Automated feature generation from the sources: feature generation with
|
|
86
|
-
⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/
|
|
87
|
-
⭐️ Calculate accuracy metrics and
|
|
88
|
-
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipeline
|
|
89
|
-
⭐️ Easy to use - single request to enrich training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
|
|
86
|
+
⭐️ Automatically find only relevant features that *improve your model’s accuracy*. Not just correlated with the target variable, which in 9 out of 10 cases yields zero accuracy improvement
|
|
87
|
+
⭐️ Automated feature generation from the sources: feature generation with LLM‑based data augmentation, RNNs, and GraphNNs; ensembling across multiple data sources
|
|
88
|
+
⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/ZIP code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
|
|
89
|
+
⭐️ Calculate accuracy metrics and uplift after enriching an existing ML model with external features
|
|
90
|
+
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate the risks of unstable external data dependencies in the ML pipeline
|
|
91
|
+
⭐️ Easy to use - a single request to enrich the training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
|
|
90
92
|
<table>
|
|
91
93
|
<tr>
|
|
92
94
|
<td> date / datetime </td>
|
|
@@ -102,7 +104,7 @@ Description-Content-Type: text/markdown
|
|
|
102
104
|
</tr>
|
|
103
105
|
</table>
|
|
104
106
|
|
|
105
|
-
⭐️ Scikit-learn
|
|
107
|
+
⭐️ Scikit-learn-compatible interface for quick data integration with existing ML pipelines
|
|
106
108
|
⭐️ Support for most common supervised ML tasks on tabular data:
|
|
107
109
|
<table>
|
|
108
110
|
<tr>
|
|
@@ -111,7 +113,7 @@ Description-Content-Type: text/markdown
|
|
|
111
113
|
</tr>
|
|
112
114
|
<tr>
|
|
113
115
|
<td><a href="https://en.wikipedia.org/wiki/Regression_analysis">☑️ regression</a></td>
|
|
114
|
-
<td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time
|
|
116
|
+
<td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time-series prediction</a></td>
|
|
115
117
|
</tr>
|
|
116
118
|
</table>
|
|
117
119
|
|
|
@@ -123,13 +125,13 @@ Description-Content-Type: text/markdown
|
|
|
123
125
|
|
|
124
126
|
## 🌎 Connected data sources and coverage
|
|
125
127
|
|
|
126
|
-
- **Public data
|
|
127
|
-
- **Community
|
|
128
|
+
- **Public data**: public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
|
|
129
|
+
- **Community‑shared data**: royalty- or license-free datasets or features from the data science community (our users). This includes both public and scraped data
|
|
128
130
|
- **Premium data providers**: commercial data sources verified by the Upgini team in real-world use cases
|
|
129
131
|
|
|
130
|
-
👉 [**Details on
|
|
132
|
+
👉 [**Details on datasets and features**](https://upgini.com/#data_sources)
|
|
131
133
|
#### 📊 Total: **239 countries** and **up to 41 years** of history
|
|
132
|
-
|Data sources|Countries|History
|
|
134
|
+
|Data sources|Countries|History (years)|# sources for ensembling|Update frequency|Search keys|API Key required
|
|
133
135
|
|--|--|--|--|--|--|--|
|
|
134
136
|
|Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No
|
|
135
137
|
|Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No
|
|
@@ -137,7 +139,7 @@ Description-Content-Type: text/markdown
|
|
|
137
139
|
|Consumer Confidence index| 44 |22|-|Monthly|date, country|No
|
|
138
140
|
|World economic indicators|191 |41|-|Monthly|date, country|No
|
|
139
141
|
|Markets data|-|17|-|Monthly|date, datetime|No
|
|
140
|
-
|World mobile & fixed
|
|
142
|
+
|World mobile & fixed-broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No
|
|
141
143
|
|World demographic data |90|-|2|Annual|country, postal/ZIP code|No
|
|
142
144
|
|World house prices |44|-|3|Annual|country, postal/ZIP code|No
|
|
143
145
|
|Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes
|
|
@@ -152,8 +154,8 @@ Description-Content-Type: text/markdown
|
|
|
152
154
|
|
|
153
155
|
### [Search of relevant external features & Automated feature generation for Salary prediction task (use as a template)](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
|
|
154
156
|
|
|
155
|
-
* The goal is to predict salary for data science job
|
|
156
|
-
* Following this guide, you'll learn how to **search
|
|
157
|
+
* The goal is to predict salary for a data science job posting based on information about the employer and job description.
|
|
158
|
+
* Following this guide, you'll learn how to **search and auto‑generate new relevant features with the Upgini library**
|
|
157
159
|
* The evaluation metric is [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).
|
|
158
160
|
|
|
159
161
|
Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb) inside your browser:
|
|
@@ -168,7 +170,7 @@ Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob
|
|
|
168
170
|
### ❓ [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
|
|
169
171
|
|
|
170
172
|
* The goal is to **predict future sales of different goods in stores** based on a 5-year history of sales.
|
|
171
|
-
* Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
|
|
173
|
+
* Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting competition. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
|
|
172
174
|
|
|
173
175
|
Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:
|
|
174
176
|
|
|
@@ -180,25 +182,25 @@ Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini
|
|
|
180
182
|
[](https://gitpod.io/#/github.com/upgini/upgini)
|
|
181
183
|
-->
|
|
182
184
|
|
|
183
|
-
### ❓ [How to boost ML model accuracy for Kaggle
|
|
185
|
+
### ❓ [How to boost ML model accuracy for Kaggle Top-1 leaderboard in 15 minutes](https://www.kaggle.com/code/nikupgini/how-to-find-external-data-for-1-private-lb-4-53/notebook)
|
|
184
186
|
|
|
185
|
-
* The goal is **
|
|
186
|
-
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting
|
|
187
|
+
* The goal is **to improve a Top‑1 winning Kaggle solution** by adding new relevant external features and data.
|
|
188
|
+
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting competition; the evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
|
|
187
189
|
|
|
188
190
|
### ❓ [How to do low-code feature engineering for AutoML tools](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret/notebook)
|
|
189
191
|
|
|
190
192
|
* **Save time on feature search and engineering**. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box.
|
|
191
193
|
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
|
|
192
|
-
* Low-code AutoML
|
|
194
|
+
* Low-code AutoML frameworks: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)
|
|
193
195
|
|
|
194
|
-
### ❓ [How to improve accuracy of Multivariate
|
|
196
|
+
### ❓ [How to improve accuracy of Multivariate time-series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)
|
|
195
197
|
|
|
196
|
-
* The goal is **accuracy
|
|
198
|
+
* The goal is **to improve the accuracy of multivariate time‑series forecasting** using new relevant external features and data. The main challenge is the data and feature enrichment strategy, in which a component of a multivariate time series depends not only on its past values but also on other components.
|
|
197
199
|
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [RMSLE](https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days#Note-regarding-calculating-the-average).
|
|
198
200
|
|
|
199
201
|
### ❓ [How to speed up feature engineering hypothesis tests with ready-to-use external features](https://www.kaggle.com/code/romaupgini/statement-dates-to-use-or-not-to-use/notebook)
|
|
200
202
|
|
|
201
|
-
* **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge
|
|
203
|
+
* **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge is the time‑dependent representation of information in the training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
|
|
202
204
|
* [Kaggle Competition](https://www.kaggle.com/competitions/amex-default-prediction) is a credit default prediction, evaluation metric is [normalized Gini coefficient](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464).
|
|
203
205
|
|
|
204
206
|
## 🏁 Quick start
|
|
@@ -227,19 +229,19 @@ docker build -t upgini .</i></br>
|
|
|
227
229
|
<i>
|
|
228
230
|
docker run -p 8888:8888 upgini</br>
|
|
229
231
|
</i></br>
|
|
230
|
-
3. Open http://localhost:8888?token
|
|
232
|
+
3. Open http://localhost:8888?token=<your_token_from_console_output> in your browser
|
|
231
233
|
</details>
|
|
232
234
|
|
|
233
235
|
|
|
234
236
|
### 2. 💡 Use your labeled training dataset for search
|
|
235
237
|
|
|
236
238
|
You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
|
|
237
|
-
- **[search keys](#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with
|
|
238
|
-
- **labels** from training dataset to estimate
|
|
239
|
-
- **your features** from training dataset to find external datasets and features
|
|
239
|
+
- **[search keys](#-search-key-types-we-support-more-to-come)** from the training dataset to match records from potential data sources with new features
|
|
240
|
+
- **labels** from the training dataset to estimate the relevance of features or datasets for your ML task and calculate feature importance metrics
|
|
241
|
+
- **your features** from the training dataset to find external datasets and features that improve accuracy of your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))
|
|
240
242
|
|
|
241
243
|
|
|
242
|
-
Load training dataset into
|
|
244
|
+
Load the training dataset into a Pandas DataFrame and separate feature columns from the label column in a Scikit-learn way:
|
|
243
245
|
```python
|
|
244
246
|
import pandas as pd
|
|
245
247
|
# labeled training dataset - customer_churn_prediction_train.csv
|
|
@@ -250,7 +252,7 @@ y = train_df["churn_flag"]
|
|
|
250
252
|
<table border=1 cellpadding=10><tr><td>
|
|
251
253
|
⚠️ <b>Requirements for search initialization dataset</b>
|
|
252
254
|
<br>
|
|
253
|
-
We
|
|
255
|
+
We perform dataset verification and cleaning under the hood, but still there are some requirements to follow:
|
|
254
256
|
<br>
|
|
255
257
|
1. <b>pandas.DataFrame</b>, <b>pandas.Series</b> or <b>numpy.ndarray</b> representation;
|
|
256
258
|
<br>
|
|
@@ -258,12 +260,12 @@ We do dataset verification and cleaning under the hood, but still there are some
|
|
|
258
260
|
<br>
|
|
259
261
|
3. at least one column selected as a <a href="#-search-key-types-we-support-more-to-come">search key</a>;
|
|
260
262
|
<br>
|
|
261
|
-
4. min size after deduplication by search
|
|
263
|
+
4. min size after deduplication by search-key columns and removal of NaNs: <i>100 records</i>
|
|
262
264
|
</td></tr></table>
|
|
263
265
|
|
|
264
|
-
### 3. 🔦 Choose one or
|
|
265
|
-
*Search keys* columns will be used to match records from all potential external data sources
|
|
266
|
-
Define one or
|
|
266
|
+
### 3. 🔦 Choose one or more columns as search keys
|
|
267
|
+
*Search keys* columns will be used to match records from all potential external data sources/features.
|
|
268
|
+
Define one or more columns as search keys when initializing the `FeaturesEnricher` class.
|
|
267
269
|
```python
|
|
268
270
|
from upgini.features_enricher import FeaturesEnricher
|
|
269
271
|
from upgini.metadata import SearchKey
|
|
@@ -283,7 +285,7 @@ enricher = FeaturesEnricher(
|
|
|
283
285
|
<tr>
|
|
284
286
|
<th> Search Key<br/>Meaning Type </th>
|
|
285
287
|
<th> Description </th>
|
|
286
|
-
<th> Allowed pandas dtypes (
|
|
288
|
+
<th> Allowed pandas dtypes (Python types) </th>
|
|
287
289
|
<th> Example </th>
|
|
288
290
|
</tr>
|
|
289
291
|
<tr>
|
|
@@ -300,13 +302,13 @@ enricher = FeaturesEnricher(
|
|
|
300
302
|
</tr>
|
|
301
303
|
<tr>
|
|
302
304
|
<td> SearchKey.IP </td>
|
|
303
|
-
<td>
|
|
304
|
-
<td> <tt>object(str, ipaddress.IPv4Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
|
|
305
|
+
<td> IPv4 or IPv6 address</td>
|
|
306
|
+
<td> <tt>object(str, ipaddress.IPv4Address, ipaddress.IPv6Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
|
|
305
307
|
<td> <tt>192.168.0.1 </tt> </td>
|
|
306
308
|
</tr>
|
|
307
309
|
<tr>
|
|
308
310
|
<td> SearchKey.PHONE </td>
|
|
309
|
-
<td> phone number
|
|
311
|
+
<td> phone number (<a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a>) </td>
|
|
310
312
|
<td> <tt>object(str)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> <br/> <tt>float64</tt> </td>
|
|
311
313
|
<td> <tt>443451925138 </tt> </td>
|
|
312
314
|
</tr>
|
|
@@ -321,7 +323,7 @@ enricher = FeaturesEnricher(
|
|
|
321
323
|
</td>
|
|
322
324
|
<td>
|
|
323
325
|
<tt>2020-02-12 </tt> (<a href="https://en.wikipedia.org/wiki/ISO_8601">ISO-8601 standard</a>)
|
|
324
|
-
<br/> <tt>12.02.2020 </tt> (non
|
|
326
|
+
<br/> <tt>12.02.2020 </tt> (non‑standard notation)
|
|
325
327
|
</td>
|
|
326
328
|
</tr>
|
|
327
329
|
<tr>
|
|
@@ -343,7 +345,7 @@ enricher = FeaturesEnricher(
|
|
|
343
345
|
</tr>
|
|
344
346
|
<tr>
|
|
345
347
|
<td> SearchKey.POSTAL_CODE </td>
|
|
346
|
-
<td> Postal code a.k.a. ZIP code.
|
|
348
|
+
<td> Postal code a.k.a. ZIP code. Can only be used with SearchKey.COUNTRY </td>
|
|
347
349
|
<td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
|
|
348
350
|
<td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>
|
|
349
351
|
</tr>
|
|
@@ -351,7 +353,7 @@ enricher = FeaturesEnricher(
|
|
|
351
353
|
|
|
352
354
|
</details>
|
|
353
355
|
|
|
354
|
-
For the
|
|
356
|
+
For the search key types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to specify the date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:
|
|
355
357
|
```python
|
|
356
358
|
from upgini.features_enricher import FeaturesEnricher
|
|
357
359
|
from upgini.metadata import SearchKey
|
|
@@ -369,12 +371,12 @@ enricher = FeaturesEnricher(
|
|
|
369
371
|
)
|
|
370
372
|
```
|
|
371
373
|
|
|
372
|
-
To use
|
|
374
|
+
To use a non-UTC timezone for datetime, you can cast datetime column explicitly to your timezone (example for Warsaw):
|
|
373
375
|
```python
|
|
374
376
|
df["date"] = df.date.astype("datetime64").dt.tz_localize("Europe/Warsaw")
|
|
375
377
|
```
|
|
376
378
|
|
|
377
|
-
|
|
379
|
+
A single country for the whole training dataset can be passed via `country_code` parameter:
|
|
378
380
|
```python
|
|
379
381
|
from upgini.features_enricher import FeaturesEnricher
|
|
380
382
|
from upgini.metadata import SearchKey
|
|
@@ -390,10 +392,10 @@ enricher = FeaturesEnricher(
|
|
|
390
392
|
```
|
|
391
393
|
|
|
392
394
|
### 4. 🔍 Start your first feature search!
|
|
393
|
-
The main abstraction you interact is `FeaturesEnricher`, a Scikit-learn
|
|
394
|
-
Create instance of the `FeaturesEnricher` class and call:
|
|
395
|
+
The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn-compatible estimator. You can easily add it to your existing ML pipelines.
|
|
396
|
+
Create an instance of the `FeaturesEnricher` class and call:
|
|
395
397
|
- `fit` to search relevant datasets & features
|
|
396
|
-
-
|
|
398
|
+
- then `transform` to enrich your dataset with features from the search result
|
|
397
399
|
|
|
398
400
|
Let's try it out!
|
|
399
401
|
```python
|
|
@@ -406,7 +408,7 @@ train_df = pd.read_csv("customer_churn_prediction_train.csv")
|
|
|
406
408
|
X = train_df.drop(columns="churn_flag")
|
|
407
409
|
y = train_df["churn_flag"]
|
|
408
410
|
|
|
409
|
-
# now we're going to create `FeaturesEnricher` class
|
|
411
|
+
# now we're going to create an instance of the `FeaturesEnricher` class
|
|
410
412
|
enricher = FeaturesEnricher(
|
|
411
413
|
search_keys={
|
|
412
414
|
"subscription_activation_date": SearchKey.DATE,
|
|
@@ -414,15 +416,15 @@ enricher = FeaturesEnricher(
|
|
|
414
416
|
"zip_code": SearchKey.POSTAL_CODE
|
|
415
417
|
})
|
|
416
418
|
|
|
417
|
-
#
|
|
418
|
-
#
|
|
419
|
+
# Everything is ready to fit! For 100k records, fitting should take around 10 minutes
|
|
420
|
+
# We'll send an email notification; just register on profile.upgini.com
|
|
419
421
|
enricher.fit(X, y)
|
|
420
422
|
```
|
|
421
423
|
|
|
422
|
-
That's
|
|
424
|
+
That's it! The `FeaturesEnricher` is now fitted.
|
|
423
425
|
### 5. 📈 Evaluate feature importances (SHAP values) from the search result
|
|
424
426
|
|
|
425
|
-
`FeaturesEnricher` class has two properties for feature importances,
|
|
427
|
+
`FeaturesEnricher` class has two properties for feature importances, that are populated after fit - `feature_names_` and `feature_importances_`:
|
|
426
428
|
- `feature_names_` - feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well
|
|
427
429
|
- `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_`
|
|
428
430
|
|
|
@@ -433,8 +435,8 @@ enricher.get_features_info()
|
|
|
433
435
|
Get more details about `FeaturesEnricher` at runtime using docstrings via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.
|
|
434
436
|
|
|
435
437
|
### 6. 🏭 Enrich Production ML pipeline with relevant external features
|
|
436
|
-
`FeaturesEnricher` is a Scikit-learn
|
|
437
|
-
Use `transform` method of `FeaturesEnricher
|
|
438
|
+
`FeaturesEnricher` is a Scikit-learn-compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit`).
|
|
439
|
+
Use the `transform` method of `FeaturesEnricher`, and let the magic do the rest 🪄
|
|
438
440
|
```python
|
|
439
441
|
# load dataset for enrichment
|
|
440
442
|
test_x = pd.read_csv("test.csv")
|
|
@@ -443,24 +445,24 @@ enriched_test_features = enricher.transform(test_x)
|
|
|
443
445
|
```
|
|
444
446
|
#### 6.1 Reuse completed search for enrichment without 'fit' run
|
|
445
447
|
|
|
446
|
-
`FeaturesEnricher` can be
|
|
448
|
+
`FeaturesEnricher` can be initialized with `search_id` from a completed search (after a fit call).
|
|
447
449
|
Just use `enricher.get_search_id()` or copy search id string from the `fit()` output.
|
|
448
|
-
Search keys and features in X
|
|
450
|
+
Search keys and features in X must be the same as for `fit()`
|
|
449
451
|
```python
|
|
450
452
|
enricher = FeaturesEnricher(
|
|
451
|
-
#same set of
|
|
453
|
+
# same set of search keys as for the fit step
|
|
452
454
|
search_keys={"date": SearchKey.DATE},
|
|
453
|
-
api_key="<YOUR API_KEY>", # if you
|
|
455
|
+
api_key="<YOUR API_KEY>", # if you fitted the enricher with an api_key, then you should use it here
|
|
454
456
|
search_id = "abcdef00-0000-0000-0000-999999999999"
|
|
455
457
|
)
|
|
456
|
-
enriched_prod_dataframe=enricher.transform(input_dataframe)
|
|
458
|
+
enriched_prod_dataframe = enricher.transform(input_dataframe)
|
|
457
459
|
```
|
|
458
|
-
#### 6.2 Enrichment with
|
|
459
|
-
|
|
460
|
-
`FeaturesEnricher`, when
|
|
461
|
-
And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features,
|
|
460
|
+
#### 6.2 Enrichment with updated external data sources and features
|
|
461
|
+
In most ML cases, the training step requires a labeled dataset with historical observations. For production, you'll need updated, current data sources and features to generate predictions.
|
|
462
|
+
`FeaturesEnricher`, when initialized with a set of search keys that includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on the specified date/datetime** based on `SearchKey.DATE`, to avoid enrichment with features "from the future" during the `fit` step.
|
|
463
|
+
And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, current as of the present date.
|
|
462
464
|
|
|
463
|
-
⚠️
|
|
465
|
+
⚠️ Include `SearchKey.DATE` in the set of search keys to get current features for production and avoid features from the future during training:
|
|
464
466
|
```python
|
|
465
467
|
enricher = FeaturesEnricher(
|
|
466
468
|
search_keys={
|
|
@@ -474,13 +476,13 @@ enricher = FeaturesEnricher(
|
|
|
474
476
|
## 💻 How does it work?
|
|
475
477
|
|
|
476
478
|
### 🧹 Search dataset validation
|
|
477
|
-
We validate and clean search
|
|
479
|
+
We validate and clean the search‑initialization dataset under the hood:
|
|
478
480
|
|
|
479
|
-
-
|
|
481
|
+
- check your **search keys** columns' formats;
|
|
480
482
|
- check zero variance for label column;
|
|
481
|
-
- check dataset for full row duplicates. If we find any, we remove
|
|
482
|
-
- check inconsistent labels - rows with the same features and keys but different labels, we remove them and
|
|
483
|
-
-
|
|
483
|
+
- check dataset for full row duplicates. If we find any, we remove them and report their share;
|
|
484
|
+
- check inconsistent labels - rows with the same features and keys but different labels, we remove them and report their share;
|
|
485
|
+
- remove columns with zero variance - we treat any non **search key** column in the search dataset as a feature, so columns with zero variance will be removed
|
|
484
486
|
|
|
485
487
|
### ❔ Supervised ML tasks detection
|
|
486
488
|
We detect ML task under the hood based on label column values. Currently we support:
|
|
@@ -488,7 +490,7 @@ We detect ML task under the hood based on label column values. Currently we supp
|
|
|
488
490
|
- ModelTaskType.MULTICLASS
|
|
489
491
|
- ModelTaskType.REGRESSION
|
|
490
492
|
|
|
491
|
-
But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML
|
|
493
|
+
But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML task type:
|
|
492
494
|
```python
|
|
493
495
|
from upgini.features_enricher import FeaturesEnricher
|
|
494
496
|
from upgini.metadata import SearchKey, ModelTaskType
|
|
@@ -498,12 +500,12 @@ enricher = FeaturesEnricher(
|
|
|
498
500
|
model_task_type=ModelTaskType.REGRESSION
|
|
499
501
|
)
|
|
500
502
|
```
|
|
501
|
-
#### ⏰ Time
|
|
502
|
-
*Time
|
|
503
|
-
* [Scikit-learn time
|
|
504
|
-
* [Blocked time
|
|
503
|
+
#### ⏰ Time-series prediction support
|
|
504
|
+
*Time-series prediction* is supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time-series‑specific cross-validation splits:
|
|
505
|
+
* [Scikit-learn time-series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
|
|
506
|
+
* [Blocked time-series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter
|
|
505
507
|
|
|
506
|
-
To initiate feature search you can pass cross-validation type parameter to `FeaturesEnricher` with time
|
|
508
|
+
To initiate feature search, you can pass the cross-validation type parameter to `FeaturesEnricher` with a time-series‑specific CV type:
|
|
507
509
|
```python
|
|
508
510
|
from upgini.features_enricher import FeaturesEnricher
|
|
509
511
|
from upgini.metadata import SearchKey, CVType
|
|
@@ -524,12 +526,12 @@ enricher = FeaturesEnricher(
|
|
|
524
526
|
cv=CVType.time_series
|
|
525
527
|
)
|
|
526
528
|
```
|
|
527
|
-
⚠️ **
|
|
529
|
+
⚠️ **Preprocess the dataset** in case of time-series prediction:
|
|
528
530
|
sort rows in dataset according to observation order, in most cases - ascending order by date/datetime.
|
|
529
531
|
|
|
530
532
|
### 🆙 Accuracy and uplift metrics calculations
|
|
531
|
-
`FeaturesEnricher`
|
|
532
|
-
You can use any model estimator with scikit-learn
|
|
533
|
+
`FeaturesEnricher` automatically calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).
|
|
534
|
+
You can use any model estimator with scikit-learn-compatible interface, some examples are:
|
|
533
535
|
* [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)
|
|
534
536
|
* [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
|
|
535
537
|
* [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
|
|
@@ -537,8 +539,8 @@ You can use any model estimator with scikit-learn compartible interface, some ex
|
|
|
537
539
|
|
|
538
540
|
<details>
|
|
539
541
|
<summary>
|
|
540
|
-
👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by <i>scoring</i>
|
|
541
|
-
out-of-the
|
|
542
|
+
👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by the <i>scoring</i> parameter,<br/>
|
|
543
|
+
out-of-the-box Upgini supports
|
|
542
544
|
</summary>
|
|
543
545
|
<table style="table-layout: fixed;">
|
|
544
546
|
<tr>
|
|
@@ -645,10 +647,10 @@ You can use any model estimator with scikit-learn compartible interface, some ex
|
|
|
645
647
|
</table>
|
|
646
648
|
</details>
|
|
647
649
|
|
|
648
|
-
In addition to that list, you can define custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/
|
|
650
|
+
In addition to that list, you can define a custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/1.7/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
|
|
649
651
|
|
|
650
|
-
By default, `calculate_metrics()` method calculates evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by parameter `cv = CVType.<cross-validation-split>`.
|
|
651
|
-
But you can easily define new split by passing
|
|
652
|
+
By default, the `calculate_metrics()` method calculates the evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by the parameter `cv = CVType.<cross-validation-split>`.
|
|
653
|
+
But you can easily define a new split by passing a subclass of `BaseCrossValidator` to the `cv` parameter in `calculate_metrics()`.
|
|
652
654
|
|
|
653
655
|
Example with more tips-and-tricks:
|
|
654
656
|
```python
|
|
@@ -673,7 +675,7 @@ enricher.calculate_metrics(scoring=custom_scoring)
|
|
|
673
675
|
custom_cv = TimeSeriesSplit(n_splits=5)
|
|
674
676
|
enricher.calculate_metrics(cv=custom_cv)
|
|
675
677
|
|
|
676
|
-
# All
|
|
678
|
+
# All of these custom parameters can be combined in both methods: fit, fit_transform and calculate_metrics:
|
|
677
679
|
enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)
|
|
678
680
|
```
|
|
679
681
|
|
|
@@ -683,34 +685,34 @@ enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator,
|
|
|
683
685
|
|
|
684
686
|
### 🤖 Automated feature generation from columns in a search dataset
|
|
685
687
|
|
|
686
|
-
If a training dataset has a text column, you can generate additional embeddings from it using
|
|
688
|
+
If a training dataset has a text column, you can generate additional embeddings from it using instruction‑guided embedding generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.
|
|
687
689
|
|
|
688
|
-
|
|
690
|
+
In most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to the search engine - GPT-3.5 from OpenAI and GPT-J.
|
|
689
691
|
|
|
690
|
-
To use this feature, pass the column names as arguments to the `
|
|
692
|
+
To use this feature, pass the column names as arguments to the `text_features` parameter. You can use up to 2 columns.
|
|
691
693
|
|
|
692
694
|
Here's an example for generating features from the "description" and "summary" columns:
|
|
693
695
|
|
|
694
696
|
```python
|
|
695
697
|
enricher = FeaturesEnricher(
|
|
696
698
|
search_keys={"date": SearchKey.DATE},
|
|
697
|
-
|
|
699
|
+
text_features=["description", "summary"]
|
|
698
700
|
)
|
|
699
701
|
```
|
|
700
702
|
|
|
701
703
|
With this code, Upgini will generate LLM embeddings from text columns and then check them for predictive power for your ML task.
|
|
702
704
|
|
|
703
|
-
Finally, Upgini will return a dataset enriched
|
|
705
|
+
Finally, Upgini will return a dataset enriched with only the relevant components of LLM embeddings.
|
|
704
706
|
|
|
705
|
-
### Find features only
|
|
707
|
+
### Find features that only provide accuracy gains to existing data in the ML model
|
|
706
708
|
|
|
707
|
-
If you already have features or other external data sources, you can specifically search new datasets
|
|
709
|
+
If you already have features or other external data sources, you can specifically search for new datasets and features that only provide accuracy gains "on top" of them.
|
|
708
710
|
|
|
709
|
-
Just leave all these existing features in the labeled training dataset and Upgini library automatically
|
|
711
|
+
Just leave all these existing features in the labeled training dataset and the Upgini library automatically uses them during the feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features that improve accuracy will be returned.
|
|
710
712
|
|
|
711
713
|
### Check robustness of accuracy improvement from external features
|
|
712
714
|
|
|
713
|
-
You can validate external features
|
|
715
|
+
You can validate the robustness of external features on an out-of-time dataset using the `eval_set` parameter:
|
|
714
716
|
```python
|
|
715
717
|
# load train dataset
|
|
716
718
|
train_df = pd.read_csv("train.csv")
|
|
@@ -737,13 +739,13 @@ enricher.fit(
|
|
|
737
739
|
- Same data schema as for search initialization X dataset
|
|
738
740
|
- Pandas dataframe representation
|
|
739
741
|
|
|
740
|
-
There are 3 options to pass out-of-time without labels:
|
|
742
|
+
The out-of-time dataset can be without labels. There are 3 options to pass out-of-time without labels:
|
|
741
743
|
```python
|
|
742
744
|
enricher.fit(
|
|
743
745
|
train_ids_and_features,
|
|
744
746
|
train_label,
|
|
745
747
|
eval_set = [
|
|
746
|
-
(eval_ids_and_features_1,), #
|
|
748
|
+
(eval_ids_and_features_1,), # A tuple with 1 element
|
|
747
749
|
(eval_ids_and_features_2, None), # None as labels
|
|
748
750
|
(eval_ids_and_features_3, [np.nan] * len(eval_ids_and_features_3)), # List or Series of the same size as eval X
|
|
749
751
|
]
|
|
@@ -775,15 +777,15 @@ enriched_df = enricher.fit_transform(
|
|
|
775
777
|
```
|
|
776
778
|
|
|
777
779
|
**Stability parameters:**
|
|
778
|
-
- `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI
|
|
780
|
+
- `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI above this threshold will be excluded from the final feature set. Lower values mean stricter stability requirements.
|
|
779
781
|
- `stability_agg_func` (str, default="max"): Function to aggregate PSI values across time intervals. Options: "max" (most conservative), "min" (least conservative), "mean" (balanced approach).
|
|
780
782
|
|
|
781
|
-
**PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models.
|
|
783
|
+
**PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models. PSI is calculated on the eval_set, which should contain the most recent dates relative to the training dataset.
|
|
782
784
|
|
|
783
785
|
### Use custom loss function in feature selection & metrics calculation
|
|
784
786
|
|
|
785
787
|
`FeaturesEnricher` can be initialized with additional string parameter `loss`.
|
|
786
|
-
Depending on ML
|
|
788
|
+
Depending on the ML task, you can use the following loss functions:
|
|
787
789
|
- `regression`: regression, regression_l1, huber, poisson, quantile, mape, gamma, tweedie;
|
|
788
790
|
- `binary`: binary;
|
|
789
791
|
- `multiclass`: multiclass, multiclassova.
|
|
@@ -802,7 +804,7 @@ enriched_dataframe.fit(X, y)
|
|
|
802
804
|
|
|
803
805
|
### Exclude premium data sources from fit, transform and metrics calculation
|
|
804
806
|
|
|
805
|
-
`fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with
|
|
807
|
+
`fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with the `exclude_features_sources` parameter to exclude Trial or Paid features from Premium data sources:
|
|
806
808
|
```python
|
|
807
809
|
enricher = FeaturesEnricher(
|
|
808
810
|
search_keys={"subscription_activation_date": SearchKey.DATE}
|
|
@@ -815,7 +817,7 @@ enricher.transform(X, exclude_features_sources=(trial_features + paid_features))
|
|
|
815
817
|
```
|
|
816
818
|
|
|
817
819
|
### Turn off autodetection for search key columns
|
|
818
|
-
Upgini has autodetection of search keys
|
|
820
|
+
Upgini has autodetection of search keys enabled by default.
|
|
819
821
|
To turn off use `autodetect_search_keys=False`:
|
|
820
822
|
|
|
821
823
|
```python
|
|
@@ -827,8 +829,8 @@ enricher = FeaturesEnricher(
|
|
|
827
829
|
enricher.fit(X, y)
|
|
828
830
|
```
|
|
829
831
|
|
|
830
|
-
### Turn off
|
|
831
|
-
Upgini
|
|
832
|
+
### Turn off removal of target outliers
|
|
833
|
+
Upgini detects rows with target outliers for regression tasks. By default such rows are dropped during metrics calculation. To turn off the removal of target‑outlier rows, use the `remove_outliers_calc_metrics=False` parameter in the fit, fit_transform, or calculate_metrics methods:
|
|
832
834
|
|
|
833
835
|
```python
|
|
834
836
|
enricher = FeaturesEnricher(
|
|
@@ -838,8 +840,8 @@ enricher = FeaturesEnricher(
|
|
|
838
840
|
enricher.fit(X, y, remove_outliers_calc_metrics=False)
|
|
839
841
|
```
|
|
840
842
|
|
|
841
|
-
### Turn off
|
|
842
|
-
Upgini
|
|
843
|
+
### Turn off feature generation on search keys
|
|
844
|
+
Upgini attempts to generate features for email, date and datetime search keys. By default this generation is enabled. To disable it use the `generate_search_key_features` parameter of the FeaturesEnricher constructor:
|
|
843
845
|
|
|
844
846
|
```python
|
|
845
847
|
enricher = FeaturesEnricher(
|
|
@@ -850,37 +852,37 @@ enricher = FeaturesEnricher(
|
|
|
850
852
|
|
|
851
853
|
## 🔑 Open up all capabilities of Upgini
|
|
852
854
|
|
|
853
|
-
[Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features:
|
|
855
|
+
[Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600M+ phone numbers, 350M+ emails, 2^32 IP addresses
|
|
854
856
|
|
|
855
857
|
|Benefit|No Sign-up | Registered user |
|
|
856
858
|
|--|--|--|
|
|
857
859
|
|Enrichment with **date/datetime, postal/ZIP code and country keys** | Yes | Yes |
|
|
858
|
-
|Enrichment with **phone number, hashed email/HEM and IP
|
|
860
|
+
|Enrichment with **phone number, hashed email/HEM and IP address keys** | No | Yes |
|
|
859
861
|
|Email notification on **search task completion** | No | Yes |
|
|
860
862
|
|Automated **feature generation with LLMs** from columns in a search dataset| Yes, *till 12/05/23* | Yes |
|
|
861
863
|
|Email notification on **new data source activation** 🔜 | No | Yes |
|
|
862
864
|
|
|
863
|
-
## 👩🏻💻 How to share data/features with
|
|
864
|
-
You may publish ANY data which you consider as royalty
|
|
865
|
+
## 👩🏻💻 How to share data/features with the community?
|
|
866
|
+
You may publish ANY data which you consider as royalty‑ or license‑free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**:
|
|
865
867
|
1. Please Sign Up [here](https://profile.upgini.com)
|
|
866
|
-
2. Copy *Upgini API key* from profile and upload your data from Upgini
|
|
868
|
+
2. Copy *Upgini API key* from your profile and upload your data from the Upgini Python library with this key:
|
|
867
869
|
```python
|
|
868
870
|
import pandas as pd
|
|
869
871
|
from upgini.metadata import SearchKey
|
|
870
872
|
from upgini.ads import upload_user_ads
|
|
871
873
|
import os
|
|
872
874
|
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
|
|
873
|
-
#you can define custom search key
|
|
875
|
+
#you can define a custom search key that might not yet be supported; just use SearchKey.CUSTOM_KEY type
|
|
874
876
|
sample_df = pd.read_csv("path_to_data_sample_file")
|
|
875
877
|
upload_user_ads("test", sample_df, {
|
|
876
878
|
"city": SearchKey.CUSTOM_KEY,
|
|
877
879
|
"stats_date": SearchKey.DATE
|
|
878
880
|
})
|
|
879
881
|
```
|
|
880
|
-
3. After data verification, search results on community data will be available usual way.
|
|
882
|
+
3. After data verification, search results on community data will be available in the usual way.
|
|
881
883
|
|
|
882
884
|
## 🛠 Getting Help & Community
|
|
883
|
-
Please note
|
|
885
|
+
Please note that we are still in beta.
|
|
884
886
|
Requests and support, in preferred order
|
|
885
887
|
[](https://4mlg.short.gy/join-upgini-community)
|
|
886
888
|
[](https://github.com/upgini/upgini/issues)
|
|
@@ -893,22 +895,22 @@ Requests and support, in preferred order
|
|
|
893
895
|
|
|
894
896
|
## 🧩 Contributing
|
|
895
897
|
We are not a large team, so we probably won't be able to:
|
|
896
|
-
- implement smooth integration with most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc.
|
|
898
|
+
- implement smooth integration with the most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc.)
|
|
897
899
|
- implement all possible data verification and normalization capabilities for different types of search keys
|
|
898
900
|
And we need some help from the community!
|
|
899
901
|
|
|
900
|
-
So, we'll be happy about every **pull request** you open and **issue** you
|
|
901
|
-
**For major changes**, please open an issue first to discuss what you would like to change
|
|
902
|
+
So, we'll be happy about every **pull request** you open and every **issue** you report to make this library **even better**. Please note that it might sometimes take us a while to get back to you.
|
|
903
|
+
**For major changes**, please open an issue first to discuss what you would like to change.
|
|
902
904
|
#### Developing
|
|
903
905
|
Some convenient ways to start contributing are:
|
|
904
906
|
⚙️ [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
|
|
905
907
|
⚙️ **Gitpod** [](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.
|
|
906
908
|
|
|
907
909
|
## 🔗 Useful links
|
|
908
|
-
- [Simple sales
|
|
910
|
+
- [Simple sales prediction template notebook](#-simple-sales-prediction-for-retail-stores)
|
|
909
911
|
- [Full list of Kaggle Guides & Examples](https://www.kaggle.com/romaupgini/code)
|
|
910
912
|
- [Project on PyPI](https://pypi.org/project/upgini)
|
|
911
913
|
- [More perks for registered users](https://profile.upgini.com)
|
|
912
914
|
|
|
913
|
-
<sup>😔 Found
|
|
915
|
+
<sup>😔 Found typo or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
|
|
914
916
|
Please report it here</a></sup>
|