upgini 1.2.122a4__py3-none-any.whl → 1.2.146a4__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of upgini might be problematic. Click here for more details.

@@ -1,6 +1,6 @@
1
- Metadata-Version: 2.3
1
+ Metadata-Version: 2.4
2
2
  Name: upgini
3
- Version: 1.2.122a4
3
+ Version: 1.2.146a4
4
4
  Summary: Intelligent data search & enrichment for Machine Learning
5
5
  Project-URL: Bug Reports, https://github.com/upgini/upgini/issues
6
6
  Project-URL: Homepage, https://upgini.com/
@@ -30,9 +30,11 @@ Requires-Dist: ipywidgets>=8.1.0
30
30
  Requires-Dist: jarowinkler>=2.0.0
31
31
  Requires-Dist: levenshtein>=0.25.1
32
32
  Requires-Dist: lightgbm>=4.6.0
33
+ Requires-Dist: more-itertools==10.7.0
33
34
  Requires-Dist: numpy<3.0.0,>=1.19.0
34
35
  Requires-Dist: pandas<3.0.0,>=1.1.0
35
36
  Requires-Dist: psutil>=5.9.0
37
+ Requires-Dist: pyarrow==18.1.0
36
38
  Requires-Dist: pydantic<3.0.0,>1.0.0
37
39
  Requires-Dist: pyjwt>=2.8.0
38
40
  Requires-Dist: python-bidi==0.4.2
@@ -50,7 +52,7 @@ Description-Content-Type: text/markdown
50
52
  <!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : Free automated data enrichment library for machine learning: </br>only the accuracy improving features in 2 minutes </h2> -->
51
53
  <!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> • Free production-ready automated data enrichment library for machine learning</h2>-->
52
54
  <h2 align="center"> <a href="https://upgini.com/">Upgini • Intelligent data search & enrichment for Machine Learning and AI</a></h2>
53
- <p align="center"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community and premium external data sources, </br>including open & commercial LLMs</b> </p>
55
+ <p align="center"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community, and premium external data sources, </br>including open & commercial LLMs</b> </p>
54
56
  <p align="center">
55
57
  <br />
56
58
  <a href="https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb"><strong>Quick Start in Colab »</strong></a> |
@@ -58,7 +60,7 @@ Description-Content-Type: text/markdown
58
60
  <a href="https://profile.upgini.com">Register / Sign In</a> |
59
61
  <!-- <a href="https://gitter.im/upgini/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link">Gitter Community</a> | -->
60
62
  <a href="https://4mlg.short.gy/join-upgini-community">Slack Community</a> |
61
- <a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose new Data source</strong></a>
63
+ <a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose a new data source</strong></a>
62
64
  </p>
63
65
  <p align=center>
64
66
  <a href="/LICENSE"><img alt="BSD-3 license" src="https://img.shields.io/badge/license-BSD--3%20Clause-green"></a>
@@ -74,19 +76,19 @@ Description-Content-Type: text/markdown
74
76
  [![Gitter Сommunity](https://img.shields.io/badge/gitter-@upgini-teal.svg?logo=gitter)](https://gitter.im/upgini/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) -->
75
77
  ## ❔ Overview
76
78
 
77
- **Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of machine ML features using large language models (LLMs), GraphNNs and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data).
79
+ **Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of ML features using large language models (LLMs), GNNs (graph neural networks), and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data).
78
80
 
79
- **Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want radically simplify features search and enrichment to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays.
81
+ **Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want to radically simplify feature search and enrichment to make external data a standard approach. Like hyperparameter tuning in machine learning today.
80
82
 
81
83
  **Mission:** Democratize access to data sources for data science community.
82
84
 
83
85
  ## 🚀 Awesome features
84
- ⭐️ Automatically find only relevant features that *give accuracy improvement for ML model*. Not just correlated with target variable, what 9 out of 10 cases gives zero accuracy improvement
85
- ⭐️ Automated feature generation from the sources: feature generation with Large Language Models' data augmentation, RNNs, GraphNN; multiple data source ensembling
86
- ⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/zip code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
87
- ⭐️ Calculate accuracy metrics and uplifts after enrichment existing ML model with external features
88
- ⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipeline
89
- ⭐️ Easy to use - single request to enrich training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
86
+ ⭐️ Automatically find only relevant features that *improve your model’s accuracy*. Not just correlated with the target variable, which in 9 out of 10 cases yields zero accuracy improvement
87
+ ⭐️ Automated feature generation from the sources: feature generation with LLM‑based data augmentation, RNNs, and GraphNNs; ensembling across multiple data sources
88
+ ⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/ZIP code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
89
+ ⭐️ Calculate accuracy metrics and uplift after enriching an existing ML model with external features
90
+ ⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate the risks of unstable external data dependencies in the ML pipeline
91
+ ⭐️ Easy to use - a single request to enrich the training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
90
92
  <table>
91
93
  <tr>
92
94
  <td> date / datetime </td>
@@ -102,7 +104,7 @@ Description-Content-Type: text/markdown
102
104
  </tr>
103
105
  </table>
104
106
 
105
- ⭐️ Scikit-learn compatible interface for quick data integration with existing ML pipelines
107
+ ⭐️ Scikit-learn-compatible interface for quick data integration with existing ML pipelines
106
108
  ⭐️ Support for most common supervised ML tasks on tabular data:
107
109
  <table>
108
110
  <tr>
@@ -111,7 +113,7 @@ Description-Content-Type: text/markdown
111
113
  </tr>
112
114
  <tr>
113
115
  <td><a href="https://en.wikipedia.org/wiki/Regression_analysis">☑️ regression</a></td>
114
- <td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time series prediction</a></td>
116
+ <td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time-series prediction</a></td>
115
117
  </tr>
116
118
  </table>
117
119
 
@@ -123,13 +125,13 @@ Description-Content-Type: text/markdown
123
125
 
124
126
  ## 🌎 Connected data sources and coverage
125
127
 
126
- - **Public data** : public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
127
- - **Community shared data**: royalty / license free datasets or features from Data science community (our users). It's both a public and a scraped data
128
+ - **Public data**: public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
129
+ - **Communityshared data**: royalty- or license-free datasets or features from the data science community (our users). This includes both public and scraped data
128
130
  - **Premium data providers**: commercial data sources verified by the Upgini team in real-world use cases
129
131
 
130
- 👉 [**Details on datasets and features**](https://upgini.com/#data_sources)
132
+ 👉 [**Details on datasets and features**](https://upgini.com/#data_sources)
131
133
  #### 📊 Total: **239 countries** and **up to 41 years** of history
132
- |Data sources|Countries|History, years|# sources for ensemble|Update|Search keys|API Key required
134
+ |Data sources|Countries|History (years)|# sources for ensembling|Update frequency|Search keys|API Key required
133
135
  |--|--|--|--|--|--|--|
134
136
  |Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No
135
137
  |Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No
@@ -137,7 +139,7 @@ Description-Content-Type: text/markdown
137
139
  |Consumer Confidence index| 44 |22|-|Monthly|date, country|No
138
140
  |World economic indicators|191 |41|-|Monthly|date, country|No
139
141
  |Markets data|-|17|-|Monthly|date, datetime|No
140
- |World mobile & fixed broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No
142
+ |World mobile & fixed-broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No
141
143
  |World demographic data |90|-|2|Annual|country, postal/ZIP code|No
142
144
  |World house prices |44|-|3|Annual|country, postal/ZIP code|No
143
145
  |Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes
@@ -152,8 +154,8 @@ Description-Content-Type: text/markdown
152
154
 
153
155
  ### [Search of relevant external features & Automated feature generation for Salary prediction task (use as a template)](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
154
156
 
155
- * The goal is to predict salary for data science job postning based on information about employer and job description.
156
- * Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library**
157
+ * The goal is to predict salary for a data science job posting based on information about the employer and job description.
158
+ * Following this guide, you'll learn how to **search and autogenerate new relevant features with the Upgini library**
157
159
  * The evaluation metric is [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).
158
160
 
159
161
  Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb) inside your browser:
@@ -168,7 +170,7 @@ Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob
168
170
  ### ❓ [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
169
171
 
170
172
  * The goal is to **predict future sales of different goods in stores** based on a 5-year history of sales.
171
- * Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
173
+ * Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting competition. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
172
174
 
173
175
  Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:
174
176
 
@@ -180,25 +182,25 @@ Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini
180
182
  [![Open example in Gitpod](https://img.shields.io/badge/run_example_in-gitpod-orange?style=for-the-badge&logo=gitpod)](https://gitpod.io/#/github.com/upgini/upgini)
181
183
  -->
182
184
 
183
- ### ❓ [How to boost ML model accuracy for Kaggle TOP1 Leaderboard in 10 minutes](https://www.kaggle.com/code/romaupgini/more-external-features-for-top1-private-lb-4-54/notebook)
185
+ ### ❓ [How to boost ML model accuracy for Kaggle Top-1 leaderboard in 15 minutes](https://www.kaggle.com/code/nikupgini/how-to-find-external-data-for-1-private-lb-4-53/notebook)
184
186
 
185
- * The goal is **accuracy improvement for TOP1 winning Kaggle solution** from new relevant external features & data.
186
- * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
187
+ * The goal is **to improve a Top‑1 winning Kaggle solution** by adding new relevant external features and data.
188
+ * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting competition; the evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
187
189
 
188
190
  ### ❓ [How to do low-code feature engineering for AutoML tools](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret/notebook)
189
191
 
190
192
  * **Save time on feature search and engineering**. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box.
191
193
  * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
192
- * Low-code AutoML tools: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)
194
+ * Low-code AutoML frameworks: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)
193
195
 
194
- ### ❓ [How to improve accuracy of Multivariate Time Series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)
196
+ ### ❓ [How to improve accuracy of Multivariate time-series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)
195
197
 
196
- * The goal is **accuracy improvement of Multivariate Time Series prediction** from new relevant external features & data. The main challenge here is a strategy of data & feature enrichment, when a component of Multivariate TS depends not only on its past values but also has **some dependency on other components**.
198
+ * The goal is **to improve the accuracy of multivariate time‑series forecasting** using new relevant external features and data. The main challenge is the data and feature enrichment strategy, in which a component of a multivariate time series depends not only on its past values but also on other components.
197
199
  * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [RMSLE](https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days#Note-regarding-calculating-the-average).
198
200
 
199
201
  ### ❓ [How to speed up feature engineering hypothesis tests with ready-to-use external features](https://www.kaggle.com/code/romaupgini/statement-dates-to-use-or-not-to-use/notebook)
200
202
 
201
- * **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge here is a time-dependent representation of information in a training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
203
+ * **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge is the timedependent representation of information in the training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
202
204
  * [Kaggle Competition](https://www.kaggle.com/competitions/amex-default-prediction) is a credit default prediction, evaluation metric is [normalized Gini coefficient](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464).
203
205
 
204
206
  ## 🏁 Quick start
@@ -227,19 +229,19 @@ docker build -t upgini .</i></br>
227
229
  <i>
228
230
  docker run -p 8888:8888 upgini</br>
229
231
  </i></br>
230
- 3. Open http://localhost:8888?token="<"your_token_from_console_output">" in your browser
232
+ 3. Open http://localhost:8888?token=&lt;your_token_from_console_output&gt; in your browser
231
233
  </details>
232
234
 
233
235
 
234
236
  ### 2. 💡 Use your labeled training dataset for search
235
237
 
236
238
  You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
237
- - **[search keys](#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
238
- - **labels** from training dataset to estimate relevancy of feature or dataset for your ML task and calculate feature importance metrics
239
- - **your features** from training dataset to find external datasets and features which only give accuracy improvement to your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))
239
+ - **[search keys](#-search-key-types-we-support-more-to-come)** from the training dataset to match records from potential data sources with new features
240
+ - **labels** from the training dataset to estimate the relevance of features or datasets for your ML task and calculate feature importance metrics
241
+ - **your features** from the training dataset to find external datasets and features that improve accuracy of your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))
240
242
 
241
243
 
242
- Load training dataset into pandas dataframe and separate features' columns from label column in a Scikit-learn way:
244
+ Load the training dataset into a Pandas DataFrame and separate feature columns from the label column in a Scikit-learn way:
243
245
  ```python
244
246
  import pandas as pd
245
247
  # labeled training dataset - customer_churn_prediction_train.csv
@@ -250,7 +252,7 @@ y = train_df["churn_flag"]
250
252
  <table border=1 cellpadding=10><tr><td>
251
253
  ⚠️ <b>Requirements for search initialization dataset</b>
252
254
  <br>
253
- We do dataset verification and cleaning under the hood, but still there are some requirements to follow:
255
+ We perform dataset verification and cleaning under the hood, but still there are some requirements to follow:
254
256
  <br>
255
257
  1. <b>pandas.DataFrame</b>, <b>pandas.Series</b> or <b>numpy.ndarray</b> representation;
256
258
  <br>
@@ -258,12 +260,12 @@ We do dataset verification and cleaning under the hood, but still there are some
258
260
  <br>
259
261
  3. at least one column selected as a <a href="#-search-key-types-we-support-more-to-come">search key</a>;
260
262
  <br>
261
- 4. min size after deduplication by search key column and NaNs removal: <i>100 records</i>
263
+ 4. min size after deduplication by search-key columns and removal of NaNs: <i>100 records</i>
262
264
  </td></tr></table>
263
265
 
264
- ### 3. 🔦 Choose one or multiple columns as a search keys
265
- *Search keys* columns will be used to match records from all potential external data sources / features.
266
- Define one or multiple columns as a search keys with `FeaturesEnricher` class initialization.
266
+ ### 3. 🔦 Choose one or more columns as search keys
267
+ *Search keys* columns will be used to match records from all potential external data sources/features.
268
+ Define one or more columns as search keys when initializing the `FeaturesEnricher` class.
267
269
  ```python
268
270
  from upgini.features_enricher import FeaturesEnricher
269
271
  from upgini.metadata import SearchKey
@@ -283,7 +285,7 @@ enricher = FeaturesEnricher(
283
285
  <tr>
284
286
  <th> Search Key<br/>Meaning Type </th>
285
287
  <th> Description </th>
286
- <th> Allowed pandas dtypes (python types) </th>
288
+ <th> Allowed pandas dtypes (Python types) </th>
287
289
  <th> Example </th>
288
290
  </tr>
289
291
  <tr>
@@ -300,13 +302,13 @@ enricher = FeaturesEnricher(
300
302
  </tr>
301
303
  <tr>
302
304
  <td> SearchKey.IP </td>
303
- <td> IP address (version 4) </td>
304
- <td> <tt>object(str, ipaddress.IPv4Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
305
+ <td> IPv4 or IPv6 address</td>
306
+ <td> <tt>object(str, ipaddress.IPv4Address, ipaddress.IPv6Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
305
307
  <td> <tt>192.168.0.1 </tt> </td>
306
308
  </tr>
307
309
  <tr>
308
310
  <td> SearchKey.PHONE </td>
309
- <td> phone number, <a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a> </td>
311
+ <td> phone number (<a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a>) </td>
310
312
  <td> <tt>object(str)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> <br/> <tt>float64</tt> </td>
311
313
  <td> <tt>443451925138 </tt> </td>
312
314
  </tr>
@@ -321,7 +323,7 @@ enricher = FeaturesEnricher(
321
323
  </td>
322
324
  <td>
323
325
  <tt>2020-02-12 </tt>&nbsp;(<a href="https://en.wikipedia.org/wiki/ISO_8601">ISO-8601 standard</a>)
324
- <br/> <tt>12.02.2020 </tt>&nbsp;(non standard notation)
326
+ <br/> <tt>12.02.2020 </tt>&nbsp;(nonstandard notation)
325
327
  </td>
326
328
  </tr>
327
329
  <tr>
@@ -343,7 +345,7 @@ enricher = FeaturesEnricher(
343
345
  </tr>
344
346
  <tr>
345
347
  <td> SearchKey.POSTAL_CODE </td>
346
- <td> Postal code a.k.a. ZIP code. Could be used only with SearchKey.COUNTRY </td>
348
+ <td> Postal code a.k.a. ZIP code. Can only be used with SearchKey.COUNTRY </td>
347
349
  <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
348
350
  <td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>
349
351
  </tr>
@@ -351,7 +353,7 @@ enricher = FeaturesEnricher(
351
353
 
352
354
  </details>
353
355
 
354
- For the meaning types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to clarify date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:
356
+ For the search key types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to specify the date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:
355
357
  ```python
356
358
  from upgini.features_enricher import FeaturesEnricher
357
359
  from upgini.metadata import SearchKey
@@ -369,12 +371,12 @@ enricher = FeaturesEnricher(
369
371
  )
370
372
  ```
371
373
 
372
- To use datetime not in UTC timezone, you can cast datetime column explicitly to your timezone (example for Warsaw):
374
+ To use a non-UTC timezone for datetime, you can cast datetime column explicitly to your timezone (example for Warsaw):
373
375
  ```python
374
376
  df["date"] = df.date.astype("datetime64").dt.tz_localize("Europe/Warsaw")
375
377
  ```
376
378
 
377
- Single country for the whole training dataset can be passed with `country_code` parameter:
379
+ A single country for the whole training dataset can be passed via `country_code` parameter:
378
380
  ```python
379
381
  from upgini.features_enricher import FeaturesEnricher
380
382
  from upgini.metadata import SearchKey
@@ -390,10 +392,10 @@ enricher = FeaturesEnricher(
390
392
  ```
391
393
 
392
394
  ### 4. 🔍 Start your first feature search!
393
- The main abstraction you interact is `FeaturesEnricher`, a Scikit-learn compatible estimator. You can easily add it into your existing ML pipelines.
394
- Create instance of the `FeaturesEnricher` class and call:
395
+ The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn-compatible estimator. You can easily add it to your existing ML pipelines.
396
+ Create an instance of the `FeaturesEnricher` class and call:
395
397
  - `fit` to search relevant datasets & features
396
- - than `transform` to enrich your dataset with features from search result
398
+ - then `transform` to enrich your dataset with features from the search result
397
399
 
398
400
  Let's try it out!
399
401
  ```python
@@ -406,7 +408,7 @@ train_df = pd.read_csv("customer_churn_prediction_train.csv")
406
408
  X = train_df.drop(columns="churn_flag")
407
409
  y = train_df["churn_flag"]
408
410
 
409
- # now we're going to create `FeaturesEnricher` class
411
+ # now we're going to create an instance of the `FeaturesEnricher` class
410
412
  enricher = FeaturesEnricher(
411
413
  search_keys={
412
414
  "subscription_activation_date": SearchKey.DATE,
@@ -414,15 +416,15 @@ enricher = FeaturesEnricher(
414
416
  "zip_code": SearchKey.POSTAL_CODE
415
417
  })
416
418
 
417
- # everything is ready to fit! For 200к records fitting should take around 10 minutes,
418
- # we send email notification, just register on profile.upgini.com
419
+ # Everything is ready to fit! For 100k records, fitting should take around 10 minutes
420
+ # We'll send an email notification; just register on profile.upgini.com
419
421
  enricher.fit(X, y)
420
422
  ```
421
423
 
422
- That's all! We've fit `FeaturesEnricher`.
424
+ That's it! The `FeaturesEnricher` is now fitted.
423
425
  ### 5. 📈 Evaluate feature importances (SHAP values) from the search result
424
426
 
425
- `FeaturesEnricher` class has two properties for feature importances, which will be filled after fit - `feature_names_` and `feature_importances_`:
427
+ `FeaturesEnricher` class has two properties for feature importances, that are populated after fit - `feature_names_` and `feature_importances_`:
426
428
  - `feature_names_` - feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well
427
429
  - `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_`
428
430
 
@@ -433,8 +435,8 @@ enricher.get_features_info()
433
435
  Get more details about `FeaturesEnricher` at runtime using docstrings via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.
434
436
 
435
437
  ### 6. 🏭 Enrich Production ML pipeline with relevant external features
436
- `FeaturesEnricher` is a Scikit-learn compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit` ).
437
- Use `transform` method of `FeaturesEnricher` , and let magic to do the rest 🪄
438
+ `FeaturesEnricher` is a Scikit-learn-compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit`).
439
+ Use the `transform` method of `FeaturesEnricher`, and let the magic do the rest 🪄
438
440
  ```python
439
441
  # load dataset for enrichment
440
442
  test_x = pd.read_csv("test.csv")
@@ -443,24 +445,24 @@ enriched_test_features = enricher.transform(test_x)
443
445
  ```
444
446
  #### 6.1 Reuse completed search for enrichment without 'fit' run
445
447
 
446
- `FeaturesEnricher` can be initiated with a `search_id` parameter from completed search after fit method call.
448
+ `FeaturesEnricher` can be initialized with `search_id` from a completed search (after a fit call).
447
449
  Just use `enricher.get_search_id()` or copy search id string from the `fit()` output.
448
- Search keys and features in X should be the same as for `fit()`
450
+ Search keys and features in X must be the same as for `fit()`
449
451
  ```python
450
452
  enricher = FeaturesEnricher(
451
- #same set of a search keys as for the fit step
453
+ # same set of search keys as for the fit step
452
454
  search_keys={"date": SearchKey.DATE},
453
- api_key="<YOUR API_KEY>", # if you fit enricher with api_key then you should use it here
455
+ api_key="<YOUR API_KEY>", # if you fitted the enricher with an api_key, then you should use it here
454
456
  search_id = "abcdef00-0000-0000-0000-999999999999"
455
457
  )
456
- enriched_prod_dataframe=enricher.transform(input_dataframe)
458
+ enriched_prod_dataframe = enricher.transform(input_dataframe)
457
459
  ```
458
- #### 6.2 Enrichment with an updated external data sources and features
459
- For most of the ML cases, training step requires labeled dataset with a historical observations from the past. But for production step you'll need an updated and actual data sources and features for the present time, to calculate a prediction.
460
- `FeaturesEnricher`, when initiated with set of search keys which includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on a the specific date/datetime** based on `SearchKey.DATE`. To avoid enrichment with features "form the future" for the `fit` step.
461
- And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, actual for the present date.
460
+ #### 6.2 Enrichment with updated external data sources and features
461
+ In most ML cases, the training step requires a labeled dataset with historical observations. For production, you'll need updated, current data sources and features to generate predictions.
462
+ `FeaturesEnricher`, when initialized with a set of search keys that includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on the specified date/datetime** based on `SearchKey.DATE`, to avoid enrichment with features "from the future" during the `fit` step.
463
+ And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, current as of the present date.
462
464
 
463
- ⚠️ Initiate `FeaturesEnricher` with `SearchKey.DATE` search key in a key set to get actual features for production and avoid features from the future for the training:
465
+ ⚠️ Include `SearchKey.DATE` in the set of search keys to get current features for production and avoid features from the future during training:
464
466
  ```python
465
467
  enricher = FeaturesEnricher(
466
468
  search_keys={
@@ -474,13 +476,13 @@ enricher = FeaturesEnricher(
474
476
  ## 💻 How does it work?
475
477
 
476
478
  ### 🧹 Search dataset validation
477
- We validate and clean search initialization dataset under the hood:
479
+ We validate and clean the searchinitialization dataset under the hood:
478
480
 
479
- - сheck you **search keys** columns format;
481
+ - check your **search keys** columns' formats;
480
482
  - check zero variance for label column;
481
- - check dataset for full row duplicates. If we find any, we remove duplicated rows and make a note on share of row duplicates;
482
- - check inconsistent labels - rows with the same features and keys but different labels, we remove them and make a note on share of row duplicates;
483
- - remove columns with zero variance - we treat any non **search key** column in search dataset as a feature, so columns with zero variance will be removed
483
+ - check dataset for full row duplicates. If we find any, we remove them and report their share;
484
+ - check inconsistent labels - rows with the same features and keys but different labels, we remove them and report their share;
485
+ - remove columns with zero variance - we treat any non **search key** column in the search dataset as a feature, so columns with zero variance will be removed
484
486
 
485
487
  ### ❔ Supervised ML tasks detection
486
488
  We detect ML task under the hood based on label column values. Currently we support:
@@ -488,7 +490,7 @@ We detect ML task under the hood based on label column values. Currently we supp
488
490
  - ModelTaskType.MULTICLASS
489
491
  - ModelTaskType.REGRESSION
490
492
 
491
- But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML taks type:
493
+ But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML task type:
492
494
  ```python
493
495
  from upgini.features_enricher import FeaturesEnricher
494
496
  from upgini.metadata import SearchKey, ModelTaskType
@@ -498,12 +500,12 @@ enricher = FeaturesEnricher(
498
500
  model_task_type=ModelTaskType.REGRESSION
499
501
  )
500
502
  ```
501
- #### ⏰ Time Series prediction support
502
- *Time series prediction* supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time series specific cross-validation split:
503
- * [Scikit-learn time series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
504
- * [Blocked time series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter
503
+ #### ⏰ Time-series prediction support
504
+ *Time-series prediction* is supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time-seriesspecific cross-validation splits:
505
+ * [Scikit-learn time-series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
506
+ * [Blocked time-series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter
505
507
 
506
- To initiate feature search you can pass cross-validation type parameter to `FeaturesEnricher` with time series specific CV type:
508
+ To initiate feature search, you can pass the cross-validation type parameter to `FeaturesEnricher` with a time-seriesspecific CV type:
507
509
  ```python
508
510
  from upgini.features_enricher import FeaturesEnricher
509
511
  from upgini.metadata import SearchKey, CVType
@@ -524,12 +526,12 @@ enricher = FeaturesEnricher(
524
526
  cv=CVType.time_series
525
527
  )
526
528
  ```
527
- ⚠️ **Pre-process search dataset** in case of time series prediction:
529
+ ⚠️ **Preprocess the dataset** in case of time-series prediction:
528
530
  sort rows in dataset according to observation order, in most cases - ascending order by date/datetime.
529
531
 
530
532
  ### 🆙 Accuracy and uplift metrics calculations
531
- `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).
532
- You can use any model estimator with scikit-learn compartible interface, some examples are:
533
+ `FeaturesEnricher` automatically calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).
534
+ You can use any model estimator with scikit-learn-compatible interface, some examples are:
533
535
  * [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)
534
536
  * [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
535
537
  * [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
@@ -537,8 +539,8 @@ You can use any model estimator with scikit-learn compartible interface, some ex
537
539
 
538
540
  <details>
539
541
  <summary>
540
- 👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by <i>scoring</i> parameter,<br/>
541
- out-of-the box Upgini supports
542
+ 👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by the <i>scoring</i> parameter,<br/>
543
+ out-of-the-box Upgini supports
542
544
  </summary>
543
545
  <table style="table-layout: fixed;">
544
546
  <tr>
@@ -645,10 +647,10 @@ You can use any model estimator with scikit-learn compartible interface, some ex
645
647
  </table>
646
648
  </details>
647
649
 
648
- In addition to that list, you can define custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
650
+ In addition to that list, you can define a custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/1.7/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
649
651
 
650
- By default, `calculate_metrics()` method calculates evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by parameter `cv = CVType.<cross-validation-split>`.
651
- But you can easily define new split by passing child of BaseCrossValidator to parameter `cv` in `calculate_metrics()`.
652
+ By default, the `calculate_metrics()` method calculates the evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by the parameter `cv = CVType.<cross-validation-split>`.
653
+ But you can easily define a new split by passing a subclass of `BaseCrossValidator` to the `cv` parameter in `calculate_metrics()`.
652
654
 
653
655
  Example with more tips-and-tricks:
654
656
  ```python
@@ -673,7 +675,7 @@ enricher.calculate_metrics(scoring=custom_scoring)
673
675
  custom_cv = TimeSeriesSplit(n_splits=5)
674
676
  enricher.calculate_metrics(cv=custom_cv)
675
677
 
676
- # All this custom parameters could be combined in both methods: fit, fit_transform and calculate_metrics:
678
+ # All of these custom parameters can be combined in both methods: fit, fit_transform and calculate_metrics:
677
679
  enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)
678
680
  ```
679
681
 
@@ -683,34 +685,34 @@ enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator,
683
685
 
684
686
  ### 🤖 Automated feature generation from columns in a search dataset
685
687
 
686
- If a training dataset has a text column, you can generate additional embeddings from it using instructed embeddings generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.
688
+ If a training dataset has a text column, you can generate additional embeddings from it using instruction‑guided embedding generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.
687
689
 
688
- For most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to a search engine - GPT-3.5 from OpenAI and GPT-J.
690
+ In most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to the search engine - GPT-3.5 from OpenAI and GPT-J.
689
691
 
690
- To use this feature, pass the column names as arguments to the `generate_features` parameter. You can use up to 2 columns.
692
+ To use this feature, pass the column names as arguments to the `text_features` parameter. You can use up to 2 columns.
691
693
 
692
694
  Here's an example for generating features from the "description" and "summary" columns:
693
695
 
694
696
  ```python
695
697
  enricher = FeaturesEnricher(
696
698
  search_keys={"date": SearchKey.DATE},
697
- generate_features=["description", "summary"]
699
+ text_features=["description", "summary"]
698
700
  )
699
701
  ```
700
702
 
701
703
  With this code, Upgini will generate LLM embeddings from text columns and then check them for predictive power for your ML task.
702
704
 
703
- Finally, Upgini will return a dataset enriched by only relevant components of LLM embeddings.
705
+ Finally, Upgini will return a dataset enriched with only the relevant components of LLM embeddings.
704
706
 
705
- ### Find features only give accuracy gain to existing data in the ML model
707
+ ### Find features that only provide accuracy gains to existing data in the ML model
706
708
 
707
- If you already have features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.
709
+ If you already have features or other external data sources, you can specifically search for new datasets and features that only provide accuracy gains "on top" of them.
708
710
 
709
- Just leave all these existing features in the labeled training dataset and Upgini library automatically use them during feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features which improve accuracy will return.
711
+ Just leave all these existing features in the labeled training dataset and the Upgini library automatically uses them during the feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features that improve accuracy will be returned.
710
712
 
711
713
  ### Check robustness of accuracy improvement from external features
712
714
 
713
- You can validate external features robustness on out-of-time dataset using `eval_set` parameter:
715
+ You can validate the robustness of external features on an out-of-time dataset using the `eval_set` parameter:
714
716
  ```python
715
717
  # load train dataset
716
718
  train_df = pd.read_csv("train.csv")
@@ -737,13 +739,13 @@ enricher.fit(
737
739
  - Same data schema as for search initialization X dataset
738
740
  - Pandas dataframe representation
739
741
 
740
- There are 3 options to pass out-of-time without labels:
742
+ The out-of-time dataset can be without labels. There are 3 options to pass out-of-time without labels:
741
743
  ```python
742
744
  enricher.fit(
743
745
  train_ids_and_features,
744
746
  train_label,
745
747
  eval_set = [
746
- (eval_ids_and_features_1,), # Just tuple of 1 element
748
+ (eval_ids_and_features_1,), # A tuple with 1 element
747
749
  (eval_ids_and_features_2, None), # None as labels
748
750
  (eval_ids_and_features_3, [np.nan] * len(eval_ids_and_features_3)), # List or Series of the same size as eval X
749
751
  ]
@@ -775,15 +777,15 @@ enriched_df = enricher.fit_transform(
775
777
  ```
776
778
 
777
779
  **Stability parameters:**
778
- - `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI below this threshold will be excluded from the final feature set. Lower values mean stricter stability requirements.
780
+ - `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI above this threshold will be excluded from the final feature set. Lower values mean stricter stability requirements.
779
781
  - `stability_agg_func` (str, default="max"): Function to aggregate PSI values across time intervals. Options: "max" (most conservative), "min" (least conservative), "mean" (balanced approach).
780
782
 
781
- **PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models.
783
+ **PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models. PSI is calculated on the eval_set, which should contain the most recent dates relative to the training dataset.
782
784
 
783
785
  ### Use custom loss function in feature selection & metrics calculation
784
786
 
785
787
  `FeaturesEnricher` can be initialized with additional string parameter `loss`.
786
- Depending on ML-task, you can use the following loss functions:
788
+ Depending on the ML task, you can use the following loss functions:
787
789
  - `regression`: regression, regression_l1, huber, poisson, quantile, mape, gamma, tweedie;
788
790
  - `binary`: binary;
789
791
  - `multiclass`: multiclass, multiclassova.
@@ -802,7 +804,7 @@ enriched_dataframe.fit(X, y)
802
804
 
803
805
  ### Exclude premium data sources from fit, transform and metrics calculation
804
806
 
805
- `fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with parameter `exclude_features_sources` that allows to exclude Trial or Paid features from Premium data sources:
807
+ `fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with the `exclude_features_sources` parameter to exclude Trial or Paid features from Premium data sources:
806
808
  ```python
807
809
  enricher = FeaturesEnricher(
808
810
  search_keys={"subscription_activation_date": SearchKey.DATE}
@@ -815,7 +817,7 @@ enricher.transform(X, exclude_features_sources=(trial_features + paid_features))
815
817
  ```
816
818
 
817
819
  ### Turn off autodetection for search key columns
818
- Upgini has autodetection of search keys on by default.
820
+ Upgini has autodetection of search keys enabled by default.
819
821
  To turn off use `autodetect_search_keys=False`:
820
822
 
821
823
  ```python
@@ -827,8 +829,8 @@ enricher = FeaturesEnricher(
827
829
  enricher.fit(X, y)
828
830
  ```
829
831
 
830
- ### Turn off removing of target outliers
831
- Upgini detect rows with target outlier for regression tasks. By default such rows are dropped on metrics calculation. To turn off removing of target outlier rows use parameter `remove_outliers_calc_metrics=False` in fit, fit_transform or calculate_metrics methods:
832
+ ### Turn off removal of target outliers
833
+ Upgini detects rows with target outliers for regression tasks. By default such rows are dropped during metrics calculation. To turn off the removal of targetoutlier rows, use the `remove_outliers_calc_metrics=False` parameter in the fit, fit_transform, or calculate_metrics methods:
832
834
 
833
835
  ```python
834
836
  enricher = FeaturesEnricher(
@@ -838,8 +840,8 @@ enricher = FeaturesEnricher(
838
840
  enricher.fit(X, y, remove_outliers_calc_metrics=False)
839
841
  ```
840
842
 
841
- ### Turn off generating features on search keys
842
- Upgini tries to generate features on email, date and datetime search keys. By default this generation is enabled. To disable it use parameter `generate_search_key_features` of FeaturesEnricher constructor:
843
+ ### Turn off feature generation on search keys
844
+ Upgini attempts to generate features for email, date and datetime search keys. By default this generation is enabled. To disable it use the `generate_search_key_features` parameter of the FeaturesEnricher constructor:
843
845
 
844
846
  ```python
845
847
  enricher = FeaturesEnricher(
@@ -850,37 +852,37 @@ enricher = FeaturesEnricher(
850
852
 
851
853
  ## 🔑 Open up all capabilities of Upgini
852
854
 
853
- [Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600 mln+ phone numbers, 350 mln+ emails, 2^32 IP addresses
855
+ [Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600M+ phone numbers, 350M+ emails, 2^32 IP addresses
854
856
 
855
857
  |Benefit|No Sign-up | Registered user |
856
858
  |--|--|--|
857
859
  |Enrichment with **date/datetime, postal/ZIP code and country keys** | Yes | Yes |
858
- |Enrichment with **phone number, hashed email/HEM and IP-address keys** | No | Yes |
860
+ |Enrichment with **phone number, hashed email/HEM and IP address keys** | No | Yes |
859
861
  |Email notification on **search task completion** | No | Yes |
860
862
  |Automated **feature generation with LLMs** from columns in a search dataset| Yes, *till 12/05/23* | Yes |
861
863
  |Email notification on **new data source activation** 🔜 | No | Yes |
862
864
 
863
- ## 👩🏻‍💻 How to share data/features with a community ?
864
- You may publish ANY data which you consider as royalty / license free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**:
865
+ ## 👩🏻‍💻 How to share data/features with the community?
866
+ You may publish ANY data which you consider as royalty or licensefree ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**:
865
867
  1. Please Sign Up [here](https://profile.upgini.com)
866
- 2. Copy *Upgini API key* from profile and upload your data from Upgini python library with this key:
868
+ 2. Copy *Upgini API key* from your profile and upload your data from the Upgini Python library with this key:
867
869
  ```python
868
870
  import pandas as pd
869
871
  from upgini.metadata import SearchKey
870
872
  from upgini.ads import upload_user_ads
871
873
  import os
872
874
  os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
873
- #you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
875
+ #you can define a custom search key that might not yet be supported; just use SearchKey.CUSTOM_KEY type
874
876
  sample_df = pd.read_csv("path_to_data_sample_file")
875
877
  upload_user_ads("test", sample_df, {
876
878
  "city": SearchKey.CUSTOM_KEY,
877
879
  "stats_date": SearchKey.DATE
878
880
  })
879
881
  ```
880
- 3. After data verification, search results on community data will be available usual way.
882
+ 3. After data verification, search results on community data will be available in the usual way.
881
883
 
882
884
  ## 🛠 Getting Help & Community
883
- Please note, that we are still in a beta stage.
885
+ Please note that we are still in beta.
884
886
  Requests and support, in preferred order
885
887
  [![Claim help in slack](https://img.shields.io/badge/slack-@upgini-orange.svg?style=for-the-badge&logo=slack)](https://4mlg.short.gy/join-upgini-community)
886
888
  [![Open GitHub issue](https://img.shields.io/badge/open%20issue%20on-github-blue?style=for-the-badge&logo=github)](https://github.com/upgini/upgini/issues)
@@ -893,22 +895,22 @@ Requests and support, in preferred order
893
895
 
894
896
  ## 🧩 Contributing
895
897
  We are not a large team, so we probably won't be able to:
896
- - implement smooth integration with most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc. )
898
+ - implement smooth integration with the most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc.)
897
899
  - implement all possible data verification and normalization capabilities for different types of search keys
898
900
  And we need some help from the community!
899
901
 
900
- So, we'll be happy about every **pull request** you open and **issue** you find to make this library **more incredible**. Please note that it might sometimes take us a while to get back to you.
901
- **For major changes**, please open an issue first to discuss what you would like to change
902
+ So, we'll be happy about every **pull request** you open and every **issue** you report to make this library **even better**. Please note that it might sometimes take us a while to get back to you.
903
+ **For major changes**, please open an issue first to discuss what you would like to change.
902
904
  #### Developing
903
905
  Some convenient ways to start contributing are:
904
906
  ⚙️ [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
905
907
  ⚙️ **Gitpod** [![Gitpod Ready-to-Code](https://img.shields.io/badge/Gitpod-Ready--to--Code-blue?logo=gitpod)](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.
906
908
 
907
909
  ## 🔗 Useful links
908
- - [Simple sales predictions as a template notebook](#-simple-sales-prediction-for-retail-stores)
910
+ - [Simple sales prediction template notebook](#-simple-sales-prediction-for-retail-stores)
909
911
  - [Full list of Kaggle Guides & Examples](https://www.kaggle.com/romaupgini/code)
910
912
  - [Project on PyPI](https://pypi.org/project/upgini)
911
913
  - [More perks for registered users](https://profile.upgini.com)
912
914
 
913
- <sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
915
+ <sup>😔 Found typo or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
914
916
  Please report it here</a></sup>