upgini 1.1.237a2__tar.gz → 1.1.238a1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of upgini might be problematic. Click here for more details.

Files changed (83) hide show
  1. upgini-1.1.238a1/PKG-INFO +832 -0
  2. {upgini-1.1.237a2 → upgini-1.1.238a1}/setup.py +1 -1
  3. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/operand.py +11 -1
  4. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/unary.py +6 -6
  5. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/data_source/data_source_publisher.py +7 -0
  6. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/dataset.py +0 -9
  7. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/features_enricher.py +12 -19
  8. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/http.py +5 -1
  9. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/resource_bundle/strings.properties +0 -1
  10. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/datetime_utils.py +3 -16
  11. upgini-1.1.238a1/src/upgini.egg-info/PKG-INFO +832 -0
  12. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini.egg-info/SOURCES.txt +0 -2
  13. upgini-1.1.237a2/PKG-INFO +0 -832
  14. upgini-1.1.237a2/src/upgini/fingerprint.js +0 -8
  15. upgini-1.1.237a2/src/upgini/utils/deduplicate_utils.py +0 -72
  16. upgini-1.1.237a2/src/upgini.egg-info/PKG-INFO +0 -832
  17. {upgini-1.1.237a2 → upgini-1.1.238a1}/LICENSE +0 -0
  18. {upgini-1.1.237a2 → upgini-1.1.238a1}/README.md +0 -0
  19. {upgini-1.1.237a2 → upgini-1.1.238a1}/pyproject.toml +0 -0
  20. {upgini-1.1.237a2 → upgini-1.1.238a1}/setup.cfg +0 -0
  21. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/__init__.py +0 -0
  22. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/ads.py +0 -0
  23. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/ads_management/__init__.py +0 -0
  24. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/ads_management/ads_manager.py +0 -0
  25. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/__init__.py +0 -0
  26. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/all_operands.py +0 -0
  27. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/binary.py +0 -0
  28. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/feature.py +0 -0
  29. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/groupby.py +0 -0
  30. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/autofe/vector.py +0 -0
  31. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/data_source/__init__.py +0 -0
  32. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/errors.py +0 -0
  33. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/mdc/__init__.py +0 -0
  34. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/mdc/context.py +0 -0
  35. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/metadata.py +0 -0
  36. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/metrics.py +0 -0
  37. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/normalizer/__init__.py +0 -0
  38. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/normalizer/phone_normalizer.py +0 -0
  39. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/resource_bundle/__init__.py +0 -0
  40. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/resource_bundle/exceptions.py +0 -0
  41. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/sampler/__init__.py +0 -0
  42. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/sampler/base.py +0 -0
  43. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/sampler/random_under_sampler.py +0 -0
  44. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/sampler/utils.py +0 -0
  45. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/search_task.py +0 -0
  46. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/spinner.py +0 -0
  47. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/__init__.py +0 -0
  48. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/base_search_key_detector.py +0 -0
  49. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/blocked_time_series.py +0 -0
  50. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/country_utils.py +0 -0
  51. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/custom_loss_utils.py +0 -0
  52. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/cv_utils.py +0 -0
  53. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/display_utils.py +0 -0
  54. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/email_utils.py +0 -0
  55. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/fallback_progress_bar.py +0 -0
  56. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/features_validator.py +0 -0
  57. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/format.py +0 -0
  58. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/ip_utils.py +0 -0
  59. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/phone_utils.py +0 -0
  60. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/postal_code_utils.py +0 -0
  61. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/progress_bar.py +0 -0
  62. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/sklearn_ext.py +0 -0
  63. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/target_utils.py +0 -0
  64. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/track_info.py +0 -0
  65. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/utils/warning_counter.py +0 -0
  66. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini/version_validator.py +0 -0
  67. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini.egg-info/dependency_links.txt +0 -0
  68. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini.egg-info/requires.txt +0 -0
  69. {upgini-1.1.237a2 → upgini-1.1.238a1}/src/upgini.egg-info/top_level.txt +0 -0
  70. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_binary_dataset.py +0 -0
  71. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_blocked_time_series.py +0 -0
  72. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_categorical_dataset.py +0 -0
  73. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_continuous_dataset.py +0 -0
  74. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_country_utils.py +0 -0
  75. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_custom_loss_utils.py +0 -0
  76. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_datetime_utils.py +0 -0
  77. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_email_utils.py +0 -0
  78. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_etalon_validation.py +0 -0
  79. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_features_enricher.py +0 -0
  80. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_metrics.py +0 -0
  81. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_phone_utils.py +0 -0
  82. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_postal_code_utils.py +0 -0
  83. {upgini-1.1.237a2 → upgini-1.1.238a1}/tests/test_widget.py +0 -0
@@ -0,0 +1,832 @@
1
+ Metadata-Version: 2.1
2
+ Name: upgini
3
+ Version: 1.1.238a1
4
+ Summary: Intelligent data search & enrichment for Machine Learning
5
+ Home-page: https://upgini.com/
6
+ Author: Upgini Developers
7
+ Author-email: madewithlove@upgini.com
8
+ License: BSD 3-Clause License
9
+ Project-URL: Bug Reports, https://github.com/upgini/upgini/issues
10
+ Project-URL: Source, https://github.com/upgini/upgini
11
+ Description:
12
+ <!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : low-code feature search and enrichment library for machine learning </h2> -->
13
+ <!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : Free automated data enrichment library for machine learning: </br>only the accuracy improving features in 2 minutes </h2> -->
14
+ <!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> • Free production-ready automated data enrichment library for machine learning</h2>-->
15
+ <h2 align="center"> <a href="https://upgini.com/">Upgini • Intelligent data search & enrichment for Machine Learning</a></h2>
16
+ <p align="center"> <b>Easily find and add relevant features to your ML pipeline from</br> hundreds of public, community and premium external data sources, </br>optimized for ML models with LLMs and other neural networks</b> </p>
17
+ <p align="center">
18
+ <br />
19
+ <a href="https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb"><strong>Quick Start in Colab »</strong></a> |
20
+ <!--<a href="https://upgini.com/">Upgini.com</a> |-->
21
+ <a href="https://profile.upgini.com">Register / Sign In</a> |
22
+ <!-- <a href="https://gitter.im/upgini/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link">Gitter Community</a> | -->
23
+ <a href="https://4mlg.short.gy/join-upgini-community">Slack Community</a> |
24
+ <a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose new Data source</strong></a>
25
+ </p>
26
+ <p align=center>
27
+ <a href="/LICENSE"><img alt="BSD-3 license" src="https://img.shields.io/badge/license-BSD--3%20Clause-green"></a>
28
+ <a href="https://pypi.org/project/upgini/"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/upgini"></a>
29
+ <a href="https://pypi.org/project/upgini/"><img alt="PyPI" src="https://img.shields.io/pypi/v/upgini?label=Release"></a>
30
+ <a href="https://pepy.tech/project/upgini"><img alt="Downloads" src="https://static.pepy.tech/badge/upgini"></a>
31
+ <a href="https://4mlg.short.gy/join-upgini-community"><img alt="Upgini slack community" src="https://img.shields.io/badge/slack-@upgini-orange.svg?logo=slack"></a>
32
+ </p>
33
+
34
+ <!--
35
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?logo=python&logoColor=white)](https://github.com/psf/black)
36
+
37
+ [![Gitter Сommunity](https://img.shields.io/badge/gitter-@upgini-teal.svg?logo=gitter)](https://gitter.im/upgini/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) -->
38
+ ## ❔ Overview
39
+
40
+ **Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of machine ML features from the source data using large language models (LLMs), GraphNNs and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data).
41
+
42
+ **Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want radically simplify features search and enrichment to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays.
43
+
44
+ **Mission:** Democratize access to data sources for data science community.
45
+
46
+ ## 🚀 Awesome features
47
+ ⭐️ Automatically find only relevant features that *give accuracy improvement for ML model*. Not just correlated with target variable, what 9 out of 10 cases gives zero accuracy improvement
48
+ ⭐️ Data source optimizations: automated feature generation with Large Language Models' data augmentation, RNNs, GraphNN; multiple data source ensembling
49
+ ⭐️ *Automatic search key augmentation* from all connected sources. If you do not have all search keys in your search request, such as postal/zip code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
50
+ ⭐️ Calculate *accuracy metrics and uplifts* after enrichment existing ML model with external features
51
+ ⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate risks of unstable external data dependencies in ML pipeline
52
+ ⭐️ Easy to use - single request to enrich training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
53
+ <table>
54
+ <tr>
55
+ <td> date / datetime </td>
56
+ <td> phone number </td>
57
+ </tr>
58
+ <tr>
59
+ <td> postal / ZIP code </td>
60
+ <td> hashed email / HEM </td>
61
+ </tr>
62
+ <tr>
63
+ <td> country </td>
64
+ <td> IP-address </td>
65
+ </tr>
66
+ </table>
67
+
68
+ ⭐️ Scikit-learn compatible interface for quick data integration with existing ML pipelines
69
+ ⭐️ Support for most common supervised ML tasks on tabular data:
70
+ <table>
71
+ <tr>
72
+ <td><a href="https://en.wikipedia.org/wiki/Binary_classification">☑️ binary classification</a></td>
73
+ <td><a href="https://en.wikipedia.org/wiki/Multiclass_classification">☑️ multiclass classification</a></td>
74
+ </tr>
75
+ <tr>
76
+ <td><a href="https://en.wikipedia.org/wiki/Regression_analysis">☑️ regression</a></td>
77
+ <td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time series prediction</a></td>
78
+ </tr>
79
+ </table>
80
+
81
+ ⭐️ [Simple Drag & Drop Search UI](https://upgini.com/upgini-widget):
82
+ <a href="https://upgini.com/upgini-widget">
83
+ <img width="710" alt="Drag & Drop Search UI" src="https://github.com/upgini/upgini/assets/95645411/36b6460c-51f3-400e-9f04-445b938bf45e">
84
+ </a>
85
+
86
+
87
+ ## 🌎 Connected data sources and coverage
88
+
89
+ - **Public data** : public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
90
+ - **Community shared data**: royalty / license free datasets or features from Data science community (our users). It's both a public and a scraped data
91
+ - **Premium data providers**: commercial data sources verified by the Upgini team in real-world use cases
92
+
93
+ 👉 [**Details on datasets and features**](https://upgini.com/#data_sources)
94
+ #### 📊 Total: **239 countries** and **up to 41 years** of history
95
+ |Data sources|Countries|History, years|# sources for ensemble|Update|Search keys|API Key required
96
+ |--|--|--|--|--|--|--|
97
+ |Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No
98
+ |Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No
99
+ |International holidays & events, Workweek calendar| 232 |22|-|Monthly|date, country|No
100
+ |Consumer Confidence index| 44 |22|-|Monthly|date, country|No
101
+ |World economic indicators|191 |41|-|Monthly|date, country|No
102
+ |Markets data|-|17|-|Monthly|date, datetime|No
103
+ |World mobile & fixed broadband network coverage and perfomance |167|-|3|Monthly|country, postal/ZIP code|No
104
+ |World demographic data |90|-|2|Annual|country, postal/ZIP code|No
105
+ |World house prices |44|-|3|Annual|country, postal/ZIP code|No
106
+ |Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes
107
+ |Car ownership data and Parking statistics|3|-|-|Annual|country, postal/ZIP code, email/HEM, phone|Yes
108
+ |Geolocation profile for phone & IPv4 & email|239|-|6|Monthly|date, email/HEM, phone, IPv4|Yes
109
+ |🔜 Email/WWW domain profile|-|-|-|-
110
+
111
+ ❓**Know other useful data sources for machine learning?** [Give us a hint and we'll add it for free](https://forms.gle/pH99gb5hPxBEfNdR7).
112
+
113
+
114
+ ## 💼 Tutorials
115
+
116
+ ### [Search of relevant external features & Automated feature generation for Salary predicton task (use as a template)](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
117
+
118
+ * The goal is to predict salary for data science job postning based on information about employer and job description.
119
+ * Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library**
120
+ * The evaluation metric is [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).
121
+
122
+ Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb) inside your browser:
123
+
124
+ [![Open example in Google Colab](https://img.shields.io/badge/run_example_in-colab-blue?style=for-the-badge&logo=googlecolab)](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
125
+ &nbsp;
126
+ <!--
127
+ [![Open in Binder](https://img.shields.io/badge/run_example_in-mybinder-red.svg?style=for-the-badge&logo=)](https://mybinder.org/v2/gh/upgini/upgini/main?labpath=notebooks%2FUpgini_Features_search%26generation.ipynb)
128
+ &nbsp;
129
+ [![Open example in Gitpod](https://img.shields.io/badge/run_example_in-gitpod-orange?style=for-the-badge&logo=gitpod)](https://gitpod.io/#/github.com/upgini/upgini)
130
+ -->
131
+ ### ❓ [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
132
+
133
+ * The goal is to **predict future sales of different goods in stores** based on a 5-year history of sales.
134
+ * Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
135
+ <!--
136
+ Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:
137
+
138
+ [![Open example in Google Colab](https://img.shields.io/badge/run_example_in-colab-blue?style=for-the-badge&logo=googlecolab)](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
139
+ &nbsp;
140
+ [![Open in Binder](https://img.shields.io/badge/run_example_in-mybinder-red.svg?style=for-the-badge&logo=)](https://mybinder.org/v2/gh/upgini/upgini/main?urlpath=notebooks%2Fnotebooks%2Fkaggle_example.ipynb)
141
+ &nbsp;
142
+ [![Open example in Gitpod](https://img.shields.io/badge/run_example_in-gitpod-orange?style=for-the-badge&logo=gitpod)](https://gitpod.io/#/github.com/upgini/upgini)
143
+ -->
144
+
145
+ ### ❓ [How to boost ML model accuracy for Kaggle TOP1 Leaderboard in 10 minutes](https://www.kaggle.com/code/romaupgini/more-external-features-for-top1-private-lb-4-54/notebook)
146
+
147
+ * The goal is **accuracy improvement for TOP1 winning Kaggle solution** from new relevant external features & data.
148
+ * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
149
+
150
+ ### ❓ [How to do low-code feature engineering for AutoML tools](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret/notebook)
151
+
152
+ * **Save time on feature search and engineering**. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box.
153
+ * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
154
+ * Low-code AutoML tools: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)
155
+
156
+ ### ❓ [How to improve accuracy of Multivariate Time Series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)
157
+
158
+ * The goal is **accuracy improvement of Multivariate Time Series prediction** from new relevant external features & data. The main challenge here is a strategy of data & feature enrichment, when a component of Multivariate TS depends not only on its past values but also has **some dependency on other components**.
159
+ * [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [RMSLE](https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days#Note-regarding-calculating-the-average).
160
+
161
+ ### ❓ [How to speed up feature engineering hypothesis tests with ready-to-use external features](https://www.kaggle.com/code/romaupgini/statement-dates-to-use-or-not-to-use/notebook)
162
+
163
+ * **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge here is a time-dependent representation of information in a training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
164
+ * [Kaggle Competition](https://www.kaggle.com/competitions/amex-default-prediction) is a credit default prediction, evaluation metric is [normalized Gini coefficient](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464).
165
+
166
+ ## 🏁 Quick start
167
+
168
+ ### 1. Install from PyPI
169
+ ```python
170
+ %pip install upgini
171
+ ```
172
+ <details>
173
+ <summary>
174
+ 🐳 <b>Docker-way</b>
175
+ </summary>
176
+ </br>
177
+ Clone <i>$ git clone https://github.com/upgini/upgini</i> or download upgini git repo locally </br>
178
+ and follow steps below to build docker container 👇 </br>
179
+ </br>
180
+ 1. Build docker image from cloned git repo:</br>
181
+ <i>cd upgini </br>
182
+ docker build -t upgini .</i></br>
183
+ </br>
184
+ ...or directly from GitHub:
185
+ </br>
186
+ <i>DOCKER_BUILDKIT=0 docker build -t upgini</i></br> <i>git@github.com:upgini/upgini.git#main</i></br>
187
+ </br>
188
+ 2. Run docker image:</br>
189
+ <i>
190
+ docker run -p 8888:8888 upgini</br>
191
+ </i></br>
192
+ 3. Open http://localhost:8888?token="<"your_token_from_console_output">" in your browser
193
+ </details>
194
+
195
+
196
+ ### 2. 💡 Use your labeled training dataset for search
197
+
198
+ You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
199
+ - **[search keys](#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
200
+ - **labels** from training dataset to estimate relevancy of feature or dataset for your ML task and calculate feature importance metrics
201
+ - **your features** from training dataset to find external datasets and features which only give accuracy improvement to your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))
202
+
203
+
204
+ Load training dataset into pandas dataframe and separate features' columns from label column in a Scikit-learn way:
205
+ ```python
206
+ import pandas as pd
207
+ # labeled training dataset - customer_churn_prediction_train.csv
208
+ train_df = pd.read_csv("customer_churn_prediction_train.csv")
209
+ X = train_df.drop(columns="churn_flag")
210
+ y = train_df["churn_flag"]
211
+ ```
212
+ <table border=1 cellpadding=10><tr><td>
213
+ ⚠️ <b>Requirements for search initialization dataset</b>
214
+ <br>
215
+ We do dataset verification and cleaning under the hood, but still there are some requirements to follow:
216
+ <br>
217
+ 1. <b>pandas.DataFrame</b>, <b>pandas.Series</b> or <b>numpy.ndarray</b> representation;
218
+ <br>
219
+ 2. correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression;
220
+ <br>
221
+ 3. at least one column selected as a <a href="#-search-key-types-we-support-more-to-come">search key</a>;
222
+ <br>
223
+ 4. min size after deduplication by search key column and NaNs removal: <i>100 records</i>
224
+ </td></tr></table>
225
+
226
+ ### 3. 🔦 Choose one or multiple columns as a search keys
227
+ *Search keys* columns will be used to match records from all potential external data sources / features.
228
+ Define one or multiple columns as a search keys with `FeaturesEnricher` class initialization.
229
+ ```python
230
+ from upgini import FeaturesEnricher, SearchKey
231
+ enricher = FeaturesEnricher(
232
+ search_keys={
233
+ "subscription_activation_date": SearchKey.DATE,
234
+ "country": SearchKey.COUNTRY,
235
+ "zip_code": SearchKey.POSTAL_CODE,
236
+ "hashed_email": SearchKey.HEM,
237
+ "last_visit_ip_address": SearchKey.IP,
238
+ "registered_with_phone": SearchKey.PHONE
239
+ })
240
+ ```
241
+ #### ✨ Search key types we support (more to come!)
242
+ <table style="table-layout: fixed; text-align: left">
243
+ <tr>
244
+ <th> Search Key<br/>Meaning Type </th>
245
+ <th> Description </th>
246
+ <th> Allowed pandas dtypes (python types) </th>
247
+ <th> Example </th>
248
+ </tr>
249
+ <tr>
250
+ <td> SearchKey.EMAIL </td>
251
+ <td> e-mail </td>
252
+ <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
253
+ <td> <tt>support@upgini.com </tt> </td>
254
+ </tr>
255
+ <tr>
256
+ <td> SearchKey.HEM </td>
257
+ <td> <tt>sha256(lowercase(email)) </tt> </td>
258
+ <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
259
+ <td> <tt>0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955</tt> </td>
260
+ </tr>
261
+ <tr>
262
+ <td> SearchKey.IP </td>
263
+ <td> IP address (version 4) </td>
264
+ <td> <tt>object(str, ipaddress.IPv4Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
265
+ <td> <tt>192.168.0.1 </tt> </td>
266
+ </tr>
267
+ <tr>
268
+ <td> SearchKey.PHONE </td>
269
+ <td> phone number, <a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a> </td>
270
+ <td> <tt>object(str)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> <br/> <tt>float64</tt> </td>
271
+ <td> <tt>443451925138 </tt> </td>
272
+ </tr>
273
+ <tr>
274
+ <td> SearchKey.DATE </td>
275
+ <td> date </td>
276
+ <td>
277
+ <tt>object(str)</tt> <br/>
278
+ <tt>string</tt> <br/>
279
+ <tt>datetime64[ns]</tt> <br/>
280
+ <tt>period[D]</tt> <br/>
281
+ </td>
282
+ <td>
283
+ <tt>2020-02-12 </tt>&nbsp;(<a href="https://en.wikipedia.org/wiki/ISO_8601">ISO-8601 standard</a>)
284
+ <br/> <tt>12.02.2020 </tt>&nbsp;(non standard notation)
285
+ </td>
286
+ </tr>
287
+ <tr>
288
+ <td> SearchKey.DATETIME </td>
289
+ <td> datetime </td>
290
+ <td>
291
+ <tt>object(str)</tt> <br/>
292
+ <tt>string</tt> <br/>
293
+ <tt>datetime64[ns]</tt> <br/>
294
+ <tt>period[D]</tt> <br/>
295
+ </td>
296
+ <td> <tt>2020-02-12 12:46:18 </tt> <br/> <tt>12:46:18 12.02.2020 </tt> </td>
297
+ </tr>
298
+ <tr>
299
+ <td> SearchKey.COUNTRY </td>
300
+ <td> <a href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">Country ISO-3166 code</a>, Country name </td>
301
+ <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
302
+ <td> <tt>GB </tt> <br/> <tt>US </tt> <br/> <tt>IN </tt> </td>
303
+ </tr>
304
+ <tr>
305
+ <td> SearchKey.POSTAL_CODE </td>
306
+ <td> Postal code a.k.a. ZIP code. Could be used only with SearchKey.COUNTRY </td>
307
+ <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
308
+ <td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>
309
+ </tr>
310
+ </table>
311
+
312
+ </details>
313
+
314
+ For the meaning types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to clarify date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:
315
+ ```python
316
+ from upgini import FeaturesEnricher, SearchKey
317
+ enricher = FeaturesEnricher(
318
+ search_keys={
319
+ "subscription_activation_date": SearchKey.DATE,
320
+ "country": SearchKey.COUNTRY,
321
+ "zip_code": SearchKey.POSTAL_CODE,
322
+ "hashed_email": SearchKey.HEM,
323
+ "last_visit_ip_address": SearchKey.IP,
324
+ "registered_with_phone": SearchKey.PHONE
325
+ },
326
+ date_format = "%Y-%d-%m"
327
+ )
328
+ ```
329
+
330
+ To use datetime not in UTC timezone, you can cast datetime column explicitly to your timezone (example for Warsaw):
331
+ ```python
332
+ df["date"] = df.date.astype("datetime64").dt.tz_localize("Europe/Warsaw")
333
+ ```
334
+
335
+ Single country for the whole training dataset can be passed with `country_code` parameter:
336
+ ```python
337
+ from upgini import FeaturesEnricher, SearchKey
338
+ enricher = FeaturesEnricher(
339
+ search_keys={
340
+ "subscription_activation_date": SearchKey.DATE,
341
+ "zip_code": SearchKey.POSTAL_CODE,
342
+ },
343
+ country_code = "US",
344
+ date_format = "%Y-%d-%m"
345
+ )
346
+ ```
347
+ ### 4. 🔍 Start your first feature search!
348
+ The main abstraction you interact is `FeaturesEnricher`, a Scikit-learn compatible estimator. You can easily add it into your existing ML pipelines.
349
+ Create instance of the `FeaturesEnricher` class and call:
350
+ - `fit` to search relevant datasets & features
351
+ - than `transform` to enrich your dataset with features from search result
352
+
353
+ Let's try it out!
354
+ ```python
355
+ import pandas as pd
356
+ from upgini import FeaturesEnricher, SearchKey
357
+
358
+ # load labeled training dataset to initiate search
359
+ train_df = pd.read_csv("customer_churn_prediction_train.csv")
360
+ X = train_df.drop(columns="churn_flag")
361
+ y = train_df["churn_flag"]
362
+
363
+ # now we're going to create `FeaturesEnricher` class
364
+ enricher = FeaturesEnricher(
365
+ search_keys={
366
+ "subscription_activation_date": SearchKey.DATE,
367
+ "country": SearchKey.COUNTRY,
368
+ "zip_code": SearchKey.POSTAL_CODE
369
+ })
370
+
371
+ # everything is ready to fit! For 200к records fitting should take around 10 minutes,
372
+ # we send email notification, just register on profile.upgini.com
373
+ enricher.fit(X, y)
374
+ ```
375
+
376
+ That's all). We've fitted `FeaturesEnricher`.
377
+ ### 5. 📈 Evaluate feature importances (SHAP values) from the search result
378
+
379
+ `FeaturesEnricher` class has two properties for feature importances, which will be filled after fit - `feature_names_` and `feature_importances_`:
380
+ - `feature_names_` - feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well
381
+ - `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_`
382
+
383
+ Method `get_features_info()` returns pandas dataframe with features and full statistics after fit, including SHAP values and match rates:
384
+ ```python
385
+ enricher.get_features_info()
386
+ ```
387
+ Get more details about `FeaturesEnricher` at runtime using docstrings via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.
388
+
389
+ ### 6. 🏭 Enrich Production ML pipeline with relevant external features
390
+ `FeaturesEnricher` is a Scikit-learn compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit` ).
391
+ Use `transform` method of `FeaturesEnricher` , and let magic to do the rest 🪄
392
+ ```python
393
+ # load dataset for enrichment
394
+ test_x = pd.read_csv("test.csv")
395
+ # enrich it!
396
+ enriched_test_features = enricher.transform(test_x)
397
+ ```
398
+ #### 6.1 Reuse completed search for enrichment without 'fit' run
399
+
400
+ `FeaturesEnricher` can be initiated with a `search_id` parameter from completed search after fit method call.
401
+ Just use `enricher.get_search_id()` or copy search id string from the `fit()` output.
402
+ Search keys and features in X should be the same as for `fit()`
403
+ ```python
404
+ enricher = FeaturesEnricher(
405
+ #same set of a search keys as for the fit step
406
+ search_keys={"date": SearchKey.DATE},
407
+ api_key="<YOUR API_KEY>", # if you fit enricher with api_key then you should use it here
408
+ search_id = "abcdef00-0000-0000-0000-999999999999"
409
+ )
410
+ enriched_prod_dataframe=enricher.transform(input_dataframe)
411
+ ```
412
+ #### 6.2 Enrichment with an updated external data sources and features
413
+ For most of the ML cases, training step requires labeled dataset with a historical observations from the past. But for production step you'll need an updated and actual data sources and features for the present time, to calculate a prediction.
414
+ `FeaturesEnricher`, when initiated with set of search keys which includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on a the specific date/datetime** based on `SearchKey.DATE`. To avoid enrichment with features "form the future" for the `fit` step.
415
+ And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, actual for the present date.
416
+
417
+ ⚠️ Initiate `FeaturesEnricher` with `SearchKey.DATE` search key in a key set to get actual features for production and avoid features from the future for the training:
418
+ ```python
419
+ enricher = FeaturesEnricher(
420
+ search_keys={
421
+ "subscription_activation_date": SearchKey.DATE,
422
+ "country": SearchKey.COUNTRY,
423
+ "zip_code": SearchKey.POSTAL_CODE,
424
+ },
425
+ )
426
+ ```
427
+
428
+ ## 💻 How it works?
429
+
430
+ ### 🧹 Search dataset validation
431
+ We validate and clean search initialization dataset under the hood:
432
+
433
+ - сheck you **search keys** columns format;
434
+ - check zero variance for label column;
435
+ - check dataset for full row duplicates. If we find any, we remove duplicated rows and make a note on share of row duplicates;
436
+ - check inconsistent labels - rows with the same features and keys but different labels, we remove them and make a note on share of row duplicates;
437
+ - remove columns with zero variance - we treat any non **search key** column in search dataset as a feature, so columns with zero variance will be removed
438
+
439
+ ### ❔ Supervised ML tasks detection
440
+ We detect ML task under the hood based on label column values. Currently we support:
441
+ - ModelTaskType.BINARY
442
+ - ModelTaskType.MULTICLASS
443
+ - ModelTaskType.REGRESSION
444
+
445
+ But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML taks type:
446
+ ```python
447
+ from upgini import ModelTaskType
448
+ enricher = FeaturesEnricher(
449
+ search_keys={"subscription_activation_date": SearchKey.DATE},
450
+ model_task_type=ModelTaskType.REGRESSION
451
+ )
452
+ ```
453
+ #### ⏰ Time Series prediction support
454
+ *Time series prediction* supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time series specific cross-validation split:
455
+ * [Scikit-learn time series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
456
+ * [Blocked time series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter
457
+
458
+ To initiate feature search you can pass cross-validation type parameter to `FeaturesEnricher` with time series specific CV type:
459
+ ```python
460
+ from upgini.metadata import CVType
461
+ enricher = FeaturesEnricher(
462
+ search_keys={"sales_date": SearchKey.DATE},
463
+ cv=CVType.time_series
464
+ )
465
+ ```
466
+ ⚠️ **Pre-process search dataset** in case of time series prediction:
467
+ sort rows in dataset according to observation order, in most cases - ascending order by date/datetime.
468
+
469
+ ### 🆙 Accuracy and uplift metrics calculations
470
+ `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).
471
+ You can use any model estimator with scikit-learn compartible interface, some examples are:
472
+ * [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)
473
+ * [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
474
+ * [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
475
+ * [CatBoost](https://catboost.ai/en/docs/concepts/python-quickstart)
476
+
477
+ <details>
478
+ <summary>
479
+ 👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by <i>scoring</i> parameter,<br/>
480
+ out-of-the box Upgini supports
481
+ </summary>
482
+ <table style="table-layout: fixed;">
483
+ <tr>
484
+ <th>Metric</th>
485
+ <th>Description</th>
486
+ </tr>
487
+ <tr>
488
+ <td><tt>explained_variance</tt></td>
489
+ <td>Explained variance regression score function</td>
490
+ </tr>
491
+ <tr>
492
+ <td><tt>r2</tt></td>
493
+ <td>R<sup>2</sup> (coefficient of determination) regression score function</td>
494
+ </tr>
495
+ <tr>
496
+ <td><tt>max_error</tt></td>
497
+ <td>Calculates the maximum residual error (negative - greater is better)</td>
498
+ </tr>
499
+ <tr>
500
+ <td><tt>median_absolute_error</tt></td>
501
+ <td>Median absolute error regression loss</td>
502
+ </tr>
503
+ <tr>
504
+ <td><tt>mean_absolute_error</tt></td>
505
+ <td>Mean absolute error regression loss</td>
506
+ </tr>
507
+ <tr>
508
+ <td><tt>mean_absolute_percentage_error</tt></td>
509
+ <td>Mean absolute percentage error regression loss</td>
510
+ </tr>
511
+ <tr>
512
+ <td><tt>mean_squared_error</tt></td>
513
+ <td>Mean squared error regression loss</td>
514
+ </tr>
515
+ <tr>
516
+ <td><tt>mean_squared_log_error</tt> (or aliases: <tt>msle</tt>, <tt>MSLE</tt>)</td>
517
+ <td>Mean squared logarithmic error regression loss</td>
518
+ </tr>
519
+ <tr>
520
+ <td><tt>root_mean_squared_log_error</tt> (or aliases: <tt>rmsle</tt>, <tt>RMSLE</tt>)</td>
521
+ <td>Root mean squared logarithmic error regression loss</td>
522
+ </tr>
523
+ <tr>
524
+ <td><tt>root_mean_squared_error</tt></td>
525
+ <td>Root mean squared error regression loss</td>
526
+ </tr>
527
+ <tr>
528
+ <td><tt>mean_poisson_deviance</tt></td>
529
+ <td>Mean Poisson deviance regression loss</td>
530
+ </tr>
531
+ <tr>
532
+ <td><tt>mean_gamma_deviance</tt></td>
533
+ <td>Mean Gamma deviance regression loss</td>
534
+ </tr>
535
+ <tr>
536
+ <td><tt>accuracy</tt></td>
537
+ <td>Accuracy classification score</td>
538
+ </tr>
539
+ <tr>
540
+ <td><tt>top_k_accuracy</tt></td>
541
+ <td>Top-k Accuracy classification score</td>
542
+ </tr>
543
+ <tr>
544
+ <td><tt>roc_auc</tt></td>
545
+ <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
546
+ from prediction scores</td>
547
+ </tr>
548
+ <tr>
549
+ <td><tt>roc_auc_ovr</tt></td>
550
+ <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
551
+ from prediction scores (multi_class="ovr")</td>
552
+ </tr>
553
+ <tr>
554
+ <td><tt>roc_auc_ovo</tt></td>
555
+ <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
556
+ from prediction scores (multi_class="ovo")</td>
557
+ </tr>
558
+ <tr>
559
+ <td><tt>roc_auc_ovr_weighted</tt></td>
560
+ <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
561
+ from prediction scores (multi_class="ovr", average="weighted")</td>
562
+ </tr>
563
+ <tr>
564
+ <td><tt>roc_auc_ovo_weighted</tt></td>
565
+ <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
566
+ from prediction scores (multi_class="ovo", average="weighted")</td>
567
+ </tr>
568
+ <tr>
569
+ <td><tt>balanced_accuracy</tt></td>
570
+ <td>Compute the balanced accuracy</td>
571
+ </tr>
572
+ <tr>
573
+ <td><tt>average_precision</tt></td>
574
+ <td>Compute average precision (AP) from prediction scores</td>
575
+ </tr>
576
+ <tr>
577
+ <td><tt>log_loss</tt></td>
578
+ <td>Log loss, aka logistic loss or cross-entropy loss</td>
579
+ </tr>
580
+ <tr>
581
+ <td><tt>brier_score</tt></td>
582
+ <td>Compute the Brier score loss</td>
583
+ </tr>
584
+ </table>
585
+ </details>
586
+
587
+ In addition to that list, you can define custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
588
+
589
+ By default, `calculate_metrics()` method calculates evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by parameter `cv = CVType.<cross-validation-split>`.
590
+ But you can easily define new split by passing child of BaseCrossValidator to parameter `cv` in `calculate_metrics()`.
591
+
592
+ Example with more tips-and-tricks:
593
+ ```python
594
+ from upgini import FeaturesEnricher, SearchKey
595
+ enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})
596
+
597
+ # Fit with default setup for metrics calculation
598
+ # CatBoost will be used
599
+ enricher.fit(X, y, eval_set=eval_set, calculate_metrics=True)
600
+
601
+ # LightGBM estimator for metrics
602
+ custom_estimator = LGBMRegressor()
603
+ enricher.calculate_metrics(estimator=custom_estimator)
604
+
605
+ # Custom metric function to scoring param (callable or name)
606
+ custom_scoring = "RMSLE"
607
+ enricher.calculate_metrics(scoring=custom_scoring)
608
+
609
+ # Custom cross validator
610
+ custom_cv = TimeSeriesSplit(n_splits=5)
611
+ enricher.calculate_metrics(cv=custom_cv)
612
+
613
+ # All this custom parameters could be combined in both methods: fit, fit_transform and calculate_metrics:
614
+ enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)
615
+ ```
616
+
617
+
618
+
619
+ ## ✅ More tips-and-tricks
620
+
621
+ ### 🤖 Automated feature generation from columns in a search dataset
622
+
623
+ If a training dataset has a text column, you can generate additional embeddings from it using instructed embeddings generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.
624
+
625
+ For most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to a search engine - GPT-3.5 from OpenAI and GPT-J.
626
+
627
+ To use this feature, pass the column names as arguments to the `generate_features` parameter. You can use up to 2 columns.
628
+
629
+ Here's an example for generating features from the "description" and "summary" columns:
630
+
631
+ ```python
632
+ enricher = FeaturesEnricher(
633
+ search_keys={"date": SearchKey.DATE},
634
+ generate_features=["description", "summary"]
635
+ )
636
+ ```
637
+
638
+ With this code, Upgini will generate LLM embeddings from text columns and then check them for predictive power for your ML task.
639
+
640
+ Finally, Upgini will return a dataset enriched by only relevant components of LLM embeddings.
641
+
642
+ ### Find features only give accuracy gain to existing data in the ML model
643
+
644
+ If you already have features or other external data sources, you can specifically search new datasets & features only give accuracy gain "on top" of them.
645
+
646
+ Just leave all these existing features in the labeled training dataset and Upgini library automatically use them during feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features which improve accuracy will return.
647
+
648
+ ### Check robustness of accuracy improvement from external features
649
+
650
+ You can validate external features robustness on out-of-time dataset using `eval_set` parameter:
651
+ ```python
652
+ # load train dataset
653
+ train_df = pd.read_csv("train.csv")
654
+ train_ids_and_features = train_df.drop(columns="label")
655
+ train_label = train_df["label"]
656
+
657
+ # load out-of-time validation dataset
658
+ eval_df = pd.read_csv("validation.csv")
659
+ eval_ids_and_features = eval_df.drop(columns="label")
660
+ eval_label = eval_df["label"]
661
+ # create FeaturesEnricher
662
+ enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})
663
+
664
+ # now we fit WITH eval_set parameter to calculate accuracy metrics on Out-of-time dataset.
665
+ # the output will contain quality metrics for both the training data set and
666
+ # the eval set (validation OOT data set)
667
+ enricher.fit(
668
+ train_ids_and_features,
669
+ train_label,
670
+ eval_set = [(eval_ids_and_features, eval_label)]
671
+ )
672
+ ```
673
+ #### ⚠️ Requirements for out-of-time dataset
674
+ - Same data schema as for search initialization dataset
675
+ - Pandas dataframe representation
676
+
677
+ ### Use custom loss function in feature selection & metrics calculation
678
+
679
+ `FeaturesEnricher` can be initialized with additional string parameter `loss`.
680
+ Depending on ML-task, you can use the following loss functions:
681
+ - `regression`: regression, regression_l1, huber, poisson, quantile, mape, gamma, tweedie;
682
+ - `binary`: binary;
683
+ - `multiclass`: multiclass, multiclassova.
684
+
685
+ For instance, if your target variable has a Poisson distribution (count of events, number of customers in the shop and so on), you should try to use `loss="poisson"` to improve quality of feature selection and get better evaluation metrics.
686
+
687
+ Usage example:
688
+ ```python
689
+ enricher = FeaturesEnricher(
690
+ search_keys={"date": SearchKey.DATE},
691
+ loss="poisson",
692
+ model_task_type=ModelTaskType.REGRESSION
693
+ )
694
+ enriched_dataframe.fit(X, y)
695
+ ```
696
+
697
+ ### Return initial dataframe enriched with TOP external features by importance
698
+
699
+ `transform` and `fit_transform` methods of `FeaturesEnricher` can be used with two additional parameters:
700
+ - `importance_threshold`: float = 0 - only features with *importance >= threshold* will be added to the output dataframe
701
+ - `max_features`: int - only first TOP N features by importance will be returned, where *N = max_features*
702
+
703
+ And `keep_input=True` will keep all initial columns from search dataset X:
704
+ ```python
705
+ enricher = FeaturesEnricher(
706
+ search_keys={"subscription_activation_date": SearchKey.DATE}
707
+ )
708
+ enriched_dataframe.fit_transform(X, y, keep_input=True, max_features=2)
709
+ ```
710
+
711
+ ### Exclude premium data sources from fit, transform and metrics calculation
712
+
713
+ `fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with parameter `exclude_features_sources` that allows to exclude Trial or Paid features from Premium data sources:
714
+ ```python
715
+ enricher = FeaturesEnricher(
716
+ search_keys={"subscription_activation_date": SearchKey.DATE}
717
+ )
718
+ enricher.fit(X, y, calculate_metrics=False)
719
+ trial_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Trial"]["Feature name"].values.tolist()
720
+ paid_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Paid"]["Feature name"].values.tolist()
721
+ enricher.calculate_metrics(exclude_features_sources=(trial_features + paid_features))
722
+ enricher.transform(X, exclude_features_sources=(trial_features + paid_features))
723
+ ```
724
+
725
+ ### Turn off autodetection for search key columns
726
+ Upgini has autodetection of search keys on by default.
727
+ To turn off use `detect_missing_search_keys=False`:
728
+
729
+ ```python
730
+ enricher = FeaturesEnricher(
731
+ search_keys={"date": SearchKey.DATE},
732
+ detect_missing_search_keys=False,
733
+ )
734
+
735
+ enricher.fit(X, y)
736
+ ```
737
+
738
+ ## Turn off removing of target outliers
739
+ Upgini detect rows with target outlier for regression tasks. By default such rows are dropped on metrics calculation. To turn off removing of target outlier rows use parameter `remove_outliers_calc_metrics=False` in fit, fit_transform or calculate_metrics methods:
740
+
741
+ ```python
742
+ enricher = FeaturesEnricher(
743
+ search_keys={"date": SearchKey.DATE},
744
+ )
745
+
746
+ enricher.fit(X, y, remove_outliers_calc_metrics=False)
747
+ ```
748
+
749
+ ## 🔑 Open up all capabilities of Upgini
750
+
751
+ [Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600 mln+ phone numbers, 350 mln+ emails, 2^32 IP addresses
752
+
753
+ |Benefit|No Sign-up | Registered user |
754
+ |--|--|--|
755
+ |Enrichment with **date/datetime, postal/ZIP code and country keys** | Yes | Yes |
756
+ |Enrichment with **phone number, hashed email/HEM and IP-address keys** | No | Yes |
757
+ |Email notification on **search task completion** | No | Yes |
758
+ |Automated **feature generation with LLMs** from columns in a search dataset| Yes, *till 12/05/23* | Yes |
759
+ |Email notification on **new data source activation** 🔜 | No | Yes |
760
+
761
+ ## 👩🏻‍💻 How to share data/features with a community ?
762
+ You may publish ANY data which you consider as royalty / license free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**:
763
+ 1. Please Sign Up [here](https://profile.upgini.com)
764
+ 2. Copy *Upgini API key* from profile and upload your data from Upgini python library with this key:
765
+ ```python
766
+ import pandas as pd
767
+ from upgini import SearchKey
768
+ from upgini.ads import upload_user_ads
769
+ import os
770
+ os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
771
+ #you can define custom search key which might not be supported yet, just use SearchKey.CUSTOM_KEY type
772
+ sample_df = pd.read_csv("path_to_data_sample_file")
773
+ upload_user_ads("test", sample_df, {
774
+ "city": SearchKey.CUSTOM_KEY,
775
+ "stats_date": SearchKey.DATE
776
+ })
777
+ ```
778
+ 3. After data verification, search results on community data will be available usual way.
779
+
780
+ ## 🛠 Getting Help & Community
781
+ Please note, that we are still in a beta stage.
782
+ Requests and support, in preferred order
783
+ [![Claim help in slack](https://img.shields.io/badge/slack-@upgini-orange.svg?style=for-the-badge&logo=slack)](https://4mlg.short.gy/join-upgini-community)
784
+ [![Open GitHub issue](https://img.shields.io/badge/open%20issue%20on-github-blue?style=for-the-badge&logo=github)](https://github.com/upgini/upgini/issues)
785
+
786
+ ❗Please try to create bug reports that are:
787
+ - **reproducible** - include steps to reproduce the problem.
788
+ - **specific** - include as much detail as possible: which Python version, what environment, etc.
789
+ - **unique** - do not duplicate existing opened issues.
790
+ - **scoped to a Single Bug** - one bug per report.
791
+
792
+ ## 🧩 Contributing
793
+ We are a **very** small team and this is a part-time project for us, thus most probably we won't be able:
794
+ - implement smooth integration with most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc. )
795
+ - implement all possible data verification and normalization capabilities for different types of search keys (we just started with current 6 types)
796
+
797
+ And we need some help from community)
798
+ So, we'll be happy about every **pull request** you open and **issue** you find to make this library **more incredible**. Please note that it might sometimes take us a while to get back to you.
799
+ **For major changes**, please open an issue first to discuss what you would like to change
800
+ #### Developing
801
+ Some convenient ways to start contributing are:
802
+ ⚙️ [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
803
+ ⚙️ **Gitpod** [![Gitpod Ready-to-Code](https://img.shields.io/badge/Gitpod-Ready--to--Code-blue?logo=gitpod)](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.
804
+
805
+ ## 🔗 Useful links
806
+ - [Simple sales predictions as a template notebook](#-simple-sales-predictions-use-as-a-template)
807
+ - [Full list of Kaggle Guides & Examples](https://www.kaggle.com/romaupgini/code)
808
+ - [Project on PyPI](https://pypi.org/project/upgini)
809
+ - [More perks for registered users](https://profile.upgini.com)
810
+
811
+ <sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
812
+ Please report it here.</a></sup>
813
+
814
+ Keywords: data science,machine learning,data mining,automl,data search
815
+ Platform: UNKNOWN
816
+ Classifier: Development Status :: 5 - Production/Stable
817
+ Classifier: Intended Audience :: Customer Service
818
+ Classifier: Intended Audience :: Developers
819
+ Classifier: Intended Audience :: Financial and Insurance Industry
820
+ Classifier: Intended Audience :: Information Technology
821
+ Classifier: Intended Audience :: Science/Research
822
+ Classifier: Intended Audience :: Telecommunications Industry
823
+ Classifier: License :: OSI Approved :: BSD License
824
+ Classifier: Operating System :: OS Independent
825
+ Classifier: Programming Language :: Python :: 3.7
826
+ Classifier: Programming Language :: Python :: 3.8
827
+ Classifier: Programming Language :: Python :: 3.9
828
+ Classifier: Programming Language :: Python :: 3.10
829
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
830
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
831
+ Requires-Python: >=3.7,<3.11
832
+ Description-Content-Type: text/markdown