data-foundry 0.0.1__tar.gz → 0.0.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data_foundry-0.0.3/PKG-INFO +353 -0
- data_foundry-0.0.3/README.md +300 -0
- data_foundry-0.0.3/pyproject.toml +198 -0
- data_foundry-0.0.3/src/data_foundry/__init__.py +0 -0
- data_foundry-0.0.3/src/data_foundry/collections/__init__.py +33 -0
- data_foundry-0.0.3/src/data_foundry/collections/_core.py +265 -0
- data_foundry-0.0.3/src/data_foundry/collections/_registry.py +187 -0
- data_foundry-0.0.3/src/data_foundry/collections/_sources.py +246 -0
- data_foundry-0.0.3/src/data_foundry/curation_container.py +450 -0
- data_foundry-0.0.3/src/data_foundry/curation_recommendations.py +473 -0
- data_foundry-0.0.3/src/data_foundry/dataset_checks.py +443 -0
- data_foundry-0.0.3/src/data_foundry/examples/__init__.py +54 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/container_metadata.json +5 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dataset.parquet +0 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dataset_metadata.dataset-mold-v1.json +18 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dtypes.json +6 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/experiment_metadata.predictive-ml-splits-mold-v1.json +1 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/task_metadata.predictive-ml-task-mold-v1.json +11 -0
- data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/toy_extra.parquet +0 -0
- data_foundry-0.0.3/src/data_foundry/schema.py +469 -0
- data_foundry-0.0.3/src/data_foundry/utils/__init__.py +0 -0
- data_foundry-0.0.3/src/data_foundry/utils/checksum.py +51 -0
- data_foundry-0.0.1/LICENSE +0 -190
- data_foundry-0.0.1/PKG-INFO +0 -19
- data_foundry-0.0.1/README.md +0 -3
- data_foundry-0.0.1/pyproject.toml +0 -26
- data_foundry-0.0.1/setup.cfg +0 -4
- data_foundry-0.0.1/src/data_foundry/__init__.py +0 -1
- data_foundry-0.0.1/src/data_foundry.egg-info/PKG-INFO +0 -19
- data_foundry-0.0.1/src/data_foundry.egg-info/SOURCES.txt +0 -8
- data_foundry-0.0.1/src/data_foundry.egg-info/dependency_links.txt +0 -1
- data_foundry-0.0.1/src/data_foundry.egg-info/top_level.txt +0 -1
|
@@ -0,0 +1,353 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: data-foundry
|
|
3
|
+
Version: 0.0.3
|
|
4
|
+
Summary: A schema and toolkit for curating tabular datasets and benchmarking tasks (the data layer behind TabArena).
|
|
5
|
+
Keywords: tabular,machine-learning,benchmark,datasets,data-curation,tabarena
|
|
6
|
+
Author: TabArena Maintainers
|
|
7
|
+
Author-email: TabArena Maintainers <mail@tabarena.ai>
|
|
8
|
+
License: Apache-2.0
|
|
9
|
+
Classifier: Development Status :: 3 - Alpha
|
|
10
|
+
Classifier: Intended Audience :: Science/Research
|
|
11
|
+
Classifier: Intended Audience :: Developers
|
|
12
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
13
|
+
Classifier: Operating System :: OS Independent
|
|
14
|
+
Classifier: Programming Language :: Python
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Requires-Dist: pandas
|
|
22
|
+
Requires-Dist: numpy
|
|
23
|
+
Requires-Dist: pydantic
|
|
24
|
+
Requires-Dist: uuid6
|
|
25
|
+
Requires-Dist: pyarrow
|
|
26
|
+
Requires-Dist: huggingface-hub
|
|
27
|
+
Requires-Dist: autogluon ; extra == 'dev'
|
|
28
|
+
Requires-Dist: openml ; extra == 'dev'
|
|
29
|
+
Requires-Dist: ruff ; extra == 'dev'
|
|
30
|
+
Requires-Dist: pyyaml ; extra == 'dev'
|
|
31
|
+
Requires-Dist: seaborn ; extra == 'dev'
|
|
32
|
+
Requires-Dist: tueplots ; extra == 'dev'
|
|
33
|
+
Requires-Dist: tqdm ; extra == 'dev'
|
|
34
|
+
Requires-Dist: kaggle ; extra == 'dev'
|
|
35
|
+
Requires-Dist: langdetect ; extra == 'dev'
|
|
36
|
+
Requires-Dist: xlrd ; extra == 'dev'
|
|
37
|
+
Requires-Dist: scipy ; extra == 'dev'
|
|
38
|
+
Requires-Dist: polars ; extra == 'dev'
|
|
39
|
+
Requires-Dist: fastexcel ; extra == 'dev'
|
|
40
|
+
Requires-Dist: openpyxl ; extra == 'dev'
|
|
41
|
+
Requires-Dist: python-calamine ; extra == 'dev'
|
|
42
|
+
Requires-Dist: pytest ; extra == 'tests'
|
|
43
|
+
Requires-Dist: scikit-learn ; extra == 'tests'
|
|
44
|
+
Requires-Python: >=3.10
|
|
45
|
+
Project-URL: Homepage, https://github.com/TabArena/data-foundry
|
|
46
|
+
Project-URL: Repository, https://github.com/TabArena/data-foundry
|
|
47
|
+
Project-URL: Issues, https://github.com/TabArena/data-foundry/issues
|
|
48
|
+
Project-URL: BeyondArena Datasets, https://huggingface.co/datasets/TabArena/BeyondArena
|
|
49
|
+
Project-URL: TabArena, https://tabarena.ai/
|
|
50
|
+
Provides-Extra: dev
|
|
51
|
+
Provides-Extra: tests
|
|
52
|
+
Description-Content-Type: text/markdown
|
|
53
|
+
|
|
54
|
+
# Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
| 📂 [Examples](examples) | 🧑🔬 [Contribute a Dataset](CONTRIBUTING_DATASETS.md) | 📄 [Paper (placeholder — coming soon)](#-citation) |
|
|
59
|
+
|:---:|:---:|:---:|
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
**Data Foundry** is the data layer behind the next generation of [TabArena](https://tabarena.ai/) datasets. It provides:
|
|
64
|
+
|
|
65
|
+
- A small, opinionated **schema** for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits — aligned with OpenML where possible, extended where it had to be.
|
|
66
|
+
- A **curation toolkit** (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
|
|
67
|
+
- A **collections API** that pins datasets (defined by ``(unique_name, uuid)``) to immutable curated containers and resolves them against a local warehouse or directly against the [BeyondArena Datasets](https://huggingface.co/datasets/TabArena/BeyondArena).
|
|
68
|
+
|
|
69
|
+
## ⚡ Quickstart
|
|
70
|
+
|
|
71
|
+
> [!TIP]
|
|
72
|
+
> Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
pip install data-foundry
|
|
76
|
+
python examples/load_curated_container.py
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
```python
|
|
80
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
81
|
+
|
|
82
|
+
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
|
|
83
|
+
print(container.describe()) # full identity + dtypes + task + splits
|
|
84
|
+
print(container.dataset.shape) # the actual DataFrame
|
|
85
|
+
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
That's the whole API surface in three lines. See [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py) for benchmarking Random Forest on the data!
|
|
89
|
+
|
|
90
|
+
## 🕹️ Use Cases
|
|
91
|
+
|
|
92
|
+
<details>
|
|
93
|
+
<summary><b>🧪 Inspect a curated container offline</b> — no Hugging Face download required</summary>
|
|
94
|
+
|
|
95
|
+
The package ships a toy `CuratedContainer` so you can poke at the full API — schema, dtypes, splits, `describe()` — without touching the network. Identical interface to a downloaded BeyondArena container.
|
|
96
|
+
|
|
97
|
+
```python
|
|
98
|
+
from data_foundry.curation_container import CuratedContainer
|
|
99
|
+
from data_foundry.examples import get_toy_container_path
|
|
100
|
+
|
|
101
|
+
container = CuratedContainer.load(get_toy_container_path())
|
|
102
|
+
print(container.describe()) # full identity + dtypes + task + splits
|
|
103
|
+
print(container.dataset.shape) # the actual DataFrame
|
|
104
|
+
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
Full inspection script (every metadata field printed): [`examples/load_curated_container.py`](examples/load_curated_container.py).
|
|
108
|
+
|
|
109
|
+
</details>
|
|
110
|
+
|
|
111
|
+
<details>
|
|
112
|
+
<summary><b>📦 Use one dataset</b> — IID and non-IID variants</summary>
|
|
113
|
+
|
|
114
|
+
Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.
|
|
115
|
+
|
|
116
|
+
```python
|
|
117
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
118
|
+
|
|
119
|
+
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
|
|
120
|
+
df = container.dataset
|
|
121
|
+
target = container.task_metadata.target_column_name
|
|
122
|
+
|
|
123
|
+
for repeat_id, folds in container.experiment_metadata.splits.items():
|
|
124
|
+
for fold_id, (train_idx, test_idx) in folds.items():
|
|
125
|
+
X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
|
|
126
|
+
X_test, y_test = df.iloc[test_idx].drop(columns=target), df.iloc[test_idx][target]
|
|
127
|
+
# ... fit, evaluate ...
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Full worked example (Random Forest, RMSE per fold, full metadata via `container.describe()`): [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py).
|
|
131
|
+
|
|
132
|
+
**Split regimes.** BeyondArena ships datasets from three regimes — which one a dataset is in shows up directly on `task_metadata`:
|
|
133
|
+
|
|
134
|
+
| Regime | Set on `PredictiveMLTaskMetadata` | Meaning |
|
|
135
|
+
|---|---|---|
|
|
136
|
+
| IID | neither `time_on` nor `group_on` | rows are independent; random / stratified splits |
|
|
137
|
+
| temporal non-IID | `time_on` set | rows ordered in time; future rows must not leak backwards |
|
|
138
|
+
| grouped non-IID | `group_on` set (+ `group_labels`) | all rows of a group stay together in one fold |
|
|
139
|
+
|
|
140
|
+
Side-by-side regime printout (one IID, two grouped variants — `per_group` vs `per_sample` — and one temporal): [`examples/data_foundry_data_regimes.py`](examples/data_foundry_data_regimes.py).
|
|
141
|
+
|
|
142
|
+
</details>
|
|
143
|
+
|
|
144
|
+
<details>
|
|
145
|
+
<summary><b>🗂️ Use a collection of datasets</b> — pre-download all of BeyondArena</summary>
|
|
146
|
+
|
|
147
|
+
`BEYOND_ARENA.prefetch(...)` batches every container into a single Hugging Face `snapshot_download` call (one network round-trip for the whole collection). On a warm cache it skips importing `huggingface_hub` entirely.
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
151
|
+
|
|
152
|
+
paths = BEYOND_ARENA.prefetch() # warms the cache once
|
|
153
|
+
for container in BEYOND_ARENA.iter_containers(): # now hits disk only
|
|
154
|
+
print(container.dataset_metadata.unique_name, container.dataset.shape)
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Cache management:
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
BEYOND_ARENA.clear_cache() # nuke this collection's subdir
|
|
161
|
+
BEYOND_ARENA.get_dataset(name, force_download=True) # re-fetch a single container
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
Full worked example with `tqdm` progress + checksum verification: [`examples/download_all_beyond_arena_datasets.py`](examples/download_all_beyond_arena_datasets.py). For a single dataset round-trip with checksum verification, see [`examples/download_beyond_arena_dataset.py`](examples/download_beyond_arena_dataset.py).
|
|
165
|
+
|
|
166
|
+
</details>
|
|
167
|
+
|
|
168
|
+
<details>
|
|
169
|
+
<summary><b>🧑🔬 Curate a dataset</b> — turn a raw download into a CuratedContainer</summary>
|
|
170
|
+
|
|
171
|
+
End-to-end pipeline, condensed (the full runnable version is [`examples/curate_a_dataset.py`](examples/curate_a_dataset.py)):
|
|
172
|
+
|
|
173
|
+
```python
|
|
174
|
+
from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata
|
|
175
|
+
|
|
176
|
+
# --- Basic metadata
|
|
177
|
+
dataset_mold = DatasetMetadata(
|
|
178
|
+
unique_name="blood_transfusion",
|
|
179
|
+
dataset_year="2008",
|
|
180
|
+
domain_str="medical & healthcare",
|
|
181
|
+
dataset_source="UCI",
|
|
182
|
+
original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
|
|
183
|
+
download_description="""
|
|
184
|
+
We download the data from the UCI repository and unzip it to a predefined folder.
|
|
185
|
+
|
|
186
|
+
mkdir -p local-data-warehouse/blood_transfusion/ \\
|
|
187
|
+
&& wget -P local-data-warehouse/blood_transfusion/ \\
|
|
188
|
+
https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
|
|
189
|
+
&& unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
|
|
190
|
+
-d local-data-warehouse/blood_transfusion/
|
|
191
|
+
""",
|
|
192
|
+
academic_reference_bibtex="""@article{yeh2009knowledge,
|
|
193
|
+
title={Knowledge discovery on RFM model using Bernoulli sequence},
|
|
194
|
+
author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
|
|
195
|
+
journal={Expert Systems with applications},
|
|
196
|
+
volume={36}, number={3}, pages={5866--5871},
|
|
197
|
+
year={2009}, publisher={Elsevier},
|
|
198
|
+
}
|
|
199
|
+
""",
|
|
200
|
+
academic_reference_bibtex_key="yeh2009knowledge",
|
|
201
|
+
license="CC BY 4.0",
|
|
202
|
+
data_tags=["IID"],
|
|
203
|
+
curation_comments="Renamed features for clarity; mapped target 0/1 → No/Yes; ~29% duplicate rows kept.",
|
|
204
|
+
)
|
|
205
|
+
task_mold = PredictiveMLTaskMetadata(
|
|
206
|
+
target_column_name="DonatedBloodInMarch2007",
|
|
207
|
+
problem_type="binary_classification",
|
|
208
|
+
objective_metric_name="roc_auc",
|
|
209
|
+
stratify_on="DonatedBloodInMarch2007",
|
|
210
|
+
)
|
|
211
|
+
|
|
212
|
+
# --- Preprocessing
|
|
213
|
+
import pandas as pd
|
|
214
|
+
df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
|
|
215
|
+
df.columns = [
|
|
216
|
+
"MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
|
|
217
|
+
"MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
|
|
218
|
+
]
|
|
219
|
+
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
|
|
220
|
+
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
|
|
221
|
+
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
|
|
222
|
+
|
|
223
|
+
# --- Sanity checks
|
|
224
|
+
from data_foundry import dataset_checks
|
|
225
|
+
df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
|
|
226
|
+
data=df,
|
|
227
|
+
target_feature=task_mold.target_column_name,
|
|
228
|
+
problem_type=task_mold.problem_type,
|
|
229
|
+
)
|
|
230
|
+
|
|
231
|
+
# --- Outer CV splits
|
|
232
|
+
from data_foundry.curation_recommendations import (
|
|
233
|
+
get_recommended_iid_splits,
|
|
234
|
+
get_recommended_splits_dimensions,
|
|
235
|
+
)
|
|
236
|
+
|
|
237
|
+
n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
|
|
238
|
+
splits = get_recommended_iid_splits(
|
|
239
|
+
dataset=df,
|
|
240
|
+
n_repeats=n_repeats,
|
|
241
|
+
n_splits=n_splits,
|
|
242
|
+
test_size=test_size,
|
|
243
|
+
stratify_on=task_mold.stratify_on,
|
|
244
|
+
)
|
|
245
|
+
|
|
246
|
+
# --- Split metadata + container
|
|
247
|
+
from data_foundry.schema import PredictiveMLSplitsMetadata
|
|
248
|
+
from data_foundry.curation_container import CuratedContainer
|
|
249
|
+
|
|
250
|
+
splits_mold = PredictiveMLSplitsMetadata(
|
|
251
|
+
splits_comment="Default splits for IID data.",
|
|
252
|
+
splits=splits,
|
|
253
|
+
)
|
|
254
|
+
curated_data = CuratedContainer(
|
|
255
|
+
dataset=df,
|
|
256
|
+
dataset_metadata=dataset_mold,
|
|
257
|
+
task_metadata=task_mold,
|
|
258
|
+
experiment_metadata=splits_mold,
|
|
259
|
+
)
|
|
260
|
+
curated_data.save()
|
|
261
|
+
print(curated_data.uuid, curated_data.checksum)
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
For the contributor flow (where to put the notebook, how to open the PR, the `/new-dataset` Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
|
|
265
|
+
|
|
266
|
+
</details>
|
|
267
|
+
|
|
268
|
+
## 🪄 Installation
|
|
269
|
+
|
|
270
|
+
> [!IMPORTANT]
|
|
271
|
+
> Requires Python **3.10+**.
|
|
272
|
+
|
|
273
|
+
<details>
|
|
274
|
+
<summary><b>📦 From PyPI</b> — use Data Foundry as a library</summary>
|
|
275
|
+
|
|
276
|
+
```bash
|
|
277
|
+
pip install data-foundry
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
</details>
|
|
281
|
+
|
|
282
|
+
<details>
|
|
283
|
+
<summary><b>🌱 From source</b> — clone and install editable</summary>
|
|
284
|
+
|
|
285
|
+
```bash
|
|
286
|
+
git clone https://github.com/TabArena/data-foundry.git
|
|
287
|
+
cd data-foundry
|
|
288
|
+
uv pip install -e .
|
|
289
|
+
```
|
|
290
|
+
|
|
291
|
+
</details>
|
|
292
|
+
|
|
293
|
+
<details>
|
|
294
|
+
<summary><b>🛠️ Developer setup</b> — extras for curation, tests, and tooling</summary>
|
|
295
|
+
|
|
296
|
+
```bash
|
|
297
|
+
git clone https://github.com/TabArena/data-foundry.git
|
|
298
|
+
cd data-foundry
|
|
299
|
+
uv pip install -e ".[dev,tests]"
|
|
300
|
+
pytest # run the test suite
|
|
301
|
+
ruff check . && ruff format --check . # lint + format
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
The `dev` extra adds curation-time deps (`openml`, `kaggle`, `seaborn`, `polars`, etc.); `tests` adds `pytest` and `scikit-learn` (needed for the recommended-split helpers and examples).
|
|
305
|
+
|
|
306
|
+
</details>
|
|
307
|
+
|
|
308
|
+
## 🗂️ Repository Structure
|
|
309
|
+
|
|
310
|
+
```
|
|
311
|
+
data-foundry/
|
|
312
|
+
├── src/data_foundry/ # the package — schema, container, collections, checks, splits
|
|
313
|
+
│ ├── schema.py # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
|
|
314
|
+
│ ├── curation_container.py # CuratedContainer (save/load + describe + checksum)
|
|
315
|
+
│ ├── collections/ # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
|
|
316
|
+
│ ├── curation_recommendations.py # recommended split helpers (IID, grouped, temporal)
|
|
317
|
+
│ ├── dataset_checks.py # run_all_checks(...) — sanity stats for the curation notebook
|
|
318
|
+
│ └── examples/toy_container/ # tiny ready-to-load CuratedContainer shipped in-package
|
|
319
|
+
├── datasets/ # curation notebooks
|
|
320
|
+
│ ├── _template/ # canonical notebook skeleton
|
|
321
|
+
│ ├── _dev/ # contributions land here first
|
|
322
|
+
│ ├── _maintenance/ # re-runs / fixes for already-released datasets
|
|
323
|
+
│ └── beyond_iid/ # promoted datasets — pinned by `final_uuid_list.py`
|
|
324
|
+
├── examples/ # runnable demos (covers the use-cases above)
|
|
325
|
+
├── scripts/ # one-off tooling (toy container builder)
|
|
326
|
+
│ └── beyond_arena/ # BeyondArena-specific scripts and outputs (warehouse stats, plots)
|
|
327
|
+
├── tests/ # pytest test suite
|
|
328
|
+
└── local-data-warehouse/ # gitignored — curators write raw + saved containers here
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
## 🧑🔬 Contributing a Dataset
|
|
332
|
+
|
|
333
|
+
The short version:
|
|
334
|
+
|
|
335
|
+
1. Copy [`datasets/_template/_template.ipynb`](datasets/_template/_template.ipynb)
|
|
336
|
+
to `datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb`.
|
|
337
|
+
2. Run the notebook end-to-end so the saved cells contain populated check
|
|
338
|
+
tables and the final `uuid` / `checksum`.
|
|
339
|
+
3. Open a PR — reviewers will move the notebook into the right
|
|
340
|
+
`beyond_iid/` subfolder and append the UUID to
|
|
341
|
+
[`datasets/beyond_iid/final_uuid_list.py`](datasets/beyond_iid/final_uuid_list.py).
|
|
342
|
+
|
|
343
|
+
The long version (field-by-field walkthrough, split-helper choice, dtype
|
|
344
|
+
gotchas, the `/new-dataset` Claude Code scaffolding skill): see
|
|
345
|
+
[**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
|
|
346
|
+
|
|
347
|
+
## 📄 Citation
|
|
348
|
+
|
|
349
|
+
**PLACEHOLDER**
|
|
350
|
+
|
|
351
|
+
```bibtex
|
|
352
|
+
PLACEHOLDER
|
|
353
|
+
```
|
|
@@ -0,0 +1,300 @@
|
|
|
1
|
+
# Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets
|
|
2
|
+
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
| 📂 [Examples](examples) | 🧑🔬 [Contribute a Dataset](CONTRIBUTING_DATASETS.md) | 📄 [Paper (placeholder — coming soon)](#-citation) |
|
|
6
|
+
|:---:|:---:|:---:|
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
**Data Foundry** is the data layer behind the next generation of [TabArena](https://tabarena.ai/) datasets. It provides:
|
|
11
|
+
|
|
12
|
+
- A small, opinionated **schema** for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits — aligned with OpenML where possible, extended where it had to be.
|
|
13
|
+
- A **curation toolkit** (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
|
|
14
|
+
- A **collections API** that pins datasets (defined by ``(unique_name, uuid)``) to immutable curated containers and resolves them against a local warehouse or directly against the [BeyondArena Datasets](https://huggingface.co/datasets/TabArena/BeyondArena).
|
|
15
|
+
|
|
16
|
+
## ⚡ Quickstart
|
|
17
|
+
|
|
18
|
+
> [!TIP]
|
|
19
|
+
> Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.
|
|
20
|
+
|
|
21
|
+
```bash
|
|
22
|
+
pip install data-foundry
|
|
23
|
+
python examples/load_curated_container.py
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
28
|
+
|
|
29
|
+
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
|
|
30
|
+
print(container.describe()) # full identity + dtypes + task + splits
|
|
31
|
+
print(container.dataset.shape) # the actual DataFrame
|
|
32
|
+
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
That's the whole API surface in three lines. See [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py) for benchmarking Random Forest on the data!
|
|
36
|
+
|
|
37
|
+
## 🕹️ Use Cases
|
|
38
|
+
|
|
39
|
+
<details>
|
|
40
|
+
<summary><b>🧪 Inspect a curated container offline</b> — no Hugging Face download required</summary>
|
|
41
|
+
|
|
42
|
+
The package ships a toy `CuratedContainer` so you can poke at the full API — schema, dtypes, splits, `describe()` — without touching the network. Identical interface to a downloaded BeyondArena container.
|
|
43
|
+
|
|
44
|
+
```python
|
|
45
|
+
from data_foundry.curation_container import CuratedContainer
|
|
46
|
+
from data_foundry.examples import get_toy_container_path
|
|
47
|
+
|
|
48
|
+
container = CuratedContainer.load(get_toy_container_path())
|
|
49
|
+
print(container.describe()) # full identity + dtypes + task + splits
|
|
50
|
+
print(container.dataset.shape) # the actual DataFrame
|
|
51
|
+
print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
Full inspection script (every metadata field printed): [`examples/load_curated_container.py`](examples/load_curated_container.py).
|
|
55
|
+
|
|
56
|
+
</details>
|
|
57
|
+
|
|
58
|
+
<details>
|
|
59
|
+
<summary><b>📦 Use one dataset</b> — IID and non-IID variants</summary>
|
|
60
|
+
|
|
61
|
+
Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
65
|
+
|
|
66
|
+
container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
|
|
67
|
+
df = container.dataset
|
|
68
|
+
target = container.task_metadata.target_column_name
|
|
69
|
+
|
|
70
|
+
for repeat_id, folds in container.experiment_metadata.splits.items():
|
|
71
|
+
for fold_id, (train_idx, test_idx) in folds.items():
|
|
72
|
+
X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
|
|
73
|
+
X_test, y_test = df.iloc[test_idx].drop(columns=target), df.iloc[test_idx][target]
|
|
74
|
+
# ... fit, evaluate ...
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Full worked example (Random Forest, RMSE per fold, full metadata via `container.describe()`): [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py).
|
|
78
|
+
|
|
79
|
+
**Split regimes.** BeyondArena ships datasets from three regimes — which one a dataset is in shows up directly on `task_metadata`:
|
|
80
|
+
|
|
81
|
+
| Regime | Set on `PredictiveMLTaskMetadata` | Meaning |
|
|
82
|
+
|---|---|---|
|
|
83
|
+
| IID | neither `time_on` nor `group_on` | rows are independent; random / stratified splits |
|
|
84
|
+
| temporal non-IID | `time_on` set | rows ordered in time; future rows must not leak backwards |
|
|
85
|
+
| grouped non-IID | `group_on` set (+ `group_labels`) | all rows of a group stay together in one fold |
|
|
86
|
+
|
|
87
|
+
Side-by-side regime printout (one IID, two grouped variants — `per_group` vs `per_sample` — and one temporal): [`examples/data_foundry_data_regimes.py`](examples/data_foundry_data_regimes.py).
|
|
88
|
+
|
|
89
|
+
</details>
|
|
90
|
+
|
|
91
|
+
<details>
|
|
92
|
+
<summary><b>🗂️ Use a collection of datasets</b> — pre-download all of BeyondArena</summary>
|
|
93
|
+
|
|
94
|
+
`BEYOND_ARENA.prefetch(...)` batches every container into a single Hugging Face `snapshot_download` call (one network round-trip for the whole collection). On a warm cache it skips importing `huggingface_hub` entirely.
|
|
95
|
+
|
|
96
|
+
```python
|
|
97
|
+
from data_foundry.collections import BEYOND_ARENA
|
|
98
|
+
|
|
99
|
+
paths = BEYOND_ARENA.prefetch() # warms the cache once
|
|
100
|
+
for container in BEYOND_ARENA.iter_containers(): # now hits disk only
|
|
101
|
+
print(container.dataset_metadata.unique_name, container.dataset.shape)
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
Cache management:
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
BEYOND_ARENA.clear_cache() # nuke this collection's subdir
|
|
108
|
+
BEYOND_ARENA.get_dataset(name, force_download=True) # re-fetch a single container
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Full worked example with `tqdm` progress + checksum verification: [`examples/download_all_beyond_arena_datasets.py`](examples/download_all_beyond_arena_datasets.py). For a single dataset round-trip with checksum verification, see [`examples/download_beyond_arena_dataset.py`](examples/download_beyond_arena_dataset.py).
|
|
112
|
+
|
|
113
|
+
</details>
|
|
114
|
+
|
|
115
|
+
<details>
|
|
116
|
+
<summary><b>🧑🔬 Curate a dataset</b> — turn a raw download into a CuratedContainer</summary>
|
|
117
|
+
|
|
118
|
+
End-to-end pipeline, condensed (the full runnable version is [`examples/curate_a_dataset.py`](examples/curate_a_dataset.py)):
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata
|
|
122
|
+
|
|
123
|
+
# --- Basic metadata
|
|
124
|
+
dataset_mold = DatasetMetadata(
|
|
125
|
+
unique_name="blood_transfusion",
|
|
126
|
+
dataset_year="2008",
|
|
127
|
+
domain_str="medical & healthcare",
|
|
128
|
+
dataset_source="UCI",
|
|
129
|
+
original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
|
|
130
|
+
download_description="""
|
|
131
|
+
We download the data from the UCI repository and unzip it to a predefined folder.
|
|
132
|
+
|
|
133
|
+
mkdir -p local-data-warehouse/blood_transfusion/ \\
|
|
134
|
+
&& wget -P local-data-warehouse/blood_transfusion/ \\
|
|
135
|
+
https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
|
|
136
|
+
&& unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
|
|
137
|
+
-d local-data-warehouse/blood_transfusion/
|
|
138
|
+
""",
|
|
139
|
+
academic_reference_bibtex="""@article{yeh2009knowledge,
|
|
140
|
+
title={Knowledge discovery on RFM model using Bernoulli sequence},
|
|
141
|
+
author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
|
|
142
|
+
journal={Expert Systems with applications},
|
|
143
|
+
volume={36}, number={3}, pages={5866--5871},
|
|
144
|
+
year={2009}, publisher={Elsevier},
|
|
145
|
+
}
|
|
146
|
+
""",
|
|
147
|
+
academic_reference_bibtex_key="yeh2009knowledge",
|
|
148
|
+
license="CC BY 4.0",
|
|
149
|
+
data_tags=["IID"],
|
|
150
|
+
curation_comments="Renamed features for clarity; mapped target 0/1 → No/Yes; ~29% duplicate rows kept.",
|
|
151
|
+
)
|
|
152
|
+
task_mold = PredictiveMLTaskMetadata(
|
|
153
|
+
target_column_name="DonatedBloodInMarch2007",
|
|
154
|
+
problem_type="binary_classification",
|
|
155
|
+
objective_metric_name="roc_auc",
|
|
156
|
+
stratify_on="DonatedBloodInMarch2007",
|
|
157
|
+
)
|
|
158
|
+
|
|
159
|
+
# --- Preprocessing
|
|
160
|
+
import pandas as pd
|
|
161
|
+
df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
|
|
162
|
+
df.columns = [
|
|
163
|
+
"MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
|
|
164
|
+
"MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
|
|
165
|
+
]
|
|
166
|
+
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
|
|
167
|
+
df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
|
|
168
|
+
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
|
|
169
|
+
|
|
170
|
+
# --- Sanity checks
|
|
171
|
+
from data_foundry import dataset_checks
|
|
172
|
+
df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
|
|
173
|
+
data=df,
|
|
174
|
+
target_feature=task_mold.target_column_name,
|
|
175
|
+
problem_type=task_mold.problem_type,
|
|
176
|
+
)
|
|
177
|
+
|
|
178
|
+
# --- Outer CV splits
|
|
179
|
+
from data_foundry.curation_recommendations import (
|
|
180
|
+
get_recommended_iid_splits,
|
|
181
|
+
get_recommended_splits_dimensions,
|
|
182
|
+
)
|
|
183
|
+
|
|
184
|
+
n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
|
|
185
|
+
splits = get_recommended_iid_splits(
|
|
186
|
+
dataset=df,
|
|
187
|
+
n_repeats=n_repeats,
|
|
188
|
+
n_splits=n_splits,
|
|
189
|
+
test_size=test_size,
|
|
190
|
+
stratify_on=task_mold.stratify_on,
|
|
191
|
+
)
|
|
192
|
+
|
|
193
|
+
# --- Split metadata + container
|
|
194
|
+
from data_foundry.schema import PredictiveMLSplitsMetadata
|
|
195
|
+
from data_foundry.curation_container import CuratedContainer
|
|
196
|
+
|
|
197
|
+
splits_mold = PredictiveMLSplitsMetadata(
|
|
198
|
+
splits_comment="Default splits for IID data.",
|
|
199
|
+
splits=splits,
|
|
200
|
+
)
|
|
201
|
+
curated_data = CuratedContainer(
|
|
202
|
+
dataset=df,
|
|
203
|
+
dataset_metadata=dataset_mold,
|
|
204
|
+
task_metadata=task_mold,
|
|
205
|
+
experiment_metadata=splits_mold,
|
|
206
|
+
)
|
|
207
|
+
curated_data.save()
|
|
208
|
+
print(curated_data.uuid, curated_data.checksum)
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
For the contributor flow (where to put the notebook, how to open the PR, the `/new-dataset` Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
|
|
212
|
+
|
|
213
|
+
</details>
|
|
214
|
+
|
|
215
|
+
## 🪄 Installation
|
|
216
|
+
|
|
217
|
+
> [!IMPORTANT]
|
|
218
|
+
> Requires Python **3.10+**.
|
|
219
|
+
|
|
220
|
+
<details>
|
|
221
|
+
<summary><b>📦 From PyPI</b> — use Data Foundry as a library</summary>
|
|
222
|
+
|
|
223
|
+
```bash
|
|
224
|
+
pip install data-foundry
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
</details>
|
|
228
|
+
|
|
229
|
+
<details>
|
|
230
|
+
<summary><b>🌱 From source</b> — clone and install editable</summary>
|
|
231
|
+
|
|
232
|
+
```bash
|
|
233
|
+
git clone https://github.com/TabArena/data-foundry.git
|
|
234
|
+
cd data-foundry
|
|
235
|
+
uv pip install -e .
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
</details>
|
|
239
|
+
|
|
240
|
+
<details>
|
|
241
|
+
<summary><b>🛠️ Developer setup</b> — extras for curation, tests, and tooling</summary>
|
|
242
|
+
|
|
243
|
+
```bash
|
|
244
|
+
git clone https://github.com/TabArena/data-foundry.git
|
|
245
|
+
cd data-foundry
|
|
246
|
+
uv pip install -e ".[dev,tests]"
|
|
247
|
+
pytest # run the test suite
|
|
248
|
+
ruff check . && ruff format --check . # lint + format
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
The `dev` extra adds curation-time deps (`openml`, `kaggle`, `seaborn`, `polars`, etc.); `tests` adds `pytest` and `scikit-learn` (needed for the recommended-split helpers and examples).
|
|
252
|
+
|
|
253
|
+
</details>
|
|
254
|
+
|
|
255
|
+
## 🗂️ Repository Structure
|
|
256
|
+
|
|
257
|
+
```
|
|
258
|
+
data-foundry/
|
|
259
|
+
├── src/data_foundry/ # the package — schema, container, collections, checks, splits
|
|
260
|
+
│ ├── schema.py # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
|
|
261
|
+
│ ├── curation_container.py # CuratedContainer (save/load + describe + checksum)
|
|
262
|
+
│ ├── collections/ # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
|
|
263
|
+
│ ├── curation_recommendations.py # recommended split helpers (IID, grouped, temporal)
|
|
264
|
+
│ ├── dataset_checks.py # run_all_checks(...) — sanity stats for the curation notebook
|
|
265
|
+
│ └── examples/toy_container/ # tiny ready-to-load CuratedContainer shipped in-package
|
|
266
|
+
├── datasets/ # curation notebooks
|
|
267
|
+
│ ├── _template/ # canonical notebook skeleton
|
|
268
|
+
│ ├── _dev/ # contributions land here first
|
|
269
|
+
│ ├── _maintenance/ # re-runs / fixes for already-released datasets
|
|
270
|
+
│ └── beyond_iid/ # promoted datasets — pinned by `final_uuid_list.py`
|
|
271
|
+
├── examples/ # runnable demos (covers the use-cases above)
|
|
272
|
+
├── scripts/ # one-off tooling (toy container builder)
|
|
273
|
+
│ └── beyond_arena/ # BeyondArena-specific scripts and outputs (warehouse stats, plots)
|
|
274
|
+
├── tests/ # pytest test suite
|
|
275
|
+
└── local-data-warehouse/ # gitignored — curators write raw + saved containers here
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
## 🧑🔬 Contributing a Dataset
|
|
279
|
+
|
|
280
|
+
The short version:
|
|
281
|
+
|
|
282
|
+
1. Copy [`datasets/_template/_template.ipynb`](datasets/_template/_template.ipynb)
|
|
283
|
+
to `datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb`.
|
|
284
|
+
2. Run the notebook end-to-end so the saved cells contain populated check
|
|
285
|
+
tables and the final `uuid` / `checksum`.
|
|
286
|
+
3. Open a PR — reviewers will move the notebook into the right
|
|
287
|
+
`beyond_iid/` subfolder and append the UUID to
|
|
288
|
+
[`datasets/beyond_iid/final_uuid_list.py`](datasets/beyond_iid/final_uuid_list.py).
|
|
289
|
+
|
|
290
|
+
The long version (field-by-field walkthrough, split-helper choice, dtype
|
|
291
|
+
gotchas, the `/new-dataset` Claude Code scaffolding skill): see
|
|
292
|
+
[**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
|
|
293
|
+
|
|
294
|
+
## 📄 Citation
|
|
295
|
+
|
|
296
|
+
**PLACEHOLDER**
|
|
297
|
+
|
|
298
|
+
```bibtex
|
|
299
|
+
PLACEHOLDER
|
|
300
|
+
```
|