data-foundry 0.0.1__tar.gz → 0.0.3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. data_foundry-0.0.3/PKG-INFO +353 -0
  2. data_foundry-0.0.3/README.md +300 -0
  3. data_foundry-0.0.3/pyproject.toml +198 -0
  4. data_foundry-0.0.3/src/data_foundry/__init__.py +0 -0
  5. data_foundry-0.0.3/src/data_foundry/collections/__init__.py +33 -0
  6. data_foundry-0.0.3/src/data_foundry/collections/_core.py +265 -0
  7. data_foundry-0.0.3/src/data_foundry/collections/_registry.py +187 -0
  8. data_foundry-0.0.3/src/data_foundry/collections/_sources.py +246 -0
  9. data_foundry-0.0.3/src/data_foundry/curation_container.py +450 -0
  10. data_foundry-0.0.3/src/data_foundry/curation_recommendations.py +473 -0
  11. data_foundry-0.0.3/src/data_foundry/dataset_checks.py +443 -0
  12. data_foundry-0.0.3/src/data_foundry/examples/__init__.py +54 -0
  13. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/container_metadata.json +5 -0
  14. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dataset.parquet +0 -0
  15. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dataset_metadata.dataset-mold-v1.json +18 -0
  16. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/dtypes.json +6 -0
  17. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/experiment_metadata.predictive-ml-splits-mold-v1.json +1 -0
  18. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/task_metadata.predictive-ml-task-mold-v1.json +11 -0
  19. data_foundry-0.0.3/src/data_foundry/examples/toy_container/toy_iid_dataset/00000000-0000-7000-8000-000000000001/toy_extra.parquet +0 -0
  20. data_foundry-0.0.3/src/data_foundry/schema.py +469 -0
  21. data_foundry-0.0.3/src/data_foundry/utils/__init__.py +0 -0
  22. data_foundry-0.0.3/src/data_foundry/utils/checksum.py +51 -0
  23. data_foundry-0.0.1/LICENSE +0 -190
  24. data_foundry-0.0.1/PKG-INFO +0 -19
  25. data_foundry-0.0.1/README.md +0 -3
  26. data_foundry-0.0.1/pyproject.toml +0 -26
  27. data_foundry-0.0.1/setup.cfg +0 -4
  28. data_foundry-0.0.1/src/data_foundry/__init__.py +0 -1
  29. data_foundry-0.0.1/src/data_foundry.egg-info/PKG-INFO +0 -19
  30. data_foundry-0.0.1/src/data_foundry.egg-info/SOURCES.txt +0 -8
  31. data_foundry-0.0.1/src/data_foundry.egg-info/dependency_links.txt +0 -1
  32. data_foundry-0.0.1/src/data_foundry.egg-info/top_level.txt +0 -1
@@ -0,0 +1,353 @@
1
+ Metadata-Version: 2.3
2
+ Name: data-foundry
3
+ Version: 0.0.3
4
+ Summary: A schema and toolkit for curating tabular datasets and benchmarking tasks (the data layer behind TabArena).
5
+ Keywords: tabular,machine-learning,benchmark,datasets,data-curation,tabarena
6
+ Author: TabArena Maintainers
7
+ Author-email: TabArena Maintainers <mail@tabarena.ai>
8
+ License: Apache-2.0
9
+ Classifier: Development Status :: 3 - Alpha
10
+ Classifier: Intended Audience :: Science/Research
11
+ Classifier: Intended Audience :: Developers
12
+ Classifier: License :: OSI Approved :: Apache Software License
13
+ Classifier: Operating System :: OS Independent
14
+ Classifier: Programming Language :: Python
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Dist: pandas
22
+ Requires-Dist: numpy
23
+ Requires-Dist: pydantic
24
+ Requires-Dist: uuid6
25
+ Requires-Dist: pyarrow
26
+ Requires-Dist: huggingface-hub
27
+ Requires-Dist: autogluon ; extra == 'dev'
28
+ Requires-Dist: openml ; extra == 'dev'
29
+ Requires-Dist: ruff ; extra == 'dev'
30
+ Requires-Dist: pyyaml ; extra == 'dev'
31
+ Requires-Dist: seaborn ; extra == 'dev'
32
+ Requires-Dist: tueplots ; extra == 'dev'
33
+ Requires-Dist: tqdm ; extra == 'dev'
34
+ Requires-Dist: kaggle ; extra == 'dev'
35
+ Requires-Dist: langdetect ; extra == 'dev'
36
+ Requires-Dist: xlrd ; extra == 'dev'
37
+ Requires-Dist: scipy ; extra == 'dev'
38
+ Requires-Dist: polars ; extra == 'dev'
39
+ Requires-Dist: fastexcel ; extra == 'dev'
40
+ Requires-Dist: openpyxl ; extra == 'dev'
41
+ Requires-Dist: python-calamine ; extra == 'dev'
42
+ Requires-Dist: pytest ; extra == 'tests'
43
+ Requires-Dist: scikit-learn ; extra == 'tests'
44
+ Requires-Python: >=3.10
45
+ Project-URL: Homepage, https://github.com/TabArena/data-foundry
46
+ Project-URL: Repository, https://github.com/TabArena/data-foundry
47
+ Project-URL: Issues, https://github.com/TabArena/data-foundry/issues
48
+ Project-URL: BeyondArena Datasets, https://huggingface.co/datasets/TabArena/BeyondArena
49
+ Project-URL: TabArena, https://tabarena.ai/
50
+ Provides-Extra: dev
51
+ Provides-Extra: tests
52
+ Description-Content-Type: text/markdown
53
+
54
+ # Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets
55
+
56
+ ---
57
+
58
+ | 📂 [Examples](examples) | 🧑‍🔬 [Contribute a Dataset](CONTRIBUTING_DATASETS.md) | 📄 [Paper (placeholder — coming soon)](#-citation) |
59
+ |:---:|:---:|:---:|
60
+
61
+ ---
62
+
63
+ **Data Foundry** is the data layer behind the next generation of [TabArena](https://tabarena.ai/) datasets. It provides:
64
+
65
+ - A small, opinionated **schema** for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits — aligned with OpenML where possible, extended where it had to be.
66
+ - A **curation toolkit** (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
67
+ - A **collections API** that pins datasets (defined by ``(unique_name, uuid)``) to immutable curated containers and resolves them against a local warehouse or directly against the [BeyondArena Datasets](https://huggingface.co/datasets/TabArena/BeyondArena).
68
+
69
+ ## ⚡ Quickstart
70
+
71
+ > [!TIP]
72
+ > Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.
73
+
74
+ ```bash
75
+ pip install data-foundry
76
+ python examples/load_curated_container.py
77
+ ```
78
+
79
+ ```python
80
+ from data_foundry.collections import BEYOND_ARENA
81
+
82
+ container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
83
+ print(container.describe()) # full identity + dtypes + task + splits
84
+ print(container.dataset.shape) # the actual DataFrame
85
+ print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
86
+ ```
87
+
88
+ That's the whole API surface in three lines. See [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py) for benchmarking Random Forest on the data!
89
+
90
+ ## 🕹️ Use Cases
91
+
92
+ <details>
93
+ <summary><b>🧪 Inspect a curated container offline</b> — no Hugging Face download required</summary>
94
+
95
+ The package ships a toy `CuratedContainer` so you can poke at the full API — schema, dtypes, splits, `describe()` — without touching the network. Identical interface to a downloaded BeyondArena container.
96
+
97
+ ```python
98
+ from data_foundry.curation_container import CuratedContainer
99
+ from data_foundry.examples import get_toy_container_path
100
+
101
+ container = CuratedContainer.load(get_toy_container_path())
102
+ print(container.describe()) # full identity + dtypes + task + splits
103
+ print(container.dataset.shape) # the actual DataFrame
104
+ print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
105
+ ```
106
+
107
+ Full inspection script (every metadata field printed): [`examples/load_curated_container.py`](examples/load_curated_container.py).
108
+
109
+ </details>
110
+
111
+ <details>
112
+ <summary><b>📦 Use one dataset</b> — IID and non-IID variants</summary>
113
+
114
+ Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.
115
+
116
+ ```python
117
+ from data_foundry.collections import BEYOND_ARENA
118
+
119
+ container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
120
+ df = container.dataset
121
+ target = container.task_metadata.target_column_name
122
+
123
+ for repeat_id, folds in container.experiment_metadata.splits.items():
124
+ for fold_id, (train_idx, test_idx) in folds.items():
125
+ X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
126
+ X_test, y_test = df.iloc[test_idx].drop(columns=target), df.iloc[test_idx][target]
127
+ # ... fit, evaluate ...
128
+ ```
129
+
130
+ Full worked example (Random Forest, RMSE per fold, full metadata via `container.describe()`): [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py).
131
+
132
+ **Split regimes.** BeyondArena ships datasets from three regimes — which one a dataset is in shows up directly on `task_metadata`:
133
+
134
+ | Regime | Set on `PredictiveMLTaskMetadata` | Meaning |
135
+ |---|---|---|
136
+ | IID | neither `time_on` nor `group_on` | rows are independent; random / stratified splits |
137
+ | temporal non-IID | `time_on` set | rows ordered in time; future rows must not leak backwards |
138
+ | grouped non-IID | `group_on` set (+ `group_labels`) | all rows of a group stay together in one fold |
139
+
140
+ Side-by-side regime printout (one IID, two grouped variants — `per_group` vs `per_sample` — and one temporal): [`examples/data_foundry_data_regimes.py`](examples/data_foundry_data_regimes.py).
141
+
142
+ </details>
143
+
144
+ <details>
145
+ <summary><b>🗂️ Use a collection of datasets</b> — pre-download all of BeyondArena</summary>
146
+
147
+ `BEYOND_ARENA.prefetch(...)` batches every container into a single Hugging Face `snapshot_download` call (one network round-trip for the whole collection). On a warm cache it skips importing `huggingface_hub` entirely.
148
+
149
+ ```python
150
+ from data_foundry.collections import BEYOND_ARENA
151
+
152
+ paths = BEYOND_ARENA.prefetch() # warms the cache once
153
+ for container in BEYOND_ARENA.iter_containers(): # now hits disk only
154
+ print(container.dataset_metadata.unique_name, container.dataset.shape)
155
+ ```
156
+
157
+ Cache management:
158
+
159
+ ```python
160
+ BEYOND_ARENA.clear_cache() # nuke this collection's subdir
161
+ BEYOND_ARENA.get_dataset(name, force_download=True) # re-fetch a single container
162
+ ```
163
+
164
+ Full worked example with `tqdm` progress + checksum verification: [`examples/download_all_beyond_arena_datasets.py`](examples/download_all_beyond_arena_datasets.py). For a single dataset round-trip with checksum verification, see [`examples/download_beyond_arena_dataset.py`](examples/download_beyond_arena_dataset.py).
165
+
166
+ </details>
167
+
168
+ <details>
169
+ <summary><b>🧑‍🔬 Curate a dataset</b> — turn a raw download into a CuratedContainer</summary>
170
+
171
+ End-to-end pipeline, condensed (the full runnable version is [`examples/curate_a_dataset.py`](examples/curate_a_dataset.py)):
172
+
173
+ ```python
174
+ from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata
175
+
176
+ # --- Basic metadata
177
+ dataset_mold = DatasetMetadata(
178
+ unique_name="blood_transfusion",
179
+ dataset_year="2008",
180
+ domain_str="medical & healthcare",
181
+ dataset_source="UCI",
182
+ original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
183
+ download_description="""
184
+ We download the data from the UCI repository and unzip it to a predefined folder.
185
+
186
+ mkdir -p local-data-warehouse/blood_transfusion/ \\
187
+ && wget -P local-data-warehouse/blood_transfusion/ \\
188
+ https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
189
+ && unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
190
+ -d local-data-warehouse/blood_transfusion/
191
+ """,
192
+ academic_reference_bibtex="""@article{yeh2009knowledge,
193
+ title={Knowledge discovery on RFM model using Bernoulli sequence},
194
+ author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
195
+ journal={Expert Systems with applications},
196
+ volume={36}, number={3}, pages={5866--5871},
197
+ year={2009}, publisher={Elsevier},
198
+ }
199
+ """,
200
+ academic_reference_bibtex_key="yeh2009knowledge",
201
+ license="CC BY 4.0",
202
+ data_tags=["IID"],
203
+ curation_comments="Renamed features for clarity; mapped target 0/1 → No/Yes; ~29% duplicate rows kept.",
204
+ )
205
+ task_mold = PredictiveMLTaskMetadata(
206
+ target_column_name="DonatedBloodInMarch2007",
207
+ problem_type="binary_classification",
208
+ objective_metric_name="roc_auc",
209
+ stratify_on="DonatedBloodInMarch2007",
210
+ )
211
+
212
+ # --- Preprocessing
213
+ import pandas as pd
214
+ df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
215
+ df.columns = [
216
+ "MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
217
+ "MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
218
+ ]
219
+ df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
220
+ df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
221
+ df = df.sample(frac=1, random_state=42).reset_index(drop=True)
222
+
223
+ # --- Sanity checks
224
+ from data_foundry import dataset_checks
225
+ df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
226
+ data=df,
227
+ target_feature=task_mold.target_column_name,
228
+ problem_type=task_mold.problem_type,
229
+ )
230
+
231
+ # --- Outer CV splits
232
+ from data_foundry.curation_recommendations import (
233
+ get_recommended_iid_splits,
234
+ get_recommended_splits_dimensions,
235
+ )
236
+
237
+ n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
238
+ splits = get_recommended_iid_splits(
239
+ dataset=df,
240
+ n_repeats=n_repeats,
241
+ n_splits=n_splits,
242
+ test_size=test_size,
243
+ stratify_on=task_mold.stratify_on,
244
+ )
245
+
246
+ # --- Split metadata + container
247
+ from data_foundry.schema import PredictiveMLSplitsMetadata
248
+ from data_foundry.curation_container import CuratedContainer
249
+
250
+ splits_mold = PredictiveMLSplitsMetadata(
251
+ splits_comment="Default splits for IID data.",
252
+ splits=splits,
253
+ )
254
+ curated_data = CuratedContainer(
255
+ dataset=df,
256
+ dataset_metadata=dataset_mold,
257
+ task_metadata=task_mold,
258
+ experiment_metadata=splits_mold,
259
+ )
260
+ curated_data.save()
261
+ print(curated_data.uuid, curated_data.checksum)
262
+ ```
263
+
264
+ For the contributor flow (where to put the notebook, how to open the PR, the `/new-dataset` Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
265
+
266
+ </details>
267
+
268
+ ## 🪄 Installation
269
+
270
+ > [!IMPORTANT]
271
+ > Requires Python **3.10+**.
272
+
273
+ <details>
274
+ <summary><b>📦 From PyPI</b> — use Data Foundry as a library</summary>
275
+
276
+ ```bash
277
+ pip install data-foundry
278
+ ```
279
+
280
+ </details>
281
+
282
+ <details>
283
+ <summary><b>🌱 From source</b> — clone and install editable</summary>
284
+
285
+ ```bash
286
+ git clone https://github.com/TabArena/data-foundry.git
287
+ cd data-foundry
288
+ uv pip install -e .
289
+ ```
290
+
291
+ </details>
292
+
293
+ <details>
294
+ <summary><b>🛠️ Developer setup</b> — extras for curation, tests, and tooling</summary>
295
+
296
+ ```bash
297
+ git clone https://github.com/TabArena/data-foundry.git
298
+ cd data-foundry
299
+ uv pip install -e ".[dev,tests]"
300
+ pytest # run the test suite
301
+ ruff check . && ruff format --check . # lint + format
302
+ ```
303
+
304
+ The `dev` extra adds curation-time deps (`openml`, `kaggle`, `seaborn`, `polars`, etc.); `tests` adds `pytest` and `scikit-learn` (needed for the recommended-split helpers and examples).
305
+
306
+ </details>
307
+
308
+ ## 🗂️ Repository Structure
309
+
310
+ ```
311
+ data-foundry/
312
+ ├── src/data_foundry/ # the package — schema, container, collections, checks, splits
313
+ │ ├── schema.py # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
314
+ │ ├── curation_container.py # CuratedContainer (save/load + describe + checksum)
315
+ │ ├── collections/ # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
316
+ │ ├── curation_recommendations.py # recommended split helpers (IID, grouped, temporal)
317
+ │ ├── dataset_checks.py # run_all_checks(...) — sanity stats for the curation notebook
318
+ │ └── examples/toy_container/ # tiny ready-to-load CuratedContainer shipped in-package
319
+ ├── datasets/ # curation notebooks
320
+ │ ├── _template/ # canonical notebook skeleton
321
+ │ ├── _dev/ # contributions land here first
322
+ │ ├── _maintenance/ # re-runs / fixes for already-released datasets
323
+ │ └── beyond_iid/ # promoted datasets — pinned by `final_uuid_list.py`
324
+ ├── examples/ # runnable demos (covers the use-cases above)
325
+ ├── scripts/ # one-off tooling (toy container builder)
326
+ │ └── beyond_arena/ # BeyondArena-specific scripts and outputs (warehouse stats, plots)
327
+ ├── tests/ # pytest test suite
328
+ └── local-data-warehouse/ # gitignored — curators write raw + saved containers here
329
+ ```
330
+
331
+ ## 🧑‍🔬 Contributing a Dataset
332
+
333
+ The short version:
334
+
335
+ 1. Copy [`datasets/_template/_template.ipynb`](datasets/_template/_template.ipynb)
336
+ to `datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb`.
337
+ 2. Run the notebook end-to-end so the saved cells contain populated check
338
+ tables and the final `uuid` / `checksum`.
339
+ 3. Open a PR — reviewers will move the notebook into the right
340
+ `beyond_iid/` subfolder and append the UUID to
341
+ [`datasets/beyond_iid/final_uuid_list.py`](datasets/beyond_iid/final_uuid_list.py).
342
+
343
+ The long version (field-by-field walkthrough, split-helper choice, dtype
344
+ gotchas, the `/new-dataset` Claude Code scaffolding skill): see
345
+ [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
346
+
347
+ ## 📄 Citation
348
+
349
+ **PLACEHOLDER**
350
+
351
+ ```bibtex
352
+ PLACEHOLDER
353
+ ```
@@ -0,0 +1,300 @@
1
+ # Data Foundry: a Schema and Toolkit for Curating Tabular ML Datasets
2
+
3
+ ---
4
+
5
+ | 📂 [Examples](examples) | 🧑‍🔬 [Contribute a Dataset](CONTRIBUTING_DATASETS.md) | 📄 [Paper (placeholder — coming soon)](#-citation) |
6
+ |:---:|:---:|:---:|
7
+
8
+ ---
9
+
10
+ **Data Foundry** is the data layer behind the next generation of [TabArena](https://tabarena.ai/) datasets. It provides:
11
+
12
+ - A small, opinionated **schema** for tabular datasets, tasks (IID / temporal non-IID / grouped non-IID), and outer CV splits — aligned with OpenML where possible, extended where it had to be.
13
+ - A **curation toolkit** (sanity checks, recommended-split helpers, dtype-preserving save/load) so a curator turns a raw download into a reproducible artifact in one notebook.
14
+ - A **collections API** that pins datasets (defined by ``(unique_name, uuid)``) to immutable curated containers and resolves them against a local warehouse or directly against the [BeyondArena Datasets](https://huggingface.co/datasets/TabArena/BeyondArena).
15
+
16
+ ## ⚡ Quickstart
17
+
18
+ > [!TIP]
19
+ > Pull a real curated dataset from BeyondArena and inspect its full metadata + outer CV splits. The first call fetches from Hugging Face; subsequent calls hit your local cache.
20
+
21
+ ```bash
22
+ pip install data-foundry
23
+ python examples/load_curated_container.py
24
+ ```
25
+
26
+ ```python
27
+ from data_foundry.collections import BEYOND_ARENA
28
+
29
+ container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
30
+ print(container.describe()) # full identity + dtypes + task + splits
31
+ print(container.dataset.shape) # the actual DataFrame
32
+ print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
33
+ ```
34
+
35
+ That's the whole API surface in three lines. See [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py) for benchmarking Random Forest on the data!
36
+
37
+ ## 🕹️ Use Cases
38
+
39
+ <details>
40
+ <summary><b>🧪 Inspect a curated container offline</b> — no Hugging Face download required</summary>
41
+
42
+ The package ships a toy `CuratedContainer` so you can poke at the full API — schema, dtypes, splits, `describe()` — without touching the network. Identical interface to a downloaded BeyondArena container.
43
+
44
+ ```python
45
+ from data_foundry.curation_container import CuratedContainer
46
+ from data_foundry.examples import get_toy_container_path
47
+
48
+ container = CuratedContainer.load(get_toy_container_path())
49
+ print(container.describe()) # full identity + dtypes + task + splits
50
+ print(container.dataset.shape) # the actual DataFrame
51
+ print(container.task_metadata.split_regime) # "iid", "temporal_non_iid", or "grouped_non_iid"
52
+ ```
53
+
54
+ Full inspection script (every metadata field printed): [`examples/load_curated_container.py`](examples/load_curated_container.py).
55
+
56
+ </details>
57
+
58
+ <details>
59
+ <summary><b>📦 Use one dataset</b> — IID and non-IID variants</summary>
60
+
61
+ Download a single BeyondArena container by name (or UUID) and iterate its outer CV splits. The collection resolves the container against your local cache; subsequent runs hit disk, not the network.
62
+
63
+ ```python
64
+ from data_foundry.collections import BEYOND_ARENA
65
+
66
+ container = BEYOND_ARENA.get_dataset("airfoil_self_noise")
67
+ df = container.dataset
68
+ target = container.task_metadata.target_column_name
69
+
70
+ for repeat_id, folds in container.experiment_metadata.splits.items():
71
+ for fold_id, (train_idx, test_idx) in folds.items():
72
+ X_train, y_train = df.iloc[train_idx].drop(columns=target), df.iloc[train_idx][target]
73
+ X_test, y_test = df.iloc[test_idx].drop(columns=target), df.iloc[test_idx][target]
74
+ # ... fit, evaluate ...
75
+ ```
76
+
77
+ Full worked example (Random Forest, RMSE per fold, full metadata via `container.describe()`): [`examples/benchmark_on_beyond_arena.py`](examples/benchmark_on_beyond_arena.py).
78
+
79
+ **Split regimes.** BeyondArena ships datasets from three regimes — which one a dataset is in shows up directly on `task_metadata`:
80
+
81
+ | Regime | Set on `PredictiveMLTaskMetadata` | Meaning |
82
+ |---|---|---|
83
+ | IID | neither `time_on` nor `group_on` | rows are independent; random / stratified splits |
84
+ | temporal non-IID | `time_on` set | rows ordered in time; future rows must not leak backwards |
85
+ | grouped non-IID | `group_on` set (+ `group_labels`) | all rows of a group stay together in one fold |
86
+
87
+ Side-by-side regime printout (one IID, two grouped variants — `per_group` vs `per_sample` — and one temporal): [`examples/data_foundry_data_regimes.py`](examples/data_foundry_data_regimes.py).
88
+
89
+ </details>
90
+
91
+ <details>
92
+ <summary><b>🗂️ Use a collection of datasets</b> — pre-download all of BeyondArena</summary>
93
+
94
+ `BEYOND_ARENA.prefetch(...)` batches every container into a single Hugging Face `snapshot_download` call (one network round-trip for the whole collection). On a warm cache it skips importing `huggingface_hub` entirely.
95
+
96
+ ```python
97
+ from data_foundry.collections import BEYOND_ARENA
98
+
99
+ paths = BEYOND_ARENA.prefetch() # warms the cache once
100
+ for container in BEYOND_ARENA.iter_containers(): # now hits disk only
101
+ print(container.dataset_metadata.unique_name, container.dataset.shape)
102
+ ```
103
+
104
+ Cache management:
105
+
106
+ ```python
107
+ BEYOND_ARENA.clear_cache() # nuke this collection's subdir
108
+ BEYOND_ARENA.get_dataset(name, force_download=True) # re-fetch a single container
109
+ ```
110
+
111
+ Full worked example with `tqdm` progress + checksum verification: [`examples/download_all_beyond_arena_datasets.py`](examples/download_all_beyond_arena_datasets.py). For a single dataset round-trip with checksum verification, see [`examples/download_beyond_arena_dataset.py`](examples/download_beyond_arena_dataset.py).
112
+
113
+ </details>
114
+
115
+ <details>
116
+ <summary><b>🧑‍🔬 Curate a dataset</b> — turn a raw download into a CuratedContainer</summary>
117
+
118
+ End-to-end pipeline, condensed (the full runnable version is [`examples/curate_a_dataset.py`](examples/curate_a_dataset.py)):
119
+
120
+ ```python
121
+ from data_foundry.schema import DatasetMetadata, PredictiveMLTaskMetadata
122
+
123
+ # --- Basic metadata
124
+ dataset_mold = DatasetMetadata(
125
+ unique_name="blood_transfusion",
126
+ dataset_year="2008",
127
+ domain_str="medical & healthcare",
128
+ dataset_source="UCI",
129
+ original_dataset_source_download_link="https://doi.org/10.24432/C5GS39",
130
+ download_description="""
131
+ We download the data from the UCI repository and unzip it to a predefined folder.
132
+
133
+ mkdir -p local-data-warehouse/blood_transfusion/ \\
134
+ && wget -P local-data-warehouse/blood_transfusion/ \\
135
+ https://archive.ics.uci.edu/static/public/176/blood+transfusion+service+center.zip \\
136
+ && unzip local-data-warehouse/blood_transfusion/blood+transfusion+service+center.zip \\
137
+ -d local-data-warehouse/blood_transfusion/
138
+ """,
139
+ academic_reference_bibtex="""@article{yeh2009knowledge,
140
+ title={Knowledge discovery on RFM model using Bernoulli sequence},
141
+ author={Yeh, I-Cheng and Yang, King-Jang and Ting, Tao-Ming},
142
+ journal={Expert Systems with applications},
143
+ volume={36}, number={3}, pages={5866--5871},
144
+ year={2009}, publisher={Elsevier},
145
+ }
146
+ """,
147
+ academic_reference_bibtex_key="yeh2009knowledge",
148
+ license="CC BY 4.0",
149
+ data_tags=["IID"],
150
+ curation_comments="Renamed features for clarity; mapped target 0/1 → No/Yes; ~29% duplicate rows kept.",
151
+ )
152
+ task_mold = PredictiveMLTaskMetadata(
153
+ target_column_name="DonatedBloodInMarch2007",
154
+ problem_type="binary_classification",
155
+ objective_metric_name="roc_auc",
156
+ stratify_on="DonatedBloodInMarch2007",
157
+ )
158
+
159
+ # --- Preprocessing
160
+ import pandas as pd
161
+ df = pd.read_csv(f"{dataset_mold.path}/transfusion.data")
162
+ df.columns = [
163
+ "MonthsSinceLastDonation", "NumberOfDonations", "TotalBloodDonated",
164
+ "MonthsSinceFirstDonation", "DonatedBloodInMarch2007",
165
+ ]
166
+ df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].map({1: "Yes", 0: "No"})
167
+ df["DonatedBloodInMarch2007"] = df["DonatedBloodInMarch2007"].astype("category")
168
+ df = df.sample(frac=1, random_state=42).reset_index(drop=True)
169
+
170
+ # --- Sanity checks
171
+ from data_foundry import dataset_checks
172
+ df_head, summary, numeric_stats, cat_stats, target_df = dataset_checks.run_all_checks(
173
+ data=df,
174
+ target_feature=task_mold.target_column_name,
175
+ problem_type=task_mold.problem_type,
176
+ )
177
+
178
+ # --- Outer CV splits
179
+ from data_foundry.curation_recommendations import (
180
+ get_recommended_iid_splits,
181
+ get_recommended_splits_dimensions,
182
+ )
183
+
184
+ n_repeats, n_splits, test_size = get_recommended_splits_dimensions(dataset=df)
185
+ splits = get_recommended_iid_splits(
186
+ dataset=df,
187
+ n_repeats=n_repeats,
188
+ n_splits=n_splits,
189
+ test_size=test_size,
190
+ stratify_on=task_mold.stratify_on,
191
+ )
192
+
193
+ # --- Split metadata + container
194
+ from data_foundry.schema import PredictiveMLSplitsMetadata
195
+ from data_foundry.curation_container import CuratedContainer
196
+
197
+ splits_mold = PredictiveMLSplitsMetadata(
198
+ splits_comment="Default splits for IID data.",
199
+ splits=splits,
200
+ )
201
+ curated_data = CuratedContainer(
202
+ dataset=df,
203
+ dataset_metadata=dataset_mold,
204
+ task_metadata=task_mold,
205
+ experiment_metadata=splits_mold,
206
+ )
207
+ curated_data.save()
208
+ print(curated_data.uuid, curated_data.checksum)
209
+ ```
210
+
211
+ For the contributor flow (where to put the notebook, how to open the PR, the `/new-dataset` Claude Code skill, best practices around versioning, anomaly tracking, and dtype handling), see [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
212
+
213
+ </details>
214
+
215
+ ## 🪄 Installation
216
+
217
+ > [!IMPORTANT]
218
+ > Requires Python **3.10+**.
219
+
220
+ <details>
221
+ <summary><b>📦 From PyPI</b> — use Data Foundry as a library</summary>
222
+
223
+ ```bash
224
+ pip install data-foundry
225
+ ```
226
+
227
+ </details>
228
+
229
+ <details>
230
+ <summary><b>🌱 From source</b> — clone and install editable</summary>
231
+
232
+ ```bash
233
+ git clone https://github.com/TabArena/data-foundry.git
234
+ cd data-foundry
235
+ uv pip install -e .
236
+ ```
237
+
238
+ </details>
239
+
240
+ <details>
241
+ <summary><b>🛠️ Developer setup</b> — extras for curation, tests, and tooling</summary>
242
+
243
+ ```bash
244
+ git clone https://github.com/TabArena/data-foundry.git
245
+ cd data-foundry
246
+ uv pip install -e ".[dev,tests]"
247
+ pytest # run the test suite
248
+ ruff check . && ruff format --check . # lint + format
249
+ ```
250
+
251
+ The `dev` extra adds curation-time deps (`openml`, `kaggle`, `seaborn`, `polars`, etc.); `tests` adds `pytest` and `scikit-learn` (needed for the recommended-split helpers and examples).
252
+
253
+ </details>
254
+
255
+ ## 🗂️ Repository Structure
256
+
257
+ ```
258
+ data-foundry/
259
+ ├── src/data_foundry/ # the package — schema, container, collections, checks, splits
260
+ │ ├── schema.py # DatasetMetadata, PredictiveMLTaskMetadata, PredictiveMLSplitsMetadata
261
+ │ ├── curation_container.py # CuratedContainer (save/load + describe + checksum)
262
+ │ ├── collections/ # BEYOND_ARENA, DatasetCollection, HuggingFaceSource, cache helpers
263
+ │ ├── curation_recommendations.py # recommended split helpers (IID, grouped, temporal)
264
+ │ ├── dataset_checks.py # run_all_checks(...) — sanity stats for the curation notebook
265
+ │ └── examples/toy_container/ # tiny ready-to-load CuratedContainer shipped in-package
266
+ ├── datasets/ # curation notebooks
267
+ │ ├── _template/ # canonical notebook skeleton
268
+ │ ├── _dev/ # contributions land here first
269
+ │ ├── _maintenance/ # re-runs / fixes for already-released datasets
270
+ │ └── beyond_iid/ # promoted datasets — pinned by `final_uuid_list.py`
271
+ ├── examples/ # runnable demos (covers the use-cases above)
272
+ ├── scripts/ # one-off tooling (toy container builder)
273
+ │ └── beyond_arena/ # BeyondArena-specific scripts and outputs (warehouse stats, plots)
274
+ ├── tests/ # pytest test suite
275
+ └── local-data-warehouse/ # gitignored — curators write raw + saved containers here
276
+ ```
277
+
278
+ ## 🧑‍🔬 Contributing a Dataset
279
+
280
+ The short version:
281
+
282
+ 1. Copy [`datasets/_template/_template.ipynb`](datasets/_template/_template.ipynb)
283
+ to `datasets/_dev/<topic>/<unique_name>/<unique_name>.ipynb`.
284
+ 2. Run the notebook end-to-end so the saved cells contain populated check
285
+ tables and the final `uuid` / `checksum`.
286
+ 3. Open a PR — reviewers will move the notebook into the right
287
+ `beyond_iid/` subfolder and append the UUID to
288
+ [`datasets/beyond_iid/final_uuid_list.py`](datasets/beyond_iid/final_uuid_list.py).
289
+
290
+ The long version (field-by-field walkthrough, split-helper choice, dtype
291
+ gotchas, the `/new-dataset` Claude Code scaffolding skill): see
292
+ [**CONTRIBUTING_DATASETS.md**](CONTRIBUTING_DATASETS.md).
293
+
294
+ ## 📄 Citation
295
+
296
+ **PLACEHOLDER**
297
+
298
+ ```bibtex
299
+ PLACEHOLDER
300
+ ```