tab2seq 0.1.2__tar.gz → 0.1.5__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {tab2seq-0.1.2/src/tab2seq.egg-info → tab2seq-0.1.5}/PKG-INFO +168 -110
- tab2seq-0.1.5/README.md +315 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/pyproject.toml +8 -11
- tab2seq-0.1.5/src/tab2seq/__init__.py +18 -0
- tab2seq-0.1.5/src/tab2seq/cli.py +71 -0
- tab2seq-0.1.5/src/tab2seq/cohort/__init__.py +6 -0
- tab2seq-0.1.5/src/tab2seq/cohort/config.py +104 -0
- tab2seq-0.1.5/src/tab2seq/cohort/core.py +461 -0
- tab2seq-0.1.5/src/tab2seq/config.py +58 -0
- tab2seq-0.1.5/src/tab2seq/datasets/__init__.py +16 -0
- tab2seq-0.1.5/src/tab2seq/datasets/builder.py +706 -0
- tab2seq-0.1.5/src/tab2seq/datasets/config.py +59 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/datasets/synthetic.py +16 -0
- tab2seq-0.1.5/src/tab2seq/loader.py +65 -0
- tab2seq-0.1.5/src/tab2seq/processor.py +52 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/config.py +21 -1
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/core.py +2 -4
- tab2seq-0.1.5/src/tab2seq/tokenization/__init__.py +7 -0
- tab2seq-0.1.5/src/tab2seq/tokenization/config.py +25 -0
- tab2seq-0.1.5/src/tab2seq/tokenization/tokenizer.py +139 -0
- tab2seq-0.1.5/src/tab2seq/tokenization/vocabulary.py +359 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5/src/tab2seq.egg-info}/PKG-INFO +168 -110
- tab2seq-0.1.5/src/tab2seq.egg-info/SOURCES.txt +38 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/requires.txt +1 -4
- tab2seq-0.1.5/tests/test_cli.py +153 -0
- tab2seq-0.1.5/tests/test_cohort.py +283 -0
- tab2seq-0.1.5/tests/test_config.py +86 -0
- tab2seq-0.1.5/tests/test_event_dataset_builder.py +225 -0
- tab2seq-0.1.5/tests/test_loader.py +102 -0
- tab2seq-0.1.5/tests/test_processor.py +113 -0
- tab2seq-0.1.5/tests/test_tokenizer.py +179 -0
- tab2seq-0.1.5/tests/test_vocabulary.py +159 -0
- tab2seq-0.1.2/README.md +0 -254
- tab2seq-0.1.2/src/tab2seq/__init__.py +0 -9
- tab2seq-0.1.2/src/tab2seq/datasets/__init__.py +0 -5
- tab2seq-0.1.2/src/tab2seq.egg-info/SOURCES.txt +0 -17
- {tab2seq-0.1.2 → tab2seq-0.1.5}/LICENSE +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/setup.cfg +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/__init__.py +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/collection.py +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/dependency_links.txt +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/top_level.txt +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/tests/test_datasets.py +0 -0
- {tab2seq-0.1.2 → tab2seq-0.1.5}/tests/test_source.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: tab2seq
|
|
3
|
-
Version: 0.1.
|
|
3
|
+
Version: 0.1.5
|
|
4
4
|
Summary: Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.
|
|
5
5
|
Author-email: Germans Savcisens <germans@savcisens.com>
|
|
6
6
|
License: MIT
|
|
@@ -9,7 +9,7 @@ Project-URL: Documentation, https://tab2seq.readthedocs.io
|
|
|
9
9
|
Project-URL: Repository, https://github.com/carlomarxdk/tab2seq
|
|
10
10
|
Project-URL: Issues, https://github.com/carlomarxdk/tab2seq/issues
|
|
11
11
|
Keywords: tokenization,data preprocessing,tabular data,transformer models,sequential models,life2vec
|
|
12
|
-
Classifier: Development Status ::
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
13
|
Classifier: Intended Audience :: Science/Research
|
|
14
14
|
Classifier: License :: OSI Approved :: MIT License
|
|
15
15
|
Classifier: Programming Language :: Python :: 3
|
|
@@ -36,13 +36,10 @@ Requires-Dist: ruff>=0.15.0; extra == "dev"
|
|
|
36
36
|
Requires-Dist: mypy>=1.19.0; extra == "dev"
|
|
37
37
|
Requires-Dist: types-PyYAML>=6.0.0; extra == "dev"
|
|
38
38
|
Provides-Extra: docs
|
|
39
|
-
Requires-Dist:
|
|
40
|
-
Requires-Dist: mkdocs-material>=9.7.1; extra == "docs"
|
|
41
|
-
Requires-Dist: mkdocstrings>=1.0.2; extra == "docs"
|
|
39
|
+
Requires-Dist: zensical; extra == "docs"
|
|
42
40
|
Requires-Dist: mkdocstrings-python>=2.0.0; extra == "docs"
|
|
43
41
|
Requires-Dist: mkdocs-gen-files>=0.6.0; extra == "docs"
|
|
44
42
|
Requires-Dist: mkdocs-literate-nav>=0.6.2; extra == "docs"
|
|
45
|
-
Requires-Dist: mkdocs-section-index>=0.3.10; extra == "docs"
|
|
46
43
|
Requires-Dist: mkdocs-bibtex>=4.4.0; extra == "docs"
|
|
47
44
|
Provides-Extra: all
|
|
48
45
|
Requires-Dist: tab2seq[dev,docs]; extra == "all"
|
|
@@ -54,86 +51,84 @@ Dynamic: license-file
|
|
|
54
51
|
[](https://pypi.org/project/tab2seq/)
|
|
55
52
|
[](https://pypi.org/project/tab2seq/)
|
|
56
53
|
[](https://github.com/carlomarxdk/tab2seq/blob/main/LICENSE)
|
|
54
|
+
[](https://doi.org/10.5281/zenodo.18752504)
|
|
57
55
|
|
|
58
|
-
**tab2seq** adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into
|
|
56
|
+
**tab2seq** adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into tokenized sequences ready for Transformer and sequential deep learning models.
|
|
57
|
+
The package reimplements the data-preprocessing steps of the [life2vec](https://github.com/SocialComplexityLab/life2vec) and [life2vec-light](https://github.com/carlomarxdk/life2vec-light) repos.
|
|
59
58
|
|
|
60
|
-
> [!
|
|
61
|
-
> This is
|
|
59
|
+
> [!INFO]
|
|
60
|
+
> This is a **BETA** version of the package.
|
|
62
61
|
|
|
63
62
|
## About
|
|
64
63
|
|
|
65
64
|
This package extracts and generalizes the data processing patterns from the [Life2Vec](https://github.com/SocialComplexityLab/life2vec) project, making them reusable for similar research projects that need to:
|
|
66
65
|
|
|
67
66
|
- Work with multiple longitudinal data sources (registries, databases)
|
|
68
|
-
- Define and filter cohorts based on
|
|
67
|
+
- Define and filter cohorts based on inclusion criteria
|
|
68
|
+
- Create deterministic train/val/test splits with static context
|
|
69
|
+
- Fit a vocabulary on training data only (no leakage)
|
|
70
|
+
- Produce tokenized, model-ready event sequences with time features
|
|
69
71
|
- Generate realistic synthetic data for development and testing
|
|
70
|
-
- Process large-scale tabular event data efficiently
|
|
71
72
|
|
|
72
73
|
Whether you're working with healthcare data, financial records, or any time-stamped event data, tab2seq provides the building blocks for preparing data for Life2Vec-style sequential models.
|
|
73
74
|
|
|
75
|
+
## Pipeline Overview
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
Sources → Cohort → Vocabulary → EventDataset → Model-ready Parquet
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
1. **Sources** – Define one `SourceConfig` per event table (health visits, labour records, income, etc.). Each config declares which columns are categorical, continuous, or timestamps.
|
|
82
|
+
2. **Cohort** – Unite sources into a single entity universe, apply inclusion criteria, and split into train/val/test with deterministic seeds.
|
|
83
|
+
3. **Vocabulary** – Fit token mappings and continuous-feature bin edges on the *train split only* to prevent leakage.
|
|
84
|
+
4. **EventDataset** – Build tokenized event rows per split, derive relative-date features (e.g. age), and persist to Parquet with metadata.
|
|
85
|
+
|
|
74
86
|
## Features
|
|
75
87
|
|
|
76
88
|
- **Multi-Source Data Management**: Handle multiple data sources (registries) with unified schema
|
|
89
|
+
- **Cohort Construction**: Entity-level inclusion criteria across sources, deterministic splits, static-attribute propagation
|
|
90
|
+
- **Train-Only Vocabulary**: Token and bin-edge fitting restricted to training entities
|
|
91
|
+
- **Tokenized Event Datasets**: Vectorized token-ID encoding, relative-date features, Parquet persistence
|
|
92
|
+
- **Entity Record Access**: Iterator, random sample, and stateful `next()` retrieval patterns for downstream training loops
|
|
77
93
|
- **Type-Safe Configuration**: Pydantic-based configuration with YAML support
|
|
78
94
|
- **Synthetic Data Generation**: Generate realistic dummy registry data for testing and exploration
|
|
79
95
|
- **Memory-Efficient Loading**: Chunked iteration and lazy loading with Polars
|
|
80
|
-
- **Schema Validation**: Automatic validation of entity IDs, timestamps, and column types
|
|
81
|
-
- **Cross-Source Operations**: Unified access and operations across multiple data sources
|
|
82
96
|
|
|
83
97
|
## Installation
|
|
84
98
|
|
|
85
99
|
```bash
|
|
86
|
-
# Basic installation
|
|
87
100
|
pip install tab2seq
|
|
88
101
|
```
|
|
89
102
|
|
|
90
103
|
## Quick Start
|
|
91
104
|
|
|
92
|
-
|
|
105
|
+
The full pipeline from raw data to model-ready sequences in five steps.
|
|
106
|
+
|
|
107
|
+
### 1. Generate Synthetic Data
|
|
93
108
|
|
|
94
109
|
```python
|
|
95
|
-
from tab2seq.
|
|
96
|
-
|
|
97
|
-
SourceConfig,
|
|
98
|
-
SourceCollection,
|
|
99
|
-
CategoricalColConfig,
|
|
100
|
-
ContinuousColConfig,
|
|
101
|
-
TimestampColConfig
|
|
102
|
-
)
|
|
110
|
+
from tab2seq.datasets import generate_synthetic_data
|
|
111
|
+
import polars as pl
|
|
103
112
|
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
|
|
110
|
-
CategoricalColConfig(col_name="procedure", prefix="PROC"),
|
|
111
|
-
CategoricalColConfig(col_name="department", prefix="DEPT"),
|
|
112
|
-
],
|
|
113
|
-
continuous_cols=[
|
|
114
|
-
ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
|
|
115
|
-
ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=20, strategy="quantile"),
|
|
116
|
-
],
|
|
117
|
-
output_format="parquet",
|
|
118
|
-
timestamp_cols=[
|
|
119
|
-
TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
|
|
120
|
-
]
|
|
113
|
+
data_paths = generate_synthetic_data(
|
|
114
|
+
output_dir="synthetic_data",
|
|
115
|
+
n_entities=10_000,
|
|
116
|
+
seed=742,
|
|
117
|
+
registries=["health", "labour"],
|
|
121
118
|
)
|
|
122
|
-
|
|
123
|
-
source = Source(config=config)
|
|
124
|
-
|
|
125
|
-
# Process and tokenize the columns
|
|
126
|
-
print("Number of unique IDs:", len(source.get_entity_ids()))
|
|
127
|
-
lf_health = source.process(cache=True)
|
|
128
|
-
lf_health.head()
|
|
119
|
+
pl.read_parquet(data_paths["health"]).head()
|
|
129
120
|
```
|
|
130
121
|
|
|
131
|
-
###
|
|
122
|
+
### 2. Define Sources
|
|
123
|
+
|
|
124
|
+
Each `Source` describes one event table: its file path, ID column, timestamp, and feature columns.
|
|
132
125
|
|
|
133
126
|
```python
|
|
134
|
-
from tab2seq.source import
|
|
127
|
+
from tab2seq.source import (
|
|
128
|
+
Source, SourceCollection, SourceConfig,
|
|
129
|
+
CategoricalColConfig, ContinuousColConfig, TimestampColConfig,
|
|
130
|
+
)
|
|
135
131
|
|
|
136
|
-
# Define your data sources
|
|
137
132
|
configs = [
|
|
138
133
|
SourceConfig(
|
|
139
134
|
name="health",
|
|
@@ -145,13 +140,12 @@ configs = [
|
|
|
145
140
|
CategoricalColConfig(col_name="department", prefix="DEPT"),
|
|
146
141
|
],
|
|
147
142
|
continuous_cols=[
|
|
148
|
-
ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20
|
|
149
|
-
ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=
|
|
143
|
+
ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20),
|
|
144
|
+
ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=10),
|
|
150
145
|
],
|
|
151
|
-
output_format="parquet",
|
|
152
146
|
timestamp_cols=[
|
|
153
|
-
TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
|
|
154
|
-
]
|
|
147
|
+
TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
|
|
148
|
+
],
|
|
155
149
|
),
|
|
156
150
|
SourceConfig(
|
|
157
151
|
name="labour",
|
|
@@ -161,99 +155,162 @@ configs = [
|
|
|
161
155
|
CategoricalColConfig(col_name="status", prefix="STATUS"),
|
|
162
156
|
CategoricalColConfig(col_name="occupation", prefix="OCC"),
|
|
163
157
|
CategoricalColConfig(col_name="residence_region", prefix="REGION"),
|
|
158
|
+
CategoricalColConfig(col_name="native_language", prefix="LANG", static=True),
|
|
164
159
|
],
|
|
165
160
|
continuous_cols=[
|
|
166
|
-
ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS")
|
|
161
|
+
ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS", n_bins=10),
|
|
167
162
|
],
|
|
168
|
-
output_format="parquet",
|
|
169
163
|
timestamp_cols=[
|
|
170
164
|
TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
|
|
171
|
-
TimestampColConfig(col_name="birthday",
|
|
165
|
+
TimestampColConfig(col_name="birthday", static=True, drop_na=True),
|
|
172
166
|
],
|
|
173
167
|
),
|
|
174
168
|
]
|
|
175
169
|
|
|
176
|
-
# Create a source collection
|
|
177
170
|
collection = SourceCollection.from_configs(configs)
|
|
178
171
|
|
|
179
|
-
# Access individual sources
|
|
180
|
-
health = collection["health"]
|
|
181
|
-
df = health.read_all()
|
|
182
|
-
|
|
183
|
-
# Or iterate over all sources
|
|
184
172
|
for source in collection:
|
|
185
173
|
print(f"{source.name}: {len(source.get_entity_ids())} entities")
|
|
186
|
-
|
|
187
|
-
# Cross-source operations
|
|
188
|
-
all_entity_ids = collection.get_all_entity_ids()
|
|
189
174
|
```
|
|
190
175
|
|
|
191
|
-
|
|
176
|
+
> Columns marked `static=True` are carried through to the cohort split table as entity-level attributes (e.g. birthday, native language).
|
|
177
|
+
|
|
178
|
+
### 3. Build a Cohort
|
|
179
|
+
|
|
180
|
+
A `Cohort` resolves one consistent entity universe across all sources, applies inclusion criteria, and generates deterministic train/val/test splits.
|
|
192
181
|
|
|
193
182
|
```python
|
|
194
|
-
from tab2seq.
|
|
195
|
-
|
|
183
|
+
from tab2seq.cohort import Cohort, CohortConfig, EntityInclusionCriteria
|
|
184
|
+
|
|
185
|
+
criteria = [
|
|
186
|
+
EntityInclusionCriteria(source_name="health", required=False),
|
|
187
|
+
EntityInclusionCriteria(source_name="labour", required=True, min_events=1),
|
|
188
|
+
]
|
|
189
|
+
|
|
190
|
+
cohort = Cohort(
|
|
191
|
+
name="my_cohort",
|
|
192
|
+
sources=collection,
|
|
193
|
+
inclusion_criteria=criteria,
|
|
194
|
+
cache_dir="data/cohorts",
|
|
195
|
+
)
|
|
196
196
|
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
n_entities=10000,
|
|
200
|
-
seed=742,
|
|
201
|
-
registries=["health", "labour", "survey", "income"],
|
|
202
|
-
file_format="parquet")
|
|
197
|
+
entities_df = cohort.build_entities_table(force_recompute=True)
|
|
198
|
+
print(f"Cohort size: {len(cohort)} entities")
|
|
203
199
|
|
|
204
|
-
|
|
205
|
-
|
|
200
|
+
split_cfg = CohortConfig(train_frac=0.7, val_frac=0.15, test_frac=0.15, seed=42)
|
|
201
|
+
split_df = cohort.build_or_load_splits(split_cfg, force_recompute=True)
|
|
202
|
+
split_df.head()
|
|
206
203
|
```
|
|
207
204
|
|
|
208
|
-
|
|
205
|
+
The split table contains one row per entity with the split label and all static columns.
|
|
209
206
|
|
|
210
|
-
|
|
211
|
-
> Work in progress!
|
|
207
|
+
### 4. Fit a Vocabulary (Train Only)
|
|
212
208
|
|
|
213
|
-
|
|
209
|
+
The vocabulary maps categorical values to token strings and bins continuous features—fitted exclusively on training entities to prevent leakage.
|
|
214
210
|
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
|
|
218
|
-
|
|
211
|
+
```python
|
|
212
|
+
from tab2seq.config import TokenizerConfig
|
|
213
|
+
from tab2seq.tokenization import Vocabulary
|
|
214
|
+
|
|
215
|
+
tok_cfg = TokenizerConfig()
|
|
216
|
+
tok_cfg.vocabulary.min_token_count = 1
|
|
217
|
+
tok_cfg.vocabulary.max_vocab_size = 50_000
|
|
218
|
+
|
|
219
|
+
vocab = Vocabulary(tok_cfg.vocabulary)
|
|
220
|
+
vocab_df = vocab.fit_from_cohort_train(
|
|
221
|
+
cohort=cohort,
|
|
222
|
+
split_config=split_cfg,
|
|
223
|
+
force_recompute=True,
|
|
224
|
+
)
|
|
225
|
+
print(f"Vocabulary size: {vocab_df.height}")
|
|
226
|
+
```
|
|
219
227
|
|
|
220
|
-
|
|
228
|
+
### 5. Build Tokenized Event Datasets
|
|
221
229
|
|
|
222
|
-
|
|
230
|
+
`EventDataset` produces one row per event with integer token IDs, time features, and optional derived columns.
|
|
223
231
|
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
|
|
228
|
-
-
|
|
232
|
+
```python
|
|
233
|
+
from tab2seq.datasets import EventDataset, EventDatasetConfig, RelativeDateRule
|
|
234
|
+
|
|
235
|
+
dataset_cfg = EventDatasetConfig(
|
|
236
|
+
reference_date="1970-01-01",
|
|
237
|
+
threshold_date="2021-01-01",
|
|
238
|
+
include_after_threshold=True,
|
|
239
|
+
include_token_str=True,
|
|
240
|
+
relative_date_features=[
|
|
241
|
+
RelativeDateRule(
|
|
242
|
+
source_static_column="labour__birthday",
|
|
243
|
+
output_column="age_years",
|
|
244
|
+
unit="years",
|
|
245
|
+
),
|
|
246
|
+
],
|
|
247
|
+
)
|
|
229
248
|
|
|
230
|
-
|
|
249
|
+
dataset = EventDataset(
|
|
250
|
+
cohort=cohort,
|
|
251
|
+
vocabulary=vocab,
|
|
252
|
+
split_config=split_cfg,
|
|
253
|
+
dataset_config=dataset_cfg,
|
|
254
|
+
)
|
|
231
255
|
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
|
|
256
|
+
# Inspect one split in memory
|
|
257
|
+
train_events = dataset.build_split("train", force_recompute_splits=True)
|
|
258
|
+
print(train_events.select(
|
|
259
|
+
["entity_id", "source_name", "primary_timestamp", "token_ids", "age_years"]
|
|
260
|
+
).head(5))
|
|
235
261
|
|
|
236
|
-
#
|
|
237
|
-
|
|
262
|
+
# Persist all splits + static table + metadata to Parquet
|
|
263
|
+
artifacts = dataset.write_parquet(force_recompute_splits=True)
|
|
264
|
+
print(artifacts.split_paths)
|
|
265
|
+
```
|
|
238
266
|
|
|
239
|
-
|
|
240
|
-
pytest --cov=tab2seq --cov-report=html
|
|
267
|
+
### Retrieving Entity Records
|
|
241
268
|
|
|
242
|
-
|
|
243
|
-
black src/tab2seq tests
|
|
269
|
+
Three patterns for feeding records into a training loop:
|
|
244
270
|
|
|
245
|
-
|
|
246
|
-
|
|
271
|
+
```python
|
|
272
|
+
# Full iterator sweep
|
|
273
|
+
for record in dataset.iter_entity_records(split="train", shuffle=True, seed=42):
|
|
274
|
+
# record = {"entity_id": ..., "split": ..., "static": {...}, "events": [...]}
|
|
275
|
+
pass
|
|
276
|
+
|
|
277
|
+
# Random sample
|
|
278
|
+
record = dataset.sample_entity_record(split="train", seed=7)
|
|
279
|
+
|
|
280
|
+
# Stateful next() — remembers position across calls
|
|
281
|
+
record = dataset.next_entity_record(split="train", shuffle=True, seed=0, reset=True)
|
|
282
|
+
while record is not None:
|
|
283
|
+
record = dataset.next_entity_record(split="train", shuffle=True, seed=0)
|
|
247
284
|
```
|
|
248
285
|
|
|
286
|
+
## Synthetic Registries
|
|
287
|
+
|
|
288
|
+
`generate_synthetic_data` / `generate_synthetic_collections` create four registry-style tables with realistic temporal patterns, missing data, and cross-field correlations:
|
|
289
|
+
|
|
290
|
+
| Registry | Key columns |
|
|
291
|
+
|----------|------------|
|
|
292
|
+
| **health** | diagnosis, procedure, department, cost, length_of_stay |
|
|
293
|
+
| **income** | income_type, sector, income_amount |
|
|
294
|
+
| **labour** | status, occupation, weekly_hours, residence_region, birthday |
|
|
295
|
+
| **survey** | education_level, marital_status, self_rated_health, satisfaction_score |
|
|
296
|
+
|
|
297
|
+
## Use Cases
|
|
298
|
+
|
|
299
|
+
- **Healthcare Research**: Transform electronic health records (EHR) into sequences for predictive modeling
|
|
300
|
+
- **Registry Data Processing**: Work with multiple event-based registries (health, income, labour, surveys)
|
|
301
|
+
- **Sequential Modeling**: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
|
|
302
|
+
- **Data Pipeline Development**: Use synthetic data to develop and test processing pipelines before working with sensitive real data
|
|
303
|
+
|
|
304
|
+
|
|
249
305
|
## TODOs
|
|
250
306
|
|
|
251
307
|
- [x] Synthetic Datasets
|
|
252
308
|
- [x] `Source` implementation
|
|
253
|
-
- [
|
|
254
|
-
- [
|
|
255
|
-
- [
|
|
256
|
-
- [
|
|
309
|
+
- [x] `Cohort` implementation
|
|
310
|
+
- [x] `Cohort` and data splits
|
|
311
|
+
- [x] `Tokenization` implementation
|
|
312
|
+
- [x] `Vocabulary` implementation
|
|
313
|
+
- [x] `EventDataset` builder
|
|
257
314
|
- [x] Caching and chunking
|
|
258
315
|
- [ ] Documentation
|
|
259
316
|
|
|
@@ -296,9 +353,10 @@ Contributions are welcome! Please open an issue or submit a pull request on [Git
|
|
|
296
353
|
|
|
297
354
|
## License
|
|
298
355
|
|
|
299
|
-
MIT License
|
|
356
|
+
MIT License: see [LICENSE](LICENSE) file for details.
|
|
300
357
|
|
|
301
358
|
## Support
|
|
302
359
|
|
|
303
360
|
- 🐛 Issues: [GitHub Issues](https://github.com/carlomarxdk/tab2seq/issues)
|
|
304
361
|
- 💬 Discussions: [GitHub Discussions](https://github.com/carlomarxdk/tab2seq/discussions)
|
|
362
|
+
|