PyPI - tab2seq - Versions diffs - 0.1.2__tar.gz → 0.1.5__tar.gz - Mend

tab2seq 0.1.2tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

{tab2seq-0.1.2/src/tab2seq.egg-info → tab2seq-0.1.5}/PKG-INFO +168 -110
tab2seq-0.1.5/README.md +315 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/pyproject.toml +8 -11
tab2seq-0.1.5/src/tab2seq/__init__.py +18 -0
tab2seq-0.1.5/src/tab2seq/cli.py +71 -0
tab2seq-0.1.5/src/tab2seq/cohort/__init__.py +6 -0
tab2seq-0.1.5/src/tab2seq/cohort/config.py +104 -0
tab2seq-0.1.5/src/tab2seq/cohort/core.py +461 -0
tab2seq-0.1.5/src/tab2seq/config.py +58 -0
tab2seq-0.1.5/src/tab2seq/datasets/__init__.py +16 -0
tab2seq-0.1.5/src/tab2seq/datasets/builder.py +706 -0
tab2seq-0.1.5/src/tab2seq/datasets/config.py +59 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/datasets/synthetic.py +16 -0
tab2seq-0.1.5/src/tab2seq/loader.py +65 -0
tab2seq-0.1.5/src/tab2seq/processor.py +52 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/config.py +21 -1
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/core.py +2 -4
tab2seq-0.1.5/src/tab2seq/tokenization/__init__.py +7 -0
tab2seq-0.1.5/src/tab2seq/tokenization/config.py +25 -0
tab2seq-0.1.5/src/tab2seq/tokenization/tokenizer.py +139 -0
tab2seq-0.1.5/src/tab2seq/tokenization/vocabulary.py +359 -0
{tab2seq-0.1.2 → tab2seq-0.1.5/src/tab2seq.egg-info}/PKG-INFO +168 -110
tab2seq-0.1.5/src/tab2seq.egg-info/SOURCES.txt +38 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/requires.txt +1 -4
tab2seq-0.1.5/tests/test_cli.py +153 -0
tab2seq-0.1.5/tests/test_cohort.py +283 -0
tab2seq-0.1.5/tests/test_config.py +86 -0
tab2seq-0.1.5/tests/test_event_dataset_builder.py +225 -0
tab2seq-0.1.5/tests/test_loader.py +102 -0
tab2seq-0.1.5/tests/test_processor.py +113 -0
tab2seq-0.1.5/tests/test_tokenizer.py +179 -0
tab2seq-0.1.5/tests/test_vocabulary.py +159 -0
tab2seq-0.1.2/README.md +0 -254
tab2seq-0.1.2/src/tab2seq/__init__.py +0 -9
tab2seq-0.1.2/src/tab2seq/datasets/__init__.py +0 -5
tab2seq-0.1.2/src/tab2seq.egg-info/SOURCES.txt +0 -17
{tab2seq-0.1.2 → tab2seq-0.1.5}/LICENSE +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/setup.cfg +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/__init__.py +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq/source/collection.py +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/dependency_links.txt +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/src/tab2seq.egg-info/top_level.txt +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/tests/test_datasets.py +0 -0
{tab2seq-0.1.2 → tab2seq-0.1.5}/tests/test_source.py +0 -0

{tab2seq-0.1.2/src/tab2seq.egg-info → tab2seq-0.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tab2seq
-Version: 0.1.2
+Version: 0.1.5
 Summary: Transform tabular event data into sequences ready for Transformer and Sequential models: Life2Vec, BEHRT and more.
 Author-email: Germans Savcisens <germans@savcisens.com>
 License: MIT
@@ -9,7 +9,7 @@ Project-URL: Documentation, https://tab2seq.readthedocs.io
 Project-URL: Repository, https://github.com/carlomarxdk/tab2seq
 Project-URL: Issues, https://github.com/carlomarxdk/tab2seq/issues
 Keywords: tokenization,data preprocessing,tabular data,transformer models,sequential models,life2vec
-Classifier: Development Status :: 3 - Alpha
+Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Science/Research
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
@@ -36,13 +36,10 @@ Requires-Dist: ruff>=0.15.0; extra == "dev"
 Requires-Dist: mypy>=1.19.0; extra == "dev"
 Requires-Dist: types-PyYAML>=6.0.0; extra == "dev"
 Provides-Extra: docs
-Requires-Dist: mkdocs>=1.6.1; extra == "docs"
-Requires-Dist: mkdocs-material>=9.7.1; extra == "docs"
-Requires-Dist: mkdocstrings>=1.0.2; extra == "docs"
+Requires-Dist: zensical; extra == "docs"
 Requires-Dist: mkdocstrings-python>=2.0.0; extra == "docs"
 Requires-Dist: mkdocs-gen-files>=0.6.0; extra == "docs"
 Requires-Dist: mkdocs-literate-nav>=0.6.2; extra == "docs"
-Requires-Dist: mkdocs-section-index>=0.3.10; extra == "docs"
 Requires-Dist: mkdocs-bibtex>=4.4.0; extra == "docs"
 Provides-Extra: all
 Requires-Dist: tab2seq[dev,docs]; extra == "all"
@@ -54,86 +51,84 @@ Dynamic: license-file
 [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tab2seq)](https://pypi.org/project/tab2seq/)
 [![PyPI - Status](https://img.shields.io/pypi/status/tab2seq)](https://pypi.org/project/tab2seq/)
 [![GitHub License](https://img.shields.io/github/license/carlomarxdk/tab2seq)](https://github.com/carlomarxdk/tab2seq/blob/main/LICENSE)
+[![DOI](https://zenodo.org/badge/1163020308.svg)](https://doi.org/10.5281/zenodo.18752504)
-**tab2seq** adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into formats ready for Transformer and sequential deep learning models.
+**tab2seq** adapts the Life2Vec data processing pipeline to make it easy to work with multi-source tabular event data for sequential modeling projects. Transform registry data, EHR records, and other event-based datasets into tokenized sequences ready for Transformer and sequential deep learning models.
+The package reimplements the data-preprocessing steps of the [life2vec](https://github.com/SocialComplexityLab/life2vec) and [life2vec-light](https://github.com/carlomarxdk/life2vec-light) repos.
-> [!WARNING]
-> This is an alpha package. In the beta version, it will reimplement all the data-preprocessing steps of the [life2vec](https://github.com/SocialComplexityLab/life2vec) and [life2vec-light](https://github.com/carlomarxdk/life2vec-light) repos. See [TODOs](#todos) to see what is implemented at this point.
+> [!INFO]
+> This is a **BETA** version of the package.
 ## About
 This package extracts and generalizes the data processing patterns from the [Life2Vec](https://github.com/SocialComplexityLab/life2vec) project, making them reusable for similar research projects that need to:
 - Work with multiple longitudinal data sources (registries, databases)
-- Define and filter cohorts based on complex criteria
+- Define and filter cohorts based on inclusion criteria
+- Create deterministic train/val/test splits with static context
+- Fit a vocabulary on training data only (no leakage)
+- Produce tokenized, model-ready event sequences with time features
 - Generate realistic synthetic data for development and testing
-- Process large-scale tabular event data efficiently
 Whether you're working with healthcare data, financial records, or any time-stamped event data, tab2seq provides the building blocks for preparing data for Life2Vec-style sequential models.
+## Pipeline Overview
+```
+Sources → Cohort → Vocabulary → EventDataset → Model-ready Parquet
+```
+1. **Sources** – Define one `SourceConfig` per event table (health visits, labour records, income, etc.). Each config declares which columns are categorical, continuous, or timestamps.
+2. **Cohort** – Unite sources into a single entity universe, apply inclusion criteria, and split into train/val/test with deterministic seeds.
+3. **Vocabulary** – Fit token mappings and continuous-feature bin edges on the *train split only* to prevent leakage.
+4. **EventDataset** – Build tokenized event rows per split, derive relative-date features (e.g. age), and persist to Parquet with metadata.
 ## Features
 - **Multi-Source Data Management**: Handle multiple data sources (registries) with unified schema
+- **Cohort Construction**: Entity-level inclusion criteria across sources, deterministic splits, static-attribute propagation
+- **Train-Only Vocabulary**: Token and bin-edge fitting restricted to training entities
+- **Tokenized Event Datasets**: Vectorized token-ID encoding, relative-date features, Parquet persistence
+- **Entity Record Access**: Iterator, random sample, and stateful `next()` retrieval patterns for downstream training loops
 - **Type-Safe Configuration**: Pydantic-based configuration with YAML support
 - **Synthetic Data Generation**: Generate realistic dummy registry data for testing and exploration
 - **Memory-Efficient Loading**: Chunked iteration and lazy loading with Polars
-- **Schema Validation**: Automatic validation of entity IDs, timestamps, and column types
-- **Cross-Source Operations**: Unified access and operations across multiple data sources
 ## Installation
 ```bash
-# Basic installation
 pip install tab2seq
 ```
 ## Quick Start
-### Working with a Single Source
+The full pipeline from raw data to model-ready sequences in five steps.
+### 1. Generate Synthetic Data
 ```python
-from tab2seq.source import (
-    Source,
-    SourceConfig,
-    SourceCollection,
-    CategoricalColConfig,
-    ContinuousColConfig,
-    TimestampColConfig
-)
+from tab2seq.datasets import generate_synthetic_data
+import polars as pl
-config = SourceConfig(
-    name="health",
-    filepath="synthetic_data/health.parquet",
-    id_col="entity_id",
-    categorical_cols=[
-        CategoricalColConfig(col_name="diagnosis", prefix="DIAG"),
-        CategoricalColConfig(col_name="procedure", prefix="PROC"),
-        CategoricalColConfig(col_name="department", prefix="DEPT"),
-    ],
-    continuous_cols=[
-        ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
-        ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=20, strategy="quantile"),
-    ],
-    output_format="parquet",
-    timestamp_cols=[
-        TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
-    ]
+data_paths = generate_synthetic_data(
+    output_dir="synthetic_data",
+    n_entities=10_000,
+    seed=742,
+    registries=["health", "labour"],
 )
-source = Source(config=config)
-# Process and tokenize the columns
-print("Number of unique IDs:", len(source.get_entity_ids()))
-lf_health = source.process(cache=True)
-lf_health.head()
+pl.read_parquet(data_paths["health"]).head()
 ```
-### Working with Multiple Sources
+### 2. Define Sources
+Each `Source` describes one event table: its file path, ID column, timestamp, and feature columns.
 ```python
-from tab2seq.source import SourceCollection, SourceConfig, CategoricalColConfig, ContinuousColConfig, TimestampColConfig
+from tab2seq.source import (
+    Source, SourceCollection, SourceConfig,
+    CategoricalColConfig, ContinuousColConfig, TimestampColConfig,
+)
-# Define your data sources
 configs = [
     SourceConfig(
         name="health",
@@ -145,13 +140,12 @@ configs = [
             CategoricalColConfig(col_name="department", prefix="DEPT"),
         ],
         continuous_cols=[
-            ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20, strategy="quantile"),
-            ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=20, strategy="quantile"),
+            ContinuousColConfig(col_name="cost", prefix="COST", n_bins=20),
+            ContinuousColConfig(col_name="length_of_stay", prefix="LOS", n_bins=10),
         ],
-        output_format="parquet",
         timestamp_cols=[
-            TimestampColConfig(col_name="date", is_primary=True, drop_na=True)
-        ]
+            TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
+        ],
     ),
     SourceConfig(
         name="labour",
@@ -161,99 +155,162 @@ configs = [
             CategoricalColConfig(col_name="status", prefix="STATUS"),
             CategoricalColConfig(col_name="occupation", prefix="OCC"),
             CategoricalColConfig(col_name="residence_region", prefix="REGION"),
+            CategoricalColConfig(col_name="native_language", prefix="LANG", static=True),
         ],
         continuous_cols=[
-            ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS")
+            ContinuousColConfig(col_name="weekly_hours", prefix="WEEKLY_HOURS", n_bins=10),
         ],
-        output_format="parquet",
         timestamp_cols=[
             TimestampColConfig(col_name="date", is_primary=True, drop_na=True),
-            TimestampColConfig(col_name="birthday", is_primary=False, drop_na=True),
+            TimestampColConfig(col_name="birthday", static=True, drop_na=True),
         ],
     ),
 ]
-# Create a source collection
 collection = SourceCollection.from_configs(configs)
-# Access individual sources
-health = collection["health"]
-df = health.read_all()
-# Or iterate over all sources
 for source in collection:
     print(f"{source.name}: {len(source.get_entity_ids())} entities")
-# Cross-source operations
-all_entity_ids = collection.get_all_entity_ids()
 ```
-### Generating Synthetic Data
+> Columns marked `static=True` are carried through to the cohort split table as entity-level attributes (e.g. birthday, native language).
+### 3. Build a Cohort
+A `Cohort` resolves one consistent entity universe across all sources, applies inclusion criteria, and generates deterministic train/val/test splits.
 ```python
-from tab2seq.datasets import generate_synthetic_data
-import polars as pl
+from tab2seq.cohort import Cohort, CohortConfig, EntityInclusionCriteria
+criteria = [
+    EntityInclusionCriteria(source_name="health", required=False),
+    EntityInclusionCriteria(source_name="labour", required=True, min_events=1),
+]
+cohort = Cohort(
+    name="my_cohort",
+    sources=collection,
+    inclusion_criteria=criteria,
+    cache_dir="data/cohorts",
+)
-# Generate synthetic registry data
-data_paths = generate_synthetic_data(output_dir="synthetic_data",
-                                     n_entities=10000,
-                                     seed=742,
-                                     registries=["health", "labour", "survey", "income"],
-                                     file_format="parquet")
+entities_df = cohort.build_entities_table(force_recompute=True)
+print(f"Cohort size: {len(cohort)} entities")
-lf_health = pl.read_parquet(data_paths["health"])
-lf_health.head()
+split_cfg = CohortConfig(train_frac=0.7, val_frac=0.15, test_frac=0.15, seed=42)
+split_df = cohort.build_or_load_splits(split_cfg, force_recompute=True)
+split_df.head()
 ```
-## Architecture
+The split table contains one row per entity with the split label and all static columns.
-> [!warning]
-> Work in progress!
+### 4. Fit a Vocabulary (Train Only)
-**Available Registries:**
+The vocabulary maps categorical values to token strings and bins continuous features—fitted exclusively on training entities to prevent leakage.
-- **health**: Medical events with diagnoses (ICD codes), procedures, departments, costs, and length of stay
-- **income**: Yearly income records with income type, sector, and amounts
-- **labour**: Quarterly labour status with occupation, employment status, and residence
-- **survey**: Periodic survey responses with education level, marital status, and satisfaction scores
+```python
+from tab2seq.config import TokenizerConfig
+from tab2seq.tokenization import Vocabulary
+tok_cfg = TokenizerConfig()
+tok_cfg.vocabulary.min_token_count = 1
+tok_cfg.vocabulary.max_vocab_size = 50_000
+vocab = Vocabulary(tok_cfg.vocabulary)
+vocab_df = vocab.fit_from_cohort_train(
+    cohort=cohort,
+    split_config=split_cfg,
+    force_recompute=True,
+)
+print(f"Vocabulary size: {vocab_df.height}")
+```
-All synthetic data includes realistic temporal patterns, missing data, and correlations between fields to mimic real-world registry data.
+### 5. Build Tokenized Event Datasets
-## Use Cases
+`EventDataset` produces one row per event with integer token IDs, time features, and optional derived columns.
-- **Healthcare Research**: Transform electronic health records (EHR) into sequences for predictive modeling
-- **Registry Data Processing**: Work with multiple event-based registries (health, income, labour, surveys)
-- **Sequential Modeling**: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
-- **Data Pipeline Development**: Use synthetic data to develop and test processing pipelines before working with sensitive real data
-- **Multi-Source Analysis**: Combine and analyze data from multiple longitudinal sources with unified tooling
+```python
+from tab2seq.datasets import EventDataset, EventDatasetConfig, RelativeDateRule
+dataset_cfg = EventDatasetConfig(
+    reference_date="1970-01-01",
+    threshold_date="2021-01-01",
+    include_after_threshold=True,
+    include_token_str=True,
+    relative_date_features=[
+        RelativeDateRule(
+            source_static_column="labour__birthday",
+            output_column="age_years",
+            unit="years",
+        ),
+    ],
+)
-## Development
+dataset = EventDataset(
+    cohort=cohort,
+    vocabulary=vocab,
+    split_config=split_cfg,
+    dataset_config=dataset_cfg,
+)
-```bash
-# Install development dependencies
-pip install -e ".[dev]"
+# Inspect one split in memory
+train_events = dataset.build_split("train", force_recompute_splits=True)
+print(train_events.select(
+    ["entity_id", "source_name", "primary_timestamp", "token_ids", "age_years"]
+).head(5))
-# Run tests
-pytest
+# Persist all splits + static table + metadata to Parquet
+artifacts = dataset.write_parquet(force_recompute_splits=True)
+print(artifacts.split_paths)
+```
-# Run tests with coverage
-pytest --cov=tab2seq --cov-report=html
+### Retrieving Entity Records
-# Format code
-black src/tab2seq tests
+Three patterns for feeding records into a training loop:
-# Lint code
-ruff check src/tab2seq tests
+```python
+# Full iterator sweep
+for record in dataset.iter_entity_records(split="train", shuffle=True, seed=42):
+    # record = {"entity_id": ..., "split": ..., "static": {...}, "events": [...]}
+    pass
+# Random sample
+record = dataset.sample_entity_record(split="train", seed=7)
+# Stateful next() — remembers position across calls
+record = dataset.next_entity_record(split="train", shuffle=True, seed=0, reset=True)
+while record is not None:
+    record = dataset.next_entity_record(split="train", shuffle=True, seed=0)
 ```
+## Synthetic Registries
+`generate_synthetic_data` / `generate_synthetic_collections` create four registry-style tables with realistic temporal patterns, missing data, and cross-field correlations:
+| Registry | Key columns |
+|----------|------------|
+| **health** | diagnosis, procedure, department, cost, length_of_stay |
+| **income** | income_type, sector, income_amount |
+| **labour** | status, occupation, weekly_hours, residence_region, birthday |
+| **survey** | education_level, marital_status, self_rated_health, satisfaction_score |
+## Use Cases
+- **Healthcare Research**: Transform electronic health records (EHR) into sequences for predictive modeling
+- **Registry Data Processing**: Work with multiple event-based registries (health, income, labour, surveys)
+- **Sequential Modeling**: Prepare multi-source data for Life2Vec, BEHRT, or other transformer-based models
+- **Data Pipeline Development**: Use synthetic data to develop and test processing pipelines before working with sensitive real data
 ## TODOs
 - [x] Synthetic Datasets
 - [x] `Source` implementation
-- [ ] `Cohort` implementation
-- [ ] `Cohort` and data splits
-- [ ] `Tokenization` implementation
-- [ ] `Vocabulary` implementation
+- [x] `Cohort` implementation
+- [x] `Cohort` and data splits
+- [x] `Tokenization` implementation
+- [x] `Vocabulary` implementation
+- [x] `EventDataset` builder
 - [x] Caching and chunking
 - [ ] Documentation
@@ -296,9 +353,10 @@ Contributions are welcome! Please open an issue or submit a pull request on [Git
 ## License
-MIT License - see [LICENSE](LICENSE) file for details.
+MIT License: see [LICENSE](LICENSE) file for details.
 ## Support
 - 🐛 Issues: [GitHub Issues](https://github.com/carlomarxdk/tab2seq/issues)
 - 💬 Discussions: [GitHub Discussions](https://github.com/carlomarxdk/tab2seq/discussions)

tab2seq 0.1.2__tar.gz → 0.1.5__tar.gz

tab2seq 0.1.2tar.gz → 0.1.5tar.gz