tablassert 7.3.6__tar.gz → 7.4.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {tablassert-7.3.6 → tablassert-7.4.1}/AGENTS.md +18 -20
- {tablassert-7.3.6 → tablassert-7.4.1}/CHANGELOG.md +27 -1
- {tablassert-7.3.6 → tablassert-7.4.1}/CITATION.cff +1 -1
- {tablassert-7.3.6 → tablassert-7.4.1}/CONTRIBUTING.md +2 -2
- {tablassert-7.3.6 → tablassert-7.4.1}/PKG-INFO +38 -24
- {tablassert-7.3.6 → tablassert-7.4.1}/README.md +28 -17
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/api/lib.md +6 -1
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/api/qc.md +18 -5
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/changelog.md +3 -3
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/cli.md +15 -15
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/configuration/graph.md +17 -2
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/docker.md +9 -9
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/examples/tutorial-table.yaml +5 -5
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/examples.md +24 -23
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/index.md +14 -4
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/installation.md +35 -11
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/tutorial.md +8 -8
- {tablassert-7.3.6 → tablassert-7.4.1}/llms.txt +2 -2
- {tablassert-7.3.6 → tablassert-7.4.1}/pyproject.toml +15 -9
- tablassert-7.4.1/src/tablassert/cli.py +161 -0
- tablassert-7.4.1/src/tablassert/downloader.py +243 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/fullmap.py +6 -3
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/lib.py +15 -5
- tablassert-7.4.1/src/tablassert/log.py +24 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/models.py +2 -0
- tablassert-7.4.1/src/tablassert/progress.py +126 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/qc.py +92 -24
- tablassert-7.4.1/tests/test_downloader.py +238 -0
- tablassert-7.4.1/tests/test_lib.py +255 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_models.py +17 -0
- tablassert-7.4.1/tests/test_qc.py +185 -0
- tablassert-7.4.1/uv.lock +2818 -0
- tablassert-7.3.6/src/tablassert/cli.py +0 -127
- tablassert-7.3.6/src/tablassert/downloader.py +0 -55
- tablassert-7.3.6/src/tablassert/log.py +0 -18
- tablassert-7.3.6/tests/test_lib.py +0 -118
- tablassert-7.3.6/uv.lock +0 -2652
- {tablassert-7.3.6 → tablassert-7.4.1}/.github/workflows/autotag.yml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/.github/workflows/docker.yml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/.github/workflows/docs.yml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/.github/workflows/pipy.yml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/.gitignore +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/.pre-commit-config.yaml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/Dockerfile +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/LICENSE +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/api/fullmap.md +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/api/utils.md +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/configuration/advanced-example.md +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/configuration/table.md +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/datassert.md +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/examples/tutorial-data.csv +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/docs/examples/tutorial-graph.yaml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/mkdocs.yml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/__init__.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/enums.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/ingests.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/nlp.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/src/tablassert/utils.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/__init__.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/conftest.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/fixtures/invalid_section_missing_source.yaml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/fixtures/minimal_section.yaml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/fixtures/minimal_section_with_sections.yaml +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_enums.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_fullmap.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_ingests.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_nlp.py +0 -0
- {tablassert-7.3.6 → tablassert-7.4.1}/tests/test_utils.py +0 -0
|
@@ -4,7 +4,7 @@ Guidance for AI coding agents working in this repository.
|
|
|
4
4
|
|
|
5
5
|
## Project Overview
|
|
6
6
|
|
|
7
|
-
Tablassert is a Python package (>=3.11) for tabular data assertion, normalization, and quality control. It builds declarative knowledge graphs from tabular data, exporting NCATS Translator-compliant KGX NDJSON. Uses **Polars** DataFrames, **DuckDB** for entity resolution, and **ONNX/BioBERT** for
|
|
7
|
+
Tablassert is a Python package (>=3.11) for tabular data assertion, normalization, and optional quality control. It builds declarative knowledge graphs from tabular data, exporting NCATS Translator-compliant KGX NDJSON. Uses **Polars** DataFrames, **DuckDB** for entity resolution, and **ONNX/BioBERT** for QC when enabled. CLI built with **cyclopts**. Models built with **Pydantic v2**.
|
|
8
8
|
|
|
9
9
|
## Quick Reference
|
|
10
10
|
|
|
@@ -31,17 +31,18 @@ Tablassert is a Python package (>=3.11) for tabular data assertion, normalizatio
|
|
|
31
31
|
|
|
32
32
|
```
|
|
33
33
|
src/tablassert/
|
|
34
|
-
cli.py #
|
|
34
|
+
cli.py # cyclopts CLI (entry point: tablassert.cli:APP)
|
|
35
35
|
lib.py # Core logic: encodings, data loading, Tcode(Section) class
|
|
36
36
|
models.py # Pydantic v2 models (TablaBase base class)
|
|
37
37
|
enums.py # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
|
|
38
|
-
fullmap.py # NER / entity resolution (DuckDB,
|
|
38
|
+
fullmap.py # NER / entity resolution (DuckDB, 10 shards)
|
|
39
39
|
qc.py # Quality control (ONNX/BioBERT, sentence_transformers)
|
|
40
40
|
nlp.py # Text normalization (level_one: strip+lowercase, level_two: regex)
|
|
41
41
|
ingests.py # YAML ingestion: from_yaml(), to_sections(), fastmerge()
|
|
42
|
-
downloader.py #
|
|
42
|
+
downloader.py # httpx-based file downloads with retries
|
|
43
|
+
progress.py # Rich progress bars for pipeline stages
|
|
43
44
|
utils.py # Hashing (xxhash), STORE path, namespace UUIDs
|
|
44
|
-
log.py # loguru logger → .logassert/
|
|
45
|
+
log.py # loguru logger → .logassert/tablassert.log; cat() helper for category tagging
|
|
45
46
|
__init__.py # Empty file (lazy loading is per-module, not here)
|
|
46
47
|
docs/ # MkDocs documentation source
|
|
47
48
|
mkdocs.yml # MkDocs configuration
|
|
@@ -51,8 +52,9 @@ tests/ # Test directory (at repo root)
|
|
|
51
52
|
|
|
52
53
|
- `conftest.py` provides a `fixtures_path` fixture returning `Path(__file__).parent / "fixtures"`.
|
|
53
54
|
- pytest configured via `pyproject.toml` `[tool.pytest.ini_options]` with `testpaths = ["tests"]`.
|
|
55
|
+
- pytest markers: `network` requires internet; `gpu` requires `CUDAExecutionProvider`.
|
|
54
56
|
- Test fixtures: `tests/fixtures/` contains YAML files for Section model tests.
|
|
55
|
-
- Test modules: `test_enums.py`, `test_fullmap.py`, `test_ingests.py`, `test_lib.py`, `test_models.py`, `test_nlp.py`, `test_utils.py`.
|
|
57
|
+
- Test modules: `test_downloader.py`, `test_enums.py`, `test_fullmap.py`, `test_ingests.py`, `test_lib.py`, `test_models.py`, `test_nlp.py`, `test_utils.py`.
|
|
56
58
|
|
|
57
59
|
## Code Style
|
|
58
60
|
|
|
@@ -69,9 +71,8 @@ tests/ # Test directory (at repo root)
|
|
|
69
71
|
else:
|
|
70
72
|
pl = Lazy.load("polars")
|
|
71
73
|
```
|
|
72
|
-
- Lazy-loaded deps: `polars`, `duckdb`, `orjson`, `
|
|
73
|
-
- Direct (non-lazy) heavy deps: `sqlite_utils`, `rapidfuzz`, `pydantic`, `loguru`, `yaml.CLoader`
|
|
74
|
-
- Previously-optional deps now in core: `sentence_transformers`, `onnxruntime`, `sklearn`, `playwright`, `pyexcel` — lazy-loaded when present
|
|
74
|
+
- Lazy-loaded deps: `polars`, `duckdb`, `orjson`, `xxhash`, `polars_hash`, `yaml`, `httpx`, `pyexcel`, `onnxruntime`, `sentence_transformers`
|
|
75
|
+
- Direct (non-lazy) heavy deps: `sqlite_utils`, `rapidfuzz`, `pydantic`, `loguru`, `cyclopts`, `rich`, `yaml.CLoader`
|
|
75
76
|
- Some modules mix direct and lazy imports for the same package (e.g., `ingests.py` does `from yaml import CLoader` directly, then lazy-loads `yaml` for `yaml.load()`)
|
|
76
77
|
- Import order: standard library → blank line → third-party → blank line → local
|
|
77
78
|
- Use `from __future__ import annotations` to enable deferred evaluation
|
|
@@ -130,13 +131,13 @@ All enums live in `enums.py` and extend `str, Enum`. Key enums: `Tokens`, `Repos
|
|
|
130
131
|
|
|
131
132
|
- Use `RuntimeError` for exceptional cases (no custom exception classes currently)
|
|
132
133
|
- Use `logger.warning()` for non-fatal issues (e.g., empty subgraphs)
|
|
133
|
-
- Logger: `from tablassert.log import logger`
|
|
134
|
+
- Logger: `from tablassert.log import logger` (or `cat()` for category-tagged logger)
|
|
134
135
|
|
|
135
136
|
### Other Conventions
|
|
136
137
|
|
|
137
138
|
- `operator.add` for Polars string concatenation on columns (not `+` directly)
|
|
138
|
-
- CLI entry point: `tablassert.cli:
|
|
139
|
-
- Use `rich.progress` for progress tracking in CLI
|
|
139
|
+
- CLI entry point: `tablassert.cli:APP` (cyclopts app)
|
|
140
|
+
- Use `rich.progress` for progress tracking in CLI (via `progress.py` which wraps Rich Live/Progress)
|
|
140
141
|
- Data side-effects stored in hidden directories: `.logassert/`, `.storassert/`, `.onnxassert/`
|
|
141
142
|
|
|
142
143
|
## Tools
|
|
@@ -151,12 +152,13 @@ All enums live in `enums.py` and extend `str, Enum`. Key enums: `Tokens`, `Repos
|
|
|
151
152
|
## Optional Dependency Groups
|
|
152
153
|
|
|
153
154
|
Defined in `pyproject.toml` `[project.optional-dependencies]`:
|
|
154
|
-
- `
|
|
155
|
-
- `
|
|
155
|
+
- `rt` — `polars[rtcompat]` (runtime-compatible Polars build for CPUs without required instructions)
|
|
156
|
+
- `qc` — `onnxruntime` (CPU QC runtime)
|
|
157
|
+
- `qc-cuda` — `onnxruntime-gpu` (CUDA QC runtime; single GPU on device 0)
|
|
156
158
|
|
|
157
|
-
All other
|
|
159
|
+
All other ML, web, and Excel dependencies are in core `dependencies`; the ONNX Runtime choice is extra-driven.
|
|
158
160
|
|
|
159
|
-
Install with: `uv sync` or `pip install tablassert`
|
|
161
|
+
Install with: `uv sync`, `uv sync --extra qc`, `uv sync --extra qc-cuda`, or `pip install tablassert[...]`
|
|
160
162
|
|
|
161
163
|
## CI Workflows
|
|
162
164
|
|
|
@@ -164,7 +166,3 @@ Install with: `uv sync` or `pip install tablassert`
|
|
|
164
166
|
- **MkDocs deploy** (`.github/workflows/docs.yml`): builds docs and deploys to GitHub Pages on push to `main`
|
|
165
167
|
- **Docker publish** (`.github/workflows/docker.yml`): builds and pushes image to GHCR on tag push (`v*`)
|
|
166
168
|
- **Autotag** (`.github/workflows/autotag.yml`): automatic version tagging
|
|
167
|
-
|
|
168
|
-
## Key Dependencies
|
|
169
|
-
|
|
170
|
-
polars, duckdb, orjson, pydantic, typer, xxhash, loguru, rapidfuzz, scikit-learn, sqlite-utils, pyyaml, lazy-loader, polars-hash, fastexcel, pyarrow, optimum-onnx
|
|
@@ -2,6 +2,32 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to this project are documented in this file.
|
|
4
4
|
|
|
5
|
+
## 7.4.1 - 2026-05-05
|
|
6
|
+
|
|
7
|
+
### Bug Fixes
|
|
8
|
+
- Fixed `AttributeError: 'str' object has no attribute 'value'` raised by `format_section_oneline()` in `progress.py` during the BUILDING TCODE stage. The `Section` model sets `use_enum_values=True`, so `Tcode.status` is already a plain string — removed the stale `.value` access.
|
|
9
|
+
|
|
10
|
+
## 7.4.0 - 2026-05-05
|
|
11
|
+
|
|
12
|
+
### Changes
|
|
13
|
+
- Renamed CLI commands for brevity: `build-knowledge-graph` → `build`, `verify-table-configuration-syntax` → `validate`. Version display moved from `tablassert version` subcommand to `tablassert --version` flag.
|
|
14
|
+
- Added `qc` parameter to `resolve_many()` for optional QC auditing during standalone batch resolution. ONNX Runtime provider is auto-detected via `get_qc_provider()`.
|
|
15
|
+
- Added `has_qc_runtime()` helper to `qc.py` for ONNX Runtime detection.
|
|
16
|
+
- Added `empty_matches()` helper to `fullmap.py` for empty result fallback.
|
|
17
|
+
- Added `DownloadReceipt` dataclass, `DownloadError`/`DownloadValidationError` exception classes, and `classify()`/`validate_download()`/`modernize_xls()` to `downloader.py`.
|
|
18
|
+
- Updated log format to include timestamps: `{time:YYYY-MM-DD HH:mm:ss}`.
|
|
19
|
+
|
|
20
|
+
### Bug Fixes
|
|
21
|
+
- Fixed tutorial table configuration using header names as `encoding` values instead of Excel column letters (`A`, `B`, `C`, `D`).
|
|
22
|
+
|
|
23
|
+
### Documentation
|
|
24
|
+
- Updated all documentation to reflect renamed CLI commands.
|
|
25
|
+
- Fixed tutorial and example YAML configurations to use Excel column letter references (`A`, `B`, `C`, `D`) for `method: column` encodings instead of header names, matching the headerless source reading behavior.
|
|
26
|
+
- Fixed `encoding` values in `docs/examples/` gallery configurations.
|
|
27
|
+
- Updated `resolve_many()` API reference with new `qc` parameter and auto-detected QC provider.
|
|
28
|
+
- Fixed CITATION.cff version (7.2.2 → 7.4.0).
|
|
29
|
+
- Fixed CONTRIBUTING.md lazy-loaded package list (`typer` → `cyclopts`, added missing packages).
|
|
30
|
+
|
|
5
31
|
## 7.3.6 - 2026-04-29
|
|
6
32
|
|
|
7
33
|
### Documentation
|
|
@@ -109,7 +135,7 @@ All notable changes to this project are documented in this file.
|
|
|
109
135
|
## 7.0.1 - 2026-03-17
|
|
110
136
|
|
|
111
137
|
### Documentation
|
|
112
|
-
- Updated installation docs to reflect `pyproject.toml` extras and added `tablassert[
|
|
138
|
+
- Updated installation docs to reflect `pyproject.toml` extras and added `tablassert[rt]` guidance for systems without required default Polars CPU instructions.
|
|
113
139
|
|
|
114
140
|
## 7.0.0 - 2026-03-17
|
|
115
141
|
|
|
@@ -2,7 +2,7 @@ cff-version: 1.2.0
|
|
|
2
2
|
message: "If you use Tablassert, please cite it as below."
|
|
3
3
|
type: software
|
|
4
4
|
title: Tablassert
|
|
5
|
-
version: 7.
|
|
5
|
+
version: 7.4.0
|
|
6
6
|
license: Apache-2.0
|
|
7
7
|
repository-code: https://github.com/SkyeAv/Tablassert
|
|
8
8
|
abstract: Tablassert is a highly performant declarative knowledge graph backend for bioinformatics that extracts knowledge assertions from tabular data, performs entity resolution and data quality control, and exports NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON.
|
|
@@ -25,7 +25,7 @@ uv sync
|
|
|
25
25
|
All ML, web, and Excel dependencies are included in the core install. The only optional extra is a runtime-compatible Polars build for CPUs without required instructions:
|
|
26
26
|
|
|
27
27
|
```bash
|
|
28
|
-
uv sync --extra
|
|
28
|
+
uv sync --extra rt # polars[rtcompat]
|
|
29
29
|
```
|
|
30
30
|
|
|
31
31
|
## Development Workflow
|
|
@@ -143,7 +143,7 @@ else:
|
|
|
143
143
|
pl = Lazy.load("polars")
|
|
144
144
|
```
|
|
145
145
|
|
|
146
|
-
Lazy-loaded packages: `polars`, `duckdb`, `orjson`, `
|
|
146
|
+
Lazy-loaded packages: `polars`, `duckdb`, `orjson`, `xxhash`, `polars_hash`, `yaml`, `httpx`, `pyexcel`, `onnxruntime`, `sentence_transformers`
|
|
147
147
|
|
|
148
148
|
Import order: standard library → blank line → third-party → blank line → local
|
|
149
149
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: tablassert
|
|
3
|
-
Version: 7.
|
|
3
|
+
Version: 7.4.1
|
|
4
4
|
Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
|
|
5
5
|
Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
|
|
6
6
|
Project-URL: Source, https://github.com/SkyeAv/Tablassert
|
|
@@ -24,14 +24,15 @@ Classifier: Programming Language :: Python :: 3.14
|
|
|
24
24
|
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
25
25
|
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
26
26
|
Requires-Python: >=3.11
|
|
27
|
+
Requires-Dist: cyclopts>=1.0.0
|
|
27
28
|
Requires-Dist: duckdb>=1.5.0
|
|
28
29
|
Requires-Dist: fastexcel>=0.19.0
|
|
30
|
+
Requires-Dist: httpx>=0.28.1
|
|
29
31
|
Requires-Dist: lazy-loader>=0.5
|
|
30
32
|
Requires-Dist: loguru>=0.7.3
|
|
31
|
-
Requires-Dist: onnxruntime>=1.24.4
|
|
32
33
|
Requires-Dist: optimum-onnx>=0.1.0
|
|
33
34
|
Requires-Dist: orjson>=3.11.7
|
|
34
|
-
Requires-Dist: playwright
|
|
35
|
+
Requires-Dist: playwright<1.59,>=1.58.0
|
|
35
36
|
Requires-Dist: polars-hash>=0.5.6
|
|
36
37
|
Requires-Dist: polars>=1.39.0
|
|
37
38
|
Requires-Dist: pyarrow>=23.0.1
|
|
@@ -39,15 +40,17 @@ Requires-Dist: pydantic>=2.12.5
|
|
|
39
40
|
Requires-Dist: pyexcel>=0.7.4
|
|
40
41
|
Requires-Dist: pyyaml>=6.0.3
|
|
41
42
|
Requires-Dist: rapidfuzz>=3.14.3
|
|
43
|
+
Requires-Dist: rich>=13.0.0
|
|
42
44
|
Requires-Dist: scikit-learn>=1.8.0
|
|
43
45
|
Requires-Dist: sentence-transformers>=5.3.0
|
|
44
46
|
Requires-Dist: sqlite-utils>=3.39
|
|
45
|
-
Requires-Dist: typer>=0.21.2
|
|
46
47
|
Requires-Dist: xxhash>=3.6.0
|
|
48
|
+
Provides-Extra: qc
|
|
49
|
+
Requires-Dist: onnxruntime>=1.24.4; extra == 'qc'
|
|
50
|
+
Provides-Extra: qc-cuda
|
|
51
|
+
Requires-Dist: onnxruntime-gpu>=1.24.4; extra == 'qc-cuda'
|
|
47
52
|
Provides-Extra: rt
|
|
48
|
-
Requires-Dist: polars[rtcompat]>=1.
|
|
49
|
-
Provides-Extra: rtcompat
|
|
50
|
-
Requires-Dist: polars[rtcompat]>=1.39.0; extra == 'rtcompat'
|
|
53
|
+
Requires-Dist: polars[rtcompat]>=1.40.1; extra == 'rt'
|
|
51
54
|
Description-Content-Type: text/markdown
|
|
52
55
|
|
|
53
56
|
# Tablassert
|
|
@@ -57,11 +60,11 @@ Description-Content-Type: text/markdown
|
|
|
57
60
|
[](https://github.com/SkyeAv/Tablassert/blob/main/LICENSE)
|
|
58
61
|
[](https://skyeav.github.io/Tablassert/)
|
|
59
62
|
|
|
60
|
-
Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control
|
|
63
|
+
Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution built in and optional quality control.
|
|
61
64
|
|
|
62
65
|
```bash
|
|
63
66
|
pip install tablassert
|
|
64
|
-
tablassert build
|
|
67
|
+
tablassert build config.yaml
|
|
65
68
|
```
|
|
66
69
|
|
|
67
70
|
**[Full Documentation](https://skyeav.github.io/Tablassert/)** — installation guides, tutorials, configuration reference, and API docs.
|
|
@@ -72,12 +75,16 @@ tablassert build-knowledge-graph config.yaml
|
|
|
72
75
|
pip install tablassert
|
|
73
76
|
```
|
|
74
77
|
|
|
75
|
-
|
|
78
|
+
Base install includes web and Excel support. Optional extras are available for CPU compatibility and QC runtime selection:
|
|
76
79
|
|
|
77
80
|
```bash
|
|
78
|
-
pip install "tablassert[
|
|
81
|
+
pip install "tablassert[rt]" # Polars build for CPUs without required instructions
|
|
82
|
+
pip install "tablassert[qc]" # Enable QC with CPU ONNX Runtime
|
|
83
|
+
pip install "tablassert[qc-cuda]" # Enable QC with CUDA ONNX Runtime on GPU 0
|
|
79
84
|
```
|
|
80
85
|
|
|
86
|
+
QC is disabled by default at the graph level. Set `qc: true` in a graph config to enable the audit stage.
|
|
87
|
+
|
|
81
88
|
<details>
|
|
82
89
|
<summary><strong>Docker</strong></summary>
|
|
83
90
|
|
|
@@ -88,32 +95,39 @@ docker run --rm \
|
|
|
88
95
|
-v /path/to/config:/data \
|
|
89
96
|
-v /path/to/datassert:/datassert \
|
|
90
97
|
ghcr.io/skyeav/tablassert:latest \
|
|
91
|
-
build
|
|
98
|
+
build /data/graph-config.yaml
|
|
92
99
|
```
|
|
93
100
|
|
|
94
101
|
</details>
|
|
95
102
|
|
|
96
103
|
## Quick Demo
|
|
97
104
|
|
|
98
|
-
```
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
105
|
+
```python
|
|
106
|
+
from pathlib import Path
|
|
107
|
+
from tablassert.lib import resolve_many
|
|
108
|
+
|
|
109
|
+
# Resolve gene names to CURIEs against a datassert database
|
|
110
|
+
results = resolve_many(
|
|
111
|
+
col="gene",
|
|
112
|
+
entities=["TP53", "BRCA1", "EGFR"],
|
|
113
|
+
datassert=Path("/path/to/datassert"),
|
|
114
|
+
taxon="9606",
|
|
115
|
+
)
|
|
116
|
+
|
|
117
|
+
for row in results:
|
|
118
|
+
print(f"{row['original gene']} → {row['gene']} ({row['gene name']})")
|
|
119
|
+
# TP53 → HGNC:11998 (TP53)
|
|
120
|
+
# BRCA1 → HGNC:1100 (BRCA1)
|
|
121
|
+
# EGFR → HGNC:3236 (EGFR)
|
|
108
122
|
```
|
|
109
123
|
|
|
110
|
-
|
|
124
|
+
Point `resolve_many()` at a datassert database and resolve any iterable of entity strings to CURIEs — no LazyFrame setup, NLP preprocessing, or DuckDB connection management required. For full pipeline builds with YAML configuration, use `tablassert build config.yaml`.
|
|
111
125
|
|
|
112
126
|
## Key Features
|
|
113
127
|
|
|
114
128
|
- **Declarative Configuration** — YAML-based, no code required
|
|
115
129
|
- **Entity Resolution** — Maps text to biological entities (genes, diseases, chemicals)
|
|
116
|
-
- **Quality Control** —
|
|
130
|
+
- **Quality Control** — Optional three-stage validation (exact → fuzzy → BERT embeddings)
|
|
117
131
|
- **KGX Compliance** — NCATS Translator-compatible NDJSON output
|
|
118
132
|
- **Performance** — Lazy evaluation pipelines with Polars and DuckDB-accelerated entity resolution
|
|
119
133
|
|
|
@@ -5,11 +5,11 @@
|
|
|
5
5
|
[](https://github.com/SkyeAv/Tablassert/blob/main/LICENSE)
|
|
6
6
|
[](https://skyeav.github.io/Tablassert/)
|
|
7
7
|
|
|
8
|
-
Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control
|
|
8
|
+
Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution built in and optional quality control.
|
|
9
9
|
|
|
10
10
|
```bash
|
|
11
11
|
pip install tablassert
|
|
12
|
-
tablassert build
|
|
12
|
+
tablassert build config.yaml
|
|
13
13
|
```
|
|
14
14
|
|
|
15
15
|
**[Full Documentation](https://skyeav.github.io/Tablassert/)** — installation guides, tutorials, configuration reference, and API docs.
|
|
@@ -20,12 +20,16 @@ tablassert build-knowledge-graph config.yaml
|
|
|
20
20
|
pip install tablassert
|
|
21
21
|
```
|
|
22
22
|
|
|
23
|
-
|
|
23
|
+
Base install includes web and Excel support. Optional extras are available for CPU compatibility and QC runtime selection:
|
|
24
24
|
|
|
25
25
|
```bash
|
|
26
|
-
pip install "tablassert[
|
|
26
|
+
pip install "tablassert[rt]" # Polars build for CPUs without required instructions
|
|
27
|
+
pip install "tablassert[qc]" # Enable QC with CPU ONNX Runtime
|
|
28
|
+
pip install "tablassert[qc-cuda]" # Enable QC with CUDA ONNX Runtime on GPU 0
|
|
27
29
|
```
|
|
28
30
|
|
|
31
|
+
QC is disabled by default at the graph level. Set `qc: true` in a graph config to enable the audit stage.
|
|
32
|
+
|
|
29
33
|
<details>
|
|
30
34
|
<summary><strong>Docker</strong></summary>
|
|
31
35
|
|
|
@@ -36,32 +40,39 @@ docker run --rm \
|
|
|
36
40
|
-v /path/to/config:/data \
|
|
37
41
|
-v /path/to/datassert:/datassert \
|
|
38
42
|
ghcr.io/skyeav/tablassert:latest \
|
|
39
|
-
build
|
|
43
|
+
build /data/graph-config.yaml
|
|
40
44
|
```
|
|
41
45
|
|
|
42
46
|
</details>
|
|
43
47
|
|
|
44
48
|
## Quick Demo
|
|
45
49
|
|
|
46
|
-
```
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
50
|
+
```python
|
|
51
|
+
from pathlib import Path
|
|
52
|
+
from tablassert.lib import resolve_many
|
|
53
|
+
|
|
54
|
+
# Resolve gene names to CURIEs against a datassert database
|
|
55
|
+
results = resolve_many(
|
|
56
|
+
col="gene",
|
|
57
|
+
entities=["TP53", "BRCA1", "EGFR"],
|
|
58
|
+
datassert=Path("/path/to/datassert"),
|
|
59
|
+
taxon="9606",
|
|
60
|
+
)
|
|
61
|
+
|
|
62
|
+
for row in results:
|
|
63
|
+
print(f"{row['original gene']} → {row['gene']} ({row['gene name']})")
|
|
64
|
+
# TP53 → HGNC:11998 (TP53)
|
|
65
|
+
# BRCA1 → HGNC:1100 (BRCA1)
|
|
66
|
+
# EGFR → HGNC:3236 (EGFR)
|
|
56
67
|
```
|
|
57
68
|
|
|
58
|
-
|
|
69
|
+
Point `resolve_many()` at a datassert database and resolve any iterable of entity strings to CURIEs — no LazyFrame setup, NLP preprocessing, or DuckDB connection management required. For full pipeline builds with YAML configuration, use `tablassert build config.yaml`.
|
|
59
70
|
|
|
60
71
|
## Key Features
|
|
61
72
|
|
|
62
73
|
- **Declarative Configuration** — YAML-based, no code required
|
|
63
74
|
- **Entity Resolution** — Maps text to biological entities (genes, diseases, chemicals)
|
|
64
|
-
- **Quality Control** —
|
|
75
|
+
- **Quality Control** — Optional three-stage validation (exact → fuzzy → BERT embeddings)
|
|
65
76
|
- **KGX Compliance** — NCATS Translator-compatible NDJSON output
|
|
66
77
|
- **Performance** — Lazy evaluation pipelines with Polars and DuckDB-accelerated entity resolution
|
|
67
78
|
|
|
@@ -19,6 +19,7 @@ def resolve_many(
|
|
|
19
19
|
prioritize: Optional[list[Categories]] = None,
|
|
20
20
|
avoid: Optional[list[Categories]] = None,
|
|
21
21
|
column_context: bool = True,
|
|
22
|
+
qc: bool = False,
|
|
22
23
|
) -> list[dict[str, Any]]
|
|
23
24
|
```
|
|
24
25
|
|
|
@@ -71,6 +72,10 @@ Controls category-frequency tie-breaking when multiple matches exist for a term.
|
|
|
71
72
|
|
|
72
73
|
This is useful when resolving a column of related entities (e.g., all genes) — the shared context helps disambiguate terms that map to multiple categories.
|
|
73
74
|
|
|
75
|
+
**`qc: bool` (default: `False`)**
|
|
76
|
+
|
|
77
|
+
When `True`, runs the QC audit stage after entity resolution. The QC pipeline validates mappings through a three-stage audit: exact match, fuzzy matching via rapidfuzz, and BioBERT sentence embeddings with cosine similarity. Requires a QC runtime to be installed (`tablassert[qc]` or `tablassert[qc-cuda]`). The ONNX Runtime provider is auto-detected based on the installed package — CUDA is preferred when `onnxruntime-gpu` is available, otherwise CPU is used.
|
|
78
|
+
|
|
74
79
|
### Return Value
|
|
75
80
|
|
|
76
81
|
Returns a `list[dict[str, Any]]` — one dictionary per resolved entity. The list is produced by calling `polars.DataFrame.to_dicts()` on the collected resolution output.
|
|
@@ -230,7 +235,7 @@ Both levels are queried during resolution. Level one (exact case-insensitive mat
|
|
|
230
235
|
|
|
231
236
|
## Integration
|
|
232
237
|
|
|
233
|
-
`resolve_many()` is a self-contained entry point. It does not require any prior setup beyond having a datassert database available. For full pipeline builds, use the CLI (`tablassert build
|
|
238
|
+
`resolve_many()` is a self-contained entry point. It does not require any prior setup beyond having a datassert database available. For full pipeline builds, use the CLI (`tablassert build`) which orchestrates resolution through the `Tcode` class.
|
|
234
239
|
|
|
235
240
|
## Next Steps
|
|
236
241
|
|
|
@@ -2,6 +2,8 @@
|
|
|
2
2
|
|
|
3
3
|
The `qc` module validates entity resolution mappings through a multi-stage pipeline: exact matching, fuzzy matching, and BERT semantic similarity.
|
|
4
4
|
|
|
5
|
+
QC runtime support is optional. Install `tablassert[qc]` for CPU inference or `tablassert[qc-cuda]` for CUDA inference on GPU 0.
|
|
6
|
+
|
|
5
7
|
## fullmap_audit()
|
|
6
8
|
|
|
7
9
|
Primary quality control function that filters entity mappings based on confidence criteria.
|
|
@@ -14,7 +16,9 @@ def fullmap_audit(
|
|
|
14
16
|
col: str,
|
|
15
17
|
section_hash: str,
|
|
16
18
|
config_file: str,
|
|
17
|
-
out: str = "passed"
|
|
19
|
+
out: str = "passed",
|
|
20
|
+
log: bool = True,
|
|
21
|
+
provider: Optional[Literal["cpu", "cuda"]] = None
|
|
18
22
|
) -> pl.LazyFrame
|
|
19
23
|
```
|
|
20
24
|
|
|
@@ -44,6 +48,14 @@ Name of the boolean column indicating validation status.
|
|
|
44
48
|
|
|
45
49
|
Rows with `out=True` passed QC, `out=False` failed.
|
|
46
50
|
|
|
51
|
+
**`log: bool` (default: `True`)**
|
|
52
|
+
|
|
53
|
+
Controls whether failed QC rows are logged.
|
|
54
|
+
|
|
55
|
+
**`provider: Optional[Literal["cpu", "cuda"]]`**
|
|
56
|
+
|
|
57
|
+
Optional runtime override. Use `"cpu"` to force CPU inference, `"cuda"` to require CUDA inference, or `None` to auto-select from the installed QC runtime.
|
|
58
|
+
|
|
47
59
|
**`section_hash: str` / `config_file: str`**
|
|
48
60
|
|
|
49
61
|
Context fields used in QC failure logs for traceability.
|
|
@@ -119,13 +131,13 @@ return similarity >= 0.2
|
|
|
119
131
|
|
|
120
132
|
**Model:** `pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb`
|
|
121
133
|
|
|
122
|
-
**Backend:** ONNX Runtime (
|
|
134
|
+
**Backend:** ONNX Runtime (`CPUExecutionProvider` or `CUDAExecutionProvider`)
|
|
123
135
|
|
|
124
136
|
**Optimizations:**
|
|
125
137
|
- Graph optimization level: ALL
|
|
126
138
|
- ONNX session caching
|
|
127
139
|
|
|
128
|
-
Lazy-loaded on first `fullmap_audit()` call that reaches the embedding stage, then reused for subsequent calls.
|
|
140
|
+
Lazy-loaded on first `fullmap_audit()` call that reaches the embedding stage, then reused per provider for subsequent calls.
|
|
129
141
|
|
|
130
142
|
### Model Caching
|
|
131
143
|
|
|
@@ -163,7 +175,8 @@ validated = fullmap_audit(
|
|
|
163
175
|
lf,
|
|
164
176
|
col="subject",
|
|
165
177
|
section_hash="tutorial-section",
|
|
166
|
-
config_file="tutorial-table.yaml"
|
|
178
|
+
config_file="tutorial-table.yaml",
|
|
179
|
+
provider="cpu"
|
|
167
180
|
)
|
|
168
181
|
|
|
169
182
|
# Only rows that passed QC remain
|
|
@@ -205,7 +218,7 @@ Output: 990 rows (700 + 250 + 40)
|
|
|
205
218
|
|
|
206
219
|
### Integration with Pipeline
|
|
207
220
|
|
|
208
|
-
QC is applied after entity resolution:
|
|
221
|
+
QC is applied after entity resolution when graph or API QC is enabled:
|
|
209
222
|
|
|
210
223
|
1. **Entity resolution** (`resolve()`) - Maps text to CURIEs
|
|
211
224
|
2. **Quality control** (`fullmap_audit()`) - Validates mappings
|
|
@@ -4,10 +4,10 @@ The canonical release history lives in the repository root at [`CHANGELOG.md`](h
|
|
|
4
4
|
|
|
5
5
|
## Current Release Notes
|
|
6
6
|
|
|
7
|
-
## 7.
|
|
7
|
+
## 7.4.1 - 2026-05-05
|
|
8
8
|
|
|
9
|
-
###
|
|
9
|
+
### Bug Fixes
|
|
10
10
|
|
|
11
|
-
-
|
|
11
|
+
- Fixed crash in the BUILDING TCODE progress display caused by `format_section_oneline()` calling `.value` on `Tcode.status`, which is a plain string under `use_enum_values=True`.
|
|
12
12
|
|
|
13
13
|
For older releases and the full project history, open the root `CHANGELOG.md` in the repository.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# CLI Reference
|
|
2
2
|
|
|
3
|
-
Tablassert provides
|
|
3
|
+
Tablassert provides two commands.
|
|
4
4
|
|
|
5
5
|
## version
|
|
6
6
|
|
|
@@ -9,29 +9,29 @@ Display the current Tablassert package version.
|
|
|
9
9
|
### Synopsis
|
|
10
10
|
|
|
11
11
|
```bash
|
|
12
|
-
tablassert version
|
|
12
|
+
tablassert --version
|
|
13
13
|
```
|
|
14
14
|
|
|
15
15
|
### Example
|
|
16
16
|
|
|
17
17
|
```bash
|
|
18
|
-
tablassert version
|
|
18
|
+
tablassert --version
|
|
19
19
|
```
|
|
20
20
|
|
|
21
21
|
### Description
|
|
22
22
|
|
|
23
|
-
Prints the installed Tablassert version to stdout and exits.
|
|
23
|
+
Prints the installed Tablassert version to stdout and exits. This is a flag on the main `tablassert` command, not a subcommand.
|
|
24
24
|
|
|
25
25
|
---
|
|
26
26
|
|
|
27
|
-
## build
|
|
27
|
+
## build
|
|
28
28
|
|
|
29
29
|
Build A KGX Compliant Knowledge Graph From A Graph Configuration File
|
|
30
30
|
|
|
31
31
|
### Synopsis
|
|
32
32
|
|
|
33
33
|
```bash
|
|
34
|
-
tablassert build
|
|
34
|
+
tablassert build <graph_configuration_file>
|
|
35
35
|
```
|
|
36
36
|
|
|
37
37
|
### Options
|
|
@@ -43,7 +43,7 @@ tablassert build-knowledge-graph <graph_configuration_file>
|
|
|
43
43
|
### Example
|
|
44
44
|
|
|
45
45
|
```bash
|
|
46
|
-
tablassert build
|
|
46
|
+
tablassert build /path/to/MOKGV6.yaml
|
|
47
47
|
```
|
|
48
48
|
|
|
49
49
|
### Description
|
|
@@ -68,14 +68,14 @@ See [Graph Configuration](configuration/graph.md) for details on the YAML schema
|
|
|
68
68
|
|
|
69
69
|
---
|
|
70
70
|
|
|
71
|
-
##
|
|
71
|
+
## validate
|
|
72
72
|
|
|
73
73
|
Verify The Syntax Of A Declarative Table Configuration File
|
|
74
74
|
|
|
75
75
|
### Synopsis
|
|
76
76
|
|
|
77
77
|
```bash
|
|
78
|
-
tablassert
|
|
78
|
+
tablassert validate <table_configuration_file>
|
|
79
79
|
```
|
|
80
80
|
|
|
81
81
|
### Options
|
|
@@ -87,7 +87,7 @@ tablassert verify-table-configuration-syntax <table_configuration_file>
|
|
|
87
87
|
### Example
|
|
88
88
|
|
|
89
89
|
```bash
|
|
90
|
-
tablassert
|
|
90
|
+
tablassert validate /path/to/table-config.yaml
|
|
91
91
|
```
|
|
92
92
|
|
|
93
93
|
### Description
|
|
@@ -108,27 +108,27 @@ See [Table Configuration](configuration/table.md) for details on the YAML schema
|
|
|
108
108
|
### Check Version
|
|
109
109
|
|
|
110
110
|
```bash
|
|
111
|
-
tablassert version
|
|
111
|
+
tablassert --version
|
|
112
112
|
```
|
|
113
113
|
|
|
114
114
|
### Build Knowledge Graph
|
|
115
115
|
|
|
116
116
|
```bash
|
|
117
|
-
tablassert build
|
|
117
|
+
tablassert build my-graph.yaml
|
|
118
118
|
```
|
|
119
119
|
|
|
120
120
|
### Validate Table Configuration
|
|
121
121
|
|
|
122
122
|
```bash
|
|
123
|
-
tablassert
|
|
123
|
+
tablassert validate table-config.yaml
|
|
124
124
|
```
|
|
125
125
|
|
|
126
126
|
## Workflow
|
|
127
127
|
|
|
128
128
|
1. **Create table configuration** - Define data sources and transformations
|
|
129
129
|
2. **Create graph configuration** - Define output name, table configs, databases
|
|
130
|
-
3. **Validate table config** - `tablassert
|
|
131
|
-
4. **Build knowledge graph** - `tablassert build
|
|
130
|
+
3. **Validate table config** - `tablassert validate table.yaml`
|
|
131
|
+
4. **Build knowledge graph** - `tablassert build graph.yaml`
|
|
132
132
|
5. **Process executes:**
|
|
133
133
|
- Downloads files from URLs (if needed)
|
|
134
134
|
- Applies transformations to each table
|