tablassert 7.4.9__tar.gz → 7.4.11__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tablassert-7.4.11/AGENTS.md +50 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/CHANGELOG.md +5 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/PKG-INFO +1 -1
- tablassert-7.4.11/docs/changelog.md +11 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/pyproject.toml +2 -2
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/cli.py +39 -32
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/fullmap.py +7 -5
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/lib.py +2 -2
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/models.py +42 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/progress.py +1 -2
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/conftest.py +14 -1
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_models.py +192 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/uv.lock +1 -1
- tablassert-7.4.9/AGENTS.md +0 -168
- tablassert-7.4.9/docs/changelog.md +0 -11
- {tablassert-7.4.9 → tablassert-7.4.11}/.github/workflows/autotag.yml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/.github/workflows/docker.yml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/.github/workflows/docs.yml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/.github/workflows/pipy.yml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/.gitignore +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/.pre-commit-config.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/CITATION.cff +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/CONTRIBUTING.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/Dockerfile +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/LICENSE +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/README.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/api/fullmap.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/api/lib.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/api/qc.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/api/utils.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/cli.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/configuration/advanced-example.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/configuration/graph.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/configuration/table.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/datassert.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/docker.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/examples/tutorial-data.csv +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/examples/tutorial-graph.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/examples/tutorial-table.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/examples.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/index.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/installation.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/docs/tutorial.md +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/llms.txt +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/mkdocs.yml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/__init__.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/downloader.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/enums.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/ingests.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/log.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/nlp.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/qc.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/src/tablassert/utils.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/__init__.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/fixtures/invalid_section_missing_source.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/fixtures/minimal_section.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/fixtures/minimal_section_with_sections.yaml +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_downloader.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_enums.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_fullmap.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_ingests.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_lib.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_nlp.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_qc.py +0 -0
- {tablassert-7.4.9 → tablassert-7.4.11}/tests/test_utils.py +0 -0
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# AGENTS.md — Tablassert
|
|
2
|
+
|
|
3
|
+
## Fast Start
|
|
4
|
+
|
|
5
|
+
- Python package, not a monorepo. Main code lives in `src/tablassert/`; tests live in `tests/`.
|
|
6
|
+
- Install with `uv sync`. QC is not available unless you install an extra: `uv sync --extra qc` or `uv sync --extra qc-cuda`.
|
|
7
|
+
- CLI entrypoint is `tablassert.cli:APP`. Real user commands are:
|
|
8
|
+
- `uv run tablassert build <graph.yaml>`
|
|
9
|
+
- `uv run tablassert validate <table.yaml> <datassert>`
|
|
10
|
+
|
|
11
|
+
## Verify Changes
|
|
12
|
+
|
|
13
|
+
- Match the repo hooks before finishing: `uv run ruff check --fix .`, `uv run ruff format .`, `uv run pyright`, `uv run pytest`.
|
|
14
|
+
- Full hook run: `uv run pre-commit run --all-files`.
|
|
15
|
+
- Focused test runs:
|
|
16
|
+
- Single test: `uv run pytest tests/test_lib.py::test_name`
|
|
17
|
+
- By keyword: `uv run pytest -k "pattern"`
|
|
18
|
+
- With print output: `uv run pytest -s tests/test_lib.py`
|
|
19
|
+
- Docs build: `uv run --group dev mkdocs build`
|
|
20
|
+
|
|
21
|
+
## High-Value Structure
|
|
22
|
+
|
|
23
|
+
- `src/tablassert/cli.py` is the wiring layer: `build()` calls `build_pipeline()`, `validate()` calls `validate_pipeline()`.
|
|
24
|
+
- `src/tablassert/ingests.py` loads YAML and expands table configs into section dicts.
|
|
25
|
+
- `src/tablassert/lib.py` is the core pipeline:
|
|
26
|
+
- `Tcode.collect()` builds the per-section operation list.
|
|
27
|
+
- `compile_subgraph()` executes that list into parquet.
|
|
28
|
+
- `compile_graph()` aggregates subgraph parquets into KGX NDJSON.
|
|
29
|
+
- `resolve_many()` is the direct library API for batch entity resolution.
|
|
30
|
+
- Entity resolution uses DuckDB shard files under `<datassert>/data/`. `src/tablassert/fullmap.py` hardcodes `SHARDS = 10`.
|
|
31
|
+
|
|
32
|
+
## Repo-Specific Gotchas
|
|
33
|
+
|
|
34
|
+
- Heavy dependencies are lazy-loaded per module with `TYPE_CHECKING` + `lazy_loader`. Follow the existing pattern instead of importing heavy packages eagerly.
|
|
35
|
+
- `tests/conftest.py` autouse-mocks `httpx.head`, so model URL validation tests do not hit the network unless a test is explicitly marked otherwise.
|
|
36
|
+
- Network-dependent tests are marked `@pytest.mark.network`; GPU QC tests are marked with both `network` and `gpu` in `tests/test_qc.py`.
|
|
37
|
+
- QC runtime selection is strict in `src/tablassert/qc.py`: if `onnxruntime-gpu` is installed but `CUDAExecutionProvider` is unavailable, the code raises instead of falling back to CPU.
|
|
38
|
+
- Downloader behavior in `src/tablassert/downloader.py` is two-path: direct `httpx` fetch for known file URLs, headless-browser fallback for browser-only sources. Keep tests around payload validation and cleanup intact when changing it.
|
|
39
|
+
|
|
40
|
+
## Conventions That Matter Here
|
|
41
|
+
|
|
42
|
+
- Start every module with `from __future__ import annotations`.
|
|
43
|
+
- Annotate locals, not just function signatures.
|
|
44
|
+
- Use `Optional[T]` / `Union[...]`, not `T | None`.
|
|
45
|
+
- Prefer `Path` over raw path strings.
|
|
46
|
+
- Function docs are usually `# ?` comments above the code, not docstrings.
|
|
47
|
+
|
|
48
|
+
## Side Effects
|
|
49
|
+
|
|
50
|
+
- The package writes working artifacts to hidden directories in the repo root: `.storassert/`, `.logassert/`, `.cachassert/`, and `.onnxassert/`.
|
|
@@ -2,6 +2,11 @@
|
|
|
2
2
|
|
|
3
3
|
All notable changes to this project are documented in this file.
|
|
4
4
|
|
|
5
|
+
## 7.4.10 - 2026-05-29
|
|
6
|
+
|
|
7
|
+
### Changes
|
|
8
|
+
- Enforced explicit `biolink:` namespace prefix on predicates and qualifiers emitted by `compile_subgraph()` in `lib.py`. Both `self.statement.predicate` and `x.qualifier` (in the qualifier loop) are now prefixed via `add("biolink:", ...)`, ensuring all output edges carry fully-qualified Biolink CURIEs rather than bare predicate/qualifier names.
|
|
9
|
+
|
|
5
10
|
## 7.4.9 - 2026-05-26
|
|
6
11
|
|
|
7
12
|
### Bug Fixes
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: tablassert
|
|
3
|
-
Version: 7.4.
|
|
3
|
+
Version: 7.4.11
|
|
4
4
|
Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
|
|
5
5
|
Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
|
|
6
6
|
Project-URL: Source, https://github.com/SkyeAv/Tablassert
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
The canonical release history lives in the repository root at [`CHANGELOG.md`](https://github.com/SkyeAv/Tablassert/blob/main/CHANGELOG.md).
|
|
4
|
+
|
|
5
|
+
## Current Release Notes
|
|
6
|
+
|
|
7
|
+
## 7.4.10 - 2026-05-29
|
|
8
|
+
|
|
9
|
+
### Changes
|
|
10
|
+
|
|
11
|
+
- Enforced explicit `biolink:` namespace prefix on predicates and qualifiers emitted by `compile_subgraph()` in `lib.py`. Both `self.statement.predicate` and `x.qualifier` (in the qualifier loop) are now prefixed via `add("biolink:", ...)`, ensuring all output edges carry fully-qualified Biolink CURIEs rather than bare predicate/qualifier names.
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "tablassert"
|
|
3
|
-
version = "7.4.
|
|
3
|
+
version = "7.4.11"
|
|
4
4
|
description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
|
|
5
5
|
authors = [
|
|
6
6
|
{ name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
|
|
@@ -100,7 +100,7 @@ dev = [
|
|
|
100
100
|
|
|
101
101
|
[tool.pytest.ini_options]
|
|
102
102
|
testpaths = ["tests"]
|
|
103
|
-
markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider"]
|
|
103
|
+
markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider", "datassert: requires the datassert DuckDB shards"]
|
|
104
104
|
|
|
105
105
|
[tool.ruff]
|
|
106
106
|
line-length = 120
|
|
@@ -57,29 +57,30 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
|
|
|
57
57
|
sections: list[dict[str, Any]] = list(chain.from_iterable(temp))
|
|
58
58
|
n: int = len(sections)
|
|
59
59
|
|
|
60
|
-
# * Build TCode (3/6)
|
|
61
|
-
progress.stage(f"Building TCode | Sections: {n}")
|
|
62
|
-
advance = progress.section_loop(n, "TCode")
|
|
63
|
-
tcode: list[Tcode] = []
|
|
64
|
-
for idx, s in enumerate(sections, start=1):
|
|
65
|
-
try:
|
|
66
|
-
tcode.append(
|
|
67
|
-
Tcode.model_validate(
|
|
68
|
-
{**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc}
|
|
69
|
-
)
|
|
70
|
-
)
|
|
71
|
-
except pydantic.ValidationError as e:
|
|
72
|
-
raise RuntimeError(
|
|
73
|
-
f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
|
|
74
|
-
) from e
|
|
75
|
-
advance(format_section_oneline(tcode[-1]))
|
|
76
|
-
|
|
77
60
|
with ExitStack() as stack:
|
|
78
61
|
conns: list[object] = [
|
|
79
62
|
stack.enter_context(duckdb.connect(g.datassert / "data" / f"{x}.duckdb", read_only=True))
|
|
80
63
|
for x in range(SHARDS)
|
|
81
64
|
]
|
|
82
65
|
|
|
66
|
+
# * Build TCode (3/6)
|
|
67
|
+
progress.stage(f"Building TCode | Sections: {n}")
|
|
68
|
+
advance = progress.section_loop(n, "TCode")
|
|
69
|
+
tcode: list[Tcode] = []
|
|
70
|
+
for idx, s in enumerate(sections, start=1):
|
|
71
|
+
try:
|
|
72
|
+
tcode.append(
|
|
73
|
+
Tcode.model_validate(
|
|
74
|
+
{**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc},
|
|
75
|
+
context={"conns": conns},
|
|
76
|
+
)
|
|
77
|
+
)
|
|
78
|
+
except pydantic.ValidationError as e:
|
|
79
|
+
raise RuntimeError(
|
|
80
|
+
f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
|
|
81
|
+
) from e
|
|
82
|
+
advance(format_section_oneline(tcode[-1]))
|
|
83
|
+
|
|
83
84
|
# * Collect Instructions (4/6)
|
|
84
85
|
progress.stage(f"Collecting Instructions | Sections: {n}")
|
|
85
86
|
advance = progress.section_loop(n, "Collect")
|
|
@@ -105,8 +106,9 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
|
|
|
105
106
|
logger.info(f"BUILD DONE | SECTIONS: {n} | NAME: {g.name} | VERSION: {g.version}")
|
|
106
107
|
|
|
107
108
|
|
|
108
|
-
def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgress") -> None:
|
|
109
|
+
def validate_pipeline(table_configuration_file: Path, datassert: Path, progress: "PipelineProgress") -> None:
|
|
109
110
|
# ? Validate Section Syntax From A Configuration File
|
|
111
|
+
from tablassert.fullmap import SHARDS
|
|
110
112
|
from tablassert.ingests import from_yaml, to_sections
|
|
111
113
|
from tablassert.lib import Tcode
|
|
112
114
|
from tablassert.progress import flatten_pydantic_error
|
|
@@ -124,27 +126,32 @@ def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgres
|
|
|
124
126
|
# * Validate Section Syntax (3/3)
|
|
125
127
|
progress.stage(f"Validating Section Syntax | Sections: {n}")
|
|
126
128
|
advance = progress.section_loop(n, "Validate")
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
129
|
+
with ExitStack() as stack:
|
|
130
|
+
conns: list[object] = [
|
|
131
|
+
stack.enter_context(duckdb.connect(datassert / "data" / f"{x}.duckdb", read_only=True))
|
|
132
|
+
for x in range(SHARDS)
|
|
133
|
+
]
|
|
134
|
+
for idx, s in enumerate(sections, start=1):
|
|
135
|
+
h: str = mkhash(s)
|
|
136
|
+
try:
|
|
137
|
+
Tcode.model_validate({**s, "number": idx, "store": (STORE / f"{h}.parquet")}, context={"conns": conns})
|
|
138
|
+
except pydantic.ValidationError as e:
|
|
139
|
+
raise RuntimeError(
|
|
140
|
+
f"02 | FAILED VALIDATION | CONFIG: {table_configuration_file} | IDX: {idx} | HASH: {h} | PYDANTIC: {flatten_pydantic_error(e)}"
|
|
141
|
+
) from e
|
|
142
|
+
advance(f"#{idx} | HASH: {h}")
|
|
136
143
|
|
|
137
144
|
logger.info(f"VALIDATE DONE | SECTIONS: {n} | CONFIG: {table_configuration_file.name}")
|
|
138
145
|
|
|
139
146
|
|
|
140
|
-
def run(stages: int, fn: Any,
|
|
147
|
+
def run(stages: int, fn: Any, *args: Path) -> None:
|
|
141
148
|
from tablassert.log import LOG_FORMAT, logger
|
|
142
149
|
from tablassert.progress import PipelineProgress
|
|
143
150
|
|
|
144
151
|
with PipelineProgress(total_stages=stages) as progress:
|
|
145
152
|
sink_id: int = logger.add(progress.log_sink, level="INFO", format=LOG_FORMAT)
|
|
146
153
|
try:
|
|
147
|
-
fn(
|
|
154
|
+
fn(*args, progress)
|
|
148
155
|
finally:
|
|
149
156
|
logger.remove(sink_id)
|
|
150
157
|
|
|
@@ -156,6 +163,6 @@ def build(graph_configuration_file: Path) -> None:
|
|
|
156
163
|
|
|
157
164
|
|
|
158
165
|
@APP.command
|
|
159
|
-
def validate(table_configuration_file: Path) -> None:
|
|
160
|
-
"""Validate section syntax from a YAML configuration file."""
|
|
161
|
-
run(3, validate_pipeline, table_configuration_file)
|
|
166
|
+
def validate(table_configuration_file: Path, datassert: Path) -> None:
|
|
167
|
+
"""Validate section syntax from a YAML configuration file against a datassert directory."""
|
|
168
|
+
run(3, validate_pipeline, table_configuration_file, datassert)
|
|
@@ -51,7 +51,9 @@ def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFr
|
|
|
51
51
|
|
|
52
52
|
terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
|
|
53
53
|
|
|
54
|
-
bad: str =
|
|
54
|
+
bad: str = (
|
|
55
|
+
r"^\d+$|^(none|nan|na|null|unknown|not applicable|p value|variable|result|exposure|expression|symbol)$|^$"
|
|
56
|
+
)
|
|
55
57
|
terms = terms.filter(~pl.col(col).str.contains(bad))
|
|
56
58
|
return terms.with_columns((plh.col(col).nchash.xxhash64() % SHARDS).alias("shard")) # pyright: ignore
|
|
57
59
|
|
|
@@ -150,15 +152,15 @@ def log_unmatched(
|
|
|
150
152
|
) -> None:
|
|
151
153
|
# * Log Unmatched Entities
|
|
152
154
|
level_one: pl.LazyFrame = terms.filter(pl.col("nlp level") == 1)
|
|
153
|
-
antimatches: pl.LazyFrame = level_one.join(
|
|
155
|
+
antimatches: pl.LazyFrame = level_one.join(
|
|
156
|
+
matches.lazy().select("term"), left_on="term", right_on="term", how="anti"
|
|
157
|
+
)
|
|
154
158
|
|
|
155
159
|
# ! Collection Point: Requires Eager
|
|
156
160
|
unnmatched: pl.DataFrame = antimatches.select("term").unique().collect()
|
|
157
161
|
if unnmatched.height > 0:
|
|
158
162
|
for term in unnmatched.get_column("term").to_list():
|
|
159
|
-
logger.info(
|
|
160
|
-
f"FAILED | STORE: {section_hash} | CONFIG: {config_file} | COL: {col} | VALUE: {term!r}"
|
|
161
|
-
)
|
|
163
|
+
logger.info(f"FAILED | STORE: {section_hash} | CONFIG: {config_file} | COL: {col} | VALUE: {term!r}")
|
|
162
164
|
|
|
163
165
|
|
|
164
166
|
def resolve(
|
|
@@ -342,8 +342,8 @@ class Tcode(Section):
|
|
|
342
342
|
[op for x in self.annotations for op in self.encoding(x, x.annotation)] if self.annotations else None,
|
|
343
343
|
self.node(self.statement.subject, "subject", conns),
|
|
344
344
|
self.node(self.statement.object, "object", conns),
|
|
345
|
-
(value, ("predicate", self.statement.predicate)),
|
|
346
|
-
[op for x in self.statement.qualifiers for op in self.node(x, x.qualifier, conns)]
|
|
345
|
+
(value, ("predicate", add("biolink:", self.statement.predicate))),
|
|
346
|
+
[op for x in self.statement.qualifiers for op in self.node(x, add("biolink:", x.qualifier), conns)]
|
|
347
347
|
if self.statement.qualifiers
|
|
348
348
|
else None,
|
|
349
349
|
(value, ("syntax", self.syntax)),
|
|
@@ -25,6 +25,9 @@ from tablassert.enums import (
|
|
|
25
25
|
Tokens,
|
|
26
26
|
)
|
|
27
27
|
|
|
28
|
+
from tablassert.fullmap import resolve
|
|
29
|
+
from tablassert.nlp import level_one, level_two
|
|
30
|
+
|
|
28
31
|
if TYPE_CHECKING:
|
|
29
32
|
import httpx
|
|
30
33
|
import polars as pl
|
|
@@ -315,6 +318,35 @@ class Annotation(Encoding):
|
|
|
315
318
|
return annotation.replace("_", " ").strip()
|
|
316
319
|
|
|
317
320
|
|
|
321
|
+
def resolves_value_encodings(statement: Statement, conns: list[object]) -> None:
|
|
322
|
+
# ? Resolve Every Value-Method Literal Against The Shared Datassert Shards
|
|
323
|
+
nodes: list[tuple[str, NodeEncoding]] = [("subject", statement.subject), ("object", statement.object)]
|
|
324
|
+
if statement.qualifiers:
|
|
325
|
+
nodes += [(q.qualifier, q) for q in statement.qualifiers]
|
|
326
|
+
|
|
327
|
+
for label, node in nodes:
|
|
328
|
+
if not eq(node.method, EncodingMethods.VALUE):
|
|
329
|
+
continue
|
|
330
|
+
|
|
331
|
+
term: str = str(node.encoding)
|
|
332
|
+
lf: pl.LazyFrame = pl.DataFrame({"term": [term]}).lazy()
|
|
333
|
+
lf = level_one(lf, "term")
|
|
334
|
+
lf = level_two(lf, "term")
|
|
335
|
+
resolved: pl.DataFrame = resolve(
|
|
336
|
+
lf,
|
|
337
|
+
"term",
|
|
338
|
+
conns,
|
|
339
|
+
taxon=str(node.taxon) if node.taxon else None,
|
|
340
|
+
prioritize=node.prioritize,
|
|
341
|
+
avoid=node.avoid,
|
|
342
|
+
log=False,
|
|
343
|
+
column_context=False,
|
|
344
|
+
).collect()
|
|
345
|
+
if resolved.height == 0:
|
|
346
|
+
msg: str = f"21 | value encoding {term!r} in {label!r} did not resolve against datassert"
|
|
347
|
+
raise ValueError(msg)
|
|
348
|
+
|
|
349
|
+
|
|
318
350
|
class Section(TablaBase):
|
|
319
351
|
# ? Pydantic "Section" Model And Coercion
|
|
320
352
|
syntax: Syntaxes = Field(Syntaxes.TC3, description="Section configuration syntax version.")
|
|
@@ -326,6 +358,16 @@ class Section(TablaBase):
|
|
|
326
358
|
None, description="Optional extra encoded columns added to each row."
|
|
327
359
|
)
|
|
328
360
|
|
|
361
|
+
@field_validator("statement", mode="after")
|
|
362
|
+
@classmethod
|
|
363
|
+
def value_encodings_resolve(cls, statement: Statement, info: Any) -> Statement:
|
|
364
|
+
# ? Ensure Value-Method Encodings Resolve Against The Shared Datassert Shards
|
|
365
|
+
conns: Optional[list[object]] = info.context.get("conns") if info.context else None
|
|
366
|
+
if conns is None:
|
|
367
|
+
return statement # * skip without shared connections (contextless path)
|
|
368
|
+
resolves_value_encodings(statement, conns)
|
|
369
|
+
return statement
|
|
370
|
+
|
|
329
371
|
|
|
330
372
|
class Graph(TablaBase):
|
|
331
373
|
# ? Pydantic "Graph" Configuration
|
|
@@ -35,8 +35,7 @@ def format_section_oneline(x: "Tcode") -> str:
|
|
|
35
35
|
else:
|
|
36
36
|
source_detail = f"TEXT({(x.source.delimiter or ',')!r})" # pyright: ignore
|
|
37
37
|
return (
|
|
38
|
-
f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} "
|
|
39
|
-
f"| CONFIG: {x.config.name} | STATUS: {x.status}"
|
|
38
|
+
f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} | CONFIG: {x.config.name} | STATUS: {x.status}"
|
|
40
39
|
)
|
|
41
40
|
|
|
42
41
|
|
|
@@ -1,7 +1,8 @@
|
|
|
1
1
|
from __future__ import annotations
|
|
2
2
|
|
|
3
|
+
import os
|
|
3
4
|
from pathlib import Path
|
|
4
|
-
from typing import Any
|
|
5
|
+
from typing import Any, Optional
|
|
5
6
|
|
|
6
7
|
import httpx
|
|
7
8
|
import pytest
|
|
@@ -26,3 +27,15 @@ def mockhttpxhead(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
|
26
27
|
@pytest.fixture
|
|
27
28
|
def fixtures_path() -> Path:
|
|
28
29
|
return Path(__file__).parent / "fixtures"
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
@pytest.fixture
|
|
33
|
+
def datassert_dir() -> Path:
|
|
34
|
+
# ? Shared Datassert Shard Directory (Skipped When Unavailable)
|
|
35
|
+
env: Optional[str] = os.environ.get("DATASSERT")
|
|
36
|
+
if not env:
|
|
37
|
+
pytest.skip("DATASSERT env var not set; skipping datassert-dependent test")
|
|
38
|
+
directory: Path = Path(env)
|
|
39
|
+
if not (directory / "data" / "0.duckdb").is_file():
|
|
40
|
+
pytest.skip(f"datassert shard data/0.duckdb not found under {directory}")
|
|
41
|
+
return directory
|
|
@@ -3,9 +3,11 @@ from __future__ import annotations
|
|
|
3
3
|
from pathlib import Path
|
|
4
4
|
from typing import Any
|
|
5
5
|
|
|
6
|
+
import polars as pl
|
|
6
7
|
import pytest
|
|
7
8
|
from pydantic import ValidationError
|
|
8
9
|
|
|
10
|
+
import tablassert.models as models
|
|
9
11
|
from tablassert.enums import Categories
|
|
10
12
|
from tablassert.ingests import from_yaml
|
|
11
13
|
from tablassert.models import (
|
|
@@ -280,3 +282,193 @@ def test_section_with_annotations() -> None:
|
|
|
280
282
|
],
|
|
281
283
|
)
|
|
282
284
|
assert len(section.annotations) == 2 # pyright: ignore
|
|
285
|
+
|
|
286
|
+
|
|
287
|
+
# ? Value Encoding Resolves Against Datassert (Context-Aware Pass)
|
|
288
|
+
def test_value_encoding_resolves_pass(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
289
|
+
def fake_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
|
|
290
|
+
return pl.DataFrame({"resolved": ["YES"]}).lazy()
|
|
291
|
+
|
|
292
|
+
monkeypatch.setattr(models, "resolve", fake_resolve)
|
|
293
|
+
section: Section = Section.model_validate(
|
|
294
|
+
{
|
|
295
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
296
|
+
"statement": {
|
|
297
|
+
"subject": {"method": "value", "encoding": "BRCA1"},
|
|
298
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
299
|
+
},
|
|
300
|
+
"provenance": {
|
|
301
|
+
"repo": "PMC",
|
|
302
|
+
"publication": "PMC000",
|
|
303
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
304
|
+
},
|
|
305
|
+
},
|
|
306
|
+
context={"conns": [object()]},
|
|
307
|
+
)
|
|
308
|
+
assert section.statement.subject.encoding == "BRCA1"
|
|
309
|
+
|
|
310
|
+
|
|
311
|
+
# ? Value Encoding Fails To Resolve Raises Code 21
|
|
312
|
+
def test_value_encoding_resolves_fail(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
313
|
+
def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
|
|
314
|
+
return pl.DataFrame({"resolved": []}).lazy()
|
|
315
|
+
|
|
316
|
+
monkeypatch.setattr(models, "resolve", fake_resolve_empty)
|
|
317
|
+
with pytest.raises(ValidationError) as exc_info:
|
|
318
|
+
Section.model_validate(
|
|
319
|
+
{
|
|
320
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
321
|
+
"statement": {
|
|
322
|
+
"subject": {"method": "value", "encoding": "BRCA1"},
|
|
323
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
324
|
+
},
|
|
325
|
+
"provenance": {
|
|
326
|
+
"repo": "PMC",
|
|
327
|
+
"publication": "PMC000",
|
|
328
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
329
|
+
},
|
|
330
|
+
},
|
|
331
|
+
context={"conns": [object()]},
|
|
332
|
+
)
|
|
333
|
+
assert "21 |" in str(exc_info.value)
|
|
334
|
+
|
|
335
|
+
|
|
336
|
+
# ? Value Encoding Validator Skips Without Context
|
|
337
|
+
def test_value_encoding_skips_without_context(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
338
|
+
def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
|
|
339
|
+
return pl.DataFrame({"resolved": []}).lazy()
|
|
340
|
+
|
|
341
|
+
monkeypatch.setattr(models, "resolve", fake_resolve_empty)
|
|
342
|
+
section: Section = Section( # pyright: ignore
|
|
343
|
+
source={"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
344
|
+
statement={
|
|
345
|
+
"subject": {"method": "value", "encoding": "BRCA1"},
|
|
346
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
347
|
+
},
|
|
348
|
+
provenance={
|
|
349
|
+
"repo": "PMC",
|
|
350
|
+
"publication": "PMC000",
|
|
351
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
352
|
+
},
|
|
353
|
+
)
|
|
354
|
+
assert section.statement.subject.encoding == "BRCA1"
|
|
355
|
+
|
|
356
|
+
|
|
357
|
+
# ? Column Method Encodings Are Not Checked
|
|
358
|
+
def test_column_encoding_not_checked(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
359
|
+
def boom_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
|
|
360
|
+
raise AssertionError("resolve must not be called for column-method encodings")
|
|
361
|
+
|
|
362
|
+
monkeypatch.setattr(models, "resolve", boom_resolve)
|
|
363
|
+
section: Section = Section.model_validate(
|
|
364
|
+
{
|
|
365
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
366
|
+
"statement": {
|
|
367
|
+
"subject": {"method": "column", "encoding": "A"},
|
|
368
|
+
"object": {"method": "column", "encoding": "B"},
|
|
369
|
+
},
|
|
370
|
+
"provenance": {
|
|
371
|
+
"repo": "PMC",
|
|
372
|
+
"publication": "PMC000",
|
|
373
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
374
|
+
},
|
|
375
|
+
},
|
|
376
|
+
context={"conns": [object()]},
|
|
377
|
+
)
|
|
378
|
+
assert section.statement.subject.encoding == "A"
|
|
379
|
+
|
|
380
|
+
|
|
381
|
+
# ? Qualifier Value Encoding Is Checked Against Datassert
|
|
382
|
+
def test_qualifier_value_encoding_checked(monkeypatch: pytest.MonkeyPatch) -> None:
|
|
383
|
+
def fake_resolve(lf: Any, col: str, _conns: list[object], **_kwargs: Any) -> Any:
|
|
384
|
+
term: str = str(lf.collect().get_column(col).to_list()[0])
|
|
385
|
+
if term in ("brca1", "tp53"):
|
|
386
|
+
return pl.DataFrame({"resolved": ["YES"]}).lazy()
|
|
387
|
+
return pl.DataFrame({"resolved": []}).lazy()
|
|
388
|
+
|
|
389
|
+
monkeypatch.setattr(models, "resolve", fake_resolve)
|
|
390
|
+
with pytest.raises(ValidationError) as exc_info:
|
|
391
|
+
Section.model_validate(
|
|
392
|
+
{
|
|
393
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
394
|
+
"statement": {
|
|
395
|
+
"subject": {"method": "value", "encoding": "BRCA1"},
|
|
396
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
397
|
+
"qualifiers": [
|
|
398
|
+
{"qualifier": "disease_context_qualifier", "method": "value", "encoding": "ZZZNOTAREALGENE123"}
|
|
399
|
+
],
|
|
400
|
+
},
|
|
401
|
+
"provenance": {
|
|
402
|
+
"repo": "PMC",
|
|
403
|
+
"publication": "PMC000",
|
|
404
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
405
|
+
},
|
|
406
|
+
},
|
|
407
|
+
context={"conns": [object()]},
|
|
408
|
+
)
|
|
409
|
+
assert "21 |" in str(exc_info.value)
|
|
410
|
+
|
|
411
|
+
|
|
412
|
+
# ? Real Value Encoding Resolves Against The Datassert Shards
|
|
413
|
+
@pytest.mark.datassert
|
|
414
|
+
def test_real_value_encoding_resolves(datassert_dir: Path) -> None:
|
|
415
|
+
from contextlib import ExitStack
|
|
416
|
+
|
|
417
|
+
import duckdb
|
|
418
|
+
|
|
419
|
+
from tablassert.fullmap import SHARDS
|
|
420
|
+
|
|
421
|
+
with ExitStack() as stack:
|
|
422
|
+
conns: list[object] = [
|
|
423
|
+
stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
|
|
424
|
+
for x in range(SHARDS)
|
|
425
|
+
]
|
|
426
|
+
section: Section = Section.model_validate(
|
|
427
|
+
{
|
|
428
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
429
|
+
"statement": {
|
|
430
|
+
"subject": {"method": "value", "encoding": "BRCA1"},
|
|
431
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
432
|
+
},
|
|
433
|
+
"provenance": {
|
|
434
|
+
"repo": "PMC",
|
|
435
|
+
"publication": "PMC000",
|
|
436
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
437
|
+
},
|
|
438
|
+
},
|
|
439
|
+
context={"conns": conns},
|
|
440
|
+
)
|
|
441
|
+
assert section.statement.subject.encoding == "BRCA1"
|
|
442
|
+
|
|
443
|
+
|
|
444
|
+
# ? Real Value Encoding Failure Raises Code 21
|
|
445
|
+
@pytest.mark.datassert
|
|
446
|
+
def test_real_value_encoding_fails(datassert_dir: Path) -> None:
|
|
447
|
+
from contextlib import ExitStack
|
|
448
|
+
|
|
449
|
+
import duckdb
|
|
450
|
+
|
|
451
|
+
from tablassert.fullmap import SHARDS
|
|
452
|
+
|
|
453
|
+
with ExitStack() as stack:
|
|
454
|
+
conns: list[object] = [
|
|
455
|
+
stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
|
|
456
|
+
for x in range(SHARDS)
|
|
457
|
+
]
|
|
458
|
+
with pytest.raises(ValidationError) as exc_info:
|
|
459
|
+
Section.model_validate(
|
|
460
|
+
{
|
|
461
|
+
"source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
|
|
462
|
+
"statement": {
|
|
463
|
+
"subject": {"method": "value", "encoding": "ZZZNOTAREALGENE123"},
|
|
464
|
+
"object": {"method": "value", "encoding": "TP53"},
|
|
465
|
+
},
|
|
466
|
+
"provenance": {
|
|
467
|
+
"repo": "PMC",
|
|
468
|
+
"publication": "PMC000",
|
|
469
|
+
"contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
|
|
470
|
+
},
|
|
471
|
+
},
|
|
472
|
+
context={"conns": conns},
|
|
473
|
+
)
|
|
474
|
+
assert "21 |" in str(exc_info.value)
|
tablassert-7.4.9/AGENTS.md
DELETED
|
@@ -1,168 +0,0 @@
|
|
|
1
|
-
# AGENTS.md — Tablassert
|
|
2
|
-
|
|
3
|
-
Guidance for AI coding agents working in this repository.
|
|
4
|
-
|
|
5
|
-
## Project Overview
|
|
6
|
-
|
|
7
|
-
Tablassert is a Python package (>=3.11) for tabular data assertion, normalization, and optional quality control. It builds declarative knowledge graphs from tabular data, exporting NCATS Translator-compliant KGX NDJSON. Uses **Polars** DataFrames, **DuckDB** for entity resolution, and **ONNX/BioBERT** for QC when enabled. CLI built with **cyclopts**. Models built with **Pydantic v2**.
|
|
8
|
-
|
|
9
|
-
## Quick Reference
|
|
10
|
-
|
|
11
|
-
| Task | Command |
|
|
12
|
-
|---|---|
|
|
13
|
-
| Install | `uv sync` |
|
|
14
|
-
| Run CLI | `uv run tablassert` |
|
|
15
|
-
| Lint | `uv run ruff check .` |
|
|
16
|
-
| Lint (fix) | `uv run ruff check --fix .` |
|
|
17
|
-
| Format | `uv run ruff format .` |
|
|
18
|
-
| Format check | `uv run ruff format --check .` |
|
|
19
|
-
| Type check | `uv run pyright` |
|
|
20
|
-
| All checks | `uv run pre-commit run --all-files` |
|
|
21
|
-
| Run all tests | `uv run pytest` |
|
|
22
|
-
| Run single test | `uv run pytest tests/test_foo.py::test_name` |
|
|
23
|
-
| Run by keyword | `uv run pytest -k "test_pattern"` |
|
|
24
|
-
| Run with print | `uv run pytest -s tests/test_foo.py` |
|
|
25
|
-
| Build | `uv build` |
|
|
26
|
-
| Build docs | `uv run --group dev mkdocs build` |
|
|
27
|
-
| Add dependency | `uv add <package>` |
|
|
28
|
-
| Add dev dependency | `uv add --group dev <package>` |
|
|
29
|
-
|
|
30
|
-
## Repository Structure
|
|
31
|
-
|
|
32
|
-
```
|
|
33
|
-
src/tablassert/
|
|
34
|
-
cli.py # cyclopts CLI (entry point: tablassert.cli:APP)
|
|
35
|
-
lib.py # Core logic: encodings, data loading, Tcode(Section) class
|
|
36
|
-
models.py # Pydantic v2 models (TablaBase base class)
|
|
37
|
-
enums.py # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
|
|
38
|
-
fullmap.py # NER / entity resolution (DuckDB, 10 shards)
|
|
39
|
-
qc.py # Quality control (ONNX/BioBERT, sentence_transformers)
|
|
40
|
-
nlp.py # Text normalization (level_one: strip+lowercase, level_two: regex)
|
|
41
|
-
ingests.py # YAML ingestion: from_yaml(), to_sections(), fastmerge()
|
|
42
|
-
downloader.py # httpx-based file downloads with retries
|
|
43
|
-
progress.py # Rich progress bars for pipeline stages
|
|
44
|
-
utils.py # Hashing (xxhash), STORE path, namespace UUIDs
|
|
45
|
-
log.py # loguru logger → .logassert/tablassert.log; cat() helper for category tagging
|
|
46
|
-
__init__.py # Empty file (lazy loading is per-module, not here)
|
|
47
|
-
docs/ # MkDocs documentation source
|
|
48
|
-
mkdocs.yml # MkDocs configuration
|
|
49
|
-
pyproject.toml # Project config, dependencies, tool settings
|
|
50
|
-
tests/ # Test directory (at repo root)
|
|
51
|
-
```
|
|
52
|
-
|
|
53
|
-
- `conftest.py` provides a `fixtures_path` fixture returning `Path(__file__).parent / "fixtures"`.
|
|
54
|
-
- pytest configured via `pyproject.toml` `[tool.pytest.ini_options]` with `testpaths = ["tests"]`.
|
|
55
|
-
- pytest markers: `network` requires internet; `gpu` requires `CUDAExecutionProvider`.
|
|
56
|
-
- Test fixtures: `tests/fixtures/` contains YAML files for Section model tests.
|
|
57
|
-
- Test modules: `test_downloader.py`, `test_enums.py`, `test_fullmap.py`, `test_ingests.py`, `test_lib.py`, `test_models.py`, `test_nlp.py`, `test_utils.py`.
|
|
58
|
-
|
|
59
|
-
## Code Style
|
|
60
|
-
|
|
61
|
-
### Imports
|
|
62
|
-
|
|
63
|
-
- Every file starts with `from __future__ import annotations`
|
|
64
|
-
- Heavy dependencies are loaded **lazily per-module** using this pattern:
|
|
65
|
-
```python
|
|
66
|
-
from typing import TYPE_CHECKING
|
|
67
|
-
import lazy_loader as Lazy
|
|
68
|
-
|
|
69
|
-
if TYPE_CHECKING:
|
|
70
|
-
import polars as pl
|
|
71
|
-
else:
|
|
72
|
-
pl = Lazy.load("polars")
|
|
73
|
-
```
|
|
74
|
-
- Lazy-loaded deps: `polars`, `duckdb`, `orjson`, `xxhash`, `polars_hash`, `yaml`, `httpx`, `pyexcel`, `onnxruntime`, `sentence_transformers`
|
|
75
|
-
- Direct (non-lazy) heavy deps: `sqlite_utils`, `rapidfuzz`, `pydantic`, `loguru`, `cyclopts`, `rich`, `yaml.CLoader`
|
|
76
|
-
- Some modules mix direct and lazy imports for the same package (e.g., `ingests.py` does `from yaml import CLoader` directly, then lazy-loads `yaml` for `yaml.load()`)
|
|
77
|
-
- Import order: standard library → blank line → third-party → blank line → local
|
|
78
|
-
- Use `from __future__ import annotations` to enable deferred evaluation
|
|
79
|
-
|
|
80
|
-
### Type Annotations
|
|
81
|
-
|
|
82
|
-
- **Every variable** gets a type annotation, including locals: `col: str = "name"`, `df: pl.DataFrame = ...`
|
|
83
|
-
- Use `Optional[T]` and `Union[...]` (not `T | None` or `X | Y`)
|
|
84
|
-
- Use `Self` for class methods returning the class type
|
|
85
|
-
- Use `Path` (not `str`) for filesystem paths
|
|
86
|
-
- Use `# pyright: ignore` comments to suppress false positives from lazy-loaded modules
|
|
87
|
-
|
|
88
|
-
### Pydantic Models
|
|
89
|
-
|
|
90
|
-
- All models inherit from `TablaBase(BaseModel)` which sets:
|
|
91
|
-
```python
|
|
92
|
-
model_config: ConfigDict = ConfigDict( # pyright: ignore
|
|
93
|
-
str_strip_whitespace=False,
|
|
94
|
-
validate_assignment=True,
|
|
95
|
-
use_enum_values=True,
|
|
96
|
-
extra="forbid",
|
|
97
|
-
populate_by_name=True,
|
|
98
|
-
)
|
|
99
|
-
```
|
|
100
|
-
- Required fields: `Field(...)` (ellipsis sentinel)
|
|
101
|
-
- Optional fields: `Optional[T] = Field(None)`
|
|
102
|
-
- All enums are `str, Enum` subclasses (defined in `enums.py`)
|
|
103
|
-
|
|
104
|
-
### Enums
|
|
105
|
-
|
|
106
|
-
All enums live in `enums.py` and extend `str, Enum`. Key enums: `Tokens`, `Repositories`, `Contributions`, `Comparisons`, `Functions`, `Files`, `EncodingMethods`, `FillMethods`, `Syntaxes`, `Statuses`, `Categories`, `Predicates`, `Qualifiers`.
|
|
107
|
-
|
|
108
|
-
### Naming
|
|
109
|
-
|
|
110
|
-
- Functions/variables: `snake_case`
|
|
111
|
-
- Classes: `PascalCase`
|
|
112
|
-
- Module-level constants: `UPPER_CASE`
|
|
113
|
-
|
|
114
|
-
### Comments
|
|
115
|
-
|
|
116
|
-
- `# ?` — descriptions / clarifications
|
|
117
|
-
- `# !` — warnings / important notes
|
|
118
|
-
- `# *` — stage markers (pipeline steps)
|
|
119
|
-
- `# TODO:` — todos
|
|
120
|
-
- No docstrings on functions; use `# ?` comment on the line above instead
|
|
121
|
-
|
|
122
|
-
### Formatting (enforced by ruff)
|
|
123
|
-
|
|
124
|
-
- Line length: **120**
|
|
125
|
-
- Quote style: **double quotes**
|
|
126
|
-
- Indent: **4 spaces**
|
|
127
|
-
- `skip-magic-trailing-comma = true`
|
|
128
|
-
- Target: Python >=3.11
|
|
129
|
-
|
|
130
|
-
### Error Handling
|
|
131
|
-
|
|
132
|
-
- Use `RuntimeError` for exceptional cases (no custom exception classes currently)
|
|
133
|
-
- Use `logger.warning()` for non-fatal issues (e.g., empty subgraphs)
|
|
134
|
-
- Logger: `from tablassert.log import logger` (or `cat()` for category-tagged logger)
|
|
135
|
-
|
|
136
|
-
### Other Conventions
|
|
137
|
-
|
|
138
|
-
- `operator.add` for Polars string concatenation on columns (not `+` directly)
|
|
139
|
-
- CLI entry point: `tablassert.cli:APP` (cyclopts app)
|
|
140
|
-
- Use `rich.progress` for progress tracking in CLI (via `progress.py` which wraps Rich Live/Progress)
|
|
141
|
-
- Data side-effects stored in hidden directories: `.logassert/`, `.storassert/`, `.onnxassert/`
|
|
142
|
-
|
|
143
|
-
## Tools
|
|
144
|
-
|
|
145
|
-
- **ruff** — linting (`ruff check`) and formatting (`ruff format`)
|
|
146
|
-
- **pyright** — type checking (no pyrightconfig.json; uses defaults)
|
|
147
|
-
- **pre-commit** — runs ruff fix, ruff-format, pyright, and pytest on all Python files
|
|
148
|
-
- **pytest** — testing (>=9.0.2)
|
|
149
|
-
- **uv** — package manager (use `uv run` for all commands, `uv add` for deps)
|
|
150
|
-
- **hatchling** — build backend
|
|
151
|
-
|
|
152
|
-
## Optional Dependency Groups
|
|
153
|
-
|
|
154
|
-
Defined in `pyproject.toml` `[project.optional-dependencies]`:
|
|
155
|
-
- `rt` — `polars[rtcompat]` (runtime-compatible Polars build for CPUs without required instructions)
|
|
156
|
-
- `qc` — `onnxruntime` (CPU QC runtime)
|
|
157
|
-
- `qc-cuda` — `onnxruntime-gpu` (CUDA QC runtime; single GPU on device 0)
|
|
158
|
-
|
|
159
|
-
All other ML, web, and Excel dependencies are in core `dependencies`; the ONNX Runtime choice is extra-driven.
|
|
160
|
-
|
|
161
|
-
Install with: `uv sync`, `uv sync --extra qc`, `uv sync --extra qc-cuda`, or `pip install tablassert[...]`
|
|
162
|
-
|
|
163
|
-
## CI Workflows
|
|
164
|
-
|
|
165
|
-
- **PyPI publish** (`.github/workflows/pipy.yml`): builds and publishes on push to `main`
|
|
166
|
-
- **MkDocs deploy** (`.github/workflows/docs.yml`): builds docs and deploys to GitHub Pages on push to `main`
|
|
167
|
-
- **Docker publish** (`.github/workflows/docker.yml`): builds and pushes image to GHCR on tag push (`v*`)
|
|
168
|
-
- **Autotag** (`.github/workflows/autotag.yml`): automatic version tagging
|
|
@@ -1,11 +0,0 @@
|
|
|
1
|
-
# Changelog
|
|
2
|
-
|
|
3
|
-
The canonical release history lives in the repository root at [`CHANGELOG.md`](https://github.com/SkyeAv/Tablassert/blob/main/CHANGELOG.md).
|
|
4
|
-
|
|
5
|
-
## Current Release Notes
|
|
6
|
-
|
|
7
|
-
## 7.4.9 - 2026-05-26
|
|
8
|
-
|
|
9
|
-
### Bug Fixes
|
|
10
|
-
|
|
11
|
-
- Fixed `OSError: Too many open files` during subgraph build at large scales (700+ sections). `with_mesh()` and `with_captions()` in `lib.py` opened a `sqlite_utils.Database` per section but never closed the underlying SQLite connection, leaving FD release to GC. In the tight sequential `compile_subgraph` loop the leaked FDs accumulated past the OS soft limit, causing the next `to_store()` → `df.write_parquet()` (which polars 1.39 routes through `sink_parquet`) to fail opening its target `.storassert/*.parquet`. Both functions now wrap their query bodies in `try:` / `finally: db.conn.close()`.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|