PyPI - tablassert - Versions diffs - 7.4.10__tar.gz → 7.4.11__tar.gz - Mend

tablassert 7.4.10tar.gz → 7.4.11tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (64) hide show

tablassert-7.4.11/AGENTS.md ADDED Viewed

@@ -0,0 +1,50 @@
+# AGENTS.md — Tablassert
+## Fast Start
+- Python package, not a monorepo. Main code lives in `src/tablassert/`; tests live in `tests/`.
+- Install with `uv sync`. QC is not available unless you install an extra: `uv sync --extra qc` or `uv sync --extra qc-cuda`.
+- CLI entrypoint is `tablassert.cli:APP`. Real user commands are:
+  - `uv run tablassert build <graph.yaml>`
+  - `uv run tablassert validate <table.yaml> <datassert>`
+## Verify Changes
+- Match the repo hooks before finishing: `uv run ruff check --fix .`, `uv run ruff format .`, `uv run pyright`, `uv run pytest`.
+- Full hook run: `uv run pre-commit run --all-files`.
+- Focused test runs:
+  - Single test: `uv run pytest tests/test_lib.py::test_name`
+  - By keyword: `uv run pytest -k "pattern"`
+  - With print output: `uv run pytest -s tests/test_lib.py`
+- Docs build: `uv run --group dev mkdocs build`
+## High-Value Structure
+- `src/tablassert/cli.py` is the wiring layer: `build()` calls `build_pipeline()`, `validate()` calls `validate_pipeline()`.
+- `src/tablassert/ingests.py` loads YAML and expands table configs into section dicts.
+- `src/tablassert/lib.py` is the core pipeline:
+  - `Tcode.collect()` builds the per-section operation list.
+  - `compile_subgraph()` executes that list into parquet.
+  - `compile_graph()` aggregates subgraph parquets into KGX NDJSON.
+  - `resolve_many()` is the direct library API for batch entity resolution.
+- Entity resolution uses DuckDB shard files under `<datassert>/data/`. `src/tablassert/fullmap.py` hardcodes `SHARDS = 10`.
+## Repo-Specific Gotchas
+- Heavy dependencies are lazy-loaded per module with `TYPE_CHECKING` + `lazy_loader`. Follow the existing pattern instead of importing heavy packages eagerly.
+- `tests/conftest.py` autouse-mocks `httpx.head`, so model URL validation tests do not hit the network unless a test is explicitly marked otherwise.
+- Network-dependent tests are marked `@pytest.mark.network`; GPU QC tests are marked with both `network` and `gpu` in `tests/test_qc.py`.
+- QC runtime selection is strict in `src/tablassert/qc.py`: if `onnxruntime-gpu` is installed but `CUDAExecutionProvider` is unavailable, the code raises instead of falling back to CPU.
+- Downloader behavior in `src/tablassert/downloader.py` is two-path: direct `httpx` fetch for known file URLs, headless-browser fallback for browser-only sources. Keep tests around payload validation and cleanup intact when changing it.
+## Conventions That Matter Here
+- Start every module with `from __future__ import annotations`.
+- Annotate locals, not just function signatures.
+- Use `Optional[T]` / `Union[...]`, not `T | None`.
+- Prefer `Path` over raw path strings.
+- Function docs are usually `# ?` comments above the code, not docstrings.
+## Side Effects
+- The package writes working artifacts to hidden directories in the repo root: `.storassert/`, `.logassert/`, `.cachassert/`, and `.onnxassert/`.

{tablassert-7.4.10 → tablassert-7.4.11}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.4.10
+Version: 7.4.11
 Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert

{tablassert-7.4.10 → tablassert-7.4.11}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tablassert"
-version = "7.4.10"
+version = "7.4.11"
 description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
@@ -100,7 +100,7 @@ dev = [
 [tool.pytest.ini_options]
 testpaths = ["tests"]
-markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider"]
+markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider", "datassert: requires the datassert DuckDB shards"]
 [tool.ruff]
 line-length = 120

{tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/cli.py RENAMED Viewed

@@ -57,29 +57,30 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
     sections: list[dict[str, Any]] = list(chain.from_iterable(temp))
     n: int = len(sections)
-    # * Build TCode (3/6)
-    progress.stage(f"Building TCode | Sections: {n}")
-    advance = progress.section_loop(n, "TCode")
-    tcode: list[Tcode] = []
-    for idx, s in enumerate(sections, start=1):
-        try:
-            tcode.append(
-                Tcode.model_validate(
-                    {**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc}
-                )
-            )
-        except pydantic.ValidationError as e:
-            raise RuntimeError(
-                f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
-            ) from e
-        advance(format_section_oneline(tcode[-1]))
     with ExitStack() as stack:
         conns: list[object] = [
             stack.enter_context(duckdb.connect(g.datassert / "data" / f"{x}.duckdb", read_only=True))
             for x in range(SHARDS)
         ]
+        # * Build TCode (3/6)
+        progress.stage(f"Building TCode | Sections: {n}")
+        advance = progress.section_loop(n, "TCode")
+        tcode: list[Tcode] = []
+        for idx, s in enumerate(sections, start=1):
+            try:
+                tcode.append(
+                    Tcode.model_validate(
+                        {**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc},
+                        context={"conns": conns},
+                    )
+                )
+            except pydantic.ValidationError as e:
+                raise RuntimeError(
+                    f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
+                ) from e
+            advance(format_section_oneline(tcode[-1]))
         # * Collect Instructions (4/6)
         progress.stage(f"Collecting Instructions | Sections: {n}")
         advance = progress.section_loop(n, "Collect")
@@ -105,8 +106,9 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
     logger.info(f"BUILD DONE | SECTIONS: {n} | NAME: {g.name} | VERSION: {g.version}")
-def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgress") -> None:
+def validate_pipeline(table_configuration_file: Path, datassert: Path, progress: "PipelineProgress") -> None:
     # ? Validate Section Syntax From A Configuration File
+    from tablassert.fullmap import SHARDS
     from tablassert.ingests import from_yaml, to_sections
     from tablassert.lib import Tcode
     from tablassert.progress import flatten_pydantic_error
@@ -124,27 +126,32 @@ def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgres
     # * Validate Section Syntax (3/3)
     progress.stage(f"Validating Section Syntax | Sections: {n}")
     advance = progress.section_loop(n, "Validate")
-    for idx, s in enumerate(sections, start=1):
-        h: str = mkhash(s)
-        try:
-            Tcode.model_validate({**s, "number": idx, "store": (STORE / f"{h}.parquet")})
-        except pydantic.ValidationError as e:
-            raise RuntimeError(
-                f"02 | FAILED VALIDATION | CONFIG: {table_configuration_file} | IDX: {idx} | HASH: {h} | PYDANTIC: {flatten_pydantic_error(e)}"
-            ) from e
-        advance(f"#{idx} | HASH: {h}")
+    with ExitStack() as stack:
+        conns: list[object] = [
+            stack.enter_context(duckdb.connect(datassert / "data" / f"{x}.duckdb", read_only=True))
+            for x in range(SHARDS)
+        ]
+        for idx, s in enumerate(sections, start=1):
+            h: str = mkhash(s)
+            try:
+                Tcode.model_validate({**s, "number": idx, "store": (STORE / f"{h}.parquet")}, context={"conns": conns})
+            except pydantic.ValidationError as e:
+                raise RuntimeError(
+                    f"02 | FAILED VALIDATION | CONFIG: {table_configuration_file} | IDX: {idx} | HASH: {h} | PYDANTIC: {flatten_pydantic_error(e)}"
+                ) from e
+            advance(f"#{idx} | HASH: {h}")
     logger.info(f"VALIDATE DONE | SECTIONS: {n} | CONFIG: {table_configuration_file.name}")
-def run(stages: int, fn: Any, arg: Path) -> None:
+def run(stages: int, fn: Any, *args: Path) -> None:
     from tablassert.log import LOG_FORMAT, logger
     from tablassert.progress import PipelineProgress
     with PipelineProgress(total_stages=stages) as progress:
         sink_id: int = logger.add(progress.log_sink, level="INFO", format=LOG_FORMAT)
         try:
-            fn(arg, progress)
+            fn(*args, progress)
         finally:
             logger.remove(sink_id)
@@ -156,6 +163,6 @@ def build(graph_configuration_file: Path) -> None:
 @APP.command
-def validate(table_configuration_file: Path) -> None:
-    """Validate section syntax from a YAML configuration file."""
-    run(3, validate_pipeline, table_configuration_file)
+def validate(table_configuration_file: Path, datassert: Path) -> None:
+    """Validate section syntax from a YAML configuration file against a datassert directory."""
+    run(3, validate_pipeline, table_configuration_file, datassert)

{tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/fullmap.py RENAMED Viewed

@@ -51,7 +51,9 @@ def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFr
     terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
-    bad: str = r"^\d+$|^(none|nan|na|null|unknown)$|^$"
+    bad: str = (
+        r"^\d+$|^(none|nan|na|null|unknown|not applicable|p value|variable|result|exposure|expression|symbol)$|^$"
+    )
     terms = terms.filter(~pl.col(col).str.contains(bad))
     return terms.with_columns((plh.col(col).nchash.xxhash64() % SHARDS).alias("shard"))  # pyright: ignore

{tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/models.py RENAMED Viewed

@@ -25,6 +25,9 @@ from tablassert.enums import (
     Tokens,
 )
+from tablassert.fullmap import resolve
+from tablassert.nlp import level_one, level_two
 if TYPE_CHECKING:
     import httpx
     import polars as pl
@@ -315,6 +318,35 @@ class Annotation(Encoding):
         return annotation.replace("_", " ").strip()
+def resolves_value_encodings(statement: Statement, conns: list[object]) -> None:
+    # ? Resolve Every Value-Method Literal Against The Shared Datassert Shards
+    nodes: list[tuple[str, NodeEncoding]] = [("subject", statement.subject), ("object", statement.object)]
+    if statement.qualifiers:
+        nodes += [(q.qualifier, q) for q in statement.qualifiers]
+    for label, node in nodes:
+        if not eq(node.method, EncodingMethods.VALUE):
+            continue
+        term: str = str(node.encoding)
+        lf: pl.LazyFrame = pl.DataFrame({"term": [term]}).lazy()
+        lf = level_one(lf, "term")
+        lf = level_two(lf, "term")
+        resolved: pl.DataFrame = resolve(
+            lf,
+            "term",
+            conns,
+            taxon=str(node.taxon) if node.taxon else None,
+            prioritize=node.prioritize,
+            avoid=node.avoid,
+            log=False,
+            column_context=False,
+        ).collect()
+        if resolved.height == 0:
+            msg: str = f"21 | value encoding {term!r} in {label!r} did not resolve against datassert"
+            raise ValueError(msg)
 class Section(TablaBase):
     # ? Pydantic "Section" Model And Coercion
     syntax: Syntaxes = Field(Syntaxes.TC3, description="Section configuration syntax version.")
@@ -326,6 +358,16 @@ class Section(TablaBase):
         None, description="Optional extra encoded columns added to each row."
     )
+    @field_validator("statement", mode="after")
+    @classmethod
+    def value_encodings_resolve(cls, statement: Statement, info: Any) -> Statement:
+        # ? Ensure Value-Method Encodings Resolve Against The Shared Datassert Shards
+        conns: Optional[list[object]] = info.context.get("conns") if info.context else None
+        if conns is None:
+            return statement  # * skip without shared connections (contextless path)
+        resolves_value_encodings(statement, conns)
+        return statement
 class Graph(TablaBase):
     # ? Pydantic "Graph" Configuration

{tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/progress.py RENAMED Viewed

@@ -35,8 +35,7 @@ def format_section_oneline(x: "Tcode") -> str:
     else:
         source_detail = f"TEXT({(x.source.delimiter or ',')!r})"  # pyright: ignore
     return (
-        f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} "
-        f"| CONFIG: {x.config.name} | STATUS: {x.status}"
+        f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} | CONFIG: {x.config.name} | STATUS: {x.status}"
     )

{tablassert-7.4.10 → tablassert-7.4.11}/tests/conftest.py RENAMED Viewed

@@ -1,7 +1,8 @@
 from __future__ import annotations
+import os
 from pathlib import Path
-from typing import Any
+from typing import Any, Optional
 import httpx
 import pytest
@@ -26,3 +27,15 @@ def mockhttpxhead(monkeypatch: pytest.MonkeyPatch) -> None:
 @pytest.fixture
 def fixtures_path() -> Path:
     return Path(__file__).parent / "fixtures"
+@pytest.fixture
+def datassert_dir() -> Path:
+    # ? Shared Datassert Shard Directory (Skipped When Unavailable)
+    env: Optional[str] = os.environ.get("DATASSERT")
+    if not env:
+        pytest.skip("DATASSERT env var not set; skipping datassert-dependent test")
+    directory: Path = Path(env)
+    if not (directory / "data" / "0.duckdb").is_file():
+        pytest.skip(f"datassert shard data/0.duckdb not found under {directory}")
+    return directory

{tablassert-7.4.10 → tablassert-7.4.11}/tests/test_models.py RENAMED Viewed

@@ -3,9 +3,11 @@ from __future__ import annotations
 from pathlib import Path
 from typing import Any
+import polars as pl
 import pytest
 from pydantic import ValidationError
+import tablassert.models as models
 from tablassert.enums import Categories
 from tablassert.ingests import from_yaml
 from tablassert.models import (
@@ -280,3 +282,193 @@ def test_section_with_annotations() -> None:
         ],
     )
     assert len(section.annotations) == 2  # pyright: ignore
+# ? Value Encoding Resolves Against Datassert (Context-Aware Pass)
+def test_value_encoding_resolves_pass(monkeypatch: pytest.MonkeyPatch) -> None:
+    def fake_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
+        return pl.DataFrame({"resolved": ["YES"]}).lazy()
+    monkeypatch.setattr(models, "resolve", fake_resolve)
+    section: Section = Section.model_validate(
+        {
+            "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+            "statement": {
+                "subject": {"method": "value", "encoding": "BRCA1"},
+                "object": {"method": "value", "encoding": "TP53"},
+            },
+            "provenance": {
+                "repo": "PMC",
+                "publication": "PMC000",
+                "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+            },
+        },
+        context={"conns": [object()]},
+    )
+    assert section.statement.subject.encoding == "BRCA1"
+# ? Value Encoding Fails To Resolve Raises Code 21
+def test_value_encoding_resolves_fail(monkeypatch: pytest.MonkeyPatch) -> None:
+    def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
+        return pl.DataFrame({"resolved": []}).lazy()
+    monkeypatch.setattr(models, "resolve", fake_resolve_empty)
+    with pytest.raises(ValidationError) as exc_info:
+        Section.model_validate(
+            {
+                "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+                "statement": {
+                    "subject": {"method": "value", "encoding": "BRCA1"},
+                    "object": {"method": "value", "encoding": "TP53"},
+                },
+                "provenance": {
+                    "repo": "PMC",
+                    "publication": "PMC000",
+                    "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+                },
+            },
+            context={"conns": [object()]},
+        )
+    assert "21 |" in str(exc_info.value)
+# ? Value Encoding Validator Skips Without Context
+def test_value_encoding_skips_without_context(monkeypatch: pytest.MonkeyPatch) -> None:
+    def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
+        return pl.DataFrame({"resolved": []}).lazy()
+    monkeypatch.setattr(models, "resolve", fake_resolve_empty)
+    section: Section = Section(  # pyright: ignore
+        source={"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+        statement={
+            "subject": {"method": "value", "encoding": "BRCA1"},
+            "object": {"method": "value", "encoding": "TP53"},
+        },
+        provenance={
+            "repo": "PMC",
+            "publication": "PMC000",
+            "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+        },
+    )
+    assert section.statement.subject.encoding == "BRCA1"
+# ? Column Method Encodings Are Not Checked
+def test_column_encoding_not_checked(monkeypatch: pytest.MonkeyPatch) -> None:
+    def boom_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
+        raise AssertionError("resolve must not be called for column-method encodings")
+    monkeypatch.setattr(models, "resolve", boom_resolve)
+    section: Section = Section.model_validate(
+        {
+            "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+            "statement": {
+                "subject": {"method": "column", "encoding": "A"},
+                "object": {"method": "column", "encoding": "B"},
+            },
+            "provenance": {
+                "repo": "PMC",
+                "publication": "PMC000",
+                "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+            },
+        },
+        context={"conns": [object()]},
+    )
+    assert section.statement.subject.encoding == "A"
+# ? Qualifier Value Encoding Is Checked Against Datassert
+def test_qualifier_value_encoding_checked(monkeypatch: pytest.MonkeyPatch) -> None:
+    def fake_resolve(lf: Any, col: str, _conns: list[object], **_kwargs: Any) -> Any:
+        term: str = str(lf.collect().get_column(col).to_list()[0])
+        if term in ("brca1", "tp53"):
+            return pl.DataFrame({"resolved": ["YES"]}).lazy()
+        return pl.DataFrame({"resolved": []}).lazy()
+    monkeypatch.setattr(models, "resolve", fake_resolve)
+    with pytest.raises(ValidationError) as exc_info:
+        Section.model_validate(
+            {
+                "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+                "statement": {
+                    "subject": {"method": "value", "encoding": "BRCA1"},
+                    "object": {"method": "value", "encoding": "TP53"},
+                    "qualifiers": [
+                        {"qualifier": "disease_context_qualifier", "method": "value", "encoding": "ZZZNOTAREALGENE123"}
+                    ],
+                },
+                "provenance": {
+                    "repo": "PMC",
+                    "publication": "PMC000",
+                    "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+                },
+            },
+            context={"conns": [object()]},
+        )
+    assert "21 |" in str(exc_info.value)
+# ? Real Value Encoding Resolves Against The Datassert Shards
+@pytest.mark.datassert
+def test_real_value_encoding_resolves(datassert_dir: Path) -> None:
+    from contextlib import ExitStack
+    import duckdb
+    from tablassert.fullmap import SHARDS
+    with ExitStack() as stack:
+        conns: list[object] = [
+            stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
+            for x in range(SHARDS)
+        ]
+        section: Section = Section.model_validate(
+            {
+                "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+                "statement": {
+                    "subject": {"method": "value", "encoding": "BRCA1"},
+                    "object": {"method": "value", "encoding": "TP53"},
+                },
+                "provenance": {
+                    "repo": "PMC",
+                    "publication": "PMC000",
+                    "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+                },
+            },
+            context={"conns": conns},
+        )
+        assert section.statement.subject.encoding == "BRCA1"
+# ? Real Value Encoding Failure Raises Code 21
+@pytest.mark.datassert
+def test_real_value_encoding_fails(datassert_dir: Path) -> None:
+    from contextlib import ExitStack
+    import duckdb
+    from tablassert.fullmap import SHARDS
+    with ExitStack() as stack:
+        conns: list[object] = [
+            stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
+            for x in range(SHARDS)
+        ]
+        with pytest.raises(ValidationError) as exc_info:
+            Section.model_validate(
+                {
+                    "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
+                    "statement": {
+                        "subject": {"method": "value", "encoding": "ZZZNOTAREALGENE123"},
+                        "object": {"method": "value", "encoding": "TP53"},
+                    },
+                    "provenance": {
+                        "repo": "PMC",
+                        "publication": "PMC000",
+                        "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
+                    },
+                },
+                context={"conns": conns},
+            )
+        assert "21 |" in str(exc_info.value)

{tablassert-7.4.10 → tablassert-7.4.11}/uv.lock RENAMED Viewed

@@ -2360,7 +2360,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.4.10"
+version = "7.4.11"
 source = { editable = "." }
 dependencies = [
     { name = "cyclopts" },

tablassert-7.4.10/AGENTS.md DELETED Viewed

@@ -1,168 +0,0 @@
-# AGENTS.md — Tablassert
-Guidance for AI coding agents working in this repository.
-## Project Overview
-Tablassert is a Python package (>=3.11) for tabular data assertion, normalization, and optional quality control. It builds declarative knowledge graphs from tabular data, exporting NCATS Translator-compliant KGX NDJSON. Uses **Polars** DataFrames, **DuckDB** for entity resolution, and **ONNX/BioBERT** for QC when enabled. CLI built with **cyclopts**. Models built with **Pydantic v2**.
-## Quick Reference
-| Task | Command |
-|---|---|
-| Install | `uv sync` |
-| Run CLI | `uv run tablassert` |
-| Lint | `uv run ruff check .` |
-| Lint (fix) | `uv run ruff check --fix .` |
-| Format | `uv run ruff format .` |
-| Format check | `uv run ruff format --check .` |
-| Type check | `uv run pyright` |
-| All checks | `uv run pre-commit run --all-files` |
-| Run all tests | `uv run pytest` |
-| Run single test | `uv run pytest tests/test_foo.py::test_name` |
-| Run by keyword | `uv run pytest -k "test_pattern"` |
-| Run with print | `uv run pytest -s tests/test_foo.py` |
-| Build | `uv build` |
-| Build docs | `uv run --group dev mkdocs build` |
-| Add dependency | `uv add <package>` |
-| Add dev dependency | `uv add --group dev <package>` |
-## Repository Structure
-```
-src/tablassert/
-  cli.py          # cyclopts CLI (entry point: tablassert.cli:APP)
-  lib.py          # Core logic: encodings, data loading, Tcode(Section) class
-  models.py       # Pydantic v2 models (TablaBase base class)
-  enums.py        # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
-  fullmap.py      # NER / entity resolution (DuckDB, 10 shards)
-  qc.py           # Quality control (ONNX/BioBERT, sentence_transformers)
-  nlp.py          # Text normalization (level_one: strip+lowercase, level_two: regex)
-  ingests.py      # YAML ingestion: from_yaml(), to_sections(), fastmerge()
-  downloader.py   # httpx-based file downloads with retries
-  progress.py     # Rich progress bars for pipeline stages
-  utils.py        # Hashing (xxhash), STORE path, namespace UUIDs
-  log.py          # loguru logger → .logassert/tablassert.log; cat() helper for category tagging
-  __init__.py     # Empty file (lazy loading is per-module, not here)
-docs/             # MkDocs documentation source
-mkdocs.yml        # MkDocs configuration
-pyproject.toml    # Project config, dependencies, tool settings
-tests/            # Test directory (at repo root)
-```
-- `conftest.py` provides a `fixtures_path` fixture returning `Path(__file__).parent / "fixtures"`.
-- pytest configured via `pyproject.toml` `[tool.pytest.ini_options]` with `testpaths = ["tests"]`.
-- pytest markers: `network` requires internet; `gpu` requires `CUDAExecutionProvider`.
-- Test fixtures: `tests/fixtures/` contains YAML files for Section model tests.
-- Test modules: `test_downloader.py`, `test_enums.py`, `test_fullmap.py`, `test_ingests.py`, `test_lib.py`, `test_models.py`, `test_nlp.py`, `test_utils.py`.
-## Code Style
-### Imports
-- Every file starts with `from __future__ import annotations`
-- Heavy dependencies are loaded **lazily per-module** using this pattern:
-  ```python
-  from typing import TYPE_CHECKING
-  import lazy_loader as Lazy
-  if TYPE_CHECKING:
-      import polars as pl
-  else:
-      pl = Lazy.load("polars")
-  ```
-- Lazy-loaded deps: `polars`, `duckdb`, `orjson`, `xxhash`, `polars_hash`, `yaml`, `httpx`, `pyexcel`, `onnxruntime`, `sentence_transformers`
-- Direct (non-lazy) heavy deps: `sqlite_utils`, `rapidfuzz`, `pydantic`, `loguru`, `cyclopts`, `rich`, `yaml.CLoader`
-- Some modules mix direct and lazy imports for the same package (e.g., `ingests.py` does `from yaml import CLoader` directly, then lazy-loads `yaml` for `yaml.load()`)
-- Import order: standard library → blank line → third-party → blank line → local
-- Use `from __future__ import annotations` to enable deferred evaluation
-### Type Annotations
-- **Every variable** gets a type annotation, including locals: `col: str = "name"`, `df: pl.DataFrame = ...`
-- Use `Optional[T]` and `Union[...]` (not `T | None` or `X | Y`)
-- Use `Self` for class methods returning the class type
-- Use `Path` (not `str`) for filesystem paths
-- Use `# pyright: ignore` comments to suppress false positives from lazy-loaded modules
-### Pydantic Models
-- All models inherit from `TablaBase(BaseModel)` which sets:
-  ```python
-  model_config: ConfigDict = ConfigDict(  # pyright: ignore
-      str_strip_whitespace=False,
-      validate_assignment=True,
-      use_enum_values=True,
-      extra="forbid",
-      populate_by_name=True,
-  )
-  ```
-- Required fields: `Field(...)` (ellipsis sentinel)
-- Optional fields: `Optional[T] = Field(None)`
-- All enums are `str, Enum` subclasses (defined in `enums.py`)
-### Enums
-All enums live in `enums.py` and extend `str, Enum`. Key enums: `Tokens`, `Repositories`, `Contributions`, `Comparisons`, `Functions`, `Files`, `EncodingMethods`, `FillMethods`, `Syntaxes`, `Statuses`, `Categories`, `Predicates`, `Qualifiers`.
-### Naming
-- Functions/variables: `snake_case`
-- Classes: `PascalCase`
-- Module-level constants: `UPPER_CASE`
-### Comments
-- `# ?` — descriptions / clarifications
-- `# !` — warnings / important notes
-- `# *` — stage markers (pipeline steps)
-- `# TODO:` — todos
-- No docstrings on functions; use `# ?` comment on the line above instead
-### Formatting (enforced by ruff)
-- Line length: **120**
-- Quote style: **double quotes**
-- Indent: **4 spaces**
-- `skip-magic-trailing-comma = true`
-- Target: Python >=3.11
-### Error Handling
-- Use `RuntimeError` for exceptional cases (no custom exception classes currently)
-- Use `logger.warning()` for non-fatal issues (e.g., empty subgraphs)
-- Logger: `from tablassert.log import logger` (or `cat()` for category-tagged logger)
-### Other Conventions
-- `operator.add` for Polars string concatenation on columns (not `+` directly)
-- CLI entry point: `tablassert.cli:APP` (cyclopts app)
-- Use `rich.progress` for progress tracking in CLI (via `progress.py` which wraps Rich Live/Progress)
-- Data side-effects stored in hidden directories: `.logassert/`, `.storassert/`, `.onnxassert/`
-## Tools
-- **ruff** — linting (`ruff check`) and formatting (`ruff format`)
-- **pyright** — type checking (no pyrightconfig.json; uses defaults)
-- **pre-commit** — runs ruff fix, ruff-format, pyright, and pytest on all Python files
-- **pytest** — testing (>=9.0.2)
-- **uv** — package manager (use `uv run` for all commands, `uv add` for deps)
-- **hatchling** — build backend
-## Optional Dependency Groups
-Defined in `pyproject.toml` `[project.optional-dependencies]`:
-- `rt` — `polars[rtcompat]` (runtime-compatible Polars build for CPUs without required instructions)
-- `qc` — `onnxruntime` (CPU QC runtime)
-- `qc-cuda` — `onnxruntime-gpu` (CUDA QC runtime; single GPU on device 0)
-All other ML, web, and Excel dependencies are in core `dependencies`; the ONNX Runtime choice is extra-driven.
-Install with: `uv sync`, `uv sync --extra qc`, `uv sync --extra qc-cuda`, or `pip install tablassert[...]`
-## CI Workflows
-- **PyPI publish** (`.github/workflows/pipy.yml`): builds and publishes on push to `main`
-- **MkDocs deploy** (`.github/workflows/docs.yml`): builds docs and deploys to GitHub Pages on push to `main`
-- **Docker publish** (`.github/workflows/docker.yml`): builds and pushes image to GHCR on tag push (`v*`)
-- **Autotag** (`.github/workflows/autotag.yml`): automatic version tagging