tablassert 7.4.10__tar.gz → 7.4.11__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (64) hide show
  1. tablassert-7.4.11/AGENTS.md +50 -0
  2. {tablassert-7.4.10 → tablassert-7.4.11}/PKG-INFO +1 -1
  3. {tablassert-7.4.10 → tablassert-7.4.11}/pyproject.toml +2 -2
  4. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/cli.py +39 -32
  5. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/fullmap.py +3 -1
  6. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/models.py +42 -0
  7. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/progress.py +1 -2
  8. {tablassert-7.4.10 → tablassert-7.4.11}/tests/conftest.py +14 -1
  9. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_models.py +192 -0
  10. {tablassert-7.4.10 → tablassert-7.4.11}/uv.lock +1 -1
  11. tablassert-7.4.10/AGENTS.md +0 -168
  12. {tablassert-7.4.10 → tablassert-7.4.11}/.github/workflows/autotag.yml +0 -0
  13. {tablassert-7.4.10 → tablassert-7.4.11}/.github/workflows/docker.yml +0 -0
  14. {tablassert-7.4.10 → tablassert-7.4.11}/.github/workflows/docs.yml +0 -0
  15. {tablassert-7.4.10 → tablassert-7.4.11}/.github/workflows/pipy.yml +0 -0
  16. {tablassert-7.4.10 → tablassert-7.4.11}/.gitignore +0 -0
  17. {tablassert-7.4.10 → tablassert-7.4.11}/.pre-commit-config.yaml +0 -0
  18. {tablassert-7.4.10 → tablassert-7.4.11}/CHANGELOG.md +0 -0
  19. {tablassert-7.4.10 → tablassert-7.4.11}/CITATION.cff +0 -0
  20. {tablassert-7.4.10 → tablassert-7.4.11}/CONTRIBUTING.md +0 -0
  21. {tablassert-7.4.10 → tablassert-7.4.11}/Dockerfile +0 -0
  22. {tablassert-7.4.10 → tablassert-7.4.11}/LICENSE +0 -0
  23. {tablassert-7.4.10 → tablassert-7.4.11}/README.md +0 -0
  24. {tablassert-7.4.10 → tablassert-7.4.11}/docs/api/fullmap.md +0 -0
  25. {tablassert-7.4.10 → tablassert-7.4.11}/docs/api/lib.md +0 -0
  26. {tablassert-7.4.10 → tablassert-7.4.11}/docs/api/qc.md +0 -0
  27. {tablassert-7.4.10 → tablassert-7.4.11}/docs/api/utils.md +0 -0
  28. {tablassert-7.4.10 → tablassert-7.4.11}/docs/changelog.md +0 -0
  29. {tablassert-7.4.10 → tablassert-7.4.11}/docs/cli.md +0 -0
  30. {tablassert-7.4.10 → tablassert-7.4.11}/docs/configuration/advanced-example.md +0 -0
  31. {tablassert-7.4.10 → tablassert-7.4.11}/docs/configuration/graph.md +0 -0
  32. {tablassert-7.4.10 → tablassert-7.4.11}/docs/configuration/table.md +0 -0
  33. {tablassert-7.4.10 → tablassert-7.4.11}/docs/datassert.md +0 -0
  34. {tablassert-7.4.10 → tablassert-7.4.11}/docs/docker.md +0 -0
  35. {tablassert-7.4.10 → tablassert-7.4.11}/docs/examples/tutorial-data.csv +0 -0
  36. {tablassert-7.4.10 → tablassert-7.4.11}/docs/examples/tutorial-graph.yaml +0 -0
  37. {tablassert-7.4.10 → tablassert-7.4.11}/docs/examples/tutorial-table.yaml +0 -0
  38. {tablassert-7.4.10 → tablassert-7.4.11}/docs/examples.md +0 -0
  39. {tablassert-7.4.10 → tablassert-7.4.11}/docs/index.md +0 -0
  40. {tablassert-7.4.10 → tablassert-7.4.11}/docs/installation.md +0 -0
  41. {tablassert-7.4.10 → tablassert-7.4.11}/docs/tutorial.md +0 -0
  42. {tablassert-7.4.10 → tablassert-7.4.11}/llms.txt +0 -0
  43. {tablassert-7.4.10 → tablassert-7.4.11}/mkdocs.yml +0 -0
  44. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/__init__.py +0 -0
  45. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/downloader.py +0 -0
  46. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/enums.py +0 -0
  47. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/ingests.py +0 -0
  48. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/lib.py +0 -0
  49. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/log.py +0 -0
  50. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/nlp.py +0 -0
  51. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/qc.py +0 -0
  52. {tablassert-7.4.10 → tablassert-7.4.11}/src/tablassert/utils.py +0 -0
  53. {tablassert-7.4.10 → tablassert-7.4.11}/tests/__init__.py +0 -0
  54. {tablassert-7.4.10 → tablassert-7.4.11}/tests/fixtures/invalid_section_missing_source.yaml +0 -0
  55. {tablassert-7.4.10 → tablassert-7.4.11}/tests/fixtures/minimal_section.yaml +0 -0
  56. {tablassert-7.4.10 → tablassert-7.4.11}/tests/fixtures/minimal_section_with_sections.yaml +0 -0
  57. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_downloader.py +0 -0
  58. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_enums.py +0 -0
  59. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_fullmap.py +0 -0
  60. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_ingests.py +0 -0
  61. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_lib.py +0 -0
  62. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_nlp.py +0 -0
  63. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_qc.py +0 -0
  64. {tablassert-7.4.10 → tablassert-7.4.11}/tests/test_utils.py +0 -0
@@ -0,0 +1,50 @@
1
+ # AGENTS.md — Tablassert
2
+
3
+ ## Fast Start
4
+
5
+ - Python package, not a monorepo. Main code lives in `src/tablassert/`; tests live in `tests/`.
6
+ - Install with `uv sync`. QC is not available unless you install an extra: `uv sync --extra qc` or `uv sync --extra qc-cuda`.
7
+ - CLI entrypoint is `tablassert.cli:APP`. Real user commands are:
8
+ - `uv run tablassert build <graph.yaml>`
9
+ - `uv run tablassert validate <table.yaml> <datassert>`
10
+
11
+ ## Verify Changes
12
+
13
+ - Match the repo hooks before finishing: `uv run ruff check --fix .`, `uv run ruff format .`, `uv run pyright`, `uv run pytest`.
14
+ - Full hook run: `uv run pre-commit run --all-files`.
15
+ - Focused test runs:
16
+ - Single test: `uv run pytest tests/test_lib.py::test_name`
17
+ - By keyword: `uv run pytest -k "pattern"`
18
+ - With print output: `uv run pytest -s tests/test_lib.py`
19
+ - Docs build: `uv run --group dev mkdocs build`
20
+
21
+ ## High-Value Structure
22
+
23
+ - `src/tablassert/cli.py` is the wiring layer: `build()` calls `build_pipeline()`, `validate()` calls `validate_pipeline()`.
24
+ - `src/tablassert/ingests.py` loads YAML and expands table configs into section dicts.
25
+ - `src/tablassert/lib.py` is the core pipeline:
26
+ - `Tcode.collect()` builds the per-section operation list.
27
+ - `compile_subgraph()` executes that list into parquet.
28
+ - `compile_graph()` aggregates subgraph parquets into KGX NDJSON.
29
+ - `resolve_many()` is the direct library API for batch entity resolution.
30
+ - Entity resolution uses DuckDB shard files under `<datassert>/data/`. `src/tablassert/fullmap.py` hardcodes `SHARDS = 10`.
31
+
32
+ ## Repo-Specific Gotchas
33
+
34
+ - Heavy dependencies are lazy-loaded per module with `TYPE_CHECKING` + `lazy_loader`. Follow the existing pattern instead of importing heavy packages eagerly.
35
+ - `tests/conftest.py` autouse-mocks `httpx.head`, so model URL validation tests do not hit the network unless a test is explicitly marked otherwise.
36
+ - Network-dependent tests are marked `@pytest.mark.network`; GPU QC tests are marked with both `network` and `gpu` in `tests/test_qc.py`.
37
+ - QC runtime selection is strict in `src/tablassert/qc.py`: if `onnxruntime-gpu` is installed but `CUDAExecutionProvider` is unavailable, the code raises instead of falling back to CPU.
38
+ - Downloader behavior in `src/tablassert/downloader.py` is two-path: direct `httpx` fetch for known file URLs, headless-browser fallback for browser-only sources. Keep tests around payload validation and cleanup intact when changing it.
39
+
40
+ ## Conventions That Matter Here
41
+
42
+ - Start every module with `from __future__ import annotations`.
43
+ - Annotate locals, not just function signatures.
44
+ - Use `Optional[T]` / `Union[...]`, not `T | None`.
45
+ - Prefer `Path` over raw path strings.
46
+ - Function docs are usually `# ?` comments above the code, not docstrings.
47
+
48
+ ## Side Effects
49
+
50
+ - The package writes working artifacts to hidden directories in the repo root: `.storassert/`, `.logassert/`, `.cachassert/`, and `.onnxassert/`.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: tablassert
3
- Version: 7.4.10
3
+ Version: 7.4.11
4
4
  Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
5
5
  Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
6
6
  Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "tablassert"
3
- version = "7.4.10"
3
+ version = "7.4.11"
4
4
  description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
5
5
  authors = [
6
6
  { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
@@ -100,7 +100,7 @@ dev = [
100
100
 
101
101
  [tool.pytest.ini_options]
102
102
  testpaths = ["tests"]
103
- markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider"]
103
+ markers = ["network: requires internet", "gpu: requires CUDAExecutionProvider", "datassert: requires the datassert DuckDB shards"]
104
104
 
105
105
  [tool.ruff]
106
106
  line-length = 120
@@ -57,29 +57,30 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
57
57
  sections: list[dict[str, Any]] = list(chain.from_iterable(temp))
58
58
  n: int = len(sections)
59
59
 
60
- # * Build TCode (3/6)
61
- progress.stage(f"Building TCode | Sections: {n}")
62
- advance = progress.section_loop(n, "TCode")
63
- tcode: list[Tcode] = []
64
- for idx, s in enumerate(sections, start=1):
65
- try:
66
- tcode.append(
67
- Tcode.model_validate(
68
- {**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc}
69
- )
70
- )
71
- except pydantic.ValidationError as e:
72
- raise RuntimeError(
73
- f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
74
- ) from e
75
- advance(format_section_oneline(tcode[-1]))
76
-
77
60
  with ExitStack() as stack:
78
61
  conns: list[object] = [
79
62
  stack.enter_context(duckdb.connect(g.datassert / "data" / f"{x}.duckdb", read_only=True))
80
63
  for x in range(SHARDS)
81
64
  ]
82
65
 
66
+ # * Build TCode (3/6)
67
+ progress.stage(f"Building TCode | Sections: {n}")
68
+ advance = progress.section_loop(n, "TCode")
69
+ tcode: list[Tcode] = []
70
+ for idx, s in enumerate(sections, start=1):
71
+ try:
72
+ tcode.append(
73
+ Tcode.model_validate(
74
+ {**s, "number": idx, "store": (STORE / f"{mkhash(s)}.parquet"), "log": g.log, "qc": g.qc},
75
+ context={"conns": conns},
76
+ )
77
+ )
78
+ except pydantic.ValidationError as e:
79
+ raise RuntimeError(
80
+ f"02 | FAILED VALIDATION | CONFIG: {graph_configuration_file} | IDX: {idx} | HASH: {mkhash(s)} | PYDANTIC: {flatten_pydantic_error(e)}"
81
+ ) from e
82
+ advance(format_section_oneline(tcode[-1]))
83
+
83
84
  # * Collect Instructions (4/6)
84
85
  progress.stage(f"Collecting Instructions | Sections: {n}")
85
86
  advance = progress.section_loop(n, "Collect")
@@ -105,8 +106,9 @@ def build_pipeline(graph_configuration_file: Path, progress: "PipelineProgress")
105
106
  logger.info(f"BUILD DONE | SECTIONS: {n} | NAME: {g.name} | VERSION: {g.version}")
106
107
 
107
108
 
108
- def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgress") -> None:
109
+ def validate_pipeline(table_configuration_file: Path, datassert: Path, progress: "PipelineProgress") -> None:
109
110
  # ? Validate Section Syntax From A Configuration File
111
+ from tablassert.fullmap import SHARDS
110
112
  from tablassert.ingests import from_yaml, to_sections
111
113
  from tablassert.lib import Tcode
112
114
  from tablassert.progress import flatten_pydantic_error
@@ -124,27 +126,32 @@ def validate_pipeline(table_configuration_file: Path, progress: "PipelineProgres
124
126
  # * Validate Section Syntax (3/3)
125
127
  progress.stage(f"Validating Section Syntax | Sections: {n}")
126
128
  advance = progress.section_loop(n, "Validate")
127
- for idx, s in enumerate(sections, start=1):
128
- h: str = mkhash(s)
129
- try:
130
- Tcode.model_validate({**s, "number": idx, "store": (STORE / f"{h}.parquet")})
131
- except pydantic.ValidationError as e:
132
- raise RuntimeError(
133
- f"02 | FAILED VALIDATION | CONFIG: {table_configuration_file} | IDX: {idx} | HASH: {h} | PYDANTIC: {flatten_pydantic_error(e)}"
134
- ) from e
135
- advance(f"#{idx} | HASH: {h}")
129
+ with ExitStack() as stack:
130
+ conns: list[object] = [
131
+ stack.enter_context(duckdb.connect(datassert / "data" / f"{x}.duckdb", read_only=True))
132
+ for x in range(SHARDS)
133
+ ]
134
+ for idx, s in enumerate(sections, start=1):
135
+ h: str = mkhash(s)
136
+ try:
137
+ Tcode.model_validate({**s, "number": idx, "store": (STORE / f"{h}.parquet")}, context={"conns": conns})
138
+ except pydantic.ValidationError as e:
139
+ raise RuntimeError(
140
+ f"02 | FAILED VALIDATION | CONFIG: {table_configuration_file} | IDX: {idx} | HASH: {h} | PYDANTIC: {flatten_pydantic_error(e)}"
141
+ ) from e
142
+ advance(f"#{idx} | HASH: {h}")
136
143
 
137
144
  logger.info(f"VALIDATE DONE | SECTIONS: {n} | CONFIG: {table_configuration_file.name}")
138
145
 
139
146
 
140
- def run(stages: int, fn: Any, arg: Path) -> None:
147
+ def run(stages: int, fn: Any, *args: Path) -> None:
141
148
  from tablassert.log import LOG_FORMAT, logger
142
149
  from tablassert.progress import PipelineProgress
143
150
 
144
151
  with PipelineProgress(total_stages=stages) as progress:
145
152
  sink_id: int = logger.add(progress.log_sink, level="INFO", format=LOG_FORMAT)
146
153
  try:
147
- fn(arg, progress)
154
+ fn(*args, progress)
148
155
  finally:
149
156
  logger.remove(sink_id)
150
157
 
@@ -156,6 +163,6 @@ def build(graph_configuration_file: Path) -> None:
156
163
 
157
164
 
158
165
  @APP.command
159
- def validate(table_configuration_file: Path) -> None:
160
- """Validate section syntax from a YAML configuration file."""
161
- run(3, validate_pipeline, table_configuration_file)
166
+ def validate(table_configuration_file: Path, datassert: Path) -> None:
167
+ """Validate section syntax from a YAML configuration file against a datassert directory."""
168
+ run(3, validate_pipeline, table_configuration_file, datassert)
@@ -51,7 +51,9 @@ def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFr
51
51
 
52
52
  terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
53
53
 
54
- bad: str = r"^\d+$|^(none|nan|na|null|unknown)$|^$"
54
+ bad: str = (
55
+ r"^\d+$|^(none|nan|na|null|unknown|not applicable|p value|variable|result|exposure|expression|symbol)$|^$"
56
+ )
55
57
  terms = terms.filter(~pl.col(col).str.contains(bad))
56
58
  return terms.with_columns((plh.col(col).nchash.xxhash64() % SHARDS).alias("shard")) # pyright: ignore
57
59
 
@@ -25,6 +25,9 @@ from tablassert.enums import (
25
25
  Tokens,
26
26
  )
27
27
 
28
+ from tablassert.fullmap import resolve
29
+ from tablassert.nlp import level_one, level_two
30
+
28
31
  if TYPE_CHECKING:
29
32
  import httpx
30
33
  import polars as pl
@@ -315,6 +318,35 @@ class Annotation(Encoding):
315
318
  return annotation.replace("_", " ").strip()
316
319
 
317
320
 
321
+ def resolves_value_encodings(statement: Statement, conns: list[object]) -> None:
322
+ # ? Resolve Every Value-Method Literal Against The Shared Datassert Shards
323
+ nodes: list[tuple[str, NodeEncoding]] = [("subject", statement.subject), ("object", statement.object)]
324
+ if statement.qualifiers:
325
+ nodes += [(q.qualifier, q) for q in statement.qualifiers]
326
+
327
+ for label, node in nodes:
328
+ if not eq(node.method, EncodingMethods.VALUE):
329
+ continue
330
+
331
+ term: str = str(node.encoding)
332
+ lf: pl.LazyFrame = pl.DataFrame({"term": [term]}).lazy()
333
+ lf = level_one(lf, "term")
334
+ lf = level_two(lf, "term")
335
+ resolved: pl.DataFrame = resolve(
336
+ lf,
337
+ "term",
338
+ conns,
339
+ taxon=str(node.taxon) if node.taxon else None,
340
+ prioritize=node.prioritize,
341
+ avoid=node.avoid,
342
+ log=False,
343
+ column_context=False,
344
+ ).collect()
345
+ if resolved.height == 0:
346
+ msg: str = f"21 | value encoding {term!r} in {label!r} did not resolve against datassert"
347
+ raise ValueError(msg)
348
+
349
+
318
350
  class Section(TablaBase):
319
351
  # ? Pydantic "Section" Model And Coercion
320
352
  syntax: Syntaxes = Field(Syntaxes.TC3, description="Section configuration syntax version.")
@@ -326,6 +358,16 @@ class Section(TablaBase):
326
358
  None, description="Optional extra encoded columns added to each row."
327
359
  )
328
360
 
361
+ @field_validator("statement", mode="after")
362
+ @classmethod
363
+ def value_encodings_resolve(cls, statement: Statement, info: Any) -> Statement:
364
+ # ? Ensure Value-Method Encodings Resolve Against The Shared Datassert Shards
365
+ conns: Optional[list[object]] = info.context.get("conns") if info.context else None
366
+ if conns is None:
367
+ return statement # * skip without shared connections (contextless path)
368
+ resolves_value_encodings(statement, conns)
369
+ return statement
370
+
329
371
 
330
372
  class Graph(TablaBase):
331
373
  # ? Pydantic "Graph" Configuration
@@ -35,8 +35,7 @@ def format_section_oneline(x: "Tcode") -> str:
35
35
  else:
36
36
  source_detail = f"TEXT({(x.source.delimiter or ',')!r})" # pyright: ignore
37
37
  return (
38
- f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} "
39
- f"| CONFIG: {x.config.name} | STATUS: {x.status}"
38
+ f"#{x.number} | HASH: {x.store.stem} | SOURCE: {source_detail} | CONFIG: {x.config.name} | STATUS: {x.status}"
40
39
  )
41
40
 
42
41
 
@@ -1,7 +1,8 @@
1
1
  from __future__ import annotations
2
2
 
3
+ import os
3
4
  from pathlib import Path
4
- from typing import Any
5
+ from typing import Any, Optional
5
6
 
6
7
  import httpx
7
8
  import pytest
@@ -26,3 +27,15 @@ def mockhttpxhead(monkeypatch: pytest.MonkeyPatch) -> None:
26
27
  @pytest.fixture
27
28
  def fixtures_path() -> Path:
28
29
  return Path(__file__).parent / "fixtures"
30
+
31
+
32
+ @pytest.fixture
33
+ def datassert_dir() -> Path:
34
+ # ? Shared Datassert Shard Directory (Skipped When Unavailable)
35
+ env: Optional[str] = os.environ.get("DATASSERT")
36
+ if not env:
37
+ pytest.skip("DATASSERT env var not set; skipping datassert-dependent test")
38
+ directory: Path = Path(env)
39
+ if not (directory / "data" / "0.duckdb").is_file():
40
+ pytest.skip(f"datassert shard data/0.duckdb not found under {directory}")
41
+ return directory
@@ -3,9 +3,11 @@ from __future__ import annotations
3
3
  from pathlib import Path
4
4
  from typing import Any
5
5
 
6
+ import polars as pl
6
7
  import pytest
7
8
  from pydantic import ValidationError
8
9
 
10
+ import tablassert.models as models
9
11
  from tablassert.enums import Categories
10
12
  from tablassert.ingests import from_yaml
11
13
  from tablassert.models import (
@@ -280,3 +282,193 @@ def test_section_with_annotations() -> None:
280
282
  ],
281
283
  )
282
284
  assert len(section.annotations) == 2 # pyright: ignore
285
+
286
+
287
+ # ? Value Encoding Resolves Against Datassert (Context-Aware Pass)
288
+ def test_value_encoding_resolves_pass(monkeypatch: pytest.MonkeyPatch) -> None:
289
+ def fake_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
290
+ return pl.DataFrame({"resolved": ["YES"]}).lazy()
291
+
292
+ monkeypatch.setattr(models, "resolve", fake_resolve)
293
+ section: Section = Section.model_validate(
294
+ {
295
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
296
+ "statement": {
297
+ "subject": {"method": "value", "encoding": "BRCA1"},
298
+ "object": {"method": "value", "encoding": "TP53"},
299
+ },
300
+ "provenance": {
301
+ "repo": "PMC",
302
+ "publication": "PMC000",
303
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
304
+ },
305
+ },
306
+ context={"conns": [object()]},
307
+ )
308
+ assert section.statement.subject.encoding == "BRCA1"
309
+
310
+
311
+ # ? Value Encoding Fails To Resolve Raises Code 21
312
+ def test_value_encoding_resolves_fail(monkeypatch: pytest.MonkeyPatch) -> None:
313
+ def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
314
+ return pl.DataFrame({"resolved": []}).lazy()
315
+
316
+ monkeypatch.setattr(models, "resolve", fake_resolve_empty)
317
+ with pytest.raises(ValidationError) as exc_info:
318
+ Section.model_validate(
319
+ {
320
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
321
+ "statement": {
322
+ "subject": {"method": "value", "encoding": "BRCA1"},
323
+ "object": {"method": "value", "encoding": "TP53"},
324
+ },
325
+ "provenance": {
326
+ "repo": "PMC",
327
+ "publication": "PMC000",
328
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
329
+ },
330
+ },
331
+ context={"conns": [object()]},
332
+ )
333
+ assert "21 |" in str(exc_info.value)
334
+
335
+
336
+ # ? Value Encoding Validator Skips Without Context
337
+ def test_value_encoding_skips_without_context(monkeypatch: pytest.MonkeyPatch) -> None:
338
+ def fake_resolve_empty(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
339
+ return pl.DataFrame({"resolved": []}).lazy()
340
+
341
+ monkeypatch.setattr(models, "resolve", fake_resolve_empty)
342
+ section: Section = Section( # pyright: ignore
343
+ source={"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
344
+ statement={
345
+ "subject": {"method": "value", "encoding": "BRCA1"},
346
+ "object": {"method": "value", "encoding": "TP53"},
347
+ },
348
+ provenance={
349
+ "repo": "PMC",
350
+ "publication": "PMC000",
351
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
352
+ },
353
+ )
354
+ assert section.statement.subject.encoding == "BRCA1"
355
+
356
+
357
+ # ? Column Method Encodings Are Not Checked
358
+ def test_column_encoding_not_checked(monkeypatch: pytest.MonkeyPatch) -> None:
359
+ def boom_resolve(_lf: Any, _col: str, _conns: list[object], **_kwargs: Any) -> Any:
360
+ raise AssertionError("resolve must not be called for column-method encodings")
361
+
362
+ monkeypatch.setattr(models, "resolve", boom_resolve)
363
+ section: Section = Section.model_validate(
364
+ {
365
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
366
+ "statement": {
367
+ "subject": {"method": "column", "encoding": "A"},
368
+ "object": {"method": "column", "encoding": "B"},
369
+ },
370
+ "provenance": {
371
+ "repo": "PMC",
372
+ "publication": "PMC000",
373
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
374
+ },
375
+ },
376
+ context={"conns": [object()]},
377
+ )
378
+ assert section.statement.subject.encoding == "A"
379
+
380
+
381
+ # ? Qualifier Value Encoding Is Checked Against Datassert
382
+ def test_qualifier_value_encoding_checked(monkeypatch: pytest.MonkeyPatch) -> None:
383
+ def fake_resolve(lf: Any, col: str, _conns: list[object], **_kwargs: Any) -> Any:
384
+ term: str = str(lf.collect().get_column(col).to_list()[0])
385
+ if term in ("brca1", "tp53"):
386
+ return pl.DataFrame({"resolved": ["YES"]}).lazy()
387
+ return pl.DataFrame({"resolved": []}).lazy()
388
+
389
+ monkeypatch.setattr(models, "resolve", fake_resolve)
390
+ with pytest.raises(ValidationError) as exc_info:
391
+ Section.model_validate(
392
+ {
393
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
394
+ "statement": {
395
+ "subject": {"method": "value", "encoding": "BRCA1"},
396
+ "object": {"method": "value", "encoding": "TP53"},
397
+ "qualifiers": [
398
+ {"qualifier": "disease_context_qualifier", "method": "value", "encoding": "ZZZNOTAREALGENE123"}
399
+ ],
400
+ },
401
+ "provenance": {
402
+ "repo": "PMC",
403
+ "publication": "PMC000",
404
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
405
+ },
406
+ },
407
+ context={"conns": [object()]},
408
+ )
409
+ assert "21 |" in str(exc_info.value)
410
+
411
+
412
+ # ? Real Value Encoding Resolves Against The Datassert Shards
413
+ @pytest.mark.datassert
414
+ def test_real_value_encoding_resolves(datassert_dir: Path) -> None:
415
+ from contextlib import ExitStack
416
+
417
+ import duckdb
418
+
419
+ from tablassert.fullmap import SHARDS
420
+
421
+ with ExitStack() as stack:
422
+ conns: list[object] = [
423
+ stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
424
+ for x in range(SHARDS)
425
+ ]
426
+ section: Section = Section.model_validate(
427
+ {
428
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
429
+ "statement": {
430
+ "subject": {"method": "value", "encoding": "BRCA1"},
431
+ "object": {"method": "value", "encoding": "TP53"},
432
+ },
433
+ "provenance": {
434
+ "repo": "PMC",
435
+ "publication": "PMC000",
436
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
437
+ },
438
+ },
439
+ context={"conns": conns},
440
+ )
441
+ assert section.statement.subject.encoding == "BRCA1"
442
+
443
+
444
+ # ? Real Value Encoding Failure Raises Code 21
445
+ @pytest.mark.datassert
446
+ def test_real_value_encoding_fails(datassert_dir: Path) -> None:
447
+ from contextlib import ExitStack
448
+
449
+ import duckdb
450
+
451
+ from tablassert.fullmap import SHARDS
452
+
453
+ with ExitStack() as stack:
454
+ conns: list[object] = [
455
+ stack.enter_context(duckdb.connect(datassert_dir / "data" / f"{x}.duckdb", read_only=True))
456
+ for x in range(SHARDS)
457
+ ]
458
+ with pytest.raises(ValidationError) as exc_info:
459
+ Section.model_validate(
460
+ {
461
+ "source": {"local": "./t.tsv", "url": "https://example.com/t.tsv", "kind": "text"},
462
+ "statement": {
463
+ "subject": {"method": "value", "encoding": "ZZZNOTAREALGENE123"},
464
+ "object": {"method": "value", "encoding": "TP53"},
465
+ },
466
+ "provenance": {
467
+ "repo": "PMC",
468
+ "publication": "PMC000",
469
+ "contributors": [{"kind": "curation", "name": "T", "date": "2025"}],
470
+ },
471
+ },
472
+ context={"conns": conns},
473
+ )
474
+ assert "21 |" in str(exc_info.value)
@@ -2360,7 +2360,7 @@ wheels = [
2360
2360
 
2361
2361
  [[package]]
2362
2362
  name = "tablassert"
2363
- version = "7.4.10"
2363
+ version = "7.4.11"
2364
2364
  source = { editable = "." }
2365
2365
  dependencies = [
2366
2366
  { name = "cyclopts" },
@@ -1,168 +0,0 @@
1
- # AGENTS.md — Tablassert
2
-
3
- Guidance for AI coding agents working in this repository.
4
-
5
- ## Project Overview
6
-
7
- Tablassert is a Python package (>=3.11) for tabular data assertion, normalization, and optional quality control. It builds declarative knowledge graphs from tabular data, exporting NCATS Translator-compliant KGX NDJSON. Uses **Polars** DataFrames, **DuckDB** for entity resolution, and **ONNX/BioBERT** for QC when enabled. CLI built with **cyclopts**. Models built with **Pydantic v2**.
8
-
9
- ## Quick Reference
10
-
11
- | Task | Command |
12
- |---|---|
13
- | Install | `uv sync` |
14
- | Run CLI | `uv run tablassert` |
15
- | Lint | `uv run ruff check .` |
16
- | Lint (fix) | `uv run ruff check --fix .` |
17
- | Format | `uv run ruff format .` |
18
- | Format check | `uv run ruff format --check .` |
19
- | Type check | `uv run pyright` |
20
- | All checks | `uv run pre-commit run --all-files` |
21
- | Run all tests | `uv run pytest` |
22
- | Run single test | `uv run pytest tests/test_foo.py::test_name` |
23
- | Run by keyword | `uv run pytest -k "test_pattern"` |
24
- | Run with print | `uv run pytest -s tests/test_foo.py` |
25
- | Build | `uv build` |
26
- | Build docs | `uv run --group dev mkdocs build` |
27
- | Add dependency | `uv add <package>` |
28
- | Add dev dependency | `uv add --group dev <package>` |
29
-
30
- ## Repository Structure
31
-
32
- ```
33
- src/tablassert/
34
- cli.py # cyclopts CLI (entry point: tablassert.cli:APP)
35
- lib.py # Core logic: encodings, data loading, Tcode(Section) class
36
- models.py # Pydantic v2 models (TablaBase base class)
37
- enums.py # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
38
- fullmap.py # NER / entity resolution (DuckDB, 10 shards)
39
- qc.py # Quality control (ONNX/BioBERT, sentence_transformers)
40
- nlp.py # Text normalization (level_one: strip+lowercase, level_two: regex)
41
- ingests.py # YAML ingestion: from_yaml(), to_sections(), fastmerge()
42
- downloader.py # httpx-based file downloads with retries
43
- progress.py # Rich progress bars for pipeline stages
44
- utils.py # Hashing (xxhash), STORE path, namespace UUIDs
45
- log.py # loguru logger → .logassert/tablassert.log; cat() helper for category tagging
46
- __init__.py # Empty file (lazy loading is per-module, not here)
47
- docs/ # MkDocs documentation source
48
- mkdocs.yml # MkDocs configuration
49
- pyproject.toml # Project config, dependencies, tool settings
50
- tests/ # Test directory (at repo root)
51
- ```
52
-
53
- - `conftest.py` provides a `fixtures_path` fixture returning `Path(__file__).parent / "fixtures"`.
54
- - pytest configured via `pyproject.toml` `[tool.pytest.ini_options]` with `testpaths = ["tests"]`.
55
- - pytest markers: `network` requires internet; `gpu` requires `CUDAExecutionProvider`.
56
- - Test fixtures: `tests/fixtures/` contains YAML files for Section model tests.
57
- - Test modules: `test_downloader.py`, `test_enums.py`, `test_fullmap.py`, `test_ingests.py`, `test_lib.py`, `test_models.py`, `test_nlp.py`, `test_utils.py`.
58
-
59
- ## Code Style
60
-
61
- ### Imports
62
-
63
- - Every file starts with `from __future__ import annotations`
64
- - Heavy dependencies are loaded **lazily per-module** using this pattern:
65
- ```python
66
- from typing import TYPE_CHECKING
67
- import lazy_loader as Lazy
68
-
69
- if TYPE_CHECKING:
70
- import polars as pl
71
- else:
72
- pl = Lazy.load("polars")
73
- ```
74
- - Lazy-loaded deps: `polars`, `duckdb`, `orjson`, `xxhash`, `polars_hash`, `yaml`, `httpx`, `pyexcel`, `onnxruntime`, `sentence_transformers`
75
- - Direct (non-lazy) heavy deps: `sqlite_utils`, `rapidfuzz`, `pydantic`, `loguru`, `cyclopts`, `rich`, `yaml.CLoader`
76
- - Some modules mix direct and lazy imports for the same package (e.g., `ingests.py` does `from yaml import CLoader` directly, then lazy-loads `yaml` for `yaml.load()`)
77
- - Import order: standard library → blank line → third-party → blank line → local
78
- - Use `from __future__ import annotations` to enable deferred evaluation
79
-
80
- ### Type Annotations
81
-
82
- - **Every variable** gets a type annotation, including locals: `col: str = "name"`, `df: pl.DataFrame = ...`
83
- - Use `Optional[T]` and `Union[...]` (not `T | None` or `X | Y`)
84
- - Use `Self` for class methods returning the class type
85
- - Use `Path` (not `str`) for filesystem paths
86
- - Use `# pyright: ignore` comments to suppress false positives from lazy-loaded modules
87
-
88
- ### Pydantic Models
89
-
90
- - All models inherit from `TablaBase(BaseModel)` which sets:
91
- ```python
92
- model_config: ConfigDict = ConfigDict( # pyright: ignore
93
- str_strip_whitespace=False,
94
- validate_assignment=True,
95
- use_enum_values=True,
96
- extra="forbid",
97
- populate_by_name=True,
98
- )
99
- ```
100
- - Required fields: `Field(...)` (ellipsis sentinel)
101
- - Optional fields: `Optional[T] = Field(None)`
102
- - All enums are `str, Enum` subclasses (defined in `enums.py`)
103
-
104
- ### Enums
105
-
106
- All enums live in `enums.py` and extend `str, Enum`. Key enums: `Tokens`, `Repositories`, `Contributions`, `Comparisons`, `Functions`, `Files`, `EncodingMethods`, `FillMethods`, `Syntaxes`, `Statuses`, `Categories`, `Predicates`, `Qualifiers`.
107
-
108
- ### Naming
109
-
110
- - Functions/variables: `snake_case`
111
- - Classes: `PascalCase`
112
- - Module-level constants: `UPPER_CASE`
113
-
114
- ### Comments
115
-
116
- - `# ?` — descriptions / clarifications
117
- - `# !` — warnings / important notes
118
- - `# *` — stage markers (pipeline steps)
119
- - `# TODO:` — todos
120
- - No docstrings on functions; use `# ?` comment on the line above instead
121
-
122
- ### Formatting (enforced by ruff)
123
-
124
- - Line length: **120**
125
- - Quote style: **double quotes**
126
- - Indent: **4 spaces**
127
- - `skip-magic-trailing-comma = true`
128
- - Target: Python >=3.11
129
-
130
- ### Error Handling
131
-
132
- - Use `RuntimeError` for exceptional cases (no custom exception classes currently)
133
- - Use `logger.warning()` for non-fatal issues (e.g., empty subgraphs)
134
- - Logger: `from tablassert.log import logger` (or `cat()` for category-tagged logger)
135
-
136
- ### Other Conventions
137
-
138
- - `operator.add` for Polars string concatenation on columns (not `+` directly)
139
- - CLI entry point: `tablassert.cli:APP` (cyclopts app)
140
- - Use `rich.progress` for progress tracking in CLI (via `progress.py` which wraps Rich Live/Progress)
141
- - Data side-effects stored in hidden directories: `.logassert/`, `.storassert/`, `.onnxassert/`
142
-
143
- ## Tools
144
-
145
- - **ruff** — linting (`ruff check`) and formatting (`ruff format`)
146
- - **pyright** — type checking (no pyrightconfig.json; uses defaults)
147
- - **pre-commit** — runs ruff fix, ruff-format, pyright, and pytest on all Python files
148
- - **pytest** — testing (>=9.0.2)
149
- - **uv** — package manager (use `uv run` for all commands, `uv add` for deps)
150
- - **hatchling** — build backend
151
-
152
- ## Optional Dependency Groups
153
-
154
- Defined in `pyproject.toml` `[project.optional-dependencies]`:
155
- - `rt` — `polars[rtcompat]` (runtime-compatible Polars build for CPUs without required instructions)
156
- - `qc` — `onnxruntime` (CPU QC runtime)
157
- - `qc-cuda` — `onnxruntime-gpu` (CUDA QC runtime; single GPU on device 0)
158
-
159
- All other ML, web, and Excel dependencies are in core `dependencies`; the ONNX Runtime choice is extra-driven.
160
-
161
- Install with: `uv sync`, `uv sync --extra qc`, `uv sync --extra qc-cuda`, or `pip install tablassert[...]`
162
-
163
- ## CI Workflows
164
-
165
- - **PyPI publish** (`.github/workflows/pipy.yml`): builds and publishes on push to `main`
166
- - **MkDocs deploy** (`.github/workflows/docs.yml`): builds docs and deploys to GitHub Pages on push to `main`
167
- - **Docker publish** (`.github/workflows/docker.yml`): builds and pushes image to GHCR on tag push (`v*`)
168
- - **Autotag** (`.github/workflows/autotag.yml`): automatic version tagging
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes