tablassert 7.3.2__tar.gz → 7.3.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. {tablassert-7.3.2 → tablassert-7.3.4}/AGENTS.md +1 -1
  2. {tablassert-7.3.2 → tablassert-7.3.4}/CHANGELOG.md +18 -0
  3. {tablassert-7.3.2 → tablassert-7.3.4}/CONTRIBUTING.md +3 -6
  4. {tablassert-7.3.2 → tablassert-7.3.4}/PKG-INFO +4 -4
  5. {tablassert-7.3.2 → tablassert-7.3.4}/README.md +3 -3
  6. {tablassert-7.3.2 → tablassert-7.3.4}/docs/api/fullmap.md +6 -6
  7. {tablassert-7.3.2 → tablassert-7.3.4}/docs/api/lib.md +9 -9
  8. {tablassert-7.3.2 → tablassert-7.3.4}/docs/api/qc.md +3 -3
  9. tablassert-7.3.4/docs/changelog.md +13 -0
  10. {tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/advanced-example.md +10 -4
  11. {tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/graph.md +4 -1
  12. {tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/table.md +17 -5
  13. tablassert-7.3.4/docs/datassert.md +106 -0
  14. {tablassert-7.3.2 → tablassert-7.3.4}/docs/docker.md +2 -2
  15. {tablassert-7.3.2 → tablassert-7.3.4}/docs/examples.md +11 -7
  16. {tablassert-7.3.2 → tablassert-7.3.4}/docs/tutorial.md +1 -1
  17. {tablassert-7.3.2 → tablassert-7.3.4}/llms.txt +1 -1
  18. {tablassert-7.3.2 → tablassert-7.3.4}/mkdocs.yml +1 -1
  19. {tablassert-7.3.2 → tablassert-7.3.4}/pyproject.toml +1 -1
  20. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/downloader.py +15 -4
  21. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/fullmap.py +20 -20
  22. {tablassert-7.3.2 → tablassert-7.3.4}/uv.lock +2 -2
  23. tablassert-7.3.2/docs/datassert.md +0 -66
  24. {tablassert-7.3.2 → tablassert-7.3.4}/.github/workflows/autotag.yml +0 -0
  25. {tablassert-7.3.2 → tablassert-7.3.4}/.github/workflows/docker.yml +0 -0
  26. {tablassert-7.3.2 → tablassert-7.3.4}/.github/workflows/docs.yml +0 -0
  27. {tablassert-7.3.2 → tablassert-7.3.4}/.github/workflows/pipy.yml +0 -0
  28. {tablassert-7.3.2 → tablassert-7.3.4}/.gitignore +0 -0
  29. {tablassert-7.3.2 → tablassert-7.3.4}/.pre-commit-config.yaml +0 -0
  30. {tablassert-7.3.2 → tablassert-7.3.4}/CITATION.cff +0 -0
  31. {tablassert-7.3.2 → tablassert-7.3.4}/Dockerfile +0 -0
  32. {tablassert-7.3.2 → tablassert-7.3.4}/LICENSE +0 -0
  33. {tablassert-7.3.2 → tablassert-7.3.4}/docs/api/utils.md +0 -0
  34. {tablassert-7.3.2 → tablassert-7.3.4}/docs/cli.md +0 -0
  35. {tablassert-7.3.2 → tablassert-7.3.4}/docs/examples/tutorial-data.csv +0 -0
  36. {tablassert-7.3.2 → tablassert-7.3.4}/docs/examples/tutorial-graph.yaml +0 -0
  37. {tablassert-7.3.2 → tablassert-7.3.4}/docs/examples/tutorial-table.yaml +0 -0
  38. {tablassert-7.3.2 → tablassert-7.3.4}/docs/index.md +0 -0
  39. {tablassert-7.3.2 → tablassert-7.3.4}/docs/installation.md +0 -0
  40. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/__init__.py +0 -0
  41. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/cli.py +0 -0
  42. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/enums.py +0 -0
  43. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/ingests.py +0 -0
  44. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/lib.py +0 -0
  45. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/log.py +0 -0
  46. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/models.py +0 -0
  47. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/nlp.py +0 -0
  48. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/qc.py +0 -0
  49. {tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/utils.py +0 -0
  50. {tablassert-7.3.2 → tablassert-7.3.4}/tests/__init__.py +0 -0
  51. {tablassert-7.3.2 → tablassert-7.3.4}/tests/conftest.py +0 -0
  52. {tablassert-7.3.2 → tablassert-7.3.4}/tests/fixtures/invalid_section_missing_source.yaml +0 -0
  53. {tablassert-7.3.2 → tablassert-7.3.4}/tests/fixtures/minimal_section.yaml +0 -0
  54. {tablassert-7.3.2 → tablassert-7.3.4}/tests/fixtures/minimal_section_with_sections.yaml +0 -0
  55. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_enums.py +0 -0
  56. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_fullmap.py +0 -0
  57. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_ingests.py +0 -0
  58. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_lib.py +0 -0
  59. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_models.py +0 -0
  60. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_nlp.py +0 -0
  61. {tablassert-7.3.2 → tablassert-7.3.4}/tests/test_utils.py +0 -0
@@ -35,7 +35,7 @@ src/tablassert/
35
35
  lib.py # Core logic: encodings, data loading, Tcode(Section) class
36
36
  models.py # Pydantic v2 models (TablaBase base class)
37
37
  enums.py # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
38
- fullmap.py # NER / entity resolution (DuckDB, 16 shards)
38
+ fullmap.py # NER / entity resolution (DuckDB, 12 shards)
39
39
  qc.py # Quality control (ONNX/BioBERT, sentence_transformers)
40
40
  nlp.py # Text normalization (level_one: strip+lowercase, level_two: regex)
41
41
  ingests.py # YAML ingestion: from_yaml(), to_sections(), fastmerge()
@@ -2,6 +2,24 @@
2
2
 
3
3
  All notable changes to this project are documented in this file.
4
4
 
5
+ ## 7.3.4 - 2026-04-28
6
+
7
+ ### Bug Fixes
8
+ - Fixed `downloader.from_url()` failing on URLs that trigger an immediate download. The Playwright session now opens a browser context with `accept_downloads=True`, wraps `page.goto()` inside `page.expect_download()`, and tolerates the expected `net::ERR_ABORTED` navigation error that fires when the response is a download rather than a page.
9
+
10
+ ### Documentation
11
+ - Documented `miscellaneous notes` as a freetext catch-all annotation in the table configuration and advanced-example pages — used for assay caveats, non-standard units, and qualitative observations that don't map cleanly to a structured field. Supports both `method: value` (constant) and `method: column` (per-row).
12
+ - Documented Polars regex constraints for the `regex` and `remove` transforms: patterns are passed to Polars `str.replace_all()` (Rust `regex` crate), so capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not supported and will raise at parse time. Chain simple substitutions instead, or capture residual context in a `miscellaneous notes` annotation.
13
+
14
+ ## 7.3.3 - 2026-04-08
15
+
16
+ ### Bug Fixes
17
+ - Changed datassert shard count to 10 (`SHARDS` constant in `fullmap.py`) to correspond to the current datassert database layout.
18
+
19
+ ### Documentation
20
+ - Updated shard count references across documentation and examples to reflect the current 10-shard datassert layout.
21
+ - Corrected provenance examples so `repo` carries the namespace prefix and `publication` carries the repository-local identifier.
22
+
5
23
  ## 7.3.2 - 2026-04-03
6
24
 
7
25
  ### Maintenance
@@ -20,15 +20,12 @@ cd Tablassert
20
20
  uv sync
21
21
  ```
22
22
 
23
- ### Optional Dependency Groups
23
+ ### Optional Extras
24
24
 
25
- Some features require optional dependencies:
25
+ All ML, web, and Excel dependencies are included in the core install. The only optional extra is a runtime-compatible Polars build for CPUs without required instructions:
26
26
 
27
27
  ```bash
28
- uv sync --extra ml # sentence-transformers, onnxruntime, scikit-learn
29
- uv sync --extra web # playwright
30
- uv sync --extra pyexcel # pyexcel
31
- uv sync --extra full # all optional deps
28
+ uv sync --extra rtcompat # polars[rtcompat]
32
29
  ```
33
30
 
34
31
  ## Development Workflow
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: tablassert
3
- Version: 7.3.2
3
+ Version: 7.3.4
4
4
  Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
5
5
  Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
6
6
  Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -99,13 +99,13 @@ docker run --rm \
99
99
  # Build a knowledge graph from a YAML configuration
100
100
  $ tablassert build-knowledge-graph graph-config.yaml
101
101
  ⠋ Loading table configurations...
102
- ⠋ Resolving entities across 16 DuckDB shards...
102
+ ⠋ Resolving entities across 10 DuckDB shards...
103
103
  ⠋ Compiling subgraphs...
104
104
  ⠋ Deduplicating nodes and edges...
105
- Done — wrote nodes.ndjson and edges.ndjson to .storassert/
105
+ Finished — wrote MY_GRAPH_1.0.0.nodes.ndjson and MY_GRAPH_1.0.0.edges.ndjson
106
106
  ```
107
107
 
108
- Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
108
+ Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
109
109
 
110
110
  ## Key Features
111
111
 
@@ -47,13 +47,13 @@ docker run --rm \
47
47
  # Build a knowledge graph from a YAML configuration
48
48
  $ tablassert build-knowledge-graph graph-config.yaml
49
49
  ⠋ Loading table configurations...
50
- ⠋ Resolving entities across 16 DuckDB shards...
50
+ ⠋ Resolving entities across 10 DuckDB shards...
51
51
  ⠋ Compiling subgraphs...
52
52
  ⠋ Deduplicating nodes and edges...
53
- Done — wrote nodes.ndjson and edges.ndjson to .storassert/
53
+ Finished — wrote MY_GRAPH_1.0.0.nodes.ndjson and MY_GRAPH_1.0.0.edges.ndjson
54
54
  ```
55
55
 
56
- Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
56
+ Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
57
57
 
58
58
  ## Key Features
59
59
 
@@ -36,7 +36,7 @@ Column name containing text strings to resolve.
36
36
 
37
37
  **`conns: list[object]`**
38
38
 
39
- List of 16 DuckDB shard connections to the datassert database.
39
+ List of 10 DuckDB shard connections to the datassert database.
40
40
 
41
41
  Each shard contains:
42
42
  - Synonym mappings (text → CURIE)
@@ -97,7 +97,7 @@ Returns a Polars LazyFrame with these columns added:
97
97
  | `{col} taxon` | NCBI Taxon ID | `"NCBITaxon:9606"` |
98
98
  | `{col} source` | Source database | `"HGNC"` |
99
99
  | `{col} source version` | Database version | `"2025-01"` |
100
- | `{col} nlp level` | NLP processing level | `0` or `1` |
100
+ | `{col} nlp level` | NLP processing level | `1` or `2` |
101
101
 
102
102
  ### DuckDB Query
103
103
 
@@ -125,11 +125,11 @@ from tablassert.enums import Categories
125
125
  import duckdb
126
126
  import polars as pl
127
127
 
128
- # Open all 16 shard connections
128
+ # Open all 10 shard connections
129
129
  datassert_dir = "/path/to/datassert"
130
130
  conns = [
131
131
  duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
132
- for i in range(16)
132
+ for i in range(10)
133
133
  ]
134
134
 
135
135
  # LazyFrame with data to resolve
@@ -167,11 +167,11 @@ from tablassert.fullmap import resolve
167
167
  from tablassert.nlp import level_one, level_two
168
168
  from tablassert.enums import Categories
169
169
 
170
- # Open all 16 shard connections
170
+ # Open all 10 shard connections
171
171
  datassert_dir = "/path/to/datassert"
172
172
  conns = [
173
173
  duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
174
- for i in range(16)
174
+ for i in range(10)
175
175
  ]
176
176
 
177
177
  # Map a list of gene symbols to CURIEs
@@ -2,11 +2,11 @@
2
2
 
3
3
  The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
4
4
 
5
- It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 16 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
5
+ It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 10 DuckDB shard connections, executing entity resolution, and returning results as a plain Python list of row dictionaries.
6
6
 
7
7
  ## resolve_many()
8
8
 
9
- Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a dictionary of lists.
9
+ Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a list of row dictionaries.
10
10
 
11
11
  ### Function Signature
12
12
 
@@ -26,9 +26,9 @@ def resolve_many(
26
26
 
27
27
  **`col: str`**
28
28
 
29
- Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in the returned dictionary.
29
+ Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in each returned row dictionary.
30
30
 
31
- For example, if `col="gene"`, the returned dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
31
+ For example, if `col="gene"`, each returned row dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
32
32
 
33
33
  **`entities: Iterable[str]`**
34
34
 
@@ -38,7 +38,7 @@ Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generato
38
38
 
39
39
  **`datassert: Path`**
40
40
 
41
- Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 16 DuckDB shard files (`0.duckdb` through `15.duckdb`).
41
+ Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 10 DuckDB shard files (`0.duckdb` through `9.duckdb`).
42
42
 
43
43
  Each shard contains:
44
44
  - Synonym mappings (text → CURIE)
@@ -86,7 +86,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
86
86
  | `{col} taxon` | NCBI Taxon ID (prefixed) | `"NCBITaxon:9606"` |
87
87
  | `{col} source` | Source database | `"HGNC"` |
88
88
  | `{col} source version` | Database version | `"2025-01"` |
89
- | `{col} nlp level` | NLP processing level used for match | `0` or `1` |
89
+ | `{col} nlp level` | NLP processing level used for match | `1` or `2` |
90
90
 
91
91
  **Important:** Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by `resolve()`. The returned list may therefore be shorter than the input iterable.
92
92
 
@@ -98,7 +98,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
98
98
 
99
99
  2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
100
100
 
101
- 3. **DuckDB connection management** — Opens all 16 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
101
+ 3. **DuckDB connection management** — Opens all 10 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
102
102
 
103
103
  4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
104
104
 
@@ -224,8 +224,8 @@ Both levels are queried during resolution. Level one (exact case-insensitive mat
224
224
  ### Error Handling
225
225
 
226
226
  - If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
227
- - If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
228
- - The `ExitStack` ensures all 16 DuckDB connections are closed even if resolution raises an exception.
227
+ - If `entities` is empty, the function returns `[]`.
228
+ - The `ExitStack` ensures all 10 DuckDB connections are closed even if resolution raises an exception.
229
229
  - Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
230
230
 
231
231
  ## Integration
@@ -82,7 +82,7 @@ Two fuzzy matching algorithms:
82
82
  1. **Ratio:** Overall string similarity
83
83
  2. **Partial token sort ratio:** Combined token/subsequence matching
84
84
 
85
- **Threshold:** Default 20% similarity (configurable)
85
+ **Threshold:** 20% similarity
86
86
 
87
87
  ```python
88
88
  fuzz.ratio(original, preferred) >= 20
@@ -125,7 +125,7 @@ return similarity >= 0.2
125
125
  - Graph optimization level: ALL
126
126
  - ONNX session caching
127
127
 
128
- Lazy-loaded on first `BERT_audit()` call, then reused for subsequent calls.
128
+ Lazy-loaded on first `fullmap_audit()` call that reaches the embedding stage, then reused for subsequent calls.
129
129
 
130
130
  ### Model Caching
131
131
 
@@ -135,7 +135,7 @@ BioBERT is lazy-loaded on first use and cached globally for the lifetime of the
135
135
  # ? Lazy-loads BioBERT once on first batch audit call, then caches globally
136
136
  ```
137
137
 
138
- **Cache location:** In-memory (global model cache)
138
+ **Cache location:** Downloaded model files are cached on disk in `.onnxassert/`, and the loaded model object is cached in memory for the lifetime of the process.
139
139
 
140
140
  **Cache strategy:** BioBERT model loaded once on first batch audit, then reused globally
141
141
 
@@ -0,0 +1,13 @@
1
+ # Changelog
2
+
3
+ The canonical release history lives in the repository root at [`CHANGELOG.md`](https://github.com/SkyeAv/Tablassert/blob/main/CHANGELOG.md).
4
+
5
+ ## Current Release Notes
6
+
7
+ ### 7.3.4 - 2026-04-28
8
+
9
+ - `downloader.from_url()` now handles URLs that respond with an immediate download instead of a navigable page — the Playwright session uses a download-aware browser context and tolerates the expected `net::ERR_ABORTED` navigation error.
10
+ - Table configuration docs now describe `miscellaneous notes` as a freetext catch-all annotation for source context that doesn't map cleanly to a structured field.
11
+ - Regex transform documentation now spells out the Polars `str.replace_all()` constraints — no capturing groups or lookarounds — so authors know to chain simple substitutions or fall back to `miscellaneous notes`.
12
+
13
+ For older releases and the full project history, open the root `CHANGELOG.md` in the repository.
@@ -66,7 +66,7 @@ template:
66
66
  # Provenance: Publication and curation info
67
67
  provenance:
68
68
  repo: PMC
69
- publication: PMC11708054
69
+ publication: 11708054
70
70
  contributors:
71
71
  - kind: curation
72
72
  name: Skye Lane Goetz
@@ -103,12 +103,16 @@ template:
103
103
  method: value
104
104
  encoding: Spearman correlation
105
105
 
106
- # Descriptive note
106
+ # Freetext catch-all — anything that doesn't map cleanly to a structured
107
+ # annotation (study design caveats, non-standard units, qualitative
108
+ # observations) belongs here rather than being dropped.
107
109
  - annotation: miscellaneous notes
108
110
  method: value
109
111
  encoding: Correlation analysis between microbial composition and 13C-tamoxifen abundance after FDR correction
110
112
  ```
111
113
 
114
+ > **`miscellaneous notes` is a freetext escape hatch.** Use it whenever the source carries context you can't otherwise cleanly encode — assay variants, post-hoc qualifiers, "values are log-transformed", etc. It accepts `method: value` for a constant note across the whole table or `method: column` to pull per-row notes from the source.
115
+
112
116
  ## Key Techniques
113
117
 
114
118
  ### Excel Column References
@@ -143,6 +147,8 @@ The subject field uses three regex transformations in sequence:
143
147
  ```
144
148
  `"Lactobacillus sp"` → `"Lactobacillus sp. "`
145
149
 
150
+ > **Regex constraint:** Each `pattern` is handed to Polars `str.replace_all()` (Rust `regex` crate). **Capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not allowed** and will fail validation. Express transformations as a sequence of simple anchored / character-class substitutions instead — the pipeline above is a deliberate three-step chain because no single capturing-group pattern is permitted. If the transformation can't be expressed without those features, capture the leftover context in a `miscellaneous notes` annotation rather than fighting the regex engine.
151
+
146
152
  ### Taxonomic Filtering
147
153
 
148
154
  Prevent incorrect entity resolution:
@@ -297,7 +303,7 @@ template:
297
303
 
298
304
  provenance:
299
305
  repo: PMC
300
- publication: PMC12345678
306
+ publication: 12345678
301
307
  contributors:
302
308
  - kind: curation
303
309
  name: Skye Lane Goetz
@@ -358,7 +364,7 @@ template:
358
364
 
359
365
  provenance:
360
366
  repo: PMC
361
- publication: PMC87654321
367
+ publication: 87654321
362
368
  contributors:
363
369
  - kind: curation
364
370
  name: Skye Lane Goetz
@@ -60,12 +60,14 @@ See [Table Configuration](table.md) for details.
60
60
 
61
61
  **`datassert: path`**
62
62
 
63
- Path to the datassert directory for entity resolution. Tablassert opens 16 shard files from `datassert/data/{0..15}.duckdb`. This database contains:
63
+ Path to the [datassert](../datassert.md) directory for entity resolution. Tablassert opens 10 shard files from `datassert/data/{0..9}.duckdb`. This database contains:
64
64
  - Synonym mappings (text → CURIE)
65
65
  - Biolink categories
66
66
  - Taxonomic information
67
67
  - Source provenance (which database provided the mapping)
68
68
 
69
+ See [Datassert](../datassert.md) for installation, build commands, and database schema.
70
+
69
71
  **`pubmed_db: path`**
70
72
 
71
73
  Optional path to SQLite database with PubMed metadata:
@@ -165,4 +167,5 @@ This processes a single table configuration (ALAMV6.yaml) into a knowledge graph
165
167
  ## Next Steps
166
168
 
167
169
  - **[Table Configuration](table.md)** - Learn how to define table transformations
170
+ - **[Datassert](../datassert.md)** - Entity-resolution database installation and build
168
171
  - **[Tutorial](../tutorial.md)** - Complete example walkthrough
@@ -100,7 +100,7 @@ template:
100
100
  ```yaml
101
101
  template:
102
102
  source: {kind: excel, local: data.xlsx}
103
- provenance: {publication: PMC123}
103
+ provenance: {repo: PMC, publication: 123}
104
104
 
105
105
  sections:
106
106
  - statement: {predicate: treats}
@@ -111,7 +111,7 @@ sections:
111
111
  ```yaml
112
112
  template:
113
113
  source: {kind: text, local: data.csv}
114
- provenance: {publication: PMID456}
114
+ provenance: {repo: PMID, publication: 456}
115
115
  statement:
116
116
  subject: {encoding: gene_symbol}
117
117
 
@@ -330,6 +330,8 @@ subject:
330
330
 
331
331
  Executed in order.
332
332
 
333
+ > **Regex dialect:** Patterns are passed directly to Polars `str.replace_all()`, which uses the Rust [`regex`](https://docs.rs/regex/) crate. Only features supported by that engine work — in particular, **capturing groups (`(...)`, `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)` are not supported** and will raise an error at parse time. Stick to character classes, anchors (`^`, `$`), quantifiers, alternation (`a|b`), and non-capturing groups (`(?:...)`) if grouping is needed. If a transformation is too complex to express, prefer chaining several simple substitutions or capturing the residual context in a `miscellaneous notes` annotation instead.
334
+
333
335
  **`remove: list[string]`** - Filter out specific strings
334
336
 
335
337
  ```yaml
@@ -339,6 +341,8 @@ subject:
339
341
  - "^NA " # Remove rows starting with "NA "
340
342
  ```
341
343
 
344
+ Same regex constraints apply as the `regex` field — Polars-compatible patterns only, no capturing groups or lookarounds.
345
+
342
346
  **`prefix` / `suffix`** - Add text
343
347
 
344
348
  ```yaml
@@ -416,7 +420,7 @@ Required metadata about data source.
416
420
  | Field | Type | Required | Description |
417
421
  |-------|------|----------|-------------|
418
422
  | `repo` | String | Yes | Repository: `"PMC"`, `"PMID"` |
419
- | `publication` | String | Yes | Identifier (e.g., `"PMC11708054"`, `"PMID123"`) |
423
+ | `publication` | String | Yes | Repository-local identifier appended to `repo:` (e.g., `"11708054"`, `"123"`) |
420
424
  | `contributors` | List[Contributor] | Yes | Curation information |
421
425
 
422
426
  **Contributor fields:**
@@ -433,7 +437,7 @@ Required metadata about data source.
433
437
  ```yaml
434
438
  provenance:
435
439
  repo: PMC
436
- publication: PMC11708054
440
+ publication: 11708054
437
441
  contributors:
438
442
  - kind: curation
439
443
  name: Skye Lane Goetz
@@ -467,8 +471,16 @@ annotations:
467
471
  - annotation: multiple testing correction method
468
472
  method: value
469
473
  encoding: "Benjamini Hochberg"
474
+
475
+ # Freetext catch-all for context that doesn't fit a structured field —
476
+ # study caveats, units, post-hoc notes, anything you'd otherwise lose.
477
+ - annotation: miscellaneous notes
478
+ method: value
479
+ encoding: "Values are log2 fold-change relative to vehicle control; n=3 biological replicates per arm"
470
480
  ```
471
481
 
482
+ > **Tip:** When source data carries information that can't be cleanly mapped to a structured annotation (assay-specific caveats, non-standard units, qualitative observations), add a `miscellaneous notes` annotation rather than forcing it into another field or dropping it. It accepts both `method: value` (one note for the whole table) and `method: column` (per-row notes from the source).
483
+
472
484
  ## Complete Example
473
485
 
474
486
  Minimal table configuration:
@@ -498,7 +510,7 @@ template:
498
510
 
499
511
  provenance:
500
512
  repo: PMID
501
- publication: PMID12345678
513
+ publication: 12345678
502
514
  contributors:
503
515
  - kind: curation
504
516
  name: Example User
@@ -0,0 +1,106 @@
1
+ # Datassert
2
+
3
+ Datassert is a high-performance CLI for building a DuckDB-backed assertion store from NCATS Translator BABEL export files, with a focus on fast local builds and simple command-driven workflows. It produces the entity-resolution database used by Tablassert, containing biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ # Install CLI from GitHub
9
+ go install github.com/SkyeAv/datassert@latest
10
+
11
+ # Verify install
12
+ datassert --help
13
+ ```
14
+
15
+ ## Build Command
16
+
17
+ ```bash
18
+ # Build a Datassert database (downloads BABEL data automatically)
19
+ datassert build
20
+ ```
21
+
22
+ The build command automatically downloads BABEL exports from RENCI (`https://stars.renci.org/var/babel_outputs`), processes them, and produces sharded DuckDB databases.
23
+
24
+ ### Flags
25
+
26
+ | Flag | Required | Default | Description |
27
+ |------|----------|---------|-------------|
28
+ | `--skip-downloads` / `-s` | No | `false` | Skip the BABEL download phase (use previously downloaded files) |
29
+ | `--use-existing-parquets` / `-p` | No | `false` | Use existing Parquet files to rebuild DuckDB databases |
30
+
31
+ ### Data Pipeline
32
+
33
+ 1. **Download** — BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under `./datassert/downloads/`.
34
+ 2. **Lookup** — Class files (`*.ndjson.lz4`) are read to build an in-memory equivalent-identifier lookup.
35
+ 3. **Parquet Staging** — Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to `./datassert/parquets/`.
36
+ 4. **DuckDB Generation** — Parquet files are loaded into 10 sharded DuckDB databases under `./datassert/data/`.
37
+
38
+ ### Examples
39
+
40
+ ```bash
41
+ # Full build (download, process, and generate databases)
42
+ datassert build
43
+
44
+ # Skip downloads if BABEL files were already fetched
45
+ datassert build --skip-downloads
46
+
47
+ # Rebuild DuckDB databases from existing Parquet files
48
+ datassert build --use-existing-parquets
49
+ ```
50
+
51
+ ### Runtime Behavior
52
+
53
+ - Displays progress bars for download, class lookup, synonym processing, and DuckDB build phases.
54
+ - Uses 90% of available CPUs for concurrent processing.
55
+ - Downloads are retried up to 3 times on failure with a 10-second backoff.
56
+ - All working files are stored under `./datassert/`.
57
+
58
+ ## Output Artifacts
59
+
60
+ - 10 sharded DuckDB databases are written to `./datassert/data/{0..9}.duckdb`.
61
+ - Each shard contains `SOURCES`, `CATEGORIES`, `CURIES`, and `SYNONYMS` tables, deduplicated, sorted, and indexed for query performance.
62
+ - Staging Parquet files are written to `./datassert/parquets/{0..9}/`.
63
+
64
+ Terms are routed to shards deterministically via `xxhash64(term) % 10`, so a given string always hits the same shard.
65
+
66
+ ### Schema
67
+
68
+ Each shard contains four tables:
69
+
70
+ | Table | Key Columns | Description |
71
+ |-------|-------------|-------------|
72
+ | `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
73
+ | `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
74
+ | `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
75
+ | `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
76
+
77
+ ## Usage in Graph Config
78
+
79
+ The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 10 shards at startup and passes the connections to `resolve()`.
80
+
81
+ ```yaml
82
+ # graph-config.yaml (GC2)
83
+ syntax: GC2
84
+ name: my-graph
85
+ version: "1.0"
86
+ datassert: /path/to/datassert/ # directory containing data/0..9.duckdb
87
+ tables:
88
+ - ./TABLE/my-table.yaml
89
+ ```
90
+
91
+ ## Programmatic Usage
92
+
93
+ When calling `resolve()` directly, open the shard connections yourself:
94
+
95
+ ```python
96
+ import duckdb
97
+ from tablassert.fullmap import resolve
98
+
99
+ datassert_dir = "/path/to/datassert"
100
+ conns = [
101
+ duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
102
+ for i in range(10)
103
+ ]
104
+ ```
105
+
106
+ See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.
@@ -81,8 +81,8 @@ docker run --rm \
81
81
 
82
82
  - **Datassert path** — The graph configuration YAML specifies the `datassert` path for the entity-resolution database. Ensure it is accessible inside the container.
83
83
  - **Multiprocessing** — `src/tablassert/cli.py:63` uses `multiprocessing.Pool` for parallel table loading and section extraction.
84
- - **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 16 Datassert DuckDB shards concurrently.
85
- - **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 16 DuckDB shards (`SHARDS = 16`) using xxhash64.
84
+ - **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 10 Datassert DuckDB shards concurrently.
85
+ - **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 10 DuckDB shards (`SHARDS = 10`) using xxhash64.
86
86
  - **Text normalization** — `src/tablassert/nlp.py` provides `level_one` (strip + lowercase) and `level_two` (regex-based cleanup).
87
87
 
88
88
  ## CI/CD Integration
@@ -33,7 +33,7 @@ template:
33
33
  - Disease
34
34
  provenance:
35
35
  repo: PMID
36
- publication: PMID12345678
36
+ publication: 12345678
37
37
  contributors:
38
38
  - kind: curation
39
39
  name: Your Name
@@ -85,7 +85,7 @@ template:
85
85
  taxon: 9606
86
86
  provenance:
87
87
  repo: PMID
88
- publication: PMID98765432
88
+ publication: 98765432
89
89
  contributors:
90
90
  - kind: curation
91
91
  name: Your Name
@@ -146,7 +146,7 @@ template:
146
146
  encoding: CHEBI:41774
147
147
  provenance:
148
148
  repo: PMC
149
- publication: PMC11708054
149
+ publication: 11708054
150
150
  contributors:
151
151
  - kind: curation
152
152
  name: Your Name
@@ -161,11 +161,15 @@ template:
161
161
  - annotation: assertion method
162
162
  method: value
163
163
  encoding: "Spearman correlation"
164
+ # Freetext catch-all for context that doesn't fit a structured field.
165
+ - annotation: miscellaneous notes
166
+ method: value
167
+ encoding: "FDR-corrected; samples pooled across two cohorts"
164
168
  ```
165
169
 
166
170
  **Key techniques:**
167
171
 
168
- - **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`)
172
+ - **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`). Patterns must be Polars `str.replace_all()`-compatible — no capturing groups (`(...)` / `\1`) and no lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`). Chain several simple substitutions instead.
169
173
  - **Avoid list** (`avoid: [Gene]`) prevents organism names from resolving to gene entities
170
174
  - **Fixed-value object** (`method: value`) assigns the same metabolite CURIE to all rows
171
175
  - **Excel source** with sheet name and row slicing
@@ -199,7 +203,7 @@ template:
199
203
  encoding: PLACEHOLDER
200
204
  provenance:
201
205
  repo: PMID
202
- publication: PMID11223344
206
+ publication: 11223344
203
207
  contributors:
204
208
  - kind: curation
205
209
  name: Your Name
@@ -277,7 +281,7 @@ template:
277
281
  - Disease
278
282
  provenance:
279
283
  repo: PMID
280
- publication: PMID55667788
284
+ publication: 55667788
281
285
  contributors:
282
286
  - kind: curation
283
287
  name: Your Name
@@ -330,7 +334,7 @@ template:
330
334
  - ChemicalEntity
331
335
  provenance:
332
336
  repo: PMID
333
- publication: PMID99887766
337
+ publication: 99887766
334
338
  contributors:
335
339
  - kind: curation
336
340
  name: Your Name
@@ -68,7 +68,7 @@ template:
68
68
  - Disease
69
69
  provenance:
70
70
  repo: PMID
71
- publication: PMID12345678
71
+ publication: 12345678
72
72
  contributors:
73
73
  - kind: curation
74
74
  name: Tutorial Example
@@ -4,7 +4,7 @@
4
4
  This file is for two audiences: (1) YAML configuration authors and (2) package contributors.
5
5
  When source code and prose docs disagree, treat `src/tablassert/models.py` and `src/tablassert/cli.py` as the current authority.
6
6
  If you encounter older configurations, migrate `dbssert` to `datassert` (directory path).
7
- Current CLI behavior opens shard files at `datassert/data/{0..15}.duckdb`.
7
+ Current CLI behavior opens shard files at `datassert/data/{0..11}.duckdb`.
8
8
 
9
9
  ## Quickstart
10
10
  - [README](README.md): high-level overview, install snippets, and one-command graph build.
@@ -17,4 +17,4 @@ nav:
17
17
  - Batch Resolution: api/lib.md
18
18
  - Quality Control: api/qc.md
19
19
  - Utilities: api/utils.md
20
- - Changelog: ../CHANGELOG.md
20
+ - Changelog: changelog.md
@@ -1,6 +1,6 @@
1
1
  [project]
2
2
  name = "tablassert"
3
- version = "7.3.2"
3
+ version = "7.3.4"
4
4
  description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
5
5
  authors = [
6
6
  { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
@@ -29,13 +29,24 @@ def from_url(website: str, p: Path, timeout: int = 60_000, retries: int = 3) ->
29
29
  try:
30
30
  with sync_playwright() as pw:
31
31
  browser = pw.chromium.launch(headless=True)
32
- page = browser.new_page()
33
- page.goto(website, wait_until="networkidle", timeout=timeout)
32
+ context = browser.new_context(accept_downloads=True)
33
+
34
+ page = context.new_page()
34
35
  with page.expect_download(timeout=timeout) as info:
35
- download = info.value
36
- download.save_as(p)
36
+ try:
37
+ page.goto(website, wait_until="load", timeout=timeout)
38
+ except Exception as e:
39
+ if "net::ERR_ABORTED" not in str(e):
40
+ raise
41
+
42
+ download = info.value
43
+ download.save_as(p)
44
+
45
+ context.close()
37
46
  browser.close()
47
+
38
48
  return p
49
+
39
50
  except Exception as e:
40
51
  last = e
41
52
  if attempt < retries - 1:
@@ -16,7 +16,7 @@ else:
16
16
  plh = Lazy.load("polars_hash")
17
17
 
18
18
 
19
- SHARDS: int = 16
19
+ SHARDS: int = 10
20
20
 
21
21
 
22
22
  def empty_matches(column_context: bool) -> pl.DataFrame:
@@ -39,15 +39,15 @@ def empty_matches(column_context: bool) -> pl.DataFrame:
39
39
  return pl.DataFrame(schema=schema) # pyright: ignore
40
40
 
41
41
 
42
- def distinct(lf: pl.LazyFrame, l0: str, l1: str, col: str = "term") -> pl.LazyFrame:
42
+ def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFrame:
43
43
  # ? Extract Unique Terms From Two Text Normalization Columns As LazyFrame
44
- t0: pl.LazyFrame = lf.select(pl.col(l0).alias(col)).unique()
45
- t0 = t0.with_columns(pl.lit(0).alias("nlp level"))
46
-
47
44
  t1: pl.LazyFrame = lf.select(pl.col(l1).alias(col)).unique()
48
45
  t1 = t1.with_columns(pl.lit(1).alias("nlp level"))
49
46
 
50
- terms: pl.LazyFrame = pl.concat([t0, t1]).unique(subset=[col], keep="first")
47
+ t2: pl.LazyFrame = lf.select(pl.col(l2).alias(col)).unique()
48
+ t2 = t2.with_columns(pl.lit(2).alias("nlp level"))
49
+
50
+ terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
51
51
 
52
52
  bad: str = r"^\d+$|^(none|nan|na|null|unknown)$|^$"
53
53
  terms = terms.filter(~pl.col(col).str.contains(bad))
@@ -172,10 +172,10 @@ def resolve(
172
172
  tag: str = " two",
173
173
  ) -> pl.LazyFrame:
174
174
  # ? Case Dependant, Provenance Rich Name Entity Recognition
175
- l0: str = col
176
- l1: str = add(l0, tag)
175
+ l1: str = col
176
+ l2: str = add(l1, tag)
177
177
 
178
- terms: pl.LazyFrame = distinct(lf, l0, l1)
178
+ terms: pl.LazyFrame = distinct(lf, l1, l2)
179
179
  matches: pl.DataFrame = query_distinct(terms, conns, taxon, prioritize, avoid, column_context)
180
180
 
181
181
  if log:
@@ -184,45 +184,45 @@ def resolve(
184
184
  # ! Collection Point: Join After DuckDB Query, Then Re-Lazy
185
185
  df: pl.DataFrame = lf.collect()
186
186
  result: pl.DataFrame = df.join(
187
- matches.filter(pl.col("NLP_LEVEL").eq(0)), left_on=l0, right_on="term", how="left", suffix=" l0"
187
+ matches.filter(pl.col("NLP_LEVEL").eq(1)), left_on=l1, right_on="term", how="left", suffix=" l1"
188
188
  )
189
189
 
190
- l1_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(1))
191
- result = result.join(l1_matches, left_on=l1, right_on="term", how="left", suffix=" l1")
190
+ l2_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(2))
191
+ result = result.join(l2_matches, left_on=l2, right_on="term", how="left", suffix=" l2")
192
192
 
193
193
  result = result.with_columns(
194
194
  [
195
- pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l1")).alias(col),
195
+ pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l2")).alias(col),
196
196
  pl.when(pl.col("PREFERRED_NAME").is_not_null())
197
197
  .then(pl.col("PREFERRED_NAME"))
198
- .otherwise(pl.col("PREFERRED_NAME l1"))
198
+ .otherwise(pl.col("PREFERRED_NAME l2"))
199
199
  .alias(add(col, " name")),
200
200
  pl.when(pl.col("CATEGORY_NAME").is_not_null())
201
201
  .then(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME")))
202
- .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l1")))
202
+ .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l2")))
203
203
  .alias(add(col, " category")),
204
204
  pl.when(pl.col("TAXON_ID").is_not_null())
205
205
  .then(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID").cast(pl.String)))
206
- .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l1").cast(pl.String)))
206
+ .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l2").cast(pl.String)))
207
207
  .alias(add(col, " taxon")),
208
208
  pl.when(pl.col("SOURCE_NAME").is_not_null())
209
209
  .then(pl.col("SOURCE_NAME"))
210
- .otherwise(pl.col("SOURCE_NAME l1"))
210
+ .otherwise(pl.col("SOURCE_NAME l2"))
211
211
  .alias(add(col, " source")),
212
212
  pl.when(pl.col("SOURCE_VERSION").is_not_null())
213
213
  .then(pl.col("SOURCE_VERSION"))
214
- .otherwise(pl.col("SOURCE_VERSION l1"))
214
+ .otherwise(pl.col("SOURCE_VERSION l2"))
215
215
  .alias(add(col, " source version")),
216
216
  pl.when(pl.col("NLP_LEVEL").is_not_null())
217
217
  .then(pl.col("NLP_LEVEL"))
218
- .otherwise(pl.col("NLP_LEVEL l1"))
218
+ .otherwise(pl.col("NLP_LEVEL l2"))
219
219
  .alias(add(col, " nlp level")),
220
220
  ]
221
221
  )
222
222
 
223
223
  result = result.select(
224
224
  pl.exclude(
225
- r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l1)?$"
225
+ r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l2)?$"
226
226
  )
227
227
  )
228
228
  result = result.select(pl.exclude(add(col, " two")))
@@ -2211,7 +2211,7 @@ wheels = [
2211
2211
 
2212
2212
  [[package]]
2213
2213
  name = "tablassert"
2214
- version = "7.3.0"
2214
+ version = "7.3.4"
2215
2215
  source = { editable = "." }
2216
2216
  dependencies = [
2217
2217
  { name = "duckdb" },
@@ -2276,7 +2276,7 @@ requires-dist = [
2276
2276
  { name = "sentence-transformers", specifier = ">=5.3.0" },
2277
2277
  { name = "sqlite-utils", specifier = ">=3.39" },
2278
2278
  { name = "tablassert", extras = ["rtcompat"], marker = "extra == 'rt'" },
2279
- { name = "typer", specifier = ">=0.24.1" },
2279
+ { name = "typer", specifier = ">=0.21.2" },
2280
2280
  { name = "xxhash", specifier = ">=3.6.0" },
2281
2281
  ]
2282
2282
  provides-extras = ["rtcompat", "rt"]
@@ -1,66 +0,0 @@
1
- # Datassert
2
-
3
- Datassert is the entity-resolution database used by Tablassert. It contains biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
4
-
5
- ## Installation
6
-
7
- ```bash
8
- git clone https://github.com/SkyeAv/datassert
9
- ```
10
-
11
- ## Structure
12
-
13
- Datassert is split into 16 DuckDB shard files for parallel querying:
14
-
15
- ```
16
- datassert/
17
- data/
18
- 0.duckdb
19
- 1.duckdb
20
- ...
21
- 15.duckdb
22
- ```
23
-
24
- Terms are routed to shards deterministically via `xxhash64(term) % 16`, so a given string always hits the same shard.
25
-
26
- ### Schema
27
-
28
- Each shard contains four tables:
29
-
30
- | Table | Key Columns | Description |
31
- |-------|-------------|-------------|
32
- | `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
33
- | `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
34
- | `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
35
- | `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
36
-
37
- ## Usage in Graph Config
38
-
39
- The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 16 shards at startup and passes the connections to `resolve()`.
40
-
41
- ```yaml
42
- # graph-config.yaml (GC2)
43
- syntax: GC2
44
- name: my-graph
45
- version: "1.0"
46
- datassert: /path/to/datassert/ # directory containing data/0..15.duckdb
47
- tables:
48
- - ./TABLE/my-table.yaml
49
- ```
50
-
51
- ## Programmatic Usage
52
-
53
- When calling `resolve()` directly, open the shard connections yourself:
54
-
55
- ```python
56
- import duckdb
57
- from tablassert.fullmap import resolve
58
-
59
- datassert_dir = "/path/to/datassert"
60
- conns = [
61
- duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
62
- for i in range(16)
63
- ]
64
- ```
65
-
66
- See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes