PyPI - tablassert - Versions diffs - 7.3.1__tar.gz → 7.3.3__tar.gz - Mend

tablassert 7.3.1tar.gz → 7.3.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

{tablassert-7.3.1 → tablassert-7.3.3}/AGENTS.md RENAMED Viewed

@@ -35,7 +35,7 @@ src/tablassert/
   lib.py          # Core logic: encodings, data loading, Tcode(Section) class
   models.py       # Pydantic v2 models (TablaBase base class)
   enums.py        # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
-  fullmap.py      # NER / entity resolution (DuckDB, 16 shards)
+  fullmap.py      # NER / entity resolution (DuckDB, 12 shards)
   qc.py           # Quality control (ONNX/BioBERT, sentence_transformers)
   nlp.py          # Text normalization (level_one: strip+lowercase, level_two: regex)
   ingests.py      # YAML ingestion: from_yaml(), to_sections(), fastmerge()

{tablassert-7.3.1 → tablassert-7.3.3}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,19 @@
 All notable changes to this project are documented in this file.
+## 7.3.3 - 2026-04-08
+### Bug Fixes
+- Changed datassert shard count from 16 to 12 (`SHARDS` constant in `fullmap.py`) to correspond to the updated datassert database layout.
+### Documentation
+- Updated all shard count references across documentation and examples to reflect the new 12-shard datassert layout.
+## 7.3.2 - 2026-04-03
+### Maintenance
+- Updated dependencies. No API changes.
 ## 7.3.1 - 2026-04-03
 ### Changes

{tablassert-7.3.1 → tablassert-7.3.3}/CONTRIBUTING.md RENAMED Viewed

@@ -20,15 +20,12 @@ cd Tablassert
 uv sync
 ```
-### Optional Dependency Groups
+### Optional Extras
-Some features require optional dependencies:
+All ML, web, and Excel dependencies are included in the core install. The only optional extra is a runtime-compatible Polars build for CPUs without required instructions:
 ```bash
-uv sync --extra ml        # sentence-transformers, onnxruntime, scikit-learn
-uv sync --extra web        # playwright
-uv sync --extra pyexcel    # pyexcel
-uv sync --extra full       # all optional deps
+uv sync --extra rtcompat   # polars[rtcompat]
 ```
 ## Development Workflow

{tablassert-7.3.1 → tablassert-7.3.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.3.1
+Version: 7.3.3
 Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -42,7 +42,7 @@ Requires-Dist: rapidfuzz>=3.14.3
 Requires-Dist: scikit-learn>=1.8.0
 Requires-Dist: sentence-transformers>=5.3.0
 Requires-Dist: sqlite-utils>=3.39
-Requires-Dist: typer>=0.24.1
+Requires-Dist: typer>=0.21.2
 Requires-Dist: xxhash>=3.6.0
 Provides-Extra: rt
 Requires-Dist: polars[rtcompat]>=1.39.0; extra == 'rt'
@@ -99,7 +99,7 @@ docker run --rm \
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
 ⠋ Loading table configurations...
-⠋ Resolving entities across 16 DuckDB shards...
+⠋ Resolving entities across 12 DuckDB shards...
 ⠋ Compiling subgraphs...
 ⠋ Deduplicating nodes and edges...
 ✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/

{tablassert-7.3.1 → tablassert-7.3.3}/README.md RENAMED Viewed

@@ -47,7 +47,7 @@ docker run --rm \
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
 ⠋ Loading table configurations...
-⠋ Resolving entities across 16 DuckDB shards...
+⠋ Resolving entities across 12 DuckDB shards...
 ⠋ Compiling subgraphs...
 ⠋ Deduplicating nodes and edges...
 ✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/

{tablassert-7.3.1 → tablassert-7.3.3}/docs/api/fullmap.md RENAMED Viewed

@@ -36,7 +36,7 @@ Column name containing text strings to resolve.
 **`conns: list[object]`**
-List of 16 DuckDB shard connections to the datassert database.
+List of 12 DuckDB shard connections to the datassert database.
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -125,11 +125,11 @@ from tablassert.enums import Categories
 import duckdb
 import polars as pl
-# Open all 16 shard connections
+# Open all 12 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
+    for i in range(12)
 ]
 # LazyFrame with data to resolve
@@ -167,11 +167,11 @@ from tablassert.fullmap import resolve
 from tablassert.nlp import level_one, level_two
 from tablassert.enums import Categories
-# Open all 16 shard connections
+# Open all 12 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
+    for i in range(12)
 ]
 # Map a list of gene symbols to CURIEs

{tablassert-7.3.1 → tablassert-7.3.3}/docs/api/lib.md RENAMED Viewed

@@ -2,7 +2,7 @@
 The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
-It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 16 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
+It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 12 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
 ## resolve_many()
@@ -38,7 +38,7 @@ Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generato
 **`datassert: Path`**
-Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 16 DuckDB shard files (`0.duckdb` through `15.duckdb`).
+Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 12 DuckDB shard files (`0.duckdb` through `11.duckdb`).
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -98,7 +98,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
 2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
-3. **DuckDB connection management** — Opens all 16 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
+3. **DuckDB connection management** — Opens all 12 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
 4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
@@ -225,7 +225,7 @@ Both levels are queried during resolution. Level one (exact case-insensitive mat
 - If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
 - If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
-- The `ExitStack` ensures all 16 DuckDB connections are closed even if resolution raises an exception.
+- The `ExitStack` ensures all 12 DuckDB connections are closed even if resolution raises an exception.
 - Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
 ## Integration

{tablassert-7.3.1 → tablassert-7.3.3}/docs/configuration/graph.md RENAMED Viewed

@@ -60,12 +60,14 @@ See [Table Configuration](table.md) for details.
 **`datassert: path`**
-Path to the datassert directory for entity resolution. Tablassert opens 16 shard files from `datassert/data/{0..15}.duckdb`. This database contains:
+Path to the [datassert](../datassert.md) directory for entity resolution. Tablassert opens 12 shard files from `datassert/data/{0..11}.duckdb`. This database contains:
 - Synonym mappings (text → CURIE)
 - Biolink categories
 - Taxonomic information
 - Source provenance (which database provided the mapping)
+See [Datassert](../datassert.md) for installation, build commands, and database schema.
 **`pubmed_db: path`**
 Optional path to SQLite database with PubMed metadata:
@@ -165,4 +167,5 @@ This processes a single table configuration (ALAMV6.yaml) into a knowledge graph
 ## Next Steps
 - **[Table Configuration](table.md)** - Learn how to define table transformations
+- **[Datassert](../datassert.md)** - Entity-resolution database installation and build
 - **[Tutorial](../tutorial.md)** - Complete example walkthrough

tablassert-7.3.3/docs/datassert.md ADDED Viewed

@@ -0,0 +1,106 @@
+# Datassert
+Datassert is a high-performance CLI for building a DuckDB-backed assertion store from NCATS Translator BABEL export files, with a focus on fast local builds and simple command-driven workflows. It produces the entity-resolution database used by Tablassert, containing biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
+## Installation
+```bash
+# Install CLI from GitHub
+go install github.com/SkyeAv/datassert@latest
+# Verify install
+datassert --help
+```
+## Build Command
+```bash
+# Build a Datassert database (downloads BABEL data automatically)
+datassert build
+```
+The build command automatically downloads BABEL exports from RENCI (`https://stars.renci.org/var/babel_outputs`), processes them, and produces sharded DuckDB databases.
+### Flags
+| Flag | Required | Default | Description |
+|------|----------|---------|-------------|
+| `--skip-downloads` / `-s` | No | `false` | Skip the BABEL download phase (use previously downloaded files) |
+| `--use-existing-parquets` / `-p` | No | `false` | Use existing Parquet files to rebuild DuckDB databases |
+### Data Pipeline
+1. **Download** — BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under `./datassert/downloads/`.
+2. **Lookup** — Class files (`*.ndjson.lz4`) are read to build an in-memory equivalent-identifier lookup.
+3. **Parquet Staging** — Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to `./datassert/parquets/`.
+4. **DuckDB Generation** — Parquet files are loaded into 12 sharded DuckDB databases under `./datassert/data/`.
+### Examples
+```bash
+# Full build (download, process, and generate databases)
+datassert build
+# Skip downloads if BABEL files were already fetched
+datassert build --skip-downloads
+# Rebuild DuckDB databases from existing Parquet files
+datassert build --use-existing-parquets
+```
+### Runtime Behavior
+- Displays progress bars for download, class lookup, synonym processing, and DuckDB build phases.
+- Uses 90% of available CPUs for concurrent processing.
+- Downloads are retried up to 3 times on failure with a 10-second backoff.
+- All working files are stored under `./datassert/`.
+## Output Artifacts
+- 12 sharded DuckDB databases are written to `./datassert/data/{0..11}.duckdb`.
+- Each shard contains `SOURCES`, `CATEGORIES`, `CURIES`, and `SYNONYMS` tables, deduplicated, sorted, and indexed for query performance.
+- Staging Parquet files are written to `./datassert/parquets/{0..11}/`.
+Terms are routed to shards deterministically via `xxhash64(term) % 12`, so a given string always hits the same shard.
+### Schema
+Each shard contains four tables:
+| Table | Key Columns | Description |
+|-------|-------------|-------------|
+| `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
+| `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
+| `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
+| `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
+## Usage in Graph Config
+The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 12 shards at startup and passes the connections to `resolve()`.
+```yaml
+# graph-config.yaml (GC2)
+syntax: GC2
+name: my-graph
+version: "1.0"
+datassert: /path/to/datassert/   # directory containing data/0..11.duckdb
+tables:
+  - ./TABLE/my-table.yaml
+```
+## Programmatic Usage
+When calling `resolve()` directly, open the shard connections yourself:
+```python
+import duckdb
+from tablassert.fullmap import resolve
+datassert_dir = "/path/to/datassert"
+conns = [
+    duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
+    for i in range(12)
+]
+```
+See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.

{tablassert-7.3.1 → tablassert-7.3.3}/docs/docker.md RENAMED Viewed

@@ -81,8 +81,8 @@ docker run --rm \
 - **Datassert path** — The graph configuration YAML specifies the `datassert` path for the entity-resolution database. Ensure it is accessible inside the container.
 - **Multiprocessing** — `src/tablassert/cli.py:63` uses `multiprocessing.Pool` for parallel table loading and section extraction.
-- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 16 Datassert DuckDB shards concurrently.
-- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 16 DuckDB shards (`SHARDS = 16`) using xxhash64.
+- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 12 Datassert DuckDB shards concurrently.
+- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 12 DuckDB shards (`SHARDS = 12`) using xxhash64.
 - **Text normalization** — `src/tablassert/nlp.py` provides `level_one` (strip + lowercase) and `level_two` (regex-based cleanup).
 ## CI/CD Integration

{tablassert-7.3.1 → tablassert-7.3.3}/llms.txt RENAMED Viewed

@@ -4,7 +4,7 @@
 This file is for two audiences: (1) YAML configuration authors and (2) package contributors.
 When source code and prose docs disagree, treat `src/tablassert/models.py` and `src/tablassert/cli.py` as the current authority.
 If you encounter older configurations, migrate `dbssert` to `datassert` (directory path).
-Current CLI behavior opens shard files at `datassert/data/{0..15}.duckdb`.
+Current CLI behavior opens shard files at `datassert/data/{0..11}.duckdb`.
 ## Quickstart
 - [README](README.md): high-level overview, install snippets, and one-command graph build.

{tablassert-7.3.1 → tablassert-7.3.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tablassert"
-version = "7.3.1"
+version = "7.3.3"
 description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
@@ -56,7 +56,7 @@ dependencies = [
     "scikit-learn>=1.8.0",
     "sentence-transformers>=5.3.0",
     "sqlite-utils>=3.39",
-    "typer>=0.24.1",
+    "typer>=0.21.2",
     "xxhash>=3.6.0",
 ]

{tablassert-7.3.1 → tablassert-7.3.3}/src/tablassert/fullmap.py RENAMED Viewed

@@ -16,7 +16,7 @@ else:
     plh = Lazy.load("polars_hash")
-SHARDS: int = 16
+SHARDS: int = 10
 def empty_matches(column_context: bool) -> pl.DataFrame:
@@ -39,15 +39,15 @@ def empty_matches(column_context: bool) -> pl.DataFrame:
     return pl.DataFrame(schema=schema)  # pyright: ignore
-def distinct(lf: pl.LazyFrame, l0: str, l1: str, col: str = "term") -> pl.LazyFrame:
+def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFrame:
     # ? Extract Unique Terms From Two Text Normalization Columns As LazyFrame
-    t0: pl.LazyFrame = lf.select(pl.col(l0).alias(col)).unique()
-    t0 = t0.with_columns(pl.lit(0).alias("nlp level"))
     t1: pl.LazyFrame = lf.select(pl.col(l1).alias(col)).unique()
     t1 = t1.with_columns(pl.lit(1).alias("nlp level"))
-    terms: pl.LazyFrame = pl.concat([t0, t1]).unique(subset=[col], keep="first")
+    t2: pl.LazyFrame = lf.select(pl.col(l2).alias(col)).unique()
+    t2 = t2.with_columns(pl.lit(2).alias("nlp level"))
+    terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
     bad: str = r"^\d+$|^(none|nan|na|null|unknown)$|^$"
     terms = terms.filter(~pl.col(col).str.contains(bad))
@@ -172,10 +172,10 @@ def resolve(
     tag: str = " two",
 ) -> pl.LazyFrame:
     # ? Case Dependant, Provenance Rich Name Entity Recognition
-    l0: str = col
-    l1: str = add(l0, tag)
+    l1: str = col
+    l2: str = add(l1, tag)
-    terms: pl.LazyFrame = distinct(lf, l0, l1)
+    terms: pl.LazyFrame = distinct(lf, l1, l2)
     matches: pl.DataFrame = query_distinct(terms, conns, taxon, prioritize, avoid, column_context)
     if log:
@@ -184,45 +184,45 @@ def resolve(
     # ! Collection Point: Join After DuckDB Query, Then Re-Lazy
     df: pl.DataFrame = lf.collect()
     result: pl.DataFrame = df.join(
-        matches.filter(pl.col("NLP_LEVEL").eq(0)), left_on=l0, right_on="term", how="left", suffix=" l0"
+        matches.filter(pl.col("NLP_LEVEL").eq(1)), left_on=l1, right_on="term", how="left", suffix=" l1"
     )
-    l1_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(1))
-    result = result.join(l1_matches, left_on=l1, right_on="term", how="left", suffix=" l1")
+    l2_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(2))
+    result = result.join(l2_matches, left_on=l2, right_on="term", how="left", suffix=" l2")
     result = result.with_columns(
         [
-            pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l1")).alias(col),
+            pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l2")).alias(col),
             pl.when(pl.col("PREFERRED_NAME").is_not_null())
             .then(pl.col("PREFERRED_NAME"))
-            .otherwise(pl.col("PREFERRED_NAME l1"))
+            .otherwise(pl.col("PREFERRED_NAME l2"))
             .alias(add(col, " name")),
             pl.when(pl.col("CATEGORY_NAME").is_not_null())
             .then(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME")))
-            .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l1")))
+            .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l2")))
             .alias(add(col, " category")),
             pl.when(pl.col("TAXON_ID").is_not_null())
             .then(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID").cast(pl.String)))
-            .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l1").cast(pl.String)))
+            .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l2").cast(pl.String)))
             .alias(add(col, " taxon")),
             pl.when(pl.col("SOURCE_NAME").is_not_null())
             .then(pl.col("SOURCE_NAME"))
-            .otherwise(pl.col("SOURCE_NAME l1"))
+            .otherwise(pl.col("SOURCE_NAME l2"))
             .alias(add(col, " source")),
             pl.when(pl.col("SOURCE_VERSION").is_not_null())
             .then(pl.col("SOURCE_VERSION"))
-            .otherwise(pl.col("SOURCE_VERSION l1"))
+            .otherwise(pl.col("SOURCE_VERSION l2"))
             .alias(add(col, " source version")),
             pl.when(pl.col("NLP_LEVEL").is_not_null())
             .then(pl.col("NLP_LEVEL"))
-            .otherwise(pl.col("NLP_LEVEL l1"))
+            .otherwise(pl.col("NLP_LEVEL l2"))
             .alias(add(col, " nlp level")),
         ]
     )
     result = result.select(
         pl.exclude(
-            r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l1)?$"
+            r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l2)?$"
         )
     )
     result = result.select(pl.exclude(add(col, " two")))

{tablassert-7.3.1 → tablassert-7.3.3}/uv.lock RENAMED Viewed

@@ -2211,7 +2211,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.3.0"
+version = "7.3.3"
 source = { editable = "." }
 dependencies = [
     { name = "duckdb" },
@@ -2276,7 +2276,7 @@ requires-dist = [
     { name = "sentence-transformers", specifier = ">=5.3.0" },
     { name = "sqlite-utils", specifier = ">=3.39" },
     { name = "tablassert", extras = ["rtcompat"], marker = "extra == 'rt'" },
-    { name = "typer", specifier = ">=0.24.1" },
+    { name = "typer", specifier = ">=0.21.2" },
     { name = "xxhash", specifier = ">=3.6.0" },
 ]
 provides-extras = ["rtcompat", "rt"]

tablassert-7.3.1/docs/datassert.md DELETED Viewed

@@ -1,66 +0,0 @@
-# Datassert
-Datassert is the entity-resolution database used by Tablassert. It contains biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
-## Installation
-```bash
-git clone https://github.com/SkyeAv/datassert
-```
-## Structure
-Datassert is split into 16 DuckDB shard files for parallel querying:
-```
-datassert/
-  data/
-    0.duckdb
-    1.duckdb
-    ...
-    15.duckdb
-```
-Terms are routed to shards deterministically via `xxhash64(term) % 16`, so a given string always hits the same shard.
-### Schema
-Each shard contains four tables:
-| Table | Key Columns | Description |
-|-------|-------------|-------------|
-| `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
-| `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
-| `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
-| `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
-## Usage in Graph Config
-The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 16 shards at startup and passes the connections to `resolve()`.
-```yaml
-# graph-config.yaml (GC2)
-syntax: GC2
-name: my-graph
-version: "1.0"
-datassert: /path/to/datassert/   # directory containing data/0..15.duckdb
-tables:
-  - ./TABLE/my-table.yaml
-```
-## Programmatic Usage
-When calling `resolve()` directly, open the shard connections yourself:
-```python
-import duckdb
-from tablassert.fullmap import resolve
-datassert_dir = "/path/to/datassert"
-conns = [
-    duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
-]
-```
-See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.