PyPI - tablassert - Versions diffs - 7.3.2__tar.gz → 7.3.4__tar.gz - Mend

tablassert 7.3.2tar.gz → 7.3.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

{tablassert-7.3.2 → tablassert-7.3.4}/AGENTS.md RENAMED Viewed

@@ -35,7 +35,7 @@ src/tablassert/
   lib.py          # Core logic: encodings, data loading, Tcode(Section) class
   models.py       # Pydantic v2 models (TablaBase base class)
   enums.py        # str, Enum subclasses (Tokens, Repositories, Comparisons, etc.)
-  fullmap.py      # NER / entity resolution (DuckDB, 16 shards)
+  fullmap.py      # NER / entity resolution (DuckDB, 12 shards)
   qc.py           # Quality control (ONNX/BioBERT, sentence_transformers)
   nlp.py          # Text normalization (level_one: strip+lowercase, level_two: regex)
   ingests.py      # YAML ingestion: from_yaml(), to_sections(), fastmerge()

{tablassert-7.3.2 → tablassert-7.3.4}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,24 @@
 All notable changes to this project are documented in this file.
+## 7.3.4 - 2026-04-28
+### Bug Fixes
+- Fixed `downloader.from_url()` failing on URLs that trigger an immediate download. The Playwright session now opens a browser context with `accept_downloads=True`, wraps `page.goto()` inside `page.expect_download()`, and tolerates the expected `net::ERR_ABORTED` navigation error that fires when the response is a download rather than a page.
+### Documentation
+- Documented `miscellaneous notes` as a freetext catch-all annotation in the table configuration and advanced-example pages — used for assay caveats, non-standard units, and qualitative observations that don't map cleanly to a structured field. Supports both `method: value` (constant) and `method: column` (per-row).
+- Documented Polars regex constraints for the `regex` and `remove` transforms: patterns are passed to Polars `str.replace_all()` (Rust `regex` crate), so capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not supported and will raise at parse time. Chain simple substitutions instead, or capture residual context in a `miscellaneous notes` annotation.
+## 7.3.3 - 2026-04-08
+### Bug Fixes
+- Changed datassert shard count to 10 (`SHARDS` constant in `fullmap.py`) to correspond to the current datassert database layout.
+### Documentation
+- Updated shard count references across documentation and examples to reflect the current 10-shard datassert layout.
+- Corrected provenance examples so `repo` carries the namespace prefix and `publication` carries the repository-local identifier.
 ## 7.3.2 - 2026-04-03
 ### Maintenance

{tablassert-7.3.2 → tablassert-7.3.4}/CONTRIBUTING.md RENAMED Viewed

@@ -20,15 +20,12 @@ cd Tablassert
 uv sync
 ```
-### Optional Dependency Groups
+### Optional Extras
-Some features require optional dependencies:
+All ML, web, and Excel dependencies are included in the core install. The only optional extra is a runtime-compatible Polars build for CPUs without required instructions:
 ```bash
-uv sync --extra ml        # sentence-transformers, onnxruntime, scikit-learn
-uv sync --extra web        # playwright
-uv sync --extra pyexcel    # pyexcel
-uv sync --extra full       # all optional deps
+uv sync --extra rtcompat   # polars[rtcompat]
 ```
 ## Development Workflow

{tablassert-7.3.2 → tablassert-7.3.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.3.2
+Version: 7.3.4
 Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -99,13 +99,13 @@ docker run --rm \
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
 ⠋ Loading table configurations...
-⠋ Resolving entities across 16 DuckDB shards...
+⠋ Resolving entities across 10 DuckDB shards...
 ⠋ Compiling subgraphs...
 ⠋ Deduplicating nodes and edges...
-✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+✓ Finished — wrote MY_GRAPH_1.0.0.nodes.ndjson and MY_GRAPH_1.0.0.edges.ndjson
 ```
-Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
 ## Key Features

{tablassert-7.3.2 → tablassert-7.3.4}/README.md RENAMED Viewed

@@ -47,13 +47,13 @@ docker run --rm \
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
 ⠋ Loading table configurations...
-⠋ Resolving entities across 16 DuckDB shards...
+⠋ Resolving entities across 10 DuckDB shards...
 ⠋ Compiling subgraphs...
 ⠋ Deduplicating nodes and edges...
-✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+✓ Finished — wrote MY_GRAPH_1.0.0.nodes.ndjson and MY_GRAPH_1.0.0.edges.ndjson
 ```
-Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
 ## Key Features

{tablassert-7.3.2 → tablassert-7.3.4}/docs/api/fullmap.md RENAMED Viewed

@@ -36,7 +36,7 @@ Column name containing text strings to resolve.
 **`conns: list[object]`**
-List of 16 DuckDB shard connections to the datassert database.
+List of 10 DuckDB shard connections to the datassert database.
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -97,7 +97,7 @@ Returns a Polars LazyFrame with these columns added:
 | `{col} taxon` | NCBI Taxon ID | `"NCBITaxon:9606"` |
 | `{col} source` | Source database | `"HGNC"` |
 | `{col} source version` | Database version | `"2025-01"` |
-| `{col} nlp level` | NLP processing level | `0` or `1` |
+| `{col} nlp level` | NLP processing level | `1` or `2` |
 ### DuckDB Query
@@ -125,11 +125,11 @@ from tablassert.enums import Categories
 import duckdb
 import polars as pl
-# Open all 16 shard connections
+# Open all 10 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
+    for i in range(10)
 ]
 # LazyFrame with data to resolve
@@ -167,11 +167,11 @@ from tablassert.fullmap import resolve
 from tablassert.nlp import level_one, level_two
 from tablassert.enums import Categories
-# Open all 16 shard connections
+# Open all 10 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
+    for i in range(10)
 ]
 # Map a list of gene symbols to CURIEs

{tablassert-7.3.2 → tablassert-7.3.4}/docs/api/lib.md RENAMED Viewed

@@ -2,11 +2,11 @@
 The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
-It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 16 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
+It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 10 DuckDB shard connections, executing entity resolution, and returning results as a plain Python list of row dictionaries.
 ## resolve_many()
-Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a dictionary of lists.
+Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a list of row dictionaries.
 ### Function Signature
@@ -26,9 +26,9 @@ def resolve_many(
 **`col: str`**
-Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in the returned dictionary.
+Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in each returned row dictionary.
-For example, if `col="gene"`, the returned dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
+For example, if `col="gene"`, each returned row dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
 **`entities: Iterable[str]`**
@@ -38,7 +38,7 @@ Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generato
 **`datassert: Path`**
-Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 16 DuckDB shard files (`0.duckdb` through `15.duckdb`).
+Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 10 DuckDB shard files (`0.duckdb` through `9.duckdb`).
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -86,7 +86,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
 | `{col} taxon` | NCBI Taxon ID (prefixed) | `"NCBITaxon:9606"` |
 | `{col} source` | Source database | `"HGNC"` |
 | `{col} source version` | Database version | `"2025-01"` |
-| `{col} nlp level` | NLP processing level used for match | `0` or `1` |
+| `{col} nlp level` | NLP processing level used for match | `1` or `2` |
 **Important:** Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by `resolve()`. The returned list may therefore be shorter than the input iterable.
@@ -98,7 +98,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
 2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
-3. **DuckDB connection management** — Opens all 16 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
+3. **DuckDB connection management** — Opens all 10 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
 4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
@@ -224,8 +224,8 @@ Both levels are queried during resolution. Level one (exact case-insensitive mat
 ### Error Handling
 - If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
-- If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
-- The `ExitStack` ensures all 16 DuckDB connections are closed even if resolution raises an exception.
+- If `entities` is empty, the function returns `[]`.
+- The `ExitStack` ensures all 10 DuckDB connections are closed even if resolution raises an exception.
 - Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
 ## Integration

{tablassert-7.3.2 → tablassert-7.3.4}/docs/api/qc.md RENAMED Viewed

@@ -82,7 +82,7 @@ Two fuzzy matching algorithms:
 1. **Ratio:** Overall string similarity
 2. **Partial token sort ratio:** Combined token/subsequence matching
-**Threshold:** Default 20% similarity (configurable)
+**Threshold:** 20% similarity
 ```python
 fuzz.ratio(original, preferred) >= 20
@@ -125,7 +125,7 @@ return similarity >= 0.2
 - Graph optimization level: ALL
 - ONNX session caching
-Lazy-loaded on first `BERT_audit()` call, then reused for subsequent calls.
+Lazy-loaded on first `fullmap_audit()` call that reaches the embedding stage, then reused for subsequent calls.
 ### Model Caching
@@ -135,7 +135,7 @@ BioBERT is lazy-loaded on first use and cached globally for the lifetime of the
 # ? Lazy-loads BioBERT once on first batch audit call, then caches globally
 ```
-**Cache location:** In-memory (global model cache)
+**Cache location:** Downloaded model files are cached on disk in `.onnxassert/`, and the loaded model object is cached in memory for the lifetime of the process.
 **Cache strategy:** BioBERT model loaded once on first batch audit, then reused globally

tablassert-7.3.4/docs/changelog.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Changelog
+The canonical release history lives in the repository root at [`CHANGELOG.md`](https://github.com/SkyeAv/Tablassert/blob/main/CHANGELOG.md).
+## Current Release Notes
+### 7.3.4 - 2026-04-28
+- `downloader.from_url()` now handles URLs that respond with an immediate download instead of a navigable page — the Playwright session uses a download-aware browser context and tolerates the expected `net::ERR_ABORTED` navigation error.
+- Table configuration docs now describe `miscellaneous notes` as a freetext catch-all annotation for source context that doesn't map cleanly to a structured field.
+- Regex transform documentation now spells out the Polars `str.replace_all()` constraints — no capturing groups or lookarounds — so authors know to chain simple substitutions or fall back to `miscellaneous notes`.
+For older releases and the full project history, open the root `CHANGELOG.md` in the repository.

{tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/advanced-example.md RENAMED Viewed

@@ -66,7 +66,7 @@ template:
   # Provenance: Publication and curation info
   provenance:
     repo: PMC
-    publication: PMC11708054
+    publication: 11708054
     contributors:
       - kind: curation
         name: Skye Lane Goetz
@@ -103,12 +103,16 @@ template:
       method: value
       encoding: Spearman correlation
-    # Descriptive note
+    # Freetext catch-all — anything that doesn't map cleanly to a structured
+    # annotation (study design caveats, non-standard units, qualitative
+    # observations) belongs here rather than being dropped.
     - annotation: miscellaneous notes
       method: value
       encoding: Correlation analysis between microbial composition and 13C-tamoxifen abundance after FDR correction
 ```
+> **`miscellaneous notes` is a freetext escape hatch.** Use it whenever the source carries context you can't otherwise cleanly encode — assay variants, post-hoc qualifiers, "values are log-transformed", etc. It accepts `method: value` for a constant note across the whole table or `method: column` to pull per-row notes from the source.
 ## Key Techniques
 ### Excel Column References
@@ -143,6 +147,8 @@ The subject field uses three regex transformations in sequence:
 ```
 `"Lactobacillus sp"` → `"Lactobacillus sp. "`
+> **Regex constraint:** Each `pattern` is handed to Polars `str.replace_all()` (Rust `regex` crate). **Capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not allowed** and will fail validation. Express transformations as a sequence of simple anchored / character-class substitutions instead — the pipeline above is a deliberate three-step chain because no single capturing-group pattern is permitted. If the transformation can't be expressed without those features, capture the leftover context in a `miscellaneous notes` annotation rather than fighting the regex engine.
 ### Taxonomic Filtering
 Prevent incorrect entity resolution:
@@ -297,7 +303,7 @@ template:
   provenance:
     repo: PMC
-    publication: PMC12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Skye Lane Goetz
@@ -358,7 +364,7 @@ template:
   provenance:
     repo: PMC
-    publication: PMC87654321
+    publication: 87654321
     contributors:
       - kind: curation
         name: Skye Lane Goetz

{tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/graph.md RENAMED Viewed

@@ -60,12 +60,14 @@ See [Table Configuration](table.md) for details.
 **`datassert: path`**
-Path to the datassert directory for entity resolution. Tablassert opens 16 shard files from `datassert/data/{0..15}.duckdb`. This database contains:
+Path to the [datassert](../datassert.md) directory for entity resolution. Tablassert opens 10 shard files from `datassert/data/{0..9}.duckdb`. This database contains:
 - Synonym mappings (text → CURIE)
 - Biolink categories
 - Taxonomic information
 - Source provenance (which database provided the mapping)
+See [Datassert](../datassert.md) for installation, build commands, and database schema.
 **`pubmed_db: path`**
 Optional path to SQLite database with PubMed metadata:
@@ -165,4 +167,5 @@ This processes a single table configuration (ALAMV6.yaml) into a knowledge graph
 ## Next Steps
 - **[Table Configuration](table.md)** - Learn how to define table transformations
+- **[Datassert](../datassert.md)** - Entity-resolution database installation and build
 - **[Tutorial](../tutorial.md)** - Complete example walkthrough

{tablassert-7.3.2 → tablassert-7.3.4}/docs/configuration/table.md RENAMED Viewed

@@ -100,7 +100,7 @@ template:
 ```yaml
 template:
   source: {kind: excel, local: data.xlsx}
-  provenance: {publication: PMC123}
+  provenance: {repo: PMC, publication: 123}
 sections:
   - statement: {predicate: treats}
@@ -111,7 +111,7 @@ sections:
 ```yaml
 template:
   source: {kind: text, local: data.csv}
-  provenance: {publication: PMID456}
+  provenance: {repo: PMID, publication: 456}
   statement:
     subject: {encoding: gene_symbol}
@@ -330,6 +330,8 @@ subject:
 Executed in order.
+> **Regex dialect:** Patterns are passed directly to Polars `str.replace_all()`, which uses the Rust [`regex`](https://docs.rs/regex/) crate. Only features supported by that engine work — in particular, **capturing groups (`(...)`, `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)` are not supported** and will raise an error at parse time. Stick to character classes, anchors (`^`, `$`), quantifiers, alternation (`a|b`), and non-capturing groups (`(?:...)`) if grouping is needed. If a transformation is too complex to express, prefer chaining several simple substitutions or capturing the residual context in a `miscellaneous notes` annotation instead.
 **`remove: list[string]`** - Filter out specific strings
 ```yaml
@@ -339,6 +341,8 @@ subject:
     - "^NA "  # Remove rows starting with "NA "
 ```
+Same regex constraints apply as the `regex` field — Polars-compatible patterns only, no capturing groups or lookarounds.
 **`prefix` / `suffix`** - Add text
 ```yaml
@@ -416,7 +420,7 @@ Required metadata about data source.
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `repo` | String | Yes | Repository: `"PMC"`, `"PMID"` |
-| `publication` | String | Yes | Identifier (e.g., `"PMC11708054"`, `"PMID123"`) |
+| `publication` | String | Yes | Repository-local identifier appended to `repo:` (e.g., `"11708054"`, `"123"`) |
 | `contributors` | List[Contributor] | Yes | Curation information |
 **Contributor fields:**
@@ -433,7 +437,7 @@ Required metadata about data source.
 ```yaml
 provenance:
   repo: PMC
-  publication: PMC11708054
+  publication: 11708054
   contributors:
     - kind: curation
       name: Skye Lane Goetz
@@ -467,8 +471,16 @@ annotations:
   - annotation: multiple testing correction method
     method: value
     encoding: "Benjamini Hochberg"
+  # Freetext catch-all for context that doesn't fit a structured field —
+  # study caveats, units, post-hoc notes, anything you'd otherwise lose.
+  - annotation: miscellaneous notes
+    method: value
+    encoding: "Values are log2 fold-change relative to vehicle control; n=3 biological replicates per arm"
 ```
+> **Tip:** When source data carries information that can't be cleanly mapped to a structured annotation (assay-specific caveats, non-standard units, qualitative observations), add a `miscellaneous notes` annotation rather than forcing it into another field or dropping it. It accepts both `method: value` (one note for the whole table) and `method: column` (per-row notes from the source).
 ## Complete Example
 Minimal table configuration:
@@ -498,7 +510,7 @@ template:
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Example User

tablassert-7.3.4/docs/datassert.md ADDED Viewed

@@ -0,0 +1,106 @@
+# Datassert
+Datassert is a high-performance CLI for building a DuckDB-backed assertion store from NCATS Translator BABEL export files, with a focus on fast local builds and simple command-driven workflows. It produces the entity-resolution database used by Tablassert, containing biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
+## Installation
+```bash
+# Install CLI from GitHub
+go install github.com/SkyeAv/datassert@latest
+# Verify install
+datassert --help
+```
+## Build Command
+```bash
+# Build a Datassert database (downloads BABEL data automatically)
+datassert build
+```
+The build command automatically downloads BABEL exports from RENCI (`https://stars.renci.org/var/babel_outputs`), processes them, and produces sharded DuckDB databases.
+### Flags
+| Flag | Required | Default | Description |
+|------|----------|---------|-------------|
+| `--skip-downloads` / `-s` | No | `false` | Skip the BABEL download phase (use previously downloaded files) |
+| `--use-existing-parquets` / `-p` | No | `false` | Use existing Parquet files to rebuild DuckDB databases |
+### Data Pipeline
+1. **Download** — BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under `./datassert/downloads/`.
+2. **Lookup** — Class files (`*.ndjson.lz4`) are read to build an in-memory equivalent-identifier lookup.
+3. **Parquet Staging** — Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to `./datassert/parquets/`.
+4. **DuckDB Generation** — Parquet files are loaded into 10 sharded DuckDB databases under `./datassert/data/`.
+### Examples
+```bash
+# Full build (download, process, and generate databases)
+datassert build
+# Skip downloads if BABEL files were already fetched
+datassert build --skip-downloads
+# Rebuild DuckDB databases from existing Parquet files
+datassert build --use-existing-parquets
+```
+### Runtime Behavior
+- Displays progress bars for download, class lookup, synonym processing, and DuckDB build phases.
+- Uses 90% of available CPUs for concurrent processing.
+- Downloads are retried up to 3 times on failure with a 10-second backoff.
+- All working files are stored under `./datassert/`.
+## Output Artifacts
+- 10 sharded DuckDB databases are written to `./datassert/data/{0..9}.duckdb`.
+- Each shard contains `SOURCES`, `CATEGORIES`, `CURIES`, and `SYNONYMS` tables, deduplicated, sorted, and indexed for query performance.
+- Staging Parquet files are written to `./datassert/parquets/{0..9}/`.
+Terms are routed to shards deterministically via `xxhash64(term) % 10`, so a given string always hits the same shard.
+### Schema
+Each shard contains four tables:
+| Table | Key Columns | Description |
+|-------|-------------|-------------|
+| `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
+| `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
+| `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
+| `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
+## Usage in Graph Config
+The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 10 shards at startup and passes the connections to `resolve()`.
+```yaml
+# graph-config.yaml (GC2)
+syntax: GC2
+name: my-graph
+version: "1.0"
+datassert: /path/to/datassert/   # directory containing data/0..9.duckdb
+tables:
+  - ./TABLE/my-table.yaml
+```
+## Programmatic Usage
+When calling `resolve()` directly, open the shard connections yourself:
+```python
+import duckdb
+from tablassert.fullmap import resolve
+datassert_dir = "/path/to/datassert"
+conns = [
+    duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
+    for i in range(10)
+]
+```
+See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.

{tablassert-7.3.2 → tablassert-7.3.4}/docs/docker.md RENAMED Viewed

@@ -81,8 +81,8 @@ docker run --rm \
 - **Datassert path** — The graph configuration YAML specifies the `datassert` path for the entity-resolution database. Ensure it is accessible inside the container.
 - **Multiprocessing** — `src/tablassert/cli.py:63` uses `multiprocessing.Pool` for parallel table loading and section extraction.
-- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 16 Datassert DuckDB shards concurrently.
-- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 16 DuckDB shards (`SHARDS = 16`) using xxhash64.
+- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 10 Datassert DuckDB shards concurrently.
+- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 10 DuckDB shards (`SHARDS = 10`) using xxhash64.
 - **Text normalization** — `src/tablassert/nlp.py` provides `level_one` (strip + lowercase) and `level_two` (regex-based cleanup).
 ## CI/CD Integration

{tablassert-7.3.2 → tablassert-7.3.4}/docs/examples.md RENAMED Viewed

@@ -33,7 +33,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Your Name
@@ -85,7 +85,7 @@ template:
       taxon: 9606
   provenance:
     repo: PMID
-    publication: PMID98765432
+    publication: 98765432
     contributors:
       - kind: curation
         name: Your Name
@@ -146,7 +146,7 @@ template:
       encoding: CHEBI:41774
   provenance:
     repo: PMC
-    publication: PMC11708054
+    publication: 11708054
     contributors:
       - kind: curation
         name: Your Name
@@ -161,11 +161,15 @@ template:
     - annotation: assertion method
       method: value
       encoding: "Spearman correlation"
+    # Freetext catch-all for context that doesn't fit a structured field.
+    - annotation: miscellaneous notes
+      method: value
+      encoding: "FDR-corrected; samples pooled across two cohorts"
 ```
 **Key techniques:**
-- **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`)
+- **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`). Patterns must be Polars `str.replace_all()`-compatible — no capturing groups (`(...)` / `\1`) and no lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`). Chain several simple substitutions instead.
 - **Avoid list** (`avoid: [Gene]`) prevents organism names from resolving to gene entities
 - **Fixed-value object** (`method: value`) assigns the same metabolite CURIE to all rows
 - **Excel source** with sheet name and row slicing
@@ -199,7 +203,7 @@ template:
       encoding: PLACEHOLDER
   provenance:
     repo: PMID
-    publication: PMID11223344
+    publication: 11223344
     contributors:
       - kind: curation
         name: Your Name
@@ -277,7 +281,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID55667788
+    publication: 55667788
     contributors:
       - kind: curation
         name: Your Name
@@ -330,7 +334,7 @@ template:
         - ChemicalEntity
   provenance:
     repo: PMID
-    publication: PMID99887766
+    publication: 99887766
     contributors:
       - kind: curation
         name: Your Name

{tablassert-7.3.2 → tablassert-7.3.4}/docs/tutorial.md RENAMED Viewed

@@ -68,7 +68,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Tutorial Example

{tablassert-7.3.2 → tablassert-7.3.4}/llms.txt RENAMED Viewed

@@ -4,7 +4,7 @@
 This file is for two audiences: (1) YAML configuration authors and (2) package contributors.
 When source code and prose docs disagree, treat `src/tablassert/models.py` and `src/tablassert/cli.py` as the current authority.
 If you encounter older configurations, migrate `dbssert` to `datassert` (directory path).
-Current CLI behavior opens shard files at `datassert/data/{0..15}.duckdb`.
+Current CLI behavior opens shard files at `datassert/data/{0..11}.duckdb`.
 ## Quickstart
 - [README](README.md): high-level overview, install snippets, and one-command graph build.

{tablassert-7.3.2 → tablassert-7.3.4}/mkdocs.yml RENAMED Viewed

@@ -17,4 +17,4 @@ nav:
     - Batch Resolution: api/lib.md
     - Quality Control: api/qc.md
     - Utilities: api/utils.md
-  - Changelog: ../CHANGELOG.md
+  - Changelog: changelog.md

{tablassert-7.3.2 → tablassert-7.3.4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tablassert"
-version = "7.3.2"
+version = "7.3.4"
 description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }

{tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/downloader.py RENAMED Viewed

@@ -29,13 +29,24 @@ def from_url(website: str, p: Path, timeout: int = 60_000, retries: int = 3) ->
         try:
             with sync_playwright() as pw:
                 browser = pw.chromium.launch(headless=True)
-                page = browser.new_page()
-                page.goto(website, wait_until="networkidle", timeout=timeout)
+                context = browser.new_context(accept_downloads=True)
+                page = context.new_page()
                 with page.expect_download(timeout=timeout) as info:
-                    download = info.value
-                    download.save_as(p)
+                    try:
+                        page.goto(website, wait_until="load", timeout=timeout)
+                    except Exception as e:
+                        if "net::ERR_ABORTED" not in str(e):
+                            raise
+                download = info.value
+                download.save_as(p)
+                context.close()
                 browser.close()
             return p
         except Exception as e:
             last = e
             if attempt < retries - 1:

{tablassert-7.3.2 → tablassert-7.3.4}/src/tablassert/fullmap.py RENAMED Viewed

@@ -16,7 +16,7 @@ else:
     plh = Lazy.load("polars_hash")
-SHARDS: int = 16
+SHARDS: int = 10
 def empty_matches(column_context: bool) -> pl.DataFrame:
@@ -39,15 +39,15 @@ def empty_matches(column_context: bool) -> pl.DataFrame:
     return pl.DataFrame(schema=schema)  # pyright: ignore
-def distinct(lf: pl.LazyFrame, l0: str, l1: str, col: str = "term") -> pl.LazyFrame:
+def distinct(lf: pl.LazyFrame, l1: str, l2: str, col: str = "term") -> pl.LazyFrame:
     # ? Extract Unique Terms From Two Text Normalization Columns As LazyFrame
-    t0: pl.LazyFrame = lf.select(pl.col(l0).alias(col)).unique()
-    t0 = t0.with_columns(pl.lit(0).alias("nlp level"))
     t1: pl.LazyFrame = lf.select(pl.col(l1).alias(col)).unique()
     t1 = t1.with_columns(pl.lit(1).alias("nlp level"))
-    terms: pl.LazyFrame = pl.concat([t0, t1]).unique(subset=[col], keep="first")
+    t2: pl.LazyFrame = lf.select(pl.col(l2).alias(col)).unique()
+    t2 = t2.with_columns(pl.lit(2).alias("nlp level"))
+    terms: pl.LazyFrame = pl.concat([t1, t2]).unique(subset=[col], keep="first")
     bad: str = r"^\d+$|^(none|nan|na|null|unknown)$|^$"
     terms = terms.filter(~pl.col(col).str.contains(bad))
@@ -172,10 +172,10 @@ def resolve(
     tag: str = " two",
 ) -> pl.LazyFrame:
     # ? Case Dependant, Provenance Rich Name Entity Recognition
-    l0: str = col
-    l1: str = add(l0, tag)
+    l1: str = col
+    l2: str = add(l1, tag)
-    terms: pl.LazyFrame = distinct(lf, l0, l1)
+    terms: pl.LazyFrame = distinct(lf, l1, l2)
     matches: pl.DataFrame = query_distinct(terms, conns, taxon, prioritize, avoid, column_context)
     if log:
@@ -184,45 +184,45 @@ def resolve(
     # ! Collection Point: Join After DuckDB Query, Then Re-Lazy
     df: pl.DataFrame = lf.collect()
     result: pl.DataFrame = df.join(
-        matches.filter(pl.col("NLP_LEVEL").eq(0)), left_on=l0, right_on="term", how="left", suffix=" l0"
+        matches.filter(pl.col("NLP_LEVEL").eq(1)), left_on=l1, right_on="term", how="left", suffix=" l1"
     )
-    l1_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(1))
-    result = result.join(l1_matches, left_on=l1, right_on="term", how="left", suffix=" l1")
+    l2_matches: pl.DataFrame = matches.filter(pl.col("NLP_LEVEL").eq(2))
+    result = result.join(l2_matches, left_on=l2, right_on="term", how="left", suffix=" l2")
     result = result.with_columns(
         [
-            pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l1")).alias(col),
+            pl.when(pl.col("CURIE").is_not_null()).then(pl.col("CURIE")).otherwise(pl.col("CURIE l2")).alias(col),
             pl.when(pl.col("PREFERRED_NAME").is_not_null())
             .then(pl.col("PREFERRED_NAME"))
-            .otherwise(pl.col("PREFERRED_NAME l1"))
+            .otherwise(pl.col("PREFERRED_NAME l2"))
             .alias(add(col, " name")),
             pl.when(pl.col("CATEGORY_NAME").is_not_null())
             .then(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME")))
-            .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l1")))
+            .otherwise(add(pl.lit("biolink:"), pl.col("CATEGORY_NAME l2")))
             .alias(add(col, " category")),
             pl.when(pl.col("TAXON_ID").is_not_null())
             .then(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID").cast(pl.String)))
-            .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l1").cast(pl.String)))
+            .otherwise(add(pl.lit("NCBITaxon:"), pl.col("TAXON_ID l2").cast(pl.String)))
             .alias(add(col, " taxon")),
             pl.when(pl.col("SOURCE_NAME").is_not_null())
             .then(pl.col("SOURCE_NAME"))
-            .otherwise(pl.col("SOURCE_NAME l1"))
+            .otherwise(pl.col("SOURCE_NAME l2"))
             .alias(add(col, " source")),
             pl.when(pl.col("SOURCE_VERSION").is_not_null())
             .then(pl.col("SOURCE_VERSION"))
-            .otherwise(pl.col("SOURCE_VERSION l1"))
+            .otherwise(pl.col("SOURCE_VERSION l2"))
             .alias(add(col, " source version")),
             pl.when(pl.col("NLP_LEVEL").is_not_null())
             .then(pl.col("NLP_LEVEL"))
-            .otherwise(pl.col("NLP_LEVEL l1"))
+            .otherwise(pl.col("NLP_LEVEL l2"))
             .alias(add(col, " nlp level")),
         ]
     )
     result = result.select(
         pl.exclude(
-            r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l1)?$"
+            r"^(CURIE|PREFERRED_NAME|CATEGORY_NAME|TAXON_ID|SOURCE_NAME|SOURCE_VERSION|NLP_LEVEL|PR|FREQUENCY)( l2)?$"
         )
     )
     result = result.select(pl.exclude(add(col, " two")))

{tablassert-7.3.2 → tablassert-7.3.4}/uv.lock RENAMED Viewed

@@ -2211,7 +2211,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.3.0"
+version = "7.3.4"
 source = { editable = "." }
 dependencies = [
     { name = "duckdb" },
@@ -2276,7 +2276,7 @@ requires-dist = [
     { name = "sentence-transformers", specifier = ">=5.3.0" },
     { name = "sqlite-utils", specifier = ">=3.39" },
     { name = "tablassert", extras = ["rtcompat"], marker = "extra == 'rt'" },
-    { name = "typer", specifier = ">=0.24.1" },
+    { name = "typer", specifier = ">=0.21.2" },
     { name = "xxhash", specifier = ">=3.6.0" },
 ]
 provides-extras = ["rtcompat", "rt"]

tablassert-7.3.2/docs/datassert.md DELETED Viewed

@@ -1,66 +0,0 @@
-# Datassert
-Datassert is the entity-resolution database used by Tablassert. It contains biological synonyms, CURIEs, Biolink categories, taxon IDs, and source provenance, enabling `resolve()` to map free-text strings to standardized identifiers.
-## Installation
-```bash
-git clone https://github.com/SkyeAv/datassert
-```
-## Structure
-Datassert is split into 16 DuckDB shard files for parallel querying:
-```
-datassert/
-  data/
-    0.duckdb
-    1.duckdb
-    ...
-    15.duckdb
-```
-Terms are routed to shards deterministically via `xxhash64(term) % 16`, so a given string always hits the same shard.
-### Schema
-Each shard contains four tables:
-| Table | Key Columns | Description |
-|-------|-------------|-------------|
-| `SYNONYMS` | `SYNONYM`, `CURIE_ID`, `SOURCE_ID` | Text synonym → CURIE mapping |
-| `CURIES` | `CURIE_ID`, `CURIE`, `PREFERRED_NAME`, `TAXON_ID`, `CATEGORY_ID` | Canonical identifiers and preferred names |
-| `CATEGORIES` | `CATEGORY_ID`, `CATEGORY_NAME` | Biolink category names |
-| `SOURCES` | `SOURCE_ID`, `SOURCE_NAME`, `SOURCE_VERSION` | Source database and version provenance |
-## Usage in Graph Config
-The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 16 shards at startup and passes the connections to `resolve()`.
-```yaml
-# graph-config.yaml (GC2)
-syntax: GC2
-name: my-graph
-version: "1.0"
-datassert: /path/to/datassert/   # directory containing data/0..15.duckdb
-tables:
-  - ./TABLE/my-table.yaml
-```
-## Programmatic Usage
-When calling `resolve()` directly, open the shard connections yourself:
-```python
-import duckdb
-from tablassert.fullmap import resolve
-datassert_dir = "/path/to/datassert"
-conns = [
-    duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(16)
-]
-```
-See [Entity Resolution](api/fullmap.md) for the full `resolve()` API.