PyPI - tablassert - Versions diffs - 7.3.3__tar.gz → 7.3.5__tar.gz - Mend

tablassert 7.3.3tar.gz → 7.3.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

{tablassert-7.3.3 → tablassert-7.3.5}/CHANGELOG.md RENAMED Viewed

@@ -2,13 +2,28 @@
 All notable changes to this project are documented in this file.
+## 7.3.5 - 2026-04-29
+### Documentation
+- Tightened the table-configuration reference so field requirements, defaults, accepted enum values, row indexing, and column-reference examples match the strict `Section` schema and section-merging behavior implemented in `models.py`, `ingests.py`, and the runtime loader.
+## 7.3.4 - 2026-04-28
+### Bug Fixes
+- Fixed `downloader.from_url()` failing on URLs that trigger an immediate download. The Playwright session now opens a browser context with `accept_downloads=True`, wraps `page.goto()` inside `page.expect_download()`, and tolerates the expected `net::ERR_ABORTED` navigation error that fires when the response is a download rather than a page.
+### Documentation
+- Documented `miscellaneous notes` as a freetext catch-all annotation in the table configuration and advanced-example pages — used for assay caveats, non-standard units, and qualitative observations that don't map cleanly to a structured field. Supports both `method: value` (constant) and `method: column` (per-row).
+- Documented Polars regex constraints for the `regex` and `remove` transforms: patterns are passed to Polars `str.replace_all()` (Rust `regex` crate), so capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not supported and will raise at parse time. Chain simple substitutions instead, or capture residual context in a `miscellaneous notes` annotation.
 ## 7.3.3 - 2026-04-08
 ### Bug Fixes
-- Changed datassert shard count from 16 to 12 (`SHARDS` constant in `fullmap.py`) to correspond to the updated datassert database layout.
+- Changed datassert shard count to 10 (`SHARDS` constant in `fullmap.py`) to correspond to the current datassert database layout.
 ### Documentation
-- Updated all shard count references across documentation and examples to reflect the new 12-shard datassert layout.
+- Updated shard count references across documentation and examples to reflect the current 10-shard datassert layout.
+- Corrected provenance examples so `repo` carries the namespace prefix and `publication` carries the repository-local identifier.
 ## 7.3.2 - 2026-04-03

{tablassert-7.3.3 → tablassert-7.3.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.3.3
+Version: 7.3.5
 Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -98,14 +98,16 @@ docker run --rm \
 ```bash
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
-⠋ Loading table configurations...
-⠋ Resolving entities across 12 DuckDB shards...
-⠋ Compiling subgraphs...
-⠋ Deduplicating nodes and edges...
-✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+⠋ Loading Tables...
+⠋ Extracting Sections...
+⠋ Building TCode...
+⠋ Collecting Instructions...
+⠋ Building Subgraphs...
+⠋ Compiling Graph...
+✓ Finished!
 ```
-Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
 ## Key Features

{tablassert-7.3.3 → tablassert-7.3.5}/README.md RENAMED Viewed

@@ -46,14 +46,16 @@ docker run --rm \
 ```bash
 # Build a knowledge graph from a YAML configuration
 $ tablassert build-knowledge-graph graph-config.yaml
-⠋ Loading table configurations...
-⠋ Resolving entities across 12 DuckDB shards...
-⠋ Compiling subgraphs...
-⠋ Deduplicating nodes and edges...
-✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+⠋ Loading Tables...
+⠋ Extracting Sections...
+⠋ Building TCode...
+⠋ Collecting Instructions...
+⠋ Building Subgraphs...
+⠋ Compiling Graph...
+✓ Finished!
 ```
-Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required. Intermediate section artifacts are staged in `.storassert/` during the build.
 ## Key Features

{tablassert-7.3.3 → tablassert-7.3.5}/docs/api/fullmap.md RENAMED Viewed

@@ -36,7 +36,7 @@ Column name containing text strings to resolve.
 **`conns: list[object]`**
-List of 12 DuckDB shard connections to the datassert database.
+List of 10 DuckDB shard connections to the datassert database.
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -97,7 +97,7 @@ Returns a Polars LazyFrame with these columns added:
 | `{col} taxon` | NCBI Taxon ID | `"NCBITaxon:9606"` |
 | `{col} source` | Source database | `"HGNC"` |
 | `{col} source version` | Database version | `"2025-01"` |
-| `{col} nlp level` | NLP processing level | `0` or `1` |
+| `{col} nlp level` | NLP processing level | `1` or `2` |
 ### DuckDB Query
@@ -125,11 +125,11 @@ from tablassert.enums import Categories
 import duckdb
 import polars as pl
-# Open all 12 shard connections
+# Open all 10 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(12)
+    for i in range(10)
 ]
 # LazyFrame with data to resolve
@@ -167,11 +167,11 @@ from tablassert.fullmap import resolve
 from tablassert.nlp import level_one, level_two
 from tablassert.enums import Categories
-# Open all 12 shard connections
+# Open all 10 shard connections
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(12)
+    for i in range(10)
 ]
 # Map a list of gene symbols to CURIEs

{tablassert-7.3.3 → tablassert-7.3.5}/docs/api/lib.md RENAMED Viewed

@@ -2,11 +2,11 @@
 The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
-It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 12 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
+It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 10 DuckDB shard connections, executing entity resolution, and returning results as a plain Python list of row dictionaries.
 ## resolve_many()
-Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a dictionary of lists.
+Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a list of row dictionaries.
 ### Function Signature
@@ -26,9 +26,9 @@ def resolve_many(
 **`col: str`**
-Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in the returned dictionary.
+Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in each returned row dictionary.
-For example, if `col="gene"`, the returned dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
+For example, if `col="gene"`, each returned row dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
 **`entities: Iterable[str]`**
@@ -38,7 +38,7 @@ Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generato
 **`datassert: Path`**
-Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 12 DuckDB shard files (`0.duckdb` through `11.duckdb`).
+Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 10 DuckDB shard files (`0.duckdb` through `9.duckdb`).
 Each shard contains:
 - Synonym mappings (text → CURIE)
@@ -86,7 +86,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
 | `{col} taxon` | NCBI Taxon ID (prefixed) | `"NCBITaxon:9606"` |
 | `{col} source` | Source database | `"HGNC"` |
 | `{col} source version` | Database version | `"2025-01"` |
-| `{col} nlp level` | NLP processing level used for match | `0` or `1` |
+| `{col} nlp level` | NLP processing level used for match | `1` or `2` |
 **Important:** Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by `resolve()`. The returned list may therefore be shorter than the input iterable.
@@ -98,7 +98,7 @@ Each dictionary contains the following keys (where `{col}` is the value of the `
 2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
-3. **DuckDB connection management** — Opens all 12 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
+3. **DuckDB connection management** — Opens all 10 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
 4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
@@ -224,8 +224,8 @@ Both levels are queried during resolution. Level one (exact case-insensitive mat
 ### Error Handling
 - If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
-- If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
-- The `ExitStack` ensures all 12 DuckDB connections are closed even if resolution raises an exception.
+- If `entities` is empty, the function returns `[]`.
+- The `ExitStack` ensures all 10 DuckDB connections are closed even if resolution raises an exception.
 - Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
 ## Integration

{tablassert-7.3.3 → tablassert-7.3.5}/docs/api/qc.md RENAMED Viewed

@@ -82,7 +82,7 @@ Two fuzzy matching algorithms:
 1. **Ratio:** Overall string similarity
 2. **Partial token sort ratio:** Combined token/subsequence matching
-**Threshold:** Default 20% similarity (configurable)
+**Threshold:** 20% similarity
 ```python
 fuzz.ratio(original, preferred) >= 20
@@ -125,7 +125,7 @@ return similarity >= 0.2
 - Graph optimization level: ALL
 - ONNX session caching
-Lazy-loaded on first `BERT_audit()` call, then reused for subsequent calls.
+Lazy-loaded on first `fullmap_audit()` call that reaches the embedding stage, then reused for subsequent calls.
 ### Model Caching
@@ -135,7 +135,7 @@ BioBERT is lazy-loaded on first use and cached globally for the lifetime of the
 # ? Lazy-loads BioBERT once on first batch audit call, then caches globally
 ```
-**Cache location:** In-memory (global model cache)
+**Cache location:** Downloaded model files are cached on disk in `.onnxassert/`, and the loaded model object is cached in memory for the lifetime of the process.
 **Cache strategy:** BioBERT model loaded once on first batch audit, then reused globally

tablassert-7.3.5/docs/changelog.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Changelog
+The canonical release history lives in the repository root at [`CHANGELOG.md`](https://github.com/SkyeAv/Tablassert/blob/main/CHANGELOG.md).
+## Current Release Notes
+## 7.3.5 - 2026-04-29
+### Documentation
+- The table-configuration reference now matches the strict runtime schema and merge behavior, including field defaults, requiredness, accepted enum values, zero-based row indexing, and valid column-reference examples.
+For older releases and the full project history, open the root `CHANGELOG.md` in the repository.

{tablassert-7.3.3 → tablassert-7.3.5}/docs/configuration/advanced-example.md RENAMED Viewed

@@ -66,7 +66,7 @@ template:
   # Provenance: Publication and curation info
   provenance:
     repo: PMC
-    publication: PMC11708054
+    publication: 11708054
     contributors:
       - kind: curation
         name: Skye Lane Goetz
@@ -103,12 +103,16 @@ template:
       method: value
       encoding: Spearman correlation
-    # Descriptive note
+    # Freetext catch-all — anything that doesn't map cleanly to a structured
+    # annotation (study design caveats, non-standard units, qualitative
+    # observations) belongs here rather than being dropped.
     - annotation: miscellaneous notes
       method: value
       encoding: Correlation analysis between microbial composition and 13C-tamoxifen abundance after FDR correction
 ```
+> **`miscellaneous notes` is a freetext escape hatch.** Use it whenever the source carries context you can't otherwise cleanly encode — assay variants, post-hoc qualifiers, "values are log-transformed", etc. It accepts `method: value` for a constant note across the whole table or `method: column` to pull per-row notes from the source.
 ## Key Techniques
 ### Excel Column References
@@ -143,6 +147,8 @@ The subject field uses three regex transformations in sequence:
 ```
 `"Lactobacillus sp"` → `"Lactobacillus sp. "`
+> **Regex constraint:** Each `pattern` is handed to Polars `str.replace_all()` (Rust `regex` crate). **Capturing groups (`(...)` / `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`) are not allowed** and will fail validation. Express transformations as a sequence of simple anchored / character-class substitutions instead — the pipeline above is a deliberate three-step chain because no single capturing-group pattern is permitted. If the transformation can't be expressed without those features, capture the leftover context in a `miscellaneous notes` annotation rather than fighting the regex engine.
 ### Taxonomic Filtering
 Prevent incorrect entity resolution:
@@ -297,7 +303,7 @@ template:
   provenance:
     repo: PMC
-    publication: PMC12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Skye Lane Goetz
@@ -358,7 +364,7 @@ template:
   provenance:
     repo: PMC
-    publication: PMC87654321
+    publication: 87654321
     contributors:
       - kind: curation
         name: Skye Lane Goetz

{tablassert-7.3.3 → tablassert-7.3.5}/docs/configuration/graph.md RENAMED Viewed

@@ -60,7 +60,7 @@ See [Table Configuration](table.md) for details.
 **`datassert: path`**
-Path to the [datassert](../datassert.md) directory for entity resolution. Tablassert opens 12 shard files from `datassert/data/{0..11}.duckdb`. This database contains:
+Path to the [datassert](../datassert.md) directory for entity resolution. Tablassert opens 10 shard files from `datassert/data/{0..9}.duckdb`. This database contains:
 - Synonym mappings (text → CURIE)
 - Biolink categories
 - Taxonomic information

{tablassert-7.3.3 → tablassert-7.3.5}/docs/configuration/table.md RENAMED Viewed

@@ -39,14 +39,14 @@ template:
 sections:
   - statement:  # Section 1: Gene-Disease
-      subject: {encoding: gene_column}
+      subject: {method: column, encoding: A}
       predicate: associated_with
-      object: {encoding: disease_column}
+      object: {method: column, encoding: B}
   - statement:  # Section 2: Gene-Pathway
-      subject: {encoding: gene_column}
+      subject: {method: column, encoding: A}
       predicate: participates_in
-      object: {encoding: pathway_column}
+      object: {method: column, encoding: C}
 ```
 ### Merge Behavior (fastmerge)
@@ -92,15 +92,15 @@ sections:
 **Single output:** Template only
 ```yaml
 template:
-  source: {kind: text, local: data.csv}
+  source: {kind: text, local: data.csv, url: https://example.com/data.csv}
   statement: {...}
 ```
 **Multiple predicates, same source:**
 ```yaml
 template:
-  source: {kind: excel, local: data.xlsx}
-  provenance: {publication: PMC123}
+  source: {kind: excel, local: data.xlsx, url: https://example.com/data.xlsx}
+  provenance: {repo: PMC, publication: 123, contributors: [{name: Example User, date: 27 JAN 2026}]}
 sections:
   - statement: {predicate: treats}
@@ -110,14 +110,14 @@ sections:
 **Multiple columns, shared provenance:**
 ```yaml
 template:
-  source: {kind: text, local: data.csv}
-  provenance: {publication: PMID456}
+  source: {kind: text, local: data.csv, url: https://example.com/data.csv}
+  provenance: {repo: PMID, publication: 456, contributors: [{name: Example User, date: 27 JAN 2026}]}
   statement:
-    subject: {encoding: gene_symbol}
+    subject: {method: column, encoding: A}
 sections:
-  - statement: {object: {encoding: column_A}}
-  - statement: {object: {encoding: column_B}}
+  - statement: {object: {method: column, encoding: B}}
+  - statement: {object: {method: column, encoding: C}}
 ```
 ## Configuration Schema
@@ -126,8 +126,8 @@ sections:
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `syntax` | String | Yes | Configuration version (must be `"TC3"`) |
-| `status` | String | No | Development status: `"alpha"`, `"beta"`, `"primetime"` |
+| `syntax` | String | No | Configuration version. Defaults to `"TC3"`. |
+| `status` | String | No | Development status. Defaults to `"alpha"`; allowed values are `"alpha"`, `"beta"`, `"primetime"`. |
 ### Source
@@ -137,12 +137,12 @@ Defines the data file location and format.
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `kind` | String | Yes | Must be `"excel"` |
+| `kind` | String | No | Source kind. Model default is `"excel"`, but specify it explicitly in configs. |
 | `local` | Path | Yes | Local file path for caching |
 | `url` | URL | Yes | Download URL (HTTP/HTTPS) |
-| `sheet` | String | No | Sheet name (default: `"Sheet1"`) |
-| `row_slice` | List[Int\|"auto"] | No | Row range: `[start, end]` or `[start, "auto"]` |
-| `rows` | List[Int] | No | Specific rows to include |
+| `sheet` | String | No | Sheet name. Defaults to `"Sheet1"`. |
+| `row_slice` | List[Int\|"auto"] | No | Two-value zero-based crop bounds: `[start, stop]`. Each value may be an integer or `"auto"`. |
+| `rows` | List[Int] | No | Zero-based row indices to keep after any `row_slice` crop. |
 | `reindex` | List[Reindex] | No | Conditional row filtering |
 **Example:**
@@ -153,7 +153,7 @@ source:
   url: https://example.com/data.xlsx
   sheet: "Sheet1"
   row_slice:
-    - 2  # Start at row 2 (skip header)
+    - 1  # Start at the second physical row
     - auto  # Read to end
 ```
@@ -161,12 +161,12 @@ source:
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `kind` | String | Yes | Must be `"text"` |
+| `kind` | String | No | Source kind. Model default is `"text"`, but specify it explicitly in configs. |
 | `local` | Path | Yes | Local file path for caching |
 | `url` | URL | Yes | Download URL |
-| `delimiter` | String | No | Column delimiter (default: `","`) |
-| `row_slice` | List[Int\|"auto"] | No | Row range |
-| `rows` | List[Int] | No | Specific rows |
+| `delimiter` | String | No | Field delimiter. Defaults to `","`. |
+| `row_slice` | List[Int\|"auto"] | No | Two-value zero-based crop bounds: `[start, stop]`. Each value may be an integer or `"auto"`. |
+| `rows` | List[Int] | No | Zero-based row indices to keep after any `row_slice` crop. |
 | `reindex` | List[Reindex] | No | Conditional filtering |
 **Example:**
@@ -187,16 +187,16 @@ Filter rows based on column values.
 | Field | Type | Description |
 |-------|------|-------------|
-| `column` | String | Column name to evaluate |
-| `comparison` | String | Operator: `"eq"`, `"ne"`, `"lt"`, `"le"`, `"gt"`, `"ge"` |
+| `column` | String | Source column letters to evaluate (`A`-`ZZZ`) |
+| `comparison` | String | Operator. Defaults to `"ne"`; allowed values are `"eq"`, `"ne"`, `"lt"`, `"le"`, `"gt"`, `"ge"`. |
 | `comparator` | String\|Int\|Float | Value to compare against |
 **Example:**
 ```yaml
 reindex:
-  - column: p_value
+  - column: C
     comparison: lt
-    comparator: 0.05  # Keep rows where p_value < 0.05
+    comparator: 0.05  # Keep rows where column C < 0.05
 ```
 ### Statement (Triple Definition)
@@ -206,7 +206,7 @@ Defines subject-predicate-object relationships.
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
 | `subject` | NodeEncoding | Yes | Subject entity configuration |
-| `predicate` | String | Yes | Biolink predicate (e.g., `"associated_with"`) |
+| `predicate` | String | No | Biolink predicate. Defaults to `"related_to"`. |
 | `object` | NodeEncoding | Yes | Object entity configuration |
 | `qualifiers` | List[Qualifier] | No | Edge qualifiers (context) |
@@ -215,12 +215,12 @@ Defines subject-predicate-object relationships.
 statement:
   subject:
     method: column
-    encoding: gene_symbol
+    encoding: A
     prioritize: [Gene]
   predicate: treats
   object:
     method: column
-    encoding: disease_name
+    encoding: B
     prioritize: [Disease]
 ```
@@ -230,11 +230,11 @@ Defines how to extract and resolve entities.
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `method` | String | Yes | `"value"` (literal) or `"column"` (column reference) |
-| `encoding` | String\|Int\|Float | Yes | Literal value or column name |
+| `method` | String | No | `"value"` (literal) or `"column"` (source column letters). Defaults to `"value"`. |
+| `encoding` | String\|Int\|Float | Yes | Literal value or source column letters, depending on `method` |
 | `taxon` | Int | No | NCBI Taxon ID for filtering (e.g., `9606` for human) |
-| `prioritize` | List[String] | No | Preferred Biolink categories |
-| `avoid` | List[String] | No | Excluded Biolink categories |
+| `prioritize` | List[String] | No | Preferred Biolink categories (must be valid `Categories` enum values such as `Gene`, `Protein`) |
+| `avoid` | List[String] | No | Excluded Biolink categories (must be valid `Categories` enum values) |
 | `regex` | List[Regex] | No | Pattern replacements |
 | `fill` | String | No | Null-filling strategy: `"forward"`, `"backward"`, `"min"`, `"max"`, `"mean"`, `"zero"`, `"one"` |
 | `remove` | List[String] | No | Strings to filter out |
@@ -253,11 +253,12 @@ subject:
   encoding: CHEBI:41774  # All rows get this CURIE
 ```
-**`method: column`** - Reference a column
+**`method: column`** - Reference a source column
-Excel columns use letters converted to `column_N`:
-- Column A → `column_1` or just `"A"`
-- Column B → `column_2` or just `"B"`
+Source files are read without headers, so column references are Excel-style letters:
+- Column A -> `"A"`
+- Column B -> `"B"`
+- Column AA -> `"AA"`
 ```yaml
 subject:
@@ -265,12 +266,7 @@ subject:
   encoding: A  # Read from column A
 ```
-CSV/TSV columns use header names:
-```yaml
-subject:
-  method: column
-  encoding: gene_symbol  # Read from "gene_symbol" column
-```
+At runtime those letters are converted internally to Polars column names such as `column_1`, but those internal names are not valid configuration values.
 #### Taxonomic Filtering
@@ -278,7 +274,8 @@ subject:
 ```yaml
 subject:
-  encoding: gene_column
+  method: column
+  encoding: A
   taxon: 9606  # Only human genes (Homo sapiens)
 ```
@@ -305,7 +302,8 @@ If "TP53" maps to both Gene and Protein, prefer Gene.
 ```yaml
 subject:
-  encoding: organism_name
+  method: column
+  encoding: A
   prioritize:
     - OrganismTaxon
   avoid:
@@ -330,6 +328,8 @@ subject:
 Executed in order.
+> **Regex dialect:** Patterns are passed directly to Polars `str.replace_all()`, which uses the Rust [`regex`](https://docs.rs/regex/) crate. Only features supported by that engine work — in particular, **capturing groups (`(...)`, `\1`) and lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)` are not supported** and will raise an error at parse time. Stick to character classes, anchors (`^`, `$`), quantifiers, alternation (`a|b`), and non-capturing groups (`(?:...)`) if grouping is needed. If a transformation is too complex to express, prefer chaining several simple substitutions or capturing the residual context in a `miscellaneous notes` annotation instead.
 **`remove: list[string]`** - Filter out specific strings
 ```yaml
@@ -339,6 +339,8 @@ subject:
     - "^NA "  # Remove rows starting with "NA "
 ```
+Same regex constraints apply as the `regex` field — Polars-compatible patterns only, no capturing groups or lookarounds.
 **`prefix` / `suffix`** - Add text
 ```yaml
@@ -362,7 +364,8 @@ Available strategies:
 ```yaml
 subject:
-  encoding: gene_symbol
+  method: column
+  encoding: A
   fill: forward  # Propagate values down through null rows
 ```
@@ -370,7 +373,7 @@ subject:
 annotations:
   - annotation: expression_level
     method: column
-    encoding: expression
+    encoding: C
     fill: mean  # Replace nulls with column average
 ```
@@ -380,7 +383,8 @@ annotations:
 ```yaml
 object:
-  encoding: pathway_list
+  method: column
+  encoding: B
   explode_by: ";"  # "P1;P2;P3" → 3 separate edges
 ```
@@ -398,7 +402,7 @@ Add context to edges (anatomical location, species, etc.).
 | Field | Type | Description |
 |-------|------|-------------|
-| `qualifier` | String | Biolink qualifier (e.g., `"species_context_qualifier"`) |
+| `qualifier` | String | Biolink qualifier from the `Qualifiers` enum (e.g., `"species_context_qualifier"`) |
 | (inherits NodeEncoding) | | All NodeEncoding fields available |
 **Example:**
@@ -415,15 +419,15 @@ Required metadata about data source.
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `repo` | String | Yes | Repository: `"PMC"`, `"PMID"` |
-| `publication` | String | Yes | Identifier (e.g., `"PMC11708054"`, `"PMID123"`) |
+| `repo` | String | No | Repository. Defaults to `"PMC"`; allowed values are `"PMC"`, `"PMID"`. |
+| `publication` | String | Yes | Repository-local identifier appended to `repo:` (e.g., `"11708054"`, `"123"`) |
 | `contributors` | List[Contributor] | Yes | Curation information |
 **Contributor fields:**
 | Field | Type | Required | Description |
 |-------|------|----------|-------------|
-| `kind` | String | Yes | `"curation"`, `"validation"`, `"tool"` |
+| `kind` | String | No | Contributor role. Defaults to `"curation"`; allowed values are `"curation"`, `"validation"`, `"tool"`. |
 | `name` | String | Yes | Contributor name |
 | `date` | String | Yes | Date (free format) |
 | `organizations` | List[String] | No | Affiliations |
@@ -433,7 +437,7 @@ Required metadata about data source.
 ```yaml
 provenance:
   repo: PMC
-  publication: PMC11708054
+  publication: 11708054
   contributors:
     - kind: curation
       name: Skye Lane Goetz
@@ -467,8 +471,16 @@ annotations:
   - annotation: multiple testing correction method
     method: value
     encoding: "Benjamini Hochberg"
+  # Freetext catch-all for context that doesn't fit a structured field —
+  # study caveats, units, post-hoc notes, anything you'd otherwise lose.
+  - annotation: miscellaneous notes
+    method: value
+    encoding: "Values are log2 fold-change relative to vehicle control; n=3 biological replicates per arm"
 ```
+> **Tip:** When source data carries information that can't be cleanly mapped to a structured annotation (assay-specific caveats, non-standard units, qualitative observations), add a `miscellaneous notes` annotation rather than forcing it into another field or dropping it. It accepts both `method: value` (one note for the whole table) and `method: column` (per-row notes from the source).
 ## Complete Example
 Minimal table configuration:
@@ -488,17 +500,17 @@ template:
   statement:
     subject:
       method: column
-      encoding: gene
+      encoding: A
       prioritize: [Gene]
     predicate: associated_with
     object:
       method: column
-      encoding: disease
+      encoding: B
       prioritize: [Disease]
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Example User
@@ -507,7 +519,7 @@ template:
   annotations:
     - annotation: p value
       method: column
-      encoding: p_val
+      encoding: C
 ```
 ## Next Steps

{tablassert-7.3.3 → tablassert-7.3.5}/docs/datassert.md RENAMED Viewed

@@ -33,7 +33,7 @@ The build command automatically downloads BABEL exports from RENCI (`https://sta
 1. **Download** — BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under `./datassert/downloads/`.
 2. **Lookup** — Class files (`*.ndjson.lz4`) are read to build an in-memory equivalent-identifier lookup.
 3. **Parquet Staging** — Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to `./datassert/parquets/`.
-4. **DuckDB Generation** — Parquet files are loaded into 12 sharded DuckDB databases under `./datassert/data/`.
+4. **DuckDB Generation** — Parquet files are loaded into 10 sharded DuckDB databases under `./datassert/data/`.
 ### Examples
@@ -57,11 +57,11 @@ datassert build --use-existing-parquets
 ## Output Artifacts
-- 12 sharded DuckDB databases are written to `./datassert/data/{0..11}.duckdb`.
+- 10 sharded DuckDB databases are written to `./datassert/data/{0..9}.duckdb`.
 - Each shard contains `SOURCES`, `CATEGORIES`, `CURIES`, and `SYNONYMS` tables, deduplicated, sorted, and indexed for query performance.
-- Staging Parquet files are written to `./datassert/parquets/{0..11}/`.
+- Staging Parquet files are written to `./datassert/parquets/{0..9}/`.
-Terms are routed to shards deterministically via `xxhash64(term) % 12`, so a given string always hits the same shard.
+Terms are routed to shards deterministically via `xxhash64(term) % 10`, so a given string always hits the same shard.
 ### Schema
@@ -76,14 +76,14 @@ Each shard contains four tables:
 ## Usage in Graph Config
-The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 12 shards at startup and passes the connections to `resolve()`.
+The `datassert:` field in a GC2 graph configuration points to the directory containing the shards. Tablassert opens all 10 shards at startup and passes the connections to `resolve()`.
 ```yaml
 # graph-config.yaml (GC2)
 syntax: GC2
 name: my-graph
 version: "1.0"
-datassert: /path/to/datassert/   # directory containing data/0..11.duckdb
+datassert: /path/to/datassert/   # directory containing data/0..9.duckdb
 tables:
   - ./TABLE/my-table.yaml
 ```
@@ -99,7 +99,7 @@ from tablassert.fullmap import resolve
 datassert_dir = "/path/to/datassert"
 conns = [
     duckdb.connect(f"{datassert_dir}/data/{i}.duckdb", read_only=True)
-    for i in range(12)
+    for i in range(10)
 ]
 ```

{tablassert-7.3.3 → tablassert-7.3.5}/docs/docker.md RENAMED Viewed

@@ -81,8 +81,8 @@ docker run --rm \
 - **Datassert path** — The graph configuration YAML specifies the `datassert` path for the entity-resolution database. Ensure it is accessible inside the container.
 - **Multiprocessing** — `src/tablassert/cli.py:63` uses `multiprocessing.Pool` for parallel table loading and section extraction.
-- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 12 Datassert DuckDB shards concurrently.
-- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 12 DuckDB shards (`SHARDS = 12`) using xxhash64.
+- **DuckDB connections** — An `ExitStack` at `src/tablassert/cli.py:81` opens read-only connections to all 10 Datassert DuckDB shards concurrently.
+- **Entity resolution** — The `fullmap` module (`src/tablassert/fullmap.py`) shards terms across 10 DuckDB shards (`SHARDS = 10`) using xxhash64.
 - **Text normalization** — `src/tablassert/nlp.py` provides `level_one` (strip + lowercase) and `level_two` (regex-based cleanup).
 ## CI/CD Integration

{tablassert-7.3.3 → tablassert-7.3.5}/docs/examples.md RENAMED Viewed

@@ -33,7 +33,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Your Name
@@ -85,7 +85,7 @@ template:
       taxon: 9606
   provenance:
     repo: PMID
-    publication: PMID98765432
+    publication: 98765432
     contributors:
       - kind: curation
         name: Your Name
@@ -146,7 +146,7 @@ template:
       encoding: CHEBI:41774
   provenance:
     repo: PMC
-    publication: PMC11708054
+    publication: 11708054
     contributors:
       - kind: curation
         name: Your Name
@@ -161,11 +161,15 @@ template:
     - annotation: assertion method
       method: value
       encoding: "Spearman correlation"
+    # Freetext catch-all for context that doesn't fit a structured field.
+    - annotation: miscellaneous notes
+      method: value
+      encoding: "FDR-corrected; samples pooled across two cohorts"
 ```
 **Key techniques:**
-- **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`)
+- **Regex pipeline** cleans raw taxonomic strings (e.g., `d__Bacteria;p__Firmicutes;g__Lactobacillus` → `Lactobacillus`). Patterns must be Polars `str.replace_all()`-compatible — no capturing groups (`(...)` / `\1`) and no lookarounds (`(?=...)`, `(?<=...)`, `(?!...)`, `(?<!...)`). Chain several simple substitutions instead.
 - **Avoid list** (`avoid: [Gene]`) prevents organism names from resolving to gene entities
 - **Fixed-value object** (`method: value`) assigns the same metabolite CURIE to all rows
 - **Excel source** with sheet name and row slicing
@@ -199,7 +203,7 @@ template:
       encoding: PLACEHOLDER
   provenance:
     repo: PMID
-    publication: PMID11223344
+    publication: 11223344
     contributors:
       - kind: curation
         name: Your Name
@@ -277,7 +281,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID55667788
+    publication: 55667788
     contributors:
       - kind: curation
         name: Your Name
@@ -330,7 +334,7 @@ template:
         - ChemicalEntity
   provenance:
     repo: PMID
-    publication: PMID99887766
+    publication: 99887766
     contributors:
       - kind: curation
         name: Your Name

{tablassert-7.3.3 → tablassert-7.3.5}/docs/tutorial.md RENAMED Viewed

@@ -68,7 +68,7 @@ template:
         - Disease
   provenance:
     repo: PMID
-    publication: PMID12345678
+    publication: 12345678
     contributors:
       - kind: curation
         name: Tutorial Example

{tablassert-7.3.3 → tablassert-7.3.5}/mkdocs.yml RENAMED Viewed

@@ -17,4 +17,4 @@ nav:
     - Batch Resolution: api/lib.md
     - Quality Control: api/qc.md
     - Utilities: api/utils.md
-  - Changelog: ../CHANGELOG.md
+  - Changelog: changelog.md

{tablassert-7.3.3 → tablassert-7.3.5}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tablassert"
-version = "7.3.3"
+version = "7.3.5"
 description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }

{tablassert-7.3.3 → tablassert-7.3.5}/src/tablassert/downloader.py RENAMED Viewed

@@ -29,13 +29,24 @@ def from_url(website: str, p: Path, timeout: int = 60_000, retries: int = 3) ->
         try:
             with sync_playwright() as pw:
                 browser = pw.chromium.launch(headless=True)
-                page = browser.new_page()
-                page.goto(website, wait_until="networkidle", timeout=timeout)
+                context = browser.new_context(accept_downloads=True)
+                page = context.new_page()
                 with page.expect_download(timeout=timeout) as info:
-                    download = info.value
-                    download.save_as(p)
+                    try:
+                        page.goto(website, wait_until="load", timeout=timeout)
+                    except Exception as e:
+                        if "net::ERR_ABORTED" not in str(e):
+                            raise
+                download = info.value
+                download.save_as(p)
+                context.close()
                 browser.close()
             return p
         except Exception as e:
             last = e
             if attempt < retries - 1:

{tablassert-7.3.3 → tablassert-7.3.5}/src/tablassert/models.py RENAMED Viewed

@@ -33,7 +33,9 @@ class TablaBase(BaseModel):
 class Reindex(TablaBase):
-    column: str = Field(..., description="Source column letters used for row filtering.", examples=["A", "AA"])
+    column: str = Field(
+        ..., pattern=r"^[A-Z]{1,3}$", description="Source column letters used for row filtering.", examples=["A", "AA"]
+    )
     comparison: Comparisons = Field(
         Comparisons.NE,
         description="Comparison operator used in reindex filtering.",

{tablassert-7.3.3 → tablassert-7.3.5}/uv.lock RENAMED Viewed

@@ -2211,7 +2211,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.3.3"
+version = "7.3.5"
 source = { editable = "." }
 dependencies = [
     { name = "duckdb" },