PyPI - tablassert - Versions diffs - 7.2.2__tar.gz → 7.3.1__tar.gz - Mend

tablassert 7.2.2tar.gz → 7.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

{tablassert-7.2.2 → tablassert-7.3.1}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,23 @@
 All notable changes to this project are documented in this file.
+## 7.3.1 - 2026-04-03
+### Changes
+- Changed `resolve_many()` return type from `dict[str, list[str]]` to `list[dict[str, Any]]` — each resolved entity is now a row dictionary, produced via `to_dicts()`.
+- `resolve_many()` now preserves the original input text in an `original {col}` key on each result row.
+### Documentation
+- Updated `resolve_many()` API reference to match the current function signature, return type, and output format.
+## 7.3.0 - 2026-04-03
+### New Features
+- Added `resolve_many()` to `lib` module — a standalone batch entity resolution function that resolves an iterable of text strings to CURIEs without requiring manual LazyFrame setup, NLP preprocessing, or DuckDB connection management.
+### Documentation
+- Added detailed API reference page for `resolve_many()` covering function signature, parameters, return value, usage examples, and integration notes.
 ## 7.2.2 - 2026-04-01
 ### Bug Fixes

tablassert-7.3.1/Dockerfile ADDED Viewed

@@ -0,0 +1,8 @@
+FROM python:3.14-slim
+RUN pip install --no-cache-dir "tablassert"
+VOLUME ["/data", "/datassert"]
+ENTRYPOINT ["tablassert"]
+CMD ["--help"]

{tablassert-7.2.2 → tablassert-7.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.2.2
+Version: 7.3.1
 Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
@@ -93,6 +93,20 @@ docker run --rm \
 </details>
+## Quick Demo
+```bash
+# Build a knowledge graph from a YAML configuration
+$ tablassert build-knowledge-graph graph-config.yaml
+⠋ Loading table configurations...
+⠋ Resolving entities across 16 DuckDB shards...
+⠋ Compiling subgraphs...
+⠋ Deduplicating nodes and edges...
+✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+```
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
 ## Key Features
 - **Declarative Configuration** — YAML-based, no code required

{tablassert-7.2.2 → tablassert-7.3.1}/README.md RENAMED Viewed

@@ -41,6 +41,20 @@ docker run --rm \
 </details>
+## Quick Demo
+```bash
+# Build a knowledge graph from a YAML configuration
+$ tablassert build-knowledge-graph graph-config.yaml
+⠋ Loading table configurations...
+⠋ Resolving entities across 16 DuckDB shards...
+⠋ Compiling subgraphs...
+⠋ Deduplicating nodes and edges...
+✓ Done — wrote nodes.ndjson and edges.ndjson to .storassert/
+```
+Define your entities and relationships in YAML, point tablassert at your data, and get NCATS Translator-compliant KGX NDJSON out the other side — no code required.
 ## Key Features
 - **Declarative Configuration** — YAML-based, no code required

tablassert-7.3.1/docs/api/lib.md ADDED Viewed

@@ -0,0 +1,239 @@
+# Batch Resolution (lib)
+The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
+It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 16 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
+## resolve_many()
+Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a dictionary of lists.
+### Function Signature
+```python
+def resolve_many(
+    col: str,
+    entities: Iterable[str],
+    datassert: Path,
+    taxon: Optional[str] = None,
+    prioritize: Optional[list[Categories]] = None,
+    avoid: Optional[list[Categories]] = None,
+    column_context: bool = True,
+) -> list[dict[str, Any]]
+```
+### Parameters
+**`col: str`**
+Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in the returned dictionary.
+For example, if `col="gene"`, the returned dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
+**`entities: Iterable[str]`**
+An iterable of text strings to resolve. Each string is treated as a candidate entity name that will be normalized and matched against the datassert synonym database. Accepts any iterable — lists, tuples, generators, sets, etc.
+Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generator expression.
+**`datassert: Path`**
+Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 16 DuckDB shard files (`0.duckdb` through `15.duckdb`).
+Each shard contains:
+- Synonym mappings (text → CURIE)
+- Preferred entity names
+- Biolink categories
+- NCBI Taxon IDs
+- Source databases and versions
+**`taxon: Optional[str]` (default: `None`)**
+Optional NCBI Taxon ID for filtering results to a specific organism.
+Example: `"9606"` restricts matches to human-specific entities. When `None`, no taxon filtering is applied and matches from all organisms are returned.
+**`prioritize: Optional[list[Categories]]` (default: `None`)**
+Optional list of Biolink categories to prefer when multiple matches exist for the same input term. Categories listed here receive higher ranking scores during resolution.
+Example: `[Categories.Gene, Categories.Protein]` prefers gene and protein mappings over other categories like diseases or chemicals.
+**`avoid: Optional[list[Categories]]` (default: `None`)**
+Optional list of Biolink categories to exclude from results entirely. Any match belonging to an avoided category is filtered out before ranking.
+Example: `[Categories.Gene]` prevents gene mappings from appearing in the output, even if they would otherwise be the best match.
+**`column_context: bool` (default: `True`)**
+Controls category-frequency tie-breaking when multiple matches exist for a term. When `True`, the resolution query adds a category frequency score and prefers the category that appears most frequently across all terms in the batch. When `False`, frequency-based tie-breaking is disabled.
+This is useful when resolving a column of related entities (e.g., all genes) — the shared context helps disambiguate terms that map to multiple categories.
+### Return Value
+Returns a `list[dict[str, Any]]` — one dictionary per resolved entity. The list is produced by calling `polars.DataFrame.to_dicts()` on the collected resolution output.
+Each dictionary contains the following keys (where `{col}` is the value of the `col` parameter):
+| Key | Description | Example Value |
+|-----|-------------|---------------|
+| `original {col}` | Original input text before normalization | `"TP53"` |
+| `{col}` | CURIE identifier | `"HGNC:11998"` |
+| `{col} name` | Preferred entity name | `"TP53"` |
+| `{col} category` | Biolink category (prefixed) | `"biolink:Gene"` |
+| `{col} taxon` | NCBI Taxon ID (prefixed) | `"NCBITaxon:9606"` |
+| `{col} source` | Source database | `"HGNC"` |
+| `{col} source version` | Database version | `"2025-01"` |
+| `{col} nlp level` | NLP processing level used for match | `0` or `1` |
+**Important:** Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by `resolve()`. The returned list may therefore be shorter than the input iterable.
+### Pipeline Internals
+`resolve_many()` executes the following steps internally:
+1. **Series construction** — Wraps the input iterable in a `pl.Series` with the given column name, then converts to a single-column `pl.LazyFrame`.
+2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
+3. **DuckDB connection management** — Opens all 16 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
+4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
+5. **Collection and conversion** — Collects the lazy result into an eager `pl.DataFrame` and converts to a list of row dictionaries via `to_dicts()`.
+### Example Usage
+#### Basic Gene Resolution
+```python
+from pathlib import Path
+from typing import Any
+from tablassert.lib import resolve_many
+from tablassert.enums import Categories
+datassert: Path = Path("/path/to/datassert")
+result: list[dict[str, Any]] = resolve_many(
+    col="gene",
+    entities=["TP53", "BRCA1", "EGFR", "KRAS"],
+    datassert=datassert,
+    taxon="9606",
+    prioritize=[Categories.Gene],
+)
+# result[0] → {"original gene": "TP53", "gene": "HGNC:11998", "gene name": "TP53", ...}
+# result[1] → {"original gene": "BRCA1", "gene": "HGNC:1100", "gene name": "BRCA1", ...}
+```
+#### Disease Resolution With Category Avoidance
+```python
+from pathlib import Path
+from typing import Any
+from tablassert.lib import resolve_many
+from tablassert.enums import Categories
+datassert: Path = Path("/path/to/datassert")
+result: list[dict[str, Any]] = resolve_many(
+    col="disease",
+    entities=["diabetes mellitus", "breast cancer", "alzheimer disease"],
+    datassert=datassert,
+    avoid=[Categories.Gene, Categories.Protein],
+)
+# result[0] → {"original disease": "diabetes mellitus", "disease": "MONDO:0005015", ...}
+# result[1] → {"original disease": "breast cancer", "disease name": "breast cancer", ...}
+```
+#### Chemical Resolution Without Column Context
+```python
+from pathlib import Path
+from typing import Any
+from tablassert.lib import resolve_many
+datassert: Path = Path("/path/to/datassert")
+result: list[dict[str, Any]] = resolve_many(
+    col="chemical",
+    entities=["aspirin", "metformin", "ibuprofen"],
+    datassert=datassert,
+    column_context=False,
+)
+```
+#### Consuming Results
+```python
+import polars as pl
+from pathlib import Path
+from typing import Any
+from tablassert.lib import resolve_many
+datassert: Path = Path("/path/to/datassert")
+result: list[dict[str, Any]] = resolve_many(
+    col="gene",
+    entities=["TP53", "BRCA1"],
+    datassert=datassert,
+    taxon="9606",
+)
+# Convert back to a Polars DataFrame
+df: pl.DataFrame = pl.DataFrame(result)
+# Or iterate over resolved rows
+for row in result:
+    print(f"{row['gene name']} → {row['gene']}")
+```
+### Comparison With resolve()
+| Aspect | `resolve_many()` | `resolve()` |
+|--------|-------------------|-------------|
+| **Module** | `tablassert.lib` | `tablassert.fullmap` |
+| **Input** | Plain iterable of strings | Pre-normalized `pl.LazyFrame` |
+| **NLP** | Applied automatically | Must be applied upstream |
+| **Connections** | Managed internally via `ExitStack` | Must be opened externally |
+| **Output** | `list[dict[str, Any]]` | `pl.LazyFrame` |
+| **Logging** | Uses default (`log=True`) | Configurable |
+| **Context params** | Not exposed (`section_hash`, `config_file`, `tag`) | Fully configurable |
+| **Use case** | Standalone batch lookups, scripting, notebooks | Internal pipeline integration |
+`resolve_many()` is designed for ad-hoc and programmatic use — scripts, notebooks, and one-off lookups. For pipeline integration where you need full control over logging, context metadata, and lazy evaluation, use `resolve()` directly.
+### NLP Processing
+`resolve_many()` applies both NLP normalization levels before resolution:
+**Level one** — `level_one(lf, col)`:
+- Strips leading/trailing whitespace
+- Converts to lowercase
+- Output column: `{col}` (overwrites the original)
+**Level two** — `level_two(lf, col)`:
+- Removes all non-word characters (`\W+` → `""`) from the level-one result
+- Output column: `{col} two`
+Both levels are queried during resolution. Level one (exact case-insensitive match) is preferred; level two is used as a fallback for terms with punctuation or special characters.
+### Error Handling
+- If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
+- If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
+- The `ExitStack` ensures all 16 DuckDB connections are closed even if resolution raises an exception.
+- Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
+## Integration
+`resolve_many()` is a self-contained entry point. It does not require any prior setup beyond having a datassert database available. For full pipeline builds, use the CLI (`tablassert build-knowledge-graph`) which orchestrates resolution through the `Tcode` class.
+## Next Steps
+- **[Entity Resolution](fullmap.md)** — Lower-level `resolve()` function details
+- **[Quality Control](qc.md)** — Multi-stage validation of resolved entities
+- **[Configuration](../configuration/table.md)** — YAML-driven entity resolution settings

{tablassert-7.2.2 → tablassert-7.3.1}/mkdocs.yml RENAMED Viewed

@@ -14,6 +14,7 @@ nav:
     - Advanced Example: configuration/advanced-example.md
   - API Reference:
     - Entity Resolution: api/fullmap.md
+    - Batch Resolution: api/lib.md
     - Quality Control: api/qc.md
     - Utilities: api/utils.md
   - Changelog: ../CHANGELOG.md

{tablassert-7.2.2 → tablassert-7.3.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "tablassert"
-version = "7.2.2"
+version = "7.3.1"
 description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }

{tablassert-7.2.2 → tablassert-7.3.1}/src/tablassert/lib.py RENAMED Viewed

@@ -2,6 +2,8 @@ from __future__ import annotations
 import math
 import operator
+from collections.abc import Iterable
+from contextlib import ExitStack
 from functools import reduce
 from operator import add, eq, le
 from os.path import basename
@@ -13,11 +15,11 @@ from pydantic import Field, NonNegativeInt, PositiveInt
 from sqlite_utils import Database
 from tablassert.downloader import from_url
-from tablassert.enums import EncodingMethods, Files, Tokens
-from tablassert.fullmap import resolve
+from tablassert.enums import Categories, EncodingMethods, Files, Tokens
+from tablassert.fullmap import SHARDS, resolve
 from tablassert.log import logger
-from tablassert.nlp import level_one, level_two
 from tablassert.models import Encoding, NodeEncoding, Section
+from tablassert.nlp import level_one, level_two
 from tablassert.qc import fullmap_audit
 from tablassert.utils import namespace_uuid
@@ -475,3 +477,31 @@ def compile_graph(subgraphs: list[Path], name: str, version: str, fmt: str = "mi
     dedup_stream(e, is_edges=True)
     dedup_stream(n, is_edges=False)
+def resolve_many(
+    col: str,
+    entities: Iterable[str],
+    datassert: Path,
+    taxon: Optional[str] = None,
+    prioritize: Optional[list[Categories]] = None,
+    avoid: Optional[list[Categories]] = None,
+    column_context: bool = True,
+) -> list[dict[str, Any]]:
+    series: pl.Series = pl.Series(col, entities)
+    lf: pl.LazyFrame = series.to_frame().lazy()
+    lf = column(lf, add("original ", col), col)
+    lf = level_one(lf, col)
+    lf = level_two(lf, col)
+    with ExitStack() as stack:
+        conns: list[object] = [
+            stack.enter_context(duckdb.connect(datassert / "data" / f"{x}.duckdb", read_only=True))
+            for x in range(SHARDS)
+        ]
+        lf = resolve(lf, col, conns, taxon=taxon, prioritize=prioritize, avoid=avoid, column_context=column_context)
+    df: pl.DataFrame = lf.collect()
+    return df.to_dicts()

{tablassert-7.2.2 → tablassert-7.3.1}/uv.lock RENAMED Viewed

@@ -2211,7 +2211,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.2.2"
+version = "7.3.0"
 source = { editable = "." }
 dependencies = [
     { name = "duckdb" },