PyPI - tablassert - Versions diffs - 7.2.1__tar.gz → 7.3.0__tar.gz - Mend

tablassert 7.2.1tar.gz → 7.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (59) hide show

{tablassert-7.2.1 → tablassert-7.3.0}/.github/workflows/docker.yml RENAMED Viewed

@@ -1,9 +1,13 @@
 name: Publish Docker Image
 on:
   workflow_dispatch:
-  push:
-    tags:
-      - "v*"
+  workflow_run:
+    workflows:
+      - "Auto Tag Versions"
+    types:
+      - completed
+    branches:
+      - main
 jobs:
   publish:
     runs-on: ubuntu-latest
@@ -12,6 +16,9 @@ jobs:
       packages: write
     steps:
       - uses: actions/checkout@v4
+      - name: Get version from pyproject.toml
+        id: version
+        run: echo "version=v$(grep -m1 'version = "' pyproject.toml | cut -d'"' -f2)" >> "$GITHUB_OUTPUT"
       - uses: docker/setup-buildx-action@v3
       - uses: docker/login-action@v3
         with:
@@ -24,5 +31,5 @@ jobs:
           file: ./Dockerfile
           push: true
           tags: |
-            ghcr.io/${{ github.repository_owner }}/tablassert:latest
-            ghcr.io/${{ github.repository_owner }}/tablassert:${{ github.ref_name }}
+            ghcr.io/skyeav/tablassert:latest
+            ghcr.io/skyeav/tablassert:${{ steps.version.outputs.version }}

{tablassert-7.2.1 → tablassert-7.3.0}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,22 @@
 All notable changes to this project are documented in this file.
+## 7.3.0 - 2026-04-03
+### New Features
+- Added `resolve_many()` to `lib` module — a standalone batch entity resolution function that resolves an iterable of text strings to CURIEs without requiring manual LazyFrame setup, NLP preprocessing, or DuckDB connection management.
+### Documentation
+- Added detailed API reference page for `resolve_many()` covering function signature, parameters, return value, usage examples, and integration notes.
+## 7.2.2 - 2026-04-01
+### Bug Fixes
+- Fixed Docker publish workflow failing due to mixed-case repository owner in image tags. Hardcoded lowercase `ghcr.io/skyeav/tablassert` and switched trigger to run after autotag completion.
+### Maintenance
+- Updated PyPI short description.
 ## 7.2.1 - 2026-04-01
 ### Maintenance

{tablassert-7.2.1 → tablassert-7.3.0}/CITATION.cff RENAMED Viewed

@@ -2,7 +2,7 @@ cff-version: 1.2.0
 message: "If you use Tablassert, please cite it as below."
 type: software
 title: Tablassert
-version: 7.2.1
+version: 7.2.2
 license: Apache-2.0
 repository-code: https://github.com/SkyeAv/Tablassert
 abstract: Tablassert is a highly performant declarative knowledge graph backend for bioinformatics that extracts knowledge assertions from tabular data, performs entity resolution and data quality control, and exports NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON.

{tablassert-7.2.1 → tablassert-7.3.0}/Dockerfile RENAMED Viewed

@@ -1,6 +1,6 @@
 FROM python:3.14-slim
-RUN pip install --no-cache-dir "tablassert[full]"
+RUN pip install --no-cache-dir "tablassert"
 ENTRYPOINT ["tablassert"]
 CMD ["--help"]

{tablassert-7.2.1 → tablassert-7.3.0}/PKG-INFO RENAMED Viewed

@@ -1,7 +1,7 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.2.1
-Summary: Tablassert is a highly performant declarative knowledge graph backend designed to extract knowledge assertions from tabular data while exporting NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON.
+Version: 7.3.0
+Summary: Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
 Project-URL: Documentation, https://skyeav.github.io/Tablassert/

tablassert-7.3.0/docs/api/lib.md ADDED Viewed

@@ -0,0 +1,236 @@
+# Batch Resolution (lib)
+The `lib` module exposes `resolve_many()`, a high-level convenience function for resolving an iterable of entity strings to CURIEs without requiring manual LazyFrame construction, NLP preprocessing, or DuckDB shard management.
+It wraps the lower-level [`resolve()`](fullmap.md) pipeline — applying `level_one` and `level_two` normalization, opening all 16 DuckDB shard connections, executing entity resolution, and returning results as a plain Python dictionary.
+## resolve_many()
+Standalone batch entity resolution function. Accepts a column name, an iterable of text strings, and a path to the datassert database, then returns resolved CURIEs and metadata as a dictionary of lists.
+### Function Signature
+```python
+def resolve_many(
+    col: str,
+    entities: Iterable[str],
+    datassert: Path,
+    taxon: Optional[str] = None,
+    prioritize: Optional[list[Categories]] = None,
+    avoid: Optional[list[Categories]] = None,
+    column_context: bool = True,
+) -> dict[str, list[str]]
+```
+### Parameters
+**`col: str`**
+Column name used internally to label the Polars Series and DataFrame columns during resolution. This name propagates through the NLP and resolution pipeline and determines the keys in the returned dictionary.
+For example, if `col="gene"`, the returned dictionary will contain keys like `"gene"`, `"gene name"`, `"gene category"`, etc.
+**`entities: Iterable[str]`**
+An iterable of text strings to resolve. Each string is treated as a candidate entity name that will be normalized and matched against the datassert synonym database. Accepts any iterable — lists, tuples, generators, sets, etc.
+Examples: `["TP53", "BRCA1", "EGFR"]`, `("aspirin", "ibuprofen")`, or a generator expression.
+**`datassert: Path`**
+Filesystem path to the root of the datassert database directory. The function expects a `data/` subdirectory containing 16 DuckDB shard files (`0.duckdb` through `15.duckdb`).
+Each shard contains:
+- Synonym mappings (text → CURIE)
+- Preferred entity names
+- Biolink categories
+- NCBI Taxon IDs
+- Source databases and versions
+**`taxon: Optional[str]` (default: `None`)**
+Optional NCBI Taxon ID for filtering results to a specific organism.
+Example: `"9606"` restricts matches to human-specific entities. When `None`, no taxon filtering is applied and matches from all organisms are returned.
+**`prioritize: Optional[list[Categories]]` (default: `None`)**
+Optional list of Biolink categories to prefer when multiple matches exist for the same input term. Categories listed here receive higher ranking scores during resolution.
+Example: `[Categories.Gene, Categories.Protein]` prefers gene and protein mappings over other categories like diseases or chemicals.
+**`avoid: Optional[list[Categories]]` (default: `None`)**
+Optional list of Biolink categories to exclude from results entirely. Any match belonging to an avoided category is filtered out before ranking.
+Example: `[Categories.Gene]` prevents gene mappings from appearing in the output, even if they would otherwise be the best match.
+**`column_context: bool` (default: `True`)**
+Controls category-frequency tie-breaking when multiple matches exist for a term. When `True`, the resolution query adds a category frequency score and prefers the category that appears most frequently across all terms in the batch. When `False`, frequency-based tie-breaking is disabled.
+This is useful when resolving a column of related entities (e.g., all genes) — the shared context helps disambiguate terms that map to multiple categories.
+### Return Value
+Returns a `dict[str, list[str]]` where each key is a column name and each value is a list of resolved values. The dictionary is produced by calling `polars.DataFrame.to_dict(as_series=False)` on the collected resolution output.
+The returned dictionary contains the following keys (where `{col}` is the value of the `col` parameter):
+| Key | Description | Example Value |
+|-----|-------------|---------------|
+| `{col}` | CURIE identifier | `"HGNC:11998"` |
+| `{col} name` | Preferred entity name | `"TP53"` |
+| `{col} category` | Biolink category (prefixed) | `"biolink:Gene"` |
+| `{col} taxon` | NCBI Taxon ID (prefixed) | `"NCBITaxon:9606"` |
+| `{col} source` | Source database | `"HGNC"` |
+| `{col} source version` | Database version | `"2025-01"` |
+| `{col} nlp level` | NLP processing level used for match | `0` or `1` |
+**Important:** Only entities that successfully resolve to a CURIE are included in the output. Unresolved entities are filtered out by `resolve()`. The returned lists may therefore be shorter than the input iterable.
+### Pipeline Internals
+`resolve_many()` executes the following steps internally:
+1. **Series construction** — Wraps the input iterable in a `pl.Series` with the given column name, then converts to a single-column `pl.LazyFrame`.
+2. **NLP normalization** — Applies `level_one()` (whitespace stripping + lowercasing) and `level_two()` (non-word character removal via `\W+`) to produce the two normalized columns required by `resolve()`.
+3. **DuckDB connection management** — Opens all 16 shard connections inside a `contextlib.ExitStack`, ensuring every connection is properly closed when resolution completes or if an error occurs.
+4. **Entity resolution** — Delegates to `fullmap.resolve()` which queries the sharded DuckDB database, ranks matches by category priority, preferred-name exactness, NLP level, and category frequency, then deduplicates to one CURIE per input string.
+5. **Collection and conversion** — Collects the lazy result into an eager `pl.DataFrame` and converts to a Python dictionary via `to_dict(as_series=False)`.
+### Example Usage
+#### Basic Gene Resolution
+```python
+from pathlib import Path
+from tablassert.lib import resolve_many
+from tablassert.enums import Categories
+datassert: Path = Path("/path/to/datassert")
+result: dict[str, list[str]] = resolve_many(
+    col="gene",
+    entities=["TP53", "BRCA1", "EGFR", "KRAS"],
+    datassert=datassert,
+    taxon="9606",
+    prioritize=[Categories.Gene],
+)
+# result["gene"]          → ["HGNC:11998", "HGNC:1100", ...]
+# result["gene name"]     → ["TP53", "BRCA1", ...]
+# result["gene category"] → ["biolink:Gene", "biolink:Gene", ...]
+```
+#### Disease Resolution With Category Avoidance
+```python
+from pathlib import Path
+from tablassert.lib import resolve_many
+from tablassert.enums import Categories
+datassert: Path = Path("/path/to/datassert")
+result: dict[str, list[str]] = resolve_many(
+    col="disease",
+    entities=["diabetes mellitus", "breast cancer", "alzheimer disease"],
+    datassert=datassert,
+    avoid=[Categories.Gene, Categories.Protein],
+)
+# result["disease"]          → ["MONDO:0005015", ...]
+# result["disease name"]     → ["diabetes mellitus", ...]
+# result["disease category"] → ["biolink:Disease", ...]
+```
+#### Chemical Resolution Without Column Context
+```python
+from pathlib import Path
+from tablassert.lib import resolve_many
+datassert: Path = Path("/path/to/datassert")
+result: dict[str, list[str]] = resolve_many(
+    col="chemical",
+    entities=["aspirin", "metformin", "ibuprofen"],
+    datassert=datassert,
+    column_context=False,
+)
+```
+#### Consuming Results
+```python
+import polars as pl
+from pathlib import Path
+from tablassert.lib import resolve_many
+datassert: Path = Path("/path/to/datassert")
+result: dict[str, list[str]] = resolve_many(
+    col="gene",
+    entities=["TP53", "BRCA1"],
+    datassert=datassert,
+    taxon="9606",
+)
+# Convert back to a Polars DataFrame
+df: pl.DataFrame = pl.DataFrame(result)
+# Or iterate over resolved pairs
+for curie, name in zip(result["gene"], result["gene name"]):
+    print(f"{name} → {curie}")
+```
+### Comparison With resolve()
+| Aspect | `resolve_many()` | `resolve()` |
+|--------|-------------------|-------------|
+| **Module** | `tablassert.lib` | `tablassert.fullmap` |
+| **Input** | Plain iterable of strings | Pre-normalized `pl.LazyFrame` |
+| **NLP** | Applied automatically | Must be applied upstream |
+| **Connections** | Managed internally via `ExitStack` | Must be opened externally |
+| **Output** | `dict[str, list[str]]` | `pl.LazyFrame` |
+| **Logging** | Uses default (`log=True`) | Configurable |
+| **Context params** | Not exposed (`section_hash`, `config_file`, `tag`) | Fully configurable |
+| **Use case** | Standalone batch lookups, scripting, notebooks | Internal pipeline integration |
+`resolve_many()` is designed for ad-hoc and programmatic use — scripts, notebooks, and one-off lookups. For pipeline integration where you need full control over logging, context metadata, and lazy evaluation, use `resolve()` directly.
+### NLP Processing
+`resolve_many()` applies both NLP normalization levels before resolution:
+**Level one** — `level_one(lf, col)`:
+- Strips leading/trailing whitespace
+- Converts to lowercase
+- Output column: `{col}` (overwrites the original)
+**Level two** — `level_two(lf, col)`:
+- Removes all non-word characters (`\W+` → `""`) from the level-one result
+- Output column: `{col} two`
+Both levels are queried during resolution. Level one (exact case-insensitive match) is preferred; level two is used as a fallback for terms with punctuation or special characters.
+### Error Handling
+- If the `datassert` path does not contain the expected shard files, `duckdb.connect()` will raise an `IOException`.
+- If `entities` is empty, the function returns a dictionary with empty lists for all output columns.
+- The `ExitStack` ensures all 16 DuckDB connections are closed even if resolution raises an exception.
+- Unresolved entities are silently filtered from the output (logged at INFO level by default via `resolve()`).
+## Integration
+`resolve_many()` is a self-contained entry point. It does not require any prior setup beyond having a datassert database available. For full pipeline builds, use the CLI (`tablassert build-knowledge-graph`) which orchestrates resolution through the `Tcode` class.
+## Next Steps
+- **[Entity Resolution](fullmap.md)** — Lower-level `resolve()` function details
+- **[Quality Control](qc.md)** — Multi-stage validation of resolved entities
+- **[Configuration](../configuration/table.md)** — YAML-driven entity resolution settings

{tablassert-7.2.1 → tablassert-7.3.0}/docs/docker.md RENAMED Viewed

@@ -10,7 +10,7 @@ The image is based on `python:3.14-slim` with the Tablassert CLI as the entrypoi
 docker pull ghcr.io/skyeav/tablassert:latest
 ```
-Version-pinned tags match the git tag (e.g., `ghcr.io/skyeav/tablassert:v7.2.1`).
+Version-pinned tags match the git tag (e.g., `ghcr.io/skyeav/tablassert:v7.2.2`).
 ## Quick Start
@@ -87,4 +87,4 @@ docker run --rm \
 ## CI/CD Integration
-Images are built by `.github/workflows/docker.yml`, which triggers on tag pushes (after autotag and PyPI publish complete). Tags match the repository version tag (e.g., `v7.2.1`).
+Images are built by `.github/workflows/docker.yml`, which triggers on tag pushes (after autotag and PyPI publish complete). Tags match the repository version tag (e.g., `v7.2.2`).

{tablassert-7.2.1 → tablassert-7.3.0}/mkdocs.yml RENAMED Viewed

@@ -14,6 +14,7 @@ nav:
     - Advanced Example: configuration/advanced-example.md
   - API Reference:
     - Entity Resolution: api/fullmap.md
+    - Batch Resolution: api/lib.md
     - Quality Control: api/qc.md
     - Utilities: api/utils.md
   - Changelog: ../CHANGELOG.md

{tablassert-7.2.1 → tablassert-7.3.0}/pyproject.toml RENAMED Viewed

@@ -1,7 +1,7 @@
 [project]
 name = "tablassert"
-version = "7.2.1"
-description = "Tablassert is a highly performant declarative knowledge graph backend designed to extract knowledge assertions from tabular data while exporting NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON."
+version = "7.3.0"
+description = "Extract knowledge assertions from tabular data into NCATS Translator-compliant KGX NDJSON — declaratively, with entity resolution and quality control built in."
 authors = [
     { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
 ]

{tablassert-7.2.1 → tablassert-7.3.0}/src/tablassert/lib.py RENAMED Viewed

@@ -2,6 +2,8 @@ from __future__ import annotations
 import math
 import operator
+from collections.abc import Iterable
+from contextlib import ExitStack
 from functools import reduce
 from operator import add, eq, le
 from os.path import basename
@@ -13,11 +15,11 @@ from pydantic import Field, NonNegativeInt, PositiveInt
 from sqlite_utils import Database
 from tablassert.downloader import from_url
-from tablassert.enums import EncodingMethods, Files, Tokens
-from tablassert.fullmap import resolve
+from tablassert.enums import Categories, EncodingMethods, Files, Tokens
+from tablassert.fullmap import SHARDS, resolve
 from tablassert.log import logger
-from tablassert.nlp import level_one, level_two
 from tablassert.models import Encoding, NodeEncoding, Section
+from tablassert.nlp import level_one, level_two
 from tablassert.qc import fullmap_audit
 from tablassert.utils import namespace_uuid
@@ -475,3 +477,30 @@ def compile_graph(subgraphs: list[Path], name: str, version: str, fmt: str = "mi
     dedup_stream(e, is_edges=True)
     dedup_stream(n, is_edges=False)
+def resolve_many(
+    col: str,
+    entities: Iterable[str],
+    datassert: Path,
+    taxon: Optional[str] = None,
+    prioritize: Optional[list[Categories]] = None,
+    avoid: Optional[list[Categories]] = None,
+    column_context: bool = True,
+) -> dict[str, list[str]]:
+    series: pl.Series = pl.Series(col, entities)
+    lf: pl.LazyFrame = series.to_frame().lazy()
+    lf = level_one(lf, col)
+    lf = level_two(lf, col)
+    with ExitStack() as stack:
+        conns: list[object] = [
+            stack.enter_context(duckdb.connect(datassert / "data" / f"{x}.duckdb", read_only=True))
+            for x in range(SHARDS)
+        ]
+        lf = resolve(lf, col, conns, taxon=taxon, prioritize=prioritize, avoid=avoid, column_context=column_context)
+    df: pl.DataFrame = lf.collect()
+    return df.to_dict(as_series=False)

{tablassert-7.2.1 → tablassert-7.3.0}/uv.lock RENAMED Viewed

@@ -2211,7 +2211,7 @@ wheels = [
 [[package]]
 name = "tablassert"
-version = "7.2.1"
+version = "7.3.0"
 source = { editable = "." }
 dependencies = [
     { name = "duckdb" },