PyPI - tablassert - Versions diffs - 7.0.1__tar.gz → 7.1.0__tar.gz - Mend

tablassert 7.0.1tar.gz → 7.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

tablassert-7.1.0/.pre-commit-config.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+repos:
+    - repo: https://github.com/astral-sh/ruff-pre-commit
+      rev: v0.9.0
+      hooks:
+          - id: ruff
+            args: [--fix]
+          - id: ruff-format
+    - repo: local
+      hooks:
+          - id: pyright
+            name: pyright
+            entry: uv run pyright
+            language: system
+            types: [python]
+            pass_filenames: false

{tablassert-7.0.1 → tablassert-7.1.0}/CHANGELOG.md RENAMED Viewed

@@ -2,6 +2,27 @@
 All notable changes to this project are documented in this file.
+## Unreleased
+### Changes
+- Updated `fullmap` ranking to prioritize case-insensitive exact matches between normalized terms and preferred names.
+- Updated `fullmap` term de-duplication to keep first occurrences, improving deterministic output ordering.
+## 7.0.2 - 2026-03-23
+### Changes
+- Updated package metadata for the 7.0.2 release.
+- Added optional `log` and `column_context` controls to `fullmap.version4()` for more configurable entity-resolution behavior.
+### Bug Fixes
+- Reworked entity-resolution querying to register terms directly in DuckDB instead of writing temporary parquet files, removing tempfile lifecycle issues in `fullmap` query execution.
+- Isolated unmatched-entity logging into a dedicated helper and gated it behind an explicit logging flag.
+### Documentation
+- Updated API reference docs to match the current `version4()` function signature and behavior.
+- Corrected QC documentation to reflect the implemented fuzzy/BERT validation pipeline.
+- Fixed documentation path typos for cache/store artifact directories.
 ## 7.0.1 - 2026-03-17
 ### Documentation

{tablassert-7.0.1 → tablassert-7.1.0}/PKG-INFO RENAMED Viewed

@@ -1,12 +1,11 @@
 Metadata-Version: 2.4
 Name: tablassert
-Version: 7.0.1
+Version: 7.1.0
 Summary: Tablassert is a highly performant declarative knowledge graph backend designed to extract knowledge assertions from tabular data while exporting NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON.
 Project-URL: Homepage, https://github.com/SkyeAv/Tablassert
 Project-URL: Source, https://github.com/SkyeAv/Tablassert
 Project-URL: Documentation, https://skyeav.github.io/Tablassert/
-Author: Jared C. Roach
-Author-email: Skye Lane Goetz <sgoetz@isbscience.org>, Gwennen Glusman <gglusman@isbscience.org>
+Author-email: Skye Lane Goetz <sgoetz@isbscience.org>
 License-Expression: Apache-2.0
 License-File: LICENSE
 Keywords: declarative pipeline,knowledge graph,natural language processing,ncats translator,ner,tablassert,table mining,yaml configuration
@@ -15,6 +14,7 @@ Requires-Python: >=3.13
 Requires-Dist: diskcache>=5.6.3
 Requires-Dist: duckdb>=1.5.0
 Requires-Dist: fastexcel>=0.19.0
+Requires-Dist: lazy-loader>=0.5
 Requires-Dist: loguru>=0.7.3
 Requires-Dist: mkdocs>=1.6.1
 Requires-Dist: onnxruntime>=1.24.3

{tablassert-7.0.1 → tablassert-7.1.0}/docs/api/fullmap.md RENAMED Viewed

@@ -13,11 +13,13 @@ def version4(
   lf: pl.LazyFrame,
   col: str,
   conn: object,
-  taxon: Optional[str],
-  prioritize: Optional[list[Categories]],
-  avoid: Optional[list[Categories]],
-  section_hash: str,
-  config_file: str,
+  taxon: Optional[str] = None,
+  prioritize: Optional[list[Categories]] = None,
+  avoid: Optional[list[Categories]] = None,
+  log: bool = True,
+  section_hash: Optional[str] = None,
+  config_file: Optional[str] = None,
+  column_context: bool = True,
   tag: str = " one"
 ) -> pl.LazyFrame
 ```
@@ -61,6 +63,18 @@ Optional list of Biolink categories to exclude from results.
 Example: `[Categories.Gene]` prevents gene mappings.
+**`log: bool` (default: `True`)**
+Controls unmatched-value logging. When enabled, unresolved terms are logged with section/config/column context.
+**`section_hash: Optional[str]` / `config_file: Optional[str]`**
+Optional context fields used for operational logging when unmatched values are encountered.
+**`column_context: bool` (default: `True`)**
+Controls category-frequency tie-breaking when multiple matches exist for a term. When `True`, the query result adds a category frequency score and prefers more frequent category hits.
 **`tag: str` (default: `" one"`)**
 Suffix for NLP processing level column.
@@ -71,10 +85,6 @@ The function looks for both:
 Default `" one"` means it uses level-one text processing (lowercase, stripped).
-**`section_hash: str` / `config_file: str`**
-Context fields used for operational logging when unmatched values are encountered.
 ### Return Value
 Returns a Polars LazyFrame with these columns added:
@@ -91,25 +101,27 @@ Returns a Polars LazyFrame with these columns added:
 ### DuckDB Query
-The function executes a complex SQL query that:
+The function executes a SQL query that:
+1. **Builds an in-memory term table** by collecting terms from both NLP levels, deduplicating by keeping first occurrences for deterministic ordering, and registering them in DuckDB as `PARQUET` via `conn.register("PARQUET", df.to_arrow())`.
-1. **Ranks matches** by:
+2. **Ranks matches** by:
    - Category priority (if `prioritize` specified)
+   - Preferred-name exactness (case-insensitive exact match of normalized term to preferred name)
    - NLP level (exact case match preferred over normalized)
-   - Source confidence
+   - Category frequency (if `column_context=True`)
-2. **Filters by:**
+3. **Filters by:**
    - Taxon ID (if specified)
    - Category avoidance (if specified)
-3. **Deduplicates** to one CURIE per row per input string
+4. **Deduplicates** to one CURIE per input string
 ### Example Usage
 ```python
 from tablassert.fullmap import version4
 from tablassert.enums import Categories
-from pathlib import Path
 import duckdb
 import polars as pl
@@ -127,8 +139,10 @@ result = version4(
   taxon="9606",  # Human only
   prioritize=[Categories.Gene],
   avoid=[Categories.Protein],
+  log=True,
   section_hash="tutorial-section",
   config_file="tutorial-table.yaml",
+  column_context=True,
   tag=" one"
 )

{tablassert-7.0.1 → tablassert-7.1.0}/docs/api/qc.md RENAMED Viewed

@@ -72,28 +72,26 @@ original == preferred_name
 **Performance:** O(1) string comparison
+Before fuzzy matching, the function also applies rule-based pass-through checks for known safe patterns (for example CHEBI/PR/UniProtKB CURIE families and selected exception prefixes).
 #### Stage 2: Fuzzy Matching
 **Medium confidence using RapidFuzz.**
-Four fuzzy matching algorithms:
+Two fuzzy matching algorithms:
 1. **Ratio:** Overall string similarity
-2. **Partial ratio:** Substring matching
-3. **Token sort ratio:** Order-independent word matching
-4. **Partial token sort ratio:** Combined approach
+2. **Partial token sort ratio:** Combined token/subsequence matching
 **Threshold:** Default 20% similarity (configurable)
 ```python
 fuzz.ratio(original, preferred) >= 20
-or fuzz.ratio(original, curie) >= 20
 or fuzz.partial_token_sort_ratio(original, preferred) >= 20
-or fuzz.partial_token_sort_ratio(original, curie) >= 20
 ```
 **Example passes:**
 - Original: `"breast ca"` → Preferred: `"breast cancer"` ✓
-- Original: `"T53"` → CURIE: `"HGNC:11998"` (TP53) ✗ (goes to Stage 3)
+- Original: `"T53"` → Preferred: `"tumor protein p53"` ✗ (goes to Stage 3)
 **Performance:** O(n) string operations, cached via `@DISKCACHE.memoize()`
@@ -128,7 +126,7 @@ return similarity >= 0.2
 - ONNX session caching
 - Disk cache for embeddings (~100MB LRU)
-Loaded once at module import, reused for all calls.
+Lazy-loaded on first `BERT_audit()` call, then reused for subsequent calls.
 ### Disk Caching
@@ -142,7 +140,7 @@ def fuzz_audit(...): ...
 def BERT_audit(...): ...
 ```
-**Cache location:** `cachessert/` directory
+**Cache location:** `./.cachassert` directory
 **Cache strategy:** LRU eviction when size exceeds limit

{tablassert-7.0.1 → tablassert-7.1.0}/docs/cli.md RENAMED Viewed

@@ -40,7 +40,7 @@ Final output files are written to the current working directory as:
 - `{name}_{version}.nodes.ndjson` - Node file (entities)
 - `{name}_{version}.edges.ndjson` - Edge file (relationships)
-Intermediate parquet artifacts are written to `storessert/` during section processing.
+Intermediate parquet artifacts are written to `.storassert/` during section processing.
 See [Graph Configuration](configuration/graph.md) for details on the YAML schema.

{tablassert-7.0.1 → tablassert-7.1.0}/pyproject.toml RENAMED Viewed

@@ -1,11 +1,9 @@
 [project]
 name = "tablassert"
-version = "7.0.1"
+version = "7.1.0"
 description = "Tablassert is a highly performant declarative knowledge graph backend designed to extract knowledge assertions from tabular data while exporting NCATS Translator-compliant Knowledge Graph Exchange (KGX) NDJSON."
 authors = [
-    { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" },
-    { name = "Gwennen Glusman", email = "gglusman@isbscience.org" },
-    { name = "Jared C. Roach" },
+    { name = "Skye Lane Goetz", email = "sgoetz@isbscience.org" }
 ]
 keywords = [
     "knowledge graph",
@@ -27,6 +25,7 @@ dependencies = [
     "diskcache>=5.6.3",
     "duckdb>=1.5.0",
     "fastexcel>=0.19.0",
+    "lazy-loader>=0.5",
     "loguru>=0.7.3",
     "mkdocs>=1.6.1",
     "onnxruntime>=1.24.3",
@@ -77,7 +76,7 @@ dev = [
 [tool.ruff]
 line-length = 120
-indent-width = 2
+indent-width = 4
 target-version = "py313"
 [tool.ruff.format]

tablassert-7.1.0/src/tablassert/downloader.py ADDED Viewed

@@ -0,0 +1,44 @@
+from __future__ import annotations
+from pathlib import Path
+from time import sleep
+from typing import TYPE_CHECKING, Optional
+import lazy_loader as Lazy
+from playwright.sync_api import sync_playwright
+if TYPE_CHECKING:
+    import pyexcel
+else:
+    pyexcel = Lazy.load("pyexcel")
+def modernize_xls(p: Path) -> Path:
+    xlsx: Path = p.with_suffix(".xlsx")
+    pyexcel.save_book_as(file_name=str(p), dest_file_name=str(xlsx))
+    return xlsx
+def from_url(website: str, p: Path, timeout: int = 60_000, retries: int = 3) -> Path:
+    p.parent.mkdir(parents=True, exist_ok=True)
+    if p.is_file():
+        return p
+    last: Optional[Exception] = None
+    for attempt in range(retries):
+        try:
+            with sync_playwright() as pw:
+                browser = pw.chromium.launch(headless=True)
+                page = browser.new_page()
+                page.goto(website, wait_until="networkidle", timeout=timeout)
+                with page.expect_download(timeout=timeout) as info:
+                    download = info.value
+                    download.save_as(p)
+                browser.close()
+            return p
+        except Exception as e:
+            last = e
+            if attempt < retries - 1:
+                sleep(2**attempt)
+    raise RuntimeError(f"01 | Download Failed After {retries} Attempts: {last}")

tablassert 7.0.1__tar.gz → 7.1.0__tar.gz

tablassert 7.0.1tar.gz → 7.1.0tar.gz