PyPI - corp-extractor - Versions diffs - 0.2.2__tar.gz → 0.2.8__tar.gz - Mend

corp-extractor 0.2.2tar.gz → 0.2.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: corp-extractor
-Version: 0.2.2
+Version: 0.2.8
 Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
 Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
 Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -23,10 +23,11 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Classifier: Topic :: Scientific/Engineering :: Information Analysis
 Classifier: Topic :: Text Processing :: Linguistic
 Requires-Python: >=3.10
+Requires-Dist: click>=8.0.0
 Requires-Dist: numpy>=1.24.0
 Requires-Dist: pydantic>=2.0.0
 Requires-Dist: torch>=2.0.0
-Requires-Dist: transformers>=4.35.0
+Requires-Dist: transformers>=5.0.0rc3
 Provides-Extra: all
 Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
 Provides-Extra: dev
@@ -55,22 +56,30 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
 - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
+- **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
+- **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
+- **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
 ```bash
 # Recommended: include embedding support for smart deduplication
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 # Minimal installation (no embedding features)
 pip install corp-extractor
 ```
-**Note**: For GPU support, install PyTorch with CUDA first:
+**Note**: This package requires `transformers>=5.0.0` (pre-release) for T5-Gemma2 model support. Install with `--pre` flag if needed:
+```bash
+pip install --pre "corp-extractor[embeddings]"
+```
+**For GPU support**, install PyTorch with CUDA first:
 ```bash
 pip install torch --index-url https://download.pytorch.org/whl/cu121
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 ```
 ## Quick Start
@@ -89,6 +98,96 @@ for stmt in result:
     print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
 ```
+## Command Line Interface
+The library includes a CLI for quick extraction from the terminal.
+### Install Globally (Recommended)
+For best results, install globally first:
+```bash
+# Using uv (recommended)
+uv tool install "corp-extractor[embeddings]"
+# Using pipx
+pipx install "corp-extractor[embeddings]"
+# Using pip
+pip install "corp-extractor[embeddings]"
+# Then use anywhere
+corp-extractor "Your text here"
+```
+### Quick Run with uvx
+Run directly without installing using [uv](https://docs.astral.sh/uv/):
+```bash
+uvx corp-extractor "Apple announced a new iPhone."
+```
+**Note**: First run downloads the model (~1.5GB) which may take a few minutes.
+### Usage Examples
+```bash
+# Extract from text argument
+corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
+# Extract from file
+corp-extractor -f article.txt
+# Pipe from stdin
+cat article.txt | corp-extractor -
+# Output as JSON
+corp-extractor "Tim Cook is CEO of Apple." --json
+# Output as XML
+corp-extractor -f article.txt --xml
+# Verbose output with confidence scores
+corp-extractor -f article.txt --verbose
+# Use more beams for better quality
+corp-extractor -f article.txt --beams 8
+# Use custom predicate taxonomy
+corp-extractor -f article.txt --taxonomy predicates.txt
+# Use GPU explicitly
+corp-extractor -f article.txt --device cuda
+```
+### CLI Options
+```
+Usage: corp-extractor [OPTIONS] [TEXT]
+Options:
+  -f, --file PATH              Read input from file
+  -o, --output [table|json|xml] Output format (default: table)
+  --json                       Output as JSON (shortcut)
+  --xml                        Output as XML (shortcut)
+  -b, --beams INTEGER          Number of beams (default: 4)
+  --diversity FLOAT            Diversity penalty (default: 1.0)
+  --max-tokens INTEGER         Max tokens to generate (default: 2048)
+  --no-dedup                   Disable deduplication
+  --no-embeddings              Disable embedding-based dedup (faster)
+  --no-merge                   Disable beam merging
+  --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
+  --min-confidence FLOAT       Min confidence filter (default: 0)
+  --taxonomy PATH              Load predicate taxonomy from file
+  --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
+  --device [auto|cuda|cpu]     Device to use (default: auto)
+  -v, --verbose                Show confidence scores and metadata
+  -q, --quiet                  Suppress progress messages
+  --version                    Show version
+  --help                       Show this message
+```
 ## New in v0.2.0: Quality Scoring & Beam Merging
 By default, the library now:
@@ -139,6 +238,47 @@ Predicate canonicalization and deduplication now use **contextualized matching**
 This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
+## New in v0.2.3: Entity Type Merging & Reversal Detection
+### Entity Type Merging
+When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
+```python
+# Before deduplication:
+# Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
+# Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+# After deduplication:
+# Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+```
+### Subject-Object Reversal Detection
+The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
+```python
+from statement_extractor import PredicateComparer
+comparer = PredicateComparer()
+# Automatically detect and fix reversals
+fixed_statements = comparer.detect_and_fix_reversals(statements)
+for stmt in fixed_statements:
+    if stmt.was_reversed:
+        print(f"Fixed reversal: {stmt}")
+```
+**How it works:**
+1. For each statement with source text, compares:
+   - "Subject Predicate Object" embedding vs source text
+   - "Object Predicate Subject" embedding vs source text
+2. If the reversed form has higher similarity, swaps subject and object
+3. Sets `was_reversed=True` to indicate the correction
+During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
 ## Disable Embeddings (Faster, No Extra Dependencies)
 ```python
@@ -213,6 +353,8 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
+7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
+8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
 ## Requirements

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/README.md RENAMED Viewed

@@ -15,22 +15,30 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
 - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
+- **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
+- **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
+- **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
 ```bash
 # Recommended: include embedding support for smart deduplication
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 # Minimal installation (no embedding features)
 pip install corp-extractor
 ```
-**Note**: For GPU support, install PyTorch with CUDA first:
+**Note**: This package requires `transformers>=5.0.0` (pre-release) for T5-Gemma2 model support. Install with `--pre` flag if needed:
+```bash
+pip install --pre "corp-extractor[embeddings]"
+```
+**For GPU support**, install PyTorch with CUDA first:
 ```bash
 pip install torch --index-url https://download.pytorch.org/whl/cu121
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 ```
 ## Quick Start
@@ -49,6 +57,96 @@ for stmt in result:
     print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
 ```
+## Command Line Interface
+The library includes a CLI for quick extraction from the terminal.
+### Install Globally (Recommended)
+For best results, install globally first:
+```bash
+# Using uv (recommended)
+uv tool install "corp-extractor[embeddings]"
+# Using pipx
+pipx install "corp-extractor[embeddings]"
+# Using pip
+pip install "corp-extractor[embeddings]"
+# Then use anywhere
+corp-extractor "Your text here"
+```
+### Quick Run with uvx
+Run directly without installing using [uv](https://docs.astral.sh/uv/):
+```bash
+uvx corp-extractor "Apple announced a new iPhone."
+```
+**Note**: First run downloads the model (~1.5GB) which may take a few minutes.
+### Usage Examples
+```bash
+# Extract from text argument
+corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
+# Extract from file
+corp-extractor -f article.txt
+# Pipe from stdin
+cat article.txt | corp-extractor -
+# Output as JSON
+corp-extractor "Tim Cook is CEO of Apple." --json
+# Output as XML
+corp-extractor -f article.txt --xml
+# Verbose output with confidence scores
+corp-extractor -f article.txt --verbose
+# Use more beams for better quality
+corp-extractor -f article.txt --beams 8
+# Use custom predicate taxonomy
+corp-extractor -f article.txt --taxonomy predicates.txt
+# Use GPU explicitly
+corp-extractor -f article.txt --device cuda
+```
+### CLI Options
+```
+Usage: corp-extractor [OPTIONS] [TEXT]
+Options:
+  -f, --file PATH              Read input from file
+  -o, --output [table|json|xml] Output format (default: table)
+  --json                       Output as JSON (shortcut)
+  --xml                        Output as XML (shortcut)
+  -b, --beams INTEGER          Number of beams (default: 4)
+  --diversity FLOAT            Diversity penalty (default: 1.0)
+  --max-tokens INTEGER         Max tokens to generate (default: 2048)
+  --no-dedup                   Disable deduplication
+  --no-embeddings              Disable embedding-based dedup (faster)
+  --no-merge                   Disable beam merging
+  --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
+  --min-confidence FLOAT       Min confidence filter (default: 0)
+  --taxonomy PATH              Load predicate taxonomy from file
+  --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
+  --device [auto|cuda|cpu]     Device to use (default: auto)
+  -v, --verbose                Show confidence scores and metadata
+  -q, --quiet                  Suppress progress messages
+  --version                    Show version
+  --help                       Show this message
+```
 ## New in v0.2.0: Quality Scoring & Beam Merging
 By default, the library now:
@@ -99,6 +197,47 @@ Predicate canonicalization and deduplication now use **contextualized matching**
 This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
+## New in v0.2.3: Entity Type Merging & Reversal Detection
+### Entity Type Merging
+When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
+```python
+# Before deduplication:
+# Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
+# Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+# After deduplication:
+# Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+```
+### Subject-Object Reversal Detection
+The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
+```python
+from statement_extractor import PredicateComparer
+comparer = PredicateComparer()
+# Automatically detect and fix reversals
+fixed_statements = comparer.detect_and_fix_reversals(statements)
+for stmt in fixed_statements:
+    if stmt.was_reversed:
+        print(f"Fixed reversal: {stmt}")
+```
+**How it works:**
+1. For each statement with source text, compares:
+   - "Subject Predicate Object" embedding vs source text
+   - "Object Predicate Subject" embedding vs source text
+2. If the reversed form has higher similarity, swaps subject and object
+3. Sets `was_reversed=True` to indicate the correction
+During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
 ## Disable Embeddings (Faster, No Extra Dependencies)
 ```python
@@ -173,6 +312,8 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
+7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
+8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
 ## Requirements

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "corp-extractor"
-version = "0.2.2"
+version = "0.2.8"
 description = "Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -46,8 +46,9 @@ classifiers = [
 dependencies = [
     "pydantic>=2.0.0",
     "torch>=2.0.0",
-    "transformers>=4.35.0",
+    "transformers>=5.0.0rc3",
     "numpy>=1.24.0",
+    "click>=8.0.0",
 ]
 [project.optional-dependencies]
@@ -66,6 +67,10 @@ all = [
     "sentence-transformers>=2.2.0",
 ]
+[project.scripts]
+statement-extractor = "statement_extractor.cli:main"
+corp-extractor = "statement_extractor.cli:main"
 [project.urls]
 Homepage = "https://github.com/corp-o-rate/statement-extractor"
 Documentation = "https://github.com/corp-o-rate/statement-extractor#readme"

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/__init__.py RENAMED Viewed

@@ -29,7 +29,7 @@ Example:
     >>> data = extract_statements_as_dict("Some text...")
 """
-__version__ = "0.2.2"
+__version__ = "0.2.5"
 # Core models
 from .models import (

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/canonicalization.py RENAMED Viewed

@@ -139,32 +139,58 @@ class Canonicalizer:
 def deduplicate_statements_exact(
     statements: list[Statement],
-    entity_canonicalizer: Optional[Callable[[str], str]] = None
+    entity_canonicalizer: Optional[Callable[[str], str]] = None,
+    detect_reversals: bool = True,
 ) -> list[Statement]:
     """
     Deduplicate statements using exact text matching.
     Use this when embedding-based deduplication is disabled.
+    When duplicates are found, entity types are merged - specific types
+    (ORG, PERSON, etc.) take precedence over UNKNOWN.
+    When detect_reversals=True, also detects reversed duplicates where
+    subject and object are swapped. The first occurrence determines the
+    canonical orientation.
     Args:
         statements: List of statements to deduplicate
         entity_canonicalizer: Optional custom canonicalization function
+        detect_reversals: Whether to detect reversed duplicates (default True)
     Returns:
-        Deduplicated list (keeps first occurrence)
+        Deduplicated list with merged entity types
     """
     if len(statements) <= 1:
         return statements
     canonicalizer = Canonicalizer(entity_fn=entity_canonicalizer)
-    seen: set[tuple[str, str, str]] = set()
+    # Map from dedup key to index in unique list
+    seen: dict[tuple[str, str, str], int] = {}
     unique: list[Statement] = []
     for stmt in statements:
         key = canonicalizer.create_dedup_key(stmt)
-        if key not in seen:
-            seen.add(key)
+        # Also compute reversed key (object, predicate, subject)
+        reversed_key = (key[2], key[1], key[0])
+        if key in seen:
+            # Direct duplicate found - merge entity types
+            existing_idx = seen[key]
+            existing_stmt = unique[existing_idx]
+            merged_stmt = existing_stmt.merge_entity_types_from(stmt)
+            unique[existing_idx] = merged_stmt
+        elif detect_reversals and reversed_key in seen:
+            # Reversed duplicate found - merge entity types (accounting for reversal)
+            existing_idx = seen[reversed_key]
+            existing_stmt = unique[existing_idx]
+            # Merge types from the reversed statement
+            merged_stmt = existing_stmt.merge_entity_types_from(stmt.reversed())
+            unique[existing_idx] = merged_stmt
+        else:
+            # New unique statement
+            seen[key] = len(unique)
             unique.append(stmt)
     return unique

corp_extractor-0.2.8/src/statement_extractor/cli.py ADDED Viewed

@@ -0,0 +1,215 @@
+"""
+Command-line interface for statement extraction.
+Usage:
+    corp-extractor "Your text here"
+    corp-extractor -f input.txt
+    cat input.txt | corp-extractor -
+"""
+import sys
+from typing import Optional
+import click
+from . import __version__
+from .models import (
+    ExtractionOptions,
+    PredicateComparisonConfig,
+    PredicateTaxonomy,
+    ScoringConfig,
+)
+@click.command()
+@click.argument("text", required=False)
+@click.option("-f", "--file", "input_file", type=click.Path(exists=True), help="Read input from file")
+@click.option(
+    "-o", "--output",
+    type=click.Choice(["table", "json", "xml"], case_sensitive=False),
+    default="table",
+    help="Output format (default: table)"
+)
+@click.option("--json", "output_json", is_flag=True, help="Output as JSON (shortcut for -o json)")
+@click.option("--xml", "output_xml", is_flag=True, help="Output as XML (shortcut for -o xml)")
+# Beam search options
+@click.option("-b", "--beams", type=int, default=4, help="Number of beams for diverse beam search (default: 4)")
+@click.option("--diversity", type=float, default=1.0, help="Diversity penalty for beam search (default: 1.0)")
+@click.option("--max-tokens", type=int, default=2048, help="Maximum tokens to generate (default: 2048)")
+# Deduplication options
+@click.option("--no-dedup", is_flag=True, help="Disable deduplication")
+@click.option("--no-embeddings", is_flag=True, help="Disable embedding-based deduplication (faster)")
+@click.option("--no-merge", is_flag=True, help="Disable beam merging (select single best beam)")
+@click.option("--dedup-threshold", type=float, default=0.65, help="Similarity threshold for deduplication (default: 0.65)")
+# Quality options
+@click.option("--min-confidence", type=float, default=0.0, help="Minimum confidence threshold 0-1 (default: 0)")
+# Taxonomy options
+@click.option("--taxonomy", type=click.Path(exists=True), help="Load predicate taxonomy from file (one per line)")
+@click.option("--taxonomy-threshold", type=float, default=0.5, help="Similarity threshold for taxonomy matching (default: 0.5)")
+# Device options
+@click.option("--device", type=click.Choice(["auto", "cuda", "mps", "cpu"]), default="auto", help="Device to use (default: auto)")
+# Output options
+@click.option("-v", "--verbose", is_flag=True, help="Show verbose output with confidence scores")
+@click.option("-q", "--quiet", is_flag=True, help="Suppress progress messages")
+@click.version_option(version=__version__)
+def main(
+    text: Optional[str],
+    input_file: Optional[str],
+    output: str,
+    output_json: bool,
+    output_xml: bool,
+    beams: int,
+    diversity: float,
+    max_tokens: int,
+    no_dedup: bool,
+    no_embeddings: bool,
+    no_merge: bool,
+    dedup_threshold: float,
+    min_confidence: float,
+    taxonomy: Optional[str],
+    taxonomy_threshold: float,
+    device: str,
+    verbose: bool,
+    quiet: bool,
+):
+    """
+    Extract structured statements from text.
+    TEXT can be provided as an argument, read from a file with -f, or piped via stdin.
+    \b
+    Examples:
+        corp-extractor "Apple announced a new iPhone."
+        corp-extractor -f article.txt --json
+        corp-extractor -f article.txt -o json --beams 8
+        cat article.txt | corp-extractor -
+        echo "Tim Cook is CEO of Apple." | corp-extractor - --verbose
+    \b
+    Output formats:
+        table  Human-readable table (default)
+        json   JSON with full metadata
+        xml    Raw XML from model
+    """
+    # Determine output format
+    if output_json:
+        output = "json"
+    elif output_xml:
+        output = "xml"
+    # Get input text
+    input_text = _get_input_text(text, input_file)
+    if not input_text:
+        raise click.UsageError(
+            "No input provided. Use: statement-extractor \"text\", "
+            "statement-extractor -f file.txt, or pipe via stdin."
+        )
+    if not quiet:
+        click.echo(f"Processing {len(input_text)} characters...", err=True)
+    # Load taxonomy if provided
+    predicate_taxonomy = None
+    if taxonomy:
+        predicate_taxonomy = PredicateTaxonomy.from_file(taxonomy)
+        if not quiet:
+            click.echo(f"Loaded taxonomy with {len(predicate_taxonomy.predicates)} predicates", err=True)
+    # Configure predicate comparison
+    predicate_config = PredicateComparisonConfig(
+        similarity_threshold=taxonomy_threshold,
+        dedup_threshold=dedup_threshold,
+    )
+    # Configure scoring
+    scoring_config = ScoringConfig(min_confidence=min_confidence)
+    # Configure extraction options
+    options = ExtractionOptions(
+        num_beams=beams,
+        diversity_penalty=diversity,
+        max_new_tokens=max_tokens,
+        deduplicate=not no_dedup,
+        embedding_dedup=not no_embeddings,
+        merge_beams=not no_merge,
+        predicate_taxonomy=predicate_taxonomy,
+        predicate_config=predicate_config,
+        scoring_config=scoring_config,
+    )
+    # Import here to allow --help without loading torch
+    from .extractor import StatementExtractor
+    # Create extractor with specified device
+    device_arg = None if device == "auto" else device
+    extractor = StatementExtractor(device=device_arg)
+    if not quiet:
+        click.echo(f"Using device: {extractor.device}", err=True)
+    # Run extraction
+    try:
+        if output == "xml":
+            result = extractor.extract_as_xml(input_text, options)
+            click.echo(result)
+        elif output == "json":
+            result = extractor.extract_as_json(input_text, options)
+            click.echo(result)
+        else:
+            # Table format
+            result = extractor.extract(input_text, options)
+            _print_table(result, verbose)
+    except Exception as e:
+        raise click.ClickException(f"Extraction failed: {e}")
+def _get_input_text(text: Optional[str], input_file: Optional[str]) -> Optional[str]:
+    """Get input text from argument, file, or stdin."""
+    if text == "-" or (text is None and input_file is None and not sys.stdin.isatty()):
+        # Read from stdin
+        return sys.stdin.read().strip()
+    elif input_file:
+        # Read from file
+        with open(input_file, "r", encoding="utf-8") as f:
+            return f.read().strip()
+    elif text:
+        return text.strip()
+    return None
+def _print_table(result, verbose: bool):
+    """Print statements in a human-readable table format."""
+    if not result.statements:
+        click.echo("No statements extracted.")
+        return
+    click.echo(f"\nExtracted {len(result.statements)} statement(s):\n")
+    click.echo("-" * 80)
+    for i, stmt in enumerate(result.statements, 1):
+        subject_type = f" ({stmt.subject.type.value})" if stmt.subject.type.value != "UNKNOWN" else ""
+        object_type = f" ({stmt.object.type.value})" if stmt.object.type.value != "UNKNOWN" else ""
+        click.echo(f"{i}. {stmt.subject.text}{subject_type}")
+        click.echo(f"   --[{stmt.predicate}]-->")
+        click.echo(f"   {stmt.object.text}{object_type}")
+        if verbose:
+            if stmt.confidence_score is not None:
+                click.echo(f"   Confidence: {stmt.confidence_score:.2f}")
+            if stmt.canonical_predicate:
+                click.echo(f"   Canonical: {stmt.canonical_predicate}")
+            if stmt.was_reversed:
+                click.echo(f"   (subject/object were swapped)")
+            if stmt.source_text:
+                source = stmt.source_text[:60] + "..." if len(stmt.source_text) > 60 else stmt.source_text
+                click.echo(f"   Source: \"{source}\"")
+        click.echo("-" * 80)
+if __name__ == "__main__":
+    main()

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/extractor.py RENAMED Viewed

@@ -80,11 +80,16 @@ class StatementExtractor:
         # Auto-detect device
         if device is None:
-            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
         else:
             self.device = device
-        # Auto-detect dtype
+        # Auto-detect dtype (bfloat16 only for CUDA, float32 for MPS/CPU)
         if torch_dtype is None:
             self.torch_dtype = torch.bfloat16 if self.device == "cuda" else torch.float32
         else:
@@ -143,7 +148,7 @@ class StatementExtractor:
         taxonomy = options.predicate_taxonomy or self._predicate_taxonomy
         config = options.predicate_config or self._predicate_config or PredicateComparisonConfig()
-        return PredicateComparer(taxonomy=taxonomy, config=config)
+        return PredicateComparer(taxonomy=taxonomy, config=config, device=self.device)
     @property
     def model(self) -> AutoModelForSeq2SeqLM:
@@ -350,30 +355,41 @@ class StatementExtractor:
             outputs = self.model.generate(
                 **inputs,
                 max_new_tokens=options.max_new_tokens,
+                max_length=None,  # Override model default, use max_new_tokens only
                 num_beams=num_seqs,
                 num_beam_groups=num_seqs,
                 num_return_sequences=num_seqs,
                 diversity_penalty=options.diversity_penalty,
                 do_sample=False,
+                top_p=None,  # Override model config to suppress warning
+                top_k=None,  # Override model config to suppress warning
                 trust_remote_code=True,
+                custom_generate="transformers-community/group-beam-search",
             )
         # Decode and process candidates
         end_tag = "</statements>"
         candidates: list[str] = []
-        for output in outputs:
+        for i, output in enumerate(outputs):
             decoded = self.tokenizer.decode(output, skip_special_tokens=True)
+            output_len = len(output)
             # Truncate at </statements>
             if end_tag in decoded:
                 end_pos = decoded.find(end_tag) + len(end_tag)
                 decoded = decoded[:end_pos]
                 candidates.append(decoded)
+                logger.debug(f"Beam {i}: {output_len} tokens, found end tag, {len(decoded)} chars")
+            else:
+                # Log the issue - likely truncated
+                logger.warning(f"Beam {i}: {output_len} tokens, NO end tag found (truncated?)")
+                logger.warning(f"Beam {i} full output ({len(decoded)} chars):\n{decoded}")
         # Include fallback if no valid candidates
         if not candidates and len(outputs) > 0:
             fallback = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+            logger.warning(f"Using fallback beam (no valid candidates found), {len(fallback)} chars")
             candidates.append(fallback)
         return candidates
@@ -467,8 +483,10 @@ class StatementExtractor:
         try:
             root = ET.fromstring(xml_output)
-        except ET.ParseError:
-            logger.warning("Failed to parse XML output")
+        except ET.ParseError as e:
+            # Log full output for debugging
+            logger.warning(f"Failed to parse XML output: {e}")
+            logger.warning(f"Full XML output ({len(xml_output)} chars):\n{xml_output}")
             return statements
         if root.tag != 'statements':

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/models.py RENAMED Viewed

@@ -32,6 +32,18 @@ class Entity(BaseModel):
     def __str__(self) -> str:
         return f"{self.text} ({self.type.value})"
+    def merge_type_from(self, other: "Entity") -> "Entity":
+        """
+        Return a new Entity with the more specific type.
+        If this entity has UNKNOWN type and other has a specific type,
+        returns a new entity with this text but other's type.
+        Otherwise returns self unchanged.
+        """
+        if self.type == EntityType.UNKNOWN and other.type != EntityType.UNKNOWN:
+            return Entity(text=self.text, type=other.type)
+        return self
 class Statement(BaseModel):
     """A single extracted statement (subject-predicate-object triple)."""
@@ -55,6 +67,10 @@ class Statement(BaseModel):
         None,
         description="Canonical form of the predicate if taxonomy matching was used"
     )
+    was_reversed: bool = Field(
+        default=False,
+        description="True if subject/object were swapped during reversal detection"
+    )
     def __str__(self) -> str:
         return f"{self.subject.text} -- {self.predicate} --> {self.object.text}"
@@ -63,6 +79,49 @@ class Statement(BaseModel):
         """Return as a simple (subject, predicate, object) tuple."""
         return (self.subject.text, self.predicate, self.object.text)
+    def merge_entity_types_from(self, other: "Statement") -> "Statement":
+        """
+        Return a new Statement with more specific entity types merged from other.
+        If this statement has UNKNOWN entity types and other has specific types,
+        the returned statement will use the specific types from other.
+        All other fields come from self.
+        """
+        merged_subject = self.subject.merge_type_from(other.subject)
+        merged_object = self.object.merge_type_from(other.object)
+        # Only create new statement if something changed
+        if merged_subject is self.subject and merged_object is self.object:
+            return self
+        return Statement(
+            subject=merged_subject,
+            object=merged_object,
+            predicate=self.predicate,
+            source_text=self.source_text,
+            confidence_score=self.confidence_score,
+            evidence_span=self.evidence_span,
+            canonical_predicate=self.canonical_predicate,
+            was_reversed=self.was_reversed,
+        )
+    def reversed(self) -> "Statement":
+        """
+        Return a new Statement with subject and object swapped.
+        Sets was_reversed=True to indicate the swap occurred.
+        """
+        return Statement(
+            subject=self.object,
+            object=self.subject,
+            predicate=self.predicate,
+            source_text=self.source_text,
+            confidence_score=self.confidence_score,
+            evidence_span=self.evidence_span,
+            canonical_predicate=self.canonical_predicate,
+            was_reversed=True,
+        )
 class ExtractionResult(BaseModel):
     """The result of statement extraction from text."""

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/predicate_comparer.py RENAMED Viewed

@@ -62,6 +62,7 @@ class PredicateComparer:
         self,
         taxonomy: Optional[PredicateTaxonomy] = None,
         config: Optional[PredicateComparisonConfig] = None,
+        device: Optional[str] = None,
     ):
         """
         Initialize the predicate comparer.
@@ -69,6 +70,7 @@ class PredicateComparer:
         Args:
             taxonomy: Optional canonical predicate taxonomy for normalization
             config: Comparison configuration (uses defaults if not provided)
+            device: Device to use ('cuda', 'cpu', or None for auto-detect)
         Raises:
             EmbeddingDependencyError: If sentence-transformers is not installed
@@ -78,6 +80,18 @@ class PredicateComparer:
         self.taxonomy = taxonomy
         self.config = config or PredicateComparisonConfig()
+        # Auto-detect device
+        if device is None:
+            import torch
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
+        else:
+            self.device = device
         # Lazy-loaded resources
         self._model = None
         self._taxonomy_embeddings: Optional[np.ndarray] = None
@@ -89,9 +103,9 @@ class PredicateComparer:
         from sentence_transformers import SentenceTransformer
-        logger.info(f"Loading embedding model: {self.config.embedding_model}")
-        self._model = SentenceTransformer(self.config.embedding_model)
-        logger.info("Embedding model loaded")
+        logger.info(f"Loading embedding model: {self.config.embedding_model} on {self.device}")
+        self._model = SentenceTransformer(self.config.embedding_model, device=self.device)
+        logger.info(f"Embedding model loaded on {self.device}")
     def _normalize_text(self, text: str) -> str:
         """Normalize text before embedding."""
@@ -256,21 +270,26 @@ class PredicateComparer:
         self,
         statements: list[Statement],
         entity_canonicalizer: Optional[callable] = None,
+        detect_reversals: bool = True,
     ) -> list[Statement]:
         """
         Remove duplicate statements using embedding-based predicate comparison.
         Two statements are considered duplicates if:
-        - Canonicalized subjects match
+        - Canonicalized subjects match AND canonicalized objects match, OR
+        - Canonicalized subjects match objects (reversed) when detect_reversals=True
         - Predicates are similar (embedding-based)
-        - Canonicalized objects match
         When duplicates are found, keeps the statement with better contextualized
         match (comparing "Subject Predicate Object" against source text).
+        For reversed duplicates, the correct orientation is determined by comparing
+        both "S P O" and "O P S" against source text.
         Args:
             statements: List of Statement objects
             entity_canonicalizer: Optional function to canonicalize entity text
+            detect_reversals: Whether to detect reversed duplicates (default True)
         Returns:
             Deduplicated list of statements (keeps best contextualized match)
@@ -293,6 +312,12 @@ class PredicateComparer:
         ]
         contextualized_embeddings = self._compute_embeddings(contextualized_texts)
+        # Compute reversed contextualized embeddings: "Object Predicate Subject"
+        reversed_texts = [
+            f"{s.object.text} {s.predicate} {s.subject.text}" for s in statements
+        ]
+        reversed_embeddings = self._compute_embeddings(reversed_texts)
         # Compute source text embeddings for scoring which duplicate to keep
         source_embeddings = []
         for stmt in statements:
@@ -302,6 +327,7 @@ class PredicateComparer:
         unique_statements: list[Statement] = []
         unique_pred_embeddings: list[np.ndarray] = []
         unique_context_embeddings: list[np.ndarray] = []
+        unique_reversed_embeddings: list[np.ndarray] = []
         unique_source_embeddings: list[np.ndarray] = []
         unique_indices: list[int] = []
@@ -310,13 +336,23 @@ class PredicateComparer:
             obj_canon = canonicalize(stmt.object.text)
             duplicate_idx = None
+            is_reversed_match = False
             for j, unique_stmt in enumerate(unique_statements):
                 unique_subj = canonicalize(unique_stmt.subject.text)
                 unique_obj = canonicalize(unique_stmt.object.text)
-                # Check subject and object match
-                if subj_canon != unique_subj or obj_canon != unique_obj:
+                # Check direct match: subject->subject, object->object
+                direct_match = (subj_canon == unique_subj and obj_canon == unique_obj)
+                # Check reversed match: subject->object, object->subject
+                reversed_match = (
+                    detect_reversals and
+                    subj_canon == unique_obj and
+                    obj_canon == unique_subj
+                )
+                if not direct_match and not reversed_match:
                     continue
                 # Check predicate similarity
@@ -326,6 +362,7 @@ class PredicateComparer:
                 )
                 if similarity >= self.config.dedup_threshold:
                     duplicate_idx = j
+                    is_reversed_match = reversed_match and not direct_match
                     break
             if duplicate_idx is None:
@@ -333,27 +370,91 @@ class PredicateComparer:
                 unique_statements.append(stmt)
                 unique_pred_embeddings.append(pred_embeddings[i])
                 unique_context_embeddings.append(contextualized_embeddings[i])
+                unique_reversed_embeddings.append(reversed_embeddings[i])
                 unique_source_embeddings.append(source_embeddings[i])
                 unique_indices.append(i)
             else:
-                # Duplicate found - keep the one with better contextualized match
-                # Compare "Subject Predicate Object" against source text
-                current_score = self._cosine_similarity(
-                    contextualized_embeddings[i],
-                    source_embeddings[i]
-                )
-                existing_score = self._cosine_similarity(
-                    unique_context_embeddings[duplicate_idx],
-                    unique_source_embeddings[duplicate_idx]
-                )
-                if current_score > existing_score:
-                    # Current statement is a better match - replace
-                    unique_statements[duplicate_idx] = stmt
-                    unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
-                    unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
-                    unique_source_embeddings[duplicate_idx] = source_embeddings[i]
-                    unique_indices[duplicate_idx] = i
+                existing_stmt = unique_statements[duplicate_idx]
+                if is_reversed_match:
+                    # Reversed duplicate - determine correct orientation using source text
+                    # Compare current's normal vs reversed against its source
+                    current_normal_score = self._cosine_similarity(
+                        contextualized_embeddings[i], source_embeddings[i]
+                    )
+                    current_reversed_score = self._cosine_similarity(
+                        reversed_embeddings[i], source_embeddings[i]
+                    )
+                    # Compare existing's normal vs reversed against its source
+                    existing_normal_score = self._cosine_similarity(
+                        unique_context_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    existing_reversed_score = self._cosine_similarity(
+                        unique_reversed_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    # Determine best orientation for current
+                    current_best = max(current_normal_score, current_reversed_score)
+                    current_should_reverse = current_reversed_score > current_normal_score
+                    # Determine best orientation for existing
+                    existing_best = max(existing_normal_score, existing_reversed_score)
+                    existing_should_reverse = existing_reversed_score > existing_normal_score
+                    if current_best > existing_best:
+                        # Current is better - use it (possibly reversed)
+                        if current_should_reverse:
+                            best_stmt = stmt.reversed()
+                        else:
+                            best_stmt = stmt
+                        # Merge entity types from existing (accounting for reversal)
+                        if existing_should_reverse:
+                            best_stmt = best_stmt.merge_entity_types_from(existing_stmt.reversed())
+                        else:
+                            best_stmt = best_stmt.merge_entity_types_from(existing_stmt)
+                        unique_statements[duplicate_idx] = best_stmt
+                        unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
+                        unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
+                        unique_reversed_embeddings[duplicate_idx] = reversed_embeddings[i]
+                        unique_source_embeddings[duplicate_idx] = source_embeddings[i]
+                        unique_indices[duplicate_idx] = i
+                    else:
+                        # Existing is better - possibly fix its orientation
+                        if existing_should_reverse and not existing_stmt.was_reversed:
+                            best_stmt = existing_stmt.reversed()
+                        else:
+                            best_stmt = existing_stmt
+                        # Merge entity types from current (accounting for reversal)
+                        if current_should_reverse:
+                            best_stmt = best_stmt.merge_entity_types_from(stmt.reversed())
+                        else:
+                            best_stmt = best_stmt.merge_entity_types_from(stmt)
+                        unique_statements[duplicate_idx] = best_stmt
+                else:
+                    # Direct duplicate - keep the one with better contextualized match
+                    current_score = self._cosine_similarity(
+                        contextualized_embeddings[i], source_embeddings[i]
+                    )
+                    existing_score = self._cosine_similarity(
+                        unique_context_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    if current_score > existing_score:
+                        # Current statement is a better match - replace
+                        merged_stmt = stmt.merge_entity_types_from(existing_stmt)
+                        unique_statements[duplicate_idx] = merged_stmt
+                        unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
+                        unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
+                        unique_reversed_embeddings[duplicate_idx] = reversed_embeddings[i]
+                        unique_source_embeddings[duplicate_idx] = source_embeddings[i]
+                        unique_indices[duplicate_idx] = i
+                    else:
+                        # Existing statement is better - merge entity types from current
+                        merged_stmt = existing_stmt.merge_entity_types_from(stmt)
+                        unique_statements[duplicate_idx] = merged_stmt
         return unique_statements
@@ -437,3 +538,79 @@ class PredicateComparer:
                 similarity=best_score,
                 matched=False,
             )
+    def detect_and_fix_reversals(
+        self,
+        statements: list[Statement],
+        threshold: float = 0.05,
+    ) -> list[Statement]:
+        """
+        Detect and fix subject-object reversals using embedding comparison.
+        For each statement, compares:
+        - "Subject Predicate Object" embedding against source_text
+        - "Object Predicate Subject" embedding against source_text
+        If the reversed version has significantly higher similarity to the source,
+        the subject and object are swapped and was_reversed is set to True.
+        Args:
+            statements: List of Statement objects
+            threshold: Minimum similarity difference to trigger reversal (default 0.05)
+        Returns:
+            List of statements with reversals corrected
+        """
+        if not statements:
+            return statements
+        result = []
+        for stmt in statements:
+            # Skip if no source_text to compare against
+            if not stmt.source_text:
+                result.append(stmt)
+                continue
+            # Build normal and reversed triple strings
+            normal_text = f"{stmt.subject.text} {stmt.predicate} {stmt.object.text}"
+            reversed_text = f"{stmt.object.text} {stmt.predicate} {stmt.subject.text}"
+            # Compute embeddings for normal, reversed, and source
+            embeddings = self._compute_embeddings([normal_text, reversed_text, stmt.source_text])
+            normal_emb, reversed_emb, source_emb = embeddings[0], embeddings[1], embeddings[2]
+            # Compute similarities to source
+            normal_sim = self._cosine_similarity(normal_emb, source_emb)
+            reversed_sim = self._cosine_similarity(reversed_emb, source_emb)
+            # If reversed is significantly better, swap subject and object
+            if reversed_sim > normal_sim + threshold:
+                result.append(stmt.reversed())
+            else:
+                result.append(stmt)
+        return result
+    def check_reversal(self, statement: Statement) -> tuple[bool, float, float]:
+        """
+        Check if a single statement should be reversed.
+        Args:
+            statement: Statement to check
+        Returns:
+            Tuple of (should_reverse, normal_similarity, reversed_similarity)
+        """
+        if not statement.source_text:
+            return (False, 0.0, 0.0)
+        normal_text = f"{statement.subject.text} {statement.predicate} {statement.object.text}"
+        reversed_text = f"{statement.object.text} {statement.predicate} {statement.subject.text}"
+        embeddings = self._compute_embeddings([normal_text, reversed_text, statement.source_text])
+        normal_emb, reversed_emb, source_emb = embeddings[0], embeddings[1], embeddings[2]
+        normal_sim = self._cosine_similarity(normal_emb, source_emb)
+        reversed_sim = self._cosine_similarity(reversed_emb, source_emb)
+        return (reversed_sim > normal_sim, normal_sim, reversed_sim)

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/.gitignore RENAMED Viewed

File without changes

{corp_extractor-0.2.2 → corp_extractor-0.2.8}/src/statement_extractor/scoring.py RENAMED Viewed

File without changes

corp-extractor 0.2.2__tar.gz → 0.2.8__tar.gz

corp-extractor 0.2.2tar.gz → 0.2.8tar.gz