PyPI - corp-extractor - Versions diffs - 0.2.3__tar.gz → 0.2.8__tar.gz - Mend

corp-extractor 0.2.3tar.gz → 0.2.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: corp-extractor
-Version: 0.2.3
+Version: 0.2.8
 Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
 Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
 Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -23,10 +23,11 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Classifier: Topic :: Scientific/Engineering :: Information Analysis
 Classifier: Topic :: Text Processing :: Linguistic
 Requires-Python: >=3.10
+Requires-Dist: click>=8.0.0
 Requires-Dist: numpy>=1.24.0
 Requires-Dist: pydantic>=2.0.0
 Requires-Dist: torch>=2.0.0
-Requires-Dist: transformers>=4.35.0
+Requires-Dist: transformers>=5.0.0rc3
 Provides-Extra: all
 Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
 Provides-Extra: dev
@@ -57,22 +58,28 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
 - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
 - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
+- **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
 ```bash
 # Recommended: include embedding support for smart deduplication
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 # Minimal installation (no embedding features)
 pip install corp-extractor
 ```
-**Note**: For GPU support, install PyTorch with CUDA first:
+**Note**: This package requires `transformers>=5.0.0` (pre-release) for T5-Gemma2 model support. Install with `--pre` flag if needed:
+```bash
+pip install --pre "corp-extractor[embeddings]"
+```
+**For GPU support**, install PyTorch with CUDA first:
 ```bash
 pip install torch --index-url https://download.pytorch.org/whl/cu121
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 ```
 ## Quick Start
@@ -91,6 +98,96 @@ for stmt in result:
     print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
 ```
+## Command Line Interface
+The library includes a CLI for quick extraction from the terminal.
+### Install Globally (Recommended)
+For best results, install globally first:
+```bash
+# Using uv (recommended)
+uv tool install "corp-extractor[embeddings]"
+# Using pipx
+pipx install "corp-extractor[embeddings]"
+# Using pip
+pip install "corp-extractor[embeddings]"
+# Then use anywhere
+corp-extractor "Your text here"
+```
+### Quick Run with uvx
+Run directly without installing using [uv](https://docs.astral.sh/uv/):
+```bash
+uvx corp-extractor "Apple announced a new iPhone."
+```
+**Note**: First run downloads the model (~1.5GB) which may take a few minutes.
+### Usage Examples
+```bash
+# Extract from text argument
+corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
+# Extract from file
+corp-extractor -f article.txt
+# Pipe from stdin
+cat article.txt | corp-extractor -
+# Output as JSON
+corp-extractor "Tim Cook is CEO of Apple." --json
+# Output as XML
+corp-extractor -f article.txt --xml
+# Verbose output with confidence scores
+corp-extractor -f article.txt --verbose
+# Use more beams for better quality
+corp-extractor -f article.txt --beams 8
+# Use custom predicate taxonomy
+corp-extractor -f article.txt --taxonomy predicates.txt
+# Use GPU explicitly
+corp-extractor -f article.txt --device cuda
+```
+### CLI Options
+```
+Usage: corp-extractor [OPTIONS] [TEXT]
+Options:
+  -f, --file PATH              Read input from file
+  -o, --output [table|json|xml] Output format (default: table)
+  --json                       Output as JSON (shortcut)
+  --xml                        Output as XML (shortcut)
+  -b, --beams INTEGER          Number of beams (default: 4)
+  --diversity FLOAT            Diversity penalty (default: 1.0)
+  --max-tokens INTEGER         Max tokens to generate (default: 2048)
+  --no-dedup                   Disable deduplication
+  --no-embeddings              Disable embedding-based dedup (faster)
+  --no-merge                   Disable beam merging
+  --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
+  --min-confidence FLOAT       Min confidence filter (default: 0)
+  --taxonomy PATH              Load predicate taxonomy from file
+  --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
+  --device [auto|cuda|cpu]     Device to use (default: auto)
+  -v, --verbose                Show confidence scores and metadata
+  -q, --quiet                  Suppress progress messages
+  --version                    Show version
+  --help                       Show this message
+```
 ## New in v0.2.0: Quality Scoring & Beam Merging
 By default, the library now:

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/README.md RENAMED Viewed

@@ -17,22 +17,28 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
 - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
 - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
+- **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
 ```bash
 # Recommended: include embedding support for smart deduplication
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 # Minimal installation (no embedding features)
 pip install corp-extractor
 ```
-**Note**: For GPU support, install PyTorch with CUDA first:
+**Note**: This package requires `transformers>=5.0.0` (pre-release) for T5-Gemma2 model support. Install with `--pre` flag if needed:
+```bash
+pip install --pre "corp-extractor[embeddings]"
+```
+**For GPU support**, install PyTorch with CUDA first:
 ```bash
 pip install torch --index-url https://download.pytorch.org/whl/cu121
-pip install corp-extractor[embeddings]
+pip install "corp-extractor[embeddings]"
 ```
 ## Quick Start
@@ -51,6 +57,96 @@ for stmt in result:
     print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
 ```
+## Command Line Interface
+The library includes a CLI for quick extraction from the terminal.
+### Install Globally (Recommended)
+For best results, install globally first:
+```bash
+# Using uv (recommended)
+uv tool install "corp-extractor[embeddings]"
+# Using pipx
+pipx install "corp-extractor[embeddings]"
+# Using pip
+pip install "corp-extractor[embeddings]"
+# Then use anywhere
+corp-extractor "Your text here"
+```
+### Quick Run with uvx
+Run directly without installing using [uv](https://docs.astral.sh/uv/):
+```bash
+uvx corp-extractor "Apple announced a new iPhone."
+```
+**Note**: First run downloads the model (~1.5GB) which may take a few minutes.
+### Usage Examples
+```bash
+# Extract from text argument
+corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
+# Extract from file
+corp-extractor -f article.txt
+# Pipe from stdin
+cat article.txt | corp-extractor -
+# Output as JSON
+corp-extractor "Tim Cook is CEO of Apple." --json
+# Output as XML
+corp-extractor -f article.txt --xml
+# Verbose output with confidence scores
+corp-extractor -f article.txt --verbose
+# Use more beams for better quality
+corp-extractor -f article.txt --beams 8
+# Use custom predicate taxonomy
+corp-extractor -f article.txt --taxonomy predicates.txt
+# Use GPU explicitly
+corp-extractor -f article.txt --device cuda
+```
+### CLI Options
+```
+Usage: corp-extractor [OPTIONS] [TEXT]
+Options:
+  -f, --file PATH              Read input from file
+  -o, --output [table|json|xml] Output format (default: table)
+  --json                       Output as JSON (shortcut)
+  --xml                        Output as XML (shortcut)
+  -b, --beams INTEGER          Number of beams (default: 4)
+  --diversity FLOAT            Diversity penalty (default: 1.0)
+  --max-tokens INTEGER         Max tokens to generate (default: 2048)
+  --no-dedup                   Disable deduplication
+  --no-embeddings              Disable embedding-based dedup (faster)
+  --no-merge                   Disable beam merging
+  --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
+  --min-confidence FLOAT       Min confidence filter (default: 0)
+  --taxonomy PATH              Load predicate taxonomy from file
+  --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
+  --device [auto|cuda|cpu]     Device to use (default: auto)
+  -v, --verbose                Show confidence scores and metadata
+  -q, --quiet                  Suppress progress messages
+  --version                    Show version
+  --help                       Show this message
+```
 ## New in v0.2.0: Quality Scoring & Beam Merging
 By default, the library now:

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "corp-extractor"
-version = "0.2.3"
+version = "0.2.8"
 description = "Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search"
 readme = "README.md"
 requires-python = ">=3.10"
@@ -46,8 +46,9 @@ classifiers = [
 dependencies = [
     "pydantic>=2.0.0",
     "torch>=2.0.0",
-    "transformers>=4.35.0",
+    "transformers>=5.0.0rc3",
     "numpy>=1.24.0",
+    "click>=8.0.0",
 ]
 [project.optional-dependencies]
@@ -66,6 +67,10 @@ all = [
     "sentence-transformers>=2.2.0",
 ]
+[project.scripts]
+statement-extractor = "statement_extractor.cli:main"
+corp-extractor = "statement_extractor.cli:main"
 [project.urls]
 Homepage = "https://github.com/corp-o-rate/statement-extractor"
 Documentation = "https://github.com/corp-o-rate/statement-extractor#readme"

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/__init__.py RENAMED Viewed

@@ -29,7 +29,7 @@ Example:
     >>> data = extract_statements_as_dict("Some text...")
 """
-__version__ = "0.2.2"
+__version__ = "0.2.5"
 # Core models
 from .models import (

corp_extractor-0.2.8/src/statement_extractor/cli.py ADDED Viewed

@@ -0,0 +1,215 @@
+"""
+Command-line interface for statement extraction.
+Usage:
+    corp-extractor "Your text here"
+    corp-extractor -f input.txt
+    cat input.txt | corp-extractor -
+"""
+import sys
+from typing import Optional
+import click
+from . import __version__
+from .models import (
+    ExtractionOptions,
+    PredicateComparisonConfig,
+    PredicateTaxonomy,
+    ScoringConfig,
+)
+@click.command()
+@click.argument("text", required=False)
+@click.option("-f", "--file", "input_file", type=click.Path(exists=True), help="Read input from file")
+@click.option(
+    "-o", "--output",
+    type=click.Choice(["table", "json", "xml"], case_sensitive=False),
+    default="table",
+    help="Output format (default: table)"
+)
+@click.option("--json", "output_json", is_flag=True, help="Output as JSON (shortcut for -o json)")
+@click.option("--xml", "output_xml", is_flag=True, help="Output as XML (shortcut for -o xml)")
+# Beam search options
+@click.option("-b", "--beams", type=int, default=4, help="Number of beams for diverse beam search (default: 4)")
+@click.option("--diversity", type=float, default=1.0, help="Diversity penalty for beam search (default: 1.0)")
+@click.option("--max-tokens", type=int, default=2048, help="Maximum tokens to generate (default: 2048)")
+# Deduplication options
+@click.option("--no-dedup", is_flag=True, help="Disable deduplication")
+@click.option("--no-embeddings", is_flag=True, help="Disable embedding-based deduplication (faster)")
+@click.option("--no-merge", is_flag=True, help="Disable beam merging (select single best beam)")
+@click.option("--dedup-threshold", type=float, default=0.65, help="Similarity threshold for deduplication (default: 0.65)")
+# Quality options
+@click.option("--min-confidence", type=float, default=0.0, help="Minimum confidence threshold 0-1 (default: 0)")
+# Taxonomy options
+@click.option("--taxonomy", type=click.Path(exists=True), help="Load predicate taxonomy from file (one per line)")
+@click.option("--taxonomy-threshold", type=float, default=0.5, help="Similarity threshold for taxonomy matching (default: 0.5)")
+# Device options
+@click.option("--device", type=click.Choice(["auto", "cuda", "mps", "cpu"]), default="auto", help="Device to use (default: auto)")
+# Output options
+@click.option("-v", "--verbose", is_flag=True, help="Show verbose output with confidence scores")
+@click.option("-q", "--quiet", is_flag=True, help="Suppress progress messages")
+@click.version_option(version=__version__)
+def main(
+    text: Optional[str],
+    input_file: Optional[str],
+    output: str,
+    output_json: bool,
+    output_xml: bool,
+    beams: int,
+    diversity: float,
+    max_tokens: int,
+    no_dedup: bool,
+    no_embeddings: bool,
+    no_merge: bool,
+    dedup_threshold: float,
+    min_confidence: float,
+    taxonomy: Optional[str],
+    taxonomy_threshold: float,
+    device: str,
+    verbose: bool,
+    quiet: bool,
+):
+    """
+    Extract structured statements from text.
+    TEXT can be provided as an argument, read from a file with -f, or piped via stdin.
+    \b
+    Examples:
+        corp-extractor "Apple announced a new iPhone."
+        corp-extractor -f article.txt --json
+        corp-extractor -f article.txt -o json --beams 8
+        cat article.txt | corp-extractor -
+        echo "Tim Cook is CEO of Apple." | corp-extractor - --verbose
+    \b
+    Output formats:
+        table  Human-readable table (default)
+        json   JSON with full metadata
+        xml    Raw XML from model
+    """
+    # Determine output format
+    if output_json:
+        output = "json"
+    elif output_xml:
+        output = "xml"
+    # Get input text
+    input_text = _get_input_text(text, input_file)
+    if not input_text:
+        raise click.UsageError(
+            "No input provided. Use: statement-extractor \"text\", "
+            "statement-extractor -f file.txt, or pipe via stdin."
+        )
+    if not quiet:
+        click.echo(f"Processing {len(input_text)} characters...", err=True)
+    # Load taxonomy if provided
+    predicate_taxonomy = None
+    if taxonomy:
+        predicate_taxonomy = PredicateTaxonomy.from_file(taxonomy)
+        if not quiet:
+            click.echo(f"Loaded taxonomy with {len(predicate_taxonomy.predicates)} predicates", err=True)
+    # Configure predicate comparison
+    predicate_config = PredicateComparisonConfig(
+        similarity_threshold=taxonomy_threshold,
+        dedup_threshold=dedup_threshold,
+    )
+    # Configure scoring
+    scoring_config = ScoringConfig(min_confidence=min_confidence)
+    # Configure extraction options
+    options = ExtractionOptions(
+        num_beams=beams,
+        diversity_penalty=diversity,
+        max_new_tokens=max_tokens,
+        deduplicate=not no_dedup,
+        embedding_dedup=not no_embeddings,
+        merge_beams=not no_merge,
+        predicate_taxonomy=predicate_taxonomy,
+        predicate_config=predicate_config,
+        scoring_config=scoring_config,
+    )
+    # Import here to allow --help without loading torch
+    from .extractor import StatementExtractor
+    # Create extractor with specified device
+    device_arg = None if device == "auto" else device
+    extractor = StatementExtractor(device=device_arg)
+    if not quiet:
+        click.echo(f"Using device: {extractor.device}", err=True)
+    # Run extraction
+    try:
+        if output == "xml":
+            result = extractor.extract_as_xml(input_text, options)
+            click.echo(result)
+        elif output == "json":
+            result = extractor.extract_as_json(input_text, options)
+            click.echo(result)
+        else:
+            # Table format
+            result = extractor.extract(input_text, options)
+            _print_table(result, verbose)
+    except Exception as e:
+        raise click.ClickException(f"Extraction failed: {e}")
+def _get_input_text(text: Optional[str], input_file: Optional[str]) -> Optional[str]:
+    """Get input text from argument, file, or stdin."""
+    if text == "-" or (text is None and input_file is None and not sys.stdin.isatty()):
+        # Read from stdin
+        return sys.stdin.read().strip()
+    elif input_file:
+        # Read from file
+        with open(input_file, "r", encoding="utf-8") as f:
+            return f.read().strip()
+    elif text:
+        return text.strip()
+    return None
+def _print_table(result, verbose: bool):
+    """Print statements in a human-readable table format."""
+    if not result.statements:
+        click.echo("No statements extracted.")
+        return
+    click.echo(f"\nExtracted {len(result.statements)} statement(s):\n")
+    click.echo("-" * 80)
+    for i, stmt in enumerate(result.statements, 1):
+        subject_type = f" ({stmt.subject.type.value})" if stmt.subject.type.value != "UNKNOWN" else ""
+        object_type = f" ({stmt.object.type.value})" if stmt.object.type.value != "UNKNOWN" else ""
+        click.echo(f"{i}. {stmt.subject.text}{subject_type}")
+        click.echo(f"   --[{stmt.predicate}]-->")
+        click.echo(f"   {stmt.object.text}{object_type}")
+        if verbose:
+            if stmt.confidence_score is not None:
+                click.echo(f"   Confidence: {stmt.confidence_score:.2f}")
+            if stmt.canonical_predicate:
+                click.echo(f"   Canonical: {stmt.canonical_predicate}")
+            if stmt.was_reversed:
+                click.echo(f"   (subject/object were swapped)")
+            if stmt.source_text:
+                source = stmt.source_text[:60] + "..." if len(stmt.source_text) > 60 else stmt.source_text
+                click.echo(f"   Source: \"{source}\"")
+        click.echo("-" * 80)
+if __name__ == "__main__":
+    main()

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/extractor.py RENAMED Viewed

@@ -80,11 +80,16 @@ class StatementExtractor:
         # Auto-detect device
         if device is None:
-            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
         else:
             self.device = device
-        # Auto-detect dtype
+        # Auto-detect dtype (bfloat16 only for CUDA, float32 for MPS/CPU)
         if torch_dtype is None:
             self.torch_dtype = torch.bfloat16 if self.device == "cuda" else torch.float32
         else:
@@ -350,12 +355,16 @@ class StatementExtractor:
             outputs = self.model.generate(
                 **inputs,
                 max_new_tokens=options.max_new_tokens,
+                max_length=None,  # Override model default, use max_new_tokens only
                 num_beams=num_seqs,
                 num_beam_groups=num_seqs,
                 num_return_sequences=num_seqs,
                 diversity_penalty=options.diversity_penalty,
                 do_sample=False,
+                top_p=None,  # Override model config to suppress warning
+                top_k=None,  # Override model config to suppress warning
                 trust_remote_code=True,
+                custom_generate="transformers-community/group-beam-search",
             )
         # Decode and process candidates

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/predicate_comparer.py RENAMED Viewed

@@ -83,7 +83,12 @@ class PredicateComparer:
         # Auto-detect device
         if device is None:
             import torch
-            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+            if torch.cuda.is_available():
+                self.device = "cuda"
+            elif torch.backends.mps.is_available():
+                self.device = "mps"
+            else:
+                self.device = "cpu"
         else:
             self.device = device

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/.gitignore RENAMED Viewed

File without changes

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/canonicalization.py RENAMED Viewed

File without changes

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/models.py RENAMED Viewed

File without changes

{corp_extractor-0.2.3 → corp_extractor-0.2.8}/src/statement_extractor/scoring.py RENAMED Viewed

File without changes

corp-extractor 0.2.3__tar.gz → 0.2.8__tar.gz

corp-extractor 0.2.3tar.gz → 0.2.8tar.gz