PyPI - corp-extractor - Versions diffs - 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl - Mend

corp-extractor 0.2.3py3-none-any.whl → 0.2.5py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

{corp_extractor-0.2.3.dist-info → corp_extractor-0.2.5.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: corp-extractor
-Version: 0.2.3
+Version: 0.2.5
 Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
 Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
 Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -23,10 +23,11 @@ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
 Classifier: Topic :: Scientific/Engineering :: Information Analysis
 Classifier: Topic :: Text Processing :: Linguistic
 Requires-Python: >=3.10
+Requires-Dist: click>=8.0.0
 Requires-Dist: numpy>=1.24.0
 Requires-Dist: pydantic>=2.0.0
 Requires-Dist: torch>=2.0.0
-Requires-Dist: transformers>=4.35.0
+Requires-Dist: transformers>=5.0.0
 Provides-Extra: all
 Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
 Provides-Extra: dev
@@ -57,6 +58,7 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
 - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
 - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
+- **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
@@ -69,7 +71,9 @@ pip install corp-extractor[embeddings]
 pip install corp-extractor
 ```
-**Note**: For GPU support, install PyTorch with CUDA first:
+**Note**: This package requires the development version of `transformers` from GitHub (for T5-Gemma2 support). This is handled automatically during installation.
+**For GPU support**, install PyTorch with CUDA first:
 ```bash
 pip install torch --index-url https://download.pytorch.org/whl/cu121
 pip install corp-extractor[embeddings]
@@ -91,6 +95,96 @@ for stmt in result:
     print(f"  Confidence: {stmt.confidence_score:.2f}")  # NEW in v0.2.0
 ```
+## Command Line Interface
+The library includes a CLI for quick extraction from the terminal.
+### Install Globally (Recommended)
+For best results, install globally first:
+```bash
+# Using uv (recommended)
+uv tool install corp-extractor[embeddings]
+# Using pipx
+pipx install corp-extractor[embeddings]
+# Using pip
+pip install corp-extractor[embeddings]
+# Then use anywhere
+corp-extractor "Your text here"
+```
+### Quick Run with uvx
+Run directly without installing using [uv](https://docs.astral.sh/uv/):
+```bash
+uvx corp-extractor "Apple announced a new iPhone."
+```
+**Note**: uvx runs may be slower on first use as it installs transformers from git.
+### Usage Examples
+```bash
+# Extract from text argument
+corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
+# Extract from file
+corp-extractor -f article.txt
+# Pipe from stdin
+cat article.txt | corp-extractor -
+# Output as JSON
+corp-extractor "Tim Cook is CEO of Apple." --json
+# Output as XML
+corp-extractor -f article.txt --xml
+# Verbose output with confidence scores
+corp-extractor -f article.txt --verbose
+# Use more beams for better quality
+corp-extractor -f article.txt --beams 8
+# Use custom predicate taxonomy
+corp-extractor -f article.txt --taxonomy predicates.txt
+# Use GPU explicitly
+corp-extractor -f article.txt --device cuda
+```
+### CLI Options
+```
+Usage: corp-extractor [OPTIONS] [TEXT]
+Options:
+  -f, --file PATH              Read input from file
+  -o, --output [table|json|xml] Output format (default: table)
+  --json                       Output as JSON (shortcut)
+  --xml                        Output as XML (shortcut)
+  -b, --beams INTEGER          Number of beams (default: 4)
+  --diversity FLOAT            Diversity penalty (default: 1.0)
+  --max-tokens INTEGER         Max tokens to generate (default: 2048)
+  --no-dedup                   Disable deduplication
+  --no-embeddings              Disable embedding-based dedup (faster)
+  --no-merge                   Disable beam merging
+  --dedup-threshold FLOAT      Deduplication threshold (default: 0.65)
+  --min-confidence FLOAT       Min confidence filter (default: 0)
+  --taxonomy PATH              Load predicate taxonomy from file
+  --taxonomy-threshold FLOAT   Taxonomy matching threshold (default: 0.5)
+  --device [auto|cuda|cpu]     Device to use (default: auto)
+  -v, --verbose                Show confidence scores and metadata
+  -q, --quiet                  Suppress progress messages
+  --version                    Show version
+  --help                       Show this message
+```
 ## New in v0.2.0: Quality Scoring & Beam Merging
 By default, the library now:

{corp_extractor-0.2.3.dist-info → corp_extractor-0.2.5.dist-info}/RECORD RENAMED Viewed

@@ -1,9 +1,11 @@
-statement_extractor/__init__.py,sha256=4Ht8GJdgik_iti7zpG71Oi5EEAnck6AYDvy7soRqIOg,2967
+statement_extractor/__init__.py,sha256=MIZgn-lD9-XGJapzdyYxMhEJFRrTzftbRklrhwA4e8w,2967
 statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
+statement_extractor/cli.py,sha256=kJnZm_mbq4np1vTxSjczMZM5zGuDlC8Z5xLJd8O3xZ4,7605
 statement_extractor/extractor.py,sha256=PX0SiJnYUnh06seyH5W77FcPpcvLXwEM8IGsuVuRh0Q,22158
 statement_extractor/models.py,sha256=xDF3pDPhIiqiMwFMPV94aBEgZGbSe-x2TkshahOiCog,10739
 statement_extractor/predicate_comparer.py,sha256=iwBfNJFNOFv8ODKN9F9EtmknpCeSThOpnu6P_PJSmgE,24898
 statement_extractor/scoring.py,sha256=Wa1BW6jXtHD7dZkUXwdwE39hwFo2ko6BuIogBc4E2Lk,14493
-corp_extractor-0.2.3.dist-info/METADATA,sha256=dCJbLWIj7hgzpkC4zYvNmnEAhNnizUEq_caea6AamIU,10724
-corp_extractor-0.2.3.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
-corp_extractor-0.2.3.dist-info/RECORD,,
+corp_extractor-0.2.5.dist-info/METADATA,sha256=iN_MPbqHhizaFAGJKzR5JNSbDivrS133oSTiYWrFht4,13552
+corp_extractor-0.2.5.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
+corp_extractor-0.2.5.dist-info/entry_points.txt,sha256=i0iKFqPIusvb-QTQ1zNnFgAqatgVah-jIhahbs5TToQ,115
+corp_extractor-0.2.5.dist-info/RECORD,,

corp_extractor-0.2.5.dist-info/entry_points.txt ADDED Viewed

@@ -0,0 +1,3 @@
+[console_scripts]
+corp-extractor = statement_extractor.cli:main
+statement-extractor = statement_extractor.cli:main

statement_extractor/__init__.py CHANGED Viewed

@@ -29,7 +29,7 @@ Example:
     >>> data = extract_statements_as_dict("Some text...")
 """
-__version__ = "0.2.2"
+__version__ = "0.2.5"
 # Core models
 from .models import (

statement_extractor/cli.py ADDED Viewed

@@ -0,0 +1,215 @@
+"""
+Command-line interface for statement extraction.
+Usage:
+    corp-extractor "Your text here"
+    corp-extractor -f input.txt
+    cat input.txt | corp-extractor -
+"""
+import sys
+from typing import Optional
+import click
+from . import __version__
+from .models import (
+    ExtractionOptions,
+    PredicateComparisonConfig,
+    PredicateTaxonomy,
+    ScoringConfig,
+)
+@click.command()
+@click.argument("text", required=False)
+@click.option("-f", "--file", "input_file", type=click.Path(exists=True), help="Read input from file")
+@click.option(
+    "-o", "--output",
+    type=click.Choice(["table", "json", "xml"], case_sensitive=False),
+    default="table",
+    help="Output format (default: table)"
+)
+@click.option("--json", "output_json", is_flag=True, help="Output as JSON (shortcut for -o json)")
+@click.option("--xml", "output_xml", is_flag=True, help="Output as XML (shortcut for -o xml)")
+# Beam search options
+@click.option("-b", "--beams", type=int, default=4, help="Number of beams for diverse beam search (default: 4)")
+@click.option("--diversity", type=float, default=1.0, help="Diversity penalty for beam search (default: 1.0)")
+@click.option("--max-tokens", type=int, default=2048, help="Maximum tokens to generate (default: 2048)")
+# Deduplication options
+@click.option("--no-dedup", is_flag=True, help="Disable deduplication")
+@click.option("--no-embeddings", is_flag=True, help="Disable embedding-based deduplication (faster)")
+@click.option("--no-merge", is_flag=True, help="Disable beam merging (select single best beam)")
+@click.option("--dedup-threshold", type=float, default=0.65, help="Similarity threshold for deduplication (default: 0.65)")
+# Quality options
+@click.option("--min-confidence", type=float, default=0.0, help="Minimum confidence threshold 0-1 (default: 0)")
+# Taxonomy options
+@click.option("--taxonomy", type=click.Path(exists=True), help="Load predicate taxonomy from file (one per line)")
+@click.option("--taxonomy-threshold", type=float, default=0.5, help="Similarity threshold for taxonomy matching (default: 0.5)")
+# Device options
+@click.option("--device", type=click.Choice(["auto", "cuda", "cpu"]), default="auto", help="Device to use (default: auto)")
+# Output options
+@click.option("-v", "--verbose", is_flag=True, help="Show verbose output with confidence scores")
+@click.option("-q", "--quiet", is_flag=True, help="Suppress progress messages")
+@click.version_option(version=__version__)
+def main(
+    text: Optional[str],
+    input_file: Optional[str],
+    output: str,
+    output_json: bool,
+    output_xml: bool,
+    beams: int,
+    diversity: float,
+    max_tokens: int,
+    no_dedup: bool,
+    no_embeddings: bool,
+    no_merge: bool,
+    dedup_threshold: float,
+    min_confidence: float,
+    taxonomy: Optional[str],
+    taxonomy_threshold: float,
+    device: str,
+    verbose: bool,
+    quiet: bool,
+):
+    """
+    Extract structured statements from text.
+    TEXT can be provided as an argument, read from a file with -f, or piped via stdin.
+    \b
+    Examples:
+        corp-extractor "Apple announced a new iPhone."
+        corp-extractor -f article.txt --json
+        corp-extractor -f article.txt -o json --beams 8
+        cat article.txt | corp-extractor -
+        echo "Tim Cook is CEO of Apple." | corp-extractor - --verbose
+    \b
+    Output formats:
+        table  Human-readable table (default)
+        json   JSON with full metadata
+        xml    Raw XML from model
+    """
+    # Determine output format
+    if output_json:
+        output = "json"
+    elif output_xml:
+        output = "xml"
+    # Get input text
+    input_text = _get_input_text(text, input_file)
+    if not input_text:
+        raise click.UsageError(
+            "No input provided. Use: statement-extractor \"text\", "
+            "statement-extractor -f file.txt, or pipe via stdin."
+        )
+    if not quiet:
+        click.echo(f"Processing {len(input_text)} characters...", err=True)
+    # Load taxonomy if provided
+    predicate_taxonomy = None
+    if taxonomy:
+        predicate_taxonomy = PredicateTaxonomy.from_file(taxonomy)
+        if not quiet:
+            click.echo(f"Loaded taxonomy with {len(predicate_taxonomy.predicates)} predicates", err=True)
+    # Configure predicate comparison
+    predicate_config = PredicateComparisonConfig(
+        similarity_threshold=taxonomy_threshold,
+        dedup_threshold=dedup_threshold,
+    )
+    # Configure scoring
+    scoring_config = ScoringConfig(min_confidence=min_confidence)
+    # Configure extraction options
+    options = ExtractionOptions(
+        num_beams=beams,
+        diversity_penalty=diversity,
+        max_new_tokens=max_tokens,
+        deduplicate=not no_dedup,
+        embedding_dedup=not no_embeddings,
+        merge_beams=not no_merge,
+        predicate_taxonomy=predicate_taxonomy,
+        predicate_config=predicate_config,
+        scoring_config=scoring_config,
+    )
+    # Import here to allow --help without loading torch
+    from .extractor import StatementExtractor
+    # Create extractor with specified device
+    device_arg = None if device == "auto" else device
+    extractor = StatementExtractor(device=device_arg)
+    if not quiet:
+        click.echo(f"Using device: {extractor.device}", err=True)
+    # Run extraction
+    try:
+        if output == "xml":
+            result = extractor.extract_as_xml(input_text, options)
+            click.echo(result)
+        elif output == "json":
+            result = extractor.extract_as_json(input_text, options)
+            click.echo(result)
+        else:
+            # Table format
+            result = extractor.extract(input_text, options)
+            _print_table(result, verbose)
+    except Exception as e:
+        raise click.ClickException(f"Extraction failed: {e}")
+def _get_input_text(text: Optional[str], input_file: Optional[str]) -> Optional[str]:
+    """Get input text from argument, file, or stdin."""
+    if text == "-" or (text is None and input_file is None and not sys.stdin.isatty()):
+        # Read from stdin
+        return sys.stdin.read().strip()
+    elif input_file:
+        # Read from file
+        with open(input_file, "r", encoding="utf-8") as f:
+            return f.read().strip()
+    elif text:
+        return text.strip()
+    return None
+def _print_table(result, verbose: bool):
+    """Print statements in a human-readable table format."""
+    if not result.statements:
+        click.echo("No statements extracted.")
+        return
+    click.echo(f"\nExtracted {len(result.statements)} statement(s):\n")
+    click.echo("-" * 80)
+    for i, stmt in enumerate(result.statements, 1):
+        subject_type = f" ({stmt.subject.type.value})" if stmt.subject.type.value != "UNKNOWN" else ""
+        object_type = f" ({stmt.object.type.value})" if stmt.object.type.value != "UNKNOWN" else ""
+        click.echo(f"{i}. {stmt.subject.text}{subject_type}")
+        click.echo(f"   --[{stmt.predicate}]-->")
+        click.echo(f"   {stmt.object.text}{object_type}")
+        if verbose:
+            if stmt.confidence_score is not None:
+                click.echo(f"   Confidence: {stmt.confidence_score:.2f}")
+            if stmt.canonical_predicate:
+                click.echo(f"   Canonical: {stmt.canonical_predicate}")
+            if stmt.was_reversed:
+                click.echo(f"   (subject/object were swapped)")
+            if stmt.source_text:
+                source = stmt.source_text[:60] + "..." if len(stmt.source_text) > 60 else stmt.source_text
+                click.echo(f"   Source: \"{source}\"")
+        click.echo("-" * 80)
+if __name__ == "__main__":
+    main()

{corp_extractor-0.2.3.dist-info → corp_extractor-0.2.5.dist-info}/WHEEL RENAMED Viewed

File without changes

corp-extractor 0.2.3__py3-none-any.whl → 0.2.5__py3-none-any.whl

corp-extractor 0.2.3py3-none-any.whl → 0.2.5py3-none-any.whl