corp-extractor 0.2.5__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: corp-extractor
3
- Version: 0.2.5
3
+ Version: 0.3.0
4
4
  Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
5
5
  Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
6
6
  Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -26,17 +26,15 @@ Requires-Python: >=3.10
26
26
  Requires-Dist: click>=8.0.0
27
27
  Requires-Dist: numpy>=1.24.0
28
28
  Requires-Dist: pydantic>=2.0.0
29
+ Requires-Dist: sentence-transformers>=2.2.0
30
+ Requires-Dist: spacy>=3.5.0
29
31
  Requires-Dist: torch>=2.0.0
30
- Requires-Dist: transformers>=5.0.0
31
- Provides-Extra: all
32
- Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
32
+ Requires-Dist: transformers>=5.0.0rc3
33
33
  Provides-Extra: dev
34
34
  Requires-Dist: mypy>=1.0.0; extra == 'dev'
35
35
  Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
36
36
  Requires-Dist: pytest>=7.0.0; extra == 'dev'
37
37
  Requires-Dist: ruff>=0.1.0; extra == 'dev'
38
- Provides-Extra: embeddings
39
- Requires-Dist: sentence-transformers>=2.2.0; extra == 'embeddings'
40
38
  Description-Content-Type: text/markdown
41
39
 
42
40
  # Corp Extractor
@@ -51,7 +49,11 @@ Extract structured subject-predicate-object statements from unstructured text us
51
49
 
52
50
  - **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
53
51
  - **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
54
- - **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness (0-1) based on source text
52
+ - **Combined Quality Scoring** *(v0.3.0)*: Confidence combines semantic similarity (50%) + subject/object noun scores (25% each)
53
+ - **spaCy-First Predicates** *(v0.3.0)*: Always uses spaCy for predicate extraction (model predicates are unreliable)
54
+ - **Multi-Candidate Extraction** *(v0.3.0)*: Generates 3 candidates per statement (hybrid, spaCy-only, predicate-split)
55
+ - **Best Triple Selection** *(v0.3.0)*: Keeps only highest-scoring triple per source (use `--all-triples` to keep all)
56
+ - **Extraction Method Tracking** *(v0.3.0)*: Each statement includes `extraction_method` field (hybrid, spacy, split, model)
55
57
  - **Beam Merging** *(v0.2.0)*: Combines top beams for better coverage instead of picking one
56
58
  - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
57
59
  - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
@@ -64,19 +66,22 @@ Extract structured subject-predicate-object statements from unstructured text us
64
66
  ## Installation
65
67
 
66
68
  ```bash
67
- # Recommended: include embedding support for smart deduplication
68
- pip install corp-extractor[embeddings]
69
-
70
- # Minimal installation (no embedding features)
71
69
  pip install corp-extractor
72
70
  ```
73
71
 
74
- **Note**: This package requires the development version of `transformers` from GitHub (for T5-Gemma2 support). This is handled automatically during installation.
72
+ The spaCy model for predicate inference is downloaded automatically on first use.
73
+
74
+ **Note**: This package requires `transformers>=5.0.0` for T5-Gemma2 model support.
75
75
 
76
76
  **For GPU support**, install PyTorch with CUDA first:
77
77
  ```bash
78
78
  pip install torch --index-url https://download.pytorch.org/whl/cu121
79
- pip install corp-extractor[embeddings]
79
+ pip install corp-extractor
80
+ ```
81
+
82
+ **For Apple Silicon (M1/M2/M3)**, MPS acceleration is automatically detected:
83
+ ```bash
84
+ pip install corp-extractor # MPS used automatically
80
85
  ```
81
86
 
82
87
  ## Quick Start
@@ -105,13 +110,13 @@ For best results, install globally first:
105
110
 
106
111
  ```bash
107
112
  # Using uv (recommended)
108
- uv tool install corp-extractor[embeddings]
113
+ uv tool install "corp-extractor[embeddings]"
109
114
 
110
115
  # Using pipx
111
- pipx install corp-extractor[embeddings]
116
+ pipx install "corp-extractor[embeddings]"
112
117
 
113
118
  # Using pip
114
- pip install corp-extractor[embeddings]
119
+ pip install "corp-extractor[embeddings]"
115
120
 
116
121
  # Then use anywhere
117
122
  corp-extractor "Your text here"
@@ -125,7 +130,7 @@ Run directly without installing using [uv](https://docs.astral.sh/uv/):
125
130
  uvx corp-extractor "Apple announced a new iPhone."
126
131
  ```
127
132
 
128
- **Note**: uvx runs may be slower on first use as it installs transformers from git.
133
+ **Note**: First run downloads the model (~1.5GB) which may take a few minutes.
129
134
 
130
135
  ### Usage Examples
131
136
 
@@ -174,11 +179,13 @@ Options:
174
179
  --no-dedup Disable deduplication
175
180
  --no-embeddings Disable embedding-based dedup (faster)
176
181
  --no-merge Disable beam merging
182
+ --no-spacy Disable spaCy extraction (use raw model output)
183
+ --all-triples Keep all candidate triples (default: best per source)
177
184
  --dedup-threshold FLOAT Deduplication threshold (default: 0.65)
178
185
  --min-confidence FLOAT Min confidence filter (default: 0)
179
186
  --taxonomy PATH Load predicate taxonomy from file
180
187
  --taxonomy-threshold FLOAT Taxonomy matching threshold (default: 0.5)
181
- --device [auto|cuda|cpu] Device to use (default: auto)
188
+ --device [auto|cuda|mps|cpu] Device to use (default: auto)
182
189
  -v, --verbose Show confidence scores and metadata
183
190
  -q, --quiet Suppress progress messages
184
191
  --version Show version
@@ -276,7 +283,91 @@ for stmt in fixed_statements:
276
283
 
277
284
  During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
278
285
 
279
- ## Disable Embeddings (Faster, No Extra Dependencies)
286
+ ## New in v0.3.0: spaCy-First Extraction & Semantic Scoring
287
+
288
+ v0.3.0 introduces significant improvements to extraction quality:
289
+
290
+ ### spaCy-First Predicate Extraction
291
+
292
+ The T5-Gemma model is excellent at:
293
+ - **Triple isolation** - identifying that a relationship exists
294
+ - **Coreference resolution** - resolving pronouns to named entities
295
+
296
+ But unreliable at:
297
+ - **Predicate extraction** - often returns empty or wrong predicates
298
+
299
+ **Solution:** v0.3.0 always uses spaCy for predicate extraction. The model provides subject, object, entity types, and source text; spaCy provides the predicate.
300
+
301
+ ### Three Candidate Extraction Methods
302
+
303
+ For each statement, three candidates are generated and the best is selected:
304
+
305
+ | Method | Description |
306
+ |--------|-------------|
307
+ | `hybrid` | Model subject/object + spaCy predicate |
308
+ | `spacy` | All components from spaCy dependency parsing |
309
+ | `split` | Source text split around the predicate |
310
+
311
+ ```python
312
+ for stmt in result:
313
+ print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
314
+ print(f" Method: {stmt.extraction_method}") # hybrid, spacy, split, or model
315
+ print(f" Confidence: {stmt.confidence_score:.2f}")
316
+ ```
317
+
318
+ ### Combined Quality Scoring
319
+
320
+ Confidence scores combine **semantic similarity** and **grammatical accuracy**:
321
+
322
+ | Component | Weight | Description |
323
+ |-----------|--------|-------------|
324
+ | Semantic similarity | 50% | Cosine similarity between source text and reassembled triple |
325
+ | Subject noun score | 25% | How noun-like the subject is |
326
+ | Object noun score | 25% | How noun-like the object is |
327
+
328
+ **Noun scoring:**
329
+ - Proper noun(s) only: 1.0
330
+ - Common noun(s) only: 0.8
331
+ - Contains noun + other words: 0.4-0.8 (based on ratio)
332
+ - No nouns: 0.2
333
+
334
+ This ensures extracted subjects and objects are grammatically valid entities, not fragments or verb phrases.
335
+
336
+ ### Extraction Method Tracking
337
+
338
+ Each statement now includes an `extraction_method` field:
339
+ - `hybrid` - Model subject/object + spaCy predicate
340
+ - `spacy` - All components from spaCy dependency parsing
341
+ - `split` - Subject/object from splitting source text around predicate
342
+ - `model` - All components from T5-Gemma model (only when `--no-spacy`)
343
+
344
+ ### Best Triple Selection
345
+
346
+ By default, only the **highest-scoring triple** is kept for each source sentence. This ensures clean output without redundant candidates.
347
+
348
+ To keep all candidate triples (for debugging or analysis):
349
+ ```python
350
+ options = ExtractionOptions(all_triples=True)
351
+ result = extract_statements(text, options)
352
+ ```
353
+
354
+ Or via CLI:
355
+ ```bash
356
+ corp-extractor "Your text" --all-triples --verbose
357
+ ```
358
+
359
+ **Disable spaCy extraction** to use only model output:
360
+ ```python
361
+ options = ExtractionOptions(use_spacy_extraction=False)
362
+ result = extract_statements(text, options)
363
+ ```
364
+
365
+ Or via CLI:
366
+ ```bash
367
+ corp-extractor "Your text" --no-spacy
368
+ ```
369
+
370
+ ## Disable Embeddings
280
371
 
281
372
  ```python
282
373
  options = ExtractionOptions(
@@ -314,7 +405,7 @@ dict_output = extract_statements_as_dict(text)
314
405
  ```python
315
406
  from statement_extractor import StatementExtractor
316
407
 
317
- extractor = StatementExtractor(device="cuda") # or "cpu"
408
+ extractor = StatementExtractor(device="cuda") # or "mps" (Apple Silicon) or "cpu"
318
409
 
319
410
  texts = ["Text 1...", "Text 2...", "Text 3..."]
320
411
  for text in texts:
@@ -352,14 +443,16 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
352
443
  6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
353
444
  7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
354
445
  8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
446
+ 9. **Hybrid spaCy** *(v0.2.12)*: spaCy candidates added to pool alongside model output for better coverage
355
447
 
356
448
  ## Requirements
357
449
 
358
450
  - Python 3.10+
359
451
  - PyTorch 2.0+
360
- - Transformers 4.35+
452
+ - Transformers 5.0+
361
453
  - Pydantic 2.0+
362
- - sentence-transformers 2.2+ *(optional, for embedding features)*
454
+ - sentence-transformers 2.2+
455
+ - spaCy 3.5+ (model downloaded automatically on first use)
363
456
  - ~2GB VRAM (GPU) or ~4GB RAM (CPU)
364
457
 
365
458
  ## Links
@@ -0,0 +1,12 @@
1
+ statement_extractor/__init__.py,sha256=KwZfWnTB9oevTLw0TrNlYFu67qIYO-34JqDtcpjOhZI,3013
2
+ statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
3
+ statement_extractor/cli.py,sha256=JMEXiT2xwmW1J8JmJliQh32AT-7bTAtAscPx1AGRfPg,9054
4
+ statement_extractor/extractor.py,sha256=vS8UCgE8uITt_28PwCh4WCqOjWLpfrJcN3fh1YPBcjA,39657
5
+ statement_extractor/models.py,sha256=FxLj2fIodX317XVIJLZ0GFNahm_VV07KzdoLSSjoVD4,11952
6
+ statement_extractor/predicate_comparer.py,sha256=jcuaBi5BYqD3TKoyj3pR9dxtX5ihfDJvjdhEd2LHCwc,26184
7
+ statement_extractor/scoring.py,sha256=pdNgyLHmlk-npISzm4nycK9G4wM2nztg5KTG7piFACI,18135
8
+ statement_extractor/spacy_extraction.py,sha256=ACvIB-Ag7H7h_Gb0cdypIr8fnf3A-UjyJnqqjWD5Ccs,12320
9
+ corp_extractor-0.3.0.dist-info/METADATA,sha256=eu8b7R_FQxFyc_9FSocy078TTyB7BwvGX-YAS79hKgg,17042
10
+ corp_extractor-0.3.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
11
+ corp_extractor-0.3.0.dist-info/entry_points.txt,sha256=i0iKFqPIusvb-QTQ1zNnFgAqatgVah-jIhahbs5TToQ,115
12
+ corp_extractor-0.3.0.dist-info/RECORD,,
@@ -29,12 +29,13 @@ Example:
29
29
  >>> data = extract_statements_as_dict("Some text...")
30
30
  """
31
31
 
32
- __version__ = "0.2.5"
32
+ __version__ = "0.3.0"
33
33
 
34
34
  # Core models
35
35
  from .models import (
36
36
  Entity,
37
37
  EntityType,
38
+ ExtractionMethod,
38
39
  ExtractionOptions,
39
40
  ExtractionResult,
40
41
  Statement,
@@ -73,6 +74,7 @@ __all__ = [
73
74
  # Core models
74
75
  "Entity",
75
76
  "EntityType",
77
+ "ExtractionMethod",
76
78
  "ExtractionOptions",
77
79
  "ExtractionResult",
78
80
  "Statement",
@@ -7,11 +7,37 @@ Usage:
7
7
  cat input.txt | corp-extractor -
8
8
  """
9
9
 
10
+ import logging
10
11
  import sys
11
12
  from typing import Optional
12
13
 
13
14
  import click
14
15
 
16
+
17
+ def _configure_logging(verbose: bool) -> None:
18
+ """Configure logging for the extraction pipeline."""
19
+ level = logging.DEBUG if verbose else logging.WARNING
20
+
21
+ # Configure root logger for statement_extractor package
22
+ logging.basicConfig(
23
+ level=level,
24
+ format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
25
+ datefmt="%H:%M:%S",
26
+ stream=sys.stderr,
27
+ force=True,
28
+ )
29
+
30
+ # Set level for all statement_extractor loggers
31
+ for logger_name in [
32
+ "statement_extractor",
33
+ "statement_extractor.extractor",
34
+ "statement_extractor.scoring",
35
+ "statement_extractor.predicate_comparer",
36
+ "statement_extractor.canonicalization",
37
+ "statement_extractor.spacy_extraction",
38
+ ]:
39
+ logging.getLogger(logger_name).setLevel(level)
40
+
15
41
  from . import __version__
16
42
  from .models import (
17
43
  ExtractionOptions,
@@ -40,6 +66,8 @@ from .models import (
40
66
  @click.option("--no-dedup", is_flag=True, help="Disable deduplication")
41
67
  @click.option("--no-embeddings", is_flag=True, help="Disable embedding-based deduplication (faster)")
42
68
  @click.option("--no-merge", is_flag=True, help="Disable beam merging (select single best beam)")
69
+ @click.option("--no-spacy", is_flag=True, help="Disable spaCy extraction (use raw model output)")
70
+ @click.option("--all-triples", is_flag=True, help="Keep all candidate triples instead of selecting best per source")
43
71
  @click.option("--dedup-threshold", type=float, default=0.65, help="Similarity threshold for deduplication (default: 0.65)")
44
72
  # Quality options
45
73
  @click.option("--min-confidence", type=float, default=0.0, help="Minimum confidence threshold 0-1 (default: 0)")
@@ -47,7 +75,7 @@ from .models import (
47
75
  @click.option("--taxonomy", type=click.Path(exists=True), help="Load predicate taxonomy from file (one per line)")
48
76
  @click.option("--taxonomy-threshold", type=float, default=0.5, help="Similarity threshold for taxonomy matching (default: 0.5)")
49
77
  # Device options
50
- @click.option("--device", type=click.Choice(["auto", "cuda", "cpu"]), default="auto", help="Device to use (default: auto)")
78
+ @click.option("--device", type=click.Choice(["auto", "cuda", "mps", "cpu"]), default="auto", help="Device to use (default: auto)")
51
79
  # Output options
52
80
  @click.option("-v", "--verbose", is_flag=True, help="Show verbose output with confidence scores")
53
81
  @click.option("-q", "--quiet", is_flag=True, help="Suppress progress messages")
@@ -64,6 +92,8 @@ def main(
64
92
  no_dedup: bool,
65
93
  no_embeddings: bool,
66
94
  no_merge: bool,
95
+ no_spacy: bool,
96
+ all_triples: bool,
67
97
  dedup_threshold: float,
68
98
  min_confidence: float,
69
99
  taxonomy: Optional[str],
@@ -91,6 +121,9 @@ def main(
91
121
  json JSON with full metadata
92
122
  xml Raw XML from model
93
123
  """
124
+ # Configure logging based on verbose flag
125
+ _configure_logging(verbose)
126
+
94
127
  # Determine output format
95
128
  if output_json:
96
129
  output = "json"
@@ -132,9 +165,12 @@ def main(
132
165
  deduplicate=not no_dedup,
133
166
  embedding_dedup=not no_embeddings,
134
167
  merge_beams=not no_merge,
168
+ use_spacy_extraction=not no_spacy,
169
+ all_triples=all_triples,
135
170
  predicate_taxonomy=predicate_taxonomy,
136
171
  predicate_config=predicate_config,
137
172
  scoring_config=scoring_config,
173
+ verbose=verbose,
138
174
  )
139
175
 
140
176
  # Import here to allow --help without loading torch
@@ -160,6 +196,7 @@ def main(
160
196
  result = extractor.extract(input_text, options)
161
197
  _print_table(result, verbose)
162
198
  except Exception as e:
199
+ logging.exception("Error extracting statements:")
163
200
  raise click.ClickException(f"Extraction failed: {e}")
164
201
 
165
202
 
@@ -195,6 +232,9 @@ def _print_table(result, verbose: bool):
195
232
  click.echo(f" {stmt.object.text}{object_type}")
196
233
 
197
234
  if verbose:
235
+ # Always show extraction method
236
+ click.echo(f" Method: {stmt.extraction_method.value}")
237
+
198
238
  if stmt.confidence_score is not None:
199
239
  click.echo(f" Confidence: {stmt.confidence_score:.2f}")
200
240