corp-extractor 0.2.3__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,489 @@
1
+ Metadata-Version: 2.4
2
+ Name: corp-extractor
3
+ Version: 0.4.0
4
+ Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
5
+ Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
6
+ Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
7
+ Project-URL: Repository, https://github.com/corp-o-rate/statement-extractor
8
+ Project-URL: Issues, https://github.com/corp-o-rate/statement-extractor/issues
9
+ Author-email: Corp-o-Rate <neil@corp-o-rate.com>
10
+ Maintainer-email: Corp-o-Rate <neil@corp-o-rate.com>
11
+ License: MIT
12
+ Keywords: diverse-beam-search,embeddings,gemma,information-extraction,knowledge-graph,nlp,statement-extraction,subject-predicate-object,t5,transformers,triples
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
24
+ Classifier: Topic :: Text Processing :: Linguistic
25
+ Requires-Python: >=3.10
26
+ Requires-Dist: click>=8.0.0
27
+ Requires-Dist: gliner2
28
+ Requires-Dist: numpy>=1.24.0
29
+ Requires-Dist: pydantic>=2.0.0
30
+ Requires-Dist: sentence-transformers>=2.2.0
31
+ Requires-Dist: torch>=2.0.0
32
+ Requires-Dist: transformers>=5.0.0rc3
33
+ Provides-Extra: dev
34
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
35
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
36
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
37
+ Requires-Dist: ruff>=0.1.0; extra == 'dev'
38
+ Description-Content-Type: text/markdown
39
+
40
+ # Corp Extractor
41
+
42
+ Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.
43
+
44
+ [![PyPI version](https://img.shields.io/pypi/v/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
45
+ [![Python 3.10+](https://img.shields.io/pypi/pyversions/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
46
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
47
+
48
+ ## Features
49
+
50
+ - **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
51
+ - **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
52
+ - **GLiNER2 Integration** *(v0.4.0)*: Uses GLiNER2 (205M params) for entity recognition and relation extraction
53
+ - **Predefined Predicates** *(v0.4.0)*: Optional `--predicates` list for GLiNER2 relation extraction mode
54
+ - **Entity-based Scoring** *(v0.4.0)*: Confidence combines semantic similarity (50%) + entity recognition scores (25% each)
55
+ - **Multi-Candidate Extraction**: Generates 3 candidates per statement (hybrid, GLiNER2-only, predicate-split)
56
+ - **Best Triple Selection**: Keeps only highest-scoring triple per source (use `--all-triples` to keep all)
57
+ - **Extraction Method Tracking**: Each statement includes `extraction_method` field (hybrid, gliner, split, model)
58
+ - **Beam Merging**: Combines top beams for better coverage instead of picking one
59
+ - **Embedding-based Dedup**: Uses semantic similarity to detect near-duplicate predicates
60
+ - **Predicate Taxonomies**: Normalize predicates to canonical forms via embeddings
61
+ - **Contextualized Matching**: Compares full "Subject Predicate Object" against source text for better accuracy
62
+ - **Entity Type Merging**: Automatically merges UNKNOWN entity types with specific types during deduplication
63
+ - **Reversal Detection**: Detects and corrects subject-object reversals using embedding comparison
64
+ - **Command Line Interface**: Full-featured CLI for terminal usage
65
+ - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
66
+
67
+ ## Installation
68
+
69
+ ```bash
70
+ pip install corp-extractor
71
+ ```
72
+
73
+ The GLiNER2 model (205M params) is downloaded automatically on first use.
74
+
75
+ **Note**: This package requires `transformers>=5.0.0` for T5-Gemma2 model support.
76
+
77
+ **For GPU support**, install PyTorch with CUDA first:
78
+ ```bash
79
+ pip install torch --index-url https://download.pytorch.org/whl/cu121
80
+ pip install corp-extractor
81
+ ```
82
+
83
+ **For Apple Silicon (M1/M2/M3)**, MPS acceleration is automatically detected:
84
+ ```bash
85
+ pip install corp-extractor # MPS used automatically
86
+ ```
87
+
88
+ ## Quick Start
89
+
90
+ ```python
91
+ from statement_extractor import extract_statements
92
+
93
+ result = extract_statements("""
94
+ Apple Inc. announced the iPhone 15 at their September event.
95
+ Tim Cook presented the new features to customers worldwide.
96
+ """)
97
+
98
+ for stmt in result:
99
+ print(f"{stmt.subject.text} ({stmt.subject.type})")
100
+ print(f" --[{stmt.predicate}]--> {stmt.object.text}")
101
+ print(f" Confidence: {stmt.confidence_score:.2f}") # NEW in v0.2.0
102
+ ```
103
+
104
+ ## Command Line Interface
105
+
106
+ The library includes a CLI for quick extraction from the terminal.
107
+
108
+ ### Install Globally (Recommended)
109
+
110
+ For best results, install globally first:
111
+
112
+ ```bash
113
+ # Using uv (recommended)
114
+ uv tool install "corp-extractor[embeddings]"
115
+
116
+ # Using pipx
117
+ pipx install "corp-extractor[embeddings]"
118
+
119
+ # Using pip
120
+ pip install "corp-extractor[embeddings]"
121
+
122
+ # Then use anywhere
123
+ corp-extractor "Your text here"
124
+ ```
125
+
126
+ ### Quick Run with uvx
127
+
128
+ Run directly without installing using [uv](https://docs.astral.sh/uv/):
129
+
130
+ ```bash
131
+ uvx corp-extractor "Apple announced a new iPhone."
132
+ ```
133
+
134
+ **Note**: First run downloads the model (~1.5GB) which may take a few minutes.
135
+
136
+ ### Usage Examples
137
+
138
+ ```bash
139
+ # Extract from text argument
140
+ corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
141
+
142
+ # Extract from file
143
+ corp-extractor -f article.txt
144
+
145
+ # Pipe from stdin
146
+ cat article.txt | corp-extractor -
147
+
148
+ # Output as JSON
149
+ corp-extractor "Tim Cook is CEO of Apple." --json
150
+
151
+ # Output as XML
152
+ corp-extractor -f article.txt --xml
153
+
154
+ # Verbose output with confidence scores
155
+ corp-extractor -f article.txt --verbose
156
+
157
+ # Use more beams for better quality
158
+ corp-extractor -f article.txt --beams 8
159
+
160
+ # Use custom predicate taxonomy
161
+ corp-extractor -f article.txt --taxonomy predicates.txt
162
+
163
+ # Use GPU explicitly
164
+ corp-extractor -f article.txt --device cuda
165
+ ```
166
+
167
+ ### CLI Options
168
+
169
+ ```
170
+ Usage: corp-extractor [OPTIONS] [TEXT]
171
+
172
+ Options:
173
+ -f, --file PATH Read input from file
174
+ -o, --output [table|json|xml] Output format (default: table)
175
+ --json Output as JSON (shortcut)
176
+ --xml Output as XML (shortcut)
177
+ -b, --beams INTEGER Number of beams (default: 4)
178
+ --diversity FLOAT Diversity penalty (default: 1.0)
179
+ --max-tokens INTEGER Max tokens to generate (default: 2048)
180
+ --no-dedup Disable deduplication
181
+ --no-embeddings Disable embedding-based dedup (faster)
182
+ --no-merge Disable beam merging
183
+ --no-gliner Disable GLiNER2 extraction (use raw model output)
184
+ --predicates TEXT Comma-separated predicate types for GLiNER2 relation extraction
185
+ --all-triples Keep all candidate triples (default: best per source)
186
+ --dedup-threshold FLOAT Deduplication threshold (default: 0.65)
187
+ --min-confidence FLOAT Min confidence filter (default: 0)
188
+ --taxonomy PATH Load predicate taxonomy from file
189
+ --taxonomy-threshold FLOAT Taxonomy matching threshold (default: 0.5)
190
+ --device [auto|cuda|mps|cpu] Device to use (default: auto)
191
+ -v, --verbose Show confidence scores and metadata
192
+ -q, --quiet Suppress progress messages
193
+ --version Show version
194
+ --help Show this message
195
+ ```
196
+
197
+ ## New in v0.2.0: Quality Scoring & Beam Merging
198
+
199
+ By default, the library now:
200
+ - **Scores each triple** for groundedness based on whether entities appear in source text
201
+ - **Merges top beams** instead of selecting one, improving coverage
202
+ - **Uses embeddings** to detect semantically similar predicates ("bought" ≈ "acquired")
203
+
204
+ ```python
205
+ from statement_extractor import ExtractionOptions, ScoringConfig
206
+
207
+ # Precision mode - filter low-confidence triples
208
+ scoring = ScoringConfig(min_confidence=0.7)
209
+ options = ExtractionOptions(scoring_config=scoring)
210
+ result = extract_statements(text, options)
211
+
212
+ # Access confidence scores
213
+ for stmt in result:
214
+ print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
215
+ ```
216
+
217
+ ## New in v0.2.0: Predicate Taxonomies
218
+
219
+ Normalize predicates to canonical forms using embedding similarity:
220
+
221
+ ```python
222
+ from statement_extractor import PredicateTaxonomy, ExtractionOptions
223
+
224
+ taxonomy = PredicateTaxonomy(predicates=[
225
+ "acquired", "founded", "works_for", "announced",
226
+ "invested_in", "partnered_with"
227
+ ])
228
+
229
+ options = ExtractionOptions(predicate_taxonomy=taxonomy)
230
+ result = extract_statements(text, options)
231
+
232
+ # "bought" -> "acquired" via embedding similarity
233
+ for stmt in result:
234
+ if stmt.canonical_predicate:
235
+ print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
236
+ ```
237
+
238
+ ## New in v0.2.2: Contextualized Matching
239
+
240
+ Predicate canonicalization and deduplication now use **contextualized matching**:
241
+ - Compares full "Subject Predicate Object" strings against source text
242
+ - Better accuracy because predicates are evaluated in context
243
+ - When duplicates are found, keeps the statement with the best match to source text
244
+
245
+ This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
246
+
247
+ ## New in v0.2.3: Entity Type Merging & Reversal Detection
248
+
249
+ ### Entity Type Merging
250
+
251
+ When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
252
+
253
+ ```python
254
+ # Before deduplication:
255
+ # Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
256
+ # Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
257
+
258
+ # After deduplication:
259
+ # Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
260
+ ```
261
+
262
+ ### Subject-Object Reversal Detection
263
+
264
+ The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
265
+
266
+ ```python
267
+ from statement_extractor import PredicateComparer
268
+
269
+ comparer = PredicateComparer()
270
+
271
+ # Automatically detect and fix reversals
272
+ fixed_statements = comparer.detect_and_fix_reversals(statements)
273
+
274
+ for stmt in fixed_statements:
275
+ if stmt.was_reversed:
276
+ print(f"Fixed reversal: {stmt}")
277
+ ```
278
+
279
+ **How it works:**
280
+ 1. For each statement with source text, compares:
281
+ - "Subject Predicate Object" embedding vs source text
282
+ - "Object Predicate Subject" embedding vs source text
283
+ 2. If the reversed form has higher similarity, swaps subject and object
284
+ 3. Sets `was_reversed=True` to indicate the correction
285
+
286
+ During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
287
+
288
+ ## New in v0.4.0: GLiNER2 Integration
289
+
290
+ v0.4.0 replaces spaCy with **GLiNER2** (205M params) for entity recognition and relation extraction. GLiNER2 is a unified model that handles NER, text classification, structured data extraction, and relation extraction with CPU-optimized inference.
291
+
292
+ ### Why GLiNER2?
293
+
294
+ The T5-Gemma model excels at:
295
+ - **Triple isolation** - identifying that a relationship exists
296
+ - **Coreference resolution** - resolving pronouns to named entities
297
+
298
+ GLiNER2 now handles:
299
+ - **Entity recognition** - refining subject/object boundaries
300
+ - **Relation extraction** - when predefined predicates are provided
301
+ - **Entity scoring** - scoring how "entity-like" subjects/objects are
302
+
303
+ ### Two Extraction Modes
304
+
305
+ **Mode 1: With Predicate List** (GLiNER2 relation extraction)
306
+ ```python
307
+ from statement_extractor import extract_statements, ExtractionOptions
308
+
309
+ options = ExtractionOptions(predicates=["works_for", "founded", "acquired", "headquartered_in"])
310
+ result = extract_statements("John works for Apple Inc. in Cupertino.", options)
311
+ ```
312
+
313
+ Or via CLI:
314
+ ```bash
315
+ corp-extractor "John works for Apple Inc." --predicates "works_for,founded,acquired"
316
+ ```
317
+
318
+ **Mode 2: Without Predicate List** (entity-refined extraction)
319
+ ```python
320
+ result = extract_statements("Apple announced a new iPhone.")
321
+ # Uses GLiNER2 for entity extraction to refine boundaries
322
+ # Extracts predicate from source text using T5-Gemma's hint
323
+ ```
324
+
325
+ ### Three Candidate Extraction Methods
326
+
327
+ For each statement, three candidates are generated and the best is selected:
328
+
329
+ | Method | Description |
330
+ |--------|-------------|
331
+ | `hybrid` | Model subject/object + GLiNER2/extracted predicate |
332
+ | `gliner` | All components refined by GLiNER2 |
333
+ | `split` | Source text split around the predicate |
334
+
335
+ ```python
336
+ for stmt in result:
337
+ print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
338
+ print(f" Method: {stmt.extraction_method}") # hybrid, gliner, split, or model
339
+ print(f" Confidence: {stmt.confidence_score:.2f}")
340
+ ```
341
+
342
+ ### Combined Quality Scoring
343
+
344
+ Confidence scores combine **semantic similarity** and **entity recognition**:
345
+
346
+ | Component | Weight | Description |
347
+ |-----------|--------|-------------|
348
+ | Semantic similarity | 50% | Cosine similarity between source text and reassembled triple |
349
+ | Subject entity score | 25% | How entity-like the subject is (via GLiNER2) |
350
+ | Object entity score | 25% | How entity-like the object is (via GLiNER2) |
351
+
352
+ **Entity scoring (via GLiNER2):**
353
+ - Recognized entity with high confidence: 1.0
354
+ - Recognized entity with moderate confidence: 0.8
355
+ - Partially recognized: 0.6
356
+ - Not recognized: 0.2
357
+
358
+ ### Extraction Method Tracking
359
+
360
+ Each statement includes an `extraction_method` field:
361
+ - `hybrid` - Model subject/object + GLiNER2 predicate
362
+ - `gliner` - All components refined by GLiNER2
363
+ - `split` - Subject/object from splitting source text around predicate
364
+ - `model` - All components from T5-Gemma model (only when `--no-gliner`)
365
+
366
+ ### Best Triple Selection
367
+
368
+ By default, only the **highest-scoring triple** is kept for each source sentence.
369
+
370
+ To keep all candidate triples:
371
+ ```python
372
+ options = ExtractionOptions(all_triples=True)
373
+ result = extract_statements(text, options)
374
+ ```
375
+
376
+ Or via CLI:
377
+ ```bash
378
+ corp-extractor "Your text" --all-triples --verbose
379
+ ```
380
+
381
+ **Disable GLiNER2 extraction** to use only model output:
382
+ ```python
383
+ options = ExtractionOptions(use_gliner_extraction=False)
384
+ result = extract_statements(text, options)
385
+ ```
386
+
387
+ Or via CLI:
388
+ ```bash
389
+ corp-extractor "Your text" --no-gliner
390
+ ```
391
+
392
+ ## Disable Embeddings
393
+
394
+ ```python
395
+ options = ExtractionOptions(
396
+ embedding_dedup=False, # Use exact text matching
397
+ merge_beams=False, # Select single best beam
398
+ )
399
+ result = extract_statements(text, options)
400
+ ```
401
+
402
+ ## Output Formats
403
+
404
+ ```python
405
+ from statement_extractor import (
406
+ extract_statements,
407
+ extract_statements_as_json,
408
+ extract_statements_as_xml,
409
+ extract_statements_as_dict,
410
+ )
411
+
412
+ # Pydantic models (default)
413
+ result = extract_statements(text)
414
+
415
+ # JSON string
416
+ json_output = extract_statements_as_json(text)
417
+
418
+ # Raw XML (model's native format)
419
+ xml_output = extract_statements_as_xml(text)
420
+
421
+ # Python dictionary
422
+ dict_output = extract_statements_as_dict(text)
423
+ ```
424
+
425
+ ## Batch Processing
426
+
427
+ ```python
428
+ from statement_extractor import StatementExtractor
429
+
430
+ extractor = StatementExtractor(device="cuda") # or "mps" (Apple Silicon) or "cpu"
431
+
432
+ texts = ["Text 1...", "Text 2...", "Text 3..."]
433
+ for text in texts:
434
+ result = extractor.extract(text)
435
+ print(f"Found {len(result)} statements")
436
+ ```
437
+
438
+ ## Entity Types
439
+
440
+ | Type | Description | Example |
441
+ |------|-------------|---------|
442
+ | `ORG` | Organizations | Apple Inc., United Nations |
443
+ | `PERSON` | People | Tim Cook, Elon Musk |
444
+ | `GPE` | Geopolitical entities | USA, California, Paris |
445
+ | `LOC` | Non-GPE locations | Mount Everest, Pacific Ocean |
446
+ | `PRODUCT` | Products | iPhone, Model S |
447
+ | `EVENT` | Events | World Cup, CES 2024 |
448
+ | `WORK_OF_ART` | Creative works | Mona Lisa, Game of Thrones |
449
+ | `LAW` | Legal documents | GDPR, Clean Air Act |
450
+ | `DATE` | Dates | 2024, January 15 |
451
+ | `MONEY` | Monetary values | $50 million, €100 |
452
+ | `PERCENT` | Percentages | 25%, 0.5% |
453
+ | `QUANTITY` | Quantities | 500 employees, 1.5 tons |
454
+ | `UNKNOWN` | Unrecognized | (fallback) |
455
+
456
+ ## How It Works
457
+
458
+ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam Search** ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424)):
459
+
460
+ 1. **Diverse Beam Search**: Generates 4+ candidate outputs using beam groups with diversity penalty
461
+ 2. **Quality Scoring**: Each triple scored for groundedness in source text
462
+ 3. **Beam Merging**: Top beams combined for better coverage
463
+ 4. **Embedding Dedup**: Semantic similarity removes near-duplicate predicates
464
+ 5. **Predicate Normalization**: Optional taxonomy matching via embeddings
465
+ 6. **Contextualized Matching**: Full statement context used for canonicalization and dedup
466
+ 7. **Entity Type Merging**: UNKNOWN types merged with specific types during dedup
467
+ 8. **Reversal Detection**: Subject-object reversals detected and corrected via embedding comparison
468
+ 9. **GLiNER2 Extraction** *(v0.4.0)*: Entity recognition and relation extraction for improved accuracy
469
+
470
+ ## Requirements
471
+
472
+ - Python 3.10+
473
+ - PyTorch 2.0+
474
+ - Transformers 5.0+
475
+ - Pydantic 2.0+
476
+ - sentence-transformers 2.2+
477
+ - GLiNER2 (model downloaded automatically on first use)
478
+ - ~2GB VRAM (GPU) or ~4GB RAM (CPU)
479
+
480
+ ## Links
481
+
482
+ - [Model on HuggingFace](https://huggingface.co/Corp-o-Rate-Community/statement-extractor)
483
+ - [Web Demo](https://statement-extractor.corp-o-rate.com)
484
+ - [Diverse Beam Search Paper](https://arxiv.org/abs/1610.02424)
485
+ - [Corp-o-Rate](https://corp-o-rate.com)
486
+
487
+ ## License
488
+
489
+ MIT License - see LICENSE file for details.