corp-extractor 0.2.7__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of corp-extractor might be problematic. Click here for more details.

@@ -0,0 +1,377 @@
1
+ Metadata-Version: 2.4
2
+ Name: corp-extractor
3
+ Version: 0.2.7
4
+ Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
5
+ Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
6
+ Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
7
+ Project-URL: Repository, https://github.com/corp-o-rate/statement-extractor
8
+ Project-URL: Issues, https://github.com/corp-o-rate/statement-extractor/issues
9
+ Author-email: Corp-o-Rate <neil@corp-o-rate.com>
10
+ Maintainer-email: Corp-o-Rate <neil@corp-o-rate.com>
11
+ License: MIT
12
+ Keywords: diverse-beam-search,embeddings,gemma,information-extraction,knowledge-graph,nlp,statement-extraction,subject-predicate-object,t5,transformers,triples
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
24
+ Classifier: Topic :: Text Processing :: Linguistic
25
+ Requires-Python: >=3.10
26
+ Requires-Dist: click>=8.0.0
27
+ Requires-Dist: numpy>=1.24.0
28
+ Requires-Dist: pydantic>=2.0.0
29
+ Requires-Dist: torch>=2.0.0
30
+ Requires-Dist: transformers>=5.0.0rc3
31
+ Provides-Extra: all
32
+ Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
33
+ Provides-Extra: dev
34
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
35
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
36
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
37
+ Requires-Dist: ruff>=0.1.0; extra == 'dev'
38
+ Provides-Extra: embeddings
39
+ Requires-Dist: sentence-transformers>=2.2.0; extra == 'embeddings'
40
+ Description-Content-Type: text/markdown
41
+
42
+ # Corp Extractor
43
+
44
+ Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.
45
+
46
+ [![PyPI version](https://img.shields.io/pypi/v/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
47
+ [![Python 3.10+](https://img.shields.io/pypi/pyversions/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
48
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
49
+
50
+ ## Features
51
+
52
+ - **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
53
+ - **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
54
+ - **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness (0-1) based on source text
55
+ - **Beam Merging** *(v0.2.0)*: Combines top beams for better coverage instead of picking one
56
+ - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
57
+ - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
58
+ - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
59
+ - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
60
+ - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
61
+ - **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
62
+ - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
63
+
64
+ ## Installation
65
+
66
+ ```bash
67
+ # Recommended: include embedding support for smart deduplication
68
+ pip install "corp-extractor[embeddings]"
69
+
70
+ # Minimal installation (no embedding features)
71
+ pip install corp-extractor
72
+ ```
73
+
74
+ **Note**: This package requires `transformers>=5.0.0` (pre-release) for T5-Gemma2 model support. Install with `--pre` flag if needed:
75
+ ```bash
76
+ pip install --pre "corp-extractor[embeddings]"
77
+ ```
78
+
79
+ **For GPU support**, install PyTorch with CUDA first:
80
+ ```bash
81
+ pip install torch --index-url https://download.pytorch.org/whl/cu121
82
+ pip install "corp-extractor[embeddings]"
83
+ ```
84
+
85
+ ## Quick Start
86
+
87
+ ```python
88
+ from statement_extractor import extract_statements
89
+
90
+ result = extract_statements("""
91
+ Apple Inc. announced the iPhone 15 at their September event.
92
+ Tim Cook presented the new features to customers worldwide.
93
+ """)
94
+
95
+ for stmt in result:
96
+ print(f"{stmt.subject.text} ({stmt.subject.type})")
97
+ print(f" --[{stmt.predicate}]--> {stmt.object.text}")
98
+ print(f" Confidence: {stmt.confidence_score:.2f}") # NEW in v0.2.0
99
+ ```
100
+
101
+ ## Command Line Interface
102
+
103
+ The library includes a CLI for quick extraction from the terminal.
104
+
105
+ ### Install Globally (Recommended)
106
+
107
+ For best results, install globally first:
108
+
109
+ ```bash
110
+ # Using uv (recommended)
111
+ uv tool install "corp-extractor[embeddings]"
112
+
113
+ # Using pipx
114
+ pipx install "corp-extractor[embeddings]"
115
+
116
+ # Using pip
117
+ pip install "corp-extractor[embeddings]"
118
+
119
+ # Then use anywhere
120
+ corp-extractor "Your text here"
121
+ ```
122
+
123
+ ### Quick Run with uvx
124
+
125
+ Run directly without installing using [uv](https://docs.astral.sh/uv/):
126
+
127
+ ```bash
128
+ uvx corp-extractor "Apple announced a new iPhone."
129
+ ```
130
+
131
+ **Note**: First run downloads the model (~1.5GB) which may take a few minutes.
132
+
133
+ ### Usage Examples
134
+
135
+ ```bash
136
+ # Extract from text argument
137
+ corp-extractor "Apple Inc. announced the iPhone 15 at their September event."
138
+
139
+ # Extract from file
140
+ corp-extractor -f article.txt
141
+
142
+ # Pipe from stdin
143
+ cat article.txt | corp-extractor -
144
+
145
+ # Output as JSON
146
+ corp-extractor "Tim Cook is CEO of Apple." --json
147
+
148
+ # Output as XML
149
+ corp-extractor -f article.txt --xml
150
+
151
+ # Verbose output with confidence scores
152
+ corp-extractor -f article.txt --verbose
153
+
154
+ # Use more beams for better quality
155
+ corp-extractor -f article.txt --beams 8
156
+
157
+ # Use custom predicate taxonomy
158
+ corp-extractor -f article.txt --taxonomy predicates.txt
159
+
160
+ # Use GPU explicitly
161
+ corp-extractor -f article.txt --device cuda
162
+ ```
163
+
164
+ ### CLI Options
165
+
166
+ ```
167
+ Usage: corp-extractor [OPTIONS] [TEXT]
168
+
169
+ Options:
170
+ -f, --file PATH Read input from file
171
+ -o, --output [table|json|xml] Output format (default: table)
172
+ --json Output as JSON (shortcut)
173
+ --xml Output as XML (shortcut)
174
+ -b, --beams INTEGER Number of beams (default: 4)
175
+ --diversity FLOAT Diversity penalty (default: 1.0)
176
+ --max-tokens INTEGER Max tokens to generate (default: 2048)
177
+ --no-dedup Disable deduplication
178
+ --no-embeddings Disable embedding-based dedup (faster)
179
+ --no-merge Disable beam merging
180
+ --dedup-threshold FLOAT Deduplication threshold (default: 0.65)
181
+ --min-confidence FLOAT Min confidence filter (default: 0)
182
+ --taxonomy PATH Load predicate taxonomy from file
183
+ --taxonomy-threshold FLOAT Taxonomy matching threshold (default: 0.5)
184
+ --device [auto|cuda|cpu] Device to use (default: auto)
185
+ -v, --verbose Show confidence scores and metadata
186
+ -q, --quiet Suppress progress messages
187
+ --version Show version
188
+ --help Show this message
189
+ ```
190
+
191
+ ## New in v0.2.0: Quality Scoring & Beam Merging
192
+
193
+ By default, the library now:
194
+ - **Scores each triple** for groundedness based on whether entities appear in source text
195
+ - **Merges top beams** instead of selecting one, improving coverage
196
+ - **Uses embeddings** to detect semantically similar predicates ("bought" ≈ "acquired")
197
+
198
+ ```python
199
+ from statement_extractor import ExtractionOptions, ScoringConfig
200
+
201
+ # Precision mode - filter low-confidence triples
202
+ scoring = ScoringConfig(min_confidence=0.7)
203
+ options = ExtractionOptions(scoring_config=scoring)
204
+ result = extract_statements(text, options)
205
+
206
+ # Access confidence scores
207
+ for stmt in result:
208
+ print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
209
+ ```
210
+
211
+ ## New in v0.2.0: Predicate Taxonomies
212
+
213
+ Normalize predicates to canonical forms using embedding similarity:
214
+
215
+ ```python
216
+ from statement_extractor import PredicateTaxonomy, ExtractionOptions
217
+
218
+ taxonomy = PredicateTaxonomy(predicates=[
219
+ "acquired", "founded", "works_for", "announced",
220
+ "invested_in", "partnered_with"
221
+ ])
222
+
223
+ options = ExtractionOptions(predicate_taxonomy=taxonomy)
224
+ result = extract_statements(text, options)
225
+
226
+ # "bought" -> "acquired" via embedding similarity
227
+ for stmt in result:
228
+ if stmt.canonical_predicate:
229
+ print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
230
+ ```
231
+
232
+ ## New in v0.2.2: Contextualized Matching
233
+
234
+ Predicate canonicalization and deduplication now use **contextualized matching**:
235
+ - Compares full "Subject Predicate Object" strings against source text
236
+ - Better accuracy because predicates are evaluated in context
237
+ - When duplicates are found, keeps the statement with the best match to source text
238
+
239
+ This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
240
+
241
+ ## New in v0.2.3: Entity Type Merging & Reversal Detection
242
+
243
+ ### Entity Type Merging
244
+
245
+ When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
246
+
247
+ ```python
248
+ # Before deduplication:
249
+ # Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
250
+ # Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
251
+
252
+ # After deduplication:
253
+ # Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
254
+ ```
255
+
256
+ ### Subject-Object Reversal Detection
257
+
258
+ The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
259
+
260
+ ```python
261
+ from statement_extractor import PredicateComparer
262
+
263
+ comparer = PredicateComparer()
264
+
265
+ # Automatically detect and fix reversals
266
+ fixed_statements = comparer.detect_and_fix_reversals(statements)
267
+
268
+ for stmt in fixed_statements:
269
+ if stmt.was_reversed:
270
+ print(f"Fixed reversal: {stmt}")
271
+ ```
272
+
273
+ **How it works:**
274
+ 1. For each statement with source text, compares:
275
+ - "Subject Predicate Object" embedding vs source text
276
+ - "Object Predicate Subject" embedding vs source text
277
+ 2. If the reversed form has higher similarity, swaps subject and object
278
+ 3. Sets `was_reversed=True` to indicate the correction
279
+
280
+ During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
281
+
282
+ ## Disable Embeddings (Faster, No Extra Dependencies)
283
+
284
+ ```python
285
+ options = ExtractionOptions(
286
+ embedding_dedup=False, # Use exact text matching
287
+ merge_beams=False, # Select single best beam
288
+ )
289
+ result = extract_statements(text, options)
290
+ ```
291
+
292
+ ## Output Formats
293
+
294
+ ```python
295
+ from statement_extractor import (
296
+ extract_statements,
297
+ extract_statements_as_json,
298
+ extract_statements_as_xml,
299
+ extract_statements_as_dict,
300
+ )
301
+
302
+ # Pydantic models (default)
303
+ result = extract_statements(text)
304
+
305
+ # JSON string
306
+ json_output = extract_statements_as_json(text)
307
+
308
+ # Raw XML (model's native format)
309
+ xml_output = extract_statements_as_xml(text)
310
+
311
+ # Python dictionary
312
+ dict_output = extract_statements_as_dict(text)
313
+ ```
314
+
315
+ ## Batch Processing
316
+
317
+ ```python
318
+ from statement_extractor import StatementExtractor
319
+
320
+ extractor = StatementExtractor(device="cuda") # or "cpu"
321
+
322
+ texts = ["Text 1...", "Text 2...", "Text 3..."]
323
+ for text in texts:
324
+ result = extractor.extract(text)
325
+ print(f"Found {len(result)} statements")
326
+ ```
327
+
328
+ ## Entity Types
329
+
330
+ | Type | Description | Example |
331
+ |------|-------------|---------|
332
+ | `ORG` | Organizations | Apple Inc., United Nations |
333
+ | `PERSON` | People | Tim Cook, Elon Musk |
334
+ | `GPE` | Geopolitical entities | USA, California, Paris |
335
+ | `LOC` | Non-GPE locations | Mount Everest, Pacific Ocean |
336
+ | `PRODUCT` | Products | iPhone, Model S |
337
+ | `EVENT` | Events | World Cup, CES 2024 |
338
+ | `WORK_OF_ART` | Creative works | Mona Lisa, Game of Thrones |
339
+ | `LAW` | Legal documents | GDPR, Clean Air Act |
340
+ | `DATE` | Dates | 2024, January 15 |
341
+ | `MONEY` | Monetary values | $50 million, €100 |
342
+ | `PERCENT` | Percentages | 25%, 0.5% |
343
+ | `QUANTITY` | Quantities | 500 employees, 1.5 tons |
344
+ | `UNKNOWN` | Unrecognized | (fallback) |
345
+
346
+ ## How It Works
347
+
348
+ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam Search** ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424)):
349
+
350
+ 1. **Diverse Beam Search**: Generates 4+ candidate outputs using beam groups with diversity penalty
351
+ 2. **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness in source text
352
+ 3. **Beam Merging** *(v0.2.0)*: Top beams combined for better coverage
353
+ 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
354
+ 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
355
+ 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
356
+ 7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
357
+ 8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
358
+
359
+ ## Requirements
360
+
361
+ - Python 3.10+
362
+ - PyTorch 2.0+
363
+ - Transformers 4.35+
364
+ - Pydantic 2.0+
365
+ - sentence-transformers 2.2+ *(optional, for embedding features)*
366
+ - ~2GB VRAM (GPU) or ~4GB RAM (CPU)
367
+
368
+ ## Links
369
+
370
+ - [Model on HuggingFace](https://huggingface.co/Corp-o-Rate-Community/statement-extractor)
371
+ - [Web Demo](https://statement-extractor.corp-o-rate.com)
372
+ - [Diverse Beam Search Paper](https://arxiv.org/abs/1610.02424)
373
+ - [Corp-o-Rate](https://corp-o-rate.com)
374
+
375
+ ## License
376
+
377
+ MIT License - see LICENSE file for details.
@@ -0,0 +1,11 @@
1
+ statement_extractor/__init__.py,sha256=MIZgn-lD9-XGJapzdyYxMhEJFRrTzftbRklrhwA4e8w,2967
2
+ statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
3
+ statement_extractor/cli.py,sha256=kJnZm_mbq4np1vTxSjczMZM5zGuDlC8Z5xLJd8O3xZ4,7605
4
+ statement_extractor/extractor.py,sha256=PX0SiJnYUnh06seyH5W77FcPpcvLXwEM8IGsuVuRh0Q,22158
5
+ statement_extractor/models.py,sha256=xDF3pDPhIiqiMwFMPV94aBEgZGbSe-x2TkshahOiCog,10739
6
+ statement_extractor/predicate_comparer.py,sha256=iwBfNJFNOFv8ODKN9F9EtmknpCeSThOpnu6P_PJSmgE,24898
7
+ statement_extractor/scoring.py,sha256=Wa1BW6jXtHD7dZkUXwdwE39hwFo2ko6BuIogBc4E2Lk,14493
8
+ corp_extractor-0.2.7.dist-info/METADATA,sha256=ldfaiJqQEkCF8vlVq2H3t5MJOH9_C1QSAz80s33nSVU,13591
9
+ corp_extractor-0.2.7.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
10
+ corp_extractor-0.2.7.dist-info/entry_points.txt,sha256=i0iKFqPIusvb-QTQ1zNnFgAqatgVah-jIhahbs5TToQ,115
11
+ corp_extractor-0.2.7.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.28.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ corp-extractor = statement_extractor.cli:main
3
+ statement-extractor = statement_extractor.cli:main
@@ -0,0 +1,110 @@
1
+ """
2
+ Statement Extractor - Extract structured statements from text using T5-Gemma 2.
3
+
4
+ A Python library for extracting subject-predicate-object triples from unstructured text.
5
+ Uses Diverse Beam Search (Vijayakumar et al., 2016) for high-quality extraction.
6
+
7
+ Paper: https://arxiv.org/abs/1610.02424
8
+
9
+ Features:
10
+ - Quality-based beam scoring and merging
11
+ - Embedding-based predicate comparison for smart deduplication
12
+ - Configurable precision/recall tradeoff
13
+ - Support for predicate taxonomies
14
+
15
+ Example:
16
+ >>> from statement_extractor import extract_statements
17
+ >>> result = extract_statements("Apple Inc. announced a new iPhone today.")
18
+ >>> for stmt in result:
19
+ ... print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
20
+ Apple Inc. -> announced -> a new iPhone
21
+
22
+ >>> # Access confidence scores
23
+ >>> for stmt in result:
24
+ ... print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
25
+
26
+ >>> # Get as different formats
27
+ >>> xml = extract_statements_as_xml("Some text...")
28
+ >>> json_str = extract_statements_as_json("Some text...")
29
+ >>> data = extract_statements_as_dict("Some text...")
30
+ """
31
+
32
+ __version__ = "0.2.5"
33
+
34
+ # Core models
35
+ from .models import (
36
+ Entity,
37
+ EntityType,
38
+ ExtractionOptions,
39
+ ExtractionResult,
40
+ Statement,
41
+ # New in 0.2.0
42
+ PredicateMatch,
43
+ PredicateTaxonomy,
44
+ PredicateComparisonConfig,
45
+ ScoringConfig,
46
+ )
47
+
48
+ # Main extractor
49
+ from .extractor import (
50
+ StatementExtractor,
51
+ extract_statements,
52
+ extract_statements_as_dict,
53
+ extract_statements_as_json,
54
+ extract_statements_as_xml,
55
+ )
56
+
57
+ # Canonicalization utilities
58
+ from .canonicalization import (
59
+ Canonicalizer,
60
+ default_entity_canonicalizer,
61
+ deduplicate_statements_exact,
62
+ )
63
+
64
+ # Scoring utilities
65
+ from .scoring import (
66
+ BeamScorer,
67
+ TripleScorer,
68
+ )
69
+
70
+ __all__ = [
71
+ # Version
72
+ "__version__",
73
+ # Core models
74
+ "Entity",
75
+ "EntityType",
76
+ "ExtractionOptions",
77
+ "ExtractionResult",
78
+ "Statement",
79
+ # Configuration models (new in 0.2.0)
80
+ "PredicateMatch",
81
+ "PredicateTaxonomy",
82
+ "PredicateComparisonConfig",
83
+ "ScoringConfig",
84
+ # Extractor class
85
+ "StatementExtractor",
86
+ # Convenience functions
87
+ "extract_statements",
88
+ "extract_statements_as_dict",
89
+ "extract_statements_as_json",
90
+ "extract_statements_as_xml",
91
+ # Canonicalization
92
+ "Canonicalizer",
93
+ "default_entity_canonicalizer",
94
+ "deduplicate_statements_exact",
95
+ # Scoring
96
+ "BeamScorer",
97
+ "TripleScorer",
98
+ ]
99
+
100
+
101
+ # Lazy imports for optional dependencies
102
+ def __getattr__(name: str):
103
+ """Lazy import for optional modules."""
104
+ if name == "PredicateComparer":
105
+ from .predicate_comparer import PredicateComparer
106
+ return PredicateComparer
107
+ if name == "EmbeddingDependencyError":
108
+ from .predicate_comparer import EmbeddingDependencyError
109
+ return EmbeddingDependencyError
110
+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")