corp-extractor 0.2.3__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,280 @@
1
+ Metadata-Version: 2.4
2
+ Name: corp-extractor
3
+ Version: 0.2.3
4
+ Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
5
+ Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
6
+ Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
7
+ Project-URL: Repository, https://github.com/corp-o-rate/statement-extractor
8
+ Project-URL: Issues, https://github.com/corp-o-rate/statement-extractor/issues
9
+ Author-email: Corp-o-Rate <neil@corp-o-rate.com>
10
+ Maintainer-email: Corp-o-Rate <neil@corp-o-rate.com>
11
+ License: MIT
12
+ Keywords: diverse-beam-search,embeddings,gemma,information-extraction,knowledge-graph,nlp,statement-extraction,subject-predicate-object,t5,transformers,triples
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3.10
20
+ Classifier: Programming Language :: Python :: 3.11
21
+ Classifier: Programming Language :: Python :: 3.12
22
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
23
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
24
+ Classifier: Topic :: Text Processing :: Linguistic
25
+ Requires-Python: >=3.10
26
+ Requires-Dist: numpy>=1.24.0
27
+ Requires-Dist: pydantic>=2.0.0
28
+ Requires-Dist: torch>=2.0.0
29
+ Requires-Dist: transformers>=4.35.0
30
+ Provides-Extra: all
31
+ Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
32
+ Provides-Extra: dev
33
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
34
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
35
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
36
+ Requires-Dist: ruff>=0.1.0; extra == 'dev'
37
+ Provides-Extra: embeddings
38
+ Requires-Dist: sentence-transformers>=2.2.0; extra == 'embeddings'
39
+ Description-Content-Type: text/markdown
40
+
41
+ # Corp Extractor
42
+
43
+ Extract structured subject-predicate-object statements from unstructured text using the T5-Gemma 2 model.
44
+
45
+ [![PyPI version](https://img.shields.io/pypi/v/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
46
+ [![Python 3.10+](https://img.shields.io/pypi/pyversions/corp-extractor.svg)](https://pypi.org/project/corp-extractor/)
47
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
48
+
49
+ ## Features
50
+
51
+ - **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
52
+ - **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
53
+ - **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness (0-1) based on source text
54
+ - **Beam Merging** *(v0.2.0)*: Combines top beams for better coverage instead of picking one
55
+ - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
56
+ - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
57
+ - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
58
+ - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
59
+ - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
60
+ - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
61
+
62
+ ## Installation
63
+
64
+ ```bash
65
+ # Recommended: include embedding support for smart deduplication
66
+ pip install corp-extractor[embeddings]
67
+
68
+ # Minimal installation (no embedding features)
69
+ pip install corp-extractor
70
+ ```
71
+
72
+ **Note**: For GPU support, install PyTorch with CUDA first:
73
+ ```bash
74
+ pip install torch --index-url https://download.pytorch.org/whl/cu121
75
+ pip install corp-extractor[embeddings]
76
+ ```
77
+
78
+ ## Quick Start
79
+
80
+ ```python
81
+ from statement_extractor import extract_statements
82
+
83
+ result = extract_statements("""
84
+ Apple Inc. announced the iPhone 15 at their September event.
85
+ Tim Cook presented the new features to customers worldwide.
86
+ """)
87
+
88
+ for stmt in result:
89
+ print(f"{stmt.subject.text} ({stmt.subject.type})")
90
+ print(f" --[{stmt.predicate}]--> {stmt.object.text}")
91
+ print(f" Confidence: {stmt.confidence_score:.2f}") # NEW in v0.2.0
92
+ ```
93
+
94
+ ## New in v0.2.0: Quality Scoring & Beam Merging
95
+
96
+ By default, the library now:
97
+ - **Scores each triple** for groundedness based on whether entities appear in source text
98
+ - **Merges top beams** instead of selecting one, improving coverage
99
+ - **Uses embeddings** to detect semantically similar predicates ("bought" ≈ "acquired")
100
+
101
+ ```python
102
+ from statement_extractor import ExtractionOptions, ScoringConfig
103
+
104
+ # Precision mode - filter low-confidence triples
105
+ scoring = ScoringConfig(min_confidence=0.7)
106
+ options = ExtractionOptions(scoring_config=scoring)
107
+ result = extract_statements(text, options)
108
+
109
+ # Access confidence scores
110
+ for stmt in result:
111
+ print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
112
+ ```
113
+
114
+ ## New in v0.2.0: Predicate Taxonomies
115
+
116
+ Normalize predicates to canonical forms using embedding similarity:
117
+
118
+ ```python
119
+ from statement_extractor import PredicateTaxonomy, ExtractionOptions
120
+
121
+ taxonomy = PredicateTaxonomy(predicates=[
122
+ "acquired", "founded", "works_for", "announced",
123
+ "invested_in", "partnered_with"
124
+ ])
125
+
126
+ options = ExtractionOptions(predicate_taxonomy=taxonomy)
127
+ result = extract_statements(text, options)
128
+
129
+ # "bought" -> "acquired" via embedding similarity
130
+ for stmt in result:
131
+ if stmt.canonical_predicate:
132
+ print(f"{stmt.predicate} -> {stmt.canonical_predicate}")
133
+ ```
134
+
135
+ ## New in v0.2.2: Contextualized Matching
136
+
137
+ Predicate canonicalization and deduplication now use **contextualized matching**:
138
+ - Compares full "Subject Predicate Object" strings against source text
139
+ - Better accuracy because predicates are evaluated in context
140
+ - When duplicates are found, keeps the statement with the best match to source text
141
+
142
+ This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
143
+
144
+ ## New in v0.2.3: Entity Type Merging & Reversal Detection
145
+
146
+ ### Entity Type Merging
147
+
148
+ When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
149
+
150
+ ```python
151
+ # Before deduplication:
152
+ # Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
153
+ # Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
154
+
155
+ # After deduplication:
156
+ # Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
157
+ ```
158
+
159
+ ### Subject-Object Reversal Detection
160
+
161
+ The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
162
+
163
+ ```python
164
+ from statement_extractor import PredicateComparer
165
+
166
+ comparer = PredicateComparer()
167
+
168
+ # Automatically detect and fix reversals
169
+ fixed_statements = comparer.detect_and_fix_reversals(statements)
170
+
171
+ for stmt in fixed_statements:
172
+ if stmt.was_reversed:
173
+ print(f"Fixed reversal: {stmt}")
174
+ ```
175
+
176
+ **How it works:**
177
+ 1. For each statement with source text, compares:
178
+ - "Subject Predicate Object" embedding vs source text
179
+ - "Object Predicate Subject" embedding vs source text
180
+ 2. If the reversed form has higher similarity, swaps subject and object
181
+ 3. Sets `was_reversed=True` to indicate the correction
182
+
183
+ During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
184
+
185
+ ## Disable Embeddings (Faster, No Extra Dependencies)
186
+
187
+ ```python
188
+ options = ExtractionOptions(
189
+ embedding_dedup=False, # Use exact text matching
190
+ merge_beams=False, # Select single best beam
191
+ )
192
+ result = extract_statements(text, options)
193
+ ```
194
+
195
+ ## Output Formats
196
+
197
+ ```python
198
+ from statement_extractor import (
199
+ extract_statements,
200
+ extract_statements_as_json,
201
+ extract_statements_as_xml,
202
+ extract_statements_as_dict,
203
+ )
204
+
205
+ # Pydantic models (default)
206
+ result = extract_statements(text)
207
+
208
+ # JSON string
209
+ json_output = extract_statements_as_json(text)
210
+
211
+ # Raw XML (model's native format)
212
+ xml_output = extract_statements_as_xml(text)
213
+
214
+ # Python dictionary
215
+ dict_output = extract_statements_as_dict(text)
216
+ ```
217
+
218
+ ## Batch Processing
219
+
220
+ ```python
221
+ from statement_extractor import StatementExtractor
222
+
223
+ extractor = StatementExtractor(device="cuda") # or "cpu"
224
+
225
+ texts = ["Text 1...", "Text 2...", "Text 3..."]
226
+ for text in texts:
227
+ result = extractor.extract(text)
228
+ print(f"Found {len(result)} statements")
229
+ ```
230
+
231
+ ## Entity Types
232
+
233
+ | Type | Description | Example |
234
+ |------|-------------|---------|
235
+ | `ORG` | Organizations | Apple Inc., United Nations |
236
+ | `PERSON` | People | Tim Cook, Elon Musk |
237
+ | `GPE` | Geopolitical entities | USA, California, Paris |
238
+ | `LOC` | Non-GPE locations | Mount Everest, Pacific Ocean |
239
+ | `PRODUCT` | Products | iPhone, Model S |
240
+ | `EVENT` | Events | World Cup, CES 2024 |
241
+ | `WORK_OF_ART` | Creative works | Mona Lisa, Game of Thrones |
242
+ | `LAW` | Legal documents | GDPR, Clean Air Act |
243
+ | `DATE` | Dates | 2024, January 15 |
244
+ | `MONEY` | Monetary values | $50 million, €100 |
245
+ | `PERCENT` | Percentages | 25%, 0.5% |
246
+ | `QUANTITY` | Quantities | 500 employees, 1.5 tons |
247
+ | `UNKNOWN` | Unrecognized | (fallback) |
248
+
249
+ ## How It Works
250
+
251
+ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam Search** ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424)):
252
+
253
+ 1. **Diverse Beam Search**: Generates 4+ candidate outputs using beam groups with diversity penalty
254
+ 2. **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness in source text
255
+ 3. **Beam Merging** *(v0.2.0)*: Top beams combined for better coverage
256
+ 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
257
+ 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
258
+ 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
259
+ 7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
260
+ 8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
261
+
262
+ ## Requirements
263
+
264
+ - Python 3.10+
265
+ - PyTorch 2.0+
266
+ - Transformers 4.35+
267
+ - Pydantic 2.0+
268
+ - sentence-transformers 2.2+ *(optional, for embedding features)*
269
+ - ~2GB VRAM (GPU) or ~4GB RAM (CPU)
270
+
271
+ ## Links
272
+
273
+ - [Model on HuggingFace](https://huggingface.co/Corp-o-Rate-Community/statement-extractor)
274
+ - [Web Demo](https://statement-extractor.corp-o-rate.com)
275
+ - [Diverse Beam Search Paper](https://arxiv.org/abs/1610.02424)
276
+ - [Corp-o-Rate](https://corp-o-rate.com)
277
+
278
+ ## License
279
+
280
+ MIT License - see LICENSE file for details.
@@ -0,0 +1,9 @@
1
+ statement_extractor/__init__.py,sha256=4Ht8GJdgik_iti7zpG71Oi5EEAnck6AYDvy7soRqIOg,2967
2
+ statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
3
+ statement_extractor/extractor.py,sha256=PX0SiJnYUnh06seyH5W77FcPpcvLXwEM8IGsuVuRh0Q,22158
4
+ statement_extractor/models.py,sha256=xDF3pDPhIiqiMwFMPV94aBEgZGbSe-x2TkshahOiCog,10739
5
+ statement_extractor/predicate_comparer.py,sha256=iwBfNJFNOFv8ODKN9F9EtmknpCeSThOpnu6P_PJSmgE,24898
6
+ statement_extractor/scoring.py,sha256=Wa1BW6jXtHD7dZkUXwdwE39hwFo2ko6BuIogBc4E2Lk,14493
7
+ corp_extractor-0.2.3.dist-info/METADATA,sha256=dCJbLWIj7hgzpkC4zYvNmnEAhNnizUEq_caea6AamIU,10724
8
+ corp_extractor-0.2.3.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
9
+ corp_extractor-0.2.3.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.28.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,110 @@
1
+ """
2
+ Statement Extractor - Extract structured statements from text using T5-Gemma 2.
3
+
4
+ A Python library for extracting subject-predicate-object triples from unstructured text.
5
+ Uses Diverse Beam Search (Vijayakumar et al., 2016) for high-quality extraction.
6
+
7
+ Paper: https://arxiv.org/abs/1610.02424
8
+
9
+ Features:
10
+ - Quality-based beam scoring and merging
11
+ - Embedding-based predicate comparison for smart deduplication
12
+ - Configurable precision/recall tradeoff
13
+ - Support for predicate taxonomies
14
+
15
+ Example:
16
+ >>> from statement_extractor import extract_statements
17
+ >>> result = extract_statements("Apple Inc. announced a new iPhone today.")
18
+ >>> for stmt in result:
19
+ ... print(f"{stmt.subject.text} -> {stmt.predicate} -> {stmt.object.text}")
20
+ Apple Inc. -> announced -> a new iPhone
21
+
22
+ >>> # Access confidence scores
23
+ >>> for stmt in result:
24
+ ... print(f"{stmt} (confidence: {stmt.confidence_score:.2f})")
25
+
26
+ >>> # Get as different formats
27
+ >>> xml = extract_statements_as_xml("Some text...")
28
+ >>> json_str = extract_statements_as_json("Some text...")
29
+ >>> data = extract_statements_as_dict("Some text...")
30
+ """
31
+
32
+ __version__ = "0.2.2"
33
+
34
+ # Core models
35
+ from .models import (
36
+ Entity,
37
+ EntityType,
38
+ ExtractionOptions,
39
+ ExtractionResult,
40
+ Statement,
41
+ # New in 0.2.0
42
+ PredicateMatch,
43
+ PredicateTaxonomy,
44
+ PredicateComparisonConfig,
45
+ ScoringConfig,
46
+ )
47
+
48
+ # Main extractor
49
+ from .extractor import (
50
+ StatementExtractor,
51
+ extract_statements,
52
+ extract_statements_as_dict,
53
+ extract_statements_as_json,
54
+ extract_statements_as_xml,
55
+ )
56
+
57
+ # Canonicalization utilities
58
+ from .canonicalization import (
59
+ Canonicalizer,
60
+ default_entity_canonicalizer,
61
+ deduplicate_statements_exact,
62
+ )
63
+
64
+ # Scoring utilities
65
+ from .scoring import (
66
+ BeamScorer,
67
+ TripleScorer,
68
+ )
69
+
70
+ __all__ = [
71
+ # Version
72
+ "__version__",
73
+ # Core models
74
+ "Entity",
75
+ "EntityType",
76
+ "ExtractionOptions",
77
+ "ExtractionResult",
78
+ "Statement",
79
+ # Configuration models (new in 0.2.0)
80
+ "PredicateMatch",
81
+ "PredicateTaxonomy",
82
+ "PredicateComparisonConfig",
83
+ "ScoringConfig",
84
+ # Extractor class
85
+ "StatementExtractor",
86
+ # Convenience functions
87
+ "extract_statements",
88
+ "extract_statements_as_dict",
89
+ "extract_statements_as_json",
90
+ "extract_statements_as_xml",
91
+ # Canonicalization
92
+ "Canonicalizer",
93
+ "default_entity_canonicalizer",
94
+ "deduplicate_statements_exact",
95
+ # Scoring
96
+ "BeamScorer",
97
+ "TripleScorer",
98
+ ]
99
+
100
+
101
+ # Lazy imports for optional dependencies
102
+ def __getattr__(name: str):
103
+ """Lazy import for optional modules."""
104
+ if name == "PredicateComparer":
105
+ from .predicate_comparer import PredicateComparer
106
+ return PredicateComparer
107
+ if name == "EmbeddingDependencyError":
108
+ from .predicate_comparer import EmbeddingDependencyError
109
+ return EmbeddingDependencyError
110
+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
@@ -0,0 +1,196 @@
1
+ """
2
+ Entity canonicalization for statement deduplication.
3
+
4
+ Provides default canonicalization functions and a Canonicalizer class
5
+ for normalizing entity text before comparison.
6
+ """
7
+
8
+ import re
9
+ from typing import Callable, Optional
10
+
11
+ from .models import Statement
12
+
13
+
14
+ # Common determiners to remove from the start of entity text
15
+ DETERMINERS = frozenset(["the", "a", "an", "this", "that", "these", "those"])
16
+
17
+
18
+ def default_entity_canonicalizer(text: str) -> str:
19
+ """
20
+ Default entity canonicalization function.
21
+
22
+ Transformations:
23
+ - Trim leading/trailing whitespace
24
+ - Convert to lowercase
25
+ - Remove leading determiners (the, a, an, etc.)
26
+ - Normalize internal whitespace (multiple spaces -> single)
27
+
28
+ Args:
29
+ text: The entity text to canonicalize
30
+
31
+ Returns:
32
+ Canonicalized text
33
+
34
+ Example:
35
+ >>> default_entity_canonicalizer(" The Apple Inc. ")
36
+ 'apple inc.'
37
+ >>> default_entity_canonicalizer("A new product")
38
+ 'new product'
39
+ """
40
+ # Trim and lowercase
41
+ result = text.strip().lower()
42
+
43
+ # Normalize internal whitespace
44
+ result = re.sub(r'\s+', ' ', result)
45
+
46
+ # Remove leading determiners
47
+ words = result.split()
48
+ if words and words[0] in DETERMINERS:
49
+ result = ' '.join(words[1:])
50
+
51
+ return result.strip()
52
+
53
+
54
+ class Canonicalizer:
55
+ """
56
+ Canonicalize entities for deduplication.
57
+
58
+ Supports custom canonicalization functions for entities.
59
+ Predicate comparison uses embeddings (see PredicateComparer).
60
+
61
+ Example:
62
+ >>> canon = Canonicalizer()
63
+ >>> canon.canonicalize_entity("The Apple Inc.")
64
+ 'apple inc.'
65
+
66
+ >>> # With custom function
67
+ >>> canon = Canonicalizer(entity_fn=lambda x: x.upper())
68
+ >>> canon.canonicalize_entity("Apple Inc.")
69
+ 'APPLE INC.'
70
+ """
71
+
72
+ def __init__(
73
+ self,
74
+ entity_fn: Optional[Callable[[str], str]] = None,
75
+ ):
76
+ """
77
+ Initialize the canonicalizer.
78
+
79
+ Args:
80
+ entity_fn: Custom function to canonicalize entity text.
81
+ If None, uses default_entity_canonicalizer.
82
+ """
83
+ self.entity_fn = entity_fn or default_entity_canonicalizer
84
+
85
+ def canonicalize_entity(self, text: str) -> str:
86
+ """
87
+ Canonicalize an entity string.
88
+
89
+ Args:
90
+ text: Entity text to canonicalize
91
+
92
+ Returns:
93
+ Canonicalized text
94
+ """
95
+ return self.entity_fn(text)
96
+
97
+ def canonicalize_statement_entities(
98
+ self,
99
+ statement: Statement
100
+ ) -> tuple[str, str]:
101
+ """
102
+ Return canonicalized (subject, object) tuple.
103
+
104
+ Note: Predicate comparison uses embeddings, not text canonicalization.
105
+
106
+ Args:
107
+ statement: Statement to canonicalize
108
+
109
+ Returns:
110
+ Tuple of (canonicalized_subject, canonicalized_object)
111
+ """
112
+ return (
113
+ self.canonicalize_entity(statement.subject.text),
114
+ self.canonicalize_entity(statement.object.text),
115
+ )
116
+
117
+ def create_dedup_key(
118
+ self,
119
+ statement: Statement,
120
+ predicate_canonical: Optional[str] = None
121
+ ) -> tuple[str, str, str]:
122
+ """
123
+ Create a deduplication key for a statement.
124
+
125
+ For exact-match deduplication (when not using embedding-based comparison).
126
+
127
+ Args:
128
+ statement: Statement to create key for
129
+ predicate_canonical: Optional canonical predicate (if taxonomy was used)
130
+
131
+ Returns:
132
+ Tuple of (subject, predicate, object) for deduplication
133
+ """
134
+ subj = self.canonicalize_entity(statement.subject.text)
135
+ obj = self.canonicalize_entity(statement.object.text)
136
+ pred = predicate_canonical or statement.predicate.lower().strip()
137
+ return (subj, pred, obj)
138
+
139
+
140
+ def deduplicate_statements_exact(
141
+ statements: list[Statement],
142
+ entity_canonicalizer: Optional[Callable[[str], str]] = None,
143
+ detect_reversals: bool = True,
144
+ ) -> list[Statement]:
145
+ """
146
+ Deduplicate statements using exact text matching.
147
+
148
+ Use this when embedding-based deduplication is disabled.
149
+ When duplicates are found, entity types are merged - specific types
150
+ (ORG, PERSON, etc.) take precedence over UNKNOWN.
151
+
152
+ When detect_reversals=True, also detects reversed duplicates where
153
+ subject and object are swapped. The first occurrence determines the
154
+ canonical orientation.
155
+
156
+ Args:
157
+ statements: List of statements to deduplicate
158
+ entity_canonicalizer: Optional custom canonicalization function
159
+ detect_reversals: Whether to detect reversed duplicates (default True)
160
+
161
+ Returns:
162
+ Deduplicated list with merged entity types
163
+ """
164
+ if len(statements) <= 1:
165
+ return statements
166
+
167
+ canonicalizer = Canonicalizer(entity_fn=entity_canonicalizer)
168
+
169
+ # Map from dedup key to index in unique list
170
+ seen: dict[tuple[str, str, str], int] = {}
171
+ unique: list[Statement] = []
172
+
173
+ for stmt in statements:
174
+ key = canonicalizer.create_dedup_key(stmt)
175
+ # Also compute reversed key (object, predicate, subject)
176
+ reversed_key = (key[2], key[1], key[0])
177
+
178
+ if key in seen:
179
+ # Direct duplicate found - merge entity types
180
+ existing_idx = seen[key]
181
+ existing_stmt = unique[existing_idx]
182
+ merged_stmt = existing_stmt.merge_entity_types_from(stmt)
183
+ unique[existing_idx] = merged_stmt
184
+ elif detect_reversals and reversed_key in seen:
185
+ # Reversed duplicate found - merge entity types (accounting for reversal)
186
+ existing_idx = seen[reversed_key]
187
+ existing_stmt = unique[existing_idx]
188
+ # Merge types from the reversed statement
189
+ merged_stmt = existing_stmt.merge_entity_types_from(stmt.reversed())
190
+ unique[existing_idx] = merged_stmt
191
+ else:
192
+ # New unique statement
193
+ seen[key] = len(unique)
194
+ unique.append(stmt)
195
+
196
+ return unique