PyPI - corp-extractor - Versions diffs - 0.2.2__tar.gz → 0.2.3__tar.gz - Mend

corp-extractor 0.2.2tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: corp-extractor
-Version: 0.2.2
+Version: 0.2.3
 Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
 Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
 Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -55,6 +55,8 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
 - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
+- **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
+- **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
@@ -139,6 +141,47 @@ Predicate canonicalization and deduplication now use **contextualized matching**
 This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
+## New in v0.2.3: Entity Type Merging & Reversal Detection
+### Entity Type Merging
+When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
+```python
+# Before deduplication:
+# Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
+# Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+# After deduplication:
+# Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+```
+### Subject-Object Reversal Detection
+The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
+```python
+from statement_extractor import PredicateComparer
+comparer = PredicateComparer()
+# Automatically detect and fix reversals
+fixed_statements = comparer.detect_and_fix_reversals(statements)
+for stmt in fixed_statements:
+    if stmt.was_reversed:
+        print(f"Fixed reversal: {stmt}")
+```
+**How it works:**
+1. For each statement with source text, compares:
+   - "Subject Predicate Object" embedding vs source text
+   - "Object Predicate Subject" embedding vs source text
+2. If the reversed form has higher similarity, swaps subject and object
+3. Sets `was_reversed=True` to indicate the correction
+During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
 ## Disable Embeddings (Faster, No Extra Dependencies)
 ```python
@@ -213,6 +256,8 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
+7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
+8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
 ## Requirements

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/README.md RENAMED Viewed

@@ -15,6 +15,8 @@ Extract structured subject-predicate-object statements from unstructured text us
 - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
 - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
 - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
+- **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
+- **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
 - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
 ## Installation
@@ -99,6 +101,47 @@ Predicate canonicalization and deduplication now use **contextualized matching**
 This means "Apple bought Beats" vs "Apple acquired Beats" are compared holistically, not just "bought" vs "acquired".
+## New in v0.2.3: Entity Type Merging & Reversal Detection
+### Entity Type Merging
+When deduplicating statements, entity types are now automatically merged. If one statement has `UNKNOWN` type and a duplicate has a specific type (like `ORG` or `PERSON`), the specific type is preserved:
+```python
+# Before deduplication:
+# Statement 1: AtlasBio Labs (UNKNOWN) --sued by--> CuraPharm (ORG)
+# Statement 2: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+# After deduplication:
+# Single statement: AtlasBio Labs (ORG) --sued by--> CuraPharm (ORG)
+```
+### Subject-Object Reversal Detection
+The library now detects when subject and object may have been extracted in the wrong order by comparing embeddings against source text:
+```python
+from statement_extractor import PredicateComparer
+comparer = PredicateComparer()
+# Automatically detect and fix reversals
+fixed_statements = comparer.detect_and_fix_reversals(statements)
+for stmt in fixed_statements:
+    if stmt.was_reversed:
+        print(f"Fixed reversal: {stmt}")
+```
+**How it works:**
+1. For each statement with source text, compares:
+   - "Subject Predicate Object" embedding vs source text
+   - "Object Predicate Subject" embedding vs source text
+2. If the reversed form has higher similarity, swaps subject and object
+3. Sets `was_reversed=True` to indicate the correction
+During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
 ## Disable Embeddings (Faster, No Extra Dependencies)
 ```python
@@ -173,6 +216,8 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
+7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
+8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
 ## Requirements

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "corp-extractor"
-version = "0.2.2"
+version = "0.2.3"
 description = "Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search"
 readme = "README.md"
 requires-python = ">=3.10"

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/canonicalization.py RENAMED Viewed

@@ -139,32 +139,58 @@ class Canonicalizer:
 def deduplicate_statements_exact(
     statements: list[Statement],
-    entity_canonicalizer: Optional[Callable[[str], str]] = None
+    entity_canonicalizer: Optional[Callable[[str], str]] = None,
+    detect_reversals: bool = True,
 ) -> list[Statement]:
     """
     Deduplicate statements using exact text matching.
     Use this when embedding-based deduplication is disabled.
+    When duplicates are found, entity types are merged - specific types
+    (ORG, PERSON, etc.) take precedence over UNKNOWN.
+    When detect_reversals=True, also detects reversed duplicates where
+    subject and object are swapped. The first occurrence determines the
+    canonical orientation.
     Args:
         statements: List of statements to deduplicate
         entity_canonicalizer: Optional custom canonicalization function
+        detect_reversals: Whether to detect reversed duplicates (default True)
     Returns:
-        Deduplicated list (keeps first occurrence)
+        Deduplicated list with merged entity types
     """
     if len(statements) <= 1:
         return statements
     canonicalizer = Canonicalizer(entity_fn=entity_canonicalizer)
-    seen: set[tuple[str, str, str]] = set()
+    # Map from dedup key to index in unique list
+    seen: dict[tuple[str, str, str], int] = {}
     unique: list[Statement] = []
     for stmt in statements:
         key = canonicalizer.create_dedup_key(stmt)
-        if key not in seen:
-            seen.add(key)
+        # Also compute reversed key (object, predicate, subject)
+        reversed_key = (key[2], key[1], key[0])
+        if key in seen:
+            # Direct duplicate found - merge entity types
+            existing_idx = seen[key]
+            existing_stmt = unique[existing_idx]
+            merged_stmt = existing_stmt.merge_entity_types_from(stmt)
+            unique[existing_idx] = merged_stmt
+        elif detect_reversals and reversed_key in seen:
+            # Reversed duplicate found - merge entity types (accounting for reversal)
+            existing_idx = seen[reversed_key]
+            existing_stmt = unique[existing_idx]
+            # Merge types from the reversed statement
+            merged_stmt = existing_stmt.merge_entity_types_from(stmt.reversed())
+            unique[existing_idx] = merged_stmt
+        else:
+            # New unique statement
+            seen[key] = len(unique)
             unique.append(stmt)
     return unique

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/extractor.py RENAMED Viewed

@@ -143,7 +143,7 @@ class StatementExtractor:
         taxonomy = options.predicate_taxonomy or self._predicate_taxonomy
         config = options.predicate_config or self._predicate_config or PredicateComparisonConfig()
-        return PredicateComparer(taxonomy=taxonomy, config=config)
+        return PredicateComparer(taxonomy=taxonomy, config=config, device=self.device)
     @property
     def model(self) -> AutoModelForSeq2SeqLM:
@@ -362,18 +362,25 @@ class StatementExtractor:
         end_tag = "</statements>"
         candidates: list[str] = []
-        for output in outputs:
+        for i, output in enumerate(outputs):
             decoded = self.tokenizer.decode(output, skip_special_tokens=True)
+            output_len = len(output)
             # Truncate at </statements>
             if end_tag in decoded:
                 end_pos = decoded.find(end_tag) + len(end_tag)
                 decoded = decoded[:end_pos]
                 candidates.append(decoded)
+                logger.debug(f"Beam {i}: {output_len} tokens, found end tag, {len(decoded)} chars")
+            else:
+                # Log the issue - likely truncated
+                logger.warning(f"Beam {i}: {output_len} tokens, NO end tag found (truncated?)")
+                logger.warning(f"Beam {i} full output ({len(decoded)} chars):\n{decoded}")
         # Include fallback if no valid candidates
         if not candidates and len(outputs) > 0:
             fallback = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
+            logger.warning(f"Using fallback beam (no valid candidates found), {len(fallback)} chars")
             candidates.append(fallback)
         return candidates
@@ -467,8 +474,10 @@ class StatementExtractor:
         try:
             root = ET.fromstring(xml_output)
-        except ET.ParseError:
-            logger.warning("Failed to parse XML output")
+        except ET.ParseError as e:
+            # Log full output for debugging
+            logger.warning(f"Failed to parse XML output: {e}")
+            logger.warning(f"Full XML output ({len(xml_output)} chars):\n{xml_output}")
             return statements
         if root.tag != 'statements':

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/models.py RENAMED Viewed

@@ -32,6 +32,18 @@ class Entity(BaseModel):
     def __str__(self) -> str:
         return f"{self.text} ({self.type.value})"
+    def merge_type_from(self, other: "Entity") -> "Entity":
+        """
+        Return a new Entity with the more specific type.
+        If this entity has UNKNOWN type and other has a specific type,
+        returns a new entity with this text but other's type.
+        Otherwise returns self unchanged.
+        """
+        if self.type == EntityType.UNKNOWN and other.type != EntityType.UNKNOWN:
+            return Entity(text=self.text, type=other.type)
+        return self
 class Statement(BaseModel):
     """A single extracted statement (subject-predicate-object triple)."""
@@ -55,6 +67,10 @@ class Statement(BaseModel):
         None,
         description="Canonical form of the predicate if taxonomy matching was used"
     )
+    was_reversed: bool = Field(
+        default=False,
+        description="True if subject/object were swapped during reversal detection"
+    )
     def __str__(self) -> str:
         return f"{self.subject.text} -- {self.predicate} --> {self.object.text}"
@@ -63,6 +79,49 @@ class Statement(BaseModel):
         """Return as a simple (subject, predicate, object) tuple."""
         return (self.subject.text, self.predicate, self.object.text)
+    def merge_entity_types_from(self, other: "Statement") -> "Statement":
+        """
+        Return a new Statement with more specific entity types merged from other.
+        If this statement has UNKNOWN entity types and other has specific types,
+        the returned statement will use the specific types from other.
+        All other fields come from self.
+        """
+        merged_subject = self.subject.merge_type_from(other.subject)
+        merged_object = self.object.merge_type_from(other.object)
+        # Only create new statement if something changed
+        if merged_subject is self.subject and merged_object is self.object:
+            return self
+        return Statement(
+            subject=merged_subject,
+            object=merged_object,
+            predicate=self.predicate,
+            source_text=self.source_text,
+            confidence_score=self.confidence_score,
+            evidence_span=self.evidence_span,
+            canonical_predicate=self.canonical_predicate,
+            was_reversed=self.was_reversed,
+        )
+    def reversed(self) -> "Statement":
+        """
+        Return a new Statement with subject and object swapped.
+        Sets was_reversed=True to indicate the swap occurred.
+        """
+        return Statement(
+            subject=self.object,
+            object=self.subject,
+            predicate=self.predicate,
+            source_text=self.source_text,
+            confidence_score=self.confidence_score,
+            evidence_span=self.evidence_span,
+            canonical_predicate=self.canonical_predicate,
+            was_reversed=True,
+        )
 class ExtractionResult(BaseModel):
     """The result of statement extraction from text."""

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/predicate_comparer.py RENAMED Viewed

@@ -62,6 +62,7 @@ class PredicateComparer:
         self,
         taxonomy: Optional[PredicateTaxonomy] = None,
         config: Optional[PredicateComparisonConfig] = None,
+        device: Optional[str] = None,
     ):
         """
         Initialize the predicate comparer.
@@ -69,6 +70,7 @@ class PredicateComparer:
         Args:
             taxonomy: Optional canonical predicate taxonomy for normalization
             config: Comparison configuration (uses defaults if not provided)
+            device: Device to use ('cuda', 'cpu', or None for auto-detect)
         Raises:
             EmbeddingDependencyError: If sentence-transformers is not installed
@@ -78,6 +80,13 @@ class PredicateComparer:
         self.taxonomy = taxonomy
         self.config = config or PredicateComparisonConfig()
+        # Auto-detect device
+        if device is None:
+            import torch
+            self.device = "cuda" if torch.cuda.is_available() else "cpu"
+        else:
+            self.device = device
         # Lazy-loaded resources
         self._model = None
         self._taxonomy_embeddings: Optional[np.ndarray] = None
@@ -89,9 +98,9 @@ class PredicateComparer:
         from sentence_transformers import SentenceTransformer
-        logger.info(f"Loading embedding model: {self.config.embedding_model}")
-        self._model = SentenceTransformer(self.config.embedding_model)
-        logger.info("Embedding model loaded")
+        logger.info(f"Loading embedding model: {self.config.embedding_model} on {self.device}")
+        self._model = SentenceTransformer(self.config.embedding_model, device=self.device)
+        logger.info(f"Embedding model loaded on {self.device}")
     def _normalize_text(self, text: str) -> str:
         """Normalize text before embedding."""
@@ -256,21 +265,26 @@ class PredicateComparer:
         self,
         statements: list[Statement],
         entity_canonicalizer: Optional[callable] = None,
+        detect_reversals: bool = True,
     ) -> list[Statement]:
         """
         Remove duplicate statements using embedding-based predicate comparison.
         Two statements are considered duplicates if:
-        - Canonicalized subjects match
+        - Canonicalized subjects match AND canonicalized objects match, OR
+        - Canonicalized subjects match objects (reversed) when detect_reversals=True
         - Predicates are similar (embedding-based)
-        - Canonicalized objects match
         When duplicates are found, keeps the statement with better contextualized
         match (comparing "Subject Predicate Object" against source text).
+        For reversed duplicates, the correct orientation is determined by comparing
+        both "S P O" and "O P S" against source text.
         Args:
             statements: List of Statement objects
             entity_canonicalizer: Optional function to canonicalize entity text
+            detect_reversals: Whether to detect reversed duplicates (default True)
         Returns:
             Deduplicated list of statements (keeps best contextualized match)
@@ -293,6 +307,12 @@ class PredicateComparer:
         ]
         contextualized_embeddings = self._compute_embeddings(contextualized_texts)
+        # Compute reversed contextualized embeddings: "Object Predicate Subject"
+        reversed_texts = [
+            f"{s.object.text} {s.predicate} {s.subject.text}" for s in statements
+        ]
+        reversed_embeddings = self._compute_embeddings(reversed_texts)
         # Compute source text embeddings for scoring which duplicate to keep
         source_embeddings = []
         for stmt in statements:
@@ -302,6 +322,7 @@ class PredicateComparer:
         unique_statements: list[Statement] = []
         unique_pred_embeddings: list[np.ndarray] = []
         unique_context_embeddings: list[np.ndarray] = []
+        unique_reversed_embeddings: list[np.ndarray] = []
         unique_source_embeddings: list[np.ndarray] = []
         unique_indices: list[int] = []
@@ -310,13 +331,23 @@ class PredicateComparer:
             obj_canon = canonicalize(stmt.object.text)
             duplicate_idx = None
+            is_reversed_match = False
             for j, unique_stmt in enumerate(unique_statements):
                 unique_subj = canonicalize(unique_stmt.subject.text)
                 unique_obj = canonicalize(unique_stmt.object.text)
-                # Check subject and object match
-                if subj_canon != unique_subj or obj_canon != unique_obj:
+                # Check direct match: subject->subject, object->object
+                direct_match = (subj_canon == unique_subj and obj_canon == unique_obj)
+                # Check reversed match: subject->object, object->subject
+                reversed_match = (
+                    detect_reversals and
+                    subj_canon == unique_obj and
+                    obj_canon == unique_subj
+                )
+                if not direct_match and not reversed_match:
                     continue
                 # Check predicate similarity
@@ -326,6 +357,7 @@ class PredicateComparer:
                 )
                 if similarity >= self.config.dedup_threshold:
                     duplicate_idx = j
+                    is_reversed_match = reversed_match and not direct_match
                     break
             if duplicate_idx is None:
@@ -333,27 +365,91 @@ class PredicateComparer:
                 unique_statements.append(stmt)
                 unique_pred_embeddings.append(pred_embeddings[i])
                 unique_context_embeddings.append(contextualized_embeddings[i])
+                unique_reversed_embeddings.append(reversed_embeddings[i])
                 unique_source_embeddings.append(source_embeddings[i])
                 unique_indices.append(i)
             else:
-                # Duplicate found - keep the one with better contextualized match
-                # Compare "Subject Predicate Object" against source text
-                current_score = self._cosine_similarity(
-                    contextualized_embeddings[i],
-                    source_embeddings[i]
-                )
-                existing_score = self._cosine_similarity(
-                    unique_context_embeddings[duplicate_idx],
-                    unique_source_embeddings[duplicate_idx]
-                )
-                if current_score > existing_score:
-                    # Current statement is a better match - replace
-                    unique_statements[duplicate_idx] = stmt
-                    unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
-                    unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
-                    unique_source_embeddings[duplicate_idx] = source_embeddings[i]
-                    unique_indices[duplicate_idx] = i
+                existing_stmt = unique_statements[duplicate_idx]
+                if is_reversed_match:
+                    # Reversed duplicate - determine correct orientation using source text
+                    # Compare current's normal vs reversed against its source
+                    current_normal_score = self._cosine_similarity(
+                        contextualized_embeddings[i], source_embeddings[i]
+                    )
+                    current_reversed_score = self._cosine_similarity(
+                        reversed_embeddings[i], source_embeddings[i]
+                    )
+                    # Compare existing's normal vs reversed against its source
+                    existing_normal_score = self._cosine_similarity(
+                        unique_context_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    existing_reversed_score = self._cosine_similarity(
+                        unique_reversed_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    # Determine best orientation for current
+                    current_best = max(current_normal_score, current_reversed_score)
+                    current_should_reverse = current_reversed_score > current_normal_score
+                    # Determine best orientation for existing
+                    existing_best = max(existing_normal_score, existing_reversed_score)
+                    existing_should_reverse = existing_reversed_score > existing_normal_score
+                    if current_best > existing_best:
+                        # Current is better - use it (possibly reversed)
+                        if current_should_reverse:
+                            best_stmt = stmt.reversed()
+                        else:
+                            best_stmt = stmt
+                        # Merge entity types from existing (accounting for reversal)
+                        if existing_should_reverse:
+                            best_stmt = best_stmt.merge_entity_types_from(existing_stmt.reversed())
+                        else:
+                            best_stmt = best_stmt.merge_entity_types_from(existing_stmt)
+                        unique_statements[duplicate_idx] = best_stmt
+                        unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
+                        unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
+                        unique_reversed_embeddings[duplicate_idx] = reversed_embeddings[i]
+                        unique_source_embeddings[duplicate_idx] = source_embeddings[i]
+                        unique_indices[duplicate_idx] = i
+                    else:
+                        # Existing is better - possibly fix its orientation
+                        if existing_should_reverse and not existing_stmt.was_reversed:
+                            best_stmt = existing_stmt.reversed()
+                        else:
+                            best_stmt = existing_stmt
+                        # Merge entity types from current (accounting for reversal)
+                        if current_should_reverse:
+                            best_stmt = best_stmt.merge_entity_types_from(stmt.reversed())
+                        else:
+                            best_stmt = best_stmt.merge_entity_types_from(stmt)
+                        unique_statements[duplicate_idx] = best_stmt
+                else:
+                    # Direct duplicate - keep the one with better contextualized match
+                    current_score = self._cosine_similarity(
+                        contextualized_embeddings[i], source_embeddings[i]
+                    )
+                    existing_score = self._cosine_similarity(
+                        unique_context_embeddings[duplicate_idx],
+                        unique_source_embeddings[duplicate_idx]
+                    )
+                    if current_score > existing_score:
+                        # Current statement is a better match - replace
+                        merged_stmt = stmt.merge_entity_types_from(existing_stmt)
+                        unique_statements[duplicate_idx] = merged_stmt
+                        unique_pred_embeddings[duplicate_idx] = pred_embeddings[i]
+                        unique_context_embeddings[duplicate_idx] = contextualized_embeddings[i]
+                        unique_reversed_embeddings[duplicate_idx] = reversed_embeddings[i]
+                        unique_source_embeddings[duplicate_idx] = source_embeddings[i]
+                        unique_indices[duplicate_idx] = i
+                    else:
+                        # Existing statement is better - merge entity types from current
+                        merged_stmt = existing_stmt.merge_entity_types_from(stmt)
+                        unique_statements[duplicate_idx] = merged_stmt
         return unique_statements
@@ -437,3 +533,79 @@ class PredicateComparer:
                 similarity=best_score,
                 matched=False,
             )
+    def detect_and_fix_reversals(
+        self,
+        statements: list[Statement],
+        threshold: float = 0.05,
+    ) -> list[Statement]:
+        """
+        Detect and fix subject-object reversals using embedding comparison.
+        For each statement, compares:
+        - "Subject Predicate Object" embedding against source_text
+        - "Object Predicate Subject" embedding against source_text
+        If the reversed version has significantly higher similarity to the source,
+        the subject and object are swapped and was_reversed is set to True.
+        Args:
+            statements: List of Statement objects
+            threshold: Minimum similarity difference to trigger reversal (default 0.05)
+        Returns:
+            List of statements with reversals corrected
+        """
+        if not statements:
+            return statements
+        result = []
+        for stmt in statements:
+            # Skip if no source_text to compare against
+            if not stmt.source_text:
+                result.append(stmt)
+                continue
+            # Build normal and reversed triple strings
+            normal_text = f"{stmt.subject.text} {stmt.predicate} {stmt.object.text}"
+            reversed_text = f"{stmt.object.text} {stmt.predicate} {stmt.subject.text}"
+            # Compute embeddings for normal, reversed, and source
+            embeddings = self._compute_embeddings([normal_text, reversed_text, stmt.source_text])
+            normal_emb, reversed_emb, source_emb = embeddings[0], embeddings[1], embeddings[2]
+            # Compute similarities to source
+            normal_sim = self._cosine_similarity(normal_emb, source_emb)
+            reversed_sim = self._cosine_similarity(reversed_emb, source_emb)
+            # If reversed is significantly better, swap subject and object
+            if reversed_sim > normal_sim + threshold:
+                result.append(stmt.reversed())
+            else:
+                result.append(stmt)
+        return result
+    def check_reversal(self, statement: Statement) -> tuple[bool, float, float]:
+        """
+        Check if a single statement should be reversed.
+        Args:
+            statement: Statement to check
+        Returns:
+            Tuple of (should_reverse, normal_similarity, reversed_similarity)
+        """
+        if not statement.source_text:
+            return (False, 0.0, 0.0)
+        normal_text = f"{statement.subject.text} {statement.predicate} {statement.object.text}"
+        reversed_text = f"{statement.object.text} {statement.predicate} {statement.subject.text}"
+        embeddings = self._compute_embeddings([normal_text, reversed_text, statement.source_text])
+        normal_emb, reversed_emb, source_emb = embeddings[0], embeddings[1], embeddings[2]
+        normal_sim = self._cosine_similarity(normal_emb, source_emb)
+        reversed_sim = self._cosine_similarity(reversed_emb, source_emb)
+        return (reversed_sim > normal_sim, normal_sim, reversed_sim)

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/.gitignore RENAMED Viewed

File without changes

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/__init__.py RENAMED Viewed

File without changes

{corp_extractor-0.2.2 → corp_extractor-0.2.3}/src/statement_extractor/scoring.py RENAMED Viewed

File without changes

corp-extractor 0.2.2__tar.gz → 0.2.3__tar.gz

corp-extractor 0.2.2tar.gz → 0.2.3tar.gz