corp-extractor 0.3.0__py3-none-any.whl → 0.4.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: corp-extractor
3
- Version: 0.3.0
3
+ Version: 0.4.0
4
4
  Summary: Extract structured statements from text using T5-Gemma 2 and Diverse Beam Search
5
5
  Project-URL: Homepage, https://github.com/corp-o-rate/statement-extractor
6
6
  Project-URL: Documentation, https://github.com/corp-o-rate/statement-extractor#readme
@@ -24,10 +24,10 @@ Classifier: Topic :: Scientific/Engineering :: Information Analysis
24
24
  Classifier: Topic :: Text Processing :: Linguistic
25
25
  Requires-Python: >=3.10
26
26
  Requires-Dist: click>=8.0.0
27
+ Requires-Dist: gliner2
27
28
  Requires-Dist: numpy>=1.24.0
28
29
  Requires-Dist: pydantic>=2.0.0
29
30
  Requires-Dist: sentence-transformers>=2.2.0
30
- Requires-Dist: spacy>=3.5.0
31
31
  Requires-Dist: torch>=2.0.0
32
32
  Requires-Dist: transformers>=5.0.0rc3
33
33
  Provides-Extra: dev
@@ -49,18 +49,19 @@ Extract structured subject-predicate-object statements from unstructured text us
49
49
 
50
50
  - **Structured Extraction**: Converts unstructured text into subject-predicate-object triples
51
51
  - **Entity Type Recognition**: Identifies 12 entity types (ORG, PERSON, GPE, LOC, PRODUCT, EVENT, etc.)
52
- - **Combined Quality Scoring** *(v0.3.0)*: Confidence combines semantic similarity (50%) + subject/object noun scores (25% each)
53
- - **spaCy-First Predicates** *(v0.3.0)*: Always uses spaCy for predicate extraction (model predicates are unreliable)
54
- - **Multi-Candidate Extraction** *(v0.3.0)*: Generates 3 candidates per statement (hybrid, spaCy-only, predicate-split)
55
- - **Best Triple Selection** *(v0.3.0)*: Keeps only highest-scoring triple per source (use `--all-triples` to keep all)
56
- - **Extraction Method Tracking** *(v0.3.0)*: Each statement includes `extraction_method` field (hybrid, spacy, split, model)
57
- - **Beam Merging** *(v0.2.0)*: Combines top beams for better coverage instead of picking one
58
- - **Embedding-based Dedup** *(v0.2.0)*: Uses semantic similarity to detect near-duplicate predicates
59
- - **Predicate Taxonomies** *(v0.2.0)*: Normalize predicates to canonical forms via embeddings
60
- - **Contextualized Matching** *(v0.2.2)*: Compares full "Subject Predicate Object" against source text for better accuracy
61
- - **Entity Type Merging** *(v0.2.3)*: Automatically merges UNKNOWN entity types with specific types during deduplication
62
- - **Reversal Detection** *(v0.2.3)*: Detects and corrects subject-object reversals using embedding comparison
63
- - **Command Line Interface** *(v0.2.4)*: Full-featured CLI for terminal usage
52
+ - **GLiNER2 Integration** *(v0.4.0)*: Uses GLiNER2 (205M params) for entity recognition and relation extraction
53
+ - **Predefined Predicates** *(v0.4.0)*: Optional `--predicates` list for GLiNER2 relation extraction mode
54
+ - **Entity-based Scoring** *(v0.4.0)*: Confidence combines semantic similarity (50%) + entity recognition scores (25% each)
55
+ - **Multi-Candidate Extraction**: Generates 3 candidates per statement (hybrid, GLiNER2-only, predicate-split)
56
+ - **Best Triple Selection**: Keeps only highest-scoring triple per source (use `--all-triples` to keep all)
57
+ - **Extraction Method Tracking**: Each statement includes `extraction_method` field (hybrid, gliner, split, model)
58
+ - **Beam Merging**: Combines top beams for better coverage instead of picking one
59
+ - **Embedding-based Dedup**: Uses semantic similarity to detect near-duplicate predicates
60
+ - **Predicate Taxonomies**: Normalize predicates to canonical forms via embeddings
61
+ - **Contextualized Matching**: Compares full "Subject Predicate Object" against source text for better accuracy
62
+ - **Entity Type Merging**: Automatically merges UNKNOWN entity types with specific types during deduplication
63
+ - **Reversal Detection**: Detects and corrects subject-object reversals using embedding comparison
64
+ - **Command Line Interface**: Full-featured CLI for terminal usage
64
65
  - **Multiple Output Formats**: Get results as Pydantic models, JSON, XML, or dictionaries
65
66
 
66
67
  ## Installation
@@ -69,7 +70,7 @@ Extract structured subject-predicate-object statements from unstructured text us
69
70
  pip install corp-extractor
70
71
  ```
71
72
 
72
- The spaCy model for predicate inference is downloaded automatically on first use.
73
+ The GLiNER2 model (205M params) is downloaded automatically on first use.
73
74
 
74
75
  **Note**: This package requires `transformers>=5.0.0` for T5-Gemma2 model support.
75
76
 
@@ -179,7 +180,8 @@ Options:
179
180
  --no-dedup Disable deduplication
180
181
  --no-embeddings Disable embedding-based dedup (faster)
181
182
  --no-merge Disable beam merging
182
- --no-spacy Disable spaCy extraction (use raw model output)
183
+ --no-gliner Disable GLiNER2 extraction (use raw model output)
184
+ --predicates TEXT Comma-separated predicate types for GLiNER2 relation extraction
183
185
  --all-triples Keep all candidate triples (default: best per source)
184
186
  --dedup-threshold FLOAT Deduplication threshold (default: 0.65)
185
187
  --min-confidence FLOAT Min confidence filter (default: 0)
@@ -283,20 +285,42 @@ for stmt in fixed_statements:
283
285
 
284
286
  During deduplication, reversed duplicates (e.g., "A -> P -> B" and "B -> P -> A") are now detected and merged, with the correct orientation determined by source text similarity.
285
287
 
286
- ## New in v0.3.0: spaCy-First Extraction & Semantic Scoring
288
+ ## New in v0.4.0: GLiNER2 Integration
287
289
 
288
- v0.3.0 introduces significant improvements to extraction quality:
290
+ v0.4.0 replaces spaCy with **GLiNER2** (205M params) for entity recognition and relation extraction. GLiNER2 is a unified model that handles NER, text classification, structured data extraction, and relation extraction with CPU-optimized inference.
289
291
 
290
- ### spaCy-First Predicate Extraction
292
+ ### Why GLiNER2?
291
293
 
292
- The T5-Gemma model is excellent at:
294
+ The T5-Gemma model excels at:
293
295
  - **Triple isolation** - identifying that a relationship exists
294
296
  - **Coreference resolution** - resolving pronouns to named entities
295
297
 
296
- But unreliable at:
297
- - **Predicate extraction** - often returns empty or wrong predicates
298
+ GLiNER2 now handles:
299
+ - **Entity recognition** - refining subject/object boundaries
300
+ - **Relation extraction** - when predefined predicates are provided
301
+ - **Entity scoring** - scoring how "entity-like" subjects/objects are
298
302
 
299
- **Solution:** v0.3.0 always uses spaCy for predicate extraction. The model provides subject, object, entity types, and source text; spaCy provides the predicate.
303
+ ### Two Extraction Modes
304
+
305
+ **Mode 1: With Predicate List** (GLiNER2 relation extraction)
306
+ ```python
307
+ from statement_extractor import extract_statements, ExtractionOptions
308
+
309
+ options = ExtractionOptions(predicates=["works_for", "founded", "acquired", "headquartered_in"])
310
+ result = extract_statements("John works for Apple Inc. in Cupertino.", options)
311
+ ```
312
+
313
+ Or via CLI:
314
+ ```bash
315
+ corp-extractor "John works for Apple Inc." --predicates "works_for,founded,acquired"
316
+ ```
317
+
318
+ **Mode 2: Without Predicate List** (entity-refined extraction)
319
+ ```python
320
+ result = extract_statements("Apple announced a new iPhone.")
321
+ # Uses GLiNER2 for entity extraction to refine boundaries
322
+ # Extracts predicate from source text using T5-Gemma's hint
323
+ ```
300
324
 
301
325
  ### Three Candidate Extraction Methods
302
326
 
@@ -304,48 +328,46 @@ For each statement, three candidates are generated and the best is selected:
304
328
 
305
329
  | Method | Description |
306
330
  |--------|-------------|
307
- | `hybrid` | Model subject/object + spaCy predicate |
308
- | `spacy` | All components from spaCy dependency parsing |
331
+ | `hybrid` | Model subject/object + GLiNER2/extracted predicate |
332
+ | `gliner` | All components refined by GLiNER2 |
309
333
  | `split` | Source text split around the predicate |
310
334
 
311
335
  ```python
312
336
  for stmt in result:
313
337
  print(f"{stmt.subject.text} --[{stmt.predicate}]--> {stmt.object.text}")
314
- print(f" Method: {stmt.extraction_method}") # hybrid, spacy, split, or model
338
+ print(f" Method: {stmt.extraction_method}") # hybrid, gliner, split, or model
315
339
  print(f" Confidence: {stmt.confidence_score:.2f}")
316
340
  ```
317
341
 
318
342
  ### Combined Quality Scoring
319
343
 
320
- Confidence scores combine **semantic similarity** and **grammatical accuracy**:
344
+ Confidence scores combine **semantic similarity** and **entity recognition**:
321
345
 
322
346
  | Component | Weight | Description |
323
347
  |-----------|--------|-------------|
324
348
  | Semantic similarity | 50% | Cosine similarity between source text and reassembled triple |
325
- | Subject noun score | 25% | How noun-like the subject is |
326
- | Object noun score | 25% | How noun-like the object is |
327
-
328
- **Noun scoring:**
329
- - Proper noun(s) only: 1.0
330
- - Common noun(s) only: 0.8
331
- - Contains noun + other words: 0.4-0.8 (based on ratio)
332
- - No nouns: 0.2
349
+ | Subject entity score | 25% | How entity-like the subject is (via GLiNER2) |
350
+ | Object entity score | 25% | How entity-like the object is (via GLiNER2) |
333
351
 
334
- This ensures extracted subjects and objects are grammatically valid entities, not fragments or verb phrases.
352
+ **Entity scoring (via GLiNER2):**
353
+ - Recognized entity with high confidence: 1.0
354
+ - Recognized entity with moderate confidence: 0.8
355
+ - Partially recognized: 0.6
356
+ - Not recognized: 0.2
335
357
 
336
358
  ### Extraction Method Tracking
337
359
 
338
- Each statement now includes an `extraction_method` field:
339
- - `hybrid` - Model subject/object + spaCy predicate
340
- - `spacy` - All components from spaCy dependency parsing
360
+ Each statement includes an `extraction_method` field:
361
+ - `hybrid` - Model subject/object + GLiNER2 predicate
362
+ - `gliner` - All components refined by GLiNER2
341
363
  - `split` - Subject/object from splitting source text around predicate
342
- - `model` - All components from T5-Gemma model (only when `--no-spacy`)
364
+ - `model` - All components from T5-Gemma model (only when `--no-gliner`)
343
365
 
344
366
  ### Best Triple Selection
345
367
 
346
- By default, only the **highest-scoring triple** is kept for each source sentence. This ensures clean output without redundant candidates.
368
+ By default, only the **highest-scoring triple** is kept for each source sentence.
347
369
 
348
- To keep all candidate triples (for debugging or analysis):
370
+ To keep all candidate triples:
349
371
  ```python
350
372
  options = ExtractionOptions(all_triples=True)
351
373
  result = extract_statements(text, options)
@@ -356,15 +378,15 @@ Or via CLI:
356
378
  corp-extractor "Your text" --all-triples --verbose
357
379
  ```
358
380
 
359
- **Disable spaCy extraction** to use only model output:
381
+ **Disable GLiNER2 extraction** to use only model output:
360
382
  ```python
361
- options = ExtractionOptions(use_spacy_extraction=False)
383
+ options = ExtractionOptions(use_gliner_extraction=False)
362
384
  result = extract_statements(text, options)
363
385
  ```
364
386
 
365
387
  Or via CLI:
366
388
  ```bash
367
- corp-extractor "Your text" --no-spacy
389
+ corp-extractor "Your text" --no-gliner
368
390
  ```
369
391
 
370
392
  ## Disable Embeddings
@@ -436,14 +458,14 @@ for text in texts:
436
458
  This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam Search** ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424)):
437
459
 
438
460
  1. **Diverse Beam Search**: Generates 4+ candidate outputs using beam groups with diversity penalty
439
- 2. **Quality Scoring** *(v0.2.0)*: Each triple scored for groundedness in source text
440
- 3. **Beam Merging** *(v0.2.0)*: Top beams combined for better coverage
441
- 4. **Embedding Dedup** *(v0.2.0)*: Semantic similarity removes near-duplicate predicates
442
- 5. **Predicate Normalization** *(v0.2.0)*: Optional taxonomy matching via embeddings
443
- 6. **Contextualized Matching** *(v0.2.2)*: Full statement context used for canonicalization and dedup
444
- 7. **Entity Type Merging** *(v0.2.3)*: UNKNOWN types merged with specific types during dedup
445
- 8. **Reversal Detection** *(v0.2.3)*: Subject-object reversals detected and corrected via embedding comparison
446
- 9. **Hybrid spaCy** *(v0.2.12)*: spaCy candidates added to pool alongside model output for better coverage
461
+ 2. **Quality Scoring**: Each triple scored for groundedness in source text
462
+ 3. **Beam Merging**: Top beams combined for better coverage
463
+ 4. **Embedding Dedup**: Semantic similarity removes near-duplicate predicates
464
+ 5. **Predicate Normalization**: Optional taxonomy matching via embeddings
465
+ 6. **Contextualized Matching**: Full statement context used for canonicalization and dedup
466
+ 7. **Entity Type Merging**: UNKNOWN types merged with specific types during dedup
467
+ 8. **Reversal Detection**: Subject-object reversals detected and corrected via embedding comparison
468
+ 9. **GLiNER2 Extraction** *(v0.4.0)*: Entity recognition and relation extraction for improved accuracy
447
469
 
448
470
  ## Requirements
449
471
 
@@ -452,7 +474,7 @@ This library uses the T5-Gemma 2 statement extraction model with **Diverse Beam
452
474
  - Transformers 5.0+
453
475
  - Pydantic 2.0+
454
476
  - sentence-transformers 2.2+
455
- - spaCy 3.5+ (model downloaded automatically on first use)
477
+ - GLiNER2 (model downloaded automatically on first use)
456
478
  - ~2GB VRAM (GPU) or ~4GB RAM (CPU)
457
479
 
458
480
  ## Links
@@ -0,0 +1,12 @@
1
+ statement_extractor/__init__.py,sha256=KwZfWnTB9oevTLw0TrNlYFu67qIYO-34JqDtcpjOhZI,3013
2
+ statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
3
+ statement_extractor/cli.py,sha256=FOkkihVfoROc-Biu8ICCzlLJeDScYYNLLJHnv0GCGGM,9507
4
+ statement_extractor/extractor.py,sha256=d0HnCeCPybw-4jDxH_ffZ4LY9Klvqnza_wa90Bd4Q18,40074
5
+ statement_extractor/gliner_extraction.py,sha256=KNs3n5-fnoUwY1wvbPwZL8j-3YVstmioJlcjp2k1FmY,10491
6
+ statement_extractor/models.py,sha256=cyCQc3vlYB3qlg6-uL5Vt4odIiulKtHzz1Cyrf0lEAU,12198
7
+ statement_extractor/predicate_comparer.py,sha256=jcuaBi5BYqD3TKoyj3pR9dxtX5ihfDJvjdhEd2LHCwc,26184
8
+ statement_extractor/scoring.py,sha256=s_8nhavBNzPPFmGf2FyBummH4tgP7YGpXoMhl2Jh3Xw,16650
9
+ corp_extractor-0.4.0.dist-info/METADATA,sha256=8f2CDtZG757kaB6XMfbBVdNSRMyS5-4Lflc_LoZCC_8,17725
10
+ corp_extractor-0.4.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
11
+ corp_extractor-0.4.0.dist-info/entry_points.txt,sha256=i0iKFqPIusvb-QTQ1zNnFgAqatgVah-jIhahbs5TToQ,115
12
+ corp_extractor-0.4.0.dist-info/RECORD,,
@@ -34,7 +34,7 @@ def _configure_logging(verbose: bool) -> None:
34
34
  "statement_extractor.scoring",
35
35
  "statement_extractor.predicate_comparer",
36
36
  "statement_extractor.canonicalization",
37
- "statement_extractor.spacy_extraction",
37
+ "statement_extractor.gliner_extraction",
38
38
  ]:
39
39
  logging.getLogger(logger_name).setLevel(level)
40
40
 
@@ -66,7 +66,8 @@ from .models import (
66
66
  @click.option("--no-dedup", is_flag=True, help="Disable deduplication")
67
67
  @click.option("--no-embeddings", is_flag=True, help="Disable embedding-based deduplication (faster)")
68
68
  @click.option("--no-merge", is_flag=True, help="Disable beam merging (select single best beam)")
69
- @click.option("--no-spacy", is_flag=True, help="Disable spaCy extraction (use raw model output)")
69
+ @click.option("--no-gliner", is_flag=True, help="Disable GLiNER2 extraction (use raw model output)")
70
+ @click.option("--predicates", type=str, help="Comma-separated list of predicate types for GLiNER2 relation extraction")
70
71
  @click.option("--all-triples", is_flag=True, help="Keep all candidate triples instead of selecting best per source")
71
72
  @click.option("--dedup-threshold", type=float, default=0.65, help="Similarity threshold for deduplication (default: 0.65)")
72
73
  # Quality options
@@ -92,7 +93,8 @@ def main(
92
93
  no_dedup: bool,
93
94
  no_embeddings: bool,
94
95
  no_merge: bool,
95
- no_spacy: bool,
96
+ no_gliner: bool,
97
+ predicates: Optional[str],
96
98
  all_triples: bool,
97
99
  dedup_threshold: float,
98
100
  min_confidence: float,
@@ -157,6 +159,13 @@ def main(
157
159
  # Configure scoring
158
160
  scoring_config = ScoringConfig(min_confidence=min_confidence)
159
161
 
162
+ # Parse predicates if provided
163
+ predicate_list = None
164
+ if predicates:
165
+ predicate_list = [p.strip() for p in predicates.split(",") if p.strip()]
166
+ if not quiet:
167
+ click.echo(f"Using predicate list: {predicate_list}", err=True)
168
+
160
169
  # Configure extraction options
161
170
  options = ExtractionOptions(
162
171
  num_beams=beams,
@@ -165,7 +174,8 @@ def main(
165
174
  deduplicate=not no_dedup,
166
175
  embedding_dedup=not no_embeddings,
167
176
  merge_beams=not no_merge,
168
- use_spacy_extraction=not no_spacy,
177
+ use_gliner_extraction=not no_gliner,
178
+ predicates=predicate_list,
169
179
  all_triples=all_triples,
170
180
  predicate_taxonomy=predicate_taxonomy,
171
181
  predicate_config=predicate_config,
@@ -721,16 +721,17 @@ class StatementExtractor:
721
721
  Parse XML output into Statement objects.
722
722
 
723
723
  Uses model for subject, object, entity types, and source_text.
724
- Always uses spaCy for predicate extraction (model predicates are unreliable).
724
+ Always uses GLiNER2 for predicate extraction (model predicates are unreliable).
725
725
 
726
726
  Produces two candidates for each statement:
727
- 1. Hybrid: model subject/object + spaCy predicate
728
- 2. spaCy-only: all components from spaCy
727
+ 1. Hybrid: model subject/object + GLiNER2 predicate
728
+ 2. GLiNER2-only: all components from GLiNER2
729
729
 
730
730
  Both go into the candidate pool; scoring/dedup picks the best.
731
731
  """
732
732
  statements: list[Statement] = []
733
- use_spacy_extraction = options.use_spacy_extraction if options else True
733
+ use_gliner_extraction = options.use_gliner_extraction if options else True
734
+ predicates = options.predicates if options else None
734
735
 
735
736
  try:
736
737
  root = ET.fromstring(xml_output)
@@ -780,57 +781,63 @@ class StatementExtractor:
780
781
  logger.debug(f"Skipping statement: missing subject or object from model")
781
782
  continue
782
783
 
783
- if use_spacy_extraction and source_text:
784
+ if use_gliner_extraction and source_text:
784
785
  try:
785
- from .spacy_extraction import extract_triple_from_text, extract_triple_by_predicate_split
786
- spacy_result = extract_triple_from_text(
786
+ from .gliner_extraction import extract_triple_from_text, extract_triple_by_predicate_split
787
+
788
+ # Get model predicate for fallback/refinement
789
+ predicate_elem = stmt_elem.find('predicate')
790
+ model_predicate = predicate_elem.text.strip() if predicate_elem is not None and predicate_elem.text else ""
791
+
792
+ gliner_result = extract_triple_from_text(
787
793
  source_text=source_text,
788
794
  model_subject=subject_text,
789
795
  model_object=object_text,
790
- model_predicate="", # Don't pass model predicate
796
+ model_predicate=model_predicate,
797
+ predicates=predicates,
791
798
  )
792
- if spacy_result:
793
- spacy_subj, spacy_pred, spacy_obj = spacy_result
799
+ if gliner_result:
800
+ gliner_subj, gliner_pred, gliner_obj = gliner_result
794
801
 
795
- if spacy_pred:
796
- # Candidate 1: Hybrid (model subject/object + spaCy predicate)
802
+ if gliner_pred:
803
+ # Candidate 1: Hybrid (model subject/object + GLiNER2 predicate)
797
804
  logger.debug(
798
- f"Adding hybrid candidate: '{subject_text}' --[{spacy_pred}]--> '{object_text}'"
805
+ f"Adding hybrid candidate: '{subject_text}' --[{gliner_pred}]--> '{object_text}'"
799
806
  )
800
807
  statements.append(Statement(
801
808
  subject=Entity(text=subject_text, type=subject_type),
802
- predicate=spacy_pred,
809
+ predicate=gliner_pred,
803
810
  object=Entity(text=object_text, type=object_type),
804
811
  source_text=source_text,
805
812
  extraction_method=ExtractionMethod.HYBRID,
806
813
  ))
807
814
 
808
- # Candidate 2: spaCy-only (if different from hybrid)
809
- if spacy_subj and spacy_obj:
810
- is_different = (spacy_subj != subject_text or spacy_obj != object_text)
815
+ # Candidate 2: GLiNER2-only (if different from hybrid)
816
+ if gliner_subj and gliner_obj:
817
+ is_different = (gliner_subj != subject_text or gliner_obj != object_text)
811
818
  if is_different:
812
819
  logger.debug(
813
- f"Adding spaCy-only candidate: '{spacy_subj}' --[{spacy_pred}]--> '{spacy_obj}'"
820
+ f"Adding GLiNER2-only candidate: '{gliner_subj}' --[{gliner_pred}]--> '{gliner_obj}'"
814
821
  )
815
822
  statements.append(Statement(
816
- subject=Entity(text=spacy_subj, type=subject_type),
817
- predicate=spacy_pred,
818
- object=Entity(text=spacy_obj, type=object_type),
823
+ subject=Entity(text=gliner_subj, type=subject_type),
824
+ predicate=gliner_pred,
825
+ object=Entity(text=gliner_obj, type=object_type),
819
826
  source_text=source_text,
820
- extraction_method=ExtractionMethod.SPACY,
827
+ extraction_method=ExtractionMethod.GLINER,
821
828
  ))
822
829
 
823
830
  # Candidate 3: Predicate-split (split source text around predicate)
824
831
  split_result = extract_triple_by_predicate_split(
825
832
  source_text=source_text,
826
- predicate=spacy_pred,
833
+ predicate=gliner_pred,
827
834
  )
828
835
  if split_result:
829
836
  split_subj, split_pred, split_obj = split_result
830
837
  # Only add if different from previous candidates
831
838
  is_different_from_hybrid = (split_subj != subject_text or split_obj != object_text)
832
- is_different_from_spacy = (split_subj != spacy_subj or split_obj != spacy_obj)
833
- if is_different_from_hybrid and is_different_from_spacy:
839
+ is_different_from_gliner = (split_subj != gliner_subj or split_obj != gliner_obj)
840
+ if is_different_from_hybrid and is_different_from_gliner:
834
841
  logger.debug(
835
842
  f"Adding predicate-split candidate: '{split_subj}' --[{split_pred}]--> '{split_obj}'"
836
843
  )
@@ -843,12 +850,12 @@ class StatementExtractor:
843
850
  ))
844
851
  else:
845
852
  logger.debug(
846
- f"spaCy found no predicate for: '{subject_text}' --> '{object_text}'"
853
+ f"GLiNER2 found no predicate for: '{subject_text}' --> '{object_text}'"
847
854
  )
848
855
  except Exception as e:
849
- logger.debug(f"spaCy extraction failed: {e}")
856
+ logger.debug(f"GLiNER2 extraction failed: {e}")
850
857
  else:
851
- # spaCy disabled - fall back to model predicate
858
+ # GLiNER2 disabled - fall back to model predicate
852
859
  predicate_elem = stmt_elem.find('predicate')
853
860
  model_predicate = predicate_elem.text.strip() if predicate_elem is not None and predicate_elem.text else ""
854
861
 
@@ -0,0 +1,288 @@
1
+ """
2
+ GLiNER2-based triple extraction.
3
+
4
+ Uses GLiNER2 for relation extraction and entity recognition to extract
5
+ subject, predicate, and object from source text. T5-Gemma model provides
6
+ triple structure and coreference resolution, while GLiNER2 handles
7
+ linguistic analysis.
8
+
9
+ The GLiNER2 model is loaded automatically on first use.
10
+ """
11
+
12
+ import logging
13
+ from typing import Optional
14
+
15
+ logger = logging.getLogger(__name__)
16
+
17
+ # Lazy-loaded GLiNER2 model
18
+ _model = None
19
+
20
+
21
+ def _get_model():
22
+ """
23
+ Lazy-load the GLiNER2 model.
24
+
25
+ Uses the base model (205M parameters) which is CPU-optimized.
26
+ """
27
+ global _model
28
+ if _model is None:
29
+ from gliner2 import GLiNER2
30
+
31
+ logger.info("Loading GLiNER2 model 'fastino/gliner2-base-v1'...")
32
+ _model = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
33
+ logger.debug("GLiNER2 model loaded")
34
+ return _model
35
+
36
+
37
+ def extract_triple_from_text(
38
+ source_text: str,
39
+ model_subject: str,
40
+ model_object: str,
41
+ model_predicate: str,
42
+ predicates: Optional[list[str]] = None,
43
+ ) -> tuple[str, str, str] | None:
44
+ """
45
+ Extract subject, predicate, object from source text using GLiNER2.
46
+
47
+ Returns a GLiNER2-based triple that can be added to the candidate pool
48
+ alongside the model's triple. The existing scoring/dedup logic will
49
+ pick the best one.
50
+
51
+ Args:
52
+ source_text: The source sentence to analyze
53
+ model_subject: Subject from T5-Gemma (used for matching and fallback)
54
+ model_object: Object from T5-Gemma (used for matching and fallback)
55
+ model_predicate: Predicate from T5-Gemma (used when no predicates provided)
56
+ predicates: Optional list of predefined relation types to extract
57
+
58
+ Returns:
59
+ Tuple of (subject, predicate, object) from GLiNER2, or None if extraction fails
60
+ """
61
+ if not source_text:
62
+ return None
63
+
64
+ try:
65
+ model = _get_model()
66
+
67
+ if predicates:
68
+ # Use relation extraction with predefined predicates
69
+ result = model.extract_relations(source_text, predicates)
70
+
71
+ # Find best matching relation
72
+ relation_data = result.get("relation_extraction", {})
73
+ best_match = None
74
+ best_confidence = 0.0
75
+
76
+ for rel_type, relations in relation_data.items():
77
+ for rel in relations:
78
+ # Handle both tuple format and dict format
79
+ if isinstance(rel, tuple):
80
+ head, tail = rel
81
+ confidence = 1.0
82
+ else:
83
+ head = rel.get("head", {}).get("text", "")
84
+ tail = rel.get("tail", {}).get("text", "")
85
+ confidence = min(
86
+ rel.get("head", {}).get("confidence", 0.5),
87
+ rel.get("tail", {}).get("confidence", 0.5)
88
+ )
89
+
90
+ # Score based on match with model hints
91
+ score = confidence
92
+ if model_subject.lower() in head.lower() or head.lower() in model_subject.lower():
93
+ score += 0.2
94
+ if model_object.lower() in tail.lower() or tail.lower() in model_object.lower():
95
+ score += 0.2
96
+
97
+ if score > best_confidence:
98
+ best_confidence = score
99
+ best_match = (head, rel_type, tail)
100
+
101
+ if best_match:
102
+ logger.debug(
103
+ f"GLiNER2 extracted (relation): subj='{best_match[0]}', pred='{best_match[1]}', obj='{best_match[2]}'"
104
+ )
105
+ return best_match
106
+
107
+ else:
108
+ # No predicate list provided - use GLiNER2 for entity extraction
109
+ # and extract predicate from source text using the model's hint
110
+
111
+ # Extract entities to refine subject/object boundaries
112
+ entity_types = [
113
+ "person", "organization", "company", "location", "city", "country",
114
+ "product", "event", "date", "money", "quantity"
115
+ ]
116
+ result = model.extract_entities(source_text, entity_types)
117
+ entities = result.get("entities", {})
118
+
119
+ # Find entities that match model subject/object
120
+ refined_subject = model_subject
121
+ refined_object = model_object
122
+
123
+ for entity_type, entity_list in entities.items():
124
+ for entity in entity_list:
125
+ entity_lower = entity.lower()
126
+ # Check if this entity matches or contains the model's subject/object
127
+ if model_subject.lower() in entity_lower or entity_lower in model_subject.lower():
128
+ # Use the entity text if it's more complete
129
+ if len(entity) >= len(refined_subject):
130
+ refined_subject = entity
131
+ if model_object.lower() in entity_lower or entity_lower in model_object.lower():
132
+ if len(entity) >= len(refined_object):
133
+ refined_object = entity
134
+
135
+ # Extract predicate from source text using predicate split
136
+ predicate_result = extract_triple_by_predicate_split(source_text, model_predicate)
137
+ if predicate_result:
138
+ _, extracted_predicate, _ = predicate_result
139
+ else:
140
+ extracted_predicate = model_predicate
141
+
142
+ if extracted_predicate:
143
+ logger.debug(
144
+ f"GLiNER2 extracted (entity-refined): subj='{refined_subject}', pred='{extracted_predicate}', obj='{refined_object}'"
145
+ )
146
+ return (refined_subject, extracted_predicate, refined_object)
147
+
148
+ return None
149
+
150
+ except ImportError as e:
151
+ logger.warning(f"GLiNER2 not installed: {e}")
152
+ return None
153
+ except Exception as e:
154
+ logger.debug(f"GLiNER2 extraction failed: {e}")
155
+ return None
156
+
157
+
158
+ def extract_triple_by_predicate_split(
159
+ source_text: str,
160
+ predicate: str,
161
+ ) -> tuple[str, str, str] | None:
162
+ """
163
+ Extract subject and object by splitting the source text around the predicate.
164
+
165
+ This is useful when the predicate is known but subject/object boundaries
166
+ are uncertain. Uses the predicate as an anchor point.
167
+
168
+ Args:
169
+ source_text: The source sentence
170
+ predicate: The predicate (verb phrase) to split on
171
+
172
+ Returns:
173
+ Tuple of (subject, predicate, object) or None if split fails
174
+ """
175
+ if not source_text or not predicate:
176
+ return None
177
+
178
+ # Find the predicate in the source text (case-insensitive)
179
+ source_lower = source_text.lower()
180
+ pred_lower = predicate.lower()
181
+
182
+ pred_pos = source_lower.find(pred_lower)
183
+ if pred_pos < 0:
184
+ # Try finding just the main verb (first word of predicate)
185
+ main_verb = pred_lower.split()[0] if pred_lower.split() else ""
186
+ if main_verb and len(main_verb) > 2:
187
+ pred_pos = source_lower.find(main_verb)
188
+ if pred_pos >= 0:
189
+ # Adjust to use the actual predicate length for splitting
190
+ predicate = main_verb
191
+
192
+ if pred_pos < 0:
193
+ return None
194
+
195
+ # Extract subject (text before predicate, trimmed)
196
+ subject = source_text[:pred_pos].strip()
197
+
198
+ # Extract object (text after predicate, trimmed)
199
+ pred_end = pred_pos + len(predicate)
200
+ obj = source_text[pred_end:].strip()
201
+
202
+ # Clean up: remove trailing punctuation from object
203
+ obj = obj.rstrip('.,;:!?')
204
+
205
+ # Clean up: remove leading articles/prepositions from object if very short
206
+ obj_words = obj.split()
207
+ if obj_words and obj_words[0].lower() in ('a', 'an', 'the', 'to', 'of', 'for'):
208
+ if len(obj_words) > 1:
209
+ obj = ' '.join(obj_words[1:])
210
+
211
+ # Validate: both subject and object should have meaningful content
212
+ if len(subject) < 2 or len(obj) < 2:
213
+ return None
214
+
215
+ logger.debug(
216
+ f"Predicate-split extracted: subj='{subject}', pred='{predicate}', obj='{obj}'"
217
+ )
218
+
219
+ return (subject, predicate, obj)
220
+
221
+
222
+ def score_entity_content(text: str) -> float:
223
+ """
224
+ Score how entity-like a text is using GLiNER2 entity recognition.
225
+
226
+ Returns:
227
+ 1.0 - Recognized as a named entity with high confidence
228
+ 0.8 - Recognized as an entity with moderate confidence
229
+ 0.6 - Partially recognized or contains entity-like content
230
+ 0.2 - Not recognized as any entity type
231
+ """
232
+ if not text or not text.strip():
233
+ return 0.2
234
+
235
+ try:
236
+ model = _get_model()
237
+
238
+ # Check if text is recognized as common entity types
239
+ entity_types = [
240
+ "person", "organization", "company", "location", "city", "country",
241
+ "product", "event", "date", "money", "quantity"
242
+ ]
243
+
244
+ result = model.extract_entities(
245
+ text,
246
+ entity_types,
247
+ include_confidence=True
248
+ )
249
+
250
+ # Result format: {'entities': {'person': [{'text': '...', 'confidence': 0.99}], ...}}
251
+ entities_dict = result.get("entities", {})
252
+
253
+ # Find best matching entity across all types
254
+ best_confidence = 0.0
255
+ text_lower = text.lower().strip()
256
+
257
+ for entity_type, entity_list in entities_dict.items():
258
+ for entity in entity_list:
259
+ if isinstance(entity, dict):
260
+ entity_text = entity.get("text", "").lower().strip()
261
+ confidence = entity.get("confidence", 0.5)
262
+ else:
263
+ # Fallback for string format
264
+ entity_text = str(entity).lower().strip()
265
+ confidence = 0.8
266
+
267
+ # Check if entity covers most of the input text
268
+ if entity_text == text_lower:
269
+ # Exact match
270
+ best_confidence = max(best_confidence, confidence)
271
+ elif entity_text in text_lower or text_lower in entity_text:
272
+ # Partial match - reduce confidence
273
+ best_confidence = max(best_confidence, confidence * 0.8)
274
+
275
+ if best_confidence >= 0.9:
276
+ return 1.0
277
+ elif best_confidence >= 0.7:
278
+ return 0.8
279
+ elif best_confidence >= 0.5:
280
+ return 0.6
281
+ elif best_confidence > 0:
282
+ return 0.4
283
+ else:
284
+ return 0.2
285
+
286
+ except Exception as e:
287
+ logger.debug(f"Entity scoring failed for '{text}': {e}")
288
+ return 0.5 # Neutral score on error
@@ -26,10 +26,10 @@ class EntityType(str, Enum):
26
26
 
27
27
  class ExtractionMethod(str, Enum):
28
28
  """Method used to extract the triple components."""
29
- HYBRID = "hybrid" # Model subject/object + spaCy predicate
30
- SPACY = "spacy" # All components from spaCy dependency parsing
29
+ HYBRID = "hybrid" # Model subject/object + GLiNER2 predicate
30
+ GLINER = "gliner" # All components from GLiNER2 extraction
31
31
  SPLIT = "split" # Subject/object from splitting source text around predicate
32
- MODEL = "model" # All components from T5-Gemma model (when spaCy disabled)
32
+ MODEL = "model" # All components from T5-Gemma model (when GLiNER2 disabled)
33
33
 
34
34
 
35
35
  class Entity(BaseModel):
@@ -295,9 +295,15 @@ class ExtractionOptions(BaseModel):
295
295
  default=True,
296
296
  description="Use embedding similarity for predicate deduplication"
297
297
  )
298
- use_spacy_extraction: bool = Field(
298
+ use_gliner_extraction: bool = Field(
299
299
  default=True,
300
- description="Use spaCy for predicate/subject/object extraction (model provides structure + coreference)"
300
+ description="Use GLiNER2 for predicate/subject/object extraction (model provides structure + coreference)"
301
+ )
302
+
303
+ # GLiNER2 predicate configuration
304
+ predicates: Optional[list[str]] = Field(
305
+ default=None,
306
+ description="Optional list of predefined predicate types for GLiNER2 relation extraction (e.g., ['works_for', 'founded'])"
301
307
  )
302
308
 
303
309
  # Verbose logging
@@ -15,41 +15,21 @@ from .models import ScoringConfig, Statement
15
15
 
16
16
  logger = logging.getLogger(__name__)
17
17
 
18
- # Lazy-loaded spaCy model for grammatical analysis
19
- _nlp = None
20
-
21
-
22
- def _get_nlp():
23
- """Lazy-load spaCy model for POS tagging."""
24
- global _nlp
25
- if _nlp is None:
26
- import spacy
27
- try:
28
- _nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
29
- except OSError:
30
- # Model not found, try to download
31
- from .spacy_extraction import _download_model
32
- if _download_model():
33
- _nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
34
- else:
35
- raise
36
- return _nlp
37
-
38
18
 
39
19
  class TripleScorer:
40
20
  """
41
- Score individual triples combining semantic similarity and grammatical accuracy.
21
+ Score individual triples combining semantic similarity and entity recognition.
42
22
 
43
23
  The score is a weighted combination of:
44
24
  - Semantic similarity (50%): Cosine similarity between source text and reassembled triple
45
- - Subject noun score (25%): How noun-like the subject is
46
- - Object noun score (25%): How noun-like the object is
47
-
48
- Noun scoring:
49
- - Proper noun only (PROPN): 1.0
50
- - Common noun only (NOUN): 0.8
51
- - Contains noun + other words: 0.6
52
- - No noun: 0.2
25
+ - Subject entity score (25%): How entity-like the subject is (via GLiNER2)
26
+ - Object entity score (25%): How entity-like the object is (via GLiNER2)
27
+
28
+ Entity scoring (via GLiNER2):
29
+ - Recognized entity with high confidence: 1.0
30
+ - Recognized entity with moderate confidence: 0.8
31
+ - Partially recognized: 0.6
32
+ - Not recognized: 0.2
53
33
  """
54
34
 
55
35
  def __init__(
@@ -102,54 +82,22 @@ class TripleScorer:
102
82
 
103
83
  def _score_noun_content(self, text: str) -> float:
104
84
  """
105
- Score how noun-like a text is.
85
+ Score how entity-like a text is using GLiNER2 entity recognition.
106
86
 
107
87
  Returns:
108
- 1.0 - Entirely proper noun(s)
109
- 0.8 - Entirely common noun(s)
110
- 0.6 - Contains noun(s) but also other words
111
- 0.2 - No nouns found
88
+ 1.0 - Recognized as a named entity with high confidence
89
+ 0.8 - Recognized as an entity with moderate confidence
90
+ 0.6 - Partially recognized or contains entity-like content
91
+ 0.2 - Not recognized as any entity type
112
92
  """
113
93
  if not text or not text.strip():
114
94
  return 0.2
115
95
 
116
96
  try:
117
- nlp = _get_nlp()
118
- doc = nlp(text)
119
-
120
- # Count token types (excluding punctuation and spaces)
121
- tokens = [t for t in doc if not t.is_punct and not t.is_space]
122
- if not tokens:
123
- return 0.2
124
-
125
- proper_nouns = sum(1 for t in tokens if t.pos_ == "PROPN")
126
- common_nouns = sum(1 for t in tokens if t.pos_ == "NOUN")
127
- total_nouns = proper_nouns + common_nouns
128
- total_tokens = len(tokens)
129
-
130
- if total_nouns == 0:
131
- # No nouns at all
132
- return 0.2
133
-
134
- if total_nouns == total_tokens:
135
- # Entirely nouns
136
- if proper_nouns == total_tokens:
137
- # All proper nouns
138
- return 1.0
139
- elif common_nouns == total_tokens:
140
- # All common nouns
141
- return 0.8
142
- else:
143
- # Mix of proper and common nouns
144
- return 0.9
145
-
146
- # Contains nouns but also other words
147
- # Score based on noun ratio
148
- noun_ratio = total_nouns / total_tokens
149
- return 0.4 + (noun_ratio * 0.4) # Range: 0.4 to 0.8
150
-
97
+ from .gliner_extraction import score_entity_content
98
+ return score_entity_content(text)
151
99
  except Exception as e:
152
- logger.debug(f"Noun scoring failed for '{text}': {e}")
100
+ logger.debug(f"Entity scoring failed for '{text}': {e}")
153
101
  return 0.5 # Neutral score on error
154
102
 
155
103
  def score_triple(self, statement: Statement, source_text: str) -> float:
@@ -1,12 +0,0 @@
1
- statement_extractor/__init__.py,sha256=KwZfWnTB9oevTLw0TrNlYFu67qIYO-34JqDtcpjOhZI,3013
2
- statement_extractor/canonicalization.py,sha256=ZMLs6RLWJa_rOJ8XZ7PoHFU13-zeJkOMDnvK-ZaFa5s,5991
3
- statement_extractor/cli.py,sha256=JMEXiT2xwmW1J8JmJliQh32AT-7bTAtAscPx1AGRfPg,9054
4
- statement_extractor/extractor.py,sha256=vS8UCgE8uITt_28PwCh4WCqOjWLpfrJcN3fh1YPBcjA,39657
5
- statement_extractor/models.py,sha256=FxLj2fIodX317XVIJLZ0GFNahm_VV07KzdoLSSjoVD4,11952
6
- statement_extractor/predicate_comparer.py,sha256=jcuaBi5BYqD3TKoyj3pR9dxtX5ihfDJvjdhEd2LHCwc,26184
7
- statement_extractor/scoring.py,sha256=pdNgyLHmlk-npISzm4nycK9G4wM2nztg5KTG7piFACI,18135
8
- statement_extractor/spacy_extraction.py,sha256=ACvIB-Ag7H7h_Gb0cdypIr8fnf3A-UjyJnqqjWD5Ccs,12320
9
- corp_extractor-0.3.0.dist-info/METADATA,sha256=eu8b7R_FQxFyc_9FSocy078TTyB7BwvGX-YAS79hKgg,17042
10
- corp_extractor-0.3.0.dist-info/WHEEL,sha256=WLgqFyCfm_KASv4WHyYy0P3pM_m7J5L9k2skdKLirC8,87
11
- corp_extractor-0.3.0.dist-info/entry_points.txt,sha256=i0iKFqPIusvb-QTQ1zNnFgAqatgVah-jIhahbs5TToQ,115
12
- corp_extractor-0.3.0.dist-info/RECORD,,
@@ -1,386 +0,0 @@
1
- """
2
- spaCy-based triple extraction.
3
-
4
- Uses spaCy dependency parsing to extract subject, predicate, and object
5
- from source text. T5-Gemma model provides triple structure and coreference
6
- resolution, while spaCy handles linguistic analysis.
7
-
8
- The spaCy model is downloaded automatically on first use.
9
- """
10
-
11
- import logging
12
- from typing import Optional
13
-
14
- logger = logging.getLogger(__name__)
15
-
16
- # Lazy-loaded spaCy model
17
- _nlp = None
18
-
19
-
20
- def _download_model():
21
- """Download the spaCy model if not present."""
22
- import shutil
23
- import subprocess
24
- import sys
25
-
26
- # Direct URL to the spaCy model wheel
27
- MODEL_URL = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl"
28
-
29
- logger.info("Downloading spaCy model 'en_core_web_sm'...")
30
-
31
- # Try uv first (for uv-managed environments)
32
- uv_path = shutil.which("uv")
33
- if uv_path:
34
- try:
35
- result = subprocess.run(
36
- [uv_path, "pip", "install", MODEL_URL],
37
- capture_output=True,
38
- text=True,
39
- )
40
- if result.returncode == 0:
41
- logger.info("Successfully downloaded spaCy model via uv")
42
- return True
43
- logger.debug(f"uv pip install failed: {result.stderr}")
44
- except Exception as e:
45
- logger.debug(f"uv pip install failed: {e}")
46
-
47
- # Try pip directly
48
- try:
49
- result = subprocess.run(
50
- [sys.executable, "-m", "pip", "install", MODEL_URL],
51
- capture_output=True,
52
- text=True,
53
- )
54
- if result.returncode == 0:
55
- logger.info("Successfully downloaded spaCy model via pip")
56
- return True
57
- logger.debug(f"pip install failed: {result.stderr}")
58
- except Exception as e:
59
- logger.debug(f"pip install failed: {e}")
60
-
61
- # Try spacy's download as last resort
62
- try:
63
- from spacy.cli import download
64
- download("en_core_web_sm")
65
- # Check if it actually worked
66
- import spacy
67
- spacy.load("en_core_web_sm")
68
- logger.info("Successfully downloaded spaCy model via spacy")
69
- return True
70
- except Exception:
71
- pass
72
-
73
- logger.warning(
74
- "Failed to download spaCy model automatically. "
75
- "Please run: uv pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl"
76
- )
77
- return False
78
-
79
-
80
- def _get_nlp():
81
- """
82
- Lazy-load the spaCy model.
83
-
84
- Disables NER and lemmatizer for faster processing since we only
85
- need dependency parsing. Automatically downloads the model if not present.
86
- """
87
- global _nlp
88
- if _nlp is None:
89
- import spacy
90
-
91
- # Try to load the model, download if not present
92
- try:
93
- _nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
94
- logger.debug("Loaded spaCy model for extraction")
95
- except OSError:
96
- # Model not found, try to download it
97
- if _download_model():
98
- _nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
99
- logger.debug("Loaded spaCy model after download")
100
- else:
101
- raise OSError(
102
- "spaCy model not found and automatic download failed. "
103
- "Please run: python -m spacy download en_core_web_sm"
104
- )
105
- return _nlp
106
-
107
-
108
- def _get_full_noun_phrase(token) -> str:
109
- """
110
- Get the full noun phrase for a token, including compounds and modifiers.
111
- """
112
- # Get all tokens in the subtree that form the noun phrase
113
- phrase_tokens = []
114
-
115
- # Collect compound modifiers and the token itself
116
- for t in token.subtree:
117
- # Include compounds, adjectives, determiners, and the head noun
118
- if t.dep_ in ("compound", "amod", "det", "poss", "nummod", "nmod") or t == token:
119
- phrase_tokens.append(t)
120
-
121
- # Sort by position and join
122
- phrase_tokens.sort(key=lambda x: x.i)
123
- return " ".join([t.text for t in phrase_tokens])
124
-
125
-
126
- def _extract_verb_phrase(verb_token) -> str:
127
- """
128
- Extract the full verb phrase including auxiliaries and particles.
129
- """
130
- parts = []
131
-
132
- # Collect auxiliaries that come before the verb
133
- for child in verb_token.children:
134
- if child.dep_ in ("aux", "auxpass") and child.i < verb_token.i:
135
- parts.append((child.i, child.text))
136
-
137
- # Add the main verb
138
- parts.append((verb_token.i, verb_token.text))
139
-
140
- # Collect particles and prepositions that are part of phrasal verbs
141
- for child in verb_token.children:
142
- if child.dep_ == "prt" and child.i > verb_token.i:
143
- parts.append((child.i, child.text))
144
- # Include prepositions for phrasal verbs like "announced by"
145
- elif child.dep_ == "agent" and child.i > verb_token.i:
146
- # For passive constructions, include "by"
147
- parts.append((child.i, child.text))
148
-
149
- # Sort by position and join
150
- parts.sort(key=lambda x: x[0])
151
- return " ".join([p[1] for p in parts])
152
-
153
-
154
- def _match_entity_boundaries(
155
- spacy_text: str,
156
- model_text: str,
157
- source_text: str,
158
- ) -> str:
159
- """
160
- Match entity boundaries between spaCy extraction and model hint.
161
-
162
- If model text is a superset that includes spaCy text, use model text
163
- for better entity boundaries (e.g., "Apple" -> "Apple Inc.").
164
- """
165
- spacy_lower = spacy_text.lower()
166
- model_lower = model_text.lower()
167
-
168
- # If model text contains spaCy text, prefer model text
169
- if spacy_lower in model_lower:
170
- return model_text
171
-
172
- # If spaCy text contains model text, prefer spaCy text
173
- if model_lower in spacy_lower:
174
- return spacy_text
175
-
176
- # If they overlap significantly, prefer the one that appears in source
177
- if spacy_text in source_text:
178
- return spacy_text
179
- if model_text in source_text:
180
- return model_text
181
-
182
- # Default to spaCy extraction
183
- return spacy_text
184
-
185
-
186
- def _extract_spacy_triple(doc, model_subject: str, model_object: str, source_text: str) -> tuple[str | None, str | None, str | None]:
187
- """Extract subject, predicate, object from spaCy doc."""
188
- # Find the root verb
189
- root = None
190
- for token in doc:
191
- if token.dep_ == "ROOT":
192
- root = token
193
- break
194
-
195
- if root is None:
196
- return None, None, None
197
-
198
- # Extract predicate from root verb
199
- predicate = None
200
- if root.pos_ == "VERB":
201
- predicate = _extract_verb_phrase(root)
202
- elif root.pos_ == "AUX":
203
- predicate = root.text
204
-
205
- # Extract subject (nsubj, nsubjpass)
206
- subject = None
207
- for child in root.children:
208
- if child.dep_ in ("nsubj", "nsubjpass"):
209
- subject = _get_full_noun_phrase(child)
210
- break
211
-
212
- # If no direct subject, check parent
213
- if subject is None and root.head != root:
214
- for child in root.head.children:
215
- if child.dep_ in ("nsubj", "nsubjpass"):
216
- subject = _get_full_noun_phrase(child)
217
- break
218
-
219
- # Extract object (dobj, pobj, attr, oprd)
220
- obj = None
221
- for child in root.children:
222
- if child.dep_ in ("dobj", "attr", "oprd"):
223
- obj = _get_full_noun_phrase(child)
224
- break
225
- elif child.dep_ == "prep":
226
- for pchild in child.children:
227
- if pchild.dep_ == "pobj":
228
- obj = _get_full_noun_phrase(pchild)
229
- break
230
- if obj:
231
- break
232
- elif child.dep_ == "agent":
233
- for pchild in child.children:
234
- if pchild.dep_ == "pobj":
235
- obj = _get_full_noun_phrase(pchild)
236
- break
237
- if obj:
238
- break
239
-
240
- # Match against model values for better entity boundaries
241
- if subject:
242
- subject = _match_entity_boundaries(subject, model_subject, source_text)
243
- if obj:
244
- obj = _match_entity_boundaries(obj, model_object, source_text)
245
-
246
- return subject, predicate, obj
247
-
248
-
249
- def extract_triple_from_text(
250
- source_text: str,
251
- model_subject: str,
252
- model_object: str,
253
- model_predicate: str,
254
- ) -> tuple[str, str, str] | None:
255
- """
256
- Extract subject, predicate, object from source text using spaCy.
257
-
258
- Returns a spaCy-based triple that can be added to the candidate pool
259
- alongside the model's triple. The existing scoring/dedup logic will
260
- pick the best one.
261
-
262
- Args:
263
- source_text: The source sentence to analyze
264
- model_subject: Subject from T5-Gemma (used for entity boundary matching)
265
- model_object: Object from T5-Gemma (used for entity boundary matching)
266
- model_predicate: Predicate from T5-Gemma (unused, kept for API compat)
267
-
268
- Returns:
269
- Tuple of (subject, predicate, object) from spaCy, or None if extraction fails
270
- """
271
- if not source_text:
272
- return None
273
-
274
- try:
275
- nlp = _get_nlp()
276
- doc = nlp(source_text)
277
- spacy_subject, spacy_predicate, spacy_object = _extract_spacy_triple(
278
- doc, model_subject, model_object, source_text
279
- )
280
-
281
- # Only return if we got at least a predicate
282
- if spacy_predicate:
283
- logger.debug(
284
- f"spaCy extracted: subj='{spacy_subject}', pred='{spacy_predicate}', obj='{spacy_object}'"
285
- )
286
- return (
287
- spacy_subject or model_subject,
288
- spacy_predicate,
289
- spacy_object or model_object,
290
- )
291
-
292
- return None
293
-
294
- except OSError as e:
295
- logger.debug(f"Cannot load spaCy model: {e}")
296
- return None
297
- except Exception as e:
298
- logger.debug(f"spaCy extraction failed: {e}")
299
- return None
300
-
301
-
302
- def extract_triple_by_predicate_split(
303
- source_text: str,
304
- predicate: str,
305
- ) -> tuple[str, str, str] | None:
306
- """
307
- Extract subject and object by splitting the source text around the predicate.
308
-
309
- This is useful when the predicate is known but subject/object boundaries
310
- are uncertain. Uses the predicate as an anchor point.
311
-
312
- Args:
313
- source_text: The source sentence
314
- predicate: The predicate (verb phrase) to split on
315
-
316
- Returns:
317
- Tuple of (subject, predicate, object) or None if split fails
318
- """
319
- if not source_text or not predicate:
320
- return None
321
-
322
- # Find the predicate in the source text (case-insensitive)
323
- source_lower = source_text.lower()
324
- pred_lower = predicate.lower()
325
-
326
- pred_pos = source_lower.find(pred_lower)
327
- if pred_pos < 0:
328
- # Try finding just the main verb (first word of predicate)
329
- main_verb = pred_lower.split()[0] if pred_lower.split() else ""
330
- if main_verb and len(main_verb) > 2:
331
- pred_pos = source_lower.find(main_verb)
332
- if pred_pos >= 0:
333
- # Adjust to use the actual predicate length for splitting
334
- predicate = main_verb
335
-
336
- if pred_pos < 0:
337
- return None
338
-
339
- # Extract subject (text before predicate, trimmed)
340
- subject = source_text[:pred_pos].strip()
341
-
342
- # Extract object (text after predicate, trimmed)
343
- pred_end = pred_pos + len(predicate)
344
- obj = source_text[pred_end:].strip()
345
-
346
- # Clean up: remove trailing punctuation from object
347
- obj = obj.rstrip('.,;:!?')
348
-
349
- # Clean up: remove leading articles/prepositions from object if very short
350
- obj_words = obj.split()
351
- if obj_words and obj_words[0].lower() in ('a', 'an', 'the', 'to', 'of', 'for'):
352
- if len(obj_words) > 1:
353
- obj = ' '.join(obj_words[1:])
354
-
355
- # Validate: both subject and object should have meaningful content
356
- if len(subject) < 2 or len(obj) < 2:
357
- return None
358
-
359
- logger.debug(
360
- f"Predicate-split extracted: subj='{subject}', pred='{predicate}', obj='{obj}'"
361
- )
362
-
363
- return (subject, predicate, obj)
364
-
365
-
366
- # Keep old function for backwards compatibility
367
- def infer_predicate(
368
- subject: str,
369
- obj: str,
370
- source_text: str,
371
- ) -> Optional[str]:
372
- """
373
- Infer the predicate from source text using dependency parsing.
374
-
375
- DEPRECATED: Use extract_triple_from_text instead.
376
- """
377
- result = extract_triple_from_text(
378
- source_text=source_text,
379
- model_subject=subject,
380
- model_object=obj,
381
- model_predicate="",
382
- )
383
- if result:
384
- _, predicate, _ = result
385
- return predicate if predicate else None
386
- return None