llm-ie 0.1.7__tar.gz → 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {llm_ie-0.1.7 → llm_ie-0.2.0}/PKG-INFO +212 -5
- {llm_ie-0.1.7 → llm_ie-0.2.0}/README.md +212 -4
- {llm_ie-0.1.7 → llm_ie-0.2.0}/pyproject.toml +1 -1
- llm_ie-0.2.0/src/llm_ie/asset/prompt_guide/BinaryRelationExtractor_prompt_guide.txt +38 -0
- llm_ie-0.2.0/src/llm_ie/asset/prompt_guide/MultiClassRelationExtractor_prompt_guide.txt +46 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/data_types.py +131 -18
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/extractors.py +436 -13
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/__init__.py +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/asset/PromptEditor_prompts/comment.txt +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/asset/PromptEditor_prompts/rewrite.txt +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/asset/prompt_guide/BasicFrameExtractor_prompt_guide.txt +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/asset/prompt_guide/ReviewFrameExtractor_prompt_guide.txt +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/asset/prompt_guide/SentenceFrameExtractor_prompt_guide.txt +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/engines.py +0 -0
- {llm_ie-0.1.7 → llm_ie-0.2.0}/src/llm_ie/prompt_editor.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: llm-ie
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.2.0
|
|
4
4
|
Summary: An LLM-powered tool that transforms everyday language into robust information extraction pipelines.
|
|
5
5
|
License: MIT
|
|
6
6
|
Author: Enshuo (David) Hsu
|
|
@@ -25,11 +25,14 @@ An LLM-powered tool that transforms everyday language into robust information ex
|
|
|
25
25
|
- [Prerequisite](#prerequisite)
|
|
26
26
|
- [Installation](#installation)
|
|
27
27
|
- [Quick Start](#quick-start)
|
|
28
|
+
- [Examples](#examples)
|
|
28
29
|
- [User Guide](#user-guide)
|
|
29
30
|
- [LLM Inference Engine](#llm-inference-engine)
|
|
30
31
|
- [Prompt Template](#prompt-template)
|
|
31
32
|
- [Prompt Editor](#prompt-editor)
|
|
32
33
|
- [Extractor](#extractor)
|
|
34
|
+
- [FrameExtractor](#frameextractor)
|
|
35
|
+
- [RelationExtractor](#relationextractor)
|
|
33
36
|
|
|
34
37
|
## Overview
|
|
35
38
|
LLM-IE is a toolkit that provides robust information extraction utilities for frame-based information extraction. Since prompt design has a significant impact on generative information extraction with LLMs, it also provides a built-in LLM editor to help with prompt writing. The flowchart below demonstrates the workflow starting from a casual language request.
|
|
@@ -206,6 +209,10 @@ for frame in frames:
|
|
|
206
209
|
doc.save("<your filename>.llmie")
|
|
207
210
|
```
|
|
208
211
|
|
|
212
|
+
## Examples
|
|
213
|
+
- [Write prompt templates with AI editors](demo/prompt_template_writing.ipynb)
|
|
214
|
+
- [NER + RE for Drug, Strength, Frequency](demo/medication_relation_extraction.ipynb)
|
|
215
|
+
|
|
209
216
|
## User Guide
|
|
210
217
|
This package is comprised of some key classes:
|
|
211
218
|
- LLM Inference Engine
|
|
@@ -547,12 +554,25 @@ Recommendations:
|
|
|
547
554
|
After a few iterations of revision, we will have a high-quality prompt template for the information extraction pipeline.
|
|
548
555
|
|
|
549
556
|
### Extractor
|
|
550
|
-
An extractor implements a prompting method for information extraction.
|
|
557
|
+
An extractor implements a prompting method for information extraction. There are two extractor families: ```FrameExtractor``` and ```RelationExtractor```.
|
|
558
|
+
The ```FrameExtractor``` extracts named entities and entity attributes ("frame"). The ```RelationExtractor``` extracts the relation (and relation types) between frames.
|
|
559
|
+
|
|
560
|
+
#### FrameExtractor
|
|
561
|
+
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame. The ```ReviewFrameExtractor``` is based on the ```BasicFrameExtractor``` but adds a review step after the initial extraction to boost sensitivity and improve performance. ```SentenceFrameExtractor``` gives LLM the entire document upfront as a reference, then prompts LLM sentence by sentence and collects per-sentence outputs. To learn about an extractor, use the class method ```get_prompt_guide()``` to print out the prompt guide.
|
|
551
562
|
|
|
552
563
|
<details>
|
|
553
564
|
<summary>BasicFrameExtractor</summary>
|
|
554
565
|
|
|
555
|
-
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame.
|
|
566
|
+
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame. The ```text_content``` holds the input text as a string, or as a dictionary (if prompt template has multiple input placeholders). The ```entity_key``` defines which JSON key should be used as entity text. It must be consistent with the prompt template.
|
|
567
|
+
|
|
568
|
+
```python
|
|
569
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
570
|
+
|
|
571
|
+
extractor = BasicFrameExtractor(llm, prompt_temp)
|
|
572
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
573
|
+
```
|
|
574
|
+
|
|
575
|
+
Use the ```get_prompt_guide()``` method to inspect the prompt template guideline for ```BasicFrameExtractor```.
|
|
556
576
|
|
|
557
577
|
```python
|
|
558
578
|
from llm_ie.extractors import BasicFrameExtractor
|
|
@@ -630,15 +650,202 @@ frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", str
|
|
|
630
650
|
<details>
|
|
631
651
|
<summary>SentenceFrameExtractor</summary>
|
|
632
652
|
|
|
633
|
-
The ```SentenceFrameExtractor``` instructs the LLM to extract sentence by sentence. The reason is to ensure the accuracy of frame spans. It also prevents LLMs from overseeing sections/ sentences. Empirically, this extractor results in better
|
|
653
|
+
The ```SentenceFrameExtractor``` instructs the LLM to extract sentence by sentence. The reason is to ensure the accuracy of frame spans. It also prevents LLMs from overseeing sections/ sentences. Empirically, this extractor results in better recall than the ```BasicFrameExtractor``` in complex tasks.
|
|
654
|
+
|
|
655
|
+
The ```multi_turn``` parameter specifies multi-turn conversation for prompting. If True, sentences and LLM outputs will be appended to the input message and carry-over. If False, only the current sentence is prompted. For LLM inference engines that supports prompt cache (e.g., Llama.Cpp, Ollama), use multi-turn conversation prompting can better utilize the KV caching and results in faster inferencing. But for vLLM with [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html), multi-turn conversation is not necessary.
|
|
634
656
|
|
|
635
657
|
```python
|
|
636
658
|
from llm_ie.extractors import SentenceFrameExtractor
|
|
637
659
|
|
|
638
660
|
extractor = SentenceFrameExtractor(llm, prompt_temp)
|
|
639
|
-
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
661
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", multi_turn=True, stream=True)
|
|
640
662
|
```
|
|
641
663
|
</details>
|
|
642
664
|
|
|
665
|
+
#### RelationExtractor
|
|
666
|
+
Relation extractors prompt LLM with combinations of two frames from a document (```LLMInformationExtractionDocument```) and extract relations.
|
|
667
|
+
The ```BinaryRelationExtractor``` extracts binary relations (yes/no) between two frames. The ```MultiClassRelationExtractor``` extracts relations and assign relation types ("multi-class").
|
|
668
|
+
|
|
669
|
+
An important feature of the relation extractors is that users are required to define a ```possible_relation_func``` or ```possible_relation_types_func``` function for the extractors. The reason is, there are too many possible combinations of two frames (N choose 2 combinations). The ```possible_relation_func``` helps rule out impossible combinations and therefore, reduce the LLM inferencing burden.
|
|
670
|
+
|
|
671
|
+
<details>
|
|
672
|
+
<summary>BinaryRelationExtractor</summary>
|
|
673
|
+
|
|
674
|
+
Use the get_prompt_guide() method to inspect the prompt template guideline for BinaryRelationExtractor.
|
|
675
|
+
```python
|
|
676
|
+
from llm_ie.extractors import BinaryRelationExtractor
|
|
677
|
+
|
|
678
|
+
print(BinaryRelationExtractor.get_prompt_guide())
|
|
679
|
+
```
|
|
680
|
+
|
|
681
|
+
```
|
|
682
|
+
Prompt template design:
|
|
683
|
+
1. Task description (mention binary relation extraction and ROI)
|
|
684
|
+
2. Schema definition (defines relation)
|
|
685
|
+
3. Output format definition (must use the key "Relation")
|
|
686
|
+
4. Hints
|
|
687
|
+
5. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
643
688
|
|
|
644
689
|
|
|
690
|
+
Example:
|
|
691
|
+
|
|
692
|
+
# Task description
|
|
693
|
+
This is a binary relation extraction task. Given a region of interest (ROI) text and two entities from a medical note, indicate the relation existence between the two entities.
|
|
694
|
+
|
|
695
|
+
# Schema definition
|
|
696
|
+
True: if there is a relationship between a medication name (one of the entities) and its strength or frequency (the other entity).
|
|
697
|
+
False: Otherwise.
|
|
698
|
+
|
|
699
|
+
# Output format definition
|
|
700
|
+
Your output should follow the JSON format:
|
|
701
|
+
{"Relation": "<True or False>"}
|
|
702
|
+
|
|
703
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
704
|
+
|
|
705
|
+
# Hints
|
|
706
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
707
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
708
|
+
3. If the strength or frequency is for another medication, output False.
|
|
709
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output False.
|
|
710
|
+
|
|
711
|
+
# Input placeholders
|
|
712
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
713
|
+
"{{roi_text}}"
|
|
714
|
+
|
|
715
|
+
Entity 1 full information:
|
|
716
|
+
{{frame_1}}
|
|
717
|
+
|
|
718
|
+
Entity 2 full information:
|
|
719
|
+
{{frame_2}}
|
|
720
|
+
```
|
|
721
|
+
|
|
722
|
+
As an example, we define the ```possible_relation_func``` function:
|
|
723
|
+
- if the two frames are > 500 characters apart, we assume no relation (False)
|
|
724
|
+
- if the two frames are "Medication" and "Strength", or "Medication" and "Frequency", there could be relations (True)
|
|
725
|
+
|
|
726
|
+
```python
|
|
727
|
+
def possible_relation_func(frame_1, frame_2) -> bool:
|
|
728
|
+
"""
|
|
729
|
+
This function pre-process two frames and outputs a bool indicating whether the two frames could be related.
|
|
730
|
+
"""
|
|
731
|
+
# if the distance between the two frames are > 500 characters, assume no relation.
|
|
732
|
+
if abs(frame_1.start - frame_2.start) > 500:
|
|
733
|
+
return False
|
|
734
|
+
|
|
735
|
+
# if the entity types are "Medication" and "Strength", there could be relations.
|
|
736
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Strength") or \
|
|
737
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Strength"):
|
|
738
|
+
return True
|
|
739
|
+
|
|
740
|
+
# if the entity types are "Medication" and "Frequency", there could be relations.
|
|
741
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Frequency") or \
|
|
742
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Frequency"):
|
|
743
|
+
return True
|
|
744
|
+
|
|
745
|
+
# Otherwise, no relation.
|
|
746
|
+
return False
|
|
747
|
+
```
|
|
748
|
+
|
|
749
|
+
In the ```BinaryRelationExtractor``` constructor, we pass in the prompt template and ```possible_relation_func```.
|
|
750
|
+
|
|
751
|
+
```python
|
|
752
|
+
from llm_ie.extractors import BinaryRelationExtractor
|
|
753
|
+
|
|
754
|
+
extractor = BinaryRelationExtractor(llm, prompt_template=prompt_template, possible_relation_func=possible_relation_func)
|
|
755
|
+
relations = extractor.extract_relations(doc, stream=True)
|
|
756
|
+
```
|
|
757
|
+
|
|
758
|
+
</details>
|
|
759
|
+
|
|
760
|
+
|
|
761
|
+
<details>
|
|
762
|
+
<summary>MultiClassRelationExtractor</summary>
|
|
763
|
+
|
|
764
|
+
The main difference from ```BinaryRelationExtractor``` is that the ```MultiClassRelationExtractor``` allows specifying relation types. The prompt template guideline has an additional placeholder for possible relation types ```{{pos_rel_types}}```.
|
|
765
|
+
|
|
766
|
+
```python
|
|
767
|
+
print(MultiClassRelationExtractor.get_prompt_guide())
|
|
768
|
+
```
|
|
769
|
+
|
|
770
|
+
```
|
|
771
|
+
Prompt template design:
|
|
772
|
+
1. Task description (mention multi-class relation extraction and ROI)
|
|
773
|
+
2. Schema definition (defines relation types)
|
|
774
|
+
3. Output format definition (must use the key "RelationType")
|
|
775
|
+
4. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
776
|
+
|
|
777
|
+
|
|
778
|
+
Example:
|
|
779
|
+
|
|
780
|
+
# Task description
|
|
781
|
+
This is a multi-class relation extraction task. Given a region of interest (ROI) text and two frames from a medical note, classify the relation types between the two frames.
|
|
782
|
+
|
|
783
|
+
# Schema definition
|
|
784
|
+
Strength-Drug: this is a relationship between the drug strength and its name.
|
|
785
|
+
Dosage-Drug: this is a relationship between the drug dosage and its name.
|
|
786
|
+
Duration-Drug: this is a relationship between a drug duration and its name.
|
|
787
|
+
Frequency-Drug: this is a relationship between a drug frequency and its name.
|
|
788
|
+
Form-Drug: this is a relationship between a drug form and its name.
|
|
789
|
+
Route-Drug: this is a relationship between the route of administration for a drug and its name.
|
|
790
|
+
Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
|
|
791
|
+
ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.
|
|
792
|
+
|
|
793
|
+
# Output format definition
|
|
794
|
+
Choose one of the relation types listed below or choose "No Relation":
|
|
795
|
+
{{pos_rel_types}}
|
|
796
|
+
|
|
797
|
+
Your output should follow the JSON format:
|
|
798
|
+
{"RelationType": "<relation type or No Relation>"}
|
|
799
|
+
|
|
800
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
801
|
+
|
|
802
|
+
# Hints
|
|
803
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
804
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
805
|
+
3. If the strength or frequency is for another medication, output "No Relation".
|
|
806
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output "No Relation".
|
|
807
|
+
|
|
808
|
+
# Input placeholders
|
|
809
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
810
|
+
"{{roi_text}}"
|
|
811
|
+
|
|
812
|
+
Entity 1 full information:
|
|
813
|
+
{{frame_1}}
|
|
814
|
+
|
|
815
|
+
Entity 2 full information:
|
|
816
|
+
{{frame_2}}
|
|
817
|
+
```
|
|
818
|
+
|
|
819
|
+
As an example, we define the ```possible_relation_types_func``` :
|
|
820
|
+
- if the two frames are > 500 characters apart, we assume "No Relation" (output [])
|
|
821
|
+
- if the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
|
|
822
|
+
- if the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
|
|
823
|
+
|
|
824
|
+
```python
|
|
825
|
+
def possible_relation_types_func(frame_1, frame_2) -> List[str]:
|
|
826
|
+
# If the two frames are > 500 characters apart, we assume "No Relation"
|
|
827
|
+
if abs(frame_1.start - frame_2.start) > 500:
|
|
828
|
+
return []
|
|
829
|
+
|
|
830
|
+
# If the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
|
|
831
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Strength") or \
|
|
832
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Strength"):
|
|
833
|
+
return ['Strength-Drug']
|
|
834
|
+
|
|
835
|
+
# If the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
|
|
836
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Frequency") or \
|
|
837
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Frequency"):
|
|
838
|
+
return ['Frequency-Drug']
|
|
839
|
+
|
|
840
|
+
return []
|
|
841
|
+
```
|
|
842
|
+
|
|
843
|
+
|
|
844
|
+
```python
|
|
845
|
+
from llm_ie.extractors import MultiClassRelationExtractor
|
|
846
|
+
|
|
847
|
+
extractor = MultiClassRelationExtractor(llm, prompt_template=re_prompt_template, possible_relation_types_func=possible_relation_types_func)
|
|
848
|
+
relations = extractor.extract_relations(doc, stream=True)
|
|
849
|
+
```
|
|
850
|
+
|
|
851
|
+
</details>
|
|
@@ -11,11 +11,14 @@ An LLM-powered tool that transforms everyday language into robust information ex
|
|
|
11
11
|
- [Prerequisite](#prerequisite)
|
|
12
12
|
- [Installation](#installation)
|
|
13
13
|
- [Quick Start](#quick-start)
|
|
14
|
+
- [Examples](#examples)
|
|
14
15
|
- [User Guide](#user-guide)
|
|
15
16
|
- [LLM Inference Engine](#llm-inference-engine)
|
|
16
17
|
- [Prompt Template](#prompt-template)
|
|
17
18
|
- [Prompt Editor](#prompt-editor)
|
|
18
19
|
- [Extractor](#extractor)
|
|
20
|
+
- [FrameExtractor](#frameextractor)
|
|
21
|
+
- [RelationExtractor](#relationextractor)
|
|
19
22
|
|
|
20
23
|
## Overview
|
|
21
24
|
LLM-IE is a toolkit that provides robust information extraction utilities for frame-based information extraction. Since prompt design has a significant impact on generative information extraction with LLMs, it also provides a built-in LLM editor to help with prompt writing. The flowchart below demonstrates the workflow starting from a casual language request.
|
|
@@ -192,6 +195,10 @@ for frame in frames:
|
|
|
192
195
|
doc.save("<your filename>.llmie")
|
|
193
196
|
```
|
|
194
197
|
|
|
198
|
+
## Examples
|
|
199
|
+
- [Write prompt templates with AI editors](demo/prompt_template_writing.ipynb)
|
|
200
|
+
- [NER + RE for Drug, Strength, Frequency](demo/medication_relation_extraction.ipynb)
|
|
201
|
+
|
|
195
202
|
## User Guide
|
|
196
203
|
This package is comprised of some key classes:
|
|
197
204
|
- LLM Inference Engine
|
|
@@ -533,12 +540,25 @@ Recommendations:
|
|
|
533
540
|
After a few iterations of revision, we will have a high-quality prompt template for the information extraction pipeline.
|
|
534
541
|
|
|
535
542
|
### Extractor
|
|
536
|
-
An extractor implements a prompting method for information extraction.
|
|
543
|
+
An extractor implements a prompting method for information extraction. There are two extractor families: ```FrameExtractor``` and ```RelationExtractor```.
|
|
544
|
+
The ```FrameExtractor``` extracts named entities and entity attributes ("frame"). The ```RelationExtractor``` extracts the relation (and relation types) between frames.
|
|
545
|
+
|
|
546
|
+
#### FrameExtractor
|
|
547
|
+
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame. The ```ReviewFrameExtractor``` is based on the ```BasicFrameExtractor``` but adds a review step after the initial extraction to boost sensitivity and improve performance. ```SentenceFrameExtractor``` gives LLM the entire document upfront as a reference, then prompts LLM sentence by sentence and collects per-sentence outputs. To learn about an extractor, use the class method ```get_prompt_guide()``` to print out the prompt guide.
|
|
537
548
|
|
|
538
549
|
<details>
|
|
539
550
|
<summary>BasicFrameExtractor</summary>
|
|
540
551
|
|
|
541
|
-
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame.
|
|
552
|
+
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame. The ```text_content``` holds the input text as a string, or as a dictionary (if prompt template has multiple input placeholders). The ```entity_key``` defines which JSON key should be used as entity text. It must be consistent with the prompt template.
|
|
553
|
+
|
|
554
|
+
```python
|
|
555
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
556
|
+
|
|
557
|
+
extractor = BasicFrameExtractor(llm, prompt_temp)
|
|
558
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
Use the ```get_prompt_guide()``` method to inspect the prompt template guideline for ```BasicFrameExtractor```.
|
|
542
562
|
|
|
543
563
|
```python
|
|
544
564
|
from llm_ie.extractors import BasicFrameExtractor
|
|
@@ -616,14 +636,202 @@ frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", str
|
|
|
616
636
|
<details>
|
|
617
637
|
<summary>SentenceFrameExtractor</summary>
|
|
618
638
|
|
|
619
|
-
The ```SentenceFrameExtractor``` instructs the LLM to extract sentence by sentence. The reason is to ensure the accuracy of frame spans. It also prevents LLMs from overseeing sections/ sentences. Empirically, this extractor results in better
|
|
639
|
+
The ```SentenceFrameExtractor``` instructs the LLM to extract sentence by sentence. The reason is to ensure the accuracy of frame spans. It also prevents LLMs from overseeing sections/ sentences. Empirically, this extractor results in better recall than the ```BasicFrameExtractor``` in complex tasks.
|
|
640
|
+
|
|
641
|
+
The ```multi_turn``` parameter specifies multi-turn conversation for prompting. If True, sentences and LLM outputs will be appended to the input message and carry-over. If False, only the current sentence is prompted. For LLM inference engines that supports prompt cache (e.g., Llama.Cpp, Ollama), use multi-turn conversation prompting can better utilize the KV caching and results in faster inferencing. But for vLLM with [Automatic Prefix Caching (APC)](https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html), multi-turn conversation is not necessary.
|
|
620
642
|
|
|
621
643
|
```python
|
|
622
644
|
from llm_ie.extractors import SentenceFrameExtractor
|
|
623
645
|
|
|
624
646
|
extractor = SentenceFrameExtractor(llm, prompt_temp)
|
|
625
|
-
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
647
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", multi_turn=True, stream=True)
|
|
626
648
|
```
|
|
627
649
|
</details>
|
|
628
650
|
|
|
651
|
+
#### RelationExtractor
|
|
652
|
+
Relation extractors prompt LLM with combinations of two frames from a document (```LLMInformationExtractionDocument```) and extract relations.
|
|
653
|
+
The ```BinaryRelationExtractor``` extracts binary relations (yes/no) between two frames. The ```MultiClassRelationExtractor``` extracts relations and assign relation types ("multi-class").
|
|
654
|
+
|
|
655
|
+
An important feature of the relation extractors is that users are required to define a ```possible_relation_func``` or ```possible_relation_types_func``` function for the extractors. The reason is, there are too many possible combinations of two frames (N choose 2 combinations). The ```possible_relation_func``` helps rule out impossible combinations and therefore, reduce the LLM inferencing burden.
|
|
656
|
+
|
|
657
|
+
<details>
|
|
658
|
+
<summary>BinaryRelationExtractor</summary>
|
|
659
|
+
|
|
660
|
+
Use the get_prompt_guide() method to inspect the prompt template guideline for BinaryRelationExtractor.
|
|
661
|
+
```python
|
|
662
|
+
from llm_ie.extractors import BinaryRelationExtractor
|
|
663
|
+
|
|
664
|
+
print(BinaryRelationExtractor.get_prompt_guide())
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
```
|
|
668
|
+
Prompt template design:
|
|
669
|
+
1. Task description (mention binary relation extraction and ROI)
|
|
670
|
+
2. Schema definition (defines relation)
|
|
671
|
+
3. Output format definition (must use the key "Relation")
|
|
672
|
+
4. Hints
|
|
673
|
+
5. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
674
|
+
|
|
675
|
+
|
|
676
|
+
Example:
|
|
677
|
+
|
|
678
|
+
# Task description
|
|
679
|
+
This is a binary relation extraction task. Given a region of interest (ROI) text and two entities from a medical note, indicate the relation existence between the two entities.
|
|
680
|
+
|
|
681
|
+
# Schema definition
|
|
682
|
+
True: if there is a relationship between a medication name (one of the entities) and its strength or frequency (the other entity).
|
|
683
|
+
False: Otherwise.
|
|
684
|
+
|
|
685
|
+
# Output format definition
|
|
686
|
+
Your output should follow the JSON format:
|
|
687
|
+
{"Relation": "<True or False>"}
|
|
688
|
+
|
|
689
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
690
|
+
|
|
691
|
+
# Hints
|
|
692
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
693
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
694
|
+
3. If the strength or frequency is for another medication, output False.
|
|
695
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output False.
|
|
696
|
+
|
|
697
|
+
# Input placeholders
|
|
698
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
699
|
+
"{{roi_text}}"
|
|
700
|
+
|
|
701
|
+
Entity 1 full information:
|
|
702
|
+
{{frame_1}}
|
|
703
|
+
|
|
704
|
+
Entity 2 full information:
|
|
705
|
+
{{frame_2}}
|
|
706
|
+
```
|
|
707
|
+
|
|
708
|
+
As an example, we define the ```possible_relation_func``` function:
|
|
709
|
+
- if the two frames are > 500 characters apart, we assume no relation (False)
|
|
710
|
+
- if the two frames are "Medication" and "Strength", or "Medication" and "Frequency", there could be relations (True)
|
|
711
|
+
|
|
712
|
+
```python
|
|
713
|
+
def possible_relation_func(frame_1, frame_2) -> bool:
|
|
714
|
+
"""
|
|
715
|
+
This function pre-process two frames and outputs a bool indicating whether the two frames could be related.
|
|
716
|
+
"""
|
|
717
|
+
# if the distance between the two frames are > 500 characters, assume no relation.
|
|
718
|
+
if abs(frame_1.start - frame_2.start) > 500:
|
|
719
|
+
return False
|
|
720
|
+
|
|
721
|
+
# if the entity types are "Medication" and "Strength", there could be relations.
|
|
722
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Strength") or \
|
|
723
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Strength"):
|
|
724
|
+
return True
|
|
725
|
+
|
|
726
|
+
# if the entity types are "Medication" and "Frequency", there could be relations.
|
|
727
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Frequency") or \
|
|
728
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Frequency"):
|
|
729
|
+
return True
|
|
730
|
+
|
|
731
|
+
# Otherwise, no relation.
|
|
732
|
+
return False
|
|
733
|
+
```
|
|
734
|
+
|
|
735
|
+
In the ```BinaryRelationExtractor``` constructor, we pass in the prompt template and ```possible_relation_func```.
|
|
736
|
+
|
|
737
|
+
```python
|
|
738
|
+
from llm_ie.extractors import BinaryRelationExtractor
|
|
739
|
+
|
|
740
|
+
extractor = BinaryRelationExtractor(llm, prompt_template=prompt_template, possible_relation_func=possible_relation_func)
|
|
741
|
+
relations = extractor.extract_relations(doc, stream=True)
|
|
742
|
+
```
|
|
743
|
+
|
|
744
|
+
</details>
|
|
745
|
+
|
|
746
|
+
|
|
747
|
+
<details>
|
|
748
|
+
<summary>MultiClassRelationExtractor</summary>
|
|
749
|
+
|
|
750
|
+
The main difference from ```BinaryRelationExtractor``` is that the ```MultiClassRelationExtractor``` allows specifying relation types. The prompt template guideline has an additional placeholder for possible relation types ```{{pos_rel_types}}```.
|
|
751
|
+
|
|
752
|
+
```python
|
|
753
|
+
print(MultiClassRelationExtractor.get_prompt_guide())
|
|
754
|
+
```
|
|
755
|
+
|
|
756
|
+
```
|
|
757
|
+
Prompt template design:
|
|
758
|
+
1. Task description (mention multi-class relation extraction and ROI)
|
|
759
|
+
2. Schema definition (defines relation types)
|
|
760
|
+
3. Output format definition (must use the key "RelationType")
|
|
761
|
+
4. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
762
|
+
|
|
763
|
+
|
|
764
|
+
Example:
|
|
765
|
+
|
|
766
|
+
# Task description
|
|
767
|
+
This is a multi-class relation extraction task. Given a region of interest (ROI) text and two frames from a medical note, classify the relation types between the two frames.
|
|
768
|
+
|
|
769
|
+
# Schema definition
|
|
770
|
+
Strength-Drug: this is a relationship between the drug strength and its name.
|
|
771
|
+
Dosage-Drug: this is a relationship between the drug dosage and its name.
|
|
772
|
+
Duration-Drug: this is a relationship between a drug duration and its name.
|
|
773
|
+
Frequency-Drug: this is a relationship between a drug frequency and its name.
|
|
774
|
+
Form-Drug: this is a relationship between a drug form and its name.
|
|
775
|
+
Route-Drug: this is a relationship between the route of administration for a drug and its name.
|
|
776
|
+
Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
|
|
777
|
+
ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.
|
|
778
|
+
|
|
779
|
+
# Output format definition
|
|
780
|
+
Choose one of the relation types listed below or choose "No Relation":
|
|
781
|
+
{{pos_rel_types}}
|
|
782
|
+
|
|
783
|
+
Your output should follow the JSON format:
|
|
784
|
+
{"RelationType": "<relation type or No Relation>"}
|
|
785
|
+
|
|
786
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
787
|
+
|
|
788
|
+
# Hints
|
|
789
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
790
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
791
|
+
3. If the strength or frequency is for another medication, output "No Relation".
|
|
792
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output "No Relation".
|
|
793
|
+
|
|
794
|
+
# Input placeholders
|
|
795
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
796
|
+
"{{roi_text}}"
|
|
797
|
+
|
|
798
|
+
Entity 1 full information:
|
|
799
|
+
{{frame_1}}
|
|
800
|
+
|
|
801
|
+
Entity 2 full information:
|
|
802
|
+
{{frame_2}}
|
|
803
|
+
```
|
|
804
|
+
|
|
805
|
+
As an example, we define the ```possible_relation_types_func``` :
|
|
806
|
+
- if the two frames are > 500 characters apart, we assume "No Relation" (output [])
|
|
807
|
+
- if the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
|
|
808
|
+
- if the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
|
|
809
|
+
|
|
810
|
+
```python
|
|
811
|
+
def possible_relation_types_func(frame_1, frame_2) -> List[str]:
|
|
812
|
+
# If the two frames are > 500 characters apart, we assume "No Relation"
|
|
813
|
+
if abs(frame_1.start - frame_2.start) > 500:
|
|
814
|
+
return []
|
|
815
|
+
|
|
816
|
+
# If the two frames are "Medication" and "Strength", the only possible relation types are "Strength-Drug" or "No Relation"
|
|
817
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Strength") or \
|
|
818
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Strength"):
|
|
819
|
+
return ['Strength-Drug']
|
|
820
|
+
|
|
821
|
+
# If the two frames are "Medication" and "Frequency", the only possible relation types are "Frequency-Drug" or "No Relation"
|
|
822
|
+
if (frame_1.attr["entity_type"] == "Medication" and frame_2.attr["entity_type"] == "Frequency") or \
|
|
823
|
+
(frame_2.attr["entity_type"] == "Medication" and frame_1.attr["entity_type"] == "Frequency"):
|
|
824
|
+
return ['Frequency-Drug']
|
|
825
|
+
|
|
826
|
+
return []
|
|
827
|
+
```
|
|
828
|
+
|
|
829
|
+
|
|
830
|
+
```python
|
|
831
|
+
from llm_ie.extractors import MultiClassRelationExtractor
|
|
832
|
+
|
|
833
|
+
extractor = MultiClassRelationExtractor(llm, prompt_template=re_prompt_template, possible_relation_types_func=possible_relation_types_func)
|
|
834
|
+
relations = extractor.extract_relations(doc, stream=True)
|
|
835
|
+
```
|
|
629
836
|
|
|
837
|
+
</details>
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
Prompt template design:
|
|
2
|
+
1. Task description (mention binary relation extraction and ROI)
|
|
3
|
+
2. Schema definition (defines relation)
|
|
4
|
+
3. Output format definition (must use the key "Relation")
|
|
5
|
+
4. Hints
|
|
6
|
+
5. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
Example:
|
|
10
|
+
|
|
11
|
+
# Task description
|
|
12
|
+
This is a binary relation extraction task. Given a region of interest (ROI) text and two entities from a medical note, indicate the relation existence between the two entities.
|
|
13
|
+
|
|
14
|
+
# Schema definition
|
|
15
|
+
True: if there is a relationship between a medication name (one of the entities) and its strength or frequency (the other entity).
|
|
16
|
+
False: Otherwise.
|
|
17
|
+
|
|
18
|
+
# Output format definition
|
|
19
|
+
Your output should follow the JSON format:
|
|
20
|
+
{"Relation": "<True or False>"}
|
|
21
|
+
|
|
22
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
23
|
+
|
|
24
|
+
# Hints
|
|
25
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
26
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
27
|
+
3. If the strength or frequency is for another medication, output False.
|
|
28
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output False.
|
|
29
|
+
|
|
30
|
+
# Input placeholders
|
|
31
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
32
|
+
"{{roi_text}}"
|
|
33
|
+
|
|
34
|
+
Entity 1 full information:
|
|
35
|
+
{{frame_1}}
|
|
36
|
+
|
|
37
|
+
Entity 2 full information:
|
|
38
|
+
{{frame_2}}
|
|
@@ -0,0 +1,46 @@
|
|
|
1
|
+
Prompt template design:
|
|
2
|
+
1. Task description (mention multi-class relation extraction and ROI)
|
|
3
|
+
2. Schema definition (defines relation types)
|
|
4
|
+
3. Output format definition (must use the key "RelationType")
|
|
5
|
+
4. Input placeholders (must include "roi_text", "frame_1", and "frame_2" placeholders)
|
|
6
|
+
|
|
7
|
+
|
|
8
|
+
Example:
|
|
9
|
+
|
|
10
|
+
# Task description
|
|
11
|
+
This is a multi-class relation extraction task. Given a region of interest (ROI) text and two frames from a medical note, classify the relation types between the two frames.
|
|
12
|
+
|
|
13
|
+
# Schema definition
|
|
14
|
+
Strength-Drug: this is a relationship between the drug strength and its name.
|
|
15
|
+
Dosage-Drug: this is a relationship between the drug dosage and its name.
|
|
16
|
+
Duration-Drug: this is a relationship between a drug duration and its name.
|
|
17
|
+
Frequency-Drug: this is a relationship between a drug frequency and its name.
|
|
18
|
+
Form-Drug: this is a relationship between a drug form and its name.
|
|
19
|
+
Route-Drug: this is a relationship between the route of administration for a drug and its name.
|
|
20
|
+
Reason-Drug: this is a relationship between the reason for which a drug was administered (e.g., symptoms, diseases, etc.) and a drug name.
|
|
21
|
+
ADE-Drug: this is a relationship between an adverse drug event (ADE) and a drug name.
|
|
22
|
+
|
|
23
|
+
# Output format definition
|
|
24
|
+
Choose one of the relation types listed below or choose "No Relation":
|
|
25
|
+
{{pos_rel_types}}
|
|
26
|
+
|
|
27
|
+
Your output should follow the JSON format:
|
|
28
|
+
{"RelationType": "<relation type or No Relation>"}
|
|
29
|
+
|
|
30
|
+
I am only interested in the content between []. Do not explain your answer.
|
|
31
|
+
|
|
32
|
+
# Hints
|
|
33
|
+
1. Your input always contains one medication entity and 1) one strength entity or 2) one frequency entity.
|
|
34
|
+
2. Pay attention to the medication entity and see if the strength or frequency is for it.
|
|
35
|
+
3. If the strength or frequency is for another medication, output "No Relation".
|
|
36
|
+
4. If the strength or frequency is for the same medication but at a different location (span), output "No Relation".
|
|
37
|
+
|
|
38
|
+
# Input placeholders
|
|
39
|
+
ROI Text with the two entities annotated with <entity_1> and <entity_2>:
|
|
40
|
+
"{{roi_text}}"
|
|
41
|
+
|
|
42
|
+
Entity 1 full information:
|
|
43
|
+
{{frame_1}}
|
|
44
|
+
|
|
45
|
+
Entity 2 full information:
|
|
46
|
+
{{frame_2}}
|