llm-ie 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- llm_ie/__init__.py +0 -0
- llm_ie/data_types.py +167 -0
- llm_ie/engines.py +166 -0
- llm_ie/extractors.py +496 -0
- llm_ie/prompt_editor.py +26 -0
- llm_ie-0.1.0.dist-info/METADATA +552 -0
- llm_ie-0.1.0.dist-info/RECORD +8 -0
- llm_ie-0.1.0.dist-info/WHEEL +4 -0
|
@@ -0,0 +1,552 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: llm-ie
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: An LLM-powered tool that transforms everyday language into robust information extraction pipelines.
|
|
5
|
+
License: MIT
|
|
6
|
+
Author: Enshuo (David) Hsu
|
|
7
|
+
Requires-Python: >=3.11,<4.0
|
|
8
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
9
|
+
Classifier: Programming Language :: Python :: 3
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
+
Description-Content-Type: text/markdown
|
|
13
|
+
|
|
14
|
+
<div align="center"><img src=asset/LLM-IE.png width=500 ></div>
|
|
15
|
+
|
|
16
|
+
An LLM-powered tool that transforms everyday language into robust information extraction pipelines.
|
|
17
|
+
|
|
18
|
+
## Table of Contents
|
|
19
|
+
- [Overview](#overview)
|
|
20
|
+
- [Prerequisite](#prerequisite)
|
|
21
|
+
- [Installation](#installation)
|
|
22
|
+
- [Quick Start](#quick-start)
|
|
23
|
+
- [User Guide](#user-guide)
|
|
24
|
+
- [LLM Inference Engine](#llm-inference-engine)
|
|
25
|
+
- [Prompt Template](#prompt-template)
|
|
26
|
+
- [Prompt Editor](#prompt-editor)
|
|
27
|
+
- [Extractor](#extractor)
|
|
28
|
+
|
|
29
|
+
## Overview
|
|
30
|
+
LLM-IE is a toolkit that provides robust information extraction utilities for frame-based information extraction. Since prompt design has a significant impact on generative information extraction with LLMs, it also provides a built-in LLM editor to help with prompt writing. The flowchart below demonstrates the workflow starting from a casual language request.
|
|
31
|
+
|
|
32
|
+
<div align="center"><img src="asset/LLM-IE flowchart.png" width=800 ></div>
|
|
33
|
+
|
|
34
|
+
## Prerequisite
|
|
35
|
+
At least one LLM inference engine is required. We provide built-in support for 🦙 [Llama-cpp-python](https://github.com/abetlen/llama-cpp-python) and <img src="https://avatars.githubusercontent.com/u/151674099?s=48&v=4" alt="Icon" width="20"/> [Ollama](https://github.com/ollama/ollama). For installation guides, please refer to those projects. Other inference engines can be configured through the [InferenceEngine](src/llm_ie/engines.py) abstract class. See [LLM Inference Engine](#llm-inference-engine) section below.
|
|
36
|
+
|
|
37
|
+
## Installation
|
|
38
|
+
The Python package is available on PyPI.
|
|
39
|
+
```
|
|
40
|
+
pip install llm-ie
|
|
41
|
+
```
|
|
42
|
+
Note that this package does not check LLM inference engine installation nor install them. See [prerequisite](#prerequisite) section for details.
|
|
43
|
+
|
|
44
|
+
## Quick Start
|
|
45
|
+
We use a [synthesized medical note](demo/document/synthesized_note.txt) by ChatGPT to demo the information extraction process. Our task is to extract diagnosis names, spans, and corresponding attributes (i.e., diagnosis datetime, status).
|
|
46
|
+
|
|
47
|
+
#### Choose an LLM inference engine
|
|
48
|
+
We use one of the built-in engines.
|
|
49
|
+
|
|
50
|
+
<details>
|
|
51
|
+
<summary><img src="https://avatars.githubusercontent.com/u/151674099?s=48&v=4" alt="Icon" width="20"/> Ollama</summary>
|
|
52
|
+
|
|
53
|
+
```python
|
|
54
|
+
from llm_ie.engines import OllamaInferenceEngine
|
|
55
|
+
|
|
56
|
+
llm = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
|
|
57
|
+
```
|
|
58
|
+
</details>
|
|
59
|
+
<details>
|
|
60
|
+
<summary>🦙 Llama-cpp-python</summary>
|
|
61
|
+
|
|
62
|
+
```python
|
|
63
|
+
from llm_ie.engines import LlamaCppInferenceEngine
|
|
64
|
+
|
|
65
|
+
llama_cpp = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF",
|
|
66
|
+
gguf_filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf")
|
|
67
|
+
```
|
|
68
|
+
</details>
|
|
69
|
+
|
|
70
|
+
|
|
71
|
+
#### Casual language as prompt
|
|
72
|
+
We start with a casual description:
|
|
73
|
+
|
|
74
|
+
*"Extract diagnosis from the clinical note. Make sure to include diagnosis date and status."*
|
|
75
|
+
|
|
76
|
+
The ```PromptEditor``` rewrites it following the schema required by the ```BasicFrameExtractor```.
|
|
77
|
+
|
|
78
|
+
```python
|
|
79
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
80
|
+
from llm_ie.prompt_editor import PromptEditor
|
|
81
|
+
|
|
82
|
+
# Describe the task in casual language
|
|
83
|
+
prompt_draft = "Extract diagnosis from the clinical note. Make sure to include diagnosis date and status."
|
|
84
|
+
|
|
85
|
+
# Use LLM editor to generate a formal prompt template with standard extraction schema
|
|
86
|
+
editor = PromptEditor(llm, BasicFrameExtractor)
|
|
87
|
+
prompt_template = editor.rewrite(prompt_draft)
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
The editor generates a prompt template as below:
|
|
91
|
+
```
|
|
92
|
+
# Task description
|
|
93
|
+
The paragraph below contains a clinical note with diagnoses listed. Please carefully review it and extract the diagnoses, including the diagnosis date and status.
|
|
94
|
+
|
|
95
|
+
# Schema definition
|
|
96
|
+
Your output should contain:
|
|
97
|
+
"Diagnosis" which is the name of the diagnosis,
|
|
98
|
+
"Date" which is the date when the diagnosis was made,
|
|
99
|
+
"Status" which is the current status of the diagnosis (e.g. active, resolved, etc.)
|
|
100
|
+
|
|
101
|
+
# Output format definition
|
|
102
|
+
Your output should follow JSON format, for example:
|
|
103
|
+
[
|
|
104
|
+
{"Diagnosis": "<Diagnosis text>", "Date": "<date in YYYY-MM-DD format>", "Status": "<status>"},
|
|
105
|
+
{"Diagnosis": "<Diagnosis text>", "Date": "<date in YYYY-MM-DD format>", "Status": "<status>"}
|
|
106
|
+
]
|
|
107
|
+
|
|
108
|
+
# Additional hints
|
|
109
|
+
Your output should be 100% based on the provided content. DO NOT output fake information.
|
|
110
|
+
If there is no specific date or status, just omit those keys.
|
|
111
|
+
|
|
112
|
+
# Input placeholder
|
|
113
|
+
Below is the clinical note:
|
|
114
|
+
{{input}}
|
|
115
|
+
```
|
|
116
|
+
#### Information extraction pipeline
|
|
117
|
+
Now we apply the prompt template to build an information extraction pipeline.
|
|
118
|
+
|
|
119
|
+
```python
|
|
120
|
+
# Load synthesized medical note
|
|
121
|
+
with open("./demo/document/synthesized_note.txt", 'r') as f:
|
|
122
|
+
note_text = f.read()
|
|
123
|
+
|
|
124
|
+
# Define extractor
|
|
125
|
+
extractor = BasicFrameExtractor(llm, prompt_template)
|
|
126
|
+
|
|
127
|
+
# Extract
|
|
128
|
+
frames = extractor.extract_frames(note_text, entity_key="Diagnosis", stream=True)
|
|
129
|
+
|
|
130
|
+
# Check extractions
|
|
131
|
+
for frame in frames:
|
|
132
|
+
print(frame.to_dict())
|
|
133
|
+
```
|
|
134
|
+
The output is a list of frames. Each frame has a ```entity_text```, ```start```, ```end```, and a dictionary of ```attr```.
|
|
135
|
+
|
|
136
|
+
```python
|
|
137
|
+
{'frame_id': '0', 'start': 537, 'end': 549, 'entity_text': 'Hypertension', 'attr': {'Datetime': '2010', 'Status': 'history'}}
|
|
138
|
+
{'frame_id': '1', 'start': 551, 'end': 565, 'entity_text': 'Hyperlipidemia', 'attr': {'Datetime': '2015', 'Status': 'history'}}
|
|
139
|
+
{'frame_id': '2', 'start': 571, 'end': 595, 'entity_text': 'Type 2 Diabetes Mellitus', 'attr': {'Datetime': '2018', 'Status': 'history'}}
|
|
140
|
+
{'frame_id': '3', 'start': 2402, 'end': 2431, 'entity_text': 'Acute Coronary Syndrome (ACS)', 'attr': {'Datetime': 'July 20, 2024', 'Status': 'present'}}
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
We can save the frames to a document object for better management. The document holds ```text``` and ```frames```. The ```add_frame()``` method performs validation and (if passed) adds a frame to the document.
|
|
144
|
+
The ```valid_mode``` controls how frame validation should be performed. For example, the ```valid_mode = "span"``` will prevent new frames from being added if the frame spans (```start```, ```end```) has already exist. The ```create_id = True``` allows the document to assign unique frame IDs.
|
|
145
|
+
|
|
146
|
+
```python
|
|
147
|
+
from llm_ie.data_types import LLMInformationExtractionDocument
|
|
148
|
+
|
|
149
|
+
# Define document
|
|
150
|
+
doc = LLMInformationExtractionDocument(doc_id="Synthesized medical note",
|
|
151
|
+
text=note_text)
|
|
152
|
+
# Add frames to a document
|
|
153
|
+
for frame in frames:
|
|
154
|
+
doc.add_frame(frame, valid_mode="span", create_id=True)
|
|
155
|
+
|
|
156
|
+
# Save document to file (.llmie)
|
|
157
|
+
doc.save("<your filename>.llmie")
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
## User Guide
|
|
161
|
+
This package is comprised of some key classes:
|
|
162
|
+
- LLM Inference Engine
|
|
163
|
+
- Prompt Template
|
|
164
|
+
- Prompt Editor
|
|
165
|
+
- Extractors
|
|
166
|
+
|
|
167
|
+
### LLM Inference Engine
|
|
168
|
+
Provides an interface for different LLM inference engines to work in the information extraction workflow. The built-in engines are ```LlamaCppInferenceEngine``` and ```OllamaInferenceEngine```.
|
|
169
|
+
|
|
170
|
+
#### 🦙 Llama-cpp-python
|
|
171
|
+
The ```repo_id``` and ```gguf_filename``` must match the ones on the Huggingface repo to ensure the correct model is loaded. ```n_ctx``` determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. Note that when ```n_ctx``` is less than the prompt length, Llama.cpp throws exceptions. ```n_gpu_layers``` indicates a number of model layers to offload to GPU. Default is -1 for all layers (entire LLM). Flash attention ```flash_attn``` is supported by Llama.cpp. The ```verbose``` indicates whether model information should be displayed. For more input parameters, see 🦙 [Llama-cpp-python](https://github.com/abetlen/llama-cpp-python).
|
|
172
|
+
|
|
173
|
+
```python
|
|
174
|
+
from llm_ie.engines import LlamaCppInferenceEngine
|
|
175
|
+
|
|
176
|
+
llama_cpp = LlamaCppInferenceEngine(repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF",
|
|
177
|
+
gguf_filename="Meta-Llama-3.1-8B-Instruct-Q8_0.gguf",
|
|
178
|
+
n_ctx=4096,
|
|
179
|
+
n_gpu_layers=-1,
|
|
180
|
+
flash_attn=True,
|
|
181
|
+
verbose=False)
|
|
182
|
+
```
|
|
183
|
+
#### <img src="https://avatars.githubusercontent.com/u/151674099?s=48&v=4" alt="Icon" width="20"/> Ollama
|
|
184
|
+
The ```model_name``` must match the names on the [Ollama library](https://ollama.com/library). Use the command line ```ollama ls``` to check your local model list. ```num_ctx``` determines the context length LLM will consider during text generation. Empirically, longer context length gives better performance, while consuming more memory and increases computation. ```keep_alive``` regulates the lifespan of LLM. It indicates a number of seconds to hold the LLM after the last API call. Default is 5 minutes (300 sec).
|
|
185
|
+
|
|
186
|
+
```python
|
|
187
|
+
from llm_ie.engines import OllamaInferenceEngine
|
|
188
|
+
|
|
189
|
+
ollama = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0",
|
|
190
|
+
num_ctx=4096,
|
|
191
|
+
keep_alive=300)
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
#### Test inference engine configuration
|
|
195
|
+
To test the inference engine, use the ```chat()``` method.
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
from llm_ie.engines import OllamaInferenceEngine
|
|
199
|
+
|
|
200
|
+
ollama = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
|
|
201
|
+
engine.chat(messages=[{"role": "user", "content":"Hi"}], stream=True)
|
|
202
|
+
```
|
|
203
|
+
The output should be something like (might vary by LLMs and versions)
|
|
204
|
+
|
|
205
|
+
```python
|
|
206
|
+
'How can I help you today?'
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
#### Customize inference engine
|
|
210
|
+
The abstract class ```InferenceEngine``` defines the interface and required method ```chat()```. Inherit this class for customized API.
|
|
211
|
+
```python
|
|
212
|
+
class InferenceEngine:
|
|
213
|
+
@abc.abstractmethod
|
|
214
|
+
def __init__(self):
|
|
215
|
+
"""
|
|
216
|
+
This is an abstract class to provide interfaces for LLM inference engines.
|
|
217
|
+
Children classes that inherits this class can be used in extractors. Must implement chat() method.
|
|
218
|
+
"""
|
|
219
|
+
return NotImplemented
|
|
220
|
+
|
|
221
|
+
@abc.abstractmethod
|
|
222
|
+
def chat(self, messages:List[Dict[str,str]], max_new_tokens:int=2048, temperature:float=0.0, stream:bool=False, **kwrs) -> str:
|
|
223
|
+
"""
|
|
224
|
+
This method inputs chat messages and outputs LLM generated text.
|
|
225
|
+
|
|
226
|
+
Parameters:
|
|
227
|
+
----------
|
|
228
|
+
messages : List[Dict[str,str]]
|
|
229
|
+
a list of dict with role and content. role must be one of {"system", "user", "assistant"}
|
|
230
|
+
max_new_tokens : str, Optional
|
|
231
|
+
the max number of new tokens LLM can generate.
|
|
232
|
+
temperature : float, Optional
|
|
233
|
+
the temperature for token sampling.
|
|
234
|
+
stream : bool, Optional
|
|
235
|
+
if True, LLM generated text will be printed in terminal in real-time.
|
|
236
|
+
"""
|
|
237
|
+
return NotImplemented
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
### Prompt Template
|
|
241
|
+
A prompt template is a string with one or many placeholders ```{{<placeholder_name>}}```. When input to an extractor, the ```text_content``` will be inserted into the placeholders to construct a prompt. Below is a demo:
|
|
242
|
+
|
|
243
|
+
```python
|
|
244
|
+
prompt_template = """
|
|
245
|
+
Below is a medical note. Your task is to extract diagnosis information.
|
|
246
|
+
Your output should include:
|
|
247
|
+
"Diagnosis": extract diagnosis names,
|
|
248
|
+
"Datetime": date/ time of diagnosis,
|
|
249
|
+
"Status": status of present, history, or family history
|
|
250
|
+
|
|
251
|
+
Your output should follow a JSON format:
|
|
252
|
+
[
|
|
253
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
254
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
255
|
+
...
|
|
256
|
+
]
|
|
257
|
+
|
|
258
|
+
Below is the medical note:
|
|
259
|
+
"{{input}}"
|
|
260
|
+
"""
|
|
261
|
+
# Define a inference engine
|
|
262
|
+
ollama = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
|
|
263
|
+
|
|
264
|
+
# Define an extractor
|
|
265
|
+
extractor = BasicFrameExtractor(ollama, prompt_template)
|
|
266
|
+
|
|
267
|
+
# Apply text content to prompt template
|
|
268
|
+
prompt_text = extractor._get_user_prompt(text_content="<some text...>")
|
|
269
|
+
print(prompt_text)
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
The ```prompt_text``` is the text content filled in the placeholder spot.
|
|
273
|
+
|
|
274
|
+
```
|
|
275
|
+
Below is a medical note. Your task is to extract diagnosis information.
|
|
276
|
+
Your output should include:
|
|
277
|
+
"Diagnosis": extract diagnosis names,
|
|
278
|
+
"Datetime": date/ time of diagnosis,
|
|
279
|
+
"Status": status of present, history, or family history
|
|
280
|
+
Your output should follow a JSON format:
|
|
281
|
+
[
|
|
282
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
283
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
284
|
+
...
|
|
285
|
+
]
|
|
286
|
+
Below is the medical note:
|
|
287
|
+
"<some text...>"
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
#### Placeholder
|
|
291
|
+
When only one placeholder is defined in the prompt template, the ```text_content``` can be a string or a dictionary with one key (regardless of the key name). When multiple placeholders are defined in the prompt template, the ```text_content``` should be a dictionary with:
|
|
292
|
+
|
|
293
|
+
```python
|
|
294
|
+
{"<placeholder 1>": "<some text>", "<placeholder 2>": "<some text>"...}
|
|
295
|
+
```
|
|
296
|
+
For example,
|
|
297
|
+
|
|
298
|
+
```python
|
|
299
|
+
prompt_template = """
|
|
300
|
+
Below is a medical note. Your task is to extract diagnosis information.
|
|
301
|
+
|
|
302
|
+
# Backgound knowledge
|
|
303
|
+
{{knowledge}}
|
|
304
|
+
Your output should include:
|
|
305
|
+
"Diagnosis": extract diagnosis names,
|
|
306
|
+
"Datetime": date/ time of diagnosis,
|
|
307
|
+
"Status": status of present, history, or family history
|
|
308
|
+
|
|
309
|
+
Your output should follow a JSON format:
|
|
310
|
+
[
|
|
311
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
312
|
+
{"Diagnosis": <exact words as in the document>, "Datetime": <diagnosis datetime>, "Status": <one of "present", "history">},
|
|
313
|
+
...
|
|
314
|
+
]
|
|
315
|
+
|
|
316
|
+
Below is the medical note:
|
|
317
|
+
"{{note}}"
|
|
318
|
+
"""
|
|
319
|
+
ollama = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
|
|
320
|
+
extractor = BasicFrameExtractor(ollama, prompt_template)
|
|
321
|
+
prompt_text = extractor._get_user_prompt(text_content={"knowledge": "<some text...>",
|
|
322
|
+
"note": "<some text...>")
|
|
323
|
+
print(prompt_text)
|
|
324
|
+
```
|
|
325
|
+
Note that the keys in ```text_content``` must match the placeholder names defined in ```{{}}```.
|
|
326
|
+
|
|
327
|
+
#### Prompt writing guide
|
|
328
|
+
The quality of the prompt template can significantly impact the performance of information extraction. Also, the schema defined in prompt templates is dependent on the choice of extractors. When designing a prompt template schema, it is important to consider which extractor will be used.
|
|
329
|
+
|
|
330
|
+
The ```Extractor``` class provides documentation and examples for prompt template writing.
|
|
331
|
+
|
|
332
|
+
```python
|
|
333
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
334
|
+
|
|
335
|
+
print(BasicFrameExtractor.get_prompt_guide())
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
### Prompt Editor
|
|
339
|
+
The prompt editor is an LLM agent that reviews, comments and rewrites a prompt following the defined schema of each extractor. It is recommended to use prompt editor iteratively:
|
|
340
|
+
1. start with a casual description of the task
|
|
341
|
+
2. have the prompt editor generate a prompt template as the starting point
|
|
342
|
+
3. manually revise the prompt template
|
|
343
|
+
4. have the prompt editor to comment/ rewrite it
|
|
344
|
+
|
|
345
|
+
```python
|
|
346
|
+
from llm_ie.prompt_editor import PromptEditor
|
|
347
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
348
|
+
from llm_ie.engines import OllamaInferenceEngine
|
|
349
|
+
|
|
350
|
+
# Define an LLM inference engine
|
|
351
|
+
ollama = OllamaInferenceEngine(model_name="llama3.1:8b-instruct-q8_0")
|
|
352
|
+
|
|
353
|
+
# Define editor
|
|
354
|
+
editor = PromptEditor(ollama, BasicFrameExtractor)
|
|
355
|
+
|
|
356
|
+
# Have editor to generate initial prompt template
|
|
357
|
+
initial_version = editor.rewrite("Extract treatment events from the discharge summary.")
|
|
358
|
+
print(initial_version)
|
|
359
|
+
```
|
|
360
|
+
The editor generated a ```initial_version``` as below:
|
|
361
|
+
|
|
362
|
+
```
|
|
363
|
+
# Task description
|
|
364
|
+
The paragraph below contains information about treatment events in a patient's discharge summary. Please carefully review it and extract the treatment events, including any relevant details such as medications or procedures. Note that each treatment event may be nested under a specific section of the discharge summary.
|
|
365
|
+
|
|
366
|
+
# Schema definition
|
|
367
|
+
Your output should contain:
|
|
368
|
+
"TreatmentEvent" which is the name of the treatment,
|
|
369
|
+
If applicable, "Medication" which is the medication used for the treatment,
|
|
370
|
+
If applicable, "Procedure" which is the procedure performed during the treatment,
|
|
371
|
+
"Evidence" which is the EXACT sentence in the text where you found the TreatmentEvent from
|
|
372
|
+
|
|
373
|
+
# Output format definition
|
|
374
|
+
Your output should follow JSON format, for example:
|
|
375
|
+
[
|
|
376
|
+
{"TreatmentEvent": "<Treatment event name>", "Medication": "<name of medication>", "Procedure": "<name of procedure>", "Evidence": "<exact sentence from the text>"},
|
|
377
|
+
{"TreatmentEvent": "<Treatment event name>", "Medication": "<name of medication>", "Procedure": "<name of procedure>", "Evidence": "<exact sentence from the text>"}
|
|
378
|
+
]
|
|
379
|
+
|
|
380
|
+
# Additional hints
|
|
381
|
+
Your output should be 100% based on the provided content. DO NOT output fake information.
|
|
382
|
+
If there is no specific medication or procedure, just omit the corresponding key.
|
|
383
|
+
|
|
384
|
+
# Input placeholder
|
|
385
|
+
Below is the discharge summary:
|
|
386
|
+
{{input}}
|
|
387
|
+
```
|
|
388
|
+
Manually reviewing it and thinking about our needs, we found certain issues:
|
|
389
|
+
1. The task description is not specific enough. This is expected since the editor does not have access to the real document.
|
|
390
|
+
2. Depending on the project, we might not need evidence text. Outputing it consumes more output tokens.
|
|
391
|
+
|
|
392
|
+
Therefore, we manually revised the prompt template as below:
|
|
393
|
+
|
|
394
|
+
```python
|
|
395
|
+
manually_revised = """
|
|
396
|
+
# Task description
|
|
397
|
+
The paragraph below is a patient's discharge summary. Please carefully review it and extract the treatment events, including any relevant details such as medications or procedures. Note that each treatment event may be nested under a specific section of the discharge summary.
|
|
398
|
+
|
|
399
|
+
# Schema definition
|
|
400
|
+
Your output should contain:
|
|
401
|
+
"TreatmentEvent" which is the name of the treatment,
|
|
402
|
+
If applicable, "Medication" which is the medication used for the treatment,
|
|
403
|
+
If applicable, "Procedure" which is the procedure performed during the treatment
|
|
404
|
+
|
|
405
|
+
# Output format definition
|
|
406
|
+
Your output should follow JSON format, for example:
|
|
407
|
+
[
|
|
408
|
+
{"TreatmentEvent": "<Treatment event name>", "Medication": "<name of medication>", "Procedure": "<name of procedure>"},
|
|
409
|
+
{"TreatmentEvent": "<Treatment event name>", "Medication": "<name of medication>", "Procedure": "<name of procedure>"}
|
|
410
|
+
]
|
|
411
|
+
|
|
412
|
+
# Additional hints
|
|
413
|
+
Your output should be 100% based on the provided content. DO NOT output fake information.
|
|
414
|
+
If there is no specific medication or procedure, just omit the corresponding key.
|
|
415
|
+
|
|
416
|
+
# Input placeholder
|
|
417
|
+
Below is the discharge summary:
|
|
418
|
+
{{input}}
|
|
419
|
+
"""
|
|
420
|
+
```
|
|
421
|
+
|
|
422
|
+
Now we have the editor to comment our revision.
|
|
423
|
+
|
|
424
|
+
```python
|
|
425
|
+
comment = editor.comment(manually_revised)
|
|
426
|
+
|
|
427
|
+
print(comment)
|
|
428
|
+
```
|
|
429
|
+
The comments from the editor are:
|
|
430
|
+
|
|
431
|
+
```
|
|
432
|
+
Here's an analysis of the draft prompt based on the prompt guideline:
|
|
433
|
+
|
|
434
|
+
**1. Task description**: The task description is clear and concise, explaining that the goal is to extract treatment events from a patient's discharge summary.
|
|
435
|
+
|
|
436
|
+
**2. Schema definition**: The schema definition is well-defined, specifying three keys: "TreatmentEvent", "Medication", and "Procedure". However, it would be helpful to provide more context on what constitutes a "treatment event" and how to distinguish between medication and procedure information.
|
|
437
|
+
|
|
438
|
+
**3. Output format definition**: The output format is correctly specified as JSON, with an example provided for clarity.
|
|
439
|
+
|
|
440
|
+
**4. Additional hints**: The additional hints are clear and concise, emphasizing the importance of extracting only real information from the text and omitting fake data.
|
|
441
|
+
|
|
442
|
+
**5. Input placeholder**: The input placeholder is present, but it would be helpful to provide a more detailed description of what type of discharge summary is expected (e.g., medical history, treatment plan, etc.).
|
|
443
|
+
|
|
444
|
+
Overall, the draft prompt is well-structured and easy to follow. However, providing more context and clarity on certain aspects, such as the definition of "treatment event" and the distinction between medication and procedure information, would make it even more effective.
|
|
445
|
+
|
|
446
|
+
Rating: 8/10
|
|
447
|
+
|
|
448
|
+
Recommendations:
|
|
449
|
+
|
|
450
|
+
* Provide a more detailed description of what constitutes a "treatment event".
|
|
451
|
+
* Clarify how to distinguish between medication and procedure information.
|
|
452
|
+
* Consider adding an example of a discharge summary to help illustrate the task.
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
After a few iterations of revision, we will have a high-quality prompt template for the information extraction pipeline.
|
|
456
|
+
|
|
457
|
+
### Extractor
|
|
458
|
+
An extractor implements a prompting method for information extraction. The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame. The ```ReviewFrameExtractor``` is based on the ```BasicFrameExtractor``` but adds a review step after the initial extraction to boost sensitivity and improve performance. ```SentenceFrameExtractor``` gives LLM the entire document upfront as a reference, then prompts LLM sentence by sentence and collects per-sentence outputs. To learn about an extractor, use the class method ```get_prompt_guide()``` to print out the prompt guide.
|
|
459
|
+
|
|
460
|
+
<details>
|
|
461
|
+
<summary>BasicFrameExtractor</summary>
|
|
462
|
+
|
|
463
|
+
The ```BasicFrameExtractor``` directly prompts LLM to generate a list of dictionaries. Each dictionary is then post-processed into a frame.
|
|
464
|
+
|
|
465
|
+
```python
|
|
466
|
+
from llm_ie.extractors import BasicFrameExtractor
|
|
467
|
+
|
|
468
|
+
print(BasicFrameExtractor.get_prompt_guide())
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
```
|
|
472
|
+
Prompt template design:
|
|
473
|
+
1. Task description
|
|
474
|
+
2. Schema definition
|
|
475
|
+
3. Output format definition
|
|
476
|
+
4. Additional hints
|
|
477
|
+
5. Input placeholder
|
|
478
|
+
|
|
479
|
+
Example:
|
|
480
|
+
|
|
481
|
+
# Task description
|
|
482
|
+
The paragraph below is from the Food and Drug Administration (FDA) Clinical Pharmacology Section of Labeling for Human Prescription Drug and Biological Products, Adverse reactions section. Please carefully review it and extract the adverse reactions and percentages. Note that each adverse reaction is nested under a clinical trial and potentially an arm. Your output should take that into consideration.
|
|
483
|
+
|
|
484
|
+
# Schema definition
|
|
485
|
+
Your output should contain:
|
|
486
|
+
"ClinicalTrial" which is the name of the trial,
|
|
487
|
+
If applicable, "Arm" which is the arm within the clinical trial,
|
|
488
|
+
"AdverseReaction" which is the name of the adverse reaction,
|
|
489
|
+
If applicable, "Percentage" which is the occurance of the adverse reaction within the trial and arm,
|
|
490
|
+
"Evidence" which is the EXACT sentence in the text where you found the AdverseReaction from
|
|
491
|
+
|
|
492
|
+
# Output format definition
|
|
493
|
+
Your output should follow JSON format, for example:
|
|
494
|
+
[
|
|
495
|
+
{"ClinicalTrial": "<Clinical trial name or number>", "Arm": "<name of arm>", "AdverseReaction": "<Adverse reaction text>", "Percentage": "<a percent>", "Evidence": "<exact sentence from the text>"},
|
|
496
|
+
{"ClinicalTrial": "<Clinical trial name or number>", "Arm": "<name of arm>", "AdverseReaction": "<Adverse reaction text>", "Percentage": "<a percent>", "Evidence": "<exact sentence from the text>"}
|
|
497
|
+
]
|
|
498
|
+
|
|
499
|
+
# Additional hints
|
|
500
|
+
Your output should be 100% based on the provided content. DO NOT output fake numbers.
|
|
501
|
+
If there is no specific arm, just omit the "Arm" key. If the percentage is not reported, just omit the "Percentage" key. The "Evidence" should always be provided.
|
|
502
|
+
|
|
503
|
+
# Input placeholder
|
|
504
|
+
Below is the Adverse reactions section:
|
|
505
|
+
{{input}}
|
|
506
|
+
```
|
|
507
|
+
</details>
|
|
508
|
+
|
|
509
|
+
<details>
|
|
510
|
+
<summary>ReviewFrameExtractor</summary>
|
|
511
|
+
|
|
512
|
+
The ```ReviewFrameExtractor``` is based on the ```BasicFrameExtractor``` but adds a review step after the initial extraction to boost sensitivity and improve performance. The ```review_prompt``` and ```review_mode``` are required when constructing the ```ReviewFrameExtractor```.
|
|
513
|
+
|
|
514
|
+
There are two review modes:
|
|
515
|
+
1. **Addition mode**: add more frames while keeping current. This is efficient for boosting recall.
|
|
516
|
+
2. **Revision mode**: regenerate frames (add new and delete existing).
|
|
517
|
+
|
|
518
|
+
Under the **Addition mode**, the ```review_prompt``` needs to instruct the LLM not to regenerate existing extractions:
|
|
519
|
+
|
|
520
|
+
*... You should ONLY add new diagnoses. DO NOT regenerate the entire answer.*
|
|
521
|
+
|
|
522
|
+
The ```review_mode``` should be set to ```review_mode="addition"```
|
|
523
|
+
|
|
524
|
+
Under the **Revision mode**, the ```review_prompt``` needs to instruct the LLM to regenerate:
|
|
525
|
+
|
|
526
|
+
*... Regenerate your output.*
|
|
527
|
+
|
|
528
|
+
The ```review_mode``` should be set to ```review_mode="revision"```
|
|
529
|
+
|
|
530
|
+
```python
|
|
531
|
+
review_prompt = "Review the input and your output again. If you find some diagnosis was missed, add them to your output. Regenerate your output."
|
|
532
|
+
|
|
533
|
+
extractor = ReviewFrameExtractor(llm, prompt_temp, review_prompt, review_mode="revision")
|
|
534
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
535
|
+
```
|
|
536
|
+
</details>
|
|
537
|
+
|
|
538
|
+
<details>
|
|
539
|
+
<summary>SentenceFrameExtractor</summary>
|
|
540
|
+
|
|
541
|
+
The ```SentenceFrameExtractor``` instructs the LLM to extract sentence by sentence. The reason is to ensure the accuracy of frame spans. It also prevents LLMs from overseeing sections/ sentences. Empirically, this extractor results in better sensitivity than the ```BasicFrameExtractor``` in complex tasks.
|
|
542
|
+
|
|
543
|
+
```python
|
|
544
|
+
from llm_ie.extractors import SentenceFrameExtractor
|
|
545
|
+
|
|
546
|
+
extractor = SentenceFrameExtractor(llm, prompt_temp)
|
|
547
|
+
frames = extractor.extract_frames(text_content=text, entity_key="Diagnosis", stream=True)
|
|
548
|
+
```
|
|
549
|
+
</details>
|
|
550
|
+
|
|
551
|
+
|
|
552
|
+
|
|
@@ -0,0 +1,8 @@
|
|
|
1
|
+
llm_ie/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
2
|
+
llm_ie/data_types.py,sha256=AxqgfmPkYySDz7VuTWh8yDWofvZgdjgFiW9hihqInHc,6605
|
|
3
|
+
llm_ie/engines.py,sha256=AKxv6iPz_vUTv72srQYNIkmSBCOsvi3Yh6a9BVGbC4Y,6134
|
|
4
|
+
llm_ie/extractors.py,sha256=94uPhEtpYeingMY4WVLc8F6vw8hnSS8Wt-TMr5B5flg,22315
|
|
5
|
+
llm_ie/prompt_editor.py,sha256=doPjy5HFoZvP5Y1x_rcA_-wSQfqHkwKfETQd3uIh0GA,1212
|
|
6
|
+
llm_ie-0.1.0.dist-info/METADATA,sha256=E4IdxUaxCDmU-vGX2tqDhp4nDjt5aeArFxAagWkun5o,25687
|
|
7
|
+
llm_ie-0.1.0.dist-info/WHEEL,sha256=sP946D7jFCHeNz5Iq4fL4Lu-PrWrFsgfLXbbkciIZwg,88
|
|
8
|
+
llm_ie-0.1.0.dist-info/RECORD,,
|