tipeft 0.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tipeft-0.0.1/Figure_1.jpg +0 -0
- tipeft-0.0.1/MANIFEST.in +3 -0
- tipeft-0.0.1/PKG-INFO +168 -0
- tipeft-0.0.1/README.md +140 -0
- tipeft-0.0.1/license.txt +7 -0
- tipeft-0.0.1/setup.cfg +4 -0
- tipeft-0.0.1/setup.py +45 -0
- tipeft-0.0.1/tipeft/IA3.py +464 -0
- tipeft-0.0.1/tipeft/__init__.py +1 -0
- tipeft-0.0.1/tipeft.egg-info/PKG-INFO +168 -0
- tipeft-0.0.1/tipeft.egg-info/SOURCES.txt +12 -0
- tipeft-0.0.1/tipeft.egg-info/dependency_links.txt +1 -0
- tipeft-0.0.1/tipeft.egg-info/requires.txt +10 -0
- tipeft-0.0.1/tipeft.egg-info/top_level.txt +1 -0
|
Binary file
|
tipeft-0.0.1/MANIFEST.in
ADDED
tipeft-0.0.1/PKG-INFO
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: tipeft
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: Tabular-Infused Parameter Efficient Finetuning (tipeft)
|
|
5
|
+
Author: Charles Alba
|
|
6
|
+
Author-email: alba@wustl.edu
|
|
7
|
+
Keywords: Parameter Efficient Finetuning,PEFT,AI in Medicine,AI in Healthcare,Postoperative Risk Prediction,IA3,LORA
|
|
8
|
+
Classifier: Development Status :: 1 - Planning
|
|
9
|
+
Classifier: Intended Audience :: Education
|
|
10
|
+
Classifier: Intended Audience :: Science/Research
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Operating System :: Unix
|
|
13
|
+
Classifier: Operating System :: MacOS :: MacOS X
|
|
14
|
+
Classifier: Operating System :: Microsoft :: Windows
|
|
15
|
+
Requires-Python: >=3.9
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: license.txt
|
|
18
|
+
Requires-Dist: numpy>=2.0.2
|
|
19
|
+
Requires-Dist: pandas>=2.2.2
|
|
20
|
+
Requires-Dist: scikit-learn>=1.5
|
|
21
|
+
Requires-Dist: tqdm>=4.67
|
|
22
|
+
Requires-Dist: torch==2.8.0
|
|
23
|
+
Requires-Dist: transformers==4.57.0
|
|
24
|
+
Requires-Dist: peft==0.17.1
|
|
25
|
+
Requires-Dist: accelerate==1.10.1
|
|
26
|
+
Requires-Dist: evaluate==0.4.2
|
|
27
|
+
Requires-Dist: datasets==2.21.0
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
# tipeft
|
|
31
|
+
|
|
32
|
+
**T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
|
|
33
|
+
|
|
34
|
+

|
|
35
|
+
|
|
36
|
+
It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
## Requirements
|
|
40
|
+
### Dependencies
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
The following Python packages are required for `tipeft`:
|
|
44
|
+
|
|
45
|
+
- `torch`
|
|
46
|
+
- `transformers`
|
|
47
|
+
- `peft`
|
|
48
|
+
- `accelerate`
|
|
49
|
+
- `numpy`
|
|
50
|
+
- `pandas`
|
|
51
|
+
- `scikit-learn`
|
|
52
|
+
- `tqdm`
|
|
53
|
+
|
|
54
|
+
Install dependencies with:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
#### Note on Pytorch installation
|
|
61
|
+
Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
|
|
62
|
+
https://pytorch.org/
|
|
63
|
+
|
|
64
|
+
### System Requirements
|
|
65
|
+
|
|
66
|
+
`tipeft` has been tested and verified on the following configuration:
|
|
67
|
+
|
|
68
|
+
| Component | Tested Version |
|
|
69
|
+
|-----------|----------------|
|
|
70
|
+
| OS | Windows 10 |
|
|
71
|
+
| Python | 3.9.19 |
|
|
72
|
+
| CUDA | 12.6 |
|
|
73
|
+
|
|
74
|
+
#### Important Notes
|
|
75
|
+
|
|
76
|
+
- **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
|
|
77
|
+
- **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
|
|
78
|
+
- **GPU**: CUDA-compatible GPU required.
|
|
79
|
+
- **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
|
|
80
|
+
|
|
81
|
+
#### Known Compatibility Limitations
|
|
82
|
+
|
|
83
|
+
1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
|
|
84
|
+
2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
|
|
85
|
+
|
|
86
|
+
If you encounter issues on a different setup, please open an issue with your system info.
|
|
87
|
+
|
|
88
|
+
#### GPU requirements
|
|
89
|
+
|
|
90
|
+
`tipeft` is designed for GPU acceleration.
|
|
91
|
+
- At least 1 GPU is recommended
|
|
92
|
+
- Suggested minimum: 16GB VRAM
|
|
93
|
+
- Memory usage depends on:
|
|
94
|
+
- sequence length
|
|
95
|
+
- model size
|
|
96
|
+
- batch size
|
|
97
|
+
- peft configuration
|
|
98
|
+
|
|
99
|
+
|
|
100
|
+
|
|
101
|
+
## Installation
|
|
102
|
+
To install in python, simply do the following:
|
|
103
|
+
```bash
|
|
104
|
+
pip install tipeft
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
## Usage
|
|
109
|
+
|
|
110
|
+
### `train_tabular_infused_IA3`
|
|
111
|
+
|
|
112
|
+
Trains a tabular-infused IA3 model for binary classification.
|
|
113
|
+
|
|
114
|
+
```python
|
|
115
|
+
from tipeft import train_tabular_infused_IA3
|
|
116
|
+
|
|
117
|
+
model, tokenizer = train_tabular_infused_IA3(
|
|
118
|
+
train=train_df,
|
|
119
|
+
val=val_df,
|
|
120
|
+
pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
|
|
121
|
+
label_col="in_hospital_mortality",
|
|
122
|
+
text_col="clinical_notes",
|
|
123
|
+
columns_unique_labels_of_tabular_features={
|
|
124
|
+
"gender": 2,
|
|
125
|
+
"insurance": 3,
|
|
126
|
+
"marital_status": 4,
|
|
127
|
+
"anchor_age": 1,
|
|
128
|
+
"anchor_year": 1
|
|
129
|
+
},
|
|
130
|
+
lr=0.001,
|
|
131
|
+
num_epochs=5,
|
|
132
|
+
lr_of_tabular_infused_features=0.0001
|
|
133
|
+
)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
#### Parameters
|
|
137
|
+
|
|
138
|
+
| Parameter | Type | Description |
|
|
139
|
+
|-----------|------|-------------|
|
|
140
|
+
| `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
|
|
141
|
+
| `val` | pandas.DataFrame | Validation dataframe with same structure as train |
|
|
142
|
+
| `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
|
|
143
|
+
| `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
|
|
144
|
+
| `text_col` | str | Column name containing the clinical text |
|
|
145
|
+
| `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
|
|
146
|
+
| `lr` | float | Learning rate for final model training (default: `0.001`) |
|
|
147
|
+
| `num_epochs` | int | Number of training epochs for final model (default: `5`) |
|
|
148
|
+
| `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
|
|
149
|
+
|
|
150
|
+
#### Returns
|
|
151
|
+
|
|
152
|
+
| Return | Type | Description |
|
|
153
|
+
|--------|------|-------------|
|
|
154
|
+
| `model` | PeftModel | The trained IA3 model |
|
|
155
|
+
| `tokenizer` | AutoTokenizer | The tokenizer for the model |
|
|
156
|
+
|
|
157
|
+
#### Notes
|
|
158
|
+
|
|
159
|
+
- The `label_col` must contain boolean values (`True`/`False`)
|
|
160
|
+
- Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
|
|
161
|
+
- Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
|
|
162
|
+
- Ensure all unique values in categorical columns appear in both train and val sets
|
|
163
|
+
- The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
|
|
164
|
+
|
|
165
|
+
|
|
166
|
+
## Questions?
|
|
167
|
+
|
|
168
|
+
Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
|
tipeft-0.0.1/README.md
ADDED
|
@@ -0,0 +1,140 @@
|
|
|
1
|
+
|
|
2
|
+
# tipeft
|
|
3
|
+
|
|
4
|
+
**T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
|
|
5
|
+
|
|
6
|
+

|
|
7
|
+
|
|
8
|
+
It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
## Requirements
|
|
12
|
+
### Dependencies
|
|
13
|
+
|
|
14
|
+
|
|
15
|
+
The following Python packages are required for `tipeft`:
|
|
16
|
+
|
|
17
|
+
- `torch`
|
|
18
|
+
- `transformers`
|
|
19
|
+
- `peft`
|
|
20
|
+
- `accelerate`
|
|
21
|
+
- `numpy`
|
|
22
|
+
- `pandas`
|
|
23
|
+
- `scikit-learn`
|
|
24
|
+
- `tqdm`
|
|
25
|
+
|
|
26
|
+
Install dependencies with:
|
|
27
|
+
|
|
28
|
+
```bash
|
|
29
|
+
pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
#### Note on Pytorch installation
|
|
33
|
+
Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
|
|
34
|
+
https://pytorch.org/
|
|
35
|
+
|
|
36
|
+
### System Requirements
|
|
37
|
+
|
|
38
|
+
`tipeft` has been tested and verified on the following configuration:
|
|
39
|
+
|
|
40
|
+
| Component | Tested Version |
|
|
41
|
+
|-----------|----------------|
|
|
42
|
+
| OS | Windows 10 |
|
|
43
|
+
| Python | 3.9.19 |
|
|
44
|
+
| CUDA | 12.6 |
|
|
45
|
+
|
|
46
|
+
#### Important Notes
|
|
47
|
+
|
|
48
|
+
- **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
|
|
49
|
+
- **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
|
|
50
|
+
- **GPU**: CUDA-compatible GPU required.
|
|
51
|
+
- **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
|
|
52
|
+
|
|
53
|
+
#### Known Compatibility Limitations
|
|
54
|
+
|
|
55
|
+
1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
|
|
56
|
+
2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
|
|
57
|
+
|
|
58
|
+
If you encounter issues on a different setup, please open an issue with your system info.
|
|
59
|
+
|
|
60
|
+
#### GPU requirements
|
|
61
|
+
|
|
62
|
+
`tipeft` is designed for GPU acceleration.
|
|
63
|
+
- At least 1 GPU is recommended
|
|
64
|
+
- Suggested minimum: 16GB VRAM
|
|
65
|
+
- Memory usage depends on:
|
|
66
|
+
- sequence length
|
|
67
|
+
- model size
|
|
68
|
+
- batch size
|
|
69
|
+
- peft configuration
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
|
|
73
|
+
## Installation
|
|
74
|
+
To install in python, simply do the following:
|
|
75
|
+
```bash
|
|
76
|
+
pip install tipeft
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
|
|
80
|
+
## Usage
|
|
81
|
+
|
|
82
|
+
### `train_tabular_infused_IA3`
|
|
83
|
+
|
|
84
|
+
Trains a tabular-infused IA3 model for binary classification.
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from tipeft import train_tabular_infused_IA3
|
|
88
|
+
|
|
89
|
+
model, tokenizer = train_tabular_infused_IA3(
|
|
90
|
+
train=train_df,
|
|
91
|
+
val=val_df,
|
|
92
|
+
pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
|
|
93
|
+
label_col="in_hospital_mortality",
|
|
94
|
+
text_col="clinical_notes",
|
|
95
|
+
columns_unique_labels_of_tabular_features={
|
|
96
|
+
"gender": 2,
|
|
97
|
+
"insurance": 3,
|
|
98
|
+
"marital_status": 4,
|
|
99
|
+
"anchor_age": 1,
|
|
100
|
+
"anchor_year": 1
|
|
101
|
+
},
|
|
102
|
+
lr=0.001,
|
|
103
|
+
num_epochs=5,
|
|
104
|
+
lr_of_tabular_infused_features=0.0001
|
|
105
|
+
)
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
#### Parameters
|
|
109
|
+
|
|
110
|
+
| Parameter | Type | Description |
|
|
111
|
+
|-----------|------|-------------|
|
|
112
|
+
| `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
|
|
113
|
+
| `val` | pandas.DataFrame | Validation dataframe with same structure as train |
|
|
114
|
+
| `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
|
|
115
|
+
| `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
|
|
116
|
+
| `text_col` | str | Column name containing the clinical text |
|
|
117
|
+
| `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
|
|
118
|
+
| `lr` | float | Learning rate for final model training (default: `0.001`) |
|
|
119
|
+
| `num_epochs` | int | Number of training epochs for final model (default: `5`) |
|
|
120
|
+
| `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
|
|
121
|
+
|
|
122
|
+
#### Returns
|
|
123
|
+
|
|
124
|
+
| Return | Type | Description |
|
|
125
|
+
|--------|------|-------------|
|
|
126
|
+
| `model` | PeftModel | The trained IA3 model |
|
|
127
|
+
| `tokenizer` | AutoTokenizer | The tokenizer for the model |
|
|
128
|
+
|
|
129
|
+
#### Notes
|
|
130
|
+
|
|
131
|
+
- The `label_col` must contain boolean values (`True`/`False`)
|
|
132
|
+
- Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
|
|
133
|
+
- Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
|
|
134
|
+
- Ensure all unique values in categorical columns appear in both train and val sets
|
|
135
|
+
- The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
|
|
136
|
+
|
|
137
|
+
|
|
138
|
+
## Questions?
|
|
139
|
+
|
|
140
|
+
Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
|
tipeft-0.0.1/license.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
Copyright 2026 Charles Alba
|
|
2
|
+
|
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
|
4
|
+
|
|
5
|
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
|
6
|
+
|
|
7
|
+
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
tipeft-0.0.1/setup.cfg
ADDED
tipeft-0.0.1/setup.py
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
import codecs
|
|
3
|
+
import os
|
|
4
|
+
|
|
5
|
+
here = os.path.abspath(os.path.dirname(__file__))
|
|
6
|
+
readme_path = os.path.join(here, "README.md")
|
|
7
|
+
with codecs.open(readme_path, encoding="utf-8") as fh:
|
|
8
|
+
long_description = fh.read()
|
|
9
|
+
VERSION = '0.0.1'
|
|
10
|
+
DESCRIPTION = 'Tabular-Infused Parameter Efficient Finetuning (tipeft)'
|
|
11
|
+
LONG_DESCRIPTION = "Tabular-Infused Parameter Efficient Finetuning (tipeft) specifically designed for postoperative risk prediction using clinical notes and complementary preoperative tabular features. Available for re-parameterization methods (LoRa and IA3)."
|
|
12
|
+
|
|
13
|
+
setup(
|
|
14
|
+
name="tipeft",
|
|
15
|
+
version=VERSION,
|
|
16
|
+
author="Charles Alba",
|
|
17
|
+
author_email="alba@wustl.edu",
|
|
18
|
+
description=DESCRIPTION,
|
|
19
|
+
long_description_content_type="text/markdown",
|
|
20
|
+
long_description=long_description,
|
|
21
|
+
packages=find_packages(),
|
|
22
|
+
install_requires=[
|
|
23
|
+
"numpy>=2.0.2",
|
|
24
|
+
"pandas>=2.2.2",
|
|
25
|
+
"scikit-learn>=1.5",
|
|
26
|
+
"tqdm>=4.67",
|
|
27
|
+
"torch==2.8.0",
|
|
28
|
+
"transformers==4.57.0",
|
|
29
|
+
"peft==0.17.1",
|
|
30
|
+
"accelerate==1.10.1",
|
|
31
|
+
"evaluate==0.4.2",
|
|
32
|
+
"datasets==2.21.0",
|
|
33
|
+
],
|
|
34
|
+
python_requires=">=3.9",
|
|
35
|
+
keywords=['Parameter Efficient Finetuning',"PEFT","AI in Medicine","AI in Healthcare","Postoperative Risk Prediction", "IA3", "LORA"],
|
|
36
|
+
classifiers=[
|
|
37
|
+
"Development Status :: 1 - Planning",
|
|
38
|
+
"Intended Audience :: Education",
|
|
39
|
+
"Intended Audience :: Science/Research",
|
|
40
|
+
"Programming Language :: Python :: 3",
|
|
41
|
+
"Operating System :: Unix",
|
|
42
|
+
"Operating System :: MacOS :: MacOS X",
|
|
43
|
+
"Operating System :: Microsoft :: Windows",
|
|
44
|
+
]
|
|
45
|
+
)
|
|
@@ -0,0 +1,464 @@
|
|
|
1
|
+
import argparse
|
|
2
|
+
import os
|
|
3
|
+
import pandas as pd
|
|
4
|
+
import torch
|
|
5
|
+
from torch.optim import AdamW
|
|
6
|
+
from torch.utils.data import DataLoader
|
|
7
|
+
from peft import (
|
|
8
|
+
get_peft_config,
|
|
9
|
+
get_peft_model,
|
|
10
|
+
get_peft_model_state_dict,
|
|
11
|
+
IA3Config,
|
|
12
|
+
IA3Model,
|
|
13
|
+
PeftType)
|
|
14
|
+
from evaluate import load
|
|
15
|
+
from torch.utils.data import DataLoader, Dataset
|
|
16
|
+
import torch
|
|
17
|
+
from datasets import load_dataset
|
|
18
|
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
|
|
19
|
+
from tqdm.notebook import tqdm_notebook
|
|
20
|
+
import gc
|
|
21
|
+
from sklearn.metrics import (
|
|
22
|
+
roc_auc_score, average_precision_score, roc_curve, precision_recall_curve,
|
|
23
|
+
accuracy_score, precision_score, recall_score, f1_score
|
|
24
|
+
)
|
|
25
|
+
from multiprocessing import Pool
|
|
26
|
+
from functools import lru_cache
|
|
27
|
+
import json
|
|
28
|
+
import numpy as np
|
|
29
|
+
import shutil
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
def train_tabular_classification(data, label, model_name_or_path,lr=0.0001):
|
|
34
|
+
data = data.copy()
|
|
35
|
+
|
|
36
|
+
class ClinicalDatasetForTabularPreopClassification(Dataset):
|
|
37
|
+
def __init__(self, dataframe, tokenizer,label_to_id):
|
|
38
|
+
self.dataframe = dataframe
|
|
39
|
+
self.tokenizer = tokenizer
|
|
40
|
+
self.label_to_id=label_to_id
|
|
41
|
+
|
|
42
|
+
def __len__(self):
|
|
43
|
+
return len(self.dataframe)
|
|
44
|
+
|
|
45
|
+
def __getitem__(self, idx):
|
|
46
|
+
sentence = self.dataframe.iloc[idx]['text']
|
|
47
|
+
label = self.dataframe.iloc[idx]['label']
|
|
48
|
+
|
|
49
|
+
# Tokenize the sentence
|
|
50
|
+
inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
|
|
51
|
+
inputs = {key: val.squeeze(0) for key, val in inputs.items()}
|
|
52
|
+
|
|
53
|
+
# Convert label to a numeric format if necessary
|
|
54
|
+
label_to_id = self.label_to_id
|
|
55
|
+
label_id = label_to_id[label]
|
|
56
|
+
|
|
57
|
+
inputs['labels'] = torch.tensor(label_id, dtype=torch.long)
|
|
58
|
+
|
|
59
|
+
return inputs
|
|
60
|
+
torch.cuda.empty_cache()
|
|
61
|
+
gc.collect()
|
|
62
|
+
data["label"]=data[label]
|
|
63
|
+
train=data[["text","label"]]
|
|
64
|
+
labels=list(set(list(data[label])))
|
|
65
|
+
peft_type = PeftType.IA3
|
|
66
|
+
device = "cuda"
|
|
67
|
+
num_epochs = 2
|
|
68
|
+
if model_name_or_path=="microsoft/biogpt":
|
|
69
|
+
peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
|
|
70
|
+
else:
|
|
71
|
+
peft_config = IA3Config(task_type="SEQ_CLS")
|
|
72
|
+
# Initialize the tokenizer
|
|
73
|
+
padding_side = "right"
|
|
74
|
+
if model_name_or_path=="microsoft/biogpt":
|
|
75
|
+
batch_size=8
|
|
76
|
+
else:
|
|
77
|
+
batch_size = 16
|
|
78
|
+
|
|
79
|
+
|
|
80
|
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
|
|
81
|
+
label_to_id = {label: index for index, label in enumerate(labels)}
|
|
82
|
+
# Model
|
|
83
|
+
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=len(label_to_id), use_auth_token=False)
|
|
84
|
+
|
|
85
|
+
train_dataset = ClinicalDatasetForTabularPreopClassification(train, tokenizer,label_to_id)
|
|
86
|
+
def collate_fn(examples):
|
|
87
|
+
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
|
|
88
|
+
|
|
89
|
+
train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
|
|
90
|
+
model = get_peft_model(model, peft_config)
|
|
91
|
+
optimizer = AdamW(params=model.parameters(), lr=lr)
|
|
92
|
+
# Instantiate scheduler
|
|
93
|
+
lr_scheduler = get_linear_schedule_with_warmup(
|
|
94
|
+
optimizer=optimizer,
|
|
95
|
+
num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
|
|
96
|
+
num_training_steps=(len(train_dataloader) * num_epochs),
|
|
97
|
+
)
|
|
98
|
+
model.to(device)
|
|
99
|
+
for epoch in range(num_epochs):
|
|
100
|
+
model.train()
|
|
101
|
+
for step, batch in enumerate(tqdm_notebook(train_dataloader)):
|
|
102
|
+
batch = {k: v.to(device) for k, v in batch.items()}
|
|
103
|
+
outputs = model(**batch)
|
|
104
|
+
loss = outputs.loss
|
|
105
|
+
loss.backward()
|
|
106
|
+
optimizer.step()
|
|
107
|
+
lr_scheduler.step()
|
|
108
|
+
optimizer.zero_grad()
|
|
109
|
+
mmm=(model_name_or_path.split("/"))[-1]
|
|
110
|
+
model.save_pretrained(f"pretrained_model/{mmm}/{label}")
|
|
111
|
+
tokenizer.save_pretrained(f"pretrained_model/{mmm}/{label}")
|
|
112
|
+
del label
|
|
113
|
+
return(model)
|
|
114
|
+
|
|
115
|
+
|
|
116
|
+
|
|
117
|
+
def train_tabular_regression(data, label,model_name_or_path,lr=0.0001):
|
|
118
|
+
data = data.copy()
|
|
119
|
+
data[label] = data[label].astype(float)
|
|
120
|
+
|
|
121
|
+
class ClinicalDatasetForTabularPreopReg(Dataset):
|
|
122
|
+
def __init__(self, dataframe, tokenizer):
|
|
123
|
+
self.dataframe = dataframe
|
|
124
|
+
self.tokenizer = tokenizer
|
|
125
|
+
|
|
126
|
+
def __len__(self):
|
|
127
|
+
return len(self.dataframe)
|
|
128
|
+
|
|
129
|
+
def __getitem__(self, idx):
|
|
130
|
+
sentence = self.dataframe.iloc[idx]['text']
|
|
131
|
+
# Label is now a continuous value
|
|
132
|
+
label = self.dataframe.iloc[idx]['label']
|
|
133
|
+
|
|
134
|
+
inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
|
|
135
|
+
inputs = {key: val.squeeze(0) for key, val in inputs.items()}
|
|
136
|
+
inputs['labels'] = torch.tensor(label, dtype=torch.float)
|
|
137
|
+
|
|
138
|
+
return inputs
|
|
139
|
+
|
|
140
|
+
torch.cuda.empty_cache()
|
|
141
|
+
gc.collect()
|
|
142
|
+
data["label"]=data[label]
|
|
143
|
+
train=data[["text","label"]]
|
|
144
|
+
labels=list(set(list(data[label])))
|
|
145
|
+
peft_type = PeftType.IA3
|
|
146
|
+
device = "cuda"
|
|
147
|
+
num_epochs = 2
|
|
148
|
+
if model_name_or_path=="microsoft/biogpt":
|
|
149
|
+
peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
|
|
150
|
+
else:
|
|
151
|
+
peft_config = IA3Config(task_type="SEQ_CLS")
|
|
152
|
+
|
|
153
|
+
padding_side = "right"
|
|
154
|
+
|
|
155
|
+
if model_name_or_path=="microsoft/biogpt":
|
|
156
|
+
batch_size=8
|
|
157
|
+
else:
|
|
158
|
+
batch_size = 16
|
|
159
|
+
# Initialize the tokenizer
|
|
160
|
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
|
|
161
|
+
# Model
|
|
162
|
+
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=1, use_auth_token=False)
|
|
163
|
+
|
|
164
|
+
train_dataset = ClinicalDatasetForTabularPreopReg(train, tokenizer)
|
|
165
|
+
def collate_fn(examples):
|
|
166
|
+
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
|
|
167
|
+
|
|
168
|
+
train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
|
|
169
|
+
model = get_peft_model(model, peft_config)
|
|
170
|
+
optimizer = AdamW(params=model.parameters(), lr=lr)
|
|
171
|
+
# Instantiate scheduler
|
|
172
|
+
lr_scheduler = get_linear_schedule_with_warmup(
|
|
173
|
+
optimizer=optimizer,
|
|
174
|
+
num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
|
|
175
|
+
num_training_steps=(len(train_dataloader) * num_epochs),
|
|
176
|
+
)
|
|
177
|
+
model.to(device)
|
|
178
|
+
for epoch in range(num_epochs):
|
|
179
|
+
model.train()
|
|
180
|
+
for step, batch in enumerate(tqdm_notebook(train_dataloader)):
|
|
181
|
+
batch = {k: v.to(device) for k, v in batch.items()}
|
|
182
|
+
outputs = model(**batch)
|
|
183
|
+
loss = outputs.loss
|
|
184
|
+
loss.backward()
|
|
185
|
+
optimizer.step()
|
|
186
|
+
lr_scheduler.step()
|
|
187
|
+
optimizer.zero_grad()
|
|
188
|
+
|
|
189
|
+
mmm=(model_name_or_path.split("/"))[-1]
|
|
190
|
+
model.save_pretrained(f"pretrained_model/{mmm}/{label}")
|
|
191
|
+
tokenizer.save_pretrained(f"pretrained_model/{mmm}/{label}")
|
|
192
|
+
del label
|
|
193
|
+
return(model)
|
|
194
|
+
|
|
195
|
+
|
|
196
|
+
|
|
197
|
+
|
|
198
|
+
|
|
199
|
+
def load_and_accumulate(model_path, name, key, columns_unique_labels):
|
|
200
|
+
"""
|
|
201
|
+
loads IA3 modules that have been trained with respect to the tabular features to prepare for initialization.
|
|
202
|
+
Designed to accomodate to parrallel loading and pooling.
|
|
203
|
+
|
|
204
|
+
parameters:
|
|
205
|
+
- model_path (str): the file path (i.e. folder name) of which the IA3-trained models (w.r.t to the tabular features) are stored.
|
|
206
|
+
- name (str): name of the tabular feature (which is also the name of the saved IA3-trained model).
|
|
207
|
+
- key (str): The specific module in which we wish to extract from the IA3-trained model. We will gather this module across all models trained w.r.t. tabular features
|
|
208
|
+
and then use them to intialize the model that will be finetuned w.r.t. the outcome of interests.
|
|
209
|
+
|
|
210
|
+
returns:
|
|
211
|
+
the module's weights of the specified feature-trained model and 1 if successful and 0 if not successful
|
|
212
|
+
(this is used to later average out all the weights of the specified module as part of the initialization process)
|
|
213
|
+
"""
|
|
214
|
+
# This function is intended to be run in a separate process
|
|
215
|
+
# Load model and accumulate parameter data
|
|
216
|
+
try:
|
|
217
|
+
# note: if this does not work, have a if-else condition, where it determines if the model is BERT, GPT, or Llama (may require an additional parameter to be accpeted)
|
|
218
|
+
# it then uses BERTForSequenceClassification, BioGPTForSequenceClassification and LlamaForSequenceClassification
|
|
219
|
+
model = AutoModelForSequenceClassification.from_pretrained(model_path, ignore_mismatched_sizes=True, output_attentions=False, output_hidden_states=False, use_auth_token=False)
|
|
220
|
+
except RuntimeError:
|
|
221
|
+
path_base = os.path.basename(model_path)
|
|
222
|
+
num_labels = columns_unique_labels[path_base]
|
|
223
|
+
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=num_labels, use_auth_token=False)
|
|
224
|
+
|
|
225
|
+
standardized_name = name.replace("base_model.model.", "")
|
|
226
|
+
corresponding_module = dict(model.named_modules()).get(standardized_name)
|
|
227
|
+
|
|
228
|
+
if corresponding_module and hasattr(corresponding_module, 'ia3_l'):
|
|
229
|
+
param = corresponding_module.ia3_l[key]
|
|
230
|
+
return param.data.clone(), 1 # Return parameter data and count
|
|
231
|
+
else:
|
|
232
|
+
print("Warning, model of different architecture")
|
|
233
|
+
return None, 0
|
|
234
|
+
|
|
235
|
+
def parallel_advanced_initialization(model1, columns_unique_labels,name_of_tabular_feature_based_model):
|
|
236
|
+
"""
|
|
237
|
+
Initializes trainable IA3 modules with respect to the tabular features.
|
|
238
|
+
It extracts and pools the trainable modules from the IA3-trained models (that have been trained w.r.t the tabular features)
|
|
239
|
+
to prepare the model (that will be tuned w.r.t. the outcome) for IA3 training.
|
|
240
|
+
Designed to be executed in parrallel.
|
|
241
|
+
|
|
242
|
+
parameters:
|
|
243
|
+
- model1 (huggingface pretrained model): the model that is ready for IA3 training, but its trainable parameters are defaulted to 1 per IA3.
|
|
244
|
+
- columns_unique_labels (dict): dictionary of tabular feature names and its respective number of columns if the the column is categorical. If it is continous, it the value is 1.
|
|
245
|
+
Needed to load the respective model that has been trained with respect to the tabular feature.
|
|
246
|
+
- key (str): The specific module in which we wish to extract from the IA3-trained model. We will gather this module across all models trained w.r.t. tabular features
|
|
247
|
+
and then use them to intialize the model that will be finetuned w.r.t. the outcome of interests.
|
|
248
|
+
- name_of_tabular_feature_based_model: base model of our model of interest. should be one of the three values: Bio_ClinicalBERT, biogpt, and BioMedGPT-LM-7B.
|
|
249
|
+
returns:
|
|
250
|
+
|
|
251
|
+
model with the trainable weights initialized with respect to the tabular features.
|
|
252
|
+
"""
|
|
253
|
+
model_dir = f'pretrained_model/{name_of_tabular_feature_based_model}'
|
|
254
|
+
list_of_model_names = [name for name in os.listdir(model_dir) if os.path.isdir(os.path.join(model_dir, name))]
|
|
255
|
+
# list_of_model_names = [item for item in list_of_model_names if item not in ['adv_init_model']]
|
|
256
|
+
models_to_average_paths = [os.path.join(model_dir, name) for name in list_of_model_names]
|
|
257
|
+
|
|
258
|
+
for name, module in tqdm_notebook(model1.named_modules()):
|
|
259
|
+
if hasattr(module, 'ia3_l'):
|
|
260
|
+
ia3_l_dict = module.ia3_l
|
|
261
|
+
|
|
262
|
+
# Prepare arguments for parallel processing
|
|
263
|
+
args = [(model_path, name, key, columns_unique_labels) for model_path in models_to_average_paths for key, _ in ia3_l_dict.items()]
|
|
264
|
+
|
|
265
|
+
with Pool(processes=10) as pool: # edit based on amount of CPUs available
|
|
266
|
+
results = pool.starmap(load_and_accumulate, args)
|
|
267
|
+
|
|
268
|
+
# Aggregate results
|
|
269
|
+
params_sum = None
|
|
270
|
+
models_count = 0
|
|
271
|
+
for result, count in tqdm_notebook(results):
|
|
272
|
+
if result is not None:
|
|
273
|
+
if params_sum is None:
|
|
274
|
+
params_sum = result
|
|
275
|
+
else:
|
|
276
|
+
params_sum += result
|
|
277
|
+
models_count += count
|
|
278
|
+
|
|
279
|
+
# Update parameters
|
|
280
|
+
if models_count > 0:
|
|
281
|
+
with torch.no_grad():
|
|
282
|
+
average_param = params_sum / models_count
|
|
283
|
+
for _, param1 in ia3_l_dict.items():
|
|
284
|
+
param1.data.copy_(average_param)
|
|
285
|
+
|
|
286
|
+
return model1
|
|
287
|
+
|
|
288
|
+
|
|
289
|
+
|
|
290
|
+
|
|
291
|
+
|
|
292
|
+
def train_tabular_infused_IA3(train,val,pretrained_model_name,label_col,text_col,columns_unique_labels_of_tabular_features,lr=0.001,num_epochs=5,lr_of_tabular_infused_features=0.0001):
|
|
293
|
+
|
|
294
|
+
"""
|
|
295
|
+
INSERT DOCUMENTATION HERE.
|
|
296
|
+
|
|
297
|
+
"""
|
|
298
|
+
train = train.copy() # Add this
|
|
299
|
+
val = val.copy() # Add this
|
|
300
|
+
|
|
301
|
+
train["label"] = train[label_col]
|
|
302
|
+
train["text"] = train[text_col]
|
|
303
|
+
val["label"] = val[label_col]
|
|
304
|
+
val["text"] = val[text_col]
|
|
305
|
+
|
|
306
|
+
train = train.drop(columns=[label_col, text_col])
|
|
307
|
+
val = val.drop(columns=[label_col, text_col])
|
|
308
|
+
|
|
309
|
+
|
|
310
|
+
list_numerical = [k for k, v in columns_unique_labels_of_tabular_features.items() if v == 1]
|
|
311
|
+
list_categorical = [k for k, v in columns_unique_labels_of_tabular_features.items() if v > 1]
|
|
312
|
+
|
|
313
|
+
for i in tqdm_notebook(list(list_numerical), desc="training numerical tabular-infused features"):
|
|
314
|
+
train_tabular_regression(train, i, pretrained_model_name,lr_of_tabular_infused_features)
|
|
315
|
+
|
|
316
|
+
for i in tqdm_notebook(list(list_categorical), desc="training categorical tabular-infused features"):
|
|
317
|
+
train_tabular_classification(train, i, pretrained_model_name,lr_of_tabular_infused_features)
|
|
318
|
+
|
|
319
|
+
new_model_name=f"IA3_{pretrained_model_name}_{label_col}"
|
|
320
|
+
|
|
321
|
+
|
|
322
|
+
torch.manual_seed(42)
|
|
323
|
+
|
|
324
|
+
label_to_id = {False: 0, True: 1}
|
|
325
|
+
padding_side = "right"
|
|
326
|
+
batch_size = 16
|
|
327
|
+
lr = lr
|
|
328
|
+
local_dir=f"trained_models/{new_model_name}"
|
|
329
|
+
class clinicalDataset(Dataset):
|
|
330
|
+
def __init__(self, dataframe, tokenizer):
|
|
331
|
+
self.dataframe = dataframe
|
|
332
|
+
self.tokenizer = tokenizer
|
|
333
|
+
|
|
334
|
+
def __len__(self):
|
|
335
|
+
return len(self.dataframe)
|
|
336
|
+
|
|
337
|
+
def __getitem__(self, idx):
|
|
338
|
+
sentence = self.dataframe.iloc[idx]['text']
|
|
339
|
+
label = self.dataframe.iloc[idx]['label']
|
|
340
|
+
|
|
341
|
+
# Tokenize the sentence
|
|
342
|
+
inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
|
|
343
|
+
inputs = {key: val.squeeze(0) for key, val in inputs.items()}
|
|
344
|
+
|
|
345
|
+
# Convert label to a numeric format if necessary
|
|
346
|
+
label_to_id = {False: 0, True: 1}
|
|
347
|
+
label_id = label_to_id[label]
|
|
348
|
+
|
|
349
|
+
inputs['labels'] = torch.tensor(label_id, dtype=torch.long)
|
|
350
|
+
|
|
351
|
+
return inputs
|
|
352
|
+
|
|
353
|
+
|
|
354
|
+
def collate_fn(examples):
|
|
355
|
+
return tokenizer.pad(examples, padding="longest", return_tensors="pt")
|
|
356
|
+
padding_side = "right"
|
|
357
|
+
batch_size = 16
|
|
358
|
+
lr = lr
|
|
359
|
+
model_name_or_path = pretrained_model_name
|
|
360
|
+
peft_type = PeftType.IA3
|
|
361
|
+
device = "cuda"
|
|
362
|
+
num_epochs = num_epochs
|
|
363
|
+
if pretrained_model_name=="microsoft/biogpt":
|
|
364
|
+
peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
|
|
365
|
+
else:
|
|
366
|
+
peft_config = IA3Config(task_type="SEQ_CLS")
|
|
367
|
+
|
|
368
|
+
# Initialize the tokenizer
|
|
369
|
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
|
|
370
|
+
|
|
371
|
+
label_to_id = {False: 0, True: 1}
|
|
372
|
+
# Model
|
|
373
|
+
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=len(label_to_id), use_auth_token=False)
|
|
374
|
+
train_dataset = clinicalDataset(train, tokenizer)
|
|
375
|
+
val_dataset = clinicalDataset(val, tokenizer)
|
|
376
|
+
# Use the collate_fn in your DataLoaders
|
|
377
|
+
train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
|
|
378
|
+
val_dataloader = DataLoader(val_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
|
|
379
|
+
model = get_peft_model(model, peft_config)
|
|
380
|
+
model.print_trainable_parameters()
|
|
381
|
+
mmm=(pretrained_model_name.split("/"))[-1]
|
|
382
|
+
print("Initializing the PEFT parameters!")
|
|
383
|
+
model=parallel_advanced_initialization(model,columns_unique_labels_of_tabular_features,mmm)
|
|
384
|
+
model.save_pretrained(f"init_models/{mmm}/adv_init_model")
|
|
385
|
+
torch.cuda.empty_cache()
|
|
386
|
+
gc.collect()
|
|
387
|
+
|
|
388
|
+
optimizer = AdamW(params=model.parameters(), lr=lr,weight_decay=0.1)#,weight_decay=0.01)
|
|
389
|
+
lr_scheduler = get_linear_schedule_with_warmup(
|
|
390
|
+
optimizer=optimizer,
|
|
391
|
+
num_warmup_steps=0,
|
|
392
|
+
num_training_steps=(len(train_dataloader) * num_epochs),
|
|
393
|
+
)
|
|
394
|
+
model.to(device)
|
|
395
|
+
f1_metric = load('f1', config_name='multiclass', average='weighted')
|
|
396
|
+
accuracy_metric = load('accuracy')
|
|
397
|
+
precision_metric = load('precision')
|
|
398
|
+
recall_metric=load('recall')
|
|
399
|
+
# Assuming binary classification, accumulate predictions and true labels
|
|
400
|
+
all_predictions = []
|
|
401
|
+
all_references = []
|
|
402
|
+
all_scores = [] # For AUROC and AUPRC
|
|
403
|
+
for epoch in range(num_epochs):
|
|
404
|
+
model.train()
|
|
405
|
+
for step, batch in enumerate(tqdm_notebook(train_dataloader)):
|
|
406
|
+
batch = {k: v.to(device) for k, v in batch.items()}
|
|
407
|
+
outputs = model(**batch)
|
|
408
|
+
loss = outputs.loss
|
|
409
|
+
loss.backward()
|
|
410
|
+
optimizer.step()
|
|
411
|
+
lr_scheduler.step()
|
|
412
|
+
optimizer.zero_grad()
|
|
413
|
+
|
|
414
|
+
model.eval()
|
|
415
|
+
for step, batch in enumerate(tqdm_notebook(val_dataloader)):
|
|
416
|
+
batch = {k: v.to(device) for k, v in batch.items()}
|
|
417
|
+
with torch.no_grad():
|
|
418
|
+
outputs = model(**batch)
|
|
419
|
+
|
|
420
|
+
# Assuming outputs.logits are raw scores for each class
|
|
421
|
+
scores = torch.nn.functional.softmax(outputs.logits, dim=-1)[:, 1].cpu().numpy() # Get probability for class '1'
|
|
422
|
+
predictions = outputs.logits.argmax(dim=-1).cpu().numpy()
|
|
423
|
+
references = batch["labels"].cpu().numpy()
|
|
424
|
+
|
|
425
|
+
all_scores.extend(scores)
|
|
426
|
+
all_predictions.extend(predictions)
|
|
427
|
+
all_references.extend(references)
|
|
428
|
+
|
|
429
|
+
# Your existing metric updates here
|
|
430
|
+
accuracy_metric.add_batch(predictions=predictions, references=references)
|
|
431
|
+
f1_metric.add_batch(predictions=predictions, references=references)
|
|
432
|
+
recall_metric.add_batch(predictions=predictions, references=references)
|
|
433
|
+
precision_metric.add_batch(predictions=predictions, references=references)
|
|
434
|
+
|
|
435
|
+
# Compute final metric values
|
|
436
|
+
final_accuracy = accuracy_metric.compute()
|
|
437
|
+
final_f1 = f1_metric.compute()
|
|
438
|
+
final_recall = recall_metric.compute()
|
|
439
|
+
final_precision = precision_metric.compute()
|
|
440
|
+
|
|
441
|
+
# Calculate AUROC and AUPRC
|
|
442
|
+
final_auroc = roc_auc_score(all_references, all_scores)
|
|
443
|
+
final_auprc = average_precision_score(all_references, all_scores)
|
|
444
|
+
|
|
445
|
+
# Output the metrics
|
|
446
|
+
print("="*20)
|
|
447
|
+
print("VALIDATION METRICS:")
|
|
448
|
+
print(f"Accuracy: {final_accuracy['accuracy']}")
|
|
449
|
+
print(f"Precision: {final_precision['precision']}")
|
|
450
|
+
print(f"Recall: {final_recall['recall']}")
|
|
451
|
+
print(f"F1 Score: {final_f1['f1']}")
|
|
452
|
+
print(f"AUROC: {final_auroc}")
|
|
453
|
+
print(f"AUPRC: {final_auprc}")
|
|
454
|
+
|
|
455
|
+
# Save your model and tokenizer
|
|
456
|
+
model.save_pretrained(f"trained_models/{new_model_name}")
|
|
457
|
+
tokenizer.save_pretrained(f"trained_models/{new_model_name}")
|
|
458
|
+
|
|
459
|
+
shutil.rmtree("pretrained_model", ignore_errors=True)
|
|
460
|
+
|
|
461
|
+
return model, tokenizer
|
|
462
|
+
|
|
463
|
+
|
|
464
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
from .IA3 import train_tabular_infused_IA3
|
|
@@ -0,0 +1,168 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: tipeft
|
|
3
|
+
Version: 0.0.1
|
|
4
|
+
Summary: Tabular-Infused Parameter Efficient Finetuning (tipeft)
|
|
5
|
+
Author: Charles Alba
|
|
6
|
+
Author-email: alba@wustl.edu
|
|
7
|
+
Keywords: Parameter Efficient Finetuning,PEFT,AI in Medicine,AI in Healthcare,Postoperative Risk Prediction,IA3,LORA
|
|
8
|
+
Classifier: Development Status :: 1 - Planning
|
|
9
|
+
Classifier: Intended Audience :: Education
|
|
10
|
+
Classifier: Intended Audience :: Science/Research
|
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
|
12
|
+
Classifier: Operating System :: Unix
|
|
13
|
+
Classifier: Operating System :: MacOS :: MacOS X
|
|
14
|
+
Classifier: Operating System :: Microsoft :: Windows
|
|
15
|
+
Requires-Python: >=3.9
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: license.txt
|
|
18
|
+
Requires-Dist: numpy>=2.0.2
|
|
19
|
+
Requires-Dist: pandas>=2.2.2
|
|
20
|
+
Requires-Dist: scikit-learn>=1.5
|
|
21
|
+
Requires-Dist: tqdm>=4.67
|
|
22
|
+
Requires-Dist: torch==2.8.0
|
|
23
|
+
Requires-Dist: transformers==4.57.0
|
|
24
|
+
Requires-Dist: peft==0.17.1
|
|
25
|
+
Requires-Dist: accelerate==1.10.1
|
|
26
|
+
Requires-Dist: evaluate==0.4.2
|
|
27
|
+
Requires-Dist: datasets==2.21.0
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
# tipeft
|
|
31
|
+
|
|
32
|
+
**T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
|
|
33
|
+
|
|
34
|
+

|
|
35
|
+
|
|
36
|
+
It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
|
|
37
|
+
|
|
38
|
+
|
|
39
|
+
## Requirements
|
|
40
|
+
### Dependencies
|
|
41
|
+
|
|
42
|
+
|
|
43
|
+
The following Python packages are required for `tipeft`:
|
|
44
|
+
|
|
45
|
+
- `torch`
|
|
46
|
+
- `transformers`
|
|
47
|
+
- `peft`
|
|
48
|
+
- `accelerate`
|
|
49
|
+
- `numpy`
|
|
50
|
+
- `pandas`
|
|
51
|
+
- `scikit-learn`
|
|
52
|
+
- `tqdm`
|
|
53
|
+
|
|
54
|
+
Install dependencies with:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
#### Note on Pytorch installation
|
|
61
|
+
Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
|
|
62
|
+
https://pytorch.org/
|
|
63
|
+
|
|
64
|
+
### System Requirements
|
|
65
|
+
|
|
66
|
+
`tipeft` has been tested and verified on the following configuration:
|
|
67
|
+
|
|
68
|
+
| Component | Tested Version |
|
|
69
|
+
|-----------|----------------|
|
|
70
|
+
| OS | Windows 10 |
|
|
71
|
+
| Python | 3.9.19 |
|
|
72
|
+
| CUDA | 12.6 |
|
|
73
|
+
|
|
74
|
+
#### Important Notes
|
|
75
|
+
|
|
76
|
+
- **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
|
|
77
|
+
- **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
|
|
78
|
+
- **GPU**: CUDA-compatible GPU required.
|
|
79
|
+
- **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
|
|
80
|
+
|
|
81
|
+
#### Known Compatibility Limitations
|
|
82
|
+
|
|
83
|
+
1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
|
|
84
|
+
2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
|
|
85
|
+
|
|
86
|
+
If you encounter issues on a different setup, please open an issue with your system info.
|
|
87
|
+
|
|
88
|
+
#### GPU requirements
|
|
89
|
+
|
|
90
|
+
`tipeft` is designed for GPU acceleration.
|
|
91
|
+
- At least 1 GPU is recommended
|
|
92
|
+
- Suggested minimum: 16GB VRAM
|
|
93
|
+
- Memory usage depends on:
|
|
94
|
+
- sequence length
|
|
95
|
+
- model size
|
|
96
|
+
- batch size
|
|
97
|
+
- peft configuration
|
|
98
|
+
|
|
99
|
+
|
|
100
|
+
|
|
101
|
+
## Installation
|
|
102
|
+
To install in python, simply do the following:
|
|
103
|
+
```bash
|
|
104
|
+
pip install tipeft
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
## Usage
|
|
109
|
+
|
|
110
|
+
### `train_tabular_infused_IA3`
|
|
111
|
+
|
|
112
|
+
Trains a tabular-infused IA3 model for binary classification.
|
|
113
|
+
|
|
114
|
+
```python
|
|
115
|
+
from tipeft import train_tabular_infused_IA3
|
|
116
|
+
|
|
117
|
+
model, tokenizer = train_tabular_infused_IA3(
|
|
118
|
+
train=train_df,
|
|
119
|
+
val=val_df,
|
|
120
|
+
pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
|
|
121
|
+
label_col="in_hospital_mortality",
|
|
122
|
+
text_col="clinical_notes",
|
|
123
|
+
columns_unique_labels_of_tabular_features={
|
|
124
|
+
"gender": 2,
|
|
125
|
+
"insurance": 3,
|
|
126
|
+
"marital_status": 4,
|
|
127
|
+
"anchor_age": 1,
|
|
128
|
+
"anchor_year": 1
|
|
129
|
+
},
|
|
130
|
+
lr=0.001,
|
|
131
|
+
num_epochs=5,
|
|
132
|
+
lr_of_tabular_infused_features=0.0001
|
|
133
|
+
)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
#### Parameters
|
|
137
|
+
|
|
138
|
+
| Parameter | Type | Description |
|
|
139
|
+
|-----------|------|-------------|
|
|
140
|
+
| `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
|
|
141
|
+
| `val` | pandas.DataFrame | Validation dataframe with same structure as train |
|
|
142
|
+
| `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
|
|
143
|
+
| `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
|
|
144
|
+
| `text_col` | str | Column name containing the clinical text |
|
|
145
|
+
| `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
|
|
146
|
+
| `lr` | float | Learning rate for final model training (default: `0.001`) |
|
|
147
|
+
| `num_epochs` | int | Number of training epochs for final model (default: `5`) |
|
|
148
|
+
| `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
|
|
149
|
+
|
|
150
|
+
#### Returns
|
|
151
|
+
|
|
152
|
+
| Return | Type | Description |
|
|
153
|
+
|--------|------|-------------|
|
|
154
|
+
| `model` | PeftModel | The trained IA3 model |
|
|
155
|
+
| `tokenizer` | AutoTokenizer | The tokenizer for the model |
|
|
156
|
+
|
|
157
|
+
#### Notes
|
|
158
|
+
|
|
159
|
+
- The `label_col` must contain boolean values (`True`/`False`)
|
|
160
|
+
- Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
|
|
161
|
+
- Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
|
|
162
|
+
- Ensure all unique values in categorical columns appear in both train and val sets
|
|
163
|
+
- The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
|
|
164
|
+
|
|
165
|
+
|
|
166
|
+
## Questions?
|
|
167
|
+
|
|
168
|
+
Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
Figure_1.jpg
|
|
2
|
+
MANIFEST.in
|
|
3
|
+
README.md
|
|
4
|
+
license.txt
|
|
5
|
+
setup.py
|
|
6
|
+
tipeft/IA3.py
|
|
7
|
+
tipeft/__init__.py
|
|
8
|
+
tipeft.egg-info/PKG-INFO
|
|
9
|
+
tipeft.egg-info/SOURCES.txt
|
|
10
|
+
tipeft.egg-info/dependency_links.txt
|
|
11
|
+
tipeft.egg-info/requires.txt
|
|
12
|
+
tipeft.egg-info/top_level.txt
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
tipeft
|