tipeft 0.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Binary file
@@ -0,0 +1,3 @@
1
+ include README.md
2
+ include Figure_1.jpg
3
+ global-include *.py
tipeft-0.0.1/PKG-INFO ADDED
@@ -0,0 +1,168 @@
1
+ Metadata-Version: 2.1
2
+ Name: tipeft
3
+ Version: 0.0.1
4
+ Summary: Tabular-Infused Parameter Efficient Finetuning (tipeft)
5
+ Author: Charles Alba
6
+ Author-email: alba@wustl.edu
7
+ Keywords: Parameter Efficient Finetuning,PEFT,AI in Medicine,AI in Healthcare,Postoperative Risk Prediction,IA3,LORA
8
+ Classifier: Development Status :: 1 - Planning
9
+ Classifier: Intended Audience :: Education
10
+ Classifier: Intended Audience :: Science/Research
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Operating System :: Unix
13
+ Classifier: Operating System :: MacOS :: MacOS X
14
+ Classifier: Operating System :: Microsoft :: Windows
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: license.txt
18
+ Requires-Dist: numpy>=2.0.2
19
+ Requires-Dist: pandas>=2.2.2
20
+ Requires-Dist: scikit-learn>=1.5
21
+ Requires-Dist: tqdm>=4.67
22
+ Requires-Dist: torch==2.8.0
23
+ Requires-Dist: transformers==4.57.0
24
+ Requires-Dist: peft==0.17.1
25
+ Requires-Dist: accelerate==1.10.1
26
+ Requires-Dist: evaluate==0.4.2
27
+ Requires-Dist: datasets==2.21.0
28
+
29
+
30
+ # tipeft
31
+
32
+ **T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
33
+
34
+ ![Overview of tipeft framework](Figure_1.jpg)
35
+
36
+ It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
37
+
38
+
39
+ ## Requirements
40
+ ### Dependencies
41
+
42
+
43
+ The following Python packages are required for `tipeft`:
44
+
45
+ - `torch`
46
+ - `transformers`
47
+ - `peft`
48
+ - `accelerate`
49
+ - `numpy`
50
+ - `pandas`
51
+ - `scikit-learn`
52
+ - `tqdm`
53
+
54
+ Install dependencies with:
55
+
56
+ ```bash
57
+ pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
58
+ ```
59
+
60
+ #### Note on Pytorch installation
61
+ Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
62
+ https://pytorch.org/
63
+
64
+ ### System Requirements
65
+
66
+ `tipeft` has been tested and verified on the following configuration:
67
+
68
+ | Component | Tested Version |
69
+ |-----------|----------------|
70
+ | OS | Windows 10 |
71
+ | Python | 3.9.19 |
72
+ | CUDA | 12.6 |
73
+
74
+ #### Important Notes
75
+
76
+ - **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
77
+ - **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
78
+ - **GPU**: CUDA-compatible GPU required.
79
+ - **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
80
+
81
+ #### Known Compatibility Limitations
82
+
83
+ 1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
84
+ 2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
85
+
86
+ If you encounter issues on a different setup, please open an issue with your system info.
87
+
88
+ #### GPU requirements
89
+
90
+ `tipeft` is designed for GPU acceleration.
91
+ - At least 1 GPU is recommended
92
+ - Suggested minimum: 16GB VRAM
93
+ - Memory usage depends on:
94
+ - sequence length
95
+ - model size
96
+ - batch size
97
+ - peft configuration
98
+
99
+
100
+
101
+ ## Installation
102
+ To install in python, simply do the following:
103
+ ```bash
104
+ pip install tipeft
105
+ ```
106
+
107
+
108
+ ## Usage
109
+
110
+ ### `train_tabular_infused_IA3`
111
+
112
+ Trains a tabular-infused IA3 model for binary classification.
113
+
114
+ ```python
115
+ from tipeft import train_tabular_infused_IA3
116
+
117
+ model, tokenizer = train_tabular_infused_IA3(
118
+ train=train_df,
119
+ val=val_df,
120
+ pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
121
+ label_col="in_hospital_mortality",
122
+ text_col="clinical_notes",
123
+ columns_unique_labels_of_tabular_features={
124
+ "gender": 2,
125
+ "insurance": 3,
126
+ "marital_status": 4,
127
+ "anchor_age": 1,
128
+ "anchor_year": 1
129
+ },
130
+ lr=0.001,
131
+ num_epochs=5,
132
+ lr_of_tabular_infused_features=0.0001
133
+ )
134
+ ```
135
+
136
+ #### Parameters
137
+
138
+ | Parameter | Type | Description |
139
+ |-----------|------|-------------|
140
+ | `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
141
+ | `val` | pandas.DataFrame | Validation dataframe with same structure as train |
142
+ | `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
143
+ | `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
144
+ | `text_col` | str | Column name containing the clinical text |
145
+ | `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
146
+ | `lr` | float | Learning rate for final model training (default: `0.001`) |
147
+ | `num_epochs` | int | Number of training epochs for final model (default: `5`) |
148
+ | `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
149
+
150
+ #### Returns
151
+
152
+ | Return | Type | Description |
153
+ |--------|------|-------------|
154
+ | `model` | PeftModel | The trained IA3 model |
155
+ | `tokenizer` | AutoTokenizer | The tokenizer for the model |
156
+
157
+ #### Notes
158
+
159
+ - The `label_col` must contain boolean values (`True`/`False`)
160
+ - Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
161
+ - Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
162
+ - Ensure all unique values in categorical columns appear in both train and val sets
163
+ - The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
164
+
165
+
166
+ ## Questions?
167
+
168
+ Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
tipeft-0.0.1/README.md ADDED
@@ -0,0 +1,140 @@
1
+
2
+ # tipeft
3
+
4
+ **T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
5
+
6
+ ![Overview of tipeft framework](Figure_1.jpg)
7
+
8
+ It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
9
+
10
+
11
+ ## Requirements
12
+ ### Dependencies
13
+
14
+
15
+ The following Python packages are required for `tipeft`:
16
+
17
+ - `torch`
18
+ - `transformers`
19
+ - `peft`
20
+ - `accelerate`
21
+ - `numpy`
22
+ - `pandas`
23
+ - `scikit-learn`
24
+ - `tqdm`
25
+
26
+ Install dependencies with:
27
+
28
+ ```bash
29
+ pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
30
+ ```
31
+
32
+ #### Note on Pytorch installation
33
+ Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
34
+ https://pytorch.org/
35
+
36
+ ### System Requirements
37
+
38
+ `tipeft` has been tested and verified on the following configuration:
39
+
40
+ | Component | Tested Version |
41
+ |-----------|----------------|
42
+ | OS | Windows 10 |
43
+ | Python | 3.9.19 |
44
+ | CUDA | 12.6 |
45
+
46
+ #### Important Notes
47
+
48
+ - **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
49
+ - **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
50
+ - **GPU**: CUDA-compatible GPU required.
51
+ - **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
52
+
53
+ #### Known Compatibility Limitations
54
+
55
+ 1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
56
+ 2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
57
+
58
+ If you encounter issues on a different setup, please open an issue with your system info.
59
+
60
+ #### GPU requirements
61
+
62
+ `tipeft` is designed for GPU acceleration.
63
+ - At least 1 GPU is recommended
64
+ - Suggested minimum: 16GB VRAM
65
+ - Memory usage depends on:
66
+ - sequence length
67
+ - model size
68
+ - batch size
69
+ - peft configuration
70
+
71
+
72
+
73
+ ## Installation
74
+ To install in python, simply do the following:
75
+ ```bash
76
+ pip install tipeft
77
+ ```
78
+
79
+
80
+ ## Usage
81
+
82
+ ### `train_tabular_infused_IA3`
83
+
84
+ Trains a tabular-infused IA3 model for binary classification.
85
+
86
+ ```python
87
+ from tipeft import train_tabular_infused_IA3
88
+
89
+ model, tokenizer = train_tabular_infused_IA3(
90
+ train=train_df,
91
+ val=val_df,
92
+ pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
93
+ label_col="in_hospital_mortality",
94
+ text_col="clinical_notes",
95
+ columns_unique_labels_of_tabular_features={
96
+ "gender": 2,
97
+ "insurance": 3,
98
+ "marital_status": 4,
99
+ "anchor_age": 1,
100
+ "anchor_year": 1
101
+ },
102
+ lr=0.001,
103
+ num_epochs=5,
104
+ lr_of_tabular_infused_features=0.0001
105
+ )
106
+ ```
107
+
108
+ #### Parameters
109
+
110
+ | Parameter | Type | Description |
111
+ |-----------|------|-------------|
112
+ | `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
113
+ | `val` | pandas.DataFrame | Validation dataframe with same structure as train |
114
+ | `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
115
+ | `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
116
+ | `text_col` | str | Column name containing the clinical text |
117
+ | `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
118
+ | `lr` | float | Learning rate for final model training (default: `0.001`) |
119
+ | `num_epochs` | int | Number of training epochs for final model (default: `5`) |
120
+ | `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
121
+
122
+ #### Returns
123
+
124
+ | Return | Type | Description |
125
+ |--------|------|-------------|
126
+ | `model` | PeftModel | The trained IA3 model |
127
+ | `tokenizer` | AutoTokenizer | The tokenizer for the model |
128
+
129
+ #### Notes
130
+
131
+ - The `label_col` must contain boolean values (`True`/`False`)
132
+ - Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
133
+ - Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
134
+ - Ensure all unique values in categorical columns appear in both train and val sets
135
+ - The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
136
+
137
+
138
+ ## Questions?
139
+
140
+ Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
@@ -0,0 +1,7 @@
1
+ Copyright 2026 Charles Alba
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
tipeft-0.0.1/setup.cfg ADDED
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
tipeft-0.0.1/setup.py ADDED
@@ -0,0 +1,45 @@
1
+ from setuptools import setup, find_packages
2
+ import codecs
3
+ import os
4
+
5
+ here = os.path.abspath(os.path.dirname(__file__))
6
+ readme_path = os.path.join(here, "README.md")
7
+ with codecs.open(readme_path, encoding="utf-8") as fh:
8
+ long_description = fh.read()
9
+ VERSION = '0.0.1'
10
+ DESCRIPTION = 'Tabular-Infused Parameter Efficient Finetuning (tipeft)'
11
+ LONG_DESCRIPTION = "Tabular-Infused Parameter Efficient Finetuning (tipeft) specifically designed for postoperative risk prediction using clinical notes and complementary preoperative tabular features. Available for re-parameterization methods (LoRa and IA3)."
12
+
13
+ setup(
14
+ name="tipeft",
15
+ version=VERSION,
16
+ author="Charles Alba",
17
+ author_email="alba@wustl.edu",
18
+ description=DESCRIPTION,
19
+ long_description_content_type="text/markdown",
20
+ long_description=long_description,
21
+ packages=find_packages(),
22
+ install_requires=[
23
+ "numpy>=2.0.2",
24
+ "pandas>=2.2.2",
25
+ "scikit-learn>=1.5",
26
+ "tqdm>=4.67",
27
+ "torch==2.8.0",
28
+ "transformers==4.57.0",
29
+ "peft==0.17.1",
30
+ "accelerate==1.10.1",
31
+ "evaluate==0.4.2",
32
+ "datasets==2.21.0",
33
+ ],
34
+ python_requires=">=3.9",
35
+ keywords=['Parameter Efficient Finetuning',"PEFT","AI in Medicine","AI in Healthcare","Postoperative Risk Prediction", "IA3", "LORA"],
36
+ classifiers=[
37
+ "Development Status :: 1 - Planning",
38
+ "Intended Audience :: Education",
39
+ "Intended Audience :: Science/Research",
40
+ "Programming Language :: Python :: 3",
41
+ "Operating System :: Unix",
42
+ "Operating System :: MacOS :: MacOS X",
43
+ "Operating System :: Microsoft :: Windows",
44
+ ]
45
+ )
@@ -0,0 +1,464 @@
1
+ import argparse
2
+ import os
3
+ import pandas as pd
4
+ import torch
5
+ from torch.optim import AdamW
6
+ from torch.utils.data import DataLoader
7
+ from peft import (
8
+ get_peft_config,
9
+ get_peft_model,
10
+ get_peft_model_state_dict,
11
+ IA3Config,
12
+ IA3Model,
13
+ PeftType)
14
+ from evaluate import load
15
+ from torch.utils.data import DataLoader, Dataset
16
+ import torch
17
+ from datasets import load_dataset
18
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed
19
+ from tqdm.notebook import tqdm_notebook
20
+ import gc
21
+ from sklearn.metrics import (
22
+ roc_auc_score, average_precision_score, roc_curve, precision_recall_curve,
23
+ accuracy_score, precision_score, recall_score, f1_score
24
+ )
25
+ from multiprocessing import Pool
26
+ from functools import lru_cache
27
+ import json
28
+ import numpy as np
29
+ import shutil
30
+
31
+
32
+
33
+ def train_tabular_classification(data, label, model_name_or_path,lr=0.0001):
34
+ data = data.copy()
35
+
36
+ class ClinicalDatasetForTabularPreopClassification(Dataset):
37
+ def __init__(self, dataframe, tokenizer,label_to_id):
38
+ self.dataframe = dataframe
39
+ self.tokenizer = tokenizer
40
+ self.label_to_id=label_to_id
41
+
42
+ def __len__(self):
43
+ return len(self.dataframe)
44
+
45
+ def __getitem__(self, idx):
46
+ sentence = self.dataframe.iloc[idx]['text']
47
+ label = self.dataframe.iloc[idx]['label']
48
+
49
+ # Tokenize the sentence
50
+ inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
51
+ inputs = {key: val.squeeze(0) for key, val in inputs.items()}
52
+
53
+ # Convert label to a numeric format if necessary
54
+ label_to_id = self.label_to_id
55
+ label_id = label_to_id[label]
56
+
57
+ inputs['labels'] = torch.tensor(label_id, dtype=torch.long)
58
+
59
+ return inputs
60
+ torch.cuda.empty_cache()
61
+ gc.collect()
62
+ data["label"]=data[label]
63
+ train=data[["text","label"]]
64
+ labels=list(set(list(data[label])))
65
+ peft_type = PeftType.IA3
66
+ device = "cuda"
67
+ num_epochs = 2
68
+ if model_name_or_path=="microsoft/biogpt":
69
+ peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
70
+ else:
71
+ peft_config = IA3Config(task_type="SEQ_CLS")
72
+ # Initialize the tokenizer
73
+ padding_side = "right"
74
+ if model_name_or_path=="microsoft/biogpt":
75
+ batch_size=8
76
+ else:
77
+ batch_size = 16
78
+
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
81
+ label_to_id = {label: index for index, label in enumerate(labels)}
82
+ # Model
83
+ model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=len(label_to_id), use_auth_token=False)
84
+
85
+ train_dataset = ClinicalDatasetForTabularPreopClassification(train, tokenizer,label_to_id)
86
+ def collate_fn(examples):
87
+ return tokenizer.pad(examples, padding="longest", return_tensors="pt")
88
+
89
+ train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
90
+ model = get_peft_model(model, peft_config)
91
+ optimizer = AdamW(params=model.parameters(), lr=lr)
92
+ # Instantiate scheduler
93
+ lr_scheduler = get_linear_schedule_with_warmup(
94
+ optimizer=optimizer,
95
+ num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
96
+ num_training_steps=(len(train_dataloader) * num_epochs),
97
+ )
98
+ model.to(device)
99
+ for epoch in range(num_epochs):
100
+ model.train()
101
+ for step, batch in enumerate(tqdm_notebook(train_dataloader)):
102
+ batch = {k: v.to(device) for k, v in batch.items()}
103
+ outputs = model(**batch)
104
+ loss = outputs.loss
105
+ loss.backward()
106
+ optimizer.step()
107
+ lr_scheduler.step()
108
+ optimizer.zero_grad()
109
+ mmm=(model_name_or_path.split("/"))[-1]
110
+ model.save_pretrained(f"pretrained_model/{mmm}/{label}")
111
+ tokenizer.save_pretrained(f"pretrained_model/{mmm}/{label}")
112
+ del label
113
+ return(model)
114
+
115
+
116
+
117
+ def train_tabular_regression(data, label,model_name_or_path,lr=0.0001):
118
+ data = data.copy()
119
+ data[label] = data[label].astype(float)
120
+
121
+ class ClinicalDatasetForTabularPreopReg(Dataset):
122
+ def __init__(self, dataframe, tokenizer):
123
+ self.dataframe = dataframe
124
+ self.tokenizer = tokenizer
125
+
126
+ def __len__(self):
127
+ return len(self.dataframe)
128
+
129
+ def __getitem__(self, idx):
130
+ sentence = self.dataframe.iloc[idx]['text']
131
+ # Label is now a continuous value
132
+ label = self.dataframe.iloc[idx]['label']
133
+
134
+ inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
135
+ inputs = {key: val.squeeze(0) for key, val in inputs.items()}
136
+ inputs['labels'] = torch.tensor(label, dtype=torch.float)
137
+
138
+ return inputs
139
+
140
+ torch.cuda.empty_cache()
141
+ gc.collect()
142
+ data["label"]=data[label]
143
+ train=data[["text","label"]]
144
+ labels=list(set(list(data[label])))
145
+ peft_type = PeftType.IA3
146
+ device = "cuda"
147
+ num_epochs = 2
148
+ if model_name_or_path=="microsoft/biogpt":
149
+ peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
150
+ else:
151
+ peft_config = IA3Config(task_type="SEQ_CLS")
152
+
153
+ padding_side = "right"
154
+
155
+ if model_name_or_path=="microsoft/biogpt":
156
+ batch_size=8
157
+ else:
158
+ batch_size = 16
159
+ # Initialize the tokenizer
160
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
161
+ # Model
162
+ model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=1, use_auth_token=False)
163
+
164
+ train_dataset = ClinicalDatasetForTabularPreopReg(train, tokenizer)
165
+ def collate_fn(examples):
166
+ return tokenizer.pad(examples, padding="longest", return_tensors="pt")
167
+
168
+ train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
169
+ model = get_peft_model(model, peft_config)
170
+ optimizer = AdamW(params=model.parameters(), lr=lr)
171
+ # Instantiate scheduler
172
+ lr_scheduler = get_linear_schedule_with_warmup(
173
+ optimizer=optimizer,
174
+ num_warmup_steps=0.06 * (len(train_dataloader) * num_epochs),
175
+ num_training_steps=(len(train_dataloader) * num_epochs),
176
+ )
177
+ model.to(device)
178
+ for epoch in range(num_epochs):
179
+ model.train()
180
+ for step, batch in enumerate(tqdm_notebook(train_dataloader)):
181
+ batch = {k: v.to(device) for k, v in batch.items()}
182
+ outputs = model(**batch)
183
+ loss = outputs.loss
184
+ loss.backward()
185
+ optimizer.step()
186
+ lr_scheduler.step()
187
+ optimizer.zero_grad()
188
+
189
+ mmm=(model_name_or_path.split("/"))[-1]
190
+ model.save_pretrained(f"pretrained_model/{mmm}/{label}")
191
+ tokenizer.save_pretrained(f"pretrained_model/{mmm}/{label}")
192
+ del label
193
+ return(model)
194
+
195
+
196
+
197
+
198
+
199
+ def load_and_accumulate(model_path, name, key, columns_unique_labels):
200
+ """
201
+ loads IA3 modules that have been trained with respect to the tabular features to prepare for initialization.
202
+ Designed to accomodate to parrallel loading and pooling.
203
+
204
+ parameters:
205
+ - model_path (str): the file path (i.e. folder name) of which the IA3-trained models (w.r.t to the tabular features) are stored.
206
+ - name (str): name of the tabular feature (which is also the name of the saved IA3-trained model).
207
+ - key (str): The specific module in which we wish to extract from the IA3-trained model. We will gather this module across all models trained w.r.t. tabular features
208
+ and then use them to intialize the model that will be finetuned w.r.t. the outcome of interests.
209
+
210
+ returns:
211
+ the module's weights of the specified feature-trained model and 1 if successful and 0 if not successful
212
+ (this is used to later average out all the weights of the specified module as part of the initialization process)
213
+ """
214
+ # This function is intended to be run in a separate process
215
+ # Load model and accumulate parameter data
216
+ try:
217
+ # note: if this does not work, have a if-else condition, where it determines if the model is BERT, GPT, or Llama (may require an additional parameter to be accpeted)
218
+ # it then uses BERTForSequenceClassification, BioGPTForSequenceClassification and LlamaForSequenceClassification
219
+ model = AutoModelForSequenceClassification.from_pretrained(model_path, ignore_mismatched_sizes=True, output_attentions=False, output_hidden_states=False, use_auth_token=False)
220
+ except RuntimeError:
221
+ path_base = os.path.basename(model_path)
222
+ num_labels = columns_unique_labels[path_base]
223
+ model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=num_labels, use_auth_token=False)
224
+
225
+ standardized_name = name.replace("base_model.model.", "")
226
+ corresponding_module = dict(model.named_modules()).get(standardized_name)
227
+
228
+ if corresponding_module and hasattr(corresponding_module, 'ia3_l'):
229
+ param = corresponding_module.ia3_l[key]
230
+ return param.data.clone(), 1 # Return parameter data and count
231
+ else:
232
+ print("Warning, model of different architecture")
233
+ return None, 0
234
+
235
+ def parallel_advanced_initialization(model1, columns_unique_labels,name_of_tabular_feature_based_model):
236
+ """
237
+ Initializes trainable IA3 modules with respect to the tabular features.
238
+ It extracts and pools the trainable modules from the IA3-trained models (that have been trained w.r.t the tabular features)
239
+ to prepare the model (that will be tuned w.r.t. the outcome) for IA3 training.
240
+ Designed to be executed in parrallel.
241
+
242
+ parameters:
243
+ - model1 (huggingface pretrained model): the model that is ready for IA3 training, but its trainable parameters are defaulted to 1 per IA3.
244
+ - columns_unique_labels (dict): dictionary of tabular feature names and its respective number of columns if the the column is categorical. If it is continous, it the value is 1.
245
+ Needed to load the respective model that has been trained with respect to the tabular feature.
246
+ - key (str): The specific module in which we wish to extract from the IA3-trained model. We will gather this module across all models trained w.r.t. tabular features
247
+ and then use them to intialize the model that will be finetuned w.r.t. the outcome of interests.
248
+ - name_of_tabular_feature_based_model: base model of our model of interest. should be one of the three values: Bio_ClinicalBERT, biogpt, and BioMedGPT-LM-7B.
249
+ returns:
250
+
251
+ model with the trainable weights initialized with respect to the tabular features.
252
+ """
253
+ model_dir = f'pretrained_model/{name_of_tabular_feature_based_model}'
254
+ list_of_model_names = [name for name in os.listdir(model_dir) if os.path.isdir(os.path.join(model_dir, name))]
255
+ # list_of_model_names = [item for item in list_of_model_names if item not in ['adv_init_model']]
256
+ models_to_average_paths = [os.path.join(model_dir, name) for name in list_of_model_names]
257
+
258
+ for name, module in tqdm_notebook(model1.named_modules()):
259
+ if hasattr(module, 'ia3_l'):
260
+ ia3_l_dict = module.ia3_l
261
+
262
+ # Prepare arguments for parallel processing
263
+ args = [(model_path, name, key, columns_unique_labels) for model_path in models_to_average_paths for key, _ in ia3_l_dict.items()]
264
+
265
+ with Pool(processes=10) as pool: # edit based on amount of CPUs available
266
+ results = pool.starmap(load_and_accumulate, args)
267
+
268
+ # Aggregate results
269
+ params_sum = None
270
+ models_count = 0
271
+ for result, count in tqdm_notebook(results):
272
+ if result is not None:
273
+ if params_sum is None:
274
+ params_sum = result
275
+ else:
276
+ params_sum += result
277
+ models_count += count
278
+
279
+ # Update parameters
280
+ if models_count > 0:
281
+ with torch.no_grad():
282
+ average_param = params_sum / models_count
283
+ for _, param1 in ia3_l_dict.items():
284
+ param1.data.copy_(average_param)
285
+
286
+ return model1
287
+
288
+
289
+
290
+
291
+
292
+ def train_tabular_infused_IA3(train,val,pretrained_model_name,label_col,text_col,columns_unique_labels_of_tabular_features,lr=0.001,num_epochs=5,lr_of_tabular_infused_features=0.0001):
293
+
294
+ """
295
+ INSERT DOCUMENTATION HERE.
296
+
297
+ """
298
+ train = train.copy() # Add this
299
+ val = val.copy() # Add this
300
+
301
+ train["label"] = train[label_col]
302
+ train["text"] = train[text_col]
303
+ val["label"] = val[label_col]
304
+ val["text"] = val[text_col]
305
+
306
+ train = train.drop(columns=[label_col, text_col])
307
+ val = val.drop(columns=[label_col, text_col])
308
+
309
+
310
+ list_numerical = [k for k, v in columns_unique_labels_of_tabular_features.items() if v == 1]
311
+ list_categorical = [k for k, v in columns_unique_labels_of_tabular_features.items() if v > 1]
312
+
313
+ for i in tqdm_notebook(list(list_numerical), desc="training numerical tabular-infused features"):
314
+ train_tabular_regression(train, i, pretrained_model_name,lr_of_tabular_infused_features)
315
+
316
+ for i in tqdm_notebook(list(list_categorical), desc="training categorical tabular-infused features"):
317
+ train_tabular_classification(train, i, pretrained_model_name,lr_of_tabular_infused_features)
318
+
319
+ new_model_name=f"IA3_{pretrained_model_name}_{label_col}"
320
+
321
+
322
+ torch.manual_seed(42)
323
+
324
+ label_to_id = {False: 0, True: 1}
325
+ padding_side = "right"
326
+ batch_size = 16
327
+ lr = lr
328
+ local_dir=f"trained_models/{new_model_name}"
329
+ class clinicalDataset(Dataset):
330
+ def __init__(self, dataframe, tokenizer):
331
+ self.dataframe = dataframe
332
+ self.tokenizer = tokenizer
333
+
334
+ def __len__(self):
335
+ return len(self.dataframe)
336
+
337
+ def __getitem__(self, idx):
338
+ sentence = self.dataframe.iloc[idx]['text']
339
+ label = self.dataframe.iloc[idx]['label']
340
+
341
+ # Tokenize the sentence
342
+ inputs = self.tokenizer(sentence, truncation=True, max_length=512, return_tensors="pt")
343
+ inputs = {key: val.squeeze(0) for key, val in inputs.items()}
344
+
345
+ # Convert label to a numeric format if necessary
346
+ label_to_id = {False: 0, True: 1}
347
+ label_id = label_to_id[label]
348
+
349
+ inputs['labels'] = torch.tensor(label_id, dtype=torch.long)
350
+
351
+ return inputs
352
+
353
+
354
+ def collate_fn(examples):
355
+ return tokenizer.pad(examples, padding="longest", return_tensors="pt")
356
+ padding_side = "right"
357
+ batch_size = 16
358
+ lr = lr
359
+ model_name_or_path = pretrained_model_name
360
+ peft_type = PeftType.IA3
361
+ device = "cuda"
362
+ num_epochs = num_epochs
363
+ if pretrained_model_name=="microsoft/biogpt":
364
+ peft_config = IA3Config(task_type="SEQ_CLS",target_modules=["k_proj", "v_proj","fc1", "fc2"], feedforward_modules=["fc1", "fc2"])
365
+ else:
366
+ peft_config = IA3Config(task_type="SEQ_CLS")
367
+
368
+ # Initialize the tokenizer
369
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side=padding_side, use_auth_token=False)
370
+
371
+ label_to_id = {False: 0, True: 1}
372
+ # Model
373
+ model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=len(label_to_id), use_auth_token=False)
374
+ train_dataset = clinicalDataset(train, tokenizer)
375
+ val_dataset = clinicalDataset(val, tokenizer)
376
+ # Use the collate_fn in your DataLoaders
377
+ train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
378
+ val_dataloader = DataLoader(val_dataset, shuffle=True, collate_fn=collate_fn, batch_size=batch_size)
379
+ model = get_peft_model(model, peft_config)
380
+ model.print_trainable_parameters()
381
+ mmm=(pretrained_model_name.split("/"))[-1]
382
+ print("Initializing the PEFT parameters!")
383
+ model=parallel_advanced_initialization(model,columns_unique_labels_of_tabular_features,mmm)
384
+ model.save_pretrained(f"init_models/{mmm}/adv_init_model")
385
+ torch.cuda.empty_cache()
386
+ gc.collect()
387
+
388
+ optimizer = AdamW(params=model.parameters(), lr=lr,weight_decay=0.1)#,weight_decay=0.01)
389
+ lr_scheduler = get_linear_schedule_with_warmup(
390
+ optimizer=optimizer,
391
+ num_warmup_steps=0,
392
+ num_training_steps=(len(train_dataloader) * num_epochs),
393
+ )
394
+ model.to(device)
395
+ f1_metric = load('f1', config_name='multiclass', average='weighted')
396
+ accuracy_metric = load('accuracy')
397
+ precision_metric = load('precision')
398
+ recall_metric=load('recall')
399
+ # Assuming binary classification, accumulate predictions and true labels
400
+ all_predictions = []
401
+ all_references = []
402
+ all_scores = [] # For AUROC and AUPRC
403
+ for epoch in range(num_epochs):
404
+ model.train()
405
+ for step, batch in enumerate(tqdm_notebook(train_dataloader)):
406
+ batch = {k: v.to(device) for k, v in batch.items()}
407
+ outputs = model(**batch)
408
+ loss = outputs.loss
409
+ loss.backward()
410
+ optimizer.step()
411
+ lr_scheduler.step()
412
+ optimizer.zero_grad()
413
+
414
+ model.eval()
415
+ for step, batch in enumerate(tqdm_notebook(val_dataloader)):
416
+ batch = {k: v.to(device) for k, v in batch.items()}
417
+ with torch.no_grad():
418
+ outputs = model(**batch)
419
+
420
+ # Assuming outputs.logits are raw scores for each class
421
+ scores = torch.nn.functional.softmax(outputs.logits, dim=-1)[:, 1].cpu().numpy() # Get probability for class '1'
422
+ predictions = outputs.logits.argmax(dim=-1).cpu().numpy()
423
+ references = batch["labels"].cpu().numpy()
424
+
425
+ all_scores.extend(scores)
426
+ all_predictions.extend(predictions)
427
+ all_references.extend(references)
428
+
429
+ # Your existing metric updates here
430
+ accuracy_metric.add_batch(predictions=predictions, references=references)
431
+ f1_metric.add_batch(predictions=predictions, references=references)
432
+ recall_metric.add_batch(predictions=predictions, references=references)
433
+ precision_metric.add_batch(predictions=predictions, references=references)
434
+
435
+ # Compute final metric values
436
+ final_accuracy = accuracy_metric.compute()
437
+ final_f1 = f1_metric.compute()
438
+ final_recall = recall_metric.compute()
439
+ final_precision = precision_metric.compute()
440
+
441
+ # Calculate AUROC and AUPRC
442
+ final_auroc = roc_auc_score(all_references, all_scores)
443
+ final_auprc = average_precision_score(all_references, all_scores)
444
+
445
+ # Output the metrics
446
+ print("="*20)
447
+ print("VALIDATION METRICS:")
448
+ print(f"Accuracy: {final_accuracy['accuracy']}")
449
+ print(f"Precision: {final_precision['precision']}")
450
+ print(f"Recall: {final_recall['recall']}")
451
+ print(f"F1 Score: {final_f1['f1']}")
452
+ print(f"AUROC: {final_auroc}")
453
+ print(f"AUPRC: {final_auprc}")
454
+
455
+ # Save your model and tokenizer
456
+ model.save_pretrained(f"trained_models/{new_model_name}")
457
+ tokenizer.save_pretrained(f"trained_models/{new_model_name}")
458
+
459
+ shutil.rmtree("pretrained_model", ignore_errors=True)
460
+
461
+ return model, tokenizer
462
+
463
+
464
+
@@ -0,0 +1 @@
1
+ from .IA3 import train_tabular_infused_IA3
@@ -0,0 +1,168 @@
1
+ Metadata-Version: 2.1
2
+ Name: tipeft
3
+ Version: 0.0.1
4
+ Summary: Tabular-Infused Parameter Efficient Finetuning (tipeft)
5
+ Author: Charles Alba
6
+ Author-email: alba@wustl.edu
7
+ Keywords: Parameter Efficient Finetuning,PEFT,AI in Medicine,AI in Healthcare,Postoperative Risk Prediction,IA3,LORA
8
+ Classifier: Development Status :: 1 - Planning
9
+ Classifier: Intended Audience :: Education
10
+ Classifier: Intended Audience :: Science/Research
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Operating System :: Unix
13
+ Classifier: Operating System :: MacOS :: MacOS X
14
+ Classifier: Operating System :: Microsoft :: Windows
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: license.txt
18
+ Requires-Dist: numpy>=2.0.2
19
+ Requires-Dist: pandas>=2.2.2
20
+ Requires-Dist: scikit-learn>=1.5
21
+ Requires-Dist: tqdm>=4.67
22
+ Requires-Dist: torch==2.8.0
23
+ Requires-Dist: transformers==4.57.0
24
+ Requires-Dist: peft==0.17.1
25
+ Requires-Dist: accelerate==1.10.1
26
+ Requires-Dist: evaluate==0.4.2
27
+ Requires-Dist: datasets==2.21.0
28
+
29
+
30
+ # tipeft
31
+
32
+ **T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently
33
+
34
+ ![Overview of tipeft framework](Figure_1.jpg)
35
+
36
+ It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`
37
+
38
+
39
+ ## Requirements
40
+ ### Dependencies
41
+
42
+
43
+ The following Python packages are required for `tipeft`:
44
+
45
+ - `torch`
46
+ - `transformers`
47
+ - `peft`
48
+ - `accelerate`
49
+ - `numpy`
50
+ - `pandas`
51
+ - `scikit-learn`
52
+ - `tqdm`
53
+
54
+ Install dependencies with:
55
+
56
+ ```bash
57
+ pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm
58
+ ```
59
+
60
+ #### Note on Pytorch installation
61
+ Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:
62
+ https://pytorch.org/
63
+
64
+ ### System Requirements
65
+
66
+ `tipeft` has been tested and verified on the following configuration:
67
+
68
+ | Component | Tested Version |
69
+ |-----------|----------------|
70
+ | OS | Windows 10 |
71
+ | Python | 3.9.19 |
72
+ | CUDA | 12.6 |
73
+
74
+ #### Important Notes
75
+
76
+ - **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.
77
+ - **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).
78
+ - **GPU**: CUDA-compatible GPU required.
79
+ - **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.
80
+
81
+ #### Known Compatibility Limitations
82
+
83
+ 1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.
84
+ 2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.
85
+
86
+ If you encounter issues on a different setup, please open an issue with your system info.
87
+
88
+ #### GPU requirements
89
+
90
+ `tipeft` is designed for GPU acceleration.
91
+ - At least 1 GPU is recommended
92
+ - Suggested minimum: 16GB VRAM
93
+ - Memory usage depends on:
94
+ - sequence length
95
+ - model size
96
+ - batch size
97
+ - peft configuration
98
+
99
+
100
+
101
+ ## Installation
102
+ To install in python, simply do the following:
103
+ ```bash
104
+ pip install tipeft
105
+ ```
106
+
107
+
108
+ ## Usage
109
+
110
+ ### `train_tabular_infused_IA3`
111
+
112
+ Trains a tabular-infused IA3 model for binary classification.
113
+
114
+ ```python
115
+ from tipeft import train_tabular_infused_IA3
116
+
117
+ model, tokenizer = train_tabular_infused_IA3(
118
+ train=train_df,
119
+ val=val_df,
120
+ pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",
121
+ label_col="in_hospital_mortality",
122
+ text_col="clinical_notes",
123
+ columns_unique_labels_of_tabular_features={
124
+ "gender": 2,
125
+ "insurance": 3,
126
+ "marital_status": 4,
127
+ "anchor_age": 1,
128
+ "anchor_year": 1
129
+ },
130
+ lr=0.001,
131
+ num_epochs=5,
132
+ lr_of_tabular_infused_features=0.0001
133
+ )
134
+ ```
135
+
136
+ #### Parameters
137
+
138
+ | Parameter | Type | Description |
139
+ |-----------|------|-------------|
140
+ | `train` | pandas.DataFrame | Training dataframe containing text, label, and tabular feature columns |
141
+ | `val` | pandas.DataFrame | Validation dataframe with same structure as train |
142
+ | `pretrained_model_name` | str | Base model to fine-tune. Currently supports: `"emilyalsentzer/Bio_ClinicalBERT"` or `"microsoft/biogpt"` |
143
+ | `label_col` | str | Column name of the binary outcome label (must contain `True`/`False` values) |
144
+ | `text_col` | str | Column name containing the clinical text |
145
+ | `columns_unique_labels_of_tabular_features` | dict | Dictionary mapping tabular feature names to their number of unique values. Use `1` for continuous features, `>1` for categorical features |
146
+ | `lr` | float | Learning rate for final model training (default: `0.001`) |
147
+ | `num_epochs` | int | Number of training epochs for final model (default: `5`) |
148
+ | `lr_of_tabular_infused_features` | float | Learning rate for tabular feature pre-training (default: `0.0001`) |
149
+
150
+ #### Returns
151
+
152
+ | Return | Type | Description |
153
+ |--------|------|-------------|
154
+ | `model` | PeftModel | The trained IA3 model |
155
+ | `tokenizer` | AutoTokenizer | The tokenizer for the model |
156
+
157
+ #### Notes
158
+
159
+ - The `label_col` must contain boolean values (`True`/`False`)
160
+ - Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`
161
+ - Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`
162
+ - Ensure all unique values in categorical columns appear in both train and val sets
163
+ - The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`
164
+
165
+
166
+ ## Questions?
167
+
168
+ Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
@@ -0,0 +1,12 @@
1
+ Figure_1.jpg
2
+ MANIFEST.in
3
+ README.md
4
+ license.txt
5
+ setup.py
6
+ tipeft/IA3.py
7
+ tipeft/__init__.py
8
+ tipeft.egg-info/PKG-INFO
9
+ tipeft.egg-info/SOURCES.txt
10
+ tipeft.egg-info/dependency_links.txt
11
+ tipeft.egg-info/requires.txt
12
+ tipeft.egg-info/top_level.txt
@@ -0,0 +1,10 @@
1
+ numpy>=2.0.2
2
+ pandas>=2.2.2
3
+ scikit-learn>=1.5
4
+ tqdm>=4.67
5
+ torch==2.8.0
6
+ transformers==4.57.0
7
+ peft==0.17.1
8
+ accelerate==1.10.1
9
+ evaluate==0.4.2
10
+ datasets==2.21.0
@@ -0,0 +1 @@
1
+ tipeft