cehrgpt 0.1.1__py3-none-any.whl → 0.1.3__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cehrgpt/analysis/htn_treatment_pathway.py +546 -0
- cehrgpt/analysis/treatment_pathway/__init__.py +0 -0
- cehrgpt/analysis/treatment_pathway/depression_treatment_pathway.py +94 -0
- cehrgpt/analysis/treatment_pathway/diabetes_treatment_pathway.py +94 -0
- cehrgpt/analysis/treatment_pathway/htn_treatment_pathway.py +94 -0
- cehrgpt/analysis/treatment_pathway/treatment_pathway.py +631 -0
- cehrgpt/data/cehrgpt_data_processor.py +549 -0
- cehrgpt/data/hf_cehrgpt_dataset.py +4 -0
- cehrgpt/data/hf_cehrgpt_dataset_collator.py +286 -629
- cehrgpt/data/hf_cehrgpt_dataset_mapping.py +60 -14
- cehrgpt/generation/cehrgpt_conditional_generation.py +316 -0
- cehrgpt/generation/generate_batch_hf_gpt_sequence.py +35 -15
- cehrgpt/generation/omop_converter_batch.py +11 -4
- cehrgpt/gpt_utils.py +73 -3
- cehrgpt/models/activations.py +27 -0
- cehrgpt/models/config.py +6 -2
- cehrgpt/models/gpt2.py +560 -0
- cehrgpt/models/hf_cehrgpt.py +193 -459
- cehrgpt/models/tokenization_hf_cehrgpt.py +380 -50
- cehrgpt/omop/ontology.py +154 -0
- cehrgpt/runners/data_utils.py +17 -6
- cehrgpt/runners/hf_cehrgpt_finetune_runner.py +33 -79
- cehrgpt/runners/hf_cehrgpt_pretrain_runner.py +48 -44
- cehrgpt/runners/hf_gpt_runner_argument_dataclass.py +58 -34
- cehrgpt/runners/hyperparameter_search_util.py +180 -69
- cehrgpt/runners/sample_packing_trainer.py +11 -2
- cehrgpt/tools/linear_prob/compute_cehrgpt_features.py +27 -31
- cehrgpt-0.1.3.dist-info/METADATA +238 -0
- {cehrgpt-0.1.1.dist-info → cehrgpt-0.1.3.dist-info}/RECORD +33 -22
- cehrgpt-0.1.1.dist-info/METADATA +0 -115
- /cehrgpt/tools/{merge_synthetic_real_dataasets.py → merge_synthetic_real_datasets.py} +0 -0
- {cehrgpt-0.1.1.dist-info → cehrgpt-0.1.3.dist-info}/WHEEL +0 -0
- {cehrgpt-0.1.1.dist-info → cehrgpt-0.1.3.dist-info}/licenses/LICENSE +0 -0
- {cehrgpt-0.1.1.dist-info → cehrgpt-0.1.3.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,238 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: cehrgpt
|
3
|
+
Version: 0.1.3
|
4
|
+
Summary: CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
|
5
|
+
Author-email: Chao Pang <chaopang229@gmail.com>, Xinzhuo Jiang <xj2193@cumc.columbia.edu>, Krishna Kalluri <kk3326@cumc.columbia.edu>, Elise Minto <em3697@cumc.columbia.edu>, Jason Patterson <jp3477@cumc.columbia.edu>, Nishanth Parameshwar Pavinkurve <np2689@cumc.columbia.edu>, Karthik Natarajan <kn2174@cumc.columbia.edu>
|
6
|
+
License: MIT License
|
7
|
+
Classifier: Development Status :: 5 - Production/Stable
|
8
|
+
Classifier: Intended Audience :: Developers
|
9
|
+
Classifier: Intended Audience :: Science/Research
|
10
|
+
Classifier: License :: OSI Approved :: MIT License
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
12
|
+
Requires-Python: >=3.10.0
|
13
|
+
Description-Content-Type: text/markdown
|
14
|
+
License-File: LICENSE
|
15
|
+
Requires-Dist: cehrbert>=1.4.8
|
16
|
+
Requires-Dist: cehrbert_data==0.0.11
|
17
|
+
Requires-Dist: openai==1.54.3
|
18
|
+
Requires-Dist: optuna==4.0.0
|
19
|
+
Requires-Dist: transformers==4.44.1
|
20
|
+
Requires-Dist: tokenizers==0.19.0
|
21
|
+
Requires-Dist: peft==0.10.0
|
22
|
+
Requires-Dist: lightgbm
|
23
|
+
Requires-Dist: polars
|
24
|
+
Provides-Extra: dev
|
25
|
+
Requires-Dist: pre-commit; extra == "dev"
|
26
|
+
Requires-Dist: pytest; extra == "dev"
|
27
|
+
Requires-Dist: pytest-cov; extra == "dev"
|
28
|
+
Requires-Dist: pytest-subtests; extra == "dev"
|
29
|
+
Requires-Dist: rootutils; extra == "dev"
|
30
|
+
Requires-Dist: hypothesis; extra == "dev"
|
31
|
+
Requires-Dist: black; extra == "dev"
|
32
|
+
Provides-Extra: flash-attn
|
33
|
+
Requires-Dist: flash_attn; extra == "flash-attn"
|
34
|
+
Dynamic: license-file
|
35
|
+
|
36
|
+
# CEHRGPT
|
37
|
+
|
38
|
+
[](https://pypi.org/project/cehrgpt/)
|
39
|
+

|
40
|
+
[](https://github.com/knatarajan-lab/cehrgpt/actions/workflows/tests.yaml)
|
41
|
+
[](https://github.com/knatarajan-lab/cehrgpt/blob/main/LICENSE)
|
42
|
+
[](https://github.com/knatarajan-lab/cehrgpt/graphs/contributors)
|
43
|
+
|
44
|
+
CEHRGPT is a multi-task foundation model for structured electronic health records (EHR) data that supports three capabilities: feature representation, zero-shot prediction, and synthetic data generation.
|
45
|
+
|
46
|
+
## 🎯 Key Capabilities
|
47
|
+
|
48
|
+
### Feature Representation
|
49
|
+
Extract meaningful patient embeddings from sequences of medical events using **linear probing** techniques for downstream tasks such as disease prediction, patient clustering, and risk stratification.
|
50
|
+
|
51
|
+
### Zero-Shot Prediction
|
52
|
+
Generate outcome predictions directly from prompts without requiring task-specific training, enabling rapid evaluation in low-label clinical settings.
|
53
|
+
|
54
|
+
### Synthetic Data Generation
|
55
|
+
Generate comprehensive patient profiles including demographics, medical history, treatment courses, and outcomes while implementing advanced privacy-preserving techniques to ensure generated data contains no identifiable information.
|
56
|
+
The platform is fully compatible with the OMOP Common Data Model for seamless integration with existing healthcare systems.
|
57
|
+
## 🚀 Installation
|
58
|
+
|
59
|
+
Clone the repository and install dependencies:
|
60
|
+
|
61
|
+
```bash
|
62
|
+
git clone https://github.com/knatarajan-lab/cehrgpt.git
|
63
|
+
cd cehrgpt
|
64
|
+
pip install .
|
65
|
+
```
|
66
|
+
|
67
|
+
## 📋 Prerequisites
|
68
|
+
|
69
|
+
Before getting started, set up the required environment variables:
|
70
|
+
|
71
|
+
```bash
|
72
|
+
export CEHRGPT_HOME=$(git rev-parse --show-toplevel)
|
73
|
+
export OMOP_DIR="" # Path to your OMOP data
|
74
|
+
export CEHR_GPT_DATA_DIR="" # Path for processed data storage
|
75
|
+
export CEHR_GPT_MODEL_DIR="" # Path for model storage
|
76
|
+
```
|
77
|
+
|
78
|
+
Create the dataset cache directory:
|
79
|
+
```bash
|
80
|
+
mkdir $CEHR_GPT_DATA_DIR/dataset_prepared
|
81
|
+
```
|
82
|
+
|
83
|
+
## 🏗️ Model Training
|
84
|
+
|
85
|
+
### Step 1: Generate Pre-training Data from OMOP
|
86
|
+
|
87
|
+
Generate the training data following the [Data Generation Instruction](./data_generation.md).
|
88
|
+
|
89
|
+
### Step 2: Pre-train CEHR-GPT
|
90
|
+
|
91
|
+
Train the foundation model:
|
92
|
+
|
93
|
+
```bash
|
94
|
+
python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner \
|
95
|
+
--model_name_or_path $CEHR_GPT_MODEL_DIR \
|
96
|
+
--tokenizer_name_or_path $CEHR_GPT_MODEL_DIR \
|
97
|
+
--output_dir $CEHR_GPT_MODEL_DIR \
|
98
|
+
--data_folder "$CEHR_GPT_DATA_DIR/patient_sequence/train" \
|
99
|
+
--dataset_prepared_path "$CEHR_GPT_DATA_DIR/dataset_prepared" \
|
100
|
+
--do_train true --seed 42 \
|
101
|
+
--dataloader_num_workers 16 --dataloader_prefetch_factor 8 \
|
102
|
+
--hidden_size 768 --num_hidden_layers 14 --max_position_embeddings 4096 \
|
103
|
+
--evaluation_strategy epoch --save_strategy epoch \
|
104
|
+
--sample_packing --max_tokens_per_batch 16384 \
|
105
|
+
--warmup_ratio 0.01 --weight_decay 0.01 \
|
106
|
+
--num_train_epochs 50 --learning_rate 0.0002 \
|
107
|
+
--use_early_stopping --early_stopping_threshold 0.001
|
108
|
+
```
|
109
|
+
|
110
|
+
> **Tip**: Increase `max_position_embeddings` for longer context windows based on your use case.
|
111
|
+
|
112
|
+
## 🎯 Feature Representation
|
113
|
+
|
114
|
+
CEHR-GPT enables extraction of meaningful patient embeddings from medical event sequences using **linear probing** techniques for downstream prediction tasks. The feature representation pipeline includes label generation, patient sequence extraction, and linear regression model training on the extracted representations.
|
115
|
+
|
116
|
+
For detailed instructions including cohort creation, patient feature extraction, and linear probing evaluation, please follow the [Feature Representation Guide](./feature_representation.md).
|
117
|
+
|
118
|
+
## 🔮 Zero-Shot Prediction
|
119
|
+
|
120
|
+
CEHR-GPT can generate outcome predictions directly from clinical prompts without requiring task-specific training, making it ideal for rapid evaluation in low-label clinical settings. The zero-shot prediction capability performs time-to-event analysis by processing patient sequences and generating risk predictions based on learned medical patterns.
|
121
|
+
|
122
|
+
For complete setup instructions including label generation, sequence preparation, and prediction execution, please follow the [Zero-Shot Prediction Guide](./zero_shot_prediction.md).
|
123
|
+
|
124
|
+
## 🧬 Synthetic Data Generation
|
125
|
+
|
126
|
+
CEHR-GPT generates comprehensive synthetic patient profiles including demographics, medical history, treatment courses, and outcomes while implementing advanced privacy-preserving techniques. The synthetic data maintains statistical fidelity to real patient populations without containing identifiable information, and outputs are fully compatible with the OMOP Common Data Model.
|
127
|
+
|
128
|
+
For step-by-step instructions on generating synthetic sequences and converting them to OMOP format, please follow the [Synthetic Data Generation Guide](./synthetic_data_generation.md).
|
129
|
+
|
130
|
+
## 📊 MEDS Support
|
131
|
+
|
132
|
+
CEHR-GPT supports the Medical Event Data Standard (MEDS) format for enhanced interoperability.
|
133
|
+
|
134
|
+
### Prerequisites
|
135
|
+
|
136
|
+
Configure MEDS-specific environment variables:
|
137
|
+
|
138
|
+
```bash
|
139
|
+
export CEHR_GPT_MODEL_DIR="" # CEHR-GPT model directory
|
140
|
+
export MEDS_DIR="" # MEDS data directory
|
141
|
+
export MEDS_READER_DIR="" # MEDS reader output directory
|
142
|
+
```
|
143
|
+
|
144
|
+
### Step 1: Create MIMIC MEDS Data
|
145
|
+
|
146
|
+
Transform MIMIC files to MEDS format following the [MEDS_transforms](https://github.com/mmcdermott/MEDS_transforms/) repository instructions.
|
147
|
+
|
148
|
+
### Step 2: Prepare MEDS Reader
|
149
|
+
|
150
|
+
Convert MEDS data for CEHR-GPT compatibility:
|
151
|
+
|
152
|
+
```bash
|
153
|
+
meds_reader_convert $MEDS_DIR $MEDS_READER_DIR --num_threads 10
|
154
|
+
```
|
155
|
+
|
156
|
+
### Step 3: Pre-train with MEDS Data
|
157
|
+
|
158
|
+
Execute pre-training using MEDS format:
|
159
|
+
|
160
|
+
```bash
|
161
|
+
python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner \
|
162
|
+
--model_name_or_path $CEHR_GPT_MODEL_DIR \
|
163
|
+
--tokenizer_name_or_path $CEHR_GPT_MODEL_DIR \
|
164
|
+
--output_dir $CEHR_GPT_MODEL_DIR \
|
165
|
+
--data_folder $MEDS_READER_DIR \
|
166
|
+
--dataset_prepared_path "$CEHR_GPT_MODEL_DIR/dataset_prepared" \
|
167
|
+
--do_train true --seed 42 \
|
168
|
+
--dataloader_num_workers 16 --dataloader_prefetch_factor 8 \
|
169
|
+
--hidden_size 768 --num_hidden_layers 14 --max_position_embeddings 8192 \
|
170
|
+
--evaluation_strategy epoch --save_strategy epoch \
|
171
|
+
--sample_packing --max_tokens_per_batch 16384 \
|
172
|
+
--warmup_steps 500 --weight_decay 0.01 \
|
173
|
+
--num_train_epochs 50 --learning_rate 0.0002 \
|
174
|
+
--use_early_stopping --early_stopping_threshold 0.001 \
|
175
|
+
--is_data_in_meds --inpatient_att_function_type day \
|
176
|
+
--att_function_type day --include_inpatient_hour_token \
|
177
|
+
--include_auxiliary_token --include_demographic_prompt \
|
178
|
+
--meds_to_cehrbert_conversion_type "MedsToBertMimic4"
|
179
|
+
```
|
180
|
+
|
181
|
+
### Step 4: Generate MEDS Trajectories
|
182
|
+
|
183
|
+
#### Environment Setup
|
184
|
+
|
185
|
+
Configure trajectory generation environment:
|
186
|
+
|
187
|
+
```bash
|
188
|
+
export MEDS_LABEL_COHORT_DIR="" # Cohort labels directory (parquet files)
|
189
|
+
export MEDS_TRAJECTORY_DIR="" # Trajectory output directory
|
190
|
+
```
|
191
|
+
|
192
|
+
#### Generate Synthetic Trajectories
|
193
|
+
|
194
|
+
Create patient trajectories with the trained model:
|
195
|
+
|
196
|
+
```bash
|
197
|
+
python -u -m cehrgpt.generation.cehrgpt_conditional_generation \
|
198
|
+
--cohort_folder $MEDS_LABEL_COHORT_DIR \
|
199
|
+
--data_folder $MEDS_READER_DIR \
|
200
|
+
--dataset_prepared_path "$CEHR_GPT_MODEL_DIR/dataset_prepared" \
|
201
|
+
--model_name_or_path $CEHR_GPT_MODEL_DIR \
|
202
|
+
--tokenizer_name_or_path $CEHR_GPT_MODEL_DIR \
|
203
|
+
--output_dir $MEDS_TRAJECTORY_DIR \
|
204
|
+
--per_device_eval_batch_size 16 \
|
205
|
+
--num_of_trajectories_per_sample 2 \
|
206
|
+
--generation_input_length 4096 \
|
207
|
+
--generation_max_new_tokens 4096 \
|
208
|
+
--is_data_in_meds \
|
209
|
+
--att_function_type day --inpatient_att_function_type day \
|
210
|
+
--meds_to_cehrbert_conversion_type MedsToBertMimic4 \
|
211
|
+
--include_auxiliary_token --include_demographic_prompt \
|
212
|
+
--include_inpatient_hour_token
|
213
|
+
```
|
214
|
+
|
215
|
+
> **Important**: Ensure `generation_input_length` + `generation_max_new_tokens` ≤ `max_position_embeddings` (8192).
|
216
|
+
|
217
|
+
#### Parameter Reference
|
218
|
+
|
219
|
+
- `generation_input_length`: Input context length for generation
|
220
|
+
- `generation_max_new_tokens`: Maximum new tokens to generate
|
221
|
+
- `num_of_trajectories_per_sample`: Number of trajectories per patient sample
|
222
|
+
|
223
|
+
## 📖 Citation
|
224
|
+
|
225
|
+
If you use CEHRGPT in your research, please cite:
|
226
|
+
|
227
|
+
```bibtex
|
228
|
+
@article{cehrgpt2024,
|
229
|
+
title={CEHRGPT: Synthetic Data Generation for Electronic Health Records},
|
230
|
+
author={Natarajan, K and others},
|
231
|
+
journal={arXiv preprint arXiv:2402.04400},
|
232
|
+
year={2024}
|
233
|
+
}
|
234
|
+
```
|
235
|
+
|
236
|
+
## 📄 License
|
237
|
+
|
238
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
@@ -1,8 +1,9 @@
|
|
1
1
|
__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
2
2
|
cehrgpt/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
3
3
|
cehrgpt/cehrgpt_args.py,sha256=zPLp9Qjlq5PapWx3R15BNnyaX8zV3dxr4PuWj71r0Lg,3516
|
4
|
-
cehrgpt/gpt_utils.py,sha256=
|
4
|
+
cehrgpt/gpt_utils.py,sha256=gMPqHpOS7_6N81r7t_p6bGJ0FFVK5AgtEIMYLYKb9iA,13746
|
5
5
|
cehrgpt/analysis/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
6
|
+
cehrgpt/analysis/htn_treatment_pathway.py,sha256=KMjSEdIFNr2bSAyw1W6_bh59aV067-ZhT-AymiKCyr8,21961
|
6
7
|
cehrgpt/analysis/irregularity.py,sha256=Rfl_daMvSh9cZ68vUwfmuH-JYCFXdAph2ITHHffYC0Y,1047
|
7
8
|
cehrgpt/analysis/privacy/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
8
9
|
cehrgpt/analysis/privacy/attribute_inference.py,sha256=0ANVW0I5uvOl6IxQ15-vMVQd0mugOgSGReBUQQESImg,9368
|
@@ -11,40 +12,50 @@ cehrgpt/analysis/privacy/member_inference.py,sha256=a_-4rkYYffYl0ucnjK6uYy8jesup
|
|
11
12
|
cehrgpt/analysis/privacy/nearest_neighbor_inference.py,sha256=qoJgWW7VsUMzjMGpTaK84iY_QLOuF3HCYXAEKLZOZsU,6391
|
12
13
|
cehrgpt/analysis/privacy/reid_inference.py,sha256=Pypd3QJXQNY8VljpnIEa5zeAbTZHMjQOazaL-9VsBGw,13955
|
13
14
|
cehrgpt/analysis/privacy/utils.py,sha256=CRA4H9mPLBjMQGKzZ_x_3ro3tMap-NjsMDVqSOjHSVQ,8226
|
15
|
+
cehrgpt/analysis/treatment_pathway/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
16
|
+
cehrgpt/analysis/treatment_pathway/depression_treatment_pathway.py,sha256=7mrzaMBv09Gn6I5OM86f7gNfPvncVVKg2C3jZo0bmsU,3024
|
17
|
+
cehrgpt/analysis/treatment_pathway/diabetes_treatment_pathway.py,sha256=qwAtJ3KVesvqvR22Tbk19k35sDL-sGlRZo2sjJNo3yQ,2962
|
18
|
+
cehrgpt/analysis/treatment_pathway/htn_treatment_pathway.py,sha256=0bsEE1VFIxzU33bSipM30p2fnHsWjGWWcu59y_38K3c,2870
|
19
|
+
cehrgpt/analysis/treatment_pathway/treatment_pathway.py,sha256=SCWphYH9ARa4ZKB9fgBYM9RC2Hc8PDwtoHHCX7th16Q,25496
|
14
20
|
cehrgpt/data/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
15
|
-
cehrgpt/data/
|
16
|
-
cehrgpt/data/
|
17
|
-
cehrgpt/data/
|
21
|
+
cehrgpt/data/cehrgpt_data_processor.py,sha256=0Y6GPWu6fRBLemXJu5IxuOPbF2wmSrX-18uyofTeUzk,23096
|
22
|
+
cehrgpt/data/hf_cehrgpt_dataset.py,sha256=uz05TG5QCl3_Ybn9zZyWRg0pEbiAvL1yPWXK3BGsj0Q,3815
|
23
|
+
cehrgpt/data/hf_cehrgpt_dataset_collator.py,sha256=2UcYB241dWhvS-mV0ZTbCJdjlgPrVjZOAh3V8EWFfCg,27930
|
24
|
+
cehrgpt/data/hf_cehrgpt_dataset_mapping.py,sha256=-Igd-P-yvYlJXGZSGlYHRnez464NCkZIko3boQDYS1E,27638
|
18
25
|
cehrgpt/data/sample_packing_sampler.py,sha256=vovGMtmhG70DRkSCeiaDEJ_rjKZ38y-YLaI1kkhFEkI,6747
|
19
26
|
cehrgpt/generation/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
27
|
+
cehrgpt/generation/cehrgpt_conditional_generation.py,sha256=6I4tI-cCQ6QdFxhDAkhu0ZNo57DINjD-NncxMbyUwgg,12032
|
20
28
|
cehrgpt/generation/chatgpt_generation.py,sha256=SrnLwHLdNtnAOEg36gNjqfoT9yd12iyPgpZffL2AFJo,4428
|
21
|
-
cehrgpt/generation/generate_batch_hf_gpt_sequence.py,sha256=
|
22
|
-
cehrgpt/generation/omop_converter_batch.py,sha256=
|
29
|
+
cehrgpt/generation/generate_batch_hf_gpt_sequence.py,sha256=lpKEvJ2hhB8bwS06c5jEAksFUrGKCUv6t7hXrsMj-Ns,12284
|
30
|
+
cehrgpt/generation/omop_converter_batch.py,sha256=h4dg9fc23w6i82KMrOQFM-KxD6iuLnJfrv7YISc0dMw,26620
|
23
31
|
cehrgpt/generation/omop_entity.py,sha256=Q5Sr0AlyuPAm1FRPfnJO13q-u1fqRgYVHXruZ9g4xNE,19400
|
24
32
|
cehrgpt/models/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
25
|
-
cehrgpt/models/
|
26
|
-
cehrgpt/models/
|
33
|
+
cehrgpt/models/activations.py,sha256=crVPS-cZpUGrvLD7xhNjGmGr9S4e4LEfNmgIEsiuQ88,981
|
34
|
+
cehrgpt/models/config.py,sha256=SwsHVXzsgDmFSfrzv90lZBePenoHv-fIGGSLdxAIiu8,11193
|
35
|
+
cehrgpt/models/gpt2.py,sha256=4H9sFzf_qFGY-Bk0mfztxlKJXxvA0kTKwKiWFbqJLrQ,22079
|
36
|
+
cehrgpt/models/hf_cehrgpt.py,sha256=YTZtY1p-M-utQa6iJvDXFOjgc1SDdL3ZcWuy_-ZN41g,81167
|
27
37
|
cehrgpt/models/hf_modeling_outputs.py,sha256=5X4WEYKqT37phv_e5ZAv3A_N0wqdAUJLJRm6TxS6dDQ,10356
|
28
38
|
cehrgpt/models/pretrained_embeddings.py,sha256=vLLVs17TLpXRqCVEWQxGGwPHkUJUO7laNTeBuyBK_yk,3238
|
29
39
|
cehrgpt/models/special_tokens.py,sha256=lrw45B4tea4Dsajn09Cz6w5D2TfHmYXikZkgwnstu_o,521
|
30
|
-
cehrgpt/models/tokenization_hf_cehrgpt.py,sha256=
|
40
|
+
cehrgpt/models/tokenization_hf_cehrgpt.py,sha256=yHuNXvLznaSjwxVJsq7r9bZLi4msM8n4LVrzHINqsgY,66225
|
31
41
|
cehrgpt/omop/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
32
42
|
cehrgpt/omop/condition_era.py,sha256=hPZALz2XaWnro_1bwIYNkI48foOJjueyg3CZ1BliCno,626
|
33
43
|
cehrgpt/omop/observation_period.py,sha256=TRMgv5Ya2RaS2im7oQ6BLC_5JL9EJYNYR62ApxIuHvg,1211
|
34
44
|
cehrgpt/omop/omop_argparse.py,sha256=WI_-vZGfPdZ8atIeB-CrpaPdkv07kDBabyEpaRZfl64,998
|
35
45
|
cehrgpt/omop/omop_table_builder.py,sha256=6K_YYKyayDUBwxUdwaliI5tufpfIQqByDY5HeBbjHok,2742
|
46
|
+
cehrgpt/omop/ontology.py,sha256=LZIp0X3gY_VDZqIl6gTwGq7ZwV1nb0raPLTQAbJm6nM,5683
|
36
47
|
cehrgpt/omop/sample_omop_tables.py,sha256=2JZ8BNSvssceinwFanvuCRh-YlKrKn25U9w1pL79kQ0,2300
|
37
48
|
cehrgpt/omop/queries/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
38
49
|
cehrgpt/omop/queries/condition_era.py,sha256=LFB6vBAvshHJxtYIRkl7cfrF0kf7ay0piBKpmHBwrpE,2578
|
39
50
|
cehrgpt/omop/queries/observation_period.py,sha256=fpzr5DMNw-QLoSwp2Iatfch88E3hyhZ75usiIdG3A0U,6410
|
40
51
|
cehrgpt/runners/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
41
|
-
cehrgpt/runners/data_utils.py,sha256=
|
52
|
+
cehrgpt/runners/data_utils.py,sha256=i-krtBx_6rvPYtdLdDoWwOTtJcaovd0wH8gBYmgN2l4,16013
|
42
53
|
cehrgpt/runners/gpt_runner_util.py,sha256=YJQSRW9Mo4TjXSOUOTf6BUFcs1MGFiXU5T4ztKZcYhU,3485
|
43
|
-
cehrgpt/runners/hf_cehrgpt_finetune_runner.py,sha256=
|
44
|
-
cehrgpt/runners/hf_cehrgpt_pretrain_runner.py,sha256=
|
45
|
-
cehrgpt/runners/hf_gpt_runner_argument_dataclass.py,sha256=
|
46
|
-
cehrgpt/runners/hyperparameter_search_util.py,sha256=
|
47
|
-
cehrgpt/runners/sample_packing_trainer.py,sha256=
|
54
|
+
cehrgpt/runners/hf_cehrgpt_finetune_runner.py,sha256=AY9QxH4WupfWpLm9rjeSMOzedmw_03kTWuhncVRuhqs,26032
|
55
|
+
cehrgpt/runners/hf_cehrgpt_pretrain_runner.py,sha256=I_fuuKNzWx6yZiDcAAZdQtyxUEgNKLygQyS-SyQpptY,26840
|
56
|
+
cehrgpt/runners/hf_gpt_runner_argument_dataclass.py,sha256=8qHVUp-hx7xKozaE_EaEJphrs1QfRSXx0P6YMByK9Ww,9981
|
57
|
+
cehrgpt/runners/hyperparameter_search_util.py,sha256=SD02j1D8IBtIOG41dh7VgmVT2SWCF-VPZ7zVHlEIN70,12801
|
58
|
+
cehrgpt/runners/sample_packing_trainer.py,sha256=HfxHCIGBXb1RbN7nbU6jmSy_Zzwx_joj-UoYqbKl5-0,8375
|
48
59
|
cehrgpt/simulations/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
49
60
|
cehrgpt/simulations/generate_plots.py,sha256=BTZ71r8Kah0PMorkiO3vw55_p_9U1Z8KiD3GsPfaV0s,2520
|
50
61
|
cehrgpt/simulations/run_simulation.sh,sha256=DcJ6B19jIteUO0pZ0Tc21876lB9XxQHFAxlre7MtAzk,795
|
@@ -62,13 +73,13 @@ cehrgpt/tools/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
62
73
|
cehrgpt/tools/ehrshot_benchmark.py,sha256=E-m_5srlYEw7Y7i9twIJWDvrkwNlop-6yZB-80FZid0,2667
|
63
74
|
cehrgpt/tools/generate_causal_patient_split_by_age.py,sha256=dmHiPAL_kR1WrhRteIiHH9dwMtMi3PVl8jXm2O06_gI,4177
|
64
75
|
cehrgpt/tools/generate_pretrained_embeddings.py,sha256=lhFSacGv8bMld6qigKZN8Op8eXpFi0DsJuQbWKOWXqI,4160
|
65
|
-
cehrgpt/tools/
|
76
|
+
cehrgpt/tools/merge_synthetic_real_datasets.py,sha256=O1dbQ32Le0t15fwymwAh9mfNVLEWuFwW53DNvESrWbY,7589
|
66
77
|
cehrgpt/tools/upload_omop_tables.py,sha256=vdBAbkeAsGPA4NsyhNjelPVj3gS8yzmS1sKNM1Qk96g,3791
|
67
78
|
cehrgpt/tools/linear_prob/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
68
|
-
cehrgpt/tools/linear_prob/compute_cehrgpt_features.py,sha256=
|
79
|
+
cehrgpt/tools/linear_prob/compute_cehrgpt_features.py,sha256=0i34zAwePG0hZK2HSDaUlO-Fzyb5K4LqRuhrCVWivxA,19906
|
69
80
|
cehrgpt/tools/linear_prob/train_with_cehrgpt_features.py,sha256=w0UvzMKYGenN_KDVnbzutmy8IPLUxW5hPvpKKxDSL5U,5820
|
70
|
-
cehrgpt-0.1.
|
71
|
-
cehrgpt-0.1.
|
72
|
-
cehrgpt-0.1.
|
73
|
-
cehrgpt-0.1.
|
74
|
-
cehrgpt-0.1.
|
81
|
+
cehrgpt-0.1.3.dist-info/licenses/LICENSE,sha256=LOfC32zkfUIdGm8e_098jPbt8OHKtNWymDzxn2pA9Zk,1093
|
82
|
+
cehrgpt-0.1.3.dist-info/METADATA,sha256=MTgv1L9ru4evziAW2yTLsd3m9d1Ept8xy85u2CpBNTM,10167
|
83
|
+
cehrgpt-0.1.3.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
|
84
|
+
cehrgpt-0.1.3.dist-info/top_level.txt,sha256=akNCJBbMSLV8nkOzdVzdy13hMJ5CIQURnAS_YYEDVwA,17
|
85
|
+
cehrgpt-0.1.3.dist-info/RECORD,,
|
cehrgpt-0.1.1.dist-info/METADATA
DELETED
@@ -1,115 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.4
|
2
|
-
Name: cehrgpt
|
3
|
-
Version: 0.1.1
|
4
|
-
Summary: CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
|
5
|
-
Author-email: Chao Pang <chaopang229@gmail.com>, Xinzhuo Jiang <xj2193@cumc.columbia.edu>, Krishna Kalluri <kk3326@cumc.columbia.edu>, Elise Minto <em3697@cumc.columbia.edu>, Jason Patterson <jp3477@cumc.columbia.edu>, Nishanth Parameshwar Pavinkurve <np2689@cumc.columbia.edu>, Karthik Natarajan <kn2174@cumc.columbia.edu>
|
6
|
-
License: MIT License
|
7
|
-
Classifier: Development Status :: 5 - Production/Stable
|
8
|
-
Classifier: Intended Audience :: Developers
|
9
|
-
Classifier: Intended Audience :: Science/Research
|
10
|
-
Classifier: License :: OSI Approved :: MIT License
|
11
|
-
Classifier: Programming Language :: Python :: 3
|
12
|
-
Requires-Python: >=3.10.0
|
13
|
-
Description-Content-Type: text/markdown
|
14
|
-
License-File: LICENSE
|
15
|
-
Requires-Dist: cehrbert==1.4.5
|
16
|
-
Requires-Dist: cehrbert_data==0.0.11
|
17
|
-
Requires-Dist: openai==1.54.3
|
18
|
-
Requires-Dist: optuna==4.0.0
|
19
|
-
Requires-Dist: transformers==4.44.1
|
20
|
-
Requires-Dist: tokenizers==0.19.0
|
21
|
-
Requires-Dist: peft==0.10.0
|
22
|
-
Requires-Dist: lightgbm
|
23
|
-
Requires-Dist: polars
|
24
|
-
Provides-Extra: dev
|
25
|
-
Requires-Dist: pre-commit; extra == "dev"
|
26
|
-
Requires-Dist: pytest; extra == "dev"
|
27
|
-
Requires-Dist: pytest-cov; extra == "dev"
|
28
|
-
Requires-Dist: pytest-subtests; extra == "dev"
|
29
|
-
Requires-Dist: rootutils; extra == "dev"
|
30
|
-
Requires-Dist: hypothesis; extra == "dev"
|
31
|
-
Requires-Dist: black; extra == "dev"
|
32
|
-
Provides-Extra: flash-attn
|
33
|
-
Requires-Dist: flash_attn; extra == "flash-attn"
|
34
|
-
Dynamic: license-file
|
35
|
-
|
36
|
-
# CEHRGPT
|
37
|
-
|
38
|
-
[](https://pypi.org/project/cehrgpt/)
|
39
|
-

|
40
|
-
[](https://github.com/knatarajan-lab/cehrgpt/actions/workflows/tests.yaml)
|
41
|
-
[](https://github.com/knatarajan-lab/cehrgpt/blob/main/LICENSE)
|
42
|
-
[](https://github.com/knatarajan-lab/cehrgpt/graphs/contributors)
|
43
|
-
|
44
|
-
## Description
|
45
|
-
CEHRGPT is a synthetic data generation model developed to handle structured electronic health records (EHR) with enhanced privacy and reliability. It leverages state-of-the-art natural language processing techniques to create realistic, anonymized patient data that can be used for research and development without compromising patient privacy.
|
46
|
-
|
47
|
-
## Features
|
48
|
-
- **Synthetic Patient Data Generation**: Generates comprehensive patient profiles including demographics, medical history, treatment courses, and outcomes.
|
49
|
-
- **Privacy-Preserving**: Implements techniques to ensure the generated data does not reveal identifiable information.
|
50
|
-
- **Compatibility with OMOP**: Fully compatible with the OMOP common data model, allowing seamless integration with existing healthcare data systems.
|
51
|
-
- **Extensible**: Designed to be adaptable to new datasets and different EHR systems.
|
52
|
-
|
53
|
-
## Installation
|
54
|
-
To install CEHRGPT, clone this repository and install the required dependencies.
|
55
|
-
|
56
|
-
```bash
|
57
|
-
git clone https://github.com/knatarajan-lab/cehrgpt.git
|
58
|
-
cd cehrgpt
|
59
|
-
pip install .
|
60
|
-
```
|
61
|
-
|
62
|
-
## Pretrain
|
63
|
-
Pretrain cehrgpt using the Hugging Face trainer, the parameters can be found in the sample configuration yaml
|
64
|
-
```bash
|
65
|
-
mkdir test_results
|
66
|
-
# This is NOT required when streaming is set to true
|
67
|
-
mkdir test_dataset_prepared
|
68
|
-
python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner sample_configs/cehrgpt_pretrain_sample_config.yaml
|
69
|
-
```
|
70
|
-
|
71
|
-
## Generate synthetic sequences
|
72
|
-
Generate synthetic sequences using the trained model
|
73
|
-
```bash
|
74
|
-
export TRANSFORMERS_VERBOSITY=info
|
75
|
-
export CUDA_VISIBLE_DEVICES="0"
|
76
|
-
python -u -m cehrgpt.generation.generate_batch_hf_gpt_sequence \
|
77
|
-
--model_folder test_results \
|
78
|
-
--tokenizer_folder test_results \
|
79
|
-
--output_folder test_results \
|
80
|
-
--num_of_patients 128 \
|
81
|
-
--batch_size 32 \
|
82
|
-
--buffer_size 128 \
|
83
|
-
--context_window 1024 \
|
84
|
-
--sampling_strategy TopPStrategy \
|
85
|
-
--top_p 1.0 --temperature 1.0 --repetition_penalty 1.0 \
|
86
|
-
--epsilon_cutoff 0.00 \
|
87
|
-
--demographic_data_path sample_data/pretrain
|
88
|
-
```
|
89
|
-
|
90
|
-
## Convert synthetic sequences to OMOP
|
91
|
-
```bash
|
92
|
-
# omop converter requires the OHDSI vocabulary
|
93
|
-
export OMOP_VOCAB_DIR = ""
|
94
|
-
# the omop derived tables need to be built using pyspark
|
95
|
-
export SPARK_WORKER_INSTANCES="1"
|
96
|
-
export SPARK_WORKER_CORES="8"
|
97
|
-
export SPARK_EXECUTOR_CORES="2"
|
98
|
-
export SPARK_DRIVER_MEMORY="2g"
|
99
|
-
export SPARK_EXECUTOR_MEMORY="2g"
|
100
|
-
|
101
|
-
# Convert the sequences, create the omop derived tables
|
102
|
-
sh scripts/omop_pipeline.sh \
|
103
|
-
test_results/top_p10000/generated_sequences/ \
|
104
|
-
test_results/top_p10000/restored_omop/ \
|
105
|
-
$OMOP_VOCAB_DIR
|
106
|
-
```
|
107
|
-
|
108
|
-
## Citation
|
109
|
-
```
|
110
|
-
@article{cehrgpt2024,
|
111
|
-
title={CEHRGPT: Synthetic Data Generation for Electronic Health Records},
|
112
|
-
author={Natarajan, K and others},
|
113
|
-
journal={arXiv preprint arXiv:2402.04400},
|
114
|
-
year={2024}
|
115
|
-
}
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|