PyPI - cehrgpt - Versions diffs - 0.0.1__tar.gz → 0.0.2__tar.gz - Mend

cehrgpt 0.0.1tar.gz → 0.0.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (106) hide show

cehrgpt-0.0.2/.gitignore ADDED Viewed

@@ -0,0 +1,25 @@
+.DS_Store
+.idea/
+.vscode/
+venv*
+dist/*
+*ipynb_checkpoints/
+*h5
+*logs
+*nohup.out
+*ipynb
+*__pycache__/
+.eggs/
+*.dat
+.metastore_db/
+build/
+*.out
+*.egg-info/
+test_data
+test_dataset_prepared
+test*results

{cehrgpt-0.0.1/src/cehrgpt.egg-info → cehrgpt-0.0.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: cehrgpt
-Version: 0.0.1
+Version: 0.0.2
 Summary: CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
 Author-email: Chao Pang <chaopang229@gmail.com>, Xinzhuo Jiang <xj2193@cumc.columbia.edu>, Krishna Kalluri <kk3326@cumc.columbia.edu>, Elise Minto <em3697@cumc.columbia.edu>, Jason Patterson <jp3477@cumc.columbia.edu>, Nishanth Parameshwar Pavinkurve <np2689@cumc.columbia.edu>, Karthik Natarajan <kn2174@cumc.columbia.edu>
 License: MIT License
@@ -12,11 +12,12 @@ Classifier: Programming Language :: Python :: 3
 Requires-Python: >=3.10.0
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: cehrbert==1.3.3
+Requires-Dist: cehrbert==1.3.8
 Requires-Dist: openai==1.54.3
 Requires-Dist: optuna==4.0.0
 Requires-Dist: transformers==4.40.0
-Requires-Dist: tokenizers==0.19
+Requires-Dist: tokenizers==0.19.0
+Requires-Dist: peft==0.10.0
 Requires-Dist: trl==0.11.4
 Provides-Extra: dev
 Requires-Dist: pre-commit; extra == "dev"
@@ -50,11 +51,57 @@ CEHRGPT is a synthetic data generation model developed to handle structured elec
 To install CEHRGPT, clone this repository and install the required dependencies.
 ```bash
-git clone https://github.com/knatarajan-lab/cehrgpt-public.git
-cd cehrgpt-public
+git clone https://github.com/knatarajan-lab/cehrgpt.git
+cd cehrgpt
 pip install .
 ```
+## Pretrain
+Pretrain cehrgpt using the Hugging Face trainer, the parameters can be found in the sample configuration yaml
+```bash
+mkdir test_results
+# This is NOT required when streaming is set to true
+mkdir test_dataset_prepared
+python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner sample_configs/cehrgpt_pretrain_sample_config.yaml
+```
+## Generate synthetic sequences
+Generate synthetic sequences using the trained model
+```bash
+export TRANSFORMERS_VERBOSITY=info
+export CUDA_VISIBLE_DEVICES="0"
+python -u -m cehrgpt.generation.generate_batch_hf_gpt_sequence \
+  --model_folder test_results \
+  --tokenizer_folder test_results \
+  --output_folder test_results \
+  --num_of_patients 128 \
+  --batch_size 32 \
+  --buffer_size 128 \
+  --context_window 1024 \
+  --sampling_strategy TopPStrategy \
+  --top_p 1.0 --temperature 1.0 --repetition_penalty 1.0 \
+  --epsilon_cutoff 0.00 \
+  --demographic_data_path sample_data/pretrain
+```
+## Convert synthetic sequences to OMOP
+```bash
+# omop converter requires the OHDSI vocabulary
+export OMOP_VOCAB_DIR = ""
+# the omop derived tables need to be built using pyspark
+export SPARK_WORKER_INSTANCES="1"
+export SPARK_WORKER_CORES="8"
+export SPARK_EXECUTOR_CORES="2"
+export SPARK_DRIVER_MEMORY="2g"
+export SPARK_EXECUTOR_MEMORY="2g"
+# Convert the sequences, create the omop derived tables
+sh scripts/omop_pipeline.sh \
+  test_results/top_p10000/generated_sequences/ \
+  test_results/top_p10000/restored_omop/ \
+  $OMOP_VOCAB_DIR
+```
 ## Citation
 ```
 @article{cehrgpt2024,
@@ -63,4 +110,3 @@ pip install .
   journal={arXiv preprint arXiv:2402.04400},
   year={2024}
 }
-```

{cehrgpt-0.0.1 → cehrgpt-0.0.2}/README.md RENAMED Viewed

@@ -19,11 +19,57 @@ CEHRGPT is a synthetic data generation model developed to handle structured elec
 To install CEHRGPT, clone this repository and install the required dependencies.
 ```bash
-git clone https://github.com/knatarajan-lab/cehrgpt-public.git
-cd cehrgpt-public
+git clone https://github.com/knatarajan-lab/cehrgpt.git
+cd cehrgpt
 pip install .
 ```
+## Pretrain
+Pretrain cehrgpt using the Hugging Face trainer, the parameters can be found in the sample configuration yaml
+```bash
+mkdir test_results
+# This is NOT required when streaming is set to true
+mkdir test_dataset_prepared
+python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner sample_configs/cehrgpt_pretrain_sample_config.yaml
+```
+## Generate synthetic sequences
+Generate synthetic sequences using the trained model
+```bash
+export TRANSFORMERS_VERBOSITY=info
+export CUDA_VISIBLE_DEVICES="0"
+python -u -m cehrgpt.generation.generate_batch_hf_gpt_sequence \
+  --model_folder test_results \
+  --tokenizer_folder test_results \
+  --output_folder test_results \
+  --num_of_patients 128 \
+  --batch_size 32 \
+  --buffer_size 128 \
+  --context_window 1024 \
+  --sampling_strategy TopPStrategy \
+  --top_p 1.0 --temperature 1.0 --repetition_penalty 1.0 \
+  --epsilon_cutoff 0.00 \
+  --demographic_data_path sample_data/pretrain
+```
+## Convert synthetic sequences to OMOP
+```bash
+# omop converter requires the OHDSI vocabulary
+export OMOP_VOCAB_DIR = ""
+# the omop derived tables need to be built using pyspark
+export SPARK_WORKER_INSTANCES="1"
+export SPARK_WORKER_CORES="8"
+export SPARK_EXECUTOR_CORES="2"
+export SPARK_DRIVER_MEMORY="2g"
+export SPARK_EXECUTOR_MEMORY="2g"
+# Convert the sequences, create the omop derived tables
+sh scripts/omop_pipeline.sh \
+  test_results/top_p10000/generated_sequences/ \
+  test_results/top_p10000/restored_omop/ \
+  $OMOP_VOCAB_DIR
+```
 ## Citation
 ```
 @article{cehrgpt2024,
@@ -31,5 +77,4 @@ pip install .
   author={Natarajan, K and others},
   journal={arXiv preprint arXiv:2402.04400},
   year={2024}
-}
-```
+}

{cehrgpt-0.0.1 → cehrgpt-0.0.2}/pyproject.toml RENAMED Viewed

@@ -28,11 +28,12 @@ classifiers = [
 ]
 dependencies = [
-    "cehrbert==1.3.3",
+    "cehrbert==1.3.8",
     "openai==1.54.3",
     "optuna==4.0.0",
     "transformers==4.40.0",
-    "tokenizers==0.19",
+    "tokenizers==0.19.0",
+    "peft==0.10.0",
     "trl==0.11.4",
 ]

cehrgpt-0.0.2/sample_configs/cehrgpt_pretrain_sample_config.yaml ADDED Viewed

@@ -0,0 +1,51 @@
+model_name_or_path: "test_results"
+tokenizer_name_or_path: "test_results"
+data_folder: "sample_data/pretrain"
+dataset_prepared_path: "test_dataset_prepared"
+validation_split_percentage: 0.05
+validation_split_num: 10
+preprocessing_num_workers: 4
+preprocessing_batch_size: 1000
+streaming: true
+#Tokenizer
+vocab_size: 50000
+min_frequency: 0
+do_train: true
+overwrite_output_dir: false
+resume_from_checkpoint: # path to the checkpoint folder
+seed: 42
+num_hidden_layers: 6
+hidden_size: 768
+n_head: 12
+max_position_embeddings: 1024
+# torch dataloader configs
+dataloader_num_workers: 4
+dataloader_prefetch_factor: 2
+output_dir: "test_results"
+save_strategy: "steps"
+evaluation_strategy: "no"
+learning_rate: 0.00005
+per_device_train_batch_size: 4
+per_device_eval_batch_size: 4
+gradient_accumulation_steps: 1
+num_train_epochs: 1
+# When streaming is set to True, max_steps needs to be provided
+max_steps: 1000
+save_steps: 500
+warmup_steps: 100
+weight_decay: 0.01
+logging_dir: "./logs"
+logging_steps: 100
+save_total_limit: 5
+load_best_model_at_end: false
+metric_for_best_model: "eval_loss"
+greater_is_better: false
+report_to: "none"

cehrgpt-0.0.2/scripts/omop_pipeline.sh ADDED Viewed

@@ -0,0 +1,55 @@
+#!/bin/bash
+# Exporting input arguments as environment variables
+export PATIENT_SEQUENCE_FOLDER="$1"
+export OMOP_FOLDER="$2"
+export SOURCE_OMOP_FOLDER="$3"
+export PATIENT_SPLITS_FOLDER="$SOURCE_OMOP_FOLDER/patient_splits"
+# Echoing the values of the environment variables
+echo "PATIENT_SEQUENCE_FOLDER=$PATIENT_SEQUENCE_FOLDER"
+echo "OMOP_FOLDER=$OMOP_FOLDER"
+echo "SOURCE_OMOP_FOLDER=$SOURCE_OMOP_FOLDER"
+# Ensure OMOP_FOLDER exists
+if [ ! -d "$OMOP_FOLDER" ]; then
+    echo "Creating $OMOP_FOLDER"
+    mkdir -p "$OMOP_FOLDER"
+fi
+# Removing existing OMOP tables
+rm -rf $OMOP_FOLDER/{person,visit_occurrence,condition_occurrence,procedure_occurrence,drug_exposure,death,measurement,observation_period,condition_era}
+# Removing existing OMOP concept tables
+rm -rf $OMOP_FOLDER/{concept,concept_ancestor,concept_relationship}
+# Copying OMOP concept tables if they don't already exist
+for table in concept concept_relationship concept_ancestor; do
+    if [ ! -d "$OMOP_FOLDER/$table" ]; then
+        echo "Creating $OMOP_FOLDER/$table"
+        cp -r "$SOURCE_OMOP_FOLDER/$table" "$OMOP_FOLDER/$table"
+    fi
+done
+# Reconstructing the OMOP instance from patient sequences
+echo "Reconstructing the OMOP instance from patient sequences in $OMOP_FOLDER"
+python -m cehrgpt.generation.omop_converter_batch \
+  --patient_sequence_path "$PATIENT_SEQUENCE_FOLDER" \
+  --output_folder "$OMOP_FOLDER" \
+  --concept_path "$OMOP_FOLDER/concept" \
+  --buffer_size 1280 \
+  --cpu_cores 10
+# Create observation_period
+echo "Reconstructing observation_period in $OMOP_FOLDER"
+python -u -m cehrgpt.omop.observation_period \
+  --input_folder "$OMOP_FOLDER" \
+  --output_folder "$OMOP_FOLDER" \
+  --domain_table_list "condition_occurrence drug_exposure procedure_occurrence measurement"
+# Create condition_era
+echo "Reconstructing condition_era in $OMOP_FOLDER"
+python -u -m cehrgpt.omop.condition_era \
+  --input_folder "$OMOP_FOLDER" \
+  --output_folder "$OMOP_FOLDER" \
+  --domain_table_list "condition_occurrence"

cehrgpt-0.0.2/src/cehrgpt/data/hf_cehrgpt_dataset_mapping.py ADDED Viewed

@@ -0,0 +1,382 @@
+import datetime
+from typing import Any, Dict
+import numpy as np
+import pandas as pd
+from cehrbert.data_generators.hf_data_generator.hf_dataset_mapping import (
+    ED_VISIT_TYPE_CODES,
+    INPATIENT_VISIT_TYPE_CODES,
+    INPATIENT_VISIT_TYPES,
+    DatasetMapping,
+    replace_escape_chars,
+)
+from cehrbert.runners.hf_runner_argument_dataclass import DataTrainingArguments
+from cehrbert_data.const.common import NA
+from cehrbert_data.decorators.patient_event_decorator_base import get_att_function
+from dateutil.relativedelta import relativedelta
+from cehrgpt.models.tokenization_hf_cehrgpt import (
+    NONE_BIN,
+    UNKNOWN_BIN,
+    CehrGptTokenizer,
+)
+def convert_date_to_posix_time(index_date: datetime.date) -> float:
+    return datetime.datetime.combine(
+        index_date, datetime.datetime.min.time()
+    ).timestamp()
+class MedToCehrGPTDatasetMapping(DatasetMapping):
+    def __init__(
+        self,
+        data_args: DataTrainingArguments,
+        is_pretraining: bool = True,
+        include_inpatient_hour_token: bool = True,
+    ):
+        self._time_token_function = get_att_function(data_args.att_function_type)
+        self._include_auxiliary_token = data_args.include_auxiliary_token
+        self._inpatient_time_token_function = get_att_function(
+            data_args.inpatient_att_function_type
+        )
+        self._include_demographic_prompt = data_args.include_demographic_prompt
+        self._is_pretraining = is_pretraining
+        self._include_inpatient_hour_token = include_inpatient_hour_token
+    """
+    This mapping function converts the MED (https://github.com/Medical-Event-Data-Standard/meds/tree/main) extension
+    to the CehrGPT format. We make several assumptions
+    - The first event contains the demographic information
+    - From the second event onward
+        - the time of the event is visit_start_datetime.
+        - the first measurement contains the code indicating a standard OMOP Visit concept_id (e.g. 9201, 9202)
+        - in case of inpatient visits, the last measurement is assumed to
+            contain the standard OMOP concept id for discharge facilities (e.g 8536)
+        - in case of inpatient visits, datetime_value of the last measurement stores visit_end_datetime
+    """
+    def remove_columns(self):
+        if self._is_pretraining:
+            return ["visits", "birth_datetime", "index_date"]
+        else:
+            return [
+                "visits",
+                "birth_datetime",
+                "visit_concept_ids",
+            ]
+    @staticmethod
+    def _update_cehrgpt_record(
+        cehrgpt_record: Dict[str, Any],
+        code: str,
+        concept_value_mask: int = 0,
+        number_as_value: float = 0.0,
+        concept_as_value: str = "0",
+        is_numeric_type: int = 0,
+        unit: str = NA,
+    ) -> None:
+        cehrgpt_record["concept_ids"].append(replace_escape_chars(code))
+        cehrgpt_record["concept_value_masks"].append(concept_value_mask)
+        cehrgpt_record["number_as_values"].append(number_as_value)
+        cehrgpt_record["concept_as_values"].append(concept_as_value)
+        cehrgpt_record["units"].append(unit)
+        cehrgpt_record["is_numeric_types"].append(is_numeric_type)
+    def transform(self, record: Dict[str, Any]) -> Dict[str, Any]:
+        cehrgpt_record = {
+            "person_id": record["patient_id"],
+            "concept_ids": [],
+            "concept_value_masks": [],
+            "number_as_values": [],
+            "concept_as_values": [],
+            "units": [],
+            "is_numeric_types": [],
+        }
+        # Extract the demographic information
+        birth_datetime = record["birth_datetime"]
+        if isinstance(birth_datetime, pd.Timestamp):
+            birth_datetime = birth_datetime.to_pydatetime()
+        gender = record["gender"]
+        race = record["race"]
+        # Add the demographic tokens
+        first_visit = record["visits"][0]
+        year_str = f'year:{str(first_visit["visit_start_datetime"].year)}'
+        age_str = f'age:{str(relativedelta(first_visit["visit_start_datetime"], birth_datetime).years)}'
+        self._update_cehrgpt_record(cehrgpt_record, year_str)
+        self._update_cehrgpt_record(cehrgpt_record, age_str)
+        self._update_cehrgpt_record(cehrgpt_record, gender)
+        self._update_cehrgpt_record(cehrgpt_record, race)
+        # Use a data cursor to keep track of time
+        date_cursor = None
+        # Loop through all the visits excluding the first event containing the demographics
+        for i, visit in enumerate(
+            sorted(record["visits"], key=lambda e: e["visit_start_datetime"])
+        ):
+            events = visit["events"]
+            # Skip this visit if the number measurements in the event is zero
+            if events is None or len(events) == 0:
+                continue
+            visit_start_datetime = visit["visit_start_datetime"]
+            time_delta = (
+                (visit_start_datetime - date_cursor).days if date_cursor else None
+            )
+            date_cursor = visit_start_datetime
+            # We assume the first measurement to be the visit type of the current visit
+            visit_type = visit["visit_type"]
+            is_er_or_inpatient = (
+                visit_type in INPATIENT_VISIT_TYPES
+                or visit_type in INPATIENT_VISIT_TYPE_CODES
+                or visit_type in ED_VISIT_TYPE_CODES
+            )
+            # Add artificial time tokens to the patient timeline if timedelta exists
+            if time_delta is not None:
+                # This generates an artificial time token depending on the choice of the time token functions
+                self._update_cehrgpt_record(
+                    cehrgpt_record,
+                    code=self._time_token_function(time_delta),
+                )
+            # Add the VS token to the patient timeline to mark the start of a visit
+            relativedelta(visit["visit_start_datetime"], birth_datetime).years
+            # Calculate the week number since the epoch time
+            date = (
+                visit["visit_start_datetime"]
+                - datetime.datetime(year=1970, month=1, day=1)
+            ).days // 7
+            # Add a [VS] token
+            self._update_cehrgpt_record(
+                cehrgpt_record,
+                code="[VS]",
+            )
+            # Add a visit type token
+            self._update_cehrgpt_record(
+                cehrgpt_record,
+                code=visit_type,
+            )
+            # Keep track of the existing outpatient events, we don't want to add them again
+            existing_outpatient_events = list()
+            for e in events:
+                # If the event doesn't have a time stamp, we skip it
+                if not e["time"]:
+                    continue
+                # If numeric_value exists, this is a concept/value tuple, we indicate this using a concept_value_mask
+                numeric_value = e.get("numeric_value", None)
+                text_value = e.get("text_value", None)
+                # The unit might be populated with a None value
+                unit = e.get("unit", NA) if e.get("unit", NA) else NA
+                concept_value_mask = int(
+                    numeric_value is not None or text_value is not None
+                )
+                is_numeric_type = int(numeric_value is not None)
+                code = replace_escape_chars(e["code"])
+                # Add a medical token to the patient timeline
+                # If this is an inpatient visit, we use the event time stamps to calculate age and date
+                # because the patient can stay in the hospital for a period of time.
+                if is_er_or_inpatient:
+                    # Calculate the week number since the epoch time
+                    date = (
+                        e["time"] - datetime.datetime(year=1970, month=1, day=1)
+                    ).days // 7
+                    # Calculate the time diff in days w.r.t the previous measurement
+                    meas_time_diff = (e["time"] - date_cursor).days
+                    # Update the date_cursor if the time diff between two neighboring measurements is greater than and
+                    # equal to 1 day
+                    if meas_time_diff > 0:
+                        date_cursor = e["time"]
+                        if self._inpatient_time_token_function:
+                            # This generates an artificial time token depending on the choice of the time token functions
+                            self._update_cehrgpt_record(
+                                cehrgpt_record,
+                                code=f"i-{self._inpatient_time_token_function(meas_time_diff)}",
+                            )
+                else:
+                    # For outpatient visits, we use the visit time stamp to calculate age and time because we assume
+                    # the outpatient visits start and end on the same day.
+                    # We check whether the date/code/value combination already exists in the existing events
+                    # If they exist, we do not add them to the patient timeline for outpatient visits.
+                    if (
+                        date,
+                        code,
+                        numeric_value,
+                        text_value,
+                        concept_value_mask,
+                        numeric_value,
+                    ) in existing_outpatient_events:
+                        continue
+                self._update_cehrgpt_record(
+                    cehrgpt_record,
+                    code=code,
+                    concept_value_mask=concept_value_mask,
+                    unit=unit,
+                    number_as_value=numeric_value if numeric_value else 0.0,
+                    concept_as_value=(
+                        replace_escape_chars(text_value) if text_value else "0"
+                    ),
+                    is_numeric_type=is_numeric_type,
+                )
+                existing_outpatient_events.append(
+                    (
+                        date,
+                        code,
+                        numeric_value,
+                        text_value,
+                        concept_value_mask,
+                        numeric_value,
+                    )
+                )
+            # For inpatient or ER visits, we want to discharge_facility to the end of the visit
+            if is_er_or_inpatient:
+                # If visit_end_datetime is populated for the inpatient visit, we update the date_cursor
+                visit_end_datetime = visit.get("visit_end_datetime", None)
+                if visit_end_datetime:
+                    date_cursor = visit_end_datetime
+                if self._include_auxiliary_token:
+                    # Reuse the age and date calculated for the last event in the patient timeline for the discharge
+                    # facility event
+                    discharge_facility = (
+                        visit["discharge_facility"]
+                        if ("discharge_facility" in visit)
+                        and visit["discharge_facility"]
+                        else "0"
+                    )
+                    self._update_cehrgpt_record(
+                        cehrgpt_record,
+                        code=discharge_facility,
+                    )
+            # Reuse the age and date calculated for the last event in the patient timeline
+            self._update_cehrgpt_record(
+                cehrgpt_record,
+                code="[VE]",
+            )
+        # Generate the orders of the concepts that the cehrbert dataset mapping function expects
+        cehrgpt_record["orders"] = list(
+            range(1, len(cehrgpt_record["concept_ids"]) + 1)
+        )
+        # Add some count information for this sequence
+        cehrgpt_record["num_of_concepts"] = len(cehrgpt_record["concept_ids"])
+        cehrgpt_record["num_of_visits"] = len(record["visits"])
+        if "label" in record:
+            cehrgpt_record["label"] = record["label"]
+        if "age_at_index" in record:
+            cehrgpt_record["age_at_index"] = record["age_at_index"]
+        return cehrgpt_record
+class HFCehrGptTokenizationMapping(DatasetMapping):
+    def __init__(
+        self,
+        concept_tokenizer: CehrGptTokenizer,
+    ):
+        self._concept_tokenizer = concept_tokenizer
+        self._lab_token_ids = self._concept_tokenizer.lab_token_ids
+    def remove_columns(self):
+        return [
+            "concept_value_masks",
+            "is_numeric_types",
+        ]
+    def transform(self, record: Dict[str, Any]) -> Dict[str, Any]:
+        # If any concept has a value associated with it, we normalize the value
+        record["input_ids"] = self._concept_tokenizer.encode(record["concept_ids"])
+        record["value_indicators"] = record["concept_value_masks"]
+        if "number_as_values" not in record or "concept_as_values" not in record:
+            record["number_as_values"] = [
+                float(value) if isinstance(value, float) else None
+                for value in record["concept_values"]
+            ]
+            record["is_numeric_types"] = [
+                int(isinstance(value, float)) for value in record["concept_values"]
+            ]
+            record["concept_as_values"] = [
+                value if isinstance(value, str) else None
+                for value in record["concept_values"]
+            ]
+        if np.any(np.asarray(record["concept_value_masks"]) > 0):
+            values = []
+            for i, (
+                concept_id,
+                unit,
+                concept_value_mask,
+                number_as_value,
+                concept_as_value,
+                is_numeric_type,
+            ) in enumerate(
+                zip(
+                    record["concept_ids"],
+                    record["units"],
+                    record["concept_value_masks"],
+                    record["number_as_values"],
+                    record["concept_as_values"],
+                    record["is_numeric_types"],
+                )
+            ):
+                if concept_value_mask == 1:
+                    value = UNKNOWN_BIN
+                    if is_numeric_type == 1:
+                        if concept_id in self._concept_tokenizer.numeric_concept_ids:
+                            value = self._concept_tokenizer.normalize(
+                                concept_id, unit, number_as_value
+                            )
+                    elif isinstance(concept_as_value, str):
+                        value = concept_as_value
+                    values.append(value)
+                else:
+                    values.append(NONE_BIN)
+            assert len(values) == len(record["input_ids"])
+            record["values"] = self._concept_tokenizer.encode_value(values)
+        else:
+            record["values"] = self._concept_tokenizer.encode_value(
+                [NONE_BIN for _ in range(len(record["concept_value_masks"]))]
+            )
+        # Delete these features because they contain null values and pyarrow cannot concatenate multiple records
+        del record["number_as_values"]
+        del record["concept_as_values"]
+        return record
+class HFFineTuningMapping(HFCehrGptTokenizationMapping):
+    """Consider removing this transformation in the future."""
+    def transform(self, record: Dict[str, Any]) -> Dict[str, Any]:
+        record = super().transform(record)
+        record.update(
+            {
+                "age_at_index": (
+                    record["age"] if "age" in record else record["age_at_index"]
+                ),
+                "classifier_label": int(record["label"] > 0),
+                "index_date": (
+                    convert_date_to_posix_time(record["index_date"])
+                    if "index_date" in record
+                    else None
+                ),
+            }
+        )
+        return record
+    def remove_columns(self):
+        columns = super().remove_columns()
+        columns.append("label")
+        return columns

cehrgpt 0.0.1__tar.gz → 0.0.2__tar.gz

cehrgpt 0.0.1tar.gz → 0.0.2tar.gz