PyPI - chebai - Versions diffs - 0.0.1.dev0__tar.gz → 0.0.2.dev0__tar.gz - Mend

chebai 0.0.1.dev0tar.gz → 0.0.2.dev0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (116) hide show

chebai-0.0.2.dev0/PKG-INFO ADDED Viewed

@@ -0,0 +1,52 @@
+Metadata-Version: 2.4
+Name: chebai
+Version: 0.0.2.dev0
+Home-page:
+Author: MGlauer
+Author-email: martin.glauer@ovgu.de
+Requires-Python: >=3.9, <3.13
+License-File: LICENSE
+Requires-Dist: certifi
+Requires-Dist: idna
+Requires-Dist: joblib
+Requires-Dist: networkx
+Requires-Dist: numpy<2
+Requires-Dist: pandas
+Requires-Dist: python-dateutil
+Requires-Dist: pytz
+Requires-Dist: requests
+Requires-Dist: scikit-learn
+Requires-Dist: scipy
+Requires-Dist: six
+Requires-Dist: threadpoolctl
+Requires-Dist: torch
+Requires-Dist: typing-extensions
+Requires-Dist: urllib3
+Requires-Dist: transformers
+Requires-Dist: fastobo
+Requires-Dist: pysmiles==1.1.2
+Requires-Dist: scikit-network
+Requires-Dist: svgutils
+Requires-Dist: matplotlib
+Requires-Dist: rdkit
+Requires-Dist: selfies
+Requires-Dist: lightning>=2.5
+Requires-Dist: jsonargparse[signatures]>=4.17
+Requires-Dist: omegaconf
+Requires-Dist: seaborn
+Requires-Dist: deepsmiles
+Requires-Dist: iterative-stratification
+Requires-Dist: wandb
+Requires-Dist: chardet
+Requires-Dist: pyyaml
+Requires-Dist: torchmetrics
+Provides-Extra: dev
+Requires-Dist: black; extra == "dev"
+Requires-Dist: isort; extra == "dev"
+Requires-Dist: pre-commit; extra == "dev"
+Dynamic: author
+Dynamic: author-email
+Dynamic: license-file
+Dynamic: provides-extra
+Dynamic: requires-dist
+Dynamic: requires-python

chebai-0.0.2.dev0/README.md ADDED Viewed

@@ -0,0 +1,89 @@
+# ChEBai
+ChEBai is a deep learning library designed for the integration of deep learning methods with chemical ontologies, particularly ChEBI.
+The library emphasizes the incorporation of the semantic qualities of the ontology into the learning process.
+## Installation
+To install ChEBai, follow these steps:
+1. Clone the repository:
+```
+git clone https://github.com/ChEB-AI/python-chebai.git
+```
+2. Install the package:
+```
+cd python-chebai
+pip install .
+```
+## Usage
+The training and inference is abstracted using the Pytorch Lightning modules.
+Here are some CLI commands for the standard functionalities of pretraining, ontology extension, fine-tuning for toxicity and prediction.
+For further details, see the [wiki](https://github.com/ChEB-AI/python-chebai/wiki).
+If you face any problems, please open a new [issue](https://github.com/ChEB-AI/python-chebai/issues/new).
+### Pretraining
+```
+python -m chebai fit --data.class_path=chebai.preprocessing.datasets.pubchem.PubchemChem --model=configs/model/electra-for-pretraining.yml --trainer=configs/training/pretraining_trainer.yml
+```
+### Structure-based ontology extension
+```
+python -m chebai fit --trainer=configs/training/default_trainer.yml --model=configs/model/electra.yml  --model.pretrained_checkpoint=[path-to-pretrained-model] --model.load_prefix=generator. --data=[path-to-dataset-config] --model.out_dim=[number-of-labels]
+```
+A command with additional options may look like this:
+```
+python3 -m chebai fit --trainer=configs/training/default_trainer.yml --model=configs/model/electra.yml --model.train_metrics=configs/metrics/micro-macro-f1.yml --model.test_metrics=configs/metrics/micro-macro-f1.yml --model.val_metrics=configs/metrics/micro-macro-f1.yml --model.pretrained_checkpoint=electra_pretrained.ckpt --model.load_prefix=generator. --data=configs/data/chebi50.yml --model.criterion=configs/loss/bce.yml --data.init_args.batch_size=10 --trainer.logger.init_args.name=chebi50_bce_unweighted --data.init_args.num_workers=9 --model.pass_loss_kwargs=false --data.init_args.chebi_version=231 --data.init_args.data_limit=1000
+```
+### Fine-tuning for Toxicity prediction
+```
+python -m chebai fit --config=[path-to-your-tox21-config] --trainer.callbacks=configs/training/default_callbacks.yml  --model.pretrained_checkpoint=[path-to-pretrained-model]
+```
+### Predicting classes given SMILES strings
+```
+python3 -m chebai predict_from_file --model=[path-to-model-config] --checkpoint_path=[path-to-model] --input_path={path-to-file-containing-smiles] [--classes_path=[path-to-classes-file]] [--save_to=[path-to-output]]
+```
+The input files should contain a list of line-separated SMILES strings. This generates a CSV file  that contains the
+one row for each SMILES string and one column for each class.
+The `classes_path` is the path to the dataset's `raw/classes.txt` file that contains the relationship between model output and ChEBI-IDs.
+## Evaluation
+An example for evaluating a model trained on the ontology extension task is given in `tutorials/eval_model_basic.ipynb`.
+It takes in the finetuned model as input for performing the evaluation.
+## Cross-validation
+You can do inner k-fold cross-validation, i.e., train models on k train-validation splits that all use the same test
+set. For that, you need to specify the total_number of folds as
+```
+--data.init_args.inner_k_folds=K
+```
+and the fold to be used in the current optimisation run as
+```
+--data.init_args.fold_index=I
+```
+To train K models, you need to do K such calls, each with a different `fold_index`. On the first call with a given
+`inner_k_folds`, all folds will be created and stored in the data directory
+## Note for developers
+If you have used ChEBai before PR #39, the file structure in which your ChEBI-data is saved has changed. This means that
+datasets will be freshly generated. The data however is the same. If you want to keep the old data (including the old
+splits), you can use a migration script. It copies the old data to the new location for a specific ChEBI class
+(including chebi version and other parameters). The script can be called by specifying the data module from a config
+```
+python chebai/preprocessing/migration/chebi_data_migration.py migrate --datamodule=[path-to-data-config]
+```
+or by specifying the class name (e.g. `ChEBIOver50`) and arguments separately
+```
+python chebai/preprocessing/migration/chebi_data_migration.py migrate --class_name=[data-class] [--chebi_version=[version]]
+```
+The new dataset will by default generate random data splits (with a given seed).
+To reuse a fixed data split, you have to provide the path of the csv file generated during the migration:
+`--data.init_args.splits_file_path=[path-to-processed_data]/splits.csv`

chebai-0.0.2.dev0/chebai/__init__.py ADDED Viewed

@@ -0,0 +1,30 @@
+import os
+from typing import Any
+import torch
+# Get the absolute path of the current file's directory
+MODULE_PATH = os.path.abspath(os.path.dirname(__file__))
+class CustomTensor(torch.Tensor):
+    """
+    A custom tensor class inheriting from `torch.Tensor`.
+    This class allows for the creation of tensors using the provided data.
+    Attributes:
+        data (Any): The data to be converted into a tensor.
+    """
+    def __new__(cls, data: Any) -> "CustomTensor":
+        """
+        Creates a new instance of CustomTensor.
+        Args:
+            data (Any): The data to be converted into a tensor.
+        Returns:
+            CustomTensor: A tensor containing the provided data.
+        """
+        return torch.tensor(data)

chebai-0.0.2.dev0/chebai/__main__.py ADDED Viewed

@@ -0,0 +1,10 @@
+from chebai.cli import cli
+if __name__ == "__main__":
+    """
+    Entry point for the CLI application.
+    This script calls the `cli` function from the `chebai.cli` module
+    when executed as the main program.
+    """
+    cli()

chebai-0.0.2.dev0/chebai/callbacks/__init__.py ADDED Viewed

File without changes

chebai-0.0.2.dev0/chebai/callbacks/epoch_metrics.py ADDED Viewed

@@ -0,0 +1,180 @@
+import torch
+import torchmetrics
+def custom_reduce_fx(input: torch.Tensor) -> torch.Tensor:
+    """
+    Custom reduction function for distributed training.
+    Args:
+        input (torch.Tensor): The input tensor to be reduced.
+    Returns:
+        torch.Tensor: The reduced tensor.
+    """
+    print(f"called reduce (device: {input.device})")
+    return torch.sum(input, dim=0)
+class MacroF1(torchmetrics.Metric):
+    """
+    Computes the Macro F1 score, which is the unweighted mean of F1 scores for each class.
+    This implementation differs from torchmetrics.classification.MultilabelF1Score in the behaviour for undefined
+    values (i.e., classes where TP+FN=0). The torchmetrics implementation sets these classes to a default value.
+    Here, the mean is only taken over classes which have at least one positive sample.
+    Args:
+        num_labels (int): Number of classes/labels.
+        dist_sync_on_step (bool, optional): Synchronize metric state across processes at each forward
+            before returning the value at the step. Default: False.
+        threshold (float, optional): Threshold for converting predicted probabilities to binary (0, 1) predictions.
+            Default: 0.5.
+    """
+    def __init__(
+        self, num_labels: int, dist_sync_on_step: bool = False, threshold: float = 0.5
+    ):
+        super().__init__(dist_sync_on_step=dist_sync_on_step)
+        self.add_state(
+            "true_positives",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.add_state(
+            "positive_predictions",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.add_state(
+            "positive_labels",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.threshold = threshold
+    def update(self, preds: torch.Tensor, labels: torch.Tensor) -> None:
+        """
+        Update the state (TPs, Positive Predictions, Positive labels) with the current batch of predictions and labels.
+        Args:
+            preds (torch.Tensor): Predictions from the model.
+            labels (torch.Tensor): Ground truth labels.
+        """
+        tps = torch.sum(
+            torch.logical_and(preds > self.threshold, labels.to(torch.bool)), dim=0
+        )
+        self.true_positives += tps
+        self.positive_predictions += torch.sum(preds > self.threshold, dim=0)
+        self.positive_labels += torch.sum(labels, dim=0)
+    def compute(self) -> torch.Tensor:
+        """
+        Compute the Macro F1 score.
+        Returns:
+            torch.Tensor: The computed Macro F1 score.
+        """
+        # ignore classes without positive labels
+        # classes with positive labels, but no positive predictions will get a precision of "nan" (0 divided by 0),
+        # which is propagated to the classwise_f1 and then turned into 0
+        mask = self.positive_labels != 0
+        precision = self.true_positives[mask] / self.positive_predictions[mask]
+        recall = self.true_positives[mask] / self.positive_labels[mask]
+        classwise_f1 = 2 * precision * recall / (precision + recall)
+        # if (precision and recall are 0) or (precision is nan), set f1 to 0
+        classwise_f1 = classwise_f1.nan_to_num()
+        return torch.mean(classwise_f1)
+class BalancedAccuracy(torchmetrics.Metric):
+    """
+    Computes the Balanced Accuracy, which is the average of true positive rate (TPR) and true negative rate (TNR).
+    Useful for imbalanced datasets.
+    Balanced Accuracy = (TPR + TNR)/2 = (TP/(TP + FN) + (TN)/(TN + FP))/2
+    Args:
+        num_labels (int): Number of classes/labels.
+        dist_sync_on_step (bool, optional): Synchronize metric state across processes at each forward
+            before returning the value at the step. Default: False.
+        threshold (float, optional): Threshold for converting predicted probabilities to binary (0, 1) predictions.
+            Default: 0.5.
+    """
+    def __init__(
+        self, num_labels: int, dist_sync_on_step: bool = False, threshold: float = 0.5
+    ):
+        super().__init__(dist_sync_on_step=dist_sync_on_step)
+        self.add_state(
+            "true_positives",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.add_state(
+            "false_positives",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.add_state(
+            "true_negatives",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.add_state(
+            "false_negatives",
+            default=torch.zeros(num_labels, dtype=torch.int),
+            dist_reduce_fx="sum",
+        )
+        self.threshold = threshold
+    def update(self, preds: torch.Tensor, labels: torch.Tensor) -> None:
+        """
+        Update the state (TPs, TNs, FPs, FNs) with the current batch of predictions and labels.
+        Args:
+            preds (torch.Tensor): Predictions from the model.
+            labels (torch.Tensor): Ground truth labels.
+        """
+        # Size: Batch_size x Num_of_Classes;
+        # summing over 1st dimension (dim=0), gives us the True positives per class
+        tps = torch.sum(
+            torch.logical_and(preds > self.threshold, labels.to(torch.bool)), dim=0
+        )
+        fps = torch.sum(
+            torch.logical_and(preds > self.threshold, ~labels.to(torch.bool)), dim=0
+        )
+        tns = torch.sum(
+            torch.logical_and(preds <= self.threshold, ~labels.to(torch.bool)), dim=0
+        )
+        fns = torch.sum(
+            torch.logical_and(preds <= self.threshold, labels.to(torch.bool)), dim=0
+        )
+        # Size: Num_of_Classes;
+        self.true_positives += tps
+        self.false_positives += fps
+        self.true_negatives += tns
+        self.false_negatives += fns
+    def compute(self) -> torch.Tensor:
+        """
+        Compute the Balanced Accuracy.
+        Returns:
+            torch.Tensor: The computed Balanced Accuracy.
+        """
+        tpr = self.true_positives / (self.true_positives + self.false_negatives)
+        tnr = self.true_negatives / (self.true_negatives + self.false_positives)
+        # Convert the nan values to 0
+        tpr = tpr.nan_to_num()
+        tnr = tnr.nan_to_num()
+        balanced_acc = (tpr + tnr) / 2
+        return torch.mean(balanced_acc)

chebai-0.0.2.dev0/chebai/callbacks/model_checkpoint.py ADDED Viewed

@@ -0,0 +1,95 @@
+import os
+from lightning.fabric.utilities.cloud_io import _is_dir
+from lightning.fabric.utilities.types import _PATH
+from lightning.pytorch import LightningModule, Trainer
+from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
+from lightning.pytorch.loggers import WandbLogger
+from lightning.pytorch.utilities.rank_zero import rank_zero_info
+from lightning_utilities.core.rank_zero import rank_zero_warn
+class CustomModelCheckpoint(ModelCheckpoint):
+    """
+    Custom checkpoint class that resolves checkpoint paths to ensure checkpoints are saved in the same directory
+    as other logs when using CustomLogger.
+    Inherits from PyTorch Lightning's ModelCheckpoint class.
+    """
+    def setup(self, trainer: Trainer, pl_module: LightningModule, stage: str) -> None:
+        """
+        Setup the directory path for saving checkpoints. If the directory path is not set, it resolves the checkpoint
+        directory using the custom logger's directory.
+        Note:
+            Same as in parent class, duplicated to be able to call self.__resolve_ckpt_dir
+        Args:
+            trainer (Trainer): The Trainer instance.
+            pl_module (LightningModule): The LightningModule instance.
+            stage (str): The stage of training (e.g., 'fit').
+        """
+        if self.dirpath is not None:
+            self.dirpath = None
+        dirpath = self.__resolve_ckpt_dir(trainer)
+        dirpath = trainer.strategy.broadcast(dirpath)
+        self.dirpath = dirpath
+        if trainer.is_global_zero and stage == "fit":
+            self.__warn_if_dir_not_empty(self.dirpath)
+    def __warn_if_dir_not_empty(self, dirpath: _PATH) -> None:
+        """
+        Warn if the checkpoint directory is not empty.
+        Note:
+            Same as in parent class, duplicated because method in parent class is not accessible
+        Args:
+            dirpath (_PATH): The path to the checkpoint directory.
+        """
+        if (
+            self.save_top_k != 0
+            and _is_dir(self._fs, dirpath, strict=True)
+            and len(self._fs.ls(dirpath)) > 0
+        ):
+            rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
+    def __resolve_ckpt_dir(self, trainer: Trainer) -> _PATH:
+        """
+        Resolve the checkpoint directory path, ensuring compatibility with WandbLogger by saving checkpoints
+        in the same directory as Wandb logs.
+        Note:
+            Overwritten for compatibility with wandb -> saves checkpoints in same dir as wandb logs
+        Args:
+            trainer (Trainer): The Trainer instance.
+        Returns:
+            _PATH: The resolved checkpoint directory path.
+        """
+        rank_zero_info(f"Resolving checkpoint dir (custom)")
+        if self.dirpath is not None:
+            # short circuit if dirpath was passed to ModelCheckpoint
+            return self.dirpath
+        if len(trainer.loggers) > 0:
+            if trainer.loggers[0].save_dir is not None:
+                save_dir = trainer.loggers[0].save_dir
+            else:
+                save_dir = trainer.default_root_dir
+            name = trainer.loggers[0].name
+            version = trainer.loggers[0].version
+            version = version if isinstance(version, str) else f"version_{version}"
+            logger = trainer.loggers[0]
+            if isinstance(logger, WandbLogger) and isinstance(
+                logger.experiment.dir, str
+            ):
+                ckpt_path = os.path.join(logger.experiment.dir, "checkpoints")
+            else:
+                ckpt_path = os.path.join(save_dir, str(name), version, "checkpoints")
+        else:
+            # if no loggers, use default_root_dir
+            ckpt_path = os.path.join(trainer.default_root_dir, "checkpoints")
+        rank_zero_info(f"Now using checkpoint path {ckpt_path}")
+        return ckpt_path

chebai-0.0.2.dev0/chebai/callbacks/prediction_callback.py ADDED Viewed

@@ -0,0 +1,55 @@
+import os
+import pickle
+from typing import Any, Literal, Sequence
+import torch
+from lightning.pytorch import LightningModule, Trainer
+from lightning.pytorch.callbacks import BasePredictionWriter
+class PredictionWriter(BasePredictionWriter):
+    """
+    Custom callback for writing predictions to a file at the end of each epoch.
+    Args:
+        output_dir (str): The directory where prediction files will be saved.
+        write_interval (str): When to write predictions. Options are "batch" or "epoch".
+    """
+    def __init__(
+        self,
+        output_dir: str,
+        write_interval: Literal["batch", "epoch", "batch_and_epoch"],
+    ):
+        super().__init__(write_interval)
+        self.output_dir = output_dir
+        self.prediction_file_name = "predictions.pkl"
+    def write_on_epoch_end(
+        self,
+        trainer: Trainer,
+        pl_module: LightningModule,
+        predictions: Sequence[Any],
+        batch_indices: Sequence[Any],
+    ) -> None:
+        """
+        Writes the predictions to a file at the end of the epoch.
+        Args:
+            trainer (Trainer): The Trainer instance.
+            pl_module (LightningModule): The LightningModule instance.
+            predictions (Sequence[Any]): Any sequence of predictions for the epoch.
+            batch_indices (Sequence[Any]): Any sequence of batch indices.
+        """
+        results = [
+            dict(
+                ident=row["data"]["idents"][0],
+                predictions=torch.sigmoid(row["output"]["logits"]).numpy(),
+                labels=row["labels"][0].numpy() if row["labels"] is not None else None,
+            )
+            for row in predictions
+        ]
+        with open(
+            os.path.join(self.output_dir, self.prediction_file_name), "wb"
+        ) as fout:
+            pickle.dump(results, fout)

chebai-0.0.2.dev0/chebai/callbacks.py ADDED Viewed

@@ -0,0 +1,86 @@
+import json
+import os
+from typing import Any, Dict, List, Literal, Union
+import torch
+from lightning.pytorch.callbacks import BasePredictionWriter
+class ChebaiPredictionWriter(BasePredictionWriter):
+    """
+    A custom prediction writer for saving batch and epoch predictions during model training.
+    This class inherits from `BasePredictionWriter` and is designed to save predictions
+    in a specified output directory at specified intervals.
+    Args:
+        output_dir (str): The directory where predictions will be saved.
+        write_interval (str): The interval at which predictions will be written.
+        target_file (str): The name of the file where epoch predictions will be saved (default: "predictions.json").
+    """
+    def __init__(
+        self,
+        output_dir: str,
+        write_interval: Literal["batch", "epoch", "batch_and_epoch"],
+        target_file: str = "predictions.json",
+    ) -> None:
+        super().__init__(write_interval)
+        self.output_dir = output_dir
+        self.target_file = target_file
+    def write_on_batch_end(
+        self,
+        trainer: Any,
+        pl_module: Any,
+        prediction: Union[torch.Tensor, List[torch.Tensor]],
+        batch_indices: List[int],
+        batch: Any,
+        batch_idx: int,
+        dataloader_idx: int,
+    ) -> None:
+        """
+        Saves batch predictions at the end of each batch.
+        Args:
+            trainer (Any): The trainer instance.
+            pl_module (Any): The LightningModule instance.
+            prediction (Union[torch.Tensor, List[torch.Tensor]]): The prediction output from the model.
+            batch_indices (List[int]): The indices of the batch.
+            batch (Any): The current batch.
+            batch_idx (int): The index of the batch.
+            dataloader_idx (int): The index of the dataloader.
+        """
+        outpath = os.path.join(self.output_dir, str(dataloader_idx), f"{batch_idx}.pt")
+        os.makedirs(os.path.dirname(outpath), exist_ok=True)
+        torch.save(prediction, outpath)
+    def write_on_epoch_end(
+        self,
+        trainer: Any,
+        pl_module: Any,
+        predictions: List[Dict[str, Any]],
+        batch_indices: List[int],
+    ) -> None:
+        """
+        Saves all predictions at the end of each epoch in a JSON file.
+        Args:
+            trainer (Any): The trainer instance.
+            pl_module (Any): The LightningModule instance.
+            predictions (List[Dict[str, Any]]): The list of prediction outputs from the model.
+            batch_indices (List[int]): The indices of the batches.
+        """
+        pred_list = []
+        for p in predictions:
+            idents = p["data"]["idents"]
+            labels = p["data"]["labels"]
+            if labels is not None:
+                labels = labels.tolist()
+            else:
+                labels = [None for _ in idents]
+            output = torch.sigmoid(p["output"]["logits"]).tolist()
+            for i, l, o in zip(idents, labels, output):
+                pred_list.append(dict(ident=i, labels=l, predictions=o))
+        with open(os.path.join(self.output_dir, self.target_file), "wt") as fout:
+            json.dump(pred_list, fout)

chebai 0.0.1.dev0__tar.gz → 0.0.2.dev0__tar.gz

chebai 0.0.1.dev0tar.gz → 0.0.2.dev0tar.gz