PyPI - dora-singlecell - Versions diffs - 0.1.0__tar.gz - Mend

dora-singlecell 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

dora_singlecell-0.1.0/PKG-INFO +131 -0
dora_singlecell-0.1.0/README.md +113 -0
dora_singlecell-0.1.0/dora/__init__.py +34 -0
dora_singlecell-0.1.0/dora/eval.py +64 -0
dora_singlecell-0.1.0/dora/get_latent_util.py +155 -0
dora_singlecell-0.1.0/dora/model.py +294 -0
dora_singlecell-0.1.0/dora/train.py +173 -0
dora_singlecell-0.1.0/dora/train_clf_test_adam.py +319 -0
dora_singlecell-0.1.0/dora/utils.py +222 -0
dora_singlecell-0.1.0/dora_singlecell.egg-info/PKG-INFO +131 -0
dora_singlecell-0.1.0/dora_singlecell.egg-info/SOURCES.txt +14 -0
dora_singlecell-0.1.0/dora_singlecell.egg-info/dependency_links.txt +1 -0
dora_singlecell-0.1.0/dora_singlecell.egg-info/requires.txt +12 -0
dora_singlecell-0.1.0/dora_singlecell.egg-info/top_level.txt +1 -0
dora_singlecell-0.1.0/pyproject.toml +28 -0
dora_singlecell-0.1.0/setup.cfg +4 -0

dora_singlecell-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,131 @@
+Metadata-Version: 2.4
+Name: dora-singlecell
+Version: 0.1.0
+Summary: DORA: latent trajectory model for single-cell drug response (PyTorch).
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+Requires-Dist: numpy>=1.20
+Requires-Dist: scipy>=1.7
+Requires-Dist: scikit-learn>=1.2
+Requires-Dist: pandas>=1.3
+Requires-Dist: torch>=1.12
+Requires-Dist: scanpy>=1.9
+Requires-Dist: joblib>=1.2
+Requires-Dist: tqdm>=4.60
+Requires-Dist: torchmetrics>=0.11
+Provides-Extra: dev
+Requires-Dist: pytest>=7.0; extra == "dev"
+# dora-singlecell
+PyTorch implementation of **DORA**, a latent-trajectory model for single-cell drug response: drug and cell embeddings, dose response, and a gene decoder, with utilities for AnnData / perturbation-style datasets.
+**PyPI package name:** `dora-singlecell`
+**Import name:** `dora`
+## Installation
+### From PyPI (after you publish)
+```bash
+pip install dora-singlecell
+```
+### From GitHub (before or instead of PyPI)
+```bash
+pip install git+https://github.com/LBiophyEvo/dora-singlecell.git@main
+```
+For a local editable install while developing:
+```bash
+git clone https://github.com/LBiophyEvo/dora-singlecell.git
+cd dora-singlecell
+pip install -e .
+```
+## Quick start
+- Load datasets
+```python
+from dora import CustomDataset_mask
+# Example: load preprocessed data (paths must match your layout; see utils.dataset_selection)
+# First load the adata, prepared dataset (arranged dose-response gene expression), drug features, cell features and defined dose trajectory)
+# For example, for Sci-Plex daatset
+dosages_standard = [0.0, 0.001, 0.01, 0.1, 1.0]
+train_dataset = CustomDataset_mask(adata=adata, dataset=dataset_train, feature_dict_drug= feature_dict_drug, feature_dict_cell=feature_dict_cell, dosages_standard=dosages_standard)
+```
+- Build the model
+```python
+from dora import DORA
+dosage_len = len(dosages_standard)
+hparam = {
+            'lr': 1e-2,
+            'wd': 4e-5,
+            'dim_hid': 32,
+            'dep_hid': 3,
+            'nb_layer': 5,
+            'n_drugs': 188,
+            'n_cells': 3,
+            'n_genes': dim_cell_feature,
+            'dim_drug_feature': dim_drug_feature,
+            'dim_cell_feature': dim_cell_feature,
+            'batch_size': 128,
+            'max_epoch': 700,
+            'device': device,
+            'cell_dim_hid': 128,
+            'module': 1,
+            'drug_dose_f': False,
+            'max_patience': 100,
+            'last_layer': 'linear',
+            'step_size_lr': 35,
+            'batch_norm': True,
+            'param_pen': 0,
+        }
+model = DORA(num_genes = hparam['n_genes'],
+            num_drugs = hparam['n_drugs'],
+            num_cells = hparam['n_cells'],
+            genes= genes,
+            dosage_len = dosage_len,
+            hparam=hparam,
+            )
+```
+Training and evaluation helpers live in `dora.train`, `dora.eval`, `dora.get_latent_util`, and `dora.train_clf_test_adam`.
+## Project layout
+```
+.
+├── pyproject.toml      # package metadata & dependencies (name: dora-singlecell)
+├── README.md
+└── dora/               # importable Python package
+    ├── __init__.py
+    ├── model.py        # DORA, MLP, losses, dose modules
+    ├── utils.py        # CustomDataset_mask, data loading
+    ├── train.py        # train the model
+    ├── eval.py         # eval the model
+    ├── get_latent_util.py   # extract the embeddings
+    └── train_clf_test_adam.py # fine tune the model
+```
+## Requirements
+- Python ≥ 3.9
+- PyTorch, scanpy, scikit-learn, numpy, scipy, pandas, joblib, tqdm, torchmetrics (see `pyproject.toml` for versions).
+## Citation
+If you use this code in a publication, cite the associated paper (add reference when available).
+## License
+MIT.

dora_singlecell-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,113 @@
+# dora-singlecell
+PyTorch implementation of **DORA**, a latent-trajectory model for single-cell drug response: drug and cell embeddings, dose response, and a gene decoder, with utilities for AnnData / perturbation-style datasets.
+**PyPI package name:** `dora-singlecell`
+**Import name:** `dora`
+## Installation
+### From PyPI (after you publish)
+```bash
+pip install dora-singlecell
+```
+### From GitHub (before or instead of PyPI)
+```bash
+pip install git+https://github.com/LBiophyEvo/dora-singlecell.git@main
+```
+For a local editable install while developing:
+```bash
+git clone https://github.com/LBiophyEvo/dora-singlecell.git
+cd dora-singlecell
+pip install -e .
+```
+## Quick start
+- Load datasets
+```python
+from dora import CustomDataset_mask
+# Example: load preprocessed data (paths must match your layout; see utils.dataset_selection)
+# First load the adata, prepared dataset (arranged dose-response gene expression), drug features, cell features and defined dose trajectory)
+# For example, for Sci-Plex daatset
+dosages_standard = [0.0, 0.001, 0.01, 0.1, 1.0]
+train_dataset = CustomDataset_mask(adata=adata, dataset=dataset_train, feature_dict_drug= feature_dict_drug, feature_dict_cell=feature_dict_cell, dosages_standard=dosages_standard)
+```
+- Build the model
+```python
+from dora import DORA
+dosage_len = len(dosages_standard)
+hparam = {
+            'lr': 1e-2,
+            'wd': 4e-5,
+            'dim_hid': 32,
+            'dep_hid': 3,
+            'nb_layer': 5,
+            'n_drugs': 188,
+            'n_cells': 3,
+            'n_genes': dim_cell_feature,
+            'dim_drug_feature': dim_drug_feature,
+            'dim_cell_feature': dim_cell_feature,
+            'batch_size': 128,
+            'max_epoch': 700,
+            'device': device,
+            'cell_dim_hid': 128,
+            'module': 1,
+            'drug_dose_f': False,
+            'max_patience': 100,
+            'last_layer': 'linear',
+            'step_size_lr': 35,
+            'batch_norm': True,
+            'param_pen': 0,
+        }
+model = DORA(num_genes = hparam['n_genes'],
+            num_drugs = hparam['n_drugs'],
+            num_cells = hparam['n_cells'],
+            genes= genes,
+            dosage_len = dosage_len,
+            hparam=hparam,
+            )
+```
+Training and evaluation helpers live in `dora.train`, `dora.eval`, `dora.get_latent_util`, and `dora.train_clf_test_adam`.
+## Project layout
+```
+.
+├── pyproject.toml      # package metadata & dependencies (name: dora-singlecell)
+├── README.md
+└── dora/               # importable Python package
+    ├── __init__.py
+    ├── model.py        # DORA, MLP, losses, dose modules
+    ├── utils.py        # CustomDataset_mask, data loading
+    ├── train.py        # train the model
+    ├── eval.py         # eval the model
+    ├── get_latent_util.py   # extract the embeddings
+    └── train_clf_test_adam.py # fine tune the model
+```
+## Requirements
+- Python ≥ 3.9
+- PyTorch, scanpy, scikit-learn, numpy, scipy, pandas, joblib, tqdm, torchmetrics (see `pyproject.toml` for versions).
+## Citation
+If you use this code in a publication, cite the associated paper (add reference when available).
+## License
+MIT.

dora_singlecell-0.1.0/dora/__init__.py ADDED Viewed

@@ -0,0 +1,34 @@
+"""
+DORA single-cell perturbation model and data utilities.
+After ``pip install -e .`` from the project root, import with ``import dora`` or
+``from dora import DORA, CustomDataset_mask``.
+"""
+from .model import (
+    Basic_ff,
+    DORA,
+    GeneralizedSigmoid,
+    MLP,
+    rrmse,
+    rrmse_penality,
+)
+from .utils import (
+    CustomDataset_mask,
+    SubDataset,
+    dataset_selection,
+    get_normaled_cell,
+)
+__all__ = [
+    "Basic_ff",
+    "DORA",
+    "GeneralizedSigmoid",
+    "MLP",
+    "CustomDataset_mask",
+    "SubDataset",
+    "dataset_selection",
+    "get_normaled_cell",
+    "rrmse",
+    "rrmse_penality",
+]

dora_singlecell-0.1.0/dora/eval.py ADDED Viewed

@@ -0,0 +1,64 @@
+"""
+Classification-oriented metrics for binary response / phenotype prediction.
+Used alongside training scripts that output probabilities; pairs with the same
+``calculate_accuracy`` helper in ``train_clf_test_adam`` when scoring phenotypic heads.
+"""
+import numpy as np
+from sklearn.metrics import (
+    accuracy_score,
+    auc,
+    precision_recall_curve,
+    precision_score,
+    recall_score,
+    roc_auc_score,
+)
+def calculate_accuracy(test_prob, test_label):
+    """
+    Summarize binary predictions given probabilities and ground-truth labels.
+    Parameters
+    ----------
+    test_prob : array-like
+        Predicted probabilities (or scores in [0, 1]) per sample.
+    test_label : array-like
+        Binary ground-truth labels (same length as ``test_prob``).
+    Returns
+    -------
+    list of float
+        ``[ROC-AUC, PR-AUC (area under precision–recall curve), accuracy,
+        precision, recall]``. If there are no positive labels in ``test_label``,
+        ROC/PR/precision/recall are returned as 0. If the model predicts no
+        positives, precision and recall are 0 while ROC-AUC and PR-AUC may still
+        be computed from scores.
+    Notes
+    -----
+    Hard labels are obtained with threshold 0.5 on ``test_prob``. The docstring
+    mention of "RECALL when PRECISION is 0.75" refers to an older metric choice
+    and is not implemented in the returned list.
+    """
+    pred_label = np.array([1 if x > 0.5 else 0 for x in test_prob])
+    ACC = accuracy_score(test_label, pred_label)
+    if test_label.sum() > 0:
+        if pred_label.sum() > 0:
+            PREC = precision_score(test_label, pred_label)
+            TPR = recall_score(test_label, pred_label)
+            precision, recall, _thresholds = precision_recall_curve(test_label, test_prob)
+            return [
+                roc_auc_score(test_label, test_prob),
+                auc(recall, precision),
+                ACC,
+                PREC,
+                TPR,
+            ]
+        precision, recall, _thresholds = precision_recall_curve(test_label, test_prob)
+        return [roc_auc_score(test_label, test_prob), auc(recall, precision), ACC, 0, 0]
+    return [0, 0, ACC, 0, 0]

dora_singlecell-0.1.0/dora/get_latent_util.py ADDED Viewed

@@ -0,0 +1,155 @@
+"""
+Extract latent embeddings from a trained DORA-style model and package them as AnnData.
+``get_latent`` runs the encoder/trajectory stack with dose information disabled
+(``drugs_dose=None``), collects per-step latent vectors for masked samples, and
+attaches cell / drug / dose metadata via the dataset’s sklearn encoders.
+"""
+import numpy as np
+import scanpy as sc
+import torch
+from scipy.stats import pearsonr
+from torchmetrics import R2Score
+def compute_r2(y_true, y_pred):
+    """
+    R² between two 1D tensors using torchmetrics (GPU-friendly).
+    Parameters
+    ----------
+    y_true, y_pred : torch.Tensor
+        Same shape; typically one cell’s gene vector vs reconstruction.
+    Returns
+    -------
+    float
+        R² score. Values are clamped internally for numerical stability; NaNs in
+        ``y_pred`` are not explicitly handled (torchmetrics may propagate NaN).
+    """
+    y_pred = torch.clamp(y_pred, -3e12, 3e12)
+    metric = R2Score().to(y_true.device)
+    metric.update(y_pred, y_true)
+    return metric.compute().item()
+def get_latent(model, datasets_val, genes, train_dataset):
+    """
+    Iterate validation batches, collect latent states and reconstruction quality.
+    Parameters
+    ----------
+    model : torch.nn.Module
+        Trained model with ``predict(..., return_emb=True)`` (e.g. DORA).
+    datasets_val : DataLoader
+        Batches matching ``CustomDataset_mask`` layout (7 tensors per batch).
+    genes : torch.Tensor
+        Full gene matrix indexed by ``indices`` (same convention as training).
+    train_dataset : CustomDataset_mask
+        Provides ``encoder_cell`` / ``encoder_drug`` to map integer ids back to names.
+    Returns
+    -------
+    anndata.AnnData
+        ``.X`` = stacked latent vectors; ``.obs`` contains ``cell_id``,
+        ``pert_iname``, ``pert_dose``, composite ``pert_cell_dose``, and per-row
+        ``r2_score`` from gene reconstruction.
+    """
+    r2_loss = []
+    pearson_loss = []
+    latent_dict = {
+        "emb_all": [],
+        "cell_id": [],
+        "pert_iname": [],
+        "drug_dose": [],
+        "response_label": [],
+        "r2_score": [],
+    }
+    model.eval()
+    for data in datasets_val:
+        data = [tensor.to(model.device) for tensor in data]
+        drugs, cells, drugs_dose, drugs_feature, cell_feature, indices, masks = (
+            data[0],
+            data[1],
+            data[2],
+            data[3],
+            data[4],
+            data[5],
+            data[6],
+        )
+        # Latent trajectory without dose conditioning (baseline for embedding export).
+        gene_reconstructions_arr, emb_arr, _ = model.predict(
+            drugs,
+            drugs_feature,
+            cell_feature,
+            indices,
+            cells=cells,
+            drugs_dose=None,
+            return_emb=True,
+        )
+        for id_g, gene_reconstructions in enumerate(gene_reconstructions_arr):
+            id_gene = indices[:, (id_g + 1)]
+            mask = masks[:, (id_g + 1)]
+            if mask.sum() > 0:
+                gene_step = genes[id_gene][mask == 1]
+                gene_reconstructions_left = gene_reconstructions[mask == 1]
+                r2_iter = [
+                    compute_r2(gene_step[i, :], gene_reconstructions_left[i, :])
+                    for i in range(len(gene_step))
+                ]
+                r2_loss += r2_iter
+                y_true = gene_step.cpu().detach().numpy()
+                y_pred = gene_reconstructions_left.cpu().detach().numpy()
+                pearson_loss += [
+                    pearsonr(y_true[i, :], y_pred[i, :])[0] for i in range(len(gene_step))
+                ]
+                latent_dict["emb_all"].append(emb_arr[id_g][mask == 1].cpu().detach().numpy())
+                latent_dict["cell_id"].append(cells[mask == 1].cpu().detach().numpy())
+                latent_dict["pert_iname"].append(drugs[mask == 1].cpu().detach().numpy())
+                latent_dict["drug_dose"].append(
+                    drugs_dose[:, (id_g + 1)][mask == 1].cpu().detach().numpy()
+                )
+                latent_dict["r2_score"].append(r2_iter)
+    print("the R2 on gene reconstruction:", np.mean(r2_loss))
+    print("the pearson on gene reconstruction", np.mean(pearson_loss))
+    # Decode integer cell indices to strings via one-hot trick (inverse_transform).
+    index_cell = np.vstack(latent_dict["cell_id"])
+    onehot = np.zeros((len(index_cell), len(train_dataset.encoder_cell.categories_[0])))
+    for i in range(len(index_cell)):
+        onehot[i, index_cell[i]] = 1
+    cells_name = train_dataset.encoder_cell.inverse_transform(onehot)
+    index_drugs = np.vstack(latent_dict["pert_iname"])
+    onehot = np.zeros((len(index_cell), len(train_dataset.encoder_drug.categories_[0])))
+    for i in range(len(index_drugs)):
+        onehot[i, index_drugs[i]] = 1
+    drugs_name = train_dataset.encoder_drug.inverse_transform(onehot)
+    dosages = np.hstack(latent_dict["drug_dose"])
+    adata_emb = sc.AnnData(np.vstack(latent_dict["emb_all"]))
+    adata_emb.obs["cell_id"] = cells_name.flatten()
+    adata_emb.obs["pert_iname"] = drugs_name.flatten()
+    adata_emb.obs["pert_dose"] = [str(el) for el in dosages]
+    adata_emb.obs["pert_cell_dose"] = (
+        adata_emb.obs["cell_id"].astype(str)
+        + "_"
+        + adata_emb.obs["pert_iname"].astype(str)
+        + "_"
+        + adata_emb.obs["pert_dose"].astype(str)
+    )
+    adata_emb.obs["r2_score"] = np.hstack(latent_dict["r2_score"])
+    return adata_emb