PyPI - bayesianflow-for-chem - Versions diffs - 1.2.7__tar.gz → 1.4.0__tar.gz - Mend

bayesianflow-for-chem 1.2.7tar.gz → 1.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of bayesianflow-for-chem might be problematic. Click here for more details.

Files changed (18) hide show

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/PKG-INFO RENAMED Viewed

@@ -1,16 +1,15 @@
 Metadata-Version: 2.4
 Name: bayesianflow_for_chem
-Version: 1.2.7
+Version: 1.4.0
 Summary: Bayesian flow network framework for Chemistry
 Home-page: https://augus1999.github.io/bayesian-flow-network-for-chemistry/
 Author: Nianze A. Tao
 Author-email: tao-nianze@hiroshima-u.ac.jp
-License: AGPL-3.0 licence
+License: AGPL-3.0-or-later
 Project-URL: Source, https://github.com/Augus1999/bayesian-flow-network-for-chemistry
 Keywords: Chemistry,CLM,ChemBFN
 Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: GNU Affero General Public License v3
 Classifier: Natural Language :: English
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.9
@@ -29,8 +28,6 @@ Requires-Dist: loralib>=0.1.2
 Requires-Dist: lightning>=2.2.0
 Requires-Dist: scikit-learn>=1.5.0
 Requires-Dist: typing_extensions>=4.8.0
-Provides-Extra: geo2seq
-Requires-Dist: pynauty>=2.8.8.1; extra == "geo2seq"
 Dynamic: author
 Dynamic: author-email
 Dynamic: classifier
@@ -41,7 +38,6 @@ Dynamic: keywords
 Dynamic: license
 Dynamic: license-file
 Dynamic: project-url
-Dynamic: provides-extra
 Dynamic: requires-dist
 Dynamic: requires-python
 Dynamic: summary
@@ -87,13 +83,13 @@ You can find example scripts in [📁example](./example) folder.
 ## Pre-trained Model
-You can find pretrained models in [release](https://github.com/Augus1999/bayesian-flow-network-for-chemistry/releases) or on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
+You can find pretrained models on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
 ## Dataset Handling
 We provide a Python class [`CSVData`](./bayesianflow_for_chem/data.py) to handle data stored in CSV or similar format containing headers to identify the entities. The following is a quickstart.
-1. Download your dataset file (e.g., ESOL form [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
+1. Download your dataset file (e.g., ESOL from [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
 ```python
 >>> from bayesianflow_for_chem.tool import split_data

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/README.md RENAMED Viewed

@@ -39,13 +39,13 @@ You can find example scripts in [📁example](./example) folder.
 ## Pre-trained Model
-You can find pretrained models in [release](https://github.com/Augus1999/bayesian-flow-network-for-chemistry/releases) or on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
+You can find pretrained models on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
 ## Dataset Handling
 We provide a Python class [`CSVData`](./bayesianflow_for_chem/data.py) to handle data stored in CSV or similar format containing headers to identify the entities. The following is a quickstart.
-1. Download your dataset file (e.g., ESOL form [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
+1. Download your dataset file (e.g., ESOL from [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
 ```python
 >>> from bayesianflow_for_chem.tool import split_data

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/__init__.py RENAMED Viewed

@@ -4,8 +4,8 @@
 ChemBFN package.
 """
 from . import data, tool, train, scorer
-from .model import ChemBFN, MLP
+from .model import ChemBFN, MLP, EnsembleChemBFN
-__all__ = ["data", "tool", "train", "scorer", "ChemBFN", "MLP"]
-__version__ = "1.2.7"
+__all__ = ["data", "tool", "train", "scorer", "ChemBFN", "MLP", "EnsembleChemBFN"]
+__version__ = "1.4.0"
 __author__ = "Nianze A. Tao (Omozawa Sueno)"

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/data.py RENAMED Viewed

@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 # Author: Nianze A. TAO (Omozawa SUENO)
 """
-Tokenise SMILES/SAFE/SELFIES/GEO2SEQ/protein-sequence strings.
+Tokenise SMILES/SAFE/SELFIES/protein-sequence strings.
 """
 import os
 import re
@@ -32,30 +32,14 @@ SMI_REGEX_PATTERN = (
     r"~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
 )
 SEL_REGEX_PATTERN = r"(\[[^\]]+]|\.)"
-GEO_REGEX_PATTERN = (
-    r"(H[e,f,g,s,o]?|"
-    r"L[i,v,a,r,u]|"
-    r"B[e,r,a,i,h,k]?|"
-    r"C[l,a,r,o,u,d,s,n,e,m,f]?|"
-    r"N[e,a,i,b,h,d,o,p]?|"
-    r"O[s,g]?|S[i,c,e,r,n,m,b,g]?|"
-    r"K[r]?|T[i,c,e,a,l,b,h,m,s]|"
-    r"G[a,e,d]|R[b,u,h,e,n,a,f,g]|"
-    r"Yb?|Z[n,r]|P[t,o,d,r,a,u,b,m]?|"
-    r"F[e,r,l,m]?|M[g,n,o,t,c,d]|"
-    r"A[l,r,s,g,u,t,c,m]|I[n,r]?|"
-    r"W|X[e]|E[u,r,s]|U|D[b,s,y]|"
-    r"-|.| |[0-9])"
-)
 AA_REGEX_PATTERN = r"(A|B|C|D|E|F|G|H|I|K|L|M|N|P|Q|R|S|T|V|W|Y|Z|-|.)"
 smi_regex = re.compile(SMI_REGEX_PATTERN)
 sel_regex = re.compile(SEL_REGEX_PATTERN)
-geo_regex = re.compile(GEO_REGEX_PATTERN)
 aa_regex = re.compile(AA_REGEX_PATTERN)
 def load_vocab(
-    vocab_file: Union[str, Path]
+    vocab_file: Union[str, Path],
 ) -> Dict[str, Union[int, List[str], Dict[str, int]]]:
     """
     Load vocabulary from source file.
@@ -86,9 +70,6 @@ AA_VOCAB_KEYS = (
 )
 AA_VOCAB_COUNT = len(AA_VOCAB_KEYS)
 AA_VOCAB_DICT = dict(zip(AA_VOCAB_KEYS, range(AA_VOCAB_COUNT)))
-GEO_VOCAB_KEYS = VOCAB_KEYS[0:3] + [" "] + VOCAB_KEYS[22:150] + [".", "-"]
-GEO_VOCAB_COUNT = len(GEO_VOCAB_KEYS)
-GEO_VOCAB_DICT = dict(zip(GEO_VOCAB_KEYS, range(GEO_VOCAB_COUNT)))
 def smiles2vec(smiles: str) -> List[int]:
@@ -104,19 +85,6 @@ def smiles2vec(smiles: str) -> List[int]:
     return [VOCAB_DICT[token] for token in tokens]
-def geo2vec(geo2seq: str) -> List[int]:
-    """
-    Geo2Seq tokenisation using a dataset-independent regex pattern.
-    :param geo2seq: Geo2Seq string
-    :type geo2seq: str
-    :return: tokens w/o `<start>` and `<end>`
-    :rtype: list
-    """
-    tokens = [token for token in geo_regex.findall(geo2seq)]
-    return [GEO_VOCAB_DICT[token] for token in tokens]
 def aa2vec(aa_seq: str) -> List[int]:
     """
     Protein sequence tokenisation using a dataset-independent regex pattern.
@@ -147,11 +115,6 @@ def smiles2token(smiles: str) -> Tensor:
     return torch.tensor([1] + smiles2vec(smiles) + [2], dtype=torch.long)
-def geo2token(geo2seq: str) -> Tensor:
-    # start token: <start> = 1; end token: <esc> = 2
-    return torch.tensor([1] + geo2vec(geo2seq) + [2], dtype=torch.long)
 def aa2token(aa_seq: str) -> Tensor:
     # start token: <start> = 1; end token: <end> = 2
     return torch.tensor([1] + aa2vec(aa_seq) + [2], dtype=torch.long)

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/model.py RENAMED Viewed

@@ -4,7 +4,8 @@
 Define Bayesian Flow Network for Chemistry (ChemBFN) model.
 """
 from pathlib import Path
-from typing import List, Tuple, Optional, Union
+from copy import deepcopy
+from typing import List, Tuple, Dict, Optional, Union, Callable
 import torch
 import torch.nn as nn
 from torch import Tensor
@@ -161,8 +162,8 @@ class Attention(nn.Module):
         :return: attentioned output;   shape: (n_b, n_t, n_f)
         :rtype: torch.Tensor
         """
-        n_b, n_a, _ = shape = x.shape
-        split = (n_b, n_a, self.nh, self.d)
+        n_b, n_t, _ = shape = x.shape
+        split = (n_b, n_t, self.nh, self.d)
         q, k, v = self.qkv(x).chunk(3, -1)
         q = q.view(split).permute(2, 0, 1, 3).contiguous()
         k = k.view(split).permute(2, 0, 1, 3).contiguous()
@@ -427,12 +428,12 @@ class ChemBFN(nn.Module):
         c = self.time_embed(t)
         if y is not None:
             c += y
-        pe = self.position(x.shape[1])
+        pe = self.position(n_t)
         x = self.embedding(x)
         attn_mask: Optional[Tensor] = None
         if self.semi_autoregressive:
             attn_mask = torch.tril(
-                torch.ones((1, n_b, n_t, n_t), device=self.beta.device), diagonal=0
+                torch.ones((1, n_b, n_t, n_t), device=x.device), diagonal=0
             )
         else:
             if mask is not None:
@@ -592,6 +593,13 @@ class ChemBFN(nn.Module):
         x, logits = torch.broadcast_tensors(x[..., None], logits)
         return (-logits.gather(-1, x[..., :1]).squeeze(-1)).mean()
+    @staticmethod
+    def reshape_y(y: Tensor) -> Tensor:
+        assert y.dim() <= 3  # this doesn't work if the model is frezen in JIT.
+        if y.dim() == 2:
+            return y[:, None, :]
+        return y
     @torch.jit.export
     def sample(
         self,
@@ -607,7 +615,7 @@ class ChemBFN(nn.Module):
         :param batch_size: batch size
         :param sequence_size: max sequence length
-        :param y: conditioning vector;      shape: (n_b, 1, n_f)
+        :param y: conditioning vector;      shape: (n_b, 1, n_f) or (n_b, n_f)
         :param sample_step: number of sampling steps
         :param guidance_strength: strength of conditional generation. It is not used if y is null.
         :param token_mask: token mask;      shape: (1, 1, n_vocab)
@@ -626,9 +634,7 @@ class ChemBFN(nn.Module):
             / self.K
         )
         if y is not None:
-            assert y.dim() == 3  # this doesn't work if the model is frezen in JIT.
-            if y.shape[0] == 1:
-                y = y.repeat(batch_size, 1, 1)
+            y = self.reshape_y(y)
         for i in torch.linspace(1, sample_step, sample_step, device=self.beta.device):
             t = (i - 1).view(1, 1, 1).repeat(batch_size, 1, 1) / sample_step
             p = self.discrete_output_distribution(theta, t, y, guidance_strength)
@@ -663,7 +669,7 @@ class ChemBFN(nn.Module):
         :param batch_size: batch size
         :param sequence_size: max sequence length
-        :param y: conditioning vector;      shape: (n_b, 1, n_f)
+        :param y: conditioning vector;      shape: (n_b, 1, n_f) or (n_b, n_f)
         :param sample_step: number of sampling steps
         :param guidance_strength: strength of conditional generation. It is not used if y is null.
         :param token_mask: token mask;      shape: (1, 1, n_vocab)
@@ -681,9 +687,7 @@ class ChemBFN(nn.Module):
         """
         z = torch.zeros((batch_size, sequence_size, self.K), device=self.beta.device)
         if y is not None:
-            assert y.dim() == 3  # this doesn't work if the model is frezen in JIT.
-            if y.shape[0] == 1:
-                y = y.repeat(batch_size, 1, 1)
+            y = self.reshape_y(y)
         for i in torch.linspace(1, sample_step, sample_step, device=self.beta.device):
             t = (i - 1).view(1, 1, 1).repeat(batch_size, 1, 1) / sample_step
             theta = torch.softmax(z, -1)
@@ -714,7 +718,7 @@ class ChemBFN(nn.Module):
         Molecule inpaint functionality.
         :param x: categorical indices of scaffold;  shape: (n_b, n_t)
-        :param y: conditioning vector;              shape: (n_b, 1, n_f)
+        :param y: conditioning vector;              shape: (n_b, 1, n_f) or (n_b, n_f)
         :param sample_step: number of sampling steps
         :param guidance_strength: strength of conditional generation. It is not used if y is null.
         :param token_mask: token mask;              shape: (1, 1, n_vocab)
@@ -733,9 +737,7 @@ class ChemBFN(nn.Module):
         x_onehot = nn.functional.one_hot(x, self.K) * mask
         theta = x_onehot + (1 - mask) * theta
         if y is not None:
-            assert y.dim() == 3  # this doesn't work if the model is frezen in JIT.
-            if y.shape[0] == 1:
-                y = y.repeat(n_b, 1, 1)
+            y = self.reshape_y(y)
         for i in torch.linspace(1, sample_step, sample_step, device=x.device):
             t = (i - 1).view(1, 1, 1).repeat(n_b, 1, 1) / sample_step
             p = self.discrete_output_distribution(theta, t, y, guidance_strength)
@@ -769,7 +771,7 @@ class ChemBFN(nn.Module):
         ODE inpainting.
         :param x: categorical indices of scaffold;  shape: (n_b, n_t)
-        :param y: conditioning vector;              shape: (n_b, 1, n_f)
+        :param y: conditioning vector;              shape: (n_b, 1, n_f) or (n_b, n_f)
         :param sample_step: number of sampling steps
         :param guidance_strength: strength of conditional generation. It is not used if y is null.
         :param token_mask: token mask;              shape: (1, 1, n_vocab)
@@ -789,9 +791,7 @@ class ChemBFN(nn.Module):
         x_onehot = nn.functional.one_hot(x, self.K) * mask
         z = torch.zeros((n_b, n_t, self.K), device=self.beta.device)
         if y is not None:
-            assert y.dim() == 3  # this doesn't work if the model is frezen in JIT.
-            if y.shape[0] == 1:
-                y = y.repeat(n_b, 1, 1)
+            y = self.reshape_y(y)
         for i in torch.linspace(1, sample_step, sample_step, device=self.beta.device):
             t = (i - 1).view(1, 1, 1).repeat(n_b, 1, 1) / sample_step
             theta = torch.softmax(z, -1)
@@ -847,13 +847,7 @@ class ChemBFN(nn.Module):
         with open(ckpt, "rb") as f:
             state = torch.load(f, "cpu", weights_only=True)
         nn, hparam = state["nn"], state["hparam"]
-        model = cls(
-            hparam["num_vocab"],
-            hparam["channel"],
-            hparam["num_layer"],
-            hparam["num_head"],
-            hparam["dropout"],
-        )
+        model = cls(**hparam)
         model.load_state_dict(nn, False)
         if ckpt_lora:
             with open(ckpt_lora, "rb") as g:
@@ -908,7 +902,7 @@ class MLP(nn.Module):
         if self.class_input:
             x = x.to(dtype=torch.long)
         for layer in self.layers[:-1]:
-            x = torch.selu(layer(x))
+            x = torch.selu(layer.forward(x))
         return self.layers[-1](x)
     @classmethod
@@ -926,10 +920,382 @@ class MLP(nn.Module):
         with open(ckpt, "rb") as f:
             state = torch.load(f, "cpu", weights_only=True)
         nn, hparam = state["nn"], state["hparam"]
-        model = cls(hparam["size"], hparam["class_input"], hparam["dropout"])
+        model = cls(**hparam)
         model.load_state_dict(nn, strict)
         return model
+class EnsembleChemBFN(ChemBFN):
+    """
+    This module does not fully support `torch.jit.script`. We have `EnsembleChemBFN.jit()`
+    method to JIT compile the submodels.
+    `torch.compile()` is a better choice to compiling the whole model.
+    """
+    def __init__(
+        self,
+        base_model_path: Union[str, Path],
+        lora_paths: Union[List[Union[str, Path]], Dict[str, Union[str, Path]]],
+        cond_heads: Union[List[nn.Module], Dict[str, nn.Module]],
+        adapter_weights: Optional[Union[List[float], Dict[str, float]]] = None,
+        semi_autoregressive_flags: Optional[Union[List[bool], Dict[str, bool]]] = None,
+    ) -> None:
+        """
+        Ensemble of ChemBFN models from LoRA checkpoints.
+        :param base_model_path: base model checkpoint file
+        :param lora_paths: a list of LoRA checkpoint files or a `dict` instance of these files
+        :param cond_heads: a list of conditioning network heads or a `dict` instance of these networks
+        :param adapter_weights: a list of weights of each LoRA finetuned model or a 'dict` instance of these weights; default is equally weighted
+        :param semi_autoregressive_flags: a list of the semi-autoregressive behaviour states of each LoRA finetuned model or a `dict` instance of these states; default is all `False`
+        :type base_model_path: str | pathlib.Path
+        :type lora_paths: list | dict
+        :type cond_heads: list | dict
+        :type adapter_weights: list | dict | None
+        :type semi_autoregressive_flags: list | dict | None
+        """
+        n = len(lora_paths)
+        assert type(lora_paths) == type(
+            cond_heads
+        ), "`lora_paths` and `cond_heads` should have the same type!"
+        assert n == len(
+            cond_heads
+        ), "`lora_paths` and `cond_heads` should have the same length!"
+        if adapter_weights:
+            assert type(lora_paths) == type(
+                adapter_weights
+            ), "`lora_paths` and `adapter_weights` should have the same type!"
+            assert n == len(
+                adapter_weights
+            ), "`lora_paths` and `adapter_weights` should have the same length!"
+        if semi_autoregressive_flags:
+            assert type(lora_paths) == type(
+                semi_autoregressive_flags
+            ), "`lora_paths` and `semi_autoregressive_flags` should have the same type!"
+            assert n == len(
+                semi_autoregressive_flags
+            ), "`lora_paths` and `semi_autoregressive_flags` should have the same length!"
+        _label_is_dict = isinstance(lora_paths, dict)
+        if isinstance(lora_paths, list):
+            names = tuple(f"val_{i}" for i in range(n))
+            lora_paths = dict(zip(names, lora_paths))
+            cond_heads = dict(zip(names, cond_heads))
+            if not adapter_weights:
+                adapter_weights = (1 / n for _ in names)
+            if not semi_autoregressive_flags:
+                semi_autoregressive_flags = (False for _ in names)
+            adapter_weights = dict(zip(names, adapter_weights))
+            semi_autoregressive_flags = dict(zip(names, semi_autoregressive_flags))
+        else:
+            names = tuple(lora_paths.keys())
+            if not adapter_weights:
+                adapter_weights = dict(zip(names, (1 / n for _ in names)))
+            if not semi_autoregressive_flags:
+                semi_autoregressive_flags = dict(zip(names, (False for _ in names)))
+        base_model = ChemBFN.from_checkpoint(base_model_path)
+        models = dict(zip(names, (deepcopy(base_model.eval()) for _ in names)))
+        for k in names:
+            with open(lora_paths[k], "rb") as f:
+                state = torch.load(f, "cpu", weights_only=True)
+            lora_nn, lora_param = state["lora_nn"], state["lora_param"]
+            models[k].enable_lora(**lora_param)
+            models[k].load_state_dict(lora_nn, False)
+            models[k].semi_autoregressive = semi_autoregressive_flags[k]
+        super().__init__(**base_model.hparam)
+        self.cond_heads = nn.ModuleDict(cond_heads)
+        self.models = nn.ModuleDict(models)
+        self.adapter_weights = adapter_weights
+        self._label_is_dict = _label_is_dict  # flag
+        # ------- remove unnecessary submodules -------
+        self.embedding = None
+        self.time_embed = None
+        self.position = None
+        self.encoder_layers = None
+        self.final_layer = None
+        self.__delattr__("embedding")
+        self.__delattr__("time_embed")
+        self.__delattr__("position")
+        self.__delattr__("encoder_layers")
+        self.__delattr__("final_layer")
+        # ------- remove unused attributes -------
+        self.__delattr__("semi_autoregressive")
+        self.__delattr__("lora_enabled")
+        self.__delattr__("lora_param")
+        self.__delattr__("hparam")
+    def construct_y(
+        self, c: Union[List[Tensor], Dict[str, Tensor]]
+    ) -> Dict[str, Tensor]:
+        assert (
+            isinstance(c, dict) is self._label_is_dict
+        ), f"`c` should be a {'`dict` instance' if self._label_is_dict else '`list` instance'} but got {type(c)} instand."
+        out: Dict[str, Tensor] = {}
+        if isinstance(c, list):
+            c = dict(zip([f"val_{i}" for i in range(len(c))], c))
+        for name, model in self.cond_heads.items():
+            y = model.forward(c[name])
+            if y.dim() == 2:
+                y = y[:, None, :]
+            out[name] = y
+        return out
+    def discrete_output_distribution(
+        self, theta: Tensor, t: Tensor, y: Dict[str, Tensor], w: float
+    ) -> Tensor:
+        """
+        :param theta: input distribution;          shape: (n_b, n_t, n_vocab)
+        :param t: continuous time in [0, 1];       shape: (n_b, 1, 1)
+        :param y: a dict of conditioning vectors;  shape: (n_b, 1, n_f) * n_h
+        :param w: guidance strength controlling the conditional generation
+        :type theta: torch.Tensor
+        :type t: torch.Tensor
+        :type y: dict
+        :type w: float
+        :return: output distribution;              shape: (n_b, n_t, n_vocab)
+        :rtype: torch.Tensor
+        """
+        theta = 2 * theta - 1  # rescale to [-1, 1]
+        p_uncond, p_cond = torch.zeros_like(theta), torch.zeros_like(theta)
+        # Q: Why not use `torch.vmap`? It's faster than doing the loop, isn't it?
+        #
+        # A: We have quite a few reasons to avoid using `vmap`:
+        #    1. JIT doesn't support vmap;
+        #    2. It's harder to switch on/off semi-autroregssive behaviours for individual
+        #       models when all models are stacked into one (we have a solution but it's not
+        #       that elegant);
+        #    3. We just found that the result from vmap was not identical to doing the loop;
+        #    4. vmap requires all models have the same size but it's not always that case
+        #       since we sometimes use different ranks of LoRA in finetuning.
+        for name, model in self.models.items():
+            p_uncond_ = model.forward(theta, t, None, None)
+            p_uncond += p_uncond_ * self.adapter_weights[name]
+            p_cond_ = model.forward(theta, t, None, y[name])
+            p_cond += p_cond_ * self.adapter_weights[name]
+        return softmax((1 + w) * p_cond - w * p_uncond, -1)
+    @staticmethod
+    def reshape_y(y: Dict[str, Tensor]) -> Dict[str, Tensor]:
+        for k in y:
+            assert y[k].dim() <= 3
+            if y[k].dim() == 2:
+                y[k] = y[k][:, None, :]
+        return y
+    @torch.inference_mode()
+    def sample(
+        self,
+        batch_size: int,
+        sequence_size: int,
+        conditions: Union[List[Tensor], Dict[str, Tensor]],
+        sample_step: int = 100,
+        guidance_strength: float = 4.0,
+        token_mask: Optional[Tensor] = None,
+    ) -> Tuple[Tensor, Tensor]:
+        """
+        Sample from a piror distribution.
+        :param batch_size: batch size
+        :param sequence_size: max sequence length
+        :param conditions: guidance conditions;  shape: (n_b, n_c) * n_h
+        :param sample_step: number of sampling steps
+        :param guidance_strength: strength of conditional generation. It is not used if y is null.
+        :param token_mask: token mask;           shape: (1, 1, n_vocab)
+        :type batch_size: int
+        :type sequence_size: int
+        :type conditions: list | dict
+        :type sample_step: int
+        :type guidance_strength: float
+        :type token_mask: torch.Tensor | None
+        :return: sampled token indices;          shape: (n_b, n_t) \n
+                 entropy of the tokens;          shape: (n_b)
+        :rtype: tuple
+        """
+        y = self.construct_y(conditions)
+        return super().sample(
+            batch_size, sequence_size, y, sample_step, guidance_strength, token_mask
+        )
+    @torch.inference_mode()
+    def ode_sample(
+        self,
+        batch_size: int,
+        sequence_size: int,
+        conditions: Union[List[Tensor], Dict[str, Tensor]],
+        sample_step: int = 100,
+        guidance_strength: float = 4.0,
+        token_mask: Optional[Tensor] = None,
+        temperature: float = 0.5,
+    ) -> Tuple[Tensor, Tensor]:
+        """
+        ODE-based sampling.
+        :param batch_size: batch size
+        :param sequence_size: max sequence length
+        :param conditions: conditioning vector;  shape: (n_b, n_c) * n_h
+        :param sample_step: number of sampling steps
+        :param guidance_strength: strength of conditional generation. It is not used if y is null.
+        :param token_mask: token mask;           shape: (1, 1, n_vocab)
+        :param temperature: sampling temperature
+        :type batch_size: int
+        :type sequence_size: int
+        :type conditions: list | dict
+        :type sample_step: int
+        :type guidance_strength: float
+        :type token_mask: torch.Tensor | None
+        :type temperature: float
+        :return: sampled token indices;          shape: (n_b, n_t) \n
+                 entropy of the tokens;          shape: (n_b)
+        :rtype: tuple
+        """
+        y = self.construct_y(conditions)
+        return super().ode_sample(
+            batch_size,
+            sequence_size,
+            y,
+            sample_step,
+            guidance_strength,
+            token_mask,
+            temperature,
+        )
+    @torch.inference_mode()
+    def inpaint(
+        self,
+        x: Tensor,
+        conditions: Union[List[Tensor], Dict[str, Tensor]],
+        sample_step: int = 100,
+        guidance_strength: float = 4.0,
+        token_mask: Optional[Tensor] = None,
+    ) -> Tuple[Tensor, Tensor]:
+        """
+        Molecule inpaint functionality.
+        :param x: categorical indices of scaffold;  shape: (n_b, n_t)
+        :param conditions: conditioning vector;     shape: (n_b, n_c) * n_h
+        :param sample_step: number of sampling steps
+        :param guidance_strength: strength of conditional generation. It is not used if y is null.
+        :param token_mask: token mask;              shape: (1, 1, n_vocab)
+        :type x: torch.Tensor
+        :type conditions: list | dict
+        :type sample_step: int
+        :type guidance_strength: float
+        :type token_mask: torch.Tensor | None
+        :return: sampled token indices;             shape: (n_b, n_t) \n
+                 entropy of the tokens;             shape: (n_b)
+        :rtype: tuple
+        """
+        y = self.construct_y(conditions)
+        return super().inpaint(x, y, sample_step, guidance_strength, token_mask)
+    @torch.inference_mode()
+    def ode_inpaint(
+        self,
+        x: Tensor,
+        conditions: Union[List[Tensor], Dict[str, Tensor]],
+        sample_step: int = 100,
+        guidance_strength: float = 4.0,
+        token_mask: Optional[Tensor] = None,
+        temperature: float = 0.5,
+    ) -> Tuple[Tensor, Tensor]:
+        """
+        ODE inpainting.
+        :param x: categorical indices of scaffold;  shape: (n_b, n_t)
+        :param conditions: conditioning vector;     shape: (n_b, n_c) * n_h
+        :param sample_step: number of sampling steps
+        :param guidance_strength: strength of conditional generation. It is not used if y is null.
+        :param token_mask: token mask;              shape: (1, 1, n_vocab)
+        :param temperature: sampling temperature
+        :type x: torch.Tensor
+        :type conditions: list | dict
+        :type sample_step: int
+        :type guidance_strength: float
+        :type token_mask: torch.Tensor | None
+        :type temperature: float
+        :return: sampled token indices;             shape: (n_b, n_t) \n
+                 entropy of the tokens;             shape: (n_b)
+        :rtype: tuple
+        """
+        y = self.construct_y(conditions)
+        return super().ode_inpaint(
+            x, y, sample_step, guidance_strength, token_mask, temperature
+        )
+    def quantise(
+        self, quantise_method: Optional[Callable[[ChemBFN], nn.Module]] = None
+    ) -> None:
+        """
+        Quantise the submodels. \n
+        This method should be called, if necessary, before `torch.compile()`.
+        :param quantise_method: quantisation method; default is `bayesianflow_for_chem.tool.quantise_model`
+        :type quantise_method: callable | None
+        :return:
+        :rtype: None
+        """
+        if quantise_method is None:
+            from bayesianflow_for_chem.tool import quantise_model
+            quantise_method = quantise_model
+        for k, v in self.models.items():
+            self.models[k] = quantise_method(v)
+    def jit(self, freeze: bool = False) -> None:
+        """
+        JIT compile the submodels. \n
+        This method should be called, if necessary, before `quantise()` method is called if applied.
+        :param freeze: whether to freeze the submodels; default is `False`. If set to `True` this
+                       method should be called before moving the model to a different device.
+        :type freeze: bool
+        :return:
+        :rtype: None
+        """
+        for k, v in self.models.items():
+            self.models[k] = torch.jit.script(v)
+            if freeze:
+                self.models[k] = torch.jit.freeze(
+                    self.models[k], ["semi_autoregressive"]
+                )
+    @torch.jit.ignore
+    def forward(self, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
+    def cts_loss(self, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
+    def reconstruction_loss(self, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
+    def enable_lora(self, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
+    def inference(self, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
+    @classmethod
+    def from_checkpoint(cls, *_, **__) -> None:
+        """
+        Don't use this method!
+        """
+        raise NotImplementedError("There's nothing here!")
 if __name__ == "__main__":
     ...

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/scorer.py RENAMED Viewed

@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 # Author: Nianze A. TAO (Omozawa SUENO)
 """
-Scorers.
+Define essential scorers.
 """
 from typing import List, Callable, Union, Optional
 import torch

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/tool.py RENAMED Viewed

@@ -1,11 +1,11 @@
 # -*- coding: utf-8 -*-
 # Author: Nianze A. TAO (Omozawa SUENO)
 """
-Tools.
+Essential tools.
 """
-import re
 import csv
 import random
+import warnings
 from copy import deepcopy
 from pathlib import Path
 from typing import List, Dict, Tuple, Union, Optional
@@ -16,7 +16,16 @@ from torch import cuda, Tensor, softmax
 from torch.ao import quantization
 from torch.utils.data import DataLoader
 from typing_extensions import Self
-from rdkit.Chem import rdDetermineBonds, Bond, MolFromXYZBlock, CanonicalRankAtoms
+from rdkit.Chem.rdchem import Mol, Bond
+from rdkit.Chem import (
+    rdDetermineBonds,
+    MolFromXYZBlock,
+    MolFromSmiles,
+    MolToSmiles,
+    CanonSmiles,
+    AllChem,
+    AddHs,
+)
 from rdkit.Chem.Scaffolds.MurckoScaffold import MurckoScaffoldSmiles  # type: ignore
 from sklearn.metrics import (
     roc_auc_score,
@@ -26,35 +35,8 @@ from sklearn.metrics import (
     mean_absolute_error,
     root_mean_squared_error,
 )
-try:
-    from pynauty import Graph, canon_label  # type: ignore
-    _use_pynauty = True
-except ImportError:
-    import warnings
-    _use_pynauty = False
 from .data import VOCAB_KEYS
-from .model import ChemBFN, MLP, Linear
-_atom_regex_pattern = (
-    r"(H[e,f,g,s,o]?|"
-    r"L[i,v,a,r,u]|"
-    r"B[e,r,a,i,h,k]?|"
-    r"C[l,a,r,o,u,d,s,n,e,m,f]?|"
-    r"N[e,a,i,b,h,d,o,p]?|"
-    r"O[s,g]?|S[i,c,e,r,n,m,b,g]?|"
-    r"K[r]?|T[i,c,e,a,l,b,h,m,s]|"
-    r"G[a,e,d]|R[b,u,h,e,n,a,f,g]|"
-    r"Yb?|Z[n,r]|P[t,o,d,r,a,u,b,m]?|"
-    r"F[e,r,l,m]?|M[g,n,o,t,c,d]|"
-    r"A[l,r,s,g,u,t,c,m]|I[n,r]?|"
-    r"W|X[e]|E[u,r,s]|U|D[b,s,y])"
-)
-_atom_regex = re.compile(_atom_regex_pattern)
+from .model import ChemBFN, MLP, Linear, EnsembleChemBFN
 def _find_device() -> torch.device:
@@ -65,10 +47,6 @@ def _find_device() -> torch.device:
     return torch.device("cpu")
-def _bond_pair_idx(bonds: Bond) -> List[List[int]]:
-    return [[i.GetBeginAtomIdx(), i.GetEndAtomIdx()] for i in bonds]
 @torch.no_grad()
 def test(
     model: ChemBFN,
@@ -196,11 +174,14 @@ def split_dataset(
                     "\033[0m",
                     stacklevel=2,
                 )
-            scaffold = MurckoScaffoldSmiles(d[smiles_idx[0]])
-            if scaffold in scaffolds:
-                scaffolds[scaffold].append(key)
-            else:
-                scaffolds[scaffold] = [key]
+            try:
+                scaffold = MurckoScaffoldSmiles(d[smiles_idx[0]])
+                if scaffold in scaffolds:
+                    scaffolds[scaffold].append(key)
+                else:
+                    scaffolds[scaffold] = [key]
+            except ValueError:  # do nothing when SMILES is not valid
+                ...
         scaffolds = {key: sorted(value) for key, value in scaffolds.items()}
         train_set, test_set, val_set = [], [], []
         for idxs in scaffolds.values():
@@ -222,137 +203,13 @@ def split_dataset(
         writer.writerows([header] + val_set)
-def geo2seq(
-    symbols: List[str],
-    coordinates: np.ndarray,
-    decimals: int = 2,
-    angle_unit: str = "degree",
-) -> str:
-    """
-    Geometry-to-sequence function.\n
-    The algorithm follows the descriptions in paper: https://arxiv.org/abs/2408.10120.
-    :param symbols: a list of atomic symbols
-    :param coordinates: Cartesian coordinates;  shape: (n_a, 3)
-    :param decimals: number of decimal places to round to
-    :param angle_unit: `'degree'` or `'radian'`
-    :type symbols: list
-    :type coordinates: numpy.ndarray
-    :type decimals: int
-    :type angle_unit: str
-    :return: `Geo2Seq` string
-    :rtype: str
-    """
-    assert angle_unit in ("degree", "radian")
-    angle_scale = 180 / np.pi if angle_unit == "degree" else 1.0
-    n = len(symbols)
-    if n == 1:
-        return f"{symbols[0]} {'0.0'} {'0.0'} {'0.0'}"
-    xyz_block = [str(n), ""]
-    for i, atom in enumerate(symbols):
-        xyz_block.append(
-            f"{atom} {'%.10f' % coordinates[i][0].item()} {'%.10f' % coordinates[i][1].item()} {'%.10f' % coordinates[i][2].item()}"
-        )
-    mol = MolFromXYZBlock("\n".join(xyz_block))
-    rdDetermineBonds.DetermineConnectivity(mol)
-    # ------- Canonicalization -------
-    if _use_pynauty:
-        pair_idx = np.array(_bond_pair_idx(mol.GetBonds())).T.tolist()
-        pair_dict: Dict[int, List[int]] = {}
-        for key, i in enumerate(pair_idx[0]):
-            if i not in pair_dict:
-                pair_dict[i] = [pair_idx[1][key]]
-            else:
-                pair_dict[i].append(pair_idx[1][key])
-        g = Graph(n, adjacency_dict=pair_dict)
-        cl = canon_label(g)  # type: list
-    else:
-        warnings.warn(
-            "\033[32;1m"
-            "`pynauty` is not installed."
-            " Switched to canonicalization function provided by `rdkit`."
-            " This is the expected behaviour only if you are working on Windows platform."
-            "\033[0m",
-            stacklevel=2,
-        )
-        cl = list(CanonicalRankAtoms(mol, breakTies=True))
-    symbols = np.array([[s] for s in symbols])[cl].flatten().tolist()
-    coordinates = coordinates[cl]
-    # ------- Find global coordinate frame -------
-    if n == 2:
-        d = np.round(np.linalg.norm(coordinates[0] - coordinates[1], 2), decimals)
-        return f"{symbols[0]} {'0.0'} {'0.0'} {'0.0'} {symbols[1]} {d} {'0.0'} {'0.0'}"
-    for idx_0 in range(n - 2):
-        _vec0 = coordinates[idx_0] - coordinates[idx_0 + 1]
-        _vec1 = coordinates[idx_0] - coordinates[idx_0 + 2]
-        _d1 = np.linalg.norm(_vec0, 2)
-        _d2 = np.linalg.norm(_vec1, 2)
-        if 1 - np.abs(np.dot(_vec0, _vec1) / (_d1 * _d2)) > 1e-6:
-            break
-    x = (coordinates[idx_0 + 1] - coordinates[idx_0]) / _d1
-    y = np.cross((coordinates[idx_0 + 2] - coordinates[idx_0]), x)
-    y_d = np.linalg.norm(y, 2)
-    y = y / np.ma.filled(np.ma.array(y_d, mask=y_d == 0), np.inf)
-    z = np.cross(x, y)
-    # ------- Build spherical coordinates -------
-    vec = coordinates - coordinates[idx_0]
-    d = np.linalg.norm(vec, 2, axis=-1)
-    _d = np.ma.filled(np.ma.array(d, mask=d == 0), np.inf)
-    theta = angle_scale * np.arccos(np.dot(vec, z) / _d)  # in [0, \pi]
-    phi = angle_scale * np.arctan2(np.dot(vec, y), np.dot(vec, x))  # in [-\pi, \pi]
-    info = np.vstack([d, theta, phi]).T
-    info[idx_0] = np.zeros(3)
-    info = [
-        f"{symbols[i]} {r[0]} {r[1]} {r[2]}"
-        for i, r in enumerate(np.round(info, decimals))
-    ]
-    return " ".join(info)
-def seq2geo(
-    seq: str, angle_unit: str = "degree"
-) -> Optional[Tuple[List[str], List[List[float]]]]:
-    """
-    Sequence-to-geometry function.\n
-    The method follows the descriptions in paper: https://arxiv.org/abs/2408.10120.
-    :param seq: `Geo2Seq` string
-    :param angle_unit: `'degree'` or `'radian'`
-    :type seq: str
-    :type angle_unit: str
-    :return: (symbols, coordinates) if `seq` is valid
-    :rtype: tuple | None
-    """
-    assert angle_unit in ("degree", "radian")
-    angle_scale = np.pi / 180 if angle_unit == "degree" else 1.0
-    tokens = seq.split()
-    if len(tokens) % 4 == 0:
-        tokens = np.array(tokens).reshape(-1, 4).tolist()
-        symbols, coordinates = [], []
-        for i in tokens:
-            symbol = i[0]
-            if len(_atom_regex.findall(symbol)) != 1:
-                return None
-            symbols.append(symbol)
-            try:
-                d, theta, phi = float(i[1]), float(i[2]), float(i[3])
-                x = d * np.sin(theta * angle_scale) * np.cos(phi * angle_scale)
-                y = d * np.sin(theta * angle_scale) * np.sin(phi * angle_scale)
-                z = d * np.cos(theta * angle_scale)
-                coordinates.append([x.item(), y.item(), z.item()])
-            except ValueError:
-                return None
-        return symbols, coordinates
-    return None
 @torch.no_grad()
 def sample(
-    model: ChemBFN,
+    model: Union[ChemBFN, EnsembleChemBFN],
     batch_size: int,
     sequence_size: int,
     sample_step: int = 100,
-    y: Optional[Tensor] = None,
+    y: Optional[Union[Tensor, Dict[str, Tensor], List[Tensor]]] = None,
     guidance_strength: float = 4.0,
     device: Union[str, torch.device, None] = None,
     vocab_keys: List[str] = VOCAB_KEYS,
@@ -368,7 +225,9 @@ def sample(
     :param batch_size: batch size
     :param sequence_size: max sequence length
     :param sample_step: number of sampling steps
-    :param y: conditioning vector;  shape: (n_b, 1, n_f)
+    :param y: conditioning vector;             shape: (n_b, 1, n_f) or (n_b, n_f) \n
+              or a list/`dict` of conditions;  shape: (n_b, n_c) * n_h
     :param guidance_strength: strength of conditional generation. It is not used if y is null.
     :param device: hardware accelerator
     :param vocab_keys: a list of (ordered) vocabulary
@@ -376,11 +235,11 @@ def sample(
     :param method: sampling method chosen from `"ODE:x"` or `"BFN"` where `x` is the value of sampling temperature; default is `"BFN"`
     :param allowed_tokens: a list of allowed tokens; default is `"all"`
     :param sort: whether to sort the samples according to entropy values; default is `False`
-    :type model: bayesianflow_for_chem.model.ChemBFN
+    :type model: bayesianflow_for_chem.model.ChemBFN | bayesianflow_for_chem.model.EnsembleChemBFN
     :type batch_size: int
     :type sequence_size: int
     :type sample_step: int
-    :type y: torch.Tensor | None
+    :type y: torch.Tensor | list | dict | None
     :type guidance_strength: float
     :type device: str | torch.device | None
     :type vocab_keys: list
@@ -392,11 +251,23 @@ def sample(
     :rtype: list
     """
     assert method.split(":")[0].lower() in ("ode", "bfn")
+    if isinstance(model, EnsembleChemBFN):
+        assert y is not None, "conditioning is required while using an ensemble model."
+        assert isinstance(y, list) or isinstance(y, dict)
+    else:
+        assert isinstance(y, Tensor) or y is None
     if device is None:
         device = _find_device()
     model.to(device).eval()
     if y is not None:
-        y = y.to(device)
+        if isinstance(y, Tensor):
+            y = y.to(device)
+        elif isinstance(y, list):
+            y = [i.to(device) for i in y]
+        elif isinstance(y, dict):
+            y = {k: v.to(device) for k, v in y.items()}
+        else:
+            raise NotImplementedError
     if isinstance(allowed_tokens, list):
         token_mask = [0 if i in allowed_tokens else 1 for i in vocab_keys]
         token_mask = torch.tensor([[token_mask]], dtype=torch.bool).to(device)
@@ -426,10 +297,10 @@ def sample(
 @torch.no_grad()
 def inpaint(
-    model: ChemBFN,
+    model: Union[ChemBFN, EnsembleChemBFN],
     x: Tensor,
     sample_step: int = 100,
-    y: Optional[Tensor] = None,
+    y: Optional[Union[Tensor, Dict[str, Tensor], List[Tensor]]] = None,
     guidance_strength: float = 4.0,
     device: Union[str, torch.device, None] = None,
     vocab_keys: List[str] = VOCAB_KEYS,
@@ -444,7 +315,9 @@ def inpaint(
     :param model: trained ChemBFN model
     :param x: categorical indices of scaffold;  shape: (n_b, n_t)
     :param sample_step: number of sampling steps
-    :param y: conditioning vector;              shape: (n_b, 1, n_f)
+    :param y: conditioning vector;              shape: (n_b, 1, n_f) or (n_b, n_f) \n
+              or a list/`dict` of conditions;   shape: (n_b, n_c) * n_h
     :param guidance_strength: strength of conditional generation. It is not used if y is null.
     :param device: hardware accelerator
     :param vocab_keys: a list of (ordered) vocabulary
@@ -452,10 +325,10 @@ def inpaint(
     :param method: sampling method chosen from `"ODE:x"` or `"BFN"` where `x` is the value of sampling temperature; default is `"BFN"`
     :param allowed_tokens: a list of allowed tokens; default is `"all"`
     :param sort: whether to sort the samples according to entropy values; default is `False`
-    :type model: bayesianflow_for_chem.model.ChemBFN
+    :type model: bayesianflow_for_chem.model.ChemBFN | bayesianflow_for_chem.model.EnsembleChemBFN
     :type x: torch.Tensor
     :type sample_step: int
-    :type y: torch.Tensor | None
+    :type y: torch.Tensor | list | dict | None
     :type guidance_strength: float
     :type device: str | torch.device | None
     :type vocab_keys: list
@@ -467,12 +340,24 @@ def inpaint(
     :rtype: list
     """
     assert method.split(":")[0].lower() in ("ode", "bfn")
+    if isinstance(model, EnsembleChemBFN):
+        assert y is not None, "conditioning is required while using an ensemble model."
+        assert isinstance(y, list) or isinstance(y, dict)
+    else:
+        assert isinstance(y, Tensor) or y is None
     if device is None:
         device = _find_device()
     model.to(device).eval()
     x = x.to(device)
     if y is not None:
-        y = y.to(device)
+        if isinstance(y, Tensor):
+            y = y.to(device)
+        elif isinstance(y, list):
+            y = [i.to(device) for i in y]
+        elif isinstance(y, dict):
+            y = {k: v.to(device) for k, v in y.items()}
+        else:
+            raise NotImplementedError
     if isinstance(allowed_tokens, list):
         token_mask = [0 if i in allowed_tokens else 1 for i in vocab_keys]
         token_mask = torch.tensor([[token_mask]], dtype=torch.bool).to(device)
@@ -585,6 +470,8 @@ def quantise_model(model: ChemBFN) -> nn.Module:
             assert hasattr(
                 mod, "qconfig"
             ), "Input float module must have qconfig defined"
+            if use_precomputed_fake_quant:
+                warnings.warn("Fake quantize operator is not implemented.")
             if mod.qconfig is not None and mod.qconfig.weight is not None:
                 weight_observer = mod.qconfig.weight()
             else:
@@ -637,3 +524,81 @@ def quantise_model(model: ChemBFN) -> nn.Module:
         model, {nn.Linear, Linear}, torch.qint8, mapping
     )
     return quantised_model
+class GeometryConverter:
+    """
+    Converting between different 2D/3D molecular representations.
+    """
+    @staticmethod
+    def _xyz2mol(symbols: List[str], coordinates: np.ndarray) -> Mol:
+        xyz_block = [str(len(symbols)), ""]
+        r = coordinates
+        for i, atom in enumerate(symbols):
+            xyz_block.append(f"{atom} {r[i][0]:.10f} {r[i][1]:.10f} {r[i][2]:.10f}")
+        return MolFromXYZBlock("\n".join(xyz_block))
+    @staticmethod
+    def _bond_pair_idx(bonds: Bond) -> List[List[int]]:
+        return [[i.GetBeginAtomIdx(), i.GetEndAtomIdx()] for i in bonds]
+    @staticmethod
+    def smiles2cartesian(
+        smiles: str, num_conformers: int = 50, random_seed: int = 42
+    ) -> Tuple[List[str], np.ndarray]:
+        """
+        Guess the 3D geometry from SMILES string via MMFF conformer search.
+        :param smiles: a valid SMILES string
+        :param num_conformers: number of initial conformers
+        :param random_seed: random seed used to generate conformers
+        :type smiles: str
+        :type num_conformers: int
+        :type random_seed: int
+        :return: atomic symbols \n
+                 cartesian coordinates;  shape: (n_a, 3)
+        :rtype: tuple
+        """
+        mol = MolFromSmiles(smiles)
+        mol = AddHs(mol)
+        AllChem.EmbedMultipleConfs(mol, numConfs=num_conformers, randomSeed=random_seed)
+        symbols = [atom.GetSymbol() for atom in mol.GetAtoms()]
+        energies = []
+        for conf_id in range(num_conformers):
+            ff = AllChem.MMFFGetMoleculeForceField(
+                mol, AllChem.MMFFGetMoleculeProperties(mol), confId=conf_id
+            )
+            energy = ff.CalcEnergy()
+            energies.append((conf_id, energy))
+        lowest_energy_conf = min(energies, key=lambda x: x[1])
+        coordinates = mol.GetConformer(id=lowest_energy_conf[0]).GetPositions()
+        return symbols, coordinates
+    def cartesian2smiles(
+        self,
+        symbols: List[str],
+        coordinates: np.ndarray,
+        charge: int = 0,
+        canonical: bool = True,
+    ) -> str:
+        """
+        Transform (guess out) molecular geometry to SMILES string.
+        :param symbols: a list of atomic symbols
+        :param coordinates: Cartesian coordinates;  shape: (n_a, 3)
+        :param charge: net charge
+        :param canonical: whether to canonicalise the SMILES
+        :type symbols: list
+        :type coordinates: numpy.ndarray
+        :type charge: int
+        :type canonical: bool
+        :return: SMILES string
+        :rtype: str
+        """
+        mol = self._xyz2mol(symbols, coordinates)
+        rdDetermineBonds.DetermineBonds(mol, charge=charge)
+        smiles = MolToSmiles(mol)
+        if canonical:
+            smiles = CanonSmiles(smiles)
+        return smiles

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/train.py RENAMED Viewed

@@ -37,7 +37,8 @@ class Model(LightningModule):
         """
         A `~lightning.LightningModule` wrapper of bayesian flow network for chemistry model.\n
         This module is used in training stage only. By calling `Model(...).export_model(YOUR_WORK_DIR)` after training,
-        the model(s) will be saved to `YOUR_WORK_DIR/model.pt` and (if exists) `YOUR_WORK_DIR/mlp.pt`.
+        the model(s) will be saved to `YOUR_WORK_DIR/model.pt` (if LoRA is enabled then `YOUR_WORK_DIR/lora.pt`)
+        and (if exists) `YOUR_WORK_DIR/mlp.pt`.
         :param model: `~bayesianflow_for_chem.model.ChemBFN` instance.
         :param mlp: `~bayesianflow_for_chem.model.MLP` instance or `None`.
@@ -135,7 +136,8 @@ class Regressor(LightningModule):
         """
         A `~lightning.LightningModule` wrapper of bayesian flow network for chemistry regression model.\n
         This module is used in training stage only. By calling `Regressor(...).export_model(YOUR_WORK_DIR)` after training,
-        the models will be saved to `YOUR_WORK_DIR/model.pt` and `YOUR_WORK_DIR/readout.pt`.
+        the models will be saved to `YOUR_WORK_DIR/model_ft.pt` (if LoRA is enabled then `YOUR_WORK_DIR/lora.pt`)
+        and `YOUR_WORK_DIR/readout.pt`.
         :param model: `~bayesianflow_for_chem.model.ChemBFN` instance.
         :param mlp: `~bayesianflow_for_chem.model.MLP` instance.
@@ -218,7 +220,7 @@ class Regressor(LightningModule):
         """
         Save the trained model.
-        :param workdir: the directory to save the model
+        :param workdir: the directory to save the models
         :type workdir: pathlib.Path
         :return:
         :rtype: None

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem.egg-info/PKG-INFO RENAMED Viewed

@@ -1,16 +1,15 @@
 Metadata-Version: 2.4
 Name: bayesianflow_for_chem
-Version: 1.2.7
+Version: 1.4.0
 Summary: Bayesian flow network framework for Chemistry
 Home-page: https://augus1999.github.io/bayesian-flow-network-for-chemistry/
 Author: Nianze A. Tao
 Author-email: tao-nianze@hiroshima-u.ac.jp
-License: AGPL-3.0 licence
+License: AGPL-3.0-or-later
 Project-URL: Source, https://github.com/Augus1999/bayesian-flow-network-for-chemistry
 Keywords: Chemistry,CLM,ChemBFN
 Classifier: Development Status :: 5 - Production/Stable
 Classifier: Intended Audience :: Science/Research
-Classifier: License :: OSI Approved :: GNU Affero General Public License v3
 Classifier: Natural Language :: English
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3.9
@@ -29,8 +28,6 @@ Requires-Dist: loralib>=0.1.2
 Requires-Dist: lightning>=2.2.0
 Requires-Dist: scikit-learn>=1.5.0
 Requires-Dist: typing_extensions>=4.8.0
-Provides-Extra: geo2seq
-Requires-Dist: pynauty>=2.8.8.1; extra == "geo2seq"
 Dynamic: author
 Dynamic: author-email
 Dynamic: classifier
@@ -41,7 +38,6 @@ Dynamic: keywords
 Dynamic: license
 Dynamic: license-file
 Dynamic: project-url
-Dynamic: provides-extra
 Dynamic: requires-dist
 Dynamic: requires-python
 Dynamic: summary
@@ -87,13 +83,13 @@ You can find example scripts in [📁example](./example) folder.
 ## Pre-trained Model
-You can find pretrained models in [release](https://github.com/Augus1999/bayesian-flow-network-for-chemistry/releases) or on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
+You can find pretrained models on our [🤗Hugging Face model page](https://huggingface.co/suenoomozawa/ChemBFN).
 ## Dataset Handling
 We provide a Python class [`CSVData`](./bayesianflow_for_chem/data.py) to handle data stored in CSV or similar format containing headers to identify the entities. The following is a quickstart.
-1. Download your dataset file (e.g., ESOL form [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
+1. Download your dataset file (e.g., ESOL from [MoleculeNet](https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv)) and split the file:
 ```python
 >>> from bayesianflow_for_chem.tool import split_data

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem.egg-info/requires.txt RENAMED Viewed

@@ -5,6 +5,3 @@ loralib>=0.1.2
 lightning>=2.2.0
 scikit-learn>=1.5.0
 typing_extensions>=4.8.0
-[geo2seq]
-pynauty>=2.8.8.1

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/pyproject.toml RENAMED Viewed

@@ -1,3 +1,3 @@
 [build-system]
-requires = ["setuptools >= 75.0"]
+requires = ["setuptools >= 77.0.3"]
 build-backend = "setuptools.build_meta"

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/setup.py RENAMED Viewed

@@ -28,7 +28,8 @@ setup(
     description="Bayesian flow network framework for Chemistry",
     long_description=long_description,
     long_description_content_type="text/markdown",
-    license="AGPL-3.0 licence",
+    license="AGPL-3.0-or-later",
+    license_files=["LICEN[CS]E*"],
     package_dir={"bayesianflow_for_chem": "bayesianflow_for_chem"},
     package_data={"bayesianflow_for_chem": ["./*.txt", "./*.py"]},
     include_package_data=True,
@@ -45,14 +46,12 @@ setup(
         "scikit-learn>=1.5.0",
         "typing_extensions>=4.8.0",
     ],
-    extras_require={"geo2seq": ["pynauty>=2.8.8.1"]},
     project_urls={
         "Source": "https://github.com/Augus1999/bayesian-flow-network-for-chemistry"
     },
     classifiers=[
         "Development Status :: 5 - Production/Stable",
         "Intended Audience :: Science/Research",
-        "License :: OSI Approved :: GNU Affero General Public License v3",
         "Natural Language :: English",
         "Programming Language :: Python :: 3",
         "Programming Language :: Python :: 3.9",

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/LICENSE RENAMED Viewed

File without changes

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem/vocab.txt RENAMED Viewed

File without changes

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/bayesianflow_for_chem.egg-info/top_level.txt RENAMED Viewed

File without changes

{bayesianflow_for_chem-1.2.7 → bayesianflow_for_chem-1.4.0}/setup.cfg RENAMED Viewed

File without changes

bayesianflow-for-chem 1.2.7__tar.gz → 1.4.0__tar.gz

Potentially problematic release.

bayesianflow-for-chem 1.2.7tar.gz → 1.4.0tar.gz