protonate-utils 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- protonate_utils-0.1.0/LICENSE +21 -0
- protonate_utils-0.1.0/PKG-INFO +294 -0
- protonate_utils-0.1.0/README.md +247 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/PKG-INFO +294 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/SOURCES.txt +10 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/dependency_links.txt +1 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/entry_points.txt +2 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/requires.txt +8 -0
- protonate_utils-0.1.0/protonate_utils.egg-info/top_level.txt +1 -0
- protonate_utils-0.1.0/protonate_utils.py +827 -0
- protonate_utils-0.1.0/pyproject.toml +44 -0
- protonate_utils-0.1.0/setup.cfg +4 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Patrick Walters
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,294 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: protonate-utils
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Add hydrogens to ligands and proteins at a target pH.
|
|
5
|
+
Author-email: Patrick Walters <wpwalters@gmail.com>
|
|
6
|
+
License: MIT License
|
|
7
|
+
|
|
8
|
+
Copyright (c) 2026 Patrick Walters
|
|
9
|
+
|
|
10
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
11
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
12
|
+
in the Software without restriction, including without limitation the rights
|
|
13
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
14
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
15
|
+
furnished to do so, subject to the following conditions:
|
|
16
|
+
|
|
17
|
+
The above copyright notice and this permission notice shall be included in all
|
|
18
|
+
copies or substantial portions of the Software.
|
|
19
|
+
|
|
20
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
23
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
25
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
26
|
+
SOFTWARE.
|
|
27
|
+
|
|
28
|
+
Project-URL: Homepage, https://github.com/PatWalters/protonate_utils
|
|
29
|
+
Project-URL: Repository, https://github.com/PatWalters/protonate_utils
|
|
30
|
+
Project-URL: Issues, https://github.com/PatWalters/protonate_utils/issues
|
|
31
|
+
Keywords: cheminformatics,protonation,hydrogens,rdkit,pdb
|
|
32
|
+
Classifier: Programming Language :: Python :: 3
|
|
33
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
34
|
+
Classifier: Operating System :: OS Independent
|
|
35
|
+
Classifier: Topic :: Scientific/Engineering :: Chemistry
|
|
36
|
+
Requires-Python: >=3.9
|
|
37
|
+
Description-Content-Type: text/markdown
|
|
38
|
+
License-File: LICENSE
|
|
39
|
+
Requires-Dist: rdkit
|
|
40
|
+
Requires-Dist: dimorphite-dl
|
|
41
|
+
Requires-Dist: biotite
|
|
42
|
+
Requires-Dist: hydride
|
|
43
|
+
Requires-Dist: numpy
|
|
44
|
+
Provides-Extra: test
|
|
45
|
+
Requires-Dist: pytest; extra == "test"
|
|
46
|
+
Dynamic: license-file
|
|
47
|
+
|
|
48
|
+
# protonate_utils
|
|
49
|
+
|
|
50
|
+
A single utility for adding hydrogens to **ligands** and **proteins** at a
|
|
51
|
+
target pH, for use in molecular modeling and structure-based drug design.
|
|
52
|
+
|
|
53
|
+
## Why this exists
|
|
54
|
+
|
|
55
|
+
Most structures you download — a ligand from a database, a protein from the
|
|
56
|
+
PDB — are missing hydrogens, or carry hydrogens that don't reflect the
|
|
57
|
+
protonation state at physiological pH. Getting these right matters: a
|
|
58
|
+
carboxylic acid is deprotonated (`-COO⁻`) at pH 7.4, a basic amine is
|
|
59
|
+
protonated (`-NH₃⁺`), and a histidine side chain can go either way. Downstream
|
|
60
|
+
tasks — docking, free-energy calculations, MD simulations, electrostatics —
|
|
61
|
+
all depend on the correct charge and hydrogen placement.
|
|
62
|
+
|
|
63
|
+
Ligands and proteins need different tools for this. Small molecules are best
|
|
64
|
+
handled with cheminformatics pKa models; proteins need residue-aware logic and
|
|
65
|
+
geometry-based hydrogen placement. `protonate_utils.py` wraps the appropriate
|
|
66
|
+
specialist tool for each case behind one consistent interface, so you don't
|
|
67
|
+
have to remember two separate workflows:
|
|
68
|
+
|
|
69
|
+
- **Ligands** use [Dimorphite-DL](https://github.com/durrantlab/dimorphite_dl)
|
|
70
|
+
for pH-aware protonation states and [the RDKit](https://www.rdkit.org/) for
|
|
71
|
+
structure handling. When the input has 3D coordinates, the heavy-atom
|
|
72
|
+
geometry is preserved exactly — only the newly added hydrogens are given
|
|
73
|
+
computed positions.
|
|
74
|
+
- **Proteins** use [Hydride](https://hydride.biotite-python.org/) for
|
|
75
|
+
geometry-based hydrogen addition and
|
|
76
|
+
[Biotite](https://www.biotite-python.org/) for PDB handling, with formal
|
|
77
|
+
charges estimated per amino acid at the requested pH.
|
|
78
|
+
|
|
79
|
+
Everything is exposed both as a **command-line tool** and as an importable
|
|
80
|
+
**Python API**.
|
|
81
|
+
|
|
82
|
+
## Installation
|
|
83
|
+
|
|
84
|
+
Clone the repo and install it with `pip`:
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
git clone https://github.com/PatWalters/protonate_utils
|
|
88
|
+
cd protonate_utils
|
|
89
|
+
pip install -e .
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
This installs the dependencies for both modes (RDKit + Dimorphite-DL for
|
|
93
|
+
ligands, Biotite + Hydride + NumPy for proteins), puts a `protonate-utils`
|
|
94
|
+
command on your `PATH`, and makes `import protonate_utils` available.
|
|
95
|
+
|
|
96
|
+
## Command-line usage
|
|
97
|
+
|
|
98
|
+
Once installed, use the `protonate-utils` command. The first argument selects
|
|
99
|
+
the mode: `ligand` or `protein`. (You can also run it without installing via
|
|
100
|
+
`python protonate_utils.py …` from a checkout.)
|
|
101
|
+
|
|
102
|
+
### Ligands
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
# SDF in, SDF out (3D coordinates preserved, hydrogens placed from geometry)
|
|
106
|
+
protonate-utils ligand input.sdf output.sdf
|
|
107
|
+
|
|
108
|
+
# SMILES in, SMILES out, at a custom pH
|
|
109
|
+
protonate-utils ligand input.smi output.smi --ph 7.4
|
|
110
|
+
|
|
111
|
+
# Mixed: read SDF, write SMILES
|
|
112
|
+
protonate-utils ligand input.sdf output.smi
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Input and output formats are inferred from the file extension:
|
|
116
|
+
`.smi`/`.smiles` is treated as SMILES, anything else as SDF. SMILES files are
|
|
117
|
+
read one molecule per line as `SMILES [optional name]`.
|
|
118
|
+
|
|
119
|
+
| Option | Default | Description |
|
|
120
|
+
|----------|---------|--------------------------------------|
|
|
121
|
+
| `--ph` | `7.4` | Target pH for protonation. |
|
|
122
|
+
|
|
123
|
+
Molecules that fail to parse or protonate are skipped with a warning on
|
|
124
|
+
stderr; the run reports how many were read, written, and skipped.
|
|
125
|
+
|
|
126
|
+
### Proteins
|
|
127
|
+
|
|
128
|
+
```bash
|
|
129
|
+
# Remove a bound ligand by residue name, then add hydrogens
|
|
130
|
+
protonate-utils protein input.pdb AP5 output.pdb
|
|
131
|
+
|
|
132
|
+
# Keep everything (no ligand removal)
|
|
133
|
+
protonate-utils protein input.pdb none output.pdb --ph 7.0
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
The second positional argument is the residue name (3-letter CCD code) of a
|
|
137
|
+
ligand to remove before protonation — pass `none` to keep all atoms. Output
|
|
138
|
+
hydrogens are reordered so each one immediately follows the heavy atom it is
|
|
139
|
+
bonded to.
|
|
140
|
+
|
|
141
|
+
| Option | Default | Description |
|
|
142
|
+
|--------------|---------|-----------------------------------------------------|
|
|
143
|
+
| `--ph` | `7.0` | pH used to estimate amino-acid formal charges. |
|
|
144
|
+
| `--no-relax` | off | Skip dihedral relaxation of the added hydrogens. |
|
|
145
|
+
|
|
146
|
+
## Python API
|
|
147
|
+
|
|
148
|
+
Import the functions directly from `protonate_utils`. There are symmetric
|
|
149
|
+
in-memory and file-to-file entry points for both ligands and proteins.
|
|
150
|
+
|
|
151
|
+
| | Ligands | Proteins |
|
|
152
|
+
|------------------|------------------------------------------|-----------------------------------|
|
|
153
|
+
| In-memory core | `protonate_molecule(mol, ph)` | `protonate_structure(structure, …)` |
|
|
154
|
+
| Convenience | `protonate_smiles_string(smiles, ph)` | — |
|
|
155
|
+
| File → file | `protonate_ligands(in, out, ph)` | `prepare_structure(in, res, out, …)` |
|
|
156
|
+
| I/O helpers | `read_molecules(path)`, `make_writer(path)` | (Biotite `PDBFile`) |
|
|
157
|
+
|
|
158
|
+
### Ligands
|
|
159
|
+
|
|
160
|
+
Protonate a single SMILES string and get a SMILES string back:
|
|
161
|
+
|
|
162
|
+
```python
|
|
163
|
+
from protonate_utils import protonate_smiles_string
|
|
164
|
+
|
|
165
|
+
protonate_smiles_string("CC(=O)O") # 'CC(=O)[O-]'
|
|
166
|
+
protonate_smiles_string("OP(=O)(O)O", ph=7.4) # 'O=P([O-])([O-])O'
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
`protonate_smiles_string` raises `ValueError` on an unparseable SMILES; other
|
|
170
|
+
failures (e.g. Dimorphite-DL cannot handle the molecule) propagate as
|
|
171
|
+
exceptions.
|
|
172
|
+
|
|
173
|
+
Protonate an RDKit `Mol` while preserving its 3D coordinates:
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
from rdkit import Chem
|
|
177
|
+
from protonate_utils import protonate_molecule, read_molecules
|
|
178
|
+
|
|
179
|
+
mol = next(read_molecules("ligand.sdf"))
|
|
180
|
+
protonated = protonate_molecule(mol, ph=7.4) # Mol with explicit Hs + coords
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
Pass `add_coord_hs=False` to keep protonation implicit (no explicit hydrogen
|
|
184
|
+
atoms added) — appropriate when you intend to serialize to SMILES.
|
|
185
|
+
|
|
186
|
+
Batch-convert a whole file (the CLI ligand path):
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
from protonate_utils import protonate_ligands
|
|
190
|
+
|
|
191
|
+
protonate_ligands("input.sdf", "output.sdf", ph=7.4)
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
### Proteins
|
|
195
|
+
|
|
196
|
+
Protonate an in-memory Biotite `AtomArray` and get a hydrogenated one back:
|
|
197
|
+
|
|
198
|
+
```python
|
|
199
|
+
import biotite.structure.io.pdb as pdb
|
|
200
|
+
from protonate_utils import protonate_structure
|
|
201
|
+
|
|
202
|
+
structure = pdb.PDBFile.read("input.pdb").get_structure(model=1)
|
|
203
|
+
hydrogenated = protonate_structure(
|
|
204
|
+
structure,
|
|
205
|
+
ligand_res_name="AP5", # or None / "none" to keep all atoms
|
|
206
|
+
ph=7.0,
|
|
207
|
+
relax=True,
|
|
208
|
+
)
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
`protonate_structure` raises `ValueError` if `ligand_res_name` is given but no
|
|
212
|
+
atoms with that residue name exist. The returned `AtomArray` has hydrogens
|
|
213
|
+
added and reordered to follow their bonded heavy atoms.
|
|
214
|
+
|
|
215
|
+
Read a PDB, protonate, and write a PDB in one call (the CLI protein path):
|
|
216
|
+
|
|
217
|
+
```python
|
|
218
|
+
from protonate_utils import prepare_structure
|
|
219
|
+
|
|
220
|
+
prepare_structure("input.pdb", "AP5", "output.pdb", ph=7.0, relax=True)
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
## How it works
|
|
224
|
+
|
|
225
|
+
### Ligand protonation
|
|
226
|
+
|
|
227
|
+
1. Pre-existing hydrogens are stripped; any 3D conformer on the heavy atoms is
|
|
228
|
+
kept.
|
|
229
|
+
2. Dimorphite-DL enumerates candidate microstate(s) within a ±0.5 pH window.
|
|
230
|
+
One is chosen deterministically by a **site-by-site plausibility** check
|
|
231
|
+
rather than by net charge — see
|
|
232
|
+
[Correcting Dimorphite-DL microstates](#correcting-dimorphite-dl-microstates)
|
|
233
|
+
below — and any residual implausible ionization is repaired against the
|
|
234
|
+
input. The SMILES string is a final tiebreak, so re-runs are stable.
|
|
235
|
+
3. The chosen template's formal charges **and** total hydrogen counts are
|
|
236
|
+
mapped back onto the original atoms via a charge-insensitive substructure
|
|
237
|
+
match (so `-COOH` still matches `-COO⁻`). Carrying the H count — not just
|
|
238
|
+
the charge — keeps the RDKit's kekulization correct on aromatic heterocycles.
|
|
239
|
+
4. With 3D input, `Chem.AddHs(addCoords=True)` adds hydrogens positioned from
|
|
240
|
+
the existing geometry; heavy-atom coordinates are never moved. Without
|
|
241
|
+
coordinates (SMILES), protonation stays implicit.
|
|
242
|
+
|
|
243
|
+
### Correcting Dimorphite-DL microstates
|
|
244
|
+
|
|
245
|
+
Dimorphite-DL enumerates *every* microstate whose modeled pKa falls anywhere
|
|
246
|
+
near the pH window, including many that are negligibly populated at pH 7.4. Left
|
|
247
|
+
to a "most ionized" or "closest net charge" rule, the selector picks chemically
|
|
248
|
+
wrong states: it deprotonates amides and phenols and protonates anilines. We add
|
|
249
|
+
a per-atom legitimacy check (`_charge_change_is_legitimate`) that compares each
|
|
250
|
+
candidate to the input atom-by-atom and accepts a formal-charge change only when
|
|
251
|
+
that group genuinely ionizes near physiological pH:
|
|
252
|
+
|
|
253
|
+
| Group | Typical pKa | At pH 7.4 | Dimorphite enumerates | We |
|
|
254
|
+
|-------|-------------|-----------|-----------------------|----|
|
|
255
|
+
| Aliphatic amine | pKaH ~10 | cation | both | **protonate** |
|
|
256
|
+
| Amidine / guanidine | pKaH ~12–13 | cation | both | **protonate** |
|
|
257
|
+
| Carboxylic acid | ~4 | anion | anion | **deprotonate** |
|
|
258
|
+
| Sulfonic / sulfinic / phosphate / phosphonate | <2–7 | anion | anion | **deprotonate** |
|
|
259
|
+
| Sulfonamide / acylsulfonamide / tetrazole | ~3–10 | anion | both | **deprotonate** |
|
|
260
|
+
| Carboxamide N–H | ~17–22 | neutral | both → `[N⁻]` *or* `[NH⁺]` | **keep neutral** |
|
|
261
|
+
| Aniline / amino-heteroarene | pKaH ~3–5 | neutral | both → `[NH⁺]` | **keep neutral** |
|
|
262
|
+
| Cyanamide (N–C≡N) | pKaH ~0 | neutral | both → `[NH⁺]` | **keep neutral** |
|
|
263
|
+
| Imidazole / pyrazole / indazole / indole / triazole N–H | ~10–17 | neutral | both → `[n⁻]` | **keep neutral** |
|
|
264
|
+
| Phenol / alcohol | ~10–16 | neutral | both → `[O⁻]` | **keep neutral** |
|
|
265
|
+
| Plain thiol / thione | ~7–10 | neutral | both → `[S⁻]` | **keep neutral** |
|
|
266
|
+
|
|
267
|
+
Two further safeguards:
|
|
268
|
+
|
|
269
|
+
- **Repair fallback.** When Dimorphite offers *only* an implausibly-ionized
|
|
270
|
+
microstate (e.g. it returns just the `[N⁻]` form of an O-alkyl hydroxamate or
|
|
271
|
+
imide, with no neutral alternative to select), the offending site is reverted
|
|
272
|
+
to the input's protonation rather than emitted as-is.
|
|
273
|
+
- **Input charges preserved.** A change is only judged relative to the input, so
|
|
274
|
+
charges already present in the SMILES — quaternary ammonium salts, *N*-oxides,
|
|
275
|
+
mesoionic zwitterions — are never altered.
|
|
276
|
+
|
|
277
|
+
Borderline acids/bases whose pKa sits right at 7.4 (e.g. *p*-nitrophenol ~7.15,
|
|
278
|
+
mercaptoazoles ~7) are deliberately defaulted to neutral; they are ~50/50 at
|
|
279
|
+
physiological pH, so this is at least as defensible as ionizing them and avoids
|
|
280
|
+
mis-ionizing the far more common ordinary phenols and amides. Validated across
|
|
281
|
+
the 2,173-molecule Biogen logS set: no skips, no heavy-atom changes, and the
|
|
282
|
+
selection is deterministic.
|
|
283
|
+
|
|
284
|
+
### Protein protonation
|
|
285
|
+
|
|
286
|
+
1. Optionally remove a ligand by residue name, then strip any existing
|
|
287
|
+
hydrogens.
|
|
288
|
+
2. Assign covalent bonds from CCD residue templates
|
|
289
|
+
(`connect_via_residue_names`).
|
|
290
|
+
3. Estimate per-residue formal charges for canonical amino acids at the
|
|
291
|
+
requested pH (`hydride.estimate_amino_acid_charges`).
|
|
292
|
+
4. Add hydrogens with Hydride and, by default, relax their geometry.
|
|
293
|
+
5. Reorder atoms so each hydrogen immediately follows the heavy atom it is
|
|
294
|
+
bonded to.
|
|
@@ -0,0 +1,247 @@
|
|
|
1
|
+
# protonate_utils
|
|
2
|
+
|
|
3
|
+
A single utility for adding hydrogens to **ligands** and **proteins** at a
|
|
4
|
+
target pH, for use in molecular modeling and structure-based drug design.
|
|
5
|
+
|
|
6
|
+
## Why this exists
|
|
7
|
+
|
|
8
|
+
Most structures you download — a ligand from a database, a protein from the
|
|
9
|
+
PDB — are missing hydrogens, or carry hydrogens that don't reflect the
|
|
10
|
+
protonation state at physiological pH. Getting these right matters: a
|
|
11
|
+
carboxylic acid is deprotonated (`-COO⁻`) at pH 7.4, a basic amine is
|
|
12
|
+
protonated (`-NH₃⁺`), and a histidine side chain can go either way. Downstream
|
|
13
|
+
tasks — docking, free-energy calculations, MD simulations, electrostatics —
|
|
14
|
+
all depend on the correct charge and hydrogen placement.
|
|
15
|
+
|
|
16
|
+
Ligands and proteins need different tools for this. Small molecules are best
|
|
17
|
+
handled with cheminformatics pKa models; proteins need residue-aware logic and
|
|
18
|
+
geometry-based hydrogen placement. `protonate_utils.py` wraps the appropriate
|
|
19
|
+
specialist tool for each case behind one consistent interface, so you don't
|
|
20
|
+
have to remember two separate workflows:
|
|
21
|
+
|
|
22
|
+
- **Ligands** use [Dimorphite-DL](https://github.com/durrantlab/dimorphite_dl)
|
|
23
|
+
for pH-aware protonation states and [the RDKit](https://www.rdkit.org/) for
|
|
24
|
+
structure handling. When the input has 3D coordinates, the heavy-atom
|
|
25
|
+
geometry is preserved exactly — only the newly added hydrogens are given
|
|
26
|
+
computed positions.
|
|
27
|
+
- **Proteins** use [Hydride](https://hydride.biotite-python.org/) for
|
|
28
|
+
geometry-based hydrogen addition and
|
|
29
|
+
[Biotite](https://www.biotite-python.org/) for PDB handling, with formal
|
|
30
|
+
charges estimated per amino acid at the requested pH.
|
|
31
|
+
|
|
32
|
+
Everything is exposed both as a **command-line tool** and as an importable
|
|
33
|
+
**Python API**.
|
|
34
|
+
|
|
35
|
+
## Installation
|
|
36
|
+
|
|
37
|
+
Clone the repo and install it with `pip`:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
git clone https://github.com/PatWalters/protonate_utils
|
|
41
|
+
cd protonate_utils
|
|
42
|
+
pip install -e .
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
This installs the dependencies for both modes (RDKit + Dimorphite-DL for
|
|
46
|
+
ligands, Biotite + Hydride + NumPy for proteins), puts a `protonate-utils`
|
|
47
|
+
command on your `PATH`, and makes `import protonate_utils` available.
|
|
48
|
+
|
|
49
|
+
## Command-line usage
|
|
50
|
+
|
|
51
|
+
Once installed, use the `protonate-utils` command. The first argument selects
|
|
52
|
+
the mode: `ligand` or `protein`. (You can also run it without installing via
|
|
53
|
+
`python protonate_utils.py …` from a checkout.)
|
|
54
|
+
|
|
55
|
+
### Ligands
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
# SDF in, SDF out (3D coordinates preserved, hydrogens placed from geometry)
|
|
59
|
+
protonate-utils ligand input.sdf output.sdf
|
|
60
|
+
|
|
61
|
+
# SMILES in, SMILES out, at a custom pH
|
|
62
|
+
protonate-utils ligand input.smi output.smi --ph 7.4
|
|
63
|
+
|
|
64
|
+
# Mixed: read SDF, write SMILES
|
|
65
|
+
protonate-utils ligand input.sdf output.smi
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Input and output formats are inferred from the file extension:
|
|
69
|
+
`.smi`/`.smiles` is treated as SMILES, anything else as SDF. SMILES files are
|
|
70
|
+
read one molecule per line as `SMILES [optional name]`.
|
|
71
|
+
|
|
72
|
+
| Option | Default | Description |
|
|
73
|
+
|----------|---------|--------------------------------------|
|
|
74
|
+
| `--ph` | `7.4` | Target pH for protonation. |
|
|
75
|
+
|
|
76
|
+
Molecules that fail to parse or protonate are skipped with a warning on
|
|
77
|
+
stderr; the run reports how many were read, written, and skipped.
|
|
78
|
+
|
|
79
|
+
### Proteins
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
# Remove a bound ligand by residue name, then add hydrogens
|
|
83
|
+
protonate-utils protein input.pdb AP5 output.pdb
|
|
84
|
+
|
|
85
|
+
# Keep everything (no ligand removal)
|
|
86
|
+
protonate-utils protein input.pdb none output.pdb --ph 7.0
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
The second positional argument is the residue name (3-letter CCD code) of a
|
|
90
|
+
ligand to remove before protonation — pass `none` to keep all atoms. Output
|
|
91
|
+
hydrogens are reordered so each one immediately follows the heavy atom it is
|
|
92
|
+
bonded to.
|
|
93
|
+
|
|
94
|
+
| Option | Default | Description |
|
|
95
|
+
|--------------|---------|-----------------------------------------------------|
|
|
96
|
+
| `--ph` | `7.0` | pH used to estimate amino-acid formal charges. |
|
|
97
|
+
| `--no-relax` | off | Skip dihedral relaxation of the added hydrogens. |
|
|
98
|
+
|
|
99
|
+
## Python API
|
|
100
|
+
|
|
101
|
+
Import the functions directly from `protonate_utils`. There are symmetric
|
|
102
|
+
in-memory and file-to-file entry points for both ligands and proteins.
|
|
103
|
+
|
|
104
|
+
| | Ligands | Proteins |
|
|
105
|
+
|------------------|------------------------------------------|-----------------------------------|
|
|
106
|
+
| In-memory core | `protonate_molecule(mol, ph)` | `protonate_structure(structure, …)` |
|
|
107
|
+
| Convenience | `protonate_smiles_string(smiles, ph)` | — |
|
|
108
|
+
| File → file | `protonate_ligands(in, out, ph)` | `prepare_structure(in, res, out, …)` |
|
|
109
|
+
| I/O helpers | `read_molecules(path)`, `make_writer(path)` | (Biotite `PDBFile`) |
|
|
110
|
+
|
|
111
|
+
### Ligands
|
|
112
|
+
|
|
113
|
+
Protonate a single SMILES string and get a SMILES string back:
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
from protonate_utils import protonate_smiles_string
|
|
117
|
+
|
|
118
|
+
protonate_smiles_string("CC(=O)O") # 'CC(=O)[O-]'
|
|
119
|
+
protonate_smiles_string("OP(=O)(O)O", ph=7.4) # 'O=P([O-])([O-])O'
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
`protonate_smiles_string` raises `ValueError` on an unparseable SMILES; other
|
|
123
|
+
failures (e.g. Dimorphite-DL cannot handle the molecule) propagate as
|
|
124
|
+
exceptions.
|
|
125
|
+
|
|
126
|
+
Protonate an RDKit `Mol` while preserving its 3D coordinates:
|
|
127
|
+
|
|
128
|
+
```python
|
|
129
|
+
from rdkit import Chem
|
|
130
|
+
from protonate_utils import protonate_molecule, read_molecules
|
|
131
|
+
|
|
132
|
+
mol = next(read_molecules("ligand.sdf"))
|
|
133
|
+
protonated = protonate_molecule(mol, ph=7.4) # Mol with explicit Hs + coords
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Pass `add_coord_hs=False` to keep protonation implicit (no explicit hydrogen
|
|
137
|
+
atoms added) — appropriate when you intend to serialize to SMILES.
|
|
138
|
+
|
|
139
|
+
Batch-convert a whole file (the CLI ligand path):
|
|
140
|
+
|
|
141
|
+
```python
|
|
142
|
+
from protonate_utils import protonate_ligands
|
|
143
|
+
|
|
144
|
+
protonate_ligands("input.sdf", "output.sdf", ph=7.4)
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
### Proteins
|
|
148
|
+
|
|
149
|
+
Protonate an in-memory Biotite `AtomArray` and get a hydrogenated one back:
|
|
150
|
+
|
|
151
|
+
```python
|
|
152
|
+
import biotite.structure.io.pdb as pdb
|
|
153
|
+
from protonate_utils import protonate_structure
|
|
154
|
+
|
|
155
|
+
structure = pdb.PDBFile.read("input.pdb").get_structure(model=1)
|
|
156
|
+
hydrogenated = protonate_structure(
|
|
157
|
+
structure,
|
|
158
|
+
ligand_res_name="AP5", # or None / "none" to keep all atoms
|
|
159
|
+
ph=7.0,
|
|
160
|
+
relax=True,
|
|
161
|
+
)
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
`protonate_structure` raises `ValueError` if `ligand_res_name` is given but no
|
|
165
|
+
atoms with that residue name exist. The returned `AtomArray` has hydrogens
|
|
166
|
+
added and reordered to follow their bonded heavy atoms.
|
|
167
|
+
|
|
168
|
+
Read a PDB, protonate, and write a PDB in one call (the CLI protein path):
|
|
169
|
+
|
|
170
|
+
```python
|
|
171
|
+
from protonate_utils import prepare_structure
|
|
172
|
+
|
|
173
|
+
prepare_structure("input.pdb", "AP5", "output.pdb", ph=7.0, relax=True)
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
## How it works
|
|
177
|
+
|
|
178
|
+
### Ligand protonation
|
|
179
|
+
|
|
180
|
+
1. Pre-existing hydrogens are stripped; any 3D conformer on the heavy atoms is
|
|
181
|
+
kept.
|
|
182
|
+
2. Dimorphite-DL enumerates candidate microstate(s) within a ±0.5 pH window.
|
|
183
|
+
One is chosen deterministically by a **site-by-site plausibility** check
|
|
184
|
+
rather than by net charge — see
|
|
185
|
+
[Correcting Dimorphite-DL microstates](#correcting-dimorphite-dl-microstates)
|
|
186
|
+
below — and any residual implausible ionization is repaired against the
|
|
187
|
+
input. The SMILES string is a final tiebreak, so re-runs are stable.
|
|
188
|
+
3. The chosen template's formal charges **and** total hydrogen counts are
|
|
189
|
+
mapped back onto the original atoms via a charge-insensitive substructure
|
|
190
|
+
match (so `-COOH` still matches `-COO⁻`). Carrying the H count — not just
|
|
191
|
+
the charge — keeps the RDKit's kekulization correct on aromatic heterocycles.
|
|
192
|
+
4. With 3D input, `Chem.AddHs(addCoords=True)` adds hydrogens positioned from
|
|
193
|
+
the existing geometry; heavy-atom coordinates are never moved. Without
|
|
194
|
+
coordinates (SMILES), protonation stays implicit.
|
|
195
|
+
|
|
196
|
+
### Correcting Dimorphite-DL microstates
|
|
197
|
+
|
|
198
|
+
Dimorphite-DL enumerates *every* microstate whose modeled pKa falls anywhere
|
|
199
|
+
near the pH window, including many that are negligibly populated at pH 7.4. Left
|
|
200
|
+
to a "most ionized" or "closest net charge" rule, the selector picks chemically
|
|
201
|
+
wrong states: it deprotonates amides and phenols and protonates anilines. We add
|
|
202
|
+
a per-atom legitimacy check (`_charge_change_is_legitimate`) that compares each
|
|
203
|
+
candidate to the input atom-by-atom and accepts a formal-charge change only when
|
|
204
|
+
that group genuinely ionizes near physiological pH:
|
|
205
|
+
|
|
206
|
+
| Group | Typical pKa | At pH 7.4 | Dimorphite enumerates | We |
|
|
207
|
+
|-------|-------------|-----------|-----------------------|----|
|
|
208
|
+
| Aliphatic amine | pKaH ~10 | cation | both | **protonate** |
|
|
209
|
+
| Amidine / guanidine | pKaH ~12–13 | cation | both | **protonate** |
|
|
210
|
+
| Carboxylic acid | ~4 | anion | anion | **deprotonate** |
|
|
211
|
+
| Sulfonic / sulfinic / phosphate / phosphonate | <2–7 | anion | anion | **deprotonate** |
|
|
212
|
+
| Sulfonamide / acylsulfonamide / tetrazole | ~3–10 | anion | both | **deprotonate** |
|
|
213
|
+
| Carboxamide N–H | ~17–22 | neutral | both → `[N⁻]` *or* `[NH⁺]` | **keep neutral** |
|
|
214
|
+
| Aniline / amino-heteroarene | pKaH ~3–5 | neutral | both → `[NH⁺]` | **keep neutral** |
|
|
215
|
+
| Cyanamide (N–C≡N) | pKaH ~0 | neutral | both → `[NH⁺]` | **keep neutral** |
|
|
216
|
+
| Imidazole / pyrazole / indazole / indole / triazole N–H | ~10–17 | neutral | both → `[n⁻]` | **keep neutral** |
|
|
217
|
+
| Phenol / alcohol | ~10–16 | neutral | both → `[O⁻]` | **keep neutral** |
|
|
218
|
+
| Plain thiol / thione | ~7–10 | neutral | both → `[S⁻]` | **keep neutral** |
|
|
219
|
+
|
|
220
|
+
Two further safeguards:
|
|
221
|
+
|
|
222
|
+
- **Repair fallback.** When Dimorphite offers *only* an implausibly-ionized
|
|
223
|
+
microstate (e.g. it returns just the `[N⁻]` form of an O-alkyl hydroxamate or
|
|
224
|
+
imide, with no neutral alternative to select), the offending site is reverted
|
|
225
|
+
to the input's protonation rather than emitted as-is.
|
|
226
|
+
- **Input charges preserved.** A change is only judged relative to the input, so
|
|
227
|
+
charges already present in the SMILES — quaternary ammonium salts, *N*-oxides,
|
|
228
|
+
mesoionic zwitterions — are never altered.
|
|
229
|
+
|
|
230
|
+
Borderline acids/bases whose pKa sits right at 7.4 (e.g. *p*-nitrophenol ~7.15,
|
|
231
|
+
mercaptoazoles ~7) are deliberately defaulted to neutral; they are ~50/50 at
|
|
232
|
+
physiological pH, so this is at least as defensible as ionizing them and avoids
|
|
233
|
+
mis-ionizing the far more common ordinary phenols and amides. Validated across
|
|
234
|
+
the 2,173-molecule Biogen logS set: no skips, no heavy-atom changes, and the
|
|
235
|
+
selection is deterministic.
|
|
236
|
+
|
|
237
|
+
### Protein protonation
|
|
238
|
+
|
|
239
|
+
1. Optionally remove a ligand by residue name, then strip any existing
|
|
240
|
+
hydrogens.
|
|
241
|
+
2. Assign covalent bonds from CCD residue templates
|
|
242
|
+
(`connect_via_residue_names`).
|
|
243
|
+
3. Estimate per-residue formal charges for canonical amino acids at the
|
|
244
|
+
requested pH (`hydride.estimate_amino_acid_charges`).
|
|
245
|
+
4. Add hydrogens with Hydride and, by default, relax their geometry.
|
|
246
|
+
5. Reorder atoms so each hydrogen immediately follows the heavy atom it is
|
|
247
|
+
bonded to.
|