protonate-utils 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Patrick Walters
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,294 @@
1
+ Metadata-Version: 2.4
2
+ Name: protonate-utils
3
+ Version: 0.1.0
4
+ Summary: Add hydrogens to ligands and proteins at a target pH.
5
+ Author-email: Patrick Walters <wpwalters@gmail.com>
6
+ License: MIT License
7
+
8
+ Copyright (c) 2026 Patrick Walters
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
26
+ SOFTWARE.
27
+
28
+ Project-URL: Homepage, https://github.com/PatWalters/protonate_utils
29
+ Project-URL: Repository, https://github.com/PatWalters/protonate_utils
30
+ Project-URL: Issues, https://github.com/PatWalters/protonate_utils/issues
31
+ Keywords: cheminformatics,protonation,hydrogens,rdkit,pdb
32
+ Classifier: Programming Language :: Python :: 3
33
+ Classifier: License :: OSI Approved :: MIT License
34
+ Classifier: Operating System :: OS Independent
35
+ Classifier: Topic :: Scientific/Engineering :: Chemistry
36
+ Requires-Python: >=3.9
37
+ Description-Content-Type: text/markdown
38
+ License-File: LICENSE
39
+ Requires-Dist: rdkit
40
+ Requires-Dist: dimorphite-dl
41
+ Requires-Dist: biotite
42
+ Requires-Dist: hydride
43
+ Requires-Dist: numpy
44
+ Provides-Extra: test
45
+ Requires-Dist: pytest; extra == "test"
46
+ Dynamic: license-file
47
+
48
+ # protonate_utils
49
+
50
+ A single utility for adding hydrogens to **ligands** and **proteins** at a
51
+ target pH, for use in molecular modeling and structure-based drug design.
52
+
53
+ ## Why this exists
54
+
55
+ Most structures you download — a ligand from a database, a protein from the
56
+ PDB — are missing hydrogens, or carry hydrogens that don't reflect the
57
+ protonation state at physiological pH. Getting these right matters: a
58
+ carboxylic acid is deprotonated (`-COO⁻`) at pH 7.4, a basic amine is
59
+ protonated (`-NH₃⁺`), and a histidine side chain can go either way. Downstream
60
+ tasks — docking, free-energy calculations, MD simulations, electrostatics —
61
+ all depend on the correct charge and hydrogen placement.
62
+
63
+ Ligands and proteins need different tools for this. Small molecules are best
64
+ handled with cheminformatics pKa models; proteins need residue-aware logic and
65
+ geometry-based hydrogen placement. `protonate_utils.py` wraps the appropriate
66
+ specialist tool for each case behind one consistent interface, so you don't
67
+ have to remember two separate workflows:
68
+
69
+ - **Ligands** use [Dimorphite-DL](https://github.com/durrantlab/dimorphite_dl)
70
+ for pH-aware protonation states and [the RDKit](https://www.rdkit.org/) for
71
+ structure handling. When the input has 3D coordinates, the heavy-atom
72
+ geometry is preserved exactly — only the newly added hydrogens are given
73
+ computed positions.
74
+ - **Proteins** use [Hydride](https://hydride.biotite-python.org/) for
75
+ geometry-based hydrogen addition and
76
+ [Biotite](https://www.biotite-python.org/) for PDB handling, with formal
77
+ charges estimated per amino acid at the requested pH.
78
+
79
+ Everything is exposed both as a **command-line tool** and as an importable
80
+ **Python API**.
81
+
82
+ ## Installation
83
+
84
+ Clone the repo and install it with `pip`:
85
+
86
+ ```bash
87
+ git clone https://github.com/PatWalters/protonate_utils
88
+ cd protonate_utils
89
+ pip install -e .
90
+ ```
91
+
92
+ This installs the dependencies for both modes (RDKit + Dimorphite-DL for
93
+ ligands, Biotite + Hydride + NumPy for proteins), puts a `protonate-utils`
94
+ command on your `PATH`, and makes `import protonate_utils` available.
95
+
96
+ ## Command-line usage
97
+
98
+ Once installed, use the `protonate-utils` command. The first argument selects
99
+ the mode: `ligand` or `protein`. (You can also run it without installing via
100
+ `python protonate_utils.py …` from a checkout.)
101
+
102
+ ### Ligands
103
+
104
+ ```bash
105
+ # SDF in, SDF out (3D coordinates preserved, hydrogens placed from geometry)
106
+ protonate-utils ligand input.sdf output.sdf
107
+
108
+ # SMILES in, SMILES out, at a custom pH
109
+ protonate-utils ligand input.smi output.smi --ph 7.4
110
+
111
+ # Mixed: read SDF, write SMILES
112
+ protonate-utils ligand input.sdf output.smi
113
+ ```
114
+
115
+ Input and output formats are inferred from the file extension:
116
+ `.smi`/`.smiles` is treated as SMILES, anything else as SDF. SMILES files are
117
+ read one molecule per line as `SMILES [optional name]`.
118
+
119
+ | Option | Default | Description |
120
+ |----------|---------|--------------------------------------|
121
+ | `--ph` | `7.4` | Target pH for protonation. |
122
+
123
+ Molecules that fail to parse or protonate are skipped with a warning on
124
+ stderr; the run reports how many were read, written, and skipped.
125
+
126
+ ### Proteins
127
+
128
+ ```bash
129
+ # Remove a bound ligand by residue name, then add hydrogens
130
+ protonate-utils protein input.pdb AP5 output.pdb
131
+
132
+ # Keep everything (no ligand removal)
133
+ protonate-utils protein input.pdb none output.pdb --ph 7.0
134
+ ```
135
+
136
+ The second positional argument is the residue name (3-letter CCD code) of a
137
+ ligand to remove before protonation — pass `none` to keep all atoms. Output
138
+ hydrogens are reordered so each one immediately follows the heavy atom it is
139
+ bonded to.
140
+
141
+ | Option | Default | Description |
142
+ |--------------|---------|-----------------------------------------------------|
143
+ | `--ph` | `7.0` | pH used to estimate amino-acid formal charges. |
144
+ | `--no-relax` | off | Skip dihedral relaxation of the added hydrogens. |
145
+
146
+ ## Python API
147
+
148
+ Import the functions directly from `protonate_utils`. There are symmetric
149
+ in-memory and file-to-file entry points for both ligands and proteins.
150
+
151
+ | | Ligands | Proteins |
152
+ |------------------|------------------------------------------|-----------------------------------|
153
+ | In-memory core | `protonate_molecule(mol, ph)` | `protonate_structure(structure, …)` |
154
+ | Convenience | `protonate_smiles_string(smiles, ph)` | — |
155
+ | File → file | `protonate_ligands(in, out, ph)` | `prepare_structure(in, res, out, …)` |
156
+ | I/O helpers | `read_molecules(path)`, `make_writer(path)` | (Biotite `PDBFile`) |
157
+
158
+ ### Ligands
159
+
160
+ Protonate a single SMILES string and get a SMILES string back:
161
+
162
+ ```python
163
+ from protonate_utils import protonate_smiles_string
164
+
165
+ protonate_smiles_string("CC(=O)O") # 'CC(=O)[O-]'
166
+ protonate_smiles_string("OP(=O)(O)O", ph=7.4) # 'O=P([O-])([O-])O'
167
+ ```
168
+
169
+ `protonate_smiles_string` raises `ValueError` on an unparseable SMILES; other
170
+ failures (e.g. Dimorphite-DL cannot handle the molecule) propagate as
171
+ exceptions.
172
+
173
+ Protonate an RDKit `Mol` while preserving its 3D coordinates:
174
+
175
+ ```python
176
+ from rdkit import Chem
177
+ from protonate_utils import protonate_molecule, read_molecules
178
+
179
+ mol = next(read_molecules("ligand.sdf"))
180
+ protonated = protonate_molecule(mol, ph=7.4) # Mol with explicit Hs + coords
181
+ ```
182
+
183
+ Pass `add_coord_hs=False` to keep protonation implicit (no explicit hydrogen
184
+ atoms added) — appropriate when you intend to serialize to SMILES.
185
+
186
+ Batch-convert a whole file (the CLI ligand path):
187
+
188
+ ```python
189
+ from protonate_utils import protonate_ligands
190
+
191
+ protonate_ligands("input.sdf", "output.sdf", ph=7.4)
192
+ ```
193
+
194
+ ### Proteins
195
+
196
+ Protonate an in-memory Biotite `AtomArray` and get a hydrogenated one back:
197
+
198
+ ```python
199
+ import biotite.structure.io.pdb as pdb
200
+ from protonate_utils import protonate_structure
201
+
202
+ structure = pdb.PDBFile.read("input.pdb").get_structure(model=1)
203
+ hydrogenated = protonate_structure(
204
+ structure,
205
+ ligand_res_name="AP5", # or None / "none" to keep all atoms
206
+ ph=7.0,
207
+ relax=True,
208
+ )
209
+ ```
210
+
211
+ `protonate_structure` raises `ValueError` if `ligand_res_name` is given but no
212
+ atoms with that residue name exist. The returned `AtomArray` has hydrogens
213
+ added and reordered to follow their bonded heavy atoms.
214
+
215
+ Read a PDB, protonate, and write a PDB in one call (the CLI protein path):
216
+
217
+ ```python
218
+ from protonate_utils import prepare_structure
219
+
220
+ prepare_structure("input.pdb", "AP5", "output.pdb", ph=7.0, relax=True)
221
+ ```
222
+
223
+ ## How it works
224
+
225
+ ### Ligand protonation
226
+
227
+ 1. Pre-existing hydrogens are stripped; any 3D conformer on the heavy atoms is
228
+ kept.
229
+ 2. Dimorphite-DL enumerates candidate microstate(s) within a ±0.5 pH window.
230
+ One is chosen deterministically by a **site-by-site plausibility** check
231
+ rather than by net charge — see
232
+ [Correcting Dimorphite-DL microstates](#correcting-dimorphite-dl-microstates)
233
+ below — and any residual implausible ionization is repaired against the
234
+ input. The SMILES string is a final tiebreak, so re-runs are stable.
235
+ 3. The chosen template's formal charges **and** total hydrogen counts are
236
+ mapped back onto the original atoms via a charge-insensitive substructure
237
+ match (so `-COOH` still matches `-COO⁻`). Carrying the H count — not just
238
+ the charge — keeps the RDKit's kekulization correct on aromatic heterocycles.
239
+ 4. With 3D input, `Chem.AddHs(addCoords=True)` adds hydrogens positioned from
240
+ the existing geometry; heavy-atom coordinates are never moved. Without
241
+ coordinates (SMILES), protonation stays implicit.
242
+
243
+ ### Correcting Dimorphite-DL microstates
244
+
245
+ Dimorphite-DL enumerates *every* microstate whose modeled pKa falls anywhere
246
+ near the pH window, including many that are negligibly populated at pH 7.4. Left
247
+ to a "most ionized" or "closest net charge" rule, the selector picks chemically
248
+ wrong states: it deprotonates amides and phenols and protonates anilines. We add
249
+ a per-atom legitimacy check (`_charge_change_is_legitimate`) that compares each
250
+ candidate to the input atom-by-atom and accepts a formal-charge change only when
251
+ that group genuinely ionizes near physiological pH:
252
+
253
+ | Group | Typical pKa | At pH 7.4 | Dimorphite enumerates | We |
254
+ |-------|-------------|-----------|-----------------------|----|
255
+ | Aliphatic amine | pKaH ~10 | cation | both | **protonate** |
256
+ | Amidine / guanidine | pKaH ~12–13 | cation | both | **protonate** |
257
+ | Carboxylic acid | ~4 | anion | anion | **deprotonate** |
258
+ | Sulfonic / sulfinic / phosphate / phosphonate | <2–7 | anion | anion | **deprotonate** |
259
+ | Sulfonamide / acylsulfonamide / tetrazole | ~3–10 | anion | both | **deprotonate** |
260
+ | Carboxamide N–H | ~17–22 | neutral | both → `[N⁻]` *or* `[NH⁺]` | **keep neutral** |
261
+ | Aniline / amino-heteroarene | pKaH ~3–5 | neutral | both → `[NH⁺]` | **keep neutral** |
262
+ | Cyanamide (N–C≡N) | pKaH ~0 | neutral | both → `[NH⁺]` | **keep neutral** |
263
+ | Imidazole / pyrazole / indazole / indole / triazole N–H | ~10–17 | neutral | both → `[n⁻]` | **keep neutral** |
264
+ | Phenol / alcohol | ~10–16 | neutral | both → `[O⁻]` | **keep neutral** |
265
+ | Plain thiol / thione | ~7–10 | neutral | both → `[S⁻]` | **keep neutral** |
266
+
267
+ Two further safeguards:
268
+
269
+ - **Repair fallback.** When Dimorphite offers *only* an implausibly-ionized
270
+ microstate (e.g. it returns just the `[N⁻]` form of an O-alkyl hydroxamate or
271
+ imide, with no neutral alternative to select), the offending site is reverted
272
+ to the input's protonation rather than emitted as-is.
273
+ - **Input charges preserved.** A change is only judged relative to the input, so
274
+ charges already present in the SMILES — quaternary ammonium salts, *N*-oxides,
275
+ mesoionic zwitterions — are never altered.
276
+
277
+ Borderline acids/bases whose pKa sits right at 7.4 (e.g. *p*-nitrophenol ~7.15,
278
+ mercaptoazoles ~7) are deliberately defaulted to neutral; they are ~50/50 at
279
+ physiological pH, so this is at least as defensible as ionizing them and avoids
280
+ mis-ionizing the far more common ordinary phenols and amides. Validated across
281
+ the 2,173-molecule Biogen logS set: no skips, no heavy-atom changes, and the
282
+ selection is deterministic.
283
+
284
+ ### Protein protonation
285
+
286
+ 1. Optionally remove a ligand by residue name, then strip any existing
287
+ hydrogens.
288
+ 2. Assign covalent bonds from CCD residue templates
289
+ (`connect_via_residue_names`).
290
+ 3. Estimate per-residue formal charges for canonical amino acids at the
291
+ requested pH (`hydride.estimate_amino_acid_charges`).
292
+ 4. Add hydrogens with Hydride and, by default, relax their geometry.
293
+ 5. Reorder atoms so each hydrogen immediately follows the heavy atom it is
294
+ bonded to.
@@ -0,0 +1,247 @@
1
+ # protonate_utils
2
+
3
+ A single utility for adding hydrogens to **ligands** and **proteins** at a
4
+ target pH, for use in molecular modeling and structure-based drug design.
5
+
6
+ ## Why this exists
7
+
8
+ Most structures you download — a ligand from a database, a protein from the
9
+ PDB — are missing hydrogens, or carry hydrogens that don't reflect the
10
+ protonation state at physiological pH. Getting these right matters: a
11
+ carboxylic acid is deprotonated (`-COO⁻`) at pH 7.4, a basic amine is
12
+ protonated (`-NH₃⁺`), and a histidine side chain can go either way. Downstream
13
+ tasks — docking, free-energy calculations, MD simulations, electrostatics —
14
+ all depend on the correct charge and hydrogen placement.
15
+
16
+ Ligands and proteins need different tools for this. Small molecules are best
17
+ handled with cheminformatics pKa models; proteins need residue-aware logic and
18
+ geometry-based hydrogen placement. `protonate_utils.py` wraps the appropriate
19
+ specialist tool for each case behind one consistent interface, so you don't
20
+ have to remember two separate workflows:
21
+
22
+ - **Ligands** use [Dimorphite-DL](https://github.com/durrantlab/dimorphite_dl)
23
+ for pH-aware protonation states and [the RDKit](https://www.rdkit.org/) for
24
+ structure handling. When the input has 3D coordinates, the heavy-atom
25
+ geometry is preserved exactly — only the newly added hydrogens are given
26
+ computed positions.
27
+ - **Proteins** use [Hydride](https://hydride.biotite-python.org/) for
28
+ geometry-based hydrogen addition and
29
+ [Biotite](https://www.biotite-python.org/) for PDB handling, with formal
30
+ charges estimated per amino acid at the requested pH.
31
+
32
+ Everything is exposed both as a **command-line tool** and as an importable
33
+ **Python API**.
34
+
35
+ ## Installation
36
+
37
+ Clone the repo and install it with `pip`:
38
+
39
+ ```bash
40
+ git clone https://github.com/PatWalters/protonate_utils
41
+ cd protonate_utils
42
+ pip install -e .
43
+ ```
44
+
45
+ This installs the dependencies for both modes (RDKit + Dimorphite-DL for
46
+ ligands, Biotite + Hydride + NumPy for proteins), puts a `protonate-utils`
47
+ command on your `PATH`, and makes `import protonate_utils` available.
48
+
49
+ ## Command-line usage
50
+
51
+ Once installed, use the `protonate-utils` command. The first argument selects
52
+ the mode: `ligand` or `protein`. (You can also run it without installing via
53
+ `python protonate_utils.py …` from a checkout.)
54
+
55
+ ### Ligands
56
+
57
+ ```bash
58
+ # SDF in, SDF out (3D coordinates preserved, hydrogens placed from geometry)
59
+ protonate-utils ligand input.sdf output.sdf
60
+
61
+ # SMILES in, SMILES out, at a custom pH
62
+ protonate-utils ligand input.smi output.smi --ph 7.4
63
+
64
+ # Mixed: read SDF, write SMILES
65
+ protonate-utils ligand input.sdf output.smi
66
+ ```
67
+
68
+ Input and output formats are inferred from the file extension:
69
+ `.smi`/`.smiles` is treated as SMILES, anything else as SDF. SMILES files are
70
+ read one molecule per line as `SMILES [optional name]`.
71
+
72
+ | Option | Default | Description |
73
+ |----------|---------|--------------------------------------|
74
+ | `--ph` | `7.4` | Target pH for protonation. |
75
+
76
+ Molecules that fail to parse or protonate are skipped with a warning on
77
+ stderr; the run reports how many were read, written, and skipped.
78
+
79
+ ### Proteins
80
+
81
+ ```bash
82
+ # Remove a bound ligand by residue name, then add hydrogens
83
+ protonate-utils protein input.pdb AP5 output.pdb
84
+
85
+ # Keep everything (no ligand removal)
86
+ protonate-utils protein input.pdb none output.pdb --ph 7.0
87
+ ```
88
+
89
+ The second positional argument is the residue name (3-letter CCD code) of a
90
+ ligand to remove before protonation — pass `none` to keep all atoms. Output
91
+ hydrogens are reordered so each one immediately follows the heavy atom it is
92
+ bonded to.
93
+
94
+ | Option | Default | Description |
95
+ |--------------|---------|-----------------------------------------------------|
96
+ | `--ph` | `7.0` | pH used to estimate amino-acid formal charges. |
97
+ | `--no-relax` | off | Skip dihedral relaxation of the added hydrogens. |
98
+
99
+ ## Python API
100
+
101
+ Import the functions directly from `protonate_utils`. There are symmetric
102
+ in-memory and file-to-file entry points for both ligands and proteins.
103
+
104
+ | | Ligands | Proteins |
105
+ |------------------|------------------------------------------|-----------------------------------|
106
+ | In-memory core | `protonate_molecule(mol, ph)` | `protonate_structure(structure, …)` |
107
+ | Convenience | `protonate_smiles_string(smiles, ph)` | — |
108
+ | File → file | `protonate_ligands(in, out, ph)` | `prepare_structure(in, res, out, …)` |
109
+ | I/O helpers | `read_molecules(path)`, `make_writer(path)` | (Biotite `PDBFile`) |
110
+
111
+ ### Ligands
112
+
113
+ Protonate a single SMILES string and get a SMILES string back:
114
+
115
+ ```python
116
+ from protonate_utils import protonate_smiles_string
117
+
118
+ protonate_smiles_string("CC(=O)O") # 'CC(=O)[O-]'
119
+ protonate_smiles_string("OP(=O)(O)O", ph=7.4) # 'O=P([O-])([O-])O'
120
+ ```
121
+
122
+ `protonate_smiles_string` raises `ValueError` on an unparseable SMILES; other
123
+ failures (e.g. Dimorphite-DL cannot handle the molecule) propagate as
124
+ exceptions.
125
+
126
+ Protonate an RDKit `Mol` while preserving its 3D coordinates:
127
+
128
+ ```python
129
+ from rdkit import Chem
130
+ from protonate_utils import protonate_molecule, read_molecules
131
+
132
+ mol = next(read_molecules("ligand.sdf"))
133
+ protonated = protonate_molecule(mol, ph=7.4) # Mol with explicit Hs + coords
134
+ ```
135
+
136
+ Pass `add_coord_hs=False` to keep protonation implicit (no explicit hydrogen
137
+ atoms added) — appropriate when you intend to serialize to SMILES.
138
+
139
+ Batch-convert a whole file (the CLI ligand path):
140
+
141
+ ```python
142
+ from protonate_utils import protonate_ligands
143
+
144
+ protonate_ligands("input.sdf", "output.sdf", ph=7.4)
145
+ ```
146
+
147
+ ### Proteins
148
+
149
+ Protonate an in-memory Biotite `AtomArray` and get a hydrogenated one back:
150
+
151
+ ```python
152
+ import biotite.structure.io.pdb as pdb
153
+ from protonate_utils import protonate_structure
154
+
155
+ structure = pdb.PDBFile.read("input.pdb").get_structure(model=1)
156
+ hydrogenated = protonate_structure(
157
+ structure,
158
+ ligand_res_name="AP5", # or None / "none" to keep all atoms
159
+ ph=7.0,
160
+ relax=True,
161
+ )
162
+ ```
163
+
164
+ `protonate_structure` raises `ValueError` if `ligand_res_name` is given but no
165
+ atoms with that residue name exist. The returned `AtomArray` has hydrogens
166
+ added and reordered to follow their bonded heavy atoms.
167
+
168
+ Read a PDB, protonate, and write a PDB in one call (the CLI protein path):
169
+
170
+ ```python
171
+ from protonate_utils import prepare_structure
172
+
173
+ prepare_structure("input.pdb", "AP5", "output.pdb", ph=7.0, relax=True)
174
+ ```
175
+
176
+ ## How it works
177
+
178
+ ### Ligand protonation
179
+
180
+ 1. Pre-existing hydrogens are stripped; any 3D conformer on the heavy atoms is
181
+ kept.
182
+ 2. Dimorphite-DL enumerates candidate microstate(s) within a ±0.5 pH window.
183
+ One is chosen deterministically by a **site-by-site plausibility** check
184
+ rather than by net charge — see
185
+ [Correcting Dimorphite-DL microstates](#correcting-dimorphite-dl-microstates)
186
+ below — and any residual implausible ionization is repaired against the
187
+ input. The SMILES string is a final tiebreak, so re-runs are stable.
188
+ 3. The chosen template's formal charges **and** total hydrogen counts are
189
+ mapped back onto the original atoms via a charge-insensitive substructure
190
+ match (so `-COOH` still matches `-COO⁻`). Carrying the H count — not just
191
+ the charge — keeps the RDKit's kekulization correct on aromatic heterocycles.
192
+ 4. With 3D input, `Chem.AddHs(addCoords=True)` adds hydrogens positioned from
193
+ the existing geometry; heavy-atom coordinates are never moved. Without
194
+ coordinates (SMILES), protonation stays implicit.
195
+
196
+ ### Correcting Dimorphite-DL microstates
197
+
198
+ Dimorphite-DL enumerates *every* microstate whose modeled pKa falls anywhere
199
+ near the pH window, including many that are negligibly populated at pH 7.4. Left
200
+ to a "most ionized" or "closest net charge" rule, the selector picks chemically
201
+ wrong states: it deprotonates amides and phenols and protonates anilines. We add
202
+ a per-atom legitimacy check (`_charge_change_is_legitimate`) that compares each
203
+ candidate to the input atom-by-atom and accepts a formal-charge change only when
204
+ that group genuinely ionizes near physiological pH:
205
+
206
+ | Group | Typical pKa | At pH 7.4 | Dimorphite enumerates | We |
207
+ |-------|-------------|-----------|-----------------------|----|
208
+ | Aliphatic amine | pKaH ~10 | cation | both | **protonate** |
209
+ | Amidine / guanidine | pKaH ~12–13 | cation | both | **protonate** |
210
+ | Carboxylic acid | ~4 | anion | anion | **deprotonate** |
211
+ | Sulfonic / sulfinic / phosphate / phosphonate | <2–7 | anion | anion | **deprotonate** |
212
+ | Sulfonamide / acylsulfonamide / tetrazole | ~3–10 | anion | both | **deprotonate** |
213
+ | Carboxamide N–H | ~17–22 | neutral | both → `[N⁻]` *or* `[NH⁺]` | **keep neutral** |
214
+ | Aniline / amino-heteroarene | pKaH ~3–5 | neutral | both → `[NH⁺]` | **keep neutral** |
215
+ | Cyanamide (N–C≡N) | pKaH ~0 | neutral | both → `[NH⁺]` | **keep neutral** |
216
+ | Imidazole / pyrazole / indazole / indole / triazole N–H | ~10–17 | neutral | both → `[n⁻]` | **keep neutral** |
217
+ | Phenol / alcohol | ~10–16 | neutral | both → `[O⁻]` | **keep neutral** |
218
+ | Plain thiol / thione | ~7–10 | neutral | both → `[S⁻]` | **keep neutral** |
219
+
220
+ Two further safeguards:
221
+
222
+ - **Repair fallback.** When Dimorphite offers *only* an implausibly-ionized
223
+ microstate (e.g. it returns just the `[N⁻]` form of an O-alkyl hydroxamate or
224
+ imide, with no neutral alternative to select), the offending site is reverted
225
+ to the input's protonation rather than emitted as-is.
226
+ - **Input charges preserved.** A change is only judged relative to the input, so
227
+ charges already present in the SMILES — quaternary ammonium salts, *N*-oxides,
228
+ mesoionic zwitterions — are never altered.
229
+
230
+ Borderline acids/bases whose pKa sits right at 7.4 (e.g. *p*-nitrophenol ~7.15,
231
+ mercaptoazoles ~7) are deliberately defaulted to neutral; they are ~50/50 at
232
+ physiological pH, so this is at least as defensible as ionizing them and avoids
233
+ mis-ionizing the far more common ordinary phenols and amides. Validated across
234
+ the 2,173-molecule Biogen logS set: no skips, no heavy-atom changes, and the
235
+ selection is deterministic.
236
+
237
+ ### Protein protonation
238
+
239
+ 1. Optionally remove a ligand by residue name, then strip any existing
240
+ hydrogens.
241
+ 2. Assign covalent bonds from CCD residue templates
242
+ (`connect_via_residue_names`).
243
+ 3. Estimate per-residue formal charges for canonical amino acids at the
244
+ requested pH (`hydride.estimate_amino_acid_charges`).
245
+ 4. Add hydrogens with Hydride and, by default, relax their geometry.
246
+ 5. Reorder atoms so each hydrogen immediately follows the heavy atom it is
247
+ bonded to.