hofvarpnir-hcon 3.20.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- hofvarpnir_hcon-3.20.0/LICENSE +28 -0
- hofvarpnir_hcon-3.20.0/MANIFEST.in +6 -0
- hofvarpnir_hcon-3.20.0/NOTICE +11 -0
- hofvarpnir_hcon-3.20.0/PKG-INFO +275 -0
- hofvarpnir_hcon-3.20.0/README.md +236 -0
- hofvarpnir_hcon-3.20.0/THIRD-PARTY-LICENSES.txt +39 -0
- hofvarpnir_hcon-3.20.0/hofvarpnir_hcon.egg-info/PKG-INFO +275 -0
- hofvarpnir_hcon-3.20.0/hofvarpnir_hcon.egg-info/SOURCES.txt +16 -0
- hofvarpnir_hcon-3.20.0/hofvarpnir_hcon.egg-info/dependency_links.txt +1 -0
- hofvarpnir_hcon-3.20.0/hofvarpnir_hcon.egg-info/requires.txt +5 -0
- hofvarpnir_hcon-3.20.0/hofvarpnir_hcon.egg-info/top_level.txt +1 -0
- hofvarpnir_hcon-3.20.0/hofvarpnirhcon/__init__.py +82 -0
- hofvarpnir_hcon-3.20.0/hofvarpnirhcon/docs/METHODS.md +97 -0
- hofvarpnir_hcon-3.20.0/hofvarpnirhcon/predictor.py +877 -0
- hofvarpnir_hcon-3.20.0/hofvarpnirhcon/train_density.py +429 -0
- hofvarpnir_hcon-3.20.0/hofvarpnirhcon/version.py +1 -0
- hofvarpnir_hcon-3.20.0/setup.cfg +4 -0
- hofvarpnir_hcon-3.20.0/setup.py +54 -0
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
BSD 3-Clause License
|
|
2
|
+
|
|
3
|
+
Copyright 2026 Leonard F Haasbroek
|
|
4
|
+
|
|
5
|
+
Redistribution and use in source and binary forms, with or without modification,
|
|
6
|
+
are permitted provided that the following conditions are met:
|
|
7
|
+
|
|
8
|
+
1. Redistributions of source code must retain the above copyright notice,
|
|
9
|
+
this list of conditions and the following disclaimer.
|
|
10
|
+
|
|
11
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
12
|
+
this list of conditions and the following disclaimer in the documentation
|
|
13
|
+
and/or other materials provided with the distribution.
|
|
14
|
+
|
|
15
|
+
3. Neither the name of the copyright holder nor the names of its contributors
|
|
16
|
+
may be used to endorse or promote products derived from this software without
|
|
17
|
+
specific prior written permission.
|
|
18
|
+
|
|
19
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY
|
|
20
|
+
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
|
|
21
|
+
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
|
|
22
|
+
|
|
23
|
+
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
|
|
24
|
+
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
|
25
|
+
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
26
|
+
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
27
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
|
|
28
|
+
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
@@ -0,0 +1,11 @@
|
|
|
1
|
+
THIRD-PARTY NOTICE
|
|
2
|
+
|
|
3
|
+
This software incorporates third-party open-source components.
|
|
4
|
+
|
|
5
|
+
A list of third-party dependencies and their associated licenses is provided
|
|
6
|
+
in THIRD-PARTY-LICENSES.txt.
|
|
7
|
+
|
|
8
|
+
These components are used under their respective permissive or weak-copyleft
|
|
9
|
+
licenses (including BSD, MIT, and MPL-2.0).
|
|
10
|
+
|
|
11
|
+
No modification has been made to third-party license terms.
|
|
@@ -0,0 +1,275 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: hofvarpnir-hcon
|
|
3
|
+
Version: 3.20.0
|
|
4
|
+
Summary: HófvarpnirHCON - Fast dictionary-based crystal density prediction from SMILES
|
|
5
|
+
Author: Leonard Haasbroek
|
|
6
|
+
Author-email: leonardfhaasbroek@gmail.com
|
|
7
|
+
License: BSD-3-Clause
|
|
8
|
+
Project-URL: Source, https://github.com/LeonardFH/hofvarpnir-hcon
|
|
9
|
+
Project-URL: Bug Reports, https://github.com/LeonardFH/hofvarpnir-hcon/issues
|
|
10
|
+
Classifier: Development Status :: 4 - Beta
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: Topic :: Scientific/Engineering :: Chemistry
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: License :: OSI Approved :: BSD License
|
|
20
|
+
Requires-Python: >=3.8
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
License-File: LICENSE
|
|
23
|
+
Requires-Dist: rdkit>=2023.03.1
|
|
24
|
+
Requires-Dist: numpy>=1.21.0
|
|
25
|
+
Requires-Dist: pandas>=1.3.0
|
|
26
|
+
Requires-Dist: tqdm>=4.62.0
|
|
27
|
+
Requires-Dist: scipy>=1.8.0
|
|
28
|
+
Dynamic: author
|
|
29
|
+
Dynamic: author-email
|
|
30
|
+
Dynamic: classifier
|
|
31
|
+
Dynamic: description
|
|
32
|
+
Dynamic: description-content-type
|
|
33
|
+
Dynamic: license
|
|
34
|
+
Dynamic: license-file
|
|
35
|
+
Dynamic: project-url
|
|
36
|
+
Dynamic: requires-dist
|
|
37
|
+
Dynamic: requires-python
|
|
38
|
+
Dynamic: summary
|
|
39
|
+
|
|
40
|
+
# HófvarpnirHCON
|
|
41
|
+
|
|
42
|
+
GitHub: [github.com/LeonardFH/hofvarpnir-hcon](https://github.com/LeonardFH/hofvarpnir-hcon)
|
|
43
|
+
PyPI: [pypi.org/project/hofvarpnir-hcon](https://pypi.org/project/hofvarpnir-hcon/)
|
|
44
|
+
|
|
45
|
+
A modular Python framework for molecular property prediction from SMILES strings.
|
|
46
|
+
|
|
47
|
+
HófvarpnirHCON (pronounced "HOFF-varp-neer-HCON") is designed as a fast and extensible framework for predicting molecular properties of organic compounds containing C, H, O, and N.
|
|
48
|
+
|
|
49
|
+
Named after the flying horse of the Norse goddess Gná, reflecting the software's intended speed and range across molecular property spaces.
|
|
50
|
+
|
|
51
|
+
## Current Status
|
|
52
|
+
|
|
53
|
+
At present, the package implements:
|
|
54
|
+
|
|
55
|
+
- Crystal density prediction for organic molecules
|
|
56
|
+
|
|
57
|
+
The framework is designed with extensibility in mind, allowing additional molecular property predictors to be added in future versions.
|
|
58
|
+
|
|
59
|
+
## A Friendly Note
|
|
60
|
+
|
|
61
|
+
Hi there,
|
|
62
|
+
|
|
63
|
+
I built HófvarpnirHCON because crystal density prediction should be fast, transparent, and accessible. I'm glad you found it.
|
|
64
|
+
|
|
65
|
+
If you need to get in touch: leonardfhaasbroek@gmail.com
|
|
66
|
+
|
|
67
|
+
## License
|
|
68
|
+
|
|
69
|
+
This project is distributed under the BSD 3-Clause License.
|
|
70
|
+
|
|
71
|
+
## Data Sources
|
|
72
|
+
|
|
73
|
+
The training data may be obtained from:
|
|
74
|
+
|
|
75
|
+
- Davis, J. V.; Marrs, F. W.; Cawkwell, M. J.; Manner, V. W. Machine Learning Models for High Explosive Crystal Density and Performance. Chem. Mater. 2024, 36, 11109–11118. DOI: 10.1021/acs.chemmater.4c01978
|
|
76
|
+
|
|
77
|
+
- Mathieu, D. Sensitivity of Energetic Materials: Theoretical Relationships to Detonation Performance and Molecular Structure. Ind. Eng. Chem. Res. 2017, 56, 8191–8201. DOI: 10.1021/acs.iecr.7b02021
|
|
78
|
+
|
|
79
|
+
These datasets are available as Supporting Information with their respective papers.
|
|
80
|
+
|
|
81
|
+
## Community Benchmarks
|
|
82
|
+
|
|
83
|
+
If you use HófvarpnirHCON on your own dataset, I invite you to share your results.
|
|
84
|
+
|
|
85
|
+
Email: **leonardfhaasbroek@gmail.com**
|
|
86
|
+
|
|
87
|
+
Please include:
|
|
88
|
+
- MAE, RMSE, R²
|
|
89
|
+
- Number of molecules
|
|
90
|
+
- Number of cocrystals
|
|
91
|
+
- Dataset description and source (if public)
|
|
92
|
+
|
|
93
|
+
Results will be posted here (with your permission).
|
|
94
|
+
|
|
95
|
+
## Documentation
|
|
96
|
+
|
|
97
|
+
For a detailed explanation of the method, see [hofvarpnirhcon/docs/METHODS.md](hofvarpnirhcon/docs/METHODS.md).
|
|
98
|
+
|
|
99
|
+
## Installation
|
|
100
|
+
|
|
101
|
+
pip install hofvarpnir-hcon
|
|
102
|
+
|
|
103
|
+
## Quick Start: Train and Predict in Thonny
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
#Copy and paste this entire script into Thonny and run it:
|
|
107
|
+
|
|
108
|
+
from hofvarpnirhcon import train_density, predict_density, predict_density_batch
|
|
109
|
+
import pandas as pd
|
|
110
|
+
import numpy as np
|
|
111
|
+
from sklearn.metrics import mean_absolute_error
|
|
112
|
+
|
|
113
|
+
# ============================================================
|
|
114
|
+
# STEP 1: Download a dataset from one of the papers above
|
|
115
|
+
# Save it as "trainingdata.csv" with columns: SMILES, Density
|
|
116
|
+
# ============================================================
|
|
117
|
+
|
|
118
|
+
# ============================================================
|
|
119
|
+
# STEP 2: Train your own weights
|
|
120
|
+
# ============================================================
|
|
121
|
+
|
|
122
|
+
print("Training model...")
|
|
123
|
+
weights = train_density(
|
|
124
|
+
data_path="trainingdata.csv",
|
|
125
|
+
output_path="my_weights.pkl",
|
|
126
|
+
filter_cocrystals=True, # Train on pure crystals only (recommended)
|
|
127
|
+
filter_hcon=True, # Train on H,C,O,N atoms only (recommended)
|
|
128
|
+
verbose=True
|
|
129
|
+
)
|
|
130
|
+
print("Training complete! Weights saved to my_weights.pkl")
|
|
131
|
+
|
|
132
|
+
# ============================================================
|
|
133
|
+
# STEP 3: Load the dataset for predictions
|
|
134
|
+
# ============================================================
|
|
135
|
+
|
|
136
|
+
df = pd.read_csv("trainingdata.csv")
|
|
137
|
+
smiles_list = df["SMILES"].tolist()
|
|
138
|
+
actuals = df["Density"].values
|
|
139
|
+
|
|
140
|
+
# ============================================================
|
|
141
|
+
# STEP 4: Single molecule prediction
|
|
142
|
+
# ============================================================
|
|
143
|
+
|
|
144
|
+
print("\n" + "=" * 60)
|
|
145
|
+
print("SINGLE MOLECULE PREDICTION")
|
|
146
|
+
print("=" * 60)
|
|
147
|
+
|
|
148
|
+
test_smiles = smiles_list[0]
|
|
149
|
+
test_actual = actuals[0]
|
|
150
|
+
pred = predict_density(test_smiles, weights_path="my_weights.pkl")
|
|
151
|
+
print(f"SMILES: {test_smiles}")
|
|
152
|
+
print(f"Actual density: {test_actual:.4f} g/cm³")
|
|
153
|
+
print(f"Predicted density: {pred:.4f} g/cm³")
|
|
154
|
+
print(f"Error: {abs(pred - test_actual):.4f} g/cm³")
|
|
155
|
+
|
|
156
|
+
# ============================================================
|
|
157
|
+
# STEP 5: Batch prediction on entire dataset
|
|
158
|
+
# ============================================================
|
|
159
|
+
|
|
160
|
+
print("\n" + "=" * 60)
|
|
161
|
+
print("BATCH PREDICTION")
|
|
162
|
+
print("=" * 60)
|
|
163
|
+
|
|
164
|
+
print(f"Predicting {len(smiles_list)} molecules...")
|
|
165
|
+
predictions = predict_density_batch(
|
|
166
|
+
smiles_list=smiles_list,
|
|
167
|
+
weights_path="my_weights.pkl",
|
|
168
|
+
verbose=True
|
|
169
|
+
)
|
|
170
|
+
|
|
171
|
+
# ============================================================
|
|
172
|
+
# STEP 6: Calculate MAE and show results (FILTER NONE VALUES)
|
|
173
|
+
# ============================================================
|
|
174
|
+
|
|
175
|
+
# Filter out None values (failed predictions)
|
|
176
|
+
valid_mask = [p is not None for p in predictions]
|
|
177
|
+
valid_actuals = np.array(actuals)[valid_mask]
|
|
178
|
+
valid_predictions = [p for p in predictions if p is not None]
|
|
179
|
+
|
|
180
|
+
print(f"\n✅ Valid predictions: {len(valid_predictions):,} / {len(smiles_list):,}")
|
|
181
|
+
|
|
182
|
+
if len(valid_predictions) == 0:
|
|
183
|
+
print("❌ No valid predictions. Check your SMILES strings.")
|
|
184
|
+
exit()
|
|
185
|
+
|
|
186
|
+
mae = mean_absolute_error(valid_actuals, valid_predictions)
|
|
187
|
+
rmse = np.sqrt(np.mean((np.array(valid_predictions) - valid_actuals) ** 2))
|
|
188
|
+
r2 = np.corrcoef(valid_predictions, valid_actuals)[0, 1] ** 2
|
|
189
|
+
|
|
190
|
+
print(f"\nModel Performance:")
|
|
191
|
+
print(f" MAE: {mae:.4f} g/cm³")
|
|
192
|
+
print(f" RMSE: {rmse:.4f} g/cm³")
|
|
193
|
+
print(f" R²: {r2:.4f}")
|
|
194
|
+
|
|
195
|
+
print("\nFirst 10 predictions:")
|
|
196
|
+
print("-" * 70)
|
|
197
|
+
print(f"{'SMILES':<35} {'Actual':>10} {'Predicted':>10} {'Error':>10}")
|
|
198
|
+
print("-" * 70)
|
|
199
|
+
|
|
200
|
+
for i in range(min(10, len(valid_predictions))):
|
|
201
|
+
smiles = smiles_list[i][:35]
|
|
202
|
+
actual = valid_actuals[i]
|
|
203
|
+
pred = valid_predictions[i]
|
|
204
|
+
error = abs(pred - actual)
|
|
205
|
+
print(f"{smiles:<35} {actual:>10.4f} {pred:>10.4f} {error:>10.4f}")
|
|
206
|
+
|
|
207
|
+
print("-" * 70)
|
|
208
|
+
print(f"MAE: {mae:.4f} g/cm³")
|
|
209
|
+
print("\n✅ All done! Weights saved to my_weights.pkl")
|
|
210
|
+
|
|
211
|
+
# ============================================================
|
|
212
|
+
# STEP 7: Save results to CSV
|
|
213
|
+
# ============================================================
|
|
214
|
+
|
|
215
|
+
results_df = pd.DataFrame({
|
|
216
|
+
'SMILES': smiles_list[:len(valid_predictions)],
|
|
217
|
+
'Actual_Density': valid_actuals,
|
|
218
|
+
'Predicted_Density': valid_predictions,
|
|
219
|
+
'Error': np.array(valid_predictions) - valid_actuals,
|
|
220
|
+
'Abs_Error': np.abs(np.array(valid_predictions) - valid_actuals),
|
|
221
|
+
})
|
|
222
|
+
|
|
223
|
+
results_df.to_csv('prediction_results.csv', index=False)
|
|
224
|
+
print("\n💾 Results saved to: prediction_results.csv")
|
|
225
|
+
|
|
226
|
+
# ============================================================
|
|
227
|
+
# USAGE EXAMPLES
|
|
228
|
+
# ============================================================
|
|
229
|
+
|
|
230
|
+
# Single molecule prediction:
|
|
231
|
+
from hofvarpnirhcon import predict_density
|
|
232
|
+
|
|
233
|
+
density = predict_density("CCO", weights_path="my_weights.pkl")
|
|
234
|
+
print(f"{density:.3f} g/cm³")
|
|
235
|
+
|
|
236
|
+
# Batch prediction:
|
|
237
|
+
from hofvarpnirhcon import predict_density_batch
|
|
238
|
+
|
|
239
|
+
smiles_list = ["CCO", "CC", "c1ccccc1", "O"]
|
|
240
|
+
results = predict_density_batch(smiles_list, weights_path="my_weights.pkl")
|
|
241
|
+
|
|
242
|
+
for smiles, density in zip(smiles_list, results):
|
|
243
|
+
print(f"{smiles}: {density:.3f} g/cm³")
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
## Performance
|
|
247
|
+
|
|
248
|
+
- MAE: ~0.0300 g/cm³ on CHON molecules
|
|
249
|
+
- Speed: ~1,800 molecules/second (1 core/thread)
|
|
250
|
+
- Speed: ~2,700 molecules/second (2 core/thread)
|
|
251
|
+
- Speed: ~3,500 molecules/second (4 core/thread - max achieved)
|
|
252
|
+
|
|
253
|
+
## Tips for Best Performance
|
|
254
|
+
|
|
255
|
+
For optimal accuracy, we recommend training separate dictionaries for each chemical family:
|
|
256
|
+
|
|
257
|
+
- **HCON only** (C, H, N, O) — best overall performance
|
|
258
|
+
- **HCON + F** — fluorine-containing molecules
|
|
259
|
+
- **HCON + Cl** — chlorine-containing molecules
|
|
260
|
+
- **HCON + S** — sulfur-containing molecules
|
|
261
|
+
- **HCON + P** — phosphorus-containing molecules
|
|
262
|
+
|
|
263
|
+
**Avoid mixing different heteroatom types** (e.g., S and Cl together) in a single training run, as this can degrade prediction accuracy.
|
|
264
|
+
|
|
265
|
+
For molecules containing rare halogens (Br, I), we recommend using the HCON-only dictionaries, as there is insufficient data to train reliable halogen-specific overlaps.
|
|
266
|
+
|
|
267
|
+
## Important Note on Polymorphs
|
|
268
|
+
|
|
269
|
+
The model predicts a single crystal density per SMILES string. For molecules with multiple known polymorphs (e.g., ROY, carbamazepine), the prediction corresponds to a **centroid** density within the experimental range. It does **not** predict individual polymorph forms.
|
|
270
|
+
|
|
271
|
+
## Citation
|
|
272
|
+
|
|
273
|
+
If you use this software in your research, please cite:
|
|
274
|
+
|
|
275
|
+
Haasbroek, L. F. (2026). HófvarpnirHCON: Fast dictionary-based crystal density prediction.
|
|
@@ -0,0 +1,236 @@
|
|
|
1
|
+
# HófvarpnirHCON
|
|
2
|
+
|
|
3
|
+
GitHub: [github.com/LeonardFH/hofvarpnir-hcon](https://github.com/LeonardFH/hofvarpnir-hcon)
|
|
4
|
+
PyPI: [pypi.org/project/hofvarpnir-hcon](https://pypi.org/project/hofvarpnir-hcon/)
|
|
5
|
+
|
|
6
|
+
A modular Python framework for molecular property prediction from SMILES strings.
|
|
7
|
+
|
|
8
|
+
HófvarpnirHCON (pronounced "HOFF-varp-neer-HCON") is designed as a fast and extensible framework for predicting molecular properties of organic compounds containing C, H, O, and N.
|
|
9
|
+
|
|
10
|
+
Named after the flying horse of the Norse goddess Gná, reflecting the software's intended speed and range across molecular property spaces.
|
|
11
|
+
|
|
12
|
+
## Current Status
|
|
13
|
+
|
|
14
|
+
At present, the package implements:
|
|
15
|
+
|
|
16
|
+
- Crystal density prediction for organic molecules
|
|
17
|
+
|
|
18
|
+
The framework is designed with extensibility in mind, allowing additional molecular property predictors to be added in future versions.
|
|
19
|
+
|
|
20
|
+
## A Friendly Note
|
|
21
|
+
|
|
22
|
+
Hi there,
|
|
23
|
+
|
|
24
|
+
I built HófvarpnirHCON because crystal density prediction should be fast, transparent, and accessible. I'm glad you found it.
|
|
25
|
+
|
|
26
|
+
If you need to get in touch: leonardfhaasbroek@gmail.com
|
|
27
|
+
|
|
28
|
+
## License
|
|
29
|
+
|
|
30
|
+
This project is distributed under the BSD 3-Clause License.
|
|
31
|
+
|
|
32
|
+
## Data Sources
|
|
33
|
+
|
|
34
|
+
The training data may be obtained from:
|
|
35
|
+
|
|
36
|
+
- Davis, J. V.; Marrs, F. W.; Cawkwell, M. J.; Manner, V. W. Machine Learning Models for High Explosive Crystal Density and Performance. Chem. Mater. 2024, 36, 11109–11118. DOI: 10.1021/acs.chemmater.4c01978
|
|
37
|
+
|
|
38
|
+
- Mathieu, D. Sensitivity of Energetic Materials: Theoretical Relationships to Detonation Performance and Molecular Structure. Ind. Eng. Chem. Res. 2017, 56, 8191–8201. DOI: 10.1021/acs.iecr.7b02021
|
|
39
|
+
|
|
40
|
+
These datasets are available as Supporting Information with their respective papers.
|
|
41
|
+
|
|
42
|
+
## Community Benchmarks
|
|
43
|
+
|
|
44
|
+
If you use HófvarpnirHCON on your own dataset, I invite you to share your results.
|
|
45
|
+
|
|
46
|
+
Email: **leonardfhaasbroek@gmail.com**
|
|
47
|
+
|
|
48
|
+
Please include:
|
|
49
|
+
- MAE, RMSE, R²
|
|
50
|
+
- Number of molecules
|
|
51
|
+
- Number of cocrystals
|
|
52
|
+
- Dataset description and source (if public)
|
|
53
|
+
|
|
54
|
+
Results will be posted here (with your permission).
|
|
55
|
+
|
|
56
|
+
## Documentation
|
|
57
|
+
|
|
58
|
+
For a detailed explanation of the method, see [hofvarpnirhcon/docs/METHODS.md](hofvarpnirhcon/docs/METHODS.md).
|
|
59
|
+
|
|
60
|
+
## Installation
|
|
61
|
+
|
|
62
|
+
pip install hofvarpnir-hcon
|
|
63
|
+
|
|
64
|
+
## Quick Start: Train and Predict in Thonny
|
|
65
|
+
|
|
66
|
+
```python
|
|
67
|
+
#Copy and paste this entire script into Thonny and run it:
|
|
68
|
+
|
|
69
|
+
from hofvarpnirhcon import train_density, predict_density, predict_density_batch
|
|
70
|
+
import pandas as pd
|
|
71
|
+
import numpy as np
|
|
72
|
+
from sklearn.metrics import mean_absolute_error
|
|
73
|
+
|
|
74
|
+
# ============================================================
|
|
75
|
+
# STEP 1: Download a dataset from one of the papers above
|
|
76
|
+
# Save it as "trainingdata.csv" with columns: SMILES, Density
|
|
77
|
+
# ============================================================
|
|
78
|
+
|
|
79
|
+
# ============================================================
|
|
80
|
+
# STEP 2: Train your own weights
|
|
81
|
+
# ============================================================
|
|
82
|
+
|
|
83
|
+
print("Training model...")
|
|
84
|
+
weights = train_density(
|
|
85
|
+
data_path="trainingdata.csv",
|
|
86
|
+
output_path="my_weights.pkl",
|
|
87
|
+
filter_cocrystals=True, # Train on pure crystals only (recommended)
|
|
88
|
+
filter_hcon=True, # Train on H,C,O,N atoms only (recommended)
|
|
89
|
+
verbose=True
|
|
90
|
+
)
|
|
91
|
+
print("Training complete! Weights saved to my_weights.pkl")
|
|
92
|
+
|
|
93
|
+
# ============================================================
|
|
94
|
+
# STEP 3: Load the dataset for predictions
|
|
95
|
+
# ============================================================
|
|
96
|
+
|
|
97
|
+
df = pd.read_csv("trainingdata.csv")
|
|
98
|
+
smiles_list = df["SMILES"].tolist()
|
|
99
|
+
actuals = df["Density"].values
|
|
100
|
+
|
|
101
|
+
# ============================================================
|
|
102
|
+
# STEP 4: Single molecule prediction
|
|
103
|
+
# ============================================================
|
|
104
|
+
|
|
105
|
+
print("\n" + "=" * 60)
|
|
106
|
+
print("SINGLE MOLECULE PREDICTION")
|
|
107
|
+
print("=" * 60)
|
|
108
|
+
|
|
109
|
+
test_smiles = smiles_list[0]
|
|
110
|
+
test_actual = actuals[0]
|
|
111
|
+
pred = predict_density(test_smiles, weights_path="my_weights.pkl")
|
|
112
|
+
print(f"SMILES: {test_smiles}")
|
|
113
|
+
print(f"Actual density: {test_actual:.4f} g/cm³")
|
|
114
|
+
print(f"Predicted density: {pred:.4f} g/cm³")
|
|
115
|
+
print(f"Error: {abs(pred - test_actual):.4f} g/cm³")
|
|
116
|
+
|
|
117
|
+
# ============================================================
|
|
118
|
+
# STEP 5: Batch prediction on entire dataset
|
|
119
|
+
# ============================================================
|
|
120
|
+
|
|
121
|
+
print("\n" + "=" * 60)
|
|
122
|
+
print("BATCH PREDICTION")
|
|
123
|
+
print("=" * 60)
|
|
124
|
+
|
|
125
|
+
print(f"Predicting {len(smiles_list)} molecules...")
|
|
126
|
+
predictions = predict_density_batch(
|
|
127
|
+
smiles_list=smiles_list,
|
|
128
|
+
weights_path="my_weights.pkl",
|
|
129
|
+
verbose=True
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
# ============================================================
|
|
133
|
+
# STEP 6: Calculate MAE and show results (FILTER NONE VALUES)
|
|
134
|
+
# ============================================================
|
|
135
|
+
|
|
136
|
+
# Filter out None values (failed predictions)
|
|
137
|
+
valid_mask = [p is not None for p in predictions]
|
|
138
|
+
valid_actuals = np.array(actuals)[valid_mask]
|
|
139
|
+
valid_predictions = [p for p in predictions if p is not None]
|
|
140
|
+
|
|
141
|
+
print(f"\n✅ Valid predictions: {len(valid_predictions):,} / {len(smiles_list):,}")
|
|
142
|
+
|
|
143
|
+
if len(valid_predictions) == 0:
|
|
144
|
+
print("❌ No valid predictions. Check your SMILES strings.")
|
|
145
|
+
exit()
|
|
146
|
+
|
|
147
|
+
mae = mean_absolute_error(valid_actuals, valid_predictions)
|
|
148
|
+
rmse = np.sqrt(np.mean((np.array(valid_predictions) - valid_actuals) ** 2))
|
|
149
|
+
r2 = np.corrcoef(valid_predictions, valid_actuals)[0, 1] ** 2
|
|
150
|
+
|
|
151
|
+
print(f"\nModel Performance:")
|
|
152
|
+
print(f" MAE: {mae:.4f} g/cm³")
|
|
153
|
+
print(f" RMSE: {rmse:.4f} g/cm³")
|
|
154
|
+
print(f" R²: {r2:.4f}")
|
|
155
|
+
|
|
156
|
+
print("\nFirst 10 predictions:")
|
|
157
|
+
print("-" * 70)
|
|
158
|
+
print(f"{'SMILES':<35} {'Actual':>10} {'Predicted':>10} {'Error':>10}")
|
|
159
|
+
print("-" * 70)
|
|
160
|
+
|
|
161
|
+
for i in range(min(10, len(valid_predictions))):
|
|
162
|
+
smiles = smiles_list[i][:35]
|
|
163
|
+
actual = valid_actuals[i]
|
|
164
|
+
pred = valid_predictions[i]
|
|
165
|
+
error = abs(pred - actual)
|
|
166
|
+
print(f"{smiles:<35} {actual:>10.4f} {pred:>10.4f} {error:>10.4f}")
|
|
167
|
+
|
|
168
|
+
print("-" * 70)
|
|
169
|
+
print(f"MAE: {mae:.4f} g/cm³")
|
|
170
|
+
print("\n✅ All done! Weights saved to my_weights.pkl")
|
|
171
|
+
|
|
172
|
+
# ============================================================
|
|
173
|
+
# STEP 7: Save results to CSV
|
|
174
|
+
# ============================================================
|
|
175
|
+
|
|
176
|
+
results_df = pd.DataFrame({
|
|
177
|
+
'SMILES': smiles_list[:len(valid_predictions)],
|
|
178
|
+
'Actual_Density': valid_actuals,
|
|
179
|
+
'Predicted_Density': valid_predictions,
|
|
180
|
+
'Error': np.array(valid_predictions) - valid_actuals,
|
|
181
|
+
'Abs_Error': np.abs(np.array(valid_predictions) - valid_actuals),
|
|
182
|
+
})
|
|
183
|
+
|
|
184
|
+
results_df.to_csv('prediction_results.csv', index=False)
|
|
185
|
+
print("\n💾 Results saved to: prediction_results.csv")
|
|
186
|
+
|
|
187
|
+
# ============================================================
|
|
188
|
+
# USAGE EXAMPLES
|
|
189
|
+
# ============================================================
|
|
190
|
+
|
|
191
|
+
# Single molecule prediction:
|
|
192
|
+
from hofvarpnirhcon import predict_density
|
|
193
|
+
|
|
194
|
+
density = predict_density("CCO", weights_path="my_weights.pkl")
|
|
195
|
+
print(f"{density:.3f} g/cm³")
|
|
196
|
+
|
|
197
|
+
# Batch prediction:
|
|
198
|
+
from hofvarpnirhcon import predict_density_batch
|
|
199
|
+
|
|
200
|
+
smiles_list = ["CCO", "CC", "c1ccccc1", "O"]
|
|
201
|
+
results = predict_density_batch(smiles_list, weights_path="my_weights.pkl")
|
|
202
|
+
|
|
203
|
+
for smiles, density in zip(smiles_list, results):
|
|
204
|
+
print(f"{smiles}: {density:.3f} g/cm³")
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
## Performance
|
|
208
|
+
|
|
209
|
+
- MAE: ~0.0300 g/cm³ on CHON molecules
|
|
210
|
+
- Speed: ~1,800 molecules/second (1 core/thread)
|
|
211
|
+
- Speed: ~2,700 molecules/second (2 core/thread)
|
|
212
|
+
- Speed: ~3,500 molecules/second (4 core/thread - max achieved)
|
|
213
|
+
|
|
214
|
+
## Tips for Best Performance
|
|
215
|
+
|
|
216
|
+
For optimal accuracy, we recommend training separate dictionaries for each chemical family:
|
|
217
|
+
|
|
218
|
+
- **HCON only** (C, H, N, O) — best overall performance
|
|
219
|
+
- **HCON + F** — fluorine-containing molecules
|
|
220
|
+
- **HCON + Cl** — chlorine-containing molecules
|
|
221
|
+
- **HCON + S** — sulfur-containing molecules
|
|
222
|
+
- **HCON + P** — phosphorus-containing molecules
|
|
223
|
+
|
|
224
|
+
**Avoid mixing different heteroatom types** (e.g., S and Cl together) in a single training run, as this can degrade prediction accuracy.
|
|
225
|
+
|
|
226
|
+
For molecules containing rare halogens (Br, I), we recommend using the HCON-only dictionaries, as there is insufficient data to train reliable halogen-specific overlaps.
|
|
227
|
+
|
|
228
|
+
## Important Note on Polymorphs
|
|
229
|
+
|
|
230
|
+
The model predicts a single crystal density per SMILES string. For molecules with multiple known polymorphs (e.g., ROY, carbamazepine), the prediction corresponds to a **centroid** density within the experimental range. It does **not** predict individual polymorph forms.
|
|
231
|
+
|
|
232
|
+
## Citation
|
|
233
|
+
|
|
234
|
+
If you use this software in your research, please cite:
|
|
235
|
+
|
|
236
|
+
Haasbroek, L. F. (2026). HófvarpnirHCON: Fast dictionary-based crystal density prediction.
|
|
@@ -0,0 +1,39 @@
|
|
|
1
|
+
THIRD-PARTY LICENSES
|
|
2
|
+
|
|
3
|
+
This file lists third-party dependencies used by this software.
|
|
4
|
+
|
|
5
|
+
============================================================
|
|
6
|
+
|
|
7
|
+
Package: numpy 2.2.6
|
|
8
|
+
License: BSD License
|
|
9
|
+
|
|
10
|
+
============================================================
|
|
11
|
+
|
|
12
|
+
Package: pandas 2.3.3
|
|
13
|
+
License: BSD License
|
|
14
|
+
|
|
15
|
+
============================================================
|
|
16
|
+
|
|
17
|
+
Package: scipy 1.15.3
|
|
18
|
+
License: BSD License
|
|
19
|
+
|
|
20
|
+
============================================================
|
|
21
|
+
|
|
22
|
+
Package: scikit-learn 1.7.1
|
|
23
|
+
License: BSD-3-Clause
|
|
24
|
+
|
|
25
|
+
============================================================
|
|
26
|
+
|
|
27
|
+
Package: rdkit 2025.3.6
|
|
28
|
+
License: BSD-3-Clause
|
|
29
|
+
|
|
30
|
+
============================================================
|
|
31
|
+
|
|
32
|
+
Package: tqdm 4.68.1
|
|
33
|
+
License: MPL-2.0 AND MIT
|
|
34
|
+
|
|
35
|
+
============================================================
|
|
36
|
+
|
|
37
|
+
NOTE:
|
|
38
|
+
Full license texts are available in the upstream package distributions
|
|
39
|
+
and are included when these dependencies are installed via pip.
|