como-ocsr 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,56 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Zhuoqi Lyu
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
22
+
23
+
24
+ ===========================================================================
25
+ Model Weights License — Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)
26
+
27
+ The pre-trained model weights (.pth files) distributed alongside this software
28
+ are licensed under CC BY-NC 4.0:
29
+
30
+ https://creativecommons.org/licenses/by-nc/4.0/
31
+
32
+ You are free to:
33
+ - Share — copy and redistribute the material in any medium or format
34
+ - Adapt — remix, transform, and build upon the material
35
+
36
+ Under the following terms:
37
+ - Attribution — You must give appropriate credit
38
+ - NonCommercial — You may not use the material for commercial purposes
39
+
40
+
41
+ ===========================================================================
42
+ Benchmark Datasets
43
+
44
+ The benchmark datasets referenced in this project are collected from existing
45
+ public OCSR benchmarks and are NOT covered by the above licenses. Please refer
46
+ to their original sources for applicable license terms and attribution:
47
+
48
+ Dataset Source
49
+ ------- ------
50
+ USPTO, CLEF, Rajan et al., 2020 — https://github.com/Kohulan/DECIMER-Image_Transformer
51
+ JPO, UOB,
52
+ Staker
53
+ Indigo, Qian et al., 2023 — https://github.com/thomas0809/MolScribe
54
+ ChemDraw, ACS
55
+ USPTO-10K Morin et al., 2023 — https://github.com/DS4SD/molgrapher
56
+ WildMol-10K Fang et al., 2025 — https://github.com/orgs/Chem-Struct-ML/repositories
@@ -0,0 +1,363 @@
1
+ Metadata-Version: 2.4
2
+ Name: como-ocsr
3
+ Version: 1.0.0
4
+ Summary: COMO: Closed-loop Optical Molecule recOgnition with Minimum Risk Training — Optical Chemical Structure Recognition
5
+ Author: Zhuoqi Lyu
6
+ License: MIT
7
+ Project-URL: Homepage, https://huggingface.co/Keylab/COMO
8
+ Project-URL: Repository, https://github.com/lyuzhuoqi/COMO
9
+ Project-URL: Bug Tracker, https://github.com/lyuzhuoqi/COMO/issues
10
+ Keywords: cheminformatics,optical-chemical-structure-recognition,ocsr,molecule-recognition,deep-learning,transformer,rdkit
11
+ Classifier: Development Status :: 5 - Production/Stable
12
+ Classifier: Intended Audience :: Science/Research
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Programming Language :: Python :: 3.13
20
+ Classifier: Topic :: Scientific/Engineering :: Chemistry
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Scientific/Engineering :: Image Recognition
23
+ Requires-Python: >=3.10
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: torch>=2.0
27
+ Requires-Dist: torchvision>=0.15
28
+ Requires-Dist: timm>=0.9
29
+ Requires-Dist: rdkit
30
+ Requires-Dist: SmilesPE>=0.0.3
31
+ Requires-Dist: albumentations>=1.3
32
+ Requires-Dist: opencv-python-headless>=4.5
33
+ Requires-Dist: Pillow>=9.0
34
+ Requires-Dist: numpy>=1.21
35
+ Requires-Dist: pandas>=1.5
36
+ Requires-Dist: tqdm>=4.60
37
+ Requires-Dist: func-timeout>=4.3
38
+ Provides-Extra: train
39
+ Dynamic: license-file
40
+
41
+ # COMO
42
+
43
+ **COMO** (**C**losed-loop **O**ptical **M**olecule rec**O**gnition) is a deep
44
+ learning framework that recognizes chemical structure diagrams from images and
45
+ predicts SMILES strings with atom-level coordinates and bond matrices. It uses
46
+ Minimum Risk Training (MRT) to directly optimize molecular-level,
47
+ non-differentiable objectives.
48
+
49
+ ## Installation
50
+
51
+ ```bash
52
+ pip install como-ocsr
53
+ ```
54
+
55
+ ## Quick Start
56
+
57
+ ```python
58
+ import como
59
+
60
+ # Load a model checkpoint (on GPU 0)
61
+ model = como.load_model("path/to/checkpoint.pth", device="cuda:0")
62
+
63
+ # Predict SMILES from a single image
64
+ smiles = como.predict(model, "molecule.png")
65
+ print(smiles) # "CC(=O)O"
66
+
67
+ # Batch prediction on a specific GPU
68
+ smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"], device="cuda:1")
69
+
70
+ # Evaluate on a benchmark (single GPU by default)
71
+ metrics = como.evaluate(
72
+ model,
73
+ benchmark_dir="benchmark/USPTO/",
74
+ csv_path="benchmark/USPTO.csv",
75
+ )
76
+ print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
77
+
78
+ # Multi-GPU, multi-benchmark evaluation
79
+ benchmarks = [
80
+ {"name": "USPTO", "benchmark_dir": "benchmark/USPTO/",
81
+ "csv_path": "benchmark/USPTO.csv"},
82
+ {"name": "CLEF", "benchmark_dir": "benchmark/CLEF/",
83
+ "csv_path": "benchmark/CLEF_corrected.csv"},
84
+ ]
85
+ results = como.evaluate_benchmarks(model, benchmarks, gpus="0,1,2,3")
86
+ for name, m in results.items():
87
+ print(f"{name}: {m['postprocess/exact_match_acc']:.2%}")
88
+ ```
89
+
90
+ ## API Reference
91
+
92
+ ### GPU Selection
93
+
94
+ All functions accept a ``device`` parameter for single-GPU usage:
95
+
96
+ ```python
97
+ model = como.load_model("checkpoint.pth", device="cuda:0")
98
+ como.predict(model, "img.png", device="cuda:1")
99
+ como.predict_batch(model, [...], device="cuda:2")
100
+ ```
101
+
102
+ For **evaluation** (which uses multi-GPU internally via ``mp.spawn``), use the
103
+ ``gpus`` parameter:
104
+
105
+ | Function | GPU control |
106
+ |----------|-------------|
107
+ | ``load_model`` | ``device="cuda:0"`` |
108
+ | ``predict`` | ``device="cuda:0"`` |
109
+ | ``predict_batch`` | ``device="cuda:0"`` |
110
+ | ``evaluate`` | ``gpus="0"`` (default), ``gpus="0,1,2"``, ``gpus=None`` (all) |
111
+ | ``evaluate_benchmarks`` | ``gpus="0"`` (default), ``gpus="0,1,2"``, ``gpus=None`` (all) |
112
+
113
+ ---
114
+
115
+ ### `como.load_model(checkpoint_path, device="cuda", pretrained=True, **kwargs)`
116
+
117
+ Load a COMO model from a `.pth` checkpoint. Returns a :class:`ComoModel`
118
+ instance in evaluation mode.
119
+
120
+ | Parameter | Type | Default | Description |
121
+ |-----------|------|---------|-------------|
122
+ | `checkpoint_path` | `str` | *required* | Path to `.pth` checkpoint |
123
+ | `device` | `str` | `"cuda"` | ``"cuda"``, ``"cuda:0"``, or ``"cpu"`` |
124
+ | `pretrained` | `bool` | `True` | Use ImageNet-pretrained backbone weights |
125
+
126
+ **Returns:** ``ComoModel``
127
+
128
+ ---
129
+
130
+ ### `como.predict(model, image, *, beam_size=1, max_len=500, smiles_mode="postprocess", device=None)`
131
+
132
+ Predict the SMILES string for a single molecular image.
133
+
134
+ | Parameter | Type | Default | Description |
135
+ |-----------|------|---------|-------------|
136
+ | `model` | `ComoModel` | *required* | A loaded model |
137
+ | `image` | `str` / `np.ndarray` / `PIL.Image` / `torch.Tensor` | *required* | Input image (file path, array, PIL, or preprocessed tensor) |
138
+ | `beam_size` | `int` | `1` | Beam width (1 = greedy, 3 = beam search) |
139
+ | `max_len` | `int` | `500` | Maximum number of tokens to generate |
140
+ | `smiles_mode` | `str` or `None` | `"postprocess"` | ``"postprocess"`` (best quality), ``"graph"``, ``"decoder"``, or ``None`` (raw result dict) |
141
+ | `device` | `str` or `None` | `None` | Optional device override (e.g. ``"cuda:1"``) |
142
+
143
+ **Returns:**
144
+ - `str` — predicted SMILES string (if *smiles_mode* is not ``None``)
145
+ - `dict` — full result dict with keys ``tokens``, ``symbols``, ``coords``, ``bond_mat``, ``decode_smiles``, ``success`` (if ``smiles_mode=None``)
146
+
147
+ ---
148
+
149
+ ### `como.predict_batch(model, images, *, beam_size=1, max_len=500, smiles_mode="postprocess", device=None)`
150
+
151
+ Batch prediction on a single GPU.
152
+
153
+ | Parameter | Type | Default | Description |
154
+ |-----------|------|---------|-------------|
155
+ | `model` | `ComoModel` | *required* | A loaded model |
156
+ | `images` | `list` | *required* | List of file paths, NumPy arrays, PIL Images, or tensors |
157
+ | `beam_size` | `int` | `1` | Beam width (1 = greedy, recommended for batch) |
158
+ | `max_len` | `int` | `500` | Maximum tokens per image |
159
+ | `smiles_mode` | `str` or `None` | `"postprocess"` | SMILES reconstruction mode |
160
+ | `device` | `str` or `None` | `None` | Optional device override |
161
+
162
+ **Returns:**
163
+ - `list[str]` — predicted SMILES for each image (if *smiles_mode* is not ``None``)
164
+ - `list[dict]` — raw result dicts (if ``smiles_mode=None``)
165
+
166
+ ---
167
+
168
+ ### `como.evaluate(model, benchmark_dir, csv_path, *, beam_size=1, postproc_workers=32, tautomer_standardize=True, gpus="0")`
169
+
170
+ Evaluate on a single benchmark dataset. Returns a flat dict of metrics.
171
+
172
+ | Parameter | Type | Default | Description |
173
+ |-----------|------|---------|-------------|
174
+ | `model` | `ComoModel` | *required* | A loaded model |
175
+ | `benchmark_dir` | `str` | *required* | Directory containing `.png` images |
176
+ | `csv_path` | `str` | *required* | CSV with columns ``image_id``, ``SMILES`` |
177
+ | `beam_size` | `int` | `1` | Beam width for decoding |
178
+ | `postproc_workers` | `int` | `32` | Parallel workers for SMILES post-processing |
179
+ | `tautomer_standardize` | `bool` | `True` | Include tautomer-normalized exact match |
180
+ | `gpus` | `str` or `None` | `"0"` | GPU IDs (``"0,1"``) or ``None`` for all |
181
+
182
+ **Returns:** ``dict`` with the following keys:
183
+
184
+ | Key | Type | Description |
185
+ |-----|------|-------------|
186
+ | `decoder/exact_match_acc` | `float` | Exact match accuracy (decoder mode) |
187
+ | `decoder/avg_tanimoto` | `float` | Average Tanimoto similarity (decoder) |
188
+ | `decoder/tautomer_match_acc` | `float` | Tautomer-normalized exact match (decoder, if `tautomer_standardize=True`) |
189
+ | `decoder/failed_predictions` | `int` | Number of failed predictions (decoder) |
190
+ | `decoder/valid` | `int` | Number of chemically valid predictions (decoder) |
191
+ | `decoder/total` | `int` | Total benchmark samples |
192
+ | `graph/exact_match_acc` | `float` | Exact match accuracy (graph mode) |
193
+ | `graph/avg_tanimoto` | `float` | Average Tanimoto similarity (graph) |
194
+ | `graph/tautomer_match_acc` | `float` | Tautomer-normalized exact match (graph, if `tautomer_standardize=True`) |
195
+ | `graph/failed_predictions` | `int` | Number of failed predictions (graph) |
196
+ | `graph/valid` | `int` | Number of chemically valid predictions (graph) |
197
+ | `graph/total` | `int` | Total benchmark samples |
198
+ | `postprocess/exact_match_acc` | `float` | Exact match accuracy (postprocess mode, **primary metric**) |
199
+ | `postprocess/avg_tanimoto` | `float` | Average Tanimoto similarity (postprocess) |
200
+ | `postprocess/tautomer_match_acc` | `float` | Tautomer-normalized exact match (postprocess, if `tautomer_standardize=True`) |
201
+ | `postprocess/failed_predictions` | `int` | Number of failed predictions (postprocess) |
202
+ | `postprocess/valid` | `int` | Number of chemically valid predictions (postprocess) |
203
+ | `postprocess/records_df` | `DataFrame` | Per-image results with columns ``image_id``, ``gt_smiles``, ``pred_smiles``, ``exact``, ``tautomer``, ``tanimoto`` |
204
+ | `postprocess/total` | `int` | Total benchmark samples |
205
+ | `total` | `int` | Total benchmark samples |
206
+
207
+ ---
208
+
209
+ ### `como.evaluate_benchmarks(model, benchmarks, *, beam_size=1, postproc_workers=32, tautomer_standardize=True, gpus="0")`
210
+
211
+ Evaluate on multiple benchmarks in one call. Returns a nested dict keyed
212
+ by benchmark name.
213
+
214
+ | Parameter | Type | Default | Description |
215
+ |-----------|------|---------|-------------|
216
+ | `model` | `ComoModel` | *required* | A loaded model |
217
+ | `benchmarks` | `list[dict]` | *required* | Each dict has keys ``"name"``, ``"benchmark_dir"``, ``"csv_path"`` |
218
+ | `beam_size` | `int` | `1` | Beam width for decoding |
219
+ | `postproc_workers` | `int` | `32` | Parallel workers for SMILES post-processing |
220
+ | `tautomer_standardize` | `bool` | `True` | Include tautomer-normalized exact match |
221
+ | `gpus` | `str` or `None` | `"0"` | GPU IDs (``"0,1"``) or ``None`` for all |
222
+
223
+ **Returns:** ``dict[str, dict]`` — mapping from benchmark name to a metrics
224
+ dict with the same structure as :func:`evaluate`. Example::
225
+
226
+ {
227
+ "USPTO": {
228
+ "postprocess/exact_match_acc": 0.934,
229
+ "postprocess/avg_tanimoto": 0.987,
230
+ ...
231
+ },
232
+ "CLEF": {
233
+ "postprocess/exact_match_acc": 0.948,
234
+ ...
235
+ },
236
+ }
237
+
238
+ **Example:**
239
+
240
+ benchmarks = [
241
+ {"name": "USPTO", "benchmark_dir": "data/benchmark/real/USPTO",
242
+ "csv_path": "data/benchmark/real/USPTO.csv"},
243
+ {"name": "CLEF", "benchmark_dir": "data/benchmark/real/CLEF",
244
+ "csv_path": "data/benchmark/real/CLEF_corrected.csv"},
245
+ ]
246
+ results = como.evaluate_benchmarks(model, benchmarks, gpus="0,1")
247
+ for name, metrics in results.items():
248
+ acc = metrics["postprocess/exact_match_acc"]
249
+ tan = metrics["postprocess/avg_tanimoto"]
250
+ print(f"{name}: Exact={acc:.2%}, Tanimoto={tan:.4f}")
251
+
252
+ ---
253
+
254
+ ### `como.canonicalize_smiles(smiles, *, ignore_chiral=False, ignore_cistrans=False, replace_rgroup=True)`
255
+
256
+ Canonicalize a SMILES string using RDKit.
257
+
258
+ | Parameter | Type | Default | Description |
259
+ |-----------|------|---------|-------------|
260
+ | `smiles` | `str` | *required* | Input SMILES string |
261
+ | `ignore_chiral` | `bool` | `False` | Strip tetrahedral chirality before canonicalization |
262
+ | `ignore_cistrans` | `bool` | `False` | Strip cis–trans markers (``/`` and ``\``) before canonicalization |
263
+ | `replace_rgroup` | `bool` | `True` | If ``True``, replace R-group tokens (``R``, ``R1``, ``X``, ``Ar``, …) with wildcard ``*`` |
264
+
265
+ **Returns:** ``tuple[str, bool]`` — ``(canonical_smiles, ok)`` where *ok* is
266
+ ``True`` if the SMILES is chemically valid and canonicalization succeeded.
267
+
268
+ ---
269
+
270
+ ### `como.canonicalize_tautomer(smiles)`
271
+
272
+ Canonicalize a SMILES string via RDKit's TautomerEnumerator, normalizing
273
+ different tautomeric forms (e.g., keto/enol, lactam/lactim) to the same
274
+ canonical representation.
275
+
276
+ | Parameter | Type | Default | Description |
277
+ |-----------|------|---------|-------------|
278
+ | `smiles` | `str` | *required* | Input SMILES string |
279
+
280
+ **Returns:** ``tuple[str, bool]`` — ``(tautomer_canonical_smiles, ok)`` where
281
+ *ok* is ``False`` if the input SMILES is invalid or tautomer enumeration fails.
282
+
283
+ ---
284
+
285
+ ### `como._result_to_smiles(result, mode="postprocess")`
286
+
287
+ Low-level: convert a raw prediction result dict (from :func:`predict` with
288
+ ``smiles_mode=None``) to a canonical SMILES string.
289
+
290
+ | Parameter | Type | Default | Description |
291
+ |-----------|------|---------|-------------|
292
+ | `result` | `dict` | *required* | Raw prediction dict with keys ``smiles``, ``symbols``, ``coords``, ``bond_mat``, ``success`` |
293
+ | `mode` | `str` | ``"postprocess"`` | SMILES reconstruction mode |
294
+
295
+ *mode* options:
296
+
297
+ | Mode | Source | Chirality | Description |
298
+ |------|--------|-----------|-------------|
299
+ | ``"decoder"`` | Decoder token sequence | ✗ | Raw decoder SMILES, no graph info used. Fastest but lowest quality. |
300
+ | ``"graph"`` | Predicted atoms + bonds | ✓ | Reconstructs SMILES entirely from predicted atom symbols, coordinates, and bond matrix. Chirality restored via `_verify_chirality`. |
301
+ | ``"postprocess"`` | Decoder + atoms + bonds | ✓ | Starts from decoder SMILES, replaces R-groups/abbreviations, restores chirality from predicted coordinates and bond matrix, then expands functional groups back. Best quality. |
302
+
303
+ **Returns:** ``str`` or ``None`` — canonical SMILES string, or ``None`` if conversion fails.
304
+
305
+ ## Model Weights
306
+
307
+ Pre-trained model weights are available on HuggingFace:
308
+
309
+ | Checkpoint | Reward Mode | Description |
310
+ |-----------|-------------|-------------|
311
+ | `COMO_joint/tanimoto/final.pth` | Tanimoto | Joint MLE+MRT (Tanimoto reward) |
312
+ | `COMO_joint/edit_distance/final.pth` | Edit Distance | Joint MLE+MRT (Edit Distance reward) |
313
+ | `COMO_joint/visual/final.pth` | Visual | Joint MLE+MRT (Visual reward) |
314
+
315
+ Download from: **https://huggingface.co/Keylab/COMO**
316
+
317
+ ## Benchmark Datasets
318
+
319
+ Benchmark datasets (images + CSV ground truth) are available on HuggingFace Datasets:
320
+
321
+ | Dataset | Images | Type |
322
+ |---------|--------|------|
323
+ | USPTO | ~6K | Real patent images |
324
+ | USPTO-10K | ~10K | Real patent images |
325
+ | CLEF | ~5K | Real patent images |
326
+ | JPO | ~3K | Real patent images |
327
+ | UOB | ~4K | Real academic images |
328
+ | staker | ~1K | Real images |
329
+ | acs | ~2K | Real publication images |
330
+ | WildMol-10K | ~10K | Real wild images |
331
+ | indigo | ~8K | Synthetic (Indigo-rendered) |
332
+ | chemdraw | ~8K | Synthetic (ChemDraw style) |
333
+
334
+ Download from: **https://huggingface.co/Keylab/COMO** (see `benchmarks/` folder)
335
+
336
+ ## Citation
337
+
338
+ If you use COMO in your research, please cite:
339
+
340
+ ```bibtex
341
+ @article{lyu2026closed,
342
+ title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
343
+ author={Lyu, Zhuoqi and Ke, Qing},
344
+ journal={arXiv preprint arXiv:2604.23546},
345
+ year={2026}
346
+ }
347
+ ```
348
+
349
+ ## License
350
+
351
+ - **Code** (`como/` package): MIT License
352
+ - **Model Weights** (`.pth` files): CC BY-NC 4.0 (non-commercial use only)
353
+ - **Benchmark Datasets**: collected from existing public OCSR benchmarks; please refer to their
354
+ original sources for license and attribution:
355
+
356
+ | Dataset | Source |
357
+ |---------|--------|
358
+ | USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/OCSR_Review), [Xiong et al., 2023](https://github.com/jiachengxiong/alpha-Extractor) |
359
+ | Indigo, ChemDraw, ACS, Staker | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
360
+ | USPTO-10K | [Morin et al., 2023](https://huggingface.co/datasets/docling-project/USPTO-30K) |
361
+ | WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |
362
+
363
+ See [LICENSE](LICENSE) for full terms.