PyPI - como-ocsr - Versions diffs - 1.0.0__tar.gz - Mend

como-ocsr 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

como_ocsr-1.0.0/LICENSE +56 -0
como_ocsr-1.0.0/PKG-INFO +363 -0
como_ocsr-1.0.0/README.md +323 -0
como_ocsr-1.0.0/como/__init__.py +298 -0
como_ocsr-1.0.0/como/_chemistry.py +563 -0
como_ocsr-1.0.0/como/_constants.py +120 -0
como_ocsr-1.0.0/como/_core.py +6105 -0
como_ocsr-1.0.0/como/_drawing.py +980 -0
como_ocsr-1.0.0/como/_indigo/lib/Linux/x64/libbingo.so +0 -0
como_ocsr-1.0.0/como/_indigo/lib/Linux/x64/libindigo-inchi.so +0 -0
como_ocsr-1.0.0/como/_indigo/lib/Linux/x64/libindigo-renderer.so +0 -0
como_ocsr-1.0.0/como/_indigo/lib/Linux/x64/libindigo.so +0 -0
como_ocsr-1.0.0/como_ocsr.egg-info/PKG-INFO +363 -0
como_ocsr-1.0.0/como_ocsr.egg-info/SOURCES.txt +17 -0
como_ocsr-1.0.0/como_ocsr.egg-info/dependency_links.txt +1 -0
como_ocsr-1.0.0/como_ocsr.egg-info/requires.txt +14 -0
como_ocsr-1.0.0/como_ocsr.egg-info/top_level.txt +1 -0
como_ocsr-1.0.0/pyproject.toml +67 -0
como_ocsr-1.0.0/setup.cfg +4 -0

como_ocsr-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,56 @@
+MIT License
+Copyright (c) 2026 Zhuoqi Lyu
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+===========================================================================
+Model Weights License — Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)
+The pre-trained model weights (.pth files) distributed alongside this software
+are licensed under CC BY-NC 4.0:
+  https://creativecommons.org/licenses/by-nc/4.0/
+You are free to:
+  - Share — copy and redistribute the material in any medium or format
+  - Adapt — remix, transform, and build upon the material
+Under the following terms:
+  - Attribution — You must give appropriate credit
+  - NonCommercial — You may not use the material for commercial purposes
+===========================================================================
+Benchmark Datasets
+The benchmark datasets referenced in this project are collected from existing
+public OCSR benchmarks and are NOT covered by the above licenses. Please refer
+to their original sources for applicable license terms and attribution:
+  Dataset       Source
+  -------       ------
+  USPTO, CLEF,  Rajan et al., 2020 — https://github.com/Kohulan/DECIMER-Image_Transformer
+  JPO, UOB,
+  Staker
+  Indigo,       Qian et al., 2023 — https://github.com/thomas0809/MolScribe
+  ChemDraw, ACS
+  USPTO-10K     Morin et al., 2023 — https://github.com/DS4SD/molgrapher
+  WildMol-10K   Fang et al., 2025 — https://github.com/orgs/Chem-Struct-ML/repositories

como_ocsr-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,363 @@
+Metadata-Version: 2.4
+Name: como-ocsr
+Version: 1.0.0
+Summary: COMO: Closed-loop Optical Molecule recOgnition with Minimum Risk Training — Optical Chemical Structure Recognition
+Author: Zhuoqi Lyu
+License: MIT
+Project-URL: Homepage, https://huggingface.co/Keylab/COMO
+Project-URL: Repository, https://github.com/lyuzhuoqi/COMO
+Project-URL: Bug Tracker, https://github.com/lyuzhuoqi/COMO/issues
+Keywords: cheminformatics,optical-chemical-structure-recognition,ocsr,molecule-recognition,deep-learning,transformer,rdkit
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Topic :: Scientific/Engineering :: Chemistry
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Scientific/Engineering :: Image Recognition
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.0
+Requires-Dist: torchvision>=0.15
+Requires-Dist: timm>=0.9
+Requires-Dist: rdkit
+Requires-Dist: SmilesPE>=0.0.3
+Requires-Dist: albumentations>=1.3
+Requires-Dist: opencv-python-headless>=4.5
+Requires-Dist: Pillow>=9.0
+Requires-Dist: numpy>=1.21
+Requires-Dist: pandas>=1.5
+Requires-Dist: tqdm>=4.60
+Requires-Dist: func-timeout>=4.3
+Provides-Extra: train
+Dynamic: license-file
+# COMO
+**COMO** (**C**losed-loop **O**ptical **M**olecule rec**O**gnition) is a deep
+learning framework that recognizes chemical structure diagrams from images and
+predicts SMILES strings with atom-level coordinates and bond matrices.  It uses
+Minimum Risk Training (MRT) to directly optimize molecular-level,
+non-differentiable objectives.
+## Installation
+```bash
+pip install como-ocsr
+```
+## Quick Start
+```python
+import como
+# Load a model checkpoint (on GPU 0)
+model = como.load_model("path/to/checkpoint.pth", device="cuda:0")
+# Predict SMILES from a single image
+smiles = como.predict(model, "molecule.png")
+print(smiles)  # "CC(=O)O"
+# Batch prediction on a specific GPU
+smiles_list = como.predict_batch(model, ["mol1.png", "mol2.png"], device="cuda:1")
+# Evaluate on a benchmark (single GPU by default)
+metrics = como.evaluate(
+    model,
+    benchmark_dir="benchmark/USPTO/",
+    csv_path="benchmark/USPTO.csv",
+)
+print(f"Exact Match: {metrics['postprocess/exact_match_acc']:.2%}")
+# Multi-GPU, multi-benchmark evaluation
+benchmarks = [
+    {"name": "USPTO", "benchmark_dir": "benchmark/USPTO/",
+     "csv_path": "benchmark/USPTO.csv"},
+    {"name": "CLEF",  "benchmark_dir": "benchmark/CLEF/",
+     "csv_path": "benchmark/CLEF_corrected.csv"},
+]
+results = como.evaluate_benchmarks(model, benchmarks, gpus="0,1,2,3")
+for name, m in results.items():
+    print(f"{name}: {m['postprocess/exact_match_acc']:.2%}")
+```
+## API Reference
+### GPU Selection
+All functions accept a ``device`` parameter for single-GPU usage:
+```python
+model = como.load_model("checkpoint.pth", device="cuda:0")
+como.predict(model, "img.png", device="cuda:1")
+como.predict_batch(model, [...], device="cuda:2")
+```
+For **evaluation** (which uses multi-GPU internally via ``mp.spawn``), use the
+``gpus`` parameter:
+| Function | GPU control |
+|----------|-------------|
+| ``load_model`` | ``device="cuda:0"`` |
+| ``predict`` | ``device="cuda:0"`` |
+| ``predict_batch`` | ``device="cuda:0"`` |
+| ``evaluate`` | ``gpus="0"`` (default), ``gpus="0,1,2"``, ``gpus=None`` (all) |
+| ``evaluate_benchmarks`` | ``gpus="0"`` (default), ``gpus="0,1,2"``, ``gpus=None`` (all) |
+---
+### `como.load_model(checkpoint_path, device="cuda", pretrained=True, **kwargs)`
+Load a COMO model from a `.pth` checkpoint.  Returns a :class:`ComoModel`
+instance in evaluation mode.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `checkpoint_path` | `str` | *required* | Path to `.pth` checkpoint |
+| `device` | `str` | `"cuda"` | ``"cuda"``, ``"cuda:0"``, or ``"cpu"`` |
+| `pretrained` | `bool` | `True` | Use ImageNet-pretrained backbone weights |
+**Returns:** ``ComoModel``
+---
+### `como.predict(model, image, *, beam_size=1, max_len=500, smiles_mode="postprocess", device=None)`
+Predict the SMILES string for a single molecular image.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model` | `ComoModel` | *required* | A loaded model |
+| `image` | `str` / `np.ndarray` / `PIL.Image` / `torch.Tensor` | *required* | Input image (file path, array, PIL, or preprocessed tensor) |
+| `beam_size` | `int` | `1` | Beam width (1 = greedy, 3 = beam search) |
+| `max_len` | `int` | `500` | Maximum number of tokens to generate |
+| `smiles_mode` | `str` or `None` | `"postprocess"` | ``"postprocess"`` (best quality), ``"graph"``, ``"decoder"``, or ``None`` (raw result dict) |
+| `device` | `str` or `None` | `None` | Optional device override (e.g. ``"cuda:1"``) |
+**Returns:**
+- `str` — predicted SMILES string (if *smiles_mode* is not ``None``)
+- `dict` — full result dict with keys ``tokens``, ``symbols``, ``coords``, ``bond_mat``, ``decode_smiles``, ``success`` (if ``smiles_mode=None``)
+---
+### `como.predict_batch(model, images, *, beam_size=1, max_len=500, smiles_mode="postprocess", device=None)`
+Batch prediction on a single GPU.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model` | `ComoModel` | *required* | A loaded model |
+| `images` | `list` | *required* | List of file paths, NumPy arrays, PIL Images, or tensors |
+| `beam_size` | `int` | `1` | Beam width (1 = greedy, recommended for batch) |
+| `max_len` | `int` | `500` | Maximum tokens per image |
+| `smiles_mode` | `str` or `None` | `"postprocess"` | SMILES reconstruction mode |
+| `device` | `str` or `None` | `None` | Optional device override |
+**Returns:**
+- `list[str]` — predicted SMILES for each image (if *smiles_mode* is not ``None``)
+- `list[dict]` — raw result dicts (if ``smiles_mode=None``)
+---
+### `como.evaluate(model, benchmark_dir, csv_path, *, beam_size=1, postproc_workers=32, tautomer_standardize=True, gpus="0")`
+Evaluate on a single benchmark dataset.  Returns a flat dict of metrics.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model` | `ComoModel` | *required* | A loaded model |
+| `benchmark_dir` | `str` | *required* | Directory containing `.png` images |
+| `csv_path` | `str` | *required* | CSV with columns ``image_id``, ``SMILES`` |
+| `beam_size` | `int` | `1` | Beam width for decoding |
+| `postproc_workers` | `int` | `32` | Parallel workers for SMILES post-processing |
+| `tautomer_standardize` | `bool` | `True` | Include tautomer-normalized exact match |
+| `gpus` | `str` or `None` | `"0"` | GPU IDs (``"0,1"``) or ``None`` for all |
+**Returns:** ``dict`` with the following keys:
+| Key | Type | Description |
+|-----|------|-------------|
+| `decoder/exact_match_acc` | `float` | Exact match accuracy (decoder mode) |
+| `decoder/avg_tanimoto` | `float` | Average Tanimoto similarity (decoder) |
+| `decoder/tautomer_match_acc` | `float` | Tautomer-normalized exact match (decoder, if `tautomer_standardize=True`) |
+| `decoder/failed_predictions` | `int` | Number of failed predictions (decoder) |
+| `decoder/valid` | `int` | Number of chemically valid predictions (decoder) |
+| `decoder/total` | `int` | Total benchmark samples |
+| `graph/exact_match_acc` | `float` | Exact match accuracy (graph mode) |
+| `graph/avg_tanimoto` | `float` | Average Tanimoto similarity (graph) |
+| `graph/tautomer_match_acc` | `float` | Tautomer-normalized exact match (graph, if `tautomer_standardize=True`) |
+| `graph/failed_predictions` | `int` | Number of failed predictions (graph) |
+| `graph/valid` | `int` | Number of chemically valid predictions (graph) |
+| `graph/total` | `int` | Total benchmark samples |
+| `postprocess/exact_match_acc` | `float` | Exact match accuracy (postprocess mode, **primary metric**) |
+| `postprocess/avg_tanimoto` | `float` | Average Tanimoto similarity (postprocess) |
+| `postprocess/tautomer_match_acc` | `float` | Tautomer-normalized exact match (postprocess, if `tautomer_standardize=True`) |
+| `postprocess/failed_predictions` | `int` | Number of failed predictions (postprocess) |
+| `postprocess/valid` | `int` | Number of chemically valid predictions (postprocess) |
+| `postprocess/records_df` | `DataFrame` | Per-image results with columns ``image_id``, ``gt_smiles``, ``pred_smiles``, ``exact``, ``tautomer``, ``tanimoto`` |
+| `postprocess/total` | `int` | Total benchmark samples |
+| `total` | `int` | Total benchmark samples |
+---
+### `como.evaluate_benchmarks(model, benchmarks, *, beam_size=1, postproc_workers=32, tautomer_standardize=True, gpus="0")`
+Evaluate on multiple benchmarks in one call.  Returns a nested dict keyed
+by benchmark name.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model` | `ComoModel` | *required* | A loaded model |
+| `benchmarks` | `list[dict]` | *required* | Each dict has keys ``"name"``, ``"benchmark_dir"``, ``"csv_path"`` |
+| `beam_size` | `int` | `1` | Beam width for decoding |
+| `postproc_workers` | `int` | `32` | Parallel workers for SMILES post-processing |
+| `tautomer_standardize` | `bool` | `True` | Include tautomer-normalized exact match |
+| `gpus` | `str` or `None` | `"0"` | GPU IDs (``"0,1"``) or ``None`` for all |
+**Returns:** ``dict[str, dict]`` — mapping from benchmark name to a metrics
+dict with the same structure as :func:`evaluate`.  Example::
+    {
+      "USPTO": {
+        "postprocess/exact_match_acc": 0.934,
+        "postprocess/avg_tanimoto": 0.987,
+        ...
+      },
+      "CLEF": {
+        "postprocess/exact_match_acc": 0.948,
+        ...
+      },
+    }
+**Example:**
+    benchmarks = [
+        {"name": "USPTO", "benchmark_dir": "data/benchmark/real/USPTO",
+         "csv_path": "data/benchmark/real/USPTO.csv"},
+        {"name": "CLEF",  "benchmark_dir": "data/benchmark/real/CLEF",
+         "csv_path": "data/benchmark/real/CLEF_corrected.csv"},
+    ]
+    results = como.evaluate_benchmarks(model, benchmarks, gpus="0,1")
+    for name, metrics in results.items():
+        acc = metrics["postprocess/exact_match_acc"]
+        tan = metrics["postprocess/avg_tanimoto"]
+        print(f"{name}: Exact={acc:.2%}, Tanimoto={tan:.4f}")
+---
+### `como.canonicalize_smiles(smiles, *, ignore_chiral=False, ignore_cistrans=False, replace_rgroup=True)`
+Canonicalize a SMILES string using RDKit.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `smiles` | `str` | *required* | Input SMILES string |
+| `ignore_chiral` | `bool` | `False` | Strip tetrahedral chirality before canonicalization |
+| `ignore_cistrans` | `bool` | `False` | Strip cis–trans markers (``/`` and ``\``) before canonicalization |
+| `replace_rgroup` | `bool` | `True` | If ``True``, replace R-group tokens (``R``, ``R1``, ``X``, ``Ar``, …) with wildcard ``*`` |
+**Returns:** ``tuple[str, bool]`` — ``(canonical_smiles, ok)`` where *ok* is
+``True`` if the SMILES is chemically valid and canonicalization succeeded.
+---
+### `como.canonicalize_tautomer(smiles)`
+Canonicalize a SMILES string via RDKit's TautomerEnumerator, normalizing
+different tautomeric forms (e.g., keto/enol, lactam/lactim) to the same
+canonical representation.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `smiles` | `str` | *required* | Input SMILES string |
+**Returns:** ``tuple[str, bool]`` — ``(tautomer_canonical_smiles, ok)`` where
+*ok* is ``False`` if the input SMILES is invalid or tautomer enumeration fails.
+---
+### `como._result_to_smiles(result, mode="postprocess")`
+Low-level: convert a raw prediction result dict (from :func:`predict` with
+``smiles_mode=None``) to a canonical SMILES string.
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `result` | `dict` | *required* | Raw prediction dict with keys ``smiles``, ``symbols``, ``coords``, ``bond_mat``, ``success`` |
+| `mode` | `str` | ``"postprocess"`` | SMILES reconstruction mode |
+*mode* options:
+| Mode | Source | Chirality | Description |
+|------|--------|-----------|-------------|
+| ``"decoder"`` | Decoder token sequence | ✗ | Raw decoder SMILES, no graph info used. Fastest but lowest quality. |
+| ``"graph"`` | Predicted atoms + bonds | ✓ | Reconstructs SMILES entirely from predicted atom symbols, coordinates, and bond matrix. Chirality restored via `_verify_chirality`. |
+| ``"postprocess"`` | Decoder + atoms + bonds | ✓ | Starts from decoder SMILES, replaces R-groups/abbreviations, restores chirality from predicted coordinates and bond matrix, then expands functional groups back. Best quality. |
+**Returns:** ``str`` or ``None`` — canonical SMILES string, or ``None`` if conversion fails.
+## Model Weights
+Pre-trained model weights are available on HuggingFace:
+| Checkpoint | Reward Mode | Description |
+|-----------|-------------|-------------|
+| `COMO_joint/tanimoto/final.pth` | Tanimoto | Joint MLE+MRT (Tanimoto reward) |
+| `COMO_joint/edit_distance/final.pth` | Edit Distance | Joint MLE+MRT (Edit Distance reward) |
+| `COMO_joint/visual/final.pth` | Visual | Joint MLE+MRT (Visual reward) |
+Download from: **https://huggingface.co/Keylab/COMO**
+## Benchmark Datasets
+Benchmark datasets (images + CSV ground truth) are available on HuggingFace Datasets:
+| Dataset | Images | Type |
+|---------|--------|------|
+| USPTO | ~6K | Real patent images |
+| USPTO-10K | ~10K | Real patent images |
+| CLEF | ~5K | Real patent images |
+| JPO | ~3K | Real patent images |
+| UOB | ~4K | Real academic images |
+| staker | ~1K | Real images |
+| acs | ~2K | Real publication images |
+| WildMol-10K | ~10K | Real wild images |
+| indigo | ~8K | Synthetic (Indigo-rendered) |
+| chemdraw | ~8K | Synthetic (ChemDraw style) |
+Download from: **https://huggingface.co/Keylab/COMO** (see `benchmarks/` folder)
+## Citation
+If you use COMO in your research, please cite:
+```bibtex
+@article{lyu2026closed,
+  title={COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training},
+  author={Lyu, Zhuoqi and Ke, Qing},
+  journal={arXiv preprint arXiv:2604.23546},
+  year={2026}
+}
+```
+## License
+- **Code** (`como/` package): MIT License
+- **Model Weights** (`.pth` files): CC BY-NC 4.0 (non-commercial use only)
+- **Benchmark Datasets**: collected from existing public OCSR benchmarks; please refer to their
+  original sources for license and attribution:
+| Dataset | Source |
+|---------|--------|
+| USPTO, CLEF, JPO, UOB, Staker | [Rajan et al., 2020](https://github.com/Kohulan/OCSR_Review), [Xiong et al., 2023](https://github.com/jiachengxiong/alpha-Extractor) |
+| Indigo, ChemDraw, ACS, Staker | [Qian et al., 2023](https://github.com/thomas0809/MolScribe) |
+| USPTO-10K | [Morin et al., 2023](https://huggingface.co/datasets/docling-project/USPTO-30K) |
+| WildMol-10K | [Fang et al., 2025](https://github.com/orgs/Chem-Struct-ML/repositories) |
+See [LICENSE](LICENSE) for full terms.