PyPI - proteintensor - Versions diffs - 0.2.0__tar.gz → 0.3.0__tar.gz - Mend

proteintensor 0.2.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{proteintensor-0.2.0 → proteintensor-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: proteintensor
-Version: 0.2.0
+Version: 0.3.0
 Summary: AI-native biomolecular tensor format for structural biology ML
 Author-email: Clayton Moore <claytonwaynemoore@gmail.com>
 License-Expression: MIT
@@ -34,6 +34,8 @@ Provides-Extra: cloud
 Requires-Dist: fsspec>=2023.1; extra == "cloud"
 Requires-Dist: s3fs>=2023.1; extra == "cloud"
 Requires-Dist: gcsfs>=2023.1; extra == "cloud"
+Provides-Extra: ligands
+Requires-Dist: rdkit>=2023.3; extra == "ligands"
 Provides-Extra: dev
 Requires-Dist: pytest>=7; extra == "dev"
 Requires-Dist: pytest-benchmark; extra == "dev"
@@ -41,7 +43,9 @@ Requires-Dist: pytest-cov; extra == "dev"
 Requires-Dist: fsspec>=2023.1; extra == "dev"
 Dynamic: license-file
-# HelixDB / ProteinTensor
+![ProteinTensor - AI-native protein data format: convert structure or sequence into cached tensors](assets/banner.png)
+# ProteinTensor Introduction
 **ProteinTensor** is an AI-native biomolecular storage format designed to eliminate
 the preprocessing bottleneck in modern structural biology machine learning pipelines.
@@ -108,20 +112,21 @@ performance format that turns a recurring computational tax into a one-time cost
 ## Benchmark: Traditional Pipeline vs ProteinTensor
-All timings are median over 30 rounds on an NVIDIA RTX 5080, CUDA 12.8, Python 3.11.
-Proteins span the full range from a 76-residue domain to a 3,525-residue CRISPR enzyme.
-Run `python boltz_benchmark.py` to reproduce.
+All timings are median over 30 rounds on a Windows workstation (RTX 5080, Python
+3.11.9); mmCIF parsing and `.ptt` reads are CPU-bound, so these reflect CPU
+performance. Proteins span the full range from a 76-residue domain to a
+3,525-residue CRISPR enzyme. Run `python boltz_benchmark.py` to reproduce.
 ### Per-structure load times
 | Structure | Method | Res | MSA seqs | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
 |---|---|---|---|---|---|---|---|---|---|
-| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.2 ms | 2.8 ms | 1.2 ms | 0.7 ms | 1.6 ms | 0.8 ms |
-| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 | 29.6 ms | 2.9 ms | 1.2 ms | 0.7 ms | 5.1 ms | 2.0 ms |
-| 4HHB - Hemoglobin | X-ray | 574 | 2,048 | 55.3 ms | 2.9 ms | 1.2 ms | 0.7 ms | 11.3 ms | 3.5 ms |
-| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 | 74.7 ms | 2.9 ms | 1.2 ms | 0.7 ms | 14.7 ms | 6.4 ms |
-| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.4 ms | 3.3 ms | 1.3 ms | 0.9 ms | 208.3 ms | 71.1 ms |
-| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 | 352.4 ms | 3.3 ms | 1.2 ms | 1.0 ms | 240.7 ms | 104.5 ms |
+| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.4 ms | 3.2 ms | 1.3 ms | 0.8 ms | 1.8 ms | 0.8 ms |
+| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 | 28.7 ms | 3.3 ms | 1.3 ms | 0.8 ms | 5.2 ms | 1.9 ms |
+| 4HHB - Hemoglobin | X-ray | 574 | 2,048 | 54.1 ms | 3.3 ms | 1.3 ms | 0.8 ms | 11.5 ms | 3.6 ms |
+| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 | 73.2 ms | 3.3 ms | 1.4 ms | 0.8 ms | 15.3 ms | 6.9 ms |
+| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.9 ms | 3.7 ms | 1.4 ms | 1.0 ms | 213.7 ms | 74.7 ms |
+| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 | 346.5 ms | 3.7 ms | 1.3 ms | 1.0 ms | 243.9 ms | 107.3 ms |
 **Column definitions**
 - `ptt: full` - `read()` - all atoms, backbone, bonds, metadata
@@ -134,32 +139,42 @@ Run `python boltz_benchmark.py` to reproduce.
 | Structure | Res | full | backbone | bonds | MSA | dist mx |
 |---|---|---|---|---|---|---|
-| 1UBQ - Ubiquitin | 76 | 3x | 6x | 11x | 4x | 9x |
-| 6LU7 - SARS-CoV-2 Mpro | 312 | 10x | 24x | 43x | 6x | 15x |
-| 4HHB - Hemoglobin | 574 | 19x | 45x | 78x | 5x | 16x |
-| 6M0J - ACE2 + RBD | 791 | 26x | 61x | 102x | 5x | 12x |
-| 6VXX - Spike trimer | 2,916 | 87x | 223x | 308x | 1x* | 4x |
-| 6OHW - Cas12a | 3,525 | 108x | 284x | 370x | 1x* | 3x |
+| 1UBQ - Ubiquitin | 76 | 2x | 6x | 10x | 4x | 9x |
+| 6LU7 - SARS-CoV-2 Mpro | 312 | 9x | 21x | 38x | 5x | 15x |
+| 4HHB - Hemoglobin | 574 | 17x | 40x | 70x | 5x | 15x |
+| 6M0J - ACE2 + RBD | 791 | 22x | 54x | 92x | 5x | 11x |
+| 6VXX - Spike trimer | 2,916 | 76x | 201x | 285x | 1x* | 4x |
+| 6OHW - Cas12a | 3,525 | 95x | 257x | 343x | 1x* | 3x |
 *MSA speedup shown as 1x vs mmCIF parse because both are in the same time range for
 large proteins - the real MSA comparison is vs JackHMMER generation (see below).
 ### Feature assembly: time to prepare all tensors for model.forward()
-Traditional = mmCIF parse + read MSA from A3M file. ProteinTensor = single .ptt read
-with all features pre-cached (sequence, backbone, bonds, MSA, distance matrix,
-ESM2 embedding).
-| Structure | Res | Traditional | ProteinTensor | Speedup |
-|---|---|---|---|---|
-| 1UBQ - Ubiquitin | 76 | 22.7 ms | 5.2 ms | 4x |
-| 6LU7 - SARS-CoV-2 Mpro | 312 | 157.3 ms | 9.9 ms | 16x |
-| 4HHB - Hemoglobin | 574 | 525.5 ms | 17.7 ms | 30x |
-| 6M0J - ACE2 + RBD | 791 | 722.7 ms | 23.9 ms | 30x |
-| 6VXX - Spike trimer | 2,916 | 9,838.5 ms | 282.7 ms | 35x |
-| 6OHW - Cas12a | 3,525 | 11,903.1 ms | 348.4 ms | **34x** |
-Average speedup across all six structures: **34x** for full feature assembly.
+Traditional = mmCIF parse + A3M MSA parse + distance-matrix compute. ProteinTensor
+= read the structure, MSA, distance matrix, and ESM2 embedding from a single
+pre-cached `.ptt`. Reproduce with `python benchmarks/assembly_benchmark.py`
+(MSA depth and embedding shape are realistic; numeric content is synthetic, so
+timing reflects tensor dimensions, not values).
+| Structure | Res | MSA depth | Traditional | ProteinTensor | Speedup |
+|---|---|---|---|---|---|
+| 1UBQ - Ubiquitin | 76 | 512 | 14.1 ms | 7.1 ms | 2.0x |
+| 6LU7 - SARS-CoV-2 Mpro | 312 | 1,024 | 48.7 ms | 13.6 ms | 3.6x |
+| 4HHB - Hemoglobin | 574 | 2,048 | 118.0 ms | 22.7 ms | 5.2x |
+| 6M0J - ACE2 + RBD | 791 | 2,048 | 196.4 ms | 38.3 ms | 5.1x |
+| 6VXX - Spike trimer | 2,916 | 8,192 | 1,395 ms | 309 ms | 4.5x |
+| 6OHW - Cas12a | 3,525 | 8,192 | 1,462 ms | 381 ms | 3.8x |
+Average speedup across all six structures: **4x** for full feature assembly
+(measured on a Windows CPU box - see
+[`benchmarks/ASSEMBLY_RESULTS.md`](benchmarks/ASSEMBLY_RESULTS.md)).
+> **On an earlier 34x figure:** prior versions reported ~34x here. That number was
+> measured against ProteinTensor's original scalar A3M parser, which dominated the
+> traditional side (~11 s to parse an 8,192-deep MSA). Vectorizing that parser in
+> v0.2.0 cut the traditional baseline ~8x, so the *fair* feature-assembly speedup
+> is now ~4x. The `.ptt` read side was unchanged - only the baseline got faster.
 ### Drug target benchmark
@@ -169,21 +184,21 @@ IgG1 antibody. Numbers are consistent with the structural biology benchmark abov
 | Target | Res | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
 |---|---|---|---|---|---|---|---|
-| 6OIM - KRAS G12C + Sotorasib | 167 | 16.6 ms | 2.8 ms | 1.2 ms | 0.7 ms | 2.8 ms | 1.1 ms |
-| 3HTB - HIV-1 protease | 163 | 16.0 ms | 2.8 ms | 1.2 ms | 0.7 ms | 2.7 ms | 1.1 ms |
-| 5WT9 - PD-L1 checkpoint | 533 | 53.8 ms | 2.9 ms | 1.2 ms | 0.7 ms | 13.1 ms | 3.3 ms |
-| 1TUP - p53 tumor suppressor | 585 | 56.5 ms | 2.8 ms | 1.2 ms | 0.7 ms | 12.4 ms | 3.4 ms |
-| 2P4E - PCSK9 | 586 | 54.7 ms | 2.8 ms | 1.2 ms | 0.7 ms | 12.1 ms | 3.4 ms |
-| 1IGT - IgG1 antibody | 1,316 | 123.4 ms | 2.9 ms | 1.2 ms | 0.8 ms | 46.8 ms | 16.4 ms |
+| 6OIM - KRAS G12C + Sotorasib | 167 | 17.1 ms | 3.4 ms | 1.3 ms | 0.8 ms | 3.0 ms | 1.3 ms |
+| 3HTB - HIV-1 protease | 163 | 16.5 ms | 3.3 ms | 1.4 ms | 0.8 ms | 2.8 ms | 1.3 ms |
+| 5WT9 - PD-L1 checkpoint | 533 | 54.8 ms | 3.8 ms | 1.4 ms | 0.8 ms | 11.9 ms | 3.8 ms |
+| 1TUP - p53 tumor suppressor | 585 | 57.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 13.0 ms | 4.0 ms |
+| 2P4E - PCSK9 | 586 | 55.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 12.8 ms | 4.1 ms |
+| 1IGT - IgG1 antibody | 1,316 | 127.3 ms | 3.5 ms | 1.4 ms | 0.8 ms | 47.1 ms | 17.9 ms |
 | Target | Res | full | backbone | bonds | MSA | dist mx |
 |---|---|---|---|---|---|---|
-| 6OIM - KRAS G12C + Sotorasib | 167 | 6x | 14x | 24x | 6x | 15x |
-| 3HTB - HIV-1 protease | 163 | 6x | 14x | 23x | 6x | 14x |
-| 5WT9 - PD-L1 checkpoint | 533 | 19x | 44x | 77x | 4x | 16x |
-| 1TUP - p53 tumor suppressor | 585 | 20x | 47x | 80x | 5x | 17x |
-| 2P4E - PCSK9 | 586 | 19x | 46x | 77x | 5x | 16x |
-| 1IGT - IgG1 antibody | 1,316 | 42x | **100x** | **162x** | 3x | 8x |
+| 6OIM - KRAS G12C + Sotorasib | 167 | 5x | 13x | 22x | 6x | 13x |
+| 3HTB - HIV-1 protease | 163 | 5x | 12x | 21x | 6x | 13x |
+| 5WT9 - PD-L1 checkpoint | 533 | 15x | 40x | 69x | 5x | 14x |
+| 1TUP - p53 tumor suppressor | 585 | 17x | 42x | 71x | 4x | 14x |
+| 2P4E - PCSK9 | 586 | 16x | 41x | 70x | 4x | 14x |
+| 1IGT - IgG1 antibody | 1,316 | 37x | **92x** | **156x** | 3x | 7x |
 ### DataLoader batch throughput
@@ -192,26 +207,38 @@ padded batches ready for `model.forward()`. Single process, no prefetch workers.
 | Batch size | ms / batch | Structures / sec |
 |---|---|---|
-| 1 | 0.01 ms | 88,106 |
-| 4 | 0.04 ms | 108,696 |
-| 8 | 0.37 ms | 21,707 |
-| 16 | 0.95 ms | 16,783 |
-| 32 | 2.0 ms | **15,854** |
+| 1 | 0.01 ms | 97,088 |
+| 4 | 0.03 ms | 116,279 |
+| 8 | 0.42 ms | 19,242 |
+| 16 | 0.97 ms | 16,412 |
+| 32 | 2.1 ms | **15,033** |
 ### Scale projection: 100,000 structures, one training epoch
+These are **projections**, extrapolated from the measured per-structure timings
+above - not end-to-end measurements at 100k scale.
 | Operation | Traditional pipeline | ProteinTensor | Speedup |
 |---|---|---|---|
-| Structure load (parse mmCIF each epoch) | 3.7 hours | 5 min | **45x** |
-| Backbone-only load (template search) | 3.7 hours | 2 min | **109x** |
-| Full feature assembly (seq + MSA + pairs + emb) | 4.5 days | 3.2 hours | **34x** |
-| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.2 hours | **1,794x** |
+| Structure load (parse mmCIF each epoch) | 3.8 hours | 6 min | **37x** |
+| Backbone-only load (template search) | 3.8 hours | 2 min | **95x** |
+| Full feature assembly (seq + MSA + pairs + emb) | 16 hours | 3.9 hours | **4x** |
+| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.7 hours | **1,477x** |
 > MSA generation assumes 2.4 min/protein on a 32-core server (PDB90 database, standard
 > AlphaFold settings). ProteinTensor generates MSAs once and loads from the `.ptt` cache
 > on every subsequent run. The 4,000-hour figure is the real cost AlphaFold2 and Boltz
 > users pay to build training datasets from scratch.
+> **Measured vs projected - read this.** The **1,477x** above is MSA *generation*
+> (building the alignment once with JackHMMER) and is a **literature-based
+> projection**, not something benchmarked here. What *is* measured on hardware is
+> the recurring per-epoch MSA **load** - reading a cached MSA from `.ptt` vs
+> re-parsing A3M text each epoch (against a vectorized A3M parser baseline):
+> **3.4x-5.9x**, growing with MSA depth. See
+> [`benchmarks/MSA_RESULTS.md`](benchmarks/MSA_RESULTS.md). These are different
+> quantities; do not read the 1,477x as a measured load speedup.
 ### Disk tradeoff
 A full-featured `.ptt` (8,192-sequence MSA + distance matrix + ESM2-650M embedding at
@@ -267,6 +294,18 @@ pt.write(data, "ubq.ptt")
 data = pt.from_fasta("complex.fasta")
 ```
+### Batch-convert a directory
+Convert an entire directory of structures in parallel, with progress reporting.
+Files that fail to parse are skipped and listed in the summary; already-converted
+outputs are skipped by default.
+```bash
+proteintensor convert-dir ./pdb_files/ ./ptt_files/            # auto worker count
+proteintensor convert-dir ./pdb_files/ ./ptt_files/ --workers 16 --recursive
+proteintensor convert-dir ./pdb_files/ ./ptt_files/ --overwrite  # rebuild existing
+```
 ### Benchmark against mmCIF
 ```bash
@@ -351,6 +390,20 @@ pt.add_pair_feature("1abc.ptt", my_array, name="template_pair",
 emb = pt.read_embedding("1abc.ptt", "esm2_t33_650M_UR50D")
 emb.data.shape      # (N_res, 1280)  float32  (upcast from float16 on load)
+# ------ Ligands / small molecules ------
+# Capture drugs, cofactors, and ions from a structure (opt-in)
+data = pt.from_mmcif("6oim.cif", include_ligands=True)
+[l.name for l in data.ligands]        # ['MG', 'GDP', 'MOV']  (MOV = sotorasib)
+ligs = pt.read_ligands("6oim.ptt")
+ligs[0].elements                      # (N_atoms,)  S2  element symbols
+ligs[0].positions                     # (N_atoms, 3)  float32
+pt.list_ligands("6oim.ptt")           # ['MG', 'GDP', 'MOV']
+# Build a ligand from SMILES (needs `pip install "proteintensor[ligands]"`)
+aspirin = pt.from_smiles("CC(=O)Oc1ccccc1C(=O)O", name="AIN")
+pt.add_ligand("target.ptt", aspirin)  # attach to an existing .ptt
 # ------ Lazy / zero-copy access ------
 positions = pt.mmap_positions("1abc.ptt")       # zarr.Array - no full load
 backbone  = pt.mmap_backbone("1abc.ptt")        # [N_res, 4, 3]
@@ -392,6 +445,7 @@ data = pt.read(
 )
 # ------ Multi-structure dataset ------
+# Structure .ptt files and sequence-only .ptt files can be mixed in one dataset.
 pt.create_dataset("training.ptt")
 for ptt_file in Path("ptt_files").glob("*.ptt"):
     pt.add_to_dataset("training.ptt", ptt_file)
@@ -407,8 +461,13 @@ loader = DataLoader(ds, batch_size=8, collate_fn=pt.ProteinDataset.collate)
 for batch in loader:
     coords  = torch.from_numpy(batch["atom_positions"])   # (B, max_atoms, 3)
     pad     = torch.from_numpy(batch["padding_mask"])     # (B, max_res)  True=real
+    has_str = torch.from_numpy(batch["has_structure"])    # (B,)  False = sequence-only
 ```
+Sequence-only entries contribute zero atoms to the batch (`n_atoms == 0`,
+`has_structure == False`), so sequence-driven and structure-based samples can be
+loaded together in one `DataLoader`.
 ---
 ## .ptt file layout
@@ -445,10 +504,16 @@ structure.ptt/                      Zarr directory store (v0.7)
 │   └── <name>/                              one sub-group per named feature
 │       ├── .zattrs                          channels, symmetric, dtype, description
 │       └── data           [N_res, N_res, C] any dtype, chunked 128x128xC
-└── embeddings/
-    └── <model>/                             one sub-group per PLM model
-        ├── .zattrs                          model, layer, dim, dtype, seq SHA-256
-        └── data           [N_res, D]        float32 or float16, chunked 256xD
+├── embeddings/
+│   └── <model>/                             one sub-group per PLM model
+│       ├── .zattrs                          model, layer, dim, dtype, seq SHA-256
+│       └── data           [N_res, D]        float32 or float16, chunked 256xD
+└── ligands/
+    └── <index>/                            one sub-group per non-polymer ligand
+        ├── .zattrs                          name (CCD), chain_id, res_num, smiles
+        ├── elements       [N_atoms]         S2       element symbols
+        ├── positions      [N_atoms, 3]      float32  Angstrom coordinates
+        └── b_factors      [N_atoms]         float32
 ```
 ### Multi-structure dataset layout
@@ -486,9 +551,9 @@ Each sub-group under `structures/` is identical to a standalone `.ptt` root, so
 pytest tests/ -v
 ```
-106 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings,
-A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming
-(memory:// fsspec - no real cloud account required).
+150 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings/ligands,
+sequence conversion, A3M parsing, Boltz adapter, multi-structure dataset, and cloud
+streaming (memory:// fsspec - no real cloud account required).
 ---
@@ -509,11 +574,11 @@ A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming
 - [ ] Chai-1 adapter
 **Data pipeline**
-- [ ] Batch convert CLI - convert entire PDB directories in parallel with progress reporting
+- [x] Batch convert CLI - convert entire PDB directories in parallel with progress reporting
 - [ ] Sequence-identity dataset splitting - MMseqs2-based cluster splits to prevent data leakage between train / val / test
 **Format extensions**
-- [ ] Ligand / small-molecule support - SMILES, CCD-based atom graphs, binding site annotations for drug-protein interaction models
+- [x] Ligand / small-molecule support - CCD-based extraction from structures, SMILES input via RDKit, element/coordinate storage (bond graphs and binding-site annotations still to come)
 - [ ] MD trajectory storage - time axis `[N_frames, N_atoms, 3]` for conformational ensembles and AlphaFold 3 diffusion trajectories
 **Performance**

{proteintensor-0.2.0 → proteintensor-0.3.0}/README.md RENAMED Viewed

@@ -1,4 +1,6 @@
-# HelixDB / ProteinTensor
+![ProteinTensor - AI-native protein data format: convert structure or sequence into cached tensors](assets/banner.png)
+# ProteinTensor Introduction
 **ProteinTensor** is an AI-native biomolecular storage format designed to eliminate
 the preprocessing bottleneck in modern structural biology machine learning pipelines.
@@ -65,20 +67,21 @@ performance format that turns a recurring computational tax into a one-time cost
 ## Benchmark: Traditional Pipeline vs ProteinTensor
-All timings are median over 30 rounds on an NVIDIA RTX 5080, CUDA 12.8, Python 3.11.
-Proteins span the full range from a 76-residue domain to a 3,525-residue CRISPR enzyme.
-Run `python boltz_benchmark.py` to reproduce.
+All timings are median over 30 rounds on a Windows workstation (RTX 5080, Python
+3.11.9); mmCIF parsing and `.ptt` reads are CPU-bound, so these reflect CPU
+performance. Proteins span the full range from a 76-residue domain to a
+3,525-residue CRISPR enzyme. Run `python boltz_benchmark.py` to reproduce.
 ### Per-structure load times
 | Structure | Method | Res | MSA seqs | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
 |---|---|---|---|---|---|---|---|---|---|
-| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.2 ms | 2.8 ms | 1.2 ms | 0.7 ms | 1.6 ms | 0.8 ms |
-| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 | 29.6 ms | 2.9 ms | 1.2 ms | 0.7 ms | 5.1 ms | 2.0 ms |
-| 4HHB - Hemoglobin | X-ray | 574 | 2,048 | 55.3 ms | 2.9 ms | 1.2 ms | 0.7 ms | 11.3 ms | 3.5 ms |
-| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 | 74.7 ms | 2.9 ms | 1.2 ms | 0.7 ms | 14.7 ms | 6.4 ms |
-| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.4 ms | 3.3 ms | 1.3 ms | 0.9 ms | 208.3 ms | 71.1 ms |
-| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 | 352.4 ms | 3.3 ms | 1.2 ms | 1.0 ms | 240.7 ms | 104.5 ms |
+| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.4 ms | 3.2 ms | 1.3 ms | 0.8 ms | 1.8 ms | 0.8 ms |
+| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 | 28.7 ms | 3.3 ms | 1.3 ms | 0.8 ms | 5.2 ms | 1.9 ms |
+| 4HHB - Hemoglobin | X-ray | 574 | 2,048 | 54.1 ms | 3.3 ms | 1.3 ms | 0.8 ms | 11.5 ms | 3.6 ms |
+| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 | 73.2 ms | 3.3 ms | 1.4 ms | 0.8 ms | 15.3 ms | 6.9 ms |
+| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.9 ms | 3.7 ms | 1.4 ms | 1.0 ms | 213.7 ms | 74.7 ms |
+| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 | 346.5 ms | 3.7 ms | 1.3 ms | 1.0 ms | 243.9 ms | 107.3 ms |
 **Column definitions**
 - `ptt: full` - `read()` - all atoms, backbone, bonds, metadata
@@ -91,32 +94,42 @@ Run `python boltz_benchmark.py` to reproduce.
 | Structure | Res | full | backbone | bonds | MSA | dist mx |
 |---|---|---|---|---|---|---|
-| 1UBQ - Ubiquitin | 76 | 3x | 6x | 11x | 4x | 9x |
-| 6LU7 - SARS-CoV-2 Mpro | 312 | 10x | 24x | 43x | 6x | 15x |
-| 4HHB - Hemoglobin | 574 | 19x | 45x | 78x | 5x | 16x |
-| 6M0J - ACE2 + RBD | 791 | 26x | 61x | 102x | 5x | 12x |
-| 6VXX - Spike trimer | 2,916 | 87x | 223x | 308x | 1x* | 4x |
-| 6OHW - Cas12a | 3,525 | 108x | 284x | 370x | 1x* | 3x |
+| 1UBQ - Ubiquitin | 76 | 2x | 6x | 10x | 4x | 9x |
+| 6LU7 - SARS-CoV-2 Mpro | 312 | 9x | 21x | 38x | 5x | 15x |
+| 4HHB - Hemoglobin | 574 | 17x | 40x | 70x | 5x | 15x |
+| 6M0J - ACE2 + RBD | 791 | 22x | 54x | 92x | 5x | 11x |
+| 6VXX - Spike trimer | 2,916 | 76x | 201x | 285x | 1x* | 4x |
+| 6OHW - Cas12a | 3,525 | 95x | 257x | 343x | 1x* | 3x |
 *MSA speedup shown as 1x vs mmCIF parse because both are in the same time range for
 large proteins - the real MSA comparison is vs JackHMMER generation (see below).
 ### Feature assembly: time to prepare all tensors for model.forward()
-Traditional = mmCIF parse + read MSA from A3M file. ProteinTensor = single .ptt read
-with all features pre-cached (sequence, backbone, bonds, MSA, distance matrix,
-ESM2 embedding).
-| Structure | Res | Traditional | ProteinTensor | Speedup |
-|---|---|---|---|---|
-| 1UBQ - Ubiquitin | 76 | 22.7 ms | 5.2 ms | 4x |
-| 6LU7 - SARS-CoV-2 Mpro | 312 | 157.3 ms | 9.9 ms | 16x |
-| 4HHB - Hemoglobin | 574 | 525.5 ms | 17.7 ms | 30x |
-| 6M0J - ACE2 + RBD | 791 | 722.7 ms | 23.9 ms | 30x |
-| 6VXX - Spike trimer | 2,916 | 9,838.5 ms | 282.7 ms | 35x |
-| 6OHW - Cas12a | 3,525 | 11,903.1 ms | 348.4 ms | **34x** |
-Average speedup across all six structures: **34x** for full feature assembly.
+Traditional = mmCIF parse + A3M MSA parse + distance-matrix compute. ProteinTensor
+= read the structure, MSA, distance matrix, and ESM2 embedding from a single
+pre-cached `.ptt`. Reproduce with `python benchmarks/assembly_benchmark.py`
+(MSA depth and embedding shape are realistic; numeric content is synthetic, so
+timing reflects tensor dimensions, not values).
+| Structure | Res | MSA depth | Traditional | ProteinTensor | Speedup |
+|---|---|---|---|---|---|
+| 1UBQ - Ubiquitin | 76 | 512 | 14.1 ms | 7.1 ms | 2.0x |
+| 6LU7 - SARS-CoV-2 Mpro | 312 | 1,024 | 48.7 ms | 13.6 ms | 3.6x |
+| 4HHB - Hemoglobin | 574 | 2,048 | 118.0 ms | 22.7 ms | 5.2x |
+| 6M0J - ACE2 + RBD | 791 | 2,048 | 196.4 ms | 38.3 ms | 5.1x |
+| 6VXX - Spike trimer | 2,916 | 8,192 | 1,395 ms | 309 ms | 4.5x |
+| 6OHW - Cas12a | 3,525 | 8,192 | 1,462 ms | 381 ms | 3.8x |
+Average speedup across all six structures: **4x** for full feature assembly
+(measured on a Windows CPU box - see
+[`benchmarks/ASSEMBLY_RESULTS.md`](benchmarks/ASSEMBLY_RESULTS.md)).
+> **On an earlier 34x figure:** prior versions reported ~34x here. That number was
+> measured against ProteinTensor's original scalar A3M parser, which dominated the
+> traditional side (~11 s to parse an 8,192-deep MSA). Vectorizing that parser in
+> v0.2.0 cut the traditional baseline ~8x, so the *fair* feature-assembly speedup
+> is now ~4x. The `.ptt` read side was unchanged - only the baseline got faster.
 ### Drug target benchmark
@@ -126,21 +139,21 @@ IgG1 antibody. Numbers are consistent with the structural biology benchmark abov
 | Target | Res | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
 |---|---|---|---|---|---|---|---|
-| 6OIM - KRAS G12C + Sotorasib | 167 | 16.6 ms | 2.8 ms | 1.2 ms | 0.7 ms | 2.8 ms | 1.1 ms |
-| 3HTB - HIV-1 protease | 163 | 16.0 ms | 2.8 ms | 1.2 ms | 0.7 ms | 2.7 ms | 1.1 ms |
-| 5WT9 - PD-L1 checkpoint | 533 | 53.8 ms | 2.9 ms | 1.2 ms | 0.7 ms | 13.1 ms | 3.3 ms |
-| 1TUP - p53 tumor suppressor | 585 | 56.5 ms | 2.8 ms | 1.2 ms | 0.7 ms | 12.4 ms | 3.4 ms |
-| 2P4E - PCSK9 | 586 | 54.7 ms | 2.8 ms | 1.2 ms | 0.7 ms | 12.1 ms | 3.4 ms |
-| 1IGT - IgG1 antibody | 1,316 | 123.4 ms | 2.9 ms | 1.2 ms | 0.8 ms | 46.8 ms | 16.4 ms |
+| 6OIM - KRAS G12C + Sotorasib | 167 | 17.1 ms | 3.4 ms | 1.3 ms | 0.8 ms | 3.0 ms | 1.3 ms |
+| 3HTB - HIV-1 protease | 163 | 16.5 ms | 3.3 ms | 1.4 ms | 0.8 ms | 2.8 ms | 1.3 ms |
+| 5WT9 - PD-L1 checkpoint | 533 | 54.8 ms | 3.8 ms | 1.4 ms | 0.8 ms | 11.9 ms | 3.8 ms |
+| 1TUP - p53 tumor suppressor | 585 | 57.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 13.0 ms | 4.0 ms |
+| 2P4E - PCSK9 | 586 | 55.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 12.8 ms | 4.1 ms |
+| 1IGT - IgG1 antibody | 1,316 | 127.3 ms | 3.5 ms | 1.4 ms | 0.8 ms | 47.1 ms | 17.9 ms |
 | Target | Res | full | backbone | bonds | MSA | dist mx |
 |---|---|---|---|---|---|---|
-| 6OIM - KRAS G12C + Sotorasib | 167 | 6x | 14x | 24x | 6x | 15x |
-| 3HTB - HIV-1 protease | 163 | 6x | 14x | 23x | 6x | 14x |
-| 5WT9 - PD-L1 checkpoint | 533 | 19x | 44x | 77x | 4x | 16x |
-| 1TUP - p53 tumor suppressor | 585 | 20x | 47x | 80x | 5x | 17x |
-| 2P4E - PCSK9 | 586 | 19x | 46x | 77x | 5x | 16x |
-| 1IGT - IgG1 antibody | 1,316 | 42x | **100x** | **162x** | 3x | 8x |
+| 6OIM - KRAS G12C + Sotorasib | 167 | 5x | 13x | 22x | 6x | 13x |
+| 3HTB - HIV-1 protease | 163 | 5x | 12x | 21x | 6x | 13x |
+| 5WT9 - PD-L1 checkpoint | 533 | 15x | 40x | 69x | 5x | 14x |
+| 1TUP - p53 tumor suppressor | 585 | 17x | 42x | 71x | 4x | 14x |
+| 2P4E - PCSK9 | 586 | 16x | 41x | 70x | 4x | 14x |
+| 1IGT - IgG1 antibody | 1,316 | 37x | **92x** | **156x** | 3x | 7x |
 ### DataLoader batch throughput
@@ -149,26 +162,38 @@ padded batches ready for `model.forward()`. Single process, no prefetch workers.
 | Batch size | ms / batch | Structures / sec |
 |---|---|---|
-| 1 | 0.01 ms | 88,106 |
-| 4 | 0.04 ms | 108,696 |
-| 8 | 0.37 ms | 21,707 |
-| 16 | 0.95 ms | 16,783 |
-| 32 | 2.0 ms | **15,854** |
+| 1 | 0.01 ms | 97,088 |
+| 4 | 0.03 ms | 116,279 |
+| 8 | 0.42 ms | 19,242 |
+| 16 | 0.97 ms | 16,412 |
+| 32 | 2.1 ms | **15,033** |
 ### Scale projection: 100,000 structures, one training epoch
+These are **projections**, extrapolated from the measured per-structure timings
+above - not end-to-end measurements at 100k scale.
 | Operation | Traditional pipeline | ProteinTensor | Speedup |
 |---|---|---|---|
-| Structure load (parse mmCIF each epoch) | 3.7 hours | 5 min | **45x** |
-| Backbone-only load (template search) | 3.7 hours | 2 min | **109x** |
-| Full feature assembly (seq + MSA + pairs + emb) | 4.5 days | 3.2 hours | **34x** |
-| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.2 hours | **1,794x** |
+| Structure load (parse mmCIF each epoch) | 3.8 hours | 6 min | **37x** |
+| Backbone-only load (template search) | 3.8 hours | 2 min | **95x** |
+| Full feature assembly (seq + MSA + pairs + emb) | 16 hours | 3.9 hours | **4x** |
+| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.7 hours | **1,477x** |
 > MSA generation assumes 2.4 min/protein on a 32-core server (PDB90 database, standard
 > AlphaFold settings). ProteinTensor generates MSAs once and loads from the `.ptt` cache
 > on every subsequent run. The 4,000-hour figure is the real cost AlphaFold2 and Boltz
 > users pay to build training datasets from scratch.
+> **Measured vs projected - read this.** The **1,477x** above is MSA *generation*
+> (building the alignment once with JackHMMER) and is a **literature-based
+> projection**, not something benchmarked here. What *is* measured on hardware is
+> the recurring per-epoch MSA **load** - reading a cached MSA from `.ptt` vs
+> re-parsing A3M text each epoch (against a vectorized A3M parser baseline):
+> **3.4x-5.9x**, growing with MSA depth. See
+> [`benchmarks/MSA_RESULTS.md`](benchmarks/MSA_RESULTS.md). These are different
+> quantities; do not read the 1,477x as a measured load speedup.
 ### Disk tradeoff
 A full-featured `.ptt` (8,192-sequence MSA + distance matrix + ESM2-650M embedding at
@@ -224,6 +249,18 @@ pt.write(data, "ubq.ptt")
 data = pt.from_fasta("complex.fasta")
 ```
+### Batch-convert a directory
+Convert an entire directory of structures in parallel, with progress reporting.
+Files that fail to parse are skipped and listed in the summary; already-converted
+outputs are skipped by default.
+```bash
+proteintensor convert-dir ./pdb_files/ ./ptt_files/            # auto worker count
+proteintensor convert-dir ./pdb_files/ ./ptt_files/ --workers 16 --recursive
+proteintensor convert-dir ./pdb_files/ ./ptt_files/ --overwrite  # rebuild existing
+```
 ### Benchmark against mmCIF
 ```bash
@@ -308,6 +345,20 @@ pt.add_pair_feature("1abc.ptt", my_array, name="template_pair",
 emb = pt.read_embedding("1abc.ptt", "esm2_t33_650M_UR50D")
 emb.data.shape      # (N_res, 1280)  float32  (upcast from float16 on load)
+# ------ Ligands / small molecules ------
+# Capture drugs, cofactors, and ions from a structure (opt-in)
+data = pt.from_mmcif("6oim.cif", include_ligands=True)
+[l.name for l in data.ligands]        # ['MG', 'GDP', 'MOV']  (MOV = sotorasib)
+ligs = pt.read_ligands("6oim.ptt")
+ligs[0].elements                      # (N_atoms,)  S2  element symbols
+ligs[0].positions                     # (N_atoms, 3)  float32
+pt.list_ligands("6oim.ptt")           # ['MG', 'GDP', 'MOV']
+# Build a ligand from SMILES (needs `pip install "proteintensor[ligands]"`)
+aspirin = pt.from_smiles("CC(=O)Oc1ccccc1C(=O)O", name="AIN")
+pt.add_ligand("target.ptt", aspirin)  # attach to an existing .ptt
 # ------ Lazy / zero-copy access ------
 positions = pt.mmap_positions("1abc.ptt")       # zarr.Array - no full load
 backbone  = pt.mmap_backbone("1abc.ptt")        # [N_res, 4, 3]
@@ -349,6 +400,7 @@ data = pt.read(
 )
 # ------ Multi-structure dataset ------
+# Structure .ptt files and sequence-only .ptt files can be mixed in one dataset.
 pt.create_dataset("training.ptt")
 for ptt_file in Path("ptt_files").glob("*.ptt"):
     pt.add_to_dataset("training.ptt", ptt_file)
@@ -364,8 +416,13 @@ loader = DataLoader(ds, batch_size=8, collate_fn=pt.ProteinDataset.collate)
 for batch in loader:
     coords  = torch.from_numpy(batch["atom_positions"])   # (B, max_atoms, 3)
     pad     = torch.from_numpy(batch["padding_mask"])     # (B, max_res)  True=real
+    has_str = torch.from_numpy(batch["has_structure"])    # (B,)  False = sequence-only
 ```
+Sequence-only entries contribute zero atoms to the batch (`n_atoms == 0`,
+`has_structure == False`), so sequence-driven and structure-based samples can be
+loaded together in one `DataLoader`.
 ---
 ## .ptt file layout
@@ -402,10 +459,16 @@ structure.ptt/                      Zarr directory store (v0.7)
 │   └── <name>/                              one sub-group per named feature
 │       ├── .zattrs                          channels, symmetric, dtype, description
 │       └── data           [N_res, N_res, C] any dtype, chunked 128x128xC
-└── embeddings/
-    └── <model>/                             one sub-group per PLM model
-        ├── .zattrs                          model, layer, dim, dtype, seq SHA-256
-        └── data           [N_res, D]        float32 or float16, chunked 256xD
+├── embeddings/
+│   └── <model>/                             one sub-group per PLM model
+│       ├── .zattrs                          model, layer, dim, dtype, seq SHA-256
+│       └── data           [N_res, D]        float32 or float16, chunked 256xD
+└── ligands/
+    └── <index>/                            one sub-group per non-polymer ligand
+        ├── .zattrs                          name (CCD), chain_id, res_num, smiles
+        ├── elements       [N_atoms]         S2       element symbols
+        ├── positions      [N_atoms, 3]      float32  Angstrom coordinates
+        └── b_factors      [N_atoms]         float32
 ```
 ### Multi-structure dataset layout
@@ -443,9 +506,9 @@ Each sub-group under `structures/` is identical to a standalone `.ptt` root, so
 pytest tests/ -v
 ```
-106 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings,
-A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming
-(memory:// fsspec - no real cloud account required).
+150 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings/ligands,
+sequence conversion, A3M parsing, Boltz adapter, multi-structure dataset, and cloud
+streaming (memory:// fsspec - no real cloud account required).
 ---
@@ -466,11 +529,11 @@ A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming
 - [ ] Chai-1 adapter
 **Data pipeline**
-- [ ] Batch convert CLI - convert entire PDB directories in parallel with progress reporting
+- [x] Batch convert CLI - convert entire PDB directories in parallel with progress reporting
 - [ ] Sequence-identity dataset splitting - MMseqs2-based cluster splits to prevent data leakage between train / val / test
 **Format extensions**
-- [ ] Ligand / small-molecule support - SMILES, CCD-based atom graphs, binding site annotations for drug-protein interaction models
+- [x] Ligand / small-molecule support - CCD-based extraction from structures, SMILES input via RDKit, element/coordinate storage (bond graphs and binding-site annotations still to come)
 - [ ] MD trajectory storage - time axis `[N_frames, N_atoms, 3]` for conformational ensembles and AlphaFold 3 diffusion trajectories
 **Performance**

{proteintensor-0.2.0 → proteintensor-0.3.0}/proteintensor/__init__.py RENAMED Viewed

@@ -34,8 +34,10 @@ from .bonds import (
 from .dataset import ProteinDataset, create_dataset, add_to_dataset
 from .remote import consolidate
 from .converters import from_mmcif, from_sequence, from_fasta, parse_fasta
+from .ligands import read_ligands, list_ligands, add_ligand, from_smiles
+from .schema import LigandData
-__version__ = "0.2.0"
+__version__ = "0.3.0"
 __all__ = [
     # Converters - input
@@ -51,6 +53,8 @@ __all__ = [
     "compute_and_store_distances", "compute_and_store_contacts",
     # I/O - embeddings
     "read_embedding", "add_embedding", "list_embeddings", "mmap_embedding",
+    # Ligands / small molecules
+    "read_ligands", "list_ligands", "add_ligand", "from_smiles", "LigandData",
     # Data containers
     "ProteinTensorData", "BackboneData", "BondData", "MsaData", "PairFeature", "EmbeddingData",
     # MSA utilities

proteintensor 0.2.0__tar.gz → 0.3.0__tar.gz

proteintensor 0.2.0tar.gz → 0.3.0tar.gz