proteintensor 0.1.3__tar.gz → 0.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {proteintensor-0.1.3 → proteintensor-0.3.0}/PKG-INFO +150 -61
- {proteintensor-0.1.3 → proteintensor-0.3.0}/README.md +147 -60
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/__init__.py +8 -1
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/cli.py +196 -2
- proteintensor-0.3.0/proteintensor/converters/__init__.py +4 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/converters/mmcif.py +15 -4
- proteintensor-0.3.0/proteintensor/converters/sequence.py +103 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/dataset.py +25 -13
- proteintensor-0.3.0/proteintensor/ligands.py +216 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/msa.py +38 -23
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/reader.py +9 -5
- proteintensor-0.3.0/proteintensor/schema.py +127 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/writer.py +15 -8
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/PKG-INFO +150 -61
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/SOURCES.txt +6 -1
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/requires.txt +3 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/pyproject.toml +6 -5
- proteintensor-0.3.0/tests/test_convert_dir.py +89 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_dataset.py +59 -0
- proteintensor-0.3.0/tests/test_ligands.py +123 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_msa.py +54 -0
- proteintensor-0.3.0/tests/test_sequence.py +150 -0
- proteintensor-0.1.3/proteintensor/converters/__init__.py +0 -3
- proteintensor-0.1.3/proteintensor/schema.py +0 -72
- {proteintensor-0.1.3 → proteintensor-0.3.0}/LICENSE +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/adapters/__init__.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/adapters/boltz.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/bonds.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/embeddings.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/pairs.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor/remote.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/dependency_links.txt +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/entry_points.txt +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/proteintensor.egg-info/top_level.txt +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/setup.cfg +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_adapters.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_embeddings.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_pairs.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_remote.py +0 -0
- {proteintensor-0.1.3 → proteintensor-0.3.0}/tests/test_roundtrip.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: proteintensor
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.3.0
|
|
4
4
|
Summary: AI-native biomolecular tensor format for structural biology ML
|
|
5
5
|
Author-email: Clayton Moore <claytonwaynemoore@gmail.com>
|
|
6
6
|
License-Expression: MIT
|
|
@@ -34,6 +34,8 @@ Provides-Extra: cloud
|
|
|
34
34
|
Requires-Dist: fsspec>=2023.1; extra == "cloud"
|
|
35
35
|
Requires-Dist: s3fs>=2023.1; extra == "cloud"
|
|
36
36
|
Requires-Dist: gcsfs>=2023.1; extra == "cloud"
|
|
37
|
+
Provides-Extra: ligands
|
|
38
|
+
Requires-Dist: rdkit>=2023.3; extra == "ligands"
|
|
37
39
|
Provides-Extra: dev
|
|
38
40
|
Requires-Dist: pytest>=7; extra == "dev"
|
|
39
41
|
Requires-Dist: pytest-benchmark; extra == "dev"
|
|
@@ -41,7 +43,9 @@ Requires-Dist: pytest-cov; extra == "dev"
|
|
|
41
43
|
Requires-Dist: fsspec>=2023.1; extra == "dev"
|
|
42
44
|
Dynamic: license-file
|
|
43
45
|
|
|
44
|
-
|
|
46
|
+

|
|
47
|
+
|
|
48
|
+
# ProteinTensor Introduction
|
|
45
49
|
|
|
46
50
|
**ProteinTensor** is an AI-native biomolecular storage format designed to eliminate
|
|
47
51
|
the preprocessing bottleneck in modern structural biology machine learning pipelines.
|
|
@@ -108,20 +112,21 @@ performance format that turns a recurring computational tax into a one-time cost
|
|
|
108
112
|
|
|
109
113
|
## Benchmark: Traditional Pipeline vs ProteinTensor
|
|
110
114
|
|
|
111
|
-
All timings are median over 30 rounds on
|
|
112
|
-
|
|
113
|
-
|
|
115
|
+
All timings are median over 30 rounds on a Windows workstation (RTX 5080, Python
|
|
116
|
+
3.11.9); mmCIF parsing and `.ptt` reads are CPU-bound, so these reflect CPU
|
|
117
|
+
performance. Proteins span the full range from a 76-residue domain to a
|
|
118
|
+
3,525-residue CRISPR enzyme. Run `python boltz_benchmark.py` to reproduce.
|
|
114
119
|
|
|
115
120
|
### Per-structure load times
|
|
116
121
|
|
|
117
122
|
| Structure | Method | Res | MSA seqs | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
|
|
118
123
|
|---|---|---|---|---|---|---|---|---|---|
|
|
119
|
-
| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.
|
|
120
|
-
| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 |
|
|
121
|
-
| 4HHB - Hemoglobin | X-ray | 574 | 2,048 |
|
|
122
|
-
| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 |
|
|
123
|
-
| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.
|
|
124
|
-
| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 |
|
|
124
|
+
| 1UBQ - Ubiquitin | X-ray | 76 | 512 | 7.4 ms | 3.2 ms | 1.3 ms | 0.8 ms | 1.8 ms | 0.8 ms |
|
|
125
|
+
| 6LU7 - SARS-CoV-2 Mpro | X-ray | 312 | 1,024 | 28.7 ms | 3.3 ms | 1.3 ms | 0.8 ms | 5.2 ms | 1.9 ms |
|
|
126
|
+
| 4HHB - Hemoglobin | X-ray | 574 | 2,048 | 54.1 ms | 3.3 ms | 1.3 ms | 0.8 ms | 11.5 ms | 3.6 ms |
|
|
127
|
+
| 6M0J - ACE2 + RBD | Cryo-EM | 791 | 2,048 | 73.2 ms | 3.3 ms | 1.4 ms | 0.8 ms | 15.3 ms | 6.9 ms |
|
|
128
|
+
| 6VXX - Spike trimer | Cryo-EM | 2,916 | 8,192 | 283.9 ms | 3.7 ms | 1.4 ms | 1.0 ms | 213.7 ms | 74.7 ms |
|
|
129
|
+
| 6OHW - Cas12a | Cryo-EM | 3,525 | 8,192 | 346.5 ms | 3.7 ms | 1.3 ms | 1.0 ms | 243.9 ms | 107.3 ms |
|
|
125
130
|
|
|
126
131
|
**Column definitions**
|
|
127
132
|
- `ptt: full` - `read()` - all atoms, backbone, bonds, metadata
|
|
@@ -134,32 +139,42 @@ Run `python boltz_benchmark.py` to reproduce.
|
|
|
134
139
|
|
|
135
140
|
| Structure | Res | full | backbone | bonds | MSA | dist mx |
|
|
136
141
|
|---|---|---|---|---|---|---|
|
|
137
|
-
| 1UBQ - Ubiquitin | 76 |
|
|
138
|
-
| 6LU7 - SARS-CoV-2 Mpro | 312 |
|
|
139
|
-
| 4HHB - Hemoglobin | 574 |
|
|
140
|
-
| 6M0J - ACE2 + RBD | 791 |
|
|
141
|
-
| 6VXX - Spike trimer | 2,916 |
|
|
142
|
-
| 6OHW - Cas12a | 3,525 |
|
|
142
|
+
| 1UBQ - Ubiquitin | 76 | 2x | 6x | 10x | 4x | 9x |
|
|
143
|
+
| 6LU7 - SARS-CoV-2 Mpro | 312 | 9x | 21x | 38x | 5x | 15x |
|
|
144
|
+
| 4HHB - Hemoglobin | 574 | 17x | 40x | 70x | 5x | 15x |
|
|
145
|
+
| 6M0J - ACE2 + RBD | 791 | 22x | 54x | 92x | 5x | 11x |
|
|
146
|
+
| 6VXX - Spike trimer | 2,916 | 76x | 201x | 285x | 1x* | 4x |
|
|
147
|
+
| 6OHW - Cas12a | 3,525 | 95x | 257x | 343x | 1x* | 3x |
|
|
143
148
|
|
|
144
149
|
*MSA speedup shown as 1x vs mmCIF parse because both are in the same time range for
|
|
145
150
|
large proteins - the real MSA comparison is vs JackHMMER generation (see below).
|
|
146
151
|
|
|
147
152
|
### Feature assembly: time to prepare all tensors for model.forward()
|
|
148
153
|
|
|
149
|
-
Traditional = mmCIF parse +
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
|
156
|
-
|
|
157
|
-
|
|
|
158
|
-
|
|
|
159
|
-
|
|
|
160
|
-
|
|
|
161
|
-
|
|
162
|
-
|
|
154
|
+
Traditional = mmCIF parse + A3M MSA parse + distance-matrix compute. ProteinTensor
|
|
155
|
+
= read the structure, MSA, distance matrix, and ESM2 embedding from a single
|
|
156
|
+
pre-cached `.ptt`. Reproduce with `python benchmarks/assembly_benchmark.py`
|
|
157
|
+
(MSA depth and embedding shape are realistic; numeric content is synthetic, so
|
|
158
|
+
timing reflects tensor dimensions, not values).
|
|
159
|
+
|
|
160
|
+
| Structure | Res | MSA depth | Traditional | ProteinTensor | Speedup |
|
|
161
|
+
|---|---|---|---|---|---|
|
|
162
|
+
| 1UBQ - Ubiquitin | 76 | 512 | 14.1 ms | 7.1 ms | 2.0x |
|
|
163
|
+
| 6LU7 - SARS-CoV-2 Mpro | 312 | 1,024 | 48.7 ms | 13.6 ms | 3.6x |
|
|
164
|
+
| 4HHB - Hemoglobin | 574 | 2,048 | 118.0 ms | 22.7 ms | 5.2x |
|
|
165
|
+
| 6M0J - ACE2 + RBD | 791 | 2,048 | 196.4 ms | 38.3 ms | 5.1x |
|
|
166
|
+
| 6VXX - Spike trimer | 2,916 | 8,192 | 1,395 ms | 309 ms | 4.5x |
|
|
167
|
+
| 6OHW - Cas12a | 3,525 | 8,192 | 1,462 ms | 381 ms | 3.8x |
|
|
168
|
+
|
|
169
|
+
Average speedup across all six structures: **4x** for full feature assembly
|
|
170
|
+
(measured on a Windows CPU box - see
|
|
171
|
+
[`benchmarks/ASSEMBLY_RESULTS.md`](benchmarks/ASSEMBLY_RESULTS.md)).
|
|
172
|
+
|
|
173
|
+
> **On an earlier 34x figure:** prior versions reported ~34x here. That number was
|
|
174
|
+
> measured against ProteinTensor's original scalar A3M parser, which dominated the
|
|
175
|
+
> traditional side (~11 s to parse an 8,192-deep MSA). Vectorizing that parser in
|
|
176
|
+
> v0.2.0 cut the traditional baseline ~8x, so the *fair* feature-assembly speedup
|
|
177
|
+
> is now ~4x. The `.ptt` read side was unchanged - only the baseline got faster.
|
|
163
178
|
|
|
164
179
|
### Drug target benchmark
|
|
165
180
|
|
|
@@ -169,21 +184,21 @@ IgG1 antibody. Numbers are consistent with the structural biology benchmark abov
|
|
|
169
184
|
|
|
170
185
|
| Target | Res | mmCIF parse | ptt: full | ptt: backbone | ptt: bonds | ptt: MSA | ptt: dist mx |
|
|
171
186
|
|---|---|---|---|---|---|---|---|
|
|
172
|
-
| 6OIM - KRAS G12C + Sotorasib | 167 |
|
|
173
|
-
| 3HTB - HIV-1 protease | 163 | 16.
|
|
174
|
-
| 5WT9 - PD-L1 checkpoint | 533 |
|
|
175
|
-
| 1TUP - p53 tumor suppressor | 585 |
|
|
176
|
-
| 2P4E - PCSK9 | 586 |
|
|
177
|
-
| 1IGT - IgG1 antibody | 1,316 |
|
|
187
|
+
| 6OIM - KRAS G12C + Sotorasib | 167 | 17.1 ms | 3.4 ms | 1.3 ms | 0.8 ms | 3.0 ms | 1.3 ms |
|
|
188
|
+
| 3HTB - HIV-1 protease | 163 | 16.5 ms | 3.3 ms | 1.4 ms | 0.8 ms | 2.8 ms | 1.3 ms |
|
|
189
|
+
| 5WT9 - PD-L1 checkpoint | 533 | 54.8 ms | 3.8 ms | 1.4 ms | 0.8 ms | 11.9 ms | 3.8 ms |
|
|
190
|
+
| 1TUP - p53 tumor suppressor | 585 | 57.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 13.0 ms | 4.0 ms |
|
|
191
|
+
| 2P4E - PCSK9 | 586 | 55.4 ms | 3.4 ms | 1.4 ms | 0.8 ms | 12.8 ms | 4.1 ms |
|
|
192
|
+
| 1IGT - IgG1 antibody | 1,316 | 127.3 ms | 3.5 ms | 1.4 ms | 0.8 ms | 47.1 ms | 17.9 ms |
|
|
178
193
|
|
|
179
194
|
| Target | Res | full | backbone | bonds | MSA | dist mx |
|
|
180
195
|
|---|---|---|---|---|---|---|
|
|
181
|
-
| 6OIM - KRAS G12C + Sotorasib | 167 |
|
|
182
|
-
| 3HTB - HIV-1 protease | 163 |
|
|
183
|
-
| 5WT9 - PD-L1 checkpoint | 533 |
|
|
184
|
-
| 1TUP - p53 tumor suppressor | 585 |
|
|
185
|
-
| 2P4E - PCSK9 | 586 |
|
|
186
|
-
| 1IGT - IgG1 antibody | 1,316 |
|
|
196
|
+
| 6OIM - KRAS G12C + Sotorasib | 167 | 5x | 13x | 22x | 6x | 13x |
|
|
197
|
+
| 3HTB - HIV-1 protease | 163 | 5x | 12x | 21x | 6x | 13x |
|
|
198
|
+
| 5WT9 - PD-L1 checkpoint | 533 | 15x | 40x | 69x | 5x | 14x |
|
|
199
|
+
| 1TUP - p53 tumor suppressor | 585 | 17x | 42x | 71x | 4x | 14x |
|
|
200
|
+
| 2P4E - PCSK9 | 586 | 16x | 41x | 70x | 4x | 14x |
|
|
201
|
+
| 1IGT - IgG1 antibody | 1,316 | 37x | **92x** | **156x** | 3x | 7x |
|
|
187
202
|
|
|
188
203
|
### DataLoader batch throughput
|
|
189
204
|
|
|
@@ -192,26 +207,38 @@ padded batches ready for `model.forward()`. Single process, no prefetch workers.
|
|
|
192
207
|
|
|
193
208
|
| Batch size | ms / batch | Structures / sec |
|
|
194
209
|
|---|---|---|
|
|
195
|
-
| 1 | 0.01 ms |
|
|
196
|
-
| 4 | 0.
|
|
197
|
-
| 8 | 0.
|
|
198
|
-
| 16 | 0.
|
|
199
|
-
| 32 | 2.
|
|
210
|
+
| 1 | 0.01 ms | 97,088 |
|
|
211
|
+
| 4 | 0.03 ms | 116,279 |
|
|
212
|
+
| 8 | 0.42 ms | 19,242 |
|
|
213
|
+
| 16 | 0.97 ms | 16,412 |
|
|
214
|
+
| 32 | 2.1 ms | **15,033** |
|
|
200
215
|
|
|
201
216
|
### Scale projection: 100,000 structures, one training epoch
|
|
202
217
|
|
|
218
|
+
These are **projections**, extrapolated from the measured per-structure timings
|
|
219
|
+
above - not end-to-end measurements at 100k scale.
|
|
220
|
+
|
|
203
221
|
| Operation | Traditional pipeline | ProteinTensor | Speedup |
|
|
204
222
|
|---|---|---|---|
|
|
205
|
-
| Structure load (parse mmCIF each epoch) | 3.
|
|
206
|
-
| Backbone-only load (template search) | 3.
|
|
207
|
-
| Full feature assembly (seq + MSA + pairs + emb) |
|
|
208
|
-
| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.
|
|
223
|
+
| Structure load (parse mmCIF each epoch) | 3.8 hours | 6 min | **37x** |
|
|
224
|
+
| Backbone-only load (template search) | 3.8 hours | 2 min | **95x** |
|
|
225
|
+
| Full feature assembly (seq + MSA + pairs + emb) | 16 hours | 3.9 hours | **4x** |
|
|
226
|
+
| MSA generation (JackHMMER, 32-core CPU, once) | 4,000 hours | 2.7 hours | **1,477x** |
|
|
209
227
|
|
|
210
228
|
> MSA generation assumes 2.4 min/protein on a 32-core server (PDB90 database, standard
|
|
211
229
|
> AlphaFold settings). ProteinTensor generates MSAs once and loads from the `.ptt` cache
|
|
212
230
|
> on every subsequent run. The 4,000-hour figure is the real cost AlphaFold2 and Boltz
|
|
213
231
|
> users pay to build training datasets from scratch.
|
|
214
232
|
|
|
233
|
+
> **Measured vs projected - read this.** The **1,477x** above is MSA *generation*
|
|
234
|
+
> (building the alignment once with JackHMMER) and is a **literature-based
|
|
235
|
+
> projection**, not something benchmarked here. What *is* measured on hardware is
|
|
236
|
+
> the recurring per-epoch MSA **load** - reading a cached MSA from `.ptt` vs
|
|
237
|
+
> re-parsing A3M text each epoch (against a vectorized A3M parser baseline):
|
|
238
|
+
> **3.4x-5.9x**, growing with MSA depth. See
|
|
239
|
+
> [`benchmarks/MSA_RESULTS.md`](benchmarks/MSA_RESULTS.md). These are different
|
|
240
|
+
> quantities; do not read the 1,477x as a measured load speedup.
|
|
241
|
+
|
|
215
242
|
### Disk tradeoff
|
|
216
243
|
|
|
217
244
|
A full-featured `.ptt` (8,192-sequence MSA + distance matrix + ESM2-650M embedding at
|
|
@@ -243,6 +270,42 @@ proteintensor convert 1abc.cif 1abc.ptt
|
|
|
243
270
|
proteintensor info 1abc.ptt
|
|
244
271
|
```
|
|
245
272
|
|
|
273
|
+
### Convert a sequence (no structure required)
|
|
274
|
+
|
|
275
|
+
For sequence-driven predictors like AlphaFold and Boltz, the primary input is a
|
|
276
|
+
sequence, not a structure. ProteinTensor can build a sequence-only `.ptt` (no
|
|
277
|
+
coordinates) directly from a raw string or a FASTA file:
|
|
278
|
+
|
|
279
|
+
```bash
|
|
280
|
+
proteintensor convert-seq MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDG ubq.ptt
|
|
281
|
+
proteintensor convert-seq complex.fasta complex.ptt # multi-record FASTA -> multi-chain
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
```python
|
|
285
|
+
import proteintensor as pt
|
|
286
|
+
|
|
287
|
+
data = pt.from_sequence("MQIFVKTLTGK...", pdb_id="UBQ", chain_id="A")
|
|
288
|
+
data.has_structure # False - sequence-only entry
|
|
289
|
+
data.sequence_tokens # (N_res,) int32
|
|
290
|
+
|
|
291
|
+
pt.write(data, "ubq.ptt")
|
|
292
|
+
|
|
293
|
+
# FASTA: a single record -> one chain; multiple records -> multi-chain complex
|
|
294
|
+
data = pt.from_fasta("complex.fasta")
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
### Batch-convert a directory
|
|
298
|
+
|
|
299
|
+
Convert an entire directory of structures in parallel, with progress reporting.
|
|
300
|
+
Files that fail to parse are skipped and listed in the summary; already-converted
|
|
301
|
+
outputs are skipped by default.
|
|
302
|
+
|
|
303
|
+
```bash
|
|
304
|
+
proteintensor convert-dir ./pdb_files/ ./ptt_files/ # auto worker count
|
|
305
|
+
proteintensor convert-dir ./pdb_files/ ./ptt_files/ --workers 16 --recursive
|
|
306
|
+
proteintensor convert-dir ./pdb_files/ ./ptt_files/ --overwrite # rebuild existing
|
|
307
|
+
```
|
|
308
|
+
|
|
246
309
|
### Benchmark against mmCIF
|
|
247
310
|
|
|
248
311
|
```bash
|
|
@@ -327,6 +390,20 @@ pt.add_pair_feature("1abc.ptt", my_array, name="template_pair",
|
|
|
327
390
|
emb = pt.read_embedding("1abc.ptt", "esm2_t33_650M_UR50D")
|
|
328
391
|
emb.data.shape # (N_res, 1280) float32 (upcast from float16 on load)
|
|
329
392
|
|
|
393
|
+
# ------ Ligands / small molecules ------
|
|
394
|
+
# Capture drugs, cofactors, and ions from a structure (opt-in)
|
|
395
|
+
data = pt.from_mmcif("6oim.cif", include_ligands=True)
|
|
396
|
+
[l.name for l in data.ligands] # ['MG', 'GDP', 'MOV'] (MOV = sotorasib)
|
|
397
|
+
|
|
398
|
+
ligs = pt.read_ligands("6oim.ptt")
|
|
399
|
+
ligs[0].elements # (N_atoms,) S2 element symbols
|
|
400
|
+
ligs[0].positions # (N_atoms, 3) float32
|
|
401
|
+
pt.list_ligands("6oim.ptt") # ['MG', 'GDP', 'MOV']
|
|
402
|
+
|
|
403
|
+
# Build a ligand from SMILES (needs `pip install "proteintensor[ligands]"`)
|
|
404
|
+
aspirin = pt.from_smiles("CC(=O)Oc1ccccc1C(=O)O", name="AIN")
|
|
405
|
+
pt.add_ligand("target.ptt", aspirin) # attach to an existing .ptt
|
|
406
|
+
|
|
330
407
|
# ------ Lazy / zero-copy access ------
|
|
331
408
|
positions = pt.mmap_positions("1abc.ptt") # zarr.Array - no full load
|
|
332
409
|
backbone = pt.mmap_backbone("1abc.ptt") # [N_res, 4, 3]
|
|
@@ -368,6 +445,7 @@ data = pt.read(
|
|
|
368
445
|
)
|
|
369
446
|
|
|
370
447
|
# ------ Multi-structure dataset ------
|
|
448
|
+
# Structure .ptt files and sequence-only .ptt files can be mixed in one dataset.
|
|
371
449
|
pt.create_dataset("training.ptt")
|
|
372
450
|
for ptt_file in Path("ptt_files").glob("*.ptt"):
|
|
373
451
|
pt.add_to_dataset("training.ptt", ptt_file)
|
|
@@ -383,8 +461,13 @@ loader = DataLoader(ds, batch_size=8, collate_fn=pt.ProteinDataset.collate)
|
|
|
383
461
|
for batch in loader:
|
|
384
462
|
coords = torch.from_numpy(batch["atom_positions"]) # (B, max_atoms, 3)
|
|
385
463
|
pad = torch.from_numpy(batch["padding_mask"]) # (B, max_res) True=real
|
|
464
|
+
has_str = torch.from_numpy(batch["has_structure"]) # (B,) False = sequence-only
|
|
386
465
|
```
|
|
387
466
|
|
|
467
|
+
Sequence-only entries contribute zero atoms to the batch (`n_atoms == 0`,
|
|
468
|
+
`has_structure == False`), so sequence-driven and structure-based samples can be
|
|
469
|
+
loaded together in one `DataLoader`.
|
|
470
|
+
|
|
388
471
|
---
|
|
389
472
|
|
|
390
473
|
## .ptt file layout
|
|
@@ -421,10 +504,16 @@ structure.ptt/ Zarr directory store (v0.7)
|
|
|
421
504
|
│ └── <name>/ one sub-group per named feature
|
|
422
505
|
│ ├── .zattrs channels, symmetric, dtype, description
|
|
423
506
|
│ └── data [N_res, N_res, C] any dtype, chunked 128x128xC
|
|
424
|
-
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
507
|
+
├── embeddings/
|
|
508
|
+
│ └── <model>/ one sub-group per PLM model
|
|
509
|
+
│ ├── .zattrs model, layer, dim, dtype, seq SHA-256
|
|
510
|
+
│ └── data [N_res, D] float32 or float16, chunked 256xD
|
|
511
|
+
└── ligands/
|
|
512
|
+
└── <index>/ one sub-group per non-polymer ligand
|
|
513
|
+
├── .zattrs name (CCD), chain_id, res_num, smiles
|
|
514
|
+
├── elements [N_atoms] S2 element symbols
|
|
515
|
+
├── positions [N_atoms, 3] float32 Angstrom coordinates
|
|
516
|
+
└── b_factors [N_atoms] float32
|
|
428
517
|
```
|
|
429
518
|
|
|
430
519
|
### Multi-structure dataset layout
|
|
@@ -462,9 +551,9 @@ Each sub-group under `structures/` is identical to a standalone `.ptt` root, so
|
|
|
462
551
|
pytest tests/ -v
|
|
463
552
|
```
|
|
464
553
|
|
|
465
|
-
|
|
466
|
-
A3M parsing, Boltz adapter, multi-structure dataset, and cloud
|
|
467
|
-
(memory:// fsspec - no real cloud account required).
|
|
554
|
+
150 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings/ligands,
|
|
555
|
+
sequence conversion, A3M parsing, Boltz adapter, multi-structure dataset, and cloud
|
|
556
|
+
streaming (memory:// fsspec - no real cloud account required).
|
|
468
557
|
|
|
469
558
|
---
|
|
470
559
|
|
|
@@ -485,11 +574,11 @@ A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming
|
|
|
485
574
|
- [ ] Chai-1 adapter
|
|
486
575
|
|
|
487
576
|
**Data pipeline**
|
|
488
|
-
- [
|
|
577
|
+
- [x] Batch convert CLI - convert entire PDB directories in parallel with progress reporting
|
|
489
578
|
- [ ] Sequence-identity dataset splitting - MMseqs2-based cluster splits to prevent data leakage between train / val / test
|
|
490
579
|
|
|
491
580
|
**Format extensions**
|
|
492
|
-
- [
|
|
581
|
+
- [x] Ligand / small-molecule support - CCD-based extraction from structures, SMILES input via RDKit, element/coordinate storage (bond graphs and binding-site annotations still to come)
|
|
493
582
|
- [ ] MD trajectory storage - time axis `[N_frames, N_atoms, 3]` for conformational ensembles and AlphaFold 3 diffusion trajectories
|
|
494
583
|
|
|
495
584
|
**Performance**
|