sawnergy 1.0.6__tar.gz → 1.0.8__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of sawnergy might be problematic. Click here for more details.
- {sawnergy-1.0.6/sawnergy.egg-info → sawnergy-1.0.8}/PKG-INFO +79 -56
- {sawnergy-1.0.6 → sawnergy-1.0.8}/README.md +78 -55
- sawnergy-1.0.8/sawnergy/embedding/SGNS_pml.py +368 -0
- sawnergy-1.0.8/sawnergy/embedding/SGNS_torch.py +364 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/embedding/__init__.py +24 -0
- sawnergy-1.0.8/sawnergy/embedding/embedder.py +714 -0
- sawnergy-1.0.8/sawnergy/embedding/visualizer.py +251 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/logging_util.py +1 -1
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/rin_builder.py +1 -1
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/visualizer.py +6 -6
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/visualizer_util.py +3 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8/sawnergy.egg-info}/PKG-INFO +79 -56
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/SOURCES.txt +2 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_embedding.py +103 -6
- sawnergy-1.0.8/tests/test_embedding_visualizer.py +58 -0
- sawnergy-1.0.6/sawnergy/embedding/SGNS_pml.py +0 -172
- sawnergy-1.0.6/sawnergy/embedding/SGNS_torch.py +0 -177
- sawnergy-1.0.6/sawnergy/embedding/embedder.py +0 -584
- {sawnergy-1.0.6 → sawnergy-1.0.8}/LICENSE +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/NOTICE +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/__init__.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/__init__.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/rin_util.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/sawnergy_util.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/__init__.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/__init__.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/walker.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/walker_util.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/dependency_links.txt +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/requires.txt +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/top_level.txt +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/setup.cfg +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_rin.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_storage.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_visual.py +0 -0
- {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_walks.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: sawnergy
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.8
|
|
4
4
|
Summary: Toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations
|
|
5
5
|
Home-page: https://github.com/Yehor-Mishchyriak/SAWNERGY
|
|
6
6
|
Author: Yehor Mishchyriak
|
|
@@ -39,19 +39,57 @@ Dynamic: summary
|
|
|
39
39
|

|
|
40
40
|
|
|
41
41
|
A toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations, sampling
|
|
42
|
-
random and self-avoiding walks, learning node embeddings, and
|
|
42
|
+
random and self-avoiding walks, learning node embeddings, and visualizing residue interaction networks (RINs). SAWNERGY
|
|
43
43
|
keeps the full workflow — from `cpptraj` output to skip-gram embeddings (node2vec approach) — inside Python, backed by efficient Zarr-based archives and optional GPU acceleration.
|
|
44
44
|
|
|
45
45
|
---
|
|
46
46
|
|
|
47
|
+
## Installation
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
pip install sawnergy
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
> **Optional:** For GPU training, install PyTorch separately (e.g., `pip install torch`).
|
|
54
|
+
> **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
|
|
55
|
+
> environment variable. Probably the easiest solution: install AmberTools via Conda, activate the environment, and SAWNERGY will find the cpptraj executable on its own, so just run your code and don't worry about it.
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
# UPDATES:
|
|
60
|
+
|
|
61
|
+
## v1.0.8 — What’s new:
|
|
62
|
+
- **Temporary deprecation of `SGNS_Torch`**
|
|
63
|
+
- `sawnergy.embedding.SGNS_Torch` currently produces noisy embeddings in practice. The issue likely stems from **weight initialization**, although the root cause has not yet been conclusively determined.
|
|
64
|
+
- **Action:** The class and its `__init__` docstring now carry a deprecation notice. Constructing the class emits a **`DeprecationWarning`** and logs a **warning**.
|
|
65
|
+
- **Use instead:** Prefer **`SG_Torch`** (plain Skip-Gram with full softmax) or the PureML backends **`SGNS_PureML`** / **`SG_PureML`**.
|
|
66
|
+
- **Compatibility:** No breaking API changes; imports remain stable. PureML backends are unaffected.
|
|
67
|
+
- **Embedding visualizer update**
|
|
68
|
+
- Now you can L2 normalize your embeddings before display.
|
|
69
|
+
- **Small improvements in the embedding module**
|
|
70
|
+
- Improved API with a lot of good defaults in place to ease usage out of the box.
|
|
71
|
+
- Small internal model tweaks.
|
|
72
|
+
|
|
73
|
+
## v1.0.7 — What’s new:
|
|
74
|
+
- **Added plain Skip-Gram model**
|
|
75
|
+
- Now, the user can choose if they want to apply the negative sampling technique (two binary classifiers) or train a single classifier over the vocabulary (full softmax). For more detail, see: [node2vec](https://arxiv.org/pdf/1607.00653), [word2vec](https://arxiv.org/pdf/1301.3781), and [negative_sampling](https://arxiv.org/pdf/1402.3722).
|
|
76
|
+
- **Set a harsher default for low interaction energies pruning during RIN construction**
|
|
77
|
+
- Now we zero out 85% of the lowest interaction energies as opposed to the past 30% default, leading to more meaningful embeddings.
|
|
78
|
+
- **BUG FIX: Visualizer**
|
|
79
|
+
- Previously, the visualizer would silently draw edges of 0 magnitude, meaning they were actually being drawn but were invisible due to full transparency and 0 width. As a result, the displayed image/animation would be very laggy. Now, this was fixed, and given the higher pruning default, the displayed interaction networks are clean and smooth under rotations, dragging, etc.
|
|
80
|
+
- **New Embedding Visualizer (3D)**
|
|
81
|
+
- New lightweight viewer for per-frame embeddings that projects embeddings with PCA to a **3D** scatter. Supports the same node coloring semantics, optional node labels, and the same antialiasing/depthshade controls. Works in headless setups using the same backend guard and uses a blocking `show=True` for scripts.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
47
85
|
## Why SAWNERGY?
|
|
48
86
|
|
|
49
87
|
- **Bridge simulations and graph ML**: Convert raw MD trajectories into residue interaction networks ready for graph
|
|
50
88
|
algorithms and downstream machine learning tasks.
|
|
51
|
-
- **Deterministic, shareable
|
|
52
|
-
- **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without
|
|
53
|
-
- **Flexible
|
|
54
|
-
- **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder
|
|
89
|
+
- **Deterministic, shareable artifacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
|
|
90
|
+
- **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serialization overhead; archives are written in chunked, compressed form for fast read/write.
|
|
91
|
+
- **Flexible objectives & backends**: Train Skip-Gram with **negative sampling** (`objective="sgns"`) or **plain Skip-Gram** (`objective="sg"`), using either **PureML** (default) or **PyTorch**.
|
|
92
|
+
- **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder.
|
|
55
93
|
|
|
56
94
|
---
|
|
57
95
|
|
|
@@ -91,9 +129,9 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
91
129
|
* Wraps the AmberTools `cpptraj` executable to:
|
|
92
130
|
- compute per-frame electrostatic (EMAP) and van der Waals (VMAP) energy matrices at the atomic level,
|
|
93
131
|
- project atom–atom interactions to residue–residue interactions using compositional masks,
|
|
94
|
-
- prune,
|
|
95
|
-
- compute per-residue
|
|
96
|
-
* Outputs a compressed Zarr archive with transition matrices, optional
|
|
132
|
+
- prune, symmetrize, remove self-interactions, and L1-normalize the matrices,
|
|
133
|
+
- compute per-residue centers of mass (COM) over the same frames.
|
|
134
|
+
* Outputs a compressed Zarr archive with transition matrices, optional pre-normalized energies, COM snapshots, and rich
|
|
97
135
|
metadata (frame range, pruning quantile, molecule ID, etc.).
|
|
98
136
|
* Supports parallel `cpptraj` execution, batch processing, and keeps temporary stores tidy via
|
|
99
137
|
`ArrayStorage.compress_and_cleanup`.
|
|
@@ -103,7 +141,7 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
103
141
|
* Opens RIN archives, resolves dataset names from attributes, and renders nodes plus attractive/repulsive edge bundles
|
|
104
142
|
in 3D using Matplotlib.
|
|
105
143
|
* Allows both static frame visualization and trajectory animation.
|
|
106
|
-
* Handles backend selection (`Agg` fallback in headless environments) and offers convenient
|
|
144
|
+
* Handles backend selection (`Agg` fallback in headless environments) and offers convenient color palettes via
|
|
107
145
|
`visualizer_util`.
|
|
108
146
|
|
|
109
147
|
### `sawnergy.walks.Walker`
|
|
@@ -116,13 +154,10 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
116
154
|
|
|
117
155
|
### `sawnergy.embedding.Embedder`
|
|
118
156
|
|
|
119
|
-
* Consumes walk archives, generates skip-gram pairs, and
|
|
120
|
-
*
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
* Both `SGNS_PureML` and `SGNS_Torch` accept training hyperparameters such as batch_size, LR, optimizer and LR_scheduler, etc.
|
|
124
|
-
* Exposes `embed_frame` (single frame) and `embed_all` (all frames, deterministic seeding per frame) which return the
|
|
125
|
-
learned input embedding matrices and write them to disk when requested.
|
|
157
|
+
* Consumes walk archives, generates skip-gram pairs, and normalizes them to 0-based indices.
|
|
158
|
+
* Selects skip-gram (SG / SGNS) backends dynamically via `model_base="pureml"|"torch"` with per-backend overrides supplied through `model_kwargs`.
|
|
159
|
+
* Handles deterministic per-frame seeding and returns the requested embedding `kind` (`"in"`, `"out"`, or `"avg"`) from `embed_frame` and `embed_all`.
|
|
160
|
+
* Persists per-frame matrices with rich provenance (walk metadata, objective, hyperparameters, RNG seeds) when `embed_all` targets an output archive.
|
|
126
161
|
|
|
127
162
|
### Supporting Utilities
|
|
128
163
|
|
|
@@ -140,23 +175,13 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
140
175
|
|---|---|---|
|
|
141
176
|
| **RIN** | `ATTRACTIVE_transitions` → **(T, N, N)**, float32 • `REPULSIVE_transitions` → **(T, N, N)**, float32 (optional) • `ATTRACTIVE_energies` → **(T, N, N)**, float32 (optional) • `REPULSIVE_energies` → **(T, N, N)**, float32 (optional) • `COM` → **(T, N, 3)**, float32 | `time_created` (ISO) • `com_name` = `"COM"` • `molecule_of_interest` (int) • `frame_range` = `(start, end)` inclusive • `frame_batch_size` (int) • `prune_low_energies_frac` (float in [0,1]) • `attractive_transitions_name` / `repulsive_transitions_name` (dataset names or `None`) • `attractive_energies_name` / `repulsive_energies_name` (dataset names or `None`) |
|
|
142
177
|
| **Walks** | `ATTRACTIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `REPULSIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `ATTRACTIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) • `REPULSIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) <br/>_Note:_ node IDs are **1-based**.| `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_batch_v1"` • `num_workers` (int) • `in_parallel` (bool) • `batch_size_nodes` (int) • `num_RWs` / `num_SAWs` (ints) • `node_count` (N) • `time_stamp_count` (T) • `walk_length` (L) • `walks_per_node` (int) • `attractive_RWs_name` / `repulsive_RWs_name` / `attractive_SAWs_name` / `repulsive_SAWs_name` (dataset names or `None`) • `walks_layout` = `"time_leading_3d"` |
|
|
143
|
-
| **Embeddings** | `FRAME_EMBEDDINGS` → **(
|
|
178
|
+
| **Embeddings** | `FRAME_EMBEDDINGS` → **(T, N, D)**, float32 | `created_at` (ISO) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `time_stamp_count` = T • `node_count` = N • `embedding_dim` = D • `model_base` = `"torch"` or `"pureml"` • `embedding_kind` = `"in"|"out"|"avg"` • `objective` = `"sgns"` or `"sg"` • `negative_sampling` (bool) • `num_negative_samples` (int) • `num_epochs` (int) • `batch_size` (int) • `window_size` (int) • `alpha` (float) • `lr_step_per_batch` (bool) • `shuffle_data` (bool) • `device_hint` (str) • `model_kwargs_repr` (repr string) • `RIN_type` = `"attr"` or `"repuls"` • `using` = `"RW"|"SAW"|"merged"` • `source_WALKS_path` (str) • `walk_length` (int) • `num_RWs` / `num_SAWs` (ints) • `attractive_*_name` / `repulsive_*_name` (dataset names or `None`) • `master_seed` (int) • `per_frame_seeds` (list[int]) • `arrays_per_chunk` (int) • `compression_level` (int) |
|
|
144
179
|
|
|
145
180
|
**Notes**
|
|
146
181
|
|
|
147
|
-
- In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-
|
|
182
|
+
- In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalized** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalized** versions used for sampling.
|
|
148
183
|
- All archives are Zarr v3 groups. ArrayStorage also maintains per-block metadata in root attrs: `array_chunk_size_in_block`, `array_shape_in_block`, and `array_dtype_in_block` (dicts keyed by dataset name). You’ll see these in every archive.
|
|
149
|
-
|
|
150
|
-
---
|
|
151
|
-
|
|
152
|
-
## Installation
|
|
153
|
-
|
|
154
|
-
```bash
|
|
155
|
-
pip install sawnergy
|
|
156
|
-
```
|
|
157
|
-
|
|
158
|
-
> **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
|
|
159
|
-
> environment variable.
|
|
184
|
+
- In **Embeddings**, `alpha` and `num_negative_samples` apply to **SGNS** only and are ignored for `objective="sg"`.
|
|
160
185
|
|
|
161
186
|
---
|
|
162
187
|
|
|
@@ -181,10 +206,10 @@ rin_builder.build_rin(
|
|
|
181
206
|
molecule_of_interest=1,
|
|
182
207
|
frame_range=(1, 100),
|
|
183
208
|
frame_batch_size=10,
|
|
184
|
-
prune_low_energies_frac=0.
|
|
209
|
+
prune_low_energies_frac=0.85,
|
|
185
210
|
output_path=rin_path,
|
|
186
211
|
include_attractive=True,
|
|
187
|
-
include_repulsive=False
|
|
212
|
+
include_repulsive=False
|
|
188
213
|
)
|
|
189
214
|
|
|
190
215
|
# 2. Sample walks from the RIN
|
|
@@ -192,52 +217,43 @@ walker = Walker(rin_path, seed=123)
|
|
|
192
217
|
walks_path = Path("./WALKS_demo.zip")
|
|
193
218
|
walker.sample_walks(
|
|
194
219
|
walk_length=16,
|
|
195
|
-
walks_per_node=
|
|
220
|
+
walks_per_node=100,
|
|
196
221
|
saw_frac=0.25,
|
|
197
222
|
include_attractive=True,
|
|
198
223
|
include_repulsive=False,
|
|
199
224
|
time_aware=False,
|
|
200
225
|
output_path=walks_path,
|
|
201
|
-
in_parallel=False
|
|
226
|
+
in_parallel=False
|
|
202
227
|
)
|
|
203
228
|
walker.close()
|
|
204
229
|
|
|
205
230
|
# 3. Train embeddings per frame (PyTorch backend)
|
|
206
231
|
import torch
|
|
207
232
|
|
|
208
|
-
embedder = Embedder(walks_path,
|
|
233
|
+
embedder = Embedder(walks_path, seed=999)
|
|
209
234
|
embeddings_path = embedder.embed_all(
|
|
210
235
|
RIN_type="attr",
|
|
211
236
|
using="merged",
|
|
237
|
+
num_epochs=10,
|
|
238
|
+
negative_sampling=False,
|
|
212
239
|
window_size=4,
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
dimensionality=128,
|
|
217
|
-
shuffle_data=True,
|
|
218
|
-
output_path="./EMBEDDINGS_demo.zip",
|
|
219
|
-
sgns_kwargs={
|
|
220
|
-
"optim": torch.optim.Adam,
|
|
221
|
-
"optim_kwargs": {"lr": 1e-3},
|
|
222
|
-
"lr_sched": torch.optim.lr_scheduler.LambdaLR,
|
|
223
|
-
"lr_sched_kwargs": {"lr_lambda": lambda _: 1.0},
|
|
224
|
-
"device": "cuda" if torch.cuda.is_available() else "cpu",
|
|
225
|
-
},
|
|
240
|
+
device="cuda" if torch.cuda.is_available() else "cpu",
|
|
241
|
+
model_base="torch",
|
|
242
|
+
output_path="./EMBEDDINGS_demo.zip"
|
|
226
243
|
)
|
|
227
244
|
print("Embeddings written to", embeddings_path)
|
|
228
245
|
```
|
|
229
246
|
|
|
230
|
-
> For the PureML backend,
|
|
231
|
-
> (for example `optim=pureml.optimizers.Adam`, `lr_sched=pureml.optimizers.CosineAnnealingLR`).
|
|
247
|
+
> For the PureML backend, set `model_base="pureml"` and pass the optimizer / scheduler classes inside `model_kwargs`.
|
|
232
248
|
|
|
233
249
|
---
|
|
234
250
|
|
|
235
|
-
##
|
|
251
|
+
## Visualization
|
|
236
252
|
|
|
237
253
|
```python
|
|
238
254
|
from sawnergy.visual import Visualizer
|
|
239
255
|
|
|
240
|
-
v =
|
|
256
|
+
v = Visualizer("./RIN_demo.zip")
|
|
241
257
|
v.build_frame(1,
|
|
242
258
|
node_colors="rainbow",
|
|
243
259
|
displayed_nodes="ALL",
|
|
@@ -250,14 +266,20 @@ v.build_frame(1,
|
|
|
250
266
|
|
|
251
267
|
`Visualizer` lazily loads datasets and works even in headless environments (falls back to the `Agg` backend).
|
|
252
268
|
|
|
269
|
+
```python
|
|
270
|
+
from sawnergy.embedding import Visualizer
|
|
271
|
+
|
|
272
|
+
viz = Visualizer("./EMBEDDINGS_demo.zip", normalize_rows=True)
|
|
273
|
+
viz.build_frame(1, show=True)
|
|
274
|
+
```
|
|
275
|
+
|
|
253
276
|
---
|
|
254
277
|
|
|
255
278
|
## Advanced Notes
|
|
256
279
|
|
|
257
280
|
- **Time-aware walks**: Set `time_aware=True`, provide `stickiness` and `on_no_options` when calling `Walker.sample_walks`.
|
|
258
281
|
- **Shared memory lifecycle**: Call `Walker.close()` (or use a context manager) to release shared-memory segments.
|
|
259
|
-
- **PureML vs PyTorch**:
|
|
260
|
-
constructor kwargs through `sgns_kwargs` (optimizer, scheduler, device).
|
|
282
|
+
- **PureML vs PyTorch**: Select the backend at call time with `model_base="pureml"|"torch"` (defaults to `"pureml"`) and pass optimizer / scheduler overrides through `model_kwargs`.
|
|
261
283
|
- **ArrayStorage utilities**: Use `ArrayStorage` directly to peek into archives, append arrays, or manage metadata.
|
|
262
284
|
|
|
263
285
|
---
|
|
@@ -268,8 +290,9 @@ v.build_frame(1,
|
|
|
268
290
|
├── sawnergy/
|
|
269
291
|
│ ├── rin/ # RINBuilder and cpptraj integration helpers
|
|
270
292
|
│ ├── walks/ # Walker class and shared-memory utilities
|
|
271
|
-
│ ├── embedding/ # Embedder + SGNS backends (PureML / PyTorch)
|
|
293
|
+
│ ├── embedding/ # Embedder + SG/SGNS backends (PureML / PyTorch)
|
|
272
294
|
│ ├── visual/ # Visualizer and palette utilities
|
|
295
|
+
│ │
|
|
273
296
|
│ ├── logging_util.py
|
|
274
297
|
│ └── sawnergy_util.py
|
|
275
298
|
│
|
|
@@ -278,7 +301,7 @@ v.build_frame(1,
|
|
|
278
301
|
|
|
279
302
|
---
|
|
280
303
|
|
|
281
|
-
##
|
|
304
|
+
## Acknowledgments
|
|
282
305
|
|
|
283
306
|
SAWNERGY builds on the AmberTools `cpptraj` ecosystem, NumPy, Matplotlib, Zarr, and PyTorch (for GPU acceleration if necessary; PureML is available by default).
|
|
284
307
|
Big thanks to the upstream communities whose work makes this toolkit possible.
|
|
@@ -5,19 +5,57 @@
|
|
|
5
5
|

|
|
6
6
|
|
|
7
7
|
A toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations, sampling
|
|
8
|
-
random and self-avoiding walks, learning node embeddings, and
|
|
8
|
+
random and self-avoiding walks, learning node embeddings, and visualizing residue interaction networks (RINs). SAWNERGY
|
|
9
9
|
keeps the full workflow — from `cpptraj` output to skip-gram embeddings (node2vec approach) — inside Python, backed by efficient Zarr-based archives and optional GPU acceleration.
|
|
10
10
|
|
|
11
11
|
---
|
|
12
12
|
|
|
13
|
+
## Installation
|
|
14
|
+
|
|
15
|
+
```bash
|
|
16
|
+
pip install sawnergy
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
> **Optional:** For GPU training, install PyTorch separately (e.g., `pip install torch`).
|
|
20
|
+
> **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
|
|
21
|
+
> environment variable. Probably the easiest solution: install AmberTools via Conda, activate the environment, and SAWNERGY will find the cpptraj executable on its own, so just run your code and don't worry about it.
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
# UPDATES:
|
|
26
|
+
|
|
27
|
+
## v1.0.8 — What’s new:
|
|
28
|
+
- **Temporary deprecation of `SGNS_Torch`**
|
|
29
|
+
- `sawnergy.embedding.SGNS_Torch` currently produces noisy embeddings in practice. The issue likely stems from **weight initialization**, although the root cause has not yet been conclusively determined.
|
|
30
|
+
- **Action:** The class and its `__init__` docstring now carry a deprecation notice. Constructing the class emits a **`DeprecationWarning`** and logs a **warning**.
|
|
31
|
+
- **Use instead:** Prefer **`SG_Torch`** (plain Skip-Gram with full softmax) or the PureML backends **`SGNS_PureML`** / **`SG_PureML`**.
|
|
32
|
+
- **Compatibility:** No breaking API changes; imports remain stable. PureML backends are unaffected.
|
|
33
|
+
- **Embedding visualizer update**
|
|
34
|
+
- Now you can L2 normalize your embeddings before display.
|
|
35
|
+
- **Small improvements in the embedding module**
|
|
36
|
+
- Improved API with a lot of good defaults in place to ease usage out of the box.
|
|
37
|
+
- Small internal model tweaks.
|
|
38
|
+
|
|
39
|
+
## v1.0.7 — What’s new:
|
|
40
|
+
- **Added plain Skip-Gram model**
|
|
41
|
+
- Now, the user can choose if they want to apply the negative sampling technique (two binary classifiers) or train a single classifier over the vocabulary (full softmax). For more detail, see: [node2vec](https://arxiv.org/pdf/1607.00653), [word2vec](https://arxiv.org/pdf/1301.3781), and [negative_sampling](https://arxiv.org/pdf/1402.3722).
|
|
42
|
+
- **Set a harsher default for low interaction energies pruning during RIN construction**
|
|
43
|
+
- Now we zero out 85% of the lowest interaction energies as opposed to the past 30% default, leading to more meaningful embeddings.
|
|
44
|
+
- **BUG FIX: Visualizer**
|
|
45
|
+
- Previously, the visualizer would silently draw edges of 0 magnitude, meaning they were actually being drawn but were invisible due to full transparency and 0 width. As a result, the displayed image/animation would be very laggy. Now, this was fixed, and given the higher pruning default, the displayed interaction networks are clean and smooth under rotations, dragging, etc.
|
|
46
|
+
- **New Embedding Visualizer (3D)**
|
|
47
|
+
- New lightweight viewer for per-frame embeddings that projects embeddings with PCA to a **3D** scatter. Supports the same node coloring semantics, optional node labels, and the same antialiasing/depthshade controls. Works in headless setups using the same backend guard and uses a blocking `show=True` for scripts.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
13
51
|
## Why SAWNERGY?
|
|
14
52
|
|
|
15
53
|
- **Bridge simulations and graph ML**: Convert raw MD trajectories into residue interaction networks ready for graph
|
|
16
54
|
algorithms and downstream machine learning tasks.
|
|
17
|
-
- **Deterministic, shareable
|
|
18
|
-
- **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without
|
|
19
|
-
- **Flexible
|
|
20
|
-
- **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder
|
|
55
|
+
- **Deterministic, shareable artifacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
|
|
56
|
+
- **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serialization overhead; archives are written in chunked, compressed form for fast read/write.
|
|
57
|
+
- **Flexible objectives & backends**: Train Skip-Gram with **negative sampling** (`objective="sgns"`) or **plain Skip-Gram** (`objective="sg"`), using either **PureML** (default) or **PyTorch**.
|
|
58
|
+
- **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder.
|
|
21
59
|
|
|
22
60
|
---
|
|
23
61
|
|
|
@@ -57,9 +95,9 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
57
95
|
* Wraps the AmberTools `cpptraj` executable to:
|
|
58
96
|
- compute per-frame electrostatic (EMAP) and van der Waals (VMAP) energy matrices at the atomic level,
|
|
59
97
|
- project atom–atom interactions to residue–residue interactions using compositional masks,
|
|
60
|
-
- prune,
|
|
61
|
-
- compute per-residue
|
|
62
|
-
* Outputs a compressed Zarr archive with transition matrices, optional
|
|
98
|
+
- prune, symmetrize, remove self-interactions, and L1-normalize the matrices,
|
|
99
|
+
- compute per-residue centers of mass (COM) over the same frames.
|
|
100
|
+
* Outputs a compressed Zarr archive with transition matrices, optional pre-normalized energies, COM snapshots, and rich
|
|
63
101
|
metadata (frame range, pruning quantile, molecule ID, etc.).
|
|
64
102
|
* Supports parallel `cpptraj` execution, batch processing, and keeps temporary stores tidy via
|
|
65
103
|
`ArrayStorage.compress_and_cleanup`.
|
|
@@ -69,7 +107,7 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
69
107
|
* Opens RIN archives, resolves dataset names from attributes, and renders nodes plus attractive/repulsive edge bundles
|
|
70
108
|
in 3D using Matplotlib.
|
|
71
109
|
* Allows both static frame visualization and trajectory animation.
|
|
72
|
-
* Handles backend selection (`Agg` fallback in headless environments) and offers convenient
|
|
110
|
+
* Handles backend selection (`Agg` fallback in headless environments) and offers convenient color palettes via
|
|
73
111
|
`visualizer_util`.
|
|
74
112
|
|
|
75
113
|
### `sawnergy.walks.Walker`
|
|
@@ -82,13 +120,10 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
82
120
|
|
|
83
121
|
### `sawnergy.embedding.Embedder`
|
|
84
122
|
|
|
85
|
-
* Consumes walk archives, generates skip-gram pairs, and
|
|
86
|
-
*
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
* Both `SGNS_PureML` and `SGNS_Torch` accept training hyperparameters such as batch_size, LR, optimizer and LR_scheduler, etc.
|
|
90
|
-
* Exposes `embed_frame` (single frame) and `embed_all` (all frames, deterministic seeding per frame) which return the
|
|
91
|
-
learned input embedding matrices and write them to disk when requested.
|
|
123
|
+
* Consumes walk archives, generates skip-gram pairs, and normalizes them to 0-based indices.
|
|
124
|
+
* Selects skip-gram (SG / SGNS) backends dynamically via `model_base="pureml"|"torch"` with per-backend overrides supplied through `model_kwargs`.
|
|
125
|
+
* Handles deterministic per-frame seeding and returns the requested embedding `kind` (`"in"`, `"out"`, or `"avg"`) from `embed_frame` and `embed_all`.
|
|
126
|
+
* Persists per-frame matrices with rich provenance (walk metadata, objective, hyperparameters, RNG seeds) when `embed_all` targets an output archive.
|
|
92
127
|
|
|
93
128
|
### Supporting Utilities
|
|
94
129
|
|
|
@@ -106,23 +141,13 @@ node indexing, and RNG seeds stay consistent across the toolchain.
|
|
|
106
141
|
|---|---|---|
|
|
107
142
|
| **RIN** | `ATTRACTIVE_transitions` → **(T, N, N)**, float32 • `REPULSIVE_transitions` → **(T, N, N)**, float32 (optional) • `ATTRACTIVE_energies` → **(T, N, N)**, float32 (optional) • `REPULSIVE_energies` → **(T, N, N)**, float32 (optional) • `COM` → **(T, N, 3)**, float32 | `time_created` (ISO) • `com_name` = `"COM"` • `molecule_of_interest` (int) • `frame_range` = `(start, end)` inclusive • `frame_batch_size` (int) • `prune_low_energies_frac` (float in [0,1]) • `attractive_transitions_name` / `repulsive_transitions_name` (dataset names or `None`) • `attractive_energies_name` / `repulsive_energies_name` (dataset names or `None`) |
|
|
108
143
|
| **Walks** | `ATTRACTIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `REPULSIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `ATTRACTIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) • `REPULSIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) <br/>_Note:_ node IDs are **1-based**.| `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_batch_v1"` • `num_workers` (int) • `in_parallel` (bool) • `batch_size_nodes` (int) • `num_RWs` / `num_SAWs` (ints) • `node_count` (N) • `time_stamp_count` (T) • `walk_length` (L) • `walks_per_node` (int) • `attractive_RWs_name` / `repulsive_RWs_name` / `attractive_SAWs_name` / `repulsive_SAWs_name` (dataset names or `None`) • `walks_layout` = `"time_leading_3d"` |
|
|
109
|
-
| **Embeddings** | `FRAME_EMBEDDINGS` → **(
|
|
144
|
+
| **Embeddings** | `FRAME_EMBEDDINGS` → **(T, N, D)**, float32 | `created_at` (ISO) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `time_stamp_count` = T • `node_count` = N • `embedding_dim` = D • `model_base` = `"torch"` or `"pureml"` • `embedding_kind` = `"in"|"out"|"avg"` • `objective` = `"sgns"` or `"sg"` • `negative_sampling` (bool) • `num_negative_samples` (int) • `num_epochs` (int) • `batch_size` (int) • `window_size` (int) • `alpha` (float) • `lr_step_per_batch` (bool) • `shuffle_data` (bool) • `device_hint` (str) • `model_kwargs_repr` (repr string) • `RIN_type` = `"attr"` or `"repuls"` • `using` = `"RW"|"SAW"|"merged"` • `source_WALKS_path` (str) • `walk_length` (int) • `num_RWs` / `num_SAWs` (ints) • `attractive_*_name` / `repulsive_*_name` (dataset names or `None`) • `master_seed` (int) • `per_frame_seeds` (list[int]) • `arrays_per_chunk` (int) • `compression_level` (int) |
|
|
110
145
|
|
|
111
146
|
**Notes**
|
|
112
147
|
|
|
113
|
-
- In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-
|
|
148
|
+
- In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalized** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalized** versions used for sampling.
|
|
114
149
|
- All archives are Zarr v3 groups. ArrayStorage also maintains per-block metadata in root attrs: `array_chunk_size_in_block`, `array_shape_in_block`, and `array_dtype_in_block` (dicts keyed by dataset name). You’ll see these in every archive.
|
|
115
|
-
|
|
116
|
-
---
|
|
117
|
-
|
|
118
|
-
## Installation
|
|
119
|
-
|
|
120
|
-
```bash
|
|
121
|
-
pip install sawnergy
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
> **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
|
|
125
|
-
> environment variable.
|
|
150
|
+
- In **Embeddings**, `alpha` and `num_negative_samples` apply to **SGNS** only and are ignored for `objective="sg"`.
|
|
126
151
|
|
|
127
152
|
---
|
|
128
153
|
|
|
@@ -147,10 +172,10 @@ rin_builder.build_rin(
|
|
|
147
172
|
molecule_of_interest=1,
|
|
148
173
|
frame_range=(1, 100),
|
|
149
174
|
frame_batch_size=10,
|
|
150
|
-
prune_low_energies_frac=0.
|
|
175
|
+
prune_low_energies_frac=0.85,
|
|
151
176
|
output_path=rin_path,
|
|
152
177
|
include_attractive=True,
|
|
153
|
-
include_repulsive=False
|
|
178
|
+
include_repulsive=False
|
|
154
179
|
)
|
|
155
180
|
|
|
156
181
|
# 2. Sample walks from the RIN
|
|
@@ -158,52 +183,43 @@ walker = Walker(rin_path, seed=123)
|
|
|
158
183
|
walks_path = Path("./WALKS_demo.zip")
|
|
159
184
|
walker.sample_walks(
|
|
160
185
|
walk_length=16,
|
|
161
|
-
walks_per_node=
|
|
186
|
+
walks_per_node=100,
|
|
162
187
|
saw_frac=0.25,
|
|
163
188
|
include_attractive=True,
|
|
164
189
|
include_repulsive=False,
|
|
165
190
|
time_aware=False,
|
|
166
191
|
output_path=walks_path,
|
|
167
|
-
in_parallel=False
|
|
192
|
+
in_parallel=False
|
|
168
193
|
)
|
|
169
194
|
walker.close()
|
|
170
195
|
|
|
171
196
|
# 3. Train embeddings per frame (PyTorch backend)
|
|
172
197
|
import torch
|
|
173
198
|
|
|
174
|
-
embedder = Embedder(walks_path,
|
|
199
|
+
embedder = Embedder(walks_path, seed=999)
|
|
175
200
|
embeddings_path = embedder.embed_all(
|
|
176
201
|
RIN_type="attr",
|
|
177
202
|
using="merged",
|
|
203
|
+
num_epochs=10,
|
|
204
|
+
negative_sampling=False,
|
|
178
205
|
window_size=4,
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
dimensionality=128,
|
|
183
|
-
shuffle_data=True,
|
|
184
|
-
output_path="./EMBEDDINGS_demo.zip",
|
|
185
|
-
sgns_kwargs={
|
|
186
|
-
"optim": torch.optim.Adam,
|
|
187
|
-
"optim_kwargs": {"lr": 1e-3},
|
|
188
|
-
"lr_sched": torch.optim.lr_scheduler.LambdaLR,
|
|
189
|
-
"lr_sched_kwargs": {"lr_lambda": lambda _: 1.0},
|
|
190
|
-
"device": "cuda" if torch.cuda.is_available() else "cpu",
|
|
191
|
-
},
|
|
206
|
+
device="cuda" if torch.cuda.is_available() else "cpu",
|
|
207
|
+
model_base="torch",
|
|
208
|
+
output_path="./EMBEDDINGS_demo.zip"
|
|
192
209
|
)
|
|
193
210
|
print("Embeddings written to", embeddings_path)
|
|
194
211
|
```
|
|
195
212
|
|
|
196
|
-
> For the PureML backend,
|
|
197
|
-
> (for example `optim=pureml.optimizers.Adam`, `lr_sched=pureml.optimizers.CosineAnnealingLR`).
|
|
213
|
+
> For the PureML backend, set `model_base="pureml"` and pass the optimizer / scheduler classes inside `model_kwargs`.
|
|
198
214
|
|
|
199
215
|
---
|
|
200
216
|
|
|
201
|
-
##
|
|
217
|
+
## Visualization
|
|
202
218
|
|
|
203
219
|
```python
|
|
204
220
|
from sawnergy.visual import Visualizer
|
|
205
221
|
|
|
206
|
-
v =
|
|
222
|
+
v = Visualizer("./RIN_demo.zip")
|
|
207
223
|
v.build_frame(1,
|
|
208
224
|
node_colors="rainbow",
|
|
209
225
|
displayed_nodes="ALL",
|
|
@@ -216,14 +232,20 @@ v.build_frame(1,
|
|
|
216
232
|
|
|
217
233
|
`Visualizer` lazily loads datasets and works even in headless environments (falls back to the `Agg` backend).
|
|
218
234
|
|
|
235
|
+
```python
|
|
236
|
+
from sawnergy.embedding import Visualizer
|
|
237
|
+
|
|
238
|
+
viz = Visualizer("./EMBEDDINGS_demo.zip", normalize_rows=True)
|
|
239
|
+
viz.build_frame(1, show=True)
|
|
240
|
+
```
|
|
241
|
+
|
|
219
242
|
---
|
|
220
243
|
|
|
221
244
|
## Advanced Notes
|
|
222
245
|
|
|
223
246
|
- **Time-aware walks**: Set `time_aware=True`, provide `stickiness` and `on_no_options` when calling `Walker.sample_walks`.
|
|
224
247
|
- **Shared memory lifecycle**: Call `Walker.close()` (or use a context manager) to release shared-memory segments.
|
|
225
|
-
- **PureML vs PyTorch**:
|
|
226
|
-
constructor kwargs through `sgns_kwargs` (optimizer, scheduler, device).
|
|
248
|
+
- **PureML vs PyTorch**: Select the backend at call time with `model_base="pureml"|"torch"` (defaults to `"pureml"`) and pass optimizer / scheduler overrides through `model_kwargs`.
|
|
227
249
|
- **ArrayStorage utilities**: Use `ArrayStorage` directly to peek into archives, append arrays, or manage metadata.
|
|
228
250
|
|
|
229
251
|
---
|
|
@@ -234,8 +256,9 @@ v.build_frame(1,
|
|
|
234
256
|
├── sawnergy/
|
|
235
257
|
│ ├── rin/ # RINBuilder and cpptraj integration helpers
|
|
236
258
|
│ ├── walks/ # Walker class and shared-memory utilities
|
|
237
|
-
│ ├── embedding/ # Embedder + SGNS backends (PureML / PyTorch)
|
|
259
|
+
│ ├── embedding/ # Embedder + SG/SGNS backends (PureML / PyTorch)
|
|
238
260
|
│ ├── visual/ # Visualizer and palette utilities
|
|
261
|
+
│ │
|
|
239
262
|
│ ├── logging_util.py
|
|
240
263
|
│ └── sawnergy_util.py
|
|
241
264
|
│
|
|
@@ -244,7 +267,7 @@ v.build_frame(1,
|
|
|
244
267
|
|
|
245
268
|
---
|
|
246
269
|
|
|
247
|
-
##
|
|
270
|
+
## Acknowledgments
|
|
248
271
|
|
|
249
272
|
SAWNERGY builds on the AmberTools `cpptraj` ecosystem, NumPy, Matplotlib, Zarr, and PyTorch (for GPU acceleration if necessary; PureML is available by default).
|
|
250
273
|
Big thanks to the upstream communities whose work makes this toolkit possible.
|