sawnergy 1.0.6__tar.gz → 1.0.8__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of sawnergy might be problematic. Click here for more details.

Files changed (36) hide show
  1. {sawnergy-1.0.6/sawnergy.egg-info → sawnergy-1.0.8}/PKG-INFO +79 -56
  2. {sawnergy-1.0.6 → sawnergy-1.0.8}/README.md +78 -55
  3. sawnergy-1.0.8/sawnergy/embedding/SGNS_pml.py +368 -0
  4. sawnergy-1.0.8/sawnergy/embedding/SGNS_torch.py +364 -0
  5. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/embedding/__init__.py +24 -0
  6. sawnergy-1.0.8/sawnergy/embedding/embedder.py +714 -0
  7. sawnergy-1.0.8/sawnergy/embedding/visualizer.py +251 -0
  8. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/logging_util.py +1 -1
  9. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/rin_builder.py +1 -1
  10. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/visualizer.py +6 -6
  11. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/visualizer_util.py +3 -0
  12. {sawnergy-1.0.6 → sawnergy-1.0.8/sawnergy.egg-info}/PKG-INFO +79 -56
  13. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/SOURCES.txt +2 -0
  14. {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_embedding.py +103 -6
  15. sawnergy-1.0.8/tests/test_embedding_visualizer.py +58 -0
  16. sawnergy-1.0.6/sawnergy/embedding/SGNS_pml.py +0 -172
  17. sawnergy-1.0.6/sawnergy/embedding/SGNS_torch.py +0 -177
  18. sawnergy-1.0.6/sawnergy/embedding/embedder.py +0 -584
  19. {sawnergy-1.0.6 → sawnergy-1.0.8}/LICENSE +0 -0
  20. {sawnergy-1.0.6 → sawnergy-1.0.8}/NOTICE +0 -0
  21. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/__init__.py +0 -0
  22. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/__init__.py +0 -0
  23. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/rin/rin_util.py +0 -0
  24. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/sawnergy_util.py +0 -0
  25. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/visual/__init__.py +0 -0
  26. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/__init__.py +0 -0
  27. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/walker.py +0 -0
  28. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy/walks/walker_util.py +0 -0
  29. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/dependency_links.txt +0 -0
  30. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/requires.txt +0 -0
  31. {sawnergy-1.0.6 → sawnergy-1.0.8}/sawnergy.egg-info/top_level.txt +0 -0
  32. {sawnergy-1.0.6 → sawnergy-1.0.8}/setup.cfg +0 -0
  33. {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_rin.py +0 -0
  34. {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_storage.py +0 -0
  35. {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_visual.py +0 -0
  36. {sawnergy-1.0.6 → sawnergy-1.0.8}/tests/test_walks.py +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: sawnergy
3
- Version: 1.0.6
3
+ Version: 1.0.8
4
4
  Summary: Toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations
5
5
  Home-page: https://github.com/Yehor-Mishchyriak/SAWNERGY
6
6
  Author: Yehor Mishchyriak
@@ -39,19 +39,57 @@ Dynamic: summary
39
39
  ![Python](https://img.shields.io/badge/python-3.11%2B-blue)
40
40
 
41
41
  A toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations, sampling
42
- random and self-avoiding walks, learning node embeddings, and visualising residue interaction networks (RINs). SAWNERGY
42
+ random and self-avoiding walks, learning node embeddings, and visualizing residue interaction networks (RINs). SAWNERGY
43
43
  keeps the full workflow — from `cpptraj` output to skip-gram embeddings (node2vec approach) — inside Python, backed by efficient Zarr-based archives and optional GPU acceleration.
44
44
 
45
45
  ---
46
46
 
47
+ ## Installation
48
+
49
+ ```bash
50
+ pip install sawnergy
51
+ ```
52
+
53
+ > **Optional:** For GPU training, install PyTorch separately (e.g., `pip install torch`).
54
+ > **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
55
+ > environment variable. Probably the easiest solution: install AmberTools via Conda, activate the environment, and SAWNERGY will find the cpptraj executable on its own, so just run your code and don't worry about it.
56
+
57
+ ---
58
+
59
+ # UPDATES:
60
+
61
+ ## v1.0.8 — What’s new:
62
+ - **Temporary deprecation of `SGNS_Torch`**
63
+ - `sawnergy.embedding.SGNS_Torch` currently produces noisy embeddings in practice. The issue likely stems from **weight initialization**, although the root cause has not yet been conclusively determined.
64
+ - **Action:** The class and its `__init__` docstring now carry a deprecation notice. Constructing the class emits a **`DeprecationWarning`** and logs a **warning**.
65
+ - **Use instead:** Prefer **`SG_Torch`** (plain Skip-Gram with full softmax) or the PureML backends **`SGNS_PureML`** / **`SG_PureML`**.
66
+ - **Compatibility:** No breaking API changes; imports remain stable. PureML backends are unaffected.
67
+ - **Embedding visualizer update**
68
+ - Now you can L2 normalize your embeddings before display.
69
+ - **Small improvements in the embedding module**
70
+ - Improved API with a lot of good defaults in place to ease usage out of the box.
71
+ - Small internal model tweaks.
72
+
73
+ ## v1.0.7 — What’s new:
74
+ - **Added plain Skip-Gram model**
75
+ - Now, the user can choose if they want to apply the negative sampling technique (two binary classifiers) or train a single classifier over the vocabulary (full softmax). For more detail, see: [node2vec](https://arxiv.org/pdf/1607.00653), [word2vec](https://arxiv.org/pdf/1301.3781), and [negative_sampling](https://arxiv.org/pdf/1402.3722).
76
+ - **Set a harsher default for low interaction energies pruning during RIN construction**
77
+ - Now we zero out 85% of the lowest interaction energies as opposed to the past 30% default, leading to more meaningful embeddings.
78
+ - **BUG FIX: Visualizer**
79
+ - Previously, the visualizer would silently draw edges of 0 magnitude, meaning they were actually being drawn but were invisible due to full transparency and 0 width. As a result, the displayed image/animation would be very laggy. Now, this was fixed, and given the higher pruning default, the displayed interaction networks are clean and smooth under rotations, dragging, etc.
80
+ - **New Embedding Visualizer (3D)**
81
+ - New lightweight viewer for per-frame embeddings that projects embeddings with PCA to a **3D** scatter. Supports the same node coloring semantics, optional node labels, and the same antialiasing/depthshade controls. Works in headless setups using the same backend guard and uses a blocking `show=True` for scripts.
82
+
83
+ ---
84
+
47
85
  ## Why SAWNERGY?
48
86
 
49
87
  - **Bridge simulations and graph ML**: Convert raw MD trajectories into residue interaction networks ready for graph
50
88
  algorithms and downstream machine learning tasks.
51
- - **Deterministic, shareable artefacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
52
- - **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serealization overhead; archives are written in chunked, compressed form for fast read/write.
53
- - **Flexible embedding backends**: Train skip-gram with negative sampling (SGNS) models using either PureML or PyTorch.
54
- - **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder
89
+ - **Deterministic, shareable artifacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
90
+ - **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serialization overhead; archives are written in chunked, compressed form for fast read/write.
91
+ - **Flexible objectives & backends**: Train Skip-Gram with **negative sampling** (`objective="sgns"`) or **plain Skip-Gram** (`objective="sg"`), using either **PureML** (default) or **PyTorch**.
92
+ - **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder.
55
93
 
56
94
  ---
57
95
 
@@ -91,9 +129,9 @@ node indexing, and RNG seeds stay consistent across the toolchain.
91
129
  * Wraps the AmberTools `cpptraj` executable to:
92
130
  - compute per-frame electrostatic (EMAP) and van der Waals (VMAP) energy matrices at the atomic level,
93
131
  - project atom–atom interactions to residue–residue interactions using compositional masks,
94
- - prune, symmetrise, remove self-interactions, and L1-normalise the matrices,
95
- - compute per-residue centres of mass (COM) over the same frames.
96
- * Outputs a compressed Zarr archive with transition matrices, optional prenormalised energies, COM snapshots, and rich
132
+ - prune, symmetrize, remove self-interactions, and L1-normalize the matrices,
133
+ - compute per-residue centers of mass (COM) over the same frames.
134
+ * Outputs a compressed Zarr archive with transition matrices, optional pre-normalized energies, COM snapshots, and rich
97
135
  metadata (frame range, pruning quantile, molecule ID, etc.).
98
136
  * Supports parallel `cpptraj` execution, batch processing, and keeps temporary stores tidy via
99
137
  `ArrayStorage.compress_and_cleanup`.
@@ -103,7 +141,7 @@ node indexing, and RNG seeds stay consistent across the toolchain.
103
141
  * Opens RIN archives, resolves dataset names from attributes, and renders nodes plus attractive/repulsive edge bundles
104
142
  in 3D using Matplotlib.
105
143
  * Allows both static frame visualization and trajectory animation.
106
- * Handles backend selection (`Agg` fallback in headless environments) and offers convenient colour palettes via
144
+ * Handles backend selection (`Agg` fallback in headless environments) and offers convenient color palettes via
107
145
  `visualizer_util`.
108
146
 
109
147
  ### `sawnergy.walks.Walker`
@@ -116,13 +154,10 @@ node indexing, and RNG seeds stay consistent across the toolchain.
116
154
 
117
155
  ### `sawnergy.embedding.Embedder`
118
156
 
119
- * Consumes walk archives, generates skip-gram pairs, and normalises them to 0-based indices.
120
- * Provides a unified interface to SGNS implementations:
121
- - **PureML backend** (`SGNS_PureML`): works with the `pureml` ecosystem, optimistic for CPU training.
122
- - **PyTorch backend** (`SGNS_Torch`): uses `torch.nn.Embedding` plays nicely with GPUs.
123
- * Both `SGNS_PureML` and `SGNS_Torch` accept training hyperparameters such as batch_size, LR, optimizer and LR_scheduler, etc.
124
- * Exposes `embed_frame` (single frame) and `embed_all` (all frames, deterministic seeding per frame) which return the
125
- learned input embedding matrices and write them to disk when requested.
157
+ * Consumes walk archives, generates skip-gram pairs, and normalizes them to 0-based indices.
158
+ * Selects skip-gram (SG / SGNS) backends dynamically via `model_base="pureml"|"torch"` with per-backend overrides supplied through `model_kwargs`.
159
+ * Handles deterministic per-frame seeding and returns the requested embedding `kind` (`"in"`, `"out"`, or `"avg"`) from `embed_frame` and `embed_all`.
160
+ * Persists per-frame matrices with rich provenance (walk metadata, objective, hyperparameters, RNG seeds) when `embed_all` targets an output archive.
126
161
 
127
162
  ### Supporting Utilities
128
163
 
@@ -140,23 +175,13 @@ node indexing, and RNG seeds stay consistent across the toolchain.
140
175
  |---|---|---|
141
176
  | **RIN** | `ATTRACTIVE_transitions` → **(T, N, N)**, float32 • `REPULSIVE_transitions` → **(T, N, N)**, float32 (optional) • `ATTRACTIVE_energies` → **(T, N, N)**, float32 (optional) • `REPULSIVE_energies` → **(T, N, N)**, float32 (optional) • `COM` → **(T, N, 3)**, float32 | `time_created` (ISO) • `com_name` = `"COM"` • `molecule_of_interest` (int) • `frame_range` = `(start, end)` inclusive • `frame_batch_size` (int) • `prune_low_energies_frac` (float in [0,1]) • `attractive_transitions_name` / `repulsive_transitions_name` (dataset names or `None`) • `attractive_energies_name` / `repulsive_energies_name` (dataset names or `None`) |
142
177
  | **Walks** | `ATTRACTIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `REPULSIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `ATTRACTIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) • `REPULSIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) <br/>_Note:_ node IDs are **1-based**.| `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_batch_v1"` • `num_workers` (int) • `in_parallel` (bool) • `batch_size_nodes` (int) • `num_RWs` / `num_SAWs` (ints) • `node_count` (N) • `time_stamp_count` (T) • `walk_length` (L) • `walks_per_node` (int) • `attractive_RWs_name` / `repulsive_RWs_name` / `attractive_SAWs_name` / `repulsive_SAWs_name` (dataset names or `None`) • `walks_layout` = `"time_leading_3d"` |
143
- | **Embeddings** | `FRAME_EMBEDDINGS` → **(frames_written, vocab_size, D)**, typically float32 | `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_frame_v1"` • `source_walks_path` (str) • `model_base` = `"torch"` or `"pureml"` • `rin_type` = `"attr"` or `"repuls"` • `using_mode` = `"RW"|"SAW"|"merged"` • `window_size` (int) • `alpha` (float; noise exponent) • `dimensionality` = D • `num_negative_samples` (int) • `num_epochs` (int) • `batch_size` (int) • `shuffle_data` (bool) • `frames_written` (int) • `vocab_size` (int) • `frame_count` (int) • `embedding_dtype` (str) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `arrays_per_chunk` (int) • `compression_level` (int) |
178
+ | **Embeddings** | `FRAME_EMBEDDINGS` → **(T, N, D)**, float32 | `created_at` (ISO) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `time_stamp_count` = T • `node_count` = N • `embedding_dim` = D • `model_base` = `"torch"` or `"pureml"` • `embedding_kind` = `"in"|"out"|"avg"` • `objective` = `"sgns"` or `"sg"` • `negative_sampling` (bool) • `num_negative_samples` (int) `num_epochs` (int) • `batch_size` (int) • `window_size` (int) • `alpha` (float) • `lr_step_per_batch` (bool) • `shuffle_data` (bool) • `device_hint` (str) • `model_kwargs_repr` (repr string) • `RIN_type` = `"attr"` or `"repuls"` • `using` = `"RW"|"SAW"|"merged"` • `source_WALKS_path` (str) • `walk_length` (int) • `num_RWs` / `num_SAWs` (ints) • `attractive_*_name` / `repulsive_*_name` (dataset names or `None`) • `master_seed` (int) • `per_frame_seeds` (list[int]) • `arrays_per_chunk` (int) • `compression_level` (int) |
144
179
 
145
180
  **Notes**
146
181
 
147
- - In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalised** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalised** versions used for sampling.
182
+ - In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalized** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalized** versions used for sampling.
148
183
  - All archives are Zarr v3 groups. ArrayStorage also maintains per-block metadata in root attrs: `array_chunk_size_in_block`, `array_shape_in_block`, and `array_dtype_in_block` (dicts keyed by dataset name). You’ll see these in every archive.
149
-
150
- ---
151
-
152
- ## Installation
153
-
154
- ```bash
155
- pip install sawnergy
156
- ```
157
-
158
- > **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
159
- > environment variable.
184
+ - In **Embeddings**, `alpha` and `num_negative_samples` apply to **SGNS** only and are ignored for `objective="sg"`.
160
185
 
161
186
  ---
162
187
 
@@ -181,10 +206,10 @@ rin_builder.build_rin(
181
206
  molecule_of_interest=1,
182
207
  frame_range=(1, 100),
183
208
  frame_batch_size=10,
184
- prune_low_energies_frac=0.3,
209
+ prune_low_energies_frac=0.85,
185
210
  output_path=rin_path,
186
211
  include_attractive=True,
187
- include_repulsive=False,
212
+ include_repulsive=False
188
213
  )
189
214
 
190
215
  # 2. Sample walks from the RIN
@@ -192,52 +217,43 @@ walker = Walker(rin_path, seed=123)
192
217
  walks_path = Path("./WALKS_demo.zip")
193
218
  walker.sample_walks(
194
219
  walk_length=16,
195
- walks_per_node=32,
220
+ walks_per_node=100,
196
221
  saw_frac=0.25,
197
222
  include_attractive=True,
198
223
  include_repulsive=False,
199
224
  time_aware=False,
200
225
  output_path=walks_path,
201
- in_parallel=False,
226
+ in_parallel=False
202
227
  )
203
228
  walker.close()
204
229
 
205
230
  # 3. Train embeddings per frame (PyTorch backend)
206
231
  import torch
207
232
 
208
- embedder = Embedder(walks_path, base="torch", seed=999)
233
+ embedder = Embedder(walks_path, seed=999)
209
234
  embeddings_path = embedder.embed_all(
210
235
  RIN_type="attr",
211
236
  using="merged",
237
+ num_epochs=10,
238
+ negative_sampling=False,
212
239
  window_size=4,
213
- num_negative_samples=5,
214
- num_epochs=5,
215
- batch_size=1024,
216
- dimensionality=128,
217
- shuffle_data=True,
218
- output_path="./EMBEDDINGS_demo.zip",
219
- sgns_kwargs={
220
- "optim": torch.optim.Adam,
221
- "optim_kwargs": {"lr": 1e-3},
222
- "lr_sched": torch.optim.lr_scheduler.LambdaLR,
223
- "lr_sched_kwargs": {"lr_lambda": lambda _: 1.0},
224
- "device": "cuda" if torch.cuda.is_available() else "cpu",
225
- },
240
+ device="cuda" if torch.cuda.is_available() else "cpu",
241
+ model_base="torch",
242
+ output_path="./EMBEDDINGS_demo.zip"
226
243
  )
227
244
  print("Embeddings written to", embeddings_path)
228
245
  ```
229
246
 
230
- > For the PureML backend, supply the relevant optimiser and scheduler via `sgns_kwargs`
231
- > (for example `optim=pureml.optimizers.Adam`, `lr_sched=pureml.optimizers.CosineAnnealingLR`).
247
+ > For the PureML backend, set `model_base="pureml"` and pass the optimizer / scheduler classes inside `model_kwargs`.
232
248
 
233
249
  ---
234
250
 
235
- ## Visualisation
251
+ ## Visualization
236
252
 
237
253
  ```python
238
254
  from sawnergy.visual import Visualizer
239
255
 
240
- v = sawnergy.visual.Visualizer("./RIN_demo.zip")
256
+ v = Visualizer("./RIN_demo.zip")
241
257
  v.build_frame(1,
242
258
  node_colors="rainbow",
243
259
  displayed_nodes="ALL",
@@ -250,14 +266,20 @@ v.build_frame(1,
250
266
 
251
267
  `Visualizer` lazily loads datasets and works even in headless environments (falls back to the `Agg` backend).
252
268
 
269
+ ```python
270
+ from sawnergy.embedding import Visualizer
271
+
272
+ viz = Visualizer("./EMBEDDINGS_demo.zip", normalize_rows=True)
273
+ viz.build_frame(1, show=True)
274
+ ```
275
+
253
276
  ---
254
277
 
255
278
  ## Advanced Notes
256
279
 
257
280
  - **Time-aware walks**: Set `time_aware=True`, provide `stickiness` and `on_no_options` when calling `Walker.sample_walks`.
258
281
  - **Shared memory lifecycle**: Call `Walker.close()` (or use a context manager) to release shared-memory segments.
259
- - **PureML vs PyTorch**: Choose the backend via `Embedder(..., base="pureml"|"torch")` and provide backend-specific
260
- constructor kwargs through `sgns_kwargs` (optimizer, scheduler, device).
282
+ - **PureML vs PyTorch**: Select the backend at call time with `model_base="pureml"|"torch"` (defaults to `"pureml"`) and pass optimizer / scheduler overrides through `model_kwargs`.
261
283
  - **ArrayStorage utilities**: Use `ArrayStorage` directly to peek into archives, append arrays, or manage metadata.
262
284
 
263
285
  ---
@@ -268,8 +290,9 @@ v.build_frame(1,
268
290
  ├── sawnergy/
269
291
  │ ├── rin/ # RINBuilder and cpptraj integration helpers
270
292
  │ ├── walks/ # Walker class and shared-memory utilities
271
- │ ├── embedding/ # Embedder + SGNS backends (PureML / PyTorch)
293
+ │ ├── embedding/ # Embedder + SG/SGNS backends (PureML / PyTorch)
272
294
  │ ├── visual/ # Visualizer and palette utilities
295
+ │ │
273
296
  │ ├── logging_util.py
274
297
  │ └── sawnergy_util.py
275
298
 
@@ -278,7 +301,7 @@ v.build_frame(1,
278
301
 
279
302
  ---
280
303
 
281
- ## Acknowledgements
304
+ ## Acknowledgments
282
305
 
283
306
  SAWNERGY builds on the AmberTools `cpptraj` ecosystem, NumPy, Matplotlib, Zarr, and PyTorch (for GPU acceleration if necessary; PureML is available by default).
284
307
  Big thanks to the upstream communities whose work makes this toolkit possible.
@@ -5,19 +5,57 @@
5
5
  ![Python](https://img.shields.io/badge/python-3.11%2B-blue)
6
6
 
7
7
  A toolkit for transforming molecular dynamics (MD) trajectories into rich graph representations, sampling
8
- random and self-avoiding walks, learning node embeddings, and visualising residue interaction networks (RINs). SAWNERGY
8
+ random and self-avoiding walks, learning node embeddings, and visualizing residue interaction networks (RINs). SAWNERGY
9
9
  keeps the full workflow — from `cpptraj` output to skip-gram embeddings (node2vec approach) — inside Python, backed by efficient Zarr-based archives and optional GPU acceleration.
10
10
 
11
11
  ---
12
12
 
13
+ ## Installation
14
+
15
+ ```bash
16
+ pip install sawnergy
17
+ ```
18
+
19
+ > **Optional:** For GPU training, install PyTorch separately (e.g., `pip install torch`).
20
+ > **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
21
+ > environment variable. Probably the easiest solution: install AmberTools via Conda, activate the environment, and SAWNERGY will find the cpptraj executable on its own, so just run your code and don't worry about it.
22
+
23
+ ---
24
+
25
+ # UPDATES:
26
+
27
+ ## v1.0.8 — What’s new:
28
+ - **Temporary deprecation of `SGNS_Torch`**
29
+ - `sawnergy.embedding.SGNS_Torch` currently produces noisy embeddings in practice. The issue likely stems from **weight initialization**, although the root cause has not yet been conclusively determined.
30
+ - **Action:** The class and its `__init__` docstring now carry a deprecation notice. Constructing the class emits a **`DeprecationWarning`** and logs a **warning**.
31
+ - **Use instead:** Prefer **`SG_Torch`** (plain Skip-Gram with full softmax) or the PureML backends **`SGNS_PureML`** / **`SG_PureML`**.
32
+ - **Compatibility:** No breaking API changes; imports remain stable. PureML backends are unaffected.
33
+ - **Embedding visualizer update**
34
+ - Now you can L2 normalize your embeddings before display.
35
+ - **Small improvements in the embedding module**
36
+ - Improved API with a lot of good defaults in place to ease usage out of the box.
37
+ - Small internal model tweaks.
38
+
39
+ ## v1.0.7 — What’s new:
40
+ - **Added plain Skip-Gram model**
41
+ - Now, the user can choose if they want to apply the negative sampling technique (two binary classifiers) or train a single classifier over the vocabulary (full softmax). For more detail, see: [node2vec](https://arxiv.org/pdf/1607.00653), [word2vec](https://arxiv.org/pdf/1301.3781), and [negative_sampling](https://arxiv.org/pdf/1402.3722).
42
+ - **Set a harsher default for low interaction energies pruning during RIN construction**
43
+ - Now we zero out 85% of the lowest interaction energies as opposed to the past 30% default, leading to more meaningful embeddings.
44
+ - **BUG FIX: Visualizer**
45
+ - Previously, the visualizer would silently draw edges of 0 magnitude, meaning they were actually being drawn but were invisible due to full transparency and 0 width. As a result, the displayed image/animation would be very laggy. Now, this was fixed, and given the higher pruning default, the displayed interaction networks are clean and smooth under rotations, dragging, etc.
46
+ - **New Embedding Visualizer (3D)**
47
+ - New lightweight viewer for per-frame embeddings that projects embeddings with PCA to a **3D** scatter. Supports the same node coloring semantics, optional node labels, and the same antialiasing/depthshade controls. Works in headless setups using the same backend guard and uses a blocking `show=True` for scripts.
48
+
49
+ ---
50
+
13
51
  ## Why SAWNERGY?
14
52
 
15
53
  - **Bridge simulations and graph ML**: Convert raw MD trajectories into residue interaction networks ready for graph
16
54
  algorithms and downstream machine learning tasks.
17
- - **Deterministic, shareable artefacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
18
- - **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serealization overhead; archives are written in chunked, compressed form for fast read/write.
19
- - **Flexible embedding backends**: Train skip-gram with negative sampling (SGNS) models using either PureML or PyTorch.
20
- - **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder
55
+ - **Deterministic, shareable artifacts**: Every stage produces compressed Zarr archives that contain both data and metadata so runs can be reproduced, shared, or inspected later.
56
+ - **High-performance data handling**: Heavy arrays live in shared memory during walk sampling to allow parallel processing without serialization overhead; archives are written in chunked, compressed form for fast read/write.
57
+ - **Flexible objectives & backends**: Train Skip-Gram with **negative sampling** (`objective="sgns"`) or **plain Skip-Gram** (`objective="sg"`), using either **PureML** (default) or **PyTorch**.
58
+ - **Visualization out of the box**: Plot and animate residue networks without leaving Python, using the data produced by RINBuilder.
21
59
 
22
60
  ---
23
61
 
@@ -57,9 +95,9 @@ node indexing, and RNG seeds stay consistent across the toolchain.
57
95
  * Wraps the AmberTools `cpptraj` executable to:
58
96
  - compute per-frame electrostatic (EMAP) and van der Waals (VMAP) energy matrices at the atomic level,
59
97
  - project atom–atom interactions to residue–residue interactions using compositional masks,
60
- - prune, symmetrise, remove self-interactions, and L1-normalise the matrices,
61
- - compute per-residue centres of mass (COM) over the same frames.
62
- * Outputs a compressed Zarr archive with transition matrices, optional prenormalised energies, COM snapshots, and rich
98
+ - prune, symmetrize, remove self-interactions, and L1-normalize the matrices,
99
+ - compute per-residue centers of mass (COM) over the same frames.
100
+ * Outputs a compressed Zarr archive with transition matrices, optional pre-normalized energies, COM snapshots, and rich
63
101
  metadata (frame range, pruning quantile, molecule ID, etc.).
64
102
  * Supports parallel `cpptraj` execution, batch processing, and keeps temporary stores tidy via
65
103
  `ArrayStorage.compress_and_cleanup`.
@@ -69,7 +107,7 @@ node indexing, and RNG seeds stay consistent across the toolchain.
69
107
  * Opens RIN archives, resolves dataset names from attributes, and renders nodes plus attractive/repulsive edge bundles
70
108
  in 3D using Matplotlib.
71
109
  * Allows both static frame visualization and trajectory animation.
72
- * Handles backend selection (`Agg` fallback in headless environments) and offers convenient colour palettes via
110
+ * Handles backend selection (`Agg` fallback in headless environments) and offers convenient color palettes via
73
111
  `visualizer_util`.
74
112
 
75
113
  ### `sawnergy.walks.Walker`
@@ -82,13 +120,10 @@ node indexing, and RNG seeds stay consistent across the toolchain.
82
120
 
83
121
  ### `sawnergy.embedding.Embedder`
84
122
 
85
- * Consumes walk archives, generates skip-gram pairs, and normalises them to 0-based indices.
86
- * Provides a unified interface to SGNS implementations:
87
- - **PureML backend** (`SGNS_PureML`): works with the `pureml` ecosystem, optimistic for CPU training.
88
- - **PyTorch backend** (`SGNS_Torch`): uses `torch.nn.Embedding` plays nicely with GPUs.
89
- * Both `SGNS_PureML` and `SGNS_Torch` accept training hyperparameters such as batch_size, LR, optimizer and LR_scheduler, etc.
90
- * Exposes `embed_frame` (single frame) and `embed_all` (all frames, deterministic seeding per frame) which return the
91
- learned input embedding matrices and write them to disk when requested.
123
+ * Consumes walk archives, generates skip-gram pairs, and normalizes them to 0-based indices.
124
+ * Selects skip-gram (SG / SGNS) backends dynamically via `model_base="pureml"|"torch"` with per-backend overrides supplied through `model_kwargs`.
125
+ * Handles deterministic per-frame seeding and returns the requested embedding `kind` (`"in"`, `"out"`, or `"avg"`) from `embed_frame` and `embed_all`.
126
+ * Persists per-frame matrices with rich provenance (walk metadata, objective, hyperparameters, RNG seeds) when `embed_all` targets an output archive.
92
127
 
93
128
  ### Supporting Utilities
94
129
 
@@ -106,23 +141,13 @@ node indexing, and RNG seeds stay consistent across the toolchain.
106
141
  |---|---|---|
107
142
  | **RIN** | `ATTRACTIVE_transitions` → **(T, N, N)**, float32 • `REPULSIVE_transitions` → **(T, N, N)**, float32 (optional) • `ATTRACTIVE_energies` → **(T, N, N)**, float32 (optional) • `REPULSIVE_energies` → **(T, N, N)**, float32 (optional) • `COM` → **(T, N, 3)**, float32 | `time_created` (ISO) • `com_name` = `"COM"` • `molecule_of_interest` (int) • `frame_range` = `(start, end)` inclusive • `frame_batch_size` (int) • `prune_low_energies_frac` (float in [0,1]) • `attractive_transitions_name` / `repulsive_transitions_name` (dataset names or `None`) • `attractive_energies_name` / `repulsive_energies_name` (dataset names or `None`) |
108
143
  | **Walks** | `ATTRACTIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `REPULSIVE_RWs` → **(T, N·num_RWs, L+1)**, int32 (optional) • `ATTRACTIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) • `REPULSIVE_SAWs` → **(T, N·num_SAWs, L+1)**, int32 (optional) <br/>_Note:_ node IDs are **1-based**.| `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_batch_v1"` • `num_workers` (int) • `in_parallel` (bool) • `batch_size_nodes` (int) • `num_RWs` / `num_SAWs` (ints) • `node_count` (N) • `time_stamp_count` (T) • `walk_length` (L) • `walks_per_node` (int) • `attractive_RWs_name` / `repulsive_RWs_name` / `attractive_SAWs_name` / `repulsive_SAWs_name` (dataset names or `None`) • `walks_layout` = `"time_leading_3d"` |
109
- | **Embeddings** | `FRAME_EMBEDDINGS` → **(frames_written, vocab_size, D)**, typically float32 | `time_created` (ISO) • `seed` (int) • `rng_scheme` = `"SeedSequence.spawn_per_frame_v1"` • `source_walks_path` (str) • `model_base` = `"torch"` or `"pureml"` • `rin_type` = `"attr"` or `"repuls"` • `using_mode` = `"RW"|"SAW"|"merged"` • `window_size` (int) • `alpha` (float; noise exponent) • `dimensionality` = D • `num_negative_samples` (int) • `num_epochs` (int) • `batch_size` (int) • `shuffle_data` (bool) • `frames_written` (int) • `vocab_size` (int) • `frame_count` (int) • `embedding_dtype` (str) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `arrays_per_chunk` (int) • `compression_level` (int) |
144
+ | **Embeddings** | `FRAME_EMBEDDINGS` → **(T, N, D)**, float32 | `created_at` (ISO) • `frame_embeddings_name` = `"FRAME_EMBEDDINGS"` • `time_stamp_count` = T • `node_count` = N • `embedding_dim` = D • `model_base` = `"torch"` or `"pureml"` • `embedding_kind` = `"in"|"out"|"avg"` • `objective` = `"sgns"` or `"sg"` • `negative_sampling` (bool) • `num_negative_samples` (int) `num_epochs` (int) • `batch_size` (int) • `window_size` (int) • `alpha` (float) • `lr_step_per_batch` (bool) • `shuffle_data` (bool) • `device_hint` (str) • `model_kwargs_repr` (repr string) • `RIN_type` = `"attr"` or `"repuls"` • `using` = `"RW"|"SAW"|"merged"` • `source_WALKS_path` (str) • `walk_length` (int) • `num_RWs` / `num_SAWs` (ints) • `attractive_*_name` / `repulsive_*_name` (dataset names or `None`) • `master_seed` (int) • `per_frame_seeds` (list[int]) • `arrays_per_chunk` (int) • `compression_level` (int) |
110
145
 
111
146
  **Notes**
112
147
 
113
- - In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalised** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalised** versions used for sampling.
148
+ - In **RIN**, `T` equals the number of frame **batches** written (i.e., `frame_range` swept in steps of `frame_batch_size`). `ATTRACTIVE/REPULSIVE_energies` are **pre-normalized** absolute energies (written only when `keep_prenormalized_energies=True`), whereas `ATTRACTIVE/REPULSIVE_transitions` are the **row-wise L1-normalized** versions used for sampling.
114
149
  - All archives are Zarr v3 groups. ArrayStorage also maintains per-block metadata in root attrs: `array_chunk_size_in_block`, `array_shape_in_block`, and `array_dtype_in_block` (dicts keyed by dataset name). You’ll see these in every archive.
115
-
116
- ---
117
-
118
- ## Installation
119
-
120
- ```bash
121
- pip install sawnergy
122
- ```
123
-
124
- > **Note:** RIN building requires `cpptraj` (AmberTools). Ensure it is discoverable via `$PATH` or the `CPPTRAJ`
125
- > environment variable.
150
+ - In **Embeddings**, `alpha` and `num_negative_samples` apply to **SGNS** only and are ignored for `objective="sg"`.
126
151
 
127
152
  ---
128
153
 
@@ -147,10 +172,10 @@ rin_builder.build_rin(
147
172
  molecule_of_interest=1,
148
173
  frame_range=(1, 100),
149
174
  frame_batch_size=10,
150
- prune_low_energies_frac=0.3,
175
+ prune_low_energies_frac=0.85,
151
176
  output_path=rin_path,
152
177
  include_attractive=True,
153
- include_repulsive=False,
178
+ include_repulsive=False
154
179
  )
155
180
 
156
181
  # 2. Sample walks from the RIN
@@ -158,52 +183,43 @@ walker = Walker(rin_path, seed=123)
158
183
  walks_path = Path("./WALKS_demo.zip")
159
184
  walker.sample_walks(
160
185
  walk_length=16,
161
- walks_per_node=32,
186
+ walks_per_node=100,
162
187
  saw_frac=0.25,
163
188
  include_attractive=True,
164
189
  include_repulsive=False,
165
190
  time_aware=False,
166
191
  output_path=walks_path,
167
- in_parallel=False,
192
+ in_parallel=False
168
193
  )
169
194
  walker.close()
170
195
 
171
196
  # 3. Train embeddings per frame (PyTorch backend)
172
197
  import torch
173
198
 
174
- embedder = Embedder(walks_path, base="torch", seed=999)
199
+ embedder = Embedder(walks_path, seed=999)
175
200
  embeddings_path = embedder.embed_all(
176
201
  RIN_type="attr",
177
202
  using="merged",
203
+ num_epochs=10,
204
+ negative_sampling=False,
178
205
  window_size=4,
179
- num_negative_samples=5,
180
- num_epochs=5,
181
- batch_size=1024,
182
- dimensionality=128,
183
- shuffle_data=True,
184
- output_path="./EMBEDDINGS_demo.zip",
185
- sgns_kwargs={
186
- "optim": torch.optim.Adam,
187
- "optim_kwargs": {"lr": 1e-3},
188
- "lr_sched": torch.optim.lr_scheduler.LambdaLR,
189
- "lr_sched_kwargs": {"lr_lambda": lambda _: 1.0},
190
- "device": "cuda" if torch.cuda.is_available() else "cpu",
191
- },
206
+ device="cuda" if torch.cuda.is_available() else "cpu",
207
+ model_base="torch",
208
+ output_path="./EMBEDDINGS_demo.zip"
192
209
  )
193
210
  print("Embeddings written to", embeddings_path)
194
211
  ```
195
212
 
196
- > For the PureML backend, supply the relevant optimiser and scheduler via `sgns_kwargs`
197
- > (for example `optim=pureml.optimizers.Adam`, `lr_sched=pureml.optimizers.CosineAnnealingLR`).
213
+ > For the PureML backend, set `model_base="pureml"` and pass the optimizer / scheduler classes inside `model_kwargs`.
198
214
 
199
215
  ---
200
216
 
201
- ## Visualisation
217
+ ## Visualization
202
218
 
203
219
  ```python
204
220
  from sawnergy.visual import Visualizer
205
221
 
206
- v = sawnergy.visual.Visualizer("./RIN_demo.zip")
222
+ v = Visualizer("./RIN_demo.zip")
207
223
  v.build_frame(1,
208
224
  node_colors="rainbow",
209
225
  displayed_nodes="ALL",
@@ -216,14 +232,20 @@ v.build_frame(1,
216
232
 
217
233
  `Visualizer` lazily loads datasets and works even in headless environments (falls back to the `Agg` backend).
218
234
 
235
+ ```python
236
+ from sawnergy.embedding import Visualizer
237
+
238
+ viz = Visualizer("./EMBEDDINGS_demo.zip", normalize_rows=True)
239
+ viz.build_frame(1, show=True)
240
+ ```
241
+
219
242
  ---
220
243
 
221
244
  ## Advanced Notes
222
245
 
223
246
  - **Time-aware walks**: Set `time_aware=True`, provide `stickiness` and `on_no_options` when calling `Walker.sample_walks`.
224
247
  - **Shared memory lifecycle**: Call `Walker.close()` (or use a context manager) to release shared-memory segments.
225
- - **PureML vs PyTorch**: Choose the backend via `Embedder(..., base="pureml"|"torch")` and provide backend-specific
226
- constructor kwargs through `sgns_kwargs` (optimizer, scheduler, device).
248
+ - **PureML vs PyTorch**: Select the backend at call time with `model_base="pureml"|"torch"` (defaults to `"pureml"`) and pass optimizer / scheduler overrides through `model_kwargs`.
227
249
  - **ArrayStorage utilities**: Use `ArrayStorage` directly to peek into archives, append arrays, or manage metadata.
228
250
 
229
251
  ---
@@ -234,8 +256,9 @@ v.build_frame(1,
234
256
  ├── sawnergy/
235
257
  │ ├── rin/ # RINBuilder and cpptraj integration helpers
236
258
  │ ├── walks/ # Walker class and shared-memory utilities
237
- │ ├── embedding/ # Embedder + SGNS backends (PureML / PyTorch)
259
+ │ ├── embedding/ # Embedder + SG/SGNS backends (PureML / PyTorch)
238
260
  │ ├── visual/ # Visualizer and palette utilities
261
+ │ │
239
262
  │ ├── logging_util.py
240
263
  │ └── sawnergy_util.py
241
264
 
@@ -244,7 +267,7 @@ v.build_frame(1,
244
267
 
245
268
  ---
246
269
 
247
- ## Acknowledgements
270
+ ## Acknowledgments
248
271
 
249
272
  SAWNERGY builds on the AmberTools `cpptraj` ecosystem, NumPy, Matplotlib, Zarr, and PyTorch (for GPU acceleration if necessary; PureML is available by default).
250
273
  Big thanks to the upstream communities whose work makes this toolkit possible.