grasp-tool 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,511 @@
1
+ Metadata-Version: 2.3
2
+ Name: grasp-tool
3
+ Version: 0.1.0
4
+ Summary: GRASP: a graph-based representation learning toolkit for spatial transcriptomics
5
+ Author: wzx
6
+ Author-email: unknown@example.com
7
+ Requires-Python: >=3.9,<3.13
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.9
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Provides-Extra: ot
14
+ Requires-Dist: POT (>=0.9,<0.10) ; extra == "ot"
15
+ Requires-Dist: joblib (>=1.3,<2.0)
16
+ Requires-Dist: leidenalg (>=0.10,<0.11)
17
+ Requires-Dist: matplotlib (>=3.7,<4.0)
18
+ Requires-Dist: networkx (>=3.2,<4.0)
19
+ Requires-Dist: numpy (>=1.26,<2.0)
20
+ Requires-Dist: pandas (>=2.0,<3.0)
21
+ Requires-Dist: python-igraph (>=0.11,<0.12)
22
+ Requires-Dist: scikit-learn (>=1.0,<2.0)
23
+ Requires-Dist: scipy (>=1.9,<2.0)
24
+ Requires-Dist: seaborn (>=0.13,<0.14)
25
+ Requires-Dist: shapely (>=2.0,<3.0)
26
+ Requires-Dist: tqdm (>=4.0,<5.0)
27
+ Requires-Dist: umap-learn (>=0.5,<0.6)
28
+ Description-Content-Type: text/markdown
29
+
30
+ # GRASP_pack
31
+
32
+ This repository is being reorganized into an installable Python package.
33
+
34
+ - Distribution name: `grasp-tool`
35
+ - Import name: `grasp_tool`
36
+
37
+ ## Repo layout
38
+
39
+ - `grasp_tool/`: GRASP package source (canonical code)
40
+ - `utils_code/`, `gnn_model/`: backward-compatible wrappers (legacy paths)
41
+ - `ELLA/`: third-party reference project (NOT part of GRASP; do not package)
42
+ - `envs/`: conda environment exports for reproducibility (runtime baseline)
43
+
44
+ ## Installation (recommended)
45
+
46
+ The recommended workflow is:
47
+
48
+ - Use conda/mamba to create a stable Python environment (and GPU toolchain if needed)
49
+ - Use pip to install the GRASP package from PyPI (`grasp-tool`)
50
+
51
+ ### 1) Create a conda environment
52
+
53
+ ```bash
54
+ conda create -n grasp python=3.9 -y
55
+ conda activate grasp
56
+ ```
57
+
58
+ ### 2) Install GRASP from PyPI
59
+
60
+ ```bash
61
+ pip install grasp-tool
62
+ ```
63
+
64
+ Smoke checks (should work without training deps):
65
+
66
+ ```bash
67
+ grasp-tool --help
68
+ grasp-tool train-moco --help
69
+ ```
70
+
71
+ If you're running from this repo checkout, you can also run a small demo:
72
+
73
+ ```bash
74
+ grasp-tool register \
75
+ --pkl_file example_pkl/simulated1_data_dict.pkl \
76
+ --output_pkl outputs/simulated1_registered.pkl
77
+ ```
78
+
79
+ ## Tiny demo (end-to-end smoke test)
80
+
81
+ If you have a repo checkout, you can run a fast end-to-end smoke test on a small
82
+ subset of `example_pkl/simulated1_data_dict.pkl`.
83
+
84
+ This demo runs:
85
+
86
+ - `register` (with `--nc_demo 4`)
87
+ - create a tiny `df_registered` subset + `pairs.csv`
88
+ - `portrait` (JS distances)
89
+ - `partition-graphs`
90
+ - `augment-graphs`
91
+ - `build-train-pkl`
92
+ - `train-moco` for 1 epoch
93
+
94
+ Run:
95
+
96
+ ```bash
97
+ # Default: uses conda env name "grasp"
98
+ bash scripts/tiny_demo_example_pkl.sh
99
+ ```
100
+
101
+ Override the conda env name (useful if you have multiple envs):
102
+
103
+ ```bash
104
+ GRASP_CONDA_ENV=<your_env_name> bash scripts/tiny_demo_example_pkl.sh
105
+ ```
106
+
107
+ Outputs are written under:
108
+
109
+ - `outputs/tiny_demo_example_pkl_<timestamp>/`
110
+
111
+ Notes:
112
+
113
+ - The final training step requires `torch` and `torch-geometric`.
114
+ - `example_pkl/` is excluded from PyPI release artifacts; this demo is intended for
115
+ repo checkouts.
116
+ - The `scripts/` directory is not part of the installed PyPI wheel. If you installed
117
+ via `pip install grasp-tool`, you need to clone this repo to use this demo script.
118
+
119
+ ### 3) Training dependencies (NOT installed by pip)
120
+
121
+ `pip install grasp-tool` intentionally does NOT pull in the training stack.
122
+
123
+ If you want to run training-related commands:
124
+
125
+ - `build-train-pkl` (needs `torch` + `torch-geometric`)
126
+ - `train-moco` (needs `torch` + `torch-geometric`)
127
+
128
+ Install them following the official PyTorch / PyTorch Geometric (PyG) instructions.
129
+ We recommend installing them via conda inside the same env.
130
+
131
+ ## Development (maintainers)
132
+
133
+ Poetry is only needed for development and release.
134
+
135
+ ```bash
136
+ poetry install
137
+ poetry run grasp-tool --help
138
+ ```
139
+
140
+ ## Environment (reproducibility baseline)
141
+
142
+ These files are meant for *reproducing a known-good runtime environment*, not for
143
+ publishing minimal dependencies:
144
+
145
+ - `envs/grasp-conda.yml`: full export (large)
146
+ - `envs/grasp-conda-from-history.yml`: minimal export (often needs extra installs)
147
+
148
+ Typical usage:
149
+
150
+ ```bash
151
+ conda env create -f envs/grasp-conda.yml
152
+ conda activate grasp
153
+ ```
154
+
155
+ Optional extras (only needed for non-core utilities):
156
+
157
+ - Optimal Transport utilities (`POT` / import name `ot`): `pip install grasp-tool[ot]`
158
+
159
+ ## Full pipeline (from scratch)
160
+
161
+ The recommended example input is:
162
+
163
+ - `example_pkl/simulated1_data_dict.pkl`
164
+
165
+ This PKL contains the raw inputs required by `register` (and also already includes
166
+ `df_registered`; the steps below still show how to run the full pipeline end-to-end).
167
+
168
+ ### 0) Register (coordinate normalization)
169
+
170
+ Input: a PKL dict containing at least:
171
+
172
+ - `data_df` (DataFrame; must include `cell`, `type`, `centerX`, `centerY`, `x`, `y`)
173
+ - `cell_mask_df` (DataFrame; columns: `cell`, `x`, `y`)
174
+ - `nuclear_boundary` (dict: cell -> DataFrame with columns `x`, `y`)
175
+
176
+ Run:
177
+
178
+ ```bash
179
+ python -m grasp_tool register \
180
+ --pkl_file example_pkl/simulated1_data_dict.pkl \
181
+ --output_pkl outputs/simulated1_registered.pkl
182
+ ```
183
+
184
+ Output: a PKL dict with keys:
185
+
186
+ - `df_registered`
187
+ - `nuclear_boundary_df_registered`
188
+ - `cell_radii`
189
+ - `cell_nuclear_stats`
190
+
191
+ ### 0.5) (Optional) Cell/gene visualization (cellplot)
192
+
193
+ This is optional and mainly used for sanity-checking transcript spatial patterns.
194
+
195
+ Notes:
196
+
197
+ - This command can generate a large number of images if you do not restrict `--cells` and `--genes`.
198
+ - The input PKL must contain the required keys for the selected `--mode`.
199
+ - `--mode raw-cell`: expects `cell_boundary` (and optionally `nuclear_boundary`) in the raw PKL.
200
+ If your raw PKL does not have `cell_boundary`, use `--mode registered-gene` instead.
201
+
202
+ Registered-gene plots (recommended; uses `df_registered`):
203
+
204
+ ```bash
205
+ python -m grasp_tool cellplot \
206
+ --mode registered-gene \
207
+ --pkl outputs/simulated1_registered.pkl \
208
+ --output_dir outputs/cellplot \
209
+ --dataset simulated1 \
210
+ --cells cell_11 \
211
+ --genes gene_6_2_1,gene_6_3_1 \
212
+ --with_nuclear 1
213
+ ```
214
+
215
+ Raw cell boundary plots (uses `cell_boundary` / `nuclear_boundary` in the raw PKL):
216
+
217
+ ```bash
218
+ python -m grasp_tool cellplot \
219
+ --mode raw-cell \
220
+ --pkl example_pkl/simulated1_data_dict.pkl \
221
+ --output_dir outputs/cellplot_raw \
222
+ --dataset simulated1
223
+ ```
224
+
225
+ ### 1) (Optional) JS distance
226
+
227
+ ```bash
228
+ python -m grasp_tool portrait \
229
+ --pkl_file outputs/simulated1_registered.pkl \
230
+ --output_dir outputs/portrait \
231
+ --use_same_r \
232
+ --visualize_top_n 0 \
233
+ --auto_params
234
+ ```
235
+
236
+ ### 2) Partition + build per-cell graphs (node/adj CSV)
237
+
238
+ ```bash
239
+ python -m grasp_tool partition-graphs \
240
+ --pkl outputs/simulated1_registered.pkl \
241
+ --graph_root outputs/graphs \
242
+ --n_sectors 20 \
243
+ --m_rings 10 \
244
+ --k_neighbor 5
245
+ ```
246
+
247
+ You can restrict scope for a quick smoke test:
248
+
249
+ ```bash
250
+ python -m grasp_tool partition-graphs \
251
+ --pkl outputs/simulated1_registered.pkl \
252
+ --graph_root outputs/graphs_demo \
253
+ --cells cell_11,cell_135 \
254
+ --genes gene_0_0_0,gene_0_1_0
255
+ ```
256
+
257
+ ### 3) Graph augmentation
258
+
259
+ ```bash
260
+ python -m grasp_tool augment-graphs \
261
+ --graph_root outputs/graphs \
262
+ --dropout_ratio 0.1 \
263
+ --seed 2025
264
+ ```
265
+
266
+ ### 4) Build training PKL
267
+
268
+ You need a `pairs.csv` with columns: `cell,gene`.
269
+
270
+ Example (generate pairs from `df_registered`):
271
+
272
+ ```bash
273
+ python -c "import pickle, pandas as pd; d=pickle.load(open('outputs/simulated1_registered.pkl','rb')); pairs=d['df_registered'][['cell','gene']].drop_duplicates(); pairs.to_csv('outputs/pairs.csv', index=False); print('wrote outputs/pairs.csv', len(pairs))"
274
+ ```
275
+
276
+ Then:
277
+
278
+ ```bash
279
+ python -m grasp_tool build-train-pkl \
280
+ --pairs_csv outputs/pairs.csv \
281
+ --graph_root outputs/graphs \
282
+ --output_pkl outputs/train.pkl \
283
+ --dataset simulated1
284
+ ```
285
+
286
+ ### 5) Train (MoCo)
287
+
288
+ Note: this stage requires `torch` and `torch-geometric`, which are NOT installed by
289
+ `pip install grasp-tool`. Install them via conda first.
290
+
291
+ ```bash
292
+ python -m grasp_tool train-moco \
293
+ --dataset simulated1 \
294
+ --pkl outputs/train.pkl \
295
+ --js 0 \
296
+ --n 20 \
297
+ --m 10 \
298
+ --num_epoch 300 \
299
+ --batch_size 64 \
300
+ --cuda_device 0 \
301
+ --output_dir outputs/embeddings
302
+ ```
303
+
304
+ If you want to use JS for positive sampling:
305
+
306
+ ```bash
307
+ python -m grasp_tool train-moco \
308
+ --dataset simulated1 \
309
+ --pkl outputs/train.pkl \
310
+ --js 1 \
311
+ --js_file outputs/portrait/js_distances_*.csv
312
+ ```
313
+
314
+ ## Outputs
315
+
316
+ This section summarizes what each stage writes to disk.
317
+
318
+ ### register
319
+
320
+ Command:
321
+
322
+ ```bash
323
+ python -m grasp_tool register --pkl_file <raw.pkl> --output_pkl <registered.pkl>
324
+ ```
325
+
326
+ Output (`<registered.pkl>` is a dict with):
327
+
328
+ - `df_registered` (DataFrame): normalized transcript coordinates; contains at least `cell,gene,x_c_s,y_c_s` and also keeps original columns.
329
+ - `nuclear_boundary_df_registered` (DataFrame): normalized nucleus boundary points per cell (contains `cell,x_c_s,y_c_s` plus intermediate columns).
330
+ - `cell_radii` (dict): per-cell radius used by downstream partitioning.
331
+ - `cell_nuclear_stats` (DataFrame): per-cell nucleus exceed stats (`exceed_percent/exceed_count/num_nuclear_points`).
332
+ - `meta` (dict): run metadata.
333
+
334
+ ### cellplot (optional)
335
+
336
+ Command:
337
+
338
+ ```bash
339
+ python -m grasp_tool cellplot --mode <raw-cell|registered-gene> --pkl <input.pkl> --output_dir <dir>
340
+ ```
341
+
342
+ Output (`<dir>`):
343
+
344
+ - Raw mode (`--mode raw-cell`): writes per-cell boundary plots under `1_<dataset>_raw_cell_plot/`.
345
+ - Registered mode (`--mode registered-gene`): writes per-cell/per-gene scatter plots under `<dataset>/registered_gene/<cell>/`.
346
+
347
+ ### portrait (optional)
348
+
349
+ Command:
350
+
351
+ ```bash
352
+ python -m grasp_tool portrait --pkl_file <registered.pkl> --output_dir <dir>
353
+ ```
354
+
355
+ Output (`<dir>`):
356
+
357
+ - `js_distances_*.csv`: JS distance table used for positive sampling (when `train-moco --js 1`).
358
+
359
+ ### partition-graphs
360
+
361
+ Command:
362
+
363
+ ```bash
364
+ python -m grasp_tool partition-graphs --pkl <registered.pkl> --graph_root <graph_root>
365
+ ```
366
+
367
+ Output directory layout (`<graph_root>`):
368
+
369
+ - `<graph_root>/<cell>/<gene>_node_matrix.csv`
370
+ - `<graph_root>/<cell>/<gene>_adj_matrix.csv`
371
+ - `<graph_root>/<cell>/<gene>_dis_matrix.csv`
372
+
373
+ These CSVs are the on-disk graph representation consumed by `build-train-pkl`.
374
+
375
+ ### augment-graphs
376
+
377
+ Command:
378
+
379
+ ```bash
380
+ python -m grasp_tool augment-graphs --graph_root <graph_root>
381
+ ```
382
+
383
+ Output directory layout:
384
+
385
+ - `<graph_root>/<cell>_aug/<gene>_node_matrix.csv`
386
+ - `<graph_root>/<cell>_aug/<gene>_adj_matrix.csv`
387
+
388
+ ### build-train-pkl
389
+
390
+ Command:
391
+
392
+ ```bash
393
+ python -m grasp_tool build-train-pkl --pairs_csv <pairs.csv> --graph_root <graph_root> --output_pkl <train.pkl>
394
+ ```
395
+
396
+ Output (`<train.pkl>` is a dict with):
397
+
398
+ - `original_graphs`: list of `torch_geometric.data.Data`
399
+ - `augmented_graphs`: list of `torch_geometric.data.Data`
400
+ - `gene_labels`, `cell_labels`: aligned labels for each graph
401
+ - `meta`: dataset tag + graph parameters + `pairs.csv` path
402
+
403
+ ### train-moco
404
+
405
+ Command:
406
+
407
+ ```bash
408
+ python -m grasp_tool train-moco --dataset <name> --pkl <train.pkl> --output_dir <out_root>
409
+ ```
410
+
411
+ Output directory layout (`<out_root>`):
412
+
413
+ - `<out_root>/<run_id>/1_training_config.json`: the full resolved args snapshot
414
+ - `<out_root>/<run_id>/epoch{E}_lr{LR}_embedding.csv`: **main representation output**
415
+ - columns: `feature_1..feature_d, cell, gene`
416
+ - checkpoints:
417
+ - `<out_root>/<run_id>/epoch_{E}_lr_{LR}_checkpoint.pth`
418
+ - (optional) `<out_root>/<run_id>/best_model_epoch_{E}_lr_{LR}.pth`
419
+ - best summary (only when clustering is enabled via `--num_clusters`):
420
+ - `<out_root>/<run_id>/best_metrics_lr{LR}.json`
421
+ - `<out_root>/<run_id>/best_{vis_method}_{cluster_method}_lr{LR}.png`
422
+ - evaluation / visualization (from `grasp_tool/gnn/plot_refined.py`):
423
+ - `<out_root>/<run_id>/epoch{E}_lr{LR}_metrics*.txt`
424
+ - `<out_root>/<run_id>/epoch{E}_lr{LR}_clusters*.csv`
425
+ - `<out_root>/<run_id>/epoch{E}_lr{LR}_visualization*.png`
426
+ - `<out_root>/<run_id>/ALL_COMPLETED.txt`: written after all learning rates finish
427
+
428
+ ## Key parameters (quick reference)
429
+
430
+ In general, you can always inspect the full list via:
431
+
432
+ ```bash
433
+ python -m grasp_tool --help
434
+ python -m grasp_tool <command> --help
435
+ ```
436
+
437
+ ### register
438
+
439
+ - `--pkl_file`: input raw data dict PKL
440
+ - `--output_pkl`: output registered PKL (will contain `df_registered`)
441
+ - `--nc_demo`: process only first N cells (smoke test)
442
+ - `--chunk_size`: multiprocessing chunk size (speed/memory tradeoff)
443
+ - `--clip_to_cell`: `1` to clip nucleus to cell boundary; `0` to keep outside points
444
+ - `--remove_outliers`: `1` to drop nucleus points exceeding boundary
445
+ - `--epsilon`: numerical stability
446
+
447
+ ### cellplot
448
+
449
+ - `--mode`: `raw-cell` or `registered-gene`
450
+ - `--pkl` / `--pkl_file`: input PKL path
451
+ - `--output_dir`: output directory root
452
+ - `--dataset`: dataset tag used in output paths (optional)
453
+ - `--cells`: restrict to a comma-separated subset of cells (recommended)
454
+ - `--genes`: restrict to a comma-separated subset of genes (registered-gene only; recommended)
455
+ - `--with_nuclear`: `1` to plot nucleus boundary if present, `0` to disable (registered-gene only)
456
+
457
+ ### portrait
458
+
459
+ This command is a pass-through wrapper. Common knobs:
460
+
461
+ - `--auto_params`: auto-select `r_min/r_max/bin_size`
462
+ - `--use_same_r`: enforce the same `r` within each gene
463
+ - `--max_count`, `--transcript_window`: reduce compute for large datasets
464
+ - `--output_dir`: control where `js_distances_*.csv` is written
465
+
466
+ ### partition-graphs
467
+
468
+ - `--pkl`: registered PKL (must contain `df_registered`)
469
+ - `--graph_root`: output root directory
470
+ - `--n_sectors`, `--m_rings`: partition resolution
471
+ - `--k_neighbor`: kNN graph connectivity
472
+ - `--cells`, `--genes`: restrict scope (smoke test)
473
+ - `--epsilon`: boundary classification tolerance
474
+
475
+ ### augment-graphs
476
+
477
+ - `--graph_root`: directory created by `partition-graphs`
478
+ - `--dropout_ratio`: node dropout probability
479
+ - `--seed`: make augmentation deterministic
480
+ - `--angle_min`, `--angle_max`: rotation angle range (degrees)
481
+
482
+ ### build-train-pkl
483
+
484
+ - `--pairs_csv`: CSV with columns `cell,gene`
485
+ - `--graph_root`: directory created by `partition-graphs` (and augmented by `augment-graphs`)
486
+ - `--output_pkl`: training PKL consumed by `train-moco`
487
+ - `--dataset`: dataset tag stored in metadata
488
+ - `--processes`: multiprocessing workers
489
+
490
+ ### train-moco
491
+
492
+ This command runs the packaged training entrypoint (`grasp_tool.cli.train_moco`).
493
+
494
+ - `--pkl`: training PKL built by `build-train-pkl`
495
+ - `--output_dir`: output root directory
496
+ - `--lrs`: learning rate list (e.g. `--lrs 0.001` or `--lrs 0.001 0.002`)
497
+ - `--use_gradient_clipping`: `1` (default) to clip gradients, `0` to disable
498
+ - `--gradient_clip_norm`: max norm for gradient clipping
499
+ - `--js` + `--js_file`: use JS distance for positive sampling
500
+ - `--n`, `--m`: must match partition settings
501
+ - `--seed`: reproducibility
502
+ - `--num_epoch`, `--batch_size`: training schedule
503
+ - `--cuda_device`: GPU index
504
+ - `--num_clusters`: affects clustering evaluation (for very small datasets, set it <= num graphs)
505
+
506
+ ## Reproducibility tips
507
+
508
+ - Always record: `n_sectors/m_rings/k_neighbor/dropout_ratio/seed` and the exact `pairs.csv`.
509
+ - Prefer writing all outputs under `outputs/` (or a dedicated run directory).
510
+ - For large runs, use tmux/screen; training can be slow due to evaluation + visualization.
511
+