scatrans 0.8.0.dev0__tar.gz → 0.9.1.dev0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/CHANGELOG.md +40 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/MANIFEST.in +9 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/PKG-INFO +167 -34
- scatrans-0.8.0.dev0/src/scatrans.egg-info/PKG-INFO → scatrans-0.9.1.dev0/README.md +161 -74
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/pyproject.toml +4 -3
- {scatrans-0.8.0.dev0/src → scatrans-0.9.1.dev0}/scatrans.egg-info/SOURCES.txt +0 -6
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/setup.cfg +1 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/__init__.py +4 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_de.py +30 -21
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_permutation.py +43 -13
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_utils.py +178 -17
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_velocity.py +36 -7
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_version.py +3 -3
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/enrich.py +665 -9
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/pl.py +362 -44
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/tl.py +361 -128
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/tests/test_basic.py +183 -7
- scatrans-0.8.0.dev0/README.md +0 -749
- scatrans-0.8.0.dev0/src/scatrans.egg-info/dependency_links.txt +0 -1
- scatrans-0.8.0.dev0/src/scatrans.egg-info/entry_points.txt +0 -2
- scatrans-0.8.0.dev0/src/scatrans.egg-info/requires.txt +0 -31
- scatrans-0.8.0.dev0/src/scatrans.egg-info/top_level.txt +0 -1
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.github/workflows/ci.yml +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.github/workflows/publish.yml +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.gitignore +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/LICENSE +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/memento_de_example.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/real_data_template.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/synthetic_active_transcription.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_bias.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Hs_GO_Biological_Process_2026.txt +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Hs_KEGG_2026.txt +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mm_GO_Biological_Process_2026.txt +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mm_KEGG_2026.txt +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mus_musculus.GRCm39.115_gene_features.parquet +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/README.md +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/mouse_2020A_gene_features.parquet +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/generate_gene_features.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/pp_bias.py +0 -0
- {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/qc.py +0 -0
|
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.9.0] - 2026-06-19
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
- `run_gsea(ranked_genes, gene_sets=..., nperm=..., ...)` — pre-ranked GSEA (via gseapy.prerank wrapper).
|
|
12
|
+
Reuses the same gene-set loading, `gene_case`, diagnostics, and `.attrs` system as ORA.
|
|
13
|
+
Returns DataFrame with `NES`, `ES`, `pvalue`, `p.adjust`, `leading_edge`, etc.
|
|
14
|
+
Optional dependency: `pip install "scatrans[gsea]"`.
|
|
15
|
+
- `scat.pl.gseaplot(ranked_genes, gsea_result, term=...)` — classic GSEA running enrichment score plot.
|
|
16
|
+
Automatically uses pre-computed RES curves + hits stored in `run_gsea` results (`.attrs["gsea_details"]`).
|
|
17
|
+
- `enrich_dotplot` now auto-detects GSEA results (defaults `x="NES"`, uses diverging colormap when `color_by="NES"`).
|
|
18
|
+
- Added `gsea` extra in `pyproject.toml`.
|
|
19
|
+
|
|
20
|
+
### Changed
|
|
21
|
+
- Minor internal cleanups and test coverage for the new GSEA path.
|
|
22
|
+
- All new functions follow the existing consistent signatures (`ax=`, `use_style=`, `save_path=`, etc.).
|
|
23
|
+
|
|
8
24
|
## [0.8.0] - 2026-06-14
|
|
9
25
|
|
|
10
26
|
### Added (enrichment module — major paper-readiness upgrade)
|
|
@@ -30,6 +46,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
30
46
|
- README and docstrings extensively updated with manuscript-export examples, `run_go`, provenance details, and `adjust_across_all` guidance.
|
|
31
47
|
- Full test coverage for new paths (per-ontology attrs, within_ontology p.adjust, save+tsv+dir creation, expand with Ontology, dual-cutoff warning, etc.). All tests pass.
|
|
32
48
|
|
|
49
|
+
## [0.9.0] - 2026-06-18
|
|
50
|
+
|
|
51
|
+
### Added
|
|
52
|
+
- **Independent permutation statistics for unspliced excess**: `unspliced_excess_pval` and `unspliced_excess_fdr` (one-sided test on bias-corrected `unspliced_excess_residual`). Computed alongside existing `active_score_pval` / `active_score_fdr` when `use_permutation=True`.
|
|
53
|
+
- New parameter `unspliced_excess_fdr_cutoff` (default 0.05) for the built-in `significant` gene list and `filter_active_genes`.
|
|
54
|
+
- `filter_active_genes` parameters `unspliced_excess_residual_cutoff` and `unspliced_excess_fdr_cutoff`; heuristic/pseudobulk presets updated accordingly.
|
|
55
|
+
- `adata.uns["scatrans"]["significant_criteria"]` metadata documenting the built-in significance conjunction.
|
|
56
|
+
|
|
57
|
+
### Changed
|
|
58
|
+
- **Terminology**: primary result columns renamed from velocity to unspliced/nascent excess:
|
|
59
|
+
- `unspliced_excess_delta` (was `velocity_delta_raw`)
|
|
60
|
+
- `unspliced_excess_residual` (was `velocity_residual`)
|
|
61
|
+
- Legacy `velocity_*` columns remain in `adata.var` as deprecated aliases.
|
|
62
|
+
- **Built-in `significant` gene list** now requires:
|
|
63
|
+
- `logFC > logfc_cutoff`, `p_adj < pval_cutoff`, `unspliced_excess_residual > 0`, `unspliced_excess_fdr < unspliced_excess_fdr_cutoff`
|
|
64
|
+
- `active_score` is no longer used for significance (ranking/visualization only).
|
|
65
|
+
- Without `use_permutation=True`, the built-in `significant` list is empty (logged warning).
|
|
66
|
+
- Plotting functions accept primary or legacy column names; axis labels updated.
|
|
67
|
+
- README rewritten for the new significance model and column names.
|
|
68
|
+
|
|
69
|
+
### Deprecated
|
|
70
|
+
- `active_fdr_cutoff` (no longer used for built-in significance; use `unspliced_excess_fdr_cutoff`).
|
|
71
|
+
- `velocity_residual_cutoff` in `filter_active_genes` (use `unspliced_excess_residual_cutoff`).
|
|
72
|
+
|
|
33
73
|
## [Unreleased]
|
|
34
74
|
|
|
35
75
|
### Added
|
|
@@ -13,3 +13,12 @@ include .github/workflows/publish.yml
|
|
|
13
13
|
|
|
14
14
|
# If more workflows are added in the future, this will catch them:
|
|
15
15
|
include .github/workflows/*.yml
|
|
16
|
+
|
|
17
|
+
# Never ship generated metadata directories inside the sdist
|
|
18
|
+
recursive-exclude src/scatrans.egg-info *
|
|
19
|
+
prune build
|
|
20
|
+
prune dist
|
|
21
|
+
global-exclude *.pyc
|
|
22
|
+
global-exclude __pycache__/*
|
|
23
|
+
global-exclude *.egg-info
|
|
24
|
+
global-exclude *.egg-info/*
|
|
@@ -1,11 +1,11 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: scatrans
|
|
3
|
-
Version: 0.
|
|
3
|
+
Version: 0.9.1.dev0
|
|
4
4
|
Summary: Single-cell Active Transcription Analysis
|
|
5
5
|
Author: scATrans Developers
|
|
6
6
|
License: MIT
|
|
7
7
|
Project-URL: Homepage, https://github.com/scATrans/scatrans
|
|
8
|
-
Keywords: single-cell,RNA-seq,
|
|
8
|
+
Keywords: single-cell,RNA-seq,unspliced,nascent RNA,active transcription,bioinformatics
|
|
9
9
|
Classifier: Development Status :: 4 - Beta
|
|
10
10
|
Classifier: Intended Audience :: Science/Research
|
|
11
11
|
Classifier: License :: OSI Approved :: MIT License
|
|
@@ -34,10 +34,14 @@ Provides-Extra: gene-features
|
|
|
34
34
|
Requires-Dist: gtfparse>=1.3.0; extra == "gene-features"
|
|
35
35
|
Provides-Extra: memento
|
|
36
36
|
Requires-Dist: memento-de>=0.1.0; extra == "memento"
|
|
37
|
+
Provides-Extra: gsea
|
|
38
|
+
Requires-Dist: gseapy>=1.1; extra == "gsea"
|
|
37
39
|
Provides-Extra: dev
|
|
38
40
|
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
39
41
|
Requires-Dist: pytest-cov>=4.0; extra == "dev"
|
|
40
42
|
Requires-Dist: ruff>=0.4.0; extra == "dev"
|
|
43
|
+
Requires-Dist: pre-commit>=3.5; extra == "dev"
|
|
44
|
+
Requires-Dist: mypy>=1.10; extra == "dev"
|
|
41
45
|
Dynamic: license-file
|
|
42
46
|
|
|
43
47
|
# scATrans
|
|
@@ -115,7 +119,7 @@ adata_res, significant, all_results = scat.active_score(
|
|
|
115
119
|
print(all_results.head())
|
|
116
120
|
```
|
|
117
121
|
|
|
118
|
-
Default parameters require no choices for bias correction, effective gamma, or mixed models. Pseudobulk mode and DE method (`de_method`) are configurable options. The built-in `significant` list
|
|
122
|
+
Default parameters require no choices for bias correction, effective gamma, or mixed models. Pseudobulk mode and DE method (`de_method`) are configurable options. The built-in `significant` list requires `use_permutation=True` (for `unspliced_excess_fdr`) and is often small or empty; use the full ranked table in `all_results`.
|
|
119
123
|
|
|
120
124
|
### Preserving raw counts and layers
|
|
121
125
|
|
|
@@ -211,17 +215,22 @@ The internal `significant` list is strict. Most users filter the full table retu
|
|
|
211
215
|
candidates = scat.filter_active_genes(
|
|
212
216
|
all_results,
|
|
213
217
|
active_score_cutoff=30,
|
|
214
|
-
|
|
218
|
+
unspliced_excess_residual_cutoff=0.5,
|
|
219
|
+
unspliced_excess_fdr_cutoff=0.05,
|
|
215
220
|
logfc_cutoff=0.3,
|
|
216
221
|
pval_cutoff=0.05,
|
|
217
222
|
)
|
|
218
223
|
|
|
219
224
|
# Or use presets that choose reasonable defaults for common analysis styles
|
|
220
225
|
candidates = scat.filter_active_genes(all_results, preset="heuristic")
|
|
226
|
+
|
|
227
|
+
# Advanced usage
|
|
228
|
+
mask = scat.filter_active_genes(all_results, return_mask=True) # boolean Series
|
|
229
|
+
filtered_inplace = scat.filter_active_genes(all_results, preset="heuristic", inplace=True)
|
|
221
230
|
# or preset="pseudobulk" after aggregation, or preset="permissive"
|
|
222
231
|
```
|
|
223
232
|
|
|
224
|
-
The helper safely ignores filters for columns that do not exist (e.g. `
|
|
233
|
+
The helper safely ignores filters for columns that do not exist (e.g. `unspliced_excess_fdr` when you did not use `use_permutation`). Legacy column names `velocity_residual` / `velocity_delta_raw` remain in `adata.var` as aliases.
|
|
225
234
|
|
|
226
235
|
### 3.3 Functional enrichment
|
|
227
236
|
|
|
@@ -317,14 +326,31 @@ print(scat.list_bundled_gene_sets())
|
|
|
317
326
|
|
|
318
327
|
**Adding your own sets**: Drop `.gmt` files into `src/scatrans/data/`. See `src/scatrans/data/README.md`.
|
|
319
328
|
|
|
320
|
-
**simplify_enrichment** (reduce redundant terms
|
|
329
|
+
**simplify_enrichment** (reduce redundant enrichment terms):
|
|
330
|
+
|
|
331
|
+
Two methods are supported:
|
|
332
|
+
|
|
333
|
+
- **`jaccard`** (default): greedy filtering by Jaccard overlap of enriched gene lists.
|
|
334
|
+
- **`pathway_denester`**: combinatorial nested-pathway test adapted from [PathwayDenester](https://github.com/Helmy-Lab/PathwayDenester). Better at removing terms that are significant only because they are nested inside a more significant parent pathway. Requires full pathway gene memberships (auto-loaded from `enrich_res.attrs` when enrichment used bundled/Enrichr libraries; pass `gene_sets=` again if you used a custom dict).
|
|
321
335
|
|
|
322
336
|
```python
|
|
337
|
+
# Jaccard (fast, overlap-based)
|
|
323
338
|
simplified = scat.simplify_enrichment(
|
|
324
339
|
enrich_res,
|
|
325
340
|
similarity_cutoff=0.5,
|
|
326
341
|
min_count=3,
|
|
327
|
-
method="jaccard",
|
|
342
|
+
method="jaccard",
|
|
343
|
+
)
|
|
344
|
+
|
|
345
|
+
# PathwayDenester (nested-pathway test; recommended for GO/KEGG dotplots)
|
|
346
|
+
simplified = scat.simplify_enrichment(
|
|
347
|
+
enrich_res,
|
|
348
|
+
method="pathway_denester",
|
|
349
|
+
min_count=3,
|
|
350
|
+
pval_threshold=0.05, # independence cutoff
|
|
351
|
+
to_test_threshold=0.0, # min shared-DEG fraction before testing
|
|
352
|
+
term_size_limit=0, # e.g. 500 to drop very broad terms
|
|
353
|
+
show_excluded=False, # True keeps excluded terms + Denester_* diagnostics
|
|
328
354
|
)
|
|
329
355
|
```
|
|
330
356
|
|
|
@@ -467,6 +493,31 @@ kegg_res = scat.run_kegg(
|
|
|
467
493
|
# To use the original Enrichr version instead: kegg_library="KEGG_2026"
|
|
468
494
|
)
|
|
469
495
|
|
|
496
|
+
### run_gsea (pre-ranked GSEA)
|
|
497
|
+
|
|
498
|
+
For ranked-list enrichment (the classic GSEA approach):
|
|
499
|
+
|
|
500
|
+
```python
|
|
501
|
+
# ranked list: higher = more associated with target (e.g. logFC or custom score)
|
|
502
|
+
ranked = all_results.set_index("gene")["logFC"] # or "active_score" etc.
|
|
503
|
+
|
|
504
|
+
gsea_res = scat.run_gsea(
|
|
505
|
+
ranked_genes=ranked,
|
|
506
|
+
gene_sets="GO_Biological_Process",
|
|
507
|
+
organism="mouse",
|
|
508
|
+
nperm=1000,
|
|
509
|
+
min_size=15,
|
|
510
|
+
# gsea_res is a DataFrame with NES, ES, pvalue, p.adjust, leading_edge, ...
|
|
511
|
+
)
|
|
512
|
+
print(gsea_res.head())
|
|
513
|
+
scat.pl.enrich_dotplot(gsea_res, x="NES", color_by="NES") # auto-friendly
|
|
514
|
+
scat.pl.gseaplot(ranked, gsea_res, term=gsea_res.iloc[0]["Term"])
|
|
515
|
+
```
|
|
516
|
+
|
|
517
|
+
`run_gsea` stores pre-computed RES curves in `.attrs["gsea_details"]` so `gseaplot` can render the exact running sum used for the NES/p-values.
|
|
518
|
+
|
|
519
|
+
Requires `pip install "scatrans[gsea]"` (or gseapy).
|
|
520
|
+
|
|
470
521
|
print(kegg_res[["Term", "p.adjust", "Count"]].head())
|
|
471
522
|
```
|
|
472
523
|
|
|
@@ -474,30 +525,73 @@ The `gene_set_source` parameter (default `"scatrans"`) controls which KEGG set i
|
|
|
474
525
|
See the section "Choosing gene sets explicitly with `gene_set_source`" above for full details
|
|
475
526
|
and examples for both GO and KEGG.
|
|
476
527
|
|
|
477
|
-
**simplify_enrichment** – Remove redundant terms from enrichment results
|
|
528
|
+
**simplify_enrichment** – Remove redundant terms from enrichment results:
|
|
478
529
|
|
|
479
530
|
```python
|
|
480
|
-
#
|
|
531
|
+
# Jaccard: drop terms whose enriched gene sets overlap strongly with a kept term
|
|
481
532
|
simplified = scat.simplify_enrichment(
|
|
482
|
-
kegg_res,
|
|
483
|
-
similarity_cutoff=0.5,
|
|
484
|
-
min_count=3,
|
|
485
|
-
by="p.adjust",
|
|
486
|
-
|
|
487
|
-
|
|
533
|
+
kegg_res,
|
|
534
|
+
similarity_cutoff=0.5,
|
|
535
|
+
min_count=3,
|
|
536
|
+
by="p.adjust",
|
|
537
|
+
method="jaccard",
|
|
538
|
+
)
|
|
539
|
+
|
|
540
|
+
# PathwayDenester: drop nested pathways explained by a more significant parent
|
|
541
|
+
simplified = scat.simplify_enrichment(
|
|
542
|
+
kegg_res,
|
|
543
|
+
method="pathway_denester",
|
|
544
|
+
min_count=3,
|
|
545
|
+
by="p.adjust",
|
|
546
|
+
gene_sets="KEGG", # optional if kegg_res.attrs records the library
|
|
547
|
+
pval_threshold=0.05,
|
|
548
|
+
to_test_threshold=0.0,
|
|
488
549
|
)
|
|
489
550
|
|
|
490
551
|
print(f"Reduced from {len(kegg_res)} to {len(simplified)} terms")
|
|
491
552
|
print(simplified[["Term", "p.adjust", "Count"]].head())
|
|
492
553
|
```
|
|
493
554
|
|
|
555
|
+
| Parameter | `jaccard` | `pathway_denester` |
|
|
556
|
+
|-----------|-----------|-------------------|
|
|
557
|
+
| `similarity_cutoff` | Jaccard threshold (default 0.5) | ignored |
|
|
558
|
+
| `gene_sets` | not used | GMT path, bundled name, or dict (auto from attrs when possible) |
|
|
559
|
+
| `pval_threshold` | not used | independence p-value (default 0.05) |
|
|
560
|
+
| `to_test_threshold` | not used | min shared-DEG fraction before testing (default 0) |
|
|
561
|
+
| `term_size_limit` | not used | drop pathways larger than this size (0 = keep all) |
|
|
562
|
+
| `show_excluded` | not used | if True, return excluded terms with `Denester_*` columns |
|
|
563
|
+
|
|
494
564
|
This function looks for common gene list columns (`Genes`, `Lead_genes`, etc.) automatically.
|
|
495
565
|
|
|
496
566
|
---
|
|
497
567
|
|
|
498
568
|
## Result Interpretation
|
|
499
569
|
|
|
500
|
-
|
|
570
|
+
### Column naming (v0.9+)
|
|
571
|
+
|
|
572
|
+
Primary result columns use **unspliced / nascent excess** terminology (not RNA velocity):
|
|
573
|
+
|
|
574
|
+
| Primary column | Legacy alias (deprecated) | Meaning |
|
|
575
|
+
|----------------|---------------------------|---------|
|
|
576
|
+
| `unspliced_excess_delta` | `velocity_delta_raw` | Raw U − γ_ref·S in target group |
|
|
577
|
+
| `unspliced_excess_residual` | `velocity_residual` | Bias-corrected excess residual |
|
|
578
|
+
| `unspliced_excess_pval` | — | One-sided permutation p-value on residual |
|
|
579
|
+
| `unspliced_excess_fdr` | — | BH-FDR on `unspliced_excess_pval` |
|
|
580
|
+
|
|
581
|
+
`active_score` (0–100) is a **heuristic ranking score** (weighted soft-scaled composite of logFC + unspliced excess residual + -log p_adj). It is intended **for ranking and visualization only** and should **not** be interpreted or reported as a p-value or statistical significance measure. Use the permutation-derived `unspliced_excess_fdr` (when enabled) or your own post-hoc statistics for claims.
|
|
582
|
+
|
|
583
|
+
### Built-in `significant` gene list
|
|
584
|
+
|
|
585
|
+
When `use_permutation=True`, the internal mask requires **all** of:
|
|
586
|
+
|
|
587
|
+
- `logFC > logfc_cutoff` (default 0.5)
|
|
588
|
+
- `p_adj < pval_cutoff` (default 0.05)
|
|
589
|
+
- `unspliced_excess_residual > 0`
|
|
590
|
+
- `unspliced_excess_fdr < unspliced_excess_fdr_cutoff` (default 0.05)
|
|
591
|
+
|
|
592
|
+
Without `use_permutation=True`, the built-in `significant` list is **empty** (FDR on unspliced excess cannot be computed). Use `all_results` + `filter_active_genes` for custom thresholds.
|
|
593
|
+
|
|
594
|
+
On real data the built-in list often returns zero or few genes. Use the full table in `all_results`, sorted by `active_score` descending.
|
|
501
595
|
|
|
502
596
|
After each run inspect the diagnostics:
|
|
503
597
|
|
|
@@ -508,7 +602,7 @@ print(meta["diagnostics"]["bias_correction"])
|
|
|
508
602
|
print(meta.get("permutation_approximation_note"))
|
|
509
603
|
```
|
|
510
604
|
|
|
511
|
-
Global unspliced fractions above ~50% frequently indicate technical issues. Bias-correction diagnostics report the number of genes used and any fallback behavior. The permutation note records that
|
|
605
|
+
Global unspliced fractions above ~50% frequently indicate technical issues. Bias-correction diagnostics report the number of genes used and any fallback behavior. The permutation note records that unspliced/spliced layers and the reference gamma were fixed for speed while labels were shuffled.
|
|
512
606
|
|
|
513
607
|
---
|
|
514
608
|
|
|
@@ -519,6 +613,7 @@ The following flags are disabled by default and should be enabled only when requ
|
|
|
519
613
|
- `use_permutation=True`
|
|
520
614
|
- `bias_correction="none"`
|
|
521
615
|
- `show_effective_gamma=True`
|
|
616
|
+
- `gamma_method="robust_median"` (or "raw")
|
|
522
617
|
- `use_mixed_model=True`
|
|
523
618
|
- `prioritize_velocity=True`
|
|
524
619
|
|
|
@@ -528,14 +623,49 @@ Inspect the corresponding diagnostics after enabling any advanced option.
|
|
|
528
623
|
|
|
529
624
|
### use_permutation=True
|
|
530
625
|
|
|
531
|
-
|
|
626
|
+
**Required for the built-in `significant` list** (via `unspliced_excess_fdr`).
|
|
627
|
+
|
|
628
|
+
Adds:
|
|
629
|
+
|
|
630
|
+
- `unspliced_excess_pval` / `unspliced_excess_fdr` — permutation significance on the bias-corrected unspliced excess residual (one-sided, positive direction). **Use these for active-gene calls.**
|
|
631
|
+
- `active_score_pval` / `active_score_fdr` — permutation on the composite heuristic score (ranking aid only).
|
|
632
|
+
|
|
633
|
+
The permutation shuffles only group labels; unspliced/spliced layers and the reference gamma are fixed from the original labeling for speed. **This is a conditional permutation** (conditioned on the observed velocity structure and gamma). It is a speed/tractability tradeoff and **not an unconditional permutation of the full data**. In small reference groups or strong batch effects, interpret the resulting FDR with extra caution; always inspect diagnostics and consider biological replicates.
|
|
634
|
+
|
|
635
|
+
See diagnostics["velocity"] for the actual gamma_method and prior_weight used.
|
|
636
|
+
|
|
637
|
+
```python
|
|
638
|
+
adata_res, significant, all_results = scat.active_score(
|
|
639
|
+
adata,
|
|
640
|
+
use_permutation=True,
|
|
641
|
+
n_perm=500,
|
|
642
|
+
unspliced_excess_fdr_cutoff=0.05,
|
|
643
|
+
)
|
|
644
|
+
```
|
|
532
645
|
|
|
533
646
|
### bias_correction
|
|
534
647
|
|
|
535
|
-
By default the package applies a Huber regression of the raw
|
|
648
|
+
By default the package applies a Huber regression of the raw unspliced excess delta on log(gene length) and log(intron number) and uses the residuals as `unspliced_excess_residual`. This step can be disabled by setting `bias_correction="none"`, in which case the raw (reference-gamma corrected) delta is used directly.
|
|
536
649
|
|
|
537
650
|
The correction is intended to reduce technical contributions from gene length and intron number to the unspliced excess term. Whether length or intron number carry biological signal of interest in a given dataset is a scientific judgment that the user must make; the correction is therefore optional. The `bias_diagnostic_plot` function can be used to inspect the relationship before and after correction.
|
|
538
651
|
|
|
652
|
+
### gamma_method and reference gamma robustness
|
|
653
|
+
|
|
654
|
+
The core unspliced excess uses a per-gene reference gamma = U_ref / S_ref (shrunk).
|
|
655
|
+
|
|
656
|
+
- Default: `gamma_method="heuristic_shrink"` + `prior_weight=5.0` (additive pseudo-count shrinkage toward a global ratio).
|
|
657
|
+
- For small reference groups, try `gamma_method="robust_median"`: uses the **median** ratio across reference genes as the anchor. This reduces sensitivity to a few outlier genes in the reference and can yield more stable residuals.
|
|
658
|
+
- `gamma_method="raw"` disables most shrinkage (exploratory only).
|
|
659
|
+
|
|
660
|
+
The chosen method, prior_weight, and summary stats of the realized effective_gamma are **always** written to diagnostics:
|
|
661
|
+
|
|
662
|
+
```python
|
|
663
|
+
v = adata_res.uns["scatrans"]["diagnostics"]["velocity"]
|
|
664
|
+
print(v["gamma_method"], v["prior_weight"], v["effective_gamma_stats"])
|
|
665
|
+
```
|
|
666
|
+
|
|
667
|
+
Shrinkage strength and stability are now visible without `show_effective_gamma`.
|
|
668
|
+
|
|
539
669
|
### show_effective_gamma=True
|
|
540
670
|
|
|
541
671
|
Adds the column `effective_gamma` (reference-group shrunk U/S ratio) to `adata.var` and to the results tables. Many genes will have similar values in pure heuristic mode; advanced (moments) mode usually shows more per-gene variation.
|
|
@@ -559,7 +689,7 @@ Requires `sample_col` (the column identifying biological replicates/individuals)
|
|
|
559
689
|
- `delta_variance` is always available in `all_results` when the flag is on; you can use it post-hoc as an additional filter.
|
|
560
690
|
- Use `use_delta_variance_pval=True` only if you want the LRT p-value to participate in the built-in `significant` mask.
|
|
561
691
|
|
|
562
|
-
**Practical note on small numbers of samples:** With very few biological replicates, pseudobulk aggregation can drive most `
|
|
692
|
+
**Practical note on small numbers of samples:** With very few biological replicates, pseudobulk aggregation can drive most `unspliced_excess_residual` values close to zero. In such regimes the cell-level mixed-model path (`use_mixed_model=True`, `use_pseudobulk=False`) often preserves more of the nascent-excess signal while still respecting sample structure.
|
|
563
693
|
|
|
564
694
|
The mixed-model settings and median `delta_variance` are recorded in diagnostics.
|
|
565
695
|
|
|
@@ -575,7 +705,7 @@ Recommended only when you have a reasonable number of cells and want noise reduc
|
|
|
575
705
|
|
|
576
706
|
The unspliced excess term is a group-contrast proxy derived from a reference-group gamma calculation. It is not a full stochastic or dynamical model.
|
|
577
707
|
|
|
578
|
-
Interpretation is simplest for clear binary contrasts. Within-group heterogeneity reduces observed signal. The permutation approximation (used when `use_permutation=True`) fixes
|
|
708
|
+
Interpretation is simplest for clear binary contrasts. Within-group heterogeneity reduces observed signal. The permutation approximation (used when `use_permutation=True`) fixes unspliced/spliced layers and the reference gamma on the original labels; the note is recorded in the results. Global unspliced fractions above ~50% are flagged as potential technical artifacts. Bias-correction quality depends on the number of genes with length and intron annotations. With few biological replicates, power for the unspliced excess term and permutation-based FDR is limited. Mixed-model statistics tend to be conservative when between-sample variation is large.
|
|
579
709
|
|
|
580
710
|
Always examine diagnostics, score distributions, and (when available) the original spliced/unspliced counts before biological interpretation.
|
|
581
711
|
|
|
@@ -607,9 +737,9 @@ These are the common "free switches" for the basic pipeline (including pseudobul
|
|
|
607
737
|
|
|
608
738
|
### Opt-in advanced / exploration parameters (see "Optional Advanced Features")
|
|
609
739
|
|
|
610
|
-
- `use_permutation`, `n_perm`, `active_fdr_cutoff`
|
|
740
|
+
- `use_permutation`, `n_perm`, `unspliced_excess_fdr_cutoff` (and deprecated `active_fdr_cutoff`)
|
|
611
741
|
- `bias_correction` ("huber_length_intron" or "none")
|
|
612
|
-
- `show_effective_gamma`
|
|
742
|
+
- `show_effective_gamma`, `gamma_method`, `prior_weight`
|
|
613
743
|
- `use_mixed_model`, `use_delta_variance_pval`, `mixed_model_pval`
|
|
614
744
|
- `mode` ("heuristic" or "advanced")
|
|
615
745
|
|
|
@@ -620,7 +750,7 @@ Full signatures and all parameters are documented in the function docstrings and
|
|
|
620
750
|
- `add_gene_features(adata, organism="mouse", ...)` — attach length/intron info
|
|
621
751
|
- `list_available_gene_features()`
|
|
622
752
|
- `diagnose_design(adata, groupby, target_group, reference_group, sample_col=None)` — analyzes cell/sample counts and global unspliced fraction; returns warnings, recommendations, and a suggested `filter_active_genes` preset. Automatically called internally when `sample_col` or `use_pseudobulk=True` is used.
|
|
623
|
-
- `run_enrichment(...)`, `run_kegg(...)`, `run_go(...)`, `simplify_enrichment(...)`, `save_enrichment_report(...)`, `expand_enrichment_genes(...)`, `list_bundled_gene_sets()`
|
|
753
|
+
- `run_enrichment(...)`, `run_kegg(...)`, `run_go(...)`, `run_gsea(...)`, `simplify_enrichment(...)`, `save_enrichment_report(...)`, `expand_enrichment_genes(...)`, `list_bundled_gene_sets()`
|
|
624
754
|
- `scat.pl.*` plotting functions (comet_plot, volcano_plot, bias_diagnostic_plot, ...)
|
|
625
755
|
- `scat.qc.unspliced_global(adata)`
|
|
626
756
|
|
|
@@ -646,18 +776,20 @@ After installing the `gene_features` extra, the `generate-gene-features` CLI is
|
|
|
646
776
|
|
|
647
777
|
```python
|
|
648
778
|
import scatrans as scat
|
|
649
|
-
scat.pl.set_style() # once
|
|
650
|
-
# or
|
|
779
|
+
scat.pl.set_style() # once early (opt-in)
|
|
780
|
+
# or (recommended to avoid globals):
|
|
651
781
|
with scat.pl.style_context(linewidth=0.8):
|
|
652
|
-
scat.pl.comet_plot(...)
|
|
782
|
+
scat.pl.comet_plot(...) # inside block or pass use_style=True
|
|
783
|
+
# Default for pl.* functions is use_style=False (prevents surprising rcParams changes in notebooks).
|
|
653
784
|
```
|
|
654
785
|
|
|
655
|
-
All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel figures)
|
|
786
|
+
All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel figures), `save_path=`, `show=`, `use_style=`, `figsize=` for consistency.
|
|
787
|
+
Most return `(fig, ax)` (or `(fig, axes_list)` for grids like phase portraits) for easy further customization or closing.
|
|
656
788
|
|
|
657
789
|
### Main Plotting Functions
|
|
658
790
|
|
|
659
791
|
- `scat.pl.comet_plot(results_df, top_n=12, point_scale=1.0, min_size=2, max_size=180, s=None, ...)`
|
|
660
|
-
Recommended: log fold change vs. bias-corrected unspliced residual (
|
|
792
|
+
Recommended: log fold change vs. bias-corrected unspliced excess residual (`unspliced_excess_residual`), sized and colored by `active_score`.
|
|
661
793
|
- `s=3` (or 1-5): force **fixed** small point size for everything (direct, simple control).
|
|
662
794
|
- `point_scale=0.2` + `min_size=1`: for variable sizing, make tiniest background points truly small.
|
|
663
795
|
|
|
@@ -674,17 +806,18 @@ All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel
|
|
|
674
806
|
- `scat.pl.volcano_3d(results_df, point_scale=..., min_size=2, s=None, ...)`
|
|
675
807
|
3D version of the volcano. Same size controls (`s` for fixed size).
|
|
676
808
|
|
|
809
|
+
- `scat.pl.enrich_dotplot(enrich_df, ...)` now also works well with GSEA results (auto defaults to `x="NES"`, diverging cmap for `color_by="NES"`).
|
|
810
|
+
- `scat.pl.gseaplot(ranked_genes, gsea_result, term=...)` — classic GSEA running-sum plot (uses precomputed curves from `run_gsea` when available).
|
|
677
811
|
- `scat.pl.enrich_dotplot(enrich_df, top_n=15, show_terms=None, x="GeneRatio", size_by="Count", color_by="Adjusted P-value", ...)`
|
|
678
812
|
Enrichment dot plot (clusterProfiler style).
|
|
679
|
-
- `x`: x-axis variable — "GeneRatio" (default), "FoldEnrichment", **"Count"**,
|
|
680
|
-
|
|
681
|
-
- `
|
|
682
|
-
- `show_terms` accepts int (top N) or list of term strings/Descriptions (exact or partial match, order preserved) —
|
|
813
|
+
- `x`: x-axis variable — "GeneRatio" (default for ORA), "FoldEnrichment", **"Count"**, "-log10(p.adj)", or "NES" (for GSEA).
|
|
814
|
+
- `size_by` (dot size, default "Count"), `color_by` (default adjusted p-value; "NES" for GSEA uses diverging colormap).
|
|
815
|
+
- `show_terms` accepts int (top N), "auto" (p.adjust <0.05 + Count>=2 smart selection), or list of term strings/Descriptions (exact or partial match, order preserved) —
|
|
683
816
|
directly analogous to `dotplot(..., showCategory=...)`.
|
|
684
817
|
Also available as `enrich_barplot`.
|
|
685
818
|
|
|
686
819
|
- `scat.pl.volcano_3d(results_df, ...)`
|
|
687
|
-
3D volcano (logFC × -log10(p) ×
|
|
820
|
+
3D volcano (logFC × -log10(p) × unspliced_excess_residual).
|
|
688
821
|
|
|
689
822
|
- `scat.pl.active_score_rankplot(results_df, top_n=20, ...)`
|
|
690
823
|
Simple horizontal barplot of top active scores.
|