scatrans 0.8.0.dev0__tar.gz → 0.9.1.dev0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/CHANGELOG.md +40 -0
  2. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/MANIFEST.in +9 -0
  3. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/PKG-INFO +167 -34
  4. scatrans-0.8.0.dev0/src/scatrans.egg-info/PKG-INFO → scatrans-0.9.1.dev0/README.md +161 -74
  5. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/pyproject.toml +4 -3
  6. {scatrans-0.8.0.dev0/src → scatrans-0.9.1.dev0}/scatrans.egg-info/SOURCES.txt +0 -6
  7. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/setup.cfg +1 -0
  8. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/__init__.py +4 -0
  9. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_de.py +30 -21
  10. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_permutation.py +43 -13
  11. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_utils.py +178 -17
  12. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_velocity.py +36 -7
  13. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_version.py +3 -3
  14. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/enrich.py +665 -9
  15. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/pl.py +362 -44
  16. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/tl.py +361 -128
  17. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/tests/test_basic.py +183 -7
  18. scatrans-0.8.0.dev0/README.md +0 -749
  19. scatrans-0.8.0.dev0/src/scatrans.egg-info/dependency_links.txt +0 -1
  20. scatrans-0.8.0.dev0/src/scatrans.egg-info/entry_points.txt +0 -2
  21. scatrans-0.8.0.dev0/src/scatrans.egg-info/requires.txt +0 -31
  22. scatrans-0.8.0.dev0/src/scatrans.egg-info/top_level.txt +0 -1
  23. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.github/workflows/ci.yml +0 -0
  24. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.github/workflows/publish.yml +0 -0
  25. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/.gitignore +0 -0
  26. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/LICENSE +0 -0
  27. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/memento_de_example.py +0 -0
  28. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/real_data_template.py +0 -0
  29. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/examples/synthetic_active_transcription.py +0 -0
  30. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/_bias.py +0 -0
  31. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Hs_GO_Biological_Process_2026.txt +0 -0
  32. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Hs_KEGG_2026.txt +0 -0
  33. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mm_GO_Biological_Process_2026.txt +0 -0
  34. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mm_KEGG_2026.txt +0 -0
  35. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/Mus_musculus.GRCm39.115_gene_features.parquet +0 -0
  36. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/README.md +0 -0
  37. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/data/mouse_2020A_gene_features.parquet +0 -0
  38. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/generate_gene_features.py +0 -0
  39. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/pp_bias.py +0 -0
  40. {scatrans-0.8.0.dev0 → scatrans-0.9.1.dev0}/src/scatrans/qc.py +0 -0
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.9.0] - 2026-06-19
9
+
10
+ ### Added
11
+ - `run_gsea(ranked_genes, gene_sets=..., nperm=..., ...)` — pre-ranked GSEA (via gseapy.prerank wrapper).
12
+ Reuses the same gene-set loading, `gene_case`, diagnostics, and `.attrs` system as ORA.
13
+ Returns DataFrame with `NES`, `ES`, `pvalue`, `p.adjust`, `leading_edge`, etc.
14
+ Optional dependency: `pip install "scatrans[gsea]"`.
15
+ - `scat.pl.gseaplot(ranked_genes, gsea_result, term=...)` — classic GSEA running enrichment score plot.
16
+ Automatically uses pre-computed RES curves + hits stored in `run_gsea` results (`.attrs["gsea_details"]`).
17
+ - `enrich_dotplot` now auto-detects GSEA results (defaults `x="NES"`, uses diverging colormap when `color_by="NES"`).
18
+ - Added `gsea` extra in `pyproject.toml`.
19
+
20
+ ### Changed
21
+ - Minor internal cleanups and test coverage for the new GSEA path.
22
+ - All new functions follow the existing consistent signatures (`ax=`, `use_style=`, `save_path=`, etc.).
23
+
8
24
  ## [0.8.0] - 2026-06-14
9
25
 
10
26
  ### Added (enrichment module — major paper-readiness upgrade)
@@ -30,6 +46,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
30
46
  - README and docstrings extensively updated with manuscript-export examples, `run_go`, provenance details, and `adjust_across_all` guidance.
31
47
  - Full test coverage for new paths (per-ontology attrs, within_ontology p.adjust, save+tsv+dir creation, expand with Ontology, dual-cutoff warning, etc.). All tests pass.
32
48
 
49
+ ## [0.9.0] - 2026-06-18
50
+
51
+ ### Added
52
+ - **Independent permutation statistics for unspliced excess**: `unspliced_excess_pval` and `unspliced_excess_fdr` (one-sided test on bias-corrected `unspliced_excess_residual`). Computed alongside existing `active_score_pval` / `active_score_fdr` when `use_permutation=True`.
53
+ - New parameter `unspliced_excess_fdr_cutoff` (default 0.05) for the built-in `significant` gene list and `filter_active_genes`.
54
+ - `filter_active_genes` parameters `unspliced_excess_residual_cutoff` and `unspliced_excess_fdr_cutoff`; heuristic/pseudobulk presets updated accordingly.
55
+ - `adata.uns["scatrans"]["significant_criteria"]` metadata documenting the built-in significance conjunction.
56
+
57
+ ### Changed
58
+ - **Terminology**: primary result columns renamed from velocity to unspliced/nascent excess:
59
+ - `unspliced_excess_delta` (was `velocity_delta_raw`)
60
+ - `unspliced_excess_residual` (was `velocity_residual`)
61
+ - Legacy `velocity_*` columns remain in `adata.var` as deprecated aliases.
62
+ - **Built-in `significant` gene list** now requires:
63
+ - `logFC > logfc_cutoff`, `p_adj < pval_cutoff`, `unspliced_excess_residual > 0`, `unspliced_excess_fdr < unspliced_excess_fdr_cutoff`
64
+ - `active_score` is no longer used for significance (ranking/visualization only).
65
+ - Without `use_permutation=True`, the built-in `significant` list is empty (logged warning).
66
+ - Plotting functions accept primary or legacy column names; axis labels updated.
67
+ - README rewritten for the new significance model and column names.
68
+
69
+ ### Deprecated
70
+ - `active_fdr_cutoff` (no longer used for built-in significance; use `unspliced_excess_fdr_cutoff`).
71
+ - `velocity_residual_cutoff` in `filter_active_genes` (use `unspliced_excess_residual_cutoff`).
72
+
33
73
  ## [Unreleased]
34
74
 
35
75
  ### Added
@@ -13,3 +13,12 @@ include .github/workflows/publish.yml
13
13
 
14
14
  # If more workflows are added in the future, this will catch them:
15
15
  include .github/workflows/*.yml
16
+
17
+ # Never ship generated metadata directories inside the sdist
18
+ recursive-exclude src/scatrans.egg-info *
19
+ prune build
20
+ prune dist
21
+ global-exclude *.pyc
22
+ global-exclude __pycache__/*
23
+ global-exclude *.egg-info
24
+ global-exclude *.egg-info/*
@@ -1,11 +1,11 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: scatrans
3
- Version: 0.8.0.dev0
3
+ Version: 0.9.1.dev0
4
4
  Summary: Single-cell Active Transcription Analysis
5
5
  Author: scATrans Developers
6
6
  License: MIT
7
7
  Project-URL: Homepage, https://github.com/scATrans/scatrans
8
- Keywords: single-cell,RNA-seq,velocity,active transcription,bioinformatics
8
+ Keywords: single-cell,RNA-seq,unspliced,nascent RNA,active transcription,bioinformatics
9
9
  Classifier: Development Status :: 4 - Beta
10
10
  Classifier: Intended Audience :: Science/Research
11
11
  Classifier: License :: OSI Approved :: MIT License
@@ -34,10 +34,14 @@ Provides-Extra: gene-features
34
34
  Requires-Dist: gtfparse>=1.3.0; extra == "gene-features"
35
35
  Provides-Extra: memento
36
36
  Requires-Dist: memento-de>=0.1.0; extra == "memento"
37
+ Provides-Extra: gsea
38
+ Requires-Dist: gseapy>=1.1; extra == "gsea"
37
39
  Provides-Extra: dev
38
40
  Requires-Dist: pytest>=7.0; extra == "dev"
39
41
  Requires-Dist: pytest-cov>=4.0; extra == "dev"
40
42
  Requires-Dist: ruff>=0.4.0; extra == "dev"
43
+ Requires-Dist: pre-commit>=3.5; extra == "dev"
44
+ Requires-Dist: mypy>=1.10; extra == "dev"
41
45
  Dynamic: license-file
42
46
 
43
47
  # scATrans
@@ -115,7 +119,7 @@ adata_res, significant, all_results = scat.active_score(
115
119
  print(all_results.head())
116
120
  ```
117
121
 
118
- Default parameters require no choices for bias correction, effective gamma, or mixed models. Pseudobulk mode and DE method (`de_method`) are configurable options. The built-in `significant` list is strict and often small or empty; use the full ranked table in `all_results`.
122
+ Default parameters require no choices for bias correction, effective gamma, or mixed models. Pseudobulk mode and DE method (`de_method`) are configurable options. The built-in `significant` list requires `use_permutation=True` (for `unspliced_excess_fdr`) and is often small or empty; use the full ranked table in `all_results`.
119
123
 
120
124
  ### Preserving raw counts and layers
121
125
 
@@ -211,17 +215,22 @@ The internal `significant` list is strict. Most users filter the full table retu
211
215
  candidates = scat.filter_active_genes(
212
216
  all_results,
213
217
  active_score_cutoff=30,
214
- velocity_residual_cutoff=0.5,
218
+ unspliced_excess_residual_cutoff=0.5,
219
+ unspliced_excess_fdr_cutoff=0.05,
215
220
  logfc_cutoff=0.3,
216
221
  pval_cutoff=0.05,
217
222
  )
218
223
 
219
224
  # Or use presets that choose reasonable defaults for common analysis styles
220
225
  candidates = scat.filter_active_genes(all_results, preset="heuristic")
226
+
227
+ # Advanced usage
228
+ mask = scat.filter_active_genes(all_results, return_mask=True) # boolean Series
229
+ filtered_inplace = scat.filter_active_genes(all_results, preset="heuristic", inplace=True)
221
230
  # or preset="pseudobulk" after aggregation, or preset="permissive"
222
231
  ```
223
232
 
224
- The helper safely ignores filters for columns that do not exist (e.g. `active_score_fdr` when you did not use `use_permutation`).
233
+ The helper safely ignores filters for columns that do not exist (e.g. `unspliced_excess_fdr` when you did not use `use_permutation`). Legacy column names `velocity_residual` / `velocity_delta_raw` remain in `adata.var` as aliases.
225
234
 
226
235
  ### 3.3 Functional enrichment
227
236
 
@@ -317,14 +326,31 @@ print(scat.list_bundled_gene_sets())
317
326
 
318
327
  **Adding your own sets**: Drop `.gmt` files into `src/scatrans/data/`. See `src/scatrans/data/README.md`.
319
328
 
320
- **simplify_enrichment** (reduce redundant terms using Jaccard similarity):
329
+ **simplify_enrichment** (reduce redundant enrichment terms):
330
+
331
+ Two methods are supported:
332
+
333
+ - **`jaccard`** (default): greedy filtering by Jaccard overlap of enriched gene lists.
334
+ - **`pathway_denester`**: combinatorial nested-pathway test adapted from [PathwayDenester](https://github.com/Helmy-Lab/PathwayDenester). Better at removing terms that are significant only because they are nested inside a more significant parent pathway. Requires full pathway gene memberships (auto-loaded from `enrich_res.attrs` when enrichment used bundled/Enrichr libraries; pass `gene_sets=` again if you used a custom dict).
321
335
 
322
336
  ```python
337
+ # Jaccard (fast, overlap-based)
323
338
  simplified = scat.simplify_enrichment(
324
339
  enrich_res,
325
340
  similarity_cutoff=0.5,
326
341
  min_count=3,
327
- method="jaccard", # currently the only supported method
342
+ method="jaccard",
343
+ )
344
+
345
+ # PathwayDenester (nested-pathway test; recommended for GO/KEGG dotplots)
346
+ simplified = scat.simplify_enrichment(
347
+ enrich_res,
348
+ method="pathway_denester",
349
+ min_count=3,
350
+ pval_threshold=0.05, # independence cutoff
351
+ to_test_threshold=0.0, # min shared-DEG fraction before testing
352
+ term_size_limit=0, # e.g. 500 to drop very broad terms
353
+ show_excluded=False, # True keeps excluded terms + Denester_* diagnostics
328
354
  )
329
355
  ```
330
356
 
@@ -467,6 +493,31 @@ kegg_res = scat.run_kegg(
467
493
  # To use the original Enrichr version instead: kegg_library="KEGG_2026"
468
494
  )
469
495
 
496
+ ### run_gsea (pre-ranked GSEA)
497
+
498
+ For ranked-list enrichment (the classic GSEA approach):
499
+
500
+ ```python
501
+ # ranked list: higher = more associated with target (e.g. logFC or custom score)
502
+ ranked = all_results.set_index("gene")["logFC"] # or "active_score" etc.
503
+
504
+ gsea_res = scat.run_gsea(
505
+ ranked_genes=ranked,
506
+ gene_sets="GO_Biological_Process",
507
+ organism="mouse",
508
+ nperm=1000,
509
+ min_size=15,
510
+ # gsea_res is a DataFrame with NES, ES, pvalue, p.adjust, leading_edge, ...
511
+ )
512
+ print(gsea_res.head())
513
+ scat.pl.enrich_dotplot(gsea_res, x="NES", color_by="NES") # auto-friendly
514
+ scat.pl.gseaplot(ranked, gsea_res, term=gsea_res.iloc[0]["Term"])
515
+ ```
516
+
517
+ `run_gsea` stores pre-computed RES curves in `.attrs["gsea_details"]` so `gseaplot` can render the exact running sum used for the NES/p-values.
518
+
519
+ Requires `pip install "scatrans[gsea]"` (or gseapy).
520
+
470
521
  print(kegg_res[["Term", "p.adjust", "Count"]].head())
471
522
  ```
472
523
 
@@ -474,30 +525,73 @@ The `gene_set_source` parameter (default `"scatrans"`) controls which KEGG set i
474
525
  See the section "Choosing gene sets explicitly with `gene_set_source`" above for full details
475
526
  and examples for both GO and KEGG.
476
527
 
477
- **simplify_enrichment** – Remove redundant terms from enrichment results (Jaccard-based):
528
+ **simplify_enrichment** – Remove redundant terms from enrichment results:
478
529
 
479
530
  ```python
480
- # After obtaining an enrichment result
531
+ # Jaccard: drop terms whose enriched gene sets overlap strongly with a kept term
481
532
  simplified = scat.simplify_enrichment(
482
- kegg_res, # or enrich_res from run_enrichment
483
- similarity_cutoff=0.5, # Jaccard similarity threshold
484
- min_count=3, # minimum number of genes in a term
485
- by="p.adjust", # column to sort by
486
- ascending=True,
487
- method="jaccard", # currently only "jaccard" is supported
533
+ kegg_res,
534
+ similarity_cutoff=0.5,
535
+ min_count=3,
536
+ by="p.adjust",
537
+ method="jaccard",
538
+ )
539
+
540
+ # PathwayDenester: drop nested pathways explained by a more significant parent
541
+ simplified = scat.simplify_enrichment(
542
+ kegg_res,
543
+ method="pathway_denester",
544
+ min_count=3,
545
+ by="p.adjust",
546
+ gene_sets="KEGG", # optional if kegg_res.attrs records the library
547
+ pval_threshold=0.05,
548
+ to_test_threshold=0.0,
488
549
  )
489
550
 
490
551
  print(f"Reduced from {len(kegg_res)} to {len(simplified)} terms")
491
552
  print(simplified[["Term", "p.adjust", "Count"]].head())
492
553
  ```
493
554
 
555
+ | Parameter | `jaccard` | `pathway_denester` |
556
+ |-----------|-----------|-------------------|
557
+ | `similarity_cutoff` | Jaccard threshold (default 0.5) | ignored |
558
+ | `gene_sets` | not used | GMT path, bundled name, or dict (auto from attrs when possible) |
559
+ | `pval_threshold` | not used | independence p-value (default 0.05) |
560
+ | `to_test_threshold` | not used | min shared-DEG fraction before testing (default 0) |
561
+ | `term_size_limit` | not used | drop pathways larger than this size (0 = keep all) |
562
+ | `show_excluded` | not used | if True, return excluded terms with `Denester_*` columns |
563
+
494
564
  This function looks for common gene list columns (`Genes`, `Lead_genes`, etc.) automatically.
495
565
 
496
566
  ---
497
567
 
498
568
  ## Result Interpretation
499
569
 
500
- The internal significance mask applies a strict conjunction of thresholds. On real data it often returns zero or few genes. Use the full table in `all_results`, which is sorted by `active_score` descending and retains every gene that passed initial expression filters.
570
+ ### Column naming (v0.9+)
571
+
572
+ Primary result columns use **unspliced / nascent excess** terminology (not RNA velocity):
573
+
574
+ | Primary column | Legacy alias (deprecated) | Meaning |
575
+ |----------------|---------------------------|---------|
576
+ | `unspliced_excess_delta` | `velocity_delta_raw` | Raw U − γ_ref·S in target group |
577
+ | `unspliced_excess_residual` | `velocity_residual` | Bias-corrected excess residual |
578
+ | `unspliced_excess_pval` | — | One-sided permutation p-value on residual |
579
+ | `unspliced_excess_fdr` | — | BH-FDR on `unspliced_excess_pval` |
580
+
581
+ `active_score` (0–100) is a **heuristic ranking score** (weighted soft-scaled composite of logFC + unspliced excess residual + -log p_adj). It is intended **for ranking and visualization only** and should **not** be interpreted or reported as a p-value or statistical significance measure. Use the permutation-derived `unspliced_excess_fdr` (when enabled) or your own post-hoc statistics for claims.
582
+
583
+ ### Built-in `significant` gene list
584
+
585
+ When `use_permutation=True`, the internal mask requires **all** of:
586
+
587
+ - `logFC > logfc_cutoff` (default 0.5)
588
+ - `p_adj < pval_cutoff` (default 0.05)
589
+ - `unspliced_excess_residual > 0`
590
+ - `unspliced_excess_fdr < unspliced_excess_fdr_cutoff` (default 0.05)
591
+
592
+ Without `use_permutation=True`, the built-in `significant` list is **empty** (FDR on unspliced excess cannot be computed). Use `all_results` + `filter_active_genes` for custom thresholds.
593
+
594
+ On real data the built-in list often returns zero or few genes. Use the full table in `all_results`, sorted by `active_score` descending.
501
595
 
502
596
  After each run inspect the diagnostics:
503
597
 
@@ -508,7 +602,7 @@ print(meta["diagnostics"]["bias_correction"])
508
602
  print(meta.get("permutation_approximation_note"))
509
603
  ```
510
604
 
511
- Global unspliced fractions above ~50% frequently indicate technical issues. Bias-correction diagnostics report the number of genes used and any fallback behavior. The permutation note records that velocity layers and the reference gamma were fixed for speed.
605
+ Global unspliced fractions above ~50% frequently indicate technical issues. Bias-correction diagnostics report the number of genes used and any fallback behavior. The permutation note records that unspliced/spliced layers and the reference gamma were fixed for speed while labels were shuffled.
512
606
 
513
607
  ---
514
608
 
@@ -519,6 +613,7 @@ The following flags are disabled by default and should be enabled only when requ
519
613
  - `use_permutation=True`
520
614
  - `bias_correction="none"`
521
615
  - `show_effective_gamma=True`
616
+ - `gamma_method="robust_median"` (or "raw")
522
617
  - `use_mixed_model=True`
523
618
  - `prioritize_velocity=True`
524
619
 
@@ -528,14 +623,49 @@ Inspect the corresponding diagnostics after enabling any advanced option.
528
623
 
529
624
  ### use_permutation=True
530
625
 
531
- Adds `active_score_pval` and `active_score_fdr` columns. The permutation shuffles only group labels; velocity layers and the reference gamma are computed once on the original data for speed. This approximation is documented in `permutation_approximation_note`.
626
+ **Required for the built-in `significant` list** (via `unspliced_excess_fdr`).
627
+
628
+ Adds:
629
+
630
+ - `unspliced_excess_pval` / `unspliced_excess_fdr` — permutation significance on the bias-corrected unspliced excess residual (one-sided, positive direction). **Use these for active-gene calls.**
631
+ - `active_score_pval` / `active_score_fdr` — permutation on the composite heuristic score (ranking aid only).
632
+
633
+ The permutation shuffles only group labels; unspliced/spliced layers and the reference gamma are fixed from the original labeling for speed. **This is a conditional permutation** (conditioned on the observed velocity structure and gamma). It is a speed/tractability tradeoff and **not an unconditional permutation of the full data**. In small reference groups or strong batch effects, interpret the resulting FDR with extra caution; always inspect diagnostics and consider biological replicates.
634
+
635
+ See diagnostics["velocity"] for the actual gamma_method and prior_weight used.
636
+
637
+ ```python
638
+ adata_res, significant, all_results = scat.active_score(
639
+ adata,
640
+ use_permutation=True,
641
+ n_perm=500,
642
+ unspliced_excess_fdr_cutoff=0.05,
643
+ )
644
+ ```
532
645
 
533
646
  ### bias_correction
534
647
 
535
- By default the package applies a Huber regression of the raw velocity delta on log(gene length) and log(intron number) and uses the residuals as `velocity_residual`. This step can be disabled by setting `bias_correction="none"`, in which case the raw (reference-gamma corrected) delta is used directly.
648
+ By default the package applies a Huber regression of the raw unspliced excess delta on log(gene length) and log(intron number) and uses the residuals as `unspliced_excess_residual`. This step can be disabled by setting `bias_correction="none"`, in which case the raw (reference-gamma corrected) delta is used directly.
536
649
 
537
650
  The correction is intended to reduce technical contributions from gene length and intron number to the unspliced excess term. Whether length or intron number carry biological signal of interest in a given dataset is a scientific judgment that the user must make; the correction is therefore optional. The `bias_diagnostic_plot` function can be used to inspect the relationship before and after correction.
538
651
 
652
+ ### gamma_method and reference gamma robustness
653
+
654
+ The core unspliced excess uses a per-gene reference gamma = U_ref / S_ref (shrunk).
655
+
656
+ - Default: `gamma_method="heuristic_shrink"` + `prior_weight=5.0` (additive pseudo-count shrinkage toward a global ratio).
657
+ - For small reference groups, try `gamma_method="robust_median"`: uses the **median** ratio across reference genes as the anchor. This reduces sensitivity to a few outlier genes in the reference and can yield more stable residuals.
658
+ - `gamma_method="raw"` disables most shrinkage (exploratory only).
659
+
660
+ The chosen method, prior_weight, and summary stats of the realized effective_gamma are **always** written to diagnostics:
661
+
662
+ ```python
663
+ v = adata_res.uns["scatrans"]["diagnostics"]["velocity"]
664
+ print(v["gamma_method"], v["prior_weight"], v["effective_gamma_stats"])
665
+ ```
666
+
667
+ Shrinkage strength and stability are now visible without `show_effective_gamma`.
668
+
539
669
  ### show_effective_gamma=True
540
670
 
541
671
  Adds the column `effective_gamma` (reference-group shrunk U/S ratio) to `adata.var` and to the results tables. Many genes will have similar values in pure heuristic mode; advanced (moments) mode usually shows more per-gene variation.
@@ -559,7 +689,7 @@ Requires `sample_col` (the column identifying biological replicates/individuals)
559
689
  - `delta_variance` is always available in `all_results` when the flag is on; you can use it post-hoc as an additional filter.
560
690
  - Use `use_delta_variance_pval=True` only if you want the LRT p-value to participate in the built-in `significant` mask.
561
691
 
562
- **Practical note on small numbers of samples:** With very few biological replicates, pseudobulk aggregation can drive most `velocity_residual` values close to zero. In such regimes the cell-level mixed-model path (`use_mixed_model=True`, `use_pseudobulk=False`) often preserves more of the velocity signal while still respecting sample structure.
692
+ **Practical note on small numbers of samples:** With very few biological replicates, pseudobulk aggregation can drive most `unspliced_excess_residual` values close to zero. In such regimes the cell-level mixed-model path (`use_mixed_model=True`, `use_pseudobulk=False`) often preserves more of the nascent-excess signal while still respecting sample structure.
563
693
 
564
694
  The mixed-model settings and median `delta_variance` are recorded in diagnostics.
565
695
 
@@ -575,7 +705,7 @@ Recommended only when you have a reasonable number of cells and want noise reduc
575
705
 
576
706
  The unspliced excess term is a group-contrast proxy derived from a reference-group gamma calculation. It is not a full stochastic or dynamical model.
577
707
 
578
- Interpretation is simplest for clear binary contrasts. Within-group heterogeneity reduces observed signal. The permutation approximation (used when `use_permutation=True`) fixes velocity layers and the reference gamma on the original labels; the note is recorded in the results. Global unspliced fractions above ~50% are flagged as potential technical artifacts. Bias-correction quality depends on the number of genes with length and intron annotations. With few biological replicates, power for the velocity term and permutation-based FDR is limited. Mixed-model statistics tend to be conservative when between-sample variation is large.
708
+ Interpretation is simplest for clear binary contrasts. Within-group heterogeneity reduces observed signal. The permutation approximation (used when `use_permutation=True`) fixes unspliced/spliced layers and the reference gamma on the original labels; the note is recorded in the results. Global unspliced fractions above ~50% are flagged as potential technical artifacts. Bias-correction quality depends on the number of genes with length and intron annotations. With few biological replicates, power for the unspliced excess term and permutation-based FDR is limited. Mixed-model statistics tend to be conservative when between-sample variation is large.
579
709
 
580
710
  Always examine diagnostics, score distributions, and (when available) the original spliced/unspliced counts before biological interpretation.
581
711
 
@@ -607,9 +737,9 @@ These are the common "free switches" for the basic pipeline (including pseudobul
607
737
 
608
738
  ### Opt-in advanced / exploration parameters (see "Optional Advanced Features")
609
739
 
610
- - `use_permutation`, `n_perm`, `active_fdr_cutoff`
740
+ - `use_permutation`, `n_perm`, `unspliced_excess_fdr_cutoff` (and deprecated `active_fdr_cutoff`)
611
741
  - `bias_correction` ("huber_length_intron" or "none")
612
- - `show_effective_gamma`
742
+ - `show_effective_gamma`, `gamma_method`, `prior_weight`
613
743
  - `use_mixed_model`, `use_delta_variance_pval`, `mixed_model_pval`
614
744
  - `mode` ("heuristic" or "advanced")
615
745
 
@@ -620,7 +750,7 @@ Full signatures and all parameters are documented in the function docstrings and
620
750
  - `add_gene_features(adata, organism="mouse", ...)` — attach length/intron info
621
751
  - `list_available_gene_features()`
622
752
  - `diagnose_design(adata, groupby, target_group, reference_group, sample_col=None)` — analyzes cell/sample counts and global unspliced fraction; returns warnings, recommendations, and a suggested `filter_active_genes` preset. Automatically called internally when `sample_col` or `use_pseudobulk=True` is used.
623
- - `run_enrichment(...)`, `run_kegg(...)`, `run_go(...)`, `simplify_enrichment(...)`, `save_enrichment_report(...)`, `expand_enrichment_genes(...)`, `list_bundled_gene_sets()`
753
+ - `run_enrichment(...)`, `run_kegg(...)`, `run_go(...)`, `run_gsea(...)`, `simplify_enrichment(...)`, `save_enrichment_report(...)`, `expand_enrichment_genes(...)`, `list_bundled_gene_sets()`
624
754
  - `scat.pl.*` plotting functions (comet_plot, volcano_plot, bias_diagnostic_plot, ...)
625
755
  - `scat.qc.unspliced_global(adata)`
626
756
 
@@ -646,18 +776,20 @@ After installing the `gene_features` extra, the `generate-gene-features` CLI is
646
776
 
647
777
  ```python
648
778
  import scatrans as scat
649
- scat.pl.set_style() # once, for good defaults
650
- # or temporary:
779
+ scat.pl.set_style() # once early (opt-in)
780
+ # or (recommended to avoid globals):
651
781
  with scat.pl.style_context(linewidth=0.8):
652
- scat.pl.comet_plot(...)
782
+ scat.pl.comet_plot(...) # inside block or pass use_style=True
783
+ # Default for pl.* functions is use_style=False (prevents surprising rcParams changes in notebooks).
653
784
  ```
654
785
 
655
- All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel figures) and `save_path=` (high-quality 300 dpi output).
786
+ All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel figures), `save_path=`, `show=`, `use_style=`, `figsize=` for consistency.
787
+ Most return `(fig, ax)` (or `(fig, axes_list)` for grids like phase portraits) for easy further customization or closing.
656
788
 
657
789
  ### Main Plotting Functions
658
790
 
659
791
  - `scat.pl.comet_plot(results_df, top_n=12, point_scale=1.0, min_size=2, max_size=180, s=None, ...)`
660
- Recommended: log fold change vs. bias-corrected unspliced residual (velocity_residual), sized and colored by active_score.
792
+ Recommended: log fold change vs. bias-corrected unspliced excess residual (`unspliced_excess_residual`), sized and colored by `active_score`.
661
793
  - `s=3` (or 1-5): force **fixed** small point size for everything (direct, simple control).
662
794
  - `point_scale=0.2` + `min_size=1`: for variable sizing, make tiniest background points truly small.
663
795
 
@@ -674,17 +806,18 @@ All `scat.pl.*` functions support `ax=` / `axes=` (for embedding in multi-panel
674
806
  - `scat.pl.volcano_3d(results_df, point_scale=..., min_size=2, s=None, ...)`
675
807
  3D version of the volcano. Same size controls (`s` for fixed size).
676
808
 
809
+ - `scat.pl.enrich_dotplot(enrich_df, ...)` now also works well with GSEA results (auto defaults to `x="NES"`, diverging cmap for `color_by="NES"`).
810
+ - `scat.pl.gseaplot(ranked_genes, gsea_result, term=...)` — classic GSEA running-sum plot (uses precomputed curves from `run_gsea` when available).
677
811
  - `scat.pl.enrich_dotplot(enrich_df, top_n=15, show_terms=None, x="GeneRatio", size_by="Count", color_by="Adjusted P-value", ...)`
678
812
  Enrichment dot plot (clusterProfiler style).
679
- - `x`: x-axis variable — "GeneRatio" (default), "FoldEnrichment", **"Count"**, or "-log10(p.adj)".
680
- Pass `x="Count"` to visualize by the number of genes in the overlap (in addition to the classic GeneRatio/FoldEnrichment views).
681
- - `size_by` (dot size, default "Count"), `color_by` (default adjusted p-value).
682
- - `show_terms` accepts int (top N) or list of term strings/Descriptions (exact or partial match, order preserved) —
813
+ - `x`: x-axis variable — "GeneRatio" (default for ORA), "FoldEnrichment", **"Count"**, "-log10(p.adj)", or "NES" (for GSEA).
814
+ - `size_by` (dot size, default "Count"), `color_by` (default adjusted p-value; "NES" for GSEA uses diverging colormap).
815
+ - `show_terms` accepts int (top N), "auto" (p.adjust <0.05 + Count>=2 smart selection), or list of term strings/Descriptions (exact or partial match, order preserved)
683
816
  directly analogous to `dotplot(..., showCategory=...)`.
684
817
  Also available as `enrich_barplot`.
685
818
 
686
819
  - `scat.pl.volcano_3d(results_df, ...)`
687
- 3D volcano (logFC × -log10(p) × velocity_residual).
820
+ 3D volcano (logFC × -log10(p) × unspliced_excess_residual).
688
821
 
689
822
  - `scat.pl.active_score_rankplot(results_df, top_n=20, ...)`
690
823
  Simple horizontal barplot of top active scores.