PyPI - pycmplot - Versions diffs - 0.2.6__tar.gz → 0.2.7__tar.gz - Mend

pycmplot 0.2.6tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

{pycmplot-0.2.6 → pycmplot-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pycmplot
-Version: 0.2.6
+Version: 0.2.7
 Summary: Multi-track circular and linear Manhattan plot generation for GWAS summary statistics
 Author: Kevin Esoh
 Author-email: Kevin Esoh <kesohku1@jh.edu>
@@ -68,6 +68,8 @@ option of the package should be used to indicate the column and then the package
 postions in hg19 to hg38 ensuring that hits table generation and plotting are done with one unified
 corrdinate system.
+# Key features
+## Column auto-detection
 A key functionality of the package is its ability to auto-detect certain columns if ommited on the
 command-line or python API:
 - Chromosome column: `-chr, --chrom_column` or ommited
@@ -90,11 +92,54 @@ bld_candidates = [build, 'BUILD', 'Genome', 'Genome_Build', 'Genome-build']
 > NB: Upper and lower cases of the candidates are also considered, making each candidate expanded 3 times.
-Since GWAS summary stats files can be very large, to improve speed and memory efficiency, it is
-**highly recommended** to use `-tp, --trim_pval` with a value to exclude variants with p-value above a
-certain threshold, e.g. `0.01 (1e-2)` or `0.001 (1e-3)`.
+## Density-aware sub-sampling
+Another key feature is density-aware sub-sampling for Manhattan-style scatter plots.
+This was inspired by ``gwaslab``'s default behaviour (https://cloufield.github.io/gwaslab/).
+Every variant whose "interestingness" signal is at or above ``keep_threshold`` is preserved (so peaks, suggestive hits, genome-wide-significant hits, and extreme
+selection-scan values are kept verbatim). It uniformly sub-samples the dense bulk
+below the threshold down to at most ``max_below`` rows in total.  For a 10 M-variant
+scan with the defaults below, this typically cuts the plotted point count from 10 M
+to ~200 K + a few hundred peaks — visually indistinguishable above the suggestive
+band, but two orders of magnitude faster to render.
+## Trim insignificant variants for faster plotting
+An optional parameter `-tp, --trim_pval` is provided to increase speed even further.
+Set with a value to exclude variants with p-value above a certain threshold,
+e.g. `0.01 (1e-2)` or `0.001 (1e-3)`. Performed on top of the default auto-thin
+feature above, it siginificant increases speed and reduces peak memory usage.
+See benchmark figure (manuscript in preparation).
+## Genome build conversion (liftover)
+Conversion of a both hg18 and hg19 positions to their hg38 equivalent is included through
+`pyliftover.LiftOver`.
+This means you can concatenate multiple summary stats into one file and include a `BUILD`
+column to specify the genome build of each position ('hg18', 'hg19', or 'hg38') and all
+'hg18' and 'hg19' positions will be converted to 'hg38' so that all positions are plotted
+using one coordinate system. If only 'hg18' or 'hg19' positions are present, no liftover
+be necessary. Hence, liftover is only performed in cases of mixed genome builds.
+## Nearest-gene annotation for GWAS lead SNPs
+The package bundles GFF3 files in hg19 and hg38 coordinates processed to reduce size
+for gene annotation. Also included are UCSC chain files for coordinate conversion (liftover).
+  - ``chain_hg19_hg38`` -- UCSC LiftOver chain file for hg19 to hg38
+    conversion. Resolved from ``PYCMPLOT_CHAIN_HG19_HG38`` or the bundled
+    ``hg19ToHg38.over.chain.gz``.
+  - ``chain_hg18_hg38`` -- UCSC LiftOver chain file for hg18 to hg38
+    conversion. Resolved from ``PYCMPLOT_CHAIN_HG18_HG38`` or the bundled
+    ``hg18ToHg38.over.chain.gz``. Only required when any input summary
+    statistics file carries a ``hg18`` build label.
+  - ``geneinfo_hg38`` -- Ensembl gene-info TSV for GRCh38, used for
+    nearest-gene annotation. Resolved from ``PYCMPLOT_GENEINFO_HG38`` or
+    the bundled ``Homo_sapiens.GRCh38.geneinfo.tsv.gz``.
+  - ``geneinfo_hg19`` -- Ensembl gene-info TSV for GRCh37, used when
+    input data carry a hg19 build label. Resolved from
+    ``PYCMPLOT_GENEINFO_HG19`` or the bundled
+    ``Homo_sapiens.GRCh37.geneinfo.tsv.gz``.
+# Application
 A potential useful application is **comparative visualization** of results from multiple imputation panels,
 multiple populations, or multiple traits to observe shared genetic architecture.

{pycmplot-0.2.6 → pycmplot-0.2.7}/README.md RENAMED Viewed

@@ -29,6 +29,8 @@ option of the package should be used to indicate the column and then the package
 postions in hg19 to hg38 ensuring that hits table generation and plotting are done with one unified
 corrdinate system.
+# Key features
+## Column auto-detection
 A key functionality of the package is its ability to auto-detect certain columns if ommited on the
 command-line or python API:
 - Chromosome column: `-chr, --chrom_column` or ommited
@@ -51,11 +53,54 @@ bld_candidates = [build, 'BUILD', 'Genome', 'Genome_Build', 'Genome-build']
 > NB: Upper and lower cases of the candidates are also considered, making each candidate expanded 3 times.
-Since GWAS summary stats files can be very large, to improve speed and memory efficiency, it is
-**highly recommended** to use `-tp, --trim_pval` with a value to exclude variants with p-value above a
-certain threshold, e.g. `0.01 (1e-2)` or `0.001 (1e-3)`.
+## Density-aware sub-sampling
+Another key feature is density-aware sub-sampling for Manhattan-style scatter plots.
+This was inspired by ``gwaslab``'s default behaviour (https://cloufield.github.io/gwaslab/).
+Every variant whose "interestingness" signal is at or above ``keep_threshold`` is preserved (so peaks, suggestive hits, genome-wide-significant hits, and extreme
+selection-scan values are kept verbatim). It uniformly sub-samples the dense bulk
+below the threshold down to at most ``max_below`` rows in total.  For a 10 M-variant
+scan with the defaults below, this typically cuts the plotted point count from 10 M
+to ~200 K + a few hundred peaks — visually indistinguishable above the suggestive
+band, but two orders of magnitude faster to render.
+## Trim insignificant variants for faster plotting
+An optional parameter `-tp, --trim_pval` is provided to increase speed even further.
+Set with a value to exclude variants with p-value above a certain threshold,
+e.g. `0.01 (1e-2)` or `0.001 (1e-3)`. Performed on top of the default auto-thin
+feature above, it siginificant increases speed and reduces peak memory usage.
+See benchmark figure (manuscript in preparation).
+## Genome build conversion (liftover)
+Conversion of a both hg18 and hg19 positions to their hg38 equivalent is included through
+`pyliftover.LiftOver`.
+This means you can concatenate multiple summary stats into one file and include a `BUILD`
+column to specify the genome build of each position ('hg18', 'hg19', or 'hg38') and all
+'hg18' and 'hg19' positions will be converted to 'hg38' so that all positions are plotted
+using one coordinate system. If only 'hg18' or 'hg19' positions are present, no liftover
+be necessary. Hence, liftover is only performed in cases of mixed genome builds.
+## Nearest-gene annotation for GWAS lead SNPs
+The package bundles GFF3 files in hg19 and hg38 coordinates processed to reduce size
+for gene annotation. Also included are UCSC chain files for coordinate conversion (liftover).
+  - ``chain_hg19_hg38`` -- UCSC LiftOver chain file for hg19 to hg38
+    conversion. Resolved from ``PYCMPLOT_CHAIN_HG19_HG38`` or the bundled
+    ``hg19ToHg38.over.chain.gz``.
+  - ``chain_hg18_hg38`` -- UCSC LiftOver chain file for hg18 to hg38
+    conversion. Resolved from ``PYCMPLOT_CHAIN_HG18_HG38`` or the bundled
+    ``hg18ToHg38.over.chain.gz``. Only required when any input summary
+    statistics file carries a ``hg18`` build label.
+  - ``geneinfo_hg38`` -- Ensembl gene-info TSV for GRCh38, used for
+    nearest-gene annotation. Resolved from ``PYCMPLOT_GENEINFO_HG38`` or
+    the bundled ``Homo_sapiens.GRCh38.geneinfo.tsv.gz``.
+  - ``geneinfo_hg19`` -- Ensembl gene-info TSV for GRCh37, used when
+    input data carry a hg19 build label. Resolved from
+    ``PYCMPLOT_GENEINFO_HG19`` or the bundled
+    ``Homo_sapiens.GRCh37.geneinfo.tsv.gz``.
+# Application
 A potential useful application is **comparative visualization** of results from multiple imputation panels,
 multiple populations, or multiple traits to observe shared genetic architecture.

pycmplot-0.2.7/benchmark/bench_python.py ADDED Viewed

@@ -0,0 +1,266 @@
+#!/usr/bin/env python3
+"""
+bench_python.py
+Benchmarks Python GWAS visualization tools: pycmplot, gwaslab, qmplot.
+Usage:
+    python bench_python.py --tool pycmplot --input data/sumstats_1M.tsv \
+        --size 1M --replicates 5 --outdir results/
+Writes one CSV row per replicate to results/bench_python.csv.
+"""
+import argparse
+import csv
+import gc
+import os
+import sys
+import time
+import tracemalloc
+from pathlib import Path
+RESULT_COLS = [
+    "tool", "plot_type", "size_label", "n_variants",
+    "replicate", "wall_time_s", "peak_mem_mb", "out_file_kb"
+]
+def _record(writer, row: dict):
+    writer.writerow({k: row.get(k, "") for k in RESULT_COLS})
+# ---------------------------------------------------------------------------
+# Individual tool wrappers
+# Each wrapper must:
+#   1. Load data from disk (include I/O in timing)
+#   2. Produce a PNG to out_path
+#   3. Return nothing
+# ---------------------------------------------------------------------------
+def run_pycmplot(input_path: str, out_path: str, plot_type: str = "manhattan"):
+    """
+    pycmplot benchmark.
+    Adjust import path / function names to match your actual API.
+    """
+    import pandas as pd
+    import pycmplot
+    df = pd.read_csv(input_path, sep="\t")
+    if plot_type == "manhattan":
+        pycmplot.plot(
+            df,
+            chrom="CHR",
+            pos="BP",
+            pval="P",
+            snp="SNP",
+            plot_type="manhattan",
+            out=out_path,
+            dpi=150,
+        )
+    elif plot_type == "circular":
+        pycmplot.plot(
+            df,
+            chrom="CHR",
+            pos="BP",
+            pval="P",
+            snp="SNP",
+            plot_type="circular",
+            out=out_path,
+            dpi=150,
+        )
+    elif plot_type == "qq":
+        pycmplot.plot(
+            df,
+            chrom="CHR",
+            pos="BP",
+            pval="P",
+            plot_type="qq",
+            out=out_path,
+            dpi=150,
+        )
+def run_gwaslab(input_path: str, out_path: str, plot_type: str = "manhattan"):
+    """
+    gwaslab benchmark.
+    https://cloufield.github.io/gwaslab/
+    """
+    import pandas as pd
+    import gwaslab as gl
+    import matplotlib
+    matplotlib.use("Agg")
+    df = pd.read_csv(input_path, sep="\t")
+    mysumstats = gl.Sumstats(
+        df,
+        snpid="SNP",
+        chrom="CHR",
+        pos="BP",
+        p="P",
+        beta="BETA",
+        se="SE",
+        ea="A1",
+        nea="A2",
+    )
+    if plot_type == "manhattan":
+        mysumstats.plot_mqq(
+            mode="m",
+            save=out_path,
+            save_args={"dpi": 150},
+            verbose=False,
+        )
+    elif plot_type == "qq":
+        mysumstats.plot_mqq(
+            mode="q",
+            save=out_path,
+            save_args={"dpi": 150},
+            verbose=False,
+        )
+def run_qmplot(input_path: str, out_path: str, plot_type: str = "manhattan"):
+    """
+    qmplot benchmark.
+    https://github.com/ShujiaHuang/qmplot
+    """
+    import pandas as pd
+    import matplotlib
+    matplotlib.use("Agg")
+    import matplotlib.pyplot as plt
+    from qmplot import manhattanplot, qqplot
+    df = pd.read_csv(input_path, sep="\t")
+    fig, ax = plt.subplots(figsize=(12, 4))
+    if plot_type in ("manhattan", "circular"):
+        manhattanplot(
+            data=df,
+            chrom="CHR",
+            pos="BP",
+            pv="P",
+            snp="SNP",
+            ax=ax,
+        )
+    elif plot_type == "qq":
+        qqplot(
+            data=df["P"],
+            ax=ax,
+        )
+    fig.savefig(out_path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+# ---------------------------------------------------------------------------
+# Timing + memory harness
+# ---------------------------------------------------------------------------
+TOOL_RUNNERS = {
+    "pycmplot": run_pycmplot,
+    "gwaslab":  run_gwaslab,
+    "qmplot":   run_qmplot,
+}
+def benchmark_one(tool: str, input_path: str, out_path: str, plot_type: str):
+    """
+    Run one timed, memory-tracked benchmark call.
+    Returns
+    -------
+    tuple[float, float]
+        (wall_time_seconds, peak_memory_mb)
+    """
+    runner = TOOL_RUNNERS[tool]
+    # Force a full GC cycle before each run so prior allocations don't inflate
+    gc.collect()
+    tracemalloc.start()
+    t0 = time.perf_counter()
+    runner(input_path, out_path, plot_type)
+    t1 = time.perf_counter()
+    _, peak_bytes = tracemalloc.get_traced_memory()
+    tracemalloc.stop()
+    wall_time = t1 - t0
+    peak_mem_mb = peak_bytes / 1024 / 1024
+    return wall_time, peak_mem_mb
+def main():
+    parser = argparse.ArgumentParser(description="Benchmark Python GWAS visualization tools")
+    parser.add_argument("--tool",        required=True, choices=list(TOOL_RUNNERS.keys()))
+    parser.add_argument("--input",       required=True, help="Path to sumstats TSV")
+    parser.add_argument("--size",        required=True, help="Dataset size label (e.g. 1M)")
+    parser.add_argument("--plot-type",   default="manhattan",
+                        choices=["manhattan", "circular", "qq"],
+                        help="Plot type to benchmark")
+    parser.add_argument("--replicates",  type=int, default=5)
+    parser.add_argument("--outdir",      default="results", help="Directory for CSV results")
+    parser.add_argument("--figdir",      default="figures", help="Directory for generated figures")
+    args = parser.parse_args()
+    os.makedirs(args.outdir, exist_ok=True)
+    os.makedirs(args.figdir, exist_ok=True)
+    csv_path = os.path.join(args.outdir, "bench_python.csv")
+    write_header = not os.path.exists(csv_path)
+    # Count variants in input
+    import pandas as pd
+    n_variants = sum(1 for _ in open(args.input)) - 1  # subtract header
+    print(f"\n[bench] tool={args.tool}  size={args.size}  "
+          f"n={n_variants:,}  plot={args.plot_type}  reps={args.replicates}")
+    with open(csv_path, "a", newline="") as fh:
+        writer = csv.DictWriter(fh, fieldnames=RESULT_COLS)
+        if write_header:
+            writer.writeheader()
+        for rep in range(1, args.replicates + 1):
+            out_fig = os.path.join(
+                args.figdir,
+                f"{args.tool}_{args.plot_type}_{args.size}_rep{rep}.png"
+            )
+            try:
+                wall, mem = benchmark_one(args.tool, args.input, out_fig, args.plot_type)
+                out_kb = os.path.getsize(out_fig) / 1024 if os.path.exists(out_fig) else 0
+                row = dict(
+                    tool=args.tool,
+                    plot_type=args.plot_type,
+                    size_label=args.size,
+                    n_variants=n_variants,
+                    replicate=rep,
+                    wall_time_s=round(wall, 3),
+                    peak_mem_mb=round(mem, 2),
+                    out_file_kb=round(out_kb, 1),
+                )
+                writer.writerow(row)
+                fh.flush()  # write incrementally in case of OOM on large runs
+                print(f"  rep {rep}/{args.replicates}  "
+                      f"time={wall:.2f}s  mem={mem:.0f}MB  fig={out_kb:.0f}KB")
+            except Exception as e:
+                print(f"  rep {rep} FAILED: {e}", file=sys.stderr)
+                writer.writerow(dict(
+                    tool=args.tool, plot_type=args.plot_type,
+                    size_label=args.size, n_variants=n_variants,
+                    replicate=rep, wall_time_s="ERROR",
+                    peak_mem_mb="ERROR", out_file_kb="ERROR"
+                ))
+                fh.flush()
+if __name__ == "__main__":
+    main()

pycmplot 0.2.6__tar.gz → 0.2.7__tar.gz

pycmplot 0.2.6tar.gz → 0.2.7tar.gz