PyPI - biopipen - Versions diffs - 0.25.4__tar.gz → 0.26.1__tar.gz - Mend

biopipen 0.25.4tar.gz → 0.26.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of biopipen might be problematic. Click here for more details.

Files changed (245) hide show

{biopipen-0.25.4 → biopipen-0.26.1}/PKG-INFO RENAMED Viewed

@@ -1,23 +1,22 @@
 Metadata-Version: 2.1
 Name: biopipen
-Version: 0.25.4
+Version: 0.26.1
 Summary: Bioinformatics processes/pipelines that can be run from `pipen run`
 License: MIT
 Author: pwwang
 Author-email: pwwang@pwwang.com
-Requires-Python: >=3.8,<4.0
+Requires-Python: >=3.9,<4.0
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Provides-Extra: runinfo
-Requires-Dist: datar[pandas] (>=0.15.4,<0.16.0)
-Requires-Dist: pipen-board[report] (>=0.14,<0.15)
-Requires-Dist: pipen-cli-run (>=0.12,<0.13)
-Requires-Dist: pipen-filters (>=0.11,<0.12)
-Requires-Dist: pipen-poplog (>=0.0.2,<0.0.3)
-Requires-Dist: pipen-runinfo (>=0.5,<0.6) ; extra == "runinfo"
-Requires-Dist: pipen-verbose (>=0.10,<0.11)
+Requires-Dist: datar[pandas] (>=0.15.5,<0.16.0)
+Requires-Dist: pipen-board[report] (>=0.15,<0.16)
+Requires-Dist: pipen-cli-run (>=0.13,<0.14)
+Requires-Dist: pipen-filters (>=0.12,<0.13)
+Requires-Dist: pipen-poplog (>=0.1,<0.2)
+Requires-Dist: pipen-runinfo (>=0.6,<0.7) ; extra == "runinfo"
+Requires-Dist: pipen-verbose (>=0.11,<0.12)

biopipen-0.26.1/biopipen/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.26.1"

{biopipen-0.25.4 → biopipen-0.26.1}/biopipen/core/config.toml RENAMED Viewed

@@ -27,6 +27,8 @@ convert = "convert"
 wget = "wget"
 # aria2c
 aria2c = "aria2c"
+# plink
+plink = "plink"
 # tabix
 tabix = "tabix"
 # sambamba

biopipen-0.26.1/biopipen/ns/rnaseq.py ADDED Viewed

@@ -0,0 +1,158 @@
+"""RNA-seq data analysis"""
+from ..core.proc import Proc
+from ..core.config import config
+class UnitConversion(Proc):
+    """Convert expression value units back and forth
+    See <https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/>
+    and <https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#fpkm>.
+    Following converstions are supported -
+    * `count -> cpm, fpkm/rpkm, fpkmuq/rpkmrq, tpm, tmm`
+    * `fpkm/rpkm -> count, tpm, cpm`
+    * `tpm -> count, fpkm/rpkm, cpm`
+    * `cpm -> count, fpkm/rpkm, tpm`
+    NOTE that during some conversions, `sum(counts/effLen)` is approximated to
+    `sum(counts)/sum(effLen) * length(effLen))`
+    You can also use this process to just transform the expression values, e.g., take
+    log2 of the expression values. In this case, you can set `inunit` and `outunit` to
+    `count` and `log2(count + 1)` respectively.
+    Input:
+        infile: Input file containing expression values
+            The file should be a matrix with rows representing genes and columns
+            representing samples.
+            It could be an RDS file containing a data frame or a matrix, or a
+            text file containing a matrix with tab as the delimiter. The text
+            file can be gzipped.
+    Output:
+        outfile: Output file containing the converted expression values
+            The file will be a matrix with rows representing genes and columns
+            representing samples.
+    Envs:
+        inunit: The input unit of the expression values.
+            You can also use an expression to indicate the input unit, e.g.,
+            `log2(counts + 1)`. The expression should be like `A * fn(B*X + C) + D`,
+            where `A`, `B`, `C` and `D` are constants, `fn` is a function, and X is
+            the input unit.
+            Currently only `expr`, `sqrt`, `log2`, `log10` and `log` are supported as
+            functions.
+            Supported input units are:
+            * counts/count/rawcounts/rawcount: raw counts.
+            * cpm: counts per million.
+            * fpkm/rpkm: fragments per kilobase of transcript per million.
+            * fpkmuq/rpkmuq: upper quartile normalized FPKM/RPKM.
+            * tpm: transcripts per million.
+            * tmm: trimmed mean of M-values.
+        outunit: The output unit of the expression values. An expression can also be
+            used for transformation (e.g. `log2(tpm + 1)`). If `inunit` is `count`,
+            then this means we are converting raw counts to tpm, and transforming it
+            to `log2(tpm + 1)` as the output. Any expression supported by `R` can be
+            used. Same units as `inunit` are supported.
+        refexon: Path to the reference exon gff file.
+        meanfl (type=auto): A file containing the mean fragment length for each sample
+            by rows (samples as rowname), without header.
+            Or a fixed universal estimated number (1 used by TCGA).
+        nreads (type=auto): The estimatied total number of reads for each sample.
+            or you can pass a file with the number for each sample by rows
+            (samples as rowname), without header.
+            When converting `fpkm/rpkm -> count`, it should be total reads of that sample.
+            When converting `cpm -> count`: it should be total reads of that sample.
+            When converting `tpm -> count`: it should be total reads of that sample.
+            When converting `tpm -> cpm`: it should be total reads of that sample.
+            When converting `tpm -> fpkm/rpkm`: it should be `sum(fpkm)` of that sample.
+            It is not used when converting `count -> cpm, fpkm/rpkm, tpm`.
+    """  # noqa: E501
+    input = "infile:file"
+    output = "outfile:file:{{in.infile | basename}}"
+    lang = config.lang.rscript
+    envs = {
+        "inunit": None,
+        "outunit": None,
+        "refexon": config.ref.refexon,
+        "meanfl": 1,
+        "nreads": 1_000_000,
+    }
+    script = "file://../scripts/rnaseq/UnitConversion.R"
+class Simulation(Proc):
+    """Simulate RNA-seq data using ESCO/RUVcorr package
+    Input:
+        ngenes: Number of genes to simulate
+        nsamples: Number of samples to simulate
+            If you want to force the process to re-simulate for the same
+            `ngenes` and `nsamples`, you can set a different value for `envs.seed`.
+            Note that the samples will be shown as cells in the output (since
+            the simulation is designed for single-cell RNA-seq data).
+    Output:
+        outfile: Output file containing the simulated data with rows representing
+            genes and columns representing samples.
+        outdir: Output directory containing the simulated data
+            `sim.rds` and `True.rds` will be generated.
+            For `ESCO`, `sim.rds` contains the simulated data in a
+            `SingleCellExperiment` object, and `True.rds` contains the matrix of true
+            counts.
+            For `RUVcorr`, `sim.rds` contains the simulated data in list with
+            `Truth`, A matrix containing the values of Xβ; `Y` A matrix containing the
+            values in `Y`; `Noise` A matrix containing the values in `Wα`; `Sigma`
+            A matrix containing the true gene-gene correlations, as defined by Xβ; and
+            `Info` A matrix containing some of the general information about the
+            simulation.
+            For all matrices, rows represent genes and columns represent samples.
+    Envs:
+        tool (choice): Which tool to use for simulation.
+            - ESCO: uses the [ESCO](https://github.com/JINJINT/ESCO) package.
+            - RUVcorr: uses the [RUVcorr](https://rdrr.io/bioc/RUVcorr/) package.
+        ncores (type=int): Number of cores to use.
+        seed (type=int): Random seed.
+            If not set, seed will not be set.
+        esco_args (ns): Additional arguments to pass to the simulation function.
+            - save (choice): Which type of data to save to `out.outfile`.
+                - `simulated-truth`: saves the simulated true counts.
+                - `zero-inflated`: saves the zero-inflated counts.
+                - `down-sampled`: saves the down-sampled counts.
+            - type (choice): Which type of heterogenounity to use.
+                - single: produces a single population.
+                - group: produces distinct groups.
+                - tree: produces distinct groups but admits a tree structure.
+                - traj: produces distinct groups but admits a smooth trajectory
+                    structure.
+            - <more>: See <https://rdrr.io/github/JINJINT/ESCO/man/escoParams.html>.
+        ruvcorr_args (ns): Additional arguments to pass to the simulation
+            function.
+            - <more>: See <https://rdrr.io/bioc/RUVcorr/man/simulateGEdata.html>.
+        transpose_output (flag): If set, the output will be transposed.
+        index_start (type=int): The index to start from when naming the samples.
+            Affects the sample names in `out.outfile` only.
+    """
+    input = "ngenes:var, nsamples:var"
+    output = [
+        "outfile:file:{{in.ngenes}}x{{in.nsamples}}.sim/simulated.txt",
+        "outdir:dir:{{in.ngenes}}x{{in.nsamples}}.sim",
+    ]
+    lang = config.lang.rscript
+    envs = {
+        "tool": "RUVcorr",
+        "ncores": config.misc.ncores,
+        "type": "single",
+        "esco_args": {
+            "dropout-type": "none",
+            "save": "simulated-truth",
+            "type": "single",
+        },
+        "ruvcorr_args": {},
+        "seed": None,
+        "transpose_output": False,
+        "index_start": 1,
+    }
+    script = "file://../scripts/rnaseq/Simulation.R"

{biopipen-0.25.4 → biopipen-0.26.1}/biopipen/ns/scrna.py RENAMED Viewed

@@ -483,14 +483,18 @@ class SeuratClusterStats(Proc):
             The parameters from the cases can overwrite the default parameters.
             - frac (flag): Whether to output the fraction of cells instead of number.
             - pie (flag): Also output a pie chart?
+            - circos (flag): Also output a circos plot?
             - table (flag): Whether to output a table (in tab-delimited format) and in the report.
             - frac_ofall(flag): Whether to output the fraction against all cells,
                 instead of the fraction in each group.
+                Does not work for circos plot.
                 Only works when `frac` is `True` and `group-by` is specified.
             - transpose (flag): Whether to transpose the cluster and group, that is,
                 using group as the x-axis and cluster to fill the plot.
+                For circos plot, when transposed, the arrows will be drawn from the idents (by `ident`) to the
+                the groups (by `group-by`).
                 Only works when `group-by` is specified.
-            - position (choice): The position of the bars.
+            - position (choice): The position of the bars. Does not work for pie and circos plots.
                 - stack: Use `position_stack()`.
                 - fill: Use `position_fill()`.
                 - dodge: Use `position_dodge()`.
@@ -499,8 +503,13 @@ class SeuratClusterStats(Proc):
             - group-by: The column name in metadata to group the cells.
                 Does NOT support for pie charts.
             - split-by: The column name in metadata to split the cells into different plots.
+                Does NOT support for circos plots.
             - subset: An expression to subset the cells, will be passed to
                 `dplyr::filter()` on metadata.
+            - circos_devpars (ns): The device parameters for the circos plots.
+                - res (type=int): The resolution of the plots.
+                - height (type=int): The height of the plots.
+                - width (type=int): The width of the plots.
             - pie_devpars (ns): The device parameters for the pie charts.
                 - res (type=int): The resolution of the plots.
                 - height (type=int): The height of the plots.
@@ -634,6 +643,7 @@ class SeuratClusterStats(Proc):
         "stats_defaults": {
             "frac": False,
             "pie": False,
+            "circos": False,
             "table": False,
             "position": "auto",
             "frac_ofall": False,
@@ -644,6 +654,7 @@ class SeuratClusterStats(Proc):
             "subset": None,
             "devpars": {"res": 100, "height": 600, "width": 800},
             "pie_devpars": {"res": 100, "height": 600, "width": 800},
+            "circos_devpars": {"res": 100, "height": 600, "width": 600},
         },
         "stats": {
             "Number of cells in each cluster": {
@@ -882,8 +893,9 @@ class CellsDistribution(Proc):
         each: The column name in metadata to separate the cells into different plots.
         section: The section to show in the report. This allows different cases to be put in the same section in report.
             Only works when `each` is not specified.
-        overlap (list): Plot the overlap of cells in different cases under the same section.
-            The section must have at least 2 cases.
+        overlap (list): Plot the overlap of cell groups (values of `cells_by`) in different cases
+            under the same section.
+            The section must have at least 2 cases, each case should have a single `cells_by` column.
         cases (type=json;order=99): If you have multiple cases, you can specify them here.
             Keys are the names of the cases and values are the options above except `mutaters`.
             If some options are not specified, the options in `envs` will be used.
@@ -1141,6 +1153,7 @@ class TopExpressingGenes(Proc):
             markers See below for all libraries.
             <https://maayanlab.cloud/Enrichr/#libraries>
         n (type=int): The number of top expressing genes to find.
+        subset: An expression to subset the cells for each case.
         cases (type=json): If you have multiple cases, you can specify them
             here. The keys are the names of the cases and the values are the
             above options except `mutaters`. If some options are
@@ -1161,6 +1174,7 @@ class TopExpressingGenes(Proc):
         "section": "DEFAULT",
         "dbs": ["KEGG_2021_Human", "MSigDB_Hallmark_2020"],
         "n": 250,
+        "subset": None,
         "cases": {},
     }
     plugin_opts = {

biopipen-0.26.1/biopipen/ns/snp.py ADDED Viewed

@@ -0,0 +1,70 @@
+"""Plink processes"""
+from ..core.proc import Proc
+from ..core.config import config
+class PlinkSimulation(Proc):
+    """Simulate SNPs using PLINK v1.9
+    See also <https://www.cog-genomics.org/plink/1.9/input#simulate>.
+    Input:
+        nsnps: Number of SNPs to simulate
+        ncases: Number of cases to simulate
+        nctrls: Number of controls to simulate
+    Output:
+        outdir: Output directory containing the simulated data
+            `plink_sim.bed`, `plink_sim.bim`, and `plink_sim.fam` will be generated.
+        gtmat: Genotype matrix file containing the simulated data with rows representing
+            SNPs and columns representing samples.
+    Envs:
+        plink: Path to PLINK v1.9
+        seed (type=int): Random seed.
+            If not set, seed will not be set.
+        label: Prefix label for the SNPs.
+        prevalence  (type=float): Disease prevalence.
+        minfreq (type=float): Minimum allele frequency.
+        maxfreq (type=float): Maximum allele frequency.
+        hetodds (type=float): Odds ratio for heterozygous genotypes.
+        homodds (type=float): Odds ratio for homozygous genotypes.
+        missing (type=float): Proportion of missing genotypes.
+        args (ns): Additional arguments to pass to PLINK.
+            - <more>: see <https://www.cog-genomics.org/plink/1.9/input#simulate>.
+        transpose_gtmat (flag): If set, the genotype matrix (`out.gtmat`) will
+            be transposed.
+        sample_prefix: Use this prefix for the sample names. If not set, the sample
+            names will be `per0_per0`, `per1_per1`, `per2_per2`, etc. If set, the
+            sample names will be `prefix0`, `prefix1`, `prefix2`, etc.
+            This only affects the sample names in the genotype matrix file
+            (`out.gtmat`).
+    """
+    input = "nsnps:var, ncases:var, nctrls:var"
+    output = [
+        (
+            "outdir:dir:{{in.nsnps | int}}_"
+            "{{in.ncases | int}}xcases_{{in.nctrls | int}}xctrls.plink_sim"
+        ),
+        (
+            "gtmat:file:{{in.nsnps | int}}_"
+            "{{in.ncases | int}}xcases_{{in.nctrls | int}}xctrls.plink_sim/gtmat.txt"
+        ),
+    ]
+    lang = config.lang.python
+    envs = {
+        "plink": config.exe.plink,
+        "seed": None,
+        "label": "SNP",
+        "prevalence": 0.01,
+        "minfreq": 0.0,
+        "maxfreq": 1.0,
+        "hetodds": 1.0,
+        "homodds": 1.0,
+        "missing": 0.0,
+        "args": {},
+        "transpose_gtmat": False,
+        "sample_prefix": None,
+    }
+    script = "file://../scripts/snp/PlinkSimulation.py"

biopipen-0.26.1/biopipen/ns/stats.py ADDED Viewed

@@ -0,0 +1,320 @@
+"""Provides processes for statistics."""
+from ..core.proc import Proc
+from ..core.config import config
+class ChowTest(Proc):
+    """Massive Chow tests.
+    See Also https://en.wikipedia.org/wiki/Chow_test
+    Input:
+        infile: The input data file. The rows are samples and the columns are
+            features. It must be tab-delimited.
+            ```
+            Sample   F1   F2   F3   ...   Fn
+            S1       1.2  3.4  5.6        7.8
+            S2       2.3  4.5  6.7        8.9
+            ...
+            Sm       5.6  7.8  9.0        1.2
+            ```
+        groupfile: The group file. The rows are the samples and the columns
+            are the groupings. It must be tab-delimited.
+            ```
+            Sample   G1   G2   G3   ...   Gk
+            S1       0    1    0          0
+            S2       2    1    0          NA  # exclude this sample
+            ...
+            Sm       1    0    0          0
+            ```
+        fmlfile: The formula file. The first column is grouping and the
+            second column is the formula. It must be tab-delimited.
+            ```
+            Group   Formula  ...  # Other columns to be added to outfile
+            G1      Fn ~ F1 + Fx + Fy   # Fx, Fy could be covariates
+            G1      Fn ~ F2 + Fx + Fy
+            ...
+            Gk      Fn ~ F3 + Fx + Fy
+            ```
+    Output:
+        outfile: The output file. It is a tab-delimited file with the first
+            column as the grouping and the second column as the p-value.
+            ```
+            Group  Formula  ...  Pooled  Groups  SSR  SumSSR  Fstat  Pval  Padj
+            G1     Fn ~ F1       0.123   2       1    0.123   0.123  0.123 0.123
+            G1     Fn ~ F2       0.123   2       1    0.123   0.123  0.123 0.123
+            ...
+            Gk     Fn ~ F3       0.123   2       1    0.123   0.123  0.123 0.123
+            ```
+    Envs:
+        padj (choice): The method for p-value adjustment.
+            - none: No p-value adjustment (no Padj column in outfile).
+            - holm: Holm-Bonferroni method.
+            - hochberg: Hochberg method.
+            - hommel: Hommel method.
+            - bonferroni: Bonferroni method.
+            - BH: Benjamini-Hochberg method.
+            - BY: Benjamini-Yekutieli method.
+            - fdr: FDR correction method.
+        transpose_input (flag): Whether to transpose the input file.
+        transpose_group (flag): Whether to transpose the group file.
+    """
+    input = "infile:file, groupfile:file, fmlfile:file"
+    output = "outfile:file:{{in.infile | stem}}.chowtest.txt"
+    lang = config.lang.rscript
+    envs = {
+        "padj": "none",
+        "transpose_input": False,
+        "transpose_group": False,
+    }
+    script = "file://../scripts/stats/ChowTest.R"
+class LiquidAssoc(Proc):
+    """Liquid association tests.
+    See Also https://github.com/gundt/fastLiquidAssociation
+    Requieres https://github.com/pwwang/fastLiquidAssociation
+    Input:
+        infile: The input data file. The rows are samples and the columns are
+            features. It must be tab-delimited.
+            ```
+            Sample   F1   F2   F3   ...   Fn
+            S1       1.2  3.4  5.6        7.8
+            S2       2.3  4.5  6.7        8.9
+            ...
+            Sm       5.6  7.8  9.0        1.2
+            ```
+            The features (columns) will be tested pairwise, which will be the X and
+            Y columns in the result of `fastMLA`
+        covfile: The covariate file. The rows are the samples and the columns
+            are the covariates. It must be tab-delimited.
+            If provided, the data in `in.infile` will be adjusted by covariates by
+            regressing out the covariates and the residuals will be used for
+            liquid association tests.
+        groupfile: The group file. The rows are the samples and the columns
+            are the groupings. It must be tab-delimited.
+            ```
+            Sample   G1   G2   G3   ...   Gk
+            S1       0    1    0          0
+            S2       2    1    0          NA  # exclude this sample
+            ...
+            Sm       1    0    0          0
+            ```
+            This will be served as the Z column in the result of `fastMLA`
+            This can be omitted. If so, `envs.nvec` should be specified, which is
+            to select column from `in.infile` as Z.
+        fmlfile: The formula file. The 3 columns are X3, X12 and X21. The results
+            will be filtered based on the formula. It must be tab-delimited without
+            header.
+    Output:
+        outfile: The output file.
+            ```
+            X12  X21  X3  rhodiff  MLA value  estimates  san.se  wald  Pval  model
+            C38  C46  C5  0.87  0.32  0.67  0.20  10.87  0  F
+            C46  C38  C5  0.87  0.32  0.67  0.20  10.87  0  F
+            C27  C39  C4  0.94  0.34  1.22  0.38  10.03  0  F
+            ```
+    Envs:
+        nvec: The column index (1-based) of Z in `in.infile`, if `in.groupfile` is
+            omitted. You can specify multiple columns by comma-seperated values, or
+            a range of columns by `-`. For example, `1,3,5-7,9`. It also supports
+            column names. For example, `F1,F3`. `-` is not supported for column
+            names.
+        x: Similar as `nvec`, but limit X group to given features.
+            The rest of features (other than X and Z) in `in.infile` will
+            be used as Y.
+            The features in `in.infile` will still be tested pairwise, but only
+            features in X and Y will be kept.
+        topn (type=int): Number of results to return by `fastMLA`, ordered from
+            highest `|MLA|` value descending.
+            The default of the package is 2000, but here we set to 1e6 to return as
+            many results as possible (also good to do pvalue adjustment).
+        rvalue (type=float): Tolerance value for LA approximation. Lower values of
+            rvalue will cause a more thorough search, but take longer.
+        cut (type=int): Value passed to the GLA function to create buckets
+            (equal to number of buckets+1). Values placing between 15-30 samples per
+            bucket are optimal. Must be a positive integer>1. By default,
+            `max(ceiling(nrow(data)/22), 4)` is used.
+        ncores (type=int): Number of cores to use for parallelization.
+        padj (choice): The method for p-value adjustment.
+            - none: No p-value adjustment (no Padj column in outfile).
+            - holm: Holm-Bonferroni method.
+            - hochberg: Hochberg method.
+            - hommel: Hommel method.
+            - bonferroni: Bonferroni method.
+            - BH: Benjamini-Hochberg method.
+            - BY: Benjamini-Yekutieli method.
+            - fdr: FDR correction method.
+        transpose_input (flag): Whether to transpose the input file.
+        transpose_group (flag): Whether to transpose the group file.
+        transpose_cov (flag): Whether to transpose the covariate file.
+        xyz_names: The names of X12, X21 and X3 in the final output file. Separated
+            by comma. For example, `X12,X21,X3`.
+    """
+    input = "infile:file, covfile:file, groupfile:file, fmlfile:file"
+    output = "outfile:file:{{in.infile | stem}}.liquidassoc.txt"
+    lang = config.lang.rscript
+    envs = {
+        "nvec": None,
+        "x": None,
+        "topn": 1e6,
+        "rvalue": 0.5,
+        "cut": 20,
+        "ncores": config.misc.ncores,
+        "padj": "none",
+        "transpose_input": False,
+        "transpose_group": False,
+        "transpose_cov": False,
+        "xyz_names": None,
+    }
+    script = "file://../scripts/stats/LiquidAssoc.R"
+class DiffCoexpr(Proc):
+    """Differential co-expression analysis.
+    See also <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-497>
+    and <https://github.com/DavisLaboratory/dcanr/blob/8958d61788937eef3b7e2b4118651cbd7af7469d/R/inference_methods.R#L199>.
+    Input:
+        infile: The input data file. The rows are samples and the columns are
+            features. It must be tab-delimited.
+            ```
+            Sample   F1   F2   F3   ...   Fn
+            S1       1.2  3.4  5.6        7.8
+            S2       2.3  4.5  6.7        8.9
+            ...
+            Sm       5.6  7.8  9.0        1.2
+            ```
+        groupfile: The group file. The rows are the samples and the columns
+            are the groupings. It must be tab-delimited.
+            ```
+            Sample   G1   G2   G3   ...   Gk
+            S1       0    1    0          0
+            S2       2    1    0          NA  # exclude this sample
+            ...
+            Sm       1    0    0          0
+            ```
+    Output:
+        outfile: The output file. It is a tab-delimited file with the first
+            column as the feature pair and the second column as the p-value.
+            ```
+            Group  Feature1  Feature2  Pval  Padj
+            G1     F1        F2        0.123 0.123
+            G1     F1        F3        0.123 0.123
+            ...
+            ```
+    Envs:
+        method (choice): The method used to calculate the differential
+            co-expression.
+            - pearson: Pearson correlation.
+            - spearman: Spearman correlation.
+        beta: The beta value for the differential co-expression analysis.
+        padj (choice): The method for p-value adjustment.
+            - none: No p-value adjustment (no Padj column in outfile).
+            - holm: Holm-Bonferroni method.
+            - hochberg: Hochberg method.
+            - hommel: Hommel method.
+            - bonferroni: Bonferroni method.
+            - BH: Benjamini-Hochberg method.
+            - BY: Benjamini-Yekutieli method.
+            - fdr: FDR correction method.
+        perm_batch (type=int): The number of permutations to run in each batch
+        seed (type=int): The seed for random number generation
+        ncores (type=int): The number of cores to use for parallelization
+        transpose_input (flag): Whether to transpose the input file.
+        transpose_group (flag): Whether to transpose the group file.
+    """  # noqa: E501
+    input = "infile:file, groupfile:file"
+    output = "outfile:file:{{in.infile | stem}}.diffcoexpr.txt"
+    lang = config.lang.rscript
+    envs = {
+        "method": "pearson",
+        "beta": 6,
+        "padj": "none",
+        "perm_batch": 20,
+        "seed": 8525,
+        "ncores": config.misc.ncores,
+        "transpose_input": False,
+        "transpose_group": False,
+    }
+    script = "file://../scripts/stats/DiffCoexpr.R"
+class MetaPvalue(Proc):
+    """Calulation of meta p-values.
+    If there is only one input file, only the p-value adjustment will be performed.
+    Input:
+        infiles: The input files. Each file is a tab-delimited file with multiple
+            columns. There should be ID column(s) to match the rows in other files and
+            p-value column(s) to be combined. The records will be full-joined by ID.
+            When only one file is provided, only the pvalue adjustment will be
+            performed when `envs.padj` is not `none`, otherwise the input file will
+            be copied to `out.outfile`.
+    Output:
+        outfile: The output file. It is a tab-delimited file with the first column as
+            the ID and the second column as the combined p-value.
+            ```
+            ID  ID1 ...  Pval   Padj
+            a   x   ...  0.123  0.123
+            b   y   ...  0.123  0.123
+            ...
+            ```
+    Envs:
+        id_cols: The column names used in all `in.infiles` as ID columns. Multiple
+            columns can be specified by comma-seperated values. For example, `ID1,ID2`.
+            If `id_expr` is specified, this should be a single column name for the new
+            ID column in each `in.infiles` and the final `out.outfile`.
+        id_exprs: The R expressions for each `in.infiles` to get ID column(s).
+        pval_cols: The column names used in all `in.infiles` as p-value columns.
+            Different columns can be specified by comma-seperated values for each
+            `in.infiles`. For example, `Pval1,Pval2`.
+        method (choice): The method used to calculate the meta-pvalue.
+            - fisher: Fisher's method.
+            - sumlog: Sum of logarithms (same as Fisher's method)
+            - logitp: Logit method.
+            - sumz: Sum of z method (Stouffer's method).
+            - meanz: Mean of z method.
+            - meanp: Mean of p method.
+            - invt: Inverse t method.
+            - sump: Sum of p method (Edgington's method).
+            - votep: Vote counting method.
+            - wilkinsonp: Wilkinson's method.
+            - invchisq: Inverse chi-square method.
+        na: The method to handle NA values. -1 to skip the record. Otherwise NA
+            will be replaced by the given value.
+        padj (choice): The method for p-value adjustment.
+            - none: No p-value adjustment (no Padj column in outfile).
+            - holm: Holm-Bonferroni method.
+            - hochberg: Hochberg method.
+            - hommel: Hommel method.
+            - bonferroni: Bonferroni method.
+            - BH: Benjamini-Hochberg method.
+            - BY: Benjamini-Yekutieli method.
+            - fdr: FDR correction method.
+    """
+    input = "infiles:files"
+    output = "outfile:file:{{in.infiles | first | stem}}.metapval.txt"
+    lang = config.lang.rscript
+    envs = {
+        "id_cols": None,
+        "id_exprs": None,
+        "pval_cols": None,
+        "method": "fisher",
+        "na": -1,
+        "padj": "none",
+    }
+    script = "file://../scripts/stats/MetaPvalue.R"

biopipen 0.25.4__tar.gz → 0.26.1__tar.gz

Potentially problematic release.

biopipen 0.25.4tar.gz → 0.26.1tar.gz