maradoner 0.9__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of maradoner might be problematic. Click here for more details.

maradoner-0.9/PKG-INFO ADDED
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.2
2
+ Name: maradoner
3
+ Version: 0.9
4
+ Summary: Variance-adjusted estimation of motif activities.
5
+ Home-page: https://github.com/autosome-ru/nemara
6
+ Author: Georgy Meshcheryakov
7
+ Author-email: iam@georgy.top
8
+ Classifier: Programming Language :: Python :: 3.9
9
+ Classifier: Programming Language :: Python :: 3.10
10
+ Classifier: Programming Language :: Python :: 3.11
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Classifier: Topic :: Scientific/Engineering
13
+ Classifier: Operating System :: OS Independent
14
+ Requires-Python: >=3.9
15
+ Description-Content-Type: text/markdown
16
+ Requires-Dist: pip>=24.0
17
+ Requires-Dist: typer>=0.13
18
+ Requires-Dist: numpy>=2.1
19
+ Requires-Dist: jax<0.5
20
+ Requires-Dist: jaxlib<0.5
21
+ Requires-Dist: matplotlib>=3.5
22
+ Requires-Dist: pandas>=2.2
23
+ Requires-Dist: scipy>=1.14
24
+ Requires-Dist: statsmodels>=0.14
25
+ Requires-Dist: datatable>=1.0.0
26
+ Requires-Dist: dill>=0.3.9
27
+ Requires-Dist: rich>=12.6.0
28
+ Dynamic: author
29
+ Dynamic: author-email
30
+ Dynamic: classifier
31
+ Dynamic: description
32
+ Dynamic: description-content-type
33
+ Dynamic: home-page
34
+ Dynamic: requires-dist
35
+ Dynamic: requires-python
36
+ Dynamic: summary
37
+
38
+
39
+ **MARADONER**
40
+
41
+ # MARADONER: Motif Activity Response Analysis Done Right
42
+
43
+ MARADONER is a tool for analyzing motif activities using promoter expression data. It provides a streamlined workflow to estimate parameters, predict deviations, and export results in a tabular form.
44
+
45
+ ## Basic Workflow
46
+
47
+
48
+ A typical MARADONER analysis session involves running commands sequentially for a given project:
49
+
50
+ 1. **`create`**: Initialize the project. This step parses your input files (promoter expression, motif loadings, optional motif expression, and sample groupings), performs initial filtering, and sets up the project's internal data structures.
51
+ ```bash
52
+ # Example: Initialize a project named 'my_project'
53
+ maradoner create my_project path/to/expression.tsv path/to/loadings.tsv --sample-groups path/to/groups.json [other options...]
54
+ ```
55
+ * Input files are typically tabular (.tsv, .csv), potentially compressed.
56
+ * You only need to provide input data files at this stage.
57
+
58
+ 2. **`fit`**: Estimate the model's variance parameters and mean motif activities using the data prepared by `create`.
59
+ ```bash
60
+ maradoner fit my_project [options...]
61
+ ```
62
+
63
+ 3. **`predict`**: Estimate the *deviations* of motif activities from their means for each sample or group, based on the parameters estimated by `fit`.
64
+ ```bash
65
+ maradoner predict my_project [options...]
66
+ ```
67
+
68
+ 4. **`export`**: Save the final results, including estimated motif activities (mean + deviations), parameter estimates, goodness-of-fit statistics, and potentially statistical test results (like ANOVA) to a specified output folder.
69
+ ```bash
70
+ maradoner export my_project path/to/output_folder [options...]
71
+ ```
72
+
73
+ ## Other Useful Commands
74
+
75
+ * **`gof`**: After `fit`, calculate Goodness-of-Fit statistics (like Fraction of Variance Explained or Correlation) to evaluate how well the model components explain the observed expression data.
76
+ ```bash
77
+ maradoner gof my_project [options...]
78
+ ```
79
+ * **`select-motifs`**: If you provided multiple loading matrices in `create` (e.g., from different databases) with unique postfixes, this command helps select the single "best" variant for each motif based on statistical criteria. The output is a list of motif names intended to be used with the `--motif-filename` option in a subsequent `create` run.
80
+ ```bash
81
+ maradoner select-motifs my_project best_motifs.txt
82
+ # Then, potentially re-run create using the generated list:
83
+ # maradoner create my_project_filtered ... --motif-filename best_motifs.txt
84
+ ```
85
+ * **`generate`**: Create a synthetic dataset with known properties for testing or demonstration purposes.
86
+ ```bash
87
+ maradoner generate path/to/synthetic_data_output [options...]
88
+ ```
89
+
90
+ ## Getting Help
91
+
92
+ Each command has various options for customization. To see the full list of commands and their detailed options, use the `--help` flag:
93
+
94
+ ```bash
95
+ maradoner --help
96
+ maradoner create --help
97
+ maradoner fit --help
98
+ # and so on for each command
@@ -0,0 +1,61 @@
1
+
2
+ **MARADONER**
3
+
4
+ # MARADONER: Motif Activity Response Analysis Done Right
5
+
6
+ MARADONER is a tool for analyzing motif activities using promoter expression data. It provides a streamlined workflow to estimate parameters, predict deviations, and export results in a tabular form.
7
+
8
+ ## Basic Workflow
9
+
10
+
11
+ A typical MARADONER analysis session involves running commands sequentially for a given project:
12
+
13
+ 1. **`create`**: Initialize the project. This step parses your input files (promoter expression, motif loadings, optional motif expression, and sample groupings), performs initial filtering, and sets up the project's internal data structures.
14
+ ```bash
15
+ # Example: Initialize a project named 'my_project'
16
+ maradoner create my_project path/to/expression.tsv path/to/loadings.tsv --sample-groups path/to/groups.json [other options...]
17
+ ```
18
+ * Input files are typically tabular (.tsv, .csv), potentially compressed.
19
+ * You only need to provide input data files at this stage.
20
+
21
+ 2. **`fit`**: Estimate the model's variance parameters and mean motif activities using the data prepared by `create`.
22
+ ```bash
23
+ maradoner fit my_project [options...]
24
+ ```
25
+
26
+ 3. **`predict`**: Estimate the *deviations* of motif activities from their means for each sample or group, based on the parameters estimated by `fit`.
27
+ ```bash
28
+ maradoner predict my_project [options...]
29
+ ```
30
+
31
+ 4. **`export`**: Save the final results, including estimated motif activities (mean + deviations), parameter estimates, goodness-of-fit statistics, and potentially statistical test results (like ANOVA) to a specified output folder.
32
+ ```bash
33
+ maradoner export my_project path/to/output_folder [options...]
34
+ ```
35
+
36
+ ## Other Useful Commands
37
+
38
+ * **`gof`**: After `fit`, calculate Goodness-of-Fit statistics (like Fraction of Variance Explained or Correlation) to evaluate how well the model components explain the observed expression data.
39
+ ```bash
40
+ maradoner gof my_project [options...]
41
+ ```
42
+ * **`select-motifs`**: If you provided multiple loading matrices in `create` (e.g., from different databases) with unique postfixes, this command helps select the single "best" variant for each motif based on statistical criteria. The output is a list of motif names intended to be used with the `--motif-filename` option in a subsequent `create` run.
43
+ ```bash
44
+ maradoner select-motifs my_project best_motifs.txt
45
+ # Then, potentially re-run create using the generated list:
46
+ # maradoner create my_project_filtered ... --motif-filename best_motifs.txt
47
+ ```
48
+ * **`generate`**: Create a synthetic dataset with known properties for testing or demonstration purposes.
49
+ ```bash
50
+ maradoner generate path/to/synthetic_data_output [options...]
51
+ ```
52
+
53
+ ## Getting Help
54
+
55
+ Each command has various options for customization. To see the full list of commands and their detailed options, use the `--help` flag:
56
+
57
+ ```bash
58
+ maradoner --help
59
+ maradoner create --help
60
+ maradoner fit --help
61
+ # and so on for each command
@@ -0,0 +1,33 @@
1
+ # -*- coding: utf-8 -*-
2
+ __version__ = '0.9'
3
+ import importlib
4
+
5
+
6
+ __min_reqs__ = [
7
+ 'pip>=24.0',
8
+ 'typer>=0.13',
9
+ 'numpy>=2.1',
10
+ 'jax<0.5',
11
+ 'jaxlib<0.5',
12
+ 'matplotlib>=3.5',
13
+ 'pandas>=2.2',
14
+ 'scipy>=1.14',
15
+ 'statsmodels>=0.14',
16
+ 'datatable>=1.0.0' ,
17
+ 'dill>=0.3.9',
18
+ 'rich>=12.6.0',
19
+ ]
20
+
21
+ def versiontuple(v):
22
+ return tuple(map(int, (v.split("."))))
23
+
24
+ def check_packages():
25
+ for req in __min_reqs__:
26
+ try:
27
+ module, ver = req.split(' @').split('>=')
28
+ ver = versiontuple(ver)
29
+ v = versiontuple(importlib.import_module(module).__version__)
30
+ except (AttributeError, ValueError):
31
+ continue
32
+ if v < ver:
33
+ raise ImportError(f'Version of the {module} package should be at least {ver} (found: {v}).')
@@ -0,0 +1,148 @@
1
+ from .utils import logger_print, openers
2
+ from .dataset_filter import filter_lowexp
3
+ import scipy.stats as st
4
+ import datatable as dt
5
+ import pandas as pd
6
+ import numpy as np
7
+ import dill
8
+ import json
9
+ import os
10
+
11
+ def transform_loadings(df, mode: str, zero_cutoff=1e-9, prom_inds=None):
12
+ if not mode or mode == 'none':
13
+ df[df < zero_cutoff] = 0
14
+ df = (df - df.min(axis=None)) / (df.max(axis=None) - df.min(axis=None))
15
+ stds = df.std()
16
+ drop_inds = (stds < 1e-16) | np.isnan(stds)
17
+ if prom_inds is not None:
18
+ df = df.loc[prom_inds, ~drop_inds]
19
+ else:
20
+ df = df.loc[:, ~drop_inds]
21
+ if mode == 'ecdf':
22
+ for j in range(len(df.columns)):
23
+ v = df.iloc[:, j]
24
+ df.iloc[:, j] = st.ecdf(v).cdf.evaluate(v)
25
+ elif mode == 'esf':
26
+ for j in range(len(df.columns)):
27
+ v = df.iloc[:, j]
28
+ v = st.ecdf(v).sf.evaluate(v)
29
+ t = np.unique(v)[1]
30
+ v[v < t] = t
31
+ df.iloc[:, j] = -np.log(v)
32
+ elif mode == 'none':
33
+ pass
34
+ elif mode:
35
+ raise Exception('Unknown transformation mode ' + str(mode))
36
+ return df
37
+
38
+ def create_project(project_name: str, promoter_expression_filename: str, loading_matrix_filenames: list[str],
39
+ motif_expression_filenames=None, loading_matrix_transformations=None, sample_groups=None, motif_postfixes=None,
40
+ promoter_filter_lowexp_cutoff=0.95, promoter_filter_plot_filename=None,
41
+ motif_names_filename=None, compression='raw', dump=True, verbose=True):
42
+ if not os.path.isfile(promoter_expression_filename):
43
+ raise FileNotFoundError(f'Promoter expression file {promoter_expression_filename} not found.')
44
+ if type(loading_matrix_filenames) is str:
45
+ loading_matrix_filenames = [loading_matrix_filenames]
46
+ for mx_name in loading_matrix_filenames:
47
+ if not os.path.isfile(mx_name):
48
+ raise FileNotFoundError(f'Loading matrix file {mx_name} not found.')
49
+ if motif_expression_filenames:
50
+ if type(motif_expression_filenames) is str:
51
+ motif_expression_filenames = [motif_expression_filenames]
52
+ for exp_name in motif_expression_filenames:
53
+ if not os.path.isfile(exp_name):
54
+ raise FileNotFoundError(f'Motif expresion file {exp_name} not found.')
55
+ if type(sample_groups) is str:
56
+ with open(sample_groups, 'r') as f:
57
+ if sample_groups.endswith('.json'):
58
+ sample_groups = json.load(f)
59
+ else:
60
+ sample_groups = dict()
61
+ for line in f:
62
+ items = line.split()
63
+ sample_groups[items[0]] = items[1:]
64
+ if motif_names_filename is not None:
65
+ with open(motif_names_filename, 'r') as f:
66
+ motif_names = list()
67
+ for line in f:
68
+ line = line.strip().split()
69
+ for item in line:
70
+ if item:
71
+ motif_names.append(item)
72
+ else:
73
+ motif_names = None
74
+ logger_print('Reading dataset...', verbose)
75
+ promoter_expression = dt.fread(promoter_expression_filename).to_pandas()
76
+ promoter_expression = promoter_expression.set_index(promoter_expression.columns[0])
77
+ proms = promoter_expression.index
78
+ sample_names = promoter_expression.columns
79
+ loading_matrices = [dt.fread(f).to_pandas() for f in loading_matrix_filenames]
80
+ loading_matrices = [df.set_index(df.columns[0]).loc[proms] for df in loading_matrices]
81
+ if loading_matrix_transformations is None or type(loading_matrix_transformations) is str:
82
+ loading_matrix_transformations = [loading_matrix_transformations] * len(loading_matrices)
83
+ else:
84
+ if len(loading_matrix_transformations) == 1:
85
+ loading_matrix_transformations = [loading_matrix_transformations[0]] * len(loading_matrices)
86
+ elif len(loading_matrix_transformations) != len(loading_matrices):
87
+ raise Exception(f'Total number of loading matrices is {len(loading_matrices)}, but the number of transformations is '
88
+ f'{len(loading_matrix_transformations)}.')
89
+
90
+ logger_print('Filtering promoters of low expression...', verbose)
91
+ inds, weights = filter_lowexp(promoter_expression, cutoff=promoter_filter_lowexp_cutoff, fit_plot_filename=promoter_filter_plot_filename)
92
+ promoter_expression = promoter_expression.loc[inds]
93
+ proms = promoter_expression.index
94
+ loading_matrices = [transform_loadings(df, mode, prom_inds=inds) for df, mode in zip(loading_matrices, loading_matrix_transformations)]
95
+ if motif_postfixes is not None:
96
+ for mx, postfix in zip(loading_matrices, motif_postfixes):
97
+ mx.columns = [f'{c}_{postfix}' for c in mx.columns]
98
+ if motif_expression_filenames:
99
+ motif_expression = [dt.fread(f).to_pandas() for f in motif_expression_filenames]
100
+ motif_expression = [df.set_index(df.columns[0]) for df in motif_expression]
101
+ if motif_postfixes is not None:
102
+ for mx, postfix in zip(motif_expression, motif_postfixes):
103
+ mx.index = [f'{c}_{postfix}' for c in mx.index]
104
+ if sample_groups:
105
+ if len(set(motif_expression[0].columns) & set(sample_groups)) == len(sample_groups):
106
+ for i in range(len(motif_expression)):
107
+ mx = motif_expression[i]
108
+ for group, cols in sample_groups.items():
109
+ for col in cols:
110
+ mx[col] = mx[group]
111
+ mx = mx.drop(sorted(sample_groups), axis=1)
112
+ motif_expression = [df.loc[mx.columns, sample_names] for df, mx in zip(motif_expression, loading_matrices)]
113
+ motif_expression = pd.concat(motif_expression, axis=0)
114
+ else:
115
+ motif_expression = None
116
+ loading_matrices = pd.concat(loading_matrices, axis=1)
117
+ if motif_names is not None:
118
+ loading_matrices = loading_matrices[motif_names]
119
+ proms = list(promoter_expression.index)
120
+ sample_names = list(promoter_expression.columns)
121
+ motif_names = list(loading_matrices.columns)
122
+ loading_matrices = loading_matrices.values
123
+ promoter_expression = promoter_expression.values
124
+ if motif_expression is not None:
125
+ motif_expression = motif_expression.values
126
+ if not sample_groups:
127
+ sample_groups = {n: [i] for i, n in enumerate(sample_names)}
128
+ else:
129
+ sample_groups = {n: sorted([sample_names.index(i) for i in inds]) for n, inds in sample_groups.items()}
130
+ res = {'expression': promoter_expression,
131
+ 'loadings': loading_matrices,
132
+ 'motif_expression': motif_expression,
133
+ 'motif_postfixes': motif_postfixes,
134
+ 'promoter_names': proms,
135
+ 'sample_names': sample_names,
136
+ 'motif_names': motif_names,
137
+ 'weights': weights,
138
+ 'groups': sample_groups}
139
+ if dump:
140
+ folder = os.path.split(project_name)[0]
141
+ name = os.path.split(project_name)[-1]
142
+ for file in os.listdir(folder if folder else None):
143
+ if file.startswith(f'{name}.') and file.endswith(tuple(openers.keys())):
144
+ os.remove(os.path.join(folder, file))
145
+ logger_print('Saving project...', verbose)
146
+ with openers[compression](f'{project_name}.init.{compression}', 'wb') as f:
147
+ dill.dump(res, f)
148
+ return res
@@ -0,0 +1,109 @@
1
+ import jax.numpy as jnp
2
+ from jax import jit, grad
3
+ from jax.scipy.stats import norm
4
+ from jax.scipy.special import logsumexp, logit, expit
5
+ import pandas as pd
6
+ import numpy as np
7
+ from scipy.optimize import minimize
8
+ from functools import partial
9
+
10
+
11
+ def normax_logpdf(x: jnp.ndarray, mu: float, sigma: float, n: int):
12
+ x = (x - mu) / sigma
13
+ return jnp.log(n) - jnp.log(sigma) + norm.logpdf(x) + (n - 1) * norm.logcdf(x)
14
+
15
+ def logmixture(x: jnp.ndarray, mus: jnp.ndarray, sigmas: jnp.ndarray, w: float, n: int):
16
+ logpdf1 = normax_logpdf(x, mus[0], sigmas[0], n)
17
+ logpdf2 = normax_logpdf(x, mus[1], sigmas[1], n)
18
+ w = jnp.array([w, 1 - w]).reshape(-1,1)
19
+ logpdf = jnp.array([logpdf1, logpdf2])
20
+ return logsumexp(logpdf, b=w, axis=0)
21
+
22
+
23
+ def transform(params, forward=True):
24
+ mu = params[:2]
25
+ sigma = params[2:4]
26
+ w = params[-1:]
27
+ if forward:
28
+ sigma = sigma ** 2
29
+ w = expit(w)
30
+ else:
31
+ sigma = sigma ** 0.5
32
+ w = logit(w)
33
+ return jnp.concatenate([mu, sigma, w])
34
+
35
+ def loglik(params: jnp.ndarray, x: jnp.ndarray, n: int):
36
+ params = transform(params)
37
+ mu = params[:2]
38
+ sigma = params[2:4]
39
+ w = params[-1]
40
+ return -logmixture(x, mu, sigma, w, n).sum()
41
+
42
+ def filter_lowexp(expression: pd.DataFrame, cutoff=0.95, fit_plot_filename=None, plot_dpi=200):
43
+ expression = (expression - expression.mean()) / expression.std()
44
+
45
+ expression_max = expression.max(axis=1).values
46
+
47
+ mu = [-1.0, 0.0]
48
+ sigmas = [1.0, 1.0]
49
+ w = [0.5]
50
+ x0 = jnp.array(mu + sigmas + w)
51
+ x0 = transform(x0, False)
52
+ fun = jit(partial(loglik, x=expression_max, n=expression.shape[1]))
53
+ jac = jit(grad(fun))
54
+ res = minimize(fun, x0, jac=jac)
55
+
56
+ params = transform(res.x)
57
+ mu = params[:2]
58
+ sigma = params[2:4]
59
+ w = params[-1]
60
+
61
+ mode1 = minimize(lambda x: -normax_logpdf(x, mu[0], sigma[0], n=expression.shape[1]), x0=[0.0]).x
62
+ mode2 = minimize(lambda x: -normax_logpdf(x, mu[1], sigma[1], n=expression.shape[1]), x0=[0.0]).x
63
+ if mode1 > mode2:
64
+ mu = mu[::-1]
65
+ sigma = sigma[::-1]
66
+ w = 1 - w
67
+
68
+ inds = np.argsort(expression_max)
69
+ inds_inv = np.empty_like(inds, dtype=int)
70
+ inds_inv[inds] = np.arange(len(inds))
71
+ x = expression_max[inds]
72
+ logpdf1 = normax_logpdf(x, mu[0], sigma[0], n=expression.shape[1])
73
+ logpdf2 = normax_logpdf(x, mu[1], sigma[1], n=expression.shape[1])
74
+ pdf1 = jnp.exp(logpdf1)
75
+ pdf2 = jnp.exp(logpdf2)
76
+ ws = np.array(pdf1 / ((w * pdf1 + (1-w)*pdf2)) * w)
77
+ if x[ws >= 0.5].mean() < x[ws < 0.5].mean():
78
+ ws = 1 - ws
79
+ j = np.argmax(ws)
80
+ l = np.argmin(ws)
81
+ ws[j:] = 1.0
82
+ ws[:l] = 0.0
83
+ for k in range(len(ws)):
84
+ if ws[k] >= 1.0-cutoff:
85
+ break
86
+ if fit_plot_filename:
87
+ import matplotlib.pyplot as plt
88
+ from matplotlib.collections import LineCollection
89
+ import seaborn as sns
90
+ pdf = jnp.exp(logmixture(x, mu, sigma, w, n=expression.shape[1]))
91
+ points = np.array([x, pdf]).T.reshape(-1, 1, 2)
92
+ segments = np.concatenate([points[:-1], points[1:]], axis=1)
93
+ plt.figure(dpi=plot_dpi, )
94
+ sns.histplot(expression_max, stat='density', color='grey')
95
+ lc = LineCollection(segments, cmap='winter')
96
+ lc.set_array(ws)
97
+ lc.set_linewidth(3)
98
+ line = plt.gca().add_collection(lc)
99
+ plt.colorbar(line)
100
+ plt.xlabel('Standardized expression')
101
+ plt.tight_layout()
102
+ plt.savefig(fit_plot_filename)
103
+ ws = ws[inds_inv]
104
+ inds = np.ones(len(expression), dtype=bool)
105
+ inds[:k] = False
106
+ # print(inds)
107
+ # inds[:] = 1
108
+ inds = inds[inds_inv]
109
+ return inds, ws