maradoner 0.17.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of maradoner might be problematic. Click here for more details.

@@ -0,0 +1,89 @@
1
+ Metadata-Version: 2.1
2
+ Name: maradoner
3
+ Version: 0.17.0
4
+ Summary: Variance-adjusted estimation of motif activities.
5
+ Home-page: https://github.com/autosome-ru/maradoner
6
+ Author: Georgy Meshcheryakov
7
+ Author-email: iam@georgy.top
8
+ Classifier: Programming Language :: Python :: 3.11
9
+ Classifier: Topic :: Scientific/Engineering
10
+ Classifier: Operating System :: OS Independent
11
+ Requires-Python: >=3.11
12
+ Description-Content-Type: text/markdown
13
+ Requires-Dist: pip>=24.0
14
+ Requires-Dist: typer>=0.13
15
+ Requires-Dist: numpy>=2.1
16
+ Requires-Dist: jax<0.5
17
+ Requires-Dist: jaxlib<0.5
18
+ Requires-Dist: matplotlib>=3.5
19
+ Requires-Dist: pandas>=2.2
20
+ Requires-Dist: scipy>=1.14
21
+ Requires-Dist: statsmodels>=0.14
22
+ Requires-Dist: datatable>=1.0.0
23
+ Requires-Dist: dill>=0.3.9
24
+ Requires-Dist: rich>=12.6.0
25
+ Requires-Dist: tqdm>=4.0
26
+ Requires-Dist: scikit-learn>=1.6
27
+ Requires-Dist: tables>=3.10
28
+
29
+
30
+ **MARADONER**
31
+
32
+ # MARADONER: Motif Activity Response Analysis Done Right
33
+
34
+ MARADONER is a tool for analyzing motif activities using promoter expression data. It provides a streamlined workflow to estimate parameters, predict deviations, and export results in a tabular form.
35
+
36
+ ## Basic Workflow
37
+
38
+
39
+ A typical MARADONER analysis session involves running commands sequentially for a given project:
40
+
41
+ 1. **`create`**: Initialize the project. This step parses your input files (promoter expression, motif loadings, optional motif expression, and sample groupings), performs initial filtering, and sets up the project's internal data structures.
42
+ ```bash
43
+ # Example: Initialize a project named 'my_project'
44
+ maradoner create my_project path/to/expression.tsv path/to/loadings.tsv --sample-groups path/to/groups.json [other options...]
45
+ ```
46
+ * Input files are typically tabular (.tsv, .csv), potentially compressed.
47
+ * You only need to provide input data files at this stage.
48
+
49
+ 2. **`fit`**: Estimate the model's variance parameters and mean motif activities using the data prepared by `create`.
50
+ ```bash
51
+ maradoner fit my_project [options...]
52
+ ```
53
+
54
+ 3. **`predict`**: Estimate the *deviations* of motif activities from their means for each sample or group, based on the parameters estimated by `fit`.
55
+ ```bash
56
+ maradoner predict my_project [options...]
57
+ ```
58
+
59
+ 4. **`export`**: Save the final results, including estimated motif activities (mean + deviations), parameter estimates, goodness-of-fit statistics, and potentially statistical test results (like ANOVA) to a specified output folder.
60
+ ```bash
61
+ maradoner export my_project path/to/output_folder [options...]
62
+ ```
63
+
64
+ ## Other Useful Commands
65
+
66
+ * **`gof`**: After `fit`, calculate Goodness-of-Fit statistics (like Fraction of Variance Explained or Correlation) to evaluate how well the model components explain the observed expression data.
67
+ ```bash
68
+ maradoner gof my_project [options...]
69
+ ```
70
+ * **`select-motifs`**: If you provided multiple loading matrices in `create` (e.g., from different databases) with unique postfixes, this command helps select the single "best" variant for each motif based on statistical criteria. The output is a list of motif names intended to be used with the `--motif-filename` option in a subsequent `create` run.
71
+ ```bash
72
+ maradoner select-motifs my_project best_motifs.txt
73
+ # Then, potentially re-run create using the generated list:
74
+ # maradoner create my_project_filtered ... --motif-filename best_motifs.txt
75
+ ```
76
+ * **`generate`**: Create a synthetic dataset with known properties for testing or demonstration purposes.
77
+ ```bash
78
+ maradoner generate path/to/synthetic_data_output [options...]
79
+ ```
80
+
81
+ ## Getting Help
82
+
83
+ Each command has various options for customization. To see the full list of commands and their detailed options, use the `--help` flag:
84
+
85
+ ```bash
86
+ maradoner --help
87
+ maradoner create --help
88
+ maradoner fit --help
89
+ # and so on for each command
@@ -0,0 +1,61 @@
1
+
2
+ **MARADONER**
3
+
4
+ # MARADONER: Motif Activity Response Analysis Done Right
5
+
6
+ MARADONER is a tool for analyzing motif activities using promoter expression data. It provides a streamlined workflow to estimate parameters, predict deviations, and export results in a tabular form.
7
+
8
+ ## Basic Workflow
9
+
10
+
11
+ A typical MARADONER analysis session involves running commands sequentially for a given project:
12
+
13
+ 1. **`create`**: Initialize the project. This step parses your input files (promoter expression, motif loadings, optional motif expression, and sample groupings), performs initial filtering, and sets up the project's internal data structures.
14
+ ```bash
15
+ # Example: Initialize a project named 'my_project'
16
+ maradoner create my_project path/to/expression.tsv path/to/loadings.tsv --sample-groups path/to/groups.json [other options...]
17
+ ```
18
+ * Input files are typically tabular (.tsv, .csv), potentially compressed.
19
+ * You only need to provide input data files at this stage.
20
+
21
+ 2. **`fit`**: Estimate the model's variance parameters and mean motif activities using the data prepared by `create`.
22
+ ```bash
23
+ maradoner fit my_project [options...]
24
+ ```
25
+
26
+ 3. **`predict`**: Estimate the *deviations* of motif activities from their means for each sample or group, based on the parameters estimated by `fit`.
27
+ ```bash
28
+ maradoner predict my_project [options...]
29
+ ```
30
+
31
+ 4. **`export`**: Save the final results, including estimated motif activities (mean + deviations), parameter estimates, goodness-of-fit statistics, and potentially statistical test results (like ANOVA) to a specified output folder.
32
+ ```bash
33
+ maradoner export my_project path/to/output_folder [options...]
34
+ ```
35
+
36
+ ## Other Useful Commands
37
+
38
+ * **`gof`**: After `fit`, calculate Goodness-of-Fit statistics (like Fraction of Variance Explained or Correlation) to evaluate how well the model components explain the observed expression data.
39
+ ```bash
40
+ maradoner gof my_project [options...]
41
+ ```
42
+ * **`select-motifs`**: If you provided multiple loading matrices in `create` (e.g., from different databases) with unique postfixes, this command helps select the single "best" variant for each motif based on statistical criteria. The output is a list of motif names intended to be used with the `--motif-filename` option in a subsequent `create` run.
43
+ ```bash
44
+ maradoner select-motifs my_project best_motifs.txt
45
+ # Then, potentially re-run create using the generated list:
46
+ # maradoner create my_project_filtered ... --motif-filename best_motifs.txt
47
+ ```
48
+ * **`generate`**: Create a synthetic dataset with known properties for testing or demonstration purposes.
49
+ ```bash
50
+ maradoner generate path/to/synthetic_data_output [options...]
51
+ ```
52
+
53
+ ## Getting Help
54
+
55
+ Each command has various options for customization. To see the full list of commands and their detailed options, use the `--help` flag:
56
+
57
+ ```bash
58
+ maradoner --help
59
+ maradoner create --help
60
+ maradoner fit --help
61
+ # and so on for each command
@@ -0,0 +1,36 @@
1
+ # -*- coding: utf-8 -*-
2
+ __version__ = '0.17.0'
3
+ import importlib
4
+
5
+
6
+ __min_reqs__ = [
7
+ 'pip>=24.0',
8
+ 'typer>=0.13',
9
+ 'numpy>=2.1',
10
+ 'jax<0.5',
11
+ 'jaxlib<0.5',
12
+ 'matplotlib>=3.5',
13
+ 'pandas>=2.2',
14
+ 'scipy>=1.14',
15
+ 'statsmodels>=0.14',
16
+ 'datatable>=1.0.0' ,
17
+ 'dill>=0.3.9',
18
+ 'rich>=12.6.0',
19
+ 'tqdm>=4.0',
20
+ 'scikit-learn>=1.6',
21
+ 'tables>=3.10'
22
+ ]
23
+
24
+ def versiontuple(v):
25
+ return tuple(map(int, (v.split("."))))
26
+
27
+ def check_packages():
28
+ for req in __min_reqs__:
29
+ try:
30
+ module, ver = req.split(' @').split('>=')
31
+ ver = versiontuple(ver)
32
+ v = versiontuple(importlib.import_module(module).__version__)
33
+ except (AttributeError, ValueError):
34
+ continue
35
+ if v < ver:
36
+ raise ImportError(f'Version of the {module} package should be at least {ver} (found: {v}).')
@@ -0,0 +1,158 @@
1
+ from .utils import logger_print, openers
2
+ from .dataset_filter import filter_lowexp
3
+ import scipy.stats as st
4
+ import datatable as dt
5
+ import pandas as pd
6
+ import numpy as np
7
+ import dill
8
+ import json
9
+ import os
10
+
11
+ def transform_loadings(df, mode: str, zero_cutoff=1e-9, prom_inds=None):
12
+ if not mode or mode == 'none':
13
+ df[df < zero_cutoff] = 0
14
+ df = (df - df.min(axis=None)) / (df.max(axis=None) - df.min(axis=None))
15
+ stds = df.std()
16
+ drop_inds = (stds == 0) | np.isnan(stds)
17
+ if prom_inds is not None:
18
+ df = df.loc[prom_inds, ~drop_inds]
19
+ else:
20
+ df = df.loc[:, ~drop_inds]
21
+ if mode == 'ecdf':
22
+ for j in range(len(df.columns)):
23
+ v = df.iloc[:, j]
24
+ df.iloc[:, j] = st.ecdf(v).cdf.evaluate(v)
25
+ elif mode == 'esf':
26
+ for j in range(len(df.columns)):
27
+ v = df.iloc[:, j]
28
+ v = st.ecdf(v).sf.evaluate(v)
29
+ t = np.unique(v)[1]
30
+ v[v < t] = t
31
+ df.iloc[:, j] = -np.log(v)
32
+ elif mode == 'none':
33
+ pass
34
+ elif mode:
35
+ raise Exception('Unknown transformation mode ' + str(mode))
36
+ return df
37
+
38
+ def create_project(project_name: str, promoter_expression_filename: str, loading_matrix_filenames: list[str],
39
+ motif_expression_filenames=None, loading_matrix_transformations=None, sample_groups=None, motif_postfixes=None,
40
+ promoter_filter_lowexp_cutoff=0.95, promoter_filter_plot_filename=None, promoter_filter_max=True,
41
+ motif_names_filename=None, compression='raw', dump=True, verbose=True):
42
+ if not os.path.isfile(promoter_expression_filename):
43
+ raise FileNotFoundError(f'Promoter expression file {promoter_expression_filename} not found.')
44
+ if type(loading_matrix_filenames) is str:
45
+ loading_matrix_filenames = [loading_matrix_filenames]
46
+ for mx_name in loading_matrix_filenames:
47
+ if not os.path.isfile(mx_name):
48
+ raise FileNotFoundError(f'Loading matrix file {mx_name} not found.')
49
+ if motif_expression_filenames:
50
+ if type(motif_expression_filenames) is str:
51
+ motif_expression_filenames = [motif_expression_filenames]
52
+ for exp_name in motif_expression_filenames:
53
+ if not os.path.isfile(exp_name):
54
+ raise FileNotFoundError(f'Motif expresion file {exp_name} not found.')
55
+ if type(sample_groups) is str:
56
+ with open(sample_groups, 'r') as f:
57
+ if sample_groups.endswith('.json'):
58
+ sample_groups = json.load(f)
59
+ else:
60
+ sample_groups = dict()
61
+ for line in f:
62
+ items = line.split()
63
+ sample_groups[items[0]] = items[1:]
64
+ if motif_names_filename is not None:
65
+ with open(motif_names_filename, 'r') as f:
66
+ motif_names = list()
67
+ for line in f:
68
+ line = line.strip().split()
69
+ for item in line:
70
+ if item:
71
+ motif_names.append(item)
72
+ else:
73
+ motif_names = None
74
+ logger_print('Reading dataset...', verbose)
75
+ promoter_expression = dt.fread(promoter_expression_filename).to_pandas()
76
+ promoter_expression = promoter_expression.set_index(promoter_expression.columns[0])
77
+
78
+ if sample_groups:
79
+ cols = set()
80
+ for vals in sample_groups.values():
81
+ cols.update(vals)
82
+ cols = list(cols)
83
+ promoter_expression = promoter_expression[cols]
84
+
85
+ proms = promoter_expression.index
86
+ sample_names = promoter_expression.columns
87
+ loading_matrices = [dt.fread(f).to_pandas() for f in loading_matrix_filenames]
88
+ loading_matrices = [df.set_index(df.columns[0]).loc[proms] for df in loading_matrices]
89
+ if loading_matrix_transformations is None or type(loading_matrix_transformations) is str:
90
+ loading_matrix_transformations = [loading_matrix_transformations] * len(loading_matrices)
91
+ else:
92
+ if len(loading_matrix_transformations) == 1:
93
+ loading_matrix_transformations = [loading_matrix_transformations[0]] * len(loading_matrices)
94
+ elif len(loading_matrix_transformations) != len(loading_matrices):
95
+ raise Exception(f'Total number of loading matrices is {len(loading_matrices)}, but the number of transformations is '
96
+ f'{len(loading_matrix_transformations)}.')
97
+
98
+ logger_print('Filtering promoters of low expression...', verbose)
99
+ inds, weights = filter_lowexp(promoter_expression, cutoff=promoter_filter_lowexp_cutoff, fit_plot_filename=promoter_filter_plot_filename,
100
+ max_mode=promoter_filter_max)
101
+ promoter_expression = promoter_expression.loc[inds]
102
+ proms = promoter_expression.index
103
+ loading_matrices = [transform_loadings(df, mode, prom_inds=inds) for df, mode in zip(loading_matrices, loading_matrix_transformations)]
104
+ if motif_postfixes is not None:
105
+ for mx, postfix in zip(loading_matrices, motif_postfixes):
106
+ mx.columns = [f'{c}_{postfix}' for c in mx.columns]
107
+ if motif_expression_filenames:
108
+ motif_expression = [dt.fread(f).to_pandas() for f in motif_expression_filenames]
109
+ motif_expression = [df.set_index(df.columns[0]) for df in motif_expression]
110
+ if motif_postfixes is not None:
111
+ for mx, postfix in zip(motif_expression, motif_postfixes):
112
+ mx.index = [f'{c}_{postfix}' for c in mx.index]
113
+ if sample_groups:
114
+ if len(set(motif_expression[0].columns) & set(sample_groups)) == len(sample_groups):
115
+ for i in range(len(motif_expression)):
116
+ mx = motif_expression[i]
117
+ for group, cols in sample_groups.items():
118
+ for col in cols:
119
+ mx[col] = mx[group]
120
+ mx = mx.drop(sorted(sample_groups), axis=1)
121
+ motif_expression = [df.loc[mx.columns, sample_names] for df, mx in zip(motif_expression, loading_matrices)]
122
+ motif_expression = pd.concat(motif_expression, axis=0)
123
+ else:
124
+ motif_expression = None
125
+ loading_matrices = pd.concat(loading_matrices, axis=1)
126
+ if motif_names is not None:
127
+ motif_names = list(set(motif_names) & set(loading_matrices.columns))
128
+ loading_matrices = loading_matrices[motif_names]
129
+ proms = list(promoter_expression.index)
130
+ sample_names = list(promoter_expression.columns)
131
+ motif_names = list(loading_matrices.columns)
132
+ loading_matrices = loading_matrices.values
133
+ promoter_expression = promoter_expression.values
134
+ if motif_expression is not None:
135
+ motif_expression = motif_expression.values
136
+ if not sample_groups:
137
+ sample_groups = {n: [i] for i, n in enumerate(sample_names)}
138
+ else:
139
+ sample_groups = {n: sorted([sample_names.index(i) for i in inds]) for n, inds in sample_groups.items()}
140
+ res = {'expression': promoter_expression,
141
+ 'loadings': loading_matrices,
142
+ 'motif_expression': motif_expression,
143
+ 'motif_postfixes': motif_postfixes,
144
+ 'promoter_names': proms,
145
+ 'sample_names': sample_names,
146
+ 'motif_names': motif_names,
147
+ 'weights': weights,
148
+ 'groups': sample_groups}
149
+ if dump:
150
+ folder = os.path.split(project_name)[0]
151
+ name = os.path.split(project_name)[-1]
152
+ for file in os.listdir(folder if folder else None):
153
+ if file.startswith(f'{name}.') and file.endswith(tuple(openers.keys())):
154
+ os.remove(os.path.join(folder, file))
155
+ logger_print('Saving project...', verbose)
156
+ with openers[compression](f'{project_name}.init.{compression}', 'wb') as f:
157
+ dill.dump(res, f)
158
+ return res
@@ -0,0 +1,145 @@
1
+ import jax.numpy as jnp
2
+ from jax import jit, grad
3
+ from jax.scipy.stats import norm
4
+ from jax.scipy.special import logsumexp, logit, expit
5
+ import pandas as pd
6
+ import numpy as np
7
+ from scipy.optimize import minimize
8
+ from functools import partial
9
+ from sklearn.mixture import GaussianMixture
10
+
11
+ def compute_leftmost_probability(Y):
12
+ Y = Y.reshape(-1, 1)
13
+ gmm = GaussianMixture(n_components=2, random_state=0)
14
+ gmm.fit(Y)
15
+
16
+ means = gmm.means_.flatten()
17
+ leftmost_component_index = np.argmin(means)
18
+ probas = gmm.predict_proba(Y)
19
+ leftmost_probs = probas[:, leftmost_component_index]
20
+
21
+ return leftmost_probs, gmm
22
+
23
+ def normax_logpdf(x: jnp.ndarray, mu: float, sigma: float, n: int):
24
+ x = (x - mu) / sigma
25
+ return jnp.log(n) - jnp.log(sigma) + norm.logpdf(x) + (n - 1) * norm.logcdf(x)
26
+
27
+ def logmixture(x: jnp.ndarray, mus: jnp.ndarray, sigmas: jnp.ndarray, w: float, n: int):
28
+ logpdf1 = normax_logpdf(x, mus[0], sigmas[0], n)
29
+ logpdf2 = normax_logpdf(x, mus[1], sigmas[1], n)
30
+ w = jnp.array([w, 1 - w]).reshape(-1,1)
31
+ logpdf = jnp.array([logpdf1, logpdf2])
32
+ return logsumexp(logpdf, b=w, axis=0)
33
+
34
+
35
+ def transform(params, forward=True):
36
+ mu = params[:2]
37
+ sigma = params[2:4]
38
+ w = params[-1:]
39
+ if forward:
40
+ sigma = sigma ** 2
41
+ w = expit(w)
42
+ else:
43
+ sigma = sigma ** 0.5
44
+ w = logit(w)
45
+ return jnp.concatenate([mu, sigma, w])
46
+
47
+ def loglik(params: jnp.ndarray, x: jnp.ndarray, n: int):
48
+ params = transform(params)
49
+ mu = params[:2]
50
+ sigma = params[2:4]
51
+ w = params[-1]
52
+ return -logmixture(x, mu, sigma, w, n).sum()
53
+
54
+ def filter_lowexp(expression: pd.DataFrame, cutoff=0.95, max_mode=True,
55
+ fit_plot_filename=None, plot_dpi=200):
56
+ expression = (expression - expression.mean()) / expression.std()
57
+ if not max_mode:
58
+ expression = expression.mean(axis=1).values
59
+ probs, gmm = compute_leftmost_probability(expression)
60
+ inds = probs < (1-cutoff)
61
+ if fit_plot_filename:
62
+ import matplotlib.pyplot as plt
63
+ from matplotlib.collections import LineCollection
64
+ import seaborn as sns
65
+ x = np.array(sorted(expression))
66
+ pdf = np.exp(gmm.score_samples(expression[:, None]))
67
+ points = np.array([x, pdf]).T.reshape(-1, 1, 2)
68
+ segments = np.concatenate([points[:-1], points[1:]], axis=1)
69
+ plt.figure(dpi=plot_dpi, )
70
+ sns.histplot(expression, stat='density', color='grey')
71
+ lc = LineCollection(segments, cmap='winter')
72
+ lc.set_array(probs)
73
+ lc.set_linewidth(3)
74
+ line = plt.gca().add_collection(lc)
75
+ plt.colorbar(line)
76
+ plt.xlabel('Standardized expression')
77
+ plt.tight_layout()
78
+ plt.savefig(fit_plot_filename)
79
+ return inds, probs
80
+
81
+ expression_max = expression.max(axis=1).values
82
+
83
+ mu = [-1.0, 0.0]
84
+ sigmas = [1.0, 1.0]
85
+ w = [0.5]
86
+ x0 = jnp.array(mu + sigmas + w)
87
+ x0 = transform(x0, False)
88
+ fun = jit(partial(loglik, x=expression_max, n=expression.shape[1]))
89
+ jac = jit(grad(fun))
90
+ res = minimize(fun, x0, jac=jac)
91
+
92
+ params = transform(res.x)
93
+ mu = params[:2]
94
+ sigma = params[2:4]
95
+ w = params[-1]
96
+
97
+ mode1 = minimize(lambda x: -normax_logpdf(x, mu[0], sigma[0], n=expression.shape[1]), x0=[0.0]).x
98
+ mode2 = minimize(lambda x: -normax_logpdf(x, mu[1], sigma[1], n=expression.shape[1]), x0=[0.0]).x
99
+ if mode1 > mode2:
100
+ mu = mu[::-1]
101
+ sigma = sigma[::-1]
102
+ w = 1 - w
103
+
104
+ inds = np.argsort(expression_max)
105
+ inds_inv = np.empty_like(inds, dtype=int)
106
+ inds_inv[inds] = np.arange(len(inds))
107
+ x = expression_max[inds]
108
+ logpdf1 = normax_logpdf(x, mu[0], sigma[0], n=expression.shape[1])
109
+ logpdf2 = normax_logpdf(x, mu[1], sigma[1], n=expression.shape[1])
110
+ pdf1 = jnp.exp(logpdf1)
111
+ pdf2 = jnp.exp(logpdf2)
112
+ ws = np.array(pdf1 / ((w * pdf1 + (1-w)*pdf2)) * w)
113
+ if x[ws >= 0.5].mean() < x[ws < 0.5].mean():
114
+ ws = 1 - ws
115
+ j = np.argmax(ws)
116
+ l = np.argmin(ws)
117
+ ws[j:] = 1.0
118
+ ws[:l] = 0.0
119
+ for k in range(len(ws)):
120
+ if ws[k] >= 1.0-cutoff:
121
+ break
122
+ if fit_plot_filename:
123
+ import matplotlib.pyplot as plt
124
+ from matplotlib.collections import LineCollection
125
+ import seaborn as sns
126
+ pdf = jnp.exp(logmixture(x, mu, sigma, w, n=expression.shape[1]))
127
+ points = np.array([x, pdf]).T.reshape(-1, 1, 2)
128
+ segments = np.concatenate([points[:-1], points[1:]], axis=1)
129
+ plt.figure(dpi=plot_dpi, )
130
+ sns.histplot(expression_max, stat='density', color='grey')
131
+ lc = LineCollection(segments, cmap='winter')
132
+ lc.set_array(ws)
133
+ lc.set_linewidth(3)
134
+ line = plt.gca().add_collection(lc)
135
+ plt.colorbar(line)
136
+ plt.xlabel('Standardized expression')
137
+ plt.tight_layout()
138
+ plt.savefig(fit_plot_filename)
139
+ ws = ws[inds_inv]
140
+ inds = np.ones(len(expression), dtype=bool)
141
+ inds[:k] = False
142
+ # print(inds)
143
+ # inds[:] = 1
144
+ inds = inds[inds_inv]
145
+ return inds, ws