pg-sui 1.6.14.dev9__py3-none-any.whl → 1.7.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. pg_sui-1.7.0.dist-info/METADATA +288 -0
  2. {pg_sui-1.6.14.dev9.dist-info → pg_sui-1.7.0.dist-info}/RECORD +29 -33
  3. pgsui/__init__.py +0 -8
  4. pgsui/_version.py +2 -2
  5. pgsui/cli.py +591 -126
  6. pgsui/data_processing/config.py +1 -2
  7. pgsui/data_processing/containers.py +218 -533
  8. pgsui/data_processing/transformers.py +44 -20
  9. pgsui/impute/deterministic/imputers/mode.py +475 -182
  10. pgsui/impute/deterministic/imputers/ref_allele.py +454 -147
  11. pgsui/impute/supervised/imputers/hist_gradient_boosting.py +4 -3
  12. pgsui/impute/supervised/imputers/random_forest.py +3 -2
  13. pgsui/impute/unsupervised/base.py +1268 -530
  14. pgsui/impute/unsupervised/callbacks.py +28 -33
  15. pgsui/impute/unsupervised/imputers/autoencoder.py +869 -764
  16. pgsui/impute/unsupervised/imputers/vae.py +928 -696
  17. pgsui/impute/unsupervised/loss_functions.py +156 -202
  18. pgsui/impute/unsupervised/models/autoencoder_model.py +7 -49
  19. pgsui/impute/unsupervised/models/vae_model.py +40 -221
  20. pgsui/impute/unsupervised/nn_scorers.py +53 -13
  21. pgsui/utils/classification_viz.py +240 -97
  22. pgsui/utils/misc.py +201 -3
  23. pgsui/utils/plotting.py +73 -58
  24. pgsui/utils/pretty_metrics.py +2 -6
  25. pgsui/utils/scorers.py +39 -0
  26. pg_sui-1.6.14.dev9.dist-info/METADATA +0 -344
  27. pgsui/impute/unsupervised/imputers/nlpca.py +0 -1554
  28. pgsui/impute/unsupervised/imputers/ubp.py +0 -1575
  29. pgsui/impute/unsupervised/models/nlpca_model.py +0 -206
  30. pgsui/impute/unsupervised/models/ubp_model.py +0 -200
  31. {pg_sui-1.6.14.dev9.dist-info → pg_sui-1.7.0.dist-info}/WHEEL +0 -0
  32. {pg_sui-1.6.14.dev9.dist-info → pg_sui-1.7.0.dist-info}/entry_points.txt +0 -0
  33. {pg_sui-1.6.14.dev9.dist-info → pg_sui-1.7.0.dist-info}/licenses/LICENSE +0 -0
  34. {pg_sui-1.6.14.dev9.dist-info → pg_sui-1.7.0.dist-info}/top_level.txt +0 -0
@@ -1,344 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: pg-sui
3
- Version: 1.6.14.dev9
4
- Summary: Python machine and deep learning API to impute missing genotypes
5
- Author-email: "Drs. Bradley T. Martin and Tyler K. Chafin" <evobio721@gmail.com>
6
- Maintainer-email: "Dr. Bradley T. Martin" <evobio721@gmail.com>
7
- License: GNU General Public License v3 (GPLv3)
8
- Project-URL: Homepage, https://github.com/btmartin721/PG-SUI
9
- Project-URL: Documentation, https://pg-sui.readthedocs.io/en/latest/
10
- Project-URL: Source, https://github.com/btmartin721/PG-SUI.git
11
- Project-URL: BugTracker, https://github.com/btmartin721/PG-SUI/issues
12
- Keywords: impute,imputation,AI,deep learning,machine learning,neural network,vae,autoencoder,ubp,nlpca,population genetics,unsupervised,supervised,bioinformatics,snp,genomics,genotype,missing data,data analysis,data science,statistics,data visualization,python
13
- Classifier: Programming Language :: Python :: 3
14
- Classifier: Programming Language :: Python :: 3.11
15
- Classifier: Programming Language :: Python :: 3.12
16
- Classifier: Development Status :: 4 - Beta
17
- Classifier: Environment :: Console
18
- Classifier: Intended Audience :: Science/Research
19
- Classifier: Intended Audience :: Developers
20
- Classifier: Intended Audience :: Education
21
- Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
22
- Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
23
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
24
- Classifier: Topic :: Scientific/Engineering :: Information Analysis
25
- Classifier: Topic :: Scientific/Engineering :: Visualization
26
- Classifier: Operating System :: MacOS
27
- Classifier: Operating System :: MacOS :: MacOS X
28
- Classifier: Operating System :: Unix
29
- Classifier: Operating System :: POSIX
30
- Classifier: Natural Language :: English
31
- Requires-Python: >=3.11
32
- Description-Content-Type: text/markdown
33
- License-File: LICENSE
34
- Requires-Dist: matplotlib
35
- Requires-Dist: numpy>=2.1
36
- Requires-Dist: pandas>=2.2.2
37
- Requires-Dist: scikit-learn>=1.4
38
- Requires-Dist: scipy
39
- Requires-Dist: seaborn
40
- Requires-Dist: torch
41
- Requires-Dist: tqdm
42
- Requires-Dist: toytree
43
- Requires-Dist: optuna
44
- Requires-Dist: rich
45
- Requires-Dist: rich[jupyter]
46
- Requires-Dist: snpio
47
- Provides-Extra: intel
48
- Requires-Dist: scikit-learn-intelex; extra == "intel"
49
- Provides-Extra: docs
50
- Requires-Dist: sphinx; extra == "docs"
51
- Requires-Dist: sphinx-rtd-theme; extra == "docs"
52
- Requires-Dist: sphinx_autodoc_typehints; extra == "docs"
53
- Requires-Dist: sphinxcontrib-napoleon; extra == "docs"
54
- Requires-Dist: sphinxcontrib-programoutput; extra == "docs"
55
- Provides-Extra: dev
56
- Requires-Dist: twine; extra == "dev"
57
- Requires-Dist: wheel; extra == "dev"
58
- Requires-Dist: pytest; extra == "dev"
59
- Requires-Dist: sphinx; extra == "dev"
60
- Requires-Dist: sphinx-rtd-theme; extra == "dev"
61
- Requires-Dist: sphinx-autodoc-typehints; extra == "dev"
62
- Requires-Dist: sphinxcontrib-napoleon; extra == "dev"
63
- Requires-Dist: sphinxcontrib-programoutput; extra == "dev"
64
- Requires-Dist: requests; extra == "dev"
65
- Provides-Extra: optional
66
- Requires-Dist: PyObjC; extra == "optional"
67
- Provides-Extra: gui
68
- Requires-Dist: fastapi>=0.110; extra == "gui"
69
- Requires-Dist: uvicorn[standard]>=0.23; extra == "gui"
70
- Dynamic: license-file
71
-
72
-
73
- <img src="https://github.com/btmartin721/PG-SUI/blob/master/img/pgsui-logo-faded.png" alt="PG-SUI Logo" width="50%" height="50%">
74
-
75
-
76
- # PG-SUI
77
-
78
- Population Genomic Supervised and Unsupervised Imputation.
79
-
80
- ## About PG-SUI
81
-
82
- PG-SUI is a Python 3 API that uses machine learning to impute missing values from population genomic SNP data. There are several supervised and unsupervised machine learning algorithms available to impute missing data, as well as some non-machine learning imputers that are useful.
83
-
84
- Below is some general information and a basic tutorial. For more detailed information, see our [API Documentation](https://pg-sui.readthedocs.io/en/latest/).
85
-
86
- ### Supervised Imputation Methods
87
-
88
- Supervised methods utilze the scikit-learn's IterativeImputer, which is based on the MICE (Multivariate Imputation by Chained Equations) algorithm ([1](#1)), and iterates over each SNP site (i.e., feature) while uses the N nearest neighbor features to inform the imputation. The number of nearest features can be adjusted by users. IterativeImputer currently works with any of the following scikit-learn classifiers:
89
-
90
- + K-Nearest Neighbors
91
- + Random Forest
92
- + XGBoost
93
-
94
- See the scikit-learn documentation (https://scikit-learn.org) for more information on IterativeImputer and each of the classifiers.
95
-
96
- ### Unsupervised Imputation Methods
97
-
98
- Unsupervised imputers include three custom neural network models:
99
-
100
- + Variational Autoencoder (VAE) ([2](#2))
101
- + Standard Autoencoder (SAE) ([3](#3))
102
- + Non-linear Principal Component Analysis (NLPCA) ([4](#4))
103
- + Unsupervised Backpropagation (UBP) ([5](#5))
104
-
105
- VAE models train themselves to reconstruct their input (i.e., the genotypes). To use VAE for imputation, the missing values are masked and the VAE model gets trained to reconstruct only on known values. Once the model is trained, it is then used to predict the missing values.
106
-
107
- SAE is a standard autoencoder that trains the input to predict itself. As with VAE, missing values are masked and the model gets trained only on known values. Predictions are then made on the missing values.
108
-
109
- NLPCA initializes random, reduced-dimensional input, then trains itself by using the known values (i.e., genotypes) as targets and refining the random input until it accurately predicts the genotype output. The trained model can then predict the missing values.
110
-
111
- UBP is an extension of NLPCA that runs over three phases. Phase 1 refines the randomly generated, reduced-dimensional input in a single layer perceptron neural network to obtain good initial input values. Phase 2 uses the refined reduced-dimensional input from phase 1 as input into a multi-layer perceptron (MLP), but in Phase 2 only the neural network weights are refined. Phase three uses an MLP to refine both the weights and the reduced-dimensional input. Once the model is trained, it can be used to predict the missing values.
112
-
113
- ### Non-Machine Learning Methods
114
-
115
- We also include several non-machine learning options for imputing missing data, including:
116
-
117
- + Per-population mode per SNP site
118
- + Global mode per SNP site
119
- + Using a phylogeny as input to inform the imputation
120
- + Matrix Factorization
121
-
122
- These four "simple" imputation methods can be used as standalone imputers, as the initial imputation strategy for IterativeImputer (at least one method is required to be chosen), and to validate the accuracy of both IterativeImputer and the neural network models.
123
-
124
- ## Installing PG-SUI
125
-
126
- The easiest way to install PG-SUI is to use pip:
127
-
128
- ```
129
- pip install pg-sui
130
- ```
131
-
132
- If you have an Intel CPU and want to use the sklearn-genetic-intelex package to speed up scikit-learn computations, you can do:
133
-
134
- ```
135
- pip install pg-sui[intel]
136
- ```
137
-
138
- ### Optional GUI (Electron)
139
-
140
- PG-SUI ships an Electron GUI wrapper around the Python CLI.
141
-
142
- 1. Install the Python-side extras (FastAPI/uvicorn helper) if you want to serve from Python:
143
- `pip install pg-sui[gui]`
144
- 2. Install Node.js (https://nodejs.org) and fetch the app dependencies once:
145
- `pgsui-gui-setup`
146
- 3. Launch the GUI:
147
- `pgsui-gui`
148
-
149
- The GUI shells out to the same CLI underneath, so presets/overrides and YAML configs behave identically.
150
-
151
- ## Manual Installation
152
-
153
- ### Dependencies
154
-
155
- + python >= 3.11
156
- + pandas
157
- + numpy
158
- + scipy
159
- + matplotlib
160
- + seaborn
161
- + plotly
162
- + kaleido
163
- + tqdm
164
- + toytree
165
- + scikit-learn
166
- + xgboost
167
- + snpio
168
- + optuna
169
-
170
- #### Installation troubleshooting
171
-
172
- ##### "use_2to3 is invalid" error
173
-
174
- Users running setuptools v58 may encounter this error during the last step of installation, using pip to install sklearn-genetic-opt:
175
-
176
- ```
177
- ERROR: Command errored out with exit status 1:
178
- command: /Users/tyler/miniforge3/envs/pg-sui/bin/python3.8 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/6x/t6g4kn711z5cxmc2_tvq0mlw0000gn/T/pip-install-6y5g_mhs/deap_1d32f65d60a44056bd7031f3aad44571/setup.py'"'"'; __file__='"'"'/private/var/folders/6x/t6g4kn711z5cxmc2_tvq0mlw0000gn/T/pip-install-6y5g_mhs/deap_1d32f65d60a44056bd7031f3aad44571/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/6x/t6g4kn711z5cxmc2_tvq0mlw0000gn/T/pip-pip-egg-info-7hg3hcq2
179
- cwd: /private/var/folders/6x/t6g4kn711z5cxmc2_tvq0mlw0000gn/T/pip-install-6y5g_mhs/deap_1d32f65d60a44056bd7031f3aad44571/
180
- Complete output (1 lines):
181
- error in deap setup command: use_2to3 is invalid.
182
- ```
183
-
184
- This occurs during the installation of DEAP, one of the dependencies for sklearn-genetic-opt. As a workaround, first downgrade setuptools, and then proceed with the installation as normal:
185
- ```
186
- pip install setuptools==57
187
- pip install sklearn-genetic-opt[all]
188
-
189
- ```
190
-
191
- ##### Mac ARM architecture
192
-
193
- PG-SUI has been tested on the new Mac M1 chips and is working fine, but some changes to the installation process were necessary as of 9-December-21. Installation was successful using the following:
194
-
195
- ```
196
- ### Install Miniforge3 instead of Miniconda3
197
- ### Download: https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
198
- bash ~/Downloads/Miniforge3-MacOSX-arm64.sh
199
-
200
- # Close and re-open terminal #
201
-
202
- #create and activate conda environment
203
- conda create -n pg-sui python
204
-
205
- #activate environment
206
- conda activate pg-sui
207
-
208
- #install packages
209
- conda install -c conda-forge matplotlib seaborn jupyterlab scikit-learn tqdm pandas numpy scipy xgboost lightgbm tensorflow keras sklearn-genetic-opt toytree
210
- conda install -c bioconda pyvolve
211
-
212
- #downgrade setuptools (may or may not be necessary)
213
- pip install setuptools==57
214
-
215
- #install sklearn-genetic-opt and mlflow
216
- pip install sklearn-genetic-opt mlflow
217
-
218
- ```
219
-
220
- Any other problems we run into testing on the Mac ARM architecture will be adjusted here. Note that the step installing scikit-learn-intelex was skipped here. PG-SUI will automatically detect the CPU architecture you are running, and forgo importing this package (which will only work on Intel processors)
221
-
222
- ## Input Data
223
-
224
- You can read your input files as a GenotypeData object from the [SNPio](https://snpio.readthedocs.io/en/latest/) package:
225
-
226
- ```
227
-
228
- # Import snpio. Automatically installed with pgsui when using pip.
229
- from snpio import GenotypeData
230
-
231
- # Read in PHYLIP, VCF, or STRUCTURE-formatted alignments.
232
- data = GenotypeData(
233
- filename="example_data/phylip_files/phylogen_nomx.u.snps.phy",
234
- popmapfile="example_data/popmaps/phylogen_nomx.popmap",
235
- force_popmap=True,
236
- filetype="auto",
237
- qmatrix_iqtree="example_data/trees/test.qmat",
238
- siterates_iqtree="example_data/trees/test.rate",
239
- guidetree="example_data/trees/test.tre",
240
- include_pops=["EA", "TT", "GU"], # Only include these populations. There's also an exclude_pops option that will exclude the provided populations.
241
- )
242
- ```
243
-
244
- ## Supported Imputation Methods
245
-
246
- There are numerous supported algorithms to impute missing data. Each one can be run by calling the corresponding class. You must provide a GenotypeData instance as the first positional argument.
247
-
248
- You can import all the supported methods with:
249
-
250
- ```
251
- from pgsui import *
252
- ```
253
-
254
- Or you can import them one at a time.
255
-
256
- ```
257
- from pgsui import ImputeVAE
258
- ```
259
-
260
- ### Supervised Imputers
261
-
262
- Various supervised imputation options are supported:
263
-
264
- ```
265
- # Supervised IterativeImputer classifiers
266
- knn = ImputeKNN(data) # K-Nearest Neighbors
267
- rf = ImputeRandomForest(data) # Random Forest or Extra Trees
268
- xgb = ImputeXGBoost(data) # XGBoost
269
- ```
270
-
271
- ### Non-machine learning methods
272
-
273
- Use phylogeny to inform imputation:
274
-
275
- ```
276
- phylo = ImputePhylo(data)
277
- ```
278
-
279
- Use by-population or global allele frequency to inform imputation
280
-
281
- ```
282
- pop_af = ImputeAlleleFreq(data, by_populations=True)
283
- global_af = ImputeAlleleFreq(data, by_populations=False)
284
- ref_af = ImputeRefAllele(data)
285
- ```
286
-
287
- Non-matrix factorization:
288
-
289
- ```
290
- mf = ImputeMF(*args) # Matrix factorization
291
- ```
292
-
293
- ### Unsupervised Neural Networks
294
-
295
- ``` python
296
- vae = ImputeVAE(data) # Variational autoencoder
297
- nlpca = ImputeNLPCA(data) # Nonlinear PCA
298
- ubp = ImputeUBP(data) # Unsupervised backpropagation
299
- sae = ImputeStandardAutoEncoder(data) # standard autoencoder
300
- ```
301
-
302
- ## Command-Line Interface
303
-
304
- Run the PG-SUI CLI with ``pg-sui`` (installed alongside the library). The CLI follows the same precedence model as the Python API:
305
-
306
- ``code defaults < preset (--preset) < YAML (--config) < explicit CLI flags < --set key=value``.
307
-
308
- Recent releases add explicit switches for the simulated-missingness workflow shared by the neural and supervised models:
309
-
310
- - ``--sim-strategy`` selects one of ``random``, ``random_weighted``, ``random_weighted_inv``, ``nonrandom``, ``nonrandom_weighted``.
311
- - ``--sim-prop`` sets the proportion of observed calls to temporarily mask when building the evaluation set.
312
- - ``--simulate-missing`` disables simulated masking entirely (store-false flag); omit it to inherit preset/YAML defaults or re-enable via ``--set sim.simulate_missing=True``.
313
-
314
- Example:
315
-
316
- ```
317
- pg-sui \
318
- --vcf data.vcf.gz \
319
- --popmap pops.popmap \
320
- --models ImputeUBP ImputeVAE \
321
- --preset balanced \
322
- --sim-strategy random_weighted_inv \
323
- --sim-prop 0.25 \
324
- --set io.prefix=vae_vs_ubp
325
- ```
326
-
327
- CLI overrides cascade into every selected model, so a single invocation can evaluate multiple imputers with a consistent simulation strategy and output prefix.
328
-
329
- ## To-Dos
330
-
331
- - simulations
332
- - Documentation
333
-
334
- ## References:
335
-
336
- <a name="1">1. </a>Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45: 1-67.
337
-
338
- <a name="2">2. </a>Kingma, D.P. & Welling, M. (2013). Auto-encoding variational bayes. In: Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1312.6114 [stat.ML].
339
-
340
- <a name="3">3. </a>Hinton, G.E., & Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.
341
-
342
- <a name="4">4. </a>Scholz, M., Kaplan, F., Guy, C. L., Kopka, J., & Selbig, J. (2005). Non-linear PCA: a missing data approach. Bioinformatics, 21(20), 3887-3895.
343
-
344
- <a name="5">5. </a>Gashler, M. S., Smith, M. R., Morris, R., & Martinez, T. (2016). Missing value imputation with unsupervised backpropagation. Computational Intelligence, 32(2), 196-215.