PyPI - smftools - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

smftools 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (182) hide show

{smftools-0.1.1 → smftools-0.1.3}/.gitignore RENAMED Viewed

@@ -18,6 +18,10 @@ build/
 venv/
 /environment.yml
+# Tests
+/tests/_test_inputs/
+/tests/_test_outputs/
 # OS
 .DS_Store
 .LSOverride

{smftools-0.1.1 → smftools-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,8 +1,9 @@
 Metadata-Version: 2.3
 Name: smftools
-Version: 0.1.1
+Version: 0.1.3
 Summary: Single Molecule Footprinting Analysis in Python.
 Project-URL: Source, https://github.com/jkmckenna/smftools
+Project-URL: Documentation, https://smftools.readthedocs.io/
 Author: Joseph McKenna
 Maintainer-email: Joseph McKenna <jkmckenna@berkeley.edu>
 License-Expression: MIT
@@ -31,6 +32,7 @@ Requires-Dist: numpy<2,>=1.22.0
 Requires-Dist: pandas>=1.4.2
 Requires-Dist: pod5>=0.1.21
 Requires-Dist: pomegranate>1.0.0
+Requires-Dist: pyfaidx>=0.8.0
 Requires-Dist: pysam>=0.19.1
 Requires-Dist: scanpy>=1.9
 Requires-Dist: scikit-learn>=1.0.2
@@ -38,9 +40,6 @@ Requires-Dist: scipy>=1.7.3
 Requires-Dist: seaborn>=0.11
 Requires-Dist: torch>=1.9.0
 Requires-Dist: tqdm
-Provides-Extra: base-tests
-Requires-Dist: pytest; extra == 'base-tests'
-Requires-Dist: pytest-cov; extra == 'base-tests'
 Provides-Extra: docs
 Requires-Dist: ipython>=7.20; extra == 'docs'
 Requires-Dist: matplotlib!=3.6.1; extra == 'docs'
@@ -56,13 +55,16 @@ Requires-Dist: sphinx-design; extra == 'docs'
 Requires-Dist: sphinx>=7; extra == 'docs'
 Requires-Dist: sphinxcontrib-bibtex; extra == 'docs'
 Requires-Dist: sphinxext-opengraph; extra == 'docs'
+Provides-Extra: tests
+Requires-Dist: pytest; extra == 'tests'
+Requires-Dist: pytest-cov; extra == 'tests'
 Description-Content-Type: text/markdown
 [![PyPI](https://img.shields.io/pypi/v/smftools.svg)](https://pypi.org/project/smftools)
 [![Docs](https://readthedocs.org/projects/smftools/badge/?version=latest)](https://smftools.readthedocs.io/en/latest/?badge=latest)
 # smftools
-A Python tool for processing raw sequencing data derived from single molecule footprinting experiments into [anndata](https://anndata.readthedocs.io/en/latest/) objects. Additional functionality for preprocessing, analysis, and visualization. Data structures are compatible with analyses developed within the [scverse](https://github.com/scverse) project, including [scanpy](https://github.com/scverse/scanpy) and [scvi-tools](https://github.com/scverse/scvi-tools).
+A Python tool for processing raw sequencing data derived from single molecule footprinting experiments into [anndata](https://anndata.readthedocs.io/en/latest/) objects. Additional functionality for preprocessing, analysis, and visualization.
 ## Philosophy
 While most genomic data structures handle low-coverage data (<100X) along large references, smftools prioritizes high-coverage data (scalable to at least 1 million X coverage) of a few genomic loci at a time. This enables efficient data storage, rapid data operations, hierarchical metadata handling, seamless integration with various machine-learning packages, and ease of visualization. Furthermore, functionality is modularized, enabling analysis sessions to be saved, reloaded, and easily shared with collaborators. Analyses are centered around the [anndata](https://anndata.readthedocs.io/en/latest/) object, and are heavily inspired by the work conducted within the single-cell genomics community.
@@ -73,10 +75,14 @@ The following CLI tools need to be installed and configured before using the inf
 2) [Samtools](https://github.com/samtools/samtools) -> For working with SAM/BAM files
 3) [Minimap2](https://github.com/lh3/minimap2) -> The aligner used by Dorado
 4) [Modkit](https://github.com/nanoporetech/modkit) -> Extracting summary statistics and read level methylation calls from modified BAM files
+5) [Bedtools](https://github.com/arq5x/bedtools2) -> For generating Bedgraphs from BAM alignment files.
+6) [BedGraphToBigWig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) -> For converting BedGraphs to BigWig files for IGV sessions.
 ## Modules
-- Informatics: Processes raw SMF data coming from Nanopore POD5 files, BAM files, or FASTQ files and organizes it into an AnnData object.
-- Preprocessing: Filters the AnnData object on read length, total methylation, and a variety of QC metrics.
+### Informatics: Processes raw Nanopore/Illumina data from SMF experiments into an AnnData object.
+![](docs/source/_static/smftools_informatics_diagram.png)
+### Preprocessing: Appends QC metrics to the AnnData object and perfroms filtering.
+![](docs/source/_static/smftools_preprocessing_diagram.png)
 - Tools: Appends various analyses to the AnnData object.
 - Plotting: Visualization of analyses stored within the AnnData object.

{smftools-0.1.1 → smftools-0.1.3}/README.md RENAMED Viewed

@@ -2,7 +2,7 @@
 [![Docs](https://readthedocs.org/projects/smftools/badge/?version=latest)](https://smftools.readthedocs.io/en/latest/?badge=latest)
 # smftools
-A Python tool for processing raw sequencing data derived from single molecule footprinting experiments into [anndata](https://anndata.readthedocs.io/en/latest/) objects. Additional functionality for preprocessing, analysis, and visualization. Data structures are compatible with analyses developed within the [scverse](https://github.com/scverse) project, including [scanpy](https://github.com/scverse/scanpy) and [scvi-tools](https://github.com/scverse/scvi-tools).
+A Python tool for processing raw sequencing data derived from single molecule footprinting experiments into [anndata](https://anndata.readthedocs.io/en/latest/) objects. Additional functionality for preprocessing, analysis, and visualization.
 ## Philosophy
 While most genomic data structures handle low-coverage data (<100X) along large references, smftools prioritizes high-coverage data (scalable to at least 1 million X coverage) of a few genomic loci at a time. This enables efficient data storage, rapid data operations, hierarchical metadata handling, seamless integration with various machine-learning packages, and ease of visualization. Furthermore, functionality is modularized, enabling analysis sessions to be saved, reloaded, and easily shared with collaborators. Analyses are centered around the [anndata](https://anndata.readthedocs.io/en/latest/) object, and are heavily inspired by the work conducted within the single-cell genomics community.
@@ -13,10 +13,14 @@ The following CLI tools need to be installed and configured before using the inf
 2) [Samtools](https://github.com/samtools/samtools) -> For working with SAM/BAM files
 3) [Minimap2](https://github.com/lh3/minimap2) -> The aligner used by Dorado
 4) [Modkit](https://github.com/nanoporetech/modkit) -> Extracting summary statistics and read level methylation calls from modified BAM files
+5) [Bedtools](https://github.com/arq5x/bedtools2) -> For generating Bedgraphs from BAM alignment files.
+6) [BedGraphToBigWig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) -> For converting BedGraphs to BigWig files for IGV sessions.
 ## Modules
-- Informatics: Processes raw SMF data coming from Nanopore POD5 files, BAM files, or FASTQ files and organizes it into an AnnData object.
-- Preprocessing: Filters the AnnData object on read length, total methylation, and a variety of QC metrics.
+### Informatics: Processes raw Nanopore/Illumina data from SMF experiments into an AnnData object.
+![](docs/source/_static/smftools_informatics_diagram.png)
+### Preprocessing: Appends QC metrics to the AnnData object and perfroms filtering.
+![](docs/source/_static/smftools_preprocessing_diagram.png)
 - Tools: Appends various analyses to the AnnData object.
 - Plotting: Visualization of analyses stored within the AnnData object.

smftools-0.1.3/docs/source/_static/converted_BAM_to_adata.png ADDED Viewed

Binary file

smftools-0.1.3/docs/source/_static/modkit_extract_to_adata.png ADDED Viewed

Binary file

smftools-0.1.3/docs/source/_static/smftools_informatics_diagram.pdf ADDED Viewed

Binary file

smftools-0.1.3/docs/source/_static/smftools_informatics_diagram.png ADDED Viewed

Binary file

smftools-0.1.3/docs/source/_static/smftools_preprocessing_diagram.png ADDED Viewed

Binary file

smftools-0.1.3/docs/source/api/index.md ADDED Viewed

@@ -0,0 +1,26 @@
+# API
+Import smftools as:
+```
+import smftools as smf
+```
+```{toctree}
+:maxdepth: 2
+informatics
+preprocessing
+tools
+datasets
+```
+## Informatics module diagram
+```{image} ../_static/smftools_informatics_diagram.png
+:width: 800px
+```
+## Preprocessing module diagram
+```{image} ../_static/smftools_preprocessing_diagram.png
+:width: 800px
+```

smftools-0.1.3/docs/source/api/informatics.md ADDED Viewed

@@ -0,0 +1,27 @@
+## Informatics: `inform`
+## Informatics module diagram
+```{image} ../_static/smftools_informatics_diagram.png
+:width: 1000px
+```
+```{eval-rst}
+.. module:: smftools.inform
+```
+```{eval-rst}
+.. currentmodule:: smftools
+```
+Processes raw sequencing data to load an adata object.
+### Diagram of final steps of Direct SMF workflow
+```{image} ../_static/modkit_extract_to_adata.png
+:width: 1000px
+```
+### Diagram of final steps of Conversion SMF workflow
+```{image} ../_static/converted_BAM_to_adata.png
+:width: 1000px
+```

{smftools-0.1.1 → smftools-0.1.3}/docs/source/api/preprocessing.md RENAMED Viewed

@@ -1,5 +1,10 @@
 ## Preprocessing: `pp`
+## Preprocessing module diagram
+```{image} ../_static/smftools_preprocessing_diagram.png
+:width: 1000px
+```
 ```{eval-rst}
 .. module:: smftools.pp
 ```

smftools-0.1.3/docs/source/basic_usage.md ADDED Viewed

@@ -0,0 +1,75 @@
+# Basic Usage
+Import SmfTools:
+```
+import smftools as smf
+```
+## Informatics Module Usage
+Many use cases for smftools begin here. For most users, the call below will be sufficient to convert any raw SMF dataset to an AnnData object:
+```
+config_path = "/Path_to_experiment_config.csv"
+smf.inform.load_adata(config_path)
+```
+## Loading AnnData objects created by the informatics module
+After creating an AnnData object holding your experiment's SMF data, you can load the AnnData object as so:
+```
+import anndata as ad
+input_adata = "/Path_to_experiment_AnnData.h5ad.gz"
+adata = ad.read_h5ad(input_file)
+adata.obs_names_make_unique()
+```
+If you don't have an AnnData object yet, but want to play with the downstream Preprocessing, Tools, and Plotting modules, you can load a pre-loaded SMF dataset.
+Currently, you can do this with our lab's in vitro dCas9 binding kinetics dataset generated from a Hia5 SMF dataset generated with direct m6A high accuracy basecalls:
+```
+adata = smf.datasets.dCas9_kinetics()
+adata.obs_names_make_unique()
+```
+Alternatively, you can do this with our lab's M.CviPI SMF test data in F1-hybrid natural killer cells generated by NEB EMseq conversion followed by canonical basecalling:
+```
+adata = smf.datasets.Kissiov_and_McKenna_2025()
+adata.obs_names_make_unique()
+```
+## Writing out AnnData objects to save analysis progress
+After preprocessing and downstream analysis of the AnnData object, you can save the AnnData object at any step as so:
+```
+import anndata as ad
+import os
+output_dir = '/Path_to_output_directory'
+output_adata = 'analyzed_adata.h5ad.gz'
+final_output = os.path.join(output_dir, output_adata)
+adata.write_h5ad(final_output, compression='gzip')
+```
+## Troubleshooting
+For more advanced usage and help troubleshooting, the API and tutorials for each of the modules is still being developed.
+However, you can currently learn about the functions contained within the module by calling:
+```
+smf.inform.__all__
+```
+This lists the functions within any given module. If you want to see the associated docstring for a given function, here is an example:
+```
+print(smf.inform.load_adata.__doc__)
+```
+These docstrings will provide a brief description of the function and also tell you the input parameters and what the function returns.
+In some cases, usage examples will also be provided in the docstring in the form of doctests.

{smftools-0.1.1 → smftools-0.1.3}/docs/source/index.md RENAMED Viewed

@@ -42,6 +42,7 @@ smftools GitHub link
 :maxdepth: 1
 installation
+basic_usage
 tutorials/index
 api/index
 release-notes/index

smftools-0.1.3/docs/source/installation.md ADDED Viewed

@@ -0,0 +1,60 @@
+# Installation
+## PyPi version
+Pull smftools from [PyPI](https://pypi.org/project/smftools):
+```shell
+pip install smftools
+```
+It is recommended to first create and activate a conda environment before installing smftools to ensure dependencies are managed smoothly:
+```shell
+conda create -n smftools
+conda activate smftools
+pip install smftools
+```
+Ensure that you can access dorado, samtools, modkit, bedtools, and BedGraphtoBigWig executables from the terminal in this environment. These are all necessary for the functionality within the Informatics module.
+You may need to add them to $PATH if they are not globally configured.
+For example, if you want to check if dorado is executable, simply run this in the terminal:
+```shell
+dorado
+```
+On Mac OSX, the following can be used to congigure bedtools (with brew) and BedGraphToBigWig (with wget). Change the BedGraphToBigWig link to include the correct architecture for your OS.
+```shell
+brew install bedtools
+wget http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/bedGraphToBigWig
+chmod +x bedGraphToBigWig
+sudo mv bedGraphToBigWig /usr/local/bin/
+```
+## Development Version
+Clone smftools from source and change into the smftools directory:
+```shell
+git clone https://github.com/jkmckenna/smftools.git
+cd smftools
+```
+A virtual environment can be created for the current version within the smftools directory:
+```shell
+python -m venv venv-smftools
+source venv-smftools/bin/activate
+pip install .
+```
+Subsequent use of the installed version of smftools can be run by changing to the smftools directory and activating the venv:
+```shell
+cd smftools
+source venv-smftools/bin/activate
+```
+You can now run smftools from the terminal, an IDE, or a notebook within the virtual environment.

{smftools-0.1.1 → smftools-0.1.3}/experiment_config.csv RENAMED Viewed

@@ -1,8 +1,8 @@
 variable,value,help,options,type
 smf_modality,conversion,Modality of SMF. Can either be conversion or direct.,"conversion, direct",str
-pod5_dir,/path_to_POD5_directory,Path to directory containing input POD5 files (If doing Nanopore SMF),,str
-basecalled_path,/path_to_basecalled_HTS_file.bam,Path to directory containing input BAM file (if doing SMF from an already basecalled experiment). Can also be a path to a FASTQ for conversion SMF.,,str
+input_data_path,/path_to_POD5_directory,Path to directory/file containing input sequencing data,,str
 fasta,/path_to_fasta.fasta,Path to initial FASTA file,,str
+fasta_regions_of_interest,/path_to_bed.bed,Path to a bed file to subsample the fasta on.,,str
 output_directory,/outputs,Directory to act as root for all analysis outputs,,str
 experiment_name,,An experiment name for the final h5ad file,,str
 model,None,The dorado basecalling model to use,,str

smftools-0.1.3/notebooks/Kissiov_and_McKenna_2025_example_notebook.ipynb ADDED Viewed

@@ -0,0 +1,85 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import anndata as ad\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import smftools as smf\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define file paths\n",
+    "adata_path = '/Path_to_input_adata.h5ad.gz'\n",
+    "output_directory = '/Path_to_output_directory'\n",
+    "output_adata = 'analyzed_adata.h5ad.gz'\n",
+    "final_output = os.path.join(output_directory, output_adata)\n",
+    "\n",
+    "# Load adata\n",
+    "adata = ad.read_h5ad(adata_path)\n",
+    "adata.obs_names_make_unique()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define path to sample sheet and run first part of preprocessing.\n",
+    "sample_sheet_path = '/path_to_sample_sheet.csv'\n",
+    "variables = smf.pp.recipe_1_Kissiov_and_McKenna_2025(adata, sample_sheet_path, output_directory)\n",
+    "# Update global variables\n",
+    "globals().update(variables)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Filter adata based on defined read length statistics, using the plots from preprocessing part 1 to direct the input parameters here.\n",
+    "smf.pp.filter_reads_on_length(adata, filter_on_coordinates=[lower_bound, upper_bound], min_read_length=2700)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Filter adata on defined read methylation statistics\n",
+    "smf.pp.filter_converted_reads_on_methylation(adata, valid_SMF_site_threshold=0.8, min_SMF_threshold=0.025)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run second part of preprocessing\n",
+    "duplicates = smf.pp.recipe_2_Kissiov_and_McKenna_2025(adata, output_directory, binary_layers)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

smftools-0.1.3/notebooks/Kissiov_and_McKenna_2025_sample_sheet.csv ADDED Viewed

@@ -0,0 +1,11 @@
+Sample,Sample_names,MTase,Time (sec),Group
+0,,Hia5,0,0
+1,,Hia5,15,0
+2,,Hia5,30,0
+3,,Hia5,120,0
+4,,Hia5,300,0
+5,,Hia5,0,1
+6,,Hia5,15,1
+7,,Hia5,30,1
+8,,Hia5,120,1
+9,,Hia5,300,1

{smftools-0.1.1 → smftools-0.1.3}/pyproject.toml RENAMED Viewed

@@ -48,6 +48,7 @@ dependencies = [
     "pandas>=1.4.2",
     "pod5>=0.1.21",
     "pomegranate>1.0.0",
+    "pyfaidx>=0.8.0",
     "pysam>=0.19.1",
     "scanpy>=1.9",
     "scikit-learn>=1.0.2",
@@ -60,9 +61,10 @@ dynamic = ["version"]
 [project.urls]
 Source = "https://github.com/jkmckenna/smftools"
+Documentation = "https://smftools.readthedocs.io/"
 [project.optional-dependencies]
-base_tests = [
+tests = [
     "pytest",
     "pytest-cov"
 ]
@@ -91,16 +93,16 @@ packages = ["src/smftools"]
 path = "src/smftools/_version.py"
 [tool.pytest.ini_options]
+addopts = [
+    "--import-mode=importlib",
+    "--strict-markers",
+    "--doctest-modules",
+    "--pyargs",
+]
 testpaths = ["tests"]
 pythonpath = ["src"]
 xfail_strict = true
-markers = [
-    "internet: mark tests that requires internet access",
-    "optional: mark optional tests",
-    "private: mark tests that are private",
-]
 [tool.coverage.run]
-branch = true
 source = ["smftools"]
 omit = ["tests/*"]

{smftools-0.1.1 → smftools-0.1.3}/requirements.txt RENAMED Viewed

@@ -7,6 +7,7 @@ numpy>=1.22.0,<2
 pandas>=1.4.2
 pomegranate>1.0.0
 pod5>=0.1.21
+pyfaidx>=0.8.0
 pysam>=0.19.1
 scanpy>=1.9
 scikit-learn>=1.0.2

smftools-0.1.3/sample_sheet.csv ADDED Viewed

@@ -0,0 +1,11 @@
+Sample,Sample_names,MTase,Time (sec),Group
+0,,Hia5,0,0
+1,,Hia5,15,0
+2,,Hia5,30,0
+3,,Hia5,120,0
+4,,Hia5,300,0
+5,,Hia5,0,1
+6,,Hia5,15,1
+7,,Hia5,30,1
+8,,Hia5,120,1
+9,,Hia5,300,1

{smftools-0.1.1 → smftools-0.1.3}/src/smftools/_settings.py RENAMED Viewed

@@ -1,4 +1,5 @@
 from pathlib import Path
+from typing import Union
 class SMFConfig:
     """\
@@ -8,9 +9,9 @@ class SMFConfig:
     def __init__(
         self,
         *,
-        datasetdir: Path | str = "./datasets/"
+        datasetdir: Union[Path, str] = "./datasets/"
     ):
-         self._datasetdir = Path(datasetdir) if isinstance(datasetdir, str) else datasetdir
+        self._datasetdir = Path(datasetdir) if isinstance(datasetdir, str) else datasetdir
     @property
     def datasetdir(self) -> Path:

smftools-0.1.3/src/smftools/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.1.3"

smftools-0.1.3/src/smftools/datasets/F1_sample_sheet.csv ADDED Viewed

@@ -0,0 +1,5 @@
+Sample,Sample_names,MTase,Time (min),Notes
+barcode0001_sorted,Neither,M.CviPI,7.5,Cultured in IL2
+barcode0002_sorted,BALBC,M.CviPI,7.5,Cultured in IL2
+barcode0003_sorted,B6,M.CviPI,7.5,Cultured in IL2
+barcode0004_sorted,Both,M.CviPI,7.5,Cultured in IL2

{smftools-0.1.1 → smftools-0.1.3}/src/smftools/datasets/datasets.py RENAMED Viewed

@@ -1,10 +1,9 @@
 ## datasets
-def import_deps():
+def import_HERE():
     """
+    Imports HERE for loading datasets
     """
-    import anndata as ad
     from pathlib import Path
     from .._settings import settings
     HERE = Path(__file__).parent
@@ -12,16 +11,18 @@ def import_deps():
 def dCas9_kinetics():
     """
+    in vitro Hia5 dCas9 kinetics SMF dataset. Nanopore HAC m6A modcalls.
     """
-    HERE = import_deps()
+    import anndata as ad
+    HERE = import_HERE()
     filepath = HERE / "dCas9_m6A_invitro_kinetics.h5ad.gz"
     return ad.read_h5ad(filepath)
 def Kissiov_and_McKenna_2025():
     """
+    F1 Hybrid M.CviPI natural killer cell SMF. Nanopore canonical calls of NEB EMseq converted SMF gDNA.
     """
-    HERE = import_deps()
+    import anndata as ad
+    HERE = import_HERE()
     filepath = HERE / "F1_hybrid_NKG2A_enhander_promoter_GpC_conversion_SMF.h5ad.gz"
     return ad.read_h5ad(filepath)

smftools-0.1.3/src/smftools/informatics/__init__.py ADDED Viewed

@@ -0,0 +1,14 @@
+from . import helpers
+from .load_adata import load_adata
+from .subsample_fasta_from_bed import subsample_fasta_from_bed
+from .subsample_pod5 import subsample_pod5
+from .fast5_to_pod5 import fast5_to_pod5
+__all__ = [
+    "load_adata",
+    "subsample_fasta_from_bed",
+    "subsample_pod5",
+    "fast5_to_pod5",
+    "helpers"
+]

{smftools-0.1.1/src/smftools/informatics → smftools-0.1.3/src/smftools/informatics/archived}/bam_conversion.py RENAMED Viewed

@@ -18,7 +18,7 @@ def bam_conversion(fasta, output_directory, conversion_types, strands, basecalle
     Returns:
         None
     """
-    from .helpers import align_and_sort_BAM, converted_BAM_to_adata, generate_converted_FASTA, split_and_index_BAM
+    from .helpers import align_and_sort_BAM, converted_BAM_to_adata, generate_converted_FASTA, split_and_index_BAM, make_dirs
     import os
     input_basecalled_basename = os.path.basename(basecalled_path)
     bam_basename = input_basecalled_basename.split(".")[0]
@@ -32,16 +32,28 @@ def bam_conversion(fasta, output_directory, conversion_types, strands, basecalle
     fasta_basename = os.path.basename(fasta)
     converted_FASTA_basename = fasta_basename.split('.fa')[0]+'_converted.fasta'
     converted_FASTA = os.path.join(output_directory, converted_FASTA_basename)
-    if os.path.exists(converted_FASTA):
+    if 'converted.fa' in fasta:
+        print(fasta + ' is already converted. Using existing converted FASTA.')
+        converted_FASTA = fasta
+    elif os.path.exists(converted_FASTA):
         print(converted_FASTA + ' already exists. Using existing converted FASTA.')
     else:
         generate_converted_FASTA(fasta, conversion_types, strands, converted_FASTA)
     # 2) Align the basecalled file to the converted reference FASTA and sort the bam on positional coordinates. Also make an index and a bed file of mapped reads
-    align_and_sort_BAM(converted_FASTA, basecalled_path, bam_suffix, output_directory)
+    aligned_output = aligned_BAM + bam_suffix
+    sorted_output = aligned_sorted_BAM + bam_suffix
+    if os.path.exists(aligned_output) and os.path.exists(sorted_output):
+        print(sorted_output + ' already exists. Using existing aligned/sorted BAM.')
+    else:
+        align_and_sort_BAM(converted_FASTA, basecalled_path, bam_suffix, output_directory)
     ### 3) Split the aligned and sorted BAM files by barcode (BC Tag) into the split_BAM directory###
-    split_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix)
+    if os.path.isdir(split_dir):
+        print(split_dir + ' already exists. Using existing aligned/sorted/split BAMs.')
+    else:
+        make_dirs([split_dir])
+        split_and_index_BAM(aligned_sorted_BAM, split_dir, bam_suffix, output_directory)
     # 4) Take the converted BAM and load it into an adata object.
     converted_BAM_to_adata(converted_FASTA, split_dir, mapping_threshold, experiment_name, conversion_types, bam_suffix)

smftools 0.1.1__tar.gz → 0.1.3__tar.gz

smftools 0.1.1tar.gz → 0.1.3tar.gz