PyPI - smftools - Versions diffs - 0.2.1__py3-none-any.whl → 0.2.3__py3-none-any.whl - Mend

smftools 0.2.1py3-none-any.whl → 0.2.3py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (96) hide show

smftools/readwrite.py CHANGED Viewed

@@ -1,4 +1,15 @@
 ## readwrite ##
+from __future__ import annotations
+from pathlib import Path
+from typing import Union, Iterable
+from pathlib import Path
+from typing import Iterable, Sequence, Optional
+import warnings
+import pandas as pd
+import anndata as ad
 ######################################################################################################
 ## Datetime functionality
@@ -21,6 +32,101 @@ def time_string():
     return current_time.strftime("%H:%M:%S")
 ######################################################################################################
+######################################################################################################
+## General file and directory handling
+def make_dirs(directories: Union[str, Path, Iterable[Union[str, Path]]]) -> None:
+    """
+    Create one or multiple directories.
+    Parameters
+    ----------
+    directories : str | Path | list/iterable of str | Path
+        Paths of directories to create. If a file path is passed,
+        the parent directory is created.
+    Returns
+    -------
+    None
+    """
+    # allow user to pass a single string/Path
+    if isinstance(directories, (str, Path)):
+        directories = [directories]
+    for d in directories:
+        p = Path(d)
+        # If someone passes in a file path, make its parent
+        if p.suffix:      # p.suffix != "" means it's a file
+            p = p.parent
+        p.mkdir(parents=True, exist_ok=True)
+def add_or_update_column_in_csv(
+    csv_path: str | Path,
+    column_name: str,
+    values,
+    index: bool = False,
+):
+    """
+    Add (or overwrite) a column in a CSV file.
+    If the CSV does not exist, create it containing only that column.
+    Parameters
+    ----------
+    csv_path : str | Path
+        Path to the CSV file.
+    column_name : str
+        Name of the column to add or update.
+    values : list | scalar | callable
+        - If list/Series: must match the number of rows.
+        - If scalar: broadcast to all rows (or single-row CSV if new file).
+        - If callable(df): function should return the column values based on df.
+    index : bool
+        Whether to write the pandas index into the CSV. Default False.
+    Returns
+    -------
+    pd.DataFrame : the updated DataFrame.
+    """
+    csv_path = Path(csv_path)
+    csv_path.parent.mkdir(parents=True, exist_ok=True)
+    # Case 1 — CSV does not exist → create it
+    if not csv_path.exists():
+        if hasattr(values, "__len__") and not isinstance(values, str):
+            df = pd.DataFrame({column_name: list(values)})
+        else:
+            df = pd.DataFrame({column_name: [values]})
+        df.to_csv(csv_path, index=index)
+        return df
+    # Case 2 — CSV exists → load + modify
+    df = pd.read_csv(csv_path)
+    # If values is callable, call it with df
+    if callable(values):
+        values = values(df)
+    # Broadcast scalar
+    if not hasattr(values, "__len__") or isinstance(values, str):
+        df[column_name] = values
+        df.to_csv(csv_path, index=index)
+        return df
+    # Sequence case: lengths must match
+    if len(values) != len(df):
+        raise ValueError(
+            f"Length mismatch: CSV has {len(df)} rows "
+            f"but values has {len(values)} entries."
+        )
+    df[column_name] = list(values)
+    df.to_csv(csv_path, index=index)
+    return df
+######################################################################################################
 ######################################################################################################
 ## Numpy, Pandas, Anndata functionality
@@ -62,7 +168,6 @@ def adata_to_df(adata, layer=None):
     return df
 def save_matrix(matrix, save_name):
     """
     Input: A numpy matrix and a save_name
@@ -71,70 +176,173 @@ def save_matrix(matrix, save_name):
     import numpy as np
     np.savetxt(f'{save_name}.txt', matrix)
-def concatenate_h5ads(output_file, file_suffix='h5ad.gz', delete_inputs=True):
+def concatenate_h5ads(
+    output_path: str | Path,
+    *,
+    input_dir: str | Path | None = None,
+    csv_path: str | Path | None = None,
+    csv_column: str = "h5ad_path",
+    file_suffixes: Sequence[str] = (".h5ad", ".h5ad.gz"),
+    delete_inputs: bool = False,
+    restore_backups: bool = True,
+) -> Path:
     """
-    Concatenate all h5ad files in a directory and delete them after the final adata is written out.
-    Input: an output file path relative to the directory in which the function is called
+    Concatenate multiple .h5ad files into one AnnData and write it safely.
+    Two input modes (choose ONE):
+      1) Directory mode: use all *.h5ad / *.h5ad.gz in `input_dir`.
+      2) CSV mode: use file paths from column `csv_column` in `csv_path`.
+    Parameters
+    ----------
+    output_path
+        Path to the final concatenated .h5ad (can be .h5ad or .h5ad.gz).
+    input_dir
+        Directory containing .h5ad files to concatenate. If None and csv_path
+        is also None, defaults to the current working directory.
+    csv_path
+        Path to a CSV containing file paths to concatenate (in column `csv_column`).
+    csv_column
+        Name of the column in the CSV containing .h5ad paths.
+    file_suffixes
+        Tuple of allowed suffixes (default: (".h5ad", ".h5ad.gz")).
+    delete_inputs
+        If True, delete the input .h5ad files after successful write of output.
+    restore_backups
+        Passed through to `safe_read_h5ad(restore_backups=...)`.
+    Returns
+    -------
+    Path
+        The path to the written concatenated .h5ad file.
+    Raises
+    ------
+    ValueError
+        If both `input_dir` and `csv_path` are provided, or none contain files.
+    FileNotFoundError
+        If specified CSV or directory does not exist.
     """
-    import os
-    import anndata as ad
-    # Runtime warnings
-    import warnings
-    warnings.filterwarnings('ignore', category=UserWarning, module='anndata')
-    warnings.filterwarnings('ignore', category=FutureWarning, module='anndata')
-    # List all files in the directory
-    files = os.listdir(os.getcwd())
-    # get current working directory
-    cwd = os.getcwd()
-    suffix = file_suffix
-    # Filter file names that contain the search string in their filename and keep them in a list
-    hdfs = [hdf for hdf in files if suffix in hdf]
-    # Sort file list by names and print the list of file names
-    hdfs.sort()
-    print('{0} sample files found: {1}'.format(len(hdfs), hdfs))
-    # Iterate over all of the hdf5 files and concatenate them.
-    final_adata = None
-    for hdf in hdfs:
-        print('{0}: Reading in {1} hdf5 file'.format(time_string(), hdf))
-        temp_adata = ad.read_h5ad(hdf)
-        if final_adata:
-            print('{0}: Concatenating final adata object with {1} hdf5 file'.format(time_string(), hdf))
-            final_adata = ad.concat([final_adata, temp_adata], join='outer', index_unique=None)
-        else:
-            print('{0}: Initializing final adata object with {1} hdf5 file'.format(time_string(), hdf))
-            final_adata = temp_adata
-    print('{0}: Writing final concatenated hdf5 file'.format(time_string()))
-    final_adata.write_h5ad(output_file, compression='gzip')
-    # Delete the individual h5ad files and only keep the final concatenated file
+    # ------------------------------------------------------------------
+    # Setup and input resolution
+    # ------------------------------------------------------------------
+    output_path = Path(output_path)
+    if input_dir is not None and csv_path is not None:
+        raise ValueError("Provide either `input_dir` OR `csv_path`, not both.")
+    if csv_path is None:
+        # Directory mode
+        input_dir = Path(input_dir) if input_dir is not None else Path.cwd()
+        if not input_dir.exists():
+            raise FileNotFoundError(f"Input directory does not exist: {input_dir}")
+        if not input_dir.is_dir():
+            raise ValueError(f"input_dir is not a directory: {input_dir}")
+        # collect all *.h5ad / *.h5ad.gz (or whatever file_suffixes specify)
+        suffixes_lower = tuple(s.lower() for s in file_suffixes)
+        h5_paths = sorted(
+            p for p in input_dir.iterdir()
+            if p.is_file() and p.suffix.lower() in suffixes_lower
+        )
+    else:
+        # CSV mode
+        csv_path = Path(csv_path)
+        if not csv_path.exists():
+            raise FileNotFoundError(f"CSV path does not exist: {csv_path}")
+        df = pd.read_csv(csv_path, dtype=str)
+        if csv_column not in df.columns:
+            raise ValueError(
+                f"CSV {csv_path} must contain column '{csv_column}' with .h5ad paths."
+            )
+        paths = df[csv_column].dropna().astype(str).tolist()
+        if not paths:
+            raise ValueError(f"No non-empty paths in column '{csv_column}' of {csv_path}.")
+        h5_paths = [Path(p).expanduser() for p in paths]
+    if not h5_paths:
+        raise ValueError("No input .h5ad files found to concatenate.")
+    # Ensure directory for output exists
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    # ------------------------------------------------------------------
+    # Concatenate
+    # ------------------------------------------------------------------
+    warnings.filterwarnings("ignore", category=UserWarning, module="anndata")
+    warnings.filterwarnings("ignore", category=FutureWarning, module="anndata")
+    print(f"{time_string()}: Found {len(h5_paths)} input h5ad files:")
+    for p in h5_paths:
+        print(f"  - {p}")
+    final_adata: Optional[ad.AnnData] = None
+    for p in h5_paths:
+        print(f"{time_string()}: Reading {p}")
+        temp_adata, read_report = safe_read_h5ad(p, restore_backups=restore_backups)
+        if final_adata is None:
+            print(f"{time_string()}: Initializing final AnnData with {p}")
+            final_adata = temp_adata
+        else:
+            print(f"{time_string()}: Concatenating {p} into final AnnData")
+            final_adata = ad.concat(
+                [final_adata, temp_adata],
+                join="outer",
+                merge='unique',
+                uns_merge='unique',
+                index_unique=None,
+            )
+    if final_adata is None:
+        raise RuntimeError("Unexpected: no AnnData objects loaded.")
+    print(f"{time_string()}: Writing concatenated AnnData to {output_path}")
+    safe_write_h5ad(final_adata, output_path, backup=restore_backups)
+    # ------------------------------------------------------------------
+    # Optional cleanup (delete inputs)
+    # ------------------------------------------------------------------
     if delete_inputs:
-        files = os.listdir(os.getcwd())
-        hdfs = [hdf for hdf in files if suffix in hdf]
-        if output_file in hdfs:
-            hdfs.remove(output_file)
-            # Iterate over the files and delete them
-            for hdf in hdfs:
-                try:
-                    os.remove(hdf)
-                    print(f"Deleted file: {hdf}")
-                except OSError as e:
-                    print(f"Error deleting file {hdf}: {e}")
+        out_resolved = output_path.resolve()
+        for p in h5_paths:
+            try:
+                # Don't delete the output file if it happens to be in the list
+                if p.resolve() == out_resolved:
+                    continue
+                if p.exists():
+                    p.unlink()
+                    print(f"Deleted input file: {p}")
+            except OSError as e:
+                print(f"Error deleting file {p}: {e}")
     else:
-        print('Keeping input files')
+        print("Keeping input files.")
-def safe_write_h5ad(adata, path, compression="gzip", backup=False, backup_dir="./uns_backups", verbose=True):
+    return output_path
+def safe_write_h5ad(adata, path, compression="gzip", backup=False, backup_dir=None, verbose=True):
     """
     Save an AnnData safely by sanitizing .obs, .var, .uns, .layers, and .obsm.
     Returns a report dict and prints a summary of what was converted/backed up/skipped.
     """
     import os, json, pickle
+    from pathlib import Path
     import numpy as np
     import pandas as pd
     import warnings
     import anndata as _ad
+    path = Path(path)
+    if not backup_dir:
+        backup_dir = path.parent / str(path.name).split(".")[0]
     os.makedirs(backup_dir, exist_ok=True)
     # report structure
@@ -155,7 +363,7 @@ def safe_write_h5ad(adata, path, compression="gzip", backup=False, backup_dir=".
     def _backup(obj, name):
         """Pickle obj to backup_dir/name.pkl and return filename (or None)."""
-        fname = os.path.join(backup_dir, f"{name}.pkl")
+        fname = backup_dir / f"{name}.pkl"
         try:
             with open(fname, "wb") as fh:
                 pickle.dump(obj, fh, protocol=pickle.HIGHEST_PROTOCOL)
@@ -516,7 +724,7 @@ def safe_write_h5ad(adata, path, compression="gzip", backup=False, backup_dir=".
     print("=== end report ===\n")
     return report
-def safe_read_h5ad(path, backup_dir="./uns_backups", restore_backups=True, re_categorize=True, categorical_threshold=100, verbose=True):
+def safe_read_h5ad(path, backup_dir=None, restore_backups=True, re_categorize=True, categorical_threshold=100, verbose=True):
     """
     Safely load an AnnData saved by safe_write_h5ad and attempt to restore complex objects
     from the backup_dir produced during save.
@@ -545,12 +753,18 @@ def safe_read_h5ad(path, backup_dir="./uns_backups", restore_backups=True, re_ca
             A report describing restored items, parsed JSON keys, and any failures.
     """
     import os
+    from pathlib import Path
     import json
     import pickle
     import numpy as np
     import pandas as pd
     import anndata as _ad
+    path = Path(path)
+    if not backup_dir:
+        backup_dir = path.parent / str(path.name).split(".")[0]
     report = {
         "restored_obs_columns": [],
         "restored_var_columns": [],
@@ -574,7 +788,6 @@ def safe_read_h5ad(path, backup_dir="./uns_backups", restore_backups=True, re_ca
         raise RuntimeError(f"Failed to read h5ad at {path}: {e}")
     # Ensure backup_dir exists (may be relative to cwd)
-    backup_dir = os.path.abspath(backup_dir)
     if verbose:
         print(f"[safe_read_h5ad] looking for backups in {backup_dir}")
@@ -594,8 +807,8 @@ def safe_read_h5ad(path, backup_dir="./uns_backups", restore_backups=True, re_ca
     # 2) Restore obs columns
     for col in list(adata.obs.columns):
         # Look for backup with exact naming from safe_write_h5ad: "obs.<col>_backup.pkl" or "obs.<col>_categorical_backup.pkl"
-        bname1 = os.path.join(backup_dir, f"obs.{col}_backup.pkl")
-        bname2 = os.path.join(backup_dir, f"obs.{col}_categorical_backup.pkl")
+        bname1 = backup_dir / f"obs.{col}_backup.pkl"
+        bname2 = backup_dir / f"obs.{col}_categorical_backup.pkl"
         restored = False
         if restore_backups:
@@ -869,93 +1082,6 @@ def safe_read_h5ad(path, backup_dir="./uns_backups", restore_backups=True, re_ca
     return adata, report
-# def safe_write_h5ad(adata, path, compression="gzip", backup=False, backup_dir="./", verbose=True):
-#     """
-#     Saves an AnnData object safely by omitting problematic columns from .obs and .var.
-#     Parameters:
-#         adata (AnnData): The AnnData object to save.
-#         path (str): Output .h5ad file path.
-#         compression (str): Compression method for h5ad file.
-#         backup (bool): If True, saves problematic columns to CSV files.
-#         backup_dir (str): Directory to store backups if backup=True.
-#     """
-#     import anndata as ad
-#     import pandas as pd
-#     import os
-#     import numpy as np
-#     import json
-#     os.makedirs(backup_dir, exist_ok=True)
-#     def filter_df(df, df_name):
-#         bad_cols = []
-#         for col in df.columns:
-#             if df[col].dtype == 'object':
-#                 if not df[col].apply(lambda x: isinstance(x, (str, type(None)))).all():
-#                     bad_cols.append(col)
-#             elif pd.api.types.is_categorical_dtype(df[col]):
-#                 if not all(isinstance(x, (str, type(None))) for x in df[col].cat.categories):
-#                     bad_cols.append(col)
-#         if bad_cols and verbose:
-#             print(f"Skipping columns from {df_name}: {bad_cols}")
-#         if backup and bad_cols:
-#             df[bad_cols].to_csv(os.path.join(backup_dir, f"{df_name}_skipped_columns.csv"))
-#             if verbose:
-#                 print(f"Backed up skipped columns to {backup_dir}/{df_name}_skipped_columns.csv")
-#         return df.drop(columns=bad_cols)
-#     def is_serializable(val):
-#         try:
-#             json.dumps(val)
-#             return True
-#         except (TypeError, OverflowError):
-#             return False
-#     def clean_uns(uns_dict):
-#         clean_uns = {}
-#         bad_keys = []
-#         for k, v in uns_dict.items():
-#             if isinstance(v, (str, int, float, type(None), list, np.ndarray, pd.DataFrame, dict)):
-#                 clean_uns[k] = v
-#             elif is_serializable(v):
-#                 clean_uns[k] = v
-#             else:
-#                 bad_keys.append(k)
-#                 if backup:
-#                     try:
-#                         with open(os.path.join(backup_dir, f"uns_{k}_backup.txt"), "w") as f:
-#                             f.write(str(v))
-#                     except Exception:
-#                         pass
-#         if bad_keys and verbose:
-#             print(f"Skipping entries from .uns: {bad_keys}")
-#         return clean_uns
-#     # Clean obs and var and uns
-#     obs_clean = filter_df(adata.obs, "obs")
-#     var_clean = filter_df(adata.var, "var")
-#     uns_clean = clean_uns(adata.uns)
-#     # Save clean version
-#     adata_copy = ad.AnnData(
-#         X=adata.X,
-#         obs=obs_clean,
-#         var=var_clean,
-#         layers=adata.layers,
-#         uns=uns_clean,
-#         obsm=adata.obsm,
-#         varm=adata.varm
-#     )
-#     adata_copy.obs_names = adata_copy.obs_names.astype(str)
-#     adata_copy.var_names = adata_copy.var_names.astype(str)
-#     adata_copy.write_h5ad(path, compression=compression)
-#     print(f"Saved safely to {path}")
 def merge_barcoded_anndatas_core(adata_single, adata_double):
     import numpy as np
     import anndata as ad

{smftools-0.2.1.dist-info → smftools-0.2.3.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: smftools
-Version: 0.2.1
+Version: 0.2.3
 Summary: Single Molecule Footprinting Analysis in Python.
 Project-URL: Source, https://github.com/jkmckenna/smftools
 Project-URL: Documentation, https://smftools.readthedocs.io/
@@ -43,7 +43,7 @@ Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Classifier: Topic :: Scientific/Engineering :: Visualization
-Requires-Python: >=3.9
+Requires-Python: <3.13,>=3.9
 Requires-Dist: anndata>=0.10.0
 Requires-Dist: biopython>=1.79
 Requires-Dist: captum
@@ -59,7 +59,8 @@ Requires-Dist: numpy<2,>=1.22.0
 Requires-Dist: omegaconf
 Requires-Dist: pandas>=1.4.2
 Requires-Dist: pod5>=0.1.21
-Requires-Dist: pomegranate>=1.0.0
+Requires-Dist: pybedtools>=0.12.0
+Requires-Dist: pybigwig>=0.3.24
 Requires-Dist: pyfaidx>=0.8.0
 Requires-Dist: pysam>=0.19.1
 Requires-Dist: scanpy>=1.9
@@ -102,12 +103,9 @@ While most genomic data structures handle low-coverage data (<100X) along large
 ## Dependencies
 The following CLI tools need to be installed and configured before using the informatics (smftools.inform) module of smftools:
-1) [Dorado](https://github.com/nanoporetech/dorado) -> For standard/modified basecalling and alignment. Can be attained by downloading and configuring nanopore MinKnow software.
-2) [Samtools](https://github.com/samtools/samtools) -> For working with SAM/BAM files
-3) [Minimap2](https://github.com/lh3/minimap2) -> The aligner used by Dorado
-4) [Modkit](https://github.com/nanoporetech/modkit) -> Extracting summary statistics and read level methylation calls from modified BAM files
-5) [Bedtools](https://github.com/arq5x/bedtools2) -> For generating Bedgraphs from BAM alignment files.
-6) [BedGraphToBigWig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) -> For converting BedGraphs to BigWig files for IGV sessions.
+1) [Dorado](https://github.com/nanoporetech/dorado) -> Basecalling, alignment, demultiplexing.
+2) [Minimap2](https://github.com/lh3/minimap2) -> Alignment if not using dorado.
+3) [Modkit](https://github.com/nanoporetech/modkit) -> Extracting read level methylation metrics from modified BAM files.
 ## Modules
 ### Informatics: Processes raw Nanopore/Illumina data from SMF experiments into an AnnData object.
@@ -122,6 +120,9 @@ The following CLI tools need to be installed and configured before using the inf
 ## Announcements
+### 11/05/25 - Version 0.2.1 is available through PyPI
+Version 0.2.1 makes the core workflow (smftools load) a command line tool that takes in an experiment_config.csv file for input/output and parameter management.
 ### 05/29/25 - Version 0.1.6 is available through PyPI.
 Informatics, preprocessing, tools, plotting modules have core functionality that is approaching stability on MacOS(Intel/Silicon) and Linux(Ubuntu). I will work on improving documentation/tutorials shortly. The base PyTorch/Scikit-Learn ML-infrastructure is going through some organizational changes to work with PyTorch Lightning, Hydra, and WanDB to facilitate organizational scaling, multi-device usage, and logging.

smftools 0.2.1__py3-none-any.whl → 0.2.3__py3-none-any.whl

smftools 0.2.1py3-none-any.whl → 0.2.3py3-none-any.whl