PyPI - gengeneeval - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

gengeneeval 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

{gengeneeval-0.2.0 → gengeneeval-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,10 +1,10 @@
 Metadata-Version: 2.4
 Name: gengeneeval
-Version: 0.2.0
-Summary: Comprehensive evaluation of generated gene expression data. Computes metrics between real and generated datasets with support for condition matching, train/test splits, and publication-quality visualizations.
+Version: 0.2.1
+Summary: Comprehensive evaluation of generated gene expression data. Computes metrics between real and generated datasets with support for condition matching, train/test splits, memory-efficient lazy loading, and publication-quality visualizations.
 License: MIT
 License-File: LICENSE
-Keywords: gene expression,evaluation,metrics,single-cell,generative models,benchmarking
+Keywords: gene expression,evaluation,metrics,single-cell,generative models,benchmarking,memory-efficient
 Author: GenEval Team
 Author-email: geneval@example.com
 Requires-Python: >=3.8,<4.0
@@ -46,7 +46,7 @@ Description-Content-Type: text/markdown
 **Comprehensive evaluation of generated gene expression data against real datasets.**
-GenEval is a modular, object-oriented Python framework for computing metrics between real and generated gene expression datasets stored in AnnData (h5ad) format. It supports condition-based matching, train/test splits, and generates publication-quality visualizations.
+GenEval is a modular, object-oriented Python framework for computing metrics between real and generated gene expression datasets stored in AnnData (h5ad) format. It supports condition-based matching, train/test splits, memory-efficient lazy loading for large datasets, and generates publication-quality visualizations.
 ## Features
@@ -77,6 +77,8 @@ All metrics are computed **per-gene** (returning a vector) and **aggregated**:
 - ✅ Condition-based matching (perturbation, cell type, etc.)
 - ✅ Train/test split support
 - ✅ Per-gene and aggregate metrics
+- ✅ **Memory-efficient lazy loading** for large datasets
+- ✅ **Batched evaluation** to avoid OOM errors
 - ✅ Modular, extensible architecture
 - ✅ Command-line interface
 - ✅ Publication-quality visualizations
@@ -140,6 +142,37 @@ geneval --real real.h5ad --generated generated.h5ad \
         --output results/
 ```
+### Memory-Efficient Mode (for Large Datasets)
+For datasets too large to fit in memory, use the lazy evaluation API:
+```python
+from geneval import evaluate_lazy, load_data_lazy
+# Memory-efficient evaluation (streams data one condition at a time)
+results = evaluate_lazy(
+    real_path="large_real.h5ad",
+    generated_path="large_generated.h5ad",
+    condition_columns=["perturbation"],
+    batch_size=256,                    # Process in batches
+    use_backed=True,                   # Memory-mapped file access
+    output_dir="eval_output/",
+    save_per_condition=True,           # Save each condition to disk
+)
+# Get summary statistics
+print(results.get_summary())
+# Or use the lazy loader directly for custom workflows
+with load_data_lazy("real.h5ad", "gen.h5ad", ["perturbation"]) as loader:
+    print(f"Memory estimate: {loader.estimate_memory_usage()}")
+    # Process one condition at a time
+    for key, real, gen, info in loader.iterate_conditions():
+        # Your custom evaluation logic
+        pass
+```
 ## Expected Data Format
 GenEval expects AnnData (h5ad) files with:

{gengeneeval-0.2.0 → gengeneeval-0.2.1}/README.md RENAMED Viewed

@@ -7,7 +7,7 @@
 **Comprehensive evaluation of generated gene expression data against real datasets.**
-GenEval is a modular, object-oriented Python framework for computing metrics between real and generated gene expression datasets stored in AnnData (h5ad) format. It supports condition-based matching, train/test splits, and generates publication-quality visualizations.
+GenEval is a modular, object-oriented Python framework for computing metrics between real and generated gene expression datasets stored in AnnData (h5ad) format. It supports condition-based matching, train/test splits, memory-efficient lazy loading for large datasets, and generates publication-quality visualizations.
 ## Features
@@ -38,6 +38,8 @@ All metrics are computed **per-gene** (returning a vector) and **aggregated**:
 - ✅ Condition-based matching (perturbation, cell type, etc.)
 - ✅ Train/test split support
 - ✅ Per-gene and aggregate metrics
+- ✅ **Memory-efficient lazy loading** for large datasets
+- ✅ **Batched evaluation** to avoid OOM errors
 - ✅ Modular, extensible architecture
 - ✅ Command-line interface
 - ✅ Publication-quality visualizations
@@ -101,6 +103,37 @@ geneval --real real.h5ad --generated generated.h5ad \
         --output results/
 ```
+### Memory-Efficient Mode (for Large Datasets)
+For datasets too large to fit in memory, use the lazy evaluation API:
+```python
+from geneval import evaluate_lazy, load_data_lazy
+# Memory-efficient evaluation (streams data one condition at a time)
+results = evaluate_lazy(
+    real_path="large_real.h5ad",
+    generated_path="large_generated.h5ad",
+    condition_columns=["perturbation"],
+    batch_size=256,                    # Process in batches
+    use_backed=True,                   # Memory-mapped file access
+    output_dir="eval_output/",
+    save_per_condition=True,           # Save each condition to disk
+)
+# Get summary statistics
+print(results.get_summary())
+# Or use the lazy loader directly for custom workflows
+with load_data_lazy("real.h5ad", "gen.h5ad", ["perturbation"]) as loader:
+    print(f"Memory estimate: {loader.estimate_memory_usage()}")
+    # Process one condition at a time
+    for key, real, gen, info in loader.iterate_conditions():
+        # Your custom evaluation logic
+        pass
+```
 ## Expected Data Format
 GenEval expects AnnData (h5ad) files with:

{gengeneeval-0.2.0 → gengeneeval-0.2.1}/pyproject.toml RENAMED Viewed

@@ -1,13 +1,13 @@
 [tool.poetry]
 name = "gengeneeval"
-version = "0.2.0"
-description = "Comprehensive evaluation of generated gene expression data. Computes metrics between real and generated datasets with support for condition matching, train/test splits, and publication-quality visualizations."
+version = "0.2.1"
+description = "Comprehensive evaluation of generated gene expression data. Computes metrics between real and generated datasets with support for condition matching, train/test splits, memory-efficient lazy loading, and publication-quality visualizations."
 authors = ["GenEval Team <geneval@example.com>"]
 license = "MIT"
 readme = "README.md"
 homepage = "https://github.com/AndreaRubbi/GenGeneEval"
 repository = "https://github.com/AndreaRubbi/GenGeneEval"
-keywords = ["gene expression", "evaluation", "metrics", "single-cell", "generative models", "benchmarking"]
+keywords = ["gene expression", "evaluation", "metrics", "single-cell", "generative models", "benchmarking", "memory-efficient"]
 classifiers = [
     "Development Status :: 4 - Beta",
     "Intended Audience :: Science/Research",

{gengeneeval-0.2.0 → gengeneeval-0.2.1}/src/geneval/__init__.py RENAMED Viewed

@@ -8,6 +8,7 @@ Features:
 - Multiple distance and correlation metrics (per-gene and aggregate)
 - Condition-based matching (perturbation, cell type, etc.)
 - Train/test split support
+- Memory-efficient lazy loading for large datasets
 - Publication-quality visualizations
 - Command-line interface
@@ -20,12 +21,22 @@ Quick Start:
     ...     output_dir="output/"
     ... )
+Memory-Efficient Mode (for large datasets):
+    >>> from geneval import evaluate_lazy
+    >>> results = evaluate_lazy(
+    ...     real_path="real.h5ad",
+    ...     generated_path="generated.h5ad",
+    ...     condition_columns=["perturbation"],
+    ...     batch_size=256,
+    ...     use_backed=True,  # Memory-mapped access
+    ... )
 CLI Usage:
     $ geneval --real real.h5ad --generated generated.h5ad \\
               --conditions perturbation cell_type --output results/
 """
-__version__ = "0.2.0"
+__version__ = "0.2.1"
 __author__ = "GenEval Team"
 # Main evaluation interface
@@ -35,12 +46,26 @@ from .evaluator import (
     MetricRegistry,
 )
+# Memory-efficient evaluation
+from .lazy_evaluator import (
+    evaluate_lazy,
+    MemoryEfficientEvaluator,
+    StreamingEvaluationResult,
+)
 # Data loading
 from .data.loader import (
     GeneExpressionDataLoader,
     load_data,
 )
+# Memory-efficient data loading
+from .data.lazy_loader import (
+    LazyGeneExpressionDataLoader,
+    load_data_lazy,
+    ConditionBatch,
+)
 # Results
 from .results import (
     EvaluationResult,
@@ -99,9 +124,17 @@ __all__ = [
     "evaluate",
     "GeneEvalEvaluator",
     "MetricRegistry",
+    # Memory-efficient evaluation
+    "evaluate_lazy",
+    "MemoryEfficientEvaluator",
+    "StreamingEvaluationResult",
     # Data loading
     "GeneExpressionDataLoader",
     "load_data",
+    # Memory-efficient data loading
+    "LazyGeneExpressionDataLoader",
+    "load_data_lazy",
+    "ConditionBatch",
     # Results
     "EvaluationResult",
     "SplitResult",
@@ -123,6 +156,11 @@ __all__ = [
     "EnergyDistance",
     "MultivariateWasserstein",
     "MultivariateMMD",
+    # Reconstruction metrics
+    "MSEDistance",
+    "RMSEDistance",
+    "MAEDistance",
+    "R2Score",
     # Visualization
     "EvaluationVisualizer",
     "visualize",

{gengeneeval-0.2.0 → gengeneeval-0.2.1}/src/geneval/data/__init__.py RENAMED Viewed

@@ -2,6 +2,7 @@
 Data loading module for gene expression evaluation.
 Provides data loaders for paired real and generated datasets.
+Includes both standard and memory-efficient lazy loading options.
 """
 from .loader import (
@@ -9,15 +10,28 @@ from .loader import (
     load_data,
     DataLoaderError,
 )
+from .lazy_loader import (
+    LazyGeneExpressionDataLoader,
+    load_data_lazy,
+    LazyDataLoaderError,
+    ConditionBatch,
+)
 from .gene_expression_datamodule import (
     GeneExpressionDataModule,
     DataModuleError,
 )
 __all__ = [
+    # Standard loader
     "GeneExpressionDataLoader",
     "load_data",
     "DataLoaderError",
+    # Lazy loader (memory-efficient)
+    "LazyGeneExpressionDataLoader",
+    "load_data_lazy",
+    "LazyDataLoaderError",
+    "ConditionBatch",
+    # DataModule
     "GeneExpressionDataModule",
     "DataModuleError",
 ]

gengeneeval-0.2.1/src/geneval/data/lazy_loader.py ADDED Viewed

@@ -0,0 +1,562 @@
+"""
+Memory-efficient lazy data loader for large-scale gene expression datasets.
+Provides lazy loading and batched iteration over AnnData h5ad files without
+loading entire datasets into memory. Supports backed mode for very large files.
+"""
+from __future__ import annotations
+from typing import Optional, List, Union, Dict, Tuple, Iterator, Generator
+from pathlib import Path
+import warnings
+import numpy as np
+import pandas as pd
+from scipy import sparse
+from dataclasses import dataclass
+try:
+    import anndata as ad
+    import scanpy as sc
+except ImportError:
+    raise ImportError("anndata and scanpy are required. Install with: pip install anndata scanpy")
+class LazyDataLoaderError(Exception):
+    """Custom exception for lazy data loading errors."""
+    pass
+@dataclass
+class ConditionBatch:
+    """Container for a batch of samples from a condition."""
+    condition_key: str
+    condition_info: Dict[str, str]
+    real_data: np.ndarray
+    generated_data: np.ndarray
+    batch_idx: int
+    n_batches: int
+    is_last_batch: bool
+class LazyGeneExpressionDataLoader:
+    """
+    Memory-efficient lazy data loader for paired gene expression datasets.
+    Unlike GeneExpressionDataLoader, this class:
+    - Uses backed mode for h5ad files to avoid loading entire datasets
+    - Supports batched iteration over conditions
+    - Only loads data into memory when explicitly requested
+    - Provides memory usage estimates
+    Parameters
+    ----------
+    real_path : str or Path
+        Path to real data h5ad file
+    generated_path : str or Path
+        Path to generated data h5ad file
+    condition_columns : List[str]
+        Columns to match between datasets
+    split_column : str, optional
+        Column indicating train/test split
+    batch_size : int
+        Maximum number of samples per batch when iterating
+    use_backed : bool
+        If True, use backed mode (memory-mapped). May be slower but uses minimal memory.
+        If False, loads full file but processes in batches.
+    min_samples_per_condition : int
+        Minimum samples required per condition to include
+    """
+    def __init__(
+        self,
+        real_path: Union[str, Path],
+        generated_path: Union[str, Path],
+        condition_columns: List[str],
+        split_column: Optional[str] = None,
+        batch_size: int = 256,
+        use_backed: bool = False,
+        min_samples_per_condition: int = 2,
+    ):
+        self.real_path = Path(real_path)
+        self.generated_path = Path(generated_path)
+        self.condition_columns = condition_columns
+        self.split_column = split_column
+        self.batch_size = batch_size
+        self.use_backed = use_backed
+        self.min_samples_per_condition = min_samples_per_condition
+        # Lazy-loaded references (backed or full)
+        self._real: Optional[ad.AnnData] = None
+        self._generated: Optional[ad.AnnData] = None
+        # Metadata (always loaded - lightweight)
+        self._real_obs: Optional[pd.DataFrame] = None
+        self._generated_obs: Optional[pd.DataFrame] = None
+        self._real_var_names: Optional[pd.Index] = None
+        self._generated_var_names: Optional[pd.Index] = None
+        # Gene alignment info
+        self._common_genes: Optional[List[str]] = None
+        self._real_gene_idx: Optional[np.ndarray] = None
+        self._gen_gene_idx: Optional[np.ndarray] = None
+        # Pre-computed condition indices for fast access
+        self._condition_indices: Optional[Dict[str, Dict[str, np.ndarray]]] = None
+        # State
+        self._is_initialized = False
+    def initialize(self) -> "LazyGeneExpressionDataLoader":
+        """
+        Initialize loader by reading metadata only (not expression data).
+        This loads obs DataFrames and var_names to prepare for iteration,
+        but does not load the expression matrices.
+        Returns
+        -------
+        self
+            For method chaining
+        """
+        if self._is_initialized:
+            return self
+        # Validate paths
+        if not self.real_path.exists():
+            raise LazyDataLoaderError(f"Real data file not found: {self.real_path}")
+        if not self.generated_path.exists():
+            raise LazyDataLoaderError(f"Generated data file not found: {self.generated_path}")
+        # Load metadata only (obs and var_names)
+        # This is much faster and lighter than loading full data
+        self._load_metadata()
+        # Validate columns
+        self._validate_columns()
+        # Compute gene alignment indices
+        self._compute_gene_alignment()
+        # Pre-compute condition indices
+        self._precompute_condition_indices()
+        self._is_initialized = True
+        return self
+    def _load_metadata(self):
+        """Load only metadata (obs, var_names) without expression data."""
+        # For backed mode, we open but don't load X
+        # For non-backed, we still only read metadata initially
+        # Try backed mode if requested
+        if self.use_backed:
+            try:
+                self._real = sc.read_h5ad(self.real_path, backed='r')
+                self._generated = sc.read_h5ad(self.generated_path, backed='r')
+            except Exception as e:
+                warnings.warn(f"Backed mode failed, falling back to standard loading: {e}")
+                self._real = None
+                self._generated = None
+        if self._real is None:
+            # Load only what we need: obs and var
+            # For very large files, read in low-memory mode
+            self._real = sc.read_h5ad(self.real_path)
+            self._generated = sc.read_h5ad(self.generated_path)
+        # Cache lightweight metadata
+        self._real_obs = self._real.obs.copy()
+        self._generated_obs = self._generated.obs.copy()
+        self._real_var_names = pd.Index(self._real.var_names.astype(str))
+        self._generated_var_names = pd.Index(self._generated.var_names.astype(str))
+    def _validate_columns(self):
+        """Validate that required columns exist."""
+        for col in self.condition_columns:
+            if col not in self._real_obs.columns:
+                raise LazyDataLoaderError(
+                    f"Condition column '{col}' not found in real data. "
+                    f"Available: {list(self._real_obs.columns)}"
+                )
+            if col not in self._generated_obs.columns:
+                raise LazyDataLoaderError(
+                    f"Condition column '{col}' not found in generated data. "
+                    f"Available: {list(self._generated_obs.columns)}"
+                )
+    def _compute_gene_alignment(self):
+        """Pre-compute gene alignment indices."""
+        common = self._real_var_names.intersection(self._generated_var_names)
+        if len(common) == 0:
+            raise LazyDataLoaderError(
+                "No overlapping genes between real and generated data."
+            )
+        self._common_genes = common.tolist()
+        self._real_gene_idx = self._real_var_names.get_indexer(common)
+        self._gen_gene_idx = self._generated_var_names.get_indexer(common)
+        n_real_only = len(self._real_var_names) - len(common)
+        n_gen_only = len(self._generated_var_names) - len(common)
+        if n_real_only > 0 or n_gen_only > 0:
+            warnings.warn(
+                f"Gene alignment: {len(common)} common genes. "
+                f"Dropped {n_real_only} from real, {n_gen_only} from generated."
+            )
+    def _get_condition_key(self, row: pd.Series) -> str:
+        """Generate unique key for a condition combination."""
+        return "####".join([str(row[c]) for c in self.condition_columns])
+    def _precompute_condition_indices(self):
+        """Pre-compute sample indices for each condition (lightweight)."""
+        self._condition_indices = {"real": {}, "generated": {}}
+        # Real data conditions
+        real_conditions = self._real_obs[self.condition_columns].astype(str).drop_duplicates()
+        for _, row in real_conditions.iterrows():
+            key = self._get_condition_key(row)
+            mask = np.ones(len(self._real_obs), dtype=bool)
+            for col in self.condition_columns:
+                mask &= (self._real_obs[col].astype(str) == str(row[col])).values
+            indices = np.where(mask)[0]
+            if len(indices) >= self.min_samples_per_condition:
+                self._condition_indices["real"][key] = indices
+        # Generated data conditions
+        gen_conditions = self._generated_obs[self.condition_columns].astype(str).drop_duplicates()
+        for _, row in gen_conditions.iterrows():
+            key = self._get_condition_key(row)
+            mask = np.ones(len(self._generated_obs), dtype=bool)
+            for col in self.condition_columns:
+                mask &= (self._generated_obs[col].astype(str) == str(row[col])).values
+            indices = np.where(mask)[0]
+            if len(indices) >= self.min_samples_per_condition:
+                self._condition_indices["generated"][key] = indices
+    def get_splits(self) -> List[str]:
+        """Get available splits."""
+        if not self._is_initialized:
+            self.initialize()
+        if self.split_column is None or self.split_column not in self._real_obs.columns:
+            return ["all"]
+        return list(self._real_obs[self.split_column].astype(str).unique())
+    def get_common_conditions(self, split: Optional[str] = None) -> List[str]:
+        """Get conditions present in both real and generated data."""
+        if not self._is_initialized:
+            self.initialize()
+        real_keys = set(self._condition_indices["real"].keys())
+        gen_keys = set(self._condition_indices["generated"].keys())
+        common = real_keys & gen_keys
+        # Filter by split if specified
+        if split is not None and split != "all" and self.split_column is not None:
+            filtered = set()
+            for key in common:
+                real_idx = self._condition_indices["real"][key]
+                split_vals = self._real_obs.iloc[real_idx][self.split_column].astype(str)
+                if (split_vals == split).any():
+                    filtered.add(key)
+            common = filtered
+        return sorted(common)
+    def _extract_data_subset(
+        self,
+        adata: ad.AnnData,
+        indices: np.ndarray,
+        gene_idx: np.ndarray,
+    ) -> np.ndarray:
+        """Extract and align a subset of data."""
+        # Handle backed vs loaded data
+        if hasattr(adata, 'isbacked') and adata.isbacked:
+            # Backed mode: read only what we need
+            X = adata.X[indices][:, gene_idx]
+        else:
+            # Standard mode
+            X = adata.X[indices][:, gene_idx]
+        # Convert to dense if sparse
+        if sparse.issparse(X):
+            X = X.toarray()
+        return np.asarray(X, dtype=np.float32)
+    def iterate_conditions(
+        self,
+        split: Optional[str] = None,
+    ) -> Generator[Tuple[str, np.ndarray, np.ndarray, Dict[str, str]], None, None]:
+        """
+        Iterate over conditions, loading one condition at a time.
+        This loads data for one condition, yields it, then releases memory
+        before loading the next condition.
+        Parameters
+        ----------
+        split : str, optional
+            Filter to this split only
+        Yields
+        ------
+        Tuple[str, np.ndarray, np.ndarray, Dict[str, str]]
+            (condition_key, real_data, generated_data, condition_info)
+        """
+        if not self._is_initialized:
+            self.initialize()
+        common_conditions = self.get_common_conditions(split)
+        for key in common_conditions:
+            real_indices = self._condition_indices["real"][key]
+            gen_indices = self._condition_indices["generated"][key]
+            # Filter by split if needed
+            if split is not None and split != "all" and self.split_column is not None:
+                split_mask = self._real_obs.iloc[real_indices][self.split_column].astype(str) == split
+                real_indices = real_indices[split_mask.values]
+            if len(real_indices) < self.min_samples_per_condition:
+                continue
+            # Load data for this condition only
+            real_data = self._extract_data_subset(
+                self._real, real_indices, self._real_gene_idx
+            )
+            gen_data = self._extract_data_subset(
+                self._generated, gen_indices, self._gen_gene_idx
+            )
+            # Parse condition info
+            parts = key.split("####")
+            condition_info = dict(zip(self.condition_columns, parts))
+            yield key, real_data, gen_data, condition_info
+    def iterate_conditions_batched(
+        self,
+        split: Optional[str] = None,
+        batch_size: Optional[int] = None,
+    ) -> Generator[ConditionBatch, None, None]:
+        """
+        Iterate over conditions in batches for memory efficiency.
+        Useful when even a single condition is too large to fit in memory.
+        Parameters
+        ----------
+        split : str, optional
+            Filter to this split only
+        batch_size : int, optional
+            Override default batch size
+        Yields
+        ------
+        ConditionBatch
+            Batch of samples from a condition
+        """
+        if not self._is_initialized:
+            self.initialize()
+        batch_size = batch_size or self.batch_size
+        common_conditions = self.get_common_conditions(split)
+        for key in common_conditions:
+            real_indices = self._condition_indices["real"][key]
+            gen_indices = self._condition_indices["generated"][key]
+            # Filter by split
+            if split is not None and split != "all" and self.split_column is not None:
+                split_mask = self._real_obs.iloc[real_indices][self.split_column].astype(str) == split
+                real_indices = real_indices[split_mask.values]
+            if len(real_indices) < self.min_samples_per_condition:
+                continue
+            # Parse condition info
+            parts = key.split("####")
+            condition_info = dict(zip(self.condition_columns, parts))
+            # Calculate number of batches (use max of real/gen for alignment)
+            n_real = len(real_indices)
+            n_gen = len(gen_indices)
+            n_batches = max(
+                (n_real + batch_size - 1) // batch_size,
+                (n_gen + batch_size - 1) // batch_size
+            )
+            for batch_idx in range(n_batches):
+                start_real = batch_idx * batch_size
+                end_real = min(start_real + batch_size, n_real)
+                start_gen = batch_idx * batch_size
+                end_gen = min(start_gen + batch_size, n_gen)
+                # Handle case where one dataset is smaller
+                if start_real >= n_real:
+                    # Wrap around for real data
+                    batch_real_idx = real_indices[start_real % n_real:end_real % n_real + 1]
+                else:
+                    batch_real_idx = real_indices[start_real:end_real]
+                if start_gen >= n_gen:
+                    batch_gen_idx = gen_indices[start_gen % n_gen:end_gen % n_gen + 1]
+                else:
+                    batch_gen_idx = gen_indices[start_gen:end_gen]
+                if len(batch_real_idx) == 0 or len(batch_gen_idx) == 0:
+                    continue
+                real_data = self._extract_data_subset(
+                    self._real, batch_real_idx, self._real_gene_idx
+                )
+                gen_data = self._extract_data_subset(
+                    self._generated, batch_gen_idx, self._gen_gene_idx
+                )
+                yield ConditionBatch(
+                    condition_key=key,
+                    condition_info=condition_info,
+                    real_data=real_data,
+                    generated_data=gen_data,
+                    batch_idx=batch_idx,
+                    n_batches=n_batches,
+                    is_last_batch=(batch_idx == n_batches - 1),
+                )
+    @property
+    def gene_names(self) -> List[str]:
+        """Get common gene names."""
+        if not self._is_initialized:
+            self.initialize()
+        return self._common_genes
+    @property
+    def n_genes(self) -> int:
+        """Number of common genes."""
+        return len(self.gene_names)
+    def estimate_memory_usage(self) -> Dict[str, float]:
+        """
+        Estimate memory usage in MB for different loading strategies.
+        Returns
+        -------
+        Dict[str, float]
+            Memory estimates in MB
+        """
+        if not self._is_initialized:
+            self.initialize()
+        n_real = len(self._real_obs)
+        n_gen = len(self._generated_obs)
+        n_genes = self.n_genes
+        # 4 bytes per float32
+        bytes_per_element = 4
+        full_real = n_real * n_genes * bytes_per_element / 1e6
+        full_gen = n_gen * n_genes * bytes_per_element / 1e6
+        # Average condition size
+        n_conditions = len(self.get_common_conditions())
+        avg_per_condition = (n_real + n_gen) / max(n_conditions, 1)
+        per_condition = avg_per_condition * n_genes * bytes_per_element / 1e6
+        per_batch = self.batch_size * 2 * n_genes * bytes_per_element / 1e6
+        return {
+            "full_load_mb": full_real + full_gen,
+            "per_condition_mb": per_condition,
+            "per_batch_mb": per_batch,
+            "metadata_mb": (n_real + n_gen) * 100 / 1e6,  # rough obs estimate
+        }
+    def close(self):
+        """Close backed file handles if any."""
+        if self._real is not None and hasattr(self._real, 'file'):
+            try:
+                self._real.file.close()
+            except:
+                pass
+        if self._generated is not None and hasattr(self._generated, 'file'):
+            try:
+                self._generated.file.close()
+            except:
+                pass
+    def __enter__(self):
+        """Context manager entry."""
+        self.initialize()
+        return self
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Context manager exit."""
+        self.close()
+    def __repr__(self) -> str:
+        if not self._is_initialized:
+            return f"LazyGeneExpressionDataLoader(not initialized, backed={self.use_backed})"
+        return (
+            f"LazyGeneExpressionDataLoader("
+            f"real={len(self._real_obs)}x{len(self._real_var_names)}, "
+            f"gen={len(self._generated_obs)}x{len(self._generated_var_names)}, "
+            f"common_genes={self.n_genes}, "
+            f"batch_size={self.batch_size})"
+        )
+def load_data_lazy(
+    real_path: Union[str, Path],
+    generated_path: Union[str, Path],
+    condition_columns: List[str],
+    split_column: Optional[str] = None,
+    batch_size: int = 256,
+    use_backed: bool = False,
+    **kwargs
+) -> LazyGeneExpressionDataLoader:
+    """
+    Convenience function to create a lazy data loader.
+    Parameters
+    ----------
+    real_path : str or Path
+        Path to real data h5ad file
+    generated_path : str or Path
+        Path to generated data h5ad file
+    condition_columns : List[str]
+        Columns to match between datasets
+    split_column : str, optional
+        Column indicating train/test split
+    batch_size : int
+        Maximum samples per batch
+    use_backed : bool
+        Use memory-mapped access for very large files
+    **kwargs
+        Additional arguments for LazyGeneExpressionDataLoader
+    Returns
+    -------
+    LazyGeneExpressionDataLoader
+        Initialized lazy data loader
+    """
+    loader = LazyGeneExpressionDataLoader(
+        real_path=real_path,
+        generated_path=generated_path,
+        condition_columns=condition_columns,
+        split_column=split_column,
+        batch_size=batch_size,
+        use_backed=use_backed,
+        **kwargs
+    )
+    loader.initialize()
+    return loader

gengeneeval-0.2.1/src/geneval/lazy_evaluator.py ADDED Viewed

@@ -0,0 +1,424 @@
+"""
+Memory-efficient evaluator for large-scale gene expression datasets.
+Uses lazy loading and batched processing to minimize memory footprint.
+"""
+from __future__ import annotations
+from typing import Dict, List, Optional, Union, Type, Any, Generator
+from pathlib import Path
+import numpy as np
+import warnings
+from dataclasses import dataclass, field
+import gc
+from .data.lazy_loader import (
+    LazyGeneExpressionDataLoader,
+    load_data_lazy,
+    ConditionBatch,
+)
+from .metrics.base_metric import BaseMetric, MetricResult
+from .metrics.correlation import (
+    PearsonCorrelation,
+    SpearmanCorrelation,
+    MeanPearsonCorrelation,
+    MeanSpearmanCorrelation,
+)
+from .metrics.distances import (
+    Wasserstein1Distance,
+    Wasserstein2Distance,
+    MMDDistance,
+    EnergyDistance,
+)
+from .metrics.reconstruction import (
+    MSEDistance,
+)
+# These multivariate metrics don't support batched computation
+from .metrics.distances import MultivariateWasserstein, MultivariateMMD
+# Metrics that support incremental/batched computation
+BATCHABLE_METRICS = [
+    MSEDistance,
+    PearsonCorrelation,
+    SpearmanCorrelation,
+]
+# Metrics that require full data
+NON_BATCHABLE_METRICS = [
+    Wasserstein1Distance,
+    Wasserstein2Distance,
+    MMDDistance,
+    EnergyDistance,
+    MultivariateWasserstein,
+    MultivariateMMD,
+]
+@dataclass
+class StreamingMetricAccumulator:
+    """Accumulates values for streaming mean/std computation."""
+    n: int = 0
+    sum: float = 0.0
+    sum_sq: float = 0.0
+    def add(self, value: float, count: int = 1):
+        """Add a value (or batch of values with same value)."""
+        self.n += count
+        self.sum += value * count
+        self.sum_sq += (value ** 2) * count
+    def add_batch(self, values: np.ndarray):
+        """Add multiple values."""
+        self.n += len(values)
+        self.sum += np.sum(values)
+        self.sum_sq += np.sum(values ** 2)
+    @property
+    def mean(self) -> float:
+        return self.sum / self.n if self.n > 0 else 0.0
+    @property
+    def std(self) -> float:
+        if self.n <= 1:
+            return 0.0
+        variance = (self.sum_sq / self.n) - (self.mean ** 2)
+        return np.sqrt(max(0, variance))
+@dataclass
+class StreamingConditionResult:
+    """Lightweight result for a single condition."""
+    condition_key: str
+    n_real_samples: int = 0
+    n_generated_samples: int = 0
+    metrics: Dict[str, float] = field(default_factory=dict)
+    real_mean: Optional[np.ndarray] = None
+    generated_mean: Optional[np.ndarray] = None
+@dataclass
+class StreamingEvaluationResult:
+    """Memory-efficient evaluation result that streams to disk."""
+    output_dir: Path
+    n_conditions: int = 0
+    metric_accumulators: Dict[str, StreamingMetricAccumulator] = field(default_factory=dict)
+    condition_keys: List[str] = field(default_factory=list)
+    def add_condition(self, result: StreamingConditionResult):
+        """Add a condition result and update accumulators."""
+        self.n_conditions += 1
+        self.condition_keys.append(result.condition_key)
+        for metric_name, value in result.metrics.items():
+            if metric_name not in self.metric_accumulators:
+                self.metric_accumulators[metric_name] = StreamingMetricAccumulator()
+            self.metric_accumulators[metric_name].add(value)
+    def get_summary(self) -> Dict[str, Dict[str, float]]:
+        """Get summary statistics."""
+        summary = {}
+        for name, acc in self.metric_accumulators.items():
+            summary[name] = {
+                "mean": acc.mean,
+                "std": acc.std,
+                "n": acc.n,
+            }
+        return summary
+    def save_summary(self):
+        """Save summary to output directory."""
+        import json
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        summary = {
+            "n_conditions": self.n_conditions,
+            "metrics": self.get_summary(),
+            "condition_keys": self.condition_keys,
+        }
+        with open(self.output_dir / "summary.json", "w") as f:
+            json.dump(summary, f, indent=2)
+class MemoryEfficientEvaluator:
+    """
+    Memory-efficient evaluator using lazy loading and batched processing.
+    Features:
+    - Lazy data loading (one condition at a time)
+    - Batched processing within conditions
+    - Streaming metric accumulation
+    - Periodic garbage collection
+    - Progress streaming to disk
+    Parameters
+    ----------
+    data_loader : LazyGeneExpressionDataLoader
+        Lazy data loader
+    metrics : List[BaseMetric], optional
+        Metrics to compute. Note: Some metrics (like MMD) may not support
+        batched computation and will use full condition data.
+    batch_size : int
+        Batch size for within-condition processing
+    gc_every_n_conditions : int
+        Run garbage collection every N conditions
+    verbose : bool
+        Print progress
+    """
+    def __init__(
+        self,
+        data_loader: LazyGeneExpressionDataLoader,
+        metrics: Optional[List[Union[BaseMetric, Type[BaseMetric]]]] = None,
+        batch_size: int = 256,
+        gc_every_n_conditions: int = 10,
+        verbose: bool = True,
+    ):
+        self.data_loader = data_loader
+        self.batch_size = batch_size
+        self.gc_every_n_conditions = gc_every_n_conditions
+        self.verbose = verbose
+        # Initialize metrics
+        self.metrics: List[BaseMetric] = []
+        metric_classes = metrics or [
+            MSEDistance,
+            PearsonCorrelation,
+            SpearmanCorrelation,
+            MeanPearsonCorrelation,
+            MeanSpearmanCorrelation,
+        ]
+        for m in metric_classes:
+            if isinstance(m, type):
+                self.metrics.append(m())
+            else:
+                self.metrics.append(m)
+    def _log(self, msg: str):
+        if self.verbose:
+            print(msg)
+    def evaluate(
+        self,
+        split: Optional[str] = None,
+        output_dir: Optional[Union[str, Path]] = None,
+        save_per_condition: bool = False,
+    ) -> StreamingEvaluationResult:
+        """
+        Run memory-efficient evaluation.
+        Parameters
+        ----------
+        split : str, optional
+            Split to evaluate
+        output_dir : str or Path, optional
+            Directory to save results. If provided, results are streamed to disk.
+        save_per_condition : bool
+            If True, save individual condition results to disk
+        Returns
+        -------
+        StreamingEvaluationResult
+            Evaluation result with aggregated metrics
+        """
+        if output_dir is not None:
+            output_dir = Path(output_dir)
+            output_dir.mkdir(parents=True, exist_ok=True)
+        else:
+            output_dir = Path(".")
+        result = StreamingEvaluationResult(output_dir=output_dir)
+        # Get conditions
+        conditions = self.data_loader.get_common_conditions(split)
+        self._log(f"Evaluating {len(conditions)} conditions")
+        self._log(f"Memory estimate: {self.data_loader.estimate_memory_usage()}")
+        # Iterate conditions (one at a time in memory)
+        for i, (cond_key, real_data, gen_data, cond_info) in enumerate(
+            self.data_loader.iterate_conditions(split)
+        ):
+            if self.verbose and (i + 1) % 10 == 0:
+                self._log(f"  Processing {i + 1}/{len(conditions)}: {cond_key}")
+            # Compute metrics for this condition
+            cond_result = self._evaluate_condition(
+                cond_key, real_data, gen_data, cond_info
+            )
+            # Add to streaming result
+            result.add_condition(cond_result)
+            # Optionally save per-condition result
+            if save_per_condition and output_dir:
+                self._save_condition_result(cond_result, output_dir)
+            # Periodic garbage collection
+            if (i + 1) % self.gc_every_n_conditions == 0:
+                gc.collect()
+        # Final summary
+        result.save_summary()
+        if self.verbose:
+            self._print_summary(result)
+        return result
+    def _evaluate_condition(
+        self,
+        cond_key: str,
+        real_data: np.ndarray,
+        gen_data: np.ndarray,
+        cond_info: Dict[str, str],
+    ) -> StreamingConditionResult:
+        """Evaluate a single condition."""
+        result = StreamingConditionResult(
+            condition_key=cond_key,
+            n_real_samples=real_data.shape[0],
+            n_generated_samples=gen_data.shape[0],
+        )
+        # Compute means
+        result.real_mean = real_data.mean(axis=0)
+        result.generated_mean = gen_data.mean(axis=0)
+        # Compute metrics
+        for metric in self.metrics:
+            try:
+                metric_result = metric.compute(
+                    real=real_data,
+                    generated=gen_data,
+                    gene_names=self.data_loader.gene_names,
+                    aggregate_method="mean",
+                    condition=cond_key,
+                )
+                result.metrics[metric.name] = metric_result.aggregate_value
+            except Exception as e:
+                warnings.warn(f"Failed to compute {metric.name} for {cond_key}: {e}")
+        return result
+    def _save_condition_result(
+        self,
+        result: StreamingConditionResult,
+        output_dir: Path,
+    ):
+        """Save a single condition result to disk."""
+        import json
+        condition_dir = output_dir / "conditions"
+        condition_dir.mkdir(exist_ok=True)
+        # Safe filename
+        safe_key = result.condition_key.replace("/", "_").replace("\\", "_")
+        data = {
+            "condition_key": result.condition_key,
+            "n_real": result.n_real_samples,
+            "n_generated": result.n_generated_samples,
+            "metrics": result.metrics,
+        }
+        with open(condition_dir / f"{safe_key}.json", "w") as f:
+            json.dump(data, f, indent=2)
+    def _print_summary(self, result: StreamingEvaluationResult):
+        """Print summary."""
+        self._log("\n" + "=" * 60)
+        self._log("EVALUATION SUMMARY (Memory-Efficient)")
+        self._log("=" * 60)
+        self._log(f"Conditions evaluated: {result.n_conditions}")
+        self._log("-" * 40)
+        for name, stats in result.get_summary().items():
+            self._log(f"  {name}: {stats['mean']:.4f} ± {stats['std']:.4f}")
+        self._log("=" * 60)
+def evaluate_lazy(
+    real_path: Union[str, Path],
+    generated_path: Union[str, Path],
+    condition_columns: List[str],
+    split_column: Optional[str] = None,
+    output_dir: Optional[Union[str, Path]] = None,
+    batch_size: int = 256,
+    use_backed: bool = False,
+    metrics: Optional[List[Union[BaseMetric, Type[BaseMetric]]]] = None,
+    verbose: bool = True,
+    save_per_condition: bool = False,
+    **kwargs
+) -> StreamingEvaluationResult:
+    """
+    Memory-efficient evaluation using lazy loading.
+    Use this function for large datasets that don't fit in memory.
+    Parameters
+    ----------
+    real_path : str or Path
+        Path to real data h5ad file
+    generated_path : str or Path
+        Path to generated data h5ad file
+    condition_columns : List[str]
+        Columns to match between datasets
+    split_column : str, optional
+        Column for train/test split
+    output_dir : str or Path, optional
+        Directory to save results
+    batch_size : int
+        Batch size for processing
+    use_backed : bool
+        Use memory-mapped file access (for very large files)
+    metrics : List, optional
+        Metrics to compute
+    verbose : bool
+        Print progress
+    save_per_condition : bool
+        Save individual condition results
+    Returns
+    -------
+    StreamingEvaluationResult
+        Aggregated evaluation results
+    Examples
+    --------
+    >>> # For large datasets that don't fit in memory
+    >>> results = evaluate_lazy(
+    ...     "real.h5ad",
+    ...     "generated.h5ad",
+    ...     condition_columns=["perturbation"],
+    ...     output_dir="eval_output/",
+    ...     batch_size=256,
+    ...     use_backed=True,  # Memory-mapped for very large files
+    ... )
+    >>> print(results.get_summary())
+    """
+    # Create lazy loader
+    with load_data_lazy(
+        real_path=real_path,
+        generated_path=generated_path,
+        condition_columns=condition_columns,
+        split_column=split_column,
+        batch_size=batch_size,
+        use_backed=use_backed,
+    ) as loader:
+        # Create evaluator
+        evaluator = MemoryEfficientEvaluator(
+            data_loader=loader,
+            metrics=metrics,
+            batch_size=batch_size,
+            verbose=verbose,
+        )
+        # Run evaluation
+        return evaluator.evaluate(
+            output_dir=output_dir,
+            save_per_condition=save_per_condition,
+        )